📚Study Guide: Inference for Categorical Data: Chi-Square
Unit 8: Inference for Categorical Data: Chi-Square
Chi-square procedures extend inference to categorical variables with multiple categories. The chi-square goodness-of-fit test compares observed counts to expected counts based on a hypothesized distribution for a single categorical variable. The chi-square test for independence assesses whether two categorical variables are associated in a two-way table, while the chi-square test for homogeneity compares distributions across multiple populations. All chi-square tests rely on calculating a statistic that measures the discrepancy between observed and expected frequencies. The conditions--Random, Large Counts (all expected counts >= 5), and Independence--must be verified. Degrees of freedom depend on the number of categories or table dimensions.
Key Concepts
- Chi-Square Statistic: Chi-square = sum [(Observed - Expected)^2 / Expected]. Larger values indicate greater discrepancy.
- Goodness-of-Fit: Tests whether a single categorical variable follows a specified distribution. df = number of categories - 1.
- Test for Independence: Tests whether two categorical variables are associated in one population. df = (rows - 1)*(columns - 1).
- Test for Homogeneity: Tests whether distributions of a categorical variable are the same across several populations. Same procedure as independence but different design and hypotheses.
- Expected Counts: (row total * column total) / grand total for two-way tables; n * p_i for goodness-of-fit.
- Conditions: Random sample, all expected counts >= 5, and observations are independent.
Vocabulary
- Observed count: The actual number of observations in a category from the sample data.
- Expected count: The number of observations expected in a category if the null hypothesis is true.
- Chi-square distribution: A right-skewed distribution defined by degrees of freedom, used for categorical inference.
- Goodness-of-fit: A test comparing sample data to a theoretical distribution.
- Two-way table: A table displaying counts for combinations of two categorical variables.
Formulas
- Chi-square = sum [ (O - E)^2 / E ]
- Expected = (row total * column total) / table total
- Goodness-of-fit df = k - 1
- Independence/Homogeneity df = (r - 1)*(c - 1)
Common Mistakes
- Using proportions instead of counts in chi-square calculations; the test requires whole number observed counts.
- Confusing independence and homogeneity; independence relates one sample to two variables, homogeneity compares groups on one variable.
- Checking observed counts >= 5 instead of expected counts >= 5; the condition applies to expected frequencies.
- Interpreting a significant result as proving causation; chi-square tests show association, not cause.
AP Exam Strategies
- Always calculate and display expected counts in a table before computing the chi-square statistic on FRQs.
- State the hypotheses in terms of the population or variables, not the sample counts.
- For two-way tables, clearly specify whether you are testing independence or homogeneity and why.
- When concluding, describe the nature of the association by comparing observed and expected counts in specific cells.
Real-World Applications
- Genetics: Goodness-of-fit tests verify whether offspring ratios match Mendelian predictions.
- Market Research: Independence tests determine if product preference is associated with age group.
- Public Health: Homogeneity tests compare disease incidence distributions across different hospitals or regions.