Chi-Square Tests

In a world overflowing with categorical data—survey responses, A/B test outcomes, product preferences—you need robust tools to move from observing patterns to confirming relationships. The chi-square test family provides these tools, allowing you to objectively test whether observed frequencies in your data significantly differ from what you would expect by random chance. Mastering these tests is fundamental for data science, enabling you to validate hypotheses about proportions, dependencies between variables, and the fit of data to theoretical models without relying on parametric assumptions about means and variances.

The Foundation: Categorical Data and the Chi-Square Statistic

All chi-square tests analyze categorical data, where observations are classified into mutually exclusive groups or categories (e.g., "Yes/No/Maybe," "Product A/B/C"). The core logic compares what you observe in your sample to what you would theoretically expect if a specific null hypothesis were true. The measure of this discrepancy is the chi-square statistic ( $X^{2}$ ).

The formula for the chi-square statistic is universally expressed as:

$X^{2} = \sum \frac{( O _{i} - E _{i} ) ^{2}}{E _{i}}$

Here, $O_{i}$ represents the observed frequency (the actual count in your data), and $E_{i}$ represents the expected frequency (the count you would anticipate if the null hypothesis were correct). You calculate this for every cell in your table (e.g., every category or every combination of categories), sum the values, and obtain your $X^{2}$ statistic. A $X^{2}$ value of zero indicates a perfect match between observed and expected counts. As the discrepancy grows, the $X^{2}$ value increases. You then compare this calculated value to a critical value from the chi-square distribution, which is defined by its degrees of freedom (df). The degrees of freedom essentially represent the number of independent pieces of information used to calculate the expected frequencies.

Chi-Square Goodness-of-Fit Test

The chi-square goodness-of-fit test answers a simple but powerful question: "How well does the distribution of my sample data fit a hypothesized or theoretical distribution?" You use this test when you have one categorical variable with two or more levels. For instance, you might test if the colors of cars in a parking lot follow a uniform distribution (equal proportions), or if the market share of four smartphone brands matches the national averages.

The procedure is straightforward. First, state your null hypothesis ( $H_{0}$ ): the observed frequencies follow the specified expected distribution. The alternative ( $H_{1}$ ) is that they do not. Second, calculate the expected frequencies. If testing for uniformity across $k$ categories, the expected frequency for each is $E = \frac{Total N}{k}$ . If testing against a specific set of proportions, you calculate $E_{i} = Total N \times Hypothesized Proportion_{i}$ .

Consider a simple example. A dice manufacturer wants to test if a die is fair (a uniform distribution where each face has a 1/6 probability). You roll it 60 times. Under $H_{0}$ , the expected count for each face is $E = 60 \times (1/6) = 10$ . If your observed counts were [8, 12, 9, 11, 10, 10], you would calculate $X^{2}$ using the formula. The degrees of freedom for a goodness-of-fit test is $df = k - 1$ , where $k$ is the number of categories. Here, $df = 6 - 1 = 5$ . You would then compare your calculated $X^{2}$ to a critical value from the chi-square distribution with 5 df to determine if the die is likely biased.

Chi-Square Test of Independence

The chi-square test of independence is arguably the more common application in data science. It assesses whether there is a statistically significant association between two categorical variables. It asks: "Are these two variables related, or are they independent of each other?" Common examples include testing if gender is associated with voting preference, if treatment type is associated with recovery outcome, or if website design (A/B) is associated with conversion (Yes/No).

The data is organized into a contingency table, an R x C matrix where R is the number of rows (categories of Variable A) and C is the number of columns (categories of Variable B). Each cell contains the observed frequency for that combination.

The null hypothesis ( $H_{0}$ ) is that the two variables are independent. The expected frequency for each cell under independence is calculated as:

$E_{rc} = \frac{( Row Total _{r} ) \times ( Column Total _{c} )}{Grand Total}$

This formula derives directly from the probability rule for independent events: P(A and B) = P(A) * P(B). You estimate P(A) with the row proportion and P(B) with the column proportion, then multiply by the grand total to get an expected count.

After calculating $E$ for every cell (note: you should calculate these values, not eyeball them), you compute the $X^{2}$ statistic using the same core formula. The degrees of freedom for a test of independence is $df = (r - 1) \times (c - 1)$ . A significant $X^{2}$ value leads you to reject the null hypothesis of independence, concluding there is evidence of an association between the variables. It’s crucial to remember that this test indicates association, not causation.

Assumptions, Corrections, and Interpretation

For your chi-square test results to be valid, key assumptions must be met. First, the data must be in raw frequency counts, not percentages or proportions. Second, categories must be mutually exclusive (each observation fits in only one cell). Third, observations must be independent. The most critical assumption concerns expected frequencies. As a general rule, no more than 20% of the expected cells should have a value less than 5, and all expected cell counts should be 1 or greater. Violations of this assumption increase the risk of a Type I error (falsely rejecting the null hypothesis).

When you have a 2x2 contingency table (each variable has two categories) and you have small sample sizes that lead to an expected frequency between 5 and 10, applying Yates correction for continuity is often recommended. This correction, also called the continuity correction, adjusts the formula to:

$X_{Ya t es}^{2} = \sum \frac{( ∣ O _{i} - E _{i} ∣ - 0.5 ) ^{2}}{E _{i}}$

This adjustment makes the approximation to the theoretical chi-square distribution more conservative, reducing the likelihood of an inflated Type I error. For larger tables or when expected frequencies are adequately large, Yates correction is not necessary.

Common Pitfalls

Ignoring the Expected Frequency Assumption: The most frequent error is running the test when many expected cell counts are below 5. This can produce misleading p-values. The solution is to either collect more data, collapse categories (if logically justified), or use an alternative test like Fisher's Exact Test, which is designed for small samples.
Misinterpreting a Significant Result: A significant chi-square test of independence tells you an association exists, but not its strength or direction. A very weak association can be statistically significant with a large enough sample size. The solution is to always compute a measure of effect size alongside the test, such as Cramér's V or the phi coefficient ( $ϕ$ ), to quantify the strength of the relationship.
Treating the Test as a Test of Proportions: While related, the chi-square test of independence is not identical to a test for the difference between two proportions (which is for a 2x2 table). The hypotheses and underlying models differ slightly. For a 2x2 table comparing two independent proportions, both approaches are often equivalent, but understanding the distinction is important for proper application.
Using Percentages in Calculations: You must always perform calculations on the raw frequency counts, not converted percentages or decimals. Inputting percentages into the $X^{2}$ formula will produce an incorrect statistic and invalid results.

Summary

Chi-square tests are the primary tools for hypothesis testing with categorical data, comparing observed frequencies to expected frequencies under a null hypothesis.
The goodness-of-fit test evaluates how well a single categorical variable's distribution matches a hypothesized distribution, with $df = k - 1$ .
The test of independence evaluates the association between two categorical variables in an R x C contingency table, with expected frequencies calculated as $(R o wT o t a l * C o l u mn T o t a l) / G r an d T o t a l$ and $df = (r - 1) (c - 1)$ .
Valid inference depends on meeting key assumptions, most importantly that expected frequencies are sufficiently large (typically >5). For 2x2 tables with small samples, Yates correction provides a more conservative estimate.
Always follow a significant test with an analysis of standardized residuals and an effect size measure (like Cramér's V) to interpret the nature and strength of any discovered association, avoiding the pitfall of equating statistical significance with practical importance.

Chi-Square Tests

Chi-Square Tests

The Foundation: Categorical Data and the Chi-Square Statistic

Chi-Square Goodness-of-Fit Test

Chi-Square Test of Independence

Assumptions, Corrections, and Interpretation

Common Pitfalls

Summary

Write better notes with AI