Hypergeometric and Multinomial Distributions

In the world of data science and statistics, not every process is a simple coin flip. You often need to model scenarios where the odds change with each draw from a finite pool or where outcomes fall into more than two categories. Understanding the hypergeometric and multinomial distributions provides the precise mathematical tools for these real-world situations, from quality assurance in manufacturing to analyzing survey data with multiple responses.

The Hypergeometric Distribution: Sampling Without Replacement

The hypergeometric distribution models the probability of drawing a specific number of successes in a fixed number of draws from a finite population, without replacement. This "without replacement" condition is crucial—each draw changes the composition of the population for the next draw, making the trials dependent. This contrasts sharply with the binomial distribution, where each trial is independent (sampling with replacement).

A classic scenario is quality control. Imagine you have a shipment of 100 items, 10 of which are defective. If you randomly select 5 items for inspection, what is the probability that exactly 2 are defective? This is a perfect application for the hypergeometric distribution because you are sampling from a finite lot without putting items back.

The probability mass function (PMF) for the hypergeometric distribution is defined by four parameters:

$N$ : The total population size.
$K$ : The total number of success states in the population (e.g., defective items).
$n$ : The number of draws (sample size).
$k$ : The number of observed successes.

The PMF is given by:

$P (X = k) = \frac{( k K ) ( n - k N - K )}{( n N )}$

Here, $(b a)$ denotes the binomial coefficient "a choose b." The numerator counts the ways to choose $k$ successes from the $K$ available and $n - k$ failures from the $N - K$ available failures. The denominator counts all possible ways to draw any $n$ items from the population of $N$ .

Let's solve the quality inspection example step-by-step:

Parameters: $N = 100$ , $K = 10$ , $n = 5$ , $k = 2$ .
Plug into the PMF:

$P (X = 2) = \frac{( 2 10 ) ( 3 90 )}{( 5 100 )} = \frac{45 \times 117 , 480}{75 , 287 , 520} \approx 0.0702$

Interpretation: There is approximately a 7.02% chance of finding exactly 2 defective items in a random sample of 5.

The expected value (mean) and variance of a hypergeometric random variable are:

Expected value: $E [X] = n \cdot \frac{K}{N}$
Variance: $Va r (X) = n \cdot \frac{K}{N} \cdot \frac{N - K}{N} \cdot \frac{N - n}{N - 1}$

The term $\frac{N - n}{N - 1}$ is the finite population correction (FPC) factor. It adjusts the variance downward compared to a binomial distribution, reflecting the reduced uncertainty that comes from sampling without replacement from a finite group.

The Multinomial Distribution: Multiple Outcome Categories

The multinomial distribution is a generalization of the binomial distribution. While a binomial trial has only two possible outcomes (success/failure), a multinomial trial has $k \geq 2$ possible mutually exclusive categories. Examples abound: the roll of a fair die (6 categories), the political affiliation of a voter (Democrat/Republican/Independent/Other), or the classification of a manufactured part (Excellent/Good/Acceptable/Defective).

The multinomial distribution describes the probabilities of counts for each category across a fixed number of independent trials. The key parameters are:

$n$ : The fixed number of independent trials.
$k$ : The number of possible categories.
$p_{i}$ : The probability of outcome $i$ occurring on any given trial, where $i = 1, 2, ..., k$ and $\sum_{i = 1}^{k} p_{i} = 1$ .

The probability mass function gives the joint probability that category 1 occurs $x_{1}$ times, category 2 occurs $x_{2}$ times, and so on, with $\sum_{i = 1}^{k} x_{i} = n$ :

$P (X_{1} = x_{1}, X_{2} = x_{2}, ..., X_{k} = x_{k}) = \frac{n !}{x _{1} ! x _{2} ! \dots x _{k} !} p_{1}^{x_{1}} p_{2}^{x_{2}} \dots p_{k}^{x_{k}}$

The multinomial coefficient $\frac{n !}{x _{1} ! x _{2} ! \dots x _{k} !}$ counts the number of ways to arrange $n$ objects where there are $x_{1}$ of type 1, $x_{2}$ of type 2, etc. The product of powers of $p_{i}$ gives the probability of any one specific sequence yielding those counts.

Consider a concrete example in categorical data analysis. Suppose in a certain city, the proportions of voters favoring Candidates A, B, and C are 0.50, 0.30, and 0.20, respectively. If you randomly poll 10 voters, what is the probability that 5 favor A, 3 favor B, and 2 favor C?

Parameters: $n = 10$ , $p_{A} = 0.5$ , $p_{B} = 0.3$ , $p_{C} = 0.2$ , $x_{A} = 5$ , $x_{B} = 3$ , $x_{C} = 2$ .
Apply the PMF:

$P (X_{A} = 5, X_{B} = 3, X_{C} = 2) = \frac{10 !}{5 ! 3 ! 2 !} (0.5)^{5} (0.3)^{3} (0.2)^{2}$ $= 2520 \cdot 0.03125 \cdot 0.027 \cdot 0.04 \approx 0.0851$

There is about an 8.51% chance of obtaining this specific breakdown in a sample of 10.

For a multinomial distribution, the expected value for the count in category $i$ is straightforward: $E [X_{i}] = n p_{i}$ . The variance for each category is $Va r (X_{i}) = n p_{i} (1 - p_{i})$ , which looks identical to the binomial variance. However, the categories are not independent; they are negatively correlated because if one category occurs more, others must occur less to keep the total fixed at $n$ . The covariance between counts for two different categories $i$ and $j$ is $C o v (X_{i}, X_{j}) = - n p_{i} p_{j}$ .

Common Pitfalls

1. Confusing the Hypergeometric and Binomial Distributions. The most frequent error is using the binomial distribution for a "without replacement" scenario. Pitfall: You have a deck of 52 cards and draw 5. Using the binomial to find the probability of 2 Aces incorrectly assumes the probability of drawing an Ace (4/52) stays constant across draws. Correction: The trials are dependent. The probability changes after each card is removed. You must use the hypergeometric distribution with $N = 52$ , $K = 4$ , $n = 5$ .

2. Misapplying the Multinomial Distribution to Non-Exclusive Categories. The multinomial model requires that each trial results in exactly one of the $k$ possible outcomes. Pitfall: Analyzing survey data where respondents can "select all that apply" from a list of interests. Here, a single trial (person) can belong to multiple categories, violating the mutual exclusivity assumption. Correction: This structure requires different models, such as modeling each category with its own Bernoulli distribution, leading to a multivariate binary outcome.

3. Ignoring the Finite Population Correction (FPC) in Approximations. When the population size $N$ is very large relative to the sample size $n$ , the hypergeometric distribution can be approximated by the binomial distribution. Pitfall: Using the binomial approximation when sampling a significant fraction of the population (e.g., $n / N > 5%$ ). This overestimates variance. Correction: Always check the sampling fraction. If it's large, use the hypergeometric or apply the FPC term $\frac{N - n}{N - 1}$ to the binomial variance.

4. Overlooking the Dependent Nature of Multinomial Category Counts. While the expected value formula $E [X_{i}] = n p_{i}$ is simple, analysts sometimes forget that the counts $X_{i}$ and $X_{j}$ are not independent. Pitfall: Performing separate binomial tests on each category of a multinomial outcome as if they were independent experiments. This misses the covariance structure and can lead to incorrect conclusions. Correction: Use statistical tests designed for multinomial data, like the chi-squared goodness-of-fit test, which accounts for the joint distribution of all categories.

Summary

The hypergeometric distribution is the correct model for counting successes when sampling without replacement from a finite population. Its variance includes a finite population correction factor that reduces uncertainty compared to the binomial model.
The multinomial distribution generalizes the binomial to scenarios with multiple categorical outcomes. It models the joint distribution of counts across all categories in a fixed number of independent trials.
A key application of the hypergeometric distribution is in quality inspection and auditing, where lot sizes are finite and sampling is destructive or permanent.
The multinomial distribution is fundamental for analyzing categorical data, such as survey responses, genetic traits, or product preference studies, where results fall into multiple buckets.
Avoid the critical mistake of using the binomial distribution for sampling without replacement. Always check if your trials are independent and if the population proportion remains constant.
Remember that in the multinomial setting, the counts for different categories are negatively correlated, as they must sum to the fixed total number of trials $n$ .

Hypergeometric and Multinomial Distributions

Hypergeometric and Multinomial Distributions

The Hypergeometric Distribution: Sampling Without Replacement

The Multinomial Distribution: Multiple Outcome Categories

Common Pitfalls

Summary

Write better notes with AI