Multinomial Distribution

When you need to model an experiment where each trial has more than two possible outcomes, the binomial distribution falls short. The multinomial distribution is the essential generalization that handles multiple categories, making it a cornerstone for analyzing everything from dice games and voter surveys to text documents and consumer choices. Mastering it provides the probabilistic foundation for modeling categorical data, which is fundamental to fields like data science, genetics, and market research.

From Binomial to Multinomial

The binomial distribution models a single process with two outcomes: success and failure. The multinomial distribution extends this to $k$ mutually exclusive and exhaustive outcomes per trial. Imagine rolling a standard six-sided die: each roll (trial) results in one of six possible faces (categories). This is a classic multinomial scenario.

Formally, consider an experiment with $k$ possible outcomes. Let $p_{i}$ represent the probability of outcome $i$ , where $i = 1, 2, ..., k$ and $\sum_{i = 1}^{k} p_{i} = 1$ . If you conduct $n$ independent, identical trials, you can ask: "What is the probability of seeing outcome 1 exactly $x_{1}$ times, outcome 2 exactly $x_{2}$ times, ..., and outcome $k$ exactly $x_{k}$ times?" The vector $X = (X_{1}, X_{2}, ..., X_{k})$ , where $X_{i}$ counts the occurrences of category $i$ , follows a multinomial distribution. Note that $\sum_{i = 1}^{k} X_{i} = n$ .

The Probability Mass Function (PMF)

The probability mass function (PMF) gives the probability of observing a specific combination of counts $(x_{1}, x_{2}, ..., x_{k})$ . It is a direct generalization of the binomial PMF: $P (X_{1} = x_{1}, X_{2} = x_{2}, ..., X_{k} = x_{k}) = \frac{n !}{x _{1} ! x _{2} ! \dots x _{k} !} p_{1}^{x_{1}} p_{2}^{x_{2}} \dots p_{k}^{x_{k}}$ provided $\sum_{i = 1}^{k} x_{i} = n$ and each $x_{i} \geq 0$ .

Let's break down the components:

$\frac{n !}{x _{1} ! x _{2} ! \dots x _{k} !}$ : This is the multinomial coefficient. It counts the number of distinct sequences of $n$ trials that yield $x_{1}$ outcomes of type 1, $x_{2}$ of type 2, and so on. For a binomial ( $k = 2$ ), this reduces to the familiar $(x n)$ .
$p_{1}^{x_{1}} p_{2}^{x_{2}} \dots p_{k}^{x_{k}}$ : This is the probability of any one specific sequence that has the exact count profile $(x_{1}, ..., x_{k})$ . Multiplying these two parts gives the total probability of that profile across all possible sequences.

Worked Example: Dice Rolls Suppose you roll a fair die 10 times ( $n = 10$ , $k = 6$ , $p_{i} = 1/6$ ). What is the probability of rolling exactly two 1s, three 2s, one 3, zero 4s, two 5s, and two 6s? Here, $x = (2, 3, 1, 0, 2, 2)$ . Applying the PMF: $P (X = x) = \frac{10 !}{2 ! 3 ! 1 ! 0 ! 2 ! 2 !} (\frac{1}{6})^{2} (\frac{1}{6})^{3} (\frac{1}{6})^{1} (\frac{1}{6})^{0} (\frac{1}{6})^{2} (\frac{1}{6})^{2}$ First, calculate the multinomial coefficient: $\frac{3628800}{( 2 * 6 * 1 * 1 * 2 * 2 )} = \frac{3628800}{96} = 37800$ . Then, calculate the probability term: $(1/6)^{(2 + 3 + 1 + 0 + 2 + 2)} = (1/6)^{10} \approx 1.65 \times 1 0^{- 8}$ . The final probability is $37800 \times 1.65 \times 1 0^{- 8} \approx 0.000624$ , or about 0.0624%.

Properties: Expectations and Dependencies

The marginal distribution of any single count $X_{i}$ is simply binomial with parameters $n$ and $p_{i}$ . This insight allows us to immediately state key properties:

Expected Value: $E [X_{i}] = n p_{i}$ . If you survey 1000 voters where 40% support Candidate A, you expect $1000 * 0.4 = 400$ "A" responses.
Variance: $Var (X_{i}) = n p_{i} (1 - p_{i})$ . This is identical to the variance of a binomial random variable.
Covariance: For two different categories $i \neq = j$ , $Cov (X_{i}, X_{j}) = - n p_{i} p_{j}$ .

The negative covariance is a crucial feature. It captures the constraint that the counts must sum to $n$ . If category $i$ occurs more frequently than expected, it "uses up" trials, making it less likely for other categories (like $j$ ) to achieve high counts. In the voter survey, observing more responses for Candidate A directly reduces the possible number of responses for Candidate B, creating a negative relationship between their counts.

Applications in Data Science and Modeling

The multinomial distribution is not just for theoretical probability; it is a workhorse for modeling real-world categorical data.

Categorical Data Modeling: It is the natural likelihood function for chi-squared goodness-of-fit tests. You compare observed category counts $(x_{1}, ..., x_{k})$ against expected counts $(n p_{1}, ..., n p_{k})$ to test if a hypothesized probability vector $p$ fits the data.
Natural Language Processing (NLP): In simple text classification models (like Naive Bayes), a document can be modeled as a "bag of words" drawn from a vocabulary of size $k$ . The vector of word counts in a document is often assumed to follow a multinomial distribution, where $p_{i}$ is the probability of word $i$ appearing. This is foundational for spam filters and topic models.
Survey Analysis and A/B Testing: When a survey question has $k$ multiple-choice answers or an A/B/C... test has $k$ variants, the response counts are multinomial. Analyzing these counts lets you determine if response distributions differ between groups.
Genetics: In population genetics, the distribution of genotypes in a sample follows a multinomial distribution based on allele frequencies, underpinning models of inheritance and evolution.

Common Pitfalls

Assuming Outcomes are Equally Likely: The formula is often introduced with fair dice or coins, leading to the misconception that $p_{i}$ must equal $1/ k$ . In practice, the probabilities can be any values summing to 1. Always explicitly define or estimate your probability vector $p$ .
Treating Components as Independent: A major error is analyzing $X_{i}$ and $X_{j}$ as if they were independent. Remember their covariance is negative. Statistical tests or models that ignore this dependence will be invalid. For example, performing separate binomial tests on each category without adjustment inflates false discovery rates.
Confusing with the Multinoulli (Categorical) Distribution: The multinoulli distribution (a.k.a. categorical distribution) describes the outcome of a single trial ( $n = 1$ ). The multinomial describes the counts over $n$ trials. They are related but distinct: the multinomial is the sum of $n$ independent multinoulli trials.
Numerical Underflow with Large n and k: Calculating the PMF directly for large $n$ and $k$ can cause numerical underflow due to tiny probabilities and huge factorials. In practice, data scientists work with log-probabilities, using the log-PMF: $lo g n! - \sum lo g x_{i}! + \sum x_{i} lo g p_{i}$ , and employ functions like lgamma for stable computation of log-factorials.

Summary

The multinomial distribution extends the binomial to model counts across $k > 2$ possible outcomes per trial, defined by a fixed number of trials $n$ and a probability vector $p$ .
Its PMF, $P (X = x) = \frac{n !}{\prod x _{i} !} \prod p_{i}^{x_{i}}$ , consists of a counting component (multinomial coefficient) and a probability component for sequences.
Key properties derive from its connection to the binomial: marginal expectation is $E [X_{i}] = n p_{i}$ , variance is $n p_{i} (1 - p_{i})$ , and covariance between different components is negative, $Cov (X_{i}, X_{j}) = - n p_{i} p_{j}$ , reflecting the constraint that counts sum to $n$ .
It is critically applied in goodness-of-fit tests, categorical data analysis, text modeling (bag-of-words), and survey/multivariate testing, forming the backbone of many machine learning algorithms for discrete data.
Avoid common mistakes by remembering outcomes need not be equally likely, components are not independent, and practical implementation requires log-space computations to ensure numerical stability.

Multinomial Distribution

Multinomial Distribution

From Binomial to Multinomial

The Probability Mass Function (PMF)

Properties: Expectations and Dependencies

Applications in Data Science and Modeling

Common Pitfalls

Summary

Write better notes with AI