Point Estimation and Maximum Likelihood

How do you take a collection of data and draw a meaningful conclusion about the larger, unseen world it came from? In data science, statistics, and machine learning, we constantly face this challenge. Point estimation provides the foundational answer, giving us a single "best guess"—an estimator—for an unknown population parameter based on sample data. Among all methods for constructing these estimators, Maximum Likelihood Estimation (MLE) stands out for its power, versatility, and theoretical elegance, forming the backbone of many modern algorithms from logistic regression to advanced neural network training.

What is a Point Estimator?

A point estimator is a statistic—a function of your sample data—used to infer the value of an unknown parameter in a population. It produces a single numerical value as the estimate. Formally, if you have a sample of $n$ independent observations, $X_{1}, X_{2}, ..., X_{n}$ , a point estimator for a parameter $θ$ is a function $\hat{θ} = g (X_{1}, X_{2}, ..., X_{n})$ . Crucially, $\hat{θ}$ is a random variable because it depends on the random sample; the specific numerical value you get from a particular dataset is called the point estimate.

For example, the sample mean $\overset{ˉ}{X} = \frac{1}{n} \sum_{i = 1}^{n} X_{i}$ is a point estimator for the population mean $μ$ . Before you collect your data, $\overset{ˉ}{X}$ is a rule (average the values). After you collect data, say {2, 5, 11}, the point estimate is $\overset{x}{ˉ} = (2 + 5 + 11) /3 = 6$ . The goal is to find estimators with good statistical properties.

Desirable Properties of Estimators

Not all estimators are created equal. We evaluate them based on their long-run behavior across many hypothetical samples from the same population. Three key properties are unbiasedness, consistency, and efficiency.

An estimator is unbiased if its expected value equals the true parameter value: $E [\hat{θ}] = θ$ . On average, across infinite repeated samples, an unbiased estimator gets it right. The sample mean is unbiased for the population mean. However, unbiasedness alone is insufficient; an estimator could be unbiased but have enormous variance, making any single estimate unreliable.

Consistency is a more fundamental large-sample property. A consistent estimator converges in probability to the true parameter as the sample size grows: $\hat{θ} P θ$ as $n \to \infty$ . This means that with enough data, your estimate will be arbitrarily close to the truth. Consistency is often a minimal requirement for a good estimator.

Efficiency compares the precision of unbiased estimators. If $\hat{θ}_{1}$ and $\hat{θ}_{2}$ are both unbiased for $θ$ , $\hat{θ}_{1}$ is more efficient if it has a smaller variance: $Va r (\hat{θ}_{1}) < Va r (\hat{θ}_{2})$ . The most efficient unbiased estimator is often given by the Cramér-Rao lower bound, which defines the theoretical minimum possible variance an unbiased estimator can achieve.

The Method of Moments

The method of moments is an intuitive, older technique for constructing estimators. The core idea is simple: equate sample moments (like the sample mean, variance) with their corresponding theoretical population moments (which are functions of the unknown parameters), then solve the resulting system of equations.

The procedure is straightforward:

Compute the first $k$ sample moments. For example, the first sample moment is the mean $m_{1}^{'} = \frac{1}{n} \sum X_{i}$ , the second is $m_{2}^{'} = \frac{1}{n} \sum X_{i}^{2}$ , and so on.
Express the first $k$ population moments as functions of the $k$ unknown parameters. For a normal distribution $N (μ, σ^{2})$ , the first population moment is $E [X] = μ$ , and the second is $E [X^{2}] = μ^{2} + σ^{2}$ .
Set the population moments equal to the sample moments and solve for the parameters.

$m_{1}^{'} = μ$ $m_{2}^{'} = μ^{2} + σ^{2}$ Solving gives the method of moments estimators: $\overset{μ}{^}_{MM} = \overset{ˉ}{X}$ and $\overset{σ}{^}_{MM}^{2} = \frac{1}{n} \sum X_{i}^{2} - \overset{ˉ}{X}^{2} = \frac{1}{n} \sum (X_{i} - \overset{ˉ}{X})^{2}$ . Notice this variance estimator uses $n$ in the denominator, not $n - 1$ , making it biased but consistent.

While simple, method of moments estimators are not always the most efficient. They serve as excellent starting points and are particularly useful for initializing more complex procedures like MLE.

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) is a predominant method for parameter estimation due to its optimal asymptotic properties. The principle is beautifully intuitive: choose the parameter values that make the observed data most probable.

The process begins by defining the likelihood function. For a sample of independent and identically distributed (i.i.d.) observations, this is the joint probability density (or mass) function, viewed as a function of the parameter $θ$ : $L (θ; x_{1}, ..., x_{n}) = i = 1 \prod n f (x_{i} ∣ θ) .$

Because products are difficult to differentiate, we almost always work with the log-likelihood function, $ℓ (θ) = lo g L (θ) = \sum_{i = 1}^{n} lo g f (x_{i} ∣ θ)$ . The logarithm is a monotonic transformation, so the value of $θ$ that maximizes $L (θ)$ also maximizes $ℓ (θ)$ .

To find the Maximum Likelihood Estimator (MLE), $\hat{θ}_{M L E}$ , we take the derivative of the log-likelihood with respect to $θ$ , set it equal to zero, and solve: $\frac{\partial ℓ ( θ )}{\partial θ} = 0.$ This is known as the likelihood equation. You must also verify you've found a maximum, typically by checking the second derivative is negative.

Deriving MLEs for Common Distributions

Example: MLE for a Normal Distribution Mean ( $σ^{2}$ known) Assume $X_{i} \sim N (μ, σ^{2})$ with $σ^{2}$ known. The PDF is $f (x ∣ μ) = \frac{1}{2 π σ ^{2}} exp (- \frac{( x - μ ) ^{2}}{2 σ ^{2}})$ . The log-likelihood is: $ℓ (μ) = i = 1 \sum n [- \frac{1}{2} lo g (2 π σ^{2}) - \frac{( x _{i} - μ ) ^{2}}{2 σ ^{2}}] .$ Dropping constants, we have $ℓ (μ) \propto - \sum (x_{i} - μ)^{2}$ . Differentiating and setting to zero: $\frac{d ℓ}{d μ} = \frac{1}{σ ^{2}} \sum (x_{i} - μ) = 0 ⟹ \sum x_{i} - n μ = 0.$ Solving gives $\overset{μ}{^}_{M L E} = \frac{1}{n} \sum x_{i} = \overset{x}{ˉ}$ . This aligns with our intuitive estimator.

Example: MLE for a Bernoulli Probability Let $X_{i} \sim Bernoulli (p)$ , where $P (X = 1) = p$ . The likelihood is $L (p) = \prod p^{x_{i}} (1 - p)^{1 - x_{i}} = p^{\sum x_{i}} (1 - p)^{n - \sum x_{i}}$ . The log-likelihood is: $ℓ (p) = (\sum x_{i}) lo g p + (n - \sum x_{i}) lo g (1 - p) .$ Differentiating: $\frac{d ℓ}{d p} = \frac{\sum x _{i}}{p} - \frac{n - \sum x _{i}}{1 - p} = 0.$ Solving yields $\overset{p}{^}_{M L E} = \frac{1}{n} \sum x_{i} = \overset{x}{ˉ}$ , the sample proportion of successes.

MLEs possess compelling properties: they are consistent, asymptotically normal (converging to a normal distribution as $n$ grows), and asymptotically efficient (achieving the Cramér-Rao lower bound for large samples). They are also invariant: if $\hat{θ}$ is the MLE of $θ$ , then for any function $g$ , the MLE of $g (θ)$ is $g (\hat{θ})$ .

Fisher Information and The Variance of MLEs

The Fisher information, denoted $I (θ)$ , quantifies the amount of information a random sample carries about an unknown parameter $θ$ . A higher Fisher information implies the parameter can be estimated with greater precision (lower variance). For a single observation from a distribution with PDF/PMF $f (x ∣ θ)$ , the Fisher information is defined as: $I (θ) = E [(\frac{\partial}{\partial θ} lo g f (X ∣ θ))^{2}] .$ Under regularity conditions, an equivalent form is $I (θ) = - E [\frac{\partial ^{2}}{\partial θ ^{2}} lo g f (X ∣ θ)]$ , which is the negative expected value of the second derivative (curvature) of the log-likelihood. Steeper curvature means the log-likelihood is sharply peaked, indicating the data provides strong information about $θ$ .

For an i.i.d. sample of size $n$ , the Fisher information is $I_{n} (θ) = n I (θ)$ . A key asymptotic result states that the variance of the MLE, $\hat{θ}_{M L E}$ , achieves the Cramér-Rao lower bound: $Va r (\hat{θ}_{M L E}) \approx \frac{1}{I _{n} ( θ )} = \frac{1}{n I ( θ )} .$ This provides a practical way to approximate the standard error of your MLE, which is essential for constructing confidence intervals. For example, a 95% asymptotic confidence interval is $\hat{θ}_{M L E} \pm 1.96 \times 1/ I_{n} (\hat{θ}_{M L E})$ .

Common Pitfalls

Confusing the Likelihood Function with the Probability Density Function (PDF): A PDF, $f (x ∣ θ)$ , is a function of the data $x$ given a fixed parameter $θ$ . The likelihood function, $L (θ ∣ x)$ , is the same expression but viewed as a function of the parameter $θ$ given fixed, observed data $x$ . They are mathematically identical but have fundamentally different interpretations.
Ignoring the Support of the Distribution: The MLE must lie within the parameter space. For instance, estimating a variance $σ^{2}$ or a probability $p$ , the MLE must be non-negative and between 0 and 1, respectively. Sometimes the calculus solution (like a negative variance) is invalid, and the MLE is found at the boundary of the parameter space (e.g., $\overset{σ}{^}^{2} = 0$ ).
Not Checking the Second Derivative: Setting the first derivative to zero finds a critical point, which could be a minimum, maximum, or saddle point. Always verify that the second derivative is negative at the candidate $\hat{θ}$ to confirm it's a maximum of the likelihood function.
Applying MLE Blindly to Non-I.I.D. Data: The standard likelihood formulation $L (θ) = \prod f (x_{i} ∣ θ)$ assumes independence. For correlated data (e.g., time series, spatial data), this formula is incorrect and will lead to biased estimates. You must model the joint dependence structure correctly in the likelihood.

Summary

Point estimation provides a single best guess for an unknown population parameter using a function of sample data called an estimator. Key properties to seek are unbiasedness, consistency, and efficiency.
The method of moments constructs estimators by equating sample moments to population moments. It is intuitive and provides consistent estimators, though they may not be efficient.
Maximum Likelihood Estimation (MLE) is a powerful, general method that chooses parameter values to maximize the likelihood function, which represents the probability of observing the collected data.
In practice, we maximize the log-likelihood to simplify calculations. MLEs are derived by solving the likelihood equation and possess excellent asymptotic properties: consistency, normality, and efficiency.
The Fisher information measures the expected curvature of the log-likelihood and inversely relates to the minimum achievable variance of an unbiased estimator. The asymptotic variance of the MLE is the inverse of the Fisher information, enabling the construction of confidence intervals.

Point Estimation and Maximum Likelihood

Point Estimation and Maximum Likelihood

What is a Point Estimator?

Desirable Properties of Estimators

The Method of Moments

Maximum Likelihood Estimation (MLE)

Deriving MLEs for Common Distributions

Fisher Information and The Variance of MLEs

Common Pitfalls

Summary

Write better notes with AI