Probability Density and Cumulative Distribution Functions

Working with continuous data requires a fundamental shift in thinking from counting discrete outcomes to measuring areas under curves. For any aspiring data scientist or statistician, mastering the probability density function (PDF) and the cumulative distribution function (CDF) is non-negotiable. These are the core mathematical tools that allow you to model, analyze, and make probabilistic predictions about everything from customer wait times to stock price fluctuations.

The Probability Density Function: Modeling Likelihood as Density

For a continuous random variable $X$ , you cannot ask for the probability that $X$ equals a specific value, like $P (X = 5.000...)$ . In a continuous space, this probability is effectively zero. Instead, we model likelihood using a probability density function (PDF), denoted $f_{X} (x)$ . The PDF is a function whose output represents density, not probability.

Think of density as similar to mass per unit volume. To find the actual mass (probability), you need to integrate the density over a volume (interval). This leads to the most critical rule: The probability that $X$ falls in an interval $[a, b]$ is the area under the PDF curve over that interval.

$P (a \leq X \leq b) = \int_{a}^{b} f_{X} (x) d x$

The PDF itself has two defining properties that stem from this area interpretation. First, it is non-negative everywhere: $f_{X} (x) \geq 0$ for all $x$ . A negative density would imply a negative probability, which is impossible. Second, the total area under the entire PDF curve must equal 1, representing the certainty that $X$ takes on some value: $\int_{- \infty}^{\infty} f_{X} (x) d x = 1$ .

For example, consider a random variable $T$ modeling the time (in hours) until a server fails, with a PDF given by $f_{T} (t) = 0.5 e^{- 0.5 t}$ for $t \geq 0$ . This is an exponential distribution. The probability the server fails between 1 and 3 hours is calculated as an area: $P (1 < T < 3) = \int_{1}^{3} 0.5 e^{- 0.5 t} d t = e^{- 0.5} - e^{- 1.5} \approx 0.383.$

The Cumulative Distribution Function: Accumulating Probability

While the PDF gives a "snapshot" of density, the cumulative distribution function (CDF) provides a running total of probability. Denoted $F_{X} (x)$ , the CDF is defined as the probability that the random variable $X$ is less than or equal to a specific value $x$ : $F_{X} (x) = P (X \leq x) .$

For a continuous variable, this is precisely the area under the PDF from negative infinity up to $x$ : $F_{X} (x) = \int_{- \infty}^{x} f_{X} (t) d t .$

The CDF is a powerful tool for probability calculations. To find $P (a < X \leq b)$ , you can simply subtract CDF values: $P (a < X \leq b) = F_{X} (b) - F_{X} (a) .$

The CDF has characteristic properties: it is a non-decreasing function that starts at 0 and approaches 1. Formally, $lim_{x \to - \infty} F_{X} (x) = 0$ and $lim_{x \to \infty} F_{X} (x) = 1$ . Furthermore, because $X$ is continuous, $P (X \leq x) = P (X < x)$ , so the CDF is a continuous function.

Returning to our server example, the CDF for the exponential distribution is $F_{T} (t) = 1 - e^{- 0.5 t}$ for $t \geq 0$ . The probability of failure before 3 hours is $F_{T} (3) = 1 - e^{- 1.5} \approx 0.777$ . The probability of failure between 1 and 3 hours is, as before, $F_{T} (3) - F_{T} (1) \approx 0.383$ .

The Quantile Function: The Inverse of the CDF

Often, you need to answer the inverse question: "What value of $x$ corresponds to a given cumulative probability?" This is the job of the quantile function, also called the inverse CDF. If $F_{X} (x)$ is the CDF, then the quantile function $Q (p)$ is defined for a probability $p$ (where $0 \leq p \leq 1$ ) as: $Q (p) = in f {x : F_{X} (x) \geq p} .$ For a continuous, strictly increasing CDF, this simplifies to the inverse function: $Q (p) = F_{X}^{- 1} (p)$ .

The quantile function is used to find percentiles, critical values, and confidence intervals. The value $Q (0.5)$ is the median—the point where there's a 50% chance $X$ falls below it. The values $Q (0.25)$ and $Q (0.75)$ are the first and third quartiles. In our exponential distribution, to find the median server failure time, we solve $F_{T} (t) = 0.5$ : $1 - e^{- 0.5 t} = 0.5 ⟹ e^{- 0.5 t} = 0.5 ⟹ t = - 2 ln (0.5) \approx 1.386 hours .$ Thus, $Q (0.5) \approx 1.386$ .

The Survival Function and Hazard Function

Two related functions offer complementary perspectives, especially in fields like reliability engineering and survival analysis. The survival function, denoted $S_{X} (x)$ , gives the probability that the random variable exceeds a value $x$ : $S_{X} (x) = P (X > x) = 1 - F_{X} (x) .$

It's called the "survival" function because in time-to-event analysis, it represents the probability of "surviving" past time $x$ . For the exponential server, $S_{T} (t) = e^{- 0.5 t}$ , meaning the probability the server survives beyond 3 hours is $e^{- 1.5} \approx 0.223$ .

Closely related is the hazard function or hazard rate, $h_{X} (x)$ . It measures the instantaneous risk of an event occurring at time $x$ , given that it has not yet occurred. It is defined as the ratio of the PDF to the survival function: $h_{X} (x) = \frac{f _{X} ( x )}{S _{X} ( x )} .$ For the exponential distribution, the hazard function is constant: $h_{T} (t) = 0.5$ . This constant hazard rate is a defining (and often unrealistic) property of the exponential model, implying the server's failure risk does not change with age.

The Fundamental Relationships

The PDF, CDF, survival function, and quantile function are not isolated concepts; they are intimately connected through calculus and logic. Understanding these relationships is key to fluid problem-solving.

PDF to CDF: The CDF is the integral of the PDF: $F_{X} (x) = \int_{- \infty}^{x} f_{X} (t) d t$ .
CDF to PDF: The PDF is the derivative of the (absolutely continuous) CDF: $f_{X} (x) = \frac{d}{d x} F_{X} (x)$ . This is the Fundamental Theorem of Calculus applied to probability.
CDF to Survival Function: They are complements: $S_{X} (x) = 1 - F_{X} (x)$ .
CDF to Quantile Function: They are inverse functions (for strictly increasing CDFs): $Q (p) = F_{X}^{- 1} (p)$ and $F_{X} (Q (p)) = p$ .

These relationships form a closed loop. You can start with a PDF, integrate to get the CDF, and then differentiate to return to the PDF. You can take the CDF, invert it to get the quantile function, and plug a quantile back into the CDF to retrieve the probability.

Common Pitfalls

Mistake 1: Interpreting the PDF as a probability. This is the most frequent and critical error. Remember, for a continuous variable, $f_{X} (5)$ is not $P (X = 5)$ . The value $f_{X} (5)$ is a density. Probability is only obtained by integrating the PDF over an interval, no matter how small.

Correction: Always think of probability as an area. The PDF gives the height of the curve; you must multiply by a width (via integration) to get an area (probability).

Mistake 2: Misusing the CDF for non-inclusive intervals. The CDF gives $P (X \leq x)$ . To find $P (X \geq x)$ or $P (X > x)$ , you must use the complement: $P (X \geq x) = 1 - F_{X} (x)$ for continuous variables. Similarly, $P (a < X < b) = F_{X} (b) - F_{X} (a)$ .

Correction: Sketch the PDF and shade the area you want. Map that shaded area onto operations with the CDF. For a continuous variable, $P (X < b) = P (X \leq b) = F_{X} (b)$ , so the equality sign does not change the probability.

Mistake 3: Confusing the quantile function with simple algebra. Finding the $p$ -th quantile $Q (p)$ means solving $F_{X} (x) = p$ for $x$ . This often requires algebraic manipulation and the use of inverse functions (like the natural log).

Correction: Write down the equation $F_{X} (Q (p)) = p$ explicitly. Isolate the term containing $Q (p)$ and apply the appropriate inverse function. For the exponential, you don't just "subtract 1"; you take the natural logarithm.

Summary

The Probability Density Function (PDF), $f_{X} (x)$ , models density. Probability is calculated as the area under the PDF curve over a given interval via integration: $P (a \leq X \leq b) = \int_{a}^{b} f_{X} (x) d x$ .
The Cumulative Distribution Function (CDF), $F_{X} (x) = P (X \leq x)$ , is the area under the PDF from $- \infty$ to $x$ . It is a non-decreasing function ranging from 0 to 1, and it provides the most direct way to calculate interval probabilities: $P (a < X \leq b) = F_{X} (b) - F_{X} (a)$ .
The Quantile Function, $Q (p)$ , is the inverse of the CDF. It answers the question: "What value $x$ corresponds to the $p$ -th cumulative probability?" It is essential for finding percentiles and critical values.
The Survival Function, $S_{X} (x) = 1 - F_{X} (x) = P (X > x)$ , and the Hazard Function, $h_{X} (x) = f_{X} (x) / S_{X} (x)$ , offer complementary views, crucial for time-to-event analysis.
These functions are deeply interconnected through differentiation and integration: $f_{X} (x)$ is the derivative of $F_{X} (x)$ , and $F_{X} (x)$ is the integral of $f_{X} (x)$ .

Probability Density and Cumulative Distribution Functions

Probability Density and Cumulative Distribution Functions

The Probability Density Function: Modeling Likelihood as Density

The Cumulative Distribution Function: Accumulating Probability

The Quantile Function: The Inverse of the CDF

The Survival Function and Hazard Function

The Fundamental Relationships

Common Pitfalls

Summary

Write better notes with AI