Probability Axioms and Rules

Probability is the language of uncertainty, and in data science, it's the foundational grammar for everything from A/B testing and machine learning to risk analysis and statistical inference. Mastering its core axioms and rules transforms vague intuition into a precise, systematic toolkit for quantifying chance and making predictions under uncertainty.

The Foundational Axioms: Kolmogorov's Framework

All modern probability theory is built upon Kolmogorov's axioms, a set of three simple rules that define a consistent mathematical system. Think of them as the "rules of the game" that every other probability rule must follow. They are defined for a sample space $S$ (the set of all possible outcomes) and assign probabilities to events, which are subsets of $S$ .

Non-Negativity: For any event $A$ , its probability is never negative: $P (A) \geq 0$ . This aligns with our intuition that chance can't be less than zero.
Normalization: The probability of the sample space itself is 1: $P (S) = 1$ . This means something from all possible outcomes must happen.
Additivity: If events $A_{1}, A_{2}, A_{3}, ...$ are mutually exclusive (meaning they cannot happen at the same time), then the probability that at least one of them occurs is the sum of their individual probabilities:

$P (A_{1} \cup A_{2} \cup ...) = P (A_{1}) + P (A_{2}) + ...$

These axioms, while abstract, are powerful. Every other rule—like the ones you'll use daily in data science—is derived from them. For example, from these axioms, we can immediately deduce that the probability of an impossible event (the empty set) is 0, and that probabilities always lie between 0 and 1.

The Addition Rules: Combining "OR" Probabilities

A common task is finding the probability that either event $A$ or event $B$ occurs, denoted $P (A \cup B)$ . The correct rule depends on whether the events can co-occur.

For mutually exclusive events (e.g., rolling a 1 or a 4 on a single die roll), the probability of their union is simply the sum of their probabilities, directly from Kolmogorov's third axiom: $P (A \cup B) = P (A) + P (B)$

However, most events in data science are not mutually exclusive. For instance, a customer could be both a "subscriber" (Event $A$ ) and "have made a purchase this month" (Event $B$ ). If we simply add $P (A) + P (B)$ , we double-count the overlapping customers who are in both groups. The general addition rule corrects for this: $P (A \cup B) = P (A) + P (B) - P (A \cap B)$ Here, $P (A \cap B)$ is the probability both $A$ and $B$ occur. Subtracting this intersection removes the double-counted area. Imagine analyzing user behavior: if 30% of users are subscribers ( $P (A) = 0.3$ ) and 20% made a purchase ( $P (B) = 0.2$ ), with 10% being subscribers who purchased ( $P (A \cap B) = 0.1$ ), then the probability a user is either a subscriber or a purchaser is $0.3 + 0.2 - 0.1 = 0.4$ or 40%.

The Multiplication Rules: Combining "AND" Probabilities

Finding the probability that both event $A$ and event $B$ occur, $P (A \cap B)$ , requires understanding dependence.

For independent events, where the occurrence of one does not influence the other (e.g., flipping a fair coin and rolling a die), the joint probability is the product of their individual probabilities: $P (A \cap B) = P (A) \cdot P (B)$

In data science, true independence is rare. More often, events are dependent. The probability of a user clicking an ad ( $B$ ) depends on whether they are in the target demographic ( $A$ ). Here, we use the general multiplication rule, which introduces the concept of conditional probability, $P (B ∣ A)$ , read as "the probability of $B$ given $A$ occurred": $P (A \cap B) = P (A) \cdot P (B ∣ A)$ This rule states: to find the probability both events happen, first find the probability $A$ happens, then multiply by the updated probability that $B$ happens *given that $A$ has already happened*.

For example, suppose 5% of patients have a condition ( $P (A) = 0.05$ ). A test for it has a 90% true positive rate ( $P (B ∣ A) = 0.9$ ). The probability that a randomly selected patient has the condition AND tests positive is $0.05 \cdot 0.9 = 0.045$ or 4.5%. Notice how this differs from the probability of testing positive overall, which would be higher due to false positives.

The Complement Rule and Inclusion-Exclusion Principle

Two more powerful tools round out the essential toolkit.

The complement rule stems from the axioms. The complement of event $A$ , denoted $A^{c}$ , is the event that " $A$ does not happen." Since $A$ and $A^{c}$ are mutually exclusive and their union is the entire sample space ( $A \cup A^{c} = S$ ), we have: $P (A) + P (A^{c}) = 1 or P (A^{c}) = 1 - P (A)$ This is deceptively useful. Calculating the probability an event does not happen is often easier. For instance, in quality control, finding the probability that at least one item in a batch is defective requires complex calculation. It's simpler to find the probability that no items are defective and subtract from 1.

The inclusion-exclusion principle generalizes the addition rule for more than two events. For three events $A, B, C$ , the probability of their union is: $P (A \cup B \cup C) = P (A) + P (B) + P (C) - P (A \cap B) - P (A \cap C) - P (B \cap C) + P (A \cap B \cap C)$ You alternately include and exclude intersections to correct for all over- and under-counting. Imagine a survey on platform use: some users use only one platform ( $A$ , $B$ , or $C$ ), some use two, and some use all three. To count the total proportion of unique users from the raw overlap data, you apply this principle systematically.

Common Pitfalls

Assuming Independence: The most frequent error is applying $P (A \cap B) = P (A) P (B)$ to events that are clearly dependent (e.g., "rains today" and "rains tomorrow"). Correction: Always ask if knowing one event occurred changes the likelihood of the other. If yes, you must use the conditional rule $P (A \cap B) = P (A) P (B ∣ A)$ .

Confusing "OR" with "AND": Misinterpreting a problem's wording can lead to using the addition rule instead of the multiplication rule, or vice versa. Correction: Translate "either A or B" to $A \cup B$ (addition rule) and "both A and B" to $A \cap B$ (multiplication rule). Pay close attention to language like "at least one" vs. "all."

Misapplying the Addition Rule: Adding probabilities for non-exclusive events without subtracting the intersection ( $P (A \cap B)$ ) will overestimate the true probability. Correction: For any "OR" scenario, your first question should be: "Can both events happen together?" If the answer is yes, you must use the general addition rule.

Overlooking the Complement: Tackling a complex probability like $P (at least one success)$ directly can be needlessly complex. Correction: Default to checking if the complement ( $P (no successes)$ ) is easier to calculate. If it is, use the rule $P (event) = 1 - P (event^{c})$ .

Summary

Kolmogorov's Axioms provide the non-negotiable foundation: probabilities are non-negative, the probability of the sample space is 1, and probabilities of mutually exclusive events add.
Use the Addition Rule for "OR" scenarios: $P (A \cup B) = P (A) + P (B) - P (A \cap B)$ . For mutually exclusive events, this simplifies to $P (A) + P (B)$ .
Use the Multiplication Rule for "AND" scenarios: $P (A \cap B) = P (A) \cdot P (B ∣ A)$ . For independent events, this simplifies to $P (A) \cdot P (B)$ .
The Complement Rule, $P (A^{c}) = 1 - P (A)$ , is a powerful simplifying tool for problems involving "at least one" or complex event structures.
The Inclusion-Exclusion Principle provides a systematic, stepwise method for calculating the probability of the union of multiple, potentially overlapping events.

Probability Axioms and Rules

Probability Axioms and Rules

The Foundational Axioms: Kolmogorov's Framework

The Addition Rules: Combining "OR" Probabilities

The Multiplication Rules: Combining "AND" Probabilities

The Complement Rule and Inclusion-Exclusion Principle

Common Pitfalls

Summary

Write better notes with AI