IB Mathematics: Statistical Analysis

Statistical analysis is the backbone of making informed decisions in a data-driven world. For your IB Mathematics course, mastering this topic is not just about passing exams; it equips you with the tools to transform raw data into credible, evidence-based conclusions that are vital in fields from science to economics.

Describing Data: Measures, Spread, and Representation

Before jumping to conclusions, you must accurately describe what your data shows. Descriptive statistics summarize and organize data so patterns become clear. The measures of central tendency—the mean, median, and mode—tell you about the data's center. The mean is the arithmetic average, the median is the middle value when data is ordered, and the mode is the most frequent value. For instance, in a dataset of test scores: 65, 70, 70, 80, 90, the mean is $(65 + 70 + 70 + 80 + 90) /5 = 75$ , the median is 70, and the mode is 70.

Central tendency alone is misleading without understanding variability. Measures of dispersion quantify how spread out the data is. The range is the difference between the maximum and minimum values. More sophisticated measures include the interquartile range (IQR), which is the range of the middle 50% of data and resists outliers, and the standard deviation, which is the average distance of each data point from the mean. For a dataset $x_{1}, x_{2}, ..., x_{n}$ with mean $\overset{x}{ˉ}$ , the standard deviation $s$ is calculated as $s = \frac{\sum _{i = 1}^{n} ( x _{i} - x ˉ ) ^{2}}{n - 1}$ for a sample. A small standard deviation indicates data points are clustered closely around the mean.

Data representation through visuals makes analysis intuitive. Histograms show the frequency distribution of continuous data, box plots visually summarize the median, quartiles, and potential outliers using the IQR, and scatter plots display the relationship between two quantitative variables. Imagine plotting heights versus weights: a scatter plot can instantly suggest if a correlation exists, guiding further analysis.

Probability Distributions: Modeling Randomness

Probability provides the language for uncertainty, and probability distributions are functions that describe the likelihood of different outcomes for a random variable. A discrete random variable takes on specific, countable values, like the number of heads in ten coin tosses, often modeled by a binomial distribution. A continuous random variable can take any value within an interval, like height or time, described by functions such as the normal distribution.

Every distribution is defined by its parameters and properties. For a discrete distribution, you work with a probability mass function, which gives $P (X = x)$ . For a continuous one, you use a probability density function (PDF), where probability is found by calculating the area under the curve for an interval. The key is to know which distribution applies to a given scenario. For example, if you're counting successes in a fixed number of independent trials with a constant success probability, the binomial distribution with parameters $n$ (trials) and $p$ (success probability) is appropriate. Its mean is $μ = n p$ and variance is $σ^{2} = n p (1 - p)$ .

The Normal Distribution: The Bell-Shaped Curve

The normal distribution is arguably the most important continuous distribution in statistics due to its natural occurrence and role in inference. It's symmetric, bell-shaped, and completely defined by its mean $μ$ and standard deviation $σ$ . The standard normal distribution has $μ = 0$ and $σ = 1$ , and any normally distributed variable $X$ can be transformed into a standard normal variable $Z$ using the z-score: $Z = \frac{X - μ}{σ}$ . This tells you how many standard deviations $X$ is from the mean.

In practice, you use z-score tables or calculators to find probabilities. For example, if adult heights are normally distributed with $μ = 170$ cm and $σ = 10$ cm, the probability that a randomly selected person is taller than 185 cm is $P (X > 185)$ . First, find the z-score: $z = (185 - 170) /10 = 1.5$ . Then, $P (Z > 1.5) \approx 0.0668$ , or 6.68%. The normal distribution is foundational because of the Central Limit Theorem, which states that the sampling distribution of the sample mean approaches normality as sample size increases, regardless of the population's distribution shape.

Making Inferences: Hypothesis Testing

Hypothesis testing is a core method of inferential statistics, allowing you to use sample data to draw conclusions about a population. You start with two opposing hypotheses: the null hypothesis ( $H_{0}$ ) represents a default position of no effect or no difference, and the alternative hypothesis ( $H_{1}$ ) is what you aim to support. For instance, $H_{0} : μ = 100$ versus $H_{1} : μ > 100$ for testing if a population mean exceeds 100.

The process involves calculating a test statistic from your sample data, which measures how compatible the data is with $H_{0}$ . You then find the p-value, the probability of observing your results (or more extreme) if $H_{0}$ is true. Compare the p-value to a predetermined significance level ( $α$ ), often 0.05. If $p \leq α$ , you reject $H_{0}$ in favor of $H_{1}$ ; otherwise, you fail to reject it. A step-by-step example: Test if a new teaching method improves scores, where historical mean is 75. A sample of 30 students has mean 78 with standard deviation 8. Assume scores are normal.

$H_{0} : μ = 75$ , $H_{1} : μ > 75$ (one-tailed test).
Calculate test statistic: $t = \frac{x ˉ - μ}{s / n} = \frac{78 - 75}{8/ 30} \approx 2.05$ .
Using a t-distribution (as population variance is unknown), find p-value for $df = 29$ : $p \approx 0.025$ .
Since $p < 0.05$ , reject $H_{0}$ —evidence suggests the method improves scores.

Be aware of Type I error (rejecting a true $H_{0}$ ) and Type II error (failing to reject a false $H_{0}$ ). The significance level $α$ controls the probability of Type I error.

The Chi-Squared Test for Association

The chi-squared ( $χ^{2}$ ) test assesses relationships between categorical variables, such as testing if smoking status is independent of lung disease diagnosis. There are two main types: the goodness-of-fit test (comparing observed frequencies to a theoretical distribution) and the test for independence (comparing observed frequencies in a contingency table to expected frequencies under independence).

For a test of independence in an $r \times c$ contingency table, the chi-squared test statistic is calculated as $χ^{2} = \sum \frac{( O _{ij} - E _{ij} ) ^{2}}{E _{ij}}$ where $O_{ij}$ is the observed frequency and $E_{ij}$ is the expected frequency for cell $(i, j)$ , computed as $E_{ij} = \frac{( ro w t o t a l ) \times ( co l u mn t o t a l )}{g r an d t o t a l}$ . This statistic follows a chi-squared distribution with degrees of freedom $df = (r - 1) (c - 1)$ .

Consider a survey testing if gender (Male, Female) is associated with preference for a new product (Yes, No). The observed data forms a 2x2 table. You calculate expected frequencies assuming no association, then compute $χ^{2}$ . Compare this to a critical value from the chi-squared distribution with $df = 1$ at your $α$ level. If $χ^{2}$ exceeds the critical value, you reject the null hypothesis of independence, concluding an association exists. This test enables evidence-based conclusions about relationships in categorical data without assuming normality.

Common Pitfalls

Confusing Correlation with Causation: A significant correlation or test result does not prove that one variable causes another. There may be lurking variables or coincidence. Correction: Always consider experimental design and potential confounding factors before making causal claims. Use controlled experiments when possible.

Misinterpreting the p-value: The p-value is not the probability that the null hypothesis is true. It is the probability of the observed data (or more extreme) given that $H_{0}$ is true. Correction: Frame conclusions carefully: "There is sufficient evidence to reject $H_{0}$ " rather than "There is a 5% probability that $H_{0}$ is false."

Overlooking Assumptions of Tests: Each statistical test relies on assumptions. For example, the t-test assumes approximate normality and independence of data; the chi-squared test requires that expected frequencies are sufficiently large (typically >5). Correction: Always check assumptions before performing a test. If violated, use non-parametric alternatives or transform the data.

Using the Wrong Measure of Central Tendency for Skewed Data: The mean is sensitive to outliers. In skewed distributions, like income data, the median is a more robust measure of center. Correction: Always examine the shape of your data distribution (e.g., using a histogram) before reporting a summary statistic.

Summary

Descriptive statistics like mean, median, mode, range, IQR, and standard deviation summarize data central tendency and dispersion, while visuals like histograms and scatter plots reveal patterns.
Probability distributions, including binomial and normal, model random phenomena, with the normal distribution's properties and z-scores being central to inference.
Hypothesis testing uses p-values and significance levels to make decisions about population parameters based on sample data, guarding against Type I and Type II errors.
The chi-squared test evaluates associations between categorical variables by comparing observed and expected frequencies in contingency tables.
All statistical methods require careful application of their assumptions and correct interpretation to draw valid, evidence-based conclusions from data.
Mastering these concepts enables you to analyze data critically, a skill assessed across IB Mathematics papers and essential for real-world problem-solving.

IB Mathematics: Statistical Analysis

IB Mathematics: Statistical Analysis

Describing Data: Measures, Spread, and Representation

Probability Distributions: Modeling Randomness

The Normal Distribution: The Bell-Shaped Curve

Making Inferences: Hypothesis Testing

The Chi-Squared Test for Association

Common Pitfalls

Summary

Write better notes with AI