IB AA: Descriptive Statistics

Descriptive statistics provide the essential vocabulary and tools for summarizing, organizing, and interpreting raw data. Whether you're analyzing experimental results, survey responses, or economic indicators, these techniques transform lists of numbers into meaningful insights about central tendencies, variability, and distribution shape. Mastering them is the critical first step in any data analysis, forming the foundation upon which more advanced inferential statistics are built.

Measures of Central Tendency: Finding the Center

The mean, median, and mode are known as measures of central tendency because they each describe a "typical" or central value in a dataset. The choice of which to use depends on the data's characteristics and what you want to emphasize.

The mean ( $\overset{x}{ˉ}$ ) is the arithmetic average. For ungrouped data, you sum all values and divide by the number of values: $\overset{x}{ˉ} = \frac{\sum x}{n}$ . For grouped data presented in a frequency table, you estimate the mean by using the midpoint ( $m_{i}$ ) of each class interval: $\overset{x}{ˉ} \approx \frac{\sum f _{i} m _{i}}{\sum f _{i}}$ , where $f_{i}$ is the frequency of the $i$ th class. The mean uses all data points but is sensitive to extreme values, or outliers.

The median is the middle value when data is ordered from smallest to largest. For an odd number of data points ( $n$ ), it's the value at position $\frac{n + 1}{2}$ . For an even number, it's the average of the values at positions $\frac{n}{2}$ and $\frac{n}{2} + 1$ . For grouped data, the median is estimated graphically from a cumulative frequency curve or calculated using interpolation within the median class. The median is robust; it is not skewed by outliers.

The mode is the most frequently occurring value. A dataset can have one mode (unimodal), two (bimodal), or more. For grouped data, we speak of the modal class—the class interval with the highest frequency. The mode is useful for categorical data (e.g., the most common car color) and can highlight a peak in a distribution that the mean and median might miss.

Measures of Dispersion: Understanding the Spread

Knowing the center of your data is not enough. Two datasets can have the same mean but wildly different spreads. Measures of dispersion quantify this variability.

Variance and standard deviation are the most common measures for quantifying spread around the mean. Variance ( $σ^{2}$ for a population, $s^{2}$ for a sample) is the average of the squared differences from the mean. For ungrouped data, the sample variance formula is $s^{2} = \frac{\sum ( x _{i} - x ˉ ) ^{2}}{n - 1}$ . The denominator $n - 1$ is used for a sample to provide an unbiased estimate of the population variance, a key point in IB AA.

The standard deviation ( $σ$ or $s$ ) is the square root of the variance: $s = s^{2}$ . It is expressed in the original units of the data, making it more interpretable. A low standard deviation indicates data points are clustered near the mean, while a high one indicates they are spread out. For grouped data, the calculation adjusts to use midpoints and frequencies: $s \approx \frac{\sum f _{i} ( m _{i} - x ˉ ) ^{2}}{\sum f _{i} - 1}$ .

Quartiles divide ordered data into four equal parts. The first quartile ( $Q_{1}$ ) is the median of the lower half of the data (the 25th percentile). The second quartile ( $Q_{2}$ ) is the median (the 50th percentile). The third quartile ( $Q_{3}$ ) is the median of the upper half (the 75th percentile). The interquartile range (IQR) is the spread of the middle 50% of the data: $I QR = Q_{3} - Q_{1}$ . Unlike standard deviation, the IQR is not affected by extreme values, making it a robust measure of dispersion.

Data Representation: Cumulative Frequency and Box Plots

Visual summaries powerfully communicate the distribution described by statistics.

A cumulative frequency diagram (or ogive) plots cumulative frequency against the upper class boundary for grouped data. To construct one, you first build a cumulative frequency table by adding frequencies as you go. Plotting points at each upper boundary and joining them with a smooth curve creates the diagram. This graph allows you to read off estimates for the median (at 50% of total frequency), quartiles (at 25% and 75%), and percentiles directly. The steepness of the curve indicates where data is concentrated.

A box plot (or box-and-whisker plot) provides a brilliant five-number summary of a dataset: the minimum, $Q_{1}$ , the median ( $Q_{2}$ ), $Q_{3}$ , and the maximum. The "box" is drawn from $Q_{1}$ to $Q_{3}$ , with a line inside at the median. The "whiskers" typically extend from the box to the minimum and maximum values within a calculated range. Box plots are ideal for visually comparing the center, spread, and skewness of different distributions at a glance.

Identifying Outliers Using Statistical Criteria

Not every extreme value is meaningful; sometimes, it's an error or an anomaly. We need an objective, statistical method to flag potential outliers. The most common method in IB uses the IQR.

An outlier is defined as a data point that lies more than $1.5 \times I QR$ below $Q_{1}$ or above $Q_{3}$ .

Lower boundary: $Q_{1} - 1.5 \times I QR$
Upper boundary: $Q_{3} + 1.5 \times I QR$

Any data point less than the lower boundary or greater than the upper boundary is considered an outlier. On a box plot, outliers are often plotted as individual points (e.g., dots or stars) beyond the whiskers, which then extend only to the smallest and largest non-outlier values. Identifying outliers is crucial before deciding whether to exclude them, investigate them, or use robust statistics like the median and IQR.

Common Pitfalls

Confusing Population and Sample Formulas: Using $n$ instead of $n - 1$ in the denominator when calculating the sample variance or standard deviation is a frequent error. Remember: use $n$ for a population's parameters, use $n - 1$ for a sample's statistics (unless explicitly told otherwise).

Misinterpreting the IQR and Whiskers on a Box Plot: The IQR is the box, not the length of the whiskers. The whiskers show the range of the "typical" data, excluding outliers. Assuming the whisker endpoints are the minimum and maximum of the entire dataset will lead to incorrect conclusions if outliers are present.

Incorrect Quartile Calculation for Discrete Data: When finding $Q_{1}$ and $Q_{3}$ manually from a list, ensure you are finding the medians of the correct lower and upper halves. If $n$ is odd, the median (the central value) is typically excluded from both halves before finding $Q_{1}$ and $Q_{3}$ . Consistent application of a method is key.

Overreliance on the Mean: Automatically using the mean as the sole measure of center is a mistake with skewed data. For example, reporting the mean income in a region with high wealth inequality can be misleading. Always consider the shape of the distribution and use the median for skewed data.

Summary

Central tendency is summarized by the mean (average), median (middle), and mode (most frequent). The mean is sensitive to outliers, while the median is robust.
Dispersion is measured by variance and standard deviation (sensitive to outliers) and by the interquartile range (IQR), which describes the spread of the middle 50% of data and is robust.
Cumulative frequency diagrams allow for the graphical estimation of medians, quartiles, and percentiles from grouped data.
Box plots provide a visual five-number summary (min, $Q_{1}$ , median, $Q_{3}$ , max) and are excellent for comparing distributions and identifying skew.
Outliers can be identified statistically using the $1.5 \times I QR$ rule, which defines fences beyond which data points are considered exceptional.

IB AA: Descriptive Statistics

IB AA: Descriptive Statistics

Measures of Central Tendency: Finding the Center

Measures of Dispersion: Understanding the Spread

Data Representation: Cumulative Frequency and Box Plots

Identifying Outliers Using Statistical Criteria

Common Pitfalls

Summary

Write better notes with AI