Skip to content
Feb 26

Frequency Distributions and Histograms

MT
Mindli Team

AI-Generated Content

Frequency Distributions and Histograms

Raw data is just noise until it's organized. Frequency distributions and histograms are the foundational tools that transform lists of numbers into clear, understandable pictures, revealing the underlying patterns, central tendencies, and variability within a dataset. Whether you're exploring customer behavior, analyzing sensor readings, or preparing data for machine learning, mastering these techniques is the first critical step in any data science workflow, turning raw observations into actionable insights.

Organizing Data: The Frequency Distribution

The journey from chaos to clarity begins with the frequency table. This is a simple tabular summary that shows the number of observations (frequency) that fall into each predefined category or class.

For discrete data—data that can only take on specific, separate values like the number of customer complaints (0, 1, 2, etc.)—construction is straightforward. You list each possible value and count how many times it occurs. For example, a survey of 20 households on the number of pets might yield:

Pets (x)Frequency (f)
06
18
24
32

This table immediately shows that the most common outcome is one pet.

Two powerful extensions of the basic frequency table are the relative frequency distribution and the cumulative frequency distribution. Relative frequency is the proportion or percentage of the total observations belonging to a class, calculated as , where is the class frequency and is the total sample size. Adding this to our pet table allows for direct comparison between datasets of different sizes.

Pets (x)Frequency (f)Relative Frequency
060.30
180.40
240.20
320.10

Cumulative frequency is the running total of frequencies up to and including a given class. It answers questions like "How many households have two or fewer pets?" The cumulative frequency for the class "2" would be 6 + 8 + 4 = 18.

Handling Continuous Data: Bins and Histograms

Discrete data is simple to tabulate, but continuous data—like height, weight, or time—can take on any value within an interval. You cannot list every possible height (e.g., 170.1 cm, 170.1001 cm, etc.). Instead, you group the data into intervals called bins or classes.

Creating a frequency distribution for continuous data involves three key decisions:

  1. Number of Bins (k): Too few bins oversimplify the data; too many bins create a jagged, noisy picture. A common starting point is Sturges' rule: , where is the number of observations.
  2. Bin Width: Once you choose the number of bins, the bin width is approximately . It's best to round this to a simple, interpretable number.
  3. Bin Limits: Bins should be mutually exclusive (no overlap) and exhaustive (cover all data). Clear rules prevent an observation from falling into two bins (e.g., bins of 10-20, 20-30 should be defined as 10-19.999..., 20-29.999...).

A histogram is the graphical representation of this binned frequency distribution. It looks like a bar chart, but with critical differences: the bars touch each other, emphasizing the continuous nature of the data, and the horizontal axis represents a numerical scale. The height of each bar corresponds to the frequency (or relative frequency) of observations within that bin. For example, a histogram of exam scores (binned as 50-59, 60-69, etc.) visually shows where most students clustered and how spread out the results were.

Alternative Visualizations: Stem-and-Leaf and Ogives

While histograms are ubiquitous, other visual tools offer unique advantages. A stem-and-leaf plot is a hybrid display that preserves the raw data while showing the distribution shape. Each number is split into a "stem" (the leading digit(s)) and a "leaf" (the trailing digit). For the data set {23, 25, 32, 34, 34, 41}, a stem-and-leaf plot would be:

2 | 3 5
3 | 2 4 4
4 | 1

This quick plot shows the shape, central value, and spread, and you can recover the original data from it.

An ogive (pronounced "oh-jive") is a line graph that plots the cumulative frequency (or cumulative relative frequency) against the upper class boundary of each bin. It is the graphical counterpart to the cumulative frequency distribution. Its primary use is to easily answer percentile questions, such as "What score did 80% of students achieve or fall below?" You would find 80% on the vertical axis, trace over to the ogive curve, and then trace down to the corresponding score on the horizontal axis.

Interpreting the Shape of a Distribution

The ultimate goal of creating these tables and graphs is to understand the distribution shape. The shape tells a story about the process that generated the data. Key characteristics to identify are:

  • Modality: The number of prominent peaks. A single peak is unimodal. Two clear peaks is bimodal, often suggesting two distinct groups are combined in the data (e.g., heights of adult men and women).
  • Symmetry: A distribution is symmetric if the left and right sides are mirror images. The classic "bell curve" (normal distribution) is symmetric and unimodal.
  • Skewness: Lack of symmetry. Right-skewed (positively skewed) distributions have a long tail extending to the right; the mean is typically greater than the median. Left-skewed (negatively skewed) distributions have a long tail to the left.
  • Central Tendency and Spread: The shape visually indicates where the data is centered and how variable it is. A tall, narrow histogram suggests low variability; a short, wide one suggests high variability.

Recognizing these shapes guides your next analytical steps. For instance, many statistical tests assume a roughly symmetric, unimodal distribution. Finding strong skewness signals that you may need to transform the data or use different, non-parametric methods.

Common Pitfalls

  1. Using the Wrong Number of Bins in a Histogram: Using a default setting (like 10 bins in software) for every dataset is a major error. A dataset with 1000 points needs more bins than one with 50 points to reveal meaningful detail. Always experiment with different bin widths. If the shape of the histogram changes dramatically with a small change in bin width, your data may be too sparse for a stable histogram, or you may need to use a different plot like a kernel density estimate.
  1. Ignoring the Impact of Outliers on Bin Selection: Extreme values can stretch your horizontal axis, making the bulk of the data cram into a few bins. For example, if most incomes are between 120k but one is 0 to $5M will make the main data cluster invisible. The correction is to either use a trimmed range for the initial visualization (and note the outlier separately) or use a transformation (like a log scale) to bring the data onto a more comparable scale.
  1. Confusing Histograms with Bar Charts: Treating them as interchangeable leads to misinterpretation. Remember: Bar charts are for categorical data (e.g., sales per product category); the order of bars can be changed. Histograms are for quantitative, continuous data; the order of bins is fixed along a number line, and the bars touch. Putting space between the bars in a histogram incorrectly implies the data is discrete.
  1. Misreading Distribution Shape: Calling a distribution "normal" based on a quick glance is risky. Many unimodal, roughly symmetric distributions are not normal. Look for characteristics like ~68% of data within one standard deviation of the mean, which a true normal distribution possesses. Relying on visual assessment alone; always use statistical tests (like Shapiro-Wilk) to check for normality when it's a required assumption for your analysis.

Summary

  • Frequency tables are the essential first step for organizing both discrete and continuous data, with relative and cumulative versions providing proportional and running-total perspectives.
  • For continuous data, histograms are created by grouping data into bins; careful bin width selection is crucial for an honest representation of the underlying distribution.
  • Stem-and-leaf plots preserve raw data while showing distribution shape, and ogives are used to easily determine percentiles and cumulative proportions.
  • The primary goal of these tools is to reveal the shape of the distribution—its modality, symmetry, skewness, and spread—which informs all subsequent statistical analysis and modeling decisions.
  • Avoid critical mistakes like using inappropriate bin counts, letting outliers distort the visualization, confusing histograms with bar charts, and over-interpreting distribution shape without statistical validation.

Write better notes with AI

Mindli helps you capture, organize, and master any subject with AI-powered summaries and flashcards.