Multi-Armed Bandit Testing

In a world where speed is a competitive advantage, waiting weeks for a traditional A/B test result can mean missed opportunities. Multi-armed bandit testing is a dynamic experimentation approach that learns as it runs, automatically shifting more users to better-performing variants to maximize a key metric while the test is still live. This method is particularly powerful for optimization problems where the goal is not just to learn but to perform better during the learning process, such as improving conversion rates or engagement in real-time.

The Core Bandit Analogy and Dynamic Allocation

The name derives from a classic gambling scenario: imagine a casino visitor facing a row of slot machines, each with an unknown payout probability. Each machine is a "one-armed bandit," and the row is a "multi-armed bandit." The gambler's challenge is to devise a strategy that maximizes total winnings by deciding which machines to play and how often, balancing the exploration of uncertain machines with the exploitation of machines known to pay well.

This directly translates to digital experimentation. In a traditional A/B/n test (or fixed-split test), traffic is divided into static groups (e.g., 50%/50%) for the entire duration. You gather data, analyze it after the fact, and then implement the winner. A bandit test, however, treats each variant (A, B, C) as a slot machine. It dynamically reallocates the incoming user traffic based on real-time performance data. If variant B starts showing a higher conversion rate early on, the bandit algorithm will automatically send it a larger share of subsequent visitors. This creates a fundamental trade-off: the algorithm must explore all variants to gather reliable data, while also exploiting the current best-known variant to maximize cumulative gains.

Key Bandit Algorithms: Epsilon-Greedy and Thompson Sampling

Two primary algorithms power most bandit tests, each handling the explore-exploit trade-off differently.

The epsilon-greedy ( $ϵ$ -greedy) algorithm is conceptually simple. You define a small probability, $ϵ$ (e.g., 5% or 10%). For each new user, the algorithm takes a "greedy" action with probability $(1 - ϵ)$ , meaning it serves the variant with the current highest observed performance. With probability $ϵ$ , it takes an "exploratory" action, randomly choosing any variant with equal chance. This ensures that even a variant performing poorly early on still receives a trickle of traffic, allowing the algorithm to discover if its performance improves. While easy to implement, its exploration is random and not very efficient.

A more sophisticated and statistically sound approach is Thompson sampling. This is a Bayesian algorithm that models the uncertainty about each variant's true performance. For a conversion rate problem, it typically assumes a Beta distribution for each variant. The algorithm starts with a prior belief (e.g., Beta(1,1) representing total uncertainty). As data comes in— $a$ successes and $b$ failures—the belief updates to a posterior distribution, Beta(1+ $a$ , 1+ $b$ ). For each user, the algorithm samples a random value from each variant's current posterior distribution and then serves the variant from which it drew the highest sample value. This elegantly balances exploration and exploitation: a variant with high uncertainty (a wide posterior distribution) has a good chance of being sampled with a high value, even if its current mean is low.

Implementing a Bandit Test for Optimization

To run a bandit test, you follow a structured process focused on a single, clearly defined objective metric, such as click-through rate, conversion rate, or average revenue per user.

First, define your goal precisely. Are you purely optimizing for cumulative reward during the campaign (e.g., donations during a fundraiser)? Or are you willing to sacrifice some short-term gains to gather robust data for a long-term decision? Bandits excel at the former. Second, choose your algorithm. Thompson sampling is generally preferred for its efficiency. Third, set up the infrastructure. You need a system that can update each variant's performance data (successes and trials) after every user interaction and use that data to probabilistically assign the next user. Many modern experimentation platforms offer this capability. Finally, monitor differently. Instead of waiting for statistical significance, you monitor the traffic allocation. The variant dominating traffic (e.g., receiving 80% of users) is the de facto winner being exploited. You can let the test run indefinitely for continuous optimization or stop it once allocations stabilize.

When to Choose a Bandit Over a Traditional A/B Test

The choice between a bandit and a fixed-split A/B test hinges on your primary goal. Use a multi-armed bandit test when your objective is real-time optimization and minimizing opportunity cost. Classic scenarios include:

Promotional campaigns with a short timeline: Maximizing sign-ups or sales during the campaign itself.
Personalizing user experiences: Adapting website content or recommendations in real-time.
Testing with low traffic: Bandits efficiently focus traffic on promising variants faster than an A/B test.
Situations where the "environment" is non-stationary (user preferences change): Bandits can adapt, while a static A/B test's results may become outdated.

Use a traditional A/B/n test when your primary goal is causal learning and making a confident, high-stakes decision. This is necessary when:

You need a definitive, statistically rigorous answer about why a change worked or didn't.
You are testing a major feature change with long-term implications.
You must understand interactions between multiple variables (requiring factorial designs).
The cost of a false positive (implementing a loser) is very high. Bandits can sometimes lock onto a suboptimal variant early due to bad luck.

Common Pitfalls

Pitfall 1: Treating a bandit like an A/B test for learning. The most common mistake is running a bandit for a set period and then declaring a "winner" with a p-value. Bandits are not designed for rigorous hypothesis testing at a snapshot in time. Their output is the dynamic allocation strategy and the cumulative reward, not a static statistical comparison.

Pitfall 2: Over-exploitation with poor exploration. Setting an $ϵ$ -greedy algorithm with an $ϵ$ value that is too low (e.g., 1%) can cause the bandit to exploit the initial best variant too aggressively. If that variant's early lead was due to chance, the bandit may get stuck in a suboptimal state, starving other variants of the traffic needed to prove they are better.

Pitfall 3: Using bandits with insufficient traffic or weak signal. While bandits can work with low traffic, if the performance difference between variants is very small and traffic is minimal, the algorithm may never conclusively shift allocation. It will continue to explore at a high rate, negating the benefit of dynamic optimization. Bandits work best when there are meaningful differences to discover.

Pitfall 4: Ignoring the changing context. A bandit optimizing for midday clicks might converge on a variant that performs terribly at night. If your user population or context shifts dramatically, the bandit's learned model may become outdated. It's crucial to monitor key metrics over time and consider resetting or re-initializing the test if the underlying environment changes.

Summary

Multi-armed bandit testing is a dynamic approach that reallocates traffic to better-performing variants in real-time, balancing exploration (gathering data) with exploitation (maximizing a metric).
Core algorithms include the simple epsilon-greedy method and the more efficient Thompson sampling, which uses Bayesian probability to model uncertainty.
Bandits are ideal for optimization problems where minimizing opportunity cost during the test is the primary goal, such as short-term campaigns or continuous personalization.
Traditional fixed-split A/B tests remain superior for causal learning and high-stakes decisions where rigorous statistical validation is required.
Successful implementation requires a clear objective, appropriate algorithm selection, and an understanding that bandits are tools for optimization, not for definitive, snapshot hypothesis testing.

Multi-Armed Bandit Testing

Multi-Armed Bandit Testing

The Core Bandit Analogy and Dynamic Allocation

Key Bandit Algorithms: Epsilon-Greedy and Thompson Sampling

Implementing a Bandit Test for Optimization

When to Choose a Bandit Over a Traditional A/B Test

Common Pitfalls

Summary

Write better notes with AI