Skip to content
Mar 1

Multi-Armed Bandit Algorithms

MT
Mindli Team

AI-Generated Content

Multi-Armed Bandit Algorithms

The Multi-Armed Bandit problem is a fundamental framework for making sequential decisions under uncertainty, and mastering it is essential for anyone in data science, marketing, or any field that runs experiments. At its heart, it’s the challenge of how to optimally allocate limited resources—like user traffic in an A/B test or capital in research projects—when you must learn about your options as you go. Unlike traditional, static A/B testing, bandit algorithms dynamically adapt, continuously balancing the need to explore unknown options with the need to exploit the current best one to maximize cumulative reward. This makes them powerful tools for recommendation systems, online advertising, and even clinical trials, where static experimentation is inefficient or even unethical.

The Core Concepts: Exploration, Exploitation, and Regret

To understand bandit algorithms, you must first grasp the core dilemma they solve. Imagine a row of slot machines (one-armed bandits) in a casino, each with an unknown probability of paying out. You have a limited number of plays. Your goal is to maximize your total winnings. Every time you choose a machine, you perform an action and receive a reward (or not). The tension is between exploration, trying machines you know less about to gather data, and exploitation, playing the machine that seems best based on current data to secure rewards.

The performance of a bandit algorithm is measured by regret. Cumulative regret is the total difference between the rewards you would have earned if you had always chosen the single best arm (the optimal one) and the rewards you actually earned. The goal of regret minimization is to design algorithms whose regret grows as slowly as possible over time, ideally logarithmically. An alternative goal is best arm identification, where the aim is purely to find the optimal arm with high confidence, often sacrificing short-term rewards. This distinction is crucial: most standard bandit algorithms are designed for regret minimization, but you might choose a different strategy if your sole objective is to identify a winner for a future, larger-scale rollout.

Foundational Bandit Algorithms

The three most fundamental algorithms provide distinct philosophies for managing the exploration-exploitation trade-off.

Epsilon-Greedy: Simple and Robust

The epsilon-greedy algorithm is the most intuitive entry point. With a probability of (e.g., 0.1), you explore by selecting an arm uniformly at random. With a probability of , you exploit by selecting the arm with the highest current empirical mean reward. This simplicity is its strength. For example, if you are testing three website headlines (A, B, C) and have current click-through rates of 3%, 5%, and 2%, an epsilon-greedy algorithm with will choose the best headline (B) 90% of the time and randomly pick among all three 10% of the time to gather new data.

Its weakness is its naivete. It explores completely randomly, wasting exploration pulls on arms that are clearly inferior. A common improvement is epsilon-decay, where starts high and gradually decreases over time, allowing for heavy exploration early and more exploitation later.

UCB1: Optimism in the Face of Uncertainty

The Upper Confidence Bound (UCB1) algorithm takes a more principled, optimistic approach. The core idea is to construct a statistical confidence interval for the true mean reward of each arm. Instead of choosing the arm with the highest sample mean, UCB1 chooses the arm with the highest upper confidence bound. This bound is calculated as:

where is the current sample mean reward for arm , is the total number of rounds played so far, and is the number of times arm has been pulled. The term under the square root is the exploration bonus. Arms with high uncertainty (low ) or high potential (high ) get a boost. UCB1 automatically balances exploration and exploitation: arms that are rarely played see their exploration bonus grow, making them more likely to be selected until their uncertainty shrinks. It provides strong theoretical guarantees for logarithmic regret.

Thompson Sampling: A Bayesian Approach

Thompson Sampling is a probabilistic and often exceptionally effective method. It is a Bayesian algorithm that maintains a probability distribution (a posterior) over the possible reward rate for each arm. For Bernoulli rewards (click/no-click), a Beta distribution is the natural choice. You start with a prior (e.g., Beta(1,1) for uniform belief). After each round, you update the posterior for the chosen arm based on the observed reward.

The algorithm is elegantly simple: in each round, you sample a single plausible reward value from the current posterior distribution of each arm. You then select the arm for which this sampled value is highest. This naturally balances exploration and exploitation: an arm with a wide posterior (high uncertainty) has a chance of yielding a high sampled value, even if its current mean is low. Over time, the posteriors for suboptimal arms narrow around their lower true means, and they are sampled less frequently. Thompson Sampling often outperforms both epsilon-greedy and UCB1 in practice, especially when rewards are non-stationary.

Advanced Topics: Contextual Bandits and A/B Testing Comparison

From Bandits to Contextual Bandits

A standard bandit treats each user identically, which is a severe limitation. Contextual bandits solve this by incorporating features (context) about each decision point. For example, when choosing which ad to show, the context might be user demographics, browsing history, and time of day. Instead of learning a single best arm, a contextual bandit learns a mapping from context to the best arm. Algorithms like LinUCB (a linear variant of UCB) or contextual Thompson sampling are used. This enables true personalization: the algorithm might choose Arm A for young users in the evening and Arm B for older users in the morning, dramatically improving performance over a one-size-fits-all bandit.

Bandits vs. Fixed Horizon A/B Testing

It's critical to understand when to use a bandit algorithm instead of a traditional, fixed-horizon A/B test.

  • Objective: A/B testing is designed for best arm identification with statistical confidence. It aims for a clear, unbiased declaration of a winner. Bandits are designed for regret minimization, maximizing cumulative reward during the experiment itself.
  • Allocation: A/B tests typically use a fixed, equal split (50/50) for the entire experiment. Bandits dynamically shift traffic toward the better-performing variant, minimizing opportunity cost.
  • When to Use:
  • Use A/B testing when you need high-confidence, incontrovertible evidence for a major, one-time decision (e.g., a permanent website redesign). The cost of running the test is acceptable.
  • Use bandit algorithms when you are running many ongoing experiments, when the cost of showing a suboptimal option is high (e.g., in clinical trials or ad revenue), or when you need to adapt quickly to changing environments (non-stationary rewards).

Common Pitfalls

  1. Using Epsilon-Greedy with a Fixed, High Epsilon: A constant high (e.g., 0.3) guarantees that you will waste a large fraction of your traffic on random exploration forever, incurring high perpetual regret. Correction: Use epsilon-decay or switch to a more adaptive algorithm like UCB1 or Thompson Sampling.
  1. Ignoring Non-Stationarity: Standard bandit algorithms assume the reward distribution of each arm is static. In the real world, user preferences change. A bandit that has converged to an arm may never re-explore if that arm's performance later degrades. Correction: Use algorithms designed for non-stationary environments, such as those incorporating discounting or sliding windows, or periodically reset the exploration parameters.
  1. Confusing Minimizing Regret with Finding the Best Arm: Deploying a regret-minimizing bandit and then declaring the most-played arm as the "winner" can be statistically flawed. The algorithm's goal was to give you good rewards during the test, not to provide an unbiased estimate of each arm's true value. Correction: If your primary goal is inference and best arm identification, use dedicated algorithms for that purpose (like successive elimination) or run a traditional A/B test after using a bandit for an initial screening phase.
  1. Applying Standard Bandits to Complex, High-Dimensional Contexts: Trying to use a basic UCB1 or Thompson sampler for a problem with rich user context will fail. You'll essentially be treating every unique context combination as a separate bandit, leading to a massive, cold-start problem. Correction: Use a proper contextual bandit algorithm (e.g., LinUCB, Neural Thompson Sampling) that can generalize learning across similar contexts.

Summary

  • Multi-armed bandit algorithms provide a dynamic framework for sequential decision-making by optimally balancing exploration (gathering new information) and exploitation (leveraging known information).
  • The three foundational algorithms are epsilon-greedy (simple but naive), UCB1 (optimistically chooses arms with high upper confidence bounds), and Thompson Sampling (Bayesian sampling-based, often the most performant).
  • Contextual bandits incorporate feature vectors about each decision point, enabling personalized decisions in applications like recommendations and ad placement.
  • Bandits are designed for regret minimization during an experiment, unlike fixed A/B testing, which is designed for best arm identification after an experiment. Choose the framework based on your primary goal: cumulative performance or statistical inference.
  • Key practical applications include adaptive clinical trials (minimizing patient exposure to inferior treatments), real-time ad and content optimization, and efficient product feature testing.

Write better notes with AI

Mindli helps you capture, organize, and master any subject with AI-powered summaries and flashcards.