Algo: Reservoir Sampling
AI-Generated Content
Algo: Reservoir Sampling
Imagine you're processing a real-time feed of social media posts, server logs, or scientific sensor data—a stream so vast you don't know its total length. How do you select a truly random, representative sample of k items from this endless flow? Loading everything into memory is impossible, and you can't go back. Reservoir sampling is the elegant algorithmic solution to this fundamental problem in data engineering and stream processing. It allows you to maintain a fair, statistically valid sample on the fly, ensuring every element seen has an equal chance of being in your final sample, even when the total count is unknown.
The Core Challenge and Algorithm R
The core challenge is selecting k items uniformly at random from a data stream of unknown, possibly infinite, length n. You must process each item once, in order, and decide immediately whether to include it in the sample without knowing what comes later. The classic solution is Algorithm R, also known as the reservoir algorithm.
The algorithm maintains a reservoir—an array of size k—which holds the current candidate sample. It works in two distinct phases:
- Initialization: For the first k elements of the stream, simply place them into the reservoir. At this point, each has a 100% probability of being selected.
- Inductive Replacement: For each subsequent element arriving at position i (where i starts from k+1), generate a random integer j uniformly between 1 and i (inclusive). If j ≤ k, you replace the j-th item in the reservoir with the new incoming element. Otherwise, the new element is discarded.
Here is a straightforward Python implementation:
import random
def reservoir_sample(stream, k):
reservoir = []
for i, element in enumerate(stream):
if i < k:
reservoir.append(element) # Initial fill
else:
j = random.randrange(i + 1) # Random int from 0 to i
if j < k: # This corresponds to j+1 ≤ k in 1-indexed terms
reservoir[j] = element # Replace
return reservoirThe beauty lies in the probability used for replacement: for the i-th element. This decreasing probability balances the increasing opportunity an element has to be selected as the stream grows.
Proof of Uniform Selection Probability
Understanding why Algorithm R works is crucial. The proof relies on induction and probability. We must prove that after processing n elements (where n ≥ k), the probability any given element is in the reservoir is exactly .
- Base Case: For the first k elements (n = k), they are all in the reservoir. The probability for each is , which holds.
- Inductive Step: Assume the property holds after processing n elements, i.e., . Now, element n+1 arrives.
- For the new element (n+1): It is selected into the reservoir if our random j ≤ k. The probability of this is .
- For an old element (from the first n elements): To survive in the reservoir, two independent events must occur:
- It must already be in the reservoir after n steps (probability = , by our inductive hypothesis).
- It must not be replaced when element n+1 is processed. Replacement only occurs if n+1 is selected and it specifically replaces this old element's slot. The probability n+1 is selected is . The probability it replaces this specific slot, given it is selected, is . Therefore, the probability the old element is replaced is . Thus, the probability it survives is .
The probability the old element remains is therefore: .
This completes the induction, proving all elements have equal probability of being in the final sample. This uniform sampling guarantee is the foundation of the algorithm's correctness.
Extension to Weighted Reservoir Sampling
In many real-world scenarios, not all stream items are equally important. You might want to sample tweets proportionally to their engagement, or log entries based on error severity. This requires weighted reservoir sampling, where each incoming item has an associated weight , and the probability of selection should be proportional to its weight.
Algorithm R's simple random integer selection no longer works. A common and efficient approach is the "A-ExpJ" or key-based method. The core idea is to assign each item a random key , where is a uniform random number between 0 and 1. You then maintain the reservoir as the k items with the largest keys. The exponentiation ensures that items with higher weights have a higher probability of generating a larger key. In practice, you can avoid explicit exponentiation by calculating a "score" as and selecting the k items with the smallest scores (since is negative). The item with the smallest score has the largest original key.
import random, math, heapq
def weighted_reservoir_sample(stream_weights, k):
reservoir = [] # Will store (score, element) tuples
for element, weight in stream_weights:
score = math.log(random.random()) / weight
if len(reservoir) < k:
heapq.heappush(reservoir, (score, element))
else:
if score > reservoir[0][0]: # New score is larger than smallest
heapq.heapreplace(reservoir, (score, element))
return [item for _, item in reservoir]This technique is vital for modern data systems that need to bias samples without full knowledge of the data distribution.
Applications in Large-Scale Data Processing
Reservoir sampling is not just a theoretical curiosity; it's a workhorse in distributed systems and big data frameworks. Its primary utility is in approximate query processing and online analytics where examining a full dataset is prohibitively expensive.
For instance, when running a SELECT ... ORDER BY RAND() LIMIT k query on a terabyte-scale table in a database like PostgreSQL (without special extensions), the system might need to sort the entire table, which is massively inefficient. A query planner using reservoir sampling can return a statistically random sample by scanning the data once. In distributed frameworks like Apache Spark, the sample() operation on RDDs often uses a variant of reservoir sampling to pull a random fraction of data from each partition, enabling quick exploratory data analysis on massive datasets. Similarly, monitoring and observability tools use reservoir sampling to capture a representative subset of trace data from high-throughput microservices, keeping storage costs manageable while preserving the ability to diagnose systemic issues.
Common Pitfalls
- Off-By-One Errors in the Random Range: The most common implementation error is generating a random integer from 1 to k (or 0 to k-1) instead of 1 to i. This breaks the probability math and biases the sample towards earlier elements. Always ensure the random range grows with the stream index i. In the provided code,
random.randrange(i + 1)correctly gives a number from 0 to i.
- Assuming the Reservoir is Sorted: The reservoir does not preserve the original order of the stream from which items were selected. The final array is a random subset in arbitrary order (often the order of insertion/replacement). If you need the sample in its original sequence, you must track the original indices alongside the elements.
- Misapplying to Known-Length Data: If you know the total number of items n in advance, simpler algorithms exist. For example, you can generate k distinct random indices and select those items directly. Reservoir sampling's power is specifically for the streaming/unknown-n context. Using it for a static array adds unnecessary complexity.
- Ignoring Weighted Sampling Requirements: Attempting to hack uniform reservoir sampling for weighted data by duplicating elements or other heuristics is inefficient and often incorrect. For weighted streams, always implement a dedicated algorithm like the key-based method to ensure correct proportionality.
Summary
- Reservoir sampling (Algorithm R) solves the critical problem of maintaining a uniform random sample of fixed size k from a data stream of unknown or infinite length.
- Its correctness is proven by induction, showing that for any stream length n, the probability any element is in the sample is exactly .
- The algorithm can be extended to weighted reservoir sampling using a key-based approach (), allowing samples to be drawn proportionally to item importance.
- It is a foundational tool in large-scale data processing, enabling efficient approximate queries, online analytics, and trace sampling in distributed systems like Spark and monitoring platforms.
- Key implementation pitfalls include off-by-one errors in the random index generation and misapplying the algorithm to static data where simpler methods exist.