Advanced Algorithms: Online and Streaming
Advanced Algorithms: Online and Streaming
Modern systems rarely get the luxury of a complete dataset sitting quietly on disk. Instead, data arrives sequentially: clicks in a web session, packets in a network, trades in a market, readings from sensors, and events in a log. Algorithms designed for this reality fall into two closely related families. Online algorithms make decisions as inputs arrive, without knowing the future. Streaming algorithms process a continuous flow of data under tight memory and time constraints, typically with a small number of passes (often one).
These settings change what “good” means. A solution might be judged not only by accuracy or optimality, but by how quickly it reacts, how little memory it uses, and how well it performs against an adversarial sequence.
What makes online and streaming problems different?
Online computation: decisions now, consequences later
An online algorithm receives inputs one at a time and must commit to actions before seeing what comes next. This commitment is the defining challenge. Once you assign a job to a server, accept a request, or pick a price, you may not be able to revise that decision later.
Classic online problems include:
- Caching (paging): which items to keep in a limited cache to minimize future misses.
- Scheduling: assigning tasks to machines as tasks arrive to minimize completion time or lateness.
- Matching: pairing arriving requests to resources without knowing future requests.
- Admission control: accepting or rejecting requests under capacity constraints.
In many of these tasks, the offline optimum (with full knowledge of the future) sets an unattainable benchmark. Online algorithm design is about narrowing the gap.
Streaming computation: huge scale, tiny memory
Streaming algorithms often do not need to “decide” in the same irreversible sense, but they must process data quickly using memory far smaller than the input. The constraints are practical: a stream may be too large to store, too fast to revisit, or both.
Typical streaming tasks include:
- Frequency estimation: which items are most common?
- Distinct counting: how many unique elements appeared?
- Quantiles: what is the median or 95th percentile?
- Similarity and set operations: approximate Jaccard similarity, detect near-duplicates.
- Graph streams: approximate properties of a graph as edges arrive.
Here, approximation is a feature, not a flaw. The goal is to compute a useful summary with guarantees.
Competitive analysis: measuring online performance
The standard tool for evaluating online algorithms is competitive analysis. Informally, an online algorithm is compared to the best possible offline algorithm on the same input sequence. The competitive ratio is the worst-case ratio (or difference, depending on the objective) between the online cost and the offline optimum.
If an algorithm has competitive ratio , it means that for every sequence, its cost is at most times the offline optimal cost (up to lower-order additive constants, depending on the model). This worst-case guarantee is meaningful when inputs can be adversarial or when you need reliability independent of distributional assumptions.
Example: caching and the value of randomness
Caching illustrates both the power and the limits of determinism. If you always evict the “least recently used” item, you can perform well in practice, but competitive analysis asks for worst-case protection. Deterministic strategies can be forced into poor behavior by a carefully chosen request sequence.
Randomized online algorithms often achieve better competitive ratios because they prevent an adversary from predicting exact choices. The key idea is not that randomness “improves” the future, but that it limits how precisely worst-case sequences can exploit your policy.
Competitive analysis also shapes design principles:
- Favor local decisions with provable bounds over heuristics that only work on friendly inputs.
- Use potential functions to track progress and amortize costs over time.
- Consider resource augmentation (slightly faster machines or larger buffers) as a realistic way to model how systems gain headroom.
Online learning: adapting from feedback
Online learning addresses a different, but related, sequential setting: at each round, the algorithm chooses an action (such as a prediction), then receives feedback (a loss), and updates its strategy. The goal is to minimize regret, the gap between the learner’s cumulative loss and that of the best fixed comparator in hindsight.
A typical framing is:
- Choose a hypothesis or action.
- Observe the outcome or loss.
- Update the model.
Regret guarantees are attractive because they do not require the data to be i.i.d. They can hold even when the sequence is adversarial, as long as the feedback follows the rules of the model.
Why regret matters in real systems
In recommendation, ad selection, or dynamic pricing, the environment is nonstationary and feedback arrives continuously. A regret bound gives a form of stability: even if today looks different from yesterday, the system’s long-run performance will not lag too far behind the best single policy you could have chosen after seeing the entire history.
Online learning also connects to classical online algorithms. Many online decision problems can be treated as choosing among “experts” (policies) and updating weights based on observed costs. This creates bridges between prediction, control, and resource allocation.
Sketching: small summaries with strong guarantees
Sketching is the art of compressing a data stream into a compact structure that supports approximate queries. A sketch is usually:
- Small: memory sublinear in the stream length.
- Fast: constant or logarithmic time per update.
- Mergeable: two sketches can be combined, useful in distributed systems.
- Provable: approximation error bounds with high probability.
Sketches are central to telemetry and analytics pipelines because they allow near real-time monitoring without storing raw events.
Common sketching goals
- Heavy hitters: find elements with high frequency relative to the stream.
- Frequency moments: estimate quantities like (useful for measuring skew), where is the count of item .
- Distinct elements: estimate the number of unique items.
- Norms and projections: approximate vector norms or similarities after dimensionality reduction.
The critical mindset is to replace exactness with well-calibrated approximation. For example, exact distinct counting requires memory proportional to the number of distinct items, but approximate methods can use far less while providing a bounded relative error.
Streaming algorithms: working under one-pass constraints
Streaming algorithms typically assume one pass over the data and strict memory limits. That forces careful choices about what information is worth preserving.
Approximating statistics and quantiles
A basic need in operations is tracking percentiles: latency p95, p99, and so on. Exact quantiles require storing large portions of data; streaming quantile algorithms maintain summaries that approximate the rank of values. The win is operational: you can monitor tails continuously without retaining every measurement.
Graph streams and dynamic networks
When edges arrive over time, you might want to estimate properties such as connectivity patterns, triangle counts, or approximate matchings. The difficulty is that graphs are combinatorial objects; a naive representation can blow up quickly. Streaming approaches often rely on sampling, sparsification, or sketches tailored to linear algebraic representations.
Sliding windows and time decay
Many real streams are not “from the beginning of time.” You care about the last 5 minutes, last hour, or last day. Sliding window models modify the problem: old elements expire. Algorithms must support deletions or time decay, which complicates sketches and can change achievable guarantees.
Practical guidance: choosing the right technique
Online and streaming methods are not interchangeable; they address different constraints.
When competitive analysis is the right lens
Use competitive analysis when decisions are irreversible and costs are real and immediate: eviction policies, scheduling, routing, and resource allocation. If failures are expensive and inputs may be adversarial (or simply unpredictable), worst-case guarantees provide insurance.
When online learning is the right tool
Use online learning when you can adjust behavior based on feedback and you care about long-run performance: personalization, ranking, adaptive thresholds, and control policies. Regret-style analysis is especially helpful when the data distribution shifts.
When sketches and streaming algorithms shine
Use sketches when the bottleneck is memory and throughput: real-time analytics, large-scale logging, and distributed monitoring. If exact answers are not required, sketches can reduce storage and network overhead dramatically while keeping errors quantifiable.
Closing perspective
Advanced algorithms for online and streaming settings reflect a broader shift in computing: from batch, static datasets to continuous, high-velocity data and immediate decisions. Competitive analysis provides a rigorous way to design algorithms that hold up under uncertainty. Online learning turns sequential feedback into performance guarantees over time. Sketching and streaming algorithms deliver compact, mergeable summaries that make large-scale measurement feasible.
The unifying principle is disciplined approximation, whether that approximation is to an offline optimum, a best-in-hindsight benchmark, or an exact statistic. In a world where data does not stop arriving, these techniques are less a specialty and more a foundation.