Skip to content
Feb 28

System Design Estimation Techniques

MT
Mindli Team

AI-Generated Content

System Design Estimation Techniques

Back-of-envelope estimation is the unsung hero of scalable system design. It’s the skill that separates a plausible architecture from a fantasy, allowing you to quickly gauge feasibility, identify bottlenecks, and make informed trade-offs before a single line of code is written. In system design interviews, your ability to perform these rapid calculations demonstrates engineering intuition and shows you think critically about real-world constraints like cost, performance, and growth. Mastering this art turns abstract requirements into concrete numbers you can reason about.

The Foundation: Orders of Magnitude and Powers of Two

Before diving into specific calculations, you must internalize the language of scale. Back-of-envelope estimation is a method for obtaining approximate calculations by rounding inputs to the nearest power of ten and using simplified assumptions. The goal is not pinpoint accuracy but arriving at the correct order of magnitude—whether something is in the thousands, millions, or billions. This rapid feasibility check prevents you from proposing a solution that is technically impossible or economically absurd.

This process relies heavily on familiarity with powers of two, as digital systems are built upon binary logic. Knowing that (1,024) is approximately (1,000) is a fundamental shortcut. From this, you can derive: million, billion, and trillion. When estimating storage, these conversions are essential. For instance, if a user record requires 1 KB (or bytes), then one million records need about 1 GB ( bytes) of storage, not 1,000,000 KB, which is a more cumbersome figure to manipulate mentally.

From Users to Load: Estimating Throughput

The most common starting point is a total user base. Your first task is to translate that into actionable load metrics, primarily Queries Per Second (QPS). You begin by defining your Daily Active Users (DAU). This might be a percentage of the total user base (e.g., 10% of 100 million users = 10 million DAU). Not all DAU generate equal load, so you often identify a core action—like posting a photo or reading a feed—and estimate how many times a user performs that action per day.

The formula to reach peak QPS is straightforward: For example, with 10 million DAU performing 5 core actions daily: Traffic isn't uniform; it peaks during certain hours. Applying a peak factor of, say, 2x or 3x adjusts for this. Thus, your system might need to handle ~1,500 QPS at peak. This number immediately informs your needs for web servers, cache clusters, and database read capacity.

Sizing Storage and Bandwidth

With throughput estimated, you must plan for data persistence and movement. Storage growth estimation requires knowing what data is created per action and its retention policy. Assume each of our 10 million DAU uploads a 2 MB photo daily. The daily storage growth is: If you keep data for 10 years, you’d plan for approximately of total storage, not accounting for compression or replication. This highlights why data lifecycle policies (archival, deletion) are critical.

Bandwidth needs are calculated similarly. If your system serves 1,500 QPS at peak and each response is 100 KB (including images, HTML, JSON), your outgoing bandwidth requirement is: You must consider both ingress (data coming in, like photo uploads) and egress (data going out to users). Cloud costs are heavily influenced by egress bandwidth, making this a vital cost-driver estimation.

The Canonical Reference: Latency Numbers Every Programmer Should Know

No discussion on estimation is complete without referencing Jeff Dean's "Latency Numbers Every Programmer Should Know." While the exact numbers change with hardware, the relative orders of magnitude remain timelessly instructive. They provide the backbone for performance reasoning:

  • L1 cache reference: ~1 ns
  • Main memory reference: ~100 ns
  • Read 1 MB sequentially from SSD: ~1 ms
  • Disk seek: ~10 ms
  • Read 1 MB sequentially from HDD: ~30 ms
  • Round-trip within same datacenter: ~500 ms
  • Packet round-trip (continental): ~150 ms

These numbers teach a crucial lesson: a network call to another service is orders of magnitude slower than a memory access. If your design requires 10 sequential database disk seeks to serve a request, you can instantly estimate a baseline latency of at least 100 ms, setting an upper bound on your requests per second per thread. This framework helps you justify using caches (to avoid disk/network calls) and guides decisions on synchronous versus asynchronous operations.

Common Pitfalls

  1. Misapplying Average vs. Peak Load: Designing a system only for average QPS is a classic mistake. Traffic spikes—during product launches, marketing events, or peak hours—can be 5-10x the daily average. Your estimation must include a reasonable peak factor and discuss how the architecture handles it (e.g., with auto-scaling or load shedding).
  2. Ignoring Data Replication and Overhead: Estimating raw storage for user data but forgetting the 3x overhead for replication (e.g., storing three copies in a distributed file system like HDFS) or backup strategies will severely undercount actual storage needs. Always clarify your replication factor in your assumptions.
  3. Overly Precise Calculations: The point is not to calculate that you need 1,234 servers. Getting bogged down in exact arithmetic wastes time and misses the forest for the trees. Round numbers to powers of ten, use approximations (like 1 million seconds ≈ 12 days), and focus on the magnitude. Saying "a few thousand QPS" or "tens of terabytes per day" is often sufficient.
  4. Forgetting Non-Functional Requirements: Estimation isn't just about load and storage. You must factor in availability (how many servers are needed to survive a failure?), durability (how is data backed up?), and cost. A design that meets all technical requirements but costs $10 million per month is not a viable solution.

Summary

  • Back-of-envelope estimation is a foundational skill for assessing system feasibility by focusing on orders of magnitude rather than precise figures.
  • Start with user counts, derive Daily Active Users (DAU), and calculate peak Queries Per Second (QPS) using a peak factor to account for traffic spikes.
  • Estimate storage growth by understanding data size per action and retention policies, and calculate bandwidth needs for both data ingress and egress.
  • Internalize the relative latency numbers for different hardware and network operations to make informed design trade-offs and justify caching strategies.
  • Avoid pitfalls by designing for peak load, accounting for replication overhead, staying high-level in calculations, and incorporating non-functional requirements like cost and availability into your estimates.

Write better notes with AI

Mindli helps you capture, organize, and master any subject with AI-powered summaries and flashcards.