ML System Design Patterns

Building a machine learning model is only the first step; deploying it reliably, maintaining its performance, and scaling it to serve millions of users is where the real engineering challenge begins. The core architectural design patterns—reusable templates for solving common system design problems—transform a promising algorithm into a robust, scalable, and valuable ML-powered service. Mastering these patterns is essential for navigating the trade-offs between latency, throughput, accuracy, and reliability in production environments.

Serving Patterns: Batch vs. Real-Time

The choice between batch and real-time serving defines your system's capabilities and constraints. Batch serving involves making predictions on a large, static dataset at scheduled intervals (e.g., hourly, daily). The results are then stored and served from a database or cache. This pattern is ideal for use cases where predictions don't need to be instantly available, such as generating daily product recommendation lists, calculating customer churn scores, or creating email marketing segments. Its key advantage is high computational efficiency, as jobs can be optimized for throughput on large datasets using frameworks like Apache Spark.

In contrast, real-time serving (or online inference) requires generating a prediction immediately in response to a user request, typically within tens or hundreds of milliseconds. Examples include fraud detection during a credit card transaction, auto-complete suggestions in a search bar, or real-time object detection in a video stream. This demands a low-latency serving architecture, often involving a dedicated model server (like TensorFlow Serving, TorchServe, or a custom REST/gRPC API) that hosts the trained model. The primary trade-off is resource utilization; keeping a model loaded in memory to serve sporadic requests can be inefficient compared to batched processing. A hybrid approach, sometimes called "synchronous on-demand," uses a real-time API to trigger small, immediate batch jobs when user tolerance allows for slightly higher latency.

Feature Pipelines: The Foundation of Consistency

A feature pipeline is the engineered process that transforms raw data into the features (model inputs) used for both training and inference. A robust feature pipeline is arguably more critical than the model itself, as most production failures stem from data issues. The core design pattern here is the feature store, a centralized repository that manages the definition, storage, and serving of features.

The pipeline operates in two main modes: offline and online. The offline pipeline computes complex, aggregating features (e.g., "user's average purchase value over the last 30 days") on large historical datasets for model training. The online pipeline serves the latest feature values for a specific entity (e.g., a user ID) with minimal latency during real-time inference. The cardinal rule is to ensure feature consistency: the feature generation logic used during training must be identical to the logic used during inference. A feature store enforces this by making the same feature calculation code available to both the training workflow and the online serving application, preventing a common failure mode known as training-serving skew.

Model Ensembles and Composition

An ensemble is a pattern that combines predictions from multiple models to produce a final, often more accurate and robust prediction. The simplest form is a voting classifier for categorical tasks or an averaging regressor for numerical ones. More sophisticated techniques include stacking, where a meta-model learns to combine the base models' predictions, and boosting, where models are trained sequentially to correct the errors of their predecessors.

Beyond pure ensembles, system design often involves model composition, where the output of one model becomes the input to another in a pipeline. For instance, a text processing service might chain a language detection model, a translation model, and then a sentiment analysis model. This pattern increases complexity but enables sophisticated capabilities. The key system design consideration is managing the latency and error propagation. If the models are independent, they can be run in parallel; if sequential, the end-to-end latency is the sum of each component's latency, and a failure in any component breaks the entire chain.

Deployment and Evaluation Patterns

Deploying a new model version is risky. Two patterns mitigate this risk: A/B testing infrastructure and shadow deployments.

A/B testing (or champion/challenger) for ML involves routing a small, statistically significant portion of live traffic to the new model (the challenger) while the majority continues to use the current model (the champion). The system then compares the business metrics (e.g., click-through rate, conversion rate) between the two groups to determine if the new model provides a statistically significant improvement. This requires careful infrastructure for traffic splitting, metric collection, and statistical analysis.

A shadow deployment is a safer, more exploratory pattern. Here, the new model is deployed alongside the production model but does not affect any user-facing decisions. It receives the same live input data, makes predictions, and logs its outputs. These predictions are compared offline to the production model's outputs and the eventual real-world outcomes. This allows you to evaluate the new model's performance on real data, check for computational or stability issues, and calibrate its outputs without any user impact. It's an essential final check before an A/B test.

Navigating Trade-offs: Latency, Throughput, Accuracy, and Cost

Every architectural decision involves balancing competing system qualities. Your design patterns must be chosen with these trade-offs in mind.

Latency vs. Throughput vs. Accuracy: A real-time ensemble of five complex models may yield high accuracy but will have high latency and low throughput (predictions per second). To reduce latency, you might switch to a single, simpler model (sacrificing some accuracy) or implement aggressive caching of frequent predictions. For batch systems, throughput is paramount, and you can afford to use more computationally expensive models that boost accuracy.
Cost vs. Everything: More complex patterns (large ensembles, real-time feature computation from multiple sources) require more computational resources, directly increasing cost. The design process must justify these costs through measurable gains in accuracy, user experience, or revenue. Techniques like model distillation (training a small "student" model to mimic a large "teacher" ensemble) are design patterns specifically aimed at reducing cost and latency while preserving accuracy.

Common Pitfalls

Ignoring Training-Serving Skew: Using different data preprocessing or feature calculation code during training and inference is a top cause of model performance decay. Correction: Implement a feature store or a shared, versioned code library for feature transformation to guarantee consistency across environments.
Over-Engineering for Real-Time: Not every application needs sub-100ms latency. Correction: Evaluate the user's actual tolerance for delay. If a recommendation can be generated asynchronously and appear when the user refreshes the page, a batched or near-real-time pattern is simpler, cheaper, and more robust.
Treating the Model as a Monolith: Packaging the model, its massive dependencies, and feature logic into one immutable artifact creates a deployment nightmare. Correction: Decouple the model artifact (e.g., a .pb or .onnx file) from the serving code. The serving application should be a separate service that loads the model artifact and calls the feature store API.
Neglecting Data Pipeline Reliability: An exquisite model served by a flawless API is useless if its feature pipeline breaks. Correction: Apply standard data engineering best practices: monitor upstream data sources, build idempotent and retry-able pipeline jobs, and implement data quality checks (e.g., detecting sudden shifts in feature distributions).

Summary

The choice between batch and real-time serving is fundamental, dictated by your application's latency requirements and trade-offs with computational efficiency.
A robust, consistent feature pipeline, often built around a feature store, is the bedrock of a reliable production ML system, preventing training-serving skew.
Model ensembles and composition can boost accuracy and enable complex capabilities but add latency and system complexity that must be managed.
Safe deployment requires patterns like A/B testing for statistical validation and shadow deployment for risk-free real-world evaluation.
All design decisions involve balancing the core trade-offs between latency, throughput, accuracy, and cost; the optimal pattern is the one that best aligns this balance with your business objective.

ML System Design Patterns

ML System Design Patterns

Serving Patterns: Batch vs. Real-Time

Feature Pipelines: The Foundation of Consistency

Model Ensembles and Composition

Deployment and Evaluation Patterns

Navigating Trade-offs: Latency, Throughput, Accuracy, and Cost

Common Pitfalls

Summary

Write better notes with AI