Skip to content
Mar 2

Feature Store Architecture with Feast

MT
Mindli Team

AI-Generated Content

Feature Store Architecture with Feast

Machine learning models are only as good as the data they are trained on, but managing that data—especially features—across training and production environments is a notorious source of failure. Inconsistent features between these two stages silently degrade model performance, leading to what’s known as training-serving skew. A feature store solves this by acting as a centralized platform to define, store, and serve features consistently. Feast (Feature Store) is an open-source framework that provides a scalable architecture to bridge the gap between data engineering and machine learning operations, or MLOps, ensuring that the features used to train your models are identical to those served during real-time inference.

Core Concepts: The Two-Store Architecture

At its heart, Feast’s power comes from its decoupled storage architecture, designed for different access patterns. This architecture consists of an offline store and an online store.

The offline store is your source of truth for historical feature data. It is typically built on top of a data warehouse like BigQuery, Snowflake, or Amazon Redshift, or a data lake like Apache Spark or Trino. Its primary role is to support the creation of massive, point-in-time-correct training datasets by joining features from various source tables. Because it prioritizes throughput over latency, queries may take seconds or minutes, which is perfectly acceptable for batch training jobs.

In contrast, the online store is a low-latency database (e.g., Redis, DynamoDB, or PostgreSQL) optimized for real-time serving. It holds the latest values of features for specific entities (like a user or product) and is designed to return feature vectors for hundreds of thousands of predictions per second with millisecond latency. You never train a model directly from the online store; instead, you populate it with features for serving.

The key workflow is materialization: periodically or continuously moving the latest feature values from the offline store to the online store. This dual-store design allows data scientists to query years of historical data for training while enabling engineers to serve the freshest features for online predictions.

Defining Features: Entities and Feature Views

You don’t just dump raw data into Feast. Instead, you define your features declaratively using Feast’s core abstractions: Entities and Feature Views.

An Entity is a domain object that your features describe. It is defined by a unique join key. Common entities include user_id, driver_id, product_id, or location_id. Entities provide the primary key for joining features across different Feature Views and for looking up features in the online store.

A Feature View is the central abstraction that binds a set of features to a specific data source and an entity. It defines the schema of your features (e.g., user_avg_transaction, driver_rating_7d_avg) and points to the table or stream in your offline store where the raw data lives. Crucially, a Feature View also contains a timestamp column, which is essential for ensuring point-in-time correctness. By grouping related features together, Feature Views allow for modular and reusable feature definitions.

Here is a simplified example definition for a user-centric Feature View:

from feast import Entity, FeatureView, Field
from feast.types import Float32, Int64
from feast.infra.offline_stores.bigquery_source import BigQuerySource

user = Entity(name="user", join_keys=["user_id"])

user_transaction_stats = BigQuerySource(
    table="project.dataset.user_transactions",
    timestamp_field="event_timestamp",
)

user_stats_fv = FeatureView(
    name="user_transaction_stats",
    entities=[user],
    ttl=None, # For offline-only views
    schema=[
        Field(name="avg_transaction_30d", dtype=Float32),
        Field(name="transaction_count_7d", dtype=Int64),
    ],
    source=user_transaction_stats,
)

Generating Point-in-Time-Correct Training Data

One of the most critical and complex challenges in ML is avoiding data leakage. This occurs when you accidentally use information from the future to predict the past, making your model performance unrealistically optimistic. Feast prevents this by enabling point-in-time-correct dataset generation.

The process requires an entity dataframe. This is not your raw data, but a listing of the entities you want to train on, along with the precise timestamps for which you need feature values. For each row in this dataframe (e.g., user_id=123, event_timestamp=2023-10-05 14:30:00), Feast queries the offline store. It retrieves the latest feature values for that entity that were available at or before that specific timestamp. It effectively performs a temporal join, ensuring the training dataset reflects what would have been known in a real-time scenario at each moment in time.

You use the get_historical_features() method for this operation. Feast orchestrates the complex SQL joins against your offline store to build this leakage-free dataset, which you can then use to train your model.

Orchestrating Feature Freshness: Materialization

For features to be available for real-time inference, they must reside in the low-latency online store. The process of populating and updating the online store is called materialization. This is a scheduled job (e.g., using Apache Airflow, Prefect, or Feast’s built-in CLI) that runs the materialize() or materialize_incremental() command.

Materialization has two core parameters: a start and end date. The job queries the offline store for all feature values that have changed within that time window for the specified Feature Views and writes them to the online store. For batch features, this might run every hour. For streaming features, you can configure a streaming materialization job to continuously update the online store from a stream like Kafka. The Time-To-Live (TTL) setting on a Feature View determines how long features persist in the online store before being automatically evicted, which is crucial for managing storage costs and data privacy.

Ensuring Feature Consistency

The ultimate value of a feature store is the guarantee of feature consistency between training and serving. Feast enforces this through a single source of truth: the Feature View definition. The same Python code that defines the feature transformation for the offline store is used by Feast’s feature server to transform raw data from a stream source or push source for online serving.

This architecture eliminates training-serving skew caused by:

  1. Different computation logic: The transformation is defined once in the Feature View.
  2. Different data sources: The offline store for training and the online store for serving are both populated from the same source or through synchronized materialization.
  3. Timing issues: Point-in-time-correct generation for training aligns with the materialization cadence for serving, ensuring the model is trained on data that mirrors the live environment.

Common Pitfalls

Pitfall 1: Ignoring the Timestamp in Your Source Data. Feast relies on the timestamp_field in your source to ensure point-in-time correctness. If this field represents the processing time instead of the actual event time, you will introduce subtle data leakage. Always use the true event timestamp from your source systems.

Correction: Audit your data pipelines to ensure the timestamp column mapped in your BigQuerySource or FileSource reflects when the event occurred in the real world, not when it arrived in your data warehouse.

Pitfall 2: Over-Materializing All Historical Data. Running a full materialize() job over years of historical data to populate your online store is wasteful and slow. The online store only needs the latest values for real-time inference.

Correction: Use materialize_incremental() for scheduled jobs. This command automatically tracks the last successful materialization and only processes new data since that point, making updates efficient and cost-effective.

Pitfall 3: Treating the Feature Store as a Raw Data Lake. A feature store is for curated, model-ready features, not raw logs or unstructured data. Dumping raw tables into Feast without proper transformation via Feature Views misses the point of centralized governance and calculation.

Correction: Define your feature transformation logic explicitly within Feast using on-demand Feature Views or within your upstream data pipelines, ensuring the Feature View contains clean, aggregated, and business-meaningful features.

Pitfall 4: Neglecting Monitoring and Versioning. Deploying a feature store does not automatically guarantee feature quality. Features can become stale, experience drift, or have upstream pipeline breaks.

Correction: Implement monitoring for feature freshness (is data arriving on time?), completeness (are there unexpected nulls?), and distributional drift. Use Feast’s project versioning and feature registry to safely test and roll back changes to feature definitions.

Summary

  • A feature store like Feast provides a central system to manage, serve, and govern ML features, directly addressing the critical problem of training-serving skew.
  • Its core architecture uses a dual-store model: an offline store (e.g., BigQuery) for historical, point-in-time-correct training data generation and an online store (e.g., Redis) for low-latency feature serving during inference.
  • Features are defined declaratively through Entities (the join keys) and Feature Views (which bind features to a data source and schema), creating a single source of truth.
  • The get_historical_features() method creates training datasets by performing a temporal join against the offline store, guaranteeing that no future data leaks into past predictions.
  • Scheduled materialization jobs move the latest feature values from the offline to the online store, and the same Feature View definitions ensure feature consistency across both training and production environments.

Write better notes with AI

Mindli helps you capture, organize, and master any subject with AI-powered summaries and flashcards.