Feature Engineering: Lag and Lead Features
Feature Engineering: Lag and Lead Features
Feature engineering is the art of transforming raw data into informative inputs for machine learning models, and nowhere is this more crucial than in time series forecasting. Lag and lead features are fundamental temporal features that encode the past and future context of your data, allowing models to recognize patterns like trends, seasonality, and autoregressive dependencies. Mastering their creation and proper application is what separates a naive model from one that genuinely understands the flow of time.
The Core Mechanics: Shift, Lag, and Lead
At its heart, creating a lag or lead feature involves shifting a sequence of values forward or backward in time. In Python's pandas library, this is accomplished primarily with the .shift() method. The direction of the shift defines the feature type.
A lag feature (or backward shift) uses past values to predict the current or future target. For instance, to predict today's sales, you might use sales from yesterday (lag 1) or last week (lag 7). You create it by shifting the series forward, which moves past values into the current row. Mathematically, for a series , the -th lag feature is .
import pandas as pd
# Create a lag-1 feature (yesterday's value)
df['sales_lag1'] = df['sales'].shift(periods=1)Conversely, a lead feature (or forward shift) uses future values. This is less common for prediction but invaluable in analysis, such as calculating day-over-day change. You create it by shifting the series backward. The -th lead is .
# Create a lead-1 feature (tomorrow's value)
df['sales_lead1'] = df['sales'].shift(periods=-1)
# Calculate next-day change (for analysis, not future prediction)
df['next_day_change'] = df['sales_lead1'] - df['sales']The critical intuition is that a lag pulls historical information into the present row for a model to use, while a lead pulls future information, which is typically not available at prediction time.
Building Multi-Step Lags and Autoregressive Patterns
Single lags are useful, but many time series exhibit autoregressive patterns where the current value depends on a window of recent past values, not just the immediate prior one. This is where multi-step lag features come into play.
Creating a set of lag features—for example, lags 1 through 7—allows a model to learn weekly cycles, short-term momentum, or decay effects. This is a direct implementation of an autoregressive model framework within a feature-based machine learning approach (like Gradient Boosted Trees or Linear Regression).
# Create a set of lag features for the past week
for lag in range(1, 8):
df[f'sales_lag_{lag}'] = df['sales'].shift(lag)The choice of which lags to include is domain-specific. For hourly data, lags 1, 2, 24, and 168 might capture hourly, daily, and weekly patterns. For financial data, lags 1, 5, and 20 might represent previous day, week, and month. This feature set empowers the model to discover complex temporal dependencies without you having to pre-specify the exact relationship.
Expanding the Toolkit: Rolling Statistics and Cross-Entity Lags
Beyond raw past values, you can engineer more sophisticated features by applying aggregation functions over a rolling window of lagged values. Rolling lag statistics summarize recent history, providing signals about trends and volatility.
Common rolling features include the mean (recent average level), standard deviation (recent volatility), min/max (recent range), and sum (recent cumulative effect). These are particularly powerful for smoothing noise or highlighting emerging trends.
# Rolling average of the last 3 lagged values
df['sales_rollmean_3'] = df['sales'].shift(1).rolling(window=3).mean()
# Rolling maximum of the last 7 days
df['sales_rollmax_7'] = df['sales'].shift(1).rolling(window=7).max()Another advanced technique involves lagged cross-entity features. In datasets with multiple parallel time series (e.g., sales for different store IDs or metrics for different server clusters), you can create features that lag within each group independently. This is achieved using .shift() in combination with .groupby().
# For each 'store_id', create a lag of its sales
df['sales_lag1_by_store'] = df.groupby('store_id')['sales'].shift(1)This ensures the lag correctly references only the historical data from the same entity, preventing information leakage from one store to another and respecting the natural grouping of the data.
Handling Missing Values at Series Boundaries
The .shift() operation inevitably creates missing values at the boundaries of your time series. The first row in your dataset will have NaN (Not a Number) for any lag feature, as there is no prior data to shift into it. Similarly, the last row will have NaN for any lead feature.
How you handle these missing values is a consequential decision. Simply dropping rows with NaN wastes data, especially for multi-step lags where the first n rows would be lost. Common strategies include:
- Forward Fill (Imputation): For lag features, you might fill the initial
NaNs with the first available value (or a global mean). This is acceptable only if it doesn't distort the series start. For example, filling the first lag-7 value with the global mean is often better than dropping a week of data. - Indicator Features: Create a binary feature (e.g.,
is_lag1_missing) that flags imputed rows, allowing the model to adjust its interpretation. - Model Compatibility: Some tree-based models (like XGBoost) can natively handle
NaNvalues by learning a direction for "missing" during splits. Check your model's documentation.
The key is to apply the same handling logic during model training and future inference to maintain consistency.
The Critical Step: Preventing Data Leakage in Train-Test Splits
This is the most important and frequently mishandled aspect of using lag features. Standard random train-test splitting is catastrophic for time series data. If a randomly selected "test" row from April has a lag feature populated by a value from a "training" row in June, you have data leakage—your model is effectively peeking into the future during training, leading to wildly optimistic and invalid performance estimates.
You must respect the temporal order. The proper method is time-based splitting:
- Sort your data chronologically.
- Choose a cutoff date. All data before this date is your training set, and all data on or after is your test/validation set.
- Apply
.shift()within each set separately, or on the full dataset before splitting.
The safest practice is to split first, then calculate features. However, you must ensure the lag features for the first row of the test set are calculated using the last rows of the training set, not from other test rows. This simulates a live prediction scenario.
# Correct Procedure
train = df[df['date'] < '2023-07-01'].copy()
test = df[df['date'] >= '2023-07-01'].copy()
# Calculate lag on training data
train['sales_lag1'] = train['sales'].shift(1)
# For the test set, the lag should come from the end of the training set
test['sales_lag1'] = test['sales'].shift(1) # This will be NaN for first row
# Better: Use the last training value to fill the first test lag
test.loc[test.index[0], 'sales_lag1'] = train['sales'].iloc[-1]Automating this process for robust backtesting often involves using specialized time series cross-validation splitters, like TimeSeriesSplit from scikit-learn, which maintain the temporal order across folds.
Common Pitfalls
- Leakage from Random Splitting: As detailed above, randomly shuffling time series data before creating lags or splitting guarantees leakage. Correction: Always sort by time and use chronological splits. Validate your split logic by checking that no test row date is earlier than any training row date used for its lag features.
- Ignoring the Missing Value Implications: Dropping all rows with
NaNfrom lags can discard a significant portion of your dataset, especially with long lag windows. This can bias your model by removing all early period data. Correction: Develop a strategy for the initial rows—considered imputation with an indicator flag, or use a model that handlesNaNs. Document your choice.
- Creating Lags After Scaling/Normalization: If you scale your entire dataset (e.g., using StandardScaler) before generating lag features, you are inadvertently using future global statistics (mean, std) to transform past data points, which is a form of leakage. Correction: Always perform scaling after creating temporal features, and fit the scaler only on training data, then apply it to the test set.
- Overlooking Non-Stationarity: Lag features of a highly non-stationary series (e.g., one with a strong, unmodeled trend) can be misleading. The relationship between and may change drastically over time. Correction: Consider differencing the series (creating features from ) or using detrended residuals for lagging, especially for linear models. Tree-based models are more robust to this but not immune.
Summary
- Lag features () are created using
.shift(k)and provide historical context, while lead features () use `.shift(-k)$ and are primarily for analytical purposes. - Multi-step lag features enable models to learn complex autoregressive patterns and seasonal cycles directly from the data.
- Rolling statistics (mean, max, std) over lag windows create powerful summary features that capture trends and volatility.
- Use
.groupby().shift()to create correct lagged cross-entity features when dealing with multiple parallel time series. - Missing values at series boundaries are unavoidable; handle them with strategic imputation or model-aware methods, never by ignoring the issue.
- The cardinal rule: Always use time-based train-test splits to prevent catastrophic data leakage. Calculate lags in a way that simulates a live prediction environment, where the model only sees historical information.