Time Series Forecasting with Machine Learning

Time series forecasting is a critical skill for predicting everything from stock prices and energy demand to product sales and website traffic. While traditional statistical methods have long been the standard, machine learning offers powerful new tools that can model complex, non-linear relationships and incorporate diverse features.

Transforming Time Series into a Supervised Problem

The fundamental shift in using machine learning for forecasting is reframing the problem. A raw time series is a sequence of observations ordered by time. Most ML algorithms, however, require a standard tabular dataset of independent features and a target variable. We achieve this by engineering features from the past to predict the future.

The core idea is to use lag features, which are past values of the series itself. For instance, to predict tomorrow's value ( $y_{t}$ ), you might use the values from the last three days ( $y_{t - 1}, y_{t - 2}, y_{t - 3}$ ) as your input features (X). The value you want to predict ( $y_{t}$ ) becomes your target (y). This creates a supervised learning problem where each row in your new dataset is a time step, the features are historical lags, and the target is a future observation. The number of lags you create defines the look-back window and is a crucial hyperparameter.

Feature Engineering for Temporal Patterns

Beyond simple lags, effective forecasting requires features that capture various temporal patterns. This engineering step is where you inject domain knowledge and help the model understand the structure of time.

Rolling Statistics: These are computed over a sliding window of recent observations and capture local trends and volatility. A rolling mean (e.g., the average of the last 7 periods) smooths out short-term noise to reveal the underlying trend. A rolling standard deviation helps the model understand periods of high and low volatility, which is essential in fields like finance.
Expanding Statistics: Similar to rolling features, but the window includes all data from the start of the series up to the current point (e.g., an expanding mean). These capture the global evolution of the series.
Calendar Features: Time has inherent structure that models can't infer from the index alone. You must explicitly extract features like the hour of the day, day of the week, month, quarter, and whether a day is a holiday or part of a weekend. For hourly data, sine/cosine transformations of the hour can effectively model the cyclical nature of days.
Target Encoding of Time Cycles: Advanced engineering might include encoding the average historical value for a given time period (e.g., the typical sales for a specific Monday in December). This must be done carefully to avoid data leakage.

The final feature set is often a combination of lags, rolling features, and calendar-derived variables, giving the model a rich representation of the past and the temporal context to make an informed prediction.

Multi-Step Forecasting Strategies

When you need to predict multiple time steps into the future (e.g., the next 7 days), you have two primary strategies: direct and recursive.

Direct Multi-Step Forecasting: This approach trains a separate model for each future horizon you wish to predict. To forecast the next 3 days, you would train Model A to predict $t + 1$ , Model B for $t + 2$ , and Model C for $t + 3$ . Each model uses the same set of lagged features. The advantage is that error accumulation is minimized, as each prediction is made independently. The disadvantage is maintaining multiple models, and the models are unaware of the predictions made for other horizons, which may lead to inconsistent forecasts.
Recursive Multi-Step Forecasting (Iterative): This method trains a single one-step-ahead model. You use it to predict $t + 1$ . Then, you feed that prediction back into the model as a lagged feature to predict $t + 2$ , and so on. This is computationally simpler and allows the model to condition its future predictions on its own previous forecasts. However, prediction errors from earlier steps compound as they are fed back into the model, which can lead to rapidly degrading forecast quality for longer horizons.

The choice depends on your data, horizon length, and tolerance for error propagation. A hybrid "direct-recursive" approach, where you train a model for each horizon but allow it to use previously predicted values as inputs, is also common.

Modeling Multiple Related Time Series

Often, you don't have just one series but many related ones—like sales for hundreds of different products or energy load for thousands of households. Training a separate model for each can be inefficient and ignores potential shared patterns.

Machine learning excels here through global models. You create a single, unified dataset by stacking all the individual time series. A crucial feature you must add is a series identifier (e.g., a product ID or store number). This allows the model to learn both general patterns across all series and specific behaviors for individual series. You can enrich this further with static features about each series (e.g., product category, store location). Gradient boosting frameworks like XGBoost and LightGBM handle this mixed data type scenario very effectively, often outperforming individual models by learning from the broader pool of data.

Model Development and Benchmarking

With your features engineered and strategy chosen, you can proceed to model training. Tree-based ensembles like gradient boosting (XGBoost, LightGBm, CatBoost) are particularly popular due to their robustness, ability to handle non-linearities, and built-in handling of missing values. Neural networks, especially architectures like LSTMs and Transformers, are powerful for capturing very long-term dependencies but require more data and careful tuning.

A critical, non-negotiable step is benchmarking your ML models against statistical baselines. The ARIMA (AutoRegressive Integrated Moving Average) model is the classic benchmark for univariate series. It is a strong, parsimonious model that defines the minimum performance bar. Your complex ML model must consistently outperform a well-tuned ARIMA or Exponential Smoothing model to justify its added complexity. Use time-series-aware cross-validation (e.g., expanding window CV) for evaluation and compare metrics like MAPE, RMSE, or MAE. The goal is not to discard traditional methods but to understand when ML provides a genuine advantage.

Common Pitfalls

Data Leakage in Feature Engineering: This is the most dangerous mistake. Using future information to predict the past invalidates your model. When creating rolling statistics or encoded features, you must ensure the calculation for each row uses only data available up to that point in time. Scikit-learn's Pipeline with a custom transformer or dedicated libraries can help enforce this.
Ignoring Stationarity: While ML models are more flexible than ARIMA, many features (like lags) are still most effective if the underlying series is stationary (its statistical properties like mean and variance are constant over time). Non-stationarity can lead to spurious relationships. Applying differencing or transformations to stabilize the series before feature engineering often improves model performance.
Overlooking Simple Baselines: Before building a complex LSTM network, always check the performance of a simple naive forecast (e.g., predicting that tomorrow's value equals today's) or a seasonal naive forecast (e.g., predicting this week's sales equal last week's). If you cannot beat these, your model has fundamental issues.
Misapplying Standard Cross-Validation: Using random k-fold CV on time series destroys the temporal order, allowing the model to train on future data to predict the past. Always use forward-chaining methods like TimeSeriesSplit which respect the sequence of time.

Summary

The core of ML forecasting is transforming a sequence into a supervised learning problem using engineered features from the past, primarily lag features.
Effective feature engineering combines lags, rolling statistics (for local trends), and calendar features (for seasonality) to create a rich representation for the model.
For multi-step forecasts, choose between the direct method (separate models per horizon) and the recursive method (one model iterates), understanding the trade-off between complexity and error compounding.
Global models that learn from multiple related series by incorporating a series identifier often outperform individual models by pooling information.
Always benchmark your ML models against established statistical baselines like ARIMA and simple naive forecasts to ensure the added complexity provides a measurable improvement in predictive accuracy.

Time Series Forecasting with Machine Learning

Time Series Forecasting with Machine Learning

Transforming Time Series into a Supervised Problem

Feature Engineering for Temporal Patterns

Multi-Step Forecasting Strategies

Modeling Multiple Related Time Series

Model Development and Benchmarking

Common Pitfalls

Summary

Write better notes with AI