Recurrent Neural Networks and LSTM

To understand language, predict stock trends, or recognize a melody, you need models that can remember context and process information in sequence. Traditional neural networks fail here because they treat each input as independent. Recurrent Neural Networks (RNNs) were designed to solve this, creating a dynamic internal memory that makes them uniquely powerful for sequential data processing.

The Foundation: Vanilla RNN Architecture

At its core, a Recurrent Neural Network (RNN) is any network that contains a cyclic connection, allowing information to persist. Imagine reading a sentence word by word; your understanding of each new word depends on the words you've just read. An RNN operates on a similar principle.

The key mechanism is the hidden state, a vector that acts as the network's memory. For each step $t$ in a sequence, the RNN takes two inputs: the current data point $x_{t}$ (e.g., a word or a stock price) and the hidden state from the previous step $h_{t - 1}$ . It combines these to produce a new hidden state $h_{t}$ and an output $y_{t}$ . The process is governed by these equations:

$h_{t} = tanh (W_{hh} h_{t - 1} + W_{x h} x_{t} + b_{h})$

$y_{t} = W_{h y} h_{t} + b_{y}$

Here, $W$ matrices and $b$ vectors are the learnable parameters. The tanh activation function squashes values to between -1 and 1, aiding stability. This cyclical propagation of the hidden state allows the network to pass information from one step to the next, capturing temporal dependencies. A simple analogy is a conveyor belt feeding into a machine; the machine's setting (the hidden state) is adjusted by each new item ( $x_{t}$ ) and its own previous setting, producing a modified output at each stage.

The Vanishing Gradient Problem

While elegant in theory, the standard or "vanilla" RNN fails with long sequences due to the vanishing gradient problem. To train an RNN, we use Backpropagation Through Time (BPTT), which is essentially backpropagation applied to the "unrolled" sequence of RNN cells.

The gradients (error signals) used to update the network's weights are calculated via the chain rule. In a long sequence, this involves multiplying many derivatives together—specifically, the derivative of the hidden state at each step. Since the tanh (and sigmoid) derivative is less than 1, repeated multiplication causes these gradients to shrink exponentially as they are propagated backward in time. A gradient that vanishes to near-zero means the early layers in the sequence (the early time steps) receive virtually no update signal during training. Consequently, the network cannot learn long-range dependencies; it effectively "forgets" what it saw more than 10-20 steps ago.

This flaw rendered vanilla RNNs impractical for many real-world tasks involving long sequences, such as document translation or analyzing lengthy time series data. The search for a solution led to a more robust architecture.

LSTM: Long Short-Term Memory Networks

The Long Short-Term Memory (LSTM) network, introduced by Sepp Hochreiter and Jürgen Schmidhuber, explicitly addresses the vanishing gradient problem. It does this by introducing a cell state, $C_{t}$ , which acts as a dedicated "memory highway" running through the entire sequence. Information can flow along this highway with minimal interference, enabling long-term memory. The flow of information into, out of, and within this cell state is regulated by three specialized gates, which are neural network layers that output values between 0 and 1, controlling how much information passes through.

Forget Gate ( $f_{t}$ ): This gate decides what information to discard from the cell state. It looks at the current input $x_{t}$ and the previous hidden state $h_{t - 1}$ , and outputs a number between 0 (completely forget) and 1 (completely keep) for each number in the cell state $C_{t - 1}$ .

$f_{t} = σ (W_{f} \cdot [h_{t - 1}, x_{t}] + b_{f})$

Input Gate ( $i_{t}$ ) & Candidate Cell State ( $\tilde{C}_{t}$ ): This step decides what new information to store. The input gate $i_{t}$ determines which values to update. A tanh layer creates a vector of candidate values, $\tilde{C}_{t}$ , that could be added.

$i_{t} = σ (W_{i} \cdot [h_{t - 1}, x_{t}] + b_{i})$ $\tilde{C}_{t} = tanh (W_{C} \cdot [h_{t - 1}, x_{t}] + b_{C})$

Update Cell State ( $C_{t}$ ): We now combine the actions of the forget and input gates to update the long-term memory.

$C_{t} = f_{t} * C_{t - 1} + i_{t} * \tilde{C}_{t}$ This is the core equation. The old state $C_{t - 1}$ is multiplied by the forget gate, selectively removing information. Then we add the candidate values $\tilde{C}_{t}$ , scaled by how much we decided to update each one ( $i_{t}$ ).

Output Gate ( $o_{t}$ ) and Hidden State ( $h_{t}$ ): Finally, we produce the output hidden state, which is a filtered version of the cell state. The output gate decides what parts of the cell state to output. The cell state is passed through tanh (to push values between -1 and 1) and multiplied by the output gate's signal.

$o_{t} = σ (W_{o} \cdot [h_{t - 1}, x_{t}] + b_{o})$ $h_{t} = o_{t} * tanh (C_{t})$

This gated architecture allows LSTMs to learn what information to store, what to forget, and what to output over very long sequences, solving the vanishing gradient problem that plagued vanilla RNNs.

GRU: A Simplified Alternative

The Gated Recurrent Unit (GRU) is a popular, more streamlined alternative to the LSTM. It combines the forget and input gates into a single update gate ( $z_{t}$ ). It also merges the cell state and hidden state. The GRU has two gates:

Update Gate ( $z_{t}$ ): Controls how much of the previous hidden state to carry forward.
Reset Gate ( $r_{t}$ ): Controls how much of the past information to forget when computing the new candidate hidden state.

While simpler and often faster to train than LSTMs, GRUs can be equally effective, especially on smaller datasets or when computational efficiency is a priority. The choice between LSTM and GRU is often empirical and depends on the specific task and dataset.

Applications: Time Series Forecasting and Sequence Modeling

The power of LSTMs and GRUs is realized in their applications. In time series forecasting—like predicting electricity demand, stock prices, or equipment failure—these models excel at learning complex temporal patterns from historical data. They can incorporate multiple variables (e.g., past prices, trading volume, news sentiment) and model seasonality and trends.

In sequence modeling, they are the foundational blocks for:

Natural Language Processing (NLP): Machine translation, text generation, and sentiment analysis. (Note: While LSTMs were state-of-the-art, they are now often superseded by Transformer models for these tasks).
Speech Recognition: Converting audio waveforms into text.
Video Analysis: Understanding actions across a sequence of frames.

Common Pitfalls

Using RNNs/LSTMs for Non-Sequential Data: These architectures are computationally expensive and offer no benefit over simpler feedforward networks if your data points are truly independent. Always assess if your problem has a genuine temporal or sequential structure.
Ignoring Input Sequence Preprocessing: Feeding raw, unnormalized sequential data (like unscaled time series) or poorly indexed text will lead to poor training. Proper normalization, tokenization, and padding are essential preprocessing steps.
Defaulting to LSTM Without Considering GRU or Simpler Models: While powerful, LSTMs have many parameters. For shorter sequences or smaller datasets, a GRU or even a carefully regularized simple RNN might achieve similar results faster and with less risk of overfitting.
Misunderstanding the Cell State and Hidden State: Confusing these two vectors is a frequent conceptual error. Remember: the cell state ( $C_{t}$ ) is the long-term memory, updated additively. The hidden state ( $h_{t}$ ) is the short-term, filtered output derived from the cell state, used for predictions and passed to the next step.

Summary

Vanilla RNNs process sequences by maintaining a propagating hidden state, but they suffer from the vanishing gradient problem, preventing them from learning long-range dependencies.
LSTM networks solve this with a gated architecture, featuring a dedicated cell state and three gates (forget, input, output) that regulate information flow, enabling effective long-term memory.
The GRU is a simplified, often effective alternative that combines the forget and input gates and merges the cell and hidden states.
These recurrent architectures are fundamental for tasks involving sequential data, most notably time series forecasting and sequence modeling in fields like NLP and speech recognition.
Successful application requires choosing the right architecture for the task complexity and ensuring proper sequential data preprocessing.

Recurrent Neural Networks and LSTM

Recurrent Neural Networks and LSTM

The Foundation: Vanilla RNN Architecture

The Vanishing Gradient Problem

LSTM: Long Short-Term Memory Networks

GRU: A Simplified Alternative

Applications: Time Series Forecasting and Sequence Modeling

Common Pitfalls

Summary

Write better notes with AI