Positional Encoding in Transformers

Transformers have revolutionized machine learning by processing entire sequences in parallel, but this very strength creates a fundamental weakness: without a mechanism to understand order, the model sees "the dog chased the cat" and "the cat chased the dog" as identical. Positional encoding is the critical solution that injects sequence order information into these otherwise order-agnostic models, enabling them to understand language, code, and time-series data. Mastering its various implementations is essential for building and interpreting modern architectures like GPT and BERT.

The Problem of Permutation Invariance

At its core, the self-attention mechanism that powers transformers is fundamentally permutation invariant. This means if you shuffle the input tokens, the attention operation produces the same set of output embeddings, just in a different order. This is a problem because meaning in sequences is intrinsically tied to order. The sentence "John saw Mary" has a different subject and object than "Mary saw John." Recurrent Neural Networks (RNNs) and Long Short-Term Memory networks (LSTMs) inherently handle order by processing tokens one after another. Transformers, which process all tokens simultaneously for massive efficiency gains, lack this recurrence. Therefore, we must explicitly add information about each token's position in the sequence. The positional encoding vectors are added to the input token embeddings before the first attention layer, giving the model a frame of reference for "where" each token is.

Sinusoidal Positional Encoding

The original Transformer paper introduced a fixed, deterministic method called sinusoidal positional encoding. This approach encodes the absolute position of a token using sine and cosine waves of varying frequencies. For a given position $p os$ in the sequence and a dimension $i$ of the embedding vector, the encoding is calculated as:

$P E_{(p os, 2 i)} = sin (p os /1000 0^{2 i / d_{m o d e l}})$ $P E_{(p os, 2 i + 1)} = cos (p os /1000 0^{2 i / d_{m o d e l}})$

Here, $d_{m o d e l}$ is the dimensionality of the embedding. The choice of a geometric progression of wavelengths (from $2 π$ to $10000 \cdot 2 π$ ) ensures that every dimension of the positional encoding corresponds to a different frequency. This design has two key properties. First, it is unique for each absolute position. Second, it allows the model to easily learn to attend to relative positions, as for any fixed offset $k$ , $P E_{p os + k}$ can be represented as a linear function of $P E_{p os}$ . This is a useful inductive bias. The sinusoidal pattern is unbounded, which theoretically allows the model to generalize to sequence lengths longer than those seen during training, though this generalization is often limited in practice.

Learned Positional Embeddings

An alternative to the fixed sinusoidal scheme is to use learned positional embeddings. Here, a trainable lookup table of size $[ma x_se q u e n ce_l e n g t h, d_{m o d e l}]$ is created. Each position index (0, 1, 2,...) is associated with a unique vector that the model learns during training, just like it learns embeddings for words or subwords. This approach is simpler to implement and is used in models like BERT. Its main advantage is flexibility; the model can discover the most useful positional representation for its task. However, it has a significant limitation: it is constrained by the max_sequence_length defined during training. The model cannot naturally handle sequences longer than this pre-defined limit because it has never learned embeddings for those new positions.

Relative Positional Encoding

Both absolute encoding methods have drawbacks. Sinusoidal encodings have limited length generalization, and learned embeddings are rigidly fixed to a maximum length. This led to the development of relative positional encoding methods, which directly model the distance between tokens in the attention calculation. Instead of saying "this is token at position 5," they say "this token is 3 positions before that token."

Two prominent variants are ALiBi and RoPE. ALiBi (Attention with Linear Biases) adds a static, non-learned bias to the attention scores based on the distance between the query and key tokens. The further apart two tokens are, the more negative the bias becomes, which discourages the attention mechanism from attending to distant positions. This simple method proves highly effective for extrapolating to longer sequences at inference time. RoPE (Rotary Position Embedding) incorporates relative position information by rotating the query and key vectors using a rotation matrix that depends on the absolute position. This elegant mathematical formulation encodes relative positional differences in the angle between vectors, which is then naturally measured by the dot-product attention. RoPE has become the standard in models like LLaMA and GPT-NeoX due to its performance and flexibility.

How Position Information Affects Attention Patterns

The choice of positional encoding fundamentally shapes the model's attention patterns. Absolute positional encodings (sinusoidal or learned) allow the model to develop "positional heuristics," such as focusing on the beginning of a sentence for certain tasks. You can visualize this by examining attention heads; some may consistently attend to the first token (a "[CLS]" token or period), a pattern enabled by absolute position. Relative encodings, by design, encourage local, windowed attention. With ALiBi, the linear bias makes it computationally cheaper for a head to attend to nearby tokens, fostering patterns of local dependency capture. RoPE enables dynamic attention patterns where the model can learn to attend based on relative distance. The encoding method is not just an implementation detail; it directly influences the inductive biases of the network, steering it toward certain types of linguistic or structural interpretations, such as syntactic relationships which often depend on relative distance.

Common Pitfalls

Ignoring Sequence Length Limits: Using learned positional embeddings with a model trained on sequences of length 512 and then feeding it a document of length 600 will fail or produce nonsense. You must always ensure your input is truncated or padded to the model's trained maximum length, or use a model with relative positional encoding designed for length extrapolation.
Misindexing Positions: A subtle but critical implementation error is misaligning positional indices with token indices, especially when handling batched sequences with varying lengths and padding. For absolute encodings, the padding tokens (e.g., at the end of a shorter sequence) must receive a positional encoding, but the attention mask should prevent the model from attending to them. Incorrect masking can leak positional information from padding.
Assuming Perfect Length Generalization: While sinusoidal and relative encodings are designed for better generalization, models still often perform worse on sequences significantly longer than their training corpus. Don't assume a model with sinusoidal encodings will flawlessly handle a sequence 10x longer than its training data; performance degradation is common and must be tested.
Treating Encoding as an Afterthought: Choosing a positional encoding is a key architectural decision. For tasks with very long-range dependencies (e.g., code, narratives), a relative encoding like ALiBi or RoPE is likely superior. For tasks where absolute start/end position is highly informative, learned embeddings might suffice. The choice impacts model capability.

Summary

Transformers require positional encoding because their self-attention mechanism is permutation invariant and lacks the innate sense of order found in RNNs.
Sinusoidal encoding uses fixed sine/cosine waves of different frequencies to provide absolute position data and offers a useful inductive bias for learning relative positions.
Learned embeddings are a simple, trainable alternative to sinusoidal encodings but are rigidly limited to a pre-defined maximum sequence length.
Relative positional encodings like ALiBi and RoPE model the distance between tokens directly, often leading to better performance on long sequences and more flexible attention patterns.
The type of encoding directly shapes the model's attention patterns, influencing whether it relies on absolute positions or learns to operate based on the relative distances between tokens.

Positional Encoding in Transformers

Positional Encoding in Transformers

The Problem of Permutation Invariance

Sinusoidal Positional Encoding

Learned Positional Embeddings

Relative Positional Encoding

How Position Information Affects Attention Patterns

Common Pitfalls

Summary

Write better notes with AI