Transformer Encoder Architecture in Detail

The transformer encoder is the architectural backbone behind models like BERT and the encoding half of the original Transformer. It revolutionized natural language processing by replacing sequential operations with parallelizable self-attention, enabling unprecedented scalability and contextual understanding. Mastering its construction is essential for implementing state-of-the-art models and adapting them to new domains.

Foundational Building Blocks: Self-Attention and Position

The core innovation of the transformer is the self-attention mechanism. It allows each token in a sequence to directly attend to all other tokens, computing a weighted sum of their values. This weight, or attention score, determines how much focus to place on other parts of the sequence when encoding a specific token. For a sequence of input vectors $X$ , the operation is performed by first projecting $X$ into three matrices: Queries ( $Q$ ), Keys ( $K$ ), and Values ( $V$ ). The output is computed as:

$Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V$

The scaling factor $d_{k}$ , where $d_{k}$ is the dimension of the key vectors, prevents the softmax gradients from becoming too small.

Since a single attention head has a limited perspective, the architecture uses multi-head self-attention. This involves performing the self-attention operation $h$ times in parallel, each with different, learned projection matrices. This allows the model to jointly attend to information from different representation subspaces. The outputs of all heads are concatenated and linearly projected to form the final multi-head attention output.

However, self-attention is permutation-invariant; it has no inherent notion of word order. To inject sequential information, we use positional encoding. The most common method is the fixed, sinusoidal encoding from the original paper, which uses sine and cosine functions of different frequencies. Each position $p os$ and dimension $i$ is encoded as:

$P E_{(p os, 2 i)} = sin (p os /1000 0^{2 i / d_{m o d e l}})$ $P E_{(p os, 2 i + 1)} = cos (p os /1000 0^{2 i / d_{m o d e l}})$

These encodings are simply added to the input token embeddings before the first encoder layer. Alternatively, learned positional embeddings are also widely used, where each position index is associated with a trainable vector.

Constructing the Encoder Layer

A single transformer encoder layer is composed of two main sub-layers, each surrounded by critical stabilizing components. Understanding this stack is key to building a functional encoder.

The first sub-layer is the multi-head self-attention mechanism described above. Its output is integrated via a residual connection (or skip connection). The residual connection adds the sub-layer's input directly to its output ( $x + Sublayer (x)$ ). This mitigates the vanishing gradient problem in deep networks, allowing gradients to flow directly backward through the addition operation. Following this addition, layer normalization is applied. Layer normalization stabilizes the activations by independently normalizing the mean and standard deviation of each input sample across its feature dimension.

The second sub-layer is the position-wise feedforward network (FFN). This is a simple two-layer neural network (typically with a ReLU activation in between) applied independently and identically to each position in the sequence. While it operates per position, it allows for mixing and transforming the features produced by the attention mechanism. Its formula is $FFN (x) = max (0, x W_{1} + b_{1}) W_{2} + b_{2}$ . A residual connection and layer normalization also surround this FFN.

Crucially, dropout is applied as a regularization technique at several points: to the sums computed by the residual connections and often within the feedforward network (applied to the ReLU activation output). This prevents overfitting by randomly "dropping out" a fraction of neurons during training.

Advanced Architectural Variations and Design Choices

As the architecture has evolved, significant variations have emerged that impact training stability and final performance. The most debated is pre-norm versus post-norm layer ordering.

The original Transformer uses a post-norm configuration: the operations order is Self-Attention → Residual Add → Layer Norm → Feedforward → Residual Add → Layer Norm. In contrast, many modern implementations like GPT and T5 use pre-norm: Layer Norm → Self-Attention → Residual Add → Layer Norm → Feedforward → Residual Add. Pre-norm places layer normalization before the sub-layer, not after. This often leads to more stable training, especially in very deep models, as it ensures the input to each sub-layer is normalized, mitigating gradient issues.

The choice and implementation of positional encoding also present a design space. Beyond sinusoidal and learned embeddings, alternatives include relative positional encodings (where attention scores are modified based on the relative distance between tokens) and rotary positional embeddings (RoPE), which have become popular in newer models. The choice here can significantly affect a model's ability to generalize to sequences longer than those seen during training.

How Encoder Representations Power Downstream Tasks

The output of the final encoder layer is a sequence of contextualized embeddings—each token's representation is informed by the entire input context. These representations are the engine for tasks like classification and information extraction.

For sequence classification (e.g., sentiment analysis), a special token, typically [CLS], is prepended to the input. The final hidden state corresponding to this token is used as the aggregate sequence representation, which is then fed into a classifier head (a small neural network). The model learns to pool the necessary information into this single vector.

For token-level tasks like named entity recognition or question answering, the final hidden state for each input token is used directly. Each token's representation is passed to a classifier to predict a label for that position (e.g., B-PERSON, I-PERSON, O). For extraction tasks, this allows the model to leverage the rich contextual information built by the encoder's self-attention layers to disambiguate word senses and relationships.

Common Pitfalls

Misunderstanding the "Position-wise" in FFN: A common mistake is to think the feedforward network mixes information across positions. It does not; it operates independently on each position's vector. The mixing across the sequence happens exclusively in the multi-head attention sub-layer. Confusing this can lead to incorrect assumptions about model capacity.

Improper Masking in the Encoder: While the encoder typically uses a "full" attention mask (all tokens attend to all tokens), padding masks are still essential. Failing to apply a padding mask to the attention scores (by adding a large negative number to positions corresponding to padding tokens before the softmax) means the model will incorporate meaningless padding into its contextual representations, harming performance.

Neglecting Gradient Norm with Pre-norm: Pre-norm architectures are more stable but can sometimes lead to larger gradient magnitudes in the early layers. While less common than in post-norm, not monitoring gradient norms or omitting gradient clipping can still cause training instability in very deep pre-norm stacks.

Incorrect Parameter Sharing Assumptions: The parameters of the multi-head attention and FFN layers are not shared across positions—this is a key difference from recurrent networks. Each position's computation uses the same set of weights, but those weights are applied independently, not shared as a single recurrent cell.

Summary

The transformer encoder is built from stacked layers, each containing a multi-head self-attention sub-layer and a position-wise feedforward network, stabilized by residual connections and layer normalization.
Positional encoding, either fixed or learned, is added to token embeddings to provide the model with sequence order information, which the self-attention mechanism lacks.
The pre-norm variant (Layer Norm before the sub-layer) is often preferred over the original post-norm for improved training stability in deep networks.
Encoder output representations power tasks by using a special [CLS] token's state for sequence classification or the states of all tokens for token-level extraction and labeling.
Key implementation details include applying dropout to residual sums and using correct padding masks during attention calculation to ignore irrelevant tokens.

Transformer Encoder Architecture in Detail

Transformer Encoder Architecture in Detail

Foundational Building Blocks: Self-Attention and Position

Constructing the Encoder Layer

Advanced Architectural Variations and Design Choices

How Encoder Representations Power Downstream Tasks

Common Pitfalls

Summary

Write better notes with AI