Multi-Head Self-Attention Implementation

Understanding the mechanics of self-attention is the single most important step in mastering transformer architectures, which power modern LLMs, vision models, and more. This article builds the mechanism from the ground up, moving from the fundamental scaled dot-product operation to the parallel efficiency of multi-head attention, and finally to modern optimizations like Flash Attention that make processing long sequences feasible. By the end, you will not only grasp the theory but also know how to implement it efficiently, understanding the critical trade-offs in computation and memory.

Scaled Dot-Product Attention: The Core Operation

At its heart, self-attention is a mechanism that allows a model to weigh the importance of different parts of its input when producing an output. The fundamental operation is Scaled Dot-Product Attention. Imagine you have a sequence of input vectors (e.g., word embeddings). For each position in the sequence, the model learns three new linear projections: a Query, a Key, and a Value.

The Query ( $Q$ ) represents what the current position is "looking for." The Key ( $K$ ) represents what information each position "contains." The Value ( $V$ ) is the actual content that will be aggregated. The attention score between position $i$ and position $j$ is computed as the dot product of $Q_{i}$ and $K_{j}$ , which measures their compatibility. In matrix form for the entire sequence, we compute $Q K^{T}$ .

These raw scores are then scaled by the square root of the key dimension ( $d_{k}$ ) to prevent the dot products from growing too large in magnitude, which would push the softmax function into regions with extremely small gradients. After scaling, a softmax is applied row-wise to convert the scores into a probability distribution (attention weights). Finally, these weights are used to take a weighted sum of the Value vectors, producing the output for each position.

The full equation is: $Attention (Q, K, V) = softmax (\frac{Q K ^{T}}{d _{k}}) V$

A basic Python implementation for a single batch looks like this:

import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V):
    # Q, K, V shapes: (seq_len, d_k) or (batch_size, seq_len, d_k)
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
    attention_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, V)
    return output, attention_weights

Multi-Head Parallelism and Projected Subspaces

A single attention head operates on the full dimensionality of the model. Multi-head attention runs multiple, independent scaled dot-product attention operations in parallel. This is powerful because it allows the model to jointly attend to information from different representation subspaces at different positions. One head might learn to focus on local grammatical dependencies, while another captures long-range thematic connections.

Technically, this is achieved by projecting the original input into $h$ different sets of Query, Key, and Value matrices, each with a reduced dimension. Typically, if the model's embedding dimension is $d_{model}$ , then each head uses $d_{k} = d_{v} = d_{model} / h$ . The input $X$ is linearly projected using learned weight matrices $W_{i}^{Q}$ , $W_{i}^{K}$ , $W_{i}^{V}$ for each head $i$ . After the attention computation is performed in parallel for all heads, their outputs are concatenated and projected once more with a final weight matrix $W^{O}$ to produce the multi-head output.

The process is:

For each head $i$ : $head_{i} = Attention (X W_{i}^{Q}, X W_{i}^{K}, X W_{i}^{V})$
Concatenate: $MultiHead = Concat (head_{1}, ..., head_{h})$
Final linear projection: $Output = MultiHead \cdot W^{O}$

This design maintains a similar total computational cost to a single-head attention layer with full dimensionality while dramatically increasing representational capacity and parallelizability.

Implementing Attention Masks: Causal and Padding

In practice, sequences are batched and often padded to the same length. Furthermore, for autoregressive tasks like language modeling, a position must not attend to future positions. Attention masks are essential tools to handle these constraints.

A padding mask prevents the model from attending to padding tokens (which contain no real information). It is typically a boolean matrix of shape (batch_size, seq_len) that is True for padding positions. This mask is unsqueezed and expanded to match the attention score shape (batch_size, num_heads, seq_len, seq_len). Before the softmax, we add a large negative value (e.g., -1e9) to the attention scores at masked positions, ensuring the softmax assigns them zero weight.

A causal mask (or autoregressive mask) ensures the autoregressive property. It is a triangular matrix where the upper triangular part (representing connections to future tokens) is masked. For a sequence length of 4, the causal mask looks like:

[ [0, -inf, -inf, -inf],
  [0,   0 , -inf, -inf],
  [0,   0 ,   0 , -inf],
  [0,   0 ,   0 ,   0 ] ]

Implementing this involves combining the mask with the scores before softmax:

def attention_with_masking(Q, K, V, padding_mask=None, causal=False):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)

    if causal:
        seq_len = Q.size(-2)
        # Creates a triangular mask where upper tri is True
        causal_mask = torch.triu(torch.ones(seq_len, seq_len, device=Q.device), diagonal=1).bool()
        scores = scores.masked_fill(causal_mask, -1e9)

    if padding_mask is not None:
        # padding_mask shape: (batch_size, seq_len)
        # Expand to match scores shape and mask *all* heads
        padding_mask = padding_mask.unsqueeze(1).unsqueeze(2)  # (batch_size, 1, 1, seq_len)
        scores = scores.masked_fill(padding_mask, -1e9)

    attention_weights = F.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, V)
    return output

Computational Complexity and the Bottleneck

The computational complexity of self-attention is a critical consideration. The dominant operation is the matrix multiplication $Q K^{T}$ . For a sequence of length $n$ and embedding dimension $d$ , this multiplication has a complexity of $O (n^{2} \cdot d)$ . The $O (n^{2})$ term is the bottleneck because it grows quadratically with sequence length. This makes processing very long documents, high-resolution images, or long audio sequences prohibitively expensive in terms of both computation and memory (as the $n \times n$ attention matrix must be stored).

The $O (n \cdot d^{2})$ cost from the linear projections is typically less significant for long sequences. This quadratic bottleneck is why traditional transformers struggle with extremely long contexts and has driven the development of efficient attention variants like Flash Attention.

Flash Attention: Memory-Efficient Exact Attention

Flash Attention is a groundbreaking algorithm that provides a much more memory-efficient way to compute exact attention, without approximation. The standard attention implementation materializes the large $n \times n$ attention score matrix in GPU memory (High Bandwidth Memory, or HBM), leading to significant memory reads/writes ( $O (n^{2})$ ). Flash Attention tackles this by restructuring the computation using classical techniques from online softmax and block-based matrix multiplication.

Its core innovation is to compute the attention output in small blocks, loading only chunks of $Q$ , $K$ , and $V$ from slow HBM into fast SRAM (on-chip memory) at a time. It performs the softmax rescaling in a numerically stable, iterative manner across these blocks. This results in two major benefits:

Dramatically Reduced Memory Reads/Writes: The algorithm achieves $O (n)$ memory footprint with respect to sequence length, as it never stores the full $n \times n$ matrix. This leads to 2-4x faster training and enables much longer context lengths.
Exact Attention: Unlike methods like sparse or linear attention, Flash Attention computes the exact same result as the standard implementation, just more efficiently.

Conceptually, you can think of it as streaming the $K$ and $V$ matrices and incrementally updating a running softmax and output for each block of $Q$ . While the implementation is complex, libraries like flash-attn allow you to use it as a drop-in replacement:

# Instead of: output = F.scaled_dot_product_attention(Q, K, V, is_causal=True)
from flash_attn import flash_attn_func
output = flash_attn_func(Q, K, V, causal=True)

Common Pitfalls

Forgetting to Scale the Dot Product: Omitting the division by $d_{k}$ is a common error. This causes the variance of the attention logits to grow with $d_{k}$ , pushing the softmax into a region where it outputs extremely sharp distributions (one-hot-like), which severely dampens gradients during training.

Incorrect Masking Application: Applying the mask after the softmax is ineffective. The softmax will have already distributed probability mass to the masked positions. The large negative value must be added to the scores before the softmax to zero out those positions correctly.

Misaligning Projection Dimensions in Multi-Head Attention: When implementing multi-head attention from scratch, it's easy to mismatch the dimensions for the head projections and the final concatenation. A reliable check is to ensure that num_heads * d_k == d_model. The final projection $W^{O}$ should have shape (d_model, d_model).

Ignoring the Memory Bottleneck with Long Sequences: When prototyping, standard attention on short sequences works fine. However, moving to long-sequence tasks without considering the $O (n^{2})$ memory blow-up will quickly lead to "out-of-memory" errors. Planning for memory-efficient attention (like Flash Attention) from the start is crucial for scaling.

Summary

Scaled Dot-Product Attention computes compatibility scores between queries and keys, scales them by $d_{k}$ for stable gradients, applies softmax to get weights, and uses these to sum values.
Multi-Head Attention runs multiple attention operations in parallel on linearly projected subspaces of the input, allowing the model to capture diverse types of relationships before recombining the information.
Attention Masks are essential: padding masks ignore invalid tokens, and causal masks enforce the autoregressive property in decoders by preventing attention to future positions.
The primary computational bottleneck is the $O (n^{2})$ complexity in sequence length, stemming from the $Q K^{T}$ operation, which limits context length.
Flash Attention is an IO-aware algorithm that recomputes attention on-the-fly in blocks, providing exact attention with an $O (n)$ memory footprint, enabling faster training and longer contexts.

Multi-Head Self-Attention Implementation

Multi-Head Self-Attention Implementation

Scaled Dot-Product Attention: The Core Operation

Multi-Head Parallelism and Projected Subspaces

Implementing Attention Masks: Causal and Padding

Computational Complexity and the Bottleneck

Flash Attention: Memory-Efficient Exact Attention

Common Pitfalls

Summary

Write better notes with AI