Modules

Multi-Head Attention

The baseline attention design that gives every query head its own key-value head pair.

Transformer blocks need each token position to gather context from others; multi-head attention (MHA) runs that lookup in parallel by splitting projected queries, keys, and values into H independent heads so each head can specialize while outputs are concatenated back together.

At a glance

Optimizes

  • Expressiveness

Practical benefits

  • per-head query specialization

Example models

None listed yet.

What It Is

Multi-head attention is the original scaled dot-product attention used in the Transformer. Each head computes its own attention distribution over keys and values, then the head outputs are merged before the next sublayer.

What It Optimizes

MHA targets representational breadth: distinct query heads can attend to different relationships in the same sequence without sharing key-value parameters.

Practical Benefit

Models keep full per-head expressiveness at training and inference time. MHA remains the reference design when quality matters more than KV-cache size, and it is the baseline against which multi-query attention and grouped-query attention are compared.

How It Works

Linear projections produce H query heads, H key heads, and H value heads. Each query head i attends to its matching key head i and value head i, producing H outputs that are concatenated and projected again.
Toggle MHA and MQA to compare query-head count against KV-head count on one canvas

Math Or Compute Schema

The formula below is the per-head attention computation used inside multi-head attention. Each head index i pairs its own query, key, and value tensors before outputs are merged.
Multi-head attention (MHA)
Attention(Qi,Ki,Vi)=softmax ⁣(QiKidk)Vi\text{Attention}(Q_i, K_i, V_i) = \mathrm{softmax}\!\left(\frac{Q_i K_i^{\top}}{\sqrt{d_k}}\right) V_i
Q
Query vectors for head i.
K
Key vectors for head i.
V
Value vectors for head i.
H
Number of query heads.
d_k
Key dimension per head.
i
Query head index.

Compared To Nearby Modules

Compared with multi-query attention, MHA stores one KV pair per query head, so cache size scales with H. Compared with grouped-query attention, MHA avoids shared KV groups and keeps the largest KV footprint in the head-sharing family.
Comparison dimensionMulti-Head AttentionMulti-Query AttentionGrouped-Query Attention
KV head countH key heads and H value heads1 key head and 1 value headG key heads and G value heads
Query-head flexibilityH independent query/KV head pairsH query heads share one KV pairH distinct query heads grouped into G shared KV pairs
Cache footprint per token2H tensors (full multi-head attention cache)2 tensors (single shared KV cache)2G tensors (G keys + G values)
How MHA compares with nearby attention variants

Example Architectures

MHA appears in the original Transformer encoder-decoder, early GPT-style decoder-only language models, and many vision and speech transformers that adopt the standard attention block.

Example model links will render from registry usage in a later story.

Limitations And Tradeoffs

Autoregressive inference must cache every key and value head, so memory and bandwidth grow with context length and head count. Serving long contexts often motivates MQA or GQA instead.

Why It Still Matters

MHA is the conceptual baseline for every head-sharing variant: understanding its one-to-one query-to-KV mapping makes MQA and GQA tradeoffs easier to reason about.

Tags

References

  1. Vaswani, Ashish, et al. "Attention Is All You Need." arXiv, 2017, https://arxiv.org/abs/1706.03762.