Multi-Head Attention

The baseline attention design that gives every query head its own key-value head pair.

Transformer blocks need each token position to gather context from others; multi-head attention (MHA) runs that lookup in parallel by splitting projected queries, keys, and values into H independent heads so each head can specialize while outputs are concatenated back together.

At a glance

Optimizes

Expressiveness

Practical benefits

per-head query specialization

Example models

None listed yet.

What It Is

Multi-head attention is the original scaled dot-product attention used in the Transformer. Each head computes its own attention distribution over keys and values, then the head outputs are merged before the next sublayer.

What It Optimizes

MHA targets representational breadth: distinct query heads can attend to different relationships in the same sequence without sharing key-value parameters.

Practical Benefit

Models keep full per-head expressiveness at training and inference time. MHA remains the reference design when quality matters more than KV-cache size, and it is the baseline against which multi-query attention and grouped-query attention are compared.

How It Works

Linear projections produce H query heads, H key heads, and H value heads. Each query head i attends to its matching key head i and value head i, producing H outputs that are concatenated and projected again.

Toggle MHA and MQA to compare query-head count against KV-head count on one canvas

Math Or Compute Schema

The formula below is the per-head attention computation used inside multi-head attention. Each head index i pairs its own query, key, and value tensors before outputs are merged.

Multi-head attention (MHA)

\text{Attention}(Q_i, K_i, V_i) = \mathrm{softmax}\!\left(\frac{Q_i K_i^{\top}}{\sqrt{d_k}}\right) V_i

Q: Query vectors for head i.
K: Key vectors for head i.
V: Value vectors for head i.
H: Number of query heads.
d_k: Key dimension per head.
i: Query head index.

Compared To Nearby Modules

Compared with multi-query attention, MHA stores one KV pair per query head, so cache size scales with H. Compared with grouped-query attention, MHA avoids shared KV groups and keeps the largest KV footprint in the head-sharing family.

Comparison dimension	Multi-Head Attention	Multi-Query Attention	Grouped-Query Attention
KV head count	H key heads and H value heads	1 key head and 1 value head	G key heads and G value heads
Query-head flexibility	H independent query/KV head pairs	H query heads share one KV pair	H distinct query heads grouped into G shared KV pairs
Cache footprint per token	2H tensors (full multi-head attention cache)	2 tensors (single shared KV cache)	2G tensors (G keys + G values)

How MHA compares with nearby attention variants

Example Architectures

MHA appears in the original Transformer encoder-decoder, early GPT-style decoder-only language models, and many vision and speech transformers that adopt the standard attention block.

Example model links will render from registry usage in a later story.

Limitations And Tradeoffs

Autoregressive inference must cache every key and value head, so memory and bandwidth grow with context length and head count. Serving long contexts often motivates MQA or GQA instead.

Why It Still Matters

MHA is the conceptual baseline for every head-sharing variant: understanding its one-to-one query-to-KV mapping makes MQA and GQA tradeoffs easier to reason about.

References

Vaswani, Ashish, et al. "Attention Is All You Need." arXiv, 2017, https://arxiv.org/abs/1706.03762.

Multi-Head Attention

At a glance

Optimizes

Practical benefits

Example models

What It Is

What It Optimizes

Practical Benefit

How It Works

Math Or Compute Schema

Compared To Nearby Modules

Example Architectures

Limitations And Tradeoffs

Why It Still Matters

Tags

References

On this page

Multi-Head Attention

At a glance

Optimizes

Practical benefits

Example models

What It Is

What It Optimizes

Practical Benefit

How It Works

Math Or Compute Schema

Compared To Nearby Modules

Example Architectures

Limitations And Tradeoffs

Why It Still Matters

Related

Tags

References

On this page