Multi-Query Attention

An attention variant that shares one key-value head across all query heads to minimize KV-cache memory.

Autoregressive inference must store keys and values for every past token, and that KV cache grows with context length and head count; multi-query attention (MQA) keeps H query heads but projects them through a single shared key-value pair so the cache stores only one K tensor and one V tensor per layer.

At a glance

Optimizes

Kv Cache
Memory Bandwidth

Practical benefits

lower KV-cache memory

Example models

None listed yet.

What It Is

Multi-query attention is an attention variant derived from multi-head attention. Every query head still computes its own attention scores, but all heads read from the same key head and value head.

What It Optimizes

MQA targets KV-cache footprint and memory bandwidth during autoregressive decoding. Sharing one KV pair across all query heads minimizes the number of tensors stored per token.

Practical Benefit

Serving stacks adopt MQA when long-context inference would otherwise multiply cache size by head count. MQA delivers the smallest KV footprint in the head-sharing family while keeping distinct query projections.

How It Works

Linear projections still produce H query heads, but key and value projections collapse to a single head each. Every query head i attends against the shared K and V tensors, then outputs are concatenated and projected as usual.

Toggle MHA and MQA to compare query-head count against KV-head count on one canvas

Math Or Compute Schema

The formulas below contrast how multi-head attention pairs every query head with its own KV heads versus multi-query attention, which routes all query heads through one shared KV pair.

Multi-head attention (MHA)

\text{Attention}(Q_i, K_i, V_i) = \mathrm{softmax}\!\left(\frac{Q_i K_i^{\top}}{\sqrt{d_k}}\right) V_i

Q: Query vectors for head i.
K: Key vectors for head i.
V: Value vectors for head i.
H: Number of query heads.
d_k: Key dimension per head.
i: Query head index.

Multi-query attention (MQA)

\text{Attention}(Q_i, K, V) = \mathrm{softmax}\!\left(\frac{Q_i K^{\top}}{\sqrt{d_k}}\right) V

Q: Query vectors for head i.
K: Shared key vectors (single KV head).
V: Shared value vectors (single KV head).
H: Number of query heads.
d_k: Key dimension per head.
i: Query head index.

Compared To Nearby Modules

Compared with multi-head attention, MQA stores one KV pair instead of H. Compared with grouped-query attention, MQA compresses further by sharing a single KV head across all query heads rather than G groups.

Comparison dimension	Multi-Query Attention	Multi-Head Attention	Grouped-Query Attention
KV head count	1 key head and 1 value head	H key heads and H value heads	G key heads and G value heads
Query-head flexibility	H query heads share one KV pair	H independent query/KV head pairs	H distinct query heads grouped into G shared KV pairs
Cache footprint per token	2 tensors (single shared KV cache)	2H tensors (full multi-head attention cache)	2G tensors (G keys + G values)

How MQA compares with nearby attention variants

Example Architectures

MQA appears in decoder-only language models that prioritize inference throughput and lean KV caches, including some large-scale serving configurations that accept the representational tradeoff.

Example model links will render from registry usage in a later story.

Limitations And Tradeoffs

Sharing one KV head across all queries can reduce representational flexibility versus full multi-head attention or grouped-query attention. Quality-sensitive workloads may prefer GQA or MHA.

Why It Still Matters

MQA established the head-sharing design space: it shows how far KV compression can go before grouped-query attention offers a middle ground between cache size and query diversity.

References

Shazeer, Noam. "Fast Transformer Decoding: One Write-Head is All You Need." arXiv, 2019, https://arxiv.org/abs/1911.02150.

Multi-Query Attention

At a glance

Optimizes

Practical benefits

Example models

What It Is

What It Optimizes

Practical Benefit

How It Works

Math Or Compute Schema

Compared To Nearby Modules

Example Architectures

Limitations And Tradeoffs

Why It Still Matters

Tags

References

On this page

Multi-Query Attention

At a glance

Optimizes

Practical benefits

Example models

What It Is

What It Optimizes

Practical Benefit

How It Works

Math Or Compute Schema

Compared To Nearby Modules

Example Architectures

Limitations And Tradeoffs

Why It Still Matters

Related

Tags

References

On this page