Modules

Multi-Query Attention

An attention variant that shares one key-value head across all query heads to minimize KV-cache memory.

Autoregressive inference must store keys and values for every past token, and that KV cache grows with context length and head count; multi-query attention (MQA) keeps H query heads but projects them through a single shared key-value pair so the cache stores only one K tensor and one V tensor per layer.

At a glance

Optimizes

  • Kv Cache
  • Memory Bandwidth

Practical benefits

  • lower KV-cache memory

Example models

None listed yet.

What It Is

Multi-query attention is an attention variant derived from multi-head attention. Every query head still computes its own attention scores, but all heads read from the same key head and value head.

What It Optimizes

MQA targets KV-cache footprint and memory bandwidth during autoregressive decoding. Sharing one KV pair across all query heads minimizes the number of tensors stored per token.

Practical Benefit

Serving stacks adopt MQA when long-context inference would otherwise multiply cache size by head count. MQA delivers the smallest KV footprint in the head-sharing family while keeping distinct query projections.

How It Works

Linear projections still produce H query heads, but key and value projections collapse to a single head each. Every query head i attends against the shared K and V tensors, then outputs are concatenated and projected as usual.
Toggle MHA and MQA to compare query-head count against KV-head count on one canvas

Math Or Compute Schema

The formulas below contrast how multi-head attention pairs every query head with its own KV heads versus multi-query attention, which routes all query heads through one shared KV pair.
Multi-head attention (MHA)
Attention(Qi,Ki,Vi)=softmax ⁣(QiKidk)Vi\text{Attention}(Q_i, K_i, V_i) = \mathrm{softmax}\!\left(\frac{Q_i K_i^{\top}}{\sqrt{d_k}}\right) V_i
Q
Query vectors for head i.
K
Key vectors for head i.
V
Value vectors for head i.
H
Number of query heads.
d_k
Key dimension per head.
i
Query head index.
Multi-query attention (MQA)
Attention(Qi,K,V)=softmax ⁣(QiKdk)V\text{Attention}(Q_i, K, V) = \mathrm{softmax}\!\left(\frac{Q_i K^{\top}}{\sqrt{d_k}}\right) V
Q
Query vectors for head i.
K
Shared key vectors (single KV head).
V
Shared value vectors (single KV head).
H
Number of query heads.
d_k
Key dimension per head.
i
Query head index.

Compared To Nearby Modules

Compared with multi-head attention, MQA stores one KV pair instead of H. Compared with grouped-query attention, MQA compresses further by sharing a single KV head across all query heads rather than G groups.
Comparison dimensionMulti-Query AttentionMulti-Head AttentionGrouped-Query Attention
KV head count1 key head and 1 value headH key heads and H value headsG key heads and G value heads
Query-head flexibilityH query heads share one KV pairH independent query/KV head pairsH distinct query heads grouped into G shared KV pairs
Cache footprint per token2 tensors (single shared KV cache)2H tensors (full multi-head attention cache)2G tensors (G keys + G values)
How MQA compares with nearby attention variants

Example Architectures

MQA appears in decoder-only language models that prioritize inference throughput and lean KV caches, including some large-scale serving configurations that accept the representational tradeoff.

Example model links will render from registry usage in a later story.

Limitations And Tradeoffs

Sharing one KV head across all queries can reduce representational flexibility versus full multi-head attention or grouped-query attention. Quality-sensitive workloads may prefer GQA or MHA.

Why It Still Matters

MQA established the head-sharing design space: it shows how far KV compression can go before grouped-query attention offers a middle ground between cache size and query diversity.

Tags

References

  1. Shazeer, Noam. "Fast Transformer Decoding: One Write-Head is All You Need." arXiv, 2019, https://arxiv.org/abs/1911.02150.