Modules
Multi-Query Attention
An attention variant that shares one key-value head across all query heads to minimize KV-cache memory.
At a glance
Optimizes
- Kv Cache
- Memory Bandwidth
Practical benefits
- lower KV-cache memory
Example models
None listed yet.
What It Is
Multi-query attention is an attention variant derived from multi-head attention. Every query head still computes its own attention scores, but all heads read from the same key head and value head.What It Optimizes
MQA targets KV-cache footprint and memory bandwidth during autoregressive decoding. Sharing one KV pair across all query heads minimizes the number of tensors stored per token.Practical Benefit
Serving stacks adopt MQA when long-context inference would otherwise multiply cache size by head count. MQA delivers the smallest KV footprint in the head-sharing family while keeping distinct query projections.How It Works
Linear projections still produce H query heads, but key and value projections collapse to a single head each. Every query head i attends against the shared K and V tensors, then outputs are concatenated and projected as usual.H query heads1 shared KV pairAll H query heads read from one K and V
Math Or Compute Schema
The formulas below contrast how multi-head attention pairs every query head with its own KV heads versus multi-query attention, which routes all query heads through one shared KV pair.Compared To Nearby Modules
Compared with multi-head attention, MQA stores one KV pair instead of H. Compared with grouped-query attention, MQA compresses further by sharing a single KV head across all query heads rather than G groups.| Comparison dimension | Multi-Query Attention | Multi-Head Attention | Grouped-Query Attention |
|---|---|---|---|
| KV head count | 1 key head and 1 value head | H key heads and H value heads | G key heads and G value heads |
| Query-head flexibility | H query heads share one KV pair | H independent query/KV head pairs | H distinct query heads grouped into G shared KV pairs |
| Cache footprint per token | 2 tensors (single shared KV cache) | 2H tensors (full multi-head attention cache) | 2G tensors (G keys + G values) |
Example Architectures
MQA appears in decoder-only language models that prioritize inference throughput and lean KV caches, including some large-scale serving configurations that accept the representational tradeoff.Example model links will render from registry usage in a later story.