Modules
Grouped-Query Attention
An attention variant that reduces KV cache memory by sharing key-value heads across query groups.
At a glance
Optimizes
- Kv Cache
- Memory Bandwidth
- Long Context Inference
Practical benefits
- lower KV-cache memory
- reduced memory bandwidth during inference
Example models
None listed yet.
What It Is
Grouped-query attention is an attention variant derived from multi-head attention. It keeps multiple query heads but groups them so each group reads from the same key-value head pair.What It Optimizes
GQA targets KV-cache footprint, memory bandwidth during autoregressive decoding, and the cost of serving long contexts when head count would otherwise multiply cache size.Practical Benefit
Teams adopt GQA when they need multi-head expressiveness at training time but want leaner inference caches and less KV traffic at deployment time. GQA keeps more query diversity than multi-query attention while using a smaller cache than full multi-head attention.How It Works
Queries are partitioned into groups. Each group shares one key head and one value head. Attention scores are computed per query head against the shared KV pair for that group, then outputs are projected as usual.H query headsG shared KV pairsH/G query heads share each KV pair
Math Or Compute Schema
With H query heads and G KV groups, each group size is H/G. The formulas below contrast how multi-head attention pairs every query head with its own KV heads versus grouped-query attention, which routes query heads through shared KV pairs per group.Compared To Nearby Modules
Compared with multi-head attention, GQA reduces KV tensors. Compared with multi-query attention, GQA keeps more distinct query heads while still sharing KV heads within each group.| Comparison dimension | Grouped-Query Attention | Multi-Head Attention | Multi-Query Attention |
|---|---|---|---|
| KV head count | G key heads and G value heads | H key heads and H value heads | 1 key head and 1 value head |
| Query-head flexibility | H distinct query heads grouped into G shared KV pairs | H independent query/KV head pairs | H query heads share one KV pair |
| Cache footprint per token | 2G tensors (G keys + G values) | 2H tensors (full multi-head attention cache) | 2 tensors (single shared KV cache) |
Example Architectures
GQA appears in large decoder-only language models that prioritize efficient long-context inference while retaining multi-head training dynamics.Example model links will render from registry usage in a later story.