Modules

Grouped-Query Attention

An attention variant that reduces KV cache memory by sharing key-value heads across query groups.

KV caches grow with context length and head count, raising memory use during long-context inference; grouped-query attention (GQA) lets several query heads share fewer key-value heads so the cache stores fewer KV tensors without collapsing to one shared head.

At a glance

Optimizes

  • Kv Cache
  • Memory Bandwidth
  • Long Context Inference

Practical benefits

  • lower KV-cache memory
  • reduced memory bandwidth during inference

Example models

None listed yet.

What It Is

Grouped-query attention is an attention variant derived from multi-head attention. It keeps multiple query heads but groups them so each group reads from the same key-value head pair.

What It Optimizes

GQA targets KV-cache footprint, memory bandwidth during autoregressive decoding, and the cost of serving long contexts when head count would otherwise multiply cache size.

Practical Benefit

Teams adopt GQA when they need multi-head expressiveness at training time but want leaner inference caches and less KV traffic at deployment time. GQA keeps more query diversity than multi-query attention while using a smaller cache than full multi-head attention.

How It Works

Queries are partitioned into groups. Each group shares one key head and one value head. Attention scores are computed per query head against the shared KV pair for that group, then outputs are projected as usual.
Toggle MHA and GQA to compare query-head count against KV-head count on one canvas

Math Or Compute Schema

With H query heads and G KV groups, each group size is H/G. The formulas below contrast how multi-head attention pairs every query head with its own KV heads versus grouped-query attention, which routes query heads through shared KV pairs per group.
Multi-head attention (MHA)
Attention(Qi,Ki,Vi)=softmax ⁣(QiKidk)Vi\text{Attention}(Q_i, K_i, V_i) = \mathrm{softmax}\!\left(\frac{Q_i K_i^{\top}}{\sqrt{d_k}}\right) V_i
Q
Query vectors for head i.
K
Key vectors for head i.
V
Value vectors for head i.
H
Number of query heads.
d_k
Key dimension per head.
i
Query head index.
Grouped-query attention (GQA)
Attention(Qi,Kg(i),Vg(i))=softmax ⁣(QiKg(i)dk)Vg(i)\text{Attention}(Q_i, K_{g(i)}, V_{g(i)}) = \mathrm{softmax}\!\left(\frac{Q_i K_{g(i)}^{\top}}{\sqrt{d_k}}\right) V_{g(i)}
Q
Query vectors for head i.
K
Key vectors for group g(i).
V
Value vectors for group g(i).
H
Number of query heads.
G
Number of shared KV groups.
d_k
Key dimension per head.
i
Query head index.
g(i)
KV group index for query head i.

Compared To Nearby Modules

Compared with multi-head attention, GQA reduces KV tensors. Compared with multi-query attention, GQA keeps more distinct query heads while still sharing KV heads within each group.
Comparison dimensionGrouped-Query AttentionMulti-Head AttentionMulti-Query Attention
KV head countG key heads and G value headsH key heads and H value heads1 key head and 1 value head
Query-head flexibilityH distinct query heads grouped into G shared KV pairsH independent query/KV head pairsH query heads share one KV pair
Cache footprint per token2G tensors (G keys + G values)2H tensors (full multi-head attention cache)2 tensors (single shared KV cache)
How GQA compares with nearby attention variants

Example Architectures

GQA appears in large decoder-only language models that prioritize efficient long-context inference while retaining multi-head training dynamics.

Example model links will render from registry usage in a later story.

Limitations And Tradeoffs

Sharing KV heads can reduce representational flexibility versus full multi-head attention. The optimal group count depends on model scale and quality targets.

Why It Still Matters

As context windows grow, KV-cache efficiency remains a first-order serving constraint, so GQA stays a practical default between MHA fidelity and MQA compression.

Tags

References

  1. Ainslie, Joshua, et al. "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." arXiv, 2023, https://arxiv.org/abs/2305.13245.