Grouped-Query Attention

An attention variant that reduces KV cache memory by sharing key-value heads across query groups.

KV caches grow with context length and head count, raising memory use during long-context inference; grouped-query attention (GQA) lets several query heads share fewer key-value heads so the cache stores fewer KV tensors without collapsing to one shared head.

At a glance

Optimizes

Kv Cache
Memory Bandwidth
Long Context Inference

Practical benefits

lower KV-cache memory
reduced memory bandwidth during inference

Example models

None listed yet.

What It Is

Grouped-query attention is an attention variant derived from multi-head attention. It keeps multiple query heads but groups them so each group reads from the same key-value head pair.

What It Optimizes

GQA targets KV-cache footprint, memory bandwidth during autoregressive decoding, and the cost of serving long contexts when head count would otherwise multiply cache size.

Practical Benefit

Teams adopt GQA when they need multi-head expressiveness at training time but want leaner inference caches and less KV traffic at deployment time. GQA keeps more query diversity than multi-query attention while using a smaller cache than full multi-head attention.

How It Works

Queries are partitioned into groups. Each group shares one key head and one value head. Attention scores are computed per query head against the shared KV pair for that group, then outputs are projected as usual.

Toggle MHA and GQA to compare query-head count against KV-head count on one canvas

Math Or Compute Schema

With H query heads and G KV groups, each group size is H/G. The formulas below contrast how multi-head attention pairs every query head with its own KV heads versus grouped-query attention, which routes query heads through shared KV pairs per group.

Multi-head attention (MHA)

\text{Attention}(Q_i, K_i, V_i) = \mathrm{softmax}\!\left(\frac{Q_i K_i^{\top}}{\sqrt{d_k}}\right) V_i

Q: Query vectors for head i.
K: Key vectors for head i.
V: Value vectors for head i.
H: Number of query heads.
d_k: Key dimension per head.
i: Query head index.

Grouped-query attention (GQA)

\text{Attention}(Q_i, K_{g(i)}, V_{g(i)}) = \mathrm{softmax}\!\left(\frac{Q_i K_{g(i)}^{\top}}{\sqrt{d_k}}\right) V_{g(i)}

Q: Query vectors for head i.
K: Key vectors for group g(i).
V: Value vectors for group g(i).
H: Number of query heads.
G: Number of shared KV groups.
d_k: Key dimension per head.
i: Query head index.
g(i): KV group index for query head i.

Compared To Nearby Modules

Compared with multi-head attention, GQA reduces KV tensors. Compared with multi-query attention, GQA keeps more distinct query heads while still sharing KV heads within each group.

Comparison dimension	Grouped-Query Attention	Multi-Head Attention	Multi-Query Attention
KV head count	G key heads and G value heads	H key heads and H value heads	1 key head and 1 value head
Query-head flexibility	H distinct query heads grouped into G shared KV pairs	H independent query/KV head pairs	H query heads share one KV pair
Cache footprint per token	2G tensors (G keys + G values)	2H tensors (full multi-head attention cache)	2 tensors (single shared KV cache)

How GQA compares with nearby attention variants

Example Architectures

GQA appears in large decoder-only language models that prioritize efficient long-context inference while retaining multi-head training dynamics.

Example model links will render from registry usage in a later story.

Limitations And Tradeoffs

Sharing KV heads can reduce representational flexibility versus full multi-head attention. The optimal group count depends on model scale and quality targets.

Why It Still Matters

As context windows grow, KV-cache efficiency remains a first-order serving constraint, so GQA stays a practical default between MHA fidelity and MQA compression.

References

Ainslie, Joshua, et al. "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." arXiv, 2023, https://arxiv.org/abs/2305.13245.

Grouped-Query Attention

At a glance

Optimizes

Practical benefits

Example models

What It Is

What It Optimizes

Practical Benefit

How It Works

Math Or Compute Schema

Compared To Nearby Modules

Example Architectures

Limitations And Tradeoffs

Why It Still Matters

Tags

References

On this page

Grouped-Query Attention

At a glance

Optimizes

Practical benefits

Example models

What It Is

What It Optimizes

Practical Benefit

How It Works

Math Or Compute Schema

Compared To Nearby Modules

Example Architectures

Limitations And Tradeoffs

Why It Still Matters

Related

Tags

References

On this page