Multi-Head Latent Attention

An attention variant that compresses key-value cache storage into a low-rank latent space while keeping distinct query heads.

KV caches grow with context length and head dimension, raising memory use during long-context inference; multi-head latent attention (MLA) stores a compact latent KV representation and reconstructs per-head keys and values on demand so the cache holds fewer bytes without collapsing to a single shared head.

At a glance

Optimizes

Kv Cache
Memory Bandwidth
Long Context Inference

Practical benefits

lower KV-cache memory
compressed latent KV representation

Example models

None listed yet.

What It Is

Multi-head latent attention is an attention variant derived from multi-head attention. It keeps multiple query heads but projects keys and values into a shared low-rank latent space that is cached during autoregressive decoding.

What It Optimizes

MLA targets KV-cache footprint, memory bandwidth during autoregressive decoding, and the cost of serving long contexts when full per-head KV tensors would otherwise dominate memory.

Practical Benefit

Teams adopt MLA when they need multi-head query expressiveness but want a smaller inference cache than full multi-head attention. MLA compresses stored KV state more aggressively than grouped-query or multi-query sharing alone by caching a latent representation instead of full head tensors.

How It Works

Keys and values are down-projected into a latent KV cache with rank r. During attention, latent vectors are up-projected back to per-head key and value spaces. Queries remain head-specific, so each head attends against reconstructed KV pairs derived from the shared latent cache.

Toggle MHA and MLA to compare full per-head KV storage against compressed latent KV caching on one canvas

Math Or Compute Schema

With H query heads and latent rank r, MLA caches compact latent KV vectors and reconstructs per-head keys and values through low-rank projections. The formulas below contrast full multi-head attention against MLA, which routes queries through a shared latent cache.

Multi-head attention (MHA)

\text{Attention}(Q_i, K_i, V_i) = \mathrm{softmax}\!\left(\frac{Q_i K_i^{\top}}{\sqrt{d_k}}\right) V_i

Q: Query vectors for head i.
K: Key vectors for head i.
V: Value vectors for head i.
H: Number of query heads.
d_k: Key dimension per head.
i: Query head index.

Multi-head latent attention (MLA)

\text{Attention}(Q_i, \hat{K}_i, \hat{V}_i) = \mathrm{softmax}\!\left(\frac{Q_i \hat{K}_i^{\top}}{\sqrt{d_k}}\right) \hat{V}_i,\quad \hat{K}_i = W^K_i c,\; \hat{V}_i = W^V_i c

Q: Query vectors for head i.
\hat{K}: Reconstructed key vectors from latent cache c.
\hat{V}: Reconstructed value vectors from latent cache c.
H: Number of query heads.
r: Latent rank of the compressed KV cache.
d_k: Key dimension per head after up-projection.
i: Query head index.
c: Shared latent KV cache vector stored per token.

Compared To Nearby Modules

Compared with multi-head attention, MLA stores a compressed latent cache instead of full per-head KV tensors. Compared with multi-query and grouped-query attention, MLA keeps distinct query heads while compressing KV state through low-rank projection rather than only sharing heads.

Comparison dimension	Multi-Head Latent Attention	Multi-Head Attention	Multi-Query Attention	Grouped-Query Attention
KV representation	Low-rank latent KV vectors (rank r)	H key heads and H value heads	1 key head and 1 value head	G key heads and G value heads
Query-head flexibility	H distinct query heads with latent KV reconstruction	H independent query/KV head pairs	H query heads share one KV pair	H distinct query heads grouped into G shared KV pairs
Cache footprint per token	r-dimensional latent tensors (compressed KV cache)	2H tensors (full multi-head attention cache)	2 tensors (single shared KV cache)	2G tensors (G keys + G values)

How MLA compares with nearby attention variants

Example Architectures

MLA appears in large decoder-only language models that prioritize efficient long-context inference while retaining multi-head query pathways.

Example model links will render from registry usage in a later story.

Limitations And Tradeoffs

Low-rank KV compression can limit representational flexibility versus storing full per-head tensors. The optimal latent rank depends on model scale and quality targets.

Why It Still Matters

As context windows grow, KV-cache efficiency remains a first-order serving constraint, so MLA offers a path between MHA fidelity and aggressive cache compression through latent storage.

References

DeepSeek-AI. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv, 2024, https://arxiv.org/abs/2405.04434.

Multi-Head Latent Attention

At a glance

Optimizes

Practical benefits

Example models

What It Is

What It Optimizes

Practical Benefit

How It Works

Math Or Compute Schema

Compared To Nearby Modules

Example Architectures

Limitations And Tradeoffs

Why It Still Matters

Tags

References

On this page

Multi-Head Latent Attention

At a glance

Optimizes

Practical benefits

Example models

What It Is

What It Optimizes

Practical Benefit

How It Works

Math Or Compute Schema

Compared To Nearby Modules

Example Architectures

Limitations And Tradeoffs

Why It Still Matters

Related

Tags

References

On this page