Modules

Multi-Head Latent Attention

An attention variant that compresses key-value cache storage into a low-rank latent space while keeping distinct query heads.

KV caches grow with context length and head dimension, raising memory use during long-context inference; multi-head latent attention (MLA) stores a compact latent KV representation and reconstructs per-head keys and values on demand so the cache holds fewer bytes without collapsing to a single shared head.

At a glance

Optimizes

  • Kv Cache
  • Memory Bandwidth
  • Long Context Inference

Practical benefits

  • lower KV-cache memory
  • compressed latent KV representation

Example models

None listed yet.

What It Is

Multi-head latent attention is an attention variant derived from multi-head attention. It keeps multiple query heads but projects keys and values into a shared low-rank latent space that is cached during autoregressive decoding.

What It Optimizes

MLA targets KV-cache footprint, memory bandwidth during autoregressive decoding, and the cost of serving long contexts when full per-head KV tensors would otherwise dominate memory.

Practical Benefit

Teams adopt MLA when they need multi-head query expressiveness but want a smaller inference cache than full multi-head attention. MLA compresses stored KV state more aggressively than grouped-query or multi-query sharing alone by caching a latent representation instead of full head tensors.

How It Works

Keys and values are down-projected into a latent KV cache with rank r. During attention, latent vectors are up-projected back to per-head key and value spaces. Queries remain head-specific, so each head attends against reconstructed KV pairs derived from the shared latent cache.
Toggle MHA and MLA to compare full per-head KV storage against compressed latent KV caching on one canvas

Math Or Compute Schema

With H query heads and latent rank r, MLA caches compact latent KV vectors and reconstructs per-head keys and values through low-rank projections. The formulas below contrast full multi-head attention against MLA, which routes queries through a shared latent cache.
Multi-head attention (MHA)
Attention(Qi,Ki,Vi)=softmax ⁣(QiKidk)Vi\text{Attention}(Q_i, K_i, V_i) = \mathrm{softmax}\!\left(\frac{Q_i K_i^{\top}}{\sqrt{d_k}}\right) V_i
Q
Query vectors for head i.
K
Key vectors for head i.
V
Value vectors for head i.
H
Number of query heads.
d_k
Key dimension per head.
i
Query head index.
Multi-head latent attention (MLA)
Attention(Qi,K^i,V^i)=softmax ⁣(QiK^idk)V^i,K^i=WiKc,  V^i=WiVc\text{Attention}(Q_i, \hat{K}_i, \hat{V}_i) = \mathrm{softmax}\!\left(\frac{Q_i \hat{K}_i^{\top}}{\sqrt{d_k}}\right) \hat{V}_i,\quad \hat{K}_i = W^K_i c,\; \hat{V}_i = W^V_i c
Q
Query vectors for head i.
\hat{K}
Reconstructed key vectors from latent cache c.
\hat{V}
Reconstructed value vectors from latent cache c.
H
Number of query heads.
r
Latent rank of the compressed KV cache.
d_k
Key dimension per head after up-projection.
i
Query head index.
c
Shared latent KV cache vector stored per token.

Compared To Nearby Modules

Compared with multi-head attention, MLA stores a compressed latent cache instead of full per-head KV tensors. Compared with multi-query and grouped-query attention, MLA keeps distinct query heads while compressing KV state through low-rank projection rather than only sharing heads.
Comparison dimensionMulti-Head Latent AttentionMulti-Head AttentionMulti-Query AttentionGrouped-Query Attention
KV representationLow-rank latent KV vectors (rank r)H key heads and H value heads1 key head and 1 value headG key heads and G value heads
Query-head flexibilityH distinct query heads with latent KV reconstructionH independent query/KV head pairsH query heads share one KV pairH distinct query heads grouped into G shared KV pairs
Cache footprint per tokenr-dimensional latent tensors (compressed KV cache)2H tensors (full multi-head attention cache)2 tensors (single shared KV cache)2G tensors (G keys + G values)
How MLA compares with nearby attention variants

Example Architectures

MLA appears in large decoder-only language models that prioritize efficient long-context inference while retaining multi-head query pathways.

Example model links will render from registry usage in a later story.

Limitations And Tradeoffs

Low-rank KV compression can limit representational flexibility versus storing full per-head tensors. The optimal latent rank depends on model scale and quality targets.

Why It Still Matters

As context windows grow, KV-cache efficiency remains a first-order serving constraint, so MLA offers a path between MHA fidelity and aggressive cache compression through latent storage.

Tags

References

  1. DeepSeek-AI. "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv, 2024, https://arxiv.org/abs/2405.04434.