Modules
Multi-Head Latent Attention
An attention variant that compresses key-value cache storage into a low-rank latent space while keeping distinct query heads.
At a glance
Optimizes
- Kv Cache
- Memory Bandwidth
- Long Context Inference
Practical benefits
- lower KV-cache memory
- compressed latent KV representation
Example models
None listed yet.
What It Is
Multi-head latent attention is an attention variant derived from multi-head attention. It keeps multiple query heads but projects keys and values into a shared low-rank latent space that is cached during autoregressive decoding.What It Optimizes
MLA targets KV-cache footprint, memory bandwidth during autoregressive decoding, and the cost of serving long contexts when full per-head KV tensors would otherwise dominate memory.Practical Benefit
Teams adopt MLA when they need multi-head query expressiveness but want a smaller inference cache than full multi-head attention. MLA compresses stored KV state more aggressively than grouped-query or multi-query sharing alone by caching a latent representation instead of full head tensors.How It Works
Keys and values are down-projected into a latent KV cache with rank r. During attention, latent vectors are up-projected back to per-head key and value spaces. Queries remain head-specific, so each head attends against reconstructed KV pairs derived from the shared latent cache.H query headsLatent KV cache (rank r)Up-project latent cache to per-head K and V
Math Or Compute Schema
With H query heads and latent rank r, MLA caches compact latent KV vectors and reconstructs per-head keys and values through low-rank projections. The formulas below contrast full multi-head attention against MLA, which routes queries through a shared latent cache.Compared To Nearby Modules
Compared with multi-head attention, MLA stores a compressed latent cache instead of full per-head KV tensors. Compared with multi-query and grouped-query attention, MLA keeps distinct query heads while compressing KV state through low-rank projection rather than only sharing heads.| Comparison dimension | Multi-Head Latent Attention | Multi-Head Attention | Multi-Query Attention | Grouped-Query Attention |
|---|---|---|---|---|
| KV representation | Low-rank latent KV vectors (rank r) | H key heads and H value heads | 1 key head and 1 value head | G key heads and G value heads |
| Query-head flexibility | H distinct query heads with latent KV reconstruction | H independent query/KV head pairs | H query heads share one KV pair | H distinct query heads grouped into G shared KV pairs |
| Cache footprint per token | r-dimensional latent tensors (compressed KV cache) | 2H tensors (full multi-head attention cache) | 2 tensors (single shared KV cache) | 2G tensors (G keys + G values) |
Example Architectures
MLA appears in large decoder-only language models that prioritize efficient long-context inference while retaining multi-head query pathways.Example model links will render from registry usage in a later story.