KV Cache
Key-value caching for faster autoregressive inference and lower memory bandwidth during decoding.
Inference
Module
- Grouped-Query Attention
An attention variant that reduces KV cache memory by sharing key-value heads across query groups.
- Multi-Head Latent Attention
An attention variant that compresses key-value cache storage into a low-rank latent space while keeping distinct query heads.
- Multi-Query Attention
An attention variant that shares one key-value head across all query heads to minimize KV-cache memory.