Attention
Self-attention and attention mechanisms used in transformers, modules, and related reference pages.
Module type
Module
- Attention
How transformer blocks mix information across token positions using query, key, and value projections.
- Grouped-Query Attention
An attention variant that reduces KV cache memory by sharing key-value heads across query groups.
- Linear Attention
An attention variant that replaces explicit softmax dot-product attention with kernel or feature-map formulations that scale near-linearly with sequence length.
- Multi-Head Attention
The baseline attention design that gives every query head its own key-value head pair.
- Multi-Head Latent Attention
An attention variant that compresses key-value cache storage into a low-rank latent space while keeping distinct query heads.
- Multi-Query Attention
An attention variant that shares one key-value head across all query heads to minimize KV-cache memory.
- Sliding-Window Attention
An attention variant that restricts each query to a fixed local window of key positions instead of the full sequence.
- Sparse Attention
An attention variant that computes scores only for a subset of token pairs instead of every query-key combination.
Glossary
- Autoregressive Generation
A generation paradigm that produces outputs one discrete step at a time, each step conditioned on everything generated so far.
- Token
The smallest unit of text a language model reads and predicts—usually a word piece, not always a whole word. Each token ID maps to a dense vector through vector embedding of model hidden size before attention runs.