Home Architecture Glossary Tags

Getting Started

Attention

Self-attention and attention mechanisms used in transformers, modules, and related reference pages.

Module type

Open search entry page

Module

Attention
How transformer blocks mix information across token positions using query, key, and value projections.
Grouped-Query Attention
An attention variant that reduces KV cache memory by sharing key-value heads across query groups.
Linear Attention
An attention variant that replaces explicit softmax dot-product attention with kernel or feature-map formulations that scale near-linearly with sequence length.
Multi-Head Attention
The baseline attention design that gives every query head its own key-value head pair.
Multi-Head Latent Attention
An attention variant that compresses key-value cache storage into a low-rank latent space while keeping distinct query heads.
Multi-Query Attention
An attention variant that shares one key-value head across all query heads to minimize KV-cache memory.
Sliding-Window Attention
An attention variant that restricts each query to a fixed local window of key positions instead of the full sequence.
Sparse Attention
An attention variant that computes scores only for a subset of token pairs instead of every query-key combination.

Glossary

Autoregressive Generation
A generation paradigm that produces outputs one discrete step at a time, each step conditioned on everything generated so far.
Token
The smallest unit of text a language model reads and predicts—usually a word piece, not always a whole word. Each token ID maps to a dense vector through vector embedding of model hidden size before attention runs.