Glossary
Transformer
A model family built from stacked attention and feed-forward blocks that mix information across token positions in parallel.
What It Is
A transformer repeats blocks of self-attention (or cross-attention in encoder–decoder setups) followed by position-wise feed-forward networks. Token embeddings enter the stack; each block updates hidden representations for every position. Decoder-only stacks use causal attention masks; encoder stacks can attend bidirectionally. The family name describes this block recipe, not any single checkpoint.
Why It Matters
Transformers dominate language, code, and many multimodal systems because attention scales flexibly with sequence length and parallelizes well on accelerators. Knowing the family separates architecture choices—decoder-only versus encoder–decoder, grouped-query attention for decoding—from training data or product packaging. The architecture page situates transformers among other families; this page is the dedicated reference for the block stack itself.
Simple Example
A decoder-only transformer for chat embeds each prompt token, runs twelve identical blocks of causal self-attention plus MLP, then projects the final hidden state to vocabulary logits for the next token. GPT-2, Llama, and many instruction-tuned models share that loop even when layer counts and widths differ.