Concepts
Transformer architecture
How attention, feed-forward layers, normalization, residuals, and position information repeat inside each decoder-style transformer block.
What It Is
A transformer architecture stacks many identical blocks on top of token embeddings. Each block usually runs attention first so every position can read other positions, then a position-wise feed-forward network that applies the same small MLP to every token. Normalization and residual connections wrap those sublayers so activations stay stable and gradients can flow through depth. Position information is added before or inside attention so order matters even though attention itself is permutation invariant.Why It Matters
Most modern language models share this block pattern even when widths, depths, and attention variants differ. Knowing the loop helps you map unfamiliar model cards to the right module pages—attention for context mixing, feed-forward for per-token refinement, normalization for training stability, residuals for depth, and positional schemes for how far order is remembered.Simple Example
Picture a decoder-only stack for text generation. Token embeddings enter block one: self-attention lets each word attend to earlier words, the feed-forward layer stretches and filters each vector, layer normalization keeps scales steady, and a residual add writes the update back onto the running stream. Block two repeats the same recipe on the updated vectors, and the stack continues until a final norm and language-model head predict the next token.Where It Appears
Decoder-only transformers in chat and completion models, encoder stacks in embedding models, and encoder–decoder stacks in translation systems all use variations of this loop. Width, depth, attention variant, feed-forward size, and normalization choice change between families, but the repeating block layout is the shared skeleton.- activationsShared tag
- ALiBiShared tag
- AlignmentShared tag
- Autoregressive generationShared tag
- backpropShared tag
- computation graphShared tag
- ConditioningShared tag
- context extensionShared tag
- context lengthShared tag
- data modalityShared tag
- DecoderShared tag
- Denoising generationShared tag
- diffusion modelsShared tag
- discriminative modelsShared tag
- embeddingsShared tag
- Emergent behaviorShared tag
- EncoderShared tag
- FFNShared tag
- foundation modelsShared tag
- GeneralizationShared tag
- generative modelsShared tag
- gradientsShared tag
- hidden sizeShared tag
- image patchShared tag
- latent codeShared tag
- latent manifoldShared tag
- LayerNormShared tag
- learned representationShared tag
- logitsShared tag
- lossShared tag
- ML modelShared tag
- model architectureShared tag
- Model capacityShared tag
- model componentShared tag
- model moduleShared tag
- model parametersShared tag
- MoEShared tag
- multimodal modelsShared tag
- normalization layerShared tag
- optimizer stateShared tag
- output entropyShared tag
- OverfittingShared tag
- PerplexityShared tag
- positional encodingShared tag
- RMS NormShared tag
- RoPEShared tag
- sampling temperatureShared tag
- Scaling lawShared tag
- seq2seqShared tag
- skip connectionShared tag
- softmax functionShared tag
- tensorsShared tag
- tokenShared tag
- TransformersShared tag
- vectorShared tag
- why long context is hardShared tag
- world modelsShared tag
- activationsSame concept type
- computation graphSame concept type
- DecoderSame concept type
- diffusion modelsSame concept type
- embeddingsSame concept type
- EncoderSame concept type
- multimodal modelsSame concept type
- seq2seqSame concept type
- tokenSame concept type
- TransformersSame concept type
- world modelsSame concept type
- attentionCurated related
- FFNCurated related
- normalization layerCurated related
- skip connectionCurated related
- positional encodingCurated related