Transformer architecture

How attention, feed-forward layers, normalization, residuals, and position information repeat inside each decoder-style transformer block.

A transformer block is a repeating loop: each token vector passes through attention, a dense feed-forward transform, normalization, and residual adds so the model can mix context and refine features layer by layer.

What It Is

A transformer architecture stacks many identical blocks on top of token embeddings. Each block usually runs attention first so every position can read other positions, then a position-wise feed-forward network that applies the same small MLP to every token. Normalization and residual connections wrap those sublayers so activations stay stable and gradients can flow through depth. Position information is added before or inside attention so order matters even though attention itself is permutation invariant.

Why It Matters

Most modern language models share this block pattern even when widths, depths, and attention variants differ. Knowing the loop helps you map unfamiliar model cards to the right module pages—attention for context mixing, feed-forward for per-token refinement, normalization for training stability, residuals for depth, and positional schemes for how far order is remembered.

Simple Example

Picture a decoder-only stack for text generation. Token embeddings enter block one: self-attention lets each word attend to earlier words, the feed-forward layer stretches and filters each vector, layer normalization keeps scales steady, and a residual add writes the update back onto the running stream. Block two repeats the same recipe on the updated vectors, and the stack continues until a final norm and language-model head predict the next token.

Where It Appears

Decoder-only transformers in chat and completion models, encoder stacks in embedding models, and encoder–decoder stacks in translation systems all use variations of this loop. Width, depth, attention variant, feed-forward size, and normalization choice change between families, but the repeating block layout is the shared skeleton.

activationsShared tag
ALiBiShared tag
AlignmentShared tag
Autoregressive generationShared tag
backpropShared tag
computation graphShared tag
ConditioningShared tag
context extensionShared tag
context lengthShared tag
data modalityShared tag
DecoderShared tag
Denoising generationShared tag
diffusion modelsShared tag
discriminative modelsShared tag
embeddingsShared tag
Emergent behaviorShared tag
EncoderShared tag
FFNShared tag
foundation modelsShared tag
GeneralizationShared tag
generative modelsShared tag
gradientsShared tag
hidden sizeShared tag
image patchShared tag
latent codeShared tag
latent manifoldShared tag
LayerNormShared tag
learned representationShared tag
logitsShared tag
lossShared tag
ML modelShared tag
model architectureShared tag
Model capacityShared tag
model componentShared tag
model moduleShared tag
model parametersShared tag
MoEShared tag
multimodal modelsShared tag
normalization layerShared tag
optimizer stateShared tag
output entropyShared tag
OverfittingShared tag
PerplexityShared tag
positional encodingShared tag
RMS NormShared tag
RoPEShared tag
sampling temperatureShared tag
Scaling lawShared tag
seq2seqShared tag
skip connectionShared tag
softmax functionShared tag
tensorsShared tag
tokenShared tag
TransformersShared tag
vectorShared tag
why long context is hardShared tag
world modelsShared tag

activationsSame concept type
computation graphSame concept type
DecoderSame concept type
diffusion modelsSame concept type
embeddingsSame concept type
EncoderSame concept type
multimodal modelsSame concept type
seq2seqSame concept type
tokenSame concept type
TransformersSame concept type
world modelsSame concept type

attentionCurated related
FFNCurated related
normalization layerCurated related
skip connectionCurated related
positional encodingCurated related

Common Confusions

A transformer block is not the same as an attention module: attention is one sublayer inside the block. The architecture is also not a single trained checkpoint—many models can share the layout at different parameter counts. Finally, positional encoding is part of how the block represents order; it is not a replacement for attention or feed-forward work.

Transformer architecture

What It Is

Why It Matters

Simple Example

Where It Appears

Common Confusions

Tags

References

On this page

Transformer architecture

What It Is

Why It Matters

Simple Example

Where It Appears

Common Confusions

Related Concepts And Modules

Tags

References

On this page