Normalization

What It Is

Normalization layers rescale each hidden vector so its numeric spread stays in a predictable range before the next sublayer runs. In transformers, that usually means applying a layer-style norm across the features inside one token vector—computing statistics per position rather than across the whole minibatch. A learned scale and shift then let the model recover useful signal after the rescale.

Why It Matters

Deep stacks multiply many matrix operations; without normalization, activations can drift or explode and training becomes fragile. Transformers favor layer norm or RMSNorm because sequence length and batch size change at inference time, while batch norm depends on batch-wide statistics that are unstable for variable-length text. Recognizing the normalization slot in each block helps you read whether a model uses pre-norm or post-norm placement around attention and the feed-forward network.

Simple Example

Picture one token vector leaving self-attention with uneven feature magnitudes. A layer norm step measures spread across that vector's dimensions, divides by a stabilized scale, and applies learned gain and bias. The feed-forward network then sees a well-scaled input. The same recipe repeats after the FFN before the residual add hands the vector to the next block.

Common Confusions

Normalization is not the residual add: residuals preserve a skip path, while norm layers actively rescale activations. Layer norm and RMSNorm are the common transformer choices; batch norm is rare in decoder stacks because per-batch mean and variance are a poor fit for variable sequence batches. Finally, normalization does not mix tokens—each position is normalized on its own feature vector.

Normalization

What It Is

Why It Matters

Simple Example

Common Confusions

Tags

References

On this page

Normalization

What It Is

Why It Matters

Simple Example

Common Confusions

Related Concepts And Modules

Tags

References

On this page