Layer norm

Per-token mean-and-variance normalization that rescales each hidden vector before the next sublayer in a transformer block.

What It Is

Layer normalization (LayerNorm) rescales one token's hidden vector by subtracting the mean across its features, dividing by a stabilized standard deviation, then applying learned scale and shift parameters. Statistics are computed per position, not across the batch or sequence, which matches how decoder transformers process variable-length text.

Why It Matters

Original transformer stacks used layer norm around attention and feed-forward sublayers to keep activations stable as depth grows. Many encoder-decoder and early decoder models still rely on it. Understanding the mean-centering step helps when you compare layer norm with RMSNorm, which drops the mean subtraction but keeps per-feature scaling.

Simple Example

After self-attention, one token vector might have features ranging from -3 to 12. Layer norm measures the mean and spread across those features, divides by the stabilized scale, then multiplies by learned gain γ and adds learned bias β. The feed-forward network receives a vector whose feature magnitudes sit in a predictable band.

Formula And Symbols

The rescale below applies to one token position. Each symbol names a quantity in that vector, not a batch-wide statistic.

Layer normalization

\text{LN}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

x: Hidden feature vector at one token position.
μ: Mean of the features in x.
σ²: Variance of the features in x.
ε: Small positive constant for numerical stability.
γ: Learned per-feature scale vector.
β: Learned per-feature shift vector.

Common Confusions

Layer norm is not batch norm: it never pools statistics across examples in a minibatch. It is also not the residual add—that path preserves the skip stream while norm actively rescales it. RMSNorm is a close cousin that omits mean centering; both are layer-style norms used inside transformer blocks rather than across tokens.

Layer norm

What It Is

Why It Matters

Simple Example

Formula And Symbols

Common Confusions

Tags

References

On this page

Layer norm

What It Is

Why It Matters

Simple Example

Formula And Symbols

Common Confusions

Related Concepts And Modules

Tags

References

On this page