Glossary

Layer norm

Per-token mean-and-variance normalization that rescales each hidden vector before the next sublayer in a transformer block.

What It Is

Layer normalization (LayerNorm) rescales one token's hidden vector by subtracting the mean across its features, dividing by a stabilized standard deviation, then applying learned scale and shift parameters. Statistics are computed per position, not across the batch or sequence, which matches how decoder transformers process variable-length text.

Why It Matters

Original transformer stacks used layer norm around attention and feed-forward sublayers to keep activations stable as depth grows. Many encoder-decoder and early decoder models still rely on it. Understanding the mean-centering step helps when you compare layer norm with RMSNorm, which drops the mean subtraction but keeps per-feature scaling.

Simple Example

After self-attention, one token vector might have features ranging from -3 to 12. Layer norm measures the mean and spread across those features, divides by the stabilized scale, then multiplies by learned gain γ and adds learned bias β. The feed-forward network receives a vector whose feature magnitudes sit in a predictable band.

Formula And Symbols

The rescale below applies to one token position. Each symbol names a quantity in that vector, not a batch-wide statistic.
Layer normalization
LN(x)=γxμσ2+ϵ+β\text{LN}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta
x
Hidden feature vector at one token position.
μ
Mean of the features in x.
σ²
Variance of the features in x.
ε
Small positive constant for numerical stability.
γ
Learned per-feature scale vector.
β
Learned per-feature shift vector.

Common Confusions

Layer norm is not batch norm: it never pools statistics across examples in a minibatch. It is also not the residual add—that path preserves the skip stream while norm actively rescales it. RMSNorm is a close cousin that omits mean centering; both are layer-style norms used inside transformer blocks rather than across tokens.

Tags

References