Per-token mean-and-variance normalization that rescales each hidden vector before the next sublayer in a transformer block.
What It Is
Layer normalization (LayerNorm) rescales one token's hidden vector by subtracting the mean across its features, dividing by a stabilized standard deviation, then applying learned scale and shift parameters. Statistics are computed per position, not across the batch or sequence, which matches how decodertransformers process variable-length text.
Why It Matters
Original transformer stacks used layer norm around attention and feed-forward sublayers to keep activations stable as depth grows. Many encoder-decoder and early decoder models still rely on it. Understanding the mean-centering step helps when you compare layer norm with RMSNorm, which drops the mean subtraction but keeps per-feature scaling.
Simple Example
After self-attention, one tokenvector might have features ranging from -3 to 12. Layer norm measures the mean and spread across those features, divides by the stabilized scale, then multiplies by learned gain γ and adds learned bias β. The feed-forward network receives a vector whose feature magnitudes sit in a predictable band.
Formula And Symbols
The rescale below applies to one token position. Each symbol names a quantity in that vector, not a batch-wide statistic.
Layer norm is not batch norm: it never pools statistics across examples in a minibatch. It is also not the residual add—that path preserves the skip stream while norm actively rescales it. RMSNorm is a close cousin that omits mean centering; both are layer-style norms used inside transformer blocks rather than across tokens.