Glossary

RMSNorm

Per-token root-mean-square scaling that rescales each hidden vector without mean centering, as used in many modern decoder transformers.

What It Is

Root mean square normalization (RMSNorm) rescales one token's hidden vector by dividing each feature by the root mean square of those features, then applying a learned per-feature scale. Unlike layer norm, it skips subtracting the mean across features. Statistics are still computed per position, which fits decoder transformers that process one token vector at a time inside each block.

Why It Matters

Many recent large language models—including Llama-class stacks—use RMSNorm instead of layer norm because dropping mean centering saves compute while keeping activations in a stable range. The choice matters when you read model cards or compare training recipes: RMSNorm is not a different placement strategy, it is a lighter rescaling rule inside the same pre-norm block pattern.

Simple Example

After self-attention, one token vector might have features with different magnitudes. RMSNorm measures the root mean square spread across those features, divides each feature by that stabilized scale, then multiplies by learned gain. The feed-forward network receives a vector whose overall magnitude sits in a predictable band without shifting every feature toward zero mean first.

Formula And Symbols

The rescale below applies to one token position. Each symbol names a quantity in that vector, not a batch-wide statistic.
RMS normalization
RMSNorm(x)=γx1di=1dxi2+ϵ\text{RMSNorm}(x) = \gamma \odot \frac{x}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2 + \epsilon}}
x
Hidden feature vector at one token position.
d
Number of features in x.
ε
Small positive constant for numerical stability.
γ
Learned per-feature scale vector.

Common Confusions

RMSNorm is not layer norm with a different name: layer norm subtracts the feature mean and uses both learned scale and shift, while RMSNorm only divides by root mean square and typically applies learned scale alone. RMSNorm is also not batch norm—it never pools statistics across examples. When a model card says RMSNorm, expect the lighter no-centering rescale rather than full mean-and-variance normalization.

Tags

References