Glossary

Optimizer State

Extra memory an adaptive optimizer keeps beyond model weights, such as momentum and variance estimates for each parameter.

What It Is

Model parameters hold the learned weights, but adaptive optimizers also keep auxiliary tensors aligned with each parameter shape. Adam tracks exponential moving averages of gradients and squared gradients (often called m and v). Momentum methods keep velocity buffers. These tensors are optimizer state: they are updated every step and influence the effective learning rate per parameter.

Why It Matters

Saving only model weights and reloading into a fresh optimizer resets training dynamics—you lose accumulated momentum and adaptive scaling. Large-language-model training checkpoints therefore include optimizer state, which can be several times the parameter memory footprint. Inference-only deployments usually discard optimizer state because weights alone are enough to run forward passes.

Simple Example

With Adam, each weight w might have companion buffers m_w and v_w. On step t the optimizer reads gradient g_w, updates m_w and v_w with decay factors β₁ and β₂, then applies a scaled step to w. Those buffers—not the gradient from the current batch alone—determine how large the update is.

Common Confusions

Optimizer state is not the same as model parameters or gradients—parameters are what you learn, gradients are ephemeral signals from the current backward pass, and optimizer state is long-lived memory across steps. It also differs from activation checkpoints or KV cache used during inference. Mixing up optimizer state with EMA shadow weights (used in some eval setups) causes confusion when comparing checkpoint sizes.

Tags

References

  1. Kingma, Diederik P., and Jimmy Ba. "Adam: A Method for Stochastic Optimization." arXiv, 2014, https://arxiv.org/abs/1412.6980.