Glossary

Scaling Law

Empirical power-law relationships between model scale, compute, data, and downstream loss or capability—used to forecast training budgets and benchmark trends.

What It Is

A scaling law is an empirical fit relating scale (parameters, training tokens, or compute) to a target such as validation loss or perplexity. Kaplan-style language-model laws showed smooth improvements over many orders of magnitude when training is near-optimal. The relationship is descriptive—not a guarantee that every architecture or objective follows the same exponents.

Why It Matters

Scaling laws help forecast whether a larger run is worth the budget and how much data a parameter count needs. They tie evaluation metrics to capacity: model capacity and parameter count set the frontier, generalization and perplexity track whether the run landed on it, and emergent behavior pages describe capabilities that may not appear on the same smooth curve.

Simple Example

If doubling parameters cuts validation loss by a fixed fraction on a log-log plot across several sizes, you can extrapolate the next checkpoint’s expected perplexity before paying for the full training run—until the data-limited regime flattens the trend.

Common Confusions

A scaling law is not the same as emergent behavior—many metrics improve smoothly while specific capabilities appear abruptly. Laws fit historical runs; they do not remove the need for held-out evaluation or alignment checks on new tasks.

Tags

References

  1. Kaplan, Jared, et al. "Scaling Laws for Neural Language Models." arXiv, 2020, https://arxiv.org/abs/2001.08361.