Glossary

Model Capacity

How much function a model can represent—driven by parameter count, width, depth, and training data—before optimization and regularization constrain what it actually learns.

What It Is

Model capacity describes expressive power: wider layers, more heads, and more parameters increase the space of functions the network can represent. It is distinct from compute budget (how much training you can afford) and from measured generalization (how well the model works on new inputs).

Why It Matters

Capacity sits between architecture choices and training outcomes. Too little capacity underfits; too much capacity without enough data or regularization invites overfitting. Scaling law and perplexity pages connect capacity trends to measurable evaluation signals.

Simple Example

A tiny two-layer network may memorize a thousand training points but fail on a held-out set, while a wider network with dropout and more diverse data achieves lower validation loss—the second model had enough capacity used well, not just more parameters for its own sake.

Common Confusions

Capacity is not the same as context length or inference speed. More parameters do not guarantee better generalization if optimization or data are mismatched. Adapter parameters add some capacity but do not replace the expressiveness of the frozen base weights they modulate.

Tags

References