How well a model performs on data it was not explicitly trained to memorize—validation sets, new domains, and live traffic are the practical tests.
What It Is
Generalization measures whether learned patterns transfer beyond the exact training examples—via validation splits, out-of-domain tests, or online A/B metrics. It depends on capacity, data diversity, optimization, and regularization; alignment and filtering change behavior but do not replace statistical generalization.
Why It Matters
Customers care about reliability on new prompts, users, and modalities, not training-set accuracy. Naming generalization separately from overfitting and capacity clarifies why bigger models can generalize better yet still fail on adversarial or out-of-scope inputs.
Simple Example
A language model trained on English helpdesk logs answers Spanish product questions reasonably well because it learned task structure, not ticket IDs—until domain shift is so large that even good generalization breaks down.
Common Confusions
Generalization is not guaranteed by scale alone—data and objectives matter. It is also not the same as instruction-following quality, which mixes capability with alignment. Perfect training accuracy with poor test accuracy indicates failed generalization, often via overfitting.