Post-training steps that steer a model toward helpful, honest, and policy-compliant behavior—often via preference data, reward models, or safety filters.
What It Is
Alignment is the family of post-training methods—supervised fine-tuning on demonstrations, reinforcement learning from human feedback (RLHF), direct preference optimization, and constitutional or rule-based critics—that nudge model behavior without replacing the base architecture. Safety classifiers and system prompts are inference-time alignment tools that complement weight updates.
Why It Matters
Capacity and generalization describe what a model can learn from data; alignment describes whose preferences win after that learning. Product teams need both stories: a high-capacity model can still violate policy if alignment is weak, and strong filters cannot fix a base model that never learned the task.
Simple Example
A chat model is pretrained on web text, then fine-tuned on helpful assistant transcripts, then updated with a reward model that prefers concise, safe answers over rambling or toxic ones. At serve time, a moderation API blocks categories the reward model never saw.
Common Confusions
Alignment is not the same as architecture design or tokenizer choice. It is also not identical to evaluation metrics such as perplexity—those measure prediction quality, not policy fit. RLHF is one alignment method, not the whole field.