Alignment

Post-training steps that steer a model toward helpful, honest, and policy-compliant behavior—often via preference data, reward models, or safety filters.

What It Is

Alignment is the family of post-training methods—supervised fine-tuning on demonstrations, reinforcement learning from human feedback (RLHF), direct preference optimization, and constitutional or rule-based critics—that nudge model behavior without replacing the base architecture. Safety classifiers and system prompts are inference-time alignment tools that complement weight updates.

Why It Matters

Capacity and generalization describe what a model can learn from data; alignment describes whose preferences win after that learning. Product teams need both stories: a high-capacity model can still violate policy if alignment is weak, and strong filters cannot fix a base model that never learned the task.

Simple Example

A chat model is pretrained on web text, then fine-tuned on helpful assistant transcripts, then updated with a reward model that prefers concise, safe answers over rambling or toxic ones. At serve time, a moderation API blocks categories the reward model never saw.

Common Confusions

Alignment is not the same as architecture design or tokenizer choice. It is also not identical to evaluation metrics such as perplexity—those measure prediction quality, not policy fit. RLHF is one alignment method, not the whole field.

Alignment

What It Is

Why It Matters

Simple Example

Common Confusions

Tags

References

On this page

Alignment

What It Is

Why It Matters

Simple Example

Common Confusions

Related Concepts And Modules

Tags

References

On this page