Perplexity

A standard language-model evaluation metric: exponentiated average cross-entropy over tokens, lower is better when comparing models on the same corpus.

What It Is

Given a test corpus, perplexity averages the negative log-likelihood the model assigns to each token (via softmax over logits), then exponentiates. It is not a task accuracy score—it measures probabilistic fit. Tokenization, vocabulary, and dataset domain all affect the number, so comparisons require matched setups.

Why It Matters

Perplexity is the workhorse metric for pretraining and compression benchmarks because it tracks how well next-token distributions match data. It connects directly to the token chain: logits become probabilities through softmax, and cross-entropy (linked to entropy) aggregates those probabilities into one scalar for monitoring generalization.

Simple Example

If a model assigns 0.5 probability to each correct next token on average, cross-entropy is about 0.69 nats and perplexity is about 2—roughly as uncertain as choosing uniformly among two tokens. Halving perplexity means the model is substantially more confident on the evaluation text.

Common Confusions

Lower perplexity does not guarantee better instruction following or safety—that is alignment and task design. Perplexity on training data can look excellent while test perplexity reveals overfitting. Do not compare perplexity across different tokenizers without caution.

Perplexity

What It Is

Why It Matters

Simple Example

Common Confusions

Tags

References

On this page

Perplexity

What It Is

Why It Matters

Simple Example

Common Confusions

Related Concepts And Modules

Tags

References

On this page