Glossary
Perplexity
A standard language-model evaluation metric: exponentiated average cross-entropy over tokens, lower is better when comparing models on the same corpus.
What It Is
Given a test corpus, perplexity averages the negative log-likelihood the model assigns to each token (via softmax over logits), then exponentiates. It is not a task accuracy score—it measures probabilistic fit. Tokenization, vocabulary, and dataset domain all affect the number, so comparisons require matched setups.
Why It Matters
Perplexity is the workhorse metric for pretraining and compression benchmarks because it tracks how well next-token distributions match data. It connects directly to the token chain: logits become probabilities through softmax, and cross-entropy (linked to entropy) aggregates those probabilities into one scalar for monitoring generalization.
Simple Example
If a model assigns 0.5 probability to each correct next token on average, cross-entropy is about 0.69 nats and perplexity is about 2—roughly as uncertain as choosing uniformly among two tokens. Halving perplexity means the model is substantially more confident on the evaluation text.