Glossary

Entropy

A measure of how spread out or uncertain a probability distribution is over next-token choices.

What It Is

For a discrete distribution p over vocabulary indices, Shannon entropy counts how many nats (or bits, if you use log base 2) of uncertainty remain in the next-token guess. It is zero only when one token has probability 1.

Why It Matters

Monitoring entropy during decoding shows when the model is hedging versus committing. Researchers compare entropy across layers, prompts, or safety interventions. It pairs naturally with temperature, which deliberately reshapes the same distribution.

Simple Example

If softmax yields [0.7, 0.2, 0.1], entropy is lower than for [0.34, 0.33, 0.33] because the first vector is peaked. Plugging those probabilities into the Shannon formula gives a concrete numeric comparison.
H(p)=ipilogpiH(p) = -\sum_i p_i \log p_i

Common Confusions

Entropy is not the same as cross-entropy loss during training—loss compares predictions to labels, while entropy describes a single distribution on its own. High logit magnitude does not by itself imply low entropy until softmax runs.

Tags

References