Glossary
Softmax
A function that turns a vector of logits into a probability distribution that sums to one.
What It Is
Softmax is the usual final normalization for multiclass outputs in language models. Given a logits vector z, it returns probabilities p_i proportional to exp(z_i), divided by the sum across all classes.
Why It Matters
Next-token sampling reads softmax probabilities. Cross-entropy training compares softmax outputs to target tokens. Any discussion of model confidence, entropy, or temperature assumes you understand this mapping from logits to a distribution.
Simple Example
With logits [2.0, 1.0, 0.0], softmax assigns the largest mass to index 0 because exp(2) dominates, but all three entries receive strictly positive probability. Dividing logits by temperature before softmax sharpens or flattens that distribution.softmax(zi)=∑jezjezi
Common Confusions
Softmax is not the same as a single activation inside a hidden layer—those may be ReLU or GELU. Softmax also differs from sigmoid, which handles binary probabilities. Applying softmax twice does not generally recover the original logits.