Glossary

Softmax

A function that turns a vector of logits into a probability distribution that sums to one.

What It Is

Softmax is the usual final normalization for multiclass outputs in language models. Given a logits vector z, it returns probabilities p_i proportional to exp(z_i), divided by the sum across all classes.

Why It Matters

Next-token sampling reads softmax probabilities. Cross-entropy training compares softmax outputs to target tokens. Any discussion of model confidence, entropy, or temperature assumes you understand this mapping from logits to a distribution.

Simple Example

With logits [2.0, 1.0, 0.0], softmax assigns the largest mass to index 0 because exp(2) dominates, but all three entries receive strictly positive probability. Dividing logits by temperature before softmax sharpens or flattens that distribution.
softmax(zi)=ezijezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

Common Confusions

Softmax is not the same as a single activation inside a hidden layer—those may be ReLU or GELU. Softmax also differs from sigmoid, which handles binary probabilities. Applying softmax twice does not generally recover the original logits.

Tags

References