Softmax

A function that turns a vector of logits into a probability distribution that sums to one.

What It Is

Softmax is the usual final normalization for multiclass outputs in language models. Given a logits vector z, it returns probabilities p_i proportional to exp(z_i), divided by the sum across all classes.

Why It Matters

Next-token sampling reads softmax probabilities. Cross-entropy training compares softmax outputs to target tokens. Any discussion of model confidence, entropy, or temperature assumes you understand this mapping from logits to a distribution.

Simple Example

With logits [2.0, 1.0, 0.0], softmax assigns the largest mass to index 0 because exp(2) dominates, but all three entries receive strictly positive probability. Dividing logits by temperature before softmax sharpens or flattens that distribution.

\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

Common Confusions

Softmax is not the same as a single activation inside a hidden layer—those may be ReLU or GELU. Softmax also differs from sigmoid, which handles binary probabilities. Applying softmax twice does not generally recover the original logits.

Softmax

What It Is

Why It Matters

Simple Example

Common Confusions

Tags

References

On this page

Softmax

What It Is

Why It Matters

Simple Example

Common Confusions

Related Concepts And Modules

Tags

References

On this page