Glossary

Loss Function

A scalar score that measures how wrong the model's predictions are so training knows which direction to improve.

What It Is

The loss (also called the objective or training loss) aggregates per-example errors into one differentiable scalar. For language models, cross-entropy between predicted next-token probabilities and the true token id is standard. Regression tasks might use mean squared error. The loss sits at the root of the computational graph so backpropagation can reach every parameter that influenced the prediction.

Why It Matters

Optimizers do not read raw logits or activations—they minimize the loss. The choice of loss shapes what the model learns to prioritize: cross-entropy penalizes confident wrong tokens heavily, while other objectives emphasize ranking, contrast, or auxiliary tasks. Monitoring loss curves is the first signal that training is converging, diverging, or stuck.

Simple Example

Suppose a three-token vocabulary and the model assigns probabilities [0.7, 0.2, 0.1] while the correct token is index 0. Cross-entropy loss is roughly −log(0.7), a small positive number. If the model instead gave 0.01 probability to the correct token, the loss jumps because −log(0.01) is much larger.

Common Confusions

Loss is not the same as accuracy or perplexity—those are evaluation metrics derived from predictions, while loss is what gradients differentiate during training. A model can have low training loss but poor generalization (overfitting). Loss also differs from regularization terms, which are sometimes added to the objective but serve a different purpose than fit-to-labels error.

Tags

References

  1. Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016, https://www.deeplearningbook.org/.