Token-to-probability chain
Beginner glossary path from tokens through embeddings, tensors, logits, softmax, and training fundamentals.
Architecture
Glossary
- Activation
The numeric output of a layer after its linear transform and nonlinearity, passed forward through the network.
- Backpropagation
The backward pass that applies the chain rule on a computational graph to compute gradients for every differentiable parameter.
- Computational Graph
A record of which operations produced each tensor so automatic differentiation can run backward.
- Embedding
A dense vector that represents a token or other discrete item so the model can run continuous math on it.
- Entropy
A measure of how spread out or uncertain a probability distribution is over next-token choices.
- Gradient
The direction and size of change each parameter should take to reduce the training loss.
- Hidden Size
The width of a model's internal vectors—the number of dimensions in each token embedding and each token's per-position hidden state before the vocabulary projection.
- Logit
A raw, unnormalized score for each vocabulary item before softmax turns scores into probabilities.
- Loss Function
A scalar score that measures how wrong the model's predictions are so training knows which direction to improve.
- Optimizer State
Extra memory an adaptive optimizer keeps beyond model weights, such as momentum and variance estimates for each parameter.
- Parameter
A learnable numeric value stored in the model weights that training updates to fit data.
- Softmax
A function that turns a vector of logits into a probability distribution that sums to one.
- Temperature
A positive scaling factor applied to logits before softmax to make the next-token distribution sharper or flatter.
- Tensor
A multi-dimensional array of numbers that carries activations, weights, or gradients through the network.
- Token
The smallest unit of text a language model reads and predicts—usually a word piece, not always a whole word. Each token ID maps to a dense vector through vector embedding of model hidden size before attention runs.
- Vector
An ordered list of numbers that represents a point or direction in continuous space—embeddings and activations are vectors at different stages of the model.