Home Architecture Glossary Tags

Getting Started

Token-to-probability chain

Beginner glossary path from tokens through embeddings, tensors, logits, softmax, and training fundamentals.

Architecture

Open search entry page

Glossary

Activation
The numeric output of a layer after its linear transform and nonlinearity, passed forward through the network.
Backpropagation
The backward pass that applies the chain rule on a computational graph to compute gradients for every differentiable parameter.
Computational Graph
A record of which operations produced each tensor so automatic differentiation can run backward.
Embedding
A dense vector that represents a token or other discrete item so the model can run continuous math on it.
Entropy
A measure of how spread out or uncertain a probability distribution is over next-token choices.
Gradient
The direction and size of change each parameter should take to reduce the training loss.
Hidden Size
The width of a model's internal vectors—the number of dimensions in each token embedding and each token's per-position hidden state before the vocabulary projection.
Logit
A raw, unnormalized score for each vocabulary item before softmax turns scores into probabilities.
Loss Function
A scalar score that measures how wrong the model's predictions are so training knows which direction to improve.
Optimizer State
Extra memory an adaptive optimizer keeps beyond model weights, such as momentum and variance estimates for each parameter.
Parameter
A learnable numeric value stored in the model weights that training updates to fit data.
Softmax
A function that turns a vector of logits into a probability distribution that sums to one.
Temperature
A positive scaling factor applied to logits before softmax to make the next-token distribution sharper or flatter.
Tensor
A multi-dimensional array of numbers that carries activations, weights, or gradients through the network.
Token
The smallest unit of text a language model reads and predicts—usually a word piece, not always a whole word. Each token ID maps to a dense vector through vector embedding of model hidden size before attention runs.
Vector
An ordered list of numbers that represents a point or direction in continuous space—embeddings and activations are vectors at different stages of the model.