Glossary
Defined machine-learning terms with short summaries. Each entry links to a canonical glossary page when published.
- Activation
The numeric output of a layer after its linear transform and nonlinearity, passed forward through the network.
- ALiBi
Attention with linear biases adds distance-based penalties to attention logits so models can extrapolate to longer sequences without explicit position embeddings.
- Alignment
Post-training steps that steer a model toward helpful, honest, and policy-compliant behavior—often via preference data, reward models, or safety filters.
- Architecture
The blueprint that defines how a model's layers, modules, and data flow connect during inference.
- Autoregressive Generation
A generation paradigm that produces outputs one discrete step at a time, each step conditioned on everything generated so far.
- Backpropagation
The backward pass that applies the chain rule on a computational graph to compute gradients for every differentiable parameter.
- Component
A finer-grained part inside a module, such as an attention head or projection layer.
- Computational Graph
A record of which operations produced each tensor so automatic differentiation can run backward.
- Conditioning
Extra inputs—text prompts, class labels, images, or guidance weights—that steer a generative model without retraining the full parameter stack each time.
- Context window
The maximum number of tokens a model can attend over in one forward pass, distinct from training sequence length and how many tokens it may emit in a session.
- Decoder
A network stack that turns internal representations into outputs—tokens, pixels, or structured predictions—often one step at a time with attention to prior context.
- Denoising Generation
A generation paradigm that starts from noisy latents and iteratively removes noise—or adds structure—until a clean sample emerges.
- Diffusion Model
A generative model family that learns to reverse a noise corruption process through many iterative denoising steps.
- Discriminative Model
A model trained to score, rank, or classify inputs into labels or preferences rather than synthesize new content.
- Embedding
A dense vector that represents a token or other discrete item so the model can run continuous math on it.
- Emergent Behavior
Qualitative capability jumps that appear only past certain model scale or training thresholds—distinct from smooth metric improvements tracked by scaling laws.
- Encoder
A network stack that maps raw or patched inputs into internal representations—tokens, latents, or context vectors—for downstream decoders or heads.
- Encoder-Decoder
An architecture that compresses input into representations with an encoder, then generates outputs with a decoder—often via cross-attention between the two stacks.
- Entropy
A measure of how spread out or uncertain a probability distribution is over next-token choices.
- Feed-forward network
The position-wise dense transform that refines each token's hidden vector after attention in a transformer block.
- Foundation Model
A large pretrained model intended as a shared starting point for many downstream tasks via fine-tuning or prompting.
- Generalization
How well a model performs on data it was not explicitly trained to memorize—validation sets, new domains, and live traffic are the practical tests.
- Generative Model
A model trained to produce new outputs such as text, images, or audio rather than only assign labels to fixed categories.
- Gradient
The direction and size of change each parameter should take to reduce the training loss.
- Hidden Size
The width of a model's internal vectors—the number of dimensions in each token embedding and each token's per-position hidden state before the vocabulary projection.
- Latent
A compressed internal code that summarizes input structure for reconstruction, generation, or downstream tasks.
- Latent Space
The coordinate system in which latent codes live, where distance and direction correspond to semantic or structural variation in the data.
- Layer norm
Per-token mean-and-variance normalization that rescales each hidden vector before the next sublayer in a transformer block.
- Logit
A raw, unnormalized score for each vocabulary item before softmax turns scores into probabilities.
- Loss Function
A scalar score that measures how wrong the model's predictions are so training knows which direction to improve.
- Mixture of experts
A sparse feed-forward layer that routes each token to a small subset of expert MLPs instead of one dense block.
- Modality
The kind of input or output data a model is built to handle, such as text, images, audio, or video.
- Model
A trained machine learning system that maps inputs to outputs using learned parameters.
- Model Capacity
How much function a model can represent—driven by parameter count, width, depth, and training data—before optimization and regularization constrain what it actually learns.
- Module
A reusable building block inside a model architecture, such as attention or a feed-forward network.
- Multimodal Model
A model family that accepts or produces more than one data modality, such as text paired with images or audio.
- Normalization
Layers that rescale hidden vectors inside each transformer block so activations stay stable as depth and batch composition change.
- Optimizer State
Extra memory an adaptive optimizer keeps beyond model weights, such as momentum and variance estimates for each parameter.
- Overfitting
When a model fits training noise so closely that held-out or production data perform worse—the classic gap between memorization and useful generalization.
- Parameter
A learnable numeric value stored in the model weights that training updates to fit data.
- Patch
A fixed-size chunk of input data—pixels, spectrogram frames, or grouped tokens—that an encoder turns into a representation vector.
- Perplexity
A standard language-model evaluation metric: exponentiated average cross-entropy over tokens, lower is better when comparing models on the same corpus.
- Representation
An internal encoding of input data that downstream layers or tasks use instead of raw features.
- Residual connection
Skip paths that add each sublayer's output back onto the running hidden stream so deep transformer stacks stay trainable.
- RMSNorm
Per-token root-mean-square scaling that rescales each hidden vector without mean centering, as used in many modern decoder transformers.
- RoPE
Rotary position embedding rotates query and key vectors by token index so attention scores depend on relative distance, as used in Llama-class decoder models.
- Scaling Law
Empirical power-law relationships between model scale, compute, data, and downstream loss or capability—used to forecast training budgets and benchmark trends.
- Softmax
A function that turns a vector of logits into a probability distribution that sums to one.
- Temperature
A positive scaling factor applied to logits before softmax to make the next-token distribution sharper or flatter.
- Tensor
A multi-dimensional array of numbers that carries activations, weights, or gradients through the network.
- Token
The smallest unit of text a language model reads and predicts—usually a word piece, not always a whole word. Each token ID maps to a dense vector through vector embedding of model hidden size before attention runs.
- Transformer
A model family built from stacked attention and feed-forward blocks that mix information across token positions in parallel.
- Vector
An ordered list of numbers that represents a point or direction in continuous space—embeddings and activations are vectors at different stages of the model.
- World Model
A model family that learns environment dynamics or state transitions so it can predict or simulate what happens next.