Model Atlas

Defined machine-learning terms with short summaries. Each entry links to a canonical glossary page when published.

Activation
The numeric output of a layer after its linear transform and nonlinearity, passed forward through the network.
ALiBi
Attention with linear biases adds distance-based penalties to attention logits so models can extrapolate to longer sequences without explicit position embeddings.
Alignment
Post-training steps that steer a model toward helpful, honest, and policy-compliant behavior—often via preference data, reward models, or safety filters.
Architecture
The blueprint that defines how a model's layers, modules, and data flow connect during inference.
Autoregressive Generation
A generation paradigm that produces outputs one discrete step at a time, each step conditioned on everything generated so far.
Backpropagation
The backward pass that applies the chain rule on a computational graph to compute gradients for every differentiable parameter.
Component
A finer-grained part inside a module, such as an attention head or projection layer.
Computational Graph
A record of which operations produced each tensor so automatic differentiation can run backward.
Conditioning
Extra inputs—text prompts, class labels, images, or guidance weights—that steer a generative model without retraining the full parameter stack each time.
Context window
The maximum number of tokens a model can attend over in one forward pass, distinct from training sequence length and how many tokens it may emit in a session.
Decoder
A network stack that turns internal representations into outputs—tokens, pixels, or structured predictions—often one step at a time with attention to prior context.
Denoising Generation
A generation paradigm that starts from noisy latents and iteratively removes noise—or adds structure—until a clean sample emerges.
Diffusion Model
A generative model family that learns to reverse a noise corruption process through many iterative denoising steps.
Discriminative Model
A model trained to score, rank, or classify inputs into labels or preferences rather than synthesize new content.
Embedding
A dense vector that represents a token or other discrete item so the model can run continuous math on it.
Emergent Behavior
Qualitative capability jumps that appear only past certain model scale or training thresholds—distinct from smooth metric improvements tracked by scaling laws.
Encoder
A network stack that maps raw or patched inputs into internal representations—tokens, latents, or context vectors—for downstream decoders or heads.
Encoder-Decoder
An architecture that compresses input into representations with an encoder, then generates outputs with a decoder—often via cross-attention between the two stacks.
Entropy
A measure of how spread out or uncertain a probability distribution is over next-token choices.
Feed-forward network
The position-wise dense transform that refines each token's hidden vector after attention in a transformer block.
Foundation Model
A large pretrained model intended as a shared starting point for many downstream tasks via fine-tuning or prompting.
Generalization
How well a model performs on data it was not explicitly trained to memorize—validation sets, new domains, and live traffic are the practical tests.
Generative Model
A model trained to produce new outputs such as text, images, or audio rather than only assign labels to fixed categories.
Gradient
The direction and size of change each parameter should take to reduce the training loss.
Hidden Size
The width of a model's internal vectors—the number of dimensions in each token embedding and each token's per-position hidden state before the vocabulary projection.
Latent
A compressed internal code that summarizes input structure for reconstruction, generation, or downstream tasks.
Latent Space
The coordinate system in which latent codes live, where distance and direction correspond to semantic or structural variation in the data.
Layer norm
Per-token mean-and-variance normalization that rescales each hidden vector before the next sublayer in a transformer block.
Logit
A raw, unnormalized score for each vocabulary item before softmax turns scores into probabilities.
Loss Function
A scalar score that measures how wrong the model's predictions are so training knows which direction to improve.
Mixture of experts
A sparse feed-forward layer that routes each token to a small subset of expert MLPs instead of one dense block.
Modality
The kind of input or output data a model is built to handle, such as text, images, audio, or video.
Model
A trained machine learning system that maps inputs to outputs using learned parameters.
Model Capacity
How much function a model can represent—driven by parameter count, width, depth, and training data—before optimization and regularization constrain what it actually learns.
Module
A reusable building block inside a model architecture, such as attention or a feed-forward network.
Multimodal Model
A model family that accepts or produces more than one data modality, such as text paired with images or audio.
Normalization
Layers that rescale hidden vectors inside each transformer block so activations stay stable as depth and batch composition change.
Optimizer State
Extra memory an adaptive optimizer keeps beyond model weights, such as momentum and variance estimates for each parameter.
Overfitting
When a model fits training noise so closely that held-out or production data perform worse—the classic gap between memorization and useful generalization.
Parameter
A learnable numeric value stored in the model weights that training updates to fit data.
Patch
A fixed-size chunk of input data—pixels, spectrogram frames, or grouped tokens—that an encoder turns into a representation vector.
Perplexity
A standard language-model evaluation metric: exponentiated average cross-entropy over tokens, lower is better when comparing models on the same corpus.
Representation
An internal encoding of input data that downstream layers or tasks use instead of raw features.
Residual connection
Skip paths that add each sublayer's output back onto the running hidden stream so deep transformer stacks stay trainable.
RMSNorm
Per-token root-mean-square scaling that rescales each hidden vector without mean centering, as used in many modern decoder transformers.
RoPE
Rotary position embedding rotates query and key vectors by token index so attention scores depend on relative distance, as used in Llama-class decoder models.
Scaling Law
Empirical power-law relationships between model scale, compute, data, and downstream loss or capability—used to forecast training budgets and benchmark trends.
Softmax
A function that turns a vector of logits into a probability distribution that sums to one.
Temperature
A positive scaling factor applied to logits before softmax to make the next-token distribution sharper or flatter.
Tensor
A multi-dimensional array of numbers that carries activations, weights, or gradients through the network.
Token
The smallest unit of text a language model reads and predicts—usually a word piece, not always a whole word. Each token ID maps to a dense vector through vector embedding of model hidden size before attention runs.
Transformer
A model family built from stacked attention and feed-forward blocks that mix information across token positions in parallel.
Vector
An ordered list of numbers that represents a point or direction in continuous space—embeddings and activations are vectors at different stages of the model.
World Model
A model family that learns environment dynamics or state transitions so it can predict or simulate what happens next.

Glossary