Model Atlas

Architecture-related concepts, model families, and component patterns. Each entry links to a canonical docs page when published.

Activation
The numeric output of a layer after its linear transform and nonlinearity, passed forward through the network.
ALiBi
Attention with linear biases adds distance-based penalties to attention logits so models can extrapolate to longer sequences without explicit position embeddings.
Alignment
Post-training steps that steer a model toward helpful, honest, and policy-compliant behavior—often via preference data, reward models, or safety filters.
Architecture
The blueprint that defines how a model's layers, modules, and data flow connect during inference.
Autoregressive Generation
A generation paradigm that produces outputs one discrete step at a time, each step conditioned on everything generated so far.
Component
A finer-grained part inside a module, such as an attention head or projection layer.
Computational Graph
A record of which operations produced each tensor so automatic differentiation can run backward.
Conditioning
Extra inputs—text prompts, class labels, images, or guidance weights—that steer a generative model without retraining the full parameter stack each time.
Context extension
Common techniques that stretch usable context beyond a model's original training window without rebuilding the entire architecture from scratch.
Context window
The maximum number of tokens a model can attend over in one forward pass, distinct from training sequence length and how many tokens it may emit in a session.
Decoder
A network stack that turns internal representations into outputs—tokens, pixels, or structured predictions—often one step at a time with attention to prior context.
Denoising Generation
A generation paradigm that starts from noisy latents and iteratively removes noise—or adds structure—until a clean sample emerges.
Diffusion Model
A generative model family that learns to reverse a noise corruption process through many iterative denoising steps.
Discriminative Model
A model trained to score, rank, or classify inputs into labels or preferences rather than synthesize new content.
Embedding
A dense vector that represents a token or other discrete item so the model can run continuous math on it.
Emergent Behavior
Qualitative capability jumps that appear only past certain model scale or training thresholds—distinct from smooth metric improvements tracked by scaling laws.
Encoder
A network stack that maps raw or patched inputs into internal representations—tokens, latents, or context vectors—for downstream decoders or heads.
Encoder-Decoder
An architecture that compresses input into representations with an encoder, then generates outputs with a decoder—often via cross-attention between the two stacks.
Feed-forward network
The position-wise dense transform that refines each token's hidden vector after attention in a transformer block.
Foundation Model
A large pretrained model intended as a shared starting point for many downstream tasks via fine-tuning or prompting.
Generalization
How well a model performs on data it was not explicitly trained to memorize—validation sets, new domains, and live traffic are the practical tests.
Generative Model
A model trained to produce new outputs such as text, images, or audio rather than only assign labels to fixed categories.
Latent
A compressed internal code that summarizes input structure for reconstruction, generation, or downstream tasks.
Latent Space
The coordinate system in which latent codes live, where distance and direction correspond to semantic or structural variation in the data.
Layer norm
Per-token mean-and-variance normalization that rescales each hidden vector before the next sublayer in a transformer block.
Mixture of experts
A sparse feed-forward layer that routes each token to a small subset of expert MLPs instead of one dense block.
Modality
The kind of input or output data a model is built to handle, such as text, images, audio, or video.
Model
A trained machine learning system that maps inputs to outputs using learned parameters.
Model Capacity
How much function a model can represent—driven by parameter count, width, depth, and training data—before optimization and regularization constrain what it actually learns.
Module
A reusable building block inside a model architecture, such as attention or a feed-forward network.
Multimodal Model
A model family that accepts or produces more than one data modality, such as text paired with images or audio.
Normalization
Layers that rescale hidden vectors inside each transformer block so activations stay stable as depth and batch composition change.
Overfitting
When a model fits training noise so closely that held-out or production data perform worse—the classic gap between memorization and useful generalization.
Patch
A fixed-size chunk of input data—pixels, spectrogram frames, or grouped tokens—that an encoder turns into a representation vector.
Perplexity
A standard language-model evaluation metric: exponentiated average cross-entropy over tokens, lower is better when comparing models on the same corpus.
Positional encodings
How transformers inject order into otherwise permutation-invariant attention through absolute, relative, and rotary position schemes.
Representation
An internal encoding of input data that downstream layers or tasks use instead of raw features.
Residual connection
Skip paths that add each sublayer's output back onto the running hidden stream so deep transformer stacks stay trainable.
RMSNorm
Per-token root-mean-square scaling that rescales each hidden vector without mean centering, as used in many modern decoder transformers.
RoPE
Rotary position embedding rotates query and key vectors by token index so attention scores depend on relative distance, as used in Llama-class decoder models.
Scaling Law
Empirical power-law relationships between model scale, compute, data, and downstream loss or capability—used to forecast training budgets and benchmark trends.
Token
The smallest unit of text a language model reads and predicts—usually a word piece, not always a whole word. Each token ID maps to a dense vector through vector embedding of model hidden size before attention runs.
Transformer
A model family built from stacked attention and feed-forward blocks that mix information across token positions in parallel.
Transformer architecture
How attention, feed-forward layers, normalization, residuals, and position information repeat inside each decoder-style transformer block.
Why long context is hard
Why stretching sequence length strains attention compute, KV-cache memory, position schemes, and retrieval quality even when a model advertises a large context window.
World Model
A model family that learns environment dynamics or state transitions so it can predict or simulate what happens next.