Architecture
Architecture-related concepts, model families, and component patterns. Each entry links to a canonical docs page when published.
- Activation
The numeric output of a layer after its linear transform and nonlinearity, passed forward through the network.
- ALiBi
Attention with linear biases adds distance-based penalties to attention logits so models can extrapolate to longer sequences without explicit position embeddings.
- Alignment
Post-training steps that steer a model toward helpful, honest, and policy-compliant behavior—often via preference data, reward models, or safety filters.
- Architecture
The blueprint that defines how a model's layers, modules, and data flow connect during inference.
- Autoregressive Generation
A generation paradigm that produces outputs one discrete step at a time, each step conditioned on everything generated so far.
- Component
A finer-grained part inside a module, such as an attention head or projection layer.
- Computational Graph
A record of which operations produced each tensor so automatic differentiation can run backward.
- Conditioning
Extra inputs—text prompts, class labels, images, or guidance weights—that steer a generative model without retraining the full parameter stack each time.
- Context extension
Common techniques that stretch usable context beyond a model's original training window without rebuilding the entire architecture from scratch.
- Context window
The maximum number of tokens a model can attend over in one forward pass, distinct from training sequence length and how many tokens it may emit in a session.
- Decoder
A network stack that turns internal representations into outputs—tokens, pixels, or structured predictions—often one step at a time with attention to prior context.
- Denoising Generation
A generation paradigm that starts from noisy latents and iteratively removes noise—or adds structure—until a clean sample emerges.
- Diffusion Model
A generative model family that learns to reverse a noise corruption process through many iterative denoising steps.
- Discriminative Model
A model trained to score, rank, or classify inputs into labels or preferences rather than synthesize new content.
- Embedding
A dense vector that represents a token or other discrete item so the model can run continuous math on it.
- Emergent Behavior
Qualitative capability jumps that appear only past certain model scale or training thresholds—distinct from smooth metric improvements tracked by scaling laws.
- Encoder
A network stack that maps raw or patched inputs into internal representations—tokens, latents, or context vectors—for downstream decoders or heads.
- Encoder-Decoder
An architecture that compresses input into representations with an encoder, then generates outputs with a decoder—often via cross-attention between the two stacks.
- Feed-forward network
The position-wise dense transform that refines each token's hidden vector after attention in a transformer block.
- Foundation Model
A large pretrained model intended as a shared starting point for many downstream tasks via fine-tuning or prompting.
- Generalization
How well a model performs on data it was not explicitly trained to memorize—validation sets, new domains, and live traffic are the practical tests.
- Generative Model
A model trained to produce new outputs such as text, images, or audio rather than only assign labels to fixed categories.
- Latent
A compressed internal code that summarizes input structure for reconstruction, generation, or downstream tasks.
- Latent Space
The coordinate system in which latent codes live, where distance and direction correspond to semantic or structural variation in the data.
- Layer norm
Per-token mean-and-variance normalization that rescales each hidden vector before the next sublayer in a transformer block.
- Mixture of experts
A sparse feed-forward layer that routes each token to a small subset of expert MLPs instead of one dense block.
- Modality
The kind of input or output data a model is built to handle, such as text, images, audio, or video.
- Model
A trained machine learning system that maps inputs to outputs using learned parameters.
- Model Capacity
How much function a model can represent—driven by parameter count, width, depth, and training data—before optimization and regularization constrain what it actually learns.
- Module
A reusable building block inside a model architecture, such as attention or a feed-forward network.
- Multimodal Model
A model family that accepts or produces more than one data modality, such as text paired with images or audio.
- Normalization
Layers that rescale hidden vectors inside each transformer block so activations stay stable as depth and batch composition change.
- Overfitting
When a model fits training noise so closely that held-out or production data perform worse—the classic gap between memorization and useful generalization.
- Patch
A fixed-size chunk of input data—pixels, spectrogram frames, or grouped tokens—that an encoder turns into a representation vector.
- Perplexity
A standard language-model evaluation metric: exponentiated average cross-entropy over tokens, lower is better when comparing models on the same corpus.
- Positional encodings
How transformers inject order into otherwise permutation-invariant attention through absolute, relative, and rotary position schemes.
- Representation
An internal encoding of input data that downstream layers or tasks use instead of raw features.
- Residual connection
Skip paths that add each sublayer's output back onto the running hidden stream so deep transformer stacks stay trainable.
- RMSNorm
Per-token root-mean-square scaling that rescales each hidden vector without mean centering, as used in many modern decoder transformers.
- RoPE
Rotary position embedding rotates query and key vectors by token index so attention scores depend on relative distance, as used in Llama-class decoder models.
- Scaling Law
Empirical power-law relationships between model scale, compute, data, and downstream loss or capability—used to forecast training budgets and benchmark trends.
- Token
The smallest unit of text a language model reads and predicts—usually a word piece, not always a whole word. Each token ID maps to a dense vector through vector embedding of model hidden size before attention runs.
- Transformer
A model family built from stacked attention and feed-forward blocks that mix information across token positions in parallel.
- Transformer architecture
How attention, feed-forward layers, normalization, residuals, and position information repeat inside each decoder-style transformer block.
- Why long context is hard
Why stretching sequence length strains attention compute, KV-cache memory, position schemes, and retrieval quality even when a model advertises a large context window.
- World Model
A model family that learns environment dynamics or state transitions so it can predict or simulate what happens next.