Concepts
Context extension
Common techniques that stretch usable context beyond a model's original training window without rebuilding the entire architecture from scratch.
What It Is
Context extension is the family of techniques that push a model's usable context window past the length it saw during pretraining. The starting point is the context window limit: how many tokens attention can cover at once. Extension methods then adjust how position is represented, how attention is computed, or how the model is further trained so longer inputs remain stable. Major families include rotary position scaling and interpolation for RoPE models, bias or embedding extrapolation for schemes like ALiBi, sliding or chunked attention patterns that trade full pairwise access for local windows, and continued fine-tuning on longer documents. Inference optimizations such as grouped-query attention shrink KV-cache memory so very long runs remain practical even when the math still allows a large window.Why It Matters
A model card may advertise 128k or 1M tokens, but that headline often reflects extension work layered on a shorter training recipe. Knowing the technique family helps you judge whether quality will hold on your actual prompt length, whether memory will fit on your hardware, and which module pages to read next. RoPE scaling changes how angles behave at unseen indices; windowed attention changes which tokens can see each other; KV-cache sharing changes cost more than correctness. Extension is also separate from simply raising a API limit—without the underlying position or attention support, longer inputs can degrade silently.Simple Example
Suppose a decoder model trained on 4k tokens uses RoPE inside attention. To serve 32k prompts, engineers might interpolate RoPE frequencies so rotation angles grow more slowly at new positions, then lightly fine-tune on long documents. At inference, grouped-query attention reduces how many key-value heads are stored per layer, making the longer cache affordable. If quality still drifts on the farthest tokens, a sliding-window attention variant might limit each position to a local neighborhood while preserving a large nominal window. Each lever addresses a different bottleneck: position extrapolation, memory, or attention reach.Where It Appears
Long-context releases for Llama-class, Mistral, and Qwen families typically combine RoPE scaling rules with memory-saving attention heads. Research models marketed for million-token contexts often pair ALiBi or modified biases with hardware-aware kernels. Product teams publish extension recipes in model cards—YaRN, NTK-aware scaling, yarn-style factors, or progressive fine-tuning schedules—rather than re-deriving attention variants on this page. When a technique changes how attention mixes tokens across distance, the detailed module pages remain the right place for head layouts and compute patterns.- activationsShared tag
- ALiBiShared tag
- AlignmentShared tag
- Autoregressive generationShared tag
- backpropShared tag
- computation graphShared tag
- ConditioningShared tag
- context lengthShared tag
- data modalityShared tag
- DecoderShared tag
- Denoising generationShared tag
- discriminative modelsShared tag
- embeddingsShared tag
- Emergent behaviorShared tag
- EncoderShared tag
- FFNShared tag
- foundation modelsShared tag
- GeneralizationShared tag
- generative modelsShared tag
- gradientsShared tag
- hidden sizeShared tag
- image patchShared tag
- latent codeShared tag
- latent manifoldShared tag
- LayerNormShared tag
- learned representationShared tag
- logitsShared tag
- lossShared tag
- ML modelShared tag
- model architectureShared tag
- Model capacityShared tag
- model componentShared tag
- model moduleShared tag
- model parametersShared tag
- MoEShared tag
- normalization layerShared tag
- optimizer stateShared tag
- output entropyShared tag
- OverfittingShared tag
- PerplexityShared tag
- positional encodingShared tag
- RMS NormShared tag
- RoPEShared tag
- sampling temperatureShared tag
- Scaling lawShared tag
- seq2seqShared tag
- skip connectionShared tag
- softmax functionShared tag
- tensorsShared tag
- tokenShared tag
- transformer blockShared tag
- vectorShared tag
- why long context is hardShared tag
- ALiBiSame concept type
- Autoregressive generationSame concept type
- ConditioningSame concept type
- context lengthSame concept type
- data modalitySame concept type
- Denoising generationSame concept type
- discriminative modelsSame concept type
- FFNSame concept type
- foundation modelsSame concept type
- generative modelsSame concept type
- image patchSame concept type
- latent codeSame concept type
- latent manifoldSame concept type
- LayerNormSame concept type
- learned representationSame concept type
- ML modelSame concept type
- model architectureSame concept type
- model componentSame concept type
- model moduleSame concept type
- model parametersSame concept type
- MoESame concept type
- normalization layerSame concept type
- optimizer stateSame concept type
- positional encodingSame concept type
- RMS NormSame concept type
- RoPESame concept type
- sampling temperatureSame concept type
- skip connectionSame concept type
- why long context is hardSame concept type
- context lengthCurated related
- RoPECurated related
- why long context is hardCurated related
- GQACurated related
- attentionCurated related