Context extension

Common techniques that stretch usable context beyond a model's original training window without rebuilding the entire architecture from scratch.

Models are trained on finite sequence lengths; context extension covers the post-training tricks—position scaling, windowed attention, memory-saving heads, and continued training—that let them run on longer prompts without rebuilding from scratch.

What It Is

Context extension is the family of techniques that push a model's usable context window past the length it saw during pretraining. The starting point is the context window limit: how many tokens attention can cover at once. Extension methods then adjust how position is represented, how attention is computed, or how the model is further trained so longer inputs remain stable. Major families include rotary position scaling and interpolation for RoPE models, bias or embedding extrapolation for schemes like ALiBi, sliding or chunked attention patterns that trade full pairwise access for local windows, and continued fine-tuning on longer documents. Inference optimizations such as grouped-query attention shrink KV-cache memory so very long runs remain practical even when the math still allows a large window.

Why It Matters

A model card may advertise 128k or 1M tokens, but that headline often reflects extension work layered on a shorter training recipe. Knowing the technique family helps you judge whether quality will hold on your actual prompt length, whether memory will fit on your hardware, and which module pages to read next. RoPE scaling changes how angles behave at unseen indices; windowed attention changes which tokens can see each other; KV-cache sharing changes cost more than correctness. Extension is also separate from simply raising a API limit—without the underlying position or attention support, longer inputs can degrade silently.

Simple Example

Suppose a decoder model trained on 4k tokens uses RoPE inside attention. To serve 32k prompts, engineers might interpolate RoPE frequencies so rotation angles grow more slowly at new positions, then lightly fine-tune on long documents. At inference, grouped-query attention reduces how many key-value heads are stored per layer, making the longer cache affordable. If quality still drifts on the farthest tokens, a sliding-window attention variant might limit each position to a local neighborhood while preserving a large nominal window. Each lever addresses a different bottleneck: position extrapolation, memory, or attention reach.

Where It Appears

Long-context releases for Llama-class, Mistral, and Qwen families typically combine RoPE scaling rules with memory-saving attention heads. Research models marketed for million-token contexts often pair ALiBi or modified biases with hardware-aware kernels. Product teams publish extension recipes in model cards—YaRN, NTK-aware scaling, yarn-style factors, or progressive fine-tuning schedules—rather than re-deriving attention variants on this page. When a technique changes how attention mixes tokens across distance, the detailed module pages remain the right place for head layouts and compute patterns.

activationsShared tag
ALiBiShared tag
AlignmentShared tag
Autoregressive generationShared tag
backpropShared tag
computation graphShared tag
ConditioningShared tag
context lengthShared tag
data modalityShared tag
DecoderShared tag
Denoising generationShared tag
discriminative modelsShared tag
embeddingsShared tag
Emergent behaviorShared tag
EncoderShared tag
FFNShared tag
foundation modelsShared tag
GeneralizationShared tag
generative modelsShared tag
gradientsShared tag
hidden sizeShared tag
image patchShared tag
latent codeShared tag
latent manifoldShared tag
LayerNormShared tag
learned representationShared tag
logitsShared tag
lossShared tag
ML modelShared tag
model architectureShared tag
Model capacityShared tag
model componentShared tag
model moduleShared tag
model parametersShared tag
MoEShared tag
normalization layerShared tag
optimizer stateShared tag
output entropyShared tag
OverfittingShared tag
PerplexityShared tag
positional encodingShared tag
RMS NormShared tag
RoPEShared tag
sampling temperatureShared tag
Scaling lawShared tag
seq2seqShared tag
skip connectionShared tag
softmax functionShared tag
tensorsShared tag
tokenShared tag
transformer blockShared tag
vectorShared tag
why long context is hardShared tag

ALiBiSame concept type
Autoregressive generationSame concept type
ConditioningSame concept type
context lengthSame concept type
data modalitySame concept type
Denoising generationSame concept type
discriminative modelsSame concept type
FFNSame concept type
foundation modelsSame concept type
generative modelsSame concept type
image patchSame concept type
latent codeSame concept type
latent manifoldSame concept type
LayerNormSame concept type
learned representationSame concept type
ML modelSame concept type
model architectureSame concept type
model componentSame concept type
model moduleSame concept type
model parametersSame concept type
MoESame concept type
normalization layerSame concept type
optimizer stateSame concept type
positional encodingSame concept type
RMS NormSame concept type
RoPESame concept type
sampling temperatureSame concept type
skip connectionSame concept type
why long context is hardSame concept type

context lengthCurated related
RoPECurated related
why long context is hardCurated related
GQACurated related
attentionCurated related

Common Confusions

Context extension is not the same as the context window definition: the window states how many tokens can be attended over; extension explains how that limit is stretched beyond training. It is also not a guarantee of perfect recall on very long inputs—position schemes can extrapolate poorly, and distant tokens may receive less effective attention. Finally, inference memory optimizations like grouped-query attention improve feasibility at long lengths but do not by themselves teach the model new long-range dependencies; they complement rather than replace position and training extensions.

Context extension

What It Is

Why It Matters

Simple Example

Where It Appears

Common Confusions

Tags

References

On this page

Context extension

What It Is

Why It Matters

Simple Example

Where It Appears

Common Confusions

Related Concepts And Modules

Tags

References

On this page