Concepts
Why long context is hard
Why stretching sequence length strains attention compute, KV-cache memory, position schemes, and retrieval quality even when a model advertises a large context window.
What It Is
Long-context difficulty is the set of bottlenecks that appear when a model must attend over many tokens at once. Standard self-attention compares every position to every other position, so compute and memory grow roughly with the square of sequence length when full attention is used. During generation, each new token also stores key and value vectors for every layer, so KV-cache footprint grows linearly with context length and can dominate GPU memory before compute does. Position schemes such as RoPE or ALiBi must extrapolate to indices beyond pretraining, which can weaken distance cues on the farthest tokens. Finally, even when math and memory allow a long window, useful retrieval can drown in noise: distant tokens compete for limited attention mass, and models may not reliably surface the one fact buried thousands of tokens back.Why It Matters
Product limits and model cards quote large windows, but feasibility on your hardware and quality on your prompt length depend on these mechanisms. Quadratic attention cost explains why very long prompts slow down or require specialized kernels and windowed variants. KV-cache growth explains why grouped-query attention and other memory-saving head layouts matter for deployment. Position extrapolation risk explains why context extension recipes exist separately from simply raising a limit. Retrieval and noise effects explain why a model can legally read 128k tokens yet still miss a needle in the haystack. Understanding the constraints helps you choose extension techniques, attention modules, and serving configurations without treating context length as a single dial.Simple Example
Imagine a 32k-token prompt on a decoder model with full self-attention. Attention must score on the order of 32k by 32k token pairs per layer, which is vastly more work than a 4k prompt. As tokens stream out, the cache stores 32k keys and values per layer, so memory climbs with every prefilled position. RoPE angles at index 30k may sit far outside the range seen in pretraining, so relative-distance behavior can soften. If the prompt hides one critical sentence near the middle, attention must spread probability mass across tens of thousands of competing positions, and the model may overweight recent or repeated phrases instead. Each issue has a different remedy—windowed attention, KV sharing, position scaling, or retrieval augmentation—which is why long context is a bundle of problems, not one switch.Where It Appears
Long-document QA, repo-scale coding assistants, and multi-hour chat sessions all hit the same structural limits. Serving stacks report out-of-memory errors when KV caches exceed device RAM; research models adopt sliding windows or sparse patterns to tame quadratic cost; release notes pair RoPE scaling with continued training when windows jump from 8k to 128k. Teams evaluating context extension should read how position and attention modules interact rather than assuming a larger advertised window guarantees uniform quality across every token index.- activationsShared tag
- ALiBiShared tag
- AlignmentShared tag
- Autoregressive generationShared tag
- backpropShared tag
- computation graphShared tag
- ConditioningShared tag
- context extensionShared tag
- context lengthShared tag
- data modalityShared tag
- DecoderShared tag
- Denoising generationShared tag
- discriminative modelsShared tag
- embeddingsShared tag
- Emergent behaviorShared tag
- EncoderShared tag
- FFNShared tag
- foundation modelsShared tag
- GeneralizationShared tag
- generative modelsShared tag
- gradientsShared tag
- hidden sizeShared tag
- image patchShared tag
- latent codeShared tag
- latent manifoldShared tag
- LayerNormShared tag
- learned representationShared tag
- logitsShared tag
- lossShared tag
- ML modelShared tag
- model architectureShared tag
- Model capacityShared tag
- model componentShared tag
- model moduleShared tag
- model parametersShared tag
- MoEShared tag
- normalization layerShared tag
- optimizer stateShared tag
- output entropyShared tag
- OverfittingShared tag
- PerplexityShared tag
- positional encodingShared tag
- RMS NormShared tag
- RoPEShared tag
- sampling temperatureShared tag
- Scaling lawShared tag
- seq2seqShared tag
- skip connectionShared tag
- softmax functionShared tag
- tensorsShared tag
- tokenShared tag
- transformer blockShared tag
- vectorShared tag
- ALiBiSame concept type
- Autoregressive generationSame concept type
- ConditioningSame concept type
- context extensionSame concept type
- context lengthSame concept type
- data modalitySame concept type
- Denoising generationSame concept type
- discriminative modelsSame concept type
- FFNSame concept type
- foundation modelsSame concept type
- generative modelsSame concept type
- image patchSame concept type
- latent codeSame concept type
- latent manifoldSame concept type
- LayerNormSame concept type
- learned representationSame concept type
- ML modelSame concept type
- model architectureSame concept type
- model componentSame concept type
- model moduleSame concept type
- model parametersSame concept type
- MoESame concept type
- normalization layerSame concept type
- optimizer stateSame concept type
- positional encodingSame concept type
- RMS NormSame concept type
- RoPESame concept type
- sampling temperatureSame concept type
- skip connectionSame concept type
- context extensionCurated related
- RoPECurated related
- attentionCurated related