Concepts
Positional encodings
How transformers inject order into otherwise permutation-invariant attention through absolute, relative, and rotary position schemes.
What It Is
Positional encodings tell a transformer where each token sits in the sequence. Self-attention scores pairs of positions without built-in order: swapping token order would leave the same pairwise relationships unless position is injected. Early transformers added fixed or learned absolute position vectors to token embeddings. Later families use relative schemes that encode distance between positions, rotary position embedding (RoPE) that rotates query and key features by index, or attention linear biases (ALiBi) that penalize far-apart keys without explicit embedding tables.Why It Matters
The position scheme shapes how far order is remembered, how well models extrapolate past their training length, and how attention cost scales with context. Absolute embeddings are simple but can struggle when sequences grow longer than training. Relative and rotary methods bake distance into attention math, which many modern decoder models prefer. Knowing the family helps you read model cards and follow links to RoPE or ALiBi variant pages without re-deriving attention mechanics.Simple Example
Imagine three word embeddings with no position added: attention could not tell whether "cat sat mat" or "mat sat cat" was the input. Add absolute position vectors and each token carries both meaning and index. Rotary position embedding instead rotates query and key vectors so dot products depend on relative distance. ALiBi skips extra embeddings and adds a distance penalty directly to attention logits. Each approach restores order sensitivity the base attention operation lacks.Where It Appears
Original transformer encoder stacks used sinusoidal absolute positions; many BERT-style models learned absolute tables. GPT-style decoders historically used learned absolute embeddings at the input. Llama-class and many recent open-weight models adopt RoPE inside attention. Some long-context research models ship with ALiBi biases instead of rotary angles. The choice is declared per architecture and pairs with how that model handles context extension.- activationsShared tag
- ALiBiShared tag
- AlignmentShared tag
- Autoregressive generationShared tag
- backpropShared tag
- computation graphShared tag
- ConditioningShared tag
- context extensionShared tag
- context lengthShared tag
- data modalityShared tag
- DecoderShared tag
- Denoising generationShared tag
- discriminative modelsShared tag
- embeddingsShared tag
- Emergent behaviorShared tag
- EncoderShared tag
- FFNShared tag
- foundation modelsShared tag
- GeneralizationShared tag
- generative modelsShared tag
- gradientsShared tag
- hidden sizeShared tag
- image patchShared tag
- latent codeShared tag
- latent manifoldShared tag
- LayerNormShared tag
- learned representationShared tag
- logitsShared tag
- lossShared tag
- ML modelShared tag
- model architectureShared tag
- Model capacityShared tag
- model componentShared tag
- model moduleShared tag
- model parametersShared tag
- MoEShared tag
- normalization layerShared tag
- optimizer stateShared tag
- output entropyShared tag
- OverfittingShared tag
- PerplexityShared tag
- RMS NormShared tag
- RoPEShared tag
- sampling temperatureShared tag
- Scaling lawShared tag
- seq2seqShared tag
- skip connectionShared tag
- softmax functionShared tag
- tensorsShared tag
- tokenShared tag
- transformer blockShared tag
- vectorShared tag
- why long context is hardShared tag
- ALiBiSame concept type
- Autoregressive generationSame concept type
- ConditioningSame concept type
- context extensionSame concept type
- context lengthSame concept type
- data modalitySame concept type
- Denoising generationSame concept type
- discriminative modelsSame concept type
- FFNSame concept type
- foundation modelsSame concept type
- generative modelsSame concept type
- image patchSame concept type
- latent codeSame concept type
- latent manifoldSame concept type
- LayerNormSame concept type
- learned representationSame concept type
- ML modelSame concept type
- model architectureSame concept type
- model componentSame concept type
- model moduleSame concept type
- model parametersSame concept type
- MoESame concept type
- normalization layerSame concept type
- optimizer stateSame concept type
- RMS NormSame concept type
- RoPESame concept type
- sampling temperatureSame concept type
- skip connectionSame concept type
- why long context is hardSame concept type
- transformer blockCurated related
- attentionCurated related
- RoPECurated related
- ALiBiCurated related