Positional encodings

How transformers inject order into otherwise permutation-invariant attention through absolute, relative, and rotary position schemes.

Attention treats tokens as an unordered set unless you add position information—absolute embeddings, relative offsets, rotary angles, or attention biases each answer where a token sits in the sequence.

What It Is

Positional encodings tell a transformer where each token sits in the sequence. Self-attention scores pairs of positions without built-in order: swapping token order would leave the same pairwise relationships unless position is injected. Early transformers added fixed or learned absolute position vectors to token embeddings. Later families use relative schemes that encode distance between positions, rotary position embedding (RoPE) that rotates query and key features by index, or attention linear biases (ALiBi) that penalize far-apart keys without explicit embedding tables.

Why It Matters

The position scheme shapes how far order is remembered, how well models extrapolate past their training length, and how attention cost scales with context. Absolute embeddings are simple but can struggle when sequences grow longer than training. Relative and rotary methods bake distance into attention math, which many modern decoder models prefer. Knowing the family helps you read model cards and follow links to RoPE or ALiBi variant pages without re-deriving attention mechanics.

Simple Example

Imagine three word embeddings with no position added: attention could not tell whether "cat sat mat" or "mat sat cat" was the input. Add absolute position vectors and each token carries both meaning and index. Rotary position embedding instead rotates query and key vectors so dot products depend on relative distance. ALiBi skips extra embeddings and adds a distance penalty directly to attention logits. Each approach restores order sensitivity the base attention operation lacks.

Where It Appears

Original transformer encoder stacks used sinusoidal absolute positions; many BERT-style models learned absolute tables. GPT-style decoders historically used learned absolute embeddings at the input. Llama-class and many recent open-weight models adopt RoPE inside attention. Some long-context research models ship with ALiBi biases instead of rotary angles. The choice is declared per architecture and pairs with how that model handles context extension.

activationsShared tag
ALiBiShared tag
AlignmentShared tag
Autoregressive generationShared tag
backpropShared tag
computation graphShared tag
ConditioningShared tag
context extensionShared tag
context lengthShared tag
data modalityShared tag
DecoderShared tag
Denoising generationShared tag
discriminative modelsShared tag
embeddingsShared tag
Emergent behaviorShared tag
EncoderShared tag
FFNShared tag
foundation modelsShared tag
GeneralizationShared tag
generative modelsShared tag
gradientsShared tag
hidden sizeShared tag
image patchShared tag
latent codeShared tag
latent manifoldShared tag
LayerNormShared tag
learned representationShared tag
logitsShared tag
lossShared tag
ML modelShared tag
model architectureShared tag
Model capacityShared tag
model componentShared tag
model moduleShared tag
model parametersShared tag
MoEShared tag
normalization layerShared tag
optimizer stateShared tag
output entropyShared tag
OverfittingShared tag
PerplexityShared tag
RMS NormShared tag
RoPEShared tag
sampling temperatureShared tag
Scaling lawShared tag
seq2seqShared tag
skip connectionShared tag
softmax functionShared tag
tensorsShared tag
tokenShared tag
transformer blockShared tag
vectorShared tag
why long context is hardShared tag

ALiBiSame concept type
Autoregressive generationSame concept type
ConditioningSame concept type
context extensionSame concept type
context lengthSame concept type
data modalitySame concept type
Denoising generationSame concept type
discriminative modelsSame concept type
FFNSame concept type
foundation modelsSame concept type
generative modelsSame concept type
image patchSame concept type
latent codeSame concept type
latent manifoldSame concept type
LayerNormSame concept type
learned representationSame concept type
ML modelSame concept type
model architectureSame concept type
model componentSame concept type
model moduleSame concept type
model parametersSame concept type
MoESame concept type
normalization layerSame concept type
optimizer stateSame concept type
RMS NormSame concept type
RoPESame concept type
sampling temperatureSame concept type
skip connectionSame concept type
why long context is hardSame concept type

transformer blockCurated related
attentionCurated related
RoPECurated related
ALiBiCurated related

Common Confusions

Positional encoding is not the same as token embeddings: embeddings carry vocabulary identity while position schemes carry index or distance. RoPE and ALiBi are not attention variants themselves—they modify how position enters the attention sublayer. Absolute position tables also differ from context-window limits: a model may advertise a large window while its position scheme still degrades on very long spans.

Positional encodings

What It Is

Why It Matters

Simple Example

Where It Appears

Common Confusions

Tags

References

On this page

Positional encodings

What It Is

Why It Matters

Simple Example

Where It Appears

Common Confusions

Related Concepts And Modules

Tags

References

On this page