Attention

How transformer blocks mix information across token positions using query, key, and value projections.

A sequence model must decide which earlier tokens matter for each position without hard-coding fixed windows or hand-built alignment rules; attention scores how much each position should read from every other position and blends value vectors according to those scores so context flows dynamically along the sequence.

At a glance

Optimizes

Context Mixing
Sequence Modeling

Practical benefits

routes information between positions in a sequence
enables long-range dependencies without fixed receptive fields

Example models

None listed yet.

What It Is

Attention is the core mixing mechanism in transformer blocks. Each position projects hidden states into queries, keys, and values, compares queries to keys to form weights, and sums value vectors to produce an updated representation.

Why It Matters

Without attention, deep stacks of feed-forward layers could not route information across arbitrary distances in one step. Head-sharing variants trade memory and bandwidth against how many distinct query and key-value heads you keep at inference time.

Attention Variants

Transformer blocks implement attention through parallel heads. Multi-head attention keeps independent key-value heads per query head; multi-query attention shares one KV head across all queries; grouped-query attention shares KV heads within groups while keeping more query diversity than multi-query attention.

Attention

At a glance

Optimizes

Practical benefits

Example models

What It Is

Why It Matters

Attention Variants

Tags

References

On this page

Attention

At a glance

Optimizes

Practical benefits

Example models

What It Is

Why It Matters

Attention Variants

Related

Tags

References

On this page