Modules

Attention

How transformer blocks mix information across token positions using query, key, and value projections.

A sequence model must decide which earlier tokens matter for each position without hard-coding fixed windows or hand-built alignment rules; attention scores how much each position should read from every other position and blends value vectors according to those scores so context flows dynamically along the sequence.

At a glance

Optimizes

  • Context Mixing
  • Sequence Modeling

Practical benefits

  • routes information between positions in a sequence
  • enables long-range dependencies without fixed receptive fields

Example models

None listed yet.

What It Is

Attention is the core mixing mechanism in transformer blocks. Each position projects hidden states into queries, keys, and values, compares queries to keys to form weights, and sums value vectors to produce an updated representation.

Why It Matters

Without attention, deep stacks of feed-forward layers could not route information across arbitrary distances in one step. Head-sharing variants trade memory and bandwidth against how many distinct query and key-value heads you keep at inference time.

Attention Variants

Transformer blocks implement attention through parallel heads. Multi-head attention keeps independent key-value heads per query head; multi-query attention shares one KV head across all queries; grouped-query attention shares KV heads within groups while keeping more query diversity than multi-query attention.

Tags

References