Modules
Attention
How transformer blocks mix information across token positions using query, key, and value projections.
A sequence model must decide which earlier tokens matter for each position without hard-coding fixed windows or hand-built alignment rules; attention scores how much each position should read from every other position and blends value vectors according to those scores so context flows dynamically along the sequence.
At a glance
Optimizes
- Context Mixing
- Sequence Modeling
Practical benefits
- routes information between positions in a sequence
- enables long-range dependencies without fixed receptive fields
Example models
None listed yet.
What It Is
Attention is the core mixing mechanism in transformer blocks. Each position projects hidden states into queries, keys, and values, compares queries to keys to form weights, and sums value vectors to produce an updated representation.
Why It Matters
Without attention, deep stacks of feed-forward layers could not route information across arbitrary distances in one step. Head-sharing variants trade memory and bandwidth against how many distinct query and key-value heads you keep at inference time.