Modules
Multi-Head Attention
The baseline attention design that gives every query head its own key-value head pair.
At a glance
Optimizes
- Expressiveness
Practical benefits
- per-head query specialization
Example models
None listed yet.
What It Is
Multi-head attention is the original scaled dot-product attention used in the Transformer. Each head computes its own attention distribution over keys and values, then the head outputs are merged before the next sublayer.What It Optimizes
MHA targets representational breadth: distinct query heads can attend to different relationships in the same sequence without sharing key-value parameters.Practical Benefit
Models keep full per-head expressiveness at training and inference time. MHA remains the reference design when quality matters more than KV-cache size, and it is the baseline against which multi-query attention and grouped-query attention are compared.How It Works
Linear projections produce H query heads, H key heads, and H value heads. Each query head i attends to its matching key head i and value head i, producing H outputs that are concatenated and projected again.H query headsH KV head pairs (1:1 with queries)Each query head has its own K and V
Math Or Compute Schema
The formula below is the per-head attention computation used inside multi-head attention. Each head index i pairs its own query, key, and value tensors before outputs are merged.Compared To Nearby Modules
Compared with multi-query attention, MHA stores one KV pair per query head, so cache size scales with H. Compared with grouped-query attention, MHA avoids shared KV groups and keeps the largest KV footprint in the head-sharing family.| Comparison dimension | Multi-Head Attention | Multi-Query Attention | Grouped-Query Attention |
|---|---|---|---|
| KV head count | H key heads and H value heads | 1 key head and 1 value head | G key heads and G value heads |
| Query-head flexibility | H independent query/KV head pairs | H query heads share one KV pair | H distinct query heads grouped into G shared KV pairs |
| Cache footprint per token | 2H tensors (full multi-head attention cache) | 2 tensors (single shared KV cache) | 2G tensors (G keys + G values) |
Example Architectures
MHA appears in the original Transformer encoder-decoder, early GPT-style decoder-only language models, and many vision and speech transformers that adopt the standard attention block.Example model links will render from registry usage in a later story.