Modules
Sparse Attention
An attention variant that computes scores only for a subset of token pairs instead of every query-key combination.
At a glance
Optimizes
- Attention Compute
- Memory Bandwidth
- Long Context Inference
Practical benefits
- lower attention compute from skipped token pairs
- reduced memory traffic for attention scores
Example models
None listed yet.
What It Is
Sparse attention is an attention variant that restricts which query-key pairs participate in the softmax. Instead of a full n-by-n attention matrix, only entries allowed by a sparsity pattern receive non-zero weight.What It Optimizes
Sparse attention targets attention compute, memory bandwidth for score materialization, and the cost of serving long sequences when dense all-pairs attention would dominate runtime.Practical Benefit
Teams adopt sparse attention when they need long-context modeling but cannot afford dense quadratic attention. Skipping disallowed pairs reduces multiply-accumulate work and the memory traffic needed to form and store full attention maps.How It Works
Each query position attends only to keys permitted by a sparsity mask or fixed pattern. Disallowed pairs are zeroed before softmax normalization, so the model spends compute on selected connections rather than the full token grid.H query headsSparsity mask (allowed pairs only)Compute attention on selected query-key pairs
Math Or Compute Schema
With sequence length n and sparsity mask M, sparse attention applies M to the score matrix before softmax. The formulas below contrast dense multi-head attention against masked sparse attention that limits which pairs receive weight.Compared To Nearby Modules
Compared with multi-head attention, sparse attention skips most token pairs rather than scoring all of them. Multi-query and grouped-query attention remain dense over allowed positions but reduce KV head count; sparse attention changes which positions connect at all.| Comparison dimension | Sparse Attention | Multi-Head Attention | Multi-Query Attention | Grouped-Query Attention |
|---|---|---|---|---|
| Attention connectivity | Mask-limited subset of query-key pairs | All n-by-n query-key pairs per head | All n-by-n pairs with shared KV heads | All n-by-n pairs with grouped KV heads |
| Compute scaling with sequence length | Subquadratic when sparsity fraction is fixed | Quadratic O(n²) per head | Quadratic O(n²) with fewer KV projections | Quadratic O(n²) with reduced KV head count |
| Global token reach | Depends on sparsity pattern; often local or block-structured | Any position can attend to any other position | Any position can attend to any other position | Any position can attend to any other position |
Example Architectures
Sparse attention appears in long-context decoder models that combine local windows, block patterns, or learned sparsity with standard projection layers.Example model links will render from registry usage in a later story.