Modules

Sparse Attention

An attention variant that computes scores only for a subset of token pairs instead of every query-key combination.

Dense attention scores every query against every key, so compute and memory grow with the square of sequence length; sparse attention applies a mask or pattern that skips most token pairs and computes attention only where the pattern allows, trading full connectivity for lower cost on long contexts.

At a glance

Optimizes

  • Attention Compute
  • Memory Bandwidth
  • Long Context Inference

Practical benefits

  • lower attention compute from skipped token pairs
  • reduced memory traffic for attention scores

Example models

None listed yet.

What It Is

Sparse attention is an attention variant that restricts which query-key pairs participate in the softmax. Instead of a full n-by-n attention matrix, only entries allowed by a sparsity pattern receive non-zero weight.

What It Optimizes

Sparse attention targets attention compute, memory bandwidth for score materialization, and the cost of serving long sequences when dense all-pairs attention would dominate runtime.

Practical Benefit

Teams adopt sparse attention when they need long-context modeling but cannot afford dense quadratic attention. Skipping disallowed pairs reduces multiply-accumulate work and the memory traffic needed to form and store full attention maps.

How It Works

Each query position attends only to keys permitted by a sparsity mask or fixed pattern. Disallowed pairs are zeroed before softmax normalization, so the model spends compute on selected connections rather than the full token grid.
Toggle MHA and sparse attention to compare all-pairs scoring against mask-limited pair selection on one canvas

Math Or Compute Schema

With sequence length n and sparsity mask M, sparse attention applies M to the score matrix before softmax. The formulas below contrast dense multi-head attention against masked sparse attention that limits which pairs receive weight.
Multi-head attention (MHA)
Attention(Qi,Ki,Vi)=softmax ⁣(QiKidk)Vi\text{Attention}(Q_i, K_i, V_i) = \mathrm{softmax}\!\left(\frac{Q_i K_i^{\top}}{\sqrt{d_k}}\right) V_i
Q
Query vectors for head i.
K
Key vectors for head i.
V
Value vectors for head i.
H
Number of query heads.
d_k
Key dimension per head.
i
Query head index.
Sparse attention
Attention(Qi,Ki,Vi)=softmax ⁣(MQiKidk)Vi\text{Attention}(Q_i, K_i, V_i) = \mathrm{softmax}\!\left(M \odot \frac{Q_i K_i^{\top}}{\sqrt{d_k}}\right) V_i
Q
Query vectors for head i.
K
Key vectors for head i.
V
Value vectors for head i.
H
Number of query heads.
s
Fraction of allowed query-key pairs in the sparsity pattern.
d_k
Key dimension per head.
i
Query head index.
M
Binary or weighted mask that zeroes disallowed attention pairs.

Compared To Nearby Modules

Compared with multi-head attention, sparse attention skips most token pairs rather than scoring all of them. Multi-query and grouped-query attention remain dense over allowed positions but reduce KV head count; sparse attention changes which positions connect at all.
Comparison dimensionSparse AttentionMulti-Head AttentionMulti-Query AttentionGrouped-Query Attention
Attention connectivityMask-limited subset of query-key pairsAll n-by-n query-key pairs per headAll n-by-n pairs with shared KV headsAll n-by-n pairs with grouped KV heads
Compute scaling with sequence lengthSubquadratic when sparsity fraction is fixedQuadratic O(n²) per headQuadratic O(n²) with fewer KV projectionsQuadratic O(n²) with reduced KV head count
Global token reachDepends on sparsity pattern; often local or block-structuredAny position can attend to any other positionAny position can attend to any other positionAny position can attend to any other position
How sparse attention compares with nearby attention variants

Example Architectures

Sparse attention appears in long-context decoder models that combine local windows, block patterns, or learned sparsity with standard projection layers.

Example model links will render from registry usage in a later story.

Limitations And Tradeoffs

Restricting connectivity can miss long-range dependencies that dense attention would capture. Pattern choice and mask quality strongly affect what information flows across the sequence.

Why It Still Matters

As sequence lengths grow, quadratic attention cost motivates sparsity patterns that preserve useful signal while avoiding a full n-by-n score matrix on every layer.

Tags

References