Sparse Attention

An attention variant that computes scores only for a subset of token pairs instead of every query-key combination.

Dense attention scores every query against every key, so compute and memory grow with the square of sequence length; sparse attention applies a mask or pattern that skips most token pairs and computes attention only where the pattern allows, trading full connectivity for lower cost on long contexts.

At a glance

Optimizes

Attention Compute
Memory Bandwidth
Long Context Inference

Practical benefits

lower attention compute from skipped token pairs
reduced memory traffic for attention scores

Example models

None listed yet.

What It Is

Sparse attention is an attention variant that restricts which query-key pairs participate in the softmax. Instead of a full n-by-n attention matrix, only entries allowed by a sparsity pattern receive non-zero weight.

What It Optimizes

Sparse attention targets attention compute, memory bandwidth for score materialization, and the cost of serving long sequences when dense all-pairs attention would dominate runtime.

Practical Benefit

Teams adopt sparse attention when they need long-context modeling but cannot afford dense quadratic attention. Skipping disallowed pairs reduces multiply-accumulate work and the memory traffic needed to form and store full attention maps.

How It Works

Each query position attends only to keys permitted by a sparsity mask or fixed pattern. Disallowed pairs are zeroed before softmax normalization, so the model spends compute on selected connections rather than the full token grid.

Toggle MHA and sparse attention to compare all-pairs scoring against mask-limited pair selection on one canvas

Math Or Compute Schema

With sequence length n and sparsity mask M, sparse attention applies M to the score matrix before softmax. The formulas below contrast dense multi-head attention against masked sparse attention that limits which pairs receive weight.

Multi-head attention (MHA)

\text{Attention}(Q_i, K_i, V_i) = \mathrm{softmax}\!\left(\frac{Q_i K_i^{\top}}{\sqrt{d_k}}\right) V_i

Q: Query vectors for head i.
K: Key vectors for head i.
V: Value vectors for head i.
H: Number of query heads.
d_k: Key dimension per head.
i: Query head index.

Sparse attention

\text{Attention}(Q_i, K_i, V_i) = \mathrm{softmax}\!\left(M \odot \frac{Q_i K_i^{\top}}{\sqrt{d_k}}\right) V_i

Q: Query vectors for head i.
K: Key vectors for head i.
V: Value vectors for head i.
H: Number of query heads.
s: Fraction of allowed query-key pairs in the sparsity pattern.
d_k: Key dimension per head.
i: Query head index.
M: Binary or weighted mask that zeroes disallowed attention pairs.

Compared To Nearby Modules

Compared with multi-head attention, sparse attention skips most token pairs rather than scoring all of them. Multi-query and grouped-query attention remain dense over allowed positions but reduce KV head count; sparse attention changes which positions connect at all.

Comparison dimension	Sparse Attention	Multi-Head Attention	Multi-Query Attention	Grouped-Query Attention
Attention connectivity	Mask-limited subset of query-key pairs	All n-by-n query-key pairs per head	All n-by-n pairs with shared KV heads	All n-by-n pairs with grouped KV heads
Compute scaling with sequence length	Subquadratic when sparsity fraction is fixed	Quadratic O(n²) per head	Quadratic O(n²) with fewer KV projections	Quadratic O(n²) with reduced KV head count
Global token reach	Depends on sparsity pattern; often local or block-structured	Any position can attend to any other position	Any position can attend to any other position	Any position can attend to any other position

How sparse attention compares with nearby attention variants

Example Architectures

Sparse attention appears in long-context decoder models that combine local windows, block patterns, or learned sparsity with standard projection layers.

Example model links will render from registry usage in a later story.

Limitations And Tradeoffs

Restricting connectivity can miss long-range dependencies that dense attention would capture. Pattern choice and mask quality strongly affect what information flows across the sequence.

Why It Still Matters

As sequence lengths grow, quadratic attention cost motivates sparsity patterns that preserve useful signal while avoiding a full n-by-n score matrix on every layer.

Sparse Attention

At a glance

Optimizes

Practical benefits

Example models

What It Is

What It Optimizes

Practical Benefit

How It Works

Math Or Compute Schema

Compared To Nearby Modules

Example Architectures

Limitations And Tradeoffs

Why It Still Matters

Tags

References

On this page

Sparse Attention

At a glance

Optimizes

Practical benefits

Example models

What It Is

What It Optimizes

Practical Benefit

How It Works

Math Or Compute Schema

Compared To Nearby Modules

Example Architectures

Limitations And Tradeoffs

Why It Still Matters

Related

Tags

References

On this page