Linear Attention

An attention variant that replaces explicit softmax dot-product attention with kernel or feature-map formulations that scale near-linearly with sequence length.

Standard attention materializes an n-by-n score matrix before softmax, so cost grows quadratically with sequence length; linear attention applies a feature map to queries and keys so attention can be computed through associative updates or recurrence, trading exact softmax dot-product scoring for near-linear scaling on long sequences.

At a glance

Optimizes

Sequence Scaling
Attention Compute
Memory Bandwidth

Practical benefits

near-linear sequence scaling for long contexts
lower quadratic attention cost via kernel or recurrence formulations

Example models

None listed yet.

What It Is

Linear attention is an attention variant that reformulates attention without building the full softmax score matrix. Feature maps on queries and keys let the model accumulate key-value statistics in linear time rather than scoring every query-key pair explicitly.

What It Optimizes

Linear attention targets sequence-length scaling, attention compute cost, and memory bandwidth when dense quadratic attention would dominate runtime on long contexts.

Practical Benefit

Teams adopt linear attention when they need to process long sequences but cannot afford quadratic attention cost. Kernel or recurrence views let each new token update running statistics instead of revisiting the full score matrix.

How It Works

Queries and keys pass through a feature map φ that makes their product associative. The model accumulates φ(K)ᵀV across positions and combines the result with φ(Q) at each step, avoiding explicit materialization of the full n-by-n attention matrix.

Toggle MHA and linear attention to compare full softmax dot-product scoring against feature-map associative updates on one canvas

Math Or Compute Schema

With feature map φ applied to queries and keys, linear attention computes output through associative products rather than softmax over all pairs. The formulas below contrast dense multi-head attention against linear attention that routes through φ and a normalization term.

Multi-head attention (MHA)

\text{Attention}(Q_i, K_i, V_i) = \mathrm{softmax}\!\left(\frac{Q_i K_i^{\top}}{\sqrt{d_k}}\right) V_i

Q: Query vectors for head i.
K: Key vectors for head i.
V: Value vectors for head i.
H: Number of query heads.
d_k: Key dimension per head.
i: Query head index.

Linear attention

\text{Attention}(Q_i, K_i, V_i) = \frac{\phi(Q_i)\big(\phi(K_i)^{\top} V_i\big)}{\phi(Q_i)\big(\sum_j \phi(K_{i,j})\big)}

Q: Query vectors for head i.
K: Key vectors for head i.
V: Value vectors for head i.
H: Number of query heads.
φ: Feature map applied to queries and keys to enable associative accumulation.
d_k: Key dimension per head.
i: Query head index.
Z: Normalization denominator from summed feature-mapped keys.

Compared To Nearby Modules

Compared with multi-head attention, linear attention changes the compute mechanism rather than only sharing KV heads. Multi-query and grouped-query attention remain dense over allowed positions but reduce KV head count; linear attention pursues subquadratic sequence scaling through kernel or recurrence formulations.

Comparison dimension	Linear Attention	Multi-Head Attention	Multi-Query Attention	Grouped-Query Attention
Sequence scaling	Near-linear O(n) per head with fixed feature-map dimension	Quadratic O(n²) per head	Quadratic O(n²) with fewer KV projections	Quadratic O(n²) with reduced KV head count
Compute mechanism	Feature-map φ on queries and keys; associative KV accumulation	Explicit softmax over all query-key dot products	Dense softmax with shared KV heads	Dense softmax with grouped KV heads
Expressiveness	Approximates softmax dot-product attention via chosen kernel	Full pairwise softmax attention	Full pairwise softmax attention with shared KV	Full pairwise softmax attention with grouped KV

How linear attention compares with nearby attention variants

Example Architectures

Linear attention appears in long-context models and research architectures that combine subquadratic attention layers with standard dense blocks to balance efficiency and expressiveness.

Example model links will render from registry usage in a later story.

Limitations And Tradeoffs

Feature-map linear attention is not identical to softmax dot-product attention. Approximation quality depends on the chosen kernel, and some formulations sacrifice the exact pairwise scoring that dense attention provides.

Why It Still Matters

As context lengths grow, subquadratic attention options remain a practical design axis for models that must serve long sequences without quadratic attention cost on every layer.

References

Katharopoulos, Angelos, et al. "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." arXiv, 2020, https://arxiv.org/abs/2006.16236.

Linear Attention

At a glance

Optimizes

Practical benefits

Example models

What It Is

What It Optimizes

Practical Benefit

How It Works

Math Or Compute Schema

Compared To Nearby Modules

Example Architectures

Limitations And Tradeoffs

Why It Still Matters

Tags

References

On this page

Linear Attention

At a glance

Optimizes

Practical benefits

Example models

What It Is

What It Optimizes

Practical Benefit

How It Works

Math Or Compute Schema

Compared To Nearby Modules

Example Architectures

Limitations And Tradeoffs

Why It Still Matters

Related

Tags

References

On this page