Modules

Linear Attention

An attention variant that replaces explicit softmax dot-product attention with kernel or feature-map formulations that scale near-linearly with sequence length.

Standard attention materializes an n-by-n score matrix before softmax, so cost grows quadratically with sequence length; linear attention applies a feature map to queries and keys so attention can be computed through associative updates or recurrence, trading exact softmax dot-product scoring for near-linear scaling on long sequences.

At a glance

Optimizes

  • Sequence Scaling
  • Attention Compute
  • Memory Bandwidth

Practical benefits

  • near-linear sequence scaling for long contexts
  • lower quadratic attention cost via kernel or recurrence formulations

Example models

None listed yet.

What It Is

Linear attention is an attention variant that reformulates attention without building the full softmax score matrix. Feature maps on queries and keys let the model accumulate key-value statistics in linear time rather than scoring every query-key pair explicitly.

What It Optimizes

Linear attention targets sequence-length scaling, attention compute cost, and memory bandwidth when dense quadratic attention would dominate runtime on long contexts.

Practical Benefit

Teams adopt linear attention when they need to process long sequences but cannot afford quadratic attention cost. Kernel or recurrence views let each new token update running statistics instead of revisiting the full score matrix.

How It Works

Queries and keys pass through a feature map φ that makes their product associative. The model accumulates φ(K)ᵀV across positions and combines the result with φ(Q) at each step, avoiding explicit materialization of the full n-by-n attention matrix.
Toggle MHA and linear attention to compare full softmax dot-product scoring against feature-map associative updates on one canvas

Math Or Compute Schema

With feature map φ applied to queries and keys, linear attention computes output through associative products rather than softmax over all pairs. The formulas below contrast dense multi-head attention against linear attention that routes through φ and a normalization term.
Multi-head attention (MHA)
Attention(Qi,Ki,Vi)=softmax ⁣(QiKidk)Vi\text{Attention}(Q_i, K_i, V_i) = \mathrm{softmax}\!\left(\frac{Q_i K_i^{\top}}{\sqrt{d_k}}\right) V_i
Q
Query vectors for head i.
K
Key vectors for head i.
V
Value vectors for head i.
H
Number of query heads.
d_k
Key dimension per head.
i
Query head index.
Linear attention
Attention(Qi,Ki,Vi)=ϕ(Qi)(ϕ(Ki)Vi)ϕ(Qi)(jϕ(Ki,j))\text{Attention}(Q_i, K_i, V_i) = \frac{\phi(Q_i)\big(\phi(K_i)^{\top} V_i\big)}{\phi(Q_i)\big(\sum_j \phi(K_{i,j})\big)}
Q
Query vectors for head i.
K
Key vectors for head i.
V
Value vectors for head i.
H
Number of query heads.
φ
Feature map applied to queries and keys to enable associative accumulation.
d_k
Key dimension per head.
i
Query head index.
Z
Normalization denominator from summed feature-mapped keys.

Compared To Nearby Modules

Compared with multi-head attention, linear attention changes the compute mechanism rather than only sharing KV heads. Multi-query and grouped-query attention remain dense over allowed positions but reduce KV head count; linear attention pursues subquadratic sequence scaling through kernel or recurrence formulations.
Comparison dimensionLinear AttentionMulti-Head AttentionMulti-Query AttentionGrouped-Query Attention
Sequence scalingNear-linear O(n) per head with fixed feature-map dimensionQuadratic O(n²) per headQuadratic O(n²) with fewer KV projectionsQuadratic O(n²) with reduced KV head count
Compute mechanismFeature-map φ on queries and keys; associative KV accumulationExplicit softmax over all query-key dot productsDense softmax with shared KV headsDense softmax with grouped KV heads
ExpressivenessApproximates softmax dot-product attention via chosen kernelFull pairwise softmax attentionFull pairwise softmax attention with shared KVFull pairwise softmax attention with grouped KV
How linear attention compares with nearby attention variants

Example Architectures

Linear attention appears in long-context models and research architectures that combine subquadratic attention layers with standard dense blocks to balance efficiency and expressiveness.

Example model links will render from registry usage in a later story.

Limitations And Tradeoffs

Feature-map linear attention is not identical to softmax dot-product attention. Approximation quality depends on the chosen kernel, and some formulations sacrifice the exact pairwise scoring that dense attention provides.

Why It Still Matters

As context lengths grow, subquadratic attention options remain a practical design axis for models that must serve long sequences without quadratic attention cost on every layer.

Tags

References

  1. Katharopoulos, Angelos, et al. "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." arXiv, 2020, https://arxiv.org/abs/2006.16236.