Modules
Linear Attention
An attention variant that replaces explicit softmax dot-product attention with kernel or feature-map formulations that scale near-linearly with sequence length.
At a glance
Optimizes
- Sequence Scaling
- Attention Compute
- Memory Bandwidth
Practical benefits
- near-linear sequence scaling for long contexts
- lower quadratic attention cost via kernel or recurrence formulations
Example models
None listed yet.
What It Is
Linear attention is an attention variant that reformulates attention without building the full softmax score matrix. Feature maps on queries and keys let the model accumulate key-value statistics in linear time rather than scoring every query-key pair explicitly.What It Optimizes
Linear attention targets sequence-length scaling, attention compute cost, and memory bandwidth when dense quadratic attention would dominate runtime on long contexts.Practical Benefit
Teams adopt linear attention when they need to process long sequences but cannot afford quadratic attention cost. Kernel or recurrence views let each new token update running statistics instead of revisiting the full score matrix.How It Works
Queries and keys pass through a feature map φ that makes their product associative. The model accumulates φ(K)ᵀV across positions and combines the result with φ(Q) at each step, avoiding explicit materialization of the full n-by-n attention matrix.H query headsFeature map φ on queries and keysAssociative KV accumulation without full score matrix
Math Or Compute Schema
With feature map φ applied to queries and keys, linear attention computes output through associative products rather than softmax over all pairs. The formulas below contrast dense multi-head attention against linear attention that routes through φ and a normalization term.Compared To Nearby Modules
Compared with multi-head attention, linear attention changes the compute mechanism rather than only sharing KV heads. Multi-query and grouped-query attention remain dense over allowed positions but reduce KV head count; linear attention pursues subquadratic sequence scaling through kernel or recurrence formulations.| Comparison dimension | Linear Attention | Multi-Head Attention | Multi-Query Attention | Grouped-Query Attention |
|---|---|---|---|---|
| Sequence scaling | Near-linear O(n) per head with fixed feature-map dimension | Quadratic O(n²) per head | Quadratic O(n²) with fewer KV projections | Quadratic O(n²) with reduced KV head count |
| Compute mechanism | Feature-map φ on queries and keys; associative KV accumulation | Explicit softmax over all query-key dot products | Dense softmax with shared KV heads | Dense softmax with grouped KV heads |
| Expressiveness | Approximates softmax dot-product attention via chosen kernel | Full pairwise softmax attention | Full pairwise softmax attention with shared KV | Full pairwise softmax attention with grouped KV |
Example Architectures
Linear attention appears in long-context models and research architectures that combine subquadratic attention layers with standard dense blocks to balance efficiency and expressiveness.Example model links will render from registry usage in a later story.