Modules
Sliding-Window Attention
An attention variant that restricts each query to a fixed local window of key positions instead of the full sequence.
At a glance
Optimizes
- Attention Compute
- Memory Bandwidth
- Local Context Inference
Practical benefits
- lower attention cost from fixed local windows
- predictable memory use for long sequences with bounded reach
Example models
None listed yet.
What It Is
Sliding-window attention is an attention variant that limits each query position to keys within a fixed neighborhood. Instead of a full n-by-n attention matrix, only entries inside the window receive non-zero weight.What It Optimizes
Sliding-window attention targets attention compute, memory bandwidth for score materialization, and the cost of serving long sequences when dense all-pairs attention would dominate runtime.Practical Benefit
Teams adopt sliding-window attention when they need long-context modeling but cannot afford dense quadratic attention. Bounding the attended neighborhood reduces multiply-accumulate work and keeps attention maps within a predictable local footprint.How It Works
Each query position attends only to keys within W tokens to its left (and optionally right, depending on the pattern). Keys outside the window are zeroed before softmax normalization, so the model spends compute on local context rather than the full token grid.H query headsLocal window of W tokensAttend only to keys inside the window
Math Or Compute Schema
With sequence length n and window size W, sliding-window attention applies a band mask before softmax. The formulas below contrast dense multi-head attention against window-limited attention that restricts which keys each query can reach.Compared To Nearby Modules
Compared with multi-head attention, sliding-window attention limits reach to a local neighborhood rather than scoring all positions. Multi-query and grouped-query attention remain dense over allowed positions but reduce KV head count; sliding-window attention changes how far each query can look.| Comparison dimension | Sliding-Window Attention | Multi-Head Attention | Multi-Query Attention | Grouped-Query Attention |
|---|---|---|---|---|
| Attention locality | Fixed local window of W tokens per query | Full sequence; every query sees every key | Full sequence with shared KV heads | Full sequence with grouped KV heads |
| Compute scaling with sequence length | Linear O(n·W) per head when W is fixed | Quadratic O(n²) per head | Quadratic O(n²) with fewer KV projections | Quadratic O(n²) with reduced KV head count |
| Global token reach | Distant tokens require stacked layers or companion patterns | Any position can attend to any other position | Any position can attend to any other position | Any position can attend to any other position |
Example Architectures
Sliding-window attention appears in long-context decoder models that combine local windows with periodic global layers or other patterns to recover distant dependencies.Example model links will render from registry usage in a later story.