Modules

Sliding-Window Attention

An attention variant that restricts each query to a fixed local window of key positions instead of the full sequence.

Dense attention lets every query attend to every key across the sequence, so cost grows with the square of length; sliding-window attention applies a fixed local window so each position attends only to nearby keys, trading global reach for bounded compute and memory on long contexts.

At a glance

Optimizes

  • Attention Compute
  • Memory Bandwidth
  • Local Context Inference

Practical benefits

  • lower attention cost from fixed local windows
  • predictable memory use for long sequences with bounded reach

Example models

None listed yet.

What It Is

Sliding-window attention is an attention variant that limits each query position to keys within a fixed neighborhood. Instead of a full n-by-n attention matrix, only entries inside the window receive non-zero weight.

What It Optimizes

Sliding-window attention targets attention compute, memory bandwidth for score materialization, and the cost of serving long sequences when dense all-pairs attention would dominate runtime.

Practical Benefit

Teams adopt sliding-window attention when they need long-context modeling but cannot afford dense quadratic attention. Bounding the attended neighborhood reduces multiply-accumulate work and keeps attention maps within a predictable local footprint.

How It Works

Each query position attends only to keys within W tokens to its left (and optionally right, depending on the pattern). Keys outside the window are zeroed before softmax normalization, so the model spends compute on local context rather than the full token grid.
Toggle MHA and sliding-window attention to compare full-sequence reach against fixed local windows on one canvas

Math Or Compute Schema

With sequence length n and window size W, sliding-window attention applies a band mask before softmax. The formulas below contrast dense multi-head attention against window-limited attention that restricts which keys each query can reach.
Multi-head attention (MHA)
Attention(Qi,Ki,Vi)=softmax ⁣(QiKidk)Vi\text{Attention}(Q_i, K_i, V_i) = \mathrm{softmax}\!\left(\frac{Q_i K_i^{\top}}{\sqrt{d_k}}\right) V_i
Q
Query vectors for head i.
K
Key vectors for head i.
V
Value vectors for head i.
H
Number of query heads.
d_k
Key dimension per head.
i
Query head index.
Sliding-window attention
Attention(Qi,Ki,Vi)=softmax ⁣(BWQiKidk)Vi\text{Attention}(Q_i, K_i, V_i) = \mathrm{softmax}\!\left(B_W \odot \frac{Q_i K_i^{\top}}{\sqrt{d_k}}\right) V_i
Q
Query vectors for head i.
K
Key vectors for head i.
V
Value vectors for head i.
H
Number of query heads.
W
Window size: maximum token distance each query may attend to.
d_k
Key dimension per head.
i
Query head index.
B_W
Band mask that zeroes key positions outside the sliding window.

Compared To Nearby Modules

Compared with multi-head attention, sliding-window attention limits reach to a local neighborhood rather than scoring all positions. Multi-query and grouped-query attention remain dense over allowed positions but reduce KV head count; sliding-window attention changes how far each query can look.
Comparison dimensionSliding-Window AttentionMulti-Head AttentionMulti-Query AttentionGrouped-Query Attention
Attention localityFixed local window of W tokens per queryFull sequence; every query sees every keyFull sequence with shared KV headsFull sequence with grouped KV heads
Compute scaling with sequence lengthLinear O(n·W) per head when W is fixedQuadratic O(n²) per headQuadratic O(n²) with fewer KV projectionsQuadratic O(n²) with reduced KV head count
Global token reachDistant tokens require stacked layers or companion patternsAny position can attend to any other positionAny position can attend to any other positionAny position can attend to any other position
How sliding-window attention compares with nearby attention variants

Example Architectures

Sliding-window attention appears in long-context decoder models that combine local windows with periodic global layers or other patterns to recover distant dependencies.

Example model links will render from registry usage in a later story.

Limitations And Tradeoffs

A fixed window cannot directly connect distant tokens in a single layer. Models that rely only on local windows may need stacked layers or companion patterns to move information across the full sequence.

Why It Still Matters

As sequence lengths grow, quadratic attention cost motivates locality patterns that preserve useful nearby signal while avoiding a full n-by-n score matrix on every layer.

Tags

References