Sliding-Window Attention

An attention variant that restricts each query to a fixed local window of key positions instead of the full sequence.

Dense attention lets every query attend to every key across the sequence, so cost grows with the square of length; sliding-window attention applies a fixed local window so each position attends only to nearby keys, trading global reach for bounded compute and memory on long contexts.

At a glance

Optimizes

Attention Compute
Memory Bandwidth
Local Context Inference

Practical benefits

lower attention cost from fixed local windows
predictable memory use for long sequences with bounded reach

Example models

None listed yet.

What It Is

Sliding-window attention is an attention variant that limits each query position to keys within a fixed neighborhood. Instead of a full n-by-n attention matrix, only entries inside the window receive non-zero weight.

What It Optimizes

Sliding-window attention targets attention compute, memory bandwidth for score materialization, and the cost of serving long sequences when dense all-pairs attention would dominate runtime.

Practical Benefit

Teams adopt sliding-window attention when they need long-context modeling but cannot afford dense quadratic attention. Bounding the attended neighborhood reduces multiply-accumulate work and keeps attention maps within a predictable local footprint.

How It Works

Each query position attends only to keys within W tokens to its left (and optionally right, depending on the pattern). Keys outside the window are zeroed before softmax normalization, so the model spends compute on local context rather than the full token grid.

Toggle MHA and sliding-window attention to compare full-sequence reach against fixed local windows on one canvas

Math Or Compute Schema

With sequence length n and window size W, sliding-window attention applies a band mask before softmax. The formulas below contrast dense multi-head attention against window-limited attention that restricts which keys each query can reach.

Multi-head attention (MHA)

\text{Attention}(Q_i, K_i, V_i) = \mathrm{softmax}\!\left(\frac{Q_i K_i^{\top}}{\sqrt{d_k}}\right) V_i

Q: Query vectors for head i.
K: Key vectors for head i.
V: Value vectors for head i.
H: Number of query heads.
d_k: Key dimension per head.
i: Query head index.

Sliding-window attention

\text{Attention}(Q_i, K_i, V_i) = \mathrm{softmax}\!\left(B_W \odot \frac{Q_i K_i^{\top}}{\sqrt{d_k}}\right) V_i

Q: Query vectors for head i.
K: Key vectors for head i.
V: Value vectors for head i.
H: Number of query heads.
W: Window size: maximum token distance each query may attend to.
d_k: Key dimension per head.
i: Query head index.
B_W: Band mask that zeroes key positions outside the sliding window.

Compared To Nearby Modules

Compared with multi-head attention, sliding-window attention limits reach to a local neighborhood rather than scoring all positions. Multi-query and grouped-query attention remain dense over allowed positions but reduce KV head count; sliding-window attention changes how far each query can look.

Comparison dimension	Sliding-Window Attention	Multi-Head Attention	Multi-Query Attention	Grouped-Query Attention
Attention locality	Fixed local window of W tokens per query	Full sequence; every query sees every key	Full sequence with shared KV heads	Full sequence with grouped KV heads
Compute scaling with sequence length	Linear O(n·W) per head when W is fixed	Quadratic O(n²) per head	Quadratic O(n²) with fewer KV projections	Quadratic O(n²) with reduced KV head count
Global token reach	Distant tokens require stacked layers or companion patterns	Any position can attend to any other position	Any position can attend to any other position	Any position can attend to any other position

How sliding-window attention compares with nearby attention variants

Example Architectures

Sliding-window attention appears in long-context decoder models that combine local windows with periodic global layers or other patterns to recover distant dependencies.

Example model links will render from registry usage in a later story.

Limitations And Tradeoffs

A fixed window cannot directly connect distant tokens in a single layer. Models that rely only on local windows may need stacked layers or companion patterns to move information across the full sequence.

Why It Still Matters

As sequence lengths grow, quadratic attention cost motivates locality patterns that preserve useful nearby signal while avoiding a full n-by-n score matrix on every layer.

Sliding-Window Attention

At a glance

Optimizes

Practical benefits

Example models

What It Is

What It Optimizes

Practical Benefit

How It Works

Math Or Compute Schema

Compared To Nearby Modules

Example Architectures

Limitations And Tradeoffs

Why It Still Matters

Tags

References

On this page

Sliding-Window Attention

At a glance

Optimizes

Practical benefits

Example models

What It Is

What It Optimizes

Practical Benefit

How It Works

Math Or Compute Schema

Compared To Nearby Modules

Example Architectures

Limitations And Tradeoffs

Why It Still Matters

Related

Tags

References

On this page