Residual connection

Skip paths that add each sublayer's output back onto the running hidden stream so deep transformer stacks stay trainable.

What It Is

A residual connection—also called a skip connection or residual add—carries the input of a sublayer forward and adds it to that sublayer's output. In transformers, the same hidden vector stream passes through attention, feed-forward layers, and normalization while residuals preserve a direct path from earlier layers. Each block typically applies two residual adds: one around attention and one around the feed-forward network.

Why It Matters

Deep stacks would be hard to train if every layer had to learn the full transformation from scratch. Residuals let each sublayer learn a small update on top of what already flows through the stream, which keeps gradients from vanishing on the skip highway during backpropagation. Placement relative to normalization also shapes training: pre-norm applies layer norm before each sublayer and adds the result back onto the stream afterward, while post-norm normalizes after the add. Most modern decoder language models use pre-norm because it tends to stabilize optimization at large depth.

Simple Example

Imagine one token vector entering a block. Pre-norm attention reads a normalized copy of the stream, mixes context from other positions, and writes its output back with a residual add so the original stream is never fully replaced. The feed-forward sublayer repeats the pattern: normalize, apply the dense MLP, then add onto the same residual stream before the next block receives the vector.

Common Confusions

A residual add is not the same as normalization: residuals preserve and sum streams, while norm layers rescale activations. The residual stream is also not a second hidden size—it is the main width that every sublayer reads and writes. Finally, skip connections here are inside one model stack, not shortcuts between separate ensemble members.

Residual connection

What It Is

Why It Matters

Simple Example

Common Confusions

Tags

References

On this page

Residual connection

What It Is

Why It Matters

Simple Example

Common Confusions

Related Concepts And Modules

Tags

References

On this page