Backpropagation

The backward pass that applies the chain rule on a computational graph to compute gradients for every differentiable parameter.

What It Is

Backpropagation (often shortened to backprop) is reverse-mode automatic differentiation specialized for neural networks. Starting at the loss node, the framework multiplies local Jacobians along edges of the computational graph so earlier parameters receive credit for how they influenced the final error.

Why It Matters

It makes deep networks trainable without hand-derived derivatives for every layer. The same code that computed logits, softmax, and cross-entropy loss can call backward once and populate gradients on embeddings, attention projections, and output weights consistently.

Simple Example

Consider logits → softmax → cross-entropy. Backpropagation first yields ∂L/∂logits, then continues through earlier matmuls and activations. Each step applies the chain rule so ∂L/∂W for a weight matrix W combines upstream loss sensitivity with the forward input that multiplied W.

\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x}

loss = cross_entropy(softmax(logits), targets)
loss.backward()  # fills parameter .grad via the graph
optimizer.step()

Common Confusions

Backpropagation is not the optimizer step—it only computes gradients. It is also not forward inference: no weight updates happen during backward unless you explicitly call an optimizer. Reverse-mode autodiff is efficient for many parameters and one scalar loss, which is the usual language-model training setup.

Backpropagation

What It Is

Why It Matters

Simple Example

Common Confusions

Tags

References

On this page

Backpropagation

What It Is

Why It Matters

Simple Example

Common Confusions

Related Concepts And Modules

Tags

References

On this page