Gradient

The direction and size of change each parameter should take to reduce the training loss.

What It Is

For each trainable parameter, the gradient records how much the scalar loss would change if that parameter increased slightly. Frameworks store gradients in `.grad` tensors aligned with parameter shapes. Gradients flow backward from the loss through the computational graph, applying the chain rule at each differentiable op.

Why It Matters

Without gradients, you cannot run stochastic gradient descent, Adam, or other optimizers. Debugging training often starts by checking whether gradients are zero, exploding, or detached from the graph. The same logits-to-loss path that produced predictions also defines which parameters receive non-zero gradients.

Simple Example

Imagine a single weight w in a tiny linear layer. If increasing w by a tiny amount raises the loss, the gradient ∂L/∂w is positive and an optimizer subtracts a small multiple of that value from w on the next step.

\nabla_\theta L(\theta) = \frac{\partial L}{\partial \theta}

Common Confusions

Gradients are not the same as parameter values or activations—they are update hints computed from the current batch. A zero gradient on a parameter often means it did not affect the loss (frozen, disconnected, or saturated). Gradients also differ from momentum buffers or optimizer state, which accumulate across steps.

Gradient

What It Is

Why It Matters

Simple Example

Common Confusions

Tags

References

On this page

Gradient

What It Is

Why It Matters

Simple Example

Common Confusions

Related Concepts And Modules

Tags

References

On this page