The direction and size of change each parameter should take to reduce the training loss.
What It Is
For each trainable parameter, the gradient records how much the scalar loss would change if that parameter increased slightly. Frameworks store gradients in `.grad` tensors aligned with parameter shapes. Gradients flow backward from the loss through the computational graph, applying the chain rule at each differentiable op.
Why It Matters
Without gradients, you cannot run stochastic gradient descent, Adam, or other optimizers. Debugging training often starts by checking whether gradients are zero, exploding, or detached from the graph. The same logits-to-loss path that produced predictions also defines which parameters receive non-zero gradients.
Simple Example
Imagine a single weight w in a tiny linear layer. If increasing w by a tiny amount raises the loss, the gradient ∂L/∂w is positive and an optimizer subtracts a small multiple of that value from w on the next step.
∇θL(θ)=∂θ∂L
Common Confusions
Gradients are not the same as parameter values or activations—they are update hints computed from the current batch. A zero gradient on a parameter often means it did not affect the loss (frozen, disconnected, or saturated). Gradients also differ from momentum buffers or optimizer state, which accumulate across steps.