Glossary

Computational Graph

A record of which operations produced each tensor so automatic differentiation can run backward.

What It Is

During a forward pass, frameworks build a directed graph whose nodes are operations (matrix multiply, softmax, loss) and whose edges carry tensors. The graph is the bookkeeping structure autograd uses—not something you usually draw by hand for a full transformer.

Why It Matters

Without the graph, you would hand-derive a gradient for every layer. With it, the same forward code that produced logits and loss also enables `.backward()` to populate parameter gradients consistently.

Simple Example

Multiply embedding weights by a one-hot input, add bias, apply GELU—each step becomes a node. The loss node at the end connects backward to every parameter node that influenced the prediction.

Common Confusions

The computational graph is not the same as the model architecture diagram in a paper—architecture is the human design, while the graph is the runtime trace of differentiable ops for one forward pass. It is also not the optimizer state, which stores update history rather than forward connectivity.

Tags

References