A record of which operations produced each tensor so automatic differentiation can run backward.
What It Is
During a forward pass, frameworks build a directed graph whose nodes are operations (matrix multiply, softmax, loss) and whose edges carry tensors. The graph is the bookkeeping structure autograd uses—not something you usually draw by hand for a full transformer.
Why It Matters
Without the graph, you would hand-derive a gradient for every layer. With it, the same forward code that produced logits and loss also enables `.backward()` to populate parameter gradients consistently.
Simple Example
Multiply embedding weights by a one-hot input, add bias, apply GELU—each step becomes a node. The loss node at the end connects backward to every parameter node that influenced the prediction.
Common Confusions
The computational graph is not the same as the model architecture diagram in a paper—architecture is the human design, while the graph is the runtime trace of differentiable ops for one forward pass. It is also not the optimizer state, which stores update history rather than forward connectivity.