Glossary

Activation

The numeric output of a layer after its linear transform and nonlinearity, passed forward through the network.

What It Is

After a layer applies its weights to an input tensor, the result (often after GELU, ReLU, or another nonlinearity) is that layer’s activation. In a transformer block, the post-attention and post-MLP tensors are typical activation examples.

Why It Matters

Activations determine what information reaches the next layer and what backpropagation will differentiate. Memory planners track activation tensors during training because they are large but temporary, unlike checkpointed parameters.

Simple Example

For a batch of one sequence, a hidden layer might output a tensor of shape (sequence_length, hidden_size). That tensor is the activation handed to the next sublayer.

Common Confusions

A hidden activation is not the same as softmax: softmax turns vocabulary logits into a probability vector at the output head, while activations are internal layer outputs that may never be normalized across the vocabulary. Saying a “ReLU activation” refers to the nonlinearity applied inside a layer, not to the softmax step at decode time.

Tags

References