Glossary

Hidden Size

The width of a model's internal vectors—the number of dimensions in each token embedding and each token's per-position hidden state before the vocabulary projection.

What It Is

Hidden size (also called model width or hidden dimension) is the length of the vectors flowing through a transformer's residual stream. Every token position carries a vector of that width after embedding lookup and through each block until the language-model head.

Why It Matters

Shape notation such as batch×sequence×768 uses hidden size as the last axis for activations inside the stack. Parameter counts, memory use, and FLOPs scale with width, so papers and configs often list hidden size alongside layer count and head count when comparing model sizes.

Simple Example

GPT-2 small uses hidden size 768: each token embedding is a 768-dimensional vector, attention projects queries and keys within that space, and the MLP expands and contracts around the same width unless a config explicitly sets a different intermediate size.

Common Confusions

Hidden size is not vocabulary size—the embedding table has vocab_size rows but each row has hidden_size columns. It is also not the same as parameter count: two models can share width but differ in depth, head layout, or MLP expansion ratio.

Tags

References