The width of a model's internal vectors—the number of dimensions in each token embedding and each token's per-position hidden state before the vocabulary projection.
Shape notation such as batch×sequence×768 uses hidden size as the last axis for activations inside the stack. Parameter counts, memory use, and FLOPs scale with width, so papers and configs often list hidden size alongside layer count and head count when comparing model sizes.
Simple Example
GPT-2 small uses hidden size 768: each token embedding is a 768-dimensional vector, attention projects queries and keys within that space, and the MLP expands and contracts around the same width unless a config explicitly sets a different intermediate size.
Common Confusions
Hidden size is not vocabulary size—the embedding table has vocab_size rows but each row has hidden_size columns. It is also not the same as parameter count: two models can share width but differ in depth, head layout, or MLP expansion ratio.