Glossary
Autoregressive Generation
A generation paradigm that produces outputs one discrete step at a time, each step conditioned on everything generated so far.
What It Is
At inference, the model holds a growing prefix of tokens (or other discrete units). The decoder computes logits over the vocabulary, softmax turns them into probabilities, and the chosen unit is appended. KV caches store past keys and values so earlier positions are not recomputed. Encoder–decoder setups still autoregress on the decoder side while reading fixed encoder memory.
Simple Example
Given the prefix "The capital of France is", the model outputs logits for the next token, softmax peaks on "Paris", that token is appended, and the process repeats until an end-of-sequence token or length limit.
Common Confusions
Autoregressive generation is not the same as any decoder block—decoders are architecture components, while autoregressive describes the step-by-step sampling loop. It is also not identical to teacher-forced training, where ground-truth prefixes are fed during optimization even though inference is autoregressive. Diffusion sampling loops over noise timesteps but is not token-autoregressive unless the model emits discrete tokens each step.