Autoregressive Generation

A generation paradigm that produces outputs one discrete step at a time, each step conditioned on everything generated so far.

What It Is

At inference, the model holds a growing prefix of tokens (or other discrete units). The decoder computes logits over the vocabulary, softmax turns them into probabilities, and the chosen unit is appended. KV caches store past keys and values so earlier positions are not recomputed. Encoder–decoder setups still autoregress on the decoder side while reading fixed encoder memory.

Why It Matters

Autoregressive generation is the dominant text and code paradigm and the output mode most decoder and encoder–decoder pages reference. It connects the token chain—token IDs, embeddings, logits, softmax—to architecture choices such as causal attention, KV-cache sizing, and grouped-query attention for efficient decoding.

Simple Example

Given the prefix "The capital of France is", the model outputs logits for the next token, softmax peaks on "Paris", that token is appended, and the process repeats until an end-of-sequence token or length limit.

Common Confusions

Autoregressive generation is not the same as any decoder block—decoders are architecture components, while autoregressive describes the step-by-step sampling loop. It is also not identical to teacher-forced training, where ground-truth prefixes are fed during optimization even though inference is autoregressive. Diffusion sampling loops over noise timesteps but is not token-autoregressive unless the model emits discrete tokens each step.

Autoregressive Generation

What It Is

Why It Matters

Simple Example

Common Confusions

Tags

References

On this page

Autoregressive Generation

What It Is

Why It Matters

Simple Example

Common Confusions

Related Concepts And Modules

Tags

References

On this page