A network stack that turns internal representations into outputs—tokens, pixels, or structured predictions—often one step at a time with attention to prior context.
What It Is
Decoders include causal transformer stacks for language modeling, transposed convolutions or U-Net upsampling paths in image models, and cross-attention layers that query encoder memory in seq2seq systems. They consume representations—they do not patchify raw inputs unless the architecture is decoder-only end to end.
Why It Matters
Naming decoders separately from encoders shows where generation happens, how KV caches apply during inference, and why encoder–decoder pairs differ from decoder-only GPT-style models. It also sets up autoregressive generation as the paradigm most decoders follow in language settings.
Simple Example
A machine-translation decoder attends to encoder memory while emitting target tokens left to right. A VAE decoder maps a latent vector back to a 64×64 image. A GPT-style stack is decoder-only: the same blocks both represent context and predict the next token.
Common Confusions
A decoder is not always autoregressive—some decoders reconstruct in one shot—but language-model decoders almost always are. A decoder is also not the softmax head alone: the head sits on top of the decoder stack. Denoising generation is a separate paradigm that may reuse decoder-like U-Nets without autoregressive token steps.