An architecture that compresses input into representations with an encoder, then generates outputs with a decoder—often via cross-attention between the two stacks.
What It Is
Classic transformer translation uses a bidirectional encoder over the source and a causal decoder over the target with cross-attention into encoder states. Variants appear in speech (encoder over audio, decoder over text), multimodal captioning, and T5-style text-to-text frameworks where both sides are token sequences.
Why It Matters
The split explains capacity allocation: encoders can see the full input while decoders remain causal for generation. It also motivates autoregressive generation as the decoder output paradigm and contrasts with decoder-only models that fold encoding and decoding into one causal stack.
Simple Example
English→French translation encodes the English sentence into memory vectors, then the decoder generates French tokens while attending to those vectors at each step. The encoder never emits French tokens; the decoder never re-reads raw English after encoding finishes.
Common Confusions
Encoder–decoder is an architecture pattern, not a training objective—you still need a loss and data pipeline. It is also distinct from encoder-only BERT models (no decoder) and decoder-only GPT models (no separate encoder). T5 is encoder–decoder even when input and output are both text.