Encoder-Decoder

An architecture that compresses input into representations with an encoder, then generates outputs with a decoder—often via cross-attention between the two stacks.

What It Is

Classic transformer translation uses a bidirectional encoder over the source and a causal decoder over the target with cross-attention into encoder states. Variants appear in speech (encoder over audio, decoder over text), multimodal captioning, and T5-style text-to-text frameworks where both sides are token sequences.

Why It Matters

The split explains capacity allocation: encoders can see the full input while decoders remain causal for generation. It also motivates autoregressive generation as the decoder output paradigm and contrasts with decoder-only models that fold encoding and decoding into one causal stack.

Simple Example

English→French translation encodes the English sentence into memory vectors, then the decoder generates French tokens while attending to those vectors at each step. The encoder never emits French tokens; the decoder never re-reads raw English after encoding finishes.

Common Confusions

Encoder–decoder is an architecture pattern, not a training objective—you still need a loss and data pipeline. It is also distinct from encoder-only BERT models (no decoder) and decoder-only GPT models (no separate encoder). T5 is encoder–decoder even when input and output are both text.

Encoder-Decoder

What It Is

Why It Matters

Simple Example

Common Confusions

Tags

References

On this page

Encoder-Decoder

What It Is

Why It Matters

Simple Example

Common Confusions

Related Concepts And Modules

Tags

References

On this page