Feed-forward network

What It Is

A feed-forward network (FFN), also called an MLP block, is a small two-layer perceptron applied independently to every token position. After attention mixes information across positions, the FFN takes each position's hidden vector, expands it through an intermediate width, applies a nonlinearity, and projects back to the model width. The same weights run at every position—only the input vector changes.

Why It Matters

Attention decides what each position can read from the sequence; the FFN decides how to transform what was read into richer features. Most transformer blocks alternate these two steps, so recognizing the FFN slot helps you read architecture diagrams and spot when a model swaps a dense MLP for a mixture-of-experts layer while keeping the block shape.

Simple Example

Imagine a decoder block after self-attention has updated every word vector. The FFN runs the same recipe on each vector: multiply by a wide weight matrix, apply a gated or ReLU-style activation, then multiply by a second matrix back to hidden size. Position three's vector never mixes with position seven inside the FFN—that mixing already happened in attention.

Common Confusions

The FFN is not attention: it does not look at other tokens. It is also not the language-model head at the stack top—that head maps final hidden states to vocabulary logits. A mixture-of-experts layer replaces the dense FFN with routed expert MLPs but still sits in the same block slot after attention.

Feed-forward network

What It Is

Why It Matters

Simple Example

Common Confusions

Tags

References

On this page

Feed-forward network

What It Is

Why It Matters

Simple Example

Common Confusions

Related Concepts And Modules

Tags

References

On this page