Glossary

Patch

A fixed-size chunk of input data—pixels, spectrogram frames, or grouped tokens—that an encoder turns into a representation vector.

What It Is

In vision transformers, an image is split into a grid of non-overlapping or lightly overlapping patches; each patch becomes a token-like vector after a linear projection. In audio or video, analogous windows play the same role. Patches sit between raw modality data and the internal representation tensors that deeper blocks consume.

Why It Matters

Patch granularity controls compute and inductive bias: smaller patches preserve detail but increase sequence length; larger patches compress context. Naming patches separately from latents and from full representations keeps vision, diffusion, and multimodal docs aligned with how encoders are drawn.

Simple Example

A 224×224 image with 16×16 pixel patches yields a 14×14 grid—196 patch tokens before the class token. Each patch embedding is a representation of that local region, not the whole image.

Common Confusions

A patch is not a latent: patches are input-side chunks, while a latent is a compressed internal code after encoding. A patch is also not a token in the language-model sense unless the architecture explicitly equates patch embeddings with discrete token IDs. Patch size is not model capacity—it is an input tiling choice.

Tags

References