The kind of input or output data a model is built to handle, such as text, images, audio, or video.
What It Is
A modality is a data type the model consumes or produces. Unimodal models specialize in one channel; multimodal models align two or more, such as image patches with text tokens. Modality sits above representation: it names what is encoded, not the internal vector format.
Why It Matters
Modality tells you which encoders, tokenizers, and fusion modules to expect before you open a family page. It also separates "handles images" from "is a diffusion architecture" or "is a foundation model."
Simple Example
A speech-to-text system pairs an audio waveform modality with a text output modality. A text-only LLM has a single text modality end to end even when users paste images in a chat UI—the product may be multimodal while the core model is not.
Common Confusions
Modality is not architecture: a transformer can be text-only or multimodal depending on inputs. Modality is also not representation—a modality is the raw channel, while representation is the learned internal form after encoding.