Glossary

Modality

The kind of input or output data a model is built to handle, such as text, images, audio, or video.

What It Is

A modality is a data type the model consumes or produces. Unimodal models specialize in one channel; multimodal models align two or more, such as image patches with text tokens. Modality sits above representation: it names what is encoded, not the internal vector format.

Why It Matters

Modality tells you which encoders, tokenizers, and fusion modules to expect before you open a family page. It also separates "handles images" from "is a diffusion architecture" or "is a foundation model."

Simple Example

A speech-to-text system pairs an audio waveform modality with a text output modality. A text-only LLM has a single text modality end to end even when users paste images in a chat UI—the product may be multimodal while the core model is not.

Common Confusions

Modality is not architecture: a transformer can be text-only or multimodal depending on inputs. Modality is also not representation—a modality is the raw channel, while representation is the learned internal form after encoding.

Tags

References