Modality

The kind of input or output data a model is built to handle, such as text, images, audio, or video.

What It Is

A modality is a data type the model consumes or produces. Unimodal models specialize in one channel; multimodal models align two or more, such as image patches with text tokens. Modality sits above representation: it names what is encoded, not the internal vector format.

Why It Matters

Modality tells you which encoders, tokenizers, and fusion modules to expect before you open a family page. It also separates "handles images" from "is a diffusion architecture" or "is a foundation model."

Simple Example

A speech-to-text system pairs an audio waveform modality with a text output modality. A text-only LLM has a single text modality end to end even when users paste images in a chat UI—the product may be multimodal while the core model is not.

Common Confusions

Modality is not architecture: a transformer can be text-only or multimodal depending on inputs. Modality is also not representation—a modality is the raw channel, while representation is the learned internal form after encoding.

Modality

What It Is

Why It Matters

Simple Example

Common Confusions

Tags

References

On this page

Modality

What It Is

Why It Matters

Simple Example

Common Confusions

Related Concepts And Modules

Tags

References

On this page