Token

What It Is

A token is one step in the model's discrete text vocabulary. Subword tokenizers often split rare words into pieces so the vocabulary stays finite while still covering open vocabulary text.

Why It Matters

Tokenization sets context length in tokens, shapes cost per request, and determines whether two strings look identical to the model. Mismatched tokenizers between training and serving are a common integration bug.

Simple Example

The phrase "language models" might become two or three tokens depending on the tokenizer. Each ID is looked up in an embedding table before attention layers run.

How raw text becomes token IDs before the transformer stack

Common Confusions

Tokens are not always words; byte-level BPE can emit single-character tokens. Token count also differs from word count and from Unicode grapheme count.

Token

What It Is

Why It Matters

Simple Example

Common Confusions

Tags

References

On this page

Token

What It Is

Why It Matters

Simple Example

Common Confusions

Related Concepts And Modules

Tags

References

On this page