Token
Plain Explanation
A token is a chunk of text that an LLM uses as its processing unit. One English word can be one token, but long words may split into multiple tokens, and spaces or punctuation can become tokens too. That is why word count and token count are not the same.
A tokenizer converts text into an array of token IDs. The model does not read raw letters directly; it reads those IDs and predicts the next likely token. The output is then decoded back from token IDs into text.
Examples & Analogies
Think of tokens as Lego pieces for text. A word such as "unbelievable" may be split into pieces like "un", "believ", and "able". Common fragments may become larger reusable pieces.
The same sentence can have different token counts depending on whether it contains English, Korean, code, emojis, or symbols. Pricing pages use units such as "1M input tokens" because tokens map more directly to model work than characters or words do.
At a Glance
| Unit | Meaning | Relationship to tokens |
|---|---|---|
| Character | A visible writing unit | Often too small for efficient model input |
| Word | A human reading unit | Boundaries vary across languages |
| Token | A model processing unit | Basis for cost, context limits, and latency |
| Token ID | Numeric vocabulary index | The actual input used for embedding lookup |
Where and Why It Matters
Token count directly affects LLM cost. System prompts, conversation history, RAG documents, tool outputs, and user messages all add input tokens. Longer answers add output tokens.
Context windows are also measured in tokens. A model with a 128k context window gives you a token budget across input and output, not a character budget. Long documents should therefore be split by measured token count, not by rough word count.
Common Misconceptions
A token is not always a word. The same text may tokenize differently depending on language, tokenizer, and model family.
Reducing tokens does not automatically improve quality. Removing repetition is useful, but removing necessary context can make the model less grounded.
Tokens are not only the visible user input. System instructions, developer messages, tool schemas, retrieved context, and model output can all count against the token budget.
How It Sounds in Conversation
"This prompt burns too many tokens, so we should trim the RAG chunks."
"Lowering the output-token cap cuts cost, but the answer may get truncated."
"For Korean documents, we should measure token count with the actual tokenizer instead of guessing from character count."
Related Reading
References
- Neural Machine Translation of Rare Words with Subword UnitsACL
Foundational paper that popularized BPE-style subword tokenization.
- SentencePiece: A simple and language independent subword tokenizer and detokenizerEMNLP
Explains subword tokenizer design for languages with and without explicit spaces.
- TokenizersHugging Face Docs
Official docs for tokenizer pipelines, vocabularies, and encode/decode behavior.
- tiktokenGitHub
OpenAI tokenizer implementation and examples for BPE-based model tokenization.