A token is the smallest unit with which a Large Language Model (LLM) processes text. These are not whole words but mostly word parts, character sequences, or individual punctuation marks. Tokens are the fundamental building blocks with which models read, generate, and charge for language.
How it works
Before an LLM can process text, the input text is broken down into tokens by a tokenizer. Each token is assigned a numerical ID from the model's vocabulary. Therefore, the model does not work with letters or words, but with these numbers.
Common tokenizer methods include Byte-Pair Encoding (BPE), WordPiece, and SentencePiece. They strike a balance between two extremes: individual characters (too granular) and whole words (too large a vocabulary). Common words like „and“ usually form a single token. Rare or compound words are broken down into multiple parts.
You can count the number of tokens in a specific text directly—for example, using the OpenAI Tokenizer or the library tiktoken for GPT models.
Examples
- The English word „tokenization“ is used in
token+izationdismantled — two tokens. - The German word „Donaudampfschifffahrtskapitän“ breaks down into about a dozen tokens. However, the cuts do not follow the word components (Donau, Dampf, Schiff...), but rather statistical fragments:
Don't understand what I'm asking for. - Punctuation, spaces, and emojis also count as their own tokens.
As a rough estimate: 1,000 tokens correspond to about 750 words in English. In German, due to longer word compounds, it's often fewer, usually in the range of 500 to 600 words. The exact values depend on the text and the tokenizer.
Practical implications
Tokens are important for three reasons:
- Context window Each model has a maximum number of tokens it can process at one time (e.g., 8,000, 128,000, or 1 million). This limit includes both input and output.
- Costs API providers like OpenAI, Anthropic, or Google charge per token, separately for input and output tokens.
- Performance: More tokens mean higher computational costs and longer response times.
Demarcation
A token is not the same as a word or a syllable. The splitting follows statistical patterns from the training data, not linguistic rules.
Different models use different tokenizers. Therefore, the same text results in a different token count depending on the model. For example, the sentence „Artificial Intelligence is changing the world of work“ results in:
- GPT-4 / GPT-3.5 (Tokenizer
cl100k_base): 15 Tokens - GPT-4o (Tokenizer
o200k_base): 11 Tokens
The same sentence, the same provider ecosystem. Yet a difference of one-third because the newer tokenizer uses a larger vocabulary and groups German word parts more coarsely.
For cost and context window calculations, the tokenizer of the specific model is therefore always decisive, not a blanket rule of thumb.