What is LLM Tokenization?
Answer
LLM Tokenization
Tokenization is the process of converting text into tokens β the discrete numerical units that LLMs actually process. It's the bridge between human text and model computation.
Why Tokenization?
LLMs are mathematical functions that only process numbers. Tokenization creates a discrete vocabulary that maps text fragments to integers.
text"Hello world" β tokenize β [9906, 1917] β model β [logits] β detokenize β response text
Tokenization Algorithms
| Algorithm | Used By | Approach | Market Share |
|---|---|---|---|
| BPE (Byte Pair Encoding) | GPT-4o, Llama 3/4, Mistral, DeepSeek, Grok | Merges most frequent byte pairs iteratively | ~85β90% |
| WordPiece | BERT, DistilBERT, Electra | Maximizes likelihood of subword splits | ~3β5% |
| SentencePiece | T5, Gemini, Llama 2, Mistral | Treats text as byte stream, language-agnostic (supports BPE & Unigram) | ~40β45% (subset of BPE) |
| Unigram | T5, mT5, ALBERT, XLNet | Starts large vocab, prunes by likelihood | ~5β8% |
| tiktoken | All OpenAI models (fast BPE) | Optimized BPE in Rust | ~25β30% (subset of BPE) |
| Byte-Level BPE (BBPE) | GPT-2, DeepSeek-V3/R1 | BPE at byte level β no unknown tokens | Subset of BPE |
| ByteLatent Transformer (BLT) | Meta (research, 2024) | Tokenizer-free β operates on raw bytes with dynamic patching | <1% (research only) |
Key insight: BPE dominates the LLM market. Among generative LLMs (not encoder models like BERT), BPE accounts for over 95% of usage. WordPiece is essentially a legacy algorithm confined to BERT-era encoder models.
BPE in Action
textInitial: "low", "lower", "newest", "widest" Frequency count: "lo" appears 2x β merge to "lo" "low" appears 2x β merge to "low" "er" appears 2x β merge to "er" ... Result vocab: ["low", "er", "est", "new", "wid"] "lower" β ["low", "er"] (2 tokens) "newest" β ["new", "est"] (2 tokens)
Algorithm Deep Dive
BPE (Byte Pair Encoding):
- Starts with individual characters/bytes, merges the most frequent adjacent pairs iteratively
- Vocabulary size is a hyperparameter (typically 32Kβ200K)
- Used by the vast majority of modern LLMs
WordPiece:
- Similar to BPE but merges are chosen to maximize the language model likelihood, not raw frequency
- Prefixes subwords with (e.g.,text
##βtext"playing")text["play", "##ing"] - Limited to BERT-family encoder models
Unigram:
- Opposite of BPE β starts with a large vocabulary and prunes tokens that least affect overall likelihood
- Can produce multiple valid tokenizations and picks the most probable one
- Often used via SentencePiece framework
SentencePiece:
- Not an algorithm itself, but a framework that supports both BPE and Unigram
- Key advantage: language-agnostic β operates on raw byte streams with no pre-tokenization rules
- Excellent for multilingual models
ByteLatent Transformer (BLT) β Meta, 2024:
- A tokenizer-free architecture β eliminates fixed vocabulary entirely
- Uses an entropy-based model to dynamically group bytes into "patches"
- Complex/surprising text gets shorter patches (more compute), predictable text gets longer patches (less compute)
- Advantages: robust to typos, novel words, and character-level tasks
- Status: research only β no production models deployed yet
Python Tokenizer Libraries
Comparison Table
| Library | Install | Models Supported | Algorithm | Speed | When to Use |
|---|---|---|---|---|---|
text | text | GPT-4o, GPT-4.1, all OpenAI models | BPE | Fastest (Rust core) | Token counting for OpenAI API, cost estimation |
text | text | All HuggingFace models (Llama, Mistral, BERT, T5, etc.) | Wraps any (BPE, WordPiece, Unigram) | Moderate (uses Rust backend) | Working with any open-source model, chat templates |
text | text | T5, Llama (raw), ALBERT, Gemini-family | BPE or Unigram | Fast (C++ core) | Training custom tokenizers, multilingual apps |
text | text | Custom / any (training from scratch) | BPE, WordPiece, Unigram | Very fast (Rust core) | Training custom tokenizers, batch processing |
text | text | Mistral, Mixtral | SentencePiece BPE | Fast | Mistral models with tool-call formatting |
Speed Comparison
| Library | Relative Speed | Implementation |
|---|---|---|
text | 1.0x (fastest) | Rust core, Python bindings |
text | ~0.7β0.9x | Rust core, Python bindings |
text | ~0.5β0.7x | C++ core, Python bindings |
text | ~0.6β0.8x | Uses text |
text | ~0.1β0.2x | Pure Python fallback |
Using texttiktoken
(OpenAI Models)
tiktokenpythonimport tiktoken # For GPT-4o / GPT-4.1 family enc = tiktoken.encoding_for_model("gpt-4o") # Uses o200k_base text = "Tokenization is key to understanding LLMs!" token_ids = enc.encode(text) print(f"Token IDs: {token_ids}") print(f"Token count: {len(token_ids)}") # Inspect individual tokens for tid in token_ids: print(f" {tid}: '{enc.decode([tid])}'") # Available encodings: # cl100k_base β GPT-4, GPT-3.5-turbo # o200k_base β GPT-4o, GPT-4.1 family
Using textAutoTokenizer
(HuggingFace β Any Open-Source Model)
AutoTokenizerpythonfrom transformers import AutoTokenizer # Works with ANY model on HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B") text = "The quick brown fox" encoded = tokenizer(text, return_tensors="pt") print(f"Input IDs: {encoded['input_ids']}") print(f"Tokens: {tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])}") # β ['βThe', 'βquick', 'βbrown', 'βfox'] # Chat template support (critical for instruction-tuned models) messages = [ {"role": "system", "content": "You are helpful."}, {"role": "user", "content": "What is AI?"} ] formatted = tokenizer.apply_chat_template(messages, tokenize=False)
Using textsentencepiece
(Direct)
sentencepiecepythonimport sentencepiece as spm # Load a pre-trained model file sp = spm.SentencePieceProcessor(model_file='tokenizer.model') text = "Hello, world!" token_ids = sp.encode(text, out_type=int) token_strs = sp.encode(text, out_type=str) print(f"IDs: {token_ids}") print(f"Tokens: {token_strs}") # Train a custom tokenizer from scratch spm.SentencePieceTrainer.train( input='corpus.txt', model_prefix='my_tokenizer', vocab_size=32000, model_type='bpe' # or 'unigram' )
Using texttokenizers
(HuggingFace Rust β Custom Training)
tokenizerspythonfrom tokenizers import Tokenizer, models, trainers, pre_tokenizers # Train a BPE tokenizer from scratch tokenizer = Tokenizer(models.BPE()) tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel() trainer = trainers.BpeTrainer( vocab_size=32000, special_tokens=["<pad>", "<eos>", "<bos>"] ) tokenizer.train(files=["corpus.txt"], trainer=trainer) # Use it encoded = tokenizer.encode("Hello, world!") print(encoded.ids) print(encoded.tokens)
Key Differences Between Libraries
| Feature | text | text | text | text |
|---|---|---|---|---|
| Best for | OpenAI models | Any HF model | Multilingual / custom | Custom training |
| Chat templates | No | Yes | No | No |
| Train custom | No | No (use text | Yes | Yes |
| Special tokens | Limited | Full support | Basic | Full control |
| Batch encoding | Yes | Yes | Yes | Yes (parallel) |
| Dependencies | Minimal | Heavy ( text | Minimal | Minimal |
Max Token Limits for Top LLMs (2025β2026)
OpenAI
| Model | Context Window | Max Output | Tokenizer |
|---|---|---|---|
| GPT-4o | 128K | 16,384 | tiktoken ( text |
| GPT-4o-mini | 128K | 16,384 | tiktoken ( text |
| GPT-4.1 | ~1M | 32,768 | tiktoken ( text |
| GPT-4.1-mini | ~1M | 32,768 | tiktoken ( text |
| GPT-4.1-nano | ~1M | 32,768 | tiktoken ( text |
Anthropic
| Model | Context Window | Max Output | Tokenizer |
|---|---|---|---|
| Claude 3.5 Sonnet | 200K | 8,192 | Custom BPE |
| Claude 3.5 Haiku | 200K | 8,192 | Custom BPE |
| Claude Opus 4 | 200K | 32,000 | Custom BPE |
| Claude Sonnet 4 | 200K | 16,000 | Custom BPE |
| Model | Context Window | Max Output | Tokenizer |
|---|---|---|---|
| Gemini 1.5 Pro | 2M | 8,192 | SentencePiece BPE |
| Gemini 2.0 Flash | 1M | 8,192 | SentencePiece BPE |
| Gemini 2.5 Pro | 1M | 64,000 | SentencePiece BPE |
Meta (Open Source)
| Model | Context Window | Max Output | Tokenizer |
|---|---|---|---|
| Llama 3.1 8B | 128K | Deployment-dependent | SentencePiece BPE (128K vocab) |
| Llama 3.1 70B | 128K | Deployment-dependent | SentencePiece BPE (128K vocab) |
| Llama 3.1 405B | 128K | Deployment-dependent | SentencePiece BPE (128K vocab) |
| Llama 4 Scout | 10M | Deployment-dependent | BPE (200K vocab) |
| Llama 4 Maverick | 1M | Deployment-dependent | BPE (200K vocab) |
Others
| Model | Context Window | Max Output | Tokenizer |
|---|---|---|---|
| Mistral Large | 128K | ~4,096 | SentencePiece BPE |
| Mixtral 8x22B | 65K | ~4,096 | SentencePiece BPE |
| DeepSeek-V3 | 128K | 8,192 | Custom BBPE (128K vocab) |
| DeepSeek-R1 | 128K | 8,192 | Custom BBPE (128K vocab) |
| Grok-2 | 128K | ~8,192 | Custom BPE |
| Grok-3 | 128K | ~16,384 | Custom BPE |
| Command R+ | 128K | 4,096 | Custom BPE |
Vocabulary Size Trend
| Generation | Typical Vocab Size | Examples |
|---|---|---|
| Early (2018β2020) | 30Kβ50K | BERT (30K), GPT-2 (50K) |
| Mid (2021β2023) | 32Kβ100K | Llama 2 (32K), GPT-4 (100K) |
| Current (2024β2026) | 128Kβ200K | GPT-4o (200K), Llama 3 (128K), Llama 4 (200K) |
Larger vocabularies improve multilingual coverage, code tokenization, and compression efficiency (fewer tokens per text).
Tokenization Edge Cases
| Situation | Effect |
|---|---|
| Non-English text | More tokens per word (Chinese, Arabic ~3β4x) |
| Code | Identifiers and syntax split in unexpected ways |
| Numbers | Each digit may be its own token |
| Whitespace | Leading spaces often merged with following word |
| Case | text text |
| Special chars | text text text text |
Special Tokens
Models use reserved tokens for structural purposes:
| Token | Purpose |
|---|---|
text text | Beginning of sequence |
text text | End of sequence |
text | Padding for batched inference |
text | Unknown token (rare with BPE) |
text | Llama instruction delimiter |
| `< | im_start |
Practical Token Counting & Cost Estimation
pythonimport tiktoken def count_tokens_and_cost( text: str, model: str = "gpt-4o", price_per_million: float = 2.5 ) -> dict: enc = tiktoken.encoding_for_model(model) n = len(enc.encode(text)) cost = n / 1_000_000 * price_per_million return {"tokens": n, "cost": f"${cost:.5f}"} result = count_tokens_and_cost("Your document text here...") print(result) # {'tokens': 5, 'cost': '$0.00001'}
Understanding tokenization is fundamental to: cost estimation, context window management, debugging prompt behavior, choosing the right tokenizer library, and fine-tuning models.
Tokenization Limitations
Yes β tokenization has several inherent limitations that directly impact how LLMs work in practice.
Core Limitations
| Limitation | Description | Impact |
|---|---|---|
| Fixed vocabulary | Tokenizer has a pre-defined vocab (32Kβ200K tokens). Words not in vocab are split into subwords/bytes | Rare words, names, and new terms require more tokens |
| Non-English inefficiency | Tokenizers trained mostly on English are 2β8x less efficient for other languages | Chinese text may use 4x more tokens than English for the same meaning |
| Lossy representation | Different texts can produce identical token sequences; whitespace and formatting may be lost | Subtle formatting differences are invisible to the model |
| Context window ceiling | Total tokens (input + output) cannot exceed the model's context window | Long documents must be chunked, summarized, or retrieved via RAG |
| No character awareness | Models see tokens, not characters β spelling, counting characters, and anagram tasks are inherently difficult | text text |
| Tokenization is irreversible at boundaries | How text is split affects model behavior β different prompts with identical meaning may produce different results | Prompt sensitivity and inconsistency |
| Cost scales with tokens | More tokens = higher API cost, regardless of semantic content | Verbose or non-English text costs more |
Non-English Token Inefficiency
| Language | Tokens for ~100 English Words Equivalent | Multiplier |
|---|---|---|
| English | ~133 tokens | 1.0x |
| Spanish / French | ~160 tokens | 1.2x |
| German | ~180 tokens | 1.4x |
| Hindi / Arabic | ~300 tokens | 2.3x |
| Chinese / Japanese | ~400 tokens | 3.0x |
| Thai / Myanmar | ~500+ tokens | 3.8x+ |
Is the Token Limit Per-Prompt or Per-Session?
This is one of the most misunderstood aspects of LLM usage.
API Usage: Per-Request (Stateless)
Every API call is completely independent. The model has zero memory between requests. You must include the entire conversation history in each call:
pythonfrom openai import OpenAI client = OpenAI() # EVERY call must include ALL prior messages messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is RAG?"}, {"role": "assistant", "content": "RAG stands for..."}, {"role": "user", "content": "How is it different from fine-tuning?"}, ] # ALL of the above counts against the context window for THIS request response = client.chat.completions.create(model="gpt-4o", messages=messages)
Chat Interfaces: Per-Conversation (Managed by Provider)
| Interface | How It Handles Limits |
|---|---|
| ChatGPT | Silently truncates or summarizes older messages |
| Claude.ai | Warns user, suggests starting a new conversation |
| Gemini | May summarize older context or return error |
| Local (Ollama) | Truncates from the beginning |
The Key Distinction
| Aspect | API (Stateless) | Chat Interface (Managed) |
|---|---|---|
| Scope | Per individual request | Per conversation session |
| Memory | None β you manage history | Provider manages history |
| When exceeded | Returns error immediately | Truncates/summarizes silently |
| Developer control | Full control over what's included | No control |
Critical point: Via API, the context window applies to each individual request β not the entire session. Each request must fit all messages (system + history + new prompt + response) within the limit.
Context Window Limits β All Major LLMs (2025β2026)
OpenAI
| Model | Context Window | Max Output | Tokenizer |
|---|---|---|---|
| GPT-4o | 128K | 16,384 | tiktoken ( text |
| GPT-4o-mini | 128K | 16,384 | tiktoken ( text |
| GPT-4.1 | ~1M | 32,768 | tiktoken ( text |
| GPT-4.1-mini | ~1M | 32,768 | tiktoken ( text |
| GPT-4.1-nano | ~1M | 32,768 | tiktoken ( text |
| o1 | 200K | 100,000 | tiktoken ( text |
| o3 | 200K | 100,000 | tiktoken ( text |
| o3-mini | 200K | 100,000 | tiktoken ( text |
| o4-mini | 200K | 100,000 | tiktoken ( text |
Anthropic
| Model | Context Window | Max Output | Tokenizer |
|---|---|---|---|
| Claude 3.5 Sonnet | 200K | 8,192 | Claude BPE |
| Claude 3.5 Haiku | 200K | 8,192 | Claude BPE |
| Claude Opus 4 | 200K | 32,000 | Claude BPE |
| Claude Sonnet 4 | 200K | 64,000 | Claude BPE |
| Model | Context Window | Max Output | Tokenizer |
|---|---|---|---|
| Gemini 1.5 Pro | 2M | 8,192 | SentencePiece |
| Gemini 1.5 Flash | 1M | 8,192 | SentencePiece |
| Gemini 2.0 Flash | 1M | 8,192 | SentencePiece |
| Gemini 2.5 Pro | 1M | 65,536 | SentencePiece |
| Gemini 2.5 Flash | 1M | 65,536 | SentencePiece |
Meta (Open Source)
| Model | Context Window | Max Output | Tokenizer |
|---|---|---|---|
| Llama 3.1 (8B/70B/405B) | 128K | Configurable | BPE (128K vocab) |
| Llama 3.3 70B | 128K | 32,768 | BPE (128K vocab) |
| Llama 4 Scout | 10M | ~32,768 | BPE (200K vocab) |
| Llama 4 Maverick | 1M | ~32,768 | BPE (200K vocab) |
Others
| Model | Context Window | Max Output | Tokenizer |
|---|---|---|---|
| Mistral Large 2 | 128K | ~8,192 | SentencePiece BPE |
| Mixtral 8x22B | 64K | ~4,096 | SentencePiece BPE |
| DeepSeek-V3 | 128K | 8,000 | Custom BBPE |
| DeepSeek-R1 | 128K | 32,768 | Custom BBPE |
| Grok-2 | 131K | 131K | Custom BPE |
| Grok-3 | 131K | ~32,768 | Custom BPE |
| Command R+ | 128K | 4,000 | Custom BPE |
| Command A | 256K | 8,000 | Custom BPE |
| Qwen 2.5 Turbo | 1M | ~8,192 | Custom BPE (152K vocab) |
Effective vs. Advertised Context Length
A critical caveat β advertised context β effective context:
| Advertised | Effective (Reliable) | Notes |
|---|---|---|
| 128K | ~80β100K | Most models handle well |
| 200K | ~130β160K | Performance degrades toward limit |
| 1M | ~600β800K | Significant quality drop at extremes |
| 10M | ~2β5M (estimated) | Very new, limited benchmarks |
"Lost in the Middle" Problem: Research shows LLMs recall information best from the beginning and end of the context, with 30%+ degradation for content in the middle. Always place critical information at the start or end of your prompt.
Managing Tokenization Limits in Practice
pythonimport tiktoken def validate_and_manage_context( messages: list[dict], model: str = "gpt-4o", max_context: int = 128_000, reserve_output: int = 4_096 ) -> list[dict]: """Ensure messages fit within context window.""" enc = tiktoken.encoding_for_model(model) system = [m for m in messages if m["role"] == "system"] others = [m for m in messages if m["role"] != "system"] budget = max_context - reserve_output budget -= sum(len(enc.encode(m["content"])) for m in system) # Keep most recent messages that fit kept = [] for msg in reversed(others): tokens = len(enc.encode(msg["content"])) if budget - tokens < 0: break kept.insert(0, msg) budget -= tokens print(f"Kept {len(kept)}/{len(others)} messages, dropped {len(others) - len(kept)}") return system + kept