What is LLM Tokenization?

Question

Accepted Answer

## LLM Tokenization Tokenization is the process of converting text into tokens — the discrete numerical units that LLMs actually process. It's the bridge between human text and model computation. ### Why Tokenization? LLMs are mathematical functions that only process numbers. Tokenization creates a discrete vocabulary that maps text fragments to integers. ``` "Hello world" → tokenize → [9906, 1917] → model → [logits] → detokenize → response text ``` --- ## Tokenization Algorithms | Algorithm | Used By | Approach | Market Share | |-----------|---------|----------|--------------| | **BPE (Byte Pair Encoding)** | GPT-4o, Llama 3/4, Mistral, DeepSeek, Grok | Merges most frequent byte pairs iteratively | ~85–90% | | **WordPiece** | BERT, DistilBERT, Electra | Maximizes likelihood of subword splits | ~3–5% | | **SentencePiece** | T5, Gemini, Llama 2, Mistral | Treats text as byte stream, language-agnostic (supports BPE & Unigram) | ~40–45% (subset of BPE) | | **Unigram** | T5, mT5, ALBERT, XLNet | Starts large vocab, prunes by likelihood | ~5–8% | | **tiktoken** | All OpenAI models (fast BPE) | Optimized BPE in Rust | ~25–30% (subset of BPE) | | **Byte-Level BPE (BBPE)** | GPT-2, DeepSeek-V3/R1 | BPE at byte level — no unknown tokens | Subset of BPE | | **ByteLatent Transformer (BLT)** | Meta (research, 2024) | Tokenizer-free — operates on raw bytes with dynamic patching | <1% (research only) | > **Key insight:** BPE dominates the LLM market. Among generative LLMs (not encoder models like BERT), BPE accounts for **over 95%** of usage. WordPiece is essentially a legacy algorithm confined to BERT-era encoder models. --- ### BPE in Action ``` Initial: "low", "lower", "newest", "widest" Frequency count: "lo" appears 2x → merge to "lo" "low" appears 2x → merge to "low" "er" appears 2x → merge to "er" ... Result vocab: ["low", "er", "est", "new", "wid"] "lower" → ["low", "er"] (2 tokens) "newest" → ["new", "est"] (2 tokens) ``` ### Algorithm Deep Dive **BPE (Byte Pair Encoding):** * Starts with individual characters/bytes, merges the most frequent adjacent pairs iteratively * Vocabulary size is a hyperparameter (typically 32K–200K) * Used by the vast majority of modern LLMs **WordPiece:** * Similar to BPE but merges are chosen to maximize the language model likelihood, not raw frequency * Prefixes subwords with `##` (e.g., `"playing"` → `["play", "##ing"]`) * Limited to BERT-family encoder models **Unigram:** * Opposite of BPE — starts with a large vocabulary and prunes tokens that least affect overall likelihood * Can produce multiple valid tokenizations and picks the most probable one * Often used via SentencePiece framework **SentencePiece:** * Not an algorithm itself, but a framework that supports both BPE and Unigram * Key advantage: language-agnostic — operates on raw byte streams with no pre-tokenization rules * Excellent for multilingual models **ByteLatent Transformer (BLT) — Meta, 2024:** * A tokenizer-free architecture — eliminates fixed vocabulary entirely * Uses an entropy-based model to dynamically group bytes into "patches" * Complex/surprising text gets shorter patches (more compute), predictable text gets longer patches (less compute) * Advantages: robust to typos, novel words, and character-level tasks * Status: research only — no production models deployed yet --- ## Python Tokenizer Libraries ### Comparison Table | Library | Install | Models Supported | Algorithm | Speed | When to Use | |---------|---------|-----------------|-----------|-------|-------------| | **`tiktoken`** | `pip install tiktoken` | GPT-4o, GPT-4.1, all OpenAI models | BPE | Fastest (Rust core) | Token counting for OpenAI API, cost estimation | | **`transformers.AutoTokenizer`** | `pip install transformers` | All HuggingFace models (Llama, Mistral, BERT, T5, etc.) | Wraps any (BPE, WordPiece, Unigram) | Moderate (uses Rust backend) | Working with any open-source model, chat templates | | **`sentencepiece`** | `pip install sentencepiece` | T5, Llama (raw), ALBERT, Gemini-family | BPE or Unigram | Fast (C++ core) | Training custom tokenizers, multilingual apps | | **`tokenizers`** | `pip install tokenizers` | Custom / any (training from scratch) | BPE, WordPiece, Unigram | Very fast (Rust core) | Training custom tokenizers, batch processing | | **`mistral-common`** | `pip install mistral-common` | Mistral, Mixtral | SentencePiece BPE | Fast | Mistral models with tool-call formatting | ### Speed Comparison | Library | Relative Speed | Implementation | |---------|---------------|----------------| | `tiktoken` | 1.0x (fastest) | Rust core, Python bindings | | `tokenizers` (HF Rust) | ~0.7–0.9x | Rust core, Python bindings | | `sentencepiece` | ~0.5–0.7x | C++ core, Python bindings | | `transformers` (fast) | ~0.6–0.8x | Uses `tokenizers` Rust backend | | `transformers` (slow) | ~0.1–0.2x | Pure Python fallback | --- ### Using `tiktoken` (OpenAI Models) ```python import tiktoken # For GPT-4o / GPT-4.1 family enc = tiktoken.encoding_for_model("gpt-4o") # Uses o200k_base text = "Tokenization is key to understanding LLMs!" token_ids = enc.encode(text) print(f"Token IDs: {token_ids}") print(f"Token count: {len(token_ids)}") # Inspect individual tokens for tid in token_ids: print(f" {tid}: '{enc.decode([tid])}'") # Available encodings: # cl100k_base → GPT-4, GPT-3.5-turbo # o200k_base → GPT-4o, GPT-4.1 family ``` ### Using `AutoTokenizer` (HuggingFace — Any Open-Source Model) ```python from transformers import AutoTokenizer # Works with ANY model on HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B") text = "The quick brown fox" encoded = tokenizer(text, return_tensors="pt") print(f"Input IDs: {encoded['input_ids']}") print(f"Tokens: {tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])}") # → ['▁The', '▁quick', '▁brown', '▁fox'] # Chat template support (critical for instruction-tuned models) messages = [ {"role": "system", "content": "You are helpful."}, {"role": "user", "content": "What is AI?"} ] formatted = tokenizer.apply_chat_template(messages, tokenize=False) ``` ### Using `sentencepiece` (Direct) ```python import sentencepiece as spm # Load a pre-trained model file sp = spm.SentencePieceProcessor(model_file='tokenizer.model') text = "Hello, world!" token_ids = sp.encode(text, out_type=int) token_strs = sp.encode(text, out_type=str) print(f"IDs: {token_ids}") print(f"Tokens: {token_strs}") # Train a custom tokenizer from scratch spm.SentencePieceTrainer.train( input='corpus.txt', model_prefix='my_tokenizer', vocab_size=32000, model_type='bpe' # or 'unigram' ) ``` ### Using `tokenizers` (HuggingFace Rust — Custom Training) ```python from tokenizers import Tokenizer, models, trainers, pre_tokenizers # Train a BPE tokenizer from scratch tokenizer = Tokenizer(models.BPE()) tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel() trainer = trainers.BpeTrainer( vocab_size=32000, special_tokens=["", "", ""] ) tokenizer.train(files=["corpus.txt"], trainer=trainer) # Use it encoded = tokenizer.encode("Hello, world!") print(encoded.ids) print(encoded.tokens) ``` ### Key Differences Between Libraries | Feature | `tiktoken` | `AutoTokenizer` | `sentencepiece` | `tokenizers` | |---------|-----------|-----------------|-----------------|-------------| | **Best for** | OpenAI models | Any HF model | Multilingual / custom | Custom training | | **Chat templates** | No | Yes | No | No | | **Train custom** | No | No (use `tokenizers`) | Yes | Yes | | **Special tokens** | Limited | Full support | Basic | Full control | | **Batch encoding** | Yes | Yes | Yes | Yes (parallel) | | **Dependencies** | Minimal | Heavy (`transformers`) | Minimal | Minimal | --- ## Max Token Limits for Top LLMs (2025–2026) ### OpenAI | Model | Context Window | Max Output | Tokenizer | |-------|---------------|------------|----------| | GPT-4o | 128K | 16,384 | tiktoken (`o200k_base`) | | GPT-4o-mini | 128K | 16,384 | tiktoken (`o200k_base`) | | GPT-4.1 | ~1M | 32,768 | tiktoken (`o200k_base`) | | GPT-4.1-mini | ~1M | 32,768 | tiktoken (`o200k_base`) | | GPT-4.1-nano | ~1M | 32,768 | tiktoken (`o200k_base`) | ### Anthropic | Model | Context Window | Max Output | Tokenizer | |-------|---------------|------------|----------| | Claude 3.5 Sonnet | 200K | 8,192 | Custom BPE | | Claude 3.5 Haiku | 200K | 8,192 | Custom BPE | | Claude Opus 4 | 200K | 32,000 | Custom BPE | | Claude Sonnet 4 | 200K | 16,000 | Custom BPE | ### Google | Model | Context Window | Max Output | Tokenizer | |-------|---------------|------------|----------| | Gemini 1.5 Pro | 2M | 8,192 | SentencePiece BPE | | Gemini 2.0 Flash | 1M | 8,192 | SentencePiece BPE | | Gemini 2.5 Pro | 1M | 64,000 | SentencePiece BPE | ### Meta (Open Source) | Model | Context Window | Max Output | Tokenizer | |-------|---------------|------------|----------| | Llama 3.1 8B | 128K | Deployment-dependent | SentencePiece BPE (128K vocab) | | Llama 3.1 70B | 128K | Deployment-dependent | SentencePiece BPE (128K vocab) | | Llama 3.1 405B | 128K | Deployment-dependent | SentencePiece BPE (128K vocab) | | Llama 4 Scout | 10M | Deployment-dependent | BPE (200K vocab) | | Llama 4 Maverick | 1M | Deployment-dependent | BPE (200K vocab) | ### Others | Model | Context Window | Max Output | Tokenizer | |-------|---------------|------------|----------| | Mistral Large | 128K | ~4,096 | SentencePiece BPE | | Mixtral 8x22B | 65K | ~4,096 | SentencePiece BPE | | DeepSeek-V3 | 128K | 8,192 | Custom BBPE (128K vocab) | | DeepSeek-R1 | 128K | 8,192 | Custom BBPE (128K vocab) | | Grok-2 | 128K | ~8,192 | Custom BPE | | Grok-3 | 128K | ~16,384 | Custom BPE | | Command R+ | 128K | 4,096 | Custom BPE | --- ### Vocabulary Size Trend | Generation | Typical Vocab Size | Examples | |-----------|-------------------|----------| | Early (2018–2020) | 30K–50K | BERT (30K), GPT-2 (50K) | | Mid (2021–2023) | 32K–100K | Llama 2 (32K), GPT-4 (100K) | | Current (2024–2026) | 128K–200K | GPT-4o (200K), Llama 3 (128K), Llama 4 (200K) | > Larger vocabularies improve multilingual coverage, code tokenization, and compression efficiency (fewer tokens per text). --- ## Tokenization Edge Cases | Situation | Effect | |---------|--------| | Non-English text | More tokens per word (Chinese, Arabic ~3–4x) | | Code | Identifiers and syntax split in unexpected ways | | Numbers | Each digit may be its own token | | Whitespace | Leading spaces often merged with following word | | Case | `Hello` and `hello` often different token IDs | | Special chars | `{`, `}`, `[`, `]` each often a single token | --- ## Special Tokens Models use reserved tokens for structural purposes: | Token | Purpose | |-------|---------| | `` / `~~` | Beginning of sequence | | `` / `~~` | End of sequence | | `` | Padding for batched inference | | `` | Unknown token (rare with BPE) | | `[INST]` | Llama instruction delimiter | | `<|im_start|>` | ChatML format start | --- ## Practical Token Counting & Cost Estimation ```python import tiktoken def count_tokens_and_cost( text: str, model: str = "gpt-4o", price_per_million: float = 2.5 ) -> dict: enc = tiktoken.encoding_for_model(model) n = len(enc.encode(text)) cost = n / 1_000_000 * price_per_million return {"tokens": n, "cost": f"${cost:.5f}"} result = count_tokens_and_cost("Your document text here...") print(result) # {'tokens': 5, 'cost': '$0.00001'} ``` --- Understanding tokenization is fundamental to: **cost estimation**, **context window management**, **debugging prompt behavior**, **choosing the right tokenizer library**, and **fine-tuning models**. --- ## Tokenization Limitations Yes — tokenization has several inherent limitations that directly impact how LLMs work in practice. ### Core Limitations | Limitation | Description | Impact | |-----------|-------------|--------| | **Fixed vocabulary** | Tokenizer has a pre-defined vocab (32K–200K tokens). Words not in vocab are split into subwords/bytes | Rare words, names, and new terms require more tokens | | **Non-English inefficiency** | Tokenizers trained mostly on English are 2–8x less efficient for other languages | Chinese text may use 4x more tokens than English for the same meaning | | **Lossy representation** | Different texts can produce identical token sequences; whitespace and formatting may be lost | Subtle formatting differences are invisible to the model | | **Context window ceiling** | Total tokens (input + output) cannot exceed the model's context window | Long documents must be chunked, summarized, or retrieved via RAG | | **No character awareness** | Models see tokens, not characters — spelling, counting characters, and anagram tasks are inherently difficult | `"strawberry"` → model can't easily count the `r`s | | **Tokenization is irreversible at boundaries** | How text is split affects model behavior — different prompts with identical meaning may produce different results | Prompt sensitivity and inconsistency | | **Cost scales with tokens** | More tokens = higher API cost, regardless of semantic content | Verbose or non-English text costs more | ### Non-English Token Inefficiency | Language | Tokens for ~100 English Words Equivalent | Multiplier | |----------|----------------------------------------|------------| | English | ~133 tokens | 1.0x | | Spanish / French | ~160 tokens | 1.2x | | German | ~180 tokens | 1.4x | | Hindi / Arabic | ~300 tokens | 2.3x | | Chinese / Japanese | ~400 tokens | 3.0x | | Thai / Myanmar | ~500+ tokens | 3.8x+ | --- ## Is the Token Limit Per-Prompt or Per-Session? This is one of the most misunderstood aspects of LLM usage. ### API Usage: Per-Request (Stateless) Every API call is **completely independent**. The model has **zero memory** between requests. You must include the entire conversation history in each call: ```python from openai import OpenAI client = OpenAI() # EVERY call must include ALL prior messages messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is RAG?"}, {"role": "assistant", "content": "RAG stands for..."}, {"role": "user", "content": "How is it different from fine-tuning?"}, ] # ALL of the above counts against the context window for THIS request response = client.chat.completions.create(model="gpt-4o", messages=messages) ``` ### Chat Interfaces: Per-Conversation (Managed by Provider) | Interface | How It Handles Limits | |-----------|----------------------| | **ChatGPT** | Silently truncates or summarizes older messages | | **Claude.ai** | Warns user, suggests starting a new conversation | | **Gemini** | May summarize older context or return error | | **Local (Ollama)** | Truncates from the beginning | ### The Key Distinction | Aspect | API (Stateless) | Chat Interface (Managed) | |--------|----------------|-------------------------| | **Scope** | Per individual request | Per conversation session | | **Memory** | None — you manage history | Provider manages history | | **When exceeded** | Returns error immediately | Truncates/summarizes silently | | **Developer control** | Full control over what's included | No control | > **Critical point:** Via API, the context window applies to **each individual request** — not the entire session. Each request must fit all messages (system + history + new prompt + response) within the limit. --- ## Context Window Limits — All Major LLMs (2025–2026) ### OpenAI | Model | Context Window | Max Output | Tokenizer | |-------|---------------|------------|----------| | GPT-4o | 128K | 16,384 | tiktoken (`o200k_base`) | | GPT-4o-mini | 128K | 16,384 | tiktoken (`o200k_base`) | | GPT-4.1 | ~1M | 32,768 | tiktoken (`o200k_base`) | | GPT-4.1-mini | ~1M | 32,768 | tiktoken (`o200k_base`) | | GPT-4.1-nano | ~1M | 32,768 | tiktoken (`o200k_base`) | | o1 | 200K | 100,000 | tiktoken (`o200k_base`) | | o3 | 200K | 100,000 | tiktoken (`o200k_base`) | | o3-mini | 200K | 100,000 | tiktoken (`o200k_base`) | | o4-mini | 200K | 100,000 | tiktoken (`o200k_base`) | ### Anthropic | Model | Context Window | Max Output | Tokenizer | |-------|---------------|------------|----------| | Claude 3.5 Sonnet | 200K | 8,192 | Claude BPE | | Claude 3.5 Haiku | 200K | 8,192 | Claude BPE | | Claude Opus 4 | 200K | 32,000 | Claude BPE | | Claude Sonnet 4 | 200K | 64,000 | Claude BPE | ### Google | Model | Context Window | Max Output | Tokenizer | |-------|---------------|------------|----------| | Gemini 1.5 Pro | 2M | 8,192 | SentencePiece | | Gemini 1.5 Flash | 1M | 8,192 | SentencePiece | | Gemini 2.0 Flash | 1M | 8,192 | SentencePiece | | Gemini 2.5 Pro | 1M | 65,536 | SentencePiece | | Gemini 2.5 Flash | 1M | 65,536 | SentencePiece | ### Meta (Open Source) | Model | Context Window | Max Output | Tokenizer | |-------|---------------|------------|----------| | Llama 3.1 (8B/70B/405B) | 128K | Configurable | BPE (128K vocab) | | Llama 3.3 70B | 128K | 32,768 | BPE (128K vocab) | | Llama 4 Scout | 10M | ~32,768 | BPE (200K vocab) | | Llama 4 Maverick | 1M | ~32,768 | BPE (200K vocab) | ### Others | Model | Context Window | Max Output | Tokenizer | |-------|---------------|------------|----------| | Mistral Large 2 | 128K | ~8,192 | SentencePiece BPE | | Mixtral 8x22B | 64K | ~4,096 | SentencePiece BPE | | DeepSeek-V3 | 128K | 8,000 | Custom BBPE | | DeepSeek-R1 | 128K | 32,768 | Custom BBPE | | Grok-2 | 131K | 131K | Custom BPE | | Grok-3 | 131K | ~32,768 | Custom BPE | | Command R+ | 128K | 4,000 | Custom BPE | | Command A | 256K | 8,000 | Custom BPE | | Qwen 2.5 Turbo | 1M | ~8,192 | Custom BPE (152K vocab) | --- ## Effective vs. Advertised Context Length A critical caveat — **advertised context ≠ effective context**: | Advertised | Effective (Reliable) | Notes | |-----------|---------------------|-------| | 128K | ~80–100K | Most models handle well | | 200K | ~130–160K | Performance degrades toward limit | | 1M | ~600–800K | Significant quality drop at extremes | | 10M | ~2–5M (estimated) | Very new, limited benchmarks | > **"Lost in the Middle" Problem:** Research shows LLMs recall information best from the **beginning** and **end** of the context, with **30%+ degradation** for content in the middle. Always place critical information at the start or end of your prompt. --- ## Managing Tokenization Limits in Practice ```python import tiktoken def validate_and_manage_context( messages: list[dict], model: str = "gpt-4o", max_context: int = 128_000, reserve_output: int = 4_096 ) -> list[dict]: """Ensure messages fit within context window.""" enc = tiktoken.encoding_for_model(model) system = [m for m in messages if m["role"] == "system"] others = [m for m in messages if m["role"] != "system"] budget = max_context - reserve_output budget -= sum(len(enc.encode(m["content"])) for m in system) # Keep most recent messages that fit kept = [] for msg in reversed(others): tokens = len(enc.encode(msg["content"])) if budget - tokens < 0: break kept.insert(0, msg) budget -= tokens print(f"Kept {len(kept)}/{len(others)} messages, dropped {len(others) - len(kept)}") return system + kept ```

Algorithm	Used By	Approach	Market Share
BPE (Byte Pair Encoding)	GPT-4o, Llama 3/4, Mistral, DeepSeek, Grok	Merges most frequent byte pairs iteratively	~85–90%
WordPiece	BERT, DistilBERT, Electra	Maximizes likelihood of subword splits	~3–5%
SentencePiece	T5, Gemini, Llama 2, Mistral	Treats text as byte stream, language-agnostic (supports BPE & Unigram)	~40–45% (subset of BPE)
Unigram	T5, mT5, ALBERT, XLNet	Starts large vocab, prunes by likelihood	~5–8%
tiktoken	All OpenAI models (fast BPE)	Optimized BPE in Rust	~25–30% (subset of BPE)
Byte-Level BPE (BBPE)	GPT-2, DeepSeek-V3/R1	BPE at byte level — no unknown tokens	Subset of BPE
ByteLatent Transformer (BLT)	Meta (research, 2024)	Tokenizer-free — operates on raw bytes with dynamic patching	<1% (research only)

Library	Install	Models Supported	Algorithm	Speed	When to Use
text `tiktoken`	text `pip install tiktoken`	GPT-4o, GPT-4.1, all OpenAI models	BPE	Fastest (Rust core)	Token counting for OpenAI API, cost estimation
text `transformers.AutoTokenizer`	text `pip install transformers`	All HuggingFace models (Llama, Mistral, BERT, T5, etc.)	Wraps any (BPE, WordPiece, Unigram)	Moderate (uses Rust backend)	Working with any open-source model, chat templates
text `sentencepiece`	text `pip install sentencepiece`	T5, Llama (raw), ALBERT, Gemini-family	BPE or Unigram	Fast (C++ core)	Training custom tokenizers, multilingual apps
text `tokenizers`	text `pip install tokenizers`	Custom / any (training from scratch)	BPE, WordPiece, Unigram	Very fast (Rust core)	Training custom tokenizers, batch processing
text `mistral-common`	text `pip install mistral-common`	Mistral, Mixtral	SentencePiece BPE	Fast	Mistral models with tool-call formatting

Library	Relative Speed	Implementation
text `tiktoken`	1.0x (fastest)	Rust core, Python bindings
text `tokenizers` (HF Rust)	~0.7–0.9x	Rust core, Python bindings
text `sentencepiece`	~0.5–0.7x	C++ core, Python bindings
text `transformers` (fast)	~0.6–0.8x	Uses text `tokenizers` Rust backend
text `transformers` (slow)	~0.1–0.2x	Pure Python fallback

Feature	text `tiktoken`	text `AutoTokenizer`	text `sentencepiece`	text `tokenizers`
Best for	OpenAI models	Any HF model	Multilingual / custom	Custom training
Chat templates	No	Yes	No	No
Train custom	No	No (use text `tokenizers` )	Yes	Yes
Special tokens	Limited	Full support	Basic	Full control
Batch encoding	Yes	Yes	Yes	Yes (parallel)
Dependencies	Minimal	Heavy ( text `transformers` )	Minimal	Minimal

Model	Context Window	Max Output	Tokenizer
GPT-4o	128K	16,384	tiktoken ( text `o200k_base` )
GPT-4o-mini	128K	16,384	tiktoken ( text `o200k_base` )
GPT-4.1	~1M	32,768	tiktoken ( text `o200k_base` )
GPT-4.1-mini	~1M	32,768	tiktoken ( text `o200k_base` )
GPT-4.1-nano	~1M	32,768	tiktoken ( text `o200k_base` )

Model	Context Window	Max Output	Tokenizer
Claude 3.5 Sonnet	200K	8,192	Custom BPE
Claude 3.5 Haiku	200K	8,192	Custom BPE
Claude Opus 4	200K	32,000	Custom BPE
Claude Sonnet 4	200K	16,000	Custom BPE

Model	Context Window	Max Output	Tokenizer
Gemini 1.5 Pro	2M	8,192	SentencePiece BPE
Gemini 2.0 Flash	1M	8,192	SentencePiece BPE
Gemini 2.5 Pro	1M	64,000	SentencePiece BPE

Model	Context Window	Max Output	Tokenizer
Llama 3.1 8B	128K	Deployment-dependent	SentencePiece BPE (128K vocab)
Llama 3.1 70B	128K	Deployment-dependent	SentencePiece BPE (128K vocab)
Llama 3.1 405B	128K	Deployment-dependent	SentencePiece BPE (128K vocab)
Llama 4 Scout	10M	Deployment-dependent	BPE (200K vocab)
Llama 4 Maverick	1M	Deployment-dependent	BPE (200K vocab)

Model	Context Window	Max Output	Tokenizer
Mistral Large	128K	~4,096	SentencePiece BPE
Mixtral 8x22B	65K	~4,096	SentencePiece BPE
DeepSeek-V3	128K	8,192	Custom BBPE (128K vocab)
DeepSeek-R1	128K	8,192	Custom BBPE (128K vocab)
Grok-2	128K	~8,192	Custom BPE
Grok-3	128K	~16,384	Custom BPE
Command R+	128K	4,096	Custom BPE

Generation	Typical Vocab Size	Examples
Early (2018–2020)	30K–50K	BERT (30K), GPT-2 (50K)
Mid (2021–2023)	32K–100K	Llama 2 (32K), GPT-4 (100K)
Current (2024–2026)	128K–200K	GPT-4o (200K), Llama 3 (128K), Llama 4 (200K)

Situation	Effect
Non-English text	More tokens per word (Chinese, Arabic ~3–4x)
Code	Identifiers and syntax split in unexpected ways
Numbers	Each digit may be its own token
Whitespace	Leading spaces often merged with following word
Case	text `Hello` and text `hello` often different token IDs
Special chars	text `{` , text `}` , text `[` , text `]` each often a single token

Token	Purpose
text `<bos>` / text `<s>`	Beginning of sequence
text `<eos>` / text `</s>`	End of sequence
text `<pad>`	Padding for batched inference
text `<unk>`	Unknown token (rare with BPE)
text `[INST]`	Llama instruction delimiter
`<	im_start

Limitation	Description	Impact
Fixed vocabulary	Tokenizer has a pre-defined vocab (32K–200K tokens). Words not in vocab are split into subwords/bytes	Rare words, names, and new terms require more tokens
Non-English inefficiency	Tokenizers trained mostly on English are 2–8x less efficient for other languages	Chinese text may use 4x more tokens than English for the same meaning
Lossy representation	Different texts can produce identical token sequences; whitespace and formatting may be lost	Subtle formatting differences are invisible to the model
Context window ceiling	Total tokens (input + output) cannot exceed the model's context window	Long documents must be chunked, summarized, or retrieved via RAG
No character awareness	Models see tokens, not characters — spelling, counting characters, and anagram tasks are inherently difficult	text `"strawberry"` → model can't easily count the text `r` s
Tokenization is irreversible at boundaries	How text is split affects model behavior — different prompts with identical meaning may produce different results	Prompt sensitivity and inconsistency
Cost scales with tokens	More tokens = higher API cost, regardless of semantic content	Verbose or non-English text costs more

Language	Tokens for ~100 English Words Equivalent	Multiplier
English	~133 tokens	1.0x
Spanish / French	~160 tokens	1.2x
German	~180 tokens	1.4x
Hindi / Arabic	~300 tokens	2.3x
Chinese / Japanese	~400 tokens	3.0x
Thai / Myanmar	~500+ tokens	3.8x+

Interface	How It Handles Limits
ChatGPT	Silently truncates or summarizes older messages
Claude.ai	Warns user, suggests starting a new conversation
Gemini	May summarize older context or return error
Local (Ollama)	Truncates from the beginning

Aspect	API (Stateless)	Chat Interface (Managed)
Scope	Per individual request	Per conversation session
Memory	None — you manage history	Provider manages history
When exceeded	Returns error immediately	Truncates/summarizes silently
Developer control	Full control over what's included	No control

Model	Context Window	Max Output	Tokenizer
Gemini 1.5 Pro	2M	8,192	SentencePiece
Gemini 1.5 Flash	1M	8,192	SentencePiece
Gemini 2.0 Flash	1M	8,192	SentencePiece
Gemini 2.5 Pro	1M	65,536	SentencePiece
Gemini 2.5 Flash	1M	65,536	SentencePiece

Model	Context Window	Max Output	Tokenizer
Llama 3.1 (8B/70B/405B)	128K	Configurable	BPE (128K vocab)
Llama 3.3 70B	128K	32,768	BPE (128K vocab)
Llama 4 Scout	10M	~32,768	BPE (200K vocab)
Llama 4 Maverick	1M	~32,768	BPE (200K vocab)

Model	Context Window	Max Output	Tokenizer
Mistral Large 2	128K	~8,192	SentencePiece BPE
Mixtral 8x22B	64K	~4,096	SentencePiece BPE
DeepSeek-V3	128K	8,000	Custom BBPE
DeepSeek-R1	128K	32,768	Custom BBPE
Grok-2	131K	131K	Custom BPE
Grok-3	131K	~32,768	Custom BPE
Command R+	128K	4,000	Custom BPE
Command A	256K	8,000	Custom BPE
Qwen 2.5 Turbo	1M	~8,192	Custom BPE (152K vocab)

Advertised	Effective (Reliable)	Notes
128K	~80–100K	Most models handle well
200K	~130–160K	Performance degrades toward limit
1M	~600–800K	Significant quality drop at extremes
10M	~2–5M (estimated)	Very new, limited benchmarks

What is LLM Tokenization?

Answer

LLM Tokenization

Why Tokenization?

Tokenization Algorithms

BPE in Action

Algorithm Deep Dive

Python Tokenizer Libraries

Comparison Table

Speed Comparison

Using textCopytiktoken (OpenAI Models)

Using textCopyAutoTokenizer (HuggingFace — Any Open-Source Model)

Using textCopysentencepiece (Direct)

Using textCopytokenizers (HuggingFace Rust — Custom Training)

Key Differences Between Libraries

Max Token Limits for Top LLMs (2025–2026)

OpenAI

Anthropic

Google

Meta (Open Source)

Others

Vocabulary Size Trend

Tokenization Edge Cases

Special Tokens

Practical Token Counting & Cost Estimation

Tokenization Limitations

Core Limitations

Non-English Token Inefficiency

Is the Token Limit Per-Prompt or Per-Session?

API Usage: Per-Request (Stateless)

Chat Interfaces: Per-Conversation (Managed by Provider)

The Key Distinction

Context Window Limits — All Major LLMs (2025–2026)

OpenAI

Anthropic

Google

Meta (Open Source)

Others

Effective vs. Advertised Context Length

Managing Tokenization Limits in Practice

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

Using
text
`tiktoken`
(OpenAI Models)

Using
text
`AutoTokenizer`
(HuggingFace — Any Open-Source Model)

Using
text
`sentencepiece`
(Direct)

Using
text
`tokenizers`
(HuggingFace Rust — Custom Training)