Concept #89Mediumextended-ai-concepts

What is LLM Tokenization?

#gen-ai#tokens#llm#tokenization#tiktoken#transformers#sentencepiece#context-window#limitations

Answer

LLM Tokenization

Tokenization is the process of converting text into tokens β€” the discrete numerical units that LLMs actually process. It's the bridge between human text and model computation.

Why Tokenization?

LLMs are mathematical functions that only process numbers. Tokenization creates a discrete vocabulary that maps text fragments to integers.

text
"Hello world" β†’ tokenize β†’ [9906, 1917] β†’ model β†’ [logits] β†’ detokenize β†’ response text

Tokenization Algorithms

AlgorithmUsed ByApproachMarket Share
BPE (Byte Pair Encoding)GPT-4o, Llama 3/4, Mistral, DeepSeek, GrokMerges most frequent byte pairs iteratively~85–90%
WordPieceBERT, DistilBERT, ElectraMaximizes likelihood of subword splits~3–5%
SentencePieceT5, Gemini, Llama 2, MistralTreats text as byte stream, language-agnostic (supports BPE & Unigram)~40–45% (subset of BPE)
UnigramT5, mT5, ALBERT, XLNetStarts large vocab, prunes by likelihood~5–8%
tiktokenAll OpenAI models (fast BPE)Optimized BPE in Rust~25–30% (subset of BPE)
Byte-Level BPE (BBPE)GPT-2, DeepSeek-V3/R1BPE at byte level β€” no unknown tokensSubset of BPE
ByteLatent Transformer (BLT)Meta (research, 2024)Tokenizer-free β€” operates on raw bytes with dynamic patching<1% (research only)

Key insight: BPE dominates the LLM market. Among generative LLMs (not encoder models like BERT), BPE accounts for over 95% of usage. WordPiece is essentially a legacy algorithm confined to BERT-era encoder models.


BPE in Action

text
Initial: "low", "lower", "newest", "widest"

Frequency count:
  "lo" appears 2x β†’ merge to "lo"
  "low" appears 2x β†’ merge to "low"
  "er" appears 2x β†’ merge to "er"
  ...

Result vocab: ["low", "er", "est", "new", "wid"]
"lower" β†’ ["low", "er"]      (2 tokens)
"newest" β†’ ["new", "est"]    (2 tokens)

Algorithm Deep Dive

BPE (Byte Pair Encoding):

  • Starts with individual characters/bytes, merges the most frequent adjacent pairs iteratively
  • Vocabulary size is a hyperparameter (typically 32K–200K)
  • Used by the vast majority of modern LLMs

WordPiece:

  • Similar to BPE but merges are chosen to maximize the language model likelihood, not raw frequency
  • Prefixes subwords with
    text
    ##
    (e.g.,
    text
    "playing"
    β†’
    text
    ["play", "##ing"]
    )
  • Limited to BERT-family encoder models

Unigram:

  • Opposite of BPE β€” starts with a large vocabulary and prunes tokens that least affect overall likelihood
  • Can produce multiple valid tokenizations and picks the most probable one
  • Often used via SentencePiece framework

SentencePiece:

  • Not an algorithm itself, but a framework that supports both BPE and Unigram
  • Key advantage: language-agnostic β€” operates on raw byte streams with no pre-tokenization rules
  • Excellent for multilingual models

ByteLatent Transformer (BLT) β€” Meta, 2024:

  • A tokenizer-free architecture β€” eliminates fixed vocabulary entirely
  • Uses an entropy-based model to dynamically group bytes into "patches"
  • Complex/surprising text gets shorter patches (more compute), predictable text gets longer patches (less compute)
  • Advantages: robust to typos, novel words, and character-level tasks
  • Status: research only β€” no production models deployed yet

Python Tokenizer Libraries

Comparison Table

LibraryInstallModels SupportedAlgorithmSpeedWhen to Use
text
tiktoken
text
pip install tiktoken
GPT-4o, GPT-4.1, all OpenAI modelsBPEFastest (Rust core)Token counting for OpenAI API, cost estimation
text
transformers.AutoTokenizer
text
pip install transformers
All HuggingFace models (Llama, Mistral, BERT, T5, etc.)Wraps any (BPE, WordPiece, Unigram)Moderate (uses Rust backend)Working with any open-source model, chat templates
text
sentencepiece
text
pip install sentencepiece
T5, Llama (raw), ALBERT, Gemini-familyBPE or UnigramFast (C++ core)Training custom tokenizers, multilingual apps
text
tokenizers
text
pip install tokenizers
Custom / any (training from scratch)BPE, WordPiece, UnigramVery fast (Rust core)Training custom tokenizers, batch processing
text
mistral-common
text
pip install mistral-common
Mistral, MixtralSentencePiece BPEFastMistral models with tool-call formatting

Speed Comparison

LibraryRelative SpeedImplementation
text
tiktoken
1.0x (fastest)Rust core, Python bindings
text
tokenizers
(HF Rust)
~0.7–0.9xRust core, Python bindings
text
sentencepiece
~0.5–0.7xC++ core, Python bindings
text
transformers
(fast)
~0.6–0.8xUses
text
tokenizers
Rust backend
text
transformers
(slow)
~0.1–0.2xPure Python fallback

Using
text
tiktoken
(OpenAI Models)

python
import tiktoken

# For GPT-4o / GPT-4.1 family
enc = tiktoken.encoding_for_model("gpt-4o")  # Uses o200k_base

text = "Tokenization is key to understanding LLMs!"
token_ids = enc.encode(text)
print(f"Token IDs: {token_ids}")
print(f"Token count: {len(token_ids)}")

# Inspect individual tokens
for tid in token_ids:
    print(f"  {tid}: '{enc.decode([tid])}'")

# Available encodings:
# cl100k_base β†’ GPT-4, GPT-3.5-turbo
# o200k_base  β†’ GPT-4o, GPT-4.1 family

Using
text
AutoTokenizer
(HuggingFace β€” Any Open-Source Model)

python
from transformers import AutoTokenizer

# Works with ANY model on HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

text = "The quick brown fox"
encoded = tokenizer(text, return_tensors="pt")
print(f"Input IDs: {encoded['input_ids']}")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(encoded['input_ids'][0])}")
# β†’ ['▁The', '▁quick', '▁brown', '▁fox']

# Chat template support (critical for instruction-tuned models)
messages = [
    {"role": "system", "content": "You are helpful."},
    {"role": "user", "content": "What is AI?"}
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False)

Using
text
sentencepiece
(Direct)

python
import sentencepiece as spm

# Load a pre-trained model file
sp = spm.SentencePieceProcessor(model_file='tokenizer.model')

text = "Hello, world!"
token_ids = sp.encode(text, out_type=int)
token_strs = sp.encode(text, out_type=str)
print(f"IDs: {token_ids}")
print(f"Tokens: {token_strs}")

# Train a custom tokenizer from scratch
spm.SentencePieceTrainer.train(
    input='corpus.txt',
    model_prefix='my_tokenizer',
    vocab_size=32000,
    model_type='bpe'  # or 'unigram'
)

Using
text
tokenizers
(HuggingFace Rust β€” Custom Training)

python
from tokenizers import Tokenizer, models, trainers, pre_tokenizers

# Train a BPE tokenizer from scratch
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel()
trainer = trainers.BpeTrainer(
    vocab_size=32000,
    special_tokens=["<pad>", "<eos>", "<bos>"]
)
tokenizer.train(files=["corpus.txt"], trainer=trainer)

# Use it
encoded = tokenizer.encode("Hello, world!")
print(encoded.ids)
print(encoded.tokens)

Key Differences Between Libraries

Feature
text
tiktoken
text
AutoTokenizer
text
sentencepiece
text
tokenizers
Best forOpenAI modelsAny HF modelMultilingual / customCustom training
Chat templatesNoYesNoNo
Train customNoNo (use
text
tokenizers
)
YesYes
Special tokensLimitedFull supportBasicFull control
Batch encodingYesYesYesYes (parallel)
DependenciesMinimalHeavy (
text
transformers
)
MinimalMinimal

Max Token Limits for Top LLMs (2025–2026)

OpenAI

ModelContext WindowMax OutputTokenizer
GPT-4o128K16,384tiktoken (
text
o200k_base
)
GPT-4o-mini128K16,384tiktoken (
text
o200k_base
)
GPT-4.1~1M32,768tiktoken (
text
o200k_base
)
GPT-4.1-mini~1M32,768tiktoken (
text
o200k_base
)
GPT-4.1-nano~1M32,768tiktoken (
text
o200k_base
)

Anthropic

ModelContext WindowMax OutputTokenizer
Claude 3.5 Sonnet200K8,192Custom BPE
Claude 3.5 Haiku200K8,192Custom BPE
Claude Opus 4200K32,000Custom BPE
Claude Sonnet 4200K16,000Custom BPE

Google

ModelContext WindowMax OutputTokenizer
Gemini 1.5 Pro2M8,192SentencePiece BPE
Gemini 2.0 Flash1M8,192SentencePiece BPE
Gemini 2.5 Pro1M64,000SentencePiece BPE

Meta (Open Source)

ModelContext WindowMax OutputTokenizer
Llama 3.1 8B128KDeployment-dependentSentencePiece BPE (128K vocab)
Llama 3.1 70B128KDeployment-dependentSentencePiece BPE (128K vocab)
Llama 3.1 405B128KDeployment-dependentSentencePiece BPE (128K vocab)
Llama 4 Scout10MDeployment-dependentBPE (200K vocab)
Llama 4 Maverick1MDeployment-dependentBPE (200K vocab)

Others

ModelContext WindowMax OutputTokenizer
Mistral Large128K~4,096SentencePiece BPE
Mixtral 8x22B65K~4,096SentencePiece BPE
DeepSeek-V3128K8,192Custom BBPE (128K vocab)
DeepSeek-R1128K8,192Custom BBPE (128K vocab)
Grok-2128K~8,192Custom BPE
Grok-3128K~16,384Custom BPE
Command R+128K4,096Custom BPE

Vocabulary Size Trend

GenerationTypical Vocab SizeExamples
Early (2018–2020)30K–50KBERT (30K), GPT-2 (50K)
Mid (2021–2023)32K–100KLlama 2 (32K), GPT-4 (100K)
Current (2024–2026)128K–200KGPT-4o (200K), Llama 3 (128K), Llama 4 (200K)

Larger vocabularies improve multilingual coverage, code tokenization, and compression efficiency (fewer tokens per text).


Tokenization Edge Cases

SituationEffect
Non-English textMore tokens per word (Chinese, Arabic ~3–4x)
CodeIdentifiers and syntax split in unexpected ways
NumbersEach digit may be its own token
WhitespaceLeading spaces often merged with following word
Case
text
Hello
and
text
hello
often different token IDs
Special chars
text
{
,
text
}
,
text
[
,
text
]
each often a single token

Special Tokens

Models use reserved tokens for structural purposes:

TokenPurpose
text
<bos>
/
text
<s>
Beginning of sequence
text
<eos>
/
text
</s>
End of sequence
text
<pad>
Padding for batched inference
text
<unk>
Unknown token (rare with BPE)
text
[INST]
Llama instruction delimiter
`<im_start

Practical Token Counting & Cost Estimation

python
import tiktoken

def count_tokens_and_cost(
    text: str,
    model: str = "gpt-4o",
    price_per_million: float = 2.5
) -> dict:
    enc = tiktoken.encoding_for_model(model)
    n = len(enc.encode(text))
    cost = n / 1_000_000 * price_per_million
    return {"tokens": n, "cost": f"${cost:.5f}"}

result = count_tokens_and_cost("Your document text here...")
print(result)  # {'tokens': 5, 'cost': '$0.00001'}

Understanding tokenization is fundamental to: cost estimation, context window management, debugging prompt behavior, choosing the right tokenizer library, and fine-tuning models.


Tokenization Limitations

Yes β€” tokenization has several inherent limitations that directly impact how LLMs work in practice.

Core Limitations

LimitationDescriptionImpact
Fixed vocabularyTokenizer has a pre-defined vocab (32K–200K tokens). Words not in vocab are split into subwords/bytesRare words, names, and new terms require more tokens
Non-English inefficiencyTokenizers trained mostly on English are 2–8x less efficient for other languagesChinese text may use 4x more tokens than English for the same meaning
Lossy representationDifferent texts can produce identical token sequences; whitespace and formatting may be lostSubtle formatting differences are invisible to the model
Context window ceilingTotal tokens (input + output) cannot exceed the model's context windowLong documents must be chunked, summarized, or retrieved via RAG
No character awarenessModels see tokens, not characters β€” spelling, counting characters, and anagram tasks are inherently difficult
text
"strawberry"
β†’ model can't easily count the
text
r
s
Tokenization is irreversible at boundariesHow text is split affects model behavior β€” different prompts with identical meaning may produce different resultsPrompt sensitivity and inconsistency
Cost scales with tokensMore tokens = higher API cost, regardless of semantic contentVerbose or non-English text costs more

Non-English Token Inefficiency

LanguageTokens for ~100 English Words EquivalentMultiplier
English~133 tokens1.0x
Spanish / French~160 tokens1.2x
German~180 tokens1.4x
Hindi / Arabic~300 tokens2.3x
Chinese / Japanese~400 tokens3.0x
Thai / Myanmar~500+ tokens3.8x+

Is the Token Limit Per-Prompt or Per-Session?

This is one of the most misunderstood aspects of LLM usage.

API Usage: Per-Request (Stateless)

Every API call is completely independent. The model has zero memory between requests. You must include the entire conversation history in each call:

python
from openai import OpenAI
client = OpenAI()

# EVERY call must include ALL prior messages
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is RAG?"},
    {"role": "assistant", "content": "RAG stands for..."},
    {"role": "user", "content": "How is it different from fine-tuning?"},
]
# ALL of the above counts against the context window for THIS request
response = client.chat.completions.create(model="gpt-4o", messages=messages)

Chat Interfaces: Per-Conversation (Managed by Provider)

InterfaceHow It Handles Limits
ChatGPTSilently truncates or summarizes older messages
Claude.aiWarns user, suggests starting a new conversation
GeminiMay summarize older context or return error
Local (Ollama)Truncates from the beginning

The Key Distinction

AspectAPI (Stateless)Chat Interface (Managed)
ScopePer individual requestPer conversation session
MemoryNone β€” you manage historyProvider manages history
When exceededReturns error immediatelyTruncates/summarizes silently
Developer controlFull control over what's includedNo control

Critical point: Via API, the context window applies to each individual request β€” not the entire session. Each request must fit all messages (system + history + new prompt + response) within the limit.


Context Window Limits β€” All Major LLMs (2025–2026)

OpenAI

ModelContext WindowMax OutputTokenizer
GPT-4o128K16,384tiktoken (
text
o200k_base
)
GPT-4o-mini128K16,384tiktoken (
text
o200k_base
)
GPT-4.1~1M32,768tiktoken (
text
o200k_base
)
GPT-4.1-mini~1M32,768tiktoken (
text
o200k_base
)
GPT-4.1-nano~1M32,768tiktoken (
text
o200k_base
)
o1200K100,000tiktoken (
text
o200k_base
)
o3200K100,000tiktoken (
text
o200k_base
)
o3-mini200K100,000tiktoken (
text
o200k_base
)
o4-mini200K100,000tiktoken (
text
o200k_base
)

Anthropic

ModelContext WindowMax OutputTokenizer
Claude 3.5 Sonnet200K8,192Claude BPE
Claude 3.5 Haiku200K8,192Claude BPE
Claude Opus 4200K32,000Claude BPE
Claude Sonnet 4200K64,000Claude BPE

Google

ModelContext WindowMax OutputTokenizer
Gemini 1.5 Pro2M8,192SentencePiece
Gemini 1.5 Flash1M8,192SentencePiece
Gemini 2.0 Flash1M8,192SentencePiece
Gemini 2.5 Pro1M65,536SentencePiece
Gemini 2.5 Flash1M65,536SentencePiece

Meta (Open Source)

ModelContext WindowMax OutputTokenizer
Llama 3.1 (8B/70B/405B)128KConfigurableBPE (128K vocab)
Llama 3.3 70B128K32,768BPE (128K vocab)
Llama 4 Scout10M~32,768BPE (200K vocab)
Llama 4 Maverick1M~32,768BPE (200K vocab)

Others

ModelContext WindowMax OutputTokenizer
Mistral Large 2128K~8,192SentencePiece BPE
Mixtral 8x22B64K~4,096SentencePiece BPE
DeepSeek-V3128K8,000Custom BBPE
DeepSeek-R1128K32,768Custom BBPE
Grok-2131K131KCustom BPE
Grok-3131K~32,768Custom BPE
Command R+128K4,000Custom BPE
Command A256K8,000Custom BPE
Qwen 2.5 Turbo1M~8,192Custom BPE (152K vocab)

Effective vs. Advertised Context Length

A critical caveat β€” advertised context β‰  effective context:

AdvertisedEffective (Reliable)Notes
128K~80–100KMost models handle well
200K~130–160KPerformance degrades toward limit
1M~600–800KSignificant quality drop at extremes
10M~2–5M (estimated)Very new, limited benchmarks

"Lost in the Middle" Problem: Research shows LLMs recall information best from the beginning and end of the context, with 30%+ degradation for content in the middle. Always place critical information at the start or end of your prompt.


Managing Tokenization Limits in Practice

python
import tiktoken

def validate_and_manage_context(
    messages: list[dict],
    model: str = "gpt-4o",
    max_context: int = 128_000,
    reserve_output: int = 4_096
) -> list[dict]:
    """Ensure messages fit within context window."""
    enc = tiktoken.encoding_for_model(model)
    system = [m for m in messages if m["role"] == "system"]
    others = [m for m in messages if m["role"] != "system"]

    budget = max_context - reserve_output
    budget -= sum(len(enc.encode(m["content"])) for m in system)

    # Keep most recent messages that fit
    kept = []
    for msg in reversed(others):
        tokens = len(enc.encode(msg["content"]))
        if budget - tokens < 0:
            break
        kept.insert(0, msg)
        budget -= tokens

    print(f"Kept {len(kept)}/{len(others)} messages, dropped {len(others) - len(kept)}")
    return system + kept