How to create/train our own tokenizer?
#gen-ai#tokens#llm#tokenization#tokenizers#sentencepiece#transformers#bpe#training
Answer
How to Create/Train Your Own Tokenizer
Training a custom tokenizer lets you optimize how text is split into tokens for your specific domain, language, or use case — improving efficiency, reducing costs, and maximizing context window usage.
Why Train a Custom Tokenizer?
| Use Case | Problem with General Tokenizers | Benefit of Custom |
|---|---|---|
| Medical/Biomedical | text | Learns domain terms as 1–2 tokens |
| Legal | text text | 30–50% fewer tokens for contracts |
| Code | Operators like text text text | Code-aware tokenization |
| Multilingual | Hindi, Thai, Arabic need 3–8x more tokens than English | Fair compression across languages |
| Cost optimization | More tokens = higher API costs | Fewer tokens = lower cost + faster inference |
Tokenizer Training Libraries
| Library | Install | Training? | Algorithms | Speed | Best For |
|---|---|---|---|---|---|
text | text | Yes | BPE, WordPiece, Unigram | Very fast (Rust) | Full control, production use |
text | text | Yes | BPE, Unigram | Fast (C++) | Multilingual, raw text, Llama-style |
text | text | Adapt existing | Inherits from base | Fast | Quick domain adaptation |
text | text | No (read-only) | BPE inference only | Fastest (Rust) | Token counting only |
Important:
cannot train custom tokenizers — OpenAI never released their training code. Usetexttiktokenortexttokenizersfor training.textsentencepiece
Method 1: HuggingFace texttokenizers
— BPE (Recommended)
text
tokenizersThe most flexible and widely used approach for modern LLMs.
pythonfrom tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.trainers import BpeTrainer from tokenizers.pre_tokenizers import ByteLevel from tokenizers.processors import TemplateProcessing from tokenizers.decoders import ByteLevel as ByteLevelDecoder from tokenizers.normalizers import NFC # Step 1: Initialize tokenizer tokenizer = Tokenizer(BPE(unk_token="<|unk|>")) # Step 2: Set normalizer and pre-tokenizer tokenizer.normalizer = NFC() # Unicode normalization tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False) tokenizer.decoder = ByteLevelDecoder() # Step 3: Configure trainer trainer = BpeTrainer( vocab_size=32000, min_frequency=2, special_tokens=[ "<|unk|>", "<|pad|>", "<|bos|>", "<|eos|>", "<|system|>", "<|user|>", "<|assistant|>" ], show_progress=True, initial_alphabet=ByteLevel.alphabet(), ) # Step 4: Train on your corpus tokenizer.train(["corpus_part1.txt", "corpus_part2.txt"], trainer) # Step 5: Add post-processing (auto-add BOS/EOS) tokenizer.post_processor = TemplateProcessing( single="<|bos|> $A <|eos|>", pair="<|bos|> $A <|eos|> <|bos|> $B <|eos|>", special_tokens=[ ("<|bos|>", tokenizer.token_to_id("<|bos|>")), ("<|eos|>", tokenizer.token_to_id("<|eos|>")), ], ) # Step 6: Save and test tokenizer.save("my-bpe-tokenizer.json") encoded = tokenizer.encode("Hello, how are you?") print(f"Tokens: {encoded.tokens}") print(f"IDs: {encoded.ids}") print(f"Decoded: {tokenizer.decode(encoded.ids)}")
Method 2: HuggingFace texttokenizers
— WordPiece
text
tokenizersUsed for BERT-family encoder models.
pythonfrom tokenizers import Tokenizer from tokenizers.models import WordPiece from tokenizers.trainers import WordPieceTrainer from tokenizers.pre_tokenizers import Whitespace from tokenizers.normalizers import Sequence, NFD, Lowercase, StripAccents tokenizer = Tokenizer(WordPiece(unk_token="[UNK]")) tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()]) tokenizer.pre_tokenizer = Whitespace() trainer = WordPieceTrainer( vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], min_frequency=2, continuing_subword_prefix="##", ) tokenizer.train(["training_data.txt"], trainer) tokenizer.save("my-wordpiece-tokenizer.json")
Method 3: HuggingFace texttokenizers
— Unigram
text
tokenizersBest compression ratio — ideal for multilingual models.
pythonfrom tokenizers import Tokenizer from tokenizers.models import Unigram from tokenizers.trainers import UnigramTrainer from tokenizers.pre_tokenizers import Metaspace tokenizer = Tokenizer(Unigram()) tokenizer.pre_tokenizer = Metaspace() # SentencePiece-style (uses ▁) trainer = UnigramTrainer( vocab_size=32000, special_tokens=["<unk>", "<s>", "</s>", "<pad>"], unk_token="<unk>", shrinking_factor=0.75, # Pruning aggressiveness ) tokenizer.train(["training_data.txt"], trainer) tokenizer.save("my-unigram-tokenizer.json")
Method 4: Google SentencePiece (Language-Agnostic)
Operates directly on raw bytes — no pre-tokenization needed. Used by Llama, T5, Gemini.
pythonimport sentencepiece as spm # Train BPE tokenizer spm.SentencePieceTrainer.train( input="raw_corpus.txt", model_prefix="my_spm_tokenizer", vocab_size=32000, model_type="bpe", # "bpe" or "unigram" character_coverage=1.0, # 1.0 for Latin, 0.9995 for CJK num_threads=16, pad_id=0, unk_id=1, bos_id=2, eos_id=3, user_defined_symbols=["<|system|>", "<|user|>", "<|assistant|>"], byte_fallback=True, # Handle unknown chars via UTF-8 bytes split_digits=True, # Split individual digits normalization_rule_name="identity", # No normalization (common for LLMs) ) # Load and use sp = spm.SentencePieceProcessor() sp.load("my_spm_tokenizer.model") text = "Machine learning is transforming healthcare." print(f"Tokens: {sp.encode(text, out_type=str)}") print(f"IDs: {sp.encode(text, out_type=int)}") print(f"Vocab size: {sp.get_piece_size()}")
Method 5: Adapt an Existing Tokenizer (Domain Adaptation)
Fastest approach — retrain vocabulary from a pretrained tokenizer's settings.
pythonfrom transformers import AutoTokenizer from datasets import load_dataset # Load a pretrained tokenizer as the base old_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") # Prepare domain-specific corpus dataset = load_dataset("your_medical_dataset", split="train") def batch_iterator(dataset, batch_size=1000): for i in range(0, len(dataset), batch_size): yield dataset[i : i + batch_size]["text"] # Train new tokenizer (keeps type/settings, learns new vocab) new_tokenizer = old_tokenizer.train_new_from_iterator( text_iterator=batch_iterator(dataset), vocab_size=32000, new_special_tokens=["<|system|>", "<|user|>", "<|assistant|>"] ) # Compare efficiency sample = "Patient presented with acute myocardial infarction." old_count = len(old_tokenizer.tokenize(sample)) new_count = len(new_tokenizer.tokenize(sample)) print(f"Old: {old_count} tokens | New: {new_count} tokens") new_tokenizer.save_pretrained("my-domain-tokenizer")
Note:
only works with "fast" tokenizers (Rust backend). Check withtexttrain_new_from_iterator().texttokenizer.is_fast
Evaluating Tokenizer Quality
pythondef evaluate_tokenizer(tokenizer, test_texts: list[str]): """Compute key tokenizer quality metrics.""" total_tokens = 0 total_words = 0 total_chars = 0 for text in test_texts: tokens = tokenizer.encode(text) token_count = len(tokens.ids) if hasattr(tokens, 'ids') else len(tokens) total_tokens += token_count total_words += len(text.split()) total_chars += len(text) fertility = total_tokens / total_words chars_per_token = total_chars / total_tokens print(f"Fertility (tokens/word): {fertility:.2f}") print(f" → Lower is better. English BPE typical: 1.3–1.5") print(f"Chars per token: {chars_per_token:.2f}") print(f" → Higher is better. GPT-4: ~4.0 for English")
| Metric | What It Measures | Good Value (English) |
|---|---|---|
| Fertility | Tokens per word | 1.3–1.5 |
| Chars/token | Characters per token | 3.5–4.5 |
| Bytes/token | Bytes per token | 3.5–5.0 |
| Unknown rate | % of text | 0% (with byte fallback) |
Algorithm Selection Guide
| Algorithm | Approach | Compression | Deterministic | Best For |
|---|---|---|---|---|
| BPE | Bottom-up merging by frequency | Good | Yes | Default for most LLMs (GPT, Llama, Mistral) |
| WordPiece | Bottom-up merging by likelihood | Moderate | Yes | BERT-family encoder models |
| Unigram | Top-down pruning by loss | Best | No (probabilistic) | Multilingual, best compression |
Vocabulary Size Guidelines
| Vocab Size | Use Case | Trade-off |
|---|---|---|
| 8K–16K | Small models, single language | High compression but longer sequences |
| 32K | Standard (Llama 2, Mistral) | Good balance for English-dominant models |
| 50K–64K | Multilingual or code-heavy | Better coverage, larger embedding matrix |
| 100K–128K | Highly multilingual (GPT-4, Llama 3) | Excellent coverage |
| 200K+ | Extreme multilingual (GPT-4o, Llama 4) | Diminishing returns beyond 128K |
Trade-off: Larger vocab = fewer tokens per text (faster inference, more context) but larger embedding matrix (more parameters, more memory).
Training Data Requirements
| Scenario | Minimum Data | Recommended |
|---|---|---|
| Domain adaptation | 10–50 MB | 100+ MB |
| Single language from scratch | 100 MB | 1–5 GB |
| Multilingual | 500 MB per language | 1–10 GB per language |
| General-purpose LLM | 10+ GB | 50–100+ GB |
Common Pitfalls
- No byte fallback — without , unseen characters producetext
byte_fallback=Truetokens. Always enable ittext<unk> - Vocab too small — causes very long token sequences, wasting context window
- Vocab too large — wastes embedding parameters on rare tokens
- Mismatched pre-tokenization — the same normalizer and pre-tokenizer must be used at training and inference
- Adding tokens to a pretrained model carelessly — randomly initialized embeddings cause instability. Initialize new token embeddings with the mean of existing embeddings
- Not evaluating on target domain — always compute fertility and compression on representative test data
Key takeaway: For most custom use cases, use HuggingFace
(BPE) for full control ortexttokenizersfor multilingual. Usetextsentencepiecefor quick domain adaptation from existing models. Never usetexttrain_new_from_iterator()for training — it's inference-only.texttiktoken