How to create/train our own tokenizer?

Question

Accepted Answer

## How to Create/Train Your Own Tokenizer Training a custom tokenizer lets you optimize how text is split into tokens for your specific domain, language, or use case — improving efficiency, reducing costs, and maximizing context window usage. --- ### Why Train a Custom Tokenizer? | Use Case | Problem with General Tokenizers | Benefit of Custom | |----------|--------------------------------|-------------------| | **Medical/Biomedical** | `"cardiomyopathy"` splits into 4–5 tokens | Learns domain terms as 1–2 tokens | | **Legal** | `"indemnification"`, `"estoppel"` get fragmented | 30–50% fewer tokens for contracts | | **Code** | Operators like `===`, `->`, `::` split oddly | Code-aware tokenization | | **Multilingual** | Hindi, Thai, Arabic need 3–8x more tokens than English | Fair compression across languages | | **Cost optimization** | More tokens = higher API costs | Fewer tokens = lower cost + faster inference | --- ### Tokenizer Training Libraries | Library | Install | Training? | Algorithms | Speed | Best For | |---------|---------|-----------|------------|-------|----------| | **`tokenizers`** (HuggingFace) | `pip install tokenizers` | Yes | BPE, WordPiece, Unigram | Very fast (Rust) | Full control, production use | | **`sentencepiece`** (Google) | `pip install sentencepiece` | Yes | BPE, Unigram | Fast (C++) | Multilingual, raw text, Llama-style | | **`transformers`** (HuggingFace) | `pip install transformers` | Adapt existing | Inherits from base | Fast | Quick domain adaptation | | **`tiktoken`** (OpenAI) | `pip install tiktoken` | **No** (read-only) | BPE inference only | Fastest (Rust) | Token counting only | > **Important:** `tiktoken` cannot train custom tokenizers — OpenAI never released their training code. Use `tokenizers` or `sentencepiece` for training. --- ### Method 1: HuggingFace `tokenizers` — BPE (Recommended) The most flexible and widely used approach for modern LLMs. ```python from tokenizers import Tokenizer from tokenizers.models import BPE from tokenizers.trainers import BpeTrainer from tokenizers.pre_tokenizers import ByteLevel from tokenizers.processors import TemplateProcessing from tokenizers.decoders import ByteLevel as ByteLevelDecoder from tokenizers.normalizers import NFC # Step 1: Initialize tokenizer tokenizer = Tokenizer(BPE(unk_token="<|unk|>")) # Step 2: Set normalizer and pre-tokenizer tokenizer.normalizer = NFC() # Unicode normalization tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False) tokenizer.decoder = ByteLevelDecoder() # Step 3: Configure trainer trainer = BpeTrainer( vocab_size=32000, min_frequency=2, special_tokens=[ "<|unk|>", "<|pad|>", "<|bos|>", "<|eos|>", "<|system|>", "<|user|>", "<|assistant|>" ], show_progress=True, initial_alphabet=ByteLevel.alphabet(), ) # Step 4: Train on your corpus tokenizer.train(["corpus_part1.txt", "corpus_part2.txt"], trainer) # Step 5: Add post-processing (auto-add BOS/EOS) tokenizer.post_processor = TemplateProcessing( single="<|bos|> $A <|eos|>", pair="<|bos|> $A <|eos|> <|bos|> $B <|eos|>", special_tokens=[ ("<|bos|>", tokenizer.token_to_id("<|bos|>")), ("<|eos|>", tokenizer.token_to_id("<|eos|>")), ], ) # Step 6: Save and test tokenizer.save("my-bpe-tokenizer.json") encoded = tokenizer.encode("Hello, how are you?") print(f"Tokens: {encoded.tokens}") print(f"IDs: {encoded.ids}") print(f"Decoded: {tokenizer.decode(encoded.ids)}") ``` --- ### Method 2: HuggingFace `tokenizers` — WordPiece Used for BERT-family encoder models. ```python from tokenizers import Tokenizer from tokenizers.models import WordPiece from tokenizers.trainers import WordPieceTrainer from tokenizers.pre_tokenizers import Whitespace from tokenizers.normalizers import Sequence, NFD, Lowercase, StripAccents tokenizer = Tokenizer(WordPiece(unk_token="[UNK]")) tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()]) tokenizer.pre_tokenizer = Whitespace() trainer = WordPieceTrainer( vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], min_frequency=2, continuing_subword_prefix="##", ) tokenizer.train(["training_data.txt"], trainer) tokenizer.save("my-wordpiece-tokenizer.json") ``` --- ### Method 3: HuggingFace `tokenizers` — Unigram Best compression ratio — ideal for multilingual models. ```python from tokenizers import Tokenizer from tokenizers.models import Unigram from tokenizers.trainers import UnigramTrainer from tokenizers.pre_tokenizers import Metaspace tokenizer = Tokenizer(Unigram()) tokenizer.pre_tokenizer = Metaspace() # SentencePiece-style (uses ▁) trainer = UnigramTrainer( vocab_size=32000, special_tokens=["", "~~", "~~", ""], unk_token="", shrinking_factor=0.75, # Pruning aggressiveness ) tokenizer.train(["training_data.txt"], trainer) tokenizer.save("my-unigram-tokenizer.json") ``` --- ### Method 4: Google SentencePiece (Language-Agnostic) Operates directly on raw bytes — no pre-tokenization needed. Used by Llama, T5, Gemini. ```python import sentencepiece as spm # Train BPE tokenizer spm.SentencePieceTrainer.train( input="raw_corpus.txt", model_prefix="my_spm_tokenizer", vocab_size=32000, model_type="bpe", # "bpe" or "unigram" character_coverage=1.0, # 1.0 for Latin, 0.9995 for CJK num_threads=16, pad_id=0, unk_id=1, bos_id=2, eos_id=3, user_defined_symbols=["<|system|>", "<|user|>", "<|assistant|>"], byte_fallback=True, # Handle unknown chars via UTF-8 bytes split_digits=True, # Split individual digits normalization_rule_name="identity", # No normalization (common for LLMs) ) # Load and use sp = spm.SentencePieceProcessor() sp.load("my_spm_tokenizer.model") text = "Machine learning is transforming healthcare." print(f"Tokens: {sp.encode(text, out_type=str)}") print(f"IDs: {sp.encode(text, out_type=int)}") print(f"Vocab size: {sp.get_piece_size()}") ``` --- ### Method 5: Adapt an Existing Tokenizer (Domain Adaptation) Fastest approach — retrain vocabulary from a pretrained tokenizer's settings. ```python from transformers import AutoTokenizer from datasets import load_dataset # Load a pretrained tokenizer as the base old_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") # Prepare domain-specific corpus dataset = load_dataset("your_medical_dataset", split="train") def batch_iterator(dataset, batch_size=1000): for i in range(0, len(dataset), batch_size): yield dataset[i : i + batch_size]["text"] # Train new tokenizer (keeps type/settings, learns new vocab) new_tokenizer = old_tokenizer.train_new_from_iterator( text_iterator=batch_iterator(dataset), vocab_size=32000, new_special_tokens=["<|system|>", "<|user|>", "<|assistant|>"] ) # Compare efficiency sample = "Patient presented with acute myocardial infarction." old_count = len(old_tokenizer.tokenize(sample)) new_count = len(new_tokenizer.tokenize(sample)) print(f"Old: {old_count} tokens | New: {new_count} tokens") new_tokenizer.save_pretrained("my-domain-tokenizer") ``` > **Note:** `train_new_from_iterator()` only works with "fast" tokenizers (Rust backend). Check with `tokenizer.is_fast`. --- ### Evaluating Tokenizer Quality ```python def evaluate_tokenizer(tokenizer, test_texts: list[str]): """Compute key tokenizer quality metrics.""" total_tokens = 0 total_words = 0 total_chars = 0 for text in test_texts: tokens = tokenizer.encode(text) token_count = len(tokens.ids) if hasattr(tokens, 'ids') else len(tokens) total_tokens += token_count total_words += len(text.split()) total_chars += len(text) fertility = total_tokens / total_words chars_per_token = total_chars / total_tokens print(f"Fertility (tokens/word): {fertility:.2f}") print(f" → Lower is better. English BPE typical: 1.3–1.5") print(f"Chars per token: {chars_per_token:.2f}") print(f" → Higher is better. GPT-4: ~4.0 for English") ``` | Metric | What It Measures | Good Value (English) | |--------|-----------------|---------------------| | **Fertility** | Tokens per word | 1.3–1.5 | | **Chars/token** | Characters per token | 3.5–4.5 | | **Bytes/token** | Bytes per token | 3.5–5.0 | | **Unknown rate** | % of `` tokens | 0% (with byte fallback) | --- ### Algorithm Selection Guide | Algorithm | Approach | Compression | Deterministic | Best For | |-----------|----------|-------------|---------------|----------| | **BPE** | Bottom-up merging by frequency | Good | Yes | Default for most LLMs (GPT, Llama, Mistral) | | **WordPiece** | Bottom-up merging by likelihood | Moderate | Yes | BERT-family encoder models | | **Unigram** | Top-down pruning by loss | Best | No (probabilistic) | Multilingual, best compression | --- ### Vocabulary Size Guidelines | Vocab Size | Use Case | Trade-off | |-----------|----------|----------| | 8K–16K | Small models, single language | High compression but longer sequences | | 32K | Standard (Llama 2, Mistral) | Good balance for English-dominant models | | 50K–64K | Multilingual or code-heavy | Better coverage, larger embedding matrix | | 100K–128K | Highly multilingual (GPT-4, Llama 3) | Excellent coverage | | 200K+ | Extreme multilingual (GPT-4o, Llama 4) | Diminishing returns beyond 128K | > **Trade-off:** Larger vocab = fewer tokens per text (faster inference, more context) but larger embedding matrix (more parameters, more memory). --- ### Training Data Requirements | Scenario | Minimum Data | Recommended | |----------|-------------|-------------| | Domain adaptation | 10–50 MB | 100+ MB | | Single language from scratch | 100 MB | 1–5 GB | | Multilingual | 500 MB per language | 1–10 GB per language | | General-purpose LLM | 10+ GB | 50–100+ GB | --- ### Common Pitfalls * **No byte fallback** — without `byte_fallback=True`, unseen characters produce `` tokens. Always enable it * **Vocab too small** — causes very long token sequences, wasting context window * **Vocab too large** — wastes embedding parameters on rare tokens * **Mismatched pre-tokenization** — the same normalizer and pre-tokenizer must be used at training and inference * **Adding tokens to a pretrained model carelessly** — randomly initialized embeddings cause instability. Initialize new token embeddings with the **mean of existing embeddings** * **Not evaluating on target domain** — always compute fertility and compression on representative test data > **Key takeaway:** For most custom use cases, use HuggingFace `tokenizers` (BPE) for full control or `sentencepiece` for multilingual. Use `train_new_from_iterator()` for quick domain adaptation from existing models. Never use `tiktoken` for training — it's inference-only.

How to create/train our own tokenizer?

Answer

How to Create/Train Your Own Tokenizer

Why Train a Custom Tokenizer?

Tokenizer Training Libraries

Method 1: HuggingFace
text
`tokenizers`
— BPE (Recommended)

Method 2: HuggingFace
text
`tokenizers`
— WordPiece

Method 3: HuggingFace
text
`tokenizers`
— Unigram

Method 4: Google SentencePiece (Language-Agnostic)

Method 5: Adapt an Existing Tokenizer (Domain Adaptation)

Evaluating Tokenizer Quality

Algorithm Selection Guide

Vocabulary Size Guidelines

Training Data Requirements

Common Pitfalls

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

Use Case	Problem with General Tokenizers	Benefit of Custom
Medical/Biomedical	text `"cardiomyopathy"` splits into 4–5 tokens	Learns domain terms as 1–2 tokens
Legal	text `"indemnification"` , text `"estoppel"` get fragmented	30–50% fewer tokens for contracts
Code	Operators like text `===` , text `->` , text `::` split oddly	Code-aware tokenization
Multilingual	Hindi, Thai, Arabic need 3–8x more tokens than English	Fair compression across languages
Cost optimization	More tokens = higher API costs	Fewer tokens = lower cost + faster inference

Library	Install	Training?	Algorithms	Speed	Best For
text `tokenizers` (HuggingFace)	text `pip install tokenizers`	Yes	BPE, WordPiece, Unigram	Very fast (Rust)	Full control, production use
text `sentencepiece` (Google)	text `pip install sentencepiece`	Yes	BPE, Unigram	Fast (C++)	Multilingual, raw text, Llama-style
text `transformers` (HuggingFace)	text `pip install transformers`	Adapt existing	Inherits from base	Fast	Quick domain adaptation
text `tiktoken` (OpenAI)	text `pip install tiktoken`	No (read-only)	BPE inference only	Fastest (Rust)	Token counting only

Metric	What It Measures	Good Value (English)
Fertility	Tokens per word	1.3–1.5
Chars/token	Characters per token	3.5–4.5
Bytes/token	Bytes per token	3.5–5.0
Unknown rate	% of text `<unk>` tokens	0% (with byte fallback)

Algorithm	Approach	Compression	Deterministic	Best For
BPE	Bottom-up merging by frequency	Good	Yes	Default for most LLMs (GPT, Llama, Mistral)
WordPiece	Bottom-up merging by likelihood	Moderate	Yes	BERT-family encoder models
Unigram	Top-down pruning by loss	Best	No (probabilistic)	Multilingual, best compression

Vocab Size	Use Case	Trade-off
8K–16K	Small models, single language	High compression but longer sequences
32K	Standard (Llama 2, Mistral)	Good balance for English-dominant models
50K–64K	Multilingual or code-heavy	Better coverage, larger embedding matrix
100K–128K	Highly multilingual (GPT-4, Llama 3)	Excellent coverage
200K+	Extreme multilingual (GPT-4o, Llama 4)	Diminishing returns beyond 128K

Scenario	Minimum Data	Recommended
Domain adaptation	10–50 MB	100+ MB
Single language from scratch	100 MB	1–5 GB
Multilingual	500 MB per language	1–10 GB per language
General-purpose LLM	10+ GB	50–100+ GB

How to create/train our own tokenizer?

Answer

How to Create/Train Your Own Tokenizer

Why Train a Custom Tokenizer?

Tokenizer Training Libraries

Method 1: HuggingFace textCopytokenizers — BPE (Recommended)

Method 2: HuggingFace textCopytokenizers — WordPiece

Method 3: HuggingFace textCopytokenizers — Unigram

Method 4: Google SentencePiece (Language-Agnostic)

Method 5: Adapt an Existing Tokenizer (Domain Adaptation)

Evaluating Tokenizer Quality

Algorithm Selection Guide

Vocabulary Size Guidelines

Training Data Requirements

Common Pitfalls

Related Concepts

What is AI?

What are all the current types of AI?

What is Machine Learning (ML)?

What is Deep Learning in AI?

What is an LLM?

Method 1: HuggingFace
text
`tokenizers`
— BPE (Recommended)

Method 2: HuggingFace
text
`tokenizers`
— WordPiece

Method 3: HuggingFace
text
`tokenizers`
— Unigram