How to create/train our own tokenizer?

#gen-ai#tokens#llm#tokenization#tokenizers#sentencepiece#transformers#bpe#training

Answer

How to Create/Train Your Own Tokenizer

Training a custom tokenizer lets you optimize how text is split into tokens for your specific domain, language, or use case — improving efficiency, reducing costs, and maximizing context window usage.


Why Train a Custom Tokenizer?

Use CaseProblem with General TokenizersBenefit of Custom
Medical/Biomedical
text
"cardiomyopathy"
splits into 4–5 tokens
Learns domain terms as 1–2 tokens
Legal
text
"indemnification"
,
text
"estoppel"
get fragmented
30–50% fewer tokens for contracts
CodeOperators like
text
===
,
text
->
,
text
::
split oddly
Code-aware tokenization
MultilingualHindi, Thai, Arabic need 3–8x more tokens than EnglishFair compression across languages
Cost optimizationMore tokens = higher API costsFewer tokens = lower cost + faster inference

Tokenizer Training Libraries

LibraryInstallTraining?AlgorithmsSpeedBest For
text
tokenizers
(HuggingFace)
text
pip install tokenizers
YesBPE, WordPiece, UnigramVery fast (Rust)Full control, production use
text
sentencepiece
(Google)
text
pip install sentencepiece
YesBPE, UnigramFast (C++)Multilingual, raw text, Llama-style
text
transformers
(HuggingFace)
text
pip install transformers
Adapt existingInherits from baseFastQuick domain adaptation
text
tiktoken
(OpenAI)
text
pip install tiktoken
No (read-only)BPE inference onlyFastest (Rust)Token counting only

Important:

text
tiktoken
cannot train custom tokenizers — OpenAI never released their training code. Use
text
tokenizers
or
text
sentencepiece
for training.


Method 1: HuggingFace
text
tokenizers
— BPE (Recommended)

The most flexible and widely used approach for modern LLMs.

python
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.processors import TemplateProcessing
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from tokenizers.normalizers import NFC

# Step 1: Initialize tokenizer
tokenizer = Tokenizer(BPE(unk_token="<|unk|>"))

# Step 2: Set normalizer and pre-tokenizer
tokenizer.normalizer = NFC()  # Unicode normalization
tokenizer.pre_tokenizer = ByteLevel(add_prefix_space=False)
tokenizer.decoder = ByteLevelDecoder()

# Step 3: Configure trainer
trainer = BpeTrainer(
    vocab_size=32000,
    min_frequency=2,
    special_tokens=[
        "<|unk|>", "<|pad|>", "<|bos|>", "<|eos|>",
        "<|system|>", "<|user|>", "<|assistant|>"
    ],
    show_progress=True,
    initial_alphabet=ByteLevel.alphabet(),
)

# Step 4: Train on your corpus
tokenizer.train(["corpus_part1.txt", "corpus_part2.txt"], trainer)

# Step 5: Add post-processing (auto-add BOS/EOS)
tokenizer.post_processor = TemplateProcessing(
    single="<|bos|> $A <|eos|>",
    pair="<|bos|> $A <|eos|> <|bos|> $B <|eos|>",
    special_tokens=[
        ("<|bos|>", tokenizer.token_to_id("<|bos|>")),
        ("<|eos|>", tokenizer.token_to_id("<|eos|>")),
    ],
)

# Step 6: Save and test
tokenizer.save("my-bpe-tokenizer.json")

encoded = tokenizer.encode("Hello, how are you?")
print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")
print(f"Decoded: {tokenizer.decode(encoded.ids)}")

Method 2: HuggingFace
text
tokenizers
— WordPiece

Used for BERT-family encoder models.

python
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.normalizers import Sequence, NFD, Lowercase, StripAccents

tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = Sequence([NFD(), Lowercase(), StripAccents()])
tokenizer.pre_tokenizer = Whitespace()

trainer = WordPieceTrainer(
    vocab_size=30522,
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],
    min_frequency=2,
    continuing_subword_prefix="##",
)

tokenizer.train(["training_data.txt"], trainer)
tokenizer.save("my-wordpiece-tokenizer.json")

Method 3: HuggingFace
text
tokenizers
— Unigram

Best compression ratio — ideal for multilingual models.

python
from tokenizers import Tokenizer
from tokenizers.models import Unigram
from tokenizers.trainers import UnigramTrainer
from tokenizers.pre_tokenizers import Metaspace

tokenizer = Tokenizer(Unigram())
tokenizer.pre_tokenizer = Metaspace()  # SentencePiece-style (uses ▁)

trainer = UnigramTrainer(
    vocab_size=32000,
    special_tokens=["<unk>", "<s>", "</s>", "<pad>"],
    unk_token="<unk>",
    shrinking_factor=0.75,  # Pruning aggressiveness
)

tokenizer.train(["training_data.txt"], trainer)
tokenizer.save("my-unigram-tokenizer.json")

Method 4: Google SentencePiece (Language-Agnostic)

Operates directly on raw bytes — no pre-tokenization needed. Used by Llama, T5, Gemini.

python
import sentencepiece as spm

# Train BPE tokenizer
spm.SentencePieceTrainer.train(
    input="raw_corpus.txt",
    model_prefix="my_spm_tokenizer",
    vocab_size=32000,
    model_type="bpe",             # "bpe" or "unigram"
    character_coverage=1.0,       # 1.0 for Latin, 0.9995 for CJK
    num_threads=16,
    pad_id=0,
    unk_id=1,
    bos_id=2,
    eos_id=3,
    user_defined_symbols=["<|system|>", "<|user|>", "<|assistant|>"],
    byte_fallback=True,           # Handle unknown chars via UTF-8 bytes
    split_digits=True,            # Split individual digits
    normalization_rule_name="identity",  # No normalization (common for LLMs)
)

# Load and use
sp = spm.SentencePieceProcessor()
sp.load("my_spm_tokenizer.model")

text = "Machine learning is transforming healthcare."
print(f"Tokens: {sp.encode(text, out_type=str)}")
print(f"IDs: {sp.encode(text, out_type=int)}")
print(f"Vocab size: {sp.get_piece_size()}")

Method 5: Adapt an Existing Tokenizer (Domain Adaptation)

Fastest approach — retrain vocabulary from a pretrained tokenizer's settings.

python
from transformers import AutoTokenizer
from datasets import load_dataset

# Load a pretrained tokenizer as the base
old_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Prepare domain-specific corpus
dataset = load_dataset("your_medical_dataset", split="train")

def batch_iterator(dataset, batch_size=1000):
    for i in range(0, len(dataset), batch_size):
        yield dataset[i : i + batch_size]["text"]

# Train new tokenizer (keeps type/settings, learns new vocab)
new_tokenizer = old_tokenizer.train_new_from_iterator(
    text_iterator=batch_iterator(dataset),
    vocab_size=32000,
    new_special_tokens=["<|system|>", "<|user|>", "<|assistant|>"]
)

# Compare efficiency
sample = "Patient presented with acute myocardial infarction."
old_count = len(old_tokenizer.tokenize(sample))
new_count = len(new_tokenizer.tokenize(sample))
print(f"Old: {old_count} tokens | New: {new_count} tokens")

new_tokenizer.save_pretrained("my-domain-tokenizer")

Note:

text
train_new_from_iterator()
only works with "fast" tokenizers (Rust backend). Check with
text
tokenizer.is_fast
.


Evaluating Tokenizer Quality

python
def evaluate_tokenizer(tokenizer, test_texts: list[str]):
    """Compute key tokenizer quality metrics."""
    total_tokens = 0
    total_words = 0
    total_chars = 0

    for text in test_texts:
        tokens = tokenizer.encode(text)
        token_count = len(tokens.ids) if hasattr(tokens, 'ids') else len(tokens)
        total_tokens += token_count
        total_words += len(text.split())
        total_chars += len(text)

    fertility = total_tokens / total_words
    chars_per_token = total_chars / total_tokens

    print(f"Fertility (tokens/word): {fertility:.2f}")
    print(f"  → Lower is better. English BPE typical: 1.3–1.5")
    print(f"Chars per token: {chars_per_token:.2f}")
    print(f"  → Higher is better. GPT-4: ~4.0 for English")
MetricWhat It MeasuresGood Value (English)
FertilityTokens per word1.3–1.5
Chars/tokenCharacters per token3.5–4.5
Bytes/tokenBytes per token3.5–5.0
Unknown rate% of
text
<unk>
tokens
0% (with byte fallback)

Algorithm Selection Guide

AlgorithmApproachCompressionDeterministicBest For
BPEBottom-up merging by frequencyGoodYesDefault for most LLMs (GPT, Llama, Mistral)
WordPieceBottom-up merging by likelihoodModerateYesBERT-family encoder models
UnigramTop-down pruning by lossBestNo (probabilistic)Multilingual, best compression

Vocabulary Size Guidelines

Vocab SizeUse CaseTrade-off
8K–16KSmall models, single languageHigh compression but longer sequences
32KStandard (Llama 2, Mistral)Good balance for English-dominant models
50K–64KMultilingual or code-heavyBetter coverage, larger embedding matrix
100K–128KHighly multilingual (GPT-4, Llama 3)Excellent coverage
200K+Extreme multilingual (GPT-4o, Llama 4)Diminishing returns beyond 128K

Trade-off: Larger vocab = fewer tokens per text (faster inference, more context) but larger embedding matrix (more parameters, more memory).


Training Data Requirements

ScenarioMinimum DataRecommended
Domain adaptation10–50 MB100+ MB
Single language from scratch100 MB1–5 GB
Multilingual500 MB per language1–10 GB per language
General-purpose LLM10+ GB50–100+ GB

Common Pitfalls

  • No byte fallback — without
    text
    byte_fallback=True
    , unseen characters produce
    text
    <unk>
    tokens. Always enable it
  • Vocab too small — causes very long token sequences, wasting context window
  • Vocab too large — wastes embedding parameters on rare tokens
  • Mismatched pre-tokenization — the same normalizer and pre-tokenizer must be used at training and inference
  • Adding tokens to a pretrained model carelessly — randomly initialized embeddings cause instability. Initialize new token embeddings with the mean of existing embeddings
  • Not evaluating on target domain — always compute fertility and compression on representative test data

Key takeaway: For most custom use cases, use HuggingFace

text
tokenizers
(BPE) for full control or
text
sentencepiece
for multilingual. Use
text
train_new_from_iterator()
for quick domain adaptation from existing models. Never use
text
tiktoken
for training — it's inference-only.