Loading learning content...
When humans read text, we effortlessly decompose continuous streams of characters into meaningful units—words, punctuation, numbers, and symbols. We do this so naturally that we rarely consider the underlying process. Yet for machines, this decomposition is a non-trivial problem that sits at the very foundation of all natural language processing.
Tokenization is the process of segmenting raw text into discrete units called tokens. These tokens become the atomic building blocks upon which all subsequent NLP processing depends. Whether you're building a sentiment classifier, a search engine, a machine translation system, or a large language model, tokenization is invariably the first transformation your text data undergoes.
The quality of your tokenization directly impacts everything downstream. Poor tokenization decisions propagate through your entire pipeline, creating vocabulary mismatches, semantic distortions, and unexpected model behaviors that are difficult to diagnose after the fact.
By the end of this page, you will understand the theoretical foundations of tokenization, master multiple tokenization strategies (word-level, subword, character-level), navigate edge cases and language-specific challenges, and make informed decisions about which tokenization approach suits your specific NLP task.
At its core, tokenization answers a deceptively simple question: What constitutes a meaningful unit of text?
The answer depends on your perspective, your task, and crucially, the language you're processing. Let's build up from first principles.
Text is fundamentally a sequence of characters. Tokenization imposes structure on this sequence by identifying boundaries between meaningful units. The challenge lies in defining 'meaningful' in a way that serves machine learning objectives.
Formal Definition:
Given a text string $T = c_1 c_2 c_3 \ldots c_n$ consisting of $n$ characters, tokenization produces a sequence of tokens $[t_1, t_2, \ldots, t_k]$ where each token $t_i$ is a contiguous subsequence of $T$, and the concatenation of all tokens (potentially with delimiters) reconstructs the original text.
Key Properties of Tokenization:
The tension between these properties drives much of the complexity in tokenization design.
| Approach | Granularity | Vocabulary Size | OOV Handling | Use Cases |
|---|---|---|---|---|
| Word-Level | Whole words | Large (50K-500K+) | Poor (explicit OOV token) | Classical NLP, bag-of-words |
| Subword | Morphemes/pieces | Medium (8K-50K) | Good (decomposes unknown) | Transformers, neural MT |
| Character-Level | Individual characters | Small (100-300) | Perfect (no OOV possible) | Spelling correction, OCR |
| Sentence-Level | Full sentences | Very large | Poor | Sentence embeddings |
Word-level tokenization is the most intuitive approach: split text on whitespace and punctuation to extract individual words. This mirrors how humans naturally perceive text and produces tokens that are immediately interpretable.
The Naive Approach:
The simplest tokenizer splits on whitespace:
Input: "Hello, world!"
Output: ["Hello,", "world!"]
But immediately we see problems. Punctuation is attached to words, creating vocabulary entries like "Hello," and "world!" that differ from "Hello" and "world". A corpus might contain "cat", "cat.", "cat,", "cat!", "cat?" as five distinct vocabulary items—all referring to the same concept.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
import refrom typing import List def naive_tokenize(text: str) -> List[str]: """ Naive whitespace-based tokenization. PROBLEM: Punctuation attached to words """ return text.split() def basic_word_tokenize(text: str) -> List[str]: """ Improved tokenization with punctuation handling. Uses regex to separate words from punctuation. """ # Pattern explanation: # \w+ : Match sequences of word characters (letters, digits, underscore) # [^\w\s] : Match non-word, non-whitespace characters (punctuation) pattern = r"\w+|[^\w\s]" return re.findall(pattern, text) def advanced_word_tokenize(text: str) -> List[str]: """ Advanced tokenization handling common edge cases: - Contractions (don't -> do, n't) - Possessives (John's -> John, 's) - Hyphenation (self-aware -> context-dependent) - Numbers with formatting (1,000.50 -> single token) - Abbreviations (U.S.A. -> single token) """ # Handle contractions first contractions = { "can't": ["can", "not"], "won't": ["will", "not"], "n't": ["not"], "'re": ["are"], "'ve": ["have"], "'ll": ["will"], "'d": ["would"], "'m": ["am"], } # Tokenize with multiple patterns # This regex handles most English edge cases pattern = r"""(?x) # verbose mode \d+[,.]\d+ # Numbers with decimals or thousands | \w+(?:[-']\w+)* # Words with hyphens or apostrophes | \.{2,} # Ellipsis | [^\w\s] # Single punctuation """ tokens = re.findall(pattern, text) # Expand contractions expanded = [] for token in tokens: token_lower = token.lower() if token_lower in contractions: expanded.extend(contractions[token_lower]) elif token_lower.endswith("n't"): expanded.extend([token[:-3], "not"]) else: expanded.append(token) return expanded # Demonstrationexamples = [ "Hello, world!", "I can't believe it's not butter.", "The U.S.A. has 50 states.", "The price is $1,000.50.", "self-aware robots are coming...",] print("=== Tokenization Comparison ===\n")for text in examples: print(f"Input: {text}") print(f" Naive: {naive_tokenize(text)}") print(f" Basic: {basic_word_tokenize(text)}") print(f" Advanced: {advanced_word_tokenize(text)}") print()Standard Tokenizers in Practice:
While custom regex-based tokenizers work for simple cases, production systems typically rely on battle-tested libraries that handle thousands of edge cases:
123456789101112131415161718192021222324252627282930313233
import nltkfrom nltk.tokenize import word_tokenize, TreebankWordTokenizerimport spacy # Download required NLTK data (run once)# nltk.download('punkt') # NLTK tokenizationtext = "I can't believe they've done this! The U.S.A. is amazing." # NLTK word_tokenize (Penn Treebank standard)nltk_tokens = word_tokenize(text)print(f"NLTK word_tokenize: {nltk_tokens}")# Output: ['I', 'ca', "n't", 'believe', 'they', "'ve", 'done', 'this', '!', # 'The', 'U.S.A.', 'is', 'amazing', '.'] # TreebankWordTokenizer explicitlytreebank = TreebankWordTokenizer()treebank_tokens = treebank.tokenize(text)print(f"TreebankWordTokenizer: {treebank_tokens}") # spaCy tokenization (requires: pip install spacy && python -m spacy download en_core_web_sm)nlp = spacy.load("en_core_web_sm")doc = nlp(text)spacy_tokens = [token.text for token in doc]print(f"spaCy tokens: {spacy_tokens}")# Output: ['I', 'ca', "n't", 'believe', 'they', "'ve", 'done', 'this', '!', # 'The', 'U.S.A.', 'is', 'amazing', '.'] # spaCy also provides rich token metadataprint("\nspaCy token details:")for token in doc[:5]: print(f" {token.text:12} | is_alpha: {token.is_alpha:5} | is_punct: {token.is_punct:5} | is_stop: {token.is_stop}")Word-level tokenization's fundamental weakness is vocabulary management. A fixed vocabulary built from training data cannot represent words never seen before. These 'out-of-vocabulary' (OOV) words get mapped to a special <UNK> token, losing all semantic information. For morphologically rich languages or domains with specialized terminology, OOV rates can exceed 10-20%, severely degrading model performance.
Subword tokenization represents the modern solution to the OOV problem. Instead of treating words as atomic units, subword methods decompose text into smaller, reusable pieces that can combine to represent any word—including words never seen during training.
The Core Insight:
Words share morphological components. "unhappiness", "unhappy", "happiness", and "happy" all share common subunits. If we tokenize at the subword level, we can:
Byte Pair Encoding (BPE) in Detail:
BPE is the most widely used subword algorithm. It works by:
The algorithm stores merge rules, which are applied in order during tokenization to convert text into subword sequences.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
from collections import Counterfrom typing import Dict, List, Tupleimport re class SimpleBPE: """ A simplified BPE implementation for educational purposes. Production systems should use HuggingFace tokenizers or SentencePiece. """ def __init__(self, vocab_size: int = 1000): self.vocab_size = vocab_size self.merges: List[Tuple[str, str]] = [] self.vocab: set = set() def _get_pair_counts(self, word_freqs: Dict[Tuple[str, ...], int]) -> Counter: """Count frequency of adjacent symbol pairs across all words.""" pairs = Counter() for word, freq in word_freqs.items(): symbols = list(word) for i in range(len(symbols) - 1): pair = (symbols[i], symbols[i + 1]) pairs[pair] += freq return pairs def _merge_pair( self, pair: Tuple[str, str], word_freqs: Dict[Tuple[str, ...], int] ) -> Dict[Tuple[str, ...], int]: """Merge all instances of a pair in the vocabulary.""" new_word_freqs = {} bigram = pair[0] + pair[1] for word, freq in word_freqs.items(): new_word = [] i = 0 while i < len(word): if i < len(word) - 1 and word[i] == pair[0] and word[i + 1] == pair[1]: new_word.append(bigram) i += 2 else: new_word.append(word[i]) i += 1 new_word_freqs[tuple(new_word)] = freq return new_word_freqs def fit(self, corpus: List[str]) -> None: """ Learn BPE merges from a corpus. Args: corpus: List of text documents """ # Step 1: Compute word frequencies from corpus word_counter = Counter() for text in corpus: words = text.lower().split() word_counter.update(words) # Step 2: Initialize word_freqs with character-level tokenization # Add </w> to mark word boundaries word_freqs: Dict[Tuple[str, ...], int] = {} for word, freq in word_counter.items(): chars = tuple(list(word) + ["</w>"]) word_freqs[chars] = freq # Initialize vocabulary with characters self.vocab = set() for word in word_freqs: self.vocab.update(word) # Step 3: Iteratively merge most frequent pairs num_merges = self.vocab_size - len(self.vocab) for i in range(num_merges): pair_counts = self._get_pair_counts(word_freqs) if not pair_counts: break # Find most frequent pair best_pair = pair_counts.most_common(1)[0][0] # Merge this pair word_freqs = self._merge_pair(best_pair, word_freqs) # Record the merge and update vocabulary self.merges.append(best_pair) merged_token = best_pair[0] + best_pair[1] self.vocab.add(merged_token) if (i + 1) % 100 == 0: print(f"Completed {i + 1} merges, vocab size: {len(self.vocab)}") print(f"Final vocabulary size: {len(self.vocab)}") def tokenize(self, text: str) -> List[str]: """ Tokenize text using learned BPE merges. """ words = text.lower().split() all_tokens = [] for word in words: # Start with character-level tokenization tokens = list(word) + ["</w>"] # Apply merges in order for merge in self.merges: i = 0 new_tokens = [] while i < len(tokens): if (i < len(tokens) - 1 and tokens[i] == merge[0] and tokens[i + 1] == merge[1]): new_tokens.append(merge[0] + merge[1]) i += 2 else: new_tokens.append(tokens[i]) i += 1 tokens = new_tokens all_tokens.extend(tokens) return all_tokens # Example usagecorpus = [ "the cat sat on the mat", "the cat ate the rat", "the rat ran from the cat", "cat cat cat dog dog", "unhappiness unhappy happiness happy", "walking walked walks walker",] bpe = SimpleBPE(vocab_size=50)bpe.fit(corpus) print("\nMerge rules (first 10):")for i, merge in enumerate(bpe.merges[:10]): print(f" {i+1}. {merge[0]!r} + {merge[1]!r} -> {merge[0] + merge[1]!r}") print("\nTokenization examples:")test_texts = ["the cat", "unhappiness", "walking", "unknown"]for text in test_texts: tokens = bpe.tokenize(text) print(f" '{text}' -> {tokens}")WordPiece Algorithm:
WordPiece, used by BERT, differs from BPE in its merge criterion. Instead of choosing the most frequent pair, WordPiece selects the pair that maximizes the likelihood of the training data when merged:
$$\text{score}(x, y) = \frac{\text{freq}(xy)}{\text{freq}(x) \times \text{freq}(y)}$$
This likelihood-based criterion tends to prefer merges that form meaningful morphological units rather than just frequent character sequences.
Unigram Language Model:
The Unigram approach inverts the BPE strategy:
Unigram provides probabilistic tokenization—multiple segmentations are possible, and the algorithm selects the one with maximum likelihood.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
from transformers import AutoTokenizer # Load pre-trained tokenizers from different model families # GPT-2: Uses BPEgpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2") # BERT: Uses WordPiecebert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # T5: Uses SentencePiece (Unigram)t5_tokenizer = AutoTokenizer.from_pretrained("t5-small") # Compare tokenization outputstext = "I can't believe the unbelievability of this!" print("=== Tokenization Comparison ===\n")print(f"Input: {text}\n") # GPT-2 (BPE)gpt2_tokens = gpt2_tokenizer.tokenize(text)gpt2_ids = gpt2_tokenizer.encode(text)print(f"GPT-2 (BPE):")print(f" Tokens: {gpt2_tokens}")print(f" IDs: {gpt2_ids}")print(f" Vocab size: {len(gpt2_tokenizer)}") # BERT (WordPiece)bert_tokens = bert_tokenizer.tokenize(text)bert_ids = bert_tokenizer.encode(text)print(f"\nBERT (WordPiece):")print(f" Tokens: {bert_tokens}")print(f" IDs: {bert_ids}")print(f" Vocab size: {len(bert_tokenizer)}") # T5 (Unigram/SentencePiece)t5_tokens = t5_tokenizer.tokenize(text)t5_ids = t5_tokenizer.encode(text)print(f"\nT5 (SentencePiece):")print(f" Tokens: {t5_tokens}")print(f" IDs: {t5_ids}")print(f" Vocab size: {len(t5_tokenizer)}") # Handling out-of-vocabulary wordsoov_text = "pneumonoultramicroscopicsilicovolcanoconiosis"print(f"\n=== OOV Handling ===")print(f"Word: {oov_text}\n")print(f"GPT-2: {gpt2_tokenizer.tokenize(oov_text)}")print(f"BERT: {bert_tokenizer.tokenize(oov_text)}")print(f"T5: {t5_tokenizer.tokenize(oov_text)}")Character-level tokenization represents the finest granularity: each character becomes a token. While this eliminates the OOV problem entirely (any text is representable with a fixed character set), it introduces significant challenges that limit its applicability.
Advantages:
Disadvantages:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
from typing import List, Dict, Tupleimport string class CharacterTokenizer: """ Character-level tokenizer with support for special tokens and vocabulary management. """ def __init__(self): # Special tokens self.PAD = "<PAD>" self.UNK = "<UNK>" self.BOS = "<BOS>" # Beginning of sequence self.EOS = "<EOS>" # End of sequence # Build vocabulary from printable ASCII + common special chars special_tokens = [self.PAD, self.UNK, self.BOS, self.EOS] chars = list(string.printable) # 100 printable ASCII characters self.id_to_char = special_tokens + chars self.char_to_id = {c: i for i, c in enumerate(self.id_to_char)} self.vocab_size = len(self.id_to_char) self.unk_id = self.char_to_id[self.UNK] def encode(self, text: str, add_special: bool = True) -> List[int]: """Convert text to list of token IDs.""" ids = [] if add_special: ids.append(self.char_to_id[self.BOS]) for char in text: ids.append(self.char_to_id.get(char, self.unk_id)) if add_special: ids.append(self.char_to_id[self.EOS]) return ids def decode(self, ids: List[int], skip_special: bool = True) -> str: """Convert token IDs back to text.""" chars = [] special_ids = { self.char_to_id[self.PAD], self.char_to_id[self.UNK], self.char_to_id[self.BOS], self.char_to_id[self.EOS], } for id in ids: if skip_special and id in special_ids: continue if 0 <= id < len(self.id_to_char): chars.append(self.id_to_char[id]) return "".join(chars) def batch_encode( self, texts: List[str], max_length: int = 512, padding: bool = True ) -> Tuple[List[List[int]], List[int]]: """ Encode a batch of texts with padding. Returns (padded_ids, lengths) """ encoded = [self.encode(text) for text in texts] lengths = [len(seq) for seq in encoded] if padding: max_len = min(max(lengths), max_length) pad_id = self.char_to_id[self.PAD] padded = [] for seq in encoded: if len(seq) > max_len: padded.append(seq[:max_len]) else: padded.append(seq + [pad_id] * (max_len - len(seq))) return padded, lengths return encoded, lengths # Demonstrationtokenizer = CharacterTokenizer() text = "Hello, World!"print(f"Vocabulary size: {tokenizer.vocab_size}")print(f"\nInput: {text}") ids = tokenizer.encode(text)print(f"Encoded: {ids}")print(f"Decoded: {tokenizer.decode(ids)}") # Compare sequence lengths with word-levelprint(f"\n=== Sequence Length Comparison ===")sample = "The quick brown fox jumps over the lazy dog."char_len = len(tokenizer.encode(sample))word_len = len(sample.split())print(f"Text: {sample}")print(f"Character tokens: {char_len}")print(f"Word tokens: {word_len}")print(f"Ratio: {char_len / word_len:.1f}x longer sequences")Character-level approaches excel in specific scenarios: (1) Spelling correction and typo detection, (2) Text generation where character-level control matters, (3) Highly multilingual settings with diverse scripts, (4) Domains with heavy use of neologisms, codes, or non-standard text. Modern byte-level approaches (like GPT-4's tokenizer) combine the robustness of character-level with subword efficiency.
The tokenization strategies we've discussed assume whitespace-delimited languages like English. This assumption fails dramatically for many of the world's languages, each presenting unique segmentation challenges.
Chinese, Japanese, and Thai (No Whitespace):
These languages write words without spaces between them. The sentence boundaries exist, but word boundaries are implicit:
Chinese: 我爱自然语言处理 (I love natural language processing)
Japanese: 私は自然言語処理が大好きです
Thai: ฉันรักการประมวลผลภาษาธรรมชาติ
Tokenization requires word segmentation models trained on annotated corpora or dictionary-based MaxMatch algorithms.
| Language | Challenge | Solution Approach |
|---|---|---|
| Chinese | No word boundaries; characters can combine in multiple valid ways | Jieba, Stanford Segmenter, BERT-based segmenters |
| Japanese | Three scripts (kanji, hiragana, katakana); no spaces | MeCab, Sudachi, SentencePiece for script-agnostic |
| German | Compound words (Donaudampfschifffahrt) | Compound splitting, subword tokenization |
| Arabic | Rich morphology, clitics, diacritics optional | Morphological analyzers (MADAMIRA, Farasa) |
| Korean | Syllable blocks (Hangul Jamo), agglutinative morphology | Mecab-ko, subword tokenization |
| Hindi | Compound words, sandhi (sound changes at word boundaries) | iNLTK, specialized tokenizers |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
# Multilingual tokenization examples# Requires: pip install jieba fugashi sentencepiece transformers # === Chinese Tokenization with Jieba ===import jieba chinese_text = "我爱自然语言处理和机器学习"# Without segmentation: 13 characters, no word boundariesprint(f"Chinese raw: {chinese_text}")print(f"Character count: {len(chinese_text)}") # Jieba segmentationwords = list(jieba.cut(chinese_text))print(f"Jieba segments: {words}")# Output: ['我', '爱', '自然', '语言', '处理', '和', '机器', '学习'] # === Japanese Tokenization with Fugashi (MeCab wrapper) ===import fugashi japanese_text = "私は機械学習が大好きです"tagger = fugashi.Tagger() print(f"\nJapanese raw: {japanese_text}")words = [word.surface for word in tagger(japanese_text)]print(f"MeCab segments: {words}")# Output: ['私', 'は', '機械', '学習', 'が', '大好き', 'です'] # === Multilingual with SentencePiece ===from transformers import AutoTokenizer # mBERT handles 104 languagesmbert = AutoTokenizer.from_pretrained("bert-base-multilingual-cased") texts = { "English": "Hello, how are you?", "Chinese": "你好,你好吗?", "Japanese": "こんにちは、お元気ですか?", "Arabic": "مرحبا كيف حالك؟", "German": "Donaudampfschifffahrtsgesellschaftskapitän",} print("\n=== Multilingual BERT Tokenization ===")for lang, text in texts.items(): tokens = mbert.tokenize(text) print(f"\n{lang}: {text}") print(f" Tokens: {tokens}") # Note: mBERT uses WordPiece with a shared vocabulary across all languages# The "##" prefix indicates continuation of a word (not a word start)Languages with complex morphology or non-Latin scripts often get more fragmented tokenization, requiring more tokens to express the same content. This 'tokenization tax' means models process these languages less efficiently, potentially leading to higher costs, slower inference, and reduced quality. This is an active area of research in fair multilingual NLP.
Moving tokenization from prototype to production introduces additional requirements beyond correctness. Speed, consistency, and maintainability become critical factors.
Performance Optimization:
Tokenization often becomes a bottleneck in NLP pipelines. For real-time applications processing thousands of requests per second, even small inefficiencies compound:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136
from transformers import AutoTokenizerfrom typing import List, Dictimport timefrom functools import lru_cachefrom concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutorimport multiprocessing as mp # === Batched Tokenization ===def benchmark_batching(tokenizer, texts: List[str], batch_size: int = 32): """Compare individual vs batched tokenization speed.""" # Individual tokenization start = time.perf_counter() individual_results = [tokenizer(text) for text in texts] individual_time = time.perf_counter() - start # Batched tokenization start = time.perf_counter() batched_results = tokenizer( texts, padding=True, truncation=True, max_length=512, return_tensors="pt" ) batched_time = time.perf_counter() - start print(f"Individual: {individual_time:.3f}s") print(f"Batched: {batched_time:.3f}s") print(f"Speedup: {individual_time / batched_time:.1f}x") # === Tokenizer Caching ===class CachedTokenizer: """ Tokenizer wrapper with LRU cache for repeated inputs. Useful when the same texts are processed repeatedly. """ def __init__(self, model_name: str, cache_size: int = 10000): self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.cache_size = cache_size # Create cached encode function @lru_cache(maxsize=cache_size) def _cached_encode(text: str) -> tuple: result = self.tokenizer.encode(text, add_special_tokens=True) return tuple(result) # LRU cache requires hashable return self._cached_encode = _cached_encode def encode(self, text: str) -> List[int]: return list(self._cached_encode(text)) def cache_info(self): return self._cached_encode.cache_info() # === Parallel Tokenization ===def parallel_tokenize( texts: List[str], model_name: str, n_workers: int = None) -> List[List[int]]: """ Tokenize texts in parallel using multiple CPU cores. Note: For small batches, overhead may exceed benefits. """ if n_workers is None: n_workers = mp.cpu_count() # Each worker needs its own tokenizer (not picklable) def worker_tokenize(text_batch: List[str]) -> List[List[int]]: tokenizer = AutoTokenizer.from_pretrained(model_name) return [tokenizer.encode(t) for t in text_batch] # Split texts into chunks chunk_size = max(1, len(texts) // n_workers) chunks = [ texts[i:i + chunk_size] for i in range(0, len(texts), chunk_size) ] # Process in parallel with ProcessPoolExecutor(max_workers=n_workers) as executor: results = list(executor.map(worker_tokenize, chunks)) # Flatten results return [item for sublist in results for item in sublist] # === Consistency Guarantees ===class DeterministicTokenizer: """ Wrapper ensuring deterministic tokenization across versions. Stores tokenizer version and validates on load. """ def __init__(self, model_name: str): self.model_name = model_name self.tokenizer = AutoTokenizer.from_pretrained(model_name) # Store vocabulary hash for validation vocab = self.tokenizer.get_vocab() self.vocab_hash = hash(frozenset(vocab.items())) def validate_consistency(self, expected_hash: int) -> bool: """Check if current tokenizer matches expected vocabulary.""" return self.vocab_hash == expected_hash def encode_with_metadata(self, text: str) -> Dict: """Return encoding with reproducibility metadata.""" return { "input_ids": self.tokenizer.encode(text), "model_name": self.model_name, "vocab_hash": self.vocab_hash, } # Example usageif __name__ == "__main__": tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") # Generate test data texts = ["This is a test sentence."] * 1000 print("=== Batching Benchmark ===") benchmark_batching(tokenizer, texts) print("\n=== Caching Demo ===") cached = CachedTokenizer("bert-base-uncased", cache_size=100) # First call - cache miss _ = cached.encode("Hello world") print(f"After 1 call: {cached.cache_info()}") # Repeat same text - cache hit for _ in range(100): _ = cached.encode("Hello world") print(f"After 101 calls: {cached.cache_info()}")With multiple tokenization strategies available, selecting the right approach requires understanding your specific requirements. Here's a decision framework based on common scenarios:
| Scenario | Recommended Approach | Rationale |
|---|---|---|
| Traditional ML (bag-of-words, TF-IDF) | Word-level (NLTK, spaCy) | Interpretable features, works with classical algorithms |
| Transformer fine-tuning | Pre-trained model's tokenizer | Must match pre-training tokenization exactly |
| Training from scratch | BPE/WordPiece (8K-50K vocab) | Balance between coverage and efficiency |
| Multilingual applications | SentencePiece with Unigram | Language-agnostic, handles diverse scripts |
| Spelling/OCR applications | Character-level | Access to character patterns, no OOV |
| Low-resource languages | Subword with smaller vocab | Maximizes coverage with limited data |
| Code/technical content | BPE trained on code | Captures programming language patterns |
When using pre-trained models, ALWAYS use the exact tokenizer that model was trained with. Vocabulary mismatches cause catastrophic performance degradation. The tokenizer is as much a part of the model as the weights themselves.
Vocabulary Size Trade-offs:
The vocabulary size parameter in subword tokenization deserves careful consideration:
Smaller vocabulary (4K-8K):
Larger vocabulary (32K-100K):
Modern practice typically uses 30K-50K for monolingual models and 100K-250K for multilingual models.
Tokenization is deceptively simple to describe but critically important to execute correctly. As the first transformation in any text processing pipeline, tokenization decisions propagate through your entire system.
What's Next:
With text segmented into tokens, the next preprocessing step is stop-word removal—identifying and handling high-frequency, low-information words that can obscure meaningful patterns. We'll explore when stop-word removal helps, when it hurts, and how to make domain-appropriate decisions.
You now understand tokenization from first principles through production deployment. You can implement custom tokenizers, leverage standard libraries, handle multilingual text, and make informed decisions about tokenization strategy based on your specific NLP task requirements.