Feature Engineering & SelectionText Preprocessing

Text Preprocessing: Transforming Raw Text for Machine Learning

LevelIntermediate

Duration90 mins

TopicText Preprocessing

1 / 6

Tokenization: The Foundation of Text Processing

The First Step in Text Understanding

When humans read text, we effortlessly decompose continuous streams of characters into meaningful units—words, punctuation, numbers, and symbols. We do this so naturally that we rarely consider the underlying process. Yet for machines, this decomposition is a non-trivial problem that sits at the very foundation of all natural language processing.

Tokenization is the process of segmenting raw text into discrete units called tokens. These tokens become the atomic building blocks upon which all subsequent NLP processing depends. Whether you're building a sentiment classifier, a search engine, a machine translation system, or a large language model, tokenization is invariably the first transformation your text data undergoes.

The quality of your tokenization directly impacts everything downstream. Poor tokenization decisions propagate through your entire pipeline, creating vocabulary mismatches, semantic distortions, and unexpected model behaviors that are difficult to diagnose after the fact.

What You Will Learn

By the end of this page, you will understand the theoretical foundations of tokenization, master multiple tokenization strategies (word-level, subword, character-level), navigate edge cases and language-specific challenges, and make informed decisions about which tokenization approach suits your specific NLP task.

Conceptual Foundations of Tokenization

At its core, tokenization answers a deceptively simple question: What constitutes a meaningful unit of text?

The answer depends on your perspective, your task, and crucially, the language you're processing. Let's build up from first principles.

The Segmentation Problem

Text is fundamentally a sequence of characters. Tokenization imposes structure on this sequence by identifying boundaries between meaningful units. The challenge lies in defining 'meaningful' in a way that serves machine learning objectives.

Formal Definition:

Given a text string $T = c_1 c_2 c_3 \ldots c_n$ consisting of $n$ characters, tokenization produces a sequence of tokens $[t_1, t_2, \ldots, t_k]$ where each token $t_i$ is a contiguous subsequence of $T$, and the concatenation of all tokens (potentially with delimiters) reconstructs the original text.

Key Properties of Tokenization:

Deterministic: The same input text should always produce the same token sequence
Invertible (ideally): We should be able to reconstruct the original text from tokens
Meaningful: Tokens should capture linguistically or semantically relevant units
Consistent: Similar constructs should tokenize similarly across the corpus

The tension between these properties drives much of the complexity in tokenization design.

Tokenization Approaches Overview
Approach	Granularity	Vocabulary Size	OOV Handling	Use Cases
Word-Level	Whole words	Large (50K-500K+)	Poor (explicit OOV token)	Classical NLP, bag-of-words
Subword	Morphemes/pieces	Medium (8K-50K)	Good (decomposes unknown)	Transformers, neural MT
Character-Level	Individual characters	Small (100-300)	Perfect (no OOV possible)	Spelling correction, OCR
Sentence-Level	Full sentences	Very large	Poor	Sentence embeddings

Word-Level Tokenization

Word-level tokenization is the most intuitive approach: split text on whitespace and punctuation to extract individual words. This mirrors how humans naturally perceive text and produces tokens that are immediately interpretable.

The Naive Approach:

The simplest tokenizer splits on whitespace:

Input: "Hello, world!"
Output: ["Hello,", "world!"]

But immediately we see problems. Punctuation is attached to words, creating vocabulary entries like "Hello," and "world!" that differ from "Hello" and "world". A corpus might contain "cat", "cat.", "cat,", "cat!", "cat?" as five distinct vocabulary items—all referring to the same concept.

word_tokenization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import re
from typing import List
 
def naive_tokenize(text: str) -> List[str]:
    """
    Naive whitespace-based tokenization.
    PROBLEM: Punctuation attached to words
    """
    return text.split()
 
def basic_word_tokenize(text: str) -> List[str]:
    """
    Improved tokenization with punctuation handling.
    Uses regex to separate words from punctuation.
    """
    # Pattern explanation:
    # \w+ : Match sequences of word characters (letters, digits, underscore)
    # [^\w\s] : Match non-word, non-whitespace characters (punctuation)
    pattern = r"\w+|[^\w\s]"
    return re.findall(pattern, text)
 
def advanced_word_tokenize(text: str) -> List[str]:
    """
    Advanced tokenization handling common edge cases:
    - Contractions (don't -> do, n't)
    - Possessives (John's -> John, 's)
    - Hyphenation (self-aware -> context-dependent)
    - Numbers with formatting (1,000.50 -> single token)
    - Abbreviations (U.S.A. -> single token)
    """
    # Handle contractions first
    contractions = {
        "can't": ["can", "not"],
        "won't": ["will", "not"],
        "n't": ["not"],
        "'re": ["are"],
        "'ve": ["have"],
        "'ll": ["will"],
        "'d": ["would"],
        "'m": ["am"],
    }
    
    # Tokenize with multiple patterns
    # This regex handles most English edge cases
    pattern = r"""(?x)          # verbose mode
        \d+[,.]\d+            # Numbers with decimals or thousands
        | \w+(?:[-']\w+)*     # Words with hyphens or apostrophes
        | \.{2,}               # Ellipsis
        | [^\w\s]             # Single punctuation
    """
    
    tokens = re.findall(pattern, text)
    
    # Expand contractions
    expanded = []
    for token in tokens:
        token_lower = token.lower()
        if token_lower in contractions:
            expanded.extend(contractions[token_lower])
        elif token_lower.endswith("n't"):
            expanded.extend([token[:-3], "not"])
        else:
            expanded.append(token)
    
    return expanded
 
# Demonstration
examples = [
    "Hello, world!",
    "I can't believe it's not butter.",
    "The U.S.A. has 50 states.",
    "The price is $1,000.50.",
    "self-aware robots are coming...",
]
 
print("=== Tokenization Comparison ===\n")
for text in examples:
    print(f"Input: {text}")
    print(f"  Naive:    {naive_tokenize(text)}")
    print(f"  Basic:    {basic_word_tokenize(text)}")
    print(f"  Advanced: {advanced_word_tokenize(text)}")
    print()

Standard Tokenizers in Practice:

While custom regex-based tokenizers work for simple cases, production systems typically rely on battle-tested libraries that handle thousands of edge cases:

NLTK's word_tokenize: Uses the Penn Treebank tokenization standard, widely used in academic NLP
spaCy: Provides language-specific tokenizers with excellent accuracy and speed
Stanford CoreNLP: Statistical tokenization with high accuracy across domains
Moses Tokenizer: Standard for machine translation preprocessing

standard_tokenizers.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import nltk
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
import spacy
 
# Download required NLTK data (run once)
# nltk.download('punkt')
 
# NLTK tokenization
text = "I can't believe they've done this! The U.S.A. is amazing."
 
# NLTK word_tokenize (Penn Treebank standard)
nltk_tokens = word_tokenize(text)
print(f"NLTK word_tokenize: {nltk_tokens}")
# Output: ['I', 'ca', "n't", 'believe', 'they', "'ve", 'done', 'this', '!', 
#          'The', 'U.S.A.', 'is', 'amazing', '.']
 
# TreebankWordTokenizer explicitly
treebank = TreebankWordTokenizer()
treebank_tokens = treebank.tokenize(text)
print(f"TreebankWordTokenizer: {treebank_tokens}")
 
# spaCy tokenization (requires: pip install spacy && python -m spacy download en_core_web_sm)
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
spacy_tokens = [token.text for token in doc]
print(f"spaCy tokens: {spacy_tokens}")
# Output: ['I', 'ca', "n't", 'believe', 'they', "'ve", 'done', 'this', '!', 
#          'The', 'U.S.A.', 'is', 'amazing', '.']
 
# spaCy also provides rich token metadata
print("\nspaCy token details:")
for token in doc[:5]:
    print(f"  {token.text:12} | is_alpha: {token.is_alpha:5} | is_punct: {token.is_punct:5} | is_stop: {token.is_stop}")

The Out-of-Vocabulary (OOV) Problem

Word-level tokenization's fundamental weakness is vocabulary management. A fixed vocabulary built from training data cannot represent words never seen before. These 'out-of-vocabulary' (OOV) words get mapped to a special <UNK> token, losing all semantic information. For morphologically rich languages or domains with specialized terminology, OOV rates can exceed 10-20%, severely degrading model performance.

Subword Tokenization

Subword tokenization represents the modern solution to the OOV problem. Instead of treating words as atomic units, subword methods decompose text into smaller, reusable pieces that can combine to represent any word—including words never seen during training.

The Core Insight:

Words share morphological components. "unhappiness", "unhappy", "happiness", and "happy" all share common subunits. If we tokenize at the subword level, we can:

Reduce vocabulary size: Represent millions of words with thousands of subwords
Handle OOV: Novel words decompose into known subwords ("unfriend" → "un" + "friend")
Capture morphology: Meaningful prefixes, suffixes, and roots become distinct tokens
Enable cross-lingual transfer: Similar morphemes across languages share representations

Major Subword Algorithms

•Byte Pair Encoding (BPE): Iteratively merges the most frequent character pairs. Used by GPT-2, RoBERTa, and many transformer models.
•WordPiece: Similar to BPE but uses likelihood-based merging. Developed by Google, used by BERT.
•Unigram Language Model: Starts with a large vocabulary and prunes based on likelihood. Used by SentencePiece (T5, ALBERT).
•SentencePiece: Language-agnostic implementation supporting BPE and Unigram, treating text as raw bytes without pre-tokenization.

Byte Pair Encoding (BPE) in Detail:

BPE is the most widely used subword algorithm. It works by:

Starting with a vocabulary of individual characters (plus special tokens)
Counting all adjacent pairs of symbols in the training corpus
Merging the most frequent pair into a new symbol
Repeating until reaching the desired vocabulary size

The algorithm stores merge rules, which are applied in order during tokenization to convert text into subword sequences.

bpe_from_scratch.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
from collections import Counter
from typing import Dict, List, Tuple
import re
 
class SimpleBPE:
    """
    A simplified BPE implementation for educational purposes.
    Production systems should use HuggingFace tokenizers or SentencePiece.
    """
    
    def __init__(self, vocab_size: int = 1000):
        self.vocab_size = vocab_size
        self.merges: List[Tuple[str, str]] = []
        self.vocab: set = set()
    
    def _get_pair_counts(self, word_freqs: Dict[Tuple[str, ...], int]) -> Counter:
        """Count frequency of adjacent symbol pairs across all words."""
        pairs = Counter()
        for word, freq in word_freqs.items():
            symbols = list(word)
            for i in range(len(symbols) - 1):
                pair = (symbols[i], symbols[i + 1])
                pairs[pair] += freq
        return pairs
    
    def _merge_pair(
        self, 
        pair: Tuple[str, str], 
        word_freqs: Dict[Tuple[str, ...], int]
    ) -> Dict[Tuple[str, ...], int]:
        """Merge all instances of a pair in the vocabulary."""
        new_word_freqs = {}
        bigram = pair[0] + pair[1]
        
        for word, freq in word_freqs.items():
            new_word = []
            i = 0
            while i < len(word):
                if i < len(word) - 1 and word[i] == pair[0] and word[i + 1] == pair[1]:
                    new_word.append(bigram)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            new_word_freqs[tuple(new_word)] = freq
        
        return new_word_freqs
    
    def fit(self, corpus: List[str]) -> None:
        """
        Learn BPE merges from a corpus.
        
        Args:
            corpus: List of text documents
        """
        # Step 1: Compute word frequencies from corpus
        word_counter = Counter()
        for text in corpus:
            words = text.lower().split()
            word_counter.update(words)
        
        # Step 2: Initialize word_freqs with character-level tokenization
        # Add </w> to mark word boundaries
        word_freqs: Dict[Tuple[str, ...], int] = {}
        for word, freq in word_counter.items():
            chars = tuple(list(word) + ["</w>"])
            word_freqs[chars] = freq
        
        # Initialize vocabulary with characters
        self.vocab = set()
        for word in word_freqs:
            self.vocab.update(word)
        
        # Step 3: Iteratively merge most frequent pairs
        num_merges = self.vocab_size - len(self.vocab)
        
        for i in range(num_merges):
            pair_counts = self._get_pair_counts(word_freqs)
            if not pair_counts:
                break
            
            # Find most frequent pair
            best_pair = pair_counts.most_common(1)[0][0]
            
            # Merge this pair
            word_freqs = self._merge_pair(best_pair, word_freqs)
            
            # Record the merge and update vocabulary
            self.merges.append(best_pair)
            merged_token = best_pair[0] + best_pair[1]
            self.vocab.add(merged_token)
            
            if (i + 1) % 100 == 0:
                print(f"Completed {i + 1} merges, vocab size: {len(self.vocab)}")
        
        print(f"Final vocabulary size: {len(self.vocab)}")
    
    def tokenize(self, text: str) -> List[str]:
        """
        Tokenize text using learned BPE merges.
        """
        words = text.lower().split()
        all_tokens = []
        
        for word in words:
            # Start with character-level tokenization
            tokens = list(word) + ["</w>"]
            
            # Apply merges in order
            for merge in self.merges:
                i = 0
                new_tokens = []
                while i < len(tokens):
                    if (i < len(tokens) - 1 and 
                        tokens[i] == merge[0] and 
                        tokens[i + 1] == merge[1]):
                        new_tokens.append(merge[0] + merge[1])
                        i += 2
                    else:
                        new_tokens.append(tokens[i])
                        i += 1
                tokens = new_tokens
            
            all_tokens.extend(tokens)
        
        return all_tokens
 
# Example usage
corpus = [
    "the cat sat on the mat",
    "the cat ate the rat",
    "the rat ran from the cat",
    "cat cat cat dog dog",
    "unhappiness unhappy happiness happy",
    "walking walked walks walker",
]
 
bpe = SimpleBPE(vocab_size=50)
bpe.fit(corpus)
 
print("\nMerge rules (first 10):")
for i, merge in enumerate(bpe.merges[:10]):
    print(f"  {i+1}. {merge[0]!r} + {merge[1]!r} -> {merge[0] + merge[1]!r}")
 
print("\nTokenization examples:")
test_texts = ["the cat", "unhappiness", "walking", "unknown"]
for text in test_texts:
    tokens = bpe.tokenize(text)
    print(f"  '{text}' -> {tokens}")

WordPiece Algorithm:

WordPiece, used by BERT, differs from BPE in its merge criterion. Instead of choosing the most frequent pair, WordPiece selects the pair that maximizes the likelihood of the training data when merged:

$$\text{score}(x, y) = \frac{\text{freq}(xy)}{\text{freq}(x) \times \text{freq}(y)}$$

This likelihood-based criterion tends to prefer merges that form meaningful morphological units rather than just frequent character sequences.

Unigram Language Model:

The Unigram approach inverts the BPE strategy:

Start with a large vocabulary (all substrings up to a maximum length)
Compute the probability of each subword based on its corpus frequency
Define a loss function: negative log-likelihood of the corpus
Iteratively remove subwords that increase the loss the least
Continue until reaching the target vocabulary size

Unigram provides probabilistic tokenization—multiple segmentations are possible, and the algorithm selects the one with maximum likelihood.

huggingface_tokenizers.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from transformers import AutoTokenizer
 
# Load pre-trained tokenizers from different model families
 
# GPT-2: Uses BPE
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
 
# BERT: Uses WordPiece
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
 
# T5: Uses SentencePiece (Unigram)
t5_tokenizer = AutoTokenizer.from_pretrained("t5-small")
 
# Compare tokenization outputs
text = "I can't believe the unbelievability of this!"
 
print("=== Tokenization Comparison ===\n")
print(f"Input: {text}\n")
 
# GPT-2 (BPE)
gpt2_tokens = gpt2_tokenizer.tokenize(text)
gpt2_ids = gpt2_tokenizer.encode(text)
print(f"GPT-2 (BPE):")
print(f"  Tokens: {gpt2_tokens}")
print(f"  IDs: {gpt2_ids}")
print(f"  Vocab size: {len(gpt2_tokenizer)}")
 
# BERT (WordPiece)
bert_tokens = bert_tokenizer.tokenize(text)
bert_ids = bert_tokenizer.encode(text)
print(f"\nBERT (WordPiece):")
print(f"  Tokens: {bert_tokens}")
print(f"  IDs: {bert_ids}")
print(f"  Vocab size: {len(bert_tokenizer)}")
 
# T5 (Unigram/SentencePiece)
t5_tokens = t5_tokenizer.tokenize(text)
t5_ids = t5_tokenizer.encode(text)
print(f"\nT5 (SentencePiece):")
print(f"  Tokens: {t5_tokens}")
print(f"  IDs: {t5_ids}")
print(f"  Vocab size: {len(t5_tokenizer)}")
 
# Handling out-of-vocabulary words
oov_text = "pneumonoultramicroscopicsilicovolcanoconiosis"
print(f"\n=== OOV Handling ===")
print(f"Word: {oov_text}\n")
print(f"GPT-2: {gpt2_tokenizer.tokenize(oov_text)}")
print(f"BERT:  {bert_tokenizer.tokenize(oov_text)}")
print(f"T5:    {t5_tokenizer.tokenize(oov_text)}")

Character-Level Tokenization

Character-level tokenization represents the finest granularity: each character becomes a token. While this eliminates the OOV problem entirely (any text is representable with a fixed character set), it introduces significant challenges that limit its applicability.

Advantages:

Zero OOV: Any text in the character set is tokenizable
Minimal vocabulary: Typically 100-300 tokens for most languages
Spelling awareness: Models can learn character patterns, useful for typo correction
Language agnostic: Same approach works across languages without modification

Disadvantages:

Long sequences: A 500-word document becomes 2500+ tokens, straining model capacity
Semantic distance: Meaning emerges over many tokens, requiring models to learn long-range dependencies
Training inefficiency: More computation per unit of meaning

character_tokenization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
from typing import List, Dict, Tuple
import string
 
class CharacterTokenizer:
    """
    Character-level tokenizer with support for special tokens
    and vocabulary management.
    """
    
    def __init__(self):
        # Special tokens
        self.PAD = "<PAD>"
        self.UNK = "<UNK>"
        self.BOS = "<BOS>"  # Beginning of sequence
        self.EOS = "<EOS>"  # End of sequence
        
        # Build vocabulary from printable ASCII + common special chars
        special_tokens = [self.PAD, self.UNK, self.BOS, self.EOS]
        chars = list(string.printable)  # 100 printable ASCII characters
        
        self.id_to_char = special_tokens + chars
        self.char_to_id = {c: i for i, c in enumerate(self.id_to_char)}
        
        self.vocab_size = len(self.id_to_char)
        self.unk_id = self.char_to_id[self.UNK]
    
    def encode(self, text: str, add_special: bool = True) -> List[int]:
        """Convert text to list of token IDs."""
        ids = []
        if add_special:
            ids.append(self.char_to_id[self.BOS])
        
        for char in text:
            ids.append(self.char_to_id.get(char, self.unk_id))
        
        if add_special:
            ids.append(self.char_to_id[self.EOS])
        
        return ids
    
    def decode(self, ids: List[int], skip_special: bool = True) -> str:
        """Convert token IDs back to text."""
        chars = []
        special_ids = {
            self.char_to_id[self.PAD],
            self.char_to_id[self.UNK],
            self.char_to_id[self.BOS],
            self.char_to_id[self.EOS],
        }
        
        for id in ids:
            if skip_special and id in special_ids:
                continue
            if 0 <= id < len(self.id_to_char):
                chars.append(self.id_to_char[id])
        
        return "".join(chars)
    
    def batch_encode(
        self, 
        texts: List[str], 
        max_length: int = 512,
        padding: bool = True
    ) -> Tuple[List[List[int]], List[int]]:
        """
        Encode a batch of texts with padding.
        Returns (padded_ids, lengths)
        """
        encoded = [self.encode(text) for text in texts]
        lengths = [len(seq) for seq in encoded]
        
        if padding:
            max_len = min(max(lengths), max_length)
            pad_id = self.char_to_id[self.PAD]
            
            padded = []
            for seq in encoded:
                if len(seq) > max_len:
                    padded.append(seq[:max_len])
                else:
                    padded.append(seq + [pad_id] * (max_len - len(seq)))
            
            return padded, lengths
        
        return encoded, lengths
 
# Demonstration
tokenizer = CharacterTokenizer()
 
text = "Hello, World!"
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"\nInput: {text}")
 
ids = tokenizer.encode(text)
print(f"Encoded: {ids}")
print(f"Decoded: {tokenizer.decode(ids)}")
 
# Compare sequence lengths with word-level
print(f"\n=== Sequence Length Comparison ===")
sample = "The quick brown fox jumps over the lazy dog."
char_len = len(tokenizer.encode(sample))
word_len = len(sample.split())
print(f"Text: {sample}")
print(f"Character tokens: {char_len}")
print(f"Word tokens: {word_len}")
print(f"Ratio: {char_len / word_len:.1f}x longer sequences")

When to Use Character-Level Tokenization

Character-level approaches excel in specific scenarios: (1) Spelling correction and typo detection, (2) Text generation where character-level control matters, (3) Highly multilingual settings with diverse scripts, (4) Domains with heavy use of neologisms, codes, or non-standard text. Modern byte-level approaches (like GPT-4's tokenizer) combine the robustness of character-level with subword efficiency.

Language-Specific Tokenization Challenges

The tokenization strategies we've discussed assume whitespace-delimited languages like English. This assumption fails dramatically for many of the world's languages, each presenting unique segmentation challenges.

Chinese, Japanese, and Thai (No Whitespace):

These languages write words without spaces between them. The sentence boundaries exist, but word boundaries are implicit:

Chinese: 我爱自然语言处理 (I love natural language processing)
Japanese: 私は自然言語処理が大好きです
Thai: ฉันรักการประมวลผลภาษาธรรมชาติ

Tokenization requires word segmentation models trained on annotated corpora or dictionary-based MaxMatch algorithms.

Tokenization Challenges by Language
Language	Challenge	Solution Approach
Chinese	No word boundaries; characters can combine in multiple valid ways	Jieba, Stanford Segmenter, BERT-based segmenters
Japanese	Three scripts (kanji, hiragana, katakana); no spaces	MeCab, Sudachi, SentencePiece for script-agnostic
German	Compound words (Donaudampfschifffahrt)	Compound splitting, subword tokenization
Arabic	Rich morphology, clitics, diacritics optional	Morphological analyzers (MADAMIRA, Farasa)
Korean	Syllable blocks (Hangul Jamo), agglutinative morphology	Mecab-ko, subword tokenization
Hindi	Compound words, sandhi (sound changes at word boundaries)	iNLTK, specialized tokenizers

multilingual_tokenization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Multilingual tokenization examples
# Requires: pip install jieba fugashi sentencepiece transformers
 
# === Chinese Tokenization with Jieba ===
import jieba
 
chinese_text = "我爱自然语言处理和机器学习"
# Without segmentation: 13 characters, no word boundaries
print(f"Chinese raw: {chinese_text}")
print(f"Character count: {len(chinese_text)}")
 
# Jieba segmentation
words = list(jieba.cut(chinese_text))
print(f"Jieba segments: {words}")
# Output: ['我', '爱', '自然', '语言', '处理', '和', '机器', '学习']
 
# === Japanese Tokenization with Fugashi (MeCab wrapper) ===
import fugashi
 
japanese_text = "私は機械学習が大好きです"
tagger = fugashi.Tagger()
 
print(f"\nJapanese raw: {japanese_text}")
words = [word.surface for word in tagger(japanese_text)]
print(f"MeCab segments: {words}")
# Output: ['私', 'は', '機械', '学習', 'が', '大好き', 'です']
 
# === Multilingual with SentencePiece ===
from transformers import AutoTokenizer
 
# mBERT handles 104 languages
mbert = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
 
texts = {
    "English": "Hello, how are you?",
    "Chinese": "你好，你好吗？",
    "Japanese": "こんにちは、お元気ですか？",
    "Arabic": "مرحبا كيف حالك؟",
    "German": "Donaudampfschifffahrtsgesellschaftskapitän",
}
 
print("\n=== Multilingual BERT Tokenization ===")
for lang, text in texts.items():
    tokens = mbert.tokenize(text)
    print(f"\n{lang}: {text}")
    print(f"  Tokens: {tokens}")
 
# Note: mBERT uses WordPiece with a shared vocabulary across all languages
# The "##" prefix indicates continuation of a word (not a word start)

Tokenization Affects Model Fairness

Languages with complex morphology or non-Latin scripts often get more fragmented tokenization, requiring more tokens to express the same content. This 'tokenization tax' means models process these languages less efficiently, potentially leading to higher costs, slower inference, and reduced quality. This is an active area of research in fair multilingual NLP.

Production Tokenization Considerations

Moving tokenization from prototype to production introduces additional requirements beyond correctness. Speed, consistency, and maintainability become critical factors.

Performance Optimization:

Tokenization often becomes a bottleneck in NLP pipelines. For real-time applications processing thousands of requests per second, even small inefficiencies compound:

Batching: Process multiple texts in batch rather than one-by-one
Pre-compilation: Compile regex patterns once, reuse across calls
Parallelization: Use multiprocessing for CPU-bound tokenization
Caching: Cache tokenized results for repeated queries

production_tokenization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
from transformers import AutoTokenizer
from typing import List, Dict
import time
from functools import lru_cache
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
import multiprocessing as mp
 
# === Batched Tokenization ===
def benchmark_batching(tokenizer, texts: List[str], batch_size: int = 32):
    """Compare individual vs batched tokenization speed."""
    
    # Individual tokenization
    start = time.perf_counter()
    individual_results = [tokenizer(text) for text in texts]
    individual_time = time.perf_counter() - start
    
    # Batched tokenization
    start = time.perf_counter()
    batched_results = tokenizer(
        texts, 
        padding=True, 
        truncation=True, 
        max_length=512,
        return_tensors="pt"
    )
    batched_time = time.perf_counter() - start
    
    print(f"Individual: {individual_time:.3f}s")
    print(f"Batched:    {batched_time:.3f}s")
    print(f"Speedup:    {individual_time / batched_time:.1f}x")
 
# === Tokenizer Caching ===
class CachedTokenizer:
    """
    Tokenizer wrapper with LRU cache for repeated inputs.
    Useful when the same texts are processed repeatedly.
    """
    
    def __init__(self, model_name: str, cache_size: int = 10000):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.cache_size = cache_size
        
        # Create cached encode function
        @lru_cache(maxsize=cache_size)
        def _cached_encode(text: str) -> tuple:
            result = self.tokenizer.encode(text, add_special_tokens=True)
            return tuple(result)  # LRU cache requires hashable return
        
        self._cached_encode = _cached_encode
    
    def encode(self, text: str) -> List[int]:
        return list(self._cached_encode(text))
    
    def cache_info(self):
        return self._cached_encode.cache_info()
 
# === Parallel Tokenization ===
def parallel_tokenize(
    texts: List[str], 
    model_name: str,
    n_workers: int = None
) -> List[List[int]]:
    """
    Tokenize texts in parallel using multiple CPU cores.
    Note: For small batches, overhead may exceed benefits.
    """
    if n_workers is None:
        n_workers = mp.cpu_count()
    
    # Each worker needs its own tokenizer (not picklable)
    def worker_tokenize(text_batch: List[str]) -> List[List[int]]:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        return [tokenizer.encode(t) for t in text_batch]
    
    # Split texts into chunks
    chunk_size = max(1, len(texts) // n_workers)
    chunks = [
        texts[i:i + chunk_size] 
        for i in range(0, len(texts), chunk_size)
    ]
    
    # Process in parallel
    with ProcessPoolExecutor(max_workers=n_workers) as executor:
        results = list(executor.map(worker_tokenize, chunks))
    
    # Flatten results
    return [item for sublist in results for item in sublist]
 
# === Consistency Guarantees ===
class DeterministicTokenizer:
    """
    Wrapper ensuring deterministic tokenization across versions.
    Stores tokenizer version and validates on load.
    """
    
    def __init__(self, model_name: str):
        self.model_name = model_name
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        
        # Store vocabulary hash for validation
        vocab = self.tokenizer.get_vocab()
        self.vocab_hash = hash(frozenset(vocab.items()))
    
    def validate_consistency(self, expected_hash: int) -> bool:
        """Check if current tokenizer matches expected vocabulary."""
        return self.vocab_hash == expected_hash
    
    def encode_with_metadata(self, text: str) -> Dict:
        """Return encoding with reproducibility metadata."""
        return {
            "input_ids": self.tokenizer.encode(text),
            "model_name": self.model_name,
            "vocab_hash": self.vocab_hash,
        }
 
# Example usage
if __name__ == "__main__":
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    
    # Generate test data
    texts = ["This is a test sentence."] * 1000
    
    print("=== Batching Benchmark ===")
    benchmark_batching(tokenizer, texts)
    
    print("\n=== Caching Demo ===")
    cached = CachedTokenizer("bert-base-uncased", cache_size=100)
    
    # First call - cache miss
    _ = cached.encode("Hello world")
    print(f"After 1 call: {cached.cache_info()}")
    
    # Repeat same text - cache hit
    for _ in range(100):
        _ = cached.encode("Hello world")
    print(f"After 101 calls: {cached.cache_info()}")

Production Checklist

•Version pinning: Lock tokenizer versions to prevent vocabulary drift between training and inference
•Vocabulary persistence: Store vocabulary files alongside model artifacts
•Edge case handling: Document and test handling of empty strings, extremely long text, binary data
•Monitoring: Track tokenization latency, cache hit rates, and OOV rates in production
•Memory management: Be aware of tokenizer memory footprint, especially with large vocabularies
•Thread safety: Verify tokenizer thread-safety if sharing across workers

Choosing the Right Tokenizer

With multiple tokenization strategies available, selecting the right approach requires understanding your specific requirements. Here's a decision framework based on common scenarios:

Tokenizer Selection Guide
Scenario	Recommended Approach	Rationale
Traditional ML (bag-of-words, TF-IDF)	Word-level (NLTK, spaCy)	Interpretable features, works with classical algorithms
Transformer fine-tuning	Pre-trained model's tokenizer	Must match pre-training tokenization exactly
Training from scratch	BPE/WordPiece (8K-50K vocab)	Balance between coverage and efficiency
Multilingual applications	SentencePiece with Unigram	Language-agnostic, handles diverse scripts
Spelling/OCR applications	Character-level	Access to character patterns, no OOV
Low-resource languages	Subword with smaller vocab	Maximizes coverage with limited data
Code/technical content	BPE trained on code	Captures programming language patterns

The Golden Rule

When using pre-trained models, ALWAYS use the exact tokenizer that model was trained with. Vocabulary mismatches cause catastrophic performance degradation. The tokenizer is as much a part of the model as the weights themselves.

Vocabulary Size Trade-offs:

The vocabulary size parameter in subword tokenization deserves careful consideration:

Smaller vocabulary (4K-8K):
- Longer token sequences
- More subword segmentation
- Better OOV handling
- More robust to rare words
Larger vocabulary (32K-100K):
- Shorter token sequences
- More whole-word tokens
- Faster inference (fewer tokens to process)
- More memory for token embeddings

Modern practice typically uses 30K-50K for monolingual models and 100K-250K for multilingual models.

Summary: Tokenization Mastery

Tokenization is deceptively simple to describe but critically important to execute correctly. As the first transformation in any text processing pipeline, tokenization decisions propagate through your entire system.

Key Takeaways

•Tokenization segments text into processing units — The granularity (word, subword, character) fundamentally shapes what your model can learn.
•Word-level tokenization is intuitive but brittle — Fixed vocabularies cannot handle OOV words, limiting model robustness.
•Subword tokenization solved the OOV problem — BPE, WordPiece, and Unigram enable open-vocabulary models while maintaining reasonable sequence lengths.
•Language diversity matters — Not all languages use whitespace; proper tokenization requires language-aware approaches.
•Production tokenization needs optimization — Batching, caching, and parallelization are essential for high-throughput systems.
•Consistency is critical — Tokenization at inference must exactly match training tokenization.

What's Next:

With text segmented into tokens, the next preprocessing step is stop-word removal—identifying and handling high-frequency, low-information words that can obscure meaningful patterns. We'll explore when stop-word removal helps, when it hurts, and how to make domain-appropriate decisions.

Page Complete

You now understand tokenization from first principles through production deployment. You can implement custom tokenizers, leverage standard libraries, handle multilingual text, and make informed decisions about tokenization strategy based on your specific NLP task requirements.

1 / 6

Loading learning content...

Feature Engineering & SelectionText Preprocessing

Text Preprocessing: Transforming Raw Text for Machine Learning

LevelIntermediate

Duration90 mins

TopicText Preprocessing

1 / 6

Tokenization: The Foundation of Text Processing

The First Step in Text Understanding

What You Will Learn

Conceptual Foundations of Tokenization

At its core, tokenization answers a deceptively simple question: What constitutes a meaningful unit of text?

The answer depends on your perspective, your task, and crucially, the language you're processing. Let's build up from first principles.

The Segmentation Problem

Formal Definition:

Key Properties of Tokenization:

Deterministic: The same input text should always produce the same token sequence
Invertible (ideally): We should be able to reconstruct the original text from tokens
Meaningful: Tokens should capture linguistically or semantically relevant units
Consistent: Similar constructs should tokenize similarly across the corpus

The tension between these properties drives much of the complexity in tokenization design.

Tokenization Approaches Overview
Approach	Granularity	Vocabulary Size	OOV Handling	Use Cases
Word-Level	Whole words	Large (50K-500K+)	Poor (explicit OOV token)	Classical NLP, bag-of-words
Subword	Morphemes/pieces	Medium (8K-50K)	Good (decomposes unknown)	Transformers, neural MT
Character-Level	Individual characters	Small (100-300)	Perfect (no OOV possible)	Spelling correction, OCR
Sentence-Level	Full sentences	Very large	Poor	Sentence embeddings

Word-Level Tokenization

The Naive Approach:

The simplest tokenizer splits on whitespace:

Input: "Hello, world!"
Output: ["Hello,", "world!"]

word_tokenization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import re
from typing import List
 
def naive_tokenize(text: str) -> List[str]:
    """
    Naive whitespace-based tokenization.
    PROBLEM: Punctuation attached to words
    """
    return text.split()
 
def basic_word_tokenize(text: str) -> List[str]:
    """
    Improved tokenization with punctuation handling.
    Uses regex to separate words from punctuation.
    """
    # Pattern explanation:
    # \w+ : Match sequences of word characters (letters, digits, underscore)
    # [^\w\s] : Match non-word, non-whitespace characters (punctuation)
    pattern = r"\w+|[^\w\s]"
    return re.findall(pattern, text)
 
def advanced_word_tokenize(text: str) -> List[str]:
    """
    Advanced tokenization handling common edge cases:
    - Contractions (don't -> do, n't)
    - Possessives (John's -> John, 's)
    - Hyphenation (self-aware -> context-dependent)
    - Numbers with formatting (1,000.50 -> single token)
    - Abbreviations (U.S.A. -> single token)
    """
    # Handle contractions first
    contractions = {
        "can't": ["can", "not"],
        "won't": ["will", "not"],
        "n't": ["not"],
        "'re": ["are"],
        "'ve": ["have"],
        "'ll": ["will"],
        "'d": ["would"],
        "'m": ["am"],
    }
    
    # Tokenize with multiple patterns
    # This regex handles most English edge cases
    pattern = r"""(?x)          # verbose mode
        \d+[,.]\d+            # Numbers with decimals or thousands
        | \w+(?:[-']\w+)*     # Words with hyphens or apostrophes
        | \.{2,}               # Ellipsis
        | [^\w\s]             # Single punctuation
    """
    
    tokens = re.findall(pattern, text)
    
    # Expand contractions
    expanded = []
    for token in tokens:
        token_lower = token.lower()
        if token_lower in contractions:
            expanded.extend(contractions[token_lower])
        elif token_lower.endswith("n't"):
            expanded.extend([token[:-3], "not"])
        else:
            expanded.append(token)
    
    return expanded
 
# Demonstration
examples = [
    "Hello, world!",
    "I can't believe it's not butter.",
    "The U.S.A. has 50 states.",
    "The price is $1,000.50.",
    "self-aware robots are coming...",
]
 
print("=== Tokenization Comparison ===\n")
for text in examples:
    print(f"Input: {text}")
    print(f"  Naive:    {naive_tokenize(text)}")
    print(f"  Basic:    {basic_word_tokenize(text)}")
    print(f"  Advanced: {advanced_word_tokenize(text)}")
    print()

Standard Tokenizers in Practice:

While custom regex-based tokenizers work for simple cases, production systems typically rely on battle-tested libraries that handle thousands of edge cases:

NLTK's word_tokenize: Uses the Penn Treebank tokenization standard, widely used in academic NLP
spaCy: Provides language-specific tokenizers with excellent accuracy and speed
Stanford CoreNLP: Statistical tokenization with high accuracy across domains
Moses Tokenizer: Standard for machine translation preprocessing

standard_tokenizers.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import nltk
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
import spacy
 
# Download required NLTK data (run once)
# nltk.download('punkt')
 
# NLTK tokenization
text = "I can't believe they've done this! The U.S.A. is amazing."
 
# NLTK word_tokenize (Penn Treebank standard)
nltk_tokens = word_tokenize(text)
print(f"NLTK word_tokenize: {nltk_tokens}")
# Output: ['I', 'ca', "n't", 'believe', 'they', "'ve", 'done', 'this', '!', 
#          'The', 'U.S.A.', 'is', 'amazing', '.']
 
# TreebankWordTokenizer explicitly
treebank = TreebankWordTokenizer()
treebank_tokens = treebank.tokenize(text)
print(f"TreebankWordTokenizer: {treebank_tokens}")
 
# spaCy tokenization (requires: pip install spacy && python -m spacy download en_core_web_sm)
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
spacy_tokens = [token.text for token in doc]
print(f"spaCy tokens: {spacy_tokens}")
# Output: ['I', 'ca', "n't", 'believe', 'they', "'ve", 'done', 'this', '!', 
#          'The', 'U.S.A.', 'is', 'amazing', '.']
 
# spaCy also provides rich token metadata
print("\nspaCy token details:")
for token in doc[:5]:
    print(f"  {token.text:12} | is_alpha: {token.is_alpha:5} | is_punct: {token.is_punct:5} | is_stop: {token.is_stop}")

The Out-of-Vocabulary (OOV) Problem

Subword Tokenization

The Core Insight:

Words share morphological components. "unhappiness", "unhappy", "happiness", and "happy" all share common subunits. If we tokenize at the subword level, we can:

Reduce vocabulary size: Represent millions of words with thousands of subwords
Handle OOV: Novel words decompose into known subwords ("unfriend" → "un" + "friend")
Capture morphology: Meaningful prefixes, suffixes, and roots become distinct tokens
Enable cross-lingual transfer: Similar morphemes across languages share representations

Major Subword Algorithms

•Byte Pair Encoding (BPE): Iteratively merges the most frequent character pairs. Used by GPT-2, RoBERTa, and many transformer models.
•WordPiece: Similar to BPE but uses likelihood-based merging. Developed by Google, used by BERT.
•Unigram Language Model: Starts with a large vocabulary and prunes based on likelihood. Used by SentencePiece (T5, ALBERT).
•SentencePiece: Language-agnostic implementation supporting BPE and Unigram, treating text as raw bytes without pre-tokenization.

Byte Pair Encoding (BPE) in Detail:

BPE is the most widely used subword algorithm. It works by:

Starting with a vocabulary of individual characters (plus special tokens)
Counting all adjacent pairs of symbols in the training corpus
Merging the most frequent pair into a new symbol
Repeating until reaching the desired vocabulary size

The algorithm stores merge rules, which are applied in order during tokenization to convert text into subword sequences.

bpe_from_scratch.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
from collections import Counter
from typing import Dict, List, Tuple
import re
 
class SimpleBPE:
    """
    A simplified BPE implementation for educational purposes.
    Production systems should use HuggingFace tokenizers or SentencePiece.
    """
    
    def __init__(self, vocab_size: int = 1000):
        self.vocab_size = vocab_size
        self.merges: List[Tuple[str, str]] = []
        self.vocab: set = set()
    
    def _get_pair_counts(self, word_freqs: Dict[Tuple[str, ...], int]) -> Counter:
        """Count frequency of adjacent symbol pairs across all words."""
        pairs = Counter()
        for word, freq in word_freqs.items():
            symbols = list(word)
            for i in range(len(symbols) - 1):
                pair = (symbols[i], symbols[i + 1])
                pairs[pair] += freq
        return pairs
    
    def _merge_pair(
        self, 
        pair: Tuple[str, str], 
        word_freqs: Dict[Tuple[str, ...], int]
    ) -> Dict[Tuple[str, ...], int]:
        """Merge all instances of a pair in the vocabulary."""
        new_word_freqs = {}
        bigram = pair[0] + pair[1]
        
        for word, freq in word_freqs.items():
            new_word = []
            i = 0
            while i < len(word):
                if i < len(word) - 1 and word[i] == pair[0] and word[i + 1] == pair[1]:
                    new_word.append(bigram)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            new_word_freqs[tuple(new_word)] = freq
        
        return new_word_freqs
    
    def fit(self, corpus: List[str]) -> None:
        """
        Learn BPE merges from a corpus.
        
        Args:
            corpus: List of text documents
        """
        # Step 1: Compute word frequencies from corpus
        word_counter = Counter()
        for text in corpus:
            words = text.lower().split()
            word_counter.update(words)
        
        # Step 2: Initialize word_freqs with character-level tokenization
        # Add </w> to mark word boundaries
        word_freqs: Dict[Tuple[str, ...], int] = {}
        for word, freq in word_counter.items():
            chars = tuple(list(word) + ["</w>"])
            word_freqs[chars] = freq
        
        # Initialize vocabulary with characters
        self.vocab = set()
        for word in word_freqs:
            self.vocab.update(word)
        
        # Step 3: Iteratively merge most frequent pairs
        num_merges = self.vocab_size - len(self.vocab)
        
        for i in range(num_merges):
            pair_counts = self._get_pair_counts(word_freqs)
            if not pair_counts:
                break
            
            # Find most frequent pair
            best_pair = pair_counts.most_common(1)[0][0]
            
            # Merge this pair
            word_freqs = self._merge_pair(best_pair, word_freqs)
            
            # Record the merge and update vocabulary
            self.merges.append(best_pair)
            merged_token = best_pair[0] + best_pair[1]
            self.vocab.add(merged_token)
            
            if (i + 1) % 100 == 0:
                print(f"Completed {i + 1} merges, vocab size: {len(self.vocab)}")
        
        print(f"Final vocabulary size: {len(self.vocab)}")
    
    def tokenize(self, text: str) -> List[str]:
        """
        Tokenize text using learned BPE merges.
        """
        words = text.lower().split()
        all_tokens = []
        
        for word in words:
            # Start with character-level tokenization
            tokens = list(word) + ["</w>"]
            
            # Apply merges in order
            for merge in self.merges:
                i = 0
                new_tokens = []
                while i < len(tokens):
                    if (i < len(tokens) - 1 and 
                        tokens[i] == merge[0] and 
                        tokens[i + 1] == merge[1]):
                        new_tokens.append(merge[0] + merge[1])
                        i += 2
                    else:
                        new_tokens.append(tokens[i])
                        i += 1
                tokens = new_tokens
            
            all_tokens.extend(tokens)
        
        return all_tokens
 
# Example usage
corpus = [
    "the cat sat on the mat",
    "the cat ate the rat",
    "the rat ran from the cat",
    "cat cat cat dog dog",
    "unhappiness unhappy happiness happy",
    "walking walked walks walker",
]
 
bpe = SimpleBPE(vocab_size=50)
bpe.fit(corpus)
 
print("\nMerge rules (first 10):")
for i, merge in enumerate(bpe.merges[:10]):
    print(f"  {i+1}. {merge[0]!r} + {merge[1]!r} -> {merge[0] + merge[1]!r}")
 
print("\nTokenization examples:")
test_texts = ["the cat", "unhappiness", "walking", "unknown"]
for text in test_texts:
    tokens = bpe.tokenize(text)
    print(f"  '{text}' -> {tokens}")

WordPiece Algorithm:

$$\text{score}(x, y) = \frac{\text{freq}(xy)}{\text{freq}(x) \times \text{freq}(y)}$$

This likelihood-based criterion tends to prefer merges that form meaningful morphological units rather than just frequent character sequences.

Unigram Language Model:

The Unigram approach inverts the BPE strategy:

Start with a large vocabulary (all substrings up to a maximum length)
Compute the probability of each subword based on its corpus frequency
Define a loss function: negative log-likelihood of the corpus
Iteratively remove subwords that increase the loss the least
Continue until reaching the target vocabulary size

Unigram provides probabilistic tokenization—multiple segmentations are possible, and the algorithm selects the one with maximum likelihood.

huggingface_tokenizers.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
from transformers import AutoTokenizer
 
# Load pre-trained tokenizers from different model families
 
# GPT-2: Uses BPE
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
 
# BERT: Uses WordPiece
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
 
# T5: Uses SentencePiece (Unigram)
t5_tokenizer = AutoTokenizer.from_pretrained("t5-small")
 
# Compare tokenization outputs
text = "I can't believe the unbelievability of this!"
 
print("=== Tokenization Comparison ===\n")
print(f"Input: {text}\n")
 
# GPT-2 (BPE)
gpt2_tokens = gpt2_tokenizer.tokenize(text)
gpt2_ids = gpt2_tokenizer.encode(text)
print(f"GPT-2 (BPE):")
print(f"  Tokens: {gpt2_tokens}")
print(f"  IDs: {gpt2_ids}")
print(f"  Vocab size: {len(gpt2_tokenizer)}")
 
# BERT (WordPiece)
bert_tokens = bert_tokenizer.tokenize(text)
bert_ids = bert_tokenizer.encode(text)
print(f"\nBERT (WordPiece):")
print(f"  Tokens: {bert_tokens}")
print(f"  IDs: {bert_ids}")
print(f"  Vocab size: {len(bert_tokenizer)}")
 
# T5 (Unigram/SentencePiece)
t5_tokens = t5_tokenizer.tokenize(text)
t5_ids = t5_tokenizer.encode(text)
print(f"\nT5 (SentencePiece):")
print(f"  Tokens: {t5_tokens}")
print(f"  IDs: {t5_ids}")
print(f"  Vocab size: {len(t5_tokenizer)}")
 
# Handling out-of-vocabulary words
oov_text = "pneumonoultramicroscopicsilicovolcanoconiosis"
print(f"\n=== OOV Handling ===")
print(f"Word: {oov_text}\n")
print(f"GPT-2: {gpt2_tokenizer.tokenize(oov_text)}")
print(f"BERT:  {bert_tokenizer.tokenize(oov_text)}")
print(f"T5:    {t5_tokenizer.tokenize(oov_text)}")

Character-Level Tokenization

Advantages:

Zero OOV: Any text in the character set is tokenizable
Minimal vocabulary: Typically 100-300 tokens for most languages
Spelling awareness: Models can learn character patterns, useful for typo correction
Language agnostic: Same approach works across languages without modification

Disadvantages:

Long sequences: A 500-word document becomes 2500+ tokens, straining model capacity
Semantic distance: Meaning emerges over many tokens, requiring models to learn long-range dependencies
Training inefficiency: More computation per unit of meaning

character_tokenization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
from typing import List, Dict, Tuple
import string
 
class CharacterTokenizer:
    """
    Character-level tokenizer with support for special tokens
    and vocabulary management.
    """
    
    def __init__(self):
        # Special tokens
        self.PAD = "<PAD>"
        self.UNK = "<UNK>"
        self.BOS = "<BOS>"  # Beginning of sequence
        self.EOS = "<EOS>"  # End of sequence
        
        # Build vocabulary from printable ASCII + common special chars
        special_tokens = [self.PAD, self.UNK, self.BOS, self.EOS]
        chars = list(string.printable)  # 100 printable ASCII characters
        
        self.id_to_char = special_tokens + chars
        self.char_to_id = {c: i for i, c in enumerate(self.id_to_char)}
        
        self.vocab_size = len(self.id_to_char)
        self.unk_id = self.char_to_id[self.UNK]
    
    def encode(self, text: str, add_special: bool = True) -> List[int]:
        """Convert text to list of token IDs."""
        ids = []
        if add_special:
            ids.append(self.char_to_id[self.BOS])
        
        for char in text:
            ids.append(self.char_to_id.get(char, self.unk_id))
        
        if add_special:
            ids.append(self.char_to_id[self.EOS])
        
        return ids
    
    def decode(self, ids: List[int], skip_special: bool = True) -> str:
        """Convert token IDs back to text."""
        chars = []
        special_ids = {
            self.char_to_id[self.PAD],
            self.char_to_id[self.UNK],
            self.char_to_id[self.BOS],
            self.char_to_id[self.EOS],
        }
        
        for id in ids:
            if skip_special and id in special_ids:
                continue
            if 0 <= id < len(self.id_to_char):
                chars.append(self.id_to_char[id])
        
        return "".join(chars)
    
    def batch_encode(
        self, 
        texts: List[str], 
        max_length: int = 512,
        padding: bool = True
    ) -> Tuple[List[List[int]], List[int]]:
        """
        Encode a batch of texts with padding.
        Returns (padded_ids, lengths)
        """
        encoded = [self.encode(text) for text in texts]
        lengths = [len(seq) for seq in encoded]
        
        if padding:
            max_len = min(max(lengths), max_length)
            pad_id = self.char_to_id[self.PAD]
            
            padded = []
            for seq in encoded:
                if len(seq) > max_len:
                    padded.append(seq[:max_len])
                else:
                    padded.append(seq + [pad_id] * (max_len - len(seq)))
            
            return padded, lengths
        
        return encoded, lengths
 
# Demonstration
tokenizer = CharacterTokenizer()
 
text = "Hello, World!"
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"\nInput: {text}")
 
ids = tokenizer.encode(text)
print(f"Encoded: {ids}")
print(f"Decoded: {tokenizer.decode(ids)}")
 
# Compare sequence lengths with word-level
print(f"\n=== Sequence Length Comparison ===")
sample = "The quick brown fox jumps over the lazy dog."
char_len = len(tokenizer.encode(sample))
word_len = len(sample.split())
print(f"Text: {sample}")
print(f"Character tokens: {char_len}")
print(f"Word tokens: {word_len}")
print(f"Ratio: {char_len / word_len:.1f}x longer sequences")

When to Use Character-Level Tokenization

Language-Specific Tokenization Challenges

Chinese, Japanese, and Thai (No Whitespace):

These languages write words without spaces between them. The sentence boundaries exist, but word boundaries are implicit:

Chinese: 我爱自然语言处理 (I love natural language processing)
Japanese: 私は自然言語処理が大好きです
Thai: ฉันรักการประมวลผลภาษาธรรมชาติ

Tokenization requires word segmentation models trained on annotated corpora or dictionary-based MaxMatch algorithms.

Tokenization Challenges by Language
Language	Challenge	Solution Approach
Chinese	No word boundaries; characters can combine in multiple valid ways	Jieba, Stanford Segmenter, BERT-based segmenters
Japanese	Three scripts (kanji, hiragana, katakana); no spaces	MeCab, Sudachi, SentencePiece for script-agnostic
German	Compound words (Donaudampfschifffahrt)	Compound splitting, subword tokenization
Arabic	Rich morphology, clitics, diacritics optional	Morphological analyzers (MADAMIRA, Farasa)
Korean	Syllable blocks (Hangul Jamo), agglutinative morphology	Mecab-ko, subword tokenization
Hindi	Compound words, sandhi (sound changes at word boundaries)	iNLTK, specialized tokenizers

multilingual_tokenization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Multilingual tokenization examples
# Requires: pip install jieba fugashi sentencepiece transformers
 
# === Chinese Tokenization with Jieba ===
import jieba
 
chinese_text = "我爱自然语言处理和机器学习"
# Without segmentation: 13 characters, no word boundaries
print(f"Chinese raw: {chinese_text}")
print(f"Character count: {len(chinese_text)}")
 
# Jieba segmentation
words = list(jieba.cut(chinese_text))
print(f"Jieba segments: {words}")
# Output: ['我', '爱', '自然', '语言', '处理', '和', '机器', '学习']
 
# === Japanese Tokenization with Fugashi (MeCab wrapper) ===
import fugashi
 
japanese_text = "私は機械学習が大好きです"
tagger = fugashi.Tagger()
 
print(f"\nJapanese raw: {japanese_text}")
words = [word.surface for word in tagger(japanese_text)]
print(f"MeCab segments: {words}")
# Output: ['私', 'は', '機械', '学習', 'が', '大好き', 'です']
 
# === Multilingual with SentencePiece ===
from transformers import AutoTokenizer
 
# mBERT handles 104 languages
mbert = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
 
texts = {
    "English": "Hello, how are you?",
    "Chinese": "你好，你好吗？",
    "Japanese": "こんにちは、お元気ですか？",
    "Arabic": "مرحبا كيف حالك؟",
    "German": "Donaudampfschifffahrtsgesellschaftskapitän",
}
 
print("\n=== Multilingual BERT Tokenization ===")
for lang, text in texts.items():
    tokens = mbert.tokenize(text)
    print(f"\n{lang}: {text}")
    print(f"  Tokens: {tokens}")
 
# Note: mBERT uses WordPiece with a shared vocabulary across all languages
# The "##" prefix indicates continuation of a word (not a word start)

Tokenization Affects Model Fairness

Production Tokenization Considerations

Moving tokenization from prototype to production introduces additional requirements beyond correctness. Speed, consistency, and maintainability become critical factors.

Performance Optimization:

Tokenization often becomes a bottleneck in NLP pipelines. For real-time applications processing thousands of requests per second, even small inefficiencies compound:

Batching: Process multiple texts in batch rather than one-by-one
Pre-compilation: Compile regex patterns once, reuse across calls
Parallelization: Use multiprocessing for CPU-bound tokenization
Caching: Cache tokenized results for repeated queries

production_tokenization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
from transformers import AutoTokenizer
from typing import List, Dict
import time
from functools import lru_cache
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
import multiprocessing as mp
 
# === Batched Tokenization ===
def benchmark_batching(tokenizer, texts: List[str], batch_size: int = 32):
    """Compare individual vs batched tokenization speed."""
    
    # Individual tokenization
    start = time.perf_counter()
    individual_results = [tokenizer(text) for text in texts]
    individual_time = time.perf_counter() - start
    
    # Batched tokenization
    start = time.perf_counter()
    batched_results = tokenizer(
        texts, 
        padding=True, 
        truncation=True, 
        max_length=512,
        return_tensors="pt"
    )
    batched_time = time.perf_counter() - start
    
    print(f"Individual: {individual_time:.3f}s")
    print(f"Batched:    {batched_time:.3f}s")
    print(f"Speedup:    {individual_time / batched_time:.1f}x")
 
# === Tokenizer Caching ===
class CachedTokenizer:
    """
    Tokenizer wrapper with LRU cache for repeated inputs.
    Useful when the same texts are processed repeatedly.
    """
    
    def __init__(self, model_name: str, cache_size: int = 10000):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.cache_size = cache_size
        
        # Create cached encode function
        @lru_cache(maxsize=cache_size)
        def _cached_encode(text: str) -> tuple:
            result = self.tokenizer.encode(text, add_special_tokens=True)
            return tuple(result)  # LRU cache requires hashable return
        
        self._cached_encode = _cached_encode
    
    def encode(self, text: str) -> List[int]:
        return list(self._cached_encode(text))
    
    def cache_info(self):
        return self._cached_encode.cache_info()
 
# === Parallel Tokenization ===
def parallel_tokenize(
    texts: List[str], 
    model_name: str,
    n_workers: int = None
) -> List[List[int]]:
    """
    Tokenize texts in parallel using multiple CPU cores.
    Note: For small batches, overhead may exceed benefits.
    """
    if n_workers is None:
        n_workers = mp.cpu_count()
    
    # Each worker needs its own tokenizer (not picklable)
    def worker_tokenize(text_batch: List[str]) -> List[List[int]]:
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        return [tokenizer.encode(t) for t in text_batch]
    
    # Split texts into chunks
    chunk_size = max(1, len(texts) // n_workers)
    chunks = [
        texts[i:i + chunk_size] 
        for i in range(0, len(texts), chunk_size)
    ]
    
    # Process in parallel
    with ProcessPoolExecutor(max_workers=n_workers) as executor:
        results = list(executor.map(worker_tokenize, chunks))
    
    # Flatten results
    return [item for sublist in results for item in sublist]
 
# === Consistency Guarantees ===
class DeterministicTokenizer:
    """
    Wrapper ensuring deterministic tokenization across versions.
    Stores tokenizer version and validates on load.
    """
    
    def __init__(self, model_name: str):
        self.model_name = model_name
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        
        # Store vocabulary hash for validation
        vocab = self.tokenizer.get_vocab()
        self.vocab_hash = hash(frozenset(vocab.items()))
    
    def validate_consistency(self, expected_hash: int) -> bool:
        """Check if current tokenizer matches expected vocabulary."""
        return self.vocab_hash == expected_hash
    
    def encode_with_metadata(self, text: str) -> Dict:
        """Return encoding with reproducibility metadata."""
        return {
            "input_ids": self.tokenizer.encode(text),
            "model_name": self.model_name,
            "vocab_hash": self.vocab_hash,
        }
 
# Example usage
if __name__ == "__main__":
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    
    # Generate test data
    texts = ["This is a test sentence."] * 1000
    
    print("=== Batching Benchmark ===")
    benchmark_batching(tokenizer, texts)
    
    print("\n=== Caching Demo ===")
    cached = CachedTokenizer("bert-base-uncased", cache_size=100)
    
    # First call - cache miss
    _ = cached.encode("Hello world")
    print(f"After 1 call: {cached.cache_info()}")
    
    # Repeat same text - cache hit
    for _ in range(100):
        _ = cached.encode("Hello world")
    print(f"After 101 calls: {cached.cache_info()}")

Production Checklist

•Version pinning: Lock tokenizer versions to prevent vocabulary drift between training and inference
•Vocabulary persistence: Store vocabulary files alongside model artifacts
•Edge case handling: Document and test handling of empty strings, extremely long text, binary data
•Monitoring: Track tokenization latency, cache hit rates, and OOV rates in production
•Memory management: Be aware of tokenizer memory footprint, especially with large vocabularies
•Thread safety: Verify tokenizer thread-safety if sharing across workers

Choosing the Right Tokenizer

With multiple tokenization strategies available, selecting the right approach requires understanding your specific requirements. Here's a decision framework based on common scenarios:

Tokenizer Selection Guide
Scenario	Recommended Approach	Rationale
Traditional ML (bag-of-words, TF-IDF)	Word-level (NLTK, spaCy)	Interpretable features, works with classical algorithms
Transformer fine-tuning	Pre-trained model's tokenizer	Must match pre-training tokenization exactly
Training from scratch	BPE/WordPiece (8K-50K vocab)	Balance between coverage and efficiency
Multilingual applications	SentencePiece with Unigram	Language-agnostic, handles diverse scripts
Spelling/OCR applications	Character-level	Access to character patterns, no OOV
Low-resource languages	Subword with smaller vocab	Maximizes coverage with limited data
Code/technical content	BPE trained on code	Captures programming language patterns

The Golden Rule

Vocabulary Size Trade-offs:

The vocabulary size parameter in subword tokenization deserves careful consideration:

Smaller vocabulary (4K-8K):
- Longer token sequences
- More subword segmentation
- Better OOV handling
- More robust to rare words
Larger vocabulary (32K-100K):
- Shorter token sequences
- More whole-word tokens
- Faster inference (fewer tokens to process)
- More memory for token embeddings

Modern practice typically uses 30K-50K for monolingual models and 100K-250K for multilingual models.

Summary: Tokenization Mastery

Key Takeaways

•Tokenization segments text into processing units — The granularity (word, subword, character) fundamentally shapes what your model can learn.
•Word-level tokenization is intuitive but brittle — Fixed vocabularies cannot handle OOV words, limiting model robustness.
•Subword tokenization solved the OOV problem — BPE, WordPiece, and Unigram enable open-vocabulary models while maintaining reasonable sequence lengths.
•Language diversity matters — Not all languages use whitespace; proper tokenization requires language-aware approaches.
•Production tokenization needs optimization — Batching, caching, and parallelization are essential for high-throughput systems.
•Consistency is critical — Tokenization at inference must exactly match training tokenization.

What's Next:

Page Complete

1 / 6