Machine LearningText Feature Engineering - N-grams

Text Feature Engineering: N-grams

LevelIntermediate

Duration75 mins

TopicText Feature Engineering - N-grams

4 / 5

Character N-grams: Sub-Word Patterns for Robust Text Features

Below the Word Level

Consider these text classification challenges:

A user types "amazzzzing" in a review—word n-grams won't recognize this as related to "amazing"
You need to identify whether text is Spanish or Portuguese—both use similar words but different letter patterns
An author writes "colour" vs "color"—word-level features treat these as unrelated
A typo produces "teh" instead of "the"—your vocabulary has no entry for "teh"

Character n-grams solve all of these problems by operating below the word level. Instead of treating words as atomic units, character n-grams decompose text into sequences of characters, capturing sub-word patterns that reveal morphology, typography, and language-specific characteristics.

This representation is remarkably powerful: character n-grams can achieve state-of-the-art performance on language identification, authorship attribution, and text classification with noisy inputs—all without any linguistic knowledge about the target language.

What You Will Learn

By the end of this page, you will understand how character n-grams work, when they outperform word n-grams, how to handle the unique implementation challenges they present, and how to combine character and word features for maximum performance. You'll gain practical skills for language identification, typo-robust classification, and authorship analysis.

What Are Character N-grams?

A character n-gram is a contiguous sequence of n characters from a text string. Unlike word n-grams, which require tokenization to identify word boundaries, character n-grams operate directly on the raw character sequence.

Formal Definition:

Given a text string S = [c₁, c₂, ..., cₖ] of k characters, the set of character n-grams Cₙ(S) is:

Cₙ(S) = {(cᵢ, cᵢ₊₁, ..., cᵢ₊ₙ₋₁) | 1 ≤ i ≤ k - n + 1}

Example:

For the text "hello" with n=3 (trigrams):

Position	Characters	Trigram
1-3	h, e, l	"hel"
2-4	e, l, l	"ell"
3-5	l, l, o	"llo"

Note that character n-grams include:

letters (a-z, A-Z)
digits (0-9)
spaces and whitespace
punctuation
special characters

Whether to include each category is a design decision with significant implications.

Spaces Are Characters Too

A critical decision is whether to include space characters in n-grams. With spaces: "hello world" → ["hel", "ell", "llo", "lo ", "o w", " wo", "wor", "orl", "rld"]. Without spaces (word boundaries): ["hel", "ell", "llo"] + ["wor", "orl", "rld"]. Including spaces captures word boundaries and cross-word patterns; excluding them focuses on within-word structure.

Word-Boundary Aware Character N-grams:

A common variant adds special boundary markers to each word:

Word "hello" → "#hello#" (# marks boundary)
Character trigrams: ["#he", "hel", "ell", "llo", "lo#"]

Boundary markers capture:

Word beginnings: prefixes like "pre", "un", "re"
Word endings: suffixes like "ing", "ed", "tion"
Short words entirely: "#a#", "#the#"

This hybrid approach preserves word structure while benefiting from sub-word patterns.

Character N-gram Orders:

Typical ranges for different applications:

Order	Name	Best For
1	Unigrams	Character frequency analysis
2-3	Bi/Trigrams	Common patterns, morphology
3-5	Tri to 5-grams	Language ID, authorship
5-7	Higher order	Long patterns, specific phrases

Why Character N-grams Work

Character n-grams capture linguistic patterns that word-level analysis misses. Understanding these patterns explains their power across diverse applications.

1. Morphological Structure:

Languages encode meaning through word structure:

Prefixes: "un-" (unhappy), "pre-" (preview), "anti-" (antivirus)
Suffixes: "-tion" (action), "-ing" (running), "-ly" (quickly)
Stems: "walk" in "walked", "walking", "walker"

Character n-grams capture these patterns without explicit morphological analysis.

2. Letter Frequency Patterns:

Languages have distinctive character distributions:

English: high 'e', 'a', 't'; rare 'q', 'z', 'x'
German: common 'sch', 'ch', 'ü', 'ß'
Spanish: common 'ción', 'mente', 'ñ'

These distributions enable language identification from character n-grams alone.

3. Spelling Variations:

Similar words share character n-grams:

"colour" / "color" share: ['c','o','l','o','r'] → most 3-grams match
"organize" / "organise" differ only in one trigram
Typos maintain most original n-grams

Character N-gram Advantages

•Typo tolerance: Errors affect few n-grams
•Open vocabulary: No OOV words possible
•Morphology capture: Prefixes/suffixes emerge
•Language agnostic: No linguistic tools needed
•Compact vocabulary: Character alphabet is small
•Works for any script: Latin, Cyrillic, CJK
•Robust to noise: Misspellings, abbreviations

Character N-gram Limitations

•No semantics: Unrelated words may share n-grams
•High dimensionality: Despite small alphabet
•Longer sequences needed: More n-grams per doc
•Less interpretable: Harder to understand features
•Positional ambiguity: Same n-gram from different words
•Topic blindness: Works better for style than content

When to Prefer Character N-grams

Use character n-grams when: (1) Text is noisy with typos or non-standard spelling, (2) Task is language ID or authorship attribution, (3) Morphology is important (agglutinative languages), (4) Training data is limited (smaller vocabulary = better generalization), (5) Processing user-generated content (social media, reviews).

Implementation Deep Dive

Efficient character n-gram extraction requires attention to preprocessing decisions and boundary handling.

character_ngrams.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
from typing import List, Dict, Set, Optional
from collections import Counter
import re
from scipy.sparse import csr_matrix
 
class CharacterNgramExtractor:
    """
    Comprehensive character n-gram extraction with multiple strategies.
    """
    
    def __init__(
        self,
        min_n: int = 3,
        max_n: int = 5,
        lowercase: bool = True,
        include_spaces: bool = True,
        word_boundaries: bool = True,
        boundary_char: str = '#',
        strip_punctuation: bool = False,
    ):
        self.min_n = min_n
        self.max_n = max_n
        self.lowercase = lowercase
        self.include_spaces = include_spaces
        self.word_boundaries = word_boundaries
        self.boundary_char = boundary_char
        self.strip_punctuation = strip_punctuation
    
    def preprocess(self, text: str) -> str:
        """Apply preprocessing steps."""
        if self.lowercase:
            text = text.lower()
        
        if self.strip_punctuation:
            text = re.sub(r'[^\w\s]', '', text)
        
        return text
    
    def extract_from_string(
        self,
        text: str,
        n: int
    ) -> List[str]:
        """
        Extract character n-grams from continuous string.
        """
        if len(text) < n:
            return []
        
        return [text[i:i+n] for i in range(len(text) - n + 1)]
    
    def extract_with_word_boundaries(
        self,
        text: str,
        n: int
    ) -> List[str]:
        """
        Extract character n-grams with word boundary markers.
        
        Each word is wrapped: "hello" → "#hello#"
        """
        words = text.split()
        ngrams = []
        
        for word in words:
            marked_word = f"{self.boundary_char}{word}{self.boundary_char}"
            ngrams.extend(self.extract_from_string(marked_word, n))
        
        return ngrams
    
    def extract(self, text: str) -> List[str]:
        """
        Extract all character n-grams with configured settings.
        """
        text = self.preprocess(text)
        ngrams = []
        
        for n in range(self.min_n, self.max_n + 1):
            if self.word_boundaries:
                ngrams.extend(self.extract_with_word_boundaries(text, n))
            else:
                if not self.include_spaces:
                    # Extract from each word separately
                    for word in text.split():
                        ngrams.extend(self.extract_from_string(word, n))
                else:
                    # Extract from continuous string
                    ngrams.extend(self.extract_from_string(text, n))
        
        return ngrams
 
 
class CharacterNgramVectorizer:
    """
    Complete character n-gram vectorizer with vocabulary management.
    """
    
    def __init__(
        self,
        min_n: int = 3,
        max_n: int = 5,
        max_vocab_size: int = None,
        min_df: int = 1,
        lowercase: bool = True,
        word_boundaries: bool = True,
    ):
        self.extractor = CharacterNgramExtractor(
            min_n=min_n,
            max_n=max_n,
            lowercase=lowercase,
            word_boundaries=word_boundaries,
        )
        
        self.max_vocab_size = max_vocab_size
        self.min_df = min_df
        
        self.vocabulary_: Dict[str, int] = {}
        self.term_freq_: Counter = Counter()
        self.doc_freq_: Counter = Counter()
    
    def fit(self, documents: List[str]) -> 'CharacterNgramVectorizer':
        """Build vocabulary from corpus."""
        for doc in documents:
            ngrams = self.extractor.extract(doc)
            
            self.term_freq_.update(ngrams)
            self.doc_freq_.update(set(ngrams))
        
        # Filter by document frequency
        candidates = [
            ng for ng, freq in self.doc_freq_.items()
            if freq >= self.min_df
        ]
        
        # Sort by frequency
        sorted_ngrams = sorted(
            candidates,
            key=lambda ng: self.term_freq_[ng],
            reverse=True
        )
        
        # Limit vocabulary
        if self.max_vocab_size:
            sorted_ngrams = sorted_ngrams[:self.max_vocab_size]
        
        self.vocabulary_ = {
            ng: idx for idx, ng in enumerate(sorted_ngrams)
        }
        
        return self
    
    def transform(self, documents: List[str]) -> csr_matrix:
        """Transform documents to feature matrix."""
        rows, cols, data = [], [], []
        
        for doc_idx, doc in enumerate(documents):
            ngrams = self.extractor.extract(doc)
            ngram_counts = Counter(ngrams)
            
            for ngram, count in ngram_counts.items():
                if ngram in self.vocabulary_:
                    rows.append(doc_idx)
                    cols.append(self.vocabulary_[ngram])
                    data.append(count)
        
        return csr_matrix(
            (data, (rows, cols)),
            shape=(len(documents), len(self.vocabulary_))
        )
    
    def analyze_vocabulary(self) -> Dict:
        """Analyze vocabulary composition."""
        by_length = Counter()
        
        for ngram in self.vocabulary_:
            by_length[len(ngram)] += 1
        
        return {
            'total_vocabulary': len(self.vocabulary_),
            'by_length': dict(by_length),
            'top_ngrams': self.term_freq_.most_common(20),
        }
 
 
def demonstrate_character_ngrams():
    """
    Demonstrate character n-gram extraction.
    """
    print("="*60)
    print("CHARACTER N-GRAM DEMONSTRATION")
    print("="*60)
    
    text = "Hello World"
    
    # Different extraction strategies
    strategies = [
        ("Continuous (with spaces)", CharacterNgramExtractor(
            min_n=3, max_n=3, word_boundaries=False, include_spaces=True
        )),
        ("Continuous (no spaces)", CharacterNgramExtractor(
            min_n=3, max_n=3, word_boundaries=False, include_spaces=False
        )),
        ("Word boundaries", CharacterNgramExtractor(
            min_n=3, max_n=3, word_boundaries=True
        )),
    ]
    
    print(f"
Text: '{text}'")
    
    for name, extractor in strategies:
        ngrams = extractor.extract(text)
        print(f"
{name}:")
        print(f"  N-grams: {ngrams}")
        print(f"  Count: {len(ngrams)}")
    
    # Typo tolerance demonstration
    print("
" + "="*60)
    print("TYPO TOLERANCE DEMONSTRATION")
    print("="*60)
    
    correct = "amazing"
    typos = ["amaznig", "amzing", "amazzing", "amaizing"]
    
    extractor = CharacterNgramExtractor(
        min_n=3, max_n=3, word_boundaries=True
    )
    
    correct_ngrams = set(extractor.extract(correct))
    print(f"
Correct: '{correct}'")
    print(f"  Trigrams: {sorted(correct_ngrams)}")
    
    for typo in typos:
        typo_ngrams = set(extractor.extract(typo))
        overlap = len(correct_ngrams & typo_ngrams)
        total = len(correct_ngrams | typo_ngrams)
        jaccard = overlap / total if total > 0 else 0
        
        print(f"
Typo: '{typo}'")
        print(f"  Trigrams: {sorted(typo_ngrams)}")
        print(f"  Overlap: {overlap}/{len(correct_ngrams)} ({jaccard:.1%} Jaccard)")
 
 
def typo_robustness_analysis():
    """
    Analyze how character n-grams handle various types of errors.
    """
    print("
" + "="*60)
    print("ERROR ROBUSTNESS ANALYSIS")
    print("="*60)
    
    extractor = CharacterNgramExtractor(
        min_n=3, max_n=5, word_boundaries=True
    )
    
    # Test cases with different error types
    test_cases = [
        ("Transposition", "the", "teh"),
        ("Insertion", "hello", "helllo"),
        ("Deletion", "friend", "frend"),
        ("Substitution", "color", "colour"),
        ("Multiple errors", "beautiful", "beautful"),
    ]
    
    for error_type, correct, error in test_cases:
        correct_ng = set(extractor.extract(correct))
        error_ng = set(extractor.extract(error))
        
        overlap = len(correct_ng & error_ng)
        union = len(correct_ng | error_ng)
        jaccard = overlap / union if union > 0 else 0
        
        print(f"
{error_type}: '{correct}' → '{error}'")
        print(f"  Preservation: {jaccard:.1%}")
 
 
if __name__ == "__main__":
    demonstrate_character_ngrams()
    typo_robustness_analysis()

Application: Language Identification

Language identification is the canonical success story for character n-grams. Even with just 3-5 character n-grams, classifiers can achieve >99% accuracy on distinguishing languages, often from just a few words of text.

Why Character N-grams Excel:

Distinctive letter patterns: "sch" is German, "ção" is Portuguese, "ough" is English
Character frequency: Languages have unique letter distributions
No vocabulary needed: Works for any language without dictionaries
Script detection as bonus: Character sets differ across scripts

Typical Approach:

Extract character 3-grams to 5-grams from text
Build per-language n-gram profiles from training data
Compare test text profile to language profiles
Classify to most similar language

Simple Distance-Based Method:

The "out-of-place" measure ranks n-grams by frequency in each language. For test text:

Build ranked n-gram list
For each language, sum position differences
Lowest sum = best match

language_identification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
from collections import Counter
from typing import List, Dict, Tuple
import numpy as np
 
class CharNgramLanguageIdentifier:
    """
    Language identification using character n-gram profiles.
    
    Based on the TextCat algorithm (Cavnar & Trenkle, 1994).
    """
    
    def __init__(
        self,
        n_range: Tuple[int, int] = (3, 5),
        profile_size: int = 300,
    ):
        self.n_range = n_range
        self.profile_size = profile_size
        
        # Language → ranked n-gram list
        self.profiles: Dict[str, List[str]] = {}
    
    def _extract_ngrams(self, text: str) -> List[str]:
        """Extract character n-grams."""
        text = text.lower()
        ngrams = []
        
        for n in range(self.n_range[0], self.n_range[1] + 1):
            # Add word boundaries
            words = text.split()
            for word in words:
                padded = f"_{word}_"
                ngrams.extend(
                    padded[i:i+n]
                    for i in range(len(padded) - n + 1)
                )
        
        return ngrams
    
    def _build_profile(self, text: str) -> List[str]:
        """Build ranked n-gram profile from text."""
        ngrams = self._extract_ngrams(text)
        counter = Counter(ngrams)
        
        # Return top n-grams by frequency
        return [ng for ng, _ in counter.most_common(self.profile_size)]
    
    def fit(self, language_texts: Dict[str, str]) -> 'CharNgramLanguageIdentifier':
        """
        Train on texts for each language.
        
        Args:
            language_texts: Dict mapping language code to training text
        """
        for language, text in language_texts.items():
            self.profiles[language] = self._build_profile(text)
        
        return self
    
    def _out_of_place_distance(
        self,
        test_profile: List[str],
        language_profile: List[str]
    ) -> int:
        """
        Compute out-of-place distance between profiles.
        
        For each n-gram in test profile, find its position in
        language profile. Sum of position differences = distance.
        """
        lang_positions = {
            ng: pos for pos, ng in enumerate(language_profile)
        }
        
        distance = 0
        max_penalty = self.profile_size
        
        for test_pos, ng in enumerate(test_profile):
            if ng in lang_positions:
                distance += abs(test_pos - lang_positions[ng])
            else:
                # N-gram not in language profile
                distance += max_penalty
        
        return distance
    
    def predict(self, text: str) -> str:
        """Identify language of text."""
        test_profile = self._build_profile(text)
        
        distances = {}
        for language, lang_profile in self.profiles.items():
            distances[language] = self._out_of_place_distance(
                test_profile, lang_profile
            )
        
        return min(distances, key=distances.get)
    
    def predict_with_scores(self, text: str) -> Dict[str, float]:
        """Return all language scores (lower = better match)."""
        test_profile = self._build_profile(text)
        
        distances = {}
        for language, lang_profile in self.profiles.items():
            distances[language] = self._out_of_place_distance(
                test_profile, lang_profile
            )
        
        # Convert to similarities (0-1, higher = better)
        max_dist = max(distances.values())
        similarities = {
            lang: 1 - (dist / max_dist)
            for lang, dist in distances.items()
        }
        
        return similarities
 
 
def demonstrate_language_identification():
    """
    Demonstrate character n-gram language identification.
    """
    print("="*60)
    print("LANGUAGE IDENTIFICATION")
    print("="*60)
    
    # Training data (simplified - real systems use much more)
    training_data = {
        'english': """
            The quick brown fox jumps over the lazy dog.
            Machine learning is transforming how we build software.
            Natural language processing enables text understanding.
            The weather today is quite pleasant and sunny.
            Reading books is a wonderful way to learn new things.
        """,
        'spanish': """
            El rápido zorro marrón salta sobre el perro perezoso.
            El aprendizaje automático está transformando el software.
            El procesamiento del lenguaje natural permite comprender texto.
            El clima hoy es bastante agradable y soleado.
            Leer libros es una forma maravillosa de aprender.
        """,
        'german': """
            Der schnelle braune Fuchs springt über den faulen Hund.
            Maschinelles Lernen verändert die Softwareentwicklung.
            Die Verarbeitung natürlicher Sprache ermöglicht Textverständnis.
            Das Wetter heute ist ziemlich angenehm und sonnig.
            Bücher lesen ist eine wunderbare Art zu lernen.
        """,
        'french': """
            Le rapide renard brun saute par-dessus le chien paresseux.
            L'apprentissage automatique transforme le développement logiciel.
            Le traitement du langage naturel permet la compréhension du texte.
            Le temps aujourd'hui est assez agréable et ensoleillé.
            Lire des livres est une merveilleuse façon d'apprendre.
        """,
    }
    
    # Train identifier
    identifier = CharNgramLanguageIdentifier(
        n_range=(2, 4),
        profile_size=200
    )
    identifier.fit(training_data)
    
    # Test samples
    test_samples = [
        "This is a simple English sentence.",
        "Esta es una oración simple en español.",
        "Dies ist ein einfacher deutscher Satz.",
        "Ceci est une phrase simple en français.",
        "The machine learning model works well.",
        "El modelo de aprendizaje funciona bien.",
    ]
    
    print("
Language Identification Results:")
    print("-"*50)
    
    for sample in test_samples:
        prediction = identifier.predict(sample)
        scores = identifier.predict_with_scores(sample)
        
        print(f"
'{sample[:40]}...'")
        print(f"  Predicted: {prediction}")
        print(f"  Scores: {', '.join(f'{l}: {s:.2f}' for l, s in sorted(scores.items(), key=lambda x: -x[1]))}")
 
 
if __name__ == "__main__":
    demonstrate_language_identification()

Application: Authorship Attribution

Authorship attribution—identifying who wrote a text—is another domain where character n-grams excel. Authors have distinctive stylistic fingerprints that emerge at the character level, including preferences for certain word endings, punctuation patterns, and letter combinations.

Why Character N-grams Work for Authorship:

Stylistic patterns: Word choice preferences create distinctive n-gram profiles
Unconscious habits: Spelling tendencies, contraction usage
Punctuation style: Different use of commas, semicolons, dashes
Vocabulary-independent: Captures style beyond content words
Hard to imitate: Authors aren't aware of their character patterns

Research Findings:

Character 3-5 grams consistently outperform word n-grams for authorship
Combining character and word n-grams often provides the best results
As few as 500-1000 characters can be sufficient for identification
Character n-grams remain effective across topics (content-independent)

Character N-gram Features for Authorship
Feature Type	Example	What It Captures
Word-ending n-grams	'ing#', 'tion#', 'ly#'	Suffix preferences, part-of-speech tendencies
Word-beginning n-grams	'#the', '#th', '#wh'	Function word usage, question patterns
Space patterns	' the ', ' a '	Article and preposition frequency
Punctuation n-grams	', and', '. the'	Sentence structure, list style
Cross-word patterns	s_of_', 'e_the'	Phrase patterns, flow

authorship_attribution.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from typing import List, Tuple
import numpy as np
 
def authorship_classification_pipeline():
    """
    Build authorship attribution pipeline using character n-grams.
    """
    
    # Simulated author samples (in practice, use actual texts)
    authors = {
        'author_a': [
            "The fundamental question we must ask ourselves is whether advancement in technology truly benefits humanity, or if it merely creates new forms of dependency that undermine our autonomy.",
            "I've often pondered the relationship between creativity and constraint. It seems that the greatest artistic achievements emerge not despite limitations, but because of them.",
            "When examining the historical record, we find patterns that repeat across centuries. Human nature, it appears, remains remarkably constant despite technological change.",
        ],
        'author_b': [
            "Look, here's the thing - tech is moving fast and we need to keep up. Not everything new is good, sure, but we can't just ignore progress because it's scary.",
            "So I was thinking about how creativity works, right? And it hit me - you need limits! Without them, you're just staring at a blank page forever.",
            "History repeats itself, they say. And yeah, looking back, it's kind of true. Different tools, same basic human behaviors. Makes you think.",
        ],
        'author_c': [
            "TECHNOLOGICAL ADVANCEMENT: A CRITICAL ANALYSIS. Section 1. Introduction. The question of technology's impact on humanity requires systematic examination.",
            "Creativity and constraint exhibit an inverse relationship initially. However, optimal creative output occurs at moderate constraint levels.",
            "Historical analysis reveals cyclical patterns in human behavior. Technological context changes; fundamental behavioral patterns persist.",
        ],
    }
    
    # Prepare data
    texts = []
    labels = []
    
    for author, samples in authors.items():
        texts.extend(samples)
        labels.extend([author] * len(samples))
    
    # Character n-gram pipeline
    char_pipeline = Pipeline([
        ('vectorizer', TfidfVectorizer(
            analyzer='char_wb',  # Character n-grams with word boundaries
            ngram_range=(3, 5),
            min_df=1,
            max_features=5000,
            sublinear_tf=True,
        )),
        ('classifier', LinearSVC(random_state=42, C=1.0))
    ])
    
    # Word n-gram pipeline for comparison
    word_pipeline = Pipeline([
        ('vectorizer', TfidfVectorizer(
            analyzer='word',
            ngram_range=(1, 2),
            min_df=1,
            max_features=5000,
            sublinear_tf=True,
        )),
        ('classifier', LinearSVC(random_state=42, C=1.0))
    ])
    
    print("="*60)
    print("AUTHORSHIP ATTRIBUTION")
    print("="*60)
    
    # Fit both pipelines
    char_pipeline.fit(texts, labels)
    word_pipeline.fit(texts, labels)
    
    # Check vocabulary characteristics
    char_vocab = char_pipeline.named_steps['vectorizer'].vocabulary_
    word_vocab = word_pipeline.named_steps['vectorizer'].vocabulary_
    
    print(f"
Character n-gram vocabulary size: {len(char_vocab)}")
    print(f"Word n-gram vocabulary size: {len(word_vocab)}")
    
    # Test on held-out samples
    test_samples = [
        "The analysis of systems requires careful consideration of multiple factors and their interdependencies.",
        "Here's what I think - we're overcomplicating things. Just keep it simple and move forward!",
        "METHODOLOGY SECTION. Systematic analysis follows established protocols. Results indicate consistent patterns.",
    ]
    
    print("
Predictions on test samples:")
    print("-"*50)
    
    for sample in test_samples:
        char_pred = char_pipeline.predict([sample])[0]
        word_pred = word_pipeline.predict([sample])[0]
        
        print(f"
'{sample[:60]}...'")
        print(f"  Character n-grams: {char_pred}")
        print(f"  Word n-grams: {word_pred}")
    
    # Analyze distinctive features per author
    print("
" + "="*60)
    print("DISTINCTIVE CHARACTER PATTERNS")
    print("="*60)
    
    vectorizer = char_pipeline.named_steps['vectorizer']
    classifier = char_pipeline.named_steps['classifier']
    
    feature_names = vectorizer.get_feature_names_out()
    
    for i, author in enumerate(classifier.classes_):
        # Get coefficients for this author (one-vs-rest)
        coefs = classifier.coef_[i] if len(classifier.classes_) > 2 else classifier.coef_[0]
        
        # Top positive features for this author
        top_indices = np.argsort(coefs)[-10:]
        
        print(f"
{author} distinctive patterns:")
        for idx in reversed(top_indices):
            print(f"  '{feature_names[idx]}' (weight: {coefs[idx]:.3f})")
 
 
if __name__ == "__main__":
    authorship_classification_pipeline()

Combining Word and Character N-grams

Word and character n-grams capture complementary information. Word n-grams encode semantic content and phrases; character n-grams encode morphology and style. Combining them often yields the best results.

Combination Strategies:

Feature Concatenation: Stack word and character feature vectors
Ensemble Methods: Train separate models, combine predictions
Late Fusion: Weighted average of model outputs
Early Fusion: Joint feature space with dimensionality reduction

Scikit-learn Implementation:

The FeatureUnion class enables easy feature concatenation:

combined_ngrams.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
import numpy as np
 
def combined_ngram_pipeline():
    """
    Build pipeline combining word and character n-grams.
    """
    
    # Sample sentiment dataset
    texts = [
        "This movie was absolutely fantastic and amazing",
        "I loved every moment of this wonderful film",
        "Terrible waste of time completely boring",
        "The worst film I have ever seen awful",
        "Great acting brilliant storyline highly recommend",
        "Disappointing slow and not worth watching",
        "A masterpiece of modern cinema outstanding",
        "Complete garbage do not waste your money",
        "Beautiful cinematography and excellent performances",
        "Boring plot with terrible acting throughout",
    ] * 5
    
    labels = [1, 1, 0, 0, 1, 0, 1, 0, 1, 0] * 5
    
    # Feature union combining word and character n-grams
    combined_features = FeatureUnion([
        ('word', TfidfVectorizer(
            analyzer='word',
            ngram_range=(1, 2),
            min_df=2,
            max_features=5000,
            sublinear_tf=True,
        )),
        ('char', TfidfVectorizer(
            analyzer='char_wb',
            ngram_range=(3, 5),
            min_df=2,
            max_features=5000,
            sublinear_tf=True,
        )),
    ])
    
    # Build pipelines for comparison
    pipelines = {
        'Word only': Pipeline([
            ('vectorizer', TfidfVectorizer(
                analyzer='word',
                ngram_range=(1, 2),
                min_df=2,
                max_features=5000,
            )),
            ('classifier', LogisticRegression(max_iter=1000, random_state=42)),
        ]),
        'Character only': Pipeline([
            ('vectorizer', TfidfVectorizer(
                analyzer='char_wb',
                ngram_range=(3, 5),
                min_df=2,
                max_features=5000,
            )),
            ('classifier', LogisticRegression(max_iter=1000, random_state=42)),
        ]),
        'Combined': Pipeline([
            ('features', combined_features),
            ('classifier', LogisticRegression(max_iter=1000, random_state=42)),
        ]),
    }
    
    print("="*60)
    print("COMBINED WORD + CHARACTER N-GRAMS")
    print("="*60)
    
    for name, pipeline in pipelines.items():
        pipeline.fit(texts, labels)
        train_acc = pipeline.score(texts, labels)
        
        # Get feature count
        if name == 'Combined':
            word_feat = pipeline.named_steps['features'].transformer_list[0][1]
            char_feat = pipeline.named_steps['features'].transformer_list[1][1]
            n_features = len(word_feat.vocabulary_) + len(char_feat.vocabulary_)
        else:
            n_features = len(pipeline.named_steps['vectorizer'].vocabulary_)
        
        print(f"
{name}:")
        print(f"  Features: {n_features}")
        print(f"  Training accuracy: {train_acc:.2%}")
    
    # Test with noisy input
    print("
" + "="*60)
    print("ROBUSTNESS TO TYPOS")
    print("="*60)
    
    clean_test = [
        "This is amazing and wonderful",
        "Terrible and disappointing experience",
    ]
    
    noisy_test = [
        "Thsi is amzaing and wondurful",  # Typos
        "Terible and disapointing experiance",  # Typos
    ]
    
    for name, pipeline in pipelines.items():
        clean_preds = pipeline.predict_proba(clean_test)
        noisy_preds = pipeline.predict_proba(noisy_test)
        
        print(f"
{name}:")
        for i, (clean, noisy) in enumerate(zip(clean_test, noisy_test)):
            clean_conf = max(clean_preds[i])
            noisy_conf = max(noisy_preds[i])
            diff = abs(clean_conf - noisy_conf)
            
            print(f"  Sample {i+1}: Clean={clean_conf:.2f}, Noisy={noisy_conf:.2f}, Diff={diff:.3f}")
 
 
if __name__ == "__main__":
    combined_ngram_pipeline()

Best Practices for Combining Features

When combining word and character n-grams: (1) Use similar vocabulary sizes for each (e.g., 5000 each) to balance their influence, (2) Apply TF-IDF to both—raw counts create scaling issues, (3) Consider feature selection on the combined space to remove redundancy, (4) Monitor which feature type contributes more to performance—this varies by task.

Summary: The Power of Sub-Word Patterns

Character n-grams provide a powerful complement to word-level features, capturing patterns that exist below the word level. We've explored their theory, implementation, and applications:

Key Takeaways

•Character n-grams capture sub-word structure: Morphology, spelling patterns, and stylistic fingerprints emerge from character sequences.
•Typo tolerance is inherent: Single typos affect only a few n-grams, making character features robust to noisy input.
•Language identification is a killer app: Character n-grams achieve >99% accuracy distinguishing languages with minimal training data.
•Authorship attribution benefits significantly: Authors have distinctive character-level patterns they cannot consciously control.
•Combination with word n-grams often wins: Different information captured at each level—semantic content vs. style and morphology.
•Word boundary handling matters: Including boundary markers (#word#) captures prefixes, suffixes, and word structure.

Looking Ahead: N-gram Tradeoffs

We've now covered unigrams, bigrams, general n-grams, and character n-grams. In the final page of this module, we'll synthesize everything by examining the fundamental tradeoffs in n-gram feature engineering: vocabulary size vs. expressiveness, computational cost vs. feature quality, and when different n-gram strategies are most appropriate.

This synthesis will provide a decision framework for choosing the right n-gram configuration for any text classification or NLP task.

Page Complete

You now understand character n-grams as sub-word features that capture morphology, spelling patterns, and stylistic signatures. You can implement character n-gram extraction with various boundary handling strategies, apply them to language identification and authorship attribution, and combine them with word n-grams for maximum robustness. Next, we'll examine the tradeoffs that guide n-gram configuration choices.

4 / 5

Loading learning content...

Machine LearningText Feature Engineering - N-grams

Text Feature Engineering: N-grams

LevelIntermediate

Duration75 mins

TopicText Feature Engineering - N-grams

4 / 5

Character N-grams: Sub-Word Patterns for Robust Text Features

Below the Word Level

Consider these text classification challenges:

A user types "amazzzzing" in a review—word n-grams won't recognize this as related to "amazing"
You need to identify whether text is Spanish or Portuguese—both use similar words but different letter patterns
An author writes "colour" vs "color"—word-level features treat these as unrelated
A typo produces "teh" instead of "the"—your vocabulary has no entry for "teh"

What You Will Learn

What Are Character N-grams?

Formal Definition:

Given a text string S = [c₁, c₂, ..., cₖ] of k characters, the set of character n-grams Cₙ(S) is:

Cₙ(S) = {(cᵢ, cᵢ₊₁, ..., cᵢ₊ₙ₋₁) | 1 ≤ i ≤ k - n + 1}

Example:

For the text "hello" with n=3 (trigrams):

Position	Characters	Trigram
1-3	h, e, l	"hel"
2-4	e, l, l	"ell"
3-5	l, l, o	"llo"

Note that character n-grams include:

letters (a-z, A-Z)
digits (0-9)
spaces and whitespace
punctuation
special characters

Whether to include each category is a design decision with significant implications.

Spaces Are Characters Too

Word-Boundary Aware Character N-grams:

A common variant adds special boundary markers to each word:

Word "hello" → "#hello#" (# marks boundary)
Character trigrams: ["#he", "hel", "ell", "llo", "lo#"]

Boundary markers capture:

Word beginnings: prefixes like "pre", "un", "re"
Word endings: suffixes like "ing", "ed", "tion"
Short words entirely: "#a#", "#the#"

This hybrid approach preserves word structure while benefiting from sub-word patterns.

Character N-gram Orders:

Typical ranges for different applications:

Order	Name	Best For
1	Unigrams	Character frequency analysis
2-3	Bi/Trigrams	Common patterns, morphology
3-5	Tri to 5-grams	Language ID, authorship
5-7	Higher order	Long patterns, specific phrases

Why Character N-grams Work

Character n-grams capture linguistic patterns that word-level analysis misses. Understanding these patterns explains their power across diverse applications.

1. Morphological Structure:

Languages encode meaning through word structure:

Prefixes: "un-" (unhappy), "pre-" (preview), "anti-" (antivirus)
Suffixes: "-tion" (action), "-ing" (running), "-ly" (quickly)
Stems: "walk" in "walked", "walking", "walker"

Character n-grams capture these patterns without explicit morphological analysis.

2. Letter Frequency Patterns:

Languages have distinctive character distributions:

English: high 'e', 'a', 't'; rare 'q', 'z', 'x'
German: common 'sch', 'ch', 'ü', 'ß'
Spanish: common 'ción', 'mente', 'ñ'

These distributions enable language identification from character n-grams alone.

3. Spelling Variations:

Similar words share character n-grams:

"colour" / "color" share: ['c','o','l','o','r'] → most 3-grams match
"organize" / "organise" differ only in one trigram
Typos maintain most original n-grams

Character N-gram Advantages

•Typo tolerance: Errors affect few n-grams
•Open vocabulary: No OOV words possible
•Morphology capture: Prefixes/suffixes emerge
•Language agnostic: No linguistic tools needed
•Compact vocabulary: Character alphabet is small
•Works for any script: Latin, Cyrillic, CJK
•Robust to noise: Misspellings, abbreviations

Character N-gram Limitations

•No semantics: Unrelated words may share n-grams
•High dimensionality: Despite small alphabet
•Longer sequences needed: More n-grams per doc
•Less interpretable: Harder to understand features
•Positional ambiguity: Same n-gram from different words
•Topic blindness: Works better for style than content

When to Prefer Character N-grams

Implementation Deep Dive

Efficient character n-gram extraction requires attention to preprocessing decisions and boundary handling.

character_ngrams.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
from typing import List, Dict, Set, Optional
from collections import Counter
import re
from scipy.sparse import csr_matrix
 
class CharacterNgramExtractor:
    """
    Comprehensive character n-gram extraction with multiple strategies.
    """
    
    def __init__(
        self,
        min_n: int = 3,
        max_n: int = 5,
        lowercase: bool = True,
        include_spaces: bool = True,
        word_boundaries: bool = True,
        boundary_char: str = '#',
        strip_punctuation: bool = False,
    ):
        self.min_n = min_n
        self.max_n = max_n
        self.lowercase = lowercase
        self.include_spaces = include_spaces
        self.word_boundaries = word_boundaries
        self.boundary_char = boundary_char
        self.strip_punctuation = strip_punctuation
    
    def preprocess(self, text: str) -> str:
        """Apply preprocessing steps."""
        if self.lowercase:
            text = text.lower()
        
        if self.strip_punctuation:
            text = re.sub(r'[^\w\s]', '', text)
        
        return text
    
    def extract_from_string(
        self,
        text: str,
        n: int
    ) -> List[str]:
        """
        Extract character n-grams from continuous string.
        """
        if len(text) < n:
            return []
        
        return [text[i:i+n] for i in range(len(text) - n + 1)]
    
    def extract_with_word_boundaries(
        self,
        text: str,
        n: int
    ) -> List[str]:
        """
        Extract character n-grams with word boundary markers.
        
        Each word is wrapped: "hello" → "#hello#"
        """
        words = text.split()
        ngrams = []
        
        for word in words:
            marked_word = f"{self.boundary_char}{word}{self.boundary_char}"
            ngrams.extend(self.extract_from_string(marked_word, n))
        
        return ngrams
    
    def extract(self, text: str) -> List[str]:
        """
        Extract all character n-grams with configured settings.
        """
        text = self.preprocess(text)
        ngrams = []
        
        for n in range(self.min_n, self.max_n + 1):
            if self.word_boundaries:
                ngrams.extend(self.extract_with_word_boundaries(text, n))
            else:
                if not self.include_spaces:
                    # Extract from each word separately
                    for word in text.split():
                        ngrams.extend(self.extract_from_string(word, n))
                else:
                    # Extract from continuous string
                    ngrams.extend(self.extract_from_string(text, n))
        
        return ngrams
 
 
class CharacterNgramVectorizer:
    """
    Complete character n-gram vectorizer with vocabulary management.
    """
    
    def __init__(
        self,
        min_n: int = 3,
        max_n: int = 5,
        max_vocab_size: int = None,
        min_df: int = 1,
        lowercase: bool = True,
        word_boundaries: bool = True,
    ):
        self.extractor = CharacterNgramExtractor(
            min_n=min_n,
            max_n=max_n,
            lowercase=lowercase,
            word_boundaries=word_boundaries,
        )
        
        self.max_vocab_size = max_vocab_size
        self.min_df = min_df
        
        self.vocabulary_: Dict[str, int] = {}
        self.term_freq_: Counter = Counter()
        self.doc_freq_: Counter = Counter()
    
    def fit(self, documents: List[str]) -> 'CharacterNgramVectorizer':
        """Build vocabulary from corpus."""
        for doc in documents:
            ngrams = self.extractor.extract(doc)
            
            self.term_freq_.update(ngrams)
            self.doc_freq_.update(set(ngrams))
        
        # Filter by document frequency
        candidates = [
            ng for ng, freq in self.doc_freq_.items()
            if freq >= self.min_df
        ]
        
        # Sort by frequency
        sorted_ngrams = sorted(
            candidates,
            key=lambda ng: self.term_freq_[ng],
            reverse=True
        )
        
        # Limit vocabulary
        if self.max_vocab_size:
            sorted_ngrams = sorted_ngrams[:self.max_vocab_size]
        
        self.vocabulary_ = {
            ng: idx for idx, ng in enumerate(sorted_ngrams)
        }
        
        return self
    
    def transform(self, documents: List[str]) -> csr_matrix:
        """Transform documents to feature matrix."""
        rows, cols, data = [], [], []
        
        for doc_idx, doc in enumerate(documents):
            ngrams = self.extractor.extract(doc)
            ngram_counts = Counter(ngrams)
            
            for ngram, count in ngram_counts.items():
                if ngram in self.vocabulary_:
                    rows.append(doc_idx)
                    cols.append(self.vocabulary_[ngram])
                    data.append(count)
        
        return csr_matrix(
            (data, (rows, cols)),
            shape=(len(documents), len(self.vocabulary_))
        )
    
    def analyze_vocabulary(self) -> Dict:
        """Analyze vocabulary composition."""
        by_length = Counter()
        
        for ngram in self.vocabulary_:
            by_length[len(ngram)] += 1
        
        return {
            'total_vocabulary': len(self.vocabulary_),
            'by_length': dict(by_length),
            'top_ngrams': self.term_freq_.most_common(20),
        }
 
 
def demonstrate_character_ngrams():
    """
    Demonstrate character n-gram extraction.
    """
    print("="*60)
    print("CHARACTER N-GRAM DEMONSTRATION")
    print("="*60)
    
    text = "Hello World"
    
    # Different extraction strategies
    strategies = [
        ("Continuous (with spaces)", CharacterNgramExtractor(
            min_n=3, max_n=3, word_boundaries=False, include_spaces=True
        )),
        ("Continuous (no spaces)", CharacterNgramExtractor(
            min_n=3, max_n=3, word_boundaries=False, include_spaces=False
        )),
        ("Word boundaries", CharacterNgramExtractor(
            min_n=3, max_n=3, word_boundaries=True
        )),
    ]
    
    print(f"
Text: '{text}'")
    
    for name, extractor in strategies:
        ngrams = extractor.extract(text)
        print(f"
{name}:")
        print(f"  N-grams: {ngrams}")
        print(f"  Count: {len(ngrams)}")
    
    # Typo tolerance demonstration
    print("
" + "="*60)
    print("TYPO TOLERANCE DEMONSTRATION")
    print("="*60)
    
    correct = "amazing"
    typos = ["amaznig", "amzing", "amazzing", "amaizing"]
    
    extractor = CharacterNgramExtractor(
        min_n=3, max_n=3, word_boundaries=True
    )
    
    correct_ngrams = set(extractor.extract(correct))
    print(f"
Correct: '{correct}'")
    print(f"  Trigrams: {sorted(correct_ngrams)}")
    
    for typo in typos:
        typo_ngrams = set(extractor.extract(typo))
        overlap = len(correct_ngrams & typo_ngrams)
        total = len(correct_ngrams | typo_ngrams)
        jaccard = overlap / total if total > 0 else 0
        
        print(f"
Typo: '{typo}'")
        print(f"  Trigrams: {sorted(typo_ngrams)}")
        print(f"  Overlap: {overlap}/{len(correct_ngrams)} ({jaccard:.1%} Jaccard)")
 
 
def typo_robustness_analysis():
    """
    Analyze how character n-grams handle various types of errors.
    """
    print("
" + "="*60)
    print("ERROR ROBUSTNESS ANALYSIS")
    print("="*60)
    
    extractor = CharacterNgramExtractor(
        min_n=3, max_n=5, word_boundaries=True
    )
    
    # Test cases with different error types
    test_cases = [
        ("Transposition", "the", "teh"),
        ("Insertion", "hello", "helllo"),
        ("Deletion", "friend", "frend"),
        ("Substitution", "color", "colour"),
        ("Multiple errors", "beautiful", "beautful"),
    ]
    
    for error_type, correct, error in test_cases:
        correct_ng = set(extractor.extract(correct))
        error_ng = set(extractor.extract(error))
        
        overlap = len(correct_ng & error_ng)
        union = len(correct_ng | error_ng)
        jaccard = overlap / union if union > 0 else 0
        
        print(f"
{error_type}: '{correct}' → '{error}'")
        print(f"  Preservation: {jaccard:.1%}")
 
 
if __name__ == "__main__":
    demonstrate_character_ngrams()
    typo_robustness_analysis()

Application: Language Identification

Why Character N-grams Excel:

Distinctive letter patterns: "sch" is German, "ção" is Portuguese, "ough" is English
Character frequency: Languages have unique letter distributions
No vocabulary needed: Works for any language without dictionaries
Script detection as bonus: Character sets differ across scripts

Typical Approach:

Extract character 3-grams to 5-grams from text
Build per-language n-gram profiles from training data
Compare test text profile to language profiles
Classify to most similar language

Simple Distance-Based Method:

The "out-of-place" measure ranks n-grams by frequency in each language. For test text:

Build ranked n-gram list
For each language, sum position differences
Lowest sum = best match

language_identification.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
from collections import Counter
from typing import List, Dict, Tuple
import numpy as np
 
class CharNgramLanguageIdentifier:
    """
    Language identification using character n-gram profiles.
    
    Based on the TextCat algorithm (Cavnar & Trenkle, 1994).
    """
    
    def __init__(
        self,
        n_range: Tuple[int, int] = (3, 5),
        profile_size: int = 300,
    ):
        self.n_range = n_range
        self.profile_size = profile_size
        
        # Language → ranked n-gram list
        self.profiles: Dict[str, List[str]] = {}
    
    def _extract_ngrams(self, text: str) -> List[str]:
        """Extract character n-grams."""
        text = text.lower()
        ngrams = []
        
        for n in range(self.n_range[0], self.n_range[1] + 1):
            # Add word boundaries
            words = text.split()
            for word in words:
                padded = f"_{word}_"
                ngrams.extend(
                    padded[i:i+n]
                    for i in range(len(padded) - n + 1)
                )
        
        return ngrams
    
    def _build_profile(self, text: str) -> List[str]:
        """Build ranked n-gram profile from text."""
        ngrams = self._extract_ngrams(text)
        counter = Counter(ngrams)
        
        # Return top n-grams by frequency
        return [ng for ng, _ in counter.most_common(self.profile_size)]
    
    def fit(self, language_texts: Dict[str, str]) -> 'CharNgramLanguageIdentifier':
        """
        Train on texts for each language.
        
        Args:
            language_texts: Dict mapping language code to training text
        """
        for language, text in language_texts.items():
            self.profiles[language] = self._build_profile(text)
        
        return self
    
    def _out_of_place_distance(
        self,
        test_profile: List[str],
        language_profile: List[str]
    ) -> int:
        """
        Compute out-of-place distance between profiles.
        
        For each n-gram in test profile, find its position in
        language profile. Sum of position differences = distance.
        """
        lang_positions = {
            ng: pos for pos, ng in enumerate(language_profile)
        }
        
        distance = 0
        max_penalty = self.profile_size
        
        for test_pos, ng in enumerate(test_profile):
            if ng in lang_positions:
                distance += abs(test_pos - lang_positions[ng])
            else:
                # N-gram not in language profile
                distance += max_penalty
        
        return distance
    
    def predict(self, text: str) -> str:
        """Identify language of text."""
        test_profile = self._build_profile(text)
        
        distances = {}
        for language, lang_profile in self.profiles.items():
            distances[language] = self._out_of_place_distance(
                test_profile, lang_profile
            )
        
        return min(distances, key=distances.get)
    
    def predict_with_scores(self, text: str) -> Dict[str, float]:
        """Return all language scores (lower = better match)."""
        test_profile = self._build_profile(text)
        
        distances = {}
        for language, lang_profile in self.profiles.items():
            distances[language] = self._out_of_place_distance(
                test_profile, lang_profile
            )
        
        # Convert to similarities (0-1, higher = better)
        max_dist = max(distances.values())
        similarities = {
            lang: 1 - (dist / max_dist)
            for lang, dist in distances.items()
        }
        
        return similarities
 
 
def demonstrate_language_identification():
    """
    Demonstrate character n-gram language identification.
    """
    print("="*60)
    print("LANGUAGE IDENTIFICATION")
    print("="*60)
    
    # Training data (simplified - real systems use much more)
    training_data = {
        'english': """
            The quick brown fox jumps over the lazy dog.
            Machine learning is transforming how we build software.
            Natural language processing enables text understanding.
            The weather today is quite pleasant and sunny.
            Reading books is a wonderful way to learn new things.
        """,
        'spanish': """
            El rápido zorro marrón salta sobre el perro perezoso.
            El aprendizaje automático está transformando el software.
            El procesamiento del lenguaje natural permite comprender texto.
            El clima hoy es bastante agradable y soleado.
            Leer libros es una forma maravillosa de aprender.
        """,
        'german': """
            Der schnelle braune Fuchs springt über den faulen Hund.
            Maschinelles Lernen verändert die Softwareentwicklung.
            Die Verarbeitung natürlicher Sprache ermöglicht Textverständnis.
            Das Wetter heute ist ziemlich angenehm und sonnig.
            Bücher lesen ist eine wunderbare Art zu lernen.
        """,
        'french': """
            Le rapide renard brun saute par-dessus le chien paresseux.
            L'apprentissage automatique transforme le développement logiciel.
            Le traitement du langage naturel permet la compréhension du texte.
            Le temps aujourd'hui est assez agréable et ensoleillé.
            Lire des livres est une merveilleuse façon d'apprendre.
        """,
    }
    
    # Train identifier
    identifier = CharNgramLanguageIdentifier(
        n_range=(2, 4),
        profile_size=200
    )
    identifier.fit(training_data)
    
    # Test samples
    test_samples = [
        "This is a simple English sentence.",
        "Esta es una oración simple en español.",
        "Dies ist ein einfacher deutscher Satz.",
        "Ceci est une phrase simple en français.",
        "The machine learning model works well.",
        "El modelo de aprendizaje funciona bien.",
    ]
    
    print("
Language Identification Results:")
    print("-"*50)
    
    for sample in test_samples:
        prediction = identifier.predict(sample)
        scores = identifier.predict_with_scores(sample)
        
        print(f"
'{sample[:40]}...'")
        print(f"  Predicted: {prediction}")
        print(f"  Scores: {', '.join(f'{l}: {s:.2f}' for l, s in sorted(scores.items(), key=lambda x: -x[1]))}")
 
 
if __name__ == "__main__":
    demonstrate_language_identification()

Application: Authorship Attribution

Why Character N-grams Work for Authorship:

Stylistic patterns: Word choice preferences create distinctive n-gram profiles
Unconscious habits: Spelling tendencies, contraction usage
Punctuation style: Different use of commas, semicolons, dashes
Vocabulary-independent: Captures style beyond content words
Hard to imitate: Authors aren't aware of their character patterns

Research Findings:

Character 3-5 grams consistently outperform word n-grams for authorship
Combining character and word n-grams often provides the best results
As few as 500-1000 characters can be sufficient for identification
Character n-grams remain effective across topics (content-independent)

Character N-gram Features for Authorship
Feature Type	Example	What It Captures
Word-ending n-grams	'ing#', 'tion#', 'ly#'	Suffix preferences, part-of-speech tendencies
Word-beginning n-grams	'#the', '#th', '#wh'	Function word usage, question patterns
Space patterns	' the ', ' a '	Article and preposition frequency
Punctuation n-grams	', and', '. the'	Sentence structure, list style
Cross-word patterns	s_of_', 'e_the'	Phrase patterns, flow

authorship_attribution.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from typing import List, Tuple
import numpy as np
 
def authorship_classification_pipeline():
    """
    Build authorship attribution pipeline using character n-grams.
    """
    
    # Simulated author samples (in practice, use actual texts)
    authors = {
        'author_a': [
            "The fundamental question we must ask ourselves is whether advancement in technology truly benefits humanity, or if it merely creates new forms of dependency that undermine our autonomy.",
            "I've often pondered the relationship between creativity and constraint. It seems that the greatest artistic achievements emerge not despite limitations, but because of them.",
            "When examining the historical record, we find patterns that repeat across centuries. Human nature, it appears, remains remarkably constant despite technological change.",
        ],
        'author_b': [
            "Look, here's the thing - tech is moving fast and we need to keep up. Not everything new is good, sure, but we can't just ignore progress because it's scary.",
            "So I was thinking about how creativity works, right? And it hit me - you need limits! Without them, you're just staring at a blank page forever.",
            "History repeats itself, they say. And yeah, looking back, it's kind of true. Different tools, same basic human behaviors. Makes you think.",
        ],
        'author_c': [
            "TECHNOLOGICAL ADVANCEMENT: A CRITICAL ANALYSIS. Section 1. Introduction. The question of technology's impact on humanity requires systematic examination.",
            "Creativity and constraint exhibit an inverse relationship initially. However, optimal creative output occurs at moderate constraint levels.",
            "Historical analysis reveals cyclical patterns in human behavior. Technological context changes; fundamental behavioral patterns persist.",
        ],
    }
    
    # Prepare data
    texts = []
    labels = []
    
    for author, samples in authors.items():
        texts.extend(samples)
        labels.extend([author] * len(samples))
    
    # Character n-gram pipeline
    char_pipeline = Pipeline([
        ('vectorizer', TfidfVectorizer(
            analyzer='char_wb',  # Character n-grams with word boundaries
            ngram_range=(3, 5),
            min_df=1,
            max_features=5000,
            sublinear_tf=True,
        )),
        ('classifier', LinearSVC(random_state=42, C=1.0))
    ])
    
    # Word n-gram pipeline for comparison
    word_pipeline = Pipeline([
        ('vectorizer', TfidfVectorizer(
            analyzer='word',
            ngram_range=(1, 2),
            min_df=1,
            max_features=5000,
            sublinear_tf=True,
        )),
        ('classifier', LinearSVC(random_state=42, C=1.0))
    ])
    
    print("="*60)
    print("AUTHORSHIP ATTRIBUTION")
    print("="*60)
    
    # Fit both pipelines
    char_pipeline.fit(texts, labels)
    word_pipeline.fit(texts, labels)
    
    # Check vocabulary characteristics
    char_vocab = char_pipeline.named_steps['vectorizer'].vocabulary_
    word_vocab = word_pipeline.named_steps['vectorizer'].vocabulary_
    
    print(f"
Character n-gram vocabulary size: {len(char_vocab)}")
    print(f"Word n-gram vocabulary size: {len(word_vocab)}")
    
    # Test on held-out samples
    test_samples = [
        "The analysis of systems requires careful consideration of multiple factors and their interdependencies.",
        "Here's what I think - we're overcomplicating things. Just keep it simple and move forward!",
        "METHODOLOGY SECTION. Systematic analysis follows established protocols. Results indicate consistent patterns.",
    ]
    
    print("
Predictions on test samples:")
    print("-"*50)
    
    for sample in test_samples:
        char_pred = char_pipeline.predict([sample])[0]
        word_pred = word_pipeline.predict([sample])[0]
        
        print(f"
'{sample[:60]}...'")
        print(f"  Character n-grams: {char_pred}")
        print(f"  Word n-grams: {word_pred}")
    
    # Analyze distinctive features per author
    print("
" + "="*60)
    print("DISTINCTIVE CHARACTER PATTERNS")
    print("="*60)
    
    vectorizer = char_pipeline.named_steps['vectorizer']
    classifier = char_pipeline.named_steps['classifier']
    
    feature_names = vectorizer.get_feature_names_out()
    
    for i, author in enumerate(classifier.classes_):
        # Get coefficients for this author (one-vs-rest)
        coefs = classifier.coef_[i] if len(classifier.classes_) > 2 else classifier.coef_[0]
        
        # Top positive features for this author
        top_indices = np.argsort(coefs)[-10:]
        
        print(f"
{author} distinctive patterns:")
        for idx in reversed(top_indices):
            print(f"  '{feature_names[idx]}' (weight: {coefs[idx]:.3f})")
 
 
if __name__ == "__main__":
    authorship_classification_pipeline()

Combining Word and Character N-grams

Combination Strategies:

Feature Concatenation: Stack word and character feature vectors
Ensemble Methods: Train separate models, combine predictions
Late Fusion: Weighted average of model outputs
Early Fusion: Joint feature space with dimensionality reduction

Scikit-learn Implementation:

The FeatureUnion class enables easy feature concatenation:

combined_ngrams.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
import numpy as np
 
def combined_ngram_pipeline():
    """
    Build pipeline combining word and character n-grams.
    """
    
    # Sample sentiment dataset
    texts = [
        "This movie was absolutely fantastic and amazing",
        "I loved every moment of this wonderful film",
        "Terrible waste of time completely boring",
        "The worst film I have ever seen awful",
        "Great acting brilliant storyline highly recommend",
        "Disappointing slow and not worth watching",
        "A masterpiece of modern cinema outstanding",
        "Complete garbage do not waste your money",
        "Beautiful cinematography and excellent performances",
        "Boring plot with terrible acting throughout",
    ] * 5
    
    labels = [1, 1, 0, 0, 1, 0, 1, 0, 1, 0] * 5
    
    # Feature union combining word and character n-grams
    combined_features = FeatureUnion([
        ('word', TfidfVectorizer(
            analyzer='word',
            ngram_range=(1, 2),
            min_df=2,
            max_features=5000,
            sublinear_tf=True,
        )),
        ('char', TfidfVectorizer(
            analyzer='char_wb',
            ngram_range=(3, 5),
            min_df=2,
            max_features=5000,
            sublinear_tf=True,
        )),
    ])
    
    # Build pipelines for comparison
    pipelines = {
        'Word only': Pipeline([
            ('vectorizer', TfidfVectorizer(
                analyzer='word',
                ngram_range=(1, 2),
                min_df=2,
                max_features=5000,
            )),
            ('classifier', LogisticRegression(max_iter=1000, random_state=42)),
        ]),
        'Character only': Pipeline([
            ('vectorizer', TfidfVectorizer(
                analyzer='char_wb',
                ngram_range=(3, 5),
                min_df=2,
                max_features=5000,
            )),
            ('classifier', LogisticRegression(max_iter=1000, random_state=42)),
        ]),
        'Combined': Pipeline([
            ('features', combined_features),
            ('classifier', LogisticRegression(max_iter=1000, random_state=42)),
        ]),
    }
    
    print("="*60)
    print("COMBINED WORD + CHARACTER N-GRAMS")
    print("="*60)
    
    for name, pipeline in pipelines.items():
        pipeline.fit(texts, labels)
        train_acc = pipeline.score(texts, labels)
        
        # Get feature count
        if name == 'Combined':
            word_feat = pipeline.named_steps['features'].transformer_list[0][1]
            char_feat = pipeline.named_steps['features'].transformer_list[1][1]
            n_features = len(word_feat.vocabulary_) + len(char_feat.vocabulary_)
        else:
            n_features = len(pipeline.named_steps['vectorizer'].vocabulary_)
        
        print(f"
{name}:")
        print(f"  Features: {n_features}")
        print(f"  Training accuracy: {train_acc:.2%}")
    
    # Test with noisy input
    print("
" + "="*60)
    print("ROBUSTNESS TO TYPOS")
    print("="*60)
    
    clean_test = [
        "This is amazing and wonderful",
        "Terrible and disappointing experience",
    ]
    
    noisy_test = [
        "Thsi is amzaing and wondurful",  # Typos
        "Terible and disapointing experiance",  # Typos
    ]
    
    for name, pipeline in pipelines.items():
        clean_preds = pipeline.predict_proba(clean_test)
        noisy_preds = pipeline.predict_proba(noisy_test)
        
        print(f"
{name}:")
        for i, (clean, noisy) in enumerate(zip(clean_test, noisy_test)):
            clean_conf = max(clean_preds[i])
            noisy_conf = max(noisy_preds[i])
            diff = abs(clean_conf - noisy_conf)
            
            print(f"  Sample {i+1}: Clean={clean_conf:.2f}, Noisy={noisy_conf:.2f}, Diff={diff:.3f}")
 
 
if __name__ == "__main__":
    combined_ngram_pipeline()

Best Practices for Combining Features

Summary: The Power of Sub-Word Patterns

Character n-grams provide a powerful complement to word-level features, capturing patterns that exist below the word level. We've explored their theory, implementation, and applications:

Key Takeaways

•Character n-grams capture sub-word structure: Morphology, spelling patterns, and stylistic fingerprints emerge from character sequences.
•Typo tolerance is inherent: Single typos affect only a few n-grams, making character features robust to noisy input.
•Language identification is a killer app: Character n-grams achieve >99% accuracy distinguishing languages with minimal training data.
•Authorship attribution benefits significantly: Authors have distinctive character-level patterns they cannot consciously control.
•Combination with word n-grams often wins: Different information captured at each level—semantic content vs. style and morphology.
•Word boundary handling matters: Including boundary markers (#word#) captures prefixes, suffixes, and word structure.

Looking Ahead: N-gram Tradeoffs

This synthesis will provide a decision framework for choosing the right n-gram configuration for any text classification or NLP task.

Page Complete

4 / 5