Loading learning content...
Consider these text classification challenges:
Character n-grams solve all of these problems by operating below the word level. Instead of treating words as atomic units, character n-grams decompose text into sequences of characters, capturing sub-word patterns that reveal morphology, typography, and language-specific characteristics.
This representation is remarkably powerful: character n-grams can achieve state-of-the-art performance on language identification, authorship attribution, and text classification with noisy inputs—all without any linguistic knowledge about the target language.
By the end of this page, you will understand how character n-grams work, when they outperform word n-grams, how to handle the unique implementation challenges they present, and how to combine character and word features for maximum performance. You'll gain practical skills for language identification, typo-robust classification, and authorship analysis.
A character n-gram is a contiguous sequence of n characters from a text string. Unlike word n-grams, which require tokenization to identify word boundaries, character n-grams operate directly on the raw character sequence.
Formal Definition:
Given a text string S = [c₁, c₂, ..., cₖ] of k characters, the set of character n-grams Cₙ(S) is:
Cₙ(S) = {(cᵢ, cᵢ₊₁, ..., cᵢ₊ₙ₋₁) | 1 ≤ i ≤ k - n + 1}
Example:
For the text "hello" with n=3 (trigrams):
| Position | Characters | Trigram |
|---|---|---|
| 1-3 | h, e, l | "hel" |
| 2-4 | e, l, l | "ell" |
| 3-5 | l, l, o | "llo" |
Note that character n-grams include:
Whether to include each category is a design decision with significant implications.
A critical decision is whether to include space characters in n-grams. With spaces: "hello world" → ["hel", "ell", "llo", "lo ", "o w", " wo", "wor", "orl", "rld"]. Without spaces (word boundaries): ["hel", "ell", "llo"] + ["wor", "orl", "rld"]. Including spaces captures word boundaries and cross-word patterns; excluding them focuses on within-word structure.
Word-Boundary Aware Character N-grams:
A common variant adds special boundary markers to each word:
Boundary markers capture:
This hybrid approach preserves word structure while benefiting from sub-word patterns.
Character N-gram Orders:
Typical ranges for different applications:
| Order | Name | Best For |
|---|---|---|
| 1 | Unigrams | Character frequency analysis |
| 2-3 | Bi/Trigrams | Common patterns, morphology |
| 3-5 | Tri to 5-grams | Language ID, authorship |
| 5-7 | Higher order | Long patterns, specific phrases |
Character n-grams capture linguistic patterns that word-level analysis misses. Understanding these patterns explains their power across diverse applications.
1. Morphological Structure:
Languages encode meaning through word structure:
Character n-grams capture these patterns without explicit morphological analysis.
2. Letter Frequency Patterns:
Languages have distinctive character distributions:
These distributions enable language identification from character n-grams alone.
3. Spelling Variations:
Similar words share character n-grams:
Use character n-grams when: (1) Text is noisy with typos or non-standard spelling, (2) Task is language ID or authorship attribution, (3) Morphology is important (agglutinative languages), (4) Training data is limited (smaller vocabulary = better generalization), (5) Processing user-generated content (social media, reviews).
Efficient character n-gram extraction requires attention to preprocessing decisions and boundary handling.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278
from typing import List, Dict, Set, Optionalfrom collections import Counterimport refrom scipy.sparse import csr_matrix class CharacterNgramExtractor: """ Comprehensive character n-gram extraction with multiple strategies. """ def __init__( self, min_n: int = 3, max_n: int = 5, lowercase: bool = True, include_spaces: bool = True, word_boundaries: bool = True, boundary_char: str = '#', strip_punctuation: bool = False, ): self.min_n = min_n self.max_n = max_n self.lowercase = lowercase self.include_spaces = include_spaces self.word_boundaries = word_boundaries self.boundary_char = boundary_char self.strip_punctuation = strip_punctuation def preprocess(self, text: str) -> str: """Apply preprocessing steps.""" if self.lowercase: text = text.lower() if self.strip_punctuation: text = re.sub(r'[^\w\s]', '', text) return text def extract_from_string( self, text: str, n: int ) -> List[str]: """ Extract character n-grams from continuous string. """ if len(text) < n: return [] return [text[i:i+n] for i in range(len(text) - n + 1)] def extract_with_word_boundaries( self, text: str, n: int ) -> List[str]: """ Extract character n-grams with word boundary markers. Each word is wrapped: "hello" → "#hello#" """ words = text.split() ngrams = [] for word in words: marked_word = f"{self.boundary_char}{word}{self.boundary_char}" ngrams.extend(self.extract_from_string(marked_word, n)) return ngrams def extract(self, text: str) -> List[str]: """ Extract all character n-grams with configured settings. """ text = self.preprocess(text) ngrams = [] for n in range(self.min_n, self.max_n + 1): if self.word_boundaries: ngrams.extend(self.extract_with_word_boundaries(text, n)) else: if not self.include_spaces: # Extract from each word separately for word in text.split(): ngrams.extend(self.extract_from_string(word, n)) else: # Extract from continuous string ngrams.extend(self.extract_from_string(text, n)) return ngrams class CharacterNgramVectorizer: """ Complete character n-gram vectorizer with vocabulary management. """ def __init__( self, min_n: int = 3, max_n: int = 5, max_vocab_size: int = None, min_df: int = 1, lowercase: bool = True, word_boundaries: bool = True, ): self.extractor = CharacterNgramExtractor( min_n=min_n, max_n=max_n, lowercase=lowercase, word_boundaries=word_boundaries, ) self.max_vocab_size = max_vocab_size self.min_df = min_df self.vocabulary_: Dict[str, int] = {} self.term_freq_: Counter = Counter() self.doc_freq_: Counter = Counter() def fit(self, documents: List[str]) -> 'CharacterNgramVectorizer': """Build vocabulary from corpus.""" for doc in documents: ngrams = self.extractor.extract(doc) self.term_freq_.update(ngrams) self.doc_freq_.update(set(ngrams)) # Filter by document frequency candidates = [ ng for ng, freq in self.doc_freq_.items() if freq >= self.min_df ] # Sort by frequency sorted_ngrams = sorted( candidates, key=lambda ng: self.term_freq_[ng], reverse=True ) # Limit vocabulary if self.max_vocab_size: sorted_ngrams = sorted_ngrams[:self.max_vocab_size] self.vocabulary_ = { ng: idx for idx, ng in enumerate(sorted_ngrams) } return self def transform(self, documents: List[str]) -> csr_matrix: """Transform documents to feature matrix.""" rows, cols, data = [], [], [] for doc_idx, doc in enumerate(documents): ngrams = self.extractor.extract(doc) ngram_counts = Counter(ngrams) for ngram, count in ngram_counts.items(): if ngram in self.vocabulary_: rows.append(doc_idx) cols.append(self.vocabulary_[ngram]) data.append(count) return csr_matrix( (data, (rows, cols)), shape=(len(documents), len(self.vocabulary_)) ) def analyze_vocabulary(self) -> Dict: """Analyze vocabulary composition.""" by_length = Counter() for ngram in self.vocabulary_: by_length[len(ngram)] += 1 return { 'total_vocabulary': len(self.vocabulary_), 'by_length': dict(by_length), 'top_ngrams': self.term_freq_.most_common(20), } def demonstrate_character_ngrams(): """ Demonstrate character n-gram extraction. """ print("="*60) print("CHARACTER N-GRAM DEMONSTRATION") print("="*60) text = "Hello World" # Different extraction strategies strategies = [ ("Continuous (with spaces)", CharacterNgramExtractor( min_n=3, max_n=3, word_boundaries=False, include_spaces=True )), ("Continuous (no spaces)", CharacterNgramExtractor( min_n=3, max_n=3, word_boundaries=False, include_spaces=False )), ("Word boundaries", CharacterNgramExtractor( min_n=3, max_n=3, word_boundaries=True )), ] print(f"\nText: '{text}'") for name, extractor in strategies: ngrams = extractor.extract(text) print(f"\n{name}:") print(f" N-grams: {ngrams}") print(f" Count: {len(ngrams)}") # Typo tolerance demonstration print("\n" + "="*60) print("TYPO TOLERANCE DEMONSTRATION") print("="*60) correct = "amazing" typos = ["amaznig", "amzing", "amazzing", "amaizing"] extractor = CharacterNgramExtractor( min_n=3, max_n=3, word_boundaries=True ) correct_ngrams = set(extractor.extract(correct)) print(f"\nCorrect: '{correct}'") print(f" Trigrams: {sorted(correct_ngrams)}") for typo in typos: typo_ngrams = set(extractor.extract(typo)) overlap = len(correct_ngrams & typo_ngrams) total = len(correct_ngrams | typo_ngrams) jaccard = overlap / total if total > 0 else 0 print(f"\nTypo: '{typo}'") print(f" Trigrams: {sorted(typo_ngrams)}") print(f" Overlap: {overlap}/{len(correct_ngrams)} ({jaccard:.1%} Jaccard)") def typo_robustness_analysis(): """ Analyze how character n-grams handle various types of errors. """ print("\n" + "="*60) print("ERROR ROBUSTNESS ANALYSIS") print("="*60) extractor = CharacterNgramExtractor( min_n=3, max_n=5, word_boundaries=True ) # Test cases with different error types test_cases = [ ("Transposition", "the", "teh"), ("Insertion", "hello", "helllo"), ("Deletion", "friend", "frend"), ("Substitution", "color", "colour"), ("Multiple errors", "beautiful", "beautful"), ] for error_type, correct, error in test_cases: correct_ng = set(extractor.extract(correct)) error_ng = set(extractor.extract(error)) overlap = len(correct_ng & error_ng) union = len(correct_ng | error_ng) jaccard = overlap / union if union > 0 else 0 print(f"\n{error_type}: '{correct}' → '{error}'") print(f" Preservation: {jaccard:.1%}") if __name__ == "__main__": demonstrate_character_ngrams() typo_robustness_analysis()Language identification is the canonical success story for character n-grams. Even with just 3-5 character n-grams, classifiers can achieve >99% accuracy on distinguishing languages, often from just a few words of text.
Why Character N-grams Excel:
Typical Approach:
Simple Distance-Based Method:
The "out-of-place" measure ranks n-grams by frequency in each language. For test text:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189
from collections import Counterfrom typing import List, Dict, Tupleimport numpy as np class CharNgramLanguageIdentifier: """ Language identification using character n-gram profiles. Based on the TextCat algorithm (Cavnar & Trenkle, 1994). """ def __init__( self, n_range: Tuple[int, int] = (3, 5), profile_size: int = 300, ): self.n_range = n_range self.profile_size = profile_size # Language → ranked n-gram list self.profiles: Dict[str, List[str]] = {} def _extract_ngrams(self, text: str) -> List[str]: """Extract character n-grams.""" text = text.lower() ngrams = [] for n in range(self.n_range[0], self.n_range[1] + 1): # Add word boundaries words = text.split() for word in words: padded = f"_{word}_" ngrams.extend( padded[i:i+n] for i in range(len(padded) - n + 1) ) return ngrams def _build_profile(self, text: str) -> List[str]: """Build ranked n-gram profile from text.""" ngrams = self._extract_ngrams(text) counter = Counter(ngrams) # Return top n-grams by frequency return [ng for ng, _ in counter.most_common(self.profile_size)] def fit(self, language_texts: Dict[str, str]) -> 'CharNgramLanguageIdentifier': """ Train on texts for each language. Args: language_texts: Dict mapping language code to training text """ for language, text in language_texts.items(): self.profiles[language] = self._build_profile(text) return self def _out_of_place_distance( self, test_profile: List[str], language_profile: List[str] ) -> int: """ Compute out-of-place distance between profiles. For each n-gram in test profile, find its position in language profile. Sum of position differences = distance. """ lang_positions = { ng: pos for pos, ng in enumerate(language_profile) } distance = 0 max_penalty = self.profile_size for test_pos, ng in enumerate(test_profile): if ng in lang_positions: distance += abs(test_pos - lang_positions[ng]) else: # N-gram not in language profile distance += max_penalty return distance def predict(self, text: str) -> str: """Identify language of text.""" test_profile = self._build_profile(text) distances = {} for language, lang_profile in self.profiles.items(): distances[language] = self._out_of_place_distance( test_profile, lang_profile ) return min(distances, key=distances.get) def predict_with_scores(self, text: str) -> Dict[str, float]: """Return all language scores (lower = better match).""" test_profile = self._build_profile(text) distances = {} for language, lang_profile in self.profiles.items(): distances[language] = self._out_of_place_distance( test_profile, lang_profile ) # Convert to similarities (0-1, higher = better) max_dist = max(distances.values()) similarities = { lang: 1 - (dist / max_dist) for lang, dist in distances.items() } return similarities def demonstrate_language_identification(): """ Demonstrate character n-gram language identification. """ print("="*60) print("LANGUAGE IDENTIFICATION") print("="*60) # Training data (simplified - real systems use much more) training_data = { 'english': """ The quick brown fox jumps over the lazy dog. Machine learning is transforming how we build software. Natural language processing enables text understanding. The weather today is quite pleasant and sunny. Reading books is a wonderful way to learn new things. """, 'spanish': """ El rápido zorro marrón salta sobre el perro perezoso. El aprendizaje automático está transformando el software. El procesamiento del lenguaje natural permite comprender texto. El clima hoy es bastante agradable y soleado. Leer libros es una forma maravillosa de aprender. """, 'german': """ Der schnelle braune Fuchs springt über den faulen Hund. Maschinelles Lernen verändert die Softwareentwicklung. Die Verarbeitung natürlicher Sprache ermöglicht Textverständnis. Das Wetter heute ist ziemlich angenehm und sonnig. Bücher lesen ist eine wunderbare Art zu lernen. """, 'french': """ Le rapide renard brun saute par-dessus le chien paresseux. L'apprentissage automatique transforme le développement logiciel. Le traitement du langage naturel permet la compréhension du texte. Le temps aujourd'hui est assez agréable et ensoleillé. Lire des livres est une merveilleuse façon d'apprendre. """, } # Train identifier identifier = CharNgramLanguageIdentifier( n_range=(2, 4), profile_size=200 ) identifier.fit(training_data) # Test samples test_samples = [ "This is a simple English sentence.", "Esta es una oración simple en español.", "Dies ist ein einfacher deutscher Satz.", "Ceci est une phrase simple en français.", "The machine learning model works well.", "El modelo de aprendizaje funciona bien.", ] print("\nLanguage Identification Results:") print("-"*50) for sample in test_samples: prediction = identifier.predict(sample) scores = identifier.predict_with_scores(sample) print(f"\n'{sample[:40]}...'") print(f" Predicted: {prediction}") print(f" Scores: {', '.join(f'{l}: {s:.2f}' for l, s in sorted(scores.items(), key=lambda x: -x[1]))}") if __name__ == "__main__": demonstrate_language_identification()Authorship attribution—identifying who wrote a text—is another domain where character n-grams excel. Authors have distinctive stylistic fingerprints that emerge at the character level, including preferences for certain word endings, punctuation patterns, and letter combinations.
Why Character N-grams Work for Authorship:
Research Findings:
| Feature Type | Example | What It Captures |
|---|---|---|
| Word-ending n-grams | 'ing#', 'tion#', 'ly#' | Suffix preferences, part-of-speech tendencies |
| Word-beginning n-grams | '#the', '#th', '#wh' | Function word usage, question patterns |
| Space patterns | ' the ', ' a ' | Article and preposition frequency |
| Punctuation n-grams | ', and', '. the' | Sentence structure, list style |
| Cross-word patterns | s_of_', 'e_the' | Phrase patterns, flow |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizerfrom sklearn.svm import LinearSVCfrom sklearn.pipeline import Pipelinefrom sklearn.model_selection import cross_val_scorefrom typing import List, Tupleimport numpy as np def authorship_classification_pipeline(): """ Build authorship attribution pipeline using character n-grams. """ # Simulated author samples (in practice, use actual texts) authors = { 'author_a': [ "The fundamental question we must ask ourselves is whether advancement in technology truly benefits humanity, or if it merely creates new forms of dependency that undermine our autonomy.", "I've often pondered the relationship between creativity and constraint. It seems that the greatest artistic achievements emerge not despite limitations, but because of them.", "When examining the historical record, we find patterns that repeat across centuries. Human nature, it appears, remains remarkably constant despite technological change.", ], 'author_b': [ "Look, here's the thing - tech is moving fast and we need to keep up. Not everything new is good, sure, but we can't just ignore progress because it's scary.", "So I was thinking about how creativity works, right? And it hit me - you need limits! Without them, you're just staring at a blank page forever.", "History repeats itself, they say. And yeah, looking back, it's kind of true. Different tools, same basic human behaviors. Makes you think.", ], 'author_c': [ "TECHNOLOGICAL ADVANCEMENT: A CRITICAL ANALYSIS. Section 1. Introduction. The question of technology's impact on humanity requires systematic examination.", "Creativity and constraint exhibit an inverse relationship initially. However, optimal creative output occurs at moderate constraint levels.", "Historical analysis reveals cyclical patterns in human behavior. Technological context changes; fundamental behavioral patterns persist.", ], } # Prepare data texts = [] labels = [] for author, samples in authors.items(): texts.extend(samples) labels.extend([author] * len(samples)) # Character n-gram pipeline char_pipeline = Pipeline([ ('vectorizer', TfidfVectorizer( analyzer='char_wb', # Character n-grams with word boundaries ngram_range=(3, 5), min_df=1, max_features=5000, sublinear_tf=True, )), ('classifier', LinearSVC(random_state=42, C=1.0)) ]) # Word n-gram pipeline for comparison word_pipeline = Pipeline([ ('vectorizer', TfidfVectorizer( analyzer='word', ngram_range=(1, 2), min_df=1, max_features=5000, sublinear_tf=True, )), ('classifier', LinearSVC(random_state=42, C=1.0)) ]) print("="*60) print("AUTHORSHIP ATTRIBUTION") print("="*60) # Fit both pipelines char_pipeline.fit(texts, labels) word_pipeline.fit(texts, labels) # Check vocabulary characteristics char_vocab = char_pipeline.named_steps['vectorizer'].vocabulary_ word_vocab = word_pipeline.named_steps['vectorizer'].vocabulary_ print(f"\nCharacter n-gram vocabulary size: {len(char_vocab)}") print(f"Word n-gram vocabulary size: {len(word_vocab)}") # Test on held-out samples test_samples = [ "The analysis of systems requires careful consideration of multiple factors and their interdependencies.", "Here's what I think - we're overcomplicating things. Just keep it simple and move forward!", "METHODOLOGY SECTION. Systematic analysis follows established protocols. Results indicate consistent patterns.", ] print("\nPredictions on test samples:") print("-"*50) for sample in test_samples: char_pred = char_pipeline.predict([sample])[0] word_pred = word_pipeline.predict([sample])[0] print(f"\n'{sample[:60]}...'") print(f" Character n-grams: {char_pred}") print(f" Word n-grams: {word_pred}") # Analyze distinctive features per author print("\n" + "="*60) print("DISTINCTIVE CHARACTER PATTERNS") print("="*60) vectorizer = char_pipeline.named_steps['vectorizer'] classifier = char_pipeline.named_steps['classifier'] feature_names = vectorizer.get_feature_names_out() for i, author in enumerate(classifier.classes_): # Get coefficients for this author (one-vs-rest) coefs = classifier.coef_[i] if len(classifier.classes_) > 2 else classifier.coef_[0] # Top positive features for this author top_indices = np.argsort(coefs)[-10:] print(f"\n{author} distinctive patterns:") for idx in reversed(top_indices): print(f" '{feature_names[idx]}' (weight: {coefs[idx]:.3f})") if __name__ == "__main__": authorship_classification_pipeline()Word and character n-grams capture complementary information. Word n-grams encode semantic content and phrases; character n-grams encode morphology and style. Combining them often yields the best results.
Combination Strategies:
Scikit-learn Implementation:
The FeatureUnion class enables easy feature concatenation:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.pipeline import Pipeline, FeatureUnionfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import cross_val_scorefrom sklearn.preprocessing import StandardScalerimport numpy as np def combined_ngram_pipeline(): """ Build pipeline combining word and character n-grams. """ # Sample sentiment dataset texts = [ "This movie was absolutely fantastic and amazing", "I loved every moment of this wonderful film", "Terrible waste of time completely boring", "The worst film I have ever seen awful", "Great acting brilliant storyline highly recommend", "Disappointing slow and not worth watching", "A masterpiece of modern cinema outstanding", "Complete garbage do not waste your money", "Beautiful cinematography and excellent performances", "Boring plot with terrible acting throughout", ] * 5 labels = [1, 1, 0, 0, 1, 0, 1, 0, 1, 0] * 5 # Feature union combining word and character n-grams combined_features = FeatureUnion([ ('word', TfidfVectorizer( analyzer='word', ngram_range=(1, 2), min_df=2, max_features=5000, sublinear_tf=True, )), ('char', TfidfVectorizer( analyzer='char_wb', ngram_range=(3, 5), min_df=2, max_features=5000, sublinear_tf=True, )), ]) # Build pipelines for comparison pipelines = { 'Word only': Pipeline([ ('vectorizer', TfidfVectorizer( analyzer='word', ngram_range=(1, 2), min_df=2, max_features=5000, )), ('classifier', LogisticRegression(max_iter=1000, random_state=42)), ]), 'Character only': Pipeline([ ('vectorizer', TfidfVectorizer( analyzer='char_wb', ngram_range=(3, 5), min_df=2, max_features=5000, )), ('classifier', LogisticRegression(max_iter=1000, random_state=42)), ]), 'Combined': Pipeline([ ('features', combined_features), ('classifier', LogisticRegression(max_iter=1000, random_state=42)), ]), } print("="*60) print("COMBINED WORD + CHARACTER N-GRAMS") print("="*60) for name, pipeline in pipelines.items(): pipeline.fit(texts, labels) train_acc = pipeline.score(texts, labels) # Get feature count if name == 'Combined': word_feat = pipeline.named_steps['features'].transformer_list[0][1] char_feat = pipeline.named_steps['features'].transformer_list[1][1] n_features = len(word_feat.vocabulary_) + len(char_feat.vocabulary_) else: n_features = len(pipeline.named_steps['vectorizer'].vocabulary_) print(f"\n{name}:") print(f" Features: {n_features}") print(f" Training accuracy: {train_acc:.2%}") # Test with noisy input print("\n" + "="*60) print("ROBUSTNESS TO TYPOS") print("="*60) clean_test = [ "This is amazing and wonderful", "Terrible and disappointing experience", ] noisy_test = [ "Thsi is amzaing and wondurful", # Typos "Terible and disapointing experiance", # Typos ] for name, pipeline in pipelines.items(): clean_preds = pipeline.predict_proba(clean_test) noisy_preds = pipeline.predict_proba(noisy_test) print(f"\n{name}:") for i, (clean, noisy) in enumerate(zip(clean_test, noisy_test)): clean_conf = max(clean_preds[i]) noisy_conf = max(noisy_preds[i]) diff = abs(clean_conf - noisy_conf) print(f" Sample {i+1}: Clean={clean_conf:.2f}, Noisy={noisy_conf:.2f}, Diff={diff:.3f}") if __name__ == "__main__": combined_ngram_pipeline()When combining word and character n-grams: (1) Use similar vocabulary sizes for each (e.g., 5000 each) to balance their influence, (2) Apply TF-IDF to both—raw counts create scaling issues, (3) Consider feature selection on the combined space to remove redundancy, (4) Monitor which feature type contributes more to performance—this varies by task.
Character n-grams provide a powerful complement to word-level features, capturing patterns that exist below the word level. We've explored their theory, implementation, and applications:
Looking Ahead: N-gram Tradeoffs
We've now covered unigrams, bigrams, general n-grams, and character n-grams. In the final page of this module, we'll synthesize everything by examining the fundamental tradeoffs in n-gram feature engineering: vocabulary size vs. expressiveness, computational cost vs. feature quality, and when different n-gram strategies are most appropriate.
This synthesis will provide a decision framework for choosing the right n-gram configuration for any text classification or NLP task.
You now understand character n-grams as sub-word features that capture morphology, spelling patterns, and stylistic signatures. You can implement character n-gram extraction with various boundary handling strategies, apply them to language identification and authorship attribution, and combine them with word n-grams for maximum robustness. Next, we'll examine the tradeoffs that guide n-gram configuration choices.