Loading content...
Consider a search query for "running shoes." Should documents containing "run," "runs," "runner," and "ran" be considered relevant? In most cases, yes—these are all morphological variants of the same underlying concept.
This is the problem stemming addresses: reducing inflected or derived words to their common base or root form, called the stem. By normalizing morphological variants, we:
Stemming algorithms accomplish this through rule-based heuristics that strip suffixes according to language-specific patterns. They're fast, deterministic, and require no training data—but they're also approximate, sometimes producing stems that aren't real words and occasionally conflating unrelated terms.
By the end of this page, you will understand the theoretical foundations of stemming, implement and apply the Porter and Snowball algorithms, recognize stemming errors (over-stemming and under-stemming), and make informed decisions about when stemming benefits your NLP pipeline.
Stemming is a form of word normalization that removes morphological affixes (primarily suffixes in English) to reduce words to a common form. Unlike lemmatization (which we'll cover next), stemming doesn't aim to produce linguistically valid words—it aims to produce consistent representations.
Key Terminology:
Why Not Just Use a Dictionary?
A naive approach might be to build a lookup table of all words to their roots. This fails because:
| Original | Porter Stem | Snowball Stem | Shared Root? |
|---|---|---|---|
| running, runs, ran, run | run, run, ran, run | run, run, ran, run | Yes (mostly) |
| connection, connected, connecting | connect, connect, connect | connect, connect, connect | Yes |
| happiness, happily, happy | happi, happili, happi | happi, happili, happi | Yes (but not 'happy') |
| university, universal, universe | univers, univers, univers | univers, univers, univers | Conflated! (differ semantically) |
| operate, operational, opera | oper, oper, opera | oper, oper, opera | Conflated! (opera ≠ operate) |
Stems are often not valid English words. 'happi' for 'happy', 'relat' for 'related', 'studi' for 'studies'—these are computational artifacts, not linguistic entities. This is acceptable for many NLP tasks where we only need consistency, but problematic when human readability matters.
The Porter Stemmer, published by Martin Porter in 1980, is the most widely used English stemming algorithm. It's been cited thousands of times and forms the basis for most subsequent stemmers.
Algorithm Overview:
Porter stemming operates through a sequence of five phases, each containing conditional suffix-stripping rules. The rules are applied in order, and conditions are based on a measure called m (the "measure" of a word).
The Measure (m):
Porter defines words as alternating sequences of consonant clusters (C) and vowel clusters (V):
Examples:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241
"""Porter Stemming Algorithm - Educational Implementation This implementation is for learning purposes. Production systemsshould use NLTK's optimized implementation."""import refrom typing import Tuple, List class PorterStemmerEducational: """ Educational implementation of the Porter Stemmer. Demonstrates the algorithm's structure with detailed comments. """ def __init__(self): # Vowels (y is sometimes a vowel) self.vowels = set('aeiou') def is_consonant(self, word: str, i: int) -> bool: """Check if character at position i is a consonant.""" c = word[i] if c in self.vowels: return False if c == 'y': # y is consonant only at word start or after vowel if i == 0: return True return not self.is_consonant(word, i - 1) return True def measure(self, stem: str) -> int: """ Calculate the measure m of a stem. m is the number of VC (vowel-consonant) sequences. Example: TR -> m=0 (CC) TREE -> m=0 (CCVV) TROUBLE -> m=1 (CCVVCVC) OATS -> m=1 (VCCC) TREES -> m=1 (CCVVCC) IVY -> m=1 (VCCC) TROUBLES -> m=2 (CCVVCVCC) PRIVATE -> m=2 (CCVCVCV) """ cv_pattern = [] for i, char in enumerate(stem): if self.is_consonant(stem, i): # Avoid consecutive C markers if not cv_pattern or cv_pattern[-1] != 'C': cv_pattern.append('C') else: if not cv_pattern or cv_pattern[-1] != 'V': cv_pattern.append('V') # Count VC sequences pattern_str = ''.join(cv_pattern) return pattern_str.count('VC') def contains_vowel(self, stem: str) -> bool: """Check if stem contains at least one vowel.""" for i in range(len(stem)): if not self.is_consonant(stem, i): return True return False def ends_double_consonant(self, word: str) -> bool: """Check if word ends with double consonant.""" if len(word) >= 2: if word[-1] == word[-2] and self.is_consonant(word, len(word) - 1): return True return False def ends_cvc(self, word: str) -> bool: """ Check if word ends consonant-vowel-consonant where final consonant is not w, x, or y. """ if len(word) >= 3: c1 = self.is_consonant(word, len(word) - 3) v = not self.is_consonant(word, len(word) - 2) c2 = self.is_consonant(word, len(word) - 1) if c1 and v and c2 and word[-1] not in 'wxy': return True return False def step1a(self, word: str) -> str: """Step 1a: Handle plurals and past tenses (SSES, IES, SS, S).""" if word.endswith('sses'): return word[:-2] if word.endswith('ies'): return word[:-2] if word.endswith('ss'): return word if word.endswith('s'): return word[:-1] return word def step1b(self, word: str) -> str: """Step 1b: Handle -eed, -ed, -ing.""" if word.endswith('eed'): stem = word[:-3] if self.measure(stem) > 0: return word[:-1] # Remove just 'd' return word # Handle -ed and -ing for suffix in ['ed', 'ing']: if word.endswith(suffix): stem = word[:-len(suffix)] if self.contains_vowel(stem): word = stem # Additional transformations if word.endswith('at') or word.endswith('bl') or word.endswith('iz'): word += 'e' elif self.ends_double_consonant(word) and word[-1] not in 'lsz': word = word[:-1] elif self.measure(word) == 1 and self.ends_cvc(word): word += 'e' return word return word + suffix # Restore if no vowel in stem return word def step1c(self, word: str) -> str: """Step 1c: Replace suffix y with i if stem contains vowel.""" if word.endswith('y'): stem = word[:-1] if self.contains_vowel(stem): return stem + 'i' return word def step2(self, word: str) -> str: """Step 2: Handle various suffixes (-ational, -fulness, etc.).""" suffixes = [ ('ational', 'ate'), ('tional', 'tion'), ('enci', 'ence'), ('anci', 'ance'), ('izer', 'ize'), ('abli', 'able'), ('alli', 'al'), ('entli', 'ent'), ('eli', 'e'), ('ousli', 'ous'), ('ization', 'ize'), ('ation', 'ate'), ('ator', 'ate'), ('alism', 'al'), ('iveness', 'ive'), ('fulness', 'ful'), ('ousness', 'ous'), ('aliti', 'al'), ('iviti', 'ive'), ('biliti', 'ble'), ] for old, new in suffixes: if word.endswith(old): stem = word[:-len(old)] if self.measure(stem) > 0: return stem + new return word def step3(self, word: str) -> str: """Step 3: Handle suffixes (-icate, -ful, -ness, etc.).""" suffixes = [ ('icate', 'ic'), ('ative', ''), ('alize', 'al'), ('iciti', 'ic'), ('ical', 'ic'), ('ful', ''), ('ness', ''), ] for old, new in suffixes: if word.endswith(old): stem = word[:-len(old)] if self.measure(stem) > 0: return stem + new return word def step4(self, word: str) -> str: """Step 4: Remove various suffixes if m > 1.""" suffixes = [ 'al', 'ance', 'ence', 'er', 'ic', 'able', 'ible', 'ant', 'ement', 'ment', 'ent', 'ou', 'ism', 'ate', 'iti', 'ous', 'ive', 'ize' ] # Special case for -ion if word.endswith('ion'): stem = word[:-3] if self.measure(stem) > 1 and stem[-1] in 'st': return stem for suffix in suffixes: if word.endswith(suffix): stem = word[:-len(suffix)] if self.measure(stem) > 1: return stem return word def step5a(self, word: str) -> str: """Step 5a: Remove final 'e' based on measure.""" if word.endswith('e'): stem = word[:-1] m = self.measure(stem) if m > 1: return stem if m == 1 and not self.ends_cvc(stem): return stem return word def step5b(self, word: str) -> str: """Step 5b: Remove double 'l' if m > 1.""" if self.ends_double_consonant(word) and word[-1] == 'l': if self.measure(word[:-1]) > 1: return word[:-1] return word def stem(self, word: str) -> str: """Apply all stemming steps to a word.""" word = word.lower() # Handle short words if len(word) <= 2: return word word = self.step1a(word) word = self.step1b(word) word = self.step1c(word) word = self.step2(word) word = self.step3(word) word = self.step4(word) word = self.step5a(word) word = self.step5b(word) return word # Test the educational implementationstemmer = PorterStemmerEducational() test_words = [ "running", "runs", "ran", "runner", "happiness", "happily", "happy", "connection", "connected", "connecting", "connective", "caresses", "ponies", "cats", "relational", "conditional", "rational", "troubles", "troubling", "troubled",] print("=== Porter Stemmer Educational Demo ===\n")for word in test_words: stem = stemmer.stem(word) print(f" {word:15} → {stem}")The Five Steps in Detail:
Step 1: Handle simple suffixes
Step 2: Double-suffix reduction
Step 3: Further suffix removal
Step 4: Long suffix removal (requires m>1)
Step 5: Final cleanup
The Snowball Stemmer (also called Porter2) is Martin Porter's improved version, developed in 2000. It addresses known issues in the original Porter algorithm and provides a specification language for creating stemmers in multiple languages.
Improvements Over Porter:
Languages Supported: Arabic, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Lithuanian, Nepali, Norwegian, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Tamil, Turkish, and Yiddish.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
from nltk.stem import PorterStemmer, SnowballStemmerfrom nltk.tokenize import word_tokenize # Initialize stemmersporter = PorterStemmer()snowball = SnowballStemmer("english") # Available languages in Snowballprint("Languages supported by Snowball:")print(SnowballStemmer.languages) # Compare Porter and Snowball on Englishtest_cases = [ # Words where they differ "generous", "generously", "generosity", "sky", "skies", "skying", "dying", "dye", "dyed", "falling", "felled", "fallen", "cries", "crying", "cried", "lying", "lie", "lied", # Irregular forms "goes", "going", "went", "gone", "analysis", "analyses", "analyzing",] print("\n=== Porter vs Snowball Comparison ===\n")print(f"{'Word':<15} {'Porter':<12} {'Snowball':<12} {'Same?':<6}")print("-" * 47) differences = []for word in test_cases: porter_stem = porter.stem(word) snowball_stem = snowball.stem(word) same = "✓" if porter_stem == snowball_stem else "✗" print(f"{word:<15} {porter_stem:<12} {snowball_stem:<12} {same:<6}") if porter_stem != snowball_stem: differences.append((word, porter_stem, snowball_stem)) print(f"\nDifferences found: {len(differences)}") # Demonstrate multilingual stemmingprint("\n=== Multilingual Snowball Stemming ===\n") multilingual_tests = [ ("german", ["laufen", "läuft", "gelaufen", "Fußball", "Fußbälle"]), ("french", ["manger", "mange", "mangé", "marchant", "marché"]), ("spanish", ["correr", "corriendo", "corrido", "hablando", "hablado"]), ("russian", ["бегать", "бегу", "бежал", "работать", "работаю"]),] for lang, words in multilingual_tests: stemmer = SnowballStemmer(lang) print(f"\n{lang.title()}:") for word in words: stem = stemmer.stem(word) print(f" {word:<15} → {stem}") # Performance and consistency analysisprint("\n=== Consistency Test ===\n") # Words that should stem to the same rootsemantic_groups = [ ["connect", "connected", "connecting", "connection", "connective"], ["happy", "happiness", "happily", "unhappy", "unhappiness"], ["organize", "organized", "organizing", "organization", "organizational"], ["create", "created", "creating", "creation", "creative", "creativity"],] for group in semantic_groups: porter_stems = set(porter.stem(w) for w in group) snowball_stems = set(snowball.stem(w) for w in group) print(f"Words: {group}") print(f" Porter stems: {porter_stems}") print(f" Snowball stems: {snowball_stems}") print(f" All same stem? Porter={len(porter_stems)==1}, Snowball={len(snowball_stems)==1}") print()For new projects, prefer Snowball over Porter for English. It's more accurate on edge cases, actively maintained, and the differences in output are generally improvements. For multilingual applications, Snowball is the only choice—Porter only supports English.
Because stemmers use heuristic rules without linguistic knowledge, they inevitably make errors. Understanding these error types helps you predict when stemming will help versus hurt.
Over-stemming (False Positives):
Over-stemming occurs when unrelated words are reduced to the same stem. This increases false matches in retrieval and conflates distinct concepts in classification.
Under-stemming (False Negatives):
Under-stemming occurs when related words are not reduced to the same stem. This causes false misses in retrieval and fragments feature space in classification.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
from nltk.stem import PorterStemmer, SnowballStemmerfrom collections import defaultdictfrom typing import List, Tuple, Set porter = PorterStemmer()snowball = SnowballStemmer("english") def analyze_conflation(word_groups: List[List[str]], stemmer) -> dict: """ Analyze stemmer behavior on groups of related and unrelated words. Args: word_groups: List of word groups, where words in each group should ideally stem to the same root """ results = { 'over_stemming': [], # Unrelated words → same stem 'under_stemming': [], # Related words → different stems 'correct': [] } # Check within groups for under-stemming for group in word_groups: stems = set(stemmer.stem(w.lower()) for w in group) if len(stems) > 1: results['under_stemming'].append({ 'words': group, 'stems': list(stems), 'issue': 'Related words have different stems' }) else: results['correct'].append({ 'words': group, 'stem': list(stems)[0] }) return results def find_over_stemming(words: List[str], stemmer) -> dict: """Find cases where unrelated words get the same stem.""" stem_to_words = defaultdict(list) for word in words: stem = stemmer.stem(word.lower()) stem_to_words[stem].append(word) # Filter to stems with multiple words (potential over-stemming) conflicts = { stem: words for stem, words in stem_to_words.items() if len(words) > 1 } return conflicts # Test word groups (words in each group should stem together)related_groups = [ ["run", "running", "runs", "runner", "ran"], ["happy", "happily", "happiness", "unhappy"], ["connect", "connected", "connection", "connective"], ["analyze", "analysis", "analyzed", "analytical"], ["go", "going", "went", "gone", "goes"], ["absorb", "absorbed", "absorption", "absorbing"],] # Words that should NOT stem togetherunrelated_words = [ "university", "universal", "universe", # Different meanings "policy", "police", "polish", # Different domains "operate", "opera", "operator", # opera ≠ operate "wander", "wand", "want", # Unrelated "organ", "organize", "organic", # Body part vs action vs chemistry] print("=== Stemming Error Analysis ===\n") print("1. Under-stemming (related words → different stems):\n")for group in related_groups: porter_stems = set(porter.stem(w) for w in group) snowball_stems = set(snowball.stem(w) for w in group) if len(porter_stems) > 1 or len(snowball_stems) > 1: print(f" Words: {group}") print(f" Porter: {porter_stems} - {'Under-stemmed!' if len(porter_stems) > 1 else 'OK'}") print(f" Snowball: {snowball_stems} - {'Under-stemmed!' if len(snowball_stems) > 1 else 'OK'}") print() print("\n2. Over-stemming (unrelated words → same stem):\n")porter_conflicts = find_over_stemming(unrelated_words, porter)snowball_conflicts = find_over_stemming(unrelated_words, snowball) print("Porter over-stemming:")for stem, words in porter_conflicts.items(): if len(words) > 1: print(f" '{stem}' ← {words}") print("\nSnowball over-stemming:")for stem, words in snowball_conflicts.items(): if len(words) > 1: print(f" '{stem}' ← {words}") # Quantify error typesprint("\n=== Error Rate Estimation ===")print("(Based on test vocabulary)\n") all_words = []for group in related_groups: all_words.extend(group) # Count under-stemmingunder_stem_count = 0for group in related_groups: if len(set(porter.stem(w) for w in group)) > 1: under_stem_count += 1 # Count over-stemmingover_stem_count = len([c for c in porter_conflicts.values() if len(c) > 1]) print(f"Word groups tested: {len(related_groups)}")print(f"Under-stemming errors: {under_stem_count}/{len(related_groups)}")print(f"Over-stemming conflicts: {over_stem_count} (in {len(unrelated_words)} test words)")In information retrieval, over-stemming increases recall but hurts precision. Under-stemming hurts recall but preserves precision. The optimal aggressiveness depends on your use case: search engines often prefer aggressive stemming, while classification tasks may prefer conservative stemming or no stemming at all.
While Porter and Snowball dominate English stemming, several alternatives exist with different trade-offs.
Lancaster Stemmer (Paice/Husk):
Developed by Chris Paice at Lancaster University. More aggressive than Porter, applying iterative rules until no more apply. Produces shorter stems.
Lovins Stemmer:
One of the earliest stemmers (1968). Uses a single-pass, longest-match approach with a list of 294 endings. Less accurate than Porter.
Regexp Stemmer:
Applies user-defined regular expression rules. Useful for domain-specific stemming.
Hunspell Stemmer:
Used by spell checkers, based on morphological dictionaries. More accurate but slower.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
from nltk.stem import ( PorterStemmer, SnowballStemmer, LancasterStemmer, RegexpStemmer)import time # Initialize stemmersstemmers = { 'Porter': PorterStemmer(), 'Snowball': SnowballStemmer('english'), 'Lancaster': LancasterStemmer(),} # Test vocabularytest_words = [ "running", "happiness", "connection", "generosity", "maximum", "organization", "argument", "multiply", "troubleshooting", "international", "manufacturing", "environmental", "considerably", "unfortunately",] print("=== Stemmer Comparison ===\n")print(f"{'Word':<18}", end="")for name in stemmers: print(f"{name:<12}", end="")print()print("-" * 54) for word in test_words: print(f"{word:<18}", end="") for name, stemmer in stemmers.items(): stem = stemmer.stem(word) print(f"{stem:<12}", end="") print() # Analyze aggressivenessprint("\n=== Aggressiveness Analysis ===\n") def measure_aggressiveness(stemmer, words): """Measure average character reduction.""" reductions = [] for word in words: stem = stemmer.stem(word) reduction = 1 - (len(stem) / len(word)) reductions.append(reduction) return sum(reductions) / len(reductions) large_vocab = [ "international", "organizational", "environmental", "computational", "understanding", "manufacturing", "troubleshooting", "administrative", "representative", "interpretation", "responsibility", "characteristics", "unfortunately", "approximately", "significantly", "professionally",] for name, stemmer in stemmers.items(): agg = measure_aggressiveness(stemmer, large_vocab) print(f"{name:12} average reduction: {agg:.1%}") # Performance comparisonprint("\n=== Performance Benchmark ===\n") # Generate test datavocab = large_vocab * 1000 # 16,000 words for name, stemmer in stemmers.items(): start = time.perf_counter() for word in vocab: stemmer.stem(word) elapsed = time.perf_counter() - start rate = len(vocab) / elapsed print(f"{name:12} {elapsed:.3f}s ({rate:,.0f} words/sec)") # Custom regex stemmer exampleprint("\n=== Custom Regex Stemmer ===\n") # Remove common suffixescustom = RegexpStemmer('ing$|s$|ed$|er$|est$|ly$|tion$|ness$') custom_test = ["running", "cats", "jumped", "faster", "quickly", "happiness"]for word in custom_test: print(f" {word:<12} → {custom.stem(word)}")| Stemmer | Aggressiveness | Speed | Accuracy | Use Case |
|---|---|---|---|---|
| Porter | Medium | Fast | Good | General purpose, widely used baseline |
| Snowball | Medium | Fast | Better | Improved Porter, multilingual support |
| Lancaster | High | Fast | Lower | When aggressive normalization needed |
| Lovins | High | Fast | Low | Historical interest only |
| Regexp | Configurable | Fast | Varies | Custom domain-specific rules |
Stemming is a design choice, not a mandatory preprocessing step. Understanding when it helps—and when it hurts—is essential for building effective NLP systems.
With BERT, GPT, and modern transformers, explicit stemming is rarely beneficial. These models use subword tokenization (BPE, WordPiece) that implicitly captures morphological patterns. The embedding for 'running' is similar to 'run' because both share subwords. Applied stemming on top of this can actually harm performance.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
"""Demonstration of stemming impact on different NLP tasks."""from nltk.stem import SnowballStemmerfrom sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizerfrom sklearn.metrics.pairwise import cosine_similarityimport numpy as np stemmer = SnowballStemmer("english") def stem_text(text: str) -> str: """Stem all words in a text.""" words = text.lower().split() return " ".join(stemmer.stem(w) for w in words) # === Impact on Vocabulary Size ===print("=== Impact on Vocabulary Size ===\n") documents = [ "The runner was running quickly through the running track.", "Happiness comes from being happy and creating happiness.", "The organization was organizing an organized event.", "Connecting the connection connected them together.", "She argued the argument argumentatively.",] # Without stemmingvec_no_stem = CountVectorizer()vec_no_stem.fit(documents)vocab_no_stem = len(vec_no_stem.get_feature_names_out()) # With stemmingdocs_stemmed = [stem_text(d) for d in documents]vec_stemmed = CountVectorizer()vec_stemmed.fit(docs_stemmed)vocab_stemmed = len(vec_stemmed.get_feature_names_out()) print(f"Vocabulary without stemming: {vocab_no_stem}")print(f"Vocabulary with stemming: {vocab_stemmed}")print(f"Reduction: {(1 - vocab_stemmed/vocab_no_stem)*100:.1f}%") # === Impact on Document Similarity ===print("\n=== Impact on Document Similarity ===\n") doc_pairs = [ ("The dog runs quickly.", "The dogs were running."), # Related ("Operational efficiency", "Opera performance"), # Conflated by stemming ("The university professor", "Universal laws of physics"), # Conflated by stemming] for doc1, doc2 in doc_pairs: # Without stemming vec = TfidfVectorizer() tfidf = vec.fit_transform([doc1, doc2]) sim_no_stem = cosine_similarity(tfidf[0], tfidf[1])[0, 0] # With stemming vec_s = TfidfVectorizer() tfidf_s = vec_s.fit_transform([stem_text(doc1), stem_text(doc2)]) sim_stemmed = cosine_similarity(tfidf_s[0], tfidf_s[1])[0, 0] print(f"Doc 1: {doc1}") print(f"Doc 2: {doc2}") print(f" Similarity without stemming: {sim_no_stem:.3f}") print(f" Similarity with stemming: {sim_stemmed:.3f}") print(f" Change: {'+' if sim_stemmed > sim_no_stem else ''}{(sim_stemmed - sim_no_stem):.3f}") print() # === Semantic Corruption Examples ===print("=== Semantic Corruption Examples ===\n") corruption_examples = [ ("The opera singer performed beautifully.", "entertainment"), ("The software operates automatically.", "technology"), ("The university admission process", "education"), ("Understanding universal truths", "philosophy"),] print("After stemming, opera/operate and university/universal become identical:")for text, domain in corruption_examples: stemmed = stem_text(text) print(f" [{domain}] {text}") print(f" → {stemmed}") print()Stemming is a powerful but imprecise tool for text normalization. Understanding its mechanics, limitations, and appropriate use cases is essential for making informed preprocessing decisions.
What's Next:
Stemming is fast and simple but linguistically crude. Lemmatization offers a more sophisticated alternative, using vocabulary lookups and morphological analysis to reduce words to their dictionary forms (lemmas). We'll explore when lemmatization's higher accuracy justifies its computational cost.
You now understand stemming from algorithmic internals to practical application. You can implement Porter and Snowball stemmers, analyze their error patterns, and make informed decisions about when stemming benefits your NLP pipeline versus when it introduces harmful conflation.