Text Preprocessing - Learning Module

Loading content...

0/245

Stemming: Rule-Based Root Extraction

The Problem of Morphological Variation

Consider a search query for "running shoes." Should documents containing "run," "runs," "runner," and "ran" be considered relevant? In most cases, yes—these are all morphological variants of the same underlying concept.

This is the problem stemming addresses: reducing inflected or derived words to their common base or root form, called the stem. By normalizing morphological variants, we:

Improve recall in information retrieval ("running" matches "run")
Reduce vocabulary size (all variants map to one term)
Increase statistical power (aggregate counts across variants)
Simplify feature representations for machine learning

Stemming algorithms accomplish this through rule-based heuristics that strip suffixes according to language-specific patterns. They're fast, deterministic, and require no training data—but they're also approximate, sometimes producing stems that aren't real words and occasionally conflating unrelated terms.

What You Will Learn

By the end of this page, you will understand the theoretical foundations of stemming, implement and apply the Porter and Snowball algorithms, recognize stemming errors (over-stemming and under-stemming), and make informed decisions about when stemming benefits your NLP pipeline.

Stemming Fundamentals

Stemming is a form of word normalization that removes morphological affixes (primarily suffixes in English) to reduce words to a common form. Unlike lemmatization (which we'll cover next), stemming doesn't aim to produce linguistically valid words—it aims to produce consistent representations.

Key Terminology:

Stem: The base form produced by the stemmer (may not be a real word)
Root: The true linguistic root of a word (stemmer's ideal target)
Affix: A morpheme attached to a word (prefixes, suffixes, infixes)
Inflection: Grammatical variation (tense, number, case): run → runs, ran
Derivation: Creating new words from existing ones: run → runner, runnable

Why Not Just Use a Dictionary?

A naive approach might be to build a lookup table of all words to their roots. This fails because:

Vocabulary is open: New words constantly emerge
Space inefficient: Millions of word-to-stem mappings
Maintenance burden: Keeping dictionary updated
Domain variation: Technical fields have specialized vocabulary

Stemming Examples Across Word Forms
Original	Porter Stem	Snowball Stem	Shared Root?
running, runs, ran, run	run, run, ran, run	run, run, ran, run	Yes (mostly)
connection, connected, connecting	connect, connect, connect	connect, connect, connect	Yes
happiness, happily, happy	happi, happili, happi	happi, happili, happi	Yes (but not 'happy')
university, universal, universe	univers, univers, univers	univers, univers, univers	Conflated! (differ semantically)
operate, operational, opera	oper, oper, opera	oper, oper, opera	Conflated! (opera ≠ operate)

Stemming Produces Crude Approximations

Stems are often not valid English words. 'happi' for 'happy', 'relat' for 'related', 'studi' for 'studies'—these are computational artifacts, not linguistic entities. This is acceptable for many NLP tasks where we only need consistency, but problematic when human readability matters.

The Porter Stemmer

The Porter Stemmer, published by Martin Porter in 1980, is the most widely used English stemming algorithm. It's been cited thousands of times and forms the basis for most subsequent stemmers.

Algorithm Overview:

Porter stemming operates through a sequence of five phases, each containing conditional suffix-stripping rules. The rules are applied in order, and conditions are based on a measure called m (the "measure" of a word).

The Measure (m):

Porter defines words as alternating sequences of consonant clusters (C) and vowel clusters (V):

A word has the form: C{m}[V]
m counts the number of VC sequences

Examples:

"tree" = CCVV → m=0 (no VC pair)
"trees" = CCVVC → m=0 (C after V doesn't form VC pair)
"private" = CCVCVCV → m=2
"troubles" = CCVVCVCC → m=2

porter_explained.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
"""
Porter Stemming Algorithm - Educational Implementation
 
This implementation is for learning purposes. Production systems
should use NLTK's optimized implementation.
"""
import re
from typing import Tuple, List
 
class PorterStemmerEducational:
    """
    Educational implementation of the Porter Stemmer.
    Demonstrates the algorithm's structure with detailed comments.
    """
    
    def __init__(self):
        # Vowels (y is sometimes a vowel)
        self.vowels = set('aeiou')
    
    def is_consonant(self, word: str, i: int) -> bool:
        """Check if character at position i is a consonant."""
        c = word[i]
        if c in self.vowels:
            return False
        if c == 'y':
            # y is consonant only at word start or after vowel
            if i == 0:
                return True
            return not self.is_consonant(word, i - 1)
        return True
    
    def measure(self, stem: str) -> int:
        """
        Calculate the measure m of a stem.
        m is the number of VC (vowel-consonant) sequences.
        
        Example:
            TR -> m=0 (CC)
            TREE -> m=0 (CCVV)
            TROUBLE -> m=1 (CCVVCVC)
            OATS -> m=1 (VCCC)
            TREES -> m=1 (CCVVCC)
            IVY -> m=1 (VCCC)
            TROUBLES -> m=2 (CCVVCVCC)
            PRIVATE -> m=2 (CCVCVCV)
        """
        cv_pattern = []
        for i, char in enumerate(stem):
            if self.is_consonant(stem, i):
                # Avoid consecutive C markers
                if not cv_pattern or cv_pattern[-1] != 'C':
                    cv_pattern.append('C')
            else:
                if not cv_pattern or cv_pattern[-1] != 'V':
                    cv_pattern.append('V')
        
        # Count VC sequences
        pattern_str = ''.join(cv_pattern)
        return pattern_str.count('VC')
    
    def contains_vowel(self, stem: str) -> bool:
        """Check if stem contains at least one vowel."""
        for i in range(len(stem)):
            if not self.is_consonant(stem, i):
                return True
        return False
    
    def ends_double_consonant(self, word: str) -> bool:
        """Check if word ends with double consonant."""
        if len(word) >= 2:
            if word[-1] == word[-2] and self.is_consonant(word, len(word) - 1):
                return True
        return False
    
    def ends_cvc(self, word: str) -> bool:
        """
        Check if word ends consonant-vowel-consonant 
        where final consonant is not w, x, or y.
        """
        if len(word) >= 3:
            c1 = self.is_consonant(word, len(word) - 3)
            v = not self.is_consonant(word, len(word) - 2)
            c2 = self.is_consonant(word, len(word) - 1)
            if c1 and v and c2 and word[-1] not in 'wxy':
                return True
        return False
    
    def step1a(self, word: str) -> str:
        """Step 1a: Handle plurals and past tenses (SSES, IES, SS, S)."""
        if word.endswith('sses'):
            return word[:-2]
        if word.endswith('ies'):
            return word[:-2]
        if word.endswith('ss'):
            return word
        if word.endswith('s'):
            return word[:-1]
        return word
    
    def step1b(self, word: str) -> str:
        """Step 1b: Handle -eed, -ed, -ing."""
        if word.endswith('eed'):
            stem = word[:-3]
            if self.measure(stem) > 0:
                return word[:-1]  # Remove just 'd'
            return word
        
        # Handle -ed and -ing
        for suffix in ['ed', 'ing']:
            if word.endswith(suffix):
                stem = word[:-len(suffix)]
                if self.contains_vowel(stem):
                    word = stem
                    # Additional transformations
                    if word.endswith('at') or word.endswith('bl') or word.endswith('iz'):
                        word += 'e'
                    elif self.ends_double_consonant(word) and word[-1] not in 'lsz':
                        word = word[:-1]
                    elif self.measure(word) == 1 and self.ends_cvc(word):
                        word += 'e'
                    return word
                return word + suffix  # Restore if no vowel in stem
        
        return word
    
    def step1c(self, word: str) -> str:
        """Step 1c: Replace suffix y with i if stem contains vowel."""
        if word.endswith('y'):
            stem = word[:-1]
            if self.contains_vowel(stem):
                return stem + 'i'
        return word
    
    def step2(self, word: str) -> str:
        """Step 2: Handle various suffixes (-ational, -fulness, etc.)."""
        suffixes = [
            ('ational', 'ate'), ('tional', 'tion'), ('enci', 'ence'),
            ('anci', 'ance'), ('izer', 'ize'), ('abli', 'able'),
            ('alli', 'al'), ('entli', 'ent'), ('eli', 'e'),
            ('ousli', 'ous'), ('ization', 'ize'), ('ation', 'ate'),
            ('ator', 'ate'), ('alism', 'al'), ('iveness', 'ive'),
            ('fulness', 'ful'), ('ousness', 'ous'), ('aliti', 'al'),
            ('iviti', 'ive'), ('biliti', 'ble'),
        ]
        
        for old, new in suffixes:
            if word.endswith(old):
                stem = word[:-len(old)]
                if self.measure(stem) > 0:
                    return stem + new
        return word
    
    def step3(self, word: str) -> str:
        """Step 3: Handle suffixes (-icate, -ful, -ness, etc.)."""
        suffixes = [
            ('icate', 'ic'), ('ative', ''), ('alize', 'al'),
            ('iciti', 'ic'), ('ical', 'ic'), ('ful', ''), ('ness', ''),
        ]
        
        for old, new in suffixes:
            if word.endswith(old):
                stem = word[:-len(old)]
                if self.measure(stem) > 0:
                    return stem + new
        return word
    
    def step4(self, word: str) -> str:
        """Step 4: Remove various suffixes if m > 1."""
        suffixes = [
            'al', 'ance', 'ence', 'er', 'ic', 'able', 'ible', 'ant',
            'ement', 'ment', 'ent', 'ou', 'ism', 'ate', 'iti', 'ous',
            'ive', 'ize'
        ]
        
        # Special case for -ion
        if word.endswith('ion'):
            stem = word[:-3]
            if self.measure(stem) > 1 and stem[-1] in 'st':
                return stem
        
        for suffix in suffixes:
            if word.endswith(suffix):
                stem = word[:-len(suffix)]
                if self.measure(stem) > 1:
                    return stem
        
        return word
    
    def step5a(self, word: str) -> str:
        """Step 5a: Remove final 'e' based on measure."""
        if word.endswith('e'):
            stem = word[:-1]
            m = self.measure(stem)
            if m > 1:
                return stem
            if m == 1 and not self.ends_cvc(stem):
                return stem
        return word
    
    def step5b(self, word: str) -> str:
        """Step 5b: Remove double 'l' if m > 1."""
        if self.ends_double_consonant(word) and word[-1] == 'l':
            if self.measure(word[:-1]) > 1:
                return word[:-1]
        return word
    
    def stem(self, word: str) -> str:
        """Apply all stemming steps to a word."""
        word = word.lower()
        
        # Handle short words
        if len(word) <= 2:
            return word
        
        word = self.step1a(word)
        word = self.step1b(word)
        word = self.step1c(word)
        word = self.step2(word)
        word = self.step3(word)
        word = self.step4(word)
        word = self.step5a(word)
        word = self.step5b(word)
        
        return word
 
# Test the educational implementation
stemmer = PorterStemmerEducational()
 
test_words = [
    "running", "runs", "ran", "runner",
    "happiness", "happily", "happy",
    "connection", "connected", "connecting", "connective",
    "caresses", "ponies", "cats",
    "relational", "conditional", "rational",
    "troubles", "troubling", "troubled",
]
 
print("=== Porter Stemmer Educational Demo ===\n")
for word in test_words:
    stem = stemmer.stem(word)
    print(f"  {word:15} → {stem}")

The Five Steps in Detail:

Step 1: Handle simple suffixes

1a: Plural forms (SSES→SS, IES→I, S→∅)
1b: Past tense/gerund (EED→EE, ED→∅, ING→∅)
1c: Y→I transformation

Step 2: Double-suffix reduction

ATIONAL→ATE, TIONAL→TION, IZATION→IZE, etc.

Step 3: Further suffix removal

ICATE→IC, ALIZE→AL, FULNESS→FUL, etc.

Step 4: Long suffix removal (requires m>1)

Remove AL, ANCE, ENCE, ER, IC, and many others

Step 5: Final cleanup

5a: Remove final E if appropriate
5b: Reduce LL to L

The Snowball Stemmer (Porter2)

The Snowball Stemmer (also called Porter2) is Martin Porter's improved version, developed in 2000. It addresses known issues in the original Porter algorithm and provides a specification language for creating stemmers in multiple languages.

Improvements Over Porter:

Better handling of 'y': More accurate consonant/vowel classification
Exception lists: Handles irregular forms explicitly
Improved suffix rules: Refined heuristics from 20 years of feedback
Multiple languages: Snowball framework supports 20+ languages
Cleaner specification: Snowball language enables formal definition

Languages Supported: Arabic, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Lithuanian, Nepali, Norwegian, Portuguese, Romanian, Russian, Serbian, Spanish, Swedish, Tamil, Turkish, and Yiddish.

snowball_practical.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.tokenize import word_tokenize
 
# Initialize stemmers
porter = PorterStemmer()
snowball = SnowballStemmer("english")
 
# Available languages in Snowball
print("Languages supported by Snowball:")
print(SnowballStemmer.languages)
 
# Compare Porter and Snowball on English
test_cases = [
    # Words where they differ
    "generous", "generously", "generosity",
    "sky", "skies", "skying",
    "dying", "dye", "dyed",
    "falling", "felled", "fallen",
    "cries", "crying", "cried",
    "lying", "lie", "lied",
    # Irregular forms
    "goes", "going", "went", "gone",
    "analysis", "analyses", "analyzing",
]
 
print("\n=== Porter vs Snowball Comparison ===\n")
print(f"{'Word':<15} {'Porter':<12} {'Snowball':<12} {'Same?':<6}")
print("-" * 47)
 
differences = []
for word in test_cases:
    porter_stem = porter.stem(word)
    snowball_stem = snowball.stem(word)
    same = "✓" if porter_stem == snowball_stem else "✗"
    print(f"{word:<15} {porter_stem:<12} {snowball_stem:<12} {same:<6}")
    if porter_stem != snowball_stem:
        differences.append((word, porter_stem, snowball_stem))
 
print(f"\nDifferences found: {len(differences)}")
 
# Demonstrate multilingual stemming
print("\n=== Multilingual Snowball Stemming ===\n")
 
multilingual_tests = [
    ("german", ["laufen", "läuft", "gelaufen", "Fußball", "Fußbälle"]),
    ("french", ["manger", "mange", "mangé", "marchant", "marché"]),
    ("spanish", ["correr", "corriendo", "corrido", "hablando", "hablado"]),
    ("russian", ["бегать", "бегу", "бежал", "работать", "работаю"]),
]
 
for lang, words in multilingual_tests:
    stemmer = SnowballStemmer(lang)
    print(f"\n{lang.title()}:")
    for word in words:
        stem = stemmer.stem(word)
        print(f"  {word:<15} → {stem}")
 
# Performance and consistency analysis
print("\n=== Consistency Test ===\n")
 
# Words that should stem to the same root
semantic_groups = [
    ["connect", "connected", "connecting", "connection", "connective"],
    ["happy", "happiness", "happily", "unhappy", "unhappiness"],
    ["organize", "organized", "organizing", "organization", "organizational"],
    ["create", "created", "creating", "creation", "creative", "creativity"],
]
 
for group in semantic_groups:
    porter_stems = set(porter.stem(w) for w in group)
    snowball_stems = set(snowball.stem(w) for w in group)
    
    print(f"Words: {group}")
    print(f"  Porter stems:   {porter_stems}")
    print(f"  Snowball stems: {snowball_stems}")
    print(f"  All same stem? Porter={len(porter_stems)==1}, Snowball={len(snowball_stems)==1}")
    print()

Which Stemmer to Use?

For new projects, prefer Snowball over Porter for English. It's more accurate on edge cases, actively maintained, and the differences in output are generally improvements. For multilingual applications, Snowball is the only choice—Porter only supports English.

Stemming Errors: Over-stemming and Under-stemming

Because stemmers use heuristic rules without linguistic knowledge, they inevitably make errors. Understanding these error types helps you predict when stemming will help versus hurt.

Over-stemming (False Positives):

Over-stemming occurs when unrelated words are reduced to the same stem. This increases false matches in retrieval and conflates distinct concepts in classification.

Under-stemming (False Negatives):

Under-stemming occurs when related words are not reduced to the same stem. This causes false misses in retrieval and fragments feature space in classification.

Over-stemming Examples

•universal, university, universe → 'univers' (different meanings!)
•operate, opera → 'oper' (completely unrelated)
•wander, wand → 'wand' (unrelated concepts)
•policy, police → 'polic' (different domains)
•organ, organize → 'organ' (body part vs. action)

Under-stemming Examples

•run, ran → 'run', 'ran' (irregular past tense)
•go, went, gone → different stems (suppletive forms)
•good, better, best → not unified (superlatives)
•analyze, analysis → 'analyz', 'analysi'
•absorb, absorption → 'absorb', 'absorpt'

stemming_errors_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
from nltk.stem import PorterStemmer, SnowballStemmer
from collections import defaultdict
from typing import List, Tuple, Set
 
porter = PorterStemmer()
snowball = SnowballStemmer("english")
 
def analyze_conflation(word_groups: List[List[str]], stemmer) -> dict:
    """
    Analyze stemmer behavior on groups of related and unrelated words.
    
    Args:
        word_groups: List of word groups, where words in each group
                     should ideally stem to the same root
    """
    results = {
        'over_stemming': [],  # Unrelated words → same stem
        'under_stemming': [],  # Related words → different stems
        'correct': []
    }
    
    # Check within groups for under-stemming
    for group in word_groups:
        stems = set(stemmer.stem(w.lower()) for w in group)
        if len(stems) > 1:
            results['under_stemming'].append({
                'words': group,
                'stems': list(stems),
                'issue': 'Related words have different stems'
            })
        else:
            results['correct'].append({
                'words': group,
                'stem': list(stems)[0]
            })
    
    return results
 
def find_over_stemming(words: List[str], stemmer) -> dict:
    """Find cases where unrelated words get the same stem."""
    stem_to_words = defaultdict(list)
    
    for word in words:
        stem = stemmer.stem(word.lower())
        stem_to_words[stem].append(word)
    
    # Filter to stems with multiple words (potential over-stemming)
    conflicts = {
        stem: words for stem, words in stem_to_words.items()
        if len(words) > 1
    }
    
    return conflicts
 
# Test word groups (words in each group should stem together)
related_groups = [
    ["run", "running", "runs", "runner", "ran"],
    ["happy", "happily", "happiness", "unhappy"],
    ["connect", "connected", "connection", "connective"],
    ["analyze", "analysis", "analyzed", "analytical"],
    ["go", "going", "went", "gone", "goes"],
    ["absorb", "absorbed", "absorption", "absorbing"],
]
 
# Words that should NOT stem together
unrelated_words = [
    "university", "universal", "universe",  # Different meanings
    "policy", "police", "polish",  # Different domains
    "operate", "opera", "operator",  # opera ≠ operate
    "wander", "wand", "want",  # Unrelated
    "organ", "organize", "organic",  # Body part vs action vs chemistry
]
 
print("=== Stemming Error Analysis ===\n")
 
print("1. Under-stemming (related words → different stems):\n")
for group in related_groups:
    porter_stems = set(porter.stem(w) for w in group)
    snowball_stems = set(snowball.stem(w) for w in group)
    
    if len(porter_stems) > 1 or len(snowball_stems) > 1:
        print(f"  Words: {group}")
        print(f"  Porter: {porter_stems} - {'Under-stemmed!' if len(porter_stems) > 1 else 'OK'}")
        print(f"  Snowball: {snowball_stems} - {'Under-stemmed!' if len(snowball_stems) > 1 else 'OK'}")
        print()
 
print("\n2. Over-stemming (unrelated words → same stem):\n")
porter_conflicts = find_over_stemming(unrelated_words, porter)
snowball_conflicts = find_over_stemming(unrelated_words, snowball)
 
print("Porter over-stemming:")
for stem, words in porter_conflicts.items():
    if len(words) > 1:
        print(f"  '{stem}' ← {words}")
 
print("\nSnowball over-stemming:")
for stem, words in snowball_conflicts.items():
    if len(words) > 1:
        print(f"  '{stem}' ← {words}")
 
# Quantify error types
print("\n=== Error Rate Estimation ===")
print("(Based on test vocabulary)\n")
 
all_words = []
for group in related_groups:
    all_words.extend(group)
 
# Count under-stemming
under_stem_count = 0
for group in related_groups:
    if len(set(porter.stem(w) for w in group)) > 1:
        under_stem_count += 1
 
# Count over-stemming
over_stem_count = len([c for c in porter_conflicts.values() if len(c) > 1])
 
print(f"Word groups tested: {len(related_groups)}")
print(f"Under-stemming errors: {under_stem_count}/{len(related_groups)}")
print(f"Over-stemming conflicts: {over_stem_count} (in {len(unrelated_words)} test words)")

Error Trade-offs

In information retrieval, over-stemming increases recall but hurts precision. Under-stemming hurts recall but preserves precision. The optimal aggressiveness depends on your use case: search engines often prefer aggressive stemming, while classification tasks may prefer conservative stemming or no stemming at all.

Alternative Stemming Algorithms

While Porter and Snowball dominate English stemming, several alternatives exist with different trade-offs.

Lancaster Stemmer (Paice/Husk):

Developed by Chris Paice at Lancaster University. More aggressive than Porter, applying iterative rules until no more apply. Produces shorter stems.

Lovins Stemmer:

One of the earliest stemmers (1968). Uses a single-pass, longest-match approach with a list of 294 endings. Less accurate than Porter.

Regexp Stemmer:

Applies user-defined regular expression rules. Useful for domain-specific stemming.

Hunspell Stemmer:

Used by spell checkers, based on morphological dictionaries. More accurate but slower.

stemmer_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
from nltk.stem import (
    PorterStemmer, 
    SnowballStemmer, 
    LancasterStemmer,
    RegexpStemmer
)
import time
 
# Initialize stemmers
stemmers = {
    'Porter': PorterStemmer(),
    'Snowball': SnowballStemmer('english'),
    'Lancaster': LancasterStemmer(),
}
 
# Test vocabulary
test_words = [
    "running", "happiness", "connection", "generosity",
    "maximum", "organization", "argument", "multiply",
    "troubleshooting", "international", "manufacturing",
    "environmental", "considerably", "unfortunately",
]
 
print("=== Stemmer Comparison ===\n")
print(f"{'Word':<18}", end="")
for name in stemmers:
    print(f"{name:<12}", end="")
print()
print("-" * 54)
 
for word in test_words:
    print(f"{word:<18}", end="")
    for name, stemmer in stemmers.items():
        stem = stemmer.stem(word)
        print(f"{stem:<12}", end="")
    print()
 
# Analyze aggressiveness
print("\n=== Aggressiveness Analysis ===\n")
 
def measure_aggressiveness(stemmer, words):
    """Measure average character reduction."""
    reductions = []
    for word in words:
        stem = stemmer.stem(word)
        reduction = 1 - (len(stem) / len(word))
        reductions.append(reduction)
    return sum(reductions) / len(reductions)
 
large_vocab = [
    "international", "organizational", "environmental", "computational",
    "understanding", "manufacturing", "troubleshooting", "administrative",
    "representative", "interpretation", "responsibility", "characteristics",
    "unfortunately", "approximately", "significantly", "professionally",
]
 
for name, stemmer in stemmers.items():
    agg = measure_aggressiveness(stemmer, large_vocab)
    print(f"{name:12} average reduction: {agg:.1%}")
 
# Performance comparison
print("\n=== Performance Benchmark ===\n")
 
# Generate test data
vocab = large_vocab * 1000  # 16,000 words
 
for name, stemmer in stemmers.items():
    start = time.perf_counter()
    for word in vocab:
        stemmer.stem(word)
    elapsed = time.perf_counter() - start
    rate = len(vocab) / elapsed
    print(f"{name:12} {elapsed:.3f}s ({rate:,.0f} words/sec)")
 
# Custom regex stemmer example
print("\n=== Custom Regex Stemmer ===\n")
 
# Remove common suffixes
custom = RegexpStemmer('ing$|s$|ed$|er$|est$|ly$|tion$|ness$')
 
custom_test = ["running", "cats", "jumped", "faster", "quickly", "happiness"]
for word in custom_test:
    print(f"  {word:<12} → {custom.stem(word)}")

Stemmer Algorithm Comparison
Stemmer	Aggressiveness	Speed	Accuracy	Use Case
Porter	Medium	Fast	Good	General purpose, widely used baseline
Snowball	Medium	Fast	Better	Improved Porter, multilingual support
Lancaster	High	Fast	Lower	When aggressive normalization needed
Lovins	High	Fast	Low	Historical interest only
Regexp	Configurable	Fast	Varies	Custom domain-specific rules

When to Use Stemming

Stemming is a design choice, not a mandatory preprocessing step. Understanding when it helps—and when it hurts—is essential for building effective NLP systems.

Tasks Where Stemming Typically Helps

•Information Retrieval / Search — Query "running" should match documents with "run", "runs", "runner"
•Topic Modeling (LDA, LSA) — Reducing vocabulary improves topic coherence
•Text Classification (Bag-of-Words) — Aggregating morphological variants improves feature statistics
•Spam Detection — Reducing vocabulary limits spammers' ability to evade filters
•Low-Resource Settings — When training data is limited, stemming provides useful normalization

Tasks Where Stemming Typically Hurts

•Sentiment Analysis — Morphology can carry sentiment ("beautiful" vs "beauty")
•Machine Translation — Morphological information needed for correct translations
•Text Generation — Output must be valid words, not stems
•Named Entity Recognition — Conflating "York" and "Yorker" loses information
•Transformer-based Models — Pre-trained on full words; stemming creates distribution shift
•Fine-grained Classification — When morphological differences are discriminative

The Modern Perspective

With BERT, GPT, and modern transformers, explicit stemming is rarely beneficial. These models use subword tokenization (BPE, WordPiece) that implicitly captures morphological patterns. The embedding for 'running' is similar to 'run' because both share subwords. Applied stemming on top of this can actually harm performance.

stemming_impact_demo.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
"""
Demonstration of stemming impact on different NLP tasks.
"""
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
 
stemmer = SnowballStemmer("english")
 
def stem_text(text: str) -> str:
    """Stem all words in a text."""
    words = text.lower().split()
    return " ".join(stemmer.stem(w) for w in words)
 
# === Impact on Vocabulary Size ===
print("=== Impact on Vocabulary Size ===\n")
 
documents = [
    "The runner was running quickly through the running track.",
    "Happiness comes from being happy and creating happiness.",
    "The organization was organizing an organized event.",
    "Connecting the connection connected them together.",
    "She argued the argument argumentatively.",
]
 
# Without stemming
vec_no_stem = CountVectorizer()
vec_no_stem.fit(documents)
vocab_no_stem = len(vec_no_stem.get_feature_names_out())
 
# With stemming
docs_stemmed = [stem_text(d) for d in documents]
vec_stemmed = CountVectorizer()
vec_stemmed.fit(docs_stemmed)
vocab_stemmed = len(vec_stemmed.get_feature_names_out())
 
print(f"Vocabulary without stemming: {vocab_no_stem}")
print(f"Vocabulary with stemming:    {vocab_stemmed}")
print(f"Reduction: {(1 - vocab_stemmed/vocab_no_stem)*100:.1f}%")
 
# === Impact on Document Similarity ===
print("\n=== Impact on Document Similarity ===\n")
 
doc_pairs = [
    ("The dog runs quickly.", "The dogs were running."),  # Related
    ("Operational efficiency", "Opera performance"),     # Conflated by stemming
    ("The university professor", "Universal laws of physics"),  # Conflated by stemming
]
 
for doc1, doc2 in doc_pairs:
    # Without stemming
    vec = TfidfVectorizer()
    tfidf = vec.fit_transform([doc1, doc2])
    sim_no_stem = cosine_similarity(tfidf[0], tfidf[1])[0, 0]
    
    # With stemming
    vec_s = TfidfVectorizer()
    tfidf_s = vec_s.fit_transform([stem_text(doc1), stem_text(doc2)])
    sim_stemmed = cosine_similarity(tfidf_s[0], tfidf_s[1])[0, 0]
    
    print(f"Doc 1: {doc1}")
    print(f"Doc 2: {doc2}")
    print(f"  Similarity without stemming: {sim_no_stem:.3f}")
    print(f"  Similarity with stemming:    {sim_stemmed:.3f}")
    print(f"  Change: {'+' if sim_stemmed > sim_no_stem else ''}{(sim_stemmed - sim_no_stem):.3f}")
    print()
 
# === Semantic Corruption Examples ===
print("=== Semantic Corruption Examples ===\n")
 
corruption_examples = [
    ("The opera singer performed beautifully.", "entertainment"),
    ("The software operates automatically.", "technology"),
    ("The university admission process", "education"),
    ("Understanding universal truths", "philosophy"),
]
 
print("After stemming, opera/operate and university/universal become identical:")
for text, domain in corruption_examples:
    stemmed = stem_text(text)
    print(f"  [{domain}] {text}")
    print(f"           → {stemmed}")
    print()

Summary: Stemming Mastery

Stemming is a powerful but imprecise tool for text normalization. Understanding its mechanics, limitations, and appropriate use cases is essential for making informed preprocessing decisions.

Key Takeaways

•Stemming reduces words to their root forms — Using heuristic rules, not linguistic knowledge, producing approximate results.
•Porter is the classic algorithm — Five-step process based on suffix stripping with conditions on word measure (m).
•Snowball (Porter2) is the improved version — Better accuracy, multilingual support, and active maintenance.
•Over-stemming conflates unrelated words — 'university' and 'universe' become 'univers', corrupting semantics.
•Under-stemming fails to unify related words — Irregular forms like 'go/went/gone' remain distinct.
•Use stemming for retrieval and topic modeling — Where recall improvement outweighs precision loss.
•Avoid stemming for transformers and generation — Modern models handle morphology implicitly.

What's Next:

Stemming is fast and simple but linguistically crude. Lemmatization offers a more sophisticated alternative, using vocabulary lookups and morphological analysis to reduce words to their dictionary forms (lemmas). We'll explore when lemmatization's higher accuracy justifies its computational cost.

Page Complete

You now understand stemming from algorithmic internals to practical application. You can implement Porter and Snowball stemmers, analyze their error patterns, and make informed decisions about when stemming benefits your NLP pipeline versus when it introduces harmful conflation.