Loading content...
Stemming, as we've seen, uses rule-based heuristics to strip suffixes. The result is often not a real word: "happiness" becomes "happi," "running" becomes "run" (or "runn" with aggressive stemmers). For many applications, this is acceptable. But when we need valid dictionary words—for human readability, semantic lookup, or linguistically-informed processing—we turn to lemmatization.
Lemmatization reduces words to their lemma: the canonical dictionary form. For verbs, this is the infinitive ("ran" → "run"). For nouns, the singular ("mice" → "mouse"). For adjectives, the base form ("better" → "good").
Unlike stemming, lemmatization requires:
The result is higher accuracy but slower execution.
By the end of this page, you will understand the linguistic foundations of lemmatization, implement lemmatizers using NLTK, spaCy, and other tools, appreciate the role of POS tagging in accurate lemmatization, and make informed choices between stemming and lemmatization based on your requirements.
To understand lemmatization, we must understand the linguistic concepts it relies upon.
The Lemma:
A lemma is the canonical or citation form of a word—the form you would find as a dictionary entry. It represents the base from which all inflected forms derive.
Inflection vs. Derivation:
Linguists distinguish two types of morphological variation:
Inflection: Grammatical variation without changing word class
Derivation: Creating new words, often changing word class
Lemmatization typically handles inflection but not derivation—"runner" lemmatizes to "runner," not "run."
| Word | POS | Lemma | Relationship |
|---|---|---|---|
| running | Verb | run | Present participle → infinitive |
| ran | Verb | run | Past tense → infinitive |
| better | Adjective | good | Comparative → positive (suppletive) |
| mice | Noun | mouse | Plural → singular (irregular) |
| am, is, are, was, were | Verb | be | Conjugations → infinitive |
| feet | Noun | foot | Plural → singular (irregular) |
| studies | Verb | study | 3rd person singular → infinitive |
| studies | Noun | study | Plural → singular |
Notice that 'studies' appears twice with different lemmas depending on POS. Without POS information, lemmatizers must guess—and often guess wrong. This is why accurate lemmatization typically requires POS tagging as a prerequisite step.
Both lemmatization and stemming aim to reduce morphological variation, but they differ fundamentally in approach and outcomes.
| Aspect | Stemming | Lemmatization |
|---|---|---|
| Approach | Rule-based suffix stripping | Dictionary lookup + morphological analysis |
| Speed | Very fast (heuristic rules) | Slower (requires POS tagging, vocabulary) |
| Output validity | Often not real words ('happi') | Always valid dictionary words |
| Linguistic knowledge | None required | Requires vocabulary and POS |
| Irregular forms | Handles poorly ('ran' ≠ 'run') | Handles correctly ('ran' → 'run') |
| Language support | Easier to develop for new languages | Requires language-specific resources |
| Error types | Over-stemming, under-stemming | POS errors, missing vocabulary |
| Reversibility | Not reversible | Sometimes reversible with context |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
from nltk.stem import PorterStemmer, WordNetLemmatizerfrom nltk.tokenize import word_tokenizefrom nltk.corpus import wordnetimport nltk # Initializestemmer = PorterStemmer()lemmatizer = WordNetLemmatizer() # Download required resources (run once)# nltk.download('wordnet')# nltk.download('averaged_perceptron_tagger') # Test cases demonstrating differencestest_cases = [ # Regular verbs ("running", "v"), ("runs", "v"), ("ran", "v"), # Irregular verbs ("went", "v"), ("goes", "v"), ("gone", "v"), # Irregular plurals ("mice", "n"), ("feet", "n"), ("children", "n"), # Adjectives ("better", "a"), ("best", "a"), ("happier", "a"), # Verb/Noun ambiguity ("studies", "v"), ("studies", "n"), ("leaves", "v"), ("leaves", "n"), # Stemming errors ("university", "n"), ("universe", "n"), ("organization", "n"),] # Map simple POS to WordNet POSdef get_wordnet_pos(simple_pos): pos_map = {'v': wordnet.VERB, 'n': wordnet.NOUN, 'a': wordnet.ADJ, 'r': wordnet.ADV} return pos_map.get(simple_pos, wordnet.NOUN) print("=== Stemming vs Lemmatization Comparison ===\n")print(f"{'Word':<15} {'POS':<5} {'Stem':<12} {'Lemma':<12} {'Same?':<6}")print("-" * 55) for word, pos in test_cases: stem = stemmer.stem(word) lemma = lemmatizer.lemmatize(word, get_wordnet_pos(pos)) same = "✓" if stem == lemma else "✗" print(f"{word:<15} {pos:<5} {stem:<12} {lemma:<12} {same:<6}") # Highlight key differencesprint("\n=== Key Differences ===\n") difference_examples = [ ("went", "v", "Past irregular verb"), ("mice", "n", "Irregular plural"), ("better", "a", "Suppletive comparative"), ("university", "n", "Unrelated word conflation risk"),] for word, pos, description in difference_examples: stem = stemmer.stem(word) lemma = lemmatizer.lemmatize(word, get_wordnet_pos(pos)) print(f"{description}:") print(f" Word: {word}") print(f" Stem: {stem}") print(f" Lemma: {lemma}") print()Choose stemming when speed is critical, output doesn't need to be human-readable, or you're working with informal/noisy text. Choose lemmatization when you need valid dictionary words, are doing semantic analysis, or when accuracy on irregular forms matters.
NLTK's WordNetLemmatizer uses the WordNet lexical database to perform lemmatization. WordNet is a large lexical database of English that groups words into sets of synonyms (synsets) and tracks morphological relationships.
Key Characteristics:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113
import nltkfrom nltk.stem import WordNetLemmatizerfrom nltk.corpus import wordnetfrom nltk import pos_tag, word_tokenizefrom typing import List, Tuple # Download required resources# nltk.download('wordnet')# nltk.download('averaged_perceptron_tagger')# nltk.download('punkt') class EnhancedWordNetLemmatizer: """ WordNet lemmatizer with automatic POS tagging for improved accuracy. """ def __init__(self): self.lemmatizer = WordNetLemmatizer() def _get_wordnet_pos(self, treebank_tag: str) -> str: """ Convert Penn Treebank POS tags to WordNet POS tags. Penn Treebank uses detailed tags (NN, NNS, VB, VBD, JJ, RB, etc.) WordNet uses simple tags (n, v, a, r) """ tag_prefix = treebank_tag[0].upper() tag_map = { 'J': wordnet.ADJ, # JJ, JJR, JJS 'V': wordnet.VERB, # VB, VBD, VBG, VBN, VBP, VBZ 'N': wordnet.NOUN, # NN, NNS, NNP, NNPS 'R': wordnet.ADV # RB, RBR, RBS } return tag_map.get(tag_prefix, wordnet.NOUN) # Default to noun def lemmatize_word(self, word: str, pos: str = None) -> str: """ Lemmatize a single word. Args: word: Word to lemmatize pos: Optional POS tag (Penn Treebank format). If not provided, word is treated as noun. """ if pos: wn_pos = self._get_wordnet_pos(pos) else: wn_pos = wordnet.NOUN return self.lemmatizer.lemmatize(word.lower(), wn_pos) def lemmatize_text(self, text: str) -> List[Tuple[str, str, str]]: """ Lemmatize all words in a text with automatic POS tagging. Returns: List of (word, pos_tag, lemma) tuples """ tokens = word_tokenize(text) tagged = pos_tag(tokens) results = [] for word, tag in tagged: lemma = self.lemmatize_word(word, tag) results.append((word, tag, lemma)) return results def lemmatize_sentence(self, text: str) -> str: """Return lemmatized text as a string.""" results = self.lemmatize_text(text) return " ".join(lemma for _, _, lemma in results) # Demonstrationlemmatizer = EnhancedWordNetLemmatizer() # Test sentencestest_sentences = [ "The children were running and playing in the leaves.", "She studies physics and her studies are going well.", "The mice ran into better hiding spots.", "I am going to the best universities in the world.",] print("=== Enhanced WordNet Lemmatization ===\n") for sentence in test_sentences: print(f"Original: {sentence}") results = lemmatizer.lemmatize_text(sentence) print(f"Detailed:") for word, pos, lemma in results: if word.lower() != lemma: # Show only changed words print(f" {word} ({pos}) → {lemma}") print(f"Lemmatized: {lemmatizer.lemmatize_sentence(sentence)}") print() # Demonstrate POS importanceprint("=== POS Tag Importance ===\n") ambiguous_words = [ ("saw", "VBD", "past tense of see"), ("saw", "NN", "cutting tool"), ("leaves", "VBZ", "departs"), ("leaves", "NNS", "plural of leaf"), ("better", "JJR", "more good"), ("better", "VB", "to improve"),] for word, pos, meaning in ambiguous_words: lemma = lemmatizer.lemmatize_word(word, pos) print(f"'{word}' as {pos} ({meaning}) → '{lemma}'")WordNetLemmatizer's output is only as good as the POS tags it receives. If POS tagging misclassifies 'saw' as a noun when it's used as a verb, lemmatization will return 'saw' instead of 'see'. For best results, use a high-quality POS tagger or a pipeline that integrates both.
spaCy provides industrial-strength lemmatization as part of its NLP pipeline. Unlike NLTK's standalone lemmatizer, spaCy integrates POS tagging and lemmatization into a unified processing pipeline, ensuring consistency between the two.
Key Characteristics:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
import spacyfrom typing import List, Dict # Load spaCy model (install with: python -m spacy download en_core_web_sm)nlp = spacy.load("en_core_web_sm") # Process text through spaCy pipelinetext = "The children were running and playing in the fallen leaves."doc = nlp(text) print("=== spaCy Lemmatization ===\n")print(f"Original: {text}\n") print(f"{'Token':<12} {'POS':<6} {'Tag':<6} {'Lemma':<12} {'Changed':<8}")print("-" * 50) for token in doc: changed = "✓" if token.text.lower() != token.lemma_ else "" print(f"{token.text:<12} {token.pos_:<6} {token.tag_:<6} {token.lemma_:<12} {changed}") # Lemmatized textlemmatized = " ".join([token.lemma_ for token in doc])print(f"\nLemmatized: {lemmatized}") # Batch processing for performanceprint("\n=== Batch Processing ===\n") texts = [ "The mice were running from the cats.", "She studies hard and her studies pay off.", "The better universities have better facilities.", "He went to the store but the store was closed.",] # Process in batch (much faster than one at a time)docs = list(nlp.pipe(texts)) for original, doc in zip(texts, docs): lemmas = [t.lemma_ for t in doc if not t.is_punct] print(f"Original: {original}") print(f"Lemmas: {' '.join(lemmas)}") print() # Disable unnecessary pipeline components for speedprint("=== Optimized Pipeline ===\n") # Create pipeline with only necessary componentsnlp_lemma = spacy.load("en_core_web_sm", disable=["ner"]) # Or more aggressively:# nlp_lemma = spacy.load("en_core_web_sm", disable=["ner", "parser"]) import time large_text = " ".join(texts * 100) # 400 sentences # Time full pipelinestart = time.perf_counter()doc_full = nlp(large_text)full_time = time.perf_counter() - start # Time optimized pipelinestart = time.perf_counter()doc_opt = nlp_lemma(large_text)opt_time = time.perf_counter() - start print(f"Full pipeline: {full_time:.3f}s")print(f"Optimized (no NER): {opt_time:.3f}s")print(f"Speedup: {full_time/opt_time:.1f}x") # Access spaCy's lemmatization rulesprint("\n=== spaCy Lemmatization Rules ===\n") # Example: show exception table for "be"doc = nlp("am is are was were be being been")print("Forms of 'be':")for token in doc: print(f" {token.text:<8} → {token.lemma_}")For production lemmatization, disable pipeline components you don't need. Removing NER and parsing can provide 2-3x speedup while maintaining full lemmatization accuracy. Use nlp.pipe() for batch processing rather than processing documents one at a time.
Beyond NLTK and spaCy, several other libraries provide lemmatization with different strengths.
Stanza (Stanford NLP):
Developed by Stanford NLP Group, Stanza provides neural network-based NLP in 70+ languages. It's particularly strong for academic research and multilingual applications.
Pattern:
A lightweight alternative that includes lemmatization for several European languages.
Simplemma:
A fast, dictionary-based lemmatizer supporting 50+ languages with minimal dependencies.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
# === Stanza (Stanford NLP) ===import stanza # Download English model (run once)# stanza.download('en') # Initialize pipelinenlp_stanza = stanza.Pipeline('en', processors='tokenize,pos,lemma') text = "The children were running and the mice were hiding."doc = nlp_stanza(text) print("=== Stanza Lemmatization ===\n")print(f"Original: {text}\n") for sentence in doc.sentences: for word in sentence.words: if word.text.lower() != word.lemma: print(f" {word.text} ({word.upos}) → {word.lemma}") # Lemmatized outputlemmas = [word.lemma for sent in doc.sentences for word in sent.words]print(f"\nLemmas: {' '.join(lemmas)}") # === Simplemma (Fast, Multilingual) ===# pip install simplemma from simplemma import lemmatize, text_lemmatizer # Single word lemmatizationwords = ["running", "mice", "better", "went", "studies"]print("\n=== Simplemma Lemmatization ===\n")for word in words: lemma = lemmatize(word, lang='en') print(f" {word} → {lemma}") # Text lemmatizationtext = "The children were running and playing."lemmatized = " ".join(text_lemmatizer(text, lang='en'))print(f"\nOriginal: {text}")print(f"Lemmatized: {lemmatized}") # Multilingual supportprint("\n=== Simplemma Multilingual ===\n")multilingual_examples = [ ('de', ["laufen", "läuft", "gelaufen"]), # German ('fr', ["manger", "mange", "mangé"]), # French ('es', ["correr", "corriendo", "corrido"]), # Spanish] for lang, words in multilingual_examples: print(f"{lang.upper()}:", end=" ") for word in words: lemma = lemmatize(word, lang=lang) print(f"{word}→{lemma}", end=" ") print() # === Performance Comparison ===import time test_text = "The children were running and playing in the leaves. " * 100 # NLTKfrom nltk.stem import WordNetLemmatizerfrom nltk import word_tokenizenltk_lemmatizer = WordNetLemmatizer() start = time.perf_counter()tokens = word_tokenize(test_text)nltk_lemmas = [nltk_lemmatizer.lemmatize(t) for t in tokens]nltk_time = time.perf_counter() - start # spaCyimport spacynlp_spacy = spacy.load("en_core_web_sm", disable=["ner"]) start = time.perf_counter()doc = nlp_spacy(test_text)spacy_lemmas = [t.lemma_ for t in doc]spacy_time = time.perf_counter() - start # Simplemmastart = time.perf_counter()simplemma_lemmas = list(text_lemmatizer(test_text, lang='en'))simplemma_time = time.perf_counter() - start print("\n=== Performance Comparison ===\n")print(f"NLTK: {nltk_time:.3f}s")print(f"spaCy: {spacy_time:.3f}s")print(f"Simplemma: {simplemma_time:.3f}s")| Library | Languages | Speed | Accuracy | Dependencies | Best For |
|---|---|---|---|---|---|
| NLTK | English | Medium | Good (with POS) | WordNet | Education, prototyping |
| spaCy | 20+ | Fast | Excellent | Model files | Production systems |
| Stanza | 70+ | Medium | Excellent | Model files (large) | Research, multilingual |
| Simplemma | 50+ | Very fast | Good | Minimal | Simple applications |
| Pattern | 7 European | Fast | Good | Minimal | European languages |
Real-world text presents numerous challenges for lemmatization. Understanding these edge cases helps you build robust preprocessing pipelines.
Suppletive Forms:
Some words have completely unrelated forms for different inflections (suppletive morphology):
These require explicit exception handling in lemmatizers—no rules can derive "go" from "went."
Compound Words:
Should "toothbrush" lemmatize to "toothbrush" or be decomposed? What about "ex-president" or "well-being"? Most lemmatizers treat compounds as single units.
Proper Nouns:
Names should generally not be lemmatized ("James" should not become "Jame"), but detecting proper nouns requires NER.
Domain-Specific Terms:
Technical vocabulary often isn't in standard dictionaries. "Kubernetes," "microservices," "serverless" won't be recognized.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
import spacyfrom nltk.stem import WordNetLemmatizerfrom nltk.corpus import wordnet nlp = spacy.load("en_core_web_sm")nltk_lemmatizer = WordNetLemmatizer() print("=== Edge Cases in Lemmatization ===\n") # 1. Suppletive formsprint("1. Suppletive Forms:")suppletive = ["went", "gone", "better", "best", "worse", "worst", "am", "is", "are", "was", "were"]for word in suppletive: doc = nlp(word) spacy_lemma = doc[0].lemma_ # NLTK needs POS hint nltk_lemma_v = nltk_lemmatizer.lemmatize(word, wordnet.VERB) nltk_lemma_a = nltk_lemmatizer.lemmatize(word, wordnet.ADJ) print(f" {word:<8} spaCy: {spacy_lemma:<6} NLTK(v): {nltk_lemma_v:<6} NLTK(a): {nltk_lemma_a}") # 2. Proper nounsprint("\n2. Proper Nouns:")proper_nouns = "James and Williams went to Paris and London."doc = nlp(proper_nouns)print(f" Original: {proper_nouns}")print(f" With NER protection:")for token in doc: if token.ent_type_: # Is named entity print(f" {token.text} ({token.ent_type_}) → {token.text} (preserved)") elif token.text.lower() != token.lemma_: print(f" {token.text} → {token.lemma_}") # 3. Compounds and hyphenated wordsprint("\n3. Compounds and Hyphenated Words:")compounds = "toothbrush ex-president well-being mother-in-law data-driven"doc = nlp(compounds)for token in doc: if not token.is_punct: print(f" {token.text:<15} → {token.lemma_}") # 4. Contractionsprint("\n4. Contractions:")contractions = "I'm can't won't shouldn't it's they've"doc = nlp(contractions)for token in doc: print(f" {token.text:<12} → {token.lemma_}") # 5. Numbers and symbolsprint("\n5. Numbers and Special Cases:")special = "2nd 3rd 1990s URLs APIs"doc = nlp(special)for token in doc: print(f" {token.text:<10} → {token.lemma_}") # 6. Domain-specific / OOV wordsprint("\n6. Domain-Specific / OOV Words:")tech_terms = "Kubernetes microservices APIs serverless containerized"doc = nlp(tech_terms)print(f" Original: {tech_terms}")for token in doc: if token.text.lower() != token.lemma_: print(f" {token.text} → {token.lemma_} (changed)") else: print(f" {token.text} → {token.lemma_} (unchanged - likely OOV)") # 7. Creating a robust lemmatizer with fallbacksprint("\n=== Robust Lemmatization Pipeline ===\n") def robust_lemmatize(text: str, nlp, preserve_entities: bool = True) -> str: """ Robust lemmatization with entity preservation and fallback handling. """ doc = nlp(text) result = [] for token in doc: # Skip punctuation if token.is_punct: result.append(token.text) continue # Preserve named entities if requested if preserve_entities and token.ent_type_: result.append(token.text) continue # Use lemma if available and different if token.lemma_ != "-PRON-" and token.lemma_.strip(): result.append(token.lemma_) else: result.append(token.text.lower()) return " ".join(result) test = "Dr. Smith and James went to Google's headquarters in New York."print(f"Original: {test}")print(f"Lemmatized (preserve entities): {robust_lemmatize(test, nlp, preserve_entities=True)}")print(f"Lemmatized (no preservation): {robust_lemmatize(test, nlp, preserve_entities=False)}")For domain-specific applications, maintain exception lists for terms that standard lemmatizers handle incorrectly. This is particularly important for technical vocabulary, brand names, and specialized terminology that may be lemmatized incorrectly or treated as OOV.
Deploying lemmatization in production systems requires attention to performance, consistency, and maintainability.
Performance Optimization:
Lemmatization is computationally more expensive than stemming. For high-throughput systems, consider:
nlp.pipe()123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162
import spacyfrom typing import List, Dict, Iteratorfrom functools import lru_cacheimport timefrom concurrent.futures import ProcessPoolExecutorimport multiprocessing as mp # === Optimized spaCy Pipeline === def create_lemmatization_pipeline(disable_components: List[str] = None): """ Create an optimized spaCy pipeline for lemmatization. """ if disable_components is None: # Disable components not needed for lemmatization disable_components = ["ner"] # Keep parser for better POS accuracy return spacy.load("en_core_web_sm", disable=disable_components) # === Batched Processing === def batch_lemmatize( texts: List[str], nlp, batch_size: int = 1000, n_process: int = 1) -> Iterator[List[str]]: """ Efficiently lemmatize large numbers of texts. Args: texts: List of texts to lemmatize nlp: spaCy pipeline batch_size: Batch size for nlp.pipe() n_process: Number of processes (1 for single-threaded) Yields: Lists of lemmatized words for each document """ for doc in nlp.pipe(texts, batch_size=batch_size, n_process=n_process): yield [token.lemma_ for token in doc if not token.is_punct] # === Vocabulary-Based Caching === class CachedLemmatizer: """ Lemmatizer with LRU cache for frequently occurring words. """ def __init__(self, cache_size: int = 100000): self.nlp = create_lemmatization_pipeline() self.cache_size = cache_size @lru_cache(maxsize=cache_size) def _cached_lemma(word: str) -> str: doc = self.nlp(word) if len(doc) == 1: return doc[0].lemma_ return word self._cached_lemma = _cached_lemma def lemmatize_word(self, word: str) -> str: """Lemmatize a single word with caching.""" return self._cached_lemma(word.lower()) def lemmatize_text(self, text: str) -> str: """Lemmatize text, using full pipeline for context.""" # For sentences, use full pipeline for POS context doc = self.nlp(text) return " ".join(token.lemma_ for token in doc if not token.is_punct) def preload_vocabulary(self, words: List[str]): """Pre-populate cache with known vocabulary.""" for word in words: self._cached_lemma(word.lower()) def cache_stats(self): """Return cache statistics.""" return self._cached_lemma.cache_info() # === Consistency Guarantees === class DeterministicLemmatizer: """ Lemmatizer with versioning for reproducibility. """ def __init__(self, model_name: str = "en_core_web_sm"): self.model_name = model_name self.nlp = spacy.load(model_name) # Store version information self.spacy_version = spacy.__version__ self.model_meta = self.nlp.meta def get_version_info(self) -> Dict: """Return version information for reproducibility.""" return { "spacy_version": self.spacy_version, "model_name": self.model_name, "model_version": self.model_meta.get("version", "unknown"), "model_lang": self.model_meta.get("lang", "unknown"), } def lemmatize_with_metadata(self, text: str) -> Dict: """Lemmatize with version metadata for provenance tracking.""" doc = self.nlp(text) return { "input": text, "lemmas": [t.lemma_ for t in doc], "version_info": self.get_version_info(), } # === Demonstration === if __name__ == "__main__": print("=== Production Lemmatization ===\n") # Create test data texts = [ "The children were running and playing.", "She studies hard and her studies are impressive.", "The mice ran into better hiding spots.", ] * 100 # 300 documents nlp = create_lemmatization_pipeline() # Benchmark batch processing print("Batch processing benchmark:") for batch_size in [10, 50, 100, 500]: start = time.perf_counter() results = list(batch_lemmatize(texts, nlp, batch_size=batch_size)) elapsed = time.perf_counter() - start rate = len(texts) / elapsed print(f" Batch size {batch_size:4}: {elapsed:.3f}s ({rate:.0f} docs/sec)") # Caching demonstration print("\nCaching demonstration:") cached = CachedLemmatizer(cache_size=1000) # First pass - cache misses start = time.perf_counter() for text in texts[:50]: _ = cached.lemmatize_text(text) first_pass = time.perf_counter() - start print(f" First pass: {first_pass:.3f}s") print(f" Cache info: {cached.cache_stats()}") # Second pass - cache hits for individual words start = time.perf_counter() for text in texts[:50]: _ = cached.lemmatize_text(text) second_pass = time.perf_counter() - start print(f" Second pass: {second_pass:.3f}s") print(f" Cache info: {cached.cache_stats()}") # Version info for reproducibility print("\nVersion tracking:") det_lemmatizer = DeterministicLemmatizer() print(f" Version info: {det_lemmatizer.get_version_info()}")Lemmatization provides linguistically accurate word normalization through dictionary lookup and morphological analysis. While more computationally expensive than stemming, it produces valid dictionary words and handles irregular forms correctly.
What's Next:
With morphological normalization covered (stemming and lemmatization), we turn to simpler but essential preprocessing steps: case normalization and punctuation handling. These seemingly trivial transformations have significant impacts on vocabulary size and model behavior.
You now understand lemmatization from linguistic foundations through production deployment. You can implement lemmatizers using NLTK, spaCy, and other libraries, handle edge cases appropriately, and choose between stemming and lemmatization based on your specific NLP requirements.