Loading content...
Consider the sentence: "The quick brown fox jumps over the lazy dog."
If we're building a document classifier or search engine, which words carry the most semantic weight? Intuitively, "quick," "brown," "fox," "jumps," "lazy," and "dog" convey meaning, while "the" and "over" are grammatical scaffolding—necessary for human readability but carrying little topical information.
These high-frequency, low-information words are called stop words. In many (but not all) NLP applications, removing them improves model performance by:
Stop-word removal is not universally helpful. For sentiment analysis, 'not' is critical. For question answering, 'who', 'what', 'where' carry meaning. For authorship attribution, function word patterns are discriminative. This page teaches you to make informed decisions about when and how to apply stop-word removal.
By the end of this page, you will understand the linguistic foundations of stop words, implement multiple stop-word removal strategies, evaluate domain-specific stop-word lists, and critically assess when stop-word removal improves versus degrades your NLP system.
The concept of stop words emerges from fundamental observations about language structure. To understand stop-word removal, we must first understand why certain words appear frequently yet carry little topical content.
Zipf's Law:
George Kingsley Zipf observed that in any natural language corpus, word frequency follows a power law distribution:
$$f(r) \propto \frac{1}{r^\alpha}$$
where $r$ is the rank of a word (1 = most frequent, 2 = second most frequent, etc.) and $\alpha \approx 1$ for most languages.
This means the most frequent word appears roughly twice as often as the second most frequent, three times as often as the third, and so on. In English corpora, the top 100 words (mostly function words) account for approximately 50% of all word occurrences.
| Rank | Word | Frequency % | Cumulative % |
|---|---|---|---|
| 1 | the | 7.0% | 7.0% |
| 2 | of | 3.5% | 10.5% |
| 3 | and | 2.8% | 13.3% |
| 4 | to | 2.6% | 15.9% |
| 5 | a | 2.3% | 18.2% |
| 6 | in | 2.1% | 20.3% |
| 7 | that | 1.0% | 21.3% |
| 8 | is | 1.0% | 22.3% |
| 9 | was | 0.9% | 23.2% |
| 10 | he | 0.9% | 24.1% |
Content Words vs. Function Words:
Linguists distinguish between two fundamental word categories:
Content Words (Open Class):
Function Words (Closed Class):
Stop words are predominantly function words, though the exact boundary depends on your application.
From an information theory standpoint, stop words have low information content because they're so common. If P(word) is high, then the information content I = -log₂P(word) is low. Words that appear in nearly every document don't help distinguish one document from another—they have low discriminative power.
Multiple standard stop-word lists exist, each with different philosophies about what constitutes a stop word. Understanding these differences helps you make informed choices.
NLTK Stop Words:
NLTK provides curated stop-word lists for multiple languages. The English list contains 179 words, including common function words but also some arguably content-carrying words like "few," "more," "most."
spaCy Stop Words:
spaCy's English stop list contains 326 words—significantly more than NLTK. It includes additional contractions and informal words.
scikit-learn Stop Words:
scikit-learn's English stop list (from the SMART information retrieval system) contains 318 words with a focus on information retrieval applications.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import nltkfrom nltk.corpus import stopwordsimport spacyfrom sklearn.feature_extraction.text import ENGLISH_STOP_WORDS # Download NLTK stopwords (run once)# nltk.download('stopwords') # NLTK English stopwordsnltk_stops = set(stopwords.words('english'))print(f"NLTK stopwords count: {len(nltk_stops)}")print(f"Sample: {sorted(list(nltk_stops))[:20]}") # spaCy English stopwordsnlp = spacy.load("en_core_web_sm")spacy_stops = nlp.Defaults.stop_wordsprint(f"\nspaCy stopwords count: {len(spacy_stops)}")print(f"Sample: {sorted(list(spacy_stops))[:20]}") # scikit-learn stopwordssklearn_stops = set(ENGLISH_STOP_WORDS)print(f"\nscikit-learn stopwords count: {len(sklearn_stops)}")print(f"Sample: {sorted(list(sklearn_stops))[:20]}") # Analyze overlap and differencesall_lists = [nltk_stops, spacy_stops, sklearn_stops]names = ["NLTK", "spaCy", "sklearn"] print("\n=== Overlap Analysis ===")common_to_all = nltk_stops & spacy_stops & sklearn_stopsprint(f"Common to all three: {len(common_to_all)} words") # Words unique to each listprint("\nWords in spaCy but not NLTK or sklearn:")spacy_unique = spacy_stops - nltk_stops - sklearn_stopsprint(f" {sorted(list(spacy_unique))[:15]}...") print("\nWords in NLTK but not spaCy or sklearn:")nltk_unique = nltk_stops - spacy_stops - sklearn_stopsprint(f" {sorted(list(nltk_unique))[:15]}...") # Important words that might be controversial as stopwordspotentially_meaningful = ['not', 'no', 'never', 'always', 'very', 'more', 'most', 'few', 'many', 'much', 'all', 'some']print("\n=== Potentially Meaningful 'Stop Words' ===")for word in potentially_meaningful: in_nltk = word in nltk_stops in_spacy = word in spacy_stops in_sklearn = word in sklearn_stops print(f" '{word}': NLTK={in_nltk}, spaCy={in_spacy}, sklearn={in_sklearn}")Implementing stop-word removal involves more than just set membership lookup. Production systems must handle case sensitivity, stemming interaction, Unicode normalization, and performance optimization.
Basic Implementation:
The simplest approach uses a set for O(1) lookup:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
from typing import List, Set, Callablefrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizeimport re class StopWordRemover: """ Comprehensive stop-word removal with multiple configuration options. """ def __init__( self, language: str = "english", custom_stops: Set[str] = None, preserve_negation: bool = False, preserve_question_words: bool = False, case_sensitive: bool = False, min_word_length: int = 0 ): """ Initialize stop-word remover with configuration. Args: language: Language for NLTK stopwords custom_stops: Additional words to add as stop words preserve_negation: Keep negation words (not, no, never, etc.) preserve_question_words: Keep question words (who, what, where, etc.) case_sensitive: Whether stop-word matching is case-sensitive min_word_length: Remove words shorter than this (0 = no minimum) """ # Load base stop words self.stops = set(stopwords.words(language)) # Add custom stop words if custom_stops: self.stops.update(custom_stops) # Remove negation words if preserving if preserve_negation: negations = {'not', 'no', 'never', 'neither', 'nor', "n't", 'none', 'nothing', 'nowhere', 'nobody'} self.stops -= negations # Remove question words if preserving if preserve_question_words: question_words = {'who', 'what', 'where', 'when', 'why', 'how', 'which', 'whom', 'whose'} self.stops -= question_words self.case_sensitive = case_sensitive self.min_word_length = min_word_length # If case-insensitive, lowercase all stop words if not case_sensitive: self.stops = {w.lower() for w in self.stops} def is_stopword(self, word: str) -> bool: """Check if a word is a stop word.""" check_word = word if self.case_sensitive else word.lower() # Check minimum length if len(word) < self.min_word_length: return True return check_word in self.stops def remove_stopwords(self, tokens: List[str]) -> List[str]: """Remove stop words from a list of tokens.""" return [t for t in tokens if not self.is_stopword(t)] def process_text(self, text: str) -> List[str]: """Tokenize and remove stop words from raw text.""" tokens = word_tokenize(text) return self.remove_stopwords(tokens) def get_stopwords(self) -> Set[str]: """Return the current stop-word set.""" return self.stops.copy() # Demonstrationremover = StopWordRemover(preserve_negation=True)text = "I am not happy with the service, but I have no complaints about the product." print(f"Original text: {text}")print(f"Tokens: {word_tokenize(text)}")print(f"After stop-word removal: {remover.process_text(text)}")print(f"Negation preserved: 'not' in result = {'not' in remover.process_text(text)}") # Compare with negation removedremover_strict = StopWordRemover(preserve_negation=False)print(f"\nStrict removal: {remover_strict.process_text(text)}")Frequency-Based Stop-word Identification:
Rather than using a fixed list, you can dynamically identify stop words based on corpus statistics. This adapts to your specific domain and dataset.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141
from collections import Counterfrom typing import List, Setimport math class DynamicStopWordIdentifier: """ Identify stop words based on corpus statistics rather than fixed lists. Uses multiple criteria: frequency, document frequency, and TF-IDF. """ def __init__(self, documents: List[List[str]]): """ Analyze corpus to identify potential stop words. Args: documents: List of tokenized documents """ self.documents = documents self.n_docs = len(documents) # Compute statistics self._compute_frequencies() def _compute_frequencies(self): """Compute term and document frequencies.""" # Term frequency across entire corpus self.term_freq = Counter() for doc in self.documents: self.term_freq.update(doc) # Document frequency (in how many docs does each term appear?) self.doc_freq = Counter() for doc in self.documents: unique_terms = set(doc) self.doc_freq.update(unique_terms) # Total tokens self.total_tokens = sum(self.term_freq.values()) # Compute IDF for each term self.idf = {} for term in self.doc_freq: # Smoothed IDF to avoid division by zero self.idf[term] = math.log((self.n_docs + 1) / (self.doc_freq[term] + 1)) + 1 def get_stopwords_by_frequency( self, top_n: int = None, min_frequency: float = None ) -> Set[str]: """ Get stop words as the most frequent terms. Args: top_n: Return top N most frequent words min_frequency: Return words appearing in > min_frequency proportion """ if top_n: return set([word for word, _ in self.term_freq.most_common(top_n)]) if min_frequency: threshold = self.total_tokens * min_frequency return set([word for word, freq in self.term_freq.items() if freq > threshold]) return set() def get_stopwords_by_document_frequency( self, min_doc_ratio: float = 0.8 ) -> Set[str]: """ Get stop words as terms appearing in many documents. Terms in > min_doc_ratio of documents are considered stop words. """ threshold = self.n_docs * min_doc_ratio return set([word for word, freq in self.doc_freq.items() if freq > threshold]) def get_stopwords_by_idf( self, max_idf: float = 1.5 ) -> Set[str]: """ Get stop words as terms with low IDF (appear in many documents). Low IDF = high document frequency = likely stop word. """ return set([word for word, idf in self.idf.items() if idf < max_idf]) def get_combined_stopwords( self, top_n: int = 50, min_doc_ratio: float = 0.5 ) -> Set[str]: """ Combine multiple criteria for robust stop-word identification. A word is a stop word if it meets BOTH criteria. """ freq_stops = self.get_stopwords_by_frequency(top_n=top_n) doc_freq_stops = self.get_stopwords_by_document_frequency(min_doc_ratio) # Intersection: must be both frequent AND appear in many documents return freq_stops & doc_freq_stops def analyze_word(self, word: str) -> dict: """Get detailed statistics for a specific word.""" return { "word": word, "term_frequency": self.term_freq.get(word, 0), "document_frequency": self.doc_freq.get(word, 0), "doc_frequency_ratio": self.doc_freq.get(word, 0) / self.n_docs, "idf": self.idf.get(word, 0), "frequency_ratio": self.term_freq.get(word, 0) / self.total_tokens, } # Example usage with sample corpusdocuments = [ ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"], ["the", "dog", "barks", "at", "the", "fox", "in", "the", "garden"], ["a", "fox", "and", "a", "dog", "are", "in", "the", "park"], ["the", "lazy", "cat", "sleeps", "on", "the", "sofa"], ["my", "dog", "loves", "to", "run", "in", "the", "park"],] identifier = DynamicStopWordIdentifier(documents) print("=== Dynamic Stop-word Analysis ===\n") print("Top 10 most frequent words:")for word, freq in identifier.term_freq.most_common(10): stats = identifier.analyze_word(word) print(f" {word:10} freq={freq:3} doc_ratio={stats['doc_frequency_ratio']:.2f} idf={stats['idf']:.3f}") print("\nStop words by frequency (top 5):")print(f" {identifier.get_stopwords_by_frequency(top_n=5)}") print("\nStop words by document frequency (>60% of docs):")print(f" {identifier.get_stopwords_by_document_frequency(min_doc_ratio=0.6)}") print("\nCombined stop words (top 10 AND >40% docs):")print(f" {identifier.get_combined_stopwords(top_n=10, min_doc_ratio=0.4)}")Generic stop-word lists may be inappropriate for specialized domains. What's a stop word in general text may carry critical meaning in specific contexts, and domain-specific terms may need to be added as stop words.
Medical Domain:
Legal Domain:
Technical/Scientific:
| Domain | Standard Stops That May Need Preservation | Domain-Specific Additions |
|---|---|---|
| Medical/Clinical | 'not', 'no', 'without', 'negative' | 'patient', 'treatment', 'mg', 'daily' |
| Legal | 'shall', 'may', 'must' (modal distinctions) | 'hereby', 'whereas', 'thereof', 'herein' |
| E-commerce Reviews | 'not', 'very', 'really', 'extremely' | 'product', 'item', 'bought', 'ordered' |
| Social Media | 'not', 'but', 'so' | 'lol', 'omg', 'tbh', platform-specific terms |
| Scientific Papers | 'however', 'therefore', 'thus' | 'figure', 'table', 'et al.', 'respectively' |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137
from typing import Set, Listfrom dataclasses import dataclass @dataclassclass DomainStopWords: """ Domain-specific stop-word configurations. """ name: str additions: Set[str] # Words to ADD as stop words preservations: Set[str] # Standard stops to PRESERVE (not remove) # Pre-defined domain configurationsMEDICAL_CONFIG = DomainStopWords( name="medical", additions={ "patient", "patients", "treatment", "hospital", "clinical", "study", "mg", "ml", "daily", "administered", "presented", "history", "examination", "diagnosis", "prognosis" }, preservations={ "not", "no", "without", "negative", "none", "never", "decreased", "increased", "absent", "present" }) LEGAL_CONFIG = DomainStopWords( name="legal", additions={ "hereby", "herein", "hereof", "thereof", "whereas", "aforesaid", "heretofore", "notwithstanding", "pursuant", "plaintiff", "defendant", "court", "filed", "pursuant" }, preservations={ "shall", "may", "must", "will", "should", # Modal distinctions matter "not", "no", "neither", "nor" # Negation critical }) ECOMMERCE_CONFIG = DomainStopWords( name="ecommerce", additions={ "product", "item", "bought", "ordered", "received", "shipping", "delivered", "package", "seller", "store", "price", "purchase", "return", "customer" }, preservations={ "not", "never", "very", "really", "extremely", "highly", "worst", "best", "great", "terrible", "amazing", "awful", "love", "hate", "recommend" # Sentiment-carrying words }) class DomainAwareStopWordRemover: """ Stop-word remover that adapts to specific domains. """ def __init__(self, base_stops: Set[str], domain_config: DomainStopWords = None): """ Initialize with base stops and optional domain configuration. """ self.stops = base_stops.copy() if domain_config: # Add domain-specific stop words self.stops.update(domain_config.additions) # Preserve domain-critical words (remove from stop list) self.stops -= domain_config.preservations self.domain = domain_config.name else: self.domain = "generic" def remove_stopwords(self, tokens: List[str]) -> List[str]: """Remove stop words from tokenized text.""" return [t for t in tokens if t.lower() not in self.stops] def explain_removal(self, tokens: List[str]) -> dict: """Explain what was removed and why.""" removed = [] kept = [] for token in tokens: if token.lower() in self.stops: removed.append(token) else: kept.append(token) return { "domain": self.domain, "original_tokens": len(tokens), "removed_count": len(removed), "kept_count": len(kept), "removed_words": removed, "result": kept } # Example usagefrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenize base_stops = set(stopwords.words('english')) # Generic removergeneric_remover = DomainAwareStopWordRemover(base_stops) # Medical removermedical_remover = DomainAwareStopWordRemover(base_stops, MEDICAL_CONFIG) # Test sentencessentences = [ "The patient did not respond to treatment.", "Patient presented with no signs of infection.", "This product is not what I expected but I love it.",] print("=== Domain-Aware Stop-word Removal ===\n") for sentence in sentences: tokens = word_tokenize(sentence.lower()) print(f"Input: {sentence}") print(f"Tokens: {tokens}") generic_result = generic_remover.explain_removal(tokens) medical_result = medical_remover.explain_removal(tokens) print(f"\nGeneric removal:") print(f" Removed: {generic_result['removed_words']}") print(f" Result: {generic_result['result']}") print(f"\nMedical domain removal:") print(f" Removed: {medical_result['removed_words']}") print(f" Result: {medical_result['result']}") print("---\n")Stop-word removal became standard practice in early information retrieval when computational resources were limited and bag-of-words models dominated. Modern NLP, particularly deep learning approaches, often performs better WITHOUT stop-word removal. Understanding when to skip this step is as important as knowing how to do it.
With BERT, GPT, and modern transformers, stop-word removal is generally discouraged. These models were pre-trained on complete sentences and learn to downweight uninformative tokens automatically through attention mechanisms. Removing stop words creates input distributions the model never saw during pre-training, often degrading performance.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
"""Demonstration of how stop-word removal affects different NLP tasks."""from nltk.tokenize import word_tokenizefrom nltk.corpus import stopwords stops = set(stopwords.words('english')) def remove_stops(text: str) -> str: tokens = word_tokenize(text.lower()) filtered = [t for t in tokens if t not in stops] return " ".join(filtered) # === Sentiment Analysis Example ===print("=== SENTIMENT ANALYSIS ===\n") sentiment_examples = [ ("I love this movie!", "Positive"), ("I do not love this movie.", "Negative"), ("This is not a bad product.", "Positive (double negative)"), ("I have never been so disappointed.", "Negative"),] for text, true_sentiment in sentiment_examples: cleaned = remove_stops(text) print(f"Original: {text}") print(f"True sentiment: {true_sentiment}") print(f"After stop removal: {cleaned}") # Analyze what's lost original_tokens = set(word_tokenize(text.lower())) cleaned_tokens = set(word_tokenize(cleaned)) removed = original_tokens - cleaned_tokens print(f"Removed words: {removed}") print() # === Question Answering Example ===print("\n=== QUESTION ANSWERING ===\n") questions = [ "Who is the president of the United States?", "What is machine learning?", "Where can I find good restaurants?", "When was Python created?", "Why is the sky blue?", "How does neural network work?",] for q in questions: cleaned = remove_stops(q) original_words = set(word_tokenize(q.lower())) cleaned_words = set(word_tokenize(cleaned)) removed = original_words - cleaned_words print(f"Original: {q}") print(f"Cleaned: {cleaned}") print(f"Lost question context: {'who' in removed or 'what' in removed or 'where' in removed or 'when' in removed or 'why' in removed or 'how' in removed}") print() # === Phrase Meaning Changes ===print("\n=== PHRASE MEANING CHANGES ===\n") phrases = [ "to be or not to be", "I think therefore I am", "The more you know", "Less is more", "It is what it is",] for phrase in phrases: cleaned = remove_stops(phrase) print(f"Original: '{phrase}'") print(f"Cleaned: '{cleaned}'") print()| Task | Model Type | Recommendation | Rationale |
|---|---|---|---|
| Document Classification | Bag-of-Words / TF-IDF | ✅ Remove | Reduces noise, improves signal |
| Document Classification | BERT / Transformers | ❌ Keep | Pre-training expects complete text |
| Search / IR | BM25 / Traditional | ✅ Remove | Query efficiency, index size |
| Sentiment Analysis | Any | ⚠️ Preserve negation | Negation words critical for polarity |
| Machine Translation | Any | ❌ Keep | All words needed for translation |
| Summarization | Extractive | ✅ Remove for scoring | But keep in final output |
| Topic Modeling | LDA / LSA | ✅ Remove | Focus on content words |
Stop-word handling varies significantly across languages. Lists that work for English may be inappropriate for other languages due to differences in grammar, morphology, and frequency distributions.
Language-Specific Challenges:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
from nltk.corpus import stopwordsimport spacy # NLTK supports 21 languagesprint("NLTK supported languages:")print(stopwords.fileids()) # Compare stop-word list sizeslanguages = ['english', 'spanish', 'french', 'german', 'russian', 'portuguese', 'italian', 'dutch', 'arabic'] print("\n=== Stop-word List Sizes by Language ===\n")for lang in languages: try: stops = stopwords.words(lang) print(f"{lang:12} {len(stops):4} stop words") except: print(f"{lang:12} Not available") # Examples of language-specific stopsprint("\n=== Language-Specific Stop-word Examples ===\n") lang_examples = { 'english': ['the', 'a', 'is', 'are', 'was', 'were'], 'spanish': ['el', 'la', 'los', 'las', 'es', 'son'], 'french': ['le', 'la', 'les', 'est', 'sont', 'un', 'une'], 'german': ['der', 'die', 'das', 'ist', 'sind', 'ein', 'eine'], 'arabic': ['من', 'في', 'على', 'إلى', 'هو', 'هي'],} for lang, examples in lang_examples.items(): all_stops = set(stopwords.words(lang)) if lang in stopwords.fileids() else set() in_list = [w for w in examples if w in all_stops] print(f"{lang}: {examples}") print(f" In NLTK stops: {in_list}\n") # Using spaCy for multilingual stop wordsprint("\n=== spaCy Multilingual Stop Words ===\n") # Load different language models (requires separate installation)# python -m spacy download es_core_news_sm# python -m spacy download de_core_news_sm try: nlp_en = spacy.load("en_core_web_sm") nlp_es = spacy.load("es_core_news_sm") nlp_de = spacy.load("de_core_news_sm") print(f"English spaCy stops: {len(nlp_en.Defaults.stop_words)}") print(f"Spanish spaCy stops: {len(nlp_es.Defaults.stop_words)}") print(f"German spaCy stops: {len(nlp_de.Defaults.stop_words)}")except OSError as e: print(f"Some language models not installed: {e}") print("Install with: python -m spacy download <model_name>")For languages without good stop-word lists, compute frequency statistics on a representative corpus and select the top N most frequent words (typically 100-300). Manually review to remove content words that happen to be frequent in your domain. This approach adapts to your specific data and language variety.
In high-throughput systems processing millions of documents, stop-word removal can become a bottleneck. Proper implementation choices are essential for production performance.
Key Optimizations:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
import timefrom typing import List, Setfrom nltk.corpus import stopwords # Prepare test datastops_list = stopwords.words('english') # Liststops_set = set(stops_list) # Setstops_frozenset = frozenset(stops_list) # Frozenset # Generate test tokenstest_tokens = ["the", "quick", "brown", "fox", "jumps"] * 10000 def benchmark(func, tokens, n_iterations=100): """Benchmark a function.""" start = time.perf_counter() for _ in range(n_iterations): result = func(tokens) elapsed = time.perf_counter() - start return elapsed / n_iterations # Different implementations to benchmark def remove_stops_list(tokens: List[str]) -> List[str]: """Using list membership (slow).""" return [t for t in tokens if t not in stops_list] def remove_stops_set(tokens: List[str]) -> List[str]: """Using set membership (fast).""" return [t for t in tokens if t not in stops_set] def remove_stops_frozenset(tokens: List[str]) -> List[str]: """Using frozenset membership (fast, immutable).""" return [t for t in tokens if t not in stops_frozenset] def remove_stops_explicit_loop(tokens: List[str]) -> List[str]: """Using explicit for loop (slower than comprehension).""" result = [] for t in tokens: if t not in stops_set: result.append(t) return result def remove_stops_filter(tokens: List[str]) -> List[str]: """Using filter function.""" return list(filter(lambda t: t not in stops_set, tokens)) # Run benchmarksprint("=== Stop-word Removal Performance Benchmark ===\n")print(f"Test data: {len(test_tokens)} tokens\n") implementations = [ ("List membership", remove_stops_list), ("Set membership", remove_stops_set), ("Frozenset membership", remove_stops_frozenset), ("Explicit loop + set", remove_stops_explicit_loop), ("Filter + lambda + set", remove_stops_filter),] times = []for name, func in implementations: elapsed = benchmark(func, test_tokens) times.append((name, elapsed)) print(f"{name:25} {elapsed*1000:8.3f} ms") # Calculate speedups relative to slowestbase_time = max(t[1] for t in times)print("\nSpeedup vs slowest:")for name, elapsed in times: speedup = base_time / elapsed print(f" {name:25} {speedup:5.1f}x") # Memory considerationsimport sysprint(f"\n=== Memory Usage ===")print(f"List size: {sys.getsizeof(stops_list):6} bytes")print(f"Set size: {sys.getsizeof(stops_set):6} bytes")print(f"Frozenset size: {sys.getsizeof(stops_frozenset):6} bytes")Stop-word removal is a nuanced preprocessing decision, not a default step to apply blindly. The choice of whether and how to remove stop words should be driven by your specific task, model architecture, and domain requirements.
What's Next:
With tokenization and stop-word removal covered, we turn to stemming—the process of reducing words to their root forms. We'll explore the Porter and Snowball algorithms, understand their heuristic nature, and learn when this aggressive normalization helps versus when lemmatization is preferred.
You now understand stop-word removal from theoretical foundations through production implementation. You can make informed decisions about whether to remove stop words, which words to preserve for your domain, and how to implement removal efficiently in high-throughput systems.