System Design (HLD)Search Systems

Full-Text Search

LevelAdvanced

Duration90 mins

TopicSearch Systems

3 / 5

Stop Words: When Common Words Matter (and When They Don't)

The Paradox of Common Words

Consider two search queries:

"The Beatles" — A user searching for the legendary rock band
"To be or not to be" — A user looking for Shakespeare's famous soliloquy

Now consider what happens if we blindly remove common words like "the," "to," "be," "or," and "not":

"The Beatles" becomes "Beatles" — Still works!
"To be or not to be" becomes "" — Complete failure.

This is the stopword paradox. The same words that add noise to most queries carry essential meaning in others. A search system that removes "the" helps 99% of users but breaks searches for "The Who," "The Office," or "The New York Times."

Stopword handling is not a simple on/off decision. It requires understanding your content, your users, and the trade-offs between index efficiency and search quality.

What You Will Learn

By the end of this page, you will understand when and why to remove stopwords, the risks of over-aggressive removal, how to build domain-appropriate stopword lists, and advanced techniques for handling stopwords in phrase matching and relevance scoring.

What Are Stopwords?

Stopwords are words that appear so frequently in a language that they carry little discriminative value for search. In English, the top 100 most common words account for roughly 50% of all word occurrences in typical text. Including these words in the search index creates significant overhead while providing minimal relevance signal.

Common English stopwords include:

Articles: the, a, an
Prepositions: in, on, at, to, for, with, by, from
Conjunctions: and, or, but, if, because
Pronouns: I, you, he, she, it, we, they
Auxiliary verbs: is, are, was, were, be, been, being, have, has, had
Common adverbs: very, just, then, now, also

These words appear in almost every document, so matching on them provides little information about which documents are more relevant than others.

zipf_law_demonstration.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Zipf's Law: Word frequency follows a power law distribution
// The most common word appears ~2x as often as the second most common,
// ~3x as often as the third, and so on.
 
// Example from analyzing 1 million documents:
const wordFrequencies = [
  { word: "the",    frequency: 69_971_528, rank: 1, percentOfCorpus: 7.0 },
  { word: "of",     frequency: 36_411_928, rank: 2, percentOfCorpus: 3.6 },
  { word: "and",    frequency: 28_765_432, rank: 3, percentOfCorpus: 2.9 },
  { word: "to",     frequency: 26_148_764, rank: 4, percentOfCorpus: 2.6 },
  { word: "a",      frequency: 21_341_218, rank: 5, percentOfCorpus: 2.1 },
  { word: "in",     frequency: 20_312_987, rank: 6, percentOfCorpus: 2.0 },
  { word: "is",     frequency: 14_987_234, rank: 7, percentOfCorpus: 1.5 },
  { word: "that",   frequency: 12_384_756, rank: 8, percentOfCorpus: 1.2 },
  { word: "it",     frequency: 10_124_879, rank: 9, percentOfCorpus: 1.0 },
  { word: "was",    frequency: 9_876_543,  rank: 10, percentOfCorpus: 1.0 },
];
 
// The top 10 words alone represent ~25% of all word occurrences!
// The top 100 words represent ~50% of all word occurrences.
 
// Implication for search:
// If we keep these words in the index, "the" appears in effectively
// 100% of documents. A search for "the" would match everything,
// providing zero discriminative value while consuming massive storage.
 
interface IndexImpact {
  withStopwords: {
    indexSize: string;
    averageTokensPerDoc: number;
    searchTimeMs: number;
  };
  withoutStopwords: {
    indexSize: string;
    averageTokensPerDoc: number;
    searchTimeMs: number;
  };
}
 
const realWorldImpact: IndexImpact = {
  withStopwords: {
    indexSize: "42 GB",
    averageTokensPerDoc: 847,
    searchTimeMs: 23
  },
  withoutStopwords: {
    indexSize: "28 GB",               // 33% smaller!
    averageTokensPerDoc: 423,         // 50% fewer tokens
    searchTimeMs: 14                  // 40% faster search
  }
};

Origins of Stopwords

The concept of stopwords was introduced by Hans Peter Luhn in the 1950s during pioneering work on automatic text processing at IBM. The term comes from the idea of "stopping" these words from being indexed. Originally, stopword removal was motivated primarily by storage constraints—but the principle remains valuable even with modern hardware.

Benefits of Stopword Removal

Removing stopwords provides three major benefits: reduced index size, faster queries, and improved relevance. Understanding these benefits helps justify the trade-offs involved.

Primary Benefits

•Reduced Index Size: Stopwords represent 40-50% of token occurrences. Removing them cuts index size by 25-35%. For a 100GB index, this saves 25-35GB of storage costs and reduces I/O during searches.
•Faster Query Execution: Inverted index lookups scale with the number of documents containing a term. "The" appears in nearly every document; querying it is essentially a full table scan. Without "the" in the query, searches skip this expensive operation.
•Improved Relevance: TF-IDF and BM25 relevance algorithms weight terms by their document frequency. Stopwords have extremely high document frequency, giving them near-zero weight. Removing them before scoring skips unnecessary computation without affecting results.
•Reduced Noise in Results: When stopwords are included, they contribute to match scoring even though they carry no meaning. Removing them focuses scoring on semantically meaningful terms.
•Better Highlighting: Search result snippets highlight matching terms. Highlighting every occurrence of "the" and "is" creates visual noise and obscures the truly relevant matches.

Quantified Impact of Stopword Removal (Real-World Benchmark)
Metric	With Stopwords	Without Stopwords	Improvement
Index size	52.3 GB	34.8 GB	33% smaller
Indexing time (10M docs)	47 minutes	31 minutes	34% faster
Average query latency	42 ms	28 ms	33% faster
P99 query latency	187 ms	124 ms	34% faster
Memory footprint	8.2 GB	5.4 GB	34% smaller

stopword_query_impact.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// Query execution comparison: with and without stopwords
 
// User query: "what is the best programming language for web development"
 
// WITH stopwords indexed:
// Query becomes: 
//   (what OR is OR the OR best OR programming OR language OR for OR web OR development)
// 
// Posting lists to merge:
//   "what"        → 4,892,341 documents
//   "is"          → 9,123,456 documents  ← almost every document!
//   "the"         → 9,456,789 documents  ← almost every document!
//   "best"        → 1,234,567 documents
//   "programming" → 567,890 documents
//   "language"    → 891,234 documents
//   "for"         → 8,765,432 documents  ← almost every document!
//   "web"         → 2,345,678 documents
//   "development" → 1,987,654 documents
//
// Total posting list entries to process: ~39 million
// Most work is wasted on low-value terms
 
// WITHOUT stopwords:
// Query becomes:
//   (best OR programming OR language OR web OR development)
//
// Posting lists to merge:
//   "best"        → 1,234,567 documents
//   "programming" → 567,890 documents
//   "language"    → 891,234 documents
//   "web"         → 2,345,678 documents
//   "development" → 1,987,654 documents
//
// Total posting list entries to process: ~7 million
// 82% reduction in work for the query engine
 
// Additionally, relevance scoring focuses on meaningful terms:
// Without stopwords, "programming language web development" 
// receives full scoring weight instead of being diluted by
// high-frequency low-value terms.

When Stopwords Matter: The Risks of Removal

The benefits of stopword removal are compelling—but aggressive removal creates significant problems in specific scenarios. Understanding when stopwords carry meaning is essential for avoiding search failures.

High-Risk Scenarios for Stopword Removal

•Named Entities with Articles: "The Who", "The Office", "The New York Times", "A Tribe Called Quest" — The article is part of the name. Removing it changes the meaning or prevents matching.
•Phrase Searches: "To be or not to be" — Entirely composed of stopwords. Removing them leaves nothing to search. Similarly, "Let It Be", "As You Like It".
•Technical Queries: In programming, "null" vs "not null", "is" (Python identity operator), "and/or" (logical operators) carry specific meaning. SQL queries often include "LIKE", "IN", "NOT".
•Lyrics and Quotes: Song lyrics and famous quotes often depend on common words. "I Will Always Love You", "We Are the Champions", "Just Do It".
•Negation: Removing "not" from "not working" or "not recommended" catastrophically inverts the meaning. The resulting "working" and "recommended" searches find the opposite of what users want.
•Disambiguation: "Guns and Roses" vs "Guns n Roses" vs "Guns N' Roses" — The conjunction helps identify the band. "Mac and Cheese" needs "and" for recipe searches.

stopword_failures.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// Real-world stopword removal failures
 
const problemQueries = [
  // Named entities broken by stopword removal
  {
    query: "The Office",
    withStopwords: "The Office",    // Correct: finds the TV show
    withoutStopwords: "Office",     // Broken: finds Microsoft Office, generic offices
    failure: "Article is part of proper name"
  },
  
  // Phrases destroyed
  {
    query: "to be or not to be",
    withStopwords: "to be or not to be",
    withoutStopwords: "",            // Complete disaster: empty query!
    failure: "Entire query is stopwords"
  },
  
  // Negation inverted
  {
    query: "not working",
    withStopwords: "not working",    // Finds bug reports, troubleshooting
    withoutStopwords: "working",     // Finds opposite: things that ARE working
    failure: "Meaning inverted by removing 'not'"
  },
  
  // Technical meaning lost
  {
    query: "if else statement",
    withStopwords: "if else statement",
    withoutStopwords: "statement",    // How do you search for if/else now?
    failure: "Keywords are programming constructs"
  },
  
  // Song/media titles
  {
    query: "Let It Be",
    withStopwords: "Let It Be",
    withoutStopwords: "Let",          // Loses the song entirely
    failure: "Song title depends on stopwords"
  },
  
  // Abbreviation expansion gone wrong
  {
    query: "the university of texas",
    withStopwords: "the university of texas",
    withoutStopwords: "university texas",  // Might work, but misses "UT"
    failure: "Prepositions help identify proper institutions"
  }
];
 
// The key insight: stopwords are only "stop" words when they truly
// provide no discriminative value. Context matters enormously.

Negation is Especially Dangerous

Removing negation words ("not", "no", "never", "without") is the most dangerous stopword mistake. A search for "laptop not working" that becomes "laptop working" returns the exact opposite of what the user needs. Many production systems specifically exclude negation words from stopword lists even when removing other common words.

Stopword Removal Strategies

Given the trade-offs, there are several strategies for handling stopwords. The right choice depends on your content domain, query patterns, and quality requirements.

Stopword Handling Strategies Compared
Strategy	Description	Pros	Cons	Best For
Full Removal	Remove all stopwords from index and queries	Maximum storage/speed benefits	Breaks phrase search, named entities, negation	Large document corpuses where phrases aren't needed
Query-time Only	Index all words; remove stopwords from query	Preserves phrase potential; simpler index config	No storage benefit; still fast queries	Systems needing phrase search fallback
Selective Removal	Remove most stopwords, but keep negation/articles	Balances efficiency with quality	Requires thoughtful list curation	Most production search systems
Low-Frequency Only	Remove only words appearing in >50% of documents	Dynamic; adapts to corpus	Requires corpus analysis; may vary	Specialized collections
Position-Based	Remove stopwords but record positions for phrase matching	Enables phrase search despite removal	More complex; position storage overhead	Systems needing both efficiency and exact phrases

selective_stopword_config.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// Recommended: Selective stopword removal
 
// Start with a conservative list and expand based on analysis
const conservativeStopwords = {
  // Articles - generally safe to remove
  articles: ["a", "an", "the"],
  
  // Prepositions - mostly safe, but watch for named entities
  prepositions: ["in", "on", "at", "to", "for", "with", "by", "from", "up", "down"],
  
  // Conjunctions - safe except in programming contexts
  conjunctions: ["and", "or", "but", "so", "yet"],
  
  // Auxiliary verbs - safe for general search
  auxiliaries: ["is", "are", "was", "were", "be", "been", "being",
                "have", "has", "had", "do", "does", "did"],
  
  // Common adverbs - usually safe
  adverbs: ["very", "just", "also", "even", "still", "already"],
  
  // Pronouns - safe except in quote/lyric search
  pronouns: ["i", "you", "he", "she", "it", "we", "they", "me", "him", "her", "us", "them"],
  
  // EXPLICITLY EXCLUDED from stopwords (these carry meaning):
  keepThese: [
    "not", "no", "never", "without",   // Negation - critical for meaning
    "if", "else", "then",               // Programming keywords
    "null", "true", "false",            // Programming values
    "all", "none", "any", "every",      // Quantifiers often matter
    "this", "that",                     // Often part of names
  ]
};
 
// Elasticsearch configuration
const selectiveStopwordConfig = {
  "settings": {
    "analysis": {
      "filter": {
        "custom_stop": {
          "type": "stop",
          "stopwords": [
            // Combine safe categories
            ...conservativeStopwords.articles,
            ...conservativeStopwords.prepositions,
            ...conservativeStopwords.conjunctions,
            ...conservativeStopwords.auxiliaries,
            ...conservativeStopwords.adverbs,
            ...conservativeStopwords.pronouns
          ],
          "ignore_case": true
        }
      },
      "analyzer": {
        "content_analyzer": {
          "tokenizer": "standard",
          "filter": ["lowercase", "custom_stop", "porter_stem"]
        }
      }
    }
  }
};

Start Conservative, Then Optimize

Begin with a small stopword list and monitor search quality. Add words to the list only when you confirm they're causing noise without contributing value. It's much easier to add stopwords later than to debug why valid searches fail after aggressive removal.

Building Domain-Specific Stopword Lists

Generic stopword lists work for general content, but specialized domains often need custom lists. A legal search system might treat "court" differently than a basketball search system. Building the right list requires corpus analysis.

corpus_analysis.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
// Automated stopword list generation from corpus analysis
 
interface TermStats {
  term: string;
  documentFrequency: number;    // How many docs contain this term
  collectionFrequency: number;  // Total occurrences across all docs
  idfScore: number;             // Inverse document frequency
}
 
async function analyzeCorpusForStopwords(
  client: ElasticsearchClient,
  index: string,
  minDocFrequencyPercent: number = 30,  // Appears in 30%+ of docs
  topN: number = 200
): Promise<string[]> {
  
  // Get total document count
  const countResult = await client.count({ index });
  const totalDocs = countResult.count;
  
  // Get term statistics from the index
  const termVectors = await client.termvectors({
    index,
    id: "_all",  // Aggregate across all docs
    fields: ["content"],
    term_statistics: true
  });
  
  // Calculate document frequency for each term
  const termStats: TermStats[] = [];
  
  for (const [term, stats] of Object.entries(termVectors.term_vectors)) {
    const docFreq = stats.doc_freq || 0;
    const docFreqPercent = (docFreq / totalDocs) * 100;
    
    // If term appears in more than threshold of documents
    if (docFreqPercent >= minDocFrequencyPercent) {
      termStats.push({
        term,
        documentFrequency: docFreq,
        collectionFrequency: stats.ttf || 0,
        idfScore: Math.log(totalDocs / docFreq)
      });
    }
  }
  
  // Sort by document frequency (highest first)
  termStats.sort((a, b) => b.documentFrequency - a.documentFrequency);
  
  // Return top N candidates
  const candidates = termStats.slice(0, topN).map(t => t.term);
  
  console.log("Stopword candidates (manual review recommended):");
  console.log("Term | DocFreq% | IDF Score");
  console.log("-".repeat(40));
  
  termStats.slice(0, 50).forEach(t => {
    const pct = ((t.documentFrequency / totalDocs) * 100).toFixed(1);
    console.log(`${t.term.padEnd(15)} | ${pct.padStart(6)}% | ${t.idfScore.toFixed(3)}`);
  });
  
  return candidates;
}
 
// Example output for an e-commerce product catalog:
// Term           | DocFreq% | IDF Score
// ----------------------------------------
// product        |   89.3%  | 0.113     ← "product" on a product site - true stopword!  
// for            |   78.4%  | 0.243
// with           |   71.2%  | 0.340
// and            |   69.8%  | 0.359
// the            |   65.4%  | 0.424
// new            |   45.2%  | 0.794     ← Careful! "New" might be a feature
// free           |   42.1%  | 0.865     ← Careful! "Free shipping" is important
// ...
 
// Domain-specific stopwords for different contexts:
 
const domainStopwords = {
  ecommerce: [
    "product", "item", "listing", "sale", "shop", "buy",
    "order", "shipping", "available", "stock"
    // These appear on almost every product page but don't help search
  ],
  
  legal: [
    "court", "case", "law", "legal", "attorney", "counsel",
    "plaintiff", "defendant", "pursuant"
    // Wait - these might actually be useful for faceting!
  ],
  
  medical: [
    "patient", "treatment", "condition", "doctor", "medical",
    "hospital", "clinical", "diagnosis"
    // Careful: these help distinguish medical content
  ],
  
  recipe: [
    "recipe", "ingredients", "minutes", "serves", "cook",
    "preparation", "instructions"
    // Every recipe has these - true stopwords for recipe search
  ]
};

Domain Stopwords Require Human Review

Automated analysis identifies candidates, but human judgment determines the final list. A word appearing in 80% of documents might still be essential for disambiguation. For example, "court" on a legal site might be very common but critical for distinguishing court documents from law review articles.

Stopwords and Phrase Matching

One of the biggest challenges with stopword removal is maintaining phrase matching capability. If "the" is removed, how can we still match the exact phrase "The Lord of the Rings"? Modern search engines use position preservation to solve this problem.

position_aware_stopwords.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
// How position-aware indexing preserves phrase matching
 
// Original document: "The quick brown fox"
// 
// Traditional stopword removal:
//   Tokens: ["quick", "brown", "fox"]  
//   Positions: [0, 1, 2]
//   Problem: Token positions are renumbered. Phrase queries fail.
//
// Position-preserving stopword removal:
//   Tokens: ["quick", "brown", "fox"]
//   Positions: [1, 2, 3]  ← Original positions preserved!
//   "The" was at position 0 but is not indexed.
//
// Now a phrase query for "brown fox" (adjacent in original) still works
// because "brown" is at position 2 and "fox" is at position 3.
 
// Elasticsearch position_increment_gap strategy:
const positionAwareConfig = {
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        }
      },
      "analyzer": {
        "position_aware_english": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "english_stop"  // Position increments are preserved by default
          ]
        }
      }
    }
  }
};
 
// How phrase queries work with stopwords removed:
 
// Document indexed: "The Lord of the Rings"
// After stopword removal with positions:
//   "lord" at position 1
//   "rings" at position 4
//   Positions 0, 2, 3 have no tokens (stopwords removed)
 
// Query: phrase "Lord of the Rings" (exact match)
// After stopword removal with positions:
//   "lord" at position 0
//   "rings" at position 3
//
// Position difference: 3 (from 0 to 3 in query, from 1 to 4 in document)
// The relative positions match! Phrase query succeeds.
 
// However, there's a subtlety:
// Query: phrase "lord rings" vs "lord of the rings"
// Both would match because we only check relative positions.
// This can be a feature (flexibility) or a bug (precision loss).
 
// For exact phrase matching including stopwords:
// Use a separate field that doesn't remove stopwords
const hybridMapping = {
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "position_aware_english",  // Stopwords removed
        "fields": {
          "exact": {
            "type": "text",
            "analyzer": "standard"  // Keeps stopwords
          }
        }
      }
    }
  }
};
 
// Query strategy:
// 1. For flexible matching: query "title" field
// 2. For exact phrase matching: query "title.exact" field

Common Grams Alternative

Some search systems use "common grams" instead of stopword removal. Common grams keep stopwords but additionally index bigrams that include them: "the_quick" along with "the" and "quick". This maintains phrase precision while still reducing the impact of common words in scoring. The trade-off is larger index size.

Stopwords and Relevance Scoring

Even when stopwords are indexed, modern relevance algorithms naturally downweight them. Understanding how TF-IDF and BM25 handle high-frequency terms explains why removal is an optimization, not a requirement for relevance.

stopword_scoring.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// How BM25 naturally handles stopwords
 
// BM25 scoring formula (simplified):
// score(q, d) = Σ IDF(qi) * (tf(qi, d) * (k1 + 1)) / (tf(qi, d) + k1 * (1 - b + b * |d|/avgdl))
 
// The key is IDF (Inverse Document Frequency):
// IDF(t) = log((N - df(t) + 0.5) / (df(t) + 0.5))
// Where:
//   N = total documents
//   df(t) = documents containing term t
 
// For stopwords:
// - They appear in nearly every document: df(t) ≈ N
// - Therefore: IDF ≈ log((N - N + 0.5) / (N + 0.5)) ≈ log(0.5 / N) ≈ very small negative number
// - Many implementations floor IDF at 0 for terms appearing in >50% of docs
 
// Example with real numbers:
const calculateIDF = (totalDocs: number, docFreq: number): number => {
  // Elasticsearch's BM25 IDF formula
  return Math.log(1 + (totalDocs - docFreq + 0.5) / (docFreq + 0.5));
};
 
const corpus = {
  totalDocs: 1_000_000
};
 
// Regular term: "programming"
const programmingIDF = calculateIDF(corpus.totalDocs, 50_000);  // In 5% of docs
// IDF ≈ 2.97 - High weight, this term matters!
 
// Stopword: "the"
const theIDF = calculateIDF(corpus.totalDocs, 950_000);  // In 95% of docs
// IDF ≈ 0.05 - Near-zero weight, term is almost ignored
 
// Practical implications:
// If query is "the best programming books"
// Weights: "the"(0.05) + "best"(1.2) + "programming"(2.97) + "books"(1.8)
// "The" contributes only 0.8% of the score.
 
// So why remove stopwords if BM25 handles them?
//
// 1. SPEED: Even with low weight, we still process the posting list
//    "the" has 950,000 documents to iterate through
//
// 2. INDEX SIZE: "the" still consumes storage for 950,000 postings
//
// 3. PHRASE QUERIES: Position matching still considers stopwords
//    Without special handling, "the best" won't match "best"
//
// 4. HIGHLIGHTING: We still need to track stopword positions for snippets
 
// Conclusion: BM25 makes stopword removal an optimization, not a necessity
// But for large-scale systems, that optimization matters significantly

Query-Time Stopword Removal

An alternative to index-time removal is query-time removal: keep stopwords in the index but strip them from queries. This preserves phrase matching capability (stopwords are there for position matching) while still speeding up queries. The trade-off is larger index size.

Complete Implementation in Elasticsearch

Let's put everything together with a complete, production-ready stopword configuration that balances efficiency with search quality.

stopword_implementation.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
// Production-ready stopword configuration
 
const stopwordConfig = {
  "settings": {
    "analysis": {
      "filter": {
        // Option 1: Built-in English stopwords (35 words)
        "english_stop_builtin": {
          "type": "stop",
          "stopwords": "_english_"
        },
        
        // Option 2: Custom conservative list (recommended)
        "conservative_stop": {
          "type": "stop",
          "stopwords": [
            // Articles
            "a", "an", "the",
            // Prepositions
            "in", "on", "at", "to", "for", "with", "by", "from", "up", "down", "into", "onto",
            // Conjunctions
            "and", "or", "but", "so", "yet",
            // Auxiliaries
            "is", "am", "are", "was", "were", "be", "been", "being",
            "have", "has", "had", "do", "does", "did", "will", "would", "could", "should",
            // Common pronouns
            "i", "you", "he", "she", "it", "we", "they", "me", "him", "her", "us", "them",
            "my", "your", "his", "its", "our", "their",
            // Common adverbs
            "very", "just", "also", "even", "still", "already", "only"
          ],
          "ignore_case": true
        },
        
        // Option 3: Aggressive list for large corpuses
        "aggressive_stop": {
          "type": "stop",
          "stopwords_path": "analysis/aggressive_stopwords.txt",  // External file
          "ignore_case": true
        },
        
        // Option 4: Custom domain list loaded from file
        "domain_stop": {
          "type": "stop",
          "stopwords_path": "analysis/domain_stopwords.txt",
          "ignore_case": true
        }
      },
      
      "analyzer": {
        // Default analyzer with conservative stopwords
        "default_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "conservative_stop",
            "porter_stem"
          ]
        },
        
        // Analyzer that keeps stopwords for exact phrase matching
        "phrase_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase"
            // No stopword filter
          ]
        },
        
        // Domain-optimized analyzer
        "domain_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "conservative_stop",
            "domain_stop",  // Additional domain-specific removal
            "porter_stem"
          ]
        }
      }
    }
  },
  
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "default_analyzer",
        "fields": {
          // For exact phrase queries that need stopwords
          "phrase": {
            "type": "text",
            "analyzer": "phrase_analyzer"
          }
        }
      },
      
      // Title: more conservative (stopwords might be in names)
      "title": {
        "type": "text",
        "analyzer": "phrase_analyzer",  // Keep stopwords
        "fields": {
          "search": {
            "type": "text",
            "analyzer": "default_analyzer"  // Removes stopwords
          }
        }
      }
    }
  }
};
 
// Query-time strategy:
async function searchWithFallback(query: string, client: ElasticsearchClient) {
  // First: try phrase match (stopwords preserved)
  const phraseResult = await client.search({
    index: "content",
    body: {
      query: {
        match_phrase: {
          "content.phrase": {
            query,
            slop: 1  // Allow small variations
          }
        }
      }
    }
  });
  
  if (phraseResult.hits.total.value > 0) {
    return phraseResult;
  }
  
  // Fallback: regular match (stopwords removed)
  return client.search({
    index: "content",
    body: {
      query: {
        match: {
          content: query  // Uses default_analyzer
        }
      }
    }
  });
}

Summary: Mastering Stopword Strategy

Stopword handling is a balancing act between efficiency and completeness. The right approach depends on your content, your queries, and your tolerance for missed matches versus noise.

Key Takeaways

•Stopwords provide significant efficiency gains: Removing them can reduce index size by 25-35% and speed up queries by 30-40%. This optimization matters at scale.
•But removal creates risks: Named entities ("The Who"), phrases ("to be or not to be"), negation ("not working"), and technical terms can break when stopwords are removed.
•Never remove negation words: Keep "not," "no," "never," "without" to preserve meaning. Removing them inverts query intent.
•Use selective removal: Start with a conservative list of clearly useless words. Expand based on corpus analysis and quality monitoring.
•Domain-specific lists matter: Generic stopwords don't fit every domain. Analyze your corpus to find domain-specific high-frequency, low-value terms.
•Position preservation enables phrase matching: Modern search engines preserve token positions even when stopwords are removed, maintaining phrase query capability.
•Multi-field strategy provides flexibility: Index both with and without stopwords to support different query types—flexible matching and exact phrase matching.
•BM25 naturally downweights stopwords: Even without removal, relevance scoring minimizes their impact. Removal is an optimization, not a requirement.

What's next:

With tokenization, stemming, and stopwords covered, we move to a more complex challenge: language handling. The next page explores how search systems deal with multiple languages, character encoding, transliteration, and the special processing required for non-Latin scripts.

Page Complete

You now understand when and how to remove stopwords, the risks of over-aggressive removal, and strategies for preserving search quality. This knowledge enables you to tune stopword handling for your specific domain and use case.

3 / 5

Loading learning content...

System Design (HLD)Search Systems

Full-Text Search

LevelAdvanced

Duration90 mins

TopicSearch Systems

3 / 5

Stop Words: When Common Words Matter (and When They Don't)

The Paradox of Common Words

Consider two search queries:

"The Beatles" — A user searching for the legendary rock band
"To be or not to be" — A user looking for Shakespeare's famous soliloquy

Now consider what happens if we blindly remove common words like "the," "to," "be," "or," and "not":

"The Beatles" becomes "Beatles" — Still works!
"To be or not to be" becomes "" — Complete failure.

Stopword handling is not a simple on/off decision. It requires understanding your content, your users, and the trade-offs between index efficiency and search quality.

What You Will Learn

What Are Stopwords?

Common English stopwords include:

Articles: the, a, an
Prepositions: in, on, at, to, for, with, by, from
Conjunctions: and, or, but, if, because
Pronouns: I, you, he, she, it, we, they
Auxiliary verbs: is, are, was, were, be, been, being, have, has, had
Common adverbs: very, just, then, now, also

These words appear in almost every document, so matching on them provides little information about which documents are more relevant than others.

zipf_law_demonstration.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Zipf's Law: Word frequency follows a power law distribution
// The most common word appears ~2x as often as the second most common,
// ~3x as often as the third, and so on.
 
// Example from analyzing 1 million documents:
const wordFrequencies = [
  { word: "the",    frequency: 69_971_528, rank: 1, percentOfCorpus: 7.0 },
  { word: "of",     frequency: 36_411_928, rank: 2, percentOfCorpus: 3.6 },
  { word: "and",    frequency: 28_765_432, rank: 3, percentOfCorpus: 2.9 },
  { word: "to",     frequency: 26_148_764, rank: 4, percentOfCorpus: 2.6 },
  { word: "a",      frequency: 21_341_218, rank: 5, percentOfCorpus: 2.1 },
  { word: "in",     frequency: 20_312_987, rank: 6, percentOfCorpus: 2.0 },
  { word: "is",     frequency: 14_987_234, rank: 7, percentOfCorpus: 1.5 },
  { word: "that",   frequency: 12_384_756, rank: 8, percentOfCorpus: 1.2 },
  { word: "it",     frequency: 10_124_879, rank: 9, percentOfCorpus: 1.0 },
  { word: "was",    frequency: 9_876_543,  rank: 10, percentOfCorpus: 1.0 },
];
 
// The top 10 words alone represent ~25% of all word occurrences!
// The top 100 words represent ~50% of all word occurrences.
 
// Implication for search:
// If we keep these words in the index, "the" appears in effectively
// 100% of documents. A search for "the" would match everything,
// providing zero discriminative value while consuming massive storage.
 
interface IndexImpact {
  withStopwords: {
    indexSize: string;
    averageTokensPerDoc: number;
    searchTimeMs: number;
  };
  withoutStopwords: {
    indexSize: string;
    averageTokensPerDoc: number;
    searchTimeMs: number;
  };
}
 
const realWorldImpact: IndexImpact = {
  withStopwords: {
    indexSize: "42 GB",
    averageTokensPerDoc: 847,
    searchTimeMs: 23
  },
  withoutStopwords: {
    indexSize: "28 GB",               // 33% smaller!
    averageTokensPerDoc: 423,         // 50% fewer tokens
    searchTimeMs: 14                  // 40% faster search
  }
};

Origins of Stopwords

Benefits of Stopword Removal

Removing stopwords provides three major benefits: reduced index size, faster queries, and improved relevance. Understanding these benefits helps justify the trade-offs involved.

Primary Benefits

•Reduced Index Size: Stopwords represent 40-50% of token occurrences. Removing them cuts index size by 25-35%. For a 100GB index, this saves 25-35GB of storage costs and reduces I/O during searches.
•Faster Query Execution: Inverted index lookups scale with the number of documents containing a term. "The" appears in nearly every document; querying it is essentially a full table scan. Without "the" in the query, searches skip this expensive operation.
•Improved Relevance: TF-IDF and BM25 relevance algorithms weight terms by their document frequency. Stopwords have extremely high document frequency, giving them near-zero weight. Removing them before scoring skips unnecessary computation without affecting results.
•Reduced Noise in Results: When stopwords are included, they contribute to match scoring even though they carry no meaning. Removing them focuses scoring on semantically meaningful terms.
•Better Highlighting: Search result snippets highlight matching terms. Highlighting every occurrence of "the" and "is" creates visual noise and obscures the truly relevant matches.

Quantified Impact of Stopword Removal (Real-World Benchmark)
Metric	With Stopwords	Without Stopwords	Improvement
Index size	52.3 GB	34.8 GB	33% smaller
Indexing time (10M docs)	47 minutes	31 minutes	34% faster
Average query latency	42 ms	28 ms	33% faster
P99 query latency	187 ms	124 ms	34% faster
Memory footprint	8.2 GB	5.4 GB	34% smaller

stopword_query_impact.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// Query execution comparison: with and without stopwords
 
// User query: "what is the best programming language for web development"
 
// WITH stopwords indexed:
// Query becomes: 
//   (what OR is OR the OR best OR programming OR language OR for OR web OR development)
// 
// Posting lists to merge:
//   "what"        → 4,892,341 documents
//   "is"          → 9,123,456 documents  ← almost every document!
//   "the"         → 9,456,789 documents  ← almost every document!
//   "best"        → 1,234,567 documents
//   "programming" → 567,890 documents
//   "language"    → 891,234 documents
//   "for"         → 8,765,432 documents  ← almost every document!
//   "web"         → 2,345,678 documents
//   "development" → 1,987,654 documents
//
// Total posting list entries to process: ~39 million
// Most work is wasted on low-value terms
 
// WITHOUT stopwords:
// Query becomes:
//   (best OR programming OR language OR web OR development)
//
// Posting lists to merge:
//   "best"        → 1,234,567 documents
//   "programming" → 567,890 documents
//   "language"    → 891,234 documents
//   "web"         → 2,345,678 documents
//   "development" → 1,987,654 documents
//
// Total posting list entries to process: ~7 million
// 82% reduction in work for the query engine
 
// Additionally, relevance scoring focuses on meaningful terms:
// Without stopwords, "programming language web development" 
// receives full scoring weight instead of being diluted by
// high-frequency low-value terms.

When Stopwords Matter: The Risks of Removal

High-Risk Scenarios for Stopword Removal

•Named Entities with Articles: "The Who", "The Office", "The New York Times", "A Tribe Called Quest" — The article is part of the name. Removing it changes the meaning or prevents matching.
•Phrase Searches: "To be or not to be" — Entirely composed of stopwords. Removing them leaves nothing to search. Similarly, "Let It Be", "As You Like It".
•Technical Queries: In programming, "null" vs "not null", "is" (Python identity operator), "and/or" (logical operators) carry specific meaning. SQL queries often include "LIKE", "IN", "NOT".
•Lyrics and Quotes: Song lyrics and famous quotes often depend on common words. "I Will Always Love You", "We Are the Champions", "Just Do It".
•Negation: Removing "not" from "not working" or "not recommended" catastrophically inverts the meaning. The resulting "working" and "recommended" searches find the opposite of what users want.
•Disambiguation: "Guns and Roses" vs "Guns n Roses" vs "Guns N' Roses" — The conjunction helps identify the band. "Mac and Cheese" needs "and" for recipe searches.

stopword_failures.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// Real-world stopword removal failures
 
const problemQueries = [
  // Named entities broken by stopword removal
  {
    query: "The Office",
    withStopwords: "The Office",    // Correct: finds the TV show
    withoutStopwords: "Office",     // Broken: finds Microsoft Office, generic offices
    failure: "Article is part of proper name"
  },
  
  // Phrases destroyed
  {
    query: "to be or not to be",
    withStopwords: "to be or not to be",
    withoutStopwords: "",            // Complete disaster: empty query!
    failure: "Entire query is stopwords"
  },
  
  // Negation inverted
  {
    query: "not working",
    withStopwords: "not working",    // Finds bug reports, troubleshooting
    withoutStopwords: "working",     // Finds opposite: things that ARE working
    failure: "Meaning inverted by removing 'not'"
  },
  
  // Technical meaning lost
  {
    query: "if else statement",
    withStopwords: "if else statement",
    withoutStopwords: "statement",    // How do you search for if/else now?
    failure: "Keywords are programming constructs"
  },
  
  // Song/media titles
  {
    query: "Let It Be",
    withStopwords: "Let It Be",
    withoutStopwords: "Let",          // Loses the song entirely
    failure: "Song title depends on stopwords"
  },
  
  // Abbreviation expansion gone wrong
  {
    query: "the university of texas",
    withStopwords: "the university of texas",
    withoutStopwords: "university texas",  // Might work, but misses "UT"
    failure: "Prepositions help identify proper institutions"
  }
];
 
// The key insight: stopwords are only "stop" words when they truly
// provide no discriminative value. Context matters enormously.

Negation is Especially Dangerous

Stopword Removal Strategies

Given the trade-offs, there are several strategies for handling stopwords. The right choice depends on your content domain, query patterns, and quality requirements.

Stopword Handling Strategies Compared
Strategy	Description	Pros	Cons	Best For
Full Removal	Remove all stopwords from index and queries	Maximum storage/speed benefits	Breaks phrase search, named entities, negation	Large document corpuses where phrases aren't needed
Query-time Only	Index all words; remove stopwords from query	Preserves phrase potential; simpler index config	No storage benefit; still fast queries	Systems needing phrase search fallback
Selective Removal	Remove most stopwords, but keep negation/articles	Balances efficiency with quality	Requires thoughtful list curation	Most production search systems
Low-Frequency Only	Remove only words appearing in >50% of documents	Dynamic; adapts to corpus	Requires corpus analysis; may vary	Specialized collections
Position-Based	Remove stopwords but record positions for phrase matching	Enables phrase search despite removal	More complex; position storage overhead	Systems needing both efficiency and exact phrases

selective_stopword_config.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// Recommended: Selective stopword removal
 
// Start with a conservative list and expand based on analysis
const conservativeStopwords = {
  // Articles - generally safe to remove
  articles: ["a", "an", "the"],
  
  // Prepositions - mostly safe, but watch for named entities
  prepositions: ["in", "on", "at", "to", "for", "with", "by", "from", "up", "down"],
  
  // Conjunctions - safe except in programming contexts
  conjunctions: ["and", "or", "but", "so", "yet"],
  
  // Auxiliary verbs - safe for general search
  auxiliaries: ["is", "are", "was", "were", "be", "been", "being",
                "have", "has", "had", "do", "does", "did"],
  
  // Common adverbs - usually safe
  adverbs: ["very", "just", "also", "even", "still", "already"],
  
  // Pronouns - safe except in quote/lyric search
  pronouns: ["i", "you", "he", "she", "it", "we", "they", "me", "him", "her", "us", "them"],
  
  // EXPLICITLY EXCLUDED from stopwords (these carry meaning):
  keepThese: [
    "not", "no", "never", "without",   // Negation - critical for meaning
    "if", "else", "then",               // Programming keywords
    "null", "true", "false",            // Programming values
    "all", "none", "any", "every",      // Quantifiers often matter
    "this", "that",                     // Often part of names
  ]
};
 
// Elasticsearch configuration
const selectiveStopwordConfig = {
  "settings": {
    "analysis": {
      "filter": {
        "custom_stop": {
          "type": "stop",
          "stopwords": [
            // Combine safe categories
            ...conservativeStopwords.articles,
            ...conservativeStopwords.prepositions,
            ...conservativeStopwords.conjunctions,
            ...conservativeStopwords.auxiliaries,
            ...conservativeStopwords.adverbs,
            ...conservativeStopwords.pronouns
          ],
          "ignore_case": true
        }
      },
      "analyzer": {
        "content_analyzer": {
          "tokenizer": "standard",
          "filter": ["lowercase", "custom_stop", "porter_stem"]
        }
      }
    }
  }
};

Start Conservative, Then Optimize

Building Domain-Specific Stopword Lists

corpus_analysis.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
// Automated stopword list generation from corpus analysis
 
interface TermStats {
  term: string;
  documentFrequency: number;    // How many docs contain this term
  collectionFrequency: number;  // Total occurrences across all docs
  idfScore: number;             // Inverse document frequency
}
 
async function analyzeCorpusForStopwords(
  client: ElasticsearchClient,
  index: string,
  minDocFrequencyPercent: number = 30,  // Appears in 30%+ of docs
  topN: number = 200
): Promise<string[]> {
  
  // Get total document count
  const countResult = await client.count({ index });
  const totalDocs = countResult.count;
  
  // Get term statistics from the index
  const termVectors = await client.termvectors({
    index,
    id: "_all",  // Aggregate across all docs
    fields: ["content"],
    term_statistics: true
  });
  
  // Calculate document frequency for each term
  const termStats: TermStats[] = [];
  
  for (const [term, stats] of Object.entries(termVectors.term_vectors)) {
    const docFreq = stats.doc_freq || 0;
    const docFreqPercent = (docFreq / totalDocs) * 100;
    
    // If term appears in more than threshold of documents
    if (docFreqPercent >= minDocFrequencyPercent) {
      termStats.push({
        term,
        documentFrequency: docFreq,
        collectionFrequency: stats.ttf || 0,
        idfScore: Math.log(totalDocs / docFreq)
      });
    }
  }
  
  // Sort by document frequency (highest first)
  termStats.sort((a, b) => b.documentFrequency - a.documentFrequency);
  
  // Return top N candidates
  const candidates = termStats.slice(0, topN).map(t => t.term);
  
  console.log("Stopword candidates (manual review recommended):");
  console.log("Term | DocFreq% | IDF Score");
  console.log("-".repeat(40));
  
  termStats.slice(0, 50).forEach(t => {
    const pct = ((t.documentFrequency / totalDocs) * 100).toFixed(1);
    console.log(`${t.term.padEnd(15)} | ${pct.padStart(6)}% | ${t.idfScore.toFixed(3)}`);
  });
  
  return candidates;
}
 
// Example output for an e-commerce product catalog:
// Term           | DocFreq% | IDF Score
// ----------------------------------------
// product        |   89.3%  | 0.113     ← "product" on a product site - true stopword!  
// for            |   78.4%  | 0.243
// with           |   71.2%  | 0.340
// and            |   69.8%  | 0.359
// the            |   65.4%  | 0.424
// new            |   45.2%  | 0.794     ← Careful! "New" might be a feature
// free           |   42.1%  | 0.865     ← Careful! "Free shipping" is important
// ...
 
// Domain-specific stopwords for different contexts:
 
const domainStopwords = {
  ecommerce: [
    "product", "item", "listing", "sale", "shop", "buy",
    "order", "shipping", "available", "stock"
    // These appear on almost every product page but don't help search
  ],
  
  legal: [
    "court", "case", "law", "legal", "attorney", "counsel",
    "plaintiff", "defendant", "pursuant"
    // Wait - these might actually be useful for faceting!
  ],
  
  medical: [
    "patient", "treatment", "condition", "doctor", "medical",
    "hospital", "clinical", "diagnosis"
    // Careful: these help distinguish medical content
  ],
  
  recipe: [
    "recipe", "ingredients", "minutes", "serves", "cook",
    "preparation", "instructions"
    // Every recipe has these - true stopwords for recipe search
  ]
};

Domain Stopwords Require Human Review

Stopwords and Phrase Matching

position_aware_stopwords.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
// How position-aware indexing preserves phrase matching
 
// Original document: "The quick brown fox"
// 
// Traditional stopword removal:
//   Tokens: ["quick", "brown", "fox"]  
//   Positions: [0, 1, 2]
//   Problem: Token positions are renumbered. Phrase queries fail.
//
// Position-preserving stopword removal:
//   Tokens: ["quick", "brown", "fox"]
//   Positions: [1, 2, 3]  ← Original positions preserved!
//   "The" was at position 0 but is not indexed.
//
// Now a phrase query for "brown fox" (adjacent in original) still works
// because "brown" is at position 2 and "fox" is at position 3.
 
// Elasticsearch position_increment_gap strategy:
const positionAwareConfig = {
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        }
      },
      "analyzer": {
        "position_aware_english": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "english_stop"  // Position increments are preserved by default
          ]
        }
      }
    }
  }
};
 
// How phrase queries work with stopwords removed:
 
// Document indexed: "The Lord of the Rings"
// After stopword removal with positions:
//   "lord" at position 1
//   "rings" at position 4
//   Positions 0, 2, 3 have no tokens (stopwords removed)
 
// Query: phrase "Lord of the Rings" (exact match)
// After stopword removal with positions:
//   "lord" at position 0
//   "rings" at position 3
//
// Position difference: 3 (from 0 to 3 in query, from 1 to 4 in document)
// The relative positions match! Phrase query succeeds.
 
// However, there's a subtlety:
// Query: phrase "lord rings" vs "lord of the rings"
// Both would match because we only check relative positions.
// This can be a feature (flexibility) or a bug (precision loss).
 
// For exact phrase matching including stopwords:
// Use a separate field that doesn't remove stopwords
const hybridMapping = {
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "position_aware_english",  // Stopwords removed
        "fields": {
          "exact": {
            "type": "text",
            "analyzer": "standard"  // Keeps stopwords
          }
        }
      }
    }
  }
};
 
// Query strategy:
// 1. For flexible matching: query "title" field
// 2. For exact phrase matching: query "title.exact" field

Common Grams Alternative

Stopwords and Relevance Scoring

stopword_scoring.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// How BM25 naturally handles stopwords
 
// BM25 scoring formula (simplified):
// score(q, d) = Σ IDF(qi) * (tf(qi, d) * (k1 + 1)) / (tf(qi, d) + k1 * (1 - b + b * |d|/avgdl))
 
// The key is IDF (Inverse Document Frequency):
// IDF(t) = log((N - df(t) + 0.5) / (df(t) + 0.5))
// Where:
//   N = total documents
//   df(t) = documents containing term t
 
// For stopwords:
// - They appear in nearly every document: df(t) ≈ N
// - Therefore: IDF ≈ log((N - N + 0.5) / (N + 0.5)) ≈ log(0.5 / N) ≈ very small negative number
// - Many implementations floor IDF at 0 for terms appearing in >50% of docs
 
// Example with real numbers:
const calculateIDF = (totalDocs: number, docFreq: number): number => {
  // Elasticsearch's BM25 IDF formula
  return Math.log(1 + (totalDocs - docFreq + 0.5) / (docFreq + 0.5));
};
 
const corpus = {
  totalDocs: 1_000_000
};
 
// Regular term: "programming"
const programmingIDF = calculateIDF(corpus.totalDocs, 50_000);  // In 5% of docs
// IDF ≈ 2.97 - High weight, this term matters!
 
// Stopword: "the"
const theIDF = calculateIDF(corpus.totalDocs, 950_000);  // In 95% of docs
// IDF ≈ 0.05 - Near-zero weight, term is almost ignored
 
// Practical implications:
// If query is "the best programming books"
// Weights: "the"(0.05) + "best"(1.2) + "programming"(2.97) + "books"(1.8)
// "The" contributes only 0.8% of the score.
 
// So why remove stopwords if BM25 handles them?
//
// 1. SPEED: Even with low weight, we still process the posting list
//    "the" has 950,000 documents to iterate through
//
// 2. INDEX SIZE: "the" still consumes storage for 950,000 postings
//
// 3. PHRASE QUERIES: Position matching still considers stopwords
//    Without special handling, "the best" won't match "best"
//
// 4. HIGHLIGHTING: We still need to track stopword positions for snippets
 
// Conclusion: BM25 makes stopword removal an optimization, not a necessity
// But for large-scale systems, that optimization matters significantly

Query-Time Stopword Removal

Complete Implementation in Elasticsearch

Let's put everything together with a complete, production-ready stopword configuration that balances efficiency with search quality.

stopword_implementation.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
// Production-ready stopword configuration
 
const stopwordConfig = {
  "settings": {
    "analysis": {
      "filter": {
        // Option 1: Built-in English stopwords (35 words)
        "english_stop_builtin": {
          "type": "stop",
          "stopwords": "_english_"
        },
        
        // Option 2: Custom conservative list (recommended)
        "conservative_stop": {
          "type": "stop",
          "stopwords": [
            // Articles
            "a", "an", "the",
            // Prepositions
            "in", "on", "at", "to", "for", "with", "by", "from", "up", "down", "into", "onto",
            // Conjunctions
            "and", "or", "but", "so", "yet",
            // Auxiliaries
            "is", "am", "are", "was", "were", "be", "been", "being",
            "have", "has", "had", "do", "does", "did", "will", "would", "could", "should",
            // Common pronouns
            "i", "you", "he", "she", "it", "we", "they", "me", "him", "her", "us", "them",
            "my", "your", "his", "its", "our", "their",
            // Common adverbs
            "very", "just", "also", "even", "still", "already", "only"
          ],
          "ignore_case": true
        },
        
        // Option 3: Aggressive list for large corpuses
        "aggressive_stop": {
          "type": "stop",
          "stopwords_path": "analysis/aggressive_stopwords.txt",  // External file
          "ignore_case": true
        },
        
        // Option 4: Custom domain list loaded from file
        "domain_stop": {
          "type": "stop",
          "stopwords_path": "analysis/domain_stopwords.txt",
          "ignore_case": true
        }
      },
      
      "analyzer": {
        // Default analyzer with conservative stopwords
        "default_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "conservative_stop",
            "porter_stem"
          ]
        },
        
        // Analyzer that keeps stopwords for exact phrase matching
        "phrase_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase"
            // No stopword filter
          ]
        },
        
        // Domain-optimized analyzer
        "domain_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "conservative_stop",
            "domain_stop",  // Additional domain-specific removal
            "porter_stem"
          ]
        }
      }
    }
  },
  
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "default_analyzer",
        "fields": {
          // For exact phrase queries that need stopwords
          "phrase": {
            "type": "text",
            "analyzer": "phrase_analyzer"
          }
        }
      },
      
      // Title: more conservative (stopwords might be in names)
      "title": {
        "type": "text",
        "analyzer": "phrase_analyzer",  // Keep stopwords
        "fields": {
          "search": {
            "type": "text",
            "analyzer": "default_analyzer"  // Removes stopwords
          }
        }
      }
    }
  }
};
 
// Query-time strategy:
async function searchWithFallback(query: string, client: ElasticsearchClient) {
  // First: try phrase match (stopwords preserved)
  const phraseResult = await client.search({
    index: "content",
    body: {
      query: {
        match_phrase: {
          "content.phrase": {
            query,
            slop: 1  // Allow small variations
          }
        }
      }
    }
  });
  
  if (phraseResult.hits.total.value > 0) {
    return phraseResult;
  }
  
  // Fallback: regular match (stopwords removed)
  return client.search({
    index: "content",
    body: {
      query: {
        match: {
          content: query  // Uses default_analyzer
        }
      }
    }
  });
}

Summary: Mastering Stopword Strategy

Stopword handling is a balancing act between efficiency and completeness. The right approach depends on your content, your queries, and your tolerance for missed matches versus noise.

Key Takeaways

•Stopwords provide significant efficiency gains: Removing them can reduce index size by 25-35% and speed up queries by 30-40%. This optimization matters at scale.
•But removal creates risks: Named entities ("The Who"), phrases ("to be or not to be"), negation ("not working"), and technical terms can break when stopwords are removed.
•Never remove negation words: Keep "not," "no," "never," "without" to preserve meaning. Removing them inverts query intent.
•Use selective removal: Start with a conservative list of clearly useless words. Expand based on corpus analysis and quality monitoring.
•Domain-specific lists matter: Generic stopwords don't fit every domain. Analyze your corpus to find domain-specific high-frequency, low-value terms.
•Position preservation enables phrase matching: Modern search engines preserve token positions even when stopwords are removed, maintaining phrase query capability.
•Multi-field strategy provides flexibility: Index both with and without stopwords to support different query types—flexible matching and exact phrase matching.
•BM25 naturally downweights stopwords: Even without removal, relevance scoring minimizes their impact. Removal is an optimization, not a requirement.

What's next:

Page Complete

3 / 5