System Design (HLD)Search Systems

Full-Text Search

LevelAdvanced

Duration90 mins

TopicSearch Systems

1 / 5

Tokenization and Analysis: The Foundation of Search

How Search Engines Understand Text

When you type a query into a search box, something remarkable happens in milliseconds: the search engine must understand what you're looking for, locate relevant documents among potentially billions of candidates, and rank them by relevance. But how does a computer—which fundamentally operates on bytes and numbers—understand the nuanced meaning of human language?

The answer begins with tokenization and analysis.

At its core, full-text search is a translation problem. Human language is fluid, ambiguous, and context-dependent. Computers require structure, precision, and determinism. The text analysis pipeline bridges this gap, transforming raw text into searchable, normalized tokens that enable both efficient retrieval and accurate matching.

What You Will Learn

By the end of this page, you will understand the complete text analysis pipeline used in production search systems: character filters, tokenizers, and token filters. You'll learn how decisions at each stage impact search quality, performance, and relevance—and why getting this foundation right is critical before considering any other search optimization.

The Text Analysis Pipeline

Before text can be searched, it must be transformed through a series of processing stages collectively called the analysis pipeline (or analyzer). This pipeline operates identically at both index time (when documents are stored) and query time (when searches are performed), ensuring that what users search for matches what was indexed.

The standard pipeline consists of three distinct phases:

Character Filtering — Preprocessing that operates on the raw character stream
Tokenization — Breaking text into discrete searchable units called tokens
Token Filtering — Normalizing, enriching, or removing tokens

Each phase is critical, and errors at any stage propagate through the entire search experience. A poorly chosen tokenizer can fragment words incorrectly. An aggressive token filter might remove important signals. A missing character filter might allow invisible characters to pollute the index.

analysis_pipeline_overview.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
Raw Input:    "The Quick-Brown Fox Jumps Over the Lazy Dog!!!"
 
                    ↓ Character Filter (remove punctuation, lowercase)
                    
Cleaned:      "the quick-brown fox jumps over the lazy dog"
 
                    ↓ Tokenizer (whitespace + hyphen splitting)
                    
Tokens:       ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
 
                    ↓ Token Filters (stopword removal, stemming)
                    
Final Tokens: ["quick", "brown", "fox", "jump", "lazi", "dog"]

Notice how the original 9-word sentence becomes 6 indexed tokens. Common words like "the" and "over" are removed as stopwords. The word "jumps" becomes "jump" through stemming, allowing it to match queries for "jumping," "jumped," or "jump." Even "lazy" becomes "lazi" through the same stemming process.

Why this matters for system design:

Every decision in this pipeline cascades through the entire search experience:

Storage costs: More aggressive filtering means smaller indexes
Query latency: Fewer tokens per document means faster searches
Recall: Over-aggressive filtering loses important matches
Precision: Under-filtering includes irrelevant matches

The art of search engineering lies in tuning this balance for each specific use case.

Index-Query Symmetry

The same analyzer (or compatible analyzers) must be applied at both index time and query time. If a document contains 'running' indexed as 'run', but the query analyzer doesn't apply the same stemmer, searching for 'running' will fail to match. This is one of the most common causes of search bugs in production systems.

Character Filters: Preprocessing the Raw Text

Character filters operate on the raw input stream before tokenization. They add, remove, or replace characters in the source text. While often overlooked, character filters handle critical preprocessing that tokenizers cannot perform efficiently.

Common character filter use cases:

Character Filter Types

•HTML Strip Filter — Removes HTML tags from content scraped from web pages. The document "Hello World" becomes "Hello World". Without this filter, tags become tokens, polluting the index with useless entries like "" and "".
•Mapping Char Filter — Replaces specific character sequences with others. Essential for normalizing ASCII symbols ("&" → "and"), handling ligatures ("ﬂ" → "fl"), or domain-specific replacements ("C++" → "cpp").
•Pattern Replace Filter — Uses regular expressions for complex transformations. Can normalize phone numbers ("555-123-4567" → "5551234567"), remove email domains, or standardize abbreviations.
•ICU Normalizer — Converts Unicode text to a canonical form (NFC, NFD, NFKC, NFKD). Ensures that "café" represented with a combining accent matches "café" with a precomposed character.

character_filter_examples.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Example: Custom character filter configuration in Elasticsearch
const analyzerConfig = {
  "settings": {
    "analysis": {
      "char_filter": {
        // HTML stripping
        "html_stripper": {
          "type": "html_strip",
          "escaped_tags": ["b", "i"]  // Keep bold/italic for emphasis detection
        },
        
        // Mapping replacements for technical content
        "tech_normalizer": {
          "type": "mapping",
          "mappings": [
            "C++ => Cpp",
            "C# => CSharp", 
            "F# => FSharp",
            ".NET => DotNet",
            "& => and",
            "@ => at"
          ]
        },
        
        // Pattern replacement for code cleanup
        "code_cleanup": {
          "type": "pattern_replace",
          "pattern": "\\s{2,}",  // Multiple whitespace
          "replacement": " "        // Single space
        }
      },
      
      "analyzer": {
        "content_analyzer": {
          "char_filter": ["html_stripper", "tech_normalizer", "code_cleanup"],
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "porter_stem"]
        }
      }
    }
  }
};
 
// Application: Processing technical blog content
const rawContent = `
<p>Learning C++ and C# can be valuable. 
The .NET framework provides many     features.</p>
`;
 
// After character filtering:
// "Learning Cpp and CSharp can be valuable. The DotNet framework provides many features."

Order Matters

Character filters execute in the order specified. If you have a pattern filter that normalizes whitespace and an HTML stripper, place the HTML stripper first. Otherwise, the pattern filter might operate on HTML entities that will later be removed, wasting processing time and potentially causing unexpected behavior.

Production considerations for character filters:

Performance: Character filters operate on every character in the input stream. Complex regex patterns can significantly impact indexing throughput.
Position offsets: Most search engines track character positions for highlighting. Character filters must maintain accurate offset mapping so that highlighted terms appear in the correct location in the original text.
Testing: Character filter bugs are subtle. A missing replacement might allow one variant of a term to be indexed while another is normalized, creating unpredictable search behavior.

Tokenization: Breaking Text Into Searchable Units

The tokenizer is the heart of text analysis. It receives the cleaned character stream from character filters and emits a sequence of tokens—discrete units that will be indexed and searched. The choice of tokenizer fundamentally shapes what queries can match which documents.

Different tokenization strategies serve different needs:

Common Tokenizer Types and Their Use Cases
Tokenizer	Splitting Rules	Best For	Example
Standard	Unicode Text Segmentation algorithm; splits on whitespace and punctuation, removes most punctuation	General-purpose text in most languages	"O'Neil's" → ["O'Neil's"]
Whitespace	Splits only on whitespace; preserves punctuation	Technical content where punctuation matters	"error_code:404" → ["error_code:404"]
Letter	Splits on non-letter characters; only emits sequences of letters	Extracting pure text from noisy data	"Order#12345" → ["Order"]
Keyword	Emits entire input as single token	Exact-match fields (IDs, codes, tags)	"user_session_token_abc" → ["user_session_token_abc"]
Pattern	Splits based on regex pattern	Custom splitting logic (e.g., splitting on commas)	"a,b,c" → ["a", "b", "c"] (with pattern ",")
N-gram	Emits character or word n-grams	Autocomplete, fuzzy matching, typo tolerance	"hello" → ["hel", "ell", "llo"] (trigrams)
Edge N-gram	Emits prefixes of tokens	Autocomplete, type-ahead search	"search" → ["s", "se", "sea", "sear", "searc", "search"]

The Standard Tokenizer in depth:

The Standard Tokenizer is the default choice in most search engines (Elasticsearch, Solr, Lucene) and implements the Unicode Text Segmentation algorithm (UAX #29). This algorithm handles complex cases that simple whitespace splitting would miss:

Contractions: "don't" tokenizes as a single token, not "don" and "t"
Decimal numbers: "3.14" remains a single token
Hostnames: "www.example.com" may be kept together or split depending on configuration
Asian languages: Handles Chinese/Japanese/Korean characters that don't use spaces

For most Western-language text search, the Standard Tokenizer provides a sensible foundation.

tokenizer_comparison.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Demonstrating different tokenizer behaviors
 
const input = "The user.email is john@example.com (created: 2024-01-15)";
 
// Standard Tokenizer output:
// ["The", "user.email", "is", "john", "example.com", "created", "2024", "01", "15"]
// Note: @ splits tokens, but . is context-dependent
 
// Whitespace Tokenizer output:
// ["The", "user.email", "is", "john@example.com", "(created:", "2024-01-15)"]
// Note: Preserves punctuation, good for technical data
 
// Letter Tokenizer output:
// ["The", "user", "email", "is", "john", "example", "com", "created"]
// Note: Loses numbers entirely, only letters remain
 
// Pattern Tokenizer (split on non-word characters) output:
// ["The", "user", "email", "is", "john", "example", "com", "created", "2024", "01", "15"]
// Note: More aggressive splitting than Standard
 
// Combining tokenizers with field mappings
const indexMapping = {
  "mappings": {
    "properties": {
      // Full-text searchable content
      "description": {
        "type": "text",
        "analyzer": "standard"  // Standard tokenizer + lowercase + stopwords
      },
      
      // Technical identifiers that should match exactly
      "error_code": {
        "type": "text",
        "analyzer": "whitespace"  // Preserve special characters
      },
      
      // Email addresses for exact matching
      "email": {
        "type": "keyword"  // No tokenization at all
      },
      
      // Also searchable by domain
      "email_searchable": {
        "type": "text",
        "analyzer": "email_analyzer"  // Custom analyzer that splits on @
      }
    }
  }
};

Multi-Field Strategy

Production search systems often index the same content with multiple analyzers using multi-fields. A product name might be indexed as-is for exact matches, with the standard analyzer for full-text search, and with an edge n-gram analyzer for autocomplete. This provides flexibility at the cost of increased storage.

Token Filters: Normalizing and Enriching Tokens

After tokenization, token filters transform the token stream. They can modify tokens (lowercase, stem), remove tokens (stopwords), or add tokens (synonyms). Multiple filters chain together, each processing the output of the previous filter.

Essential token filter categories:

Token Filter Categories

•Normalization Filters — Transform tokens to a canonical form. Lowercase filter converts "Hello" to "hello". ASCII folding converts "résumé" to "resume". These ensure variant spellings match.
•Linguistic Filters — Apply language-specific processing. Stemmers reduce words to root forms. Lemmatizers produce dictionary forms. We'll explore these in depth in the next page.
•Stopword Filters — Remove common words that add little search value. "the", "is", "at" are typically removed. We'll cover stopword strategies in detail in page 3.
•Synonym Filters — Expand or replace tokens with synonyms. "laptop" might expand to include "notebook" and "computer". Critical for handling domain terminology and user vocabulary.
•Phonetic Filters — Encode tokens by their pronunciation. "Smith" and "Smyth" both encode to the same token. Useful for name matching where spelling varies.
•Shingle Filters — Create token n-grams from adjacent tokens. "new york" becomes "new", "york", "new york". Enables phrase-like matching without explicit phrase queries.

token_filter_pipeline.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
// Complete token filter pipeline example
const analysisSettings = {
  "analysis": {
    "filter": {
      // Custom stopwords list
      "english_stop": {
        "type": "stop",
        "stopwords": "_english_",  // Built-in English stopwords
        "ignore_case": true
      },
      
      // Porter stemmer for English
      "english_stemmer": {
        "type": "stemmer",
        "language": "english"
      },
      
      // Synonym expansion for e-commerce
      "product_synonyms": {
        "type": "synonym",
        "synonyms": [
          "laptop, notebook, portable computer",
          "phone, smartphone, mobile, cellphone",
          "tv, television, flatscreen",
          "couch, sofa, settee"
        ]
      },
      
      // ASCII folding for accent insensitivity
      "ascii_folder": {
        "type": "asciifolding",
        "preserve_original": true  // Index both "café" and "cafe"
      },
      
      // Word delimiter for camelCase and compound words
      "word_splitter": {
        "type": "word_delimiter_graph",
        "catenate_all": true,      // Create combined token too
        "generate_word_parts": true,
        "generate_number_parts": true,
        "split_on_case_change": true,
        "split_on_numerics": true
        // "PowerPoint2023" → ["Power", "Point", "2023", "PowerPoint2023"]
      },
      
      // Ensure unique tokens
      "unique_tokens": {
        "type": "unique",
        "only_on_same_position": false
      }
    },
    
    "analyzer": {
      "product_analyzer": {
        "type": "custom",
        "tokenizer": "standard",
        "filter": [
          "lowercase",
          "ascii_folder",
          "word_splitter",
          "english_stop",
          "product_synonyms",
          "english_stemmer",
          "unique_tokens"
        ]
      }
    }
  }
};
 
// Token transformation walkthrough:
// Input: "MacBook Pro Laptop Café Edition"
//
// After tokenizer:      ["MacBook", "Pro", "Laptop", "Café", "Edition"]
// After lowercase:      ["macbook", "pro", "laptop", "café", "edition"]
// After ascii_folder:   ["macbook", "pro", "laptop", "cafe", "café", "edition"]
// After word_splitter:  ["mac", "book", "macbook", "pro", "laptop", "cafe", "café", "edition"]
// After english_stop:   (no removal - these aren't stopwords)
// After product_synonyms: ["mac", "book", "macbook", "pro", "laptop", "notebook", 
//                          "portable", "computer", "cafe", "café", "edition"]
// After english_stemmer: ["mac", "book", "macbook", "pro", "laptop", "notebook",
//                         "portabl", "comput", "cafe", "café", "edit"]
// After unique_tokens:   (duplicates removed if any on same position)

Filter Order is Critical

The order of token filters dramatically affects results. Stopword removal before stemming works differently than after. Synonyms applied before lowercase might miss matches. In the example above, we lowercase before synonyms so that our synonym mappings don't need to account for case variations.

Building a Production-Grade Custom Analyzer

Real-world search systems rarely use off-the-shelf analyzers. Domain-specific requirements demand custom configurations. Let's build a complete analyzer for an e-commerce product search system.

Requirements:

Handle product names with brand terminology ("iPhone", "MacBook Pro")
Support searches with common misspellings and typos
Match on synonyms ("sneakers" = "trainers" = "athletic shoes")
Handle measurements and sizes ("10oz", "size:large")
Work with international products (accented characters)

production_analyzer.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
// Production e-commerce analyzer configuration
 
const ecommerceAnalyzerConfig = {
  "settings": {
    "analysis": {
      // ========== CHARACTER FILTERS ==========
      "char_filter": {
        // Normalize common symbol patterns
        "symbol_normalizer": {
          "type": "mapping",
          "mappings": [
            "& => and",
            "+ => plus",
            "@ => at",
            "# => number"
          ]
        },
        
        // Clean up size/measurement formats
        "measurement_normalizer": {
          "type": "pattern_replace",
          "pattern": "(\\d+)\\s*(oz|ml|g|kg|lb|lbs|inch|in|cm|mm)",
          "replacement": "$1$2"  // "10 oz" → "10oz"
        }
      },
      
      // ========== TOKEN FILTERS ==========
      "filter": {
        // ASCII folding that keeps originals
        "ascii_folder": {
          "type": "asciifolding",
          "preserve_original": true
        },
        
        // Custom brand-aware word delimiter
        "brand_safe_delimiter": {
          "type": "word_delimiter_graph",
          "protected_words": ["iphone", "macbook", "airpods", "playstation"],
          "catenate_all": true,
          "generate_word_parts": true,
          "split_on_case_change": true,
          "preserve_original": true
        },
        
        // E-commerce specific synonyms
        "ecommerce_synonyms": {
          "type": "synonym_graph",
          "synonyms_path": "synonyms/products.txt",
          "updateable": true  // Can update without reindex
        },
        
        // Minimal stemming (less aggressive for products)
        "light_stemmer": {
          "type": "stemmer",
          "language": "light_english"  // Less aggressive than Porter
        },
        
        // Edge n-grams for autocomplete
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 15
        },
        
        // Length filter to remove noise
        "length_filter": {
          "type": "length",
          "min": 2,
          "max": 50
        }
      },
      
      // ========== ANALYZERS ==========
      "analyzer": {
        // Main search analyzer
        "product_search": {
          "type": "custom",
          "char_filter": ["symbol_normalizer", "measurement_normalizer"],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "ascii_folder",
            "brand_safe_delimiter",
            "ecommerce_synonyms",
            "light_stemmer",
            "length_filter"
          ]
        },
        
        // Index analyzer (no synonyms at index time)
        "product_index": {
          "type": "custom",
          "char_filter": ["symbol_normalizer", "measurement_normalizer"],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "ascii_folder",
            "brand_safe_delimiter",
            "light_stemmer",
            "length_filter"
          ]
        },
        
        // Autocomplete analyzer for type-ahead
        "product_autocomplete": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "ascii_folder",
            "autocomplete_filter"
          ]
        }
      }
    }
  },
  
  "mappings": {
    "properties": {
      "product_name": {
        "type": "text",
        "analyzer": "product_index",
        "search_analyzer": "product_search",
        "fields": {
          "autocomplete": {
            "type": "text",
            "analyzer": "product_autocomplete",
            "search_analyzer": "standard"
          },
          "exact": {
            "type": "keyword"  // For exact matching
          }
        }
      }
    }
  }
};

Best Practices Applied

•Separate index and search analyzers
•Synonyms only at search time for flexibility
•Multi-field mapping for different use cases
•Brand name protection to avoid over-tokenization
•Light stemming to preserve product precision
•Edge n-grams dedicated to autocomplete field

Common Mistakes Avoided

•Not using synonym_graph (use synonym for legacy only)
•Applying synonyms at index time (causes index bloat)
•Over-aggressive stemming ("running" matching "run")
•Ignoring brand names ("iPhone" → "i" + "phone")
•Missing length filters (indexing single characters)
•Forgetting preserve_original on folding

Testing and Validating Analyzers

A custom analyzer is only as good as its testing. Before deploying to production, you must verify that the analyzer produces expected tokens for both indexing and searching. Search engines provide analysis APIs for this purpose.

Validation strategy:

Test individual components (each filter in isolation)
Test the complete pipeline with sample documents
Create a test suite with expected token outputs
Test edge cases: empty strings, very long strings, special characters, different languages
Validate index-query equivalence

analyzer_testing.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
// Testing analyzer behavior with Elasticsearch Analyze API
 
// Test the product_search analyzer
const analyzeRequest = {
  analyzer: "product_search",
  text: "Apple MacBook Pro 14-inch Laptop (M3 chip)"
};
 
// Response shows token stream:
{
  "tokens": [
    {"token": "appl", "start_offset": 0, "end_offset": 5, "position": 0},
    {"token": "macbook", "start_offset": 6, "end_offset": 13, "position": 1},
    {"token": "pro", "start_offset": 14, "end_offset": 17, "position": 2},
    {"token": "14", "start_offset": 18, "end_offset": 20, "position": 3},
    {"token": "inch", "start_offset": 21, "end_offset": 25, "position": 4},
    {"token": "laptop", "start_offset": 26, "end_offset": 32, "position": 5},
    {"token": "notebook", "start_offset": 26, "end_offset": 32, "position": 5}, // Synonym!
    {"token": "m3", "start_offset": 34, "end_offset": 36, "position": 6},
    {"token": "chip", "start_offset": 37, "end_offset": 41, "position": 7}
  ]
}
 
// Automated test suite
interface TokenTest {
  input: string;
  analyzer: string;
  expectedTokens: string[];
}
 
const analyzerTests: TokenTest[] = [
  // Basic functionality
  {
    input: "MacBook Pro",
    analyzer: "product_index",
    expectedTokens: ["macbook", "pro"]
  },
  
  // Symbol handling
  {
    input: "Beats & Bose",
    analyzer: "product_search", 
    expectedTokens: ["beat", "and", "bose"]
  },
  
  // Measurements
  {
    input: "Water Bottle 32 oz",
    analyzer: "product_search",
    expectedTokens: ["water", "bottl", "32oz"]
  },
  
  // Unicode handling
  {
    input: "Café Crème",
    analyzer: "product_search",
    expectedTokens: ["cafe", "café", "cream", "crème"]  // Both variants
  },
  
  // Synonym expansion at search time
  {
    input: "sneakers",
    analyzer: "product_search",
    expectedTokens: ["sneaker", "trainer", "athletic", "shoe"]
  },
  
  // Edge case: empty string
  {
    input: "",
    analyzer: "product_search",
    expectedTokens: []
  },
  
  // Edge case: only stopwords
  {
    input: "the a an",
    analyzer: "product_search",
    expectedTokens: []  // All removed as stopwords
  }
];
 
async function runAnalyzerTests(client: ElasticsearchClient) {
  for (const test of analyzerTests) {
    const result = await client.indices.analyze({
      index: "products",
      body: {
        analyzer: test.analyzer,
        text: test.input
      }
    });
    
    const actualTokens = result.tokens.map(t => t.token);
    
    const missing = test.expectedTokens.filter(t => !actualTokens.includes(t));
    const extra = actualTokens.filter(t => !test.expectedTokens.includes(t));
    
    if (missing.length > 0 || extra.length > 0) {
      console.error(`FAIL: "${test.input}"`);
      console.error(`  Missing: ${missing}`);
      console.error(`  Unexpected: ${extra}`);
    } else {
      console.log(`PASS: "${test.input}"`);
    }
  }
}

Test with Real User Queries

The most valuable analyzer tests come from production query logs. Extract a sample of real user queries, expected matching documents, and verify the analyzer produces compatible tokens. Nothing reveals analyzer gaps like real user behavior.

Performance Considerations for Analysis

Text analysis runs on every document at index time and every query at search time. Poor analyzer performance directly impacts system throughput and latency. Understanding the performance characteristics of different analysis components is essential for system design.

Analysis Component Performance Characteristics
Component	Performance Impact	Memory Impact	Mitigation Strategies
Character filters (Regex)	High CPU per character; O(n) where n is text length; complex patterns are expensive	Low	Optimize regex patterns; avoid catastrophic backtracking; limit input length
Standard Tokenizer	Low; highly optimized; O(n)	Low	N/A - already optimal for most cases
N-gram Tokenizer	Low compute, but high token output; O(n × max_gram)	High output volume	Limit to specific fields; use edge n-grams when possible; set reasonable max_gram
Stemmer	Low; dictionary-based O(1) lookup per token	Low (dictionary in memory)	Prefer algorithmic stemmers (Porter) over dictionary stemmers for large vocabularies
Synonym Filter	Medium; dictionary lookup per token	High if synonym list is large	Apply at search time only; use updateable synonyms for changes without reindex
Phonetic Filter	Medium; algorithmic encoding per token	Low	Apply only to name fields; avoid on high-volume text fields

performance_measurement.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Benchmarking analyzer performance
 
async function benchmarkAnalyzer(
  client: ElasticsearchClient,
  analyzer: string,
  sampleTexts: string[],
  iterations: number = 1000
): Promise<AnalyzerBenchmark> {
  
  const start = performance.now();
  
  for (let i = 0; i < iterations; i++) {
    const text = sampleTexts[i % sampleTexts.length];
    await client.indices.analyze({
      index: "benchmark_index",
      body: { analyzer, text }
    });
  }
  
  const elapsed = performance.now() - start;
  
  return {
    analyzer,
    iterations,
    totalMs: elapsed,
    avgMsPerAnalysis: elapsed / iterations,
    analysesPerSecond: (iterations / elapsed) * 1000
  };
}
 
// Example results comparison:
// 
// Analyzer: "standard"
//   Avg: 0.3ms per analysis
//   Throughput: 3,300 analyses/second
//
// Analyzer: "product_search" (with synonyms, stemming)
//   Avg: 0.8ms per analysis
//   Throughput: 1,250 analyses/second
//
// Analyzer: "ngram_analyzer" (min_gram: 2, max_gram: 10)
//   Avg: 2.1ms per analysis (but with 5x more tokens output)
//   Throughput: 475 analyses/second
 
// Impact on bulk indexing:
// If indexing 1 million documents:
// - Standard analyzer: ~5 minutes
// - Complex custom analyzer: ~13 minutes
// - N-gram heavy analyzer: ~35 minutes

Index Size Explosion

N-gram analyzers can dramatically increase index size. A 10-character word with trigrams produces 8 tokens. Across millions of documents, this explodes storage costs and impacts search performance. Always profile index size with production-like data before deploying n-gram analyzers.

Summary: Tokenization and Analysis Fundamentals

Tokenization and analysis form the foundation of every full-text search system. Before considering ranking algorithms or advanced query features, you must get the analysis pipeline right. Poor analysis means good documents won't match user queries—no matter how sophisticated your relevance tuning.

Key Takeaways

•The analysis pipeline has three stages: Character filters preprocess raw text, tokenizers break it into tokens, and token filters normalize and enrich those tokens.
•Index-query symmetry is essential: The same or compatible analyzers must be used at both index and query time for matches to occur.
•Tokenizer choice determines matching granularity: Standard tokenizers work for most text; whitespace preserves structure; n-grams enable fuzzy matching at the cost of index size.
•Token filter order matters: The sequence of filters affects the final tokens. Lowercase before synonyms; stemming usually last.
•Production systems use multiple analyzers: Different fields (search, autocomplete, exact match) require different analysis strategies, often via multi-field mappings.
•Always test analyzers: Validate with the analyze API, create automated test suites, and verify with real user query logs.
•Monitor performance impacts: Complex analyzers affect indexing throughput and query latency. N-gram analyzers can explode index sizes.

What's next:

With tokenization understood, we move to linguistic processing in the next page. We'll explore stemming and lemmatization—the techniques that allow "running," "runs," and "ran" to all match the same search intent. These linguistic transformations are essential for search systems that understand language, not just characters.

Page Complete

You now understand the complete text analysis pipeline: character filters, tokenizers, and token filters. You've seen how to build custom analyzers for production use cases, test them thoroughly, and consider their performance implications. This foundation enables everything else in full-text search.

1 / 5

Loading learning content...

System Design (HLD)Search Systems

Full-Text Search

LevelAdvanced

Duration90 mins

TopicSearch Systems

1 / 5

Tokenization and Analysis: The Foundation of Search

How Search Engines Understand Text

The answer begins with tokenization and analysis.

What You Will Learn

The Text Analysis Pipeline

The standard pipeline consists of three distinct phases:

Character Filtering — Preprocessing that operates on the raw character stream
Tokenization — Breaking text into discrete searchable units called tokens
Token Filtering — Normalizing, enriching, or removing tokens

analysis_pipeline_overview.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
Raw Input:    "The Quick-Brown Fox Jumps Over the Lazy Dog!!!"
 
                    ↓ Character Filter (remove punctuation, lowercase)
                    
Cleaned:      "the quick-brown fox jumps over the lazy dog"
 
                    ↓ Tokenizer (whitespace + hyphen splitting)
                    
Tokens:       ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
 
                    ↓ Token Filters (stopword removal, stemming)
                    
Final Tokens: ["quick", "brown", "fox", "jump", "lazi", "dog"]

Why this matters for system design:

Every decision in this pipeline cascades through the entire search experience:

Storage costs: More aggressive filtering means smaller indexes
Query latency: Fewer tokens per document means faster searches
Recall: Over-aggressive filtering loses important matches
Precision: Under-filtering includes irrelevant matches

The art of search engineering lies in tuning this balance for each specific use case.

Index-Query Symmetry

Character Filters: Preprocessing the Raw Text

Common character filter use cases:

Character Filter Types

•HTML Strip Filter — Removes HTML tags from content scraped from web pages. The document "Hello World" becomes "Hello World". Without this filter, tags become tokens, polluting the index with useless entries like "" and "".
•Mapping Char Filter — Replaces specific character sequences with others. Essential for normalizing ASCII symbols ("&" → "and"), handling ligatures ("ﬂ" → "fl"), or domain-specific replacements ("C++" → "cpp").
•Pattern Replace Filter — Uses regular expressions for complex transformations. Can normalize phone numbers ("555-123-4567" → "5551234567"), remove email domains, or standardize abbreviations.
•ICU Normalizer — Converts Unicode text to a canonical form (NFC, NFD, NFKC, NFKD). Ensures that "café" represented with a combining accent matches "café" with a precomposed character.

character_filter_examples.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Example: Custom character filter configuration in Elasticsearch
const analyzerConfig = {
  "settings": {
    "analysis": {
      "char_filter": {
        // HTML stripping
        "html_stripper": {
          "type": "html_strip",
          "escaped_tags": ["b", "i"]  // Keep bold/italic for emphasis detection
        },
        
        // Mapping replacements for technical content
        "tech_normalizer": {
          "type": "mapping",
          "mappings": [
            "C++ => Cpp",
            "C# => CSharp", 
            "F# => FSharp",
            ".NET => DotNet",
            "& => and",
            "@ => at"
          ]
        },
        
        // Pattern replacement for code cleanup
        "code_cleanup": {
          "type": "pattern_replace",
          "pattern": "\\s{2,}",  // Multiple whitespace
          "replacement": " "        // Single space
        }
      },
      
      "analyzer": {
        "content_analyzer": {
          "char_filter": ["html_stripper", "tech_normalizer", "code_cleanup"],
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "porter_stem"]
        }
      }
    }
  }
};
 
// Application: Processing technical blog content
const rawContent = `
<p>Learning C++ and C# can be valuable. 
The .NET framework provides many     features.</p>
`;
 
// After character filtering:
// "Learning Cpp and CSharp can be valuable. The DotNet framework provides many features."

Order Matters

Production considerations for character filters:

Performance: Character filters operate on every character in the input stream. Complex regex patterns can significantly impact indexing throughput.
Position offsets: Most search engines track character positions for highlighting. Character filters must maintain accurate offset mapping so that highlighted terms appear in the correct location in the original text.
Testing: Character filter bugs are subtle. A missing replacement might allow one variant of a term to be indexed while another is normalized, creating unpredictable search behavior.

Tokenization: Breaking Text Into Searchable Units

Different tokenization strategies serve different needs:

Common Tokenizer Types and Their Use Cases
Tokenizer	Splitting Rules	Best For	Example
Standard	Unicode Text Segmentation algorithm; splits on whitespace and punctuation, removes most punctuation	General-purpose text in most languages	"O'Neil's" → ["O'Neil's"]
Whitespace	Splits only on whitespace; preserves punctuation	Technical content where punctuation matters	"error_code:404" → ["error_code:404"]
Letter	Splits on non-letter characters; only emits sequences of letters	Extracting pure text from noisy data	"Order#12345" → ["Order"]
Keyword	Emits entire input as single token	Exact-match fields (IDs, codes, tags)	"user_session_token_abc" → ["user_session_token_abc"]
Pattern	Splits based on regex pattern	Custom splitting logic (e.g., splitting on commas)	"a,b,c" → ["a", "b", "c"] (with pattern ",")
N-gram	Emits character or word n-grams	Autocomplete, fuzzy matching, typo tolerance	"hello" → ["hel", "ell", "llo"] (trigrams)
Edge N-gram	Emits prefixes of tokens	Autocomplete, type-ahead search	"search" → ["s", "se", "sea", "sear", "searc", "search"]

The Standard Tokenizer in depth:

Contractions: "don't" tokenizes as a single token, not "don" and "t"
Decimal numbers: "3.14" remains a single token
Hostnames: "www.example.com" may be kept together or split depending on configuration
Asian languages: Handles Chinese/Japanese/Korean characters that don't use spaces

For most Western-language text search, the Standard Tokenizer provides a sensible foundation.

tokenizer_comparison.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Demonstrating different tokenizer behaviors
 
const input = "The user.email is john@example.com (created: 2024-01-15)";
 
// Standard Tokenizer output:
// ["The", "user.email", "is", "john", "example.com", "created", "2024", "01", "15"]
// Note: @ splits tokens, but . is context-dependent
 
// Whitespace Tokenizer output:
// ["The", "user.email", "is", "john@example.com", "(created:", "2024-01-15)"]
// Note: Preserves punctuation, good for technical data
 
// Letter Tokenizer output:
// ["The", "user", "email", "is", "john", "example", "com", "created"]
// Note: Loses numbers entirely, only letters remain
 
// Pattern Tokenizer (split on non-word characters) output:
// ["The", "user", "email", "is", "john", "example", "com", "created", "2024", "01", "15"]
// Note: More aggressive splitting than Standard
 
// Combining tokenizers with field mappings
const indexMapping = {
  "mappings": {
    "properties": {
      // Full-text searchable content
      "description": {
        "type": "text",
        "analyzer": "standard"  // Standard tokenizer + lowercase + stopwords
      },
      
      // Technical identifiers that should match exactly
      "error_code": {
        "type": "text",
        "analyzer": "whitespace"  // Preserve special characters
      },
      
      // Email addresses for exact matching
      "email": {
        "type": "keyword"  // No tokenization at all
      },
      
      // Also searchable by domain
      "email_searchable": {
        "type": "text",
        "analyzer": "email_analyzer"  // Custom analyzer that splits on @
      }
    }
  }
};

Multi-Field Strategy

Token Filters: Normalizing and Enriching Tokens

Essential token filter categories:

Token Filter Categories

•Normalization Filters — Transform tokens to a canonical form. Lowercase filter converts "Hello" to "hello". ASCII folding converts "résumé" to "resume". These ensure variant spellings match.
•Linguistic Filters — Apply language-specific processing. Stemmers reduce words to root forms. Lemmatizers produce dictionary forms. We'll explore these in depth in the next page.
•Stopword Filters — Remove common words that add little search value. "the", "is", "at" are typically removed. We'll cover stopword strategies in detail in page 3.
•Synonym Filters — Expand or replace tokens with synonyms. "laptop" might expand to include "notebook" and "computer". Critical for handling domain terminology and user vocabulary.
•Phonetic Filters — Encode tokens by their pronunciation. "Smith" and "Smyth" both encode to the same token. Useful for name matching where spelling varies.
•Shingle Filters — Create token n-grams from adjacent tokens. "new york" becomes "new", "york", "new york". Enables phrase-like matching without explicit phrase queries.

token_filter_pipeline.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
// Complete token filter pipeline example
const analysisSettings = {
  "analysis": {
    "filter": {
      // Custom stopwords list
      "english_stop": {
        "type": "stop",
        "stopwords": "_english_",  // Built-in English stopwords
        "ignore_case": true
      },
      
      // Porter stemmer for English
      "english_stemmer": {
        "type": "stemmer",
        "language": "english"
      },
      
      // Synonym expansion for e-commerce
      "product_synonyms": {
        "type": "synonym",
        "synonyms": [
          "laptop, notebook, portable computer",
          "phone, smartphone, mobile, cellphone",
          "tv, television, flatscreen",
          "couch, sofa, settee"
        ]
      },
      
      // ASCII folding for accent insensitivity
      "ascii_folder": {
        "type": "asciifolding",
        "preserve_original": true  // Index both "café" and "cafe"
      },
      
      // Word delimiter for camelCase and compound words
      "word_splitter": {
        "type": "word_delimiter_graph",
        "catenate_all": true,      // Create combined token too
        "generate_word_parts": true,
        "generate_number_parts": true,
        "split_on_case_change": true,
        "split_on_numerics": true
        // "PowerPoint2023" → ["Power", "Point", "2023", "PowerPoint2023"]
      },
      
      // Ensure unique tokens
      "unique_tokens": {
        "type": "unique",
        "only_on_same_position": false
      }
    },
    
    "analyzer": {
      "product_analyzer": {
        "type": "custom",
        "tokenizer": "standard",
        "filter": [
          "lowercase",
          "ascii_folder",
          "word_splitter",
          "english_stop",
          "product_synonyms",
          "english_stemmer",
          "unique_tokens"
        ]
      }
    }
  }
};
 
// Token transformation walkthrough:
// Input: "MacBook Pro Laptop Café Edition"
//
// After tokenizer:      ["MacBook", "Pro", "Laptop", "Café", "Edition"]
// After lowercase:      ["macbook", "pro", "laptop", "café", "edition"]
// After ascii_folder:   ["macbook", "pro", "laptop", "cafe", "café", "edition"]
// After word_splitter:  ["mac", "book", "macbook", "pro", "laptop", "cafe", "café", "edition"]
// After english_stop:   (no removal - these aren't stopwords)
// After product_synonyms: ["mac", "book", "macbook", "pro", "laptop", "notebook", 
//                          "portable", "computer", "cafe", "café", "edition"]
// After english_stemmer: ["mac", "book", "macbook", "pro", "laptop", "notebook",
//                         "portabl", "comput", "cafe", "café", "edit"]
// After unique_tokens:   (duplicates removed if any on same position)

Filter Order is Critical

Building a Production-Grade Custom Analyzer

Real-world search systems rarely use off-the-shelf analyzers. Domain-specific requirements demand custom configurations. Let's build a complete analyzer for an e-commerce product search system.

Requirements:

Handle product names with brand terminology ("iPhone", "MacBook Pro")
Support searches with common misspellings and typos
Match on synonyms ("sneakers" = "trainers" = "athletic shoes")
Handle measurements and sizes ("10oz", "size:large")
Work with international products (accented characters)

production_analyzer.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
// Production e-commerce analyzer configuration
 
const ecommerceAnalyzerConfig = {
  "settings": {
    "analysis": {
      // ========== CHARACTER FILTERS ==========
      "char_filter": {
        // Normalize common symbol patterns
        "symbol_normalizer": {
          "type": "mapping",
          "mappings": [
            "& => and",
            "+ => plus",
            "@ => at",
            "# => number"
          ]
        },
        
        // Clean up size/measurement formats
        "measurement_normalizer": {
          "type": "pattern_replace",
          "pattern": "(\\d+)\\s*(oz|ml|g|kg|lb|lbs|inch|in|cm|mm)",
          "replacement": "$1$2"  // "10 oz" → "10oz"
        }
      },
      
      // ========== TOKEN FILTERS ==========
      "filter": {
        // ASCII folding that keeps originals
        "ascii_folder": {
          "type": "asciifolding",
          "preserve_original": true
        },
        
        // Custom brand-aware word delimiter
        "brand_safe_delimiter": {
          "type": "word_delimiter_graph",
          "protected_words": ["iphone", "macbook", "airpods", "playstation"],
          "catenate_all": true,
          "generate_word_parts": true,
          "split_on_case_change": true,
          "preserve_original": true
        },
        
        // E-commerce specific synonyms
        "ecommerce_synonyms": {
          "type": "synonym_graph",
          "synonyms_path": "synonyms/products.txt",
          "updateable": true  // Can update without reindex
        },
        
        // Minimal stemming (less aggressive for products)
        "light_stemmer": {
          "type": "stemmer",
          "language": "light_english"  // Less aggressive than Porter
        },
        
        // Edge n-grams for autocomplete
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 15
        },
        
        // Length filter to remove noise
        "length_filter": {
          "type": "length",
          "min": 2,
          "max": 50
        }
      },
      
      // ========== ANALYZERS ==========
      "analyzer": {
        // Main search analyzer
        "product_search": {
          "type": "custom",
          "char_filter": ["symbol_normalizer", "measurement_normalizer"],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "ascii_folder",
            "brand_safe_delimiter",
            "ecommerce_synonyms",
            "light_stemmer",
            "length_filter"
          ]
        },
        
        // Index analyzer (no synonyms at index time)
        "product_index": {
          "type": "custom",
          "char_filter": ["symbol_normalizer", "measurement_normalizer"],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "ascii_folder",
            "brand_safe_delimiter",
            "light_stemmer",
            "length_filter"
          ]
        },
        
        // Autocomplete analyzer for type-ahead
        "product_autocomplete": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "ascii_folder",
            "autocomplete_filter"
          ]
        }
      }
    }
  },
  
  "mappings": {
    "properties": {
      "product_name": {
        "type": "text",
        "analyzer": "product_index",
        "search_analyzer": "product_search",
        "fields": {
          "autocomplete": {
            "type": "text",
            "analyzer": "product_autocomplete",
            "search_analyzer": "standard"
          },
          "exact": {
            "type": "keyword"  // For exact matching
          }
        }
      }
    }
  }
};

Best Practices Applied

•Separate index and search analyzers
•Synonyms only at search time for flexibility
•Multi-field mapping for different use cases
•Brand name protection to avoid over-tokenization
•Light stemming to preserve product precision
•Edge n-grams dedicated to autocomplete field

Common Mistakes Avoided

•Not using synonym_graph (use synonym for legacy only)
•Applying synonyms at index time (causes index bloat)
•Over-aggressive stemming ("running" matching "run")
•Ignoring brand names ("iPhone" → "i" + "phone")
•Missing length filters (indexing single characters)
•Forgetting preserve_original on folding

Testing and Validating Analyzers

Validation strategy:

Test individual components (each filter in isolation)
Test the complete pipeline with sample documents
Create a test suite with expected token outputs
Test edge cases: empty strings, very long strings, special characters, different languages
Validate index-query equivalence

analyzer_testing.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
// Testing analyzer behavior with Elasticsearch Analyze API
 
// Test the product_search analyzer
const analyzeRequest = {
  analyzer: "product_search",
  text: "Apple MacBook Pro 14-inch Laptop (M3 chip)"
};
 
// Response shows token stream:
{
  "tokens": [
    {"token": "appl", "start_offset": 0, "end_offset": 5, "position": 0},
    {"token": "macbook", "start_offset": 6, "end_offset": 13, "position": 1},
    {"token": "pro", "start_offset": 14, "end_offset": 17, "position": 2},
    {"token": "14", "start_offset": 18, "end_offset": 20, "position": 3},
    {"token": "inch", "start_offset": 21, "end_offset": 25, "position": 4},
    {"token": "laptop", "start_offset": 26, "end_offset": 32, "position": 5},
    {"token": "notebook", "start_offset": 26, "end_offset": 32, "position": 5}, // Synonym!
    {"token": "m3", "start_offset": 34, "end_offset": 36, "position": 6},
    {"token": "chip", "start_offset": 37, "end_offset": 41, "position": 7}
  ]
}
 
// Automated test suite
interface TokenTest {
  input: string;
  analyzer: string;
  expectedTokens: string[];
}
 
const analyzerTests: TokenTest[] = [
  // Basic functionality
  {
    input: "MacBook Pro",
    analyzer: "product_index",
    expectedTokens: ["macbook", "pro"]
  },
  
  // Symbol handling
  {
    input: "Beats & Bose",
    analyzer: "product_search", 
    expectedTokens: ["beat", "and", "bose"]
  },
  
  // Measurements
  {
    input: "Water Bottle 32 oz",
    analyzer: "product_search",
    expectedTokens: ["water", "bottl", "32oz"]
  },
  
  // Unicode handling
  {
    input: "Café Crème",
    analyzer: "product_search",
    expectedTokens: ["cafe", "café", "cream", "crème"]  // Both variants
  },
  
  // Synonym expansion at search time
  {
    input: "sneakers",
    analyzer: "product_search",
    expectedTokens: ["sneaker", "trainer", "athletic", "shoe"]
  },
  
  // Edge case: empty string
  {
    input: "",
    analyzer: "product_search",
    expectedTokens: []
  },
  
  // Edge case: only stopwords
  {
    input: "the a an",
    analyzer: "product_search",
    expectedTokens: []  // All removed as stopwords
  }
];
 
async function runAnalyzerTests(client: ElasticsearchClient) {
  for (const test of analyzerTests) {
    const result = await client.indices.analyze({
      index: "products",
      body: {
        analyzer: test.analyzer,
        text: test.input
      }
    });
    
    const actualTokens = result.tokens.map(t => t.token);
    
    const missing = test.expectedTokens.filter(t => !actualTokens.includes(t));
    const extra = actualTokens.filter(t => !test.expectedTokens.includes(t));
    
    if (missing.length > 0 || extra.length > 0) {
      console.error(`FAIL: "${test.input}"`);
      console.error(`  Missing: ${missing}`);
      console.error(`  Unexpected: ${extra}`);
    } else {
      console.log(`PASS: "${test.input}"`);
    }
  }
}

Test with Real User Queries

Performance Considerations for Analysis

Analysis Component Performance Characteristics
Component	Performance Impact	Memory Impact	Mitigation Strategies
Character filters (Regex)	High CPU per character; O(n) where n is text length; complex patterns are expensive	Low	Optimize regex patterns; avoid catastrophic backtracking; limit input length
Standard Tokenizer	Low; highly optimized; O(n)	Low	N/A - already optimal for most cases
N-gram Tokenizer	Low compute, but high token output; O(n × max_gram)	High output volume	Limit to specific fields; use edge n-grams when possible; set reasonable max_gram
Stemmer	Low; dictionary-based O(1) lookup per token	Low (dictionary in memory)	Prefer algorithmic stemmers (Porter) over dictionary stemmers for large vocabularies
Synonym Filter	Medium; dictionary lookup per token	High if synonym list is large	Apply at search time only; use updateable synonyms for changes without reindex
Phonetic Filter	Medium; algorithmic encoding per token	Low	Apply only to name fields; avoid on high-volume text fields

performance_measurement.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Benchmarking analyzer performance
 
async function benchmarkAnalyzer(
  client: ElasticsearchClient,
  analyzer: string,
  sampleTexts: string[],
  iterations: number = 1000
): Promise<AnalyzerBenchmark> {
  
  const start = performance.now();
  
  for (let i = 0; i < iterations; i++) {
    const text = sampleTexts[i % sampleTexts.length];
    await client.indices.analyze({
      index: "benchmark_index",
      body: { analyzer, text }
    });
  }
  
  const elapsed = performance.now() - start;
  
  return {
    analyzer,
    iterations,
    totalMs: elapsed,
    avgMsPerAnalysis: elapsed / iterations,
    analysesPerSecond: (iterations / elapsed) * 1000
  };
}
 
// Example results comparison:
// 
// Analyzer: "standard"
//   Avg: 0.3ms per analysis
//   Throughput: 3,300 analyses/second
//
// Analyzer: "product_search" (with synonyms, stemming)
//   Avg: 0.8ms per analysis
//   Throughput: 1,250 analyses/second
//
// Analyzer: "ngram_analyzer" (min_gram: 2, max_gram: 10)
//   Avg: 2.1ms per analysis (but with 5x more tokens output)
//   Throughput: 475 analyses/second
 
// Impact on bulk indexing:
// If indexing 1 million documents:
// - Standard analyzer: ~5 minutes
// - Complex custom analyzer: ~13 minutes
// - N-gram heavy analyzer: ~35 minutes

Index Size Explosion

Summary: Tokenization and Analysis Fundamentals

Key Takeaways

•The analysis pipeline has three stages: Character filters preprocess raw text, tokenizers break it into tokens, and token filters normalize and enrich those tokens.
•Index-query symmetry is essential: The same or compatible analyzers must be used at both index and query time for matches to occur.
•Tokenizer choice determines matching granularity: Standard tokenizers work for most text; whitespace preserves structure; n-grams enable fuzzy matching at the cost of index size.
•Token filter order matters: The sequence of filters affects the final tokens. Lowercase before synonyms; stemming usually last.
•Production systems use multiple analyzers: Different fields (search, autocomplete, exact match) require different analysis strategies, often via multi-field mappings.
•Always test analyzers: Validate with the analyze API, create automated test suites, and verify with real user query logs.
•Monitor performance impacts: Complex analyzers affect indexing throughput and query latency. N-gram analyzers can explode index sizes.

What's next:

Page Complete

1 / 5