Loading learning content...
When you type a query into a search box, something remarkable happens in milliseconds: the search engine must understand what you're looking for, locate relevant documents among potentially billions of candidates, and rank them by relevance. But how does a computer—which fundamentally operates on bytes and numbers—understand the nuanced meaning of human language?
The answer begins with tokenization and analysis.
At its core, full-text search is a translation problem. Human language is fluid, ambiguous, and context-dependent. Computers require structure, precision, and determinism. The text analysis pipeline bridges this gap, transforming raw text into searchable, normalized tokens that enable both efficient retrieval and accurate matching.
By the end of this page, you will understand the complete text analysis pipeline used in production search systems: character filters, tokenizers, and token filters. You'll learn how decisions at each stage impact search quality, performance, and relevance—and why getting this foundation right is critical before considering any other search optimization.
Before text can be searched, it must be transformed through a series of processing stages collectively called the analysis pipeline (or analyzer). This pipeline operates identically at both index time (when documents are stored) and query time (when searches are performed), ensuring that what users search for matches what was indexed.
The standard pipeline consists of three distinct phases:
Each phase is critical, and errors at any stage propagate through the entire search experience. A poorly chosen tokenizer can fragment words incorrectly. An aggressive token filter might remove important signals. A missing character filter might allow invisible characters to pollute the index.
12345678910111213
Raw Input: "The Quick-Brown Fox Jumps Over the Lazy Dog!!!" ↓ Character Filter (remove punctuation, lowercase) Cleaned: "the quick-brown fox jumps over the lazy dog" ↓ Tokenizer (whitespace + hyphen splitting) Tokens: ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"] ↓ Token Filters (stopword removal, stemming) Final Tokens: ["quick", "brown", "fox", "jump", "lazi", "dog"]Notice how the original 9-word sentence becomes 6 indexed tokens. Common words like "the" and "over" are removed as stopwords. The word "jumps" becomes "jump" through stemming, allowing it to match queries for "jumping," "jumped," or "jump." Even "lazy" becomes "lazi" through the same stemming process.
Why this matters for system design:
Every decision in this pipeline cascades through the entire search experience:
The art of search engineering lies in tuning this balance for each specific use case.
The same analyzer (or compatible analyzers) must be applied at both index time and query time. If a document contains 'running' indexed as 'run', but the query analyzer doesn't apply the same stemmer, searching for 'running' will fail to match. This is one of the most common causes of search bugs in production systems.
Character filters operate on the raw input stream before tokenization. They add, remove, or replace characters in the source text. While often overlooked, character filters handle critical preprocessing that tokenizers cannot perform efficiently.
Common character filter use cases:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
// Example: Custom character filter configuration in Elasticsearchconst analyzerConfig = { "settings": { "analysis": { "char_filter": { // HTML stripping "html_stripper": { "type": "html_strip", "escaped_tags": ["b", "i"] // Keep bold/italic for emphasis detection }, // Mapping replacements for technical content "tech_normalizer": { "type": "mapping", "mappings": [ "C++ => Cpp", "C# => CSharp", "F# => FSharp", ".NET => DotNet", "& => and", "@ => at" ] }, // Pattern replacement for code cleanup "code_cleanup": { "type": "pattern_replace", "pattern": "\\s{2,}", // Multiple whitespace "replacement": " " // Single space } }, "analyzer": { "content_analyzer": { "char_filter": ["html_stripper", "tech_normalizer", "code_cleanup"], "tokenizer": "standard", "filter": ["lowercase", "stop", "porter_stem"] } } } }}; // Application: Processing technical blog contentconst rawContent = `<p>Learning C++ and C# can be valuable. The .NET framework provides many features.</p>`; // After character filtering:// "Learning Cpp and CSharp can be valuable. The DotNet framework provides many features."Character filters execute in the order specified. If you have a pattern filter that normalizes whitespace and an HTML stripper, place the HTML stripper first. Otherwise, the pattern filter might operate on HTML entities that will later be removed, wasting processing time and potentially causing unexpected behavior.
Production considerations for character filters:
The tokenizer is the heart of text analysis. It receives the cleaned character stream from character filters and emits a sequence of tokens—discrete units that will be indexed and searched. The choice of tokenizer fundamentally shapes what queries can match which documents.
Different tokenization strategies serve different needs:
| Tokenizer | Splitting Rules | Best For | Example |
|---|---|---|---|
| Standard | Unicode Text Segmentation algorithm; splits on whitespace and punctuation, removes most punctuation | General-purpose text in most languages | "O'Neil's" → ["O'Neil's"] |
| Whitespace | Splits only on whitespace; preserves punctuation | Technical content where punctuation matters | "error_code:404" → ["error_code:404"] |
| Letter | Splits on non-letter characters; only emits sequences of letters | Extracting pure text from noisy data | "Order#12345" → ["Order"] |
| Keyword | Emits entire input as single token | Exact-match fields (IDs, codes, tags) | "user_session_token_abc" → ["user_session_token_abc"] |
| Pattern | Splits based on regex pattern | Custom splitting logic (e.g., splitting on commas) | "a,b,c" → ["a", "b", "c"] (with pattern ",") |
| N-gram | Emits character or word n-grams | Autocomplete, fuzzy matching, typo tolerance | "hello" → ["hel", "ell", "llo"] (trigrams) |
| Edge N-gram | Emits prefixes of tokens | Autocomplete, type-ahead search | "search" → ["s", "se", "sea", "sear", "searc", "search"] |
The Standard Tokenizer in depth:
The Standard Tokenizer is the default choice in most search engines (Elasticsearch, Solr, Lucene) and implements the Unicode Text Segmentation algorithm (UAX #29). This algorithm handles complex cases that simple whitespace splitting would miss:
For most Western-language text search, the Standard Tokenizer provides a sensible foundation.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
// Demonstrating different tokenizer behaviors const input = "The user.email is john@example.com (created: 2024-01-15)"; // Standard Tokenizer output:// ["The", "user.email", "is", "john", "example.com", "created", "2024", "01", "15"]// Note: @ splits tokens, but . is context-dependent // Whitespace Tokenizer output:// ["The", "user.email", "is", "john@example.com", "(created:", "2024-01-15)"]// Note: Preserves punctuation, good for technical data // Letter Tokenizer output:// ["The", "user", "email", "is", "john", "example", "com", "created"]// Note: Loses numbers entirely, only letters remain // Pattern Tokenizer (split on non-word characters) output:// ["The", "user", "email", "is", "john", "example", "com", "created", "2024", "01", "15"]// Note: More aggressive splitting than Standard // Combining tokenizers with field mappingsconst indexMapping = { "mappings": { "properties": { // Full-text searchable content "description": { "type": "text", "analyzer": "standard" // Standard tokenizer + lowercase + stopwords }, // Technical identifiers that should match exactly "error_code": { "type": "text", "analyzer": "whitespace" // Preserve special characters }, // Email addresses for exact matching "email": { "type": "keyword" // No tokenization at all }, // Also searchable by domain "email_searchable": { "type": "text", "analyzer": "email_analyzer" // Custom analyzer that splits on @ } } }};Production search systems often index the same content with multiple analyzers using multi-fields. A product name might be indexed as-is for exact matches, with the standard analyzer for full-text search, and with an edge n-gram analyzer for autocomplete. This provides flexibility at the cost of increased storage.
After tokenization, token filters transform the token stream. They can modify tokens (lowercase, stem), remove tokens (stopwords), or add tokens (synonyms). Multiple filters chain together, each processing the output of the previous filter.
Essential token filter categories:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
// Complete token filter pipeline exampleconst analysisSettings = { "analysis": { "filter": { // Custom stopwords list "english_stop": { "type": "stop", "stopwords": "_english_", // Built-in English stopwords "ignore_case": true }, // Porter stemmer for English "english_stemmer": { "type": "stemmer", "language": "english" }, // Synonym expansion for e-commerce "product_synonyms": { "type": "synonym", "synonyms": [ "laptop, notebook, portable computer", "phone, smartphone, mobile, cellphone", "tv, television, flatscreen", "couch, sofa, settee" ] }, // ASCII folding for accent insensitivity "ascii_folder": { "type": "asciifolding", "preserve_original": true // Index both "café" and "cafe" }, // Word delimiter for camelCase and compound words "word_splitter": { "type": "word_delimiter_graph", "catenate_all": true, // Create combined token too "generate_word_parts": true, "generate_number_parts": true, "split_on_case_change": true, "split_on_numerics": true // "PowerPoint2023" → ["Power", "Point", "2023", "PowerPoint2023"] }, // Ensure unique tokens "unique_tokens": { "type": "unique", "only_on_same_position": false } }, "analyzer": { "product_analyzer": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "ascii_folder", "word_splitter", "english_stop", "product_synonyms", "english_stemmer", "unique_tokens" ] } } }}; // Token transformation walkthrough:// Input: "MacBook Pro Laptop Café Edition"//// After tokenizer: ["MacBook", "Pro", "Laptop", "Café", "Edition"]// After lowercase: ["macbook", "pro", "laptop", "café", "edition"]// After ascii_folder: ["macbook", "pro", "laptop", "cafe", "café", "edition"]// After word_splitter: ["mac", "book", "macbook", "pro", "laptop", "cafe", "café", "edition"]// After english_stop: (no removal - these aren't stopwords)// After product_synonyms: ["mac", "book", "macbook", "pro", "laptop", "notebook", // "portable", "computer", "cafe", "café", "edition"]// After english_stemmer: ["mac", "book", "macbook", "pro", "laptop", "notebook",// "portabl", "comput", "cafe", "café", "edit"]// After unique_tokens: (duplicates removed if any on same position)The order of token filters dramatically affects results. Stopword removal before stemming works differently than after. Synonyms applied before lowercase might miss matches. In the example above, we lowercase before synonyms so that our synonym mappings don't need to account for case variations.
Real-world search systems rarely use off-the-shelf analyzers. Domain-specific requirements demand custom configurations. Let's build a complete analyzer for an e-commerce product search system.
Requirements:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137
// Production e-commerce analyzer configuration const ecommerceAnalyzerConfig = { "settings": { "analysis": { // ========== CHARACTER FILTERS ========== "char_filter": { // Normalize common symbol patterns "symbol_normalizer": { "type": "mapping", "mappings": [ "& => and", "+ => plus", "@ => at", "# => number" ] }, // Clean up size/measurement formats "measurement_normalizer": { "type": "pattern_replace", "pattern": "(\\d+)\\s*(oz|ml|g|kg|lb|lbs|inch|in|cm|mm)", "replacement": "$1$2" // "10 oz" → "10oz" } }, // ========== TOKEN FILTERS ========== "filter": { // ASCII folding that keeps originals "ascii_folder": { "type": "asciifolding", "preserve_original": true }, // Custom brand-aware word delimiter "brand_safe_delimiter": { "type": "word_delimiter_graph", "protected_words": ["iphone", "macbook", "airpods", "playstation"], "catenate_all": true, "generate_word_parts": true, "split_on_case_change": true, "preserve_original": true }, // E-commerce specific synonyms "ecommerce_synonyms": { "type": "synonym_graph", "synonyms_path": "synonyms/products.txt", "updateable": true // Can update without reindex }, // Minimal stemming (less aggressive for products) "light_stemmer": { "type": "stemmer", "language": "light_english" // Less aggressive than Porter }, // Edge n-grams for autocomplete "autocomplete_filter": { "type": "edge_ngram", "min_gram": 2, "max_gram": 15 }, // Length filter to remove noise "length_filter": { "type": "length", "min": 2, "max": 50 } }, // ========== ANALYZERS ========== "analyzer": { // Main search analyzer "product_search": { "type": "custom", "char_filter": ["symbol_normalizer", "measurement_normalizer"], "tokenizer": "standard", "filter": [ "lowercase", "ascii_folder", "brand_safe_delimiter", "ecommerce_synonyms", "light_stemmer", "length_filter" ] }, // Index analyzer (no synonyms at index time) "product_index": { "type": "custom", "char_filter": ["symbol_normalizer", "measurement_normalizer"], "tokenizer": "standard", "filter": [ "lowercase", "ascii_folder", "brand_safe_delimiter", "light_stemmer", "length_filter" ] }, // Autocomplete analyzer for type-ahead "product_autocomplete": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "ascii_folder", "autocomplete_filter" ] } } } }, "mappings": { "properties": { "product_name": { "type": "text", "analyzer": "product_index", "search_analyzer": "product_search", "fields": { "autocomplete": { "type": "text", "analyzer": "product_autocomplete", "search_analyzer": "standard" }, "exact": { "type": "keyword" // For exact matching } } } } }};A custom analyzer is only as good as its testing. Before deploying to production, you must verify that the analyzer produces expected tokens for both indexing and searching. Search engines provide analysis APIs for this purpose.
Validation strategy:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
// Testing analyzer behavior with Elasticsearch Analyze API // Test the product_search analyzerconst analyzeRequest = { analyzer: "product_search", text: "Apple MacBook Pro 14-inch Laptop (M3 chip)"}; // Response shows token stream:{ "tokens": [ {"token": "appl", "start_offset": 0, "end_offset": 5, "position": 0}, {"token": "macbook", "start_offset": 6, "end_offset": 13, "position": 1}, {"token": "pro", "start_offset": 14, "end_offset": 17, "position": 2}, {"token": "14", "start_offset": 18, "end_offset": 20, "position": 3}, {"token": "inch", "start_offset": 21, "end_offset": 25, "position": 4}, {"token": "laptop", "start_offset": 26, "end_offset": 32, "position": 5}, {"token": "notebook", "start_offset": 26, "end_offset": 32, "position": 5}, // Synonym! {"token": "m3", "start_offset": 34, "end_offset": 36, "position": 6}, {"token": "chip", "start_offset": 37, "end_offset": 41, "position": 7} ]} // Automated test suiteinterface TokenTest { input: string; analyzer: string; expectedTokens: string[];} const analyzerTests: TokenTest[] = [ // Basic functionality { input: "MacBook Pro", analyzer: "product_index", expectedTokens: ["macbook", "pro"] }, // Symbol handling { input: "Beats & Bose", analyzer: "product_search", expectedTokens: ["beat", "and", "bose"] }, // Measurements { input: "Water Bottle 32 oz", analyzer: "product_search", expectedTokens: ["water", "bottl", "32oz"] }, // Unicode handling { input: "Café Crème", analyzer: "product_search", expectedTokens: ["cafe", "café", "cream", "crème"] // Both variants }, // Synonym expansion at search time { input: "sneakers", analyzer: "product_search", expectedTokens: ["sneaker", "trainer", "athletic", "shoe"] }, // Edge case: empty string { input: "", analyzer: "product_search", expectedTokens: [] }, // Edge case: only stopwords { input: "the a an", analyzer: "product_search", expectedTokens: [] // All removed as stopwords }]; async function runAnalyzerTests(client: ElasticsearchClient) { for (const test of analyzerTests) { const result = await client.indices.analyze({ index: "products", body: { analyzer: test.analyzer, text: test.input } }); const actualTokens = result.tokens.map(t => t.token); const missing = test.expectedTokens.filter(t => !actualTokens.includes(t)); const extra = actualTokens.filter(t => !test.expectedTokens.includes(t)); if (missing.length > 0 || extra.length > 0) { console.error(`FAIL: "${test.input}"`); console.error(` Missing: ${missing}`); console.error(` Unexpected: ${extra}`); } else { console.log(`PASS: "${test.input}"`); } }}The most valuable analyzer tests come from production query logs. Extract a sample of real user queries, expected matching documents, and verify the analyzer produces compatible tokens. Nothing reveals analyzer gaps like real user behavior.
Text analysis runs on every document at index time and every query at search time. Poor analyzer performance directly impacts system throughput and latency. Understanding the performance characteristics of different analysis components is essential for system design.
| Component | Performance Impact | Memory Impact | Mitigation Strategies |
|---|---|---|---|
| Character filters (Regex) | High CPU per character; O(n) where n is text length; complex patterns are expensive | Low | Optimize regex patterns; avoid catastrophic backtracking; limit input length |
| Standard Tokenizer | Low; highly optimized; O(n) | Low | N/A - already optimal for most cases |
| N-gram Tokenizer | Low compute, but high token output; O(n × max_gram) | High output volume | Limit to specific fields; use edge n-grams when possible; set reasonable max_gram |
| Stemmer | Low; dictionary-based O(1) lookup per token | Low (dictionary in memory) | Prefer algorithmic stemmers (Porter) over dictionary stemmers for large vocabularies |
| Synonym Filter | Medium; dictionary lookup per token | High if synonym list is large | Apply at search time only; use updateable synonyms for changes without reindex |
| Phonetic Filter | Medium; algorithmic encoding per token | Low | Apply only to name fields; avoid on high-volume text fields |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
// Benchmarking analyzer performance async function benchmarkAnalyzer( client: ElasticsearchClient, analyzer: string, sampleTexts: string[], iterations: number = 1000): Promise<AnalyzerBenchmark> { const start = performance.now(); for (let i = 0; i < iterations; i++) { const text = sampleTexts[i % sampleTexts.length]; await client.indices.analyze({ index: "benchmark_index", body: { analyzer, text } }); } const elapsed = performance.now() - start; return { analyzer, iterations, totalMs: elapsed, avgMsPerAnalysis: elapsed / iterations, analysesPerSecond: (iterations / elapsed) * 1000 };} // Example results comparison:// // Analyzer: "standard"// Avg: 0.3ms per analysis// Throughput: 3,300 analyses/second//// Analyzer: "product_search" (with synonyms, stemming)// Avg: 0.8ms per analysis// Throughput: 1,250 analyses/second//// Analyzer: "ngram_analyzer" (min_gram: 2, max_gram: 10)// Avg: 2.1ms per analysis (but with 5x more tokens output)// Throughput: 475 analyses/second // Impact on bulk indexing:// If indexing 1 million documents:// - Standard analyzer: ~5 minutes// - Complex custom analyzer: ~13 minutes// - N-gram heavy analyzer: ~35 minutesN-gram analyzers can dramatically increase index size. A 10-character word with trigrams produces 8 tokens. Across millions of documents, this explodes storage costs and impacts search performance. Always profile index size with production-like data before deploying n-gram analyzers.
Tokenization and analysis form the foundation of every full-text search system. Before considering ranking algorithms or advanced query features, you must get the analysis pipeline right. Poor analysis means good documents won't match user queries—no matter how sophisticated your relevance tuning.
What's next:
With tokenization understood, we move to linguistic processing in the next page. We'll explore stemming and lemmatization—the techniques that allow "running," "runs," and "ran" to all match the same search intent. These linguistic transformations are essential for search systems that understand language, not just characters.
You now understand the complete text analysis pipeline: character filters, tokenizers, and token filters. You've seen how to build custom analyzers for production use cases, test them thoroughly, and consider their performance implications. This foundation enables everything else in full-text search.