Loading learning content...
When you search for 'running shoes' in an e-commerce application, you expect to find products listed as 'Running Shoes', 'running shoe', 'RUNNING SHOES', and even 'Runner's footwear'. This seemingly simple expectation requires sophisticated text processing that happens at two critical moments: when documents are indexed and when queries are executed.
This text processing pipeline—called analysis—transforms raw text into normalized, searchable terms. The mapping defines how each field in your documents should be analyzed, stored, and made searchable.
Mapping and analysis are where Elasticsearch's search magic happens. Get them right, and your search feels intuitive and comprehensive. Get them wrong, and users search for products that exist but appear invisible.
By the end of this page, you will understand: Elasticsearch data types and when to use each; the analysis pipeline (character filters, tokenizers, token filters); how to create custom analyzers; dynamic vs explicit mapping strategies; and common patterns for multi-language and faceted search.
A mapping defines how a document and its fields are stored and indexed. It specifies:
Mappings are defined per index. Unlike relational databases, Elasticsearch was originally designed to be schema-flexible—you could index documents with any structure. However, this flexibility came with trade-offs, and modern best practices strongly favor explicit mappings.
123456789101112131415161718192021222324252627282930313233343536373839404142434445
// GET /products/_mapping{ "products": { "mappings": { "properties": { "name": { "type": "text", "analyzer": "standard", "fields": { "keyword": { "type": "keyword" } } }, "description": { "type": "text", "analyzer": "english" }, "price": { "type": "float" }, "category": { "type": "keyword" }, "tags": { "type": "keyword" }, "created_at": { "type": "date" }, "in_stock": { "type": "boolean" }, "reviews": { "type": "nested", "properties": { "user": { "type": "keyword" }, "rating": { "type": "integer" }, "comment": { "type": "text" } } } } } }}Mapping immutability:
Once a field is mapped, its type cannot be changed. This is fundamental: changing how data is indexed would invalidate existing indexed terms, creating inconsistent search behavior.
If you need to change a field's mapping, you must:
This immutability makes upfront mapping design critical. Mistakes are expensive to fix.
Never index production data with default mappings and 'fix it later.' The fix requires full reindexing, which is time-consuming and disruptive. Invest time in mapping design before your first document.
Elasticsearch provides numerous field types, each optimized for different data patterns and query types. Choosing the correct type affects searchability, storage efficiency, and query performance.
The two most important text-related types:
The multi-field pattern:
Often you need both text search and exact matching on the same field. The multi-field pattern solves this:
12345678910111213141516171819202122
{ "properties": { "product_name": { "type": "text", // Full-text search "analyzer": "standard", "fields": { "keyword": { "type": "keyword" // Exact matching, aggregations }, "autocomplete": { "type": "text", // Edge n-gram for typeahead "analyzer": "autocomplete_analyzer" } } } }} // Query examples:// Full-text: "query": { "match": { "product_name": "running shoes" }}// Exact: "query": { "term": { "product_name.keyword": "Nike Air Max" }}// Aggregation: "aggs": { "by_name": { "terms": { "field": "product_name.keyword" }}}| Type | Use Case | Query Types | Notes |
|---|---|---|---|
| text | Full-text search | match, match_phrase, fuzzy | Analyzed, tokenized |
| keyword | Exact values, filtering, sorting | term, terms, range | Not analyzed, max 32,766 bytes |
| long / integer / short / byte | Whole numbers | term, range, aggregations | Choose smallest type that fits |
| float / double / half_float | Decimal numbers | term, range, aggregations | half_float saves space for less precision |
| boolean | True/false values | term | Stored as 'true' or 'false' |
| date | Dates and times | range, date histogram | ISO8601, epoch, or custom format |
| object | Nested JSON objects | Flattened dot notation | Fields merged at document level |
| nested | Arrays of objects | Nested queries | Preserves array object boundaries |
| geo_point | Lat/lon coordinates | geo_distance, geo_bounding_box | For location searches |
| ip | IPv4/IPv6 addresses | term, range, CIDR queries | Stored efficiently |
When you index an array of objects as 'object' type, Elasticsearch flattens them—losing the association between fields within each object. Use 'nested' type when you need to query individual objects within an array. This has performance implications, so use sparingly.
When text is analyzed, it passes through a three-stage pipeline that transforms the original text into searchable terms:
1. Character Filters — Pre-process the character stream (HTML stripping, character replacement)
2. Tokenizer — Split text into individual tokens (words, substrings, etc.)
3. Token Filters — Transform tokens (lowercase, stemming, synonyms, stop word removal)
This pipeline runs at two critical moments:
For search to work correctly, the same analysis (or compatible analysis) must be applied at both times.
1234567891011121314
Input Text: "<p>The QUICK Brown Fox!</p>" Step 1: Character Filters └── html_strip: "The QUICK Brown Fox!" Step 2: Tokenizer └── standard: ["The", "QUICK", "Brown", "Fox"] Step 3: Token Filters └── lowercase: ["the", "quick", "brown", "fox"] └── stop: ["quick", "brown", "fox"] (removed "the") └── porter_stem: ["quick", "brown", "fox"] Final Indexed Terms: ["quick", "brown", "fox"]Built-in analyzers:
Elasticsearch provides pre-configured analyzers for common use cases:
standard (default) — Unicode text segmentation, lowercase. Good for most Western languages.
simple — Divides on non-letter characters, lowercase. Useful for simple text.
whitespace — Divides on whitespace only. No case conversion. For code or structured text.
keyword — No analysis at all. The entire input becomes one token.
language analyzers (english, french, german, etc.) — Language-specific stemming and stop words.
pattern — Uses regex to split text. Flexible but can be slow.
123456789101112131415161718
// Test how an analyzer processes textPOST /_analyze{ "analyzer": "english", "text": "The quick brown foxes were running quickly"} // Response shows produced tokens:{ "tokens": [ { "token": "quick", "position": 1 }, { "token": "brown", "position": 2 }, { "token": "fox", "position": 3 }, // "foxes" stemmed to "fox" { "token": "were", "position": 4 }, { "token": "run", "position": 5 }, // "running" stemmed to "run" { "token": "quickli", "position": 6 } // "quickly" stemmed ]}The _analyze API is essential for understanding how text is processed. Before deploying a new analyzer, test it with representative text samples. Stemming and tokenization can produce surprising results that affect search quality.
Built-in analyzers handle common cases, but production search often requires custom analysis chains. Custom analyzers let you combine character filters, tokenizers, and token filters to match your specific needs.
Example: Product search analyzer
For an e-commerce site, you might want:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
PUT /products{ "settings": { "analysis": { "char_filter": { "html_cleaner": { "type": "html_strip", "escaped_tags": ["b", "em"] // Keep some tags } }, "tokenizer": { "product_tokenizer": { "type": "pattern", "pattern": "[\\W_]+", "lowercase": true } }, "filter": { "product_synonyms": { "type": "synonym", "synonyms": [ "laptop, notebook, computer", "phone, mobile, smartphone", "tv, television" ] }, "product_stemmer": { "type": "stemmer", "language": "light_english" // Minimal stemming } }, "analyzer": { "product_analyzer": { "type": "custom", "char_filter": ["html_cleaner"], "tokenizer": "product_tokenizer", "filter": [ "lowercase", "product_synonyms", "product_stemmer" ] } } } }, "mappings": { "properties": { "name": { "type": "text", "analyzer": "product_analyzer" }, "description": { "type": "text", "analyzer": "product_analyzer" } } }}Common token filters for custom analyzers:
lowercase — Converts to lowercase (essential for case-insensitive search)
stop — Removes stop words (the, a, is, etc.). Configurable by language.
stemmer — Reduces words to root form. Multiple algorithms available (porter_stem, snowball, light_english).
synonym — Expands or replaces terms with synonyms. Can be inline or from a file.
ngram / edge_ngram — Creates character n-grams. Essential for autocomplete and partial matching.
asciifolding — Converts accented characters to ASCII (café → cafe).
word_delimiter — Splits on case transitions, hyphens (WiFi → [Wi, Fi]).
1234567891011121314151617181920212223242526272829303132333435363738394041
PUT /products{ "settings": { "analysis": { "filter": { "autocomplete_filter": { "type": "edge_ngram", "min_gram": 1, "max_gram": 20 } }, "analyzer": { "autocomplete_index": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "autocomplete_filter" ] }, "autocomplete_search": { "type": "custom", "tokenizer": "standard", "filter": ["lowercase"] } } } }, "mappings": { "properties": { "name": { "type": "text", "analyzer": "autocomplete_index", "search_analyzer": "autocomplete_search" } } }} // "MacBook" becomes: ["m", "ma", "mac", "macb", "macbo", "macboo", "macbook"]// Query "mac" matches because it's in the indexed termsFor autocomplete, use different analyzers for indexing (edge_ngram expansion) and searching (no ngram). This prevents 'ma' from expanding to ['m', 'ma'] at search time, which would match too broadly. Use 'analyzer' for index time and 'search_analyzer' for query time.
By default, Elasticsearch uses dynamic mapping to automatically detect field types when you index documents without a predefined mapping. While convenient for prototyping, dynamic mapping has significant implications for production systems.
How dynamic mapping works:
When Elasticsearch encounters a new field, it infers the type from the JSON value:
| JSON Value | Detected Type |
|---|---|
"text" | text + keyword |
123 | long |
123.45 | float |
true/false | boolean |
"2024-01-15" | date (if matches date format) |
{...} | object |
[...] | multi-value field |
123456789101112131415161718192021222324252627282930313233
// Index a document with no predefined mappingPUT /test/_doc/1{ "title": "Introduction to Elasticsearch", "views": 42, "rating": 4.8, "published": "2024-01-15", "is_premium": false} // Check the auto-generated mappingGET /test/_mapping{ "test": { "mappings": { "properties": { "title": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "views": { "type": "long" }, "rating": { "type": "float" }, "published": { "type": "date" }, "is_premium": { "type": "boolean" } } } }}The dangers of dynamic mapping:
Type misdetection — A field's type is locked based on the first document. If the first 'price' value is 10 (integer), later 10.99 values are truncated to 10.
Mapping explosion — Each new field adds to the cluster state. Untrusted input with varied field names can create thousands of fields, bloating memory.
Inconsistent analysis — Dynamic mapping uses default analyzers. Critical text fields may not be analyzed appropriately.
Inefficient types — Strings that should be keywords (IDs, categories) become text+keyword, wasting index space.
1234567891011121314151617181920212223242526272829303132333435363738394041424344
// Disable dynamic mapping (strict mode)PUT /products{ "mappings": { "dynamic": "strict", // Reject documents with unmapped fields "properties": { "name": { "type": "text" }, "price": { "type": "float" } } }} // Now this fails:PUT /products/_doc/1{ "name": "Widget", "price": 9.99, "category": "tools" // ERROR: field not mapped} // Alternative: dynamic templates for controlled mappingPUT /products{ "mappings": { "dynamic_templates": [ { "strings_as_keywords": { "match_mapping_type": "string", "mapping": { "type": "keyword" // All strings become keywords, not text } } }, { "longs_as_integers": { "match_mapping_type": "long", "mapping": { "type": "integer" // Use smaller integer type } } } ] }}If users can submit arbitrary JSON (API payloads, log fields, etc.), dynamic mapping can be weaponized. Attackers can send documents with thousands of unique field names, causing mapping explosion that degrades cluster performance. Always use explicit mappings or strict mode for external data.
Proper mapping design prevents future problems and optimizes for your actual query patterns. Apply these principles when designing production mappings:
1234567891011121314151617181920212223242526272829303132333435363738394041
PUT /logs{ "mappings": { "dynamic": "strict", "properties": { "@timestamp": { "type": "date" }, "level": { "type": "keyword" // INFO, WARN, ERROR }, "service": { "type": "keyword" }, "trace_id": { "type": "keyword", "doc_values": false // Only for filtering, not aggregations }, "message": { "type": "text", "analyzer": "standard", "norms": false // We don't need relevance scoring for logs }, "host": { "type": "keyword" }, "response_time_ms": { "type": "integer" }, "stack_trace": { "type": "text", "index": false, // Only retrieved, never searched "store": true }, "metadata": { "type": "object", "enabled": false // Store raw JSON, don't index } } }}Understanding indexing options:
index: true/false — Whether the field is searchable. Setting false saves index space for fields only used for retrieval.
doc_values: true/false — Whether the field supports sorting/aggregations. Setting false saves disk space for fields only used in queries.
store: true/false — Whether the field is stored separately for retrieval. By default, fields are retrieved from _source. Storing separately can be faster for large documents when you only need specific fields.
norms: true/false — Whether length normalization is stored (for relevance scoring). Disable for fields where document length shouldn't affect scoring.
enabled: true/false — For object types, whether the contents are indexed at all. Disabled objects are stored in _source but not searchable.
For time-based indexes (logs-2024.01.15, logs-2024.01.16), use index templates to ensure consistent mappings across all indexes. Templates apply automatically when matching indexes are created. Combine with component templates for reusable mapping fragments.
Global applications must search content in multiple languages, each with different tokenization rules, stemming algorithms, and stop words. There are several approaches to multi-language search in Elasticsearch.
Language-specific analyzers:
Elasticsearch provides built-in analyzers for 30+ languages. Each handles language-specific challenges:
Strategy 1: Per-field language assignment
If you know the language of each field, use language-specific analyzers directly:
12345678910111213141516171819202122
PUT /articles{ "mappings": { "properties": { "title_en": { "type": "text", "analyzer": "english" }, "title_de": { "type": "text", "analyzer": "german" }, "title_fr": { "type": "text", "analyzer": "french" }, "language": { "type": "keyword" } } }}Strategy 2: Multi-field with multiple analyzers
When the same content might be searched in different language contexts:
123456789101112131415161718192021222324252627282930313233343536
PUT /products{ "mappings": { "properties": { "description": { "type": "text", "analyzer": "standard", // Default: no language-specific processing "fields": { "en": { "type": "text", "analyzer": "english" }, "de": { "type": "text", "analyzer": "german" }, "stemmed": { "type": "text", "analyzer": "porter_stem" // Generic stemming } } } } }} // Query spanning multiple analyzersGET /products/_search{ "query": { "multi_match": { "query": "laufen", "fields": ["description^1", "description.de^2", "description.en"] } }}Strategy 3: Separate indexes by language
For large-scale multi-language deployments, consider separate indexes per language:
articles-en, articles-de, articles-frLanguage detection:
For content without known language, Elasticsearch doesn't have built-in language detection. You must detect language in your application (using libraries like langdetect) and either route to appropriate indexes or populate a language field for filtering.
Chinese, Japanese, and Korean don't use spaces between words. The standard tokenizer treats entire sentences as single tokens. Use language-specific plugins (kuromoji for Japanese, icu_analyzer for Chinese) or n-gram tokenizers for these languages. The icu_analyzer plugin provides excellent cross-language normalization.
Mapping and analysis determine how searchable your data becomes. Poor decisions here cause search results that frustrate users—even when the right documents exist. Let's consolidate the key principles:
What's next:
With data properly mapped and analyzed, we can explore Query DSL—Elasticsearch's powerful query language. You'll learn to construct queries that leverage your mapping design to find exactly what users need, from simple term queries to complex boolean combinations with scoring, filtering, and aggregations.
You now understand Elasticsearch's mapping and analysis systems—the foundation of search relevance. Next, we explore Query DSL to construct powerful, precise queries against your analyzed data.