System Design (HLD)Elasticsearch

Elasticsearch: Distributed Search at Scale

LevelAdvanced

Duration90 mins

TopicElasticsearch

3 / 5

Mapping and Analysis: From Documents to Searchable Terms

The Path from Text to Search

When you search for 'running shoes' in an e-commerce application, you expect to find products listed as 'Running Shoes', 'running shoe', 'RUNNING SHOES', and even 'Runner's footwear'. This seemingly simple expectation requires sophisticated text processing that happens at two critical moments: when documents are indexed and when queries are executed.

This text processing pipeline—called analysis—transforms raw text into normalized, searchable terms. The mapping defines how each field in your documents should be analyzed, stored, and made searchable.

Mapping and analysis are where Elasticsearch's search magic happens. Get them right, and your search feels intuitive and comprehensive. Get them wrong, and users search for products that exist but appear invisible.

What You Will Learn

By the end of this page, you will understand: Elasticsearch data types and when to use each; the analysis pipeline (character filters, tokenizers, token filters); how to create custom analyzers; dynamic vs explicit mapping strategies; and common patterns for multi-language and faceted search.

Understanding Mapping: The Schema Definition

A mapping defines how a document and its fields are stored and indexed. It specifies:

Which fields exist in the documents
The data type of each field (text, keyword, number, date, etc.)
How each field should be analyzed (for text fields)
Whether fields should be stored for retrieval
How fields should be indexed for different query types

Mappings are defined per index. Unlike relational databases, Elasticsearch was originally designed to be schema-flexible—you could index documents with any structure. However, this flexibility came with trade-offs, and modern best practices strongly favor explicit mappings.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// GET /products/_mapping
{
  "products": {
    "mappings": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "standard",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        },
        "description": {
          "type": "text",
          "analyzer": "english"
        },
        "price": {
          "type": "float"
        },
        "category": {
          "type": "keyword"
        },
        "tags": {
          "type": "keyword"
        },
        "created_at": {
          "type": "date"
        },
        "in_stock": {
          "type": "boolean"
        },
        "reviews": {
          "type": "nested",
          "properties": {
            "user": { "type": "keyword" },
            "rating": { "type": "integer" },
            "comment": { "type": "text" }
          }
        }
      }
    }
  }
}

Mapping immutability:

Once a field is mapped, its type cannot be changed. This is fundamental: changing how data is indexed would invalidate existing indexed terms, creating inconsistent search behavior.

If you need to change a field's mapping, you must:

Create a new index with the correct mapping
Reindex all documents from the old index
Switch your application to use the new index (often via index aliases)

This immutability makes upfront mapping design critical. Mistakes are expensive to fix.

Plan Your Mapping Before Indexing

Never index production data with default mappings and 'fix it later.' The fix requires full reindexing, which is time-consuming and disruptive. Invest time in mapping design before your first document.

Core Field Types: Choosing the Right Type

Elasticsearch provides numerous field types, each optimized for different data patterns and query types. Choosing the correct type affects searchability, storage efficiency, and query performance.

The two most important text-related types:

text Type

•Full-text search with analysis
•Tokenized into individual terms
•Supports match, phrase, fuzzy queries
•Case-insensitive, stemmed matching
•Use for: Descriptions, content, names

keyword Type

•Exact matching, no analysis
•Stored as single term
•Supports term, terms, prefix queries
•Case-sensitive, exact matching
•Use for: IDs, categories, tags, enums

The multi-field pattern:

Often you need both text search and exact matching on the same field. The multi-field pattern solves this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
{
  "properties": {
    "product_name": {
      "type": "text",           // Full-text search
      "analyzer": "standard",
      "fields": {
        "keyword": {
          "type": "keyword"     // Exact matching, aggregations
        },
        "autocomplete": {
          "type": "text",       // Edge n-gram for typeahead
          "analyzer": "autocomplete_analyzer"
        }
      }
    }
  }
}
 
// Query examples:
// Full-text: "query": { "match": { "product_name": "running shoes" }}
// Exact: "query": { "term": { "product_name.keyword": "Nike Air Max" }}
// Aggregation: "aggs": { "by_name": { "terms": { "field": "product_name.keyword" }}}

Common Elasticsearch Field Types
Type	Use Case	Query Types	Notes
text	Full-text search	match, match_phrase, fuzzy	Analyzed, tokenized
keyword	Exact values, filtering, sorting	term, terms, range	Not analyzed, max 32,766 bytes
long / integer / short / byte	Whole numbers	term, range, aggregations	Choose smallest type that fits
float / double / half_float	Decimal numbers	term, range, aggregations	half_float saves space for less precision
boolean	True/false values	term	Stored as 'true' or 'false'
date	Dates and times	range, date histogram	ISO8601, epoch, or custom format
object	Nested JSON objects	Flattened dot notation	Fields merged at document level
nested	Arrays of objects	Nested queries	Preserves array object boundaries
geo_point	Lat/lon coordinates	geo_distance, geo_bounding_box	For location searches
ip	IPv4/IPv6 addresses	term, range, CIDR queries	Stored efficiently

Object vs Nested: A Crucial Distinction

When you index an array of objects as 'object' type, Elasticsearch flattens them—losing the association between fields within each object. Use 'nested' type when you need to query individual objects within an array. This has performance implications, so use sparingly.

The Analysis Pipeline: Text Transformation

When text is analyzed, it passes through a three-stage pipeline that transforms the original text into searchable terms:

1. Character Filters — Pre-process the character stream (HTML stripping, character replacement)

2. Tokenizer — Split text into individual tokens (words, substrings, etc.)

3. Token Filters — Transform tokens (lowercase, stemming, synonyms, stop word removal)

This pipeline runs at two critical moments:

Index time: When documents are indexed, creating the inverted index
Search time: When queries are processed, normalizing search terms

For search to work correctly, the same analysis (or compatible analysis) must be applied at both times.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Input Text: "<p>The QUICK Brown Fox!</p>"
 
Step 1: Character Filters
  └── html_strip: "The QUICK Brown Fox!"
 
Step 2: Tokenizer
  └── standard: ["The", "QUICK", "Brown", "Fox"]
 
Step 3: Token Filters
  └── lowercase: ["the", "quick", "brown", "fox"]
  └── stop: ["quick", "brown", "fox"]  (removed "the")
  └── porter_stem: ["quick", "brown", "fox"]
 
Final Indexed Terms: ["quick", "brown", "fox"]

Built-in analyzers:

Elasticsearch provides pre-configured analyzers for common use cases:

standard (default) — Unicode text segmentation, lowercase. Good for most Western languages.

simple — Divides on non-letter characters, lowercase. Useful for simple text.

whitespace — Divides on whitespace only. No case conversion. For code or structured text.

keyword — No analysis at all. The entire input becomes one token.

language analyzers (english, french, german, etc.) — Language-specific stemming and stop words.

pattern — Uses regex to split text. Flexible but can be slow.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// Test how an analyzer processes text
POST /_analyze
{
  "analyzer": "english",
  "text": "The quick brown foxes were running quickly"
}
 
// Response shows produced tokens:
{
  "tokens": [
    { "token": "quick", "position": 1 },
    { "token": "brown", "position": 2 },
    { "token": "fox", "position": 3 },      // "foxes" stemmed to "fox"
    { "token": "were", "position": 4 },
    { "token": "run", "position": 5 },      // "running" stemmed to "run"
    { "token": "quickli", "position": 6 }   // "quickly" stemmed
  ]
}

Always Test Your Analyzers

The _analyze API is essential for understanding how text is processed. Before deploying a new analyzer, test it with representative text samples. Stemming and tokenization can produce surprising results that affect search quality.

Custom Analyzers: Tailored Text Processing

Built-in analyzers handle common cases, but production search often requires custom analysis chains. Custom analyzers let you combine character filters, tokenizers, and token filters to match your specific needs.

Example: Product search analyzer

For an e-commerce site, you might want:

HTML stripping (product descriptions may contain markup)
Standard tokenization
Lowercase normalization
Synonym expansion ("laptop" → "laptop, notebook, computer")
Minimal stemming (don't over-stem product terms)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
PUT /products
{
  "settings": {
    "analysis": {
      "char_filter": {
        "html_cleaner": {
          "type": "html_strip",
          "escaped_tags": ["b", "em"]  // Keep some tags
        }
      },
      "tokenizer": {
        "product_tokenizer": {
          "type": "pattern",
          "pattern": "[\\W_]+",
          "lowercase": true
        }
      },
      "filter": {
        "product_synonyms": {
          "type": "synonym",
          "synonyms": [
            "laptop, notebook, computer",
            "phone, mobile, smartphone",
            "tv, television"
          ]
        },
        "product_stemmer": {
          "type": "stemmer",
          "language": "light_english"  // Minimal stemming
        }
      },
      "analyzer": {
        "product_analyzer": {
          "type": "custom",
          "char_filter": ["html_cleaner"],
          "tokenizer": "product_tokenizer",
          "filter": [
            "lowercase",
            "product_synonyms",
            "product_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "product_analyzer"
      },
      "description": {
        "type": "text",
        "analyzer": "product_analyzer"
      }
    }
  }
}

Common token filters for custom analyzers:

lowercase — Converts to lowercase (essential for case-insensitive search)

stop — Removes stop words (the, a, is, etc.). Configurable by language.

stemmer — Reduces words to root form. Multiple algorithms available (porter_stem, snowball, light_english).

synonym — Expands or replaces terms with synonyms. Can be inline or from a file.

ngram / edge_ngram — Creates character n-grams. Essential for autocomplete and partial matching.

asciifolding — Converts accented characters to ASCII (café → cafe).

word_delimiter — Splits on case transitions, hyphens (WiFi → [Wi, Fi]).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
PUT /products
{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }
      },
      "analyzer": {
        "autocomplete_index": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        },
        "autocomplete_search": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "autocomplete_index",
        "search_analyzer": "autocomplete_search"
      }
    }
  }
}
 
// "MacBook" becomes: ["m", "ma", "mac", "macb", "macbo", "macboo", "macbook"]
// Query "mac" matches because it's in the indexed terms

Different Analyzers for Index vs Search

For autocomplete, use different analyzers for indexing (edge_ngram expansion) and searching (no ngram). This prevents 'ma' from expanding to ['m', 'ma'] at search time, which would match too broadly. Use 'analyzer' for index time and 'search_analyzer' for query time.

Dynamic Mapping: Automatic Field Detection

By default, Elasticsearch uses dynamic mapping to automatically detect field types when you index documents without a predefined mapping. While convenient for prototyping, dynamic mapping has significant implications for production systems.

How dynamic mapping works:

When Elasticsearch encounters a new field, it infers the type from the JSON value:

JSON Value	Detected Type
`"text"`	text + keyword
`123`	long
`123.45`	float
`true`/`false`	boolean
`"2024-01-15"`	date (if matches date format)
`{...}`	object
`[...]`	multi-value field

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Index a document with no predefined mapping
PUT /test/_doc/1
{
  "title": "Introduction to Elasticsearch",
  "views": 42,
  "rating": 4.8,
  "published": "2024-01-15",
  "is_premium": false
}
 
// Check the auto-generated mapping
GET /test/_mapping
{
  "test": {
    "mappings": {
      "properties": {
        "title": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "views": { "type": "long" },
        "rating": { "type": "float" },
        "published": { "type": "date" },
        "is_premium": { "type": "boolean" }
      }
    }
  }
}

The dangers of dynamic mapping:

Type misdetection — A field's type is locked based on the first document. If the first 'price' value is 10 (integer), later 10.99 values are truncated to 10.

Mapping explosion — Each new field adds to the cluster state. Untrusted input with varied field names can create thousands of fields, bloating memory.

Inconsistent analysis — Dynamic mapping uses default analyzers. Critical text fields may not be analyzed appropriately.

Inefficient types — Strings that should be keywords (IDs, categories) become text+keyword, wasting index space.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// Disable dynamic mapping (strict mode)
PUT /products
{
  "mappings": {
    "dynamic": "strict",  // Reject documents with unmapped fields
    "properties": {
      "name": { "type": "text" },
      "price": { "type": "float" }
    }
  }
}
 
// Now this fails:
PUT /products/_doc/1
{
  "name": "Widget",
  "price": 9.99,
  "category": "tools"  // ERROR: field not mapped
}
 
// Alternative: dynamic templates for controlled mapping
PUT /products
{
  "mappings": {
    "dynamic_templates": [
      {
        "strings_as_keywords": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword"  // All strings become keywords, not text
          }
        }
      },
      {
        "longs_as_integers": {
          "match_mapping_type": "long",
          "mapping": {
            "type": "integer"  // Use smaller integer type
          }
        }
      }
    ]
  }
}

Never Use Dynamic Mapping with Untrusted Input

If users can submit arbitrary JSON (API payloads, log fields, etc.), dynamic mapping can be weaponized. Attackers can send documents with thousands of unique field names, causing mapping explosion that degrades cluster performance. Always use explicit mappings or strict mode for external data.

Mapping Best Practices: Production-Ready Design

Proper mapping design prevents future problems and optimizes for your actual query patterns. Apply these principles when designing production mappings:

Mapping Design Principles

•Start with explicit mappings — Define all expected fields upfront. Use dynamic: 'strict' to catch unexpected data early.
•Choose the smallest appropriate type — 'integer' uses less space than 'long'. 'half_float' uses less than 'float'. Size adds up across billions of documents.
•Use keyword for exact-match fields — IDs, categories, status codes, country codes should be keyword, not text. Text analysis is wasted overhead.
•Apply multi-fields strategically — Not every text field needs a keyword sub-field. Only add sub-fields for fields you'll aggregate or sort on.
•Disable features you won't use — Set 'index: false' for fields you only retrieve, never query. Set 'doc_values: false' for fields you only query, never aggregate.
•Plan for index lifecycle — Consider field types that support time-based analysis. Date fields enable time histograms; keywords enable cardinality estimates.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
PUT /logs
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "level": {
        "type": "keyword"  // INFO, WARN, ERROR
      },
      "service": {
        "type": "keyword"
      },
      "trace_id": {
        "type": "keyword",
        "doc_values": false  // Only for filtering, not aggregations
      },
      "message": {
        "type": "text",
        "analyzer": "standard",
        "norms": false       // We don't need relevance scoring for logs
      },
      "host": {
        "type": "keyword"
      },
      "response_time_ms": {
        "type": "integer"
      },
      "stack_trace": {
        "type": "text",
        "index": false,      // Only retrieved, never searched
        "store": true
      },
      "metadata": {
        "type": "object",
        "enabled": false     // Store raw JSON, don't index
      }
    }
  }
}

Understanding indexing options:

index: true/false — Whether the field is searchable. Setting false saves index space for fields only used for retrieval.

doc_values: true/false — Whether the field supports sorting/aggregations. Setting false saves disk space for fields only used in queries.

store: true/false — Whether the field is stored separately for retrieval. By default, fields are retrieved from _source. Storing separately can be faster for large documents when you only need specific fields.

norms: true/false — Whether length normalization is stored (for relevance scoring). Disable for fields where document length shouldn't affect scoring.

enabled: true/false — For object types, whether the contents are indexed at all. Disabled objects are stored in _source but not searchable.

Use Index Templates for Consistency

For time-based indexes (logs-2024.01.15, logs-2024.01.16), use index templates to ensure consistent mappings across all indexes. Templates apply automatically when matching indexes are created. Combine with component templates for reusable mapping fragments.

Multi-Language Search: Handling International Content

Global applications must search content in multiple languages, each with different tokenization rules, stemming algorithms, and stop words. There are several approaches to multi-language search in Elasticsearch.

Language-specific analyzers:

Elasticsearch provides built-in analyzers for 30+ languages. Each handles language-specific challenges:

English: Stemming (running → run), common stop words
German: Compound word decomposition (Weltmeisterschaft → Weltmeister, schaft)
Chinese/Japanese/Korean (CJK): Character-based tokenization, no spaces between words
Arabic: Right-to-left, complex morphology, diacritics

Strategy 1: Per-field language assignment

If you know the language of each field, use language-specific analyzers directly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
PUT /articles
{
  "mappings": {
    "properties": {
      "title_en": {
        "type": "text",
        "analyzer": "english"
      },
      "title_de": {
        "type": "text",
        "analyzer": "german"
      },
      "title_fr": {
        "type": "text",
        "analyzer": "french"
      },
      "language": {
        "type": "keyword"
      }
    }
  }
}

Strategy 2: Multi-field with multiple analyzers

When the same content might be searched in different language contexts:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
PUT /products
{
  "mappings": {
    "properties": {
      "description": {
        "type": "text",
        "analyzer": "standard",  // Default: no language-specific processing
        "fields": {
          "en": {
            "type": "text",
            "analyzer": "english"
          },
          "de": {
            "type": "text",
            "analyzer": "german"
          },
          "stemmed": {
            "type": "text",
            "analyzer": "porter_stem"  // Generic stemming
          }
        }
      }
    }
  }
}
 
// Query spanning multiple analyzers
GET /products/_search
{
  "query": {
    "multi_match": {
      "query": "laufen",
      "fields": ["description^1", "description.de^2", "description.en"]
    }
  }
}

Strategy 3: Separate indexes by language

For large-scale multi-language deployments, consider separate indexes per language:

articles-en, articles-de, articles-fr
Query specific language indexes based on user locale
Enables language-specific optimization and scaling
Simplifies mapping management

Language detection:

For content without known language, Elasticsearch doesn't have built-in language detection. You must detect language in your application (using libraries like langdetect) and either route to appropriate indexes or populate a language field for filtering.

CJK Languages Need Special Attention

Chinese, Japanese, and Korean don't use spaces between words. The standard tokenizer treats entire sentences as single tokens. Use language-specific plugins (kuromoji for Japanese, icu_analyzer for Chinese) or n-gram tokenizers for these languages. The icu_analyzer plugin provides excellent cross-language normalization.

Summary: Mapping and Analysis Mastery

Mapping and analysis determine how searchable your data becomes. Poor decisions here cause search results that frustrate users—even when the right documents exist. Let's consolidate the key principles:

Key Takeaways

•Mappings are immutable — Field types cannot be changed after creation. Plan carefully, test thoroughly, and document your mapping decisions.
•Choose text vs keyword deliberately — text for full-text search with analysis; keyword for exact matches, filtering, and aggregations. Use multi-fields when you need both.
•The analysis pipeline transforms text — Character filters → Tokenizer → Token filters. The same analysis must apply at both index and search time for matching to work.
•Custom analyzers enable precision — Built-in analyzers handle common cases, but production search often needs tailored analysis chains for domain-specific text.
•Dynamic mapping is dangerous in production — Use explicit mappings or strict mode. Mapping explosion from untrusted input can destabilize clusters.
•Optimize for your query patterns — Disable indexing for fields you only retrieve. Disable doc_values for fields you only filter. Every optimization saves resources at scale.
•Multi-language requires strategy — Use language-specific analyzers, multi-field mappings, or separate indexes depending on your scale and complexity.

What's next:

With data properly mapped and analyzed, we can explore Query DSL—Elasticsearch's powerful query language. You'll learn to construct queries that leverage your mapping design to find exactly what users need, from simple term queries to complex boolean combinations with scoring, filtering, and aggregations.

Page Complete

You now understand Elasticsearch's mapping and analysis systems—the foundation of search relevance. Next, we explore Query DSL to construct powerful, precise queries against your analyzed data.

3 / 5

Loading learning content...

System Design (HLD)Elasticsearch

Elasticsearch: Distributed Search at Scale

LevelAdvanced

Duration90 mins

TopicElasticsearch

3 / 5

Mapping and Analysis: From Documents to Searchable Terms

The Path from Text to Search

What You Will Learn

Understanding Mapping: The Schema Definition

A mapping defines how a document and its fields are stored and indexed. It specifies:

Which fields exist in the documents
The data type of each field (text, keyword, number, date, etc.)
How each field should be analyzed (for text fields)
Whether fields should be stored for retrieval
How fields should be indexed for different query types

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// GET /products/_mapping
{
  "products": {
    "mappings": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer": "standard",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        },
        "description": {
          "type": "text",
          "analyzer": "english"
        },
        "price": {
          "type": "float"
        },
        "category": {
          "type": "keyword"
        },
        "tags": {
          "type": "keyword"
        },
        "created_at": {
          "type": "date"
        },
        "in_stock": {
          "type": "boolean"
        },
        "reviews": {
          "type": "nested",
          "properties": {
            "user": { "type": "keyword" },
            "rating": { "type": "integer" },
            "comment": { "type": "text" }
          }
        }
      }
    }
  }
}

Mapping immutability:

Once a field is mapped, its type cannot be changed. This is fundamental: changing how data is indexed would invalidate existing indexed terms, creating inconsistent search behavior.

If you need to change a field's mapping, you must:

Create a new index with the correct mapping
Reindex all documents from the old index
Switch your application to use the new index (often via index aliases)

This immutability makes upfront mapping design critical. Mistakes are expensive to fix.

Plan Your Mapping Before Indexing

Core Field Types: Choosing the Right Type

Elasticsearch provides numerous field types, each optimized for different data patterns and query types. Choosing the correct type affects searchability, storage efficiency, and query performance.

The two most important text-related types:

text Type

•Full-text search with analysis
•Tokenized into individual terms
•Supports match, phrase, fuzzy queries
•Case-insensitive, stemmed matching
•Use for: Descriptions, content, names

keyword Type

•Exact matching, no analysis
•Stored as single term
•Supports term, terms, prefix queries
•Case-sensitive, exact matching
•Use for: IDs, categories, tags, enums

The multi-field pattern:

Often you need both text search and exact matching on the same field. The multi-field pattern solves this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
{
  "properties": {
    "product_name": {
      "type": "text",           // Full-text search
      "analyzer": "standard",
      "fields": {
        "keyword": {
          "type": "keyword"     // Exact matching, aggregations
        },
        "autocomplete": {
          "type": "text",       // Edge n-gram for typeahead
          "analyzer": "autocomplete_analyzer"
        }
      }
    }
  }
}
 
// Query examples:
// Full-text: "query": { "match": { "product_name": "running shoes" }}
// Exact: "query": { "term": { "product_name.keyword": "Nike Air Max" }}
// Aggregation: "aggs": { "by_name": { "terms": { "field": "product_name.keyword" }}}

Common Elasticsearch Field Types
Type	Use Case	Query Types	Notes
text	Full-text search	match, match_phrase, fuzzy	Analyzed, tokenized
keyword	Exact values, filtering, sorting	term, terms, range	Not analyzed, max 32,766 bytes
long / integer / short / byte	Whole numbers	term, range, aggregations	Choose smallest type that fits
float / double / half_float	Decimal numbers	term, range, aggregations	half_float saves space for less precision
boolean	True/false values	term	Stored as 'true' or 'false'
date	Dates and times	range, date histogram	ISO8601, epoch, or custom format
object	Nested JSON objects	Flattened dot notation	Fields merged at document level
nested	Arrays of objects	Nested queries	Preserves array object boundaries
geo_point	Lat/lon coordinates	geo_distance, geo_bounding_box	For location searches
ip	IPv4/IPv6 addresses	term, range, CIDR queries	Stored efficiently

Object vs Nested: A Crucial Distinction

The Analysis Pipeline: Text Transformation

When text is analyzed, it passes through a three-stage pipeline that transforms the original text into searchable terms:

1. Character Filters — Pre-process the character stream (HTML stripping, character replacement)

2. Tokenizer — Split text into individual tokens (words, substrings, etc.)

3. Token Filters — Transform tokens (lowercase, stemming, synonyms, stop word removal)

This pipeline runs at two critical moments:

Index time: When documents are indexed, creating the inverted index
Search time: When queries are processed, normalizing search terms

For search to work correctly, the same analysis (or compatible analysis) must be applied at both times.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
Input Text: "<p>The QUICK Brown Fox!</p>"
 
Step 1: Character Filters
  └── html_strip: "The QUICK Brown Fox!"
 
Step 2: Tokenizer
  └── standard: ["The", "QUICK", "Brown", "Fox"]
 
Step 3: Token Filters
  └── lowercase: ["the", "quick", "brown", "fox"]
  └── stop: ["quick", "brown", "fox"]  (removed "the")
  └── porter_stem: ["quick", "brown", "fox"]
 
Final Indexed Terms: ["quick", "brown", "fox"]

Built-in analyzers:

Elasticsearch provides pre-configured analyzers for common use cases:

standard (default) — Unicode text segmentation, lowercase. Good for most Western languages.

simple — Divides on non-letter characters, lowercase. Useful for simple text.

whitespace — Divides on whitespace only. No case conversion. For code or structured text.

keyword — No analysis at all. The entire input becomes one token.

language analyzers (english, french, german, etc.) — Language-specific stemming and stop words.

pattern — Uses regex to split text. Flexible but can be slow.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// Test how an analyzer processes text
POST /_analyze
{
  "analyzer": "english",
  "text": "The quick brown foxes were running quickly"
}
 
// Response shows produced tokens:
{
  "tokens": [
    { "token": "quick", "position": 1 },
    { "token": "brown", "position": 2 },
    { "token": "fox", "position": 3 },      // "foxes" stemmed to "fox"
    { "token": "were", "position": 4 },
    { "token": "run", "position": 5 },      // "running" stemmed to "run"
    { "token": "quickli", "position": 6 }   // "quickly" stemmed
  ]
}

Always Test Your Analyzers

Custom Analyzers: Tailored Text Processing

Example: Product search analyzer

For an e-commerce site, you might want:

HTML stripping (product descriptions may contain markup)
Standard tokenization
Lowercase normalization
Synonym expansion ("laptop" → "laptop, notebook, computer")
Minimal stemming (don't over-stem product terms)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
PUT /products
{
  "settings": {
    "analysis": {
      "char_filter": {
        "html_cleaner": {
          "type": "html_strip",
          "escaped_tags": ["b", "em"]  // Keep some tags
        }
      },
      "tokenizer": {
        "product_tokenizer": {
          "type": "pattern",
          "pattern": "[\\W_]+",
          "lowercase": true
        }
      },
      "filter": {
        "product_synonyms": {
          "type": "synonym",
          "synonyms": [
            "laptop, notebook, computer",
            "phone, mobile, smartphone",
            "tv, television"
          ]
        },
        "product_stemmer": {
          "type": "stemmer",
          "language": "light_english"  // Minimal stemming
        }
      },
      "analyzer": {
        "product_analyzer": {
          "type": "custom",
          "char_filter": ["html_cleaner"],
          "tokenizer": "product_tokenizer",
          "filter": [
            "lowercase",
            "product_synonyms",
            "product_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "product_analyzer"
      },
      "description": {
        "type": "text",
        "analyzer": "product_analyzer"
      }
    }
  }
}

Common token filters for custom analyzers:

lowercase — Converts to lowercase (essential for case-insensitive search)

stop — Removes stop words (the, a, is, etc.). Configurable by language.

stemmer — Reduces words to root form. Multiple algorithms available (porter_stem, snowball, light_english).

synonym — Expands or replaces terms with synonyms. Can be inline or from a file.

ngram / edge_ngram — Creates character n-grams. Essential for autocomplete and partial matching.

asciifolding — Converts accented characters to ASCII (café → cafe).

word_delimiter — Splits on case transitions, hyphens (WiFi → [Wi, Fi]).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
PUT /products
{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }
      },
      "analyzer": {
        "autocomplete_index": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        },
        "autocomplete_search": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "autocomplete_index",
        "search_analyzer": "autocomplete_search"
      }
    }
  }
}
 
// "MacBook" becomes: ["m", "ma", "mac", "macb", "macbo", "macboo", "macbook"]
// Query "mac" matches because it's in the indexed terms

Different Analyzers for Index vs Search

Dynamic Mapping: Automatic Field Detection

How dynamic mapping works:

When Elasticsearch encounters a new field, it infers the type from the JSON value:

JSON Value	Detected Type
`"text"`	text + keyword
`123`	long
`123.45`	float
`true`/`false`	boolean
`"2024-01-15"`	date (if matches date format)
`{...}`	object
`[...]`	multi-value field

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Index a document with no predefined mapping
PUT /test/_doc/1
{
  "title": "Introduction to Elasticsearch",
  "views": 42,
  "rating": 4.8,
  "published": "2024-01-15",
  "is_premium": false
}
 
// Check the auto-generated mapping
GET /test/_mapping
{
  "test": {
    "mappings": {
      "properties": {
        "title": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        },
        "views": { "type": "long" },
        "rating": { "type": "float" },
        "published": { "type": "date" },
        "is_premium": { "type": "boolean" }
      }
    }
  }
}

The dangers of dynamic mapping:

Type misdetection — A field's type is locked based on the first document. If the first 'price' value is 10 (integer), later 10.99 values are truncated to 10.

Mapping explosion — Each new field adds to the cluster state. Untrusted input with varied field names can create thousands of fields, bloating memory.

Inconsistent analysis — Dynamic mapping uses default analyzers. Critical text fields may not be analyzed appropriately.

Inefficient types — Strings that should be keywords (IDs, categories) become text+keyword, wasting index space.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// Disable dynamic mapping (strict mode)
PUT /products
{
  "mappings": {
    "dynamic": "strict",  // Reject documents with unmapped fields
    "properties": {
      "name": { "type": "text" },
      "price": { "type": "float" }
    }
  }
}
 
// Now this fails:
PUT /products/_doc/1
{
  "name": "Widget",
  "price": 9.99,
  "category": "tools"  // ERROR: field not mapped
}
 
// Alternative: dynamic templates for controlled mapping
PUT /products
{
  "mappings": {
    "dynamic_templates": [
      {
        "strings_as_keywords": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword"  // All strings become keywords, not text
          }
        }
      },
      {
        "longs_as_integers": {
          "match_mapping_type": "long",
          "mapping": {
            "type": "integer"  // Use smaller integer type
          }
        }
      }
    ]
  }
}

Never Use Dynamic Mapping with Untrusted Input

Mapping Best Practices: Production-Ready Design

Proper mapping design prevents future problems and optimizes for your actual query patterns. Apply these principles when designing production mappings:

Mapping Design Principles

•Start with explicit mappings — Define all expected fields upfront. Use dynamic: 'strict' to catch unexpected data early.
•Choose the smallest appropriate type — 'integer' uses less space than 'long'. 'half_float' uses less than 'float'. Size adds up across billions of documents.
•Use keyword for exact-match fields — IDs, categories, status codes, country codes should be keyword, not text. Text analysis is wasted overhead.
•Apply multi-fields strategically — Not every text field needs a keyword sub-field. Only add sub-fields for fields you'll aggregate or sort on.
•Disable features you won't use — Set 'index: false' for fields you only retrieve, never query. Set 'doc_values: false' for fields you only query, never aggregate.
•Plan for index lifecycle — Consider field types that support time-based analysis. Date fields enable time histograms; keywords enable cardinality estimates.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
PUT /logs
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "@timestamp": {
        "type": "date"
      },
      "level": {
        "type": "keyword"  // INFO, WARN, ERROR
      },
      "service": {
        "type": "keyword"
      },
      "trace_id": {
        "type": "keyword",
        "doc_values": false  // Only for filtering, not aggregations
      },
      "message": {
        "type": "text",
        "analyzer": "standard",
        "norms": false       // We don't need relevance scoring for logs
      },
      "host": {
        "type": "keyword"
      },
      "response_time_ms": {
        "type": "integer"
      },
      "stack_trace": {
        "type": "text",
        "index": false,      // Only retrieved, never searched
        "store": true
      },
      "metadata": {
        "type": "object",
        "enabled": false     // Store raw JSON, don't index
      }
    }
  }
}

Understanding indexing options:

index: true/false — Whether the field is searchable. Setting false saves index space for fields only used for retrieval.

doc_values: true/false — Whether the field supports sorting/aggregations. Setting false saves disk space for fields only used in queries.

norms: true/false — Whether length normalization is stored (for relevance scoring). Disable for fields where document length shouldn't affect scoring.

enabled: true/false — For object types, whether the contents are indexed at all. Disabled objects are stored in _source but not searchable.

Use Index Templates for Consistency

Multi-Language Search: Handling International Content

Language-specific analyzers:

Elasticsearch provides built-in analyzers for 30+ languages. Each handles language-specific challenges:

English: Stemming (running → run), common stop words
German: Compound word decomposition (Weltmeisterschaft → Weltmeister, schaft)
Chinese/Japanese/Korean (CJK): Character-based tokenization, no spaces between words
Arabic: Right-to-left, complex morphology, diacritics

Strategy 1: Per-field language assignment

If you know the language of each field, use language-specific analyzers directly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
PUT /articles
{
  "mappings": {
    "properties": {
      "title_en": {
        "type": "text",
        "analyzer": "english"
      },
      "title_de": {
        "type": "text",
        "analyzer": "german"
      },
      "title_fr": {
        "type": "text",
        "analyzer": "french"
      },
      "language": {
        "type": "keyword"
      }
    }
  }
}

Strategy 2: Multi-field with multiple analyzers

When the same content might be searched in different language contexts:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
PUT /products
{
  "mappings": {
    "properties": {
      "description": {
        "type": "text",
        "analyzer": "standard",  // Default: no language-specific processing
        "fields": {
          "en": {
            "type": "text",
            "analyzer": "english"
          },
          "de": {
            "type": "text",
            "analyzer": "german"
          },
          "stemmed": {
            "type": "text",
            "analyzer": "porter_stem"  // Generic stemming
          }
        }
      }
    }
  }
}
 
// Query spanning multiple analyzers
GET /products/_search
{
  "query": {
    "multi_match": {
      "query": "laufen",
      "fields": ["description^1", "description.de^2", "description.en"]
    }
  }
}

Strategy 3: Separate indexes by language

For large-scale multi-language deployments, consider separate indexes per language:

articles-en, articles-de, articles-fr
Query specific language indexes based on user locale
Enables language-specific optimization and scaling
Simplifies mapping management

Language detection:

CJK Languages Need Special Attention

Summary: Mapping and Analysis Mastery

Key Takeaways

•Mappings are immutable — Field types cannot be changed after creation. Plan carefully, test thoroughly, and document your mapping decisions.
•Choose text vs keyword deliberately — text for full-text search with analysis; keyword for exact matches, filtering, and aggregations. Use multi-fields when you need both.
•The analysis pipeline transforms text — Character filters → Tokenizer → Token filters. The same analysis must apply at both index and search time for matching to work.
•Custom analyzers enable precision — Built-in analyzers handle common cases, but production search often needs tailored analysis chains for domain-specific text.
•Dynamic mapping is dangerous in production — Use explicit mappings or strict mode. Mapping explosion from untrusted input can destabilize clusters.
•Optimize for your query patterns — Disable indexing for fields you only retrieve. Disable doc_values for fields you only filter. Every optimization saves resources at scale.
•Multi-language requires strategy — Use language-specific analyzers, multi-field mappings, or separate indexes depending on your scale and complexity.

What's next:

Page Complete

You now understand Elasticsearch's mapping and analysis systems—the foundation of search relevance. Next, we explore Query DSL to construct powerful, precise queries against your analyzed data.

3 / 5