System Design (HLD)Elasticsearch

Elasticsearch: Distributed Search at Scale

LevelAdvanced

Duration90 mins

TopicElasticsearch

1 / 5

Elasticsearch Architecture: Distributed Search Fundamentals

The Search Engine That Changed Everything

When GitHub needed to search across billions of lines of code in milliseconds, they turned to Elasticsearch. When Wikipedia required full-text search across articles in 300+ languages, Elasticsearch powered the solution. When Netflix needed to analyze petabytes of logs to understand service behavior, Elasticsearch became their operational intelligence backbone.

Elasticsearch has evolved from a simple search server into one of the most widely-deployed distributed systems in the world. It powers search boxes, log analytics platforms, security information systems, and real-time recommendation engines at organizations ranging from startups to Fortune 500 enterprises.

But Elasticsearch's apparent simplicity—just send JSON, get results—masks a sophisticated distributed architecture that orchestrates data across clusters of machines, handles node failures transparently, and delivers sub-second query responses over terabytes of data.

What You Will Learn

By the end of this page, you will understand Elasticsearch's distributed architecture from first principles: how clusters self-organize, how nodes specialize into roles, how data flows through the system, and how the architecture enables both horizontal scaling and fault tolerance. This foundation is essential for every subsequent topic in this module.

Elasticsearch in Context: Beyond Traditional Databases

To truly understand Elasticsearch's architecture, we must first understand what problems it was designed to solve—and why traditional databases fall short.

The full-text search challenge:

Traditional relational databases are optimized for exact matches and structured queries. When you search for WHERE id = 42, the database uses B-tree indexes to locate the row in O(log n) time. But what happens when you need to find all documents containing 'distributed systems' regardless of word order, case, or even spelling variations?

SQL's LIKE '%distributed%' forces a full table scan—O(n) complexity that collapses at scale. Even with full-text indexes (like PostgreSQL's tsvector), relational databases weren't architected for the access patterns that search requires: high read throughput, complex text analysis, relevance scoring, and faceted aggregations.

Search Requirements vs Traditional Database Capabilities
Requirement	Traditional RDBMS	Elasticsearch
Full-text search	LIKE operator (slow) or limited FTS	Native, highly optimized
Relevance scoring	Not native, requires application logic	Built-in TF-IDF, BM25
Fuzzy matching	Complex regex or extensions	Native fuzzy queries
Faceted search	Multiple GROUP BY queries	Single aggregation query
Schema flexibility	Fixed schema, migrations required	Dynamic mapping
Horizontal scaling	Complex sharding, read replicas	Native distributed architecture
Real-time analytics	OLTP/OLAP separation	Near real-time on same cluster

Elasticsearch's origin story:

Elasticsearch emerged in 2010, created by Shay Banon as a distributed layer over Apache Lucene—the powerful but single-node search library. Banon's insight was that while Lucene solved the algorithmic problem of full-text search (inverted indexes, text analysis, scoring), it didn't solve the systems problem of distributed search.

Elasticsearch added:

Cluster management — Automatic node discovery and coordination
Distributed indexing — Splitting data across multiple machines
Query distribution — Scattering queries and gathering results
Fault tolerance — Replication and automatic failure recovery
RESTful API — JSON over HTTP for universal accessibility

The result was a system that made Lucene's power accessible to any developer who could write HTTP requests.

Elasticsearch vs Lucene

Think of Lucene as the engine and Elasticsearch as the car. Lucene provides the indexing and search algorithms—it's incredibly fast and sophisticated. Elasticsearch wraps Lucene with networking, distribution, coordination, and management features that transform a library into a production-ready distributed system.

Cluster Fundamentals: The Foundation of Distribution

At its core, Elasticsearch operates as a cluster—a collection of nodes that work together to store data, answer queries, and maintain system health. Understanding cluster mechanics is fundamental to designing robust Elasticsearch deployments.

What is a cluster?

A cluster is one or more nodes (servers) that collectively hold your data and provide indexing and search capabilities. Every cluster has a unique name—by default, 'elasticsearch'—and nodes join clusters by sharing the same cluster name.

Cluster membership is dynamic: nodes can join and leave at any time. When nodes join, the cluster automatically redistributes data to balance load. When nodes fail, the cluster detects the failure and recovers by promoting replicas and redistributing workload.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// GET /_cluster/health
{
  "cluster_name": "production-search",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 12,
  "number_of_data_nodes": 9,
  "active_primary_shards": 150,
  "active_shards": 450,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100.0
}

Cluster health states:

Cluster health is reported as a simple color code, but these colors represent critical operational states:

🟢 Green — All primary and replica shards are allocated. The cluster is fully operational and fault-tolerant. This is your target state in production.

🟡 Yellow — All primary shards are allocated, but some replicas are missing. The cluster functions normally, but fault tolerance is reduced. A node failure could cause data unavailability.

🔴 Red — Some primary shards are unallocated. Data is missing and queries will return incomplete results. This requires immediate attention.

Production clusters should maintain green status. Yellow is acceptable for development environments or briefly during rebalancing operations. Red indicates a serious problem requiring immediate investigation.

Yellow Cluster ≠ Safe Cluster

Many teams tolerate yellow clusters because 'everything still works.' This is dangerous. Yellow means you're one node failure away from data loss or unavailability. Always investigate why replicas aren't allocated—usually insufficient nodes, disk space issues, or allocation filtering rules.

Node discovery and cluster formation:

When an Elasticsearch node starts, it must either form a new cluster or join an existing one. This process involves several mechanisms:

Seed hosts — Each node is configured with a list of 'seed hosts'—addresses of existing cluster members. The new node contacts these hosts to discover the current cluster state.

Master election — If no existing cluster is found (or the node is the first), master election occurs. Elasticsearch uses a consensus algorithm to elect a single master node that manages cluster-wide operations.

Cluster state propagation — Once connected, the master broadcasts the cluster state to all nodes. This state includes index mappings, shard allocations, and node membership—everything nodes need to route requests correctly.

The discovery process is designed to be resilient. Nodes can join and leave without disrupting ongoing operations, and the system handles network partitions gracefully (though with important constraints we'll discuss later).

Node Roles: The Division of Labor

In small deployments, every Elasticsearch node does everything: stores data, answers queries, and participates in cluster coordination. But as clusters grow, role specialization becomes essential for performance, stability, and cost efficiency.

Elasticsearch supports several distinct node roles, each optimized for specific responsibilities:

Elasticsearch Node Roles

•Master Node — Manages cluster-wide operations: creating/deleting indexes, tracking node membership, allocating shards. Only one master is active at a time, but multiple master-eligible nodes provide failover.
•Data Node — Stores shards and executes data-related operations: indexing documents, searching, aggregations. These are your workhorse nodes requiring high disk I/O and memory.
•Coordinating Node — Routes requests, gathers results from data nodes, performs final aggregations. Also called 'client nodes.' Every node can coordinate, but dedicated coordinators improve performance for complex queries.
•Ingest Node — Pre-processes documents before indexing using ingest pipelines. Useful for enrichment, transformation, and content extraction without burdening data nodes.
•Machine Learning Node — Runs machine learning jobs (anomaly detection, inference). ML workloads are CPU-intensive and benefit from dedicated resources.
•Transform Node — Executes transform operations that pivot and aggregate data into summary indexes. Similar to ML nodes, these offload specialized computation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Master-eligible node (no data storage)
node.roles: [ master ]
 
# Data-only node (no cluster management)
node.roles: [ data ]
 
# Dedicated coordinating node (routes requests only)
node.roles: [ ]
 
# Multi-role node (common for smaller clusters)
node.roles: [ master, data, ingest ]
 
# Dedicated ingest node
node.roles: [ ingest ]
 
# Hot-tier data node (for time-series data)
node.roles: [ data_hot, data_content ]
 
# Warm-tier data node (older, less-accessed data)
node.roles: [ data_warm ]

Why specialize?

Role specialization solves several production challenges:

Stability — Master nodes manage cluster state. If a data node becomes overwhelmed with heavy queries, it shouldn't affect cluster coordination. Dedicated masters with modest resources (but stable network) keep the cluster stable even during data node stress.

Performance isolation — Ingest pipelines can consume significant CPU (e.g., running NLP models). Dedicated ingest nodes prevent indexing operations from competing with query performance on data nodes.

Cost optimization — Master nodes need fast networking but modest storage. Data nodes need massive storage but moderate CPU. By specializing, you right-size hardware for each role instead of over-provisioning everything.

Scaling dimensions — If query throughput is your bottleneck, add coordinating nodes. If storage is full, add data nodes. Role separation enables targeted scaling.

Master Node Sizing

A common mistake is running masters on tiny instances. While masters don't store data, they do handle cluster state—which includes mappings for every field in every index. Clusters with thousands of fields can have cluster states exceeding 1GB. Allocate at least 4GB heap to master nodes, and ensure they have stable, low-latency networking.

Single-Role Anti-Patterns

•Running masters on nodes with high query load
•Only one master-eligible node (single point of failure)
•Ingest pipelines on data nodes during indexing bursts
•Coordinating-only nodes without monitoring query patterns

Role Specialization Best Practices

•3+ dedicated master nodes (odd number for quorum)
•Data nodes sized for expected shard count and query load
•Coordinating nodes for complex aggregations
•Separate hot/warm/cold data tiers for time-series

The Master Node: Cluster Brain and Coordinator

The master node is arguably the most critical component of an Elasticsearch cluster. It doesn't store data or answer search queries, but it makes decisions that affect every operation in the cluster.

Master responsibilities:

Cluster state management — The master maintains the authoritative view of the entire cluster: which nodes are alive, which indexes exist, how shards are distributed, what mappings are defined. This 'cluster state' is propagated to every node.
Index operations — Creating, updating, and deleting indexes. Defining mappings and settings. These structural changes must be coordinated cluster-wide.
Shard allocation — Deciding which node holds which shard. When nodes join or leave, the master orchestrates data redistribution.
Node membership — Detecting node failures (via heartbeats) and triggering recovery procedures.

1
2
3
4
5
6
7
8
9
10
// GET /_cat/master?v
id                     host        ip           node
YkMF4Hw8RWaHU_lTGUxP8g 10.0.0.101  10.0.0.101   master-node-1
 
// GET /_cluster/state/master_node?pretty
{
  "cluster_name": "production-search",
  "cluster_uuid": "abc123...",
  "master_node": "YkMF4Hw8RWaHU_lTGUxP8g"
}

Master election and quorum:

While only one master is active at a time, clusters should have multiple master-eligible nodes. If the active master fails, the remaining master-eligible nodes elect a new one.

Election requires a quorum—a majority of master-eligible nodes must agree. This prevents 'split-brain' scenarios where network partitions could lead to two masters making conflicting decisions.

For a cluster with N master-eligible nodes:

Quorum requires (N/2) + 1 nodes
3 nodes → need 2 for quorum (tolerates 1 failure)
5 nodes → need 3 for quorum (tolerates 2 failures)

This is why production clusters should have an odd number of master-eligible nodes (typically 3 or 5). Even numbers offer no additional fault tolerance and can cause ties during elections.

Split-Brain: The Cluster Killer

Split-brain occurs when network partitions cause a cluster to split into independent sub-clusters, each electing its own master. Both sub-clusters accept writes, creating divergent data that cannot be reconciled. This is why quorum-based election is non-negotiable—it's better for a minority partition to become unavailable than for two partitions to diverge.

Cluster state and its costs:

Every node holds a copy of the cluster state. This state includes:

Node membership and roles
Index metadata (all index settings and mappings)
Shard routing tables (which shard on which node)
Ingest pipeline definitions
Templates and component templates

Cluster state size grows with the number of indexes and the complexity of mappings. Clusters with thousands of indexes or mappings with thousands of fields can have cluster states of several gigabytes.

Large cluster states create operational challenges:

Master must serialize and transmit state updates to all nodes
Nodes must parse and apply state updates
Recovery after master election takes longer

This is a key reason to avoid index explosion—having thousands of small indexes instead of fewer larger ones can destabilize the cluster due to cluster state overhead.

Data Organization: From Clusters to Documents

Understanding how Elasticsearch organizes data is essential for designing effective schemas and understanding query behavior. The hierarchy moves from abstract containers to individual data units.

The data hierarchy:

1
2
3
4
5
6
Cluster
  └── Index (logical namespace)
        └── Shard (distributed unit, Lucene index)
              └── Segment (immutable file)
                    └── Document (JSON record)
                          └── Field (name-value pair)

Index: The logical container

An index is a logical namespace that holds a collection of documents with similar characteristics. Think of an index like a database table—it groups related data and defines how that data is analyzed and stored.

Unlike relational tables, Elasticsearch indexes are:

Schema-flexible — Documents in the same index can have different fields (though this isn't recommended)
Horizontally partitioned — Each index is split into multiple shards
Replicated — Each shard can have multiple copies for fault tolerance and read scaling

Indexes have settings (number of shards, replica count, refresh interval) and mappings (field types, analyzers, relationships).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
PUT /products
{
  "settings": {
    "number_of_shards": 6,
    "number_of_replicas": 1,
    "refresh_interval": "1s",
    "analysis": {
      "analyzer": {
        "product_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "porter_stem"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": { "type": "text", "analyzer": "product_analyzer" },
      "price": { "type": "float" },
      "category": { "type": "keyword" },
      "created_at": { "type": "date" }
    }
  }
}

Shard: The unit of distribution

A shard is the fundamental unit of distribution in Elasticsearch. Each shard is a self-contained Lucene index—it can be indexed, searched, and managed independently.

Shards come in two flavors:

Primary shards — The authoritative copy of data. All indexing operations go to primaries first.
Replica shards — Copies of primaries for fault tolerance and read throughput. Replicas are promoted to primaries if the original fails.

The number of primary shards is fixed at index creation and cannot be changed without reindexing. This makes shard count one of the most important decisions in Elasticsearch design—get it wrong, and you'll need to rebuild your indexes.

Segment: The immutable file

Within each shard, data is stored in segments—immutable files that contain indexed documents. When you index a document, it's first written to an in-memory buffer, then periodically flushed to a new segment on disk (the 'refresh' operation).

Segments are immutable by design:

Updates create new documents with the same _id; old versions are marked as deleted
Deleted documents remain in segments until segment merging removes them
This append-only model enables fast writes and efficient caching

Documents Are Never Updated In Place

When you 'update' a document in Elasticsearch, the old version is marked deleted and a new version is indexed. This is why update-heavy workloads can accumulate deleted documents, requiring background merge operations to reclaim space. Design your data model with this in mind.

Request Flow: How Queries and Documents Travel

Understanding how requests flow through Elasticsearch is essential for debugging performance issues and designing optimal data access patterns. Let's trace both indexing and search operations.

Indexing flow (writing documents):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
1. Client sends document to any node (coordinating node)
   POST /products/_doc
   {"name": "Laptop", "price": 999}
 
2. Coordinating node determines target shard:
   shard_id = hash(_routing) % number_of_primary_shards
   (routing defaults to _id)
 
3. Request forwarded to primary shard's node
 
4. Primary shard indexes document:
   a. Document added to in-memory buffer (translog)
   b. Translog flushed to disk (durability)
   c. On refresh: buffer written to new segment
 
5. Primary forwards to replica shards in parallel
 
6. Success returned when replicas acknowledge
   (behavior controlled by wait_for_active_shards)

Key insights from indexing flow:

Any node can receive the request—the coordinating role determines routing
Routing is deterministic: the same _id always maps to the same primary shard
The translog provides durability: if a node crashes between refresh intervals, uncommitted documents are recovered from the translog
Replica writes happen in parallel, not sequentially
Index writes are not immediately visible—they become searchable after the next refresh (typically every 1 second)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
1. Client sends query to any node (coordinating node)
   GET /products/_search
   {"query": {"match": {"name": "laptop"}}}
 
2. Coordinating node identifies relevant shards:
   - For this index: shards 0-5
   - Query phase: scatter request to one copy of each shard
     (primary or replica, load-balanced)
 
3. QUERY PHASE: Each shard executes query locally:
   a. Parse and optimize query
   b. Execute against local segments
   c. Return TOP N document IDs + scores to coordinator
 
4. MERGE: Coordinating node merges results:
   - Global top N from all shards' results
   - If aggregations: partial aggregation merge
 
5. FETCH PHASE: Coordinator retrieves full documents:
   - Request full documents for final result set
   - Only for documents that made the final cut
 
6. Return results to client

Two-phase search explained:

Elasticsearch uses a two-phase search strategy to minimize data transfer:

Query phase (scatter): Each shard finds its top N matching documents and returns only IDs and scores. This is lightweight because full document content isn't transferred.

Fetch phase (gather): The coordinator identifies the global top N from all shards' results, then fetches the complete documents only for those final results.

This design dramatically reduces network traffic. If you request 10 results from an index with 6 shards, the fetch phase retrieves 10 documents, not 60.

Implications for system design:

Deep pagination is expensive: requesting page 100 (offset 990, size 10) means the query phase must find and merge the top 1000 from each shard
Aggregations across many shards multiply coordination overhead
Shard count affects both query parallelism and coordination cost—there's a balance to strike

Deep Pagination Trap

Avoid allowing users to paginate deep into result sets (page 1000+). Each page requires holding larger result sets from every shard. Use search_after or scroll APIs for deep traversal. Better yet, design UX that encourages refined searches over endless pagination.

Inside the Shard: Lucene's Role

Each Elasticsearch shard is a Lucene index. While Elasticsearch handles distribution and coordination, Lucene handles the actual indexing and search algorithms. Understanding Lucene's role clarifies many Elasticsearch behaviors.

What Lucene provides:

Inverted index — The core data structure for fast text search. Maps terms to documents containing them.
Text analysis — Tokenization, normalization, stemming, and other text processing.
Query execution — Boolean logic, phrase queries, fuzzy matching, scoring.
Compression — Efficient storage of posting lists and stored fields.
Codec management — Various on-disk formats for different data types.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Document 1: "The quick brown fox"
Document 2: "The lazy brown dog"
Document 3: "A quick fox jumps"
 
After Analysis (lowercase, tokenization):
 
Term Dictionary:        Posting Lists:
--------------         ---------------
"a"         →  [3]
"brown"     →  [1, 2]
"dog"       →  [2]
"fox"       →  [1, 3]
"jumps"     →  [3]
"lazy"      →  [2]
"quick"     →  [1, 3]
"the"       →  [1, 2]
 
Query: "quick fox"
1. Look up "quick" → [1, 3]
2. Look up "fox" → [1, 3]
3. Intersect/score → Documents 1 and 3 match

Segment lifecycle:

Lucene's segment-based architecture has important implications:

1. Segments are immutable Once written, segments never change. Updates are handled by marking old documents as deleted and writing new versions to new segments. This immutability enables:

Lock-free concurrent reads
Efficient caching (segment content never changes)
Crash safety (no partial writes to existing data)

2. Segments accumulate over time Frequent indexing creates many small segments. Too many segments slow searches because each query must check all segments.

3. Merge process consolidates segments Background merge operations combine small segments into larger ones, removing deleted documents in the process. This is essential for sustained performance but consumes I/O and CPU.

4. Refresh creates visibility New documents aren't searchable until they're in a segment. The 'refresh' operation writes the in-memory buffer to a new segment, making documents searchable. By default, this happens every second—hence 'near real-time' search.

Force Merge for Static Indexes

For indexes that receive no further writes (historical data, completed imports), use force merge to consolidate into a single segment. This optimizes search performance and reclaims space from deleted documents. Never force merge active indexes—it competes with indexing and can cause massive I/O spikes.

Summary: Elasticsearch Architecture Principles

We've covered the foundational architecture of Elasticsearch. Let's consolidate the key principles that should guide your deployment and data modeling decisions:

Key Architectural Takeaways

•Elasticsearch is distributed by default — Every design decision should account for multiple nodes, network latency, and partial failures. Single-node thinking leads to production problems.
•The master node is critical but lightweight — It manages cluster state, not data. Protect masters from resource contention via dedicated nodes. Always maintain an odd-numbered quorum.
•Shards are the unit of distribution and the unit of cost — Too few shards limit parallelism and scaling. Too many shards increase coordination overhead and memory usage. Right-size based on data volume and query patterns.
•Lucene does the heavy lifting — Elasticsearch adds distribution; Lucene provides indexing and search. Understanding Lucene's segment-based, append-only model explains many Elasticsearch behaviors.
•Two-phase search balances throughput and network — Query phases scatter lightweight requests; fetch phases gather only final results. Deep pagination breaks this optimization.
•Cluster state has size limits — Index explosion (thousands of indexes) bloats cluster state, destabilizing masters and slowing recovery. Consolidate where possible.
•Role specialization improves stability and cost — Separate masters, data nodes, and coordinators in production. Right-size hardware for each role's specific needs.

What's next:

Now that we understand Elasticsearch's distributed architecture, we'll dive deeper into shards and replicas—the mechanism that enables both horizontal scaling and fault tolerance. We'll explore shard sizing strategies, replica topology, and the critical decisions that determine long-term cluster health.

Page Complete

You now understand Elasticsearch's distributed architecture: clusters, nodes, roles, data organization, and request flows. This foundation is essential for every subsequent topic—sharding, mapping, Query DSL, and scaling. Next, we examine shards and replicas in detail.

1 / 5

Loading learning content...

System Design (HLD)Elasticsearch

Elasticsearch: Distributed Search at Scale

LevelAdvanced

Duration90 mins

TopicElasticsearch

1 / 5

Elasticsearch Architecture: Distributed Search Fundamentals

The Search Engine That Changed Everything

What You Will Learn

Elasticsearch in Context: Beyond Traditional Databases

To truly understand Elasticsearch's architecture, we must first understand what problems it was designed to solve—and why traditional databases fall short.

The full-text search challenge:

Search Requirements vs Traditional Database Capabilities
Requirement	Traditional RDBMS	Elasticsearch
Full-text search	LIKE operator (slow) or limited FTS	Native, highly optimized
Relevance scoring	Not native, requires application logic	Built-in TF-IDF, BM25
Fuzzy matching	Complex regex or extensions	Native fuzzy queries
Faceted search	Multiple GROUP BY queries	Single aggregation query
Schema flexibility	Fixed schema, migrations required	Dynamic mapping
Horizontal scaling	Complex sharding, read replicas	Native distributed architecture
Real-time analytics	OLTP/OLAP separation	Near real-time on same cluster

Elasticsearch's origin story:

Elasticsearch added:

Cluster management — Automatic node discovery and coordination
Distributed indexing — Splitting data across multiple machines
Query distribution — Scattering queries and gathering results
Fault tolerance — Replication and automatic failure recovery
RESTful API — JSON over HTTP for universal accessibility

The result was a system that made Lucene's power accessible to any developer who could write HTTP requests.

Elasticsearch vs Lucene

Cluster Fundamentals: The Foundation of Distribution

What is a cluster?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// GET /_cluster/health
{
  "cluster_name": "production-search",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 12,
  "number_of_data_nodes": 9,
  "active_primary_shards": 150,
  "active_shards": 450,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100.0
}

Cluster health states:

Cluster health is reported as a simple color code, but these colors represent critical operational states:

🟢 Green — All primary and replica shards are allocated. The cluster is fully operational and fault-tolerant. This is your target state in production.

🟡 Yellow — All primary shards are allocated, but some replicas are missing. The cluster functions normally, but fault tolerance is reduced. A node failure could cause data unavailability.

🔴 Red — Some primary shards are unallocated. Data is missing and queries will return incomplete results. This requires immediate attention.

Yellow Cluster ≠ Safe Cluster

Node discovery and cluster formation:

When an Elasticsearch node starts, it must either form a new cluster or join an existing one. This process involves several mechanisms:

Seed hosts — Each node is configured with a list of 'seed hosts'—addresses of existing cluster members. The new node contacts these hosts to discover the current cluster state.

Node Roles: The Division of Labor

Elasticsearch supports several distinct node roles, each optimized for specific responsibilities:

Elasticsearch Node Roles

•Master Node — Manages cluster-wide operations: creating/deleting indexes, tracking node membership, allocating shards. Only one master is active at a time, but multiple master-eligible nodes provide failover.
•Data Node — Stores shards and executes data-related operations: indexing documents, searching, aggregations. These are your workhorse nodes requiring high disk I/O and memory.
•Coordinating Node — Routes requests, gathers results from data nodes, performs final aggregations. Also called 'client nodes.' Every node can coordinate, but dedicated coordinators improve performance for complex queries.
•Ingest Node — Pre-processes documents before indexing using ingest pipelines. Useful for enrichment, transformation, and content extraction without burdening data nodes.
•Machine Learning Node — Runs machine learning jobs (anomaly detection, inference). ML workloads are CPU-intensive and benefit from dedicated resources.
•Transform Node — Executes transform operations that pivot and aggregate data into summary indexes. Similar to ML nodes, these offload specialized computation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Master-eligible node (no data storage)
node.roles: [ master ]
 
# Data-only node (no cluster management)
node.roles: [ data ]
 
# Dedicated coordinating node (routes requests only)
node.roles: [ ]
 
# Multi-role node (common for smaller clusters)
node.roles: [ master, data, ingest ]
 
# Dedicated ingest node
node.roles: [ ingest ]
 
# Hot-tier data node (for time-series data)
node.roles: [ data_hot, data_content ]
 
# Warm-tier data node (older, less-accessed data)
node.roles: [ data_warm ]

Why specialize?

Role specialization solves several production challenges:

Scaling dimensions — If query throughput is your bottleneck, add coordinating nodes. If storage is full, add data nodes. Role separation enables targeted scaling.

Master Node Sizing

Single-Role Anti-Patterns

•Running masters on nodes with high query load
•Only one master-eligible node (single point of failure)
•Ingest pipelines on data nodes during indexing bursts
•Coordinating-only nodes without monitoring query patterns

Role Specialization Best Practices

•3+ dedicated master nodes (odd number for quorum)
•Data nodes sized for expected shard count and query load
•Coordinating nodes for complex aggregations
•Separate hot/warm/cold data tiers for time-series

The Master Node: Cluster Brain and Coordinator

The master node is arguably the most critical component of an Elasticsearch cluster. It doesn't store data or answer search queries, but it makes decisions that affect every operation in the cluster.

Master responsibilities:

Cluster state management — The master maintains the authoritative view of the entire cluster: which nodes are alive, which indexes exist, how shards are distributed, what mappings are defined. This 'cluster state' is propagated to every node.
Index operations — Creating, updating, and deleting indexes. Defining mappings and settings. These structural changes must be coordinated cluster-wide.
Shard allocation — Deciding which node holds which shard. When nodes join or leave, the master orchestrates data redistribution.
Node membership — Detecting node failures (via heartbeats) and triggering recovery procedures.

1
2
3
4
5
6
7
8
9
10
// GET /_cat/master?v
id                     host        ip           node
YkMF4Hw8RWaHU_lTGUxP8g 10.0.0.101  10.0.0.101   master-node-1
 
// GET /_cluster/state/master_node?pretty
{
  "cluster_name": "production-search",
  "cluster_uuid": "abc123...",
  "master_node": "YkMF4Hw8RWaHU_lTGUxP8g"
}

Master election and quorum:

While only one master is active at a time, clusters should have multiple master-eligible nodes. If the active master fails, the remaining master-eligible nodes elect a new one.

Election requires a quorum—a majority of master-eligible nodes must agree. This prevents 'split-brain' scenarios where network partitions could lead to two masters making conflicting decisions.

For a cluster with N master-eligible nodes:

Quorum requires (N/2) + 1 nodes
3 nodes → need 2 for quorum (tolerates 1 failure)
5 nodes → need 3 for quorum (tolerates 2 failures)

This is why production clusters should have an odd number of master-eligible nodes (typically 3 or 5). Even numbers offer no additional fault tolerance and can cause ties during elections.

Split-Brain: The Cluster Killer

Cluster state and its costs:

Every node holds a copy of the cluster state. This state includes:

Node membership and roles
Index metadata (all index settings and mappings)
Shard routing tables (which shard on which node)
Ingest pipeline definitions
Templates and component templates

Large cluster states create operational challenges:

Master must serialize and transmit state updates to all nodes
Nodes must parse and apply state updates
Recovery after master election takes longer

This is a key reason to avoid index explosion—having thousands of small indexes instead of fewer larger ones can destabilize the cluster due to cluster state overhead.

Data Organization: From Clusters to Documents

Understanding how Elasticsearch organizes data is essential for designing effective schemas and understanding query behavior. The hierarchy moves from abstract containers to individual data units.

The data hierarchy:

1
2
3
4
5
6
Cluster
  └── Index (logical namespace)
        └── Shard (distributed unit, Lucene index)
              └── Segment (immutable file)
                    └── Document (JSON record)
                          └── Field (name-value pair)

Index: The logical container

Unlike relational tables, Elasticsearch indexes are:

Schema-flexible — Documents in the same index can have different fields (though this isn't recommended)
Horizontally partitioned — Each index is split into multiple shards
Replicated — Each shard can have multiple copies for fault tolerance and read scaling

Indexes have settings (number of shards, replica count, refresh interval) and mappings (field types, analyzers, relationships).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
PUT /products
{
  "settings": {
    "number_of_shards": 6,
    "number_of_replicas": 1,
    "refresh_interval": "1s",
    "analysis": {
      "analyzer": {
        "product_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "porter_stem"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": { "type": "text", "analyzer": "product_analyzer" },
      "price": { "type": "float" },
      "category": { "type": "keyword" },
      "created_at": { "type": "date" }
    }
  }
}

Shard: The unit of distribution

A shard is the fundamental unit of distribution in Elasticsearch. Each shard is a self-contained Lucene index—it can be indexed, searched, and managed independently.

Shards come in two flavors:

Primary shards — The authoritative copy of data. All indexing operations go to primaries first.
Replica shards — Copies of primaries for fault tolerance and read throughput. Replicas are promoted to primaries if the original fails.

Segment: The immutable file

Segments are immutable by design:

Updates create new documents with the same _id; old versions are marked as deleted
Deleted documents remain in segments until segment merging removes them
This append-only model enables fast writes and efficient caching

Documents Are Never Updated In Place

Request Flow: How Queries and Documents Travel

Understanding how requests flow through Elasticsearch is essential for debugging performance issues and designing optimal data access patterns. Let's trace both indexing and search operations.

Indexing flow (writing documents):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
1. Client sends document to any node (coordinating node)
   POST /products/_doc
   {"name": "Laptop", "price": 999}
 
2. Coordinating node determines target shard:
   shard_id = hash(_routing) % number_of_primary_shards
   (routing defaults to _id)
 
3. Request forwarded to primary shard's node
 
4. Primary shard indexes document:
   a. Document added to in-memory buffer (translog)
   b. Translog flushed to disk (durability)
   c. On refresh: buffer written to new segment
 
5. Primary forwards to replica shards in parallel
 
6. Success returned when replicas acknowledge
   (behavior controlled by wait_for_active_shards)

Key insights from indexing flow:

Any node can receive the request—the coordinating role determines routing
Routing is deterministic: the same _id always maps to the same primary shard
The translog provides durability: if a node crashes between refresh intervals, uncommitted documents are recovered from the translog
Replica writes happen in parallel, not sequentially
Index writes are not immediately visible—they become searchable after the next refresh (typically every 1 second)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
1. Client sends query to any node (coordinating node)
   GET /products/_search
   {"query": {"match": {"name": "laptop"}}}
 
2. Coordinating node identifies relevant shards:
   - For this index: shards 0-5
   - Query phase: scatter request to one copy of each shard
     (primary or replica, load-balanced)
 
3. QUERY PHASE: Each shard executes query locally:
   a. Parse and optimize query
   b. Execute against local segments
   c. Return TOP N document IDs + scores to coordinator
 
4. MERGE: Coordinating node merges results:
   - Global top N from all shards' results
   - If aggregations: partial aggregation merge
 
5. FETCH PHASE: Coordinator retrieves full documents:
   - Request full documents for final result set
   - Only for documents that made the final cut
 
6. Return results to client

Two-phase search explained:

Elasticsearch uses a two-phase search strategy to minimize data transfer:

Query phase (scatter): Each shard finds its top N matching documents and returns only IDs and scores. This is lightweight because full document content isn't transferred.

Fetch phase (gather): The coordinator identifies the global top N from all shards' results, then fetches the complete documents only for those final results.

This design dramatically reduces network traffic. If you request 10 results from an index with 6 shards, the fetch phase retrieves 10 documents, not 60.

Implications for system design:

Deep pagination is expensive: requesting page 100 (offset 990, size 10) means the query phase must find and merge the top 1000 from each shard
Aggregations across many shards multiply coordination overhead
Shard count affects both query parallelism and coordination cost—there's a balance to strike

Deep Pagination Trap

Inside the Shard: Lucene's Role

What Lucene provides:

Inverted index — The core data structure for fast text search. Maps terms to documents containing them.
Text analysis — Tokenization, normalization, stemming, and other text processing.
Query execution — Boolean logic, phrase queries, fuzzy matching, scoring.
Compression — Efficient storage of posting lists and stored fields.
Codec management — Various on-disk formats for different data types.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Document 1: "The quick brown fox"
Document 2: "The lazy brown dog"
Document 3: "A quick fox jumps"
 
After Analysis (lowercase, tokenization):
 
Term Dictionary:        Posting Lists:
--------------         ---------------
"a"         →  [3]
"brown"     →  [1, 2]
"dog"       →  [2]
"fox"       →  [1, 3]
"jumps"     →  [3]
"lazy"      →  [2]
"quick"     →  [1, 3]
"the"       →  [1, 2]
 
Query: "quick fox"
1. Look up "quick" → [1, 3]
2. Look up "fox" → [1, 3]
3. Intersect/score → Documents 1 and 3 match

Segment lifecycle:

Lucene's segment-based architecture has important implications:

1. Segments are immutable Once written, segments never change. Updates are handled by marking old documents as deleted and writing new versions to new segments. This immutability enables:

Lock-free concurrent reads
Efficient caching (segment content never changes)
Crash safety (no partial writes to existing data)

2. Segments accumulate over time Frequent indexing creates many small segments. Too many segments slow searches because each query must check all segments.

Force Merge for Static Indexes

Summary: Elasticsearch Architecture Principles

We've covered the foundational architecture of Elasticsearch. Let's consolidate the key principles that should guide your deployment and data modeling decisions:

Key Architectural Takeaways

•Elasticsearch is distributed by default — Every design decision should account for multiple nodes, network latency, and partial failures. Single-node thinking leads to production problems.
•The master node is critical but lightweight — It manages cluster state, not data. Protect masters from resource contention via dedicated nodes. Always maintain an odd-numbered quorum.
•Shards are the unit of distribution and the unit of cost — Too few shards limit parallelism and scaling. Too many shards increase coordination overhead and memory usage. Right-size based on data volume and query patterns.
•Lucene does the heavy lifting — Elasticsearch adds distribution; Lucene provides indexing and search. Understanding Lucene's segment-based, append-only model explains many Elasticsearch behaviors.
•Two-phase search balances throughput and network — Query phases scatter lightweight requests; fetch phases gather only final results. Deep pagination breaks this optimization.
•Cluster state has size limits — Index explosion (thousands of indexes) bloats cluster state, destabilizing masters and slowing recovery. Consolidate where possible.
•Role specialization improves stability and cost — Separate masters, data nodes, and coordinators in production. Right-size hardware for each role's specific needs.

What's next:

Page Complete

1 / 5