Elasticsearch - Learning Module

Loading content...

0/273

Shards and Replicas: The Foundation of Distribution and Fault Tolerance

The Partitioning Problem

Every distributed data system faces the same fundamental challenge: how do you spread data across multiple machines while maintaining coherent query behavior and surviving failures? Elasticsearch's answer is sharding with replication—a design that enables both horizontal scaling and fault tolerance.

Sharding isn't unique to Elasticsearch. PostgreSQL has table partitioning. MongoDB has sharded clusters. Kafka has partitions. Redis has cluster slots. But each system implements partitioning differently, with distinct trade-offs.

What makes Elasticsearch's sharding model distinctive is its tight coupling with Lucene indexes. Each shard is a complete, self-contained Lucene index. This design choice has profound implications for sizing, allocation, and operational behavior that we'll explore in depth.

What You Will Learn

By the end of this page, you will understand: the distinction between primary and replica shards; how to determine optimal shard count and size; replication mechanics and consistency guarantees; shard allocation strategies and awareness; and the operational implications of sharding decisions.

Primary Shards: The Atomic Unit of Data

When you create an Elasticsearch index, you specify the number of primary shards. This number determines how your data will be partitioned and has permanent implications—you cannot change the primary shard count without reindexing.

What is a primary shard?

A primary shard is the authoritative copy of a portion of an index's data. When a document is indexed, Elasticsearch uses a routing formula to determine which primary shard should hold it:

shard_id = hash(_routing) % number_of_primary_shards

By default, _routing equals the document's _id. This ensures deterministic placement—given a document ID, any node can calculate which shard contains that document without consulting a lookup table.

This hash-based routing is elegant but inflexible. Once shard count is set, changing it would alter the hash results, making existing documents unfindable. Hence the immutability.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// GET /_cat/shards/products?v&h=index,shard,prirep,state,docs,store,node
index     shard prirep state   docs   store    node
products  0     p      STARTED 12847  25.1mb   data-node-1
products  0     r      STARTED 12847  25.1mb   data-node-3
products  1     p      STARTED 12923  25.4mb   data-node-2
products  1     r      STARTED 12923  25.4mb   data-node-4
products  2     p      STARTED 12830  25.0mb   data-node-3
products  2     r      STARTED 12830  25.0mb   data-node-1
products  3     p      STARTED 12901  25.3mb   data-node-4
products  3     r      STARTED 12901  25.3mb   data-node-2
products  4     p      STARTED 12756  24.9mb   data-node-1
products  4     r      STARTED 12756  24.9mb   data-node-3
products  5     p      STARTED 12843  25.2mb   data-node-2
products  5     r      STARTED 12843  25.2mb   data-node-4
 
// p = primary, r = replica

Each shard is a Lucene index:

Here's the crucial point: every primary shard is a complete Lucene index with its own inverted indexes, stored fields, and segment files. This means:

Each shard can be searched independently
Each shard maintains its own term statistics (affecting TF-IDF/BM25 scores)
Each shard has fixed overhead regardless of document count

The 'shard as Lucene index' design enables distribution but creates a floor on resource usage. Even an empty shard consumes memory for buffers, file handles for open segments, and CPU for background operations. This overhead accumulates with shard count.

The Hidden Cost of Many Shards

Each shard consumes approximately 0.5-1 MB of heap memory just for data structures, plus file descriptors for segments. A cluster with 10,000 shards might use 10GB+ of heap purely for shard overhead—before any document data. This is why 'too many shards' is one of the most common Elasticsearch performance problems.

Replica Shards: Fault Tolerance and Read Scaling

While primary shards handle write operations, replica shards provide two critical capabilities: data redundancy for fault tolerance and read distribution for query scaling.

What is a replica shard?

A replica shard is a copy of a primary shard. When documents are indexed to a primary, they're replicated to all associated replicas. If the node holding a primary shard fails, one of its replicas is automatically promoted to primary.

Unlike primary shard count, the replica count can be changed dynamically at any time:

PUT /products/_settings
{
  "number_of_replicas": 2
}

Increasing replicas triggers data copying to new shards. Decreasing removes shards immediately (data still exists on primaries). This flexibility lets you tune redundancy based on changing requirements.

Primary vs Replica Shard Characteristics
Characteristic	Primary Shards	Replica Shards
Count mutability	Fixed at index creation	Adjustable anytime
Write operations	Receive all writes first	Receive writes from primaries
Read operations	Can serve searches	Can serve searches (load balanced)
Failure handling	If lost, replica promoted	If lost, recreated from primary
Placement constraint	Must be on different node from its replicas	Must be on different node from its primary

How replication works:

When you index a document, the following sequence occurs:

Coordinating node routes request to the appropriate primary shard
Primary shard performs the indexing operation
Primary shard writes to its translog (for durability)
Primary shard forwards the operation to all replica shards in parallel
Replicas perform the same operation and acknowledge
Primary acknowledges to coordinator once enough replicas respond

The 'enough replicas' threshold is controlled by wait_for_active_shards. By default, it's 1 (just the primary). Setting it higher (all, or a specific number) provides stronger durability guarantees at the cost of latency.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Wait for all active shards (primary + replicas)
PUT /products/_doc/1?wait_for_active_shards=all
{
  "name": "Critical Product",
  "price": 999
}
 
// Wait for quorum of shards
PUT /products/_doc/1?wait_for_active_shards=2
{
  "name": "Important Product",
  "price": 499
}
 
// Default: wait for primary only
PUT /products/_doc/1
{
  "name": "Standard Product",
  "price": 99
}

Read distribution:

During search operations, queries can be served by either primary or replica shards. The coordinating node selects which copy to query using a round-robin or adaptive algorithm. This means:

More replicas = more copies available for parallel search
Read throughput scales roughly linearly with replica count
Geographic distribution of replicas can reduce latency for distant users

However, replicas incur costs:

Each replica consumes storage equal to its primary
Each replica requires memory for caching and operations
Network bandwidth for replication traffic
Indexing latency increases with more replicas

Replica Count Trade-offs

For most production workloads, 1 replica (2 total copies) provides a good balance. This survives single-node failures and doubles read capacity. Increase to 2 replicas for critical data or when read scaling is paramount. Going higher is rarely beneficial—you typically hit other bottlenecks first.

Shard Sizing: The Critical Design Decision

Shard sizing is one of the most consequential decisions in Elasticsearch deployments. Get it wrong, and you'll face performance problems, scaling constraints, or operational headaches. Unfortunately, there's no universal answer—optimal shard size depends on hardware, workload patterns, and query characteristics.

The competing pressures:

Fewer, Larger Shards

•Lower cluster state overhead
•Reduced memory per shard
•Better Lucene efficiency (fewer segments)
•Simpler cluster management
•BUT: Poor parallelism, recovery takes longer, can't distribute across many nodes

More, Smaller Shards

•Better parallelism
•Faster recovery (smaller units)
•Can spread across many nodes
•Finer-grained load balancing
•BUT: High overhead, coordination costs, cluster state bloat

The sizing guidelines:

Elastic's official guidance has evolved over time, but the current recommendations are:

Target shard size: 10GB to 50GB

This range balances Lucene efficiency with operational flexibility. Below 10GB, the overhead-to-data ratio is poor. Above 50GB, recovery and rebalancing take too long.

Shards per node: 20 per GB of heap (rough guide)

For a node with 30GB heap, aim for under 600 shards. This is a ceiling, not a target—fewer is usually better.

Calculate before creating:

If you expect 200GB of data, don't create an index with 100 shards (2GB each). Instead, use 4-6 shards (33-50GB each) and add replicas for redundancy and read scaling.

Shard Sizing Examples
Expected Data Size	Recommended Primary Shards	Rationale
10GB	1	Single shard is efficient for small data
50GB	1-2	Still manageable in single shard, 2 for faster recovery
200GB	4-6	~30-50GB per shard, good parallelism
1TB	20-30	~30-50GB per shard, distributes across cluster
10TB	200-300	Requires larger cluster, consider time-based indexes

Time-based indexing for scaling:

For logging, metrics, and other time-series data, a common pattern is time-based indexing:

Create a new index daily/weekly/monthly (e.g., logs-2024.01.15)
Each index is sized based on daily data volume
Old indexes can be optimized (force merged), moved to cold storage, or deleted

This pattern avoids the static sizing problem: you're not guessing future data volume, just sizing for predictable time periods. Index lifecycle management (ILM) automates this workflow.

The Common Anti-Pattern: One Shard Per Document Type

Don't create indexes (with multiple shards each) for every 'type' of document. Hundreds of small indexes cause cluster state bloat and resource waste. Instead, put related documents in the same index and use a 'type' field for filtering. One index can hold millions of documents efficiently.

Shard Allocation: Where Data Lives

Shard allocation is the process by which the master node decides where each shard should live. This seemingly simple task involves complex decisions balancing disk space, node capacity, failure domains, and administrative policies.

Default allocation behavior:

Without manual intervention, Elasticsearch allocates shards to:

Maximize balance across nodes (similar shard counts per node)
Ensure primaries and their replicas are on different nodes
Consider disk usage (won't allocate to nearly-full disks)
Spread shards across failure domains when configured (rack/zone awareness)

The allocation process runs continuously—whenever a new index is created, a node joins/leaves, or a recovery is needed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// Why is a shard unassigned?
GET /_cluster/allocation/explain
{
  "index": "products",
  "shard": 0,
  "primary": true
}
 
// Response reveals the reason:
{
  "index": "products",
  "shard": 0,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "NODE_LEFT",
    "at": "2024-01-15T10:23:45.123Z"
  },
  "can_allocate": "no",
  "allocate_explanation": "cannot allocate because all nodes are too full",
  "node_allocation_decisions": [
    {
      "node_name": "data-node-1",
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node has 85% disk used"
        }
      ]
    }
  ]
}

Disk-based allocation:

Elasticsearch applies watermarks to prevent disk exhaustion:

Low watermark (default 85%): Elasticsearch avoids allocating new shards to nodes above this threshold.

High watermark (default 90%): Elasticsearch attempts to relocate shards away from nodes above this threshold.

Flood stage (default 95%): Elasticsearch blocks writes to indexes with shards on nodes above this threshold. This is a safety mechanism—better to reject writes than corrupt data.

These thresholds are configurable but should be changed carefully. At 95% disk usage, a sudden spike in segment creation (during heavy indexing) could fill the remaining space instantly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// View current watermark settings
GET /_cluster/settings?include_defaults=true&filter_path=*.cluster.routing.allocation.disk*
 
// Adjust watermarks (persistent setting)
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": "80%",
    "cluster.routing.allocation.disk.watermark.high": "85%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "90%"
  }
}
 
// Alternatively, use absolute values:
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": "100gb",
    "cluster.routing.allocation.disk.watermark.high": "50gb",
    "cluster.routing.allocation.disk.watermark.flood_stage": "20gb"
  }
}

Never Disable Disk Watermarks in Production

While you can disable disk thresholds, doing so in production is dangerous. A full disk can corrupt indexes, crash nodes, and cause data loss. If you're hitting watermarks, add storage or delete data—don't disable the safety net.

Allocation Awareness: Surviving Failures

Default shard allocation ensures primaries and replicas aren't on the same node. But what if that node's rack loses power? Or that cloud availability zone goes down? Both copies are lost simultaneously.

Allocation awareness extends failure domain thinking beyond individual nodes. You define attributes that categorize nodes (zone, rack, data center), and Elasticsearch ensures replicas are spread across these failure domains.

Configuring zone awareness:

# On nodes in zone-a:
node.attr.zone: zone-a
 
# On nodes in zone-b:
node.attr.zone: zone-b
 
# On nodes in zone-c:
node.attr.zone: zone-c

1
2
3
4
5
6
7
8
9
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.awareness.attributes": "zone"
  }
}
 
// Now Elasticsearch ensures replica shards are in different zones
// Primary in zone-a → Replica must be in zone-b or zone-c

How awareness affects allocation:

With zone awareness enabled, Elasticsearch's allocation logic changes:

When allocating a replica, it prefers nodes in different zones from the primary
If no nodes are available in other zones, the replica remains unassigned (yellow cluster)
Rebalancing considers zone distribution, not just node-level balance

This means a 2-zone cluster with 1 replica achieves true redundancy: losing an entire zone doesn't lose any data.

Zone Awareness Scenarios
Setup	Zone Failure Impact	Recommendation
3 zones, 2 replicas	Full availability maintained	Ideal for critical data
3 zones, 1 replica	Reduced redundancy, still available	Good balance for most uses
2 zones, 1 replica	Full availability maintained	Minimum for zone tolerance
2 zones, 0 replicas	Data unavailable	Not recommended for production

Forced awareness (strict zone distribution):

By default, if a zone is unavailable, Elasticsearch can still allocate replicas to remaining zones. Forced awareness prevents this—it requires all configured zones to be present before allocating replicas.

This is useful when you want to guarantee data doesn't concentrate in a single zone during temporary outages.

1
2
3
4
5
6
7
8
9
10
11
12
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.awareness.attributes": "zone",
    "cluster.routing.allocation.awareness.force.zone.values": "zone-a,zone-b,zone-c"
  }
}
 
// With this setting:
// - If zone-c nodes are all down, replicas that should be in zone-c remain unassigned
// - Cluster stays yellow until zone-c nodes return
// - Prevents over-concentration in available zones

Cloud Provider Zone Mapping

In AWS, use availability zones (us-east-1a, us-east-1b). In GCP, use zones (us-central1-a, us-central1-b). In Azure, use availability zones (1, 2, 3). Map these to your node.attr.zone values. Losing a cloud availability zone should lose only one copy of each shard.

Allocation Filtering: Controlling Shard Placement

Beyond automated allocation, Elasticsearch provides allocation filtering—rules that explicitly include, exclude, or require nodes for certain indexes. This enables sophisticated multi-tier architectures and operational control.

Use cases for allocation filtering:

Hot-warm-cold architecture: Route new indexes to fast (hot) nodes, older indexes to slower (warm/cold) nodes
Hardware specialization: Keep aggregation-heavy indexes on high-memory nodes
Maintenance operations: Drain a node before hardware maintenance
Data residency: Keep certain data in specific geographic locations

1
2
3
4
5
6
7
8
# Hot-tier nodes (fast SSDs, high CPU)
node.attr.tier: hot
 
# Warm-tier nodes (larger but slower storage)
node.attr.tier: warm
 
# Cold-tier nodes (high-capacity, slow storage)
node.attr.tier: cold

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Route an index to hot-tier nodes only
PUT /recent-logs/_settings
{
  "index.routing.allocation.require.tier": "hot"
}
 
// Route an index to warm OR cold tier
PUT /old-logs/_settings
{
  "index.routing.allocation.include.tier": "warm,cold"
}
 
// Exclude shards from a specific node (for maintenance)
PUT /products/_settings
{
  "index.routing.allocation.exclude._name": "data-node-5"
}
 
// Drain all shards from a node
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.exclude._ip": "10.0.0.105"
  }
}

Filter types:

require — All specified conditions must be met. The shard can only allocate to nodes matching ALL criteria.

include — At least one specified condition must be met. The shard can allocate to nodes matching ANY of the criteria.

exclude — Specified conditions must NOT be met. The shard avoids nodes matching ANY of the criteria.

Node attributes you can filter on:

_name — Node name
_ip — Node IP address
_host — Node hostname
_id — Node ID
Custom attributes (any node.attr.* value)

The most powerful patterns combine custom attributes with lifecycle policies. As data ages, ILM automatically moves indexes through tiers by updating allocation filters.

Allocation Filtering Order of Operations

When multiple filters are set, Elasticsearch evaluates: require first (must match all), then include (must match one), then exclude (must match none). If any stage fails, the shard cannot allocate to that node. Use allocation explain API to debug complex filtering rules.

Rebalancing and Recovery: Maintaining Equilibrium

A healthy Elasticsearch cluster is in constant motion. Shards relocate to balance load, replicas recover after failures, and new shards are allocated as indexes grow. Understanding these processes helps diagnose performance issues and plan capacity.

Rebalancing:

Rebalancing moves shards between nodes to maintain even distribution. This happens when:

New nodes join the cluster (shards spread to new capacity)
Nodes leave the cluster (shards redistribute to remaining nodes)
Index operations create uneven distribution

Rebalancing is automatic but configurable. You can control:

Threshold: How unbalanced before rebalancing triggers
Concurrency: How many shards move simultaneously
Timing: Temporary disabling during maintenance

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// View rebalancing status
GET /_cat/shards?v&s=state,node
 
// Temporarily disable rebalancing (for maintenance)
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.rebalance.enable": "none"
  }
}
 
// Re-enable rebalancing
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.rebalance.enable": "all"
  }
}
 
// Control rebalancing concurrency
PUT /_cluster/settings
{
  "transient": {
    "cluster.routing.allocation.cluster_concurrent_rebalance": 4
  }
}

Recovery:

Recovery is the process of restoring shard data after failures or when initializing replicas. There are several recovery types:

Peer recovery — Copying data from a primary to a replica, or from one node to another during relocation. This involves streaming segment files over the network.

Translog replay — After a node restart, replaying the transaction log to recover uncommitted operations.

Snapshot restore — Restoring from a snapshot repository for disaster recovery or data migration.

Recovery can be I/O and network intensive. Too aggressive recovery can impact production traffic; too conservative extends time at risk.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// View ongoing recoveries
GET /_cat/recovery?v&active_only=true
 
// Control recovery speed (bytes per second per node)
PUT /_cluster/settings
{
  "persistent": {
    "indices.recovery.max_bytes_per_sec": "100mb"
  }
}
 
// Control concurrent recoveries per node
PUT /_cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.node_concurrent_recoveries": 2
  }
}
 
// Delay allocation after node disconnect (allows for restart)
PUT /_all/_settings
{
  "index.unassigned.node_left.delayed_timeout": "5m"
}

Delayed Allocation for Node Restarts

When a node disconnects, Elasticsearch waits 1 minute by default before starting recovery. This prevents unnecessary recovery for short restarts. For planned maintenance, increase this delay or relocate shards beforehand. For unplanned failures, you might want faster recovery—decrease the timeout.

Summary: Mastering Shards and Replicas

Sharding and replication are the heart of Elasticsearch's distributed nature. Every performance, availability, and scaling decision traces back to how you've configured shards. Let's consolidate the key principles:

Key Takeaways

•Primary shard count is permanent — Plan carefully before index creation. Estimate data volume and calculate appropriate shard count. Changing requires full reindex.
•Target 10-50GB per shard — This balances Lucene efficiency with operational flexibility. Too small wastes resources; too large complicates recovery.
•Replicas provide redundancy AND read scaling — At least 1 replica for production. More replicas trade storage for query throughput and fault tolerance.
•Allocation awareness survives zone failures — Configure zone attributes so replicas distribute across failure domains. A zone failure should never lose all copies of a shard.
•Allocation filtering enables multi-tier architectures — Route hot data to fast storage, warm/cold data to cheaper storage. ILM automates lifecycle transitions.
•Rebalancing and recovery are configurable — Tune concurrency and speed based on cluster capacity and urgency. Production traffic should take priority over background operations.
•Monitor shard counts per node — Too many shards cause overhead. Aim for under 1000 shards per node, ideally under 600.

What's next:

With distribution mechanics understood, we'll explore mapping and analysis—how Elasticsearch processes, stores, and makes documents searchable. Proper mapping design is as critical as shard sizing for query performance and storage efficiency.

Page Complete

You now understand Elasticsearch's sharding model in depth: primary and replica shards, sizing strategies, allocation mechanics, zone awareness, and recovery behavior. Next, we examine mapping and analysis—how data is processed for search.