Loading content...
Every distributed data system faces the same fundamental challenge: how do you spread data across multiple machines while maintaining coherent query behavior and surviving failures? Elasticsearch's answer is sharding with replication—a design that enables both horizontal scaling and fault tolerance.
Sharding isn't unique to Elasticsearch. PostgreSQL has table partitioning. MongoDB has sharded clusters. Kafka has partitions. Redis has cluster slots. But each system implements partitioning differently, with distinct trade-offs.
What makes Elasticsearch's sharding model distinctive is its tight coupling with Lucene indexes. Each shard is a complete, self-contained Lucene index. This design choice has profound implications for sizing, allocation, and operational behavior that we'll explore in depth.
By the end of this page, you will understand: the distinction between primary and replica shards; how to determine optimal shard count and size; replication mechanics and consistency guarantees; shard allocation strategies and awareness; and the operational implications of sharding decisions.
When you create an Elasticsearch index, you specify the number of primary shards. This number determines how your data will be partitioned and has permanent implications—you cannot change the primary shard count without reindexing.
What is a primary shard?
A primary shard is the authoritative copy of a portion of an index's data. When a document is indexed, Elasticsearch uses a routing formula to determine which primary shard should hold it:
shard_id = hash(_routing) % number_of_primary_shards
By default, _routing equals the document's _id. This ensures deterministic placement—given a document ID, any node can calculate which shard contains that document without consulting a lookup table.
This hash-based routing is elegant but inflexible. Once shard count is set, changing it would alter the hash results, making existing documents unfindable. Hence the immutability.
12345678910111213141516
// GET /_cat/shards/products?v&h=index,shard,prirep,state,docs,store,nodeindex shard prirep state docs store nodeproducts 0 p STARTED 12847 25.1mb data-node-1products 0 r STARTED 12847 25.1mb data-node-3products 1 p STARTED 12923 25.4mb data-node-2products 1 r STARTED 12923 25.4mb data-node-4products 2 p STARTED 12830 25.0mb data-node-3products 2 r STARTED 12830 25.0mb data-node-1products 3 p STARTED 12901 25.3mb data-node-4products 3 r STARTED 12901 25.3mb data-node-2products 4 p STARTED 12756 24.9mb data-node-1products 4 r STARTED 12756 24.9mb data-node-3products 5 p STARTED 12843 25.2mb data-node-2products 5 r STARTED 12843 25.2mb data-node-4 // p = primary, r = replicaEach shard is a Lucene index:
Here's the crucial point: every primary shard is a complete Lucene index with its own inverted indexes, stored fields, and segment files. This means:
The 'shard as Lucene index' design enables distribution but creates a floor on resource usage. Even an empty shard consumes memory for buffers, file handles for open segments, and CPU for background operations. This overhead accumulates with shard count.
Each shard consumes approximately 0.5-1 MB of heap memory just for data structures, plus file descriptors for segments. A cluster with 10,000 shards might use 10GB+ of heap purely for shard overhead—before any document data. This is why 'too many shards' is one of the most common Elasticsearch performance problems.
While primary shards handle write operations, replica shards provide two critical capabilities: data redundancy for fault tolerance and read distribution for query scaling.
What is a replica shard?
A replica shard is a copy of a primary shard. When documents are indexed to a primary, they're replicated to all associated replicas. If the node holding a primary shard fails, one of its replicas is automatically promoted to primary.
Unlike primary shard count, the replica count can be changed dynamically at any time:
PUT /products/_settings
{
"number_of_replicas": 2
}
Increasing replicas triggers data copying to new shards. Decreasing removes shards immediately (data still exists on primaries). This flexibility lets you tune redundancy based on changing requirements.
| Characteristic | Primary Shards | Replica Shards |
|---|---|---|
| Count mutability | Fixed at index creation | Adjustable anytime |
| Write operations | Receive all writes first | Receive writes from primaries |
| Read operations | Can serve searches | Can serve searches (load balanced) |
| Failure handling | If lost, replica promoted | If lost, recreated from primary |
| Placement constraint | Must be on different node from its replicas | Must be on different node from its primary |
How replication works:
When you index a document, the following sequence occurs:
The 'enough replicas' threshold is controlled by wait_for_active_shards. By default, it's 1 (just the primary). Setting it higher (all, or a specific number) provides stronger durability guarantees at the cost of latency.
1234567891011121314151617181920
// Wait for all active shards (primary + replicas)PUT /products/_doc/1?wait_for_active_shards=all{ "name": "Critical Product", "price": 999} // Wait for quorum of shardsPUT /products/_doc/1?wait_for_active_shards=2{ "name": "Important Product", "price": 499} // Default: wait for primary onlyPUT /products/_doc/1{ "name": "Standard Product", "price": 99}Read distribution:
During search operations, queries can be served by either primary or replica shards. The coordinating node selects which copy to query using a round-robin or adaptive algorithm. This means:
However, replicas incur costs:
For most production workloads, 1 replica (2 total copies) provides a good balance. This survives single-node failures and doubles read capacity. Increase to 2 replicas for critical data or when read scaling is paramount. Going higher is rarely beneficial—you typically hit other bottlenecks first.
Shard sizing is one of the most consequential decisions in Elasticsearch deployments. Get it wrong, and you'll face performance problems, scaling constraints, or operational headaches. Unfortunately, there's no universal answer—optimal shard size depends on hardware, workload patterns, and query characteristics.
The competing pressures:
The sizing guidelines:
Elastic's official guidance has evolved over time, but the current recommendations are:
Target shard size: 10GB to 50GB
This range balances Lucene efficiency with operational flexibility. Below 10GB, the overhead-to-data ratio is poor. Above 50GB, recovery and rebalancing take too long.
Shards per node: 20 per GB of heap (rough guide)
For a node with 30GB heap, aim for under 600 shards. This is a ceiling, not a target—fewer is usually better.
Calculate before creating:
If you expect 200GB of data, don't create an index with 100 shards (2GB each). Instead, use 4-6 shards (33-50GB each) and add replicas for redundancy and read scaling.
| Expected Data Size | Recommended Primary Shards | Rationale |
|---|---|---|
| 10GB | 1 | Single shard is efficient for small data |
| 50GB | 1-2 | Still manageable in single shard, 2 for faster recovery |
| 200GB | 4-6 | ~30-50GB per shard, good parallelism |
| 1TB | 20-30 | ~30-50GB per shard, distributes across cluster |
| 10TB | 200-300 | Requires larger cluster, consider time-based indexes |
Time-based indexing for scaling:
For logging, metrics, and other time-series data, a common pattern is time-based indexing:
logs-2024.01.15)This pattern avoids the static sizing problem: you're not guessing future data volume, just sizing for predictable time periods. Index lifecycle management (ILM) automates this workflow.
Don't create indexes (with multiple shards each) for every 'type' of document. Hundreds of small indexes cause cluster state bloat and resource waste. Instead, put related documents in the same index and use a 'type' field for filtering. One index can hold millions of documents efficiently.
Shard allocation is the process by which the master node decides where each shard should live. This seemingly simple task involves complex decisions balancing disk space, node capacity, failure domains, and administrative policies.
Default allocation behavior:
Without manual intervention, Elasticsearch allocates shards to:
The allocation process runs continuously—whenever a new index is created, a node joins/leaves, or a recovery is needed.
12345678910111213141516171819202122232425262728293031323334
// Why is a shard unassigned?GET /_cluster/allocation/explain{ "index": "products", "shard": 0, "primary": true} // Response reveals the reason:{ "index": "products", "shard": 0, "primary": true, "current_state": "unassigned", "unassigned_info": { "reason": "NODE_LEFT", "at": "2024-01-15T10:23:45.123Z" }, "can_allocate": "no", "allocate_explanation": "cannot allocate because all nodes are too full", "node_allocation_decisions": [ { "node_name": "data-node-1", "node_decision": "no", "deciders": [ { "decider": "disk_threshold", "decision": "NO", "explanation": "the node has 85% disk used" } ] } ]}Disk-based allocation:
Elasticsearch applies watermarks to prevent disk exhaustion:
Low watermark (default 85%): Elasticsearch avoids allocating new shards to nodes above this threshold.
High watermark (default 90%): Elasticsearch attempts to relocate shards away from nodes above this threshold.
Flood stage (default 95%): Elasticsearch blocks writes to indexes with shards on nodes above this threshold. This is a safety mechanism—better to reject writes than corrupt data.
These thresholds are configurable but should be changed carefully. At 95% disk usage, a sudden spike in segment creation (during heavy indexing) could fill the remaining space instantly.
12345678910111213141516171819202122
// View current watermark settingsGET /_cluster/settings?include_defaults=true&filter_path=*.cluster.routing.allocation.disk* // Adjust watermarks (persistent setting)PUT /_cluster/settings{ "persistent": { "cluster.routing.allocation.disk.watermark.low": "80%", "cluster.routing.allocation.disk.watermark.high": "85%", "cluster.routing.allocation.disk.watermark.flood_stage": "90%" }} // Alternatively, use absolute values:PUT /_cluster/settings{ "persistent": { "cluster.routing.allocation.disk.watermark.low": "100gb", "cluster.routing.allocation.disk.watermark.high": "50gb", "cluster.routing.allocation.disk.watermark.flood_stage": "20gb" }}While you can disable disk thresholds, doing so in production is dangerous. A full disk can corrupt indexes, crash nodes, and cause data loss. If you're hitting watermarks, add storage or delete data—don't disable the safety net.
Default shard allocation ensures primaries and replicas aren't on the same node. But what if that node's rack loses power? Or that cloud availability zone goes down? Both copies are lost simultaneously.
Allocation awareness extends failure domain thinking beyond individual nodes. You define attributes that categorize nodes (zone, rack, data center), and Elasticsearch ensures replicas are spread across these failure domains.
Configuring zone awareness:
12345678
# On nodes in zone-a:node.attr.zone: zone-a # On nodes in zone-b:node.attr.zone: zone-b # On nodes in zone-c:node.attr.zone: zone-c123456789
PUT /_cluster/settings{ "persistent": { "cluster.routing.allocation.awareness.attributes": "zone" }} // Now Elasticsearch ensures replica shards are in different zones// Primary in zone-a → Replica must be in zone-b or zone-cHow awareness affects allocation:
With zone awareness enabled, Elasticsearch's allocation logic changes:
This means a 2-zone cluster with 1 replica achieves true redundancy: losing an entire zone doesn't lose any data.
| Setup | Zone Failure Impact | Recommendation |
|---|---|---|
| 3 zones, 2 replicas | Full availability maintained | Ideal for critical data |
| 3 zones, 1 replica | Reduced redundancy, still available | Good balance for most uses |
| 2 zones, 1 replica | Full availability maintained | Minimum for zone tolerance |
| 2 zones, 0 replicas | Data unavailable | Not recommended for production |
Forced awareness (strict zone distribution):
By default, if a zone is unavailable, Elasticsearch can still allocate replicas to remaining zones. Forced awareness prevents this—it requires all configured zones to be present before allocating replicas.
This is useful when you want to guarantee data doesn't concentrate in a single zone during temporary outages.
123456789101112
PUT /_cluster/settings{ "persistent": { "cluster.routing.allocation.awareness.attributes": "zone", "cluster.routing.allocation.awareness.force.zone.values": "zone-a,zone-b,zone-c" }} // With this setting:// - If zone-c nodes are all down, replicas that should be in zone-c remain unassigned// - Cluster stays yellow until zone-c nodes return// - Prevents over-concentration in available zonesIn AWS, use availability zones (us-east-1a, us-east-1b). In GCP, use zones (us-central1-a, us-central1-b). In Azure, use availability zones (1, 2, 3). Map these to your node.attr.zone values. Losing a cloud availability zone should lose only one copy of each shard.
Beyond automated allocation, Elasticsearch provides allocation filtering—rules that explicitly include, exclude, or require nodes for certain indexes. This enables sophisticated multi-tier architectures and operational control.
Use cases for allocation filtering:
12345678
# Hot-tier nodes (fast SSDs, high CPU)node.attr.tier: hot # Warm-tier nodes (larger but slower storage)node.attr.tier: warm # Cold-tier nodes (high-capacity, slow storage)node.attr.tier: cold12345678910111213141516171819202122232425
// Route an index to hot-tier nodes onlyPUT /recent-logs/_settings{ "index.routing.allocation.require.tier": "hot"} // Route an index to warm OR cold tierPUT /old-logs/_settings{ "index.routing.allocation.include.tier": "warm,cold"} // Exclude shards from a specific node (for maintenance)PUT /products/_settings{ "index.routing.allocation.exclude._name": "data-node-5"} // Drain all shards from a nodePUT /_cluster/settings{ "transient": { "cluster.routing.allocation.exclude._ip": "10.0.0.105" }}Filter types:
require — All specified conditions must be met. The shard can only allocate to nodes matching ALL criteria.
include — At least one specified condition must be met. The shard can allocate to nodes matching ANY of the criteria.
exclude — Specified conditions must NOT be met. The shard avoids nodes matching ANY of the criteria.
Node attributes you can filter on:
_name — Node name_ip — Node IP address_host — Node hostname_id — Node IDThe most powerful patterns combine custom attributes with lifecycle policies. As data ages, ILM automatically moves indexes through tiers by updating allocation filters.
When multiple filters are set, Elasticsearch evaluates: require first (must match all), then include (must match one), then exclude (must match none). If any stage fails, the shard cannot allocate to that node. Use allocation explain API to debug complex filtering rules.
A healthy Elasticsearch cluster is in constant motion. Shards relocate to balance load, replicas recover after failures, and new shards are allocated as indexes grow. Understanding these processes helps diagnose performance issues and plan capacity.
Rebalancing:
Rebalancing moves shards between nodes to maintain even distribution. This happens when:
Rebalancing is automatic but configurable. You can control:
1234567891011121314151617181920212223242526
// View rebalancing statusGET /_cat/shards?v&s=state,node // Temporarily disable rebalancing (for maintenance)PUT /_cluster/settings{ "transient": { "cluster.routing.rebalance.enable": "none" }} // Re-enable rebalancingPUT /_cluster/settings{ "transient": { "cluster.routing.rebalance.enable": "all" }} // Control rebalancing concurrencyPUT /_cluster/settings{ "transient": { "cluster.routing.allocation.cluster_concurrent_rebalance": 4 }}Recovery:
Recovery is the process of restoring shard data after failures or when initializing replicas. There are several recovery types:
Peer recovery — Copying data from a primary to a replica, or from one node to another during relocation. This involves streaming segment files over the network.
Translog replay — After a node restart, replaying the transaction log to recover uncommitted operations.
Snapshot restore — Restoring from a snapshot repository for disaster recovery or data migration.
Recovery can be I/O and network intensive. Too aggressive recovery can impact production traffic; too conservative extends time at risk.
123456789101112131415161718192021222324
// View ongoing recoveriesGET /_cat/recovery?v&active_only=true // Control recovery speed (bytes per second per node)PUT /_cluster/settings{ "persistent": { "indices.recovery.max_bytes_per_sec": "100mb" }} // Control concurrent recoveries per nodePUT /_cluster/settings{ "persistent": { "cluster.routing.allocation.node_concurrent_recoveries": 2 }} // Delay allocation after node disconnect (allows for restart)PUT /_all/_settings{ "index.unassigned.node_left.delayed_timeout": "5m"}When a node disconnects, Elasticsearch waits 1 minute by default before starting recovery. This prevents unnecessary recovery for short restarts. For planned maintenance, increase this delay or relocate shards beforehand. For unplanned failures, you might want faster recovery—decrease the timeout.
Sharding and replication are the heart of Elasticsearch's distributed nature. Every performance, availability, and scaling decision traces back to how you've configured shards. Let's consolidate the key principles:
What's next:
With distribution mechanics understood, we'll explore mapping and analysis—how Elasticsearch processes, stores, and makes documents searchable. Proper mapping design is as critical as shard sizing for query performance and storage efficiency.
You now understand Elasticsearch's sharding model in depth: primary and replica shards, sizing strategies, allocation mechanics, zone awareness, and recovery behavior. Next, we examine mapping and analysis—how data is processed for search.