Elasticsearch - Learning Module

Loading content...

0/273

Scaling Elasticsearch: From Development to Production

The Scaling Challenge

Elasticsearch is distributed by design, but distribution doesn't automatically equal scalability. A poorly-configured 100-node cluster can be slower than a well-tuned 5-node cluster. Scaling Elasticsearch effectively requires understanding the bottlenecks, applying appropriate configurations, and monitoring the right metrics.

The journey from development (single node, gigabytes of data) to production (multi-cluster, petabytes of data) involves navigating many decisions: hardware selection, shard sizing, JVM tuning, index lifecycle management, and operational automation. Each decision affects performance, reliability, and cost.

This page provides the comprehensive guide to scaling Elasticsearch, consolidating lessons from organizations running some of the world's largest Elasticsearch deployments.

What You Will Learn

By the end of this page, you will understand: capacity planning methodology; hardware selection and sizing; JVM and OS tuning; index lifecycle management; performance monitoring; and operational best practices for production Elasticsearch clusters.

Capacity Planning: Right-Sizing Your Cluster

Capacity planning is the foundation of reliable Elasticsearch deployments. Under-provisioning causes outages during traffic spikes; over-provisioning wastes money. The goal is right-sizing—having enough capacity for expected load plus headroom for growth and failures.

The capacity planning methodology:

1. Quantify the workload:

Data volume: How much data? Growth rate?
Indexing rate: Documents per second? Bulk operations?
Query rate: Searches per second? Latency requirements?
Retention: How long is data kept? What happens to old data?

2. Calculate storage requirements:

Raw data size × replication factor × overhead factor
Overhead includes Lucene indexes (inverted index, stored fields, doc values)
Typical overhead: 1.5x to 3x raw data, depending on mapping

3. Determine memory requirements:

JVM heap: 50% of physical memory, max 31GB
OS cache: Remaining memory for file system cache
Shard overhead: ~0.5-1MB per shard for data structures

4. Calculate throughput capacity:

Per-node indexing throughput depends on CPU, disk I/O
Per-node query throughput depends on memory, cache efficiency
Add nodes until aggregate capacity exceeds requirements

Capacity Planning Rules of Thumb
Resource	Guideline	Notes
Storage per node	2-10TB	SSD strongly recommended
JVM heap	50% of RAM, max 31GB	Above 31GB, pointer compression disabled
Shards per node	< 600, ideally < 200	More shards = more overhead
Shard size	10-50GB	Smaller = more parallelism, larger = less overhead
Data to heap ratio	< 100 shards per GB heap	For stable cluster state
CPU cores	8-32 per node	More for aggregation-heavy workloads
Network	10Gbps+	For fast recovery and rebalancing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Workload:
- 500GB/day ingestion
- 90 days retention = 45TB total data
- 1 replica = 90TB storage (2 copies)
- 50% overhead for indexes = 135TB storage needed
 
Storage sizing:
- If using 8TB disks at 75% utilization (6TB usable)
- Need: 135TB / 6TB = 23 data nodes minimum
 
Memory sizing:
- 23 nodes × 64GB RAM = 1.5TB cluster memory
- 23 nodes × 30GB heap = 690GB heap
- 23 nodes × 34GB OS cache = 782GB for file cache
 
Shard sizing:
- 45TB data / 30GB per shard = 1500 primary shards
- With replica = 3000 total shards
- 3000 shards / 23 nodes = 130 shards per node ✓
 
Add headroom:
- 30% for growth/spikes = 30 data nodes
- 3 dedicated master nodes = 33 total nodes

Always Load Test Before Production

Capacity calculations are estimates. Before production launch, run load tests with realistic data and query patterns. Tools like Rally (Elasticsearch's benchmarking tool) simulate realistic workloads and report throughput, latency, and resource utilization.

Hardware Selection: Matching Resources to Roles

Different node roles have different hardware requirements. Optimizing hardware by role reduces cost while improving performance.

Master nodes:

Master nodes manage cluster state but don't store data. They need:

CPU: Moderate (4-8 cores) — Cluster state operations aren't CPU-intensive
Memory: Moderate (16-32GB) — For large cluster state in memory
Storage: Minimal (100GB SSD) — For node data and logs
Network: Low latency — Stable connections to all nodes

Dedicated masters are essential for production. Use 3 (or 5 for larger clusters) to maintain quorum during failures.

Data Node Requirements

•CPU: High (16-32 cores) for indexing and aggregations
•Memory: High (64-128GB) for JVM and file cache
•Storage: High capacity SSD (4-10TB), NVMe preferred
•Network: High bandwidth (10Gbps+) for recovery

Coordinating Node Requirements

•CPU: High for aggregation merging
•Memory: High for collecting results
•Storage: Minimal (no data stored)
•Network: High bandwidth to clients and data nodes

Storage considerations:

SSD vs HDD: SSDs are strongly recommended for all Elasticsearch workloads. Lucene's segment-based architecture involves random I/O during searches and significant sequential writes during indexing. SSDs provide:

10-100x lower latency for random reads
Faster recovery (copying segment files)
Better performance under concurrent load

HDDs may be acceptable only for cold/frozen tiers where access is infrequent.

Local vs Network storage: Local storage (direct-attached) is preferred for performance. Network-attached storage (NFS, EBS gp3) adds latency and introduces shared failure domains. If using cloud block storage, provision IOPS appropriately—default IOPS is often insufficient.

Cloud instance selection:

Cloud Instance Recommendations by Provider
Provider	Master Nodes	Hot Data Nodes	Warm/Cold Data Nodes
AWS	m6i.xlarge	i3.2xlarge, i3en.2xlarge	d3.2xlarge
GCP	e2-medium	n2-highmem-8 + local SSD	n2-highmem-8 + pd-balanced
Azure	Standard_D4s_v3	Standard_L8s_v2	Standard_E8s_v4

Don't Undersize Network

Network bandwidth is often overlooked. During recovery or rebalancing, nodes stream segment files at maximum speed. A 1Gbps network can take hours to recover a 500GB shard. Use 10Gbps+ for production clusters. In cloud environments, choose instances with high network bandwidth.

JVM and System Tuning: Extracting Maximum Performance

Elasticsearch runs on the JVM, and proper JVM configuration is critical for stable performance. Additionally, operating system settings affect I/O performance and process limits.

JVM Heap sizing:

The most important JVM setting is heap size. The golden rules:

Rule 1: 50% of physical memory for heap, 50% for OS cache Elasticsearch uses both JVM heap (for data structures) and OS file cache (for Lucene segments). Both need adequate memory.

Rule 2: Never exceed 31GB heap Above 31GB, the JVM disables compressed ordinary object pointers (compressed OOPs), meaning object references use 8 bytes instead of 4. A 32GB heap actually has less usable space than 31GB.

Rule 3: Set Xms and Xmx to the same value This prevents heap resizing during operation, which can cause pauses.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Heap size - must be same for Xms and Xmx
-Xms30g
-Xmx30g
 
# G1GC is now the default (Elasticsearch 8+)
# No need to specify unless customizing
 
# Disable explicit GC calls (from application code)
-XX:+DisableExplicitGC
 
# Pre-touch memory pages at startup (faster startup)
-XX:+AlwaysPreTouch
 
# GC logging (useful for diagnosing issues)
-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m

Operating system settings:

Elasticsearch requires several OS settings adjusted from defaults:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# /etc/security/limits.conf
elasticsearch  -  nofile  65535    # Open file descriptors
elasticsearch  -  nproc   4096     # Process/thread limit
 
# /etc/sysctl.conf
vm.max_map_count = 262144          # Memory mapping limit (required!)
vm.swappiness = 1                  # Minimize swapping
 
# Disable swap entirely (or use mlockall)
sudo swapoff -a
 
# For Elasticsearch user, in /etc/security/limits.conf
elasticsearch  -  memlock  unlimited
 
# In elasticsearch.yml, enable memory locking
bootstrap.memory_lock: true

Why disable swap?

Swapping JVM memory to disk causes severe performance degradation. Garbage collection operates on heap memory; if heap pages are swapped out, GC becomes I/O-bound. A GC pause that should take 100ms can take seconds.

bootstrap.memory_lock: true tells Elasticsearch to lock its memory in RAM, preventing the OS from swapping it. This requires the memlock ulimit to be set to unlimited.

Memory Lock Failures Are Silent

If memory locking fails (due to insufficient ulimits), Elasticsearch may still start but without memory lock. Check node info API: GET /_nodes?filter_path=**.mlockall to verify memory is locked. Unlocked memory in production risks swap-induced freezes.

Index Lifecycle Management: Automating Data Lifecycle

For time-series data (logs, metrics, events), data has different value and access patterns over time. Recent data is queried frequently and needs fast access. Older data is rarely accessed but may need retention for compliance.

Index Lifecycle Management (ILM) automates moving data through phases based on age or size, reducing costs while maintaining appropriate performance.

ILM phases:

Hot — Active indexing and frequent queries. Use fast SSDs, high-replica count.

Warm — No new indexing, occasional queries. Reduce replicas, move to slower storage.

Cold — Rare queries, archival access. Freeze indexes, searchable snapshots.

Frozen — Very rare access. Use searchable snapshots (data in object storage).

Delete — Remove data after retention period.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
PUT /_ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "allocate": {
            "require": { "tier": "warm" },
            "number_of_replicas": 0
          },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": { "tier": "cold" }
          },
          "freeze": {},
          "set_priority": { "priority": 0 }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Key ILM actions:

rollover — Create a new index when conditions met (size, age, doc count). Essential for managing shard count over time.

shrink — Reduce primary shard count. Useful for warm phase where queries are less frequent.

forcemerge — Merge segments into fewer files. Optimizes storage and query performance for static indexes.

freeze — Mark index read-only and release memory. Frozen indexes have higher query latency but minimal resource usage.

allocate — Move index to different tier nodes using allocation filtering.

delete — Remove index entirely when retention expires.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Create index template that uses ILM policy
PUT /_index_template/logs_template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "logs_policy",
      "index.lifecycle.rollover_alias": "logs"
    },
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "message": { "type": "text" }
      }
    }
  }
}
 
// Bootstrap the alias
PUT /logs-000001
{
  "aliases": {
    "logs": {
      "is_write_index": true
    }
  }
}
 
// Now write to 'logs' alias - ILM handles rollover automatically

Use Data Streams for Simpler Time-Series

Elasticsearch 7.9+ introduced Data Streams, which simplify time-series workflows. Data streams automatically handle rollover, naming, and alias management. For new deployments, prefer data streams over manual rollover alias patterns.

Performance Monitoring: Observing Cluster Health

Effective monitoring is essential for maintaining healthy Elasticsearch clusters. The right metrics reveal problems before they cause outages, while wrong metrics create noise without insight.

Essential metrics to monitor:

Cluster-Level Metrics

•Cluster status — Green is healthy; yellow means missing replicas; red means missing primaries. Alert on non-green.
•Node count — Detect node failures or discovery issues. Alert on unexpected decreases.
•Shard states — Unassigned, initializing, or relocating shards indicate problems or recovery.
•Pending tasks — Cluster state tasks waiting on master. High counts indicate master overload.

Node-Level Metrics

•JVM heap usage — Alert when heap consistently > 75%. Above 85% risks GC thrashing.
•GC time — Long GC pauses (> 1s) indicate memory pressure. Young GC should be < 100ms.
•CPU utilization — High sustained CPU on data nodes may indicate heavy queries or indexing.
•Disk usage — Alert before hitting watermarks. 80% warning, 90% critical.
•Disk I/O wait — High iowait indicates disk bottleneck. Consider faster storage.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// Cluster health
GET /_cluster/health
 
// Node statistics
GET /_nodes/stats/jvm,os,fs,indices
 
// Index statistics  
GET /_stats/indexing,search,merge
 
// Thread pool status (queued/rejected)
GET /_cat/thread_pool?v&h=node_name,name,active,queue,rejected
 
// Hot threads (what's consuming CPU)
GET /_nodes/hot_threads
 
// Shard allocation explanation
GET /_cluster/allocation/explain

Query performance metrics:

Search latency — P50, P90, P99 query latencies. Track trends over time.

Query rate — Queries per second. Correlate with latency changes.

Slow queries — Enable slow log to capture queries exceeding thresholds:

PUT /products/_settings
{
  "index.search.slowlog.threshold.query.warn": "2s",
  "index.search.slowlog.threshold.query.info": "1s"
}

Search thread pool queues — If search queue grows, queries are waiting for resources. May need more nodes or query optimization.

Indexing performance metrics:

Indexing rate — Documents per second. Track against expected throughput.

Indexing latency — Time to index documents. Increases may indicate resource contention.

Bulk rejections — If bulk queue overflows, requests are rejected. Need slower ingestion or more capacity.

Merge time — Background merges are I/O intensive. Excessive merging indicates too many segments.

Use Elastic Stack for Elasticsearch Monitoring

Elastic provides free monitoring features in Kibana. Metricbeat can ship Elasticsearch metrics to a monitoring cluster. For production, always monitor Elasticsearch from a separate cluster—you don't want monitoring to fail when your primary cluster has issues.

Common Performance Problems: Diagnosis and Solutions

Production Elasticsearch clusters encounter recurring performance patterns. Recognizing these patterns accelerates diagnosis and resolution.

Problem: Slow searches / high latency

Symptoms: Query latencies increasing, users complaining about slow results.

Common causes:

Insufficient heap (GC overhead)
Too many shards (coordination overhead)
Expensive queries (deep aggregations, wildcards)
Cold file cache (recently restarted, not warmed)
Shard imbalance (hot spots)

Diagnosis:

Check heap usage and GC times
Enable slow log to identify expensive queries
Profile slow queries for bottlenecks
Check shard distribution across nodes
Monitor cache hit rates

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Check if heap is the problem
GET /_nodes/stats/jvm?filter_path=**.heap_used_percent
 
// Look for GC pressure
GET /_nodes/stats/jvm?filter_path=**.gc.collectors
 
// Enable slow query log
PUT /products/_settings
{
  "index.search.slowlog.threshold.query.warn": "2s",
  "index.search.slowlog.level": "info"
}
 
// Profile a specific query
POST /products/_search
{
  "profile": true,
  "query": { "match": { "description": "slow query" } }
}
 
// Check cache statistics
GET /_nodes/stats/indices/query_cache,request_cache

Problem: Indexing rejections / bulk failures

Symptoms: Bulk API returns 429 errors, documents not appearing.

Common causes:

Write thread pool saturated
Disk I/O bottleneck
Too many shards receiving writes
Slow merges blocking refreshes

Solutions:

Reduce bulk size (fewer docs per request)
Increase thread pool queue size (temporary)
Add data nodes for write capacity
Use index rollover to spread writes

Problem: Cluster state too large / master instability

Symptoms: Master nodes struggling, cluster state updates slow, frequent master elections.

Common causes:

Too many indexes (thousands)
Mappings with too many fields
Frequent index creation/deletion
Underprovisioned master nodes

Solutions:

Consolidate small indexes (time-based → less granular)
Use index templates with field limits
Increase master node resources
Use dynamic mapping restrictions

Common Problems Quick Reference
Symptom	Likely Cause	First Action
Yellow cluster	Missing replicas	Check node count, disk space
Red cluster	Missing primaries	Check node health, shard allocation
High GC time	Heap pressure	Reduce shards, add nodes
Slow queries	Large result sets	Profile queries, add filters
Bulk rejections	Thread pool saturation	Reduce bulk size, add nodes
Frequent master elections	Cluster state issues	Add dedicated masters, reduce indexes
High disk I/O	Merge storms	Reduce indexing rate, force merge old indexes

Start with the Simplest Explanation

Most Elasticsearch problems trace to insufficient resources (heap, disk, CPU) or excessive load (too many shards, too many indexes, too complex queries). Before investigating exotic causes, verify basics: is the cluster under-provisioned for its workload?

Operational Best Practices: Production Readiness

Beyond configuration nuances, operational practices determine long-term cluster reliability. These patterns emerge from running Elasticsearch at scale.

1. Always use dedicated master nodes

Master stability affects the entire cluster. Never run masters on nodes with heavy data operations. Use 3 dedicated master nodes for most clusters; 5 for larger deployments. Configure voting configuration to handle failures.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Master-only node configuration
node.roles: [ master ]
node.name: master-${HOSTNAME}
 
# Initial master nodes for bootstrapping
cluster.initial_master_nodes:
  - master-1
  - master-2
  - master-3
 
# Network settings
network.host: _site_
discovery.seed_hosts:
  - master-1:9300
  - master-2:9300
  - master-3:9300

2. Implement backup and recovery

Snapshots are critical for disaster recovery. Configure automated snapshots to a repository (S3, GCS, Azure Blob, or shared filesystem).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Register S3 repository
PUT /_snapshot/backup_repo
{
  "type": "s3",
  "settings": {
    "bucket": "elasticsearch-backups",
    "region": "us-east-1",
    "base_path": "production"
  }
}
 
// Create snapshot lifecycle policy
PUT /_slm/policy/daily-snapshots
{
  "schedule": "0 0 * * * ?",  // Daily at midnight UTC
  "name": "<daily-snap-{now/d}>",
  "repository": "backup_repo",
  "config": {
    "indices": ["*"],
    "include_global_state": true
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 5,
    "max_count": 50
  }
}

3. Have a rolling restart procedure

For maintenance (upgrades, configuration changes), restart nodes one at a time with proper allocation management:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# 1. Disable shard allocation (prevent unnecessary recovery)
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
  }
}'
 
# 2. Stop indexing if possible (flush translog)
curl -X POST "localhost:9200/_flush/synced"
 
# 3. Stop the node
sudo systemctl stop elasticsearch
 
# 4. Perform maintenance (upgrade, config change)
 
# 5. Start the node
sudo systemctl start elasticsearch
 
# 6. Wait for node to rejoin
curl "localhost:9200/_cat/nodes?v"
 
# 7. Re-enable allocation
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "cluster.routing.allocation.enable": null
  }
}'
 
# 8. Wait for green status
curl "localhost:9200/_cluster/health?wait_for_status=green"
 
# 9. Repeat for next node

Production Checklist

•Dedicated master nodes (3 or 5, never 2)
•Memory lock enabled and verified
•Swap disabled or set to 0
•Automatic snapshots configured and tested
•Monitoring in place with alerting
•Index templates for consistent mappings
•ILM policies for data lifecycle
•Security enabled (TLS, authentication)
•Rolling restart procedure documented
•Capacity with 30%+ headroom

Test Your Recovery Procedures

A backup you haven't tested is not a backup. Periodically restore to a test cluster and verify data integrity. Practice your disaster recovery runbook before you need it in a crisis.

Summary: Scaling Elasticsearch Successfully

Scaling Elasticsearch is an ongoing process of capacity planning, configuration tuning, monitoring, and operational excellence. Let's consolidate the key principles:

Key Takeaways

•Capacity plan before deployment — Quantify data volume, indexing rate, query rate, and retention. Calculate storage, memory, and compute requirements. Add headroom for growth.
•Match hardware to node roles — Masters need stability; data nodes need storage and memory; coordinators need CPU for aggregation. Don't over-generalize.
•JVM tuning is essential — 50% of RAM for heap, never exceed 31GB, lock memory, disable swap. These settings prevent most stability issues.
•Automate data lifecycle with ILM — Hot→warm→cold→delete based on age. Reduce costs while maintaining appropriate performance for each data temperature.
•Monitor the right metrics — Heap usage, GC time, cluster status, shard allocation, query latency, indexing throughput. Alert on anomalies before they cause outages.
•Know common failure patterns — Most problems trace to insufficient resources or excessive load. Start simple, then investigate complex causes.
•Follow operational best practices — Dedicated masters, automated snapshots, tested recovery procedures, documented runbooks.

Module Complete:

You now have a comprehensive understanding of Elasticsearch—from distributed architecture and sharding, through mapping and Query DSL, to production scaling. This knowledge enables you to design, implement, and operate Elasticsearch deployments that power search at scale.

Elasticsearch continues to evolve rapidly. Stay current with release notes, participate in the community, and continuously refine your deployments based on real-world performance data.

Module Complete: Elasticsearch

Congratulations! You've completed the Elasticsearch module, covering architecture, shards/replicas, mapping/analysis, Query DSL, and scaling. You're now equipped to design and operate production Elasticsearch clusters that deliver fast, relevant search at scale.

Scaling Elasticsearch: From Development to Production

The Scaling Challenge

This page provides the comprehensive guide to scaling Elasticsearch, consolidating lessons from organizations running some of the world's largest Elasticsearch deployments.

What You Will Learn

Capacity Planning: Right-Sizing Your Cluster

The capacity planning methodology:

1. Quantify the workload:

Data volume: How much data? Growth rate?
Indexing rate: Documents per second? Bulk operations?
Query rate: Searches per second? Latency requirements?
Retention: How long is data kept? What happens to old data?

2. Calculate storage requirements:

Raw data size × replication factor × overhead factor
Overhead includes Lucene indexes (inverted index, stored fields, doc values)
Typical overhead: 1.5x to 3x raw data, depending on mapping

3. Determine memory requirements:

JVM heap: 50% of physical memory, max 31GB
OS cache: Remaining memory for file system cache
Shard overhead: ~0.5-1MB per shard for data structures

4. Calculate throughput capacity:

Per-node indexing throughput depends on CPU, disk I/O
Per-node query throughput depends on memory, cache efficiency
Add nodes until aggregate capacity exceeds requirements

Capacity Planning Rules of Thumb
Resource	Guideline	Notes
Storage per node	2-10TB	SSD strongly recommended
JVM heap	50% of RAM, max 31GB	Above 31GB, pointer compression disabled
Shards per node	< 600, ideally < 200	More shards = more overhead
Shard size	10-50GB	Smaller = more parallelism, larger = less overhead
Data to heap ratio	< 100 shards per GB heap	For stable cluster state
CPU cores	8-32 per node	More for aggregation-heavy workloads
Network	10Gbps+	For fast recovery and rebalancing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Workload:
- 500GB/day ingestion
- 90 days retention = 45TB total data
- 1 replica = 90TB storage (2 copies)
- 50% overhead for indexes = 135TB storage needed
 
Storage sizing:
- If using 8TB disks at 75% utilization (6TB usable)
- Need: 135TB / 6TB = 23 data nodes minimum
 
Memory sizing:
- 23 nodes × 64GB RAM = 1.5TB cluster memory
- 23 nodes × 30GB heap = 690GB heap
- 23 nodes × 34GB OS cache = 782GB for file cache
 
Shard sizing:
- 45TB data / 30GB per shard = 1500 primary shards
- With replica = 3000 total shards
- 3000 shards / 23 nodes = 130 shards per node ✓
 
Add headroom:
- 30% for growth/spikes = 30 data nodes
- 3 dedicated master nodes = 33 total nodes

Always Load Test Before Production

Hardware Selection: Matching Resources to Roles

Different node roles have different hardware requirements. Optimizing hardware by role reduces cost while improving performance.

Master nodes:

Master nodes manage cluster state but don't store data. They need:

CPU: Moderate (4-8 cores) — Cluster state operations aren't CPU-intensive
Memory: Moderate (16-32GB) — For large cluster state in memory
Storage: Minimal (100GB SSD) — For node data and logs
Network: Low latency — Stable connections to all nodes

Dedicated masters are essential for production. Use 3 (or 5 for larger clusters) to maintain quorum during failures.

Data Node Requirements

•CPU: High (16-32 cores) for indexing and aggregations
•Memory: High (64-128GB) for JVM and file cache
•Storage: High capacity SSD (4-10TB), NVMe preferred
•Network: High bandwidth (10Gbps+) for recovery

Coordinating Node Requirements

•CPU: High for aggregation merging
•Memory: High for collecting results
•Storage: Minimal (no data stored)
•Network: High bandwidth to clients and data nodes

Storage considerations:

10-100x lower latency for random reads
Faster recovery (copying segment files)
Better performance under concurrent load

HDDs may be acceptable only for cold/frozen tiers where access is infrequent.

Cloud instance selection:

Cloud Instance Recommendations by Provider
Provider	Master Nodes	Hot Data Nodes	Warm/Cold Data Nodes
AWS	m6i.xlarge	i3.2xlarge, i3en.2xlarge	d3.2xlarge
GCP	e2-medium	n2-highmem-8 + local SSD	n2-highmem-8 + pd-balanced
Azure	Standard_D4s_v3	Standard_L8s_v2	Standard_E8s_v4

Don't Undersize Network

JVM and System Tuning: Extracting Maximum Performance

Elasticsearch runs on the JVM, and proper JVM configuration is critical for stable performance. Additionally, operating system settings affect I/O performance and process limits.

JVM Heap sizing:

The most important JVM setting is heap size. The golden rules:

Rule 1: 50% of physical memory for heap, 50% for OS cache Elasticsearch uses both JVM heap (for data structures) and OS file cache (for Lucene segments). Both need adequate memory.

Rule 3: Set Xms and Xmx to the same value This prevents heap resizing during operation, which can cause pauses.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Heap size - must be same for Xms and Xmx
-Xms30g
-Xmx30g
 
# G1GC is now the default (Elasticsearch 8+)
# No need to specify unless customizing
 
# Disable explicit GC calls (from application code)
-XX:+DisableExplicitGC
 
# Pre-touch memory pages at startup (faster startup)
-XX:+AlwaysPreTouch
 
# GC logging (useful for diagnosing issues)
-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m

Operating system settings:

Elasticsearch requires several OS settings adjusted from defaults:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# /etc/security/limits.conf
elasticsearch  -  nofile  65535    # Open file descriptors
elasticsearch  -  nproc   4096     # Process/thread limit
 
# /etc/sysctl.conf
vm.max_map_count = 262144          # Memory mapping limit (required!)
vm.swappiness = 1                  # Minimize swapping
 
# Disable swap entirely (or use mlockall)
sudo swapoff -a
 
# For Elasticsearch user, in /etc/security/limits.conf
elasticsearch  -  memlock  unlimited
 
# In elasticsearch.yml, enable memory locking
bootstrap.memory_lock: true

Why disable swap?

bootstrap.memory_lock: true tells Elasticsearch to lock its memory in RAM, preventing the OS from swapping it. This requires the memlock ulimit to be set to unlimited.

Memory Lock Failures Are Silent

Index Lifecycle Management: Automating Data Lifecycle

Index Lifecycle Management (ILM) automates moving data through phases based on age or size, reducing costs while maintaining appropriate performance.

ILM phases:

Hot — Active indexing and frequent queries. Use fast SSDs, high-replica count.

Warm — No new indexing, occasional queries. Reduce replicas, move to slower storage.

Cold — Rare queries, archival access. Freeze indexes, searchable snapshots.

Frozen — Very rare access. Use searchable snapshots (data in object storage).

Delete — Remove data after retention period.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
PUT /_ilm/policy/logs_policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "allocate": {
            "require": { "tier": "warm" },
            "number_of_replicas": 0
          },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": { "tier": "cold" }
          },
          "freeze": {},
          "set_priority": { "priority": 0 }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Key ILM actions:

rollover — Create a new index when conditions met (size, age, doc count). Essential for managing shard count over time.

shrink — Reduce primary shard count. Useful for warm phase where queries are less frequent.

forcemerge — Merge segments into fewer files. Optimizes storage and query performance for static indexes.

freeze — Mark index read-only and release memory. Frozen indexes have higher query latency but minimal resource usage.

allocate — Move index to different tier nodes using allocation filtering.

delete — Remove index entirely when retention expires.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Create index template that uses ILM policy
PUT /_index_template/logs_template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "logs_policy",
      "index.lifecycle.rollover_alias": "logs"
    },
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "message": { "type": "text" }
      }
    }
  }
}
 
// Bootstrap the alias
PUT /logs-000001
{
  "aliases": {
    "logs": {
      "is_write_index": true
    }
  }
}
 
// Now write to 'logs' alias - ILM handles rollover automatically

Use Data Streams for Simpler Time-Series

Performance Monitoring: Observing Cluster Health

Effective monitoring is essential for maintaining healthy Elasticsearch clusters. The right metrics reveal problems before they cause outages, while wrong metrics create noise without insight.

Essential metrics to monitor:

Cluster-Level Metrics

•Cluster status — Green is healthy; yellow means missing replicas; red means missing primaries. Alert on non-green.
•Node count — Detect node failures or discovery issues. Alert on unexpected decreases.
•Shard states — Unassigned, initializing, or relocating shards indicate problems or recovery.
•Pending tasks — Cluster state tasks waiting on master. High counts indicate master overload.

Node-Level Metrics

•JVM heap usage — Alert when heap consistently > 75%. Above 85% risks GC thrashing.
•GC time — Long GC pauses (> 1s) indicate memory pressure. Young GC should be < 100ms.
•CPU utilization — High sustained CPU on data nodes may indicate heavy queries or indexing.
•Disk usage — Alert before hitting watermarks. 80% warning, 90% critical.
•Disk I/O wait — High iowait indicates disk bottleneck. Consider faster storage.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// Cluster health
GET /_cluster/health
 
// Node statistics
GET /_nodes/stats/jvm,os,fs,indices
 
// Index statistics  
GET /_stats/indexing,search,merge
 
// Thread pool status (queued/rejected)
GET /_cat/thread_pool?v&h=node_name,name,active,queue,rejected
 
// Hot threads (what's consuming CPU)
GET /_nodes/hot_threads
 
// Shard allocation explanation
GET /_cluster/allocation/explain

Query performance metrics:

Search latency — P50, P90, P99 query latencies. Track trends over time.

Query rate — Queries per second. Correlate with latency changes.

Slow queries — Enable slow log to capture queries exceeding thresholds:

PUT /products/_settings
{
  "index.search.slowlog.threshold.query.warn": "2s",
  "index.search.slowlog.threshold.query.info": "1s"
}

Search thread pool queues — If search queue grows, queries are waiting for resources. May need more nodes or query optimization.

Indexing performance metrics:

Indexing rate — Documents per second. Track against expected throughput.

Indexing latency — Time to index documents. Increases may indicate resource contention.

Bulk rejections — If bulk queue overflows, requests are rejected. Need slower ingestion or more capacity.

Merge time — Background merges are I/O intensive. Excessive merging indicates too many segments.

Use Elastic Stack for Elasticsearch Monitoring

Common Performance Problems: Diagnosis and Solutions

Production Elasticsearch clusters encounter recurring performance patterns. Recognizing these patterns accelerates diagnosis and resolution.

Problem: Slow searches / high latency

Symptoms: Query latencies increasing, users complaining about slow results.

Common causes:

Insufficient heap (GC overhead)
Too many shards (coordination overhead)
Expensive queries (deep aggregations, wildcards)
Cold file cache (recently restarted, not warmed)
Shard imbalance (hot spots)

Diagnosis:

Check heap usage and GC times
Enable slow log to identify expensive queries
Profile slow queries for bottlenecks
Check shard distribution across nodes
Monitor cache hit rates

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Check if heap is the problem
GET /_nodes/stats/jvm?filter_path=**.heap_used_percent
 
// Look for GC pressure
GET /_nodes/stats/jvm?filter_path=**.gc.collectors
 
// Enable slow query log
PUT /products/_settings
{
  "index.search.slowlog.threshold.query.warn": "2s",
  "index.search.slowlog.level": "info"
}
 
// Profile a specific query
POST /products/_search
{
  "profile": true,
  "query": { "match": { "description": "slow query" } }
}
 
// Check cache statistics
GET /_nodes/stats/indices/query_cache,request_cache

Problem: Indexing rejections / bulk failures

Symptoms: Bulk API returns 429 errors, documents not appearing.

Common causes:

Write thread pool saturated
Disk I/O bottleneck
Too many shards receiving writes
Slow merges blocking refreshes

Solutions:

Reduce bulk size (fewer docs per request)
Increase thread pool queue size (temporary)
Add data nodes for write capacity
Use index rollover to spread writes

Problem: Cluster state too large / master instability

Symptoms: Master nodes struggling, cluster state updates slow, frequent master elections.

Common causes:

Too many indexes (thousands)
Mappings with too many fields
Frequent index creation/deletion
Underprovisioned master nodes

Solutions:

Consolidate small indexes (time-based → less granular)
Use index templates with field limits
Increase master node resources
Use dynamic mapping restrictions

Common Problems Quick Reference
Symptom	Likely Cause	First Action
Yellow cluster	Missing replicas	Check node count, disk space
Red cluster	Missing primaries	Check node health, shard allocation
High GC time	Heap pressure	Reduce shards, add nodes
Slow queries	Large result sets	Profile queries, add filters
Bulk rejections	Thread pool saturation	Reduce bulk size, add nodes
Frequent master elections	Cluster state issues	Add dedicated masters, reduce indexes
High disk I/O	Merge storms	Reduce indexing rate, force merge old indexes

Start with the Simplest Explanation

Operational Best Practices: Production Readiness

Beyond configuration nuances, operational practices determine long-term cluster reliability. These patterns emerge from running Elasticsearch at scale.

1. Always use dedicated master nodes

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Master-only node configuration
node.roles: [ master ]
node.name: master-${HOSTNAME}
 
# Initial master nodes for bootstrapping
cluster.initial_master_nodes:
  - master-1
  - master-2
  - master-3
 
# Network settings
network.host: _site_
discovery.seed_hosts:
  - master-1:9300
  - master-2:9300
  - master-3:9300

2. Implement backup and recovery

Snapshots are critical for disaster recovery. Configure automated snapshots to a repository (S3, GCS, Azure Blob, or shared filesystem).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Register S3 repository
PUT /_snapshot/backup_repo
{
  "type": "s3",
  "settings": {
    "bucket": "elasticsearch-backups",
    "region": "us-east-1",
    "base_path": "production"
  }
}
 
// Create snapshot lifecycle policy
PUT /_slm/policy/daily-snapshots
{
  "schedule": "0 0 * * * ?",  // Daily at midnight UTC
  "name": "<daily-snap-{now/d}>",
  "repository": "backup_repo",
  "config": {
    "indices": ["*"],
    "include_global_state": true
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 5,
    "max_count": 50
  }
}

3. Have a rolling restart procedure

For maintenance (upgrades, configuration changes), restart nodes one at a time with proper allocation management:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# 1. Disable shard allocation (prevent unnecessary recovery)
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "cluster.routing.allocation.enable": "primaries"
  }
}'
 
# 2. Stop indexing if possible (flush translog)
curl -X POST "localhost:9200/_flush/synced"
 
# 3. Stop the node
sudo systemctl stop elasticsearch
 
# 4. Perform maintenance (upgrade, config change)
 
# 5. Start the node
sudo systemctl start elasticsearch
 
# 6. Wait for node to rejoin
curl "localhost:9200/_cat/nodes?v"
 
# 7. Re-enable allocation
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "persistent": {
    "cluster.routing.allocation.enable": null
  }
}'
 
# 8. Wait for green status
curl "localhost:9200/_cluster/health?wait_for_status=green"
 
# 9. Repeat for next node

Production Checklist

•Dedicated master nodes (3 or 5, never 2)
•Memory lock enabled and verified
•Swap disabled or set to 0
•Automatic snapshots configured and tested
•Monitoring in place with alerting
•Index templates for consistent mappings
•ILM policies for data lifecycle
•Security enabled (TLS, authentication)
•Rolling restart procedure documented
•Capacity with 30%+ headroom

Test Your Recovery Procedures

A backup you haven't tested is not a backup. Periodically restore to a test cluster and verify data integrity. Practice your disaster recovery runbook before you need it in a crisis.

Summary: Scaling Elasticsearch Successfully

Scaling Elasticsearch is an ongoing process of capacity planning, configuration tuning, monitoring, and operational excellence. Let's consolidate the key principles:

Key Takeaways

•Capacity plan before deployment — Quantify data volume, indexing rate, query rate, and retention. Calculate storage, memory, and compute requirements. Add headroom for growth.
•Match hardware to node roles — Masters need stability; data nodes need storage and memory; coordinators need CPU for aggregation. Don't over-generalize.
•JVM tuning is essential — 50% of RAM for heap, never exceed 31GB, lock memory, disable swap. These settings prevent most stability issues.
•Automate data lifecycle with ILM — Hot→warm→cold→delete based on age. Reduce costs while maintaining appropriate performance for each data temperature.
•Monitor the right metrics — Heap usage, GC time, cluster status, shard allocation, query latency, indexing throughput. Alert on anomalies before they cause outages.
•Know common failure patterns — Most problems trace to insufficient resources or excessive load. Start simple, then investigate complex causes.
•Follow operational best practices — Dedicated masters, automated snapshots, tested recovery procedures, documented runbooks.

Module Complete:

Elasticsearch continues to evolve rapidly. Stay current with release notes, participate in the community, and continuously refine your deployments based on real-world performance data.

Module Complete: Elasticsearch