System Design (HLD)Logging at Scale

Logging at Scale: Production-Grade Observability

LevelIntermediate

Duration60 mins

TopicLogging at Scale

3 / 5

Log Aggregation (ELK, OpenSearch)

From Scattered Files to Centralized Intelligence

In monolithic applications, logs lived in /var/log/application.log. When something broke, you SSH'd into the server and ran grep. This doesn't scale.

Modern distributed systems span hundreds of microservices across thousands of containers that may live for minutes. Logs are ephemeral—when a container dies, its logs vanish. Debugging requires correlating events across dozens of services simultaneously. Manual log inspection becomes impossible.

Log aggregation solves this by collecting logs from all sources into a centralized, searchable repository. Query logs from 500 services with a single search. Reconstruct request flows across microservices. Alert on patterns across your entire fleet. This is the foundation of operational visibility at scale.

What You Will Learn

By the end of this page, you'll understand log aggregation architecture patterns, master the ELK Stack (Elasticsearch, Logstash, Kibana), know when to choose OpenSearch over Elasticsearch, understand Grafana Loki's different approach, and be able to design production-ready log aggregation systems.

Log Aggregation Architecture

All log aggregation systems share a common architectural pattern, regardless of which specific technologies you choose:

The Universal Pipeline:

Log Generation: Applications emit structured logs (stdout, files, syslog)
Collection: Agents on each host/container collect and forward logs
Transport: Message queue or streaming platform buffers logs
Processing: Parsing, enrichment, transformation before storage
Storage: Write to indexed search engine or object store
Query/Visualization: Search, analyze, and visualize log data

Log Aggregation Pipeline Components
Stage	Purpose	Common Technologies
Generation	Create log entries	Application code, frameworks, system services
Collection	Gather and forward logs from sources	Filebeat, Fluent Bit, Fluentd, Vector, Promtail
Transport	Buffer and reliable delivery	Kafka, Redis, RabbitMQ, direct HTTP/gRPC
Processing	Parse, enrich, filter, transform	Logstash, Fluentd, Vector, Kafka Streams
Storage	Index for fast querying	Elasticsearch, OpenSearch, Loki, ClickHouse
Visualization	Query UI and dashboards	Kibana, Grafana, OpenSearch Dashboards

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
┌─────────────────────────────────────────────────────────────────────────┐
│                        APPLICATION LAYER                                  │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │ Service A│  │ Service B│  │ Service C│  │ Service D│  │ Service E│  │
│  │ (stdout) │  │ (stdout) │  │ (stdout) │  │ (stdout) │  │ (stdout) │  │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘  │
└───────┼─────────────┼─────────────┼─────────────┼─────────────┼────────┘
        │             │             │             │             │
        ▼             ▼             ▼             ▼             ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        COLLECTION LAYER                                   │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │  Log Collectors (Filebeat/Fluent Bit per node)                    │  │
│  │  - Tail container stdout/stderr                                   │  │
│  │  - Add metadata (pod, namespace, container ID)                    │  │
│  │  - Buffer locally, forward with backpressure handling             │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        BUFFERING LAYER (Optional)                        │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │  Kafka / Redis / Message Queue                                    │  │
│  │  - Decouples collection from processing                           │  │
│  │  - Handles traffic spikes                                         │  │
│  │  - Enables replay for reprocessing                                │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        PROCESSING LAYER                                   │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │  Logstash / Vector / Stream Processor                             │  │
│  │  - Parse and validate JSON                                        │  │
│  │  - Enrich with additional context                                 │  │
│  │  - Route to appropriate indexes                                   │  │
│  │  - Filter sensitive data                                          │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        STORAGE LAYER                                      │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │  Elasticsearch / OpenSearch Cluster                               │  │
│  │  - Sharded for horizontal scale                                   │  │
│  │  - Replicated for durability                                      │  │
│  │  - Time-based indexes (logs-2024.01.15)                          │  │
│  │  - Hot/warm/cold tiering for cost optimization                    │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        VISUALIZATION LAYER                                │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │  Kibana / Grafana / OpenSearch Dashboards                         │  │
│  │  - Full-text search                                               │  │
│  │  - Saved queries and dashboards                                   │  │
│  │  - Alert definition                                               │  │
│  │  - Access control                                                 │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘

The Buffering Decision

Many architectures skip the buffering layer for simplicity, sending directly from collectors to Elasticsearch. This works until a traffic spike or Elasticsearch maintenance window causes log loss. Kafka as a buffer enables: replay if processing fails, multiple consumers for different purposes, and graceful handling of downstream outages.

The ELK Stack Deep Dive

The ELK Stack (Elasticsearch, Logstash, Kibana) is the most widely deployed log aggregation solution. Originally open source, Elastic changed licensing in 2021, spawning OpenSearch (covered later). Understanding ELK is foundational regardless of which variant you use.

Three Pillars of ELK:

Elasticsearch — Distributed search and analytics engine built on Apache Lucene. Stores and indexes logs for fast full-text search. Scales horizontally through sharding.

Logstash — Data processing pipeline. Collects, transforms, and routes data from various sources to destinations. Plugin-based architecture supports hundreds of inputs/outputs.

Kibana — Visualization layer. Web UI for searching, visualizing, and analyzing Elasticsearch data. Dashboards, saved searches, and alerting.

ELK Stack Component Details
Component	Role	Key Capabilities	Resource Profile
Elasticsearch	Storage + Search	Full-text search, aggregations, distributed clustering	Memory-intensive (heap), SSD-optimized
Logstash	Processing	Input plugins, filters, output plugins, codecs	CPU-intensive, memory for buffering
Kibana	Visualization	Dashboards, Discover, alerts, RBAC	Low resource, stateless
Beats	Collection	Lightweight shippers (Filebeat, Metricbeat)	Minimal footprint per node

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# elasticsearch.yml - Data node configuration
cluster.name: production-logs
node.name: es-data-01
 
# Node roles (Elasticsearch 7.9+)
node.roles: [ data, data_content, data_hot ]
 
# Network
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300
 
# Discovery
discovery.seed_hosts:
  - es-master-01.internal:9300
  - es-master-02.internal:9300
  - es-master-03.internal:9300
 
# Memory: Heap should be ~50% of RAM, max 32GB
# Set via ES_JAVA_OPTS: -Xms16g -Xmx16g
 
# Disk space thresholds
cluster.routing.allocation.disk.watermark.low: 85%
cluster.routing.allocation.disk.watermark.high: 90%
cluster.routing.allocation.disk.watermark.flood_stage: 95%
 
# Index lifecycle management
action.destructive_requires_name: true
 
# Security (X-Pack)
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# /etc/logstash/conf.d/logs-pipeline.conf
 
input {
  beats {
    port => 5044
    ssl => true
    ssl_certificate => "/etc/logstash/ssl/logstash.crt"
    ssl_key => "/etc/logstash/ssl/logstash.key"
  }
  
  # Alternative: Direct Kafka consumption
  kafka {
    bootstrap_servers => "kafka-1:9092,kafka-2:9092,kafka-3:9092"
    topics => ["application-logs"]
    group_id => "logstash-consumers"
    codec => json
  }
}
 
filter {
  # Parse JSON body if not already parsed
  if [message] =~ /^\{/ {
    json {
      source => "message"
      target => "parsed"
    }
    
    # Move parsed fields to top level if successful
    if "_jsonparsefailure" not in [tags] {
      mutate {
        rename => { "[parsed][timestamp]" => "timestamp" }
        rename => { "[parsed][level]" => "level" }
        rename => { "[parsed][service]" => "service" }
        remove_field => [ "message", "parsed" ]
      }
    }
  }
  
  # Parse timestamp
  date {
    match => [ "timestamp", "ISO8601" ]
    target => "@timestamp"
    remove_field => [ "timestamp" ]
  }
  
  # Add processing metadata
  mutate {
    add_field => {
      "processed_at" => "%{+ISO8601}"
      "pipeline_version" => "2.1.0"
    }
  }
  
  # GeoIP enrichment for client IPs
  if [client_ip] {
    geoip {
      source => "client_ip"
      target => "geoip"
    }
  }
  
  # Drop DEBUG logs in production (optional)
  if [level] == "DEBUG" {
    drop { }
  }
  
  # Redact sensitive fields
  mutate {
    gsub => [
      "password", ".+", "[REDACTED]",
      "credit_card", "\d{12}(\d{4})", "************\1"
    ]
  }
}
 
output {
  elasticsearch {
    hosts => ["https://es-data-01:9200", "https://es-data-02:9200"]
    index => "logs-%{[service]}-%{+YYYY.MM.dd}"
    user => "logstash_writer"
    password => "${LOGSTASH_ES_PASSWORD}"
    ssl => true
    cacert => "/etc/logstash/ssl/ca.crt"
  }
  
  # Dead letter queue for failed documents
  if "_jsonparsefailure" in [tags] {
    file {
      path => "/var/log/logstash/dlq/%{+YYYY-MM-dd}.json"
    }
  }
}

Logstash vs Ingest Pipelines

For simple transformations, Elasticsearch's built-in Ingest Pipelines can replace Logstash, reducing operational complexity. Use Logstash when you need complex routing, external enrichment lookups, or multi-destination output. For straightforward JSON log parsing, Ingest Pipelines are often sufficient.

Elasticsearch Indexing Strategies

Elasticsearch indexing configuration dramatically impacts query performance, storage costs, and operational complexity. Production log systems require careful index design.

Time-Based Indices

The dominant pattern for logs is time-based indexing: logs-2024.01.15. Each day (or hour, for high volume) gets a new index. This enables:

Efficient deletion: Drop old indices entirely instead of deleting documents
Tiered storage: Move older indexes to cheaper storage
Query optimization: Searches typically filter by time range
Independent scaling: Recent data gets more resources

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
{
  "index_patterns": ["logs-*"],
  "priority": 100,
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "codec": "best_compression",
      "refresh_interval": "30s",
      
      "index.lifecycle.name": "logs-lifecycle-policy",
      "index.lifecycle.rollover_alias": "logs-write",
      
      "index.translog.durability": "async",
      "index.translog.sync_interval": "5s"
    },
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "level": { "type": "keyword" },
        "logger": { "type": "keyword" },
        "service": { "type": "keyword" },
        "version": { "type": "keyword" },
        "environment": { "type": "keyword" },
        "trace_id": { "type": "keyword" },
        "span_id": { "type": "keyword" },
        "message": { 
          "type": "text",
          "fields": {
            "keyword": { "type": "keyword", "ignore_above": 256 }
          }
        },
        "error": {
          "properties": {
            "type": { "type": "keyword" },
            "message": { "type": "text" },
            "stack_trace": { "type": "text", "index": false }
          }
        },
        "http": {
          "properties": {
            "method": { "type": "keyword" },
            "status_code": { "type": "short" },
            "path": { "type": "keyword" },
            "latency_ms": { "type": "integer" }
          }
        },
        "user_id": { "type": "keyword" },
        "request_id": { "type": "keyword" }
      },
      "dynamic_templates": [
        {
          "strings_as_keywords": {
            "match_mapping_type": "string",
            "mapping": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      ]
    }
  }
}

Index Template Best Practices

•Use keyword type for filterable strings — level, service, user_id should be keyword type for exact matching and aggregations. Reserve text for fields needing full-text search.
•Don't index stack traces — Set index: false for large, rarely-searched fields like stack traces. Reduces index size significantly.
•Set explicit mappings — Disable dynamic mapping or carefully configure dynamic templates. Uncontrolled dynamic mapping causes mapping explosions.
•Configure refresh interval — Default 1-second refresh is expensive. 30 seconds is usually fine for logs; near-real-time search is rarely needed.
•Consider doc values — For high-cardinality aggregations, ensure doc values are enabled (default for keyword).
•Use ILM (Index Lifecycle Management) — Automate rollover, shrink, and deletion based on age or size.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "allocate": {
            "require": {
              "data": "warm"
            }
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": {
              "data": "cold"
            }
          },
          "freeze": {},
          "set_priority": {
            "priority": 0
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Shard Sizing Critical

Elasticsearch performance degrades with too many small shards or too few large shards. Target 20-50GB per shard. At 1TB/day ingest, you need ~25 primary shards across daily indices. Calculate: (daily_volume ÷ target_shard_size) = shards_per_day.

OpenSearch: The Fork That Changed Everything

In January 2021, Elastic changed Elasticsearch's license from Apache 2.0 to Server Side Public License (SSPL). AWS and the open-source community responded by forking Elasticsearch 7.10.2 into OpenSearch, maintained by AWS under Apache 2.0.

Why This Matters:

The license change means cloud providers cannot offer Elasticsearch as a managed service without restrictions. OpenSearch provides a truly open-source alternative with continued development and AWS-backed support.

OpenSearch vs Elasticsearch:

OpenSearch vs Elasticsearch Comparison
Aspect	Elasticsearch	OpenSearch
License	SSPL (Elastic License 2.0)	Apache 2.0 (truly open source)
Cloud Offerings	Elastic Cloud (vendor-managed)	AWS OpenSearch, self-managed
API Compatibility	Original API, diverging	Maintains ES 7.x compatibility, diverging
Security Features	X-Pack (some features require licensing)	Security built-in and free
Alerting	Elastic Alerting (requires license)	Alerting plugin included free
Community	Elastic-controlled	Community-governed, AWS-supported
Feature Development	Proprietary roadmap	Open roadmap, community input

When to Choose OpenSearch

•Cloud-native deployment — AWS OpenSearch Service is fully managed. If you're on AWS, it's the path of least resistance.
•License compliance concerns — SSPL is controversial for organizations with redistribution requirements. Apache 2.0 has no such concerns.
•Cost sensitivity — OpenSearch includes security, alerting, and anomaly detection free. Elasticsearch charges for equivalent features.
•Portability requirements — Apache 2.0 allows moving between cloud providers without license concerns.
•Existing ES 7.x workloads — OpenSearch is API-compatible; migration is often straightforward.

When to Choose Elasticsearch

•Elastic-specific features — Machine learning, APM, SIEM capabilities are more mature in Elastic's offerings.
•Vendor support preference — Elastic provides commercial support; some organizations prefer this.
•ES 8.x features — OpenSearch is diverging from Elasticsearch; ES 8.x features may not appear in OpenSearch.
•Existing Elastic ecosystem — If deeply integrated with Elastic Beats, Agent, APM, migration is complex.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# AWS CDK TypeScript - OpenSearch Domain
const domain = new opensearch.Domain(this, 'LogsDomain', {
  version: opensearch.EngineVersion.OPENSEARCH_2_11,
  
  capacity: {
    dataNodes: 3,
    dataNodeInstanceType: 'r6g.large.search',
    masterNodes: 3,
    masterNodeInstanceType: 'm6g.large.search',
  },
  
  ebs: {
    volumeSize: 500,
    volumeType: ec2.EbsDeviceVolumeType.GP3,
  },
  
  nodeToNodeEncryption: true,
  encryptionAtRest: {
    enabled: true,
  },
  
  vpc: vpc,
  vpcSubnets: [{ subnetType: ec2.SubnetType.PRIVATE_ISOLATED }],
  
  zoneAwareness: {
    enabled: true,
    availabilityZoneCount: 3,
  },
  
  logging: {
    slowSearchLogEnabled: true,
    appLogEnabled: true,
    slowIndexLogEnabled: true,
  },
  
  fineGrainedAccessControl: {
    masterUserName: 'admin',
    masterUserPassword: secretsManager.Secret.fromSecretNameV2(...),
  },
});

Migration Path

Migrating from Elasticsearch 7.x to OpenSearch is generally straightforward—snapshot and restore works across the fork. ES 8.x to OpenSearch requires more effort due to diverging APIs. Plan migration during the 7.x window if possible.

Grafana Loki: A Different Approach

Grafana Loki takes a fundamentally different approach to log aggregation. While Elasticsearch indexes the full content of every log line (enabling arbitrary searches), Loki indexes only labels (metadata) and stores log content as compressed chunks. This design trades some flexibility for dramatic cost reduction.

The Loki Philosophy:

"Logs are like metrics, but have rich content."

Loki is designed for Kubernetes-native environments where structured labels (namespace, pod, container) provide sufficient filtering for most queries. You query by label selectors, then filter log content within matching streams.

Loki Advantages

•10-50x lower storage costs than Elasticsearch for equivalent data
•Simpler operations — fewer components, smaller clusters
•Native Grafana integration — same UI for metrics and logs
•Prometheus-like labels — familiar for teams using Prometheus
•Efficient for high-cardinality labels (within limits)

Loki Limitations

•No full-text indexing — can't search for arbitrary strings not in labels
•Label cardinality limits — millions of unique label values cause issues
•Regex on large streams is slow — must narrow with labels first
•Less mature than ELK ecosystem
•LogQL learning curve vs familiar Elasticsearch queries

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Basic label selector (like Prometheus)
{namespace="production", app="payment-service"}
 
# Filter by log level using label
{namespace="production", level="error"}
 
# Regex filter on log content (after label selection)
{namespace="production", app="payment-service"} |= "payment_failed"
 
# Negative filter (exclude lines)
{namespace="production"} != "healthcheck"
 
# JSON parsing for structured logs
{namespace="production", app="order-service"} 
  | json 
  | user_id="usr_a1b2c3"
 
# Aggregation (like Prometheus)
sum(rate({namespace="production", level="error"}[5m])) by (app)
 
# Pattern extraction for metrics
{namespace="production"} 
  | pattern "<timestamp> <level> <_> latency=<latency>ms"
  | latency > 500

Loki vs Elasticsearch: When to Use Each
Scenario	Loki	Elasticsearch
Kubernetes-native environment	✅ Excellent fit	✅ Works but heavier
Need arbitrary full-text search	❌ Not indexed	✅ Built for this
Cost is primary concern	✅ 10-50x cheaper	❌ Expensive at scale
Already have Grafana+Prometheus	✅ Perfect integration	⚠️ Requires Kibana or additional config
Need complex aggregations	⚠️ Limited to label aggregations	✅ Powerful aggregation DSL
Log analytics/ML workloads	❌ Not designed for this	✅ Strong analytics capabilities
Compliance requiring full indexing	❌ Doesn't index content	✅ Full indexing

Hybrid Approach

Many organizations use both: Loki for high-volume operational logs (DEBUG, INFO) where label-based querying suffices, and Elasticsearch for audit logs and low-volume ERROR logs requiring full-text search. Route different log types to different backends based on requirements.

Operational Considerations

Log aggregation systems are critical infrastructure—when they fail, debugging becomes blind. Operating these systems at scale requires attention to specific challenges:

Capacity Planning:

Log volume grows with application traffic, new services, and debugging efforts. Plan for 20-50% annual growth plus spikes during incidents (when everyone adds logging).

Capacity Planning Reference
Scale	Daily Ingest	Elasticsearch Cluster	Monthly Cost (Rough)
Small	50GB/day	3 data nodes, 3 masters	$1,000-2,000
Medium	500GB/day	6-9 data nodes, 3 masters	$5,000-10,000
Large	5TB/day	20-30 data nodes, 3 dedicated masters	$25,000-50,000
Enterprise	50TB/day	100+ data nodes, multi-cluster	$200,000+

Critical Operational Practices

•Monitor the log system with a separate system — If your logging infrastructure uses itself for monitoring, you can't detect its own failures. Use a separate metrics system (Prometheus) to monitor Elasticsearch health.
•Set up cardinality alerts — New high-cardinality fields (like UUIDs used as field values) cause mapping explosions. Alert when field count grows unexpectedly.
•Test retention policy regularly — Verify that ILM policies are executing and deleting old data. Storage exhaustion is a common failure mode.
•Maintain cold storage backups — For compliance requirements, snapshot to S3 before deletion. Cheaper than keeping hot indices.
•Plan for cluster upgrades — Rolling upgrades require careful coordination. Document and practice the procedure.
•Rate limit at ingestion — Runaway logging (debug accidentally enabled everywhere) can overwhelm the cluster. Set ingestion rate limits.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
groups:
- name: elasticsearch
  rules:
  # Cluster health not green
  - alert: ElasticsearchClusterHealthYellow
    expr: elasticsearch_cluster_health_status{color="yellow"} == 1
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Elasticsearch cluster health is yellow"
      description: "Some replicas are not allocated. Check node health."
 
  - alert: ElasticsearchClusterHealthRed
    expr: elasticsearch_cluster_health_status{color="red"} == 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Elasticsearch cluster health is RED"
      description: "Primary shards are unassigned. Data may be unavailable."
 
  # Disk space
  - alert: ElasticsearchDiskSpaceLow
    expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes < 0.15
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Elasticsearch disk space low on {{ $labels.node }}"
 
  # Ingest rate drop (may indicate collector failure)
  - alert: ElasticsearchIngestRateDrop
    expr: rate(elasticsearch_indices_indexing_index_total[5m]) < 0.5 * avg_over_time(rate(elasticsearch_indices_indexing_index_total[5m])[1h:5m])
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Elasticsearch ingest rate dropped significantly"
      description: "May indicate log collector failures or network issues."
 
  # JVM heap pressure
  - alert: ElasticsearchJVMHeapHigh
    expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.85
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Elasticsearch JVM heap usage high on {{ $labels.node }}"

The Mapping Explosion Disaster

Elasticsearch limits fields per index (default: 1000). Dynamic mapping + untrusted input = disaster. If an application logs user-provided data as field names (JSON body with arbitrary keys), field count explodes. The cluster rejects documents, logs are lost. Prevent this with strict schemas and input validation.

Query Patterns and Performance

Efficient querying is essential for production incident response. Slow queries during outages compound the problem. Understanding query performance helps you design for fast debugging.

Query Performance Principles:

Query Optimization Strategies

•Always filter by time range first — Time-based indices mean specifying time range excludes most indices from the search entirely.
•Use keyword fields for filters — Filtering on keyword fields is O(1) through term queries. text fields require analysis and are slower.
•Avoid leading wildcards — *error* scans all terms. error* uses the term index efficiently.
•Limit result size — Retrieve only what you need. size: 10 is vastly faster than size: 10000.
•Use aggregations instead of fetching all — To count error types, use terms aggregation rather than fetching all documents.
•Leverage doc_values for sorting — Sorting on non-doc-values fields is expensive. Ensure sort fields have doc_values enabled.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// ❌ SLOW: Full text search without filters
{
  "query": {
    "match": { "message": "payment failed" }
  }
}
 
// ✅ FAST: Time + keyword filters narrow scope first
{
  "query": {
    "bool": {
      "must": [
        { "match": { "message": "payment failed" } }
      ],
      "filter": [
        { "range": { "@timestamp": { "gte": "now-1h" } } },
        { "term": { "level": "error" } },
        { "term": { "service": "payment-service" } }
      ]
    }
  },
  "size": 100,
  "sort": [{ "@timestamp": "desc" }]
}
 
// Aggregation for error breakdown (no document fetch)
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        { "range": { "@timestamp": { "gte": "now-24h" } } },
        { "term": { "level": "error" } }
      ]
    }
  },
  "aggs": {
    "errors_by_service": {
      "terms": { "field": "service", "size": 20 },
      "aggs": {
        "error_types": {
          "terms": { "field": "error.type", "size": 10 }
        }
      }
    }
  }
}
 
// Get specific trace (fast: keyword exact match)
{
  "query": {
    "term": { "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736" }
  },
  "sort": [{ "@timestamp": "asc" }]
}

Query Type Performance Comparison
Query Type	Relative Speed	Use Case
term (keyword)	Fastest	Exact ID lookup, service/level filtering
range (timestamp)	Very Fast	Time-based filtering (always use)
bool with filters	Fast	Combining multiple criteria
match (analyzed text)	Moderate	Full-text search within narrowed scope
wildcard (prefix)	Slow	Pattern matching on log content
wildcard (leading)	Very Slow	Avoid: scans all terms
regex	Slowest	Use only as last resort on small result sets

Prepare Debug Queries in Advance

During calm periods, create and save queries you'll need during incidents: recent errors by service, trace lookup by ID, error rate over time. When the incident hits, you'll have fast, tested queries ready instead of constructing them under pressure.

Summary: Log Aggregation Mastery

Log aggregation transforms scattered, ephemeral logs into a centralized, searchable source of system truth. The choice of technology impacts cost, capability, and operational burden.

Key Takeaways:

Key Takeaways

•All aggregation follows the same architecture — Collection → Transport → Processing → Storage → Visualization. Technology choices fill each slot.
•ELK Stack is the industry standard — Elasticsearch provides powerful search; Logstash transforms data; Kibana visualizes. Mature ecosystem with extensive documentation.
•OpenSearch is the open-source fork — API-compatible with ES 7.x, Apache 2.0 licensed, security included free. Choose for cloud-native deployments and cost sensitivity.
•Loki trades flexibility for cost — 10-50x cheaper by indexing only labels. Perfect for Kubernetes environments with Grafana already deployed.
•Index strategy is critical — Time-based indexes, ILM policies, appropriate shard sizing directly impact performance and cost.
•Monitor the monitor — Use separate metrics to watch your logging infrastructure. It can't report its own failures.
•Query design matters during incidents — Prepare efficient queries in advance. Time filters + keyword filters + limited results = fast debugging.

What's next:

Log aggregation stores your logs, but without thoughtful retention policies, storage costs explode. The next page covers log retention and cost management—balancing debugging capability, compliance requirements, and budget constraints.

Page Complete

You now understand log aggregation architecture and can evaluate ELK, OpenSearch, and Loki for your needs. You can design Elasticsearch indexing strategies and query patterns for efficient debugging. Next, we'll tackle the business side: retention policies and cost optimization.

3 / 5

Loading learning content...

System Design (HLD)Logging at Scale

Logging at Scale: Production-Grade Observability

LevelIntermediate

Duration60 mins

TopicLogging at Scale

3 / 5

Log Aggregation (ELK, OpenSearch)

From Scattered Files to Centralized Intelligence

In monolithic applications, logs lived in /var/log/application.log. When something broke, you SSH'd into the server and ran grep. This doesn't scale.

What You Will Learn

Log Aggregation Architecture

All log aggregation systems share a common architectural pattern, regardless of which specific technologies you choose:

The Universal Pipeline:

Log Generation: Applications emit structured logs (stdout, files, syslog)
Collection: Agents on each host/container collect and forward logs
Transport: Message queue or streaming platform buffers logs
Processing: Parsing, enrichment, transformation before storage
Storage: Write to indexed search engine or object store
Query/Visualization: Search, analyze, and visualize log data

Log Aggregation Pipeline Components
Stage	Purpose	Common Technologies
Generation	Create log entries	Application code, frameworks, system services
Collection	Gather and forward logs from sources	Filebeat, Fluent Bit, Fluentd, Vector, Promtail
Transport	Buffer and reliable delivery	Kafka, Redis, RabbitMQ, direct HTTP/gRPC
Processing	Parse, enrich, filter, transform	Logstash, Fluentd, Vector, Kafka Streams
Storage	Index for fast querying	Elasticsearch, OpenSearch, Loki, ClickHouse
Visualization	Query UI and dashboards	Kibana, Grafana, OpenSearch Dashboards

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
┌─────────────────────────────────────────────────────────────────────────┐
│                        APPLICATION LAYER                                  │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │ Service A│  │ Service B│  │ Service C│  │ Service D│  │ Service E│  │
│  │ (stdout) │  │ (stdout) │  │ (stdout) │  │ (stdout) │  │ (stdout) │  │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬─────┘  │
└───────┼─────────────┼─────────────┼─────────────┼─────────────┼────────┘
        │             │             │             │             │
        ▼             ▼             ▼             ▼             ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        COLLECTION LAYER                                   │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │  Log Collectors (Filebeat/Fluent Bit per node)                    │  │
│  │  - Tail container stdout/stderr                                   │  │
│  │  - Add metadata (pod, namespace, container ID)                    │  │
│  │  - Buffer locally, forward with backpressure handling             │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        BUFFERING LAYER (Optional)                        │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │  Kafka / Redis / Message Queue                                    │  │
│  │  - Decouples collection from processing                           │  │
│  │  - Handles traffic spikes                                         │  │
│  │  - Enables replay for reprocessing                                │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        PROCESSING LAYER                                   │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │  Logstash / Vector / Stream Processor                             │  │
│  │  - Parse and validate JSON                                        │  │
│  │  - Enrich with additional context                                 │  │
│  │  - Route to appropriate indexes                                   │  │
│  │  - Filter sensitive data                                          │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        STORAGE LAYER                                      │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │  Elasticsearch / OpenSearch Cluster                               │  │
│  │  - Sharded for horizontal scale                                   │  │
│  │  - Replicated for durability                                      │  │
│  │  - Time-based indexes (logs-2024.01.15)                          │  │
│  │  - Hot/warm/cold tiering for cost optimization                    │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                        VISUALIZATION LAYER                                │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │  Kibana / Grafana / OpenSearch Dashboards                         │  │
│  │  - Full-text search                                               │  │
│  │  - Saved queries and dashboards                                   │  │
│  │  - Alert definition                                               │  │
│  │  - Access control                                                 │  │
│  └──────────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────────┘

The Buffering Decision

The ELK Stack Deep Dive

Three Pillars of ELK:

Elasticsearch — Distributed search and analytics engine built on Apache Lucene. Stores and indexes logs for fast full-text search. Scales horizontally through sharding.

Logstash — Data processing pipeline. Collects, transforms, and routes data from various sources to destinations. Plugin-based architecture supports hundreds of inputs/outputs.

Kibana — Visualization layer. Web UI for searching, visualizing, and analyzing Elasticsearch data. Dashboards, saved searches, and alerting.

ELK Stack Component Details
Component	Role	Key Capabilities	Resource Profile
Elasticsearch	Storage + Search	Full-text search, aggregations, distributed clustering	Memory-intensive (heap), SSD-optimized
Logstash	Processing	Input plugins, filters, output plugins, codecs	CPU-intensive, memory for buffering
Kibana	Visualization	Dashboards, Discover, alerts, RBAC	Low resource, stateless
Beats	Collection	Lightweight shippers (Filebeat, Metricbeat)	Minimal footprint per node

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# elasticsearch.yml - Data node configuration
cluster.name: production-logs
node.name: es-data-01
 
# Node roles (Elasticsearch 7.9+)
node.roles: [ data, data_content, data_hot ]
 
# Network
network.host: 0.0.0.0
http.port: 9200
transport.port: 9300
 
# Discovery
discovery.seed_hosts:
  - es-master-01.internal:9300
  - es-master-02.internal:9300
  - es-master-03.internal:9300
 
# Memory: Heap should be ~50% of RAM, max 32GB
# Set via ES_JAVA_OPTS: -Xms16g -Xmx16g
 
# Disk space thresholds
cluster.routing.allocation.disk.watermark.low: 85%
cluster.routing.allocation.disk.watermark.high: 90%
cluster.routing.allocation.disk.watermark.flood_stage: 95%
 
# Index lifecycle management
action.destructive_requires_name: true
 
# Security (X-Pack)
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
# /etc/logstash/conf.d/logs-pipeline.conf
 
input {
  beats {
    port => 5044
    ssl => true
    ssl_certificate => "/etc/logstash/ssl/logstash.crt"
    ssl_key => "/etc/logstash/ssl/logstash.key"
  }
  
  # Alternative: Direct Kafka consumption
  kafka {
    bootstrap_servers => "kafka-1:9092,kafka-2:9092,kafka-3:9092"
    topics => ["application-logs"]
    group_id => "logstash-consumers"
    codec => json
  }
}
 
filter {
  # Parse JSON body if not already parsed
  if [message] =~ /^\{/ {
    json {
      source => "message"
      target => "parsed"
    }
    
    # Move parsed fields to top level if successful
    if "_jsonparsefailure" not in [tags] {
      mutate {
        rename => { "[parsed][timestamp]" => "timestamp" }
        rename => { "[parsed][level]" => "level" }
        rename => { "[parsed][service]" => "service" }
        remove_field => [ "message", "parsed" ]
      }
    }
  }
  
  # Parse timestamp
  date {
    match => [ "timestamp", "ISO8601" ]
    target => "@timestamp"
    remove_field => [ "timestamp" ]
  }
  
  # Add processing metadata
  mutate {
    add_field => {
      "processed_at" => "%{+ISO8601}"
      "pipeline_version" => "2.1.0"
    }
  }
  
  # GeoIP enrichment for client IPs
  if [client_ip] {
    geoip {
      source => "client_ip"
      target => "geoip"
    }
  }
  
  # Drop DEBUG logs in production (optional)
  if [level] == "DEBUG" {
    drop { }
  }
  
  # Redact sensitive fields
  mutate {
    gsub => [
      "password", ".+", "[REDACTED]",
      "credit_card", "\d{12}(\d{4})", "************\1"
    ]
  }
}
 
output {
  elasticsearch {
    hosts => ["https://es-data-01:9200", "https://es-data-02:9200"]
    index => "logs-%{[service]}-%{+YYYY.MM.dd}"
    user => "logstash_writer"
    password => "${LOGSTASH_ES_PASSWORD}"
    ssl => true
    cacert => "/etc/logstash/ssl/ca.crt"
  }
  
  # Dead letter queue for failed documents
  if "_jsonparsefailure" in [tags] {
    file {
      path => "/var/log/logstash/dlq/%{+YYYY-MM-dd}.json"
    }
  }
}

Logstash vs Ingest Pipelines

Elasticsearch Indexing Strategies

Elasticsearch indexing configuration dramatically impacts query performance, storage costs, and operational complexity. Production log systems require careful index design.

Time-Based Indices

The dominant pattern for logs is time-based indexing: logs-2024.01.15. Each day (or hour, for high volume) gets a new index. This enables:

Efficient deletion: Drop old indices entirely instead of deleting documents
Tiered storage: Move older indexes to cheaper storage
Query optimization: Searches typically filter by time range
Independent scaling: Recent data gets more resources

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
{
  "index_patterns": ["logs-*"],
  "priority": 100,
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "codec": "best_compression",
      "refresh_interval": "30s",
      
      "index.lifecycle.name": "logs-lifecycle-policy",
      "index.lifecycle.rollover_alias": "logs-write",
      
      "index.translog.durability": "async",
      "index.translog.sync_interval": "5s"
    },
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "level": { "type": "keyword" },
        "logger": { "type": "keyword" },
        "service": { "type": "keyword" },
        "version": { "type": "keyword" },
        "environment": { "type": "keyword" },
        "trace_id": { "type": "keyword" },
        "span_id": { "type": "keyword" },
        "message": { 
          "type": "text",
          "fields": {
            "keyword": { "type": "keyword", "ignore_above": 256 }
          }
        },
        "error": {
          "properties": {
            "type": { "type": "keyword" },
            "message": { "type": "text" },
            "stack_trace": { "type": "text", "index": false }
          }
        },
        "http": {
          "properties": {
            "method": { "type": "keyword" },
            "status_code": { "type": "short" },
            "path": { "type": "keyword" },
            "latency_ms": { "type": "integer" }
          }
        },
        "user_id": { "type": "keyword" },
        "request_id": { "type": "keyword" }
      },
      "dynamic_templates": [
        {
          "strings_as_keywords": {
            "match_mapping_type": "string",
            "mapping": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      ]
    }
  }
}

Index Template Best Practices

•Use keyword type for filterable strings — level, service, user_id should be keyword type for exact matching and aggregations. Reserve text for fields needing full-text search.
•Don't index stack traces — Set index: false for large, rarely-searched fields like stack traces. Reduces index size significantly.
•Set explicit mappings — Disable dynamic mapping or carefully configure dynamic templates. Uncontrolled dynamic mapping causes mapping explosions.
•Configure refresh interval — Default 1-second refresh is expensive. 30 seconds is usually fine for logs; near-real-time search is rarely needed.
•Consider doc values — For high-cardinality aggregations, ensure doc values are enabled (default for keyword).
•Use ILM (Index Lifecycle Management) — Automate rollover, shrink, and deletion based on age or size.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "allocate": {
            "require": {
              "data": "warm"
            }
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": {
              "data": "cold"
            }
          },
          "freeze": {},
          "set_priority": {
            "priority": 0
          }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Shard Sizing Critical

OpenSearch: The Fork That Changed Everything

Why This Matters:

OpenSearch vs Elasticsearch:

OpenSearch vs Elasticsearch Comparison
Aspect	Elasticsearch	OpenSearch
License	SSPL (Elastic License 2.0)	Apache 2.0 (truly open source)
Cloud Offerings	Elastic Cloud (vendor-managed)	AWS OpenSearch, self-managed
API Compatibility	Original API, diverging	Maintains ES 7.x compatibility, diverging
Security Features	X-Pack (some features require licensing)	Security built-in and free
Alerting	Elastic Alerting (requires license)	Alerting plugin included free
Community	Elastic-controlled	Community-governed, AWS-supported
Feature Development	Proprietary roadmap	Open roadmap, community input

When to Choose OpenSearch

•Cloud-native deployment — AWS OpenSearch Service is fully managed. If you're on AWS, it's the path of least resistance.
•License compliance concerns — SSPL is controversial for organizations with redistribution requirements. Apache 2.0 has no such concerns.
•Cost sensitivity — OpenSearch includes security, alerting, and anomaly detection free. Elasticsearch charges for equivalent features.
•Portability requirements — Apache 2.0 allows moving between cloud providers without license concerns.
•Existing ES 7.x workloads — OpenSearch is API-compatible; migration is often straightforward.

When to Choose Elasticsearch

•Elastic-specific features — Machine learning, APM, SIEM capabilities are more mature in Elastic's offerings.
•Vendor support preference — Elastic provides commercial support; some organizations prefer this.
•ES 8.x features — OpenSearch is diverging from Elasticsearch; ES 8.x features may not appear in OpenSearch.
•Existing Elastic ecosystem — If deeply integrated with Elastic Beats, Agent, APM, migration is complex.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# AWS CDK TypeScript - OpenSearch Domain
const domain = new opensearch.Domain(this, 'LogsDomain', {
  version: opensearch.EngineVersion.OPENSEARCH_2_11,
  
  capacity: {
    dataNodes: 3,
    dataNodeInstanceType: 'r6g.large.search',
    masterNodes: 3,
    masterNodeInstanceType: 'm6g.large.search',
  },
  
  ebs: {
    volumeSize: 500,
    volumeType: ec2.EbsDeviceVolumeType.GP3,
  },
  
  nodeToNodeEncryption: true,
  encryptionAtRest: {
    enabled: true,
  },
  
  vpc: vpc,
  vpcSubnets: [{ subnetType: ec2.SubnetType.PRIVATE_ISOLATED }],
  
  zoneAwareness: {
    enabled: true,
    availabilityZoneCount: 3,
  },
  
  logging: {
    slowSearchLogEnabled: true,
    appLogEnabled: true,
    slowIndexLogEnabled: true,
  },
  
  fineGrainedAccessControl: {
    masterUserName: 'admin',
    masterUserPassword: secretsManager.Secret.fromSecretNameV2(...),
  },
});

Migration Path

Grafana Loki: A Different Approach

The Loki Philosophy:

"Logs are like metrics, but have rich content."

Loki Advantages

•10-50x lower storage costs than Elasticsearch for equivalent data
•Simpler operations — fewer components, smaller clusters
•Native Grafana integration — same UI for metrics and logs
•Prometheus-like labels — familiar for teams using Prometheus
•Efficient for high-cardinality labels (within limits)

Loki Limitations

•No full-text indexing — can't search for arbitrary strings not in labels
•Label cardinality limits — millions of unique label values cause issues
•Regex on large streams is slow — must narrow with labels first
•Less mature than ELK ecosystem
•LogQL learning curve vs familiar Elasticsearch queries

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Basic label selector (like Prometheus)
{namespace="production", app="payment-service"}
 
# Filter by log level using label
{namespace="production", level="error"}
 
# Regex filter on log content (after label selection)
{namespace="production", app="payment-service"} |= "payment_failed"
 
# Negative filter (exclude lines)
{namespace="production"} != "healthcheck"
 
# JSON parsing for structured logs
{namespace="production", app="order-service"} 
  | json 
  | user_id="usr_a1b2c3"
 
# Aggregation (like Prometheus)
sum(rate({namespace="production", level="error"}[5m])) by (app)
 
# Pattern extraction for metrics
{namespace="production"} 
  | pattern "<timestamp> <level> <_> latency=<latency>ms"
  | latency > 500

Loki vs Elasticsearch: When to Use Each
Scenario	Loki	Elasticsearch
Kubernetes-native environment	✅ Excellent fit	✅ Works but heavier
Need arbitrary full-text search	❌ Not indexed	✅ Built for this
Cost is primary concern	✅ 10-50x cheaper	❌ Expensive at scale
Already have Grafana+Prometheus	✅ Perfect integration	⚠️ Requires Kibana or additional config
Need complex aggregations	⚠️ Limited to label aggregations	✅ Powerful aggregation DSL
Log analytics/ML workloads	❌ Not designed for this	✅ Strong analytics capabilities
Compliance requiring full indexing	❌ Doesn't index content	✅ Full indexing

Hybrid Approach

Operational Considerations

Log aggregation systems are critical infrastructure—when they fail, debugging becomes blind. Operating these systems at scale requires attention to specific challenges:

Capacity Planning:

Log volume grows with application traffic, new services, and debugging efforts. Plan for 20-50% annual growth plus spikes during incidents (when everyone adds logging).

Capacity Planning Reference
Scale	Daily Ingest	Elasticsearch Cluster	Monthly Cost (Rough)
Small	50GB/day	3 data nodes, 3 masters	$1,000-2,000
Medium	500GB/day	6-9 data nodes, 3 masters	$5,000-10,000
Large	5TB/day	20-30 data nodes, 3 dedicated masters	$25,000-50,000
Enterprise	50TB/day	100+ data nodes, multi-cluster	$200,000+

Critical Operational Practices

•Monitor the log system with a separate system — If your logging infrastructure uses itself for monitoring, you can't detect its own failures. Use a separate metrics system (Prometheus) to monitor Elasticsearch health.
•Set up cardinality alerts — New high-cardinality fields (like UUIDs used as field values) cause mapping explosions. Alert when field count grows unexpectedly.
•Test retention policy regularly — Verify that ILM policies are executing and deleting old data. Storage exhaustion is a common failure mode.
•Maintain cold storage backups — For compliance requirements, snapshot to S3 before deletion. Cheaper than keeping hot indices.
•Plan for cluster upgrades — Rolling upgrades require careful coordination. Document and practice the procedure.
•Rate limit at ingestion — Runaway logging (debug accidentally enabled everywhere) can overwhelm the cluster. Set ingestion rate limits.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
groups:
- name: elasticsearch
  rules:
  # Cluster health not green
  - alert: ElasticsearchClusterHealthYellow
    expr: elasticsearch_cluster_health_status{color="yellow"} == 1
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Elasticsearch cluster health is yellow"
      description: "Some replicas are not allocated. Check node health."
 
  - alert: ElasticsearchClusterHealthRed
    expr: elasticsearch_cluster_health_status{color="red"} == 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Elasticsearch cluster health is RED"
      description: "Primary shards are unassigned. Data may be unavailable."
 
  # Disk space
  - alert: ElasticsearchDiskSpaceLow
    expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes < 0.15
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Elasticsearch disk space low on {{ $labels.node }}"
 
  # Ingest rate drop (may indicate collector failure)
  - alert: ElasticsearchIngestRateDrop
    expr: rate(elasticsearch_indices_indexing_index_total[5m]) < 0.5 * avg_over_time(rate(elasticsearch_indices_indexing_index_total[5m])[1h:5m])
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Elasticsearch ingest rate dropped significantly"
      description: "May indicate log collector failures or network issues."
 
  # JVM heap pressure
  - alert: ElasticsearchJVMHeapHigh
    expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.85
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Elasticsearch JVM heap usage high on {{ $labels.node }}"

The Mapping Explosion Disaster

Query Patterns and Performance

Efficient querying is essential for production incident response. Slow queries during outages compound the problem. Understanding query performance helps you design for fast debugging.

Query Performance Principles:

Query Optimization Strategies

•Always filter by time range first — Time-based indices mean specifying time range excludes most indices from the search entirely.
•Use keyword fields for filters — Filtering on keyword fields is O(1) through term queries. text fields require analysis and are slower.
•Avoid leading wildcards — *error* scans all terms. error* uses the term index efficiently.
•Limit result size — Retrieve only what you need. size: 10 is vastly faster than size: 10000.
•Use aggregations instead of fetching all — To count error types, use terms aggregation rather than fetching all documents.
•Leverage doc_values for sorting — Sorting on non-doc-values fields is expensive. Ensure sort fields have doc_values enabled.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// ❌ SLOW: Full text search without filters
{
  "query": {
    "match": { "message": "payment failed" }
  }
}
 
// ✅ FAST: Time + keyword filters narrow scope first
{
  "query": {
    "bool": {
      "must": [
        { "match": { "message": "payment failed" } }
      ],
      "filter": [
        { "range": { "@timestamp": { "gte": "now-1h" } } },
        { "term": { "level": "error" } },
        { "term": { "service": "payment-service" } }
      ]
    }
  },
  "size": 100,
  "sort": [{ "@timestamp": "desc" }]
}
 
// Aggregation for error breakdown (no document fetch)
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        { "range": { "@timestamp": { "gte": "now-24h" } } },
        { "term": { "level": "error" } }
      ]
    }
  },
  "aggs": {
    "errors_by_service": {
      "terms": { "field": "service", "size": 20 },
      "aggs": {
        "error_types": {
          "terms": { "field": "error.type", "size": 10 }
        }
      }
    }
  }
}
 
// Get specific trace (fast: keyword exact match)
{
  "query": {
    "term": { "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736" }
  },
  "sort": [{ "@timestamp": "asc" }]
}

Query Type Performance Comparison
Query Type	Relative Speed	Use Case
term (keyword)	Fastest	Exact ID lookup, service/level filtering
range (timestamp)	Very Fast	Time-based filtering (always use)
bool with filters	Fast	Combining multiple criteria
match (analyzed text)	Moderate	Full-text search within narrowed scope
wildcard (prefix)	Slow	Pattern matching on log content
wildcard (leading)	Very Slow	Avoid: scans all terms
regex	Slowest	Use only as last resort on small result sets

Prepare Debug Queries in Advance

Summary: Log Aggregation Mastery

Log aggregation transforms scattered, ephemeral logs into a centralized, searchable source of system truth. The choice of technology impacts cost, capability, and operational burden.

Key Takeaways:

Key Takeaways

•All aggregation follows the same architecture — Collection → Transport → Processing → Storage → Visualization. Technology choices fill each slot.
•ELK Stack is the industry standard — Elasticsearch provides powerful search; Logstash transforms data; Kibana visualizes. Mature ecosystem with extensive documentation.
•OpenSearch is the open-source fork — API-compatible with ES 7.x, Apache 2.0 licensed, security included free. Choose for cloud-native deployments and cost sensitivity.
•Loki trades flexibility for cost — 10-50x cheaper by indexing only labels. Perfect for Kubernetes environments with Grafana already deployed.
•Index strategy is critical — Time-based indexes, ILM policies, appropriate shard sizing directly impact performance and cost.
•Monitor the monitor — Use separate metrics to watch your logging infrastructure. It can't report its own failures.
•Query design matters during incidents — Prepare efficient queries in advance. Time filters + keyword filters + limited results = fast debugging.

What's next:

Page Complete

3 / 5