Loading learning content...
In monolithic applications, logs lived in /var/log/application.log. When something broke, you SSH'd into the server and ran grep. This doesn't scale.
Modern distributed systems span hundreds of microservices across thousands of containers that may live for minutes. Logs are ephemeral—when a container dies, its logs vanish. Debugging requires correlating events across dozens of services simultaneously. Manual log inspection becomes impossible.
Log aggregation solves this by collecting logs from all sources into a centralized, searchable repository. Query logs from 500 services with a single search. Reconstruct request flows across microservices. Alert on patterns across your entire fleet. This is the foundation of operational visibility at scale.
By the end of this page, you'll understand log aggregation architecture patterns, master the ELK Stack (Elasticsearch, Logstash, Kibana), know when to choose OpenSearch over Elasticsearch, understand Grafana Loki's different approach, and be able to design production-ready log aggregation systems.
All log aggregation systems share a common architectural pattern, regardless of which specific technologies you choose:
The Universal Pipeline:
| Stage | Purpose | Common Technologies |
|---|---|---|
| Generation | Create log entries | Application code, frameworks, system services |
| Collection | Gather and forward logs from sources | Filebeat, Fluent Bit, Fluentd, Vector, Promtail |
| Transport | Buffer and reliable delivery | Kafka, Redis, RabbitMQ, direct HTTP/gRPC |
| Processing | Parse, enrich, filter, transform | Logstash, Fluentd, Vector, Kafka Streams |
| Storage | Index for fast querying | Elasticsearch, OpenSearch, Loki, ClickHouse |
| Visualization | Query UI and dashboards | Kibana, Grafana, OpenSearch Dashboards |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
┌─────────────────────────────────────────────────────────────────────────┐│ APPLICATION LAYER ││ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ││ │ Service A│ │ Service B│ │ Service C│ │ Service D│ │ Service E│ ││ │ (stdout) │ │ (stdout) │ │ (stdout) │ │ (stdout) │ │ (stdout) │ ││ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │└───────┼─────────────┼─────────────┼─────────────┼─────────────┼────────┘ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼┌─────────────────────────────────────────────────────────────────────────┐│ COLLECTION LAYER ││ ┌──────────────────────────────────────────────────────────────────┐ ││ │ Log Collectors (Filebeat/Fluent Bit per node) │ ││ │ - Tail container stdout/stderr │ ││ │ - Add metadata (pod, namespace, container ID) │ ││ │ - Buffer locally, forward with backpressure handling │ ││ └──────────────────────────────────────────────────────────────────┘ │└────────────────────────────────┬────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────────────┐│ BUFFERING LAYER (Optional) ││ ┌──────────────────────────────────────────────────────────────────┐ ││ │ Kafka / Redis / Message Queue │ ││ │ - Decouples collection from processing │ ││ │ - Handles traffic spikes │ ││ │ - Enables replay for reprocessing │ ││ └──────────────────────────────────────────────────────────────────┘ │└────────────────────────────────┬────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────────────┐│ PROCESSING LAYER ││ ┌──────────────────────────────────────────────────────────────────┐ ││ │ Logstash / Vector / Stream Processor │ ││ │ - Parse and validate JSON │ ││ │ - Enrich with additional context │ ││ │ - Route to appropriate indexes │ ││ │ - Filter sensitive data │ ││ └──────────────────────────────────────────────────────────────────┘ │└────────────────────────────────┬────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────────────┐│ STORAGE LAYER ││ ┌──────────────────────────────────────────────────────────────────┐ ││ │ Elasticsearch / OpenSearch Cluster │ ││ │ - Sharded for horizontal scale │ ││ │ - Replicated for durability │ ││ │ - Time-based indexes (logs-2024.01.15) │ ││ │ - Hot/warm/cold tiering for cost optimization │ ││ └──────────────────────────────────────────────────────────────────┘ │└────────────────────────────────┬────────────────────────────────────────┘ │ ▼┌─────────────────────────────────────────────────────────────────────────┐│ VISUALIZATION LAYER ││ ┌──────────────────────────────────────────────────────────────────┐ ││ │ Kibana / Grafana / OpenSearch Dashboards │ ││ │ - Full-text search │ ││ │ - Saved queries and dashboards │ ││ │ - Alert definition │ ││ │ - Access control │ ││ └──────────────────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────────────┘Many architectures skip the buffering layer for simplicity, sending directly from collectors to Elasticsearch. This works until a traffic spike or Elasticsearch maintenance window causes log loss. Kafka as a buffer enables: replay if processing fails, multiple consumers for different purposes, and graceful handling of downstream outages.
The ELK Stack (Elasticsearch, Logstash, Kibana) is the most widely deployed log aggregation solution. Originally open source, Elastic changed licensing in 2021, spawning OpenSearch (covered later). Understanding ELK is foundational regardless of which variant you use.
Three Pillars of ELK:
Elasticsearch — Distributed search and analytics engine built on Apache Lucene. Stores and indexes logs for fast full-text search. Scales horizontally through sharding.
Logstash — Data processing pipeline. Collects, transforms, and routes data from various sources to destinations. Plugin-based architecture supports hundreds of inputs/outputs.
Kibana — Visualization layer. Web UI for searching, visualizing, and analyzing Elasticsearch data. Dashboards, saved searches, and alerting.
| Component | Role | Key Capabilities | Resource Profile |
|---|---|---|---|
| Elasticsearch | Storage + Search | Full-text search, aggregations, distributed clustering | Memory-intensive (heap), SSD-optimized |
| Logstash | Processing | Input plugins, filters, output plugins, codecs | CPU-intensive, memory for buffering |
| Kibana | Visualization | Dashboards, Discover, alerts, RBAC | Low resource, stateless |
| Beats | Collection | Lightweight shippers (Filebeat, Metricbeat) | Minimal footprint per node |
1234567891011121314151617181920212223242526272829303132
# elasticsearch.yml - Data node configurationcluster.name: production-logsnode.name: es-data-01 # Node roles (Elasticsearch 7.9+)node.roles: [ data, data_content, data_hot ] # Networknetwork.host: 0.0.0.0http.port: 9200transport.port: 9300 # Discoverydiscovery.seed_hosts: - es-master-01.internal:9300 - es-master-02.internal:9300 - es-master-03.internal:9300 # Memory: Heap should be ~50% of RAM, max 32GB# Set via ES_JAVA_OPTS: -Xms16g -Xmx16g # Disk space thresholdscluster.routing.allocation.disk.watermark.low: 85%cluster.routing.allocation.disk.watermark.high: 90%cluster.routing.allocation.disk.watermark.flood_stage: 95% # Index lifecycle managementaction.destructive_requires_name: true # Security (X-Pack)xpack.security.enabled: truexpack.security.transport.ssl.enabled: true1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
# /etc/logstash/conf.d/logs-pipeline.conf input { beats { port => 5044 ssl => true ssl_certificate => "/etc/logstash/ssl/logstash.crt" ssl_key => "/etc/logstash/ssl/logstash.key" } # Alternative: Direct Kafka consumption kafka { bootstrap_servers => "kafka-1:9092,kafka-2:9092,kafka-3:9092" topics => ["application-logs"] group_id => "logstash-consumers" codec => json }} filter { # Parse JSON body if not already parsed if [message] =~ /^\{/ { json { source => "message" target => "parsed" } # Move parsed fields to top level if successful if "_jsonparsefailure" not in [tags] { mutate { rename => { "[parsed][timestamp]" => "timestamp" } rename => { "[parsed][level]" => "level" } rename => { "[parsed][service]" => "service" } remove_field => [ "message", "parsed" ] } } } # Parse timestamp date { match => [ "timestamp", "ISO8601" ] target => "@timestamp" remove_field => [ "timestamp" ] } # Add processing metadata mutate { add_field => { "processed_at" => "%{+ISO8601}" "pipeline_version" => "2.1.0" } } # GeoIP enrichment for client IPs if [client_ip] { geoip { source => "client_ip" target => "geoip" } } # Drop DEBUG logs in production (optional) if [level] == "DEBUG" { drop { } } # Redact sensitive fields mutate { gsub => [ "password", ".+", "[REDACTED]", "credit_card", "\d{12}(\d{4})", "************\1" ] }} output { elasticsearch { hosts => ["https://es-data-01:9200", "https://es-data-02:9200"] index => "logs-%{[service]}-%{+YYYY.MM.dd}" user => "logstash_writer" password => "${LOGSTASH_ES_PASSWORD}" ssl => true cacert => "/etc/logstash/ssl/ca.crt" } # Dead letter queue for failed documents if "_jsonparsefailure" in [tags] { file { path => "/var/log/logstash/dlq/%{+YYYY-MM-dd}.json" } }}For simple transformations, Elasticsearch's built-in Ingest Pipelines can replace Logstash, reducing operational complexity. Use Logstash when you need complex routing, external enrichment lookups, or multi-destination output. For straightforward JSON log parsing, Ingest Pipelines are often sufficient.
Elasticsearch indexing configuration dramatically impacts query performance, storage costs, and operational complexity. Production log systems require careful index design.
Time-Based Indices
The dominant pattern for logs is time-based indexing: logs-2024.01.15. Each day (or hour, for high volume) gets a new index. This enables:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
{ "index_patterns": ["logs-*"], "priority": 100, "template": { "settings": { "number_of_shards": 3, "number_of_replicas": 1, "codec": "best_compression", "refresh_interval": "30s", "index.lifecycle.name": "logs-lifecycle-policy", "index.lifecycle.rollover_alias": "logs-write", "index.translog.durability": "async", "index.translog.sync_interval": "5s" }, "mappings": { "properties": { "@timestamp": { "type": "date" }, "level": { "type": "keyword" }, "logger": { "type": "keyword" }, "service": { "type": "keyword" }, "version": { "type": "keyword" }, "environment": { "type": "keyword" }, "trace_id": { "type": "keyword" }, "span_id": { "type": "keyword" }, "message": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "error": { "properties": { "type": { "type": "keyword" }, "message": { "type": "text" }, "stack_trace": { "type": "text", "index": false } } }, "http": { "properties": { "method": { "type": "keyword" }, "status_code": { "type": "short" }, "path": { "type": "keyword" }, "latency_ms": { "type": "integer" } } }, "user_id": { "type": "keyword" }, "request_id": { "type": "keyword" } }, "dynamic_templates": [ { "strings_as_keywords": { "match_mapping_type": "string", "mapping": { "type": "keyword", "ignore_above": 256 } } } ] } }}level, service, user_id should be keyword type for exact matching and aggregations. Reserve text for fields needing full-text search.index: false for large, rarely-searched fields like stack traces. Reduces index size significantly.123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
{ "policy": { "phases": { "hot": { "min_age": "0ms", "actions": { "rollover": { "max_primary_shard_size": "50gb", "max_age": "1d" }, "set_priority": { "priority": 100 } } }, "warm": { "min_age": "7d", "actions": { "shrink": { "number_of_shards": 1 }, "forcemerge": { "max_num_segments": 1 }, "allocate": { "require": { "data": "warm" } }, "set_priority": { "priority": 50 } } }, "cold": { "min_age": "30d", "actions": { "allocate": { "require": { "data": "cold" } }, "freeze": {}, "set_priority": { "priority": 0 } } }, "delete": { "min_age": "90d", "actions": { "delete": {} } } } }}Elasticsearch performance degrades with too many small shards or too few large shards. Target 20-50GB per shard. At 1TB/day ingest, you need ~25 primary shards across daily indices. Calculate: (daily_volume ÷ target_shard_size) = shards_per_day.
In January 2021, Elastic changed Elasticsearch's license from Apache 2.0 to Server Side Public License (SSPL). AWS and the open-source community responded by forking Elasticsearch 7.10.2 into OpenSearch, maintained by AWS under Apache 2.0.
Why This Matters:
The license change means cloud providers cannot offer Elasticsearch as a managed service without restrictions. OpenSearch provides a truly open-source alternative with continued development and AWS-backed support.
OpenSearch vs Elasticsearch:
| Aspect | Elasticsearch | OpenSearch |
|---|---|---|
| License | SSPL (Elastic License 2.0) | Apache 2.0 (truly open source) |
| Cloud Offerings | Elastic Cloud (vendor-managed) | AWS OpenSearch, self-managed |
| API Compatibility | Original API, diverging | Maintains ES 7.x compatibility, diverging |
| Security Features | X-Pack (some features require licensing) | Security built-in and free |
| Alerting | Elastic Alerting (requires license) | Alerting plugin included free |
| Community | Elastic-controlled | Community-governed, AWS-supported |
| Feature Development | Proprietary roadmap | Open roadmap, community input |
12345678910111213141516171819202122232425262728293031323334353637383940
# AWS CDK TypeScript - OpenSearch Domainconst domain = new opensearch.Domain(this, 'LogsDomain', { version: opensearch.EngineVersion.OPENSEARCH_2_11, capacity: { dataNodes: 3, dataNodeInstanceType: 'r6g.large.search', masterNodes: 3, masterNodeInstanceType: 'm6g.large.search', }, ebs: { volumeSize: 500, volumeType: ec2.EbsDeviceVolumeType.GP3, }, nodeToNodeEncryption: true, encryptionAtRest: { enabled: true, }, vpc: vpc, vpcSubnets: [{ subnetType: ec2.SubnetType.PRIVATE_ISOLATED }], zoneAwareness: { enabled: true, availabilityZoneCount: 3, }, logging: { slowSearchLogEnabled: true, appLogEnabled: true, slowIndexLogEnabled: true, }, fineGrainedAccessControl: { masterUserName: 'admin', masterUserPassword: secretsManager.Secret.fromSecretNameV2(...), },});Migrating from Elasticsearch 7.x to OpenSearch is generally straightforward—snapshot and restore works across the fork. ES 8.x to OpenSearch requires more effort due to diverging APIs. Plan migration during the 7.x window if possible.
Grafana Loki takes a fundamentally different approach to log aggregation. While Elasticsearch indexes the full content of every log line (enabling arbitrary searches), Loki indexes only labels (metadata) and stores log content as compressed chunks. This design trades some flexibility for dramatic cost reduction.
The Loki Philosophy:
"Logs are like metrics, but have rich content."
Loki is designed for Kubernetes-native environments where structured labels (namespace, pod, container) provide sufficient filtering for most queries. You query by label selectors, then filter log content within matching streams.
123456789101112131415161718192021222324
# Basic label selector (like Prometheus){namespace="production", app="payment-service"} # Filter by log level using label{namespace="production", level="error"} # Regex filter on log content (after label selection){namespace="production", app="payment-service"} |= "payment_failed" # Negative filter (exclude lines){namespace="production"} != "healthcheck" # JSON parsing for structured logs{namespace="production", app="order-service"} | json | user_id="usr_a1b2c3" # Aggregation (like Prometheus)sum(rate({namespace="production", level="error"}[5m])) by (app) # Pattern extraction for metrics{namespace="production"} | pattern "<timestamp> <level> <_> latency=<latency>ms" | latency > 500| Scenario | Loki | Elasticsearch |
|---|---|---|
| Kubernetes-native environment | ✅ Excellent fit | ✅ Works but heavier |
| Need arbitrary full-text search | ❌ Not indexed | ✅ Built for this |
| Cost is primary concern | ✅ 10-50x cheaper | ❌ Expensive at scale |
| Already have Grafana+Prometheus | ✅ Perfect integration | ⚠️ Requires Kibana or additional config |
| Need complex aggregations | ⚠️ Limited to label aggregations | ✅ Powerful aggregation DSL |
| Log analytics/ML workloads | ❌ Not designed for this | ✅ Strong analytics capabilities |
| Compliance requiring full indexing | ❌ Doesn't index content | ✅ Full indexing |
Many organizations use both: Loki for high-volume operational logs (DEBUG, INFO) where label-based querying suffices, and Elasticsearch for audit logs and low-volume ERROR logs requiring full-text search. Route different log types to different backends based on requirements.
Log aggregation systems are critical infrastructure—when they fail, debugging becomes blind. Operating these systems at scale requires attention to specific challenges:
Capacity Planning:
Log volume grows with application traffic, new services, and debugging efforts. Plan for 20-50% annual growth plus spikes during incidents (when everyone adds logging).
| Scale | Daily Ingest | Elasticsearch Cluster | Monthly Cost (Rough) |
|---|---|---|---|
| Small | 50GB/day | 3 data nodes, 3 masters | $1,000-2,000 |
| Medium | 500GB/day | 6-9 data nodes, 3 masters | $5,000-10,000 |
| Large | 5TB/day | 20-30 data nodes, 3 dedicated masters | $25,000-50,000 |
| Enterprise | 50TB/day | 100+ data nodes, multi-cluster | $200,000+ |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
groups:- name: elasticsearch rules: # Cluster health not green - alert: ElasticsearchClusterHealthYellow expr: elasticsearch_cluster_health_status{color="yellow"} == 1 for: 10m labels: severity: warning annotations: summary: "Elasticsearch cluster health is yellow" description: "Some replicas are not allocated. Check node health." - alert: ElasticsearchClusterHealthRed expr: elasticsearch_cluster_health_status{color="red"} == 1 for: 5m labels: severity: critical annotations: summary: "Elasticsearch cluster health is RED" description: "Primary shards are unassigned. Data may be unavailable." # Disk space - alert: ElasticsearchDiskSpaceLow expr: elasticsearch_filesystem_data_available_bytes / elasticsearch_filesystem_data_size_bytes < 0.15 for: 10m labels: severity: warning annotations: summary: "Elasticsearch disk space low on {{ $labels.node }}" # Ingest rate drop (may indicate collector failure) - alert: ElasticsearchIngestRateDrop expr: rate(elasticsearch_indices_indexing_index_total[5m]) < 0.5 * avg_over_time(rate(elasticsearch_indices_indexing_index_total[5m])[1h:5m]) for: 10m labels: severity: warning annotations: summary: "Elasticsearch ingest rate dropped significantly" description: "May indicate log collector failures or network issues." # JVM heap pressure - alert: ElasticsearchJVMHeapHigh expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.85 for: 15m labels: severity: warning annotations: summary: "Elasticsearch JVM heap usage high on {{ $labels.node }}"Elasticsearch limits fields per index (default: 1000). Dynamic mapping + untrusted input = disaster. If an application logs user-provided data as field names (JSON body with arbitrary keys), field count explodes. The cluster rejects documents, logs are lost. Prevent this with strict schemas and input validation.
Efficient querying is essential for production incident response. Slow queries during outages compound the problem. Understanding query performance helps you design for fast debugging.
Query Performance Principles:
keyword fields is O(1) through term queries. text fields require analysis and are slower.*error* scans all terms. error* uses the term index efficiently.size: 10 is vastly faster than size: 10000.terms aggregation rather than fetching all documents.12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
// ❌ SLOW: Full text search without filters{ "query": { "match": { "message": "payment failed" } }} // ✅ FAST: Time + keyword filters narrow scope first{ "query": { "bool": { "must": [ { "match": { "message": "payment failed" } } ], "filter": [ { "range": { "@timestamp": { "gte": "now-1h" } } }, { "term": { "level": "error" } }, { "term": { "service": "payment-service" } } ] } }, "size": 100, "sort": [{ "@timestamp": "desc" }]} // Aggregation for error breakdown (no document fetch){ "size": 0, "query": { "bool": { "filter": [ { "range": { "@timestamp": { "gte": "now-24h" } } }, { "term": { "level": "error" } } ] } }, "aggs": { "errors_by_service": { "terms": { "field": "service", "size": 20 }, "aggs": { "error_types": { "terms": { "field": "error.type", "size": 10 } } } } }} // Get specific trace (fast: keyword exact match){ "query": { "term": { "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736" } }, "sort": [{ "@timestamp": "asc" }]}| Query Type | Relative Speed | Use Case |
|---|---|---|
| term (keyword) | Fastest | Exact ID lookup, service/level filtering |
| range (timestamp) | Very Fast | Time-based filtering (always use) |
| bool with filters | Fast | Combining multiple criteria |
| match (analyzed text) | Moderate | Full-text search within narrowed scope |
| wildcard (prefix) | Slow | Pattern matching on log content |
| wildcard (leading) | Very Slow | Avoid: scans all terms |
| regex | Slowest | Use only as last resort on small result sets |
During calm periods, create and save queries you'll need during incidents: recent errors by service, trace lookup by ID, error rate over time. When the incident hits, you'll have fast, tested queries ready instead of constructing them under pressure.
Log aggregation transforms scattered, ephemeral logs into a centralized, searchable source of system truth. The choice of technology impacts cost, capability, and operational burden.
Key Takeaways:
What's next:
Log aggregation stores your logs, but without thoughtful retention policies, storage costs explode. The next page covers log retention and cost management—balancing debugging capability, compliance requirements, and budget constraints.
You now understand log aggregation architecture and can evaluate ELK, OpenSearch, and Loki for your needs. You can design Elasticsearch indexing strategies and query patterns for efficient debugging. Next, we'll tackle the business side: retention policies and cost optimization.