Design a Logging & Log Aggregation System

Design a centralized logging and log aggregation system (like the ELK stack — Elasticsearch + Logstash + Kibana, or Grafana Loki) that collects logs from thousands of services, indexes them for fast full-text and structured search, supports real-time log tailing, provides aggregation analytics, implements tiered storage for cost-effective retention, and integrates with the broader observability ecosystem (metrics + traces).

Scale Estimates

Metric	Value
Log lines ingested per second	1+ million
Raw log volume per day	1 TB
Services / hosts emitting logs	10,000+
Elasticsearch index size (7-day hot)	10 TB
Total stored data (30-day warm)	30 TB
Long-term archive (S3, 1 year)	365 TB (uncompressed) / 50 TB (compressed)
Search queries per second	10,000
Live tail concurrent sessions	500
Search latency (recent data, p95)	< 2 seconds
Search latency (30-day historical)	< 10 seconds

Non-Functional Requirements

Throughput: Ingest 1M+ log lines/sec; achieved via Kafka buffering, Elasticsearch bulk writes, horizontal scaling of indexers and data nodes
Search speed: Full-text + structured queries on recent data (7 days) < 2 seconds; inverted index in Elasticsearch; time-based index partitioning; coordinating node fan-out
Cost efficiency: Tiered storage — hot (SSD, 7 days) → warm (HDD, 30 days) → cold (S3, 1 year) → archive (Glacier); compression 5–10×; sampling for high-volume services; Loki+S3 model for extreme cost savings
Reliability: No log loss — multi-layer buffering (agent disk buffer → Kafka → indexer); Kafka durability for pipeline resilience; Elasticsearch replicas for search availability
Observability integration: Logs correlated with metrics and traces via shared trace_id/span_id; OpenTelemetry compatible; enables cross-pillar drill-down (metric alert → logs → trace)
Multi-tenancy: Per-team log isolation, storage quotas, rate limits, RBAC for PII-containing logs

Scale Estimates

Metric

Value

Log lines ingested per second

1+ million

Raw log volume per day

1 TB

Services / hosts emitting logs

10,000+

Elasticsearch index size (7-day hot)

10 TB

Total stored data (30-day warm)

30 TB

Long-term archive (S3, 1 year)

365 TB (uncompressed) / 50 TB (compressed)

Search queries per second

10,000

Live tail concurrent sessions

500

Search latency (recent data, p95)

< 2 seconds

Search latency (30-day historical)

< 10 seconds

Non-Functional Requirements

Throughput: Ingest 1M+ log lines/sec; achieved via Kafka buffering, Elasticsearch bulk writes, horizontal scaling of indexers and data nodes

Search speed: Full-text + structured queries on recent data (7 days) < 2 seconds; inverted index in Elasticsearch; time-based index partitioning; coordinating node fan-out

Cost efficiency: Tiered storage — hot (SSD, 7 days) → warm (HDD, 30 days) → cold (S3, 1 year) → archive (Glacier); compression 5–10×; sampling for high-volume services; Loki+S3 model for extreme cost savings

Reliability: No log loss — multi-layer buffering (agent disk buffer → Kafka → indexer); Kafka durability for pipeline resilience; Elasticsearch replicas for search availability

Observability integration: Logs correlated with metrics and traces via shared trace_id/span_id; OpenTelemetry compatible; enables cross-pillar drill-down (metric alert → logs → trace)

Multi-tenancy: Per-team log isolation, storage quotas, rate limits, RBAC for PII-containing logs

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Follow-up Deep Dives(Questions an interviewer might ask)

Design a Logging & Log Aggregation System

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Follow-up Deep Dives(Questions an interviewer might ask)

Design a Logging & Log Aggregation System

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Non-Functional Requirements~3 min

Core Entities~2 min

API Design~3 min

High-Level Design~5 min

Follow-up Deep Dives(Questions an interviewer might ask)

1How would you design the log collection and ingestion pipeline?

2How would you store and index logs for fast full-text search?

3How would you design log parsing and enrichment?

4How would you implement log tailing (live tail) across a distributed system?

5How would you handle storage costs and long-term retention?

6How would you scale the system for millions of log lines per second?

7How does this relate to the broader observability ecosystem (metrics, traces, logs)?

Key Topics

Asked At

Design a Logging & Log Aggregation System

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Non-Functional Requirements~3 min

Core Entities~2 min

API Design~3 min

High-Level Design~5 min

Follow-up Deep Dives(Questions an interviewer might ask)

1How would you design the log collection and ingestion pipeline?

2How would you store and index logs for fast full-text search?

3How would you design log parsing and enrichment?

4How would you implement log tailing (live tail) across a distributed system?

5How would you handle storage costs and long-term retention?

6How would you scale the system for millions of log lines per second?

7How does this relate to the broader observability ecosystem (metrics, traces, logs)?

Key Topics

Asked At