Loading system design...
Design a centralized logging and log aggregation system (like the ELK stack — Elasticsearch + Logstash + Kibana, or Grafana Loki) that collects logs from thousands of services, indexes them for fast full-text and structured search, supports real-time log tailing, provides aggregation analytics, implements tiered storage for cost-effective retention, and integrates with the broader observability ecosystem (metrics + traces).
| Metric | Value |
|---|---|
| Log lines ingested per second | 1+ million |
| Raw log volume per day | 1 TB |
| Services / hosts emitting logs | 10,000+ |
| Elasticsearch index size (7-day hot) | 10 TB |
| Total stored data (30-day warm) | 30 TB |
| Long-term archive (S3, 1 year) | 365 TB (uncompressed) / 50 TB (compressed) |
| Search queries per second | 10,000 |
| Live tail concurrent sessions | 500 |
| Search latency (recent data, p95) | < 2 seconds |
| Search latency (30-day historical) | < 10 seconds |
Log ingestion: accept structured and unstructured logs from thousands of services/hosts at a rate of 1+ million log lines per second; each log has: timestamp, service_name, severity (DEBUG/INFO/WARN/ERROR/FATAL), message, and arbitrary key-value metadata
Full-text search: given a query string, find all matching log lines across all services and time ranges; support keyword search, phrase search, wildcard, regex, and boolean operators (AND/OR/NOT); results returned within seconds over TB-scale data
Structured field filtering: filter logs by structured fields (service=auth, severity=ERROR, region=us-east, request_id=abc123); combine filters with full-text search; fast filtering via indexed fields
Time-range queries: all queries scoped to a time range (e.g., 'last 1 hour', 'March 15 2pm–3pm'); time is the primary query dimension; recent data (< 24h) queries return in < 2 seconds; historical queries (30+ days) acceptable at 5–10 seconds
Log aggregation and analytics: compute aggregations over log data — count of ERROR logs per service per hour, top 10 most frequent error messages, time-series chart of log volume; support GROUP BY, COUNT, AVG, percentile on log fields
Log tailing (live tail): stream new logs matching a filter in real-time to the user's browser; like 'tail -f' across all services; useful for debugging in-progress issues; < 3 second latency from log emission to display
Alerting on logs: define alert rules on log patterns (e.g., 'alert if > 100 ERROR logs in 5 minutes from service=payments'); alert evaluation on streaming log data; notify via Slack/PagerDuty/email
Log retention and tiering: configurable retention per service/team — hot tier (fast SSD, 7 days), warm tier (lower cost, 30 days), cold/archive tier (S3/Glacier, 1+ years); automatic data lifecycle management
Log parsing and enrichment: parse unstructured log lines (e.g., Apache access logs, application stack traces) into structured fields using configurable parsers (grok, regex, JSON); enrich with metadata (hostname → data centre, IP → geolocation)
Multi-tenancy and access control: multiple teams share the platform; each team sees only their services' logs; per-team storage quotas and rate limits; RBAC for log access (some logs contain PII — restricted to authorised personnel)
Non-functional requirements define the system qualities critical to your users. Frame them as 'The system should be able to...' statements. These will guide your deep dives later.
Think about CAP theorem trade-offs, scalability limits, latency targets, durability guarantees, security requirements, fault tolerance, and compliance needs.
Frame NFRs for this specific system. 'Low latency search under 100ms' is far more valuable than just 'low latency'.
Add concrete numbers: 'P99 response time < 500ms', '99.9% availability', '10M DAU'. This drives architectural decisions.
Choose the 3-5 most critical NFRs. Every system should be 'scalable', but what makes THIS system's scaling uniquely challenging?