Loading system design...
Design a metrics monitoring and alerting platform (like Prometheus + Grafana + PagerDuty) that ingests millions of time-series data points per second from thousands of services, stores them in a purpose-built time-series database with compression and downsampling, supports a powerful query language (PromQL) for dashboards, evaluates alert rules in real-time, and dispatches notifications through multiple channels with escalation policies.
| Metric | Value |
|---|---|
| Active time series | 100 million |
| Data points ingested per second | 10 million |
| Data points stored (raw, 15 days) | 13 trillion |
| Services/hosts monitored | 100,000 |
| Scrape targets (pull model) | 500,000 |
| Dashboard queries per second | 100,000 |
| Alert rules | 50,000 |
| Alert rule evaluations per second | 3,000 |
| Notification channels | 1,000+ (Slack, PagerDuty, email) |
| Storage (raw, per month) | 500 TB |
| Storage (1hr rollups, per year) | 24 TB |
Metric ingestion: accept time-series metrics from thousands of services/hosts (CPU, memory, disk, request latency, error rate, custom business metrics); each metric is a (metric_name, value, timestamp, tags/labels) tuple; ingest at 10+ million data points per second
Time-series storage: durably store all ingested metrics in a purpose-built time-series database; support high write throughput and efficient range queries (e.g., 'CPU usage for host X from 2pm–3pm'); automatic downsampling and retention policies
Querying and dashboards: users define queries using a query language (e.g., PromQL); query results rendered as line charts, bar charts, heatmaps, gauges; dashboards with multiple panels; auto-refresh at configurable intervals (5s–5min)
Alerting rules: define alert rules as threshold conditions on metrics (e.g., 'alert if p99 latency > 500ms for 5 minutes'); support for rate-of-change alerts, anomaly detection, absence-of-data alerts; each rule evaluated periodically (every 15–60 seconds)
Alert notification: when an alert triggers, send notifications via multiple channels: email, Slack, PagerDuty, OpsGenie, webhook; configurable escalation policies (page on-call engineer → escalate to manager after 15 min if unacknowledged)
Metric aggregation and functions: support aggregation functions (sum, avg, min, max, count, percentile — p50, p95, p99) across time windows and tag dimensions; support rate(), irate(), histogram_quantile(), mathematical operations (+, -, /, *)
Tagging and label-based filtering: each metric has key-value labels (service=auth, env=prod, region=us-east); query by any combination of labels; high-cardinality label support (up to 10K unique values per label)
Downsampling and retention: raw data retained for 15 days; 1-minute rollups for 90 days; 1-hour rollups for 2 years; downsampling runs automatically as background process; configurable per metric/namespace
Alert silencing and inhibition: silence alerts for a time window (maintenance); inhibit dependent alerts (if cluster is down, suppress all per-service alerts for that cluster); deduplication (same alert condition doesn't fire repeatedly)
Multi-tenancy: multiple teams/services share the platform; tenant-level isolation for metrics namespace, dashboards, alert rules, and notification channels; per-tenant rate limits and storage quotas
Non-functional requirements define the system qualities critical to your users. Frame them as 'The system should be able to...' statements. These will guide your deep dives later.
Think about CAP theorem trade-offs, scalability limits, latency targets, durability guarantees, security requirements, fault tolerance, and compliance needs.
Frame NFRs for this specific system. 'Low latency search under 100ms' is far more valuable than just 'low latency'.
Add concrete numbers: 'P99 response time < 500ms', '99.9% availability', '10M DAU'. This drives architectural decisions.
Choose the 3-5 most critical NFRs. Every system should be 'scalable', but what makes THIS system's scaling uniquely challenging?