Design a Metrics Monitoring & Alerting System

Design a metrics monitoring and alerting platform (like Prometheus + Grafana + PagerDuty) that ingests millions of time-series data points per second from thousands of services, stores them in a purpose-built time-series database with compression and downsampling, supports a powerful query language (PromQL) for dashboards, evaluates alert rules in real-time, and dispatches notifications through multiple channels with escalation policies.

Scale Estimates

Metric	Value
Active time series	100 million
Data points ingested per second	10 million
Data points stored (raw, 15 days)	13 trillion
Services/hosts monitored	100,000
Scrape targets (pull model)	500,000
Dashboard queries per second	100,000
Alert rules	50,000
Alert rule evaluations per second	3,000
Notification channels	1,000+ (Slack, PagerDuty, email)
Storage (raw, per month)	500 TB
Storage (1hr rollups, per year)	24 TB

Non-Functional Requirements

Write throughput: Ingest 10M+ data points/sec; achieved via LSM-tree writes, Kafka buffering, write sharding by metric hash
Query latency: Dashboard panel queries < 1 second (p95); achieved via chunk-based compressed storage, inverted index for label matching, query caching, recording rules
Compression: 12× compression ratio using Gorilla encoding (delta-of-delta timestamps + XOR values); critical for cost-effective storage at petabyte scale
Alert timeliness: Alerts evaluated every 15 seconds; notification delivered within 30 seconds of threshold breach (after 'FOR' duration); escalation within configured time windows
Retention: Raw data 15 days → 1-min rollups 90 days → 1-hour rollups 2 years; tiered storage (SSD → S3); downsampling preserves min/max/avg for accurate historical views
Multi-tenancy: Per-tenant isolation for metrics, dashboards, alerts; cardinality limits prevent explosion; rate limits prevent noisy neighbours

Scale Estimates

Metric

Value

Active time series

100 million

Data points ingested per second

10 million

Data points stored (raw, 15 days)

13 trillion

Services/hosts monitored

100,000

Scrape targets (pull model)

500,000

Dashboard queries per second

100,000

Alert rules

50,000

Alert rule evaluations per second

3,000

Notification channels

1,000+ (Slack, PagerDuty, email)

Storage (raw, per month)

500 TB

Storage (1hr rollups, per year)

24 TB

Non-Functional Requirements

Write throughput: Ingest 10M+ data points/sec; achieved via LSM-tree writes, Kafka buffering, write sharding by metric hash

Query latency: Dashboard panel queries < 1 second (p95); achieved via chunk-based compressed storage, inverted index for label matching, query caching, recording rules

Compression: 12× compression ratio using Gorilla encoding (delta-of-delta timestamps + XOR values); critical for cost-effective storage at petabyte scale

Alert timeliness: Alerts evaluated every 15 seconds; notification delivered within 30 seconds of threshold breach (after 'FOR' duration); escalation within configured time windows

Retention: Raw data 15 days → 1-min rollups 90 days → 1-hour rollups 2 years; tiered storage (SSD → S3); downsampling preserves min/max/avg for accurate historical views

Multi-tenancy: Per-tenant isolation for metrics, dashboards, alerts; cardinality limits prevent explosion; rate limits prevent noisy neighbours

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Follow-up Deep Dives(Questions an interviewer might ask)

Design a Metrics Monitoring & Alerting System

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Follow-up Deep Dives(Questions an interviewer might ask)

Design a Metrics Monitoring & Alerting System

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Non-Functional Requirements~3 min

Core Entities~2 min

API Design~3 min

High-Level Design~5 min

Follow-up Deep Dives(Questions an interviewer might ask)

1How would you design the time-series database for metric storage?

2How would you design the metric ingestion pipeline?

3How does the query engine work for metric queries?

4How would you design the alerting system?

5How would you handle downsampling and long-term retention?

6How would you scale the monitoring system for a large organisation?

7How would you design the dashboard system?

Key Topics

Asked At

Design a Metrics Monitoring & Alerting System

Scale Estimates

Non-Functional Requirements

Functional Requirements

Approach Guide(Click to expand each section)

Non-Functional Requirements~3 min

Core Entities~2 min

API Design~3 min

High-Level Design~5 min

Follow-up Deep Dives(Questions an interviewer might ask)

1How would you design the time-series database for metric storage?

2How would you design the metric ingestion pipeline?

3How does the query engine work for metric queries?

4How would you design the alerting system?

5How would you handle downsampling and long-term retention?

6How would you scale the monitoring system for a large organisation?

7How would you design the dashboard system?

Key Topics

Asked At