System Design (HLD)Time-Series Databases

Time-Series Databases

LevelAdvanced

Duration90 mins

TopicTime-Series Databases

3 / 5

Metrics and Monitoring Data

The Pulse of Modern Systems

At 2:47 AM, a payment processing service begins rejecting transactions. Within 60 seconds, an on-call engineer receives a PagerDuty alert. By 2:49 AM, they've identified the root cause: a database connection pool exhaustion triggered by a memory leak deployed 3 hours earlier. By 2:52 AM, the faulty deployment is rolled back. Total customer impact: 4 minutes of partial outage.

This response—identification in under 2 minutes, resolution in under 5—is only possible because of metrics and monitoring infrastructure. Behind the scenes, thousands of metrics per second flow from application servers to collection agents, through processing pipelines, into time-series databases, and finally to alerting engines. This observability infrastructure is as critical as the application it monitors.

Time-series databases are the beating heart of this infrastructure. They store the billions of data points that describe system health, enable the queries that power dashboards and alerts, and retain the historical data that enables incident postmortems and capacity planning.

What You Will Learn

By the end of this page, you will understand the complete metrics and monitoring pipeline: from metric types and collection patterns, through the observability stack architecture, to real-world deployment patterns used by companies processing trillions of data points daily.

The Three Pillars of Observability

Modern observability rests on three complementary data types, each providing a different lens into system behavior. Understanding how these pillars relate to time-series databases is fundamental to architectural decisions.

1. Metrics:

Metrics are numerical measurements collected at regular intervals. They answer questions like "What is the current CPU usage?" or "How many requests per second are we processing?" Metrics are:

Highly structured: pre-defined measurement names and dimensions
Aggregatable: can be summed, averaged, percentiled across time and dimensions
Low cardinality per series: typically thousands to millions of unique series
Time-series native: perfectly suited for TSDBs

2. Logs:

Logs are unstructured or semi-structured text records of discrete events. They answer questions like "What error message did this request produce?" Logs are:

High cardinality: every request can produce unique log entries
Text-heavy: require full-text search capabilities
Event-based: not inherently aggregatable
Complementary to TSDBs: often stored in Elasticsearch, Loki, or specialized log systems

3. Traces:

Traces follow individual requests across distributed services. They answer questions like "Where did this slow request spend its time?" Traces are:

Request-scoped: one trace per request, with spans for each service
Hierarchical: parent-child relationships between spans
High cardinality: millions of unique trace IDs
Specialized storage: Jaeger, Zipkin, Tempo use purpose-built storage

Three Pillars: Storage Implications
Pillar	Primary Storage	TSDB Role	Cardinality
Metrics	Time-Series Database	Primary	Low-Medium (millions)
Logs	Log aggregation system	Derived metrics only	Very High (billions)
Traces	Trace storage system	Derived metrics only	Very High (billions)

Metrics Are the First Line of Defense

In incident response, metrics typically provide the first indication that something is wrong—alerting on latency spikes, error rate increases, or resource exhaustion. Logs and traces are then used for root cause analysis. This is why investing in robust metrics infrastructure pays immediate dividends in operational reliability.

Understanding Metric Types

Not all metrics are created equal. Different types of measurements require different handling, storage, and query patterns. Understanding metric types is essential for schema design and efficient querying.

Core Metric Types:

The Four Fundamental Metric Types

•Counter — A cumulative value that only increases (or resets to zero). Examples: total HTTP requests served, total bytes transferred, total errors. Queried using rate() or increase() functions to get per-second or per-interval values.
•Gauge — A value that can go up or down at any time. Examples: current CPU usage, memory utilization, queue depth, temperature. Queried directly or with aggregations like avg(), min(), max().
•Histogram — Pre-computed distribution of values across configurable buckets. Examples: request latency distribution (0-10ms: 1000, 10-50ms: 500, 50-100ms: 100). Enables percentile queries without storing every observation.
•Summary — Similar to histogram but calculates percentiles on the client side. Examples: P50/P95/P99 latency computed by the application. Less flexible than histograms but lower cardinality.

Metric Types Examples
prometheus
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# COUNTER: Total HTTP requests (only increases)
# TYPE http_requests_total counter
http_requests_total{method="GET", status="200", handler="/api/users"} 145678
http_requests_total{method="POST", status="201", handler="/api/users"} 23456
http_requests_total{method="GET", status="500", handler="/api/users"} 89
 
# Query: Request rate per second over last 5 minutes
# rate(http_requests_total[5m])
 
# GAUGE: Current memory usage (can go up or down)
# TYPE memory_usage_bytes gauge  
memory_usage_bytes{instance="web-01"} 1073741824
memory_usage_bytes{instance="web-02"} 2147483648
 
# Query: Average memory across instances
# avg(memory_usage_bytes)
 
# HISTOGRAM: Request latency distribution
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.01"} 10234    # ≤10ms
http_request_duration_seconds_bucket{le="0.05"} 45678    # ≤50ms
http_request_duration_seconds_bucket{le="0.1"} 52341     # ≤100ms
http_request_duration_seconds_bucket{le="0.5"} 54123     # ≤500ms
http_request_duration_seconds_bucket{le="1.0"} 54567     # ≤1s
http_request_duration_seconds_bucket{le="+Inf"} 54890    # Total count
http_request_duration_seconds_sum 4521.3                 # Sum of all durations
http_request_duration_seconds_count 54890                # Total observations
 
# Query: 95th percentile latency
# histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
 
# SUMMARY: Pre-computed percentiles (computed client-side)
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{quantile="0.5"} 0.042   # P50 = 42ms
rpc_duration_seconds{quantile="0.9"} 0.089   # P90 = 89ms
rpc_duration_seconds{quantile="0.99"} 0.234  # P99 = 234ms
rpc_duration_seconds_sum 8723.4
rpc_duration_seconds_count 194532

Histogram vs Summary Trade-offs

Histograms require more storage (one series per bucket) but allow server-side percentile calculation and aggregation across instances. Summaries are more compact but percentiles are computed client-side and cannot be aggregated. For most use cases, histograms are preferred for their flexibility.

The Metrics Collection Pipeline

Metrics don't magically appear in your time-series database. A sophisticated pipeline collects, transforms, and routes data from thousands of sources to centralized storage. Understanding this pipeline is essential for designing reliable observability infrastructure.

Metrics Pipeline Architecture

architecture

Complete Metrics Collection Pipeline:
 
SOURCES (Where metrics originate)
┌─────────────────────────────────────────────────────────────────────┐
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────────────┐│
│ │Application │ │  System    │ │  Network   │ │   Cloud APIs       ││
│ │ Metrics    │ │  Metrics   │ │  Devices   │ │   (AWS, GCP)       ││
│ │ (SDKs)     │ │ (node_exp) │ │  (SNMP)    │ │                    ││
│ └────────────┘ └────────────┘ └────────────┘ └────────────────────┘│
└─────────────────────────────────────────────────────────────────────┘
                          │
                          ▼
COLLECTION (Gathering and initial processing)
┌─────────────────────────────────────────────────────────────────────┐
│  Collection Agents (per-host or centralized)                        │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Telegraf / Prometheus / OpenTelemetry Collector             │   │
│  │  ├── Scrape metrics from endpoints (pull model)             │   │
│  │  ├── Receive pushed metrics (push model)                    │   │
│  │  ├── Parse and validate                                     │   │
│  │  ├── Add metadata (host, region, environment)               │   │
│  │  └── Buffer during downstream outages                       │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘
                          │
                          ▼
AGGREGATION (Optional: reduce cardinality and volume)
┌─────────────────────────────────────────────────────────────────────┐
│  Aggregation Layer                                                  │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Pre-aggregation (before storage)                            │   │
│  │  ├── Reduce scrape frequency (15s → 1m)                     │   │
│  │  ├── Drop unnecessary labels                                │   │
│  │  ├── Compute rollups (sum, avg, max per minute)             │   │
│  │  └── Filter high-cardinality series                         │   │
│  └─────────────────────────────────────────────────────────────┘   │
│  Examples: Prometheus federation, M3 Aggregator, Cortex            │
└─────────────────────────────────────────────────────────────────────┘
                          │
                          ▼
STORAGE (Time-series database)
┌─────────────────────────────────────────────────────────────────────┐
│  Time-Series Database Cluster                                       │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  InfluxDB / TimescaleDB / Prometheus / M3DB / VictoriaMetrics│   │
│  │  ├── High-throughput ingestion                              │   │
│  │  ├── Compression and retention                              │   │
│  │  ├── Query engine for dashboards                            │   │
│  │  └── Long-term storage (tiering to S3)                      │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘
                          │
                          ▼
CONSUMPTION (Visualization and alerting)
┌─────────────────────────────────────────────────────────────────────┐
│ ┌────────────────┐ ┌──────────────┐ ┌──────────────────────────┐   │
│ │ Dashboards     │ │  Alerting    │ │  Analytics/Reporting     │   │
│ │ (Grafana)      │ │ (Alertmanager│ │  (Business Intelligence) │   │
│ │                │ │  PagerDuty)  │ │                          │   │
│ └────────────────┘ └──────────────┘ └──────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Pull vs Push: Collection Models

Two fundamental paradigms exist for metrics collection:

Pull Model (Prometheus-style):

Central server scrapes metrics from application endpoints
Applications expose /metrics HTTP endpoint
Advantages: Centralized control, no client configuration, easy to detect failed targets
Disadvantages: Requires network reachability, challenging for serverless/ephemeral workloads

Push Model (InfluxDB/StatsD-style):

Applications actively push metrics to collectors
Advantages: Works behind firewalls, natural for batch jobs and lambdas
Disadvantages: Central server doesn't know who should exist, harder to detect failures

Essential Metrics Every System Should Collect

Deciding what to measure is as important as how to measure it. Industry experience has converged on standard metrics that provide comprehensive visibility into system health.

The RED Method (for request-driven services):

RED: Request, Error, Duration

•Rate — Number of requests per second. Track by endpoint, method, and status code.
•Errors — Number of failed requests per second. Track by error type.
•Duration — Latency of requests. Track P50, P95, P99 percentiles.

The USE Method (for resources like CPU, memory, disk):

USE: Utilization, Saturation, Errors

•Utilization — Percentage of resource capacity being used (CPU at 70%).
•Saturation — Degree to which work is waiting (5 requests queued).
•Errors — Number of errors from this resource (disk I/O errors).

Essential Metrics Checklist
metrics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# INFRASTRUCTURE METRICS (USE Method)
# CPU
system_cpu_usage_percent{cpu="0", mode="user|system|idle|iowait"}
system_cpu_usage_percent{aggregate="total"}
 
# Memory
system_memory_used_bytes
system_memory_available_bytes
system_memory_cached_bytes
system_swap_used_bytes
 
# Disk
system_disk_used_bytes{device="/dev/sda1"}
system_disk_io_read_bytes_total{device="/dev/sda1"}
system_disk_io_write_bytes_total{device="/dev/sda1"}
system_disk_io_time_seconds_total{device="/dev/sda1"}
 
# Network
system_network_bytes_total{interface="eth0", direction="rx|tx"}
system_network_packets_total{interface="eth0", direction="rx|tx"}
system_network_errors_total{interface="eth0", direction="rx|tx"}
 
# APPLICATION METRICS (RED Method)
# HTTP Server
http_requests_total{method, status, handler}
http_request_duration_seconds{method, handler}  # histogram
http_requests_in_flight{handler}                # gauge
 
# Database Connections
db_connections_open{database="postgres"}
db_connections_max{database="postgres"}
db_query_duration_seconds{operation="select|insert|update|delete"}
 
# Queue Processing
queue_messages_total{queue="orders", status="received|processed|failed"}
queue_depth{queue="orders"}
queue_oldest_message_age_seconds{queue="orders"}
 
# External Dependencies
external_request_duration_seconds{service="payment-api", method, status}
external_request_total{service="payment-api", method, status}
circuit_breaker_state{service="payment-api"}  # 0=closed, 1=open, 2=half-open
 
# BUSINESS METRICS
orders_created_total{product_type, region}
revenue_cents_total{product_type, region}
user_registrations_total{source}
payment_processing_duration_seconds

The Four Golden Signals

Google's SRE book defines four golden signals: Latency, Traffic, Errors, and Saturation. These overlap significantly with RED and USE. The key insight is that a small set of well-chosen metrics provides the majority of observability value. Start with these fundamentals before adding specialized metrics.

Scaling Metrics Infrastructure

As organizations grow, metrics infrastructure faces severe scaling challenges. A 100-server deployment might generate 50,000 metrics per second; a 10,000-server deployment generates 5 million. Naive architectures collapse under this load.

Scaling Strategies:

Strategies for High-Scale Metrics

•Horizontal Sharding — Partition metrics across multiple TSDB instances by team, service, or consistent hash of series. Each shard handles a fraction of total load.
•Hierarchical Aggregation — Deploy per-datacenter or per-cluster Prometheus instances that federate aggregated metrics to a global view. Raw metrics stay local; only summaries propagate.
•Cardinality Control — Aggressively limit high-cardinality labels. Drop user IDs, request IDs, and other unbounded values from metric labels. Store these in logs instead.
•Downsampling — Reduce resolution of historical data. Keep 15-second granularity for 24 hours, 1-minute for 7 days, 5-minute for 30 days, hourly for years.
•Remote Storage — Use dedicated long-term storage systems (Thanos, Cortex, M3DB, VictoriaMetrics) that provide infinite retention and cross-cluster queries.

Scaled Metrics Architecture

architecture

Enterprise-Scale Metrics Architecture:
 
                    ┌─────────────────────────────────────┐
                    │      Global Query Layer             │
                    │  (Thanos Query / Cortex Querier)    │
                    │  - Queries across all clusters      │
                    │  - Deduplication of HA pairs        │
                    └─────────────────────────────────────┘
                                      │
                    ┌─────────────────┴─────────────────┐
                    ▼                                   ▼
┌─────────────────────────────────┐   ┌─────────────────────────────────┐
│  US-EAST Cluster                │   │  US-WEST Cluster                │
│  ┌───────────────────────────┐  │   │  ┌───────────────────────────┐  │
│  │ Prometheus (HA pair)      │  │   │  │ Prometheus (HA pair)      │  │
│  │ - Scrapes 1000 targets    │  │   │  │ - Scrapes 800 targets     │  │
│  │ - 2M active series        │  │   │  │ - 1.5M active series      │  │
│  │ - 48h local retention     │  │   │  │ - 48h local retention     │  │
│  └───────────────────────────┘  │   │  └───────────────────────────┘  │
│              │                  │   │              │                  │
│              ▼                  │   │              ▼                  │
│  ┌───────────────────────────┐  │   │  ┌───────────────────────────┐  │
│  │ Thanos Sidecar            │  │   │  │ Thanos Sidecar            │  │
│  │ - Uploads blocks to S3    │  │   │  │ - Uploads blocks to S3    │  │
│  └───────────────────────────┘  │   │  └───────────────────────────┘  │
└─────────────────────────────────┘   └─────────────────────────────────┘
                    │                                   │
                    └─────────────────┬─────────────────┘
                                      ▼
                    ┌─────────────────────────────────────┐
                    │      Object Storage (S3/GCS)        │
                    │  - 2-year retention                 │
                    │  - Compacted and downsampled blocks │
                    │  - Cost: ~$0.02/GB/month            │
                    └─────────────────────────────────────┘
                                      │
                                      ▼
                    ┌─────────────────────────────────────┐
                    │      Thanos Compactor               │
                    │  - Compacts small blocks            │
                    │  - Downsamples old data             │
                    │  - Manages retention/deletion       │
                    └─────────────────────────────────────┘
 
Scale Numbers:
- 10 clusters × 500K series = 5M series globally
- 15-second scrape interval = 333K samples/sec
- Raw storage: ~50GB/day compressed
- After downsampling: ~5GB/day for historical

Alerting Architecture

Metrics are only valuable if they trigger action. The alerting layer transforms passive data collection into active incident response. A well-designed alerting architecture minimizes noise while ensuring critical issues never go unnoticed.

Alerting Pipeline Components:

Alerting Pipeline

architecture

Alerting Data Flow:
 
┌─────────────────────────────────────────────────────────────────────┐
│  ALERT RULES (Define conditions)                                    │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Prometheus Alert Rules / InfluxDB Tasks                     │   │
│  │  ├── Threshold: cpu > 90% for 5 minutes                     │   │
│  │  ├── Rate of change: error_rate increase > 10x in 1 minute  │   │
│  │  ├── Absence: no heartbeat for 2 minutes                    │   │
│  │  └── Anomaly: response_time > 3σ from baseline              │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘
                          │ (firing/resolved events)
                          ▼
┌─────────────────────────────────────────────────────────────────────┐
│  ALERT MANAGER (Deduplication, routing, inhibition)                 │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Alertmanager / PagerDuty Event Orchestration                │   │
│  │  ├── GROUP: Combine similar alerts                          │   │
│  │  │   "100 CPU alerts" → "CPU alert (100 instances)"         │   │
│  │  ├── DEDUPLICATE: Suppress duplicate notifications          │   │
│  │  ├── INHIBIT: Suppress child alerts when parent fires       │   │
│  │  │   "datacenter-down" inhibits individual "server-down"    │   │
│  │  ├── SILENCE: Temporarily mute known issues                 │   │
│  │  └── ROUTE: Send to appropriate team/channel                │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘
                          │ (routed notifications)
                          ▼
┌─────────────────────────────────────────────────────────────────────┐
│  NOTIFICATION CHANNELS                                              │
│  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────────┐   │
│  │ PagerDuty  │ │  Slack     │ │  Email     │ │  Webhooks      │   │
│  │ (Critical) │ │ (Warning)  │ │ (Info)     │ │ (Automation)   │   │
│  └────────────┘ └────────────┘ └────────────┘ └────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Alert Design Best Practices:

•Alert on symptoms, not causes — Alert on 'user requests failing' rather than 'database CPU high'. Users don't care about CPU; they care about failed requests.
•Use multi-window alerting — Combine short windows (5-minute spike detection) with long windows (1-hour sustained issues) to reduce flapping.
•Include runbook links — Every alert should link to documentation explaining the alert and remediation steps.
•Careful with 'pending' duration — Too short causes alert storms; too long delays response. 5 minutes is a common starting point.
•Page only for actionable alerts — If there's no action to take at 3 AM, it shouldn't page. Send to Slack for next business day.
•Regularly review alert fatigue — Track alerts per week by team. If most are ignored, they need refinement or removal.

Alert Fatigue Kills Reliability

The most dangerous failure mode isn't missing alerts—it's alert fatigue. When engineers receive 50 alerts per day, they stop reading them. Critical alerts get lost in noise. Invest heavily in alert quality: every alert should be actionable, and every page should be urgent.

Case Studies: Metrics at Scale

Understanding how leading companies implement metrics infrastructure provides invaluable patterns for your own architecture.

Case Study 1: Netflix (Custom InfluxDB-based System)

Netflix ingests over 2 billion metrics per minute from its global streaming infrastructure. Their system, Atlas, is a custom in-memory time-series database optimized for their specific requirements:

In-memory storage: Only recent data (4 hours) kept in RAM for instant queries
S3 archival: Older data compressed and archived to S3 for long-term analysis
Dimensional data model: Metrics tagged with 20+ dimensions (region, device, title, etc.)
Streaming aggregation: Pre-aggregates at ingestion to reduce cardinality

Case Study 2: Uber (M3DB)

Uber built M3DB, an open-source distributed time-series database, to handle:

500 million unique time series
10+ billion data points per day
Sub-second query latency for operational dashboards
Global replication across multiple data centers

M3DB uses a sharded architecture with consistent hashing, strong write availability, and configurable consistency for reads.

Case Study 3: GitLab (Prometheus + Thanos)

GitLab runs one of the largest public Prometheus deployments:

20+ Prometheus servers across multiple Kubernetes clusters
Thanos for global query federation and long-term storage
10M+ active series scraped every 15 seconds
2-year retention in object storage with downsampling

Metrics Scale by Company
Company	Scale	Technology	Key Innovation
Netflix	2B metrics/min	Atlas (custom)	In-memory + S3 tiering
Uber	500M series, 10B pts/day	M3DB	Distributed, high availability
GitLab	10M+ series	Prometheus + Thanos	Federation + object storage
Cloudflare	72M HTTP req/sec	ClickHouse	OLAP for metrics analytics
Datadog	Trillions pts/day	Custom TSDB	Multi-tenant SaaS

Summary: Metrics and Monitoring Data

We've explored the complete lifecycle of metrics data, from collection to alerting. Let's consolidate the key insights:

Key Takeaways

•Three Pillars of Observability — Metrics, logs, and traces provide complementary views. Metrics (in TSDBs) are the foundation for alerting and dashboards.
•Metric Types Matter — Counters, gauges, histograms, and summaries have different storage and query patterns. Choose appropriately for each measurement.
•The Collection Pipeline Is Complex — Sources → agents → aggregation → storage → consumption. Each stage requires careful design for reliability and scale.
•RED and USE Methods — Request/Error/Duration for services, Utilization/Saturation/Errors for resources. These frameworks ensure comprehensive coverage.
•Scaling Requires Strategy — Sharding, federation, downsampling, and remote storage enable metrics infrastructure to grow with your organization.
•Alerting Quality Over Quantity — Alert fatigue is the enemy. Focus on actionable, symptom-based alerts with proper grouping and inhibition.
•Industry Patterns Are Proven — Netflix, Uber, GitLab, and others have solved these problems at massive scale. Learn from their architectures.

What's Next:

Having explored metrics and monitoring, we'll turn to Retention Policies—the strategies for managing the lifecycle of time-series data, balancing storage costs against query requirements, and implementing intelligent data aging.

Page Complete

You now understand the complete metrics and monitoring lifecycle—from metric types and collection to scaling strategies and alerting. This knowledge forms the operational foundation for running reliable, observable systems at any scale.

3 / 5

Loading learning content...

System Design (HLD)Time-Series Databases

Time-Series Databases

LevelAdvanced

Duration90 mins

TopicTime-Series Databases

3 / 5

Metrics and Monitoring Data

The Pulse of Modern Systems

What You Will Learn

The Three Pillars of Observability

1. Metrics:

Metrics are numerical measurements collected at regular intervals. They answer questions like "What is the current CPU usage?" or "How many requests per second are we processing?" Metrics are:

Highly structured: pre-defined measurement names and dimensions
Aggregatable: can be summed, averaged, percentiled across time and dimensions
Low cardinality per series: typically thousands to millions of unique series
Time-series native: perfectly suited for TSDBs

2. Logs:

Logs are unstructured or semi-structured text records of discrete events. They answer questions like "What error message did this request produce?" Logs are:

High cardinality: every request can produce unique log entries
Text-heavy: require full-text search capabilities
Event-based: not inherently aggregatable
Complementary to TSDBs: often stored in Elasticsearch, Loki, or specialized log systems

3. Traces:

Traces follow individual requests across distributed services. They answer questions like "Where did this slow request spend its time?" Traces are:

Request-scoped: one trace per request, with spans for each service
Hierarchical: parent-child relationships between spans
High cardinality: millions of unique trace IDs
Specialized storage: Jaeger, Zipkin, Tempo use purpose-built storage

Three Pillars: Storage Implications
Pillar	Primary Storage	TSDB Role	Cardinality
Metrics	Time-Series Database	Primary	Low-Medium (millions)
Logs	Log aggregation system	Derived metrics only	Very High (billions)
Traces	Trace storage system	Derived metrics only	Very High (billions)

Metrics Are the First Line of Defense

Understanding Metric Types

Core Metric Types:

The Four Fundamental Metric Types

•Counter — A cumulative value that only increases (or resets to zero). Examples: total HTTP requests served, total bytes transferred, total errors. Queried using rate() or increase() functions to get per-second or per-interval values.
•Gauge — A value that can go up or down at any time. Examples: current CPU usage, memory utilization, queue depth, temperature. Queried directly or with aggregations like avg(), min(), max().
•Histogram — Pre-computed distribution of values across configurable buckets. Examples: request latency distribution (0-10ms: 1000, 10-50ms: 500, 50-100ms: 100). Enables percentile queries without storing every observation.
•Summary — Similar to histogram but calculates percentiles on the client side. Examples: P50/P95/P99 latency computed by the application. Less flexible than histograms but lower cardinality.

Metric Types Examples
prometheus
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# COUNTER: Total HTTP requests (only increases)
# TYPE http_requests_total counter
http_requests_total{method="GET", status="200", handler="/api/users"} 145678
http_requests_total{method="POST", status="201", handler="/api/users"} 23456
http_requests_total{method="GET", status="500", handler="/api/users"} 89
 
# Query: Request rate per second over last 5 minutes
# rate(http_requests_total[5m])
 
# GAUGE: Current memory usage (can go up or down)
# TYPE memory_usage_bytes gauge  
memory_usage_bytes{instance="web-01"} 1073741824
memory_usage_bytes{instance="web-02"} 2147483648
 
# Query: Average memory across instances
# avg(memory_usage_bytes)
 
# HISTOGRAM: Request latency distribution
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.01"} 10234    # ≤10ms
http_request_duration_seconds_bucket{le="0.05"} 45678    # ≤50ms
http_request_duration_seconds_bucket{le="0.1"} 52341     # ≤100ms
http_request_duration_seconds_bucket{le="0.5"} 54123     # ≤500ms
http_request_duration_seconds_bucket{le="1.0"} 54567     # ≤1s
http_request_duration_seconds_bucket{le="+Inf"} 54890    # Total count
http_request_duration_seconds_sum 4521.3                 # Sum of all durations
http_request_duration_seconds_count 54890                # Total observations
 
# Query: 95th percentile latency
# histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
 
# SUMMARY: Pre-computed percentiles (computed client-side)
# TYPE rpc_duration_seconds summary
rpc_duration_seconds{quantile="0.5"} 0.042   # P50 = 42ms
rpc_duration_seconds{quantile="0.9"} 0.089   # P90 = 89ms
rpc_duration_seconds{quantile="0.99"} 0.234  # P99 = 234ms
rpc_duration_seconds_sum 8723.4
rpc_duration_seconds_count 194532

Histogram vs Summary Trade-offs

The Metrics Collection Pipeline

Metrics Pipeline Architecture

architecture

Complete Metrics Collection Pipeline:
 
SOURCES (Where metrics originate)
┌─────────────────────────────────────────────────────────────────────┐
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────────────┐│
│ │Application │ │  System    │ │  Network   │ │   Cloud APIs       ││
│ │ Metrics    │ │  Metrics   │ │  Devices   │ │   (AWS, GCP)       ││
│ │ (SDKs)     │ │ (node_exp) │ │  (SNMP)    │ │                    ││
│ └────────────┘ └────────────┘ └────────────┘ └────────────────────┘│
└─────────────────────────────────────────────────────────────────────┘
                          │
                          ▼
COLLECTION (Gathering and initial processing)
┌─────────────────────────────────────────────────────────────────────┐
│  Collection Agents (per-host or centralized)                        │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Telegraf / Prometheus / OpenTelemetry Collector             │   │
│  │  ├── Scrape metrics from endpoints (pull model)             │   │
│  │  ├── Receive pushed metrics (push model)                    │   │
│  │  ├── Parse and validate                                     │   │
│  │  ├── Add metadata (host, region, environment)               │   │
│  │  └── Buffer during downstream outages                       │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘
                          │
                          ▼
AGGREGATION (Optional: reduce cardinality and volume)
┌─────────────────────────────────────────────────────────────────────┐
│  Aggregation Layer                                                  │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Pre-aggregation (before storage)                            │   │
│  │  ├── Reduce scrape frequency (15s → 1m)                     │   │
│  │  ├── Drop unnecessary labels                                │   │
│  │  ├── Compute rollups (sum, avg, max per minute)             │   │
│  │  └── Filter high-cardinality series                         │   │
│  └─────────────────────────────────────────────────────────────┘   │
│  Examples: Prometheus federation, M3 Aggregator, Cortex            │
└─────────────────────────────────────────────────────────────────────┘
                          │
                          ▼
STORAGE (Time-series database)
┌─────────────────────────────────────────────────────────────────────┐
│  Time-Series Database Cluster                                       │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  InfluxDB / TimescaleDB / Prometheus / M3DB / VictoriaMetrics│   │
│  │  ├── High-throughput ingestion                              │   │
│  │  ├── Compression and retention                              │   │
│  │  ├── Query engine for dashboards                            │   │
│  │  └── Long-term storage (tiering to S3)                      │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘
                          │
                          ▼
CONSUMPTION (Visualization and alerting)
┌─────────────────────────────────────────────────────────────────────┐
│ ┌────────────────┐ ┌──────────────┐ ┌──────────────────────────┐   │
│ │ Dashboards     │ │  Alerting    │ │  Analytics/Reporting     │   │
│ │ (Grafana)      │ │ (Alertmanager│ │  (Business Intelligence) │   │
│ │                │ │  PagerDuty)  │ │                          │   │
│ └────────────────┘ └──────────────┘ └──────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Pull vs Push: Collection Models

Two fundamental paradigms exist for metrics collection:

Pull Model (Prometheus-style):

Central server scrapes metrics from application endpoints
Applications expose /metrics HTTP endpoint
Advantages: Centralized control, no client configuration, easy to detect failed targets
Disadvantages: Requires network reachability, challenging for serverless/ephemeral workloads

Push Model (InfluxDB/StatsD-style):

Applications actively push metrics to collectors
Advantages: Works behind firewalls, natural for batch jobs and lambdas
Disadvantages: Central server doesn't know who should exist, harder to detect failures

Essential Metrics Every System Should Collect

Deciding what to measure is as important as how to measure it. Industry experience has converged on standard metrics that provide comprehensive visibility into system health.

The RED Method (for request-driven services):

RED: Request, Error, Duration

•Rate — Number of requests per second. Track by endpoint, method, and status code.
•Errors — Number of failed requests per second. Track by error type.
•Duration — Latency of requests. Track P50, P95, P99 percentiles.

The USE Method (for resources like CPU, memory, disk):

USE: Utilization, Saturation, Errors

•Utilization — Percentage of resource capacity being used (CPU at 70%).
•Saturation — Degree to which work is waiting (5 requests queued).
•Errors — Number of errors from this resource (disk I/O errors).

Essential Metrics Checklist
metrics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# INFRASTRUCTURE METRICS (USE Method)
# CPU
system_cpu_usage_percent{cpu="0", mode="user|system|idle|iowait"}
system_cpu_usage_percent{aggregate="total"}
 
# Memory
system_memory_used_bytes
system_memory_available_bytes
system_memory_cached_bytes
system_swap_used_bytes
 
# Disk
system_disk_used_bytes{device="/dev/sda1"}
system_disk_io_read_bytes_total{device="/dev/sda1"}
system_disk_io_write_bytes_total{device="/dev/sda1"}
system_disk_io_time_seconds_total{device="/dev/sda1"}
 
# Network
system_network_bytes_total{interface="eth0", direction="rx|tx"}
system_network_packets_total{interface="eth0", direction="rx|tx"}
system_network_errors_total{interface="eth0", direction="rx|tx"}
 
# APPLICATION METRICS (RED Method)
# HTTP Server
http_requests_total{method, status, handler}
http_request_duration_seconds{method, handler}  # histogram
http_requests_in_flight{handler}                # gauge
 
# Database Connections
db_connections_open{database="postgres"}
db_connections_max{database="postgres"}
db_query_duration_seconds{operation="select|insert|update|delete"}
 
# Queue Processing
queue_messages_total{queue="orders", status="received|processed|failed"}
queue_depth{queue="orders"}
queue_oldest_message_age_seconds{queue="orders"}
 
# External Dependencies
external_request_duration_seconds{service="payment-api", method, status}
external_request_total{service="payment-api", method, status}
circuit_breaker_state{service="payment-api"}  # 0=closed, 1=open, 2=half-open
 
# BUSINESS METRICS
orders_created_total{product_type, region}
revenue_cents_total{product_type, region}
user_registrations_total{source}
payment_processing_duration_seconds

The Four Golden Signals

Scaling Metrics Infrastructure

Scaling Strategies:

Strategies for High-Scale Metrics

•Horizontal Sharding — Partition metrics across multiple TSDB instances by team, service, or consistent hash of series. Each shard handles a fraction of total load.
•Hierarchical Aggregation — Deploy per-datacenter or per-cluster Prometheus instances that federate aggregated metrics to a global view. Raw metrics stay local; only summaries propagate.
•Cardinality Control — Aggressively limit high-cardinality labels. Drop user IDs, request IDs, and other unbounded values from metric labels. Store these in logs instead.
•Downsampling — Reduce resolution of historical data. Keep 15-second granularity for 24 hours, 1-minute for 7 days, 5-minute for 30 days, hourly for years.
•Remote Storage — Use dedicated long-term storage systems (Thanos, Cortex, M3DB, VictoriaMetrics) that provide infinite retention and cross-cluster queries.

Scaled Metrics Architecture

architecture

Enterprise-Scale Metrics Architecture:
 
                    ┌─────────────────────────────────────┐
                    │      Global Query Layer             │
                    │  (Thanos Query / Cortex Querier)    │
                    │  - Queries across all clusters      │
                    │  - Deduplication of HA pairs        │
                    └─────────────────────────────────────┘
                                      │
                    ┌─────────────────┴─────────────────┐
                    ▼                                   ▼
┌─────────────────────────────────┐   ┌─────────────────────────────────┐
│  US-EAST Cluster                │   │  US-WEST Cluster                │
│  ┌───────────────────────────┐  │   │  ┌───────────────────────────┐  │
│  │ Prometheus (HA pair)      │  │   │  │ Prometheus (HA pair)      │  │
│  │ - Scrapes 1000 targets    │  │   │  │ - Scrapes 800 targets     │  │
│  │ - 2M active series        │  │   │  │ - 1.5M active series      │  │
│  │ - 48h local retention     │  │   │  │ - 48h local retention     │  │
│  └───────────────────────────┘  │   │  └───────────────────────────┘  │
│              │                  │   │              │                  │
│              ▼                  │   │              ▼                  │
│  ┌───────────────────────────┐  │   │  ┌───────────────────────────┐  │
│  │ Thanos Sidecar            │  │   │  │ Thanos Sidecar            │  │
│  │ - Uploads blocks to S3    │  │   │  │ - Uploads blocks to S3    │  │
│  └───────────────────────────┘  │   │  └───────────────────────────┘  │
└─────────────────────────────────┘   └─────────────────────────────────┘
                    │                                   │
                    └─────────────────┬─────────────────┘
                                      ▼
                    ┌─────────────────────────────────────┐
                    │      Object Storage (S3/GCS)        │
                    │  - 2-year retention                 │
                    │  - Compacted and downsampled blocks │
                    │  - Cost: ~$0.02/GB/month            │
                    └─────────────────────────────────────┘
                                      │
                                      ▼
                    ┌─────────────────────────────────────┐
                    │      Thanos Compactor               │
                    │  - Compacts small blocks            │
                    │  - Downsamples old data             │
                    │  - Manages retention/deletion       │
                    └─────────────────────────────────────┘
 
Scale Numbers:
- 10 clusters × 500K series = 5M series globally
- 15-second scrape interval = 333K samples/sec
- Raw storage: ~50GB/day compressed
- After downsampling: ~5GB/day for historical

Alerting Architecture

Alerting Pipeline Components:

Alerting Pipeline

architecture

Alerting Data Flow:
 
┌─────────────────────────────────────────────────────────────────────┐
│  ALERT RULES (Define conditions)                                    │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Prometheus Alert Rules / InfluxDB Tasks                     │   │
│  │  ├── Threshold: cpu > 90% for 5 minutes                     │   │
│  │  ├── Rate of change: error_rate increase > 10x in 1 minute  │   │
│  │  ├── Absence: no heartbeat for 2 minutes                    │   │
│  │  └── Anomaly: response_time > 3σ from baseline              │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘
                          │ (firing/resolved events)
                          ▼
┌─────────────────────────────────────────────────────────────────────┐
│  ALERT MANAGER (Deduplication, routing, inhibition)                 │
│  ┌─────────────────────────────────────────────────────────────┐   │
│  │  Alertmanager / PagerDuty Event Orchestration                │   │
│  │  ├── GROUP: Combine similar alerts                          │   │
│  │  │   "100 CPU alerts" → "CPU alert (100 instances)"         │   │
│  │  ├── DEDUPLICATE: Suppress duplicate notifications          │   │
│  │  ├── INHIBIT: Suppress child alerts when parent fires       │   │
│  │  │   "datacenter-down" inhibits individual "server-down"    │   │
│  │  ├── SILENCE: Temporarily mute known issues                 │   │
│  │  └── ROUTE: Send to appropriate team/channel                │   │
│  └─────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘
                          │ (routed notifications)
                          ▼
┌─────────────────────────────────────────────────────────────────────┐
│  NOTIFICATION CHANNELS                                              │
│  ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────────┐   │
│  │ PagerDuty  │ │  Slack     │ │  Email     │ │  Webhooks      │   │
│  │ (Critical) │ │ (Warning)  │ │ (Info)     │ │ (Automation)   │   │
│  └────────────┘ └────────────┘ └────────────┘ └────────────────┘   │
└─────────────────────────────────────────────────────────────────────┘

Alert Design Best Practices:

•Alert on symptoms, not causes — Alert on 'user requests failing' rather than 'database CPU high'. Users don't care about CPU; they care about failed requests.
•Use multi-window alerting — Combine short windows (5-minute spike detection) with long windows (1-hour sustained issues) to reduce flapping.
•Include runbook links — Every alert should link to documentation explaining the alert and remediation steps.
•Careful with 'pending' duration — Too short causes alert storms; too long delays response. 5 minutes is a common starting point.
•Page only for actionable alerts — If there's no action to take at 3 AM, it shouldn't page. Send to Slack for next business day.
•Regularly review alert fatigue — Track alerts per week by team. If most are ignored, they need refinement or removal.

Alert Fatigue Kills Reliability

Case Studies: Metrics at Scale

Understanding how leading companies implement metrics infrastructure provides invaluable patterns for your own architecture.

Case Study 1: Netflix (Custom InfluxDB-based System)

In-memory storage: Only recent data (4 hours) kept in RAM for instant queries
S3 archival: Older data compressed and archived to S3 for long-term analysis
Dimensional data model: Metrics tagged with 20+ dimensions (region, device, title, etc.)
Streaming aggregation: Pre-aggregates at ingestion to reduce cardinality

Case Study 2: Uber (M3DB)

Uber built M3DB, an open-source distributed time-series database, to handle:

500 million unique time series
10+ billion data points per day
Sub-second query latency for operational dashboards
Global replication across multiple data centers

M3DB uses a sharded architecture with consistent hashing, strong write availability, and configurable consistency for reads.

Case Study 3: GitLab (Prometheus + Thanos)

GitLab runs one of the largest public Prometheus deployments:

20+ Prometheus servers across multiple Kubernetes clusters
Thanos for global query federation and long-term storage
10M+ active series scraped every 15 seconds
2-year retention in object storage with downsampling

Metrics Scale by Company
Company	Scale	Technology	Key Innovation
Netflix	2B metrics/min	Atlas (custom)	In-memory + S3 tiering
Uber	500M series, 10B pts/day	M3DB	Distributed, high availability
GitLab	10M+ series	Prometheus + Thanos	Federation + object storage
Cloudflare	72M HTTP req/sec	ClickHouse	OLAP for metrics analytics
Datadog	Trillions pts/day	Custom TSDB	Multi-tenant SaaS

Summary: Metrics and Monitoring Data

We've explored the complete lifecycle of metrics data, from collection to alerting. Let's consolidate the key insights:

Key Takeaways

•Three Pillars of Observability — Metrics, logs, and traces provide complementary views. Metrics (in TSDBs) are the foundation for alerting and dashboards.
•Metric Types Matter — Counters, gauges, histograms, and summaries have different storage and query patterns. Choose appropriately for each measurement.
•The Collection Pipeline Is Complex — Sources → agents → aggregation → storage → consumption. Each stage requires careful design for reliability and scale.
•RED and USE Methods — Request/Error/Duration for services, Utilization/Saturation/Errors for resources. These frameworks ensure comprehensive coverage.
•Scaling Requires Strategy — Sharding, federation, downsampling, and remote storage enable metrics infrastructure to grow with your organization.
•Alerting Quality Over Quantity — Alert fatigue is the enemy. Focus on actionable, symptom-based alerts with proper grouping and inhibition.
•Industry Patterns Are Proven — Netflix, Uber, GitLab, and others have solved these problems at massive scale. Learn from their architectures.

What's Next:

Page Complete

3 / 5