Loading learning content...
At 2:47 AM, a payment processing service begins rejecting transactions. Within 60 seconds, an on-call engineer receives a PagerDuty alert. By 2:49 AM, they've identified the root cause: a database connection pool exhaustion triggered by a memory leak deployed 3 hours earlier. By 2:52 AM, the faulty deployment is rolled back. Total customer impact: 4 minutes of partial outage.
This response—identification in under 2 minutes, resolution in under 5—is only possible because of metrics and monitoring infrastructure. Behind the scenes, thousands of metrics per second flow from application servers to collection agents, through processing pipelines, into time-series databases, and finally to alerting engines. This observability infrastructure is as critical as the application it monitors.
Time-series databases are the beating heart of this infrastructure. They store the billions of data points that describe system health, enable the queries that power dashboards and alerts, and retain the historical data that enables incident postmortems and capacity planning.
By the end of this page, you will understand the complete metrics and monitoring pipeline: from metric types and collection patterns, through the observability stack architecture, to real-world deployment patterns used by companies processing trillions of data points daily.
Modern observability rests on three complementary data types, each providing a different lens into system behavior. Understanding how these pillars relate to time-series databases is fundamental to architectural decisions.
1. Metrics:
Metrics are numerical measurements collected at regular intervals. They answer questions like "What is the current CPU usage?" or "How many requests per second are we processing?" Metrics are:
2. Logs:
Logs are unstructured or semi-structured text records of discrete events. They answer questions like "What error message did this request produce?" Logs are:
3. Traces:
Traces follow individual requests across distributed services. They answer questions like "Where did this slow request spend its time?" Traces are:
| Pillar | Primary Storage | TSDB Role | Cardinality |
|---|---|---|---|
| Metrics | Time-Series Database | Primary | Low-Medium (millions) |
| Logs | Log aggregation system | Derived metrics only | Very High (billions) |
| Traces | Trace storage system | Derived metrics only | Very High (billions) |
In incident response, metrics typically provide the first indication that something is wrong—alerting on latency spikes, error rate increases, or resource exhaustion. Logs and traces are then used for root cause analysis. This is why investing in robust metrics infrastructure pays immediate dividends in operational reliability.
Not all metrics are created equal. Different types of measurements require different handling, storage, and query patterns. Understanding metric types is essential for schema design and efficient querying.
Core Metric Types:
1234567891011121314151617181920212223242526272829303132333435363738
# COUNTER: Total HTTP requests (only increases)# TYPE http_requests_total counterhttp_requests_total{method="GET", status="200", handler="/api/users"} 145678http_requests_total{method="POST", status="201", handler="/api/users"} 23456http_requests_total{method="GET", status="500", handler="/api/users"} 89 # Query: Request rate per second over last 5 minutes# rate(http_requests_total[5m]) # GAUGE: Current memory usage (can go up or down)# TYPE memory_usage_bytes gauge memory_usage_bytes{instance="web-01"} 1073741824memory_usage_bytes{instance="web-02"} 2147483648 # Query: Average memory across instances# avg(memory_usage_bytes) # HISTOGRAM: Request latency distribution# TYPE http_request_duration_seconds histogramhttp_request_duration_seconds_bucket{le="0.01"} 10234 # ≤10mshttp_request_duration_seconds_bucket{le="0.05"} 45678 # ≤50mshttp_request_duration_seconds_bucket{le="0.1"} 52341 # ≤100mshttp_request_duration_seconds_bucket{le="0.5"} 54123 # ≤500mshttp_request_duration_seconds_bucket{le="1.0"} 54567 # ≤1shttp_request_duration_seconds_bucket{le="+Inf"} 54890 # Total counthttp_request_duration_seconds_sum 4521.3 # Sum of all durationshttp_request_duration_seconds_count 54890 # Total observations # Query: 95th percentile latency# histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) # SUMMARY: Pre-computed percentiles (computed client-side)# TYPE rpc_duration_seconds summaryrpc_duration_seconds{quantile="0.5"} 0.042 # P50 = 42msrpc_duration_seconds{quantile="0.9"} 0.089 # P90 = 89msrpc_duration_seconds{quantile="0.99"} 0.234 # P99 = 234msrpc_duration_seconds_sum 8723.4rpc_duration_seconds_count 194532Histograms require more storage (one series per bucket) but allow server-side percentile calculation and aggregation across instances. Summaries are more compact but percentiles are computed client-side and cannot be aggregated. For most use cases, histograms are preferred for their flexibility.
Metrics don't magically appear in your time-series database. A sophisticated pipeline collects, transforms, and routes data from thousands of sources to centralized storage. Understanding this pipeline is essential for designing reliable observability infrastructure.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
Complete Metrics Collection Pipeline: SOURCES (Where metrics originate)┌─────────────────────────────────────────────────────────────────────┐│ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────────────┐││ │Application │ │ System │ │ Network │ │ Cloud APIs │││ │ Metrics │ │ Metrics │ │ Devices │ │ (AWS, GCP) │││ │ (SDKs) │ │ (node_exp) │ │ (SNMP) │ │ │││ └────────────┘ └────────────┘ └────────────┘ └────────────────────┘│└─────────────────────────────────────────────────────────────────────┘ │ ▼COLLECTION (Gathering and initial processing)┌─────────────────────────────────────────────────────────────────────┐│ Collection Agents (per-host or centralized) ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Telegraf / Prometheus / OpenTelemetry Collector │ ││ │ ├── Scrape metrics from endpoints (pull model) │ ││ │ ├── Receive pushed metrics (push model) │ ││ │ ├── Parse and validate │ ││ │ ├── Add metadata (host, region, environment) │ ││ │ └── Buffer during downstream outages │ ││ └─────────────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────────┘ │ ▼AGGREGATION (Optional: reduce cardinality and volume)┌─────────────────────────────────────────────────────────────────────┐│ Aggregation Layer ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Pre-aggregation (before storage) │ ││ │ ├── Reduce scrape frequency (15s → 1m) │ ││ │ ├── Drop unnecessary labels │ ││ │ ├── Compute rollups (sum, avg, max per minute) │ ││ │ └── Filter high-cardinality series │ ││ └─────────────────────────────────────────────────────────────┘ ││ Examples: Prometheus federation, M3 Aggregator, Cortex │└─────────────────────────────────────────────────────────────────────┘ │ ▼STORAGE (Time-series database)┌─────────────────────────────────────────────────────────────────────┐│ Time-Series Database Cluster ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ InfluxDB / TimescaleDB / Prometheus / M3DB / VictoriaMetrics│ ││ │ ├── High-throughput ingestion │ ││ │ ├── Compression and retention │ ││ │ ├── Query engine for dashboards │ ││ │ └── Long-term storage (tiering to S3) │ ││ └─────────────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────────┘ │ ▼CONSUMPTION (Visualization and alerting)┌─────────────────────────────────────────────────────────────────────┐│ ┌────────────────┐ ┌──────────────┐ ┌──────────────────────────┐ ││ │ Dashboards │ │ Alerting │ │ Analytics/Reporting │ ││ │ (Grafana) │ │ (Alertmanager│ │ (Business Intelligence) │ ││ │ │ │ PagerDuty) │ │ │ ││ └────────────────┘ └──────────────┘ └──────────────────────────┘ │└─────────────────────────────────────────────────────────────────────┘Pull vs Push: Collection Models
Two fundamental paradigms exist for metrics collection:
Pull Model (Prometheus-style):
Push Model (InfluxDB/StatsD-style):
Deciding what to measure is as important as how to measure it. Industry experience has converged on standard metrics that provide comprehensive visibility into system health.
The RED Method (for request-driven services):
The USE Method (for resources like CPU, memory, disk):
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
# INFRASTRUCTURE METRICS (USE Method)# CPUsystem_cpu_usage_percent{cpu="0", mode="user|system|idle|iowait"}system_cpu_usage_percent{aggregate="total"} # Memorysystem_memory_used_bytessystem_memory_available_bytessystem_memory_cached_bytessystem_swap_used_bytes # Disksystem_disk_used_bytes{device="/dev/sda1"}system_disk_io_read_bytes_total{device="/dev/sda1"}system_disk_io_write_bytes_total{device="/dev/sda1"}system_disk_io_time_seconds_total{device="/dev/sda1"} # Networksystem_network_bytes_total{interface="eth0", direction="rx|tx"}system_network_packets_total{interface="eth0", direction="rx|tx"}system_network_errors_total{interface="eth0", direction="rx|tx"} # APPLICATION METRICS (RED Method)# HTTP Serverhttp_requests_total{method, status, handler}http_request_duration_seconds{method, handler} # histogramhttp_requests_in_flight{handler} # gauge # Database Connectionsdb_connections_open{database="postgres"}db_connections_max{database="postgres"}db_query_duration_seconds{operation="select|insert|update|delete"} # Queue Processingqueue_messages_total{queue="orders", status="received|processed|failed"}queue_depth{queue="orders"}queue_oldest_message_age_seconds{queue="orders"} # External Dependenciesexternal_request_duration_seconds{service="payment-api", method, status}external_request_total{service="payment-api", method, status}circuit_breaker_state{service="payment-api"} # 0=closed, 1=open, 2=half-open # BUSINESS METRICSorders_created_total{product_type, region}revenue_cents_total{product_type, region}user_registrations_total{source}payment_processing_duration_secondsGoogle's SRE book defines four golden signals: Latency, Traffic, Errors, and Saturation. These overlap significantly with RED and USE. The key insight is that a small set of well-chosen metrics provides the majority of observability value. Start with these fundamentals before adding specialized metrics.
As organizations grow, metrics infrastructure faces severe scaling challenges. A 100-server deployment might generate 50,000 metrics per second; a 10,000-server deployment generates 5 million. Naive architectures collapse under this load.
Scaling Strategies:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
Enterprise-Scale Metrics Architecture: ┌─────────────────────────────────────┐ │ Global Query Layer │ │ (Thanos Query / Cortex Querier) │ │ - Queries across all clusters │ │ - Deduplication of HA pairs │ └─────────────────────────────────────┘ │ ┌─────────────────┴─────────────────┐ ▼ ▼┌─────────────────────────────────┐ ┌─────────────────────────────────┐│ US-EAST Cluster │ │ US-WEST Cluster ││ ┌───────────────────────────┐ │ │ ┌───────────────────────────┐ ││ │ Prometheus (HA pair) │ │ │ │ Prometheus (HA pair) │ ││ │ - Scrapes 1000 targets │ │ │ │ - Scrapes 800 targets │ ││ │ - 2M active series │ │ │ │ - 1.5M active series │ ││ │ - 48h local retention │ │ │ │ - 48h local retention │ ││ └───────────────────────────┘ │ │ └───────────────────────────┘ ││ │ │ │ │ ││ ▼ │ │ ▼ ││ ┌───────────────────────────┐ │ │ ┌───────────────────────────┐ ││ │ Thanos Sidecar │ │ │ │ Thanos Sidecar │ ││ │ - Uploads blocks to S3 │ │ │ │ - Uploads blocks to S3 │ ││ └───────────────────────────┘ │ │ └───────────────────────────┘ │└─────────────────────────────────┘ └─────────────────────────────────┘ │ │ └─────────────────┬─────────────────┘ ▼ ┌─────────────────────────────────────┐ │ Object Storage (S3/GCS) │ │ - 2-year retention │ │ - Compacted and downsampled blocks │ │ - Cost: ~$0.02/GB/month │ └─────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────┐ │ Thanos Compactor │ │ - Compacts small blocks │ │ - Downsamples old data │ │ - Manages retention/deletion │ └─────────────────────────────────────┘ Scale Numbers:- 10 clusters × 500K series = 5M series globally- 15-second scrape interval = 333K samples/sec- Raw storage: ~50GB/day compressed- After downsampling: ~5GB/day for historicalMetrics are only valuable if they trigger action. The alerting layer transforms passive data collection into active incident response. A well-designed alerting architecture minimizes noise while ensuring critical issues never go unnoticed.
Alerting Pipeline Components:
123456789101112131415161718192021222324252627282930313233343536
Alerting Data Flow: ┌─────────────────────────────────────────────────────────────────────┐│ ALERT RULES (Define conditions) ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Prometheus Alert Rules / InfluxDB Tasks │ ││ │ ├── Threshold: cpu > 90% for 5 minutes │ ││ │ ├── Rate of change: error_rate increase > 10x in 1 minute │ ││ │ ├── Absence: no heartbeat for 2 minutes │ ││ │ └── Anomaly: response_time > 3σ from baseline │ ││ └─────────────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────────┘ │ (firing/resolved events) ▼┌─────────────────────────────────────────────────────────────────────┐│ ALERT MANAGER (Deduplication, routing, inhibition) ││ ┌─────────────────────────────────────────────────────────────┐ ││ │ Alertmanager / PagerDuty Event Orchestration │ ││ │ ├── GROUP: Combine similar alerts │ ││ │ │ "100 CPU alerts" → "CPU alert (100 instances)" │ ││ │ ├── DEDUPLICATE: Suppress duplicate notifications │ ││ │ ├── INHIBIT: Suppress child alerts when parent fires │ ││ │ │ "datacenter-down" inhibits individual "server-down" │ ││ │ ├── SILENCE: Temporarily mute known issues │ ││ │ └── ROUTE: Send to appropriate team/channel │ ││ └─────────────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────────┘ │ (routed notifications) ▼┌─────────────────────────────────────────────────────────────────────┐│ NOTIFICATION CHANNELS ││ ┌────────────┐ ┌────────────┐ ┌────────────┐ ┌────────────────┐ ││ │ PagerDuty │ │ Slack │ │ Email │ │ Webhooks │ ││ │ (Critical) │ │ (Warning) │ │ (Info) │ │ (Automation) │ ││ └────────────┘ └────────────┘ └────────────┘ └────────────────┘ │└─────────────────────────────────────────────────────────────────────┘Alert Design Best Practices:
The most dangerous failure mode isn't missing alerts—it's alert fatigue. When engineers receive 50 alerts per day, they stop reading them. Critical alerts get lost in noise. Invest heavily in alert quality: every alert should be actionable, and every page should be urgent.
Understanding how leading companies implement metrics infrastructure provides invaluable patterns for your own architecture.
Case Study 1: Netflix (Custom InfluxDB-based System)
Netflix ingests over 2 billion metrics per minute from its global streaming infrastructure. Their system, Atlas, is a custom in-memory time-series database optimized for their specific requirements:
Case Study 2: Uber (M3DB)
Uber built M3DB, an open-source distributed time-series database, to handle:
M3DB uses a sharded architecture with consistent hashing, strong write availability, and configurable consistency for reads.
Case Study 3: GitLab (Prometheus + Thanos)
GitLab runs one of the largest public Prometheus deployments:
| Company | Scale | Technology | Key Innovation |
|---|---|---|---|
| Netflix | 2B metrics/min | Atlas (custom) | In-memory + S3 tiering |
| Uber | 500M series, 10B pts/day | M3DB | Distributed, high availability |
| GitLab | 10M+ series | Prometheus + Thanos | Federation + object storage |
| Cloudflare | 72M HTTP req/sec | ClickHouse | OLAP for metrics analytics |
| Datadog | Trillions pts/day | Custom TSDB | Multi-tenant SaaS |
We've explored the complete lifecycle of metrics data, from collection to alerting. Let's consolidate the key insights:
What's Next:
Having explored metrics and monitoring, we'll turn to Retention Policies—the strategies for managing the lifecycle of time-series data, balancing storage costs against query requirements, and implementing intelligent data aging.
You now understand the complete metrics and monitoring lifecycle—from metric types and collection to scaling strategies and alerting. This knowledge forms the operational foundation for running reliable, observable systems at any scale.