Three Pillars - Learning Module

Loading content...

0/273

Metrics: Numerical Measurements

The Language of System Health

Imagine you're a physician examining a patient. You don't rely on the patient saying 'I feel okay.' You measure—heart rate, blood pressure, temperature, oxygen saturation. These numerical measurements tell an objective story that symptoms alone cannot convey. They reveal patterns, indicate trends, and provide early warning signs before a crisis manifests.

Metrics serve the same purpose for software systems. They are the vital signs of your applications and infrastructure—quantitative measurements that capture the state, behavior, and performance of your systems over time. Without metrics, operating a distributed system is like flying blind: you're guessing at what's happening rather than knowing.

In this page, we will explore metrics as the first and perhaps most fundamental pillar of observability. We'll examine what metrics are, why they matter, the different types of metrics, how to collect and store them, and the principles that guide effective metrics design at scale.

What You Will Learn

By the end of this page, you will understand metrics fundamentally—not just as numbers emitted by monitoring tools, but as a carefully designed language for expressing system behavior. You'll learn the taxonomy of metric types, internalize best practices for naming and labeling, understand cardinality dangers, and appreciate how metrics form the quantitative backbone of observability.

What Are Metrics?

At their core, metrics are numerical measurements collected at regular intervals over time. Each metric captures a specific aspect of system behavior—how many requests were processed, how long operations took, how much memory is being used, how many errors occurred.

Formally, a metric consists of:

A name — A human-readable identifier describing what is being measured (e.g., http_requests_total, memory_usage_bytes)
A value — The numerical measurement itself (e.g., 42, 1024.5)
A timestamp — When the measurement was taken
Labels/dimensions — Key-value pairs that add context (e.g., method="GET", status="200", service="checkout")

Together, these elements form a time series: a sequence of values ordered by time, identified by a unique combination of name and labels.

metric-structure.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Anatomy of a metric data point
# ================================
 
# General structure:
# <metric_name>{<label_name>=<label_value>, ...} <value> [<timestamp>]
 
# Example: HTTP request counter
http_requests_total{method="GET", status="200", service="api-gateway"} 142857 1704672000000
 
# Breakdown:
# - Metric name: http_requests_total
# - Labels: method="GET", status="200", service="api-gateway"
# - Value: 142857 (total requests matching these labels)
# - Timestamp: 1704672000000 (Unix milliseconds)
 
# This creates a unique time series. Different label combinations
# create different time series:
 
http_requests_total{method="GET", status="200", service="api-gateway"} 142857
http_requests_total{method="GET", status="404", service="api-gateway"} 532
http_requests_total{method="POST", status="201", service="api-gateway"} 89421
http_requests_total{method="POST", status="500", service="api-gateway"} 17
 
# Each line above is a separate time series, even though they share
# the same metric name.

Time Series Database Foundation

Metrics are typically stored in Time Series Databases (TSDBs) optimized for append-heavy workloads with time-based queries. Systems like Prometheus, InfluxDB, VictoriaMetrics, and TimescaleDB are designed specifically for this access pattern—writing millions of data points per second and querying across large time ranges efficiently.

Why numerical measurements matter:

Metrics provide something that other observability signals cannot: aggregation and mathematical analysis. You can compute averages, percentiles, rates of change, standard deviations, and correlations. You can set thresholds and trigger alerts when values cross boundaries. You can visualize trends over days, weeks, or months.

This quantitative nature makes metrics ideal for:

Dashboards and visualization — Seeing system state at a glance
Alerting — Automated notification when things go wrong
Capacity planning — Projecting future resource needs based on trends
SLO tracking — Measuring whether you're meeting reliability targets
Performance comparison — Before vs. after deployments, A/B testing

Types of Metrics

Not all metrics behave the same way. The type of measurement you're making determines how the metric should be collected, stored, aggregated, and queried. Understanding metric types is essential for effective observability.

There are four fundamental metric types, each with distinct semantics:

The Four Fundamental Metric Types
Type	What It Measures	Key Characteristic	Example Use Cases
Counter	Cumulative totals that only increase	Monotonically increasing (only goes up, never down)	Request count, bytes transferred, errors, completed jobs
Gauge	Current value that can go up or down	Point-in-time snapshot, can increase or decrease	Temperature, memory usage, queue depth, active connections
Histogram	Distribution of values in buckets	Records observations in configurable ranges	Request latency, response sizes, batch sizes
Summary	Client-calculated percentiles	Computes quantiles client-side before exposition	Latency percentiles (p50, p95, p99)

Let's explore each type in detail:

Counter Metrics

•Definition: A counter is a cumulative metric that represents a monotonically increasing value. It can only go up (or reset to zero on restart).
•Key insight: Raw counter values are rarely useful on their own. You typically compute the rate of change using functions like rate() or increase() to understand 'how many per second?'
•Example: http_requests_total might show 1,000,000 requests. But what matters is whether that's 100/sec (normal) or 10,000/sec (traffic spike).
•Reset handling: Counters reset when processes restart. Monitoring systems track these resets and compute accurate rates across restarts.
•Naming convention: Counter names should end with _total (e.g., errors_total, bytes_received_total).

Gauge Metrics

•Definition: A gauge represents a single numerical value that can arbitrarily go up and down over time.
•Key insight: Gauges represent current state, not change over time. They answer 'what is the value right now?'
•Example: Current memory usage (2.5GB), active database connections (42), items in queue (1,500).
•Aggregation caution: Summing or averaging gauges across instances requires care. 'Total memory across 10 instances' might be meaningful; 'average queue depth' might mask hotspots.
•Common pattern: Gauges often track resource utilization (CPU, memory, disk), connection pools, queue depths, and configuration values.

Histogram Metrics

•Definition: A histogram samples observations (like request durations) and counts them in configurable buckets.
•Key insight: Histograms enable server-side percentile calculation. By storing counts in buckets, you can later compute approximate percentiles (p50, p95, p99) across any time range or aggregation.
•Example: Request latency histogram with buckets [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] seconds.
•Trade-off: Bucket boundaries must be chosen carefully. Too few buckets lose precision; too many increase storage and query costs.
•Components: A histogram exposes multiple time series—one counter per bucket (_bucket), total sum of observations (_sum), and count of observations (_count).

histogram-example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Histogram for HTTP request duration (in seconds)
# The _bucket counters show how many requests fell into each bucket
 
http_request_duration_seconds_bucket{le="0.005"} 24054    # ≤ 5ms
http_request_duration_seconds_bucket{le="0.01"} 33445     # ≤ 10ms
http_request_duration_seconds_bucket{le="0.025"} 100392   # ≤ 25ms
http_request_duration_seconds_bucket{le="0.05"} 129389    # ≤ 50ms
http_request_duration_seconds_bucket{le="0.1"} 133988     # ≤ 100ms
http_request_duration_seconds_bucket{le="0.25"} 134890    # ≤ 250ms
http_request_duration_seconds_bucket{le="0.5"} 135085     # ≤ 500ms
http_request_duration_seconds_bucket{le="1"} 135121       # ≤ 1s
http_request_duration_seconds_bucket{le="+Inf"} 135123    # All requests
 
http_request_duration_seconds_sum 4503.2                  # Total seconds
http_request_duration_seconds_count 135123                # Total requests
 
# From this, you can compute:
# - Average latency: _sum / _count = 33.3ms
# - p50 (median): ~25ms (where 50% of requests fall)
# - p99: ~250ms (99% of requests complete within this time)

Summary Metrics

•Definition: A summary calculates quantiles (percentiles) on the client side over a sliding time window.
•Key insight: Unlike histograms, summaries compute exact percentiles but cannot be aggregated across instances. Each pod computes its own p99; you can't combine them into a cluster-wide p99.
•When to use: Summaries work well when you need precise percentiles for a single instance without needing to aggregate across instances.
•Limitation: Because quantiles are pre-computed, you cannot recalculate them for different time windows after the fact.
•Recommendation: Most modern observability practices favor histograms over summaries due to aggregation flexibility, accepting the slight loss of precision.

Metric Naming and Labels

Well-designed metrics are self-documenting. A well-named metric with appropriate labels tells you exactly what it measures without requiring external documentation. Poor naming leads to confusion, misinterpretation, and eventually, metrics that nobody trusts.

Naming best practices:

•Use a consistent namespace prefix — Start with your application or library name: myapp_, http_, db_pool_. This prevents collisions and groups related metrics.
•Describe what's being measured, not how — request_duration_seconds is better than request_timer. The former is clear; the latter requires context.
•Include the unit in the name — response_size_bytes, request_duration_seconds, temperature_celsius. Never make users guess the unit.
•Use base units — Prefer seconds over milliseconds, bytes over kilobytes. Query languages can convert, but inconsistent units cause errors.
•Suffix counters with _total — http_requests_total, errors_total. This signals the metric type to readers.
•Use snake_case — http_request_duration_seconds, not httpRequestDurationSeconds or http-request-duration-seconds.

The Cardinality Trap

Labels multiply the number of time series. A metric with labels for service, method, status, and endpoint creates a unique time series for every combination. If you have 10 services × 5 methods × 20 status codes × 100 endpoints = 100,000 time series from a single metric. High cardinality destroys database performance and increases costs exponentially.

Label design principles:

Labels add dimensionality to metrics, enabling powerful queries like 'show error rates by service and endpoint.' But with this power comes responsibility:

Good Label Practices

•Labels with bounded cardinality (status codes, HTTP methods, known services)
•Labels that enable meaningful aggregation across instances
•Labels that answer common operational questions
•Consistent label names across all metrics
•Labels for environment (prod, staging, dev)

Dangerous Label Practices

•User IDs as labels (millions of unique values)
•Request/trace IDs as labels (infinite cardinality)
•IP addresses as labels (very high cardinality)
•Timestamps or durations as labels
•Free-form text or error messages as labels

label-examples.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# GOOD: Bounded cardinality labels
http_requests_total{
    service="checkout",      # Limited number of services
    method="POST",           # ~10 HTTP methods
    status="201",           # ~50 common status codes
    endpoint="/api/orders"  # Known, enumerable endpoints
}
 
# BAD: Unbounded cardinality labels - DON'T DO THIS
http_requests_total{
    user_id="user_12345",        # DANGER: Millions of users
    request_id="abc-123-xyz",    # DANGER: Unique per request
    client_ip="192.168.1.42",    # DANGER: Many IPs
    error_message="Connection refused to host..."  # DANGER: Free text
}
 
# The bad example would create a new time series for every
# unique combination, quickly overwhelming your TSDB.

Collecting Metrics

There are two fundamental models for collecting metrics: push and pull. Each has distinct trade-offs that influence architecture decisions.

Pull-based collection (Prometheus model):

In the pull model, applications expose a metrics endpoint (typically /metrics), and a central collector periodically scrapes these endpoints. This is the model pioneered by Prometheus and widely adopted in cloud-native environments.

Pull Model Characteristics

•Application exposes endpoint — Your service exposes /metrics with current values
•Collector scrapes at intervals — Prometheus or similar scrapes every 15-60 seconds
•Targets are configured — The collector knows where to scrape (service discovery or static config)
•Failure detection built-in — If a target doesn't respond, the collector knows immediately
•Simpler application code — Applications just expose current state; no need to batch or push

Push-based collection (StatsD/OpenTelemetry model):

In the push model, applications actively send metrics to a central collector or aggregator. This is common with StatsD, Graphite, and increasingly with OpenTelemetry collectors.

Push Model Characteristics

•Application sends metrics — Your service actively pushes data to a collector
•Works behind firewalls — Easier when the monitoring system can't reach your applications
•Captures short-lived jobs — Batch jobs that start and stop can push their metrics before exiting
•Requires buffering — Applications must handle collector unavailability
•Risk of data loss — If push fails and there's no buffer, metrics are lost

Pull vs Push Collection Models
Aspect	Pull (Prometheus)	Push (StatsD/OTLP)
Target discovery	Collector discovers targets	Targets know collector address
Firewall-friendliness	Collector must reach targets	Targets push out (easier)
Short-lived jobs	Difficult (Pushgateway needed)	Natural fit
Failure detection	Missing scrape = target down	Requires separate health checks
Network traffic	Predictable intervals	Can be bursty
Application complexity	Expose endpoint only	Push logic + retry + buffering

Modern Best Practice: OpenTelemetry

OpenTelemetry is emerging as the standard for metrics (and traces, and logs) collection. It provides a vendor-neutral SDK and collector that can export to any backend (Prometheus, Datadog, New Relic, etc.). OTel supports both push and pull semantics, giving you flexibility without vendor lock-in.

What to measure:

The RED and USE methods provide frameworks for deciding what metrics to collect:

RED Method (for services):

Rate — Requests per second
Errors — Failed requests per second
Duration — Distribution of request latency

USE Method (for resources):

Utilization — Percentage of time resource is busy
Saturation — Amount of work queued/waiting
Errors — Error events for the resource

red-use-metrics.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# RED Method - Service Metrics
# ==============================
 
# Rate: Requests per second
http_requests_total{service="api", endpoint="/users"}
 
# Errors: Failed requests
http_requests_total{service="api", endpoint="/users", status=~"5.."}
 
# Duration: Latency distribution
http_request_duration_seconds_bucket{service="api", endpoint="/users"}
 
 
# USE Method - Resource Metrics
# =============================
 
# Utilization: CPU busy percentage
node_cpu_seconds_total{mode!="idle"}  # Then calculate ratio
 
# Saturation: Runnable processes waiting for CPU
node_load1   # 1-minute load average
 
# Errors: Hardware/resource errors
node_disk_io_errors_total

Metric Storage and Querying

Metrics generate enormous volumes of data. A typical microservices deployment might produce millions of data points per minute. Storing and querying this efficiently requires specialized Time Series Databases (TSDBs).

TSDB characteristics:

•Append-only writes — New data is always inserted; historical data is rarely modified
•Time-ordered data — Everything is indexed by timestamp
•Compression — Time series compress well (values change slowly; timestamps are sequential)
•Downsampling — Old data can be aggregated to reduce storage (hourly → daily → weekly)
•Range queries — Optimized for 'show me the last 24 hours' not 'find where value = X'
•High write throughput — Designed for ingesting millions of data points per second

Popular TSDBs and their characteristics:

Time Series Database Comparison
TSDB	Key Strengths	Query Language	Architecture
Prometheus	Pull-based, excellent Kubernetes integration, alerting built-in	PromQL	Single-node (federation/Thanos for scale)
VictoriaMetrics	PromQL-compatible, lower resource usage, long-term storage	MetricsQL (PromQL superset)	Single-node or cluster
InfluxDB	High write performance, SQL-like queries, enterprise features	InfluxQL/Flux	Single-node or cluster (Enterprise)
TimescaleDB	PostgreSQL extension, SQL interface, relational data + time series	SQL	PostgreSQL-based
M3DB	Ultra-high cardinality, distributed, Uber-developed	PromQL	Distributed cluster

Query patterns:

Understanding common query patterns helps you design metrics that are easy to analyze:

Instant queries — Current value: 'What is CPU usage right now?'
Range queries — Values over time: 'Show CPU usage for the last hour'
Aggregations — Combined values: 'Average CPU across all instances'
Rate calculations — Change over time: 'Requests per second'
Comparisons — Threshold checks: 'Is error rate > 1%?'

promql-examples.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# PromQL Query Examples
# =====================
 
# Current request rate (requests per second over last 5 minutes)
rate(http_requests_total[5m])
 
# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
 
# 99th percentile latency from histogram
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
 
# Memory usage across all pods for a service
sum(container_memory_usage_bytes{container="myapp"}) by (pod)
 
# Top 5 endpoints by request volume
topk(5, sum(rate(http_requests_total[1h])) by (endpoint))
 
# Alert expression: Error rate exceeds 5%
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05

Retention and Downsampling

Store high-resolution data (every 15 seconds) for recent time periods (7-30 days), then downsample older data to lower resolution (1-hour averages) for longer retention (1+ years). This balances detail for debugging recent issues against storage costs for trend analysis.

Metrics at Scale

Operating metrics infrastructure for large-scale systems introduces unique challenges. What works for a dozen services fails spectacularly at hundreds or thousands.

Challenge 1: Cardinality explosion

As discussed earlier, high cardinality is the primary enemy of metrics scalability. With Kubernetes deployments, you might have metrics labeled with pod names, which change on every deployment. Suddenly your time series count explodes every time you deploy.

Real-World Cardinality Disaster

A well-known incident at a major tech company involved adding a 'request_id' label to metrics for debugging. Within hours, databases were overwhelmed with billions of unique time series. The monitoring system itself went down, blindsiding the team right when they needed visibility most. Recovery required removing the label and purging data—during an active incident.

Challenge 2: Federation and global aggregation

A single Prometheus server works for small deployments, but at scale, you need multiple instances. Federating data across these instances while maintaining query performance requires careful architecture:

•Hierarchical federation — Regional Prometheus instances federate up to a global instance that stores aggregated data only
•Remote write — Prometheus instances push data to a central long-term store (Cortex, Thanos, VictoriaMetrics)
•Query federation — A query layer fans out queries to multiple Prometheus instances and merges results (Thanos Query, Cortex Query Frontend)

Challenge 3: Cost management

Metrics storage isn't free. Cloud providers charge per active time series or per million data points ingested. Organizations often find metrics costs growing faster than their actual infrastructure costs.

Cost Control Strategies

•Regular cardinality audits — Identify and eliminate high-cardinality metrics that aren't providing value
•Metric dropping — Configure collectors to drop unused metrics before they're stored
•Aggregation rules — Pre-compute aggregations at write time, reducing query-time computation and storage
•Tiered retention — Keep high-resolution data briefly; downsample for long-term storage
•Metric ownership — Assign costs to teams, incentivizing thoughtful metric design

prometheus-federation.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Prometheus Federation Configuration Example
# =============================================
 
# Global Prometheus that federates from regional instances
scrape_configs:
  # Federate specific metrics from regional Prometheus instances
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      # Only federate aggregated metrics, not raw high-cardinality data
      'match[]':
        - '{job="api-server", __name__=~"job:.*"}'  # Recording rules
        - 'up'                                       # Health checks
        - 'http_requests_total:rate5m'               # Pre-aggregated rates
    static_configs:
      - targets:
        - 'prometheus-us-east.internal:9090'
        - 'prometheus-us-west.internal:9090'
        - 'prometheus-eu-west.internal:9090'

Instrumentation Patterns

Effective metrics don't appear by accident—they result from intentional instrumentation. Here are battle-tested patterns for instrumenting applications:

HTTP Service Instrumentation

•Request counter — Total requests, labeled by method, path, status
•Request duration histogram — Latency distribution with appropriate buckets
•In-flight requests gauge — How many requests are being handled concurrently
•Request size histogram — Payload sizes for capacity planning
•Response size histogram — Response sizes for bandwidth planning

http-instrumentation.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// Go example: HTTP handler instrumentation with Prometheus
package main
 
import (
    "net/http"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)
 
var (
    // Counter for total requests
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "path", "status"},
    )
 
    // Histogram for request duration
    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration in seconds",
            Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
        },
        []string{"method", "path"},
    )
 
    // Gauge for in-flight requests
    httpInFlightRequests = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "http_in_flight_requests",
            Help: "Current number of in-flight HTTP requests",
        },
    )
)
 
func instrumentHandler(path string, handler http.HandlerFunc) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        httpInFlightRequests.Inc()
        defer httpInFlightRequests.Dec()
 
        timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues(r.Method, path))
        defer timer.ObserveDuration()
 
        // Wrap ResponseWriter to capture status code
        wrapped := &statusRecorder{ResponseWriter: w, status: 200}
        handler(wrapped, r)
 
        httpRequestsTotal.WithLabelValues(r.Method, path, strconv.Itoa(wrapped.status)).Inc()
    })
}

Database Client Instrumentation

•Query counter — Total queries, labeled by operation type (SELECT, INSERT, UPDATE, DELETE)
•Query duration histogram — How long queries take
•Connection pool gauges — Active, idle, and waiting connections
•Error counter — Failed queries by error type

Queue/Background Job Instrumentation

•Queue depth gauge — How many items are waiting
•Processing duration histogram — How long jobs take
•Jobs processed counter — Total completed jobs by status (success, failure)
•Age of oldest item gauge — Staleness indicator

The 'Four Golden Signals'

Google's SRE book identifies four essential signals for any user-facing system: Latency (duration of requests), Traffic (demand on your system), Errors (rate of failed requests), and Saturation (how full your service is). Ensure every service exposes metrics for all four.

Summary: Metrics as Quantitative Foundation

We've covered metrics comprehensively—from fundamental concepts to practical instrumentation patterns. Let's consolidate the key takeaways:

Key Takeaways

•Metrics are numerical time series — Values with timestamps and labels that enable mathematical analysis over time.
•Four types serve different purposes — Counters track totals, Gauges show current state, Histograms capture distributions, Summaries compute percentiles.
•Naming and labeling matter enormously — Self-documenting names with bounded labels prevent confusion and cardinality explosions.
•Pull vs Push have different trade-offs — Prometheus pull model dominates cloud-native; push works for ephemeral workloads.
•TSDBs are optimized for metrics — Purpose-built databases handle the write patterns and query requirements efficiently.
•Scale brings unique challenges — Cardinality control, federation, and cost management require ongoing attention.
•Instrumentation is intentional — RED/USE methods and the Four Golden Signals guide what to measure.

What's next:

Metrics tell you what is happening—how many requests, how long they took, how much memory is used. But they don't tell you why. When you see a latency spike in your metrics dashboard, you need more detail.

Next, we'll explore Logs—the second pillar of observability. Logs provide the narrative detail that metrics lack: the actual events, errors, and context that explain the numbers.

Page Complete

You now have a comprehensive understanding of metrics—the quantitative backbone of observability. You understand metric types, naming conventions, collection strategies, and instrumentation patterns. Next, we'll complement this quantitative view with the qualitative richness of logs.