Loading content...
Imagine you're a physician examining a patient. You don't rely on the patient saying 'I feel okay.' You measure—heart rate, blood pressure, temperature, oxygen saturation. These numerical measurements tell an objective story that symptoms alone cannot convey. They reveal patterns, indicate trends, and provide early warning signs before a crisis manifests.
Metrics serve the same purpose for software systems. They are the vital signs of your applications and infrastructure—quantitative measurements that capture the state, behavior, and performance of your systems over time. Without metrics, operating a distributed system is like flying blind: you're guessing at what's happening rather than knowing.
In this page, we will explore metrics as the first and perhaps most fundamental pillar of observability. We'll examine what metrics are, why they matter, the different types of metrics, how to collect and store them, and the principles that guide effective metrics design at scale.
By the end of this page, you will understand metrics fundamentally—not just as numbers emitted by monitoring tools, but as a carefully designed language for expressing system behavior. You'll learn the taxonomy of metric types, internalize best practices for naming and labeling, understand cardinality dangers, and appreciate how metrics form the quantitative backbone of observability.
At their core, metrics are numerical measurements collected at regular intervals over time. Each metric captures a specific aspect of system behavior—how many requests were processed, how long operations took, how much memory is being used, how many errors occurred.
Formally, a metric consists of:
http_requests_total, memory_usage_bytes)42, 1024.5)method="GET", status="200", service="checkout")Together, these elements form a time series: a sequence of values ordered by time, identified by a unique combination of name and labels.
12345678910111213141516171819202122232425
# Anatomy of a metric data point# ================================ # General structure:# <metric_name>{<label_name>=<label_value>, ...} <value> [<timestamp>] # Example: HTTP request counterhttp_requests_total{method="GET", status="200", service="api-gateway"} 142857 1704672000000 # Breakdown:# - Metric name: http_requests_total# - Labels: method="GET", status="200", service="api-gateway"# - Value: 142857 (total requests matching these labels)# - Timestamp: 1704672000000 (Unix milliseconds) # This creates a unique time series. Different label combinations# create different time series: http_requests_total{method="GET", status="200", service="api-gateway"} 142857http_requests_total{method="GET", status="404", service="api-gateway"} 532http_requests_total{method="POST", status="201", service="api-gateway"} 89421http_requests_total{method="POST", status="500", service="api-gateway"} 17 # Each line above is a separate time series, even though they share# the same metric name.Metrics are typically stored in Time Series Databases (TSDBs) optimized for append-heavy workloads with time-based queries. Systems like Prometheus, InfluxDB, VictoriaMetrics, and TimescaleDB are designed specifically for this access pattern—writing millions of data points per second and querying across large time ranges efficiently.
Why numerical measurements matter:
Metrics provide something that other observability signals cannot: aggregation and mathematical analysis. You can compute averages, percentiles, rates of change, standard deviations, and correlations. You can set thresholds and trigger alerts when values cross boundaries. You can visualize trends over days, weeks, or months.
This quantitative nature makes metrics ideal for:
Not all metrics behave the same way. The type of measurement you're making determines how the metric should be collected, stored, aggregated, and queried. Understanding metric types is essential for effective observability.
There are four fundamental metric types, each with distinct semantics:
| Type | What It Measures | Key Characteristic | Example Use Cases |
|---|---|---|---|
| Counter | Cumulative totals that only increase | Monotonically increasing (only goes up, never down) | Request count, bytes transferred, errors, completed jobs |
| Gauge | Current value that can go up or down | Point-in-time snapshot, can increase or decrease | Temperature, memory usage, queue depth, active connections |
| Histogram | Distribution of values in buckets | Records observations in configurable ranges | Request latency, response sizes, batch sizes |
| Summary | Client-calculated percentiles | Computes quantiles client-side before exposition | Latency percentiles (p50, p95, p99) |
Let's explore each type in detail:
rate() or increase() to understand 'how many per second?'http_requests_total might show 1,000,000 requests. But what matters is whether that's 100/sec (normal) or 10,000/sec (traffic spike)._total (e.g., errors_total, bytes_received_total)._bucket), total sum of observations (_sum), and count of observations (_count).1234567891011121314151617181920
# Histogram for HTTP request duration (in seconds)# The _bucket counters show how many requests fell into each bucket http_request_duration_seconds_bucket{le="0.005"} 24054 # ≤ 5mshttp_request_duration_seconds_bucket{le="0.01"} 33445 # ≤ 10mshttp_request_duration_seconds_bucket{le="0.025"} 100392 # ≤ 25mshttp_request_duration_seconds_bucket{le="0.05"} 129389 # ≤ 50mshttp_request_duration_seconds_bucket{le="0.1"} 133988 # ≤ 100mshttp_request_duration_seconds_bucket{le="0.25"} 134890 # ≤ 250mshttp_request_duration_seconds_bucket{le="0.5"} 135085 # ≤ 500mshttp_request_duration_seconds_bucket{le="1"} 135121 # ≤ 1shttp_request_duration_seconds_bucket{le="+Inf"} 135123 # All requests http_request_duration_seconds_sum 4503.2 # Total secondshttp_request_duration_seconds_count 135123 # Total requests # From this, you can compute:# - Average latency: _sum / _count = 33.3ms# - p50 (median): ~25ms (where 50% of requests fall)# - p99: ~250ms (99% of requests complete within this time)Well-designed metrics are self-documenting. A well-named metric with appropriate labels tells you exactly what it measures without requiring external documentation. Poor naming leads to confusion, misinterpretation, and eventually, metrics that nobody trusts.
Naming best practices:
myapp_, http_, db_pool_. This prevents collisions and groups related metrics.request_duration_seconds is better than request_timer. The former is clear; the latter requires context.response_size_bytes, request_duration_seconds, temperature_celsius. Never make users guess the unit._total — http_requests_total, errors_total. This signals the metric type to readers.http_request_duration_seconds, not httpRequestDurationSeconds or http-request-duration-seconds.Labels multiply the number of time series. A metric with labels for service, method, status, and endpoint creates a unique time series for every combination. If you have 10 services × 5 methods × 20 status codes × 100 endpoints = 100,000 time series from a single metric. High cardinality destroys database performance and increases costs exponentially.
Label design principles:
Labels add dimensionality to metrics, enabling powerful queries like 'show error rates by service and endpoint.' But with this power comes responsibility:
123456789101112131415161718
# GOOD: Bounded cardinality labelshttp_requests_total{ service="checkout", # Limited number of services method="POST", # ~10 HTTP methods status="201", # ~50 common status codes endpoint="/api/orders" # Known, enumerable endpoints} # BAD: Unbounded cardinality labels - DON'T DO THIShttp_requests_total{ user_id="user_12345", # DANGER: Millions of users request_id="abc-123-xyz", # DANGER: Unique per request client_ip="192.168.1.42", # DANGER: Many IPs error_message="Connection refused to host..." # DANGER: Free text} # The bad example would create a new time series for every# unique combination, quickly overwhelming your TSDB.There are two fundamental models for collecting metrics: push and pull. Each has distinct trade-offs that influence architecture decisions.
Pull-based collection (Prometheus model):
In the pull model, applications expose a metrics endpoint (typically /metrics), and a central collector periodically scrapes these endpoints. This is the model pioneered by Prometheus and widely adopted in cloud-native environments.
/metrics with current valuesPush-based collection (StatsD/OpenTelemetry model):
In the push model, applications actively send metrics to a central collector or aggregator. This is common with StatsD, Graphite, and increasingly with OpenTelemetry collectors.
| Aspect | Pull (Prometheus) | Push (StatsD/OTLP) |
|---|---|---|
| Target discovery | Collector discovers targets | Targets know collector address |
| Firewall-friendliness | Collector must reach targets | Targets push out (easier) |
| Short-lived jobs | Difficult (Pushgateway needed) | Natural fit |
| Failure detection | Missing scrape = target down | Requires separate health checks |
| Network traffic | Predictable intervals | Can be bursty |
| Application complexity | Expose endpoint only | Push logic + retry + buffering |
OpenTelemetry is emerging as the standard for metrics (and traces, and logs) collection. It provides a vendor-neutral SDK and collector that can export to any backend (Prometheus, Datadog, New Relic, etc.). OTel supports both push and pull semantics, giving you flexibility without vendor lock-in.
What to measure:
The RED and USE methods provide frameworks for deciding what metrics to collect:
RED Method (for services):
USE Method (for resources):
123456789101112131415161718192021222324
# RED Method - Service Metrics# ============================== # Rate: Requests per secondhttp_requests_total{service="api", endpoint="/users"} # Errors: Failed requestshttp_requests_total{service="api", endpoint="/users", status=~"5.."} # Duration: Latency distributionhttp_request_duration_seconds_bucket{service="api", endpoint="/users"} # USE Method - Resource Metrics# ============================= # Utilization: CPU busy percentagenode_cpu_seconds_total{mode!="idle"} # Then calculate ratio # Saturation: Runnable processes waiting for CPUnode_load1 # 1-minute load average # Errors: Hardware/resource errorsnode_disk_io_errors_totalMetrics generate enormous volumes of data. A typical microservices deployment might produce millions of data points per minute. Storing and querying this efficiently requires specialized Time Series Databases (TSDBs).
TSDB characteristics:
Popular TSDBs and their characteristics:
| TSDB | Key Strengths | Query Language | Architecture |
|---|---|---|---|
| Prometheus | Pull-based, excellent Kubernetes integration, alerting built-in | PromQL | Single-node (federation/Thanos for scale) |
| VictoriaMetrics | PromQL-compatible, lower resource usage, long-term storage | MetricsQL (PromQL superset) | Single-node or cluster |
| InfluxDB | High write performance, SQL-like queries, enterprise features | InfluxQL/Flux | Single-node or cluster (Enterprise) |
| TimescaleDB | PostgreSQL extension, SQL interface, relational data + time series | SQL | PostgreSQL-based |
| M3DB | Ultra-high cardinality, distributed, Uber-developed | PromQL | Distributed cluster |
Query patterns:
Understanding common query patterns helps you design metrics that are easy to analyze:
1234567891011121314151617181920212223242526
# PromQL Query Examples# ===================== # Current request rate (requests per second over last 5 minutes)rate(http_requests_total[5m]) # Error rate as a percentagesum(rate(http_requests_total{status=~"5.."}[5m]))/sum(rate(http_requests_total[5m]))* 100 # 99th percentile latency from histogramhistogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) # Memory usage across all pods for a servicesum(container_memory_usage_bytes{container="myapp"}) by (pod) # Top 5 endpoints by request volumetopk(5, sum(rate(http_requests_total[1h])) by (endpoint)) # Alert expression: Error rate exceeds 5%sum(rate(http_requests_total{status=~"5.."}[5m]))/sum(rate(http_requests_total[5m]))> 0.05Store high-resolution data (every 15 seconds) for recent time periods (7-30 days), then downsample older data to lower resolution (1-hour averages) for longer retention (1+ years). This balances detail for debugging recent issues against storage costs for trend analysis.
Operating metrics infrastructure for large-scale systems introduces unique challenges. What works for a dozen services fails spectacularly at hundreds or thousands.
Challenge 1: Cardinality explosion
As discussed earlier, high cardinality is the primary enemy of metrics scalability. With Kubernetes deployments, you might have metrics labeled with pod names, which change on every deployment. Suddenly your time series count explodes every time you deploy.
A well-known incident at a major tech company involved adding a 'request_id' label to metrics for debugging. Within hours, databases were overwhelmed with billions of unique time series. The monitoring system itself went down, blindsiding the team right when they needed visibility most. Recovery required removing the label and purging data—during an active incident.
Challenge 2: Federation and global aggregation
A single Prometheus server works for small deployments, but at scale, you need multiple instances. Federating data across these instances while maintaining query performance requires careful architecture:
Challenge 3: Cost management
Metrics storage isn't free. Cloud providers charge per active time series or per million data points ingested. Organizations often find metrics costs growing faster than their actual infrastructure costs.
123456789101112131415161718192021
# Prometheus Federation Configuration Example# ============================================= # Global Prometheus that federates from regional instancesscrape_configs: # Federate specific metrics from regional Prometheus instances - job_name: 'federate' scrape_interval: 15s honor_labels: true metrics_path: '/federate' params: # Only federate aggregated metrics, not raw high-cardinality data 'match[]': - '{job="api-server", __name__=~"job:.*"}' # Recording rules - 'up' # Health checks - 'http_requests_total:rate5m' # Pre-aggregated rates static_configs: - targets: - 'prometheus-us-east.internal:9090' - 'prometheus-us-west.internal:9090' - 'prometheus-eu-west.internal:9090'Effective metrics don't appear by accident—they result from intentional instrumentation. Here are battle-tested patterns for instrumenting applications:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
// Go example: HTTP handler instrumentation with Prometheuspackage main import ( "net/http" "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp") var ( // Counter for total requests httpRequestsTotal = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests", }, []string{"method", "path", "status"}, ) // Histogram for request duration httpRequestDuration = prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "HTTP request duration in seconds", Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}, }, []string{"method", "path"}, ) // Gauge for in-flight requests httpInFlightRequests = prometheus.NewGauge( prometheus.GaugeOpts{ Name: "http_in_flight_requests", Help: "Current number of in-flight HTTP requests", }, )) func instrumentHandler(path string, handler http.HandlerFunc) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { httpInFlightRequests.Inc() defer httpInFlightRequests.Dec() timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues(r.Method, path)) defer timer.ObserveDuration() // Wrap ResponseWriter to capture status code wrapped := &statusRecorder{ResponseWriter: w, status: 200} handler(wrapped, r) httpRequestsTotal.WithLabelValues(r.Method, path, strconv.Itoa(wrapped.status)).Inc() })}Google's SRE book identifies four essential signals for any user-facing system: Latency (duration of requests), Traffic (demand on your system), Errors (rate of failed requests), and Saturation (how full your service is). Ensure every service exposes metrics for all four.
We've covered metrics comprehensively—from fundamental concepts to practical instrumentation patterns. Let's consolidate the key takeaways:
What's next:
Metrics tell you what is happening—how many requests, how long they took, how much memory is used. But they don't tell you why. When you see a latency spike in your metrics dashboard, you need more detail.
Next, we'll explore Logs—the second pillar of observability. Logs provide the narrative detail that metrics lack: the actual events, errors, and context that explain the numbers.
You now have a comprehensive understanding of metrics—the quantitative backbone of observability. You understand metric types, naming conventions, collection strategies, and instrumentation patterns. Next, we'll complement this quantitative view with the qualitative richness of logs.