Loading learning content...
Modern systems generate thousands of metrics. A single Kubernetes cluster might emit tens of thousands of distinct metric series. An application with detailed instrumentation could produce hundreds of metrics per service. Database servers, message queues, load balancers—each component adds to the deluge.\n\nFaced with this abundance, engineers make one of two mistakes: they either display everything (creating overwhelming dashboards that communicate nothing) or they choose metrics arbitrarily (displaying whatever was easy to add rather than what matters).\n\nThe result is dashboards that fail at their fundamental purpose: communicating system health.\n\nThis page addresses the critical question: Of all the metrics we could display, which ones should we actually put on our dashboards? The answer isn't arbitrary—it's grounded in decades of operational experience distilled into principled methodologies.
By the end of this page, you will understand proven frameworks for metric selection: the RED method for services, the USE method for resources, and the Four Golden Signals. You'll learn how to adapt these frameworks to different system types and understand the relationship between metrics and the questions they answer.
Before diving into specific frameworks, let's establish the principles that guide metric selection for dashboards.\n\nMetrics Exist to Answer Questions\n\nEvery metric on a dashboard should answer a specific, actionable question. If you cannot articulate what question a metric answers, it doesn't belong on an operational dashboard.\n\nThe questions that matter fall into a hierarchy:
12345678910111213141516171819
Question Hierarchy (in order of operational priority): 1. "Is the service working for users?" └─ Success rate, availability, functional correctness 2. "How well is the service performing for users?" └─ Latency, throughput, quality of experience 3. "Can the service continue to work under current conditions?" └─ Capacity, saturation, headroom 4. "Where are problems occurring?" └─ Service breakdowns, endpoint-level metrics 5. "Why is this happening?" └─ Resource utilization, dependency health, debug data Dashboards should answer questions 1-3 immediately.Questions 4-5 are answered through drill-down.User-Centricity as North Star\n\nThe most important metrics are those that reflect user experience. A database might show 100% healthy on all its internal metrics while queries time out at the application layer. Internal metrics matter, but user-facing metrics are authoritative.\n\nThis principle determines dashboard hierarchy:\n\n1. User-facing metrics (errors users see, latency users experience) get prime position\n2. Service-level metrics (internal but directly impacting user experience) get secondary position\n3. Resource metrics (CPU, memory, network) support investigation\n4. Infrastructure metrics (kernel stats, hardware) are for deep debugging\n\nFewer Metrics, Better Understanding\n\nCognitive research shows that humans make better decisions with less information—if that information is the right information. A dashboard with 5 carefully chosen metrics outperforms one with 50 random metrics.\n\nThe discipline of metric selection is primarily about what to exclude. Every metric added to a dashboard:\n\n- Competes for attention with other metrics\n- Increases cognitive load\n- May distract from more important signals\n\nEach metric must earn its place by answering a question that matters.
For every metric on your dashboard, ask: 'If this metric changed significantly, would we take action?' If the answer is no—or if the action would be to investigate other metrics first—the metric doesn't belong on the primary dashboard. It might belong in a drill-down view or investigation dashboard instead.
The RED method, developed by Tom Wilkie at Grafana Labs, provides a simple framework for monitoring request-driven services. For every service, measure:\n\nR — Rate: How many requests per second is this service handling?\nE — Errors: How many of those requests are failing?\nD — Duration: How long do those requests take?
Why RED Works\n\nRED captures what users care about in a service:\n\n- Rate tells you if the service is being used (and how much)\n- Errors tell you if the service is working\n- Duration tells you if the service is working well\n\nThese three metrics together provide a complete picture of service health from the user's perspective. They're also easy to collect—most HTTP servers and RPC frameworks can emit these automatically.\n\nRED in Practice
| Metric | Visualization | What to Show | Alert Threshold Example |
|---|---|---|---|
| Request Rate | Line chart, Single stat | Requests/second, trend over time, comparison to baseline | Anomaly detection: >3 std dev from normal |
| Error Rate | Line chart with threshold, Single stat with color | Error percentage, absolute error count, breakdown by type | 1% error rate for >5 minutes |
| Duration (p50) | Line chart | Median latency, shows typical user experience | Informational, rarely alert on p50 |
| Duration (p95) | Line chart with threshold | 95th percentile, most users experience this or better | 500ms for >10 minutes |
| Duration (p99) | Line chart with threshold | 99th percentile, worst-case for most users | 2s for >5 minutes |
| Latency Heatmap | Heatmap | Full distribution over time, reveals bimodality | Visual inspection for pattern changes |
RED Method Variations\n\nBy Endpoint: Track RED metrics for each significant endpoint, not just service aggregate. Login latency matters differently than search latency.\n\nBy Customer Tier: If you have premium customers, track their experience separately. A 1% error rate might mean 100% of enterprise customers are affected.\n\nBy Geography: Users in different regions may experience different conditions. Track RED per region for global services.
RED is designed for request-driven services—those that respond to incoming requests. It doesn't fit well for batch processing, streaming pipelines, or background workers. For these, consider per-item metrics (items processed/failed/duration) or use the USE method for the underlying resources.
The USE method, developed by Brendan Gregg, provides a framework for analyzing resource performance. For every resource, measure:\n\nU — Utilization: What percentage of the resource is being used?\nS — Saturation: How much extra work is queued, waiting for the resource?\nE — Errors: How many error events are occurring on this resource?
Applying USE to Common Resources
| Resource | Utilization | Saturation | Errors |
|---|---|---|---|
| CPU | CPU busy percentage (user + system) | Run queue length (load average) | — |
| Memory | Used memory / Total memory | Swap usage, OOM kills | ECC errors, memory failures |
| Disk I/O | Disk busy percentage | I/O wait percentage, queue depth | Read/write errors, SMART warnings |
| Disk Capacity | Used space / Total space | Inodes used (for some filesystems) | Filesystem errors, corruption |
| Network Interface | Bytes sent/received vs. capacity | Packet transmit queue length | Drops, overruns, CRC errors |
| Network Sockets | Open connections / Max connections | Listen queue backlog, TIME_WAIT count | Connection refused, reset |
| Thread Pool | Active threads / Max threads | Queue depth, rejected tasks | Failed task executions |
| Connection Pool | In-use connections / Pool size | Wait time for connection | Connection timeouts, exhaustion |
The Saturation Signal\n\nSaturation is often the most actionable USE metric. High utilization alone doesn't guarantee problems—a CPU running at 90% utilization might be perfectly healthy if there's no queue building. But non-zero saturation means work is waiting, which directly impacts user experience.\n\nPrioritize saturation visibility:\n\n- Run queue length (CPU saturation) is more actionable than CPU percentage\n- I/O wait indicates disk saturation better than disk utilization\n- Connection wait time reveals database saturation before pool exhaustion\n\nSaturation metrics often predict problems before utilization metrics reveal them.
RED measures workload from the user's perspective. USE measures resources from the system's perspective. A healthy dashboard includes both: RED for 'is the service working?' and USE for 'why might it stop working?' RED metrics tell you there's a problem; USE metrics help you find the cause.
Google's Site Reliability Engineering book introduced the Four Golden Signals—a framework used internally at Google for monitoring user-facing systems. The signals are:\n\n1. Latency — The time it takes to service a request\n2. Traffic — The demand placed on your system\n3. Errors — The rate of failed requests\n4. Saturation — How 'full' your service is
Latency: Time to Serve\n\nLatency measures the time between request receipt and response delivery. Critical nuances:\n\n- Distinguish successful vs. failed request latency — A fast error isn't a success. Track latency for successful requests separately.\n- Track distributions, not averages — The p99 tells you what the worst-off users experience. An average of 100ms might hide a p99 of 3 seconds.\n- Consider user-perceived latency — Time from user action to visual response matters more than server processing time.\n\nTraffic: Demand on the System\n\nTraffic measures workload in domain-appropriate terms:\n\n- Web services: HTTP requests per second\n- Databases: Queries per second, transactions per second\n- Streaming: Messages per second, bytes per second\n- Storage: IOPS, bytes written per second\n\nTraffic provides context for other signals. High error rate at 10x normal traffic tells a different story than high error rate at normal traffic.\n\nErrors: Failed Requests\n\nError tracking must be comprehensive:\n\n- Explicit errors: HTTP 5xx responses, exception throws\n- Implicit errors: Wrong data returned, timeouts\n- Policy violations: Responses that succeed but violate quality standards (e.g., latency SLO violations)\n\nSaturation: Capacity Utilization\n\nSaturation measures how close the service is to capacity:\n\n- Resource saturation: CPU, memory, disk approaching limits\n- Throughput capacity: Requests per second vs. maximum tested\n- Queue depth: Work waiting to be processed
The Four Golden Signals overlap with RED and USE. Latency ≈ Duration in RED. Traffic ≈ Rate in RED. Errors appears in all three. Saturation appears in both Golden Signals and USE. The frameworks are complementary views of the same fundamental concerns. Choose the framework that resonates with your team—the specific labels matter less than covering the core concepts.
1234567891011121314151617181920212223242526272829
# LATENCY: p50, p95, p99 for successful requests# Shows the response time distribution users experiencehistogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le))histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le))histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le)) # TRAFFIC: Request rate, broken down by endpointsum(rate(http_requests_total[5m])) by (handler) # ERRORS: Error rate as percentage of total trafficsum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 # SATURATION: Various indicators# CPU saturation via container throttlingsum(rate(container_cpu_cfs_throttled_seconds_total[5m])) by (pod) # Thread pool saturationthreadpool_queue_size / threadpool_max_queue_size * 100 # Memory approaching OOM(container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100Different system types require different metric emphases. While RED/USE/Golden Signals provide frameworks, application to specific systems requires domain knowledge.\n\nAPI Services and Web Applications
Databases
Message Queues and Streaming Systems
Caches
Don't limit dashboards to technical metrics. Include business-level metrics that tie technical health to business outcomes: orders per minute, successful logins, payments processed. These bridge the gap between engineering reality and business impact, making dashboards relevant to non-technical stakeholders.
Metric selection involves avoiding common traps that reduce dashboard effectiveness.
The Cardinality Problem\n\nMetric cardinality—the number of unique time series generated by a metric—can explode when labels have high variability:\n\n| Label Pattern | Risk | Example |\n|---------------|------|---------|\n| User ID as label | Extreme | request_count{user_id="12345"} — Millions of series |\n| Request ID as label | Extreme | request_duration{request_id="abc123"} — Unbounded |\n| URL path with variables | High | /users/12345/posts/67890 — Many unique paths |\n| Reasonable labels | Safe | status_code, method, service, region — Bounded |\n\nHigh-cardinality metrics consume memory, slow queries, and increase storage costs. Design metrics with bounded label values.
Adding more metrics doesn't improve observability—it often degrades it. Every additional metric is noise that obscures the signal. After an incident, the instinct to add metrics that 'would have caught this' leads to dashboard creep. Resist. Instead, refine existing metrics and improve their visibility.
SLO-based dashboards require specific metrics that differ from general operational metrics. The focus shifts from instantaneous values to budget consumption.
Essential SLO Metrics\n\nSLI (Service Level Indicator) — The raw measurement underlying the SLO. For an availability SLO, this might be successful_requests / total_requests. For a latency SLO, this might be requests_under_threshold / total_requests.\n\nCurrent SLO Compliance — Whether the SLI currently meets the objective. Displayed as percentage with color coding against target.\n\nError Budget Remaining — How much failure is acceptable before breaching the SLO. Expressed as percentage remaining or time remaining at current burn rate.\n\nBurn Rate — How fast the error budget is being consumed. A burn rate of 1.0x means pace exactly matches the budget; 2.0x means consuming budget twice as fast as sustainable.\n\nTime Window Performance — SLO compliance over the measurement window (30 days, quarter).
1234567891011121314151617181920212223242526272829
# SLI: Availability (successful requests / total requests)- record: sli:availability:ratio expr: | sum(rate(http_requests_total{status!~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) # SLI: Latency (requests under 200ms / total)- record: sli:latency:ratio expr: | sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m])) by (service) / sum(rate(http_request_duration_seconds_count[5m])) by (service) # Error budget consumed (30 day window, 99.9% SLO)- record: slo:error_budget_consumed:ratio expr: | 1 - ( sum_over_time(sli:availability:ratio[30d]) / (30 * 24) # hours in window ) / 0.001 # 0.1% error budget # Burn rate (how fast budget is being consumed)# Burn rate of 1.0 = exactly using allowed budget# Burn rate of 14.4 = will exhaust 30-day budget in ~50 hours- record: slo:burn_rate:5m expr: | (1 - sli:availability:ratio) / 0.001SLO Dashboard Layout\n\nA typical SLO dashboard might display:\n\nTop Row: Status Summary\n- SLO compliance status (green/yellow/red) for each service\n- Error budget remaining percentage\n- Days until budget exhaustion at current rate\n\nSecond Row: Current Performance\n- Current SLI value vs. target\n- Burn rate indicator\n- Trend compared to previous period\n\nThird Row: Time Series\n- SLI over time with SLO threshold line\n- Error budget consumption over the window\n- Burn rate over time\n\nDetail Section: Breakdown\n- SLI by endpoint, region, or customer tier\n- Major contributing factors to budget consumption\n- Incident markers correlated with budget consumption
When error budget is healthy, teams can move fast—shipping features and accepting some risk. When budget is exhausted, teams focus on reliability. SLO dashboards don't just monitor; they guide engineering prioritization by making the reliability/velocity tradeoff visible.
We've covered the frameworks and principles that guide effective metric selection for dashboards. Let's consolidate the key insights:
What's Next:\n\nWith an understanding of which metrics to display, we need to consider who is viewing the dashboard and what they need. The next page explores service-level dashboards—dashboards designed for engineering teams operating specific services.
You now have frameworks for selecting the metrics that belong on operational dashboards. The key insight: metric selection is an exercise in discipline and focus. The goal isn't to display every available metric—it's to display the metrics that enable understanding and action.