Loading learning content...
Modern systems generate thousands of metrics. A single Kubernetes cluster might emit tens of thousands of distinct metric series. An application with detailed instrumentation could produce hundreds of metrics per service. Database servers, message queues, load balancers—each component adds to the deluge.
Faced with this abundance, engineers make one of two mistakes: they either display everything (creating overwhelming dashboards that communicate nothing) or they choose metrics arbitrarily (displaying whatever was easy to add rather than what matters).
The result is dashboards that fail at their fundamental purpose: communicating system health.
This page addresses the critical question: Of all the metrics we could display, which ones should we actually put on our dashboards? The answer isn't arbitrary—it's grounded in decades of operational experience distilled into principled methodologies.
By the end of this page, you will understand proven frameworks for metric selection: the RED method for services, the USE method for resources, and the Four Golden Signals. You'll learn how to adapt these frameworks to different system types and understand the relationship between metrics and the questions they answer.
Before diving into specific frameworks, let's establish the principles that guide metric selection for dashboards.
Metrics Exist to Answer Questions
Every metric on a dashboard should answer a specific, actionable question. If you cannot articulate what question a metric answers, it doesn't belong on an operational dashboard.
The questions that matter fall into a hierarchy:
12345678910111213141516171819
Question Hierarchy (in order of operational priority): 1. "Is the service working for users?" └─ Success rate, availability, functional correctness 2. "How well is the service performing for users?" └─ Latency, throughput, quality of experience 3. "Can the service continue to work under current conditions?" └─ Capacity, saturation, headroom 4. "Where are problems occurring?" └─ Service breakdowns, endpoint-level metrics 5. "Why is this happening?" └─ Resource utilization, dependency health, debug data Dashboards should answer questions 1-3 immediately.Questions 4-5 are answered through drill-down.User-Centricity as North Star
The most important metrics are those that reflect user experience. A database might show 100% healthy on all its internal metrics while queries time out at the application layer. Internal metrics matter, but user-facing metrics are authoritative.
This principle determines dashboard hierarchy:
Fewer Metrics, Better Understanding
Cognitive research shows that humans make better decisions with less information—if that information is the right information. A dashboard with 5 carefully chosen metrics outperforms one with 50 random metrics.
The discipline of metric selection is primarily about what to exclude. Every metric added to a dashboard:
Each metric must earn its place by answering a question that matters.
For every metric on your dashboard, ask: 'If this metric changed significantly, would we take action?' If the answer is no—or if the action would be to investigate other metrics first—the metric doesn't belong on the primary dashboard. It might belong in a drill-down view or investigation dashboard instead.
The RED method, developed by Tom Wilkie at Grafana Labs, provides a simple framework for monitoring request-driven services. For every service, measure:
R — Rate: How many requests per second is this service handling? E — Errors: How many of those requests are failing? D — Duration: How long do those requests take?
Why RED Works
RED captures what users care about in a service:
These three metrics together provide a complete picture of service health from the user's perspective. They're also easy to collect—most HTTP servers and RPC frameworks can emit these automatically.
RED in Practice
| Metric | Visualization | What to Show | Alert Threshold Example |
|---|---|---|---|
| Request Rate | Line chart, Single stat | Requests/second, trend over time, comparison to baseline | Anomaly detection: >3 std dev from normal |
| Error Rate | Line chart with threshold, Single stat with color | Error percentage, absolute error count, breakdown by type | 1% error rate for >5 minutes |
| Duration (p50) | Line chart | Median latency, shows typical user experience | Informational, rarely alert on p50 |
| Duration (p95) | Line chart with threshold | 95th percentile, most users experience this or better | 500ms for >10 minutes |
| Duration (p99) | Line chart with threshold | 99th percentile, worst-case for most users | 2s for >5 minutes |
| Latency Heatmap | Heatmap | Full distribution over time, reveals bimodality | Visual inspection for pattern changes |
RED Method Variations
By Endpoint: Track RED metrics for each significant endpoint, not just service aggregate. Login latency matters differently than search latency.
By Customer Tier: If you have premium customers, track their experience separately. A 1% error rate might mean 100% of enterprise customers are affected.
By Geography: Users in different regions may experience different conditions. Track RED per region for global services.
RED is designed for request-driven services—those that respond to incoming requests. It doesn't fit well for batch processing, streaming pipelines, or background workers. For these, consider per-item metrics (items processed/failed/duration) or use the USE method for the underlying resources.
The USE method, developed by Brendan Gregg, provides a framework for analyzing resource performance. For every resource, measure:
U — Utilization: What percentage of the resource is being used? S — Saturation: How much extra work is queued, waiting for the resource? E — Errors: How many error events are occurring on this resource?
Applying USE to Common Resources
| Resource | Utilization | Saturation | Errors |
|---|---|---|---|
| CPU | CPU busy percentage (user + system) | Run queue length (load average) | — |
| Memory | Used memory / Total memory | Swap usage, OOM kills | ECC errors, memory failures |
| Disk I/O | Disk busy percentage | I/O wait percentage, queue depth | Read/write errors, SMART warnings |
| Disk Capacity | Used space / Total space | Inodes used (for some filesystems) | Filesystem errors, corruption |
| Network Interface | Bytes sent/received vs. capacity | Packet transmit queue length | Drops, overruns, CRC errors |
| Network Sockets | Open connections / Max connections | Listen queue backlog, TIME_WAIT count | Connection refused, reset |
| Thread Pool | Active threads / Max threads | Queue depth, rejected tasks | Failed task executions |
| Connection Pool | In-use connections / Pool size | Wait time for connection | Connection timeouts, exhaustion |
The Saturation Signal
Saturation is often the most actionable USE metric. High utilization alone doesn't guarantee problems—a CPU running at 90% utilization might be perfectly healthy if there's no queue building. But non-zero saturation means work is waiting, which directly impacts user experience.
Prioritize saturation visibility:
Saturation metrics often predict problems before utilization metrics reveal them.
RED measures workload from the user's perspective. USE measures resources from the system's perspective. A healthy dashboard includes both: RED for 'is the service working?' and USE for 'why might it stop working?' RED metrics tell you there's a problem; USE metrics help you find the cause.
Google's Site Reliability Engineering book introduced the Four Golden Signals—a framework used internally at Google for monitoring user-facing systems. The signals are:
1. Latency — The time it takes to service a request 2. Traffic — The demand placed on your system 3. Errors — The rate of failed requests 4. Saturation — How 'full' your service is
Latency: Time to Serve
Latency measures the time between request receipt and response delivery. Critical nuances:
Traffic: Demand on the System
Traffic measures workload in domain-appropriate terms:
Traffic provides context for other signals. High error rate at 10x normal traffic tells a different story than high error rate at normal traffic.
Errors: Failed Requests
Error tracking must be comprehensive:
Saturation: Capacity Utilization
Saturation measures how close the service is to capacity:
The Four Golden Signals overlap with RED and USE. Latency ≈ Duration in RED. Traffic ≈ Rate in RED. Errors appears in all three. Saturation appears in both Golden Signals and USE. The frameworks are complementary views of the same fundamental concerns. Choose the framework that resonates with your team—the specific labels matter less than covering the core concepts.
1234567891011121314151617181920212223242526272829
# LATENCY: p50, p95, p99 for successful requests# Shows the response time distribution users experiencehistogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le))histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le))histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le)) # TRAFFIC: Request rate, broken down by endpointsum(rate(http_requests_total[5m])) by (handler) # ERRORS: Error rate as percentage of total trafficsum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 # SATURATION: Various indicators# CPU saturation via container throttlingsum(rate(container_cpu_cfs_throttled_seconds_total[5m])) by (pod) # Thread pool saturationthreadpool_queue_size / threadpool_max_queue_size * 100 # Memory approaching OOM(container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100Different system types require different metric emphases. While RED/USE/Golden Signals provide frameworks, application to specific systems requires domain knowledge.
API Services and Web Applications
Databases
Message Queues and Streaming Systems
Caches
Don't limit dashboards to technical metrics. Include business-level metrics that tie technical health to business outcomes: orders per minute, successful logins, payments processed. These bridge the gap between engineering reality and business impact, making dashboards relevant to non-technical stakeholders.
Metric selection involves avoiding common traps that reduce dashboard effectiveness.
The Cardinality Problem
Metric cardinality—the number of unique time series generated by a metric—can explode when labels have high variability:
| Label Pattern | Risk | Example |
|---|---|---|
| User ID as label | Extreme | request_count{user_id="12345"} — Millions of series |
| Request ID as label | Extreme | request_duration{request_id="abc123"} — Unbounded |
| URL path with variables | High | /users/12345/posts/67890 — Many unique paths |
| Reasonable labels | Safe | status_code, method, service, region — Bounded |
High-cardinality metrics consume memory, slow queries, and increase storage costs. Design metrics with bounded label values.
Adding more metrics doesn't improve observability—it often degrades it. Every additional metric is noise that obscures the signal. After an incident, the instinct to add metrics that 'would have caught this' leads to dashboard creep. Resist. Instead, refine existing metrics and improve their visibility.
SLO-based dashboards require specific metrics that differ from general operational metrics. The focus shifts from instantaneous values to budget consumption.
Essential SLO Metrics
SLI (Service Level Indicator) — The raw measurement underlying the SLO. For an availability SLO, this might be successful_requests / total_requests. For a latency SLO, this might be requests_under_threshold / total_requests.
Current SLO Compliance — Whether the SLI currently meets the objective. Displayed as percentage with color coding against target.
Error Budget Remaining — How much failure is acceptable before breaching the SLO. Expressed as percentage remaining or time remaining at current burn rate.
Burn Rate — How fast the error budget is being consumed. A burn rate of 1.0x means pace exactly matches the budget; 2.0x means consuming budget twice as fast as sustainable.
Time Window Performance — SLO compliance over the measurement window (30 days, quarter).
1234567891011121314151617181920212223242526272829
# SLI: Availability (successful requests / total requests)- record: sli:availability:ratio expr: | sum(rate(http_requests_total{status!~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) # SLI: Latency (requests under 200ms / total)- record: sli:latency:ratio expr: | sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m])) by (service) / sum(rate(http_request_duration_seconds_count[5m])) by (service) # Error budget consumed (30 day window, 99.9% SLO)- record: slo:error_budget_consumed:ratio expr: | 1 - ( sum_over_time(sli:availability:ratio[30d]) / (30 * 24) # hours in window ) / 0.001 # 0.1% error budget # Burn rate (how fast budget is being consumed)# Burn rate of 1.0 = exactly using allowed budget# Burn rate of 14.4 = will exhaust 30-day budget in ~50 hours- record: slo:burn_rate:5m expr: | (1 - sli:availability:ratio) / 0.001SLO Dashboard Layout
A typical SLO dashboard might display:
Top Row: Status Summary
Second Row: Current Performance
Third Row: Time Series
Detail Section: Breakdown
When error budget is healthy, teams can move fast—shipping features and accepting some risk. When budget is exhausted, teams focus on reliability. SLO dashboards don't just monitor; they guide engineering prioritization by making the reliability/velocity tradeoff visible.
We've covered the frameworks and principles that guide effective metric selection for dashboards. Let's consolidate the key insights:
What's Next:
With an understanding of which metrics to display, we need to consider who is viewing the dashboard and what they need. The next page explores service-level dashboards—dashboards designed for engineering teams operating specific services.
You now have frameworks for selecting the metrics that belong on operational dashboards. The key insight: metric selection is an exercise in discipline and focus. The goal isn't to display every available metric—it's to display the metrics that enable understanding and action.