System Design (HLD)Dashboards and Visualization

Dashboards and Visualization: Bringing Observability to Life

LevelIntermediate

Duration90 mins

TopicDashboards and Visualization

2 / 5

Key Metrics to Display

The Signal Selection Problem

Modern systems generate thousands of metrics. A single Kubernetes cluster might emit tens of thousands of distinct metric series. An application with detailed instrumentation could produce hundreds of metrics per service. Database servers, message queues, load balancers—each component adds to the deluge.

Faced with this abundance, engineers make one of two mistakes: they either display everything (creating overwhelming dashboards that communicate nothing) or they choose metrics arbitrarily (displaying whatever was easy to add rather than what matters).

The result is dashboards that fail at their fundamental purpose: communicating system health.

This page addresses the critical question: Of all the metrics we could display, which ones should we actually put on our dashboards? The answer isn't arbitrary—it's grounded in decades of operational experience distilled into principled methodologies.

What You Will Learn

By the end of this page, you will understand proven frameworks for metric selection: the RED method for services, the USE method for resources, and the Four Golden Signals. You'll learn how to adapt these frameworks to different system types and understand the relationship between metrics and the questions they answer.

Principles of Metric Selection

Before diving into specific frameworks, let's establish the principles that guide metric selection for dashboards.

Metrics Exist to Answer Questions

Every metric on a dashboard should answer a specific, actionable question. If you cannot articulate what question a metric answers, it doesn't belong on an operational dashboard.

The questions that matter fall into a hierarchy:

Question Hierarchy

Concept

Question Hierarchy (in order of operational priority):
 
1. "Is the service working for users?"
   └─ Success rate, availability, functional correctness
 
2. "How well is the service performing for users?"  
   └─ Latency, throughput, quality of experience
 
3. "Can the service continue to work under current conditions?"
   └─ Capacity, saturation, headroom
 
4. "Where are problems occurring?"
   └─ Service breakdowns, endpoint-level metrics
 
5. "Why is this happening?"
   └─ Resource utilization, dependency health, debug data
 
Dashboards should answer questions 1-3 immediately.
Questions 4-5 are answered through drill-down.

User-Centricity as North Star

The most important metrics are those that reflect user experience. A database might show 100% healthy on all its internal metrics while queries time out at the application layer. Internal metrics matter, but user-facing metrics are authoritative.

This principle determines dashboard hierarchy:

User-facing metrics (errors users see, latency users experience) get prime position
Service-level metrics (internal but directly impacting user experience) get secondary position
Resource metrics (CPU, memory, network) support investigation
Infrastructure metrics (kernel stats, hardware) are for deep debugging

Fewer Metrics, Better Understanding

Cognitive research shows that humans make better decisions with less information—if that information is the right information. A dashboard with 5 carefully chosen metrics outperforms one with 50 random metrics.

The discipline of metric selection is primarily about what to exclude. Every metric added to a dashboard:

Competes for attention with other metrics
Increases cognitive load
May distract from more important signals

Each metric must earn its place by answering a question that matters.

The Deletion Test

For every metric on your dashboard, ask: 'If this metric changed significantly, would we take action?' If the answer is no—or if the action would be to investigate other metrics first—the metric doesn't belong on the primary dashboard. It might belong in a drill-down view or investigation dashboard instead.

The RED Method: Metrics for Services

The RED method, developed by Tom Wilkie at Grafana Labs, provides a simple framework for monitoring request-driven services. For every service, measure:

R — Rate: How many requests per second is this service handling? E — Errors: How many of those requests are failing? D — Duration: How long do those requests take?

RED Method Deep Dive

•Rate (Requests per Second) — The traffic volume flowing through the service. Track total requests and successful requests. Changes in rate context-dependent: a spike might indicate attack or genuine demand; a drop might indicate upstream problems or user abandonment. Rate gives denominator for error calculations.
•Errors (Failure Rate) — The percentage of requests failing. Break down by error category: client errors (4xx) vs. server errors (5xx). Consider what 'error' means for your service—it might include slow responses, partial failures, or degraded quality responses. Express as both absolute count and percentage.
•Duration (Latency Distribution) — How long requests take to complete. Always show percentiles (p50, p95, p99), not just averages. Averages hide the experience of users hit by tail latency. Consider upstream to downstream latency, not just service processing time.

Why RED Works

RED captures what users care about in a service:

Rate tells you if the service is being used (and how much)
Errors tell you if the service is working
Duration tells you if the service is working well

These three metrics together provide a complete picture of service health from the user's perspective. They're also easy to collect—most HTTP servers and RPC frameworks can emit these automatically.

RED in Practice

RED Method Dashboard Implementation
Metric	Visualization	What to Show	Alert Threshold Example
Request Rate	Line chart, Single stat	Requests/second, trend over time, comparison to baseline	Anomaly detection: >3 std dev from normal
Error Rate	Line chart with threshold, Single stat with color	Error percentage, absolute error count, breakdown by type	1% error rate for >5 minutes
Duration (p50)	Line chart	Median latency, shows typical user experience	Informational, rarely alert on p50
Duration (p95)	Line chart with threshold	95th percentile, most users experience this or better	500ms for >10 minutes
Duration (p99)	Line chart with threshold	99th percentile, worst-case for most users	2s for >5 minutes
Latency Heatmap	Heatmap	Full distribution over time, reveals bimodality	Visual inspection for pattern changes

RED Method Variations

By Endpoint: Track RED metrics for each significant endpoint, not just service aggregate. Login latency matters differently than search latency.

By Customer Tier: If you have premium customers, track their experience separately. A 1% error rate might mean 100% of enterprise customers are affected.

By Geography: Users in different regions may experience different conditions. Track RED per region for global services.

When RED Doesn't Apply

RED is designed for request-driven services—those that respond to incoming requests. It doesn't fit well for batch processing, streaming pipelines, or background workers. For these, consider per-item metrics (items processed/failed/duration) or use the USE method for the underlying resources.

The USE Method: Metrics for Resources

The USE method, developed by Brendan Gregg, provides a framework for analyzing resource performance. For every resource, measure:

U — Utilization: What percentage of the resource is being used? S — Saturation: How much extra work is queued, waiting for the resource? E — Errors: How many error events are occurring on this resource?

USE Metric Definitions

•Utilization — Time the resource is busy (or capacity used). For CPU: percentage busy. For disk: IOPS as percentage of maximum. For memory: bytes used as percentage of total.
•Saturation — Work waiting for the resource. For CPU: run queue length. For disk: I/O queue depth. For network: packets queued. Saturation indicates demand exceeding capacity.
•Errors — Error events on the resource. For disk: read/write errors. For network: packet drops, CRC errors. For memory: ECC corrections, OOM events.

Why Each Matters

•High Utilization — Resource approaching maximum capacity. May need scaling. Performance degradation likely at sustained high utilization.
•Non-Zero Saturation — Resource is overloaded. Work is waiting. Users are experiencing delays. Immediate attention may be needed.
•Any Errors — Hardware or configuration problem. Even low error rates can indicate impending failure. Investigate promptly.

Applying USE to Common Resources

USE Method Applied to System Resources
Resource	Utilization	Saturation	Errors
CPU	CPU busy percentage (user + system)	Run queue length (load average)	—
Memory	Used memory / Total memory	Swap usage, OOM kills	ECC errors, memory failures
Disk I/O	Disk busy percentage	I/O wait percentage, queue depth	Read/write errors, SMART warnings
Disk Capacity	Used space / Total space	Inodes used (for some filesystems)	Filesystem errors, corruption
Network Interface	Bytes sent/received vs. capacity	Packet transmit queue length	Drops, overruns, CRC errors
Network Sockets	Open connections / Max connections	Listen queue backlog, TIME_WAIT count	Connection refused, reset
Thread Pool	Active threads / Max threads	Queue depth, rejected tasks	Failed task executions
Connection Pool	In-use connections / Pool size	Wait time for connection	Connection timeouts, exhaustion

The Saturation Signal

Saturation is often the most actionable USE metric. High utilization alone doesn't guarantee problems—a CPU running at 90% utilization might be perfectly healthy if there's no queue building. But non-zero saturation means work is waiting, which directly impacts user experience.

Prioritize saturation visibility:

Run queue length (CPU saturation) is more actionable than CPU percentage
I/O wait indicates disk saturation better than disk utilization
Connection wait time reveals database saturation before pool exhaustion

Saturation metrics often predict problems before utilization metrics reveal them.

USE vs. RED

RED measures workload from the user's perspective. USE measures resources from the system's perspective. A healthy dashboard includes both: RED for 'is the service working?' and USE for 'why might it stop working?' RED metrics tell you there's a problem; USE metrics help you find the cause.

The Four Golden Signals

Google's Site Reliability Engineering book introduced the Four Golden Signals—a framework used internally at Google for monitoring user-facing systems. The signals are:

1. Latency — The time it takes to service a request 2. Traffic — The demand placed on your system 3. Errors — The rate of failed requests 4. Saturation — How 'full' your service is

Latency: Time to Serve

Latency measures the time between request receipt and response delivery. Critical nuances:

Distinguish successful vs. failed request latency — A fast error isn't a success. Track latency for successful requests separately.
Track distributions, not averages — The p99 tells you what the worst-off users experience. An average of 100ms might hide a p99 of 3 seconds.
Consider user-perceived latency — Time from user action to visual response matters more than server processing time.

Traffic: Demand on the System

Traffic measures workload in domain-appropriate terms:

Web services: HTTP requests per second
Databases: Queries per second, transactions per second
Streaming: Messages per second, bytes per second
Storage: IOPS, bytes written per second

Traffic provides context for other signals. High error rate at 10x normal traffic tells a different story than high error rate at normal traffic.

Errors: Failed Requests

Error tracking must be comprehensive:

Explicit errors: HTTP 5xx responses, exception throws
Implicit errors: Wrong data returned, timeouts
Policy violations: Responses that succeed but violate quality standards (e.g., latency SLO violations)

Saturation: Capacity Utilization

Saturation measures how close the service is to capacity:

Resource saturation: CPU, memory, disk approaching limits
Throughput capacity: Requests per second vs. maximum tested
Queue depth: Work waiting to be processed

Four Golden Signals vs. RED and USE

The Four Golden Signals overlap with RED and USE. Latency ≈ Duration in RED. Traffic ≈ Rate in RED. Errors appears in all three. Saturation appears in both Golden Signals and USE. The frameworks are complementary views of the same fundamental concerns. Choose the framework that resonates with your team—the specific labels matter less than covering the core concepts.

Four Golden Signals Dashboard Example

Prometheus Queries

# LATENCY: p50, p95, p99 for successful requests
# Shows the response time distribution users experience
histogram_quantile(0.50, 
  sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le)
)
histogram_quantile(0.95, 
  sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le)
)
histogram_quantile(0.99, 
  sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le)
)
 
# TRAFFIC: Request rate, broken down by endpoint
sum(rate(http_requests_total[5m])) by (handler)
 
# ERRORS: Error rate as percentage of total traffic
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m])) * 100
 
# SATURATION: Various indicators
# CPU saturation via container throttling
sum(rate(container_cpu_cfs_throttled_seconds_total[5m])) by (pod)
 
# Thread pool saturation
threadpool_queue_size / threadpool_max_queue_size * 100
 
# Memory approaching OOM
(container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100

Metrics by System Type

Different system types require different metric emphases. While RED/USE/Golden Signals provide frameworks, application to specific systems requires domain knowledge.

API Services and Web Applications

Key Metrics for API Services

•Request rate by endpoint, method, and status code family
•Error rate by type (4xx vs 5xx) and specific error code
•Latency p50, p95, p99 by endpoint
•Request size and response size distributions
•Active connections and connection rate
•Authentication/authorization failures

Databases

Key Metrics for Databases

•Query rate (reads vs writes, by query type)
•Query latency percentiles
•Connection pool utilization and wait time
•Replication lag for replicated databases
•Lock contention and deadlock count
•Buffer cache hit ratio
•Disk I/O (IOPS, throughput, latency)
•Transaction rate and rollback rate

Message Queues and Streaming Systems

Key Metrics for Message Systems

•Message rate (produced vs consumed)
•Consumer lag (messages waiting to be consumed)
•Message age (oldest unprocessed message)
•Processing latency (time from production to consumption)
•Dead letter queue size and growth rate
•Partition balance for distributed queues
•Consumer group health and rebalance frequency

Caches

Key Metrics for Caches

•Hit rate / Miss rate — Primary indicator of cache effectiveness
•Memory usage and eviction rate
•Connection count and connection rate
•Operation latency by command type
•Key space size and expiration patterns
•Replication lag for distributed caches

The Business Metric Bridge

Don't limit dashboards to technical metrics. Include business-level metrics that tie technical health to business outcomes: orders per minute, successful logins, payments processed. These bridge the gap between engineering reality and business impact, making dashboards relevant to non-technical stakeholders.

Avoiding Metric Pitfalls

Metric selection involves avoiding common traps that reduce dashboard effectiveness.

Common Metric Pitfalls

•Vanity Metrics — Metrics that look impressive but don't indicate system health. 'Total requests since launch' tells you nothing about current state. Prefer rate over cumulative metrics.
•Averages Without Context — An average latency of 100ms might mean everyone experiences 100ms, or that 99% experience 10ms while 1% experience 9 seconds. Always show distributions or percentiles alongside averages.
•Misleading Aggregation — Aggregating across disparate services hides problems. A healthy payment service can mask a broken search service when lumped into 'backend health'.
•Per-Instance Instead of Aggregate — Displaying metrics for individual servers creates noise. Users don't care which server is slow—they care that the service is slow. Aggregate intelligently.
•Absolute Numbers Without Rate — 'Errors: 1,247' is meaningless without knowing the time window and total request count. Express as rates and percentages.
•Missing Units — '200' means nothing. '200 requests/second' or '200 milliseconds' provides meaning. Always label units clearly.
•Stale Metrics — Metrics that update infrequently or have high latency give false confidence. Ensure metrics reflect current state within acceptable delay.

The Cardinality Problem

Metric cardinality—the number of unique time series generated by a metric—can explode when labels have high variability:

Label Pattern	Risk	Example
User ID as label	Extreme	`request_count{user_id="12345"}` — Millions of series
Request ID as label	Extreme	`request_duration{request_id="abc123"}` — Unbounded
URL path with variables	High	`/users/12345/posts/67890` — Many unique paths
Reasonable labels	Safe	`status_code`, `method`, `service`, `region` — Bounded

High-cardinality metrics consume memory, slow queries, and increase storage costs. Design metrics with bounded label values.

The 'More Metrics' Fallacy

Adding more metrics doesn't improve observability—it often degrades it. Every additional metric is noise that obscures the signal. After an incident, the instinct to add metrics that 'would have caught this' leads to dashboard creep. Resist. Instead, refine existing metrics and improve their visibility.

Metrics for SLO Tracking

SLO-based dashboards require specific metrics that differ from general operational metrics. The focus shifts from instantaneous values to budget consumption.

Essential SLO Metrics

SLI (Service Level Indicator) — The raw measurement underlying the SLO. For an availability SLO, this might be successful_requests / total_requests. For a latency SLO, this might be requests_under_threshold / total_requests.

Current SLO Compliance — Whether the SLI currently meets the objective. Displayed as percentage with color coding against target.

Error Budget Remaining — How much failure is acceptable before breaching the SLO. Expressed as percentage remaining or time remaining at current burn rate.

Burn Rate — How fast the error budget is being consumed. A burn rate of 1.0x means pace exactly matches the budget; 2.0x means consuming budget twice as fast as sustainable.

Time Window Performance — SLO compliance over the measurement window (30 days, quarter).

SLO Dashboard Metrics Example
Prometheus Recording Rules
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# SLI: Availability (successful requests / total requests)
- record: sli:availability:ratio
  expr: |
    sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
    /
    sum(rate(http_requests_total[5m])) by (service)
 
# SLI: Latency (requests under 200ms / total)
- record: sli:latency:ratio
  expr: |
    sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m])) by (service)
    /
    sum(rate(http_request_duration_seconds_count[5m])) by (service)
 
# Error budget consumed (30 day window, 99.9% SLO)
- record: slo:error_budget_consumed:ratio
  expr: |
    1 - (
      sum_over_time(sli:availability:ratio[30d])
      /
      (30 * 24)  # hours in window
    ) / 0.001  # 0.1% error budget
 
# Burn rate (how fast budget is being consumed)
# Burn rate of 1.0 = exactly using allowed budget
# Burn rate of 14.4 = will exhaust 30-day budget in ~50 hours
- record: slo:burn_rate:5m
  expr: |
    (1 - sli:availability:ratio) / 0.001

SLO Dashboard Layout

A typical SLO dashboard might display:

Top Row: Status Summary

SLO compliance status (green/yellow/red) for each service
Error budget remaining percentage
Days until budget exhaustion at current rate

Second Row: Current Performance

Current SLI value vs. target
Burn rate indicator
Trend compared to previous period

Third Row: Time Series

SLI over time with SLO threshold line
Error budget consumption over the window
Burn rate over time

Detail Section: Breakdown

SLI by endpoint, region, or customer tier
Major contributing factors to budget consumption
Incident markers correlated with budget consumption

SLOs as Decision Framework

When error budget is healthy, teams can move fast—shipping features and accepting some risk. When budget is exhausted, teams focus on reliability. SLO dashboards don't just monitor; they guide engineering prioritization by making the reliability/velocity tradeoff visible.

Summary: Selecting the Right Metrics

We've covered the frameworks and principles that guide effective metric selection for dashboards. Let's consolidate the key insights:

Key Takeaways

•Every metric must answer a specific question — If you can't articulate what question a metric answers, it doesn't belong on the dashboard.
•Use RED for services (Rate, Errors, Duration) — Captures user experience of request-driven services.
•Use USE for resources (Utilization, Saturation, Errors) — Captures resource health for infrastructure components.
•The Four Golden Signals provide a unified view — Latency, Traffic, Errors, and Saturation cover the core concerns.
•Prioritize user-facing metrics — What users experience matters more than internal indicators.
•Less is more — A focused dashboard with 5, right metrics outperforms one with 50 random metrics.
•Avoid common pitfalls — Vanity metrics, averages without distributions, and high-cardinality labels degrade dashboard utility.
•SLO metrics enable decision-making — Error budgets and burn rates translate monitoring into actionable guidance.

What's Next:

With an understanding of which metrics to display, we need to consider who is viewing the dashboard and what they need. The next page explores service-level dashboards—dashboards designed for engineering teams operating specific services.

Page Complete

You now have frameworks for selecting the metrics that belong on operational dashboards. The key insight: metric selection is an exercise in discipline and focus. The goal isn't to display every available metric—it's to display the metrics that enable understanding and action.

2 / 5

Loading learning content...

System Design (HLD)Dashboards and Visualization

Dashboards and Visualization: Bringing Observability to Life

LevelIntermediate

Duration90 mins

TopicDashboards and Visualization

2 / 5

Key Metrics to Display

The Signal Selection Problem

The result is dashboards that fail at their fundamental purpose: communicating system health.

What You Will Learn

Principles of Metric Selection

Before diving into specific frameworks, let's establish the principles that guide metric selection for dashboards.

Metrics Exist to Answer Questions

Every metric on a dashboard should answer a specific, actionable question. If you cannot articulate what question a metric answers, it doesn't belong on an operational dashboard.

The questions that matter fall into a hierarchy:

Question Hierarchy

Concept

Question Hierarchy (in order of operational priority):
 
1. "Is the service working for users?"
   └─ Success rate, availability, functional correctness
 
2. "How well is the service performing for users?"  
   └─ Latency, throughput, quality of experience
 
3. "Can the service continue to work under current conditions?"
   └─ Capacity, saturation, headroom
 
4. "Where are problems occurring?"
   └─ Service breakdowns, endpoint-level metrics
 
5. "Why is this happening?"
   └─ Resource utilization, dependency health, debug data
 
Dashboards should answer questions 1-3 immediately.
Questions 4-5 are answered through drill-down.

User-Centricity as North Star

This principle determines dashboard hierarchy:

User-facing metrics (errors users see, latency users experience) get prime position
Service-level metrics (internal but directly impacting user experience) get secondary position
Resource metrics (CPU, memory, network) support investigation
Infrastructure metrics (kernel stats, hardware) are for deep debugging

Fewer Metrics, Better Understanding

The discipline of metric selection is primarily about what to exclude. Every metric added to a dashboard:

Competes for attention with other metrics
Increases cognitive load
May distract from more important signals

Each metric must earn its place by answering a question that matters.

The Deletion Test

The RED Method: Metrics for Services

The RED method, developed by Tom Wilkie at Grafana Labs, provides a simple framework for monitoring request-driven services. For every service, measure:

R — Rate: How many requests per second is this service handling? E — Errors: How many of those requests are failing? D — Duration: How long do those requests take?

RED Method Deep Dive

•Rate (Requests per Second) — The traffic volume flowing through the service. Track total requests and successful requests. Changes in rate context-dependent: a spike might indicate attack or genuine demand; a drop might indicate upstream problems or user abandonment. Rate gives denominator for error calculations.
•Errors (Failure Rate) — The percentage of requests failing. Break down by error category: client errors (4xx) vs. server errors (5xx). Consider what 'error' means for your service—it might include slow responses, partial failures, or degraded quality responses. Express as both absolute count and percentage.
•Duration (Latency Distribution) — How long requests take to complete. Always show percentiles (p50, p95, p99), not just averages. Averages hide the experience of users hit by tail latency. Consider upstream to downstream latency, not just service processing time.

Why RED Works

RED captures what users care about in a service:

Rate tells you if the service is being used (and how much)
Errors tell you if the service is working
Duration tells you if the service is working well

RED in Practice

RED Method Dashboard Implementation
Metric	Visualization	What to Show	Alert Threshold Example
Request Rate	Line chart, Single stat	Requests/second, trend over time, comparison to baseline	Anomaly detection: >3 std dev from normal
Error Rate	Line chart with threshold, Single stat with color	Error percentage, absolute error count, breakdown by type	1% error rate for >5 minutes
Duration (p50)	Line chart	Median latency, shows typical user experience	Informational, rarely alert on p50
Duration (p95)	Line chart with threshold	95th percentile, most users experience this or better	500ms for >10 minutes
Duration (p99)	Line chart with threshold	99th percentile, worst-case for most users	2s for >5 minutes
Latency Heatmap	Heatmap	Full distribution over time, reveals bimodality	Visual inspection for pattern changes

RED Method Variations

By Endpoint: Track RED metrics for each significant endpoint, not just service aggregate. Login latency matters differently than search latency.

By Customer Tier: If you have premium customers, track their experience separately. A 1% error rate might mean 100% of enterprise customers are affected.

By Geography: Users in different regions may experience different conditions. Track RED per region for global services.

When RED Doesn't Apply

The USE Method: Metrics for Resources

The USE method, developed by Brendan Gregg, provides a framework for analyzing resource performance. For every resource, measure:

USE Metric Definitions

•Utilization — Time the resource is busy (or capacity used). For CPU: percentage busy. For disk: IOPS as percentage of maximum. For memory: bytes used as percentage of total.
•Saturation — Work waiting for the resource. For CPU: run queue length. For disk: I/O queue depth. For network: packets queued. Saturation indicates demand exceeding capacity.
•Errors — Error events on the resource. For disk: read/write errors. For network: packet drops, CRC errors. For memory: ECC corrections, OOM events.

Why Each Matters

•High Utilization — Resource approaching maximum capacity. May need scaling. Performance degradation likely at sustained high utilization.
•Non-Zero Saturation — Resource is overloaded. Work is waiting. Users are experiencing delays. Immediate attention may be needed.
•Any Errors — Hardware or configuration problem. Even low error rates can indicate impending failure. Investigate promptly.

Applying USE to Common Resources

USE Method Applied to System Resources
Resource	Utilization	Saturation	Errors
CPU	CPU busy percentage (user + system)	Run queue length (load average)	—
Memory	Used memory / Total memory	Swap usage, OOM kills	ECC errors, memory failures
Disk I/O	Disk busy percentage	I/O wait percentage, queue depth	Read/write errors, SMART warnings
Disk Capacity	Used space / Total space	Inodes used (for some filesystems)	Filesystem errors, corruption
Network Interface	Bytes sent/received vs. capacity	Packet transmit queue length	Drops, overruns, CRC errors
Network Sockets	Open connections / Max connections	Listen queue backlog, TIME_WAIT count	Connection refused, reset
Thread Pool	Active threads / Max threads	Queue depth, rejected tasks	Failed task executions
Connection Pool	In-use connections / Pool size	Wait time for connection	Connection timeouts, exhaustion

The Saturation Signal

Prioritize saturation visibility:

Run queue length (CPU saturation) is more actionable than CPU percentage
I/O wait indicates disk saturation better than disk utilization
Connection wait time reveals database saturation before pool exhaustion

Saturation metrics often predict problems before utilization metrics reveal them.

USE vs. RED

The Four Golden Signals

Google's Site Reliability Engineering book introduced the Four Golden Signals—a framework used internally at Google for monitoring user-facing systems. The signals are:

Latency: Time to Serve

Latency measures the time between request receipt and response delivery. Critical nuances:

Distinguish successful vs. failed request latency — A fast error isn't a success. Track latency for successful requests separately.
Track distributions, not averages — The p99 tells you what the worst-off users experience. An average of 100ms might hide a p99 of 3 seconds.
Consider user-perceived latency — Time from user action to visual response matters more than server processing time.

Traffic: Demand on the System

Traffic measures workload in domain-appropriate terms:

Web services: HTTP requests per second
Databases: Queries per second, transactions per second
Streaming: Messages per second, bytes per second
Storage: IOPS, bytes written per second

Traffic provides context for other signals. High error rate at 10x normal traffic tells a different story than high error rate at normal traffic.

Errors: Failed Requests

Error tracking must be comprehensive:

Explicit errors: HTTP 5xx responses, exception throws
Implicit errors: Wrong data returned, timeouts
Policy violations: Responses that succeed but violate quality standards (e.g., latency SLO violations)

Saturation: Capacity Utilization

Saturation measures how close the service is to capacity:

Resource saturation: CPU, memory, disk approaching limits
Throughput capacity: Requests per second vs. maximum tested
Queue depth: Work waiting to be processed

Four Golden Signals vs. RED and USE

Four Golden Signals Dashboard Example

Prometheus Queries

# LATENCY: p50, p95, p99 for successful requests
# Shows the response time distribution users experience
histogram_quantile(0.50, 
  sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le)
)
histogram_quantile(0.95, 
  sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le)
)
histogram_quantile(0.99, 
  sum(rate(http_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le)
)
 
# TRAFFIC: Request rate, broken down by endpoint
sum(rate(http_requests_total[5m])) by (handler)
 
# ERRORS: Error rate as percentage of total traffic
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m])) * 100
 
# SATURATION: Various indicators
# CPU saturation via container throttling
sum(rate(container_cpu_cfs_throttled_seconds_total[5m])) by (pod)
 
# Thread pool saturation
threadpool_queue_size / threadpool_max_queue_size * 100
 
# Memory approaching OOM
(container_memory_usage_bytes / container_spec_memory_limit_bytes) * 100

Metrics by System Type

Different system types require different metric emphases. While RED/USE/Golden Signals provide frameworks, application to specific systems requires domain knowledge.

API Services and Web Applications

Key Metrics for API Services

•Request rate by endpoint, method, and status code family
•Error rate by type (4xx vs 5xx) and specific error code
•Latency p50, p95, p99 by endpoint
•Request size and response size distributions
•Active connections and connection rate
•Authentication/authorization failures

Databases

Key Metrics for Databases

•Query rate (reads vs writes, by query type)
•Query latency percentiles
•Connection pool utilization and wait time
•Replication lag for replicated databases
•Lock contention and deadlock count
•Buffer cache hit ratio
•Disk I/O (IOPS, throughput, latency)
•Transaction rate and rollback rate

Message Queues and Streaming Systems

Key Metrics for Message Systems

•Message rate (produced vs consumed)
•Consumer lag (messages waiting to be consumed)
•Message age (oldest unprocessed message)
•Processing latency (time from production to consumption)
•Dead letter queue size and growth rate
•Partition balance for distributed queues
•Consumer group health and rebalance frequency

Caches

Key Metrics for Caches

•Hit rate / Miss rate — Primary indicator of cache effectiveness
•Memory usage and eviction rate
•Connection count and connection rate
•Operation latency by command type
•Key space size and expiration patterns
•Replication lag for distributed caches

The Business Metric Bridge

Avoiding Metric Pitfalls

Metric selection involves avoiding common traps that reduce dashboard effectiveness.

Common Metric Pitfalls

•Vanity Metrics — Metrics that look impressive but don't indicate system health. 'Total requests since launch' tells you nothing about current state. Prefer rate over cumulative metrics.
•Averages Without Context — An average latency of 100ms might mean everyone experiences 100ms, or that 99% experience 10ms while 1% experience 9 seconds. Always show distributions or percentiles alongside averages.
•Misleading Aggregation — Aggregating across disparate services hides problems. A healthy payment service can mask a broken search service when lumped into 'backend health'.
•Per-Instance Instead of Aggregate — Displaying metrics for individual servers creates noise. Users don't care which server is slow—they care that the service is slow. Aggregate intelligently.
•Absolute Numbers Without Rate — 'Errors: 1,247' is meaningless without knowing the time window and total request count. Express as rates and percentages.
•Missing Units — '200' means nothing. '200 requests/second' or '200 milliseconds' provides meaning. Always label units clearly.
•Stale Metrics — Metrics that update infrequently or have high latency give false confidence. Ensure metrics reflect current state within acceptable delay.

The Cardinality Problem

Metric cardinality—the number of unique time series generated by a metric—can explode when labels have high variability:

Label Pattern	Risk	Example
User ID as label	Extreme	`request_count{user_id="12345"}` — Millions of series
Request ID as label	Extreme	`request_duration{request_id="abc123"}` — Unbounded
URL path with variables	High	`/users/12345/posts/67890` — Many unique paths
Reasonable labels	Safe	`status_code`, `method`, `service`, `region` — Bounded

High-cardinality metrics consume memory, slow queries, and increase storage costs. Design metrics with bounded label values.

The 'More Metrics' Fallacy

Metrics for SLO Tracking

SLO-based dashboards require specific metrics that differ from general operational metrics. The focus shifts from instantaneous values to budget consumption.

Essential SLO Metrics

Current SLO Compliance — Whether the SLI currently meets the objective. Displayed as percentage with color coding against target.

Error Budget Remaining — How much failure is acceptable before breaching the SLO. Expressed as percentage remaining or time remaining at current burn rate.

Burn Rate — How fast the error budget is being consumed. A burn rate of 1.0x means pace exactly matches the budget; 2.0x means consuming budget twice as fast as sustainable.

Time Window Performance — SLO compliance over the measurement window (30 days, quarter).

SLO Dashboard Metrics Example
Prometheus Recording Rules
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# SLI: Availability (successful requests / total requests)
- record: sli:availability:ratio
  expr: |
    sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
    /
    sum(rate(http_requests_total[5m])) by (service)
 
# SLI: Latency (requests under 200ms / total)
- record: sli:latency:ratio
  expr: |
    sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m])) by (service)
    /
    sum(rate(http_request_duration_seconds_count[5m])) by (service)
 
# Error budget consumed (30 day window, 99.9% SLO)
- record: slo:error_budget_consumed:ratio
  expr: |
    1 - (
      sum_over_time(sli:availability:ratio[30d])
      /
      (30 * 24)  # hours in window
    ) / 0.001  # 0.1% error budget
 
# Burn rate (how fast budget is being consumed)
# Burn rate of 1.0 = exactly using allowed budget
# Burn rate of 14.4 = will exhaust 30-day budget in ~50 hours
- record: slo:burn_rate:5m
  expr: |
    (1 - sli:availability:ratio) / 0.001

SLO Dashboard Layout

A typical SLO dashboard might display:

Top Row: Status Summary

SLO compliance status (green/yellow/red) for each service
Error budget remaining percentage
Days until budget exhaustion at current rate

Second Row: Current Performance

Current SLI value vs. target
Burn rate indicator
Trend compared to previous period

Third Row: Time Series

SLI over time with SLO threshold line
Error budget consumption over the window
Burn rate over time

Detail Section: Breakdown

SLI by endpoint, region, or customer tier
Major contributing factors to budget consumption
Incident markers correlated with budget consumption

SLOs as Decision Framework

Summary: Selecting the Right Metrics

We've covered the frameworks and principles that guide effective metric selection for dashboards. Let's consolidate the key insights:

Key Takeaways

•Every metric must answer a specific question — If you can't articulate what question a metric answers, it doesn't belong on the dashboard.
•Use RED for services (Rate, Errors, Duration) — Captures user experience of request-driven services.
•Use USE for resources (Utilization, Saturation, Errors) — Captures resource health for infrastructure components.
•The Four Golden Signals provide a unified view — Latency, Traffic, Errors, and Saturation cover the core concerns.
•Prioritize user-facing metrics — What users experience matters more than internal indicators.
•Less is more — A focused dashboard with 5, right metrics outperforms one with 50 random metrics.
•Avoid common pitfalls — Vanity metrics, averages without distributions, and high-cardinality labels degrade dashboard utility.
•SLO metrics enable decision-making — Error budgets and burn rates translate monitoring into actionable guidance.

What's Next:

Page Complete

2 / 5