Dashboards - Learning Module

Loading content...

0/273

Service-Level Dashboards

The Service Team's Window into Production

It's 2 AM and your phone buzzes with an alert: 'Payment service error rate elevated.' You stumble to your laptop, open the dashboard, and... confusion. Where's the error happening? Is it getting worse? Is the database slow? Is a downstream dependency failing?\n\nThe dashboard shows overall metrics but doesn't answer the questions you're desperately asking. You start opening multiple tools, running ad-hoc queries, correlating timestamps manually. By the time you understand the problem, 20 minutes have passed and customers have abandoned their carts.\n\nThis is the failure mode of poorly designed service dashboards.\n\nService-level dashboards are the primary interface between engineering teams and their running services. They must support two distinct modes: routine monitoring (glancing to confirm everything is healthy) and incident investigation (rapidly diagnosing problems under pressure). A dashboard that serves only one mode fails engineers when they need it most.

What You Will Learn

By the end of this page, you will understand how to design service-level dashboards that support both operational modes. You'll learn the anatomy of an effective service dashboard, how to structure information for rapid understanding, and patterns that accelerate incident diagnosis.

The Purpose of Service Dashboards

Service dashboards serve the team responsible for operating a specific service. Unlike executive dashboards (covered later), service dashboards prioritize technical depth over breadth, operational utility over simplicity.\n\nThe Two Operational Modes\n\nEngineers interact with service dashboards in two fundamentally different contexts:

Routine Monitoring Mode

•Frequency: Multiple times daily
•Duration: 5-30 seconds
•Goal: Confirm everything is healthy
•Cognitive state: Calm, multitasking
•Key need: Instant status comprehension
•Failure mode: Missed gradual degradation

Incident Investigation Mode

•Frequency: During incidents
•Duration: Minutes to hours
•Goal: Diagnose and resolve problems
•Cognitive state: Stressed, focused
•Key need: Rapid root cause identification
•Failure mode: Slow diagnosis, extended outage

Design for Both Modes\n\nThe challenge is designing a single dashboard that serves both modes effectively. The solution is progressive disclosure:\n\n- Top of dashboard: Health summary for routine monitoring (5-second scan)\n- Middle of dashboard: Key metrics and trends for quick investigation (30-second assessment)\n- Lower sections: Detailed breakdowns for deep investigation (multi-minute analysis)\n- Drill-down links: Connections to logs, traces, and specialized dashboards\n\nQuestions the Dashboard Must Answer\n\nA service dashboard should rapidly answer these questions:\n\n1. Is the service healthy? (Answered in seconds)\n2. If not, what's the impact? (Answered in 30 seconds)\n3. Where is the problem? (Answered in 1-2 minutes)\n4. When did it start? (Answered in 1-2 minutes)\n5. What changed? (Answered with annotations/links)\n6. What do I investigate next? (Answered with drill-down paths)

The Pager Test

Design each service dashboard as if you'll be woken at 3 AM and must use it with half-functioning cognition. Can you determine 'is this real?' and 'where do I look?' within 60 seconds? If not, the dashboard will fail you when you need it most.

Anatomy of a Service Dashboard

An effective service dashboard follows a consistent structure that teams can learn once and apply to any service. This consistency accelerates incident response—engineers don't waste time figuring out where to look.

Service Dashboard Structure

Layout

╔═══════════════════════════════════════════════════════════════════════════════╗
║ HEADER: Service Name | Environment | Time Range Selector | Refresh Status     ║
╠═══════════════════════════════════════════════════════════════════════════════╣
║ ROW 1: HEALTH SUMMARY (5-second comprehension)                                ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌────────────┐║
║ │   STATUS    │ │   ERROR     │ │   LATENCY   │ │   TRAFFIC   │ │   ALERTS   │║
║ │   ● OK      │ │    RATE     │ │     P99     │ │     QPS     │ │   COUNT    │║
║ │   99.98%    │ │    0.02%    │ │    145ms    │ │    12.5k    │ │     0      │║
║ │   ▲ 0.01%   │ │   ▼ 15%     │ │   ▲ 12ms    │ │   ▲ 5%      │ │            │║
║ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ └────────────┘║
╠═══════════════════════════════════════════════════════════════════════════════╣
║ ROW 2: KEY METRICS OVER TIME (30-second assessment)                           ║
║ ┌─────────────────────────────────────┐ ┌─────────────────────────────────────┐║
║ │         Request Rate (4h)          │ │         Error Rate (4h)             │║
║ │    ________________    ▲ Deploy    │ │    ____________________________     │║
║ │ __/                \___   marker   │ │ ___/                        \____   │║
║ │                                    │ │         SLO threshold line          │║
║ └─────────────────────────────────────┘ └─────────────────────────────────────┘║
║ ┌─────────────────────────────────────┐ ┌─────────────────────────────────────┐║
║ │        Latency Percentiles (4h)     │ │       Latency Heatmap (4h)          │║
║ │  p99: ___/\___  p95: _____  p50: __ │ │  ░░░▒▒▒▓▓██▓▓▒▒░░░░░░░░░░░░░░░░░   │║
║ └─────────────────────────────────────┘ └─────────────────────────────────────┘║
╠═══════════════════════════════════════════════════════════════════════════════╣
║ ROW 3: DEPENDENCY HEALTH (Where is the problem?)                              ║
║ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ ┌─────────────┐║
║ │    Database      │ │     Cache        │ │   Auth Service   │ │ Payment API │║
║ │  ⚠ Degraded      │ │    ● Healthy     │ │    ● Healthy     │ │  ● Healthy  │║
║ │  45ms → 340ms    │ │    99.9% hit     │ │    12ms p99      │ │  89ms p99   │║
║ └──────────────────┘ └──────────────────┘ └──────────────────┘ └─────────────┘║
╠═══════════════════════════════════════════════════════════════════════════════╣
║ ROW 4: ENDPOINT BREAKDOWN (Where specifically?)                               ║
║ ┌─────────────────────────────────────────────────────────────────────────────┐║
║ │  Endpoint                           │ Rate    │ Errors  │ P99     │ Status │║
║ │  POST /api/v1/checkout              │ 2.4k/s  │ 0.8%    │ 890ms   │   ⚠    │║
║ │  GET  /api/v1/products              │ 8.1k/s  │ 0.01%   │ 45ms    │   ●    │║
║ │  POST /api/v1/cart                  │ 1.2k/s  │ 0.02%   │ 67ms    │   ●    │║
║ │  GET  /api/v1/user                  │ 3.8k/s  │ 0.01%   │ 23ms    │   ●    │║
║ └─────────────────────────────────────────────────────────────────────────────┘║
╠═══════════════════════════════════════════════════════════════════════════════╣
║ ROW 5: INFRASTRUCTURE (Resource health)                                       ║
║ ┌─────────────────────────────────────┐ ┌─────────────────────────────────────┐║
║ │          CPU Usage (4h)             │ │        Memory Usage (4h)            │║
║ │  By pod, with aggregate             │ │  By pod, with limits shown          │║
║ └─────────────────────────────────────┘ └─────────────────────────────────────┘║
║ ┌─────────────────────────────────────┐ ┌─────────────────────────────────────┐║
║ │         Pod Count (4h)              │ │       Restarts & Errors (4h)        │║
║ │  Ready vs Desired, HPA activity     │ │  CrashLoops, OOMKills               │║
║ └─────────────────────────────────────┘ └─────────────────────────────────────┘║
╠═══════════════════════════════════════════════════════════════════════════════╣
║ ROW 6: QUICK LINKS                                                            ║
║ [ Logs ] [ Traces ] [ Runbook ] [ Recent Deploys ] [ Alerts Config ]          ║
╚═══════════════════════════════════════════════════════════════════════════════╝

Row-by-Row Explanation\n\nHeader Row: Service identification and controls\n- Service name prominently displayed\n- Environment badge (prod/staging/dev)\n- Time range selector with common presets (1h, 4h, 24h, 7d)\n- Auto-refresh indicator and manual refresh button\n- Link to service documentation/runbook\n\nRow 1 (Health Summary): Instant status comprehension\n- Single stat panels with current values\n- Color-coded status (green/yellow/red)\n- Trend indicators (▲/▼) showing direction\n- Comparison to previous period or baseline\n\nRow 2 (Key Metrics Over Time): Recent history and patterns\n- Time series charts for core RED/USE metrics\n- Deployment annotations overlaid\n- SLO threshold lines for context\n- Consistent time alignment across charts\n\nRow 3 (Dependencies): Upstream/downstream health\n- Status of each dependency\n- Current latency to each dependency\n- Quick identification of external causes\n\nRow 4 (Endpoint Breakdown): Localize problems\n- Table of endpoints sorted by error rate or traffic\n- Per-endpoint error rates and latencies\n- Status indicators for quick scanning\n\nRow 5 (Infrastructure): Resource visibility\n- CPU, memory, pod status\n- Evidence of resource constraints\n- HPA scaling activity\n\nRow 6 (Quick Links): Investigation pathways\n- Direct links to related tools\n- Runbook for this service\n- Recent deployment list

Consistency Across Services

Use the same dashboard template for all services in your organization. Engineers on-call for unfamiliar services should immediately know where to look. Template-based dashboards also simplify maintenance—improvements benefit all services at once.

Designing the Health Summary Row

The health summary row is the most important section of the dashboard—it's what engineers see first and what they check during routine monitoring. Every element must be optimized for instant comprehension.

Health Summary Best Practices

•Large, Clear Numbers — Current values should be readable from across the room. Use large fonts with high contrast. The number is the primary information; labels should be secondary.
•Status Indicators — Use color (green/yellow/red) plus shape (●/⚠/✕) for accessibility. Position status indicators consistently (e.g., always upper-left of each panel).
•Trend Direction — Show whether metrics are improving or degrading with arrows (▲/▼) and percentage change. Trend matters as much as current value.
•Contextual Comparison — Display comparison to baseline: 'vs. yesterday', 'vs. last week', or 'vs. typical'. A latency of 200ms is concerning if yesterday was 100ms, fine if yesterday was 250ms.
•Sparklines — Small inline charts showing recent trend. Reveals patterns that numbers alone hide: is this a spike or sustained elevation?
•Threshold-Based Coloring — Panels should change color based on thresholds tied to SLOs. Green != 'current value exists', green = 'current value is within acceptable range'.

Recommended Health Summary Panels
Panel	Primary Value	Supporting Info	Thresholds
Overall Status	Health score or SLO compliance %	Trend arrow, sparkline	Green: >99.9%, Yellow: 99-99.9%, Red: <99%
Error Rate	Current error % (5m window)	Absolute error count, trend	Green: <0.1%, Yellow: 0.1-1%, Red: >1%
Latency P99	99th percentile response time	P50 for context, trend	Service-specific, typically <500ms
Traffic	Requests per second	Comparison to baseline, trend	Anomaly-based (deviation from expected)
Active Alerts	Count of firing alerts	Highest severity indicator	Green: 0, Yellow: warnings only, Red: any critical

The Overall Status Panel\n\nConsider including an aggregated 'overall status' panel that synthesizes multiple signals into a single indicator. This could be:\n\n- SLO compliance rate — Are we within error budget?\n- Composite health score — Weighted combination of key metrics\n- Worst-case indicator — Status of the most degraded dimension\n\nThe aggregation logic must be transparent. Engineers should understand why the overall status is yellow—clicking should reveal which components contribute.

Avoid False Greens

A dashboard that shows green when problems exist is worse than useless—it breeds complacency. Tune thresholds conservatively. It's better to investigate a false positive than to miss a real problem. Regularly review: 'Were there incidents where the dashboard showed healthy?'

Dependency Visibility

Most service problems originate outside the service itself—in databases, downstream services, or infrastructure. A service dashboard that only shows internal metrics forces engineers to hunt for causes elsewhere.\n\nWhat to Show for Each Dependency

Dependency Panel Contents

•Status Indicator — Is this dependency healthy from our service's perspective?
•Request Latency — How long are calls to this dependency taking? Show p50 and p99.
•Error Rate — What percentage of requests to this dependency are failing?
•Circuit Breaker State — If applicable, is the circuit open, closed, or half-open?
•Connection Pool — Are we running out of connections to this dependency?
•Trend — Is the dependency getting better or worse?

Dependency Categories\n\nDirect Dependencies (Critical Path)\n- Database (primary, replicas)\n- Caches\n- Authentication/authorization services\n- Core business logic services\n\nThese dependencies directly impact request success. Problems here immediately affect users.\n\nAsync Dependencies (Background)\n- Message queues\n- Analytics pipelines\n- Notification services\n\nThese don't block requests but failures may cause delayed effects or functionality degradation.\n\nInfrastructure Dependencies\n- Kubernetes API\n- Service mesh (Envoy, Istio)\n- DNS resolution\n- Secret management\n\nOften overlooked, infrastructure problems can manifest as mysterious service failures.

Dependency Monitoring Queries

PromQL

# Latency to database (client-side measured)
histogram_quantile(0.99,
  sum(rate(db_client_request_duration_seconds_bucket{
    service="payment-service"
  }[5m])) by (le, database)
)
 
# Error rate to downstream service
sum(rate(http_client_requests_total{
  service="payment-service",
  target_service="inventory-service",
  status=~"5.."
}[5m])) 
/
sum(rate(http_client_requests_total{
  service="payment-service",
  target_service="inventory-service"
}[5m])) * 100
 
# Connection pool utilization
db_pool_connections_active{service="payment-service"}
/
db_pool_connections_max{service="payment-service"} * 100
 
# Circuit breaker state (1=open, 0=closed)
circuit_breaker_state{service="payment-service", dependency="payment-gateway"}

Measure Client-Side

Always measure dependency health from your service's perspective (client-side), not the dependency's perspective (server-side). A database might report 100% health while your service experiences timeouts due to network issues. Client-side metrics reflect actual user impact.

Endpoint and Path Breakdown

Aggregate service metrics can hide localized problems. A 0.5% overall error rate might represent 50% errors on a critical endpoint masked by healthy high-volume endpoints. Breakdowns localize problems.\n\nWhat to Break Down By

Breakdown Dimensions and Their Value
Dimension	What It Reveals	When to Use
Endpoint/Route	Which API paths are affected	Almost always—primary breakdown
HTTP Method	GET vs POST behavior differences	When methods have different characteristics
Status Code	Types of errors (4xx vs 5xx)	During error investigation
Region/Datacenter	Geographic distribution of problems	Multi-region services
Customer/Tenant	Which customers are affected	Multi-tenant services, enterprise focus
Version/Canary	Old vs new code behavior	During rollouts
Host/Pod	Instance-specific issues	Debugging specific instances

Endpoint Table Design\n\nThe endpoint breakdown should be a sortable, scannable table:\n\n\n┌─────────────────────────────────────────────────────────────────────────────┐\n│ ▼ Sort by: Error Rate Filter: [ ] │\n├─────────────────────────────────────────────────────────────────────────────┤\n│ Endpoint │ Rate/s │ Errors │ P50 │ P99 │ Status│\n├───────────────────────────────┼─────────┼─────────┼────────┼────────┼───────┤\n│ POST /api/v1/checkout │ 2,412 │ 0.82% │ 234ms │ 890ms │ ⚠ │\n│ POST /api/v1/payment/process │ 847 │ 0.45% │ 567ms │ 1.2s │ ⚠ │\n│ GET /api/v1/cart │ 5,234 │ 0.02% │ 23ms │ 89ms │ ● │\n│ GET /api/v1/products/{id} │ 12,456 │ 0.01% │ 12ms │ 45ms │ ● │\n│ POST /api/v1/cart/add │ 3,234 │ 0.01% │ 34ms │ 123ms │ ● │\n└─────────────────────────────────────────────────────────────────────────────┘\n\n\nKey Design Elements:\n\n- Sortable columns — Sort by error rate to find problems, by traffic to find high-impact endpoints\n- Status indicators — Quick visual scan for problems\n- Relative metrics — Error rate as percentage, not just absolute count\n- Filter capability — Search for specific endpoints\n- Drill-down — Clicking an endpoint should reveal per-endpoint detail charts

Handling High-Cardinality Endpoints\n\nRESTful APIs often include IDs in paths: /users/12345/orders/67890. If tracked literally, this creates unbounded cardinality. Solutions:\n\n1. Path templates — Normalize to /users/{userId}/orders/{orderId}\n2. Top-N aggregation — Show top 20 endpoints, aggregate rest as 'other'\n3. Parameterized grouping — Group by path pattern, show parameters as dimensions\n\nMost observability platforms support path normalization. Configure this during instrumentation.

The Critical Path Focus

Not all endpoints deserve equal attention. Identify your service's critical paths—the endpoints that must work for the business to function. Highlight these in the breakdown or create a separate critical path panel. An error on checkout is more important than an error on recently-viewed-items.

Infrastructure Context

While service dashboards should prioritize user-facing metrics, infrastructure visibility helps diagnose resource-related issues. The goal isn't comprehensive infrastructure monitoring—it's providing enough context to correlate service problems with resource constraints.\n\nEssential Infrastructure Panels

Infrastructure Metrics for Service Dashboards

•CPU Usage — Show per-pod/container with aggregate. Include CPU throttling for containerized workloads. Throttling often explains latency spikes even when average CPU looks healthy.
•Memory Usage — Show with container limits visible. Memory approaching limits precedes OOMKills. Track working set, not just RSS.
•Pod/Instance Count — Current vs. desired replicas. Show HPA activity, rollout status, and availability zones. Reveals scaling problems and deployment issues.
•Restarts and Crashes — OOMKills, CrashLoopBackOff, liveness probe failures. Any non-zero value warrants investigation.
•Network I/O — Throughput and errors per pod. Network saturation causes timeouts that look like application problems.

Container-Specific Considerations\n\nContainerized services running in Kubernetes require specific metrics:\n\n| Metric | What It Reveals | Red Flag |\n|--------|-----------------|----------|\n| CPU throttle rate | Resource limits too low | >5% throttled time |\n| Memory vs. limit | OOM risk | >80% of limit |\n| Pod restart count | Stability issues | Any restarts in 24h |\n| Ready pod ratio | Deployment health | Ready < desired |\n| Evicted pods | Resource pressure | Any evictions |\n| ImagePullErrors | Container registry issues | Any errors |\n\nAvoiding Infrastructure Clutter\n\nA common mistake is showing too much infrastructure detail. Remember: the service dashboard serves the service team, not the platform team. Include infrastructure metrics that:\n\n1. Directly explain service behavior (latency spikes correlate with CPU)\n2. Indicate imminent problems (memory approaching limit)\n3. Show deployment status (relevant during rollouts)\n\nDeeper infrastructure investigation belongs in platform-team dashboards, not service dashboards.

The CPU Mislead

Low CPU usage doesn't mean the service is healthy. I/O-bound services, services waiting on slow dependencies, and services stuck in garbage collection all show low CPU while experiencing severe problems. CPU is context, not health verdict.

Time Range and Comparison

Metrics without temporal context are difficult to interpret. Is 200ms latency good or bad? Is 10,000 requests/second normal traffic or an attack? Time range controls and comparisons provide this essential context.\n\nTime Range Selection

Recommended Time Range Presets

•Last 1 hour — Default for incident investigation. Shows current situation with enough history to see incident start.
•Last 4 hours — Good for routine monitoring and shift handoffs. Captures recent patterns without overwhelming detail.
•Last 24 hours — Reveals daily patterns; useful for understanding cyclical behavior.
•Last 7 days — Shows weekly patterns; useful for capacity planning and trend analysis.
•Custom range — For investigating specific historical periods or comparing to known-good baselines.

Comparison Modes\n\nModern dashboards should support temporal comparisons:\n\nHour-over-Hour: Compare current hour to same hour yesterday. Useful for detecting anomalies in daily patterns.\n\nDay-over-Day: Compare today to same day last week. Accounts for weekly patterns (Monday vs. Sunday traffic).\n\nWeek-over-Week: Compare this week to last week. Useful for trending and growth analysis.\n\nComparison to Baseline: Compare to a defined 'normal' period. Useful during events (Black Friday vs. normal Friday).\n\nVisual Comparison Techniques\n\n- Overlaid lines — Current period with previous period as dotted line on same chart\n- Side-by-side panels — 'Now' and 'Then' panels adjacent for comparison\n- Delta indicators — Percentage change displayed with current values\n- Anomaly bands — Expected range shown as shaded area, current value as line

Comparison Query Examples

PromQL

# Current error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m])) * 100
 
# Error rate 1 week ago (for comparison overlay)
sum(rate(http_requests_total{status=~"5.."}[5m] offset 1w)) 
/ 
sum(rate(http_requests_total[5m] offset 1w)) * 100
 
# Percentage change in request rate vs. 24h ago
(
  sum(rate(http_requests_total[5m])) 
  - sum(rate(http_requests_total[5m] offset 24h))
) 
/ sum(rate(http_requests_total[5m] offset 24h)) * 100
 
# Anomaly detection: current value vs. weekly average
sum(rate(http_requests_total[5m])) 
/ 
avg_over_time(sum(rate(http_requests_total[5m]))[7d:1h]) - 1

Default to Useful Comparison

Don't make engineers select comparison periods—show the most useful comparison by default. For daily patterns, week-over-week comparison (same day last week) is often most informative. Include the comparison as part of the standard dashboard, not as an option engineers must remember to enable.

Summary: Building Effective Service Dashboards

We've covered the structure and design patterns that make service dashboards effective operational tools. Let's consolidate the key insights:

Key Takeaways

•Design for two modes — Routine monitoring (5-second health check) and incident investigation (rapid diagnosis). Progressive disclosure serves both.
•Consistent structure across services — Use templates so engineers can navigate any service dashboard immediately.
•Health summary first — Large, color-coded status panels that answer 'is everything OK?' instantly.
•Show dependencies — Most problems originate outside the service. Include dependency health prominently.
•Break down by endpoint — Aggregate metrics hide localized problems. Show per-endpoint health.
•Include infrastructure context — CPU, memory, and pod health explain resource-related issues without overwhelming detail.
•Provide temporal context — Time range controls and comparisons make current values meaningful.
•Enable investigation paths — Link to logs, traces, runbooks, and related dashboards for drill-down.

What's Next:\n\nService dashboards serve engineering teams operating individual services. But leadership needs different information: aggregate health across many services, business impact, and high-level trends. The next page explores executive dashboards—designed for non-technical stakeholders and cross-organizational visibility.

Page Complete

You now understand how to design service dashboards that actually work during incidents. The key insight: every dashboard element should accelerate understanding. If a panel doesn't help answer 'is it broken?', 'where is it broken?', or 'why is it broken?', it doesn't belong on the service dashboard.

Service-Level Dashboards

The Service Team's Window into Production

What You Will Learn

The Purpose of Service Dashboards

Routine Monitoring Mode

•Frequency: Multiple times daily
•Duration: 5-30 seconds
•Goal: Confirm everything is healthy
•Cognitive state: Calm, multitasking
•Key need: Instant status comprehension
•Failure mode: Missed gradual degradation

Incident Investigation Mode

•Frequency: During incidents
•Duration: Minutes to hours
•Goal: Diagnose and resolve problems
•Cognitive state: Stressed, focused
•Key need: Rapid root cause identification
•Failure mode: Slow diagnosis, extended outage

The Pager Test

Anatomy of a Service Dashboard

Service Dashboard Structure

Layout

╔═══════════════════════════════════════════════════════════════════════════════╗
║ HEADER: Service Name | Environment | Time Range Selector | Refresh Status     ║
╠═══════════════════════════════════════════════════════════════════════════════╣
║ ROW 1: HEALTH SUMMARY (5-second comprehension)                                ║
║ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌────────────┐║
║ │   STATUS    │ │   ERROR     │ │   LATENCY   │ │   TRAFFIC   │ │   ALERTS   │║
║ │   ● OK      │ │    RATE     │ │     P99     │ │     QPS     │ │   COUNT    │║
║ │   99.98%    │ │    0.02%    │ │    145ms    │ │    12.5k    │ │     0      │║
║ │   ▲ 0.01%   │ │   ▼ 15%     │ │   ▲ 12ms    │ │   ▲ 5%      │ │            │║
║ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ └────────────┘║
╠═══════════════════════════════════════════════════════════════════════════════╣
║ ROW 2: KEY METRICS OVER TIME (30-second assessment)                           ║
║ ┌─────────────────────────────────────┐ ┌─────────────────────────────────────┐║
║ │         Request Rate (4h)          │ │         Error Rate (4h)             │║
║ │    ________________    ▲ Deploy    │ │    ____________________________     │║
║ │ __/                \___   marker   │ │ ___/                        \____   │║
║ │                                    │ │         SLO threshold line          │║
║ └─────────────────────────────────────┘ └─────────────────────────────────────┘║
║ ┌─────────────────────────────────────┐ ┌─────────────────────────────────────┐║
║ │        Latency Percentiles (4h)     │ │       Latency Heatmap (4h)          │║
║ │  p99: ___/\___  p95: _____  p50: __ │ │  ░░░▒▒▒▓▓██▓▓▒▒░░░░░░░░░░░░░░░░░   │║
║ └─────────────────────────────────────┘ └─────────────────────────────────────┘║
╠═══════════════════════════════════════════════════════════════════════════════╣
║ ROW 3: DEPENDENCY HEALTH (Where is the problem?)                              ║
║ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ ┌─────────────┐║
║ │    Database      │ │     Cache        │ │   Auth Service   │ │ Payment API │║
║ │  ⚠ Degraded      │ │    ● Healthy     │ │    ● Healthy     │ │  ● Healthy  │║
║ │  45ms → 340ms    │ │    99.9% hit     │ │    12ms p99      │ │  89ms p99   │║
║ └──────────────────┘ └──────────────────┘ └──────────────────┘ └─────────────┘║
╠═══════════════════════════════════════════════════════════════════════════════╣
║ ROW 4: ENDPOINT BREAKDOWN (Where specifically?)                               ║
║ ┌─────────────────────────────────────────────────────────────────────────────┐║
║ │  Endpoint                           │ Rate    │ Errors  │ P99     │ Status │║
║ │  POST /api/v1/checkout              │ 2.4k/s  │ 0.8%    │ 890ms   │   ⚠    │║
║ │  GET  /api/v1/products              │ 8.1k/s  │ 0.01%   │ 45ms    │   ●    │║
║ │  POST /api/v1/cart                  │ 1.2k/s  │ 0.02%   │ 67ms    │   ●    │║
║ │  GET  /api/v1/user                  │ 3.8k/s  │ 0.01%   │ 23ms    │   ●    │║
║ └─────────────────────────────────────────────────────────────────────────────┘║
╠═══════════════════════════════════════════════════════════════════════════════╣
║ ROW 5: INFRASTRUCTURE (Resource health)                                       ║
║ ┌─────────────────────────────────────┐ ┌─────────────────────────────────────┐║
║ │          CPU Usage (4h)             │ │        Memory Usage (4h)            │║
║ │  By pod, with aggregate             │ │  By pod, with limits shown          │║
║ └─────────────────────────────────────┘ └─────────────────────────────────────┘║
║ ┌─────────────────────────────────────┐ ┌─────────────────────────────────────┐║
║ │         Pod Count (4h)              │ │       Restarts & Errors (4h)        │║
║ │  Ready vs Desired, HPA activity     │ │  CrashLoops, OOMKills               │║
║ └─────────────────────────────────────┘ └─────────────────────────────────────┘║
╠═══════════════════════════════════════════════════════════════════════════════╣
║ ROW 6: QUICK LINKS                                                            ║
║ [ Logs ] [ Traces ] [ Runbook ] [ Recent Deploys ] [ Alerts Config ]          ║
╚═══════════════════════════════════════════════════════════════════════════════╝

Consistency Across Services

Designing the Health Summary Row

Health Summary Best Practices

•Large, Clear Numbers — Current values should be readable from across the room. Use large fonts with high contrast. The number is the primary information; labels should be secondary.
•Status Indicators — Use color (green/yellow/red) plus shape (●/⚠/✕) for accessibility. Position status indicators consistently (e.g., always upper-left of each panel).
•Trend Direction — Show whether metrics are improving or degrading with arrows (▲/▼) and percentage change. Trend matters as much as current value.
•Contextual Comparison — Display comparison to baseline: 'vs. yesterday', 'vs. last week', or 'vs. typical'. A latency of 200ms is concerning if yesterday was 100ms, fine if yesterday was 250ms.
•Sparklines — Small inline charts showing recent trend. Reveals patterns that numbers alone hide: is this a spike or sustained elevation?
•Threshold-Based Coloring — Panels should change color based on thresholds tied to SLOs. Green != 'current value exists', green = 'current value is within acceptable range'.

Recommended Health Summary Panels
Panel	Primary Value	Supporting Info	Thresholds
Overall Status	Health score or SLO compliance %	Trend arrow, sparkline	Green: >99.9%, Yellow: 99-99.9%, Red: <99%
Error Rate	Current error % (5m window)	Absolute error count, trend	Green: <0.1%, Yellow: 0.1-1%, Red: >1%
Latency P99	99th percentile response time	P50 for context, trend	Service-specific, typically <500ms
Traffic	Requests per second	Comparison to baseline, trend	Anomaly-based (deviation from expected)
Active Alerts	Count of firing alerts	Highest severity indicator	Green: 0, Yellow: warnings only, Red: any critical

Avoid False Greens

Dependency Visibility

Dependency Panel Contents

•Status Indicator — Is this dependency healthy from our service's perspective?
•Request Latency — How long are calls to this dependency taking? Show p50 and p99.
•Error Rate — What percentage of requests to this dependency are failing?
•Circuit Breaker State — If applicable, is the circuit open, closed, or half-open?
•Connection Pool — Are we running out of connections to this dependency?
•Trend — Is the dependency getting better or worse?

Dependency Monitoring Queries

PromQL

# Latency to database (client-side measured)
histogram_quantile(0.99,
  sum(rate(db_client_request_duration_seconds_bucket{
    service="payment-service"
  }[5m])) by (le, database)
)
 
# Error rate to downstream service
sum(rate(http_client_requests_total{
  service="payment-service",
  target_service="inventory-service",
  status=~"5.."
}[5m])) 
/
sum(rate(http_client_requests_total{
  service="payment-service",
  target_service="inventory-service"
}[5m])) * 100
 
# Connection pool utilization
db_pool_connections_active{service="payment-service"}
/
db_pool_connections_max{service="payment-service"} * 100
 
# Circuit breaker state (1=open, 0=closed)
circuit_breaker_state{service="payment-service", dependency="payment-gateway"}

Measure Client-Side

Endpoint and Path Breakdown

Breakdown Dimensions and Their Value
Dimension	What It Reveals	When to Use
Endpoint/Route	Which API paths are affected	Almost always—primary breakdown
HTTP Method	GET vs POST behavior differences	When methods have different characteristics
Status Code	Types of errors (4xx vs 5xx)	During error investigation
Region/Datacenter	Geographic distribution of problems	Multi-region services
Customer/Tenant	Which customers are affected	Multi-tenant services, enterprise focus
Version/Canary	Old vs new code behavior	During rollouts
Host/Pod	Instance-specific issues	Debugging specific instances

The Critical Path Focus

Infrastructure Context

Infrastructure Metrics for Service Dashboards

•CPU Usage — Show per-pod/container with aggregate. Include CPU throttling for containerized workloads. Throttling often explains latency spikes even when average CPU looks healthy.
•Memory Usage — Show with container limits visible. Memory approaching limits precedes OOMKills. Track working set, not just RSS.
•Pod/Instance Count — Current vs. desired replicas. Show HPA activity, rollout status, and availability zones. Reveals scaling problems and deployment issues.
•Restarts and Crashes — OOMKills, CrashLoopBackOff, liveness probe failures. Any non-zero value warrants investigation.
•Network I/O — Throughput and errors per pod. Network saturation causes timeouts that look like application problems.

The CPU Mislead

Time Range and Comparison

Recommended Time Range Presets

•Last 1 hour — Default for incident investigation. Shows current situation with enough history to see incident start.
•Last 4 hours — Good for routine monitoring and shift handoffs. Captures recent patterns without overwhelming detail.
•Last 24 hours — Reveals daily patterns; useful for understanding cyclical behavior.
•Last 7 days — Shows weekly patterns; useful for capacity planning and trend analysis.
•Custom range — For investigating specific historical periods or comparing to known-good baselines.

Comparison Query Examples

PromQL

# Current error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) 
/ 
sum(rate(http_requests_total[5m])) * 100
 
# Error rate 1 week ago (for comparison overlay)
sum(rate(http_requests_total{status=~"5.."}[5m] offset 1w)) 
/ 
sum(rate(http_requests_total[5m] offset 1w)) * 100
 
# Percentage change in request rate vs. 24h ago
(
  sum(rate(http_requests_total[5m])) 
  - sum(rate(http_requests_total[5m] offset 24h))
) 
/ sum(rate(http_requests_total[5m] offset 24h)) * 100
 
# Anomaly detection: current value vs. weekly average
sum(rate(http_requests_total[5m])) 
/ 
avg_over_time(sum(rate(http_requests_total[5m]))[7d:1h]) - 1

Default to Useful Comparison

Summary: Building Effective Service Dashboards

We've covered the structure and design patterns that make service dashboards effective operational tools. Let's consolidate the key insights:

Key Takeaways

•Design for two modes — Routine monitoring (5-second health check) and incident investigation (rapid diagnosis). Progressive disclosure serves both.
•Consistent structure across services — Use templates so engineers can navigate any service dashboard immediately.
•Health summary first — Large, color-coded status panels that answer 'is everything OK?' instantly.
•Show dependencies — Most problems originate outside the service. Include dependency health prominently.
•Break down by endpoint — Aggregate metrics hide localized problems. Show per-endpoint health.
•Include infrastructure context — CPU, memory, and pod health explain resource-related issues without overwhelming detail.
•Provide temporal context — Time range controls and comparisons make current values meaningful.
•Enable investigation paths — Link to logs, traces, runbooks, and related dashboards for drill-down.

Page Complete