System Design (HLD)Dashboards and Visualization

Dashboards and Visualization: Bringing Observability to Life

LevelIntermediate

Duration90 mins

TopicDashboards and Visualization

5 / 5

Tools and Best Practices

From Theory to Practice

You understand dashboard design principles. You know which metrics to display. You've designed layouts for both service teams and executives. But now comes the practical challenge: which tools should you use? How do you implement these dashboards in the real world? How do you maintain them as systems evolve?\n\nThe observability tooling landscape is vast and rapidly evolving. Grafana, Datadog, New Relic, Kibana, Splunk, CloudWatch—each promises comprehensive visualization capabilities. Choosing among them, and then using them effectively, requires understanding their strengths, limitations, and the operational practices that determine long-term success.\n\nThis page bridges the gap between dashboard design theory and practical implementation. We'll explore the major tooling options, implementation patterns that scale, and the operational practices that keep dashboards useful as teams and systems grow.

What You Will Learn

By the end of this page, you will understand the major dashboard tooling categories and their tradeoffs, implementation patterns for dashboard-as-code, operational practices for dashboard maintenance, and common pitfalls to avoid in dashboard operations.

Dashboard Tooling Landscape

Modern observability platforms can be categorized by their architecture and positioning in the market. Understanding these categories helps make informed tooling decisions.\n\nCategory 1: Open Source Visualization Layers\n\nThese tools focus on visualization, connecting to various data sources:\n\nGrafana\n- The dominant open-source visualization platform\n- Connects to 100+ data sources (Prometheus, InfluxDB, Elasticsearch, cloud metrics)\n- Highly customizable with plugins and theming\n- Strong community with abundant pre-built dashboards\n- Self-hosted or Grafana Cloud SaaS\n\nKibana (OpenSearch Dashboards)\n- Native visualization for Elasticsearch/OpenSearch\n- Strong log visualization and exploration\n- Limited compared to Grafana for general metrics\n- Best paired with ELK/OpenSearch stack\n\nCategory 2: Integrated Observability Platforms\n\nThese provide metrics, logs, traces, AND visualization in a unified platform:

Major Observability Platforms Comparison
Platform	Strengths	Considerations	Best For
Datadog	Unified platform, excellent UX, strong APM	Expensive at scale, vendor lock-in	Well-funded teams wanting all-in-one solution
New Relic	Strong APM heritage, full-stack observability	Pricing complexity, data ingest costs	Teams transitioning from APM focus
Splunk/SignalFx	Enterprise-grade, powerful search	Complex, expensive, steep learning curve	Enterprise environments with log-heavy needs
Dynatrace	AI-powered analysis, auto-discovery	High cost, complex pricing	Large enterprises with auto-instrumentation needs
Honeycomb	Query-centric approach, high cardinality	Newer, smaller ecosystem	Teams prioritizing exploration over fixed dashboards

Category 3: Cloud-Native Solutions\n\nCloud provider-native tooling:\n\nAWS CloudWatch\n- Native to AWS, automatic for AWS services\n- Limited customization compared to dedicated tools\n- Cost-effective for AWS-only environments\n\nAzure Monitor\n- Integrated with Azure ecosystem\n- Strong for Azure-native workloads\n- Workbooks provide flexible visualization\n\nGoogle Cloud Monitoring\n- Prometheus-compatible query language\n- Strong for GCP workloads\n- MQL adds powerful analysis capabilities\n\nChoosing Your Tool\n\nConsider these factors:\n\n| Factor | What to Evaluate |\n|--------|------------------|\n| Data Sources | Does it connect to all your metric sources? |\n| Query Language | Is the query language learnable by your team? |\n| Collaboration | Can teams easily share and collaborate on dashboards? |\n| Alerting Integration | Does it integrate with your alerting workflow? |\n| Cost | What's the total cost including infrastructure and team time? |\n| Ecosystem | Are there pre-built dashboards and community resources? |\n| Scalability | Does it handle your data volume and query load? |

The Multi-Tool Reality

Many organizations use multiple tools: Prometheus + Grafana for infrastructure metrics, Elasticsearch + Kibana for logs, and a cloud-native solution for vendor-specific services. Accept this complexity rather than forcing a single tool that does everything poorly.

Grafana: The De Facto Standard

Grafana has become the default choice for Kubernetes and cloud-native observability. Its ubiquity makes it worth understanding in depth.\n\nKey Grafana Capabilities

Grafana Feature Highlights

•Data Source Flexibility — Connect to Prometheus, InfluxDB, Elasticsearch, MySQL, PostgreSQL, CloudWatch, Azure Monitor, GCP, and 100+ others simultaneously.
•Unified Dashboards — Mix data from multiple sources on a single dashboard. Application metrics from Prometheus, logs from Loki, and traces from Tempo together.
•Template Variables — Create dynamic dashboards where users select environment, service, or time range. One dashboard template serves many contexts.
•Annotations — Overlay events (deployments, incidents, config changes) on time series charts.
•Alerting — Built-in alerting that evaluates dashboard queries and sends notifications.
•Provisioning — Dashboard-as-code through JSON or YAML configuration files.
•Plugins — Extensible with community and official plugins for specialized visualizations.

Grafana Panel Types\n\nUnderstanding panel types is essential for effective visualization:

Grafana Panel Selection Guide
Panel Type	Best For	Example Use Case
Time Series	Metrics changing over time	Request rate, latency percentiles, error rate trends
Stat	Single current value with optional sparkline	Current QPS, error rate, overall health score
Gauge	Value against min/max thresholds	CPU utilization percentage, SLO compliance
Bar Gauge	Comparing values across categories	Response codes distribution, traffic by region
Table	Tabular data display	Endpoint breakdown, top-N queries, error details
Heatmap	Distribution over time	Latency distribution, request timing patterns
State Timeline	Status changes over time	Service health states, deployment status
Logs	Log line display with search	Error logs, specific trace logs
Alert List	Active alerts display	Current firing alerts for the service
Text/Markdown	Static information and links	Dashboard documentation, runbook links

Grafana Dashboard JSON Structure
JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
{
  "title": "Payment Service Overview",
  "uid": "payment-service-prod",
  "tags": ["service", "payment", "production"],
  "timezone": "utc",
  "refresh": "30s",
  
  "templating": {
    "list": [
      {
        "name": "environment",
        "type": "query",
        "query": "label_values(up, environment)",
        "current": { "value": "production" }
      },
      {
        "name": "time_range",
        "type": "interval",
        "options": ["5m", "15m", "1h", "6h", "24h"],
        "current": { "value": "1h" }
      }
    ]
  },
  
  "panels": [
    {
      "title": "Error Rate",
      "type": "stat",
      "gridPos": { "x": 0, "y": 0, "w": 6, "h": 4 },
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~"5..", env="$environment"}[5m])) / sum(rate(http_requests_total{env="$environment"}[5m])) * 100",
          "legendFormat": "Error Rate %"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": 0 },
              { "color": "yellow", "value": 0.1 },
              { "color": "red", "value": 1 }
            ]
          },
          "unit": "percent"
        }
      }
    }
  ],
  
  "annotations": {
    "list": [
      {
        "name": "Deployments",
        "datasource": "prometheus",
        "expr": "deployment_timestamp{service="payment"}"
      }
    ]
  }
}

Grafana Labs Ecosystem

Grafana Labs offers a full observability stack: Prometheus for metrics (Mimir for scale), Loki for logs, Tempo for traces, and Grafana for visualization. This integration provides unified querying and correlation across signals with open standards (OTLP support).

Dashboards as Code

Manually created dashboards suffer from drift, inconsistency, and loss when someone accidentally deletes them. Dashboards as code applies software engineering practices—version control, code review, testing—to dashboard management.\n\nWhy Dashboards as Code?

Benefits of Dashboards as Code

•Version Control — Track changes over time. Know who changed what and when. Roll back to previous versions.
•Code Review — Dashboard changes go through pull requests. Catch errors before they reach production.
•Consistency — Templates ensure all services have similar dashboard structures.
•Reproducibility — Recreate dashboards in new environments automatically.
•Backup and Recovery — Dashboards stored in git are inherently backed up.
•Documentation — Code changes include commit messages explaining why changes were made.

Implementation Approaches\n\n1. Grafana Provisioning (Native)\n\nGrafana can automatically load dashboards from YAML or JSON files on startup:

Grafana Provisioning Configuration
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# /etc/grafana/provisioning/dashboards/all.yaml
apiVersion: 1
providers:
  - name: 'service-dashboards'
    orgId: 1
    folder: 'Services'
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    options:
      path: /var/lib/grafana/dashboards/services
 
  - name: 'executive-dashboards'  
    orgId: 1
    folder: 'Executive'
    type: file
    options:
      path: /var/lib/grafana/dashboards/executive
 
# Dashboard files in those directories are automatically loaded

2. Grafonnet (Jsonnet for Grafana)\n\nJsonnet is a data templating language. Grafonnet is a Jsonnet library for generating Grafana dashboards:

Grafonnet Dashboard Example

Jsonnet

local grafana = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet';
local dashboard = grafana.dashboard;
local row = grafana.panel.row;
local timeSeries = grafana.panel.timeSeries;
local prometheus = grafana.query.prometheus;
 
local errorRatePanel = timeSeries.new('Error Rate')
  + timeSeries.queryOptions.withTargets([
      prometheus.new(
        'prometheus',
        'sum(rate(http_requests_total{status=~"5..",service="$service"}[5m])) / sum(rate(http_requests_total{service="$service"}[5m])) * 100'
      )
      + prometheus.withLegendFormat('Error %'),
    ])
  + timeSeries.standardOptions.withUnit('percent');
 
local latencyPanel = timeSeries.new('P99 Latency')
  + timeSeries.queryOptions.withTargets([
      prometheus.new(
        'prometheus',
        'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le))'
      )
      + prometheus.withLegendFormat('P99'),
    ])
  + timeSeries.standardOptions.withUnit('s');
 
dashboard.new('Service Overview')
+ dashboard.withUid('service-overview')
+ dashboard.withTags(['generated', 'service'])
+ dashboard.withPanels([
    row.new('Key Metrics'),
    errorRatePanel + { gridPos: { x: 0, y: 1, w: 12, h: 8 } },
    latencyPanel + { gridPos: { x: 12, y: 1, w: 12, h: 8 } },
  ])

3. Terraform/Pulumi Integration\n\nInfrastructure-as-code tools can manage dashboards:

Terraform Grafana Dashboard

HCL

resource "grafana_dashboard" "payment_service" {
  folder       = grafana_folder.services.id
  config_json  = file("dashboards/payment-service.json")
  
  # Or generate dynamically
  config_json = templatefile("dashboards/service-template.json", {
    service_name = "payment"
    environment  = "production"
    slo_target   = 99.9
  })
}
 
resource "grafana_folder" "services" {
  title = "Service Dashboards"
}
 
# Auto-generate dashboards for all services
resource "grafana_dashboard" "services" {
  for_each = toset(["payment", "inventory", "shipping", "user"])
  
  folder      = grafana_folder.services.id
  config_json = templatefile("dashboards/service-template.json", {
    service_name = each.key
    environment  = var.environment
    slo_target   = var.service_slos[each.key]
  })
}

Start Simple

You don't need fancy tooling to start. Export existing dashboards as JSON, commit them to git, and use Grafana's provisioning to load them. Add Jsonnet/Terraform later when you need templating and automation.

Dashboard Templates and Consistency

Consistency across dashboards accelerates incident response—engineers know where to look regardless of which service is affected. Templates enforce this consistency.\n\nBuilding a Dashboard Template System

Template Components\n\n1. Standard Variables\n\nEvery dashboard should include consistent variable definitions:\n\nyaml\nvariables:\n - name: environment\n description: 'Production, staging, or development'\n values: [production, staging, development]\n \n - name: time_range\n description: 'Dashboard time window'\n values: [15m, 1h, 6h, 24h, 7d]\n \n - name: service\n description: 'Service to display (populated dynamically)'\n query: 'label_values(up, service)'\n\n\n2. Standard Row Organization\n\nDefine row structure as a template:\n\nyaml\nrows:\n - name: 'Health Summary'\n type: single-stat-row\n panels: [status, error_rate, latency_p99, traffic, alerts]\n \n - name: 'Key Metrics'\n type: time-series-row\n panels: [request_rate, error_rate_chart, latency_percentiles]\n \n - name: 'Dependencies'\n type: dependency-row\n panels: [database, cache, downstream_services]\n \n - name: 'Infrastructure'\n type: resource-row\n panels: [cpu, memory, pods, restarts]\n \n - name: 'Quick Links'\n type: link-row\n links: [logs, traces, runbook, deployments]\n\n\n3. Standard Panel Definitions\n\nCreate reusable panel definitions:

Reusable Panel Library

Jsonnet

// lib/panels.libsonnet
{
  // Error rate stat panel - reusable across all service dashboards
  errorRateStat(service, environment):: {
    title: 'Error Rate',
    type: 'stat',
    targets: [{
      expr: 'sum(rate(http_requests_total{status=~"5..", service="%s", env="%s"}[5m])) / sum(rate(http_requests_total{service="%s", env="%s"}[5m])) * 100' % [service, environment, service, environment],
      legendFormat: 'Error %',
    }],
    fieldConfig: {
      defaults: {
        thresholds: {
          mode: 'absolute',
          steps: [
            { color: 'green', value: 0 },
            { color: 'yellow', value: 0.1 },
            { color: 'red', value: 1 },
          ],
        },
        unit: 'percent',
        decimals: 2,
      },
    },
  },
 
  // Latency percentiles panel - standard across services
  latencyPercentiles(service, environment):: {
    title: 'Latency Percentiles',
    type: 'timeseries',
    targets: [
      {
        expr: 'histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service="%s", env="%s"}[5m])) by (le))' % [service, environment],
        legendFormat: 'P50',
      },
      {
        expr: 'histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="%s", env="%s"}[5m])) by (le))' % [service, environment],
        legendFormat: 'P95',
      },
      {
        expr: 'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="%s", env="%s"}[5m])) by (le))' % [service, environment],
        legendFormat: 'P99',
      },
    ],
    fieldConfig: {
      defaults: { unit: 's' },
    },
  },
}
 
// Usage in service dashboard:
local panels = import 'lib/panels.libsonnet';
panels.errorRateStat('payment', 'production')
panels.latencyPercentiles('payment', 'production')

Enforcing Template Usage\n\n1. Automated generation — Teams don't manually create dashboards; CI/CD generates them from service metadata\n2. Linting — Dashboard linters check that required panels exist and follow naming conventions\n3. Code review — Dashboard changes go through pull requests where reviewers check template adherence\n4. Documentation — Clear guidelines on when to use templates vs. custom dashboards

Templates Enable Customization

Templates define the baseline, not the ceiling. Teams can add service-specific panels below the standard rows. The key is that the standard information is always present and in a predictable location.

Dashboard Lifecycle Management

Dashboards require ongoing maintenance. Without active lifecycle management, organizations accumulate outdated, broken, and redundant dashboards that erode trust in the observability platform.\n\nThe Dashboard Lifecycle

Dashboard Lifecycle Stages

Concept

┌──────────────────────────────────────────────────────────────────────────────┐
│                        DASHBOARD LIFECYCLE                                    │
└──────────────────────────────────────────────────────────────────────────────┘
 
  ┌─────────┐      ┌──────────┐      ┌─────────┐      ┌──────────┐
  │ CREATE  │──────│  REVIEW  │──────│ DEPLOY  │──────│  OPERATE │
  └─────────┘      └──────────┘      └─────────┘      └──────────┘
       │                                                    │
       │                                                    │
       │            ┌─────────────────────────────────────┐ │
       │            │                                     │ │
       ▼            ▼                                     ▼ │
  ┌─────────────────────────────────────────────────────────────────────────┐
  │                            MAINTAIN                                      │
  │  • Update queries for schema changes                                    │
  │  • Fix broken panels                                                    │
  │  • Adjust thresholds based on new baselines                             │
  │  • Add new metrics as instrumentation improves                          │
  └─────────────────────────────────────────────────────────────────────────┘
       │
       │            ┌─────────┐
       └────────────│ RETIRE  │
                    └─────────┘
                    • Service deprecated
                    • Replaced by newer dashboard
                    • No longer accessed

Maintenance Practices\n\nRegular Review Cadence\n\nSchedule periodic dashboard reviews:\n\n| Frequency | Review Focus |\n|-----------|--------------|\n| Weekly | Alert-linked dashboards: Are queries working? Are thresholds appropriate? |\n| Monthly | Service dashboards: Do they reflect current architecture? |\n| Quarterly | All dashboards: Which are unused? Which need updates? |\n| Post-incident | Dashboards used during incident: What was missing? What was confusing? |\n\nUsage Analytics\n\nTrack which dashboards are actually used:\n\n- View counts over time\n- Unique viewers\n- Time spent on dashboard\n- Search queries that lead to dashboard\n\nDashboards with zero views in 90 days are candidates for archival.\n\nOwnership Clarity\n\nEvery dashboard should have a clear owner:\n\n- Team or individual responsible for maintenance\n- Contact information for questions\n- Last reviewed date\n\nOrphan dashboards without owners accumulate and eventually break.

Common Dashboard Decay Patterns

•Broken Queries — Metrics renamed or removed, panels show 'No Data' or errors.
•Outdated Architecture — Dashboard shows services that no longer exist or uses old naming.
•Threshold Drift — Thresholds set for old traffic patterns, now create false positives/negatives.
•Feature Creep — Panels added during investigations never removed, dashboard becomes cluttered.
•Zombie Dashboards — Dashboards for deprecated services still exist, confusing newcomers.

The Dashboard Hygiene Sprint

Once per quarter, dedicate a day to dashboard cleanup. Delete unused dashboards, fix broken panels, update stale documentation, and ensure ownership is current. This prevents gradual decay that eventually makes the entire dashboard system untrustworthy.

Performance and Scalability

Dashboards that take 30 seconds to load fail their purpose. Performance optimization is essential, especially for dashboards used during incidents when every second counts.\n\nCommon Performance Problems

Dashboard Performance Issues

•Too Many Panels — Each panel is a separate query. 50 panels = 50 queries on every load and refresh.
•Expensive Queries — Queries scanning large time ranges or high-cardinality labels are slow.
•Missing Recording Rules — Complex aggregations computed on every request instead of pre-computed.
•Unoptimized Time Ranges — Querying 30 days of raw data when aggregated data would suffice.
•Auto-Refresh Too Frequent — 5-second refresh on complex dashboards creates load.

Optimization Techniques\n\n1. Recording Rules\n\nPre-compute frequently used aggregations:

Prometheus Recording Rules
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Instead of computing on every dashboard load:
# sum(rate(http_requests_total[5m])) by (service)
 
# Pre-compute with recording rule:
groups:
  - name: service_metrics
    rules:
      # Request rate by service
      - record: service:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service)
      
      # Error rate by service  
      - record: service:http_errors:rate5m
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
      
      # Error percentage by service
      - record: service:http_error_percentage:rate5m
        expr: |
          service:http_errors:rate5m / service:http_requests:rate5m * 100
      
      # Latency percentiles by service
      - record: service:http_latency:p99_5m
        expr: |
          histogram_quantile(0.99, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          )
 
# Dashboard queries now simple and fast:
# service:http_requests:rate5m{service="payment"}

2. Query Optimization\n\n| Problem | Solution |\n|---------|----------|\n| High cardinality in label | Add label filter before aggregation |\n| Unnecessary precision | Use larger rate windows (5m instead of 1m) |\n| Full range scans | Use recording rules for aggregations |\n| Many small queries | Combine into fewer queries with multiple series |\n| Raw data for long ranges | Use downsampled data for historical views |\n\n3. Dashboard Structure Optimization\n\n- Lazy loading — Panels below the fold load on scroll\n- Collapsed rows — Rows that aren't immediately needed start collapsed\n- Appropriate refresh — 30s or 1m refresh is usually sufficient; 5s is rarely needed\n- Panel consolidation — Multiple small stats can become one multi-stat panel

Measure Dashboard Performance

Most observability platforms provide query performance metrics. Monitor dashboard load times and slow queries. Set performance budgets: 'No dashboard should take more than 3 seconds to fully load.' Treat performance regressions as bugs.

Organizational Best Practices

Technical excellence in dashboard design means nothing if organizational practices don't support their effective use. These practices determine whether dashboards remain valuable over time.\n\nDashboard Governance

Governance Practices

•Naming Conventions — Consistent naming makes dashboards discoverable: [Team]-[Service]-[Type] (e.g., platform-payment-service, exec-business-health).
•Folder Organization — Organize by team, then by purpose. Avoid proliferation of top-level folders.
•Tagging Standards — Tags enable filtering and discovery: team:platform, type:service, env:production.
•Access Control — Production dashboards should require appropriate permissions to modify. View access can be broad.
•Change Management — Significant dashboard changes (especially executive dashboards) should go through light approval process.

Documentation Standards\n\nEvery dashboard should include:\n\n1. Purpose Statement\n- Who is this dashboard for?\n- What questions does it answer?\n- When should someone use this dashboard?\n\n2. Panel Documentation\n- What does each panel show?\n- What do the thresholds mean?\n- How to interpret abnormal values?\n\n3. Links to Related Resources\n- Runbooks for common issues\n- Related dashboards for drill-down\n- Metric definitions and instrumentation documentation\n\n4. Owner Information\n- Owning team and contact\n- Last reviewed date\n- How to request changes\n\nTraining and Onboarding\n\nDashboard effectiveness depends on people knowing how to use them:\n\n- Onboarding sessions — New team members get dashboard walkthrough\n- Incident reviews — Include 'How did dashboards help/hinder?'\n- Documentation — Written guides for dashboard navigation\n- Office hours — Regular sessions where teams can ask dashboard questions

The Dashboard Champion

Designate a 'dashboard champion' or working group responsible for dashboard standards, template maintenance, and helping teams create effective dashboards. Without clear ownership, dashboard quality is nobody's job and therefore nobody's priority.

Summary: Tools and Best Practices

We've covered the practical side of dashboard implementation—from tooling selection to organizational practices. Let's consolidate the key insights:

Key Takeaways

•Choose tools based on your stack — Grafana for flexibility and open source; integrated platforms for convenience; cloud-native for specific ecosystems.
•Master your primary tool — Invest in learning Grafana (or your chosen platform) deeply. Understanding panel types, variables, and query optimization pays dividends.
•Treat dashboards as code — Version control, code review, and automated deployment ensure consistency and prevent loss.
•Use templates for consistency — Standard dashboard structures accelerate incident response and reduce maintenance burden.
•Actively manage lifecycle — Regular reviews, usage tracking, and clear ownership prevent dashboard decay.
•Optimize for performance — Recording rules, query optimization, and appropriate refresh rates keep dashboards responsive.
•Establish organizational practices — Naming conventions, documentation standards, and training enable effective dashboard use across teams.
•Assign clear ownership — Dashboard quality requires someone responsible. Consider dashboard champions or working groups.

Module Complete:\n\nYou've now completed the comprehensive guide to Dashboards and Visualization. From design principles and metric selection through service and executive dashboards to practical tooling and best practices, you have the knowledge to create dashboards that actually work—dashboards that communicate system health clearly, accelerate incident response, and serve stakeholders from engineers to executives.

Module Complete: Dashboards and Visualization

You now have a comprehensive understanding of dashboard design and implementation. The key insight across all topics: dashboards exist to translate data into understanding and action. Every design decision, metric choice, and operational practice should serve this translation. Great dashboards don't just display metrics—they tell the story of your system's health in a way that enables confident, rapid decision-making.

5 / 5

Loading learning content...

System Design (HLD)Dashboards and Visualization

Dashboards and Visualization: Bringing Observability to Life

LevelIntermediate

Duration90 mins

TopicDashboards and Visualization

5 / 5

Tools and Best Practices

From Theory to Practice

What You Will Learn

Dashboard Tooling Landscape

Major Observability Platforms Comparison
Platform	Strengths	Considerations	Best For
Datadog	Unified platform, excellent UX, strong APM	Expensive at scale, vendor lock-in	Well-funded teams wanting all-in-one solution
New Relic	Strong APM heritage, full-stack observability	Pricing complexity, data ingest costs	Teams transitioning from APM focus
Splunk/SignalFx	Enterprise-grade, powerful search	Complex, expensive, steep learning curve	Enterprise environments with log-heavy needs
Dynatrace	AI-powered analysis, auto-discovery	High cost, complex pricing	Large enterprises with auto-instrumentation needs
Honeycomb	Query-centric approach, high cardinality	Newer, smaller ecosystem	Teams prioritizing exploration over fixed dashboards

The Multi-Tool Reality

Grafana: The De Facto Standard

Grafana has become the default choice for Kubernetes and cloud-native observability. Its ubiquity makes it worth understanding in depth.\n\nKey Grafana Capabilities

Grafana Feature Highlights

•Data Source Flexibility — Connect to Prometheus, InfluxDB, Elasticsearch, MySQL, PostgreSQL, CloudWatch, Azure Monitor, GCP, and 100+ others simultaneously.
•Unified Dashboards — Mix data from multiple sources on a single dashboard. Application metrics from Prometheus, logs from Loki, and traces from Tempo together.
•Template Variables — Create dynamic dashboards where users select environment, service, or time range. One dashboard template serves many contexts.
•Annotations — Overlay events (deployments, incidents, config changes) on time series charts.
•Alerting — Built-in alerting that evaluates dashboard queries and sends notifications.
•Provisioning — Dashboard-as-code through JSON or YAML configuration files.
•Plugins — Extensible with community and official plugins for specialized visualizations.

Grafana Panel Types\n\nUnderstanding panel types is essential for effective visualization:

Grafana Panel Selection Guide
Panel Type	Best For	Example Use Case
Time Series	Metrics changing over time	Request rate, latency percentiles, error rate trends
Stat	Single current value with optional sparkline	Current QPS, error rate, overall health score
Gauge	Value against min/max thresholds	CPU utilization percentage, SLO compliance
Bar Gauge	Comparing values across categories	Response codes distribution, traffic by region
Table	Tabular data display	Endpoint breakdown, top-N queries, error details
Heatmap	Distribution over time	Latency distribution, request timing patterns
State Timeline	Status changes over time	Service health states, deployment status
Logs	Log line display with search	Error logs, specific trace logs
Alert List	Active alerts display	Current firing alerts for the service
Text/Markdown	Static information and links	Dashboard documentation, runbook links

Grafana Dashboard JSON Structure
JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
{
  "title": "Payment Service Overview",
  "uid": "payment-service-prod",
  "tags": ["service", "payment", "production"],
  "timezone": "utc",
  "refresh": "30s",
  
  "templating": {
    "list": [
      {
        "name": "environment",
        "type": "query",
        "query": "label_values(up, environment)",
        "current": { "value": "production" }
      },
      {
        "name": "time_range",
        "type": "interval",
        "options": ["5m", "15m", "1h", "6h", "24h"],
        "current": { "value": "1h" }
      }
    ]
  },
  
  "panels": [
    {
      "title": "Error Rate",
      "type": "stat",
      "gridPos": { "x": 0, "y": 0, "w": 6, "h": 4 },
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~"5..", env="$environment"}[5m])) / sum(rate(http_requests_total{env="$environment"}[5m])) * 100",
          "legendFormat": "Error Rate %"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": 0 },
              { "color": "yellow", "value": 0.1 },
              { "color": "red", "value": 1 }
            ]
          },
          "unit": "percent"
        }
      }
    }
  ],
  
  "annotations": {
    "list": [
      {
        "name": "Deployments",
        "datasource": "prometheus",
        "expr": "deployment_timestamp{service="payment"}"
      }
    ]
  }
}

Grafana Labs Ecosystem

Dashboards as Code

Benefits of Dashboards as Code

•Version Control — Track changes over time. Know who changed what and when. Roll back to previous versions.
•Code Review — Dashboard changes go through pull requests. Catch errors before they reach production.
•Consistency — Templates ensure all services have similar dashboard structures.
•Reproducibility — Recreate dashboards in new environments automatically.
•Backup and Recovery — Dashboards stored in git are inherently backed up.
•Documentation — Code changes include commit messages explaining why changes were made.

Implementation Approaches\n\n1. Grafana Provisioning (Native)\n\nGrafana can automatically load dashboards from YAML or JSON files on startup:

Grafana Provisioning Configuration
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# /etc/grafana/provisioning/dashboards/all.yaml
apiVersion: 1
providers:
  - name: 'service-dashboards'
    orgId: 1
    folder: 'Services'
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    options:
      path: /var/lib/grafana/dashboards/services
 
  - name: 'executive-dashboards'  
    orgId: 1
    folder: 'Executive'
    type: file
    options:
      path: /var/lib/grafana/dashboards/executive
 
# Dashboard files in those directories are automatically loaded

2. Grafonnet (Jsonnet for Grafana)\n\nJsonnet is a data templating language. Grafonnet is a Jsonnet library for generating Grafana dashboards:

Grafonnet Dashboard Example

Jsonnet

local grafana = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet';
local dashboard = grafana.dashboard;
local row = grafana.panel.row;
local timeSeries = grafana.panel.timeSeries;
local prometheus = grafana.query.prometheus;
 
local errorRatePanel = timeSeries.new('Error Rate')
  + timeSeries.queryOptions.withTargets([
      prometheus.new(
        'prometheus',
        'sum(rate(http_requests_total{status=~"5..",service="$service"}[5m])) / sum(rate(http_requests_total{service="$service"}[5m])) * 100'
      )
      + prometheus.withLegendFormat('Error %'),
    ])
  + timeSeries.standardOptions.withUnit('percent');
 
local latencyPanel = timeSeries.new('P99 Latency')
  + timeSeries.queryOptions.withTargets([
      prometheus.new(
        'prometheus',
        'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="$service"}[5m])) by (le))'
      )
      + prometheus.withLegendFormat('P99'),
    ])
  + timeSeries.standardOptions.withUnit('s');
 
dashboard.new('Service Overview')
+ dashboard.withUid('service-overview')
+ dashboard.withTags(['generated', 'service'])
+ dashboard.withPanels([
    row.new('Key Metrics'),
    errorRatePanel + { gridPos: { x: 0, y: 1, w: 12, h: 8 } },
    latencyPanel + { gridPos: { x: 12, y: 1, w: 12, h: 8 } },
  ])

3. Terraform/Pulumi Integration\n\nInfrastructure-as-code tools can manage dashboards:

Terraform Grafana Dashboard

HCL

resource "grafana_dashboard" "payment_service" {
  folder       = grafana_folder.services.id
  config_json  = file("dashboards/payment-service.json")
  
  # Or generate dynamically
  config_json = templatefile("dashboards/service-template.json", {
    service_name = "payment"
    environment  = "production"
    slo_target   = 99.9
  })
}
 
resource "grafana_folder" "services" {
  title = "Service Dashboards"
}
 
# Auto-generate dashboards for all services
resource "grafana_dashboard" "services" {
  for_each = toset(["payment", "inventory", "shipping", "user"])
  
  folder      = grafana_folder.services.id
  config_json = templatefile("dashboards/service-template.json", {
    service_name = each.key
    environment  = var.environment
    slo_target   = var.service_slos[each.key]
  })
}

Start Simple

Dashboard Templates and Consistency

Reusable Panel Library

Jsonnet

// lib/panels.libsonnet
{
  // Error rate stat panel - reusable across all service dashboards
  errorRateStat(service, environment):: {
    title: 'Error Rate',
    type: 'stat',
    targets: [{
      expr: 'sum(rate(http_requests_total{status=~"5..", service="%s", env="%s"}[5m])) / sum(rate(http_requests_total{service="%s", env="%s"}[5m])) * 100' % [service, environment, service, environment],
      legendFormat: 'Error %',
    }],
    fieldConfig: {
      defaults: {
        thresholds: {
          mode: 'absolute',
          steps: [
            { color: 'green', value: 0 },
            { color: 'yellow', value: 0.1 },
            { color: 'red', value: 1 },
          ],
        },
        unit: 'percent',
        decimals: 2,
      },
    },
  },
 
  // Latency percentiles panel - standard across services
  latencyPercentiles(service, environment):: {
    title: 'Latency Percentiles',
    type: 'timeseries',
    targets: [
      {
        expr: 'histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service="%s", env="%s"}[5m])) by (le))' % [service, environment],
        legendFormat: 'P50',
      },
      {
        expr: 'histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="%s", env="%s"}[5m])) by (le))' % [service, environment],
        legendFormat: 'P95',
      },
      {
        expr: 'histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="%s", env="%s"}[5m])) by (le))' % [service, environment],
        legendFormat: 'P99',
      },
    ],
    fieldConfig: {
      defaults: { unit: 's' },
    },
  },
}
 
// Usage in service dashboard:
local panels = import 'lib/panels.libsonnet';
panels.errorRateStat('payment', 'production')
panels.latencyPercentiles('payment', 'production')

Templates Enable Customization

Dashboard Lifecycle Management

Dashboard Lifecycle Stages

Concept

┌──────────────────────────────────────────────────────────────────────────────┐
│                        DASHBOARD LIFECYCLE                                    │
└──────────────────────────────────────────────────────────────────────────────┘
 
  ┌─────────┐      ┌──────────┐      ┌─────────┐      ┌──────────┐
  │ CREATE  │──────│  REVIEW  │──────│ DEPLOY  │──────│  OPERATE │
  └─────────┘      └──────────┘      └─────────┘      └──────────┘
       │                                                    │
       │                                                    │
       │            ┌─────────────────────────────────────┐ │
       │            │                                     │ │
       ▼            ▼                                     ▼ │
  ┌─────────────────────────────────────────────────────────────────────────┐
  │                            MAINTAIN                                      │
  │  • Update queries for schema changes                                    │
  │  • Fix broken panels                                                    │
  │  • Adjust thresholds based on new baselines                             │
  │  • Add new metrics as instrumentation improves                          │
  └─────────────────────────────────────────────────────────────────────────┘
       │
       │            ┌─────────┐
       └────────────│ RETIRE  │
                    └─────────┘
                    • Service deprecated
                    • Replaced by newer dashboard
                    • No longer accessed

Common Dashboard Decay Patterns

•Broken Queries — Metrics renamed or removed, panels show 'No Data' or errors.
•Outdated Architecture — Dashboard shows services that no longer exist or uses old naming.
•Threshold Drift — Thresholds set for old traffic patterns, now create false positives/negatives.
•Feature Creep — Panels added during investigations never removed, dashboard becomes cluttered.
•Zombie Dashboards — Dashboards for deprecated services still exist, confusing newcomers.

The Dashboard Hygiene Sprint

Performance and Scalability

Dashboard Performance Issues

•Too Many Panels — Each panel is a separate query. 50 panels = 50 queries on every load and refresh.
•Expensive Queries — Queries scanning large time ranges or high-cardinality labels are slow.
•Missing Recording Rules — Complex aggregations computed on every request instead of pre-computed.
•Unoptimized Time Ranges — Querying 30 days of raw data when aggregated data would suffice.
•Auto-Refresh Too Frequent — 5-second refresh on complex dashboards creates load.

Optimization Techniques\n\n1. Recording Rules\n\nPre-compute frequently used aggregations:

Prometheus Recording Rules
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Instead of computing on every dashboard load:
# sum(rate(http_requests_total[5m])) by (service)
 
# Pre-compute with recording rule:
groups:
  - name: service_metrics
    rules:
      # Request rate by service
      - record: service:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service)
      
      # Error rate by service  
      - record: service:http_errors:rate5m
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
      
      # Error percentage by service
      - record: service:http_error_percentage:rate5m
        expr: |
          service:http_errors:rate5m / service:http_requests:rate5m * 100
      
      # Latency percentiles by service
      - record: service:http_latency:p99_5m
        expr: |
          histogram_quantile(0.99, 
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          )
 
# Dashboard queries now simple and fast:
# service:http_requests:rate5m{service="payment"}

Measure Dashboard Performance

Organizational Best Practices

Governance Practices

•Naming Conventions — Consistent naming makes dashboards discoverable: [Team]-[Service]-[Type] (e.g., platform-payment-service, exec-business-health).
•Folder Organization — Organize by team, then by purpose. Avoid proliferation of top-level folders.
•Tagging Standards — Tags enable filtering and discovery: team:platform, type:service, env:production.
•Access Control — Production dashboards should require appropriate permissions to modify. View access can be broad.
•Change Management — Significant dashboard changes (especially executive dashboards) should go through light approval process.

The Dashboard Champion

Summary: Tools and Best Practices

We've covered the practical side of dashboard implementation—from tooling selection to organizational practices. Let's consolidate the key insights:

Key Takeaways

•Choose tools based on your stack — Grafana for flexibility and open source; integrated platforms for convenience; cloud-native for specific ecosystems.
•Master your primary tool — Invest in learning Grafana (or your chosen platform) deeply. Understanding panel types, variables, and query optimization pays dividends.
•Treat dashboards as code — Version control, code review, and automated deployment ensure consistency and prevent loss.
•Use templates for consistency — Standard dashboard structures accelerate incident response and reduce maintenance burden.
•Actively manage lifecycle — Regular reviews, usage tracking, and clear ownership prevent dashboard decay.
•Optimize for performance — Recording rules, query optimization, and appropriate refresh rates keep dashboards responsive.
•Establish organizational practices — Naming conventions, documentation standards, and training enable effective dashboard use across teams.
•Assign clear ownership — Dashboard quality requires someone responsible. Consider dashboard champions or working groups.

Module Complete: Dashboards and Visualization

5 / 5