Database Management SystemsMonitoring and Alerting

Monitoring and Alerting

LevelAdvanced

Duration90 mins

TopicMonitoring and Alerting

4 / 5

Dashboards

Visual Command Centers for Database Health

Dashboards transform raw metrics into actionable intelligence. While alerts tell you something is wrong, dashboards show you what's happening—providing the context needed to understand problems, verify fixes, and spot trends before they become incidents.

A well-designed dashboard serves multiple audiences:

On-call engineers — Quick incident triage and root cause analysis
DBAs — Deep performance investigation and optimization guidance
Engineering managers — Capacity planning and service health overview
Executives — High-level availability and performance status

This page explores dashboard design principles, essential database dashboards every team needs, and best practices for visualization that genuinely aids decision-making.

What You Will Learn

By the end of this page, you will understand dashboard design principles and effective visualization techniques, build essential database monitoring dashboards for different use cases, apply information hierarchy to guide users to relevant data, and create dashboards that support both proactive monitoring and reactive troubleshooting.

Dashboard Design Principles

Effective dashboards aren't created by placing random metrics on a screen. They require intentional design based on how humans perceive and process visual information.

The Five-Second Rule:

When someone opens a dashboard, they should understand the overall status within five seconds. If they have to study panels to figure out 'is everything okay?', the dashboard has failed its primary purpose.

Key Design Principles:

Dashboard Design Fundamentals
Principle	Description	Implementation
Information Hierarchy	Most important information should be most prominent	Red/green status at top; details below; drill-down for specifics
Progressive Disclosure	Start simple, reveal complexity on demand	Overview dashboard → Category dashboards → Detail dashboards
Consistent Layout	Same metric types in same positions across dashboards	Resource metrics always top-right; query metrics always center
Meaningful Defaults	Time range and filters that make sense for the use case	Last 6 hours for operations; last 30 days for capacity planning
Visual Hierarchy	Use color, size, and position intentionally	Errors in red; warnings in yellow; healthy in green; larger = more important

Anti-Patterns to Avoid:

Dashboard Anti-Patterns

•Metric Soup — 50 panels with no organization
•Rainbow Dashboards — Every panel a different color
•Wrong Chart Types — Pie charts for time series
•No Context — Raw numbers without thresholds
•Stale Data — Metrics that don't update

Dashboard Best Practices

•Logical Grouping — Related metrics together
•Consistent Colors — Red=bad, green=good, everywhere
•Appropriate Charts — Time series for trends
•Threshold Lines — Show where problems start
•Last Updated — Show data freshness clearly

Design for the Worst Moment

Your dashboard will be used at 3 AM by someone who was just woken up, is stressed, and needs to quickly understand what's wrong. Optimize for that moment—not for demonstrations to management. Clarity over comprehensiveness.

Dashboard Hierarchy and Organization

A single dashboard cannot serve all purposes. Effective monitoring uses a hierarchy of dashboards, each designed for specific use cases.

Three-Tier Dashboard Architecture:

Converting Mermaid diagram...

Tier Descriptions:

Dashboard Tier Purposes
Tier	Purpose	Typical Time Range	Audience
Tier 1: Overview	Is everything healthy? Where are problems?	Last 24 hours	Everyone; first stop during incidents
Tier 2: Category	Database health summary; all instances at once	Last 6 hours	DBAs, on-call engineers
Tier 3: Detail	Deep dive into single instance/component	Last 1-6 hours	DBAs during active troubleshooting

Essential Database Dashboard Set:

Core Database Dashboards

•Database Fleet Overview — All database instances, health status, key metrics summary
•Instance Detail Dashboard — Deep metrics for single database instance
•Query Performance Dashboard — Top queries, slow query analysis, query trends
•Replication Dashboard — Lag metrics, replica health, replication topology
•Connection Dashboard — Pool status, connection patterns, idle connections
•Storage Dashboard — Disk usage, growth trends, bloat analysis
•Backup Dashboard — Backup status, age, size, restore verification

Link Dashboards Together

Each dashboard should link to related dashboards. An alert fires, the on-call engineer clicks to the fleet overview, spots the red database, clicks through to the instance detail, identifies the query problem, drills into query performance. This navigation flow should be seamless.

Database Fleet Overview Dashboard

The fleet overview is your command center—showing health status across all database instances at a glance. It answers: 'Are all my databases healthy, and if not, which ones need attention?'

Essential Fleet Overview Components:

Fleet Overview Dashboard Sections
Section	Visualization	Purpose
Health Status Grid	Colored tiles per instance	Immediate visual: green=healthy, yellow=warning, red=critical
Active Alerts Panel	List with severity/time	Current problems requiring attention
Key Metrics Summary	Stat panels/gauges	QPS, active connections, replication lag aggregates
Resource Utilization	Bar gauges per instance	CPU, memory, disk at-a-glance for all instances
Recent Events	Timeline/log panel	Failovers, restarts, configuration changes

Grafana Panel Configuration Examples:

fleet_overview_panels.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
{
  "title": "Database Fleet Health",
  "panels": [
    {
      "title": "Instance Health Status",
      "type": "stat",
      "gridPos": { "x": 0, "y": 0, "w": 24, "h": 4 },
      "targets": [
        {
          "expr": "pg_up",
          "legendFormat": "{{ instance }}"
        }
      ],
      "options": {
        "colorMode": "background",
        "graphMode": "none"
      },
      "fieldConfig": {
        "defaults": {
          "mappings": [
            { "type": "value", "options": { "0": { "text": "DOWN", "color": "red" } } },
            { "type": "value", "options": { "1": { "text": "UP", "color": "green" } } }
          ],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "value": 0, "color": "red" },
              { "value": 1, "color": "green" }
            ]
          }
        }
      }
    },
    {
      "title": "Queries Per Second - All Instances",
      "type": "timeseries",
      "gridPos": { "x": 0, "y": 4, "w": 12, "h": 8 },
      "targets": [
        {
          "expr": "sum(rate(pg_stat_database_xact_commit[5m])) by (instance)",
          "legendFormat": "{{ instance }}"
        }
      ]
    },
    {
      "title": "Connection Utilization",
      "type": "bargauge",
      "gridPos": { "x": 12, "y": 4, "w": 12, "h": 8 },
      "targets": [
        {
          "expr": "(pg_stat_activity_count / pg_settings_max_connections) * 100",
          "legendFormat": "{{ instance }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "max": 100,
          "thresholds": {
            "steps": [
              { "value": 0, "color": "green" },
              { "value": 70, "color": "yellow" },
              { "value": 90, "color": "red" }
            ]
          }
        }
      }
    },
    {
      "title": "Replication Lag",
      "type": "stat",
      "gridPos": { "x": 0, "y": 12, "w": 12, "h": 4 },
      "targets": [
        {
          "expr": "pg_replication_lag_seconds",
          "legendFormat": "{{ instance }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "s",
          "thresholds": {
            "steps": [
              { "value": 0, "color": "green" },
              { "value": 30, "color": "yellow" },
              { "value": 60, "color": "red" }
            ]
          }
        }
      }
    }
  ]
}

Use Template Variables

Grafana template variables let users filter by environment, cluster, or instance without duplicating dashboards. One well-designed dashboard with variables serves better than ten nearly-identical dashboards.

Instance Detail Dashboard

Once you've identified a problematic instance from the fleet view, the instance detail dashboard provides deep visibility into that specific database.

Instance Dashboard Layout:

Row-by-Row Organization

•Row 1: Health Summary — Up/down status, uptime, version, key current values
•Row 2: Query Performance — QPS, latency distribution, slow queries, error rate
•Row 3: Connections — Active/idle distribution, connection rate, pool utilization
•Row 4: Resource Utilization — CPU, memory, disk I/O, network
•Row 5: Storage — Disk usage, table sizes, bloat, growth rate
•Row 6: Replication — Lag, bytes behind, replay rate (if replica)
•Row 7: Internal Operations — Checkpoints, vacuum, autovacuum, bgwriter

Key Visualizations for Instance Dashboard:

Recommended Visualizations
Metric Category	Chart Type	Why This Type
QPS / TPS over time	Time series line chart	Shows trends, spikes, correlations with time
Latency percentiles	Time series with p50/p95/p99 lines	Shows distribution and tail latency
Current connections	Stacked bar or stat	Categorical breakdown (active/idle/waiting)
CPU usage	Time series or gauge	Time series for correlation; gauge for current state
Disk usage	Gauge with thresholds	Immediately visible capacity status
Top queries	Table with sortable columns	Details needed; charts insufficient
Lock waits	Heatmap	Shows time-based patterns in contention

latency_percentile_panel.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
{
  "title": "Query Latency Percentiles",
  "type": "timeseries",
  "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 },
  "targets": [
    {
      "expr": "histogram_quantile(0.50, sum(rate(pg_query_duration_seconds_bucket{instance=~"$instance"}[5m])) by (le))",
      "legendFormat": "p50"
    },
    {
      "expr": "histogram_quantile(0.95, sum(rate(pg_query_duration_seconds_bucket{instance=~"$instance"}[5m])) by (le))",
      "legendFormat": "p95"
    },
    {
      "expr": "histogram_quantile(0.99, sum(rate(pg_query_duration_seconds_bucket{instance=~"$instance"}[5m])) by (le))",
      "legendFormat": "p99"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "s",
      "custom": {
        "lineWidth": 2,
        "fillOpacity": 10
      }
    },
    "overrides": [
      {
        "matcher": { "id": "byName", "options": "p99" },
        "properties": [{ "id": "color", "value": { "fixedColor": "red" } }]
      },
      {
        "matcher": { "id": "byName", "options": "p95" },
        "properties": [{ "id": "color", "value": { "fixedColor": "yellow" } }]
      },
      {
        "matcher": { "id": "byName", "options": "p50" },
        "properties": [{ "id": "color", "value": { "fixedColor": "green" } }]
      }
    ]
  },
  "options": {
    "thresholdsStyle": { "mode": "line" }
  }
}

Include Annotations

Grafana annotations can overlay deployment events, configuration changes, and incident markers on your time series. When investigating a problem, seeing 'deployment at 14:32' right before metrics degraded immediately provides context that would otherwise require manual correlation.

Query Performance Dashboard

Query performance is often where database problems manifest. This dashboard helps identify problematic queries and understand query patterns.

Query Performance Dashboard Components:

Query Performance Panels
Panel	Content	Actionable Insight
Top Queries by Total Time	Ranked query fingerprints	Optimize these first for maximum impact
Top Queries by Execution Count	Most frequent queries	Even small improvements multiply significantly
Slow Query Trend	Slow query count over time	Correlate with deployments, traffic patterns
Query Duration Heatmap	Time vs duration buckets	Visualize when slow queries occur
Rows Examined vs Returned	Ratio trend	High ratio indicates index issues
Temp Table Usage	Queries creating temp tables	Memory pressure indicators

pg_stat_statements Integration:

For PostgreSQL, the pg_stat_statements extension provides the data for query analysis:

top_queries_panel.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
-- Top queries by total execution time
-- Use this as the query for a table panel
 
SELECT 
    left(query, 80) as query_preview,
    calls,
    round(total_exec_time::numeric, 2) as total_time_ms,
    round(mean_exec_time::numeric, 2) as avg_time_ms,
    round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) as pct_total,
    rows,
    round(rows::numeric / calls, 2) as rows_per_call
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
 
-- For Grafana/Prometheus, use postgres_exporter with pg_stat_statements
-- Query configured in postgres_exporter's queries.yaml:
 
pg_stat_statements:
  query: |
    SELECT 
      queryid,
      calls,
      total_exec_time,
      mean_exec_time,
      rows
    FROM pg_stat_statements
    WHERE query NOT LIKE '%pg_stat_statements%'
    ORDER BY total_exec_time DESC
    LIMIT 50
  metrics:
    - queryid:
        usage: "LABEL"
        description: "Query ID"
    - calls:
        usage: "COUNTER"
        description: "Number of times executed"
    - total_exec_time:
        usage: "COUNTER"
        description: "Total execution time (ms)"
    - mean_exec_time:
        usage: "GAUGE"
        description: "Mean execution time (ms)"
    - rows:
        usage: "COUNTER"
        description: "Total rows returned"

Query Analysis Workflow:

Systematic Query Investigation

•Sort by total time — Find queries consuming most resources
•Check execution count — High-frequency queries amplify any inefficiency
•Calculate rows examined / rows returned — Ratios > 100:1 suggest missing indexes
•Look at temporal patterns — Do slow queries correlate with specific times?
•Export query for EXPLAIN ANALYZE — Understand the execution plan

Don't Display Raw Query Text

Query text may contain sensitive data (customer IDs, emails in WHERE clauses). Use query fingerprints (normalized queries with parameters replaced) in dashboards. Full query text should require explicit access to the database.

Replication and Storage Dashboards

Specialized dashboards focus on specific subsystems that require dedicated monitoring attention.

Replication Dashboard:

Replication Dashboard Components
Panel	Metrics	Alert Correlation
Replication Topology	Primary/replica relationships, network diagram	Visual understanding of replication architecture
Replication Lag Time Series	Lag in seconds over time per replica	ReplicationLag alert context
Bytes Behind Primary	WAL position difference	Understanding severity of lag
Replay Rate	How fast replica is applying changes	Is replica catching up or falling further behind?
Replication Slots	Slot sizes, inactive slots	Growing slots can bloat WAL

replication_panels.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
{
  "panels": [
    {
      "title": "Replication Lag (seconds)",
      "type": "timeseries",
      "targets": [
        {
          "expr": "pg_replication_lag_seconds",
          "legendFormat": "{{ instance }} → {{ upstream }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "s",
          "thresholds": {
            "steps": [
              { "value": 0, "color": "green" },
              { "value": 30, "color": "yellow" },
              { "value": 60, "color": "orange" },
              { "value": 300, "color": "red" }
            ]
          }
        }
      }
    },
    {
      "title": "WAL Generation Rate",
      "type": "timeseries",
      "targets": [
        {
          "expr": "rate(pg_wal_lsn{type="insert"}[5m]) * 8",
          "legendFormat": "WAL bytes/sec"
        }
      ],
      "fieldConfig": {
        "defaults": { "unit": "Bps" }
      }
    },
    {
      "title": "Replication Slot Sizes",
      "type": "bargauge",
      "targets": [
        {
          "expr": "pg_replication_slot_disk_usage_bytes / 1024 / 1024 / 1024",
          "legendFormat": "{{ slot_name }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "GB",
          "thresholds": {
            "steps": [
              { "value": 0, "color": "green" },
              { "value": 10, "color": "yellow" },
              { "value": 50, "color": "red" }
            ]
          }
        }
      }
    }
  ]
}

Storage Dashboard:

Storage Dashboard Panels

•Disk Usage Percentage — Gauges per mount point with thresholds
•Disk Usage Trend — Time series with linear prediction overlay
•Database Size Breakdown — Stacked chart showing tables/indexes/bloat
•Top Tables by Size — Table panel sortable by actual/bloat/index size
•Table Bloat Estimates — Ratio of actual to ideal size
•Growth Rate — GB/day for capacity planning
•I/O Utilization — IOPS and throughput vs capacity

Predictive Visualization

For capacity metrics like disk space, overlay a predict_linear() line showing projected exhaustion. This transforms a storage dashboard from reactive ('we ran out of space') to proactive ('we'll run out in 30 days').

Operational and Maintenance Dashboards

Beyond real-time performance, DBAs need visibility into database maintenance operations that run in the background.

Vacuum and Maintenance Dashboard (PostgreSQL):

Maintenance Operations to Monitor
Operation	Metrics	Concerns
Autovacuum Activity	Workers running, tables vacuumed/day	Not enough vacuuming → table bloat
Oldest Transaction Age	xid_age, datfrozenxid	Approaching wraparound is critical
Dead Tuple Accumulation	n_dead_tup per table	Growing dead tuples indicate vacuum falling behind
Checkpoint Frequency	Checkpoints per hour, duration	Too frequent = performance impact; too rare = recovery time
WAL Generation	WAL bytes per second	Capacity planning for storage and replication

vacuum_monitoring.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- Tables needing vacuum attention
SELECT
    schemaname || '.' || relname AS table_name,
    n_dead_tup,
    n_live_tup,
    ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC
LIMIT 20;
 
-- Transaction ID age (wraparound monitoring)
SELECT
    datname,
    age(datfrozenxid) AS xid_age,
    ROUND(age(datfrozenxid)::numeric / 2147483647 * 100, 2) AS pct_to_wraparound
FROM pg_database
WHERE datname NOT LIKE 'template%'
ORDER BY xid_age DESC;
 
-- Autovacuum progress
SELECT
    p.pid,
    p.datname,
    p.relid::regclass AS table_name,
    p.phase,
    p.heap_blks_total,
    p.heap_blks_scanned,
    ROUND(p.heap_blks_scanned::numeric / NULLIF(p.heap_blks_total, 0) * 100, 2) AS pct_complete
FROM pg_stat_progress_vacuum p
JOIN pg_stat_activity a ON p.pid = a.pid;

Backup Status Dashboard:

Backup monitoring is critical but often overlooked:

Backup Dashboard Essentials

•Last Successful Backup — Age in hours; alert if > threshold
•Backup Duration Trend — Growing duration may indicate problems
•Backup Size Trend — Unexpected changes warrant investigation
•Restore Test Status — When was the last verified restore?
•Point-in-Time Recovery Window — How far back can we restore?

Backup Age is Critical

A backup that's 72 hours old means up to 72 hours of potential data loss in a disaster. Display backup age prominently with clear thresholds. This is a P1 alert condition if exceeded.

Summary: Dashboards That Drive Action

Dashboards are the visual interface to your database's health. Well-designed dashboards accelerate incident response, enable proactive optimization, and provide the situational awareness that distinguishes excellent operations from reactive firefighting.

Key Takeaways

•Design for the five-second rule — Overall status must be clear immediately.
•Use progressive disclosure — Overview → category → detail hierarchy.
•Choose appropriate visualizations — Time series for trends, gauges for current state, tables for details.
•Include thresholds and annotations — Context transforms data into information.
•Link dashboards together — Seamless navigation from detection to investigation.
•Cover all subsystems — Query performance, replication, storage, maintenance all need visibility.
•Use template variables — One flexible dashboard beats ten rigid ones.
•Design for incidents — 3 AM troubleshooting is the true test of dashboard quality.

What's Next:

With metrics collected, alerts configured, and dashboards built, the final page of this module addresses Proactive Management—using monitoring data not just to respond to problems, but to anticipate and prevent them through capacity planning, trend analysis, and continuous improvement.

Page Complete

You now understand how to design and build database monitoring dashboards that serve real operational needs—from fleet overviews to deep-dive troubleshooting. Next, we'll explore how to use these tools proactively, preventing problems before they impact users.

4 / 5

Loading learning content...

Database Management SystemsMonitoring and Alerting

Monitoring and Alerting

LevelAdvanced

Duration90 mins

TopicMonitoring and Alerting

4 / 5

Dashboards

Visual Command Centers for Database Health

A well-designed dashboard serves multiple audiences:

On-call engineers — Quick incident triage and root cause analysis
DBAs — Deep performance investigation and optimization guidance
Engineering managers — Capacity planning and service health overview
Executives — High-level availability and performance status

This page explores dashboard design principles, essential database dashboards every team needs, and best practices for visualization that genuinely aids decision-making.

What You Will Learn

Dashboard Design Principles

Effective dashboards aren't created by placing random metrics on a screen. They require intentional design based on how humans perceive and process visual information.

The Five-Second Rule:

Key Design Principles:

Dashboard Design Fundamentals
Principle	Description	Implementation
Information Hierarchy	Most important information should be most prominent	Red/green status at top; details below; drill-down for specifics
Progressive Disclosure	Start simple, reveal complexity on demand	Overview dashboard → Category dashboards → Detail dashboards
Consistent Layout	Same metric types in same positions across dashboards	Resource metrics always top-right; query metrics always center
Meaningful Defaults	Time range and filters that make sense for the use case	Last 6 hours for operations; last 30 days for capacity planning
Visual Hierarchy	Use color, size, and position intentionally	Errors in red; warnings in yellow; healthy in green; larger = more important

Anti-Patterns to Avoid:

Dashboard Anti-Patterns

•Metric Soup — 50 panels with no organization
•Rainbow Dashboards — Every panel a different color
•Wrong Chart Types — Pie charts for time series
•No Context — Raw numbers without thresholds
•Stale Data — Metrics that don't update

Dashboard Best Practices

•Logical Grouping — Related metrics together
•Consistent Colors — Red=bad, green=good, everywhere
•Appropriate Charts — Time series for trends
•Threshold Lines — Show where problems start
•Last Updated — Show data freshness clearly

Design for the Worst Moment

Dashboard Hierarchy and Organization

A single dashboard cannot serve all purposes. Effective monitoring uses a hierarchy of dashboards, each designed for specific use cases.

Three-Tier Dashboard Architecture:

Converting Mermaid diagram...

Tier Descriptions:

Dashboard Tier Purposes
Tier	Purpose	Typical Time Range	Audience
Tier 1: Overview	Is everything healthy? Where are problems?	Last 24 hours	Everyone; first stop during incidents
Tier 2: Category	Database health summary; all instances at once	Last 6 hours	DBAs, on-call engineers
Tier 3: Detail	Deep dive into single instance/component	Last 1-6 hours	DBAs during active troubleshooting

Essential Database Dashboard Set:

Core Database Dashboards

•Database Fleet Overview — All database instances, health status, key metrics summary
•Instance Detail Dashboard — Deep metrics for single database instance
•Query Performance Dashboard — Top queries, slow query analysis, query trends
•Replication Dashboard — Lag metrics, replica health, replication topology
•Connection Dashboard — Pool status, connection patterns, idle connections
•Storage Dashboard — Disk usage, growth trends, bloat analysis
•Backup Dashboard — Backup status, age, size, restore verification

Link Dashboards Together

Database Fleet Overview Dashboard

The fleet overview is your command center—showing health status across all database instances at a glance. It answers: 'Are all my databases healthy, and if not, which ones need attention?'

Essential Fleet Overview Components:

Fleet Overview Dashboard Sections
Section	Visualization	Purpose
Health Status Grid	Colored tiles per instance	Immediate visual: green=healthy, yellow=warning, red=critical
Active Alerts Panel	List with severity/time	Current problems requiring attention
Key Metrics Summary	Stat panels/gauges	QPS, active connections, replication lag aggregates
Resource Utilization	Bar gauges per instance	CPU, memory, disk at-a-glance for all instances
Recent Events	Timeline/log panel	Failovers, restarts, configuration changes

Grafana Panel Configuration Examples:

fleet_overview_panels.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
{
  "title": "Database Fleet Health",
  "panels": [
    {
      "title": "Instance Health Status",
      "type": "stat",
      "gridPos": { "x": 0, "y": 0, "w": 24, "h": 4 },
      "targets": [
        {
          "expr": "pg_up",
          "legendFormat": "{{ instance }}"
        }
      ],
      "options": {
        "colorMode": "background",
        "graphMode": "none"
      },
      "fieldConfig": {
        "defaults": {
          "mappings": [
            { "type": "value", "options": { "0": { "text": "DOWN", "color": "red" } } },
            { "type": "value", "options": { "1": { "text": "UP", "color": "green" } } }
          ],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "value": 0, "color": "red" },
              { "value": 1, "color": "green" }
            ]
          }
        }
      }
    },
    {
      "title": "Queries Per Second - All Instances",
      "type": "timeseries",
      "gridPos": { "x": 0, "y": 4, "w": 12, "h": 8 },
      "targets": [
        {
          "expr": "sum(rate(pg_stat_database_xact_commit[5m])) by (instance)",
          "legendFormat": "{{ instance }}"
        }
      ]
    },
    {
      "title": "Connection Utilization",
      "type": "bargauge",
      "gridPos": { "x": 12, "y": 4, "w": 12, "h": 8 },
      "targets": [
        {
          "expr": "(pg_stat_activity_count / pg_settings_max_connections) * 100",
          "legendFormat": "{{ instance }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "max": 100,
          "thresholds": {
            "steps": [
              { "value": 0, "color": "green" },
              { "value": 70, "color": "yellow" },
              { "value": 90, "color": "red" }
            ]
          }
        }
      }
    },
    {
      "title": "Replication Lag",
      "type": "stat",
      "gridPos": { "x": 0, "y": 12, "w": 12, "h": 4 },
      "targets": [
        {
          "expr": "pg_replication_lag_seconds",
          "legendFormat": "{{ instance }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "s",
          "thresholds": {
            "steps": [
              { "value": 0, "color": "green" },
              { "value": 30, "color": "yellow" },
              { "value": 60, "color": "red" }
            ]
          }
        }
      }
    }
  ]
}

Use Template Variables

Instance Detail Dashboard

Once you've identified a problematic instance from the fleet view, the instance detail dashboard provides deep visibility into that specific database.

Instance Dashboard Layout:

Row-by-Row Organization

•Row 1: Health Summary — Up/down status, uptime, version, key current values
•Row 2: Query Performance — QPS, latency distribution, slow queries, error rate
•Row 3: Connections — Active/idle distribution, connection rate, pool utilization
•Row 4: Resource Utilization — CPU, memory, disk I/O, network
•Row 5: Storage — Disk usage, table sizes, bloat, growth rate
•Row 6: Replication — Lag, bytes behind, replay rate (if replica)
•Row 7: Internal Operations — Checkpoints, vacuum, autovacuum, bgwriter

Key Visualizations for Instance Dashboard:

Recommended Visualizations
Metric Category	Chart Type	Why This Type
QPS / TPS over time	Time series line chart	Shows trends, spikes, correlations with time
Latency percentiles	Time series with p50/p95/p99 lines	Shows distribution and tail latency
Current connections	Stacked bar or stat	Categorical breakdown (active/idle/waiting)
CPU usage	Time series or gauge	Time series for correlation; gauge for current state
Disk usage	Gauge with thresholds	Immediately visible capacity status
Top queries	Table with sortable columns	Details needed; charts insufficient
Lock waits	Heatmap	Shows time-based patterns in contention

latency_percentile_panel.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
{
  "title": "Query Latency Percentiles",
  "type": "timeseries",
  "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 },
  "targets": [
    {
      "expr": "histogram_quantile(0.50, sum(rate(pg_query_duration_seconds_bucket{instance=~"$instance"}[5m])) by (le))",
      "legendFormat": "p50"
    },
    {
      "expr": "histogram_quantile(0.95, sum(rate(pg_query_duration_seconds_bucket{instance=~"$instance"}[5m])) by (le))",
      "legendFormat": "p95"
    },
    {
      "expr": "histogram_quantile(0.99, sum(rate(pg_query_duration_seconds_bucket{instance=~"$instance"}[5m])) by (le))",
      "legendFormat": "p99"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "s",
      "custom": {
        "lineWidth": 2,
        "fillOpacity": 10
      }
    },
    "overrides": [
      {
        "matcher": { "id": "byName", "options": "p99" },
        "properties": [{ "id": "color", "value": { "fixedColor": "red" } }]
      },
      {
        "matcher": { "id": "byName", "options": "p95" },
        "properties": [{ "id": "color", "value": { "fixedColor": "yellow" } }]
      },
      {
        "matcher": { "id": "byName", "options": "p50" },
        "properties": [{ "id": "color", "value": { "fixedColor": "green" } }]
      }
    ]
  },
  "options": {
    "thresholdsStyle": { "mode": "line" }
  }
}

Include Annotations

Query Performance Dashboard

Query performance is often where database problems manifest. This dashboard helps identify problematic queries and understand query patterns.

Query Performance Dashboard Components:

Query Performance Panels
Panel	Content	Actionable Insight
Top Queries by Total Time	Ranked query fingerprints	Optimize these first for maximum impact
Top Queries by Execution Count	Most frequent queries	Even small improvements multiply significantly
Slow Query Trend	Slow query count over time	Correlate with deployments, traffic patterns
Query Duration Heatmap	Time vs duration buckets	Visualize when slow queries occur
Rows Examined vs Returned	Ratio trend	High ratio indicates index issues
Temp Table Usage	Queries creating temp tables	Memory pressure indicators

pg_stat_statements Integration:

For PostgreSQL, the pg_stat_statements extension provides the data for query analysis:

top_queries_panel.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
-- Top queries by total execution time
-- Use this as the query for a table panel
 
SELECT 
    left(query, 80) as query_preview,
    calls,
    round(total_exec_time::numeric, 2) as total_time_ms,
    round(mean_exec_time::numeric, 2) as avg_time_ms,
    round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) as pct_total,
    rows,
    round(rows::numeric / calls, 2) as rows_per_call
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
 
-- For Grafana/Prometheus, use postgres_exporter with pg_stat_statements
-- Query configured in postgres_exporter's queries.yaml:
 
pg_stat_statements:
  query: |
    SELECT 
      queryid,
      calls,
      total_exec_time,
      mean_exec_time,
      rows
    FROM pg_stat_statements
    WHERE query NOT LIKE '%pg_stat_statements%'
    ORDER BY total_exec_time DESC
    LIMIT 50
  metrics:
    - queryid:
        usage: "LABEL"
        description: "Query ID"
    - calls:
        usage: "COUNTER"
        description: "Number of times executed"
    - total_exec_time:
        usage: "COUNTER"
        description: "Total execution time (ms)"
    - mean_exec_time:
        usage: "GAUGE"
        description: "Mean execution time (ms)"
    - rows:
        usage: "COUNTER"
        description: "Total rows returned"

Query Analysis Workflow:

Systematic Query Investigation

•Sort by total time — Find queries consuming most resources
•Check execution count — High-frequency queries amplify any inefficiency
•Calculate rows examined / rows returned — Ratios > 100:1 suggest missing indexes
•Look at temporal patterns — Do slow queries correlate with specific times?
•Export query for EXPLAIN ANALYZE — Understand the execution plan

Don't Display Raw Query Text

Replication and Storage Dashboards

Specialized dashboards focus on specific subsystems that require dedicated monitoring attention.

Replication Dashboard:

Replication Dashboard Components
Panel	Metrics	Alert Correlation
Replication Topology	Primary/replica relationships, network diagram	Visual understanding of replication architecture
Replication Lag Time Series	Lag in seconds over time per replica	ReplicationLag alert context
Bytes Behind Primary	WAL position difference	Understanding severity of lag
Replay Rate	How fast replica is applying changes	Is replica catching up or falling further behind?
Replication Slots	Slot sizes, inactive slots	Growing slots can bloat WAL

replication_panels.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
{
  "panels": [
    {
      "title": "Replication Lag (seconds)",
      "type": "timeseries",
      "targets": [
        {
          "expr": "pg_replication_lag_seconds",
          "legendFormat": "{{ instance }} → {{ upstream }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "s",
          "thresholds": {
            "steps": [
              { "value": 0, "color": "green" },
              { "value": 30, "color": "yellow" },
              { "value": 60, "color": "orange" },
              { "value": 300, "color": "red" }
            ]
          }
        }
      }
    },
    {
      "title": "WAL Generation Rate",
      "type": "timeseries",
      "targets": [
        {
          "expr": "rate(pg_wal_lsn{type="insert"}[5m]) * 8",
          "legendFormat": "WAL bytes/sec"
        }
      ],
      "fieldConfig": {
        "defaults": { "unit": "Bps" }
      }
    },
    {
      "title": "Replication Slot Sizes",
      "type": "bargauge",
      "targets": [
        {
          "expr": "pg_replication_slot_disk_usage_bytes / 1024 / 1024 / 1024",
          "legendFormat": "{{ slot_name }}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "GB",
          "thresholds": {
            "steps": [
              { "value": 0, "color": "green" },
              { "value": 10, "color": "yellow" },
              { "value": 50, "color": "red" }
            ]
          }
        }
      }
    }
  ]
}

Storage Dashboard:

Storage Dashboard Panels

•Disk Usage Percentage — Gauges per mount point with thresholds
•Disk Usage Trend — Time series with linear prediction overlay
•Database Size Breakdown — Stacked chart showing tables/indexes/bloat
•Top Tables by Size — Table panel sortable by actual/bloat/index size
•Table Bloat Estimates — Ratio of actual to ideal size
•Growth Rate — GB/day for capacity planning
•I/O Utilization — IOPS and throughput vs capacity

Predictive Visualization

Operational and Maintenance Dashboards

Beyond real-time performance, DBAs need visibility into database maintenance operations that run in the background.

Vacuum and Maintenance Dashboard (PostgreSQL):

Maintenance Operations to Monitor
Operation	Metrics	Concerns
Autovacuum Activity	Workers running, tables vacuumed/day	Not enough vacuuming → table bloat
Oldest Transaction Age	xid_age, datfrozenxid	Approaching wraparound is critical
Dead Tuple Accumulation	n_dead_tup per table	Growing dead tuples indicate vacuum falling behind
Checkpoint Frequency	Checkpoints per hour, duration	Too frequent = performance impact; too rare = recovery time
WAL Generation	WAL bytes per second	Capacity planning for storage and replication

vacuum_monitoring.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- Tables needing vacuum attention
SELECT
    schemaname || '.' || relname AS table_name,
    n_dead_tup,
    n_live_tup,
    ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC
LIMIT 20;
 
-- Transaction ID age (wraparound monitoring)
SELECT
    datname,
    age(datfrozenxid) AS xid_age,
    ROUND(age(datfrozenxid)::numeric / 2147483647 * 100, 2) AS pct_to_wraparound
FROM pg_database
WHERE datname NOT LIKE 'template%'
ORDER BY xid_age DESC;
 
-- Autovacuum progress
SELECT
    p.pid,
    p.datname,
    p.relid::regclass AS table_name,
    p.phase,
    p.heap_blks_total,
    p.heap_blks_scanned,
    ROUND(p.heap_blks_scanned::numeric / NULLIF(p.heap_blks_total, 0) * 100, 2) AS pct_complete
FROM pg_stat_progress_vacuum p
JOIN pg_stat_activity a ON p.pid = a.pid;

Backup Status Dashboard:

Backup monitoring is critical but often overlooked:

Backup Dashboard Essentials

•Last Successful Backup — Age in hours; alert if > threshold
•Backup Duration Trend — Growing duration may indicate problems
•Backup Size Trend — Unexpected changes warrant investigation
•Restore Test Status — When was the last verified restore?
•Point-in-Time Recovery Window — How far back can we restore?

Backup Age is Critical

A backup that's 72 hours old means up to 72 hours of potential data loss in a disaster. Display backup age prominently with clear thresholds. This is a P1 alert condition if exceeded.

Summary: Dashboards That Drive Action

Key Takeaways

•Design for the five-second rule — Overall status must be clear immediately.
•Use progressive disclosure — Overview → category → detail hierarchy.
•Choose appropriate visualizations — Time series for trends, gauges for current state, tables for details.
•Include thresholds and annotations — Context transforms data into information.
•Link dashboards together — Seamless navigation from detection to investigation.
•Cover all subsystems — Query performance, replication, storage, maintenance all need visibility.
•Use template variables — One flexible dashboard beats ten rigid ones.
•Design for incidents — 3 AM troubleshooting is the true test of dashboard quality.

What's Next:

Page Complete

4 / 5