Loading learning content...
Dashboards transform raw metrics into actionable intelligence. While alerts tell you something is wrong, dashboards show you what's happening—providing the context needed to understand problems, verify fixes, and spot trends before they become incidents.
A well-designed dashboard serves multiple audiences:
This page explores dashboard design principles, essential database dashboards every team needs, and best practices for visualization that genuinely aids decision-making.
By the end of this page, you will understand dashboard design principles and effective visualization techniques, build essential database monitoring dashboards for different use cases, apply information hierarchy to guide users to relevant data, and create dashboards that support both proactive monitoring and reactive troubleshooting.
Effective dashboards aren't created by placing random metrics on a screen. They require intentional design based on how humans perceive and process visual information.
The Five-Second Rule:
When someone opens a dashboard, they should understand the overall status within five seconds. If they have to study panels to figure out 'is everything okay?', the dashboard has failed its primary purpose.
Key Design Principles:
| Principle | Description | Implementation |
|---|---|---|
| Information Hierarchy | Most important information should be most prominent | Red/green status at top; details below; drill-down for specifics |
| Progressive Disclosure | Start simple, reveal complexity on demand | Overview dashboard → Category dashboards → Detail dashboards |
| Consistent Layout | Same metric types in same positions across dashboards | Resource metrics always top-right; query metrics always center |
| Meaningful Defaults | Time range and filters that make sense for the use case | Last 6 hours for operations; last 30 days for capacity planning |
| Visual Hierarchy | Use color, size, and position intentionally | Errors in red; warnings in yellow; healthy in green; larger = more important |
Anti-Patterns to Avoid:
Your dashboard will be used at 3 AM by someone who was just woken up, is stressed, and needs to quickly understand what's wrong. Optimize for that moment—not for demonstrations to management. Clarity over comprehensiveness.
A single dashboard cannot serve all purposes. Effective monitoring uses a hierarchy of dashboards, each designed for specific use cases.
Three-Tier Dashboard Architecture:
Tier Descriptions:
| Tier | Purpose | Typical Time Range | Audience |
|---|---|---|---|
| Tier 1: Overview | Is everything healthy? Where are problems? | Last 24 hours | Everyone; first stop during incidents |
| Tier 2: Category | Database health summary; all instances at once | Last 6 hours | DBAs, on-call engineers |
| Tier 3: Detail | Deep dive into single instance/component | Last 1-6 hours | DBAs during active troubleshooting |
Essential Database Dashboard Set:
Each dashboard should link to related dashboards. An alert fires, the on-call engineer clicks to the fleet overview, spots the red database, clicks through to the instance detail, identifies the query problem, drills into query performance. This navigation flow should be seamless.
The fleet overview is your command center—showing health status across all database instances at a glance. It answers: 'Are all my databases healthy, and if not, which ones need attention?'
Essential Fleet Overview Components:
| Section | Visualization | Purpose |
|---|---|---|
| Health Status Grid | Colored tiles per instance | Immediate visual: green=healthy, yellow=warning, red=critical |
| Active Alerts Panel | List with severity/time | Current problems requiring attention |
| Key Metrics Summary | Stat panels/gauges | QPS, active connections, replication lag aggregates |
| Resource Utilization | Bar gauges per instance | CPU, memory, disk at-a-glance for all instances |
| Recent Events | Timeline/log panel | Failovers, restarts, configuration changes |
Grafana Panel Configuration Examples:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
{ "title": "Database Fleet Health", "panels": [ { "title": "Instance Health Status", "type": "stat", "gridPos": { "x": 0, "y": 0, "w": 24, "h": 4 }, "targets": [ { "expr": "pg_up", "legendFormat": "{{ instance }}" } ], "options": { "colorMode": "background", "graphMode": "none" }, "fieldConfig": { "defaults": { "mappings": [ { "type": "value", "options": { "0": { "text": "DOWN", "color": "red" } } }, { "type": "value", "options": { "1": { "text": "UP", "color": "green" } } } ], "thresholds": { "mode": "absolute", "steps": [ { "value": 0, "color": "red" }, { "value": 1, "color": "green" } ] } } } }, { "title": "Queries Per Second - All Instances", "type": "timeseries", "gridPos": { "x": 0, "y": 4, "w": 12, "h": 8 }, "targets": [ { "expr": "sum(rate(pg_stat_database_xact_commit[5m])) by (instance)", "legendFormat": "{{ instance }}" } ] }, { "title": "Connection Utilization", "type": "bargauge", "gridPos": { "x": 12, "y": 4, "w": 12, "h": 8 }, "targets": [ { "expr": "(pg_stat_activity_count / pg_settings_max_connections) * 100", "legendFormat": "{{ instance }}" } ], "fieldConfig": { "defaults": { "unit": "percent", "max": 100, "thresholds": { "steps": [ { "value": 0, "color": "green" }, { "value": 70, "color": "yellow" }, { "value": 90, "color": "red" } ] } } } }, { "title": "Replication Lag", "type": "stat", "gridPos": { "x": 0, "y": 12, "w": 12, "h": 4 }, "targets": [ { "expr": "pg_replication_lag_seconds", "legendFormat": "{{ instance }}" } ], "fieldConfig": { "defaults": { "unit": "s", "thresholds": { "steps": [ { "value": 0, "color": "green" }, { "value": 30, "color": "yellow" }, { "value": 60, "color": "red" } ] } } } } ]}Grafana template variables let users filter by environment, cluster, or instance without duplicating dashboards. One well-designed dashboard with variables serves better than ten nearly-identical dashboards.
Once you've identified a problematic instance from the fleet view, the instance detail dashboard provides deep visibility into that specific database.
Instance Dashboard Layout:
Key Visualizations for Instance Dashboard:
| Metric Category | Chart Type | Why This Type |
|---|---|---|
| QPS / TPS over time | Time series line chart | Shows trends, spikes, correlations with time |
| Latency percentiles | Time series with p50/p95/p99 lines | Shows distribution and tail latency |
| Current connections | Stacked bar or stat | Categorical breakdown (active/idle/waiting) |
| CPU usage | Time series or gauge | Time series for correlation; gauge for current state |
| Disk usage | Gauge with thresholds | Immediately visible capacity status |
| Top queries | Table with sortable columns | Details needed; charts insufficient |
| Lock waits | Heatmap | Shows time-based patterns in contention |
123456789101112131415161718192021222324252627282930313233343536373839404142434445
{ "title": "Query Latency Percentiles", "type": "timeseries", "gridPos": { "x": 0, "y": 0, "w": 12, "h": 8 }, "targets": [ { "expr": "histogram_quantile(0.50, sum(rate(pg_query_duration_seconds_bucket{instance=~"$instance"}[5m])) by (le))", "legendFormat": "p50" }, { "expr": "histogram_quantile(0.95, sum(rate(pg_query_duration_seconds_bucket{instance=~"$instance"}[5m])) by (le))", "legendFormat": "p95" }, { "expr": "histogram_quantile(0.99, sum(rate(pg_query_duration_seconds_bucket{instance=~"$instance"}[5m])) by (le))", "legendFormat": "p99" } ], "fieldConfig": { "defaults": { "unit": "s", "custom": { "lineWidth": 2, "fillOpacity": 10 } }, "overrides": [ { "matcher": { "id": "byName", "options": "p99" }, "properties": [{ "id": "color", "value": { "fixedColor": "red" } }] }, { "matcher": { "id": "byName", "options": "p95" }, "properties": [{ "id": "color", "value": { "fixedColor": "yellow" } }] }, { "matcher": { "id": "byName", "options": "p50" }, "properties": [{ "id": "color", "value": { "fixedColor": "green" } }] } ] }, "options": { "thresholdsStyle": { "mode": "line" } }}Grafana annotations can overlay deployment events, configuration changes, and incident markers on your time series. When investigating a problem, seeing 'deployment at 14:32' right before metrics degraded immediately provides context that would otherwise require manual correlation.
Query performance is often where database problems manifest. This dashboard helps identify problematic queries and understand query patterns.
Query Performance Dashboard Components:
| Panel | Content | Actionable Insight |
|---|---|---|
| Top Queries by Total Time | Ranked query fingerprints | Optimize these first for maximum impact |
| Top Queries by Execution Count | Most frequent queries | Even small improvements multiply significantly |
| Slow Query Trend | Slow query count over time | Correlate with deployments, traffic patterns |
| Query Duration Heatmap | Time vs duration buckets | Visualize when slow queries occur |
| Rows Examined vs Returned | Ratio trend | High ratio indicates index issues |
| Temp Table Usage | Queries creating temp tables | Memory pressure indicators |
pg_stat_statements Integration:
For PostgreSQL, the pg_stat_statements extension provides the data for query analysis:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
-- Top queries by total execution time-- Use this as the query for a table panel SELECT left(query, 80) as query_preview, calls, round(total_exec_time::numeric, 2) as total_time_ms, round(mean_exec_time::numeric, 2) as avg_time_ms, round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 2) as pct_total, rows, round(rows::numeric / calls, 2) as rows_per_callFROM pg_stat_statementsORDER BY total_exec_time DESCLIMIT 20; -- For Grafana/Prometheus, use postgres_exporter with pg_stat_statements-- Query configured in postgres_exporter's queries.yaml: pg_stat_statements: query: | SELECT queryid, calls, total_exec_time, mean_exec_time, rows FROM pg_stat_statements WHERE query NOT LIKE '%pg_stat_statements%' ORDER BY total_exec_time DESC LIMIT 50 metrics: - queryid: usage: "LABEL" description: "Query ID" - calls: usage: "COUNTER" description: "Number of times executed" - total_exec_time: usage: "COUNTER" description: "Total execution time (ms)" - mean_exec_time: usage: "GAUGE" description: "Mean execution time (ms)" - rows: usage: "COUNTER" description: "Total rows returned"Query Analysis Workflow:
Query text may contain sensitive data (customer IDs, emails in WHERE clauses). Use query fingerprints (normalized queries with parameters replaced) in dashboards. Full query text should require explicit access to the database.
Specialized dashboards focus on specific subsystems that require dedicated monitoring attention.
Replication Dashboard:
| Panel | Metrics | Alert Correlation |
|---|---|---|
| Replication Topology | Primary/replica relationships, network diagram | Visual understanding of replication architecture |
| Replication Lag Time Series | Lag in seconds over time per replica | ReplicationLag alert context |
| Bytes Behind Primary | WAL position difference | Understanding severity of lag |
| Replay Rate | How fast replica is applying changes | Is replica catching up or falling further behind? |
| Replication Slots | Slot sizes, inactive slots | Growing slots can bloat WAL |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
{ "panels": [ { "title": "Replication Lag (seconds)", "type": "timeseries", "targets": [ { "expr": "pg_replication_lag_seconds", "legendFormat": "{{ instance }} → {{ upstream }}" } ], "fieldConfig": { "defaults": { "unit": "s", "thresholds": { "steps": [ { "value": 0, "color": "green" }, { "value": 30, "color": "yellow" }, { "value": 60, "color": "orange" }, { "value": 300, "color": "red" } ] } } } }, { "title": "WAL Generation Rate", "type": "timeseries", "targets": [ { "expr": "rate(pg_wal_lsn{type="insert"}[5m]) * 8", "legendFormat": "WAL bytes/sec" } ], "fieldConfig": { "defaults": { "unit": "Bps" } } }, { "title": "Replication Slot Sizes", "type": "bargauge", "targets": [ { "expr": "pg_replication_slot_disk_usage_bytes / 1024 / 1024 / 1024", "legendFormat": "{{ slot_name }}" } ], "fieldConfig": { "defaults": { "unit": "GB", "thresholds": { "steps": [ { "value": 0, "color": "green" }, { "value": 10, "color": "yellow" }, { "value": 50, "color": "red" } ] } } } } ]}Storage Dashboard:
For capacity metrics like disk space, overlay a predict_linear() line showing projected exhaustion. This transforms a storage dashboard from reactive ('we ran out of space') to proactive ('we'll run out in 30 days').
Beyond real-time performance, DBAs need visibility into database maintenance operations that run in the background.
Vacuum and Maintenance Dashboard (PostgreSQL):
| Operation | Metrics | Concerns |
|---|---|---|
| Autovacuum Activity | Workers running, tables vacuumed/day | Not enough vacuuming → table bloat |
| Oldest Transaction Age | xid_age, datfrozenxid | Approaching wraparound is critical |
| Dead Tuple Accumulation | n_dead_tup per table | Growing dead tuples indicate vacuum falling behind |
| Checkpoint Frequency | Checkpoints per hour, duration | Too frequent = performance impact; too rare = recovery time |
| WAL Generation | WAL bytes per second | Capacity planning for storage and replication |
1234567891011121314151617181920212223242526272829303132333435
-- Tables needing vacuum attentionSELECT schemaname || '.' || relname AS table_name, n_dead_tup, n_live_tup, ROUND(n_dead_tup::numeric / NULLIF(n_live_tup + n_dead_tup, 0) * 100, 2) AS dead_pct, last_vacuum, last_autovacuum, last_analyze, last_autoanalyzeFROM pg_stat_user_tablesWHERE n_dead_tup > 10000ORDER BY n_dead_tup DESCLIMIT 20; -- Transaction ID age (wraparound monitoring)SELECT datname, age(datfrozenxid) AS xid_age, ROUND(age(datfrozenxid)::numeric / 2147483647 * 100, 2) AS pct_to_wraparoundFROM pg_databaseWHERE datname NOT LIKE 'template%'ORDER BY xid_age DESC; -- Autovacuum progressSELECT p.pid, p.datname, p.relid::regclass AS table_name, p.phase, p.heap_blks_total, p.heap_blks_scanned, ROUND(p.heap_blks_scanned::numeric / NULLIF(p.heap_blks_total, 0) * 100, 2) AS pct_completeFROM pg_stat_progress_vacuum pJOIN pg_stat_activity a ON p.pid = a.pid;Backup Status Dashboard:
Backup monitoring is critical but often overlooked:
A backup that's 72 hours old means up to 72 hours of potential data loss in a disaster. Display backup age prominently with clear thresholds. This is a P1 alert condition if exceeded.
Dashboards are the visual interface to your database's health. Well-designed dashboards accelerate incident response, enable proactive optimization, and provide the situational awareness that distinguishes excellent operations from reactive firefighting.
What's Next:
With metrics collected, alerts configured, and dashboards built, the final page of this module addresses Proactive Management—using monitoring data not just to respond to problems, but to anticipate and prevent them through capacity planning, trend analysis, and continuous improvement.
You now understand how to design and build database monitoring dashboards that serve real operational needs—from fleet overviews to deep-dive troubleshooting. Next, we'll explore how to use these tools proactively, preventing problems before they impact users.