Database Management SystemsDBA Responsibilities

Database Administrator Responsibilities

LevelAdvanced

Duration90 mins

TopicDBA Responsibilities

2 / 5

Performance Monitoring

The Eyes and Ears of Database Operations

A database without monitoring is a ship without instruments—you may be sailing smoothly, or you may be heading toward disaster, but you won't know until it's too late. Performance monitoring is the discipline that transforms database operation from reactive firefighting into proactive management.

Every production database generates a continuous stream of information about its health and behavior. Queries execute, transactions commit, connections open and close, buffers fill and empty. Each of these activities produces signals that, when properly collected and interpreted, reveal the complete operational state of the system.

The challenge isn't collecting data—modern databases can produce gigabytes of metrics daily. The challenge is knowing which metrics matter, understanding what they mean, and establishing systems that surface problems before users notice.

What You Will Learn

By the end of this page, you will understand comprehensive database monitoring strategies, including key performance indicators, monitoring tools and architectures, metric collection and analysis, baseline establishment, anomaly detection, and proactive issue identification. You'll learn the systematic approaches used by experienced DBAs to maintain optimal database performance.

Why Monitoring Matters

Database performance issues are insidious. Unlike application crashes that immediately alert users, database degradation often happens gradually—queries that took 100ms start taking 200ms, then 500ms, then 2 seconds. Users may not consciously notice individual slowdowns, but they experience the cumulative effect as a sluggish, frustrating application.

The Cost of Unmonitored Databases:

Consequences of Inadequate Monitoring

•User Experience Degradation — Slow queries translate directly to slow applications. Users abandon slow pages; customers leave for faster competitors.
•Surprise Outages — Without monitoring, disk space exhaustion, connection limits, and memory pressure become emergency situations discovered only when systems fail.
•Inefficient Resource Usage — Over-provisioning wastes money; under-provisioning causes problems. Without metrics, you can't know which situation you're in.
•Difficult Troubleshooting — When something breaks, historical metrics are essential for root cause analysis. Without them, troubleshooting becomes guesswork.
•Missed SLA Commitments — You can't prove you met service level agreements without measurement. Compliance, auditing, and customer contracts all require metrics.
•Career Limiting Events — Databases that fail unexpectedly damage reputations. DBAs who prevent problems are valued; those who only react to emergencies are replaceable.

The Proactive vs. Reactive Spectrum:

Monitoring transforms database operations along a maturity spectrum:

Monitoring Maturity Levels
Level	Approach	Typical Discovery	Impact
1 - Chaotic	No monitoring	User complaints, system crashes	Extended outages, data loss risk
2 - Reactive	Basic logs reviewed occasionally	Hours after problem starts	Significant user impact
3 - Active	Metrics collected, dashboards exist	When checking dashboards	Moderate impact, faster recovery
4 - Proactive	Alerting on thresholds	Before user impact	Minimal impact, preventive action
5 - Predictive	Trend analysis, capacity planning	Before problem develops	Problems prevented entirely

Aim for Level 5

The goal isn't just alerting when something breaks—it's predicting when something will break and preventing it. Capacity planning, trend analysis, and proactive tuning distinguish excellent database operations from merely adequate ones.

Key Performance Indicators (KPIs)

Not all metrics are equally important. Effective monitoring focuses on Key Performance Indicators (KPIs) that directly reflect database health and user experience. These fall into several categories:

1. Query Performance Metrics:

These metrics directly reflect the user experience:

Query Performance KPIs
Metric	Description	Healthy Range	Alert Threshold
Query Response Time (avg)	Average time to execute queries	< 100ms	500ms
Query Response Time (p95)	95th percentile query time	< 500ms	2s
Query Response Time (p99)	99th percentile (tail latency)	< 2s	5s
Queries Per Second (QPS)	Throughput measurement	Within capacity	80% of max tested
Slow Query Count	Queries exceeding threshold	< 1% of total	5% of total
Query Error Rate	Percentage of failed queries	< 0.1%	1%

2. Resource Utilization Metrics:

These indicate how effectively the database uses available resources:

CPU Metrics

•CPU Utilization % — Overall CPU consumption. Sustained > 80% indicates capacity issues.
•User CPU % — CPU spent on query processing (good work).
•System CPU % — CPU spent on kernel operations (I/O, context switches).
•Wait CPU % — CPU cycles waiting for I/O (indicates storage bottleneck).

Memory Metrics

•Buffer Pool Hit Ratio — % of reads from memory vs. disk. Should be > 99%.
•Memory Utilization % — Total memory consumption. Watch for trending up.
•Swap Usage — Any swap activity is a critical warning.
•Memory Pressure Events — OS memory reclamation affects database.

3. Storage and I/O Metrics:

Storage is typically the database bottleneck. These metrics are crucial:

storage-metrics-analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- PostgreSQL disk I/O statistics
SELECT 
    datname,
    blks_read,           -- Blocks read from disk
    blks_hit,            -- Blocks found in buffer cache
    round(100.0 * blks_hit / nullif(blks_hit + blks_read, 0), 2) 
        AS cache_hit_ratio   -- Target: > 99%
FROM pg_stat_database 
WHERE datname NOT IN ('template0', 'template1');
 
-- Table I/O statistics
SELECT 
    schemaname,
    tablename,
    heap_blks_read,      -- Table blocks from disk
    heap_blks_hit,       -- Table blocks from cache
    idx_blks_read,       -- Index blocks from disk
    idx_blks_hit         -- Index blocks from cache
FROM pg_statio_user_tables
ORDER BY heap_blks_read DESC
LIMIT 10;

Critical Storage Metrics

•Disk Space Used % — Alert at 80%, critical at 90%. Databases can halt when full.
•Read IOPS — I/O operations per second for reads. Compare to storage capability.
•Write IOPS — Write operations. Especially important for transaction logs.
•I/O Latency (ms) — Time to complete I/O requests. SSDs should be < 1ms; HDDs < 10ms.
•I/O Queue Depth — Pending I/O requests. High depth indicates storage saturation.
•Transaction Log Space — WAL/redo log consumption. Monitor for archival success.

4. Connection and Concurrency Metrics:

These show how the database handles concurrent users:

Connection Metrics
Metric	Description	Warning Sign
Active Connections	Currently open connections	Approaching max_connections
Waiting Connections	Connections waiting for resources	Any waiting during normal operation
Connection Rate	New connections per second	High rate indicates pool misconfiguration
Idle Connections	Open but unused connections	Many idle connections waste memory
Lock Waits	Queries waiting for locks	Increasing trend indicates contention
Deadlocks	Deadlock occurrences	Any deadlock requires investigation

The Buffer Pool Hit Ratio Myth

While a high buffer pool hit ratio (>99%) is generally good, don't optimize exclusively for this metric. A 99.9% hit ratio on a read-heavy workload with poor query patterns is worse than a 98% hit ratio with efficient queries. Always consider hit ratio alongside query response times.

Monitoring Infrastructure

Effective monitoring requires three components: collection (gathering metrics from databases), storage (retaining metrics for analysis), and visualization/alerting (presenting data and triggering notifications).

Monitoring Architecture:

Converting Mermaid diagram...

Metric Collection Approaches:

Metric Collection Methods
Method	Description	Pros	Cons
Agent-Based	Software agent runs on database host, pushes metrics	Full access to OS and DB metrics, reliable	Resource overhead, agent maintenance
Agentless/Pull	Central collector polls databases via SQL/API	No agent deployment, centralized control	Network dependency, connection overhead
Log Shipping	Parse database logs for metrics	Non-intrusive, detailed query info	Log parsing complexity, delay
Native Integration	Cloud provider managed monitoring	Zero setup, integrated dashboards	Limited customization, vendor lock-in

Popular Monitoring Stack Components:

Monitoring Tools Ecosystem

•Prometheus + Grafana — Industry standard open-source stack. Prometheus collects and stores metrics; Grafana visualizes. PostgreSQL Exporter, MySQL Exporter for database-specific metrics.
•Datadog — Commercial SaaS platform with deep database integrations. Agent-based collection with extensive dashboard templates.
•pganalyze, Percona Monitoring (PMM) — Database-specific monitoring tools with query analytics, explain plan capture, and recommendations.
•AWS CloudWatch, Azure Monitor — Native cloud monitoring for managed databases. Automatic metric collection with basic alerting.
•Elastic Stack (ELK) — Log aggregation and analysis. Excellent for query log analysis and slow query investigation.
•VictoriaMetrics, InfluxDB — High-performance time series databases for metric storage at scale.

prometheus-config.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Prometheus configuration for database monitoring
global:
  scrape_interval: 15s          # How often to scrape targets
  evaluation_interval: 15s      # How often to evaluate rules
 
scrape_configs:
  - job_name: 'postgresql'
    static_configs:
      - targets: 
        - 'db-primary:9187'     # postgres_exporter
        - 'db-replica:9187'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+).*'
        replacement: '${1}'
 
  - job_name: 'mysql'
    static_configs:
      - targets: 
        - 'mysql-primary:9104'  # mysqld_exporter
    params:
      collect[]:
        - global_status
        - slave_status
        - innodb_metrics

Monitor the Monitors

Your monitoring system is itself infrastructure that can fail. Ensure the monitoring stack has redundancy, and implement monitoring of the monitoring system (meta-monitoring). A dead alerting system won't tell you it's dead.

Database-Specific Monitoring

Each database system exposes metrics through different mechanisms. Understanding your specific database's instrumentation is essential for effective monitoring.

PostgreSQL Monitoring:

PostgreSQL provides extensive visibility through system views:

postgresql-monitoring-queries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
-- Active queries and their state
SELECT pid, usename, application_name, client_addr,
       state, wait_event_type, wait_event,
       query_start, now() - query_start AS duration,
       left(query, 100) AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
 
-- Table statistics (sequential scans indicate missing indexes)
SELECT schemaname, relname,
       seq_scan, seq_tup_read,           -- Sequential scan stats
       idx_scan, idx_tup_fetch,          -- Index scan stats
       n_live_tup, n_dead_tup,           -- Tuple counts
       last_vacuum, last_autovacuum,     -- Vacuum timing
       last_analyze, last_autoanalyze    -- Statistics timing
FROM pg_stat_user_tables
ORDER BY seq_scan DESC
LIMIT 20;
 
-- Lock contention analysis
SELECT blocked.pid AS blocked_pid,
       blocked.usename AS blocked_user,
       blocking.pid AS blocking_pid,
       blocking.usename AS blocking_user,
       blocked.query AS blocked_query,
       blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_locks blocked_locks ON blocked.pid = blocked_locks.pid
JOIN pg_locks blocking_locks ON blocked_locks.locktype = blocking_locks.locktype
    AND blocked_locks.database IS NOT DISTINCT FROM blocking_locks.database
    AND blocked_locks.relation IS NOT DISTINCT FROM blocking_locks.relation
    AND blocked_locks.page IS NOT DISTINCT FROM blocking_locks.page
    AND blocked_locks.tuple IS NOT DISTINCT FROM blocking_locks.tuple
    AND blocked_locks.transactionid IS NOT DISTINCT FROM blocking_locks.transactionid
    AND blocked_locks.classid IS NOT DISTINCT FROM blocking_locks.classid
    AND blocked_locks.objid IS NOT DISTINCT FROM blocking_locks.objid
    AND blocked_locks.objsubid IS NOT DISTINCT FROM blocking_locks.objsubid
JOIN pg_stat_activity blocking ON blocking.pid = blocking_locks.pid
WHERE blocked_locks.granted = false AND blocking_locks.granted = true;
 
-- Replication lag monitoring
SELECT client_addr, state,
       sent_lsn, write_lsn, flush_lsn, replay_lsn,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes,
       replay_lag
FROM pg_stat_replication;

MySQL/MariaDB Monitoring:

MySQL provides the Performance Schema and InnoDB-specific metrics:

mysql-monitoring-queries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
-- Global status overview
SHOW GLOBAL STATUS LIKE 'Threads_%';
SHOW GLOBAL STATUS LIKE 'Questions';
SHOW GLOBAL STATUS LIKE 'Slow_queries';
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_%';
 
-- InnoDB buffer pool efficiency
SELECT 
    (1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests)) * 100 
        AS buffer_pool_hit_ratio
FROM (
    SELECT VARIABLE_VALUE AS Innodb_buffer_pool_reads 
    FROM performance_schema.global_status 
    WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads'
) reads,
(
    SELECT VARIABLE_VALUE AS Innodb_buffer_pool_read_requests 
    FROM performance_schema.global_status 
    WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests'
) requests;
 
-- Currently running queries
SELECT id, user, host, db, command, time, state, 
       LEFT(info, 100) AS query_preview
FROM information_schema.processlist
WHERE command != 'Sleep'
ORDER BY time DESC;
 
-- Table lock waits (Performance Schema)
SELECT * FROM sys.innodb_lock_waits\G
 
-- Query digest (most resource-intensive queries)
SELECT 
    DIGEST_TEXT,
    COUNT_STAR,
    SUM_TIMER_WAIT/1000000000000 AS total_latency_sec,
    AVG_TIMER_WAIT/1000000000 AS avg_latency_ms,
    SUM_ROWS_EXAMINED,
    SUM_ROWS_SENT
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;

Enable Performance Schema

MySQL's Performance Schema must be enabled (default in MySQL 8.0+) for detailed query analytics. The overhead is typically 5-10% but provides invaluable visibility. Configure events_statements_history_size and other consumers based on your analysis needs.

SQL Server Monitoring:

SQL Server provides Dynamic Management Views (DMVs) for comprehensive monitoring:

-- Current activity
SELECT session_id, request_id, status, command,
       wait_type, wait_time, cpu_time, reads, writes,
       SUBSTRING(text, 1, 100) AS query_preview
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle)
WHERE status != 'background';

-- Wait statistics (where is time being spent?)
SELECT wait_type, wait_time_ms, waiting_tasks_count,
       signal_wait_time_ms
FROM sys.dm_os_wait_stats
WHERE wait_time_ms > 0
ORDER BY wait_time_ms DESC;

-- Buffer cache hit ratio
SELECT (a.cntr_value * 1.0 / b.cntr_value) * 100 AS buffer_cache_hit_ratio
FROM sys.dm_os_performance_counters a
JOIN sys.dm_os_performance_counters b ON a.object_name = b.object_name
WHERE a.counter_name = 'Buffer cache hit ratio'
  AND b.counter_name = 'Buffer cache hit ratio base';

Baseline Establishment

Baselines transform raw metrics into actionable intelligence. Without knowing what "normal" looks like, you can't identify "abnormal." Baseline establishment is the process of characterizing typical database behavior.

What to Baseline:

Capture baselines for all KPIs across different time dimensions:

Time of Day: Morning startup differs from midday peak
Day of Week: Weekends may have different patterns
Monthly Cycles: Month-end processing, billing cycles
Seasonal Patterns: Retail peaks, academic calendars

Baseline Documentation Template
Metric	Weekday Peak	Weekday Low	Weekend Avg	Month-End Peak
Query Response Time (avg)	150ms	50ms	40ms	300ms
CPU Utilization	60%	20%	15%	80%
Active Connections	200	50	30	350
IOPS (Read)	5,000	1,000	800	10,000
Buffer Hit Ratio	99.2%	99.8%	99.9%	98.5%
Transactions/sec	500	100	80	800

Baseline Collection Strategy:

Baseline Collection Best Practices

•Collect for Full Cycle: At minimum, collect 2-4 weeks of data to capture weekly patterns. Capture monthly cycles where relevant.
•Multiple Granularities: Store both high-frequency data (1-minute intervals) for troubleshooting and aggregated data (hourly/daily) for trend analysis.
•Pre-Change Captures: Before any major change (software upgrade, schema change, hardware modification), capture comprehensive baselines.
•Document Environment: Record hardware specs, configuration, data volumes, and user counts alongside metrics. Context matters.
•Refresh Regularly: Baselines become stale. Re-establish baselines after significant changes or at least quarterly.

baseline-capture-script.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- PostgreSQL: Create baseline snapshots table
CREATE TABLE IF NOT EXISTS dba_baseline_snapshots (
    snapshot_id SERIAL PRIMARY KEY,
    snapshot_time TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    snapshot_type VARCHAR(20), -- 'hourly', 'daily', 'pre-change'
    
    -- Query performance
    total_queries BIGINT,
    avg_query_time_ms NUMERIC(10,2),
    p95_query_time_ms NUMERIC(10,2),
    slow_queries_count INTEGER,
    
    -- Resource utilization
    active_connections INTEGER,
    idle_connections INTEGER,
    cpu_percent NUMERIC(5,2),
    memory_used_mb BIGINT,
    
    -- I/O metrics
    buffer_hit_ratio NUMERIC(6,3),
    blocks_read BIGINT,
    blocks_hit BIGINT,
    
    -- Transaction metrics
    commits_per_sec NUMERIC(10,2),
    rollbacks_per_sec NUMERIC(10,2),
    
    -- Replication (if applicable)
    max_replication_lag_bytes BIGINT,
    
    -- Context
    notes TEXT
);
 
-- Capture hourly baseline
INSERT INTO dba_baseline_snapshots (
    snapshot_type, total_queries, active_connections, 
    buffer_hit_ratio, blocks_read, blocks_hit
)
SELECT 
    'hourly',
    sum(calls),
    (SELECT count(*) FROM pg_stat_activity WHERE state = 'active'),
    (SELECT round(100.0 * sum(blks_hit) / nullif(sum(blks_hit + blks_read), 0), 3) 
     FROM pg_stat_database),
    (SELECT sum(blks_read) FROM pg_stat_database),
    (SELECT sum(blks_hit) FROM pg_stat_database)
FROM pg_stat_statements;

Beware of Baseline Drift

Gradually worsening performance can go unnoticed if baselines slowly drift. Compare current metrics not just to yesterday, but to 30, 60, 90 days ago. Trend analysis reveals gradual degradation that day-over-day comparison misses.

Alerting Strategy

Alerting is the active component of monitoring—it notifies humans when intervention is needed. However, alerting is a double-edged sword: too few alerts miss problems; too many cause alert fatigue and ignored notifications.

Alert Design Principles:

Effective Alerting Guidelines

•Alert on Impact, Not Symptoms: Alert when user experience is affected, not on every metric deviation. Users don't care if CPU is high if queries are still fast.
•Actionable Alerts Only: Every alert should have a defined response. If there's no action to take, it's not an alert—it's noise.
•Tiered Severity Levels: Distinguish between 'someone should look at this tomorrow' (warning) and 'wake someone up now' (critical).
•Include Context: Alerts should contain information needed to diagnose: current value, threshold, affected system, recent changes.
•Avoid Flapping: Implement hysteresis—require sustained conditions before alerting, and don't clear immediately when conditions improve briefly.
•Regular Review: Analyze alerting effectiveness monthly. Delete alerts that never fire or always fire without consequence.

Database Alert Examples
Alert Name	Condition	Severity	Action
High Query Latency	p95 response > 2s for 5 min	Warning	Investigate slow queries, check resources
Disk Space Critical	< 10% disk free	Critical	Immediate cleanup or expansion required
Replication Lag	Lag > 60s for 5 min	Warning	Check replica, network, long transactions
Connection Exhaustion	90% connections used	Critical	Scale connections, investigate leaks
Deadlock Detected	Any deadlock occurrence	Warning	Review application logic, locking order
Backup Failed	No successful backup in 25h	Critical	Immediate investigation, data loss risk

prometheus-alerts.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
groups:
  - name: postgresql_alerts
    rules:
      - alert: PostgreSQLHighQueryLatency
        expr: |
          histogram_quantile(0.95, 
            rate(pg_stat_statements_seconds_bucket[5m])
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High query latency on {{ $labels.instance }}"
          description: "95th percentile query latency is {{ $value | humanizeDuration }}"
 
      - alert: PostgreSQLDiskSpaceCritical
        expr: |
          (pg_database_size_bytes / pg_tablespace_size_bytes) * 100 > 90
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Disk space critical on {{ $labels.instance }}"
          description: "Database {{ $labels.datname }} is at {{ $value }}% capacity"
 
      - alert: PostgreSQLReplicationLag
        expr: pg_stat_replication_replay_lag_seconds > 60
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Replication lag on {{ $labels.instance }}"
          description: "Replica lag is {{ $value }} seconds"
 
      - alert: PostgreSQLConnectionsHigh
        expr: |
          pg_stat_activity_count / pg_settings_max_connections > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Connection count approaching limit"
          description: "{{ $value | humanizePercentage }} of max connections in use"

The 2am Test

For every critical-severity alert, ask: 'Is this worth waking someone up at 2am?' If the answer is no, it shouldn't be critical. If the answer is 'it depends on context,' the alert needs more conditions. Respect on-call engineers' sleep—it improves long-term team sustainability.

Query Performance Analysis

Beyond aggregate metrics, understanding individual query performance is essential for troubleshooting and optimization. Query analysis identifies the specific statements consuming resources.

Query Analysis Tools:

Most databases provide facilities for query performance tracking:

Query Analysis Capabilities by Database

•PostgreSQL: pg_stat_statements extension tracks all executed queries with call counts, total time, and row statistics. Essential for any PostgreSQL monitoring.
•MySQL: Performance Schema events_statements tables, Query Analyzer in Percona Monitoring, slow query log analysis.
•SQL Server: Query Store captures execution plans and runtime statistics. Dynamic Management Views for point-in-time analysis.
•Oracle: Automatic Workload Repository (AWR), Active Session History (ASH), SQL Tuning Advisor.

query-analysis-postgresql.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- Enable pg_stat_statements (in postgresql.conf)
-- shared_preload_libraries = 'pg_stat_statements'
 
-- Create extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
 
-- Top queries by total execution time
SELECT 
    substring(query, 1, 80) AS query_preview,
    calls,
    total_exec_time::numeric(12,2) AS total_time_ms,
    mean_exec_time::numeric(10,2) AS avg_time_ms,
    stddev_exec_time::numeric(10,2) AS stddev_ms,
    rows,
    shared_blks_hit + shared_blks_read AS total_blocks,
    CASE WHEN shared_blks_hit + shared_blks_read > 0 
         THEN round(100.0 * shared_blks_hit / 
                    (shared_blks_hit + shared_blks_read), 2)
         ELSE 100 END AS cache_hit_pct
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
 
-- Identify queries with high variance (unpredictable performance)
SELECT 
    substring(query, 1, 60) AS query_preview,
    calls,
    mean_exec_time::numeric(10,2) AS avg_ms,
    stddev_exec_time::numeric(10,2) AS stddev_ms,
    (stddev_exec_time / nullif(mean_exec_time, 0))::numeric(6,2) AS cv,
    min_exec_time::numeric(10,2) AS min_ms,
    max_exec_time::numeric(10,2) AS max_ms
FROM pg_stat_statements
WHERE calls > 100  -- Sufficient sample size
ORDER BY (stddev_exec_time / nullif(mean_exec_time, 0)) DESC NULLS LAST
LIMIT 10;
 
-- Queries scanning most rows (potential for optimization)
SELECT 
    substring(query, 1, 60) AS query_preview,
    calls,
    rows AS total_rows_returned,
    (rows::numeric / nullif(calls, 0))::numeric(12,2) AS rows_per_call,
    shared_blks_read / nullif(calls, 0) AS disk_reads_per_call
FROM pg_stat_statements
WHERE calls > 10
ORDER BY rows DESC
LIMIT 10;

Execution Plan Capture:

For problematic queries, execution plans reveal exactly how the database processes them. Capturing plans for later analysis:

-- PostgreSQL: Explain with all details
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.order_date > '2024-01-01';

-- auto_explain logs slow query plans automatically
-- In postgresql.conf:
-- auto_explain.log_min_duration = 1000  -- Log plans for queries > 1s
-- auto_explain.log_analyze = on

Identifying Optimization Opportunities:

Query analysis typically reveals:

Missing Indexes: Queries doing sequential scans when index scans would be appropriate
Inefficient Joins: Wrong join order or method
Large Result Sets: Queries returning more data than needed
Repeated Queries: Same query executed many times (caching opportunity)
Lock Contention: Queries waiting for locks held by others

Reset Statistics Carefully

Statistics views like pg_stat_statements can be reset with pg_stat_statements_reset(). Do this after major changes to get clean measurements, but avoid resetting during active troubleshooting—you'll lose historical data needed for analysis.

Dashboard Design

Dashboards are the primary interface for monitoring visibility. Well-designed dashboards provide instant situational awareness; poorly designed ones hide problems in visual noise.

Dashboard Organization:

Structure dashboards in layers of detail:

Dashboard Hierarchy
Level	Purpose	Audience	Refresh Rate
Overview/SLI	Health status at a glance	Everyone, NOC displays	15-30 seconds
Service	All databases for a service	On-call engineers	1 minute
Instance	Deep metrics for one database	DBAs, troubleshooting	15 seconds
Query/Session	Individual query analysis	DBAs, developers	Real-time

Overview Dashboard Components:

The top-level dashboard should enable quick assessment:

Essential Overview Dashboard Panels

•Health Status Indicator: Red/Yellow/Green status based on composite health rules. Immediate visual triage.
•Query Response Time Graph: Rolling graph showing latency trends. Include p50, p95, p99 lines.
•Error Rate: Percentage of queries failing. Should be near-zero; any elevation is significant.
•Throughput: Queries per second or transactions per second. Sudden drops indicate problems.
•Resource Saturation: CPU, memory, disk, connections as percentage of capacity. Shows how much headroom exists.
•Replication Status: For distributed databases, lag and sync status across replicas.

Visualization Best Practices:

•Use consistent color coding across dashboards
•Show time ranges appropriate to the metric (5min for latency, 24h for capacity)
•Include threshold lines showing alert boundaries
•Use appropriate chart types (gauges for capacity, lines for trends)
•Provide drill-down links to detailed dashboards

Don't

•Overload dashboards with too many panels
•Show raw numbers without context or trends
•Use pie charts for time-series data
•Hide critical information below the fold
•Rely on color alone (consider color-blindness)

The 30-Second Rule

A well-designed overview dashboard should allow an experienced operator to assess system health in 30 seconds or less. If it takes longer, the dashboard needs simplification. Detailed investigation comes after initial triage.

Summary

Performance monitoring transforms database operation from reactive firefighting into proactive management. With proper monitoring, problems are detected before users notice, capacity issues are anticipated, and troubleshooting becomes guided by data rather than guesswork.

Key Takeaways:

Performance Monitoring Essentials

•Monitoring is essential, not optional. Unmonitored databases are accidents waiting to happen. The cost of monitoring infrastructure is trivial compared to the cost of outages.
•Focus on Key Performance Indicators. Not all metrics are equally important. Query latency, error rates, resource saturation, and replication status are the critical few.
•Build proper monitoring infrastructure. Use purpose-built tools with time-series storage, visualization, and alerting. Don't rely on occasional manual checks.
•Understand your specific database. Each database exposes metrics differently. Master the monitoring features of your specific database system.
•Establish and maintain baselines. 'Normal' is only meaningful with historical context. Capture baselines regularly and use them for comparison.
•Design alerts carefully. Alert fatigue is real. Every alert should be actionable, appropriately severe, and include context for diagnosis.
•Analyze query performance actively. Aggregate metrics highlight problems; query analysis reveals root causes. Use pg_stat_statements or equivalent.
•Create effective dashboards. Organize dashboards in layers. Enable 30-second health assessment at the overview level with drill-down for details.

What's Next:

With monitoring in place to detect issues, the next DBA responsibility is Security Management—protecting the database from unauthorized access, data breaches, and malicious attacks while maintaining compliance with regulatory requirements.

Page Complete

You now understand comprehensive database performance monitoring, from key performance indicators through monitoring infrastructure, baseline establishment, alerting strategies, query analysis, and dashboard design. These skills enable proactive database management that prevents problems before they impact users.

2 / 5

Loading learning content...

Database Management SystemsDBA Responsibilities

Database Administrator Responsibilities

LevelAdvanced

Duration90 mins

TopicDBA Responsibilities

2 / 5

Performance Monitoring

The Eyes and Ears of Database Operations

What You Will Learn

Why Monitoring Matters

The Cost of Unmonitored Databases:

Consequences of Inadequate Monitoring

•User Experience Degradation — Slow queries translate directly to slow applications. Users abandon slow pages; customers leave for faster competitors.
•Surprise Outages — Without monitoring, disk space exhaustion, connection limits, and memory pressure become emergency situations discovered only when systems fail.
•Inefficient Resource Usage — Over-provisioning wastes money; under-provisioning causes problems. Without metrics, you can't know which situation you're in.
•Difficult Troubleshooting — When something breaks, historical metrics are essential for root cause analysis. Without them, troubleshooting becomes guesswork.
•Missed SLA Commitments — You can't prove you met service level agreements without measurement. Compliance, auditing, and customer contracts all require metrics.
•Career Limiting Events — Databases that fail unexpectedly damage reputations. DBAs who prevent problems are valued; those who only react to emergencies are replaceable.

The Proactive vs. Reactive Spectrum:

Monitoring transforms database operations along a maturity spectrum:

Monitoring Maturity Levels
Level	Approach	Typical Discovery	Impact
1 - Chaotic	No monitoring	User complaints, system crashes	Extended outages, data loss risk
2 - Reactive	Basic logs reviewed occasionally	Hours after problem starts	Significant user impact
3 - Active	Metrics collected, dashboards exist	When checking dashboards	Moderate impact, faster recovery
4 - Proactive	Alerting on thresholds	Before user impact	Minimal impact, preventive action
5 - Predictive	Trend analysis, capacity planning	Before problem develops	Problems prevented entirely

Aim for Level 5

Key Performance Indicators (KPIs)

1. Query Performance Metrics:

These metrics directly reflect the user experience:

Query Performance KPIs
Metric	Description	Healthy Range	Alert Threshold
Query Response Time (avg)	Average time to execute queries	< 100ms	500ms
Query Response Time (p95)	95th percentile query time	< 500ms	2s
Query Response Time (p99)	99th percentile (tail latency)	< 2s	5s
Queries Per Second (QPS)	Throughput measurement	Within capacity	80% of max tested
Slow Query Count	Queries exceeding threshold	< 1% of total	5% of total
Query Error Rate	Percentage of failed queries	< 0.1%	1%

2. Resource Utilization Metrics:

These indicate how effectively the database uses available resources:

CPU Metrics

•CPU Utilization % — Overall CPU consumption. Sustained > 80% indicates capacity issues.
•User CPU % — CPU spent on query processing (good work).
•System CPU % — CPU spent on kernel operations (I/O, context switches).
•Wait CPU % — CPU cycles waiting for I/O (indicates storage bottleneck).

Memory Metrics

•Buffer Pool Hit Ratio — % of reads from memory vs. disk. Should be > 99%.
•Memory Utilization % — Total memory consumption. Watch for trending up.
•Swap Usage — Any swap activity is a critical warning.
•Memory Pressure Events — OS memory reclamation affects database.

3. Storage and I/O Metrics:

Storage is typically the database bottleneck. These metrics are crucial:

storage-metrics-analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- PostgreSQL disk I/O statistics
SELECT 
    datname,
    blks_read,           -- Blocks read from disk
    blks_hit,            -- Blocks found in buffer cache
    round(100.0 * blks_hit / nullif(blks_hit + blks_read, 0), 2) 
        AS cache_hit_ratio   -- Target: > 99%
FROM pg_stat_database 
WHERE datname NOT IN ('template0', 'template1');
 
-- Table I/O statistics
SELECT 
    schemaname,
    tablename,
    heap_blks_read,      -- Table blocks from disk
    heap_blks_hit,       -- Table blocks from cache
    idx_blks_read,       -- Index blocks from disk
    idx_blks_hit         -- Index blocks from cache
FROM pg_statio_user_tables
ORDER BY heap_blks_read DESC
LIMIT 10;

Critical Storage Metrics

•Disk Space Used % — Alert at 80%, critical at 90%. Databases can halt when full.
•Read IOPS — I/O operations per second for reads. Compare to storage capability.
•Write IOPS — Write operations. Especially important for transaction logs.
•I/O Latency (ms) — Time to complete I/O requests. SSDs should be < 1ms; HDDs < 10ms.
•I/O Queue Depth — Pending I/O requests. High depth indicates storage saturation.
•Transaction Log Space — WAL/redo log consumption. Monitor for archival success.

4. Connection and Concurrency Metrics:

These show how the database handles concurrent users:

Connection Metrics
Metric	Description	Warning Sign
Active Connections	Currently open connections	Approaching max_connections
Waiting Connections	Connections waiting for resources	Any waiting during normal operation
Connection Rate	New connections per second	High rate indicates pool misconfiguration
Idle Connections	Open but unused connections	Many idle connections waste memory
Lock Waits	Queries waiting for locks	Increasing trend indicates contention
Deadlocks	Deadlock occurrences	Any deadlock requires investigation

The Buffer Pool Hit Ratio Myth

Monitoring Infrastructure

Monitoring Architecture:

Converting Mermaid diagram...

Metric Collection Approaches:

Metric Collection Methods
Method	Description	Pros	Cons
Agent-Based	Software agent runs on database host, pushes metrics	Full access to OS and DB metrics, reliable	Resource overhead, agent maintenance
Agentless/Pull	Central collector polls databases via SQL/API	No agent deployment, centralized control	Network dependency, connection overhead
Log Shipping	Parse database logs for metrics	Non-intrusive, detailed query info	Log parsing complexity, delay
Native Integration	Cloud provider managed monitoring	Zero setup, integrated dashboards	Limited customization, vendor lock-in

Popular Monitoring Stack Components:

Monitoring Tools Ecosystem

•Prometheus + Grafana — Industry standard open-source stack. Prometheus collects and stores metrics; Grafana visualizes. PostgreSQL Exporter, MySQL Exporter for database-specific metrics.
•Datadog — Commercial SaaS platform with deep database integrations. Agent-based collection with extensive dashboard templates.
•pganalyze, Percona Monitoring (PMM) — Database-specific monitoring tools with query analytics, explain plan capture, and recommendations.
•AWS CloudWatch, Azure Monitor — Native cloud monitoring for managed databases. Automatic metric collection with basic alerting.
•Elastic Stack (ELK) — Log aggregation and analysis. Excellent for query log analysis and slow query investigation.
•VictoriaMetrics, InfluxDB — High-performance time series databases for metric storage at scale.

prometheus-config.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Prometheus configuration for database monitoring
global:
  scrape_interval: 15s          # How often to scrape targets
  evaluation_interval: 15s      # How often to evaluate rules
 
scrape_configs:
  - job_name: 'postgresql'
    static_configs:
      - targets: 
        - 'db-primary:9187'     # postgres_exporter
        - 'db-replica:9187'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+).*'
        replacement: '${1}'
 
  - job_name: 'mysql'
    static_configs:
      - targets: 
        - 'mysql-primary:9104'  # mysqld_exporter
    params:
      collect[]:
        - global_status
        - slave_status
        - innodb_metrics

Monitor the Monitors

Database-Specific Monitoring

Each database system exposes metrics through different mechanisms. Understanding your specific database's instrumentation is essential for effective monitoring.

PostgreSQL Monitoring:

PostgreSQL provides extensive visibility through system views:

postgresql-monitoring-queries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
-- Active queries and their state
SELECT pid, usename, application_name, client_addr,
       state, wait_event_type, wait_event,
       query_start, now() - query_start AS duration,
       left(query, 100) AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
 
-- Table statistics (sequential scans indicate missing indexes)
SELECT schemaname, relname,
       seq_scan, seq_tup_read,           -- Sequential scan stats
       idx_scan, idx_tup_fetch,          -- Index scan stats
       n_live_tup, n_dead_tup,           -- Tuple counts
       last_vacuum, last_autovacuum,     -- Vacuum timing
       last_analyze, last_autoanalyze    -- Statistics timing
FROM pg_stat_user_tables
ORDER BY seq_scan DESC
LIMIT 20;
 
-- Lock contention analysis
SELECT blocked.pid AS blocked_pid,
       blocked.usename AS blocked_user,
       blocking.pid AS blocking_pid,
       blocking.usename AS blocking_user,
       blocked.query AS blocked_query,
       blocking.query AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_locks blocked_locks ON blocked.pid = blocked_locks.pid
JOIN pg_locks blocking_locks ON blocked_locks.locktype = blocking_locks.locktype
    AND blocked_locks.database IS NOT DISTINCT FROM blocking_locks.database
    AND blocked_locks.relation IS NOT DISTINCT FROM blocking_locks.relation
    AND blocked_locks.page IS NOT DISTINCT FROM blocking_locks.page
    AND blocked_locks.tuple IS NOT DISTINCT FROM blocking_locks.tuple
    AND blocked_locks.transactionid IS NOT DISTINCT FROM blocking_locks.transactionid
    AND blocked_locks.classid IS NOT DISTINCT FROM blocking_locks.classid
    AND blocked_locks.objid IS NOT DISTINCT FROM blocking_locks.objid
    AND blocked_locks.objsubid IS NOT DISTINCT FROM blocking_locks.objsubid
JOIN pg_stat_activity blocking ON blocking.pid = blocking_locks.pid
WHERE blocked_locks.granted = false AND blocking_locks.granted = true;
 
-- Replication lag monitoring
SELECT client_addr, state,
       sent_lsn, write_lsn, flush_lsn, replay_lsn,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes,
       replay_lag
FROM pg_stat_replication;

MySQL/MariaDB Monitoring:

MySQL provides the Performance Schema and InnoDB-specific metrics:

mysql-monitoring-queries.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
-- Global status overview
SHOW GLOBAL STATUS LIKE 'Threads_%';
SHOW GLOBAL STATUS LIKE 'Questions';
SHOW GLOBAL STATUS LIKE 'Slow_queries';
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_%';
 
-- InnoDB buffer pool efficiency
SELECT 
    (1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests)) * 100 
        AS buffer_pool_hit_ratio
FROM (
    SELECT VARIABLE_VALUE AS Innodb_buffer_pool_reads 
    FROM performance_schema.global_status 
    WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads'
) reads,
(
    SELECT VARIABLE_VALUE AS Innodb_buffer_pool_read_requests 
    FROM performance_schema.global_status 
    WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests'
) requests;
 
-- Currently running queries
SELECT id, user, host, db, command, time, state, 
       LEFT(info, 100) AS query_preview
FROM information_schema.processlist
WHERE command != 'Sleep'
ORDER BY time DESC;
 
-- Table lock waits (Performance Schema)
SELECT * FROM sys.innodb_lock_waits\G
 
-- Query digest (most resource-intensive queries)
SELECT 
    DIGEST_TEXT,
    COUNT_STAR,
    SUM_TIMER_WAIT/1000000000000 AS total_latency_sec,
    AVG_TIMER_WAIT/1000000000 AS avg_latency_ms,
    SUM_ROWS_EXAMINED,
    SUM_ROWS_SENT
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 10;

Enable Performance Schema

SQL Server Monitoring:

SQL Server provides Dynamic Management Views (DMVs) for comprehensive monitoring:

-- Current activity
SELECT session_id, request_id, status, command,
       wait_type, wait_time, cpu_time, reads, writes,
       SUBSTRING(text, 1, 100) AS query_preview
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle)
WHERE status != 'background';

-- Wait statistics (where is time being spent?)
SELECT wait_type, wait_time_ms, waiting_tasks_count,
       signal_wait_time_ms
FROM sys.dm_os_wait_stats
WHERE wait_time_ms > 0
ORDER BY wait_time_ms DESC;

-- Buffer cache hit ratio
SELECT (a.cntr_value * 1.0 / b.cntr_value) * 100 AS buffer_cache_hit_ratio
FROM sys.dm_os_performance_counters a
JOIN sys.dm_os_performance_counters b ON a.object_name = b.object_name
WHERE a.counter_name = 'Buffer cache hit ratio'
  AND b.counter_name = 'Buffer cache hit ratio base';

Baseline Establishment

What to Baseline:

Capture baselines for all KPIs across different time dimensions:

Time of Day: Morning startup differs from midday peak
Day of Week: Weekends may have different patterns
Monthly Cycles: Month-end processing, billing cycles
Seasonal Patterns: Retail peaks, academic calendars

Baseline Documentation Template
Metric	Weekday Peak	Weekday Low	Weekend Avg	Month-End Peak
Query Response Time (avg)	150ms	50ms	40ms	300ms
CPU Utilization	60%	20%	15%	80%
Active Connections	200	50	30	350
IOPS (Read)	5,000	1,000	800	10,000
Buffer Hit Ratio	99.2%	99.8%	99.9%	98.5%
Transactions/sec	500	100	80	800

Baseline Collection Strategy:

Baseline Collection Best Practices

•Collect for Full Cycle: At minimum, collect 2-4 weeks of data to capture weekly patterns. Capture monthly cycles where relevant.
•Multiple Granularities: Store both high-frequency data (1-minute intervals) for troubleshooting and aggregated data (hourly/daily) for trend analysis.
•Pre-Change Captures: Before any major change (software upgrade, schema change, hardware modification), capture comprehensive baselines.
•Document Environment: Record hardware specs, configuration, data volumes, and user counts alongside metrics. Context matters.
•Refresh Regularly: Baselines become stale. Re-establish baselines after significant changes or at least quarterly.

baseline-capture-script.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- PostgreSQL: Create baseline snapshots table
CREATE TABLE IF NOT EXISTS dba_baseline_snapshots (
    snapshot_id SERIAL PRIMARY KEY,
    snapshot_time TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
    snapshot_type VARCHAR(20), -- 'hourly', 'daily', 'pre-change'
    
    -- Query performance
    total_queries BIGINT,
    avg_query_time_ms NUMERIC(10,2),
    p95_query_time_ms NUMERIC(10,2),
    slow_queries_count INTEGER,
    
    -- Resource utilization
    active_connections INTEGER,
    idle_connections INTEGER,
    cpu_percent NUMERIC(5,2),
    memory_used_mb BIGINT,
    
    -- I/O metrics
    buffer_hit_ratio NUMERIC(6,3),
    blocks_read BIGINT,
    blocks_hit BIGINT,
    
    -- Transaction metrics
    commits_per_sec NUMERIC(10,2),
    rollbacks_per_sec NUMERIC(10,2),
    
    -- Replication (if applicable)
    max_replication_lag_bytes BIGINT,
    
    -- Context
    notes TEXT
);
 
-- Capture hourly baseline
INSERT INTO dba_baseline_snapshots (
    snapshot_type, total_queries, active_connections, 
    buffer_hit_ratio, blocks_read, blocks_hit
)
SELECT 
    'hourly',
    sum(calls),
    (SELECT count(*) FROM pg_stat_activity WHERE state = 'active'),
    (SELECT round(100.0 * sum(blks_hit) / nullif(sum(blks_hit + blks_read), 0), 3) 
     FROM pg_stat_database),
    (SELECT sum(blks_read) FROM pg_stat_database),
    (SELECT sum(blks_hit) FROM pg_stat_database)
FROM pg_stat_statements;

Beware of Baseline Drift

Alerting Strategy

Alert Design Principles:

Effective Alerting Guidelines

•Alert on Impact, Not Symptoms: Alert when user experience is affected, not on every metric deviation. Users don't care if CPU is high if queries are still fast.
•Actionable Alerts Only: Every alert should have a defined response. If there's no action to take, it's not an alert—it's noise.
•Tiered Severity Levels: Distinguish between 'someone should look at this tomorrow' (warning) and 'wake someone up now' (critical).
•Include Context: Alerts should contain information needed to diagnose: current value, threshold, affected system, recent changes.
•Avoid Flapping: Implement hysteresis—require sustained conditions before alerting, and don't clear immediately when conditions improve briefly.
•Regular Review: Analyze alerting effectiveness monthly. Delete alerts that never fire or always fire without consequence.

Database Alert Examples
Alert Name	Condition	Severity	Action
High Query Latency	p95 response > 2s for 5 min	Warning	Investigate slow queries, check resources
Disk Space Critical	< 10% disk free	Critical	Immediate cleanup or expansion required
Replication Lag	Lag > 60s for 5 min	Warning	Check replica, network, long transactions
Connection Exhaustion	90% connections used	Critical	Scale connections, investigate leaks
Deadlock Detected	Any deadlock occurrence	Warning	Review application logic, locking order
Backup Failed	No successful backup in 25h	Critical	Immediate investigation, data loss risk

prometheus-alerts.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
groups:
  - name: postgresql_alerts
    rules:
      - alert: PostgreSQLHighQueryLatency
        expr: |
          histogram_quantile(0.95, 
            rate(pg_stat_statements_seconds_bucket[5m])
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High query latency on {{ $labels.instance }}"
          description: "95th percentile query latency is {{ $value | humanizeDuration }}"
 
      - alert: PostgreSQLDiskSpaceCritical
        expr: |
          (pg_database_size_bytes / pg_tablespace_size_bytes) * 100 > 90
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Disk space critical on {{ $labels.instance }}"
          description: "Database {{ $labels.datname }} is at {{ $value }}% capacity"
 
      - alert: PostgreSQLReplicationLag
        expr: pg_stat_replication_replay_lag_seconds > 60
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Replication lag on {{ $labels.instance }}"
          description: "Replica lag is {{ $value }} seconds"
 
      - alert: PostgreSQLConnectionsHigh
        expr: |
          pg_stat_activity_count / pg_settings_max_connections > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Connection count approaching limit"
          description: "{{ $value | humanizePercentage }} of max connections in use"

The 2am Test

Query Performance Analysis

Beyond aggregate metrics, understanding individual query performance is essential for troubleshooting and optimization. Query analysis identifies the specific statements consuming resources.

Query Analysis Tools:

Most databases provide facilities for query performance tracking:

Query Analysis Capabilities by Database

•PostgreSQL: pg_stat_statements extension tracks all executed queries with call counts, total time, and row statistics. Essential for any PostgreSQL monitoring.
•MySQL: Performance Schema events_statements tables, Query Analyzer in Percona Monitoring, slow query log analysis.
•SQL Server: Query Store captures execution plans and runtime statistics. Dynamic Management Views for point-in-time analysis.
•Oracle: Automatic Workload Repository (AWR), Active Session History (ASH), SQL Tuning Advisor.

query-analysis-postgresql.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- Enable pg_stat_statements (in postgresql.conf)
-- shared_preload_libraries = 'pg_stat_statements'
 
-- Create extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
 
-- Top queries by total execution time
SELECT 
    substring(query, 1, 80) AS query_preview,
    calls,
    total_exec_time::numeric(12,2) AS total_time_ms,
    mean_exec_time::numeric(10,2) AS avg_time_ms,
    stddev_exec_time::numeric(10,2) AS stddev_ms,
    rows,
    shared_blks_hit + shared_blks_read AS total_blocks,
    CASE WHEN shared_blks_hit + shared_blks_read > 0 
         THEN round(100.0 * shared_blks_hit / 
                    (shared_blks_hit + shared_blks_read), 2)
         ELSE 100 END AS cache_hit_pct
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
 
-- Identify queries with high variance (unpredictable performance)
SELECT 
    substring(query, 1, 60) AS query_preview,
    calls,
    mean_exec_time::numeric(10,2) AS avg_ms,
    stddev_exec_time::numeric(10,2) AS stddev_ms,
    (stddev_exec_time / nullif(mean_exec_time, 0))::numeric(6,2) AS cv,
    min_exec_time::numeric(10,2) AS min_ms,
    max_exec_time::numeric(10,2) AS max_ms
FROM pg_stat_statements
WHERE calls > 100  -- Sufficient sample size
ORDER BY (stddev_exec_time / nullif(mean_exec_time, 0)) DESC NULLS LAST
LIMIT 10;
 
-- Queries scanning most rows (potential for optimization)
SELECT 
    substring(query, 1, 60) AS query_preview,
    calls,
    rows AS total_rows_returned,
    (rows::numeric / nullif(calls, 0))::numeric(12,2) AS rows_per_call,
    shared_blks_read / nullif(calls, 0) AS disk_reads_per_call
FROM pg_stat_statements
WHERE calls > 10
ORDER BY rows DESC
LIMIT 10;

Execution Plan Capture:

For problematic queries, execution plans reveal exactly how the database processes them. Capturing plans for later analysis:

-- PostgreSQL: Explain with all details
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.order_date > '2024-01-01';

-- auto_explain logs slow query plans automatically
-- In postgresql.conf:
-- auto_explain.log_min_duration = 1000  -- Log plans for queries > 1s
-- auto_explain.log_analyze = on

Identifying Optimization Opportunities:

Query analysis typically reveals:

Missing Indexes: Queries doing sequential scans when index scans would be appropriate
Inefficient Joins: Wrong join order or method
Large Result Sets: Queries returning more data than needed
Repeated Queries: Same query executed many times (caching opportunity)
Lock Contention: Queries waiting for locks held by others

Reset Statistics Carefully

Dashboard Design

Dashboards are the primary interface for monitoring visibility. Well-designed dashboards provide instant situational awareness; poorly designed ones hide problems in visual noise.

Dashboard Organization:

Structure dashboards in layers of detail:

Dashboard Hierarchy
Level	Purpose	Audience	Refresh Rate
Overview/SLI	Health status at a glance	Everyone, NOC displays	15-30 seconds
Service	All databases for a service	On-call engineers	1 minute
Instance	Deep metrics for one database	DBAs, troubleshooting	15 seconds
Query/Session	Individual query analysis	DBAs, developers	Real-time

Overview Dashboard Components:

The top-level dashboard should enable quick assessment:

Essential Overview Dashboard Panels

•Health Status Indicator: Red/Yellow/Green status based on composite health rules. Immediate visual triage.
•Query Response Time Graph: Rolling graph showing latency trends. Include p50, p95, p99 lines.
•Error Rate: Percentage of queries failing. Should be near-zero; any elevation is significant.
•Throughput: Queries per second or transactions per second. Sudden drops indicate problems.
•Resource Saturation: CPU, memory, disk, connections as percentage of capacity. Shows how much headroom exists.
•Replication Status: For distributed databases, lag and sync status across replicas.

Visualization Best Practices:

•Use consistent color coding across dashboards
•Show time ranges appropriate to the metric (5min for latency, 24h for capacity)
•Include threshold lines showing alert boundaries
•Use appropriate chart types (gauges for capacity, lines for trends)
•Provide drill-down links to detailed dashboards

Don't

•Overload dashboards with too many panels
•Show raw numbers without context or trends
•Use pie charts for time-series data
•Hide critical information below the fold
•Rely on color alone (consider color-blindness)

The 30-Second Rule

Summary

Key Takeaways:

Performance Monitoring Essentials

•Monitoring is essential, not optional. Unmonitored databases are accidents waiting to happen. The cost of monitoring infrastructure is trivial compared to the cost of outages.
•Focus on Key Performance Indicators. Not all metrics are equally important. Query latency, error rates, resource saturation, and replication status are the critical few.
•Build proper monitoring infrastructure. Use purpose-built tools with time-series storage, visualization, and alerting. Don't rely on occasional manual checks.
•Understand your specific database. Each database exposes metrics differently. Master the monitoring features of your specific database system.
•Establish and maintain baselines. 'Normal' is only meaningful with historical context. Capture baselines regularly and use them for comparison.
•Design alerts carefully. Alert fatigue is real. Every alert should be actionable, appropriately severe, and include context for diagnosis.
•Analyze query performance actively. Aggregate metrics highlight problems; query analysis reveals root causes. Use pg_stat_statements or equivalent.
•Create effective dashboards. Organize dashboards in layers. Enable 30-second health assessment at the overview level with drill-down for details.

What's Next:

Page Complete

2 / 5