Database Management SystemsMonitoring and Alerting

Monitoring and Alerting

LevelAdvanced

Duration90 mins

TopicMonitoring and Alerting

5 / 5

Proactive Management

From Reactive to Proactive Operations

Reactive teams fight fires. Proactive teams prevent them.

The monitoring infrastructure we've built—metrics collection, alerting systems, and dashboards—can serve two fundamentally different purposes:

Reactive Use — Wait for alerts, respond to incidents, restore service, move on.
Proactive Use — Analyze trends, predict problems, optimize before degradation, prevent incidents from occurring.

Mature database operations teams spend significantly more time on proactive work. They use monitoring data not just to detect problems, but to anticipate them—addressing capacity limits before exhaustion, optimizing queries before they cause bottlenecks, and planning infrastructure changes before they become urgent.

This page synthesizes everything we've learned into a proactive database management practice that reduces incidents, improves reliability, and creates space for improvement work rather than constant firefighting.

What You Will Learn

By the end of this page, you will understand capacity planning using monitoring data, implement trend analysis to predict issues, build automated remediation for common problems, establish continuous improvement practices, and create operational runbooks driven by monitoring insights.

Capacity Planning with Monitoring Data

Capacity planning transforms monitoring from incident detection to incident prevention. By projecting resource consumption into the future, you can procure resources, optimize systems, or archive data before hitting limits.

The Capacity Planning Cycle:

Converting Mermaid diagram...

Key Resources to Plan:

Database Capacity Planning Dimensions
Resource	Metrics to Track	Planning Horizon	Remediation Options
Storage	Disk usage, growth rate, table sizes	3-6 months	Add storage, archive data, compress, partition
Connections	Connection count, pool utilization	1-3 months	Increase limits, add pooler, optimize app
Memory	Buffer pool hit ratio, swap usage	3-6 months	Add RAM, tune cache settings, optimize queries
CPU	Utilization, query processing time	1-3 months	Scale up, optimize queries, read replicas
IOPS	Read/write operations, queue depth	3-6 months	Faster storage, caching, query optimization

Growth Rate Calculation:

capacity_planning.promql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Daily growth rate (GB/day)
deriv(pg_database_size_bytes[7d]) * 86400 / 1024 / 1024 / 1024
 
# Days until disk exhaustion at current growth rate
(node_filesystem_avail_bytes / deriv(node_filesystem_size_bytes - node_filesystem_avail_bytes)[7d]) / 86400
 
# Projected disk usage in 30 days
pg_database_size_bytes + (deriv(pg_database_size_bytes[30d]) * 30 * 86400)
 
# Connection pool exhaustion projection
# Linear prediction of when connections will hit max
predict_linear(pg_stat_activity_count[7d], 86400 * 30) / pg_settings_max_connections * 100
 
# Storage capacity planning query for reporting
SELECT
    datname,
    pg_database_size(datname) / 1024 / 1024 / 1024 AS current_size_gb,
    -- Assuming you track historical sizes
    (pg_database_size(datname) - lag_30_days) / 30 AS avg_daily_growth_mb,
    disk_total_gb - pg_database_size(datname) / 1024 / 1024 / 1024 AS remaining_gb,
    CASE 
        WHEN daily_growth > 0 THEN 
            remaining_gb / (daily_growth / 1024)
        ELSE 
            'N/A'
    END AS days_until_full
FROM database_metrics;

Plan for Peaks, Not Averages

Your database doesn't experience average load—it experiences peak load. Black Friday, month-end closing, viral content events. Capacity planning must account for peak demand, not just current usage trends. If you can, analyze year-over-year patterns to identify seasonal peaks.

Trend Analysis and Anomaly Detection

Day-to-day metrics can mask gradual degradation. A 1% weekly increase in query latency doesn't trigger alerts but compounds to 67% degradation over a year. Trend analysis catches these slow-motion problems.

What to Trend:

Key Trends to Monitor

•Query latency percentiles (p50, p95, p99) — Gradual increases indicate growing technical debt
•Buffer pool hit ratio — Declining ratio means working set is outgrowing cache
•Table bloat percentages — Growing bloat indicates vacuum isn't keeping pace
•Slow query counts — More slow queries often precedes performance problems
•Replication lag patterns — Gradually increasing lag suggests approaching limits
•Connection pool utilization — Creeping toward limits

Week-over-Week and Month-over-Month Comparison:

trend_queries.promql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Week-over-week query latency comparison
# Current week's p95 latency
histogram_quantile(0.95, sum(rate(query_duration_seconds_bucket[1h])) by (le))
# vs
# Same hour last week
histogram_quantile(0.95, sum(rate(query_duration_seconds_bucket[1h] offset 7d)) by (le))
 
# Percentage change in QPS week-over-week
(
  sum(rate(pg_stat_database_xact_commit[1h])) -
  sum(rate(pg_stat_database_xact_commit[1h] offset 7d))
) / sum(rate(pg_stat_database_xact_commit[1h] offset 7d)) * 100
 
# Month-over-month storage growth comparison
(pg_database_size_bytes - pg_database_size_bytes offset 30d) / 1024 / 1024 / 1024
 
# Trend of error ratio over 30 days
# Alert if trending upward
deriv(
  sum(rate(pg_stat_database_xact_rollback[1d])) / 
  (sum(rate(pg_stat_database_xact_commit[1d])) + sum(rate(pg_stat_database_xact_rollback[1d])))
[30d]) > 0

Anomaly Detection Approaches:

Anomaly Detection Methods
Method	How It Works	Best For
Static Thresholds	Alert when value exceeds fixed limit	Absolute limits (disk space, connection count)
Standard Deviation	Alert when value deviates N sigmas from mean	Metrics with normal distribution
Percentage Change	Alert when value changes > X% from baseline	Detecting sudden shifts
Seasonal Decomposition	Compare to same time period historically	Metrics with daily/weekly patterns
Machine Learning	Model learns normal patterns, flags deviations	Complex, multi-dimensional patterns

Weekly Review Rituals

Schedule a weekly database health review. Compare key metrics to the previous week. Look at growth trends. Review any near-miss incidents. This 30-minute ritual often catches problems weeks before they become incidents.

Automated Remediation

Some problems have known, safe remediation steps. Automating these responses reduces mean time to recovery (MTTR) and frees engineers for higher-value work.

Candidates for Automation:

Automatable Database Operations
Problem	Automated Response	Risk Level	Human Override
Connection pool exhaustion	Kill long-running idle connections	Low	Configurable threshold
Table bloat > 50%	Schedule VACUUM FULL during maintenance window	Medium	Requires approval for large tables
Replication lag spike	Pause non-critical replication consumers	Medium	Alert on action taken
Query timeout threshold	Cancel query and log for review	Low	Configurable timeout
Disk > 90%	Trigger log rotation, delete old exports	Medium	Never delete data files automatically
Primary failure	Promote replica to primary (failover)	High	Usually requires human validation

Example: Automated Idle Connection Cleanup:

auto_cleanup_connections.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
#!/usr/bin/env python3
"""
Automated cleanup of idle database connections.
Triggered when connection pool utilization exceeds threshold.
"""
 
import psycopg2
import logging
from datetime import datetime, timedelta
 
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
 
# Configuration
MAX_IDLE_TIME_MINUTES = 30
CONNECTION_THRESHOLD_PERCENT = 85
DRY_RUN = False  # Set to False to actually terminate connections
 
def get_idle_connections(conn):
    """Find connections idle longer than threshold."""
    query = """
    SELECT pid, usename, datname, application_name,
           client_addr, state, 
           now() - state_change AS idle_time,
           query
    FROM pg_stat_activity
    WHERE state = 'idle'
      AND state_change < now() - interval '%s minutes'
      AND pid != pg_backend_pid()
      AND usename NOT IN ('replication_user', 'monitoring_user')
    ORDER BY idle_time DESC;
    """
    with conn.cursor() as cur:
        cur.execute(query, (MAX_IDLE_TIME_MINUTES,))
        return cur.fetchall()
 
def get_connection_utilization(conn):
    """Get current connection pool utilization percentage."""
    query = """
    SELECT 
        (SELECT count(*) FROM pg_stat_activity) AS current,
        (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max
    """
    with conn.cursor() as cur:
        cur.execute(query)
        current, max_conn = cur.fetchone()
        return (current / max_conn) * 100
 
def terminate_connection(conn, pid, reason):
    """Safely terminate a database connection."""
    if DRY_RUN:
        logger.info(f"[DRY RUN] Would terminate PID {pid}: {reason}")
        return True
    
    try:
        with conn.cursor() as cur:
            cur.execute("SELECT pg_terminate_backend(%s);", (pid,))
            result = cur.fetchone()[0]
            if result:
                logger.info(f"Terminated PID {pid}: {reason}")
            else:
                logger.warning(f"Failed to terminate PID {pid}")
            return result
    except Exception as e:
        logger.error(f"Error terminating PID {pid}: {e}")
        return False
 
def main():
    conn = psycopg2.connect(
        host="localhost",
        database="postgres",
        user="admin"
    )
    conn.autocommit = True
    
    try:
        utilization = get_connection_utilization(conn)
        logger.info(f"Current connection utilization: {utilization:.1f}%")
        
        if utilization < CONNECTION_THRESHOLD_PERCENT:
            logger.info("Utilization below threshold, no action needed")
            return
        
        logger.warning(f"Utilization {utilization:.1f}% exceeds threshold {CONNECTION_THRESHOLD_PERCENT}%")
        
        idle_connections = get_idle_connections(conn)
        logger.info(f"Found {len(idle_connections)} idle connections")
        
        terminated = 0
        for pid, user, db, app, addr, state, idle_time, query in idle_connections:
            reason = f"Idle for {idle_time}, user={user}, db={db}, app={app}"
            if terminate_connection(conn, pid, reason):
                terminated += 1
        
        logger.info(f"Terminated {terminated} idle connections")
        
        # Log final state
        new_utilization = get_connection_utilization(conn)
        logger.info(f"New connection utilization: {new_utilization:.1f}%")
        
    finally:
        conn.close()
 
if __name__ == "__main__":
    main()

Automation Safety Principles

Never automate actions that can cause data loss. 2) Always log automated actions for audit. 3) Implement rate limiting to prevent cascading actions. 4) Include kill switches to disable automation. 5) Start with dry-run mode to validate behavior.

Preventive Maintenance Schedules

Proactive management includes scheduled maintenance that prevents degradation. Monitoring data guides the timing and scope of these activities.

Regular Maintenance Tasks:

Database Maintenance Schedule
Task	Frequency	Monitoring Trigger	Impact
Statistics Update	Daily or after large data changes	Query plan degradation detected	Brief CPU spike, improves query planning
Routine VACUUM	Continuous (autovacuum)	Dead tuple ratio > 20%	Background I/O, reclaims space
VACUUM FULL / CLUSTER	Monthly or as needed	Table bloat > 50%	Table lock, significant I/O; schedule during maintenance window
REINDEX	Quarterly or as triggered	Index bloat > 40%, index scans slowing	Lock on index, CPU intensive
Query Plan Analysis	Weekly	Top query latency increasing	No direct impact; guides optimization
Backup Verification	Monthly	Backup age > threshold	Restore to separate environment; verification
Security Audit	Quarterly	Compliance requirements	Review permissions, audit logs

Monitoring-Driven Maintenance Scheduling:

maintenance_scheduler.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
"""
Maintenance scheduler that uses monitoring data to prioritize tasks.
Integrates with Prometheus for metrics and schedules maintenance windows.
"""
 
from dataclasses import dataclass
from datetime import datetime, timedelta
import prometheus_client
from prometheus_api_client import PrometheusConnect
 
@dataclass
class MaintenanceTask:
    name: str
    table: str
    priority: int  # 1 = highest
    estimated_duration: timedelta
    requires_lock: bool
    
class MaintenanceScheduler:
    def __init__(self, prometheus_url: str):
        self.prom = PrometheusConnect(url=prometheus_url)
        
    def get_tables_needing_vacuum(self, dead_tuple_threshold: float = 0.2) -> list:
        """Find tables where dead tuples exceed threshold of total tuples."""
        query = f"""
        pg_stat_user_tables_n_dead_tup / 
        (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup) 
        > {dead_tuple_threshold}
        """
        result = self.prom.custom_query(query)
        return [
            MaintenanceTask(
                name="vacuum",
                table=r['metric']['relname'],
                priority=1 if float(r['value'][1]) > 0.5 else 2,
                estimated_duration=timedelta(minutes=30),
                requires_lock=False
            )
            for r in result
        ]
    
    def get_tables_needing_reindex(self, bloat_threshold: float = 0.4) -> list:
        """Find indexes with high bloat ratio."""
        # This would query index bloat metrics
        # Implementation depends on your bloat estimation approach
        pass
    
    def get_stale_statistics(self, age_threshold_days: int = 7) -> list:
        """Find tables with statistics older than threshold."""
        query = f"""
        (time() - pg_stat_user_tables_last_analyze) / 86400 
        > {age_threshold_days}
        """
        result = self.prom.custom_query(query)
        return [
            MaintenanceTask(
                name="analyze",
                table=r['metric']['relname'],
                priority=3,
                estimated_duration=timedelta(minutes=5),
                requires_lock=False
            )
            for r in result
        ]
    
    def schedule_maintenance(self, window_start: datetime, window_duration: timedelta):
        """
        Schedule maintenance tasks within a maintenance window.
        Prioritizes by urgency, respects lock requirements, estimates timing.
        """
        tasks = []
        tasks.extend(self.get_tables_needing_vacuum())
        tasks.extend(self.get_stale_statistics())
        
        # Sort by priority
        tasks.sort(key=lambda t: t.priority)
        
        # Schedule within window
        current_time = window_start
        window_end = window_start + window_duration
        scheduled = []
        
        for task in tasks:
            if current_time + task.estimated_duration <= window_end:
                scheduled.append({
                    'task': task,
                    'scheduled_time': current_time
                })
                current_time += task.estimated_duration
            else:
                # Window full, remaining tasks go to next window
                break
        
        return scheduled

Maintenance Windows

Define regular maintenance windows (e.g., Sunday 2-6 AM) when heavier operations can run. Use monitoring to identify the lowest-activity period. During the window, run operations prioritized by monitoring data—highest-bloat tables first, most-stale statistics first.

Continuous Performance Optimization

Proactive performance management uses monitoring data to identify and prioritize optimization opportunities before they become problems.

Optimization Workflow:

Performance Optimization Cycle

•Identify — Use query performance dashboard to find top resource consumers
•Analyze — Export problematic queries; run EXPLAIN ANALYZE; understand execution plans
•Hypothesize — What would improve this? Index? Query rewrite? Architecture change?
•Test — Implement change in non-production; measure impact
•Deploy — Roll out change with monitoring; verify improvement
•Validate — Confirm improvement in dashboards; document learnings

Query Optimization Priority Matrix:

Optimization Priority Framework
Factor	High Priority	Low Priority
Total Time	Top 10 by cumulative time	Tail of distribution
Frequency	Runs 1000s of times/day	Runs once per week
Growth	Latency increasing week-over-week	Stable or improving
Rows Ratio	Examines 100x rows returned	Efficient row access
User Impact	Critical path (checkout, login)	Background jobs

Index Recommendation from Monitoring:

index_recommendations.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
-- Find queries doing sequential scans on large tables
-- Candidates for new indexes
 
SELECT 
    schemaname,
    relname AS table_name,
    seq_scan,
    seq_tup_read,
    idx_scan,
    ROUND(CASE WHEN seq_scan > 0 
              THEN seq_tup_read::numeric / seq_scan 
              ELSE 0 END, 2) AS avg_rows_per_seq_scan,
    pg_size_pretty(pg_relation_size(schemaname || '.' || relname)) AS table_size
FROM pg_stat_user_tables
WHERE seq_scan > 100  -- More than 100 sequential scans
  AND pg_relation_size(schemaname || '.' || relname) > 10 * 1024 * 1024  -- Table > 10MB
  AND (idx_scan < seq_scan OR idx_scan IS NULL)  -- More seq scans than index scans
ORDER BY seq_tup_read DESC
LIMIT 20;
 
-- Find unused indexes (candidates for removal)
SELECT 
    schemaname,
    relname AS table_name,
    indexrelname AS index_name,
    idx_scan,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND indexrelname NOT LIKE '%_pkey'  -- Don't suggest removing primary keys
  AND indexrelname NOT LIKE '%_unique%'  -- Be careful with unique constraints
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 20;
 
-- Find duplicate/redundant indexes
SELECT 
    a.indexrelid::regclass AS redundant_index,
    b.indexrelid::regclass AS covering_index,
    pg_size_pretty(pg_relation_size(a.indexrelid)) AS redundant_size
FROM pg_index a
JOIN pg_index b ON (
    a.indrelid = b.indrelid 
    AND a.indexrelid != b.indexrelid
    AND (
        (a.indkey::text LIKE b.indkey::text || ' %')
        OR (a.indkey::text = b.indkey::text AND a.indexrelid > b.indexrelid)
    )
)
WHERE pg_relation_size(a.indexrelid) > 1024 * 1024  -- > 1MB
ORDER BY pg_relation_size(a.indexrelid) DESC;

Track Optimization ROI

Document each optimization: what changed, expected improvement, actual improvement. This creates institutional knowledge and helps justify time spent on proactive work. 'We reduced query X latency from 500ms to 50ms, saving $Y in compute costs and improving user experience.'

Operational Runbooks

Runbooks transform monitoring insights into actionable procedures. Every alert should link to a runbook that guides responders through diagnosis and remediation.

Runbook Structure:

runbook_template.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# Runbook: High Replication Lag
 
## Overview
**Alert Name:** PostgreSQLReplicationLagCritical
**Severity:** P2 (High)  
**Service Impact:** Read replicas serving stale data; failover would lose recent transactions
 
## Symptoms
- pg_replication_lag_seconds > 60 for 5+ minutes
- Dashboard: Replication > Lag shows elevated values
- Possible: Increased read latency if clients hit stale replicas
 
## Quick Diagnosis
 
### 1. Check Replication Status
```sql
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes_behind
FROM pg_stat_replication;
```
 
### 2. Check Replica Resources
- **CPU**: Is replica CPU-bound replaying WAL?
- **Disk I/O**: Is replica disk saturated?
- **Network**: Is there latency between primary and replica?
 
Dashboard: Infrastructure > Node Metrics > [replica-hostname]
 
### 3. Check for Blocking Queries on Replica
```sql
-- On replica, check for conflicts
SELECT * FROM pg_stat_database_conflicts WHERE datname = current_database();
 
-- Check for long-running queries blocking replay
SELECT pid, now() - query_start AS duration, query 
FROM pg_stat_activity 
WHERE state = 'active' AND query_start < now() - interval '1 minute';
```
 
## Common Causes and Remediation
 
### Cause: High Write Load on Primary
**Diagnosis:** Primary WAL generation rate spiked
**Resolution:** 
1. Identify cause of write spike (deployment, batch job, attack)
2. Wait for spike to subside, replica will catch up
3. If persistent, scale replica resources
 
### Cause: Long-Running Query Blocking Replay
**Diagnosis:** pg_stat_database_conflicts shows conflicts
**Resolution:**
```sql
-- Cancel long queries on replica (safe)
SELECT pg_cancel_backend(pid) FROM pg_stat_activity 
WHERE state = 'active' AND query_start < now() - interval '5 minutes';
```
 
### Cause: Replica Resource Exhaustion
**Diagnosis:** Replica CPU/disk/memory at limits
**Resolution:** Scale replica instance or optimize queries
 
### Cause: Network Issues
**Diagnosis:** Network metrics show latency/packet loss
**Resolution:** Escalate to network team
 
## Escalation
- If lag > 300 seconds: Escalate to DBA lead
- If approaching max_slot_wal_keep_size: URGENT - risk of slot invalidation
- If primary disk filling due to WAL retention: CRITICAL
 
## Post-Incident
1. Document root cause in incident report
2. Add specific monitoring if new failure mode discovered
3. Update this runbook with learnings

Runbook Maintenance:

Keeping Runbooks Current

•Post-incident updates — After every incident, update the relevant runbook with learnings
•Quarterly review — Validate runbooks against current infrastructure; remove stale procedures
•Link validation — Ensure dashboard and documentation links still work
•Drill testing — Periodically have team members follow runbooks for simulated incidents
•Version control — Store runbooks in git; track changes over time

Runbooks as Code

Consider embedding runbooks in your monitoring configuration. Alertmanager annotations can include runbook URLs. Some teams go further, generating runbooks from code or even making runbooks executable with embedded scripts for common diagnostics.

Continuous Improvement Practices

Proactive database management is not a destination but a journey. Establish practices that drive ongoing improvement.

Regular Review Cadences:

Operational Review Schedule
Review	Frequency	Participants	Outcomes
On-Call Handoff	Daily/Weekly	Outgoing and incoming on-call	Alert summary, known issues, pending actions
Incident Review	After each incident	Involved responders	Timeline, root cause, action items
Database Health Review	Weekly	DBA team	Trends, capacity, optimization opportunities
Alert Quality Review	Monthly	DBA team	False positive rate, missing alerts, threshold tuning
Capacity Planning	Quarterly	DBA + Engineering leads	Growth projections, resource requests
Architecture Review	Quarterly	DBA + Architecture team	Strategic improvements, technology evaluation

Metrics for Operational Excellence:

Operational Metrics to Track

•Mean Time to Detect (MTTD) — How quickly do we notice problems?
•Mean Time to Resolve (MTTR) — How quickly do we fix problems?
•Incident Frequency — Are we having fewer incidents over time?
•Alert Noise Ratio — What percentage of alerts are actionable?
•Proactive vs Reactive Time — How much time is spent on improvements vs firefighting?
•Error Budget Consumption — Are we maintaining our SLO targets?

Building a Culture of Proactivity:

Cultural Elements

•Blameless postmortems — Focus on system improvement, not individual blame
•Protected time for improvements — Reserve capacity for proactive work, not just ticket queues
•Celebrate prevention — Recognize when proactive work prevented incidents
•Share learnings — Document and distribute insights from monitoring analysis
•Invest in tooling — Better tools enable better proactive work

The 50% Rule

Google's SRE book suggests that operational teams should spend no more than 50% of time on operational toil (reactive work). The remaining 50% should go to engineering improvements. If you're above 50% toil, prioritize automation and reliability improvements until you get below.

Summary: The Proactive DBA

Proactive database management transforms the DBA role from firefighter to engineer—from responding to problems to preventing them. The monitoring infrastructure we've built throughout this module provides the foundation; proactive practices turn that data into action.

Key Takeaways

•Capacity plan with data — Use monitoring trends to predict and prevent resource exhaustion.
•Analyze trends, not just alerts — Gradual degradation doesn't trigger alerts but compounds into major problems.
•Automate safe remediations — Reduce MTTR and free humans for higher-value work.
•Schedule preventive maintenance — Use monitoring to prioritize and time maintenance activities.
•Continuously optimize — Query performance dashboards guide systematic improvement efforts.
•Maintain living runbooks — Every alert links to actionable, current documentation.
•Establish review cadences — Regular reviews catch what automated monitoring misses.
•Measure operational excellence — MTTD, MTTR, and toil percentage guide improvement priorities.

Module Complete:

You've now completed the Monitoring and Alerting module. From understanding key metrics, through monitoring tools and alert configuration, to building dashboards and establishing proactive practices—you have the knowledge to implement world-class database monitoring.

The work doesn't end here. Monitoring systems require ongoing attention: thresholds need tuning, dashboards need refining, new failure modes need detection, and proactive practices need nurturing. But with this foundation, you can build and operate databases that your users and business can depend on.

Module Complete

Congratulations! You've completed the Monitoring and Alerting module. You now have the knowledge to implement comprehensive database monitoring, configure intelligent alerting, build insightful dashboards, and establish proactive management practices that prevent incidents before they impact users.

5 / 5

Loading learning content...

Database Management SystemsMonitoring and Alerting

Monitoring and Alerting

LevelAdvanced

Duration90 mins

TopicMonitoring and Alerting

5 / 5

Proactive Management

From Reactive to Proactive Operations

Reactive teams fight fires. Proactive teams prevent them.

The monitoring infrastructure we've built—metrics collection, alerting systems, and dashboards—can serve two fundamentally different purposes:

Reactive Use — Wait for alerts, respond to incidents, restore service, move on.
Proactive Use — Analyze trends, predict problems, optimize before degradation, prevent incidents from occurring.

What You Will Learn

Capacity Planning with Monitoring Data

The Capacity Planning Cycle:

Converting Mermaid diagram...

Key Resources to Plan:

Database Capacity Planning Dimensions
Resource	Metrics to Track	Planning Horizon	Remediation Options
Storage	Disk usage, growth rate, table sizes	3-6 months	Add storage, archive data, compress, partition
Connections	Connection count, pool utilization	1-3 months	Increase limits, add pooler, optimize app
Memory	Buffer pool hit ratio, swap usage	3-6 months	Add RAM, tune cache settings, optimize queries
CPU	Utilization, query processing time	1-3 months	Scale up, optimize queries, read replicas
IOPS	Read/write operations, queue depth	3-6 months	Faster storage, caching, query optimization

Growth Rate Calculation:

capacity_planning.promql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Daily growth rate (GB/day)
deriv(pg_database_size_bytes[7d]) * 86400 / 1024 / 1024 / 1024
 
# Days until disk exhaustion at current growth rate
(node_filesystem_avail_bytes / deriv(node_filesystem_size_bytes - node_filesystem_avail_bytes)[7d]) / 86400
 
# Projected disk usage in 30 days
pg_database_size_bytes + (deriv(pg_database_size_bytes[30d]) * 30 * 86400)
 
# Connection pool exhaustion projection
# Linear prediction of when connections will hit max
predict_linear(pg_stat_activity_count[7d], 86400 * 30) / pg_settings_max_connections * 100
 
# Storage capacity planning query for reporting
SELECT
    datname,
    pg_database_size(datname) / 1024 / 1024 / 1024 AS current_size_gb,
    -- Assuming you track historical sizes
    (pg_database_size(datname) - lag_30_days) / 30 AS avg_daily_growth_mb,
    disk_total_gb - pg_database_size(datname) / 1024 / 1024 / 1024 AS remaining_gb,
    CASE 
        WHEN daily_growth > 0 THEN 
            remaining_gb / (daily_growth / 1024)
        ELSE 
            'N/A'
    END AS days_until_full
FROM database_metrics;

Plan for Peaks, Not Averages

Trend Analysis and Anomaly Detection

What to Trend:

Key Trends to Monitor

•Query latency percentiles (p50, p95, p99) — Gradual increases indicate growing technical debt
•Buffer pool hit ratio — Declining ratio means working set is outgrowing cache
•Table bloat percentages — Growing bloat indicates vacuum isn't keeping pace
•Slow query counts — More slow queries often precedes performance problems
•Replication lag patterns — Gradually increasing lag suggests approaching limits
•Connection pool utilization — Creeping toward limits

Week-over-Week and Month-over-Month Comparison:

trend_queries.promql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Week-over-week query latency comparison
# Current week's p95 latency
histogram_quantile(0.95, sum(rate(query_duration_seconds_bucket[1h])) by (le))
# vs
# Same hour last week
histogram_quantile(0.95, sum(rate(query_duration_seconds_bucket[1h] offset 7d)) by (le))
 
# Percentage change in QPS week-over-week
(
  sum(rate(pg_stat_database_xact_commit[1h])) -
  sum(rate(pg_stat_database_xact_commit[1h] offset 7d))
) / sum(rate(pg_stat_database_xact_commit[1h] offset 7d)) * 100
 
# Month-over-month storage growth comparison
(pg_database_size_bytes - pg_database_size_bytes offset 30d) / 1024 / 1024 / 1024
 
# Trend of error ratio over 30 days
# Alert if trending upward
deriv(
  sum(rate(pg_stat_database_xact_rollback[1d])) / 
  (sum(rate(pg_stat_database_xact_commit[1d])) + sum(rate(pg_stat_database_xact_rollback[1d])))
[30d]) > 0

Anomaly Detection Approaches:

Anomaly Detection Methods
Method	How It Works	Best For
Static Thresholds	Alert when value exceeds fixed limit	Absolute limits (disk space, connection count)
Standard Deviation	Alert when value deviates N sigmas from mean	Metrics with normal distribution
Percentage Change	Alert when value changes > X% from baseline	Detecting sudden shifts
Seasonal Decomposition	Compare to same time period historically	Metrics with daily/weekly patterns
Machine Learning	Model learns normal patterns, flags deviations	Complex, multi-dimensional patterns

Weekly Review Rituals

Automated Remediation

Some problems have known, safe remediation steps. Automating these responses reduces mean time to recovery (MTTR) and frees engineers for higher-value work.

Candidates for Automation:

Automatable Database Operations
Problem	Automated Response	Risk Level	Human Override
Connection pool exhaustion	Kill long-running idle connections	Low	Configurable threshold
Table bloat > 50%	Schedule VACUUM FULL during maintenance window	Medium	Requires approval for large tables
Replication lag spike	Pause non-critical replication consumers	Medium	Alert on action taken
Query timeout threshold	Cancel query and log for review	Low	Configurable timeout
Disk > 90%	Trigger log rotation, delete old exports	Medium	Never delete data files automatically
Primary failure	Promote replica to primary (failover)	High	Usually requires human validation

Example: Automated Idle Connection Cleanup:

auto_cleanup_connections.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
#!/usr/bin/env python3
"""
Automated cleanup of idle database connections.
Triggered when connection pool utilization exceeds threshold.
"""
 
import psycopg2
import logging
from datetime import datetime, timedelta
 
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
 
# Configuration
MAX_IDLE_TIME_MINUTES = 30
CONNECTION_THRESHOLD_PERCENT = 85
DRY_RUN = False  # Set to False to actually terminate connections
 
def get_idle_connections(conn):
    """Find connections idle longer than threshold."""
    query = """
    SELECT pid, usename, datname, application_name,
           client_addr, state, 
           now() - state_change AS idle_time,
           query
    FROM pg_stat_activity
    WHERE state = 'idle'
      AND state_change < now() - interval '%s minutes'
      AND pid != pg_backend_pid()
      AND usename NOT IN ('replication_user', 'monitoring_user')
    ORDER BY idle_time DESC;
    """
    with conn.cursor() as cur:
        cur.execute(query, (MAX_IDLE_TIME_MINUTES,))
        return cur.fetchall()
 
def get_connection_utilization(conn):
    """Get current connection pool utilization percentage."""
    query = """
    SELECT 
        (SELECT count(*) FROM pg_stat_activity) AS current,
        (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max
    """
    with conn.cursor() as cur:
        cur.execute(query)
        current, max_conn = cur.fetchone()
        return (current / max_conn) * 100
 
def terminate_connection(conn, pid, reason):
    """Safely terminate a database connection."""
    if DRY_RUN:
        logger.info(f"[DRY RUN] Would terminate PID {pid}: {reason}")
        return True
    
    try:
        with conn.cursor() as cur:
            cur.execute("SELECT pg_terminate_backend(%s);", (pid,))
            result = cur.fetchone()[0]
            if result:
                logger.info(f"Terminated PID {pid}: {reason}")
            else:
                logger.warning(f"Failed to terminate PID {pid}")
            return result
    except Exception as e:
        logger.error(f"Error terminating PID {pid}: {e}")
        return False
 
def main():
    conn = psycopg2.connect(
        host="localhost",
        database="postgres",
        user="admin"
    )
    conn.autocommit = True
    
    try:
        utilization = get_connection_utilization(conn)
        logger.info(f"Current connection utilization: {utilization:.1f}%")
        
        if utilization < CONNECTION_THRESHOLD_PERCENT:
            logger.info("Utilization below threshold, no action needed")
            return
        
        logger.warning(f"Utilization {utilization:.1f}% exceeds threshold {CONNECTION_THRESHOLD_PERCENT}%")
        
        idle_connections = get_idle_connections(conn)
        logger.info(f"Found {len(idle_connections)} idle connections")
        
        terminated = 0
        for pid, user, db, app, addr, state, idle_time, query in idle_connections:
            reason = f"Idle for {idle_time}, user={user}, db={db}, app={app}"
            if terminate_connection(conn, pid, reason):
                terminated += 1
        
        logger.info(f"Terminated {terminated} idle connections")
        
        # Log final state
        new_utilization = get_connection_utilization(conn)
        logger.info(f"New connection utilization: {new_utilization:.1f}%")
        
    finally:
        conn.close()
 
if __name__ == "__main__":
    main()

Automation Safety Principles

Never automate actions that can cause data loss. 2) Always log automated actions for audit. 3) Implement rate limiting to prevent cascading actions. 4) Include kill switches to disable automation. 5) Start with dry-run mode to validate behavior.

Preventive Maintenance Schedules

Proactive management includes scheduled maintenance that prevents degradation. Monitoring data guides the timing and scope of these activities.

Regular Maintenance Tasks:

Database Maintenance Schedule
Task	Frequency	Monitoring Trigger	Impact
Statistics Update	Daily or after large data changes	Query plan degradation detected	Brief CPU spike, improves query planning
Routine VACUUM	Continuous (autovacuum)	Dead tuple ratio > 20%	Background I/O, reclaims space
VACUUM FULL / CLUSTER	Monthly or as needed	Table bloat > 50%	Table lock, significant I/O; schedule during maintenance window
REINDEX	Quarterly or as triggered	Index bloat > 40%, index scans slowing	Lock on index, CPU intensive
Query Plan Analysis	Weekly	Top query latency increasing	No direct impact; guides optimization
Backup Verification	Monthly	Backup age > threshold	Restore to separate environment; verification
Security Audit	Quarterly	Compliance requirements	Review permissions, audit logs

Monitoring-Driven Maintenance Scheduling:

maintenance_scheduler.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
"""
Maintenance scheduler that uses monitoring data to prioritize tasks.
Integrates with Prometheus for metrics and schedules maintenance windows.
"""
 
from dataclasses import dataclass
from datetime import datetime, timedelta
import prometheus_client
from prometheus_api_client import PrometheusConnect
 
@dataclass
class MaintenanceTask:
    name: str
    table: str
    priority: int  # 1 = highest
    estimated_duration: timedelta
    requires_lock: bool
    
class MaintenanceScheduler:
    def __init__(self, prometheus_url: str):
        self.prom = PrometheusConnect(url=prometheus_url)
        
    def get_tables_needing_vacuum(self, dead_tuple_threshold: float = 0.2) -> list:
        """Find tables where dead tuples exceed threshold of total tuples."""
        query = f"""
        pg_stat_user_tables_n_dead_tup / 
        (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup) 
        > {dead_tuple_threshold}
        """
        result = self.prom.custom_query(query)
        return [
            MaintenanceTask(
                name="vacuum",
                table=r['metric']['relname'],
                priority=1 if float(r['value'][1]) > 0.5 else 2,
                estimated_duration=timedelta(minutes=30),
                requires_lock=False
            )
            for r in result
        ]
    
    def get_tables_needing_reindex(self, bloat_threshold: float = 0.4) -> list:
        """Find indexes with high bloat ratio."""
        # This would query index bloat metrics
        # Implementation depends on your bloat estimation approach
        pass
    
    def get_stale_statistics(self, age_threshold_days: int = 7) -> list:
        """Find tables with statistics older than threshold."""
        query = f"""
        (time() - pg_stat_user_tables_last_analyze) / 86400 
        > {age_threshold_days}
        """
        result = self.prom.custom_query(query)
        return [
            MaintenanceTask(
                name="analyze",
                table=r['metric']['relname'],
                priority=3,
                estimated_duration=timedelta(minutes=5),
                requires_lock=False
            )
            for r in result
        ]
    
    def schedule_maintenance(self, window_start: datetime, window_duration: timedelta):
        """
        Schedule maintenance tasks within a maintenance window.
        Prioritizes by urgency, respects lock requirements, estimates timing.
        """
        tasks = []
        tasks.extend(self.get_tables_needing_vacuum())
        tasks.extend(self.get_stale_statistics())
        
        # Sort by priority
        tasks.sort(key=lambda t: t.priority)
        
        # Schedule within window
        current_time = window_start
        window_end = window_start + window_duration
        scheduled = []
        
        for task in tasks:
            if current_time + task.estimated_duration <= window_end:
                scheduled.append({
                    'task': task,
                    'scheduled_time': current_time
                })
                current_time += task.estimated_duration
            else:
                # Window full, remaining tasks go to next window
                break
        
        return scheduled

Maintenance Windows

Continuous Performance Optimization

Proactive performance management uses monitoring data to identify and prioritize optimization opportunities before they become problems.

Optimization Workflow:

Performance Optimization Cycle

•Identify — Use query performance dashboard to find top resource consumers
•Analyze — Export problematic queries; run EXPLAIN ANALYZE; understand execution plans
•Hypothesize — What would improve this? Index? Query rewrite? Architecture change?
•Test — Implement change in non-production; measure impact
•Deploy — Roll out change with monitoring; verify improvement
•Validate — Confirm improvement in dashboards; document learnings

Query Optimization Priority Matrix:

Optimization Priority Framework
Factor	High Priority	Low Priority
Total Time	Top 10 by cumulative time	Tail of distribution
Frequency	Runs 1000s of times/day	Runs once per week
Growth	Latency increasing week-over-week	Stable or improving
Rows Ratio	Examines 100x rows returned	Efficient row access
User Impact	Critical path (checkout, login)	Background jobs

Index Recommendation from Monitoring:

index_recommendations.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
-- Find queries doing sequential scans on large tables
-- Candidates for new indexes
 
SELECT 
    schemaname,
    relname AS table_name,
    seq_scan,
    seq_tup_read,
    idx_scan,
    ROUND(CASE WHEN seq_scan > 0 
              THEN seq_tup_read::numeric / seq_scan 
              ELSE 0 END, 2) AS avg_rows_per_seq_scan,
    pg_size_pretty(pg_relation_size(schemaname || '.' || relname)) AS table_size
FROM pg_stat_user_tables
WHERE seq_scan > 100  -- More than 100 sequential scans
  AND pg_relation_size(schemaname || '.' || relname) > 10 * 1024 * 1024  -- Table > 10MB
  AND (idx_scan < seq_scan OR idx_scan IS NULL)  -- More seq scans than index scans
ORDER BY seq_tup_read DESC
LIMIT 20;
 
-- Find unused indexes (candidates for removal)
SELECT 
    schemaname,
    relname AS table_name,
    indexrelname AS index_name,
    idx_scan,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
  AND indexrelname NOT LIKE '%_pkey'  -- Don't suggest removing primary keys
  AND indexrelname NOT LIKE '%_unique%'  -- Be careful with unique constraints
ORDER BY pg_relation_size(indexrelid) DESC
LIMIT 20;
 
-- Find duplicate/redundant indexes
SELECT 
    a.indexrelid::regclass AS redundant_index,
    b.indexrelid::regclass AS covering_index,
    pg_size_pretty(pg_relation_size(a.indexrelid)) AS redundant_size
FROM pg_index a
JOIN pg_index b ON (
    a.indrelid = b.indrelid 
    AND a.indexrelid != b.indexrelid
    AND (
        (a.indkey::text LIKE b.indkey::text || ' %')
        OR (a.indkey::text = b.indkey::text AND a.indexrelid > b.indexrelid)
    )
)
WHERE pg_relation_size(a.indexrelid) > 1024 * 1024  -- > 1MB
ORDER BY pg_relation_size(a.indexrelid) DESC;

Track Optimization ROI

Operational Runbooks

Runbooks transform monitoring insights into actionable procedures. Every alert should link to a runbook that guides responders through diagnosis and remediation.

Runbook Structure:

runbook_template.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# Runbook: High Replication Lag
 
## Overview
**Alert Name:** PostgreSQLReplicationLagCritical
**Severity:** P2 (High)  
**Service Impact:** Read replicas serving stale data; failover would lose recent transactions
 
## Symptoms
- pg_replication_lag_seconds > 60 for 5+ minutes
- Dashboard: Replication > Lag shows elevated values
- Possible: Increased read latency if clients hit stale replicas
 
## Quick Diagnosis
 
### 1. Check Replication Status
```sql
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
       pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes_behind
FROM pg_stat_replication;
```
 
### 2. Check Replica Resources
- **CPU**: Is replica CPU-bound replaying WAL?
- **Disk I/O**: Is replica disk saturated?
- **Network**: Is there latency between primary and replica?
 
Dashboard: Infrastructure > Node Metrics > [replica-hostname]
 
### 3. Check for Blocking Queries on Replica
```sql
-- On replica, check for conflicts
SELECT * FROM pg_stat_database_conflicts WHERE datname = current_database();
 
-- Check for long-running queries blocking replay
SELECT pid, now() - query_start AS duration, query 
FROM pg_stat_activity 
WHERE state = 'active' AND query_start < now() - interval '1 minute';
```
 
## Common Causes and Remediation
 
### Cause: High Write Load on Primary
**Diagnosis:** Primary WAL generation rate spiked
**Resolution:** 
1. Identify cause of write spike (deployment, batch job, attack)
2. Wait for spike to subside, replica will catch up
3. If persistent, scale replica resources
 
### Cause: Long-Running Query Blocking Replay
**Diagnosis:** pg_stat_database_conflicts shows conflicts
**Resolution:**
```sql
-- Cancel long queries on replica (safe)
SELECT pg_cancel_backend(pid) FROM pg_stat_activity 
WHERE state = 'active' AND query_start < now() - interval '5 minutes';
```
 
### Cause: Replica Resource Exhaustion
**Diagnosis:** Replica CPU/disk/memory at limits
**Resolution:** Scale replica instance or optimize queries
 
### Cause: Network Issues
**Diagnosis:** Network metrics show latency/packet loss
**Resolution:** Escalate to network team
 
## Escalation
- If lag > 300 seconds: Escalate to DBA lead
- If approaching max_slot_wal_keep_size: URGENT - risk of slot invalidation
- If primary disk filling due to WAL retention: CRITICAL
 
## Post-Incident
1. Document root cause in incident report
2. Add specific monitoring if new failure mode discovered
3. Update this runbook with learnings

Runbook Maintenance:

Keeping Runbooks Current

•Post-incident updates — After every incident, update the relevant runbook with learnings
•Quarterly review — Validate runbooks against current infrastructure; remove stale procedures
•Link validation — Ensure dashboard and documentation links still work
•Drill testing — Periodically have team members follow runbooks for simulated incidents
•Version control — Store runbooks in git; track changes over time

Runbooks as Code

Continuous Improvement Practices

Proactive database management is not a destination but a journey. Establish practices that drive ongoing improvement.

Regular Review Cadences:

Operational Review Schedule
Review	Frequency	Participants	Outcomes
On-Call Handoff	Daily/Weekly	Outgoing and incoming on-call	Alert summary, known issues, pending actions
Incident Review	After each incident	Involved responders	Timeline, root cause, action items
Database Health Review	Weekly	DBA team	Trends, capacity, optimization opportunities
Alert Quality Review	Monthly	DBA team	False positive rate, missing alerts, threshold tuning
Capacity Planning	Quarterly	DBA + Engineering leads	Growth projections, resource requests
Architecture Review	Quarterly	DBA + Architecture team	Strategic improvements, technology evaluation

Metrics for Operational Excellence:

Operational Metrics to Track

•Mean Time to Detect (MTTD) — How quickly do we notice problems?
•Mean Time to Resolve (MTTR) — How quickly do we fix problems?
•Incident Frequency — Are we having fewer incidents over time?
•Alert Noise Ratio — What percentage of alerts are actionable?
•Proactive vs Reactive Time — How much time is spent on improvements vs firefighting?
•Error Budget Consumption — Are we maintaining our SLO targets?

Building a Culture of Proactivity:

Cultural Elements

•Blameless postmortems — Focus on system improvement, not individual blame
•Protected time for improvements — Reserve capacity for proactive work, not just ticket queues
•Celebrate prevention — Recognize when proactive work prevented incidents
•Share learnings — Document and distribute insights from monitoring analysis
•Invest in tooling — Better tools enable better proactive work

The 50% Rule

Summary: The Proactive DBA

Key Takeaways

•Capacity plan with data — Use monitoring trends to predict and prevent resource exhaustion.
•Analyze trends, not just alerts — Gradual degradation doesn't trigger alerts but compounds into major problems.
•Automate safe remediations — Reduce MTTR and free humans for higher-value work.
•Schedule preventive maintenance — Use monitoring to prioritize and time maintenance activities.
•Continuously optimize — Query performance dashboards guide systematic improvement efforts.
•Maintain living runbooks — Every alert links to actionable, current documentation.
•Establish review cadences — Regular reviews catch what automated monitoring misses.
•Measure operational excellence — MTTD, MTTR, and toil percentage guide improvement priorities.

Module Complete:

Module Complete

5 / 5