Loading learning content...
Reactive teams fight fires. Proactive teams prevent them.
The monitoring infrastructure we've built—metrics collection, alerting systems, and dashboards—can serve two fundamentally different purposes:
Reactive Use — Wait for alerts, respond to incidents, restore service, move on.
Proactive Use — Analyze trends, predict problems, optimize before degradation, prevent incidents from occurring.
Mature database operations teams spend significantly more time on proactive work. They use monitoring data not just to detect problems, but to anticipate them—addressing capacity limits before exhaustion, optimizing queries before they cause bottlenecks, and planning infrastructure changes before they become urgent.
This page synthesizes everything we've learned into a proactive database management practice that reduces incidents, improves reliability, and creates space for improvement work rather than constant firefighting.
By the end of this page, you will understand capacity planning using monitoring data, implement trend analysis to predict issues, build automated remediation for common problems, establish continuous improvement practices, and create operational runbooks driven by monitoring insights.
Capacity planning transforms monitoring from incident detection to incident prevention. By projecting resource consumption into the future, you can procure resources, optimize systems, or archive data before hitting limits.
The Capacity Planning Cycle:
Key Resources to Plan:
| Resource | Metrics to Track | Planning Horizon | Remediation Options |
|---|---|---|---|
| Storage | Disk usage, growth rate, table sizes | 3-6 months | Add storage, archive data, compress, partition |
| Connections | Connection count, pool utilization | 1-3 months | Increase limits, add pooler, optimize app |
| Memory | Buffer pool hit ratio, swap usage | 3-6 months | Add RAM, tune cache settings, optimize queries |
| CPU | Utilization, query processing time | 1-3 months | Scale up, optimize queries, read replicas |
| IOPS | Read/write operations, queue depth | 3-6 months | Faster storage, caching, query optimization |
Growth Rate Calculation:
123456789101112131415161718192021222324252627
# Daily growth rate (GB/day)deriv(pg_database_size_bytes[7d]) * 86400 / 1024 / 1024 / 1024 # Days until disk exhaustion at current growth rate(node_filesystem_avail_bytes / deriv(node_filesystem_size_bytes - node_filesystem_avail_bytes)[7d]) / 86400 # Projected disk usage in 30 dayspg_database_size_bytes + (deriv(pg_database_size_bytes[30d]) * 30 * 86400) # Connection pool exhaustion projection# Linear prediction of when connections will hit maxpredict_linear(pg_stat_activity_count[7d], 86400 * 30) / pg_settings_max_connections * 100 # Storage capacity planning query for reportingSELECT datname, pg_database_size(datname) / 1024 / 1024 / 1024 AS current_size_gb, -- Assuming you track historical sizes (pg_database_size(datname) - lag_30_days) / 30 AS avg_daily_growth_mb, disk_total_gb - pg_database_size(datname) / 1024 / 1024 / 1024 AS remaining_gb, CASE WHEN daily_growth > 0 THEN remaining_gb / (daily_growth / 1024) ELSE 'N/A' END AS days_until_fullFROM database_metrics;Your database doesn't experience average load—it experiences peak load. Black Friday, month-end closing, viral content events. Capacity planning must account for peak demand, not just current usage trends. If you can, analyze year-over-year patterns to identify seasonal peaks.
Day-to-day metrics can mask gradual degradation. A 1% weekly increase in query latency doesn't trigger alerts but compounds to 67% degradation over a year. Trend analysis catches these slow-motion problems.
What to Trend:
Week-over-Week and Month-over-Month Comparison:
12345678910111213141516171819202122
# Week-over-week query latency comparison# Current week's p95 latencyhistogram_quantile(0.95, sum(rate(query_duration_seconds_bucket[1h])) by (le))# vs# Same hour last weekhistogram_quantile(0.95, sum(rate(query_duration_seconds_bucket[1h] offset 7d)) by (le)) # Percentage change in QPS week-over-week( sum(rate(pg_stat_database_xact_commit[1h])) - sum(rate(pg_stat_database_xact_commit[1h] offset 7d))) / sum(rate(pg_stat_database_xact_commit[1h] offset 7d)) * 100 # Month-over-month storage growth comparison(pg_database_size_bytes - pg_database_size_bytes offset 30d) / 1024 / 1024 / 1024 # Trend of error ratio over 30 days# Alert if trending upwardderiv( sum(rate(pg_stat_database_xact_rollback[1d])) / (sum(rate(pg_stat_database_xact_commit[1d])) + sum(rate(pg_stat_database_xact_rollback[1d])))[30d]) > 0Anomaly Detection Approaches:
| Method | How It Works | Best For |
|---|---|---|
| Static Thresholds | Alert when value exceeds fixed limit | Absolute limits (disk space, connection count) |
| Standard Deviation | Alert when value deviates N sigmas from mean | Metrics with normal distribution |
| Percentage Change | Alert when value changes > X% from baseline | Detecting sudden shifts |
| Seasonal Decomposition | Compare to same time period historically | Metrics with daily/weekly patterns |
| Machine Learning | Model learns normal patterns, flags deviations | Complex, multi-dimensional patterns |
Schedule a weekly database health review. Compare key metrics to the previous week. Look at growth trends. Review any near-miss incidents. This 30-minute ritual often catches problems weeks before they become incidents.
Some problems have known, safe remediation steps. Automating these responses reduces mean time to recovery (MTTR) and frees engineers for higher-value work.
Candidates for Automation:
| Problem | Automated Response | Risk Level | Human Override |
|---|---|---|---|
| Connection pool exhaustion | Kill long-running idle connections | Low | Configurable threshold |
| Table bloat > 50% | Schedule VACUUM FULL during maintenance window | Medium | Requires approval for large tables |
| Replication lag spike | Pause non-critical replication consumers | Medium | Alert on action taken |
| Query timeout threshold | Cancel query and log for review | Low | Configurable timeout |
| Disk > 90% | Trigger log rotation, delete old exports | Medium | Never delete data files automatically |
| Primary failure | Promote replica to primary (failover) | High | Usually requires human validation |
Example: Automated Idle Connection Cleanup:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
#!/usr/bin/env python3"""Automated cleanup of idle database connections.Triggered when connection pool utilization exceeds threshold.""" import psycopg2import loggingfrom datetime import datetime, timedelta logging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__) # ConfigurationMAX_IDLE_TIME_MINUTES = 30CONNECTION_THRESHOLD_PERCENT = 85DRY_RUN = False # Set to False to actually terminate connections def get_idle_connections(conn): """Find connections idle longer than threshold.""" query = """ SELECT pid, usename, datname, application_name, client_addr, state, now() - state_change AS idle_time, query FROM pg_stat_activity WHERE state = 'idle' AND state_change < now() - interval '%s minutes' AND pid != pg_backend_pid() AND usename NOT IN ('replication_user', 'monitoring_user') ORDER BY idle_time DESC; """ with conn.cursor() as cur: cur.execute(query, (MAX_IDLE_TIME_MINUTES,)) return cur.fetchall() def get_connection_utilization(conn): """Get current connection pool utilization percentage.""" query = """ SELECT (SELECT count(*) FROM pg_stat_activity) AS current, (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max """ with conn.cursor() as cur: cur.execute(query) current, max_conn = cur.fetchone() return (current / max_conn) * 100 def terminate_connection(conn, pid, reason): """Safely terminate a database connection.""" if DRY_RUN: logger.info(f"[DRY RUN] Would terminate PID {pid}: {reason}") return True try: with conn.cursor() as cur: cur.execute("SELECT pg_terminate_backend(%s);", (pid,)) result = cur.fetchone()[0] if result: logger.info(f"Terminated PID {pid}: {reason}") else: logger.warning(f"Failed to terminate PID {pid}") return result except Exception as e: logger.error(f"Error terminating PID {pid}: {e}") return False def main(): conn = psycopg2.connect( host="localhost", database="postgres", user="admin" ) conn.autocommit = True try: utilization = get_connection_utilization(conn) logger.info(f"Current connection utilization: {utilization:.1f}%") if utilization < CONNECTION_THRESHOLD_PERCENT: logger.info("Utilization below threshold, no action needed") return logger.warning(f"Utilization {utilization:.1f}% exceeds threshold {CONNECTION_THRESHOLD_PERCENT}%") idle_connections = get_idle_connections(conn) logger.info(f"Found {len(idle_connections)} idle connections") terminated = 0 for pid, user, db, app, addr, state, idle_time, query in idle_connections: reason = f"Idle for {idle_time}, user={user}, db={db}, app={app}" if terminate_connection(conn, pid, reason): terminated += 1 logger.info(f"Terminated {terminated} idle connections") # Log final state new_utilization = get_connection_utilization(conn) logger.info(f"New connection utilization: {new_utilization:.1f}%") finally: conn.close() if __name__ == "__main__": main()Proactive management includes scheduled maintenance that prevents degradation. Monitoring data guides the timing and scope of these activities.
Regular Maintenance Tasks:
| Task | Frequency | Monitoring Trigger | Impact |
|---|---|---|---|
| Statistics Update | Daily or after large data changes | Query plan degradation detected | Brief CPU spike, improves query planning |
| Routine VACUUM | Continuous (autovacuum) | Dead tuple ratio > 20% | Background I/O, reclaims space |
| VACUUM FULL / CLUSTER | Monthly or as needed | Table bloat > 50% | Table lock, significant I/O; schedule during maintenance window |
| REINDEX | Quarterly or as triggered | Index bloat > 40%, index scans slowing | Lock on index, CPU intensive |
| Query Plan Analysis | Weekly | Top query latency increasing | No direct impact; guides optimization |
| Backup Verification | Monthly | Backup age > threshold | Restore to separate environment; verification |
| Security Audit | Quarterly | Compliance requirements | Review permissions, audit logs |
Monitoring-Driven Maintenance Scheduling:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394
"""Maintenance scheduler that uses monitoring data to prioritize tasks.Integrates with Prometheus for metrics and schedules maintenance windows.""" from dataclasses import dataclassfrom datetime import datetime, timedeltaimport prometheus_clientfrom prometheus_api_client import PrometheusConnect @dataclassclass MaintenanceTask: name: str table: str priority: int # 1 = highest estimated_duration: timedelta requires_lock: bool class MaintenanceScheduler: def __init__(self, prometheus_url: str): self.prom = PrometheusConnect(url=prometheus_url) def get_tables_needing_vacuum(self, dead_tuple_threshold: float = 0.2) -> list: """Find tables where dead tuples exceed threshold of total tuples.""" query = f""" pg_stat_user_tables_n_dead_tup / (pg_stat_user_tables_n_live_tup + pg_stat_user_tables_n_dead_tup) > {dead_tuple_threshold} """ result = self.prom.custom_query(query) return [ MaintenanceTask( name="vacuum", table=r['metric']['relname'], priority=1 if float(r['value'][1]) > 0.5 else 2, estimated_duration=timedelta(minutes=30), requires_lock=False ) for r in result ] def get_tables_needing_reindex(self, bloat_threshold: float = 0.4) -> list: """Find indexes with high bloat ratio.""" # This would query index bloat metrics # Implementation depends on your bloat estimation approach pass def get_stale_statistics(self, age_threshold_days: int = 7) -> list: """Find tables with statistics older than threshold.""" query = f""" (time() - pg_stat_user_tables_last_analyze) / 86400 > {age_threshold_days} """ result = self.prom.custom_query(query) return [ MaintenanceTask( name="analyze", table=r['metric']['relname'], priority=3, estimated_duration=timedelta(minutes=5), requires_lock=False ) for r in result ] def schedule_maintenance(self, window_start: datetime, window_duration: timedelta): """ Schedule maintenance tasks within a maintenance window. Prioritizes by urgency, respects lock requirements, estimates timing. """ tasks = [] tasks.extend(self.get_tables_needing_vacuum()) tasks.extend(self.get_stale_statistics()) # Sort by priority tasks.sort(key=lambda t: t.priority) # Schedule within window current_time = window_start window_end = window_start + window_duration scheduled = [] for task in tasks: if current_time + task.estimated_duration <= window_end: scheduled.append({ 'task': task, 'scheduled_time': current_time }) current_time += task.estimated_duration else: # Window full, remaining tasks go to next window break return scheduledDefine regular maintenance windows (e.g., Sunday 2-6 AM) when heavier operations can run. Use monitoring to identify the lowest-activity period. During the window, run operations prioritized by monitoring data—highest-bloat tables first, most-stale statistics first.
Proactive performance management uses monitoring data to identify and prioritize optimization opportunities before they become problems.
Optimization Workflow:
Query Optimization Priority Matrix:
| Factor | High Priority | Low Priority |
|---|---|---|
| Total Time | Top 10 by cumulative time | Tail of distribution |
| Frequency | Runs 1000s of times/day | Runs once per week |
| Growth | Latency increasing week-over-week | Stable or improving |
| Rows Ratio | Examines 100x rows returned | Efficient row access |
| User Impact | Critical path (checkout, login) | Background jobs |
Index Recommendation from Monitoring:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
-- Find queries doing sequential scans on large tables-- Candidates for new indexes SELECT schemaname, relname AS table_name, seq_scan, seq_tup_read, idx_scan, ROUND(CASE WHEN seq_scan > 0 THEN seq_tup_read::numeric / seq_scan ELSE 0 END, 2) AS avg_rows_per_seq_scan, pg_size_pretty(pg_relation_size(schemaname || '.' || relname)) AS table_sizeFROM pg_stat_user_tablesWHERE seq_scan > 100 -- More than 100 sequential scans AND pg_relation_size(schemaname || '.' || relname) > 10 * 1024 * 1024 -- Table > 10MB AND (idx_scan < seq_scan OR idx_scan IS NULL) -- More seq scans than index scansORDER BY seq_tup_read DESCLIMIT 20; -- Find unused indexes (candidates for removal)SELECT schemaname, relname AS table_name, indexrelname AS index_name, idx_scan, pg_size_pretty(pg_relation_size(indexrelid)) AS index_sizeFROM pg_stat_user_indexesWHERE idx_scan = 0 AND indexrelname NOT LIKE '%_pkey' -- Don't suggest removing primary keys AND indexrelname NOT LIKE '%_unique%' -- Be careful with unique constraintsORDER BY pg_relation_size(indexrelid) DESCLIMIT 20; -- Find duplicate/redundant indexesSELECT a.indexrelid::regclass AS redundant_index, b.indexrelid::regclass AS covering_index, pg_size_pretty(pg_relation_size(a.indexrelid)) AS redundant_sizeFROM pg_index aJOIN pg_index b ON ( a.indrelid = b.indrelid AND a.indexrelid != b.indexrelid AND ( (a.indkey::text LIKE b.indkey::text || ' %') OR (a.indkey::text = b.indkey::text AND a.indexrelid > b.indexrelid) ))WHERE pg_relation_size(a.indexrelid) > 1024 * 1024 -- > 1MBORDER BY pg_relation_size(a.indexrelid) DESC;Document each optimization: what changed, expected improvement, actual improvement. This creates institutional knowledge and helps justify time spent on proactive work. 'We reduced query X latency from 500ms to 50ms, saving $Y in compute costs and improving user experience.'
Runbooks transform monitoring insights into actionable procedures. Every alert should link to a runbook that guides responders through diagnosis and remediation.
Runbook Structure:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
# Runbook: High Replication Lag ## Overview**Alert Name:** PostgreSQLReplicationLagCritical**Severity:** P2 (High) **Service Impact:** Read replicas serving stale data; failover would lose recent transactions ## Symptoms- pg_replication_lag_seconds > 60 for 5+ minutes- Dashboard: Replication > Lag shows elevated values- Possible: Increased read latency if clients hit stale replicas ## Quick Diagnosis ### 1. Check Replication Status```sqlSELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes_behindFROM pg_stat_replication;``` ### 2. Check Replica Resources- **CPU**: Is replica CPU-bound replaying WAL?- **Disk I/O**: Is replica disk saturated?- **Network**: Is there latency between primary and replica? Dashboard: Infrastructure > Node Metrics > [replica-hostname] ### 3. Check for Blocking Queries on Replica```sql-- On replica, check for conflictsSELECT * FROM pg_stat_database_conflicts WHERE datname = current_database(); -- Check for long-running queries blocking replaySELECT pid, now() - query_start AS duration, query FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '1 minute';``` ## Common Causes and Remediation ### Cause: High Write Load on Primary**Diagnosis:** Primary WAL generation rate spiked**Resolution:** 1. Identify cause of write spike (deployment, batch job, attack)2. Wait for spike to subside, replica will catch up3. If persistent, scale replica resources ### Cause: Long-Running Query Blocking Replay**Diagnosis:** pg_stat_database_conflicts shows conflicts**Resolution:**```sql-- Cancel long queries on replica (safe)SELECT pg_cancel_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND query_start < now() - interval '5 minutes';``` ### Cause: Replica Resource Exhaustion**Diagnosis:** Replica CPU/disk/memory at limits**Resolution:** Scale replica instance or optimize queries ### Cause: Network Issues**Diagnosis:** Network metrics show latency/packet loss**Resolution:** Escalate to network team ## Escalation- If lag > 300 seconds: Escalate to DBA lead- If approaching max_slot_wal_keep_size: URGENT - risk of slot invalidation- If primary disk filling due to WAL retention: CRITICAL ## Post-Incident1. Document root cause in incident report2. Add specific monitoring if new failure mode discovered3. Update this runbook with learningsRunbook Maintenance:
Consider embedding runbooks in your monitoring configuration. Alertmanager annotations can include runbook URLs. Some teams go further, generating runbooks from code or even making runbooks executable with embedded scripts for common diagnostics.
Proactive database management is not a destination but a journey. Establish practices that drive ongoing improvement.
Regular Review Cadences:
| Review | Frequency | Participants | Outcomes |
|---|---|---|---|
| On-Call Handoff | Daily/Weekly | Outgoing and incoming on-call | Alert summary, known issues, pending actions |
| Incident Review | After each incident | Involved responders | Timeline, root cause, action items |
| Database Health Review | Weekly | DBA team | Trends, capacity, optimization opportunities |
| Alert Quality Review | Monthly | DBA team | False positive rate, missing alerts, threshold tuning |
| Capacity Planning | Quarterly | DBA + Engineering leads | Growth projections, resource requests |
| Architecture Review | Quarterly | DBA + Architecture team | Strategic improvements, technology evaluation |
Metrics for Operational Excellence:
Building a Culture of Proactivity:
Google's SRE book suggests that operational teams should spend no more than 50% of time on operational toil (reactive work). The remaining 50% should go to engineering improvements. If you're above 50% toil, prioritize automation and reliability improvements until you get below.
Proactive database management transforms the DBA role from firefighter to engineer—from responding to problems to preventing them. The monitoring infrastructure we've built throughout this module provides the foundation; proactive practices turn that data into action.
Module Complete:
You've now completed the Monitoring and Alerting module. From understanding key metrics, through monitoring tools and alert configuration, to building dashboards and establishing proactive practices—you have the knowledge to implement world-class database monitoring.
The work doesn't end here. Monitoring systems require ongoing attention: thresholds need tuning, dashboards need refining, new failure modes need detection, and proactive practices need nurturing. But with this foundation, you can build and operate databases that your users and business can depend on.
Congratulations! You've completed the Monitoring and Alerting module. You now have the knowledge to implement comprehensive database monitoring, configure intelligent alerting, build insightful dashboards, and establish proactive management practices that prevent incidents before they impact users.