Loading learning content...
You cannot optimize what you cannot measure. Database monitoring provides the visibility needed to detect problems before they impact users, diagnose root causes when issues occur, and verify that tuning efforts deliver expected improvements.
Effective monitoring transforms reactive firefighting into proactive performance management. Instead of waiting for users to complain, you see degradation as it begins. Instead of guessing at causes, you pinpoint bottlenecks with data.
By the end of this page, you will understand key database metrics to monitor, built-in monitoring capabilities in major databases, external monitoring tools and their integration, dashboard design principles, and alerting strategies that prevent alert fatigue while catching real issues.
Effective monitoring focuses on actionable metrics—numbers that indicate problems and guide solutions. These fall into categories that map to the tuning areas we've covered.
| Category | Key Metrics | Warning Signs |
|---|---|---|
| Performance | Query latency, throughput (TPS), response time percentiles (p95, p99) | P99 > 10x median; throughput decline |
| Resource Utilization | CPU%, memory usage, disk I/O, network I/O | Sustained >80%; sudden spikes |
| Buffer/Cache | Hit ratio, page reads, cache misses | Hit ratio <95%; rising cache misses |
| Connections | Active connections, waiting connections, pool utilization | Near max_connections; many waiting |
| Locks/Waits | Lock waits, deadlocks, blocked queries | Rising lock time; any deadlocks |
| Replication | Lag seconds, replication status, apply rate | Lag > acceptable threshold |
| Storage | Disk space, table bloat, index fragmentation | <20% free space; high fragmentation |
For every resource (CPU, memory, disk, network), track Utilization (%), Saturation (queue length), and Errors (count). High utilization without saturation is fine; saturation means requests are waiting; errors indicate failures.
Every major database includes statistics views and monitoring capabilities. These are your primary diagnostic tools—learn them thoroughly.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
-- POSTGRESQL BUILT-IN MONITORING -- Enable required extensionsCREATE EXTENSION IF NOT EXISTS pg_stat_statements; -- Current activity (who's doing what)SELECT pid, usename, state, query_start, now() - query_start as duration, queryFROM pg_stat_activityWHERE state != 'idle'ORDER BY duration DESC; -- Database-level statisticsSELECT datname, numbackends as connections, xact_commit as commits, xact_rollback as rollbacks, blks_hit, blks_read, round(blks_hit::numeric / nullif(blks_hit + blks_read, 0) * 100, 2) as cache_hit_pctFROM pg_stat_databaseWHERE datname NOT LIKE 'template%'; -- Table-level statisticsSELECT schemaname, relname, seq_scan, idx_scan, n_tup_ins, n_tup_upd, n_tup_del, n_live_tup, n_dead_tup, last_vacuum, last_analyzeFROM pg_stat_user_tablesORDER BY n_dead_tup DESC LIMIT 20; -- Top queries by time (requires pg_stat_statements)SELECT round(total_exec_time::numeric, 2) as total_ms, calls, round(mean_exec_time::numeric, 2) as avg_ms, queryFROM pg_stat_statementsORDER BY total_exec_time DESCLIMIT 20; -- Lock monitoringSELECT blocked.pid, blocked.query as blocked_query, blocking.pid, blocking.query as blocking_queryFROM pg_stat_activity blockedJOIN pg_stat_activity blocking ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))WHERE blocked.state = 'active';Built-in monitoring requires querying the database. External monitoring tools collect metrics continuously, store historical data, visualize trends, and alert on thresholds—essential for production operations.
| Tool | Type | Best For |
|---|---|---|
| Prometheus + Grafana | Open source metrics/visualization | Custom dashboards, Kubernetes, flexible alerting |
| Datadog | Commercial APM | Full-stack observability, managed service |
| pgAdmin / MySQL Workbench | Database-specific GUI | Development, ad-hoc administration |
| Percona Monitoring (PMM) | Open source | MySQL, PostgreSQL, MongoDB deep monitoring |
| SolarWinds DPA | Commercial | Multi-database wait analysis, query tuning |
| AWS RDS Performance Insights | Cloud-native | RDS/Aurora users, integrated experience |
| Azure SQL Analytics | Cloud-native | Azure SQL Database, managed insights |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
# PROMETHEUS + GRAFANA SETUP # PostgreSQL Exporter configuration# (postgres_exporter for Prometheus)---# prometheus.yml scrape config:scrape_configs: - job_name: 'postgresql' static_configs: - targets: ['localhost:9187'] - job_name: 'mysql' static_configs: - targets: ['localhost:9104'] - job_name: 'sqlserver' static_configs: - targets: ['localhost:4000'] # Key metrics exposed by postgres_exporter:# - pg_stat_activity_count (connections by state)# - pg_stat_database_blks_* (buffer cache metrics)# - pg_stat_statements_* (query statistics)# - pg_replication_lag (replication delay) # Alerting rules example:groups: - name: database_alerts rules: - alert: HighConnectionCount expr: pg_stat_activity_count > 180 for: 5m labels: severity: warning annotations: summary: "Database connection count high" - alert: ReplicationLag expr: pg_replication_lag > 30 for: 2m labels: severity: critical annotations: summary: "Replication lag exceeds 30 seconds" - alert: LowCacheHitRatio expr: pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read) < 0.95 for: 10m labels: severity: warningYou don't need every monitoring tool from day one. Start with database built-in statistics and slow query logs. Add Prometheus/Grafana for visualization. Evolve as needs grow.
Slow query logs capture queries exceeding a time threshold—the most direct path to identifying performance problems. Every database supports slow query logging; enable it in production.
1234567891011121314151617181920212223242526272829303132333435363738
-- POSTGRESQL SLOW QUERY LOGGING -- Configuration (postgresql.conf)/*# Log queries exceeding thresholdlog_min_duration_statement = 1000 # ms (1 second) # Or log all statements (high overhead)# log_statement = 'all' # Include useful contextlog_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a 'log_checkpoints = onlog_connections = onlog_disconnections = onlog_lock_waits = on # Auto-explain for slow queriesshared_preload_libraries = 'auto_explain'auto_explain.log_min_duration = '3s'auto_explain.log_analyze = true*/ -- Using pg_stat_statements (preferred approach)SELECT round(total_exec_time::numeric, 0) as total_ms, round(mean_exec_time::numeric, 0) as avg_ms, round(stddev_exec_time::numeric, 0) as stddev_ms, calls, rows, queryFROM pg_stat_statementsWHERE mean_exec_time > 1000 -- Over 1 second averageORDER BY total_exec_time DESCLIMIT 20; -- Reset statistics periodicallySELECT pg_stat_statements_reset();Alerts must balance sensitivity (catching real issues) with specificity (not crying wolf). Too many alerts cause alert fatigue; too few mean problems go unnoticed.
| Alert | Threshold Example | Severity |
|---|---|---|
| Connection count > 90% of max | 180 of 200 for 5 min | Warning |
| Replication lag > 30 seconds | 30s for 2 min | Critical |
| Buffer cache hit ratio < 95% | <95% for 10 min | Warning |
| Disk space < 10% | <10% free | Critical |
| Long-running query > 10 min | 10 min active | Warning |
| Deadlock detected | Any deadlock | Warning |
| Database connection failure | Any failure from app | Critical |
If alerts fire constantly without requiring action, people ignore them. Every alert should be actionable—when it fires, something must be done. If an alert fires but requires no action, raise the threshold or delete it.
Module Complete:
You have now completed the Performance Tuning module, covering query tuning, index tuning, configuration tuning, memory tuning, and monitoring tools. These skills form a comprehensive toolkit for ensuring database systems perform optimally at any scale. Apply these techniques iteratively: monitor → identify bottleneck → tune → verify improvement → repeat.
Congratulations! You've mastered database performance tuning from multiple angles. These skills are essential for every DBA and senior engineer working with data at scale. Practice on real systems, build monitoring dashboards, and never stop measuring.