Database Management SystemsPerformance Tuning

Performance Tuning

LevelAdvanced

Duration90 mins

TopicPerformance Tuning

5 / 5

Monitoring Tools

Visibility: The Foundation of Performance Management

You cannot optimize what you cannot measure. Database monitoring provides the visibility needed to detect problems before they impact users, diagnose root causes when issues occur, and verify that tuning efforts deliver expected improvements.

Effective monitoring transforms reactive firefighting into proactive performance management. Instead of waiting for users to complain, you see degradation as it begins. Instead of guessing at causes, you pinpoint bottlenecks with data.

What You Will Learn

By the end of this page, you will understand key database metrics to monitor, built-in monitoring capabilities in major databases, external monitoring tools and their integration, dashboard design principles, and alerting strategies that prevent alert fatigue while catching real issues.

Key Database Metrics

Effective monitoring focuses on actionable metrics—numbers that indicate problems and guide solutions. These fall into categories that map to the tuning areas we've covered.

Essential Database Metrics
Category	Key Metrics	Warning Signs
Performance	Query latency, throughput (TPS), response time percentiles (p95, p99)	P99 > 10x median; throughput decline
Resource Utilization	CPU%, memory usage, disk I/O, network I/O	Sustained >80%; sudden spikes
Buffer/Cache	Hit ratio, page reads, cache misses	Hit ratio <95%; rising cache misses
Connections	Active connections, waiting connections, pool utilization	Near max_connections; many waiting
Locks/Waits	Lock waits, deadlocks, blocked queries	Rising lock time; any deadlocks
Replication	Lag seconds, replication status, apply rate	Lag > acceptable threshold
Storage	Disk space, table bloat, index fragmentation	<20% free space; high fragmentation

The Four Golden Signals

•Latency — How long requests take; distinguish successful vs failed requests
•Traffic — Demand on the system; queries per second, transactions per second
•Errors — Rate of failed requests; connection failures, query errors
•Saturation — How full the system is; queue lengths, utilization percentages

USE Methodology

For every resource (CPU, memory, disk, network), track Utilization (%), Saturation (queue length), and Errors (count). High utilization without saturation is fine; saturation means requests are waiting; errors indicate failures.

Built-in Database Monitoring

Every major database includes statistics views and monitoring capabilities. These are your primary diagnostic tools—learn them thoroughly.

built_in_monitoring.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- POSTGRESQL BUILT-IN MONITORING
 
-- Enable required extensions
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
 
-- Current activity (who's doing what)
SELECT pid, usename, state, query_start, 
       now() - query_start as duration, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
 
-- Database-level statistics
SELECT datname, 
       numbackends as connections,
       xact_commit as commits,
       xact_rollback as rollbacks,
       blks_hit, blks_read,
       round(blks_hit::numeric / nullif(blks_hit + blks_read, 0) * 100, 2) 
           as cache_hit_pct
FROM pg_stat_database
WHERE datname NOT LIKE 'template%';
 
-- Table-level statistics
SELECT schemaname, relname,
       seq_scan, idx_scan,
       n_tup_ins, n_tup_upd, n_tup_del,
       n_live_tup, n_dead_tup,
       last_vacuum, last_analyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 20;
 
-- Top queries by time (requires pg_stat_statements)
SELECT round(total_exec_time::numeric, 2) as total_ms,
       calls,
       round(mean_exec_time::numeric, 2) as avg_ms,
       query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
 
-- Lock monitoring
SELECT blocked.pid, blocked.query as blocked_query,
       blocking.pid, blocking.query as blocking_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking 
    ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.state = 'active';

External Monitoring Tools

Built-in monitoring requires querying the database. External monitoring tools collect metrics continuously, store historical data, visualize trends, and alert on thresholds—essential for production operations.

Popular Database Monitoring Tools
Tool	Type	Best For
Prometheus + Grafana	Open source metrics/visualization	Custom dashboards, Kubernetes, flexible alerting
Datadog	Commercial APM	Full-stack observability, managed service
pgAdmin / MySQL Workbench	Database-specific GUI	Development, ad-hoc administration
Percona Monitoring (PMM)	Open source	MySQL, PostgreSQL, MongoDB deep monitoring
SolarWinds DPA	Commercial	Multi-database wait analysis, query tuning
AWS RDS Performance Insights	Cloud-native	RDS/Aurora users, integrated experience
Azure SQL Analytics	Cloud-native	Azure SQL Database, managed insights

prometheus_exporters.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# PROMETHEUS + GRAFANA SETUP
 
# PostgreSQL Exporter configuration
# (postgres_exporter for Prometheus)
---
# prometheus.yml scrape config:
scrape_configs:
  - job_name: 'postgresql'
    static_configs:
      - targets: ['localhost:9187']
    
  - job_name: 'mysql'
    static_configs:
      - targets: ['localhost:9104']
    
  - job_name: 'sqlserver'
    static_configs:
      - targets: ['localhost:4000']
 
# Key metrics exposed by postgres_exporter:
# - pg_stat_activity_count (connections by state)
# - pg_stat_database_blks_* (buffer cache metrics)
# - pg_stat_statements_* (query statistics)
# - pg_replication_lag (replication delay)
 
# Alerting rules example:
groups:
  - name: database_alerts
    rules:
      - alert: HighConnectionCount
        expr: pg_stat_activity_count > 180
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Database connection count high"
          
      - alert: ReplicationLag
        expr: pg_replication_lag > 30
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Replication lag exceeds 30 seconds"
          
      - alert: LowCacheHitRatio
        expr: pg_stat_database_blks_hit / 
              (pg_stat_database_blks_hit + pg_stat_database_blks_read) < 0.95
        for: 10m
        labels:
          severity: warning

Start Simple

You don't need every monitoring tool from day one. Start with database built-in statistics and slow query logs. Add Prometheus/Grafana for visualization. Evolve as needs grow.

Slow Query Log Analysis

Slow query logs capture queries exceeding a time threshold—the most direct path to identifying performance problems. Every database supports slow query logging; enable it in production.

slow_query_config.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- POSTGRESQL SLOW QUERY LOGGING
 
-- Configuration (postgresql.conf)
/*
# Log queries exceeding threshold
log_min_duration_statement = 1000  # ms (1 second)
 
# Or log all statements (high overhead)
# log_statement = 'all'
 
# Include useful context
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a '
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
 
# Auto-explain for slow queries
shared_preload_libraries = 'auto_explain'
auto_explain.log_min_duration = '3s'
auto_explain.log_analyze = true
*/
 
-- Using pg_stat_statements (preferred approach)
SELECT 
    round(total_exec_time::numeric, 0) as total_ms,
    round(mean_exec_time::numeric, 0) as avg_ms,
    round(stddev_exec_time::numeric, 0) as stddev_ms,
    calls,
    rows,
    query
FROM pg_stat_statements
WHERE mean_exec_time > 1000  -- Over 1 second average
ORDER BY total_exec_time DESC
LIMIT 20;
 
-- Reset statistics periodically
SELECT pg_stat_statements_reset();

Alerting Strategy

Alerts must balance sensitivity (catching real issues) with specificity (not crying wolf). Too many alerts cause alert fatigue; too few mean problems go unnoticed.

Alerting Best Practices

•Alert on symptoms, not causes — Users care about latency and errors, not CPU%. High CPU is fine if latency is acceptable.
•Set meaningful thresholds — Base on historical data and SLAs, not arbitrary values
•Use appropriate durations — Brief spikes are normal; sustained degradation is concerning. '5 minutes' not '1 minute'.
•Tier severity levels — Critical (pages on-call), Warning (review soon), Info (investigate when convenient)
•Include runbooks — Alert message should link to documentation on how to investigate and remediate
•Review and tune regularly — Delete alerts that never fire or always fire; add alerts for incidents you missed

Recommended Database Alerts
Alert	Threshold Example	Severity
Connection count > 90% of max	180 of 200 for 5 min	Warning
Replication lag > 30 seconds	30s for 2 min	Critical
Buffer cache hit ratio < 95%	<95% for 10 min	Warning
Disk space < 10%	<10% free	Critical
Long-running query > 10 min	10 min active	Warning
Deadlock detected	Any deadlock	Warning
Database connection failure	Any failure from app	Critical

Alert Fatigue is Real

If alerts fire constantly without requiring action, people ignore them. Every alert should be actionable—when it fires, something must be done. If an alert fires but requires no action, raise the threshold or delete it.

Summary: Monitoring Mastery

Key Takeaways

•Monitor what matters — Focus on the four golden signals: latency, traffic, errors, saturation
•Master built-in tools — pg_stat_statements, Performance Schema, Query Store are powerful free resources
•Implement continuous monitoring — External tools collect history, visualize trends, and alert
•Enable slow query logging — The most direct path to finding performance problems
•Alert thoughtfully — Every alert should be actionable; tune thresholds to avoid fatigue
•Build dashboards — Visual representation makes patterns obvious that numbers hide

Module Complete:

You have now completed the Performance Tuning module, covering query tuning, index tuning, configuration tuning, memory tuning, and monitoring tools. These skills form a comprehensive toolkit for ensuring database systems perform optimally at any scale. Apply these techniques iteratively: monitor → identify bottleneck → tune → verify improvement → repeat.

Module Complete

Congratulations! You've mastered database performance tuning from multiple angles. These skills are essential for every DBA and senior engineer working with data at scale. Practice on real systems, build monitoring dashboards, and never stop measuring.

5 / 5

Loading learning content...

Database Management SystemsPerformance Tuning

Performance Tuning

LevelAdvanced

Duration90 mins

TopicPerformance Tuning

5 / 5

Monitoring Tools

Visibility: The Foundation of Performance Management

What You Will Learn

Key Database Metrics

Effective monitoring focuses on actionable metrics—numbers that indicate problems and guide solutions. These fall into categories that map to the tuning areas we've covered.

Essential Database Metrics
Category	Key Metrics	Warning Signs
Performance	Query latency, throughput (TPS), response time percentiles (p95, p99)	P99 > 10x median; throughput decline
Resource Utilization	CPU%, memory usage, disk I/O, network I/O	Sustained >80%; sudden spikes
Buffer/Cache	Hit ratio, page reads, cache misses	Hit ratio <95%; rising cache misses
Connections	Active connections, waiting connections, pool utilization	Near max_connections; many waiting
Locks/Waits	Lock waits, deadlocks, blocked queries	Rising lock time; any deadlocks
Replication	Lag seconds, replication status, apply rate	Lag > acceptable threshold
Storage	Disk space, table bloat, index fragmentation	<20% free space; high fragmentation

The Four Golden Signals

•Latency — How long requests take; distinguish successful vs failed requests
•Traffic — Demand on the system; queries per second, transactions per second
•Errors — Rate of failed requests; connection failures, query errors
•Saturation — How full the system is; queue lengths, utilization percentages

USE Methodology

Built-in Database Monitoring

Every major database includes statistics views and monitoring capabilities. These are your primary diagnostic tools—learn them thoroughly.

built_in_monitoring.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
-- POSTGRESQL BUILT-IN MONITORING
 
-- Enable required extensions
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
 
-- Current activity (who's doing what)
SELECT pid, usename, state, query_start, 
       now() - query_start as duration, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY duration DESC;
 
-- Database-level statistics
SELECT datname, 
       numbackends as connections,
       xact_commit as commits,
       xact_rollback as rollbacks,
       blks_hit, blks_read,
       round(blks_hit::numeric / nullif(blks_hit + blks_read, 0) * 100, 2) 
           as cache_hit_pct
FROM pg_stat_database
WHERE datname NOT LIKE 'template%';
 
-- Table-level statistics
SELECT schemaname, relname,
       seq_scan, idx_scan,
       n_tup_ins, n_tup_upd, n_tup_del,
       n_live_tup, n_dead_tup,
       last_vacuum, last_analyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 20;
 
-- Top queries by time (requires pg_stat_statements)
SELECT round(total_exec_time::numeric, 2) as total_ms,
       calls,
       round(mean_exec_time::numeric, 2) as avg_ms,
       query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
 
-- Lock monitoring
SELECT blocked.pid, blocked.query as blocked_query,
       blocking.pid, blocking.query as blocking_query
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking 
    ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE blocked.state = 'active';

External Monitoring Tools

Popular Database Monitoring Tools
Tool	Type	Best For
Prometheus + Grafana	Open source metrics/visualization	Custom dashboards, Kubernetes, flexible alerting
Datadog	Commercial APM	Full-stack observability, managed service
pgAdmin / MySQL Workbench	Database-specific GUI	Development, ad-hoc administration
Percona Monitoring (PMM)	Open source	MySQL, PostgreSQL, MongoDB deep monitoring
SolarWinds DPA	Commercial	Multi-database wait analysis, query tuning
AWS RDS Performance Insights	Cloud-native	RDS/Aurora users, integrated experience
Azure SQL Analytics	Cloud-native	Azure SQL Database, managed insights

prometheus_exporters.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# PROMETHEUS + GRAFANA SETUP
 
# PostgreSQL Exporter configuration
# (postgres_exporter for Prometheus)
---
# prometheus.yml scrape config:
scrape_configs:
  - job_name: 'postgresql'
    static_configs:
      - targets: ['localhost:9187']
    
  - job_name: 'mysql'
    static_configs:
      - targets: ['localhost:9104']
    
  - job_name: 'sqlserver'
    static_configs:
      - targets: ['localhost:4000']
 
# Key metrics exposed by postgres_exporter:
# - pg_stat_activity_count (connections by state)
# - pg_stat_database_blks_* (buffer cache metrics)
# - pg_stat_statements_* (query statistics)
# - pg_replication_lag (replication delay)
 
# Alerting rules example:
groups:
  - name: database_alerts
    rules:
      - alert: HighConnectionCount
        expr: pg_stat_activity_count > 180
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Database connection count high"
          
      - alert: ReplicationLag
        expr: pg_replication_lag > 30
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Replication lag exceeds 30 seconds"
          
      - alert: LowCacheHitRatio
        expr: pg_stat_database_blks_hit / 
              (pg_stat_database_blks_hit + pg_stat_database_blks_read) < 0.95
        for: 10m
        labels:
          severity: warning

Start Simple

You don't need every monitoring tool from day one. Start with database built-in statistics and slow query logs. Add Prometheus/Grafana for visualization. Evolve as needs grow.

Slow Query Log Analysis

Slow query logs capture queries exceeding a time threshold—the most direct path to identifying performance problems. Every database supports slow query logging; enable it in production.

slow_query_config.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- POSTGRESQL SLOW QUERY LOGGING
 
-- Configuration (postgresql.conf)
/*
# Log queries exceeding threshold
log_min_duration_statement = 1000  # ms (1 second)
 
# Or log all statements (high overhead)
# log_statement = 'all'
 
# Include useful context
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a '
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
 
# Auto-explain for slow queries
shared_preload_libraries = 'auto_explain'
auto_explain.log_min_duration = '3s'
auto_explain.log_analyze = true
*/
 
-- Using pg_stat_statements (preferred approach)
SELECT 
    round(total_exec_time::numeric, 0) as total_ms,
    round(mean_exec_time::numeric, 0) as avg_ms,
    round(stddev_exec_time::numeric, 0) as stddev_ms,
    calls,
    rows,
    query
FROM pg_stat_statements
WHERE mean_exec_time > 1000  -- Over 1 second average
ORDER BY total_exec_time DESC
LIMIT 20;
 
-- Reset statistics periodically
SELECT pg_stat_statements_reset();

Alerting Strategy

Alerts must balance sensitivity (catching real issues) with specificity (not crying wolf). Too many alerts cause alert fatigue; too few mean problems go unnoticed.

Alerting Best Practices

•Alert on symptoms, not causes — Users care about latency and errors, not CPU%. High CPU is fine if latency is acceptable.
•Set meaningful thresholds — Base on historical data and SLAs, not arbitrary values
•Use appropriate durations — Brief spikes are normal; sustained degradation is concerning. '5 minutes' not '1 minute'.
•Tier severity levels — Critical (pages on-call), Warning (review soon), Info (investigate when convenient)
•Include runbooks — Alert message should link to documentation on how to investigate and remediate
•Review and tune regularly — Delete alerts that never fire or always fire; add alerts for incidents you missed

Recommended Database Alerts
Alert	Threshold Example	Severity
Connection count > 90% of max	180 of 200 for 5 min	Warning
Replication lag > 30 seconds	30s for 2 min	Critical
Buffer cache hit ratio < 95%	<95% for 10 min	Warning
Disk space < 10%	<10% free	Critical
Long-running query > 10 min	10 min active	Warning
Deadlock detected	Any deadlock	Warning
Database connection failure	Any failure from app	Critical

Alert Fatigue is Real

Summary: Monitoring Mastery

Key Takeaways

•Monitor what matters — Focus on the four golden signals: latency, traffic, errors, saturation
•Master built-in tools — pg_stat_statements, Performance Schema, Query Store are powerful free resources
•Implement continuous monitoring — External tools collect history, visualize trends, and alert
•Enable slow query logging — The most direct path to finding performance problems
•Alert thoughtfully — Every alert should be actionable; tune thresholds to avoid fatigue
•Build dashboards — Visual representation makes patterns obvious that numbers hide

Module Complete:

Module Complete

5 / 5