Monitoring And Alerting - Learning Module

Loading content...

0/252

Monitoring Tools

The Monitoring Ecosystem

Understanding database metrics is only half the battle. Collecting, storing, visualizing, and acting upon those metrics requires specialized tooling—a monitoring stack that operates continuously, scales with your infrastructure, and remains reliable precisely when things break.

Database monitoring has evolved dramatically. Early DBAs relied on command-line queries run manually or via cron scripts. Today's production environments demand real-time metric collection, historical trend analysis, intelligent alerting, and integration with incident management systems. The tools you choose fundamentally shape your ability to maintain database health.

This page surveys the monitoring tool landscape—from native database utilities every DBA must master, through specialized database monitors, to comprehensive observability platforms that unify metrics across entire infrastructures.

What You Will Learn

By the end of this page, you will understand native monitoring capabilities in major database systems, evaluate specialized database monitoring solutions, architect comprehensive monitoring stacks using open-source and commercial tools, and implement best practices for metric collection and retention.

Native Database Monitoring Capabilities

Every major database system includes built-in monitoring facilities. These native tools provide the deepest visibility into database internals, serving as the foundation for all other monitoring.

PostgreSQL Monitoring:

PostgreSQL offers extensive monitoring through system catalogs and extensions:

postgresql_monitoring.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
-- Active session monitoring
SELECT pid, usename, datname, state, 
       query_start, now() - query_start AS duration,
       query
FROM pg_stat_activity
WHERE state = 'active' AND query NOT LIKE '%pg_stat_activity%';
 
-- Table I/O statistics
SELECT schemaname, relname, 
       heap_blks_read, heap_blks_hit,
       ROUND(100.0 * heap_blks_hit / NULLIF(heap_blks_hit + heap_blks_read, 0), 2) AS hit_ratio
FROM pg_statio_user_tables
ORDER BY heap_blks_read DESC LIMIT 10;
 
-- Index usage analysis
SELECT schemaname, tablename, indexname,
       idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;
 
-- Query statistics (requires pg_stat_statements extension)
SELECT query, calls, total_exec_time, mean_exec_time,
       rows, shared_blks_hit, shared_blks_read
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 10;

Key PostgreSQL System Views
View	Purpose	Key Metrics
pg_stat_activity	Current session state	Active queries, wait events, connection state
pg_stat_database	Per-database statistics	Transaction counts, block reads, cache hits
pg_stat_user_tables	Table access statistics	Sequential scans, index scans, row operations
pg_stat_user_indexes	Index usage statistics	Index scans, tuples read/fetched
pg_stat_bgwriter	Background writer activity	Buffers written, checkpoints
pg_stat_replication	Replication status	Replica lag, write/flush/replay LSN

MySQL / MariaDB Monitoring:

MySQL provides monitoring through SHOW commands and the Performance Schema:

mysql_monitoring.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- Global status variables
SHOW GLOBAL STATUS WHERE Variable_name IN (
    'Queries', 'Threads_connected', 'Threads_running',
    'Innodb_buffer_pool_read_requests', 'Innodb_buffer_pool_reads',
    'Innodb_row_lock_waits', 'Innodb_row_lock_time'
);
 
-- Process list with query text
SELECT id, user, host, db, command, time, state, info
FROM information_schema.processlist
WHERE command != 'Sleep';
 
-- InnoDB buffer pool hit ratio
SELECT 
    (1 - (Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests)) * 100 
    AS buffer_pool_hit_ratio
FROM (
    SELECT 
        Variable_value AS Innodb_buffer_pool_reads 
    FROM performance_schema.global_status 
    WHERE Variable_name = 'Innodb_buffer_pool_reads'
) reads,
(
    SELECT 
        Variable_value AS Innodb_buffer_pool_read_requests 
    FROM performance_schema.global_status 
    WHERE Variable_name = 'Innodb_buffer_pool_read_requests'
) requests;
 
-- Slow query digest from Performance Schema
SELECT DIGEST_TEXT, COUNT_STAR, 
       SUM_TIMER_WAIT/1000000000000 AS total_time_sec,
       AVG_TIMER_WAIT/1000000000 AS avg_time_ms,
       SUM_ROWS_EXAMINED, SUM_ROWS_SENT
FROM performance_schema.events_statements_summary_by_digest
ORDER BY SUM_TIMER_WAIT DESC LIMIT 10;

SQL Server Monitoring:

SQL Server includes Dynamic Management Views (DMVs) and Extended Events:

sqlserver_monitoring.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
-- Currently executing queries
SELECT r.session_id, r.status, r.command,
       r.cpu_time, r.total_elapsed_time,
       r.logical_reads, r.writes,
       t.text AS query_text
FROM sys.dm_exec_requests r
CROSS APPLY sys.dm_exec_sql_text(r.sql_handle) t
WHERE r.session_id > 50;
 
-- Wait statistics
SELECT wait_type, waiting_tasks_count,
       wait_time_ms, max_wait_time_ms,
       signal_wait_time_ms
FROM sys.dm_os_wait_stats
WHERE waiting_tasks_count > 0
ORDER BY wait_time_ms DESC;
 
-- Buffer pool usage by database
SELECT DB_NAME(database_id) AS database_name,
       COUNT(*) * 8 / 1024 AS buffer_pool_mb
FROM sys.dm_os_buffer_descriptors
GROUP BY database_id
ORDER BY buffer_pool_mb DESC;
 
-- Top queries by CPU
SELECT TOP 10 
    qs.total_worker_time/1000000 AS total_cpu_sec,
    qs.execution_count,
    qs.total_worker_time/qs.execution_count/1000 AS avg_cpu_ms,
    SUBSTRING(st.text, (qs.statement_start_offset/2)+1, 
        ((CASE qs.statement_end_offset 
            WHEN -1 THEN DATALENGTH(st.text) 
            ELSE qs.statement_end_offset 
        END - qs.statement_start_offset)/2)+1) AS query_text
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) st
ORDER BY total_cpu_sec DESC;

Master Your Database's Native Tools

Third-party monitoring tools ultimately query these native interfaces. Understanding the underlying system views lets you debug monitoring issues, write custom queries for specific investigations, and validate what monitoring tools report.

Command-Line Monitoring Utilities

Beyond SQL queries, databases and operating systems provide command-line utilities for real-time monitoring. These tools are invaluable for interactive troubleshooting and quick health checks.

Database-Specific CLI Tools:

Database CLI Monitoring Tools
Database	Tool	Purpose
PostgreSQL	pg_top	Real-time query activity viewer (top-like interface)
PostgreSQL	pgbadger	Log analyzer generating HTML reports
PostgreSQL	pg_stat_monitor	Advanced query statistics extension
MySQL	mysqladmin	Server status and process management
MySQL	mytop	Real-time query monitoring (top-like)
MySQL	pt-query-digest	Slow query log analyzer from Percona Toolkit
SQL Server	sqlcmd	Command-line query execution
Oracle	*SQLPlus**	Interactive SQL and PL/SQL execution
Oracle	adrci	Automatic Diagnostic Repository CLI

Operating System Monitoring:

Database performance depends on OS resources. These tools monitor the host system:

os_monitoring_commands.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# CPU, memory, and process overview
top -c         # Interactive process viewer
htop           # Enhanced interactive viewer (if installed)
 
# Virtual memory statistics
vmstat 1       # Per-second memory/CPU stats
 
# Disk I/O monitoring
iostat -xz 1   # Extended disk statistics per second
iotop          # Per-process I/O activity
 
# Network monitoring
netstat -an    # All connections
ss -s          # Socket statistics summary
iftop          # Per-connection bandwidth
 
# Memory usage breakdown
free -h        # Human-readable memory summary
cat /proc/meminfo  # Detailed memory information
 
# File system usage
df -h          # Disk space by mount point
du -sh /var/lib/postgresql  # Database directory size
 
# Database-specific file activity
lsof -p $(pgrep postgres)  # Files open by PostgreSQL

Essential OS Metrics for Database Health

•Load Average — System CPU demand queue. Should be < number of CPU cores.
•Memory Usage — Watch for swap usage; databases should never swap.
•Disk I/O Wait — High iowait indicates storage bottleneck.
•Network Bandwidth — Ensure sufficient capacity for client traffic and replication.
•File Descriptor Usage — Databases need many open files; check limits.

Swap Is the Enemy

If your database server is using swap, performance is already severely degraded. Databases rely on predictable memory access patterns; swapping destroys this. Monitor swap usage and investigate immediately if any swap is active.

Dedicated Database Monitoring Solutions

While native tools provide raw data, dedicated database monitoring solutions add critical capabilities: historical storage, trend analysis, intelligent alerting, and correlation across instances.

Open-Source Database Monitors:

Open-Source Database Monitoring Tools
Tool	Target Databases	Key Features
Prometheus + PostgreSQL Exporter	PostgreSQL	Time-series metrics, PromQL queries, Grafana integration
Prometheus + MySQL Exporter	MySQL/MariaDB	Comprehensive MySQL metrics, Performance Schema integration
PMM (Percona Monitoring)	MySQL, PostgreSQL, MongoDB	Full-stack monitoring, query analytics, free and open-source
pgwatch2	PostgreSQL	Metrics collection, pre-built dashboards, alerting
pg_monitor	PostgreSQL	Extension providing standardized metric views
VividCortex (SolarWinds DPM)	Multi-database	Query-level insights, ML-driven analysis

Percona Monitoring and Management (PMM):

PMM deserves special attention as a comprehensive, free solution:

PMM Capabilities

•Query Analytics (QAN) — Normalized query fingerprints, execution statistics, wait analysis
•Node and Service Metrics — CPU, memory, disk, network for both OS and database
•Pre-built Dashboards — InnoDB details, PostgreSQL specifics, replication status
•Multi-Database Support — MySQL, PostgreSQL, MongoDB in one platform
•Alerting Integration — Built-in alerting with email, PagerDuty, Slack

Commercial Database Monitoring:

Enterprise environments often require commercial solutions with support contracts:

Commercial Database Monitoring Platforms
Solution	Strengths	Best For
Datadog Database Monitoring	Deep integrations, APM correlation, cloud-native	Multi-cloud, microservices environments
New Relic Database	Entity mapping, distributed tracing, AI analysis	Large-scale production with APM needs
SolarWinds DPA	Wait-time analysis, query tuning advisors	SQL Server and Oracle environments
Redgate SQL Monitor	SQL Server focus, detailed wait analysis	Microsoft SQL Server shops
Oracle Enterprise Manager	Deep Oracle integration, lifecycle management	Oracle database environments
SentryOne (now SolarWinds)	Query Plan analysis, workload comparison	SQL Server performance optimization

Start with PMM

If you're building a monitoring stack from scratch, Percona Monitoring and Management provides exceptional value. It's free, well-documented, actively maintained, and covers the vast majority of monitoring needs for MySQL and PostgreSQL.

The Prometheus Monitoring Stack

Prometheus has emerged as the de facto standard for time-series metric collection in cloud-native environments. Understanding its architecture is essential for modern database monitoring.

Prometheus Architecture:

Prometheus Components

•Prometheus Server — Scrapes and stores metrics, evaluates rules, triggers alerts
•Exporters — Agents that expose metrics in Prometheus format (postgres_exporter, mysqld_exporter)
•Alertmanager — Routes, groups, and manages alert notifications
•Grafana — Visualization layer with dashboards and ad-hoc queries
•Service Discovery — Automatic target discovery (Kubernetes, Consul, DNS)

Database Exporters Configuration:

PostgreSQL exporter setup example:

postgres_exporter.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Docker Compose setup for PostgreSQL monitoring
version: '3.8'
services:
  postgres_exporter:
    image: prometheuscommunity/postgres-exporter
    environment:
      DATA_SOURCE_NAME: "postgresql://monitor:password@postgres:5432/postgres?sslmode=disable"
    ports:
      - "9187:9187"
    
  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
 
---
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
 
scrape_configs:
  - job_name: 'postgresql'
    static_configs:
      - targets: ['postgres_exporter:9187']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        replacement: 'prod-db-1'

PromQL for Database Metrics:

PromQL (Prometheus Query Language) enables powerful metric analysis:

promql_examples.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# Current active connections
pg_stat_activity_count{state="active"}
 
# Query rate over 5 minutes
rate(pg_stat_database_xact_commit[5m])
 
# Buffer cache hit ratio
sum(pg_stat_database_blks_hit) / 
(sum(pg_stat_database_blks_hit) + sum(pg_stat_database_blks_read)) * 100
 
# 95th percentile query latency (histogram)
histogram_quantile(0.95, sum(rate(pg_query_duration_seconds_bucket[5m])) by (le))
 
# Connections approaching limit (>80%)
pg_stat_activity_count / pg_settings_max_connections * 100 > 80
 
# Disk space growth rate (MB/hour)
deriv(pg_database_size_bytes[1h]) / 1024 / 1024
 
# Replication lag in seconds
pg_replication_lag_seconds > 30
 
# Transaction rollback ratio
rate(pg_stat_database_xact_rollback[5m]) / 
(rate(pg_stat_database_xact_commit[5m]) + rate(pg_stat_database_xact_rollback[5m])) * 100

Cardinality Considerations

Prometheus stores each unique label combination as a separate time series. High-cardinality labels (like individual query texts or session IDs) can explode storage requirements. Use normalized query fingerprints, not raw SQL, when labeling query metrics.

Log-Based Monitoring

While metrics provide numerical measurements, logs capture events, errors, and detailed query information. A complete monitoring strategy integrates both.

Database Log Types:

Critical Database Logs
Log Type	Content	Monitoring Value
Error Log	Server errors, startup/shutdown, crashes	Immediate incident detection
Slow Query Log	Queries exceeding time threshold	Performance optimization candidates
General Query Log	All executed queries (high overhead)	Debugging, auditing (not production)
Audit Log	Security-relevant events, access patterns	Compliance, security monitoring
Transaction Log	All database modifications	Recovery, replication (not for monitoring)

Configuring Slow Query Logging:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# postgresql.conf
log_destination = 'stderr'
logging_collector = on
log_directory = 'log'
log_filename = 'postgresql-%Y-%m-%d.log'
 
# Log slow queries
log_min_duration_statement = 1000  # Log queries > 1 second
 
# Optional: Log all statements (high overhead)
# log_statement = 'all'
 
# Include useful context
log_line_prefix = '%t [%p]: db=%d,user=%u,app=%a,client=%h '
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0  # Log all temp file usage

Log Aggregation Architecture:

Centralized log management is essential for multi-server environments:

Log Aggregation Stack Components

•Collector (Filebeat, Fluentd, Vector) — Ships logs from database servers to central storage
•Parser — Extracts structured fields from log lines (timestamp, user, query, duration)
•Storage (Elasticsearch, Loki) — Indexed storage for fast search
•Visualization (Kibana, Grafana) — Search interface and dashboards
•Alerting — Pattern-based alerts on log content (error spikes, specific messages)

Grafana Loki for Logs

If you're already using Prometheus + Grafana, consider Grafana Loki for logs. It uses the same label-based approach as Prometheus, stores logs efficiently without full indexing, and integrates seamlessly into existing Grafana dashboards alongside your metrics.

Cloud-Managed Monitoring

Cloud database services include built-in monitoring that integrates with the cloud provider's observability ecosystem. Understanding these tools is essential for cloud deployments.

AWS RDS Monitoring:

AWS RDS Monitoring Features

•CloudWatch Metrics — CPU, connections, IOPS, read/write latency, freeable memory
•Enhanced Monitoring — OS-level metrics at 1-second granularity (additional cost)
•Performance Insights — Wait analysis, top SQL, database load visualization
•Event Subscriptions — Notifications for failovers, maintenance, configuration changes
•Slow Query Logs → CloudWatch Logs — Centralized log access

Azure SQL Database Monitoring:

Azure SQL Monitoring Capabilities

•Azure Monitor — Unified metrics, logs, and alerts across Azure services
•Query Performance Insight — Top resource-consuming queries, recommendations
•Automatic Tuning — Applied index recommendations, query performance insights
•SQL Analytics — Log Analytics solution for Azure SQL telemetry
•Intelligent Insights — ML-powered performance anomaly detection

Google Cloud SQL Monitoring:

Google Cloud Monitoring for SQL

•Cloud Monitoring — Metrics dashboards, alerting policies
•Query Insights — Top queries by execution time, lock time, rows examined
•Cloud Logging — Slow query logs, connection logs, error logs
•Operations Suite — Unified logging, monitoring, tracing

Cloud Monitoring Feature Comparison
Feature	AWS RDS	Azure SQL	Cloud SQL
Wait Analysis	Performance Insights	Query Store	Query Insights
Index Recommendations	Limited	Automatic Tuning	Limited
OS-Level Metrics	Enhanced Monitoring	Built-in	Built-in
Query Normalization	✅	✅	✅
Historical Retention	7 days (free)	30 days	7 days

Don't Rely Solely on Cloud Monitoring

Cloud-provided monitoring is convenient but often limited in retention, customization, and cross-cloud visibility. Most production environments supplement cloud monitoring with their own observability stack (Prometheus, Datadog, etc.) for consistent monitoring across all environments.

Architecting Your Monitoring Stack

A production monitoring stack requires thoughtful architecture. Consider collection frequency, storage requirements, visualization needs, and operational overhead.

Reference Architecture:

Converting Mermaid diagram...

Key Architectural Decisions:

Monitoring Stack Design Considerations
Decision	Considerations	Recommendation
Scrape Interval	Granularity vs storage cost; 15s is standard	15s for metrics; 1s for troubleshooting only
Retention Period	Disk space vs historical analysis needs	30-90 days raw; 2 years downsampled
High Availability	Monitoring must survive when databases fail	Separate infrastructure; multiple Prometheus replicas
Remote Storage	Long-term storage and global queries	Thanos, Cortex, or Mimir for Prometheus
Security	Credential management, network isolation	Exporters inside network; encrypted transport

Monitor the Monitoring

Your monitoring system is itself critical infrastructure. If Prometheus goes down, you lose visibility during an incident. Implement health checks on monitoring components, keep them on separate infrastructure from monitored databases, and have fallback procedures for when monitoring fails.

Summary: Selecting Your Tools

We've surveyed the monitoring tool landscape from native utilities to enterprise platforms. The right choice depends on your scale, budget, and existing infrastructure.

Key Takeaways

•Master native tools first — pg_stat_*, information_schema, DMVs provide the foundation for all monitoring.
•CLI tools for interactive troubleshooting — top, iostat, htop, and database-specific tools for real-time investigation.
•Prometheus stack for modern infrastructure — Exporters, PromQL, Grafana create a flexible, powerful monitoring foundation.
•Dedicated database monitors for depth — PMM, Datadog, provide query-level insights beyond basic metrics.
•Logs complement metrics — Slow query logs, error logs provide event context that metrics cannot.
•Cloud monitoring as baseline, not ceiling — Use provider tools but supplement with portable monitoring.
•Architect for reliability — Your monitoring must survive database failures.

What's Next:

With metrics collected and dashboards visualized, the next page addresses Alert Configuration—how to translate monitoring data into actionable notifications that wake you up for real problems while letting you sleep through noise.

Page Complete

You now understand the database monitoring tool landscape—from native database utilities to cloud-managed solutions to comprehensive observability platforms. Next, we'll focus on configuring alerts that turn this data into timely, actionable notifications.