Database Management SystemsMonitoring and Alerting

Monitoring and Alerting

LevelAdvanced

Duration90 mins

TopicMonitoring and Alerting

1 / 5

Key Metrics

The Pulse of Your Database

Imagine a seasoned cardiologist examining a patient. They don't just ask 'How do you feel?'—they measure heart rate, blood pressure, oxygen saturation, and dozens of other vital signs. These metrics tell a story that subjective experience cannot. A patient might feel fine while their blood pressure silently climbs to dangerous levels.

Database monitoring operates on the same principle. Your application might appear healthy—queries returning results, users logging in—while underneath, connection pools silently exhaust, disk I/O creeps toward saturation, and query latencies slowly degrade. By the time these problems manifest as user-visible failures, the damage is often severe.

This page establishes the vital signs of database health: the key metrics that every database administrator must understand, collect, and interpret. These aren't arbitrary numbers—they're windows into the internal dynamics of your database engine, revealing bottlenecks, predicting failures, and guiding optimization efforts.

What You Will Learn

By the end of this page, you will understand the taxonomy of database metrics, master the interpretation of resource utilization indicators, comprehend throughput and latency measurements, analyze query-level statistics, and develop the diagnostic intuition to translate metric patterns into actionable insights.

The Taxonomy of Database Metrics

Database metrics can be overwhelming in their volume and variety. Modern database systems expose hundreds, sometimes thousands, of measurable values. To navigate this complexity, we must first establish a conceptual framework that organizes metrics into meaningful categories.

The Four Pillars of Database Observability:

Database metrics fundamentally divide into four interconnected domains, each revealing different aspects of system health:

The Four Pillars of Database Metrics
Pillar	Focus	Key Questions Answered	Example Metrics
Resource Utilization	Hardware and system resources consumed by the database	Is the database running out of CPU, memory, disk, or network capacity?	CPU usage, memory allocation, disk I/O, network throughput
Throughput	Work completed by the database per unit time	How much work is the database accomplishing? Is capacity sufficient?	Queries per second, transactions per second, rows read/written
Latency	Time required to complete operations	How responsive is the database? Are users experiencing delays?	Query response time, lock wait time, I/O latency
Errors and Saturation	Failures and capacity limits	What's breaking? Where are the bottlenecks?	Error rates, connection pool exhaustion, lock timeouts

The USE Method:

Brendan Gregg's USE Method provides a systematic approach to resource analysis:

Utilization — What percentage of the resource's capacity is in use?
Saturation — Is work queuing because the resource is overloaded?
Errors — Are operations failing due to resource issues?

For every database resource (CPU, memory, disk, network, locks, connections), we ask all three questions. This structured approach ensures we don't miss critical bottlenecks.

The RED Method (for Services):

For query-level and transaction-level analysis, the RED method applies:

Rate — How many requests per second?
Errors — How many requests are failing?
Duration — How long do requests take?

Together, USE and RED provide comprehensive coverage of both infrastructure and application-layer database behavior.

Think in Layers

Effective monitoring requires thinking at multiple layers simultaneously. A slow query (high latency) might stem from poor indexing (logical layer), memory pressure causing disk reads (caching layer), slow disks (storage layer), or CPU contention (compute layer). Correlating metrics across layers reveals the true root cause.

Resource Utilization Metrics

Resource utilization metrics measure how database operations consume underlying computational resources. While seemingly straightforward, interpreting these metrics correctly requires understanding database-specific nuances.

CPU Utilization:

CPU usage in databases deserves careful interpretation:

CPU Metric Breakdown

•User CPU Time — Time spent executing database code (query processing, sorting, hashing). High user CPU typically indicates complex queries or missing indexes forcing full table scans.
•System CPU Time — Time spent in operating system calls (I/O operations, memory allocation). High system CPU often indicates excessive disk I/O or memory management overhead.
•I/O Wait — Time CPU is idle waiting for disk I/O. High I/O wait signals disk bottlenecks—consider faster storage or better caching.
•CPU Saturation — When runnable processes exceed CPU cores, processes queue. Monitor the system run queue for saturation signals.

Memory Utilization:

Database memory architecture is complex. Key memory areas to monitor:

Critical Memory Metrics
Memory Area	Purpose	When to Worry	Remediation
Buffer Pool / Cache	Caches data pages in memory to avoid disk reads	Hit ratio < 95%, frequent evictions	Increase cache size, optimize queries to access less data
Sort/Work Memory	Temporary memory for sorting, hashing, joins	Disk-based sorts occurring frequently	Increase work memory or optimize queries
Connection Memory	Memory allocated per database connection	Growing with connection count	Use connection pooling, right-size connection limits
Log Buffer	Buffers transaction log writes	Log buffer waits occurring	Increase log buffer size or improve disk performance

Buffer Pool Hit Ratio:

This fundamental metric measures how often requested data is found in memory versus requiring disk access:

Buffer Pool Hit Ratio = (Buffer Pool Reads from Memory) / (Total Buffer Pool Requests) × 100%

Target: > 95% for OLTP workloads, > 90% for mixed workloads, > 80% for analytical workloads.

Caution: A high hit ratio doesn't guarantee good performance. If the working set fits in memory, you'll have 99.9% hit ratio even with terrible query plans. Always correlate with latency metrics.

Disk I/O Metrics:

Disk performance remains critical despite SSD adoption:

Key Disk I/O Indicators

•IOPS (I/O Operations Per Second) — Measures the count of read/write operations. SSDs handle 10,000-100,000+ IOPS; HDDs peak at ~200 IOPS.
•Throughput (MB/s) — Measures data transfer rate. Important for large sequential scans and backup operations.
•Latency (ms) — Time to complete I/O operations. SSDs: < 1ms typical. HDDs: 5-15ms typical. Spikes indicate contention or hardware issues.
•Queue Depth — Number of I/O requests waiting. Persistently high queues signal disk saturation.
•Read vs Write Ratio — Understanding your workload character helps size storage appropriately.

The Latency Trap

Average I/O latency can hide problems. A 2ms average might include 99% of operations at 0.5ms and 1% at 200ms. Those slow outliers cause query timeouts. Always monitor p95 and p99 latencies, not just averages.

Throughput Metrics

Throughput metrics quantify the actual work accomplished by the database. While resource utilization tells you how hard the system is working, throughput tells you how much useful work results from that effort.

Query Throughput:

The most fundamental throughput measure:

Query Throughput Metrics
Metric	Description	Significance
Queries Per Second (QPS)	Total queries executed per second across all types	Overall database activity level; capacity planning baseline
Transactions Per Second (TPS)	Complete transactions committed per second	True business throughput; includes both reads and writes
SELECT Queries/sec	Read-only queries per second	Read load indicator; cache effectiveness matters most here
INSERT/UPDATE/DELETE/sec	Write operations per second	Write load; affects log throughput, replication lag, cache invalidation

Data Movement Throughput:

Beyond query counts, measuring data volume provides deeper insight:

Data Volume Metrics

•Rows Read Per Second — Total rows examined by queries. High values with low result counts indicate missing indexes.
•Rows Written Per Second — Rows inserted, updated, or deleted. Correlates with transaction log activity.
•Bytes Sent/Received — Network traffic to/from database. High values may indicate transferring excessive result sets.
•Index Reads vs Table Reads — Ratio indicates how effectively indexes are used. High table reads suggest missing indexes.

Transaction Log Throughput:

Write-ahead logging (WAL) is critical for durability:

Log Writes/Second — Rate of log records written
Log Bytes Written/Second — Volume of log data written
Log Flushes/Second — Frequency of log buffer flushes to disk
Checkpoint Frequency — How often dirty pages are written to data files

Log throughput directly limits write capacity. If log writes saturate disk I/O, no amount of CPU or memory helps.

Replication Throughput (for replicated systems):

Critical for high-availability deployments:

Replication Metrics

•Replication Lag — Time delay between primary and replica. Should be seconds, not minutes. Increasing lag signals replica falling behind.
•Bytes Replicated/Second — Network throughput consumed by replication. Must keep pace with primary write rate.
•Apply Rate — How fast the replica applies received changes. CPU-bound on replica can cause lag.

The Throughput-Latency Tradeoff

Higher throughput often increases latency due to queuing. A database handling 100 QPS with 10ms latency might hit 500 QPS with 50ms latency due to internal contention. This relationship is non-linear—latency often spikes dramatically near capacity limits.

Latency Metrics

Latency metrics measure time—the dimension most directly felt by users. A query taking 2 seconds versus 20 milliseconds is the difference between a usable application and an abandoned one.

Query Execution Time:

The most user-visible latency metric:

Query Latency Breakdown
Phase	Description	Typical Cause of Delays
Parse Time	SQL text parsing and validation	Complex queries, excessive dynamic SQL
Plan Time	Query optimization and plan selection	Many table joins, missing statistics
Execution Time	Actual query execution	Full scans, poor indexes, lock waits, disk I/O
Network Time	Result transmission to client	Large result sets, network latency

Measuring Latency Correctly:

Latency measurement requires statistical sophistication:

Latency Statistics

•Average (Mean) — Easily skewed by outliers. 1000 fast queries and one 10-second query give a misleading average.
•Median (p50) — The 'typical' experience. 50% of queries are faster, 50% slower.
•p95 (95th Percentile) — The worst 5% of experiences. Critical for user satisfaction—if 5% of page loads are slow, users notice.
•p99 (99th Percentile) — The tail latency. Often reveals infrastructure issues invisible in averages.
•Max — The absolute worst case. Useful for identifying extreme outliers.

Why Percentiles Matter

A user who visits 10 pages experiences p90 latency on at least one page. Heavy users experience p99 regularly. Optimizing for average while ignoring p99 leaves your most engaged users with the worst experience.

Wait Event Analysis:

Modern databases track time spent waiting for various resources:

Wait Category	Examples	Indicates
CPU	Compute waits	CPU contention, complex queries
I/O	Buffer reads, log writes	Disk bottlenecks, insufficient caching
Lock	Row locks, table locks	Contention between transactions
Latch	Buffer pool latches	Internal data structure contention
Network	Client communication waits	Network latency, large result sets

Wait event analysis reveals where queries spend time waiting rather than working. A query consuming 10ms CPU but waiting 500ms for locks isn't CPU-bound—it's contention-bound.

Lock and Contention Metrics:

Critical for concurrent workloads:

Lock-Related Metrics

•Lock Wait Time — Total time transactions spend waiting for locks. High values indicate contention.
•Lock Timeouts — When waits exceed thresholds. Indicates severe contention affecting functionality.
•Deadlock Frequency — Circular lock dependencies requiring transaction rollback. Should be near zero.
•Average Lock Hold Time — How long locks are held. Long-held locks block other transactions.

Query-Level Metrics

While system-level metrics reveal overall health, query-level metrics pinpoint exactly which queries need attention. Modern databases maintain detailed query statistics.

Query Performance Statistics:

Per-Query Metrics
Metric	Description	Optimization Guidance
Total Execution Time	Cumulative time across all executions	Optimize queries with highest total time for maximum impact
Execution Count	Number of times query executed	High-frequency queries amplify any inefficiency
Rows Examined	Total rows read to produce results	High ratio to rows returned indicates missing indexes
Rows Returned	Result set size	Very large results may need pagination or filtering
Temp Tables Created	On-disk temporary tables used	Memory constraints or complex sorts/joins
Full Scans	Times query required table scans	Missing or unusable indexes

Query Normalization and Fingerprinting:

Raw query texts aren't directly comparable—parameter values differ. Query normalization replaces literals with placeholders:

-- Original queries
SELECT * FROM orders WHERE customer_id = 12345;
SELECT * FROM orders WHERE customer_id = 67890;

-- Normalized fingerprint
SELECT * FROM orders WHERE customer_id = ?;

This fingerprinting allows aggregating statistics across logically identical queries.

Identifying Problem Queries:

Sort query statistics by different criteria to find different problem types:

Problem Patterns

•By Total Time → Queries consuming most resources overall
•By Average Time → Individually slow queries
•By Execution Count → Hot spots in the workload
•By Rows Examined/Returned Ratio → Index optimization candidates

Optimization Impact

•Optimizing #1 by total time has maximum system impact
•Optimizing high-frequency queries multiplies savings
•High rows-examined ratios often yield easy index wins
•Temp table queries may need work_mem increases

The Power of Top-N Analysis

In most workloads, the top 10 queries by total time account for 80%+ of database resource consumption. Focusing optimization efforts on these queries yields disproportionate returns. Don't optimize randomly—prioritize by impact.

Connection and Session Metrics

Database connections are finite resources. Each connection consumes memory, file descriptors, and management overhead. Monitoring connection behavior prevents both resource exhaustion and application availability issues.

Connection Pool Metrics:

Connection Health Indicators
Metric	Healthy State	Warning Signs
Active Connections	Well below maximum limit	Approaching or at max_connections
Idle Connections	Stable, reasonable count	Growing unbounded, indicating leaks
Connection Rate	Steady, expected level	Spikes indicating connection storm
Wait Queue Length	Zero or near-zero	Connections waiting for pool slots
Average Wait Time	< 1ms	Growing wait times signal exhaustion

Session State Distribution:

Understanding what connections are doing provides operational insight:

Active/Executing — Currently running a query
Idle — Connected but not executing (may be in a transaction or truly idle)
Idle in Transaction — In an open transaction but not executing (dangerous—holds locks)
Waiting — Blocked on lock, I/O, or other resource
Disconnecting — Cleaning up connection state

Red Flag: Idle in Transaction

Connections stuck 'idle in transaction' are particularly problematic:

Dangers of Idle Transactions

•Hold locks — Blocking other transactions indefinitely
•Prevent VACUUM — In PostgreSQL, blocking cleanup of dead tuples
•Consume resources — Memory, connection slots, snapshot overhead
•Indicate application bugs — Transactions not properly committed/rolled back

Connection Leak Detection:

Connection leaks—connections acquired but never released—eventually exhaust the pool:

# Signs of connection leak:
- Active + Idle connections grow over time
- Connection count never decreases
- Eventually: connection refused errors
- Pattern correlates with specific application actions

Monitor connection count over time. Healthy applications have stable, bounded connection usage. Growing counts indicate leaks requiring application code fixes.

Connection Pooling Metrics

If using external connection poolers (PgBouncer, ProxySQL), monitor both the pooler and the database. A healthy pooler might hide database connection limits from applications, but the database itself still has connection overhead to manage.

Storage and Capacity Metrics

Running out of disk space is a catastrophic database failure. Unlike performance degradation that annoys users, disk exhaustion causes hard failures, potential data corruption, and emergency recovery procedures. Proactive storage monitoring is non-negotiable.

Space Utilization Metrics:

Critical Storage Metrics

•Total Disk Usage — Current space consumed by database (data + indexes + logs + temp)
•Data File Size — Space used by actual table data
•Index Size — Space consumed by indexes (can exceed data size for heavily indexed tables)
•Transaction Log Size — Space for write-ahead logs (can grow during long transactions or high write rates)
•Temporary Space — Disk used for sorts, joins, and temporary tables

Growth Rate Analysis:

Static size matters less than growth trajectory:

Projected Days Until Full = (Available Space) / (Average Daily Growth)

Track growth rates by component:

Data growth — New records being inserted
Log growth — Write activity and retention policies
Bloat — Space wasted due to updates/deletes (especially PostgreSQL MVCC)

Storage Threshold Guidelines
Utilization	Status	Action
< 70%	✅ Healthy	Normal monitoring
70-80%	⚠️ Attention	Plan for expansion, review growth rates
80-90%	🔶 Warning	Active capacity planning required
90%	🔴 Critical	Immediate action—expand storage or archive data

Table Bloat and Fragmentation:

Databases don't always reclaim space efficiently:

Dead Tuples (PostgreSQL) — Updated/deleted rows not yet cleaned by VACUUM
Free Space Fragmentation — Scattered free space not usable for new rows
Index Bloat — Indexes become sparse after many deletes

Monitor bloat ratios:

Bloat Ratio = (Actual Size - Ideal Size) / Ideal Size × 100%

Tables with > 50% bloat deserve attention—maintenance operations can reclaim this space.

The 90% Cliff

Many file systems and databases behave erratically above 90% capacity. Free space management becomes fragmented, performance degrades, and you lose the safety margin for unexpected spikes. Treat 90% as an absolute ceiling requiring immediate action.

Summary: Building Metric Intuition

We've surveyed the essential metrics that form the foundation of database monitoring. These aren't just numbers—they're a language for understanding database behavior.

Key Takeaways

•Organize metrics into pillars — Resource utilization, throughput, latency, and errors provide complete coverage using USE and RED methodologies.
•Interpret metrics in context — High CPU isn't inherently bad; high CPU with poor throughput and high latency indicates problems.
•Measure latency correctly — Percentiles (p50, p95, p99) reveal user experience; averages hide problems.
•Focus query optimization — Top-N queries by total time yield maximum optimization ROI.
•Monitor connections proactively — Connection exhaustion causes complete outages; watch trends, not just current values.
•Storage is existential — Running out of disk space is a hard failure; monitor growth rates and maintain safety margins.
•Correlate across layers — A single symptom (slow queries) may have many causes (disk, memory, locks, indexes). Multi-metric analysis reveals root causes.

What's Next:

Understanding which metrics matter is the first step. The next page explores monitoring tools—the systems that collect, aggregate, visualize, and act upon these metrics. We'll examine database-native monitoring, third-party solutions, and the architecture of production-grade monitoring infrastructure.

Page Complete

You now understand the essential metrics for database monitoring—from resource utilization to query statistics to capacity planning. These form the foundation for the monitoring systems, alerting strategies, and dashboards we'll explore next.

1 / 5

Loading learning content...

Database Management SystemsMonitoring and Alerting

Monitoring and Alerting

LevelAdvanced

Duration90 mins

TopicMonitoring and Alerting

1 / 5

Key Metrics

The Pulse of Your Database

What You Will Learn

The Taxonomy of Database Metrics

The Four Pillars of Database Observability:

Database metrics fundamentally divide into four interconnected domains, each revealing different aspects of system health:

The Four Pillars of Database Metrics
Pillar	Focus	Key Questions Answered	Example Metrics
Resource Utilization	Hardware and system resources consumed by the database	Is the database running out of CPU, memory, disk, or network capacity?	CPU usage, memory allocation, disk I/O, network throughput
Throughput	Work completed by the database per unit time	How much work is the database accomplishing? Is capacity sufficient?	Queries per second, transactions per second, rows read/written
Latency	Time required to complete operations	How responsive is the database? Are users experiencing delays?	Query response time, lock wait time, I/O latency
Errors and Saturation	Failures and capacity limits	What's breaking? Where are the bottlenecks?	Error rates, connection pool exhaustion, lock timeouts

The USE Method:

Brendan Gregg's USE Method provides a systematic approach to resource analysis:

Utilization — What percentage of the resource's capacity is in use?
Saturation — Is work queuing because the resource is overloaded?
Errors — Are operations failing due to resource issues?

For every database resource (CPU, memory, disk, network, locks, connections), we ask all three questions. This structured approach ensures we don't miss critical bottlenecks.

The RED Method (for Services):

For query-level and transaction-level analysis, the RED method applies:

Rate — How many requests per second?
Errors — How many requests are failing?
Duration — How long do requests take?

Together, USE and RED provide comprehensive coverage of both infrastructure and application-layer database behavior.

Think in Layers

Resource Utilization Metrics

CPU Utilization:

CPU usage in databases deserves careful interpretation:

CPU Metric Breakdown

•User CPU Time — Time spent executing database code (query processing, sorting, hashing). High user CPU typically indicates complex queries or missing indexes forcing full table scans.
•System CPU Time — Time spent in operating system calls (I/O operations, memory allocation). High system CPU often indicates excessive disk I/O or memory management overhead.
•I/O Wait — Time CPU is idle waiting for disk I/O. High I/O wait signals disk bottlenecks—consider faster storage or better caching.
•CPU Saturation — When runnable processes exceed CPU cores, processes queue. Monitor the system run queue for saturation signals.

Memory Utilization:

Database memory architecture is complex. Key memory areas to monitor:

Critical Memory Metrics
Memory Area	Purpose	When to Worry	Remediation
Buffer Pool / Cache	Caches data pages in memory to avoid disk reads	Hit ratio < 95%, frequent evictions	Increase cache size, optimize queries to access less data
Sort/Work Memory	Temporary memory for sorting, hashing, joins	Disk-based sorts occurring frequently	Increase work memory or optimize queries
Connection Memory	Memory allocated per database connection	Growing with connection count	Use connection pooling, right-size connection limits
Log Buffer	Buffers transaction log writes	Log buffer waits occurring	Increase log buffer size or improve disk performance

Buffer Pool Hit Ratio:

This fundamental metric measures how often requested data is found in memory versus requiring disk access:

Buffer Pool Hit Ratio = (Buffer Pool Reads from Memory) / (Total Buffer Pool Requests) × 100%

Target: > 95% for OLTP workloads, > 90% for mixed workloads, > 80% for analytical workloads.

Caution: A high hit ratio doesn't guarantee good performance. If the working set fits in memory, you'll have 99.9% hit ratio even with terrible query plans. Always correlate with latency metrics.

Disk I/O Metrics:

Disk performance remains critical despite SSD adoption:

Key Disk I/O Indicators

•IOPS (I/O Operations Per Second) — Measures the count of read/write operations. SSDs handle 10,000-100,000+ IOPS; HDDs peak at ~200 IOPS.
•Throughput (MB/s) — Measures data transfer rate. Important for large sequential scans and backup operations.
•Latency (ms) — Time to complete I/O operations. SSDs: < 1ms typical. HDDs: 5-15ms typical. Spikes indicate contention or hardware issues.
•Queue Depth — Number of I/O requests waiting. Persistently high queues signal disk saturation.
•Read vs Write Ratio — Understanding your workload character helps size storage appropriately.

The Latency Trap

Throughput Metrics

Query Throughput:

The most fundamental throughput measure:

Query Throughput Metrics
Metric	Description	Significance
Queries Per Second (QPS)	Total queries executed per second across all types	Overall database activity level; capacity planning baseline
Transactions Per Second (TPS)	Complete transactions committed per second	True business throughput; includes both reads and writes
SELECT Queries/sec	Read-only queries per second	Read load indicator; cache effectiveness matters most here
INSERT/UPDATE/DELETE/sec	Write operations per second	Write load; affects log throughput, replication lag, cache invalidation

Data Movement Throughput:

Beyond query counts, measuring data volume provides deeper insight:

Data Volume Metrics

•Rows Read Per Second — Total rows examined by queries. High values with low result counts indicate missing indexes.
•Rows Written Per Second — Rows inserted, updated, or deleted. Correlates with transaction log activity.
•Bytes Sent/Received — Network traffic to/from database. High values may indicate transferring excessive result sets.
•Index Reads vs Table Reads — Ratio indicates how effectively indexes are used. High table reads suggest missing indexes.

Transaction Log Throughput:

Write-ahead logging (WAL) is critical for durability:

Log Writes/Second — Rate of log records written
Log Bytes Written/Second — Volume of log data written
Log Flushes/Second — Frequency of log buffer flushes to disk
Checkpoint Frequency — How often dirty pages are written to data files

Log throughput directly limits write capacity. If log writes saturate disk I/O, no amount of CPU or memory helps.

Replication Throughput (for replicated systems):

Critical for high-availability deployments:

Replication Metrics

•Replication Lag — Time delay between primary and replica. Should be seconds, not minutes. Increasing lag signals replica falling behind.
•Bytes Replicated/Second — Network throughput consumed by replication. Must keep pace with primary write rate.
•Apply Rate — How fast the replica applies received changes. CPU-bound on replica can cause lag.

The Throughput-Latency Tradeoff

Latency Metrics

Latency metrics measure time—the dimension most directly felt by users. A query taking 2 seconds versus 20 milliseconds is the difference between a usable application and an abandoned one.

Query Execution Time:

The most user-visible latency metric:

Query Latency Breakdown
Phase	Description	Typical Cause of Delays
Parse Time	SQL text parsing and validation	Complex queries, excessive dynamic SQL
Plan Time	Query optimization and plan selection	Many table joins, missing statistics
Execution Time	Actual query execution	Full scans, poor indexes, lock waits, disk I/O
Network Time	Result transmission to client	Large result sets, network latency

Measuring Latency Correctly:

Latency measurement requires statistical sophistication:

Latency Statistics

•Average (Mean) — Easily skewed by outliers. 1000 fast queries and one 10-second query give a misleading average.
•Median (p50) — The 'typical' experience. 50% of queries are faster, 50% slower.
•p95 (95th Percentile) — The worst 5% of experiences. Critical for user satisfaction—if 5% of page loads are slow, users notice.
•p99 (99th Percentile) — The tail latency. Often reveals infrastructure issues invisible in averages.
•Max — The absolute worst case. Useful for identifying extreme outliers.

Why Percentiles Matter

Wait Event Analysis:

Modern databases track time spent waiting for various resources:

Wait Category	Examples	Indicates
CPU	Compute waits	CPU contention, complex queries
I/O	Buffer reads, log writes	Disk bottlenecks, insufficient caching
Lock	Row locks, table locks	Contention between transactions
Latch	Buffer pool latches	Internal data structure contention
Network	Client communication waits	Network latency, large result sets

Wait event analysis reveals where queries spend time waiting rather than working. A query consuming 10ms CPU but waiting 500ms for locks isn't CPU-bound—it's contention-bound.

Lock and Contention Metrics:

Critical for concurrent workloads:

Lock-Related Metrics

•Lock Wait Time — Total time transactions spend waiting for locks. High values indicate contention.
•Lock Timeouts — When waits exceed thresholds. Indicates severe contention affecting functionality.
•Deadlock Frequency — Circular lock dependencies requiring transaction rollback. Should be near zero.
•Average Lock Hold Time — How long locks are held. Long-held locks block other transactions.

Query-Level Metrics

While system-level metrics reveal overall health, query-level metrics pinpoint exactly which queries need attention. Modern databases maintain detailed query statistics.

Query Performance Statistics:

Per-Query Metrics
Metric	Description	Optimization Guidance
Total Execution Time	Cumulative time across all executions	Optimize queries with highest total time for maximum impact
Execution Count	Number of times query executed	High-frequency queries amplify any inefficiency
Rows Examined	Total rows read to produce results	High ratio to rows returned indicates missing indexes
Rows Returned	Result set size	Very large results may need pagination or filtering
Temp Tables Created	On-disk temporary tables used	Memory constraints or complex sorts/joins
Full Scans	Times query required table scans	Missing or unusable indexes

Query Normalization and Fingerprinting:

Raw query texts aren't directly comparable—parameter values differ. Query normalization replaces literals with placeholders:

-- Original queries
SELECT * FROM orders WHERE customer_id = 12345;
SELECT * FROM orders WHERE customer_id = 67890;

-- Normalized fingerprint
SELECT * FROM orders WHERE customer_id = ?;

This fingerprinting allows aggregating statistics across logically identical queries.

Identifying Problem Queries:

Sort query statistics by different criteria to find different problem types:

Problem Patterns

•By Total Time → Queries consuming most resources overall
•By Average Time → Individually slow queries
•By Execution Count → Hot spots in the workload
•By Rows Examined/Returned Ratio → Index optimization candidates

Optimization Impact

•Optimizing #1 by total time has maximum system impact
•Optimizing high-frequency queries multiplies savings
•High rows-examined ratios often yield easy index wins
•Temp table queries may need work_mem increases

The Power of Top-N Analysis

Connection and Session Metrics

Connection Pool Metrics:

Connection Health Indicators
Metric	Healthy State	Warning Signs
Active Connections	Well below maximum limit	Approaching or at max_connections
Idle Connections	Stable, reasonable count	Growing unbounded, indicating leaks
Connection Rate	Steady, expected level	Spikes indicating connection storm
Wait Queue Length	Zero or near-zero	Connections waiting for pool slots
Average Wait Time	< 1ms	Growing wait times signal exhaustion

Session State Distribution:

Understanding what connections are doing provides operational insight:

Active/Executing — Currently running a query
Idle — Connected but not executing (may be in a transaction or truly idle)
Idle in Transaction — In an open transaction but not executing (dangerous—holds locks)
Waiting — Blocked on lock, I/O, or other resource
Disconnecting — Cleaning up connection state

Red Flag: Idle in Transaction

Connections stuck 'idle in transaction' are particularly problematic:

Dangers of Idle Transactions

•Hold locks — Blocking other transactions indefinitely
•Prevent VACUUM — In PostgreSQL, blocking cleanup of dead tuples
•Consume resources — Memory, connection slots, snapshot overhead
•Indicate application bugs — Transactions not properly committed/rolled back

Connection Leak Detection:

Connection leaks—connections acquired but never released—eventually exhaust the pool:

# Signs of connection leak:
- Active + Idle connections grow over time
- Connection count never decreases
- Eventually: connection refused errors
- Pattern correlates with specific application actions

Monitor connection count over time. Healthy applications have stable, bounded connection usage. Growing counts indicate leaks requiring application code fixes.

Connection Pooling Metrics

Storage and Capacity Metrics

Space Utilization Metrics:

Critical Storage Metrics

•Total Disk Usage — Current space consumed by database (data + indexes + logs + temp)
•Data File Size — Space used by actual table data
•Index Size — Space consumed by indexes (can exceed data size for heavily indexed tables)
•Transaction Log Size — Space for write-ahead logs (can grow during long transactions or high write rates)
•Temporary Space — Disk used for sorts, joins, and temporary tables

Growth Rate Analysis:

Static size matters less than growth trajectory:

Projected Days Until Full = (Available Space) / (Average Daily Growth)

Track growth rates by component:

Data growth — New records being inserted
Log growth — Write activity and retention policies
Bloat — Space wasted due to updates/deletes (especially PostgreSQL MVCC)

Storage Threshold Guidelines
Utilization	Status	Action
< 70%	✅ Healthy	Normal monitoring
70-80%	⚠️ Attention	Plan for expansion, review growth rates
80-90%	🔶 Warning	Active capacity planning required
90%	🔴 Critical	Immediate action—expand storage or archive data

Table Bloat and Fragmentation:

Databases don't always reclaim space efficiently:

Dead Tuples (PostgreSQL) — Updated/deleted rows not yet cleaned by VACUUM
Free Space Fragmentation — Scattered free space not usable for new rows
Index Bloat — Indexes become sparse after many deletes

Monitor bloat ratios:

Bloat Ratio = (Actual Size - Ideal Size) / Ideal Size × 100%

Tables with > 50% bloat deserve attention—maintenance operations can reclaim this space.

The 90% Cliff

Summary: Building Metric Intuition

We've surveyed the essential metrics that form the foundation of database monitoring. These aren't just numbers—they're a language for understanding database behavior.

Key Takeaways

•Organize metrics into pillars — Resource utilization, throughput, latency, and errors provide complete coverage using USE and RED methodologies.
•Interpret metrics in context — High CPU isn't inherently bad; high CPU with poor throughput and high latency indicates problems.
•Measure latency correctly — Percentiles (p50, p95, p99) reveal user experience; averages hide problems.
•Focus query optimization — Top-N queries by total time yield maximum optimization ROI.
•Monitor connections proactively — Connection exhaustion causes complete outages; watch trends, not just current values.
•Storage is existential — Running out of disk space is a hard failure; monitor growth rates and maintain safety margins.
•Correlate across layers — A single symptom (slow queries) may have many causes (disk, memory, locks, indexes). Multi-metric analysis reveals root causes.

What's Next:

Page Complete

1 / 5