Loading learning content...
Imagine a seasoned cardiologist examining a patient. They don't just ask 'How do you feel?'—they measure heart rate, blood pressure, oxygen saturation, and dozens of other vital signs. These metrics tell a story that subjective experience cannot. A patient might feel fine while their blood pressure silently climbs to dangerous levels.
Database monitoring operates on the same principle. Your application might appear healthy—queries returning results, users logging in—while underneath, connection pools silently exhaust, disk I/O creeps toward saturation, and query latencies slowly degrade. By the time these problems manifest as user-visible failures, the damage is often severe.
This page establishes the vital signs of database health: the key metrics that every database administrator must understand, collect, and interpret. These aren't arbitrary numbers—they're windows into the internal dynamics of your database engine, revealing bottlenecks, predicting failures, and guiding optimization efforts.
By the end of this page, you will understand the taxonomy of database metrics, master the interpretation of resource utilization indicators, comprehend throughput and latency measurements, analyze query-level statistics, and develop the diagnostic intuition to translate metric patterns into actionable insights.
Database metrics can be overwhelming in their volume and variety. Modern database systems expose hundreds, sometimes thousands, of measurable values. To navigate this complexity, we must first establish a conceptual framework that organizes metrics into meaningful categories.
The Four Pillars of Database Observability:
Database metrics fundamentally divide into four interconnected domains, each revealing different aspects of system health:
| Pillar | Focus | Key Questions Answered | Example Metrics |
|---|---|---|---|
| Resource Utilization | Hardware and system resources consumed by the database | Is the database running out of CPU, memory, disk, or network capacity? | CPU usage, memory allocation, disk I/O, network throughput |
| Throughput | Work completed by the database per unit time | How much work is the database accomplishing? Is capacity sufficient? | Queries per second, transactions per second, rows read/written |
| Latency | Time required to complete operations | How responsive is the database? Are users experiencing delays? | Query response time, lock wait time, I/O latency |
| Errors and Saturation | Failures and capacity limits | What's breaking? Where are the bottlenecks? | Error rates, connection pool exhaustion, lock timeouts |
The USE Method:
Brendan Gregg's USE Method provides a systematic approach to resource analysis:
For every database resource (CPU, memory, disk, network, locks, connections), we ask all three questions. This structured approach ensures we don't miss critical bottlenecks.
The RED Method (for Services):
For query-level and transaction-level analysis, the RED method applies:
Together, USE and RED provide comprehensive coverage of both infrastructure and application-layer database behavior.
Effective monitoring requires thinking at multiple layers simultaneously. A slow query (high latency) might stem from poor indexing (logical layer), memory pressure causing disk reads (caching layer), slow disks (storage layer), or CPU contention (compute layer). Correlating metrics across layers reveals the true root cause.
Resource utilization metrics measure how database operations consume underlying computational resources. While seemingly straightforward, interpreting these metrics correctly requires understanding database-specific nuances.
CPU Utilization:
CPU usage in databases deserves careful interpretation:
Memory Utilization:
Database memory architecture is complex. Key memory areas to monitor:
| Memory Area | Purpose | When to Worry | Remediation |
|---|---|---|---|
| Buffer Pool / Cache | Caches data pages in memory to avoid disk reads | Hit ratio < 95%, frequent evictions | Increase cache size, optimize queries to access less data |
| Sort/Work Memory | Temporary memory for sorting, hashing, joins | Disk-based sorts occurring frequently | Increase work memory or optimize queries |
| Connection Memory | Memory allocated per database connection | Growing with connection count | Use connection pooling, right-size connection limits |
| Log Buffer | Buffers transaction log writes | Log buffer waits occurring | Increase log buffer size or improve disk performance |
Buffer Pool Hit Ratio:
This fundamental metric measures how often requested data is found in memory versus requiring disk access:
Buffer Pool Hit Ratio = (Buffer Pool Reads from Memory) / (Total Buffer Pool Requests) × 100%
Target: > 95% for OLTP workloads, > 90% for mixed workloads, > 80% for analytical workloads.
Caution: A high hit ratio doesn't guarantee good performance. If the working set fits in memory, you'll have 99.9% hit ratio even with terrible query plans. Always correlate with latency metrics.
Disk I/O Metrics:
Disk performance remains critical despite SSD adoption:
Average I/O latency can hide problems. A 2ms average might include 99% of operations at 0.5ms and 1% at 200ms. Those slow outliers cause query timeouts. Always monitor p95 and p99 latencies, not just averages.
Throughput metrics quantify the actual work accomplished by the database. While resource utilization tells you how hard the system is working, throughput tells you how much useful work results from that effort.
Query Throughput:
The most fundamental throughput measure:
| Metric | Description | Significance |
|---|---|---|
| Queries Per Second (QPS) | Total queries executed per second across all types | Overall database activity level; capacity planning baseline |
| Transactions Per Second (TPS) | Complete transactions committed per second | True business throughput; includes both reads and writes |
| SELECT Queries/sec | Read-only queries per second | Read load indicator; cache effectiveness matters most here |
| INSERT/UPDATE/DELETE/sec | Write operations per second | Write load; affects log throughput, replication lag, cache invalidation |
Data Movement Throughput:
Beyond query counts, measuring data volume provides deeper insight:
Transaction Log Throughput:
Write-ahead logging (WAL) is critical for durability:
Log throughput directly limits write capacity. If log writes saturate disk I/O, no amount of CPU or memory helps.
Replication Throughput (for replicated systems):
Critical for high-availability deployments:
Higher throughput often increases latency due to queuing. A database handling 100 QPS with 10ms latency might hit 500 QPS with 50ms latency due to internal contention. This relationship is non-linear—latency often spikes dramatically near capacity limits.
Latency metrics measure time—the dimension most directly felt by users. A query taking 2 seconds versus 20 milliseconds is the difference between a usable application and an abandoned one.
Query Execution Time:
The most user-visible latency metric:
| Phase | Description | Typical Cause of Delays |
|---|---|---|
| Parse Time | SQL text parsing and validation | Complex queries, excessive dynamic SQL |
| Plan Time | Query optimization and plan selection | Many table joins, missing statistics |
| Execution Time | Actual query execution | Full scans, poor indexes, lock waits, disk I/O |
| Network Time | Result transmission to client | Large result sets, network latency |
Measuring Latency Correctly:
Latency measurement requires statistical sophistication:
A user who visits 10 pages experiences p90 latency on at least one page. Heavy users experience p99 regularly. Optimizing for average while ignoring p99 leaves your most engaged users with the worst experience.
Wait Event Analysis:
Modern databases track time spent waiting for various resources:
| Wait Category | Examples | Indicates |
|---|---|---|
| CPU | Compute waits | CPU contention, complex queries |
| I/O | Buffer reads, log writes | Disk bottlenecks, insufficient caching |
| Lock | Row locks, table locks | Contention between transactions |
| Latch | Buffer pool latches | Internal data structure contention |
| Network | Client communication waits | Network latency, large result sets |
Wait event analysis reveals where queries spend time waiting rather than working. A query consuming 10ms CPU but waiting 500ms for locks isn't CPU-bound—it's contention-bound.
Lock and Contention Metrics:
Critical for concurrent workloads:
While system-level metrics reveal overall health, query-level metrics pinpoint exactly which queries need attention. Modern databases maintain detailed query statistics.
Query Performance Statistics:
| Metric | Description | Optimization Guidance |
|---|---|---|
| Total Execution Time | Cumulative time across all executions | Optimize queries with highest total time for maximum impact |
| Execution Count | Number of times query executed | High-frequency queries amplify any inefficiency |
| Rows Examined | Total rows read to produce results | High ratio to rows returned indicates missing indexes |
| Rows Returned | Result set size | Very large results may need pagination or filtering |
| Temp Tables Created | On-disk temporary tables used | Memory constraints or complex sorts/joins |
| Full Scans | Times query required table scans | Missing or unusable indexes |
Query Normalization and Fingerprinting:
Raw query texts aren't directly comparable—parameter values differ. Query normalization replaces literals with placeholders:
-- Original queries
SELECT * FROM orders WHERE customer_id = 12345;
SELECT * FROM orders WHERE customer_id = 67890;
-- Normalized fingerprint
SELECT * FROM orders WHERE customer_id = ?;
This fingerprinting allows aggregating statistics across logically identical queries.
Identifying Problem Queries:
Sort query statistics by different criteria to find different problem types:
In most workloads, the top 10 queries by total time account for 80%+ of database resource consumption. Focusing optimization efforts on these queries yields disproportionate returns. Don't optimize randomly—prioritize by impact.
Database connections are finite resources. Each connection consumes memory, file descriptors, and management overhead. Monitoring connection behavior prevents both resource exhaustion and application availability issues.
Connection Pool Metrics:
| Metric | Healthy State | Warning Signs |
|---|---|---|
| Active Connections | Well below maximum limit | Approaching or at max_connections |
| Idle Connections | Stable, reasonable count | Growing unbounded, indicating leaks |
| Connection Rate | Steady, expected level | Spikes indicating connection storm |
| Wait Queue Length | Zero or near-zero | Connections waiting for pool slots |
| Average Wait Time | < 1ms | Growing wait times signal exhaustion |
Session State Distribution:
Understanding what connections are doing provides operational insight:
Red Flag: Idle in Transaction
Connections stuck 'idle in transaction' are particularly problematic:
Connection Leak Detection:
Connection leaks—connections acquired but never released—eventually exhaust the pool:
# Signs of connection leak:
- Active + Idle connections grow over time
- Connection count never decreases
- Eventually: connection refused errors
- Pattern correlates with specific application actions
Monitor connection count over time. Healthy applications have stable, bounded connection usage. Growing counts indicate leaks requiring application code fixes.
If using external connection poolers (PgBouncer, ProxySQL), monitor both the pooler and the database. A healthy pooler might hide database connection limits from applications, but the database itself still has connection overhead to manage.
Running out of disk space is a catastrophic database failure. Unlike performance degradation that annoys users, disk exhaustion causes hard failures, potential data corruption, and emergency recovery procedures. Proactive storage monitoring is non-negotiable.
Space Utilization Metrics:
Growth Rate Analysis:
Static size matters less than growth trajectory:
Projected Days Until Full = (Available Space) / (Average Daily Growth)
Track growth rates by component:
| Utilization | Status | Action |
|---|---|---|
| < 70% | ✅ Healthy | Normal monitoring |
| 70-80% | ⚠️ Attention | Plan for expansion, review growth rates |
| 80-90% | 🔶 Warning | Active capacity planning required |
90% | 🔴 Critical | Immediate action—expand storage or archive data |
Table Bloat and Fragmentation:
Databases don't always reclaim space efficiently:
Monitor bloat ratios:
Bloat Ratio = (Actual Size - Ideal Size) / Ideal Size × 100%
Tables with > 50% bloat deserve attention—maintenance operations can reclaim this space.
Many file systems and databases behave erratically above 90% capacity. Free space management becomes fragmented, performance degrades, and you lose the safety margin for unexpected spikes. Treat 90% as an absolute ceiling requiring immediate action.
We've surveyed the essential metrics that form the foundation of database monitoring. These aren't just numbers—they're a language for understanding database behavior.
What's Next:
Understanding which metrics matter is the first step. The next page explores monitoring tools—the systems that collect, aggregate, visualize, and act upon these metrics. We'll examine database-native monitoring, third-party solutions, and the architecture of production-grade monitoring infrastructure.
You now understand the essential metrics for database monitoring—from resource utilization to query statistics to capacity planning. These form the foundation for the monitoring systems, alerting strategies, and dashboards we'll explore next.