Database Management SystemsSQL Performance

Query Performance Basics

LevelIntermediate

Duration60 mins

TopicSQL Performance

5 / 5

Performance Goals

What 'Good Performance' Really Means

"Is the database fast enough?" This seemingly simple question is surprisingly difficult to answer without clearly defined performance goals. Fast compared to what? Fast enough for whom? Fast under what conditions?

Performance goals transform vague aspirations ("make it faster") into measurable targets ("P99 query latency under 100ms at 1,000 concurrent users"). They provide the foundation for prioritizing optimization efforts, justifying infrastructure changes, and communicating progress to stakeholders. Without goals, performance work is endless and unfocused. With them, it becomes engineering.

What You Will Learn

By the end of this page, you will understand how to define meaningful performance goals, establish baselines, create Service Level Objectives (SLOs), align technical metrics with business requirements, and build a framework for continuous performance improvement.

Why Performance Goals Matter

Performance goals serve multiple critical functions in database management:

1. Focus Optimization Efforts

Without goals, every query seems equally important. Goals identify which performance characteristics matter most, enabling targeted optimization. Is it more important to be consistently fast (low variance) or to maximize throughput (high volume)?

2. Enable Objective Measurement

"The database is slow" is a complaint. "The order lookup query exceeds our 200ms P95 target" is actionable. Goals convert subjective perception into objective measurement.

3. Justify Investment

Performance improvements often require investment—hardware, engineering time, architectural changes. Goals connect these investments to measurable outcomes, enabling cost-benefit analysis.

4. Detect Regressions

New deployments, schema changes, and increased load can degrade performance. Continuous monitoring against goals catches regressions before users notice.

5. Align Stakeholders

Developers, DBAs, operations, and business stakeholders often have different performance intuitions. Explicit goals create shared understanding and expectations.

Without Goals

•"Make everything faster"
•Endless optimization with unclear benefit
•Subjective debates about what's "fast enough"
•Regressions discovered by user complaints
•No basis for capacity planning

With Goals

•"Achieve P99 < 100ms for read queries"
•Clear definition of success
•Data-driven prioritization of work
•Automated regression detection
•Predictive capacity planning

Goals Are Not Limits

Performance goals define acceptable levels, not aspirational ideals. Meeting goals means the system is functioning correctly—not that optimization is complete. Goals should be achievable with reasonable effort, not theoretical perfection.

Types of Performance Goals

Database performance goals fall into several categories, each measuring different aspects of system behavior.

Latency Goals

Latency measures response time—how long a query takes from submission to completion. Latency goals typically use percentiles:

P50 (median) — The typical user experience; 50% of queries are faster
P95 — The experience of the slower end; only 5% are slower
P99 — The worst reasonable experience; catches tail latency
Max — The absolute worst case (often noisy, less useful)

Throughput Goals

Throughput measures volume—how many queries/transactions the system handles per unit time:

Queries Per Second (QPS) — Total query execution rate
Transactions Per Second (TPS) — Completed transaction rate
Rows Per Second — Data processing rate (ETL scenarios)

Performance Goal Categories
Goal Type	What It Measures	Example Target	When to Prioritize
Latency (P50)	Typical response time	< 50ms	User-facing interactive queries
Latency (P99)	Tail latency	< 500ms	Consistency-critical applications
Throughput (QPS)	Query handling capacity	10,000 QPS	High-volume read workloads
Throughput (TPS)	Transaction completion rate	1,000 TPS	Transactional processing
Availability	Uptime percentage	99.9%	Business-critical systems
Error Rate	Failed query percentage	< 0.1%	Data integrity scenarios
Resource Usage	CPU, memory, I/O limits	CPU < 70%	Capacity planning, cost control

The Latency-Throughput Tradeoff

Latency and throughput aren't independent. As load increases (throughput rises), latency typically increases too due to queuing, contention, and resource saturation. Goals must account for this relationship:

"P99 < 100ms at 5,000 QPS" is more specific than either goal alone
Reaching 10,000 QPS might require accepting P99 at 200ms
Achieving P99 < 50ms might limit sustainable throughput to 3,000 QPS

This tradeoff is why goals should specify conditions, not just targets.

Availability vs. Performance

Availability (is the system up?) and performance (is it fast?) are related but distinct. A system can be available but unusably slow, or fast but frequently crashing. Good goals cover both: "99.9% availability with P99 latency < 200ms during available periods."

Establishing Baselines

Before setting goals, you must understand current performance. Baselines document existing system behavior under normal conditions, providing the reference point for goal-setting and regression detection.

What to Baseline

Query performance — Latency distributions for key queries
Resource utilization — Typical CPU, memory, I/O, network usage
Throughput levels — Normal query/transaction rates
Workload patterns — Peak hours, quiet periods, batch windows
Database metrics — Buffer hit ratios, wait statistics, lock contention

Baseline Collection Strategy

Baselines should capture representative behavior:

Duration — Collect for at least one full business cycle (week minimum, month better)
Granularity — Hourly or minute-level aggregation for pattern detection
Percentiles — Store P50, P95, P99 at each interval, not just averages
Segmentation — Separate baselines for different query types, user groups, peak/off-peak

baseline-collection.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
-- Create baseline table for historical storage
CREATE TABLE query_baselines (
    baseline_id SERIAL PRIMARY KEY,
    captured_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    period_type VARCHAR(20) NOT NULL,  -- 'hourly', 'daily', 'weekly'
    queryid BIGINT NOT NULL,
    query_signature TEXT NOT NULL,
    
    -- Volume metrics
    total_calls BIGINT NOT NULL,
    
    -- Latency percentiles (in milliseconds)
    p50_latency DECIMAL(10,2),
    p95_latency DECIMAL(10,2),
    p99_latency DECIMAL(10,2),
    max_latency DECIMAL(10,2),
    
    -- Resource metrics
    avg_rows_returned DECIMAL(10,2),
    total_shared_blks_hit BIGINT,
    total_shared_blks_read BIGINT,
    buffer_hit_ratio DECIMAL(5,2)
);
 
-- Capture hourly baseline (run via pg_cron or external scheduler)
INSERT INTO query_baselines (
    period_type, queryid, query_signature,
    total_calls, p50_latency, p95_latency, p99_latency, max_latency,
    avg_rows_returned, total_shared_blks_hit, total_shared_blks_read,
    buffer_hit_ratio
)
SELECT 
    'hourly',
    queryid,
    LEFT(query, 200),
    calls,
    -- Note: pg_stat_statements doesn't store percentiles directly
    -- This shows mean; for true percentiles, use auto_explain + log parsing
    mean_exec_time,  -- Approximation
    mean_exec_time * 1.5,  -- Rough P95 estimate
    mean_exec_time * 2.5,  -- Rough P99 estimate
    max_exec_time,
    rows / NULLIF(calls, 0),
    shared_blks_hit,
    shared_blks_read,
    100.0 * shared_blks_hit / NULLIF(shared_blks_hit + shared_blks_read, 0)
FROM pg_stat_statements
WHERE calls > 100  -- Focus on significant queries
ORDER BY total_exec_time DESC
LIMIT 50;
 
-- Trend analysis: compare current to baseline
SELECT 
    current_stats.queryid,
    baseline.p99_latency AS baseline_p99,
    current_stats.mean_exec_time AS current_mean,
    CASE 
        WHEN current_stats.mean_exec_time > baseline.p99_latency * 1.5 
        THEN 'REGRESSION' 
        ELSE 'OK' 
    END AS status
FROM pg_stat_statements current_stats
JOIN query_baselines baseline ON current_stats.queryid = baseline.queryid
WHERE baseline.period_type = 'daily'
  AND baseline.captured_at > now() - INTERVAL '7 days';

Baseline Best Practices

•Capture before major changes — New deployments, schema changes, traffic increases
•Include seasonal variations — Month-end, holiday traffic, promotional events
•Separate by environment — Production baselines differ from staging
•Store indefinitely or long-term — Enables year-over-year comparison
•Automate collection — Manual baselines are forgotten during crises

Don't Baseline During Anomalies

Collecting baselines during outages, attacks, or unusual traffic skews the reference. Exclude anomalous periods or clearly mark them. A baseline captured during a database migration isn't representative of normal operation.

Setting SLOs and SLAs

Service Level Objectives (SLOs) and Service Level Agreements (SLAs) formalize performance goals into commitments.

Terminology

SLI (Service Level Indicator) — The metric being measured (e.g., query latency)
SLO (Service Level Objective) — The target value for an SLI (e.g., P99 < 200ms)
SLA (Service Level Agreement) — A contractual commitment with consequences for SLO violations
Error Budget — The acceptable amount of SLO violation (e.g., 0.1% of queries can exceed target)

SLO Design Principles

Achievable — Based on current baselines plus realistic improvement targets
Measurable — Specific metrics that can be automatically monitored
Meaningful — Connected to user experience or business impact
Actionable — Violations trigger defined response procedures

Example Database SLO Framework
SLI	SLO Target	Error Budget	Violation Response
Read query latency	P99 < 200ms	0.1% may exceed	Page on-call engineer
Write transaction latency	P99 < 500ms	0.05% may exceed	Immediate investigation
Query success rate	99.9%	1 in 1000 may fail	Automated retry + alert
Connection success rate	99.95%	0.05% connection failures	Page + capacity review
Replication lag	< 10 seconds	1 minute of delay/hour	Failover consideration

Error Budgets

Error budgets acknowledge that perfection is impossible and expensive. Instead of demanding 100% compliance, error budgets allow a defined margin for violations:

99.9% availability = 43.8 minutes downtime per month error budget
99.9% of queries meeting P99 target = 0.1% may exceed

Using Error Budgets:

Budget remaining — Deploy new features, take calculated risks
Budget depleted — Freeze changes, focus on reliability
Budget vastly unused — Consider relaxing SLOs to reduce cost/complexity

This approach balances reliability against development velocity.

slo-monitoring.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
-- Calculate SLO compliance from query statistics
-- Assuming slow_query_log or similar captures durations
 
-- Read latency SLO: P99 < 200ms
WITH query_buckets AS (
    SELECT 
        DATE_TRUNC('hour', captured_at) AS hour,
        COUNT(*) AS total_queries,
        COUNT(*) FILTER (WHERE duration_ms <= 200) AS within_slo,
        PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY duration_ms) AS actual_p99
    FROM query_logs
    WHERE query_type = 'SELECT'
      AND captured_at > CURRENT_DATE - INTERVAL '7 days'
    GROUP BY DATE_TRUNC('hour', captured_at)
)
SELECT 
    hour,
    total_queries,
    within_slo,
    100.0 * within_slo / total_queries AS compliance_pct,
    actual_p99,
    CASE 
        WHEN actual_p99 <= 200 THEN 'PASS'
        ELSE 'BREACH'
    END AS slo_status
FROM query_buckets
ORDER BY hour DESC;
 
-- Error budget calculation (monthly)
WITH monthly_stats AS (
    SELECT 
        DATE_TRUNC('month', captured_at) AS month,
        COUNT(*) AS total_queries,
        COUNT(*) FILTER (WHERE duration_ms > 200) AS violations,
        -- Error budget: 0.1% of queries can exceed
        0.001 * COUNT(*) AS allowed_violations
    FROM query_logs
    WHERE query_type = 'SELECT'
    GROUP BY DATE_TRUNC('month', captured_at)
)
SELECT 
    month,
    total_queries,
    violations AS actual_violations,
    allowed_violations::int,
    violations - allowed_violations AS budget_status,
    CASE 
        WHEN violations <= allowed_violations THEN 'BUDGET_OK'
        ELSE 'BUDGET_EXCEEDED'
    END AS status
FROM monthly_stats
ORDER BY month DESC;

Start Conservative, Tighten Gradually

Initially set SLOs that you can easily meet (e.g., 110% of current baseline performance). Consistently meeting SLOs builds confidence. Gradually tighten targets as you optimize and as the organization matures in SLO management.

Aligning with Business Requirements

Technical performance metrics must connect to business value. Database engineers can easily fall into the trap of optimizing for technical perfection when business needs are more pragmatic.

The Business-Technical Connection

Business Need	Technical Goal
Users don't abandon checkout	Payment queries < 500ms
Reports ready by 9 AM meeting	Batch ETL completes by 8:30 AM
Support can answer calls quickly	Customer lookup < 100ms
Mobile app feels responsive	API queries < 200ms
No data loss in disasters	RPO < 1 minute, RTO < 15 minutes

Discovering Business Requirements

Business requirements often aren't expressed in milliseconds. Translation is needed:

"The app feels slow" → Which specific interactions? Measure current latency, target improvement.
"Reports take too long" → What's acceptable? By when do users need them?
"We can't lose any data" → What's the actual tolerance? Second-level precision? Minute-level?

Questions to Uncover Business Goals

•"What user actions are most time-sensitive?" — Identifies critical paths
•"What happens if this query takes 5 seconds instead of 500ms?" — Quantifies impact
•"When do you need batch results ready?" — Defines batch deadlines
•"How much data loss is acceptable in a disaster?" — Defines RPO
•"How quickly must the system recover from outage?" — Defines RTO
•"What traffic growth do you expect?" — Guides capacity planning

Prioritizing Based on Impact

Not all queries are equal. Prioritize optimization efforts based on:

User-facing vs. background — User-visible latency matters more
Frequency — High-volume queries have more total impact
Business criticality — Revenue-generating operations first
Failure severity — What breaks if this is slow?

Example Prioritization:

Priority	Query Type	Rationale
1	Checkout/Payment	Revenue-critical, abandonment risk
2	Search/Browse	User experience, frequent
3	User Dashboard	Retain users, daily usage
4	Admin Reports	Internal, less time-sensitive
5	Batch Jobs	Can run off-peak, retry on failure

The Cost of Perfection

Achieving P99 < 10ms might require 10x the infrastructure of P99 < 100ms. Business stakeholders should understand these tradeoffs. Sometimes 'good enough' performance at lower cost is the right business decision. Present options with cost implications.

Goal-Driven Monitoring

Performance goals are only useful if you monitor against them continuously. Goal-driven monitoring means configuring dashboards, alerts, and reports around your defined SLOs rather than generic metrics.

Dashboard Design Principles

Lead with SLO status — Green/yellow/red for each SLO at a glance
Show error budget burn — How much budget remains this period?
Trend visualization — Performance over time relative to targets
Drill-down capability — From SLO violation to specific queries

Alerting Strategy

Don't alert on every threshold breach — Use burn rate (how fast is error budget depleting?)
Page only when action is needed — Avoid alert fatigue
Include context in alerts — Which queries, what impact, suggested actions

alerting-rules.yaml
Prometheus
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# SLO-based alerting rules
groups:
  - name: database_slos
    rules:
      # Read latency SLO: P99 < 200ms
      # Alert when error budget depletes too fast (burn rate)
      - alert: ReadLatencySLOBudgetBurn
        # 5% of monthly error budget burned in 1 hour = critical
        expr: |
          (
            sum(rate(query_duration_bucket{le="0.2",type="read"}[1h]))
            /
            sum(rate(query_duration_count{type="read"}[1h]))
          ) < 0.999 - (0.001 * 0.05 * 720)
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Read latency SLO error budget burning fast"
          description: "At current rate, monthly error budget will be exhausted in < 20 hours"
          
      # Absolute threshold: SLO breached now
      - alert: ReadLatencyP99Breached
        expr: |
          histogram_quantile(0.99, 
            sum(rate(query_duration_bucket{type="read"}[5m])) by (le)
          ) > 0.2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Read query P99 latency exceeds 200ms SLO"
          description: "Current P99: {{ $value | humanizeDuration }}"
          
      # Connection availability SLO
      - alert: ConnectionAvailabilitySLOBreached
        expr: |
          rate(connection_failures_total[5m]) / rate(connection_attempts_total[5m]) > 0.0005
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Connection success rate below 99.95% SLO"
          
  - name: capacity_warnings
    rules:
      # Proactive capacity: approaching limits
      - alert: QueryLatencyTrending
        expr: |
          deriv(query_duration_p99[1h]) > 0.001  # Increasing 1ms/minute
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Query latency trending upward"
          description: "P99 latency is consistently increasing - investigate before SLO breach"

Alert Severity Guidelines
Severity	Condition	Response	Example
Critical (Page)	SLO breached, error budget exhausted	Immediate investigation, may wake on-call	P99 > 200ms for 10+ minutes
Warning	Approaching SLO, budget burning fast	Investigate during business hours	P99 at 180ms, trending up
Info	Anomaly detected, no SLO impact	Review in regular maintenance	Query count spike, performance stable

Burn Rate > Threshold Alerts

Rather than alerting when P99 exceeds 200ms, alert when error budget is burning faster than sustainable. Brief spikes might not deplete budget; sustained moderate issues will. Burn rate alerting catches the latter before crisis.

Continuous Performance Improvement

Performance optimization isn't a one-time project but an ongoing practice. Continuous improvement means regularly reviewing performance, identifying opportunities, implementing changes, and validating results.

The Performance Improvement Cycle

┌──────────────────────────────────────────────────────────┐
│                                                          │
│   1. MEASURE        →      2. ANALYZE                    │
│   (Current state)          (Find bottlenecks)            │
│                                                          │
│         ↑                           ↓                    │
│                                                          │
│   4. VALIDATE       ←      3. IMPROVE                    │
│   (Did it work?)           (Apply changes)               │
│                                                          │
└──────────────────────────────────────────────────────────┘

1. Measure — Collect current SLO compliance, top-N query performance, resource utilization

2. Analyze — Identify biggest opportunities using bottleneck identification techniques

3. Improve — Implement changes: indexes, query rewrites, configuration tuning, scaling

4. Validate — Confirm improvement achieved expected results; detect unintended regressions

Continuous Improvement Practices

•Weekly performance reviews — Review SLO dashboards, top queries, resource trends
•Quarterly goal setting — Adjust SLOs based on achieved performance and changing needs
•Post-incident analysis — After performance incidents, conduct blameless retrospectives
•Change validation — Every deployment includes performance validation against baselines
•Capacity planning — Project future needs based on growth trends vs. current headroom
•Knowledge sharing — Document optimizations, patterns, and anti-patterns for the team

Regression Prevention

As important as improvement is preventing regression:

Automated performance testing — Run load tests in CI/CD pipeline
Canary deployments — Route small traffic to new versions, monitor before full rollout
Query analysis in code review — Review EXPLAIN output for queries in pull requests
Baseline comparison — Automatically compare new version performance to production baseline
Rollback triggers — Auto-rollback if P99 degrades beyond threshold post-deploy

The Compound Effect

Small weekly improvements compound. Improving one query by 10% each week doesn't seem dramatic, but consistently applied across months, the system becomes dramatically faster. More importantly, the habit of measurement and optimization becomes embedded in engineering culture.

Summary: Performance Goals

Performance goals transform database optimization from endless tuning into managed engineering. Let's consolidate the essential concepts:

Key Takeaways

•Goals enable focused optimization — Without targets, you can't prioritize or measure success
•Different goal types measure different aspects — Latency (responsiveness), throughput (capacity), availability (reliability)
•Baselines are essential foundations — You can't set meaningful goals without understanding current performance
•SLOs formalize commitments — Service Level Objectives with error budgets balance reliability and development velocity
•Connect to business value — Technical metrics must trace to user experience and business outcomes
•Monitor against goals — Dashboards and alerts should reflect SLO status, not just raw metrics
•Continuous improvement compounds — Regular cycles of measure-analyze-improve-validate drive lasting enhancement

Module Complete

You have completed Module 1: Query Performance Basics. You now understand:

How to measure and interpret query execution time
The four resources databases consume: CPU, memory, I/O, network
Systematic bottleneck identification techniques
The measurement tools available for performance analysis
How to define and pursue meaningful performance goals

These foundations prepare you for the subsequent modules, where we'll dive deeper into EXPLAIN plans, index optimization, query tuning techniques, and advanced performance topics.

Module Complete

Congratulations! You have mastered the foundational concepts of SQL query performance. You can now measure execution time, understand resource consumption, identify bottlenecks systematically, use measurement tools effectively, and define goals that drive continuous improvement. These skills form the essential base for all advanced database performance work.

5 / 5

Loading learning content...

Database Management SystemsSQL Performance

Query Performance Basics

LevelIntermediate

Duration60 mins

TopicSQL Performance

5 / 5

Performance Goals

What 'Good Performance' Really Means

What You Will Learn

Why Performance Goals Matter

Performance goals serve multiple critical functions in database management:

1. Focus Optimization Efforts

2. Enable Objective Measurement

"The database is slow" is a complaint. "The order lookup query exceeds our 200ms P95 target" is actionable. Goals convert subjective perception into objective measurement.

3. Justify Investment

Performance improvements often require investment—hardware, engineering time, architectural changes. Goals connect these investments to measurable outcomes, enabling cost-benefit analysis.

4. Detect Regressions

New deployments, schema changes, and increased load can degrade performance. Continuous monitoring against goals catches regressions before users notice.

5. Align Stakeholders

Developers, DBAs, operations, and business stakeholders often have different performance intuitions. Explicit goals create shared understanding and expectations.

Without Goals

•"Make everything faster"
•Endless optimization with unclear benefit
•Subjective debates about what's "fast enough"
•Regressions discovered by user complaints
•No basis for capacity planning

With Goals

•"Achieve P99 < 100ms for read queries"
•Clear definition of success
•Data-driven prioritization of work
•Automated regression detection
•Predictive capacity planning

Goals Are Not Limits

Types of Performance Goals

Database performance goals fall into several categories, each measuring different aspects of system behavior.

Latency Goals

Latency measures response time—how long a query takes from submission to completion. Latency goals typically use percentiles:

P50 (median) — The typical user experience; 50% of queries are faster
P95 — The experience of the slower end; only 5% are slower
P99 — The worst reasonable experience; catches tail latency
Max — The absolute worst case (often noisy, less useful)

Throughput Goals

Throughput measures volume—how many queries/transactions the system handles per unit time:

Queries Per Second (QPS) — Total query execution rate
Transactions Per Second (TPS) — Completed transaction rate
Rows Per Second — Data processing rate (ETL scenarios)

Performance Goal Categories
Goal Type	What It Measures	Example Target	When to Prioritize
Latency (P50)	Typical response time	< 50ms	User-facing interactive queries
Latency (P99)	Tail latency	< 500ms	Consistency-critical applications
Throughput (QPS)	Query handling capacity	10,000 QPS	High-volume read workloads
Throughput (TPS)	Transaction completion rate	1,000 TPS	Transactional processing
Availability	Uptime percentage	99.9%	Business-critical systems
Error Rate	Failed query percentage	< 0.1%	Data integrity scenarios
Resource Usage	CPU, memory, I/O limits	CPU < 70%	Capacity planning, cost control

The Latency-Throughput Tradeoff

"P99 < 100ms at 5,000 QPS" is more specific than either goal alone
Reaching 10,000 QPS might require accepting P99 at 200ms
Achieving P99 < 50ms might limit sustainable throughput to 3,000 QPS

This tradeoff is why goals should specify conditions, not just targets.

Availability vs. Performance

Establishing Baselines

What to Baseline

Query performance — Latency distributions for key queries
Resource utilization — Typical CPU, memory, I/O, network usage
Throughput levels — Normal query/transaction rates
Workload patterns — Peak hours, quiet periods, batch windows
Database metrics — Buffer hit ratios, wait statistics, lock contention

Baseline Collection Strategy

Baselines should capture representative behavior:

Duration — Collect for at least one full business cycle (week minimum, month better)
Granularity — Hourly or minute-level aggregation for pattern detection
Percentiles — Store P50, P95, P99 at each interval, not just averages
Segmentation — Separate baselines for different query types, user groups, peak/off-peak

baseline-collection.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
-- Create baseline table for historical storage
CREATE TABLE query_baselines (
    baseline_id SERIAL PRIMARY KEY,
    captured_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    period_type VARCHAR(20) NOT NULL,  -- 'hourly', 'daily', 'weekly'
    queryid BIGINT NOT NULL,
    query_signature TEXT NOT NULL,
    
    -- Volume metrics
    total_calls BIGINT NOT NULL,
    
    -- Latency percentiles (in milliseconds)
    p50_latency DECIMAL(10,2),
    p95_latency DECIMAL(10,2),
    p99_latency DECIMAL(10,2),
    max_latency DECIMAL(10,2),
    
    -- Resource metrics
    avg_rows_returned DECIMAL(10,2),
    total_shared_blks_hit BIGINT,
    total_shared_blks_read BIGINT,
    buffer_hit_ratio DECIMAL(5,2)
);
 
-- Capture hourly baseline (run via pg_cron or external scheduler)
INSERT INTO query_baselines (
    period_type, queryid, query_signature,
    total_calls, p50_latency, p95_latency, p99_latency, max_latency,
    avg_rows_returned, total_shared_blks_hit, total_shared_blks_read,
    buffer_hit_ratio
)
SELECT 
    'hourly',
    queryid,
    LEFT(query, 200),
    calls,
    -- Note: pg_stat_statements doesn't store percentiles directly
    -- This shows mean; for true percentiles, use auto_explain + log parsing
    mean_exec_time,  -- Approximation
    mean_exec_time * 1.5,  -- Rough P95 estimate
    mean_exec_time * 2.5,  -- Rough P99 estimate
    max_exec_time,
    rows / NULLIF(calls, 0),
    shared_blks_hit,
    shared_blks_read,
    100.0 * shared_blks_hit / NULLIF(shared_blks_hit + shared_blks_read, 0)
FROM pg_stat_statements
WHERE calls > 100  -- Focus on significant queries
ORDER BY total_exec_time DESC
LIMIT 50;
 
-- Trend analysis: compare current to baseline
SELECT 
    current_stats.queryid,
    baseline.p99_latency AS baseline_p99,
    current_stats.mean_exec_time AS current_mean,
    CASE 
        WHEN current_stats.mean_exec_time > baseline.p99_latency * 1.5 
        THEN 'REGRESSION' 
        ELSE 'OK' 
    END AS status
FROM pg_stat_statements current_stats
JOIN query_baselines baseline ON current_stats.queryid = baseline.queryid
WHERE baseline.period_type = 'daily'
  AND baseline.captured_at > now() - INTERVAL '7 days';

Baseline Best Practices

•Capture before major changes — New deployments, schema changes, traffic increases
•Include seasonal variations — Month-end, holiday traffic, promotional events
•Separate by environment — Production baselines differ from staging
•Store indefinitely or long-term — Enables year-over-year comparison
•Automate collection — Manual baselines are forgotten during crises

Don't Baseline During Anomalies

Setting SLOs and SLAs

Service Level Objectives (SLOs) and Service Level Agreements (SLAs) formalize performance goals into commitments.

Terminology

SLI (Service Level Indicator) — The metric being measured (e.g., query latency)
SLO (Service Level Objective) — The target value for an SLI (e.g., P99 < 200ms)
SLA (Service Level Agreement) — A contractual commitment with consequences for SLO violations
Error Budget — The acceptable amount of SLO violation (e.g., 0.1% of queries can exceed target)

SLO Design Principles

Achievable — Based on current baselines plus realistic improvement targets
Measurable — Specific metrics that can be automatically monitored
Meaningful — Connected to user experience or business impact
Actionable — Violations trigger defined response procedures

Example Database SLO Framework
SLI	SLO Target	Error Budget	Violation Response
Read query latency	P99 < 200ms	0.1% may exceed	Page on-call engineer
Write transaction latency	P99 < 500ms	0.05% may exceed	Immediate investigation
Query success rate	99.9%	1 in 1000 may fail	Automated retry + alert
Connection success rate	99.95%	0.05% connection failures	Page + capacity review
Replication lag	< 10 seconds	1 minute of delay/hour	Failover consideration

Error Budgets

Error budgets acknowledge that perfection is impossible and expensive. Instead of demanding 100% compliance, error budgets allow a defined margin for violations:

99.9% availability = 43.8 minutes downtime per month error budget
99.9% of queries meeting P99 target = 0.1% may exceed

Using Error Budgets:

Budget remaining — Deploy new features, take calculated risks
Budget depleted — Freeze changes, focus on reliability
Budget vastly unused — Consider relaxing SLOs to reduce cost/complexity

This approach balances reliability against development velocity.

slo-monitoring.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
-- Calculate SLO compliance from query statistics
-- Assuming slow_query_log or similar captures durations
 
-- Read latency SLO: P99 < 200ms
WITH query_buckets AS (
    SELECT 
        DATE_TRUNC('hour', captured_at) AS hour,
        COUNT(*) AS total_queries,
        COUNT(*) FILTER (WHERE duration_ms <= 200) AS within_slo,
        PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY duration_ms) AS actual_p99
    FROM query_logs
    WHERE query_type = 'SELECT'
      AND captured_at > CURRENT_DATE - INTERVAL '7 days'
    GROUP BY DATE_TRUNC('hour', captured_at)
)
SELECT 
    hour,
    total_queries,
    within_slo,
    100.0 * within_slo / total_queries AS compliance_pct,
    actual_p99,
    CASE 
        WHEN actual_p99 <= 200 THEN 'PASS'
        ELSE 'BREACH'
    END AS slo_status
FROM query_buckets
ORDER BY hour DESC;
 
-- Error budget calculation (monthly)
WITH monthly_stats AS (
    SELECT 
        DATE_TRUNC('month', captured_at) AS month,
        COUNT(*) AS total_queries,
        COUNT(*) FILTER (WHERE duration_ms > 200) AS violations,
        -- Error budget: 0.1% of queries can exceed
        0.001 * COUNT(*) AS allowed_violations
    FROM query_logs
    WHERE query_type = 'SELECT'
    GROUP BY DATE_TRUNC('month', captured_at)
)
SELECT 
    month,
    total_queries,
    violations AS actual_violations,
    allowed_violations::int,
    violations - allowed_violations AS budget_status,
    CASE 
        WHEN violations <= allowed_violations THEN 'BUDGET_OK'
        ELSE 'BUDGET_EXCEEDED'
    END AS status
FROM monthly_stats
ORDER BY month DESC;

Start Conservative, Tighten Gradually

Aligning with Business Requirements

Technical performance metrics must connect to business value. Database engineers can easily fall into the trap of optimizing for technical perfection when business needs are more pragmatic.

The Business-Technical Connection

Business Need	Technical Goal
Users don't abandon checkout	Payment queries < 500ms
Reports ready by 9 AM meeting	Batch ETL completes by 8:30 AM
Support can answer calls quickly	Customer lookup < 100ms
Mobile app feels responsive	API queries < 200ms
No data loss in disasters	RPO < 1 minute, RTO < 15 minutes

Discovering Business Requirements

Business requirements often aren't expressed in milliseconds. Translation is needed:

"The app feels slow" → Which specific interactions? Measure current latency, target improvement.
"Reports take too long" → What's acceptable? By when do users need them?
"We can't lose any data" → What's the actual tolerance? Second-level precision? Minute-level?

Questions to Uncover Business Goals

•"What user actions are most time-sensitive?" — Identifies critical paths
•"What happens if this query takes 5 seconds instead of 500ms?" — Quantifies impact
•"When do you need batch results ready?" — Defines batch deadlines
•"How much data loss is acceptable in a disaster?" — Defines RPO
•"How quickly must the system recover from outage?" — Defines RTO
•"What traffic growth do you expect?" — Guides capacity planning

Prioritizing Based on Impact

Not all queries are equal. Prioritize optimization efforts based on:

User-facing vs. background — User-visible latency matters more
Frequency — High-volume queries have more total impact
Business criticality — Revenue-generating operations first
Failure severity — What breaks if this is slow?

Example Prioritization:

Priority	Query Type	Rationale
1	Checkout/Payment	Revenue-critical, abandonment risk
2	Search/Browse	User experience, frequent
3	User Dashboard	Retain users, daily usage
4	Admin Reports	Internal, less time-sensitive
5	Batch Jobs	Can run off-peak, retry on failure

The Cost of Perfection

Goal-Driven Monitoring

Dashboard Design Principles

Lead with SLO status — Green/yellow/red for each SLO at a glance
Show error budget burn — How much budget remains this period?
Trend visualization — Performance over time relative to targets
Drill-down capability — From SLO violation to specific queries

Alerting Strategy

Don't alert on every threshold breach — Use burn rate (how fast is error budget depleting?)
Page only when action is needed — Avoid alert fatigue
Include context in alerts — Which queries, what impact, suggested actions

alerting-rules.yaml
Prometheus
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# SLO-based alerting rules
groups:
  - name: database_slos
    rules:
      # Read latency SLO: P99 < 200ms
      # Alert when error budget depletes too fast (burn rate)
      - alert: ReadLatencySLOBudgetBurn
        # 5% of monthly error budget burned in 1 hour = critical
        expr: |
          (
            sum(rate(query_duration_bucket{le="0.2",type="read"}[1h]))
            /
            sum(rate(query_duration_count{type="read"}[1h]))
          ) < 0.999 - (0.001 * 0.05 * 720)
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Read latency SLO error budget burning fast"
          description: "At current rate, monthly error budget will be exhausted in < 20 hours"
          
      # Absolute threshold: SLO breached now
      - alert: ReadLatencyP99Breached
        expr: |
          histogram_quantile(0.99, 
            sum(rate(query_duration_bucket{type="read"}[5m])) by (le)
          ) > 0.2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Read query P99 latency exceeds 200ms SLO"
          description: "Current P99: {{ $value | humanizeDuration }}"
          
      # Connection availability SLO
      - alert: ConnectionAvailabilitySLOBreached
        expr: |
          rate(connection_failures_total[5m]) / rate(connection_attempts_total[5m]) > 0.0005
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Connection success rate below 99.95% SLO"
          
  - name: capacity_warnings
    rules:
      # Proactive capacity: approaching limits
      - alert: QueryLatencyTrending
        expr: |
          deriv(query_duration_p99[1h]) > 0.001  # Increasing 1ms/minute
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Query latency trending upward"
          description: "P99 latency is consistently increasing - investigate before SLO breach"

Alert Severity Guidelines
Severity	Condition	Response	Example
Critical (Page)	SLO breached, error budget exhausted	Immediate investigation, may wake on-call	P99 > 200ms for 10+ minutes
Warning	Approaching SLO, budget burning fast	Investigate during business hours	P99 at 180ms, trending up
Info	Anomaly detected, no SLO impact	Review in regular maintenance	Query count spike, performance stable

Burn Rate > Threshold Alerts

Continuous Performance Improvement

The Performance Improvement Cycle

┌──────────────────────────────────────────────────────────┐
│                                                          │
│   1. MEASURE        →      2. ANALYZE                    │
│   (Current state)          (Find bottlenecks)            │
│                                                          │
│         ↑                           ↓                    │
│                                                          │
│   4. VALIDATE       ←      3. IMPROVE                    │
│   (Did it work?)           (Apply changes)               │
│                                                          │
└──────────────────────────────────────────────────────────┘

1. Measure — Collect current SLO compliance, top-N query performance, resource utilization

2. Analyze — Identify biggest opportunities using bottleneck identification techniques

3. Improve — Implement changes: indexes, query rewrites, configuration tuning, scaling

4. Validate — Confirm improvement achieved expected results; detect unintended regressions

Continuous Improvement Practices

•Weekly performance reviews — Review SLO dashboards, top queries, resource trends
•Quarterly goal setting — Adjust SLOs based on achieved performance and changing needs
•Post-incident analysis — After performance incidents, conduct blameless retrospectives
•Change validation — Every deployment includes performance validation against baselines
•Capacity planning — Project future needs based on growth trends vs. current headroom
•Knowledge sharing — Document optimizations, patterns, and anti-patterns for the team

Regression Prevention

As important as improvement is preventing regression:

Automated performance testing — Run load tests in CI/CD pipeline
Canary deployments — Route small traffic to new versions, monitor before full rollout
Query analysis in code review — Review EXPLAIN output for queries in pull requests
Baseline comparison — Automatically compare new version performance to production baseline
Rollback triggers — Auto-rollback if P99 degrades beyond threshold post-deploy

The Compound Effect

Summary: Performance Goals

Performance goals transform database optimization from endless tuning into managed engineering. Let's consolidate the essential concepts:

Key Takeaways

•Goals enable focused optimization — Without targets, you can't prioritize or measure success
•Different goal types measure different aspects — Latency (responsiveness), throughput (capacity), availability (reliability)
•Baselines are essential foundations — You can't set meaningful goals without understanding current performance
•SLOs formalize commitments — Service Level Objectives with error budgets balance reliability and development velocity
•Connect to business value — Technical metrics must trace to user experience and business outcomes
•Monitor against goals — Dashboards and alerts should reflect SLO status, not just raw metrics
•Continuous improvement compounds — Regular cycles of measure-analyze-improve-validate drive lasting enhancement

Module Complete

You have completed Module 1: Query Performance Basics. You now understand:

How to measure and interpret query execution time
The four resources databases consume: CPU, memory, I/O, network
Systematic bottleneck identification techniques
The measurement tools available for performance analysis
How to define and pursue meaningful performance goals

These foundations prepare you for the subsequent modules, where we'll dive deeper into EXPLAIN plans, index optimization, query tuning techniques, and advanced performance topics.

Module Complete

5 / 5