Database Management SystemsQuery Profiling

Query Profiling

LevelAdvanced

Duration90 mins

TopicQuery Profiling

5 / 5

Continuous Improvement

Performance as a Process

Query performance is not a problem you solve once. It's a characteristic you maintain continuously. Data grows, workloads evolve, code changes, and infrastructure shifts. A query that performed well last month may become a bottleneck next month. A new feature deployment may introduce slow queries that didn't exist before.

Continuous improvement transforms performance work from reactive firefighting into proactive engineering. Rather than waiting for user complaints or system alerts, you establish processes that detect, prevent, and remediate performance issues as part of normal development and operations.

This page establishes the practices, processes, and infrastructure that sustain query performance over time—ensuring the optimizations you make today remain effective tomorrow.

What You Will Learn

By the end of this page, you will understand how to establish performance baselines and monitoring, integrate query performance into development workflows, implement regression detection systems, create performance review processes, and build organizational capability for sustained database performance.

The Continuous Improvement Cycle

Database performance improvement follows a cyclical pattern: Measure → Analyze → Improve → Validate → Monitor → Repeat. Each phase builds on the previous, creating a feedback loop that sustains and improves performance over time.

Converting Mermaid diagram...

Improvement Cycle Phases

•Measure — Collect query timing, resource consumption, and workload statistics continuously. This provides the raw data for all subsequent analysis.
•Analyze — Identify slow queries, detect regressions, and prioritize optimization opportunities based on impact and effort.
•Improve — Apply optimizations: add indexes, rewrite queries, tune configuration, or modify schema.
•Validate — Verify improvements achieve expected results without side effects. Compare against baseline metrics.
•Monitor — Track performance in production over time. Detect new issues early and trigger the next improvement cycle.

Automation Enables Sustainability

Manual execution of this cycle doesn't scale. Each phase should be automated where possible—automated collection, automated regression detection, automated alerting, automated validation. Human judgment directs the process; automation executes it.

Establishing Baselines

Baselines are reference points that define "normal" performance. Without baselines, you can't detect regression, measure improvement, or set meaningful SLAs. Baselines should capture:

Query latency (average, percentiles)
Resource consumption (CPU, I/O, memory)
Throughput (queries per second)
Error rates

Baselines must be established under representative conditions—typical workload, normal data volume, standard concurrency levels.

baseline_collection.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
-- ================================================================
-- Baseline Collection Infrastructure
-- ================================================================
 
-- =========================
-- PostgreSQL: Baseline Snapshot System
-- =========================
 
-- Create baseline storage table
CREATE TABLE performance_baselines (
    baseline_id SERIAL PRIMARY KEY,
    baseline_name VARCHAR(100) NOT NULL,
    captured_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    capture_reason VARCHAR(255),  -- "weekly snapshot", "pre-deployment", etc.
    
    -- Query-level baselines
    CONSTRAINT unique_baseline UNIQUE (baseline_name)
);
 
CREATE TABLE baseline_query_stats (
    baseline_id INT REFERENCES performance_baselines(baseline_id),
    queryid BIGINT,
    query_text TEXT,
    calls BIGINT,
    total_exec_time_ms DOUBLE PRECISION,
    mean_exec_time_ms DOUBLE PRECISION,
    min_exec_time_ms DOUBLE PRECISION,
    max_exec_time_ms DOUBLE PRECISION,
    stddev_exec_time_ms DOUBLE PRECISION,
    rows_returned BIGINT,
    shared_blks_hit BIGINT,
    shared_blks_read BIGINT,
    PRIMARY KEY (baseline_id, queryid)
);
 
-- Procedure to capture baseline
CREATE OR REPLACE PROCEDURE capture_baseline(
    p_baseline_name VARCHAR,
    p_reason VARCHAR DEFAULT NULL
)
LANGUAGE plpgsql
AS $$
DECLARE
    v_baseline_id INT;
BEGIN
    -- Create baseline record
    INSERT INTO performance_baselines (baseline_name, capture_reason)
    VALUES (p_baseline_name, p_reason)
    ON CONFLICT (baseline_name) DO UPDATE 
    SET captured_at = NOW(), capture_reason = p_reason
    RETURNING baseline_id INTO v_baseline_id;
    
    -- Clear previous stats for this baseline
    DELETE FROM baseline_query_stats WHERE baseline_id = v_baseline_id;
    
    -- Capture current query statistics
    INSERT INTO baseline_query_stats
    SELECT 
        v_baseline_id,
        queryid,
        query,
        calls,
        total_exec_time,
        mean_exec_time,
        min_exec_time,
        max_exec_time,
        stddev_exec_time,
        rows,
        shared_blks_hit,
        shared_blks_read
    FROM pg_stat_statements
    WHERE calls > 10;  -- Filter noise
    
    RAISE NOTICE 'Baseline % captured with % queries', 
        p_baseline_name, 
        (SELECT COUNT(*) FROM baseline_query_stats WHERE baseline_id = v_baseline_id);
END;
$$;
 
-- Capture baselines at key moments
CALL capture_baseline('weekly_2024_w03', 'Weekly baseline capture');
CALL capture_baseline('pre_release_v2.5', 'Before v2.5 deployment');
CALL capture_baseline('post_index_optimization', 'After adding composite indexes');
 
-- =========================
-- SQL Server: Query Store Baseline
-- =========================
 
-- Query Store automatically maintains historical data
-- Define baseline period for comparison
 
-- Get baseline metrics for a specific time window
SELECT 
    q.query_id,
    qt.query_sql_text,
    AVG(rs.avg_duration) / 1000.0 AS baseline_avg_ms,
    AVG(rs.avg_cpu_time) / 1000.0 AS baseline_cpu_ms,
    AVG(rs.avg_logical_io_reads) AS baseline_io,
    SUM(rs.count_executions) AS baseline_executions
FROM sys.query_store_query q
JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
JOIN sys.query_store_plan p ON q.query_id = p.query_id
JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id
JOIN sys.query_store_runtime_stats_interval rsi 
    ON rs.runtime_stats_interval_id = rsi.runtime_stats_interval_id
WHERE rsi.start_time BETWEEN '2024-01-01' AND '2024-01-31'  -- Baseline period
GROUP BY q.query_id, qt.query_sql_text;
 
-- Create a persistent baseline table for long-term storage
CREATE TABLE dbo.QueryBaselines (
    baseline_id INT IDENTITY PRIMARY KEY,
    baseline_name NVARCHAR(100),
    created_at DATETIME2 DEFAULT GETUTCDATE(),
    query_id BIGINT,
    query_hash BINARY(8),
    baseline_avg_ms DECIMAL(18,4),
    baseline_p95_ms DECIMAL(18,4),
    baseline_cpu_ms DECIMAL(18,4),
    baseline_io BIGINT,
    baseline_executions BIGINT
);

Baseline Capture Frequency Recommendations
Baseline Type	Frequency	Purpose	Retention
Rolling	Daily	Detect day-over-day changes	7-14 days
Weekly Snapshot	Weekly	Trend analysis, capacity planning	3-6 months
Pre-Deployment	Per release	Regression detection after deployments	Per major version
Post-Optimization	After major changes	Verify improvement, set new baseline	Permanent
Incident Reference	On-demand	Document state during/after incidents	Permanent

Regression Detection

Regression detection automatically identifies when query performance degrades compared to baseline. This is essential for catching problems early—ideally before users notice.

regression_detection.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
-- ================================================================
-- Automated Regression Detection
-- ================================================================
 
-- =========================
-- PostgreSQL: Regression Detection Query
-- =========================
 
-- Compare current performance to baseline
WITH current_stats AS (
    SELECT 
        queryid,
        query,
        calls,
        mean_exec_time,
        stddev_exec_time,
        total_exec_time
    FROM pg_stat_statements
    WHERE calls > 100  -- Sufficient sample
),
baseline_stats AS (
    SELECT 
        queryid,
        mean_exec_time_ms AS baseline_mean,
        stddev_exec_time_ms AS baseline_stddev,
        calls AS baseline_calls
    FROM baseline_query_stats
    WHERE baseline_id = (
        SELECT baseline_id FROM performance_baselines 
        WHERE baseline_name = 'pre_release_v2.5'
    )
)
SELECT 
    cs.queryid,
    LEFT(cs.query, 100) AS query_preview,
    
    -- Current vs baseline
    ROUND(bs.baseline_mean::NUMERIC, 2) AS baseline_ms,
    ROUND(cs.mean_exec_time::NUMERIC, 2) AS current_ms,
    ROUND((cs.mean_exec_time / NULLIF(bs.baseline_mean, 0))::NUMERIC, 2) AS regression_factor,
    
    -- Statistical significance
    CASE 
        WHEN ABS(cs.mean_exec_time - bs.baseline_mean) > 
             2 * GREATEST(cs.stddev_exec_time, bs.baseline_stddev)
        THEN 'SIGNIFICANT'
        ELSE 'Within variance'
    END AS significance,
    
    -- Impact assessment
    cs.calls AS current_calls,
    ROUND(((cs.mean_exec_time - bs.baseline_mean) * cs.calls / 1000)::NUMERIC, 2) 
        AS added_time_seconds
        
FROM current_stats cs
JOIN baseline_stats bs ON cs.queryid = bs.queryid
WHERE cs.mean_exec_time > bs.baseline_mean * 2  -- 2x regression threshold
ORDER BY added_time_seconds DESC
LIMIT 20;
 
-- =========================
-- SQL Server: Query Store Regression Detection
-- =========================
 
-- Find regressed queries compared to previous period
WITH CurrentPeriod AS (
    SELECT 
        p.query_id,
        AVG(rs.avg_duration) AS avg_duration,
        SUM(rs.count_executions) AS executions
    FROM sys.query_store_plan p
    JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id
    JOIN sys.query_store_runtime_stats_interval rsi 
        ON rs.runtime_stats_interval_id = rsi.runtime_stats_interval_id
    WHERE rsi.start_time > DATEADD(day, -1, GETUTCDATE())  -- Last 24 hours
    GROUP BY p.query_id
),
BaselinePeriod AS (
    SELECT 
        p.query_id,
        AVG(rs.avg_duration) AS avg_duration,
        SUM(rs.count_executions) AS executions
    FROM sys.query_store_plan p
    JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id
    JOIN sys.query_store_runtime_stats_interval rsi 
        ON rs.runtime_stats_interval_id = rsi.runtime_stats_interval_id
    WHERE rsi.start_time BETWEEN DATEADD(day, -8, GETUTCDATE()) 
                            AND DATEADD(day, -1, GETUTCDATE())  -- Previous week
    GROUP BY p.query_id
)
SELECT 
    qt.query_sql_text,
    c.query_id,
    b.avg_duration / 1000.0 AS baseline_ms,
    c.avg_duration / 1000.0 AS current_ms,
    c.avg_duration / NULLIF(b.avg_duration, 0) AS regression_factor,
    c.executions AS current_executions,
    (c.avg_duration - b.avg_duration) * c.executions / 1000000.0 AS added_time_sec,
    
    -- Categorize severity
    CASE 
        WHEN c.avg_duration > b.avg_duration * 10 THEN 'CRITICAL'
        WHEN c.avg_duration > b.avg_duration * 5 THEN 'HIGH'
        WHEN c.avg_duration > b.avg_duration * 2 THEN 'MEDIUM'
        ELSE 'LOW'
    END AS severity
    
FROM CurrentPeriod c
JOIN BaselinePeriod b ON c.query_id = b.query_id
JOIN sys.query_store_query q ON c.query_id = q.query_id
JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
WHERE c.avg_duration > b.avg_duration * 1.5  -- 50% regression
  AND c.executions > 100  -- Sufficient activity
  AND b.avg_duration > 10000  -- Non-trivial queries (10ms+)
ORDER BY added_time_sec DESC;
 
-- =========================
-- Automated Alerting Job
-- =========================
 
-- SQL Server Agent job or PostgreSQL pg_cron job
-- that runs regression detection and sends alerts
 
/*
Implementation pattern:
1. Run regression detection query hourly
2. Insert results into regression_alerts table
3. For severity = 'CRITICAL' or 'HIGH', trigger alert:
   - Email DBA team
   - Post to Slack/Teams channel
   - Create on-call incident
 
Alert content should include:
- Query text/hash
- Baseline vs current metrics
- Impact in total time
- Suggested investigation starting points
*/
 
-- PostgreSQL: Create alert logging table
CREATE TABLE regression_alerts (
    alert_id SERIAL PRIMARY KEY,
    detected_at TIMESTAMPTZ DEFAULT NOW(),
    queryid BIGINT,
    query_preview TEXT,
    baseline_ms DECIMAL(10,2),
    current_ms DECIMAL(10,2),
    regression_factor DECIMAL(10,2),
    severity VARCHAR(20),
    acknowledged BOOLEAN DEFAULT FALSE,
    acknowledged_by VARCHAR(100),
    resolution_notes TEXT
);

Avoid Alert Fatigue

Too many alerts desensitize recipients. Set thresholds that balance early detection with actionable signal. Start conservative (high thresholds), then lower as you build confidence. Consider alert deduplication—don't alert on the same regression repeatedly.

Development Integration

The best time to catch performance problems is before deployment. Development integration embeds query performance awareness into the software development lifecycle.

Development Phase Gates

•Code Review — Review EXPLAIN plans for new or modified queries. Flag full table scans, missing indexes, and dangerous patterns.
•Automated Testing — Include query performance in test suites. Assert that key queries execute within thresholds.
•CI/CD Pipeline — Run EXPLAIN ANALYZE on changed queries. Block deployments if execution plans regress.
•Staging Validation — Test with production-like data volumes before production deployment.
•Canary Deployment — Roll out changes gradually, monitoring query performance at each stage.

development_integration.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
-- ================================================================
-- Development Integration: Query Performance Testing
-- ================================================================
 
-- =========================
-- Code Review: EXPLAIN Annotation
-- =========================
 
/*
Require EXPLAIN output in pull request descriptions
for any changed queries. Reviewers check for:
 
✓ Index usage on filtered columns
✓ Reasonable row estimates
✓ Appropriate join strategies
✗ Sequential scans on large tables
✗ Nested loops with high loop counts
✗ Sort operations on large result sets
*/
 
-- Example PR description format:
/*
## Query Changes in This PR
 
### Modified: GetCustomerOrders
 
EXPLAIN ANALYZE output:
```
Hash Join  (cost=125.00..890.00 rows=1000 width=80)
           (actual time=12.5..45.2 rows=987 loops=1)
  Hash Cond: (o.customer_id = c.customer_id)
  -> Index Scan using idx_orders_date on orders o
       (actual time=0.05..15.3 rows=10000 loops=1)
       Index Cond: (order_date >= '2024-01-01')
  -> Hash  (actual time=8.2..8.2 rows=50 loops=1)
       -> Index Scan using idx_customers_segment on customers c
            (actual time=0.03..4.1 rows=50 loops=1)
            Index Cond: (segment = 'Enterprise')
```
 
✓ Uses indexes on both tables
✓ Estimates match actuals reasonably
✓ Hash join appropriate for this data volume
```
*/
 
-- =========================
-- Automated Test: Performance Assertions
-- =========================
 
/*
Example: pytest with performance assertions (Python)
 
import pytest
import time
 
def test_customer_orders_performance(db_connection):
    """GetCustomerOrders should complete under 100ms for typical case"""
    
    customer_id = 12345
    
    start = time.perf_counter()
    result = db_connection.execute(
        "SELECT * FROM orders WHERE customer_id = %s",
        [customer_id]
    )
    elapsed_ms = (time.perf_counter() - start) * 1000
    
    assert elapsed_ms < 100, f"Query took {elapsed_ms}ms, expected < 100ms"
    assert len(result) > 0, "Expected results for test customer"
 
def test_order_summary_uses_index(db_connection):
    """Order summary query should use idx_orders_status"""
    
    plan = db_connection.execute("""
        EXPLAIN (FORMAT JSON)
        SELECT status, COUNT(*) FROM orders GROUP BY status
    """)
    
    plan_text = str(plan)
    assert 'Seq Scan' not in plan_text, "Query should not perform sequential scan"
    assert 'idx_orders_status' in plan_text or 'Index' in plan_text
*/
 
-- =========================
-- CI/CD Pipeline: Plan Comparison
-- =========================
 
-- Store expected EXPLAIN output in repository
-- Compare against actual during CI build
 
/*
Example GitHub Actions workflow:
 
name: Query Performance Check
 
on: [pull_request]
 
jobs:
  check-queries:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Start test database
        run: docker-compose up -d postgres
        
      - name: Load schema and test data
        run: psql -f schema.sql && psql -f test_data.sql
      
      - name: Run query checks
        run: |
          for query_file in queries/*.sql; do
            echo "Checking $query_file"
            
            # Get EXPLAIN output
            psql -c "EXPLAIN (ANALYZE, FORMAT JSON) \i $query_file" > plan.json
            
            # Check for sequential scans on large tables
            if grep -q '"Node Type": "Seq Scan"' plan.json; then
              table=$(grep -A5 '"Node Type": "Seq Scan"' plan.json | grep "Relation Name")
              echo "WARNING: Sequential scan detected: $table"
              exit 1
            fi
            
            # Check for excessive rows examined
            rows=$(jq '.[0]."Plan"."Actual Rows"' plan.json)
            if [ "$rows" -gt 100000 ]; then
              echo "WARNING: Query examines $rows rows"
              exit 1
            fi
          done
*/
 
-- =========================
-- Staging Validation Checklist
-- =========================
 
/*
Before production deployment:
 
□ Load staging with production-equivalent data volume
□ Run all modified queries
□ Compare EXPLAIN output to development environment
□ Verify execution times within acceptable thresholds
□ Check for plan changes due to data volume differences
□ Monitor staging for 24-48 hours under simulated load
□ Document any performance concerns and mitigations
*/

Shift Left on Performance

The earlier you catch performance issues, the cheaper they are to fix. A developer writing a query can add an index in seconds. Fixing the same issue in production with locked tables and urgent timelines takes hours and creates risk.

Production Monitoring

Production monitoring provides ongoing visibility into query performance, enabling early detection of issues and trend analysis over time.

Key Monitoring Dimensions

•Latency Metrics — p50, p95, p99 query latency. Track overall and per-query-pattern.
•Throughput Metrics — Queries per second, transactions per second. Capacity indicator.
•Error Rates — Query timeouts, deadlocks, constraint violations. Quality indicator.
•Resource Utilization — CPU, disk I/O, memory, connections. Capacity and bottleneck indicator.
•Slow Query Counts — Number of queries exceeding threshold. Trend indicator.
•Cache Hit Rates — Buffer pool efficiency. Memory adequacy indicator.

monitoring_dashboards.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
-- ================================================================
-- Production Monitoring Queries for Dashboards
-- ================================================================
 
-- =========================
-- Real-Time Health Dashboard
-- =========================
 
-- PostgreSQL: Current system health
SELECT 
    -- Active connections
    (SELECT COUNT(*) FROM pg_stat_activity WHERE state = 'active') AS active_queries,
    
    -- Waiting queries (locked)
    (SELECT COUNT(*) FROM pg_stat_activity 
     WHERE wait_event_type = 'Lock') AS waiting_queries,
    
    -- Long-running queries
    (SELECT COUNT(*) FROM pg_stat_activity 
     WHERE state = 'active' 
       AND NOW() - query_start > INTERVAL '10 seconds') AS slow_queries,
    
    -- Cache hit ratio
    (SELECT ROUND(100.0 * SUM(blks_hit) / NULLIF(SUM(blks_hit) + SUM(blks_read), 0), 2)
     FROM pg_stat_database) AS cache_hit_pct,
    
    -- Deadlocks (last hour)
    (SELECT deadlocks FROM pg_stat_database 
     WHERE datname = current_database()) AS total_deadlocks,
    
    -- Commits per second (approximate)
    (SELECT xact_commit FROM pg_stat_database 
     WHERE datname = current_database()) AS total_commits;
 
-- =========================
-- Time-Series Metrics Collection (for Prometheus/Grafana)
-- =========================
 
-- PostgreSQL: Metrics endpoint query (run every 15s)
SELECT 
    'pg_stat_statements_total_time' AS metric,
    SUM(total_exec_time) AS value,
    NOW() AS timestamp
FROM pg_stat_statements
UNION ALL
SELECT 
    'pg_stat_statements_calls',
    SUM(calls),
    NOW()
FROM pg_stat_statements
UNION ALL
SELECT 
    'pg_active_connections',
    COUNT(*),
    NOW()
FROM pg_stat_activity WHERE state = 'active';
 
-- =========================
-- Daily Performance Summary Report
-- =========================
 
-- PostgreSQL: Generate daily summary
WITH daily_stats AS (
    SELECT 
        DATE_TRUNC('hour', NOW()) AS report_time,
        COUNT(DISTINCT queryid) AS unique_queries,
        SUM(calls) AS total_calls,
        SUM(total_exec_time) / 1000 AS total_time_seconds,
        AVG(mean_exec_time) AS avg_query_time_ms,
        MAX(max_exec_time) AS slowest_query_ms,
        SUM(shared_blks_read) AS total_blocks_read,
        SUM(shared_blks_hit) AS total_blocks_hit
    FROM pg_stat_statements
)
SELECT 
    'Daily Summary' AS report_type,
    unique_queries,
    total_calls,
    ROUND(total_time_seconds::NUMERIC, 2) AS total_time_sec,
    ROUND(avg_query_time_ms::NUMERIC, 2) AS avg_time_ms,
    ROUND(slowest_query_ms::NUMERIC, 2) AS max_time_ms,
    ROUND(100.0 * total_blocks_hit / 
          NULLIF(total_blocks_hit + total_blocks_read, 0), 2) AS cache_hit_pct
FROM daily_stats;
 
-- =========================
-- Alerting Thresholds
-- =========================
 
/*
Recommended alert thresholds (adjust for your workload):
 
CRITICAL (page immediately):
  - p99 latency > 5x baseline
  - Active queries > 80% connection limit
  - Deadlock rate > 1/hour
  - Cache hit ratio < 80%
  - Disk queue depth > 100
 
HIGH (respond within 1 hour):
  - p95 latency > 3x baseline
  - Slow query count > 10/minute
  - CPU utilization > 90% sustained
  - Memory pressure events
 
MEDIUM (review daily):
  - p50 latency increasing trend
  - Query volume change > 50%
  - New query patterns detected
  - Index usage changes
 
LOW (weekly review):
  - Statistics updates pending
  - Table bloat increasing
  - Configuration recommendations
*/
 
-- =========================
-- SQL Server: Monitoring Views
-- =========================
 
-- Real-time dashboard query
SELECT 
    (SELECT COUNT(*) FROM sys.dm_exec_requests 
     WHERE session_id > 50 AND status = 'running') AS running_queries,
    
    (SELECT COUNT(*) FROM sys.dm_exec_requests
     WHERE session_id > 50 AND blocking_session_id > 0) AS blocked_queries,
    
    (SELECT COUNT(*) FROM sys.dm_exec_requests
     WHERE session_id > 50 AND total_elapsed_time > 10000) AS slow_queries,
    
    (SELECT 
        ROUND(100.0 * SUM(CASE WHEN is_modified = 0 THEN 1 ELSE 0 END) / COUNT(*), 2)
     FROM sys.dm_os_buffer_descriptors) AS clean_buffer_pct,
    
    (SELECT cntr_value FROM sys.dm_os_performance_counters
     WHERE counter_name = 'Page life expectancy'
       AND object_name LIKE '%Buffer Manager%') AS page_life_expectancy;

Monitoring Tool Integration

Most organizations use external monitoring platforms (Prometheus/Grafana, Datadog, New Relic, etc.). Export database metrics using pg_exporter (PostgreSQL), sql_exporter, or native integrations. This enables unified dashboards, historical analysis, and cross-system correlation.

Performance Review Process

Beyond automated monitoring, regular performance reviews ensure that trends are analyzed, priorities are set, and improvements are tracked systematically.

Performance Review Cadence
Review Type	Frequency	Participants	Focus Areas
Incident Review	Per incident	DBA, affected team	Root cause, prevention, action items
Weekly Triage	Weekly	DBA, dev leads	Top slow queries, regressions, capacity
Sprint Planning	Every sprint	DBA, product team	Performance debt prioritization
Monthly Analysis	Monthly	DBA, engineering lead	Trends, forecasting, infrastructure
Quarterly Review	Quarterly	DBA, CTO/VP	Budget, roadmap, major investments

Weekly Triage Agenda

•Review Top 10 Slow Queries — From aggregate stats. Any new entries? Any regression from last week?
•Check Regression Alerts — Any triggered alerts? Resolution status?
•Review Upcoming Deployments — Any known query-intensive features? Pre-optimization needed?
•Capacity Check — Are we approaching limits on connections, storage, CPU?
•Action Item Review — Status of open optimization tasks. Blockers?
•Key Metric Trends — Week-over-week latency, throughput, error rates.

review_reports.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
-- ================================================================
-- Weekly Performance Review Report Queries
-- ================================================================
 
-- =========================
-- Top 10 Slow Queries This Week vs Last Week
-- =========================
 
WITH this_week AS (
    SELECT queryid, mean_exec_time, total_exec_time, calls
    FROM pg_stat_statements
),
last_week AS (
    SELECT queryid, mean_exec_time_ms AS mean_exec_time
    FROM baseline_query_stats
    WHERE baseline_id = (
        SELECT baseline_id FROM performance_baselines 
        WHERE baseline_name = 'weekly_snapshot_prev'
    )
)
SELECT 
    RANK() OVER(ORDER BY tw.total_exec_time DESC) AS this_week_rank,
    tw.queryid,
    ROUND(tw.mean_exec_time::NUMERIC, 2) AS current_avg_ms,
    ROUND(lw.mean_exec_time::NUMERIC, 2) AS last_week_avg_ms,
    ROUND((tw.mean_exec_time / NULLIF(lw.mean_exec_time, 0))::NUMERIC, 2) AS change_factor,
    CASE 
        WHEN tw.mean_exec_time > lw.mean_exec_time * 1.5 THEN '⬆️ SLOWER'
        WHEN tw.mean_exec_time < lw.mean_exec_time * 0.67 THEN '⬇️ FASTER'
        ELSE '➡️ STABLE'
    END AS trend,
    tw.calls AS weekly_calls
FROM this_week tw
LEFT JOIN last_week lw ON tw.queryid = lw.queryid
ORDER BY tw.total_exec_time DESC
LIMIT 10;
 
-- =========================
-- New Query Patterns (First seen this week)
-- =========================
 
SELECT 
    pss.queryid,
    LEFT(pss.query, 100) AS query_preview,
    pss.calls,
    ROUND(pss.mean_exec_time::NUMERIC, 2) AS avg_ms,
    ROUND(pss.total_exec_time::NUMERIC, 2) AS total_ms
FROM pg_stat_statements pss
LEFT JOIN baseline_query_stats bqs 
    ON pss.queryid = bqs.queryid
    AND bqs.baseline_id = (
        SELECT baseline_id FROM performance_baselines 
        WHERE baseline_name = 'weekly_snapshot_prev'
    )
WHERE bqs.queryid IS NULL  -- Not in last week's baseline
  AND pss.calls > 100  -- Significant usage
ORDER BY pss.total_exec_time DESC
LIMIT 10;
 
-- =========================
-- Capacity Trending
-- =========================
 
SELECT 
    'Database Size' AS metric,
    pg_size_pretty(pg_database_size(current_database())) AS current_value,
    NULL AS week_over_week_change
UNION ALL
SELECT 
    'Total Queries/Week',
    SUM(calls)::TEXT,
    NULL
FROM pg_stat_statements
UNION ALL
SELECT 
    'Avg Query Time (ms)',
    ROUND(AVG(mean_exec_time)::NUMERIC, 2)::TEXT,
    NULL
FROM pg_stat_statements
WHERE calls > 100;
 
-- =========================
-- Action Items Dashboard
-- =========================
 
/*
Maintain a table of optimization action items:
 
CREATE TABLE performance_action_items (
    item_id SERIAL PRIMARY KEY,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    queryid BIGINT,
    description TEXT,
    priority VARCHAR(20),  -- P1, P2, P3
    status VARCHAR(20),    -- Open, In Progress, Done, Wont Fix
    assigned_to VARCHAR(100),
    target_date DATE,
    completed_date DATE,
    resolution_notes TEXT
);
 
-- Report for weekly review:
SELECT 
    priority,
    status,
    COUNT(*) AS items,
    STRING_AGG(item_id::TEXT, ', ' ORDER BY created_at) AS item_ids
FROM performance_action_items
WHERE status NOT IN ('Done', 'Wont Fix')
GROUP BY priority, status
ORDER BY priority, status;
*/

Knowledge Management

Performance knowledge must be captured, shared, and retained. When the one expert who understands a complex optimization leaves, that knowledge shouldn't leave with them.

Knowledge Capture Practices

•Query Documentation — For complex or critical queries, maintain runbooks explaining logic, performance characteristics, and optimization history.
•Index Rationale — Document why each index exists, which queries it supports, and when to consider removal.
•Incident Postmortems — After performance incidents, document root cause, detection timeline, resolution, and prevention measures.
•Optimization Playbooks — Standard procedures for common optimization scenarios (full table scan, lock contention, memory pressure).
•Decision Logs — Record significant architectural decisions (denormalization, sharding, replication) with context and tradeoffs.

knowledge_documentation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
-- ================================================================
-- Knowledge Documentation Examples
-- ================================================================
 
-- =========================
-- In-Database Documentation
-- =========================
 
-- PostgreSQL: Use COMMENT for index documentation
COMMENT ON INDEX idx_orders_customer_date IS 
'Supports: GetCustomerOrders, OrderHistoryReport
 Created: 2024-01-15
 Reason: Customer order lookups were doing sequential scans
 Impact: Reduced GetCustomerOrders from 450ms to 12ms
 Review: Check usage quarterly; remove if no longer needed';
 
COMMENT ON INDEX idx_orders_status_partial IS
'Partial index on status = ''pending''
 Supports: PendingOrdersDashboard
 Only indexes pending orders (5% of table)
 Much smaller than full index, faster updates';
 
-- =========================
-- Query Runbook Template (Wiki/README)
-- =========================
 
/*
# Query: MonthlyRevenueReport
 
## Purpose
Generates monthly revenue summary by product category for finance dashboard.
 
## Execution Profile
- Typical execution time: 2-5 seconds
- Resource intensity: HIGH (full scan of order_items)
- Frequency: Daily at 6:00 UTC, plus ad-hoc
- User impact: None (background job)
 
## Dependencies
- Tables: orders, order_items, products, categories
- Indexes: idx_orders_date (critical for date filtering)
 
## Known Issues
- Slow on first Monday of month due to cold cache
- If > 30s, check for lock contention from inventory jobs
 
## Optimization History
| Date       | Change              | Before | After |
|------------|---------------------|--------|-------|
| 2023-06-15 | Added covering index| 45s    | 5s    |
| 2023-09-01 | Partitioned orders  | 5s     | 2s    |
| 2024-01-20 | Materialized view   | 2s     | 50ms  |
 
## Contacts
- Owner: Data Platform Team
- Escalation: #data-platform-oncall
*/
 
-- =========================
-- Incident Postmortem Template
-- =========================
 
/*
# Performance Incident: Checkout Latency Spike
 
## Summary
On 2024-03-15 14:00-14:45 UTC, checkout completion times 
increased from 200ms to 8000ms, causing cart abandonment.
 
## Timeline (UTC)
- 14:00: Deployment of order-service v2.4.1
- 14:05: First automated alert (p95 > 1s)
- 14:12: On-call acknowledged, began investigation
- 14:20: Identified new query pattern from deployment
- 14:35: Rolled back to v2.4.0
- 14:45: Latency returned to normal
 
## Root Cause
New ORM-generated query in v2.4.1 changed:
FROM: SELECT * FROM orders WHERE id = ?
TO:   SELECT * FROM orders WHERE id = ? ORDER BY created_at
 
The ORDER BY prevented index-only scan, caused sequential scan 
on 10M row orders table.
 
## Impact
- Duration: 45 minutes
- Affected checkouts: ~2000
- Estimated revenue impact: $50,000
 
## Resolution
Immediate: Rollback
Permanent: Added index on (id, created_at), modified ORM mapping
 
## Prevention
- Add query plan regression check to CI/CD pipeline
- Review ORM-generated queries before release
- Add query latency monitoring at application layer
*/

Living Documentation

Documentation rots quickly. Schedule periodic reviews (quarterly) to update runbooks, verify index usage, and archive obsolete information. Stale documentation is worse than no documentation—it creates false confidence.

Summary: Continuous Improvement

Query performance is not a problem to solve once but a characteristic to maintain continuously. The practices covered in this page transform performance work from reactive firefighting into proactive engineering—ensuring your database remains performant as data grows, workloads evolve, and systems change.

Key Takeaways

•Follow the improvement cycle — Measure → Analyze → Improve → Validate → Monitor → Repeat.
•Establish and maintain baselines — You can't detect regression without knowing what normal looks like.
•Automate regression detection — Catch performance degradation before users notice.
•Integrate into development — Query performance belongs in code review, CI/CD, and staging validation.
•Monitor production continuously — Dashboards, alerts, and trend analysis provide ongoing visibility.
•Conduct regular reviews — Weekly triage, monthly analysis, and incident postmortems sustain focus.
•Capture and share knowledge — Documentation, postmortems, and runbooks preserve institutional expertise.

Module Complete:

You've now completed the Query Profiling module. You understand how to use profiler tools, interpret execution statistics, identify slow queries, rewrite inefficient SQL, and establish processes for continuous performance improvement. These skills transform you from someone who can optimize individual queries into someone who can build and maintain performant database systems at scale.

Module Complete: Query Profiling

Congratulations! You've mastered the complete query profiling discipline—from understanding profiler architecture through establishing organizational practices for sustained performance. You're now equipped to approach database performance as an engineering practice, not just a troubleshooting exercise.

5 / 5

Loading learning content...

Database Management SystemsQuery Profiling

Query Profiling

LevelAdvanced

Duration90 mins

TopicQuery Profiling

5 / 5

Continuous Improvement

Performance as a Process

This page establishes the practices, processes, and infrastructure that sustain query performance over time—ensuring the optimizations you make today remain effective tomorrow.

What You Will Learn

The Continuous Improvement Cycle

Converting Mermaid diagram...

Improvement Cycle Phases

•Measure — Collect query timing, resource consumption, and workload statistics continuously. This provides the raw data for all subsequent analysis.
•Analyze — Identify slow queries, detect regressions, and prioritize optimization opportunities based on impact and effort.
•Improve — Apply optimizations: add indexes, rewrite queries, tune configuration, or modify schema.
•Validate — Verify improvements achieve expected results without side effects. Compare against baseline metrics.
•Monitor — Track performance in production over time. Detect new issues early and trigger the next improvement cycle.

Automation Enables Sustainability

Establishing Baselines

Baselines are reference points that define "normal" performance. Without baselines, you can't detect regression, measure improvement, or set meaningful SLAs. Baselines should capture:

Query latency (average, percentiles)
Resource consumption (CPU, I/O, memory)
Throughput (queries per second)
Error rates

Baselines must be established under representative conditions—typical workload, normal data volume, standard concurrency levels.

baseline_collection.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
-- ================================================================
-- Baseline Collection Infrastructure
-- ================================================================
 
-- =========================
-- PostgreSQL: Baseline Snapshot System
-- =========================
 
-- Create baseline storage table
CREATE TABLE performance_baselines (
    baseline_id SERIAL PRIMARY KEY,
    baseline_name VARCHAR(100) NOT NULL,
    captured_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    capture_reason VARCHAR(255),  -- "weekly snapshot", "pre-deployment", etc.
    
    -- Query-level baselines
    CONSTRAINT unique_baseline UNIQUE (baseline_name)
);
 
CREATE TABLE baseline_query_stats (
    baseline_id INT REFERENCES performance_baselines(baseline_id),
    queryid BIGINT,
    query_text TEXT,
    calls BIGINT,
    total_exec_time_ms DOUBLE PRECISION,
    mean_exec_time_ms DOUBLE PRECISION,
    min_exec_time_ms DOUBLE PRECISION,
    max_exec_time_ms DOUBLE PRECISION,
    stddev_exec_time_ms DOUBLE PRECISION,
    rows_returned BIGINT,
    shared_blks_hit BIGINT,
    shared_blks_read BIGINT,
    PRIMARY KEY (baseline_id, queryid)
);
 
-- Procedure to capture baseline
CREATE OR REPLACE PROCEDURE capture_baseline(
    p_baseline_name VARCHAR,
    p_reason VARCHAR DEFAULT NULL
)
LANGUAGE plpgsql
AS $$
DECLARE
    v_baseline_id INT;
BEGIN
    -- Create baseline record
    INSERT INTO performance_baselines (baseline_name, capture_reason)
    VALUES (p_baseline_name, p_reason)
    ON CONFLICT (baseline_name) DO UPDATE 
    SET captured_at = NOW(), capture_reason = p_reason
    RETURNING baseline_id INTO v_baseline_id;
    
    -- Clear previous stats for this baseline
    DELETE FROM baseline_query_stats WHERE baseline_id = v_baseline_id;
    
    -- Capture current query statistics
    INSERT INTO baseline_query_stats
    SELECT 
        v_baseline_id,
        queryid,
        query,
        calls,
        total_exec_time,
        mean_exec_time,
        min_exec_time,
        max_exec_time,
        stddev_exec_time,
        rows,
        shared_blks_hit,
        shared_blks_read
    FROM pg_stat_statements
    WHERE calls > 10;  -- Filter noise
    
    RAISE NOTICE 'Baseline % captured with % queries', 
        p_baseline_name, 
        (SELECT COUNT(*) FROM baseline_query_stats WHERE baseline_id = v_baseline_id);
END;
$$;
 
-- Capture baselines at key moments
CALL capture_baseline('weekly_2024_w03', 'Weekly baseline capture');
CALL capture_baseline('pre_release_v2.5', 'Before v2.5 deployment');
CALL capture_baseline('post_index_optimization', 'After adding composite indexes');
 
-- =========================
-- SQL Server: Query Store Baseline
-- =========================
 
-- Query Store automatically maintains historical data
-- Define baseline period for comparison
 
-- Get baseline metrics for a specific time window
SELECT 
    q.query_id,
    qt.query_sql_text,
    AVG(rs.avg_duration) / 1000.0 AS baseline_avg_ms,
    AVG(rs.avg_cpu_time) / 1000.0 AS baseline_cpu_ms,
    AVG(rs.avg_logical_io_reads) AS baseline_io,
    SUM(rs.count_executions) AS baseline_executions
FROM sys.query_store_query q
JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
JOIN sys.query_store_plan p ON q.query_id = p.query_id
JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id
JOIN sys.query_store_runtime_stats_interval rsi 
    ON rs.runtime_stats_interval_id = rsi.runtime_stats_interval_id
WHERE rsi.start_time BETWEEN '2024-01-01' AND '2024-01-31'  -- Baseline period
GROUP BY q.query_id, qt.query_sql_text;
 
-- Create a persistent baseline table for long-term storage
CREATE TABLE dbo.QueryBaselines (
    baseline_id INT IDENTITY PRIMARY KEY,
    baseline_name NVARCHAR(100),
    created_at DATETIME2 DEFAULT GETUTCDATE(),
    query_id BIGINT,
    query_hash BINARY(8),
    baseline_avg_ms DECIMAL(18,4),
    baseline_p95_ms DECIMAL(18,4),
    baseline_cpu_ms DECIMAL(18,4),
    baseline_io BIGINT,
    baseline_executions BIGINT
);

Baseline Capture Frequency Recommendations
Baseline Type	Frequency	Purpose	Retention
Rolling	Daily	Detect day-over-day changes	7-14 days
Weekly Snapshot	Weekly	Trend analysis, capacity planning	3-6 months
Pre-Deployment	Per release	Regression detection after deployments	Per major version
Post-Optimization	After major changes	Verify improvement, set new baseline	Permanent
Incident Reference	On-demand	Document state during/after incidents	Permanent

Regression Detection

Regression detection automatically identifies when query performance degrades compared to baseline. This is essential for catching problems early—ideally before users notice.

regression_detection.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
-- ================================================================
-- Automated Regression Detection
-- ================================================================
 
-- =========================
-- PostgreSQL: Regression Detection Query
-- =========================
 
-- Compare current performance to baseline
WITH current_stats AS (
    SELECT 
        queryid,
        query,
        calls,
        mean_exec_time,
        stddev_exec_time,
        total_exec_time
    FROM pg_stat_statements
    WHERE calls > 100  -- Sufficient sample
),
baseline_stats AS (
    SELECT 
        queryid,
        mean_exec_time_ms AS baseline_mean,
        stddev_exec_time_ms AS baseline_stddev,
        calls AS baseline_calls
    FROM baseline_query_stats
    WHERE baseline_id = (
        SELECT baseline_id FROM performance_baselines 
        WHERE baseline_name = 'pre_release_v2.5'
    )
)
SELECT 
    cs.queryid,
    LEFT(cs.query, 100) AS query_preview,
    
    -- Current vs baseline
    ROUND(bs.baseline_mean::NUMERIC, 2) AS baseline_ms,
    ROUND(cs.mean_exec_time::NUMERIC, 2) AS current_ms,
    ROUND((cs.mean_exec_time / NULLIF(bs.baseline_mean, 0))::NUMERIC, 2) AS regression_factor,
    
    -- Statistical significance
    CASE 
        WHEN ABS(cs.mean_exec_time - bs.baseline_mean) > 
             2 * GREATEST(cs.stddev_exec_time, bs.baseline_stddev)
        THEN 'SIGNIFICANT'
        ELSE 'Within variance'
    END AS significance,
    
    -- Impact assessment
    cs.calls AS current_calls,
    ROUND(((cs.mean_exec_time - bs.baseline_mean) * cs.calls / 1000)::NUMERIC, 2) 
        AS added_time_seconds
        
FROM current_stats cs
JOIN baseline_stats bs ON cs.queryid = bs.queryid
WHERE cs.mean_exec_time > bs.baseline_mean * 2  -- 2x regression threshold
ORDER BY added_time_seconds DESC
LIMIT 20;
 
-- =========================
-- SQL Server: Query Store Regression Detection
-- =========================
 
-- Find regressed queries compared to previous period
WITH CurrentPeriod AS (
    SELECT 
        p.query_id,
        AVG(rs.avg_duration) AS avg_duration,
        SUM(rs.count_executions) AS executions
    FROM sys.query_store_plan p
    JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id
    JOIN sys.query_store_runtime_stats_interval rsi 
        ON rs.runtime_stats_interval_id = rsi.runtime_stats_interval_id
    WHERE rsi.start_time > DATEADD(day, -1, GETUTCDATE())  -- Last 24 hours
    GROUP BY p.query_id
),
BaselinePeriod AS (
    SELECT 
        p.query_id,
        AVG(rs.avg_duration) AS avg_duration,
        SUM(rs.count_executions) AS executions
    FROM sys.query_store_plan p
    JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id
    JOIN sys.query_store_runtime_stats_interval rsi 
        ON rs.runtime_stats_interval_id = rsi.runtime_stats_interval_id
    WHERE rsi.start_time BETWEEN DATEADD(day, -8, GETUTCDATE()) 
                            AND DATEADD(day, -1, GETUTCDATE())  -- Previous week
    GROUP BY p.query_id
)
SELECT 
    qt.query_sql_text,
    c.query_id,
    b.avg_duration / 1000.0 AS baseline_ms,
    c.avg_duration / 1000.0 AS current_ms,
    c.avg_duration / NULLIF(b.avg_duration, 0) AS regression_factor,
    c.executions AS current_executions,
    (c.avg_duration - b.avg_duration) * c.executions / 1000000.0 AS added_time_sec,
    
    -- Categorize severity
    CASE 
        WHEN c.avg_duration > b.avg_duration * 10 THEN 'CRITICAL'
        WHEN c.avg_duration > b.avg_duration * 5 THEN 'HIGH'
        WHEN c.avg_duration > b.avg_duration * 2 THEN 'MEDIUM'
        ELSE 'LOW'
    END AS severity
    
FROM CurrentPeriod c
JOIN BaselinePeriod b ON c.query_id = b.query_id
JOIN sys.query_store_query q ON c.query_id = q.query_id
JOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_id
WHERE c.avg_duration > b.avg_duration * 1.5  -- 50% regression
  AND c.executions > 100  -- Sufficient activity
  AND b.avg_duration > 10000  -- Non-trivial queries (10ms+)
ORDER BY added_time_sec DESC;
 
-- =========================
-- Automated Alerting Job
-- =========================
 
-- SQL Server Agent job or PostgreSQL pg_cron job
-- that runs regression detection and sends alerts
 
/*
Implementation pattern:
1. Run regression detection query hourly
2. Insert results into regression_alerts table
3. For severity = 'CRITICAL' or 'HIGH', trigger alert:
   - Email DBA team
   - Post to Slack/Teams channel
   - Create on-call incident
 
Alert content should include:
- Query text/hash
- Baseline vs current metrics
- Impact in total time
- Suggested investigation starting points
*/
 
-- PostgreSQL: Create alert logging table
CREATE TABLE regression_alerts (
    alert_id SERIAL PRIMARY KEY,
    detected_at TIMESTAMPTZ DEFAULT NOW(),
    queryid BIGINT,
    query_preview TEXT,
    baseline_ms DECIMAL(10,2),
    current_ms DECIMAL(10,2),
    regression_factor DECIMAL(10,2),
    severity VARCHAR(20),
    acknowledged BOOLEAN DEFAULT FALSE,
    acknowledged_by VARCHAR(100),
    resolution_notes TEXT
);

Avoid Alert Fatigue

Development Integration

The best time to catch performance problems is before deployment. Development integration embeds query performance awareness into the software development lifecycle.

Development Phase Gates

•Code Review — Review EXPLAIN plans for new or modified queries. Flag full table scans, missing indexes, and dangerous patterns.
•Automated Testing — Include query performance in test suites. Assert that key queries execute within thresholds.
•CI/CD Pipeline — Run EXPLAIN ANALYZE on changed queries. Block deployments if execution plans regress.
•Staging Validation — Test with production-like data volumes before production deployment.
•Canary Deployment — Roll out changes gradually, monitoring query performance at each stage.

development_integration.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
-- ================================================================
-- Development Integration: Query Performance Testing
-- ================================================================
 
-- =========================
-- Code Review: EXPLAIN Annotation
-- =========================
 
/*
Require EXPLAIN output in pull request descriptions
for any changed queries. Reviewers check for:
 
✓ Index usage on filtered columns
✓ Reasonable row estimates
✓ Appropriate join strategies
✗ Sequential scans on large tables
✗ Nested loops with high loop counts
✗ Sort operations on large result sets
*/
 
-- Example PR description format:
/*
## Query Changes in This PR
 
### Modified: GetCustomerOrders
 
EXPLAIN ANALYZE output:
```
Hash Join  (cost=125.00..890.00 rows=1000 width=80)
           (actual time=12.5..45.2 rows=987 loops=1)
  Hash Cond: (o.customer_id = c.customer_id)
  -> Index Scan using idx_orders_date on orders o
       (actual time=0.05..15.3 rows=10000 loops=1)
       Index Cond: (order_date >= '2024-01-01')
  -> Hash  (actual time=8.2..8.2 rows=50 loops=1)
       -> Index Scan using idx_customers_segment on customers c
            (actual time=0.03..4.1 rows=50 loops=1)
            Index Cond: (segment = 'Enterprise')
```
 
✓ Uses indexes on both tables
✓ Estimates match actuals reasonably
✓ Hash join appropriate for this data volume
```
*/
 
-- =========================
-- Automated Test: Performance Assertions
-- =========================
 
/*
Example: pytest with performance assertions (Python)
 
import pytest
import time
 
def test_customer_orders_performance(db_connection):
    """GetCustomerOrders should complete under 100ms for typical case"""
    
    customer_id = 12345
    
    start = time.perf_counter()
    result = db_connection.execute(
        "SELECT * FROM orders WHERE customer_id = %s",
        [customer_id]
    )
    elapsed_ms = (time.perf_counter() - start) * 1000
    
    assert elapsed_ms < 100, f"Query took {elapsed_ms}ms, expected < 100ms"
    assert len(result) > 0, "Expected results for test customer"
 
def test_order_summary_uses_index(db_connection):
    """Order summary query should use idx_orders_status"""
    
    plan = db_connection.execute("""
        EXPLAIN (FORMAT JSON)
        SELECT status, COUNT(*) FROM orders GROUP BY status
    """)
    
    plan_text = str(plan)
    assert 'Seq Scan' not in plan_text, "Query should not perform sequential scan"
    assert 'idx_orders_status' in plan_text or 'Index' in plan_text
*/
 
-- =========================
-- CI/CD Pipeline: Plan Comparison
-- =========================
 
-- Store expected EXPLAIN output in repository
-- Compare against actual during CI build
 
/*
Example GitHub Actions workflow:
 
name: Query Performance Check
 
on: [pull_request]
 
jobs:
  check-queries:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Start test database
        run: docker-compose up -d postgres
        
      - name: Load schema and test data
        run: psql -f schema.sql && psql -f test_data.sql
      
      - name: Run query checks
        run: |
          for query_file in queries/*.sql; do
            echo "Checking $query_file"
            
            # Get EXPLAIN output
            psql -c "EXPLAIN (ANALYZE, FORMAT JSON) \i $query_file" > plan.json
            
            # Check for sequential scans on large tables
            if grep -q '"Node Type": "Seq Scan"' plan.json; then
              table=$(grep -A5 '"Node Type": "Seq Scan"' plan.json | grep "Relation Name")
              echo "WARNING: Sequential scan detected: $table"
              exit 1
            fi
            
            # Check for excessive rows examined
            rows=$(jq '.[0]."Plan"."Actual Rows"' plan.json)
            if [ "$rows" -gt 100000 ]; then
              echo "WARNING: Query examines $rows rows"
              exit 1
            fi
          done
*/
 
-- =========================
-- Staging Validation Checklist
-- =========================
 
/*
Before production deployment:
 
□ Load staging with production-equivalent data volume
□ Run all modified queries
□ Compare EXPLAIN output to development environment
□ Verify execution times within acceptable thresholds
□ Check for plan changes due to data volume differences
□ Monitor staging for 24-48 hours under simulated load
□ Document any performance concerns and mitigations
*/

Shift Left on Performance

Production Monitoring

Production monitoring provides ongoing visibility into query performance, enabling early detection of issues and trend analysis over time.

Key Monitoring Dimensions

•Latency Metrics — p50, p95, p99 query latency. Track overall and per-query-pattern.
•Throughput Metrics — Queries per second, transactions per second. Capacity indicator.
•Error Rates — Query timeouts, deadlocks, constraint violations. Quality indicator.
•Resource Utilization — CPU, disk I/O, memory, connections. Capacity and bottleneck indicator.
•Slow Query Counts — Number of queries exceeding threshold. Trend indicator.
•Cache Hit Rates — Buffer pool efficiency. Memory adequacy indicator.

monitoring_dashboards.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
-- ================================================================
-- Production Monitoring Queries for Dashboards
-- ================================================================
 
-- =========================
-- Real-Time Health Dashboard
-- =========================
 
-- PostgreSQL: Current system health
SELECT 
    -- Active connections
    (SELECT COUNT(*) FROM pg_stat_activity WHERE state = 'active') AS active_queries,
    
    -- Waiting queries (locked)
    (SELECT COUNT(*) FROM pg_stat_activity 
     WHERE wait_event_type = 'Lock') AS waiting_queries,
    
    -- Long-running queries
    (SELECT COUNT(*) FROM pg_stat_activity 
     WHERE state = 'active' 
       AND NOW() - query_start > INTERVAL '10 seconds') AS slow_queries,
    
    -- Cache hit ratio
    (SELECT ROUND(100.0 * SUM(blks_hit) / NULLIF(SUM(blks_hit) + SUM(blks_read), 0), 2)
     FROM pg_stat_database) AS cache_hit_pct,
    
    -- Deadlocks (last hour)
    (SELECT deadlocks FROM pg_stat_database 
     WHERE datname = current_database()) AS total_deadlocks,
    
    -- Commits per second (approximate)
    (SELECT xact_commit FROM pg_stat_database 
     WHERE datname = current_database()) AS total_commits;
 
-- =========================
-- Time-Series Metrics Collection (for Prometheus/Grafana)
-- =========================
 
-- PostgreSQL: Metrics endpoint query (run every 15s)
SELECT 
    'pg_stat_statements_total_time' AS metric,
    SUM(total_exec_time) AS value,
    NOW() AS timestamp
FROM pg_stat_statements
UNION ALL
SELECT 
    'pg_stat_statements_calls',
    SUM(calls),
    NOW()
FROM pg_stat_statements
UNION ALL
SELECT 
    'pg_active_connections',
    COUNT(*),
    NOW()
FROM pg_stat_activity WHERE state = 'active';
 
-- =========================
-- Daily Performance Summary Report
-- =========================
 
-- PostgreSQL: Generate daily summary
WITH daily_stats AS (
    SELECT 
        DATE_TRUNC('hour', NOW()) AS report_time,
        COUNT(DISTINCT queryid) AS unique_queries,
        SUM(calls) AS total_calls,
        SUM(total_exec_time) / 1000 AS total_time_seconds,
        AVG(mean_exec_time) AS avg_query_time_ms,
        MAX(max_exec_time) AS slowest_query_ms,
        SUM(shared_blks_read) AS total_blocks_read,
        SUM(shared_blks_hit) AS total_blocks_hit
    FROM pg_stat_statements
)
SELECT 
    'Daily Summary' AS report_type,
    unique_queries,
    total_calls,
    ROUND(total_time_seconds::NUMERIC, 2) AS total_time_sec,
    ROUND(avg_query_time_ms::NUMERIC, 2) AS avg_time_ms,
    ROUND(slowest_query_ms::NUMERIC, 2) AS max_time_ms,
    ROUND(100.0 * total_blocks_hit / 
          NULLIF(total_blocks_hit + total_blocks_read, 0), 2) AS cache_hit_pct
FROM daily_stats;
 
-- =========================
-- Alerting Thresholds
-- =========================
 
/*
Recommended alert thresholds (adjust for your workload):
 
CRITICAL (page immediately):
  - p99 latency > 5x baseline
  - Active queries > 80% connection limit
  - Deadlock rate > 1/hour
  - Cache hit ratio < 80%
  - Disk queue depth > 100
 
HIGH (respond within 1 hour):
  - p95 latency > 3x baseline
  - Slow query count > 10/minute
  - CPU utilization > 90% sustained
  - Memory pressure events
 
MEDIUM (review daily):
  - p50 latency increasing trend
  - Query volume change > 50%
  - New query patterns detected
  - Index usage changes
 
LOW (weekly review):
  - Statistics updates pending
  - Table bloat increasing
  - Configuration recommendations
*/
 
-- =========================
-- SQL Server: Monitoring Views
-- =========================
 
-- Real-time dashboard query
SELECT 
    (SELECT COUNT(*) FROM sys.dm_exec_requests 
     WHERE session_id > 50 AND status = 'running') AS running_queries,
    
    (SELECT COUNT(*) FROM sys.dm_exec_requests
     WHERE session_id > 50 AND blocking_session_id > 0) AS blocked_queries,
    
    (SELECT COUNT(*) FROM sys.dm_exec_requests
     WHERE session_id > 50 AND total_elapsed_time > 10000) AS slow_queries,
    
    (SELECT 
        ROUND(100.0 * SUM(CASE WHEN is_modified = 0 THEN 1 ELSE 0 END) / COUNT(*), 2)
     FROM sys.dm_os_buffer_descriptors) AS clean_buffer_pct,
    
    (SELECT cntr_value FROM sys.dm_os_performance_counters
     WHERE counter_name = 'Page life expectancy'
       AND object_name LIKE '%Buffer Manager%') AS page_life_expectancy;

Monitoring Tool Integration

Performance Review Process

Beyond automated monitoring, regular performance reviews ensure that trends are analyzed, priorities are set, and improvements are tracked systematically.

Performance Review Cadence
Review Type	Frequency	Participants	Focus Areas
Incident Review	Per incident	DBA, affected team	Root cause, prevention, action items
Weekly Triage	Weekly	DBA, dev leads	Top slow queries, regressions, capacity
Sprint Planning	Every sprint	DBA, product team	Performance debt prioritization
Monthly Analysis	Monthly	DBA, engineering lead	Trends, forecasting, infrastructure
Quarterly Review	Quarterly	DBA, CTO/VP	Budget, roadmap, major investments

Weekly Triage Agenda

•Review Top 10 Slow Queries — From aggregate stats. Any new entries? Any regression from last week?
•Check Regression Alerts — Any triggered alerts? Resolution status?
•Review Upcoming Deployments — Any known query-intensive features? Pre-optimization needed?
•Capacity Check — Are we approaching limits on connections, storage, CPU?
•Action Item Review — Status of open optimization tasks. Blockers?
•Key Metric Trends — Week-over-week latency, throughput, error rates.

review_reports.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
-- ================================================================
-- Weekly Performance Review Report Queries
-- ================================================================
 
-- =========================
-- Top 10 Slow Queries This Week vs Last Week
-- =========================
 
WITH this_week AS (
    SELECT queryid, mean_exec_time, total_exec_time, calls
    FROM pg_stat_statements
),
last_week AS (
    SELECT queryid, mean_exec_time_ms AS mean_exec_time
    FROM baseline_query_stats
    WHERE baseline_id = (
        SELECT baseline_id FROM performance_baselines 
        WHERE baseline_name = 'weekly_snapshot_prev'
    )
)
SELECT 
    RANK() OVER(ORDER BY tw.total_exec_time DESC) AS this_week_rank,
    tw.queryid,
    ROUND(tw.mean_exec_time::NUMERIC, 2) AS current_avg_ms,
    ROUND(lw.mean_exec_time::NUMERIC, 2) AS last_week_avg_ms,
    ROUND((tw.mean_exec_time / NULLIF(lw.mean_exec_time, 0))::NUMERIC, 2) AS change_factor,
    CASE 
        WHEN tw.mean_exec_time > lw.mean_exec_time * 1.5 THEN '⬆️ SLOWER'
        WHEN tw.mean_exec_time < lw.mean_exec_time * 0.67 THEN '⬇️ FASTER'
        ELSE '➡️ STABLE'
    END AS trend,
    tw.calls AS weekly_calls
FROM this_week tw
LEFT JOIN last_week lw ON tw.queryid = lw.queryid
ORDER BY tw.total_exec_time DESC
LIMIT 10;
 
-- =========================
-- New Query Patterns (First seen this week)
-- =========================
 
SELECT 
    pss.queryid,
    LEFT(pss.query, 100) AS query_preview,
    pss.calls,
    ROUND(pss.mean_exec_time::NUMERIC, 2) AS avg_ms,
    ROUND(pss.total_exec_time::NUMERIC, 2) AS total_ms
FROM pg_stat_statements pss
LEFT JOIN baseline_query_stats bqs 
    ON pss.queryid = bqs.queryid
    AND bqs.baseline_id = (
        SELECT baseline_id FROM performance_baselines 
        WHERE baseline_name = 'weekly_snapshot_prev'
    )
WHERE bqs.queryid IS NULL  -- Not in last week's baseline
  AND pss.calls > 100  -- Significant usage
ORDER BY pss.total_exec_time DESC
LIMIT 10;
 
-- =========================
-- Capacity Trending
-- =========================
 
SELECT 
    'Database Size' AS metric,
    pg_size_pretty(pg_database_size(current_database())) AS current_value,
    NULL AS week_over_week_change
UNION ALL
SELECT 
    'Total Queries/Week',
    SUM(calls)::TEXT,
    NULL
FROM pg_stat_statements
UNION ALL
SELECT 
    'Avg Query Time (ms)',
    ROUND(AVG(mean_exec_time)::NUMERIC, 2)::TEXT,
    NULL
FROM pg_stat_statements
WHERE calls > 100;
 
-- =========================
-- Action Items Dashboard
-- =========================
 
/*
Maintain a table of optimization action items:
 
CREATE TABLE performance_action_items (
    item_id SERIAL PRIMARY KEY,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    queryid BIGINT,
    description TEXT,
    priority VARCHAR(20),  -- P1, P2, P3
    status VARCHAR(20),    -- Open, In Progress, Done, Wont Fix
    assigned_to VARCHAR(100),
    target_date DATE,
    completed_date DATE,
    resolution_notes TEXT
);
 
-- Report for weekly review:
SELECT 
    priority,
    status,
    COUNT(*) AS items,
    STRING_AGG(item_id::TEXT, ', ' ORDER BY created_at) AS item_ids
FROM performance_action_items
WHERE status NOT IN ('Done', 'Wont Fix')
GROUP BY priority, status
ORDER BY priority, status;
*/

Knowledge Management

Performance knowledge must be captured, shared, and retained. When the one expert who understands a complex optimization leaves, that knowledge shouldn't leave with them.

Knowledge Capture Practices

•Query Documentation — For complex or critical queries, maintain runbooks explaining logic, performance characteristics, and optimization history.
•Index Rationale — Document why each index exists, which queries it supports, and when to consider removal.
•Incident Postmortems — After performance incidents, document root cause, detection timeline, resolution, and prevention measures.
•Optimization Playbooks — Standard procedures for common optimization scenarios (full table scan, lock contention, memory pressure).
•Decision Logs — Record significant architectural decisions (denormalization, sharding, replication) with context and tradeoffs.

knowledge_documentation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
-- ================================================================
-- Knowledge Documentation Examples
-- ================================================================
 
-- =========================
-- In-Database Documentation
-- =========================
 
-- PostgreSQL: Use COMMENT for index documentation
COMMENT ON INDEX idx_orders_customer_date IS 
'Supports: GetCustomerOrders, OrderHistoryReport
 Created: 2024-01-15
 Reason: Customer order lookups were doing sequential scans
 Impact: Reduced GetCustomerOrders from 450ms to 12ms
 Review: Check usage quarterly; remove if no longer needed';
 
COMMENT ON INDEX idx_orders_status_partial IS
'Partial index on status = ''pending''
 Supports: PendingOrdersDashboard
 Only indexes pending orders (5% of table)
 Much smaller than full index, faster updates';
 
-- =========================
-- Query Runbook Template (Wiki/README)
-- =========================
 
/*
# Query: MonthlyRevenueReport
 
## Purpose
Generates monthly revenue summary by product category for finance dashboard.
 
## Execution Profile
- Typical execution time: 2-5 seconds
- Resource intensity: HIGH (full scan of order_items)
- Frequency: Daily at 6:00 UTC, plus ad-hoc
- User impact: None (background job)
 
## Dependencies
- Tables: orders, order_items, products, categories
- Indexes: idx_orders_date (critical for date filtering)
 
## Known Issues
- Slow on first Monday of month due to cold cache
- If > 30s, check for lock contention from inventory jobs
 
## Optimization History
| Date       | Change              | Before | After |
|------------|---------------------|--------|-------|
| 2023-06-15 | Added covering index| 45s    | 5s    |
| 2023-09-01 | Partitioned orders  | 5s     | 2s    |
| 2024-01-20 | Materialized view   | 2s     | 50ms  |
 
## Contacts
- Owner: Data Platform Team
- Escalation: #data-platform-oncall
*/
 
-- =========================
-- Incident Postmortem Template
-- =========================
 
/*
# Performance Incident: Checkout Latency Spike
 
## Summary
On 2024-03-15 14:00-14:45 UTC, checkout completion times 
increased from 200ms to 8000ms, causing cart abandonment.
 
## Timeline (UTC)
- 14:00: Deployment of order-service v2.4.1
- 14:05: First automated alert (p95 > 1s)
- 14:12: On-call acknowledged, began investigation
- 14:20: Identified new query pattern from deployment
- 14:35: Rolled back to v2.4.0
- 14:45: Latency returned to normal
 
## Root Cause
New ORM-generated query in v2.4.1 changed:
FROM: SELECT * FROM orders WHERE id = ?
TO:   SELECT * FROM orders WHERE id = ? ORDER BY created_at
 
The ORDER BY prevented index-only scan, caused sequential scan 
on 10M row orders table.
 
## Impact
- Duration: 45 minutes
- Affected checkouts: ~2000
- Estimated revenue impact: $50,000
 
## Resolution
Immediate: Rollback
Permanent: Added index on (id, created_at), modified ORM mapping
 
## Prevention
- Add query plan regression check to CI/CD pipeline
- Review ORM-generated queries before release
- Add query latency monitoring at application layer
*/

Living Documentation

Summary: Continuous Improvement

Key Takeaways

•Follow the improvement cycle — Measure → Analyze → Improve → Validate → Monitor → Repeat.
•Establish and maintain baselines — You can't detect regression without knowing what normal looks like.
•Automate regression detection — Catch performance degradation before users notice.
•Integrate into development — Query performance belongs in code review, CI/CD, and staging validation.
•Monitor production continuously — Dashboards, alerts, and trend analysis provide ongoing visibility.
•Conduct regular reviews — Weekly triage, monthly analysis, and incident postmortems sustain focus.
•Capture and share knowledge — Documentation, postmortems, and runbooks preserve institutional expertise.

Module Complete:

Module Complete: Query Profiling

5 / 5