Loading learning content...
Query performance is not a problem you solve once. It's a characteristic you maintain continuously. Data grows, workloads evolve, code changes, and infrastructure shifts. A query that performed well last month may become a bottleneck next month. A new feature deployment may introduce slow queries that didn't exist before.
Continuous improvement transforms performance work from reactive firefighting into proactive engineering. Rather than waiting for user complaints or system alerts, you establish processes that detect, prevent, and remediate performance issues as part of normal development and operations.
This page establishes the practices, processes, and infrastructure that sustain query performance over time—ensuring the optimizations you make today remain effective tomorrow.
By the end of this page, you will understand how to establish performance baselines and monitoring, integrate query performance into development workflows, implement regression detection systems, create performance review processes, and build organizational capability for sustained database performance.
Database performance improvement follows a cyclical pattern: Measure → Analyze → Improve → Validate → Monitor → Repeat. Each phase builds on the previous, creating a feedback loop that sustains and improves performance over time.
Manual execution of this cycle doesn't scale. Each phase should be automated where possible—automated collection, automated regression detection, automated alerting, automated validation. Human judgment directs the process; automation executes it.
Baselines are reference points that define "normal" performance. Without baselines, you can't detect regression, measure improvement, or set meaningful SLAs. Baselines should capture:
Baselines must be established under representative conditions—typical workload, normal data volume, standard concurrency levels.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
-- ================================================================-- Baseline Collection Infrastructure-- ================================================================ -- =========================-- PostgreSQL: Baseline Snapshot System-- ========================= -- Create baseline storage tableCREATE TABLE performance_baselines ( baseline_id SERIAL PRIMARY KEY, baseline_name VARCHAR(100) NOT NULL, captured_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), capture_reason VARCHAR(255), -- "weekly snapshot", "pre-deployment", etc. -- Query-level baselines CONSTRAINT unique_baseline UNIQUE (baseline_name)); CREATE TABLE baseline_query_stats ( baseline_id INT REFERENCES performance_baselines(baseline_id), queryid BIGINT, query_text TEXT, calls BIGINT, total_exec_time_ms DOUBLE PRECISION, mean_exec_time_ms DOUBLE PRECISION, min_exec_time_ms DOUBLE PRECISION, max_exec_time_ms DOUBLE PRECISION, stddev_exec_time_ms DOUBLE PRECISION, rows_returned BIGINT, shared_blks_hit BIGINT, shared_blks_read BIGINT, PRIMARY KEY (baseline_id, queryid)); -- Procedure to capture baselineCREATE OR REPLACE PROCEDURE capture_baseline( p_baseline_name VARCHAR, p_reason VARCHAR DEFAULT NULL)LANGUAGE plpgsqlAS $$DECLARE v_baseline_id INT;BEGIN -- Create baseline record INSERT INTO performance_baselines (baseline_name, capture_reason) VALUES (p_baseline_name, p_reason) ON CONFLICT (baseline_name) DO UPDATE SET captured_at = NOW(), capture_reason = p_reason RETURNING baseline_id INTO v_baseline_id; -- Clear previous stats for this baseline DELETE FROM baseline_query_stats WHERE baseline_id = v_baseline_id; -- Capture current query statistics INSERT INTO baseline_query_stats SELECT v_baseline_id, queryid, query, calls, total_exec_time, mean_exec_time, min_exec_time, max_exec_time, stddev_exec_time, rows, shared_blks_hit, shared_blks_read FROM pg_stat_statements WHERE calls > 10; -- Filter noise RAISE NOTICE 'Baseline % captured with % queries', p_baseline_name, (SELECT COUNT(*) FROM baseline_query_stats WHERE baseline_id = v_baseline_id);END;$$; -- Capture baselines at key momentsCALL capture_baseline('weekly_2024_w03', 'Weekly baseline capture');CALL capture_baseline('pre_release_v2.5', 'Before v2.5 deployment');CALL capture_baseline('post_index_optimization', 'After adding composite indexes'); -- =========================-- SQL Server: Query Store Baseline-- ========================= -- Query Store automatically maintains historical data-- Define baseline period for comparison -- Get baseline metrics for a specific time windowSELECT q.query_id, qt.query_sql_text, AVG(rs.avg_duration) / 1000.0 AS baseline_avg_ms, AVG(rs.avg_cpu_time) / 1000.0 AS baseline_cpu_ms, AVG(rs.avg_logical_io_reads) AS baseline_io, SUM(rs.count_executions) AS baseline_executionsFROM sys.query_store_query qJOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_idJOIN sys.query_store_plan p ON q.query_id = p.query_idJOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_idJOIN sys.query_store_runtime_stats_interval rsi ON rs.runtime_stats_interval_id = rsi.runtime_stats_interval_idWHERE rsi.start_time BETWEEN '2024-01-01' AND '2024-01-31' -- Baseline periodGROUP BY q.query_id, qt.query_sql_text; -- Create a persistent baseline table for long-term storageCREATE TABLE dbo.QueryBaselines ( baseline_id INT IDENTITY PRIMARY KEY, baseline_name NVARCHAR(100), created_at DATETIME2 DEFAULT GETUTCDATE(), query_id BIGINT, query_hash BINARY(8), baseline_avg_ms DECIMAL(18,4), baseline_p95_ms DECIMAL(18,4), baseline_cpu_ms DECIMAL(18,4), baseline_io BIGINT, baseline_executions BIGINT);| Baseline Type | Frequency | Purpose | Retention |
|---|---|---|---|
| Rolling | Daily | Detect day-over-day changes | 7-14 days |
| Weekly Snapshot | Weekly | Trend analysis, capacity planning | 3-6 months |
| Pre-Deployment | Per release | Regression detection after deployments | Per major version |
| Post-Optimization | After major changes | Verify improvement, set new baseline | Permanent |
| Incident Reference | On-demand | Document state during/after incidents | Permanent |
Regression detection automatically identifies when query performance degrades compared to baseline. This is essential for catching problems early—ideally before users notice.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153
-- ================================================================-- Automated Regression Detection-- ================================================================ -- =========================-- PostgreSQL: Regression Detection Query-- ========================= -- Compare current performance to baselineWITH current_stats AS ( SELECT queryid, query, calls, mean_exec_time, stddev_exec_time, total_exec_time FROM pg_stat_statements WHERE calls > 100 -- Sufficient sample),baseline_stats AS ( SELECT queryid, mean_exec_time_ms AS baseline_mean, stddev_exec_time_ms AS baseline_stddev, calls AS baseline_calls FROM baseline_query_stats WHERE baseline_id = ( SELECT baseline_id FROM performance_baselines WHERE baseline_name = 'pre_release_v2.5' ))SELECT cs.queryid, LEFT(cs.query, 100) AS query_preview, -- Current vs baseline ROUND(bs.baseline_mean::NUMERIC, 2) AS baseline_ms, ROUND(cs.mean_exec_time::NUMERIC, 2) AS current_ms, ROUND((cs.mean_exec_time / NULLIF(bs.baseline_mean, 0))::NUMERIC, 2) AS regression_factor, -- Statistical significance CASE WHEN ABS(cs.mean_exec_time - bs.baseline_mean) > 2 * GREATEST(cs.stddev_exec_time, bs.baseline_stddev) THEN 'SIGNIFICANT' ELSE 'Within variance' END AS significance, -- Impact assessment cs.calls AS current_calls, ROUND(((cs.mean_exec_time - bs.baseline_mean) * cs.calls / 1000)::NUMERIC, 2) AS added_time_seconds FROM current_stats csJOIN baseline_stats bs ON cs.queryid = bs.queryidWHERE cs.mean_exec_time > bs.baseline_mean * 2 -- 2x regression thresholdORDER BY added_time_seconds DESCLIMIT 20; -- =========================-- SQL Server: Query Store Regression Detection-- ========================= -- Find regressed queries compared to previous periodWITH CurrentPeriod AS ( SELECT p.query_id, AVG(rs.avg_duration) AS avg_duration, SUM(rs.count_executions) AS executions FROM sys.query_store_plan p JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id JOIN sys.query_store_runtime_stats_interval rsi ON rs.runtime_stats_interval_id = rsi.runtime_stats_interval_id WHERE rsi.start_time > DATEADD(day, -1, GETUTCDATE()) -- Last 24 hours GROUP BY p.query_id),BaselinePeriod AS ( SELECT p.query_id, AVG(rs.avg_duration) AS avg_duration, SUM(rs.count_executions) AS executions FROM sys.query_store_plan p JOIN sys.query_store_runtime_stats rs ON p.plan_id = rs.plan_id JOIN sys.query_store_runtime_stats_interval rsi ON rs.runtime_stats_interval_id = rsi.runtime_stats_interval_id WHERE rsi.start_time BETWEEN DATEADD(day, -8, GETUTCDATE()) AND DATEADD(day, -1, GETUTCDATE()) -- Previous week GROUP BY p.query_id)SELECT qt.query_sql_text, c.query_id, b.avg_duration / 1000.0 AS baseline_ms, c.avg_duration / 1000.0 AS current_ms, c.avg_duration / NULLIF(b.avg_duration, 0) AS regression_factor, c.executions AS current_executions, (c.avg_duration - b.avg_duration) * c.executions / 1000000.0 AS added_time_sec, -- Categorize severity CASE WHEN c.avg_duration > b.avg_duration * 10 THEN 'CRITICAL' WHEN c.avg_duration > b.avg_duration * 5 THEN 'HIGH' WHEN c.avg_duration > b.avg_duration * 2 THEN 'MEDIUM' ELSE 'LOW' END AS severity FROM CurrentPeriod cJOIN BaselinePeriod b ON c.query_id = b.query_idJOIN sys.query_store_query q ON c.query_id = q.query_idJOIN sys.query_store_query_text qt ON q.query_text_id = qt.query_text_idWHERE c.avg_duration > b.avg_duration * 1.5 -- 50% regression AND c.executions > 100 -- Sufficient activity AND b.avg_duration > 10000 -- Non-trivial queries (10ms+)ORDER BY added_time_sec DESC; -- =========================-- Automated Alerting Job-- ========================= -- SQL Server Agent job or PostgreSQL pg_cron job-- that runs regression detection and sends alerts /*Implementation pattern:1. Run regression detection query hourly2. Insert results into regression_alerts table3. For severity = 'CRITICAL' or 'HIGH', trigger alert: - Email DBA team - Post to Slack/Teams channel - Create on-call incident Alert content should include:- Query text/hash- Baseline vs current metrics- Impact in total time- Suggested investigation starting points*/ -- PostgreSQL: Create alert logging tableCREATE TABLE regression_alerts ( alert_id SERIAL PRIMARY KEY, detected_at TIMESTAMPTZ DEFAULT NOW(), queryid BIGINT, query_preview TEXT, baseline_ms DECIMAL(10,2), current_ms DECIMAL(10,2), regression_factor DECIMAL(10,2), severity VARCHAR(20), acknowledged BOOLEAN DEFAULT FALSE, acknowledged_by VARCHAR(100), resolution_notes TEXT);Too many alerts desensitize recipients. Set thresholds that balance early detection with actionable signal. Start conservative (high thresholds), then lower as you build confidence. Consider alert deduplication—don't alert on the same regression repeatedly.
The best time to catch performance problems is before deployment. Development integration embeds query performance awareness into the software development lifecycle.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
-- ================================================================-- Development Integration: Query Performance Testing-- ================================================================ -- =========================-- Code Review: EXPLAIN Annotation-- ========================= /*Require EXPLAIN output in pull request descriptionsfor any changed queries. Reviewers check for: ✓ Index usage on filtered columns✓ Reasonable row estimates✓ Appropriate join strategies✗ Sequential scans on large tables✗ Nested loops with high loop counts✗ Sort operations on large result sets*/ -- Example PR description format:/*## Query Changes in This PR ### Modified: GetCustomerOrders EXPLAIN ANALYZE output:```Hash Join (cost=125.00..890.00 rows=1000 width=80) (actual time=12.5..45.2 rows=987 loops=1) Hash Cond: (o.customer_id = c.customer_id) -> Index Scan using idx_orders_date on orders o (actual time=0.05..15.3 rows=10000 loops=1) Index Cond: (order_date >= '2024-01-01') -> Hash (actual time=8.2..8.2 rows=50 loops=1) -> Index Scan using idx_customers_segment on customers c (actual time=0.03..4.1 rows=50 loops=1) Index Cond: (segment = 'Enterprise')``` ✓ Uses indexes on both tables✓ Estimates match actuals reasonably✓ Hash join appropriate for this data volume```*/ -- =========================-- Automated Test: Performance Assertions-- ========================= /*Example: pytest with performance assertions (Python) import pytestimport time def test_customer_orders_performance(db_connection): """GetCustomerOrders should complete under 100ms for typical case""" customer_id = 12345 start = time.perf_counter() result = db_connection.execute( "SELECT * FROM orders WHERE customer_id = %s", [customer_id] ) elapsed_ms = (time.perf_counter() - start) * 1000 assert elapsed_ms < 100, f"Query took {elapsed_ms}ms, expected < 100ms" assert len(result) > 0, "Expected results for test customer" def test_order_summary_uses_index(db_connection): """Order summary query should use idx_orders_status""" plan = db_connection.execute(""" EXPLAIN (FORMAT JSON) SELECT status, COUNT(*) FROM orders GROUP BY status """) plan_text = str(plan) assert 'Seq Scan' not in plan_text, "Query should not perform sequential scan" assert 'idx_orders_status' in plan_text or 'Index' in plan_text*/ -- =========================-- CI/CD Pipeline: Plan Comparison-- ========================= -- Store expected EXPLAIN output in repository-- Compare against actual during CI build /*Example GitHub Actions workflow: name: Query Performance Check on: [pull_request] jobs: check-queries: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Start test database run: docker-compose up -d postgres - name: Load schema and test data run: psql -f schema.sql && psql -f test_data.sql - name: Run query checks run: | for query_file in queries/*.sql; do echo "Checking $query_file" # Get EXPLAIN output psql -c "EXPLAIN (ANALYZE, FORMAT JSON) \i $query_file" > plan.json # Check for sequential scans on large tables if grep -q '"Node Type": "Seq Scan"' plan.json; then table=$(grep -A5 '"Node Type": "Seq Scan"' plan.json | grep "Relation Name") echo "WARNING: Sequential scan detected: $table" exit 1 fi # Check for excessive rows examined rows=$(jq '.[0]."Plan"."Actual Rows"' plan.json) if [ "$rows" -gt 100000 ]; then echo "WARNING: Query examines $rows rows" exit 1 fi done*/ -- =========================-- Staging Validation Checklist-- ========================= /*Before production deployment: □ Load staging with production-equivalent data volume□ Run all modified queries□ Compare EXPLAIN output to development environment□ Verify execution times within acceptable thresholds□ Check for plan changes due to data volume differences□ Monitor staging for 24-48 hours under simulated load□ Document any performance concerns and mitigations*/The earlier you catch performance issues, the cheaper they are to fix. A developer writing a query can add an index in seconds. Fixing the same issue in production with locked tables and urgent timelines takes hours and creates risk.
Production monitoring provides ongoing visibility into query performance, enabling early detection of issues and trend analysis over time.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139
-- ================================================================-- Production Monitoring Queries for Dashboards-- ================================================================ -- =========================-- Real-Time Health Dashboard-- ========================= -- PostgreSQL: Current system healthSELECT -- Active connections (SELECT COUNT(*) FROM pg_stat_activity WHERE state = 'active') AS active_queries, -- Waiting queries (locked) (SELECT COUNT(*) FROM pg_stat_activity WHERE wait_event_type = 'Lock') AS waiting_queries, -- Long-running queries (SELECT COUNT(*) FROM pg_stat_activity WHERE state = 'active' AND NOW() - query_start > INTERVAL '10 seconds') AS slow_queries, -- Cache hit ratio (SELECT ROUND(100.0 * SUM(blks_hit) / NULLIF(SUM(blks_hit) + SUM(blks_read), 0), 2) FROM pg_stat_database) AS cache_hit_pct, -- Deadlocks (last hour) (SELECT deadlocks FROM pg_stat_database WHERE datname = current_database()) AS total_deadlocks, -- Commits per second (approximate) (SELECT xact_commit FROM pg_stat_database WHERE datname = current_database()) AS total_commits; -- =========================-- Time-Series Metrics Collection (for Prometheus/Grafana)-- ========================= -- PostgreSQL: Metrics endpoint query (run every 15s)SELECT 'pg_stat_statements_total_time' AS metric, SUM(total_exec_time) AS value, NOW() AS timestampFROM pg_stat_statementsUNION ALLSELECT 'pg_stat_statements_calls', SUM(calls), NOW()FROM pg_stat_statementsUNION ALLSELECT 'pg_active_connections', COUNT(*), NOW()FROM pg_stat_activity WHERE state = 'active'; -- =========================-- Daily Performance Summary Report-- ========================= -- PostgreSQL: Generate daily summaryWITH daily_stats AS ( SELECT DATE_TRUNC('hour', NOW()) AS report_time, COUNT(DISTINCT queryid) AS unique_queries, SUM(calls) AS total_calls, SUM(total_exec_time) / 1000 AS total_time_seconds, AVG(mean_exec_time) AS avg_query_time_ms, MAX(max_exec_time) AS slowest_query_ms, SUM(shared_blks_read) AS total_blocks_read, SUM(shared_blks_hit) AS total_blocks_hit FROM pg_stat_statements)SELECT 'Daily Summary' AS report_type, unique_queries, total_calls, ROUND(total_time_seconds::NUMERIC, 2) AS total_time_sec, ROUND(avg_query_time_ms::NUMERIC, 2) AS avg_time_ms, ROUND(slowest_query_ms::NUMERIC, 2) AS max_time_ms, ROUND(100.0 * total_blocks_hit / NULLIF(total_blocks_hit + total_blocks_read, 0), 2) AS cache_hit_pctFROM daily_stats; -- =========================-- Alerting Thresholds-- ========================= /*Recommended alert thresholds (adjust for your workload): CRITICAL (page immediately): - p99 latency > 5x baseline - Active queries > 80% connection limit - Deadlock rate > 1/hour - Cache hit ratio < 80% - Disk queue depth > 100 HIGH (respond within 1 hour): - p95 latency > 3x baseline - Slow query count > 10/minute - CPU utilization > 90% sustained - Memory pressure events MEDIUM (review daily): - p50 latency increasing trend - Query volume change > 50% - New query patterns detected - Index usage changes LOW (weekly review): - Statistics updates pending - Table bloat increasing - Configuration recommendations*/ -- =========================-- SQL Server: Monitoring Views-- ========================= -- Real-time dashboard querySELECT (SELECT COUNT(*) FROM sys.dm_exec_requests WHERE session_id > 50 AND status = 'running') AS running_queries, (SELECT COUNT(*) FROM sys.dm_exec_requests WHERE session_id > 50 AND blocking_session_id > 0) AS blocked_queries, (SELECT COUNT(*) FROM sys.dm_exec_requests WHERE session_id > 50 AND total_elapsed_time > 10000) AS slow_queries, (SELECT ROUND(100.0 * SUM(CASE WHEN is_modified = 0 THEN 1 ELSE 0 END) / COUNT(*), 2) FROM sys.dm_os_buffer_descriptors) AS clean_buffer_pct, (SELECT cntr_value FROM sys.dm_os_performance_counters WHERE counter_name = 'Page life expectancy' AND object_name LIKE '%Buffer Manager%') AS page_life_expectancy;Most organizations use external monitoring platforms (Prometheus/Grafana, Datadog, New Relic, etc.). Export database metrics using pg_exporter (PostgreSQL), sql_exporter, or native integrations. This enables unified dashboards, historical analysis, and cross-system correlation.
Beyond automated monitoring, regular performance reviews ensure that trends are analyzed, priorities are set, and improvements are tracked systematically.
| Review Type | Frequency | Participants | Focus Areas |
|---|---|---|---|
| Incident Review | Per incident | DBA, affected team | Root cause, prevention, action items |
| Weekly Triage | Weekly | DBA, dev leads | Top slow queries, regressions, capacity |
| Sprint Planning | Every sprint | DBA, product team | Performance debt prioritization |
| Monthly Analysis | Monthly | DBA, engineering lead | Trends, forecasting, infrastructure |
| Quarterly Review | Quarterly | DBA, CTO/VP | Budget, roadmap, major investments |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
-- ================================================================-- Weekly Performance Review Report Queries-- ================================================================ -- =========================-- Top 10 Slow Queries This Week vs Last Week-- ========================= WITH this_week AS ( SELECT queryid, mean_exec_time, total_exec_time, calls FROM pg_stat_statements),last_week AS ( SELECT queryid, mean_exec_time_ms AS mean_exec_time FROM baseline_query_stats WHERE baseline_id = ( SELECT baseline_id FROM performance_baselines WHERE baseline_name = 'weekly_snapshot_prev' ))SELECT RANK() OVER(ORDER BY tw.total_exec_time DESC) AS this_week_rank, tw.queryid, ROUND(tw.mean_exec_time::NUMERIC, 2) AS current_avg_ms, ROUND(lw.mean_exec_time::NUMERIC, 2) AS last_week_avg_ms, ROUND((tw.mean_exec_time / NULLIF(lw.mean_exec_time, 0))::NUMERIC, 2) AS change_factor, CASE WHEN tw.mean_exec_time > lw.mean_exec_time * 1.5 THEN '⬆️ SLOWER' WHEN tw.mean_exec_time < lw.mean_exec_time * 0.67 THEN '⬇️ FASTER' ELSE '➡️ STABLE' END AS trend, tw.calls AS weekly_callsFROM this_week twLEFT JOIN last_week lw ON tw.queryid = lw.queryidORDER BY tw.total_exec_time DESCLIMIT 10; -- =========================-- New Query Patterns (First seen this week)-- ========================= SELECT pss.queryid, LEFT(pss.query, 100) AS query_preview, pss.calls, ROUND(pss.mean_exec_time::NUMERIC, 2) AS avg_ms, ROUND(pss.total_exec_time::NUMERIC, 2) AS total_msFROM pg_stat_statements pssLEFT JOIN baseline_query_stats bqs ON pss.queryid = bqs.queryid AND bqs.baseline_id = ( SELECT baseline_id FROM performance_baselines WHERE baseline_name = 'weekly_snapshot_prev' )WHERE bqs.queryid IS NULL -- Not in last week's baseline AND pss.calls > 100 -- Significant usageORDER BY pss.total_exec_time DESCLIMIT 10; -- =========================-- Capacity Trending-- ========================= SELECT 'Database Size' AS metric, pg_size_pretty(pg_database_size(current_database())) AS current_value, NULL AS week_over_week_changeUNION ALLSELECT 'Total Queries/Week', SUM(calls)::TEXT, NULLFROM pg_stat_statementsUNION ALLSELECT 'Avg Query Time (ms)', ROUND(AVG(mean_exec_time)::NUMERIC, 2)::TEXT, NULLFROM pg_stat_statementsWHERE calls > 100; -- =========================-- Action Items Dashboard-- ========================= /*Maintain a table of optimization action items: CREATE TABLE performance_action_items ( item_id SERIAL PRIMARY KEY, created_at TIMESTAMPTZ DEFAULT NOW(), queryid BIGINT, description TEXT, priority VARCHAR(20), -- P1, P2, P3 status VARCHAR(20), -- Open, In Progress, Done, Wont Fix assigned_to VARCHAR(100), target_date DATE, completed_date DATE, resolution_notes TEXT); -- Report for weekly review:SELECT priority, status, COUNT(*) AS items, STRING_AGG(item_id::TEXT, ', ' ORDER BY created_at) AS item_idsFROM performance_action_itemsWHERE status NOT IN ('Done', 'Wont Fix')GROUP BY priority, statusORDER BY priority, status;*/Performance knowledge must be captured, shared, and retained. When the one expert who understands a complex optimization leaves, that knowledge shouldn't leave with them.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
-- ================================================================-- Knowledge Documentation Examples-- ================================================================ -- =========================-- In-Database Documentation-- ========================= -- PostgreSQL: Use COMMENT for index documentationCOMMENT ON INDEX idx_orders_customer_date IS 'Supports: GetCustomerOrders, OrderHistoryReport Created: 2024-01-15 Reason: Customer order lookups were doing sequential scans Impact: Reduced GetCustomerOrders from 450ms to 12ms Review: Check usage quarterly; remove if no longer needed'; COMMENT ON INDEX idx_orders_status_partial IS'Partial index on status = ''pending'' Supports: PendingOrdersDashboard Only indexes pending orders (5% of table) Much smaller than full index, faster updates'; -- =========================-- Query Runbook Template (Wiki/README)-- ========================= /*# Query: MonthlyRevenueReport ## PurposeGenerates monthly revenue summary by product category for finance dashboard. ## Execution Profile- Typical execution time: 2-5 seconds- Resource intensity: HIGH (full scan of order_items)- Frequency: Daily at 6:00 UTC, plus ad-hoc- User impact: None (background job) ## Dependencies- Tables: orders, order_items, products, categories- Indexes: idx_orders_date (critical for date filtering) ## Known Issues- Slow on first Monday of month due to cold cache- If > 30s, check for lock contention from inventory jobs ## Optimization History| Date | Change | Before | After ||------------|---------------------|--------|-------|| 2023-06-15 | Added covering index| 45s | 5s || 2023-09-01 | Partitioned orders | 5s | 2s || 2024-01-20 | Materialized view | 2s | 50ms | ## Contacts- Owner: Data Platform Team- Escalation: #data-platform-oncall*/ -- =========================-- Incident Postmortem Template-- ========================= /*# Performance Incident: Checkout Latency Spike ## SummaryOn 2024-03-15 14:00-14:45 UTC, checkout completion times increased from 200ms to 8000ms, causing cart abandonment. ## Timeline (UTC)- 14:00: Deployment of order-service v2.4.1- 14:05: First automated alert (p95 > 1s)- 14:12: On-call acknowledged, began investigation- 14:20: Identified new query pattern from deployment- 14:35: Rolled back to v2.4.0- 14:45: Latency returned to normal ## Root CauseNew ORM-generated query in v2.4.1 changed:FROM: SELECT * FROM orders WHERE id = ?TO: SELECT * FROM orders WHERE id = ? ORDER BY created_at The ORDER BY prevented index-only scan, caused sequential scan on 10M row orders table. ## Impact- Duration: 45 minutes- Affected checkouts: ~2000- Estimated revenue impact: $50,000 ## ResolutionImmediate: RollbackPermanent: Added index on (id, created_at), modified ORM mapping ## Prevention- Add query plan regression check to CI/CD pipeline- Review ORM-generated queries before release- Add query latency monitoring at application layer*/Documentation rots quickly. Schedule periodic reviews (quarterly) to update runbooks, verify index usage, and archive obsolete information. Stale documentation is worse than no documentation—it creates false confidence.
Query performance is not a problem to solve once but a characteristic to maintain continuously. The practices covered in this page transform performance work from reactive firefighting into proactive engineering—ensuring your database remains performant as data grows, workloads evolve, and systems change.
Module Complete:
You've now completed the Query Profiling module. You understand how to use profiler tools, interpret execution statistics, identify slow queries, rewrite inefficient SQL, and establish processes for continuous performance improvement. These skills transform you from someone who can optimize individual queries into someone who can build and maintain performant database systems at scale.
Congratulations! You've mastered the complete query profiling discipline—from understanding profiler architecture through establishing organizational practices for sustained performance. You're now equipped to approach database performance as an engineering practice, not just a troubleshooting exercise.