Loading content...
In the film The Matrix, Neo learns to see through the cascading green symbols to perceive the underlying reality of the simulated world. For database engineers, reading execution plans is a similar skill—looking past the rows and tables of EXPLAIN output to perceive the actual flow of computation, the bottlenecks, the inefficiencies, and the opportunities.
An execution plan is not just output to glance at; it's a map of computational work. Every node in the plan represents operations consuming CPU cycles, reading storage pages, and allocating memory. Learning to read this map fluently is perhaps the single most valuable skill for SQL performance engineering.
Most developers look at EXPLAIN output and feel overwhelmed by unfamiliar terms and nested structures. By the end of this page, you'll have a systematic methodology for reading any execution plan—regardless of database vendor—with confidence and precision.
By the end of this page, you will understand the tree structure of execution plans, master the technique of reading plans from inner to outer operations, learn to trace data flow through the plan, identify the key metrics to focus on at each node, and develop intuition for spotting problems at a glance.
Every execution plan is fundamentally a tree data structure. This isn't just an implementation detail—it's the conceptual model you need to internalize for effective plan reading.
The Tree Analogy:
Data flows upward through the tree. Leaf nodes fetch or generate rows, which propagate up through intermediate nodes that filter, join, sort, or aggregate them, until the root node produces the final output.
123456789101112131415161718192021222324252627282930
-- Query: Find all orders with their customer names for orders over $100-- EXPLAIN (COSTS OFF) for clarityEXPLAIN (COSTS OFF)SELECT c.name, o.order_date, o.totalFROM customers cJOIN orders o ON c.id = o.customer_idWHERE o.total > 100ORDER BY o.order_date DESC; -- Output (simplified): QUERY PLAN---------------------------------------------------------- Sort -- ROOT (Level 0) Sort Key: o.order_date DESC -> Hash Join -- Level 1 (intermediate) Hash Cond: (o.customer_id = c.id) -> Seq Scan on orders o -- Level 2 (leaf) Filter: (total > 100) -> Hash -- Level 2 (intermediate) -> Seq Scan on customers c -- Level 3 (leaf) -- Tree representation:-- Sort (root)-- │-- Hash Join-- / \-- Seq Scan Hash-- (orders) │-- Seq Scan-- (customers)Execution Order vs. Display Order:
A critical insight: the display order in EXPLAIN output does NOT match execution order.
Visually, the root appears at the top. But execution proceeds bottom-up:
Seq Scan on customers c runs and feeds into HashSeq Scan on orders o runs with its filterHash Join combines results from both branchesSort orders the outputThe indentation level indicates the tree depth: more indented operations are children of less indented operations. The arrows (->) indicate the data flow direction (from children to parent).
Always read execution plans from the innermost (most indented) operations outward toward the root. This matches the actual execution order: leaf operations run first, feeding their results upward through the tree.
The relationships between operations in an execution plan follow specific patterns. Understanding these patterns is essential for tracing how data moves through the plan.
Unary Operations (One Child): These operations take input from exactly one child and transform it:
Binary Operations (Two Children): These operations combine data from two child operations:
N-ary Operations (Multiple Children):
Some operations (like Append in PostgreSQL) can have many children, typically for partitioned table access or UNION ALL operations.
| Operation | Children | Data Flow Pattern |
|---|---|---|
| Seq Scan (Table Scan) | 0 (leaf) | Reads rows directly from table storage |
| Index Scan | 0 (leaf) | Reads rows via index navigation |
| Index Only Scan | 0 (leaf) | Reads data from index without table access |
| Filter | 1 | Passes through rows matching condition |
| Sort | 1 | Buffers all input, outputs in sorted order |
| Aggregate | 1 | Groups input into summary rows |
| Nested Loop | 2 | For each outer row, scans inner child |
| Hash Join | 2 | Hashes one input, probes with other |
| Merge Join | 2 | Merges two pre-sorted inputs |
| Append | N | Concatenates rows from all children |
Join Operations and Inner/Outer Terminology:
In join operations, outer and inner have specific meanings:
For Nested Loop joins:
For Hash Join operations:
Build side provides rows for hash table constructionProbe side scans afterward, looking up in the hash tableUnderstanding which table is in which role is crucial for understanding join performance. If a huge table becomes the inner table of a nested loop, you get quadratic behavior.
Watch for 'Nested Loop' operations where the inner child performs a Seq Scan on a large table. This pattern is acceptable when the loop count is small, but catastrophic when the outer side returns many rows. Each outer row triggers a full scan of the inner table.
Once you understand the tree structure, the next skill is tracing how data flows through it. This means following the row counts and data transformations from leaves to root.
The Mental Execution Trace:
When reading a plan, mentally execute it:
Along this path, ask:
1234567891011121314151617181920212223242526272829
EXPLAIN ANALYZESELECT c.name, COUNT(*) as order_countFROM customers cJOIN orders o ON c.id = o.customer_idWHERE o.status = 'completed'GROUP BY c.nameHAVING COUNT(*) > 5; QUERY PLAN--------------------------------------------------------------- Filter (cost=... rows=10) (actual time=... rows=8 loops=1) Filter: (count(*) > 5) -> HashAggregate (cost=... rows=100) (actual time=... rows=92 loops=1) Group Key: c.name -> Hash Join (cost=... rows=5000) (actual time=... rows=4872 loops=1) Hash Cond: (o.customer_id = c.id) -> Seq Scan on orders o (cost=... rows=5000) (actual time=... rows=5127 loops=1) Filter: (status = 'completed'::text) Rows Removed by Filter: 9873 -> Hash (cost=... rows=200) (actual time=... rows=200 loops=1) Buckets: 256 -> Seq Scan on customers c (cost=... rows=200) (actual time=... rows=200 loops=1) -- DATA FLOW TRACE:-- 1. customers scan → 200 rows (all customers)-- 2. orders scan → 15000 rows total, 5127 after filter (status = 'completed')-- 3. Hash Join → 4872 rows (matched customer-order pairs)-- 4. HashAggregate → 92 rows (92 distinct customer names with orders)-- 5. Filter (HAVING) → 8 rows (only those with count > 5)Key Metrics to Track at Each Node:
| Metric | What It Tells You | Red Flag Threshold |
|---|---|---|
| Rows | How much data flows through | Large row counts in nested operations |
| Width | Bytes per row | Very wide rows (>500 bytes) through many operations |
| Loops | Times operation executed | High loop counts on expensive operations |
| Rows Removed | Filter effectiveness | Filters removing >99% of rows (good) vs. <10% (why filter?) |
| Actual vs Estimated | Statistics accuracy | >10x difference indicates stale statistics |
The Volume Formula:
Total work at a node ≈ rows × loops × cost_per_row
Even a cheap operation becomes expensive if looped 100,000 times. Even an inexpensive row becomes work if there are 100 million of them. Always consider the multiplicative effect.
When EXPLAIN ANALYZE shows actual rows dramatically different from estimated rows, the optimizer made its plan based on wrong assumptions. This is the #1 cause of poor plan selection. Update statistics (ANALYZE in PostgreSQL, ANALYZE TABLE in MySQL) and re-check the plan.
Execution plans include both cost estimates and (with ANALYZE) actual timing. Understanding how to interpret these metrics correctly is essential for effective performance analysis.
Understanding Cost:
Cost in execution plans is measured in arbitrary units calibrated to sequential page reads. Key points:
(cost=startup..total rows=N width=W)Startup Cost: Work done before the first row is produced Total Cost: Work done to produce all rows
For most operations, total cost is what matters. But for operations like Sort or Hash, startup cost is significant because all input must be processed before any output is available.
123456789101112131415161718192021222324252627282930
-- Notice the startup..total cost formatEXPLAIN ANALYZESELECT * FROM large_tableWHERE indexed_column = 'rare_value'ORDER BY created_at; QUERY PLAN----------------------------------------------------------------------- Sort (cost=1250.44..1250.69 rows=100 width=200) (actual time=24.832..24.951 rows=87 loops=1) Sort Key: created_at Sort Method: quicksort Memory: 38kB -> Index Scan using idx_column on large_table (cost=0.42..1247.08 rows=100 width=200) (actual time=0.028..24.217 rows=87 loops=1) Index Cond: (indexed_column = 'rare_value'::text) Planning Time: 0.156 ms Execution Time: 25.042 ms -- Analysis:-- Index Scan: -- startup=0.42 (minimal), total=1247.08-- actual time: 0.028ms to first row, 24.217ms for all rows-- Estimate 100 rows, got 87 (close enough)---- Sort:-- startup=1250.44 (includes feeding from child + sorting)-- total=1250.69 (almost same - output is fast once sorted)-- actual: 24.832ms startup (had to wait for all input)-- Memory-based quicksort (good - didn't spill to disk)| Metric | Source | Units | Use For |
|---|---|---|---|
| Startup Cost | Optimizer estimate | Arbitrary (seq page reads) | Understanding blocking operations |
| Total Cost | Optimizer estimate | Arbitrary (seq page reads) | Comparing alternative plans |
| Actual Time (first) | Real execution | Milliseconds | Measuring latency to first row |
| Actual Time (total) | Real execution | Milliseconds | Measuring full execution time |
| Planning Time | Real execution | Milliseconds | Optimizer overhead |
| Execution Time | Real execution | Milliseconds | Total time including all operations |
Interpreting Actual Time:
The actual time shown in EXPLAIN ANALYZE can be confusing:
time=startup..totalCritical Pattern: Identifying Where Time Is Spent:
Parent (actual time=0.500..100.000 rows=1000)
-> Child (actual time=0.100..98.000 rows=10000)
Here, the parent's time (100ms) includes the child's time (98ms). The parent itself only added 2ms. The child is the bottleneck.
To find where time is actually consumed:
Parent-only time = Parent total - sum(Children totals)
If a node shows loops=1000 and time=0.1ms, the actual total time is 100ms. Always multiply time by loops for the true cost. Nested loop joins are the most common source of this multiplication effect.
With EXPLAIN (ANALYZE, BUFFERS), you get detailed information about memory buffer usage and I/O operations. This is crucial for understanding whether performance problems are CPU-bound or I/O-bound.
PostgreSQL Buffer Statistics:
| Metric | Meaning | Performance Implication |
|---|---|---|
shared hit | Pages found in shared buffer cache | Good: Data was in memory; no disk I/O |
shared read | Pages read from disk into cache | Potentially slow: Required disk I/O |
shared dirtied | Pages modified in cache | Indicates write activity (updates/inserts) |
shared written | Pages written to disk | I/O-intensive if frequent |
local hit/read | Temporary table buffer access | Tracks temp table I/O separately |
temp read/written | Explicit temp file access | Bad: Sort/hash spilled to disk |
12345678910111213141516171819202122232425262728293031
EXPLAIN (ANALYZE, BUFFERS)SELECT department_id, AVG(salary)FROM employeesGROUP BY department_idORDER BY AVG(salary) DESC; QUERY PLAN----------------------------------------------------------------------- Sort (cost=450.52..451.27 rows=50 width=40) (actual time=12.847..12.891 rows=50 loops=1) Sort Key: (avg(salary)) DESC Sort Method: quicksort Memory: 27kB Buffers: shared hit=284 read=32 -- Most data cached, some disk reads -> HashAggregate (cost=447.00..448.25 rows=50 width=40) (actual time=12.156..12.243 rows=50 loops=1) Group Key: department_id Batches: 1 Memory Usage: 40kB Buffers: shared hit=284 read=32 -> Seq Scan on employees (cost=0.00..372.00 rows=15000 width=12) (actual time=0.015..5.432 rows=15000 loops=1) Buffers: shared hit=284 read=32 Planning: Buffers: shared hit=12 Planning Time: 0.284 ms Execution Time: 12.991 ms -- INTERPRETATION:-- shared hit=284: 284 pages found in buffer cache (good)-- shared read=32: 32 pages required disk I/O (some cache misses)-- No temp read/written: Sort and aggregate fit in memory (good)-- If temp reads appeared, work_mem should be increasedDiagnosing I/O Problems:
High shared read relative to shared hit:
Any temp read/written presence:
High buffer counts on small row counts:
The Cache Hit Ratio:
Cache hit ratio = shared hit / (shared hit + shared read)
For OLTP workloads, aim for >95% hit ratio. For analytics or cold queries, lower ratios may be acceptable.
When you see 'temp read' or 'temp written' in EXPLAIN output, it means the operation ran out of work_mem and spilled to disk. This dramatically slows performance. Either increase work_mem or restructure the query to reduce intermediate result sizes.
Rather than randomly scanning execution plans, use a systematic methodology to quickly identify issues:
The 6-Step Plan Reading Process:
1234567891011121314151617181920212223242526272829303132333435363738
-- Example of applying the 6-step processEXPLAIN (ANALYZE, BUFFERS)SELECT p.name, SUM(oi.quantity * oi.price) as revenueFROM products pJOIN order_items oi ON p.id = oi.product_idJOIN orders o ON oi.order_id = o.idWHERE o.order_date >= '2024-01-01'GROUP BY p.nameORDER BY revenue DESCLIMIT 10; -- Step 1: Bottom line-- Execution Time: 2847.234 ms -- SLOW! Worth investigating -- Step 2: Identify bottleneck-- Looking at actual times shows:-- Nested Loop: actual time=2400ms <- BOTTLENECK-- Index Scan on orders: actual time=200ms (loops=50000) -- Step 3: Cardinality estimates -- Nested Loop Join shows: rows=1000 expected, 50000 actual-- Optimizer underestimated join output by 50x! -- Step 4: Data access-- orders: Index Scan using idx_order_date (good)-- order_items: Seq Scan with filter (suspicious - 50000 loops!)-- products: Index Scan (good) -- Step 5: Join strategy-- Nested Loop chosen, but inner table scanned 50000 times-- Should be Hash Join instead for this cardinality -- Step 6: Resource usage-- Buffers: shared read=45322 -- Heavy disk I/O-- This correlates with the repeated scanning -- DIAGNOSIS: Statistics caused bad join order estimate-- FIX: ANALYZE order_items; or add index on order_items(order_id)Typically 1-2 operations consume 80%+ of query time. Don't get lost in analyzing every node. Find the bottleneck first, address it, then re-run EXPLAIN. Often fixing one issue dramatically changes the entire plan.
Text-based EXPLAIN output can be challenging to read for complex queries with many joins and subqueries. Development teams often use visual plan analysis tools:
Popular Plan Visualization Tools:
What Visual Tools Reveal:
Good visualization tools provide:
Reading Text Output Efficiently:
Even when visual tools aren't available, you can improve text plan readability:
JSON Format for Automation:
EXPLAIN (FORMAT JSON, ANALYZE) SELECT ...;
JSON output is ideal for:
Like reading code, plan reading is a skill that improves with practice. Start with visual tools to build intuition, then graduate to text output for speed. The goal is to glance at a plan and immediately see the structure and bottlenecks.
We've developed a comprehensive framework for reading and interpreting execution plans. Let's consolidate the essential insights:
(ANALYZE) for measuring real performance.What's Next:
Now that you can read execution plans fluently, the next page examines plan operators in detail—the specific operations like Seq Scan, Index Scan, Hash Join, Nested Loop, and Sort. Understanding what each operator does, when it's appropriate, and when it indicates a problem is essential for effective query optimization.
You now have a systematic approach to reading any execution plan. You understand the tree structure, can trace data flow, interpret timing and cost metrics, and analyze buffer statistics. Next, we'll examine the specific operators that appear in plans and what each one means for performance.