Explain Plans - Learning Module

Loading content...

0/241

Execution Plan Reading

Learning to Read the Matrix

In the film The Matrix, Neo learns to see through the cascading green symbols to perceive the underlying reality of the simulated world. For database engineers, reading execution plans is a similar skill—looking past the rows and tables of EXPLAIN output to perceive the actual flow of computation, the bottlenecks, the inefficiencies, and the opportunities.

An execution plan is not just output to glance at; it's a map of computational work. Every node in the plan represents operations consuming CPU cycles, reading storage pages, and allocating memory. Learning to read this map fluently is perhaps the single most valuable skill for SQL performance engineering.

Most developers look at EXPLAIN output and feel overwhelmed by unfamiliar terms and nested structures. By the end of this page, you'll have a systematic methodology for reading any execution plan—regardless of database vendor—with confidence and precision.

What You Will Learn

By the end of this page, you will understand the tree structure of execution plans, master the technique of reading plans from inner to outer operations, learn to trace data flow through the plan, identify the key metrics to focus on at each node, and develop intuition for spotting problems at a glance.

The Tree Structure of Execution Plans

Every execution plan is fundamentally a tree data structure. This isn't just an implementation detail—it's the conceptual model you need to internalize for effective plan reading.

The Tree Analogy:

Root Node: The topmost operation that produces the final result set sent to the client
Intermediate Nodes: Operations that process and transform data from their children
Leaf Nodes: The terminal data access operations (table scans, index lookups) that touch actual storage
Edges: Represent the flow of rows from child operations to parent operations

Data flows upward through the tree. Leaf nodes fetch or generate rows, which propagate up through intermediate nodes that filter, join, sort, or aggregate them, until the root node produces the final output.

Visualizing Plan Tree Structure

PostgreSQL

-- Query: Find all orders with their customer names for orders over $100
-- EXPLAIN (COSTS OFF) for clarity
EXPLAIN (COSTS OFF)
SELECT c.name, o.order_date, o.total
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE o.total > 100
ORDER BY o.order_date DESC;
 
-- Output (simplified):
                        QUERY PLAN
----------------------------------------------------------
 Sort                                    -- ROOT (Level 0)
   Sort Key: o.order_date DESC
   ->  Hash Join                         -- Level 1 (intermediate)
         Hash Cond: (o.customer_id = c.id)
         ->  Seq Scan on orders o        -- Level 2 (leaf)
               Filter: (total > 100)
         ->  Hash                        -- Level 2 (intermediate)
               ->  Seq Scan on customers c  -- Level 3 (leaf)
 
-- Tree representation:
--                    Sort (root)
--                      │
--                   Hash Join
--                   /       \
--           Seq Scan        Hash
--          (orders)           │
--                        Seq Scan
--                       (customers)

Execution Order vs. Display Order:

A critical insight: the display order in EXPLAIN output does NOT match execution order.

Visually, the root appears at the top. But execution proceeds bottom-up:

First, Seq Scan on customers c runs and feeds into Hash
Simultaneously (conceptually), Seq Scan on orders o runs with its filter
Then Hash Join combines results from both branches
Finally, Sort orders the output

The indentation level indicates the tree depth: more indented operations are children of less indented operations. The arrows (->) indicate the data flow direction (from children to parent).

The Golden Rule of Plan Reading

Always read execution plans from the innermost (most indented) operations outward toward the root. This matches the actual execution order: leaf operations run first, feeding their results upward through the tree.

Understanding Parent-Child Relationships

The relationships between operations in an execution plan follow specific patterns. Understanding these patterns is essential for tracing how data moves through the plan.

Unary Operations (One Child): These operations take input from exactly one child and transform it:

Filter — Removes rows not matching a condition
Sort — Reorders rows by specified criteria
Materialize — Caches intermediate results in memory
Limit — Restricts number of output rows
Aggregate — Combines multiple rows into summary rows
Hash — Builds hash table for upcoming join

Binary Operations (Two Children): These operations combine data from two child operations:

Nested Loop — For each row from outer child, scan inner child
Hash Join — Build hash from one side, probe with other
Merge Join — Merge two sorted inputs
Union/Except/Intersect — Set operations combining two inputs

N-ary Operations (Multiple Children): Some operations (like Append in PostgreSQL) can have many children, typically for partitioned table access or UNION ALL operations.

Common Operation Types and Their Arity
Operation	Children	Data Flow Pattern
Seq Scan (Table Scan)	0 (leaf)	Reads rows directly from table storage
Index Scan	0 (leaf)	Reads rows via index navigation
Index Only Scan	0 (leaf)	Reads data from index without table access
Filter	1	Passes through rows matching condition
Sort	1	Buffers all input, outputs in sorted order
Aggregate	1	Groups input into summary rows
Nested Loop	2	For each outer row, scans inner child
Hash Join	2	Hashes one input, probes with other
Merge Join	2	Merges two pre-sorted inputs
Append	N	Concatenates rows from all children

Join Operations and Inner/Outer Terminology:

In join operations, outer and inner have specific meanings:

Outer Table (Driving Table): The table that controls the loop or probe
Inner Table (Lookup Table): The table accessed for each outer row

For Nested Loop joins:

The outer table is typically scanned once
The inner table is scanned repeatedly—once per outer row

For Hash Join operations:

The Build side provides rows for hash table construction
The Probe side scans afterward, looking up in the hash table

Understanding which table is in which role is crucial for understanding join performance. If a huge table becomes the inner table of a nested loop, you get quadratic behavior.

Inner Loop Danger

Watch for 'Nested Loop' operations where the inner child performs a Seq Scan on a large table. This pattern is acceptable when the loop count is small, but catastrophic when the outer side returns many rows. Each outer row triggers a full scan of the inner table.

Tracing Data Flow Through the Plan

Once you understand the tree structure, the next skill is tracing how data flows through it. This means following the row counts and data transformations from leaves to root.

The Mental Execution Trace:

When reading a plan, mentally execute it:

Start at the deepest leaf nodes — How many rows does each table scan produce?
Apply filters and conditions — Which operations reduce row counts?
Trace through joins — How many rows emerge from each join?
Follow through to the root — What's the final row count?

Along this path, ask:

Where do row counts explode (joins producing more rows than inputs)?
Where do row counts collapse (filters removing most rows)?
Which operations process the most rows?
Which operations produce the most rows?

Tracing Row Flow Example

PostgreSQL

EXPLAIN ANALYZE
SELECT c.name, COUNT(*) as order_count
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE o.status = 'completed'
GROUP BY c.name
HAVING COUNT(*) > 5;
 
                            QUERY PLAN
---------------------------------------------------------------
 Filter  (cost=... rows=10) (actual time=... rows=8 loops=1)
   Filter: (count(*) > 5)
   ->  HashAggregate  (cost=... rows=100) (actual time=... rows=92 loops=1)
         Group Key: c.name
         ->  Hash Join  (cost=... rows=5000) (actual time=... rows=4872 loops=1)
               Hash Cond: (o.customer_id = c.id)
               ->  Seq Scan on orders o  (cost=... rows=5000) (actual time=... rows=5127 loops=1)
                     Filter: (status = 'completed'::text)
                     Rows Removed by Filter: 9873
               ->  Hash  (cost=... rows=200) (actual time=... rows=200 loops=1)
                     Buckets: 256
                     ->  Seq Scan on customers c  (cost=... rows=200) (actual time=... rows=200 loops=1)
 
-- DATA FLOW TRACE:
-- 1. customers scan → 200 rows (all customers)
-- 2. orders scan → 15000 rows total, 5127 after filter (status = 'completed')
-- 3. Hash Join → 4872 rows (matched customer-order pairs)
-- 4. HashAggregate → 92 rows (92 distinct customer names with orders)
-- 5. Filter (HAVING) → 8 rows (only those with count > 5)

Key Metrics to Track at Each Node:

Metric	What It Tells You	Red Flag Threshold
Rows	How much data flows through	Large row counts in nested operations
Width	Bytes per row	Very wide rows (>500 bytes) through many operations
Loops	Times operation executed	High loop counts on expensive operations
Rows Removed	Filter effectiveness	Filters removing >99% of rows (good) vs. <10% (why filter?)
Actual vs Estimated	Statistics accuracy	>10x difference indicates stale statistics

The Volume Formula:

Total work at a node ≈ rows × loops × cost_per_row

Even a cheap operation becomes expensive if looped 100,000 times. Even an inexpensive row becomes work if there are 100 million of them. Always consider the multiplicative effect.

Estimated vs. Actual Row Discrepancies

When EXPLAIN ANALYZE shows actual rows dramatically different from estimated rows, the optimizer made its plan based on wrong assumptions. This is the #1 cause of poor plan selection. Update statistics (ANALYZE in PostgreSQL, ANALYZE TABLE in MySQL) and re-check the plan.

Reading Time and Cost Metrics

Execution plans include both cost estimates and (with ANALYZE) actual timing. Understanding how to interpret these metrics correctly is essential for effective performance analysis.

Understanding Cost:

Cost in execution plans is measured in arbitrary units calibrated to sequential page reads. Key points:

Cost is NOT time—it's a relative measure for comparing operations and plans
The optimizer uses cost to choose between alternative plans
Cost has two components: startup cost and total cost
Format: (cost=startup..total rows=N width=W)

Startup Cost: Work done before the first row is produced Total Cost: Work done to produce all rows

For most operations, total cost is what matters. But for operations like Sort or Hash, startup cost is significant because all input must be processed before any output is available.

Interpreting Cost and Timing

PostgreSQL

-- Notice the startup..total cost format
EXPLAIN ANALYZE
SELECT * FROM large_table
WHERE indexed_column = 'rare_value'
ORDER BY created_at;
 
                                QUERY PLAN
-----------------------------------------------------------------------
 Sort  (cost=1250.44..1250.69 rows=100 width=200)
       (actual time=24.832..24.951 rows=87 loops=1)
   Sort Key: created_at
   Sort Method: quicksort  Memory: 38kB
   ->  Index Scan using idx_column on large_table
           (cost=0.42..1247.08 rows=100 width=200)
           (actual time=0.028..24.217 rows=87 loops=1)
         Index Cond: (indexed_column = 'rare_value'::text)
 Planning Time: 0.156 ms
 Execution Time: 25.042 ms
 
-- Analysis:
-- Index Scan: 
--   startup=0.42 (minimal), total=1247.08
--   actual time: 0.028ms to first row, 24.217ms for all rows
--   Estimate 100 rows, got 87 (close enough)
--
-- Sort:
--   startup=1250.44 (includes feeding from child + sorting)
--   total=1250.69 (almost same - output is fast once sorted)
--   actual: 24.832ms startup (had to wait for all input)
--   Memory-based quicksort (good - didn't spill to disk)

Cost vs. Time Relationship
Metric	Source	Units	Use For
Startup Cost	Optimizer estimate	Arbitrary (seq page reads)	Understanding blocking operations
Total Cost	Optimizer estimate	Arbitrary (seq page reads)	Comparing alternative plans
Actual Time (first)	Real execution	Milliseconds	Measuring latency to first row
Actual Time (total)	Real execution	Milliseconds	Measuring full execution time
Planning Time	Real execution	Milliseconds	Optimizer overhead
Execution Time	Real execution	Milliseconds	Total time including all operations

Interpreting Actual Time:

The actual time shown in EXPLAIN ANALYZE can be confusing:

Times are shown as time=startup..total
These times are cumulative, including child node times
For loops > 1, times are per loop (multiply by loops for total)

Critical Pattern: Identifying Where Time Is Spent:

Parent  (actual time=0.500..100.000 rows=1000)
  ->  Child  (actual time=0.100..98.000 rows=10000)

Here, the parent's time (100ms) includes the child's time (98ms). The parent itself only added 2ms. The child is the bottleneck.

To find where time is actually consumed: Parent-only time = Parent total - sum(Children totals)

Loops Multiply Everything

If a node shows loops=1000 and time=0.1ms, the actual total time is 100ms. Always multiply time by loops for the true cost. Nested loop joins are the most common source of this multiplication effect.

Reading Buffer and I/O Statistics

With EXPLAIN (ANALYZE, BUFFERS), you get detailed information about memory buffer usage and I/O operations. This is crucial for understanding whether performance problems are CPU-bound or I/O-bound.

PostgreSQL Buffer Statistics:

Buffer Statistics Explained
Metric	Meaning	Performance Implication
`shared hit`	Pages found in shared buffer cache	Good: Data was in memory; no disk I/O
`shared read`	Pages read from disk into cache	Potentially slow: Required disk I/O
`shared dirtied`	Pages modified in cache	Indicates write activity (updates/inserts)
`shared written`	Pages written to disk	I/O-intensive if frequent
`local hit/read`	Temporary table buffer access	Tracks temp table I/O separately
`temp read/written`	Explicit temp file access	Bad: Sort/hash spilled to disk

Buffer Statistics Example

PostgreSQL

EXPLAIN (ANALYZE, BUFFERS)
SELECT department_id, AVG(salary)
FROM employees
GROUP BY department_id
ORDER BY AVG(salary) DESC;
 
                                QUERY PLAN
-----------------------------------------------------------------------
 Sort  (cost=450.52..451.27 rows=50 width=40)
       (actual time=12.847..12.891 rows=50 loops=1)
   Sort Key: (avg(salary)) DESC
   Sort Method: quicksort  Memory: 27kB
   Buffers: shared hit=284 read=32         -- Most data cached, some disk reads
   ->  HashAggregate  (cost=447.00..448.25 rows=50 width=40)
             (actual time=12.156..12.243 rows=50 loops=1)
         Group Key: department_id
         Batches: 1  Memory Usage: 40kB
         Buffers: shared hit=284 read=32
         ->  Seq Scan on employees  (cost=0.00..372.00 rows=15000 width=12)
                   (actual time=0.015..5.432 rows=15000 loops=1)
               Buffers: shared hit=284 read=32
 Planning:
   Buffers: shared hit=12
 Planning Time: 0.284 ms
 Execution Time: 12.991 ms
 
-- INTERPRETATION:
-- shared hit=284: 284 pages found in buffer cache (good)
-- shared read=32: 32 pages required disk I/O (some cache misses)
-- No temp read/written: Sort and aggregate fit in memory (good)
-- If temp reads appeared, work_mem should be increased

Diagnosing I/O Problems:

High shared read relative to shared hit:

Cold cache or dataset doesn't fit in memory
Consider increasing shared_buffers or adding indexes to reduce scanned pages

Any temp read/written presence:

Sorts or hash operations spilling to disk
Increase work_mem to allow more memory per operation
Alternatively, add indexes to avoid sorts

High buffer counts on small row counts:

Inefficient access pattern (random I/O)
Consider index covering columns or clustering table

The Cache Hit Ratio:

Cache hit ratio = shared hit / (shared hit + shared read)

For OLTP workloads, aim for >95% hit ratio. For analytics or cold queries, lower ratios may be acceptable.

Temp Writes Are Performance Killers

When you see 'temp read' or 'temp written' in EXPLAIN output, it means the operation ran out of work_mem and spilled to disk. This dramatically slows performance. Either increase work_mem or restructure the query to reduce intermediate result sizes.

Systematic Plan Reading Methodology

Rather than randomly scanning execution plans, use a systematic methodology to quickly identify issues:

The 6-Step Plan Reading Process:

Step-by-Step Plan Analysis

•Bottom Line Check — Look at total execution time and total rows. Is this query problematic at all? If it runs in 1ms, maybe you don't need to optimize.
•Identify the Bottleneck — Find the node(s) consuming the most time. Calculate actual time per node (subtracting children). Focus on the 1-2 nodes dominating execution.
•Check Cardinality Estimates — Compare estimated vs. actual rows at each node. Large discrepancies indicate statistics problems driving poor plans.
•Examine Data Access — Look at how tables are accessed (Seq Scan vs. Index Scan). Are the right indexes being used? Is there unnecessary full table scanning?
•Analyze Join Strategy — What join algorithm is used? Is it appropriate for the data sizes? Watch for nested loops with large inner tables.
•Review Resource Usage — Check buffer statistics. Look for disk spills (temp files), high disk reads, or unexpectedly large memory usage.

Applying the Methodology

PostgreSQL

-- Example of applying the 6-step process
EXPLAIN (ANALYZE, BUFFERS)
SELECT p.name, SUM(oi.quantity * oi.price) as revenue
FROM products p
JOIN order_items oi ON p.id = oi.product_id
JOIN orders o ON oi.order_id = o.id
WHERE o.order_date >= '2024-01-01'
GROUP BY p.name
ORDER BY revenue DESC
LIMIT 10;
 
-- Step 1: Bottom line
-- Execution Time: 2847.234 ms  -- SLOW! Worth investigating
 
-- Step 2: Identify bottleneck
-- Looking at actual times shows:
--   Nested Loop: actual time=2400ms  <- BOTTLENECK
--   Index Scan on orders: actual time=200ms (loops=50000)
 
-- Step 3: Cardinality estimates  
--   Nested Loop Join shows: rows=1000 expected, 50000 actual
--   Optimizer underestimated join output by 50x!
 
-- Step 4: Data access
--   orders: Index Scan using idx_order_date (good)
--   order_items: Seq Scan with filter (suspicious - 50000 loops!)
--   products: Index Scan (good)
 
-- Step 5: Join strategy
--   Nested Loop chosen, but inner table scanned 50000 times
--   Should be Hash Join instead for this cardinality
 
-- Step 6: Resource usage
--   Buffers: shared read=45322  -- Heavy disk I/O
--   This correlates with the repeated scanning
 
-- DIAGNOSIS: Statistics caused bad join order estimate
-- FIX: ANALYZE order_items; or add index on order_items(order_id)

Focus on the 80/20 Rule

Typically 1-2 operations consume 80%+ of query time. Don't get lost in analyzing every node. Find the bottleneck first, address it, then re-run EXPLAIN. Often fixing one issue dramatically changes the entire plan.

Visual Plan Reading Strategies

Text-based EXPLAIN output can be challenging to read for complex queries with many joins and subqueries. Development teams often use visual plan analysis tools:

Popular Plan Visualization Tools:

pgAdmin / pgExplain (PostgreSQL) — Built-in graphical explain
EXPLAIN.depesz.com (PostgreSQL) — Paste text plans, get visual analysis
MySQL Workbench (MySQL) — Visual Explain diagram
SQL Server Management Studio (SQL Server) — Graphical execution plans
Oracle SQL Developer (Oracle) — Plan visualization with statistics

What Visual Tools Reveal:

Good visualization tools provide:

Flow diagrams showing data movement
Color coding for expensive operations (red = slow)
Tooltips with detailed node information
Row count comparison (estimated vs. actual)
Highlighting of warnings and anomalies

When to Use Visual Tools

•Complex queries with 5+ joins
•Queries with multiple subqueries or CTEs
•When explaining plans to non-technical stakeholders
•Teaching or learning plan reading
•Quick scan for obvious red flags

When to Use Text Output

•Simple queries (1-2 tables)
•Detailed buffer/I/O analysis
•Scripted or automated plan collection
•Comparing plans in version control
•Production environments without GUI access

Reading Text Output Efficiently:

Even when visual tools aren't available, you can improve text plan readability:

Use COSTS OFF initially to focus on structure without noise
Pipe to less or grep for searching large plans
Look for patterns: repeated operations, high loop counts, scans on large tables
Create mental markers: note join nodes, aggregate nodes as anchor points
Work in sections: analyze each subtree independently for complex queries

JSON Format for Automation:

EXPLAIN (FORMAT JSON, ANALYZE) SELECT ...;

JSON output is ideal for:

Storing plans for historical comparison
Automated performance regression testing
Custom visualization or analysis tools
Integration with monitoring systems

Build Plan Reading Fluency

Like reading code, plan reading is a skill that improves with practice. Start with visual tools to build intuition, then graduate to text output for speed. The goal is to glance at a plan and immediately see the structure and bottlenecks.

Summary: Execution Plan Reading

We've developed a comprehensive framework for reading and interpreting execution plans. Let's consolidate the essential insights:

Key Takeaways

•Plans are trees — Data flows upward from leaf nodes (data access) through intermediate nodes to the root. Read bottom-up.
•Trace row counts — Follow how many rows enter and exit each operation. Large jumps or drops reveal what's happening.
•Compare estimated vs. actual — Discrepancies indicate statistics problems that cause bad plan selection.
•Cost is relative, time is absolute — Use cost for comparing plans; use actual time (ANALYZE) for measuring real performance.
•Loops multiply everything — A fast operation executed 100,000 times becomes a slow operation. Always multiply by loops.
•Buffer statistics reveal I/O — High read counts or temp file usage indicate I/O bottlenecks requiring indexing or memory tuning.
•Use systematic methodology — Follow the 6-step process to quickly identify bottlenecks and root causes.

What's Next:

Now that you can read execution plans fluently, the next page examines plan operators in detail—the specific operations like Seq Scan, Index Scan, Hash Join, Nested Loop, and Sort. Understanding what each operator does, when it's appropriate, and when it indicates a problem is essential for effective query optimization.

Page Complete

You now have a systematic approach to reading any execution plan. You understand the tree structure, can trace data flow, interpret timing and cost metrics, and analyze buffer statistics. Next, we'll examine the specific operators that appear in plans and what each one means for performance.

Execution Plan Reading

Learning to Read the Matrix

What You Will Learn

The Tree Structure of Execution Plans

Every execution plan is fundamentally a tree data structure. This isn't just an implementation detail—it's the conceptual model you need to internalize for effective plan reading.

The Tree Analogy:

Root Node: The topmost operation that produces the final result set sent to the client
Intermediate Nodes: Operations that process and transform data from their children
Leaf Nodes: The terminal data access operations (table scans, index lookups) that touch actual storage
Edges: Represent the flow of rows from child operations to parent operations

Visualizing Plan Tree Structure

PostgreSQL

-- Query: Find all orders with their customer names for orders over $100
-- EXPLAIN (COSTS OFF) for clarity
EXPLAIN (COSTS OFF)
SELECT c.name, o.order_date, o.total
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE o.total > 100
ORDER BY o.order_date DESC;
 
-- Output (simplified):
                        QUERY PLAN
----------------------------------------------------------
 Sort                                    -- ROOT (Level 0)
   Sort Key: o.order_date DESC
   ->  Hash Join                         -- Level 1 (intermediate)
         Hash Cond: (o.customer_id = c.id)
         ->  Seq Scan on orders o        -- Level 2 (leaf)
               Filter: (total > 100)
         ->  Hash                        -- Level 2 (intermediate)
               ->  Seq Scan on customers c  -- Level 3 (leaf)
 
-- Tree representation:
--                    Sort (root)
--                      │
--                   Hash Join
--                   /       \
--           Seq Scan        Hash
--          (orders)           │
--                        Seq Scan
--                       (customers)

Execution Order vs. Display Order:

A critical insight: the display order in EXPLAIN output does NOT match execution order.

Visually, the root appears at the top. But execution proceeds bottom-up:

First, Seq Scan on customers c runs and feeds into Hash
Simultaneously (conceptually), Seq Scan on orders o runs with its filter
Then Hash Join combines results from both branches
Finally, Sort orders the output

The indentation level indicates the tree depth: more indented operations are children of less indented operations. The arrows (->) indicate the data flow direction (from children to parent).

The Golden Rule of Plan Reading

Understanding Parent-Child Relationships

The relationships between operations in an execution plan follow specific patterns. Understanding these patterns is essential for tracing how data moves through the plan.

Unary Operations (One Child): These operations take input from exactly one child and transform it:

Filter — Removes rows not matching a condition
Sort — Reorders rows by specified criteria
Materialize — Caches intermediate results in memory
Limit — Restricts number of output rows
Aggregate — Combines multiple rows into summary rows
Hash — Builds hash table for upcoming join

Binary Operations (Two Children): These operations combine data from two child operations:

Nested Loop — For each row from outer child, scan inner child
Hash Join — Build hash from one side, probe with other
Merge Join — Merge two sorted inputs
Union/Except/Intersect — Set operations combining two inputs

N-ary Operations (Multiple Children): Some operations (like Append in PostgreSQL) can have many children, typically for partitioned table access or UNION ALL operations.

Common Operation Types and Their Arity
Operation	Children	Data Flow Pattern
Seq Scan (Table Scan)	0 (leaf)	Reads rows directly from table storage
Index Scan	0 (leaf)	Reads rows via index navigation
Index Only Scan	0 (leaf)	Reads data from index without table access
Filter	1	Passes through rows matching condition
Sort	1	Buffers all input, outputs in sorted order
Aggregate	1	Groups input into summary rows
Nested Loop	2	For each outer row, scans inner child
Hash Join	2	Hashes one input, probes with other
Merge Join	2	Merges two pre-sorted inputs
Append	N	Concatenates rows from all children

Join Operations and Inner/Outer Terminology:

In join operations, outer and inner have specific meanings:

Outer Table (Driving Table): The table that controls the loop or probe
Inner Table (Lookup Table): The table accessed for each outer row

For Nested Loop joins:

The outer table is typically scanned once
The inner table is scanned repeatedly—once per outer row

For Hash Join operations:

The Build side provides rows for hash table construction
The Probe side scans afterward, looking up in the hash table

Understanding which table is in which role is crucial for understanding join performance. If a huge table becomes the inner table of a nested loop, you get quadratic behavior.

Inner Loop Danger

Tracing Data Flow Through the Plan

Once you understand the tree structure, the next skill is tracing how data flows through it. This means following the row counts and data transformations from leaves to root.

The Mental Execution Trace:

When reading a plan, mentally execute it:

Start at the deepest leaf nodes — How many rows does each table scan produce?
Apply filters and conditions — Which operations reduce row counts?
Trace through joins — How many rows emerge from each join?
Follow through to the root — What's the final row count?

Along this path, ask:

Where do row counts explode (joins producing more rows than inputs)?
Where do row counts collapse (filters removing most rows)?
Which operations process the most rows?
Which operations produce the most rows?

Tracing Row Flow Example

PostgreSQL

EXPLAIN ANALYZE
SELECT c.name, COUNT(*) as order_count
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE o.status = 'completed'
GROUP BY c.name
HAVING COUNT(*) > 5;
 
                            QUERY PLAN
---------------------------------------------------------------
 Filter  (cost=... rows=10) (actual time=... rows=8 loops=1)
   Filter: (count(*) > 5)
   ->  HashAggregate  (cost=... rows=100) (actual time=... rows=92 loops=1)
         Group Key: c.name
         ->  Hash Join  (cost=... rows=5000) (actual time=... rows=4872 loops=1)
               Hash Cond: (o.customer_id = c.id)
               ->  Seq Scan on orders o  (cost=... rows=5000) (actual time=... rows=5127 loops=1)
                     Filter: (status = 'completed'::text)
                     Rows Removed by Filter: 9873
               ->  Hash  (cost=... rows=200) (actual time=... rows=200 loops=1)
                     Buckets: 256
                     ->  Seq Scan on customers c  (cost=... rows=200) (actual time=... rows=200 loops=1)
 
-- DATA FLOW TRACE:
-- 1. customers scan → 200 rows (all customers)
-- 2. orders scan → 15000 rows total, 5127 after filter (status = 'completed')
-- 3. Hash Join → 4872 rows (matched customer-order pairs)
-- 4. HashAggregate → 92 rows (92 distinct customer names with orders)
-- 5. Filter (HAVING) → 8 rows (only those with count > 5)

Key Metrics to Track at Each Node:

Metric	What It Tells You	Red Flag Threshold
Rows	How much data flows through	Large row counts in nested operations
Width	Bytes per row	Very wide rows (>500 bytes) through many operations
Loops	Times operation executed	High loop counts on expensive operations
Rows Removed	Filter effectiveness	Filters removing >99% of rows (good) vs. <10% (why filter?)
Actual vs Estimated	Statistics accuracy	>10x difference indicates stale statistics

The Volume Formula:

Total work at a node ≈ rows × loops × cost_per_row

Even a cheap operation becomes expensive if looped 100,000 times. Even an inexpensive row becomes work if there are 100 million of them. Always consider the multiplicative effect.

Estimated vs. Actual Row Discrepancies

Reading Time and Cost Metrics

Execution plans include both cost estimates and (with ANALYZE) actual timing. Understanding how to interpret these metrics correctly is essential for effective performance analysis.

Understanding Cost:

Cost in execution plans is measured in arbitrary units calibrated to sequential page reads. Key points:

Cost is NOT time—it's a relative measure for comparing operations and plans
The optimizer uses cost to choose between alternative plans
Cost has two components: startup cost and total cost
Format: (cost=startup..total rows=N width=W)

Startup Cost: Work done before the first row is produced Total Cost: Work done to produce all rows

For most operations, total cost is what matters. But for operations like Sort or Hash, startup cost is significant because all input must be processed before any output is available.

Interpreting Cost and Timing

PostgreSQL

-- Notice the startup..total cost format
EXPLAIN ANALYZE
SELECT * FROM large_table
WHERE indexed_column = 'rare_value'
ORDER BY created_at;
 
                                QUERY PLAN
-----------------------------------------------------------------------
 Sort  (cost=1250.44..1250.69 rows=100 width=200)
       (actual time=24.832..24.951 rows=87 loops=1)
   Sort Key: created_at
   Sort Method: quicksort  Memory: 38kB
   ->  Index Scan using idx_column on large_table
           (cost=0.42..1247.08 rows=100 width=200)
           (actual time=0.028..24.217 rows=87 loops=1)
         Index Cond: (indexed_column = 'rare_value'::text)
 Planning Time: 0.156 ms
 Execution Time: 25.042 ms
 
-- Analysis:
-- Index Scan: 
--   startup=0.42 (minimal), total=1247.08
--   actual time: 0.028ms to first row, 24.217ms for all rows
--   Estimate 100 rows, got 87 (close enough)
--
-- Sort:
--   startup=1250.44 (includes feeding from child + sorting)
--   total=1250.69 (almost same - output is fast once sorted)
--   actual: 24.832ms startup (had to wait for all input)
--   Memory-based quicksort (good - didn't spill to disk)

Cost vs. Time Relationship
Metric	Source	Units	Use For
Startup Cost	Optimizer estimate	Arbitrary (seq page reads)	Understanding blocking operations
Total Cost	Optimizer estimate	Arbitrary (seq page reads)	Comparing alternative plans
Actual Time (first)	Real execution	Milliseconds	Measuring latency to first row
Actual Time (total)	Real execution	Milliseconds	Measuring full execution time
Planning Time	Real execution	Milliseconds	Optimizer overhead
Execution Time	Real execution	Milliseconds	Total time including all operations

Interpreting Actual Time:

The actual time shown in EXPLAIN ANALYZE can be confusing:

Times are shown as time=startup..total
These times are cumulative, including child node times
For loops > 1, times are per loop (multiply by loops for total)

Critical Pattern: Identifying Where Time Is Spent:

Parent  (actual time=0.500..100.000 rows=1000)
  ->  Child  (actual time=0.100..98.000 rows=10000)

Here, the parent's time (100ms) includes the child's time (98ms). The parent itself only added 2ms. The child is the bottleneck.

To find where time is actually consumed: Parent-only time = Parent total - sum(Children totals)

Loops Multiply Everything

If a node shows loops=1000 and time=0.1ms, the actual total time is 100ms. Always multiply time by loops for the true cost. Nested loop joins are the most common source of this multiplication effect.

Reading Buffer and I/O Statistics

PostgreSQL Buffer Statistics:

Buffer Statistics Explained
Metric	Meaning	Performance Implication
`shared hit`	Pages found in shared buffer cache	Good: Data was in memory; no disk I/O
`shared read`	Pages read from disk into cache	Potentially slow: Required disk I/O
`shared dirtied`	Pages modified in cache	Indicates write activity (updates/inserts)
`shared written`	Pages written to disk	I/O-intensive if frequent
`local hit/read`	Temporary table buffer access	Tracks temp table I/O separately
`temp read/written`	Explicit temp file access	Bad: Sort/hash spilled to disk

Buffer Statistics Example

PostgreSQL

EXPLAIN (ANALYZE, BUFFERS)
SELECT department_id, AVG(salary)
FROM employees
GROUP BY department_id
ORDER BY AVG(salary) DESC;
 
                                QUERY PLAN
-----------------------------------------------------------------------
 Sort  (cost=450.52..451.27 rows=50 width=40)
       (actual time=12.847..12.891 rows=50 loops=1)
   Sort Key: (avg(salary)) DESC
   Sort Method: quicksort  Memory: 27kB
   Buffers: shared hit=284 read=32         -- Most data cached, some disk reads
   ->  HashAggregate  (cost=447.00..448.25 rows=50 width=40)
             (actual time=12.156..12.243 rows=50 loops=1)
         Group Key: department_id
         Batches: 1  Memory Usage: 40kB
         Buffers: shared hit=284 read=32
         ->  Seq Scan on employees  (cost=0.00..372.00 rows=15000 width=12)
                   (actual time=0.015..5.432 rows=15000 loops=1)
               Buffers: shared hit=284 read=32
 Planning:
   Buffers: shared hit=12
 Planning Time: 0.284 ms
 Execution Time: 12.991 ms
 
-- INTERPRETATION:
-- shared hit=284: 284 pages found in buffer cache (good)
-- shared read=32: 32 pages required disk I/O (some cache misses)
-- No temp read/written: Sort and aggregate fit in memory (good)
-- If temp reads appeared, work_mem should be increased

Diagnosing I/O Problems:

High shared read relative to shared hit:

Cold cache or dataset doesn't fit in memory
Consider increasing shared_buffers or adding indexes to reduce scanned pages

Any temp read/written presence:

Sorts or hash operations spilling to disk
Increase work_mem to allow more memory per operation
Alternatively, add indexes to avoid sorts

High buffer counts on small row counts:

Inefficient access pattern (random I/O)
Consider index covering columns or clustering table

The Cache Hit Ratio:

Cache hit ratio = shared hit / (shared hit + shared read)

For OLTP workloads, aim for >95% hit ratio. For analytics or cold queries, lower ratios may be acceptable.

Temp Writes Are Performance Killers

Systematic Plan Reading Methodology

Rather than randomly scanning execution plans, use a systematic methodology to quickly identify issues:

The 6-Step Plan Reading Process:

Step-by-Step Plan Analysis

•Bottom Line Check — Look at total execution time and total rows. Is this query problematic at all? If it runs in 1ms, maybe you don't need to optimize.
•Identify the Bottleneck — Find the node(s) consuming the most time. Calculate actual time per node (subtracting children). Focus on the 1-2 nodes dominating execution.
•Check Cardinality Estimates — Compare estimated vs. actual rows at each node. Large discrepancies indicate statistics problems driving poor plans.
•Examine Data Access — Look at how tables are accessed (Seq Scan vs. Index Scan). Are the right indexes being used? Is there unnecessary full table scanning?
•Analyze Join Strategy — What join algorithm is used? Is it appropriate for the data sizes? Watch for nested loops with large inner tables.
•Review Resource Usage — Check buffer statistics. Look for disk spills (temp files), high disk reads, or unexpectedly large memory usage.

Applying the Methodology

PostgreSQL

-- Example of applying the 6-step process
EXPLAIN (ANALYZE, BUFFERS)
SELECT p.name, SUM(oi.quantity * oi.price) as revenue
FROM products p
JOIN order_items oi ON p.id = oi.product_id
JOIN orders o ON oi.order_id = o.id
WHERE o.order_date >= '2024-01-01'
GROUP BY p.name
ORDER BY revenue DESC
LIMIT 10;
 
-- Step 1: Bottom line
-- Execution Time: 2847.234 ms  -- SLOW! Worth investigating
 
-- Step 2: Identify bottleneck
-- Looking at actual times shows:
--   Nested Loop: actual time=2400ms  <- BOTTLENECK
--   Index Scan on orders: actual time=200ms (loops=50000)
 
-- Step 3: Cardinality estimates  
--   Nested Loop Join shows: rows=1000 expected, 50000 actual
--   Optimizer underestimated join output by 50x!
 
-- Step 4: Data access
--   orders: Index Scan using idx_order_date (good)
--   order_items: Seq Scan with filter (suspicious - 50000 loops!)
--   products: Index Scan (good)
 
-- Step 5: Join strategy
--   Nested Loop chosen, but inner table scanned 50000 times
--   Should be Hash Join instead for this cardinality
 
-- Step 6: Resource usage
--   Buffers: shared read=45322  -- Heavy disk I/O
--   This correlates with the repeated scanning
 
-- DIAGNOSIS: Statistics caused bad join order estimate
-- FIX: ANALYZE order_items; or add index on order_items(order_id)

Focus on the 80/20 Rule

Visual Plan Reading Strategies

Text-based EXPLAIN output can be challenging to read for complex queries with many joins and subqueries. Development teams often use visual plan analysis tools:

Popular Plan Visualization Tools:

pgAdmin / pgExplain (PostgreSQL) — Built-in graphical explain
EXPLAIN.depesz.com (PostgreSQL) — Paste text plans, get visual analysis
MySQL Workbench (MySQL) — Visual Explain diagram
SQL Server Management Studio (SQL Server) — Graphical execution plans
Oracle SQL Developer (Oracle) — Plan visualization with statistics

What Visual Tools Reveal:

Good visualization tools provide:

Flow diagrams showing data movement
Color coding for expensive operations (red = slow)
Tooltips with detailed node information
Row count comparison (estimated vs. actual)
Highlighting of warnings and anomalies

When to Use Visual Tools

•Complex queries with 5+ joins
•Queries with multiple subqueries or CTEs
•When explaining plans to non-technical stakeholders
•Teaching or learning plan reading
•Quick scan for obvious red flags

When to Use Text Output

•Simple queries (1-2 tables)
•Detailed buffer/I/O analysis
•Scripted or automated plan collection
•Comparing plans in version control
•Production environments without GUI access

Reading Text Output Efficiently:

Even when visual tools aren't available, you can improve text plan readability:

Use COSTS OFF initially to focus on structure without noise
Pipe to less or grep for searching large plans
Look for patterns: repeated operations, high loop counts, scans on large tables
Create mental markers: note join nodes, aggregate nodes as anchor points
Work in sections: analyze each subtree independently for complex queries

JSON Format for Automation:

EXPLAIN (FORMAT JSON, ANALYZE) SELECT ...;

JSON output is ideal for:

Storing plans for historical comparison
Automated performance regression testing
Custom visualization or analysis tools
Integration with monitoring systems

Build Plan Reading Fluency

Summary: Execution Plan Reading

We've developed a comprehensive framework for reading and interpreting execution plans. Let's consolidate the essential insights:

Key Takeaways

•Plans are trees — Data flows upward from leaf nodes (data access) through intermediate nodes to the root. Read bottom-up.
•Trace row counts — Follow how many rows enter and exit each operation. Large jumps or drops reveal what's happening.
•Compare estimated vs. actual — Discrepancies indicate statistics problems that cause bad plan selection.
•Cost is relative, time is absolute — Use cost for comparing plans; use actual time (ANALYZE) for measuring real performance.
•Loops multiply everything — A fast operation executed 100,000 times becomes a slow operation. Always multiply by loops.
•Buffer statistics reveal I/O — High read counts or temp file usage indicate I/O bottlenecks requiring indexing or memory tuning.
•Use systematic methodology — Follow the 6-step process to quickly identify bottlenecks and root causes.

What's Next:

Page Complete