Physical OperatorsSelection Implementation

Selection Implementation

LevelIntermediate

Duration90 mins

TopicSelection Implementation

3 / 5

Conjunctive Selection

Combining Multiple Conditions with AND

Real-world queries rarely filter on a single column. Most selection predicates combine multiple conditions: "Find orders placed by customer X in the last 30 days with status Y." These conjunctive selections—predicates connected by AND—present unique implementation challenges and optimization opportunities.

The challenge arises because multiple filtering conditions can be satisfied through various strategies: using a composite index that covers all predicates, using separate indexes and intersecting results, or using one index followed by in-memory filtering. Each approach has distinct performance characteristics, and the optimal choice depends on selectivity, available indexes, and data distribution.

Mastering conjunctive selection implementation is essential because it represents the most common pattern in production query workloads, and poor handling of multi-predicate filters is a leading cause of query performance problems.

What You Will Learn

By the end of this page, you will understand how databases evaluate conjunctive predicates, master the strategies for combining multiple index accesses, analyze cost models for index intersection operations, and learn optimal filter ordering for residual predicate evaluation.

Conjunctive Selection Fundamentals

A conjunctive selection is a selection operation where the predicate consists of multiple conditions combined with logical AND. Formally:

σ(P₁ ∧ P₂ ∧ ... ∧ Pₙ)(R)

where P₁, P₂, ..., Pₙ are individual predicates. A tuple satisfies the conjunctive selection if and only if it satisfies ALL individual predicates.

Mathematical Properties:

Conjunctive selections exhibit important properties that enable optimization:

Selectivity Multiplication: If predicates are independent, combined selectivity = s₁ × s₂ × ... × sₙ. For 10% selectivity on each of 3 predicates: 0.1 × 0.1 × 0.1 = 0.1% (1 in 1000 tuples match).
Associativity: The order of evaluation doesn't affect the result (though it affects performance).
Short-Circuit Potential: Once a tuple fails any predicate, remaining predicates need not be evaluated.
Decomposition: σ(P₁ ∧ P₂)(R) ≡ σ(P₁)(σ(P₂)(R)) ≡ σ(P₂)(σ(P₁)(R)). We can evaluate predicates in any order.

conjunctive_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Conjunctive Selection Examples
 
-- Example 1: Three-predicate conjunction
SELECT * FROM orders
WHERE customer_id = 12345        -- P1: Customer filter
  AND order_date > '2024-01-01'  -- P2: Date range
  AND status = 'shipped';        -- P3: Status filter
 
-- Selectivity analysis (assume independence):
-- P1: 1 customer out of 100,000 = 0.001%
-- P2: Last year out of 10 years = 10%
-- P3: Shipped = 1 of 5 statuses = 20%
-- Combined: 0.001% × 10% × 20% ≈ 0.0002%
-- On 100M orders: ~200 matching rows
 
-- Example 2: Range conjunctions
SELECT * FROM products
WHERE category = 'Electronics'
  AND price BETWEEN 100 AND 500
  AND rating >= 4.0
  AND in_stock = true;
 
-- Example 3: Expression conjunctions
SELECT * FROM employees
WHERE department_id = 10
  AND EXTRACT(YEAR FROM hire_date) = 2020
  AND salary * 1.1 > 50000;
 
-- Note: Expressions involving functions/calculations
-- typically cannot use indexes directly

Independence Assumption Caveat

The selectivity multiplication formula assumes predicate independence, which often doesn't hold. Correlated predicates (e.g., 'city' and 'zip_code') have actual selectivity different from the multiplicative estimate. Modern optimizers use multi-column histograms and correlation statistics to improve accuracy, but estimation errors for conjunctive predicates remain a significant source of poor query plans.

Implementation Strategies for Conjunctive Selection

Database systems employ several strategies for implementing conjunctive selections, each with distinct trade-offs. The optimizer evaluates available options and chooses based on estimated cost.

Strategy 1: Composite Index Scan

The most efficient approach when a composite index exists on multiple predicate columns:

composite_index_strategy.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Strategy 1: Composite Index Scan
 
-- Composite index covering multiple predicates:
CREATE INDEX idx_cust_date_status ON orders(customer_id, order_date, status);
 
-- Query with matching predicates:
SELECT * FROM orders
WHERE customer_id = 12345
  AND order_date > '2024-01-01'
  AND status = 'shipped';
 
-- Execution:
-- 1. Navigate B+tree to (customer_id=12345, order_date>'2024-01-01')
-- 2. Scan leaf entries, filtering on status='shipped' (ICP if supported)
-- 3. Fetch matching tuples from heap
-- 4. Return results
 
-- Cost analysis:
-- Index traversal: ~4 pages (tree height)
-- Leaf scan: Proportional to (customer's orders × date selectivity)
-- Heap access: Only for fully matching entries
 
-- This is optimal because:
-- + All predicates evaluated using index
-- + Minimal index pages scanned
-- + Minimal heap accesses
 
-- Limitation: Composite index must exist with columns in prefix-compatible order

Strategy 2: Single Index + Filter

When only one predicate has an index, use that index and filter the rest:

index_plus_filter.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Strategy 2: Single Index + Filter
 
-- Only customer_id is indexed:
CREATE INDEX idx_customer ON orders(customer_id);
 
-- Query with multiple predicates:
SELECT * FROM orders
WHERE customer_id = 12345         -- Index condition
  AND order_date > '2024-01-01'   -- Filter condition 1
  AND status = 'shipped';         -- Filter condition 2
 
-- Execution:
-- 1. Index scan: Find all entries for customer_id=12345
--    (Returns all orders by this customer, maybe 500 rows)
-- 2. Heap access: Fetch each matching tuple
-- 3. Filter: Apply order_date and status predicates
-- 4. Return only tuples passing all filters
 
-- PostgreSQL EXPLAIN output:
-- Index Scan using idx_customer on orders
--   Index Cond: (customer_id = 12345)
--   Filter: (order_date > '2024-01-01' AND status = 'shipped')
--   Rows Removed by Filter: 450
--   (shows 500 fetched, 450 filtered out, 50 returned)
 
-- Cost analysis:
-- Index traversal: ~4 pages
-- Leaf scan: Small for single customer
-- Heap access: 500 random page reads (all customer orders)
-- Filter CPU: 500 predicate evaluations
-- Result: Only 50 rows (if ~10% match additional filters)

Strategy 3: Index Intersection (Bitmap AND)

When multiple single-column indexes exist, combine their results:

index_intersection.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
-- Strategy 3: Index Intersection / Bitmap AND
 
-- Separate indexes:
CREATE INDEX idx_customer ON orders(customer_id);
CREATE INDEX idx_date ON orders(order_date);
CREATE INDEX idx_status ON orders(status);
 
-- Query:
SELECT * FROM orders
WHERE customer_id = 12345
  AND order_date > '2024-01-01'
  AND status = 'shipped';
 
-- Execution (Bitmap AND approach):
-- 1. Bitmap Index Scan on idx_customer:
--    -> Build bitmap1: Pages containing customer 12345's orders
-- 2. Bitmap Index Scan on idx_date:
--    -> Build bitmap2: Pages with orders after 2024-01-01
-- 3. Bitmap Index Scan on idx_status:
--    -> Build bitmap3: Pages with shipped orders
-- 4. BitmapAnd: Intersect bitmap1 ∩ bitmap2 ∩ bitmap3
--    -> Result bitmap = pages containing potential matches
-- 5. Bitmap Heap Scan: Fetch pages in result bitmap
-- 6. Recheck: Verify all three conditions on each tuple
 
-- PostgreSQL EXPLAIN output:
-- Bitmap Heap Scan on orders
--   Recheck Cond: (customer_id = 12345 AND order_date > ... AND status = ...)
--   -> BitmapAnd
--        -> Bitmap Index Scan on idx_customer
--             Index Cond: (customer_id = 12345)
--        -> Bitmap Index Scan on idx_date
--             Index Cond: (order_date > '2024-01-01')
--        -> Bitmap Index Scan on idx_status
--             Index Cond: (status = 'shipped')
 
-- Cost analysis:
-- + Uses all three indexes to narrow candidates
-- + Heap access limited to intersection pages
-- + Sequential heap access (sorted bitmap)
-- - Three index scans required
-- - Bitmap operations have memory/CPU cost

Conjunctive Selection Strategy Comparison
Strategy	When Applicable	Advantages	Disadvantages
Composite Index	Index exists on all/most columns	Single index traversal, minimal I/O	Requires specific index, column order matters
Single Index + Filter	Index on most selective column	Uses best available index	Heap access for all index matches
Index Intersection	Separate indexes on each column	Combines multiple indexes	Multiple scans, bitmap overhead
Sequential + Filter	No useful indexes exist	Predictable sequential I/O	Reads entire table

Strategy Selection Heuristic

Generally, prefer composite indexes when available. Use single index + filter when one predicate is highly selective. Consider index intersection when no single index is selective enough but combining two or more significantly narrows candidates. Fall back to sequential scan when combined selectivity is still high (>20% of table).

Composite Index Column Ordering

The column order in a composite index profoundly affects its usefulness for conjunctive selections. Understanding the principles of column ordering enables designing indexes that serve multiple query patterns efficiently.

The Prefix Rule:

A composite index on (A, B, C) can be used for predicates involving:

A only
A and B
A, B, and C

But NOT for:

B only (index order starts with A)
C only
A and C without B (gap in prefix)

Column Ordering Principles:

Composite Index Column Ordering Guidelines

•Equality predicates first — Columns with equality conditions (=) should appear before range conditions. Equality pinpoints exact entries; range conditions scan a portion of entries.
•Higher selectivity earlier — When multiple equality predicates exist, place the more selective column first to maximize early filtering.
•Query pattern coverage — Consider which column subsets are queried together. Order to satisfy the most frequent/important queries.
•Sort columns for ORDER BY — If queries include ORDER BY, including sort columns after filter columns can enable index-ordered retrieval.

column_ordering_analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- Column Ordering Impact Analysis
 
-- Scenario: Queries on (department_id, job_title, salary)
 
-- Query patterns:
-- Q1: WHERE department_id = 10 AND job_title = 'Engineer'
-- Q2: WHERE department_id = 10 AND salary > 80000
-- Q3: WHERE department_id = 10 AND job_title = 'Engineer' AND salary > 80000
 
-- Option A: Index on (department_id, job_title, salary)
-- Q1: ✓ Uses (dept, job_title) as equality prefix - excellent
-- Q2: ✓ Uses (dept) prefix, salary as filter - good
-- Q3: ✓ Uses full prefix (dept, job_title), range on salary - excellent
 
-- Option B: Index on (department_id, salary, job_title)
-- Q1: ✓ Uses (dept), job_title as filter - good (not great)
-- Q2: ✓ Uses (dept) prefix, range on salary - good
-- Q3: ✓ Uses (dept), salary range, job_title as filter - moderate
 
-- Analysis:
-- Q1 is most frequent -> Option A is better
-- Q1 has two equality predicates -> maximize prefix usage
 
-- Column ordering for Q3:
-- "dept = 10 AND job_title = 'Engineer' AND salary > 80000"
-- 
-- With index (dept, job_title, salary):
--   Navigate to (10, 'Engineer', _)
--   Scan forward where salary > 80000
--   Index entries are in salary order → efficient range scan
--   
-- With index (dept, salary, job_title):
--   Navigate to (10, 80000, _)
--   Scan forward through all salaries > 80000
--   Filter on job_title = 'Engineer' during scan
--   Much larger scan range!
 
-- Rule: Equality columns BEFORE range columns in index

Covering Index Extensions:

When designing composite indexes for conjunctive selections, consider adding columns that don't participate in filtering but are needed in the SELECT clause. This creates a covering index that eliminates heap access entirely:

covering_index_design.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- Covering Index Design for Conjunctive Selections
 
-- Query to optimize:
SELECT order_id, total, ship_date
FROM orders
WHERE customer_id = 12345
  AND order_date > '2024-01-01'
  AND status = 'shipped';
 
-- Basic composite index (filter columns only):
CREATE INDEX idx_filter ON orders(customer_id, order_date, status);
-- Requires heap access for order_id, total, ship_date
 
-- Covering composite index (includes SELECT columns):
CREATE INDEX idx_covering ON orders(
    customer_id, 
    order_date, 
    status,
    order_id,      -- Additional for SELECT
    total,         -- Additional for SELECT
    ship_date      -- Additional for SELECT
);
 
-- Result: Index-Only Scan - no heap access!
 
-- Tradeoff:
-- + Eliminates all heap I/O for this query
-- + Can be 10-100x faster for large result sets
-- - Larger index on disk
-- - Higher maintenance cost (more columns to update)
-- - May exceed index size limits (some DBs)
 
-- PostgreSQL INCLUDE syntax (cleaner approach):
CREATE INDEX idx_covering_v2 ON orders(customer_id, order_date, status)
    INCLUDE (order_id, total, ship_date);
-- INCLUDE columns are stored in index but not used for searching

Index Design Trade-offs

Composite indexes accelerate specific query patterns but increase storage requirements and write overhead. Each additional index column increases the per-entry size, reducing entries per page. Write operations must update all affected indexes. The optimal design balances query performance against maintenance cost, typically prioritizing the most frequent and performance-critical queries.

Index Intersection: The Bitmap AND Technique

When no single index is sufficiently selective but multiple indexes exist, index intersection combines their results. This technique is particularly powerful for conjunctive selections where each predicate independently matches many rows, but their intersection is small.

How Bitmap AND Works:

The bitmap approach represents matching tuples as bitmaps (one bit per tuple or page), then performs logical AND to find the intersection:

bitmap_and_internals.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Bitmap AND Internals
 
-- Table: 1 million rows across 10,000 pages (~100 rows/page)
-- Index 1: idx_status (status = 'active' matches 200,000 rows = 20%)
-- Index 2: idx_region (region = 'EAST' matches 250,000 rows = 25%)
 
-- Query: WHERE status = 'active' AND region = 'EAST'
-- Expected matches: 20% × 25% = 5% = 50,000 rows (if independent)
 
-- Bitmap Index Scan 1 (idx_status):
-- Build bitmap where bit[i] = 1 if page i contains 'active' rows
-- Approximately 4,000 pages have at least one 'active' row
 
-- Bitmap Index Scan 2 (idx_region):  
-- Build bitmap where bit[i] = 1 if page i contains 'EAST' rows
-- Approximately 5,000 pages have at least one 'EAST' row
 
-- BitmapAnd operation:
-- result[i] = bitmap1[i] AND bitmap2[i]
-- Result: ~2,500 pages may contain matching rows
 
-- Bitmap Heap Scan:
-- Fetch 2,500 pages (in physical order)
-- Recheck both conditions on all tuples in fetched pages
-- Return ~50,000 tuples that satisfy both conditions
 
-- Memory representation (lossy vs exact):
-- Exact bitmap: 1 bit per tuple (1M bits = 125KB)
-- Lossy bitmap: 1 bit per page (10K bits = 1.25KB)
-- PostgreSQL uses work_mem to decide exact vs lossy

Cost Model for Index Intersection:

Index Intersection Cost Components
Component	Cost Formula	Notes
Index Scan 1	Height₁ + LeafPages₁	First index traversal + leaf scan
Index Scan 2	Height₂ + LeafPages₂	Second index traversal + leaf scan
Bitmap Construction	O(matches₁) + O(matches₂)	CPU cost to build bitmaps
Bitmap AND	O(pages)	Fast bitwise AND operation
Heap Scan	PagesInIntersection	Sequential read of matching pages
Tuple Recheck	TuplesInPages × PredicateCost	Re-evaluate all predicates

When Index Intersection Wins:

Index intersection is most beneficial when:

Intersection is significantly smaller than individual sets: If idx_A matches 20% and idx_B matches 25%, but intersection is 5%, we reduce heap access by 4-5×.
No composite index exists: If we only have single-column indexes, intersection combines them without requiring a new index.
Heap clustering is poor: Random heap access is expensive; reducing the number of pages fetched has high value.
Bitmaps fit in memory: When bitmaps must become lossy (page-level rather than tuple-level), the recheck cost increases.

intersection_vs_filter.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- When to use Intersection vs Single Index + Filter
 
-- Scenario: Two predicates, two single-column indexes
-- idx_A: covers predicate A (matches 10,000 rows)
-- idx_B: covers predicate B (matches 8,000 rows)
-- Intersection: ~800 rows (assuming some correlation)
 
-- Option 1: Use idx_A + Filter B
SELECT * FROM table WHERE A = ... AND B = ...;
-- Cost: 
--   Index scan A: ~50 pages
--   Heap access: 10,000 rows → ~2,000 pages (scattered)
--   Filter B: 10,000 predicate evaluations
--   Result: ~800 rows
 
-- Option 2: Use idx_B + Filter A
-- Cost:
--   Index scan B: ~40 pages  
--   Heap access: 8,000 rows → ~1,600 pages (scattered)
--   Filter A: 8,000 predicate evaluations
--   Result: ~800 rows
 
-- Option 3: Bitmap Index Intersection
-- Cost:
--   Index scan A: ~50 pages
--   Index scan B: ~40 pages
--   Bitmap AND: ~1 page worth of CPU
--   Heap access: ~200 pages (intersection)
--   Recheck: ~2,000 tuples on those pages
--   Result: ~800 rows
 
-- Analysis:
-- Option 1: 50 + 2000 = 2050 pages
-- Option 2: 40 + 1600 = 1640 pages
-- Option 3: 50 + 40 + 200 = 290 pages
 
-- Intersection wins dramatically when both indexes combined
-- narrow down to far fewer pages than either alone!

Optimizer Behavior

The optimizer automatically considers index intersection when it estimates lower cost than alternatives. You can hint toward intersection by setting enable_indexscan = off (forcing bitmap approach) or adjusting random_page_cost to reflect actual I/O characteristics. PostgreSQL tends to prefer bitmap intersection when matching rows are scattered across many pages.

Filter Predicate Ordering Optimization

When predicates cannot all be satisfied by indexes, the remaining filter predicates must be evaluated on each candidate tuple. The order in which these filters are evaluated affects CPU cost significantly through short-circuit evaluation.

The Short-Circuit Principle:

For a conjunction P₁ AND P₂ AND P₃:

If P₁ is false, P₂ and P₃ need not be evaluated
Evaluating the most selective predicate first eliminates the most tuples earliest
Evaluating the cheapest predicate first minimizes wasted CPU

Optimal Filter Ordering:

The optimal order balances selectivity and evaluation cost. The principle is to minimize expected evaluation cost per tuple:

filter_ordering.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- Filter Predicate Ordering Optimization
 
-- Three filter predicates to apply:
-- P1: status = 'active'     (selectivity: 20%, cost: 1 unit)
-- P2: amount > 1000         (selectivity: 30%, cost: 1 unit)
-- P3: name LIKE '%smith%'   (selectivity: 5%, cost: 50 units)
 
-- Evaluation Order Analysis (1000 candidate tuples):
 
-- Order A: P1 → P2 → P3
-- P1: 1000 × 1 = 1000 cost, 200 pass
-- P2: 200 × 1 = 200 cost, 60 pass
-- P3: 60 × 50 = 3000 cost, 3 pass
-- Total: 4200 cost units
 
-- Order B: P3 → P1 → P2
-- P3: 1000 × 50 = 50000 cost, 50 pass
-- P1: 50 × 1 = 50 cost, 10 pass
-- P2: 10 × 1 = 10 cost, 3 pass
-- Total: 50060 cost units
 
-- Order C: P2 → P1 → P3
-- P2: 1000 × 1 = 1000 cost, 300 pass
-- P1: 300 × 1 = 300 cost, 60 pass
-- P3: 60 × 50 = 3000 cost, 3 pass
-- Total: 4300 cost units
 
-- Optimal ordering formula:
-- For predicates Pᵢ with selectivity sᵢ and cost cᵢ
-- Order by (1 - sᵢ) / cᵢ descending
-- This maximizes "elimination power per unit cost"
 
-- P1: (1 - 0.2) / 1 = 0.8   ← First (most efficient)
-- P2: (1 - 0.3) / 1 = 0.7   ← Second
-- P3: (1 - 0.05) / 50 = 0.019 ← Last (expensive and selective)

The Ranking Formula:

Rank = (1 - selectivity) / cost

Evaluate predicates in descending rank order. This formula captures:

(1 - selectivity): The "rejection power"—how many tuples this predicate will eliminate
/ cost: Normalize by evaluation cost to balance expensive but selective predicates against cheap but less selective ones

A predicate that eliminates 80% of tuples (selectivity 0.2) with cost 1 has rank 0.8. A predicate that eliminates 95% (selectivity 0.05) but costs 50 has rank 0.019.

JIT evaluation order correctly places cheap eliminators before expensive ones.

Predicate Cost Categories
Predicate Type	Relative Cost	Ranking Impact
Integer equality/comparison	1×	High rank (cheap + often selective)
String equality (short)	2-3×	Good rank if selective
Date/timestamp comparison	2-4×	Moderate rank
String prefix LIKE	5-10×	Lower rank despite selectivity
String contains LIKE	20-100×	Very low rank, evaluate last
Regular expression	50-200×	Almost always evaluate last
User-defined function	100-1000×	Last unless extremely selective

Automatic Predicate Reordering

Most query optimizers automatically reorder filter predicates based on estimated selectivity and cost. However, the optimizer's estimates may be inaccurate, especially for complex predicates or UDFs. If you observe slow query execution with filters, examine the predicate evaluation order in the execution plan and consider rewriting the query or providing better statistics.

Complex Conjunctive Selection Patterns

Real-world queries often exhibit complex patterns that go beyond simple multi-predicate AND. Understanding these patterns enables both query writing and index design optimization.

Pattern 1: Mixed Equality and Range

mixed_predicates.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- Pattern 1: Mixed Equality and Range Predicates
 
-- Query: Find recent active premium users
SELECT * FROM users
WHERE account_type = 'premium'      -- Equality
  AND status = 'active'             -- Equality  
  AND last_login > NOW() - INTERVAL '30 days'  -- Range
  AND subscription_end > NOW();     -- Range
 
-- Optimal index design:
-- Put ALL equality columns first, then range columns
CREATE INDEX idx_type_status_login_sub ON users(
    account_type,      -- Equality (first)
    status,            -- Equality (second)
    last_login,        -- Range (third)
    subscription_end   -- Cannot use this column efficiently!
);
 
-- Why subscription_end isn't useful:
-- After scanning range on last_login, we're already reading
-- non-contiguous index entries. Adding another range column
-- doesn't further narrow the scan.
 
-- Solution: Evaluate subscription_end as filter, or use separate index:
CREATE INDEX idx_sub_end ON users(subscription_end) 
    WHERE account_type = 'premium' AND status = 'active';
-- Partial index for the specific pattern

Pattern 2: Multi-Column Range (Spatial-like)

multi_range.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- Pattern 2: Multiple Range Predicates (Non-Spatial)
 
-- Query: Products in price AND rating range
SELECT * FROM products
WHERE price BETWEEN 50 AND 200
  AND rating BETWEEN 4.0 AND 5.0;
 
-- B+tree limitation:
-- Can efficiently use ONE range predicate, not both
-- With index (price, rating):
--   Scan price range 50-200, filter rating in memory
-- With index (rating, price):
--   Scan rating range 4.0-5.0, filter price in memory
 
-- Neither perfectly index both ranges
 
-- Solution approaches:
 
-- 1. Choose index on more selective range:
-- If price range matches 20% and rating range matches 10%,
-- index on (rating) first reduces scan to 10% of table
 
-- 2. Use index intersection:
CREATE INDEX idx_price ON products(price);
CREATE INDEX idx_rating ON products(rating);
-- BitmapAnd combines both ranges
 
-- 3. For true 2D ranges, consider spatial index (R-tree):
-- Treat (price, rating) as 2D coordinates
-- CREATE INDEX idx_price_rating_gist ON products 
--     USING GIST (point(price, rating));
-- SELECT * FROM products 
-- WHERE point(price, rating) <@ box(point(50,4.0), point(200,5.0));

Pattern 3: Partial Index for Common Filters

partial_index.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Pattern 3: Partial Indexes for Frequent Conjunctions
 
-- Common query pattern: active customers only
SELECT * FROM orders
WHERE customer_id = 12345
  AND is_deleted = false;          -- 99% of rows have is_deleted = false
 
-- Problem with full index:
CREATE INDEX idx_cust ON orders(customer_id);
-- Index includes both deleted and non-deleted orders
-- Must filter is_deleted after index access
 
-- Solution: Partial index excluding deleted rows
CREATE INDEX idx_cust_active ON orders(customer_id)
    WHERE is_deleted = false;
 
-- Benefits:
-- + Index 99% smaller (excludes deleted rows)
-- + No filtering needed for common case
-- + Index condition automatically satisfied
-- + Query: WHERE customer_id = 12345 AND is_deleted = false
--   → Uses idx_cust_active directly
 
-- PostgreSQL EXPLAIN shows:
-- Index Scan using idx_cust_active on orders
--   Index Cond: (customer_id = 12345)
-- Note: is_deleted = false is implicit in index definition
 
-- Combine with additional columns:
CREATE INDEX idx_cust_date_active ON orders(customer_id, order_date)
    WHERE is_deleted = false AND status != 'cancelled';

Partial Index Design Heuristic

Partial indexes are powerful when a common filter eliminates a large fraction of rows AND the filtered condition rarely changes (like soft-delete flags, status='active', or date ranges for recent data). They reduce index size and maintenance cost while perfectly serving the common query pattern.

Cost Estimation for Conjunctive Selections

Accurate cost estimation for conjunctive selections is challenging because it requires estimating the combined selectivity of multiple predicates. The optimizer must handle predicate correlation, complex expressions, and uncertain statistics.

The Independence Assumption Revisited:

selectivity_estimation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Selectivity Estimation for Conjunctions
 
-- Given predicates:
-- P1: country = 'USA' (selectivity s1 = 0.30)
-- P2: state = 'California' (selectivity s2 = 0.03)
 
-- Independence assumption:
-- P(P1 AND P2) = s1 × s2 = 0.30 × 0.03 = 0.009 = 0.9%
 
-- Reality check:
-- IF state = 'California', THEN country MUST = 'USA'
-- P(P1 | P2) = 1.0 (California implies USA)
-- Actual P(P1 AND P2) = s2 = 0.03 = 3%
 
-- The independence estimate (0.9%) is 3× too low!
-- This causes optimizer to underestimate result cardinality,
-- potentially choosing wrong access methods.
 
-- Multi-column statistics can capture correlation:
-- PostgreSQL:
CREATE STATISTICS stat_country_state ON country, state 
    FROM customers;
ANALYZE customers;
 
-- Now optimizer uses joint distribution instead of independence
 
-- View estimated vs actual:
EXPLAIN ANALYZE SELECT * FROM customers 
WHERE country = 'USA' AND state = 'California';
-- Compare 'rows' (estimated) vs 'actual rows'

Selectivity Estimation Methods
Method	Description	Accuracy	Overhead
Independence	Multiply individual selectivities	Poor for correlated	None
Histogram intersection	Combine single-column histograms	Moderate	Low
Multi-column stats	Joint distribution histograms	Good	Storage/analyze cost
Sampling	Sample data to estimate combined selectivity	High	Query-time cost
Machine learning	Trained model predicts cardinality	Varies	Training cost

Practical Impact of Estimation Errors:

Poor selectivity estimates for conjunctions cause several problems:

Wrong access method: If estimated selectivity is too low, optimizer may choose index scan when sequential scan would be cheaper.
Wrong index: With multiple index options, incorrect estimates may favor a less efficient index.
Wrong join order: In multi-table queries, conjunction selectivity affects join cardinality estimates, cascading into poor join ordering.
Memory allocation: Hash operations size memory based on estimated cardinality; underestimates cause spilling to disk.

estimation_debugging.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
-- Debugging Cardinality Estimation Issues
 
-- PostgreSQL: Compare estimates to actuals
EXPLAIN (ANALYZE, VERBOSE) 
SELECT * FROM orders
WHERE customer_id = 12345 
  AND order_date > '2024-01-01'
  AND status = 'shipped';
 
-- Look for large differences:
-- Seq Scan on orders  (cost=... rows=10 ...)
--                       (actual ... rows=500 ...)
--                             ↑             ↑
--                         estimated      actual (50× difference!)
 
-- Possible causes:
-- 1. Stale statistics: Run ANALYZE
-- 2. Correlated predicates: Create extended statistics
-- 3. Expression predicates: Optimizer can't estimate functions
-- 4. Parameter sniffing: Plan was optimized for different values
 
-- Fix attempts:
 
-- 1. Fresh statistics
ANALYZE orders;
 
-- 2. Extended statistics for correlation
CREATE STATISTICS order_stats ON customer_id, order_date, status 
    FROM orders;
ANALYZE orders;
 
-- 3. Check n_distinct accuracy
SELECT attname, n_distinct, most_common_vals, most_common_freqs
FROM pg_stats WHERE tablename = 'orders';
 
-- 4. If estimates still wrong, consider:
-- - More frequent ANALYZE (especially for volatile tables)
-- - Larger statistics target: ALTER TABLE orders 
--       ALTER COLUMN customer_id SET STATISTICS 1000;
-- - Query hints to override optimizer (last resort)

The Estimation Error Multiplier

Estimation errors compound in conjunctions. If each of 3 predicates has 2× error, the combined error can be 8× or more. This is why conjunctive selections are particularly susceptible to optimization failures. Always validate estimates with EXPLAIN ANALYZE for critical queries.

Summary: Mastering Conjunctive Selection

We have explored conjunctive selection implementation comprehensively, covering implementation strategies, index design, intersection algorithms, and cost estimation. Let us consolidate the essential knowledge:

Key Takeaways

•Conjunctive selections combine multiple predicates with AND. Combined selectivity is typically much lower than individual selectivities, making efficient implementation crucial.
•Four main strategies exist: composite index (best when available), single index + filter, index intersection (bitmap AND), and sequential scan + filter.
•Composite index column ordering matters critically—equality columns before range columns, more selective first, to maximize prefix efficiency.
•Index intersection combines multiple single-column indexes via bitmap AND, effective when intersection is much smaller than individual result sets.
•Filter predicate ordering optimizes CPU cost by evaluating predicates with highest (1-selectivity)/cost first to maximize short-circuit elimination.
•Partial indexes provide dramatic optimization for common filter patterns by pre-filtering the indexed rows.
•Selectivity estimation for conjunctions is challenging due to predicate correlation; extended statistics help but estimation errors remain a major source of suboptimal plans.

What's Next:

Having mastered conjunctive (AND) selections, we proceed to Disjunctive Selection—implementing selections with multiple OR-ed predicates. Disjunctions present fundamentally different optimization challenges, often requiring union strategies rather than intersection.

Page Complete

You now possess expert-level understanding of conjunctive selection implementation—the most common pattern in production query workloads. This knowledge enables designing effective indexes, diagnosing query plan issues, and understanding optimizer decision-making for complex predicates.

3 / 5

Loading learning content...

Physical OperatorsSelection Implementation

Selection Implementation

LevelIntermediate

Duration90 mins

TopicSelection Implementation

3 / 5

Conjunctive Selection

Combining Multiple Conditions with AND

What You Will Learn

Conjunctive Selection Fundamentals

A conjunctive selection is a selection operation where the predicate consists of multiple conditions combined with logical AND. Formally:

σ(P₁ ∧ P₂ ∧ ... ∧ Pₙ)(R)

where P₁, P₂, ..., Pₙ are individual predicates. A tuple satisfies the conjunctive selection if and only if it satisfies ALL individual predicates.

Mathematical Properties:

Conjunctive selections exhibit important properties that enable optimization:

Selectivity Multiplication: If predicates are independent, combined selectivity = s₁ × s₂ × ... × sₙ. For 10% selectivity on each of 3 predicates: 0.1 × 0.1 × 0.1 = 0.1% (1 in 1000 tuples match).
Associativity: The order of evaluation doesn't affect the result (though it affects performance).
Short-Circuit Potential: Once a tuple fails any predicate, remaining predicates need not be evaluated.
Decomposition: σ(P₁ ∧ P₂)(R) ≡ σ(P₁)(σ(P₂)(R)) ≡ σ(P₂)(σ(P₁)(R)). We can evaluate predicates in any order.

conjunctive_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Conjunctive Selection Examples
 
-- Example 1: Three-predicate conjunction
SELECT * FROM orders
WHERE customer_id = 12345        -- P1: Customer filter
  AND order_date > '2024-01-01'  -- P2: Date range
  AND status = 'shipped';        -- P3: Status filter
 
-- Selectivity analysis (assume independence):
-- P1: 1 customer out of 100,000 = 0.001%
-- P2: Last year out of 10 years = 10%
-- P3: Shipped = 1 of 5 statuses = 20%
-- Combined: 0.001% × 10% × 20% ≈ 0.0002%
-- On 100M orders: ~200 matching rows
 
-- Example 2: Range conjunctions
SELECT * FROM products
WHERE category = 'Electronics'
  AND price BETWEEN 100 AND 500
  AND rating >= 4.0
  AND in_stock = true;
 
-- Example 3: Expression conjunctions
SELECT * FROM employees
WHERE department_id = 10
  AND EXTRACT(YEAR FROM hire_date) = 2020
  AND salary * 1.1 > 50000;
 
-- Note: Expressions involving functions/calculations
-- typically cannot use indexes directly

Independence Assumption Caveat

Implementation Strategies for Conjunctive Selection

Database systems employ several strategies for implementing conjunctive selections, each with distinct trade-offs. The optimizer evaluates available options and chooses based on estimated cost.

Strategy 1: Composite Index Scan

The most efficient approach when a composite index exists on multiple predicate columns:

composite_index_strategy.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Strategy 1: Composite Index Scan
 
-- Composite index covering multiple predicates:
CREATE INDEX idx_cust_date_status ON orders(customer_id, order_date, status);
 
-- Query with matching predicates:
SELECT * FROM orders
WHERE customer_id = 12345
  AND order_date > '2024-01-01'
  AND status = 'shipped';
 
-- Execution:
-- 1. Navigate B+tree to (customer_id=12345, order_date>'2024-01-01')
-- 2. Scan leaf entries, filtering on status='shipped' (ICP if supported)
-- 3. Fetch matching tuples from heap
-- 4. Return results
 
-- Cost analysis:
-- Index traversal: ~4 pages (tree height)
-- Leaf scan: Proportional to (customer's orders × date selectivity)
-- Heap access: Only for fully matching entries
 
-- This is optimal because:
-- + All predicates evaluated using index
-- + Minimal index pages scanned
-- + Minimal heap accesses
 
-- Limitation: Composite index must exist with columns in prefix-compatible order

Strategy 2: Single Index + Filter

When only one predicate has an index, use that index and filter the rest:

index_plus_filter.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Strategy 2: Single Index + Filter
 
-- Only customer_id is indexed:
CREATE INDEX idx_customer ON orders(customer_id);
 
-- Query with multiple predicates:
SELECT * FROM orders
WHERE customer_id = 12345         -- Index condition
  AND order_date > '2024-01-01'   -- Filter condition 1
  AND status = 'shipped';         -- Filter condition 2
 
-- Execution:
-- 1. Index scan: Find all entries for customer_id=12345
--    (Returns all orders by this customer, maybe 500 rows)
-- 2. Heap access: Fetch each matching tuple
-- 3. Filter: Apply order_date and status predicates
-- 4. Return only tuples passing all filters
 
-- PostgreSQL EXPLAIN output:
-- Index Scan using idx_customer on orders
--   Index Cond: (customer_id = 12345)
--   Filter: (order_date > '2024-01-01' AND status = 'shipped')
--   Rows Removed by Filter: 450
--   (shows 500 fetched, 450 filtered out, 50 returned)
 
-- Cost analysis:
-- Index traversal: ~4 pages
-- Leaf scan: Small for single customer
-- Heap access: 500 random page reads (all customer orders)
-- Filter CPU: 500 predicate evaluations
-- Result: Only 50 rows (if ~10% match additional filters)

Strategy 3: Index Intersection (Bitmap AND)

When multiple single-column indexes exist, combine their results:

index_intersection.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
-- Strategy 3: Index Intersection / Bitmap AND
 
-- Separate indexes:
CREATE INDEX idx_customer ON orders(customer_id);
CREATE INDEX idx_date ON orders(order_date);
CREATE INDEX idx_status ON orders(status);
 
-- Query:
SELECT * FROM orders
WHERE customer_id = 12345
  AND order_date > '2024-01-01'
  AND status = 'shipped';
 
-- Execution (Bitmap AND approach):
-- 1. Bitmap Index Scan on idx_customer:
--    -> Build bitmap1: Pages containing customer 12345's orders
-- 2. Bitmap Index Scan on idx_date:
--    -> Build bitmap2: Pages with orders after 2024-01-01
-- 3. Bitmap Index Scan on idx_status:
--    -> Build bitmap3: Pages with shipped orders
-- 4. BitmapAnd: Intersect bitmap1 ∩ bitmap2 ∩ bitmap3
--    -> Result bitmap = pages containing potential matches
-- 5. Bitmap Heap Scan: Fetch pages in result bitmap
-- 6. Recheck: Verify all three conditions on each tuple
 
-- PostgreSQL EXPLAIN output:
-- Bitmap Heap Scan on orders
--   Recheck Cond: (customer_id = 12345 AND order_date > ... AND status = ...)
--   -> BitmapAnd
--        -> Bitmap Index Scan on idx_customer
--             Index Cond: (customer_id = 12345)
--        -> Bitmap Index Scan on idx_date
--             Index Cond: (order_date > '2024-01-01')
--        -> Bitmap Index Scan on idx_status
--             Index Cond: (status = 'shipped')
 
-- Cost analysis:
-- + Uses all three indexes to narrow candidates
-- + Heap access limited to intersection pages
-- + Sequential heap access (sorted bitmap)
-- - Three index scans required
-- - Bitmap operations have memory/CPU cost

Conjunctive Selection Strategy Comparison
Strategy	When Applicable	Advantages	Disadvantages
Composite Index	Index exists on all/most columns	Single index traversal, minimal I/O	Requires specific index, column order matters
Single Index + Filter	Index on most selective column	Uses best available index	Heap access for all index matches
Index Intersection	Separate indexes on each column	Combines multiple indexes	Multiple scans, bitmap overhead
Sequential + Filter	No useful indexes exist	Predictable sequential I/O	Reads entire table

Strategy Selection Heuristic

Composite Index Column Ordering

The Prefix Rule:

A composite index on (A, B, C) can be used for predicates involving:

A only
A and B
A, B, and C

But NOT for:

B only (index order starts with A)
C only
A and C without B (gap in prefix)

Column Ordering Principles:

Composite Index Column Ordering Guidelines

•Equality predicates first — Columns with equality conditions (=) should appear before range conditions. Equality pinpoints exact entries; range conditions scan a portion of entries.
•Higher selectivity earlier — When multiple equality predicates exist, place the more selective column first to maximize early filtering.
•Query pattern coverage — Consider which column subsets are queried together. Order to satisfy the most frequent/important queries.
•Sort columns for ORDER BY — If queries include ORDER BY, including sort columns after filter columns can enable index-ordered retrieval.

column_ordering_analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- Column Ordering Impact Analysis
 
-- Scenario: Queries on (department_id, job_title, salary)
 
-- Query patterns:
-- Q1: WHERE department_id = 10 AND job_title = 'Engineer'
-- Q2: WHERE department_id = 10 AND salary > 80000
-- Q3: WHERE department_id = 10 AND job_title = 'Engineer' AND salary > 80000
 
-- Option A: Index on (department_id, job_title, salary)
-- Q1: ✓ Uses (dept, job_title) as equality prefix - excellent
-- Q2: ✓ Uses (dept) prefix, salary as filter - good
-- Q3: ✓ Uses full prefix (dept, job_title), range on salary - excellent
 
-- Option B: Index on (department_id, salary, job_title)
-- Q1: ✓ Uses (dept), job_title as filter - good (not great)
-- Q2: ✓ Uses (dept) prefix, range on salary - good
-- Q3: ✓ Uses (dept), salary range, job_title as filter - moderate
 
-- Analysis:
-- Q1 is most frequent -> Option A is better
-- Q1 has two equality predicates -> maximize prefix usage
 
-- Column ordering for Q3:
-- "dept = 10 AND job_title = 'Engineer' AND salary > 80000"
-- 
-- With index (dept, job_title, salary):
--   Navigate to (10, 'Engineer', _)
--   Scan forward where salary > 80000
--   Index entries are in salary order → efficient range scan
--   
-- With index (dept, salary, job_title):
--   Navigate to (10, 80000, _)
--   Scan forward through all salaries > 80000
--   Filter on job_title = 'Engineer' during scan
--   Much larger scan range!
 
-- Rule: Equality columns BEFORE range columns in index

Covering Index Extensions:

covering_index_design.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- Covering Index Design for Conjunctive Selections
 
-- Query to optimize:
SELECT order_id, total, ship_date
FROM orders
WHERE customer_id = 12345
  AND order_date > '2024-01-01'
  AND status = 'shipped';
 
-- Basic composite index (filter columns only):
CREATE INDEX idx_filter ON orders(customer_id, order_date, status);
-- Requires heap access for order_id, total, ship_date
 
-- Covering composite index (includes SELECT columns):
CREATE INDEX idx_covering ON orders(
    customer_id, 
    order_date, 
    status,
    order_id,      -- Additional for SELECT
    total,         -- Additional for SELECT
    ship_date      -- Additional for SELECT
);
 
-- Result: Index-Only Scan - no heap access!
 
-- Tradeoff:
-- + Eliminates all heap I/O for this query
-- + Can be 10-100x faster for large result sets
-- - Larger index on disk
-- - Higher maintenance cost (more columns to update)
-- - May exceed index size limits (some DBs)
 
-- PostgreSQL INCLUDE syntax (cleaner approach):
CREATE INDEX idx_covering_v2 ON orders(customer_id, order_date, status)
    INCLUDE (order_id, total, ship_date);
-- INCLUDE columns are stored in index but not used for searching

Index Design Trade-offs

Index Intersection: The Bitmap AND Technique

How Bitmap AND Works:

The bitmap approach represents matching tuples as bitmaps (one bit per tuple or page), then performs logical AND to find the intersection:

bitmap_and_internals.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Bitmap AND Internals
 
-- Table: 1 million rows across 10,000 pages (~100 rows/page)
-- Index 1: idx_status (status = 'active' matches 200,000 rows = 20%)
-- Index 2: idx_region (region = 'EAST' matches 250,000 rows = 25%)
 
-- Query: WHERE status = 'active' AND region = 'EAST'
-- Expected matches: 20% × 25% = 5% = 50,000 rows (if independent)
 
-- Bitmap Index Scan 1 (idx_status):
-- Build bitmap where bit[i] = 1 if page i contains 'active' rows
-- Approximately 4,000 pages have at least one 'active' row
 
-- Bitmap Index Scan 2 (idx_region):  
-- Build bitmap where bit[i] = 1 if page i contains 'EAST' rows
-- Approximately 5,000 pages have at least one 'EAST' row
 
-- BitmapAnd operation:
-- result[i] = bitmap1[i] AND bitmap2[i]
-- Result: ~2,500 pages may contain matching rows
 
-- Bitmap Heap Scan:
-- Fetch 2,500 pages (in physical order)
-- Recheck both conditions on all tuples in fetched pages
-- Return ~50,000 tuples that satisfy both conditions
 
-- Memory representation (lossy vs exact):
-- Exact bitmap: 1 bit per tuple (1M bits = 125KB)
-- Lossy bitmap: 1 bit per page (10K bits = 1.25KB)
-- PostgreSQL uses work_mem to decide exact vs lossy

Cost Model for Index Intersection:

Index Intersection Cost Components
Component	Cost Formula	Notes
Index Scan 1	Height₁ + LeafPages₁	First index traversal + leaf scan
Index Scan 2	Height₂ + LeafPages₂	Second index traversal + leaf scan
Bitmap Construction	O(matches₁) + O(matches₂)	CPU cost to build bitmaps
Bitmap AND	O(pages)	Fast bitwise AND operation
Heap Scan	PagesInIntersection	Sequential read of matching pages
Tuple Recheck	TuplesInPages × PredicateCost	Re-evaluate all predicates

When Index Intersection Wins:

Index intersection is most beneficial when:

Intersection is significantly smaller than individual sets: If idx_A matches 20% and idx_B matches 25%, but intersection is 5%, we reduce heap access by 4-5×.
No composite index exists: If we only have single-column indexes, intersection combines them without requiring a new index.
Heap clustering is poor: Random heap access is expensive; reducing the number of pages fetched has high value.
Bitmaps fit in memory: When bitmaps must become lossy (page-level rather than tuple-level), the recheck cost increases.

intersection_vs_filter.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
-- When to use Intersection vs Single Index + Filter
 
-- Scenario: Two predicates, two single-column indexes
-- idx_A: covers predicate A (matches 10,000 rows)
-- idx_B: covers predicate B (matches 8,000 rows)
-- Intersection: ~800 rows (assuming some correlation)
 
-- Option 1: Use idx_A + Filter B
SELECT * FROM table WHERE A = ... AND B = ...;
-- Cost: 
--   Index scan A: ~50 pages
--   Heap access: 10,000 rows → ~2,000 pages (scattered)
--   Filter B: 10,000 predicate evaluations
--   Result: ~800 rows
 
-- Option 2: Use idx_B + Filter A
-- Cost:
--   Index scan B: ~40 pages  
--   Heap access: 8,000 rows → ~1,600 pages (scattered)
--   Filter A: 8,000 predicate evaluations
--   Result: ~800 rows
 
-- Option 3: Bitmap Index Intersection
-- Cost:
--   Index scan A: ~50 pages
--   Index scan B: ~40 pages
--   Bitmap AND: ~1 page worth of CPU
--   Heap access: ~200 pages (intersection)
--   Recheck: ~2,000 tuples on those pages
--   Result: ~800 rows
 
-- Analysis:
-- Option 1: 50 + 2000 = 2050 pages
-- Option 2: 40 + 1600 = 1640 pages
-- Option 3: 50 + 40 + 200 = 290 pages
 
-- Intersection wins dramatically when both indexes combined
-- narrow down to far fewer pages than either alone!

Optimizer Behavior

Filter Predicate Ordering Optimization

The Short-Circuit Principle:

For a conjunction P₁ AND P₂ AND P₃:

If P₁ is false, P₂ and P₃ need not be evaluated
Evaluating the most selective predicate first eliminates the most tuples earliest
Evaluating the cheapest predicate first minimizes wasted CPU

Optimal Filter Ordering:

The optimal order balances selectivity and evaluation cost. The principle is to minimize expected evaluation cost per tuple:

filter_ordering.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- Filter Predicate Ordering Optimization
 
-- Three filter predicates to apply:
-- P1: status = 'active'     (selectivity: 20%, cost: 1 unit)
-- P2: amount > 1000         (selectivity: 30%, cost: 1 unit)
-- P3: name LIKE '%smith%'   (selectivity: 5%, cost: 50 units)
 
-- Evaluation Order Analysis (1000 candidate tuples):
 
-- Order A: P1 → P2 → P3
-- P1: 1000 × 1 = 1000 cost, 200 pass
-- P2: 200 × 1 = 200 cost, 60 pass
-- P3: 60 × 50 = 3000 cost, 3 pass
-- Total: 4200 cost units
 
-- Order B: P3 → P1 → P2
-- P3: 1000 × 50 = 50000 cost, 50 pass
-- P1: 50 × 1 = 50 cost, 10 pass
-- P2: 10 × 1 = 10 cost, 3 pass
-- Total: 50060 cost units
 
-- Order C: P2 → P1 → P3
-- P2: 1000 × 1 = 1000 cost, 300 pass
-- P1: 300 × 1 = 300 cost, 60 pass
-- P3: 60 × 50 = 3000 cost, 3 pass
-- Total: 4300 cost units
 
-- Optimal ordering formula:
-- For predicates Pᵢ with selectivity sᵢ and cost cᵢ
-- Order by (1 - sᵢ) / cᵢ descending
-- This maximizes "elimination power per unit cost"
 
-- P1: (1 - 0.2) / 1 = 0.8   ← First (most efficient)
-- P2: (1 - 0.3) / 1 = 0.7   ← Second
-- P3: (1 - 0.05) / 50 = 0.019 ← Last (expensive and selective)

The Ranking Formula:

Rank = (1 - selectivity) / cost

Evaluate predicates in descending rank order. This formula captures:

(1 - selectivity): The "rejection power"—how many tuples this predicate will eliminate
/ cost: Normalize by evaluation cost to balance expensive but selective predicates against cheap but less selective ones

A predicate that eliminates 80% of tuples (selectivity 0.2) with cost 1 has rank 0.8. A predicate that eliminates 95% (selectivity 0.05) but costs 50 has rank 0.019.

JIT evaluation order correctly places cheap eliminators before expensive ones.

Predicate Cost Categories
Predicate Type	Relative Cost	Ranking Impact
Integer equality/comparison	1×	High rank (cheap + often selective)
String equality (short)	2-3×	Good rank if selective
Date/timestamp comparison	2-4×	Moderate rank
String prefix LIKE	5-10×	Lower rank despite selectivity
String contains LIKE	20-100×	Very low rank, evaluate last
Regular expression	50-200×	Almost always evaluate last
User-defined function	100-1000×	Last unless extremely selective

Automatic Predicate Reordering

Complex Conjunctive Selection Patterns

Real-world queries often exhibit complex patterns that go beyond simple multi-predicate AND. Understanding these patterns enables both query writing and index design optimization.

Pattern 1: Mixed Equality and Range

mixed_predicates.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- Pattern 1: Mixed Equality and Range Predicates
 
-- Query: Find recent active premium users
SELECT * FROM users
WHERE account_type = 'premium'      -- Equality
  AND status = 'active'             -- Equality  
  AND last_login > NOW() - INTERVAL '30 days'  -- Range
  AND subscription_end > NOW();     -- Range
 
-- Optimal index design:
-- Put ALL equality columns first, then range columns
CREATE INDEX idx_type_status_login_sub ON users(
    account_type,      -- Equality (first)
    status,            -- Equality (second)
    last_login,        -- Range (third)
    subscription_end   -- Cannot use this column efficiently!
);
 
-- Why subscription_end isn't useful:
-- After scanning range on last_login, we're already reading
-- non-contiguous index entries. Adding another range column
-- doesn't further narrow the scan.
 
-- Solution: Evaluate subscription_end as filter, or use separate index:
CREATE INDEX idx_sub_end ON users(subscription_end) 
    WHERE account_type = 'premium' AND status = 'active';
-- Partial index for the specific pattern

Pattern 2: Multi-Column Range (Spatial-like)

multi_range.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
-- Pattern 2: Multiple Range Predicates (Non-Spatial)
 
-- Query: Products in price AND rating range
SELECT * FROM products
WHERE price BETWEEN 50 AND 200
  AND rating BETWEEN 4.0 AND 5.0;
 
-- B+tree limitation:
-- Can efficiently use ONE range predicate, not both
-- With index (price, rating):
--   Scan price range 50-200, filter rating in memory
-- With index (rating, price):
--   Scan rating range 4.0-5.0, filter price in memory
 
-- Neither perfectly index both ranges
 
-- Solution approaches:
 
-- 1. Choose index on more selective range:
-- If price range matches 20% and rating range matches 10%,
-- index on (rating) first reduces scan to 10% of table
 
-- 2. Use index intersection:
CREATE INDEX idx_price ON products(price);
CREATE INDEX idx_rating ON products(rating);
-- BitmapAnd combines both ranges
 
-- 3. For true 2D ranges, consider spatial index (R-tree):
-- Treat (price, rating) as 2D coordinates
-- CREATE INDEX idx_price_rating_gist ON products 
--     USING GIST (point(price, rating));
-- SELECT * FROM products 
-- WHERE point(price, rating) <@ box(point(50,4.0), point(200,5.0));

Pattern 3: Partial Index for Common Filters

partial_index.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
-- Pattern 3: Partial Indexes for Frequent Conjunctions
 
-- Common query pattern: active customers only
SELECT * FROM orders
WHERE customer_id = 12345
  AND is_deleted = false;          -- 99% of rows have is_deleted = false
 
-- Problem with full index:
CREATE INDEX idx_cust ON orders(customer_id);
-- Index includes both deleted and non-deleted orders
-- Must filter is_deleted after index access
 
-- Solution: Partial index excluding deleted rows
CREATE INDEX idx_cust_active ON orders(customer_id)
    WHERE is_deleted = false;
 
-- Benefits:
-- + Index 99% smaller (excludes deleted rows)
-- + No filtering needed for common case
-- + Index condition automatically satisfied
-- + Query: WHERE customer_id = 12345 AND is_deleted = false
--   → Uses idx_cust_active directly
 
-- PostgreSQL EXPLAIN shows:
-- Index Scan using idx_cust_active on orders
--   Index Cond: (customer_id = 12345)
-- Note: is_deleted = false is implicit in index definition
 
-- Combine with additional columns:
CREATE INDEX idx_cust_date_active ON orders(customer_id, order_date)
    WHERE is_deleted = false AND status != 'cancelled';

Partial Index Design Heuristic

Cost Estimation for Conjunctive Selections

The Independence Assumption Revisited:

selectivity_estimation.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- Selectivity Estimation for Conjunctions
 
-- Given predicates:
-- P1: country = 'USA' (selectivity s1 = 0.30)
-- P2: state = 'California' (selectivity s2 = 0.03)
 
-- Independence assumption:
-- P(P1 AND P2) = s1 × s2 = 0.30 × 0.03 = 0.009 = 0.9%
 
-- Reality check:
-- IF state = 'California', THEN country MUST = 'USA'
-- P(P1 | P2) = 1.0 (California implies USA)
-- Actual P(P1 AND P2) = s2 = 0.03 = 3%
 
-- The independence estimate (0.9%) is 3× too low!
-- This causes optimizer to underestimate result cardinality,
-- potentially choosing wrong access methods.
 
-- Multi-column statistics can capture correlation:
-- PostgreSQL:
CREATE STATISTICS stat_country_state ON country, state 
    FROM customers;
ANALYZE customers;
 
-- Now optimizer uses joint distribution instead of independence
 
-- View estimated vs actual:
EXPLAIN ANALYZE SELECT * FROM customers 
WHERE country = 'USA' AND state = 'California';
-- Compare 'rows' (estimated) vs 'actual rows'

Selectivity Estimation Methods
Method	Description	Accuracy	Overhead
Independence	Multiply individual selectivities	Poor for correlated	None
Histogram intersection	Combine single-column histograms	Moderate	Low
Multi-column stats	Joint distribution histograms	Good	Storage/analyze cost
Sampling	Sample data to estimate combined selectivity	High	Query-time cost
Machine learning	Trained model predicts cardinality	Varies	Training cost

Practical Impact of Estimation Errors:

Poor selectivity estimates for conjunctions cause several problems:

Wrong access method: If estimated selectivity is too low, optimizer may choose index scan when sequential scan would be cheaper.
Wrong index: With multiple index options, incorrect estimates may favor a less efficient index.
Wrong join order: In multi-table queries, conjunction selectivity affects join cardinality estimates, cascading into poor join ordering.
Memory allocation: Hash operations size memory based on estimated cardinality; underestimates cause spilling to disk.

estimation_debugging.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
-- Debugging Cardinality Estimation Issues
 
-- PostgreSQL: Compare estimates to actuals
EXPLAIN (ANALYZE, VERBOSE) 
SELECT * FROM orders
WHERE customer_id = 12345 
  AND order_date > '2024-01-01'
  AND status = 'shipped';
 
-- Look for large differences:
-- Seq Scan on orders  (cost=... rows=10 ...)
--                       (actual ... rows=500 ...)
--                             ↑             ↑
--                         estimated      actual (50× difference!)
 
-- Possible causes:
-- 1. Stale statistics: Run ANALYZE
-- 2. Correlated predicates: Create extended statistics
-- 3. Expression predicates: Optimizer can't estimate functions
-- 4. Parameter sniffing: Plan was optimized for different values
 
-- Fix attempts:
 
-- 1. Fresh statistics
ANALYZE orders;
 
-- 2. Extended statistics for correlation
CREATE STATISTICS order_stats ON customer_id, order_date, status 
    FROM orders;
ANALYZE orders;
 
-- 3. Check n_distinct accuracy
SELECT attname, n_distinct, most_common_vals, most_common_freqs
FROM pg_stats WHERE tablename = 'orders';
 
-- 4. If estimates still wrong, consider:
-- - More frequent ANALYZE (especially for volatile tables)
-- - Larger statistics target: ALTER TABLE orders 
--       ALTER COLUMN customer_id SET STATISTICS 1000;
-- - Query hints to override optimizer (last resort)

The Estimation Error Multiplier

Summary: Mastering Conjunctive Selection

Key Takeaways

•Conjunctive selections combine multiple predicates with AND. Combined selectivity is typically much lower than individual selectivities, making efficient implementation crucial.
•Four main strategies exist: composite index (best when available), single index + filter, index intersection (bitmap AND), and sequential scan + filter.
•Composite index column ordering matters critically—equality columns before range columns, more selective first, to maximize prefix efficiency.
•Index intersection combines multiple single-column indexes via bitmap AND, effective when intersection is much smaller than individual result sets.
•Filter predicate ordering optimizes CPU cost by evaluating predicates with highest (1-selectivity)/cost first to maximize short-circuit elimination.
•Partial indexes provide dramatic optimization for common filter patterns by pre-filtering the indexed rows.
•Selectivity estimation for conjunctions is challenging due to predicate correlation; extended statistics help but estimation errors remain a major source of suboptimal plans.

What's Next:

Page Complete

3 / 5