Multi Column Indexes - Learning Module

Loading content...

0/252

Prefix Matching

One Index, Many Queries

One of the most powerful aspects of composite indexes is their ability to serve multiple query patterns through a single index structure. A well-designed composite index on (A, B, C, D) doesn't just accelerate queries filtering on all four columns—it can also efficiently serve queries on (A), (A, B), and (A, B, C).

This capability is called prefix matching, and mastering it allows database designers to minimize index proliferation while maximizing query coverage. Instead of creating four separate indexes, one strategic composite index can serve all the queries efficiently.

This page explores prefix matching in depth: how it works, how to design indexes to leverage it, and how to analyze which prefixes your queries actually require.

What You Will Learn

By the end of this page, you will understand how prefix matching works at the B+-tree level, how to design indexes that serve multiple query patterns, techniques for analyzing query prefix requirements, and the trade-offs between prefix-based index consolidation and specialized indexes.

Understanding Prefix Matching

Prefix matching refers to the database engine's ability to use a composite index even when the query doesn't filter on all indexed columns—as long as the query uses a contiguous prefix of the index's column list.

Formal Definition:

For a composite index I on columns (C₁, C₂, C₃, ..., Cₙ), any query that filters on columns (C₁) or (C₁, C₂) or (C₁, C₂, C₃) or ... or (C₁, C₂, ..., Cₖ) where k ≤ n can utilize the index efficiently.

The key requirement is contiguity from the left—no gaps are allowed in the column sequence.

Converting Mermaid diagram...

Prefix Matching Examples - Index on (customer_id, order_date, product_id, quantity)
Query WHERE Clause	Prefix Used	Columns Leveraged	Index Usable?
customer_id = 100	(customer_id)	1 of 4	✅ Yes
customer_id = 100 AND order_date = '2023-01-15'	(customer_id, order_date)	2 of 4	✅ Yes
customer_id = 100 AND order_date = '2023-01-15' AND product_id = 50	(..., product_id)	3 of 4	✅ Yes
customer_id = 100 AND order_date BETWEEN ... AND product_id = 50	First 2 only	2 of 4 (range stops)	⚠️ Partial
order_date = '2023-01-15'	None (skips customer_id)	0 of 4	❌ No
customer_id = 100 AND product_id = 50	(customer_id) only	1 of 4 (gap)	⚠️ Partial
product_id = 50 AND quantity > 10	None	0 of 4	❌ No

Why Prefix Matching Works

Prefix matching works because of the hierarchical lexicographic ordering in B+-tree composite indexes. Understanding this mechanism reveals why prefixes work and gaps don't.

The Hierarchical Sort Structure:

In a B+-tree for index (A, B, C), entries are sorted as:

(1, 'a', 100) < (1, 'a', 200) < (1, 'b', 50) < (1, 'b', 150) < 
(2, 'a', 75) < (2, 'a', 300) < (2, 'b', 25) < ...

All entries with A=1 form a contiguous block. Within that block, entries with (A=1, B='a') form another contiguous sub-block. This hierarchical contiguity enables prefix matching.

prefix_matching_visualization.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Index (department_id, hire_year, employee_id) - Sorted entries
 
// All dept_id = 10 entries are CONTIGUOUS:
(10, 2020, 1001)
(10, 2020, 1002)
(10, 2021, 1003)
(10, 2021, 1004)
(10, 2022, 1005)
// ← Query "WHERE dept_id = 10" scans this contiguous range
 
// All dept_id = 20 entries are CONTIGUOUS:
(20, 2019, 1010)
(20, 2020, 1011)
(20, 2021, 1012)
// ← Query "WHERE dept_id = 20" scans this contiguous range
 
// All (dept_id = 10, hire_year = 2021) entries are CONTIGUOUS:
(10, 2021, 1003)
(10, 2021, 1004)
// ← Query "WHERE dept_id = 10 AND hire_year = 2021" scans this range
 
// BUT hire_year = 2021 entries are NOT CONTIGUOUS (scattered):
(10, 2021, 1003)  // dept 10
(10, 2021, 1004)  // dept 10
(20, 2021, 1012)  // dept 20 - in a different part of the index
(30, 2021, 1020)  // dept 30 - even further away
// ← Query "WHERE hire_year = 2021" CANNOT use this index efficiently

The Contiguity Principle

A prefix (C₁, C₂, ..., Cₖ) defines a contiguous range in the index. Any column predicate beyond the prefix breaks contiguity because the remaining columns are only sorted WITHIN each prefix combination, not globally across the index.

Range Scan Behavior:

When a query uses a k-column prefix, the database:

Navigates to the first matching entry using B+-tree descent
Scans forward through contiguous entries matching the prefix
Stops when it encounters an entry that no longer matches

The efficiency comes from the guarantee that all matching entries are adjacent—no scattered lookups are needed.

What Happens with Gaps:

For a query with gaps (like A=1 AND C=100 on index (A, B, C)):

The database navigates to A=1 (using the 1-column prefix)
Within the A=1 range, C values are scattered across different B values
The C=100 predicate must be checked for EVERY entry in the A=1 range
This is a filter, not an index range scan on C

Designing for Maximum Prefix Coverage

Strategic index design aims to maximize the number of query patterns served by each composite index through prefix matching. This reduces index count, storage overhead, and maintenance burden.

The Query Pattern Analysis Process:

Inventory your queries: List all WHERE clause patterns for the table
Identify column groups: Find which columns frequently appear together
Order by frequency: Determine which prefixes are most commonly needed
Design encompassing indexes: Create indexes whose prefixes cover multiple patterns

Multi-Pattern Index DesignOrders table with various query patterns:

Input

Output

prefix_design_strategy.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Naive approach: Create separate indexes for each pattern
CREATE INDEX idx_q1 ON orders (customer_id);                              -- Query 1
CREATE INDEX idx_q2 ON orders (customer_id, order_date);                  -- Query 2
CREATE INDEX idx_q3 ON orders (customer_id, order_date, status);          -- Query 3
CREATE INDEX idx_q4 ON orders (customer_id, status);                      -- Query 4
-- Result: 4 indexes, significant storage and maintenance overhead
 
-- Strategic approach: Use prefix matching
CREATE INDEX idx_orders_main ON orders (customer_id, order_date, status);
 
-- This single index serves:
-- Query 1: Uses (customer_id) prefix ✅
-- Query 2: Uses (customer_id, order_date) prefix ✅
-- Query 3: Uses full index ✅
-- Query 4: Uses (customer_id) prefix, then filters status ⚠️
 
-- For Query 4 to be fully optimized, we might add:
CREATE INDEX idx_orders_status ON orders (customer_id, status);
-- Now Query 4 uses (customer_id, status) directly ✅
 
-- Result: 2 indexes instead of 4, with full coverage

The Prefix Subsumption Rule

An index on (A, B, C) makes a separate index on (A) or (A, B) redundant—the composite index subsumes the shorter prefixes. When designing indexes, check if existing composite indexes already cover the prefix you need.

Prefix Length and Performance

The length of the prefix used directly impacts query performance. Longer prefixes (using more index columns) generally provide better selectivity and fewer rows examined.

Performance Hierarchy:

Prefix Length	Selectivity	Rows Examined	I/O Cost
1 column	Low	Many	Higher
2 columns	Medium	Fewer	Medium
3 columns	High	Fewest	Lower
Full index	Highest	Minimal	Lowest

prefix_length_impact.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
-- Table: orders (10 million rows)
-- Index: (customer_id, order_date, product_category, status)
-- Column cardinalities:
--   customer_id: 100,000 distinct values
--   order_date: 2,000 distinct values  
--   product_category: 50 distinct values
--   status: 5 distinct values
 
-- Query 1: 1-column prefix
SELECT * FROM orders WHERE customer_id = 12345;
-- Rows examined: ~100 (10M / 100K customers)
-- Selectivity: 0.001%
 
-- Query 2: 2-column prefix
SELECT * FROM orders 
WHERE customer_id = 12345 AND order_date = '2023-06-15';
-- Rows examined: ~5 (100 / 20 dates per customer)
-- Selectivity: 0.00005%
 
-- Query 3: 3-column prefix
SELECT * FROM orders 
WHERE customer_id = 12345 
  AND order_date = '2023-06-15' 
  AND product_category = 'Electronics';
-- Rows examined: ~1 (5 / 5 categories)
-- Selectivity: 0.000001%
 
-- Query 4: Full index (4 columns)
SELECT * FROM orders 
WHERE customer_id = 12345 
  AND order_date = '2023-06-15' 
  AND product_category = 'Electronics'
  AND status = 'shipped';
-- Rows examined: ~0.2 (essentially point lookup)
-- Selectivity: 0.00000002%
 
-- The multiplicative selectivity improvement is dramatic!

Prefix Length Trade-offs
Consideration	Shorter Prefix	Longer Prefix
Query Flexibility	More queries can use it	Fewer queries use all columns
Selectivity	Lower (more rows examined)	Higher (fewer rows examined)
Index Maintenance	NA (same index)	NA (same index)
Query Plan Stability	More predictable	Depends on statistics accuracy
Range Query Impact	Stops at first range	Stops at first range

Diminishing Returns

Beyond 3-4 columns, the additional selectivity benefit often diminishes. Very long composite indexes (5+ columns) rarely provide proportional performance improvement and increase storage/maintenance overhead. Consider covering index techniques instead for additional columns.

Prefix Matching with ORDER BY

Prefix matching extends to ORDER BY clauses. A composite index can satisfy sorting requirements if the sort columns form a prefix of the index—or if they follow the filter prefix.

The Extended Prefix Rule for Sorting:

If a query filters on columns (C₁, ..., Cₖ) and sorts on columns (Cₖ₊₁, ..., Cₘ), the index can satisfy both if (C₁, ..., Cₘ) forms a prefix of the index definition.

ORDER BY Prefix Matching - Index on (A, B, C, D)
Query Pattern	Filter Prefix	Sort Prefix	Can Avoid Filesort?
WHERE A=? ORDER BY B	(A)	(A, B)	✅ Yes
WHERE A=? ORDER BY B, C	(A)	(A, B, C)	✅ Yes
WHERE A=? AND B=? ORDER BY C	(A, B)	(A, B, C)	✅ Yes
WHERE A=? ORDER BY C	(A)	(A, C) gap!	❌ No - gap in sort
WHERE A=? ORDER BY D	(A)	(A, D) gap!	❌ No - skips B, C
ORDER BY A, B	None	(A, B)	✅ Yes
ORDER BY B, C	None	(B, C) no prefix	❌ No - skips A

order_by_prefix.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
-- Index: orders (customer_id, order_date, order_id)
 
-- Scenario 1: Filter + Sort on consecutive columns
SELECT * FROM orders
WHERE customer_id = 100
ORDER BY order_date DESC;
-- Filter prefix: (customer_id)
-- Extended prefix: (customer_id, order_date)
-- Result: Index scan in order, NO filesort ✅
 
-- Scenario 2: Filter + Sort with gap
SELECT * FROM orders
WHERE customer_id = 100
ORDER BY order_id;
-- Filter prefix: (customer_id)
-- Sort column: order_id (skips order_date!)
-- Result: Filesort required ❌
 
-- Scenario 3: Multi-column sort
SELECT * FROM orders
WHERE customer_id = 100
ORDER BY order_date, order_id;
-- Filter prefix: (customer_id)
-- Extended prefix: (customer_id, order_date, order_id)
-- Result: Full index order match, NO filesort ✅
 
-- Scenario 4: Pure sorting (no filter)
SELECT * FROM orders
ORDER BY customer_id, order_date
LIMIT 100;
-- No filter prefix, but sort matches index prefix
-- Result: Index scan in order, NO filesort ✅
 
-- Scenario 5: Sort starting from wrong column
SELECT * FROM orders
ORDER BY order_date
LIMIT 100;
-- Sort starts with order_date (2nd column)
-- Result: Filesort required (or different index needed) ❌

The Filter-Sort Handoff

Think of it as a 'handoff' between filtering and sorting. If filters use columns (A, B) with equality predicates, the sort can 'pick up' with column C without needing to specify A, B in the ORDER BY. The index guarantees ordering on C within each (A, B) combination.

Skip Scan Optimization

Modern database optimizers sometimes employ a technique called skip scan (or loose index scan) that can partially bypass the strict prefix requirement—but only under specific conditions.

Skip Scan Concept:

When the leading column of a composite index has few distinct values, the optimizer may scan through each leading column value separately, essentially converting one range scan into multiple point lookups.

How Skip Scan Works:

For index (A, B) where A has few distinct values and query only filters on B:

Find first distinct A value (A=1)
Within A=1, use index range scan for B condition
Skip to next distinct A value (A=2)
Within A=2, use index range scan for B condition
Repeat until all A values processed

skip_scan_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
-- Table: products
-- Index: (category, price)
-- category has 10 distinct values
-- price has 10,000 distinct values
 
-- Query that normally can't use the index (skips category):
SELECT * FROM products WHERE price BETWEEN 100 AND 200;
 
-- Without skip scan:
-- Full table scan required (index unusable)
 
-- With skip scan (Oracle, MySQL 8.0+, PostgreSQL 14+):
-- Execution plan becomes:
--   1. Seek to (category='Books', price=100)
--   2. Scan until price > 200 within 'Books'
--   3. Seek to (category='Clothing', price=100)
--   4. Scan until price > 200 within 'Clothing'
--   5. ... repeat for all 10 categories
 
-- MySQL example with EXPLAIN showing skip scan:
EXPLAIN SELECT * FROM products WHERE price BETWEEN 100 AND 200;
-- Extra: Using index for skip scan
 
-- When skip scan is beneficial:
-- - Leading column has LOW cardinality (few distinct values)
-- - Non-leading column predicate is highly selective
-- - No better index alternative exists
 
-- When skip scan is NOT beneficial:
-- - Leading column has HIGH cardinality (many distinct values)
-- - Would require too many 'skips'
-- - Full table scan is cheaper

Skip Scan Applicability
Database	Skip Scan Support	Notes
Oracle	✅ Full support	Called 'INDEX SKIP SCAN'
MySQL 8.0+	✅ Supported	Automatic for suitable indexes
PostgreSQL 14+	⚠️ Limited via Memoize	Not full skip scan, uses memoization
SQL Server	⚠️ Partial	Via query rewriting or UNION ALL
SQLite	❌ Not supported	Requires manual query rewriting

Don't Rely on Skip Scan

Skip scan is an optimization for edge cases, not a substitute for proper index design. If queries frequently skip the leading column, create an appropriate index. Skip scan has overhead and is only efficient when the leading column has very few distinct values (typically <100).

Analyzing Prefix Usage

Understanding which prefixes your queries actually use helps optimize index design and identify opportunities for consolidation.

mysql_prefix_analysis.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
-- Check index usage with EXPLAIN
EXPLAIN FORMAT=JSON SELECT * FROM orders
WHERE customer_id = 100 AND order_date = '2023-01-15';
 
-- Look for:
-- "key": "idx_customer_order"
-- "key_length": "8"  -- Shows how many bytes of index used
-- If key_length matches first column only, only prefix is used
 
-- Calculate expected key_length:
-- INT = 4 bytes (+ 1 if nullable)
-- DATE = 3 bytes (+ 1 if nullable)
-- VARCHAR(n) = n * character_size + 2 length bytes (+ 1 if nullable)
 
-- Enable optimizer trace for deep analysis
SET optimizer_trace = 'enabled=on';
SELECT * FROM orders WHERE customer_id = 100;
SELECT * FROM information_schema.optimizer_trace;
 
-- Check index statistics
SELECT 
    index_name,
    column_name,
    seq_in_index,
    cardinality
FROM information_schema.statistics
WHERE table_name = 'orders'
ORDER BY index_name, seq_in_index;

Summary: Prefix Matching Mastery

Prefix matching is the mechanism that makes composite indexes versatile—one index serving many query patterns. Here are the essential concepts:

Key Takeaways

•Prefix matching allows a composite index (A, B, C) to also serve queries on (A) and (A, B) efficiently.
•Contiguity is key — Matching entries form contiguous blocks because of lexicographic ordering. Gaps break contiguity.
•Index subsumption — A composite index makes separate indexes on its prefixes redundant. Consolidate to reduce overhead.
•Longer prefixes = better selectivity — Each additional column in the prefix multiplicatively reduces rows examined.
•ORDER BY benefits — Sorting on columns following the filter prefix avoids expensive filesort operations.
•Skip scan is a workaround — It can help when leading columns have low cardinality, but proper index design is preferred.
•Analyze actual usage — Use EXPLAIN and system views to verify which prefixes are actually leveraged.

What's Next:

The next page explores index intersection—when and how the database combines multiple indexes to satisfy complex queries. This technique complements composite indexes and helps handle queries that can't be efficiently served by a single index's prefix.

Page Complete

You now understand how prefix matching enables one composite index to serve multiple query patterns. You know why contiguity matters, how to design for maximum prefix coverage, and how to analyze prefix usage in production systems. This knowledge is essential for efficient index design.