Loading content...
One of the most powerful aspects of composite indexes is their ability to serve multiple query patterns through a single index structure. A well-designed composite index on (A, B, C, D) doesn't just accelerate queries filtering on all four columns—it can also efficiently serve queries on (A), (A, B), and (A, B, C).
This capability is called prefix matching, and mastering it allows database designers to minimize index proliferation while maximizing query coverage. Instead of creating four separate indexes, one strategic composite index can serve all the queries efficiently.
This page explores prefix matching in depth: how it works, how to design indexes to leverage it, and how to analyze which prefixes your queries actually require.
By the end of this page, you will understand how prefix matching works at the B+-tree level, how to design indexes that serve multiple query patterns, techniques for analyzing query prefix requirements, and the trade-offs between prefix-based index consolidation and specialized indexes.
Prefix matching refers to the database engine's ability to use a composite index even when the query doesn't filter on all indexed columns—as long as the query uses a contiguous prefix of the index's column list.
Formal Definition:
For a composite index I on columns (C₁, C₂, C₃, ..., Cₙ), any query that filters on columns (C₁) or (C₁, C₂) or (C₁, C₂, C₃) or ... or (C₁, C₂, ..., Cₖ) where k ≤ n can utilize the index efficiently.
The key requirement is contiguity from the left—no gaps are allowed in the column sequence.
| Query WHERE Clause | Prefix Used | Columns Leveraged | Index Usable? |
|---|---|---|---|
| customer_id = 100 | (customer_id) | 1 of 4 | ✅ Yes |
| customer_id = 100 AND order_date = '2023-01-15' | (customer_id, order_date) | 2 of 4 | ✅ Yes |
| customer_id = 100 AND order_date = '2023-01-15' AND product_id = 50 | (..., product_id) | 3 of 4 | ✅ Yes |
| customer_id = 100 AND order_date BETWEEN ... AND product_id = 50 | First 2 only | 2 of 4 (range stops) | ⚠️ Partial |
| order_date = '2023-01-15' | None (skips customer_id) | 0 of 4 | ❌ No |
| customer_id = 100 AND product_id = 50 | (customer_id) only | 1 of 4 (gap) | ⚠️ Partial |
| product_id = 50 AND quantity > 10 | None | 0 of 4 | ❌ No |
Prefix matching works because of the hierarchical lexicographic ordering in B+-tree composite indexes. Understanding this mechanism reveals why prefixes work and gaps don't.
The Hierarchical Sort Structure:
In a B+-tree for index (A, B, C), entries are sorted as:
(1, 'a', 100) < (1, 'a', 200) < (1, 'b', 50) < (1, 'b', 150) <
(2, 'a', 75) < (2, 'a', 300) < (2, 'b', 25) < ...
All entries with A=1 form a contiguous block. Within that block, entries with (A=1, B='a') form another contiguous sub-block. This hierarchical contiguity enables prefix matching.
123456789101112131415161718192021222324252627
// Index (department_id, hire_year, employee_id) - Sorted entries // All dept_id = 10 entries are CONTIGUOUS:(10, 2020, 1001)(10, 2020, 1002)(10, 2021, 1003)(10, 2021, 1004)(10, 2022, 1005)// ← Query "WHERE dept_id = 10" scans this contiguous range // All dept_id = 20 entries are CONTIGUOUS:(20, 2019, 1010)(20, 2020, 1011)(20, 2021, 1012)// ← Query "WHERE dept_id = 20" scans this contiguous range // All (dept_id = 10, hire_year = 2021) entries are CONTIGUOUS:(10, 2021, 1003)(10, 2021, 1004)// ← Query "WHERE dept_id = 10 AND hire_year = 2021" scans this range // BUT hire_year = 2021 entries are NOT CONTIGUOUS (scattered):(10, 2021, 1003) // dept 10(10, 2021, 1004) // dept 10(20, 2021, 1012) // dept 20 - in a different part of the index(30, 2021, 1020) // dept 30 - even further away// ← Query "WHERE hire_year = 2021" CANNOT use this index efficientlyA prefix (C₁, C₂, ..., Cₖ) defines a contiguous range in the index. Any column predicate beyond the prefix breaks contiguity because the remaining columns are only sorted WITHIN each prefix combination, not globally across the index.
Range Scan Behavior:
When a query uses a k-column prefix, the database:
The efficiency comes from the guarantee that all matching entries are adjacent—no scattered lookups are needed.
What Happens with Gaps:
For a query with gaps (like A=1 AND C=100 on index (A, B, C)):
Strategic index design aims to maximize the number of query patterns served by each composite index through prefix matching. This reduces index count, storage overhead, and maintenance burden.
The Query Pattern Analysis Process:
123456789101112131415161718192021
-- Naive approach: Create separate indexes for each patternCREATE INDEX idx_q1 ON orders (customer_id); -- Query 1CREATE INDEX idx_q2 ON orders (customer_id, order_date); -- Query 2CREATE INDEX idx_q3 ON orders (customer_id, order_date, status); -- Query 3CREATE INDEX idx_q4 ON orders (customer_id, status); -- Query 4-- Result: 4 indexes, significant storage and maintenance overhead -- Strategic approach: Use prefix matchingCREATE INDEX idx_orders_main ON orders (customer_id, order_date, status); -- This single index serves:-- Query 1: Uses (customer_id) prefix ✅-- Query 2: Uses (customer_id, order_date) prefix ✅-- Query 3: Uses full index ✅-- Query 4: Uses (customer_id) prefix, then filters status ⚠️ -- For Query 4 to be fully optimized, we might add:CREATE INDEX idx_orders_status ON orders (customer_id, status);-- Now Query 4 uses (customer_id, status) directly ✅ -- Result: 2 indexes instead of 4, with full coverageAn index on (A, B, C) makes a separate index on (A) or (A, B) redundant—the composite index subsumes the shorter prefixes. When designing indexes, check if existing composite indexes already cover the prefix you need.
The length of the prefix used directly impacts query performance. Longer prefixes (using more index columns) generally provide better selectivity and fewer rows examined.
Performance Hierarchy:
| Prefix Length | Selectivity | Rows Examined | I/O Cost |
|---|---|---|---|
| 1 column | Low | Many | Higher |
| 2 columns | Medium | Fewer | Medium |
| 3 columns | High | Fewest | Lower |
| Full index | Highest | Minimal | Lowest |
12345678910111213141516171819202122232425262728293031323334353637
-- Table: orders (10 million rows)-- Index: (customer_id, order_date, product_category, status)-- Column cardinalities:-- customer_id: 100,000 distinct values-- order_date: 2,000 distinct values -- product_category: 50 distinct values-- status: 5 distinct values -- Query 1: 1-column prefixSELECT * FROM orders WHERE customer_id = 12345;-- Rows examined: ~100 (10M / 100K customers)-- Selectivity: 0.001% -- Query 2: 2-column prefixSELECT * FROM orders WHERE customer_id = 12345 AND order_date = '2023-06-15';-- Rows examined: ~5 (100 / 20 dates per customer)-- Selectivity: 0.00005% -- Query 3: 3-column prefixSELECT * FROM orders WHERE customer_id = 12345 AND order_date = '2023-06-15' AND product_category = 'Electronics';-- Rows examined: ~1 (5 / 5 categories)-- Selectivity: 0.000001% -- Query 4: Full index (4 columns)SELECT * FROM orders WHERE customer_id = 12345 AND order_date = '2023-06-15' AND product_category = 'Electronics' AND status = 'shipped';-- Rows examined: ~0.2 (essentially point lookup)-- Selectivity: 0.00000002% -- The multiplicative selectivity improvement is dramatic!| Consideration | Shorter Prefix | Longer Prefix |
|---|---|---|
| Query Flexibility | More queries can use it | Fewer queries use all columns |
| Selectivity | Lower (more rows examined) | Higher (fewer rows examined) |
| Index Maintenance | NA (same index) | NA (same index) |
| Query Plan Stability | More predictable | Depends on statistics accuracy |
| Range Query Impact | Stops at first range | Stops at first range |
Beyond 3-4 columns, the additional selectivity benefit often diminishes. Very long composite indexes (5+ columns) rarely provide proportional performance improvement and increase storage/maintenance overhead. Consider covering index techniques instead for additional columns.
Prefix matching extends to ORDER BY clauses. A composite index can satisfy sorting requirements if the sort columns form a prefix of the index—or if they follow the filter prefix.
The Extended Prefix Rule for Sorting:
If a query filters on columns (C₁, ..., Cₖ) and sorts on columns (Cₖ₊₁, ..., Cₘ), the index can satisfy both if (C₁, ..., Cₘ) forms a prefix of the index definition.
| Query Pattern | Filter Prefix | Sort Prefix | Can Avoid Filesort? |
|---|---|---|---|
| WHERE A=? ORDER BY B | (A) | (A, B) | ✅ Yes |
| WHERE A=? ORDER BY B, C | (A) | (A, B, C) | ✅ Yes |
| WHERE A=? AND B=? ORDER BY C | (A, B) | (A, B, C) | ✅ Yes |
| WHERE A=? ORDER BY C | (A) | (A, C) gap! | ❌ No - gap in sort |
| WHERE A=? ORDER BY D | (A) | (A, D) gap! | ❌ No - skips B, C |
| ORDER BY A, B | None | (A, B) | ✅ Yes |
| ORDER BY B, C | None | (B, C) no prefix | ❌ No - skips A |
123456789101112131415161718192021222324252627282930313233343536373839
-- Index: orders (customer_id, order_date, order_id) -- Scenario 1: Filter + Sort on consecutive columnsSELECT * FROM ordersWHERE customer_id = 100ORDER BY order_date DESC;-- Filter prefix: (customer_id)-- Extended prefix: (customer_id, order_date)-- Result: Index scan in order, NO filesort ✅ -- Scenario 2: Filter + Sort with gapSELECT * FROM ordersWHERE customer_id = 100ORDER BY order_id;-- Filter prefix: (customer_id)-- Sort column: order_id (skips order_date!)-- Result: Filesort required ❌ -- Scenario 3: Multi-column sortSELECT * FROM ordersWHERE customer_id = 100ORDER BY order_date, order_id;-- Filter prefix: (customer_id)-- Extended prefix: (customer_id, order_date, order_id)-- Result: Full index order match, NO filesort ✅ -- Scenario 4: Pure sorting (no filter)SELECT * FROM ordersORDER BY customer_id, order_dateLIMIT 100;-- No filter prefix, but sort matches index prefix-- Result: Index scan in order, NO filesort ✅ -- Scenario 5: Sort starting from wrong columnSELECT * FROM ordersORDER BY order_dateLIMIT 100;-- Sort starts with order_date (2nd column)-- Result: Filesort required (or different index needed) ❌Think of it as a 'handoff' between filtering and sorting. If filters use columns (A, B) with equality predicates, the sort can 'pick up' with column C without needing to specify A, B in the ORDER BY. The index guarantees ordering on C within each (A, B) combination.
Modern database optimizers sometimes employ a technique called skip scan (or loose index scan) that can partially bypass the strict prefix requirement—but only under specific conditions.
Skip Scan Concept:
When the leading column of a composite index has few distinct values, the optimizer may scan through each leading column value separately, essentially converting one range scan into multiple point lookups.
How Skip Scan Works:
For index (A, B) where A has few distinct values and query only filters on B:
1234567891011121314151617181920212223242526272829303132
-- Table: products-- Index: (category, price)-- category has 10 distinct values-- price has 10,000 distinct values -- Query that normally can't use the index (skips category):SELECT * FROM products WHERE price BETWEEN 100 AND 200; -- Without skip scan:-- Full table scan required (index unusable) -- With skip scan (Oracle, MySQL 8.0+, PostgreSQL 14+):-- Execution plan becomes:-- 1. Seek to (category='Books', price=100)-- 2. Scan until price > 200 within 'Books'-- 3. Seek to (category='Clothing', price=100)-- 4. Scan until price > 200 within 'Clothing'-- 5. ... repeat for all 10 categories -- MySQL example with EXPLAIN showing skip scan:EXPLAIN SELECT * FROM products WHERE price BETWEEN 100 AND 200;-- Extra: Using index for skip scan -- When skip scan is beneficial:-- - Leading column has LOW cardinality (few distinct values)-- - Non-leading column predicate is highly selective-- - No better index alternative exists -- When skip scan is NOT beneficial:-- - Leading column has HIGH cardinality (many distinct values)-- - Would require too many 'skips'-- - Full table scan is cheaper| Database | Skip Scan Support | Notes |
|---|---|---|
| Oracle | ✅ Full support | Called 'INDEX SKIP SCAN' |
| MySQL 8.0+ | ✅ Supported | Automatic for suitable indexes |
| PostgreSQL 14+ | ⚠️ Limited via Memoize | Not full skip scan, uses memoization |
| SQL Server | ⚠️ Partial | Via query rewriting or UNION ALL |
| SQLite | ❌ Not supported | Requires manual query rewriting |
Skip scan is an optimization for edge cases, not a substitute for proper index design. If queries frequently skip the leading column, create an appropriate index. Skip scan has overhead and is only efficient when the leading column has very few distinct values (typically <100).
Understanding which prefixes your queries actually use helps optimize index design and identify opportunities for consolidation.
12345678910111213141516171819202122232425262728
-- Check index usage with EXPLAINEXPLAIN FORMAT=JSON SELECT * FROM ordersWHERE customer_id = 100 AND order_date = '2023-01-15'; -- Look for:-- "key": "idx_customer_order"-- "key_length": "8" -- Shows how many bytes of index used-- If key_length matches first column only, only prefix is used -- Calculate expected key_length:-- INT = 4 bytes (+ 1 if nullable)-- DATE = 3 bytes (+ 1 if nullable)-- VARCHAR(n) = n * character_size + 2 length bytes (+ 1 if nullable) -- Enable optimizer trace for deep analysisSET optimizer_trace = 'enabled=on';SELECT * FROM orders WHERE customer_id = 100;SELECT * FROM information_schema.optimizer_trace; -- Check index statisticsSELECT index_name, column_name, seq_in_index, cardinalityFROM information_schema.statisticsWHERE table_name = 'orders'ORDER BY index_name, seq_in_index;Prefix matching is the mechanism that makes composite indexes versatile—one index serving many query patterns. Here are the essential concepts:
What's Next:
The next page explores index intersection—when and how the database combines multiple indexes to satisfy complex queries. This technique complements composite indexes and helps handle queries that can't be efficiently served by a single index's prefix.
You now understand how prefix matching enables one composite index to serve multiple query patterns. You know why contiguity matters, how to design for maximum prefix coverage, and how to analyze prefix usage in production systems. This knowledge is essential for efficient index design.