Database Management SystemsIndex Usage

Index Usage for Query Optimization

LevelIntermediate

Duration60 mins

TopicIndex Usage

1 / 5

Index Benefits

The Performance Multiplier

Imagine searching for a specific word in a 1,000-page book. Without an index, you'd have to scan every page—a process that could take hours. With a properly constructed index at the back of the book, you can find the exact page in seconds. This simple analogy captures the essence of database indexes: they transform exhaustive searches into surgical lookups.

In the world of database management systems, indexes are not merely a nice-to-have optimization—they are the single most impactful tool for query performance. A well-designed index can transform a query that takes 30 minutes into one that completes in 30 milliseconds. Conversely, missing or poorly designed indexes are the root cause of the vast majority of database performance problems in production systems.

What You Will Learn

By the end of this page, you will understand the fundamental benefits indexes provide, the mathematical basis for their performance advantages, how they reduce I/O operations, and the specific scenarios where indexes deliver the greatest performance improvements. This knowledge forms the foundation for all advanced indexing strategies.

The Fundamental Problem: Full Table Scans

To appreciate why indexes are so valuable, we must first understand what happens without them. When a database receives a query with a WHERE clause and no applicable index exists, it has only one option: scan every row in the table.

Consider a table with 10 million customer records. Without an index, finding a single customer by email requires the database engine to:

Read every data page from disk into memory
Examine every row on each page
Check each row's email column against the search value
Return matching rows (usually just one)

This is called a full table scan or sequential scan, and it has devastating performance implications at scale.

Full Table Scan Performance at Scale
Table Size	Rows to Scan	Estimated I/O Operations	Approximate Time
1,000 rows	1,000	~10-50 pages	< 100ms
100,000 rows	100,000	~1,000-5,000 pages	~1-5 seconds
10,000,000 rows	10,000,000	~100,000+ pages	~30 seconds - 5 minutes
1,000,000,000 rows	1,000,000,000	~10,000,000+ pages	~30 minutes - hours

The key insight: Full table scan performance degrades linearly with table size. If your table doubles in size, your query takes twice as long. This O(n) behavior is acceptable for small tables but becomes catastrophic for production databases with millions or billions of rows.

Moreover, full table scans create additional problems:

Buffer pool pollution: Large scans may evict frequently-accessed data from memory
Lock contention: Long-running scans can block other operations
Resource exhaustion: High I/O and CPU consumption affects other queries
Unpredictable latency: Query times become tied to current table size, not query complexity

The Silent Killer

Full table scans are particularly insidious because they often work fine in development with small datasets. The query that runs in 50ms against 10,000 test rows becomes a 10-minute nightmare when deployed against 10 million production rows. This is why performance testing with realistic data volumes is essential.

How Indexes Solve This Problem

A database index is a separate data structure that maintains a sorted, searchable copy of selected columns along with pointers back to the actual table rows. The most common index structure in relational databases is the B-tree (or its variant, the B+tree), which provides logarithmic search performance.

The B-tree Advantage

B-trees organize data in a hierarchical tree structure where:

The root node contains pointers to child nodes
Internal nodes guide the search by defining value ranges
Leaf nodes contain the actual indexed values and row pointers
All leaf nodes are at the same depth, ensuring consistent performance

b-tree-structure.txt
B-tree Index Structure (Conceptual)
====================================
 
                    [Root: 50, 100]
                   /       |       \
                  /        |        \
    [25, 40]           [75, 90]           [125, 150]
    /   |   \          /   |   \          /    |    \
   /    |    \        /    |    \        /     |     \
Leaf   Leaf  Leaf   Leaf  Leaf  Leaf   Leaf   Leaf   Leaf
Nodes  Nodes Nodes  Nodes Nodes Nodes  Nodes  Nodes  Nodes
 
Each leaf node contains:
- Indexed column value(s)
- Pointer to actual row (row ID / physical address)
 
Search for value 87:
1. Start at root: 87 > 50, 87 < 100 → go to middle child
2. At [75, 90]: 87 > 75, 87 < 90 → go to middle child  
3. At leaf: scan for 87, find row pointer
4. Fetch actual row using pointer
 
Total nodes visited: 3 (regardless of table having millions of rows!)

The Mathematical Breakthrough

The power of B-trees comes from their logarithmic height. For a B-tree with branching factor b (typically 100-500 for database indexes), the height is approximately:

$$h = \log_b(n)$$

Where n is the number of indexed values. This means:

100 rows: ~1 level (root is also leaf)
10,000 rows: ~2 levels
1,000,000 rows: ~3 levels
100,000,000 rows: ~4 levels
10,000,000,000 rows: ~5 levels

To find any single value, the database only needs to read 3-5 pages even in tables with billions of rows!

The Magnitude of Improvement

Consider the implications: finding a row in a billion-row table requires reading ~5 index pages plus 1 data page, totaling ~6 I/O operations. A full table scan of the same table might require 10 million+ I/O operations. The index provides a million-fold improvement in I/O efficiency.

I/O Reduction: The Primary Benefit

In database systems, I/O is almost always the bottleneck. While modern CPUs can perform billions of operations per second, reading data from disk (even SSDs) is orders of magnitude slower. This fundamental asymmetry means that any technique reducing I/O operations delivers massive performance benefits.

Understanding the I/O Hierarchy

Storage Access Latency Comparison
Storage Type	Random Read Latency	Relative Speed
CPU L1 Cache	~1 nanosecond	1x (baseline)
CPU L3 Cache	~10 nanoseconds	10x slower
RAM	~100 nanoseconds	100x slower
NVMe SSD	~100 microseconds	100,000x slower
SATA SSD	~500 microseconds	500,000x slower
HDD (spinning disk)	~10 milliseconds	10,000,000x slower

How Indexes Minimize I/O

Indexes reduce I/O through several mechanisms:

1. Smaller Data Volume Index pages contain only the indexed columns and row pointers, not the full row data. An index on a single integer column might be 100x smaller than the full table, meaning 100x fewer pages to read for scans.

2. Ordered Access Patterns B-tree indexes maintain sorted order, enabling efficient range scans. Finding all orders from January requires reading contiguous leaf pages rather than random pages scattered across the table.

3. Pointer-Based Row Access After locating matching index entries, the database uses row pointers to fetch only the specific data pages needed, avoiding reads of irrelevant data.

4. Index-Only Queries When all required columns exist in the index (covered queries), the database never needs to read the main table at all, eliminating table I/O entirely.

I/O Comparison: With vs Without IndexConsider finding orders for customer ID 12345 in a 10-million row orders table:

Input

Output

The Buffer Pool Multiplier

Indexes compound their benefits through the buffer pool (in-memory cache). Frequently-accessed index pages remain in memory, eliminating disk I/O entirely for repeated queries. A hot index on customer_id might serve millions of lookups without a single disk read after initial warming.

Search Performance Benefits

Indexes dramatically accelerate different types of search operations. Understanding which searches benefit from indexes helps you design effective indexing strategies.

Equality Searches (Exact Match)

The most common and most optimized use case. Finding a specific value requires traversing the B-tree from root to leaf, then following the row pointer.

equality-search.sql
1
2
3
4
5
6
7
8
9
-- Equality search example
SELECT * FROM employees 
WHERE employee_id = 10042;
 
-- With primary key index: O(log n) complexity
-- 4 billion rows → ~32 comparisons → < 1 millisecond
 
-- Without index: O(n) complexity  
-- 4 billion rows → 4 billion comparisons → potentially hours

Range Searches

B-tree indexes excel at range queries because leaves are linked, allowing efficient sequential scans after locating the range start.

range-search.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Range search examples
SELECT * FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';
 
SELECT * FROM products
WHERE price >= 100 AND price < 500;
 
SELECT * FROM log_entries
WHERE timestamp > '2024-01-15 00:00:00';
 
-- Index usage pattern:
-- 1. Traverse B-tree to find range start (January 1st)
-- 2. Scan linked leaf nodes until range end (January 31st)
-- 3. Fetch data rows for matching entries
-- 
-- Performance: O(log n + k) where k = matching rows

Prefix Searches

For string columns, B-trees support efficient prefix matching because strings are sorted lexicographically.

prefix-search.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Prefix search - can use index
SELECT * FROM customers
WHERE last_name LIKE 'Smith%';  -- Efficient: uses index range scan
 
-- Suffix/infix search - CANNOT use standard index
SELECT * FROM customers  
WHERE last_name LIKE '%Smith';   -- Full scan required
 
SELECT * FROM customers
WHERE last_name LIKE '%mit%';    -- Full scan required
 
-- Index sorts strings: ['Adams', 'Brown', 'Smith', 'Smithson', 'Wilson']
-- 'Smith%' means: find where 'Smith' <= value < 'Smiti' (range scan)
-- '%Smith' requires checking every value (no ordering helps)

Index-Friendly Search Patterns

•Equality matches: column = value — optimal B-tree usage
•Range conditions: column > value, column BETWEEN a AND b — efficient range scans
•IN lists: column IN (v1, v2, v3) — multiple equality lookups
•Prefix LIKE: column LIKE 'prefix%' — treated as range condition
•IS NULL / IS NOT NULL: B-trees typically index NULL values
•Multiple columns: Leftmost prefix of composite index can be used

Sorting and Ordering Benefits

Beyond search acceleration, indexes provide a second major benefit: pre-sorted data access. This eliminates the need for expensive in-memory or disk-based sorting operations.

The Cost of Sorting

Without an index, sort operations require:

Reading all qualifying rows into memory (or temp space)
Executing a sorting algorithm (typically quicksort or merge sort)
Potentially spilling to disk if data exceeds available memory

For large result sets, sorting can dominate query execution time and consume enormous memory/temp space.

sorting-comparison.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Query requiring sorted output
SELECT * FROM transactions
WHERE account_id = 5001
ORDER BY transaction_date DESC
LIMIT 100;
 
-- Scenario 1: No index on (account_id, transaction_date)
-- 1. Full scan to find account_id = 5001 rows (e.g., 500,000 matches)
-- 2. Load all 500,000 rows into memory
-- 3. Sort by transaction_date (O(n log n) = millions of comparisons)
-- 4. Return first 100 rows
-- 5. Discard remaining 499,900 sorted rows
-- Time: Several seconds, high memory usage
 
-- Scenario 2: Composite index on (account_id, transaction_date DESC)
-- 1. Index lookup for account_id = 5001
-- 2. Data is ALREADY sorted by transaction_date DESC
-- 3. Read first 100 matching entries from index
-- 4. Fetch corresponding rows
-- Time: < 100 milliseconds, minimal memory
 
-- The index didn't just find the data faster—it eliminated sorting entirely!

ORDER BY Optimization

Indexes can eliminate sorting for ORDER BY clauses when:

The index column order matches the ORDER BY column order
The sort direction (ASC/DESC) matches or is a complete reversal
Any WHERE conditions use leading index columns with equality

GROUP BY Optimization

Similarly, indexes can accelerate GROUP BY operations by providing pre-grouped data access, eliminating the need for hash aggregation or sort-based grouping.

Index Design for ORDER BY EliminationDesigning indexes that eliminate sorting for common query patterns:

Input

Output

Index Direction Matters

Most databases store indexes in ascending order by default. For queries with ORDER BY ... DESC, consider creating a descending index. For composite indexes, the combination of directions must match your query patterns. An index on (a ASC, b DESC) cannot efficiently serve ORDER BY a DESC, b ASC.

Join Performance Benefits

In relational databases, joins are fundamental operations that combine data from multiple tables. Indexes are absolutely critical for join performance—without them, even simple two-table joins can become catastrophically slow.

Join Algorithms and Index Usage

Databases typically employ three join algorithms, each with different index requirements:

1. Nested Loop Join For each row in the outer table, look up matching rows in the inner table. Indexes on the inner table's join column make this efficient.

nested-loop-join.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Nested loop join example
SELECT c.name, o.order_id, o.total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.country = 'USA';
 
-- Execution with index on orders.customer_id:
-- For each USA customer (outer loop):
--   Use index to find their orders in O(log n) time
-- Total: O(m × log n) where m = USA customers
 
-- Execution WITHOUT index on orders.customer_id:
-- For each USA customer (outer loop):
--   Scan ALL orders to find matches in O(n) time  
-- Total: O(m × n) — potentially millions × millions = disaster

2. Merge Join Sort both tables on the join column, then merge them in a single pass. Indexes providing sorted access eliminate the sort step.

merge-join.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Merge join is efficient when both sides are sorted
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.department_id;
 
-- With indexes on both e.department_id and d.department_id:
-- Both tables accessed in sorted order via index scans
-- Single pass merge: O(m + n)
 
-- Without indexes:
-- Sort employees by department_id: O(m log m)
-- Sort departments by department_id: O(n log n)
-- Then merge: O(m + n)
-- The sorts can be extremely expensive for large tables

3. Hash Join Build a hash table from the smaller table, then probe with the larger table. Indexes aren't directly used, but this is typically chosen when indexes are unavailable.

Join Algorithm Comparison
Algorithm	Index Requirement	Best For	Performance
Nested Loop	Index on inner table's join column	Small outer table, indexed inner table	O(m × log n) with index
Merge Join	Indexes or sorted input on both tables	Both tables large and sortable	O(m + n) after sorting
Hash Join	None required	No indexes, one table fits in memory	O(m + n) but memory-intensive

Foreign Key Indexes

Always create indexes on foreign key columns. These columns are inherently used in joins, and without indexes, every join involving that relationship suffers. Some databases auto-create foreign key indexes; others (like PostgreSQL) do not—check your database's behavior.

Aggregation and Grouping Benefits

Indexes provide significant benefits for aggregate operations (COUNT, SUM, AVG, MIN, MAX) and grouping (GROUP BY), though the benefits depend on query structure and index design.

MIN/MAX Optimization

Finding minimum or maximum values is trivially fast with an index—just read the first or last entry from the B-tree leaf level.

minmax-optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Finding extremes with indexes
SELECT MIN(created_at) FROM events;  -- Read leftmost leaf entry
SELECT MAX(price) FROM products;      -- Read rightmost leaf entry
 
-- Without index: Scan millions of rows
-- With index: Read exactly 1 leaf page + small tree traversal
 
-- Even with WHERE clauses:
SELECT MAX(order_total) 
FROM orders 
WHERE status = 'completed';
 
-- With index on (status, order_total):
-- Navigate to status='completed' section
-- Read rightmost entry in that section
-- Near-instant regardless of table size

COUNT Optimization

Simple counts can use the index directly without accessing the table:

count-optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Counting with indexes
SELECT COUNT(*) FROM orders WHERE status = 'pending';
 
-- With index on status:
-- Index-only scan of 'pending' entries
-- Count entries without touching main table
-- Much faster than full table scan
 
-- Counting distinct values:
SELECT COUNT(DISTINCT customer_id) FROM orders;
 
-- With index on customer_id:
-- Scan index (much smaller than table)
-- Track unique values during scan

GROUP BY Optimization

Indexes can accelerate GROUP BY by providing pre-sorted access, enabling stream-based aggregation instead of hash-based aggregation.

groupby-optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Grouping with indexes
SELECT department_id, COUNT(*), AVG(salary)
FROM employees
GROUP BY department_id;
 
-- With index on department_id:
-- Read index in order: all employees of dept 1, then dept 2, etc.
-- Compute aggregates as you go (streaming)
-- No need to hash or sort
 
-- Without index:
-- Build hash table of department_id → accumulated values
-- Or: sort all rows by department_id, then aggregate
-- Both require processing entire table before output

Loose Index Scan

Some databases support 'loose index scan' or 'skip scan' for GROUP BY—jumping between distinct values in the index rather than reading every entry. This is especially efficient when grouping on low-cardinality columns (few distinct values) over large tables.

Summary: The Power of Indexes

We've explored the fundamental benefits that make indexes the most important performance tool in database systems. Let's consolidate these insights:

Key Takeaways

•Full table scans are O(n) — Performance degrades linearly with table size, becoming unacceptable for large tables
•B-tree indexes provide O(log n) lookups — Finding any value requires only 3-5 page reads regardless of table size
•I/O reduction is the primary benefit — Indexes can reduce I/O by factors of thousands or millions
•Searches, ranges, and prefixes benefit — Equality, range, and left-anchored LIKE patterns leverage indexes
•Sorting can be eliminated — Properly designed indexes provide pre-sorted data access
•Joins require indexes — Nested loop and merge joins depend on indexes for acceptable performance
•Aggregations benefit significantly — MIN/MAX are instant; COUNT and GROUP BY can be accelerated

What's Next

Understanding index benefits is the foundation. The next page explores how the query optimizer selects which indexes to use—because having indexes isn't enough; the database must choose to use them correctly. We'll examine the optimizer's decision-making process and the factors that influence index selection.

Page Complete

You now understand why indexes are essential for database performance. They transform O(n) full table scans into O(log n) lookups, reduce I/O by orders of magnitude, eliminate sorting overhead, and make joins feasible at scale. Next, we'll explore how the optimizer decides which indexes to use.

1 / 5

Loading learning content...

Database Management SystemsIndex Usage

Index Usage for Query Optimization

LevelIntermediate

Duration60 mins

TopicIndex Usage

1 / 5

Index Benefits

The Performance Multiplier

What You Will Learn

The Fundamental Problem: Full Table Scans

Consider a table with 10 million customer records. Without an index, finding a single customer by email requires the database engine to:

Read every data page from disk into memory
Examine every row on each page
Check each row's email column against the search value
Return matching rows (usually just one)

This is called a full table scan or sequential scan, and it has devastating performance implications at scale.

Full Table Scan Performance at Scale
Table Size	Rows to Scan	Estimated I/O Operations	Approximate Time
1,000 rows	1,000	~10-50 pages	< 100ms
100,000 rows	100,000	~1,000-5,000 pages	~1-5 seconds
10,000,000 rows	10,000,000	~100,000+ pages	~30 seconds - 5 minutes
1,000,000,000 rows	1,000,000,000	~10,000,000+ pages	~30 minutes - hours

Moreover, full table scans create additional problems:

Buffer pool pollution: Large scans may evict frequently-accessed data from memory
Lock contention: Long-running scans can block other operations
Resource exhaustion: High I/O and CPU consumption affects other queries
Unpredictable latency: Query times become tied to current table size, not query complexity

The Silent Killer

How Indexes Solve This Problem

The B-tree Advantage

B-trees organize data in a hierarchical tree structure where:

The root node contains pointers to child nodes
Internal nodes guide the search by defining value ranges
Leaf nodes contain the actual indexed values and row pointers
All leaf nodes are at the same depth, ensuring consistent performance

b-tree-structure.txt
B-tree Index Structure (Conceptual)
====================================
 
                    [Root: 50, 100]
                   /       |       \
                  /        |        \
    [25, 40]           [75, 90]           [125, 150]
    /   |   \          /   |   \          /    |    \
   /    |    \        /    |    \        /     |     \
Leaf   Leaf  Leaf   Leaf  Leaf  Leaf   Leaf   Leaf   Leaf
Nodes  Nodes Nodes  Nodes Nodes Nodes  Nodes  Nodes  Nodes
 
Each leaf node contains:
- Indexed column value(s)
- Pointer to actual row (row ID / physical address)
 
Search for value 87:
1. Start at root: 87 > 50, 87 < 100 → go to middle child
2. At [75, 90]: 87 > 75, 87 < 90 → go to middle child  
3. At leaf: scan for 87, find row pointer
4. Fetch actual row using pointer
 
Total nodes visited: 3 (regardless of table having millions of rows!)

The Mathematical Breakthrough

The power of B-trees comes from their logarithmic height. For a B-tree with branching factor b (typically 100-500 for database indexes), the height is approximately:

$$h = \log_b(n)$$

Where n is the number of indexed values. This means:

100 rows: ~1 level (root is also leaf)
10,000 rows: ~2 levels
1,000,000 rows: ~3 levels
100,000,000 rows: ~4 levels
10,000,000,000 rows: ~5 levels

To find any single value, the database only needs to read 3-5 pages even in tables with billions of rows!

The Magnitude of Improvement

I/O Reduction: The Primary Benefit

Understanding the I/O Hierarchy

Storage Access Latency Comparison
Storage Type	Random Read Latency	Relative Speed
CPU L1 Cache	~1 nanosecond	1x (baseline)
CPU L3 Cache	~10 nanoseconds	10x slower
RAM	~100 nanoseconds	100x slower
NVMe SSD	~100 microseconds	100,000x slower
SATA SSD	~500 microseconds	500,000x slower
HDD (spinning disk)	~10 milliseconds	10,000,000x slower

How Indexes Minimize I/O

Indexes reduce I/O through several mechanisms:

3. Pointer-Based Row Access After locating matching index entries, the database uses row pointers to fetch only the specific data pages needed, avoiding reads of irrelevant data.

4. Index-Only Queries When all required columns exist in the index (covered queries), the database never needs to read the main table at all, eliminating table I/O entirely.

I/O Comparison: With vs Without IndexConsider finding orders for customer ID 12345 in a 10-million row orders table:

Input

Output

The Buffer Pool Multiplier

Search Performance Benefits

Indexes dramatically accelerate different types of search operations. Understanding which searches benefit from indexes helps you design effective indexing strategies.

Equality Searches (Exact Match)

The most common and most optimized use case. Finding a specific value requires traversing the B-tree from root to leaf, then following the row pointer.

equality-search.sql
1
2
3
4
5
6
7
8
9
-- Equality search example
SELECT * FROM employees 
WHERE employee_id = 10042;
 
-- With primary key index: O(log n) complexity
-- 4 billion rows → ~32 comparisons → < 1 millisecond
 
-- Without index: O(n) complexity  
-- 4 billion rows → 4 billion comparisons → potentially hours

Range Searches

B-tree indexes excel at range queries because leaves are linked, allowing efficient sequential scans after locating the range start.

range-search.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Range search examples
SELECT * FROM orders
WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31';
 
SELECT * FROM products
WHERE price >= 100 AND price < 500;
 
SELECT * FROM log_entries
WHERE timestamp > '2024-01-15 00:00:00';
 
-- Index usage pattern:
-- 1. Traverse B-tree to find range start (January 1st)
-- 2. Scan linked leaf nodes until range end (January 31st)
-- 3. Fetch data rows for matching entries
-- 
-- Performance: O(log n + k) where k = matching rows

Prefix Searches

For string columns, B-trees support efficient prefix matching because strings are sorted lexicographically.

prefix-search.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Prefix search - can use index
SELECT * FROM customers
WHERE last_name LIKE 'Smith%';  -- Efficient: uses index range scan
 
-- Suffix/infix search - CANNOT use standard index
SELECT * FROM customers  
WHERE last_name LIKE '%Smith';   -- Full scan required
 
SELECT * FROM customers
WHERE last_name LIKE '%mit%';    -- Full scan required
 
-- Index sorts strings: ['Adams', 'Brown', 'Smith', 'Smithson', 'Wilson']
-- 'Smith%' means: find where 'Smith' <= value < 'Smiti' (range scan)
-- '%Smith' requires checking every value (no ordering helps)

Index-Friendly Search Patterns

•Equality matches: column = value — optimal B-tree usage
•Range conditions: column > value, column BETWEEN a AND b — efficient range scans
•IN lists: column IN (v1, v2, v3) — multiple equality lookups
•Prefix LIKE: column LIKE 'prefix%' — treated as range condition
•IS NULL / IS NOT NULL: B-trees typically index NULL values
•Multiple columns: Leftmost prefix of composite index can be used

Sorting and Ordering Benefits

Beyond search acceleration, indexes provide a second major benefit: pre-sorted data access. This eliminates the need for expensive in-memory or disk-based sorting operations.

The Cost of Sorting

Without an index, sort operations require:

Reading all qualifying rows into memory (or temp space)
Executing a sorting algorithm (typically quicksort or merge sort)
Potentially spilling to disk if data exceeds available memory

For large result sets, sorting can dominate query execution time and consume enormous memory/temp space.

sorting-comparison.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- Query requiring sorted output
SELECT * FROM transactions
WHERE account_id = 5001
ORDER BY transaction_date DESC
LIMIT 100;
 
-- Scenario 1: No index on (account_id, transaction_date)
-- 1. Full scan to find account_id = 5001 rows (e.g., 500,000 matches)
-- 2. Load all 500,000 rows into memory
-- 3. Sort by transaction_date (O(n log n) = millions of comparisons)
-- 4. Return first 100 rows
-- 5. Discard remaining 499,900 sorted rows
-- Time: Several seconds, high memory usage
 
-- Scenario 2: Composite index on (account_id, transaction_date DESC)
-- 1. Index lookup for account_id = 5001
-- 2. Data is ALREADY sorted by transaction_date DESC
-- 3. Read first 100 matching entries from index
-- 4. Fetch corresponding rows
-- Time: < 100 milliseconds, minimal memory
 
-- The index didn't just find the data faster—it eliminated sorting entirely!

ORDER BY Optimization

Indexes can eliminate sorting for ORDER BY clauses when:

The index column order matches the ORDER BY column order
The sort direction (ASC/DESC) matches or is a complete reversal
Any WHERE conditions use leading index columns with equality

GROUP BY Optimization

Similarly, indexes can accelerate GROUP BY operations by providing pre-grouped data access, eliminating the need for hash aggregation or sort-based grouping.

Index Design for ORDER BY EliminationDesigning indexes that eliminate sorting for common query patterns:

Input

Output

Index Direction Matters

Join Performance Benefits

Join Algorithms and Index Usage

Databases typically employ three join algorithms, each with different index requirements:

1. Nested Loop Join For each row in the outer table, look up matching rows in the inner table. Indexes on the inner table's join column make this efficient.

nested-loop-join.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Nested loop join example
SELECT c.name, o.order_id, o.total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE c.country = 'USA';
 
-- Execution with index on orders.customer_id:
-- For each USA customer (outer loop):
--   Use index to find their orders in O(log n) time
-- Total: O(m × log n) where m = USA customers
 
-- Execution WITHOUT index on orders.customer_id:
-- For each USA customer (outer loop):
--   Scan ALL orders to find matches in O(n) time  
-- Total: O(m × n) — potentially millions × millions = disaster

2. Merge Join Sort both tables on the join column, then merge them in a single pass. Indexes providing sorted access eliminate the sort step.

merge-join.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Merge join is efficient when both sides are sorted
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.department_id;
 
-- With indexes on both e.department_id and d.department_id:
-- Both tables accessed in sorted order via index scans
-- Single pass merge: O(m + n)
 
-- Without indexes:
-- Sort employees by department_id: O(m log m)
-- Sort departments by department_id: O(n log n)
-- Then merge: O(m + n)
-- The sorts can be extremely expensive for large tables

3. Hash Join Build a hash table from the smaller table, then probe with the larger table. Indexes aren't directly used, but this is typically chosen when indexes are unavailable.

Join Algorithm Comparison
Algorithm	Index Requirement	Best For	Performance
Nested Loop	Index on inner table's join column	Small outer table, indexed inner table	O(m × log n) with index
Merge Join	Indexes or sorted input on both tables	Both tables large and sortable	O(m + n) after sorting
Hash Join	None required	No indexes, one table fits in memory	O(m + n) but memory-intensive

Foreign Key Indexes

Aggregation and Grouping Benefits

Indexes provide significant benefits for aggregate operations (COUNT, SUM, AVG, MIN, MAX) and grouping (GROUP BY), though the benefits depend on query structure and index design.

MIN/MAX Optimization

Finding minimum or maximum values is trivially fast with an index—just read the first or last entry from the B-tree leaf level.

minmax-optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Finding extremes with indexes
SELECT MIN(created_at) FROM events;  -- Read leftmost leaf entry
SELECT MAX(price) FROM products;      -- Read rightmost leaf entry
 
-- Without index: Scan millions of rows
-- With index: Read exactly 1 leaf page + small tree traversal
 
-- Even with WHERE clauses:
SELECT MAX(order_total) 
FROM orders 
WHERE status = 'completed';
 
-- With index on (status, order_total):
-- Navigate to status='completed' section
-- Read rightmost entry in that section
-- Near-instant regardless of table size

COUNT Optimization

Simple counts can use the index directly without accessing the table:

count-optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Counting with indexes
SELECT COUNT(*) FROM orders WHERE status = 'pending';
 
-- With index on status:
-- Index-only scan of 'pending' entries
-- Count entries without touching main table
-- Much faster than full table scan
 
-- Counting distinct values:
SELECT COUNT(DISTINCT customer_id) FROM orders;
 
-- With index on customer_id:
-- Scan index (much smaller than table)
-- Track unique values during scan

GROUP BY Optimization

Indexes can accelerate GROUP BY by providing pre-sorted access, enabling stream-based aggregation instead of hash-based aggregation.

groupby-optimization.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Grouping with indexes
SELECT department_id, COUNT(*), AVG(salary)
FROM employees
GROUP BY department_id;
 
-- With index on department_id:
-- Read index in order: all employees of dept 1, then dept 2, etc.
-- Compute aggregates as you go (streaming)
-- No need to hash or sort
 
-- Without index:
-- Build hash table of department_id → accumulated values
-- Or: sort all rows by department_id, then aggregate
-- Both require processing entire table before output

Loose Index Scan

Summary: The Power of Indexes

We've explored the fundamental benefits that make indexes the most important performance tool in database systems. Let's consolidate these insights:

Key Takeaways

•Full table scans are O(n) — Performance degrades linearly with table size, becoming unacceptable for large tables
•B-tree indexes provide O(log n) lookups — Finding any value requires only 3-5 page reads regardless of table size
•I/O reduction is the primary benefit — Indexes can reduce I/O by factors of thousands or millions
•Searches, ranges, and prefixes benefit — Equality, range, and left-anchored LIKE patterns leverage indexes
•Sorting can be eliminated — Properly designed indexes provide pre-sorted data access
•Joins require indexes — Nested loop and merge joins depend on indexes for acceptable performance
•Aggregations benefit significantly — MIN/MAX are instant; COUNT and GROUP BY can be accelerated

What's Next

Page Complete

1 / 5