Database Management SystemsMulti-Level Index

Multi-Level Index Structures

LevelIntermediate

Duration75 mins

TopicMulti-Level Index

4 / 5

Search Efficiency

Quantifying Search Performance

We've established that multi-level index trees provide O(log_f n) search complexity, where f is the fanout and n is the number of records. But what does this mean in practice? How many milliseconds does a search actually take? How does caching affect performance? How do range queries differ from point queries?

This page moves from theoretical complexity to practical performance analysis. Understanding these details is crucial for database tuning, capacity planning, and troubleshooting performance issues. The difference between a well-tuned and poorly-tuned database often comes down to understanding exactly how search operations interact with the index structure.

What You Will Learn

By the end of this page, you will understand: (1) the exact cost of point queries in terms of I/O operations, (2) how range queries traverse the tree and scan leaves, (3) the mathematics of I/O cost with and without caching, (4) how buffer management affects search performance, and (5) comparison with alternative access methods (full scan, hash index).

Point Query Analysis

A point query retrieves a single record based on an exact key match. This is the most common index operation and the one for which B+-trees are optimized. Let's analyze the cost precisely.

The search path:

A point query traverses from the root to a leaf, reading one block at each level, then accesses the data block containing the actual record (unless it's a covering index).

Total I/O = (Tree Height) + (Data Access) = h + 1

For typical parameters:

Fanout f ≈ 200-500 (depends on key size and block size)
Height h = ⌈log_f(n)⌉
For n = 10 million: h = 3 → Total I/O = 4 blocks

point_query_cost.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
Point Query Cost Analysis
==========================
 
Scenario: SELECT * FROM customers WHERE customer_id = 5000000
 
Index structure:
  - 10 million records
  - Fanout: 250 entries per block
  - Tree height: 3 levels
  - Index type: B+-tree (data pointers in leaves only)
 
Search path:
 
Step 1: Read root block (Level 3)
  - 1 I/O operation
  - Find entry where K < 5,000,000 ≤ K'
  - Binary search within block: ~8 comparisons (log₂ 250)
  - CPU time: ~1 microsecond
  - Disk time: 0.1ms (SSD) or 10ms (HDD)
  
Step 2: Read internal block (Level 2)
  - Follow pointer from Step 1
  - 1 I/O operation
  - Binary search: ~8 comparisons
  - Disk time: 0.1ms (SSD) or 10ms (HDD)
  
Step 3: Read leaf block (Level 1)
  - Follow pointer from Step 2
  - 1 I/O operation
  - Binary search: ~8 comparisons
  - Find exact entry for customer_id = 5,000,000
  - Disk time: 0.1ms (SSD) or 10ms (HDD)
  
Step 4: Read data block (Level 0)
  - Follow data pointer from leaf entry
  - 1 I/O operation
  - Retrieve actual customer record
  - Disk time: 0.1ms (SSD) or 10ms (HDD)
 
TOTAL COST:
                   Cold Cache    Warm Cache (upper levels cached)
  I/O operations:      4              2 (Level 3+2 cached)
  SSD latency:      0.4 ms          0.2 ms
  HDD latency:      40 ms           20 ms
  CPU time:         ~4 µs           ~4 µs (negligible)
 
The dominant cost is ALWAYS disk I/O.
CPU time for in-block binary search is <0.01% of total time.

Point Query Performance Across Database Sizes
Records	Tree Height	I/O (Cold)	I/O (Warm)	SSD Latency (Cold)	SSD Latency (Warm)
10,000	2	3	2	0.3 ms	0.2 ms
1,000,000	3	4	2	0.4 ms	0.2 ms
100,000,000	4	5	2	0.5 ms	0.2 ms
10,000,000,000	5	6	2	0.6 ms	0.2 ms
1,000,000,000,000	6	7	2-3	0.7 ms	0.2-0.3 ms

The Warm Cache Constant

Notice that warm cache latency is almost constant (0.2-0.3 ms) regardless of database size. Upper index levels occupy little memory and stay cached. Only leaf and data blocks vary. This is why properly sized buffer pools are critical—they convert logarithmic I/O into near-constant I/O.

Range Query Analysis

A range query retrieves all records where the key falls within a specified range. B+-trees excel at range queries due to their leaf-level structure: all leaf nodes are linked, allowing sequential traversal after finding the starting point.

The range query algorithm:

Use a point query to find the leaf containing the range start
Scan sequentially through leaf nodes until the range end is exceeded
For each matching entry, access the data record (if needed)

The cost has two components: the initial tree traversal (h I/Os) and the leaf/data scan (proportional to result size).

range_query_cost.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
def range_query_cost(
    start_key: int,
    end_key: int,
    total_records: int,
    fanout: int,
    records_per_data_block: int,
    selectivity: float = None  # Fraction of records in range
) -> dict:
    """
    Calculate I/O cost for a range query.
    
    Parameters:
    - start_key, end_key: Range boundaries
    - total_records: Total records in table
    - fanout: Index entries per block
    - records_per_data_block: Data records per block
    - selectivity: Fraction of records matching (if known)
    
    Returns:
    - Dictionary with cost breakdown
    """
    import math
    
    # Tree height
    height = math.ceil(math.log(total_records) / math.log(fanout))
    
    # Calculate selectivity if not provided
    if selectivity is None:
        key_range = end_key - start_key
        total_key_range = total_records  # Assume keys are 1 to n
        selectivity = key_range / total_key_range
    
    # Records in range
    records_in_range = int(total_records * selectivity)
    
    # Leaf blocks to scan
    # Each leaf block has 'fanout' entries
    leaf_blocks_scanned = math.ceil(records_in_range / fanout)
    
    # Data blocks to access
    # Best case: data is clustered, sequential
    # Worst case: each record in different block
    clustered_data_blocks = math.ceil(records_in_range / records_per_data_block)
    unclustered_data_blocks = records_in_range  # One per record
    
    return {
        'tree_traversal_ios': height,
        'leaf_scan_ios': leaf_blocks_scanned,
        'data_access_ios_clustered': clustered_data_blocks,
        'data_access_ios_unclustered': unclustered_data_blocks,
        'total_ios_clustered': height + leaf_blocks_scanned + clustered_data_blocks,
        'total_ios_unclustered': height + leaf_blocks_scanned + unclustered_data_blocks,
        'records_returned': records_in_range,
    }
 
 
# Example: Find orders from Jan 2024 (assume 5% of 1 million orders)
cost = range_query_cost(
    start_key=20240101,
    end_key=20240131,
    total_records=1_000_000,
    fanout=250,
    records_per_data_block=20,
    selectivity=0.05
)
 
print(f"""
Range Query: Orders from January 2024
=====================================
Records in range: {cost['records_returned']:,}
Tree traversal: {cost['tree_traversal_ios']} I/Os
 
Clustered index (orders sorted by date):
  Leaf scan: {cost['leaf_scan_ios']} I/Os
  Data access: {cost['data_access_ios_clustered']} I/Os
  TOTAL: {cost['total_ios_clustered']} I/Os
 
Non-clustered index (orders NOT sorted by date):
  Leaf scan: {cost['leaf_scan_ios']} I/Os
  Data access: {cost['data_access_ios_unclustered']:,} I/Os
  TOTAL: {cost['total_ios_unclustered']:,} I/Os
""")
 
# Output:
# Records in range: 50,000
# Tree traversal: 3 I/Os
# 
# Clustered index: TOTAL: 2,703 I/Os
# Non-clustered index: TOTAL: 50,203 I/Os  (20x worse!)

The clustering advantage:

The example above highlights a critical factor: whether the data is physically clustered by the index key.

Clustered index: Data records are physically sorted by key. Range scans read contiguous data blocks. Very efficient.
Non-clustered index: Data records are scattered. Each index entry points to a different data block. Each record may require a separate I/O.

For large range queries on non-clustered indexes, the optimizer may choose a full table scan instead, as it can be faster than thousands of random I/Os.

The Range Query Cliff

Non-clustered range queries have a break-even point. If the range selects more than ~5-15% of the table, a full table scan (sequential I/O) is often faster than index access (random I/O). Query optimizers detect this and switch strategies. Understanding this helps explain why the same query might use an index scan OR a table scan depending on the data.

Buffer Pool Impact on Search

The buffer pool (or buffer cache) is memory allocated to cache frequently-accessed disk blocks. Its size and management policy dramatically affect index search performance. Understanding this interaction is crucial for database tuning.

The caching hierarchy for index searches:

Root block: Accessed by every query. Always cached after first access.
Upper internal levels: Few blocks, frequently accessed. High cache hit rate.
Lower internal levels: More blocks, access depends on query distribution.
Leaf blocks: Many blocks, caching depends on workload locality.
Data blocks: Most numerous, caching highly workload-dependent.

buffer_pool_analysis.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Buffer Pool Impact Analysis
============================
 
Database: 10 million records, B+-tree index
Block size: 4 KB
Fanout: 250
 
Index structure:
  Level 3 (Root):     1 block = 4 KB
  Level 2:           40 blocks = 160 KB
  Level 1 (Leaves): 40,000 blocks = 160 MB
  Data:            500,000 blocks = 2 GB
 
SCENARIO 1: Tiny buffer pool (1 MB)
-----------------------------------
Can cache: Root + Level 2 + ~200 hot leaf blocks
Hit rate estimate:
  - Root: 100% (always cached)
  - Level 2: 100% (fits entirely)
  - Level 1: ~0.5% (200/40,000 if uniform access)
  - Data: ~0% (essentially no caching)
  
Average I/Os per point query: ~3 (leaf + data + maybe some misses)
 
 
SCENARIO 2: Moderate buffer pool (200 MB)
-----------------------------------------
Can cache: Root + Level 2 + All leaves + ~10,000 hot data blocks
Hit rate estimate:
  - Root: 100%
  - Level 2: 100%
  - Level 1: 100% (all 160 MB cached!)
  - Data: ~2% uniform, or higher with locality
  
Average I/Os per point query: ~1-1.5 (mostly just data access)
 
 
SCENARIO 3: Large buffer pool (500 MB)
--------------------------------------
Can cache: Entire index + 25% of data
Hit rate estimate:
  - All index levels: 100%
  - Data: ~25% uniform, higher with locality
  
Average I/Os per point query: ~0.75 (many fully cached)
 
 
SCENARIO 4: Huge buffer pool (3 GB)
-----------------------------------
Can cache: Everything (index + all data)
Hit rate estimate: 100%
 
Average I/Os per point query: 0 (all in memory)
 
 
KEY INSIGHT: The marginal value of buffer pool memory is non-linear.
  - First 1 MB: Critical (caches upper index levels)
  - 1-200 MB: High value (caches all index leaves)
  - 200 MB - 2 GB: Moderate value (caches data)
  - Beyond data size: No additional value
 
Prioritize buffer pool sizing for:
  1. Upper index levels (mandatory)
  2. Leaf levels of hot indexes
  3. Hot data pages

Buffer Pool Size vs. Average Point Query I/O
Buffer Pool Size	% of Index Cached	% of Data Cached	Avg I/Os per Query	Relative Performance
1 MB	0.5%	0%	3.5	1x (baseline)
10 MB	5%	0.5%	2.8	1.25x
100 MB	50%	3%	1.8	1.9x
200 MB	100%	5%	1.0	3.5x
500 MB	100%	20%	0.8	4.4x
2 GB	100%	80%	0.2	17.5x
3 GB (all)	100%	100%	0	∞ (in-memory)

The Index Caching Priority

If you have limited memory, prioritize caching index pages over data pages. A fully cached index with uncached data gives 1 I/O per query (data access). A partially cached index with cached data still requires 2-3 I/Os per query (index traversal). The index is smaller and more heavily accessed—cache it first.

Comparison with Alternative Access Methods

Multi-level tree indexes aren't the only way to access data. Understanding how they compare to alternatives helps justify their near-universal use and identifies scenarios where other approaches might be better.

Alternative access methods:

Full table scan: Read every block sequentially
Hash index: Direct hash lookup with O(1) access
Single-level index with binary search: As analyzed earlier
Bitmap index: Bit vector per value, combined with logical operations

Access Method Comparison for 10 Million Records
Access Method	Point Query I/O	Range Query I/O (5%)	Space Overhead	Update Cost
Full Table Scan	500,000 blocks	500,000 blocks	0 bytes	1 block
Hash Index	1-2 blocks	N/A (no range support)	~150 MB	1-2 blocks
Single-Level Index	16 blocks	16 + 2,000 leaf + data	~140 MB	Multiple blocks
B+-Tree Index	3-4 blocks	3 + 2,000 leaf + data	~160 MB	2-4 blocks
Bitmap Index	~10 blocks	Depends on conjunction	Varies greatly	Expensive

access_method_comparison.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
-- Access Method Comparison: Concrete Examples
-- Table: orders (10 million rows, 200 bytes each)
-- Total data: ~2 GB (500,000 blocks at 4 KB)
 
-- QUERY 1: Point lookup
-- SELECT * FROM orders WHERE order_id = 5000000
 
-- Access Method        I/O Operations   Time (SSD)   Time (HDD)
-- ------------------- ----------------- ------------ ------------
-- Full Scan            500,000           50 sec       83 min
-- Hash Index           1-2               0.2 ms       20 ms
-- B+-Tree Index        4                 0.4 ms       40 ms
 
 
-- QUERY 2: Range query (5% selectivity)
-- SELECT * FROM orders WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31'
 
-- Access Method        I/O Operations   Time (SSD)   Time (HDD)
-- ------------------- ----------------- ------------ ------------
-- Full Scan            500,000           50 sec       83 min
-- Hash Index           N/A (no range!)   Cannot use   Cannot use
-- B+-Tree (clustered)  27,000            2.7 sec      4.5 min
-- B+-Tree (non-clust.) 502,000           50 sec       84 min
 
 
-- QUERY 3: High-selectivity range (90% of table)
-- SELECT * FROM orders WHERE order_date >= '2023-01-01'
 
-- Access Method        I/O Operations   Time (SSD)   Time (HDD)
-- ------------------- ----------------- ------------ ------------
-- Full Scan            500,000           50 sec       83 min
-- B+-Tree (clustered)  450,000           45 sec       75 min
-- B+-Tree (non-clust.) 9,000,000+        15 min+      25 hours+
 
-- For high selectivity, full scan wins over non-clustered index!
-- The optimizer should (and usually does) choose full scan here.
 
 
-- QUERY 4: AND of multiple conditions
-- SELECT * FROM orders WHERE status = 'shipped' AND region = 'west'
 
-- Access Method        I/O Operations (if each selects 20%)
-- ------------------- ------------------------------------------
-- Full Scan            500,000 (but can evaluate both predicates)
-- Single B+-Tree       100,000 (one index) + filter in memory
-- Bitmap Index         ~500 (if low cardinality, bitmap AND is fast)
-- Two B+-Tree indexes  Intersection: ~20,000 (if supported)

When Each Method Excels

•B+-Tree: General purpose. Excellent for point queries, good for range queries, supports ordered traversal. The default choice.
•Hash Index: Point queries only, but O(1). Use when: (1) no range queries ever, (2) extreme point query volume, (3) equality joins.
•Full Table Scan: Best for: (1) very high selectivity (>10-20%), (2) small tables, (3) no suitable index exists.
•Bitmap Index: Data warehousing AND/OR queries on low-cardinality columns. Poor for OLTP (update-heavy workloads).

Why B+-Trees Dominate

B+-trees provide the best general-purpose performance: fast point queries, efficient range queries, reasonable update costs, and predictable behavior. While hash indexes are faster for pure point queries, losing range query capability is rarely acceptable. This versatility makes B+-trees the universal choice for primary database indexes.

Real-World Performance Factors

The theoretical analysis assumes ideal conditions. Real-world performance is affected by multiple additional factors. Understanding these helps bridge the gap between textbook knowledge and production troubleshooting.

Factors affecting actual search performance:

Performance-Affecting Factors

•Index fragmentation — Over time, updates cause page splits and non-contiguous storage. Sequential scans become random I/O. Periodic rebuilds may be necessary.
•Buffer pool contention — Multiple queries competing for buffer pool space can cause thrashing. Hot pages get evicted before reuse.
•Lock contention — Concurrent access to the same index pages requires locking. Heavy write loads can block readers.
•Key length impact — Longer keys reduce fanout, increasing tree height. Variable-length keys complicate space management.
•NULL handling — NULLs may or may not be indexed. Queries filtering NULLs may not use indexes as expected.
•Data skew — Non-uniform key distribution can cause hot spots in specific tree regions, increasing contention.
•Storage subsystem — RAID configuration, SSD vs. HDD, NVMe vs. SATA, storage controller caching all affect actual I/O latency.

fragmentation_impact.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
-- Demonstrating index fragmentation impact
 
-- Check fragmentation in SQL Server
SELECT 
    object_name(ips.object_id) as table_name,
    i.name as index_name,
    ips.avg_fragmentation_in_percent,
    ips.page_count,
    ips.avg_page_space_used_in_percent
FROM sys.dm_db_index_physical_stats(
    DB_ID(), NULL, NULL, NULL, 'DETAILED'
) ips
JOIN sys.indexes i ON ips.object_id = i.object_id 
    AND ips.index_id = i.index_id
WHERE ips.avg_fragmentation_in_percent > 10;
 
-- Example output:
-- table_name   index_name      fragmentation%   pages    space_used%
-- ----------- --------------- ---------------- -------- -----------
-- orders      ix_order_date   45.2             125000   62.3
-- customers   pk_customer_id   8.7              5000    91.4
 
-- The orders index is heavily fragmented:
--   - 45% fragmentation means pages are not contiguous
--   - 62% space used means 38% wasted space (from splits/deletes)
--   
-- Impact:
--   - Range scans read 1.6x more blocks (1/0.62) 
--   - Sequential scans become random I/O
--   
-- Solution: Rebuild or reorganize the index
ALTER INDEX ix_order_date ON orders REBUILD;
 
 
-- PostgreSQL equivalent
SELECT 
    schemaname,
    tablename,
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes;
 
-- Check bloat (fragmentation) with pgstattuple extension
SELECT * FROM pgstattuple('orders_pkey');
-- Returns: dead_tuple_percent, free_space, etc.

The Maintenance Reality

Indexes degrade over time. Without maintenance (rebuild, reorganize), a 4-I/O query can become a 6-8 I/O query due to fragmentation and bloat. Regular index maintenance is not optional for production databases—it's essential for maintaining baseline performance.

Optimizing Search Performance

Armed with understanding of how search works, we can derive practical optimization strategies. These are the techniques used by database administrators and performance engineers to maximize index efficiency.

Optimization strategies:

Buffer Pool Tuning

•Size for index caching — Calculate total size of all active indexes. Ensure buffer pool can hold at least upper index levels plus leaf levels of hot indexes.
•Monitor hit rates — Track buffer pool hit ratio. Target >95% for OLTP workloads. Below 90% indicates memory pressure.
•Separate pools — Some databases allow multiple buffer pools. Dedicate one for index pages to prevent data pages from evicting index pages.
•Warm the cache — After restart, run common queries to populate buffer pool before going to production load.

Index Design for Search

•Shorter keys — Shorter keys increase fanout, decreasing tree height. Use integer IDs rather than long strings when possible.
•Covering indexes — Include all queried columns in the index to avoid data block access entirely.
•Clustered index selection — Choose the most queried range column for the clustered index to maximize sequential I/O benefit.
•Avoid over-indexing — Each additional index consumes buffer pool space. Remove unused indexes to free memory for useful ones.

covering_index_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
-- Covering Index Example: Eliminating Data Block Access
 
-- Original query
SELECT customer_id, order_date, total_amount
FROM orders
WHERE customer_id = 12345
ORDER BY order_date DESC
LIMIT 10;
 
-- Without covering index:
-- 1. Search B+-tree on customer_id (3-4 I/Os)
-- 2. For each matching row, access data block to get order_date, total_amount
-- 3. If customer has 100 orders: potentially 100 additional I/Os!
 
-- With covering index:
CREATE INDEX ix_orders_covering ON orders 
    (customer_id, order_date DESC) 
    INCLUDE (total_amount);
 
-- Now:
-- 1. Search B+-tree on customer_id (3-4 I/Os)  
-- 2. All needed columns are IN THE INDEX - no data block access needed
-- 3. Range scan within index for limit 10: maybe 1 additional leaf block
 
-- Result: 4-5 I/Os instead of 100+ I/Os
 
-- Trade-off: Index is larger (includes total_amount)
-- But for frequently-executed queries, this is almost always worth it
 
-- EXPLAIN ANALYZE will show "Index Only Scan" when covering works
EXPLAIN ANALYZE
SELECT customer_id, order_date, total_amount
FROM orders  
WHERE customer_id = 12345
ORDER BY order_date DESC
LIMIT 10;
 
-- Good output indicates:
-- "Index Only Scan using ix_orders_covering"
-- "Heap Fetches: 0" (no data block access needed)

The Covering Index Win

Covering indexes are one of the highest-impact optimizations available. By including all columns needed by a query in the index, you eliminate the data block access—often the most expensive part of a query. For hot queries, always consider whether a covering index is practical.

Summary: Search Efficiency in Practice

Search efficiency in multi-level index trees is the product of algorithmic design, physical storage characteristics, and system configuration. Understanding all three dimensions is essential for database performance engineering.

Key Takeaways

•Point query cost is predictable — Tree height + 1 I/Os cold cache, typically 2 I/Os warm cache (leaf + data).
•Range queries depend on clustering — Clustered indexes enable sequential data access. Non-clustered indexes cause random I/O per record.
•Buffer pool sizing is critical — Caching upper index levels (small) provides outsized benefit. Size for at least 100% index + hot data.
•B+-trees outperform alternatives — General-purpose excellence: fast point queries, efficient ranges, reasonable updates. Hash for pure point-query workloads only.
•Real-world factors matter — Fragmentation, contention, and storage characteristics affect actual performance. Monitor and maintain indexes.
•Covering indexes eliminate data I/O — Including query columns in the index can reduce I/O by 10-100x for appropriate queries.

What's next:

Searches are only half the story. The final page examines index maintenance—how insertions, deletions, and updates are handled while maintaining tree balance and search efficiency. Understanding maintenance costs is essential for write-heavy workloads and helps explain the trade-offs inherent in index design.

Page Complete

You now have a rigorous understanding of search efficiency in multi-level indexes: point query costs, range query dynamics, buffer pool impact, comparison with alternatives, and optimization strategies. This practical knowledge enables informed decisions about index design and database tuning.

4 / 5

Loading learning content...

Database Management SystemsMulti-Level Index

Multi-Level Index Structures

LevelIntermediate

Duration75 mins

TopicMulti-Level Index

4 / 5

Search Efficiency

Quantifying Search Performance

What You Will Learn

Point Query Analysis

A point query retrieves a single record based on an exact key match. This is the most common index operation and the one for which B+-trees are optimized. Let's analyze the cost precisely.

The search path:

A point query traverses from the root to a leaf, reading one block at each level, then accesses the data block containing the actual record (unless it's a covering index).

Total I/O = (Tree Height) + (Data Access) = h + 1

For typical parameters:

Fanout f ≈ 200-500 (depends on key size and block size)
Height h = ⌈log_f(n)⌉
For n = 10 million: h = 3 → Total I/O = 4 blocks

point_query_cost.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
Point Query Cost Analysis
==========================
 
Scenario: SELECT * FROM customers WHERE customer_id = 5000000
 
Index structure:
  - 10 million records
  - Fanout: 250 entries per block
  - Tree height: 3 levels
  - Index type: B+-tree (data pointers in leaves only)
 
Search path:
 
Step 1: Read root block (Level 3)
  - 1 I/O operation
  - Find entry where K < 5,000,000 ≤ K'
  - Binary search within block: ~8 comparisons (log₂ 250)
  - CPU time: ~1 microsecond
  - Disk time: 0.1ms (SSD) or 10ms (HDD)
  
Step 2: Read internal block (Level 2)
  - Follow pointer from Step 1
  - 1 I/O operation
  - Binary search: ~8 comparisons
  - Disk time: 0.1ms (SSD) or 10ms (HDD)
  
Step 3: Read leaf block (Level 1)
  - Follow pointer from Step 2
  - 1 I/O operation
  - Binary search: ~8 comparisons
  - Find exact entry for customer_id = 5,000,000
  - Disk time: 0.1ms (SSD) or 10ms (HDD)
  
Step 4: Read data block (Level 0)
  - Follow data pointer from leaf entry
  - 1 I/O operation
  - Retrieve actual customer record
  - Disk time: 0.1ms (SSD) or 10ms (HDD)
 
TOTAL COST:
                   Cold Cache    Warm Cache (upper levels cached)
  I/O operations:      4              2 (Level 3+2 cached)
  SSD latency:      0.4 ms          0.2 ms
  HDD latency:      40 ms           20 ms
  CPU time:         ~4 µs           ~4 µs (negligible)
 
The dominant cost is ALWAYS disk I/O.
CPU time for in-block binary search is <0.01% of total time.

Point Query Performance Across Database Sizes
Records	Tree Height	I/O (Cold)	I/O (Warm)	SSD Latency (Cold)	SSD Latency (Warm)
10,000	2	3	2	0.3 ms	0.2 ms
1,000,000	3	4	2	0.4 ms	0.2 ms
100,000,000	4	5	2	0.5 ms	0.2 ms
10,000,000,000	5	6	2	0.6 ms	0.2 ms
1,000,000,000,000	6	7	2-3	0.7 ms	0.2-0.3 ms

The Warm Cache Constant

Range Query Analysis

The range query algorithm:

Use a point query to find the leaf containing the range start
Scan sequentially through leaf nodes until the range end is exceeded
For each matching entry, access the data record (if needed)

The cost has two components: the initial tree traversal (h I/Os) and the leaf/data scan (proportional to result size).

range_query_cost.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
def range_query_cost(
    start_key: int,
    end_key: int,
    total_records: int,
    fanout: int,
    records_per_data_block: int,
    selectivity: float = None  # Fraction of records in range
) -> dict:
    """
    Calculate I/O cost for a range query.
    
    Parameters:
    - start_key, end_key: Range boundaries
    - total_records: Total records in table
    - fanout: Index entries per block
    - records_per_data_block: Data records per block
    - selectivity: Fraction of records matching (if known)
    
    Returns:
    - Dictionary with cost breakdown
    """
    import math
    
    # Tree height
    height = math.ceil(math.log(total_records) / math.log(fanout))
    
    # Calculate selectivity if not provided
    if selectivity is None:
        key_range = end_key - start_key
        total_key_range = total_records  # Assume keys are 1 to n
        selectivity = key_range / total_key_range
    
    # Records in range
    records_in_range = int(total_records * selectivity)
    
    # Leaf blocks to scan
    # Each leaf block has 'fanout' entries
    leaf_blocks_scanned = math.ceil(records_in_range / fanout)
    
    # Data blocks to access
    # Best case: data is clustered, sequential
    # Worst case: each record in different block
    clustered_data_blocks = math.ceil(records_in_range / records_per_data_block)
    unclustered_data_blocks = records_in_range  # One per record
    
    return {
        'tree_traversal_ios': height,
        'leaf_scan_ios': leaf_blocks_scanned,
        'data_access_ios_clustered': clustered_data_blocks,
        'data_access_ios_unclustered': unclustered_data_blocks,
        'total_ios_clustered': height + leaf_blocks_scanned + clustered_data_blocks,
        'total_ios_unclustered': height + leaf_blocks_scanned + unclustered_data_blocks,
        'records_returned': records_in_range,
    }
 
 
# Example: Find orders from Jan 2024 (assume 5% of 1 million orders)
cost = range_query_cost(
    start_key=20240101,
    end_key=20240131,
    total_records=1_000_000,
    fanout=250,
    records_per_data_block=20,
    selectivity=0.05
)
 
print(f"""
Range Query: Orders from January 2024
=====================================
Records in range: {cost['records_returned']:,}
Tree traversal: {cost['tree_traversal_ios']} I/Os
 
Clustered index (orders sorted by date):
  Leaf scan: {cost['leaf_scan_ios']} I/Os
  Data access: {cost['data_access_ios_clustered']} I/Os
  TOTAL: {cost['total_ios_clustered']} I/Os
 
Non-clustered index (orders NOT sorted by date):
  Leaf scan: {cost['leaf_scan_ios']} I/Os
  Data access: {cost['data_access_ios_unclustered']:,} I/Os
  TOTAL: {cost['total_ios_unclustered']:,} I/Os
""")
 
# Output:
# Records in range: 50,000
# Tree traversal: 3 I/Os
# 
# Clustered index: TOTAL: 2,703 I/Os
# Non-clustered index: TOTAL: 50,203 I/Os  (20x worse!)

The clustering advantage:

The example above highlights a critical factor: whether the data is physically clustered by the index key.

Clustered index: Data records are physically sorted by key. Range scans read contiguous data blocks. Very efficient.
Non-clustered index: Data records are scattered. Each index entry points to a different data block. Each record may require a separate I/O.

For large range queries on non-clustered indexes, the optimizer may choose a full table scan instead, as it can be faster than thousands of random I/Os.

The Range Query Cliff

Buffer Pool Impact on Search

The caching hierarchy for index searches:

Root block: Accessed by every query. Always cached after first access.
Upper internal levels: Few blocks, frequently accessed. High cache hit rate.
Lower internal levels: More blocks, access depends on query distribution.
Leaf blocks: Many blocks, caching depends on workload locality.
Data blocks: Most numerous, caching highly workload-dependent.

buffer_pool_analysis.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Buffer Pool Impact Analysis
============================
 
Database: 10 million records, B+-tree index
Block size: 4 KB
Fanout: 250
 
Index structure:
  Level 3 (Root):     1 block = 4 KB
  Level 2:           40 blocks = 160 KB
  Level 1 (Leaves): 40,000 blocks = 160 MB
  Data:            500,000 blocks = 2 GB
 
SCENARIO 1: Tiny buffer pool (1 MB)
-----------------------------------
Can cache: Root + Level 2 + ~200 hot leaf blocks
Hit rate estimate:
  - Root: 100% (always cached)
  - Level 2: 100% (fits entirely)
  - Level 1: ~0.5% (200/40,000 if uniform access)
  - Data: ~0% (essentially no caching)
  
Average I/Os per point query: ~3 (leaf + data + maybe some misses)
 
 
SCENARIO 2: Moderate buffer pool (200 MB)
-----------------------------------------
Can cache: Root + Level 2 + All leaves + ~10,000 hot data blocks
Hit rate estimate:
  - Root: 100%
  - Level 2: 100%
  - Level 1: 100% (all 160 MB cached!)
  - Data: ~2% uniform, or higher with locality
  
Average I/Os per point query: ~1-1.5 (mostly just data access)
 
 
SCENARIO 3: Large buffer pool (500 MB)
--------------------------------------
Can cache: Entire index + 25% of data
Hit rate estimate:
  - All index levels: 100%
  - Data: ~25% uniform, higher with locality
  
Average I/Os per point query: ~0.75 (many fully cached)
 
 
SCENARIO 4: Huge buffer pool (3 GB)
-----------------------------------
Can cache: Everything (index + all data)
Hit rate estimate: 100%
 
Average I/Os per point query: 0 (all in memory)
 
 
KEY INSIGHT: The marginal value of buffer pool memory is non-linear.
  - First 1 MB: Critical (caches upper index levels)
  - 1-200 MB: High value (caches all index leaves)
  - 200 MB - 2 GB: Moderate value (caches data)
  - Beyond data size: No additional value
 
Prioritize buffer pool sizing for:
  1. Upper index levels (mandatory)
  2. Leaf levels of hot indexes
  3. Hot data pages

Buffer Pool Size vs. Average Point Query I/O
Buffer Pool Size	% of Index Cached	% of Data Cached	Avg I/Os per Query	Relative Performance
1 MB	0.5%	0%	3.5	1x (baseline)
10 MB	5%	0.5%	2.8	1.25x
100 MB	50%	3%	1.8	1.9x
200 MB	100%	5%	1.0	3.5x
500 MB	100%	20%	0.8	4.4x
2 GB	100%	80%	0.2	17.5x
3 GB (all)	100%	100%	0	∞ (in-memory)

The Index Caching Priority

Comparison with Alternative Access Methods

Alternative access methods:

Full table scan: Read every block sequentially
Hash index: Direct hash lookup with O(1) access
Single-level index with binary search: As analyzed earlier
Bitmap index: Bit vector per value, combined with logical operations

Access Method Comparison for 10 Million Records
Access Method	Point Query I/O	Range Query I/O (5%)	Space Overhead	Update Cost
Full Table Scan	500,000 blocks	500,000 blocks	0 bytes	1 block
Hash Index	1-2 blocks	N/A (no range support)	~150 MB	1-2 blocks
Single-Level Index	16 blocks	16 + 2,000 leaf + data	~140 MB	Multiple blocks
B+-Tree Index	3-4 blocks	3 + 2,000 leaf + data	~160 MB	2-4 blocks
Bitmap Index	~10 blocks	Depends on conjunction	Varies greatly	Expensive

access_method_comparison.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
-- Access Method Comparison: Concrete Examples
-- Table: orders (10 million rows, 200 bytes each)
-- Total data: ~2 GB (500,000 blocks at 4 KB)
 
-- QUERY 1: Point lookup
-- SELECT * FROM orders WHERE order_id = 5000000
 
-- Access Method        I/O Operations   Time (SSD)   Time (HDD)
-- ------------------- ----------------- ------------ ------------
-- Full Scan            500,000           50 sec       83 min
-- Hash Index           1-2               0.2 ms       20 ms
-- B+-Tree Index        4                 0.4 ms       40 ms
 
 
-- QUERY 2: Range query (5% selectivity)
-- SELECT * FROM orders WHERE order_date BETWEEN '2024-01-01' AND '2024-01-31'
 
-- Access Method        I/O Operations   Time (SSD)   Time (HDD)
-- ------------------- ----------------- ------------ ------------
-- Full Scan            500,000           50 sec       83 min
-- Hash Index           N/A (no range!)   Cannot use   Cannot use
-- B+-Tree (clustered)  27,000            2.7 sec      4.5 min
-- B+-Tree (non-clust.) 502,000           50 sec       84 min
 
 
-- QUERY 3: High-selectivity range (90% of table)
-- SELECT * FROM orders WHERE order_date >= '2023-01-01'
 
-- Access Method        I/O Operations   Time (SSD)   Time (HDD)
-- ------------------- ----------------- ------------ ------------
-- Full Scan            500,000           50 sec       83 min
-- B+-Tree (clustered)  450,000           45 sec       75 min
-- B+-Tree (non-clust.) 9,000,000+        15 min+      25 hours+
 
-- For high selectivity, full scan wins over non-clustered index!
-- The optimizer should (and usually does) choose full scan here.
 
 
-- QUERY 4: AND of multiple conditions
-- SELECT * FROM orders WHERE status = 'shipped' AND region = 'west'
 
-- Access Method        I/O Operations (if each selects 20%)
-- ------------------- ------------------------------------------
-- Full Scan            500,000 (but can evaluate both predicates)
-- Single B+-Tree       100,000 (one index) + filter in memory
-- Bitmap Index         ~500 (if low cardinality, bitmap AND is fast)
-- Two B+-Tree indexes  Intersection: ~20,000 (if supported)

When Each Method Excels

•B+-Tree: General purpose. Excellent for point queries, good for range queries, supports ordered traversal. The default choice.
•Hash Index: Point queries only, but O(1). Use when: (1) no range queries ever, (2) extreme point query volume, (3) equality joins.
•Full Table Scan: Best for: (1) very high selectivity (>10-20%), (2) small tables, (3) no suitable index exists.
•Bitmap Index: Data warehousing AND/OR queries on low-cardinality columns. Poor for OLTP (update-heavy workloads).

Why B+-Trees Dominate

Real-World Performance Factors

Factors affecting actual search performance:

Performance-Affecting Factors

•Index fragmentation — Over time, updates cause page splits and non-contiguous storage. Sequential scans become random I/O. Periodic rebuilds may be necessary.
•Buffer pool contention — Multiple queries competing for buffer pool space can cause thrashing. Hot pages get evicted before reuse.
•Lock contention — Concurrent access to the same index pages requires locking. Heavy write loads can block readers.
•Key length impact — Longer keys reduce fanout, increasing tree height. Variable-length keys complicate space management.
•NULL handling — NULLs may or may not be indexed. Queries filtering NULLs may not use indexes as expected.
•Data skew — Non-uniform key distribution can cause hot spots in specific tree regions, increasing contention.
•Storage subsystem — RAID configuration, SSD vs. HDD, NVMe vs. SATA, storage controller caching all affect actual I/O latency.

fragmentation_impact.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
-- Demonstrating index fragmentation impact
 
-- Check fragmentation in SQL Server
SELECT 
    object_name(ips.object_id) as table_name,
    i.name as index_name,
    ips.avg_fragmentation_in_percent,
    ips.page_count,
    ips.avg_page_space_used_in_percent
FROM sys.dm_db_index_physical_stats(
    DB_ID(), NULL, NULL, NULL, 'DETAILED'
) ips
JOIN sys.indexes i ON ips.object_id = i.object_id 
    AND ips.index_id = i.index_id
WHERE ips.avg_fragmentation_in_percent > 10;
 
-- Example output:
-- table_name   index_name      fragmentation%   pages    space_used%
-- ----------- --------------- ---------------- -------- -----------
-- orders      ix_order_date   45.2             125000   62.3
-- customers   pk_customer_id   8.7              5000    91.4
 
-- The orders index is heavily fragmented:
--   - 45% fragmentation means pages are not contiguous
--   - 62% space used means 38% wasted space (from splits/deletes)
--   
-- Impact:
--   - Range scans read 1.6x more blocks (1/0.62) 
--   - Sequential scans become random I/O
--   
-- Solution: Rebuild or reorganize the index
ALTER INDEX ix_order_date ON orders REBUILD;
 
 
-- PostgreSQL equivalent
SELECT 
    schemaname,
    tablename,
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes;
 
-- Check bloat (fragmentation) with pgstattuple extension
SELECT * FROM pgstattuple('orders_pkey');
-- Returns: dead_tuple_percent, free_space, etc.

The Maintenance Reality

Optimizing Search Performance

Optimization strategies:

Buffer Pool Tuning

•Size for index caching — Calculate total size of all active indexes. Ensure buffer pool can hold at least upper index levels plus leaf levels of hot indexes.
•Monitor hit rates — Track buffer pool hit ratio. Target >95% for OLTP workloads. Below 90% indicates memory pressure.
•Separate pools — Some databases allow multiple buffer pools. Dedicate one for index pages to prevent data pages from evicting index pages.
•Warm the cache — After restart, run common queries to populate buffer pool before going to production load.

Index Design for Search

•Shorter keys — Shorter keys increase fanout, decreasing tree height. Use integer IDs rather than long strings when possible.
•Covering indexes — Include all queried columns in the index to avoid data block access entirely.
•Clustered index selection — Choose the most queried range column for the clustered index to maximize sequential I/O benefit.
•Avoid over-indexing — Each additional index consumes buffer pool space. Remove unused indexes to free memory for useful ones.

covering_index_example.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
-- Covering Index Example: Eliminating Data Block Access
 
-- Original query
SELECT customer_id, order_date, total_amount
FROM orders
WHERE customer_id = 12345
ORDER BY order_date DESC
LIMIT 10;
 
-- Without covering index:
-- 1. Search B+-tree on customer_id (3-4 I/Os)
-- 2. For each matching row, access data block to get order_date, total_amount
-- 3. If customer has 100 orders: potentially 100 additional I/Os!
 
-- With covering index:
CREATE INDEX ix_orders_covering ON orders 
    (customer_id, order_date DESC) 
    INCLUDE (total_amount);
 
-- Now:
-- 1. Search B+-tree on customer_id (3-4 I/Os)  
-- 2. All needed columns are IN THE INDEX - no data block access needed
-- 3. Range scan within index for limit 10: maybe 1 additional leaf block
 
-- Result: 4-5 I/Os instead of 100+ I/Os
 
-- Trade-off: Index is larger (includes total_amount)
-- But for frequently-executed queries, this is almost always worth it
 
-- EXPLAIN ANALYZE will show "Index Only Scan" when covering works
EXPLAIN ANALYZE
SELECT customer_id, order_date, total_amount
FROM orders  
WHERE customer_id = 12345
ORDER BY order_date DESC
LIMIT 10;
 
-- Good output indicates:
-- "Index Only Scan using ix_orders_covering"
-- "Heap Fetches: 0" (no data block access needed)

The Covering Index Win

Summary: Search Efficiency in Practice

Key Takeaways

•Point query cost is predictable — Tree height + 1 I/Os cold cache, typically 2 I/Os warm cache (leaf + data).
•Range queries depend on clustering — Clustered indexes enable sequential data access. Non-clustered indexes cause random I/O per record.
•Buffer pool sizing is critical — Caching upper index levels (small) provides outsized benefit. Size for at least 100% index + hot data.
•B+-trees outperform alternatives — General-purpose excellence: fast point queries, efficient ranges, reasonable updates. Hash for pure point-query workloads only.
•Real-world factors matter — Fragmentation, contention, and storage characteristics affect actual performance. Monitor and maintain indexes.
•Covering indexes eliminate data I/O — Including query columns in the index can reduce I/O by 10-100x for appropriate queries.

What's next:

Page Complete

4 / 5