Loading learning content...
Multi-level index trees achieve their remarkable search efficiency through careful structural organization: balanced height, sorted keys, and predictable fanout. But databases aren't static—records are constantly inserted, updated, and deleted. Every modification potentially disrupts the tree's carefully maintained structure.
Index maintenance is the hidden cost of indexing. While we've seen that search requires 3-4 I/Os, maintaining the tree through modifications can require significantly more. Understanding these costs is essential for database design, especially for write-heavy workloads like OLTP systems, streaming data pipelines, and high-frequency trading platforms.
By the end of this page, you will understand: (1) how insertions work, including node splits when nodes overflow, (2) how deletions work, including underflow handling through redistribution and merging, (3) the I/O cost of maintenance operations, (4) how updates affect indexes, and (5) strategies for minimizing maintenance overhead in production systems.
Inserting a new record requires adding an entry to the index tree. The process starts by finding the correct leaf node (just like a search), then inserting the new key-pointer pair. The complexity arises when the target leaf is already full.
The insertion algorithm:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108
def insert_into_btree(root: BTreeNode, key, data_pointer) -> BTreeNode: """ Insert a key-pointer pair into a B+-tree. Returns the (possibly new) root node. This implementation uses a recursive descent with split propagation. Real databases often use iterative approaches with parent pointers. """ def insert_recursive(node: BTreeNode, key, data_pointer): """ Returns: (new_key, new_child) if split occurred, else (None, None) new_key: The separator key to insert in parent new_child: Pointer to the new sibling node created by split """ if node.is_leaf: # Insert into leaf node insert_position = find_insert_position(node.keys, key) node.keys.insert(insert_position, key) node.data_pointers.insert(insert_position, data_pointer) if len(node.keys) > node.max_keys: # Overflow! Must split this leaf return split_leaf(node) else: return (None, None) # No split needed else: # Internal node: find child and recurse child_index = find_child_index(node.keys, key) child = node.children[child_index] # Recursive insert into appropriate subtree (promoted_key, new_child) = insert_recursive(child, key, data_pointer) if promoted_key is not None: # A split occurred below - must insert separator here node.keys.insert(child_index, promoted_key) node.children.insert(child_index + 1, new_child) if len(node.keys) > node.max_keys: # This internal node also overflows - split it too return split_internal(node) return (None, None) # Start insertion from root (promoted_key, new_child) = insert_recursive(root, key, data_pointer) if promoted_key is not None: # Root was split - create new root new_root = BTreeNode(order=root.order, is_leaf=False) new_root.keys = [promoted_key] new_root.children = [root, new_child] return new_root return root # Root unchanged def split_leaf(node: BTreeNode): """ Split an overflowing leaf node into two nodes. Returns: (separator_key, new_sibling) """ mid = len(node.keys) // 2 # Create new sibling with upper half of entries sibling = BTreeNode(order=node.order, is_leaf=True) sibling.keys = node.keys[mid:] sibling.data_pointers = node.data_pointers[mid:] # Original node keeps lower half node.keys = node.keys[:mid] node.data_pointers = node.data_pointers[:mid] # Maintain leaf sibling chain (for B+-tree) sibling.next_leaf = node.next_leaf node.next_leaf = sibling # Separator key: first key of new sibling (copied up) separator = sibling.keys[0] return (separator, sibling) def split_internal(node: BTreeNode): """ Split an overflowing internal node. Returns: (separator_key, new_sibling) Key difference from leaf split: separator is pushed up, not copied. """ mid = len(node.keys) // 2 # The middle key is pushed up to parent (not kept in either child) separator = node.keys[mid] # Create new sibling with upper half sibling = BTreeNode(order=node.order, is_leaf=False) sibling.keys = node.keys[mid + 1:] # Skip the separator sibling.children = node.children[mid + 1:] # Original keeps lower half node.keys = node.keys[:mid] node.children = node.children[:mid + 1] return (separator, sibling)Visualizing a node split:
When a leaf node with capacity 4 receives a 5th entry:
123456789101112131415161718192021222324252627282930313233
Node Split Example================== Before insertion of key 35: Parent Node: [20 | 40 | 60] / | \ v v v [10,15] [25,30,32,38] [45,50] ^^^^^^^^ Target leaf (already full with 4 keys) After inserting 35 (causes overflow → split): Step 1: Insert 35 in sorted position [25, 30, 32, 35, 38] ← 5 entries, overflow! Step 2: Split leaf at midpoint [25, 30] | [32, 35, 38] ↑ ↑ Original New sibling Step 3: Promote separator (32) to parent Parent Node: [20 | 32 | 40 | 60] / | | \ v v v v [10,15] [25,30] [32,35,38] [45,50] Note: - Separator 32 is COPIED to parent (in B+-tree)- Separator 32 remains in leaf (data pointer needed)- Parent now has 4 keys (if max is 4, parent would also split)A single insertion can trigger splits all the way up to the root. If the root splits, a new root is created, increasing tree height by 1. This cascade is rare (probability decreases exponentially with level) but can cause significant I/O when it happens. This is why B-trees 'grow from the bottom up.'
Let's quantify the I/O cost of insertions. The cost depends on whether splits occur, which depends on how full the tree nodes are.
Base cost (no split):
For our 10-million-record example (h=3): 3 reads + 1 write = 4 I/Os
Cost with leaf split:
Cost with cascading split:
If parent also splits, add 2 more writes. If it cascades to root, potentially h×2 writes.
| Scenario | Probability | Read I/Os | Write I/Os | Total I/Os |
|---|---|---|---|---|
| No split | ~65-70% | 3 | 1 | 4 |
| Leaf split only | ~25-30% | 3 | 3 | 6 |
| Leaf + parent split | ~3-5% | 3 | 5 | 8 |
| Split to root | <1% | 3 | 2h+1 | 2h+4 |
| New root created | <<1% | 3 | 2h+2 | 2h+5 |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
Expected Insertion I/O Cost Calculation======================================== Assumptions: - Tree height: h = 3 - Fanout: f = 250 - Node fill factor after splits: ~70% (empirical average) Split probability at each level: P(leaf split) ≈ 1/f (one split per f insertions on average) ≈ 1/250 ≈ 0.4% per insertion But nodes aren't always full. With 70% fill factor: P(leaf split) ≈ 30% (room for ~75 more keys per leaf) Wait, this seems contradictory. Let's think more carefully: After many random insertions, nodes average ~69% full (ln(2) ≈ 0.693). Each node can hold f keys. Average keys per leaf: 0.69 × f ≈ 173 (for f=250) P(leaf split | insertion into this leaf): Depends on current fullness. On average, over long term: P(split) ≈ 1 / (keys per leaf) ≈ 1/173 ≈ 0.6% More accurately, considering that splits create two ~50% full nodes: Average split rate: ~ 1 / (0.5 × f) = 1/125 ≈ 0.8% per insertion P(parent split | leaf split) ≈ same logic ≈ 0.8% P(root split | parent split) ≈ 0.8% Expected I/O Cost: E[I/O] = P(no split) × 4 + P(leaf split only) × 6 + P(cascade) × (6+) With p = 0.008 (0.8% split rate): E[I/O] ≈ (1-p) × 4 + p × (1-p) × 6 + p² × 8 + ... ≈ 0.992 × 4 + 0.00794 × 6 + 0.000064 × 8 ≈ 3.968 + 0.048 + 0.0005 ≈ 4.02 I/Os per insertion RESULT: Expected insertion cost is only slightly above search cost! - Search: 3-4 I/Os (read only) - Insert: ~4 I/Os (mostly no split, occasional split overhead averages out) This is remarkably efficient. The B-tree design minimizes overhead.Despite the potential for cascading splits, the expected insertion cost is close to the search cost. B-tree insertions have low write amplification (~1.3x) on average. This is vastly better than maintaining a sorted array (which would require moving half the data on average) and is one reason B-trees are so efficient for OLTP workloads.
Deleting a record requires removing its entry from the index tree. This is more complex than insertion because we must handle underflow—when a node falls below the minimum occupancy requirement.
The deletion algorithm:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
Deletion Example: Handling Underflow===================================== Tree parameters: Minimum 2 keys per leaf (max 4) Initial state: Parent: [30 | 50] / | \ v v v [10,20] [35,40] [55,60,70,80] ↑ Delete key 40 from this leaf Step 1: Delete 40 [35] ← Only 1 key remaining, minimum is 2 → UNDERFLOW! Step 2: Check siblings for redistribution Option A: Borrow from right sibling (has 4 keys, can donate) Before: [35] and [55,60,70,80] After: [35,55] and [60,70,80] Parent separator changes: 50 → 60 Result: Parent: [30 | 60] / | \ v v v [10,20] [35,55] [60,70,80] I/O: 1 read (sibling) + 3 writes (leaf, sibling, parent) = 4 I/Os Option B: Borrow from left sibling (has 2 keys, can donate 1) Before: [10,20] and [35] After: [10] and [20,35] Parent separator changes: 30 → 20 But wait: [10] now has only 1 key → left sibling underflows! Would need to borrow from ITS sibling, or merge. In this case, borrowing from right is better (more keys available). What if sibling can't donate?============================ Scenario: Delete from [35,40] when right sibling has only [55,60] After deleting 40: [35] ← underflowRight sibling [55,60] has exactly minimum, can't donate Must MERGE: [35] + [55,60] + separator(50) → [35,50,55,60] Before merge: Parent: [30 | 50] / | \ v v v [10,20] [35] [55,60] After merge: Parent: [30] / \ v v [10,20] [35,50,55,60] I/O: 1 read + 3 writes (leaf gone, sibling modified, parent modified) Note: Parent lost a key. If parent underflows, cascade merge upward.If root ends up with 0 keys, tree height decreases by 1.Many databases use 'lazy deletion'—marking entries as deleted without immediately removing them. This avoids underflow handling during the delete operation. The tree is later reorganized during maintenance windows or when a threshold of deleted entries is reached. This trades storage space for reduced deletion overhead.
Updates come in two flavors, with very different index maintenance costs:
Non-key update: Changing a column that is NOT part of the index key. Example: updating a customer's email address when the index is on customer_id.
Key update: Changing a column that IS part of the index key. Example: correcting a customer_id value.
The difference is profound.
| Update Type | Index Impact | I/O Cost | Complexity |
|---|---|---|---|
| Non-key (covered column) | Update leaf entry value | h reads + 1 write | Low—modify in place |
| Non-key (non-covered) | No index change | 0 index I/Os | None—index unaffected |
| Key update | Delete + Insert | 2× (delete + insert cost) | High—full rebalancing possible |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
-- Update Operation Examples -- Table: customers with clustered index on customer_id-- Non-clustered index on (last_name, first_name) -- SCENARIO 1: Non-key update on non-indexed columnUPDATE customers SET email = 'new@email.com' WHERE customer_id = 12345; -- Index impact: NONE-- Clustered index used to find row (3-4 I/Os)-- Only data page modified-- last_name index: unchanged-- Total: 4 I/Os (same as point query + 1 write) -- SCENARIO 2: Non-key update on indexed column (not key)-- Assume covering index: (customer_id) INCLUDE (status)UPDATE customers SET status = 'active' WHERE customer_id = 12345; -- Index impact: Update index entry to change status value-- Find entry: 3-4 I/Os-- Modify leaf entry in place: 1 write-- Total: 4-5 I/Os -- SCENARIO 3: Key update (changing primary key - rare!)UPDATE customers SET customer_id = 99999 WHERE customer_id = 12345; -- Index impact: MAJOR-- 1. Delete old entry for customer_id = 12345-- - Search: 3 I/Os-- - Delete from leaf: 1 write-- - Potential underflow handling: 0-4 more I/Os-- -- 2. Insert new entry for customer_id = 99999-- - Search: 3 I/Os -- - Insert into leaf: 1 write-- - Potential split handling: 0-4 more I/Os---- 3. Data record must also move (clustered index)-- -- Total: 8-20+ I/Os -- This is why primary keys should be immutable! -- SCENARIO 4: Update on secondary index keyUPDATE customers SET last_name = 'Smith' WHERE customer_id = 12345; -- Secondary index on (last_name, first_name) affected:-- 1. Delete old entry for (old_last_name, first_name)-- 2. Insert new entry for ('Smith', first_name)-- -- Clustered index (customer_id): just data update, no key change-- -- Total: ~8-12 I/Os (search + delete + insert + data update) -- Every indexed column update costs more than non-indexed updates!Every index on a table adds maintenance overhead for inserts, deletes, and updates to indexed columns. A table with 5 indexes requires ~5x the write I/O of an unindexed table. This 'index tax' is why good index design balances read performance gains against write performance costs.
Row-by-row operations are optimized for transactional workloads. But what about loading millions of rows during data migration, ETL processes, or initial database population? The row-by-row approach becomes prohibitively expensive.
Bulk loading strategies:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
Bulk Load Performance Comparison================================= Scenario: Load 10 million records into table with B+-tree indexRecord size: 200 bytesIndex entry: 14 bytesBlock size: 4 KB METHOD 1: Row-by-row INSERT with index maintenance-------------------------------------------------Per-row cost: ~4-6 I/Os (search + insert + occasional split)Total I/Os: 10,000,000 × 5 = 50,000,000 I/OsSSD time: 50M × 0.1ms = 5,000 seconds = 83 minutesHDD time: 50M × 10ms = 500,000 seconds = 139 hours (!) Problem: Completely impractical for HDD, slow even for SSD. METHOD 2: Sorted bulk load (bottom-up construction)--------------------------------------------------Step 1: Sort data by index key - External sort: O(n log n) comparisons - I/O: ~2-4 passes over data = 2-4 × (10M × 200 bytes / 4KB) - I/O: ~2 million block reads/writes Step 2: Build leaf level - Write full leaf blocks sequentially - 10M entries / 250 per leaf = 40,000 leaf blocks - I/O: 40,000 sequential writes Step 3: Build internal levels - Level 2: 40,000 / 250 = 160 blocks - Level 3: 160 / 250 = 1 block (root) - I/O: ~200 writes Total I/Os: ~2,000,000 + 40,200 = ~2,040,000 I/OsSSD time: 2M × 0.1ms = 200 seconds = 3.3 minutesHDD time: 2M × 0.1ms (sequential!) = ~200 seconds = 3.3 minutes SPEEDUP: 25x faster than row-by-row! METHOD 3: Drop index, load, rebuild-----------------------------------Step 1: Drop index - Instant Step 2: Load data (no index maintenance) - Simple append to heap: 10M × 200 bytes = 2 GB - Sequential writes: 500,000 block writes - I/O: 500,000 writes Step 3: Rebuild index (same as sorted bulk load) - But data not sorted! Must sort first. - Index build: ~2,040,000 I/Os (same as Method 2) Total: ~2,540,000 I/OsSlightly slower than Method 2, but simpler to implement. CONCLUSION: For large loads, NEVER do row-by-row with index maintenance.Either bulk load with sorted data, or drop/rebuild indexes.When loading large datasets: (1) sort data by primary key before loading, (2) load without secondary indexes, (3) build secondary indexes after load completes. This can reduce load times from hours to minutes. Most database bulk-loading utilities implement these optimizations automatically.
Index maintenance isn't just about individual operations—it's an ongoing operational concern. Over time, indexes can develop problems that degrade performance: fragmentation, bloat, and suboptimal structure. Understanding ongoing maintenance is essential for production systems.
Common index degradation issues:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
-- Index Maintenance Operations -- SQL Server: Check fragmentationSELECT OBJECT_NAME(i.object_id) AS TableName, i.name AS IndexName, ips.avg_fragmentation_in_percent AS Fragmentation, ips.avg_page_space_used_in_percent AS PageDensityFROM sys.indexes iJOIN sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'LIMITED') ips ON i.object_id = ips.object_id AND i.index_id = ips.index_idWHERE ips.avg_fragmentation_in_percent > 10ORDER BY Fragmentation DESC; -- SQL Server: Reorganize (light maintenance, online)ALTER INDEX ix_customer_name ON customers REORGANIZE;-- Defragments leaf level, compacts pages-- Low overhead, can run during business hours -- SQL Server: Rebuild (heavy maintenance, may block)ALTER INDEX ix_customer_name ON customers REBUILD;-- Drops and recreates entire index-- Full defragmentation, resets fill factor-- Can block queries (unless ONLINE option used) -- PostgreSQL: Check bloatSELECT schemaname, tablename, indexname, pg_size_pretty(pg_relation_size(indexrelid)) AS index_sizeFROM pg_stat_user_indexes; -- PostgreSQL: Reindex (equivalent to rebuild)REINDEX INDEX ix_customer_name;-- Or for all indexes on a table:REINDEX TABLE customers; -- MySQL/InnoDB: Analyze to update statisticsANALYZE TABLE customers; -- MySQL: Optimize to rebuild and defragmentOPTIMIZE TABLE customers;-- Recreates table and indexes, expensive but thorough /*Maintenance Schedule Guidelines:================================- Daily: Update statistics for tables with heavy DML- Weekly: Reorganize indexes with 10-30% fragmentation - Monthly: Rebuild indexes with >30% fragmentation- After bulk loads: Always rebuild/reorganize affected indexes Automation:- Use maintenance plans (SQL Server)- Use pg_repack extension (PostgreSQL) for online rebuilds- Schedule during low-activity windows- Monitor index health with regular fragmentation checks*/Proactive index maintenance (scheduled reorganization and statistics updates) is far cheaper than reactive troubleshooting. A fragmented index might slow queries by 2-3x, but diagnosing 'why is the database slow' takes hours. Regular maintenance prevents the crisis before it happens.
Index maintenance is the other side of the indexing coin. While indexes dramatically accelerate reads, they impose ongoing costs for modifications. Understanding these costs is essential for balanced database design.
Module complete:
You've now completed Module 5: Multi-Level Index, covering:
This knowledge prepares you for the detailed study of B-trees and B+-trees in the next chapter, where you'll see exactly how these concepts are implemented in production database systems.
Congratulations! You now have a comprehensive understanding of multi-level indexing: from the theoretical foundations (logarithmic search, balance properties) to practical concerns (I/O costs, maintenance overhead, bulk loading). This knowledge is directly applicable to database design, performance tuning, and troubleshooting in production systems.