Loading learning content...
The B+-tree described in textbooks is an elegant abstraction—clean, mathematical, and focused on asymptotic complexity. Production database implementations are different. They're filled with engineering compromises, platform-specific optimizations, and decades of accumulated wisdom about real-world workloads.
Understanding how major databases actually implement B+-trees reveals:
This page surveys B+-tree implementations across PostgreSQL, MySQL/InnoDB, SQL Server, Oracle, and SQLite—examining their storage formats, locking strategies, and unique features.
By the end of this page, you will understand the implementation details of B+-trees in major database systems, their design philosophies, page formats, locking mechanisms, and distinctive features. This knowledge is essential for database performance tuning and system selection.
Before diving into specific systems, let's understand why implementations vary:
Design Priorities Differ:
Key Implementation Dimensions:
| Aspect | PostgreSQL | InnoDB | SQL Server | Oracle | SQLite |
|---|---|---|---|---|---|
| Page size | 8 KB | 16 KB | 8 KB | 8 KB (configurable) | 4 KB default |
| Max key size | ~2.7 KB | ~3.5 KB | 900 bytes | ~75% of block | No hard limit |
| Row storage | Heap separate | Clustered PK | Optional clustered | Heap or IOT | B-tree rowid |
| MVCC approach | Heap versioning | Undo log | Versioning | Undo segments | Journal |
| Locking | Page + row | Row + gap | Row + key-range | Row + TM | Table/Page |
These differences have practical implications:
No implementation is universally "best." Each represents a coherent set of trade-offs optimized for specific use cases. Understanding these trade-offs helps you select the right database and configure it appropriately for your workload.
PostgreSQL's B-tree implementation, known as nbtree, is the default and most commonly used index type. It's a sophisticated implementation with features accumulated over 25+ years of development.
Key Characteristics:
PostgreSQL Page Layout:
123456789101112131415161718192021
┌─────────────────────────────────────────────────────┐│ Page Header (24 bytes) ││ - pd_lsn, pd_flags, pd_lower, pd_upper, pd_special│├─────────────────────────────────────────────────────┤│ Line Pointer Array ││ - Fixed 4-byte slots pointing to tuples ││ - Grows downward from header │├─────────────────────────────────────────────────────┤│ ││ Free Space ││ │├─────────────────────────────────────────────────────┤│ Index Tuples ││ - Variable-length (key + TID) ││ - Grows upward from bottom │├─────────────────────────────────────────────────────┤│ Special Area (16 bytes) ││ - btpo_prev, btpo_next: sibling pointers ││ - btpo_level: tree level (0 = leaf) ││ - btpo_flags: page flags │└─────────────────────────────────────────────────────┘PostgreSQL-Specific Features:
High-Key: Each page stores the upper bound key for its contents, enabling lock-free searches during concurrent modifications
Kill Bits: Dead index entries marked with "kill hints" during scans, enabling efficient cleanup
INCLUDE Columns: Non-key columns stored in leaf pages for index-only scans
Parallel Index Scans: B-tree scans can use multiple workers for large indexes
123456789101112131415161718192021
-- Basic B-tree indexCREATE INDEX idx_orders_customer ON orders(customer_id); -- Covering index with INCLUDE (PostgreSQL 11+)CREATE INDEX idx_orders_covering ON orders(order_date) INCLUDE (total_amount, status); -- Partial indexCREATE INDEX idx_orders_pending ON orders(created_at) WHERE status = 'pending'; -- Index with specific fill factorCREATE INDEX idx_products ON products(sku) WITH (fillfactor = 80); -- Deduplication control (PostgreSQL 13+)CREATE INDEX idx_logs_timestamp ON logs(timestamp) WITH (deduplicate_items = on); -- Check index statisticsSELECT * FROM pgstatindex('idx_orders_customer');For columns with many duplicate values, PostgreSQL 13+ deduplication can reduce index size by 50-90%. A row like (timestamp, TID1) (timestamp, TID2) becomes (timestamp, [TID1, TID2]). This dramatically improves indexes on columns like status, category, or date.
InnoDB uses B+-trees not just for indexes but as the fundamental storage structure for tables. Every InnoDB table is stored as a clustered index organized by the primary key.
Core Design:
InnoDB Page Structure:
123456789101112131415161718192021222324252627
┌─────────────────────────────────────────────────────┐│ FIL Header (38 bytes) ││ - Space ID, page number, checksums, LSN │├─────────────────────────────────────────────────────┤│ Page Header (56 bytes) ││ - Record counts, heap info, level, index ID │├─────────────────────────────────────────────────────┤│ Infimum Record (virtual minimum) │├─────────────────────────────────────────────────────┤│ ││ User Records (B+-tree nodes) ││ - Variable-length, stored in heap order ││ - Each record has: header + key + payload ││ │├─────────────────────────────────────────────────────┤│ Supremum Record (virtual maximum) │├─────────────────────────────────────────────────────┤│ ││ Free Space ││ │├─────────────────────────────────────────────────────┤│ Page Directory (slot array) ││ - Sparse directory, ~4-8 records per slot │├─────────────────────────────────────────────────────┤│ FIL Trailer (8 bytes) ││ - Checksum verification │└─────────────────────────────────────────────────────┘Key InnoDB Features:
Clustered Table Organization: The primary key IS the table. Secondary indexes are always "covering" for the PK.
Change Buffer: Buffer updates to secondary index pages that aren't in memory, merging later.
Adaptive Hash Index: Automatically builds hash index on frequently accessed B+-tree pages.
Page Splitting: Uses heuristics to split pages at optimal points for sequential inserts.
Secondary Index Implications:
12345678910111213141516171819
Table: orders (id INT PRIMARY KEY, customer_id INT, order_date DATE, ...) Clustered Index (Primary Key):┌──────────────────────────────────────────────────────┐│ Leaf pages contain FULL ROW DATA ││ Key: id → Value: (id, customer_id, order_date, ...) │└──────────────────────────────────────────────────────┘ Secondary Index on customer_id:┌──────────────────────────────────────────────────────┐│ Leaf pages contain (secondary_key, primary_key) ││ Key: customer_id → Value: id ││ ││ Lookup requires TWO B+-tree traversals: ││ 1. Secondary index → find PK value ││ 2. Clustered index → find row data │└──────────────────────────────────────────────────────┘ This "double lookup" is why InnoDB covering indexes are important!In InnoDB, the primary key is stored in every secondary index entry. A 16-byte UUID PK makes all secondary indexes 16 bytes larger per row. For a table with 100M rows and 5 secondary indexes, that's 8GB of additional index storage. Use compact primary keys (INT AUTO_INCREMENT) when possible.
SQL Server provides flexible B+-tree options, allowing tables as either heaps (no clustered index) or clustered by any indexed column.
SQL Server Design:
SQL Server Page Layout:
12345678910111213141516171819
┌─────────────────────────────────────────────────────┐│ Page Header (96 bytes) ││ - Page type, page ID, LSN, free space info │├─────────────────────────────────────────────────────┤│ ││ Row Data ││ - Fixed-length columns first ││ - Variable-length columns after ││ - NULL bitmap ││ │├─────────────────────────────────────────────────────┤│ ││ Free Space ││ │├─────────────────────────────────────────────────────┤│ Slot Array (Row Offset Array) ││ - 2 bytes per row, grows backward ││ - Points to row start within page │└─────────────────────────────────────────────────────┘SQL Server-Specific Features:
Included Columns: Non-key columns in leaf level only (like PostgreSQL INCLUDE)
Filtered Indexes: Indexes with WHERE clauses, stored only for matching rows
Columnstore Indexes: Columnar storage with B+-tree integration for hybrid queries
Online Operations: Most index operations can proceed while table is in use
Compression: Page and row-level compression built into B+-tree storage
123456789101112131415161718192021222324
-- Clustered index (table becomes the index)CREATE CLUSTERED INDEX ix_orders_cluster ON orders(order_date); -- Non-clustered with included columnsCREATE NONCLUSTERED INDEX ix_orders_cust ON orders(customer_id) INCLUDE (total_amount, status); -- Filtered indexCREATE NONCLUSTERED INDEX ix_orders_pending ON orders(created_at) WHERE status = 'pending' WITH (FILLFACTOR = 90); -- Online index creation (Enterprise)CREATE INDEX ix_products_sku ON products(sku) WITH (ONLINE = ON, RESUMABLE = ON); -- Compressed indexCREATE INDEX ix_logs_compressed ON logs(timestamp) WITH (DATA_COMPRESSION = PAGE); -- Pause and resume long-running operationALTER INDEX ix_large REBUILD WITH (RESUMABLE = ON);-- Later...ALTER INDEX ix_large RESUME;SQL Server lets you choose: heaps are faster for bulk inserts and updates, clustered indexes are faster for range scans. For OLTP workloads with point queries and range scans, a clustered index on the date or primary key usually wins. For ETL staging tables with bulk operations, heaps may be faster.
Oracle's B+-tree implementation is optimized for enterprise-scale workloads with features for high concurrency, massive data volumes, and complex deployment topologies.
Oracle Design:
Key Oracle Features:
1234567891011121314151617181920212223242526272829
-- Standard B-tree indexCREATE INDEX idx_orders_customer ON orders(customer_id); -- Reverse key to reduce contentionCREATE INDEX idx_orders_seq REVERSE ON orders(order_seq); -- Key compressionCREATE INDEX idx_orders_composite ON orders(region, category, product) COMPRESS 2; -- Compress first 2 columns -- Index-Organized Table (clustered)CREATE TABLE order_items ( order_id NUMBER, line_num NUMBER, product_id NUMBER, quantity NUMBER, PRIMARY KEY (order_id, line_num)) ORGANIZATION INDEX; -- Invisible index for testingCREATE INDEX idx_test INVISIBLE ON orders(new_column);ALTER SESSION SET optimizer_use_invisible_indexes = TRUE;-- Test query plans, then make visible if helpfulALTER INDEX idx_test VISIBLE; -- Monitor index usageALTER INDEX idx_orders_customer MONITORING USAGE;-- Check usage laterSELECT * FROM v$object_usage WHERE index_name = 'IDX_ORDERS_CUSTOMER';The reverse key index is Oracle's solution to "hot spot" contention. When many sessions insert sequentially-increasing keys (like sequences), they all contend for the rightmost leaf block. Reversing the key bytes distributes inserts across the entire index, eliminating the bottleneck—at the cost of disabling range scans on the column.
SQLite's B+-tree implementation prioritizes simplicity and reliability over raw performance. As an embedded database, it makes different trade-offs than server databases.
SQLite Design:
SQLite B+-Tree Structure:
12345678910111213141516171819202122
SQLite uses two types of B+-trees: 1. TABLE B+-TREE (for rowid tables): ┌─────────────────────────────────────────┐ │ Internal pages: rowid ranges │ │ Leaf pages: rowid → complete row data │ └─────────────────────────────────────────┘ Key: integer rowid (8 bytes max) Value: all column data for the row 2. INDEX B+-TREE: ┌─────────────────────────────────────────┐ │ Internal pages: index key ranges │ │ Leaf pages: index key → rowid │ └─────────────────────────────────────────┘ Key: indexed column(s) Value: rowid of the row Key insight: Every SQLite table IS a B+-tree. "rowid" lookup is O(log n) tree traversal, not O(1)!SQLite-Specific Features:
WITHOUT ROWID Tables: Store record by PRIMARY KEY instead of hidden rowid (like InnoDB)
Partial Indexes: WHERE clause on index definition
Expression Indexes: Index on computed values
ANALYZE: Collect statistics for query planner
Incremental Vacuum: Reclaim space without full rebuild
1234567891011121314151617181920212223242526
-- Standard indexCREATE INDEX idx_orders_customer ON orders(customer_id); -- WITHOUT ROWID table (clustered by PK)CREATE TABLE sessions ( session_id TEXT PRIMARY KEY, user_id INTEGER, expires_at INTEGER) WITHOUT ROWID; -- Partial indexCREATE INDEX idx_active_users ON users(last_login) WHERE is_active = 1; -- Expression indexCREATE INDEX idx_orders_year ON orders(strftime('%Y', order_date)); -- Covering index (all columns in index)CREATE INDEX idx_orders_cover ON orders(customer_id, order_date, total); -- Analyze for statisticsANALYZE; -- All tablesANALYZE orders; -- Specific table -- Check index usageEXPLAIN QUERY PLAN SELECT * FROM orders WHERE customer_id = 123;For tables where the PRIMARY KEY is the main access pattern (like key-value stores or session tables), WITHOUT ROWID eliminates one B+-tree traversal. Instead of: index → rowid → table, you get: table B+-tree (organized by PK) directly. This can improve performance significantly for PK-heavy workloads.
B+-tree concurrency control varies significantly across databases. The choice of locking granularity affects both concurrency and overhead.
Locking Approaches:
| Database | Read Strategy | Write Strategy | Phantom Protection |
|---|---|---|---|
| PostgreSQL | MVCC snapshot | Page locks during modification | SERIALIZABLE mode |
| InnoDB | MVCC + consistent read | Row locks + next-key locking | Gap locks |
| SQL Server | Row/page locks | Key-range locks | Key-range locking |
| Oracle | MVCC (multi-version) | Row locks only | SERIALIZABLE mode |
| SQLite | Table-level (default) | Table-level | SERIALIZABLE default |
InnoDB's Next-Key Locking:
InnoDB is notable for its next-key locks—locking both the index record AND the gap before it. This prevents phantom reads without table-level locks.
123456789101112131415161718192021
Index values: 10, 15, 20, 25 Query: SELECT * FROM t WHERE id = 15 FOR UPDATE; Standard row lock (what you might expect):- Lock on record with id=15 only InnoDB next-key lock (what actually happens):- Lock on record 15- Gap lock on (10, 15) - prevents insertion of 11, 12, 13, 14 Query: SELECT * FROM t WHERE id > 15 AND id < 25 FOR UPDATE; InnoDB next-key locks:- Lock on record 20- Gap lock on (15, 20)- Gap lock on (20, 25) This prevents:- Insertions of 16, 17, 18, 19 (phantom would appear)- Insertions of 21, 22, 23, 24 (phantom would appear)InnoDB's gap locks can cause surprising contention. A SELECT ... FOR UPDATE on a range locks gaps, blocking inserts from other transactions—even if those inserts are for non-overlapping values. Understanding this is essential for debugging lock wait issues in InnoDB.
Modern databases apply various optimizations to B+-tree page formats to improve cache efficiency, reduce I/O, and increase effective fanout.
Common Optimizations:
InnoDB Page Compression Example:
123456789101112131415161718
-- InnoDB table with page compressionCREATE TABLE logs ( id BIGINT PRIMARY KEY, timestamp DATETIME, message TEXT) ROW_FORMAT=COMPRESSED -- Legacy compressed format KEY_BLOCK_SIZE=8; -- Compress to 8KB target -- Modern transparent page compression (5.7+)CREATE TABLE logs_modern ( id BIGINT PRIMARY KEY, timestamp DATETIME, message TEXT) COMPRESSION='lz4'; -- Use LZ4 algorithm -- Enable compression for existing tableALTER TABLE existing_logs COMPRESSION='zstd';OPTIMIZE TABLE existing_logs; -- Rebuild with compressionPage compression reduces I/O at the cost of CPU. For I/O-bound workloads (traditional HDDs, slow storage), compression is almost always beneficial. For CPU-bound workloads on fast NVMe, the CPU overhead may hurt more than the I/O savings help. Benchmark with realistic workloads.
Understanding B+-tree implementation differences helps inform database selection for specific use cases.
| Use Case | Best Fit | Reason |
|---|---|---|
| Read-heavy OLTP | PostgreSQL, Oracle | Excellent MVCC, read/write separation |
| Write-heavy OLTP | InnoDB (MySQL) | Change buffer, compact storage |
| Mixed OLTP | SQL Server | Flexible clustering, online ops |
| High-concurrency PK inserts | Oracle + reverse key | Eliminates insert hot spots |
| Embedded/Edge | SQLite | Single file, zero config |
| Time-series append | InnoDB, PostgreSQL | Sequential inserts optimize |
| Large text keys | PostgreSQL | Best prefix compression, dedup |
Key Questions When Selecting:
What's your primary key pattern?
Read/write ratio?
Key characteristics?
Concurrency pattern?
Each database represents coherent engineering trade-offs. PostgreSQL's heap-separate-from-index design differs fundamentally from InnoDB's clustered approach, and each excels in different scenarios. The "best" database depends entirely on your workload characteristics.
Production B+-tree implementations go far beyond textbook descriptions, incorporating decades of engineering optimizations and design decisions.
What's Next:
We've now surveyed how production databases implement B+-trees. In the final page of this module, we'll explore optimization techniques—the advanced strategies databases use to squeeze maximum performance from B+-tree indexes, including cache-oblivious algorithms, write-optimized structures, and modern hardware adaptations.
You now understand how major databases implement B+-trees in practice, their page formats, locking strategies, and unique features. This knowledge enables you to interpret database-specific documentation, tune for your workload, and make informed system selection decisions.