Database Management SystemsB+-Tree Variants

B+-Tree Variants and Optimizations

LevelAdvanced

Duration90 mins

TopicB+-Tree Variants

4 / 5

Database Implementations: B+-Trees in Practice

From Theory to Production

The B+-tree described in textbooks is an elegant abstraction—clean, mathematical, and focused on asymptotic complexity. Production database implementations are different. They're filled with engineering compromises, platform-specific optimizations, and decades of accumulated wisdom about real-world workloads.

Understanding how major databases actually implement B+-trees reveals:

Why two databases with "B+-tree indexes" can behave so differently
How to interpret database-specific documentation and tuning advice
What trade-offs are hidden behind configuration parameters
Why some optimizations work in one system but not another

This page surveys B+-tree implementations across PostgreSQL, MySQL/InnoDB, SQL Server, Oracle, and SQLite—examining their storage formats, locking strategies, and unique features.

What You Will Learn

By the end of this page, you will understand the implementation details of B+-trees in major database systems, their design philosophies, page formats, locking mechanisms, and distinctive features. This knowledge is essential for database performance tuning and system selection.

Implementation Variations: Why Details Differ

Before diving into specific systems, let's understand why implementations vary:

Design Priorities Differ:

PostgreSQL prioritizes MVCC and extensibility
MySQL/InnoDB emphasizes transactional OLTP workloads
SQL Server balances enterprise features with performance
Oracle optimizes for high-end enterprise scalability
SQLite targets simplicity and embeddability

Key Implementation Dimensions:

B+-Tree Implementation Comparison
Aspect	PostgreSQL	InnoDB	SQL Server	Oracle	SQLite
Page size	8 KB	16 KB	8 KB	8 KB (configurable)	4 KB default
Max key size	~2.7 KB	~3.5 KB	900 bytes	~75% of block	No hard limit
Row storage	Heap separate	Clustered PK	Optional clustered	Heap or IOT	B-tree rowid
MVCC approach	Heap versioning	Undo log	Versioning	Undo segments	Journal
Locking	Page + row	Row + gap	Row + key-range	Row + TM	Table/Page

These differences have practical implications:

Page size affects I/O efficiency and fanout
Key size limits affect composite index design
Clustering affects table scan vs. index lookup trade-offs
MVCC approach affects reader-writer concurrency
Locking granularity affects contention under load

Understanding Trade-offs

No implementation is universally "best." Each represents a coherent set of trade-offs optimized for specific use cases. Understanding these trade-offs helps you select the right database and configure it appropriately for your workload.

PostgreSQL: nbtree Implementation

PostgreSQL's B-tree implementation, known as nbtree, is the default and most commonly used index type. It's a sophisticated implementation with features accumulated over 25+ years of development.

Key Characteristics:

Page Structure: 8KB pages with a standard page header, line pointers (offset array), and index tuples growing from the end
Lehman-Yao Algorithm: Uses high-key and right-link pointers for lock-free reads during concurrent modifications
Suffix Truncation: Since PostgreSQL 12, internal pages store truncated keys to maximize fanout
Deduplication: Since PostgreSQL 13, duplicate keys are stored once with a posting list of TIDs

PostgreSQL Page Layout:

postgresql_page_layout.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
┌─────────────────────────────────────────────────────┐
│              Page Header (24 bytes)                 │
│  - pd_lsn, pd_flags, pd_lower, pd_upper, pd_special│
├─────────────────────────────────────────────────────┤
│              Line Pointer Array                      │
│  - Fixed 4-byte slots pointing to tuples            │
│  - Grows downward from header                       │
├─────────────────────────────────────────────────────┤
│                                                     │
│              Free Space                              │
│                                                     │
├─────────────────────────────────────────────────────┤
│              Index Tuples                            │
│  - Variable-length (key + TID)                      │
│  - Grows upward from bottom                         │
├─────────────────────────────────────────────────────┤
│              Special Area (16 bytes)                 │
│  - btpo_prev, btpo_next: sibling pointers           │
│  - btpo_level: tree level (0 = leaf)                │
│  - btpo_flags: page flags                           │
└─────────────────────────────────────────────────────┘

PostgreSQL-Specific Features:

High-Key: Each page stores the upper bound key for its contents, enabling lock-free searches during concurrent modifications
Kill Bits: Dead index entries marked with "kill hints" during scans, enabling efficient cleanup
INCLUDE Columns: Non-key columns stored in leaf pages for index-only scans
Parallel Index Scans: B-tree scans can use multiple workers for large indexes

postgresql_btree_features.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Basic B-tree index
CREATE INDEX idx_orders_customer ON orders(customer_id);
 
-- Covering index with INCLUDE (PostgreSQL 11+)
CREATE INDEX idx_orders_covering ON orders(order_date) 
    INCLUDE (total_amount, status);
 
-- Partial index
CREATE INDEX idx_orders_pending ON orders(created_at)
    WHERE status = 'pending';
 
-- Index with specific fill factor
CREATE INDEX idx_products ON products(sku) 
    WITH (fillfactor = 80);
 
-- Deduplication control (PostgreSQL 13+)
CREATE INDEX idx_logs_timestamp ON logs(timestamp)
    WITH (deduplicate_items = on);
 
-- Check index statistics
SELECT * FROM pgstatindex('idx_orders_customer');

PostgreSQL Deduplication

For columns with many duplicate values, PostgreSQL 13+ deduplication can reduce index size by 50-90%. A row like (timestamp, TID1) (timestamp, TID2) becomes (timestamp, [TID1, TID2]). This dramatically improves indexes on columns like status, category, or date.

MySQL/InnoDB: Clustered Index Architecture

InnoDB uses B+-trees not just for indexes but as the fundamental storage structure for tables. Every InnoDB table is stored as a clustered index organized by the primary key.

Core Design:

Clustered Index: Table data lives in leaf pages of the primary key B+-tree
Secondary Indexes: Store (secondary key → primary key) mappings
16KB Pages: Larger pages = fewer I/Os, higher fanout
Page Split Algorithm: Modified to minimize fragmentation
Undo Logs: MVCC versions stored separately, linked from index rows

InnoDB Page Structure:

innodb_page_layout.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
┌─────────────────────────────────────────────────────┐
│              FIL Header (38 bytes)                  │
│  - Space ID, page number, checksums, LSN            │
├─────────────────────────────────────────────────────┤
│              Page Header (56 bytes)                 │
│  - Record counts, heap info, level, index ID        │
├─────────────────────────────────────────────────────┤
│              Infimum Record (virtual minimum)       │
├─────────────────────────────────────────────────────┤
│                                                     │
│              User Records (B+-tree nodes)           │
│  - Variable-length, stored in heap order            │
│  - Each record has: header + key + payload          │
│                                                     │
├─────────────────────────────────────────────────────┤
│              Supremum Record (virtual maximum)      │
├─────────────────────────────────────────────────────┤
│                                                     │
│              Free Space                              │
│                                                     │
├─────────────────────────────────────────────────────┤
│              Page Directory (slot array)             │
│  - Sparse directory, ~4-8 records per slot          │
├─────────────────────────────────────────────────────┤
│              FIL Trailer (8 bytes)                  │
│  - Checksum verification                            │
└─────────────────────────────────────────────────────┘

Key InnoDB Features:

Clustered Table Organization: The primary key IS the table. Secondary indexes are always "covering" for the PK.
Change Buffer: Buffer updates to secondary index pages that aren't in memory, merging later.
Adaptive Hash Index: Automatically builds hash index on frequently accessed B+-tree pages.
Page Splitting: Uses heuristics to split pages at optimal points for sequential inserts.

Secondary Index Implications:

innodb_secondary_index.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Table: orders (id INT PRIMARY KEY, customer_id INT, order_date DATE, ...)
 
Clustered Index (Primary Key):
┌──────────────────────────────────────────────────────┐
│  Leaf pages contain FULL ROW DATA                    │
│  Key: id → Value: (id, customer_id, order_date, ...) │
└──────────────────────────────────────────────────────┘
 
Secondary Index on customer_id:
┌──────────────────────────────────────────────────────┐
│  Leaf pages contain (secondary_key, primary_key)     │
│  Key: customer_id → Value: id                        │
│                                                      │
│  Lookup requires TWO B+-tree traversals:             │
│  1. Secondary index → find PK value                  │
│  2. Clustered index → find row data                 │
└──────────────────────────────────────────────────────┘
 
This "double lookup" is why InnoDB covering indexes are important!

Primary Key Selection Matters

In InnoDB, the primary key is stored in every secondary index entry. A 16-byte UUID PK makes all secondary indexes 16 bytes larger per row. For a table with 100M rows and 5 secondary indexes, that's 8GB of additional index storage. Use compact primary keys (INT AUTO_INCREMENT) when possible.

SQL Server: Flexible Clustering

SQL Server provides flexible B+-tree options, allowing tables as either heaps (no clustered index) or clustered by any indexed column.

SQL Server Design:

8KB Pages: Standard page size across all storage
Optional Clustered Index: Tables can be heaps or clustered
Columnstore Alternative: Column-oriented storage for analytics
In-Memory OLTP: Lock-free Bw-tree for memory-optimized tables
Resumable Operations: Long-running index operations can pause and resume

SQL Server Page Layout:

sqlserver_page_layout.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
┌─────────────────────────────────────────────────────┐
│              Page Header (96 bytes)                  │
│  - Page type, page ID, LSN, free space info          │
├─────────────────────────────────────────────────────┤
│                                                     │
│              Row Data                                │
│  - Fixed-length columns first                       │
│  - Variable-length columns after                    │
│  - NULL bitmap                                      │
│                                                     │
├─────────────────────────────────────────────────────┤
│                                                     │
│              Free Space                              │
│                                                     │
├─────────────────────────────────────────────────────┤
│              Slot Array (Row Offset Array)           │
│  - 2 bytes per row, grows backward                  │
│  - Points to row start within page                  │
└─────────────────────────────────────────────────────┘

SQL Server-Specific Features:

Included Columns: Non-key columns in leaf level only (like PostgreSQL INCLUDE)
Filtered Indexes: Indexes with WHERE clauses, stored only for matching rows
Columnstore Indexes: Columnar storage with B+-tree integration for hybrid queries
Online Operations: Most index operations can proceed while table is in use
Compression: Page and row-level compression built into B+-tree storage

sqlserver_btree_features.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Clustered index (table becomes the index)
CREATE CLUSTERED INDEX ix_orders_cluster ON orders(order_date);
 
-- Non-clustered with included columns
CREATE NONCLUSTERED INDEX ix_orders_cust ON orders(customer_id)
    INCLUDE (total_amount, status);
 
-- Filtered index
CREATE NONCLUSTERED INDEX ix_orders_pending ON orders(created_at)
    WHERE status = 'pending'
    WITH (FILLFACTOR = 90);
 
-- Online index creation (Enterprise)
CREATE INDEX ix_products_sku ON products(sku)
    WITH (ONLINE = ON, RESUMABLE = ON);
 
-- Compressed index
CREATE INDEX ix_logs_compressed ON logs(timestamp)
    WITH (DATA_COMPRESSION = PAGE);
 
-- Pause and resume long-running operation
ALTER INDEX ix_large REBUILD WITH (RESUMABLE = ON);
-- Later...
ALTER INDEX ix_large RESUME;

Heap vs. Clustered Trade-off

SQL Server lets you choose: heaps are faster for bulk inserts and updates, clustered indexes are faster for range scans. For OLTP workloads with point queries and range scans, a clustered index on the date or primary key usually wins. For ETL staging tables with bulk operations, heaps may be faster.

Oracle: Enterprise Scale Features

Oracle's B+-tree implementation is optimized for enterprise-scale workloads with features for high concurrency, massive data volumes, and complex deployment topologies.

Oracle Design:

Block Size: Configurable per tablespace (2KB-32KB, typically 8KB)
Index-Organized Tables (IOT): Oracle's clustered table option
ROWID: Physical address (file, block, row) stored in regular indexes
Bitmap Indexes: Alternative to B+-tree for low-cardinality columns
Reverse Key Indexes: Reverse byte order to distribute sequential inserts

Key Oracle Features:

Oracle B+-Tree Optimizations

•Reverse Key Indexes: Reverse bytes before insertion to avoid right-edge contention for sequential keys
•Key Compression: Prefix compression for leaf entries, reducing storage significantly
•Online Operations: Rebuild indexes while table remains fully accessible
•Index Monitoring: Track index usage to identify candidates for removal
•Invisible Indexes: Create indexes invisible to optimizer for testing
•Function-Based Indexes: Index on expressions, not just columns

oracle_btree_features.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Standard B-tree index
CREATE INDEX idx_orders_customer ON orders(customer_id);
 
-- Reverse key to reduce contention
CREATE INDEX idx_orders_seq REVERSE ON orders(order_seq);
 
-- Key compression
CREATE INDEX idx_orders_composite ON orders(region, category, product)
    COMPRESS 2;  -- Compress first 2 columns
 
-- Index-Organized Table (clustered)
CREATE TABLE order_items (
    order_id NUMBER,
    line_num NUMBER,
    product_id NUMBER,
    quantity NUMBER,
    PRIMARY KEY (order_id, line_num)
) ORGANIZATION INDEX;
 
-- Invisible index for testing
CREATE INDEX idx_test INVISIBLE ON orders(new_column);
ALTER SESSION SET optimizer_use_invisible_indexes = TRUE;
-- Test query plans, then make visible if helpful
ALTER INDEX idx_test VISIBLE;
 
-- Monitor index usage
ALTER INDEX idx_orders_customer MONITORING USAGE;
-- Check usage later
SELECT * FROM v$object_usage WHERE index_name = 'IDX_ORDERS_CUSTOMER';

Oracle's Reverse Index Solution

The reverse key index is Oracle's solution to "hot spot" contention. When many sessions insert sequentially-increasing keys (like sequences), they all contend for the rightmost leaf block. Reversing the key bytes distributes inserts across the entire index, eliminating the bottleneck—at the cost of disabling range scans on the column.

SQLite: Embedded Simplicity

SQLite's B+-tree implementation prioritizes simplicity and reliability over raw performance. As an embedded database, it makes different trade-offs than server databases.

SQLite Design:

Page Size: 4KB default (configurable: 512B-64KB)
Table = B+-tree: Every table is stored as a B+-tree keyed by rowid
Single File: Entire database in one file, including all B+-trees
Write-Ahead Log: Optional WAL mode for better concurrency
Vacuum: B+-tree defragmentation operation

SQLite B+-Tree Structure:

sqlite_btree_structure.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
SQLite uses two types of B+-trees:
 
1. TABLE B+-TREE (for rowid tables):
   ┌─────────────────────────────────────────┐
   │ Internal pages: rowid ranges            │
   │ Leaf pages: rowid → complete row data   │
   └─────────────────────────────────────────┘
   
   Key: integer rowid (8 bytes max)
   Value: all column data for the row
 
2. INDEX B+-TREE:
   ┌─────────────────────────────────────────┐
   │ Internal pages: index key ranges        │
   │ Leaf pages: index key → rowid           │
   └─────────────────────────────────────────┘
   
   Key: indexed column(s)
   Value: rowid of the row
 
Key insight: Every SQLite table IS a B+-tree. 
"rowid" lookup is O(log n) tree traversal, not O(1)!

SQLite-Specific Features:

WITHOUT ROWID Tables: Store record by PRIMARY KEY instead of hidden rowid (like InnoDB)
Partial Indexes: WHERE clause on index definition
Expression Indexes: Index on computed values
ANALYZE: Collect statistics for query planner
Incremental Vacuum: Reclaim space without full rebuild

sqlite_btree_features.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- Standard index
CREATE INDEX idx_orders_customer ON orders(customer_id);
 
-- WITHOUT ROWID table (clustered by PK)
CREATE TABLE sessions (
    session_id TEXT PRIMARY KEY,
    user_id INTEGER,
    expires_at INTEGER
) WITHOUT ROWID;
 
-- Partial index
CREATE INDEX idx_active_users ON users(last_login)
    WHERE is_active = 1;
 
-- Expression index
CREATE INDEX idx_orders_year ON orders(strftime('%Y', order_date));
 
-- Covering index (all columns in index)
CREATE INDEX idx_orders_cover ON orders(customer_id, order_date, total);
 
-- Analyze for statistics
ANALYZE;  -- All tables
ANALYZE orders;  -- Specific table
 
-- Check index usage
EXPLAIN QUERY PLAN SELECT * FROM orders WHERE customer_id = 123;

SQLite WITHOUT ROWID

For tables where the PRIMARY KEY is the main access pattern (like key-value stores or session tables), WITHOUT ROWID eliminates one B+-tree traversal. Instead of: index → rowid → table, you get: table B+-tree (organized by PK) directly. This can improve performance significantly for PK-heavy workloads.

Locking and Concurrency Strategies

B+-tree concurrency control varies significantly across databases. The choice of locking granularity affects both concurrency and overhead.

Locking Approaches:

B+-Tree Locking Strategies by Database
Database	Read Strategy	Write Strategy	Phantom Protection
PostgreSQL	MVCC snapshot	Page locks during modification	SERIALIZABLE mode
InnoDB	MVCC + consistent read	Row locks + next-key locking	Gap locks
SQL Server	Row/page locks	Key-range locks	Key-range locking
Oracle	MVCC (multi-version)	Row locks only	SERIALIZABLE mode
SQLite	Table-level (default)	Table-level	SERIALIZABLE default

InnoDB's Next-Key Locking:

InnoDB is notable for its next-key locks—locking both the index record AND the gap before it. This prevents phantom reads without table-level locks.

innodb_locking.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Index values: 10, 15, 20, 25
 
Query: SELECT * FROM t WHERE id = 15 FOR UPDATE;
 
Standard row lock (what you might expect):
- Lock on record with id=15 only
 
InnoDB next-key lock (what actually happens):
- Lock on record 15
- Gap lock on (10, 15) - prevents insertion of 11, 12, 13, 14
 
Query: SELECT * FROM t WHERE id > 15 AND id < 25 FOR UPDATE;
 
InnoDB next-key locks:
- Lock on record 20
- Gap lock on (15, 20)
- Gap lock on (20, 25)
 
This prevents:
- Insertions of 16, 17, 18, 19 (phantom would appear)
- Insertions of 21, 22, 23, 24 (phantom would appear)

Gap Locking Gotcha

InnoDB's gap locks can cause surprising contention. A SELECT ... FOR UPDATE on a range locks gaps, blocking inserts from other transactions—even if those inserts are for non-overlapping values. Understanding this is essential for debugging lock wait issues in InnoDB.

Page Format Optimizations

Modern databases apply various optimizations to B+-tree page formats to improve cache efficiency, reduce I/O, and increase effective fanout.

Common Optimizations:

Page-Level Optimizations

•Prefix Compression: Store common prefix once, suffixes separately (PostgreSQL, Oracle, InnoDB)
•Key Truncation: Internal nodes store minimum distinguishing prefix (PostgreSQL suffix truncation)
•Deduplication: Multiple TIDs for same key in posting list (PostgreSQL 13+)
•Slot Directory: Sparse index within page for faster binary search (InnoDB)
•Page Compression: LZ4/ZSTD compression of page contents before disk write
•SIMD Search: Vectorized key comparison using CPU SIMD instructions

InnoDB Page Compression Example:

innodb_compression.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- InnoDB table with page compression
CREATE TABLE logs (
    id BIGINT PRIMARY KEY,
    timestamp DATETIME,
    message TEXT
) ROW_FORMAT=COMPRESSED   -- Legacy compressed format
  KEY_BLOCK_SIZE=8;       -- Compress to 8KB target
 
-- Modern transparent page compression (5.7+)
CREATE TABLE logs_modern (
    id BIGINT PRIMARY KEY,
    timestamp DATETIME,
    message TEXT
) COMPRESSION='lz4';      -- Use LZ4 algorithm
 
-- Enable compression for existing table
ALTER TABLE existing_logs COMPRESSION='zstd';
OPTIMIZE TABLE existing_logs;  -- Rebuild with compression

Compression Trade-offs

Page compression reduces I/O at the cost of CPU. For I/O-bound workloads (traditional HDDs, slow storage), compression is almost always beneficial. For CPU-bound workloads on fast NVMe, the CPU overhead may hurt more than the I/O savings help. Benchmark with realistic workloads.

Implications for Database Selection

Understanding B+-tree implementation differences helps inform database selection for specific use cases.

Use Case and Database Fit
Use Case	Best Fit	Reason
Read-heavy OLTP	PostgreSQL, Oracle	Excellent MVCC, read/write separation
Write-heavy OLTP	InnoDB (MySQL)	Change buffer, compact storage
Mixed OLTP	SQL Server	Flexible clustering, online ops
High-concurrency PK inserts	Oracle + reverse key	Eliminates insert hot spots
Embedded/Edge	SQLite	Single file, zero config
Time-series append	InnoDB, PostgreSQL	Sequential inserts optimize
Large text keys	PostgreSQL	Best prefix compression, dedup

Key Questions When Selecting:

What's your primary key pattern?
- Sequential: InnoDB's clustered storage works well
- Random UUIDs: Consider PostgreSQL (smaller secondary indexes)
Read/write ratio?
- Read-heavy: MVCC implementations (PostgreSQL, Oracle) shine
- Write-heavy: InnoDB's change buffer helps
Key characteristics?
- Long, redundant strings: PostgreSQL or Oracle with compression
- Many duplicates: PostgreSQL 13+ with deduplication
Concurrency pattern?
- Range queries with updates: Understand InnoDB gap locking
- Hot row contention: Consider Oracle reverse keys

No Universal Winner

Each database represents coherent engineering trade-offs. PostgreSQL's heap-separate-from-index design differs fundamentally from InnoDB's clustered approach, and each excels in different scenarios. The "best" database depends entirely on your workload characteristics.

Summary: Database Implementations

Production B+-tree implementations go far beyond textbook descriptions, incorporating decades of engineering optimizations and design decisions.

Key Takeaways

•PostgreSQL: Heap-based storage with sophisticated MVCC, suffix truncation, and deduplication.
•InnoDB: Clustered primary key storage, secondary indexes contain PK, change buffer for writes.
•SQL Server: Flexible heap or clustered options, resumable operations, columnstore integration.
•Oracle: Enterprise features including reverse keys, invisible indexes, compression.
•SQLite: Embedded simplicity with every table as a B+-tree, WITHOUT ROWID option.
•Locking varies: From InnoDB's next-key locks to Oracle's row-only locking.
•Page optimizations: Compression, prefix truncation, deduplication are universal trends.
•Selection matters: Different implementations suit different workloads.

What's Next:

We've now surveyed how production databases implement B+-trees. In the final page of this module, we'll explore optimization techniques—the advanced strategies databases use to squeeze maximum performance from B+-tree indexes, including cache-oblivious algorithms, write-optimized structures, and modern hardware adaptations.

Page Complete

You now understand how major databases implement B+-trees in practice, their page formats, locking strategies, and unique features. This knowledge enables you to interpret database-specific documentation, tune for your workload, and make informed system selection decisions.

4 / 5

Loading learning content...

Database Management SystemsB+-Tree Variants

B+-Tree Variants and Optimizations

LevelAdvanced

Duration90 mins

TopicB+-Tree Variants

4 / 5

Database Implementations: B+-Trees in Practice

From Theory to Production

Understanding how major databases actually implement B+-trees reveals:

Why two databases with "B+-tree indexes" can behave so differently
How to interpret database-specific documentation and tuning advice
What trade-offs are hidden behind configuration parameters
Why some optimizations work in one system but not another

This page surveys B+-tree implementations across PostgreSQL, MySQL/InnoDB, SQL Server, Oracle, and SQLite—examining their storage formats, locking strategies, and unique features.

What You Will Learn

Implementation Variations: Why Details Differ

Before diving into specific systems, let's understand why implementations vary:

Design Priorities Differ:

PostgreSQL prioritizes MVCC and extensibility
MySQL/InnoDB emphasizes transactional OLTP workloads
SQL Server balances enterprise features with performance
Oracle optimizes for high-end enterprise scalability
SQLite targets simplicity and embeddability

Key Implementation Dimensions:

B+-Tree Implementation Comparison
Aspect	PostgreSQL	InnoDB	SQL Server	Oracle	SQLite
Page size	8 KB	16 KB	8 KB	8 KB (configurable)	4 KB default
Max key size	~2.7 KB	~3.5 KB	900 bytes	~75% of block	No hard limit
Row storage	Heap separate	Clustered PK	Optional clustered	Heap or IOT	B-tree rowid
MVCC approach	Heap versioning	Undo log	Versioning	Undo segments	Journal
Locking	Page + row	Row + gap	Row + key-range	Row + TM	Table/Page

These differences have practical implications:

Page size affects I/O efficiency and fanout
Key size limits affect composite index design
Clustering affects table scan vs. index lookup trade-offs
MVCC approach affects reader-writer concurrency
Locking granularity affects contention under load

Understanding Trade-offs

PostgreSQL: nbtree Implementation

PostgreSQL's B-tree implementation, known as nbtree, is the default and most commonly used index type. It's a sophisticated implementation with features accumulated over 25+ years of development.

Key Characteristics:

Page Structure: 8KB pages with a standard page header, line pointers (offset array), and index tuples growing from the end
Lehman-Yao Algorithm: Uses high-key and right-link pointers for lock-free reads during concurrent modifications
Suffix Truncation: Since PostgreSQL 12, internal pages store truncated keys to maximize fanout
Deduplication: Since PostgreSQL 13, duplicate keys are stored once with a posting list of TIDs

PostgreSQL Page Layout:

postgresql_page_layout.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
┌─────────────────────────────────────────────────────┐
│              Page Header (24 bytes)                 │
│  - pd_lsn, pd_flags, pd_lower, pd_upper, pd_special│
├─────────────────────────────────────────────────────┤
│              Line Pointer Array                      │
│  - Fixed 4-byte slots pointing to tuples            │
│  - Grows downward from header                       │
├─────────────────────────────────────────────────────┤
│                                                     │
│              Free Space                              │
│                                                     │
├─────────────────────────────────────────────────────┤
│              Index Tuples                            │
│  - Variable-length (key + TID)                      │
│  - Grows upward from bottom                         │
├─────────────────────────────────────────────────────┤
│              Special Area (16 bytes)                 │
│  - btpo_prev, btpo_next: sibling pointers           │
│  - btpo_level: tree level (0 = leaf)                │
│  - btpo_flags: page flags                           │
└─────────────────────────────────────────────────────┘

PostgreSQL-Specific Features:

High-Key: Each page stores the upper bound key for its contents, enabling lock-free searches during concurrent modifications
Kill Bits: Dead index entries marked with "kill hints" during scans, enabling efficient cleanup
INCLUDE Columns: Non-key columns stored in leaf pages for index-only scans
Parallel Index Scans: B-tree scans can use multiple workers for large indexes

postgresql_btree_features.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
-- Basic B-tree index
CREATE INDEX idx_orders_customer ON orders(customer_id);
 
-- Covering index with INCLUDE (PostgreSQL 11+)
CREATE INDEX idx_orders_covering ON orders(order_date) 
    INCLUDE (total_amount, status);
 
-- Partial index
CREATE INDEX idx_orders_pending ON orders(created_at)
    WHERE status = 'pending';
 
-- Index with specific fill factor
CREATE INDEX idx_products ON products(sku) 
    WITH (fillfactor = 80);
 
-- Deduplication control (PostgreSQL 13+)
CREATE INDEX idx_logs_timestamp ON logs(timestamp)
    WITH (deduplicate_items = on);
 
-- Check index statistics
SELECT * FROM pgstatindex('idx_orders_customer');

PostgreSQL Deduplication

MySQL/InnoDB: Clustered Index Architecture

InnoDB uses B+-trees not just for indexes but as the fundamental storage structure for tables. Every InnoDB table is stored as a clustered index organized by the primary key.

Core Design:

Clustered Index: Table data lives in leaf pages of the primary key B+-tree
Secondary Indexes: Store (secondary key → primary key) mappings
16KB Pages: Larger pages = fewer I/Os, higher fanout
Page Split Algorithm: Modified to minimize fragmentation
Undo Logs: MVCC versions stored separately, linked from index rows

InnoDB Page Structure:

innodb_page_layout.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
┌─────────────────────────────────────────────────────┐
│              FIL Header (38 bytes)                  │
│  - Space ID, page number, checksums, LSN            │
├─────────────────────────────────────────────────────┤
│              Page Header (56 bytes)                 │
│  - Record counts, heap info, level, index ID        │
├─────────────────────────────────────────────────────┤
│              Infimum Record (virtual minimum)       │
├─────────────────────────────────────────────────────┤
│                                                     │
│              User Records (B+-tree nodes)           │
│  - Variable-length, stored in heap order            │
│  - Each record has: header + key + payload          │
│                                                     │
├─────────────────────────────────────────────────────┤
│              Supremum Record (virtual maximum)      │
├─────────────────────────────────────────────────────┤
│                                                     │
│              Free Space                              │
│                                                     │
├─────────────────────────────────────────────────────┤
│              Page Directory (slot array)             │
│  - Sparse directory, ~4-8 records per slot          │
├─────────────────────────────────────────────────────┤
│              FIL Trailer (8 bytes)                  │
│  - Checksum verification                            │
└─────────────────────────────────────────────────────┘

Key InnoDB Features:

Clustered Table Organization: The primary key IS the table. Secondary indexes are always "covering" for the PK.
Change Buffer: Buffer updates to secondary index pages that aren't in memory, merging later.
Adaptive Hash Index: Automatically builds hash index on frequently accessed B+-tree pages.
Page Splitting: Uses heuristics to split pages at optimal points for sequential inserts.

Secondary Index Implications:

innodb_secondary_index.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Table: orders (id INT PRIMARY KEY, customer_id INT, order_date DATE, ...)
 
Clustered Index (Primary Key):
┌──────────────────────────────────────────────────────┐
│  Leaf pages contain FULL ROW DATA                    │
│  Key: id → Value: (id, customer_id, order_date, ...) │
└──────────────────────────────────────────────────────┘
 
Secondary Index on customer_id:
┌──────────────────────────────────────────────────────┐
│  Leaf pages contain (secondary_key, primary_key)     │
│  Key: customer_id → Value: id                        │
│                                                      │
│  Lookup requires TWO B+-tree traversals:             │
│  1. Secondary index → find PK value                  │
│  2. Clustered index → find row data                 │
└──────────────────────────────────────────────────────┘
 
This "double lookup" is why InnoDB covering indexes are important!

Primary Key Selection Matters

SQL Server: Flexible Clustering

SQL Server provides flexible B+-tree options, allowing tables as either heaps (no clustered index) or clustered by any indexed column.

SQL Server Design:

8KB Pages: Standard page size across all storage
Optional Clustered Index: Tables can be heaps or clustered
Columnstore Alternative: Column-oriented storage for analytics
In-Memory OLTP: Lock-free Bw-tree for memory-optimized tables
Resumable Operations: Long-running index operations can pause and resume

SQL Server Page Layout:

sqlserver_page_layout.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
┌─────────────────────────────────────────────────────┐
│              Page Header (96 bytes)                  │
│  - Page type, page ID, LSN, free space info          │
├─────────────────────────────────────────────────────┤
│                                                     │
│              Row Data                                │
│  - Fixed-length columns first                       │
│  - Variable-length columns after                    │
│  - NULL bitmap                                      │
│                                                     │
├─────────────────────────────────────────────────────┤
│                                                     │
│              Free Space                              │
│                                                     │
├─────────────────────────────────────────────────────┤
│              Slot Array (Row Offset Array)           │
│  - 2 bytes per row, grows backward                  │
│  - Points to row start within page                  │
└─────────────────────────────────────────────────────┘

SQL Server-Specific Features:

Included Columns: Non-key columns in leaf level only (like PostgreSQL INCLUDE)
Filtered Indexes: Indexes with WHERE clauses, stored only for matching rows
Columnstore Indexes: Columnar storage with B+-tree integration for hybrid queries
Online Operations: Most index operations can proceed while table is in use
Compression: Page and row-level compression built into B+-tree storage

sqlserver_btree_features.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Clustered index (table becomes the index)
CREATE CLUSTERED INDEX ix_orders_cluster ON orders(order_date);
 
-- Non-clustered with included columns
CREATE NONCLUSTERED INDEX ix_orders_cust ON orders(customer_id)
    INCLUDE (total_amount, status);
 
-- Filtered index
CREATE NONCLUSTERED INDEX ix_orders_pending ON orders(created_at)
    WHERE status = 'pending'
    WITH (FILLFACTOR = 90);
 
-- Online index creation (Enterprise)
CREATE INDEX ix_products_sku ON products(sku)
    WITH (ONLINE = ON, RESUMABLE = ON);
 
-- Compressed index
CREATE INDEX ix_logs_compressed ON logs(timestamp)
    WITH (DATA_COMPRESSION = PAGE);
 
-- Pause and resume long-running operation
ALTER INDEX ix_large REBUILD WITH (RESUMABLE = ON);
-- Later...
ALTER INDEX ix_large RESUME;

Heap vs. Clustered Trade-off

Oracle: Enterprise Scale Features

Oracle's B+-tree implementation is optimized for enterprise-scale workloads with features for high concurrency, massive data volumes, and complex deployment topologies.

Oracle Design:

Block Size: Configurable per tablespace (2KB-32KB, typically 8KB)
Index-Organized Tables (IOT): Oracle's clustered table option
ROWID: Physical address (file, block, row) stored in regular indexes
Bitmap Indexes: Alternative to B+-tree for low-cardinality columns
Reverse Key Indexes: Reverse byte order to distribute sequential inserts

Key Oracle Features:

Oracle B+-Tree Optimizations

•Reverse Key Indexes: Reverse bytes before insertion to avoid right-edge contention for sequential keys
•Key Compression: Prefix compression for leaf entries, reducing storage significantly
•Online Operations: Rebuild indexes while table remains fully accessible
•Index Monitoring: Track index usage to identify candidates for removal
•Invisible Indexes: Create indexes invisible to optimizer for testing
•Function-Based Indexes: Index on expressions, not just columns

oracle_btree_features.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- Standard B-tree index
CREATE INDEX idx_orders_customer ON orders(customer_id);
 
-- Reverse key to reduce contention
CREATE INDEX idx_orders_seq REVERSE ON orders(order_seq);
 
-- Key compression
CREATE INDEX idx_orders_composite ON orders(region, category, product)
    COMPRESS 2;  -- Compress first 2 columns
 
-- Index-Organized Table (clustered)
CREATE TABLE order_items (
    order_id NUMBER,
    line_num NUMBER,
    product_id NUMBER,
    quantity NUMBER,
    PRIMARY KEY (order_id, line_num)
) ORGANIZATION INDEX;
 
-- Invisible index for testing
CREATE INDEX idx_test INVISIBLE ON orders(new_column);
ALTER SESSION SET optimizer_use_invisible_indexes = TRUE;
-- Test query plans, then make visible if helpful
ALTER INDEX idx_test VISIBLE;
 
-- Monitor index usage
ALTER INDEX idx_orders_customer MONITORING USAGE;
-- Check usage later
SELECT * FROM v$object_usage WHERE index_name = 'IDX_ORDERS_CUSTOMER';

Oracle's Reverse Index Solution

SQLite: Embedded Simplicity

SQLite's B+-tree implementation prioritizes simplicity and reliability over raw performance. As an embedded database, it makes different trade-offs than server databases.

SQLite Design:

Page Size: 4KB default (configurable: 512B-64KB)
Table = B+-tree: Every table is stored as a B+-tree keyed by rowid
Single File: Entire database in one file, including all B+-trees
Write-Ahead Log: Optional WAL mode for better concurrency
Vacuum: B+-tree defragmentation operation

SQLite B+-Tree Structure:

sqlite_btree_structure.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
SQLite uses two types of B+-trees:
 
1. TABLE B+-TREE (for rowid tables):
   ┌─────────────────────────────────────────┐
   │ Internal pages: rowid ranges            │
   │ Leaf pages: rowid → complete row data   │
   └─────────────────────────────────────────┘
   
   Key: integer rowid (8 bytes max)
   Value: all column data for the row
 
2. INDEX B+-TREE:
   ┌─────────────────────────────────────────┐
   │ Internal pages: index key ranges        │
   │ Leaf pages: index key → rowid           │
   └─────────────────────────────────────────┘
   
   Key: indexed column(s)
   Value: rowid of the row
 
Key insight: Every SQLite table IS a B+-tree. 
"rowid" lookup is O(log n) tree traversal, not O(1)!

SQLite-Specific Features:

WITHOUT ROWID Tables: Store record by PRIMARY KEY instead of hidden rowid (like InnoDB)
Partial Indexes: WHERE clause on index definition
Expression Indexes: Index on computed values
ANALYZE: Collect statistics for query planner
Incremental Vacuum: Reclaim space without full rebuild

sqlite_btree_features.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- Standard index
CREATE INDEX idx_orders_customer ON orders(customer_id);
 
-- WITHOUT ROWID table (clustered by PK)
CREATE TABLE sessions (
    session_id TEXT PRIMARY KEY,
    user_id INTEGER,
    expires_at INTEGER
) WITHOUT ROWID;
 
-- Partial index
CREATE INDEX idx_active_users ON users(last_login)
    WHERE is_active = 1;
 
-- Expression index
CREATE INDEX idx_orders_year ON orders(strftime('%Y', order_date));
 
-- Covering index (all columns in index)
CREATE INDEX idx_orders_cover ON orders(customer_id, order_date, total);
 
-- Analyze for statistics
ANALYZE;  -- All tables
ANALYZE orders;  -- Specific table
 
-- Check index usage
EXPLAIN QUERY PLAN SELECT * FROM orders WHERE customer_id = 123;

SQLite WITHOUT ROWID

Locking and Concurrency Strategies

B+-tree concurrency control varies significantly across databases. The choice of locking granularity affects both concurrency and overhead.

Locking Approaches:

B+-Tree Locking Strategies by Database
Database	Read Strategy	Write Strategy	Phantom Protection
PostgreSQL	MVCC snapshot	Page locks during modification	SERIALIZABLE mode
InnoDB	MVCC + consistent read	Row locks + next-key locking	Gap locks
SQL Server	Row/page locks	Key-range locks	Key-range locking
Oracle	MVCC (multi-version)	Row locks only	SERIALIZABLE mode
SQLite	Table-level (default)	Table-level	SERIALIZABLE default

InnoDB's Next-Key Locking:

InnoDB is notable for its next-key locks—locking both the index record AND the gap before it. This prevents phantom reads without table-level locks.

innodb_locking.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Index values: 10, 15, 20, 25
 
Query: SELECT * FROM t WHERE id = 15 FOR UPDATE;
 
Standard row lock (what you might expect):
- Lock on record with id=15 only
 
InnoDB next-key lock (what actually happens):
- Lock on record 15
- Gap lock on (10, 15) - prevents insertion of 11, 12, 13, 14
 
Query: SELECT * FROM t WHERE id > 15 AND id < 25 FOR UPDATE;
 
InnoDB next-key locks:
- Lock on record 20
- Gap lock on (15, 20)
- Gap lock on (20, 25)
 
This prevents:
- Insertions of 16, 17, 18, 19 (phantom would appear)
- Insertions of 21, 22, 23, 24 (phantom would appear)

Gap Locking Gotcha

Page Format Optimizations

Modern databases apply various optimizations to B+-tree page formats to improve cache efficiency, reduce I/O, and increase effective fanout.

Common Optimizations:

Page-Level Optimizations

•Prefix Compression: Store common prefix once, suffixes separately (PostgreSQL, Oracle, InnoDB)
•Key Truncation: Internal nodes store minimum distinguishing prefix (PostgreSQL suffix truncation)
•Deduplication: Multiple TIDs for same key in posting list (PostgreSQL 13+)
•Slot Directory: Sparse index within page for faster binary search (InnoDB)
•Page Compression: LZ4/ZSTD compression of page contents before disk write
•SIMD Search: Vectorized key comparison using CPU SIMD instructions

InnoDB Page Compression Example:

innodb_compression.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- InnoDB table with page compression
CREATE TABLE logs (
    id BIGINT PRIMARY KEY,
    timestamp DATETIME,
    message TEXT
) ROW_FORMAT=COMPRESSED   -- Legacy compressed format
  KEY_BLOCK_SIZE=8;       -- Compress to 8KB target
 
-- Modern transparent page compression (5.7+)
CREATE TABLE logs_modern (
    id BIGINT PRIMARY KEY,
    timestamp DATETIME,
    message TEXT
) COMPRESSION='lz4';      -- Use LZ4 algorithm
 
-- Enable compression for existing table
ALTER TABLE existing_logs COMPRESSION='zstd';
OPTIMIZE TABLE existing_logs;  -- Rebuild with compression

Compression Trade-offs

Implications for Database Selection

Understanding B+-tree implementation differences helps inform database selection for specific use cases.

Use Case and Database Fit
Use Case	Best Fit	Reason
Read-heavy OLTP	PostgreSQL, Oracle	Excellent MVCC, read/write separation
Write-heavy OLTP	InnoDB (MySQL)	Change buffer, compact storage
Mixed OLTP	SQL Server	Flexible clustering, online ops
High-concurrency PK inserts	Oracle + reverse key	Eliminates insert hot spots
Embedded/Edge	SQLite	Single file, zero config
Time-series append	InnoDB, PostgreSQL	Sequential inserts optimize
Large text keys	PostgreSQL	Best prefix compression, dedup

Key Questions When Selecting:

What's your primary key pattern?
- Sequential: InnoDB's clustered storage works well
- Random UUIDs: Consider PostgreSQL (smaller secondary indexes)
Read/write ratio?
- Read-heavy: MVCC implementations (PostgreSQL, Oracle) shine
- Write-heavy: InnoDB's change buffer helps
Key characteristics?
- Long, redundant strings: PostgreSQL or Oracle with compression
- Many duplicates: PostgreSQL 13+ with deduplication
Concurrency pattern?
- Range queries with updates: Understand InnoDB gap locking
- Hot row contention: Consider Oracle reverse keys

No Universal Winner

Summary: Database Implementations

Production B+-tree implementations go far beyond textbook descriptions, incorporating decades of engineering optimizations and design decisions.

Key Takeaways

•PostgreSQL: Heap-based storage with sophisticated MVCC, suffix truncation, and deduplication.
•InnoDB: Clustered primary key storage, secondary indexes contain PK, change buffer for writes.
•SQL Server: Flexible heap or clustered options, resumable operations, columnstore integration.
•Oracle: Enterprise features including reverse keys, invisible indexes, compression.
•SQLite: Embedded simplicity with every table as a B+-tree, WITHOUT ROWID option.
•Locking varies: From InnoDB's next-key locks to Oracle's row-only locking.
•Page optimizations: Compression, prefix truncation, deduplication are universal trends.
•Selection matters: Different implementations suit different workloads.

What's Next:

Page Complete

4 / 5