Loading content...
Fanout—the number of child pointers per internal node—is the single most important factor determining B+-tree efficiency. The relationship is exponential: doubling the fanout can reduce tree height by an entire level, cutting lookup I/O in half.
While page size sets an upper bound on fanout, numerous optimization techniques can dramatically increase the actual number of entries that fit in each node. Understanding and applying these techniques separates adequate indexes from highly optimized ones that deliver exceptional performance.
By the end of this page, you will understand how fanout affects performance mathematically, master key compression techniques (prefix, suffix, dictionary), learn structural optimizations used in production databases, and apply these techniques to maximize index efficiency.
The impact of fanout on tree characteristics follows exponential relationships that make even small improvements highly valuable.
The Fanout-Height Relationship:
For a B+-tree with N records and average fanout f:
$$h = \lceil\log_f(N)\rceil + 1$$
This logarithmic relationship means:
| Fanout | Tree Height | Total Nodes | Internal Nodes | I/O per Lookup |
|---|---|---|---|---|
| 100 | 5 | ~10.1M | ~101K | 5 |
| 200 | 4 | ~5.03M | ~25K | 4 |
| 500 | 4 | ~2.01M | ~4K | 4 |
| 1000 | 3 | ~1.00M | ~1K | 3 |
| 2000 | 3 | ~500K | ~250 | 3 |
The Multiplicative Effect:
Notice how fanout improvements provide multiplicative benefits:
For a database serving millions of queries per day, reducing height from 5 to 4 eliminates millions of disk I/O operations.
123456789101112131415161718192021222324252627282930313233343536373839404142434445
import math def analyze_fanout_impact(record_count, fanout_values): """ Analyze the impact of different fanout values on tree characteristics. """ print(f"Analysis for {record_count:,} records") print("=" * 70) print(f"{'Fanout':<10} {'Height':<8} {'Leaves':<15} {'Internal':<12} {'Total Size':<12}") print("-" * 70) for f in fanout_values: # Calculate tree structure leaves = math.ceil(record_count / (f - 1)) # leaf capacity = f-1 height = 1 level_nodes = leaves internal_nodes = 0 while level_nodes > 1: parent_nodes = math.ceil(level_nodes / f) internal_nodes += parent_nodes level_nodes = parent_nodes height += 1 total_nodes = leaves + internal_nodes # Assume 8KB pages total_size_gb = total_nodes * 8 / (1024 * 1024) print(f"{f:<10} {height:<8} {leaves:<15,} {internal_nodes:<12,} {total_size_gb:<12.2f} GB") # Analyze for different fanout valuesrecord_count = 1_000_000_000 # 1 billionfanout_values = [100, 200, 300, 500, 750, 1000]analyze_fanout_impact(record_count, fanout_values) # Impact of 20% fanout improvementprint("\n" + "=" * 70)print("Impact of 20% fanout improvement:")base_fanout = 400improved_fanout = 480 for f, label in [(base_fanout, "Baseline"), (improved_fanout, "Optimized")]: leaves = math.ceil(record_count / (f - 1)) height = math.ceil(math.log(leaves, f)) + 1 print(f"{label}: Fanout {f} → Height {height}, ~{leaves//1000}K leaf pages")A 20% improvement in fanout often translates to ~17% reduction in tree height (log relationship). For a height-4 tree, this could mean the difference between 4 and 3 levels—a 25% reduction in I/O per lookup. Small optimizations compound into significant performance gains.
Prefix compression exploits the fact that adjacent keys in a sorted B+-tree often share common prefixes. By storing the shared prefix once and recording only the unique suffixes, we can dramatically reduce key storage requirements.
How Prefix Compression Works:
Consider a page containing these keys:
products/electronics/laptops/dell-xps-15
products/electronics/laptops/lenovo-thinkpad
products/electronics/laptops/macbook-pro
products/electronics/phones/iphone-15
products/electronics/phones/pixel-8
Without compression: 5 × ~45 bytes = 225 bytes
With prefix compression:
With hierarchical compression: Even more savings by sharing "laptops/" prefix among the first three entries.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
def find_common_prefix(keys): """Find the longest common prefix among a list of keys.""" if not keys: return "" prefix = keys[0] for key in keys[1:]: while not key.startswith(prefix): prefix = prefix[:-1] if not prefix: return "" return prefix def prefix_compress_page(keys): """ Simulate prefix compression on a B+-tree page. Returns compression statistics. """ if not keys: return {"original_size": 0, "compressed_size": 0, "ratio": 0} # Original size original_size = sum(len(k) for k in keys) # Find common prefix prefix = find_common_prefix(keys) prefix_len = len(prefix) # Compressed storage: # - 2 bytes for prefix length # - prefix itself # - For each key: 1 byte suffix length + suffix compressed_size = 2 + prefix_len for key in keys: suffix = key[prefix_len:] compressed_size += 1 + len(suffix) ratio = (original_size - compressed_size) / original_size * 100 return { "original_size": original_size, "compressed_size": compressed_size, "savings": original_size - compressed_size, "ratio": ratio, "prefix": prefix, "prefix_len": prefix_len } # Example: URL indexurls = [ "https://example.com/products/category/electronics/item/12345", "https://example.com/products/category/electronics/item/12346", "https://example.com/products/category/electronics/item/12347", "https://example.com/products/category/electronics/item/12348", "https://example.com/products/category/electronics/item/12349",] result = prefix_compress_page(urls)print("URL Index Prefix Compression")print(f"Original size: {result['original_size']} bytes")print(f"Compressed size: {result['compressed_size']} bytes")print(f"Savings: {result['savings']} bytes ({result['ratio']:.1f}%)")print(f"Common prefix: '{result['prefix']}'") # Example: UUID index (worst case - random, no common prefix)import uuiduuids = [str(uuid.uuid4()) for _ in range(5)]uuid_result = prefix_compress_page(uuids)print(f"\nUUID Compression: {uuid_result['ratio']:.1f}% (minimal savings expected)")| Key Type | Typical Prefix Sharing | Compression Ratio | Fanout Improvement |
|---|---|---|---|
| URLs / Paths | Very High | 60-80% | 2.5-5× |
| Sequential IDs | High | 40-60% | 1.7-2.5× |
| Timestamps | High | 50-70% | 2-3× |
| Email addresses | Medium | 30-50% | 1.4-2× |
| Random UUIDs | None | 0% | 1× (no benefit) |
| Hash values | None | 0% | 1× (no benefit) |
Prefix compression requires decompression during search (CPU overhead), may complicate concurrent access (prefix changes require page restructuring), and provides no benefit for random keys like UUIDs. Modern CPUs handle decompression efficiently, but the complexity must be considered.
Suffix truncation (also called separator key compression or fence key compression) is a powerful technique specific to internal nodes of B+-trees. The key insight: internal node keys only need to distinguish between child subtrees—they don't need to be complete keys.
The Insight:
In a B+-tree internal node, a key K separates children: all keys < K go left, all keys ≥ K go right. We only need enough of K to make this distinction—not the full key value.
Consider these adjacent child page boundary keys:
Full separator: "transaction_2024_01_16_def456uvw" Minimal separator: "transaction_2024_01_16" (or even just "transaction_2024_01_16")
The truncated version is sufficient to route searches correctly!
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
def compute_minimal_separator(left_max, right_min): """ Compute the minimal separator key between two adjacent child ranges. The separator must be: - > left_max (strictly greater) - <= right_min (less than or equal) We find the shortest prefix of right_min that satisfies these conditions. """ # Find the first character position where they differ min_len = min(len(left_max), len(right_min)) for i in range(min_len): if left_max[i] < right_min[i]: # We can truncate here - prefix up to i (inclusive) is sufficient return right_min[:i+1] elif left_max[i] > right_min[i]: # This shouldn't happen in a valid B+-tree (left_max < right_min) raise ValueError("Invalid ordering: left_max >= right_min") # If we get here, one is a prefix of the other # right_min must be longer (since left_max < right_min) return right_min[:min_len + 1] def analyze_suffix_truncation(separators): """Analyze savings from suffix truncation on separator keys.""" total_full = 0 total_truncated = 0 for left_max, right_min in separators: minimal = compute_minimal_separator(left_max, right_min) full_len = len(right_min) truncated_len = len(minimal) total_full += full_len total_truncated += truncated_len savings = (full_len - truncated_len) / full_len * 100 print(f"Full: '{right_min[:30]}...' ({full_len})") print(f"Minimal: '{minimal}' ({truncated_len})") print(f"Savings: {savings:.1f}%\n") overall_savings = (total_full - total_truncated) / total_full * 100 print(f"Overall: {total_full} → {total_truncated} bytes ({overall_savings:.1f}% reduction)") # Example: Log entry keys (timestamp + ID)separators = [ ("2024-01-15T23:59:59.999Z_id_99999", "2024-01-16T00:00:00.001Z_id_00001"), ("2024-01-16T00:59:59.999Z_id_99999", "2024-01-16T01:00:00.001Z_id_00001"), ("2024-01-16T01:59:59.999Z_id_99999", "2024-01-16T02:00:00.001Z_id_00001"),] print("Suffix Truncation Analysis: Time-Series Index")print("=" * 60)analyze_suffix_truncation(separators)PostgreSQL 12+ implements suffix truncation for B-tree internal pages. For indexes on long text columns, this can reduce internal index size by 70% or more, significantly improving buffer pool efficiency and reducing tree height.
Key normalization transforms keys into a canonical binary format optimized for comparison and storage. This technique enables simpler comparison operations and can reduce key sizes.
Why Normalize Keys?
Native data types have varying comparison rules:
Normalized keys allow simple byte-for-byte comparison (memcmp) regardless of original type.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import struct def normalize_integer(value, signed=True, width=8): """ Normalize an integer for ordered byte comparison. For memcmp to work correctly: - Flip sign bit for signed integers (makes negative < positive) - Use big-endian byte order (MSB first for proper ordering) """ if signed: # Flip the sign bit so memcmp gives correct ordering # For 8-byte signed: XOR with 0x8000000000000000 flip_mask = 1 << (width * 8 - 1) adjusted = value ^ flip_mask # Handle negative wraparound if adjusted < 0: adjusted += (1 << (width * 8)) else: adjusted = value return adjusted.to_bytes(width, byteorder='big') def normalize_string(s, max_len=8): """ Normalize a string for ordered byte comparison. Simple approach: truncate/pad to fixed length. Production systems use more sophisticated encoding. """ # Encode as UTF-8 and truncate/pad encoded = s.encode('utf-8')[:max_len] return encoded.ljust(max_len, b'\x00') def normalize_composite_key(timestamp, user_id, txn_id): """ Normalize a composite key: (timestamp, user_id, txn_id) All components concatenated in order for single memcmp comparison. """ ts_bytes = normalize_integer(int(timestamp * 1000000), signed=False, width=8) user_bytes = normalize_integer(user_id, signed=False, width=4) txn_bytes = normalize_integer(txn_id, signed=False, width=8) return ts_bytes + user_bytes + txn_bytes # Demonstrate that normalized comparison matches semantic comparisontest_integers = [-100, -1, 0, 1, 100]print("Integer Normalization Test")print("=" * 50)normalized = [(n, normalize_integer(n)) for n in test_integers] # Sort by normalized bytessorted_by_bytes = sorted(normalized, key=lambda x: x[1])sorted_values = [x[0] for x in sorted_by_bytes]print(f"Original order: {test_integers}")print(f"After byte-sort: {sorted_values}")print(f"Correct: {sorted_values == sorted(test_integers)}")| Data Type | Normalization Strategy | Size Impact | Example |
|---|---|---|---|
| INTEGER | Flip sign bit, big-endian | No change | -100 → 0x7FFFFF9C |
| FLOAT | Sign/exponent encoding | No change | IEEE 754 transform |
| VARCHAR | Collation encoding | May increase | ICU collation keys |
| TIMESTAMP | Epoch nanoseconds | Fixed 8 bytes | Consistent comparison |
| UUID | Binary form | 16 bytes | Remove hyphens |
| COMPOSITE | Concatenate normalized | Sum of parts | See code example |
String normalization for collation (locale-aware sorting) can actually increase key size. For example, German 'ß' sorted as 'ss' or French accented characters may expand significantly. This is a trade-off between correct ordering and storage efficiency.
Dictionary compression replaces repeated key values or key components with shorter integer codes. This is particularly effective for low-cardinality columns (few distinct values) or keys with repetitive patterns.
How Dictionary Compression Works:
Instead of storing full values, maintain a dictionary mapping:
Dictionary:
1 → "North America"
2 → "Europe"
3 → "Asia Pacific"
4 → "Latin America"
Original keys: ["North America", "Europe", "North America", "Asia Pacific", ...]
Compressed: [1, 2, 1, 3, ...]
Each key goes from ~15 bytes average to 1-4 bytes.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
class DictionaryEncoder: """ Simple dictionary encoder for B+-tree keys. Used for low-cardinality column values. """ def __init__(self): self.value_to_code = {} self.code_to_value = {} self.next_code = 1 def encode(self, value): """Encode a value to its dictionary code.""" if value not in self.value_to_code: self.value_to_code[value] = self.next_code self.code_to_value[self.next_code] = value self.next_code += 1 return self.value_to_code[value] def decode(self, code): """Decode a dictionary code to its value.""" return self.code_to_value.get(code) def bytes_per_code(self): """Number of bytes needed for codes.""" if self.next_code <= 256: return 1 elif self.next_code <= 65536: return 2 else: return 4 def analyze_dictionary_compression(values): """Analyze space savings from dictionary compression.""" encoder = DictionaryEncoder() # Original size original_size = sum(len(str(v).encode('utf-8')) for v in values) # Encode all values codes = [encoder.encode(v) for v in values] # Dictionary size dict_size = sum(len(str(v).encode('utf-8')) + 4 for v in encoder.value_to_code.keys()) # Compressed size bytes_per = encoder.bytes_per_code() compressed_size = len(codes) * bytes_per + dict_size cardinality = len(encoder.value_to_code) return { "original_size": original_size, "compressed_size": compressed_size, "dict_size": dict_size, "cardinality": cardinality, "bytes_per_code": bytes_per, "ratio": (original_size - compressed_size) / original_size * 100 } # Example: Country column (low cardinality)import randomcountries = ["United States", "United Kingdom", "Germany", "France", "Japan", "Canada", "Australia", "Brazil", "India", "China"] # 1 million rows with random countriesvalues = [random.choice(countries) for _ in range(1_000_000)] result = analyze_dictionary_compression(values)print("Dictionary Compression: Country Column (1M rows)")print(f"Cardinality: {result['cardinality']} distinct values")print(f"Original: {result['original_size']:,} bytes")print(f"Compressed: {result['compressed_size']:,} bytes")print(f"Dictionary overhead: {result['dict_size']:,} bytes")print(f"Savings: {result['ratio']:.1f}%")| Column Type | Cardinality | Original Size | Code Size | Typical Savings |
|---|---|---|---|---|
| Status (Active/Inactive) | 2 | 7 bytes avg | 1 byte | 85% |
| Country | ~200 | 10 bytes avg | 1 byte | 90% |
| State/Province | ~5000 | 8 bytes avg | 2 bytes | 75% |
| Product Category | ~100 | 15 bytes avg | 1 byte | 93% |
| Email Domain | ~1M | 12 bytes avg | 3 bytes | 75% |
| User ID (high card) | Millions | 8 bytes | 4 bytes | 50% |
Dictionary compression requires maintaining the dictionary in memory, adds encoding/decoding overhead, and doesn't help with high-cardinality columns. It's most effective for columns with <10,000 distinct values appearing in millions of rows.
Beyond key compression, structural modifications to the B+-tree itself can improve fanout and overall efficiency.
1. Variable-Length Internal Node Format:
Instead of fixed-size slots, internal nodes can use variable-length entries:
123456789101112131415161718192021222324252627282930
/* Fixed-format internal node (traditional) */struct FixedInternalNode { uint16_t num_keys; uint32_t child_pointers[MAX_FANOUT]; char keys[MAX_FANOUT - 1][MAX_KEY_SIZE]; // Wastes space!}; /* Variable-format internal node (optimized) */struct VarInternalNode { uint16_t num_keys; uint16_t free_space_offset; /* Slot array: offsets to actual key data */ uint16_t key_offsets[1]; // Dynamic size /* After key_offsets array: */ /* - Child pointers (packed) */ /* - Key data (variable length, grows from end of page) */}; /* Benefits: * - Short keys use less space * - More entries fit per page * - Suffix-truncated separators pack efficiently * * Trade-offs: * - More complex slot management * - Insertion requires shifting * - Must track free space */2. Pointer Swizzling:
Replace on-disk page pointers with in-memory pointers for cached nodes:
This eliminates buffer pool lookup overhead for frequently accessed nodes.
3. Hybrid Node Formats:
Some systems use different formats for different node types:
PostgreSQL 13+ implements B-tree deduplication: when a key appears multiple times, it stores the key once with a list of TIDs. This is similar to dictionary compression but at the tree level. For non-unique indexes with many duplicates, this can reduce index size by 50% or more.
Understanding your actual index fanout is essential for optimization. Here's how to measure it in production databases.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
-- PostgreSQL: Estimate index fanout from statisticsSELECT indexrelname as index_name, pg_relation_size(indexrelid) as index_bytes, idx_scan as index_scans, idx_tup_read as tuples_via_index, idx_tup_fetch as tuples_fetched, -- Estimate pages and entries (pg_relation_size(indexrelid) / 8192)::int as index_pages, (SELECT reltuples FROM pg_class WHERE oid = indexrelid)::bigint as estimated_entriesFROM pg_stat_user_indexesWHERE indexrelname LIKE 'idx_%'ORDER BY pg_relation_size(indexrelid) DESC; -- PostgreSQL: Detailed B-tree statistics (requires pageinspect extension)CREATE EXTENSION IF NOT EXISTS pageinspect; -- Get B-tree meta page infoSELECT * FROM bt_metap('idx_users_email'); -- Get statistics for internal pages (level > 0)SELECT blkno, type, live_items, dead_items, avg_item_size, free_sizeFROM bt_page_stats('idx_users_email', 1); -- Page 1 (root) -- MySQL InnoDB: Index statisticsSELECT table_name, index_name, stat_name, stat_value, stat_descriptionFROM mysql.innodb_index_statsWHERE database_name = 'mydb' AND table_name = 'users' AND index_name = 'idx_email'; -- Oracle: Index statisticsSELECT index_name, blevel, -- B-tree level (height - 1) leaf_blocks, -- Number of leaf pages distinct_keys, avg_leaf_blocks_per_key, avg_data_blocks_per_keyFROM user_indexesWHERE index_name = 'IDX_USERS_EMAIL';| Check | Good | Needs Attention | Action |
|---|---|---|---|
| Average entries/page | 70% of theoretical max | <50% of max | Check fill factor, fragmentation |
| Index height | ≤ log_f(N) + 1 | Higher than expected | Consider key compression |
| Key size | Match actual data | Significant padding | Use appropriate types |
| Duplicate handling | Deduplication enabled | Many duplicate keys | Enable deduplication |
| Page utilization | 80% | <60% | REINDEX or VACUUM |
Schedule regular index health assessments. Fanout can degrade over time due to fragmentation, bloat, and changing data patterns. Many production issues stem from indexes that have degraded significantly from their initial optimal state.
We've comprehensively explored techniques to maximize B+-tree fanout, directly improving index performance.
What's Next:
With fanout optimization understood, we'll move to I/O cost analysis—the comprehensive framework for understanding and predicting the disk I/O costs of B+-tree operations.
You now understand the techniques used by database systems to maximize B+-tree fanout, from key compression to structural optimizations. This knowledge enables you to design efficient indexes and diagnose performance issues. Next, we'll analyze the I/O costs of B+-tree operations in detail.