Operating SystemsPaging

Hash-based and Inverted Page Tables

LevelAdvanced

Duration90 mins

TopicPaging

1 / 5

Hashed Page Tables

The Scalability Crisis in Address Translation

As computing evolved from 32-bit to 64-bit architectures, a fundamental challenge emerged in virtual memory management. A 64-bit address space theoretically spans 18 exabytes (2⁶⁴ bytes)—more than a billion times larger than typical physical memory. Traditional page tables, which allocate entries proportional to the virtual address space, become grotesquely inefficient when the address space dwarfs actual memory usage.

Hashed page tables represent a paradigm shift in address translation design. Rather than organizing page table entries by virtual page number in a linear or hierarchical structure, hashed page tables use a hash function to map virtual page numbers directly to table entries. This approach decouples the page table size from the virtual address space size, making it proportional instead to the number of actually mapped pages.

This architectural innovation is critical for modern 64-bit systems, sparse address spaces, and operating systems that must handle processes with wildly varying memory footprints efficiently.

What You Will Learn

By the end of this page, you will understand the fundamental motivation for hashed page tables, their internal structure and organization, how hash functions are designed for address translation, the complete lookup mechanism, and the performance trade-offs that make hashed tables essential for 64-bit system design.

The Problem with Traditional Approaches

Before understanding why hashed page tables exist, we must appreciate the limitations of traditional page table designs—linear page tables and multi-level hierarchical page tables.

The Linear Page Table Problem:

A linear (single-level) page table maintains one entry per virtual page in the address space. For a 64-bit architecture with 4KB pages:

Virtual address bits:     64
Page offset bits:         12 (for 4KB pages)
Virtual page number bits: 52
Number of page entries:   2⁵² = 4,503,599,627,370,496 (4.5 quadrillion)
Entry size:               8 bytes (typical for 64-bit systems)
Total page table size:    2⁵² × 8 = 32 PB (petabytes)

This is obviously impossible. No system can allocate 32 petabytes just for a single process's page table. Even for 32-bit systems, a linear page table requires 4MB per process—significant overhead that motivated multi-level designs.

The Address Space Explosion

64-bit address spaces are 2³² = 4 billion times larger than 32-bit spaces. Solutions that work for 32-bit systems often become untenable for 64-bit architectures. This exponential growth drives the need for fundamentally different approaches.

The Multi-Level Page Table Approach:

Hierarchical page tables address the space explosion by only allocating table levels for actually-used portions of the address space. A typical x86-64 four-level page table structure:

Level 4 (PML4):  512 entries × 8 bytes = 4KB
Level 3 (PDPT):  Up to 512 tables × 4KB each
Level 2 (PD):    Up to 512² tables × 4KB each  
Level 1 (PT):    Up to 512³ tables × 4KB each

For sparse address spaces (most pages unmapped), multi-level tables are efficient—only populated regions require table entries. However, multi-level tables have significant drawbacks:

Multiple Memory Accesses: Each address translation requires traversing multiple levels (up to 4 for x86-64), resulting in 4 memory accesses without TLB support.
Pointer Chasing: Each level requires following a pointer to the next, causing poor cache behavior due to scattered memory locations.
Fixed Depth: The hierarchy depth is fixed by architecture, potentially over-provisioning for small processes or under-serving unusual access patterns.
Complex Management: Creating, destroying, and maintaining multi-level structures adds OS complexity.

Page Table Approach Comparison
Approach	Space Complexity	Time Complexity	Scalability	Sparsity Handling
Linear Table	O(address space)	O(1)	Poor for 64-bit	None
Multi-Level	O(mapped pages)	O(levels)	Good	Excellent
Hashed Table	O(mapped pages)	O(1 + chain length)	Excellent	Excellent

The Insight Behind Hashed Tables:

The key observation driving hashed page tables is that processes typically use only a tiny fraction of their virtual address space. A process with 2GB of actual memory usage in a 64-bit address space is using:

2GB / 2⁶⁴ bytes = 2³¹ / 2⁶⁴ = 2⁻³³ ≈ 0.00000001%

Why maintain any structure proportional to the address space when utilization is infinitesimally small? Hashed page tables organize entries by actual mappings, not potential addresses.

This insight leads to a structure where:

Table size ∝ number of mapped pages (not address space size)
Lookup is O(1) average case with a good hash function
Sparsity is handled inherently—unmapped pages simply have no entries

Hashed Page Table Structure

A hashed page table is fundamentally a hash table specialized for address translation. It consists of an array of buckets, where each bucket can contain one or more page table entries. The virtual page number is hashed to select a bucket, and the bucket contains the translation to the physical frame.

Core Components:

1. Hash Table Array (Bucket Array)

The primary structure is an array of buckets. Each bucket is either:

Empty (no mapping exists for any VPN that hashes here)
Contains a linked list of entries (one or more mappings)

The number of buckets is chosen based on expected memory usage and acceptable collision rates—not virtual address space size.

2. Page Table Entry (Hash Table Entry)

Each entry contains:

Virtual Page Number (VPN): The complete virtual page number being mapped
Physical Frame Number (PFN): The corresponding physical frame
Control Bits: Valid, protection, dirty, accessed flags
Chain Pointer: Link to next entry in same bucket (for collision handling)

3. Hash Function

A function that maps virtual page numbers to bucket indices:

bucket_index = hash(virtual_page_number) mod table_size

Converting Mermaid diagram...

Entry Structure in Detail:

A typical hashed page table entry for a 64-bit system:

┌─────────────────────────────────────────────────────────────────┐
│                    Hashed Page Table Entry (24 bytes)           │
├─────────────────────────────────────────────────────────────────┤
│ Virtual Page Number (52 bits)                    │ Pad (12 bits)│
│ 0                                                           63  │
├─────────────────────────────────────────────────────────────────┤
│ Physical Frame Number (40 bits)      │ Control Bits (24 bits)  │
│ 0                                                           63  │
├─────────────────────────────────────────────────────────────────┤
│ Chain Pointer (64 bits - pointer to next entry)                 │
│ 0                                                           63  │
└─────────────────────────────────────────────────────────────────┘

Control Bits:
  - Valid (V):        1 bit - Entry contains valid translation
  - Read (R):         1 bit - Page is readable
  - Write (W):        1 bit - Page is writable  
  - Execute (X):      1 bit - Page is executable
  - User (U):         1 bit - Accessible in user mode
  - Dirty (D):        1 bit - Page has been modified
  - Accessed (A):     1 bit - Page has been accessed
  - Cache Disable:    1 bit - Bypass CPU caches
  - Write-Through:    1 bit - Write-through caching policy
  - Reserved:         15 bits - Future use

Why Store the Complete VPN?

Unlike array-indexed page tables where the array index implicitly encodes the VPN, hashed tables must explicitly store the VPN because:

Multiple VPNs hash to the same bucket - must identify which entry matches
No implicit ordering - entries in a bucket are not sorted
Verification requirement - must confirm the found entry matches the requested VPN

hashed_page_table.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Hashed page table data structures
 
#include <stdint.h>
#include <stdbool.h>
 
// Control bits for page table entries
typedef struct {
    uint32_t valid         : 1;   // Entry is valid
    uint32_t read          : 1;   // Page is readable
    uint32_t write         : 1;   // Page is writable
    uint32_t execute       : 1;   // Page is executable
    uint32_t user          : 1;   // User-mode accessible
    uint32_t dirty         : 1;   // Page modified
    uint32_t accessed      : 1;   // Page accessed
    uint32_t cache_disable : 1;   // Disable caching
    uint32_t write_through : 1;   // Write-through policy
    uint32_t reserved      : 23;  // Reserved for future use
} PageControlBits;
 
// Single entry in the hashed page table
typedef struct HashedPageTableEntry {
    uint64_t virtual_page_number;             // Full VPN (52 bits used)
    uint64_t physical_frame_number;           // PFN (40 bits used)
    PageControlBits control;                  // Protection and status bits
    struct HashedPageTableEntry* next;        // Chain pointer for collisions
} HashedPageTableEntry;
 
// The hashed page table structure
typedef struct {
    HashedPageTableEntry** buckets;           // Array of bucket pointers
    size_t num_buckets;                       // Number of buckets
    size_t entry_count;                       // Total entries in table
    uint64_t hash_seed;                       // Seed for hash function
} HashedPageTable;
 
// Page table operations
HashedPageTable* hpt_create(size_t num_buckets, uint64_t seed);
void hpt_destroy(HashedPageTable* table);
bool hpt_insert(HashedPageTable* table, uint64_t vpn, uint64_t pfn, 
                PageControlBits control);
bool hpt_lookup(HashedPageTable* table, uint64_t vpn, uint64_t* pfn_out,
                PageControlBits* control_out);
bool hpt_remove(HashedPageTable* table, uint64_t vpn);
void hpt_update_control(HashedPageTable* table, uint64_t vpn, 
                        PageControlBits new_control);

Memory Overhead Comparison

For a process with 10,000 mapped pages: A linear page table for 64-bit needs 32 PB (impossible). A 4-level table needs ~160KB minimum. A hashed table with 16K buckets and 10K entries needs ~280KB (16K bucket pointers + ~240KB for entries). The hashed table trades slightly more memory for O(1) lookup without multi-level traversal.

Hash Function Design for Address Translation

The hash function is the heart of a hashed page table. Its quality directly determines lookup performance—a poor hash function leads to clustering, long chains, and degraded O(n) performance. Designing hash functions for virtual page numbers presents unique challenges.

Requirements for Address Translation Hash Functions:

1. Uniform Distribution

The function must distribute VPNs uniformly across buckets. Real-world VPN patterns are NOT uniform:

Stack pages cluster near the top of address space
Heap grows from a base address
Code sections occupy contiguous regions
Memory-mapped files have predictable addresses

A good hash function must transform these clustered patterns into uniform bucket distribution.

2. Fast Computation

Address translation is on the critical path of EVERY memory access. The hash function is computed millions of times per second. Complex cryptographic hashes (SHA, MD5) are too slow—we need hardware-friendly operations.

3. Low Collision Rate

Collisions require chain traversal. Each collision adds a memory access. Target: average chain length < 2 entries.

4. Deterministic and Reproducible

The same VPN must always hash to the same bucket within a process lifecycle. No randomness per-lookup (though the table can be initialized with random seeds).

Common Hash Function Approaches:

Simple Modulo (Poor Choice):

hash(vpn) = vpn mod table_size

Problem: Sequential VPNs map to sequential buckets, causing poor distribution when access patterns are localized.

Multiplicative Hashing (Better):

hash(vpn) = ((vpn × A) mod 2ʷ) >> (w - log₂(table_size))

Where A is a carefully chosen odd constant (e.g., 2654435769 for 32-bit).

This spreads sequential numbers across the table by exploiting the mathematical properties of multiplication.

XOR-Folding (Good for VPNs):

hash(vpn) = (vpn ^ (vpn >> 16) ^ (vpn >> 32)) mod table_size

Combines high and low bits, effective when address patterns are predictable.

MurmurHash/xxHash (Excellent): Modern non-cryptographic hash functions designed for speed and distribution. These are often used in production systems.

hash_functions.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// Hash function implementations for page tables
 
#include <stdint.h>
 
// Simple multiplicative hash - fast but moderate distribution
static inline uint64_t hash_multiplicative(uint64_t vpn, uint64_t table_size) {
    // Golden ratio constant for 64-bit
    const uint64_t GOLDEN_RATIO = 0x9E3779B97F4A7C15ULL;
    return ((vpn * GOLDEN_RATIO) >> 32) % table_size;
}
 
// XOR-folding hash - good for clustered VPNs
static inline uint64_t hash_xor_fold(uint64_t vpn, uint64_t table_size) {
    uint64_t h = vpn;
    h ^= (h >> 23);
    h *= 0x2127599BF4325C37ULL;
    h ^= (h >> 47);
    return h % table_size;
}
 
// Fibonacci hash - excellent distribution, very fast
static inline uint64_t hash_fibonacci(uint64_t vpn, uint64_t num_buckets_log2) {
    const uint64_t FIBONACCI = 11400714819323198485ULL;  // 2^64 / φ
    return (vpn * FIBONACCI) >> (64 - num_buckets_log2);
}
 
// MurmurHash3 finalizer - production quality
static inline uint64_t hash_murmur3_finalizer(uint64_t vpn, uint64_t table_size) {
    uint64_t h = vpn;
    h ^= h >> 33;
    h *= 0xFF51AFD7ED558CCDULL;
    h ^= h >> 33;
    h *= 0xC4CEB9FE1A85EC53ULL;
    h ^= h >> 33;
    return h % table_size;
}
 
// Recommended: Combined approach for page tables
static inline uint64_t hash_vpn(uint64_t vpn, uint64_t seed, 
                                uint64_t table_size) {
    // Mix with seed for per-table randomization (security)
    uint64_t h = vpn ^ seed;
    
    // Murmur3 finalizer for excellent distribution
    h ^= h >> 33;
    h *= 0xFF51AFD7ED558CCDULL;
    h ^= h >> 33;
    h *= 0xC4CEB9FE1A85EC53ULL;
    h ^= h >> 33;
    
    // Use Fibonacci for final bucket selection (power-of-2 table size)
    // This avoids expensive modulo operation
    return (h * 11400714819323198485ULL) >> (64 - __builtin_ctzll(table_size));
}

Table Size Selection:

The number of buckets in the hash table is a critical design decision:

Too Few Buckets:

High collision rate
Long chain lengths
Degraded O(n) lookup

Too Many Buckets:

Wasted memory for bucket pointers
Poor cache utilization (bucket array doesn't fit in cache)
Overhead exceeds savings

Optimal Sizing:

A common heuristic is to size the table such that the load factor (entries / buckets) stays between 0.5 and 0.75:

load_factor = num_entries / num_buckets

Recommended: 0.5 ≤ load_factor ≤ 0.75

For 10,000 mapped pages:
  Minimum buckets: 10,000 / 0.75 ≈ 13,333
  Maximum buckets: 10,000 / 0.5 = 20,000
  Practical choice: 16,384 (power of 2 for fast modulo)

Power-of-2 Table Sizes:

Using power-of-2 bucket counts allows replacing expensive modulo operations with fast bit masking:

bucket = hash(vpn) & (table_size - 1)  // Only works when table_size is power of 2

This is significantly faster than hash(vpn) % table_size and is used universally in performance-critical implementations.

Security Consideration: Hash Randomization

Without randomization, attackers could craft input patterns that cause pathological hash collisions (hash flooding attacks). Modern systems initialize hash tables with random seeds, making collision attacks computationally infeasible. The seed changes per-process or per-table instantiation, preventing adversarial exploitation.

Lookup Mechanism and Address Translation

The lookup process in a hashed page table involves computing the hash, locating the correct bucket, and traversing the collision chain to find the matching entry. Understanding this process in detail reveals the performance characteristics and design trade-offs.

Complete Lookup Algorithm:

PHYSICAL_ADDRESS translate(VIRTUAL_ADDRESS va):
    1. Extract virtual page number: vpn = va >> PAGE_OFFSET_BITS
    2. Extract page offset: offset = va & PAGE_OFFSET_MASK
    3. Compute bucket index: bucket = hash(vpn) mod table_size
    4. Get bucket head: entry = table.buckets[bucket]
    5. Traverse chain:
       while entry != NULL:
           if entry.vpn == vpn AND entry.valid:
               CHECK PERMISSIONS(entry.control)
               UPDATE accessed bit
               return (entry.pfn << PAGE_OFFSET_BITS) | offset
           entry = entry.next
    6. Entry not found: TRIGGER PAGE FAULT

Step-by-Step Analysis:

Lookup Step Breakdown
Step	Operation	Time Cost	Notes
1-2	VPN/Offset Extraction	1-2 cycles	Simple bit operations, pipelined
3	Hash Computation	3-10 cycles	Depends on hash function complexity
4	Bucket Access	~100 cycles (cache miss)	Memory load; cached if recently used
5	Chain Traversal	~100 cycles per entry	Memory load per chain link
6	Permission Check	1-2 cycles	Bit comparisons

Performance Characteristics:

Best Case (No Collision):

Single bucket access
Single entry comparison
Total: hash computation + 1 memory access
Approximately 100-150 CPU cycles

Average Case:

With load factor 0.5-0.75, average chain length ≈ 1.5 entries
Total: hash computation + 1.5 memory accesses
Approximately 150-250 CPU cycles

Worst Case (Collision Chain):

All entries hash to same bucket (pathological)
O(n) traversal where n = number of mapped pages
Catastrophic for large processes—must be prevented

Comparison with Multi-Level Tables:

Hashed Table Lookup

•Hash computation: ~10 cycles
•Bucket access: ~100 cycles
•Chain traversal: 0-200 cycles (avg)
•Total: ~110-310 cycles
•Variable latency (depends on collisions)
•Single path to entry

4-Level Table Lookup

•Level 4 access: ~100 cycles
•Level 3 access: ~100 cycles
•Level 2 access: ~100 cycles
•Level 1 access: ~100 cycles
•Total: ~400 cycles
•Fixed latency (always 4 accesses)

hpt_lookup.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
// Complete lookup implementation for hashed page table
 
#include "hashed_page_table.h"
#include <stdio.h>
 
// Translation result codes
typedef enum {
    TRANSLATE_SUCCESS,
    TRANSLATE_PAGE_FAULT,      // No mapping exists
    TRANSLATE_PROTECTION_FAULT, // Permission denied
    TRANSLATE_INVALID_ADDRESS   // Address out of range
} TranslateResult;
 
// Full translation structure
typedef struct {
    TranslateResult result;
    uint64_t physical_address;
    PageControlBits permissions;
    int chain_depth;           // For performance monitoring
} TranslationResult;
 
// Main translation function
TranslationResult translate_address(HashedPageTable* table, 
                                     uint64_t virtual_address,
                                     bool is_write,
                                     bool is_execute,
                                     bool is_user_mode) {
    TranslationResult result = {0};
    
    // Step 1 & 2: Extract VPN and offset
    const uint64_t PAGE_OFFSET_BITS = 12;  // 4KB pages
    const uint64_t PAGE_OFFSET_MASK = (1ULL << PAGE_OFFSET_BITS) - 1;
    
    uint64_t vpn = virtual_address >> PAGE_OFFSET_BITS;
    uint64_t offset = virtual_address & PAGE_OFFSET_MASK;
    
    // Step 3: Compute hash and bucket index
    uint64_t bucket_idx = hash_vpn(vpn, table->hash_seed, table->num_buckets);
    
    // Step 4: Access bucket
    HashedPageTableEntry* entry = table->buckets[bucket_idx];
    result.chain_depth = 0;
    
    // Step 5: Traverse chain
    while (entry != NULL) {
        result.chain_depth++;
        
        if (entry->virtual_page_number == vpn) {
            // Found matching entry
            
            // Check if valid
            if (!entry->control.valid) {
                result.result = TRANSLATE_PAGE_FAULT;
                return result;
            }
            
            // Permission checks
            if (is_user_mode && !entry->control.user) {
                result.result = TRANSLATE_PROTECTION_FAULT;
                return result;
            }
            
            if (is_write && !entry->control.write) {
                result.result = TRANSLATE_PROTECTION_FAULT;
                return result;
            }
            
            if (is_execute && !entry->control.execute) {
                result.result = TRANSLATE_PROTECTION_FAULT;
                return result;
            }
            
            // Update accessed bit (atomically in real implementation)
            entry->control.accessed = 1;
            
            // Update dirty bit if write
            if (is_write) {
                entry->control.dirty = 1;
            }
            
            // Compute physical address
            result.physical_address = 
                (entry->physical_frame_number << PAGE_OFFSET_BITS) | offset;
            result.permissions = entry->control;
            result.result = TRANSLATE_SUCCESS;
            return result;
        }
        
        entry = entry->next;
    }
    
    // Step 6: Entry not found - page fault
    result.result = TRANSLATE_PAGE_FAULT;
    return result;
}

TLB Integration is Critical

In practice, the vast majority of translations hit the TLB (Translation Lookaside Buffer) and never touch the page table. The hashed table is only accessed on TLB misses. A good TLB hit rate (>99%) means page table design affects only ~1% of translations—but that 1% can still be millions of lookups per second, making hash table efficiency crucial.

Clustered Page Tables: A Practical Variant

Standard hashed page tables store one entry per page mapping. Clustered page tables extend this concept by storing multiple contiguous page mappings in each hash table entry. This optimization exploits spatial locality—processes typically map contiguous virtual pages to related physical frames.

The Clustering Insight:

When a process allocates memory, it typically receives contiguous virtual addresses:

Array allocation: pages N, N+1, N+2, ..., N+k are mapped together
Stack growth: each page push maps the next sequential page
Code loading: executable segments span multiple contiguous pages

Instead of hashing each page individually (creating many entries), we can hash the base page and store multiple translations in one entry.

Clustered Entry Structure:

┌─────────────────────────────────────────────────────────────────┐
│           Clustered Page Table Entry (64 bytes)                 │
├─────────────────────────────────────────────────────────────────┤
│ Base VPN (52 bits)                                              │
├─────────────────────────────────────────────────────────────────┤
│ Cluster Size (4 bits: 1-16 pages)   │ Control Bits             │
├─────────────────────────────────────────────────────────────────┤
│ PFN[0]: Frame for VPN+0                                         │
│ PFN[1]: Frame for VPN+1                                         │
│ PFN[2]: Frame for VPN+2                                         │
│ PFN[3]: Frame for VPN+3                                         │
│ ... (up to 16 PFNs based on cluster size)                       │
├─────────────────────────────────────────────────────────────────┤
│ Chain Pointer                                                   │
└─────────────────────────────────────────────────────────────────┘

Lookup Modification for Clustered Tables:

LOOKUP(vpn):
    1. Compute cluster base: base_vpn = vpn & ~(CLUSTER_SIZE - 1)
    2. Compute offset within cluster: cluster_offset = vpn & (CLUSTER_SIZE - 1)
    3. Hash using base_vpn: bucket = hash(base_vpn)
    4. Search for entry with matching base_vpn
    5. Extract PFN[cluster_offset] from entry

Benefits of Clustering:

Clustering Advantages

•Fewer Entries: 16× reduction in entry count for cluster size 16
•Better Cache Utilization: Entire cluster fits in one cache line (64 bytes)
•Reduced Collisions: Fewer distinct entries means fewer potential collisions
•Prefetch Friendly: Accessing one page brings adjacent mappings into cache
•Reduced Memory Overhead: Less overhead from entry headers and chain pointers

Drawbacks of Clustering:

Sparse Mapping Waste: If only 1 of 16 pages in a cluster is mapped, other slots waste space
Non-Contiguous Physical Frames: May not always map contiguous VPNs to related PFNs
Complexity: Hash function must handle cluster boundaries correctly
Partial Clusters: Need special handling for mappings that span cluster boundaries

When Clustering Works Well:

Large array allocations
Memory-mapped files
Process text (code) segments
Stack pages (grow contiguously)

When Clustering is Suboptimal:

Sparse allocations (e.g., mmap with sparse patterns)
Fine-grained memory-mapped I/O
Processes with many small, scattered allocations

Real-World Usage:

IA-64 (Itanium) architecture used clustered hashed page tables natively. The PowerPC architecture also supports hashed page tables with clustering. These implementations demonstrated that clustering provides significant benefits for typical workloads while gracefully handling suboptimal cases.

clustered_hpt.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// Clustered hashed page table implementation
 
#define CLUSTER_BITS 4
#define CLUSTER_SIZE (1 << CLUSTER_BITS)   // 16 pages per cluster
#define CLUSTER_MASK (CLUSTER_SIZE - 1)
 
typedef struct {
    uint64_t base_vpn;                      // Base VPN of cluster
    uint64_t pfn[CLUSTER_SIZE];             // Physical frames for each page
    uint8_t valid_bitmap;                   // Which slots are valid (16 bits)
    PageControlBits control;                // Shared control for cluster
    struct ClusteredEntry* next;            // Chain pointer
} ClusteredEntry;
 
// Lookup in clustered page table
TranslationResult clustered_lookup(ClusteredPageTable* table,
                                    uint64_t vpn) {
    TranslationResult result = {0};
    
    // Compute cluster base and offset
    uint64_t base_vpn = vpn & ~((uint64_t)CLUSTER_MASK);
    uint64_t cluster_offset = vpn & CLUSTER_MASK;
    
    // Hash on base VPN only
    uint64_t bucket = hash_vpn(base_vpn, table->hash_seed, table->num_buckets);
    
    ClusteredEntry* entry = table->buckets[bucket];
    
    while (entry != NULL) {
        if (entry->base_vpn == base_vpn) {
            // Check if this specific slot in cluster is valid
            if (!(entry->valid_bitmap & (1 << cluster_offset))) {
                result.result = TRANSLATE_PAGE_FAULT;
                return result;
            }
            
            // Get PFN from cluster
            uint64_t pfn = entry->pfn[cluster_offset];
            result.physical_address = (pfn << 12) | (vpn & 0xFFF);
            result.result = TRANSLATE_SUCCESS;
            return result;
        }
        entry = entry->next;
    }
    
    result.result = TRANSLATE_PAGE_FAULT;
    return result;
}

Performance Considerations and Tuning

Achieving optimal performance from hashed page tables requires careful attention to several interacting factors. Understanding these considerations is essential for system designers and kernel developers.

Load Factor Management:

The load factor (entries / buckets) is the primary determinant of lookup performance:

Expected chain length ≈ load_factor (for uniform hashing)

Load Factor    Expected Lookups    Recommendation
0.25           1.12                 Oversized, waste memory
0.50           1.25                 Good balance
0.75           1.56                 Acceptable
1.00           2.00                 Consider resizing
2.00           3.00                 Performance degrading

Dynamic Resizing:

As processes map more pages, the hash table may need to grow. Resizing strategies:

Incremental Doubling: When load factor exceeds threshold, double bucket count
Rehash All Entries: Traverse old table, rehash each entry to new bucket
Avoid Shrinking: Shrinking is rarely worth the cost; memory is cheap

Resizing is Expensive:

Must pause or lock the table during resize
Every entry needs rehashing
Temporary spike in memory usage (old + new tables)
Common strategy: resize during idle time or when system load is low

Cache Behavior Analysis:

Memory access patterns critically affect hash table performance:

Bucket Array Caching:

Bucket array should fit in L2/L3 cache if possible
For 100K buckets × 8 bytes = 800KB (fits in most L3 caches)
Hot buckets (frequently accessed) stay in L1/L2

Entry Caching:

Entries in collision chains are scattered in memory
Each chain link likely causes cache miss
Clustered entries (all data in 64B) are cache-line aligned

Workload-Dependent Behavior:

Sequential access patterns: Excellent cache reuse, same buckets repeatedly
Random access patterns: Poor cache reuse, cold bucket accesses
Locality-aware allocation: OS can improve cache behavior by grouping related VPNs

Cache Impact on Lookup Latency
Access Type	Bucket	Entry	Total Cycles (approx)
L1 Hit (hot)	4 cycles	4 cycles	~20 cycles
L2 Hit (warm)	12 cycles	12 cycles	~40 cycles
L3 Hit (recent)	40 cycles	40 cycles	~100 cycles
Memory (cold)	100 cycles	100+ cycles	~250+ cycles

Hardware Considerations:

TLB Miss Penalty: The hash table is only accessed on TLB misses. On modern CPUs:

TLB hit rate typically 99%+ for well-behaved workloads
TLB miss penalty: 7-20 cycles (hardware-walked multi-level) to 100+ cycles (software TLB miss)
Hash table lookup adds to software TLB miss handlers

Memory Prefetching: Modern CPUs have hardware prefetchers that can hide memory latency:

Sequential prefetching helps with bucket array access
Collision chains defeat prefetching (random pointers)
Clustered tables are more prefetch-friendly

Simultaneous Multi-Threading (SMT/Hyperthreading):

Multiple threads can hide memory latency by interleaving computation
Hash table lookups from multiple threads can proceed in parallel
Cache contention may limit benefits with shared page tables

Performance Optimization Techniques

•Align entries to cache lines: 64-byte alignment prevents false sharing and maximizes cache efficiency
•Use power-of-2 table sizes: Enables fast bit-mask operations instead of expensive modulo
•Prefetch bucket on TLB miss: Start memory fetch while computing hash
•Keep hot entries first: Move frequently accessed entries to chain head
•Consider read-copy-update (RCU): Allows concurrent reads during updates
•NUMA-aware allocation: Place table in memory closest to accessing CPUs

Production Insight

In practice, hashed page tables often outperform multi-level tables for 64-bit systems despite the collision overhead. The key is that average-case O(1+ε) with ε small beats guaranteed O(4) for common workloads. However, absolute worst-case latency matters for real-time systems—multi-level tables may be preferred when bounded latency is critical.

Summary: Hashed Page Tables

Hashed page tables represent a fundamental rethinking of address translation structure. By using hash functions to map virtual page numbers to table entries, they achieve translation lookup that scales with actual memory usage rather than potential address space size—a critical advantage for 64-bit architectures.

Key Takeaways

•Traditional table limitations drive the need for hashing—linear tables are impossible for 64-bit, multi-level tables require multiple memory accesses per lookup.
•Hashed table structure consists of a bucket array with chained entries, where each entry contains the full VPN, PFN, control bits, and a chain pointer for collision handling.
•Hash function quality is paramount—functions must provide uniform distribution, fast computation, and deterministic results to achieve O(1) average-case lookup.
•Lookup mechanism involves hash computation, bucket access, chain traversal, and permission verification—typically 100-300 cycles on modern hardware.
•Clustered page tables group contiguous page mappings in single entries, reducing entry count and improving cache efficiency for common allocation patterns.
•Performance tuning requires attention to load factor, cache behavior, resizing strategy, and hardware characteristics like TLB integration and prefetching.

What's Next:

Hashed page tables must handle the case when multiple virtual page numbers hash to the same bucket—collisions. The next page explores collision handling strategies in depth, examining chaining, open addressing, and the trade-offs each approach presents for address translation workloads.

Page Complete

You now understand hashed page tables comprehensively—from the motivation behind their design through the internal structure, hash function requirements, lookup mechanisms, and performance considerations. This foundation prepares you to understand collision handling, inverted page tables, and their role in modern operating system design.

1 / 5

Loading learning content...

Operating SystemsPaging

Hash-based and Inverted Page Tables

LevelAdvanced

Duration90 mins

TopicPaging

1 / 5

Hashed Page Tables

The Scalability Crisis in Address Translation

This architectural innovation is critical for modern 64-bit systems, sparse address spaces, and operating systems that must handle processes with wildly varying memory footprints efficiently.

What You Will Learn

The Problem with Traditional Approaches

Before understanding why hashed page tables exist, we must appreciate the limitations of traditional page table designs—linear page tables and multi-level hierarchical page tables.

The Linear Page Table Problem:

A linear (single-level) page table maintains one entry per virtual page in the address space. For a 64-bit architecture with 4KB pages:

Virtual address bits:     64
Page offset bits:         12 (for 4KB pages)
Virtual page number bits: 52
Number of page entries:   2⁵² = 4,503,599,627,370,496 (4.5 quadrillion)
Entry size:               8 bytes (typical for 64-bit systems)
Total page table size:    2⁵² × 8 = 32 PB (petabytes)

The Address Space Explosion

The Multi-Level Page Table Approach:

Hierarchical page tables address the space explosion by only allocating table levels for actually-used portions of the address space. A typical x86-64 four-level page table structure:

Level 4 (PML4):  512 entries × 8 bytes = 4KB
Level 3 (PDPT):  Up to 512 tables × 4KB each
Level 2 (PD):    Up to 512² tables × 4KB each  
Level 1 (PT):    Up to 512³ tables × 4KB each

For sparse address spaces (most pages unmapped), multi-level tables are efficient—only populated regions require table entries. However, multi-level tables have significant drawbacks:

Multiple Memory Accesses: Each address translation requires traversing multiple levels (up to 4 for x86-64), resulting in 4 memory accesses without TLB support.
Pointer Chasing: Each level requires following a pointer to the next, causing poor cache behavior due to scattered memory locations.
Fixed Depth: The hierarchy depth is fixed by architecture, potentially over-provisioning for small processes or under-serving unusual access patterns.
Complex Management: Creating, destroying, and maintaining multi-level structures adds OS complexity.

Page Table Approach Comparison
Approach	Space Complexity	Time Complexity	Scalability	Sparsity Handling
Linear Table	O(address space)	O(1)	Poor for 64-bit	None
Multi-Level	O(mapped pages)	O(levels)	Good	Excellent
Hashed Table	O(mapped pages)	O(1 + chain length)	Excellent	Excellent

The Insight Behind Hashed Tables:

2GB / 2⁶⁴ bytes = 2³¹ / 2⁶⁴ = 2⁻³³ ≈ 0.00000001%

Why maintain any structure proportional to the address space when utilization is infinitesimally small? Hashed page tables organize entries by actual mappings, not potential addresses.

This insight leads to a structure where:

Table size ∝ number of mapped pages (not address space size)
Lookup is O(1) average case with a good hash function
Sparsity is handled inherently—unmapped pages simply have no entries

Hashed Page Table Structure

Core Components:

1. Hash Table Array (Bucket Array)

The primary structure is an array of buckets. Each bucket is either:

Empty (no mapping exists for any VPN that hashes here)
Contains a linked list of entries (one or more mappings)

The number of buckets is chosen based on expected memory usage and acceptable collision rates—not virtual address space size.

2. Page Table Entry (Hash Table Entry)

Each entry contains:

Virtual Page Number (VPN): The complete virtual page number being mapped
Physical Frame Number (PFN): The corresponding physical frame
Control Bits: Valid, protection, dirty, accessed flags
Chain Pointer: Link to next entry in same bucket (for collision handling)

3. Hash Function

A function that maps virtual page numbers to bucket indices:

bucket_index = hash(virtual_page_number) mod table_size

Converting Mermaid diagram...

Entry Structure in Detail:

A typical hashed page table entry for a 64-bit system:

┌─────────────────────────────────────────────────────────────────┐
│                    Hashed Page Table Entry (24 bytes)           │
├─────────────────────────────────────────────────────────────────┤
│ Virtual Page Number (52 bits)                    │ Pad (12 bits)│
│ 0                                                           63  │
├─────────────────────────────────────────────────────────────────┤
│ Physical Frame Number (40 bits)      │ Control Bits (24 bits)  │
│ 0                                                           63  │
├─────────────────────────────────────────────────────────────────┤
│ Chain Pointer (64 bits - pointer to next entry)                 │
│ 0                                                           63  │
└─────────────────────────────────────────────────────────────────┘

Control Bits:
  - Valid (V):        1 bit - Entry contains valid translation
  - Read (R):         1 bit - Page is readable
  - Write (W):        1 bit - Page is writable  
  - Execute (X):      1 bit - Page is executable
  - User (U):         1 bit - Accessible in user mode
  - Dirty (D):        1 bit - Page has been modified
  - Accessed (A):     1 bit - Page has been accessed
  - Cache Disable:    1 bit - Bypass CPU caches
  - Write-Through:    1 bit - Write-through caching policy
  - Reserved:         15 bits - Future use

Why Store the Complete VPN?

Unlike array-indexed page tables where the array index implicitly encodes the VPN, hashed tables must explicitly store the VPN because:

Multiple VPNs hash to the same bucket - must identify which entry matches
No implicit ordering - entries in a bucket are not sorted
Verification requirement - must confirm the found entry matches the requested VPN

hashed_page_table.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Hashed page table data structures
 
#include <stdint.h>
#include <stdbool.h>
 
// Control bits for page table entries
typedef struct {
    uint32_t valid         : 1;   // Entry is valid
    uint32_t read          : 1;   // Page is readable
    uint32_t write         : 1;   // Page is writable
    uint32_t execute       : 1;   // Page is executable
    uint32_t user          : 1;   // User-mode accessible
    uint32_t dirty         : 1;   // Page modified
    uint32_t accessed      : 1;   // Page accessed
    uint32_t cache_disable : 1;   // Disable caching
    uint32_t write_through : 1;   // Write-through policy
    uint32_t reserved      : 23;  // Reserved for future use
} PageControlBits;
 
// Single entry in the hashed page table
typedef struct HashedPageTableEntry {
    uint64_t virtual_page_number;             // Full VPN (52 bits used)
    uint64_t physical_frame_number;           // PFN (40 bits used)
    PageControlBits control;                  // Protection and status bits
    struct HashedPageTableEntry* next;        // Chain pointer for collisions
} HashedPageTableEntry;
 
// The hashed page table structure
typedef struct {
    HashedPageTableEntry** buckets;           // Array of bucket pointers
    size_t num_buckets;                       // Number of buckets
    size_t entry_count;                       // Total entries in table
    uint64_t hash_seed;                       // Seed for hash function
} HashedPageTable;
 
// Page table operations
HashedPageTable* hpt_create(size_t num_buckets, uint64_t seed);
void hpt_destroy(HashedPageTable* table);
bool hpt_insert(HashedPageTable* table, uint64_t vpn, uint64_t pfn, 
                PageControlBits control);
bool hpt_lookup(HashedPageTable* table, uint64_t vpn, uint64_t* pfn_out,
                PageControlBits* control_out);
bool hpt_remove(HashedPageTable* table, uint64_t vpn);
void hpt_update_control(HashedPageTable* table, uint64_t vpn, 
                        PageControlBits new_control);

Memory Overhead Comparison

Hash Function Design for Address Translation

Requirements for Address Translation Hash Functions:

1. Uniform Distribution

The function must distribute VPNs uniformly across buckets. Real-world VPN patterns are NOT uniform:

Stack pages cluster near the top of address space
Heap grows from a base address
Code sections occupy contiguous regions
Memory-mapped files have predictable addresses

A good hash function must transform these clustered patterns into uniform bucket distribution.

2. Fast Computation

3. Low Collision Rate

Collisions require chain traversal. Each collision adds a memory access. Target: average chain length < 2 entries.

4. Deterministic and Reproducible

The same VPN must always hash to the same bucket within a process lifecycle. No randomness per-lookup (though the table can be initialized with random seeds).

Common Hash Function Approaches:

Simple Modulo (Poor Choice):

hash(vpn) = vpn mod table_size

Problem: Sequential VPNs map to sequential buckets, causing poor distribution when access patterns are localized.

Multiplicative Hashing (Better):

hash(vpn) = ((vpn × A) mod 2ʷ) >> (w - log₂(table_size))

Where A is a carefully chosen odd constant (e.g., 2654435769 for 32-bit).

This spreads sequential numbers across the table by exploiting the mathematical properties of multiplication.

XOR-Folding (Good for VPNs):

hash(vpn) = (vpn ^ (vpn >> 16) ^ (vpn >> 32)) mod table_size

Combines high and low bits, effective when address patterns are predictable.

MurmurHash/xxHash (Excellent): Modern non-cryptographic hash functions designed for speed and distribution. These are often used in production systems.

hash_functions.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// Hash function implementations for page tables
 
#include <stdint.h>
 
// Simple multiplicative hash - fast but moderate distribution
static inline uint64_t hash_multiplicative(uint64_t vpn, uint64_t table_size) {
    // Golden ratio constant for 64-bit
    const uint64_t GOLDEN_RATIO = 0x9E3779B97F4A7C15ULL;
    return ((vpn * GOLDEN_RATIO) >> 32) % table_size;
}
 
// XOR-folding hash - good for clustered VPNs
static inline uint64_t hash_xor_fold(uint64_t vpn, uint64_t table_size) {
    uint64_t h = vpn;
    h ^= (h >> 23);
    h *= 0x2127599BF4325C37ULL;
    h ^= (h >> 47);
    return h % table_size;
}
 
// Fibonacci hash - excellent distribution, very fast
static inline uint64_t hash_fibonacci(uint64_t vpn, uint64_t num_buckets_log2) {
    const uint64_t FIBONACCI = 11400714819323198485ULL;  // 2^64 / φ
    return (vpn * FIBONACCI) >> (64 - num_buckets_log2);
}
 
// MurmurHash3 finalizer - production quality
static inline uint64_t hash_murmur3_finalizer(uint64_t vpn, uint64_t table_size) {
    uint64_t h = vpn;
    h ^= h >> 33;
    h *= 0xFF51AFD7ED558CCDULL;
    h ^= h >> 33;
    h *= 0xC4CEB9FE1A85EC53ULL;
    h ^= h >> 33;
    return h % table_size;
}
 
// Recommended: Combined approach for page tables
static inline uint64_t hash_vpn(uint64_t vpn, uint64_t seed, 
                                uint64_t table_size) {
    // Mix with seed for per-table randomization (security)
    uint64_t h = vpn ^ seed;
    
    // Murmur3 finalizer for excellent distribution
    h ^= h >> 33;
    h *= 0xFF51AFD7ED558CCDULL;
    h ^= h >> 33;
    h *= 0xC4CEB9FE1A85EC53ULL;
    h ^= h >> 33;
    
    // Use Fibonacci for final bucket selection (power-of-2 table size)
    // This avoids expensive modulo operation
    return (h * 11400714819323198485ULL) >> (64 - __builtin_ctzll(table_size));
}

Table Size Selection:

The number of buckets in the hash table is a critical design decision:

Too Few Buckets:

High collision rate
Long chain lengths
Degraded O(n) lookup

Too Many Buckets:

Wasted memory for bucket pointers
Poor cache utilization (bucket array doesn't fit in cache)
Overhead exceeds savings

Optimal Sizing:

A common heuristic is to size the table such that the load factor (entries / buckets) stays between 0.5 and 0.75:

load_factor = num_entries / num_buckets

Recommended: 0.5 ≤ load_factor ≤ 0.75

For 10,000 mapped pages:
  Minimum buckets: 10,000 / 0.75 ≈ 13,333
  Maximum buckets: 10,000 / 0.5 = 20,000
  Practical choice: 16,384 (power of 2 for fast modulo)

Power-of-2 Table Sizes:

Using power-of-2 bucket counts allows replacing expensive modulo operations with fast bit masking:

bucket = hash(vpn) & (table_size - 1)  // Only works when table_size is power of 2

This is significantly faster than hash(vpn) % table_size and is used universally in performance-critical implementations.

Security Consideration: Hash Randomization

Lookup Mechanism and Address Translation

Complete Lookup Algorithm:

PHYSICAL_ADDRESS translate(VIRTUAL_ADDRESS va):
    1. Extract virtual page number: vpn = va >> PAGE_OFFSET_BITS
    2. Extract page offset: offset = va & PAGE_OFFSET_MASK
    3. Compute bucket index: bucket = hash(vpn) mod table_size
    4. Get bucket head: entry = table.buckets[bucket]
    5. Traverse chain:
       while entry != NULL:
           if entry.vpn == vpn AND entry.valid:
               CHECK PERMISSIONS(entry.control)
               UPDATE accessed bit
               return (entry.pfn << PAGE_OFFSET_BITS) | offset
           entry = entry.next
    6. Entry not found: TRIGGER PAGE FAULT

Step-by-Step Analysis:

Lookup Step Breakdown
Step	Operation	Time Cost	Notes
1-2	VPN/Offset Extraction	1-2 cycles	Simple bit operations, pipelined
3	Hash Computation	3-10 cycles	Depends on hash function complexity
4	Bucket Access	~100 cycles (cache miss)	Memory load; cached if recently used
5	Chain Traversal	~100 cycles per entry	Memory load per chain link
6	Permission Check	1-2 cycles	Bit comparisons

Performance Characteristics:

Best Case (No Collision):

Single bucket access
Single entry comparison
Total: hash computation + 1 memory access
Approximately 100-150 CPU cycles

Average Case:

With load factor 0.5-0.75, average chain length ≈ 1.5 entries
Total: hash computation + 1.5 memory accesses
Approximately 150-250 CPU cycles

Worst Case (Collision Chain):

All entries hash to same bucket (pathological)
O(n) traversal where n = number of mapped pages
Catastrophic for large processes—must be prevented

Comparison with Multi-Level Tables:

Hashed Table Lookup

•Hash computation: ~10 cycles
•Bucket access: ~100 cycles
•Chain traversal: 0-200 cycles (avg)
•Total: ~110-310 cycles
•Variable latency (depends on collisions)
•Single path to entry

4-Level Table Lookup

•Level 4 access: ~100 cycles
•Level 3 access: ~100 cycles
•Level 2 access: ~100 cycles
•Level 1 access: ~100 cycles
•Total: ~400 cycles
•Fixed latency (always 4 accesses)

hpt_lookup.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
// Complete lookup implementation for hashed page table
 
#include "hashed_page_table.h"
#include <stdio.h>
 
// Translation result codes
typedef enum {
    TRANSLATE_SUCCESS,
    TRANSLATE_PAGE_FAULT,      // No mapping exists
    TRANSLATE_PROTECTION_FAULT, // Permission denied
    TRANSLATE_INVALID_ADDRESS   // Address out of range
} TranslateResult;
 
// Full translation structure
typedef struct {
    TranslateResult result;
    uint64_t physical_address;
    PageControlBits permissions;
    int chain_depth;           // For performance monitoring
} TranslationResult;
 
// Main translation function
TranslationResult translate_address(HashedPageTable* table, 
                                     uint64_t virtual_address,
                                     bool is_write,
                                     bool is_execute,
                                     bool is_user_mode) {
    TranslationResult result = {0};
    
    // Step 1 & 2: Extract VPN and offset
    const uint64_t PAGE_OFFSET_BITS = 12;  // 4KB pages
    const uint64_t PAGE_OFFSET_MASK = (1ULL << PAGE_OFFSET_BITS) - 1;
    
    uint64_t vpn = virtual_address >> PAGE_OFFSET_BITS;
    uint64_t offset = virtual_address & PAGE_OFFSET_MASK;
    
    // Step 3: Compute hash and bucket index
    uint64_t bucket_idx = hash_vpn(vpn, table->hash_seed, table->num_buckets);
    
    // Step 4: Access bucket
    HashedPageTableEntry* entry = table->buckets[bucket_idx];
    result.chain_depth = 0;
    
    // Step 5: Traverse chain
    while (entry != NULL) {
        result.chain_depth++;
        
        if (entry->virtual_page_number == vpn) {
            // Found matching entry
            
            // Check if valid
            if (!entry->control.valid) {
                result.result = TRANSLATE_PAGE_FAULT;
                return result;
            }
            
            // Permission checks
            if (is_user_mode && !entry->control.user) {
                result.result = TRANSLATE_PROTECTION_FAULT;
                return result;
            }
            
            if (is_write && !entry->control.write) {
                result.result = TRANSLATE_PROTECTION_FAULT;
                return result;
            }
            
            if (is_execute && !entry->control.execute) {
                result.result = TRANSLATE_PROTECTION_FAULT;
                return result;
            }
            
            // Update accessed bit (atomically in real implementation)
            entry->control.accessed = 1;
            
            // Update dirty bit if write
            if (is_write) {
                entry->control.dirty = 1;
            }
            
            // Compute physical address
            result.physical_address = 
                (entry->physical_frame_number << PAGE_OFFSET_BITS) | offset;
            result.permissions = entry->control;
            result.result = TRANSLATE_SUCCESS;
            return result;
        }
        
        entry = entry->next;
    }
    
    // Step 6: Entry not found - page fault
    result.result = TRANSLATE_PAGE_FAULT;
    return result;
}

TLB Integration is Critical

Clustered Page Tables: A Practical Variant

The Clustering Insight:

When a process allocates memory, it typically receives contiguous virtual addresses:

Array allocation: pages N, N+1, N+2, ..., N+k are mapped together
Stack growth: each page push maps the next sequential page
Code loading: executable segments span multiple contiguous pages

Instead of hashing each page individually (creating many entries), we can hash the base page and store multiple translations in one entry.

Clustered Entry Structure:

┌─────────────────────────────────────────────────────────────────┐
│           Clustered Page Table Entry (64 bytes)                 │
├─────────────────────────────────────────────────────────────────┤
│ Base VPN (52 bits)                                              │
├─────────────────────────────────────────────────────────────────┤
│ Cluster Size (4 bits: 1-16 pages)   │ Control Bits             │
├─────────────────────────────────────────────────────────────────┤
│ PFN[0]: Frame for VPN+0                                         │
│ PFN[1]: Frame for VPN+1                                         │
│ PFN[2]: Frame for VPN+2                                         │
│ PFN[3]: Frame for VPN+3                                         │
│ ... (up to 16 PFNs based on cluster size)                       │
├─────────────────────────────────────────────────────────────────┤
│ Chain Pointer                                                   │
└─────────────────────────────────────────────────────────────────┘

Lookup Modification for Clustered Tables:

LOOKUP(vpn):
    1. Compute cluster base: base_vpn = vpn & ~(CLUSTER_SIZE - 1)
    2. Compute offset within cluster: cluster_offset = vpn & (CLUSTER_SIZE - 1)
    3. Hash using base_vpn: bucket = hash(base_vpn)
    4. Search for entry with matching base_vpn
    5. Extract PFN[cluster_offset] from entry

Benefits of Clustering:

Clustering Advantages

•Fewer Entries: 16× reduction in entry count for cluster size 16
•Better Cache Utilization: Entire cluster fits in one cache line (64 bytes)
•Reduced Collisions: Fewer distinct entries means fewer potential collisions
•Prefetch Friendly: Accessing one page brings adjacent mappings into cache
•Reduced Memory Overhead: Less overhead from entry headers and chain pointers

Drawbacks of Clustering:

Sparse Mapping Waste: If only 1 of 16 pages in a cluster is mapped, other slots waste space
Non-Contiguous Physical Frames: May not always map contiguous VPNs to related PFNs
Complexity: Hash function must handle cluster boundaries correctly
Partial Clusters: Need special handling for mappings that span cluster boundaries

When Clustering Works Well:

Large array allocations
Memory-mapped files
Process text (code) segments
Stack pages (grow contiguously)

When Clustering is Suboptimal:

Sparse allocations (e.g., mmap with sparse patterns)
Fine-grained memory-mapped I/O
Processes with many small, scattered allocations

Real-World Usage:

clustered_hpt.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// Clustered hashed page table implementation
 
#define CLUSTER_BITS 4
#define CLUSTER_SIZE (1 << CLUSTER_BITS)   // 16 pages per cluster
#define CLUSTER_MASK (CLUSTER_SIZE - 1)
 
typedef struct {
    uint64_t base_vpn;                      // Base VPN of cluster
    uint64_t pfn[CLUSTER_SIZE];             // Physical frames for each page
    uint8_t valid_bitmap;                   // Which slots are valid (16 bits)
    PageControlBits control;                // Shared control for cluster
    struct ClusteredEntry* next;            // Chain pointer
} ClusteredEntry;
 
// Lookup in clustered page table
TranslationResult clustered_lookup(ClusteredPageTable* table,
                                    uint64_t vpn) {
    TranslationResult result = {0};
    
    // Compute cluster base and offset
    uint64_t base_vpn = vpn & ~((uint64_t)CLUSTER_MASK);
    uint64_t cluster_offset = vpn & CLUSTER_MASK;
    
    // Hash on base VPN only
    uint64_t bucket = hash_vpn(base_vpn, table->hash_seed, table->num_buckets);
    
    ClusteredEntry* entry = table->buckets[bucket];
    
    while (entry != NULL) {
        if (entry->base_vpn == base_vpn) {
            // Check if this specific slot in cluster is valid
            if (!(entry->valid_bitmap & (1 << cluster_offset))) {
                result.result = TRANSLATE_PAGE_FAULT;
                return result;
            }
            
            // Get PFN from cluster
            uint64_t pfn = entry->pfn[cluster_offset];
            result.physical_address = (pfn << 12) | (vpn & 0xFFF);
            result.result = TRANSLATE_SUCCESS;
            return result;
        }
        entry = entry->next;
    }
    
    result.result = TRANSLATE_PAGE_FAULT;
    return result;
}

Performance Considerations and Tuning

Load Factor Management:

The load factor (entries / buckets) is the primary determinant of lookup performance:

Expected chain length ≈ load_factor (for uniform hashing)

Load Factor    Expected Lookups    Recommendation
0.25           1.12                 Oversized, waste memory
0.50           1.25                 Good balance
0.75           1.56                 Acceptable
1.00           2.00                 Consider resizing
2.00           3.00                 Performance degrading

Dynamic Resizing:

As processes map more pages, the hash table may need to grow. Resizing strategies:

Incremental Doubling: When load factor exceeds threshold, double bucket count
Rehash All Entries: Traverse old table, rehash each entry to new bucket
Avoid Shrinking: Shrinking is rarely worth the cost; memory is cheap

Resizing is Expensive:

Must pause or lock the table during resize
Every entry needs rehashing
Temporary spike in memory usage (old + new tables)
Common strategy: resize during idle time or when system load is low

Cache Behavior Analysis:

Memory access patterns critically affect hash table performance:

Bucket Array Caching:

Bucket array should fit in L2/L3 cache if possible
For 100K buckets × 8 bytes = 800KB (fits in most L3 caches)
Hot buckets (frequently accessed) stay in L1/L2

Entry Caching:

Entries in collision chains are scattered in memory
Each chain link likely causes cache miss
Clustered entries (all data in 64B) are cache-line aligned

Workload-Dependent Behavior:

Sequential access patterns: Excellent cache reuse, same buckets repeatedly
Random access patterns: Poor cache reuse, cold bucket accesses
Locality-aware allocation: OS can improve cache behavior by grouping related VPNs

Cache Impact on Lookup Latency
Access Type	Bucket	Entry	Total Cycles (approx)
L1 Hit (hot)	4 cycles	4 cycles	~20 cycles
L2 Hit (warm)	12 cycles	12 cycles	~40 cycles
L3 Hit (recent)	40 cycles	40 cycles	~100 cycles
Memory (cold)	100 cycles	100+ cycles	~250+ cycles

Hardware Considerations:

TLB Miss Penalty: The hash table is only accessed on TLB misses. On modern CPUs:

TLB hit rate typically 99%+ for well-behaved workloads
TLB miss penalty: 7-20 cycles (hardware-walked multi-level) to 100+ cycles (software TLB miss)
Hash table lookup adds to software TLB miss handlers

Memory Prefetching: Modern CPUs have hardware prefetchers that can hide memory latency:

Sequential prefetching helps with bucket array access
Collision chains defeat prefetching (random pointers)
Clustered tables are more prefetch-friendly

Simultaneous Multi-Threading (SMT/Hyperthreading):

Multiple threads can hide memory latency by interleaving computation
Hash table lookups from multiple threads can proceed in parallel
Cache contention may limit benefits with shared page tables

Performance Optimization Techniques

•Align entries to cache lines: 64-byte alignment prevents false sharing and maximizes cache efficiency
•Use power-of-2 table sizes: Enables fast bit-mask operations instead of expensive modulo
•Prefetch bucket on TLB miss: Start memory fetch while computing hash
•Keep hot entries first: Move frequently accessed entries to chain head
•Consider read-copy-update (RCU): Allows concurrent reads during updates
•NUMA-aware allocation: Place table in memory closest to accessing CPUs

Production Insight

Summary: Hashed Page Tables

Key Takeaways

•Traditional table limitations drive the need for hashing—linear tables are impossible for 64-bit, multi-level tables require multiple memory accesses per lookup.
•Hashed table structure consists of a bucket array with chained entries, where each entry contains the full VPN, PFN, control bits, and a chain pointer for collision handling.
•Hash function quality is paramount—functions must provide uniform distribution, fast computation, and deterministic results to achieve O(1) average-case lookup.
•Lookup mechanism involves hash computation, bucket access, chain traversal, and permission verification—typically 100-300 cycles on modern hardware.
•Clustered page tables group contiguous page mappings in single entries, reducing entry count and improving cache efficiency for common allocation patterns.
•Performance tuning requires attention to load factor, cache behavior, resizing strategy, and hardware characteristics like TLB integration and prefetching.

What's Next:

Page Complete

1 / 5