Database Management SystemsBuffer Management

Buffer Management in Database Systems

LevelIntermediate

Duration60 mins

TopicBuffer Management

4 / 5

Buffer Manager Role

The Central Coordinator

The buffer manager is the database component responsible for managing the buffer pool and mediating all access between higher-level database components and disk storage. It serves as the gatekeeper between the volatile, high-performance realm of main memory and the durable, slower world of persistent storage.

Every database operation that reads or writes data passes through the buffer manager. Query execution requests pages; the buffer manager fetches them. Transactions commit; the buffer manager coordinates with the log manager. Background maintenance runs; the buffer manager provides the mechanisms. Understanding the buffer manager's role is essential for comprehending database architecture.

What You Will Learn

By the end of this page, you will understand the buffer manager's responsibilities, its API design, how it interacts with query execution and transaction processing, and the concurrency control mechanisms it employs to support high-throughput workloads.

Buffer Manager Responsibilities

The buffer manager handles a wide range of responsibilities that span from simple page access to complex coordination with other database subsystems:

Core Responsibilities

•Page Fetching — Load pages from disk into the buffer pool when requested and not already cached
•Page Pinning — Track which pages are in active use and prevent their eviction
•Page Replacement — Select victim pages for eviction when the buffer pool is full
•Dirty Page Management — Track modified pages and coordinate their flush to disk
•Concurrency Control — Provide latches and synchronization for safe concurrent access
•Prefetching — Anticipate future page needs and load pages proactively
•WAL Coordination — Ensure dirty pages are flushed only after corresponding log records
•Checkpoint Support — Participate in checkpoint operations by flushing dirty pages
•Statistics Collection — Gather metrics for performance monitoring and tuning

The abstraction the buffer manager provides:

To higher-level components (query execution, index managers, etc.), the buffer manager presents an abstraction where all pages appear to be in memory. The caller simply requests a page by ID, and the buffer manager returns a pointer to the page contents in memory. The complexities of disk I/O, caching, and replacement are entirely hidden.

This abstraction is powerful because:

Query execution code doesn't need to manage I/O directly
Caching decisions are centralized and optimized
Memory management is uniform across all page types
The same API works for data pages, index pages, and system pages

The Pinning Contract

The buffer manager's most important contract with callers is the pin/unpin protocol. When a caller pins a page, the buffer manager guarantees the page will remain valid in memory (at the same address) until it's unpinned. Violating this contract (using a page without pinning, or accessing after unpinning) leads to memory corruption.

Buffer Manager API

The buffer manager exposes a well-defined API that other database components use to access pages. While implementations vary, the core operations are consistent across database systems:

buffer_manager_api.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
// Buffer Manager API definition
 
class BufferManager {
public:
    // ================================================
    // Core Page Access Operations
    // ================================================
    
    /**
     * Fetch a page and pin it for use.
     * If the page is not in the buffer pool, it will be read from disk.
     * The caller MUST call unpinPage() when done with the page.
     * 
     * @param page_id  The page to fetch
     * @param mode     Access mode: READ or WRITE
     * @return Pointer to the page contents, or nullptr if failed
     */
    Page* pinPage(PageId page_id, AccessMode mode);
    
    /**
     * Release a pin on a page.
     * After this call, the page may be evicted at any time.
     * Do NOT access the page after calling unpinPage().
     * 
     * @param page_id  The page to unpin
     * @param is_dirty Set to true if the page was modified
     */
    void unpinPage(PageId page_id, bool is_dirty);
    
    /**
     * Allocate a new page on disk and pin it.
     * The page is initialized with zeros.
     * 
     * @return PageId of the newly allocated page
     */
    PageId newPage();
    
    /**
     * Delete a page from the database.
     * The page must not be pinned by any transaction.
     * 
     * @param page_id  The page to delete
     * @return true if successful
     */
    bool deletePage(PageId page_id);
    
    // ================================================
    // Flushing and Synchronization
    // ================================================
    
    /**
     * Flush a specific page to disk if it's dirty.
     * This is a synchronous operation.
     */
    void flushPage(PageId page_id);
    
    /**
     * Flush all dirty pages to disk.
     * Used during checkpoint or shutdown.
     */
    void flushAllPages();
    
    // ================================================
    // Advanced Operations
    // ================================================
    
    /**
     * Prefetch a range of pages asynchronously.
     * Pages are loaded into the buffer pool but not pinned.
     */
    void prefetch(PageId start_page, size_t count);
    
    /**
     * Get current buffer pool statistics.
     */
    BufferStats getStats();
    
    /**
     * Resize the buffer pool (if supported).
     * May require draining current pages.
     */
    void resize(size_t new_size);
};

Buffer Manager Operations in Different Systems
Operation	PostgreSQL	MySQL InnoDB	SQLite
Pin/Read page	ReadBuffer()	buf_page_get()	sqlite3PagerGet()
Unpin page	ReleaseBuffer()	mtr_commit()	sqlite3PagerUnref()
Mark dirty	MarkBufferDirty()	mtr_memo_modify()	sqlite3PagerWrite()
Allocate page	ReadBuffer(P_NEW)	fsp_page_create()	sqlite3PagerWrite()
Flush page	FlushBuffer()	buf_flush_single_page_from_LRU()	sqlite3PagerSync()

Page Pinning in Depth

The pin/unpin protocol is the foundation of safe buffer pool usage. Let's examine it in detail:

pin_implementation.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
// Detailed pin implementation
 
Page* BufferManager::pinPage(PageId page_id, AccessMode mode) {
    // Step 1: Acquire buffer pool latch
    lock_guard<mutex> guard(pool_latch);
    
    // Step 2: Check if page is already in buffer pool
    auto it = page_table.find(page_id);
    if (it != page_table.end()) {
        // Page hit - found in buffer pool
        FrameId frame_id = it->second;
        FrameDescriptor& desc = descriptors[frame_id];
        
        // Increment pin count
        desc.pin_count++;
        
        // Update replacement algorithm state
        replacer->recordAccess(frame_id);
        
        // Acquire appropriate latch on the frame
        if (mode == AccessMode::WRITE) {
            desc.latch.writeLock();
        } else {
            desc.latch.readLock();
        }
        
        stats.page_hits++;
        return &frames[frame_id];
    }
    
    // Step 3: Page miss - need to load from disk
    stats.page_misses++;
    
    // Step 4: Find a frame to use
    FrameId frame_id;
    if (!free_list.empty()) {
        // Use a free frame
        frame_id = free_list.back();
        free_list.pop_back();
    } else {
        // Need to evict a page
        frame_id = findVictim();
        if (frame_id == INVALID_FRAME_ID) {
            throw BufferPoolExhaustedException(
                "No unpinned pages available for eviction"
            );
        }
        
        // Evict the current occupant
        evictPage(frame_id);
    }
    
    // Step 5: Load the page from disk
    disk_manager->readPage(page_id, &frames[frame_id]);
    
    // Step 6: Update metadata
    FrameDescriptor& desc = descriptors[frame_id];
    desc.page_id = page_id;
    desc.pin_count = 1;
    desc.is_dirty = false;
    desc.ref_bit = true;
    
    // Step 7: Add to page table
    page_table[page_id] = frame_id;
    
    // Step 8: Acquire latch
    if (mode == AccessMode::WRITE) {
        desc.latch.writeLock();
    } else {
        desc.latch.readLock();
    }
    
    return &frames[frame_id];
}
 
void BufferManager::unpinPage(PageId page_id, bool is_dirty) {
    lock_guard<mutex> guard(pool_latch);
    
    auto it = page_table.find(page_id);
    if (it == page_table.end()) {
        // Page not in buffer pool - error or already evicted
        return;
    }
    
    FrameId frame_id = it->second;
    FrameDescriptor& desc = descriptors[frame_id];
    
    // Release latch
    desc.latch.unlock();
    
    // Update dirty status
    if (is_dirty) {
        desc.is_dirty = true;
        if (!dirty_page_table.contains(page_id)) {
            dirty_page_table.addDirtyPage(page_id, log_manager->getCurrentLSN());
        }
    }
    
    // Decrement pin count
    if (desc.pin_count > 0) {
        desc.pin_count--;
    }
    
    // If fully unpinned, make available for replacement
    if (desc.pin_count == 0) {
        replacer->add(frame_id);
    }
}

Common pin-related bugs:

Double pin without double unpin: Pinning the same page twice without unpinning twice leads to artificially elevated pin counts.
Using page after unpin: After unpinPage(), the page may be evicted. Accessing it is undefined behavior.
Forgetting to unpin: Pin count never reaches zero; page is never evictable. Eventually causes buffer pool exhaustion.
Unpinning wrong page: Decrements pin count on wrong frame, causing premature eviction.
Incorrect dirty flag: Modifying a page but passing is_dirty=false to unpin leads to data loss.

RAII for Pin Management

Production code typically uses RAII wrappers (like C++ smart pointers or Rust's ownership system) to ensure pages are always unpinned. The wrapper's destructor calls unpin, preventing leaks even when exceptions occur.

Integration with Query Execution

Query execution operators rely heavily on the buffer manager for access to data. Let's examine how common operations interact with the buffer manager:

query_buffer_integration.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
// Sequential scan operator using buffer manager
 
class SeqScanOperator {
    TableId table_id;
    BufferManager* buffer_mgr;
    PageId current_page;
    SlotId current_slot;
    
public:
    Tuple* next() {
        while (current_page < getTablePageCount(table_id)) {
            // Pin the current page for reading
            Page* page = buffer_mgr->pinPage(
                PageId{table_id, current_page}, 
                AccessMode::READ
            );
            
            // Scan tuples on this page
            while (current_slot < page->getSlotCount()) {
                Tuple* tuple = page->getTuple(current_slot);
                current_slot++;
                
                if (tuple != nullptr && !tuple->isDeleted()) {
                    // Found a valid tuple - prepare to return
                    // Note: We DON'T unpin yet - caller needs the page
                    return tuple;
                }
            }
            
            // Done with this page - move to next
            buffer_mgr->unpinPage(PageId{table_id, current_page}, false);
            current_page++;
            current_slot = 0;
        }
        
        return nullptr;  // End of table
    }
};
 
// Index scan using buffer manager
class IndexScanOperator {
    BPlusTree* index;
    BufferManager* buffer_mgr;
    
public:
    Tuple* lookupByKey(Key key) {
        // Search index - this pins multiple index pages internally
        RID rid = index->search(key);
        
        if (!rid.isValid()) {
            return nullptr;
        }
        
        // Pin the data page
        Page* data_page = buffer_mgr->pinPage(
            rid.getPageId(),
            AccessMode::READ
        );
        
        // Get the tuple
        Tuple* tuple = data_page->getTuple(rid.getSlotId());
        
        // Caller must unpin when done with tuple
        return tuple;
    }
};
 
// Insert operation using buffer manager
class InsertOperator {
    TableId table_id;
    BufferManager* buffer_mgr;
    
public:
    RID insert(Tuple* tuple) {
        // Find a page with space (or allocate new)
        PageId target_page = findPageWithSpace(tuple->getSize());
        
        // Pin for writing
        Page* page = buffer_mgr->pinPage(target_page, AccessMode::WRITE);
        
        // Insert the tuple
        SlotId slot = page->insertTuple(tuple);
        
        // Unpin and mark dirty
        buffer_mgr->unpinPage(target_page, true);  // is_dirty = true
        
        return RID{target_page, slot};
    }
};

Page access patterns in query execution:

Sequential scans: Access pages in order; benefits from prefetching; don't need old pages once moved past
Index scans: Access pages in key order (often scattered); need to pin during tuple retrieval
Nested loop joins: Inner table rescanned repeatedly; benefits from buffer pool caching
Sort operations: Write runs to temp pages; read back for merge; may use dedicated buffer space
Hash joins: Build hash table in memory; probe phase random access; may spill to disk

Query-Optimizer Awareness

Smart query optimizers consider buffer pool size when planning. A nested-loop join is cost-effective if the inner table fits in the buffer pool. If not, a hash join may be preferred despite higher CPU cost. This is why buffer pool sizing affects not just performance but query plan selection.

Concurrency Control in the Buffer Manager

The buffer manager operates in a highly concurrent environment. Multiple queries execute simultaneously, each potentially accessing the same pages. Proper concurrency control is essential for correctness and performance.

Levels of concurrency control:

Buffer pool metadata: Protected by the pool_latch (often a mutex or spinlock). Guards the page table, free list, and frame descriptors.
Individual frames: Each frame has its own latch (typically a reader-writer lock). Multiple readers can access simultaneously; writers exclusive.
Page contents: Separate from buffer manager; handled by lock manager at the tuple/row level.

The two-level latch pattern:

buffer_concurrency.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Two-level latch pattern in buffer manager
 
Page* BufferManager::pinPage(PageId page_id, AccessMode mode) {
    // Level 1: Acquire pool latch to find/allocate frame
    pool_latch.lock();  // Short duration
    
    auto it = page_table.find(page_id);
    FrameId frame_id;
    
    if (it != page_table.end()) {
        frame_id = it->second;
        descriptors[frame_id].pin_count++;
    } else {
        // ... allocate frame, load page ...
        frame_id = allocateAndLoadPage(page_id);
    }
    
    pool_latch.unlock();  // Release pool latch ASAP
    
    // Level 2: Acquire frame latch (can be held longer)
    FrameDescriptor& desc = descriptors[frame_id];
    if (mode == AccessMode::WRITE) {
        desc.latch.writeLock();  // May block waiting for readers
    } else {
        desc.latch.readLock();   // May share with other readers
    }
    
    return &frames[frame_id];
}
 
// Why two levels? Consider this scenario:
// 1. Thread A pins page P1 for writing (long operation)
// 2. Thread B wants page P2 (different page)
// 
// With single lock: B waits for A even though they access different pages
// With two levels: B can acquire pool latch, find P2's frame, release pool
//                  latch, then acquire P2's frame latch - independent of A

Latch Types in Buffer Management
Latch Type	Scope	Duration	Contention Level
Pool latch (mutex)	Entire page table	Very short	High in multi-core systems
Frame latch (RW)	Single frame	Duration of page access	Varies by access pattern
Dirty list latch	Dirty page tracking	Short	Moderate in write-heavy workloads
Free list latch	Free frame management	Very short	Low (separate from pool latch)

Deadlock Prevention

Buffer manager latches are not the same as database locks. Latches are held for short durations and follow strict ordering to prevent deadlock (e.g., always acquire pool latch before frame latch). Database locks are held for transaction duration and use deadlock detection.

Prefetching and I/O Optimization

Intelligent buffer managers don't just passively respond to page requests—they proactively fetch pages that will likely be needed soon. This prefetching hides I/O latency by loading pages before they're requested.

Prefetching strategies:

Sequential prefetch: When accessing pages sequentially (e.g., table scan), prefetch the next N pages. High accuracy for sequential access patterns.
Index-guided prefetch: During index range scans, prefetch data pages pointed to by index entries before they're accessed.
Query-plan-guided prefetch: The query executor informs the buffer manager of upcoming page needs based on the execution plan.
Prediction-based prefetch: Machine learning models predict future accesses based on historical patterns (experimental in modern systems).

prefetching.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// Prefetching implementation
 
class BufferManager {
    // Asynchronous I/O thread pool for prefetch
    ThreadPool io_pool{4};  // 4 I/O threads
    
public:
    // Sequential prefetch hint from table scan
    void prefetchSequential(PageId start, size_t count) {
        for (size_t i = 0; i < count; i++) {
            PageId target = {start.file_id, start.page_num + i};
            
            // Skip if already in buffer pool
            if (isInBufferPool(target)) continue;
            
            // Submit async I/O request
            io_pool.submit([this, target]() {
                // Acquire frame and load page
                // But don't pin - just make it available
                loadPageAsync(target);
            });
        }
    }
    
    // Index scan prefetch
    void prefetchByRIDs(const vector<RID>& rids) {
        set<PageId> pages_to_fetch;
        
        for (const RID& rid : rids) {
            PageId page_id = rid.getPageId();
            if (!isInBufferPool(page_id)) {
                pages_to_fetch.insert(page_id);
            }
        }
        
        for (PageId page_id : pages_to_fetch) {
            io_pool.submit([this, page_id]() {
                loadPageAsync(page_id);
            });
        }
    }
    
private:
    void loadPageAsync(PageId page_id) {
        lock_guard<mutex> guard(pool_latch);
        
        // Double-check not loaded by another thread
        if (isInBufferPool(page_id)) return;
        
        // Same as regular page load, but without incrementing pin_count
        FrameId frame_id = allocateFrame();
        disk_manager->readPage(page_id, &frames[frame_id]);
        
        descriptors[frame_id].page_id = page_id;
        descriptors[frame_id].pin_count = 0;  // Not pinned
        descriptors[frame_id].is_dirty = false;
        
        page_table[page_id] = frame_id;
        replacer->add(frame_id);  // Available for replacement immediately
    }
};

Prefetch Trade-offs

Prefetching improves latency but can waste I/O bandwidth and pollute the cache if predictions are wrong. Effective prefetch requires: (1) accurate prediction of future accesses, (2) sufficient buffer pool space to hold prefetched pages, (3) asynchronous I/O to avoid blocking.

Recovery and Checkpoint Integration

The buffer manager plays a crucial role in database recovery and checkpoint operations. It maintains the information needed for crash recovery and participates in checkpoint execution.

Buffer manager's role in checkpointing:

Provide dirty page information: The buffer manager maintains the Dirty Page Table, which the checkpoint writes to the log.
Flush dirty pages: During a checkpoint, the buffer manager flushes dirty pages to disk. This ensures modifications are persisted.
Coordinate with WAL: Before flushing any dirty page, ensure the corresponding log records are durable.
Support fuzzy checkpoints: Continue normal operations during checkpoint; track which pages were dirty at checkpoint start.

checkpoint_integration.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
// Buffer manager checkpoint integration
 
class CheckpointManager {
    BufferManager* buffer_mgr;
    LogManager* log_mgr;
    TransactionManager* txn_mgr;
    
public:
    void performCheckpoint() {
        // Step 1: Get current state from buffer manager
        auto active_txns = txn_mgr->getActiveTransactions();
        auto dirty_pages = buffer_mgr->getDirtyPageTable();
        
        // Step 2: Write checkpoint BEGIN to log
        LSN checkpoint_begin_lsn = log_mgr->writeCheckpointBegin();
        
        // Step 3: Write transaction table to log
        log_mgr->writeTransactionTable(active_txns);
        
        // Step 4: Write dirty page table to log
        log_mgr->writeDirtyPageTable(dirty_pages);
        
        // Step 5: Request buffer manager to flush dirty pages
        // This is the expensive part - lots of I/O
        buffer_mgr->flushDirtyPagesForCheckpoint();
        
        // Step 6: Write checkpoint END to log
        LSN checkpoint_end_lsn = log_mgr->writeCheckpointEnd();
        
        // Step 7: Flush checkpoint records
        log_mgr->flushToLSN(checkpoint_end_lsn);
        
        // Step 8: Update master record with checkpoint location
        updateMasterRecord(checkpoint_begin_lsn);
    }
};
 
// Buffer manager support for checkpoint
void BufferManager::flushDirtyPagesForCheckpoint() {
    // Get snapshot of dirty pages at checkpoint start
    vector<PageId> dirty_at_checkpoint = getDirtyPageList();
    
    // Flush each, respecting WAL constraint
    for (PageId page_id : dirty_at_checkpoint) {
        lock_guard<mutex> guard(pool_latch);
        
        auto it = page_table.find(page_id);
        if (it == page_table.end()) continue;  // Already evicted
        
        FrameId frame_id = it->second;
        FrameDescriptor& desc = descriptors[frame_id];
        
        if (!desc.is_dirty) continue;  // Already flushed
        
        // Ensure WAL records are durable
        log_manager->flushToLSN(desc.page_lsn);
        
        // Write page to disk
        disk_manager->writePage(page_id, &frames[frame_id]);
        
        // Mark clean
        desc.is_dirty = false;
        dirty_page_table->removeDirtyPage(page_id);
    }
}

Fuzzy vs. Sharp Checkpoints

A 'sharp' checkpoint blocks all transactions until dirty pages are flushed. A 'fuzzy' checkpoint allows transactions to continue, using the Dirty Page Table snapshot to track what was dirty at checkpoint start. Modern databases use fuzzy checkpoints to minimize impact on running queries.

Summary: The Buffer Manager's Orchestrating Role

The buffer manager is the central coordinator for all page access in the database. It provides the abstraction that makes higher-level components oblivious to the complexities of disk I/O while ensuring correctness, durability, and performance.

Key Takeaways

•The buffer manager mediates all page access — It provides a unified interface for fetching, modifying, and releasing pages.
•Pin/unpin is the fundamental contract — Callers must pin pages before use and unpin when done, optionally marking them dirty.
•Two-level latching provides concurrency — Short-held pool latch for metadata; longer-held frame latches for page access.
•Query execution depends on the buffer manager — Scans, joins, and updates all flow through the buffer manager's API.
•Prefetching hides I/O latency — Anticipating future page needs and loading proactively improves query performance.
•Checkpoint integration enables recovery — The buffer manager provides dirty page information and flushes pages during checkpoints.
•Correct usage prevents data corruption — Bugs like forgetting to unpin or incorrect dirty flags lead to serious problems.

What's next:

We've examined the buffer manager's API and integration points. The final page dives deep into LRU and Clock algorithms—the specific replacement algorithms most commonly used in production database systems, with implementation details and tuning considerations.

Page Complete

You now understand the buffer manager's role as the central coordinator for page access, its API design, concurrency control mechanisms, and integration with query execution and recovery subsystems.

4 / 5

Loading learning content...

Database Management SystemsBuffer Management

Buffer Management in Database Systems

LevelIntermediate

Duration60 mins

TopicBuffer Management

4 / 5

Buffer Manager Role

The Central Coordinator

What You Will Learn

Buffer Manager Responsibilities

The buffer manager handles a wide range of responsibilities that span from simple page access to complex coordination with other database subsystems:

Core Responsibilities

•Page Fetching — Load pages from disk into the buffer pool when requested and not already cached
•Page Pinning — Track which pages are in active use and prevent their eviction
•Page Replacement — Select victim pages for eviction when the buffer pool is full
•Dirty Page Management — Track modified pages and coordinate their flush to disk
•Concurrency Control — Provide latches and synchronization for safe concurrent access
•Prefetching — Anticipate future page needs and load pages proactively
•WAL Coordination — Ensure dirty pages are flushed only after corresponding log records
•Checkpoint Support — Participate in checkpoint operations by flushing dirty pages
•Statistics Collection — Gather metrics for performance monitoring and tuning

The abstraction the buffer manager provides:

This abstraction is powerful because:

Query execution code doesn't need to manage I/O directly
Caching decisions are centralized and optimized
Memory management is uniform across all page types
The same API works for data pages, index pages, and system pages

The Pinning Contract

Buffer Manager API

The buffer manager exposes a well-defined API that other database components use to access pages. While implementations vary, the core operations are consistent across database systems:

buffer_manager_api.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
// Buffer Manager API definition
 
class BufferManager {
public:
    // ================================================
    // Core Page Access Operations
    // ================================================
    
    /**
     * Fetch a page and pin it for use.
     * If the page is not in the buffer pool, it will be read from disk.
     * The caller MUST call unpinPage() when done with the page.
     * 
     * @param page_id  The page to fetch
     * @param mode     Access mode: READ or WRITE
     * @return Pointer to the page contents, or nullptr if failed
     */
    Page* pinPage(PageId page_id, AccessMode mode);
    
    /**
     * Release a pin on a page.
     * After this call, the page may be evicted at any time.
     * Do NOT access the page after calling unpinPage().
     * 
     * @param page_id  The page to unpin
     * @param is_dirty Set to true if the page was modified
     */
    void unpinPage(PageId page_id, bool is_dirty);
    
    /**
     * Allocate a new page on disk and pin it.
     * The page is initialized with zeros.
     * 
     * @return PageId of the newly allocated page
     */
    PageId newPage();
    
    /**
     * Delete a page from the database.
     * The page must not be pinned by any transaction.
     * 
     * @param page_id  The page to delete
     * @return true if successful
     */
    bool deletePage(PageId page_id);
    
    // ================================================
    // Flushing and Synchronization
    // ================================================
    
    /**
     * Flush a specific page to disk if it's dirty.
     * This is a synchronous operation.
     */
    void flushPage(PageId page_id);
    
    /**
     * Flush all dirty pages to disk.
     * Used during checkpoint or shutdown.
     */
    void flushAllPages();
    
    // ================================================
    // Advanced Operations
    // ================================================
    
    /**
     * Prefetch a range of pages asynchronously.
     * Pages are loaded into the buffer pool but not pinned.
     */
    void prefetch(PageId start_page, size_t count);
    
    /**
     * Get current buffer pool statistics.
     */
    BufferStats getStats();
    
    /**
     * Resize the buffer pool (if supported).
     * May require draining current pages.
     */
    void resize(size_t new_size);
};

Buffer Manager Operations in Different Systems
Operation	PostgreSQL	MySQL InnoDB	SQLite
Pin/Read page	ReadBuffer()	buf_page_get()	sqlite3PagerGet()
Unpin page	ReleaseBuffer()	mtr_commit()	sqlite3PagerUnref()
Mark dirty	MarkBufferDirty()	mtr_memo_modify()	sqlite3PagerWrite()
Allocate page	ReadBuffer(P_NEW)	fsp_page_create()	sqlite3PagerWrite()
Flush page	FlushBuffer()	buf_flush_single_page_from_LRU()	sqlite3PagerSync()

Page Pinning in Depth

The pin/unpin protocol is the foundation of safe buffer pool usage. Let's examine it in detail:

pin_implementation.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
// Detailed pin implementation
 
Page* BufferManager::pinPage(PageId page_id, AccessMode mode) {
    // Step 1: Acquire buffer pool latch
    lock_guard<mutex> guard(pool_latch);
    
    // Step 2: Check if page is already in buffer pool
    auto it = page_table.find(page_id);
    if (it != page_table.end()) {
        // Page hit - found in buffer pool
        FrameId frame_id = it->second;
        FrameDescriptor& desc = descriptors[frame_id];
        
        // Increment pin count
        desc.pin_count++;
        
        // Update replacement algorithm state
        replacer->recordAccess(frame_id);
        
        // Acquire appropriate latch on the frame
        if (mode == AccessMode::WRITE) {
            desc.latch.writeLock();
        } else {
            desc.latch.readLock();
        }
        
        stats.page_hits++;
        return &frames[frame_id];
    }
    
    // Step 3: Page miss - need to load from disk
    stats.page_misses++;
    
    // Step 4: Find a frame to use
    FrameId frame_id;
    if (!free_list.empty()) {
        // Use a free frame
        frame_id = free_list.back();
        free_list.pop_back();
    } else {
        // Need to evict a page
        frame_id = findVictim();
        if (frame_id == INVALID_FRAME_ID) {
            throw BufferPoolExhaustedException(
                "No unpinned pages available for eviction"
            );
        }
        
        // Evict the current occupant
        evictPage(frame_id);
    }
    
    // Step 5: Load the page from disk
    disk_manager->readPage(page_id, &frames[frame_id]);
    
    // Step 6: Update metadata
    FrameDescriptor& desc = descriptors[frame_id];
    desc.page_id = page_id;
    desc.pin_count = 1;
    desc.is_dirty = false;
    desc.ref_bit = true;
    
    // Step 7: Add to page table
    page_table[page_id] = frame_id;
    
    // Step 8: Acquire latch
    if (mode == AccessMode::WRITE) {
        desc.latch.writeLock();
    } else {
        desc.latch.readLock();
    }
    
    return &frames[frame_id];
}
 
void BufferManager::unpinPage(PageId page_id, bool is_dirty) {
    lock_guard<mutex> guard(pool_latch);
    
    auto it = page_table.find(page_id);
    if (it == page_table.end()) {
        // Page not in buffer pool - error or already evicted
        return;
    }
    
    FrameId frame_id = it->second;
    FrameDescriptor& desc = descriptors[frame_id];
    
    // Release latch
    desc.latch.unlock();
    
    // Update dirty status
    if (is_dirty) {
        desc.is_dirty = true;
        if (!dirty_page_table.contains(page_id)) {
            dirty_page_table.addDirtyPage(page_id, log_manager->getCurrentLSN());
        }
    }
    
    // Decrement pin count
    if (desc.pin_count > 0) {
        desc.pin_count--;
    }
    
    // If fully unpinned, make available for replacement
    if (desc.pin_count == 0) {
        replacer->add(frame_id);
    }
}

Common pin-related bugs:

Double pin without double unpin: Pinning the same page twice without unpinning twice leads to artificially elevated pin counts.
Using page after unpin: After unpinPage(), the page may be evicted. Accessing it is undefined behavior.
Forgetting to unpin: Pin count never reaches zero; page is never evictable. Eventually causes buffer pool exhaustion.
Unpinning wrong page: Decrements pin count on wrong frame, causing premature eviction.
Incorrect dirty flag: Modifying a page but passing is_dirty=false to unpin leads to data loss.

RAII for Pin Management

Integration with Query Execution

Query execution operators rely heavily on the buffer manager for access to data. Let's examine how common operations interact with the buffer manager:

query_buffer_integration.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
// Sequential scan operator using buffer manager
 
class SeqScanOperator {
    TableId table_id;
    BufferManager* buffer_mgr;
    PageId current_page;
    SlotId current_slot;
    
public:
    Tuple* next() {
        while (current_page < getTablePageCount(table_id)) {
            // Pin the current page for reading
            Page* page = buffer_mgr->pinPage(
                PageId{table_id, current_page}, 
                AccessMode::READ
            );
            
            // Scan tuples on this page
            while (current_slot < page->getSlotCount()) {
                Tuple* tuple = page->getTuple(current_slot);
                current_slot++;
                
                if (tuple != nullptr && !tuple->isDeleted()) {
                    // Found a valid tuple - prepare to return
                    // Note: We DON'T unpin yet - caller needs the page
                    return tuple;
                }
            }
            
            // Done with this page - move to next
            buffer_mgr->unpinPage(PageId{table_id, current_page}, false);
            current_page++;
            current_slot = 0;
        }
        
        return nullptr;  // End of table
    }
};
 
// Index scan using buffer manager
class IndexScanOperator {
    BPlusTree* index;
    BufferManager* buffer_mgr;
    
public:
    Tuple* lookupByKey(Key key) {
        // Search index - this pins multiple index pages internally
        RID rid = index->search(key);
        
        if (!rid.isValid()) {
            return nullptr;
        }
        
        // Pin the data page
        Page* data_page = buffer_mgr->pinPage(
            rid.getPageId(),
            AccessMode::READ
        );
        
        // Get the tuple
        Tuple* tuple = data_page->getTuple(rid.getSlotId());
        
        // Caller must unpin when done with tuple
        return tuple;
    }
};
 
// Insert operation using buffer manager
class InsertOperator {
    TableId table_id;
    BufferManager* buffer_mgr;
    
public:
    RID insert(Tuple* tuple) {
        // Find a page with space (or allocate new)
        PageId target_page = findPageWithSpace(tuple->getSize());
        
        // Pin for writing
        Page* page = buffer_mgr->pinPage(target_page, AccessMode::WRITE);
        
        // Insert the tuple
        SlotId slot = page->insertTuple(tuple);
        
        // Unpin and mark dirty
        buffer_mgr->unpinPage(target_page, true);  // is_dirty = true
        
        return RID{target_page, slot};
    }
};

Page access patterns in query execution:

Sequential scans: Access pages in order; benefits from prefetching; don't need old pages once moved past
Index scans: Access pages in key order (often scattered); need to pin during tuple retrieval
Nested loop joins: Inner table rescanned repeatedly; benefits from buffer pool caching
Sort operations: Write runs to temp pages; read back for merge; may use dedicated buffer space
Hash joins: Build hash table in memory; probe phase random access; may spill to disk

Query-Optimizer Awareness

Concurrency Control in the Buffer Manager

Levels of concurrency control:

Buffer pool metadata: Protected by the pool_latch (often a mutex or spinlock). Guards the page table, free list, and frame descriptors.
Individual frames: Each frame has its own latch (typically a reader-writer lock). Multiple readers can access simultaneously; writers exclusive.
Page contents: Separate from buffer manager; handled by lock manager at the tuple/row level.

The two-level latch pattern:

buffer_concurrency.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Two-level latch pattern in buffer manager
 
Page* BufferManager::pinPage(PageId page_id, AccessMode mode) {
    // Level 1: Acquire pool latch to find/allocate frame
    pool_latch.lock();  // Short duration
    
    auto it = page_table.find(page_id);
    FrameId frame_id;
    
    if (it != page_table.end()) {
        frame_id = it->second;
        descriptors[frame_id].pin_count++;
    } else {
        // ... allocate frame, load page ...
        frame_id = allocateAndLoadPage(page_id);
    }
    
    pool_latch.unlock();  // Release pool latch ASAP
    
    // Level 2: Acquire frame latch (can be held longer)
    FrameDescriptor& desc = descriptors[frame_id];
    if (mode == AccessMode::WRITE) {
        desc.latch.writeLock();  // May block waiting for readers
    } else {
        desc.latch.readLock();   // May share with other readers
    }
    
    return &frames[frame_id];
}
 
// Why two levels? Consider this scenario:
// 1. Thread A pins page P1 for writing (long operation)
// 2. Thread B wants page P2 (different page)
// 
// With single lock: B waits for A even though they access different pages
// With two levels: B can acquire pool latch, find P2's frame, release pool
//                  latch, then acquire P2's frame latch - independent of A

Latch Types in Buffer Management
Latch Type	Scope	Duration	Contention Level
Pool latch (mutex)	Entire page table	Very short	High in multi-core systems
Frame latch (RW)	Single frame	Duration of page access	Varies by access pattern
Dirty list latch	Dirty page tracking	Short	Moderate in write-heavy workloads
Free list latch	Free frame management	Very short	Low (separate from pool latch)

Deadlock Prevention

Prefetching and I/O Optimization

Prefetching strategies:

Sequential prefetch: When accessing pages sequentially (e.g., table scan), prefetch the next N pages. High accuracy for sequential access patterns.
Index-guided prefetch: During index range scans, prefetch data pages pointed to by index entries before they're accessed.
Query-plan-guided prefetch: The query executor informs the buffer manager of upcoming page needs based on the execution plan.
Prediction-based prefetch: Machine learning models predict future accesses based on historical patterns (experimental in modern systems).

prefetching.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// Prefetching implementation
 
class BufferManager {
    // Asynchronous I/O thread pool for prefetch
    ThreadPool io_pool{4};  // 4 I/O threads
    
public:
    // Sequential prefetch hint from table scan
    void prefetchSequential(PageId start, size_t count) {
        for (size_t i = 0; i < count; i++) {
            PageId target = {start.file_id, start.page_num + i};
            
            // Skip if already in buffer pool
            if (isInBufferPool(target)) continue;
            
            // Submit async I/O request
            io_pool.submit([this, target]() {
                // Acquire frame and load page
                // But don't pin - just make it available
                loadPageAsync(target);
            });
        }
    }
    
    // Index scan prefetch
    void prefetchByRIDs(const vector<RID>& rids) {
        set<PageId> pages_to_fetch;
        
        for (const RID& rid : rids) {
            PageId page_id = rid.getPageId();
            if (!isInBufferPool(page_id)) {
                pages_to_fetch.insert(page_id);
            }
        }
        
        for (PageId page_id : pages_to_fetch) {
            io_pool.submit([this, page_id]() {
                loadPageAsync(page_id);
            });
        }
    }
    
private:
    void loadPageAsync(PageId page_id) {
        lock_guard<mutex> guard(pool_latch);
        
        // Double-check not loaded by another thread
        if (isInBufferPool(page_id)) return;
        
        // Same as regular page load, but without incrementing pin_count
        FrameId frame_id = allocateFrame();
        disk_manager->readPage(page_id, &frames[frame_id]);
        
        descriptors[frame_id].page_id = page_id;
        descriptors[frame_id].pin_count = 0;  // Not pinned
        descriptors[frame_id].is_dirty = false;
        
        page_table[page_id] = frame_id;
        replacer->add(frame_id);  // Available for replacement immediately
    }
};

Prefetch Trade-offs

Recovery and Checkpoint Integration

The buffer manager plays a crucial role in database recovery and checkpoint operations. It maintains the information needed for crash recovery and participates in checkpoint execution.

Buffer manager's role in checkpointing:

Provide dirty page information: The buffer manager maintains the Dirty Page Table, which the checkpoint writes to the log.
Flush dirty pages: During a checkpoint, the buffer manager flushes dirty pages to disk. This ensures modifications are persisted.
Coordinate with WAL: Before flushing any dirty page, ensure the corresponding log records are durable.
Support fuzzy checkpoints: Continue normal operations during checkpoint; track which pages were dirty at checkpoint start.

checkpoint_integration.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
// Buffer manager checkpoint integration
 
class CheckpointManager {
    BufferManager* buffer_mgr;
    LogManager* log_mgr;
    TransactionManager* txn_mgr;
    
public:
    void performCheckpoint() {
        // Step 1: Get current state from buffer manager
        auto active_txns = txn_mgr->getActiveTransactions();
        auto dirty_pages = buffer_mgr->getDirtyPageTable();
        
        // Step 2: Write checkpoint BEGIN to log
        LSN checkpoint_begin_lsn = log_mgr->writeCheckpointBegin();
        
        // Step 3: Write transaction table to log
        log_mgr->writeTransactionTable(active_txns);
        
        // Step 4: Write dirty page table to log
        log_mgr->writeDirtyPageTable(dirty_pages);
        
        // Step 5: Request buffer manager to flush dirty pages
        // This is the expensive part - lots of I/O
        buffer_mgr->flushDirtyPagesForCheckpoint();
        
        // Step 6: Write checkpoint END to log
        LSN checkpoint_end_lsn = log_mgr->writeCheckpointEnd();
        
        // Step 7: Flush checkpoint records
        log_mgr->flushToLSN(checkpoint_end_lsn);
        
        // Step 8: Update master record with checkpoint location
        updateMasterRecord(checkpoint_begin_lsn);
    }
};
 
// Buffer manager support for checkpoint
void BufferManager::flushDirtyPagesForCheckpoint() {
    // Get snapshot of dirty pages at checkpoint start
    vector<PageId> dirty_at_checkpoint = getDirtyPageList();
    
    // Flush each, respecting WAL constraint
    for (PageId page_id : dirty_at_checkpoint) {
        lock_guard<mutex> guard(pool_latch);
        
        auto it = page_table.find(page_id);
        if (it == page_table.end()) continue;  // Already evicted
        
        FrameId frame_id = it->second;
        FrameDescriptor& desc = descriptors[frame_id];
        
        if (!desc.is_dirty) continue;  // Already flushed
        
        // Ensure WAL records are durable
        log_manager->flushToLSN(desc.page_lsn);
        
        // Write page to disk
        disk_manager->writePage(page_id, &frames[frame_id]);
        
        // Mark clean
        desc.is_dirty = false;
        dirty_page_table->removeDirtyPage(page_id);
    }
}

Fuzzy vs. Sharp Checkpoints

Summary: The Buffer Manager's Orchestrating Role

Key Takeaways

•The buffer manager mediates all page access — It provides a unified interface for fetching, modifying, and releasing pages.
•Pin/unpin is the fundamental contract — Callers must pin pages before use and unpin when done, optionally marking them dirty.
•Two-level latching provides concurrency — Short-held pool latch for metadata; longer-held frame latches for page access.
•Query execution depends on the buffer manager — Scans, joins, and updates all flow through the buffer manager's API.
•Prefetching hides I/O latency — Anticipating future page needs and loading proactively improves query performance.
•Checkpoint integration enables recovery — The buffer manager provides dirty page information and flushes pages during checkpoints.
•Correct usage prevents data corruption — Bugs like forgetting to unpin or incorrect dirty flags lead to serious problems.

What's next:

Page Complete

You now understand the buffer manager's role as the central coordinator for page access, its API design, concurrency control mechanisms, and integration with query execution and recovery subsystems.

4 / 5