Database Management SystemsDBMS Components

DBMS Components: The Internal Architecture

LevelIntermediate

Duration75 mins

TopicDBMS Components

4 / 5

Buffer Manager: Bridging Memory and Disk

The Performance Multiplier

Here's a number that should startle you: accessing data from disk takes approximately 100,000 times longer than accessing data from memory. A nanosecond in memory becomes a tenth of a second from disk—the difference between instantaneous and noticeable.

The Buffer Manager exists to hide this astronomical performance gap. It maintains a pool of database pages in main memory, satisfying most data requests from fast RAM instead of slow disk. When configured properly, a buffer manager can achieve hit rates exceeding 99%, making database operations feel nearly as fast as in-memory data structures.

This page explores the Buffer Manager in depth—the architecture, algorithms, and trade-offs that make efficient database caching possible.

What You Will Learn

By the end of this page, you will understand buffer pool architecture, page replacement algorithms (LRU, Clock, LRU-K), the relationship between buffer management and WAL, and how modern databases optimize for different workload patterns.

Buffer Pool Architecture

The buffer pool is a region of main memory divided into fixed-size frames, each capable of holding one database page. The Buffer Manager maintains metadata about each frame and coordinates access between the DBMS components and disk storage.

Core components of the buffer pool:

Buffer Pool Components

•Frame Array: The actual memory holding page contents. Typically sized as a power of 2; each frame holds one page (usually 8KB or 16KB).
•Page Table (Hash Table): Maps PageID → FrameID, enabling O(1) lookup of whether a page is in the buffer pool and where.
•Frame Descriptors: Metadata for each frame—PageID currently stored, dirty bit, pin count, usage information for replacement algorithm.
•Free List: List of frames currently not holding valid pages. Empty after warmup; frames returned here after eviction.
•Replacement Policy State: Data structures supporting the replacement algorithm (LRU list, clock hand position, etc.).

Converting Mermaid diagram...

The Buffer Manager API:

Other DBMS components interact with the Buffer Manager through a simple interface:

Buffer Manager Interface
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
interface BufferManager {
    /**
     * Fetch a page from buffer pool. If not present, read from disk.
     * Increments pin count—caller must unpin when done.
     * 
     * @param pageId - The page to fetch
     * @param exclusive - If true, acquire exclusive latch for writes
     * @returns Pointer to the in-memory page frame
     */
    fetchPage(pageId: PageId, exclusive: boolean): Page;
 
    /**
     * Release a pin on a page. When pinCount reaches 0,
     * the page becomes eligible for eviction.
     * 
     * @param pageId - The page to unpin
     * @param dirty - If true, mark page as modified (must be written before eviction)
     */
    unpinPage(pageId: PageId, dirty: boolean): void;
 
    /**
     * Allocate a new page on disk and bring it into buffer pool.
     * Returns an empty page ready for writing.
     */
    newPage(): Page;
 
    /**
     * Delete a page from both buffer pool and disk.
     */
    deletePage(pageId: PageId): void;
 
    /**
     * Force all dirty pages to disk. Used for checkpoints.
     */
    flushAllPages(): void;
}
 
// Usage example:
const page = bufferManager.fetchPage(42, /*exclusive=*/ true);
page.data[100] = 0xFF;  // Modify the page
bufferManager.unpinPage(42, /*dirty=*/ true);
// Page will be written to disk eventually

Pinning Prevents Eviction

A 'pinned' page cannot be evicted—the caller is actively using it. The Buffer Manager tracks pin counts; when a page reaches pinCount=0, it becomes a candidate for eviction. Forgetting to unpin is a classic bug that exhausts the buffer pool, causing all requests to block.

Page Replacement Policies

When the buffer pool is full and a new page is needed, the Buffer Manager must evict an existing page. The replacement policy determines which page to evict. The goal is to evict pages that won't be needed soon—maximizing future cache hits.

The challenge: We can't predict the future, so replacement policies use past access patterns to estimate future value.

Optimal policy (Bélády's algorithm): Evict the page that will be needed furthest in the future. This is optimal but impossible—we don't know future accesses. It serves as a theoretical benchmark.

Least Recently Used (LRU) evicts the page that hasn't been accessed for the longest time. The intuition: if a page hasn't been used recently, it probably won't be used soon.

Implementation:

Maintain a doubly-linked list of pages in access order
On access: move page to front (MRU position)
On eviction: remove from back (LRU position)
O(1) for both operations with proper data structures

Advantages:

Simple conceptual model
Works well for temporal locality
O(1) operations

Disadvantages:

Susceptible to 'scan pollution': a sequential scan evicts all hot pages
Doesn't distinguish frequency from recency
List maintenance has cache-unfriendly pointer chasing

Converting Mermaid diagram...

The Sequential Scan Problem:

A major challenge for buffer management is sequential scans. When a query scans an entire table:

Each page is accessed exactly once
LRU places each scanned page at the front
All previously hot pages get pushed to the back and evicted
After the scan, the cache is full of pages that won't be used again

This 'cache pollution' can devastate performance. Solutions include:

Ring buffers for sequential scans
Distinguishing scan accesses from point accesses
LRU-K and similar multi-access policies

Buffer Pool Configuration
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- PostgreSQL buffer pool settings
SHOW shared_buffers;  -- Total buffer pool size
-- Typical: 25-40% of available RAM
 
-- PostgreSQL uses Clock with some LRU-like behavior
-- and special handling for scan-resistant access
 
-- View buffer pool statistics
SELECT 
    c.relname as table_name,
    count(*) as buffers,
    pg_size_pretty(count(*) * 8192) as size,
    round(100.0 * count(*) / (
        SELECT setting::int FROM pg_settings WHERE name = 'shared_buffers'
    ), 2) as pct_of_pool
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid)
GROUP BY c.relname
ORDER BY buffers DESC
LIMIT 10;
 
-- MySQL InnoDB buffer pool
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
-- Typical: 70-80% of available RAM for dedicated servers
 
-- MySQL buffer pool statistics
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool%';

Dirty Page Management and Write-Back

When a page is modified in the buffer pool, it becomes dirty—its memory contents differ from disk. The Buffer Manager must eventually write dirty pages back to disk, but the timing and ordering of these writes is critical for both performance and correctness.

The write-back challenge:

Write too frequently: Wasted I/O; performance suffers
Write too rarely: Large checkpoint backlogs; recovery takes longer
Write in wrong order: Corrupt database if crash occurs

Write-Back Strategies

•Steal Policy: Allow dirty pages from uncommitted transactions to be written to disk. Most systems use 'steal' because it allows buffer space to be reclaimed. Requires undo logging for recovery.
•No-Steal Policy: Never write uncommitted dirty pages. Simpler recovery, but limits transaction size to buffer pool capacity. Rarely used in practice.
•Force Policy: Write all dirty pages before transaction commits. Guarantees durability without WAL but very slow—every commit waits for I/O.
•No-Force Policy: Dirty pages may remain in memory after commit. Combined with WAL, this is the standard approach—commit writes only log records.

The WAL Constraint:

The Buffer Manager must respect the Write-Ahead Logging protocol:

A dirty page cannot be written to disk until all log records that describe its modifications have been flushed to the WAL.

This is enforced by comparing each page's pageLSN (the LSN of the most recent modification) against the flushedLSN (the highest LSN flushed to the WAL). If pageLSN > flushedLSN, the Buffer Manager must wait for the WAL flush before writing the page.

Converting Mermaid diagram...

Background Writers:

Modern databases use background processes to write dirty pages to disk, spreading I/O over time instead of bursting during eviction:

PostgreSQL: bgwriter process wakes periodically, writes oldest dirty pages
MySQL InnoDB: page_cleaner threads flush dirty pages in batches
Benefit: When a page needs to be evicted, it's often already clean, avoiding synchronous I/O

Background Writer Configuration
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- PostgreSQL background writer settings
SHOW bgwriter_delay;        -- Sleep between rounds (200ms default)
SHOW bgwriter_lru_maxpages; -- Max pages to write per round
SHOW bgwriter_lru_multiplier; -- How many to write based on recent need
 
-- Background writer statistics
SELECT * FROM pg_stat_bgwriter;
-- checkpoints_timed: Scheduled checkpoints
-- checkpoints_req: Requested checkpoints (WAL full, etc.)
-- buffers_checkpoint: Pages written during checkpoints
-- buffers_clean: Pages written by bgwriter
-- buffers_backend: Pages written by backends (bad—means eviction I/O)
 
-- Goal: buffers_backend should be near zero
-- If high, increase shared_buffers or tune bgwriter
 
-- MySQL InnoDB page cleaner settings
SHOW VARIABLES LIKE 'innodb_page_cleaners';  -- Number of cleaner threads
SHOW VARIABLES LIKE 'innodb_io_capacity';    -- IOPS budget for flushing
SHOW VARIABLES LIKE 'innodb_io_capacity_max'; -- Max IOPS for urgent flushing

Buffer Pool Sizing and Tuning

Sizing the buffer pool correctly is one of the most impactful tuning decisions. Too small, and the cache miss rate cripples performance. Too large, and the OS starts swapping, which is even worse.

General guidelines:

Buffer Pool Sizing Recommendations
DBMS	Recommendation	Notes
PostgreSQL	25-40% of RAM	OS filesystem cache handles rest; double caching avoided
MySQL InnoDB	70-80% of RAM	InnoDB uses O_DIRECT, bypassing OS cache; needs more buffer pool
Oracle	40-80% of RAM	Depends on SGA configuration and other components
SQL Server	60-80% of RAM	Maximum server memory setting; dynamic adjustment

Measuring Buffer Pool Effectiveness:

The key metric is cache hit ratio—the percentage of page requests satisfied from memory without disk I/O. For OLTP workloads, aim for 95%+ hit ratio.

Cache Hit Ratio Queries
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- PostgreSQL: Cache hit ratio
SELECT 
    sum(heap_blks_hit) as cache_hits,
    sum(heap_blks_read) as disk_reads,
    round(
        100.0 * sum(heap_blks_hit) / 
        nullif(sum(heap_blks_hit) + sum(heap_blks_read), 0), 
        2
    ) as cache_hit_ratio
FROM pg_statio_user_tables;
 
-- Per-table cache hit ratio
SELECT
    relname,
    heap_blks_hit,
    heap_blks_read,
    round(
        100.0 * heap_blks_hit / 
        nullif(heap_blks_hit + heap_blks_read, 0), 
        2
    ) as hit_ratio
FROM pg_statio_user_tables
ORDER BY heap_blks_read DESC
LIMIT 10;
 
-- MySQL InnoDB cache hit ratio
SELECT 
    (1 - (
        (SELECT VARIABLE_VALUE FROM performance_schema.global_status 
         WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads') / 
        (SELECT VARIABLE_VALUE FROM performance_schema.global_status 
         WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests')
    )) * 100 AS hit_ratio;
 
-- Should be > 99% for warm, well-sized buffer pool

Buffer Pool Tuning Red Flags

•Cache hit ratio below 90% — Buffer pool likely too small, or working set doesn't fit. Increase buffer pool or optimize queries.
•High buffers_backend (PostgreSQL) — Backend processes writing dirty pages during eviction. Background writer not keeping up.
•Frequent checkpoint spikes — Large checkpoint_completion_target or too many dirty pages. Spread writes over time.
•OS swapping activity — Buffer pool too large; OS is swapping buffer pool pages! Reduce buffer pool size.
•Consistent 100% cache hit — Might be over-provisioned. Or (more likely) workload is small. Not necessarily a problem.

The Working Set Concept

The 'working set' is the set of pages actively used by current queries. If your working set fits in the buffer pool, you'll see excellent hit ratios. If it exceeds the buffer pool, performance degrades rapidly. The working set size depends on query patterns and data distribution—not just total database size.

Concurrency Control in Buffer Management

The Buffer Manager is accessed concurrently by many threads—query executors, background writers, checkpoint processes. Protecting shared data structures requires careful concurrency control without becoming a bottleneck.

Two levels of protection:

Buffer Pool Latches: Protect the buffer pool's data structures (page table, free list). These are short-held locks for pool operations.
Page Latches: Protect individual page contents. Read latches allow concurrent readers; write latches provide exclusive access.

Latches vs. Locks

•Latches protect physical data structures
•Held for very short durations (microseconds)
•No deadlock detection (ordering enforced)
•Not tracked in lock manager
•Spin-locks or lightweight mutexes

Locks vs. Latches

•Locks protect logical data (rows, tables)
•May be held for entire transaction
•Deadlock detection required
•Tracked in lock manager tables
•OS-level blocking when contended

Buffer Pool Partitioning:

A single buffer pool with one latch becomes a contention bottleneck for high-concurrency workloads. Modern systems partition the buffer pool:

MySQL InnoDB: innodb_buffer_pool_instances creates multiple independent buffer pools
PostgreSQL: Partitions the buffer pool mapping with multiple lock partitions
Each partition has its own latch; concurrent accesses to different partitions don't conflict

Buffer Pool Partitioning
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- MySQL: Multiple buffer pool instances
SHOW VARIABLES LIKE 'innodb_buffer_pool_instances';
-- Recommended: 8-16 instances for large buffer pools
-- Pages are distributed by hash(PageID)
 
SET GLOBAL innodb_buffer_pool_instances = 8;
-- Note: Requires restart to take effect
 
-- Per-instance statistics
SELECT 
    POOL_ID,
    POOL_SIZE,
    FREE_BUFFERS,
    DATABASE_PAGES,
    PAGES_MADE_YOUNG,
    PAGES_NOT_MADE_YOUNG
FROM information_schema.INNODB_BUFFER_POOL_STATS;
 
-- PostgreSQL: Check buffer pool latch contention
SELECT * FROM pg_stat_activity 
WHERE wait_event_type = 'LWLock' 
AND wait_event LIKE 'buffer%';

Latch Ordering Prevents Deadlocks

The Buffer Manager follows strict latch ordering: always acquire buffer pool latch before page latches, and release in reverse order. When needing multiple page latches, acquire in PageID order. This global ordering makes deadlocks impossible within the buffer manager, eliminating the need for expensive deadlock detection.

Prefetching and Scan Optimization

Smart Buffer Managers anticipate future page requests and prefetch data before it's needed, hiding disk latency. This is especially important for sequential operations like table scans and index range scans.

Prefetching strategies:

Prefetching Techniques

•Sequential Prefetch: Detect sequential access patterns; read ahead several pages. If reading page N, prefetch N+1, N+2, ... Most effective for table scans.
•List Prefetch: When index scan produces a list of TIDs, prefetch those data pages before visiting them. Avoids one-at-a-time random I/O.
•Asynchronous I/O: Submit multiple read requests simultaneously; OS and storage subsystem can reorder and parallelize.
•Ring Buffers: For sequential scans, use a small circular buffer that doesn't pollute the main buffer pool. Pages are evicted immediately after use.

Converting Mermaid diagram...

Prefetch Settings
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- PostgreSQL: effective_io_concurrency
-- Controls how many concurrent I/O operations can be issued
SHOW effective_io_concurrency;  -- Default: 1 (HDD), set 200+ for SSD
 
-- Higher values enable more prefetching
SET effective_io_concurrency = 200;  -- For SSD arrays
 
-- PostgreSQL 14+: Also used for sequential scan prefetch
SHOW maintenance_io_concurrency;  -- For VACUUM, CREATE INDEX
 
-- MySQL InnoDB read-ahead
SHOW VARIABLES LIKE 'innodb_read_ahead%';
-- innodb_read_ahead_threshold: pages accessed to trigger read-ahead
-- Linear read-ahead: sequential page access triggers prefetch
 
-- MySQL random read-ahead (for certain patterns)
SHOW VARIABLES LIKE 'innodb_random_read_ahead';

SSD vs. HDD Prefetching

With HDDs, aggressive sequential prefetching is critical—seek time dominates. With SSDs, seek time is near-zero, so the benefit shifts to parallelism (issuing many concurrent requests to saturate the SSD's internal parallelism). Increase effective_io_concurrency significantly for NVMe SSDs.

Summary: The Buffer Manager

The Buffer Manager is the performance multiplier that makes databases usable by hiding the massive speed gap between memory and disk through intelligent caching.

Key takeaways:

Buffer Manager Essentials

•Buffer Pool Architecture centers on a frame array with page table mapping, enabling O(1) lookup of cached pages. Pin counting prevents premature eviction.
•Replacement Policies determine which pages to evict. LRU is simple but susceptible to scan pollution; LRU-K and ARC resist scans by tracking access frequency.
•Dirty Page Management must respect WAL protocol—pages cannot be written to disk until their log records are durable. Background writers spread write I/O over time.
•Buffer Pool Sizing is critical: too small causes excessive disk I/O; too large causes OS swapping. Aim for 95%+ cache hit ratio for OLTP workloads.
•Concurrency Control uses latches for short-duration protection and buffer pool partitioning for high-concurrency scalability.
•Prefetching anticipates future page requests, hiding disk latency especially for sequential scans. Essential for HDD; still valuable for SSD parallelism.

What's next:

We've now explored all four core DBMS components: Query Processor, Storage Manager, Transaction Manager, and Buffer Manager. The final page synthesizes these components, showing how they interact to process a complete query from submission to result delivery—a unified view of the DBMS architecture in action.

Page Complete

You now understand how the Buffer Manager bridges memory and disk, from page replacement algorithms to dirty page management and prefetching. You can appreciate why buffer pool tuning is one of the most impactful performance optimizations in database administration.

4 / 5

Loading learning content...

Database Management SystemsDBMS Components

DBMS Components: The Internal Architecture

LevelIntermediate

Duration75 mins

TopicDBMS Components

4 / 5

Buffer Manager: Bridging Memory and Disk

The Performance Multiplier

This page explores the Buffer Manager in depth—the architecture, algorithms, and trade-offs that make efficient database caching possible.

What You Will Learn

Buffer Pool Architecture

Core components of the buffer pool:

Buffer Pool Components

•Frame Array: The actual memory holding page contents. Typically sized as a power of 2; each frame holds one page (usually 8KB or 16KB).
•Page Table (Hash Table): Maps PageID → FrameID, enabling O(1) lookup of whether a page is in the buffer pool and where.
•Frame Descriptors: Metadata for each frame—PageID currently stored, dirty bit, pin count, usage information for replacement algorithm.
•Free List: List of frames currently not holding valid pages. Empty after warmup; frames returned here after eviction.
•Replacement Policy State: Data structures supporting the replacement algorithm (LRU list, clock hand position, etc.).

Converting Mermaid diagram...

The Buffer Manager API:

Other DBMS components interact with the Buffer Manager through a simple interface:

Buffer Manager Interface
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
interface BufferManager {
    /**
     * Fetch a page from buffer pool. If not present, read from disk.
     * Increments pin count—caller must unpin when done.
     * 
     * @param pageId - The page to fetch
     * @param exclusive - If true, acquire exclusive latch for writes
     * @returns Pointer to the in-memory page frame
     */
    fetchPage(pageId: PageId, exclusive: boolean): Page;
 
    /**
     * Release a pin on a page. When pinCount reaches 0,
     * the page becomes eligible for eviction.
     * 
     * @param pageId - The page to unpin
     * @param dirty - If true, mark page as modified (must be written before eviction)
     */
    unpinPage(pageId: PageId, dirty: boolean): void;
 
    /**
     * Allocate a new page on disk and bring it into buffer pool.
     * Returns an empty page ready for writing.
     */
    newPage(): Page;
 
    /**
     * Delete a page from both buffer pool and disk.
     */
    deletePage(pageId: PageId): void;
 
    /**
     * Force all dirty pages to disk. Used for checkpoints.
     */
    flushAllPages(): void;
}
 
// Usage example:
const page = bufferManager.fetchPage(42, /*exclusive=*/ true);
page.data[100] = 0xFF;  // Modify the page
bufferManager.unpinPage(42, /*dirty=*/ true);
// Page will be written to disk eventually

Pinning Prevents Eviction

Page Replacement Policies

The challenge: We can't predict the future, so replacement policies use past access patterns to estimate future value.

Least Recently Used (LRU) evicts the page that hasn't been accessed for the longest time. The intuition: if a page hasn't been used recently, it probably won't be used soon.

Implementation:

Maintain a doubly-linked list of pages in access order
On access: move page to front (MRU position)
On eviction: remove from back (LRU position)
O(1) for both operations with proper data structures

Advantages:

Simple conceptual model
Works well for temporal locality
O(1) operations

Disadvantages:

Susceptible to 'scan pollution': a sequential scan evicts all hot pages
Doesn't distinguish frequency from recency
List maintenance has cache-unfriendly pointer chasing

Converting Mermaid diagram...

The Sequential Scan Problem:

A major challenge for buffer management is sequential scans. When a query scans an entire table:

Each page is accessed exactly once
LRU places each scanned page at the front
All previously hot pages get pushed to the back and evicted
After the scan, the cache is full of pages that won't be used again

This 'cache pollution' can devastate performance. Solutions include:

Ring buffers for sequential scans
Distinguishing scan accesses from point accesses
LRU-K and similar multi-access policies

Buffer Pool Configuration
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- PostgreSQL buffer pool settings
SHOW shared_buffers;  -- Total buffer pool size
-- Typical: 25-40% of available RAM
 
-- PostgreSQL uses Clock with some LRU-like behavior
-- and special handling for scan-resistant access
 
-- View buffer pool statistics
SELECT 
    c.relname as table_name,
    count(*) as buffers,
    pg_size_pretty(count(*) * 8192) as size,
    round(100.0 * count(*) / (
        SELECT setting::int FROM pg_settings WHERE name = 'shared_buffers'
    ), 2) as pct_of_pool
FROM pg_buffercache b
JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid)
GROUP BY c.relname
ORDER BY buffers DESC
LIMIT 10;
 
-- MySQL InnoDB buffer pool
SHOW VARIABLES LIKE 'innodb_buffer_pool_size';
-- Typical: 70-80% of available RAM for dedicated servers
 
-- MySQL buffer pool statistics
SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool%';

Dirty Page Management and Write-Back

The write-back challenge:

Write too frequently: Wasted I/O; performance suffers
Write too rarely: Large checkpoint backlogs; recovery takes longer
Write in wrong order: Corrupt database if crash occurs

Write-Back Strategies

•Steal Policy: Allow dirty pages from uncommitted transactions to be written to disk. Most systems use 'steal' because it allows buffer space to be reclaimed. Requires undo logging for recovery.
•No-Steal Policy: Never write uncommitted dirty pages. Simpler recovery, but limits transaction size to buffer pool capacity. Rarely used in practice.
•Force Policy: Write all dirty pages before transaction commits. Guarantees durability without WAL but very slow—every commit waits for I/O.
•No-Force Policy: Dirty pages may remain in memory after commit. Combined with WAL, this is the standard approach—commit writes only log records.

The WAL Constraint:

The Buffer Manager must respect the Write-Ahead Logging protocol:

A dirty page cannot be written to disk until all log records that describe its modifications have been flushed to the WAL.

Converting Mermaid diagram...

Background Writers:

Modern databases use background processes to write dirty pages to disk, spreading I/O over time instead of bursting during eviction:

PostgreSQL: bgwriter process wakes periodically, writes oldest dirty pages
MySQL InnoDB: page_cleaner threads flush dirty pages in batches
Benefit: When a page needs to be evicted, it's often already clean, avoiding synchronous I/O

Background Writer Configuration
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- PostgreSQL background writer settings
SHOW bgwriter_delay;        -- Sleep between rounds (200ms default)
SHOW bgwriter_lru_maxpages; -- Max pages to write per round
SHOW bgwriter_lru_multiplier; -- How many to write based on recent need
 
-- Background writer statistics
SELECT * FROM pg_stat_bgwriter;
-- checkpoints_timed: Scheduled checkpoints
-- checkpoints_req: Requested checkpoints (WAL full, etc.)
-- buffers_checkpoint: Pages written during checkpoints
-- buffers_clean: Pages written by bgwriter
-- buffers_backend: Pages written by backends (bad—means eviction I/O)
 
-- Goal: buffers_backend should be near zero
-- If high, increase shared_buffers or tune bgwriter
 
-- MySQL InnoDB page cleaner settings
SHOW VARIABLES LIKE 'innodb_page_cleaners';  -- Number of cleaner threads
SHOW VARIABLES LIKE 'innodb_io_capacity';    -- IOPS budget for flushing
SHOW VARIABLES LIKE 'innodb_io_capacity_max'; -- Max IOPS for urgent flushing

Buffer Pool Sizing and Tuning

Sizing the buffer pool correctly is one of the most impactful tuning decisions. Too small, and the cache miss rate cripples performance. Too large, and the OS starts swapping, which is even worse.

General guidelines:

Buffer Pool Sizing Recommendations
DBMS	Recommendation	Notes
PostgreSQL	25-40% of RAM	OS filesystem cache handles rest; double caching avoided
MySQL InnoDB	70-80% of RAM	InnoDB uses O_DIRECT, bypassing OS cache; needs more buffer pool
Oracle	40-80% of RAM	Depends on SGA configuration and other components
SQL Server	60-80% of RAM	Maximum server memory setting; dynamic adjustment

Measuring Buffer Pool Effectiveness:

The key metric is cache hit ratio—the percentage of page requests satisfied from memory without disk I/O. For OLTP workloads, aim for 95%+ hit ratio.

Cache Hit Ratio Queries
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- PostgreSQL: Cache hit ratio
SELECT 
    sum(heap_blks_hit) as cache_hits,
    sum(heap_blks_read) as disk_reads,
    round(
        100.0 * sum(heap_blks_hit) / 
        nullif(sum(heap_blks_hit) + sum(heap_blks_read), 0), 
        2
    ) as cache_hit_ratio
FROM pg_statio_user_tables;
 
-- Per-table cache hit ratio
SELECT
    relname,
    heap_blks_hit,
    heap_blks_read,
    round(
        100.0 * heap_blks_hit / 
        nullif(heap_blks_hit + heap_blks_read, 0), 
        2
    ) as hit_ratio
FROM pg_statio_user_tables
ORDER BY heap_blks_read DESC
LIMIT 10;
 
-- MySQL InnoDB cache hit ratio
SELECT 
    (1 - (
        (SELECT VARIABLE_VALUE FROM performance_schema.global_status 
         WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads') / 
        (SELECT VARIABLE_VALUE FROM performance_schema.global_status 
         WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests')
    )) * 100 AS hit_ratio;
 
-- Should be > 99% for warm, well-sized buffer pool

Buffer Pool Tuning Red Flags

•Cache hit ratio below 90% — Buffer pool likely too small, or working set doesn't fit. Increase buffer pool or optimize queries.
•High buffers_backend (PostgreSQL) — Backend processes writing dirty pages during eviction. Background writer not keeping up.
•Frequent checkpoint spikes — Large checkpoint_completion_target or too many dirty pages. Spread writes over time.
•OS swapping activity — Buffer pool too large; OS is swapping buffer pool pages! Reduce buffer pool size.
•Consistent 100% cache hit — Might be over-provisioned. Or (more likely) workload is small. Not necessarily a problem.

The Working Set Concept

Concurrency Control in Buffer Management

Two levels of protection:

Buffer Pool Latches: Protect the buffer pool's data structures (page table, free list). These are short-held locks for pool operations.
Page Latches: Protect individual page contents. Read latches allow concurrent readers; write latches provide exclusive access.

Latches vs. Locks

•Latches protect physical data structures
•Held for very short durations (microseconds)
•No deadlock detection (ordering enforced)
•Not tracked in lock manager
•Spin-locks or lightweight mutexes

Locks vs. Latches

•Locks protect logical data (rows, tables)
•May be held for entire transaction
•Deadlock detection required
•Tracked in lock manager tables
•OS-level blocking when contended

Buffer Pool Partitioning:

A single buffer pool with one latch becomes a contention bottleneck for high-concurrency workloads. Modern systems partition the buffer pool:

MySQL InnoDB: innodb_buffer_pool_instances creates multiple independent buffer pools
PostgreSQL: Partitions the buffer pool mapping with multiple lock partitions
Each partition has its own latch; concurrent accesses to different partitions don't conflict

Buffer Pool Partitioning
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
-- MySQL: Multiple buffer pool instances
SHOW VARIABLES LIKE 'innodb_buffer_pool_instances';
-- Recommended: 8-16 instances for large buffer pools
-- Pages are distributed by hash(PageID)
 
SET GLOBAL innodb_buffer_pool_instances = 8;
-- Note: Requires restart to take effect
 
-- Per-instance statistics
SELECT 
    POOL_ID,
    POOL_SIZE,
    FREE_BUFFERS,
    DATABASE_PAGES,
    PAGES_MADE_YOUNG,
    PAGES_NOT_MADE_YOUNG
FROM information_schema.INNODB_BUFFER_POOL_STATS;
 
-- PostgreSQL: Check buffer pool latch contention
SELECT * FROM pg_stat_activity 
WHERE wait_event_type = 'LWLock' 
AND wait_event LIKE 'buffer%';

Latch Ordering Prevents Deadlocks

Prefetching and Scan Optimization

Prefetching strategies:

Prefetching Techniques

•Sequential Prefetch: Detect sequential access patterns; read ahead several pages. If reading page N, prefetch N+1, N+2, ... Most effective for table scans.
•List Prefetch: When index scan produces a list of TIDs, prefetch those data pages before visiting them. Avoids one-at-a-time random I/O.
•Asynchronous I/O: Submit multiple read requests simultaneously; OS and storage subsystem can reorder and parallelize.
•Ring Buffers: For sequential scans, use a small circular buffer that doesn't pollute the main buffer pool. Pages are evicted immediately after use.

Converting Mermaid diagram...

Prefetch Settings
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- PostgreSQL: effective_io_concurrency
-- Controls how many concurrent I/O operations can be issued
SHOW effective_io_concurrency;  -- Default: 1 (HDD), set 200+ for SSD
 
-- Higher values enable more prefetching
SET effective_io_concurrency = 200;  -- For SSD arrays
 
-- PostgreSQL 14+: Also used for sequential scan prefetch
SHOW maintenance_io_concurrency;  -- For VACUUM, CREATE INDEX
 
-- MySQL InnoDB read-ahead
SHOW VARIABLES LIKE 'innodb_read_ahead%';
-- innodb_read_ahead_threshold: pages accessed to trigger read-ahead
-- Linear read-ahead: sequential page access triggers prefetch
 
-- MySQL random read-ahead (for certain patterns)
SHOW VARIABLES LIKE 'innodb_random_read_ahead';

SSD vs. HDD Prefetching

Summary: The Buffer Manager

The Buffer Manager is the performance multiplier that makes databases usable by hiding the massive speed gap between memory and disk through intelligent caching.

Key takeaways:

Buffer Manager Essentials

•Buffer Pool Architecture centers on a frame array with page table mapping, enabling O(1) lookup of cached pages. Pin counting prevents premature eviction.
•Replacement Policies determine which pages to evict. LRU is simple but susceptible to scan pollution; LRU-K and ARC resist scans by tracking access frequency.
•Dirty Page Management must respect WAL protocol—pages cannot be written to disk until their log records are durable. Background writers spread write I/O over time.
•Buffer Pool Sizing is critical: too small causes excessive disk I/O; too large causes OS swapping. Aim for 95%+ cache hit ratio for OLTP workloads.
•Concurrency Control uses latches for short-duration protection and buffer pool partitioning for high-concurrency scalability.
•Prefetching anticipates future page requests, hiding disk latency especially for sequential scans. Essential for HDD; still valuable for SSD parallelism.

What's next:

Page Complete

4 / 5