Database Management SystemsLog-Based Recovery

Log-Based Recovery

LevelIntermediate

Duration60 mins

TopicLog-Based Recovery

4 / 5

Log Storage

The Architecture Behind Reliable Logging

Understanding what logs contain is only half the story. The how of log storage—the physical organization, buffering strategies, file management, and retention policies—is equally critical. A well-designed log storage system must achieve seemingly contradictory goals:

Extreme durability: Logs must never be lost or corrupted; they're the source of truth for recovery
High performance: Logging must not become a bottleneck, even under heavy transaction loads
Efficient space usage: Logs can grow indefinitely; they must be managed to prevent disk exhaustion
Quick access: Recovery must rapidly locate and process relevant log records

Achieving these goals requires sophisticated engineering in log buffer management, file organization, and lifecycle policies. This page examines the practical architecture that makes reliable logging possible at scale.

What You Will Learn

By the end of this page, you will understand how logs are physically organized on disk, how the log buffer optimizes write performance, how log files grow and are managed, the mechanisms for log archival and truncation, and the engineering considerations for log storage in production systems.

Log File Organization

Transaction logs are stored in one or more dedicated files on stable storage. The physical organization of these files directly impacts both performance and recoverability.

Single Log File vs. Multiple Log Files:

While conceptually the log is a single continuous sequence, most production systems use multiple physical log files for practical reasons:

Size limitations: Some filesystems have maximum file size limits
Parallel I/O: Multiple files can be written in parallel across different disks
Archival granularity: Individual files can be archived or deleted independently
Recovery efficiency: Only relevant files need to be read during recovery

Common Log File Organizations:

Log File Organization Approaches
Organization	Description	Used By
Circular Log	Fixed number of files; oldest overwritten when full	SQL Server (default mode)
Segmented Log	New segments created as needed; old segments archived/deleted	PostgreSQL, MySQL
Single Growing File	One file that grows; truncated periodically	Some embedded databases
Log Groups	Multiple identical copies written in parallel	Oracle (archivelog mode)

Segmented Log Architecture:

The most common modern approach is segmented logging. Here's how it works:

Active Segments: The current segment being written to, plus recently written segments that may still be needed for rollback or active transactions.
Archivable Segments: Segments containing only completed transactions, eligible for backup and removal from primary storage.
Segment Lifecycle:
- Created when previous segment fills up
- Becomes archivable when all transactions it contains are complete
- Archived to backup storage
- Eventually deleted from primary storage after backup confirmation

Segmented Log Structure Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Typical segmented log layout
 
Log Directory: /data/pg_wal/
├── 000000010000000A00000001    # Oldest active segment
├── 000000010000000A00000002    # Active (contains running txns)
├── 000000010000000A00000003    # Active (contains running txns)
├── 000000010000000A00000004    # Current write segment
└── archive_status/
    ├── 000000010000000A00000001.done   # Marked as archived
    └── 000000010000000A00000002.ready  # Ready for archival
 
// Segment naming convention: TimelineID + LogFileNumber + SegmentNumber
// Each segment is typically 16MB (PostgreSQL) or configurable
 
// Log file internal structure:
// +------------------+------------------+------------------+
// | Record 1         | Record 2         | Record 3         |
// | LSN: 0xA0000001  | LSN: 0xA0000089  | LSN: 0xA0000102  |
// +------------------+------------------+------------------+
// | Record 4         | Record 5         | [Free Space]     |
// | LSN: 0xA0000150  | LSN: 0xA0000201  |                  |
// +------------------+------------------+------------------+
 
// LSN encodes file location:
// LSN = (FileNumber * FileSize) + OffsetWithinFile
// Given LSN, can directly seek to record location

Log Segment Sizing

Segment size is a trade-off: larger segments mean fewer files to manage but slower archival and recovery. Smaller segments mean faster archival but more file management overhead. PostgreSQL defaults to 16MB; SQL Server uses 64KB virtual log files within larger physical files. Choose based on your workload and recovery requirements.

The Log Buffer: Optimizing Write Performance

While log records must reach stable storage to provide durability, writing each record individually to disk would be prohibitively slow. The log buffer (also called the write-ahead log buffer or redo log buffer) is an in-memory buffer that collects log records before flushing them to disk in batches.

How the Log Buffer Works:

Transaction generates log record: Record is written to the log buffer in memory
Buffer accumulates records: Multiple transactions' records collect in the buffer
Flush triggered: Buffer contents written to disk under certain conditions
Space recycled: After flush, buffer space becomes available for new records

Log Buffer Flush Triggers
Trigger	Description	Priority
Transaction Commit	COMMIT requires all that transaction's log records to be durable	Highest
Buffer Full	No space for new records until flush occurs	High
Checkpoint	All log records through checkpoint must be durable	High
Timeout	Periodic flush even without other triggers (e.g., every 1 second)	Medium
Page Write	Log records for a page must be flushed before that page is written (WAL rule)	High

Log Buffer Architecture:

Log Buffer Implementation Concepts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// Log Buffer Architecture
 
struct LogBuffer {
    byte[]          buffer;            // Circular buffer in memory
    size_t          capacity;          // Total buffer size (e.g., 16MB)
    
    atomic<uint64_t> write_position;   // Where next record goes
    atomic<uint64_t> flush_position;   // Up to where has been flushed
    
    Mutex           write_mutex;       // Protects concurrent writes
    Condition       space_available;   // Signals when space freed
    Condition       flush_complete;    // Signals when flush done
}
 
// Writing a log record:
function writeLogRecord(record):
    acquire(log_buffer.write_mutex)
    
    // Wait if buffer is full
    while (write_position - flush_position >= capacity):
        wait(space_available)
    
    // Copy record to buffer
    offset = write_position % capacity
    memcpy(buffer + offset, record, record.size)
    
    // Advance write position
    write_position += record.size
    
    release(log_buffer.write_mutex)
    return record.lsn
 
// Flushing the log buffer:
function flushLogBuffer(through_lsn):
    // Calculate what needs to be written
    start = flush_position
    end = min(write_position, through_lsn)
    
    if (end <= start):
        return  // Nothing to flush
    
    // Write to disk (may wrap around circular buffer)
    bytes_to_write = end - start
    disk_offset = flush_position
    
    // Use O_DIRECT and fdatasync for durability
    write(log_file, buffer + (start % capacity), bytes_to_write)
    fdatasync(log_file)  // Ensure on stable storage!
    
    // Update flush position
    flush_position = end
    signal(space_available)
    signal(flush_complete)

Group Commit Optimization

When multiple transactions commit simultaneously, their commit records can be flushed together in a single I/O operation. This 'group commit' amortizes the flush overhead across multiple transactions, dramatically improving throughput under concurrent load. A single fsync can durable commits from dozens of transactions.

Ensuring Log Write Durability

The log's reliability depends on ensuring writes actually reach stable storage. This is more complex than it sounds due to multiple layers of caching between the database and the disk platters:

The Write Path:

Database Log Buffer → OS Page Cache → Disk Controller Cache → Disk Platters

Data isn't truly durable until it reaches the disk platters. Data in any cache layer can be lost on power failure. The database must force data through all caching layers.

Durability Mechanisms:

Log Write Durability Mechanisms
Mechanism	What It Does	Considerations
fsync() / fdatasync()	Forces OS to flush file data to disk controller	Most common; depends on disk honoring flush
O_DIRECT	Bypasses OS page cache; writes directly to disk	Reduces double-buffering; requires aligned writes
O_SYNC / O_DSYNC	Every write() is synchronous to disk	Simple but may reduce batching opportunities
Battery-Backed Cache	Disk controller cache survives power loss	Allows treating controller cache as durable
Write Barriers	Ensures ordering of writes to disk	Important for journaling correctness

The fsync Controversy:

For decades, databases relied on fsync() to ensure durability. However, research and real-world incidents have revealed problems:

Error handling: If an fsync fails, the data state is undefined. Some systems may lose data without signaling an error to the database.
Disk write reordering: Some disk controllers reorder writes for performance. Without proper barriers, log records might reach disk out of order.
Write-back caching: Aggressive caching on disks or storage controllers can delay actual platter writes.

Modern Best Practices:

Use fdatasync() instead of fsync() when metadata changes aren't critical
Verify storage subsystem honors flush commands
Disable or properly configure disk write caching
Consider battery-backed controllers for performance with safety
Test durability guarantees in your specific environment

The Durability Trust Chain

Database durability is only as strong as the weakest link in the storage chain. A disk that ignores flush commands, a controller with volatile write cache, or a virtualized environment with lazy persistence can all silently violate durability guarantees. Verify your entire storage stack.

Log Archival: Preserving History

While active log files enable crash recovery, archived logs serve a broader purpose: enabling point-in-time recovery (PITR), supporting replication, and providing a complete audit trail. Log archival is the process of copying completed log segments to secondary storage before they're removed from primary storage.

Why Archive Logs?

Point-in-Time Recovery: Restore database to any moment by replaying archived logs onto a backup
Disaster Recovery: Archived logs at a remote site enable recovery from total primary site loss
Replication: Standby servers apply archived logs to stay synchronized
Compliance: Regulations may require retaining transaction history

Archival Process:

Log Archival Process
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Typical Log Archival Workflow
 
// 1. Segment becomes eligible for archival
//    - All transactions spanning this segment have completed
//    - Segment is no longer needed for crash recovery
 
// 2. Archival command triggered (example: PostgreSQL archive_command)
archive_command = 'cp %p /backup/wal_archive/%f'
// %p = full path to segment file
// %f = segment file name
 
// 3. Success verification
//    - Archival command must return success (exit code 0)
//    - Segment marked as archived only on success
//    - Retry on failure with backoff
 
// 4. Segment management after archival
//    - Primary copy may be deleted if space needed
//    - Or retained for faster recovery (reduced need to fetch from archive)
 
// PostgreSQL archival status tracking:
// pg_wal/archive_status/
//   000000010000000A00000001.ready   # Ready for archival
//   000000010000000A00000002.done    # Successfully archived
//   000000010000000A00000003.ready   # Archival in progress or pending
 
// Example archive workflow:
function archiveSegment(segment_path):
    try:
        // Copy to multiple destinations for redundancy
        copy(segment_path, local_archive_path)
        copy(segment_path, cloud_storage_uri)
        
        // Verify copies are complete and correct
        verify_checksum(local_archive_path)
        verify_checksum(cloud_storage_uri)
        
        // Mark as successfully archived
        mark_archived(segment_path)
        return SUCCESS
        
    catch error:
        log("Archive failed for " + segment_path + ": " + error)
        schedule_retry(segment_path, delay=60)
        return FAILURE

Log Archival Best Practices

•Archive to multiple destinations — Local disk, remote server, and cloud storage provide defense in depth against various failure modes.
•Verify archive integrity — Compute and store checksums; verify immediately after archival and periodically thereafter.
•Monitor archival lag — If archival falls behind, disk space will eventually be exhausted. Alert on archival delays.
•Test recovery regularly — An archive that can't be restored is useless. Regularly perform test recoveries.
•Secure archive storage — Archives contain all your data. Encrypt at rest and in transit; control access carefully.

Continuous Archiving vs. Periodic Backup

Log archiving is continuous—segments are archived as soon as they're complete. This differs from periodic base backups. The combination of periodic base backups plus continuous log archiving enables true point-in-time recovery: restore the base backup then replay archived logs up to the desired moment.

Log Truncation and Space Management

Without management, transaction logs grow forever. Log truncation is the process of removing or reusing log space that's no longer needed. This is distinct from archival—truncation reclaims space; archival preserves data elsewhere.

When Can Log Records Be Truncated?

A log record can be truncated (its space reused) only when:

No active transaction needs it: All transactions that might need to rollback using this record have completed
Crash recovery doesn't need it: The corresponding data pages are durably on disk (no redo needed)
Archival is complete: If archival is enabled, the segment must be archived first
Replication is caught up: All standby servers have received and applied the record

Log Truncation Blockers
Blocker	Why It Blocks	Resolution
Long-running transaction	May need log records for rollback	Wait for completion or kill transaction
Dirty pages in buffer	Log needed if crash before page flush	Wait for checkpoint
Slow archival	Segment not yet safely archived	Speed up archival; add storage
Slow replica	Replica needs log for catch-up	Wait for replica or disconnect it
PITR retention	Log needed for recovery target	Reduce retention or add storage

Checkpoint's Role in Truncation:

Checkpoints are the primary enabler of log truncation. A checkpoint:

Flushes all dirty pages to disk
Records the state of active transactions
Identifies the earliest log record still needed (the checkpoint LSN)

After a checkpoint completes, all log records before the checkpoint LSN (subject to other constraints) can be truncated—they're no longer needed for crash recovery.

Log Space Management Strategies:

Log Space Management
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Log space management approaches
 
// 1. CIRCULAR LOG (SQL Server, Oracle)
//    Fixed number of virtual log files (VLFs)
//    Oldest VLF reused when no longer needed
 
struct CircularLog {
    VLF     vlfs[NUM_VLFS];          // Array of virtual log files
    int     active_vlf_start;         // Oldest needed VLF
    int     current_vlf;              // Currently being written
}
 
// When current_vlf reaches active_vlf_start, must wait!
// Solution: checkpoint to advance active_vlf_start
 
// 2. SEGMENTED LOG (PostgreSQL)
//    New segments created as needed
//    Old segments deleted after archival + checkpoint
 
function manageLogSegments():
    for segment in log_segments:
        if segment.max_lsn < last_checkpoint_lsn
           and segment.archived
           and segment.replicated_to_all_standbys:
            
            if num_segments > min_segments:
                delete(segment)
            else:
                recycle(segment)  // Keep for reuse
 
// 3. AUTOMATIC GROWTH (with limits)
//    Log grows as needed up to configured maximum
//    Alert/block when approaching limit
 
max_log_size = 100GB
current_log_size = getLogSize()
 
if current_log_size > max_log_size * 0.8:
    alert("Log approaching size limit")
    trigger_checkpoint()
 
if current_log_size >= max_log_size:
    // Block new transactions until space available
    block_new_transactions()
 
// 4. MANUAL TRUNCATION
//    DBA explicitly truncates log after backup
// (Common with SQL Server simple recovery model)
 
BACKUP LOG database_name TO DISK='backup.trn'
-- After backup, log space marked for reuse

The Runaway Log Problem

A single long-running transaction can prevent log truncation indefinitely, causing log files to consume all available disk space. This is one of the most common production database issues. Monitor for long-running transactions and set timeout policies.

Log Storage Hardware Considerations

Log storage hardware decisions significantly impact both performance and reliability. Unlike data files that benefit from SSDs for random reads, log writes have unique characteristics that inform hardware choices:

Log I/O Characteristics:

Sequential writes: Logs are always appended; random access is only during recovery
Write-intensive: Heavy write volume; limited reads during normal operation
Latency-sensitive: Commit latency directly depends on log flush speed
Small to medium I/O sizes: Individual flushes are often 4KB-64KB

Log Storage Hardware Options
Option	Advantages	Disadvantages	Best For
Enterprise SSD	Low latency; high IOPS; durable	Higher cost per GB	High-throughput OLTP
NVMe SSD	Extremely low latency; parallel queues	Premium cost	Ultra-low-latency requirements
Enterprise HDD with BBU	Lower cost; battery protects cache	Higher latency than SSD	Cost-sensitive; large logs
RAID-10 HDD	Redundancy + reasonable performance	Latency varies	Reliability-focused
Separate Log Volume	Isolates log I/O from data I/O	Additional hardware	Avoiding I/O contention

Key Hardware Recommendations:

1. Separate Log Storage from Data Storage

Log writes and data reads/writes have different I/O patterns. Placing them on the same storage causes contention:

Log: Sequential write, latency-sensitive
Data: Random read/write, throughput-sensitive

Isolating log I/O to a dedicated volume or drive array prevents data I/O from delaying commits.

2. Use Enterprise-Grade Storage

Consumer-grade SSDs and HDDs may:

Have volatile write caches that lose data on power loss
Not fully honor fsync commands
Lack power-loss protection

Enterprise storage includes capacitors or batteries to flush caches on power loss.

3. Consider RAID for Redundancy

RAID-1 (mirroring) or RAID-10 for log files provides:

Protection against single drive failure
Potential read performance improvement
Note: RAID is not a substitute for archival/backup

Cloud Storage Considerations

In cloud environments, use provisioned IOPS storage for logs (AWS io1/io2, Azure Premium SSD). Standard or GP storage may have variable latency that causes commit time spikes. Also consider storage-optimized instances with local NVMe for lowest latency.

Log Storage Performance Optimization

Beyond hardware, several software and configuration optimizations can dramatically improve log storage performance:

Log Performance Optimization Techniques

•Log Buffer Sizing — Larger log buffers reduce flush frequency but increase memory usage and potential loss window. Typical sizes range from 16MB to 256MB depending on workload.
•Group Commit Tuning — Configure group commit delay to balance latency vs. throughput. A small delay (1-10ms) allows more transactions to share each flush.
•Commit Delay Settings — Some databases allow configurable commit delays to batch more work per flush. PostgreSQL's commit_delay is an example.
•Log Segment Pre-allocation — Pre-allocate log file space to avoid filesystem metadata operations during writes. Creates larger contiguous regions.
•Filesystem Choice — ext4 with noatime, XFS with appropriate block size, or direct I/O can reduce overhead. Avoid filesystems with heavy journaling for log files.
•Asynchronous Commit (with caution) — Some databases offer async commit mode that doesn't wait for flush. Improves performance but risks data loss on crash.

Log Performance Configuration Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- PostgreSQL log performance settings
 
-- Log buffer size (default 16MB, increase for high throughput)
wal_buffers = 64MB
 
-- Group commit delay (microseconds to wait for other commits)
commit_delay = 10     -- Wait up to 10us if ≥5 concurrent txns
commit_siblings = 5   -- Only delay if this many other active txns
 
-- Pre-allocate segments
wal_keep_size = 1GB   -- Keep this much WAL always available
 
-- Async commit (faster but risks up to 3 * wal_writer_delay data loss)
-- synchronous_commit = off  -- NOT recommended for important data
 
-- MySQL/InnoDB log performance settings
 
-- Log buffer size
innodb_log_buffer_size = 64M
 
-- Log file size (larger = fewer switches, longer recovery)
innodb_log_file_size = 1G
 
-- Flush behavior
innodb_flush_log_at_trx_commit = 1  -- Full durability (recommended)
-- = 0: Flush every second (data loss risk)
-- = 2: Flush to OS cache per commit (faster, some risk)
 
-- SQL Server log performance settings
 
-- Use dedicated drive for log
ALTER DATABASE mydb MODIFY FILE (NAME = mydb_log, FILENAME = 'L:logsmydb.ldf')
 
-- Instant file initialization (faster log grows)
-- Requires 'Perform Volume Maintenance Tasks' privilege

Asynchronous Commit Trade-offs

Asynchronous commit can improve throughput 10x or more, but at a risk: the last few milliseconds of transactions may be lost on crash. Use only for data that can tolerate loss (sessions, caches) or where application-level idempotency handles retry.

Summary: The Infrastructure of Reliability

We've examined the physical infrastructure that makes transaction logging reliable and performant. Let's consolidate the key insights:

Key Takeaways

•Logs are typically organized as segments — Multiple files enable efficient management, archival, and recovery.
•The log buffer optimizes write performance — Batching records and group commit dramatically improve throughput.
•Durability requires careful attention — fsync, storage hardware, and the entire storage stack must be verified.
•Archival preserves logs for PITR and DR — Continuous archival enables recovery to any point in time.
•Truncation reclaims space from old logs — Checkpoints enable truncation; long transactions block it.
•Hardware choices impact performance and reliability — Separate log storage, enterprise hardware, and proper RAID matter.
•Configuration tuning can improve throughput — Buffer sizing, group commit, and filesystem choice are key levers.

What's Next:

Now that we understand how logs are stored and managed, we'll complete our study of log-based recovery by examining why logs are so important—the critical role they play not just in crash recovery but in the broader database ecosystem including replication, auditing, and change data capture.

Page Complete

You now understand the physical infrastructure of log storage—how logs are organized, buffered, archived, and truncated. Next, we'll explore the broader importance of logs beyond just crash recovery.

4 / 5

Loading learning content...

Database Management SystemsLog-Based Recovery

Log-Based Recovery

LevelIntermediate

Duration60 mins

TopicLog-Based Recovery

4 / 5

Log Storage

The Architecture Behind Reliable Logging

Extreme durability: Logs must never be lost or corrupted; they're the source of truth for recovery
High performance: Logging must not become a bottleneck, even under heavy transaction loads
Efficient space usage: Logs can grow indefinitely; they must be managed to prevent disk exhaustion
Quick access: Recovery must rapidly locate and process relevant log records

What You Will Learn

Log File Organization

Transaction logs are stored in one or more dedicated files on stable storage. The physical organization of these files directly impacts both performance and recoverability.

Single Log File vs. Multiple Log Files:

While conceptually the log is a single continuous sequence, most production systems use multiple physical log files for practical reasons:

Size limitations: Some filesystems have maximum file size limits
Parallel I/O: Multiple files can be written in parallel across different disks
Archival granularity: Individual files can be archived or deleted independently
Recovery efficiency: Only relevant files need to be read during recovery

Common Log File Organizations:

Log File Organization Approaches
Organization	Description	Used By
Circular Log	Fixed number of files; oldest overwritten when full	SQL Server (default mode)
Segmented Log	New segments created as needed; old segments archived/deleted	PostgreSQL, MySQL
Single Growing File	One file that grows; truncated periodically	Some embedded databases
Log Groups	Multiple identical copies written in parallel	Oracle (archivelog mode)

Segmented Log Architecture:

The most common modern approach is segmented logging. Here's how it works:

Active Segments: The current segment being written to, plus recently written segments that may still be needed for rollback or active transactions.
Archivable Segments: Segments containing only completed transactions, eligible for backup and removal from primary storage.
Segment Lifecycle:
- Created when previous segment fills up
- Becomes archivable when all transactions it contains are complete
- Archived to backup storage
- Eventually deleted from primary storage after backup confirmation

Segmented Log Structure Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Typical segmented log layout
 
Log Directory: /data/pg_wal/
├── 000000010000000A00000001    # Oldest active segment
├── 000000010000000A00000002    # Active (contains running txns)
├── 000000010000000A00000003    # Active (contains running txns)
├── 000000010000000A00000004    # Current write segment
└── archive_status/
    ├── 000000010000000A00000001.done   # Marked as archived
    └── 000000010000000A00000002.ready  # Ready for archival
 
// Segment naming convention: TimelineID + LogFileNumber + SegmentNumber
// Each segment is typically 16MB (PostgreSQL) or configurable
 
// Log file internal structure:
// +------------------+------------------+------------------+
// | Record 1         | Record 2         | Record 3         |
// | LSN: 0xA0000001  | LSN: 0xA0000089  | LSN: 0xA0000102  |
// +------------------+------------------+------------------+
// | Record 4         | Record 5         | [Free Space]     |
// | LSN: 0xA0000150  | LSN: 0xA0000201  |                  |
// +------------------+------------------+------------------+
 
// LSN encodes file location:
// LSN = (FileNumber * FileSize) + OffsetWithinFile
// Given LSN, can directly seek to record location

Log Segment Sizing

The Log Buffer: Optimizing Write Performance

How the Log Buffer Works:

Transaction generates log record: Record is written to the log buffer in memory
Buffer accumulates records: Multiple transactions' records collect in the buffer
Flush triggered: Buffer contents written to disk under certain conditions
Space recycled: After flush, buffer space becomes available for new records

Log Buffer Flush Triggers
Trigger	Description	Priority
Transaction Commit	COMMIT requires all that transaction's log records to be durable	Highest
Buffer Full	No space for new records until flush occurs	High
Checkpoint	All log records through checkpoint must be durable	High
Timeout	Periodic flush even without other triggers (e.g., every 1 second)	Medium
Page Write	Log records for a page must be flushed before that page is written (WAL rule)	High

Log Buffer Architecture:

Log Buffer Implementation Concepts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// Log Buffer Architecture
 
struct LogBuffer {
    byte[]          buffer;            // Circular buffer in memory
    size_t          capacity;          // Total buffer size (e.g., 16MB)
    
    atomic<uint64_t> write_position;   // Where next record goes
    atomic<uint64_t> flush_position;   // Up to where has been flushed
    
    Mutex           write_mutex;       // Protects concurrent writes
    Condition       space_available;   // Signals when space freed
    Condition       flush_complete;    // Signals when flush done
}
 
// Writing a log record:
function writeLogRecord(record):
    acquire(log_buffer.write_mutex)
    
    // Wait if buffer is full
    while (write_position - flush_position >= capacity):
        wait(space_available)
    
    // Copy record to buffer
    offset = write_position % capacity
    memcpy(buffer + offset, record, record.size)
    
    // Advance write position
    write_position += record.size
    
    release(log_buffer.write_mutex)
    return record.lsn
 
// Flushing the log buffer:
function flushLogBuffer(through_lsn):
    // Calculate what needs to be written
    start = flush_position
    end = min(write_position, through_lsn)
    
    if (end <= start):
        return  // Nothing to flush
    
    // Write to disk (may wrap around circular buffer)
    bytes_to_write = end - start
    disk_offset = flush_position
    
    // Use O_DIRECT and fdatasync for durability
    write(log_file, buffer + (start % capacity), bytes_to_write)
    fdatasync(log_file)  // Ensure on stable storage!
    
    // Update flush position
    flush_position = end
    signal(space_available)
    signal(flush_complete)

Group Commit Optimization

Ensuring Log Write Durability

The log's reliability depends on ensuring writes actually reach stable storage. This is more complex than it sounds due to multiple layers of caching between the database and the disk platters:

The Write Path:

Database Log Buffer → OS Page Cache → Disk Controller Cache → Disk Platters

Data isn't truly durable until it reaches the disk platters. Data in any cache layer can be lost on power failure. The database must force data through all caching layers.

Durability Mechanisms:

Log Write Durability Mechanisms
Mechanism	What It Does	Considerations
fsync() / fdatasync()	Forces OS to flush file data to disk controller	Most common; depends on disk honoring flush
O_DIRECT	Bypasses OS page cache; writes directly to disk	Reduces double-buffering; requires aligned writes
O_SYNC / O_DSYNC	Every write() is synchronous to disk	Simple but may reduce batching opportunities
Battery-Backed Cache	Disk controller cache survives power loss	Allows treating controller cache as durable
Write Barriers	Ensures ordering of writes to disk	Important for journaling correctness

The fsync Controversy:

For decades, databases relied on fsync() to ensure durability. However, research and real-world incidents have revealed problems:

Error handling: If an fsync fails, the data state is undefined. Some systems may lose data without signaling an error to the database.
Disk write reordering: Some disk controllers reorder writes for performance. Without proper barriers, log records might reach disk out of order.
Write-back caching: Aggressive caching on disks or storage controllers can delay actual platter writes.

Modern Best Practices:

Use fdatasync() instead of fsync() when metadata changes aren't critical
Verify storage subsystem honors flush commands
Disable or properly configure disk write caching
Consider battery-backed controllers for performance with safety
Test durability guarantees in your specific environment

The Durability Trust Chain

Log Archival: Preserving History

Why Archive Logs?

Point-in-Time Recovery: Restore database to any moment by replaying archived logs onto a backup
Disaster Recovery: Archived logs at a remote site enable recovery from total primary site loss
Replication: Standby servers apply archived logs to stay synchronized
Compliance: Regulations may require retaining transaction history

Archival Process:

Log Archival Process
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Typical Log Archival Workflow
 
// 1. Segment becomes eligible for archival
//    - All transactions spanning this segment have completed
//    - Segment is no longer needed for crash recovery
 
// 2. Archival command triggered (example: PostgreSQL archive_command)
archive_command = 'cp %p /backup/wal_archive/%f'
// %p = full path to segment file
// %f = segment file name
 
// 3. Success verification
//    - Archival command must return success (exit code 0)
//    - Segment marked as archived only on success
//    - Retry on failure with backoff
 
// 4. Segment management after archival
//    - Primary copy may be deleted if space needed
//    - Or retained for faster recovery (reduced need to fetch from archive)
 
// PostgreSQL archival status tracking:
// pg_wal/archive_status/
//   000000010000000A00000001.ready   # Ready for archival
//   000000010000000A00000002.done    # Successfully archived
//   000000010000000A00000003.ready   # Archival in progress or pending
 
// Example archive workflow:
function archiveSegment(segment_path):
    try:
        // Copy to multiple destinations for redundancy
        copy(segment_path, local_archive_path)
        copy(segment_path, cloud_storage_uri)
        
        // Verify copies are complete and correct
        verify_checksum(local_archive_path)
        verify_checksum(cloud_storage_uri)
        
        // Mark as successfully archived
        mark_archived(segment_path)
        return SUCCESS
        
    catch error:
        log("Archive failed for " + segment_path + ": " + error)
        schedule_retry(segment_path, delay=60)
        return FAILURE

Log Archival Best Practices

•Archive to multiple destinations — Local disk, remote server, and cloud storage provide defense in depth against various failure modes.
•Verify archive integrity — Compute and store checksums; verify immediately after archival and periodically thereafter.
•Monitor archival lag — If archival falls behind, disk space will eventually be exhausted. Alert on archival delays.
•Test recovery regularly — An archive that can't be restored is useless. Regularly perform test recoveries.
•Secure archive storage — Archives contain all your data. Encrypt at rest and in transit; control access carefully.

Continuous Archiving vs. Periodic Backup

Log Truncation and Space Management

When Can Log Records Be Truncated?

A log record can be truncated (its space reused) only when:

No active transaction needs it: All transactions that might need to rollback using this record have completed
Crash recovery doesn't need it: The corresponding data pages are durably on disk (no redo needed)
Archival is complete: If archival is enabled, the segment must be archived first
Replication is caught up: All standby servers have received and applied the record

Log Truncation Blockers
Blocker	Why It Blocks	Resolution
Long-running transaction	May need log records for rollback	Wait for completion or kill transaction
Dirty pages in buffer	Log needed if crash before page flush	Wait for checkpoint
Slow archival	Segment not yet safely archived	Speed up archival; add storage
Slow replica	Replica needs log for catch-up	Wait for replica or disconnect it
PITR retention	Log needed for recovery target	Reduce retention or add storage

Checkpoint's Role in Truncation:

Checkpoints are the primary enabler of log truncation. A checkpoint:

Flushes all dirty pages to disk
Records the state of active transactions
Identifies the earliest log record still needed (the checkpoint LSN)

After a checkpoint completes, all log records before the checkpoint LSN (subject to other constraints) can be truncated—they're no longer needed for crash recovery.

Log Space Management Strategies:

Log Space Management
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Log space management approaches
 
// 1. CIRCULAR LOG (SQL Server, Oracle)
//    Fixed number of virtual log files (VLFs)
//    Oldest VLF reused when no longer needed
 
struct CircularLog {
    VLF     vlfs[NUM_VLFS];          // Array of virtual log files
    int     active_vlf_start;         // Oldest needed VLF
    int     current_vlf;              // Currently being written
}
 
// When current_vlf reaches active_vlf_start, must wait!
// Solution: checkpoint to advance active_vlf_start
 
// 2. SEGMENTED LOG (PostgreSQL)
//    New segments created as needed
//    Old segments deleted after archival + checkpoint
 
function manageLogSegments():
    for segment in log_segments:
        if segment.max_lsn < last_checkpoint_lsn
           and segment.archived
           and segment.replicated_to_all_standbys:
            
            if num_segments > min_segments:
                delete(segment)
            else:
                recycle(segment)  // Keep for reuse
 
// 3. AUTOMATIC GROWTH (with limits)
//    Log grows as needed up to configured maximum
//    Alert/block when approaching limit
 
max_log_size = 100GB
current_log_size = getLogSize()
 
if current_log_size > max_log_size * 0.8:
    alert("Log approaching size limit")
    trigger_checkpoint()
 
if current_log_size >= max_log_size:
    // Block new transactions until space available
    block_new_transactions()
 
// 4. MANUAL TRUNCATION
//    DBA explicitly truncates log after backup
// (Common with SQL Server simple recovery model)
 
BACKUP LOG database_name TO DISK='backup.trn'
-- After backup, log space marked for reuse

The Runaway Log Problem

Log Storage Hardware Considerations

Log I/O Characteristics:

Sequential writes: Logs are always appended; random access is only during recovery
Write-intensive: Heavy write volume; limited reads during normal operation
Latency-sensitive: Commit latency directly depends on log flush speed
Small to medium I/O sizes: Individual flushes are often 4KB-64KB

Log Storage Hardware Options
Option	Advantages	Disadvantages	Best For
Enterprise SSD	Low latency; high IOPS; durable	Higher cost per GB	High-throughput OLTP
NVMe SSD	Extremely low latency; parallel queues	Premium cost	Ultra-low-latency requirements
Enterprise HDD with BBU	Lower cost; battery protects cache	Higher latency than SSD	Cost-sensitive; large logs
RAID-10 HDD	Redundancy + reasonable performance	Latency varies	Reliability-focused
Separate Log Volume	Isolates log I/O from data I/O	Additional hardware	Avoiding I/O contention

Key Hardware Recommendations:

1. Separate Log Storage from Data Storage

Log writes and data reads/writes have different I/O patterns. Placing them on the same storage causes contention:

Log: Sequential write, latency-sensitive
Data: Random read/write, throughput-sensitive

Isolating log I/O to a dedicated volume or drive array prevents data I/O from delaying commits.

2. Use Enterprise-Grade Storage

Consumer-grade SSDs and HDDs may:

Have volatile write caches that lose data on power loss
Not fully honor fsync commands
Lack power-loss protection

Enterprise storage includes capacitors or batteries to flush caches on power loss.

3. Consider RAID for Redundancy

RAID-1 (mirroring) or RAID-10 for log files provides:

Protection against single drive failure
Potential read performance improvement
Note: RAID is not a substitute for archival/backup

Cloud Storage Considerations

Log Storage Performance Optimization

Beyond hardware, several software and configuration optimizations can dramatically improve log storage performance:

Log Performance Optimization Techniques

•Log Buffer Sizing — Larger log buffers reduce flush frequency but increase memory usage and potential loss window. Typical sizes range from 16MB to 256MB depending on workload.
•Group Commit Tuning — Configure group commit delay to balance latency vs. throughput. A small delay (1-10ms) allows more transactions to share each flush.
•Commit Delay Settings — Some databases allow configurable commit delays to batch more work per flush. PostgreSQL's commit_delay is an example.
•Log Segment Pre-allocation — Pre-allocate log file space to avoid filesystem metadata operations during writes. Creates larger contiguous regions.
•Filesystem Choice — ext4 with noatime, XFS with appropriate block size, or direct I/O can reduce overhead. Avoid filesystems with heavy journaling for log files.
•Asynchronous Commit (with caution) — Some databases offer async commit mode that doesn't wait for flush. Improves performance but risks data loss on crash.

Log Performance Configuration Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- PostgreSQL log performance settings
 
-- Log buffer size (default 16MB, increase for high throughput)
wal_buffers = 64MB
 
-- Group commit delay (microseconds to wait for other commits)
commit_delay = 10     -- Wait up to 10us if ≥5 concurrent txns
commit_siblings = 5   -- Only delay if this many other active txns
 
-- Pre-allocate segments
wal_keep_size = 1GB   -- Keep this much WAL always available
 
-- Async commit (faster but risks up to 3 * wal_writer_delay data loss)
-- synchronous_commit = off  -- NOT recommended for important data
 
-- MySQL/InnoDB log performance settings
 
-- Log buffer size
innodb_log_buffer_size = 64M
 
-- Log file size (larger = fewer switches, longer recovery)
innodb_log_file_size = 1G
 
-- Flush behavior
innodb_flush_log_at_trx_commit = 1  -- Full durability (recommended)
-- = 0: Flush every second (data loss risk)
-- = 2: Flush to OS cache per commit (faster, some risk)
 
-- SQL Server log performance settings
 
-- Use dedicated drive for log
ALTER DATABASE mydb MODIFY FILE (NAME = mydb_log, FILENAME = 'L:logsmydb.ldf')
 
-- Instant file initialization (faster log grows)
-- Requires 'Perform Volume Maintenance Tasks' privilege

Asynchronous Commit Trade-offs

Summary: The Infrastructure of Reliability

We've examined the physical infrastructure that makes transaction logging reliable and performant. Let's consolidate the key insights:

Key Takeaways

•Logs are typically organized as segments — Multiple files enable efficient management, archival, and recovery.
•The log buffer optimizes write performance — Batching records and group commit dramatically improve throughput.
•Durability requires careful attention — fsync, storage hardware, and the entire storage stack must be verified.
•Archival preserves logs for PITR and DR — Continuous archival enables recovery to any point in time.
•Truncation reclaims space from old logs — Checkpoints enable truncation; long transactions block it.
•Hardware choices impact performance and reliability — Separate log storage, enterprise hardware, and proper RAID matter.
•Configuration tuning can improve throughput — Buffer sizing, group commit, and filesystem choice are key levers.

What's Next:

Page Complete

4 / 5