Loading learning content...
Understanding what logs contain is only half the story. The how of log storage—the physical organization, buffering strategies, file management, and retention policies—is equally critical. A well-designed log storage system must achieve seemingly contradictory goals:
Achieving these goals requires sophisticated engineering in log buffer management, file organization, and lifecycle policies. This page examines the practical architecture that makes reliable logging possible at scale.
By the end of this page, you will understand how logs are physically organized on disk, how the log buffer optimizes write performance, how log files grow and are managed, the mechanisms for log archival and truncation, and the engineering considerations for log storage in production systems.
Transaction logs are stored in one or more dedicated files on stable storage. The physical organization of these files directly impacts both performance and recoverability.
Single Log File vs. Multiple Log Files:
While conceptually the log is a single continuous sequence, most production systems use multiple physical log files for practical reasons:
Common Log File Organizations:
| Organization | Description | Used By |
|---|---|---|
| Circular Log | Fixed number of files; oldest overwritten when full | SQL Server (default mode) |
| Segmented Log | New segments created as needed; old segments archived/deleted | PostgreSQL, MySQL |
| Single Growing File | One file that grows; truncated periodically | Some embedded databases |
| Log Groups | Multiple identical copies written in parallel | Oracle (archivelog mode) |
Segmented Log Architecture:
The most common modern approach is segmented logging. Here's how it works:
Active Segments: The current segment being written to, plus recently written segments that may still be needed for rollback or active transactions.
Archivable Segments: Segments containing only completed transactions, eligible for backup and removal from primary storage.
Segment Lifecycle:
1234567891011121314151617181920212223242526
// Typical segmented log layout Log Directory: /data/pg_wal/├── 000000010000000A00000001 # Oldest active segment├── 000000010000000A00000002 # Active (contains running txns)├── 000000010000000A00000003 # Active (contains running txns)├── 000000010000000A00000004 # Current write segment└── archive_status/ ├── 000000010000000A00000001.done # Marked as archived └── 000000010000000A00000002.ready # Ready for archival // Segment naming convention: TimelineID + LogFileNumber + SegmentNumber// Each segment is typically 16MB (PostgreSQL) or configurable // Log file internal structure:// +------------------+------------------+------------------+// | Record 1 | Record 2 | Record 3 |// | LSN: 0xA0000001 | LSN: 0xA0000089 | LSN: 0xA0000102 |// +------------------+------------------+------------------+// | Record 4 | Record 5 | [Free Space] |// | LSN: 0xA0000150 | LSN: 0xA0000201 | |// +------------------+------------------+------------------+ // LSN encodes file location:// LSN = (FileNumber * FileSize) + OffsetWithinFile// Given LSN, can directly seek to record locationSegment size is a trade-off: larger segments mean fewer files to manage but slower archival and recovery. Smaller segments mean faster archival but more file management overhead. PostgreSQL defaults to 16MB; SQL Server uses 64KB virtual log files within larger physical files. Choose based on your workload and recovery requirements.
While log records must reach stable storage to provide durability, writing each record individually to disk would be prohibitively slow. The log buffer (also called the write-ahead log buffer or redo log buffer) is an in-memory buffer that collects log records before flushing them to disk in batches.
How the Log Buffer Works:
| Trigger | Description | Priority |
|---|---|---|
| Transaction Commit | COMMIT requires all that transaction's log records to be durable | Highest |
| Buffer Full | No space for new records until flush occurs | High |
| Checkpoint | All log records through checkpoint must be durable | High |
| Timeout | Periodic flush even without other triggers (e.g., every 1 second) | Medium |
| Page Write | Log records for a page must be flushed before that page is written (WAL rule) | High |
Log Buffer Architecture:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
// Log Buffer Architecture struct LogBuffer { byte[] buffer; // Circular buffer in memory size_t capacity; // Total buffer size (e.g., 16MB) atomic<uint64_t> write_position; // Where next record goes atomic<uint64_t> flush_position; // Up to where has been flushed Mutex write_mutex; // Protects concurrent writes Condition space_available; // Signals when space freed Condition flush_complete; // Signals when flush done} // Writing a log record:function writeLogRecord(record): acquire(log_buffer.write_mutex) // Wait if buffer is full while (write_position - flush_position >= capacity): wait(space_available) // Copy record to buffer offset = write_position % capacity memcpy(buffer + offset, record, record.size) // Advance write position write_position += record.size release(log_buffer.write_mutex) return record.lsn // Flushing the log buffer:function flushLogBuffer(through_lsn): // Calculate what needs to be written start = flush_position end = min(write_position, through_lsn) if (end <= start): return // Nothing to flush // Write to disk (may wrap around circular buffer) bytes_to_write = end - start disk_offset = flush_position // Use O_DIRECT and fdatasync for durability write(log_file, buffer + (start % capacity), bytes_to_write) fdatasync(log_file) // Ensure on stable storage! // Update flush position flush_position = end signal(space_available) signal(flush_complete)When multiple transactions commit simultaneously, their commit records can be flushed together in a single I/O operation. This 'group commit' amortizes the flush overhead across multiple transactions, dramatically improving throughput under concurrent load. A single fsync can durable commits from dozens of transactions.
The log's reliability depends on ensuring writes actually reach stable storage. This is more complex than it sounds due to multiple layers of caching between the database and the disk platters:
The Write Path:
Database Log Buffer → OS Page Cache → Disk Controller Cache → Disk Platters
Data isn't truly durable until it reaches the disk platters. Data in any cache layer can be lost on power failure. The database must force data through all caching layers.
Durability Mechanisms:
| Mechanism | What It Does | Considerations |
|---|---|---|
| fsync() / fdatasync() | Forces OS to flush file data to disk controller | Most common; depends on disk honoring flush |
| O_DIRECT | Bypasses OS page cache; writes directly to disk | Reduces double-buffering; requires aligned writes |
| O_SYNC / O_DSYNC | Every write() is synchronous to disk | Simple but may reduce batching opportunities |
| Battery-Backed Cache | Disk controller cache survives power loss | Allows treating controller cache as durable |
| Write Barriers | Ensures ordering of writes to disk | Important for journaling correctness |
The fsync Controversy:
For decades, databases relied on fsync() to ensure durability. However, research and real-world incidents have revealed problems:
Error handling: If an fsync fails, the data state is undefined. Some systems may lose data without signaling an error to the database.
Disk write reordering: Some disk controllers reorder writes for performance. Without proper barriers, log records might reach disk out of order.
Write-back caching: Aggressive caching on disks or storage controllers can delay actual platter writes.
Modern Best Practices:
fdatasync() instead of fsync() when metadata changes aren't criticalDatabase durability is only as strong as the weakest link in the storage chain. A disk that ignores flush commands, a controller with volatile write cache, or a virtualized environment with lazy persistence can all silently violate durability guarantees. Verify your entire storage stack.
While active log files enable crash recovery, archived logs serve a broader purpose: enabling point-in-time recovery (PITR), supporting replication, and providing a complete audit trail. Log archival is the process of copying completed log segments to secondary storage before they're removed from primary storage.
Why Archive Logs?
Archival Process:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
// Typical Log Archival Workflow // 1. Segment becomes eligible for archival// - All transactions spanning this segment have completed// - Segment is no longer needed for crash recovery // 2. Archival command triggered (example: PostgreSQL archive_command)archive_command = 'cp %p /backup/wal_archive/%f'// %p = full path to segment file// %f = segment file name // 3. Success verification// - Archival command must return success (exit code 0)// - Segment marked as archived only on success// - Retry on failure with backoff // 4. Segment management after archival// - Primary copy may be deleted if space needed// - Or retained for faster recovery (reduced need to fetch from archive) // PostgreSQL archival status tracking:// pg_wal/archive_status/// 000000010000000A00000001.ready # Ready for archival// 000000010000000A00000002.done # Successfully archived// 000000010000000A00000003.ready # Archival in progress or pending // Example archive workflow:function archiveSegment(segment_path): try: // Copy to multiple destinations for redundancy copy(segment_path, local_archive_path) copy(segment_path, cloud_storage_uri) // Verify copies are complete and correct verify_checksum(local_archive_path) verify_checksum(cloud_storage_uri) // Mark as successfully archived mark_archived(segment_path) return SUCCESS catch error: log("Archive failed for " + segment_path + ": " + error) schedule_retry(segment_path, delay=60) return FAILURELog archiving is continuous—segments are archived as soon as they're complete. This differs from periodic base backups. The combination of periodic base backups plus continuous log archiving enables true point-in-time recovery: restore the base backup then replay archived logs up to the desired moment.
Without management, transaction logs grow forever. Log truncation is the process of removing or reusing log space that's no longer needed. This is distinct from archival—truncation reclaims space; archival preserves data elsewhere.
When Can Log Records Be Truncated?
A log record can be truncated (its space reused) only when:
No active transaction needs it: All transactions that might need to rollback using this record have completed
Crash recovery doesn't need it: The corresponding data pages are durably on disk (no redo needed)
Archival is complete: If archival is enabled, the segment must be archived first
Replication is caught up: All standby servers have received and applied the record
| Blocker | Why It Blocks | Resolution |
|---|---|---|
| Long-running transaction | May need log records for rollback | Wait for completion or kill transaction |
| Dirty pages in buffer | Log needed if crash before page flush | Wait for checkpoint |
| Slow archival | Segment not yet safely archived | Speed up archival; add storage |
| Slow replica | Replica needs log for catch-up | Wait for replica or disconnect it |
| PITR retention | Log needed for recovery target | Reduce retention or add storage |
Checkpoint's Role in Truncation:
Checkpoints are the primary enabler of log truncation. A checkpoint:
After a checkpoint completes, all log records before the checkpoint LSN (subject to other constraints) can be truncated—they're no longer needed for crash recovery.
Log Space Management Strategies:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
// Log space management approaches // 1. CIRCULAR LOG (SQL Server, Oracle)// Fixed number of virtual log files (VLFs)// Oldest VLF reused when no longer needed struct CircularLog { VLF vlfs[NUM_VLFS]; // Array of virtual log files int active_vlf_start; // Oldest needed VLF int current_vlf; // Currently being written} // When current_vlf reaches active_vlf_start, must wait!// Solution: checkpoint to advance active_vlf_start // 2. SEGMENTED LOG (PostgreSQL)// New segments created as needed// Old segments deleted after archival + checkpoint function manageLogSegments(): for segment in log_segments: if segment.max_lsn < last_checkpoint_lsn and segment.archived and segment.replicated_to_all_standbys: if num_segments > min_segments: delete(segment) else: recycle(segment) // Keep for reuse // 3. AUTOMATIC GROWTH (with limits)// Log grows as needed up to configured maximum// Alert/block when approaching limit max_log_size = 100GBcurrent_log_size = getLogSize() if current_log_size > max_log_size * 0.8: alert("Log approaching size limit") trigger_checkpoint() if current_log_size >= max_log_size: // Block new transactions until space available block_new_transactions() // 4. MANUAL TRUNCATION// DBA explicitly truncates log after backup// (Common with SQL Server simple recovery model) BACKUP LOG database_name TO DISK='backup.trn'-- After backup, log space marked for reuseA single long-running transaction can prevent log truncation indefinitely, causing log files to consume all available disk space. This is one of the most common production database issues. Monitor for long-running transactions and set timeout policies.
Log storage hardware decisions significantly impact both performance and reliability. Unlike data files that benefit from SSDs for random reads, log writes have unique characteristics that inform hardware choices:
Log I/O Characteristics:
| Option | Advantages | Disadvantages | Best For |
|---|---|---|---|
| Enterprise SSD | Low latency; high IOPS; durable | Higher cost per GB | High-throughput OLTP |
| NVMe SSD | Extremely low latency; parallel queues | Premium cost | Ultra-low-latency requirements |
| Enterprise HDD with BBU | Lower cost; battery protects cache | Higher latency than SSD | Cost-sensitive; large logs |
| RAID-10 HDD | Redundancy + reasonable performance | Latency varies | Reliability-focused |
| Separate Log Volume | Isolates log I/O from data I/O | Additional hardware | Avoiding I/O contention |
Key Hardware Recommendations:
1. Separate Log Storage from Data Storage
Log writes and data reads/writes have different I/O patterns. Placing them on the same storage causes contention:
Isolating log I/O to a dedicated volume or drive array prevents data I/O from delaying commits.
2. Use Enterprise-Grade Storage
Consumer-grade SSDs and HDDs may:
Enterprise storage includes capacitors or batteries to flush caches on power loss.
3. Consider RAID for Redundancy
RAID-1 (mirroring) or RAID-10 for log files provides:
In cloud environments, use provisioned IOPS storage for logs (AWS io1/io2, Azure Premium SSD). Standard or GP storage may have variable latency that causes commit time spikes. Also consider storage-optimized instances with local NVMe for lowest latency.
Beyond hardware, several software and configuration optimizations can dramatically improve log storage performance:
1234567891011121314151617181920212223242526272829303132333435
-- PostgreSQL log performance settings -- Log buffer size (default 16MB, increase for high throughput)wal_buffers = 64MB -- Group commit delay (microseconds to wait for other commits)commit_delay = 10 -- Wait up to 10us if ≥5 concurrent txnscommit_siblings = 5 -- Only delay if this many other active txns -- Pre-allocate segmentswal_keep_size = 1GB -- Keep this much WAL always available -- Async commit (faster but risks up to 3 * wal_writer_delay data loss)-- synchronous_commit = off -- NOT recommended for important data -- MySQL/InnoDB log performance settings -- Log buffer sizeinnodb_log_buffer_size = 64M -- Log file size (larger = fewer switches, longer recovery)innodb_log_file_size = 1G -- Flush behaviorinnodb_flush_log_at_trx_commit = 1 -- Full durability (recommended)-- = 0: Flush every second (data loss risk)-- = 2: Flush to OS cache per commit (faster, some risk) -- SQL Server log performance settings -- Use dedicated drive for logALTER DATABASE mydb MODIFY FILE (NAME = mydb_log, FILENAME = 'L:logsmydb.ldf') -- Instant file initialization (faster log grows)-- Requires 'Perform Volume Maintenance Tasks' privilegeAsynchronous commit can improve throughput 10x or more, but at a risk: the last few milliseconds of transactions may be lost on crash. Use only for data that can tolerate loss (sessions, caches) or where application-level idempotency handles retry.
We've examined the physical infrastructure that makes transaction logging reliable and performant. Let's consolidate the key insights:
What's Next:
Now that we understand how logs are stored and managed, we'll complete our study of log-based recovery by examining why logs are so important—the critical role they play not just in crash recovery but in the broader database ecosystem including replication, auditing, and change data capture.
You now understand the physical infrastructure of log storage—how logs are organized, buffered, archived, and truncated. Next, we'll explore the broader importance of logs beyond just crash recovery.