Point In Time Recovery - Learning Module

Loading content...

0/241

Log Archiving: The Backbone of Point-in-Time Recovery

The Critical Chain of Custody

Every transaction that modifies your database leaves a trace in the Write-Ahead Log (WAL). Under normal operation, these logs are temporary—the database recycles old log segments once checkpoint processing confirms the changes are safely written to data files. But for Point-in-Time Recovery, those 'temporary' logs are the only bridge between your last backup and any recovery target.

Log archiving is the process of preserving that bridge.

Without proper log archiving, PITR capability exists only in theory. The moment a log segment is recycled before archiving, a gap appears in your recovery timeline—a permanent blind spot where the database's history is lost forever.

This page examines log archiving comprehensively: how it works, what can go wrong, how to monitor it, and the operational practices that ensure your PITR capability remains intact through the daily operation of your database systems.

The Single Point of PITR Failure

Log archiving is the single point of failure for PITR. A missed archive, a corrupted segment, or a gap in the log sequence can render days of PITR capability useless. This page will teach you how to prevent, detect, and respond to archiving failures.

Understanding the WAL Lifecycle

Before we can properly configure log archiving, we must understand the complete lifecycle of Write-Ahead Log segments. This lifecycle determines when archiving must occur and what happens if it doesn't.

The Stages of WAL Segment Life

WAL segments progress through several distinct stages:

1. Creation (Allocation) When the current WAL segment fills up, the database allocates a new segment. This may involve:

Pre-allocating segments during low-activity periods (proactive allocation)
Immediate allocation when needed (reactive allocation)
Recycling an old segment by renaming it (most common)

2. Active Writing The segment receives new log records as transactions execute:

Each log record receives a unique LSN
Records are first written to the WAL buffer in memory
Periodic or triggered flushes write buffer contents to disk
The wal_sync_method parameter controls durability guarantees

3. Full Status When a segment reaches capacity:

The segment is marked as complete
A new segment is opened for writing
The full segment enters the archival queue (if archiving is enabled)

4. Archival Pending The complete segment waits for archive processing:

The archive process copies the segment to archive storage
The segment is marked as archived only after successful copy
This is the critical window—the segment must not be recycled before confirmation

5. Archived and Retained After successful archival:

The segment is marked as ready for recycling
The database may retain the segment for streaming replication
Eventually, the segment is recycled for new log content

6. Recycled The physical file is reassigned a new name and reused:

Old content is overwritten
The segment re-enters the creation stage
If the segment was not archived, its contents are permanently lost

Converting Mermaid diagram...

The Recycling Race Condition

If archive storage fills up or the archive command fails repeatedly, the database faces a dilemma: stop accepting new transactions (preserving logs but halting operations) or recycle unarchived segments (continuing operations but losing PITR capability). The archive_mode and archive_command configuration controls this behavior.

Archive Configuration Deep Dive

Proper archive configuration is foundational to reliable PITR. The specific parameters vary by database system, but the concepts are universal. We'll use PostgreSQL's terminology as a reference model, then map to other systems.

Core Archive Parameters

archive_mode

Controls whether WAL archiving is enabled:

off: No archiving (default)
on: Enable archiving
always: Enable archiving even on standby servers

-- Enable archiving (requires restart)
alter system set archive_mode = 'on';

archive_command

The shell command executed to archive each WAL segment:

%p — Path to the WAL segment to archive
%f — Filename of the WAL segment

-- Basic local archive
alter system set archive_command = 
  'cp %p /archive/wal/%f';

-- Archive to S3 with compression
alter system set archive_command = 
  'gzip -c %p | aws s3 cp - s3://db-archive/wal/%f.gz';

-- Archive with verification
alter system set archive_command = 
  'test ! -f /archive/wal/%f && cp %p /archive/wal/%f';

archive_timeout

Forces an archive switch even if the current segment isn't full:

Prevents long gaps in archive coverage during low-activity periods
Creates a maximum RPO for idle databases

-- Archive at least every 5 minutes
alter system set archive_timeout = '300s';

Archive Parameter Matrix by Database System
Concept	PostgreSQL	MySQL/MariaDB	Oracle	SQL Server
Enable Archiving	archive_mode = on	log_bin = ON	ALTER DATABASE ARCHIVELOG	Full/Bulk-Logged Recovery
Archive Command	archive_command	N/A (binlog rotation)	LOG_ARCHIVE_DEST_n	Backup Log with COPY_ONLY
Archive Timeout	archive_timeout	expire_logs_days	ARCHIVE_LAG_TARGET	Log backup frequency
Archive Location	Command parameter	binlog directory	LOG_ARCHIVE_DEST	Backup destination

Critical Archive Command Requirements

The archive command must satisfy specific requirements to ensure reliable operation:

1. Atomic Success/Failure The command must exit with status 0 only if the archive succeeded completely. Any non-zero exit prevents recycling of the segment.

2. Idempotency The command must succeed if called multiple times for the same file (the segment may already exist in the archive from a previous interrupted attempt).

3. Verification The command should verify the archived copy is complete and uncorrupted before exiting successfully.

4. Destination Check The command should fail if the destination is unreachable, full, or experiencing issues.

5. No Modification of Source The command must not modify or delete the source WAL file.

6. Reasonable Timeout The command should complete in reasonable time (usually seconds). Long-running commands block archival of subsequent segments.

robust_archive_command.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#!/bin/bash
# Robust WAL archiving script with verification
# Usage: archive_wal.sh <source_path> <filename>
 
SOURCE_PATH="$1"
FILENAME="$2"
ARCHIVE_DIR="/archive/wal"
S3_BUCKET="s3://db-archive/wal"
 
# Function to log with timestamp
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> /var/log/wal_archive.log
}
 
# Check source file exists
if [ ! -f "$SOURCE_PATH" ]; then
    log "ERROR: Source file not found: $SOURCE_PATH"
    exit 1
fi
 
# Calculate source checksum
SOURCE_CHECKSUM=$(sha256sum "$SOURCE_PATH" | awk '{print $1}')
 
# Archive to local storage first (fast, reliable)
LOCAL_DEST="$ARCHIVE_DIR/$FILENAME"
if [ -f "$LOCAL_DEST" ]; then
    # File exists - verify it matches (idempotency)
    EXISTING_CHECKSUM=$(sha256sum "$LOCAL_DEST" | awk '{print $1}')
    if [ "$SOURCE_CHECKSUM" = "$EXISTING_CHECKSUM" ]; then
        log "INFO: $FILENAME already archived correctly (idempotent success)"
        exit 0
    else
        log "ERROR: $FILENAME exists but checksum mismatch!"
        exit 1
    fi
fi
 
# Copy to local archive with fsync
cp "$SOURCE_PATH" "$LOCAL_DEST.tmp" && sync "$LOCAL_DEST.tmp"
if [ $? -ne 0 ]; then
    log "ERROR: Failed to copy $FILENAME to local archive"
    rm -f "$LOCAL_DEST.tmp"
    exit 1
fi
 
# Verify local copy
LOCAL_CHECKSUM=$(sha256sum "$LOCAL_DEST.tmp" | awk '{print $1}')
if [ "$SOURCE_CHECKSUM" != "$LOCAL_CHECKSUM" ]; then
    log "ERROR: Checksum mismatch after local copy for $FILENAME"
    rm -f "$LOCAL_DEST.tmp"
    exit 1
fi
 
# Atomic rename to final location
mv "$LOCAL_DEST.tmp" "$LOCAL_DEST"
if [ $? -ne 0 ]; then
    log "ERROR: Failed to rename $FILENAME in local archive"
    exit 1
fi
 
log "INFO: Successfully archived $FILENAME to local storage"
 
# Async upload to S3 (non-blocking for archive command)
# This is handled by a separate process that monitors local archive
nohup aws s3 cp "$LOCAL_DEST" "$S3_BUCKET/$FILENAME" >> /var/log/s3_upload.log 2>&1 &
 
exit 0

Archive Storage Strategies

The choice of archive storage profoundly impacts PITR reliability, recovery speed, and total cost of ownership. A well-designed archive strategy typically combines multiple storage tiers to balance these factors.

The Tiered Archive Architecture

Enterprise deployments commonly implement multiple storage tiers:

Hot Tier (Recent Archives)

Storage: Local SSD or NVMe
Retention: Last 24-72 hours of archives
Purpose: Fast recovery for recent incidents
Access: Immediate (milliseconds)

Warm Tier (Recent History)

Storage: Network storage, local HDD, or standard object storage
Retention: 7-30 days
Purpose: Operational recovery, typical PITR scenarios
Access: Fast (seconds)

Cold Tier (Long-term Archive)

Storage: Infrequent-access object storage, tape
Retention: Months to years (compliance driven)
Purpose: Compliance, disaster recovery, historical analysis
Access: Slow (minutes to hours)

Offsite Tier (Disaster Recovery)

Storage: Geographically separate location
Retention: Mirrors warm/cold tier policies
Purpose: DR capability, regional disaster resilience
Access: Variable (depends on DR tier selection)

Archive Tier Characteristics
Tier	Typical Storage	Retention	Access Time	Cost/GB/Month
Hot	Local NVMe/SSD	24-72 hours	< 10ms	$0.15-0.30
Warm	Object Storage (Standard)	7-30 days	50-200ms	$0.02-0.03
Cold	Object Storage (IA)	30-365 days	1-5 seconds	$0.01-0.015
Archive	Glacier/Deep Archive	Years	Minutes-Hours	$0.001-0.004
Offsite DR	Cross-region replicated	Mirrors warm	Variable	2x base cost

Compression Strategies

WAL files compress effectively—typical compression ratios range from 3:1 to 10:1 depending on workload. Compression strategies must balance storage savings against recovery speed:

Inline Compression

Compress during archival:

archive_command = 'gzip -c %p > /archive/%f.gz'

Pro: Immediate storage savings
Pro: Network transfer reduction for remote archives
Con: Decompression time during recovery
Con: CPU overhead during archival

Deferred Compression

Archive uncompressed, compress later:

Pro: Fastest archival (no CPU overhead)
Pro: Recent files ready for immediate recovery
Con: Higher short-term storage costs
Con: Requires background compression process

Compression Algorithm Selection

Algorithm	Speed	Ratio	Best For
gzip -1	Fast	3-4x	Balance speed/ratio
gzip -9	Slow	4-5x	Cold tier storage
lz4	Very Fast	2-3x	Hot tier, fast recovery
zstd	Fast	4-6x	Modern systems, best balance
xz	Very Slow	6-8x	Long-term archive only

The Zstandard Sweet Spot

For most modern deployments, zstd (Zstandard) offers the best balance of compression ratio and speed. It approaches gzip's compression ratio while operating at speeds closer to lz4. PostgreSQL 15+ supports native zstd compression.

Encryption Considerations

Archived WAL files contain complete transaction data—including all inserted, updated, and deleted values. Encryption is essential for:

Compliance requirements (GDPR, HIPAA, PCI-DSS)
Protection of data at rest
Defense against archive storage compromise

Encryption Approaches:

1. Storage-Level Encryption

Provider-managed keys or customer-managed keys
Transparent to archival process
Example: S3 Server-Side Encryption (SSE-KMS)

2. Client-Side Encryption

Encrypt before sending to archive
Full control over key management
Example: gpg -e -r backup@company.com %f

3. Database-Native Encryption

Some databases support encrypted WAL
Unified key management with data encryption
Example: PostgreSQL pgcrypto extensions

Key Management Requirements:

Keys must be available during recovery (not just backup)
Key rotation must not render old archives unrecoverable
DR procedures must include key recovery

Continuous Archiving Mechanisms

Traditional segment-based archiving introduces latency between transaction commit and archival. For systems requiring near-zero RPO, continuous archiving mechanisms stream changes in real-time.

Segment-Based vs. Continuous Archiving

Segment-Based Archiving

The traditional model archives complete segments:

Transactions write to current WAL segment
Segment fills up (typically 16MB in PostgreSQL)
Archive command executes for complete segment
Segment copied to archive storage

RPO Implication: Maximum data loss = segment size worth of transactions

With a 16MB segment and high-throughput database, this might be seconds. With a low-activity database, a segment might take hours to fill.

Continuous (Streaming) Archiving

Alternative approach streams WAL records continuously:

Transactions write to WAL
Each WAL record (or small batch) sent to archive receiver
Archive receiver persists records immediately
RPO approaches zero (limited only by transmission latency)

Implementation Methods:

WAL Streaming Protocol: PostgreSQL's replication protocol can stream to archive servers
Change Data Capture: CDC systems capture and archive changes in real-time
Log Shipping with Partial Segments: Archive current incomplete segment periodically

streaming_archive_receiver.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
#!/usr/bin/env python3
"""
Continuous WAL Archive Receiver
Receives streaming WAL records and persists to durable storage
"""
 
import asyncio
import hashlib
from datetime import datetime
from pathlib import Path
import aiofiles
import boto3
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class WALRecord:
    """Represents a single WAL record from the stream"""
    lsn: str
    timeline: int
    xid: Optional[int]
    timestamp: datetime
    data: bytes
    
    @property
    def checksum(self) -> str:
        return hashlib.sha256(self.data).hexdigest()[:16]
 
class StreamingArchiveReceiver:
    """
    Receives continuous WAL stream and archives to 
    multiple storage tiers with configurable durability
    """
    
    def __init__(
        self,
        local_archive_path: Path,
        s3_bucket: str,
        buffer_size: int = 1024 * 1024,  # 1MB buffer
        flush_interval_seconds: float = 1.0,
        sync_to_s3_interval: int = 60
    ):
        self.local_path = local_archive_path
        self.s3_bucket = s3_bucket
        self.buffer_size = buffer_size
        self.flush_interval = flush_interval_seconds
        self.s3_sync_interval = sync_to_s3_interval
        
        self.buffer: list[WALRecord] = []
        self.buffer_bytes = 0
        self.last_flushed_lsn: Optional[str] = None
        self.s3_client = boto3.client('s3')
        
    async def receive_record(self, record: WALRecord):
        """Process incoming WAL record from the stream"""
        self.buffer.append(record)
        self.buffer_bytes += len(record.data)
        
        # Flush buffer when it reaches size threshold
        if self.buffer_bytes >= self.buffer_size:
            await self.flush_buffer()
            
    async def flush_buffer(self):
        """Persist buffered records to local storage"""
        if not self.buffer:
            return
            
        # Determine filename based on LSN range
        first_lsn = self.buffer[0].lsn
        last_lsn = self.buffer[-1].lsn
        timestamp = datetime.utcnow().strftime('%Y%m%d_%H%M%S')
        
        filename = f"wal_stream_{first_lsn}_{last_lsn}_{timestamp}.wal"
        filepath = self.local_path / filename
        
        # Write with fsync for durability
        async with aiofiles.open(filepath, 'wb') as f:
            for record in self.buffer:
                # Write record with length prefix and checksum
                header = f"{len(record.data)}:{record.checksum}:".encode()
                await f.write(header + record.data + b'\n')
            await f.flush()
            # Force sync to disk
            import os
            os.fsync(f.fileno())
            
        # Update tracking
        self.last_flushed_lsn = last_lsn
        self.buffer.clear()
        self.buffer_bytes = 0
        
        print(f"Flushed WAL records to {filename}, "
              f"last LSN: {last_lsn}")
              
    async def run_periodic_tasks(self):
        """Background tasks for time-based flushing and S3 sync"""
        while True:
            await asyncio.sleep(self.flush_interval)
            
            # Time-based buffer flush
            if self.buffer:
                await self.flush_buffer()

PostgreSQL pg_receivewal

PostgreSQL provides pg_receivewal for streaming archive reception:

# Basic streaming archive
pg_receivewal -h primary-db -U replication -D /archive/wal \
  --synchronous --create-slot --slot=archive_receiver

# With compression
pg_receivewal -h primary-db -U replication -D /archive/wal \
  --compress=zstd:3 --synchronous

Advantages of pg_receivewal:

Near-zero RPO (streams individual WAL records)
Automatic slot management prevents premature WAL removal
Native PostgreSQL support, well-tested

Considerations:

Requires dedicated process per primary
Network connectivity to primary required
Slot management overhead on primary

Monitoring Archive Health

Archive failures can silently accumulate, creating PITR gaps that aren't discovered until a recovery is attempted. Proactive monitoring is essential for maintaining PITR capability.

Key Archive Metrics

1. Archive Lag

The delay between WAL generation and archival:

-- PostgreSQL: Check archive lag
SELECT 
    pg_walfile_name(pg_current_wal_lsn()) AS current_wal,
    last_archived_wal,
    pg_wal_lsn_diff(
        pg_current_wal_lsn(), 
        (last_archived_wal || '000000')::pg_lsn
    ) / 1024 / 1024 AS lag_mb,
    last_archived_time,
    now() - last_archived_time AS time_since_last_archive
FROM pg_stat_archiver;

Alert Thresholds:

Warning: Lag > 10 segments or > 5 minutes
Critical: Lag > 50 segments or > 30 minutes

2. Archive Failure Count

-- PostgreSQL: Check recent archive failures
SELECT 
    failed_count,
    last_failed_wal,
    last_failed_time
FROM pg_stat_archiver;

Alert Thresholds:

Warning: Any failures in the last hour
Critical: Failed count increasing or last failure < 5 minutes ago

3. Archive Storage Capacity

Monitor archive destination for space:

# Local archive storage
df -h /archive/wal

# S3 bucket size (via CloudWatch or CLI)
aws s3 ls s3://db-archive/wal/ --summarize

Alert Thresholds:

Warning: < 20% free space
Critical: < 5% free space

Archive Monitoring Metrics
Metric	Description	Warning Threshold	Critical Threshold
Archive Lag (segments)	Unarchived complete segments	10 segments	50 segments
Archive Lag (time)	Time since last successful archive	5 minutes	30 minutes
Failed Archive Count	Cumulative archive command failures	0 in 1 hour	Count increasing
Archive Storage Free	Available space at destination	< 20%	< 5%
Archive Completeness	No gaps in archive sequence	N/A	Any gap detected
Archive Command Duration	Time per archive operation	30 seconds	120 seconds

Archive Integrity Verification

Beyond monitoring active archival, periodic verification ensures archive integrity:

1. Sequence Continuity Check

Verify no gaps exist in the WAL sequence:

#!/bin/bash
# Verify WAL archive sequence continuity
ARCHIVE_DIR="/archive/wal"

prev_segment=""
for file in $(ls -1 $ARCHIVE_DIR/0000*.gz | sort); do
    segment=$(basename $file .gz)
    if [ -n "$prev_segment" ]; then
        expected=$(pg_waldiff $prev_segment)
        if [ "$segment" != "$expected" ]; then
            echo "GAP DETECTED: Expected $expected after $prev_segment, found $segment"
            exit 1
        fi
    fi
    prev_segment=$segment
done
echo "Archive sequence verified: no gaps"

2. Archive Checksum Verification

# Verify archive checksums
for file in /archive/wal/*.gz; do
    gunzip -t "$file" 2>/dev/null || echo "CORRUPT: $file"
done

3. Sample Restore Testing

Periodic end-to-end recovery tests:

Select random historical time point
Restore base backup + WAL to that point
Verify database starts and passes consistency checks
Document recovery time for planning

Automate Recovery Testing

The most reliable way to ensure PITR works is to actually perform recoveries regularly. Automate weekly or monthly recovery tests to a non-production environment. This validates the complete recovery chain: base backup + archives + recovery process.

Archive Retention Policies

Archive retention determines how far back in time PITR can reach. Retention policies must balance recovery capability against storage costs and compliance requirements.

Retention Strategy Framework

Recovery Window Retention

Archives needed to support PITR within the operational recovery window:

Typically 7-30 days for operational systems
Supports recovery from data corruption, errors, ransomware
Sized based on maximum acceptable data loss scenario

Compliance Retention

Archives required by regulatory frameworks:

Financial: 7 years (SOX, SEC Rule 17a-4)
Healthcare: 6-7 years (HIPAA)
General: Varies by jurisdiction and industry

Base Backup Correlation

Archives must be retained at least as long as there's a corresponding base backup:

If monthly backups are kept for 1 year, archives must span 1 year minimum
Otherwise, older backups become useless (no WAL bridge to recovery target)

Implementing Retention Automation

#!/bin/bash
# WAL archive retention automation script

ARCHIVE_DIR="/archive/wal"
HOT_RETENTION_DAYS=3
WARM_RETENTION_DAYS=30
COLD_RETENTION_DAYS=365

# Get current date for comparison
NOW=$(date +%s)

for archive_file in "$ARCHIVE_DIR"/*.gz; do
    [ -f "$archive_file" ] || continue
    
    # Get file modification time
    FILE_TIME=$(stat -c %Y "$archive_file")
    AGE_DAYS=$(( (NOW - FILE_TIME) / 86400 ))
    
    # Determine action based on age
    if [ $AGE_DAYS -gt $COLD_RETENTION_DAYS ]; then
        # Beyond cold retention: delete
        echo "Deleting expired archive: $archive_file (age: $AGE_DAYS days)"
        rm "$archive_file"
        
    elif [ $AGE_DAYS -gt $WARM_RETENTION_DAYS ]; then
        # Move to cold tier (Glacier)
        BASENAME=$(basename "$archive_file")
        if ! aws s3api head-object --bucket db-archive-cold \
             --key "wal/$BASENAME" 2>/dev/null; then
            echo "Moving to cold tier: $archive_file"
            aws s3 cp "$archive_file" "s3://db-archive-cold/wal/$BASENAME" \
                --storage-class GLACIER
        fi
        
    elif [ $AGE_DAYS -gt $HOT_RETENTION_DAYS ]; then
        # Move to warm tier (S3 Standard-IA)
        BASENAME=$(basename "$archive_file")
        if ! aws s3api head-object --bucket db-archive-warm \
             --key "wal/$BASENAME" 2>/dev/null; then
            echo "Moving to warm tier: $archive_file"
            aws s3 cp "$archive_file" "s3://db-archive-warm/wal/$BASENAME" \
                --storage-class STANDARD_IA
        fi
    fi
done

Never Delete Archives Before Base Backup Expires

The cardinal rule of archive retention: never delete WAL archives until the oldest base backup depending on them has also been deleted. The sequence must be: delete old base backup → delete WAL archives that only that backup needed → never the reverse.

Troubleshooting Common Archive Issues

Even well-configured archive systems encounter problems. Rapid diagnosis and resolution is critical to minimizing PITR gaps.

Common Archive Failures and Solutions

Problem: Archive Command Hangs

Symptoms:

Archive lag increasing
No new archives appearing
Archiver process stuck

Diagnosis:

-- Check archiver status
SELECT * FROM pg_stat_archiver;

-- Check PostgreSQL logs
tail -100 /var/log/postgresql/postgresql-*.log | grep -i archive

Common Causes:

Network timeout to remote storage
Archive destination full
DNS resolution failure
Storage credential expiration

Resolution:

# Kill stuck archive command
killall -9 "archive_command_process_name"

# PostgreSQL will retry automatically
# Verify with:
SELECT * FROM pg_stat_archiver;

Problem: Archive Destination Full

Symptoms:

Archive command returning non-zero
Error messages about disk space

Immediate Actions:

Identify and delete old archives (respecting retention rules)
Move archives to additional storage
Increase archive storage allocation

# Emergency space cleanup (carefully!)
# Only delete archives older than oldest needed backup
find /archive/wal -mtime +30 -name "*.gz" -delete

Problem: Archive Gap Discovered

A gap in the archive sequence renders PITR impossible across that gap.

Discovery:

# List archive sequence to find gaps
ls -1 /archive/wal/ | sort | awk '
    NR==1 {prev=$1; next}
    {
        # Check if current file follows previous
        # (Implementation depends on WAL naming convention)
        if (gap_detected) print "GAP: " prev " -> " $1
        prev=$1
    }
'

Recovery Options:

If gap is recent: Source files may still exist on database server
If a standby exists: Standby may have the missing files
If gap is old: Accept that PITR only works on each side of gap

Emergency Archive Recovery Procedures

•Check Server pg_wal: The missing segments may still exist in the database's pg_wal directory if not yet recycled
•Check Standby Servers: Standby databases may have segments that weren't archived from primary
•Check Failed Archive Location: Some archive commands leave partial files that can be recovered
•Check Cloud Storage Versions: Object storage versioning may preserve 'deleted' files
•Document the Gap: Record the exact LSN range affected for future recovery planning

Summary: Mastering Log Archiving

Log archiving is the often-overlooked foundation of PITR capability. Let's consolidate the key concepts:

Key Takeaways

•WAL lifecycle determines archival timing — Archives must occur before segments are recycled, making archival a race against the database's natural log management.
•Archive commands must be robust — Idempotency, verification, and atomicity are requirements, not nice-to-haves. A faulty command can silently create PITR gaps.
•Tiered storage optimizes cost/performance — Hot, warm, and cold tiers allow cost-effective retention while maintaining fast recovery for recent incidents.
•Continuous archiving enables near-zero RPO — Streaming approaches eliminate the latency of segment-based archival for systems requiring minimal data loss risk.
•Proactive monitoring prevents surprises — Archive lag, failure counts, and storage capacity must be continuously monitored with appropriate alerting.
•Retention policies require careful coordination — Archives must outlive any base backup that depends on them; the reverse leads to orphaned backups.
•Archive gaps are permanent — Unlike many database problems, a gap in the archive sequence creates a permanent limitation in PITR capability.

What's next:

With the archive foundation established, we'll examine the Recovery Process itself. The next page walks through the complete PITR recovery workflow—from incident detection through recovery target selection, the actual recovery procedure, and post-recovery validation steps that ensure a successful restoration.

Page Complete

You now understand the critical role of log archiving in PITR. You've learned how WAL segments progress through their lifecycle, how to configure robust archive commands, storage tiering strategies, monitoring requirements, and retention policies. Next, we'll explore how these archived logs are used during the actual recovery process.

Log Archiving: The Backbone of Point-in-Time Recovery

The Critical Chain of Custody

Log archiving is the process of preserving that bridge.

The Single Point of PITR Failure

Understanding the WAL Lifecycle

The Stages of WAL Segment Life

WAL segments progress through several distinct stages:

1. Creation (Allocation) When the current WAL segment fills up, the database allocates a new segment. This may involve:

Pre-allocating segments during low-activity periods (proactive allocation)
Immediate allocation when needed (reactive allocation)
Recycling an old segment by renaming it (most common)

2. Active Writing The segment receives new log records as transactions execute:

Each log record receives a unique LSN
Records are first written to the WAL buffer in memory
Periodic or triggered flushes write buffer contents to disk
The wal_sync_method parameter controls durability guarantees

3. Full Status When a segment reaches capacity:

The segment is marked as complete
A new segment is opened for writing
The full segment enters the archival queue (if archiving is enabled)

4. Archival Pending The complete segment waits for archive processing:

The archive process copies the segment to archive storage
The segment is marked as archived only after successful copy
This is the critical window—the segment must not be recycled before confirmation

5. Archived and Retained After successful archival:

The segment is marked as ready for recycling
The database may retain the segment for streaming replication
Eventually, the segment is recycled for new log content

6. Recycled The physical file is reassigned a new name and reused:

Old content is overwritten
The segment re-enters the creation stage
If the segment was not archived, its contents are permanently lost

Converting Mermaid diagram...

The Recycling Race Condition

Archive Configuration Deep Dive

Core Archive Parameters

archive_mode

Controls whether WAL archiving is enabled:

off: No archiving (default)
on: Enable archiving
always: Enable archiving even on standby servers

-- Enable archiving (requires restart)
alter system set archive_mode = 'on';

archive_command

The shell command executed to archive each WAL segment:

%p — Path to the WAL segment to archive
%f — Filename of the WAL segment

-- Basic local archive
alter system set archive_command = 
  'cp %p /archive/wal/%f';

-- Archive to S3 with compression
alter system set archive_command = 
  'gzip -c %p | aws s3 cp - s3://db-archive/wal/%f.gz';

-- Archive with verification
alter system set archive_command = 
  'test ! -f /archive/wal/%f && cp %p /archive/wal/%f';

archive_timeout

Forces an archive switch even if the current segment isn't full:

Prevents long gaps in archive coverage during low-activity periods
Creates a maximum RPO for idle databases

-- Archive at least every 5 minutes
alter system set archive_timeout = '300s';

Archive Parameter Matrix by Database System
Concept	PostgreSQL	MySQL/MariaDB	Oracle	SQL Server
Enable Archiving	archive_mode = on	log_bin = ON	ALTER DATABASE ARCHIVELOG	Full/Bulk-Logged Recovery
Archive Command	archive_command	N/A (binlog rotation)	LOG_ARCHIVE_DEST_n	Backup Log with COPY_ONLY
Archive Timeout	archive_timeout	expire_logs_days	ARCHIVE_LAG_TARGET	Log backup frequency
Archive Location	Command parameter	binlog directory	LOG_ARCHIVE_DEST	Backup destination

Critical Archive Command Requirements

The archive command must satisfy specific requirements to ensure reliable operation:

1. Atomic Success/Failure The command must exit with status 0 only if the archive succeeded completely. Any non-zero exit prevents recycling of the segment.

2. Idempotency The command must succeed if called multiple times for the same file (the segment may already exist in the archive from a previous interrupted attempt).

3. Verification The command should verify the archived copy is complete and uncorrupted before exiting successfully.

4. Destination Check The command should fail if the destination is unreachable, full, or experiencing issues.

5. No Modification of Source The command must not modify or delete the source WAL file.

6. Reasonable Timeout The command should complete in reasonable time (usually seconds). Long-running commands block archival of subsequent segments.

robust_archive_command.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#!/bin/bash
# Robust WAL archiving script with verification
# Usage: archive_wal.sh <source_path> <filename>
 
SOURCE_PATH="$1"
FILENAME="$2"
ARCHIVE_DIR="/archive/wal"
S3_BUCKET="s3://db-archive/wal"
 
# Function to log with timestamp
log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" >> /var/log/wal_archive.log
}
 
# Check source file exists
if [ ! -f "$SOURCE_PATH" ]; then
    log "ERROR: Source file not found: $SOURCE_PATH"
    exit 1
fi
 
# Calculate source checksum
SOURCE_CHECKSUM=$(sha256sum "$SOURCE_PATH" | awk '{print $1}')
 
# Archive to local storage first (fast, reliable)
LOCAL_DEST="$ARCHIVE_DIR/$FILENAME"
if [ -f "$LOCAL_DEST" ]; then
    # File exists - verify it matches (idempotency)
    EXISTING_CHECKSUM=$(sha256sum "$LOCAL_DEST" | awk '{print $1}')
    if [ "$SOURCE_CHECKSUM" = "$EXISTING_CHECKSUM" ]; then
        log "INFO: $FILENAME already archived correctly (idempotent success)"
        exit 0
    else
        log "ERROR: $FILENAME exists but checksum mismatch!"
        exit 1
    fi
fi
 
# Copy to local archive with fsync
cp "$SOURCE_PATH" "$LOCAL_DEST.tmp" && sync "$LOCAL_DEST.tmp"
if [ $? -ne 0 ]; then
    log "ERROR: Failed to copy $FILENAME to local archive"
    rm -f "$LOCAL_DEST.tmp"
    exit 1
fi
 
# Verify local copy
LOCAL_CHECKSUM=$(sha256sum "$LOCAL_DEST.tmp" | awk '{print $1}')
if [ "$SOURCE_CHECKSUM" != "$LOCAL_CHECKSUM" ]; then
    log "ERROR: Checksum mismatch after local copy for $FILENAME"
    rm -f "$LOCAL_DEST.tmp"
    exit 1
fi
 
# Atomic rename to final location
mv "$LOCAL_DEST.tmp" "$LOCAL_DEST"
if [ $? -ne 0 ]; then
    log "ERROR: Failed to rename $FILENAME in local archive"
    exit 1
fi
 
log "INFO: Successfully archived $FILENAME to local storage"
 
# Async upload to S3 (non-blocking for archive command)
# This is handled by a separate process that monitors local archive
nohup aws s3 cp "$LOCAL_DEST" "$S3_BUCKET/$FILENAME" >> /var/log/s3_upload.log 2>&1 &
 
exit 0

Archive Storage Strategies

The Tiered Archive Architecture

Enterprise deployments commonly implement multiple storage tiers:

Hot Tier (Recent Archives)

Storage: Local SSD or NVMe
Retention: Last 24-72 hours of archives
Purpose: Fast recovery for recent incidents
Access: Immediate (milliseconds)

Warm Tier (Recent History)

Storage: Network storage, local HDD, or standard object storage
Retention: 7-30 days
Purpose: Operational recovery, typical PITR scenarios
Access: Fast (seconds)

Cold Tier (Long-term Archive)

Storage: Infrequent-access object storage, tape
Retention: Months to years (compliance driven)
Purpose: Compliance, disaster recovery, historical analysis
Access: Slow (minutes to hours)

Offsite Tier (Disaster Recovery)

Storage: Geographically separate location
Retention: Mirrors warm/cold tier policies
Purpose: DR capability, regional disaster resilience
Access: Variable (depends on DR tier selection)

Archive Tier Characteristics
Tier	Typical Storage	Retention	Access Time	Cost/GB/Month
Hot	Local NVMe/SSD	24-72 hours	< 10ms	$0.15-0.30
Warm	Object Storage (Standard)	7-30 days	50-200ms	$0.02-0.03
Cold	Object Storage (IA)	30-365 days	1-5 seconds	$0.01-0.015
Archive	Glacier/Deep Archive	Years	Minutes-Hours	$0.001-0.004
Offsite DR	Cross-region replicated	Mirrors warm	Variable	2x base cost

Compression Strategies

WAL files compress effectively—typical compression ratios range from 3:1 to 10:1 depending on workload. Compression strategies must balance storage savings against recovery speed:

Inline Compression

Compress during archival:

archive_command = 'gzip -c %p > /archive/%f.gz'

Pro: Immediate storage savings
Pro: Network transfer reduction for remote archives
Con: Decompression time during recovery
Con: CPU overhead during archival

Deferred Compression

Archive uncompressed, compress later:

Pro: Fastest archival (no CPU overhead)
Pro: Recent files ready for immediate recovery
Con: Higher short-term storage costs
Con: Requires background compression process

Compression Algorithm Selection

Algorithm	Speed	Ratio	Best For
gzip -1	Fast	3-4x	Balance speed/ratio
gzip -9	Slow	4-5x	Cold tier storage
lz4	Very Fast	2-3x	Hot tier, fast recovery
zstd	Fast	4-6x	Modern systems, best balance
xz	Very Slow	6-8x	Long-term archive only

The Zstandard Sweet Spot

Encryption Considerations

Archived WAL files contain complete transaction data—including all inserted, updated, and deleted values. Encryption is essential for:

Compliance requirements (GDPR, HIPAA, PCI-DSS)
Protection of data at rest
Defense against archive storage compromise

Encryption Approaches:

1. Storage-Level Encryption

Provider-managed keys or customer-managed keys
Transparent to archival process
Example: S3 Server-Side Encryption (SSE-KMS)

2. Client-Side Encryption

Encrypt before sending to archive
Full control over key management
Example: gpg -e -r backup@company.com %f

3. Database-Native Encryption

Some databases support encrypted WAL
Unified key management with data encryption
Example: PostgreSQL pgcrypto extensions

Key Management Requirements:

Keys must be available during recovery (not just backup)
Key rotation must not render old archives unrecoverable
DR procedures must include key recovery

Continuous Archiving Mechanisms

Traditional segment-based archiving introduces latency between transaction commit and archival. For systems requiring near-zero RPO, continuous archiving mechanisms stream changes in real-time.

Segment-Based vs. Continuous Archiving

Segment-Based Archiving

The traditional model archives complete segments:

Transactions write to current WAL segment
Segment fills up (typically 16MB in PostgreSQL)
Archive command executes for complete segment
Segment copied to archive storage

RPO Implication: Maximum data loss = segment size worth of transactions

With a 16MB segment and high-throughput database, this might be seconds. With a low-activity database, a segment might take hours to fill.

Continuous (Streaming) Archiving

Alternative approach streams WAL records continuously:

Transactions write to WAL
Each WAL record (or small batch) sent to archive receiver
Archive receiver persists records immediately
RPO approaches zero (limited only by transmission latency)

Implementation Methods:

WAL Streaming Protocol: PostgreSQL's replication protocol can stream to archive servers
Change Data Capture: CDC systems capture and archive changes in real-time
Log Shipping with Partial Segments: Archive current incomplete segment periodically

streaming_archive_receiver.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
#!/usr/bin/env python3
"""
Continuous WAL Archive Receiver
Receives streaming WAL records and persists to durable storage
"""
 
import asyncio
import hashlib
from datetime import datetime
from pathlib import Path
import aiofiles
import boto3
from dataclasses import dataclass
from typing import Optional
 
@dataclass
class WALRecord:
    """Represents a single WAL record from the stream"""
    lsn: str
    timeline: int
    xid: Optional[int]
    timestamp: datetime
    data: bytes
    
    @property
    def checksum(self) -> str:
        return hashlib.sha256(self.data).hexdigest()[:16]
 
class StreamingArchiveReceiver:
    """
    Receives continuous WAL stream and archives to 
    multiple storage tiers with configurable durability
    """
    
    def __init__(
        self,
        local_archive_path: Path,
        s3_bucket: str,
        buffer_size: int = 1024 * 1024,  # 1MB buffer
        flush_interval_seconds: float = 1.0,
        sync_to_s3_interval: int = 60
    ):
        self.local_path = local_archive_path
        self.s3_bucket = s3_bucket
        self.buffer_size = buffer_size
        self.flush_interval = flush_interval_seconds
        self.s3_sync_interval = sync_to_s3_interval
        
        self.buffer: list[WALRecord] = []
        self.buffer_bytes = 0
        self.last_flushed_lsn: Optional[str] = None
        self.s3_client = boto3.client('s3')
        
    async def receive_record(self, record: WALRecord):
        """Process incoming WAL record from the stream"""
        self.buffer.append(record)
        self.buffer_bytes += len(record.data)
        
        # Flush buffer when it reaches size threshold
        if self.buffer_bytes >= self.buffer_size:
            await self.flush_buffer()
            
    async def flush_buffer(self):
        """Persist buffered records to local storage"""
        if not self.buffer:
            return
            
        # Determine filename based on LSN range
        first_lsn = self.buffer[0].lsn
        last_lsn = self.buffer[-1].lsn
        timestamp = datetime.utcnow().strftime('%Y%m%d_%H%M%S')
        
        filename = f"wal_stream_{first_lsn}_{last_lsn}_{timestamp}.wal"
        filepath = self.local_path / filename
        
        # Write with fsync for durability
        async with aiofiles.open(filepath, 'wb') as f:
            for record in self.buffer:
                # Write record with length prefix and checksum
                header = f"{len(record.data)}:{record.checksum}:".encode()
                await f.write(header + record.data + b'\n')
            await f.flush()
            # Force sync to disk
            import os
            os.fsync(f.fileno())
            
        # Update tracking
        self.last_flushed_lsn = last_lsn
        self.buffer.clear()
        self.buffer_bytes = 0
        
        print(f"Flushed WAL records to {filename}, "
              f"last LSN: {last_lsn}")
              
    async def run_periodic_tasks(self):
        """Background tasks for time-based flushing and S3 sync"""
        while True:
            await asyncio.sleep(self.flush_interval)
            
            # Time-based buffer flush
            if self.buffer:
                await self.flush_buffer()

PostgreSQL pg_receivewal

PostgreSQL provides pg_receivewal for streaming archive reception:

# Basic streaming archive
pg_receivewal -h primary-db -U replication -D /archive/wal \
  --synchronous --create-slot --slot=archive_receiver

# With compression
pg_receivewal -h primary-db -U replication -D /archive/wal \
  --compress=zstd:3 --synchronous

Advantages of pg_receivewal:

Near-zero RPO (streams individual WAL records)
Automatic slot management prevents premature WAL removal
Native PostgreSQL support, well-tested

Considerations:

Requires dedicated process per primary
Network connectivity to primary required
Slot management overhead on primary

Monitoring Archive Health

Archive failures can silently accumulate, creating PITR gaps that aren't discovered until a recovery is attempted. Proactive monitoring is essential for maintaining PITR capability.

Key Archive Metrics

1. Archive Lag

The delay between WAL generation and archival:

-- PostgreSQL: Check archive lag
SELECT 
    pg_walfile_name(pg_current_wal_lsn()) AS current_wal,
    last_archived_wal,
    pg_wal_lsn_diff(
        pg_current_wal_lsn(), 
        (last_archived_wal || '000000')::pg_lsn
    ) / 1024 / 1024 AS lag_mb,
    last_archived_time,
    now() - last_archived_time AS time_since_last_archive
FROM pg_stat_archiver;

Alert Thresholds:

Warning: Lag > 10 segments or > 5 minutes
Critical: Lag > 50 segments or > 30 minutes

2. Archive Failure Count

-- PostgreSQL: Check recent archive failures
SELECT 
    failed_count,
    last_failed_wal,
    last_failed_time
FROM pg_stat_archiver;

Alert Thresholds:

Warning: Any failures in the last hour
Critical: Failed count increasing or last failure < 5 minutes ago

3. Archive Storage Capacity

Monitor archive destination for space:

# Local archive storage
df -h /archive/wal

# S3 bucket size (via CloudWatch or CLI)
aws s3 ls s3://db-archive/wal/ --summarize

Alert Thresholds:

Warning: < 20% free space
Critical: < 5% free space

Archive Monitoring Metrics
Metric	Description	Warning Threshold	Critical Threshold
Archive Lag (segments)	Unarchived complete segments	10 segments	50 segments
Archive Lag (time)	Time since last successful archive	5 minutes	30 minutes
Failed Archive Count	Cumulative archive command failures	0 in 1 hour	Count increasing
Archive Storage Free	Available space at destination	< 20%	< 5%
Archive Completeness	No gaps in archive sequence	N/A	Any gap detected
Archive Command Duration	Time per archive operation	30 seconds	120 seconds

Archive Integrity Verification

Beyond monitoring active archival, periodic verification ensures archive integrity:

1. Sequence Continuity Check

Verify no gaps exist in the WAL sequence:

#!/bin/bash
# Verify WAL archive sequence continuity
ARCHIVE_DIR="/archive/wal"

prev_segment=""
for file in $(ls -1 $ARCHIVE_DIR/0000*.gz | sort); do
    segment=$(basename $file .gz)
    if [ -n "$prev_segment" ]; then
        expected=$(pg_waldiff $prev_segment)
        if [ "$segment" != "$expected" ]; then
            echo "GAP DETECTED: Expected $expected after $prev_segment, found $segment"
            exit 1
        fi
    fi
    prev_segment=$segment
done
echo "Archive sequence verified: no gaps"

2. Archive Checksum Verification

# Verify archive checksums
for file in /archive/wal/*.gz; do
    gunzip -t "$file" 2>/dev/null || echo "CORRUPT: $file"
done

3. Sample Restore Testing

Periodic end-to-end recovery tests:

Select random historical time point
Restore base backup + WAL to that point
Verify database starts and passes consistency checks
Document recovery time for planning

Automate Recovery Testing

Archive Retention Policies

Archive retention determines how far back in time PITR can reach. Retention policies must balance recovery capability against storage costs and compliance requirements.

Retention Strategy Framework

Recovery Window Retention

Archives needed to support PITR within the operational recovery window:

Typically 7-30 days for operational systems
Supports recovery from data corruption, errors, ransomware
Sized based on maximum acceptable data loss scenario

Compliance Retention

Archives required by regulatory frameworks:

Financial: 7 years (SOX, SEC Rule 17a-4)
Healthcare: 6-7 years (HIPAA)
General: Varies by jurisdiction and industry

Base Backup Correlation

Archives must be retained at least as long as there's a corresponding base backup:

If monthly backups are kept for 1 year, archives must span 1 year minimum
Otherwise, older backups become useless (no WAL bridge to recovery target)

Implementing Retention Automation

#!/bin/bash
# WAL archive retention automation script

ARCHIVE_DIR="/archive/wal"
HOT_RETENTION_DAYS=3
WARM_RETENTION_DAYS=30
COLD_RETENTION_DAYS=365

# Get current date for comparison
NOW=$(date +%s)

for archive_file in "$ARCHIVE_DIR"/*.gz; do
    [ -f "$archive_file" ] || continue
    
    # Get file modification time
    FILE_TIME=$(stat -c %Y "$archive_file")
    AGE_DAYS=$(( (NOW - FILE_TIME) / 86400 ))
    
    # Determine action based on age
    if [ $AGE_DAYS -gt $COLD_RETENTION_DAYS ]; then
        # Beyond cold retention: delete
        echo "Deleting expired archive: $archive_file (age: $AGE_DAYS days)"
        rm "$archive_file"
        
    elif [ $AGE_DAYS -gt $WARM_RETENTION_DAYS ]; then
        # Move to cold tier (Glacier)
        BASENAME=$(basename "$archive_file")
        if ! aws s3api head-object --bucket db-archive-cold \
             --key "wal/$BASENAME" 2>/dev/null; then
            echo "Moving to cold tier: $archive_file"
            aws s3 cp "$archive_file" "s3://db-archive-cold/wal/$BASENAME" \
                --storage-class GLACIER
        fi
        
    elif [ $AGE_DAYS -gt $HOT_RETENTION_DAYS ]; then
        # Move to warm tier (S3 Standard-IA)
        BASENAME=$(basename "$archive_file")
        if ! aws s3api head-object --bucket db-archive-warm \
             --key "wal/$BASENAME" 2>/dev/null; then
            echo "Moving to warm tier: $archive_file"
            aws s3 cp "$archive_file" "s3://db-archive-warm/wal/$BASENAME" \
                --storage-class STANDARD_IA
        fi
    fi
done

Never Delete Archives Before Base Backup Expires

Troubleshooting Common Archive Issues

Even well-configured archive systems encounter problems. Rapid diagnosis and resolution is critical to minimizing PITR gaps.

Common Archive Failures and Solutions

Problem: Archive Command Hangs

Symptoms:

Archive lag increasing
No new archives appearing
Archiver process stuck

Diagnosis:

-- Check archiver status
SELECT * FROM pg_stat_archiver;

-- Check PostgreSQL logs
tail -100 /var/log/postgresql/postgresql-*.log | grep -i archive

Common Causes:

Network timeout to remote storage
Archive destination full
DNS resolution failure
Storage credential expiration

Resolution:

# Kill stuck archive command
killall -9 "archive_command_process_name"

# PostgreSQL will retry automatically
# Verify with:
SELECT * FROM pg_stat_archiver;

Problem: Archive Destination Full

Symptoms:

Archive command returning non-zero
Error messages about disk space

Immediate Actions:

Identify and delete old archives (respecting retention rules)
Move archives to additional storage
Increase archive storage allocation

# Emergency space cleanup (carefully!)
# Only delete archives older than oldest needed backup
find /archive/wal -mtime +30 -name "*.gz" -delete

Problem: Archive Gap Discovered

A gap in the archive sequence renders PITR impossible across that gap.

Discovery:

# List archive sequence to find gaps
ls -1 /archive/wal/ | sort | awk '
    NR==1 {prev=$1; next}
    {
        # Check if current file follows previous
        # (Implementation depends on WAL naming convention)
        if (gap_detected) print "GAP: " prev " -> " $1
        prev=$1
    }
'

Recovery Options:

If gap is recent: Source files may still exist on database server
If a standby exists: Standby may have the missing files
If gap is old: Accept that PITR only works on each side of gap

Emergency Archive Recovery Procedures

•Check Server pg_wal: The missing segments may still exist in the database's pg_wal directory if not yet recycled
•Check Standby Servers: Standby databases may have segments that weren't archived from primary
•Check Failed Archive Location: Some archive commands leave partial files that can be recovered
•Check Cloud Storage Versions: Object storage versioning may preserve 'deleted' files
•Document the Gap: Record the exact LSN range affected for future recovery planning

Summary: Mastering Log Archiving

Log archiving is the often-overlooked foundation of PITR capability. Let's consolidate the key concepts:

Key Takeaways

•WAL lifecycle determines archival timing — Archives must occur before segments are recycled, making archival a race against the database's natural log management.
•Archive commands must be robust — Idempotency, verification, and atomicity are requirements, not nice-to-haves. A faulty command can silently create PITR gaps.
•Tiered storage optimizes cost/performance — Hot, warm, and cold tiers allow cost-effective retention while maintaining fast recovery for recent incidents.
•Continuous archiving enables near-zero RPO — Streaming approaches eliminate the latency of segment-based archival for systems requiring minimal data loss risk.
•Proactive monitoring prevents surprises — Archive lag, failure counts, and storage capacity must be continuously monitored with appropriate alerting.
•Retention policies require careful coordination — Archives must outlive any base backup that depends on them; the reverse leads to orphaned backups.
•Archive gaps are permanent — Unlike many database problems, a gap in the archive sequence creates a permanent limitation in PITR capability.

What's next:

Page Complete