Loading learning content...
In the modern digital economy, where databases power everything from global financial systems to real-time e-commerce platforms, the notion of shutting down a database for backup has become increasingly untenable. Online backup, also known as hot backup, represents the gold standard in enterprise data protection—enabling organizations to capture consistent, recoverable database snapshots while the system continues to serve thousands or millions of concurrent users without interruption.
This capability isn't merely a convenience; it's a fundamental business requirement. Consider a global banking system processing 10,000 transactions per second, an e-commerce platform during peak holiday shopping, or a healthcare system where milliseconds can impact patient care. For these systems, even a 30-minute maintenance window for backup represents millions of dollars in lost revenue, severe customer impact, or potentially life-threatening service interruptions.
By the end of this page, you will understand the fundamental principles of online backup, master the techniques used to capture consistent snapshots from active databases, recognize the challenges and trade-offs involved, and be able to design online backup strategies for production database systems.
Defining Online Backup
An online backup (or hot backup) is a backup operation performed while the database remains fully operational, accepting reads and writes from connected applications. Unlike cold backups that require database shutdown, online backups leverage sophisticated mechanisms to capture a consistent point-in-time snapshot of data without blocking production workloads.
The Fundamental Challenge
To appreciate what online backup accomplishes, consider what happens when you simply copy database files from a running database:
Online backup mechanisms must solve all these problems while minimizing impact on production performance—a technically demanding proposition.
| Aspect | Simple File Copy (Running DB) | Online Backup |
|---|---|---|
| Consistency | Inconsistent—captures mixed states | Point-in-time consistent |
| Transaction Integrity | Broken—partial transactions captured | Preserved—complete transactions only |
| Corruption Risk | High—partial page writes | None—complete pages only |
| Recovery Guarantee | Uncertain—may be unusable | Guaranteed—tested recovery path |
| Application Availability | Unchanged but backup is worthless | Unchanged and backup is valid |
Many junior DBAs have learned the hard way that copying database files from a running system doesn't produce a valid backup. The backup might appear complete, but when recovery is attempted during an actual disaster, the corrupted or inconsistent state renders it unusable. This is why proper online backup mechanisms are essential—not optional.
Online backup relies on several foundational database mechanisms working in concert. Understanding these mechanisms is crucial for both implementing and troubleshooting online backup systems.
Write-Ahead Logging (WAL)
The cornerstone of online backup is the Write-Ahead Log (also called the redo log or transaction log). Before any data modification is written to the actual data files, it is first recorded in the WAL. This principle ensures:
Checkpoint Mechanism
Checkpoints periodically flush modified pages from memory to disk and record a consistency point in the WAL. During online backup:
Buffer Pool Coordination
The buffer pool (or buffer cache) holds frequently accessed data pages in memory. During online backup:
Page-Level Consistency
Modern databases ensure page-level atomicity—a page is either completely written or not written at all (using techniques like double-write buffers or torn page detection). Online backups leverage this:
Different online backup strategies offer varying trade-offs between complexity, performance impact, and recovery flexibility. Understanding each approach enables selection of the optimal strategy for specific requirements.
Physical Online Backup
Physical backups copy the raw database files (data files, control files, WAL segments) at the file system level. This is the most common enterprise approach:
Advantages:
Considerations:
Logical Online Backup
Logical backups export database objects as SQL statements or structured data formats (like pg_dump or mysqldump):
Advantages:
Considerations:
Filesystem Snapshot Backup
Leverage storage-level snapshot capabilities (LVM, ZFS, SAN snapshots, cloud volume snapshots) for near-instantaneous backups:
How it works:
Advantages:
Considerations:
Continuous/Streaming Backup
Modern databases support continuous backup where WAL segments are shipped to backup storage in real-time:
How it works:
Advantages:
Considerations:
Each major database system implements online backup with its own tooling, terminology, and specific behaviors. Let's examine the approaches used by leading databases.
PostgreSQL: pg_basebackup and Continuous Archiving
PostgreSQL's online backup architecture is elegant and well-integrated:
pg_basebackup: Creates a physical backup via streaming replication protocolarchive_command or streamingpg_start_backup() / pg_stop_backup() for custom backup scripts123456789101112131415161718192021222324252627282930313233343536373839404142434445
#!/bin/bash# PostgreSQL Online Backup with pg_basebackup # Environment setupBACKUP_DIR="/backup/postgresql/$(date +%Y%m%d_%H%M%S)"PG_HOST="localhost"PG_PORT="5432"PG_USER="backup_user" # Create backup directorymkdir -p "$BACKUP_DIR" # Perform streaming base backup with compression# -D: Target directory# -F: Format (plain directory or tar)# -X: Include WAL (stream method)# -z: Compress output# -P: Show progress# --checkpoint: Checkpoint mode (fast vs spread) pg_basebackup \ -h "$PG_HOST" \ -p "$PG_PORT" \ -U "$PG_USER" \ -D "$BACKUP_DIR" \ -F tar \ -X stream \ -z \ -P \ --checkpoint=fast \ --label="Daily Backup $(date +%Y-%m-%d)" \ --manifest-checksums=SHA256 # Check backup successif [ $? -eq 0 ]; then echo "Backup completed successfully: $BACKUP_DIR" # Create backup manifest echo "Backup completed: $(date)" > "$BACKUP_DIR/backup_info.txt" echo "PostgreSQL version: $(psql --version)" >> "$BACKUP_DIR/backup_info.txt" echo "Backup size: $(du -sh $BACKUP_DIR)" >> "$BACKUP_DIR/backup_info.txt"else echo "Backup FAILED!" >&2 exit 1fiMySQL: Enterprise Backup and Percona XtraBackup
MySQL offers multiple online backup approaches:
12345678910111213141516171819202122232425262728293031323334353637
#!/bin/bash# MySQL Online Backup with Percona XtraBackup BACKUP_BASE="/backup/mysql"FULL_BACKUP_DIR="$BACKUP_BASE/full_$(date +%Y%m%d)"MYSQL_USER="backup_user"MYSQL_PASS="secure_password" # Full backup with Percona XtraBackup# --backup: Perform backup# --target-dir: Backup destination# --compress: Enable compression# --compress-threads: Parallel compression# --parallel: Parallel file copy threads xtrabackup \ --backup \ --user="$MYSQL_USER" \ --password="$MYSQL_PASS" \ --target-dir="$FULL_BACKUP_DIR" \ --compress \ --compress-threads=4 \ --parallel=4 # Prepare the backup (apply logs for consistency)# This step is REQUIRED before restorextrabackup \ --prepare \ --target-dir="$FULL_BACKUP_DIR" echo "MySQL backup complete: $FULL_BACKUP_DIR" # For incremental backup (runs after full backup)# INCR_BACKUP_DIR="$BACKUP_BASE/incr_$(date +%Y%m%d_%H%M%S)"# xtrabackup --backup --target-dir="$INCR_BACKUP_DIR" \# --incremental-basedir="$FULL_BACKUP_DIR" \# --user="$MYSQL_USER" --password="$MYSQL_PASS"Oracle: RMAN (Recovery Manager)
Oracle's RMAN is the most sophisticated enterprise backup tool:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
-- Oracle RMAN Online Backup Script -- Connect to RMAN-- $ rman target / catalog rman_user/password@catalog_db -- Configure backup settingsCONFIGURE RETENTION POLICY TO RECOVERY WINDOW OF 30 DAYS;CONFIGURE CONTROLFILE AUTOBACKUP ON;CONFIGURE DEVICE TYPE DISK PARALLELISM 4;CONFIGURE COMPRESSION ALGORITHM 'MEDIUM';CONFIGURE ENCRYPTION FOR DATABASE ON; -- Full database backup with compression and parallelismRUN { ALLOCATE CHANNEL c1 DEVICE TYPE DISK; ALLOCATE CHANNEL c2 DEVICE TYPE DISK; ALLOCATE CHANNEL c3 DEVICE TYPE DISK; ALLOCATE CHANNEL c4 DEVICE TYPE DISK; -- Backup the database BACKUP AS COMPRESSED BACKUPSET DATABASE PLUS ARCHIVELOG TAG 'DAILY_FULL_BACKUP'; -- Backup control file and SPFILE BACKUP CURRENT CONTROLFILE SPFILE TAG 'CONTROLFILE_BACKUP'; -- Validate backup integrity RESTORE DATABASE VALIDATE; RELEASE CHANNEL c1; RELEASE CHANNEL c2; RELEASE CHANNEL c3; RELEASE CHANNEL c4;} -- Crosscheck and delete obsolete backupsCROSSCHECK BACKUP;DELETE NOPROMPT OBSOLETE; -- Report backup statusLIST BACKUP SUMMARY;SQL Server: Native Backup with Compression
Microsoft SQL Server provides built-in online backup capabilities:
BACKUP DATABASE ... WITH COPY_ONLY: Non-disruptive backup12345678910111213141516171819202122232425262728293031323334353637383940414243
-- SQL Server Online Full Backup -- Full database backup with compressionBACKUP DATABASE [Production]TO DISK = 'D:\Backups\Production_Full.bak'WITH COMPRESSION, -- Native compression CHECKSUM, -- Verify backup integrity STATS = 10, -- Progress reporting every 10% NAME = 'Production-Full Database Backup', DESCRIPTION = 'Daily full backup with compression', COPY_ONLY; -- Don't affect differential baseline -- Backup to multiple files (striped backup for performance)BACKUP DATABASE [Production]TO DISK = 'D:\Backups\Production_Stripe1.bak', DISK = 'D:\Backups\Production_Stripe2.bak', DISK = 'D:\Backups\Production_Stripe3.bak', DISK = 'D:\Backups\Production_Stripe4.bak'WITH COMPRESSION, CHECKSUM, MAXTRANSFERSIZE = 4194304, -- 4MB per transfer BUFFERCOUNT = 20, -- Memory buffers STATS = 5; -- Verify backup is readableRESTORE VERIFYONLY FROM DISK = 'D:\Backups\Production_Full.bak'WITH CHECKSUM; -- Query backup historySELECT database_name, backup_start_date, backup_finish_date, DATEDIFF(MINUTE, backup_start_date, backup_finish_date) AS duration_minutes, backup_size / 1024 / 1024 AS backup_size_mb, compressed_backup_size / 1024 / 1024 AS compressed_size_mb, CAST(100 - (compressed_backup_size * 100.0 / backup_size) AS DECIMAL(5,2)) AS compression_ratio_pctFROM msdb.dbo.backupsetWHERE database_name = 'Production'ORDER BY backup_start_date DESC;While online backups minimize disruption, they are not zero-impact operations. Understanding and managing performance impact is essential for production deployments.
Resource Consumption During Backup
Online backups consume several system resources:
Measuring Backup Impact
Before deploying online backup, measure baseline performance and impact:
| Metric Category | Specific Metrics | Acceptable Increase |
|---|---|---|
| Query Performance | Average query latency, P95/P99 latency | 10-30% degradation |
| Throughput | Transactions per second, queries per second | 5-20% reduction |
| Disk I/O | IOPS, throughput MB/s, queue depth | 50-100% increase acceptable |
| CPU | User %, system %, iowait % | 20-40% increase if capacity exists |
| Replication Lag | Seconds behind primary (if replicated) | Minimal increase expected |
Mitigation Strategies
1. Schedule During Low-Traffic Periods
Even for 24/7 systems, traffic patterns vary. Schedule full backups during:
2. Throttle Backup I/O
Many backup tools support I/O limiting:
123456789101112131415
# PostgreSQL: pg_basebackup with rate limitingpg_basebackup -D /backup -r 100M # Limit to 100 MB/s # XtraBackup: I/O rate limitingxtrabackup --backup --throttle=40 --target-dir=/backup# Limits to 40 I/O operations per second # Linux: Use ionice for I/O scheduling priorityionice -c 3 pg_basebackup -D /backup # Idle I/O class (lowest priority) # Linux: Use cpulimit to cap CPU usagecpulimit -l 50 -- pg_basebackup -D /backup # Limit to 50% CPU # Combination approachionice -c 2 -n 7 nice -n 19 pg_basebackup -D /backup -r 50M3. Use Replica for Backup Source
A powerful pattern is to perform backups from a replica rather than the primary:
Advantages:
Considerations:
4. Leverage Storage-Level Features
Modern storage systems offer backup-friendly features:
5. Optimize Backup Duration
Shorter backups mean shorter impact windows:
Always establish a performance baseline before deploying online backup in production. Run test backups during various load conditions and measure impact on key metrics. This data enables informed decisions about backup scheduling, throttling, and architecture. A well-tuned online backup should cause <15% degradation in production performance during the backup window.
The ultimate test of any backup is successful recovery. Online backups must provide clear, guaranteed recovery paths with well-defined consistency semantics.
Crash Consistency vs. Application Consistency
Online backups typically provide crash consistency—the backup represents a state equivalent to abruptly powering off the server at a specific moment:
Application consistency goes further, ensuring:
Most online backups provide crash consistency. Application consistency requires additional coordination (quiescing applications, coordinating multi-system backups).
The Recovery Process
Recovering from an online backup involves several steps:
1. Restore Data Files
2. Apply Redo Logs (WAL)
3. Open Database
4. Validate Recovery
12345678910111213141516171819202122232425262728
# PostgreSQL: Restore from pg_basebackup # 1. Stop the running PostgreSQL (if any)sudo systemctl stop postgresql # 2. Clear the data directory (CAUTION: destroys current data)sudo rm -rf /var/lib/postgresql/14/main/* # 3. Restore the backupsudo tar -xzf /backup/base.tar.gz -C /var/lib/postgresql/14/main/sudo tar -xzf /backup/pg_wal.tar.gz -C /var/lib/postgresql/14/main/pg_wal/ # 4. Set correct ownershipsudo chown -R postgres:postgres /var/lib/postgresql/14/main # 5. Create recovery signal file (PostgreSQL 12+)sudo touch /var/lib/postgresql/14/main/recovery.signal # 6. Start PostgreSQL - recovery happens automaticallysudo systemctl start postgresql # 7. Monitor recovery progresssudo -u postgres psql -c "SELECT pg_is_in_recovery();"# Returns 'f' (false) when recovery is complete # 8. Validate recoverysudo -u postgres pg_isreadysudo -u postgres psql -c "SELECT count(*) FROM pg_stat_user_tables;"Many organizations have discovered during actual disasters that their backups were unusable—corrupted, incomplete, or with undocumented dependencies. Regular recovery testing is essential. The only proven backup is one that has been successfully restored. Test your recovery procedures at least monthly, with documented runbooks and timing metrics.
Implementing online backup successfully requires attention to numerous details. The following best practices and checklist ensure robust, production-ready backup systems.
| Category | Item | Status |
|---|---|---|
| Infrastructure | Dedicated backup storage sized appropriately | ☐ |
| Infrastructure | Network bandwidth for backup traffic calculated | ☐ |
| Infrastructure | Backup retention storage estimated (30-90 days) | ☐ |
| Configuration | Backup user/role with appropriate permissions | ☐ |
| Configuration | WAL archiving configured and tested | ☐ |
| Configuration | Backup compression enabled | ☐ |
| Configuration | Backup encryption configured | ☐ |
| Scheduling | Backup schedule aligned with traffic patterns | ☐ |
| Scheduling | Backup window sized appropriately for data volume | ☐ |
| Monitoring | Backup success/failure alerting configured | ☐ |
| Monitoring | Backup duration monitoring in place | ☐ |
| Monitoring | Backup size trending tracked | ☐ |
| Validation | Automated backup integrity checks enabled | ☐ |
| Validation | Regular restore tests scheduled (monthly) | ☐ |
| Documentation | Recovery runbook documented | ☐ |
| Documentation | Recovery tested by multiple team members | ☐ |
You now understand the principles, mechanisms, and implementation strategies for online (hot) backup. This knowledge enables you to design and deploy backup systems that protect production data without impacting availability. Next, we'll explore offline (cold) backup strategies for scenarios where system shutdown is acceptable or required.