Loading content...
Mirroring—maintaining identical copies of data on multiple independent storage systems—is perhaps the most intuitive approach to data protection. If one copy fails, the other remains. The concept is simple, but the implementation details are profound.
In the previous page, we explored RAID 1 as basic disk mirroring. But mirroring extends far beyond single-system disk arrays. Modern database systems implement mirroring at multiple levels: storage mirroring within a server, database log mirroring across multiple devices, synchronous replication to standby servers, and asynchronous replication to remote sites.
Each level of mirroring addresses different failure scenarios and involves distinct tradeoffs between protection, performance, and complexity.
By the end of this page, you will understand the different levels of mirroring in database systems, the critical distinction between synchronous and asynchronous replication, how database systems implement log and data mirroring, and the CAP theorem implications of strong consistency mirroring.
At its core, mirroring ensures that every write operation is applied to multiple copies of the data before being acknowledged as complete. This creates redundancy that survives component failures.
Key Mirroring Properties:
1. Write Ordering
Mirrors must maintain consistent write ordering. If writes A, B, and C are applied to the primary in that order, they must be applied to mirrors in the same order. Inconsistent ordering can create states that never existed on the primary.
2. Failure Independence
For mirroring to provide redundancy, the mirrors must fail independently. Mirrors on the same RAID controller or same power circuit don't provide independent failure modes.
3. Consistency Point
After a primary failure, the mirror must be usable as a consistent replacement. This requires careful handling of in-flight writes during failure scenarios.
| Level | What's Mirrored | Protection Against | Example |
|---|---|---|---|
| Disk/Storage | Physical blocks | Drive failure | RAID 1, SAN mirroring |
| Log File | Transaction log | Log corruption/loss | Oracle multiplexed redo logs |
| Database Instance | Entire database | Server failure | PostgreSQL streaming replication |
| Geographic | Site copy | Site disaster | Cross-datacenter replication |
The Mirror Synchronization Spectrum:
Mirroring implementations exist on a spectrum from fully synchronous to fully asynchronous, with significant implications for both protection guarantees and performance:
← More Protection Less Protection →
← Higher Latency Lower Latency →
|───────────────────────────────────────────────────────────|
│ │
│ Synchronous Semi-Sync Async w/ACK Async │
│ (Wait for (Wait for (Periodic (Best-effort │
│ write) receive) confirm) send) │
│ │
|───────────────────────────────────────────────────────────|
The choice of synchronization mode determines what failures the system can survive without data loss.
Synchronous mirroring is the gold standard for data protection: a write is not acknowledged to the application until it has been durably written to all mirrors. This guarantees that committed data exists on multiple independent storage systems.
How Synchronous Mirroring Works:
Application Primary Mirror
│ │ │
│──── Write Request ─────▶│ │
│ │ │
│ │── Sync Write ─────▶│
│ │ │
│ │◀── Write ACK ──────│
│ │ │
│ │── Write to disk ──▶│
│ │ │
│◀── Write Complete ──────│ │
│ │ │
Critical: Application waits for BOTH writes to complete
The Durability Guarantee:
With synchronous mirroring:
Every write must wait for the slowest mirror. If the mirror is across a network, latency includes round-trip network time plus mirror write time. For local mirrors, overhead is typically 1-5ms. For remote mirrors across continents, overhead can be 50-150ms. This directly impacts transaction commit latency.
Synchronous Mirroring in Practice:
1. PostgreSQL Synchronous Replication
PostgreSQL's streaming replication can be configured for synchronous operation:
1234567891011121314151617181920
-- Primary server: postgresql.conf-- Require at least one synchronous standby before commit returnssynchronous_commit = onsynchronous_standby_names = 'standby1' -- Or list: 'FIRST 2 (s1, s2, s3)' -- Different synchronous commit modes:-- on = wait for WAL flush on primary and standby-- remote_apply = wait for standby to apply WAL (replay)-- remote_write = wait for standby to receive (but not necessarily flush)-- local = wait for local flush only (async replication)-- off = async even locally (dangerous!) -- Check replication status:SELECT client_addr, state, sync_state, sent_lsn, write_lsn, flush_lsnFROM pg_stat_replication; -- sync_state values:-- 'sync' = synchronous standby-- 'async' = asynchronous standby -- 'quorum' = participating in quorum commit2. MySQL Semisynchronous Replication
MySQL's semisync replication waits for acknowledgment that the replica received the binlog, but not necessarily that it was written to disk:
12345678910111213141516
-- Primary server: Enable semi-synchronous replicationINSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';SET GLOBAL rpl_semi_sync_master_enabled = 1; -- Timeout before falling back to async (in ms)SET GLOBAL rpl_semi_sync_master_timeout = 10000; -- Wait point: AFTER_SYNC (safer) or AFTER_COMMITSET GLOBAL rpl_semi_sync_master_wait_point = 'AFTER_SYNC'; -- Replica server: Enable semi-syncINSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';SET GLOBAL rpl_semi_sync_slave_enabled = 1; -- Monitor semi-sync statusSHOW STATUS LIKE 'Rpl_semi_sync%';Asynchronous mirroring decouples the primary write acknowledgment from mirror updates. The primary acknowledges writes immediately after local durability, and mirror updates happen in the background.
How Asynchronous Mirroring Works:
Application Primary Mirror
│ │ │
│──── Write Request ─────▶│ │
│ │ │
│ │── Write to disk ──▶│
│ │ │
│◀── Write Complete ──────│ │
│ │ │
│ (Application │── Async Send ─────▶│
│ continues │ │
│ immediately) │ │
│ │◀── ACK ────────────│
│ │ (Later) │
The Data Loss Window:
With asynchronous mirroring, there's always a window of potential data loss—the replication lag. If the primary fails, any writes that were acknowledged but not yet replicated are lost.
Replication lag in async systems can range from milliseconds under normal conditions to minutes or hours during high load or network issues. The amount of potential data loss equals the replication lag at the time of failure. Monitoring lag is critical for understanding your actual data loss exposure.
Asynchronous Replication in Practice:
PostgreSQL Async Streaming Replication:
123456789101112131415161718192021222324252627282930
-- Primary: Configure async streaming replication-- postgresql.conf wal_level = replica -- Required for replicationmax_wal_senders = 10 -- Max concurrent senderswal_keep_size = 1GB -- Keep WAL for slow standbyssynchronous_commit = local -- Don't wait for standby -- Standby: recovery.conf / standby.signal + postgresql.conf-- Create standby.signal file (PostgreSQL 12+)-- Or use recovery.conf for older versions primary_conninfo = 'host=primary port=5432 user=repl password=xxx'restore_command = 'cp /archive/%f %p' -- Monitor replication lag on primary:SELECT client_addr, state, pg_wal_lsn_diff(sent_lsn, write_lsn) AS send_lag_bytes, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS replay_lag_bytes, replay_lagFROM pg_stat_replication; -- On standby, check how far behind:SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp(), EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp())) AS lag_seconds;When to Use Asynchronous Mirroring:
Async mirroring is appropriate when:
Database transaction logs require the highest level of protection because they're the foundation of recovery. Many database systems provide log-specific mirroring independent of overall data replication.
Oracle Multiplexed Redo Logs:
Oracle's redo log system allows multiple copies of each log group on different storage devices. Oracle writes to all members of a log group synchronously—a commit doesn't complete until all copies are written.
123456789101112131415161718192021
-- Oracle: View current redo log configurationSELECT group#, member, type, status FROM v$logfile ORDER BY group#; -- Add a mirror member to an existing log groupALTER DATABASE ADD LOGFILE MEMBER '/u02/oracle/redo/redo01b.log' TO GROUP 1; -- Create a new log group with multiple members (mirrored)ALTER DATABASE ADD LOGFILE GROUP 4 ( '/u01/oracle/redo/redo04a.log', -- First copy '/u02/oracle/redo/redo04b.log' -- Mirror copy) SIZE 500M; -- Recommended: At least 2 copies of each log group-- Place on different disks/controllers/paths -- Monitor log write performanceSELECT group#, sequence#, bytes/1024/1024 MB, status, archivedFROM v$log;SQL Server Transaction Log Mirroring:
SQL Server doesn't have built-in log file mirroring at the database level, but provides several protection mechanisms:
12345678910111213141516171819202122
-- SQL Server: Log protection options -- 1. Database Mirroring (deprecated but still used)-- Creates synchronous copy of entire database including logALTER DATABASE MyDB SET PARTNER = 'TCP://mirror:5022'; -- 2. Always On Availability Groups (modern approach)-- Synchronous replicas keep log synchronized-- Configure via SSMS or:ALTER AVAILABILITY GROUP MyAG ADD DATABASE MyDB; -- 3. Log shipping (async, for DR)-- Backs up and restores log to secondary periodically-- Not real-time but simple and reliable -- 4. Place log on mirrored storage-- Use RAID 1 or SAN mirroring for transaction log files -- Verify log file location:SELECT name, physical_name, type_descFROM sys.master_filesWHERE database_id = DB_ID('MyDB');For critical databases, implement log protection at multiple levels: RAID 1 or RAID 10 for the log devices, multiplexed/mirrored log files if supported, and synchronous replication to a standby server. This defense-in-depth approach protects against disk failures, controller failures, and entire server failures.
When mirroring to more than two copies, quorum-based systems provide a balance between protection and availability. Instead of requiring all copies to acknowledge a write, the system requires a majority (quorum) of copies.
Quorum Mathematics:
For N copies:
Example: 3 copies with W=2, R=2
Quorum systems directly implement the tradeoffs described by the CAP theorem. Requiring W > N/2 for writes ensures consistency but means writes fail if a majority of nodes are unavailable (reduced availability). Allowing W ≤ N/2 improves availability but risks split-brain scenarios where conflicting writes succeed on different partitions.
PostgreSQL Quorum Commit:
PostgreSQL supports quorum-based synchronous replication where transactions wait for a configurable number of standbys:
123456789101112131415161718
-- PostgreSQL: Quorum-based synchronous replication -- Wait for ANY 2 of 3 standbys to confirmsynchronous_standby_names = 'ANY 2 (standby1, standby2, standby3)' -- Wait for FIRST 2 standbys in priority ordersynchronous_standby_names = 'FIRST 2 (standby1, standby2, standby3)' -- Both provide 2/3 quorum, but:-- ANY: Chooses fastest 2 from any of the 3-- FIRST: Always waits for standby1 and standby2 specifically -- Monitor which standbys are in quorum:SELECT client_addr, application_name, sync_state, sync_priorityFROM pg_stat_replication; -- sync_state = 'quorum' indicates participating in quorum-- sync_priority shows order for FIRST N configurationsDistributed Databases and Quorum:
Distributed databases like CockroachDB, TiDB, and Spanner use Raft or Paxos consensus protocols that are fundamentally quorum-based. A write is committed when a majority of replicas acknowledge it:
| Replicas (N) | Write Quorum (W) | Max Failures Tolerated | Notes |
|---|---|---|---|
| 3 | 2 | 1 | Common configuration; good balance |
| 5 | 3 | 2 | Higher availability; more resources |
| 7 | 4 | 3 | Rare; for extreme requirements |
| 2 | 2 | 0 | Not fault-tolerant; both required |
Quorum systems typically use odd numbers of replicas (3, 5, 7) because even numbers don't improve fault tolerance. Both 3 and 4 replicas tolerate 1 failure (needing 2 and 3 respectively). Since 4 replicas costs more than 3 but doesn't help, odd configurations are preferred.
When a mirror fails and is subsequently repaired or replaced, it must be resynchronized with the primary. The resynchronization process is critical for restoring full redundancy efficiently.
Resynchronization Methods:
1. Full Resync (Complete Copy)
Copy the entire dataset from primary to mirror. Simple but time-consuming for large databases.
Primary Mirror (Empty/Stale)
│ │
│── Copy All Data ──────▶│
│ │
│── Continue Operations ─│
│── Track Changes ──────▶│ (While copy in progress)
│ │
│── Copy Complete ───────│
│── Apply Tracked Changes│
│── Mirror Synchronized ─│
2. Incremental Resync (Delta Copy)
Copy only blocks that changed since the mirror failed. Requires tracking which blocks were modified.
Primary Mirror (Stale)
│ │
│── Changed Blocks Only ▶│
│ │
│── Mirror Synchronized ─│
Much faster if relatively few blocks changed
3. Log-Based Resync (WAL Shipping)
Apply transaction log entries that the mirror missed. Most efficient when mirror was only briefly unavailable.
12345678910111213141516171819202122232425262728
#!/bin/bash# PostgreSQL: Resynchronizing a standby server # Method 1: pg_basebackup (full copy)# Use when standby is completely out of sync or newpg_basebackup -h primary -D /var/lib/postgresql/data \ --checkpoint=fast --wal-method=stream -R # Method 2: pg_rewind (incremental)# Use when standby diverged but has overlapping historypg_rewind --target-pgdata=/var/lib/postgresql/data \ --source-server='host=primary user=repl' # This only copies changed blocks since divergence point # Method 3: WAL archive catch-up# If standby is slightly behind but connected# Simply restart streaming replication and let it catch up# Standby will automatically apply missing WAL # Check if standby can catch up from archives:# Recovery will use restore_command to fetch archived WAL# if streaming position is too far behind # Monitor resync progress:psql -c "SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_is_in_recovery();"During resynchronization, the system has reduced redundancy. If another failure occurs before resync completes, data loss is possible. For critical systems, use three or more replicas so that one failure still leaves N-1 redundancy, allowing a replica to be rebuilt without losing protection.
When a primary fails, the mirror must take over—a process called failover. Failover complexity depends on whether the system uses automatic or manual failover and whether the standby was synchronous or asynchronous.
Failover Considerations:
1. Synchronous Standby Failover
With synchronous replication, the standby has all committed transactions. Failover is conceptually simple:
2. Asynchronous Standby Failover
With async replication, the standby may be behind:
12345678910111213141516171819202122232425262728293031
-- PostgreSQL: Standby failover procedure -- 1. Verify primary is truly down (avoid split-brain!)-- Check from multiple network paths if possible -- 2. Check standby replication statusSELECT pg_is_in_recovery(), pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(), pg_last_xact_replay_timestamp(); -- 3. Verify standby has applied all received WAL-- (for sync replication, this should be up-to-date)-- For async, decide if data loss is acceptable -- 4. Promote standby to primary-- Option A: pg_ctl promote-- $ pg_ctl promote -D /var/lib/postgresql/data -- Option B: SQL function (PostgreSQL 12+)SELECT pg_promote(wait => true, wait_seconds => 60); -- 5. Verify promotion succeededSELECT pg_is_in_recovery(); -- Should return FALSE -- 6. Update application connection strings to new primary -- 7. Ensure old primary cannot start as primary-- (fence it, or ensure it starts in standby mode) -- 8. Eventually rebuild old primary as new standbyIf the old primary isn't actually down (network partition, not failure), promoting the standby creates two primaries accepting different writes. This 'split-brain' scenario causes data divergence that may be impossible to reconcile. Always ensure the old primary is truly unavailable or fenced before failover.
Automatic vs. Manual Failover:
| Aspect | Automatic Failover | Manual Failover |
|---|---|---|
| RTO (Recovery Time) | Seconds to minutes | Minutes to hours |
| Split-brain risk | Higher (false positives) | Lower (human verification) |
| Operational overhead | Lower (no pager alerts) | Higher (requires staff) |
| Complexity | Higher (automation logic) | Lower (documented procedure) |
| Recommended for | High availability requirements | Critical data, lower RTO tolerance |
Tools for Automatic Failover:
Mirroring is a fundamental technique for achieving stable storage and high availability in database systems. Let's consolidate the key concepts:
What's Next:
Mirroring within a single site protects against component failures but not site-level disasters. The next page explores remote backup—replicating data to geographically distant locations to survive fires, floods, earthquakes, and other events that could destroy an entire data center.
You now understand the principles of database mirroring, from simple disk mirrors to synchronous and asynchronous replication. This knowledge is essential for designing database systems that maintain high availability while protecting data integrity.