System Design (HLD)PostgreSQL

PostgreSQL: The World's Most Advanced Open Source Database

LevelAdvanced

Duration90 mins

TopicPostgreSQL

4 / 5

Replication Options

PostgreSQL Replication: Scaling and Resilience

A single PostgreSQL server, no matter how powerful, represents a single point of failure and a scalability ceiling. Replication addresses both concerns by maintaining copies of your data on multiple servers. When implemented correctly, replication provides high availability (automatic failover when primary fails), read scaling (distribute read load across replicas), and disaster recovery (survive datacenter failures).

PostgreSQL offers sophisticated replication options that have evolved significantly over its history. Understanding the differences between physical and logical replication, synchronous and asynchronous modes, and various failover architectures is essential for designing resilient systems.

What You Will Learn

This page covers PostgreSQL's replication mechanisms in depth: Write-Ahead Log (WAL) streaming for physical replication, logical replication for selective data distribution, synchronous and asynchronous configurations, cascading replicas, and high availability architectures including Patroni and pgpool-II.

Physical vs. Logical Replication

PostgreSQL provides two fundamentally different approaches to replication, each suited for different use cases:

Physical Replication (Streaming Replication):

Physical replication transmits the Write-Ahead Log (WAL)—the binary transaction log—to standby servers. Standbys apply WAL records to their data files, resulting in byte-for-byte identical copies of the primary. This is a complete database copy at the storage level.

Logical Replication:

Logical replication transmits changes as logical operations (INSERT, UPDATE, DELETE on specific tables). The standby reconstructs changes from these logical operations. This allows selective table replication, cross-version replication, and even schema differences between publisher and subscriber.

Physical vs. Logical Replication Comparison
Aspect	Physical Replication	Logical Replication
Data Transmitted	Binary WAL records	Logical row changes (INSERT/UPDATE/DELETE)
Replication Scope	Entire cluster (all databases)	Selected tables within a database
Standby Writability	Read-only (hot standby)	Writable (can have local tables)
Version Compatibility	Same major version required	Cross-version replication possible
Schema Requirements	Identical schemas	Compatible schemas (flexible)
DDL Replication	Automatic (in WAL)	Not replicated (must apply manually)
Initial Setup	pg_basebackup (full copy)	Table-by-table sync + streaming
Performance Overhead	Minimal (WAL shipping)	Higher (decoding + apply)
Use Cases	HA failover, read replicas	Data integration, subset replication

When to Use Which

Physical replication for: High availability with automatic failover, read replicas for the same application, disaster recovery to a remote site. Logical replication for: Replicating specific tables to analytics systems, migrating between PostgreSQL versions, consolidating from multiple sources to one target, enabling writes on the replica for local data.

Streaming Replication: The Foundation of HA

Streaming replication is PostgreSQL's primary mechanism for high availability. The primary server streams WAL records to standby servers in near-real-time, maintaining synchronized copies of the entire database cluster.

How Streaming Replication Works:

Primary writes transactions to WAL files
WAL sender process streams WAL records to standby over TCP
Standby's WAL receiver writes records locally
Startup process applies WAL to data files (recovery mode)
Standby can serve read queries while applying WAL (hot standby)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Primary Server                              Standby Server
┌─────────────────────────────────────┐     ┌─────────────────────────────────────┐
│  Backend Process (handling queries) │     │  Startup Process (applies WAL)      │
│           │                         │     │           ▲                         │
│           ▼                         │     │           │                         │
│  ┌─────────────────┐               │     │  ┌─────────────────┐               │
│  │   WAL Buffer    │               │     │  │  WAL Receiver   │◄──────┐       │
│  │   (in memory)   │               │     │  │  Process        │       │       │
│  └────────┬────────┘               │     │  └─────────────────┘       │       │
│           │                         │     │           │                │       │
│           ▼                         │     │           ▼                │       │
│  ┌─────────────────┐               │     │  ┌─────────────────┐       │       │
│  │   WAL Files     │               │     │  │   WAL Files     │       │       │
│  │  (pg_wal/)      │               │     │  │  (pg_wal/)      │       │       │
│  └────────┬────────┘               │     │  └─────────────────┘       │       │
│           │                         │     │                            │       │
│           ▼                         │     │    Hot Standby Queries     │       │
│  ┌─────────────────┐               │     │  ┌─────────────────┐       │       │
│  │   WAL Sender    │────TCP/IP────────────►│  │   Client Conns  │       │       │
│  │   Process       │               │     │  └─────────────────┘       │       │
│  └─────────────────┘               │     │                            │       │
│                                     │     │                            │       │
└─────────────────────────────────────┘     └────────────────────────────┘
                │
                │  Feedback: flush position, apply position
                ◄───────────────────────────────────────────────────────

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# === On Primary Server ===
 
# 1. Configure postgresql.conf
wal_level = replica                    # Enable replication info in WAL
max_wal_senders = 10                   # Max number of standbys
wal_keep_size = 1GB                    # Retain WAL for slow standbys
hot_standby = on                        # Allow queries on standby
 
# 2. Create replication user
psql -c "CREATE USER replicator WITH REPLICATION ENCRYPTED PASSWORD 'secure_password';"
 
# 3. Configure pg_hba.conf (allow replication connections)
# host    replication     replicator      standby_ip/32      scram-sha-256
 
# 4. Restart PostgreSQL
pg_ctl restart -D /var/lib/postgresql/data
 
 
# === On Standby Server ===
 
# 1. Stop PostgreSQL if running
pg_ctl stop -D /var/lib/postgresql/data
 
# 2. Clear data directory (IMPORTANT: this deletes existing data)
rm -rf /var/lib/postgresql/data/*
 
# 3. Take base backup from primary
pg_basebackup -h primary_ip -D /var/lib/postgresql/data -U replicator -P -R
 
# The -R flag creates standby.signal and configures recovery settings
 
# 4. Start standby
pg_ctl start -D /var/lib/postgresql/data
 
# Standby is now recovering and accepting read-only queries

Monitoring Replication Lag:

Replication lag—the delay between a transaction committing on primary and being visible on standby—is the critical metric for replica health.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- On Primary: Check connected standbys
SELECT 
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS replay_lag,
    sync_state  -- 'async', 'sync', 'potential', 'quorum'
FROM pg_stat_replication;
 
-- On Primary: Replication lag in time (approximate)
SELECT 
    client_addr,
    CASE 
        WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN '0 seconds'
        ELSE NOW() - pg_last_xact_replay_timestamp()
    END AS replication_lag
FROM pg_stat_replication;
 
-- On Standby: Check recovery status
SELECT 
    pg_is_in_recovery() AS is_standby,
    pg_last_wal_receive_lsn() AS last_received,
    pg_last_wal_replay_lsn() AS last_applied,
    pg_last_xact_replay_timestamp() AS last_tx_time;
 
-- Check for WAL files waiting to be applied
SELECT COUNT(*) as pending_wal_files FROM pg_ls_waldir();

Replication Slots

Replication slots (CREATE SLOT) ensure WAL is retained until standby confirms receipt, preventing 'too far behind' errors. However, if a standby goes offline with an active slot, WAL accumulates indefinitely on primary, potentially filling the disk. Monitor slot lag (pg_replication_slots) and consider slot timeouts for unmonitored standbys.

Synchronous Replication: Zero Data Loss

By default, streaming replication is asynchronous—the primary doesn't wait for standbys before confirming commits. This maximizes throughput but means committed transactions might be lost if the primary fails before standbys receive the WAL.

Synchronous replication addresses this by requiring the primary to wait for one or more standbys to confirm WAL receipt before returning success to the client.

Confirmation Levels:

remote_write: Standby has written WAL to OS cache (but not flushed to disk)
on (remote_flush): Standby has flushed WAL to disk
remote_apply: Standby has applied WAL (query-visible)

Synchronous Commit Levels
Level	Wait For	Durability	Performance Impact	Use Case
off	Nothing (async)	Data in primary WAL buffer; risk of 3x wal_writer_delay loss	Lowest latency	Logs, metrics, ephemeral data
local	Primary disk flush	Data survives primary crash	Low latency	Standard operations
remote_write	Standby OS cache	Data on standby memory; rare loss on both crash	Moderate latency	Good performance / safety balance
on (remote_flush)	Standby disk flush	Data on standby disk; survives primary loss	Higher latency	Important transactions
remote_apply	Standby has applied	Data queryable on standby immediately after commit	Highest latency	Read-after-write consistency needed

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# postgresql.conf on Primary
 
# Define synchronous standby policy
# FIRST: wait for first N standbys in priority order
# ANY: wait for any N from the list
 
synchronous_standby_names = 'FIRST 1 (standby1, standby2)'
# - Wait for standby1 (or standby2 if standby1 unavailable)
# - standby2 becomes sync if standby1 disconnects
 
# Alternative: require any 2 of 3 standbys
synchronous_standby_names = 'ANY 2 (standby1, standby2, standby3)'
 
# Combined with commit level:
synchronous_commit = on  # Wait for sync standbys to flush
 
# Quorum-based (requires at least 2 standbys):
synchronous_standby_names = 'FIRST 2 (standby1, standby2, standby3)'

1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Override for specific transactions (less critical data)
BEGIN;
SET LOCAL synchronous_commit = off;
INSERT INTO audit_log (event, timestamp) VALUES ('user_clicked_button', NOW());
COMMIT;  -- Returns immediately, doesn't wait for standby
 
-- Override for critical transactions (even if default is async)
BEGIN;
SET LOCAL synchronous_commit = remote_apply;
UPDATE accounts SET balance = balance - 1000 WHERE id = 123;
COMMIT;  -- Waits until standby has applied and can serve reads
 
-- Check current setting
SHOW synchronous_commit;

Synchronous Standby Availability

If all synchronous standbys become unavailable and synchronous_commit = on, the primary will block all commits until at least one standby reconnects. This protects data but can cause complete application outage. Some architectures use synchronous_commit = local as fallback when standbys are unavailable, trading durability for availability in failure scenarios.

Logical Replication: Flexible Data Distribution

Logical replication, introduced in PostgreSQL 10, replicates changes at the row level using a publish/subscribe model. It's more flexible than physical replication but requires more configuration and has different characteristics.

Key Concepts:

Publication: A set of tables whose changes are published
Subscription: A connection to a publication that applies changes locally
Logical Decoding: The mechanism that extracts row-level changes from WAL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
-- === On Publisher (source database) ===
 
-- Ensure wal_level supports logical decoding
-- postgresql.conf: wal_level = logical
 
-- Create publication for specific tables
CREATE PUBLICATION sales_pub FOR TABLE orders, customers, products;
 
-- Or publish all tables
CREATE PUBLICATION all_tables_pub FOR ALL TABLES;
 
-- Add/remove tables from publication
ALTER PUBLICATION sales_pub ADD TABLE order_items;
ALTER PUBLICATION sales_pub DROP TABLE products;
 
-- View publications
SELECT pubname, puballtables, pubinsert, pubupdate, pubdelete FROM pg_publication;
 
 
-- === On Subscriber (target database) ===
 
-- Tables must exist with compatible schema (create manually or pg_dump -s)
CREATE TABLE orders (...);
CREATE TABLE customers (...);
 
-- Create subscription (starts initial sync automatically)
CREATE SUBSCRIPTION sales_sub
    CONNECTION 'host=publisher_ip dbname=source user=replicator password=...'
    PUBLICATION sales_pub;
 
-- View subscription status
SELECT subname, subenabled, subslotname FROM pg_subscription;
 
-- View replication state per table
SELECT * FROM pg_subscription_rel;
 
-- Check replication lag
SELECT 
    slot_name,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag
FROM pg_replication_slots
WHERE slot_type = 'logical';

Logical Replication Use Cases:

When to Use Logical Replication

•Major Version Upgrades: Replicate to a newer PostgreSQL version, then switch over
•Selective Replication: Only replicate specific tables to analytics or reporting databases
•Multi-Master Consolidation: Aggregate data from multiple sources to a central database
•Live Migration: Replicate to new infrastructure with minimal downtime
•Data Integration: Feed changes to data warehouses, search engines, or event systems
•Multi-Tenant Isolation: Each tenant's data to their own database for compliance

Logical Replication Limitations

Logical replication does NOT replicate: DDL changes (schema changes must be applied manually), sequences (values don't sync automatically), large objects, truncate (prior to PostgreSQL 11). Tables must have a primary key or REPLICA IDENTITY for UPDATE/DELETE operations to work. Consider these limitations in your replication strategy.

Cascading and Delayed Standbys

PostgreSQL supports advanced replication topologies beyond simple primary-standby configurations:

Cascading Replication:

Standbys can themselves act as sources for other standbys, creating a replication chain. This reduces load on the primary and enables complex topologies for geographic distribution.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
                            ┌─────────────────┐
                            │     Primary     │
                            │   (US-East)     │
                            └────────┬────────┘
                                     │ WAL Stream
                    ┌────────────────┴────────────────┐
                    │                                 │
           ┌────────▼────────┐               ┌────────▼────────┐
           │   Standby #1    │               │   Standby #2    │
           │  (US-East DC2)  │               │   (EU-West)     │
           └────────┬────────┘               └────────┬────────┘
                    │                                 │
                    │ Cascade                         │ Cascade
                    │                                 │
           ┌────────▼────────┐               ┌────────▼────────┐
           │   Standby #3    │               │   Standby #4    │
           │    (US-West)    │               │   (Asia-Pac)    │
           └─────────────────┘               └─────────────────┘
 
Benefits:
- Standby #1/#2 handle cascade load, primary focuses on writes
- Regional standbys serve local reads with low latency
- Reduces cross-ocean bandwidth from primary

1
2
3
4
5
6
7
8
# On Cascading Source Standby (Standby #1)
# postgresql.conf additions:
hot_standby = on
max_wal_senders = 5  # Allow this standby to send WAL
 
# On Downstream Standby (Standby #3)
# pg_basebackup from Standby #1, not Primary:
pg_basebackup -h standby1_ip -D /var/lib/postgresql/data -U replicator -P -R

Delayed Standbys:

A delayed standby intentionally applies WAL with a time delay. This provides a recovery point for human errors—if someone accidentally drops a table, you have a window to recover from the delayed standby before it applies the destructive command.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- On Delayed Standby's postgresql.conf
-- (or postgresql.auto.conf via ALTER SYSTEM)
recovery_min_apply_delay = '1 hour'
 
-- This standby will always be 1 hour behind the primary
-- If disaster occurs, you have 1 hour to:
-- 1. Stop replication
-- 2. Promote the delayed standby 
-- 3. Extract the data before the destructive operation
 
-- Common delay values:
-- 15 minutes: Quick recovery for operational errors
-- 1 hour: Buffer for detection and response
-- 24 hours: Overnight protection for major changes
 
-- Check delay on standby
SELECT 
    pg_last_wal_replay_lsn(),
    pg_last_xact_replay_timestamp(),
    NOW() - pg_last_xact_replay_timestamp() AS actual_delay;

Delayed Standby Recovery Procedure

When recovering from a delayed standby: (1) Pause replication at the right point using pg_wal_replay_pause(), (2) Advance to just before the problematic transaction, (3) Promote the standby to recover data. This requires knowing approximately when the bad transaction occurred. Combine with good monitoring and alerting for effective disaster recovery.

High Availability Architectures

Replication alone doesn't provide high availability—you need automated failover to promote a standby when the primary fails. Several tools and architectures enable this:

Key HA Components:

Failure Detection: Monitoring to detect primary failure (network checks, query tests)
Quorum/Fencing: Preventing split-brain where multiple nodes think they're primary
Promotion: Making a standby the new primary
Client Routing: Directing applications to the current primary

PostgreSQL HA Solutions
Solution	Approach	Pros	Cons
Patroni + etcd/Consul	Cluster manager with consensus store	Industry standard, well-documented	Requires external consensus cluster
Stolon	Cloud-native, Kubernetes-focused	Great for K8s, automatic healing	Steeper learning curve for non-K8s
repmgr	Traditional cluster manager	Simple setup, SSH-based	Less robust failure detection
pg_auto_failover	Citus-backed, built-in monitor	Simple architecture, no external deps	Less flexible topology options
pgpool-II	Connection pooler with HA	Combined pooling and failover	Can be complex, legacy code concerns
Cloud Managed (RDS, Cloud SQL)	Provider-managed HA	Zero operational burden	Limited customization, vendor lock-in

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# patroni.yml - Patroni configuration for high availability
 
scope: postgres-cluster
namespace: /service/
name: postgresql-node1
 
restapi:
  listen: 0.0.0.0:8008
  connect_address: node1:8008
 
etcd3:
  hosts: etcd1:2379,etcd2:2379,etcd3:2379
 
bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB - don't failover to far-behind replica
    postgresql:
      use_pg_rewind: true  # Enable fast rejoining after failover
      parameters:
        max_connections: 200
        shared_buffers: 2GB
        wal_level: replica
        hot_standby: on
        max_wal_senders: 10
        synchronous_commit: on
        synchronous_standby_names: '*'  # Require sync replica
 
  initdb:
    - encoding: UTF8
    - data-checksums
 
postgresql:
  listen: 0.0.0.0:5432
  connect_address: node1:5432
  data_dir: /var/lib/postgresql/data
  authentication:
    replication:
      username: replicator
      password: secret
    superuser:
      username: postgres
      password: secret

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
                    ┌──────────────────────────────────────────┐
                    │            Load Balancer (HAProxy)       │
                    │  Reads port 5001 → all nodes             │
                    │  Writes port 5000 → primary only         │
                    └──────────────────┬───────────────────────┘
                                       │
         ┌─────────────────────────────┼─────────────────────────────┐
         │                             │                             │
         ▼                             ▼                             ▼
┌─────────────────┐         ┌─────────────────┐         ┌─────────────────┐
│    Node 1       │         │    Node 2       │         │    Node 3       │
│   (Primary)     │         │   (Replica)     │         │   (Replica)     │
│                 │         │                 │         │                 │
│ ┌─────────────┐ │         │ ┌─────────────┐ │         │ ┌─────────────┐ │
│ │ PostgreSQL  │ │◄───WAL──┤ │ PostgreSQL  │ │◄───WAL──┤ │ PostgreSQL  │ │
│ └─────────────┘ │         │ └─────────────┘ │         │ └─────────────┘ │
│ ┌─────────────┐ │         │ ┌─────────────┐ │         │ ┌─────────────┐ │
│ │   Patroni   │◄├─────────┤►│   Patroni   │◄├─────────┤►│   Patroni   │ │
│ └──────┬──────┘ │         │ └──────┬──────┘ │         │ └──────┬──────┘ │
└────────┼────────┘         └────────┼────────┘         └────────┼────────┘
         │                           │                           │
         └───────────────────────────┼───────────────────────────┘
                                     │
                    ┌────────────────▼────────────────┐
                    │      etcd Cluster (Consensus)   │
                    │   Stores leader lock, config    │
                    └─────────────────────────────────┘
 
Failover Process:
1. Primary becomes unreachable
2. Patroni on replicas detect via etcd leader lock expiry
3. Most up-to-date replica acquires leader lock
4. Patroni promotes that replica to primary
5. Other replicas reconfigure to follow new primary
6. HAProxy health checks detect change, route writes to new primary

Failover Time

With well-tuned Patroni, failover typically completes in 10-30 seconds. Most of this time is failure detection (waiting for timeouts to confirm primary is truly down). Aggressive timeouts can speed failover but increase risk of unnecessary failovers from transient network issues. Balance carefully based on your availability requirements.

Replication Best Practices

Effective replication requires attention to operational practices beyond initial setup:

Replication Operational Checklist

•Monitor replication lag continuously — Alert when lag exceeds acceptable thresholds (e.g., >1 minute)
•Test failover regularly — Scheduled failover tests ensure the process works and team knows procedures
•Use replication slots with caution — Monitor slot lag; orphaned slots cause WAL accumulation
•Size WAL retention appropriately — Balance between allowing standbys to catch up and disk usage
•Validate standby consistency — Periodically compare data between primary and standbys
•Document failover procedures — Runbooks for both automated and manual failover scenarios
•Configure connection pooling — PgBouncer or similar to handle connection storms after failover
•Implement read routing — HAProxy, DNS, or application logic to direct reads to standbys
•Plan for network partitions — Test behavior when primary can reach some but not all standbys
•Consider synchronous replica location — Sync replicas in same datacenter minimize latency impact

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- Comprehensive replication health check
SELECT 
    client_addr AS standby_ip,
    state,
    sync_state,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) AS send_lag,
    pg_size_pretty(pg_wal_lsn_diff(sent_lsn, flush_lsn)) AS flush_lag,
    pg_size_pretty(pg_wal_lsn_diff(flush_lsn, replay_lsn)) AS replay_lag,
    CASE 
        WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() 
        THEN interval '0 seconds'
        ELSE now() - pg_last_xact_replay_timestamp() 
    END AS time_lag
FROM pg_stat_replication;
 
-- Check replication slot health
SELECT 
    slot_name,
    slot_type,
    active,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal,
    CASE 
        WHEN active THEN 'Healthy'
        WHEN pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 1073741824 
        THEN 'CRITICAL: >1GB pending'
        ELSE 'Inactive'
    END AS status
FROM pg_replication_slots;
 
-- WAL generation rate (for capacity planning)
SELECT 
    pg_size_pretty(sum(size)) AS wal_24h,
    pg_size_pretty(sum(size) / 24) AS wal_per_hour
FROM pg_ls_waldir()
WHERE modification > now() - interval '24 hours';

The Split-Brain Danger

The worst replication failure mode is split-brain: two nodes both accepting writes as primary. This usually happens when failover promotes a standby, but the old primary recovers and resumes accepting writes. Prevention requires proper fencing (STONITH—Shoot The Other Node In The Head) and consensus-based leader election. Never manually force a standby to primary without confirming the old primary is truly down.

Summary: PostgreSQL Replication Options

We've explored PostgreSQL's comprehensive replication capabilities:

Key Takeaways

•Physical replication (streaming) provides byte-identical standbys for HA and read scaling
•Logical replication enables selective table replication, cross-version migration, and data integration
•Synchronous modes trade latency for durability guarantees—choose based on data criticality
•Cascading replicas reduce primary load and enable geographic distribution
•Delayed standbys provide recovery points for human errors
•HA solutions like Patroni automate failover with consensus-based leader election
•Operational practices—monitoring, testing, documentation—are as important as configuration

What's Next:

Now that we understand PostgreSQL's replication capabilities, the final page explores when to choose PostgreSQL—decision criteria, use case fit, and comparison with alternatives.

Page Complete

You now understand PostgreSQL's replication options in depth—from basic streaming replication to sophisticated HA architectures. These capabilities enable building systems that survive failures, scale reads, and recover from disasters. Next, we'll synthesize everything into guidance on when PostgreSQL is the right choice.

4 / 5

Loading learning content...

System Design (HLD)PostgreSQL

PostgreSQL: The World's Most Advanced Open Source Database

LevelAdvanced

Duration90 mins

TopicPostgreSQL

4 / 5

Replication Options

PostgreSQL Replication: Scaling and Resilience

What You Will Learn

Physical vs. Logical Replication

PostgreSQL provides two fundamentally different approaches to replication, each suited for different use cases:

Physical Replication (Streaming Replication):

Logical Replication:

Physical vs. Logical Replication Comparison
Aspect	Physical Replication	Logical Replication
Data Transmitted	Binary WAL records	Logical row changes (INSERT/UPDATE/DELETE)
Replication Scope	Entire cluster (all databases)	Selected tables within a database
Standby Writability	Read-only (hot standby)	Writable (can have local tables)
Version Compatibility	Same major version required	Cross-version replication possible
Schema Requirements	Identical schemas	Compatible schemas (flexible)
DDL Replication	Automatic (in WAL)	Not replicated (must apply manually)
Initial Setup	pg_basebackup (full copy)	Table-by-table sync + streaming
Performance Overhead	Minimal (WAL shipping)	Higher (decoding + apply)
Use Cases	HA failover, read replicas	Data integration, subset replication

When to Use Which

Streaming Replication: The Foundation of HA

How Streaming Replication Works:

Primary writes transactions to WAL files
WAL sender process streams WAL records to standby over TCP
Standby's WAL receiver writes records locally
Startup process applies WAL to data files (recovery mode)
Standby can serve read queries while applying WAL (hot standby)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Primary Server                              Standby Server
┌─────────────────────────────────────┐     ┌─────────────────────────────────────┐
│  Backend Process (handling queries) │     │  Startup Process (applies WAL)      │
│           │                         │     │           ▲                         │
│           ▼                         │     │           │                         │
│  ┌─────────────────┐               │     │  ┌─────────────────┐               │
│  │   WAL Buffer    │               │     │  │  WAL Receiver   │◄──────┐       │
│  │   (in memory)   │               │     │  │  Process        │       │       │
│  └────────┬────────┘               │     │  └─────────────────┘       │       │
│           │                         │     │           │                │       │
│           ▼                         │     │           ▼                │       │
│  ┌─────────────────┐               │     │  ┌─────────────────┐       │       │
│  │   WAL Files     │               │     │  │   WAL Files     │       │       │
│  │  (pg_wal/)      │               │     │  │  (pg_wal/)      │       │       │
│  └────────┬────────┘               │     │  └─────────────────┘       │       │
│           │                         │     │                            │       │
│           ▼                         │     │    Hot Standby Queries     │       │
│  ┌─────────────────┐               │     │  ┌─────────────────┐       │       │
│  │   WAL Sender    │────TCP/IP────────────►│  │   Client Conns  │       │       │
│  │   Process       │               │     │  └─────────────────┘       │       │
│  └─────────────────┘               │     │                            │       │
│                                     │     │                            │       │
└─────────────────────────────────────┘     └────────────────────────────┘
                │
                │  Feedback: flush position, apply position
                ◄───────────────────────────────────────────────────────

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# === On Primary Server ===
 
# 1. Configure postgresql.conf
wal_level = replica                    # Enable replication info in WAL
max_wal_senders = 10                   # Max number of standbys
wal_keep_size = 1GB                    # Retain WAL for slow standbys
hot_standby = on                        # Allow queries on standby
 
# 2. Create replication user
psql -c "CREATE USER replicator WITH REPLICATION ENCRYPTED PASSWORD 'secure_password';"
 
# 3. Configure pg_hba.conf (allow replication connections)
# host    replication     replicator      standby_ip/32      scram-sha-256
 
# 4. Restart PostgreSQL
pg_ctl restart -D /var/lib/postgresql/data
 
 
# === On Standby Server ===
 
# 1. Stop PostgreSQL if running
pg_ctl stop -D /var/lib/postgresql/data
 
# 2. Clear data directory (IMPORTANT: this deletes existing data)
rm -rf /var/lib/postgresql/data/*
 
# 3. Take base backup from primary
pg_basebackup -h primary_ip -D /var/lib/postgresql/data -U replicator -P -R
 
# The -R flag creates standby.signal and configures recovery settings
 
# 4. Start standby
pg_ctl start -D /var/lib/postgresql/data
 
# Standby is now recovering and accepting read-only queries

Monitoring Replication Lag:

Replication lag—the delay between a transaction committing on primary and being visible on standby—is the critical metric for replica health.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- On Primary: Check connected standbys
SELECT 
    client_addr,
    state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    pg_size_pretty(pg_wal_lsn_diff(sent_lsn, replay_lsn)) AS replay_lag,
    sync_state  -- 'async', 'sync', 'potential', 'quorum'
FROM pg_stat_replication;
 
-- On Primary: Replication lag in time (approximate)
SELECT 
    client_addr,
    CASE 
        WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN '0 seconds'
        ELSE NOW() - pg_last_xact_replay_timestamp()
    END AS replication_lag
FROM pg_stat_replication;
 
-- On Standby: Check recovery status
SELECT 
    pg_is_in_recovery() AS is_standby,
    pg_last_wal_receive_lsn() AS last_received,
    pg_last_wal_replay_lsn() AS last_applied,
    pg_last_xact_replay_timestamp() AS last_tx_time;
 
-- Check for WAL files waiting to be applied
SELECT COUNT(*) as pending_wal_files FROM pg_ls_waldir();

Replication Slots

Synchronous Replication: Zero Data Loss

Synchronous replication addresses this by requiring the primary to wait for one or more standbys to confirm WAL receipt before returning success to the client.

Confirmation Levels:

remote_write: Standby has written WAL to OS cache (but not flushed to disk)
on (remote_flush): Standby has flushed WAL to disk
remote_apply: Standby has applied WAL (query-visible)

Synchronous Commit Levels
Level	Wait For	Durability	Performance Impact	Use Case
off	Nothing (async)	Data in primary WAL buffer; risk of 3x wal_writer_delay loss	Lowest latency	Logs, metrics, ephemeral data
local	Primary disk flush	Data survives primary crash	Low latency	Standard operations
remote_write	Standby OS cache	Data on standby memory; rare loss on both crash	Moderate latency	Good performance / safety balance
on (remote_flush)	Standby disk flush	Data on standby disk; survives primary loss	Higher latency	Important transactions
remote_apply	Standby has applied	Data queryable on standby immediately after commit	Highest latency	Read-after-write consistency needed

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# postgresql.conf on Primary
 
# Define synchronous standby policy
# FIRST: wait for first N standbys in priority order
# ANY: wait for any N from the list
 
synchronous_standby_names = 'FIRST 1 (standby1, standby2)'
# - Wait for standby1 (or standby2 if standby1 unavailable)
# - standby2 becomes sync if standby1 disconnects
 
# Alternative: require any 2 of 3 standbys
synchronous_standby_names = 'ANY 2 (standby1, standby2, standby3)'
 
# Combined with commit level:
synchronous_commit = on  # Wait for sync standbys to flush
 
# Quorum-based (requires at least 2 standbys):
synchronous_standby_names = 'FIRST 2 (standby1, standby2, standby3)'

1
2
3
4
5
6
7
8
9
10
11
12
13
14
-- Override for specific transactions (less critical data)
BEGIN;
SET LOCAL synchronous_commit = off;
INSERT INTO audit_log (event, timestamp) VALUES ('user_clicked_button', NOW());
COMMIT;  -- Returns immediately, doesn't wait for standby
 
-- Override for critical transactions (even if default is async)
BEGIN;
SET LOCAL synchronous_commit = remote_apply;
UPDATE accounts SET balance = balance - 1000 WHERE id = 123;
COMMIT;  -- Waits until standby has applied and can serve reads
 
-- Check current setting
SHOW synchronous_commit;

Synchronous Standby Availability

Logical Replication: Flexible Data Distribution

Key Concepts:

Publication: A set of tables whose changes are published
Subscription: A connection to a publication that applies changes locally
Logical Decoding: The mechanism that extracts row-level changes from WAL

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
-- === On Publisher (source database) ===
 
-- Ensure wal_level supports logical decoding
-- postgresql.conf: wal_level = logical
 
-- Create publication for specific tables
CREATE PUBLICATION sales_pub FOR TABLE orders, customers, products;
 
-- Or publish all tables
CREATE PUBLICATION all_tables_pub FOR ALL TABLES;
 
-- Add/remove tables from publication
ALTER PUBLICATION sales_pub ADD TABLE order_items;
ALTER PUBLICATION sales_pub DROP TABLE products;
 
-- View publications
SELECT pubname, puballtables, pubinsert, pubupdate, pubdelete FROM pg_publication;
 
 
-- === On Subscriber (target database) ===
 
-- Tables must exist with compatible schema (create manually or pg_dump -s)
CREATE TABLE orders (...);
CREATE TABLE customers (...);
 
-- Create subscription (starts initial sync automatically)
CREATE SUBSCRIPTION sales_sub
    CONNECTION 'host=publisher_ip dbname=source user=replicator password=...'
    PUBLICATION sales_pub;
 
-- View subscription status
SELECT subname, subenabled, subslotname FROM pg_subscription;
 
-- View replication state per table
SELECT * FROM pg_subscription_rel;
 
-- Check replication lag
SELECT 
    slot_name,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)) AS lag
FROM pg_replication_slots
WHERE slot_type = 'logical';

Logical Replication Use Cases:

When to Use Logical Replication

•Major Version Upgrades: Replicate to a newer PostgreSQL version, then switch over
•Selective Replication: Only replicate specific tables to analytics or reporting databases
•Multi-Master Consolidation: Aggregate data from multiple sources to a central database
•Live Migration: Replicate to new infrastructure with minimal downtime
•Data Integration: Feed changes to data warehouses, search engines, or event systems
•Multi-Tenant Isolation: Each tenant's data to their own database for compliance

Logical Replication Limitations

Cascading and Delayed Standbys

PostgreSQL supports advanced replication topologies beyond simple primary-standby configurations:

Cascading Replication:

Standbys can themselves act as sources for other standbys, creating a replication chain. This reduces load on the primary and enables complex topologies for geographic distribution.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
                            ┌─────────────────┐
                            │     Primary     │
                            │   (US-East)     │
                            └────────┬────────┘
                                     │ WAL Stream
                    ┌────────────────┴────────────────┐
                    │                                 │
           ┌────────▼────────┐               ┌────────▼────────┐
           │   Standby #1    │               │   Standby #2    │
           │  (US-East DC2)  │               │   (EU-West)     │
           └────────┬────────┘               └────────┬────────┘
                    │                                 │
                    │ Cascade                         │ Cascade
                    │                                 │
           ┌────────▼────────┐               ┌────────▼────────┐
           │   Standby #3    │               │   Standby #4    │
           │    (US-West)    │               │   (Asia-Pac)    │
           └─────────────────┘               └─────────────────┘
 
Benefits:
- Standby #1/#2 handle cascade load, primary focuses on writes
- Regional standbys serve local reads with low latency
- Reduces cross-ocean bandwidth from primary

1
2
3
4
5
6
7
8
# On Cascading Source Standby (Standby #1)
# postgresql.conf additions:
hot_standby = on
max_wal_senders = 5  # Allow this standby to send WAL
 
# On Downstream Standby (Standby #3)
# pg_basebackup from Standby #1, not Primary:
pg_basebackup -h standby1_ip -D /var/lib/postgresql/data -U replicator -P -R

Delayed Standbys:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- On Delayed Standby's postgresql.conf
-- (or postgresql.auto.conf via ALTER SYSTEM)
recovery_min_apply_delay = '1 hour'
 
-- This standby will always be 1 hour behind the primary
-- If disaster occurs, you have 1 hour to:
-- 1. Stop replication
-- 2. Promote the delayed standby 
-- 3. Extract the data before the destructive operation
 
-- Common delay values:
-- 15 minutes: Quick recovery for operational errors
-- 1 hour: Buffer for detection and response
-- 24 hours: Overnight protection for major changes
 
-- Check delay on standby
SELECT 
    pg_last_wal_replay_lsn(),
    pg_last_xact_replay_timestamp(),
    NOW() - pg_last_xact_replay_timestamp() AS actual_delay;

Delayed Standby Recovery Procedure

High Availability Architectures

Replication alone doesn't provide high availability—you need automated failover to promote a standby when the primary fails. Several tools and architectures enable this:

Key HA Components:

Failure Detection: Monitoring to detect primary failure (network checks, query tests)
Quorum/Fencing: Preventing split-brain where multiple nodes think they're primary
Promotion: Making a standby the new primary
Client Routing: Directing applications to the current primary

PostgreSQL HA Solutions
Solution	Approach	Pros	Cons
Patroni + etcd/Consul	Cluster manager with consensus store	Industry standard, well-documented	Requires external consensus cluster
Stolon	Cloud-native, Kubernetes-focused	Great for K8s, automatic healing	Steeper learning curve for non-K8s
repmgr	Traditional cluster manager	Simple setup, SSH-based	Less robust failure detection
pg_auto_failover	Citus-backed, built-in monitor	Simple architecture, no external deps	Less flexible topology options
pgpool-II	Connection pooler with HA	Combined pooling and failover	Can be complex, legacy code concerns
Cloud Managed (RDS, Cloud SQL)	Provider-managed HA	Zero operational burden	Limited customization, vendor lock-in

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# patroni.yml - Patroni configuration for high availability
 
scope: postgres-cluster
namespace: /service/
name: postgresql-node1
 
restapi:
  listen: 0.0.0.0:8008
  connect_address: node1:8008
 
etcd3:
  hosts: etcd1:2379,etcd2:2379,etcd3:2379
 
bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576  # 1MB - don't failover to far-behind replica
    postgresql:
      use_pg_rewind: true  # Enable fast rejoining after failover
      parameters:
        max_connections: 200
        shared_buffers: 2GB
        wal_level: replica
        hot_standby: on
        max_wal_senders: 10
        synchronous_commit: on
        synchronous_standby_names: '*'  # Require sync replica
 
  initdb:
    - encoding: UTF8
    - data-checksums
 
postgresql:
  listen: 0.0.0.0:5432
  connect_address: node1:5432
  data_dir: /var/lib/postgresql/data
  authentication:
    replication:
      username: replicator
      password: secret
    superuser:
      username: postgres
      password: secret

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
                    ┌──────────────────────────────────────────┐
                    │            Load Balancer (HAProxy)       │
                    │  Reads port 5001 → all nodes             │
                    │  Writes port 5000 → primary only         │
                    └──────────────────┬───────────────────────┘
                                       │
         ┌─────────────────────────────┼─────────────────────────────┐
         │                             │                             │
         ▼                             ▼                             ▼
┌─────────────────┐         ┌─────────────────┐         ┌─────────────────┐
│    Node 1       │         │    Node 2       │         │    Node 3       │
│   (Primary)     │         │   (Replica)     │         │   (Replica)     │
│                 │         │                 │         │                 │
│ ┌─────────────┐ │         │ ┌─────────────┐ │         │ ┌─────────────┐ │
│ │ PostgreSQL  │ │◄───WAL──┤ │ PostgreSQL  │ │◄───WAL──┤ │ PostgreSQL  │ │
│ └─────────────┘ │         │ └─────────────┘ │         │ └─────────────┘ │
│ ┌─────────────┐ │         │ ┌─────────────┐ │         │ ┌─────────────┐ │
│ │   Patroni   │◄├─────────┤►│   Patroni   │◄├─────────┤►│   Patroni   │ │
│ └──────┬──────┘ │         │ └──────┬──────┘ │         │ └──────┬──────┘ │
└────────┼────────┘         └────────┼────────┘         └────────┼────────┘
         │                           │                           │
         └───────────────────────────┼───────────────────────────┘
                                     │
                    ┌────────────────▼────────────────┐
                    │      etcd Cluster (Consensus)   │
                    │   Stores leader lock, config    │
                    └─────────────────────────────────┘
 
Failover Process:
1. Primary becomes unreachable
2. Patroni on replicas detect via etcd leader lock expiry
3. Most up-to-date replica acquires leader lock
4. Patroni promotes that replica to primary
5. Other replicas reconfigure to follow new primary
6. HAProxy health checks detect change, route writes to new primary

Failover Time

Replication Best Practices

Effective replication requires attention to operational practices beyond initial setup:

Replication Operational Checklist

•Monitor replication lag continuously — Alert when lag exceeds acceptable thresholds (e.g., >1 minute)
•Test failover regularly — Scheduled failover tests ensure the process works and team knows procedures
•Use replication slots with caution — Monitor slot lag; orphaned slots cause WAL accumulation
•Size WAL retention appropriately — Balance between allowing standbys to catch up and disk usage
•Validate standby consistency — Periodically compare data between primary and standbys
•Document failover procedures — Runbooks for both automated and manual failover scenarios
•Configure connection pooling — PgBouncer or similar to handle connection storms after failover
•Implement read routing — HAProxy, DNS, or application logic to direct reads to standbys
•Plan for network partitions — Test behavior when primary can reach some but not all standbys
•Consider synchronous replica location — Sync replicas in same datacenter minimize latency impact

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
-- Comprehensive replication health check
SELECT 
    client_addr AS standby_ip,
    state,
    sync_state,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) AS send_lag,
    pg_size_pretty(pg_wal_lsn_diff(sent_lsn, flush_lsn)) AS flush_lag,
    pg_size_pretty(pg_wal_lsn_diff(flush_lsn, replay_lsn)) AS replay_lag,
    CASE 
        WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() 
        THEN interval '0 seconds'
        ELSE now() - pg_last_xact_replay_timestamp() 
    END AS time_lag
FROM pg_stat_replication;
 
-- Check replication slot health
SELECT 
    slot_name,
    slot_type,
    active,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal,
    CASE 
        WHEN active THEN 'Healthy'
        WHEN pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) > 1073741824 
        THEN 'CRITICAL: >1GB pending'
        ELSE 'Inactive'
    END AS status
FROM pg_replication_slots;
 
-- WAL generation rate (for capacity planning)
SELECT 
    pg_size_pretty(sum(size)) AS wal_24h,
    pg_size_pretty(sum(size) / 24) AS wal_per_hour
FROM pg_ls_waldir()
WHERE modification > now() - interval '24 hours';

The Split-Brain Danger

Summary: PostgreSQL Replication Options

We've explored PostgreSQL's comprehensive replication capabilities:

Key Takeaways

•Physical replication (streaming) provides byte-identical standbys for HA and read scaling
•Logical replication enables selective table replication, cross-version migration, and data integration
•Synchronous modes trade latency for durability guarantees—choose based on data criticality
•Cascading replicas reduce primary load and enable geographic distribution
•Delayed standbys provide recovery points for human errors
•HA solutions like Patroni automate failover with consensus-based leader election
•Operational practices—monitoring, testing, documentation—are as important as configuration

What's Next:

Now that we understand PostgreSQL's replication capabilities, the final page explores when to choose PostgreSQL—decision criteria, use case fit, and comparison with alternatives.

Page Complete

4 / 5