Loading content...
Consider a global e-commerce platform serving customers across six continents. During peak shopping events, traffic can spike 100x normal levels within minutes. A single active server—even with a standby ready—cannot handle this load. The company needs every server working simultaneously, sharing the traffic, each capable of handling any request from any user.
This is the domain of active-active redundancy: an architectural pattern where multiple nodes simultaneously serve production traffic, each fully capable of handling the complete workload. Unlike active-passive systems where standby nodes sit idle, active-active configurations maximize resource utilization while providing inherent fault tolerance.
But this power comes with complexity. When multiple nodes can modify the same data simultaneously, you introduce the possibility of conflicts. When traffic can arrive at any node, you need mechanisms to maintain consistency. When failures occur, remaining nodes must absorb additional load without dedicated standby capacity.
Active-active redundancy is essential for systems that must scale horizontally while maintaining high availability. Understanding its patterns, tradeoffs, and implementation strategies is critical for any engineer designing systems at scale.
By the end of this page, you will understand active-active architecture patterns, conflict resolution strategies, load distribution mechanisms, and the fundamental tradeoffs between consistency and availability. You'll learn when to choose active-active over active-passive and how to implement it correctly for different use cases.
Active-active redundancy, also known as multi-master, multi-primary, or peer-to-peer replication, is a high availability pattern where all nodes actively handle production traffic simultaneously. Each node maintains the full dataset and can serve both read and write requests.
The Core Principle:
Instead of designating one node as primary and others as standby, active-active treats all nodes as equals. Traffic is distributed across all nodes, and any node can process any request. When one node fails, the others continue serving traffic without interruption—there's no failover delay because there's no failover.
Key Characteristics:
| Aspect | Active-Passive | Active-Active |
|---|---|---|
| Resource Utilization | 50% (standby idle) | 100% (all nodes active) |
| Write Handling | Single node only | All nodes |
| Failover Required | Yes (detection + promotion) | No (immediate redistribution) |
| Implementation Complexity | Moderate | High |
| Conflict Potential | None (single writer) | High (multiple writers) |
| Consistency Model | Strong (single source) | Eventually consistent or complex coordination |
| Geographic Scaling | Limited by replication lag | Natural fit for global distribution |
When to Choose Active-Active:
High throughput requirements: When a single node cannot handle write volume, distributing writes across nodes becomes necessary.
Global user base: When users are distributed globally/regionally, active-active allows each region to have a local primary, reducing latency.
Maximum resource utilization: When cost efficiency demands that all provisioned capacity serves production traffic.
Zero-downtime requirements: When even brief failover windows are unacceptable, active-active's instant redistribution is essential.
When NOT to Choose Active-Active:
Strong consistency requirements: Financial transactions, inventory counts, and other scenarios requiring strict consistency are challenging with active-active.
Simple architecture preference: When operational simplicity is prioritized over performance optimization.
Limited engineering resources: Active-active systems require more expertise to design, implement, and operate correctly.
Active-active systems can be arranged in various topologies, each with distinct characteristics for replication, failure handling, and operational complexity.
1. Mesh Topology (Peer-to-Peer)
Every node connects directly to every other node. Changes propagate directly from source to all peers.
2. Ring Topology
Each node connects to adjacent nodes in a circular arrangement. Changes propagate around the ring.
3. Star Topology (Hub-and-Spoke)
A central coordinator receives all changes and distributes to all peers.
4. Hybrid Topology
Combines patterns—for example, mesh within regions with ring between regions.
123456789101112131415161718192021222324
┌─────────────────────────────────────────────────────────────────┐│ GLOBAL TRAFFIC │└─────────────────────────────────┬───────────────────────────────┘ │ ┌───────────────────┼───────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ NODE A │ │ NODE B │ │ NODE C │ │ (US-EAST) │ │ (EU-WEST) │ │ (ASIA-PACIFIC) │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ │Application│ │ │ │Application│ │ │ │Application│ │ │ │ (Active) │ │ │ │ (Active) │ │ │ │ (Active) │ │ │ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ ┌───────────┐ │ │ │ Database │◄─┼─┼──│ Database │──┼─┼─►│ Database │ │ │ │(Read/Write│ │ │ │(Read/Write│ │ │ │(Read/Write│ │ │ └───────────┘ │ │ └───────────┘ │ │ └───────────┘ │ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │ │ │ └───────────────────┼───────────────────┘ │ Bidirectional Replication (All nodes sync with all others)Choose mesh for small clusters requiring lowest latency. Use star when you need centralized conflict resolution. Consider hybrid topologies for multi-region deployments where intra-region latency differs significantly from inter-region latency. The right topology depends on your specific geographic distribution and consistency requirements.
The defining challenge of active-active systems is conflict resolution. When multiple nodes can modify the same data simultaneously, conflicting changes are inevitable. Consider this scenario:
Time T1: User in NYC updates product price to $99 (Node A)
Time T1: User in London updates same product price to £85 (Node B)
Time T2: Both changes replicate to all nodes
Question: What should the price be?
Without a deterministic conflict resolution strategy, different nodes could end up with different values, violating consistency. Active-active systems must define clear rules for handling conflicts.
Conflict Resolution Strategies:
1. Last-Write-Wins (LWW)
The most recent change (by timestamp) overwrites earlier changes. Simple to implement but has significant drawbacks:
2. First-Write-Wins (FWW)
The earliest change persists; later conflicting changes are rejected. Useful when the first value has special significance.
3. Merge Resolution
Conflicting changes are combined rather than one overwriting the other. Examples:
4. Application-Specific Resolution
Business logic determines the winner. For example:
1234567891011121314151617181920212223242526272829303132333435363738394041
interface ConflictedValue<T> { nodeId: string; timestamp: number; value: T; vector_clock: Record<string, number>;} type ConflictResolver<T> = ( conflicts: ConflictedValue<T>[]) => T; // Last-Write-Wins resolverconst lastWriteWins: ConflictResolver<any> = (conflicts) => { return conflicts.reduce((latest, current) => current.timestamp > latest.timestamp ? current : latest ).value;}; // Merge resolver for numeric counters (sum deltas)const sumCounters: ConflictResolver<number> = (conflicts) => { return conflicts.reduce((sum, c) => sum + c.value, 0);}; // Business logic resolver (e.g., higher price wins)const higherPriceWins: ConflictResolver<number> = (conflicts) => { return Math.max(...conflicts.map(c => c.value));}; // Vector clock based resolver (causally latest)const vectorClockResolver: ConflictResolver<any> = (conflicts) => { return conflicts.reduce((latest, current) => { if (happensBefore(latest.vector_clock, current.vector_clock)) { return current; } if (happensBefore(current.vector_clock, latest.vector_clock)) { return latest; } // Concurrent - fall back to timestamp return current.timestamp > latest.timestamp ? current : latest; }).value;};The best conflict resolution strategy is avoiding conflicts entirely. Techniques like data partitioning (each node owns different data), conflict-free replicated data types (CRDTs), and consensus protocols eliminate conflicts by design. Consider whether your data model can be restructured to make conflicts impossible rather than merely resolvable.
Active-active systems introduce fundamental questions about consistency. When data can be modified on multiple nodes simultaneously, what guarantees can we provide about the visibility and ordering of those modifications?
Eventual Consistency
The most common model for active-active systems. All replicas will eventually converge to the same state, but at any given moment, different replicas may return different values.
Causal Consistency
Operations that are causally related (one depends on the other) are seen in the same order by all nodes. Concurrent operations may be seen in different orders.
Strong Consistency (Linearizability)
All operations appear to occur instantaneously at a single point in time. Reads always return the most recent write.
| Property | Eventual | Causal | Strong |
|---|---|---|---|
| Read Latency | Lowest (local) | Low | Higher (coordination) |
| Write Latency | Lowest (local) | Low | Higher (consensus) |
| Availability | Highest | High | Lower during partitions |
| Conflict Handling | Required | Reduced | None (prevented) |
| Implementation | Simple | Moderate | Complex |
| Data Freshness | May be stale | Causally fresh | Always fresh |
Choosing the Right Consistency Model:
The choice depends on your application's tolerance for inconsistency:
Use eventual consistency when:
Use causal consistency when:
Use strong consistency when:
The CAP theorem states that distributed systems can provide at most two of: Consistency, Availability, and Partition tolerance. Active-active systems typically choose Availability and Partition tolerance (AP), accepting weaker consistency. If you need strong consistency (CP), active-passive with consensus may be more appropriate.
In active-active systems, distributing traffic across nodes is critical for both performance and failure resilience. The distribution strategy affects latency, resource utilization, and conflict probability.
Geographic Routing
Route users to the nearest datacenter based on geographic location:
Key-Based Routing (Sharding)
Route requests for specific data to specific nodes:
Round-Robin Distribution
Distribute requests evenly across all nodes regardless of content:
Workload-Aware Distribution
Route based on request characteristics and node capacity:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
interface RoutingConfig { regions: RegionConfig[]; affinityRules: AffinityRule[];} class ActiveActiveRouter { constructor(private config: RoutingConfig) {} routeRequest(request: Request): Node { // Step 1: Check for key-based affinity const affinityNode = this.checkAffinity(request); if (affinityNode) { return affinityNode; } // Step 2: Route to user's nearest healthy region const userRegion = this.detectUserRegion(request.clientIP); const regionalNodes = this.getHealthyNodes(userRegion); if (regionalNodes.length > 0) { return this.selectLeastLoaded(regionalNodes); } // Step 3: Fall back to any healthy region const allHealthyNodes = this.config.regions .flatMap(r => this.getHealthyNodes(r.id)); if (allHealthyNodes.length === 0) { throw new Error('No healthy nodes available'); } return this.selectNearestHealthy(allHealthyNodes, userRegion); } private checkAffinity(request: Request): Node | null { for (const rule of this.config.affinityRules) { if (rule.matches(request)) { const affinityKey = rule.extractKey(request); return this.getNodeForKey(affinityKey); } } return null; }}Smart routing can dramatically reduce conflicts. If you route all requests for a specific user to the same node, that user's operations won't conflict. Combine geographic routing with key-based affinity to get low latency AND minimal conflicts. This transforms many active-active challenges into single-node operations.
One of active-active's primary advantages is simplified failure handling. Without failover, node failures cause traffic redistribution rather than promotion ceremonies.
Immediate Traffic Redistribution
When a node fails in active-active:
This process typically completes in seconds versus the minutes required for active-passive failover.
Capacity Planning for Failures
Because active-active has no dedicated standby capacity, you must plan for failures:
N+1 Capacity: Provision one extra node so that if any single node fails, the remaining nodes can handle full load.
N+X Capacity: Provision X extra nodes to handle X simultaneous failures. The value of X depends on your availability requirements.
Automatic Scaling: Cloud environments can auto-scale to replace failed nodes, but this takes time. You need buffer capacity to handle load during scale-up.
Handling Network Partitions
Network partitions are particularly challenging for active-active:
Strategies for handling partitions:
CockroachDB
CockroachDB implements active-active with strong consistency using the Raft consensus protocol:
Amazon DynamoDB Global Tables
DynamoDB Global Tables provide active-active across AWS regions:
Cassandra Multi-Datacenter
Apache Cassandra supports active-active across datacenters:
MySQL Group Replication (Multi-Primary)
MySQL Group Replication can operate in multi-primary mode:
Managed active-active services like DynamoDB Global Tables, Cosmos DB with multi-region writes, and Spanner handle much of the complexity automatically. When available for your use case, these services can dramatically reduce operational burden while providing battle-tested implementations.
Active-active redundancy maximizes resource utilization and enables global scale, but requires careful handling of conflicts, consistency, and capacity planning.
Next Steps:
Active-active provides maximum utilization but assumes homogeneous failure probability. In the next page, we'll explore N+1 redundancy, a capacity planning pattern that explicitly quantifies how much extra capacity to maintain for failure tolerance.
You now understand active-active redundancy comprehensively—from topology patterns and conflict resolution through consistency models and failure handling. This pattern is essential for globally distributed systems requiring horizontal scalability and maximum resource utilization.