Loading content...
Understanding availability zones conceptually is only the first step. The real challenge lies in translating that understanding into production architectures that actually survive AZ failures. This page bridges theory and practice, providing you with concrete patterns, reference architectures, and implementation guidance for multi-AZ deployments.
Multi-AZ deployment isn't just about checking a box or deploying 'to multiple zones.' It requires thoughtful design across every layer of your stack—compute, storage, networking, and application logic. A single oversight can create a hidden single point of failure that nullifies your entire multi-AZ investment.
We'll examine reference architectures for common deployment patterns, walk through the critical decisions at each layer, and highlight the subtle mistakes that lead to failures despite ostensibly correct multi-AZ configurations.
By the end of this page, you will be able to design and implement multi-AZ architectures for stateless services, stateful services, and hybrid workloads. You'll understand reference architectures for web applications, databases, message queues, and caches—and know how to avoid the common pitfalls that undermine availability.
Before diving into specific architectures, let's establish the foundational principles that guide multi-AZ design. These principles apply regardless of the specific technologies or cloud provider you're using.
Principle 1: Every Component Must Have Multi-AZ Story
For each component in your architecture, you must be able to answer: "What happens to this component when an AZ fails?" If the answer is "the application stops working," that component is a single point of failure.
This applies to:
The Multi-AZ Readiness Checklist:
For each component in your architecture, verify:
| Question | Good Answer | Red Flag |
|---|---|---|
| Where does this run? | Instances in 2-3+ AZs | Single AZ only |
| What happens if one instance fails? | Traffic routes elsewhere | Service degrades |
| What happens if an entire AZ fails? | Automatic failover to other AZs | Manual intervention needed |
| Is capacity sufficient with one AZ down? | Yes, remaining AZs handle full load | No, capacity is at edge |
| How is state managed during failover? | Replicated or externalized | Stored locally, lost on failure |
| Is failover automated? | Yes, typically <5 minutes | No, requires human action |
| Has failover been tested? | Yes, within last quarter | No / don't know |
Simply deploying instances to multiple AZs doesn't guarantee availability. If your load balancer, database, or any other critical component remains single-AZ, you've just created an expensive multi-AZ deployment with a single point of failure. True multi-AZ requires end-to-end thinking.
Stateless web applications are the simplest case for multi-AZ deployment because application instances can be freely replaced and traffic rerouted without session loss. Let's examine a production-grade reference architecture.
Architecture Components:
Internet
│
┌──────┴──────┐
│ Route 53 │ (Global DNS, health-checked)
│ (or equiv) │
└──────┬──────┘
│
┌──────┴──────┐
│Application │ (Multi-AZ, regional)
│Load Balancer│
└──────┬──────┘
┌───────────────┼───────────────┐
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│ AZ-A │ │ AZ-B │ │ AZ-C │
├─────────┤ ├─────────┤ ├─────────┤
│ Web x3 │ │ Web x3 │ │ Web x3 │
│ (ASG) │ │ (ASG) │ │ (ASG) │
└────┬────┘ └────┬────┘ └────┬────┘
│ │ │
└───────────────┼───────────────┘
│
┌──────┴──────┐
│ Multi-AZ │ (Primary in AZ-A,
│ Database │ Standby in AZ-B)
└─────────────┘
Layer-by-Layer Analysis:
1. DNS Layer (Route 53 / Cloud DNS)
2. Load Balancer Layer
3. Compute Layer (Auto Scaling Group)
4. Database Layer
Capacity Planning Example:
| Total Required Capacity | Number of AZs | Capacity per AZ | Rationale |
|---|---|---|---|
| 100 requests/second | 3 AZs | ~50 rps each | 2 AZs can handle 100% load |
| 10 instances needed | 3 AZs | 4-5 per AZ | 2 AZs have at least 8-10 instances |
Instead of fixed capacity per AZ, use target tracking scaling (e.g., maintain 60% average CPU) combined with sufficient minimum capacity. This allows the system to automatically scale up in remaining AZs during an AZ failure rather than requiring pre-provisioned headroom.
Stateful services—databases, caches, message queues—require more careful multi-AZ design because they hold data that must survive failures. The key question is: how do you replicate state across AZs while maintaining performance and consistency?
Pattern: Synchronous Replication with Automatic Failover
┌─────────────────────────────────────────────────────────┐
│ Application Tier │
│ (connects to database via endpoint/DNS that │
│ automatically points to current primary) │
└──────────────────────────┬──────────────────────────────┘
│
┌──────┴──────┐
│ RDS/DB │
│ Endpoint │ (virtual, follows primary)
└──────┬──────┘
│
┌───────────────┴───────────────┐
│ │
┌────┴────┐ ┌────┴────┐
│ AZ-A │ │ AZ-B │
├─────────┤ ├─────────┤
│ PRIMARY │─────sync write─────→│ STANDBY │
│ (R/W) │ │ (DR) │
└─────────┘ └─────────┘
Behavior:
Implementation Notes:
Pattern: Quorum-Based Replication Across AZs
┌─────────────────────────────────────────────────────────┐
│ Application Tier │
│ (SDK/driver handles node discovery │
│ and routes to appropriate replicas) │
└──────────────────────────┬──────────────────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│ AZ-A │ │ AZ-B │ │ AZ-C │
├─────────┤ ├─────────┤ ├─────────┤
│ Node 1 │←───→│ Node 2 │←───→│ Node 3 │
│ Replica │ │ Replica │ │ Replica │
└─────────┘ └─────────┘ └─────────┘
│ │ │
└───────────────┴───────────────┘
Gossip Protocol
Behavior:
DynamoDB Specifics:
Cassandra/Scylla Specifics:
Pattern: Clustered Cache with Multi-AZ Replication
┌─────────────────────────────────────────────────────────┐
│ Application Tier │
│ (client library handles cluster node discovery) │
└──────────────────────────┬──────────────────────────────┘
│
┌──────┴──────┐
│ ElastiCache │
│ Cluster │ (configuration endpoint)
└──────┬──────┘
│
┌───────────────┼───────────────┐
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│ AZ-A │ │ AZ-B │ │ AZ-C │
├─────────┤ ├─────────┤ ├─────────┤
│ Primary │ │ Replica │ │ Replica │
│ Shard 1 │────→│ Shard 1 │ │ │
│ │ │ │ │ │
│ Replica │ │ Primary │ │ Replica │
│ Shard 2 │ │ Shard 2 │────→│ Shard 2 │
└─────────┘ └─────────┘ └─────────┘
Behavior:
ElastiCache Redis Specifics:
Design Consideration:
An alternative to multi-AZ stateful services is externalizing state from your application tier. Instead of managing session state in your application, store it in a multi-AZ cache or database. This makes your application tier stateless and simplifies multi-AZ deployment.
Message queues and event buses are critical infrastructure for decoupled, event-driven architectures. Their multi-AZ deployment is essential because queue unavailability can cascade to producers (backpressure) and consumers (starvation).
Pattern: Managed Multi-AZ Queues
Managed queue services like SQS are inherently multi-AZ:
┌─────────────────────────────────────────────────────────┐
│ Producers │
│ (any AZ, sends to SQS endpoint) │
└──────────────────────────┬──────────────────────────────┘
│
┌──────┴──────┐
│ SQS │ (regional service,
│ Queue │ replicated across AZs)
└──────┬──────┘
│
┌───────────────┼───────────────┐
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│ AZ-A │ │ AZ-B │ │ AZ-C │
├─────────┤ ├─────────┤ ├─────────┤
│Consumer │ │Consumer │ │Consumer │
│ x3 │ │ x3 │ │ x3 │
└─────────┘ └─────────┘ └─────────┘
SQS Behavior:
Multi-AZ Consumer Deployment:
Pattern: Kafka Multi-AZ Cluster
┌─────────────────────────────────────────────────────────┐
│ Producers │
│ (Kafka clients with broker discovery) │
└──────────────────────────┬──────────────────────────────┘
│
┌───────────────┼───────────────┐
│ │ │
┌────┴────┐ ┌────┴────┐ ┌────┴────┐
│ AZ-A │ │ AZ-B │ │ AZ-C │
├─────────┤ ├─────────┤ ├─────────┤
│Broker 1 │◄───►│Broker 2 │◄───►│Broker 3 │
│Zookeeper│ │Zookeeper│ │Zookeeper│
└─────────┘ └─────────┘ └─────────┘
│ │ │
└───────────────┼───────────────┘
Inter-broker
replication
Kafka Configuration for Multi-AZ:
# Broker configuration
broker.rack=az-a # (az-b, az-c for other brokers)
# Topic configuration
min.insync.replicas=2 # Require 2 AZs to acknowledge
default.replication.factor=3 # Replicate to all 3 AZs
# Producer configuration
acks=all # Wait for all in-sync replicas
Key Points:
broker.rack to inform Kafka of AZ topologymin.insync.replicas=2 ensures writes survive one AZ failureRabbitMQ Configuration for Multi-AZ:
Self-managed message brokers (Kafka, RabbitMQ) require significant operational expertise to run reliably across AZs. Consider managed alternatives (Amazon MSK, Amazon MQ, Confluent Cloud) that handle multi-AZ replication for you—unless you have specific requirements that mandate self-management.
Service discovery—how components find and communicate with each other—is often overlooked in multi-AZ design. If your service discovery mechanism is single-AZ, your entire service mesh can fail when that AZ goes down.
Service Discovery Patterns:
1. DNS-Based Service Discovery
┌──────────────────────────────────────────┐
│ Client │
│ Resolves: api.internal.example.com │
└────────────────────┬─────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ Private Hosted Zone │
│ (Route 53, Cloud DNS - multi-AZ) │
│ │
│ api.internal.example.com │
│ → 10.0.1.10 (AZ-A) │
│ → 10.0.2.10 (AZ-B) │
│ → 10.0.3.10 (AZ-C) │
│ │
│ (Health checks remove failed IPs) │
└──────────────────────────────────────────┘
Pros: Simple, works everywhere, no client changes required Cons: DNS caching can cause stale records, slow failover
Best Practices:
2. Load Balancer-Based Discovery
┌──────────────────────────────────────────┐
│ Client │
│ Connects to: internal-api-lb.local │
└────────────────────┬─────────────────────┘
│
▼
┌──────────────────────────────────────────┐
│ Internal Load Balancer │
│ (Multi-AZ) │
│ │
│ Routes to healthy instances │
│ across all AZs │
└──────────────────────────────────────────┘
Pros: Built-in health checks, automatic failover, cross-zone balancing Cons: Adds latency and cost, potential bottleneck
Best Practices:
3. Service Mesh / Sidecar Discovery
┌─────────────────────────────────────────────────────────┐
│ Service A Pod Service B Pod │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Application │ │ Application │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ ┌──────┴──────┐ ┌──────┴──────┐ │
│ │ Sidecar │←───mTLS─────────→│ Sidecar │ │
│ │ (Envoy) │ │ (Envoy) │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ Control Plane │ │
│ └──────────────┬───────────────────┘ │
│ ┌─────┴─────┐ │
│ │ Istiod │ (Multi-AZ, HA) │
│ └───────────┘ │
└─────────────────────────────────────────────────────────┘
Pros: Fine-grained routing, observability, mTLS out of the box Cons: Complexity, resource overhead, learning curve
Multi-AZ Considerations for Service Mesh:
4. Kubernetes-Native Discovery
┌─────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
│ (Nodes across 3 AZs) │
│ │
│ ┌─────────────┐ ┌──────────────────────────────┐ │
│ │ Service │───│ Endpoints │ │
│ │ (ClusterIP)│ │ Pod 1 (AZ-A): 10.244.1.5 │ │
│ │ │ │ Pod 2 (AZ-B): 10.244.2.3 │ │
│ └─────────────┘ │ Pod 3 (AZ-C): 10.244.3.7 │ │
│ └──────────────────────────────┘ │
│ │
│ kube-dns (CoreDNS) multi-AZ, resolves service names │
└─────────────────────────────────────────────────────────┘
Pros: Native to Kubernetes, automatic endpoint management Cons: Only works within Kubernetes cluster
Multi-AZ Best Practices:
Combine multiple service discovery mechanisms as fallback. For example, use service mesh as primary with DNS fallback. If your service mesh control plane has issues, DNS can still route traffic. Never have a single point of failure in discovery infrastructure.
Even experienced teams make multi-AZ deployment mistakes. Understanding these common pitfalls helps you avoid them in your own architectures.
Before: Single NAT Gateway (Broken)
VPC
┌──────────────────┐
│ AZ-A AZ-B │
│ ┌───┐ ┌───┐ │
│ │Sub│ │Sub│ │
│ │ A │ │ B │ │
│ └─┬─┘ └─┬─┘ │
│ │ │ │
│ └──┬──┘ │
│ │ │
│ ┌────┴────┐ │
│ │ NAT │ │
│ │ (AZ-A) │ │
│ └─────────┘ │
│ ↓ │
│ [internet] │
└──────────────────┘
If AZ-A fails, AZ-B has
no internet access
After: Per-AZ NAT Gateway (Correct)
VPC
┌──────────────────┐
│ AZ-A AZ-B │
│ ┌───┐ ┌───┐ │
│ │Sub│ │Sub│ │
│ │ A │ │ B │ │
│ └─┬─┘ └─┬─┘ │
│ │ │ │
│ ┌─┴─┐ ┌─┴─┐ │
│ │NAT│ │NAT│ │
│ │ A │ │ B │ │
│ └───┘ └───┘ │
│ ↓ ↓ │
│ [internet] │
└──────────────────┘
Each AZ routes through
its own NAT Gateway
Most multi-AZ mistakes hide in network configuration—NAT Gateways, VPC Endpoints, Route Tables, Security Groups. Conduct a thorough review of your VPC topology to ensure every AZ has independent network paths. Use infrastructure as code to enforce consistent patterns.
A multi-AZ architecture is only as good as your ability to verify it works under failure conditions. Regular testing—chaos engineering—is essential to validate your design and surface hidden dependencies.
Chaos Engineering Principles for Multi-AZ:
Simulating AZ Failure:
| Technique | How It Works | Realism | Safety |
|---|---|---|---|
| Terminate all instances in AZ | Use AWS CLI/API to terminate EC2 instances | Medium | Start in staging |
| Block network traffic to AZ | Security groups/NACLs block AZ CIDR ranges | High | Can cause cascades |
| DNS manipulation | Remove AZ-specific endpoints from DNS | Low | Safe but incomplete |
| Load balancer target removal | De-register all targets in one AZ | Medium | Safe, easy rollback |
| AWS Fault Injection Service | Managed chaos experiments in AWS | High | Controlled experiments |
| Gremlin/Chaos Toolkit | Third-party chaos engineering platforms | High | Purpose-built tooling |
Testing Checklist:
| System Component | What to Test | Expected Behavior |
|---|---|---|
| Load Balancer | Remove all targets in one AZ | Traffic routes to remaining AZs |
| Auto Scaling | Terminate instances in one AZ | New instances launch (any AZ) |
| Database | Primary AZ goes down | Automatic failover to standby |
| Cache | Primary cache node fails | Promotion of replica without data loss |
| Message Queue | Consumer AZ unavailable | Messages remain queued, other consumers process |
| DNS | Health check detects failure | Failed endpoints removed from responses |
| Application | Connection to DB primary lost | Retry and reconnect to new primary |
AWS Fault Injection Simulator Example:
{
"description": "Stop all EC2 instances in AZ-A",
"targets": {
"ec2-instances": {
"resourceType": "aws:ec2:instance",
"resourceArns": ["*"],
"selectionMode": "ALL",
"filters": [
{
"path": "Placement.AvailabilityZone",
"values": ["us-east-1a"]
}
]
}
},
"actions": {
"stop-instances": {
"actionId": "aws:ec2:stop-instances",
"parameters": {},
"targets": {
"Instances": "ec2-instances"
}
}
},
"stopConditions": [
{
"source": "aws:cloudwatch:alarm",
"value": "arn:aws:cloudwatch:...:alarm:high-error-rate"
}
]
}
Schedule regular 'Game Days' where teams intentionally inject AZ failures in production-like environments. These exercises build operational muscle memory and uncover issues before real incidents. Start with staging, graduate to production with guard rails (automatic rollback triggers).
Multi-AZ deployment is the foundation of highly available cloud architecture. It requires end-to-end thinking—every component must have a multi-AZ story, and the interactions between components must be resilient to partial failures.
What's Next:
With multi-AZ deployment patterns mastered, we'll expand our scope to Cross-Region Deployments—how to design systems that survive regional failures, not just AZ failures. You'll learn about active-active multi-region, disaster recovery patterns, and the complexities of global data consistency.
You now have practical knowledge to implement multi-AZ deployments for stateless services, databases, caches, message queues, and service discovery. You understand the common mistakes to avoid and how to validate your architecture through chaos engineering. Next, we'll extend these concepts to cross-region architectures.