Regions Availability Zones - Learning Module

Loading content...

0/273

Multi-AZ Deployments

From Theory to Production

Understanding availability zones conceptually is only the first step. The real challenge lies in translating that understanding into production architectures that actually survive AZ failures. This page bridges theory and practice, providing you with concrete patterns, reference architectures, and implementation guidance for multi-AZ deployments.

Multi-AZ deployment isn't just about checking a box or deploying 'to multiple zones.' It requires thoughtful design across every layer of your stack—compute, storage, networking, and application logic. A single oversight can create a hidden single point of failure that nullifies your entire multi-AZ investment.

We'll examine reference architectures for common deployment patterns, walk through the critical decisions at each layer, and highlight the subtle mistakes that lead to failures despite ostensibly correct multi-AZ configurations.

What You Will Learn

By the end of this page, you will be able to design and implement multi-AZ architectures for stateless services, stateful services, and hybrid workloads. You'll understand reference architectures for web applications, databases, message queues, and caches—and know how to avoid the common pitfalls that undermine availability.

The Multi-AZ Design Principles

Before diving into specific architectures, let's establish the foundational principles that guide multi-AZ design. These principles apply regardless of the specific technologies or cloud provider you're using.

Principle 1: Every Component Must Have Multi-AZ Story

For each component in your architecture, you must be able to answer: "What happens to this component when an AZ fails?" If the answer is "the application stops working," that component is a single point of failure.

This applies to:

Compute instances
Load balancers
Databases and data stores
Message queues and event buses
Caches
DNS and service discovery
API gateways
Identity and authentication services
Monitoring and alerting infrastructure

Core Multi-AZ Design Principles

•Principle 2: Active-Active Over Active-Passive — Where possible, run active instances in all AZs, not just a standby. Active-active maintains warm capacity, exercises failover paths continuously, and provides faster recovery.
•Principle 3: Automate Failover — Manual failover adds minutes to hours of downtime. Automated failover with proper health checks enables recovery in seconds to minutes.
•Principle 4: Test Failover Regularly — A failover mechanism that's never been tested is an untested assumption. Regular chaos engineering exercises validate your multi-AZ design actually works.
•Principle 5: Accept Graceful Degradation — During AZ failures, accepting slightly degraded performance (higher latency, reduced throughput) is preferable to complete unavailability.
•Principle 6: Monitor at the AZ Level — You can't fix problems you can't see. Ensure monitoring and alerting are granular enough to detect AZ-specific issues.

The Multi-AZ Readiness Checklist:

For each component in your architecture, verify:

Question	Good Answer	Red Flag
Where does this run?	Instances in 2-3+ AZs	Single AZ only
What happens if one instance fails?	Traffic routes elsewhere	Service degrades
What happens if an entire AZ fails?	Automatic failover to other AZs	Manual intervention needed
Is capacity sufficient with one AZ down?	Yes, remaining AZs handle full load	No, capacity is at edge
How is state managed during failover?	Replicated or externalized	Stored locally, lost on failure
Is failover automated?	Yes, typically <5 minutes	No, requires human action
Has failover been tested?	Yes, within last quarter	No / don't know

The 'Multi-AZ' Checkbox Trap

Simply deploying instances to multiple AZs doesn't guarantee availability. If your load balancer, database, or any other critical component remains single-AZ, you've just created an expensive multi-AZ deployment with a single point of failure. True multi-AZ requires end-to-end thinking.

Reference Architecture: Stateless Web Application

Stateless web applications are the simplest case for multi-AZ deployment because application instances can be freely replaced and traffic rerouted without session loss. Let's examine a production-grade reference architecture.

Architecture Components:

                        Internet
                           │
                    ┌──────┴──────┐
                    │  Route 53   │  (Global DNS, health-checked)
                    │  (or equiv) │
                    └──────┬──────┘
                           │
                    ┌──────┴──────┐
                    │Application  │  (Multi-AZ, regional)
                    │Load Balancer│
                    └──────┬──────┘
           ┌───────────────┼───────────────┐
           │               │               │
      ┌────┴────┐     ┌────┴────┐     ┌────┴────┐
      │  AZ-A   │     │  AZ-B   │     │  AZ-C   │
      ├─────────┤     ├─────────┤     ├─────────┤
      │ Web x3  │     │ Web x3  │     │ Web x3  │
      │ (ASG)   │     │ (ASG)   │     │ (ASG)   │
      └────┬────┘     └────┬────┘     └────┬────┘
           │               │               │
           └───────────────┼───────────────┘
                           │
                    ┌──────┴──────┐
                    │ Multi-AZ    │  (Primary in AZ-A,
                    │ Database    │   Standby in AZ-B)
                    └─────────────┘

Layer-by-Layer Analysis:

1. DNS Layer (Route 53 / Cloud DNS)

Globally distributed, inherently multi-AZ
Health checks verify regional endpoint availability
If regional health check fails, DNS can route to DR region
TTL considerations: Lower TTL (60-300 seconds) enables faster failover but increases DNS query volume

2. Load Balancer Layer

Regional service, automatically deployed across all AZs in the region
Performs health checks against application instances
Cross-zone load balancing distributes evenly across all healthy targets
Connection draining prevents abrupt disconnects during instance removal

3. Compute Layer (Auto Scaling Group)

Auto Scaling Group spans all target AZs
Minimum capacity set to handle load with one AZ down
Launch template configures instances identically across AZs
AZ rebalancing automatically redistributes instances if AZ capacity becomes uneven

4. Database Layer

Multi-AZ RDS with synchronous replication to standby
Automatic failover on primary failure (typically 1-2 minutes)
Or: Aurora with multi-AZ read replicas for faster failover

Capacity Planning Example:

Total Required Capacity	Number of AZs	Capacity per AZ	Rationale
100 requests/second	3 AZs	~50 rps each	2 AZs can handle 100% load
10 instances needed	3 AZs	4-5 per AZ	2 AZs have at least 8-10 instances

Use Target Tracking Scaling

Instead of fixed capacity per AZ, use target tracking scaling (e.g., maintain 60% average CPU) combined with sufficient minimum capacity. This allows the system to automatically scale up in remaining AZs during an AZ failure rather than requiring pre-provisioned headroom.

Reference Architecture: Stateful Services

Stateful services—databases, caches, message queues—require more careful multi-AZ design because they hold data that must survive failures. The key question is: how do you replicate state across AZs while maintaining performance and consistency?

Relational Databases (MySQL, PostgreSQL)

Pattern: Synchronous Replication with Automatic Failover

┌─────────────────────────────────────────────────────────┐
│                    Application Tier                      │
│   (connects to database via endpoint/DNS that           │
│    automatically points to current primary)             │
└──────────────────────────┬──────────────────────────────┘
                           │
                    ┌──────┴──────┐
                    │   RDS/DB    │
                    │  Endpoint   │  (virtual, follows primary)
                    └──────┬──────┘
                           │
           ┌───────────────┴───────────────┐
           │                               │
      ┌────┴────┐                     ┌────┴────┐
      │  AZ-A   │                     │  AZ-B   │
      ├─────────┤                     ├─────────┤
      │ PRIMARY │─────sync write─────→│ STANDBY │
      │ (R/W)   │                     │ (DR)    │
      └─────────┘                     └─────────┘

Behavior:

All writes go to primary, synchronously replicated to standby
Reads can go to primary or read replicas
On primary failure: standby promoted, DNS endpoint updated
Typical failover time: 60-120 seconds for RDS, 30-60 seconds for Aurora

Implementation Notes:

Enable automatic failover in database configuration
Application should use database endpoint (not direct IP)
Implement connection retry logic for failover window
Consider read replicas in each AZ for read-heavy workloads

Distributed Databases (DynamoDB, Cassandra, CockroachDB)

Pattern: Quorum-Based Replication Across AZs

┌─────────────────────────────────────────────────────────┐
│                    Application Tier                      │
│         (SDK/driver handles node discovery              │
│          and routes to appropriate replicas)            │
└──────────────────────────┬──────────────────────────────┘
                           │
           ┌───────────────┼───────────────┐
           │               │               │
      ┌────┴────┐     ┌────┴────┐     ┌────┴────┐
      │  AZ-A   │     │  AZ-B   │     │  AZ-C   │
      ├─────────┤     ├─────────┤     ├─────────┤
      │ Node 1  │←───→│ Node 2  │←───→│ Node 3  │
      │ Replica │     │ Replica │     │ Replica │
      └─────────┘     └─────────┘     └─────────┘
           │               │               │
           └───────────────┴───────────────┘
                     Gossip Protocol

Behavior:

Data automatically replicated across nodes in different AZs
Quorum writes (2 of 3 nodes) ensure AZ failure doesn't lose data
Reads can be tuned for consistency vs. performance
Self-healing: cluster automatically rebalances when nodes join/leave

DynamoDB Specifics:

Fully managed, automatically replicated across 3 AZs
No infrastructure to manage for multi-AZ
Global tables for cross-region replication

Cassandra/Scylla Specifics:

Must configure NetworkTopologyStrategy with RF=3
Use rack-awareness to map racks to AZs
Configure LOCAL_QUORUM for most operations

In-Memory Caches (Redis, Memcached)

Pattern: Clustered Cache with Multi-AZ Replication

┌─────────────────────────────────────────────────────────┐
│                    Application Tier                      │
│    (client library handles cluster node discovery)      │
└──────────────────────────┬──────────────────────────────┘
                           │
                    ┌──────┴──────┐
                    │ ElastiCache │
                    │ Cluster     │  (configuration endpoint)
                    └──────┬──────┘
                           │
           ┌───────────────┼───────────────┐
           │               │               │
      ┌────┴────┐     ┌────┴────┐     ┌────┴────┐
      │  AZ-A   │     │  AZ-B   │     │  AZ-C   │
      ├─────────┤     ├─────────┤     ├─────────┤
      │ Primary │     │ Replica │     │ Replica │
      │ Shard 1 │────→│ Shard 1 │     │         │
      │         │     │         │     │         │
      │ Replica │     │ Primary │     │ Replica │
      │ Shard 2 │     │ Shard 2 │────→│ Shard 2 │
      └─────────┘     └─────────┘     └─────────┘

Behavior:

Cluster mode shards data across nodes
Each shard has primary in one AZ, replica(s) in other AZ(s)
Automatic failover promotes replica on primary failure
Data redistribution when adding/removing shards

ElastiCache Redis Specifics:

Enable Multi-AZ with automatic failover
Choose cluster mode for horizontal scaling
Consider read from replica for read-heavy workloads

Design Consideration:

Cache is typically non-authoritative; cache miss = query origin
Size cache to handle warming period after AZ failure
Consider cache-aside vs. write-through patterns for consistency

Stateful vs. State Externalization

An alternative to multi-AZ stateful services is externalizing state from your application tier. Instead of managing session state in your application, store it in a multi-AZ cache or database. This makes your application tier stateless and simplifies multi-AZ deployment.

Message Queues and Event-Driven Deployments

Message queues and event buses are critical infrastructure for decoupled, event-driven architectures. Their multi-AZ deployment is essential because queue unavailability can cascade to producers (backpressure) and consumers (starvation).

Managed Queue Services (SQS, Amazon MQ)

Pattern: Managed Multi-AZ Queues

Managed queue services like SQS are inherently multi-AZ:

┌─────────────────────────────────────────────────────────┐
│                      Producers                           │
│            (any AZ, sends to SQS endpoint)              │
└──────────────────────────┬──────────────────────────────┘
                           │
                    ┌──────┴──────┐
                    │     SQS     │  (regional service,
                    │   Queue     │   replicated across AZs)
                    └──────┬──────┘
                           │
           ┌───────────────┼───────────────┐
           │               │               │
      ┌────┴────┐     ┌────┴────┐     ┌────┴────┐
      │  AZ-A   │     │  AZ-B   │     │  AZ-C   │
      ├─────────┤     ├─────────┤     ├─────────┤
      │Consumer │     │Consumer │     │Consumer │
      │  x3     │     │  x3     │     │  x3     │
      └─────────┘     └─────────┘     └─────────┘

SQS Behavior:

Messages replicated across multiple AZs within the region
Standard queues: at-least-once delivery, best-effort ordering
FIFO queues: exactly-once delivery, ordering preserved
Consumer instances in each AZ pull messages independently

Multi-AZ Consumer Deployment:

Deploy consumers in all AZs
Use auto-scaling based on queue depth
Ensure consumers are idempotent (handle redelivery)

Self-Managed Message Brokers (Kafka, RabbitMQ)

Pattern: Kafka Multi-AZ Cluster

┌─────────────────────────────────────────────────────────┐
│                      Producers                           │
│        (Kafka clients with broker discovery)            │
└──────────────────────────┬──────────────────────────────┘
                           │
           ┌───────────────┼───────────────┐
           │               │               │
      ┌────┴────┐     ┌────┴────┐     ┌────┴────┐
      │  AZ-A   │     │  AZ-B   │     │  AZ-C   │
      ├─────────┤     ├─────────┤     ├─────────┤
      │Broker 1 │◄───►│Broker 2 │◄───►│Broker 3 │
      │Zookeeper│     │Zookeeper│     │Zookeeper│
      └─────────┘     └─────────┘     └─────────┘
           │               │               │
           └───────────────┼───────────────┘
                    Inter-broker
                    replication

Kafka Configuration for Multi-AZ:

# Broker configuration
broker.rack=az-a  # (az-b, az-c for other brokers)

# Topic configuration
min.insync.replicas=2  # Require 2 AZs to acknowledge
default.replication.factor=3  # Replicate to all 3 AZs

# Producer configuration
acks=all  # Wait for all in-sync replicas

Key Points:

Configure broker.rack to inform Kafka of AZ topology
Kafka distributes replicas across racks (AZs)
min.insync.replicas=2 ensures writes survive one AZ failure
ZooKeeper/KRaft controller must also span multiple AZs

RabbitMQ Configuration for Multi-AZ:

Use quorum queues (not classic mirrored queues)
Deploy odd number of nodes across AZs (3 or 5)
Quorum queues use Raft consensus, tolerate minority failures

Self-Managed Broker Complexity

Self-managed message brokers (Kafka, RabbitMQ) require significant operational expertise to run reliably across AZs. Consider managed alternatives (Amazon MSK, Amazon MQ, Confluent Cloud) that handle multi-AZ replication for you—unless you have specific requirements that mandate self-management.

Service Discovery and DNS in Multi-AZ

Service discovery—how components find and communicate with each other—is often overlooked in multi-AZ design. If your service discovery mechanism is single-AZ, your entire service mesh can fail when that AZ goes down.

Service Discovery Patterns:

1. DNS-Based Service Discovery

┌──────────────────────────────────────────┐
│                Client                     │
│   Resolves: api.internal.example.com     │
└────────────────────┬─────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────┐
│          Private Hosted Zone              │
│   (Route 53, Cloud DNS - multi-AZ)       │
│                                          │
│   api.internal.example.com               │
│   → 10.0.1.10 (AZ-A)                     │
│   → 10.0.2.10 (AZ-B)                     │
│   → 10.0.3.10 (AZ-C)                     │
│                                          │
│   (Health checks remove failed IPs)      │
└──────────────────────────────────────────┘

Pros: Simple, works everywhere, no client changes required Cons: DNS caching can cause stale records, slow failover

Best Practices:

Use low TTLs (30-60 seconds) for internal DNS
Health checks remove failed endpoints from DNS response
Clients should implement retry logic

2. Load Balancer-Based Discovery

┌──────────────────────────────────────────┐
│                Client                     │
│   Connects to: internal-api-lb.local     │
└────────────────────┬─────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────┐
│         Internal Load Balancer            │
│            (Multi-AZ)                     │
│                                          │
│   Routes to healthy instances            │
│   across all AZs                         │
└──────────────────────────────────────────┘

Pros: Built-in health checks, automatic failover, cross-zone balancing Cons: Adds latency and cost, potential bottleneck

Best Practices:

Use internal load balancers for service-to-service communication
Enable cross-zone load balancing
Configure appropriate health check intervals

3. Service Mesh / Sidecar Discovery

┌─────────────────────────────────────────────────────────┐
│   Service A Pod                    Service B Pod        │
│  ┌─────────────┐                  ┌─────────────┐       │
│  │ Application │                  │ Application │       │
│  └──────┬──────┘                  └──────┬──────┘       │
│         │                                │              │
│  ┌──────┴──────┐                  ┌──────┴──────┐       │
│  │   Sidecar   │←───mTLS─────────→│   Sidecar   │       │
│  │   (Envoy)   │                  │   (Envoy)   │       │
│  └──────┬──────┘                  └──────┬──────┘       │
│         │        Control Plane            │              │
│         └──────────────┬───────────────────┘              │
│                  ┌─────┴─────┐                          │
│                  │   Istiod   │  (Multi-AZ, HA)          │
│                  └───────────┘                          │
└─────────────────────────────────────────────────────────┘

Pros: Fine-grained routing, observability, mTLS out of the box Cons: Complexity, resource overhead, learning curve

Multi-AZ Considerations for Service Mesh:

Control plane (Istiod, Linkerd control plane) must be multi-AZ
Locality-aware routing can reduce cross-AZ traffic
Automatic retries and circuit breakers help during partial failures

4. Kubernetes-Native Discovery

┌─────────────────────────────────────────────────────────┐
│                    Kubernetes Cluster                    │
│                  (Nodes across 3 AZs)                   │
│                                                         │
│  ┌─────────────┐   ┌──────────────────────────────┐     │
│  │   Service   │───│      Endpoints                │     │
│  │  (ClusterIP)│   │  Pod 1 (AZ-A): 10.244.1.5    │     │
│  │             │   │  Pod 2 (AZ-B): 10.244.2.3    │     │
│  └─────────────┘   │  Pod 3 (AZ-C): 10.244.3.7    │     │
│                    └──────────────────────────────┘     │
│                                                         │
│  kube-dns (CoreDNS) multi-AZ, resolves service names    │
└─────────────────────────────────────────────────────────┘

Pros: Native to Kubernetes, automatic endpoint management Cons: Only works within Kubernetes cluster

Multi-AZ Best Practices:

Enable topology-aware routing (Kubernetes 1.21+)
Use pod anti-affinity to spread across AZs
Ensure CoreDNS has replicas in multiple AZs

Defense in Depth for Service Discovery

Combine multiple service discovery mechanisms as fallback. For example, use service mesh as primary with DNS fallback. If your service mesh control plane has issues, DNS can still route traffic. Never have a single point of failure in discovery infrastructure.

Common Multi-AZ Deployment Mistakes

Even experienced teams make multi-AZ deployment mistakes. Understanding these common pitfalls helps you avoid them in your own architectures.

Critical Multi-AZ Mistakes

•Mistake 1: NAT Gateway Single Point of Failure — Deploying a single NAT Gateway in one AZ. If that AZ fails, all private subnets lose internet access. Fix: Deploy a NAT Gateway in each AZ, route each subnet's traffic through its local NAT.
•Mistake 2: Single-AZ Bastion/Jump Host — All SSH access goes through a single-AZ bastion. AZ failure = no administrative access. Fix: Deploy bastions in multiple AZs or use AWS Session Manager (regionless).
•Mistake 3: Hardcoded IP Addresses — Application config points to specific IPs instead of DNS/LB endpoints. When instances change during failure, connections break. Fix: Always use DNS names or load balancer endpoints, never IPs.
•Mistake 4: Single-AZ S3 VPC Endpoint — While S3 itself is multi-AZ, a VPC Gateway Endpoint is configured per route table. If routing is misconfigured, some AZs may not have S3 access. Fix: Verify S3 endpoint routes in all subnet route tables.
•Mistake 5: Insufficient Capacity Margin — Deploying exactly enough capacity across AZs without headroom. AZ failure overloads remaining AZs, causing cascade failure. Fix: Each 2 of 3 AZs should handle 100% of peak load.
•Mistake 6: Ignoring AZ-Specific DNS Caching — DNS caches may hold stale AZ-specific records after failure. Fix: Use low TTLs for internal services, implement retry logic in applications.
•Mistake 7: Local File Storage Instead of Shared Storage — Storing files on EC2 local disk that's needed after failover. Fix: Use S3, EFS, or other multi-AZ storage for persistent files.

Before: Single NAT Gateway (Broken)

         VPC
  ┌──────────────────┐
  │    AZ-A   AZ-B   │
  │   ┌───┐ ┌───┐    │
  │   │Sub│ │Sub│    │
  │   │ A │ │ B │    │
  │   └─┬─┘ └─┬─┘    │
  │     │     │      │
  │     └──┬──┘      │
  │        │         │
  │   ┌────┴────┐    │
  │   │  NAT    │    │
  │   │ (AZ-A)  │    │
  │   └─────────┘    │
  │      ↓           │
  │   [internet]     │
  └──────────────────┘

If AZ-A fails, AZ-B has
no internet access

After: Per-AZ NAT Gateway (Correct)

         VPC
  ┌──────────────────┐
  │    AZ-A   AZ-B   │
  │   ┌───┐ ┌───┐    │
  │   │Sub│ │Sub│    │
  │   │ A │ │ B │    │
  │   └─┬─┘ └─┬─┘    │
  │     │     │      │
  │   ┌─┴─┐ ┌─┴─┐    │
  │   │NAT│ │NAT│    │
  │   │ A │ │ B │    │
  │   └───┘ └───┘    │
  │     ↓     ↓      │
  │   [internet]     │
  └──────────────────┘

Each AZ routes through
its own NAT Gateway

Review Your Network Architecture

Most multi-AZ mistakes hide in network configuration—NAT Gateways, VPC Endpoints, Route Tables, Security Groups. Conduct a thorough review of your VPC topology to ensure every AZ has independent network paths. Use infrastructure as code to enforce consistent patterns.

Testing Multi-AZ Resilience

A multi-AZ architecture is only as good as your ability to verify it works under failure conditions. Regular testing—chaos engineering—is essential to validate your design and surface hidden dependencies.

Chaos Engineering Principles for Multi-AZ:

Start with a hypothesis: "If AZ-A becomes unavailable, the application will continue serving traffic from AZ-B and AZ-C with <10% performance degradation"
Design experiments: Simulate AZ failure in a controlled way
Measure impact: Observe latency, error rates, throughput
Learn and improve: Fix issues discovered, repeat

Simulating AZ Failure:

AZ Failure Simulation Techniques
Technique	How It Works	Realism	Safety
Terminate all instances in AZ	Use AWS CLI/API to terminate EC2 instances	Medium	Start in staging
Block network traffic to AZ	Security groups/NACLs block AZ CIDR ranges	High	Can cause cascades
DNS manipulation	Remove AZ-specific endpoints from DNS	Low	Safe but incomplete
Load balancer target removal	De-register all targets in one AZ	Medium	Safe, easy rollback
AWS Fault Injection Service	Managed chaos experiments in AWS	High	Controlled experiments
Gremlin/Chaos Toolkit	Third-party chaos engineering platforms	High	Purpose-built tooling

Testing Checklist:

System Component	What to Test	Expected Behavior
Load Balancer	Remove all targets in one AZ	Traffic routes to remaining AZs
Auto Scaling	Terminate instances in one AZ	New instances launch (any AZ)
Database	Primary AZ goes down	Automatic failover to standby
Cache	Primary cache node fails	Promotion of replica without data loss
Message Queue	Consumer AZ unavailable	Messages remain queued, other consumers process
DNS	Health check detects failure	Failed endpoints removed from responses
Application	Connection to DB primary lost	Retry and reconnect to new primary

AWS Fault Injection Simulator Example:

{
  "description": "Stop all EC2 instances in AZ-A",
  "targets": {
    "ec2-instances": {
      "resourceType": "aws:ec2:instance",
      "resourceArns": ["*"],
      "selectionMode": "ALL",
      "filters": [
        {
          "path": "Placement.AvailabilityZone",
          "values": ["us-east-1a"]
        }
      ]
    }
  },
  "actions": {
    "stop-instances": {
      "actionId": "aws:ec2:stop-instances",
      "parameters": {},
      "targets": {
        "Instances": "ec2-instances"
      }
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:...:alarm:high-error-rate"
    }
  ]
}

Game Days

Schedule regular 'Game Days' where teams intentionally inject AZ failures in production-like environments. These exercises build operational muscle memory and uncover issues before real incidents. Start with staging, graduate to production with guard rails (automatic rollback triggers).

Summary: Multi-AZ Deployment Mastery

Multi-AZ deployment is the foundation of highly available cloud architecture. It requires end-to-end thinking—every component must have a multi-AZ story, and the interactions between components must be resilient to partial failures.

Key Takeaways

•Every component needs a multi-AZ plan — Compute, storage, networking, and application layers all require explicit multi-AZ design.
•Active-active is preferable to active-passive — Keep capacity warm in all AZs, exercise failover paths continuously.
•Automate failover — Manual failover adds unacceptable delay. Use health checks and automated promotion.
•Plan for N+1 capacity — Remaining AZs must handle full load when one fails. Pre-provision or enable aggressive auto-scaling.
•Service discovery must be multi-AZ — DNS, load balancers, and service mesh control planes are critical infrastructure that must span AZs.
•Watch for hidden single points of failure — NAT Gateways, bastions, hardcoded IPs, and local storage are common failure points.
•Test regularly — Chaos engineering validates your multi-AZ design actually works. Untested failover is untested assumption.

What's Next:

With multi-AZ deployment patterns mastered, we'll expand our scope to Cross-Region Deployments—how to design systems that survive regional failures, not just AZ failures. You'll learn about active-active multi-region, disaster recovery patterns, and the complexities of global data consistency.

Page Complete

You now have practical knowledge to implement multi-AZ deployments for stateless services, databases, caches, message queues, and service discovery. You understand the common mistakes to avoid and how to validate your architecture through chaos engineering. Next, we'll extend these concepts to cross-region architectures.

Multi-AZ Deployments

From Theory to Production

What You Will Learn

The Multi-AZ Design Principles

Principle 1: Every Component Must Have Multi-AZ Story

This applies to:

Compute instances
Load balancers
Databases and data stores
Message queues and event buses
Caches
DNS and service discovery
API gateways
Identity and authentication services
Monitoring and alerting infrastructure

Core Multi-AZ Design Principles

•Principle 2: Active-Active Over Active-Passive — Where possible, run active instances in all AZs, not just a standby. Active-active maintains warm capacity, exercises failover paths continuously, and provides faster recovery.
•Principle 3: Automate Failover — Manual failover adds minutes to hours of downtime. Automated failover with proper health checks enables recovery in seconds to minutes.
•Principle 4: Test Failover Regularly — A failover mechanism that's never been tested is an untested assumption. Regular chaos engineering exercises validate your multi-AZ design actually works.
•Principle 5: Accept Graceful Degradation — During AZ failures, accepting slightly degraded performance (higher latency, reduced throughput) is preferable to complete unavailability.
•Principle 6: Monitor at the AZ Level — You can't fix problems you can't see. Ensure monitoring and alerting are granular enough to detect AZ-specific issues.

The Multi-AZ Readiness Checklist:

For each component in your architecture, verify:

Question	Good Answer	Red Flag
Where does this run?	Instances in 2-3+ AZs	Single AZ only
What happens if one instance fails?	Traffic routes elsewhere	Service degrades
What happens if an entire AZ fails?	Automatic failover to other AZs	Manual intervention needed
Is capacity sufficient with one AZ down?	Yes, remaining AZs handle full load	No, capacity is at edge
How is state managed during failover?	Replicated or externalized	Stored locally, lost on failure
Is failover automated?	Yes, typically <5 minutes	No, requires human action
Has failover been tested?	Yes, within last quarter	No / don't know

The 'Multi-AZ' Checkbox Trap

Reference Architecture: Stateless Web Application

Architecture Components:

                        Internet
                           │
                    ┌──────┴──────┐
                    │  Route 53   │  (Global DNS, health-checked)
                    │  (or equiv) │
                    └──────┬──────┘
                           │
                    ┌──────┴──────┐
                    │Application  │  (Multi-AZ, regional)
                    │Load Balancer│
                    └──────┬──────┘
           ┌───────────────┼───────────────┐
           │               │               │
      ┌────┴────┐     ┌────┴────┐     ┌────┴────┐
      │  AZ-A   │     │  AZ-B   │     │  AZ-C   │
      ├─────────┤     ├─────────┤     ├─────────┤
      │ Web x3  │     │ Web x3  │     │ Web x3  │
      │ (ASG)   │     │ (ASG)   │     │ (ASG)   │
      └────┬────┘     └────┬────┘     └────┬────┘
           │               │               │
           └───────────────┼───────────────┘
                           │
                    ┌──────┴──────┐
                    │ Multi-AZ    │  (Primary in AZ-A,
                    │ Database    │   Standby in AZ-B)
                    └─────────────┘

Layer-by-Layer Analysis:

1. DNS Layer (Route 53 / Cloud DNS)

Globally distributed, inherently multi-AZ
Health checks verify regional endpoint availability
If regional health check fails, DNS can route to DR region
TTL considerations: Lower TTL (60-300 seconds) enables faster failover but increases DNS query volume

2. Load Balancer Layer

Regional service, automatically deployed across all AZs in the region
Performs health checks against application instances
Cross-zone load balancing distributes evenly across all healthy targets
Connection draining prevents abrupt disconnects during instance removal

3. Compute Layer (Auto Scaling Group)

Auto Scaling Group spans all target AZs
Minimum capacity set to handle load with one AZ down
Launch template configures instances identically across AZs
AZ rebalancing automatically redistributes instances if AZ capacity becomes uneven

4. Database Layer

Multi-AZ RDS with synchronous replication to standby
Automatic failover on primary failure (typically 1-2 minutes)
Or: Aurora with multi-AZ read replicas for faster failover

Capacity Planning Example:

Total Required Capacity	Number of AZs	Capacity per AZ	Rationale
100 requests/second	3 AZs	~50 rps each	2 AZs can handle 100% load
10 instances needed	3 AZs	4-5 per AZ	2 AZs have at least 8-10 instances

Use Target Tracking Scaling

Reference Architecture: Stateful Services

Relational Databases (MySQL, PostgreSQL)

Pattern: Synchronous Replication with Automatic Failover

┌─────────────────────────────────────────────────────────┐
│                    Application Tier                      │
│   (connects to database via endpoint/DNS that           │
│    automatically points to current primary)             │
└──────────────────────────┬──────────────────────────────┘
                           │
                    ┌──────┴──────┐
                    │   RDS/DB    │
                    │  Endpoint   │  (virtual, follows primary)
                    └──────┬──────┘
                           │
           ┌───────────────┴───────────────┐
           │                               │
      ┌────┴────┐                     ┌────┴────┐
      │  AZ-A   │                     │  AZ-B   │
      ├─────────┤                     ├─────────┤
      │ PRIMARY │─────sync write─────→│ STANDBY │
      │ (R/W)   │                     │ (DR)    │
      └─────────┘                     └─────────┘

Behavior:

All writes go to primary, synchronously replicated to standby
Reads can go to primary or read replicas
On primary failure: standby promoted, DNS endpoint updated
Typical failover time: 60-120 seconds for RDS, 30-60 seconds for Aurora

Implementation Notes:

Enable automatic failover in database configuration
Application should use database endpoint (not direct IP)
Implement connection retry logic for failover window
Consider read replicas in each AZ for read-heavy workloads

Distributed Databases (DynamoDB, Cassandra, CockroachDB)

Pattern: Quorum-Based Replication Across AZs

┌─────────────────────────────────────────────────────────┐
│                    Application Tier                      │
│         (SDK/driver handles node discovery              │
│          and routes to appropriate replicas)            │
└──────────────────────────┬──────────────────────────────┘
                           │
           ┌───────────────┼───────────────┐
           │               │               │
      ┌────┴────┐     ┌────┴────┐     ┌────┴────┐
      │  AZ-A   │     │  AZ-B   │     │  AZ-C   │
      ├─────────┤     ├─────────┤     ├─────────┤
      │ Node 1  │←───→│ Node 2  │←───→│ Node 3  │
      │ Replica │     │ Replica │     │ Replica │
      └─────────┘     └─────────┘     └─────────┘
           │               │               │
           └───────────────┴───────────────┘
                     Gossip Protocol

Behavior:

Data automatically replicated across nodes in different AZs
Quorum writes (2 of 3 nodes) ensure AZ failure doesn't lose data
Reads can be tuned for consistency vs. performance
Self-healing: cluster automatically rebalances when nodes join/leave

DynamoDB Specifics:

Fully managed, automatically replicated across 3 AZs
No infrastructure to manage for multi-AZ
Global tables for cross-region replication

Cassandra/Scylla Specifics:

Must configure NetworkTopologyStrategy with RF=3
Use rack-awareness to map racks to AZs
Configure LOCAL_QUORUM for most operations

In-Memory Caches (Redis, Memcached)

Pattern: Clustered Cache with Multi-AZ Replication

┌─────────────────────────────────────────────────────────┐
│                    Application Tier                      │
│    (client library handles cluster node discovery)      │
└──────────────────────────┬──────────────────────────────┘
                           │
                    ┌──────┴──────┐
                    │ ElastiCache │
                    │ Cluster     │  (configuration endpoint)
                    └──────┬──────┘
                           │
           ┌───────────────┼───────────────┐
           │               │               │
      ┌────┴────┐     ┌────┴────┐     ┌────┴────┐
      │  AZ-A   │     │  AZ-B   │     │  AZ-C   │
      ├─────────┤     ├─────────┤     ├─────────┤
      │ Primary │     │ Replica │     │ Replica │
      │ Shard 1 │────→│ Shard 1 │     │         │
      │         │     │         │     │         │
      │ Replica │     │ Primary │     │ Replica │
      │ Shard 2 │     │ Shard 2 │────→│ Shard 2 │
      └─────────┘     └─────────┘     └─────────┘

Behavior:

Cluster mode shards data across nodes
Each shard has primary in one AZ, replica(s) in other AZ(s)
Automatic failover promotes replica on primary failure
Data redistribution when adding/removing shards

ElastiCache Redis Specifics:

Enable Multi-AZ with automatic failover
Choose cluster mode for horizontal scaling
Consider read from replica for read-heavy workloads

Design Consideration:

Cache is typically non-authoritative; cache miss = query origin
Size cache to handle warming period after AZ failure
Consider cache-aside vs. write-through patterns for consistency

Stateful vs. State Externalization

Message Queues and Event-Driven Deployments

Managed Queue Services (SQS, Amazon MQ)

Pattern: Managed Multi-AZ Queues

Managed queue services like SQS are inherently multi-AZ:

┌─────────────────────────────────────────────────────────┐
│                      Producers                           │
│            (any AZ, sends to SQS endpoint)              │
└──────────────────────────┬──────────────────────────────┘
                           │
                    ┌──────┴──────┐
                    │     SQS     │  (regional service,
                    │   Queue     │   replicated across AZs)
                    └──────┬──────┘
                           │
           ┌───────────────┼───────────────┐
           │               │               │
      ┌────┴────┐     ┌────┴────┐     ┌────┴────┐
      │  AZ-A   │     │  AZ-B   │     │  AZ-C   │
      ├─────────┤     ├─────────┤     ├─────────┤
      │Consumer │     │Consumer │     │Consumer │
      │  x3     │     │  x3     │     │  x3     │
      └─────────┘     └─────────┘     └─────────┘

SQS Behavior:

Messages replicated across multiple AZs within the region
Standard queues: at-least-once delivery, best-effort ordering
FIFO queues: exactly-once delivery, ordering preserved
Consumer instances in each AZ pull messages independently

Multi-AZ Consumer Deployment:

Deploy consumers in all AZs
Use auto-scaling based on queue depth
Ensure consumers are idempotent (handle redelivery)

Self-Managed Message Brokers (Kafka, RabbitMQ)

Pattern: Kafka Multi-AZ Cluster

┌─────────────────────────────────────────────────────────┐
│                      Producers                           │
│        (Kafka clients with broker discovery)            │
└──────────────────────────┬──────────────────────────────┘
                           │
           ┌───────────────┼───────────────┐
           │               │               │
      ┌────┴────┐     ┌────┴────┐     ┌────┴────┐
      │  AZ-A   │     │  AZ-B   │     │  AZ-C   │
      ├─────────┤     ├─────────┤     ├─────────┤
      │Broker 1 │◄───►│Broker 2 │◄───►│Broker 3 │
      │Zookeeper│     │Zookeeper│     │Zookeeper│
      └─────────┘     └─────────┘     └─────────┘
           │               │               │
           └───────────────┼───────────────┘
                    Inter-broker
                    replication

Kafka Configuration for Multi-AZ:

# Broker configuration
broker.rack=az-a  # (az-b, az-c for other brokers)

# Topic configuration
min.insync.replicas=2  # Require 2 AZs to acknowledge
default.replication.factor=3  # Replicate to all 3 AZs

# Producer configuration
acks=all  # Wait for all in-sync replicas

Key Points:

Configure broker.rack to inform Kafka of AZ topology
Kafka distributes replicas across racks (AZs)
min.insync.replicas=2 ensures writes survive one AZ failure
ZooKeeper/KRaft controller must also span multiple AZs

RabbitMQ Configuration for Multi-AZ:

Use quorum queues (not classic mirrored queues)
Deploy odd number of nodes across AZs (3 or 5)
Quorum queues use Raft consensus, tolerate minority failures

Self-Managed Broker Complexity

Service Discovery and DNS in Multi-AZ

Service Discovery Patterns:

1. DNS-Based Service Discovery

┌──────────────────────────────────────────┐
│                Client                     │
│   Resolves: api.internal.example.com     │
└────────────────────┬─────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────┐
│          Private Hosted Zone              │
│   (Route 53, Cloud DNS - multi-AZ)       │
│                                          │
│   api.internal.example.com               │
│   → 10.0.1.10 (AZ-A)                     │
│   → 10.0.2.10 (AZ-B)                     │
│   → 10.0.3.10 (AZ-C)                     │
│                                          │
│   (Health checks remove failed IPs)      │
└──────────────────────────────────────────┘

Pros: Simple, works everywhere, no client changes required Cons: DNS caching can cause stale records, slow failover

Best Practices:

Use low TTLs (30-60 seconds) for internal DNS
Health checks remove failed endpoints from DNS response
Clients should implement retry logic

2. Load Balancer-Based Discovery

┌──────────────────────────────────────────┐
│                Client                     │
│   Connects to: internal-api-lb.local     │
└────────────────────┬─────────────────────┘
                     │
                     ▼
┌──────────────────────────────────────────┐
│         Internal Load Balancer            │
│            (Multi-AZ)                     │
│                                          │
│   Routes to healthy instances            │
│   across all AZs                         │
└──────────────────────────────────────────┘

Pros: Built-in health checks, automatic failover, cross-zone balancing Cons: Adds latency and cost, potential bottleneck

Best Practices:

Use internal load balancers for service-to-service communication
Enable cross-zone load balancing
Configure appropriate health check intervals

3. Service Mesh / Sidecar Discovery

┌─────────────────────────────────────────────────────────┐
│   Service A Pod                    Service B Pod        │
│  ┌─────────────┐                  ┌─────────────┐       │
│  │ Application │                  │ Application │       │
│  └──────┬──────┘                  └──────┬──────┘       │
│         │                                │              │
│  ┌──────┴──────┐                  ┌──────┴──────┐       │
│  │   Sidecar   │←───mTLS─────────→│   Sidecar   │       │
│  │   (Envoy)   │                  │   (Envoy)   │       │
│  └──────┬──────┘                  └──────┬──────┘       │
│         │        Control Plane            │              │
│         └──────────────┬───────────────────┘              │
│                  ┌─────┴─────┐                          │
│                  │   Istiod   │  (Multi-AZ, HA)          │
│                  └───────────┘                          │
└─────────────────────────────────────────────────────────┘

Pros: Fine-grained routing, observability, mTLS out of the box Cons: Complexity, resource overhead, learning curve

Multi-AZ Considerations for Service Mesh:

Control plane (Istiod, Linkerd control plane) must be multi-AZ
Locality-aware routing can reduce cross-AZ traffic
Automatic retries and circuit breakers help during partial failures

4. Kubernetes-Native Discovery

┌─────────────────────────────────────────────────────────┐
│                    Kubernetes Cluster                    │
│                  (Nodes across 3 AZs)                   │
│                                                         │
│  ┌─────────────┐   ┌──────────────────────────────┐     │
│  │   Service   │───│      Endpoints                │     │
│  │  (ClusterIP)│   │  Pod 1 (AZ-A): 10.244.1.5    │     │
│  │             │   │  Pod 2 (AZ-B): 10.244.2.3    │     │
│  └─────────────┘   │  Pod 3 (AZ-C): 10.244.3.7    │     │
│                    └──────────────────────────────┘     │
│                                                         │
│  kube-dns (CoreDNS) multi-AZ, resolves service names    │
└─────────────────────────────────────────────────────────┘

Pros: Native to Kubernetes, automatic endpoint management Cons: Only works within Kubernetes cluster

Multi-AZ Best Practices:

Enable topology-aware routing (Kubernetes 1.21+)
Use pod anti-affinity to spread across AZs
Ensure CoreDNS has replicas in multiple AZs

Defense in Depth for Service Discovery

Common Multi-AZ Deployment Mistakes

Even experienced teams make multi-AZ deployment mistakes. Understanding these common pitfalls helps you avoid them in your own architectures.

Critical Multi-AZ Mistakes

•Mistake 1: NAT Gateway Single Point of Failure — Deploying a single NAT Gateway in one AZ. If that AZ fails, all private subnets lose internet access. Fix: Deploy a NAT Gateway in each AZ, route each subnet's traffic through its local NAT.
•Mistake 2: Single-AZ Bastion/Jump Host — All SSH access goes through a single-AZ bastion. AZ failure = no administrative access. Fix: Deploy bastions in multiple AZs or use AWS Session Manager (regionless).
•Mistake 3: Hardcoded IP Addresses — Application config points to specific IPs instead of DNS/LB endpoints. When instances change during failure, connections break. Fix: Always use DNS names or load balancer endpoints, never IPs.
•Mistake 4: Single-AZ S3 VPC Endpoint — While S3 itself is multi-AZ, a VPC Gateway Endpoint is configured per route table. If routing is misconfigured, some AZs may not have S3 access. Fix: Verify S3 endpoint routes in all subnet route tables.
•Mistake 5: Insufficient Capacity Margin — Deploying exactly enough capacity across AZs without headroom. AZ failure overloads remaining AZs, causing cascade failure. Fix: Each 2 of 3 AZs should handle 100% of peak load.
•Mistake 6: Ignoring AZ-Specific DNS Caching — DNS caches may hold stale AZ-specific records after failure. Fix: Use low TTLs for internal services, implement retry logic in applications.
•Mistake 7: Local File Storage Instead of Shared Storage — Storing files on EC2 local disk that's needed after failover. Fix: Use S3, EFS, or other multi-AZ storage for persistent files.

Before: Single NAT Gateway (Broken)

         VPC
  ┌──────────────────┐
  │    AZ-A   AZ-B   │
  │   ┌───┐ ┌───┐    │
  │   │Sub│ │Sub│    │
  │   │ A │ │ B │    │
  │   └─┬─┘ └─┬─┘    │
  │     │     │      │
  │     └──┬──┘      │
  │        │         │
  │   ┌────┴────┐    │
  │   │  NAT    │    │
  │   │ (AZ-A)  │    │
  │   └─────────┘    │
  │      ↓           │
  │   [internet]     │
  └──────────────────┘

If AZ-A fails, AZ-B has
no internet access

After: Per-AZ NAT Gateway (Correct)

         VPC
  ┌──────────────────┐
  │    AZ-A   AZ-B   │
  │   ┌───┐ ┌───┐    │
  │   │Sub│ │Sub│    │
  │   │ A │ │ B │    │
  │   └─┬─┘ └─┬─┘    │
  │     │     │      │
  │   ┌─┴─┐ ┌─┴─┐    │
  │   │NAT│ │NAT│    │
  │   │ A │ │ B │    │
  │   └───┘ └───┘    │
  │     ↓     ↓      │
  │   [internet]     │
  └──────────────────┘

Each AZ routes through
its own NAT Gateway

Review Your Network Architecture

Testing Multi-AZ Resilience

Chaos Engineering Principles for Multi-AZ:

Start with a hypothesis: "If AZ-A becomes unavailable, the application will continue serving traffic from AZ-B and AZ-C with <10% performance degradation"
Design experiments: Simulate AZ failure in a controlled way
Measure impact: Observe latency, error rates, throughput
Learn and improve: Fix issues discovered, repeat

Simulating AZ Failure:

AZ Failure Simulation Techniques
Technique	How It Works	Realism	Safety
Terminate all instances in AZ	Use AWS CLI/API to terminate EC2 instances	Medium	Start in staging
Block network traffic to AZ	Security groups/NACLs block AZ CIDR ranges	High	Can cause cascades
DNS manipulation	Remove AZ-specific endpoints from DNS	Low	Safe but incomplete
Load balancer target removal	De-register all targets in one AZ	Medium	Safe, easy rollback
AWS Fault Injection Service	Managed chaos experiments in AWS	High	Controlled experiments
Gremlin/Chaos Toolkit	Third-party chaos engineering platforms	High	Purpose-built tooling

Testing Checklist:

System Component	What to Test	Expected Behavior
Load Balancer	Remove all targets in one AZ	Traffic routes to remaining AZs
Auto Scaling	Terminate instances in one AZ	New instances launch (any AZ)
Database	Primary AZ goes down	Automatic failover to standby
Cache	Primary cache node fails	Promotion of replica without data loss
Message Queue	Consumer AZ unavailable	Messages remain queued, other consumers process
DNS	Health check detects failure	Failed endpoints removed from responses
Application	Connection to DB primary lost	Retry and reconnect to new primary

AWS Fault Injection Simulator Example:

{
  "description": "Stop all EC2 instances in AZ-A",
  "targets": {
    "ec2-instances": {
      "resourceType": "aws:ec2:instance",
      "resourceArns": ["*"],
      "selectionMode": "ALL",
      "filters": [
        {
          "path": "Placement.AvailabilityZone",
          "values": ["us-east-1a"]
        }
      ]
    }
  },
  "actions": {
    "stop-instances": {
      "actionId": "aws:ec2:stop-instances",
      "parameters": {},
      "targets": {
        "Instances": "ec2-instances"
      }
    }
  },
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:...:alarm:high-error-rate"
    }
  ]
}

Game Days

Summary: Multi-AZ Deployment Mastery

Key Takeaways

•Every component needs a multi-AZ plan — Compute, storage, networking, and application layers all require explicit multi-AZ design.
•Active-active is preferable to active-passive — Keep capacity warm in all AZs, exercise failover paths continuously.
•Automate failover — Manual failover adds unacceptable delay. Use health checks and automated promotion.
•Plan for N+1 capacity — Remaining AZs must handle full load when one fails. Pre-provision or enable aggressive auto-scaling.
•Service discovery must be multi-AZ — DNS, load balancers, and service mesh control planes are critical infrastructure that must span AZs.
•Watch for hidden single points of failure — NAT Gateways, bastions, hardcoded IPs, and local storage are common failure points.
•Test regularly — Chaos engineering validates your multi-AZ design actually works. Untested failover is untested assumption.

What's Next:

Page Complete