System Design (HLD)Regions and Availability Zones

Cloud Regions and Availability Zones

LevelIntermediate

Duration90 mins

TopicRegions and Availability Zones

2 / 5

Availability Zone Architecture

The Building Blocks of Cloud Resilience

If regions are the continents of cloud infrastructure, Availability Zones (AZs) are the cities within them. Understanding AZs is essential because they are the fundamental unit of fault isolation in cloud architecture—the mechanism that allows you to survive infrastructure failures without losing service.

When a data center experiences a power outage, a network partition, or a cooling failure, what happens to your application? If you've deployed correctly across availability zones, the answer should be nothing visible to users. Traffic seamlessly routes to healthy infrastructure, and your service continues operating.

This isn't theoretical. AZ failures happen regularly in production environments. In April 2011, an AWS outage in one AZ of US-East-1 took down major services including Reddit, Quora, and Foursquare—not because AWS failed entirely, but because those services hadn't properly designed for multi-AZ resilience. The services that survived understood AZ architecture. This page will ensure you do too.

What You Will Learn

By the end of this page, you will understand what availability zones are at a physical and logical level, how cloud providers implement fault isolation between zones, what guarantees they provide, and how to reason about AZ architecture when designing your systems. You'll learn the critical distinctions between AZ failure modes and how to design infrastructure that remains available even when entire zones become unavailable.

What Is an Availability Zone?

An Availability Zone is a logically isolated section of a cloud provider's infrastructure within a geographic region. Each AZ consists of one or more discrete data centers, each with independent power, cooling, networking, and physical security. The key principle is fault isolation: failures in one AZ should not propagate to other AZs in the same region.

Physical Characteristics of Availability Zones:

While cloud providers keep the exact details confidential for security reasons, the general architecture follows consistent patterns:

Physical Separation: AZs are physically separated by a meaningful distance—typically 1-100 kilometers apart. This distance is carefully calibrated:
- Far enough to survive localized disasters (single substation failure, local flooding, building fire)
- Close enough to enable low-latency synchronous replication (typically <2ms round-trip between AZs)
Independent Infrastructure:
- Power: Each AZ has independent connections to the electrical grid, often from different substations, plus independent backup generators and UPS systems
- Networking: Separate network paths to internet backbones and between AZs
- Cooling: Independent HVAC systems
- Physical Security: Separate physical access controls

Availability Zone Physical Isolation Standards
Isolation Dimension	Typical Implementation	Failure Mode Protected
Geographic Separation	1-100 km between AZ data centers	Localized natural disasters, fires, explosions
Power Grid Independence	Different substations, redundant feeds	Grid failures, substation outages
Network Path Diversity	Multiple fiber routes, different carriers	Fiber cuts, network equipment failure
Cooling Independence	Separate HVAC per AZ	Cooling system failures
Physical Security	Separate buildings, access controls	Physical intrusion (theoretical)

Logical Constructs:

From the customer perspective, AZs appear as abstract identifiers (e.g., us-east-1a, us-east-1b, us-east-1c). These identifiers map to physical infrastructure, but the mapping is randomized per AWS account. This means:

Your us-east-1a might be a completely different physical data center than another customer's us-east-1a
This randomization prevents all customers from concentrating in the same AZ
You cannot assume your AZ identifiers match another account's assignments

To reliably coordinate AZ assignments across accounts, use AZ IDs (e.g., use1-az1, use1-az2) which are consistent identifiers that map to the same physical infrastructure regardless of account.

AZ Names vs. AZ IDs

When coordinating deployments across multiple AWS accounts (common in enterprise environments), always use AZ IDs, not AZ names. If your production account's 'us-east-1a' and your DR account's 'us-east-1a' map to different physical AZs, you may have less disaster isolation than you intended.

Fault Domain Theory: Understanding Blast Radius

To properly design for availability zones, you need to understand fault domain theory—the systematic study of how failures propagate through infrastructure and how to contain their impact.

What Is a Fault Domain?

A fault domain is a set of infrastructure components that can fail together due to a single root cause. Identifying fault domains helps you understand the blast radius of potential failures—how much of your infrastructure is affected when something goes wrong.

Hierarchy of Fault Domains in Cloud Infrastructure:

Fault Domain Hierarchy (Smallest to Largest)

•1. Single Instance/Container — Failure of one compute unit (hardware fault, OS crash, application bug). Blast radius: 1 instance.
•2. Rack/Host Group — Physical rack loses power or network. Blast radius: 10-100 instances on shared hardware.
•3. Availability Zone — Entire zone becomes unavailable (major power outage, network partition). Blast radius: Thousands to millions of instances.
•4. Region — Catastrophic regional failure (major natural disaster, widespread infrastructure failure). Blast radius: All instances in region.
•5. Cloud Provider — Provider-wide outage (control plane failure, global DNS issue). Blast radius: All services, all regions.

The Blast Radius Containment Principle:

Effective high availability design is about containing blast radius at each level. The question isn't if failures will occur, but when, and whether your architecture limits their impact.

Fault Domain Level	Expected Frequency	Typical Duration	Design Response
Instance	Multiple times daily	Seconds-minutes	Auto-scaling, health checks
Rack/Host	Weekly	Minutes-hours	Spread instances across racks
Availability Zone	1-2 times per year	Hours	Multi-AZ deployment
Region	Very rare	Hours-days	Multi-region architecture
Provider	Extremely rare	Hours	Multi-cloud (if critical)

Correlated vs. Independent Failures:

AZs are designed to have independent failure modes—a power grid issue in AZ-A shouldn't affect AZ-B's power. However, some failures can be correlated:

Shared Dependencies: If all AZs in a region use the same upstream internet provider, that provider's failure affects all AZs
Control Plane Failures: Regional API endpoints may be shared, so control plane issues can affect all AZs
Operator Errors: A misconfigured security group or IAM policy affects all AZs where it's applied
Application Bugs: A bug in your application fails identically across all AZs

AZs protect against infrastructure failures, not application failures or operational errors.

Chaos Engineering and Fault Domains

Understanding fault domains is the theoretical foundation of chaos engineering. Tools like Netflix's Chaos Monkey deliberately inject failures at each fault domain level to verify your systems remain available. If you haven't tested your AZ failure response, you don't actually know if multi-AZ deployment works for your specific architecture.

AZ Implementations Across Major Providers

While the concept of availability zones is universal, each major cloud provider implements them with slightly different characteristics and naming conventions.

Amazon Web Services (AWS)

AWS pioneered the modern AZ concept and provides the most explicit AZ model:

Naming: Regions have 3-6 AZs, named as {region}{az-letter} (e.g., us-east-1a, us-east-1b)
AZ IDs: Consistent identifiers like use1-az1 for cross-account coordination
Guaranteed Separation: AWS commits to physical separation of power, networking, and connectivity
Local Zones: Extended AZs in metropolitan areas for ultra-low-latency use cases
Wavelength Zones: Edge locations for 5G carrier networks

AWS AZ Guarantees:

AZs are "meaningfully distant" to reduce correlated failure risk
Interconnected with high-bandwidth, low-latency private networking
Each AZ has one or more isolated data center facilities

Microsoft Azure

Azure's approach to availability zones has evolved over time:

Naming: Zones are numbered 1, 2, 3 within a region (e.g., Zone 1, Zone 2, Zone 3)
Not All Regions Have AZs: Some older Azure regions lack availability zone support
Availability Sets: Azure's older mechanism using fault domains and update domains within a single data center
Zone-Redundant Services: Many Azure PaaS services automatically spread across zones

Azure-Specific Considerations:

Availability Sets (older) provide rack-level isolation, not AZ-level
Availability Zones (newer) provide true AZ isolation
Zone-redundant storage (ZRS) automatically replicates across AZs
Some services require explicit zone selection; others are zone-redundant by default

Google Cloud Platform (GCP)

GCP uses the term "zones" (not availability zones) with a slightly different model:

Naming: Zones named as {region}-{zone-letter} (e.g., us-central1-a, us-central1-b)
Regional Resources: GCP emphasizes regional resources that automatically span zones
Live Migration: GCP supports live migration of VMs during maintenance, reducing zone-level impact
Regional Persistent Disks: Storage that replicates synchronously across zones

GCP-Specific Guidance:

Use regional managed instance groups for automatic zone distribution
Regional resources (like regional persistent disks) simplify multi-zone design
Some GCP services have built-in zone redundancy

AZ Comparison Across Major Cloud Providers
Feature	AWS	Azure	GCP
AZ Terminology	Availability Zone	Availability Zone	Zone
AZs per Region	3-6 (minimum 3)	Usually 3	3-4 typical
AZ Naming	us-east-1a, 1b, 1c	Zone 1, 2, 3	us-central1-a, b, c
Cross-Account ID	AZ IDs (use1-az1)	Zone ID	Zone names consistent
Legacy Alternative	N/A	Availability Sets	N/A
Auto Multi-Zone	Load balancers, RDS Multi-AZ	Zone-redundant services	Regional resources
Live Migration	Limited	Yes (some services)	Yes (Compute Engine)

Provider-Agnostic Thinking

While implementation details differ, the core principle is consistent: spread critical workloads across multiple fault-isolated domains within a region. Whether you call them Availability Zones or just Zones, the design patterns are the same.

Inter-AZ Communication Characteristics

Understanding the network characteristics between availability zones is crucial for designing multi-AZ architectures. While AZs provide fault isolation, they're connected by high-speed, low-latency private networks that enable synchronous operations.

Latency Characteristics:

Within a region, inter-AZ latency is designed to support synchronous replication:

Typical Inter-AZ Latency: 1-2 milliseconds round-trip
Maximum Expected: Generally <5ms (provider SLAs vary)
Jitter: Usually minimal due to dedicated private networks

This low latency enables patterns that wouldn't be practical across regions:

Synchronous database replication (writes commit only after all replicas acknowledge)
Distributed transactions with two-phase commit
Real-time session replication
Active-active load balancing without user affinity

Latency Comparison: Intra-AZ vs. Inter-AZ vs. Inter-Region
Communication Type	Typical Latency	Bandwidth	Cost
Same AZ (same VPC)	0.1-0.5ms	Very high (10-100+ Gbps)	Free (within VPC)
Cross-AZ (same region)	1-2ms	High (up to 25 Gbps)	$0.01-0.02/GB typically
Cross-Region (same continent)	20-100ms	Limited by network	$0.02-0.09/GB
Cross-Region (intercontinental)	100-300ms	Limited by network	$0.02-0.20/GB

Data Transfer Costs:

Inter-AZ data transfer is not free on most cloud providers. While intra-AZ traffic within a VPC is typically free, crossing AZ boundaries incurs charges:

AWS: $0.01/GB for inter-AZ traffic (each direction)
Azure: Similar pricing for cross-zone traffic
GCP: Varies based on network tier

Cost Implications for Chatty Architectures:

Consider a microservices architecture where services communicate frequently:

Scenario: 100 services, each making 1,000 requests/second to other services
Average payload: 10 KB per request/response

Cross-AZ traffic per day:
= 100 services × 1,000 req/sec × 10 KB × 2 (bidirectional) × 86,400 sec/day
= 172.8 TB/day

At $0.01/GB:
= $1,728/day = ~$52,000/month just for inter-AZ transfer

This cost can be reduced through:

AZ-aware routing: Route requests to same-AZ instances when possible
Caching: Reduce redundant cross-AZ queries
Batching: Aggregate small requests into larger payloads
Compression: Reduce payload sizes

The Multi-AZ Cost Trap

Multi-AZ deployment is essential for availability, but it can significantly increase costs for data-intensive workloads. Always model your expected cross-AZ traffic and include it in your architecture cost estimates. The availability benefits justify the cost for critical systems, but you should understand the trade-off.

AZ Failure Modes and Recovery Patterns

Understanding how availability zones fail is crucial for designing systems that survive those failures. AZ failures are not binary—they occur in various modes with different characteristics.

Failure Mode Classification:

Types of AZ Failures

•Complete Outage: All resources in the AZ become unavailable simultaneously. Rare but dramatic. Usually caused by power events or major network partitions.
•Partial Degradation: Some services or instances fail while others remain healthy. More common than complete outages. May affect specific instance types, storage volumes, or network paths.
•Performance Degradation: Resources remain 'available' but with degraded performance (high latency, packet loss, reduced throughput). Insidious because health checks may pass while user experience suffers.
•Control Plane Issues: Infrastructure API calls fail (can't launch instances, modify security groups) but running resources continue operating. Frustrating for operators but less impactful for users.
•Cascade Failures: Initial failure causes secondary failures due to overload or dependency chains. Systems designed for 3 AZs may not function correctly with only 2.

Recovery Patterns:

1. Automatic Failover (Preferred)

For stateless services behind a load balancer:

Load balancer health checks detect failed instances
Traffic automatically routes to healthy AZs
Auto-scaling launches replacement instances
No human intervention required
Recovery time: Seconds to minutes

2. Stateful Service Failover

For databases and stateful services:

Secondary replica in another AZ detects primary failure
Automatic or manual failover to secondary
Applications reconnect to new primary
Recovery time: Minutes (automatic) to hours (manual)

3. Workload Redistribution

When one AZ fails, remaining AZs must handle all traffic:

Ensure sufficient capacity in remaining AZs
Pre-provision headroom or configure aggressive auto-scaling
Consider 2×AZ capacity: If you have 3 AZs, each should handle 50% of load (to survive losing one)

The N+1 AZ Rule:

Design your multi-AZ architecture such that losing any single AZ leaves sufficient capacity to handle full production load. For critical systems, consider N+2 (can lose two zones).

Capacity Planning for AZ Failures
Number of AZs	Each AZ Capacity	Survives Losing	Total Capacity Required
2 AZs	100% each	1 AZ	200% of peak load
3 AZs	50% each	1 AZ	150% of peak load
3 AZs	75% each	2 AZs (N+2)	225% of peak load
4 AZs	33% each	1 AZ	132% of peak load

Test Your Failure Response

Theory is no substitute for testing. Regularly conduct chaos engineering exercises where you simulate AZ failures—terminate all instances in one AZ, block network traffic to an AZ, or failover databases. Verify that your monitoring detects the failure, your automation responds correctly, and user impact stays within acceptable bounds.

Designing Multi-AZ Data Persistence

Compute instances are relatively straightforward to spread across AZs—they're ephemeral by design. Data persistence is more complex because you must balance availability, durability, consistency, and performance across AZ boundaries.

Storage Classes and AZ Behavior:

Storage Options and Multi-AZ Behavior
Storage Type	AZ Behavior	Use Case	Tradeoff
Instance Storage	Single AZ only, ephemeral	Temporary data, caches	Data lost on instance termination
Single-AZ Block Storage (EBS)	Replicated within one AZ	Standard persistent storage	Lost if AZ fails
Multi-AZ Block Storage	Synchronous replication to second AZ	Critical databases	Higher cost, slight latency
Object Storage (S3/GCS)	Automatically replicated across 3+ AZs	Durable object storage	Eventually consistent in some cases
Zone-Redundant File Systems	Replicated across AZs	Shared file storage	Higher cost

Database Multi-AZ Patterns:

1. Synchronous Replication (Strong Consistency)

Write Request → Primary (AZ-A)
                    │
                    ├──sync write──→ Replica (AZ-B)
                    │                    │
                    │←── ACK ────────────┘
                    │
                ← ACK to client (write committed)

Write only returns success after both AZs have the data
Guarantees zero data loss (RPO = 0)
Adds latency to every write (1-2ms typically)
Used by: RDS Multi-AZ, Azure SQL Database

2. Asynchronous Replication (Eventual Consistency)

Write Request → Primary (AZ-A)
                    │
                ← ACK to client (immediate)
                    │
                    └──async──→ Replica (AZ-B)
                                 (may lag seconds to minutes)

Write returns immediately, replication happens in background
Possible data loss in failover (RPO > 0)
Lower latency for writes
Used by: Read replicas, some NoSQL configurations

3. Quorum-Based Systems

For distributed databases like Cassandra, DynamoDB, or CockroachDB:

Write Request → Coordinator
                    │
                    ├──write──→ Node (AZ-A) ✓
                    ├──write──→ Node (AZ-B) ✓
                    └──write──→ Node (AZ-C) (may or may not complete)
                    
If 2 of 3 ACKs received → Return success to client

Requires majority (quorum) of nodes to acknowledge
Survives minority node failures
Tunable consistency (write to more nodes = more durable)
Read quorum + write quorum must exceed total nodes for strong consistency

Choosing a Pattern:

Requirement	Pattern	Example
Zero data loss at any cost	Synchronous Multi-AZ	RDS Multi-AZ
High write throughput, some lag acceptable	Async replication	Read replicas
Massive scale, tunable consistency	Quorum-based	DynamoDB, Cassandra
Simple operations, moderate scale	Managed sync	Aurora, Cloud Spanner

Data Loss vs. Downtime

Multi-AZ databases protect against data loss (durability) AND reduce downtime (availability). Synchronous replication to another AZ means you have a current copy even if the primary AZ fails. But having that copy is only useful if you can fail over to it quickly—hence the importance of automated failover mechanisms.

Load Balancing Across Availability Zones

Load balancers are the traffic directors of multi-AZ architectures, responsible for distributing requests across healthy instances in multiple zones. Understanding how load balancers interact with AZs is essential for achieving true high availability.

Cross-Zone Load Balancing:

By default, load balancers may only distribute traffic to targets within the same AZ. Cross-zone load balancing enables distribution across all registered targets in all AZs.

Without Cross-Zone Load Balancing:

                    AZ-A (60% of DNS)       AZ-B (40% of DNS)
                         │                       │
Load Balancer Node A ────┤     Load Balancer Node B ────┤
         │                               │
    ┌────┴────┐                     ┌────┴────┐
    │         │                     │         │
Instance 1  Instance 2            Instance 3  Instance 4
    │         │                     │         │
Gets 30%   Gets 30%              Gets 20%   Gets 20%

Pattern Problems:

Uneven instance utilization if AZ capacity differs
If one AZ has fewer instances, they get proportionally more load

With Cross-Zone Load Balancing:

                         Unified Load Balancer
                              │
                    ┌─────────┼─────────┐
                    │         │         │
                    ▼         ▼         ▼
              Instance 1  Instance 2  Instance 3  Instance 4
                  │           │           │           │
               Gets 25%    Gets 25%    Gets 25%    Gets 25%

Even distribution regardless of AZ
All instances receive equal traffic
Recommended for most workloads

Health Checks and AZ Awareness:

Load balancer health checks verify that targets can serve traffic. When targets in an AZ fail health checks:

Load balancer stops sending new requests to unhealthy targets
Active connections may be drained (configurable)
Traffic redistributes to healthy targets in other AZs
When targets recover, traffic resumes after passing health checks

Configuring Health Checks for AZ Resilience:

Parameter	Recommendation	Rationale
Health check interval	5-10 seconds	Balance between detection speed and load
Healthy threshold	2-3 checks	Avoid flapping on transient issues
Unhealthy threshold	2-3 checks	Fast removal of truly failed targets
Timeout	2-5 seconds	Account for AZ latency
Health check path	`/health` endpoint	Verify application functionality, not just port open

Zonal Affinity (AZ-Aware Routing):

For latency-sensitive or cost-conscious applications, you can configure load balancers to prefer same-zone targets:

Advantage: Eliminates cross-AZ latency and data transfer costs for most requests
Disadvantage: Can cause uneven load if zone capacity differs
Implementation: When available, route to same-AZ; fall back to cross-AZ only when same-AZ targets are unhealthy

This feature is called different names:

AWS NLB: Availability Zone affinity
AWS ALB: Cross-zone load balancing disable with DNS
Azure: Zone-aware routing
GCP: Not directly supported (use regional instance groups)

Connection Draining is Essential

When removing targets (during deployments or AZ failures), enable connection draining to allow in-flight requests to complete before cutting connections. A typical draining timeout of 30-300 seconds prevents abrupt disconnects that cause errors for users mid-request.

Summary: Availability Zone Architecture Mastery

Availability zones are the fundamental building blocks of resilient cloud architecture. Understanding their characteristics, failure modes, and design patterns is essential for building systems that remain available when infrastructure fails.

Key Takeaways

•AZs are physically isolated — Each AZ has independent power, networking, and cooling to minimize correlated failures.
•Fault domains have hierarchies — Design to contain blast radius at each level: instance, rack, AZ, region, provider.
•Inter-AZ latency is low — 1-2ms typical latency enables synchronous replication, but cross-AZ traffic has costs.
•AZ failures come in many modes — Complete outages, partial degradation, performance issues, and control plane problems all require different responses.
•Capacity planning is critical — Ensure you can handle full load with one AZ unavailable (N+1 design).
•Data persistence needs careful design — Choose synchronous vs. asynchronous replication based on durability requirements.
•Load balancers enable automatic failover — Properly configured cross-zone load balancing and health checks make AZ failures transparent to users.

What's Next:

With a solid understanding of individual AZ architecture, we'll move to Multi-AZ Deployments—the practical patterns for deploying applications across availability zones. You'll learn reference architectures, best practices, and common mistakes to avoid when building multi-AZ systems.

Page Complete

You now understand availability zone architecture at both conceptual and practical levels. You can reason about fault domains, provider implementations, inter-AZ communication, failure modes, data persistence patterns, and load balancing strategies. Next, we'll build on this foundation with concrete multi-AZ deployment patterns.

2 / 5

Loading learning content...

System Design (HLD)Regions and Availability Zones

Cloud Regions and Availability Zones

LevelIntermediate

Duration90 mins

TopicRegions and Availability Zones

2 / 5

Availability Zone Architecture

The Building Blocks of Cloud Resilience

What You Will Learn

What Is an Availability Zone?

Physical Characteristics of Availability Zones:

While cloud providers keep the exact details confidential for security reasons, the general architecture follows consistent patterns:

Physical Separation: AZs are physically separated by a meaningful distance—typically 1-100 kilometers apart. This distance is carefully calibrated:
- Far enough to survive localized disasters (single substation failure, local flooding, building fire)
- Close enough to enable low-latency synchronous replication (typically <2ms round-trip between AZs)
Independent Infrastructure:
- Power: Each AZ has independent connections to the electrical grid, often from different substations, plus independent backup generators and UPS systems
- Networking: Separate network paths to internet backbones and between AZs
- Cooling: Independent HVAC systems
- Physical Security: Separate physical access controls

Availability Zone Physical Isolation Standards
Isolation Dimension	Typical Implementation	Failure Mode Protected
Geographic Separation	1-100 km between AZ data centers	Localized natural disasters, fires, explosions
Power Grid Independence	Different substations, redundant feeds	Grid failures, substation outages
Network Path Diversity	Multiple fiber routes, different carriers	Fiber cuts, network equipment failure
Cooling Independence	Separate HVAC per AZ	Cooling system failures
Physical Security	Separate buildings, access controls	Physical intrusion (theoretical)

Logical Constructs:

Your us-east-1a might be a completely different physical data center than another customer's us-east-1a
This randomization prevents all customers from concentrating in the same AZ
You cannot assume your AZ identifiers match another account's assignments

AZ Names vs. AZ IDs

Fault Domain Theory: Understanding Blast Radius

To properly design for availability zones, you need to understand fault domain theory—the systematic study of how failures propagate through infrastructure and how to contain their impact.

What Is a Fault Domain?

Hierarchy of Fault Domains in Cloud Infrastructure:

Fault Domain Hierarchy (Smallest to Largest)

•1. Single Instance/Container — Failure of one compute unit (hardware fault, OS crash, application bug). Blast radius: 1 instance.
•2. Rack/Host Group — Physical rack loses power or network. Blast radius: 10-100 instances on shared hardware.
•3. Availability Zone — Entire zone becomes unavailable (major power outage, network partition). Blast radius: Thousands to millions of instances.
•4. Region — Catastrophic regional failure (major natural disaster, widespread infrastructure failure). Blast radius: All instances in region.
•5. Cloud Provider — Provider-wide outage (control plane failure, global DNS issue). Blast radius: All services, all regions.

The Blast Radius Containment Principle:

Effective high availability design is about containing blast radius at each level. The question isn't if failures will occur, but when, and whether your architecture limits their impact.

Fault Domain Level	Expected Frequency	Typical Duration	Design Response
Instance	Multiple times daily	Seconds-minutes	Auto-scaling, health checks
Rack/Host	Weekly	Minutes-hours	Spread instances across racks
Availability Zone	1-2 times per year	Hours	Multi-AZ deployment
Region	Very rare	Hours-days	Multi-region architecture
Provider	Extremely rare	Hours	Multi-cloud (if critical)

Correlated vs. Independent Failures:

AZs are designed to have independent failure modes—a power grid issue in AZ-A shouldn't affect AZ-B's power. However, some failures can be correlated:

Shared Dependencies: If all AZs in a region use the same upstream internet provider, that provider's failure affects all AZs
Control Plane Failures: Regional API endpoints may be shared, so control plane issues can affect all AZs
Operator Errors: A misconfigured security group or IAM policy affects all AZs where it's applied
Application Bugs: A bug in your application fails identically across all AZs

AZs protect against infrastructure failures, not application failures or operational errors.

Chaos Engineering and Fault Domains

AZ Implementations Across Major Providers

While the concept of availability zones is universal, each major cloud provider implements them with slightly different characteristics and naming conventions.

Amazon Web Services (AWS)

AWS pioneered the modern AZ concept and provides the most explicit AZ model:

Naming: Regions have 3-6 AZs, named as {region}{az-letter} (e.g., us-east-1a, us-east-1b)
AZ IDs: Consistent identifiers like use1-az1 for cross-account coordination
Guaranteed Separation: AWS commits to physical separation of power, networking, and connectivity
Local Zones: Extended AZs in metropolitan areas for ultra-low-latency use cases
Wavelength Zones: Edge locations for 5G carrier networks

AWS AZ Guarantees:

AZs are "meaningfully distant" to reduce correlated failure risk
Interconnected with high-bandwidth, low-latency private networking
Each AZ has one or more isolated data center facilities

Microsoft Azure

Azure's approach to availability zones has evolved over time:

Naming: Zones are numbered 1, 2, 3 within a region (e.g., Zone 1, Zone 2, Zone 3)
Not All Regions Have AZs: Some older Azure regions lack availability zone support
Availability Sets: Azure's older mechanism using fault domains and update domains within a single data center
Zone-Redundant Services: Many Azure PaaS services automatically spread across zones

Azure-Specific Considerations:

Availability Sets (older) provide rack-level isolation, not AZ-level
Availability Zones (newer) provide true AZ isolation
Zone-redundant storage (ZRS) automatically replicates across AZs
Some services require explicit zone selection; others are zone-redundant by default

Google Cloud Platform (GCP)

GCP uses the term "zones" (not availability zones) with a slightly different model:

Naming: Zones named as {region}-{zone-letter} (e.g., us-central1-a, us-central1-b)
Regional Resources: GCP emphasizes regional resources that automatically span zones
Live Migration: GCP supports live migration of VMs during maintenance, reducing zone-level impact
Regional Persistent Disks: Storage that replicates synchronously across zones

GCP-Specific Guidance:

Use regional managed instance groups for automatic zone distribution
Regional resources (like regional persistent disks) simplify multi-zone design
Some GCP services have built-in zone redundancy

AZ Comparison Across Major Cloud Providers
Feature	AWS	Azure	GCP
AZ Terminology	Availability Zone	Availability Zone	Zone
AZs per Region	3-6 (minimum 3)	Usually 3	3-4 typical
AZ Naming	us-east-1a, 1b, 1c	Zone 1, 2, 3	us-central1-a, b, c
Cross-Account ID	AZ IDs (use1-az1)	Zone ID	Zone names consistent
Legacy Alternative	N/A	Availability Sets	N/A
Auto Multi-Zone	Load balancers, RDS Multi-AZ	Zone-redundant services	Regional resources
Live Migration	Limited	Yes (some services)	Yes (Compute Engine)

Provider-Agnostic Thinking

Inter-AZ Communication Characteristics

Latency Characteristics:

Within a region, inter-AZ latency is designed to support synchronous replication:

Typical Inter-AZ Latency: 1-2 milliseconds round-trip
Maximum Expected: Generally <5ms (provider SLAs vary)
Jitter: Usually minimal due to dedicated private networks

This low latency enables patterns that wouldn't be practical across regions:

Synchronous database replication (writes commit only after all replicas acknowledge)
Distributed transactions with two-phase commit
Real-time session replication
Active-active load balancing without user affinity

Latency Comparison: Intra-AZ vs. Inter-AZ vs. Inter-Region
Communication Type	Typical Latency	Bandwidth	Cost
Same AZ (same VPC)	0.1-0.5ms	Very high (10-100+ Gbps)	Free (within VPC)
Cross-AZ (same region)	1-2ms	High (up to 25 Gbps)	$0.01-0.02/GB typically
Cross-Region (same continent)	20-100ms	Limited by network	$0.02-0.09/GB
Cross-Region (intercontinental)	100-300ms	Limited by network	$0.02-0.20/GB

Data Transfer Costs:

Inter-AZ data transfer is not free on most cloud providers. While intra-AZ traffic within a VPC is typically free, crossing AZ boundaries incurs charges:

AWS: $0.01/GB for inter-AZ traffic (each direction)
Azure: Similar pricing for cross-zone traffic
GCP: Varies based on network tier

Cost Implications for Chatty Architectures:

Consider a microservices architecture where services communicate frequently:

Scenario: 100 services, each making 1,000 requests/second to other services
Average payload: 10 KB per request/response

Cross-AZ traffic per day:
= 100 services × 1,000 req/sec × 10 KB × 2 (bidirectional) × 86,400 sec/day
= 172.8 TB/day

At $0.01/GB:
= $1,728/day = ~$52,000/month just for inter-AZ transfer

This cost can be reduced through:

AZ-aware routing: Route requests to same-AZ instances when possible
Caching: Reduce redundant cross-AZ queries
Batching: Aggregate small requests into larger payloads
Compression: Reduce payload sizes

The Multi-AZ Cost Trap

AZ Failure Modes and Recovery Patterns

Understanding how availability zones fail is crucial for designing systems that survive those failures. AZ failures are not binary—they occur in various modes with different characteristics.

Failure Mode Classification:

Types of AZ Failures

•Complete Outage: All resources in the AZ become unavailable simultaneously. Rare but dramatic. Usually caused by power events or major network partitions.
•Partial Degradation: Some services or instances fail while others remain healthy. More common than complete outages. May affect specific instance types, storage volumes, or network paths.
•Performance Degradation: Resources remain 'available' but with degraded performance (high latency, packet loss, reduced throughput). Insidious because health checks may pass while user experience suffers.
•Control Plane Issues: Infrastructure API calls fail (can't launch instances, modify security groups) but running resources continue operating. Frustrating for operators but less impactful for users.
•Cascade Failures: Initial failure causes secondary failures due to overload or dependency chains. Systems designed for 3 AZs may not function correctly with only 2.

Recovery Patterns:

1. Automatic Failover (Preferred)

For stateless services behind a load balancer:

Load balancer health checks detect failed instances
Traffic automatically routes to healthy AZs
Auto-scaling launches replacement instances
No human intervention required
Recovery time: Seconds to minutes

2. Stateful Service Failover

For databases and stateful services:

Secondary replica in another AZ detects primary failure
Automatic or manual failover to secondary
Applications reconnect to new primary
Recovery time: Minutes (automatic) to hours (manual)

3. Workload Redistribution

When one AZ fails, remaining AZs must handle all traffic:

Ensure sufficient capacity in remaining AZs
Pre-provision headroom or configure aggressive auto-scaling
Consider 2×AZ capacity: If you have 3 AZs, each should handle 50% of load (to survive losing one)

The N+1 AZ Rule:

Design your multi-AZ architecture such that losing any single AZ leaves sufficient capacity to handle full production load. For critical systems, consider N+2 (can lose two zones).

Capacity Planning for AZ Failures
Number of AZs	Each AZ Capacity	Survives Losing	Total Capacity Required
2 AZs	100% each	1 AZ	200% of peak load
3 AZs	50% each	1 AZ	150% of peak load
3 AZs	75% each	2 AZs (N+2)	225% of peak load
4 AZs	33% each	1 AZ	132% of peak load

Test Your Failure Response

Designing Multi-AZ Data Persistence

Storage Classes and AZ Behavior:

Storage Options and Multi-AZ Behavior
Storage Type	AZ Behavior	Use Case	Tradeoff
Instance Storage	Single AZ only, ephemeral	Temporary data, caches	Data lost on instance termination
Single-AZ Block Storage (EBS)	Replicated within one AZ	Standard persistent storage	Lost if AZ fails
Multi-AZ Block Storage	Synchronous replication to second AZ	Critical databases	Higher cost, slight latency
Object Storage (S3/GCS)	Automatically replicated across 3+ AZs	Durable object storage	Eventually consistent in some cases
Zone-Redundant File Systems	Replicated across AZs	Shared file storage	Higher cost

Database Multi-AZ Patterns:

1. Synchronous Replication (Strong Consistency)

Write Request → Primary (AZ-A)
                    │
                    ├──sync write──→ Replica (AZ-B)
                    │                    │
                    │←── ACK ────────────┘
                    │
                ← ACK to client (write committed)

Write only returns success after both AZs have the data
Guarantees zero data loss (RPO = 0)
Adds latency to every write (1-2ms typically)
Used by: RDS Multi-AZ, Azure SQL Database

2. Asynchronous Replication (Eventual Consistency)

Write Request → Primary (AZ-A)
                    │
                ← ACK to client (immediate)
                    │
                    └──async──→ Replica (AZ-B)
                                 (may lag seconds to minutes)

Write returns immediately, replication happens in background
Possible data loss in failover (RPO > 0)
Lower latency for writes
Used by: Read replicas, some NoSQL configurations

3. Quorum-Based Systems

For distributed databases like Cassandra, DynamoDB, or CockroachDB:

Write Request → Coordinator
                    │
                    ├──write──→ Node (AZ-A) ✓
                    ├──write──→ Node (AZ-B) ✓
                    └──write──→ Node (AZ-C) (may or may not complete)
                    
If 2 of 3 ACKs received → Return success to client

Requires majority (quorum) of nodes to acknowledge
Survives minority node failures
Tunable consistency (write to more nodes = more durable)
Read quorum + write quorum must exceed total nodes for strong consistency

Choosing a Pattern:

Requirement	Pattern	Example
Zero data loss at any cost	Synchronous Multi-AZ	RDS Multi-AZ
High write throughput, some lag acceptable	Async replication	Read replicas
Massive scale, tunable consistency	Quorum-based	DynamoDB, Cassandra
Simple operations, moderate scale	Managed sync	Aurora, Cloud Spanner

Data Loss vs. Downtime

Load Balancing Across Availability Zones

Cross-Zone Load Balancing:

By default, load balancers may only distribute traffic to targets within the same AZ. Cross-zone load balancing enables distribution across all registered targets in all AZs.

Without Cross-Zone Load Balancing:

                    AZ-A (60% of DNS)       AZ-B (40% of DNS)
                         │                       │
Load Balancer Node A ────┤     Load Balancer Node B ────┤
         │                               │
    ┌────┴────┐                     ┌────┴────┐
    │         │                     │         │
Instance 1  Instance 2            Instance 3  Instance 4
    │         │                     │         │
Gets 30%   Gets 30%              Gets 20%   Gets 20%

Pattern Problems:

Uneven instance utilization if AZ capacity differs
If one AZ has fewer instances, they get proportionally more load

With Cross-Zone Load Balancing:

                         Unified Load Balancer
                              │
                    ┌─────────┼─────────┐
                    │         │         │
                    ▼         ▼         ▼
              Instance 1  Instance 2  Instance 3  Instance 4
                  │           │           │           │
               Gets 25%    Gets 25%    Gets 25%    Gets 25%

Even distribution regardless of AZ
All instances receive equal traffic
Recommended for most workloads

Health Checks and AZ Awareness:

Load balancer health checks verify that targets can serve traffic. When targets in an AZ fail health checks:

Load balancer stops sending new requests to unhealthy targets
Active connections may be drained (configurable)
Traffic redistributes to healthy targets in other AZs
When targets recover, traffic resumes after passing health checks

Configuring Health Checks for AZ Resilience:

Parameter	Recommendation	Rationale
Health check interval	5-10 seconds	Balance between detection speed and load
Healthy threshold	2-3 checks	Avoid flapping on transient issues
Unhealthy threshold	2-3 checks	Fast removal of truly failed targets
Timeout	2-5 seconds	Account for AZ latency
Health check path	`/health` endpoint	Verify application functionality, not just port open

Zonal Affinity (AZ-Aware Routing):

For latency-sensitive or cost-conscious applications, you can configure load balancers to prefer same-zone targets:

Advantage: Eliminates cross-AZ latency and data transfer costs for most requests
Disadvantage: Can cause uneven load if zone capacity differs
Implementation: When available, route to same-AZ; fall back to cross-AZ only when same-AZ targets are unhealthy

This feature is called different names:

AWS NLB: Availability Zone affinity
AWS ALB: Cross-zone load balancing disable with DNS
Azure: Zone-aware routing
GCP: Not directly supported (use regional instance groups)

Connection Draining is Essential

Summary: Availability Zone Architecture Mastery

Key Takeaways

•AZs are physically isolated — Each AZ has independent power, networking, and cooling to minimize correlated failures.
•Fault domains have hierarchies — Design to contain blast radius at each level: instance, rack, AZ, region, provider.
•Inter-AZ latency is low — 1-2ms typical latency enables synchronous replication, but cross-AZ traffic has costs.
•AZ failures come in many modes — Complete outages, partial degradation, performance issues, and control plane problems all require different responses.
•Capacity planning is critical — Ensure you can handle full load with one AZ unavailable (N+1 design).
•Data persistence needs careful design — Choose synchronous vs. asynchronous replication based on durability requirements.
•Load balancers enable automatic failover — Properly configured cross-zone load balancing and health checks make AZ failures transparent to users.

What's Next:

Page Complete

2 / 5