System Design (HLD)Multi-Region Architecture

Multi-Region Architecture: Building Globally Distributed Systems

LevelAdvanced

Duration90 mins

TopicMulti-Region Architecture

5 / 5

Traffic Routing: Directing Users to the Right Region

The First Decision in Every Request

Before a user's request touches your application code, before it queries your database or hits your cache, a critical decision has already been made: which region will handle this request? This decision—made in milliseconds by DNS servers, load balancers, and routing policies—determines the user's latency experience, affects your system's load distribution, and enables regional failover.

Traffic routing is the front door of multi-region architecture. A user in São Paulo types your URL; within 20 milliseconds, routing infrastructure has determined whether their request goes to US-East, EU-West, or a South American edge location. This invisible choreography happens billions of times daily, yet most users never know it exists.

This page explores the technologies and strategies that make intelligent traffic routing possible—from DNS fundamentals to sophisticated health-aware load balancers—giving you the tools to direct traffic with precision and resilience.

What You Will Learn

By the end of this page, you will understand DNS-based geographic routing, implement global load balancing strategies, design health check systems that enable automatic failover, and choose appropriate routing policies for different multi-region configurations.

DNS-Based Routing Fundamentals

The Domain Name System (DNS) is the internet's phone book, translating human-readable domains into IP addresses. For multi-region systems, DNS becomes a routing layer, directing users to different IP addresses based on geography, health, or other factors.

How DNS Routing Works

When a user requests app.example.com:

User's device queries a recursive DNS resolver (often their ISP or a service like 8.8.8.8)
Resolver contacts authoritative DNS servers for example.com
Authoritative server returns IP address(es) based on routing policy
Resolver caches the response for the TTL duration
User's device connects to the returned IP address

With geographic DNS routing, step 3 becomes intelligent: the authoritative server considers the resolver's location (or uses EDNS Client Subnet for the user's actual location) and returns the IP address of the nearest/best region.

DNS Record Types for Routing

A Record: Maps domain to IPv4 address

app.example.com.  60  IN  A  54.192.1.1

AAAA Record: Maps domain to IPv6 address

app.example.com.  60  IN  AAAA  2600:9000:5306:6f00::1

CNAME Record: Aliases one domain to another (used for cloud load balancer integration)

app.example.com.  60  IN  CNAME  d123abc.cloudfront.net.

Alias Record (AWS Route 53): AWS-specific, points to AWS resources without CNAME limitations

Converting Mermaid diagram...

TTL (Time-To-Live) Considerations

TTL controls how long DNS responses are cached. For multi-region:

Long TTL (hours): Reduces DNS queries, but slows failover
Short TTL (60-300 seconds): Faster failover, more DNS queries
Very short TTL (<60 seconds): Near-instant failover, high DNS load

Best Practice: Use 60-300 second TTL for production services. This balances failover speed with DNS infrastructure load. For critical services with sub-minute failover requirements, consider supplementing DNS with IP-layer failover.

Limitations of DNS Routing

Resolver location ≠ user location: DNS decisions are made based on the resolver's IP, not the user's. VPN users, corporate proxies, and mobile networks can cause misrouting.
TTL caching: Users hold cached records beyond your control. Even with 60s TTL, some clients cache longer.
Propagation delay: DNS changes take time to reach all resolvers globally (typically minutes, occasionally longer).
No per-request routing: DNS routing is sticky for the TTL duration—individual requests can't be routed differently.

Low TTL Before Changes

Before planned failovers or infrastructure changes, lower your TTL well in advance (at least 24 hours before, set TTL to target value). This ensures caches expire before your change, enabling faster cutover. Restore higher TTL after stability is confirmed.

DNS Routing Policies

Modern DNS services offer sophisticated routing policies that go beyond simple domain-to-IP mapping. Understanding these policies is essential for designing multi-region traffic distribution.

Simple (Round-Robin) Routing

Multiple IP addresses returned for a single domain; clients pick one (typically first):

app.example.com.  60  IN  A  54.192.1.1
app.example.com.  60  IN  A  52.215.2.2
app.example.com.  60  IN  A  13.114.3.3

Distributes load (roughly) across all IPs
No health awareness (returns unhealthy endpoints)
Client behavior varies (some always use first IP)
Rarely appropriate for production multi-region

Weighted Routing

Assign weights to endpoints; DNS responses proportionally reflect weights:

App US-East (weight 70)
App EU-West (weight 30)

Use cases:

Gradual traffic shifting during migrations
A/B testing at infrastructure level
Proportional load distribution based on capacity
Canary deployments to new regions

route53-routing-policies.tf

Terraform

# AWS Route 53 Routing Policy Examples
 
# 1. GEOLOCATION ROUTING
# Route users based on geographic location
 
resource "aws_route53_record" "app_us" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
  
  geolocation_routing_policy {
    continent = "NA"  # North America
  }
  
  set_identifier = "us-east"
  
  alias {
    name                   = aws_lb.us_east.dns_name
    zone_id                = aws_lb.us_east.zone_id
    evaluate_target_health = true
  }
}
 
resource "aws_route53_record" "app_eu" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
  
  geolocation_routing_policy {
    continent = "EU"  # Europe
  }
  
  set_identifier = "eu-west"
  
  alias {
    name                   = aws_lb.eu_west.dns_name
    zone_id                = aws_lb.eu_west.zone_id
    evaluate_target_health = true
  }
}
 
resource "aws_route53_record" "app_default" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
  
  geolocation_routing_policy {
    country = "*"  # Default for unmatched locations
  }
  
  set_identifier = "default"
  
  alias {
    name                   = aws_lb.us_east.dns_name
    zone_id                = aws_lb.us_east.zone_id
    evaluate_target_health = true
  }
}
 
# 2. LATENCY-BASED ROUTING
# Route to the region with lowest latency from user's location
 
resource "aws_route53_record" "app_latency_us" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
  
  latency_routing_policy {
    region = "us-east-1"
  }
  
  set_identifier = "us-east-latency"
  
  alias {
    name                   = aws_lb.us_east.dns_name
    zone_id                = aws_lb.us_east.zone_id
    evaluate_target_health = true
  }
}
 
resource "aws_route53_record" "app_latency_eu" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
  
  latency_routing_policy {
    region = "eu-west-1"
  }
  
  set_identifier = "eu-west-latency"
  
  alias {
    name                   = aws_lb.eu_west.dns_name
    zone_id                = aws_lb.eu_west.zone_id
    evaluate_target_health = true
  }
}
 
# 3. FAILOVER ROUTING
# Primary/secondary configuration for active-passive
 
resource "aws_route53_record" "app_primary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
  
  failover_routing_policy {
    type = "PRIMARY"
  }
  
  set_identifier = "primary"
  
  alias {
    name                   = aws_lb.us_east.dns_name
    zone_id                = aws_lb.us_east.zone_id
    evaluate_target_health = true  # Critical: enables failover
  }
  
  health_check_id = aws_route53_health_check.primary.id
}
 
resource "aws_route53_record" "app_secondary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
  
  failover_routing_policy {
    type = "SECONDARY"
  }
  
  set_identifier = "secondary"
  
  alias {
    name                   = aws_lb.eu_west.dns_name
    zone_id                = aws_lb.eu_west.zone_id
    evaluate_target_health = true
  }
}
 
# 4. WEIGHTED ROUTING
# Distribute traffic by percentage
 
resource "aws_route53_record" "app_weighted_us" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
  ttl     = 60
  
  weighted_routing_policy {
    weight = 80  # 80% of traffic
  }
  
  set_identifier = "us-east-weighted"
  records        = ["54.192.1.1"]
  
  health_check_id = aws_route53_health_check.us_east.id
}
 
resource "aws_route53_record" "app_weighted_eu" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
  ttl     = 60
  
  weighted_routing_policy {
    weight = 20  # 20% of traffic
  }
  
  set_identifier = "eu-west-weighted"
  records        = ["52.215.2.2"]
  
  health_check_id = aws_route53_health_check.eu_west.id
}
 
# 5. HEALTH CHECK CONFIGURATION
resource "aws_route53_health_check" "primary" {
  fqdn              = "health.us-east.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = "3"
  request_interval  = "10"
  
  regions = [
    "us-east-1",
    "us-west-2",
    "eu-west-1"
  ]
  
  tags = {
    Name = "Primary Region Health Check"
  }
}

Geolocation Routing

Route based on user's geographic location (country, continent, or US state):

Maps locations to specific endpoints
Requires default record for unmapped locations
Good for regulatory compliance (keep EU users in EU)
Doesn't optimize for actual latency (users near borders may be misrouted)

Latency-Based Routing

AWS Route 53 maintains latency measurements between its resolver locations and AWS regions. Responses direct users to the lowest-latency region:

More accurate than geolocation for performance optimization
Automatically adapts to changing network conditions
Only measures to AWS regions (not arbitrary endpoints)

Failover Routing

Explicit primary/secondary configuration with health-check-driven failover:

Returns primary endpoint when healthy
Automatically switches to secondary when primary fails health checks
Foundation for active-passive multi-region

Multivalue Answer Routing

Returns multiple healthy IP addresses (up to 8):

Client-side selection provides some load distribution
Health-checked: only healthy endpoints returned
Better than simple round-robin for resilience

DNS Routing Policy Selection Guide
Policy	Best For	Considerations
Geolocation	Compliance, regulatory requirements	Configure defaults; users near borders may be suboptimal
Latency	Performance optimization	AWS only; measures network, not application latency
Weighted	Migrations, canary, load balancing	Need to adjust weights manually or via automation
Failover	Active-passive DR	Only two tiers; can combine with other policies
Multivalue	Simple load distribution with health	Client behavior varies; not true load balancing

Global Load Balancers: Beyond DNS

DNS-based routing has fundamental limitations: caching, coarse-grained decisions, and slow failover. Global Load Balancers (GLBs) operate at the network layer, providing more sophisticated and responsive traffic management.

How Global Load Balancers Work

Unlike DNS routing, GLBs typically use anycast IP addressing: multiple locations advertise the same IP address, and network routing (BGP) directs traffic to the nearest location. From there, the GLB can make intelligent decisions:

User's request goes to nearest edge location (via anycast)
Edge terminates TLS and inspects request
GLB routes request to optimal backend region
Response returns via the same path

This provides:

Request-level routing (not DNS-TTL-granularity)
Instant failover (health changes take effect immediately)
TLS termination closer to users (lower latency)
Layer 7 routing capabilities (headers, paths, cookies)

Major Global Load Balancer Solutions

AWS Global Accelerator

Anycast IPs route to AWS edge locations
Health-checked routing to AWS regions
TCP and UDP support
Integrates with ALB, NLB, EC2

Google Cloud Load Balancing

Single anycast IP for global deployment
Automatic multi-region utilization
Closest region selection with overflow
Deep integration with GCP services

Cloudflare

CDN + GLB in one
300+ edge locations
Load balancing, DNS, DDoS protection bundled
Geographic and dynamic steering

Azure Front Door

Global HTTP load balancer
Layer 7 routing and caching
Integrates with Azure services
Health probing with failover

gcloud-global-lb.tf

Terraform

# Google Cloud Global Load Balancer Configuration
# Single anycast IP serving traffic from multiple regions
 
# Backend service grouping regional instance groups
resource "google_compute_backend_service" "global_app" {
  name                  = "global-app-backend"
  protocol              = "HTTP"
  port_name             = "http"
  timeout_sec           = 30
  enable_cdn            = true
  
  # Health check for backend instances
  health_checks = [google_compute_health_check.app.id]
  
  # US region backend with capacity
  backend {
    group           = google_compute_region_instance_group_manager.us.instance_group
    balancing_mode  = "UTILIZATION"
    capacity_scaler = 1.0
    max_utilization = 0.8
  }
  
  # EU region backend with capacity
  backend {
    group           = google_compute_region_instance_group_manager.eu.instance_group
    balancing_mode  = "UTILIZATION"
    capacity_scaler = 1.0
    max_utilization = 0.8
  }
  
  # Asia region backend with capacity
  backend {
    group           = google_compute_region_instance_group_manager.asia.instance_group
    balancing_mode  = "UTILIZATION"
    capacity_scaler = 1.0
    max_utilization = 0.8
  }
  
  # Outlier detection for automatic ejection of unhealthy instances
  outlier_detection {
    consecutive_errors = 5
    interval {
      seconds = 10
    }
    base_ejection_time {
      seconds = 30
    }
    max_ejection_percent = 50
  }
  
  # Circuit breaker settings
  circuit_breakers {
    max_connections         = 1000
    max_pending_requests    = 200
    max_requests            = 1000
    max_retries             = 3
  }
}
 
# Health check definition
resource "google_compute_health_check" "app" {
  name               = "app-health-check"
  check_interval_sec = 5
  timeout_sec        = 5
  healthy_threshold  = 2
  unhealthy_threshold = 3
  
  http_health_check {
    port         = 8080
    request_path = "/healthz"
  }
}
 
# URL map for routing
resource "google_compute_url_map" "global_app" {
  name            = "global-app-urlmap"
  default_service = google_compute_backend_service.global_app.id
  
  # Path-based routing example
  host_rule {
    hosts        = ["app.example.com"]
    path_matcher = "app-paths"
  }
  
  path_matcher {
    name            = "app-paths"
    default_service = google_compute_backend_service.global_app.id
    
    path_rule {
      paths   = ["/api/*"]
      service = google_compute_backend_service.global_app.id
    }
    
    path_rule {
      paths   = ["/static/*"]
      service = google_compute_backend_bucket.static.id
    }
  }
}
 
# HTTPS proxy with TLS termination
resource "google_compute_target_https_proxy" "global_app" {
  name             = "global-app-https-proxy"
  url_map          = google_compute_url_map.global_app.id
  ssl_certificates = [google_compute_managed_ssl_certificate.app.id]
}
 
# Global forwarding rule with anycast IP
resource "google_compute_global_forwarding_rule" "global_app" {
  name       = "global-app-forwarding-rule"
  target     = google_compute_target_https_proxy.global_app.id
  port_range = "443"
  ip_address = google_compute_global_address.app.address
}
 
# Reserve a global anycast IP
resource "google_compute_global_address" "app" {
  name = "global-app-ip"
}
 
# Managed SSL certificate
resource "google_compute_managed_ssl_certificate" "app" {
  name = "app-cert"
  
  managed {
    domains = ["app.example.com"]
  }
}

GLB vs DNS Routing: When to Use Each

Prefer Global Load Balancer when:

Sub-second failover is required
You need request-level routing decisions
TLS termination at edge reduces latency significantly
Layer 7 routing (headers, paths) is needed
You're already on a cloud provider with strong GLB offerings

Prefer DNS Routing when:

Multi-cloud or hybrid environments
Cost sensitivity (DNS is typically cheaper)
Simple active-passive failover is sufficient
You want to avoid vendor lock-in
Edge termination adds unacceptable latency for your use case

Combined Approach (Common in Practice)

Many production systems combine DNS and GLB:

DNS routes to regional GLB endpoints
GLB handles per-request routing within the region
Health checks operate at both layers
DNS provides disaster recovery; GLB provides active load balancing

Anycast IP Benefits

With anycast, users always connect to the same IP address regardless of region. This simplifies DNS configuration, eliminates TTL-based failover delays, and allows instant traffic shifting. If you're using a cloud GLB, you're likely already using anycast.

Health Checks and Automatic Failover

Health checks are the nervous system of multi-region traffic routing. They continuously probe endpoints, detect failures, and trigger routing changes. Properly designed health checks are essential for reliable failover.

Health Check Types

TCP Health Checks

Probe: Establish TCP connection to port
Pass: Connection succeeds
Simple, low overhead
Only confirms network connectivity

HTTP/HTTPS Health Checks

Probe: Make HTTP request to endpoint
Pass: Expected status code returned (typically 200)
Confirms application is responding
Can check specific paths (/health, /ready)

Deep Health Checks (Application-Aware)

Probe: Request to /health endpoint
Endpoint checks database, cache, dependencies
Pass: All dependencies healthy
Most accurate but can cause false positives during transient issues

Best Practice: Layered Health Checks

Implement multiple health endpoints:

/alive (liveness): Application process is running
/ready (readiness): Application can serve traffic
/health (deep): All dependencies are functional

health-endpoints.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
/**
 * Multi-Layer Health Check Implementation
 * 
 * Provides endpoints for different health check scenarios:
 * - Liveness: Process is running
 * - Readiness: Ready to receive traffic  
 * - Deep: All dependencies verified
 */
 
import { Router, Request, Response } from 'express';
import { Pool } from 'pg';
import Redis from 'ioredis';
 
interface HealthStatus {
  status: 'healthy' | 'degraded' | 'unhealthy';
  checks: Record<string, CheckResult>;
  timestamp: string;
  version: string;
  region: string;
}
 
interface CheckResult {
  status: 'pass' | 'fail' | 'warn';
  latencyMs?: number;
  message?: string;
}
 
class HealthChecker {
  private db: Pool;
  private redis: Redis;
  private appVersion: string;
  private region: string;
  
  constructor(db: Pool, redis: Redis) {
    this.db = db;
    this.redis = redis;
    this.appVersion = process.env.APP_VERSION || 'unknown';
    this.region = process.env.AWS_REGION || 'unknown';
  }
  
  /**
   * Liveness check: Is the process running?
   * Used by: Container orchestration (restart on fail)
   */
  async checkLiveness(): Promise<CheckResult> {
    // If this code executes, we're alive
    return { status: 'pass' };
  }
  
  /**
   * Readiness check: Can we serve traffic?
   * Used by: Load balancer (remove from rotation on fail)
   */
  async checkReadiness(): Promise<HealthStatus> {
    const checks: Record<string, CheckResult> = {};
    
    // Check database connection pool
    checks.database = await this.checkDatabase();
    
    // Check cache connection
    checks.cache = await this.checkCache();
    
    // Overall status based on critical dependencies
    const criticalFailed = checks.database.status === 'fail';
    
    return {
      status: criticalFailed ? 'unhealthy' : 'healthy',
      checks,
      timestamp: new Date().toISOString(),
      version: this.appVersion,
      region: this.region
    };
  }
  
  /**
   * Deep health check: Are all dependencies fully functional?
   * Used by: DNS/GLB routing (failover to another region on fail)
   */
  async checkDeep(): Promise<HealthStatus> {
    const checks: Record<string, CheckResult> = {};
    
    // All dependency checks
    const [dbCheck, cacheCheck, externalCheck] = await Promise.allSettled([
      this.checkDatabaseQuery(),
      this.checkCacheReadWrite(),
      this.checkExternalDependencies()
    ]);
    
    checks.database = dbCheck.status === 'fulfilled' 
      ? dbCheck.value 
      : { status: 'fail', message: 'Check threw exception' };
      
    checks.cache = cacheCheck.status === 'fulfilled'
      ? cacheCheck.value
      : { status: 'fail', message: 'Check threw exception' };
      
    checks.external = externalCheck.status === 'fulfilled'
      ? externalCheck.value
      : { status: 'fail', message: 'Check threw exception' };
    
    // Determine overall health
    const failCount = Object.values(checks)
      .filter(c => c.status === 'fail').length;
    const warnCount = Object.values(checks)
      .filter(c => c.status === 'warn').length;
    
    let status: 'healthy' | 'degraded' | 'unhealthy' = 'healthy';
    if (failCount > 0) status = 'unhealthy';
    else if (warnCount > 0) status = 'degraded';
    
    return {
      status,
      checks,
      timestamp: new Date().toISOString(),
      version: this.appVersion,
      region: this.region
    };
  }
  
  private async checkDatabase(): Promise<CheckResult> {
    const start = Date.now();
    try {
      // Quick connection check
      await this.db.query('SELECT 1');
      return {
        status: 'pass',
        latencyMs: Date.now() - start
      };
    } catch (error) {
      return {
        status: 'fail',
        message: (error as Error).message,
        latencyMs: Date.now() - start
      };
    }
  }
  
  private async checkDatabaseQuery(): Promise<CheckResult> {
    const start = Date.now();
    try {
      // More thorough check: verify we can query
      const result = await this.db.query(
        'SELECT COUNT(*) FROM information_schema.tables'
      );
      const latency = Date.now() - start;
      
      // Warn if query is slow
      if (latency > 1000) {
        return {
          status: 'warn',
          message: 'Query latency elevated',
          latencyMs: latency
        };
      }
      
      return { status: 'pass', latencyMs: latency };
    } catch (error) {
      return {
        status: 'fail',
        message: (error as Error).message,
        latencyMs: Date.now() - start
      };
    }
  }
  
  private async checkCache(): Promise<CheckResult> {
    const start = Date.now();
    try {
      await this.redis.ping();
      return {
        status: 'pass',
        latencyMs: Date.now() - start
      };
    } catch (error) {
      return {
        status: 'fail',
        message: (error as Error).message,
        latencyMs: Date.now() - start
      };
    }
  }
  
  private async checkCacheReadWrite(): Promise<CheckResult> {
    const start = Date.now();
    const testKey = `health-check:${Date.now()}`;
    const testValue = 'test';
    
    try {
      await this.redis.set(testKey, testValue, 'EX', 10);
      const retrieved = await this.redis.get(testKey);
      await this.redis.del(testKey);
      
      if (retrieved !== testValue) {
        return {
          status: 'fail',
          message: 'Read/write verification failed',
          latencyMs: Date.now() - start
        };
      }
      
      return { status: 'pass', latencyMs: Date.now() - start };
    } catch (error) {
      return {
        status: 'fail',
        message: (error as Error).message,
        latencyMs: Date.now() - start
      };
    }
  }
  
  private async checkExternalDependencies(): Promise<CheckResult> {
    // Check critical external services
    // In production, you might check payment providers, etc.
    return { status: 'pass' };
  }
}
 
// Express router setup
export function createHealthRouter(db: Pool, redis: Redis): Router {
  const router = Router();
  const checker = new HealthChecker(db, redis);
  
  // Liveness - always succeed if running
  router.get('/alive', async (req: Request, res: Response) => {
    const result = await checker.checkLiveness();
    res.status(200).json(result);
  });
  
  // Readiness - check critical dependencies
  router.get('/ready', async (req: Request, res: Response) => {
    const result = await checker.checkReadiness();
    const statusCode = result.status === 'healthy' ? 200 : 503;
    res.status(statusCode).json(result);
  });
  
  // Deep health - comprehensive dependency check
  router.get('/health', async (req: Request, res: Response) => {
    const result = await checker.checkDeep();
    let statusCode: number;
    switch (result.status) {
      case 'healthy': statusCode = 200; break;
      case 'degraded': statusCode = 200; break;  // Still serving
      case 'unhealthy': statusCode = 503; break;
    }
    res.status(statusCode).json(result);
  });
  
  return router;
}

Health Check Configuration Parameters

Interval: How often to probe (typically 10-30 seconds)

Shorter: Faster detection, more probe traffic
Longer: Slower detection, less overhead

Timeout: How long to wait for response (typically 5-10 seconds)

Must be less than interval
Longer than expected normal response time

Healthy Threshold: Consecutive passes to mark healthy (typically 2-3)

Prevents flapping from transient successes

Unhealthy Threshold: Consecutive failures to mark unhealthy (typically 2-5)

Prevents failover from transient issues
Balance: quick detection vs. stability

Multi-Location Probing

Probe from multiple geographic locations:

Prevents false positives from regional network issues
Require failures from multiple locations before failover
Route 53 probes from multiple AWS edge locations

The Deep Health Check Dilemma

Deep health checks can cause cascading failures: if a non-critical dependency fails, the health check fails, traffic shifts, overwhelming other regions, which then also fail health checks. Consider separating critical (routing-affecting) and non-critical (monitoring-only) dependencies.

Traffic Shifting Strategies

Beyond automatic failover, traffic shifting enables controlled migration between regions for deployments, maintenance, and testing.

Gradual Traffic Migration

When deploying to a new region or recovering from failover:

Start with 5-10% of traffic to new/restored region
Monitor error rates, latency, and capacity
Increase to 25%, then 50%
Complete migration to target distribution

This identifies issues before they affect all users and allows capacity validation.

Canary Routing

Route a small percentage of traffic to a canary deployment:

Deploy new version to one region first
Route 1-5% of traffic to that region
Compare metrics between canary and baseline
Expand or rollback based on results

Session Affinity Considerations

Some applications require consistent region routing for a user session:

Cookie-based affinity: Initial request determines region, cookie maintains it
User-based routing: Hash user ID to consistent region
Session state in shared store: Allows any-region routing without session loss

traffic-shifter.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
"""
Traffic Shifting Controller for Multi-Region Systems
 
Manages gradual traffic shifts between regions with
automated rollback based on health metrics.
"""
import boto3
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Dict, List, Optional
import time
 
@dataclass
class ShiftStage:
    """Configuration for a traffic shift stage."""
    target_weights: Dict[str, int]  # region -> weight (0-100)
    duration_minutes: int           # Time to wait before next stage
    rollback_threshold: float       # Error rate that triggers rollback
 
@dataclass
class ShiftPlan:
    """Multi-stage traffic shift plan."""
    name: str
    stages: List[ShiftStage]
    current_stage: int = 0
    started_at: Optional[datetime] = None
 
class TrafficShifter:
    """
    Manages controlled traffic shifts between regions.
    
    Integrates with Route 53 weighted routing and CloudWatch
    for metric-based progression and rollback.
    """
    
    def __init__(self, hosted_zone_id: str, record_name: str):
        self.route53 = boto3.client('route53')
        self.cloudwatch = boto3.client('cloudwatch')
        self.hosted_zone_id = hosted_zone_id
        self.record_name = record_name
        self.active_plan: Optional[ShiftPlan] = None
    
    def create_gradual_shift_plan(
        self,
        from_region: str,
        to_region: str,
        stages: int = 4
    ) -> ShiftPlan:
        """
        Create a gradual traffic shift plan between regions.
        
        Example 4-stage shift (75% US to 25% EU → 25% US to 75% EU):
          Stage 1: 75% US, 25% EU (validate)
          Stage 2: 50% US, 50% EU
          Stage 3: 25% US, 75% EU
          Stage 4: 10% US, 90% EU (keep some in original)
        """
        shift_amounts = [25, 50, 75, 90][:stages]
        
        plan_stages = []
        for i, to_weight in enumerate(shift_amounts):
            from_weight = 100 - to_weight
            plan_stages.append(ShiftStage(
                target_weights={
                    from_region: from_weight,
                    to_region: to_weight
                },
                duration_minutes=15 if i < stages - 1 else 0,
                rollback_threshold=0.05  # 5% error rate
            ))
        
        return ShiftPlan(
            name=f"shift-{from_region}-to-{to_region}",
            stages=plan_stages
        )
    
    def execute_plan(self, plan: ShiftPlan, auto_progress: bool = False):
        """
        Execute a traffic shift plan.
        
        If auto_progress is True, automatically advances stages
        when metrics are healthy. Otherwise, waits for manual confirmation.
        """
        self.active_plan = plan
        plan.started_at = datetime.now()
        
        for stage_num, stage in enumerate(plan.stages):
            plan.current_stage = stage_num
            
            print(f"Executing stage {stage_num + 1}/{len(plan.stages)}")
            print(f"Target weights: {stage.target_weights}")
            
            # Apply traffic weights
            self._update_route53_weights(stage.target_weights)
            
            # Wait for DNS propagation
            print("Waiting for DNS propagation (60s)...")
            time.sleep(60)
            
            # Monitor for stage duration
            if stage.duration_minutes > 0:
                if auto_progress:
                    healthy = self._monitor_stage(
                        stage.duration_minutes,
                        stage.rollback_threshold
                    )
                    if not healthy:
                        print("Unhealthy metrics detected, initiating rollback")
                        self._rollback_to_stage(0)  # Return to initial state
                        return False
                else:
                    print(f"Stage complete. Waiting {stage.duration_minutes}m")
                    print("Monitor metrics and call advance_stage() to continue")
                    return True  # Pause for manual review
        
        print("Traffic shift completed successfully")
        self.active_plan = None
        return True
    
    def _update_route53_weights(self, weights: Dict[str, int]):
        """Update Route 53 weighted records."""
        changes = []
        
        # Get current records
        response = self.route53.list_resource_record_sets(
            HostedZoneId=self.hosted_zone_id,
            StartRecordName=self.record_name,
            StartRecordType='A',
            MaxItems='10'
        )
        
        for record in response['ResourceRecordSets']:
            if record['Name'].rstrip('.') == self.record_name:
                if 'SetIdentifier' in record:
                    region = record['SetIdentifier']
                    if region in weights:
                        changes.append({
                            'Action': 'UPSERT',
                            'ResourceRecordSet': {
                                **record,
                                'Weight': weights[region]
                            }
                        })
        
        if changes:
            self.route53.change_resource_record_sets(
                HostedZoneId=self.hosted_zone_id,
                ChangeBatch={
                    'Comment': f'Traffic shift: {weights}',
                    'Changes': changes
                }
            )
    
    def _monitor_stage(
        self,
        duration_minutes: int,
        error_threshold: float
    ) -> bool:
        """
        Monitor metrics during stage execution.
        Returns False if rollback should be triggered.
        """
        end_time = datetime.now() + timedelta(minutes=duration_minutes)
        check_interval = 60  # seconds
        
        while datetime.now() < end_time:
            error_rate = self._get_current_error_rate()
            
            print(f"Current error rate: {error_rate:.2%}")
            
            if error_rate > error_threshold:
                return False  # Trigger rollback
            
            time.sleep(check_interval)
        
        return True  # Stage completed successfully
    
    def _get_current_error_rate(self) -> float:
        """Get current error rate from CloudWatch."""
        end_time = datetime.now()
        start_time = end_time - timedelta(minutes=5)
        
        # Get 5xx error count
        errors_response = self.cloudwatch.get_metric_statistics(
            Namespace='AWS/ApplicationELB',
            MetricName='HTTPCode_Target_5XX_Count',
            StartTime=start_time,
            EndTime=end_time,
            Period=300,
            Statistics=['Sum']
        )
        
        # Get total request count
        requests_response = self.cloudwatch.get_metric_statistics(
            Namespace='AWS/ApplicationELB',
            MetricName='RequestCount',
            StartTime=start_time,
            EndTime=end_time,
            Period=300,
            Statistics=['Sum']
        )
        
        errors = sum(
            dp['Sum'] for dp in errors_response.get('Datapoints', [])
        )
        requests = sum(
            dp['Sum'] for dp in requests_response.get('Datapoints', [])
        )
        
        if requests == 0:
            return 0.0
        
        return errors / requests
    
    def _rollback_to_stage(self, stage_num: int):
        """Rollback to a previous stage's weights."""
        if not self.active_plan:
            return
        
        stage = self.active_plan.stages[stage_num]
        print(f"Rolling back to stage {stage_num}: {stage.target_weights}")
        self._update_route53_weights(stage.target_weights)
 
 
# Example usage
if __name__ == "__main__":
    shifter = TrafficShifter(
        hosted_zone_id="Z123456789",
        record_name="app.example.com"
    )
    
    # Create and execute gradual shift from US to EU
    plan = shifter.create_gradual_shift_plan(
        from_region="us-east",
        to_region="eu-west",
        stages=4
    )
    
    shifter.execute_plan(plan, auto_progress=True)

Blue-Green Region Deployments

For major changes, run parallel infrastructure:

Deploy new version to "green" region while "blue" serves traffic
Shift traffic entirely to green (instant cutover)
Keep blue ready for instant rollback
After confirmation, blue becomes the next green

Maintenance Window Routing

During regional maintenance:

Reduce traffic weight before maintenance window
At window start, set weight to zero
Perform maintenance
Gradually restore traffic weight
Monitor before full restoration

Avoiding Thundering Herd

Sudden traffic shifts can overwhelm the receiving region. Always scale capacity before increasing traffic weight. Connection pools, cache warming, and JIT compilation all need time to stabilize under new load patterns.

Summary: Traffic Routing

We've explored the technologies and strategies that direct users to the right region in multi-region systems. Let's consolidate the key principles:

Key Takeaways

•DNS is the foundation: Geolocation, latency-based, and failover routing provide the basis for multi-region traffic distribution. Understand TTL implications and DNS propagation delays.
•Global Load Balancers overcome DNS limitations: Anycast, instant failover, and request-level routing make GLBs essential for sub-minute failover requirements.
•Health checks drive automatic failover: Multi-layer health endpoints (liveness, readiness, deep) enable nuanced routing decisions. Configure thresholds to balance detection speed with stability.
•Gradual traffic shifting reduces risk: For deployments, migrations, and recovery, shift traffic incrementally with metric monitoring and automatic rollback.
•Combine DNS and GLB: Production systems often use DNS for regional selection and GLBs for request-level load balancing within regions.
•Test your routing: Regular failover drills validate that health checks, routing policies, and capacity scaling work as designed.

Module Complete: Multi-Region Architecture

This module has taken you through the complete journey of multi-region system design:

Why Multi-Region: Understanding the drivers and deciding when geographic distribution is appropriate
Active-Passive: The simpler pattern for disaster recovery with clear primary/secondary roles
Active-Active: Serving traffic globally with conflict resolution and consistency tradeoffs
Data Replication: The mechanisms that synchronize data across regions
Traffic Routing: Directing users to the right region with DNS, GLBs, and health checks

With this knowledge, you can design multi-region architectures that provide high availability, low latency, and resilience to regional failures—the hallmarks of truly global-scale systems.

Module Complete

You've completed the Multi-Region Architecture module. You now understand the patterns, mechanisms, and operational practices for building systems that span geographic regions—delivering high availability and low latency to users worldwide.

5 / 5

Loading learning content...

System Design (HLD)Multi-Region Architecture

Multi-Region Architecture: Building Globally Distributed Systems

LevelAdvanced

Duration90 mins

TopicMulti-Region Architecture

5 / 5

Traffic Routing: Directing Users to the Right Region

The First Decision in Every Request

What You Will Learn

DNS-Based Routing Fundamentals

How DNS Routing Works

When a user requests app.example.com:

User's device queries a recursive DNS resolver (often their ISP or a service like 8.8.8.8)
Resolver contacts authoritative DNS servers for example.com
Authoritative server returns IP address(es) based on routing policy
Resolver caches the response for the TTL duration
User's device connects to the returned IP address

DNS Record Types for Routing

A Record: Maps domain to IPv4 address

app.example.com.  60  IN  A  54.192.1.1

AAAA Record: Maps domain to IPv6 address

app.example.com.  60  IN  AAAA  2600:9000:5306:6f00::1

CNAME Record: Aliases one domain to another (used for cloud load balancer integration)

app.example.com.  60  IN  CNAME  d123abc.cloudfront.net.

Alias Record (AWS Route 53): AWS-specific, points to AWS resources without CNAME limitations

Converting Mermaid diagram...

TTL (Time-To-Live) Considerations

TTL controls how long DNS responses are cached. For multi-region:

Long TTL (hours): Reduces DNS queries, but slows failover
Short TTL (60-300 seconds): Faster failover, more DNS queries
Very short TTL (<60 seconds): Near-instant failover, high DNS load

Limitations of DNS Routing

Resolver location ≠ user location: DNS decisions are made based on the resolver's IP, not the user's. VPN users, corporate proxies, and mobile networks can cause misrouting.
TTL caching: Users hold cached records beyond your control. Even with 60s TTL, some clients cache longer.
Propagation delay: DNS changes take time to reach all resolvers globally (typically minutes, occasionally longer).
No per-request routing: DNS routing is sticky for the TTL duration—individual requests can't be routed differently.

Low TTL Before Changes

DNS Routing Policies

Modern DNS services offer sophisticated routing policies that go beyond simple domain-to-IP mapping. Understanding these policies is essential for designing multi-region traffic distribution.

Simple (Round-Robin) Routing

Multiple IP addresses returned for a single domain; clients pick one (typically first):

app.example.com.  60  IN  A  54.192.1.1
app.example.com.  60  IN  A  52.215.2.2
app.example.com.  60  IN  A  13.114.3.3

Distributes load (roughly) across all IPs
No health awareness (returns unhealthy endpoints)
Client behavior varies (some always use first IP)
Rarely appropriate for production multi-region

Weighted Routing

Assign weights to endpoints; DNS responses proportionally reflect weights:

App US-East (weight 70)
App EU-West (weight 30)

Use cases:

Gradual traffic shifting during migrations
A/B testing at infrastructure level
Proportional load distribution based on capacity
Canary deployments to new regions

route53-routing-policies.tf

Terraform

# AWS Route 53 Routing Policy Examples
 
# 1. GEOLOCATION ROUTING
# Route users based on geographic location
 
resource "aws_route53_record" "app_us" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
  
  geolocation_routing_policy {
    continent = "NA"  # North America
  }
  
  set_identifier = "us-east"
  
  alias {
    name                   = aws_lb.us_east.dns_name
    zone_id                = aws_lb.us_east.zone_id
    evaluate_target_health = true
  }
}
 
resource "aws_route53_record" "app_eu" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
  
  geolocation_routing_policy {
    continent = "EU"  # Europe
  }
  
  set_identifier = "eu-west"
  
  alias {
    name                   = aws_lb.eu_west.dns_name
    zone_id                = aws_lb.eu_west.zone_id
    evaluate_target_health = true
  }
}
 
resource "aws_route53_record" "app_default" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
  
  geolocation_routing_policy {
    country = "*"  # Default for unmatched locations
  }
  
  set_identifier = "default"
  
  alias {
    name                   = aws_lb.us_east.dns_name
    zone_id                = aws_lb.us_east.zone_id
    evaluate_target_health = true
  }
}
 
# 2. LATENCY-BASED ROUTING
# Route to the region with lowest latency from user's location
 
resource "aws_route53_record" "app_latency_us" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
  
  latency_routing_policy {
    region = "us-east-1"
  }
  
  set_identifier = "us-east-latency"
  
  alias {
    name                   = aws_lb.us_east.dns_name
    zone_id                = aws_lb.us_east.zone_id
    evaluate_target_health = true
  }
}
 
resource "aws_route53_record" "app_latency_eu" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
  
  latency_routing_policy {
    region = "eu-west-1"
  }
  
  set_identifier = "eu-west-latency"
  
  alias {
    name                   = aws_lb.eu_west.dns_name
    zone_id                = aws_lb.eu_west.zone_id
    evaluate_target_health = true
  }
}
 
# 3. FAILOVER ROUTING
# Primary/secondary configuration for active-passive
 
resource "aws_route53_record" "app_primary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
  
  failover_routing_policy {
    type = "PRIMARY"
  }
  
  set_identifier = "primary"
  
  alias {
    name                   = aws_lb.us_east.dns_name
    zone_id                = aws_lb.us_east.zone_id
    evaluate_target_health = true  # Critical: enables failover
  }
  
  health_check_id = aws_route53_health_check.primary.id
}
 
resource "aws_route53_record" "app_secondary" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
  
  failover_routing_policy {
    type = "SECONDARY"
  }
  
  set_identifier = "secondary"
  
  alias {
    name                   = aws_lb.eu_west.dns_name
    zone_id                = aws_lb.eu_west.zone_id
    evaluate_target_health = true
  }
}
 
# 4. WEIGHTED ROUTING
# Distribute traffic by percentage
 
resource "aws_route53_record" "app_weighted_us" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
  ttl     = 60
  
  weighted_routing_policy {
    weight = 80  # 80% of traffic
  }
  
  set_identifier = "us-east-weighted"
  records        = ["54.192.1.1"]
  
  health_check_id = aws_route53_health_check.us_east.id
}
 
resource "aws_route53_record" "app_weighted_eu" {
  zone_id = aws_route53_zone.main.zone_id
  name    = "app.example.com"
  type    = "A"
  ttl     = 60
  
  weighted_routing_policy {
    weight = 20  # 20% of traffic
  }
  
  set_identifier = "eu-west-weighted"
  records        = ["52.215.2.2"]
  
  health_check_id = aws_route53_health_check.eu_west.id
}
 
# 5. HEALTH CHECK CONFIGURATION
resource "aws_route53_health_check" "primary" {
  fqdn              = "health.us-east.example.com"
  port              = 443
  type              = "HTTPS"
  resource_path     = "/health"
  failure_threshold = "3"
  request_interval  = "10"
  
  regions = [
    "us-east-1",
    "us-west-2",
    "eu-west-1"
  ]
  
  tags = {
    Name = "Primary Region Health Check"
  }
}

Geolocation Routing

Route based on user's geographic location (country, continent, or US state):

Maps locations to specific endpoints
Requires default record for unmapped locations
Good for regulatory compliance (keep EU users in EU)
Doesn't optimize for actual latency (users near borders may be misrouted)

Latency-Based Routing

AWS Route 53 maintains latency measurements between its resolver locations and AWS regions. Responses direct users to the lowest-latency region:

More accurate than geolocation for performance optimization
Automatically adapts to changing network conditions
Only measures to AWS regions (not arbitrary endpoints)

Failover Routing

Explicit primary/secondary configuration with health-check-driven failover:

Returns primary endpoint when healthy
Automatically switches to secondary when primary fails health checks
Foundation for active-passive multi-region

Multivalue Answer Routing

Returns multiple healthy IP addresses (up to 8):

Client-side selection provides some load distribution
Health-checked: only healthy endpoints returned
Better than simple round-robin for resilience

DNS Routing Policy Selection Guide
Policy	Best For	Considerations
Geolocation	Compliance, regulatory requirements	Configure defaults; users near borders may be suboptimal
Latency	Performance optimization	AWS only; measures network, not application latency
Weighted	Migrations, canary, load balancing	Need to adjust weights manually or via automation
Failover	Active-passive DR	Only two tiers; can combine with other policies
Multivalue	Simple load distribution with health	Client behavior varies; not true load balancing

Global Load Balancers: Beyond DNS

How Global Load Balancers Work

User's request goes to nearest edge location (via anycast)
Edge terminates TLS and inspects request
GLB routes request to optimal backend region
Response returns via the same path

This provides:

Request-level routing (not DNS-TTL-granularity)
Instant failover (health changes take effect immediately)
TLS termination closer to users (lower latency)
Layer 7 routing capabilities (headers, paths, cookies)

Major Global Load Balancer Solutions

AWS Global Accelerator

Anycast IPs route to AWS edge locations
Health-checked routing to AWS regions
TCP and UDP support
Integrates with ALB, NLB, EC2

Google Cloud Load Balancing

Single anycast IP for global deployment
Automatic multi-region utilization
Closest region selection with overflow
Deep integration with GCP services

Cloudflare

CDN + GLB in one
300+ edge locations
Load balancing, DNS, DDoS protection bundled
Geographic and dynamic steering

Azure Front Door

Global HTTP load balancer
Layer 7 routing and caching
Integrates with Azure services
Health probing with failover

gcloud-global-lb.tf

Terraform

# Google Cloud Global Load Balancer Configuration
# Single anycast IP serving traffic from multiple regions
 
# Backend service grouping regional instance groups
resource "google_compute_backend_service" "global_app" {
  name                  = "global-app-backend"
  protocol              = "HTTP"
  port_name             = "http"
  timeout_sec           = 30
  enable_cdn            = true
  
  # Health check for backend instances
  health_checks = [google_compute_health_check.app.id]
  
  # US region backend with capacity
  backend {
    group           = google_compute_region_instance_group_manager.us.instance_group
    balancing_mode  = "UTILIZATION"
    capacity_scaler = 1.0
    max_utilization = 0.8
  }
  
  # EU region backend with capacity
  backend {
    group           = google_compute_region_instance_group_manager.eu.instance_group
    balancing_mode  = "UTILIZATION"
    capacity_scaler = 1.0
    max_utilization = 0.8
  }
  
  # Asia region backend with capacity
  backend {
    group           = google_compute_region_instance_group_manager.asia.instance_group
    balancing_mode  = "UTILIZATION"
    capacity_scaler = 1.0
    max_utilization = 0.8
  }
  
  # Outlier detection for automatic ejection of unhealthy instances
  outlier_detection {
    consecutive_errors = 5
    interval {
      seconds = 10
    }
    base_ejection_time {
      seconds = 30
    }
    max_ejection_percent = 50
  }
  
  # Circuit breaker settings
  circuit_breakers {
    max_connections         = 1000
    max_pending_requests    = 200
    max_requests            = 1000
    max_retries             = 3
  }
}
 
# Health check definition
resource "google_compute_health_check" "app" {
  name               = "app-health-check"
  check_interval_sec = 5
  timeout_sec        = 5
  healthy_threshold  = 2
  unhealthy_threshold = 3
  
  http_health_check {
    port         = 8080
    request_path = "/healthz"
  }
}
 
# URL map for routing
resource "google_compute_url_map" "global_app" {
  name            = "global-app-urlmap"
  default_service = google_compute_backend_service.global_app.id
  
  # Path-based routing example
  host_rule {
    hosts        = ["app.example.com"]
    path_matcher = "app-paths"
  }
  
  path_matcher {
    name            = "app-paths"
    default_service = google_compute_backend_service.global_app.id
    
    path_rule {
      paths   = ["/api/*"]
      service = google_compute_backend_service.global_app.id
    }
    
    path_rule {
      paths   = ["/static/*"]
      service = google_compute_backend_bucket.static.id
    }
  }
}
 
# HTTPS proxy with TLS termination
resource "google_compute_target_https_proxy" "global_app" {
  name             = "global-app-https-proxy"
  url_map          = google_compute_url_map.global_app.id
  ssl_certificates = [google_compute_managed_ssl_certificate.app.id]
}
 
# Global forwarding rule with anycast IP
resource "google_compute_global_forwarding_rule" "global_app" {
  name       = "global-app-forwarding-rule"
  target     = google_compute_target_https_proxy.global_app.id
  port_range = "443"
  ip_address = google_compute_global_address.app.address
}
 
# Reserve a global anycast IP
resource "google_compute_global_address" "app" {
  name = "global-app-ip"
}
 
# Managed SSL certificate
resource "google_compute_managed_ssl_certificate" "app" {
  name = "app-cert"
  
  managed {
    domains = ["app.example.com"]
  }
}

GLB vs DNS Routing: When to Use Each

Prefer Global Load Balancer when:

Sub-second failover is required
You need request-level routing decisions
TLS termination at edge reduces latency significantly
Layer 7 routing (headers, paths) is needed
You're already on a cloud provider with strong GLB offerings

Prefer DNS Routing when:

Multi-cloud or hybrid environments
Cost sensitivity (DNS is typically cheaper)
Simple active-passive failover is sufficient
You want to avoid vendor lock-in
Edge termination adds unacceptable latency for your use case

Combined Approach (Common in Practice)

Many production systems combine DNS and GLB:

DNS routes to regional GLB endpoints
GLB handles per-request routing within the region
Health checks operate at both layers
DNS provides disaster recovery; GLB provides active load balancing

Anycast IP Benefits

Health Checks and Automatic Failover

Health Check Types

TCP Health Checks

Probe: Establish TCP connection to port
Pass: Connection succeeds
Simple, low overhead
Only confirms network connectivity

HTTP/HTTPS Health Checks

Probe: Make HTTP request to endpoint
Pass: Expected status code returned (typically 200)
Confirms application is responding
Can check specific paths (/health, /ready)

Deep Health Checks (Application-Aware)

Probe: Request to /health endpoint
Endpoint checks database, cache, dependencies
Pass: All dependencies healthy
Most accurate but can cause false positives during transient issues

Best Practice: Layered Health Checks

Implement multiple health endpoints:

/alive (liveness): Application process is running
/ready (readiness): Application can serve traffic
/health (deep): All dependencies are functional

health-endpoints.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
/**
 * Multi-Layer Health Check Implementation
 * 
 * Provides endpoints for different health check scenarios:
 * - Liveness: Process is running
 * - Readiness: Ready to receive traffic  
 * - Deep: All dependencies verified
 */
 
import { Router, Request, Response } from 'express';
import { Pool } from 'pg';
import Redis from 'ioredis';
 
interface HealthStatus {
  status: 'healthy' | 'degraded' | 'unhealthy';
  checks: Record<string, CheckResult>;
  timestamp: string;
  version: string;
  region: string;
}
 
interface CheckResult {
  status: 'pass' | 'fail' | 'warn';
  latencyMs?: number;
  message?: string;
}
 
class HealthChecker {
  private db: Pool;
  private redis: Redis;
  private appVersion: string;
  private region: string;
  
  constructor(db: Pool, redis: Redis) {
    this.db = db;
    this.redis = redis;
    this.appVersion = process.env.APP_VERSION || 'unknown';
    this.region = process.env.AWS_REGION || 'unknown';
  }
  
  /**
   * Liveness check: Is the process running?
   * Used by: Container orchestration (restart on fail)
   */
  async checkLiveness(): Promise<CheckResult> {
    // If this code executes, we're alive
    return { status: 'pass' };
  }
  
  /**
   * Readiness check: Can we serve traffic?
   * Used by: Load balancer (remove from rotation on fail)
   */
  async checkReadiness(): Promise<HealthStatus> {
    const checks: Record<string, CheckResult> = {};
    
    // Check database connection pool
    checks.database = await this.checkDatabase();
    
    // Check cache connection
    checks.cache = await this.checkCache();
    
    // Overall status based on critical dependencies
    const criticalFailed = checks.database.status === 'fail';
    
    return {
      status: criticalFailed ? 'unhealthy' : 'healthy',
      checks,
      timestamp: new Date().toISOString(),
      version: this.appVersion,
      region: this.region
    };
  }
  
  /**
   * Deep health check: Are all dependencies fully functional?
   * Used by: DNS/GLB routing (failover to another region on fail)
   */
  async checkDeep(): Promise<HealthStatus> {
    const checks: Record<string, CheckResult> = {};
    
    // All dependency checks
    const [dbCheck, cacheCheck, externalCheck] = await Promise.allSettled([
      this.checkDatabaseQuery(),
      this.checkCacheReadWrite(),
      this.checkExternalDependencies()
    ]);
    
    checks.database = dbCheck.status === 'fulfilled' 
      ? dbCheck.value 
      : { status: 'fail', message: 'Check threw exception' };
      
    checks.cache = cacheCheck.status === 'fulfilled'
      ? cacheCheck.value
      : { status: 'fail', message: 'Check threw exception' };
      
    checks.external = externalCheck.status === 'fulfilled'
      ? externalCheck.value
      : { status: 'fail', message: 'Check threw exception' };
    
    // Determine overall health
    const failCount = Object.values(checks)
      .filter(c => c.status === 'fail').length;
    const warnCount = Object.values(checks)
      .filter(c => c.status === 'warn').length;
    
    let status: 'healthy' | 'degraded' | 'unhealthy' = 'healthy';
    if (failCount > 0) status = 'unhealthy';
    else if (warnCount > 0) status = 'degraded';
    
    return {
      status,
      checks,
      timestamp: new Date().toISOString(),
      version: this.appVersion,
      region: this.region
    };
  }
  
  private async checkDatabase(): Promise<CheckResult> {
    const start = Date.now();
    try {
      // Quick connection check
      await this.db.query('SELECT 1');
      return {
        status: 'pass',
        latencyMs: Date.now() - start
      };
    } catch (error) {
      return {
        status: 'fail',
        message: (error as Error).message,
        latencyMs: Date.now() - start
      };
    }
  }
  
  private async checkDatabaseQuery(): Promise<CheckResult> {
    const start = Date.now();
    try {
      // More thorough check: verify we can query
      const result = await this.db.query(
        'SELECT COUNT(*) FROM information_schema.tables'
      );
      const latency = Date.now() - start;
      
      // Warn if query is slow
      if (latency > 1000) {
        return {
          status: 'warn',
          message: 'Query latency elevated',
          latencyMs: latency
        };
      }
      
      return { status: 'pass', latencyMs: latency };
    } catch (error) {
      return {
        status: 'fail',
        message: (error as Error).message,
        latencyMs: Date.now() - start
      };
    }
  }
  
  private async checkCache(): Promise<CheckResult> {
    const start = Date.now();
    try {
      await this.redis.ping();
      return {
        status: 'pass',
        latencyMs: Date.now() - start
      };
    } catch (error) {
      return {
        status: 'fail',
        message: (error as Error).message,
        latencyMs: Date.now() - start
      };
    }
  }
  
  private async checkCacheReadWrite(): Promise<CheckResult> {
    const start = Date.now();
    const testKey = `health-check:${Date.now()}`;
    const testValue = 'test';
    
    try {
      await this.redis.set(testKey, testValue, 'EX', 10);
      const retrieved = await this.redis.get(testKey);
      await this.redis.del(testKey);
      
      if (retrieved !== testValue) {
        return {
          status: 'fail',
          message: 'Read/write verification failed',
          latencyMs: Date.now() - start
        };
      }
      
      return { status: 'pass', latencyMs: Date.now() - start };
    } catch (error) {
      return {
        status: 'fail',
        message: (error as Error).message,
        latencyMs: Date.now() - start
      };
    }
  }
  
  private async checkExternalDependencies(): Promise<CheckResult> {
    // Check critical external services
    // In production, you might check payment providers, etc.
    return { status: 'pass' };
  }
}
 
// Express router setup
export function createHealthRouter(db: Pool, redis: Redis): Router {
  const router = Router();
  const checker = new HealthChecker(db, redis);
  
  // Liveness - always succeed if running
  router.get('/alive', async (req: Request, res: Response) => {
    const result = await checker.checkLiveness();
    res.status(200).json(result);
  });
  
  // Readiness - check critical dependencies
  router.get('/ready', async (req: Request, res: Response) => {
    const result = await checker.checkReadiness();
    const statusCode = result.status === 'healthy' ? 200 : 503;
    res.status(statusCode).json(result);
  });
  
  // Deep health - comprehensive dependency check
  router.get('/health', async (req: Request, res: Response) => {
    const result = await checker.checkDeep();
    let statusCode: number;
    switch (result.status) {
      case 'healthy': statusCode = 200; break;
      case 'degraded': statusCode = 200; break;  // Still serving
      case 'unhealthy': statusCode = 503; break;
    }
    res.status(statusCode).json(result);
  });
  
  return router;
}

Health Check Configuration Parameters

Interval: How often to probe (typically 10-30 seconds)

Shorter: Faster detection, more probe traffic
Longer: Slower detection, less overhead

Timeout: How long to wait for response (typically 5-10 seconds)

Must be less than interval
Longer than expected normal response time

Healthy Threshold: Consecutive passes to mark healthy (typically 2-3)

Prevents flapping from transient successes

Unhealthy Threshold: Consecutive failures to mark unhealthy (typically 2-5)

Prevents failover from transient issues
Balance: quick detection vs. stability

Multi-Location Probing

Probe from multiple geographic locations:

Prevents false positives from regional network issues
Require failures from multiple locations before failover
Route 53 probes from multiple AWS edge locations

The Deep Health Check Dilemma

Traffic Shifting Strategies

Beyond automatic failover, traffic shifting enables controlled migration between regions for deployments, maintenance, and testing.

Gradual Traffic Migration

When deploying to a new region or recovering from failover:

Start with 5-10% of traffic to new/restored region
Monitor error rates, latency, and capacity
Increase to 25%, then 50%
Complete migration to target distribution

This identifies issues before they affect all users and allows capacity validation.

Canary Routing

Route a small percentage of traffic to a canary deployment:

Deploy new version to one region first
Route 1-5% of traffic to that region
Compare metrics between canary and baseline
Expand or rollback based on results

Session Affinity Considerations

Some applications require consistent region routing for a user session:

Cookie-based affinity: Initial request determines region, cookie maintains it
User-based routing: Hash user ID to consistent region
Session state in shared store: Allows any-region routing without session loss

traffic-shifter.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
"""
Traffic Shifting Controller for Multi-Region Systems
 
Manages gradual traffic shifts between regions with
automated rollback based on health metrics.
"""
import boto3
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Dict, List, Optional
import time
 
@dataclass
class ShiftStage:
    """Configuration for a traffic shift stage."""
    target_weights: Dict[str, int]  # region -> weight (0-100)
    duration_minutes: int           # Time to wait before next stage
    rollback_threshold: float       # Error rate that triggers rollback
 
@dataclass
class ShiftPlan:
    """Multi-stage traffic shift plan."""
    name: str
    stages: List[ShiftStage]
    current_stage: int = 0
    started_at: Optional[datetime] = None
 
class TrafficShifter:
    """
    Manages controlled traffic shifts between regions.
    
    Integrates with Route 53 weighted routing and CloudWatch
    for metric-based progression and rollback.
    """
    
    def __init__(self, hosted_zone_id: str, record_name: str):
        self.route53 = boto3.client('route53')
        self.cloudwatch = boto3.client('cloudwatch')
        self.hosted_zone_id = hosted_zone_id
        self.record_name = record_name
        self.active_plan: Optional[ShiftPlan] = None
    
    def create_gradual_shift_plan(
        self,
        from_region: str,
        to_region: str,
        stages: int = 4
    ) -> ShiftPlan:
        """
        Create a gradual traffic shift plan between regions.
        
        Example 4-stage shift (75% US to 25% EU → 25% US to 75% EU):
          Stage 1: 75% US, 25% EU (validate)
          Stage 2: 50% US, 50% EU
          Stage 3: 25% US, 75% EU
          Stage 4: 10% US, 90% EU (keep some in original)
        """
        shift_amounts = [25, 50, 75, 90][:stages]
        
        plan_stages = []
        for i, to_weight in enumerate(shift_amounts):
            from_weight = 100 - to_weight
            plan_stages.append(ShiftStage(
                target_weights={
                    from_region: from_weight,
                    to_region: to_weight
                },
                duration_minutes=15 if i < stages - 1 else 0,
                rollback_threshold=0.05  # 5% error rate
            ))
        
        return ShiftPlan(
            name=f"shift-{from_region}-to-{to_region}",
            stages=plan_stages
        )
    
    def execute_plan(self, plan: ShiftPlan, auto_progress: bool = False):
        """
        Execute a traffic shift plan.
        
        If auto_progress is True, automatically advances stages
        when metrics are healthy. Otherwise, waits for manual confirmation.
        """
        self.active_plan = plan
        plan.started_at = datetime.now()
        
        for stage_num, stage in enumerate(plan.stages):
            plan.current_stage = stage_num
            
            print(f"Executing stage {stage_num + 1}/{len(plan.stages)}")
            print(f"Target weights: {stage.target_weights}")
            
            # Apply traffic weights
            self._update_route53_weights(stage.target_weights)
            
            # Wait for DNS propagation
            print("Waiting for DNS propagation (60s)...")
            time.sleep(60)
            
            # Monitor for stage duration
            if stage.duration_minutes > 0:
                if auto_progress:
                    healthy = self._monitor_stage(
                        stage.duration_minutes,
                        stage.rollback_threshold
                    )
                    if not healthy:
                        print("Unhealthy metrics detected, initiating rollback")
                        self._rollback_to_stage(0)  # Return to initial state
                        return False
                else:
                    print(f"Stage complete. Waiting {stage.duration_minutes}m")
                    print("Monitor metrics and call advance_stage() to continue")
                    return True  # Pause for manual review
        
        print("Traffic shift completed successfully")
        self.active_plan = None
        return True
    
    def _update_route53_weights(self, weights: Dict[str, int]):
        """Update Route 53 weighted records."""
        changes = []
        
        # Get current records
        response = self.route53.list_resource_record_sets(
            HostedZoneId=self.hosted_zone_id,
            StartRecordName=self.record_name,
            StartRecordType='A',
            MaxItems='10'
        )
        
        for record in response['ResourceRecordSets']:
            if record['Name'].rstrip('.') == self.record_name:
                if 'SetIdentifier' in record:
                    region = record['SetIdentifier']
                    if region in weights:
                        changes.append({
                            'Action': 'UPSERT',
                            'ResourceRecordSet': {
                                **record,
                                'Weight': weights[region]
                            }
                        })
        
        if changes:
            self.route53.change_resource_record_sets(
                HostedZoneId=self.hosted_zone_id,
                ChangeBatch={
                    'Comment': f'Traffic shift: {weights}',
                    'Changes': changes
                }
            )
    
    def _monitor_stage(
        self,
        duration_minutes: int,
        error_threshold: float
    ) -> bool:
        """
        Monitor metrics during stage execution.
        Returns False if rollback should be triggered.
        """
        end_time = datetime.now() + timedelta(minutes=duration_minutes)
        check_interval = 60  # seconds
        
        while datetime.now() < end_time:
            error_rate = self._get_current_error_rate()
            
            print(f"Current error rate: {error_rate:.2%}")
            
            if error_rate > error_threshold:
                return False  # Trigger rollback
            
            time.sleep(check_interval)
        
        return True  # Stage completed successfully
    
    def _get_current_error_rate(self) -> float:
        """Get current error rate from CloudWatch."""
        end_time = datetime.now()
        start_time = end_time - timedelta(minutes=5)
        
        # Get 5xx error count
        errors_response = self.cloudwatch.get_metric_statistics(
            Namespace='AWS/ApplicationELB',
            MetricName='HTTPCode_Target_5XX_Count',
            StartTime=start_time,
            EndTime=end_time,
            Period=300,
            Statistics=['Sum']
        )
        
        # Get total request count
        requests_response = self.cloudwatch.get_metric_statistics(
            Namespace='AWS/ApplicationELB',
            MetricName='RequestCount',
            StartTime=start_time,
            EndTime=end_time,
            Period=300,
            Statistics=['Sum']
        )
        
        errors = sum(
            dp['Sum'] for dp in errors_response.get('Datapoints', [])
        )
        requests = sum(
            dp['Sum'] for dp in requests_response.get('Datapoints', [])
        )
        
        if requests == 0:
            return 0.0
        
        return errors / requests
    
    def _rollback_to_stage(self, stage_num: int):
        """Rollback to a previous stage's weights."""
        if not self.active_plan:
            return
        
        stage = self.active_plan.stages[stage_num]
        print(f"Rolling back to stage {stage_num}: {stage.target_weights}")
        self._update_route53_weights(stage.target_weights)
 
 
# Example usage
if __name__ == "__main__":
    shifter = TrafficShifter(
        hosted_zone_id="Z123456789",
        record_name="app.example.com"
    )
    
    # Create and execute gradual shift from US to EU
    plan = shifter.create_gradual_shift_plan(
        from_region="us-east",
        to_region="eu-west",
        stages=4
    )
    
    shifter.execute_plan(plan, auto_progress=True)

Blue-Green Region Deployments

For major changes, run parallel infrastructure:

Deploy new version to "green" region while "blue" serves traffic
Shift traffic entirely to green (instant cutover)
Keep blue ready for instant rollback
After confirmation, blue becomes the next green

Maintenance Window Routing

During regional maintenance:

Reduce traffic weight before maintenance window
At window start, set weight to zero
Perform maintenance
Gradually restore traffic weight
Monitor before full restoration

Avoiding Thundering Herd

Summary: Traffic Routing

We've explored the technologies and strategies that direct users to the right region in multi-region systems. Let's consolidate the key principles:

Key Takeaways

•DNS is the foundation: Geolocation, latency-based, and failover routing provide the basis for multi-region traffic distribution. Understand TTL implications and DNS propagation delays.
•Global Load Balancers overcome DNS limitations: Anycast, instant failover, and request-level routing make GLBs essential for sub-minute failover requirements.
•Health checks drive automatic failover: Multi-layer health endpoints (liveness, readiness, deep) enable nuanced routing decisions. Configure thresholds to balance detection speed with stability.
•Gradual traffic shifting reduces risk: For deployments, migrations, and recovery, shift traffic incrementally with metric monitoring and automatic rollback.
•Combine DNS and GLB: Production systems often use DNS for regional selection and GLBs for request-level load balancing within regions.
•Test your routing: Regular failover drills validate that health checks, routing policies, and capacity scaling work as designed.

Module Complete: Multi-Region Architecture

This module has taken you through the complete journey of multi-region system design:

Why Multi-Region: Understanding the drivers and deciding when geographic distribution is appropriate
Active-Passive: The simpler pattern for disaster recovery with clear primary/secondary roles
Active-Active: Serving traffic globally with conflict resolution and consistency tradeoffs
Data Replication: The mechanisms that synchronize data across regions
Traffic Routing: Directing users to the right region with DNS, GLBs, and health checks

With this knowledge, you can design multi-region architectures that provide high availability, low latency, and resilience to regional failures—the hallmarks of truly global-scale systems.

Module Complete

5 / 5