System Design (HLD)Global Load Balancing & Anycast

Global Load Balancing & Anycast

LevelAdvanced

Duration90 mins

TopicGlobal Load Balancing & Anycast

5 / 5

Traffic Management Policies

Orchestrating Global Traffic Flow

You now understand the individual components of global traffic distribution: GSLB for intelligent DNS-based routing, Anycast for network-layer proximity routing, GeoDNS for location-based differentiation, and the mechanics of DNS resolution. But in production, these components don't operate in isolation—they're orchestrated together through traffic management policies that encode your business requirements into routing decisions.

Traffic management policies are the glue that binds infrastructure capabilities to business objectives. They determine: how traffic flows under normal conditions; what happens when components fail; how to balance competing concerns like cost and performance; and how to enforce compliance while maximizing user experience.

This page explores the art of designing comprehensive traffic management strategies—the capstone skill that separates system designers who understand individual components from those who can architect cohesive global platforms.

What You Will Learn

By the end of this page, you will have mastered: multi-dimensional traffic policy design balancing performance, availability, cost, and compliance; policy chaining and priority ordering for complex routing logic; failover hierarchies and degradation strategies; traffic splitting for canary deployments and migrations; capacity-aware routing and cost optimization; observability requirements for policy validation; and real-world policy architectures from major internet services.

The Four Dimensions of Traffic Policy

Effective traffic management must balance multiple, often competing, objectives. Understanding these dimensions and their tradeoffs is essential for policy design.

Dimension 1: Performance (Latency/Throughput)

Minimizing user-perceived latency and maximizing throughput are primary goals. Performance-focused policies emphasize routing to the lowest-latency endpoint, often using latency-based GSLB and Anycast.

Dimension 2: Availability (Resilience)

Maintaining service availability during failures requires routing around unhealthy endpoints, failover hierarchies, and graceful degradation. Availability-focused policies prioritize redundancy and rapid failover.

Dimension 3: Cost (Infrastructure Efficiency)

Cloud and infrastructure costs can be significant. Cost-aware policies consider cross-region data transfer charges, compute pricing differences, and committed capacity utilization.

Dimension 4: Compliance (Regulatory/Contractual)

Data residency laws, contractual SLAs, and content licensing requirements impose constraints that override other considerations. Compliance must be treated as a hard constraint, not a soft preference.

Traffic Policy Dimension Tradeoffs
Dimension	Primary Goal	Common Tradeoffs	Example Policy Decision
Performance	Minimize latency	Higher costs (multi-region); complexity	Route to closest healthy DC even if more expensive
Availability	Maximize uptime	Higher costs (redundancy); complexity	Maintain idle DR capacity for instant failover
Cost	Minimize spend	Potentially higher latency; lower redundancy	Route to cheapest region when latency difference is <50ms
Compliance	Meet legal requirements	May conflict with all other dimensions	EU users must route to EU even if US is faster/cheaper

Dimension Priority Framework:

In practice, dimensions have different priorities based on business context:

Safety-Critical Systems (healthcare, aviation, finance): Compliance > Availability > Performance > Cost

Consumer Applications (streaming, social media): Availability > Performance > Cost > Compliance

Cost-Sensitive Startups: Cost > Performance > Availability > Compliance

Enterprise SaaS: Compliance > Availability > Performance > Cost

Defining these priorities explicitly helps resolve conflicts during policy design.

Document Your Priority Stack

Before designing traffic policies, work with stakeholders to explicitly document dimension priorities. When a conflict arises (e.g., the fastest route is non-compliant), the priority stack provides a clear decision framework. Revisit priorities annually or when business context changes.

Policy Chaining and Evaluation Order

Complex routing requirements are rarely satisfied by a single policy type. Policy chaining combines multiple policies evaluated in sequence, with each policy filtering or modifying the candidate set.

The Evaluation Pipeline:

A well-designed policy chain processes routing decisions through sequential stages:

Converting Mermaid diagram...

Designing Policy Chains:

Each stage in the pipeline performs a specific function:

Policy Chain Stages

•Stage 1: Health Filtering — Remove all endpoints failing health checks. This is always first—never route to known-bad infrastructure.
•Stage 2: Compliance Filtering — Remove endpoints in non-compliant jurisdictions. EU users should never see US-only endpoints in their candidate set.
•Stage 3: Geographic Affinity — Prefer endpoints in the user's region. This reduces the candidate set to regionally-appropriate options.
•Stage 4: Performance Optimization — Among remaining candidates, prefer lowest latency or highest throughput endpoints.
•Stage 5: Capacity Weighting — Distribute traffic across candidates proportional to their capacity. A 2x-capacity DC gets 2x traffic.
•Stage 6: Deterministic Selection — Apply consistent hashing or random selection to choose final endpoint(s) from weighted candidates.

policy-chain.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# Comprehensive Traffic Policy Chain Configuration
traffic_policy:
  name: "production-api-routing"
  hostname: "api.example.com"
  
  # Policy evaluation order (top to bottom)
  chain:
    # Stage 1: Health Filter (hard requirement)
    - type: "health_filter"
      config:
        health_check_id: "api-health-comprehensive"
        require_healthy: true
        unhealthy_action: "remove_from_pool"
        
    # Stage 2: Compliance Filter (hard requirement)
    - type: "compliance_filter"
      config:
        rules:
          - user_region: "EU"
            allowed_endpoints: ["frankfurt-dc", "dublin-dc", "amsterdam-dc"]
            fallback: "service_unavailable"  # Don't route EU to non-EU
            
          - user_region: "CN"
            allowed_endpoints: ["beijing-dc", "shanghai-dc"]
            fallback: "service_unavailable"  # China data must stay in China
            
    # Stage 3: Geographic Affinity (soft preference)
    - type: "geo_affinity"
      config:
        prefer_same_region: true
        max_latency_penalty_for_affinity: 50  # Accept 50ms extra for same region
        
    # Stage 4: Latency Optimization (soft preference)
    - type: "latency_routing"
      config:
        measurement_source: "real_user_monitoring"
        fallback_source: "probe_measurements"
        max_latency_difference: 20  # Consider endpoints within 20ms equivalent
        
    # Stage 5: Capacity Weighting (distribution)
    - type: "weighted_distribution"
      config:
        weights:
          frankfurt-dc: 100    # Full capacity
          dublin-dc: 75        # 75% capacity
          virginia-dc: 150     # 150% base capacity (larger DC)
          singapore-dc: 80     # 80% capacity
          tokyo-dc: 60         # 60% capacity
        
    # Stage 6: Selection
    - type: "selection"
      config:
        method: "weighted_random"  # Or "consistent_hash" for session affinity
        return_count: 2  # Return 2 IPs for client-side failover
        
  # Fallback if entire chain produces no candidates
  fallback:
    action: "return_global_fallback"
    endpoint: "virginia-dc"  # US-East as last resort

Empty Candidate Set Handling

Each filtering stage can potentially remove all candidates. Ensure your policy chain handles empty candidate sets gracefully—either by falling back to a global endpoint, returning a service unavailable response, or executing a secondary policy chain. Never return empty DNS responses.

Failover Hierarchies and Graceful Degradation

Production systems must handle failures gracefully. Failover hierarchies define how traffic should reroute when primary endpoints fail, while graceful degradation strategies manage partial failures.

Failover Hierarchy Design:

A robust failover hierarchy has multiple levels, each providing a less-optimal but still-acceptable alternative:

failover-hierarchy.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Multi-Level Failover Hierarchy
failover_config:
  hostname: "app.example.com"
  
  # Primary: In-region, same cloud
  primary:
    selection: "latency_based"
    candidates:
      - endpoint: "us-east-1a.aws.example.internal"
        provider: "aws"
        region: "us-east-1"
      - endpoint: "us-east-1b.aws.example.internal"
        provider: "aws"
        region: "us-east-1"
    health_threshold: 1  # At least 1 healthy to use this tier
    
  # Secondary: Same region, different AZ/cloud
  secondary:
    activation: "primary_unhealthy"
    selection: "round_robin"
    candidates:
      - endpoint: "us-east-2a.aws.example.internal"
        provider: "aws"
        region: "us-east-2"
      - endpoint: "us-east.gcp.example.internal"
        provider: "gcp"
        region: "us-east4"
    health_threshold: 1
    notification: "alert"  # Alert when secondary activated
    
  # Tertiary: Different region, same continent
  tertiary:
    activation: "secondary_unhealthy"
    selection: "geo_proximity"
    candidates:
      - endpoint: "us-west-2.aws.example.internal"
        provider: "aws"
        region: "us-west-2"
    notification: "page"  # Page on-call when tertiary activated
    
  # Quaternary: Global fallback
  quaternary:
    activation: "tertiary_unhealthy"
    candidates:
      - endpoint: "eu-west-1.aws.example.internal"
        provider: "aws"
        region: "eu-west-1"
    notification: "critical"  # Critical incident when global fallback
    degraded_mode:
      rate_limit: 0.5  # 50% rate limit in degraded mode
      cached_responses: true  # Serve stale cache if available
      
  # Ultimate fallback: Static error page
  static_fallback:
    activation: "all_unhealthy"
    response: "status_page"  # Return link to status page
    notification: "all_hands"

Graceful Degradation Strategies:

Not all failures are binary. Graceful degradation handles partial failures where some capacity remains available:

Graceful Degradation Strategies
Strategy	Trigger	Action	User Impact
Capacity Reduction	DC at 80%+ capacity	Reduce traffic weight to that DC	Slight latency increase as traffic shifts
Feature Degradation	Dependent service failing	Disable non-critical features	Reduced functionality, core works
Rate Limiting	Approaching capacity	Shed excess traffic with 429s	Some requests rejected, prevents collapse
Stale Cache Serving	Origin unhealthy	Serve cached responses	Potentially stale data, but available
Static Fallback	Total failure	Serve static 'try again later' page	Complete outage, but graceful messaging

degradation-policy.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Progressive Degradation Policy
degradation_policy:
  # Level 0: Normal operation
  normal:
    capacity_threshold: 70  # Below 70% utilization
    behavior: "full_features"
    
  # Level 1: Light degradation
  light_degradation:
    trigger:
      - condition: "capacity > 70%"
      - condition: "error_rate > 1%"
    behavior:
      - disable: "recommendation_engine"  # Disable ML features
      - disable: "video_transcoding"      # Disable heavy processing
      - cache_ttl: "increase_2x"          # Longer cache TTL
    notification: "slack_channel"
    
  # Level 2: Moderate degradation
  moderate_degradation:
    trigger:
      - condition: "capacity > 85%"
      - condition: "error_rate > 5%"
      - condition: "failover_tier >= secondary"
    behavior:
      - rate_limit: 0.8                   # 80% of normal traffic
      - disable: "search"                  # Disable search
      - serve_stale: true                  # Serve stale cache
    notification: "pagerduty_low"
    
  # Level 3: Heavy degradation
  heavy_degradation:
    trigger:
      - condition: "capacity > 95%"
      - condition: "error_rate > 10%"
      - condition: "failover_tier >= tertiary"
    behavior:
      - rate_limit: 0.5                    # 50% of normal traffic
      - read_only_mode: true               # No writes
      - minimal_features: true             # Core functionality only
    notification: "pagerduty_high"
    
  # Level 4: Survival mode
  survival_mode:
    trigger:
      - condition: "all_primary_dc_unhealthy"
      - condition: "error_rate > 25%"
    behavior:
      - static_response: true              # Static responses only
      - message: "Service is experiencing issues. Please try again later."
    notification: "incident_commander"

Test Your Degradation Modes

Graceful degradation only works if it's been tested. Regularly exercise degradation modes in production (controlled chaos engineering) or staging to verify behavior. An untested degradation path is an unreliable degradation path.

Traffic Splitting for Canaries, A/B Testing, and Migrations

Traffic splitting divides traffic between multiple backends for testing, gradual migrations, and controlled rollouts. This is a critical capability for reducing risk when deploying changes.

Common Traffic Splitting Use Cases:

Traffic Splitting Scenarios

•Canary Deployments — Route 1-5% of traffic to new version while monitoring for errors. Gradually increase percentage as confidence grows.
•Blue-Green Deployments — Maintain two full environments. Switch 100% of traffic instantly for zero-downtime deployments.
•A/B Testing — Split traffic between variants to measure business metric differences (conversions, engagement).
•Infrastructure Migration — Gradually shift traffic from old to new infrastructure (e.g., on-prem to cloud).
•Database Migration — Perform shadow reads/writes or gradually shift DB traffic during migrations.

traffic-splitting.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# Traffic Splitting Configuration Examples
 
# 1. Canary Deployment
canary_deployment:
  name: "api-v2-canary"
  hostname: "api.example.com"
  
  split:
    - weight: 99
      backend: "api-v1.internal"
      name: "stable"
      
    - weight: 1
      backend: "api-v2.internal"
      name: "canary"
      
  canary_config:
    error_threshold: 1.0  # Abort if error rate > 1%
    latency_threshold_p99: 500  # Abort if p99 > 500ms
    auto_promotion:
      enabled: true
      wait_time: 30m
      success_criteria:
        error_rate: "<= baseline * 1.1"  # Within 10% of baseline
        latency_p99: "<= baseline * 1.2"  # Within 20% of baseline
    auto_rollback:
      enabled: true
      trigger:
        - error_rate: "> 2%"
        - latency_p99: "> 1000ms"
        
# 2. Blue-Green Deployment
blue_green:
  name: "blue-green-switch"
  hostname: "app.example.com"
  
  environments:
    blue:
      backend: "app-blue.internal"
      status: "live"  # Currently receiving traffic
      
    green:
      backend: "app-green.internal"
      status: "standby"  # Ready but not receiving traffic
      
  switch:
    type: "instant"  # 0% -> 100% immediately
    pre_check:
      - smoke_test: "https://app-green.internal/health"
      - synthetic_transaction: "checkout_flow"
    post_switch:
      - monitor_duration: 5m
      - auto_rollback_on_error_spike: true
      
# 3. Infrastructure Migration
migration:
  name: "cloud-migration-q4"
  hostname: "services.example.com"
  
  split:
    - weight: 70
      backend: "onprem-lb.internal"
      name: "on-premise"
      
    - weight: 30
      backend: "aws-nlb.internal"
      name: "aws-cloud"
      
  schedule:
    - week: 1
      weights: {onprem: 90, cloud: 10}
    - week: 2
      weights: {onprem: 70, cloud: 30}
    - week: 3
      weights: {onprem: 50, cloud: 50}
    - week: 4
      weights: {onprem: 20, cloud: 80}
    - week: 5
      weights: {onprem: 0, cloud: 100}
      
  validation:
    compare_metrics: true
    metrics:
      - error_rate
      - latency_p50
      - latency_p99
      - throughput

Sticky vs. Random Traffic Splitting:

Traffic splitting can be sticky (consistent per-user) or random (per-request):

Sticky Splitting (Consistent Hashing):

Same user always routes to same backend (based on user ID, session, or IP hash)
Required for A/B testing with user-level metrics
Prevents confusing UX from inconsistent feature availability
Implementation: Hash user identifier to determine split

Random Splitting (Weighted Random):

Each request independently selects backend based on weights
Faster traffic distribution changes
Simpler implementation
Implementation: Random number compared against weight thresholds

A/B Testing Requires Sticky Splitting

If measuring conversion rates, engagement, or other user-level metrics, you must use sticky splitting. A user who sees variant A on first visit and variant B on second visit corrupts both cohorts' data. Ensure your traffic management supports consistent hashing when A/B testing.

Cost-Aware Traffic Management

Cloud infrastructure costs can be substantial, and routing decisions directly impact cloud bills. Cost-aware traffic management incorporates cost factors into routing policies, optimizing spend without unacceptable user experience degradation.

Cost Factors in Traffic Routing:

Cloud Cost Factors Affected by Routing
Cost Factor	Description	Routing Impact	Typical Magnitude
Data Transfer (Egress)	Charges for data leaving cloud regions	Cross-region routing increases egress costs	$0.02-0.12 per GB
Compute Pricing	Different regions have different compute costs	US-East often cheapest; Europe/APAC higher	10-30% variance
Reserved/Committed Use	Pre-purchased capacity at discount	Underutilized commitments are waste	30-70% discount vs on-demand
Spot/Preemptible	Cheap but interruptible capacity	Can absorb non-critical traffic cheaply	60-90% discount
Inter-Region Transfer	Traffic between provider regions	Multi-region architectures incur transfer costs	$0.01-0.02 per GB

cost-aware-routing.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# Cost-Aware Traffic Routing Configuration
cost_optimization:
  hostname: "batch.example.com"  # Batch processing, latency-tolerant
  
  strategy: "cost_first_with_latency_cap"
  
  datacenters:
    # US-East: Cheapest compute, most committed capacity
    - name: "us-east-1"
      cost_score: 10  # Lowest cost (lower is better)
      committed_capacity: 1000  # Reserved instances
      spot_capacity: 500       # Spot capacity available
      
    # EU-West: Higher cost, some commitments
    - name: "eu-west-1"
      cost_score: 25
      committed_capacity: 400
      spot_capacity: 200
      
    # APAC: Highest cost, minimal commitments
    - name: "ap-northeast-1"
      cost_score: 35
      committed_capacity: 100
      spot_capacity: 100
      
  routing_rules:
    # Prefer using committed capacity first (already paid for)
    - priority: 1
      condition: "committed_capacity_available"
      action: "route_to_committed"
      
    # Then use spot capacity for cost savings
    - priority: 2
      condition: "spot_capacity_available && latency < 500ms"
      action: "route_to_spot"
      prefer_cost_score: true  # Lowest cost region
      
    # Fall back to on-demand, still prefer lowest cost
    - priority: 3
      condition: "any_capacity_available"
      action: "route_to_cheapest_region"
      latency_cap: 300  # Don't exceed 300ms even for cost savings
      
  constraints:
    max_latency: 500  # Never sacrifice latency beyond 500ms
    min_availability: 99.9  # Availability requirements still apply
    
  monitoring:
    track_cost_per_request: true
    alert_on_cost_spike: 
      threshold: "20% above baseline"
      
# Example: Hybrid Latency-Cost Optimization
hybrid_optimization:
  hostname: "api.example.com"  # Latency-sensitive API
  
  strategy: "latency_with_cost_tiebreaker"
  
  rules:
    # For endpoints with equivalent latency (<20ms difference), prefer cheaper
    - latency_equivalence_threshold: 20  # ms
      tiebreaker: "cost_score"
      
    # But never route to a region >100ms slower for cost savings
    - max_latency_penalty_for_cost: 100
    
  example_decision:
    # User in Chicago, options:
    # - us-east-1: 25ms latency, cost_score 10
    # - us-west-2: 40ms latency, cost_score 15
    # Decision: us-east-1 (15ms difference > 20ms threshold, so latency wins)
    
    # User in Denver, options:
    # - us-east-1: 45ms latency, cost_score 10
    # - us-west-2: 35ms latency, cost_score 15
    # Decision: us-west-2 (us-west is faster, latency wins)

Committed Capacity Optimization:

If you've purchased reserved instances or committed use discounts, traffic policies should maximize their utilization:

Baseline Routing: Route traffic primarily to regions with reserved capacity.
Overflow Routing: Only use expensive on-demand capacity for traffic exceeding reserved capacity.
Time-Based Shifting: During off-peak hours, consolidate traffic to fewer regions to maximize per-instance utilization.

Multi-Cloud Cost Arbitrage:

With infrastructure across multiple clouds, route traffic to the currently cheapest provider for workloads that can tolerate provider switching. This requires abstracting your infrastructure sufficiently that workloads are truly portable.

Cost Optimization Is a Feature, Not a Constraint

Frame cost optimization positively. Rather than 'degrading experience to save money,' frame it as 'intelligently routing to well-provisioned infrastructure while avoiding wasteful over-provisioning.' When done well, cost optimization improves efficiency without negative user impact.

Observability for Traffic Policy Validation

Traffic policies are only as good as your ability to verify they're working correctly. Observability—the ability to understand system behavior from external outputs—is essential for validating that policies behave as designed.

Key Observability Dimensions:

Traffic Policy Observability Requirements

•Traffic Distribution Metrics — What percentage of traffic is going to each endpoint? Is it matching configured weights? Are geographic distributions as expected?
•Policy Decision Logs — Why did each request route where it did? Which policy stage made the decision? What factors influenced selection?
•Latency by Route — What latency are users experiencing by region, by endpoint? Are latency-based policies actually reducing latency?
•Health Check Status — Which endpoints are healthy/unhealthy? How often are health checks failing? Are flapping endpoints causing issues?
•Failover Events — When did failovers occur? How long did they take? Did traffic successfully reroute?
•Compliance Audit Trail — Can you prove that EU users were routed to EU infrastructure? Are you logging evidence for regulatory audits?

traffic-observability.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# Traffic Policy Observability Configuration
observability:
  # Metrics to track
  metrics:
    - name: "traffic_distribution"
      dimensions: ["endpoint", "region", "policy_tier"]
      aggregations: ["count", "rate", "percentage"]
      
    - name: "routing_latency_by_path"
      dimensions: ["user_region", "endpoint", "policy_tier"]
      aggregations: ["p50", "p95", "p99", "max"]
      
    - name: "policy_decisions"
      dimensions: ["policy_stage", "decision_type", "endpoint"]
      aggregations: ["count"]
      
    - name: "failover_events"
      dimensions: ["from_endpoint", "to_endpoint", "trigger"]
      aggregations: ["count", "duration_avg"]
      
    - name: "health_check_results"
      dimensions: ["endpoint", "check_type", "result"]
      aggregations: ["count", "success_rate"]
      
  # Structured logging for policy decisions
  logging:
    enabled: true
    sample_rate: 0.1  # Log 10% of decisions (sample for volume)
    full_log_on:
      - condition: "error"
      - condition: "failover"
      - condition: "compliance_block"
    fields:
      - request_id
      - user_region
      - user_asn
      - policy_chain_result
      - selected_endpoint
      - decision_reason
      - latency_to_endpoint
      
  # Dashboards
  dashboards:
    - name: "Traffic Overview"
      panels:
        - traffic_by_region_heatmap
        - endpoint_health_status
        - latency_percentiles
        - failover_timeline
        
    - name: "Policy Validation"
      panels:
        - policy_hit_rates
        - compliance_routing_accuracy
        - latency_by_policy_decision
        - weight_accuracy  # Actual vs configured weights
        
  # Alerts
  alerts:
    - name: "unexpected_routing"
      condition: "eu_traffic_to_non_eu_endpoint > 0"
      severity: "critical"
      
    - name: "weight_deviation"
      condition: "abs(actual_weight - configured_weight) > 10%"
      severity: "warning"
      
    - name: "failover_frequency"
      condition: "failover_count > 5 in 10m"
      severity: "warning"
      
    - name: "global_failover"
      condition: "policy_tier == quaternary"
      severity: "critical"

Synthetic Testing:

Don't wait for real users to validate policies. Deploy synthetic probes that simulate requests from various global locations and verify routing correctness:

Geographic Probes: Test from multiple countries to verify GeoDNS routing.
Failover Testing: Simulate endpoint failures and verify failover behavior.
Latency Baseline: Continuously measure latency from known locations to detect degradation.
Policy Regression Testing: After policy changes, verify expected routing with synthetic requests before enabling for real traffic.

Observability Enables Confidence

Comprehensive observability is what enables you to confidently make policy changes. Without visibility into policy behavior, every change is a gamble. Invest in observability tooling proportional to the complexity of your traffic policies.

Real-World Policy Architectures

Let's examine how major internet services combine the concepts we've covered into comprehensive traffic management architectures.

Architecture 1: Global Streaming Service (Netflix-style)

streaming-architecture.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Global Streaming Service Traffic Architecture
streaming_platform:
  # Layer 1: DNS (GSLB)
  dns_layer:
    provider: "route53"
    type: "latency_based_with_geo_override"
    
    geo_overrides:
      # Content licensing restrictions
      - countries: ["cn"]
        action: "block"  # Not licensed in China
      - countries: ["ru"]
        endpoint: "ru-licensing-dc"  # Special Russian catalog
        
    latency_based:
      - region: "us-east-1"
        weight: 40
      - region: "eu-west-1"
        weight: 30
      - region: "ap-northeast-1"
        weight: 30
        
  # Layer 2: Edge (Anycast CDN)
  edge_layer:
    type: "anycast"
    providers: ["own_cdn", "cloudflare", "akamai"]
    
    routing:
      - path_prefix: "/content/"
        cache: true
        origin: "origin_shield"
      - path_prefix: "/api/"
        cache: false
        origin: "api_gslb"
        
  # Layer 3: Origin Shield
  origin_shield:
    purpose: "reduce_origin_load"
    locations: ["us-east", "eu-west", "ap-northeast"]
    cache_hierarchy: "tiered"
    
  # Layer 4: Origin
  origin_layer:
    type: "regional_clusters"
    clusters:
      - region: "us-east"
        capacity: "100k_rps"
      - region: "eu-west"
        capacity: "60k_rps"
      - region: "ap-northeast"
        capacity: "40k_rps"

Architecture 2: Global SaaS Platform (Enterprise)

saas-architecture.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# Enterprise SaaS Traffic Architecture
enterprise_saas:
  # Tenant-aware routing
  routing_model: "tenant_first"
  
  tenant_routing:
    # Enterprise customers with data residency requirements
    enterprise_tenants:
      - tenant: "deutsche-bank"
        restriction: "eu-only"
        primary_dc: "frankfurt"
        failover_dc: "dublin"
        cross_region_failover: false  # Never leave EU
        
      - tenant: "toyota"
        restriction: "japan-preferred"
        primary_dc: "tokyo"
        failover_dc: "singapore"  # APAC fallback OK
        
    # Standard tenants - optimal performance routing
    standard_tenants:
      routing: "latency_based"
      fallback: "geo_based"
      
  # Policy chain
  policy_chain:
    1_tenant_lookup:
      - identify_tenant_from_hostname  # acme.app.example.com
      - load_tenant_config
      
    2_compliance_check:
      - if: "tenant.restriction"
        apply: "restriction_filter"
        
    3_health_filter:
      - remove_unhealthy_endpoints
      
    4_selection:
      - enterprise: "use_tenant_primary_with_failover"
      - standard: "latency_based_selection"
      
  # Enterprise SLA monitoring
  sla_monitoring:
    by_tenant: true
    metrics: ["availability", "latency_p99", "error_rate"]
    alerting:
      - enterprise_sla_breach: "immediate_page"
      - standard_sla_breach: "ticket"

Architecture 3: E-Commerce Platform (Peak Traffic Handling)

ecommerce-architecture.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# E-Commerce Traffic Architecture (Black Friday Ready)
ecommerce_platform:
  # Normal operations
  normal_mode:
    routing: "geo_based_with_latency_tiebreaker"
    capacity_headroom: 200%  # Always 2x base capacity
    
  # Peak event mode (Black Friday)
  peak_mode:
    trigger: "scheduled OR auto_detected_traffic_spike"
    
    changes:
      - autoscaling: "aggressive"
        scale_up_threshold: 50%  # Scale at 50% vs normal 70%
        
      - routing: "capacity_aware"
        shift_overflow: true  # Shift traffic away from hot regions
        
      - degradation_thresholds:
          disable_search: 80%  # Disable search earlier
          disable_recommendations: 70%
          rate_limit_threshold: 85%
          
      - static_asset_routing:
          force_cdn: true  # Never hit origin for static
          cache_everything: true
          
  # Emergency mode (outage protection)
  emergency_mode:
    trigger: "manual OR error_rate > 10%"
    
    immediate_actions:
      - traffic_shift: "away_from_problem_region"
      - rate_limit: 50%
      - serve_cached_catalog: true  # Stale is better than down
      - disable_checkout_if_needed: false  # Protect revenue
      
  # Post-peak analysis
  post_event:
    capture_metrics: true
    generate_report:
      - peak_traffic_handled
      - degradation_events
      - failover_events
      - revenue_impact

Architecture Evolves

These architectures weren't built in a day. They evolved from simpler single-region deployments. Start with straightforward policies and add complexity as requirements demand. Over-engineering traffic management before you need it creates operational burden without benefit.

Summary: Traffic Management Policies

Traffic management policies are the strategic layer that transforms infrastructure capabilities into business value. They encode your priorities and requirements into automated routing decisions that operate at global scale. Let's consolidate the key concepts:

Key Takeaways

•Balance Four Dimensions — Performance, availability, cost, and compliance are the fundamental tradeoffs. Explicitly prioritize them based on your business context.
•Chain Policies Sequentially — Complex routing requires chaining policies: health filtering, compliance, geographic affinity, latency optimization, and capacity weighting. Each stage refines the candidate set.
•Design Failover Hierarchies — Define multiple fallback levels from primary through global fallback. Each level should be tested and monitored. Graceful degradation preserves user experience during partial failures.
•Enable Safe Deployments via Traffic Splitting — Canary deployments, blue-green switching, and gradual migrations all depend on traffic splitting capabilities. Invest in this infrastructure.
•Incorporate Cost Awareness — Cloud costs are significant. Route traffic to maximize committed capacity utilization and minimize expensive cross-region transfers, within performance constraints.
•Instrument Everything — Observability enables confidence. Log policy decisions, track traffic distribution, alert on anomalies, and validate behavior with synthetic testing.

Module Complete:

You've now completed the module on Global Load Balancing & Anycast. You understand how GSLB distributes traffic across worldwide infrastructure, how DNS-based load balancing leverages resolution mechanics, how Anycast enables network-layer proximity routing, how GeoDNS provides geographic targeting, and how traffic management policies orchestrate these components into cohesive architectures.

These capabilities are foundational to building internet-scale services that deliver excellent user experiences worldwide while meeting performance, availability, cost, and compliance requirements.

Module Complete: Global Load Balancing & Anycast

Congratulations! You've mastered global traffic distribution—from GSLB fundamentals through Anycast, GeoDNS, and comprehensive traffic management policies. You're now equipped to architect globally distributed systems that balance performance, resilience, cost, and compliance at internet scale.

5 / 5

Loading learning content...

System Design (HLD)Global Load Balancing & Anycast

Global Load Balancing & Anycast

LevelAdvanced

Duration90 mins

TopicGlobal Load Balancing & Anycast

5 / 5

Traffic Management Policies

Orchestrating Global Traffic Flow

What You Will Learn

The Four Dimensions of Traffic Policy

Effective traffic management must balance multiple, often competing, objectives. Understanding these dimensions and their tradeoffs is essential for policy design.

Dimension 1: Performance (Latency/Throughput)

Dimension 2: Availability (Resilience)

Dimension 3: Cost (Infrastructure Efficiency)

Cloud and infrastructure costs can be significant. Cost-aware policies consider cross-region data transfer charges, compute pricing differences, and committed capacity utilization.

Dimension 4: Compliance (Regulatory/Contractual)

Traffic Policy Dimension Tradeoffs
Dimension	Primary Goal	Common Tradeoffs	Example Policy Decision
Performance	Minimize latency	Higher costs (multi-region); complexity	Route to closest healthy DC even if more expensive
Availability	Maximize uptime	Higher costs (redundancy); complexity	Maintain idle DR capacity for instant failover
Cost	Minimize spend	Potentially higher latency; lower redundancy	Route to cheapest region when latency difference is <50ms
Compliance	Meet legal requirements	May conflict with all other dimensions	EU users must route to EU even if US is faster/cheaper

Dimension Priority Framework:

In practice, dimensions have different priorities based on business context:

Safety-Critical Systems (healthcare, aviation, finance): Compliance > Availability > Performance > Cost

Consumer Applications (streaming, social media): Availability > Performance > Cost > Compliance

Cost-Sensitive Startups: Cost > Performance > Availability > Compliance

Enterprise SaaS: Compliance > Availability > Performance > Cost

Defining these priorities explicitly helps resolve conflicts during policy design.

Document Your Priority Stack

Policy Chaining and Evaluation Order

The Evaluation Pipeline:

A well-designed policy chain processes routing decisions through sequential stages:

Converting Mermaid diagram...

Designing Policy Chains:

Each stage in the pipeline performs a specific function:

Policy Chain Stages

•Stage 1: Health Filtering — Remove all endpoints failing health checks. This is always first—never route to known-bad infrastructure.
•Stage 2: Compliance Filtering — Remove endpoints in non-compliant jurisdictions. EU users should never see US-only endpoints in their candidate set.
•Stage 3: Geographic Affinity — Prefer endpoints in the user's region. This reduces the candidate set to regionally-appropriate options.
•Stage 4: Performance Optimization — Among remaining candidates, prefer lowest latency or highest throughput endpoints.
•Stage 5: Capacity Weighting — Distribute traffic across candidates proportional to their capacity. A 2x-capacity DC gets 2x traffic.
•Stage 6: Deterministic Selection — Apply consistent hashing or random selection to choose final endpoint(s) from weighted candidates.

policy-chain.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
# Comprehensive Traffic Policy Chain Configuration
traffic_policy:
  name: "production-api-routing"
  hostname: "api.example.com"
  
  # Policy evaluation order (top to bottom)
  chain:
    # Stage 1: Health Filter (hard requirement)
    - type: "health_filter"
      config:
        health_check_id: "api-health-comprehensive"
        require_healthy: true
        unhealthy_action: "remove_from_pool"
        
    # Stage 2: Compliance Filter (hard requirement)
    - type: "compliance_filter"
      config:
        rules:
          - user_region: "EU"
            allowed_endpoints: ["frankfurt-dc", "dublin-dc", "amsterdam-dc"]
            fallback: "service_unavailable"  # Don't route EU to non-EU
            
          - user_region: "CN"
            allowed_endpoints: ["beijing-dc", "shanghai-dc"]
            fallback: "service_unavailable"  # China data must stay in China
            
    # Stage 3: Geographic Affinity (soft preference)
    - type: "geo_affinity"
      config:
        prefer_same_region: true
        max_latency_penalty_for_affinity: 50  # Accept 50ms extra for same region
        
    # Stage 4: Latency Optimization (soft preference)
    - type: "latency_routing"
      config:
        measurement_source: "real_user_monitoring"
        fallback_source: "probe_measurements"
        max_latency_difference: 20  # Consider endpoints within 20ms equivalent
        
    # Stage 5: Capacity Weighting (distribution)
    - type: "weighted_distribution"
      config:
        weights:
          frankfurt-dc: 100    # Full capacity
          dublin-dc: 75        # 75% capacity
          virginia-dc: 150     # 150% base capacity (larger DC)
          singapore-dc: 80     # 80% capacity
          tokyo-dc: 60         # 60% capacity
        
    # Stage 6: Selection
    - type: "selection"
      config:
        method: "weighted_random"  # Or "consistent_hash" for session affinity
        return_count: 2  # Return 2 IPs for client-side failover
        
  # Fallback if entire chain produces no candidates
  fallback:
    action: "return_global_fallback"
    endpoint: "virginia-dc"  # US-East as last resort

Empty Candidate Set Handling

Failover Hierarchies and Graceful Degradation

Failover Hierarchy Design:

A robust failover hierarchy has multiple levels, each providing a less-optimal but still-acceptable alternative:

failover-hierarchy.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Multi-Level Failover Hierarchy
failover_config:
  hostname: "app.example.com"
  
  # Primary: In-region, same cloud
  primary:
    selection: "latency_based"
    candidates:
      - endpoint: "us-east-1a.aws.example.internal"
        provider: "aws"
        region: "us-east-1"
      - endpoint: "us-east-1b.aws.example.internal"
        provider: "aws"
        region: "us-east-1"
    health_threshold: 1  # At least 1 healthy to use this tier
    
  # Secondary: Same region, different AZ/cloud
  secondary:
    activation: "primary_unhealthy"
    selection: "round_robin"
    candidates:
      - endpoint: "us-east-2a.aws.example.internal"
        provider: "aws"
        region: "us-east-2"
      - endpoint: "us-east.gcp.example.internal"
        provider: "gcp"
        region: "us-east4"
    health_threshold: 1
    notification: "alert"  # Alert when secondary activated
    
  # Tertiary: Different region, same continent
  tertiary:
    activation: "secondary_unhealthy"
    selection: "geo_proximity"
    candidates:
      - endpoint: "us-west-2.aws.example.internal"
        provider: "aws"
        region: "us-west-2"
    notification: "page"  # Page on-call when tertiary activated
    
  # Quaternary: Global fallback
  quaternary:
    activation: "tertiary_unhealthy"
    candidates:
      - endpoint: "eu-west-1.aws.example.internal"
        provider: "aws"
        region: "eu-west-1"
    notification: "critical"  # Critical incident when global fallback
    degraded_mode:
      rate_limit: 0.5  # 50% rate limit in degraded mode
      cached_responses: true  # Serve stale cache if available
      
  # Ultimate fallback: Static error page
  static_fallback:
    activation: "all_unhealthy"
    response: "status_page"  # Return link to status page
    notification: "all_hands"

Graceful Degradation Strategies:

Not all failures are binary. Graceful degradation handles partial failures where some capacity remains available:

Graceful Degradation Strategies
Strategy	Trigger	Action	User Impact
Capacity Reduction	DC at 80%+ capacity	Reduce traffic weight to that DC	Slight latency increase as traffic shifts
Feature Degradation	Dependent service failing	Disable non-critical features	Reduced functionality, core works
Rate Limiting	Approaching capacity	Shed excess traffic with 429s	Some requests rejected, prevents collapse
Stale Cache Serving	Origin unhealthy	Serve cached responses	Potentially stale data, but available
Static Fallback	Total failure	Serve static 'try again later' page	Complete outage, but graceful messaging

degradation-policy.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Progressive Degradation Policy
degradation_policy:
  # Level 0: Normal operation
  normal:
    capacity_threshold: 70  # Below 70% utilization
    behavior: "full_features"
    
  # Level 1: Light degradation
  light_degradation:
    trigger:
      - condition: "capacity > 70%"
      - condition: "error_rate > 1%"
    behavior:
      - disable: "recommendation_engine"  # Disable ML features
      - disable: "video_transcoding"      # Disable heavy processing
      - cache_ttl: "increase_2x"          # Longer cache TTL
    notification: "slack_channel"
    
  # Level 2: Moderate degradation
  moderate_degradation:
    trigger:
      - condition: "capacity > 85%"
      - condition: "error_rate > 5%"
      - condition: "failover_tier >= secondary"
    behavior:
      - rate_limit: 0.8                   # 80% of normal traffic
      - disable: "search"                  # Disable search
      - serve_stale: true                  # Serve stale cache
    notification: "pagerduty_low"
    
  # Level 3: Heavy degradation
  heavy_degradation:
    trigger:
      - condition: "capacity > 95%"
      - condition: "error_rate > 10%"
      - condition: "failover_tier >= tertiary"
    behavior:
      - rate_limit: 0.5                    # 50% of normal traffic
      - read_only_mode: true               # No writes
      - minimal_features: true             # Core functionality only
    notification: "pagerduty_high"
    
  # Level 4: Survival mode
  survival_mode:
    trigger:
      - condition: "all_primary_dc_unhealthy"
      - condition: "error_rate > 25%"
    behavior:
      - static_response: true              # Static responses only
      - message: "Service is experiencing issues. Please try again later."
    notification: "incident_commander"

Test Your Degradation Modes

Traffic Splitting for Canaries, A/B Testing, and Migrations

Traffic splitting divides traffic between multiple backends for testing, gradual migrations, and controlled rollouts. This is a critical capability for reducing risk when deploying changes.

Common Traffic Splitting Use Cases:

Traffic Splitting Scenarios

•Canary Deployments — Route 1-5% of traffic to new version while monitoring for errors. Gradually increase percentage as confidence grows.
•Blue-Green Deployments — Maintain two full environments. Switch 100% of traffic instantly for zero-downtime deployments.
•A/B Testing — Split traffic between variants to measure business metric differences (conversions, engagement).
•Infrastructure Migration — Gradually shift traffic from old to new infrastructure (e.g., on-prem to cloud).
•Database Migration — Perform shadow reads/writes or gradually shift DB traffic during migrations.

traffic-splitting.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# Traffic Splitting Configuration Examples
 
# 1. Canary Deployment
canary_deployment:
  name: "api-v2-canary"
  hostname: "api.example.com"
  
  split:
    - weight: 99
      backend: "api-v1.internal"
      name: "stable"
      
    - weight: 1
      backend: "api-v2.internal"
      name: "canary"
      
  canary_config:
    error_threshold: 1.0  # Abort if error rate > 1%
    latency_threshold_p99: 500  # Abort if p99 > 500ms
    auto_promotion:
      enabled: true
      wait_time: 30m
      success_criteria:
        error_rate: "<= baseline * 1.1"  # Within 10% of baseline
        latency_p99: "<= baseline * 1.2"  # Within 20% of baseline
    auto_rollback:
      enabled: true
      trigger:
        - error_rate: "> 2%"
        - latency_p99: "> 1000ms"
        
# 2. Blue-Green Deployment
blue_green:
  name: "blue-green-switch"
  hostname: "app.example.com"
  
  environments:
    blue:
      backend: "app-blue.internal"
      status: "live"  # Currently receiving traffic
      
    green:
      backend: "app-green.internal"
      status: "standby"  # Ready but not receiving traffic
      
  switch:
    type: "instant"  # 0% -> 100% immediately
    pre_check:
      - smoke_test: "https://app-green.internal/health"
      - synthetic_transaction: "checkout_flow"
    post_switch:
      - monitor_duration: 5m
      - auto_rollback_on_error_spike: true
      
# 3. Infrastructure Migration
migration:
  name: "cloud-migration-q4"
  hostname: "services.example.com"
  
  split:
    - weight: 70
      backend: "onprem-lb.internal"
      name: "on-premise"
      
    - weight: 30
      backend: "aws-nlb.internal"
      name: "aws-cloud"
      
  schedule:
    - week: 1
      weights: {onprem: 90, cloud: 10}
    - week: 2
      weights: {onprem: 70, cloud: 30}
    - week: 3
      weights: {onprem: 50, cloud: 50}
    - week: 4
      weights: {onprem: 20, cloud: 80}
    - week: 5
      weights: {onprem: 0, cloud: 100}
      
  validation:
    compare_metrics: true
    metrics:
      - error_rate
      - latency_p50
      - latency_p99
      - throughput

Sticky vs. Random Traffic Splitting:

Traffic splitting can be sticky (consistent per-user) or random (per-request):

Sticky Splitting (Consistent Hashing):

Same user always routes to same backend (based on user ID, session, or IP hash)
Required for A/B testing with user-level metrics
Prevents confusing UX from inconsistent feature availability
Implementation: Hash user identifier to determine split

Random Splitting (Weighted Random):

Each request independently selects backend based on weights
Faster traffic distribution changes
Simpler implementation
Implementation: Random number compared against weight thresholds

A/B Testing Requires Sticky Splitting

Cost-Aware Traffic Management

Cost Factors in Traffic Routing:

Cloud Cost Factors Affected by Routing
Cost Factor	Description	Routing Impact	Typical Magnitude
Data Transfer (Egress)	Charges for data leaving cloud regions	Cross-region routing increases egress costs	$0.02-0.12 per GB
Compute Pricing	Different regions have different compute costs	US-East often cheapest; Europe/APAC higher	10-30% variance
Reserved/Committed Use	Pre-purchased capacity at discount	Underutilized commitments are waste	30-70% discount vs on-demand
Spot/Preemptible	Cheap but interruptible capacity	Can absorb non-critical traffic cheaply	60-90% discount
Inter-Region Transfer	Traffic between provider regions	Multi-region architectures incur transfer costs	$0.01-0.02 per GB

cost-aware-routing.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# Cost-Aware Traffic Routing Configuration
cost_optimization:
  hostname: "batch.example.com"  # Batch processing, latency-tolerant
  
  strategy: "cost_first_with_latency_cap"
  
  datacenters:
    # US-East: Cheapest compute, most committed capacity
    - name: "us-east-1"
      cost_score: 10  # Lowest cost (lower is better)
      committed_capacity: 1000  # Reserved instances
      spot_capacity: 500       # Spot capacity available
      
    # EU-West: Higher cost, some commitments
    - name: "eu-west-1"
      cost_score: 25
      committed_capacity: 400
      spot_capacity: 200
      
    # APAC: Highest cost, minimal commitments
    - name: "ap-northeast-1"
      cost_score: 35
      committed_capacity: 100
      spot_capacity: 100
      
  routing_rules:
    # Prefer using committed capacity first (already paid for)
    - priority: 1
      condition: "committed_capacity_available"
      action: "route_to_committed"
      
    # Then use spot capacity for cost savings
    - priority: 2
      condition: "spot_capacity_available && latency < 500ms"
      action: "route_to_spot"
      prefer_cost_score: true  # Lowest cost region
      
    # Fall back to on-demand, still prefer lowest cost
    - priority: 3
      condition: "any_capacity_available"
      action: "route_to_cheapest_region"
      latency_cap: 300  # Don't exceed 300ms even for cost savings
      
  constraints:
    max_latency: 500  # Never sacrifice latency beyond 500ms
    min_availability: 99.9  # Availability requirements still apply
    
  monitoring:
    track_cost_per_request: true
    alert_on_cost_spike: 
      threshold: "20% above baseline"
      
# Example: Hybrid Latency-Cost Optimization
hybrid_optimization:
  hostname: "api.example.com"  # Latency-sensitive API
  
  strategy: "latency_with_cost_tiebreaker"
  
  rules:
    # For endpoints with equivalent latency (<20ms difference), prefer cheaper
    - latency_equivalence_threshold: 20  # ms
      tiebreaker: "cost_score"
      
    # But never route to a region >100ms slower for cost savings
    - max_latency_penalty_for_cost: 100
    
  example_decision:
    # User in Chicago, options:
    # - us-east-1: 25ms latency, cost_score 10
    # - us-west-2: 40ms latency, cost_score 15
    # Decision: us-east-1 (15ms difference > 20ms threshold, so latency wins)
    
    # User in Denver, options:
    # - us-east-1: 45ms latency, cost_score 10
    # - us-west-2: 35ms latency, cost_score 15
    # Decision: us-west-2 (us-west is faster, latency wins)

Committed Capacity Optimization:

If you've purchased reserved instances or committed use discounts, traffic policies should maximize their utilization:

Baseline Routing: Route traffic primarily to regions with reserved capacity.
Overflow Routing: Only use expensive on-demand capacity for traffic exceeding reserved capacity.
Time-Based Shifting: During off-peak hours, consolidate traffic to fewer regions to maximize per-instance utilization.

Multi-Cloud Cost Arbitrage:

Cost Optimization Is a Feature, Not a Constraint

Observability for Traffic Policy Validation

Key Observability Dimensions:

Traffic Policy Observability Requirements

•Traffic Distribution Metrics — What percentage of traffic is going to each endpoint? Is it matching configured weights? Are geographic distributions as expected?
•Policy Decision Logs — Why did each request route where it did? Which policy stage made the decision? What factors influenced selection?
•Latency by Route — What latency are users experiencing by region, by endpoint? Are latency-based policies actually reducing latency?
•Health Check Status — Which endpoints are healthy/unhealthy? How often are health checks failing? Are flapping endpoints causing issues?
•Failover Events — When did failovers occur? How long did they take? Did traffic successfully reroute?
•Compliance Audit Trail — Can you prove that EU users were routed to EU infrastructure? Are you logging evidence for regulatory audits?

traffic-observability.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# Traffic Policy Observability Configuration
observability:
  # Metrics to track
  metrics:
    - name: "traffic_distribution"
      dimensions: ["endpoint", "region", "policy_tier"]
      aggregations: ["count", "rate", "percentage"]
      
    - name: "routing_latency_by_path"
      dimensions: ["user_region", "endpoint", "policy_tier"]
      aggregations: ["p50", "p95", "p99", "max"]
      
    - name: "policy_decisions"
      dimensions: ["policy_stage", "decision_type", "endpoint"]
      aggregations: ["count"]
      
    - name: "failover_events"
      dimensions: ["from_endpoint", "to_endpoint", "trigger"]
      aggregations: ["count", "duration_avg"]
      
    - name: "health_check_results"
      dimensions: ["endpoint", "check_type", "result"]
      aggregations: ["count", "success_rate"]
      
  # Structured logging for policy decisions
  logging:
    enabled: true
    sample_rate: 0.1  # Log 10% of decisions (sample for volume)
    full_log_on:
      - condition: "error"
      - condition: "failover"
      - condition: "compliance_block"
    fields:
      - request_id
      - user_region
      - user_asn
      - policy_chain_result
      - selected_endpoint
      - decision_reason
      - latency_to_endpoint
      
  # Dashboards
  dashboards:
    - name: "Traffic Overview"
      panels:
        - traffic_by_region_heatmap
        - endpoint_health_status
        - latency_percentiles
        - failover_timeline
        
    - name: "Policy Validation"
      panels:
        - policy_hit_rates
        - compliance_routing_accuracy
        - latency_by_policy_decision
        - weight_accuracy  # Actual vs configured weights
        
  # Alerts
  alerts:
    - name: "unexpected_routing"
      condition: "eu_traffic_to_non_eu_endpoint > 0"
      severity: "critical"
      
    - name: "weight_deviation"
      condition: "abs(actual_weight - configured_weight) > 10%"
      severity: "warning"
      
    - name: "failover_frequency"
      condition: "failover_count > 5 in 10m"
      severity: "warning"
      
    - name: "global_failover"
      condition: "policy_tier == quaternary"
      severity: "critical"

Synthetic Testing:

Don't wait for real users to validate policies. Deploy synthetic probes that simulate requests from various global locations and verify routing correctness:

Geographic Probes: Test from multiple countries to verify GeoDNS routing.
Failover Testing: Simulate endpoint failures and verify failover behavior.
Latency Baseline: Continuously measure latency from known locations to detect degradation.
Policy Regression Testing: After policy changes, verify expected routing with synthetic requests before enabling for real traffic.

Observability Enables Confidence

Real-World Policy Architectures

Let's examine how major internet services combine the concepts we've covered into comprehensive traffic management architectures.

Architecture 1: Global Streaming Service (Netflix-style)

streaming-architecture.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Global Streaming Service Traffic Architecture
streaming_platform:
  # Layer 1: DNS (GSLB)
  dns_layer:
    provider: "route53"
    type: "latency_based_with_geo_override"
    
    geo_overrides:
      # Content licensing restrictions
      - countries: ["cn"]
        action: "block"  # Not licensed in China
      - countries: ["ru"]
        endpoint: "ru-licensing-dc"  # Special Russian catalog
        
    latency_based:
      - region: "us-east-1"
        weight: 40
      - region: "eu-west-1"
        weight: 30
      - region: "ap-northeast-1"
        weight: 30
        
  # Layer 2: Edge (Anycast CDN)
  edge_layer:
    type: "anycast"
    providers: ["own_cdn", "cloudflare", "akamai"]
    
    routing:
      - path_prefix: "/content/"
        cache: true
        origin: "origin_shield"
      - path_prefix: "/api/"
        cache: false
        origin: "api_gslb"
        
  # Layer 3: Origin Shield
  origin_shield:
    purpose: "reduce_origin_load"
    locations: ["us-east", "eu-west", "ap-northeast"]
    cache_hierarchy: "tiered"
    
  # Layer 4: Origin
  origin_layer:
    type: "regional_clusters"
    clusters:
      - region: "us-east"
        capacity: "100k_rps"
      - region: "eu-west"
        capacity: "60k_rps"
      - region: "ap-northeast"
        capacity: "40k_rps"

Architecture 2: Global SaaS Platform (Enterprise)

saas-architecture.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# Enterprise SaaS Traffic Architecture
enterprise_saas:
  # Tenant-aware routing
  routing_model: "tenant_first"
  
  tenant_routing:
    # Enterprise customers with data residency requirements
    enterprise_tenants:
      - tenant: "deutsche-bank"
        restriction: "eu-only"
        primary_dc: "frankfurt"
        failover_dc: "dublin"
        cross_region_failover: false  # Never leave EU
        
      - tenant: "toyota"
        restriction: "japan-preferred"
        primary_dc: "tokyo"
        failover_dc: "singapore"  # APAC fallback OK
        
    # Standard tenants - optimal performance routing
    standard_tenants:
      routing: "latency_based"
      fallback: "geo_based"
      
  # Policy chain
  policy_chain:
    1_tenant_lookup:
      - identify_tenant_from_hostname  # acme.app.example.com
      - load_tenant_config
      
    2_compliance_check:
      - if: "tenant.restriction"
        apply: "restriction_filter"
        
    3_health_filter:
      - remove_unhealthy_endpoints
      
    4_selection:
      - enterprise: "use_tenant_primary_with_failover"
      - standard: "latency_based_selection"
      
  # Enterprise SLA monitoring
  sla_monitoring:
    by_tenant: true
    metrics: ["availability", "latency_p99", "error_rate"]
    alerting:
      - enterprise_sla_breach: "immediate_page"
      - standard_sla_breach: "ticket"

Architecture 3: E-Commerce Platform (Peak Traffic Handling)

ecommerce-architecture.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# E-Commerce Traffic Architecture (Black Friday Ready)
ecommerce_platform:
  # Normal operations
  normal_mode:
    routing: "geo_based_with_latency_tiebreaker"
    capacity_headroom: 200%  # Always 2x base capacity
    
  # Peak event mode (Black Friday)
  peak_mode:
    trigger: "scheduled OR auto_detected_traffic_spike"
    
    changes:
      - autoscaling: "aggressive"
        scale_up_threshold: 50%  # Scale at 50% vs normal 70%
        
      - routing: "capacity_aware"
        shift_overflow: true  # Shift traffic away from hot regions
        
      - degradation_thresholds:
          disable_search: 80%  # Disable search earlier
          disable_recommendations: 70%
          rate_limit_threshold: 85%
          
      - static_asset_routing:
          force_cdn: true  # Never hit origin for static
          cache_everything: true
          
  # Emergency mode (outage protection)
  emergency_mode:
    trigger: "manual OR error_rate > 10%"
    
    immediate_actions:
      - traffic_shift: "away_from_problem_region"
      - rate_limit: 50%
      - serve_cached_catalog: true  # Stale is better than down
      - disable_checkout_if_needed: false  # Protect revenue
      
  # Post-peak analysis
  post_event:
    capture_metrics: true
    generate_report:
      - peak_traffic_handled
      - degradation_events
      - failover_events
      - revenue_impact

Architecture Evolves

Summary: Traffic Management Policies

Key Takeaways

•Balance Four Dimensions — Performance, availability, cost, and compliance are the fundamental tradeoffs. Explicitly prioritize them based on your business context.
•Chain Policies Sequentially — Complex routing requires chaining policies: health filtering, compliance, geographic affinity, latency optimization, and capacity weighting. Each stage refines the candidate set.
•Design Failover Hierarchies — Define multiple fallback levels from primary through global fallback. Each level should be tested and monitored. Graceful degradation preserves user experience during partial failures.
•Enable Safe Deployments via Traffic Splitting — Canary deployments, blue-green switching, and gradual migrations all depend on traffic splitting capabilities. Invest in this infrastructure.
•Incorporate Cost Awareness — Cloud costs are significant. Route traffic to maximize committed capacity utilization and minimize expensive cross-region transfers, within performance constraints.
•Instrument Everything — Observability enables confidence. Log policy decisions, track traffic distribution, alert on anomalies, and validate behavior with synthetic testing.

Module Complete:

These capabilities are foundational to building internet-scale services that deliver excellent user experiences worldwide while meeting performance, availability, cost, and compliance requirements.

Module Complete: Global Load Balancing & Anycast

5 / 5