System Design (HLD)Routing and Traffic Management

Routing and Traffic Management

LevelIntermediate

Duration90 mins

TopicRouting and Traffic Management

4 / 4

Traffic Splitting

Controlling the Flow of Production Traffic

In production systems, deploying new code is one of the highest-risk activities engineers perform. A single bug, a subtle performance regression, or an unexpected edge case can impact millions of users. Traffic splitting is the gateway capability that transforms deployments from all-or-nothing gambles into controlled, observable experiments.

Traffic splitting allows you to:

Gradually shift traffic from stable to new versions (canary deployments)
Instantly switch traffic between environments (blue-green deployments)
Test hypotheses by routing user segments to different experiences (A/B testing)
Debug in production by mirroring real traffic to test environments (traffic shadowing)

This page provides a Principal Engineer's deep dive into traffic splitting: the strategies, the algorithms, the configuration patterns, and the production operational considerations that make safe deployments possible at scale.

What You Will Learn

By the end of this page, you will master weighted traffic distribution algorithms, canary deployment progressions, blue-green release strategies, A/B testing implementation, traffic mirroring for safe testing, sticky sessions for consistent user experience, and automated rollback mechanisms.

Weighted Traffic Distribution

At its core, traffic splitting divides incoming requests among multiple backends based on configurable weights. A weighted round-robin or weighted random algorithm determines which backend receives each request.

Weight Distribution Example:

Backend A (stable): 90% weight
Backend B (canary): 10% weight

100 requests distribution:
- ~90 requests → Backend A
- ~10 requests → Backend B

Implementation Algorithms:

1. Weighted Random Selection

For each request, generate a random number and select the backend whose cumulative weight range contains that number.

Weights: A=90, B=10 (total=100)
Random: 0-89 → A, 90-99 → B
Generate random(0,99): 42 → Backend A

Pros: Simple, stateless, naturally distributed Cons: Short-term variance (could see 15 of 100 go to B)

2. Weighted Round-Robin (WRR)

Maintain a counter and iterate through backends proportionally.

Sequence for weights A=3, B=1:
A, A, A, B, A, A, A, B, ...

Pros: Smoother distribution, more predictable Cons: Requires state, more complex with dynamic weights

weighted-distribution.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
// Traffic splitting algorithms
interface WeightedBackend {
  id: string;
  address: string;
  weight: number;
}
 
/**
 * Weighted Random Selection
 * Stateless, suitable for distributed gateways
 */
function selectByWeightedRandom(backends: WeightedBackend[]): WeightedBackend {
  const totalWeight = backends.reduce((sum, b) => sum + b.weight, 0);
  let random = Math.random() * totalWeight;
  
  for (const backend of backends) {
    random -= backend.weight;
    if (random <= 0) {
      return backend;
    }
  }
  
  // Fallback (shouldn't happen with correct weights)
  return backends[backends.length - 1];
}
 
/**
 * Smooth Weighted Round-Robin (SWRR)
 * Used by Nginx for smoother distribution
 */
class SmoothWeightedRoundRobin {
  private backends: WeightedBackend[];
  private currentWeights: number[];
  
  constructor(backends: WeightedBackend[]) {
    this.backends = backends;
    this.currentWeights = backends.map(() => 0);
  }
  
  select(): WeightedBackend {
    const effectiveWeights = this.backends.map(b => b.weight);
    const totalWeight = effectiveWeights.reduce((a, b) => a + b, 0);
    
    // Increase current weights by effective weights
    for (let i = 0; i < this.backends.length; i++) {
      this.currentWeights[i] += effectiveWeights[i];
    }
    
    // Select backend with highest current weight
    let maxIdx = 0;
    for (let i = 1; i < this.currentWeights.length; i++) {
      if (this.currentWeights[i] > this.currentWeights[maxIdx]) {
        maxIdx = i;
      }
    }
    
    // Reduce selected backend's current weight by total
    this.currentWeights[maxIdx] -= totalWeight;
    
    return this.backends[maxIdx];
  }
}
 
/**
 * Consistent Hashing with Weights
 * For sticky sessions with traffic splitting
 */
class ConsistentHashRing {
  private ring: Map<number, string> = new Map();
  private sortedKeys: number[] = [];
  private replicas: number;
  
  constructor(backends: WeightedBackend[], replicas = 100) {
    this.replicas = replicas;
    
    for (const backend of backends) {
      // Number of virtual nodes proportional to weight
      const virtualNodes = Math.round((backend.weight / 100) * replicas);
      
      for (let i = 0; i < virtualNodes; i++) {
        const hash = this.hash(`${backend.id}:${i}`);
        this.ring.set(hash, backend.id);
        this.sortedKeys.push(hash);
      }
    }
    
    this.sortedKeys.sort((a, b) => a - b);
  }
  
  select(key: string): string {
    const hash = this.hash(key);
    
    // Find first node with hash >= key hash
    for (const nodeHash of this.sortedKeys) {
      if (nodeHash >= hash) {
        return this.ring.get(nodeHash)!;
      }
    }
    
    // Wrap around to first node
    return this.ring.get(this.sortedKeys[0])!;
  }
  
  private hash(key: string): number {
    // Simple hash for demonstration (use proper hash in production)
    let hash = 0;
    for (let i = 0; i < key.length; i++) {
      const char = key.charCodeAt(i);
      hash = ((hash << 5) - hash) + char;
      hash = hash & hash;
    }
    return Math.abs(hash);
  }
}

Weight Normalization

Weights don't need to sum to 100—they're ratios. Weights of 90:10, 9:1, and 900:100 all produce the same 90%/10% split. Gateways normalize weights internally. However, using 0-100 scale improves readability and makes percentage changes obvious.

Canary Deployments

A canary deployment gradually shifts traffic from a stable version to a new version while monitoring for problems. The name comes from coal miners using canaries to detect gas—if the canary dies, evacuate. Similarly, if the canary version shows problems, halt the rollout.

Canary Progression Strategy:

Stage 1: Deploy canary with 0% traffic
         → Smoke test the deployment
         
Stage 2: Route 1% traffic to canary
         → Monitor errors, latency for 10 minutes
         
Stage 3: Route 5% traffic to canary
         → Monitor for 30 minutes
         
Stage 4: Route 25% traffic to canary
         → Monitor for 1 hour
         
Stage 5: Route 50% traffic to canary
         → Monitor for 2 hours
         
Stage 6: Route 100% traffic to canary
         → Canary becomes new stable
         
Rollback: At any stage, if metrics degrade:
         → Shift 100% back to stable
         → Investigate and fix

Key Canary Metrics to Monitor:

Canary Health Metrics
Metric	Comparison	Rollback Threshold
Error Rate (5xx)	Canary vs Stable	If canary > stable * 1.1
Latency P99	Canary vs Stable	If canary > stable * 1.2
Latency P50	Canary vs Stable	If canary > stable * 1.5
Request Rate	Canary expected vs actual	If significantly different (routing issue)
CPU/Memory	Canary pods	If resource exhaustion
Custom Business Metrics	Conversion, transactions	Domain-specific thresholds

canary-config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Istio VirtualService for canary deployment
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: product-service
spec:
  hosts:
    - product-service
  http:
    # Header-based override for testing canary directly
    - match:
        - headers:
            x-canary:
              exact: "true"
      route:
        - destination:
            host: product-service
            subset: canary
          weight: 100
          
    # Weighted split for gradual rollout
    - route:
        - destination:
            host: product-service
            subset: stable
          weight: 95
        - destination:
            host: product-service
            subset: canary
          weight: 5
          
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: product-service
spec:
  host: product-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
  subsets:
    - name: stable
      labels:
        version: v1.2.3
    - name: canary
      labels:
        version: v1.3.0

Canary Duration Matters

Issues may take time to surface—memory leaks, cache warming, time-of-day traffic patterns. A canary that looks healthy after 10 minutes might fail after 2 hours under different load. Extend canary duration proportionally to the blast radius of the change.

Blue-Green Deployments

Blue-green deployment maintains two identical production environments. At any time, one (Blue) serves live traffic while the other (Green) is idle or receiving the new deployment. Traffic switches instantly from Blue to Green.

Blue-Green vs Canary:

Aspect	Blue-Green	Canary
Traffic shift	Instant (100%)	Gradual (1% → 100%)
Rollback speed	Instant	Instant
Resource usage	2x (both environments running)	~1x + small canary
Risk exposure	All users at once	Progressive exposure
Complexity	Simpler routing	Complex weight management
Best for	Stateless services, quick validation	Stateful, long-running validation

Blue-Green Workflow:

Initial State:
  Blue (v1.2.3): 100% traffic ← Current production
  Green (v1.2.3): 0% traffic  ← Standby

Deployment:
  1. Deploy v1.3.0 to Green
  2. Run smoke tests against Green
  3. Switch traffic: Blue 0%, Green 100%
  4. Monitor Green closely
  5. If issues: instant rollback to Blue
  6. If healthy: Blue becomes new standby

blue-green-switch.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Blue-Green with Kubernetes Service selector switch
---
# Blue Deployment (current production)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-blue
spec:
  replicas: 10
  selector:
    matchLabels:
      app: myapp
      version: blue
  template:
    metadata:
      labels:
        app: myapp
        version: blue
    spec:
      containers:
        - name: app
          image: myapp:v1.2.3
---
# Green Deployment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-green
spec:
  replicas: 10
  selector:
    matchLabels:
      app: myapp
      version: green
  template:
    metadata:
      labels:
        app: myapp
        version: green
    spec:
      containers:
        - name: app
          image: myapp:v1.3.0
---
# Service pointing to active deployment
# Switch by updating selector
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
    version: blue  # ← Change to 'green' for switch
  ports:
    - port: 80
      targetPort: 8080

Database Compatibility

Blue-green works best when both versions are database-compatible. If v1.3.0 requires schema changes, you need backward-compatible migrations that work with both versions. Otherwise, instant rollback becomes impossible (Blue can't read data written by Green's new schema).

A/B Testing at Scale

A/B testing (or split testing) routes different user segments to different experiences to measure the impact of changes. Unlike canary deployments (focused on technical health), A/B testing measures business metrics: conversion rates, engagement, revenue.

A/B Testing vs Canary:

Aspect	A/B Testing	Canary
Goal	Measure business impact	Validate technical health
Duration	Days to weeks	Hours to days
Metrics	Conversion, revenue, engagement	Error rate, latency
User assignment	Consistent (same user, same variant)	Random per request
Traffic split	Usually 50/50	Progressive (1%→100%)
Rollback trigger	Statistical significance	Metric degradation

Consistent User Assignment:

Critical for A/B testing: the same user must always see the same variant. If users flip between variants, you can't measure impact.

User assignment algorithm:
  variant = hash(user_id + experiment_id + salt) % 100
  
  if variant < 50:
    return "control"
  else:
    return "treatment"

ab-testing-gateway.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
// Gateway-based A/B testing with consistent assignment
import crypto from 'crypto';
 
interface Experiment {
  id: string;
  name: string;
  salt: string;
  enabled: boolean;
  variants: Variant[];
  targeting?: TargetingRule[];
  startDate: Date;
  endDate?: Date;
}
 
interface Variant {
  id: string;
  name: string;
  weight: number;      // 0-100
  backend: string;     // Target backend cluster
}
 
interface TargetingRule {
  type: 'user_list' | 'percentage' | 'attribute';
  config: Record<string, any>;
}
 
interface UserContext {
  userId: string;
  attributes: Record<string, string>;
  isInternal: boolean;
}
 
/**
 * Deterministic variant assignment using consistent hashing
 */
function assignVariant(user: UserContext, experiment: Experiment): Variant | null {
  // Check if experiment is active
  const now = new Date();
  if (!experiment.enabled || now < experiment.startDate) {
    return null;
  }
  if (experiment.endDate && now > experiment.endDate) {
    return null;
  }
 
  // Evaluate targeting rules
  if (experiment.targeting) {
    if (!evaluateTargeting(user, experiment.targeting)) {
      return null; // User not targeted
    }
  }
 
  // Consistent hash for variant selection
  const hash = crypto
    .createHash('sha256')
    .update(`${user.userId}:${experiment.id}:${experiment.salt}`)
    .digest();
  
  const bucket = hash.readUInt32BE(0) % 10000; // 0.01% precision
  
  // Select variant based on cumulative weights
  let cumulative = 0;
  for (const variant of experiment.variants) {
    cumulative += variant.weight * 100; // Convert percentage to basis points
    if (bucket < cumulative) {
      return variant;
    }
  }
  
  return experiment.variants[experiment.variants.length - 1];
}
 
/**
 * Gateway routing with experiment assignment
 */
async function routeWithExperiments(
  request: GatewayRequest,
  user: UserContext,
  experiments: Experiment[]
): Promise<RoutingDecision> {
  const assignments: Record<string, string> = {};
  let selectedBackend: string | null = null;
 
  for (const experiment of experiments) {
    const variant = assignVariant(user, experiment);
    
    if (variant) {
      assignments[experiment.id] = variant.id;
      
      // First matching experiment determines routing
      // (or implement experiment priority/layering)
      if (!selectedBackend && variant.backend) {
        selectedBackend = variant.backend;
      }
    }
  }
 
  // Inject assignments as headers for backends and analytics
  const headers: Record<string, string> = {
    'X-Experiment-Assignments': JSON.stringify(assignments),
  };
 
  // Also set individual headers for easy backend access
  for (const [expId, variantId] of Object.entries(assignments)) {
    headers[`X-Exp-${expId}`] = variantId;
  }
 
  return {
    backend: selectedBackend || 'default-cluster',
    headers,
    experimentAssignments: assignments,
  };
}
 
/**
 * Evaluate targeting rules
 */
function evaluateTargeting(user: UserContext, rules: TargetingRule[]): boolean {
  for (const rule of rules) {
    switch (rule.type) {
      case 'user_list':
        if (!rule.config.users.includes(user.userId)) {
          return false;
        }
        break;
        
      case 'percentage':
        // Exclude X% of users (e.g., internal testing only)
        const hash = crypto.createHash('md5').update(user.userId).digest();
        const bucket = hash.readUInt16BE(0) % 100;
        if (bucket >= rule.config.percentage) {
          return false;
        }
        break;
        
      case 'attribute':
        // e.g., { attribute: 'country', values: ['US', 'CA'] }
        const value = user.attributes[rule.config.attribute];
        if (!rule.config.values.includes(value)) {
          return false;
        }
        break;
    }
  }
  
  return true;
}

A/B Testing Best Practices

•Consistent Assignment: Same user always sees same variant (hash on user ID, not random)
•Experiment Isolation: Multiple experiments can run simultaneously without interference
•Holdout Groups: Reserve a percentage of users who never see experiments (baseline)
•Statistical Rigor: Run until statistically significant, not until you see desired result
•Experiment Logging: Log all assignments for analytics correlation
•Kill Switches: Ability to disable experiments instantly without deployment
•Gradual Exposure: Start experiments with small percentage, increase if healthy

Traffic Mirroring (Shadowing)

Traffic mirroring (also called shadowing) sends a copy of production traffic to a test environment without affecting the production response. The mirror receives real traffic, but its responses are discarded.

Use Cases:

Pre-production validation: Test new version with real traffic patterns before any user exposure
Performance benchmarking: Compare latency/throughput of new version vs stable
Data pipeline testing: Validate new data processing logic with production data
Disaster recovery testing: Verify standby systems can handle production load
Shadow traffic analysis: Debug issues by replaying production traffic

Mirroring Architecture:

     ┌──────────────────────────────────────────────────┐
     │                   Gateway                         │
     │  ┌─────────┐                    ┌─────────┐      │
────►│  │ Primary │───────────────────►│Response │──────►
     │  │ Route   │                    │         │      │
     │  └────┬────┘                    └─────────┘      │
     │       │ (async copy)                             │
     │       ▼                                          │
     │  ┌─────────┐                    ┌─────────┐      │
     │  │ Mirror  │───────────────────►│Discard  │      │
     │  │ Route   │                    │Response │      │
     │  └─────────┘                    └─────────┘      │
     └──────────────────────────────────────────────────┘

Critical Mirroring Considerations:

⚠️ Stateful Operations: Mirrored writes will execute! If the mirror service writes to a shared database, you'll have duplicate writes. Mirror to isolated environments only.

⚠️ External Calls: Mirrored requests may call external services (payments, emails). Ensure the mirror environment stubs these calls.

⚠️ Resource Consumption: Mirroring doubles request volume. Ensure the mirror environment has sufficient capacity.

traffic-mirroring.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Istio traffic mirroring configuration
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: product-service
spec:
  hosts:
    - product-service
  http:
    - route:
        # Primary route gets the response
        - destination:
            host: product-service
            subset: stable
          weight: 100
      # Mirror configuration
      mirror:
        host: product-service-shadow
        subset: canary
      # Percentage of traffic to mirror (optional, default 100)
      mirrorPercentage:
        value: 100.0
---
# Shadow service in isolated environment
apiVersion: v1
kind: Service
metadata:
  name: product-service-shadow
spec:
  selector:
    app: product-service
    environment: shadow
  ports:
    - port: 80
      targetPort: 8080

Mirror Response Comparison

Capture both primary and mirror responses for comparison. Tools like Diffy (by Twitter) compare responses to detect regressions: differences in status codes, response bodies, or latencies. This reveals bugs before any user exposure.

Sticky Sessions in Traffic Splitting

Sticky sessions (session affinity) ensure the same user consistently routes to the same backend version. This is critical for:

A/B Testing: User must see consistent experience
Canaries: Avoid mid-session version switches
Stateful Services: Session data stored in backend memory
WebSocket Connections: Persistent connections need stable backend

Stickiness Methods:

Method	How It Works	Pros	Cons
Cookie-Based	Gateway sets cookie with backend ID	Explicit, inspectable	Requires cookie support
Header-Based	Hash on user ID header	Simple, stateless	Requires auth
IP-Based	Hash on client IP	No user identity needed	Breaks with NAT/proxies
URL Parameter	Include shard in URL	Explicit	Pollutes URLs
Consistent Hashing	Hash to ring of backends	Survives scale events	Complex

The Stickiness Problem with Canaries:

If 10% of traffic goes to canary, should:

10% of users be on canary (sticky)
10% of requests go to canary (non-sticky)

For A/B testing and user experience: sticky is essential. For pure load distribution: non-sticky is fine.

sticky-sessions.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Envoy sticky session configuration
routes:
  - match:
      prefix: "/api"
    route:
      # Weighted clusters with stickiness
      weighted_clusters:
        clusters:
          - name: service-stable
            weight: 90
          - name: service-canary
            weight: 10
        total_weight: 100
        
      # Hash policy for sticky routing
      hash_policy:
        # Option 1: Cookie-based (gateway manages cookie)
        - cookie:
            name: "BACKEND_AFFINITY"
            ttl: 3600s  # 1 hour
            path: "/"
            
        # Option 2: Header-based (use user ID)
        # - header:
        #     header_name: "x-user-id"
        
        # Option 3: Connection-based
        # - connection_properties:
        #     source_ip: true
 
# Connection pool settings for stickiness
clusters:
  - name: service-stable
    type: STRICT_DNS
    lb_policy: RING_HASH  # Required for hash-based stickiness
    ring_hash_lb_config:
      minimum_ring_size: 1024
      maximum_ring_size: 8388608
    # ... rest of cluster config

Stickiness and Scaling

When backends scale in/out, sticky sessions may break. If the backend a user was stuck to disappears, they'll be reassigned. Use consistent hashing (hash ring) to minimize reassignments—when a backend disappears, only its users get reassigned, not everyone.

Automated Rollback Mechanisms

The ultimate safety net for traffic splitting is automated rollback. When the new version shows unhealthy metrics, traffic automatically shifts back to the stable version without human intervention.

Rollback Triggers:

Error Rate Spike: 5xx errors exceed threshold
Latency Degradation: P99/P95 exceeds baseline by X%
Health Check Failure: Backend fails liveness/readiness probes
Resource Exhaustion: CPU/memory above thresholds
Custom Metrics: Business metrics (conversion, revenue) drop

Rollback Speed Hierarchy:

From fastest to slowest:

Instant DNS/Gateway switch: < 1 second (change routing rules)
Connection drain: 5-30 seconds (finish in-flight requests)
Pod termination: 30-60 seconds (complete Kubernetes graceful shutdown)
Full redeployment: 5-10 minutes (rebuild and deploy previous version)

Progressive Rollback:

If issues are subtle, progressive rollback can isolate the problem:

Canary at 20% → Issue detected
Rollback to 10% → Monitor
Still issues? → Rollback to 0%
Issues resolved at 10%? → Maintain at 10% for investigation

automated-rollback.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
// Rollback controller with multi-metric analysis
interface MetricThreshold {
  name: string;
  query: string;           // PromQL query
  operator: '>' | '<' | '>=' | '<=';
  threshold: number;
  forDuration: number;     // Seconds metric must breach
}
 
interface RollbackPolicy {
  name: string;
  enabled: boolean;
  metrics: MetricThreshold[];
  action: 'instant' | 'progressive' | 'alert';
  progressiveSteps?: number[];  // e.g., [10, 5, 0]
}
 
class RollbackController {
  private currentWeight: number;
  private stableVersion: string;
  private canaryVersion: string;
  private breachTimers: Map<string, number> = new Map();
  
  constructor(
    private gateway: GatewayClient,
    private prometheus: PrometheusClient,
    private alerting: AlertingClient
  ) {}
  
  async evaluatePolicy(policy: RollbackPolicy): Promise<void> {
    if (!policy.enabled || this.currentWeight === 0) return;
    
    for (const metric of policy.metrics) {
      const isBreaching = await this.checkMetricBreach(metric);
      
      if (isBreaching) {
        const breachStart = this.breachTimers.get(metric.name) || Date.now();
        const breachDuration = (Date.now() - breachStart) / 1000;
        
        if (!this.breachTimers.has(metric.name)) {
          this.breachTimers.set(metric.name, breachStart);
        }
        
        if (breachDuration >= metric.forDuration) {
          console.log(`Metric ${metric.name} breached for ${breachDuration}s`);
          await this.executeRollback(policy);
          return;
        }
      } else {
        // Metric recovered, reset timer
        this.breachTimers.delete(metric.name);
      }
    }
  }
  
  private async checkMetricBreach(metric: MetricThreshold): Promise<boolean> {
    // Query Prometheus for metric value
    const result = await this.prometheus.query(metric.query);
    const value = parseFloat(result.data.result[0]?.value[1] || '0');
    
    switch (metric.operator) {
      case '>': return value > metric.threshold;
      case '<': return value < metric.threshold;
      case '>=': return value >= metric.threshold;
      case '<=': return value <= metric.threshold;
    }
  }
  
  private async executeRollback(policy: RollbackPolicy): Promise<void> {
    // Alert on rollback
    await this.alerting.send({
      severity: 'critical',
      title: `Automatic rollback triggered: ${policy.name}`,
      message: `Rolling back canary ${this.canaryVersion} due to metric breach`,
    });
    
    switch (policy.action) {
      case 'instant':
        // Immediately shift all traffic to stable
        await this.gateway.updateWeights({
          [this.stableVersion]: 100,
          [this.canaryVersion]: 0,
        });
        this.currentWeight = 0;
        break;
        
      case 'progressive':
        // Step down through progressive stages
        for (const weight of policy.progressiveSteps || [0]) {
          await this.gateway.updateWeights({
            [this.stableVersion]: 100 - weight,
            [this.canaryVersion]: weight,
          });
          this.currentWeight = weight;
          
          if (weight > 0) {
            // Wait and re-evaluate
            await sleep(60000); // 1 minute
            const stillBreaching = await this.checkMetricBreach(
              policy.metrics[0] // Primary metric
            );
            if (!stillBreaching) {
              console.log(`Metrics recovered at ${weight}% canary`);
              break;
            }
          }
        }
        break;
        
      case 'alert':
        // Don't auto-rollback, just alert
        console.log('Alert-only policy, not rolling back');
        break;
    }
  }
}
 
// Example policy configuration
const rollbackPolicy: RollbackPolicy = {
  name: 'canary-health',
  enabled: true,
  metrics: [
    {
      name: 'error-rate',
      query: `
        sum(rate(http_requests_total{status=~"5..",version="canary"}[1m]))
        /
        sum(rate(http_requests_total{version="canary"}[1m]))
      `,
      operator: '>',
      threshold: 0.01,  // 1% error rate
      forDuration: 60,  // For 60 seconds
    },
    {
      name: 'latency-p99',
      query: `
        histogram_quantile(0.99, 
          sum(rate(http_request_duration_bucket{version="canary"}[1m])) 
          by (le)
        )
      `,
      operator: '>',
      threshold: 1.0,  // 1 second P99
      forDuration: 120, // For 2 minutes
    },
  ],
  action: 'progressive',
  progressiveSteps: [10, 5, 0],
};

Rollback Testing

Test your rollback mechanism regularly—in production. Trigger deliberate rollbacks to verify the mechanism works and the team knows the process. A rollback procedure that's never been tested will fail when you need it most.

Summary: Traffic Splitting Mastery

Traffic splitting transforms deployments from risky all-or-nothing events into controlled, observable experiments. Mastering these patterns is essential for any system that values both velocity and reliability.

Key Takeaways

•Weighted Distribution: Random or round-robin algorithms distribute traffic according to configured weights
•Canary Deployments: Gradual traffic shift with monitoring enables safe rollouts with quick rollback
•Blue-Green Deployments: Instant switch between environments for simple, fast rollback
•A/B Testing: Consistent user assignment and long-duration testing for business metric validation
•Traffic Mirroring: Test with real production traffic without user impact (but beware stateful operations)
•Sticky Sessions: Ensure consistent user experience during splits via cookies, headers, or consistent hashing
•Automated Rollback: Metric-driven automatic traffic shifting provides the ultimate safety net
•Always Test Rollback: A rollback procedure that's never been tested will fail when needed

Module Complete

You have completed the Routing and Traffic Management module! You now understand path-based routing, header-based routing, request transformation, and traffic splitting—the core capabilities that make API gateways powerful traffic managers. These patterns form the foundation for safe, observable deployments at scale.

4 / 4

Loading learning content...

System Design (HLD)Routing and Traffic Management

Routing and Traffic Management

LevelIntermediate

Duration90 mins

TopicRouting and Traffic Management

4 / 4

Traffic Splitting

Controlling the Flow of Production Traffic

Traffic splitting allows you to:

Gradually shift traffic from stable to new versions (canary deployments)
Instantly switch traffic between environments (blue-green deployments)
Test hypotheses by routing user segments to different experiences (A/B testing)
Debug in production by mirroring real traffic to test environments (traffic shadowing)

What You Will Learn

Weighted Traffic Distribution

Weight Distribution Example:

Backend A (stable): 90% weight
Backend B (canary): 10% weight

100 requests distribution:
- ~90 requests → Backend A
- ~10 requests → Backend B

Implementation Algorithms:

1. Weighted Random Selection

For each request, generate a random number and select the backend whose cumulative weight range contains that number.

Weights: A=90, B=10 (total=100)
Random: 0-89 → A, 90-99 → B
Generate random(0,99): 42 → Backend A

Pros: Simple, stateless, naturally distributed Cons: Short-term variance (could see 15 of 100 go to B)

2. Weighted Round-Robin (WRR)

Maintain a counter and iterate through backends proportionally.

Sequence for weights A=3, B=1:
A, A, A, B, A, A, A, B, ...

Pros: Smoother distribution, more predictable Cons: Requires state, more complex with dynamic weights

weighted-distribution.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
// Traffic splitting algorithms
interface WeightedBackend {
  id: string;
  address: string;
  weight: number;
}
 
/**
 * Weighted Random Selection
 * Stateless, suitable for distributed gateways
 */
function selectByWeightedRandom(backends: WeightedBackend[]): WeightedBackend {
  const totalWeight = backends.reduce((sum, b) => sum + b.weight, 0);
  let random = Math.random() * totalWeight;
  
  for (const backend of backends) {
    random -= backend.weight;
    if (random <= 0) {
      return backend;
    }
  }
  
  // Fallback (shouldn't happen with correct weights)
  return backends[backends.length - 1];
}
 
/**
 * Smooth Weighted Round-Robin (SWRR)
 * Used by Nginx for smoother distribution
 */
class SmoothWeightedRoundRobin {
  private backends: WeightedBackend[];
  private currentWeights: number[];
  
  constructor(backends: WeightedBackend[]) {
    this.backends = backends;
    this.currentWeights = backends.map(() => 0);
  }
  
  select(): WeightedBackend {
    const effectiveWeights = this.backends.map(b => b.weight);
    const totalWeight = effectiveWeights.reduce((a, b) => a + b, 0);
    
    // Increase current weights by effective weights
    for (let i = 0; i < this.backends.length; i++) {
      this.currentWeights[i] += effectiveWeights[i];
    }
    
    // Select backend with highest current weight
    let maxIdx = 0;
    for (let i = 1; i < this.currentWeights.length; i++) {
      if (this.currentWeights[i] > this.currentWeights[maxIdx]) {
        maxIdx = i;
      }
    }
    
    // Reduce selected backend's current weight by total
    this.currentWeights[maxIdx] -= totalWeight;
    
    return this.backends[maxIdx];
  }
}
 
/**
 * Consistent Hashing with Weights
 * For sticky sessions with traffic splitting
 */
class ConsistentHashRing {
  private ring: Map<number, string> = new Map();
  private sortedKeys: number[] = [];
  private replicas: number;
  
  constructor(backends: WeightedBackend[], replicas = 100) {
    this.replicas = replicas;
    
    for (const backend of backends) {
      // Number of virtual nodes proportional to weight
      const virtualNodes = Math.round((backend.weight / 100) * replicas);
      
      for (let i = 0; i < virtualNodes; i++) {
        const hash = this.hash(`${backend.id}:${i}`);
        this.ring.set(hash, backend.id);
        this.sortedKeys.push(hash);
      }
    }
    
    this.sortedKeys.sort((a, b) => a - b);
  }
  
  select(key: string): string {
    const hash = this.hash(key);
    
    // Find first node with hash >= key hash
    for (const nodeHash of this.sortedKeys) {
      if (nodeHash >= hash) {
        return this.ring.get(nodeHash)!;
      }
    }
    
    // Wrap around to first node
    return this.ring.get(this.sortedKeys[0])!;
  }
  
  private hash(key: string): number {
    // Simple hash for demonstration (use proper hash in production)
    let hash = 0;
    for (let i = 0; i < key.length; i++) {
      const char = key.charCodeAt(i);
      hash = ((hash << 5) - hash) + char;
      hash = hash & hash;
    }
    return Math.abs(hash);
  }
}

Weight Normalization

Canary Deployments

Canary Progression Strategy:

Stage 1: Deploy canary with 0% traffic
         → Smoke test the deployment
         
Stage 2: Route 1% traffic to canary
         → Monitor errors, latency for 10 minutes
         
Stage 3: Route 5% traffic to canary
         → Monitor for 30 minutes
         
Stage 4: Route 25% traffic to canary
         → Monitor for 1 hour
         
Stage 5: Route 50% traffic to canary
         → Monitor for 2 hours
         
Stage 6: Route 100% traffic to canary
         → Canary becomes new stable
         
Rollback: At any stage, if metrics degrade:
         → Shift 100% back to stable
         → Investigate and fix

Key Canary Metrics to Monitor:

Canary Health Metrics
Metric	Comparison	Rollback Threshold
Error Rate (5xx)	Canary vs Stable	If canary > stable * 1.1
Latency P99	Canary vs Stable	If canary > stable * 1.2
Latency P50	Canary vs Stable	If canary > stable * 1.5
Request Rate	Canary expected vs actual	If significantly different (routing issue)
CPU/Memory	Canary pods	If resource exhaustion
Custom Business Metrics	Conversion, transactions	Domain-specific thresholds

canary-config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Istio VirtualService for canary deployment
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: product-service
spec:
  hosts:
    - product-service
  http:
    # Header-based override for testing canary directly
    - match:
        - headers:
            x-canary:
              exact: "true"
      route:
        - destination:
            host: product-service
            subset: canary
          weight: 100
          
    # Weighted split for gradual rollout
    - route:
        - destination:
            host: product-service
            subset: stable
          weight: 95
        - destination:
            host: product-service
            subset: canary
          weight: 5
          
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: product-service
spec:
  host: product-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
  subsets:
    - name: stable
      labels:
        version: v1.2.3
    - name: canary
      labels:
        version: v1.3.0

Canary Duration Matters

Blue-Green Deployments

Blue-Green vs Canary:

Aspect	Blue-Green	Canary
Traffic shift	Instant (100%)	Gradual (1% → 100%)
Rollback speed	Instant	Instant
Resource usage	2x (both environments running)	~1x + small canary
Risk exposure	All users at once	Progressive exposure
Complexity	Simpler routing	Complex weight management
Best for	Stateless services, quick validation	Stateful, long-running validation

Blue-Green Workflow:

Initial State:
  Blue (v1.2.3): 100% traffic ← Current production
  Green (v1.2.3): 0% traffic  ← Standby

Deployment:
  1. Deploy v1.3.0 to Green
  2. Run smoke tests against Green
  3. Switch traffic: Blue 0%, Green 100%
  4. Monitor Green closely
  5. If issues: instant rollback to Blue
  6. If healthy: Blue becomes new standby

blue-green-switch.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# Blue-Green with Kubernetes Service selector switch
---
# Blue Deployment (current production)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-blue
spec:
  replicas: 10
  selector:
    matchLabels:
      app: myapp
      version: blue
  template:
    metadata:
      labels:
        app: myapp
        version: blue
    spec:
      containers:
        - name: app
          image: myapp:v1.2.3
---
# Green Deployment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-green
spec:
  replicas: 10
  selector:
    matchLabels:
      app: myapp
      version: green
  template:
    metadata:
      labels:
        app: myapp
        version: green
    spec:
      containers:
        - name: app
          image: myapp:v1.3.0
---
# Service pointing to active deployment
# Switch by updating selector
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
    version: blue  # ← Change to 'green' for switch
  ports:
    - port: 80
      targetPort: 8080

Database Compatibility

A/B Testing at Scale

A/B Testing vs Canary:

Aspect	A/B Testing	Canary
Goal	Measure business impact	Validate technical health
Duration	Days to weeks	Hours to days
Metrics	Conversion, revenue, engagement	Error rate, latency
User assignment	Consistent (same user, same variant)	Random per request
Traffic split	Usually 50/50	Progressive (1%→100%)
Rollback trigger	Statistical significance	Metric degradation

Consistent User Assignment:

Critical for A/B testing: the same user must always see the same variant. If users flip between variants, you can't measure impact.

User assignment algorithm:
  variant = hash(user_id + experiment_id + salt) % 100
  
  if variant < 50:
    return "control"
  else:
    return "treatment"

ab-testing-gateway.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
// Gateway-based A/B testing with consistent assignment
import crypto from 'crypto';
 
interface Experiment {
  id: string;
  name: string;
  salt: string;
  enabled: boolean;
  variants: Variant[];
  targeting?: TargetingRule[];
  startDate: Date;
  endDate?: Date;
}
 
interface Variant {
  id: string;
  name: string;
  weight: number;      // 0-100
  backend: string;     // Target backend cluster
}
 
interface TargetingRule {
  type: 'user_list' | 'percentage' | 'attribute';
  config: Record<string, any>;
}
 
interface UserContext {
  userId: string;
  attributes: Record<string, string>;
  isInternal: boolean;
}
 
/**
 * Deterministic variant assignment using consistent hashing
 */
function assignVariant(user: UserContext, experiment: Experiment): Variant | null {
  // Check if experiment is active
  const now = new Date();
  if (!experiment.enabled || now < experiment.startDate) {
    return null;
  }
  if (experiment.endDate && now > experiment.endDate) {
    return null;
  }
 
  // Evaluate targeting rules
  if (experiment.targeting) {
    if (!evaluateTargeting(user, experiment.targeting)) {
      return null; // User not targeted
    }
  }
 
  // Consistent hash for variant selection
  const hash = crypto
    .createHash('sha256')
    .update(`${user.userId}:${experiment.id}:${experiment.salt}`)
    .digest();
  
  const bucket = hash.readUInt32BE(0) % 10000; // 0.01% precision
  
  // Select variant based on cumulative weights
  let cumulative = 0;
  for (const variant of experiment.variants) {
    cumulative += variant.weight * 100; // Convert percentage to basis points
    if (bucket < cumulative) {
      return variant;
    }
  }
  
  return experiment.variants[experiment.variants.length - 1];
}
 
/**
 * Gateway routing with experiment assignment
 */
async function routeWithExperiments(
  request: GatewayRequest,
  user: UserContext,
  experiments: Experiment[]
): Promise<RoutingDecision> {
  const assignments: Record<string, string> = {};
  let selectedBackend: string | null = null;
 
  for (const experiment of experiments) {
    const variant = assignVariant(user, experiment);
    
    if (variant) {
      assignments[experiment.id] = variant.id;
      
      // First matching experiment determines routing
      // (or implement experiment priority/layering)
      if (!selectedBackend && variant.backend) {
        selectedBackend = variant.backend;
      }
    }
  }
 
  // Inject assignments as headers for backends and analytics
  const headers: Record<string, string> = {
    'X-Experiment-Assignments': JSON.stringify(assignments),
  };
 
  // Also set individual headers for easy backend access
  for (const [expId, variantId] of Object.entries(assignments)) {
    headers[`X-Exp-${expId}`] = variantId;
  }
 
  return {
    backend: selectedBackend || 'default-cluster',
    headers,
    experimentAssignments: assignments,
  };
}
 
/**
 * Evaluate targeting rules
 */
function evaluateTargeting(user: UserContext, rules: TargetingRule[]): boolean {
  for (const rule of rules) {
    switch (rule.type) {
      case 'user_list':
        if (!rule.config.users.includes(user.userId)) {
          return false;
        }
        break;
        
      case 'percentage':
        // Exclude X% of users (e.g., internal testing only)
        const hash = crypto.createHash('md5').update(user.userId).digest();
        const bucket = hash.readUInt16BE(0) % 100;
        if (bucket >= rule.config.percentage) {
          return false;
        }
        break;
        
      case 'attribute':
        // e.g., { attribute: 'country', values: ['US', 'CA'] }
        const value = user.attributes[rule.config.attribute];
        if (!rule.config.values.includes(value)) {
          return false;
        }
        break;
    }
  }
  
  return true;
}

A/B Testing Best Practices

•Consistent Assignment: Same user always sees same variant (hash on user ID, not random)
•Experiment Isolation: Multiple experiments can run simultaneously without interference
•Holdout Groups: Reserve a percentage of users who never see experiments (baseline)
•Statistical Rigor: Run until statistically significant, not until you see desired result
•Experiment Logging: Log all assignments for analytics correlation
•Kill Switches: Ability to disable experiments instantly without deployment
•Gradual Exposure: Start experiments with small percentage, increase if healthy

Traffic Mirroring (Shadowing)

Use Cases:

Pre-production validation: Test new version with real traffic patterns before any user exposure
Performance benchmarking: Compare latency/throughput of new version vs stable
Data pipeline testing: Validate new data processing logic with production data
Disaster recovery testing: Verify standby systems can handle production load
Shadow traffic analysis: Debug issues by replaying production traffic

Mirroring Architecture:

     ┌──────────────────────────────────────────────────┐
     │                   Gateway                         │
     │  ┌─────────┐                    ┌─────────┐      │
────►│  │ Primary │───────────────────►│Response │──────►
     │  │ Route   │                    │         │      │
     │  └────┬────┘                    └─────────┘      │
     │       │ (async copy)                             │
     │       ▼                                          │
     │  ┌─────────┐                    ┌─────────┐      │
     │  │ Mirror  │───────────────────►│Discard  │      │
     │  │ Route   │                    │Response │      │
     │  └─────────┘                    └─────────┘      │
     └──────────────────────────────────────────────────┘

Critical Mirroring Considerations:

⚠️ Stateful Operations: Mirrored writes will execute! If the mirror service writes to a shared database, you'll have duplicate writes. Mirror to isolated environments only.

⚠️ External Calls: Mirrored requests may call external services (payments, emails). Ensure the mirror environment stubs these calls.

⚠️ Resource Consumption: Mirroring doubles request volume. Ensure the mirror environment has sufficient capacity.

traffic-mirroring.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Istio traffic mirroring configuration
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: product-service
spec:
  hosts:
    - product-service
  http:
    - route:
        # Primary route gets the response
        - destination:
            host: product-service
            subset: stable
          weight: 100
      # Mirror configuration
      mirror:
        host: product-service-shadow
        subset: canary
      # Percentage of traffic to mirror (optional, default 100)
      mirrorPercentage:
        value: 100.0
---
# Shadow service in isolated environment
apiVersion: v1
kind: Service
metadata:
  name: product-service-shadow
spec:
  selector:
    app: product-service
    environment: shadow
  ports:
    - port: 80
      targetPort: 8080

Mirror Response Comparison

Sticky Sessions in Traffic Splitting

Sticky sessions (session affinity) ensure the same user consistently routes to the same backend version. This is critical for:

A/B Testing: User must see consistent experience
Canaries: Avoid mid-session version switches
Stateful Services: Session data stored in backend memory
WebSocket Connections: Persistent connections need stable backend

Stickiness Methods:

Method	How It Works	Pros	Cons
Cookie-Based	Gateway sets cookie with backend ID	Explicit, inspectable	Requires cookie support
Header-Based	Hash on user ID header	Simple, stateless	Requires auth
IP-Based	Hash on client IP	No user identity needed	Breaks with NAT/proxies
URL Parameter	Include shard in URL	Explicit	Pollutes URLs
Consistent Hashing	Hash to ring of backends	Survives scale events	Complex

The Stickiness Problem with Canaries:

If 10% of traffic goes to canary, should:

10% of users be on canary (sticky)
10% of requests go to canary (non-sticky)

For A/B testing and user experience: sticky is essential. For pure load distribution: non-sticky is fine.

sticky-sessions.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Envoy sticky session configuration
routes:
  - match:
      prefix: "/api"
    route:
      # Weighted clusters with stickiness
      weighted_clusters:
        clusters:
          - name: service-stable
            weight: 90
          - name: service-canary
            weight: 10
        total_weight: 100
        
      # Hash policy for sticky routing
      hash_policy:
        # Option 1: Cookie-based (gateway manages cookie)
        - cookie:
            name: "BACKEND_AFFINITY"
            ttl: 3600s  # 1 hour
            path: "/"
            
        # Option 2: Header-based (use user ID)
        # - header:
        #     header_name: "x-user-id"
        
        # Option 3: Connection-based
        # - connection_properties:
        #     source_ip: true
 
# Connection pool settings for stickiness
clusters:
  - name: service-stable
    type: STRICT_DNS
    lb_policy: RING_HASH  # Required for hash-based stickiness
    ring_hash_lb_config:
      minimum_ring_size: 1024
      maximum_ring_size: 8388608
    # ... rest of cluster config

Stickiness and Scaling

Automated Rollback Mechanisms

Rollback Triggers:

Error Rate Spike: 5xx errors exceed threshold
Latency Degradation: P99/P95 exceeds baseline by X%
Health Check Failure: Backend fails liveness/readiness probes
Resource Exhaustion: CPU/memory above thresholds
Custom Metrics: Business metrics (conversion, revenue) drop

Rollback Speed Hierarchy:

From fastest to slowest:

Instant DNS/Gateway switch: < 1 second (change routing rules)
Connection drain: 5-30 seconds (finish in-flight requests)
Pod termination: 30-60 seconds (complete Kubernetes graceful shutdown)
Full redeployment: 5-10 minutes (rebuild and deploy previous version)

Progressive Rollback:

If issues are subtle, progressive rollback can isolate the problem:

Canary at 20% → Issue detected
Rollback to 10% → Monitor
Still issues? → Rollback to 0%
Issues resolved at 10%? → Maintain at 10% for investigation

automated-rollback.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
// Rollback controller with multi-metric analysis
interface MetricThreshold {
  name: string;
  query: string;           // PromQL query
  operator: '>' | '<' | '>=' | '<=';
  threshold: number;
  forDuration: number;     // Seconds metric must breach
}
 
interface RollbackPolicy {
  name: string;
  enabled: boolean;
  metrics: MetricThreshold[];
  action: 'instant' | 'progressive' | 'alert';
  progressiveSteps?: number[];  // e.g., [10, 5, 0]
}
 
class RollbackController {
  private currentWeight: number;
  private stableVersion: string;
  private canaryVersion: string;
  private breachTimers: Map<string, number> = new Map();
  
  constructor(
    private gateway: GatewayClient,
    private prometheus: PrometheusClient,
    private alerting: AlertingClient
  ) {}
  
  async evaluatePolicy(policy: RollbackPolicy): Promise<void> {
    if (!policy.enabled || this.currentWeight === 0) return;
    
    for (const metric of policy.metrics) {
      const isBreaching = await this.checkMetricBreach(metric);
      
      if (isBreaching) {
        const breachStart = this.breachTimers.get(metric.name) || Date.now();
        const breachDuration = (Date.now() - breachStart) / 1000;
        
        if (!this.breachTimers.has(metric.name)) {
          this.breachTimers.set(metric.name, breachStart);
        }
        
        if (breachDuration >= metric.forDuration) {
          console.log(`Metric ${metric.name} breached for ${breachDuration}s`);
          await this.executeRollback(policy);
          return;
        }
      } else {
        // Metric recovered, reset timer
        this.breachTimers.delete(metric.name);
      }
    }
  }
  
  private async checkMetricBreach(metric: MetricThreshold): Promise<boolean> {
    // Query Prometheus for metric value
    const result = await this.prometheus.query(metric.query);
    const value = parseFloat(result.data.result[0]?.value[1] || '0');
    
    switch (metric.operator) {
      case '>': return value > metric.threshold;
      case '<': return value < metric.threshold;
      case '>=': return value >= metric.threshold;
      case '<=': return value <= metric.threshold;
    }
  }
  
  private async executeRollback(policy: RollbackPolicy): Promise<void> {
    // Alert on rollback
    await this.alerting.send({
      severity: 'critical',
      title: `Automatic rollback triggered: ${policy.name}`,
      message: `Rolling back canary ${this.canaryVersion} due to metric breach`,
    });
    
    switch (policy.action) {
      case 'instant':
        // Immediately shift all traffic to stable
        await this.gateway.updateWeights({
          [this.stableVersion]: 100,
          [this.canaryVersion]: 0,
        });
        this.currentWeight = 0;
        break;
        
      case 'progressive':
        // Step down through progressive stages
        for (const weight of policy.progressiveSteps || [0]) {
          await this.gateway.updateWeights({
            [this.stableVersion]: 100 - weight,
            [this.canaryVersion]: weight,
          });
          this.currentWeight = weight;
          
          if (weight > 0) {
            // Wait and re-evaluate
            await sleep(60000); // 1 minute
            const stillBreaching = await this.checkMetricBreach(
              policy.metrics[0] // Primary metric
            );
            if (!stillBreaching) {
              console.log(`Metrics recovered at ${weight}% canary`);
              break;
            }
          }
        }
        break;
        
      case 'alert':
        // Don't auto-rollback, just alert
        console.log('Alert-only policy, not rolling back');
        break;
    }
  }
}
 
// Example policy configuration
const rollbackPolicy: RollbackPolicy = {
  name: 'canary-health',
  enabled: true,
  metrics: [
    {
      name: 'error-rate',
      query: `
        sum(rate(http_requests_total{status=~"5..",version="canary"}[1m]))
        /
        sum(rate(http_requests_total{version="canary"}[1m]))
      `,
      operator: '>',
      threshold: 0.01,  // 1% error rate
      forDuration: 60,  // For 60 seconds
    },
    {
      name: 'latency-p99',
      query: `
        histogram_quantile(0.99, 
          sum(rate(http_request_duration_bucket{version="canary"}[1m])) 
          by (le)
        )
      `,
      operator: '>',
      threshold: 1.0,  // 1 second P99
      forDuration: 120, // For 2 minutes
    },
  ],
  action: 'progressive',
  progressiveSteps: [10, 5, 0],
};

Rollback Testing

Summary: Traffic Splitting Mastery

Key Takeaways

•Weighted Distribution: Random or round-robin algorithms distribute traffic according to configured weights
•Canary Deployments: Gradual traffic shift with monitoring enables safe rollouts with quick rollback
•Blue-Green Deployments: Instant switch between environments for simple, fast rollback
•A/B Testing: Consistent user assignment and long-duration testing for business metric validation
•Traffic Mirroring: Test with real production traffic without user impact (but beware stateful operations)
•Sticky Sessions: Ensure consistent user experience during splits via cookies, headers, or consistent hashing
•Automated Rollback: Metric-driven automatic traffic shifting provides the ultimate safety net
•Always Test Rollback: A rollback procedure that's never been tested will fail when needed

Module Complete

4 / 4