Loading learning content...
In production systems, deploying new code is one of the highest-risk activities engineers perform. A single bug, a subtle performance regression, or an unexpected edge case can impact millions of users. Traffic splitting is the gateway capability that transforms deployments from all-or-nothing gambles into controlled, observable experiments.
Traffic splitting allows you to:
This page provides a Principal Engineer's deep dive into traffic splitting: the strategies, the algorithms, the configuration patterns, and the production operational considerations that make safe deployments possible at scale.
By the end of this page, you will master weighted traffic distribution algorithms, canary deployment progressions, blue-green release strategies, A/B testing implementation, traffic mirroring for safe testing, sticky sessions for consistent user experience, and automated rollback mechanisms.
At its core, traffic splitting divides incoming requests among multiple backends based on configurable weights. A weighted round-robin or weighted random algorithm determines which backend receives each request.
Weight Distribution Example:
Backend A (stable): 90% weight
Backend B (canary): 10% weight
100 requests distribution:
- ~90 requests → Backend A
- ~10 requests → Backend B
Implementation Algorithms:
1. Weighted Random Selection
For each request, generate a random number and select the backend whose cumulative weight range contains that number.
Weights: A=90, B=10 (total=100)
Random: 0-89 → A, 90-99 → B
Generate random(0,99): 42 → Backend A
Pros: Simple, stateless, naturally distributed Cons: Short-term variance (could see 15 of 100 go to B)
2. Weighted Round-Robin (WRR)
Maintain a counter and iterate through backends proportionally.
Sequence for weights A=3, B=1:
A, A, A, B, A, A, A, B, ...
Pros: Smoother distribution, more predictable Cons: Requires state, more complex with dynamic weights
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
// Traffic splitting algorithmsinterface WeightedBackend { id: string; address: string; weight: number;} /** * Weighted Random Selection * Stateless, suitable for distributed gateways */function selectByWeightedRandom(backends: WeightedBackend[]): WeightedBackend { const totalWeight = backends.reduce((sum, b) => sum + b.weight, 0); let random = Math.random() * totalWeight; for (const backend of backends) { random -= backend.weight; if (random <= 0) { return backend; } } // Fallback (shouldn't happen with correct weights) return backends[backends.length - 1];} /** * Smooth Weighted Round-Robin (SWRR) * Used by Nginx for smoother distribution */class SmoothWeightedRoundRobin { private backends: WeightedBackend[]; private currentWeights: number[]; constructor(backends: WeightedBackend[]) { this.backends = backends; this.currentWeights = backends.map(() => 0); } select(): WeightedBackend { const effectiveWeights = this.backends.map(b => b.weight); const totalWeight = effectiveWeights.reduce((a, b) => a + b, 0); // Increase current weights by effective weights for (let i = 0; i < this.backends.length; i++) { this.currentWeights[i] += effectiveWeights[i]; } // Select backend with highest current weight let maxIdx = 0; for (let i = 1; i < this.currentWeights.length; i++) { if (this.currentWeights[i] > this.currentWeights[maxIdx]) { maxIdx = i; } } // Reduce selected backend's current weight by total this.currentWeights[maxIdx] -= totalWeight; return this.backends[maxIdx]; }} /** * Consistent Hashing with Weights * For sticky sessions with traffic splitting */class ConsistentHashRing { private ring: Map<number, string> = new Map(); private sortedKeys: number[] = []; private replicas: number; constructor(backends: WeightedBackend[], replicas = 100) { this.replicas = replicas; for (const backend of backends) { // Number of virtual nodes proportional to weight const virtualNodes = Math.round((backend.weight / 100) * replicas); for (let i = 0; i < virtualNodes; i++) { const hash = this.hash(`${backend.id}:${i}`); this.ring.set(hash, backend.id); this.sortedKeys.push(hash); } } this.sortedKeys.sort((a, b) => a - b); } select(key: string): string { const hash = this.hash(key); // Find first node with hash >= key hash for (const nodeHash of this.sortedKeys) { if (nodeHash >= hash) { return this.ring.get(nodeHash)!; } } // Wrap around to first node return this.ring.get(this.sortedKeys[0])!; } private hash(key: string): number { // Simple hash for demonstration (use proper hash in production) let hash = 0; for (let i = 0; i < key.length; i++) { const char = key.charCodeAt(i); hash = ((hash << 5) - hash) + char; hash = hash & hash; } return Math.abs(hash); }}Weights don't need to sum to 100—they're ratios. Weights of 90:10, 9:1, and 900:100 all produce the same 90%/10% split. Gateways normalize weights internally. However, using 0-100 scale improves readability and makes percentage changes obvious.
A canary deployment gradually shifts traffic from a stable version to a new version while monitoring for problems. The name comes from coal miners using canaries to detect gas—if the canary dies, evacuate. Similarly, if the canary version shows problems, halt the rollout.
Canary Progression Strategy:
Stage 1: Deploy canary with 0% traffic
→ Smoke test the deployment
Stage 2: Route 1% traffic to canary
→ Monitor errors, latency for 10 minutes
Stage 3: Route 5% traffic to canary
→ Monitor for 30 minutes
Stage 4: Route 25% traffic to canary
→ Monitor for 1 hour
Stage 5: Route 50% traffic to canary
→ Monitor for 2 hours
Stage 6: Route 100% traffic to canary
→ Canary becomes new stable
Rollback: At any stage, if metrics degrade:
→ Shift 100% back to stable
→ Investigate and fix
Key Canary Metrics to Monitor:
| Metric | Comparison | Rollback Threshold |
|---|---|---|
| Error Rate (5xx) | Canary vs Stable | If canary > stable * 1.1 |
| Latency P99 | Canary vs Stable | If canary > stable * 1.2 |
| Latency P50 | Canary vs Stable | If canary > stable * 1.5 |
| Request Rate | Canary expected vs actual | If significantly different (routing issue) |
| CPU/Memory | Canary pods | If resource exhaustion |
| Custom Business Metrics | Conversion, transactions | Domain-specific thresholds |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
# Istio VirtualService for canary deploymentapiVersion: networking.istio.io/v1beta1kind: VirtualServicemetadata: name: product-servicespec: hosts: - product-service http: # Header-based override for testing canary directly - match: - headers: x-canary: exact: "true" route: - destination: host: product-service subset: canary weight: 100 # Weighted split for gradual rollout - route: - destination: host: product-service subset: stable weight: 95 - destination: host: product-service subset: canary weight: 5 ---apiVersion: networking.istio.io/v1beta1kind: DestinationRulemetadata: name: product-servicespec: host: product-service trafficPolicy: connectionPool: tcp: maxConnections: 100 subsets: - name: stable labels: version: v1.2.3 - name: canary labels: version: v1.3.0Issues may take time to surface—memory leaks, cache warming, time-of-day traffic patterns. A canary that looks healthy after 10 minutes might fail after 2 hours under different load. Extend canary duration proportionally to the blast radius of the change.
Blue-green deployment maintains two identical production environments. At any time, one (Blue) serves live traffic while the other (Green) is idle or receiving the new deployment. Traffic switches instantly from Blue to Green.
Blue-Green vs Canary:
| Aspect | Blue-Green | Canary |
|---|---|---|
| Traffic shift | Instant (100%) | Gradual (1% → 100%) |
| Rollback speed | Instant | Instant |
| Resource usage | 2x (both environments running) | ~1x + small canary |
| Risk exposure | All users at once | Progressive exposure |
| Complexity | Simpler routing | Complex weight management |
| Best for | Stateless services, quick validation | Stateful, long-running validation |
Blue-Green Workflow:
Initial State:
Blue (v1.2.3): 100% traffic ← Current production
Green (v1.2.3): 0% traffic ← Standby
Deployment:
1. Deploy v1.3.0 to Green
2. Run smoke tests against Green
3. Switch traffic: Blue 0%, Green 100%
4. Monitor Green closely
5. If issues: instant rollback to Blue
6. If healthy: Blue becomes new standby
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
# Blue-Green with Kubernetes Service selector switch---# Blue Deployment (current production)apiVersion: apps/v1kind: Deploymentmetadata: name: app-bluespec: replicas: 10 selector: matchLabels: app: myapp version: blue template: metadata: labels: app: myapp version: blue spec: containers: - name: app image: myapp:v1.2.3---# Green Deployment (new version)apiVersion: apps/v1kind: Deploymentmetadata: name: app-greenspec: replicas: 10 selector: matchLabels: app: myapp version: green template: metadata: labels: app: myapp version: green spec: containers: - name: app image: myapp:v1.3.0---# Service pointing to active deployment# Switch by updating selectorapiVersion: v1kind: Servicemetadata: name: myappspec: selector: app: myapp version: blue # ← Change to 'green' for switch ports: - port: 80 targetPort: 8080Blue-green works best when both versions are database-compatible. If v1.3.0 requires schema changes, you need backward-compatible migrations that work with both versions. Otherwise, instant rollback becomes impossible (Blue can't read data written by Green's new schema).
A/B testing (or split testing) routes different user segments to different experiences to measure the impact of changes. Unlike canary deployments (focused on technical health), A/B testing measures business metrics: conversion rates, engagement, revenue.
A/B Testing vs Canary:
| Aspect | A/B Testing | Canary |
|---|---|---|
| Goal | Measure business impact | Validate technical health |
| Duration | Days to weeks | Hours to days |
| Metrics | Conversion, revenue, engagement | Error rate, latency |
| User assignment | Consistent (same user, same variant) | Random per request |
| Traffic split | Usually 50/50 | Progressive (1%→100%) |
| Rollback trigger | Statistical significance | Metric degradation |
Consistent User Assignment:
Critical for A/B testing: the same user must always see the same variant. If users flip between variants, you can't measure impact.
User assignment algorithm:
variant = hash(user_id + experiment_id + salt) % 100
if variant < 50:
return "control"
else:
return "treatment"
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147
// Gateway-based A/B testing with consistent assignmentimport crypto from 'crypto'; interface Experiment { id: string; name: string; salt: string; enabled: boolean; variants: Variant[]; targeting?: TargetingRule[]; startDate: Date; endDate?: Date;} interface Variant { id: string; name: string; weight: number; // 0-100 backend: string; // Target backend cluster} interface TargetingRule { type: 'user_list' | 'percentage' | 'attribute'; config: Record<string, any>;} interface UserContext { userId: string; attributes: Record<string, string>; isInternal: boolean;} /** * Deterministic variant assignment using consistent hashing */function assignVariant(user: UserContext, experiment: Experiment): Variant | null { // Check if experiment is active const now = new Date(); if (!experiment.enabled || now < experiment.startDate) { return null; } if (experiment.endDate && now > experiment.endDate) { return null; } // Evaluate targeting rules if (experiment.targeting) { if (!evaluateTargeting(user, experiment.targeting)) { return null; // User not targeted } } // Consistent hash for variant selection const hash = crypto .createHash('sha256') .update(`${user.userId}:${experiment.id}:${experiment.salt}`) .digest(); const bucket = hash.readUInt32BE(0) % 10000; // 0.01% precision // Select variant based on cumulative weights let cumulative = 0; for (const variant of experiment.variants) { cumulative += variant.weight * 100; // Convert percentage to basis points if (bucket < cumulative) { return variant; } } return experiment.variants[experiment.variants.length - 1];} /** * Gateway routing with experiment assignment */async function routeWithExperiments( request: GatewayRequest, user: UserContext, experiments: Experiment[]): Promise<RoutingDecision> { const assignments: Record<string, string> = {}; let selectedBackend: string | null = null; for (const experiment of experiments) { const variant = assignVariant(user, experiment); if (variant) { assignments[experiment.id] = variant.id; // First matching experiment determines routing // (or implement experiment priority/layering) if (!selectedBackend && variant.backend) { selectedBackend = variant.backend; } } } // Inject assignments as headers for backends and analytics const headers: Record<string, string> = { 'X-Experiment-Assignments': JSON.stringify(assignments), }; // Also set individual headers for easy backend access for (const [expId, variantId] of Object.entries(assignments)) { headers[`X-Exp-${expId}`] = variantId; } return { backend: selectedBackend || 'default-cluster', headers, experimentAssignments: assignments, };} /** * Evaluate targeting rules */function evaluateTargeting(user: UserContext, rules: TargetingRule[]): boolean { for (const rule of rules) { switch (rule.type) { case 'user_list': if (!rule.config.users.includes(user.userId)) { return false; } break; case 'percentage': // Exclude X% of users (e.g., internal testing only) const hash = crypto.createHash('md5').update(user.userId).digest(); const bucket = hash.readUInt16BE(0) % 100; if (bucket >= rule.config.percentage) { return false; } break; case 'attribute': // e.g., { attribute: 'country', values: ['US', 'CA'] } const value = user.attributes[rule.config.attribute]; if (!rule.config.values.includes(value)) { return false; } break; } } return true;}Traffic mirroring (also called shadowing) sends a copy of production traffic to a test environment without affecting the production response. The mirror receives real traffic, but its responses are discarded.
Use Cases:
Mirroring Architecture:
┌──────────────────────────────────────────────────┐
│ Gateway │
│ ┌─────────┐ ┌─────────┐ │
────►│ │ Primary │───────────────────►│Response │──────►
│ │ Route │ │ │ │
│ └────┬────┘ └─────────┘ │
│ │ (async copy) │
│ ▼ │
│ ┌─────────┐ ┌─────────┐ │
│ │ Mirror │───────────────────►│Discard │ │
│ │ Route │ │Response │ │
│ └─────────┘ └─────────┘ │
└──────────────────────────────────────────────────┘
Critical Mirroring Considerations:
⚠️ Stateful Operations: Mirrored writes will execute! If the mirror service writes to a shared database, you'll have duplicate writes. Mirror to isolated environments only.
⚠️ External Calls: Mirrored requests may call external services (payments, emails). Ensure the mirror environment stubs these calls.
⚠️ Resource Consumption: Mirroring doubles request volume. Ensure the mirror environment has sufficient capacity.
1234567891011121314151617181920212223242526272829303132333435
# Istio traffic mirroring configurationapiVersion: networking.istio.io/v1beta1kind: VirtualServicemetadata: name: product-servicespec: hosts: - product-service http: - route: # Primary route gets the response - destination: host: product-service subset: stable weight: 100 # Mirror configuration mirror: host: product-service-shadow subset: canary # Percentage of traffic to mirror (optional, default 100) mirrorPercentage: value: 100.0---# Shadow service in isolated environmentapiVersion: v1kind: Servicemetadata: name: product-service-shadowspec: selector: app: product-service environment: shadow ports: - port: 80 targetPort: 8080Capture both primary and mirror responses for comparison. Tools like Diffy (by Twitter) compare responses to detect regressions: differences in status codes, response bodies, or latencies. This reveals bugs before any user exposure.
Sticky sessions (session affinity) ensure the same user consistently routes to the same backend version. This is critical for:
Stickiness Methods:
| Method | How It Works | Pros | Cons |
|---|---|---|---|
| Cookie-Based | Gateway sets cookie with backend ID | Explicit, inspectable | Requires cookie support |
| Header-Based | Hash on user ID header | Simple, stateless | Requires auth |
| IP-Based | Hash on client IP | No user identity needed | Breaks with NAT/proxies |
| URL Parameter | Include shard in URL | Explicit | Pollutes URLs |
| Consistent Hashing | Hash to ring of backends | Survives scale events | Complex |
The Stickiness Problem with Canaries:
If 10% of traffic goes to canary, should:
For A/B testing and user experience: sticky is essential. For pure load distribution: non-sticky is fine.
123456789101112131415161718192021222324252627282930313233343536373839
# Envoy sticky session configurationroutes: - match: prefix: "/api" route: # Weighted clusters with stickiness weighted_clusters: clusters: - name: service-stable weight: 90 - name: service-canary weight: 10 total_weight: 100 # Hash policy for sticky routing hash_policy: # Option 1: Cookie-based (gateway manages cookie) - cookie: name: "BACKEND_AFFINITY" ttl: 3600s # 1 hour path: "/" # Option 2: Header-based (use user ID) # - header: # header_name: "x-user-id" # Option 3: Connection-based # - connection_properties: # source_ip: true # Connection pool settings for stickinessclusters: - name: service-stable type: STRICT_DNS lb_policy: RING_HASH # Required for hash-based stickiness ring_hash_lb_config: minimum_ring_size: 1024 maximum_ring_size: 8388608 # ... rest of cluster configWhen backends scale in/out, sticky sessions may break. If the backend a user was stuck to disappears, they'll be reassigned. Use consistent hashing (hash ring) to minimize reassignments—when a backend disappears, only its users get reassigned, not everyone.
The ultimate safety net for traffic splitting is automated rollback. When the new version shows unhealthy metrics, traffic automatically shifts back to the stable version without human intervention.
Rollback Triggers:
Rollback Speed Hierarchy:
From fastest to slowest:
Progressive Rollback:
If issues are subtle, progressive rollback can isolate the problem:
Canary at 20% → Issue detected
Rollback to 10% → Monitor
Still issues? → Rollback to 0%
Issues resolved at 10%? → Maintain at 10% for investigation
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
// Rollback controller with multi-metric analysisinterface MetricThreshold { name: string; query: string; // PromQL query operator: '>' | '<' | '>=' | '<='; threshold: number; forDuration: number; // Seconds metric must breach} interface RollbackPolicy { name: string; enabled: boolean; metrics: MetricThreshold[]; action: 'instant' | 'progressive' | 'alert'; progressiveSteps?: number[]; // e.g., [10, 5, 0]} class RollbackController { private currentWeight: number; private stableVersion: string; private canaryVersion: string; private breachTimers: Map<string, number> = new Map(); constructor( private gateway: GatewayClient, private prometheus: PrometheusClient, private alerting: AlertingClient ) {} async evaluatePolicy(policy: RollbackPolicy): Promise<void> { if (!policy.enabled || this.currentWeight === 0) return; for (const metric of policy.metrics) { const isBreaching = await this.checkMetricBreach(metric); if (isBreaching) { const breachStart = this.breachTimers.get(metric.name) || Date.now(); const breachDuration = (Date.now() - breachStart) / 1000; if (!this.breachTimers.has(metric.name)) { this.breachTimers.set(metric.name, breachStart); } if (breachDuration >= metric.forDuration) { console.log(`Metric ${metric.name} breached for ${breachDuration}s`); await this.executeRollback(policy); return; } } else { // Metric recovered, reset timer this.breachTimers.delete(metric.name); } } } private async checkMetricBreach(metric: MetricThreshold): Promise<boolean> { // Query Prometheus for metric value const result = await this.prometheus.query(metric.query); const value = parseFloat(result.data.result[0]?.value[1] || '0'); switch (metric.operator) { case '>': return value > metric.threshold; case '<': return value < metric.threshold; case '>=': return value >= metric.threshold; case '<=': return value <= metric.threshold; } } private async executeRollback(policy: RollbackPolicy): Promise<void> { // Alert on rollback await this.alerting.send({ severity: 'critical', title: `Automatic rollback triggered: ${policy.name}`, message: `Rolling back canary ${this.canaryVersion} due to metric breach`, }); switch (policy.action) { case 'instant': // Immediately shift all traffic to stable await this.gateway.updateWeights({ [this.stableVersion]: 100, [this.canaryVersion]: 0, }); this.currentWeight = 0; break; case 'progressive': // Step down through progressive stages for (const weight of policy.progressiveSteps || [0]) { await this.gateway.updateWeights({ [this.stableVersion]: 100 - weight, [this.canaryVersion]: weight, }); this.currentWeight = weight; if (weight > 0) { // Wait and re-evaluate await sleep(60000); // 1 minute const stillBreaching = await this.checkMetricBreach( policy.metrics[0] // Primary metric ); if (!stillBreaching) { console.log(`Metrics recovered at ${weight}% canary`); break; } } } break; case 'alert': // Don't auto-rollback, just alert console.log('Alert-only policy, not rolling back'); break; } }} // Example policy configurationconst rollbackPolicy: RollbackPolicy = { name: 'canary-health', enabled: true, metrics: [ { name: 'error-rate', query: ` sum(rate(http_requests_total{status=~"5..",version="canary"}[1m])) / sum(rate(http_requests_total{version="canary"}[1m])) `, operator: '>', threshold: 0.01, // 1% error rate forDuration: 60, // For 60 seconds }, { name: 'latency-p99', query: ` histogram_quantile(0.99, sum(rate(http_request_duration_bucket{version="canary"}[1m])) by (le) ) `, operator: '>', threshold: 1.0, // 1 second P99 forDuration: 120, // For 2 minutes }, ], action: 'progressive', progressiveSteps: [10, 5, 0],};Test your rollback mechanism regularly—in production. Trigger deliberate rollbacks to verify the mechanism works and the team knows the process. A rollback procedure that's never been tested will fail when you need it most.
Traffic splitting transforms deployments from risky all-or-nothing events into controlled, observable experiments. Mastering these patterns is essential for any system that values both velocity and reliability.
You have completed the Routing and Traffic Management module! You now understand path-based routing, header-based routing, request transformation, and traffic splitting—the core capabilities that make API gateways powerful traffic managers. These patterns form the foundation for safe, observable deployments at scale.