Deployment Strategies - Learning Module

Loading content...

0/273

Blue-Green Deployments

The Art of Instant, Safe Releases

What if you could deploy a new version of your application, test it thoroughly in production conditions, and then switch all traffic to it instantaneously—with the ability to switch back just as instantly if anything goes wrong? This is the promise of blue-green deployments, a strategy that trades infrastructure cost for deployment safety and flexibility.

Blue-green deployments represent a fundamentally different approach from rolling updates. Rather than gradually replacing instances, you maintain two complete, identical production environments—one serving live traffic (blue), the other ready to receive the next release (green). Traffic is switched atomically between them.

What You Will Learn

By the end of this page, you will understand how to architect blue-green deployments, handle the complex problem of database synchronization, configure traffic switching mechanisms, and evaluate when blue-green is the right choice for your system. You'll be equipped to implement blue-green deployments on any cloud platform.

Blue-Green Architecture Fundamentals

A blue-green deployment maintains two identical production environments:

Blue environment: Currently serving production traffic
Green environment: Idle or serving a new version for testing

When a new release is ready, it's deployed to the idle environment. After validation, traffic is switched from the active to the previously idle environment. The old active environment becomes the new idle—ready to receive the next release or serve as an immediate rollback target.

The environment lifecycle:

Blue-Green Environment State Transitions
Phase	Blue Environment	Green Environment	Traffic
Initial	Running v1.0 (active)	Idle or previous version	100% → Blue
Deploy	Running v1.0 (active)	Deploying v1.1	100% → Blue
Validate	Running v1.0 (active)	Running v1.1 (testing)	100% → Blue, internal → Green
Switch	Running v1.0 (standby)	Running v1.1 (active)	100% → Green
Stabilize	Idle, ready for next deploy	Running v1.1 (active)	100% → Green

What constitutes an 'environment':

The term 'environment' in blue-green deployments typically includes:

Component	Blue Environment	Green Environment
Application servers	Separate cluster/deployment	Separate cluster/deployment
Load balancer	Shared (routes to active)	Shared (routes to active)
Database	Shared (complex)	Shared (complex)
Cache	Separate or shared	Separate or shared
Configuration	Environment-specific	Environment-specific
DNS	Shared (points to LB)	Shared (points to LB)

The database is typically shared because maintaining synchronized data across two databases is extremely complex. This creates the most significant constraint on blue-green deployments—both versions must be compatible with the same database schema.

Cloud Cost Optimization

You don't need to run both environments at full capacity continuously. The idle environment can run at minimal capacity (or be scaled to zero with serverless) and scaled up just before a release. Scale down the newly idle environment after validating the switch.

Traffic Switching Mechanisms

The mechanism for switching traffic between environments is the heart of blue-green deployments. The choice of switching mechanism affects switching speed, rollback time, and operational complexity.

Traffic Switching Approaches

•Load Balancer Target Change — Update the load balancer to route traffic to the new target group. Switch time: seconds. This is the most common and recommended approach.
•DNS Update — Change DNS records to point to the new environment's IP or load balancer. Switch time: minutes to hours (TTL dependent). Generally not recommended due to propagation delays.
•Router/Proxy Rules — Update routing rules in an API gateway or service mesh. Switch time: seconds. Provides fine-grained control.
•Kubernetes Service Selector — Update Service selector to match new deployment labels. Switch time: seconds. Native to Kubernetes.

traffic-switching-aws.tf

Terraform

# AWS Blue-Green with Application Load Balancer
# Traffic switching via listener rule modification
 
resource "aws_lb" "main" {
  name               = "app-alb"
  internal           = false
  load_balancer_type = "application"
  subnets            = var.public_subnet_ids
  
  tags = {
    Environment = "production"
  }
}
 
# Blue target group - current production
resource "aws_lb_target_group" "blue" {
  name     = "app-blue"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = var.vpc_id
  
  health_check {
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 30
    path                = "/health"
    matcher             = "200"
  }
  
  tags = {
    Color = "blue"
  }
}
 
# Green target group - next release
resource "aws_lb_target_group" "green" {
  name     = "app-green"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = var.vpc_id
  
  health_check {
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 30
    path                = "/health"
    matcher             = "200"
  }
  
  tags = {
    Color = "green"
  }
}
 
# Production listener - the switch point
resource "aws_lb_listener" "production" {
  load_balancer_arn = aws_lb.main.arn
  port              = "443"
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS-1-2-2017-01"
  certificate_arn   = var.certificate_arn
  
  default_action {
    type             = "forward"
    # SWITCH: Change this to green when deploying
    target_group_arn = var.active_color == "blue" ? aws_lb_target_group.blue.arn : aws_lb_target_group.green.arn
  }
}
 
# Test listener - access inactive environment for validation
resource "aws_lb_listener" "test" {
  load_balancer_arn = aws_lb.main.arn
  port              = "8443"
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS-1-2-2017-01"
  certificate_arn   = var.certificate_arn
  
  default_action {
    type             = "forward"
    # Points to the INACTIVE environment for testing
    target_group_arn = var.active_color == "blue" ? aws_lb_target_group.green.arn : aws_lb_target_group.blue.arn
  }
}
 
# Variable to control which environment is active
variable "active_color" {
  description = "Which color environment is serving production traffic"
  type        = string
  default     = "blue"
  
  validation {
    condition     = contains(["blue", "green"], var.active_color)
    error_message = "Active color must be 'blue' or 'green'."
  }
}

Kubernetes-native blue-green switching:

In Kubernetes, blue-green deployments can be implemented using Service selector manipulation or by using more sophisticated tools like Argo Rollouts.

kubernetes-blue-green.yaml
Kubernetes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
# Blue Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-blue
  labels:
    app: myapp
    version: v1.0.0
spec:
  replicas: 5
  selector:
    matchLabels:
      app: myapp
      color: blue
  template:
    metadata:
      labels:
        app: myapp
        color: blue
        version: v1.0.0
    spec:
      containers:
      - name: app
        image: myapp:v1.0.0
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
 
---
# Green Deployment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-green
  labels:
    app: myapp
    version: v1.1.0
spec:
  replicas: 5
  selector:
    matchLabels:
      app: myapp
      color: green
  template:
    metadata:
      labels:
        app: myapp
        color: green
        version: v1.1.0
    spec:
      containers:
      - name: app
        image: myapp:v1.1.0
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
 
---
# Production Service - selector determines which deployment receives traffic
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
    # SWITCH: Change this label to switch traffic
    color: blue  # or "green"
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP
 
---
# Test Service - always points to inactive environment
apiVersion: v1
kind: Service
metadata:
  name: myapp-test
spec:
  selector:
    app: myapp
    # Points to INACTIVE color for pre-switch testing
    color: green  # Opposite of production
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP

Avoid DNS-Based Switching

DNS-based switching seems simple but has serious drawbacks: DNS caching means clients may continue using old IPs for minutes or hours after the switch, TTL settings are often ignored by clients, and rollback is equally slow. Use load balancer or router-based switching for instant and reliable traffic control.

The Database Synchronization Challenge

The most complex aspect of blue-green deployments is handling the database. Unlike stateless application servers, databases contain persistent state that both environments must access. You have three main approaches, each with significant trade-offs:

Shared Database (Recommended)

•Both environments connect to the same database
•No synchronization needed—data is always consistent
•Requires backward-compatible schema migrations
•Rolling back code is easy; rolling back schema is hard
•Most common approach in production

Separate Databases (Complex)

•Each environment has its own database
•Requires real-time replication between databases
•Data written to inactive environment is lost on switch
•Enables independent schema per environment
•Rarely practical for production systems

The shared database constraint:

With a shared database, both the blue and green environments must be able to work with the same database schema. This creates a hard constraint: migrations must be backward compatible.

A migration is backward compatible if:

The old code can operate correctly after the migration runs
The new code can operate correctly before the migration runs

This rules out many common migration patterns:

Migration Compatibility Analysis
Migration Type	Backward Compatible?	Safe Approach
Add nullable column	✅ Yes	Direct migration
Add column with default	✅ Yes	Direct migration
Add table	✅ Yes	Direct migration
Drop unused column	✅ Yes (if truly unused)	Verify no code references first
Rename column	❌ No	Add new column → migrate data → update code → drop old
Change column type	❌ No	Add new column → migrate data → update code → drop old
Drop table in use	❌ No	Update code first → verify no access → drop table
Add NOT NULL constraint	❌ No	Backfill data → update code → add constraint

backward-compatible-migration.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
-- Example: Safely renaming a column in a blue-green environment
-- Total migration requires 4 deployment cycles
 
-- ============================================
-- DEPLOYMENT 1: Add new column
-- ============================================
-- Run this migration before deploying new code
ALTER TABLE users ADD COLUMN username VARCHAR(255);
 
-- Copy existing data
UPDATE users SET username = user_name WHERE username IS NULL;
 
-- At this point:
-- - Blue (v1) still reads/writes user_name (works)
-- - Green (v2) can read both columns, writes to both (works)
 
-- ============================================
-- DEPLOYMENT 2: Deploy code that reads both, writes both
-- ============================================
-- Application code change (pseudo-code):
-- 
-- function getUsername(user):
--     return user.username ?? user.user_name  // prefer new, fallback to old
--
-- function saveUser(user):
--     user.username = value
--     user.user_name = value  // write to BOTH columns
 
-- ============================================
-- DEPLOYMENT 3: Deploy code that only uses new column
-- ============================================
-- Application code change (pseudo-code):
-- 
-- function getUsername(user):
--     return user.username  // only new column
--
-- function saveUser(user):
--     user.username = value  // only new column
 
-- Verify no code references old column before proceeding
 
-- ============================================
-- DEPLOYMENT 4: Remove old column
-- ============================================
-- Only after all instances use new column exclusively
ALTER TABLE users DROP COLUMN user_name;
 
-- Safe because:
-- - No running code reads user_name
-- - No running code writes user_name
-- - Rollback would deploy code that reads username (new column)

Migration Tooling

Tools like Flyway, Liquibase, and Rails migrations support separating 'expand' (backward-compatible) and 'contract' (cleanup) migrations. Always run expand migrations before deployment and contract migrations after all instances are updated.

Session and State Management

A critical challenge in blue-green deployments is handling user sessions and application state during the traffic switch. Users in the middle of a transaction when traffic switches must not experience errors or data loss.

State Management Strategies

•Externalized session storage — Store sessions in Redis, Memcached, or a database accessible by both environments. Both blue and green can read/write the same sessions.
•Stateless design — Use JWTs or similar tokens that contain all session data. No server-side session storage needed. Most scalable approach.
•Sticky sessions with drain — Route users to the same environment until their session expires. Gradually drain sessions from old environment. Increases switch time.
•Session serialization compatibility — Ensure both versions can deserialize session data. Add new fields with defaults; never remove fields without migration.

externalized-sessions.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import Redis from 'ioredis';
import { v4 as uuidv4 } from 'uuid';
 
/**
 * Session store backed by Redis - shared between blue and green environments.
 * Both environments connect to the same Redis cluster.
 */
class SessionStore {
  private redis: Redis;
  private sessionTTL: number = 24 * 60 * 60; // 24 hours
  
  constructor(redisUrl: string) {
    // Both blue and green environments use the same Redis cluster
    this.redis = new Redis(redisUrl);
  }
  
  async createSession(userId: string, data: SessionData): Promise<string> {
    const sessionId = uuidv4();
    
    // Include schema version for forward/backward compatibility
    const sessionPayload: VersionedSession = {
      schemaVersion: 2,  // Increment when session structure changes
      userId,
      data,
      createdAt: new Date().toISOString(),
      // New fields added in v2 - old code will ignore, new code uses
      environment: process.env.ENVIRONMENT_COLOR,
      lastActiveAt: new Date().toISOString(),
    };
    
    await this.redis.setex(
      `session:${sessionId}`,
      this.sessionTTL,
      JSON.stringify(sessionPayload)
    );
    
    return sessionId;
  }
  
  async getSession(sessionId: string): Promise<SessionData | null> {
    const raw = await this.redis.get(`session:${sessionId}`);
    if (!raw) return null;
    
    const session = JSON.parse(raw) as VersionedSession;
    
    // Handle different schema versions
    // Old sessions (v1) may not have new fields
    return this.migrateSession(session);
  }
  
  private migrateSession(session: VersionedSession): SessionData {
    // Handle missing fields from older schema versions
    // This allows green (new version) to read sessions created by blue (old version)
    
    if (!session.schemaVersion || session.schemaVersion < 2) {
      // Migrate v1 session to v2 format
      session.lastActiveAt = session.createdAt;
      session.environment = 'unknown';
    }
    
    return session.data;
  }
  
  async updateSession(sessionId: string, data: Partial<SessionData>): Promise<void> {
    const existing = await this.redis.get(`session:${sessionId}`);
    if (!existing) throw new Error('Session not found');
    
    const session = JSON.parse(existing) as VersionedSession;
    session.data = { ...session.data, ...data };
    session.lastActiveAt = new Date().toISOString();
    
    await this.redis.setex(
      `session:${sessionId}`,
      this.sessionTTL,
      JSON.stringify(session)
    );
  }
}
 
interface SessionData {
  cart?: CartItem[];
  preferences?: UserPreferences;
  [key: string]: unknown;
}
 
interface VersionedSession {
  schemaVersion: number;
  userId: string;
  data: SessionData;
  createdAt: string;
  environment?: string;      // Added in v2
  lastActiveAt?: string;     // Added in v2
}

Long-Running Operations

Users with long-running operations (file uploads, complex workflows) may be affected by the switch. Consider: completing in-flight operations before switching, using queues with workers in both environments, or implementing operation hand-off between environments.

Pre-Switch Validation Strategies

One of the primary advantages of blue-green deployments is the ability to validate the new version in a production environment before switching traffic. This validation phase is critical—it's your last line of defense before the release goes live.

Pre-Switch Validation Checklist

•Health checks pass — All instances in green environment report healthy on readiness probes
•Smoke tests pass — Automated tests hitting the test endpoint verify core functionality
•Performance baseline met — Load test against green environment meets latency/throughput SLAs
•Dependencies accessible — Green can reach all required services (databases, APIs, caches)
•Configuration correct — Environment-specific config (feature flags, API keys) is properly set
•Logs streaming — Verify logging pipeline is receiving logs from green environment
•Metrics reporting — Confirm metrics are being collected and dashboards are populated
•Manual exploration — QA or engineers manually test key user flows on green

pre-switch-validation.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
#!/bin/bash
# Pre-switch validation script for blue-green deployment
 
set -euo pipefail
 
INACTIVE_ENV="green"
TEST_ENDPOINT="https://test.example.com"  # Points to inactive environment
MAIN_ENDPOINT="https://api.example.com"   # Points to active environment
 
echo "🔍 Starting pre-switch validation for $INACTIVE_ENV environment..."
 
# 1. Health check verification
echo "Step 1: Verifying health checks..."
HEALTH=$(curl -s -o /dev/null -w "%{http_code}" "$TEST_ENDPOINT/health/ready")
if [ "$HEALTH" != "200" ]; then
    echo "❌ Health check failed with status $HEALTH"
    exit 1
fi
echo "✅ Health checks passing"
 
# 2. Smoke tests
echo "Step 2: Running smoke tests..."
npm run test:smoke -- --endpoint="$TEST_ENDPOINT"
if [ $? -ne 0 ]; then
    echo "❌ Smoke tests failed"
    exit 1
fi
echo "✅ Smoke tests passing"
 
# 3. API version verification
echo "Step 3: Verifying API version..."
DEPLOYED_VERSION=$(curl -s "$TEST_ENDPOINT/version" | jq -r '.version')
EXPECTED_VERSION="${DEPLOY_VERSION:- unknown}"
if ["$DEPLOYED_VERSION" != "$EXPECTED_VERSION"]; then
    echo "❌ Version mismatch: expected $EXPECTED_VERSION, got $DEPLOYED_VERSION"
    exit 1
fi
echo "✅ Version verified: $DEPLOYED_VERSION"
 
# 4. Performance baseline
echo "Step 4: Running performance baseline..."
LATENCY_P99 = $(curl - s "$TEST_ENDPOINT/metrics" | grep 'http_request_duration_seconds' | grep 'quantile="0.99"' | awk '{print $2}')
LATENCY_THRESHOLD="0.5"
if(($(echo "$LATENCY_P99 > $LATENCY_THRESHOLD" | bc - l))); then
    echo "❌ P99 latency $LATENCY_P99 exceeds threshold $LATENCY_THRESHOLD"
    exit 1
    fi
echo "✅ Performance baseline met: p99=${LATENCY_P99}s"
 
# 5. Database connectivity
echo "Step 5: Verifying database connectivity..."
    DB_CHECK = $(curl - s "$TEST_ENDPOINT/health/dependencies" | jq - r '.database.status')
    if ["$DB_CHECK" != "healthy"]; then
    echo "❌ Database connectivity check failed"
    exit 1
    fi
echo "✅ Database connectivity verified"
 
# 6. Feature flag verification
echo "Step 6: Verifying feature flags..."
    FLAGS = $(curl - s "$TEST_ENDPOINT/debug/feature-flags")
echo "Active feature flags: $FLAGS"
 
# 7. Compare metrics with active environment
echo "Step 7: Comparing error rates..."
    ACTIVE_ERRORS = $(curl - s "$MAIN_ENDPOINT/metrics" | grep 'http_requests_total.*status="5' | awk '{sum+=$2} END {print sum}')
    INACTIVE_ERRORS = $(curl - s "$TEST_ENDPOINT/metrics" | grep 'http_requests_total.*status="5' | awk '{sum+=$2} END {print sum}')
echo "Active environment errors: $ACTIVE_ERRORS"
echo "Inactive environment errors: $INACTIVE_ERRORS"
 
echo ""
echo "═══════════════════════════════════════════════════════════"
echo "✅ All pre-switch validations passed"
echo "═══════════════════════════════════════════════════════════"
echo ""
echo "Ready to switch traffic to $INACTIVE_ENV environment."
echo "To proceed, run: ./switch-traffic.sh $INACTIVE_ENV"

Synthetic traffic testing:

For critical systems, generate synthetic production-like traffic against the green environment before switching:

Test Type	Purpose	Duration
Smoke tests	Verify core functionality works	1-2 minutes
Load test	Verify performance under expected load	5-10 minutes
Soak test	Verify stability over time (memory leaks, etc.)	30-60 minutes
Chaos test	Verify resilience to failures	10-15 minutes

The depth of testing should match the risk level of the release. Critical path changes warrant more extensive testing.

Instant Rollback Procedures

The killer feature of blue-green deployments is instant rollback. Because the old environment is still running with the previous version, rolling back is simply switching traffic back—no redeployment, no waiting for instances to start.

Keep the Old Environment Hot

Don't scale down or redeploy the old environment until you're confident the new version is stable. A common practice is to wait 24-48 hours before reusing the old environment for the next release. This gives time for delayed issues to manifest.

traffic-switch.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
#!/bin/bash
# Traffic switching script for blue - green deployment
 
set - euo pipefail
 
    TARGET_COLOR = "$1"
 
    if [["$TARGET_COLOR" != "blue" && "$TARGET_COLOR" != "green"]]; then
    echo "Usage: $0 <blue|green>"
    exit 1
    fi
 
# Get current state
    CURRENT_COLOR = $(aws ssm get - parameter--name / app / active - color--query 'Parameter.Value' --output text)
 
    if ["$CURRENT_COLOR" == "$TARGET_COLOR"]; then
    echo "Traffic already routing to $TARGET_COLOR"
    exit 0
    fi
 
echo "═══════════════════════════════════════════════════════════"
echo "🔄 Switching traffic from $CURRENT_COLOR to $TARGET_COLOR"
echo "═══════════════════════════════════════════════════════════"
 
# Record the switch in audit log
aws logs put - log - events     --log - group - name / app / deployments     --log - stream - name traffic - switches     --log - events timestamp = $(date +% s000), message = "Switching traffic from $CURRENT_COLOR to $TARGET_COLOR by ${USER:-unknown}"
 
# Verify target environment is healthy before switching
echo "Verifying $TARGET_COLOR environment health..."
    TARGET_LB = $(aws elbv2 describe - target - groups--names "app-$TARGET_COLOR" --query 'TargetGroups[0].TargetGroupArn' --output text)
    HEALTH = $(aws elbv2 describe - target - health--target - group - arn "$TARGET_LB" --query 'TargetHealthDescriptions[?TargetHealth.State==`healthy`] | length(@)')
 
    if ["$HEALTH" - lt 1]; then
    echo "❌ No healthy targets in $TARGET_COLOR environment. Aborting switch."
    exit 1
    fi
echo "✅ $HEALTH healthy targets in $TARGET_COLOR"
 
# Update load balancer listener
    LISTENER_ARN = $(aws elbv2 describe - listeners--load - balancer - arn "$LB_ARN" --query 'Listeners[?Port==`443`].ListenerArn' --output text)
 
echo "Updating listener to forward to $TARGET_COLOR..."
aws elbv2 modify - listener     --listener - arn "$LISTENER_ARN"     --default -actions Type = forward, TargetGroupArn = "$TARGET_LB"
 
# Update parameter store with new active color
aws ssm put - parameter     --name / app / active - color     --value "$TARGET_COLOR"     --type String     --overwrite
 
# Verify the switch
sleep 5
echo "Verifying traffic switch..."
    ACTIVE_TARGETS = $(aws elbv2 describe - listeners--listener - arns "$LISTENER_ARN" --query 'Listeners[0].DefaultActions[0].TargetGroupArn' --output text)
 
    if [["$ACTIVE_TARGETS" == * "$TARGET_COLOR" * ]]; then
    echo ""
    echo "═══════════════════════════════════════════════════════════"
    echo "✅ Traffic successfully switched to $TARGET_COLOR"
    echo "═══════════════════════════════════════════════════════════"
    echo ""
    echo "Previous environment ($CURRENT_COLOR) is still running."
    echo "To rollback, run: $0 $CURRENT_COLOR"
else
    echo "❌ Switch verification failed"
    exit 1
    fi

Rollback triggers:

Trigger	Detection Time	Rollback Time	Total Impact
Complete outage	Seconds (health checks)	Seconds	Minimal
Error rate spike	1-5 minutes (monitoring)	Seconds	Low
Performance degradation	5-15 minutes (monitoring)	Seconds	Moderate
Data corruption	Hours-days (user reports)	Seconds, but damage done	High
Business logic bug	Minutes-hours (testing/reports)	Seconds	Variable

The key insight: blue-green rollback is always fast, but detection time varies. Invest in monitoring to detect issues quickly.

Rollback Doesn't Undo Data Changes

If the new version wrote data in an incompatible format or corrupted data, rolling back the code doesn't fix the data. You need separate data repair procedures. This is why backward-compatible migrations are essential—even a rollback must work with any data written by the new version.

Cost and Resource Considerations

Blue-green deployments require maintaining two production environments, which has significant cost implications. However, there are strategies to minimize this overhead while retaining the benefits.

Blue-Green Cost Optimization Strategies
Strategy	Cost Savings	Trade-off
Scale idle to zero (serverless)	Up to 50%	Longer pre-switch warm-up needed
Scale idle to minimum	30-40%	Slower scale-up during switch
Use spot/preemptible for idle	20-30%	May need to replace instances
Smaller instance types for idle	20-30%	Scale up before switch
Time-limited retention	Variable	Reduces rollback window

Cost calculation example:

Assuming a production workload requiring 10 instances at $100/month each:

Approach	Active Env	Idle Env	Monthly Cost	Premium
Full capacity both	10 × $100	10 × $100	$2,000	+100%
Idle at 20% capacity	10 × $100	2 × $100	$1,200	+20%
Idle scaled to zero	10 × $100	$0 (+ scale-up)	$1,050	+5%
Single env (no B/G)	10 × $100	N/A	$1,000	Baseline

The 5-20% premium for blue-green capability is often justified by reduced downtime risk and faster rollback capability.

auto-scaling-idle.tf

Terraform

# Kubernetes HPA for idle environment
# Keeps minimal capacity until pre -switch scale - up
 
apiVersion: autoscaling / v2
    kind: HorizontalPodAutoscaler
    metadata:
    name: app - green - hpa
    spec:
    scaleTargetRef:
    apiVersion: apps / v1
    kind: Deployment
    name: app - green
  
  # Idle: maintain minimal capacity
  # Active: scale based on load
    minReplicas: 2    # Minimal for idle
  maxReplicas: 20   # Full capacity for active
  
  metrics:
        - type: Resource
    resource:
    name: cpu
    target:
    type: Utilization
    averageUtilization: 70
 
    behavior:
    # Scale up quickly for traffic switch
    scaleUp:
      stabilizationWindowSeconds: 0
    policies:
    - type: Percent
    value: 100
    periodSeconds: 15
    # Scale down slowly after becoming idle
    scaleDown:
    stabilizationWindowSeconds: 300
    policies:
    - type: Percent
    value: 10
    periodSeconds: 60

Pre-Switch Warm-Up

If the idle environment is scaled down, include a warm-up phase before switching. Scale up the idle environment, run health checks and load tests, wait for caches to populate and connections to establish, then switch traffic. This adds 5-15 minutes to the deployment process but prevents cold-start issues.

When to Use Blue-Green Deployments

Blue-green deployments are powerful but not universally applicable. Understanding when they provide value—and when alternatives are better—is essential for choosing the right deployment strategy.

Blue-Green Is Ideal When

•Zero-downtime deployment is mandatory
•Instant rollback capability is required
•You need to test in production before going live
•Database migrations are backward compatible
•The application is stateless or uses externalized state
•Cost of maintaining two environments is acceptable
•Releases are infrequent (weekly or less)

Consider Alternatives When

•You deploy multiple times per day (canary may be better)
•Database migrations are frequently breaking
•Strong consistency during transition is required
•Cost constraints are severe
•The application is highly stateful with server affinity
•You need gradual traffic shifting (canary)
•You have hundreds of microservices (complexity)

Deployment Strategy Comparison
Criteria	Rolling	Blue-Green	Canary
Resource overhead	Low (surge capacity)	High (2x environments)	Low-Medium
Rollback speed	Slow (redeploy)	Instant (switch)	Fast (routing change)
Test before live	No	Yes (test endpoint)	Yes (% traffic)
Mixed version duration	During rollout	None (atomic switch)	Extended (canary period)
Complexity	Low	Medium	High
Best for	Most workloads	Critical services	High-risk releases

Hybrid Approaches

Many organizations use blue-green as the environment strategy but combine it with canary releases within each environment. Deploy to green, canary 5% of traffic, gradually increase, then switch. This provides the best of both worlds.

Summary: Blue-Green Deployments

Blue-green deployments provide the gold standard for deployment safety—instant switchover, instant rollback, and the ability to validate in production before going live. Let's consolidate the key concepts:

Key Takeaways

•Two identical environments, one active — Blue serves traffic while green receives new releases (or vice versa). Traffic switches atomically between them.
•Load balancer switching is preferred — Avoid DNS-based switching due to caching delays. Use ALB target groups, Kubernetes Service selectors, or similar instant mechanisms.
•Shared database is the norm — Both environments use the same database, requiring all migrations to be backward compatible.
•Externalize session state — Use Redis, database, or stateless tokens so sessions work across both environments.
•Validate before switching — Use the test endpoint to run smoke tests, load tests, and manual verification against the inactive environment.
•Instant rollback is the killer feature — Keep the old environment running for 24-48 hours post-switch to enable immediate traffic reversion.
•Cost optimization is possible — Scale idle environment to minimum capacity and scale up before switching to reduce infrastructure costs.

What's next:

Blue-green provides atomic switching but doesn't allow testing with real production traffic before full commitment. In the next page, we'll explore canary deployments, which route a small percentage of traffic to the new version while monitoring for issues—enabling gradual, low-risk releases.

Page Complete

You now understand blue-green deployments at a comprehensive level—from architecture and traffic switching to database synchronization and cost optimization. This knowledge applies to any cloud platform or orchestration system.

Blue-Green Deployments

The Art of Instant, Safe Releases

What You Will Learn

Blue-Green Architecture Fundamentals

A blue-green deployment maintains two identical production environments:

Blue environment: Currently serving production traffic
Green environment: Idle or serving a new version for testing

The environment lifecycle:

Blue-Green Environment State Transitions
Phase	Blue Environment	Green Environment	Traffic
Initial	Running v1.0 (active)	Idle or previous version	100% → Blue
Deploy	Running v1.0 (active)	Deploying v1.1	100% → Blue
Validate	Running v1.0 (active)	Running v1.1 (testing)	100% → Blue, internal → Green
Switch	Running v1.0 (standby)	Running v1.1 (active)	100% → Green
Stabilize	Idle, ready for next deploy	Running v1.1 (active)	100% → Green

What constitutes an 'environment':

The term 'environment' in blue-green deployments typically includes:

Component	Blue Environment	Green Environment
Application servers	Separate cluster/deployment	Separate cluster/deployment
Load balancer	Shared (routes to active)	Shared (routes to active)
Database	Shared (complex)	Shared (complex)
Cache	Separate or shared	Separate or shared
Configuration	Environment-specific	Environment-specific
DNS	Shared (points to LB)	Shared (points to LB)

Cloud Cost Optimization

Traffic Switching Mechanisms

The mechanism for switching traffic between environments is the heart of blue-green deployments. The choice of switching mechanism affects switching speed, rollback time, and operational complexity.

Traffic Switching Approaches

•Load Balancer Target Change — Update the load balancer to route traffic to the new target group. Switch time: seconds. This is the most common and recommended approach.
•DNS Update — Change DNS records to point to the new environment's IP or load balancer. Switch time: minutes to hours (TTL dependent). Generally not recommended due to propagation delays.
•Router/Proxy Rules — Update routing rules in an API gateway or service mesh. Switch time: seconds. Provides fine-grained control.
•Kubernetes Service Selector — Update Service selector to match new deployment labels. Switch time: seconds. Native to Kubernetes.

traffic-switching-aws.tf

Terraform

# AWS Blue-Green with Application Load Balancer
# Traffic switching via listener rule modification
 
resource "aws_lb" "main" {
  name               = "app-alb"
  internal           = false
  load_balancer_type = "application"
  subnets            = var.public_subnet_ids
  
  tags = {
    Environment = "production"
  }
}
 
# Blue target group - current production
resource "aws_lb_target_group" "blue" {
  name     = "app-blue"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = var.vpc_id
  
  health_check {
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 30
    path                = "/health"
    matcher             = "200"
  }
  
  tags = {
    Color = "blue"
  }
}
 
# Green target group - next release
resource "aws_lb_target_group" "green" {
  name     = "app-green"
  port     = 8080
  protocol = "HTTP"
  vpc_id   = var.vpc_id
  
  health_check {
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 5
    interval            = 30
    path                = "/health"
    matcher             = "200"
  }
  
  tags = {
    Color = "green"
  }
}
 
# Production listener - the switch point
resource "aws_lb_listener" "production" {
  load_balancer_arn = aws_lb.main.arn
  port              = "443"
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS-1-2-2017-01"
  certificate_arn   = var.certificate_arn
  
  default_action {
    type             = "forward"
    # SWITCH: Change this to green when deploying
    target_group_arn = var.active_color == "blue" ? aws_lb_target_group.blue.arn : aws_lb_target_group.green.arn
  }
}
 
# Test listener - access inactive environment for validation
resource "aws_lb_listener" "test" {
  load_balancer_arn = aws_lb.main.arn
  port              = "8443"
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS-1-2-2017-01"
  certificate_arn   = var.certificate_arn
  
  default_action {
    type             = "forward"
    # Points to the INACTIVE environment for testing
    target_group_arn = var.active_color == "blue" ? aws_lb_target_group.green.arn : aws_lb_target_group.blue.arn
  }
}
 
# Variable to control which environment is active
variable "active_color" {
  description = "Which color environment is serving production traffic"
  type        = string
  default     = "blue"
  
  validation {
    condition     = contains(["blue", "green"], var.active_color)
    error_message = "Active color must be 'blue' or 'green'."
  }
}

Kubernetes-native blue-green switching:

In Kubernetes, blue-green deployments can be implemented using Service selector manipulation or by using more sophisticated tools like Argo Rollouts.

kubernetes-blue-green.yaml
Kubernetes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
# Blue Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-blue
  labels:
    app: myapp
    version: v1.0.0
spec:
  replicas: 5
  selector:
    matchLabels:
      app: myapp
      color: blue
  template:
    metadata:
      labels:
        app: myapp
        color: blue
        version: v1.0.0
    spec:
      containers:
      - name: app
        image: myapp:v1.0.0
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
 
---
# Green Deployment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-green
  labels:
    app: myapp
    version: v1.1.0
spec:
  replicas: 5
  selector:
    matchLabels:
      app: myapp
      color: green
  template:
    metadata:
      labels:
        app: myapp
        color: green
        version: v1.1.0
    spec:
      containers:
      - name: app
        image: myapp:v1.1.0
        ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 5
 
---
# Production Service - selector determines which deployment receives traffic
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  selector:
    app: myapp
    # SWITCH: Change this label to switch traffic
    color: blue  # or "green"
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP
 
---
# Test Service - always points to inactive environment
apiVersion: v1
kind: Service
metadata:
  name: myapp-test
spec:
  selector:
    app: myapp
    # Points to INACTIVE color for pre-switch testing
    color: green  # Opposite of production
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP

Avoid DNS-Based Switching

The Database Synchronization Challenge

Shared Database (Recommended)

•Both environments connect to the same database
•No synchronization needed—data is always consistent
•Requires backward-compatible schema migrations
•Rolling back code is easy; rolling back schema is hard
•Most common approach in production

Separate Databases (Complex)

•Each environment has its own database
•Requires real-time replication between databases
•Data written to inactive environment is lost on switch
•Enables independent schema per environment
•Rarely practical for production systems

The shared database constraint:

With a shared database, both the blue and green environments must be able to work with the same database schema. This creates a hard constraint: migrations must be backward compatible.

A migration is backward compatible if:

The old code can operate correctly after the migration runs
The new code can operate correctly before the migration runs

This rules out many common migration patterns:

Migration Compatibility Analysis
Migration Type	Backward Compatible?	Safe Approach
Add nullable column	✅ Yes	Direct migration
Add column with default	✅ Yes	Direct migration
Add table	✅ Yes	Direct migration
Drop unused column	✅ Yes (if truly unused)	Verify no code references first
Rename column	❌ No	Add new column → migrate data → update code → drop old
Change column type	❌ No	Add new column → migrate data → update code → drop old
Drop table in use	❌ No	Update code first → verify no access → drop table
Add NOT NULL constraint	❌ No	Backfill data → update code → add constraint

backward-compatible-migration.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
-- Example: Safely renaming a column in a blue-green environment
-- Total migration requires 4 deployment cycles
 
-- ============================================
-- DEPLOYMENT 1: Add new column
-- ============================================
-- Run this migration before deploying new code
ALTER TABLE users ADD COLUMN username VARCHAR(255);
 
-- Copy existing data
UPDATE users SET username = user_name WHERE username IS NULL;
 
-- At this point:
-- - Blue (v1) still reads/writes user_name (works)
-- - Green (v2) can read both columns, writes to both (works)
 
-- ============================================
-- DEPLOYMENT 2: Deploy code that reads both, writes both
-- ============================================
-- Application code change (pseudo-code):
-- 
-- function getUsername(user):
--     return user.username ?? user.user_name  // prefer new, fallback to old
--
-- function saveUser(user):
--     user.username = value
--     user.user_name = value  // write to BOTH columns
 
-- ============================================
-- DEPLOYMENT 3: Deploy code that only uses new column
-- ============================================
-- Application code change (pseudo-code):
-- 
-- function getUsername(user):
--     return user.username  // only new column
--
-- function saveUser(user):
--     user.username = value  // only new column
 
-- Verify no code references old column before proceeding
 
-- ============================================
-- DEPLOYMENT 4: Remove old column
-- ============================================
-- Only after all instances use new column exclusively
ALTER TABLE users DROP COLUMN user_name;
 
-- Safe because:
-- - No running code reads user_name
-- - No running code writes user_name
-- - Rollback would deploy code that reads username (new column)

Migration Tooling

Session and State Management

State Management Strategies

•Externalized session storage — Store sessions in Redis, Memcached, or a database accessible by both environments. Both blue and green can read/write the same sessions.
•Stateless design — Use JWTs or similar tokens that contain all session data. No server-side session storage needed. Most scalable approach.
•Sticky sessions with drain — Route users to the same environment until their session expires. Gradually drain sessions from old environment. Increases switch time.
•Session serialization compatibility — Ensure both versions can deserialize session data. Add new fields with defaults; never remove fields without migration.

externalized-sessions.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import Redis from 'ioredis';
import { v4 as uuidv4 } from 'uuid';
 
/**
 * Session store backed by Redis - shared between blue and green environments.
 * Both environments connect to the same Redis cluster.
 */
class SessionStore {
  private redis: Redis;
  private sessionTTL: number = 24 * 60 * 60; // 24 hours
  
  constructor(redisUrl: string) {
    // Both blue and green environments use the same Redis cluster
    this.redis = new Redis(redisUrl);
  }
  
  async createSession(userId: string, data: SessionData): Promise<string> {
    const sessionId = uuidv4();
    
    // Include schema version for forward/backward compatibility
    const sessionPayload: VersionedSession = {
      schemaVersion: 2,  // Increment when session structure changes
      userId,
      data,
      createdAt: new Date().toISOString(),
      // New fields added in v2 - old code will ignore, new code uses
      environment: process.env.ENVIRONMENT_COLOR,
      lastActiveAt: new Date().toISOString(),
    };
    
    await this.redis.setex(
      `session:${sessionId}`,
      this.sessionTTL,
      JSON.stringify(sessionPayload)
    );
    
    return sessionId;
  }
  
  async getSession(sessionId: string): Promise<SessionData | null> {
    const raw = await this.redis.get(`session:${sessionId}`);
    if (!raw) return null;
    
    const session = JSON.parse(raw) as VersionedSession;
    
    // Handle different schema versions
    // Old sessions (v1) may not have new fields
    return this.migrateSession(session);
  }
  
  private migrateSession(session: VersionedSession): SessionData {
    // Handle missing fields from older schema versions
    // This allows green (new version) to read sessions created by blue (old version)
    
    if (!session.schemaVersion || session.schemaVersion < 2) {
      // Migrate v1 session to v2 format
      session.lastActiveAt = session.createdAt;
      session.environment = 'unknown';
    }
    
    return session.data;
  }
  
  async updateSession(sessionId: string, data: Partial<SessionData>): Promise<void> {
    const existing = await this.redis.get(`session:${sessionId}`);
    if (!existing) throw new Error('Session not found');
    
    const session = JSON.parse(existing) as VersionedSession;
    session.data = { ...session.data, ...data };
    session.lastActiveAt = new Date().toISOString();
    
    await this.redis.setex(
      `session:${sessionId}`,
      this.sessionTTL,
      JSON.stringify(session)
    );
  }
}
 
interface SessionData {
  cart?: CartItem[];
  preferences?: UserPreferences;
  [key: string]: unknown;
}
 
interface VersionedSession {
  schemaVersion: number;
  userId: string;
  data: SessionData;
  createdAt: string;
  environment?: string;      // Added in v2
  lastActiveAt?: string;     // Added in v2
}

Long-Running Operations

Pre-Switch Validation Strategies

Pre-Switch Validation Checklist

•Health checks pass — All instances in green environment report healthy on readiness probes
•Smoke tests pass — Automated tests hitting the test endpoint verify core functionality
•Performance baseline met — Load test against green environment meets latency/throughput SLAs
•Dependencies accessible — Green can reach all required services (databases, APIs, caches)
•Configuration correct — Environment-specific config (feature flags, API keys) is properly set
•Logs streaming — Verify logging pipeline is receiving logs from green environment
•Metrics reporting — Confirm metrics are being collected and dashboards are populated
•Manual exploration — QA or engineers manually test key user flows on green

pre-switch-validation.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
#!/bin/bash
# Pre-switch validation script for blue-green deployment
 
set -euo pipefail
 
INACTIVE_ENV="green"
TEST_ENDPOINT="https://test.example.com"  # Points to inactive environment
MAIN_ENDPOINT="https://api.example.com"   # Points to active environment
 
echo "🔍 Starting pre-switch validation for $INACTIVE_ENV environment..."
 
# 1. Health check verification
echo "Step 1: Verifying health checks..."
HEALTH=$(curl -s -o /dev/null -w "%{http_code}" "$TEST_ENDPOINT/health/ready")
if [ "$HEALTH" != "200" ]; then
    echo "❌ Health check failed with status $HEALTH"
    exit 1
fi
echo "✅ Health checks passing"
 
# 2. Smoke tests
echo "Step 2: Running smoke tests..."
npm run test:smoke -- --endpoint="$TEST_ENDPOINT"
if [ $? -ne 0 ]; then
    echo "❌ Smoke tests failed"
    exit 1
fi
echo "✅ Smoke tests passing"
 
# 3. API version verification
echo "Step 3: Verifying API version..."
DEPLOYED_VERSION=$(curl -s "$TEST_ENDPOINT/version" | jq -r '.version')
EXPECTED_VERSION="${DEPLOY_VERSION:- unknown}"
if ["$DEPLOYED_VERSION" != "$EXPECTED_VERSION"]; then
    echo "❌ Version mismatch: expected $EXPECTED_VERSION, got $DEPLOYED_VERSION"
    exit 1
fi
echo "✅ Version verified: $DEPLOYED_VERSION"
 
# 4. Performance baseline
echo "Step 4: Running performance baseline..."
LATENCY_P99 = $(curl - s "$TEST_ENDPOINT/metrics" | grep 'http_request_duration_seconds' | grep 'quantile="0.99"' | awk '{print $2}')
LATENCY_THRESHOLD="0.5"
if(($(echo "$LATENCY_P99 > $LATENCY_THRESHOLD" | bc - l))); then
    echo "❌ P99 latency $LATENCY_P99 exceeds threshold $LATENCY_THRESHOLD"
    exit 1
    fi
echo "✅ Performance baseline met: p99=${LATENCY_P99}s"
 
# 5. Database connectivity
echo "Step 5: Verifying database connectivity..."
    DB_CHECK = $(curl - s "$TEST_ENDPOINT/health/dependencies" | jq - r '.database.status')
    if ["$DB_CHECK" != "healthy"]; then
    echo "❌ Database connectivity check failed"
    exit 1
    fi
echo "✅ Database connectivity verified"
 
# 6. Feature flag verification
echo "Step 6: Verifying feature flags..."
    FLAGS = $(curl - s "$TEST_ENDPOINT/debug/feature-flags")
echo "Active feature flags: $FLAGS"
 
# 7. Compare metrics with active environment
echo "Step 7: Comparing error rates..."
    ACTIVE_ERRORS = $(curl - s "$MAIN_ENDPOINT/metrics" | grep 'http_requests_total.*status="5' | awk '{sum+=$2} END {print sum}')
    INACTIVE_ERRORS = $(curl - s "$TEST_ENDPOINT/metrics" | grep 'http_requests_total.*status="5' | awk '{sum+=$2} END {print sum}')
echo "Active environment errors: $ACTIVE_ERRORS"
echo "Inactive environment errors: $INACTIVE_ERRORS"
 
echo ""
echo "═══════════════════════════════════════════════════════════"
echo "✅ All pre-switch validations passed"
echo "═══════════════════════════════════════════════════════════"
echo ""
echo "Ready to switch traffic to $INACTIVE_ENV environment."
echo "To proceed, run: ./switch-traffic.sh $INACTIVE_ENV"

Synthetic traffic testing:

For critical systems, generate synthetic production-like traffic against the green environment before switching:

Test Type	Purpose	Duration
Smoke tests	Verify core functionality works	1-2 minutes
Load test	Verify performance under expected load	5-10 minutes
Soak test	Verify stability over time (memory leaks, etc.)	30-60 minutes
Chaos test	Verify resilience to failures	10-15 minutes

The depth of testing should match the risk level of the release. Critical path changes warrant more extensive testing.

Instant Rollback Procedures

Keep the Old Environment Hot

traffic-switch.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
#!/bin/bash
# Traffic switching script for blue - green deployment
 
set - euo pipefail
 
    TARGET_COLOR = "$1"
 
    if [["$TARGET_COLOR" != "blue" && "$TARGET_COLOR" != "green"]]; then
    echo "Usage: $0 <blue|green>"
    exit 1
    fi
 
# Get current state
    CURRENT_COLOR = $(aws ssm get - parameter--name / app / active - color--query 'Parameter.Value' --output text)
 
    if ["$CURRENT_COLOR" == "$TARGET_COLOR"]; then
    echo "Traffic already routing to $TARGET_COLOR"
    exit 0
    fi
 
echo "═══════════════════════════════════════════════════════════"
echo "🔄 Switching traffic from $CURRENT_COLOR to $TARGET_COLOR"
echo "═══════════════════════════════════════════════════════════"
 
# Record the switch in audit log
aws logs put - log - events     --log - group - name / app / deployments     --log - stream - name traffic - switches     --log - events timestamp = $(date +% s000), message = "Switching traffic from $CURRENT_COLOR to $TARGET_COLOR by ${USER:-unknown}"
 
# Verify target environment is healthy before switching
echo "Verifying $TARGET_COLOR environment health..."
    TARGET_LB = $(aws elbv2 describe - target - groups--names "app-$TARGET_COLOR" --query 'TargetGroups[0].TargetGroupArn' --output text)
    HEALTH = $(aws elbv2 describe - target - health--target - group - arn "$TARGET_LB" --query 'TargetHealthDescriptions[?TargetHealth.State==`healthy`] | length(@)')
 
    if ["$HEALTH" - lt 1]; then
    echo "❌ No healthy targets in $TARGET_COLOR environment. Aborting switch."
    exit 1
    fi
echo "✅ $HEALTH healthy targets in $TARGET_COLOR"
 
# Update load balancer listener
    LISTENER_ARN = $(aws elbv2 describe - listeners--load - balancer - arn "$LB_ARN" --query 'Listeners[?Port==`443`].ListenerArn' --output text)
 
echo "Updating listener to forward to $TARGET_COLOR..."
aws elbv2 modify - listener     --listener - arn "$LISTENER_ARN"     --default -actions Type = forward, TargetGroupArn = "$TARGET_LB"
 
# Update parameter store with new active color
aws ssm put - parameter     --name / app / active - color     --value "$TARGET_COLOR"     --type String     --overwrite
 
# Verify the switch
sleep 5
echo "Verifying traffic switch..."
    ACTIVE_TARGETS = $(aws elbv2 describe - listeners--listener - arns "$LISTENER_ARN" --query 'Listeners[0].DefaultActions[0].TargetGroupArn' --output text)
 
    if [["$ACTIVE_TARGETS" == * "$TARGET_COLOR" * ]]; then
    echo ""
    echo "═══════════════════════════════════════════════════════════"
    echo "✅ Traffic successfully switched to $TARGET_COLOR"
    echo "═══════════════════════════════════════════════════════════"
    echo ""
    echo "Previous environment ($CURRENT_COLOR) is still running."
    echo "To rollback, run: $0 $CURRENT_COLOR"
else
    echo "❌ Switch verification failed"
    exit 1
    fi

Rollback triggers:

Trigger	Detection Time	Rollback Time	Total Impact
Complete outage	Seconds (health checks)	Seconds	Minimal
Error rate spike	1-5 minutes (monitoring)	Seconds	Low
Performance degradation	5-15 minutes (monitoring)	Seconds	Moderate
Data corruption	Hours-days (user reports)	Seconds, but damage done	High
Business logic bug	Minutes-hours (testing/reports)	Seconds	Variable

The key insight: blue-green rollback is always fast, but detection time varies. Invest in monitoring to detect issues quickly.

Rollback Doesn't Undo Data Changes

Cost and Resource Considerations

Blue-green deployments require maintaining two production environments, which has significant cost implications. However, there are strategies to minimize this overhead while retaining the benefits.

Blue-Green Cost Optimization Strategies
Strategy	Cost Savings	Trade-off
Scale idle to zero (serverless)	Up to 50%	Longer pre-switch warm-up needed
Scale idle to minimum	30-40%	Slower scale-up during switch
Use spot/preemptible for idle	20-30%	May need to replace instances
Smaller instance types for idle	20-30%	Scale up before switch
Time-limited retention	Variable	Reduces rollback window

Cost calculation example:

Assuming a production workload requiring 10 instances at $100/month each:

Approach	Active Env	Idle Env	Monthly Cost	Premium
Full capacity both	10 × $100	10 × $100	$2,000	+100%
Idle at 20% capacity	10 × $100	2 × $100	$1,200	+20%
Idle scaled to zero	10 × $100	$0 (+ scale-up)	$1,050	+5%
Single env (no B/G)	10 × $100	N/A	$1,000	Baseline

The 5-20% premium for blue-green capability is often justified by reduced downtime risk and faster rollback capability.

auto-scaling-idle.tf

Terraform

# Kubernetes HPA for idle environment
# Keeps minimal capacity until pre -switch scale - up
 
apiVersion: autoscaling / v2
    kind: HorizontalPodAutoscaler
    metadata:
    name: app - green - hpa
    spec:
    scaleTargetRef:
    apiVersion: apps / v1
    kind: Deployment
    name: app - green
  
  # Idle: maintain minimal capacity
  # Active: scale based on load
    minReplicas: 2    # Minimal for idle
  maxReplicas: 20   # Full capacity for active
  
  metrics:
        - type: Resource
    resource:
    name: cpu
    target:
    type: Utilization
    averageUtilization: 70
 
    behavior:
    # Scale up quickly for traffic switch
    scaleUp:
      stabilizationWindowSeconds: 0
    policies:
    - type: Percent
    value: 100
    periodSeconds: 15
    # Scale down slowly after becoming idle
    scaleDown:
    stabilizationWindowSeconds: 300
    policies:
    - type: Percent
    value: 10
    periodSeconds: 60

Pre-Switch Warm-Up

When to Use Blue-Green Deployments

Blue-green deployments are powerful but not universally applicable. Understanding when they provide value—and when alternatives are better—is essential for choosing the right deployment strategy.

Blue-Green Is Ideal When

•Zero-downtime deployment is mandatory
•Instant rollback capability is required
•You need to test in production before going live
•Database migrations are backward compatible
•The application is stateless or uses externalized state
•Cost of maintaining two environments is acceptable
•Releases are infrequent (weekly or less)

Consider Alternatives When

•You deploy multiple times per day (canary may be better)
•Database migrations are frequently breaking
•Strong consistency during transition is required
•Cost constraints are severe
•The application is highly stateful with server affinity
•You need gradual traffic shifting (canary)
•You have hundreds of microservices (complexity)

Deployment Strategy Comparison
Criteria	Rolling	Blue-Green	Canary
Resource overhead	Low (surge capacity)	High (2x environments)	Low-Medium
Rollback speed	Slow (redeploy)	Instant (switch)	Fast (routing change)
Test before live	No	Yes (test endpoint)	Yes (% traffic)
Mixed version duration	During rollout	None (atomic switch)	Extended (canary period)
Complexity	Low	Medium	High
Best for	Most workloads	Critical services	High-risk releases

Hybrid Approaches

Summary: Blue-Green Deployments

Key Takeaways

•Two identical environments, one active — Blue serves traffic while green receives new releases (or vice versa). Traffic switches atomically between them.
•Load balancer switching is preferred — Avoid DNS-based switching due to caching delays. Use ALB target groups, Kubernetes Service selectors, or similar instant mechanisms.
•Shared database is the norm — Both environments use the same database, requiring all migrations to be backward compatible.
•Externalize session state — Use Redis, database, or stateless tokens so sessions work across both environments.
•Validate before switching — Use the test endpoint to run smoke tests, load tests, and manual verification against the inactive environment.
•Instant rollback is the killer feature — Keep the old environment running for 24-48 hours post-switch to enable immediate traffic reversion.
•Cost optimization is possible — Scale idle environment to minimum capacity and scale up before switching to reduce infrastructure costs.

What's next:

Page Complete