Loading content...
What if you could deploy a new version of your application, test it thoroughly in production conditions, and then switch all traffic to it instantaneously—with the ability to switch back just as instantly if anything goes wrong? This is the promise of blue-green deployments, a strategy that trades infrastructure cost for deployment safety and flexibility.
Blue-green deployments represent a fundamentally different approach from rolling updates. Rather than gradually replacing instances, you maintain two complete, identical production environments—one serving live traffic (blue), the other ready to receive the next release (green). Traffic is switched atomically between them.
By the end of this page, you will understand how to architect blue-green deployments, handle the complex problem of database synchronization, configure traffic switching mechanisms, and evaluate when blue-green is the right choice for your system. You'll be equipped to implement blue-green deployments on any cloud platform.
A blue-green deployment maintains two identical production environments:
When a new release is ready, it's deployed to the idle environment. After validation, traffic is switched from the active to the previously idle environment. The old active environment becomes the new idle—ready to receive the next release or serve as an immediate rollback target.
The environment lifecycle:
| Phase | Blue Environment | Green Environment | Traffic |
|---|---|---|---|
| Initial | Running v1.0 (active) | Idle or previous version | 100% → Blue |
| Deploy | Running v1.0 (active) | Deploying v1.1 | 100% → Blue |
| Validate | Running v1.0 (active) | Running v1.1 (testing) | 100% → Blue, internal → Green |
| Switch | Running v1.0 (standby) | Running v1.1 (active) | 100% → Green |
| Stabilize | Idle, ready for next deploy | Running v1.1 (active) | 100% → Green |
What constitutes an 'environment':
The term 'environment' in blue-green deployments typically includes:
| Component | Blue Environment | Green Environment |
|---|---|---|
| Application servers | Separate cluster/deployment | Separate cluster/deployment |
| Load balancer | Shared (routes to active) | Shared (routes to active) |
| Database | Shared (complex) | Shared (complex) |
| Cache | Separate or shared | Separate or shared |
| Configuration | Environment-specific | Environment-specific |
| DNS | Shared (points to LB) | Shared (points to LB) |
The database is typically shared because maintaining synchronized data across two databases is extremely complex. This creates the most significant constraint on blue-green deployments—both versions must be compatible with the same database schema.
You don't need to run both environments at full capacity continuously. The idle environment can run at minimal capacity (or be scaled to zero with serverless) and scaled up just before a release. Scale down the newly idle environment after validating the switch.
The mechanism for switching traffic between environments is the heart of blue-green deployments. The choice of switching mechanism affects switching speed, rollback time, and operational complexity.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
# AWS Blue-Green with Application Load Balancer# Traffic switching via listener rule modification resource "aws_lb" "main" { name = "app-alb" internal = false load_balancer_type = "application" subnets = var.public_subnet_ids tags = { Environment = "production" }} # Blue target group - current productionresource "aws_lb_target_group" "blue" { name = "app-blue" port = 8080 protocol = "HTTP" vpc_id = var.vpc_id health_check { enabled = true healthy_threshold = 2 unhealthy_threshold = 3 timeout = 5 interval = 30 path = "/health" matcher = "200" } tags = { Color = "blue" }} # Green target group - next releaseresource "aws_lb_target_group" "green" { name = "app-green" port = 8080 protocol = "HTTP" vpc_id = var.vpc_id health_check { enabled = true healthy_threshold = 2 unhealthy_threshold = 3 timeout = 5 interval = 30 path = "/health" matcher = "200" } tags = { Color = "green" }} # Production listener - the switch pointresource "aws_lb_listener" "production" { load_balancer_arn = aws_lb.main.arn port = "443" protocol = "HTTPS" ssl_policy = "ELBSecurityPolicy-TLS-1-2-2017-01" certificate_arn = var.certificate_arn default_action { type = "forward" # SWITCH: Change this to green when deploying target_group_arn = var.active_color == "blue" ? aws_lb_target_group.blue.arn : aws_lb_target_group.green.arn }} # Test listener - access inactive environment for validationresource "aws_lb_listener" "test" { load_balancer_arn = aws_lb.main.arn port = "8443" protocol = "HTTPS" ssl_policy = "ELBSecurityPolicy-TLS-1-2-2017-01" certificate_arn = var.certificate_arn default_action { type = "forward" # Points to the INACTIVE environment for testing target_group_arn = var.active_color == "blue" ? aws_lb_target_group.green.arn : aws_lb_target_group.blue.arn }} # Variable to control which environment is activevariable "active_color" { description = "Which color environment is serving production traffic" type = string default = "blue" validation { condition = contains(["blue", "green"], var.active_color) error_message = "Active color must be 'blue' or 'green'." }}Kubernetes-native blue-green switching:
In Kubernetes, blue-green deployments can be implemented using Service selector manipulation or by using more sophisticated tools like Argo Rollouts.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
# Blue DeploymentapiVersion: apps/v1kind: Deploymentmetadata: name: app-blue labels: app: myapp version: v1.0.0spec: replicas: 5 selector: matchLabels: app: myapp color: blue template: metadata: labels: app: myapp color: blue version: v1.0.0 spec: containers: - name: app image: myapp:v1.0.0 ports: - containerPort: 8080 readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 10 periodSeconds: 5 ---# Green Deployment (new version)apiVersion: apps/v1kind: Deploymentmetadata: name: app-green labels: app: myapp version: v1.1.0spec: replicas: 5 selector: matchLabels: app: myapp color: green template: metadata: labels: app: myapp color: green version: v1.1.0 spec: containers: - name: app image: myapp:v1.1.0 ports: - containerPort: 8080 readinessProbe: httpGet: path: /health/ready port: 8080 initialDelaySeconds: 10 periodSeconds: 5 ---# Production Service - selector determines which deployment receives trafficapiVersion: v1kind: Servicemetadata: name: myappspec: selector: app: myapp # SWITCH: Change this label to switch traffic color: blue # or "green" ports: - port: 80 targetPort: 8080 type: ClusterIP ---# Test Service - always points to inactive environmentapiVersion: v1kind: Servicemetadata: name: myapp-testspec: selector: app: myapp # Points to INACTIVE color for pre-switch testing color: green # Opposite of production ports: - port: 80 targetPort: 8080 type: ClusterIPDNS-based switching seems simple but has serious drawbacks: DNS caching means clients may continue using old IPs for minutes or hours after the switch, TTL settings are often ignored by clients, and rollback is equally slow. Use load balancer or router-based switching for instant and reliable traffic control.
The most complex aspect of blue-green deployments is handling the database. Unlike stateless application servers, databases contain persistent state that both environments must access. You have three main approaches, each with significant trade-offs:
The shared database constraint:
With a shared database, both the blue and green environments must be able to work with the same database schema. This creates a hard constraint: migrations must be backward compatible.
A migration is backward compatible if:
This rules out many common migration patterns:
| Migration Type | Backward Compatible? | Safe Approach |
|---|---|---|
| Add nullable column | ✅ Yes | Direct migration |
| Add column with default | ✅ Yes | Direct migration |
| Add table | ✅ Yes | Direct migration |
| Drop unused column | ✅ Yes (if truly unused) | Verify no code references first |
| Rename column | ❌ No | Add new column → migrate data → update code → drop old |
| Change column type | ❌ No | Add new column → migrate data → update code → drop old |
| Drop table in use | ❌ No | Update code first → verify no access → drop table |
| Add NOT NULL constraint | ❌ No | Backfill data → update code → add constraint |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
-- Example: Safely renaming a column in a blue-green environment-- Total migration requires 4 deployment cycles -- ============================================-- DEPLOYMENT 1: Add new column-- ============================================-- Run this migration before deploying new codeALTER TABLE users ADD COLUMN username VARCHAR(255); -- Copy existing dataUPDATE users SET username = user_name WHERE username IS NULL; -- At this point:-- - Blue (v1) still reads/writes user_name (works)-- - Green (v2) can read both columns, writes to both (works) -- ============================================-- DEPLOYMENT 2: Deploy code that reads both, writes both-- ============================================-- Application code change (pseudo-code):-- -- function getUsername(user):-- return user.username ?? user.user_name // prefer new, fallback to old---- function saveUser(user):-- user.username = value-- user.user_name = value // write to BOTH columns -- ============================================-- DEPLOYMENT 3: Deploy code that only uses new column-- ============================================-- Application code change (pseudo-code):-- -- function getUsername(user):-- return user.username // only new column---- function saveUser(user):-- user.username = value // only new column -- Verify no code references old column before proceeding -- ============================================-- DEPLOYMENT 4: Remove old column-- ============================================-- Only after all instances use new column exclusivelyALTER TABLE users DROP COLUMN user_name; -- Safe because:-- - No running code reads user_name-- - No running code writes user_name-- - Rollback would deploy code that reads username (new column)Tools like Flyway, Liquibase, and Rails migrations support separating 'expand' (backward-compatible) and 'contract' (cleanup) migrations. Always run expand migrations before deployment and contract migrations after all instances are updated.
A critical challenge in blue-green deployments is handling user sessions and application state during the traffic switch. Users in the middle of a transaction when traffic switches must not experience errors or data loss.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import Redis from 'ioredis';import { v4 as uuidv4 } from 'uuid'; /** * Session store backed by Redis - shared between blue and green environments. * Both environments connect to the same Redis cluster. */class SessionStore { private redis: Redis; private sessionTTL: number = 24 * 60 * 60; // 24 hours constructor(redisUrl: string) { // Both blue and green environments use the same Redis cluster this.redis = new Redis(redisUrl); } async createSession(userId: string, data: SessionData): Promise<string> { const sessionId = uuidv4(); // Include schema version for forward/backward compatibility const sessionPayload: VersionedSession = { schemaVersion: 2, // Increment when session structure changes userId, data, createdAt: new Date().toISOString(), // New fields added in v2 - old code will ignore, new code uses environment: process.env.ENVIRONMENT_COLOR, lastActiveAt: new Date().toISOString(), }; await this.redis.setex( `session:${sessionId}`, this.sessionTTL, JSON.stringify(sessionPayload) ); return sessionId; } async getSession(sessionId: string): Promise<SessionData | null> { const raw = await this.redis.get(`session:${sessionId}`); if (!raw) return null; const session = JSON.parse(raw) as VersionedSession; // Handle different schema versions // Old sessions (v1) may not have new fields return this.migrateSession(session); } private migrateSession(session: VersionedSession): SessionData { // Handle missing fields from older schema versions // This allows green (new version) to read sessions created by blue (old version) if (!session.schemaVersion || session.schemaVersion < 2) { // Migrate v1 session to v2 format session.lastActiveAt = session.createdAt; session.environment = 'unknown'; } return session.data; } async updateSession(sessionId: string, data: Partial<SessionData>): Promise<void> { const existing = await this.redis.get(`session:${sessionId}`); if (!existing) throw new Error('Session not found'); const session = JSON.parse(existing) as VersionedSession; session.data = { ...session.data, ...data }; session.lastActiveAt = new Date().toISOString(); await this.redis.setex( `session:${sessionId}`, this.sessionTTL, JSON.stringify(session) ); }} interface SessionData { cart?: CartItem[]; preferences?: UserPreferences; [key: string]: unknown;} interface VersionedSession { schemaVersion: number; userId: string; data: SessionData; createdAt: string; environment?: string; // Added in v2 lastActiveAt?: string; // Added in v2}Users with long-running operations (file uploads, complex workflows) may be affected by the switch. Consider: completing in-flight operations before switching, using queues with workers in both environments, or implementing operation hand-off between environments.
One of the primary advantages of blue-green deployments is the ability to validate the new version in a production environment before switching traffic. This validation phase is critical—it's your last line of defense before the release goes live.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
#!/bin/bash# Pre-switch validation script for blue-green deployment set -euo pipefail INACTIVE_ENV="green"TEST_ENDPOINT="https://test.example.com" # Points to inactive environmentMAIN_ENDPOINT="https://api.example.com" # Points to active environment echo "🔍 Starting pre-switch validation for $INACTIVE_ENV environment..." # 1. Health check verificationecho "Step 1: Verifying health checks..."HEALTH=$(curl -s -o /dev/null -w "%{http_code}" "$TEST_ENDPOINT/health/ready")if [ "$HEALTH" != "200" ]; then echo "❌ Health check failed with status $HEALTH" exit 1fiecho "✅ Health checks passing" # 2. Smoke testsecho "Step 2: Running smoke tests..."npm run test:smoke -- --endpoint="$TEST_ENDPOINT"if [ $? -ne 0 ]; then echo "❌ Smoke tests failed" exit 1fiecho "✅ Smoke tests passing" # 3. API version verificationecho "Step 3: Verifying API version..."DEPLOYED_VERSION=$(curl -s "$TEST_ENDPOINT/version" | jq -r '.version')EXPECTED_VERSION="${DEPLOY_VERSION:- unknown}"if ["$DEPLOYED_VERSION" != "$EXPECTED_VERSION"]; then echo "❌ Version mismatch: expected $EXPECTED_VERSION, got $DEPLOYED_VERSION" exit 1fiecho "✅ Version verified: $DEPLOYED_VERSION" # 4. Performance baselineecho "Step 4: Running performance baseline..."LATENCY_P99 = $(curl - s "$TEST_ENDPOINT/metrics" | grep 'http_request_duration_seconds' | grep 'quantile="0.99"' | awk '{print $2}')LATENCY_THRESHOLD="0.5"if(($(echo "$LATENCY_P99 > $LATENCY_THRESHOLD" | bc - l))); then echo "❌ P99 latency $LATENCY_P99 exceeds threshold $LATENCY_THRESHOLD" exit 1 fiecho "✅ Performance baseline met: p99=${LATENCY_P99}s" # 5. Database connectivityecho "Step 5: Verifying database connectivity..." DB_CHECK = $(curl - s "$TEST_ENDPOINT/health/dependencies" | jq - r '.database.status') if ["$DB_CHECK" != "healthy"]; then echo "❌ Database connectivity check failed" exit 1 fiecho "✅ Database connectivity verified" # 6. Feature flag verificationecho "Step 6: Verifying feature flags..." FLAGS = $(curl - s "$TEST_ENDPOINT/debug/feature-flags")echo "Active feature flags: $FLAGS" # 7. Compare metrics with active environmentecho "Step 7: Comparing error rates..." ACTIVE_ERRORS = $(curl - s "$MAIN_ENDPOINT/metrics" | grep 'http_requests_total.*status="5' | awk '{sum+=$2} END {print sum}') INACTIVE_ERRORS = $(curl - s "$TEST_ENDPOINT/metrics" | grep 'http_requests_total.*status="5' | awk '{sum+=$2} END {print sum}')echo "Active environment errors: $ACTIVE_ERRORS"echo "Inactive environment errors: $INACTIVE_ERRORS" echo ""echo "═══════════════════════════════════════════════════════════"echo "✅ All pre-switch validations passed"echo "═══════════════════════════════════════════════════════════"echo ""echo "Ready to switch traffic to $INACTIVE_ENV environment."echo "To proceed, run: ./switch-traffic.sh $INACTIVE_ENV"Synthetic traffic testing:
For critical systems, generate synthetic production-like traffic against the green environment before switching:
| Test Type | Purpose | Duration |
|---|---|---|
| Smoke tests | Verify core functionality works | 1-2 minutes |
| Load test | Verify performance under expected load | 5-10 minutes |
| Soak test | Verify stability over time (memory leaks, etc.) | 30-60 minutes |
| Chaos test | Verify resilience to failures | 10-15 minutes |
The depth of testing should match the risk level of the release. Critical path changes warrant more extensive testing.
The killer feature of blue-green deployments is instant rollback. Because the old environment is still running with the previous version, rolling back is simply switching traffic back—no redeployment, no waiting for instances to start.
Don't scale down or redeploy the old environment until you're confident the new version is stable. A common practice is to wait 24-48 hours before reusing the old environment for the next release. This gives time for delayed issues to manifest.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
#!/bin/bash# Traffic switching script for blue - green deployment set - euo pipefail TARGET_COLOR = "$1" if [["$TARGET_COLOR" != "blue" && "$TARGET_COLOR" != "green"]]; then echo "Usage: $0 <blue|green>" exit 1 fi # Get current state CURRENT_COLOR = $(aws ssm get - parameter--name / app / active - color--query 'Parameter.Value' --output text) if ["$CURRENT_COLOR" == "$TARGET_COLOR"]; then echo "Traffic already routing to $TARGET_COLOR" exit 0 fi echo "═══════════════════════════════════════════════════════════"echo "🔄 Switching traffic from $CURRENT_COLOR to $TARGET_COLOR"echo "═══════════════════════════════════════════════════════════" # Record the switch in audit logaws logs put - log - events --log - group - name / app / deployments --log - stream - name traffic - switches --log - events timestamp = $(date +% s000), message = "Switching traffic from $CURRENT_COLOR to $TARGET_COLOR by ${USER:-unknown}" # Verify target environment is healthy before switchingecho "Verifying $TARGET_COLOR environment health..." TARGET_LB = $(aws elbv2 describe - target - groups--names "app-$TARGET_COLOR" --query 'TargetGroups[0].TargetGroupArn' --output text) HEALTH = $(aws elbv2 describe - target - health--target - group - arn "$TARGET_LB" --query 'TargetHealthDescriptions[?TargetHealth.State==`healthy`] | length(@)') if ["$HEALTH" - lt 1]; then echo "❌ No healthy targets in $TARGET_COLOR environment. Aborting switch." exit 1 fiecho "✅ $HEALTH healthy targets in $TARGET_COLOR" # Update load balancer listener LISTENER_ARN = $(aws elbv2 describe - listeners--load - balancer - arn "$LB_ARN" --query 'Listeners[?Port==`443`].ListenerArn' --output text) echo "Updating listener to forward to $TARGET_COLOR..."aws elbv2 modify - listener --listener - arn "$LISTENER_ARN" --default -actions Type = forward, TargetGroupArn = "$TARGET_LB" # Update parameter store with new active coloraws ssm put - parameter --name / app / active - color --value "$TARGET_COLOR" --type String --overwrite # Verify the switchsleep 5echo "Verifying traffic switch..." ACTIVE_TARGETS = $(aws elbv2 describe - listeners--listener - arns "$LISTENER_ARN" --query 'Listeners[0].DefaultActions[0].TargetGroupArn' --output text) if [["$ACTIVE_TARGETS" == * "$TARGET_COLOR" * ]]; then echo "" echo "═══════════════════════════════════════════════════════════" echo "✅ Traffic successfully switched to $TARGET_COLOR" echo "═══════════════════════════════════════════════════════════" echo "" echo "Previous environment ($CURRENT_COLOR) is still running." echo "To rollback, run: $0 $CURRENT_COLOR"else echo "❌ Switch verification failed" exit 1 fiRollback triggers:
| Trigger | Detection Time | Rollback Time | Total Impact |
|---|---|---|---|
| Complete outage | Seconds (health checks) | Seconds | Minimal |
| Error rate spike | 1-5 minutes (monitoring) | Seconds | Low |
| Performance degradation | 5-15 minutes (monitoring) | Seconds | Moderate |
| Data corruption | Hours-days (user reports) | Seconds, but damage done | High |
| Business logic bug | Minutes-hours (testing/reports) | Seconds | Variable |
The key insight: blue-green rollback is always fast, but detection time varies. Invest in monitoring to detect issues quickly.
If the new version wrote data in an incompatible format or corrupted data, rolling back the code doesn't fix the data. You need separate data repair procedures. This is why backward-compatible migrations are essential—even a rollback must work with any data written by the new version.
Blue-green deployments require maintaining two production environments, which has significant cost implications. However, there are strategies to minimize this overhead while retaining the benefits.
| Strategy | Cost Savings | Trade-off |
|---|---|---|
| Scale idle to zero (serverless) | Up to 50% | Longer pre-switch warm-up needed |
| Scale idle to minimum | 30-40% | Slower scale-up during switch |
| Use spot/preemptible for idle | 20-30% | May need to replace instances |
| Smaller instance types for idle | 20-30% | Scale up before switch |
| Time-limited retention | Variable | Reduces rollback window |
Cost calculation example:
Assuming a production workload requiring 10 instances at $100/month each:
| Approach | Active Env | Idle Env | Monthly Cost | Premium |
|---|---|---|---|---|
| Full capacity both | 10 × $100 | 10 × $100 | $2,000 | +100% |
| Idle at 20% capacity | 10 × $100 | 2 × $100 | $1,200 | +20% |
| Idle scaled to zero | 10 × $100 | $0 (+ scale-up) | $1,050 | +5% |
| Single env (no B/G) | 10 × $100 | N/A | $1,000 | Baseline |
The 5-20% premium for blue-green capability is often justified by reduced downtime risk and faster rollback capability.
1234567891011121314151617181920212223242526272829303132333435363738394041
# Kubernetes HPA for idle environment# Keeps minimal capacity until pre -switch scale - up apiVersion: autoscaling / v2 kind: HorizontalPodAutoscaler metadata: name: app - green - hpa spec: scaleTargetRef: apiVersion: apps / v1 kind: Deployment name: app - green # Idle: maintain minimal capacity # Active: scale based on load minReplicas: 2 # Minimal for idle maxReplicas: 20 # Full capacity for active metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 behavior: # Scale up quickly for traffic switch scaleUp: stabilizationWindowSeconds: 0 policies: - type: Percent value: 100 periodSeconds: 15 # Scale down slowly after becoming idle scaleDown: stabilizationWindowSeconds: 300 policies: - type: Percent value: 10 periodSeconds: 60If the idle environment is scaled down, include a warm-up phase before switching. Scale up the idle environment, run health checks and load tests, wait for caches to populate and connections to establish, then switch traffic. This adds 5-15 minutes to the deployment process but prevents cold-start issues.
Blue-green deployments are powerful but not universally applicable. Understanding when they provide value—and when alternatives are better—is essential for choosing the right deployment strategy.
| Criteria | Rolling | Blue-Green | Canary |
|---|---|---|---|
| Resource overhead | Low (surge capacity) | High (2x environments) | Low-Medium |
| Rollback speed | Slow (redeploy) | Instant (switch) | Fast (routing change) |
| Test before live | No | Yes (test endpoint) | Yes (% traffic) |
| Mixed version duration | During rollout | None (atomic switch) | Extended (canary period) |
| Complexity | Low | Medium | High |
| Best for | Most workloads | Critical services | High-risk releases |
Many organizations use blue-green as the environment strategy but combine it with canary releases within each environment. Deploy to green, canary 5% of traffic, gradually increase, then switch. This provides the best of both worlds.
Blue-green deployments provide the gold standard for deployment safety—instant switchover, instant rollback, and the ability to validate in production before going live. Let's consolidate the key concepts:
What's next:
Blue-green provides atomic switching but doesn't allow testing with real production traffic before full commitment. In the next page, we'll explore canary deployments, which route a small percentage of traffic to the new version while monitoring for issues—enabling gradual, low-risk releases.
You now understand blue-green deployments at a comprehensive level—from architecture and traffic switching to database synchronization and cost optimization. This knowledge applies to any cloud platform or orchestration system.