System Design (HLD)Deployment Strategies

Deployment Strategies

LevelAdvanced

Duration90 mins

TopicDeployment Strategies

5 / 5

Rollback Strategies

The Safety Net of Modern Deployments

Every deployment strategy is incomplete without a well-designed rollback strategy. No matter how thorough your testing, how gradual your rollout, or how sophisticated your canary analysis—production issues will occur. The difference between a minor incident and a major outage often comes down to how quickly and reliably you can roll back.

Rollback is not a single action but a spectrum of techniques applicable to different failure scenarios. Understanding when to use each technique, and having practiced procedures ready, is essential for operating production systems reliably.

What You Will Learn

By the end of this page, you will understand rollback techniques for different layers (application, database, configuration), automated vs. manual rollback triggers, rollback limitations and edge cases, and how to design systems that are rollback-friendly. You'll be equipped to handle any rollback scenario with confidence.

Rollback Fundamentals

A rollback is the process of reverting a system to a previous known-good state after a deployment introduces problems. While the concept sounds simple, rollback involves multiple layers and trade-offs.

What can be rolled back:

Rollback Layers and Complexity
Layer	Rollback Method	Typical Time	Complexity	Risk
Application code	Redeploy previous version	Minutes	Low	Low
Feature flags	Toggle flag to previous state	Seconds	Very Low	Very Low
Infrastructure config	Apply previous IaC state	Minutes to hours	Medium	Medium
Database schema	Reverse migration (if possible)	Minutes to hours	High	High
Data content	Restore from backup	Hours	Very High	Very High
External integrations	Revert API contracts	Variable	Very High	Very High

The rollback decision framework:

Not every issue warrants rollback. The decision depends on:

Factor	Roll Forward	Roll Back
Severity	Minor, affects few users	Major, affects many users
Root cause	Known, fix is simple	Unknown or complex to fix
Fix time	Minutes	Hours or longer
Blast radius	Limited	Widespread
Data integrity	No risk	Potential data corruption
Customer impact	Low annoyance	Revenue/trust impact

Roll Back First, Investigate Later

When in doubt, roll back first to stop the bleeding, then investigate in a safe environment. Pride should never delay user recovery. A fast rollback followed by a proper fix always beats an extended outage while debugging in production.

Application Code Rollback

Application rollback is the most common and typically safest rollback operation. It involves replacing running application instances with the previous version.

Application Rollback Methods

•Kubernetes rollback — kubectl rollout undo reverts to previous ReplicaSet. Previous version containers are already cached on nodes. Fastest method.
•Blue-green switch — Redirect traffic to previous environment. Near-instant if previous environment is still running.
•Canary abort — Stop canary progression and route all traffic to stable. Minimal blast radius if caught early.
•CI/CD redeploy — Trigger deployment pipeline with previous version tag. Uses normal deployment process but with old artifact.
•Container image revert — Update deployment to reference previous image tag and let orchestrator handle the rollout.

kubernetes-rollback.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#!/bin/bash
# Kubernetes Application Rollback Procedures
 
set -euo pipefail
 
DEPLOYMENT="payment-service"
NAMESPACE="production"
 
echo "═══════════════════════════════════════════════════════════"
echo "🔄 Initiating rollback for $DEPLOYMENT"
echo "═══════════════════════════════════════════════════════════"
 
# Step 1: Check current status
echo "Current deployment status:"
kubectl -n "$NAMESPACE" get deployment "$DEPLOYMENT" -o wide
 
# Step 2: View rollout history
echo ""
echo "Rollout history:"
kubectl -n "$NAMESPACE" rollout history deployment/"$DEPLOYMENT"
 
# Step 3: Check what will be rolled back to
CURRENT_REVISION=$(kubectl -n "$NAMESPACE" rollout history deployment/"$DEPLOYMENT" --output jsonpath='{.metadata.generation}')
PREVIOUS_REVISION=$((CURRENT_REVISION - 1))
echo ""
echo "Will roll back from revision $CURRENT_REVISION to revision $PREVIOUS_REVISION"
 
# Step 4: Get details of previous revision
echo ""
echo "Previous revision details:"
kubectl -n "$NAMESPACE" rollout history deployment/"$DEPLOYMENT" --revision="$PREVIOUS_REVISION"
 
# Step 5: Perform the rollback
echo ""
echo "Executing rollback..."
kubectl -n "$NAMESPACE" rollout undo deployment/"$DEPLOYMENT"
 
# Step 6: Monitor rollback progress
echo ""
echo "Monitoring rollback progress..."
kubectl -n "$NAMESPACE" rollout status deployment/"$DEPLOYMENT" --timeout=300s
 
# Step 7: Verify rollback
echo ""
echo "Verifying rollback:"
kubectl -n "$NAMESPACE" get deployment "$DEPLOYMENT" -o wide
 
# Step 8: Check pod status
echo ""
echo "Pod status after rollback:"
kubectl -n "$NAMESPACE" get pods -l app="$DEPLOYMENT" -o wide
 
# Step 9: Notify team
echo ""
echo "═══════════════════════════════════════════════════════════"
echo "✅ Rollback complete!"
echo "═══════════════════════════════════════════════════════════"
echo ""
echo "Post-rollback actions:"
echo "1. Verify application health: kubectl -n $NAMESPACE logs -l app=$DEPLOYMENT --tail=50"
echo "2. Check error rates in monitoring dashboard"
echo "3. Notify incident channel"
echo "4. Create post-mortem ticket"

Rollback to specific revision:

Sometimes you need to roll back more than one version—especially if multiple deployments occurred before an issue was detected.

# List all revisions with details
kubectl rollout history deployment/payment-service

# REVISION  CHANGE-CAUSE
# 1         Initial deployment
# 2         Feature: Add retry logic
# 3         Bugfix: Fix timeout handling  
# 4         Feature: New payment provider (BROKEN)
# 5         Hotfix attempt (STILL BROKEN)

# Roll back to revision 3 (last working version)
kubectl rollout undo deployment/payment-service --to-revision=3

Revision History Limits

Kubernetes keeps a limited number of revisions (default 10, controlled by revisionHistoryLimit). If you need to roll back further than history allows, you'll need to redeploy the old image tag explicitly. Keep sufficient history for your rollback windows.

Database Rollback Strategies

Database rollback is the most challenging aspect of deployment rollback. Unlike application code—which can be replaced with previous versions—database changes may be irreversible or have complex dependencies.

The three types of database rollback:

Database Rollback Categories

•Schema rollback — Reversing structural changes (columns, tables, indexes). May or may not be possible depending on the change.
•Data rollback — Restoring data to previous state. Typically requires backup restoration or point-in-time recovery.
•Hybrid rollback — Combination of schema and data changes. Most complex scenario.

Schema Change Rollback Feasibility
Change Type	Rollback Possible?	Rollback Method	Data Loss Risk
Add nullable column	✅ Yes	DROP COLUMN	None (column is empty or nullable)
Add column with default	✅ Yes	DROP COLUMN	Loses default values set
Add table	✅ Yes	DROP TABLE	Loses any data written
Add index	✅ Yes	DROP INDEX	None
Drop column	❌ No	Restore from backup	Data already lost
Drop table	❌ No	Restore from backup	Data already lost
Rename column	⚠️ Partial	Rename back	None if no app changes
Change column type	⚠️ Partial	Change back (may fail)	Possible if conversion lossy
Add NOT NULL constraint	✅ Yes	DROP CONSTRAINT	None

reversible-migration.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
-- Example: Reversible migration pattern
-- Each migration has explicit UP and DOWN operations
 
-- ========================================
-- Migration: V20240115_Add_User_Preferences
-- ========================================
 
-- UP (forward migration)
CREATE TABLE user_preferences (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    user_id BIGINT NOT NULL,
    preference_key VARCHAR(100) NOT NULL,
    preference_value TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE,
    UNIQUE KEY uk_user_preference (user_id, preference_key)
);
 
CREATE INDEX idx_preferences_user ON user_preferences(user_id);
 
-- DOWN (rollback migration)
-- Execute this if the deployment fails and rollback is needed
 
DROP INDEX idx_preferences_user ON user_preferences;
DROP TABLE user_preferences;
 
-- ========================================
-- Migration: V20240116_Add_Email_Verified_Column  
-- ========================================
 
-- UP
ALTER TABLE users ADD COLUMN email_verified BOOLEAN DEFAULT FALSE;
 
-- Update existing users based on some criteria
UPDATE users SET email_verified = TRUE WHERE confirmed_at IS NOT NULL;
 
-- DOWN
-- NOTE: This loses the email_verified data
ALTER TABLE users DROP COLUMN email_verified;
 
-- ========================================
-- IRREVERSIBLE Migration Example
-- Migration: V20240117_Remove_Legacy_Column
-- ========================================
 
-- UP
-- First, verify no application code references this column
ALTER TABLE users DROP COLUMN legacy_field;
 
-- DOWN
-- IRREVERSIBLE: Cannot restore data that was in legacy_field
-- Require backup restoration if rollback needed
-- 
-- Procedure if rollback needed:
-- 1. Restore full database from backup
-- 2. OR: Re-add column (but data is lost)
--    ALTER TABLE users ADD COLUMN legacy_field VARCHAR(255);
--
-- IMPORTANT: Before running this migration:
-- - Ensure full recent backup exists
-- - Ensure no rollback will be needed (migration has been in staging)
-- - Consider keeping column as deprecated instead of dropping

Point-in-Time Recovery (PITR)

For severe data issues, databases support PITR—restoring to a specific timestamp. But PITR rolls back ALL changes after that point, not just the problematic ones. Use it only when other options are exhausted, and expect data loss for anything written after the recovery point.

Configuration Rollback

Configuration changes—infrastructure, secrets, environment variables—can cause outages just as easily as code changes. A robust configuration rollback strategy is essential.

Configuration Rollback Approaches

•Git revert + apply — Configuration stored in Git; revert commit and re-apply through normal pipeline.
•Terraform state rollback — terraform state can be manipulated, but usually better to re-apply previous config.
•ConfigMap/Secret versioning — Kubernetes ConfigMaps can be versioned; roll back by referencing previous version.
•Parameter Store versioning — AWS SSM Parameter Store maintains version history; reference previous version.
•Vault secret rollback — HashiCorp Vault maintains secret versions; applications can reference specific versions.

config-rollback.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
#!/bin/bash
# Configuration Rollback Procedures
 
# ========================================
# Kubernetes ConfigMap Rollback
# ========================================
 
# ConfigMaps don't have built-in versioning, but you can use labels
 
# Current approach: version ConfigMaps with timestamps
kubectl create configmap app-config-v20240115 \
    --from-file=config.yaml \
    -o yaml --dry-run=client | kubectl apply -f -
 
# Update deployment to use specific ConfigMap version
kubectl patch deployment payment-service \
    -p '{"spec":{"template":{"spec":{"volumes":[{"name":"config","configMap":{"name":"app-config-v20240115"}}]}}}}'
 
# To rollback: switch to previous ConfigMap version
kubectl patch deployment payment-service \
    -p '{"spec":{"template":{"spec":{"volumes":[{"name":"config","configMap":{"name":"app-config-v20240114"}}]}}}}'
 
# ========================================
# AWS Parameter Store Rollback
# ========================================
 
# Parameter Store maintains version history automatically
 
# View parameter history
aws ssm get-parameter-history \
    --name /production/payment-service/api-key \
    --output table
 
# Get previous version value
PREVIOUS_VALUE=$(aws ssm get-parameter \
    --name /production/payment-service/api-key:2 \
    --with-decryption \
    --query 'Parameter.Value' \
    --output text)
 
# Roll back by setting parameter to previous value
aws ssm put-parameter \
    --name /production/payment-service/api-key \
    --value "$PREVIOUS_VALUE" \
    --type SecureString \
    --overwrite
 
# ========================================
# Terraform Infrastructure Rollback
# ========================================
 
# Option 1: Git revert and re-apply (preferred)
cd terraform/
git log --oneline -5
# abc123 Update API gateway config (BROKEN)
# def456 Add new Lambda function
# ghi789 Previous working state
 
git revert abc123  # Creates revert commit
terraform plan     # Review changes
terraform apply    # Apply rollback
 
# Option 2: Apply previous state file (risky, not recommended)
# This can cause state drift and unexpected behavior
# Only use if Git history is not available
 
# ========================================
# Feature Flag Rollback (Instant)
# ========================================
 
# LaunchDarkly CLI example
ldcli flags update \
    --project production \
    --key new-payment-flow \
    --set-toggle-off
 
# Generic API call
curl -X PATCH https://flags.internal/api/flags/new-payment-flow \
    -H "Authorization: Bearer $FLAG_API_KEY" \
    -d '{"enabled": false}'

Infrastructure as Code Enables Rollback

The primary benefit of Infrastructure as Code isn't automation—it's rollback capability. If all infrastructure changes go through version-controlled code and a deployment pipeline, you can always revert to any previous state.

Automated Rollback Systems

The fastest rollback is one that happens automatically without human intervention. Automated rollback systems monitor key metrics and trigger rollback when thresholds are breached.

Automated Rollback Triggers

•Health check failures — Pods failing readiness probes trigger automatic deployment pause/rollback in Kubernetes.
•Progress deadline exceeded — Deployment taking too long indicates stuck rollout; triggers automatic rollback.
•Canary analysis failure — Metrics-based analysis (Argo Rollouts, Flagger) detects regression and rolls back.
•Error rate spike — Monitoring alert triggers runbook/automation that initiates rollback.
•SLO breach — Error budget burn rate exceeds threshold; automated response includes rollback.

automated-rollback-argo.yaml
Argo Rollouts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# Argo Rollouts with automatic rollback on failure
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
spec:
  replicas: 10
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
    spec:
      containers:
      - name: payment-service
        image: payment-service:v2.0.0
        
  strategy:
    canary:
      steps:
      - setWeight: 10
      - analysis:
          templates:
          - templateName: success-rate-check
          args:
          - name: service
            value: payment-service-canary
      - setWeight: 50
      - analysis:
          templates:
          - templateName: success-rate-check
      - setWeight: 100
      
      # Automatic rollback configuration
      autoPromotionEnabled: true  # Promote automatically if analysis passes
      
      # Rollback on analysis failure
      analysis:
        successfulRunHistoryLimit: 3
        unsuccessfulRunHistoryLimit: 3
        
      # Abort and rollback conditions
      abortScaleDownDelaySeconds: 30
 
---
# Analysis template that triggers rollback
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate-check
spec:
  args:
  - name: service
  metrics:
  - name: success-rate
    interval: 30s
    # If this condition fails 3 times, rollback is triggered
    successCondition: result[0] >= 0.99
    failureCondition: result[0] < 0.95  # Immediate failure threshold
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(http_requests_total{
            service="{{args.service}}",
            status!~"5.."
          }[2m]))
          /
          sum(rate(http_requests_total{
            service="{{args.service}}"
          }[2m]))

Kubernetes native automatic rollback:

Kubernetes Deployments have built-in automatic rollback when deployments fail to make progress.

spec:
  progressDeadlineSeconds: 600  # 10 minutes
  minReadySeconds: 30

If new pods don't become ready within progressDeadlineSeconds, deployment is marked failed
Kubernetes doesn't automatically roll back, but it stops the rollout
Controllers like Argo Rollouts, Flagger add automatic rollback capability

Automated Rollback Limits

Automated rollback is excellent for clear-cut failures (crash loops, error spikes) but dangerous for subtle issues. A slow memory leak or gradual performance degradation may not trigger thresholds. Always have human review for non-obvious issues.

Complex Rollback Scenarios

Some deployments involve coordinated changes across multiple systems—application code, database schema, message formats, external APIs. Rolling back such deployments requires careful coordination.

Complex Rollback Scenarios

•Code + schema change — New code requires new database column. Rolling back code requires column to still exist with compatible data.
•Producer + consumer change — New message format from producer, consumers must understand both formats during rollback window.
•API version change — New API version with breaking changes. Clients may need coordinated rollback.
•Multi-service deployment — Service A depends on Service B's new feature. Rolling back A may break if B is not also rolled back.
•Third-party integration change — New integration with external API. Rollback may not be possible if external state changed.

Multi-service rollback procedure:

When multiple services were deployed together and one fails, you may need coordinated rollback:

Scenario: Service A v2.0 → v2.1, Service B v3.0 → v3.1
Issue discovered in B v3.1 that requires A v2.0

Rollback sequence:
1. Identify dependency graph
   A v2.1 → B v3.1 (new versions depend on each other)
   A v2.0 → B v3.0 (old versions depend on each other)
   A v2.0 ⊗ B v3.1 (incompatible!)
   A v2.1 → B v3.0 (check if compatible)

2. If A v2.1 → B v3.0 is compatible:
   - Roll back B first (B v3.1 → B v3.0)
   - A v2.1 continues working with B v3.0
   - Optional: roll back A if needed

3. If A v2.1 ⊗ B v3.0 (incompatible):
   - Roll back both simultaneously
   - Use feature flag to disable the feature in A
   - Roll back A, then B
   - Coordinate timing carefully

coordinated-rollback.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#!/bin/bash
# Coordinated multi-service rollback procedure
 
set -euo pipefail
 
# Step 1: Disable the feature flag to stop new traffic to broken functionality
echo "Step 1: Disabling feature flag..."
curl -X PATCH https://flags.internal/api/flags/new-payment-flow \
    -H "Authorization: Bearer $FLAG_API_KEY" \
    -d '{"enabled": false}'
 
echo "Waiting for flag propagation..."
sleep 10
 
# Step 2: Roll back the dependent service first (consumer)
echo "Step 2: Rolling back payment-service..."
kubectl -n production rollout undo deployment/payment-service
 
kubectl -n production rollout status deployment/payment-service --timeout=300s
 
# Step 3: Roll back the upstream service (producer)
echo "Step 3: Rolling back order-service..."
kubectl -n production rollout undo deployment/order-service
 
kubectl -n production rollout status deployment/order-service --timeout=300s
 
# Step 4: Verify both services are healthy
echo "Step 4: Verifying service health..."
 
PAYMENT_HEALTH=$(curl -s -o /dev/null -w "%{http_code}" https://payment.internal/health)
ORDER_HEALTH=$(curl -s -o /dev/null -w "%{http_code}" https://order.internal/health)
 
if [[ "$PAYMENT_HEALTH" == "200" && "$ORDER_HEALTH" == "200" ]]; then
    echo "✅ Both services healthy after rollback"
else
    echo "❌ Health check failed: payment=$PAYMENT_HEALTH, order=$ORDER_HEALTH"
    exit 1
fi
 
# Step 5: Verify cross-service communication
echo "Step 5: Running integration test..."
npm run test:integration:payment-order
 
# Step 6: Document the rollback
echo "Step 6: Creating incident record..."
curl -X POST https://incidents.internal/api/incidents \
    -H "Content-Type: application/json" \
    -d '{
        "title": "Coordinated rollback: payment-service, order-service",
        "severity": "P2",
        "services": ["payment-service", "order-service"],
        "rolled_back_by": "'"$USER"'",
        "timestamp": "'"$(date -Iseconds)"'"
    }'
 
echo ""
echo "═══════════════════════════════════════════════════════════"
echo "✅ Coordinated rollback complete"
echo "═══════════════════════════════════════════════════════════"

Avoid Coordinated Deployments

The best way to handle coordinated rollback is to avoid needing it. Design services to be backward and forward compatible. Deploy changes that are safe to run with adjacent services at any version. This makes individual rollbacks safe.

Rollback Limitations and Edge Cases

Not everything can be rolled back. Understanding these limitations is crucial for both deployment planning and incident response.

When Rollback Is Impossible or Insufficient

•Data corruption — If the new version wrote invalid data, rolling back code doesn't fix the data. Manual data repair required.
•Dropped columns/tables — Data deleted by schema migration cannot be recovered without backup restoration.
•External state changes — If you called external APIs, sent emails, or triggered webhooks, those actions cannot be undone.
•Financial transactions — Payments processed, inventory decremented, orders shipped cannot be automatically reversed.
•User-visible changes — If users acted on incorrect information (wrong prices, false notifications), trust damage is done.
•Cache poisoning — Corrupt data cached in CDN, browser, or application caches may persist after rollback.

Rollback doesn't restore lost time:

Scenario	Rollback Fixes	Rollback Doesn't Fix
Error spike	Error rate returns to normal	Users who got errors already impacted
Wrong prices	Prices corrected	Orders placed at wrong prices exist
Email sent	Stops sending more	Emails already sent are out there
Data corruption	New code won't corrupt more	Existing corrupt data still exists
Performance regression	Performance restored	Users who experienced slowness gone

Plan for Unrecoverable Scenarios

For each deployment, ask: 'What's the worst that can happen, and can we roll back from it?' If the answer is 'no,' you need additional safeguards: feature flags, canary with very slow progression, explicit backup verification, or postponing the change.

Partial rollback strategies for edge cases:

Scenario: Data corruption discovered 2 hours after deployment

Options:
1. Full database restore (loses 2 hours of all data)
2. Selective data repair (script to fix affected records)
3. Compensating transactions (create correction records)
4. Hybrid: restore to backup, replay valid transactions

Decision factors:
- How many records affected?
- Is affected data identifiable?
- Can valid transactions be replayed?
- What's the cost of data loss vs repair effort?

Designing Rollback-Friendly Systems

The best rollback strategy is a system designed to make rollback safe and simple from the start. This requires upfront investment but pays dividends in operational safety.

Rollback-Friendly Design Principles

•Backward-compatible changes only — New code must work with old data. Old code must work with new data (within rollback window). Never deploy breaking changes atomically.
•Feature flags for new functionality — Deploy code disabled, enable via flag. Rollback is flag toggle, not redeployment.
•Expand-contract migrations — Schema changes in phases: add (expand), migrate, remove (contract). Each phase is independently rollback-safe.
•Idempotent operations — Operations that can be safely retried. If rollback happens mid-operation, re-running the old version doesn't cause duplicates.
•Version-aware data formats — Include version in message/document schemas. Consumers handle multiple versions gracefully.
•Async integration over sync — Queue-based integrations handle mixed versions better than synchronous API calls.
•Graceful degradation — Design features to degrade gracefully if dependencies fail. Rollback doesn't cascade across services.

rollback-friendly-patterns.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
// Rollback-Friendly Design Patterns
 
// ========================================
// Pattern 1: Version-Aware Message Handling
// ========================================
 
interface OrderMessage {
  version: 1 | 2 | 3;
  orderId: string;
  // Version-specific fields
  [key: string]: unknown;
}
 
class OrderProcessor {
  async processOrder(message: OrderMessage): Promise<void> {
    // Handle all known versions
    switch (message.version) {
      case 1:
        return this.processV1(message as OrderV1);
      case 2:
        return this.processV2(message as OrderV2);
      case 3:
        return this.processV3(message as OrderV3);
      default:
        // Forward compatibility: log and skip unknown versions
        console.warn(`Unknown message version: ${message.version}`);
        return;
    }
  }
  
  private async processV1(order: OrderV1): Promise<void> {
    // Legacy processing
  }
  
  private async processV2(order: OrderV2): Promise<void> {
    // V2 added shipping_address as separate field
    const shippingAddress = order.shipping_address ?? order.address;
    // Process with shipping address
  }
  
  private async processV3(order: OrderV3): Promise<void> {
    // V3 added gift options
    const giftOptions = order.gift_options ?? { isGift: false };
    // Process with gift options
  }
}
 
// ========================================
// Pattern 2: Idempotent Operations
// ========================================
 
class PaymentService {
  async processPayment(request: PaymentRequest): Promise<PaymentResult> {
    // Check if this exact payment was already processed
    const existingPayment = await this.db.payments.findUnique({
      where: { idempotencyKey: request.idempotencyKey }
    });
    
    if (existingPayment) {
      // Already processed - return the previous result
      // Safe even if this is a retry after rollback
      return {
        success: true,
        paymentId: existingPayment.id,
        cached: true
      };
    }
    
    // Process new payment
    const payment = await this.chargeCard(request);
    
    // Store with idempotency key
    await this.db.payments.create({
      data: {
        ...payment,
        idempotencyKey: request.idempotencyKey
      }
    });
    
    return { success: true, paymentId: payment.id, cached: false };
  }
}
 
// ========================================
// Pattern 3: Graceful Degradation
// ========================================
 
class RecommendationService {
  async getRecommendations(userId: string): Promise<Product[]> {
    try {
      // Try new ML-based recommendations (v2 feature)
      if (await this.flags.isEnabled('ml-recommendations', userId)) {
        return await this.mlRecommendations.getForUser(userId);
      }
    } catch (error) {
      // ML service down - log and fall through to fallback
      console.error('ML recommendations failed:', error);
    }
    
    try {
      // Try collaborative filtering (v1 feature)
      return await this.collaborativeFiltering.getForUser(userId);
    } catch (error) {
      console.error('Collaborative filtering failed:', error);
    }
    
    // Ultimate fallback: popular products (always works)
    return await this.getPopularProducts();
  }
}

Review for Rollback Safety

Add 'rollback safety' as an explicit review criterion in code reviews. Ask: 'If we need to roll this back, what breaks?' Any change that isn't rollback-safe needs mitigation strategy documented.

Summary: Rollback Strategies

Rollback capability is the safety net that makes aggressive deployment possible. Without reliable rollback, every deployment is a high-stakes gamble. Let's consolidate the key concepts:

Key Takeaways

•Rollback is multi-layered — Application, database, configuration, and data each have different rollback mechanisms and complexity levels.
•Application rollback is straightforward — Container orchestrators provide built-in rollback via revision history. Blue-green provides instant switching.
•Database rollback is hard — Some schema changes are irreversible. Design migrations to be rollback-safe with expand-contract patterns.
•Automated rollback catches clear failures — Health check failures, error spikes, and canary analysis can trigger automatic rollback.
•Complex scenarios require coordination — Multi-service deployments may need coordinated rollback. Avoid by designing for independent rollback.
•Not everything can be rolled back — Data corruption, external state changes, and user-visible actions may not be reversible. Plan accordingly.
•Design for rollback from the start — Backward compatibility, feature flags, idempotent operations, and version-aware formats make rollback safe.

Module complete:

You've now mastered the five core deployment strategies:

Rolling deployments — The workhorse of zero-downtime releases
Blue-green deployments — Dual environments with instant switching
Canary deployments — Progressive rollout with automated analysis
Feature flags — Runtime control decoupled from deployment
Rollback strategies — The safety net that makes it all possible

Together, these strategies form a complete toolkit for releasing software reliably at any scale.

Module Complete

You now have comprehensive knowledge of production deployment strategies. From rolling updates to canary analysis to instant rollback, you're equipped to release software safely and confidently in any environment. Apply these strategies based on your system's risk profile, traffic patterns, and operational maturity.

5 / 5

Loading learning content...

System Design (HLD)Deployment Strategies

Deployment Strategies

LevelAdvanced

Duration90 mins

TopicDeployment Strategies

5 / 5

Rollback Strategies

The Safety Net of Modern Deployments

What You Will Learn

Rollback Fundamentals

What can be rolled back:

Rollback Layers and Complexity
Layer	Rollback Method	Typical Time	Complexity	Risk
Application code	Redeploy previous version	Minutes	Low	Low
Feature flags	Toggle flag to previous state	Seconds	Very Low	Very Low
Infrastructure config	Apply previous IaC state	Minutes to hours	Medium	Medium
Database schema	Reverse migration (if possible)	Minutes to hours	High	High
Data content	Restore from backup	Hours	Very High	Very High
External integrations	Revert API contracts	Variable	Very High	Very High

The rollback decision framework:

Not every issue warrants rollback. The decision depends on:

Factor	Roll Forward	Roll Back
Severity	Minor, affects few users	Major, affects many users
Root cause	Known, fix is simple	Unknown or complex to fix
Fix time	Minutes	Hours or longer
Blast radius	Limited	Widespread
Data integrity	No risk	Potential data corruption
Customer impact	Low annoyance	Revenue/trust impact

Roll Back First, Investigate Later

Application Code Rollback

Application rollback is the most common and typically safest rollback operation. It involves replacing running application instances with the previous version.

Application Rollback Methods

•Kubernetes rollback — kubectl rollout undo reverts to previous ReplicaSet. Previous version containers are already cached on nodes. Fastest method.
•Blue-green switch — Redirect traffic to previous environment. Near-instant if previous environment is still running.
•Canary abort — Stop canary progression and route all traffic to stable. Minimal blast radius if caught early.
•CI/CD redeploy — Trigger deployment pipeline with previous version tag. Uses normal deployment process but with old artifact.
•Container image revert — Update deployment to reference previous image tag and let orchestrator handle the rollout.

kubernetes-rollback.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
#!/bin/bash
# Kubernetes Application Rollback Procedures
 
set -euo pipefail
 
DEPLOYMENT="payment-service"
NAMESPACE="production"
 
echo "═══════════════════════════════════════════════════════════"
echo "🔄 Initiating rollback for $DEPLOYMENT"
echo "═══════════════════════════════════════════════════════════"
 
# Step 1: Check current status
echo "Current deployment status:"
kubectl -n "$NAMESPACE" get deployment "$DEPLOYMENT" -o wide
 
# Step 2: View rollout history
echo ""
echo "Rollout history:"
kubectl -n "$NAMESPACE" rollout history deployment/"$DEPLOYMENT"
 
# Step 3: Check what will be rolled back to
CURRENT_REVISION=$(kubectl -n "$NAMESPACE" rollout history deployment/"$DEPLOYMENT" --output jsonpath='{.metadata.generation}')
PREVIOUS_REVISION=$((CURRENT_REVISION - 1))
echo ""
echo "Will roll back from revision $CURRENT_REVISION to revision $PREVIOUS_REVISION"
 
# Step 4: Get details of previous revision
echo ""
echo "Previous revision details:"
kubectl -n "$NAMESPACE" rollout history deployment/"$DEPLOYMENT" --revision="$PREVIOUS_REVISION"
 
# Step 5: Perform the rollback
echo ""
echo "Executing rollback..."
kubectl -n "$NAMESPACE" rollout undo deployment/"$DEPLOYMENT"
 
# Step 6: Monitor rollback progress
echo ""
echo "Monitoring rollback progress..."
kubectl -n "$NAMESPACE" rollout status deployment/"$DEPLOYMENT" --timeout=300s
 
# Step 7: Verify rollback
echo ""
echo "Verifying rollback:"
kubectl -n "$NAMESPACE" get deployment "$DEPLOYMENT" -o wide
 
# Step 8: Check pod status
echo ""
echo "Pod status after rollback:"
kubectl -n "$NAMESPACE" get pods -l app="$DEPLOYMENT" -o wide
 
# Step 9: Notify team
echo ""
echo "═══════════════════════════════════════════════════════════"
echo "✅ Rollback complete!"
echo "═══════════════════════════════════════════════════════════"
echo ""
echo "Post-rollback actions:"
echo "1. Verify application health: kubectl -n $NAMESPACE logs -l app=$DEPLOYMENT --tail=50"
echo "2. Check error rates in monitoring dashboard"
echo "3. Notify incident channel"
echo "4. Create post-mortem ticket"

Rollback to specific revision:

Sometimes you need to roll back more than one version—especially if multiple deployments occurred before an issue was detected.

# List all revisions with details
kubectl rollout history deployment/payment-service

# REVISION  CHANGE-CAUSE
# 1         Initial deployment
# 2         Feature: Add retry logic
# 3         Bugfix: Fix timeout handling  
# 4         Feature: New payment provider (BROKEN)
# 5         Hotfix attempt (STILL BROKEN)

# Roll back to revision 3 (last working version)
kubectl rollout undo deployment/payment-service --to-revision=3

Revision History Limits

Database Rollback Strategies

The three types of database rollback:

Database Rollback Categories

•Schema rollback — Reversing structural changes (columns, tables, indexes). May or may not be possible depending on the change.
•Data rollback — Restoring data to previous state. Typically requires backup restoration or point-in-time recovery.
•Hybrid rollback — Combination of schema and data changes. Most complex scenario.

Schema Change Rollback Feasibility
Change Type	Rollback Possible?	Rollback Method	Data Loss Risk
Add nullable column	✅ Yes	DROP COLUMN	None (column is empty or nullable)
Add column with default	✅ Yes	DROP COLUMN	Loses default values set
Add table	✅ Yes	DROP TABLE	Loses any data written
Add index	✅ Yes	DROP INDEX	None
Drop column	❌ No	Restore from backup	Data already lost
Drop table	❌ No	Restore from backup	Data already lost
Rename column	⚠️ Partial	Rename back	None if no app changes
Change column type	⚠️ Partial	Change back (may fail)	Possible if conversion lossy
Add NOT NULL constraint	✅ Yes	DROP CONSTRAINT	None

reversible-migration.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
-- Example: Reversible migration pattern
-- Each migration has explicit UP and DOWN operations
 
-- ========================================
-- Migration: V20240115_Add_User_Preferences
-- ========================================
 
-- UP (forward migration)
CREATE TABLE user_preferences (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    user_id BIGINT NOT NULL,
    preference_key VARCHAR(100) NOT NULL,
    preference_value TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE,
    UNIQUE KEY uk_user_preference (user_id, preference_key)
);
 
CREATE INDEX idx_preferences_user ON user_preferences(user_id);
 
-- DOWN (rollback migration)
-- Execute this if the deployment fails and rollback is needed
 
DROP INDEX idx_preferences_user ON user_preferences;
DROP TABLE user_preferences;
 
-- ========================================
-- Migration: V20240116_Add_Email_Verified_Column  
-- ========================================
 
-- UP
ALTER TABLE users ADD COLUMN email_verified BOOLEAN DEFAULT FALSE;
 
-- Update existing users based on some criteria
UPDATE users SET email_verified = TRUE WHERE confirmed_at IS NOT NULL;
 
-- DOWN
-- NOTE: This loses the email_verified data
ALTER TABLE users DROP COLUMN email_verified;
 
-- ========================================
-- IRREVERSIBLE Migration Example
-- Migration: V20240117_Remove_Legacy_Column
-- ========================================
 
-- UP
-- First, verify no application code references this column
ALTER TABLE users DROP COLUMN legacy_field;
 
-- DOWN
-- IRREVERSIBLE: Cannot restore data that was in legacy_field
-- Require backup restoration if rollback needed
-- 
-- Procedure if rollback needed:
-- 1. Restore full database from backup
-- 2. OR: Re-add column (but data is lost)
--    ALTER TABLE users ADD COLUMN legacy_field VARCHAR(255);
--
-- IMPORTANT: Before running this migration:
-- - Ensure full recent backup exists
-- - Ensure no rollback will be needed (migration has been in staging)
-- - Consider keeping column as deprecated instead of dropping

Point-in-Time Recovery (PITR)

Configuration Rollback

Configuration changes—infrastructure, secrets, environment variables—can cause outages just as easily as code changes. A robust configuration rollback strategy is essential.

Configuration Rollback Approaches

•Git revert + apply — Configuration stored in Git; revert commit and re-apply through normal pipeline.
•Terraform state rollback — terraform state can be manipulated, but usually better to re-apply previous config.
•ConfigMap/Secret versioning — Kubernetes ConfigMaps can be versioned; roll back by referencing previous version.
•Parameter Store versioning — AWS SSM Parameter Store maintains version history; reference previous version.
•Vault secret rollback — HashiCorp Vault maintains secret versions; applications can reference specific versions.

config-rollback.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
#!/bin/bash
# Configuration Rollback Procedures
 
# ========================================
# Kubernetes ConfigMap Rollback
# ========================================
 
# ConfigMaps don't have built-in versioning, but you can use labels
 
# Current approach: version ConfigMaps with timestamps
kubectl create configmap app-config-v20240115 \
    --from-file=config.yaml \
    -o yaml --dry-run=client | kubectl apply -f -
 
# Update deployment to use specific ConfigMap version
kubectl patch deployment payment-service \
    -p '{"spec":{"template":{"spec":{"volumes":[{"name":"config","configMap":{"name":"app-config-v20240115"}}]}}}}'
 
# To rollback: switch to previous ConfigMap version
kubectl patch deployment payment-service \
    -p '{"spec":{"template":{"spec":{"volumes":[{"name":"config","configMap":{"name":"app-config-v20240114"}}]}}}}'
 
# ========================================
# AWS Parameter Store Rollback
# ========================================
 
# Parameter Store maintains version history automatically
 
# View parameter history
aws ssm get-parameter-history \
    --name /production/payment-service/api-key \
    --output table
 
# Get previous version value
PREVIOUS_VALUE=$(aws ssm get-parameter \
    --name /production/payment-service/api-key:2 \
    --with-decryption \
    --query 'Parameter.Value' \
    --output text)
 
# Roll back by setting parameter to previous value
aws ssm put-parameter \
    --name /production/payment-service/api-key \
    --value "$PREVIOUS_VALUE" \
    --type SecureString \
    --overwrite
 
# ========================================
# Terraform Infrastructure Rollback
# ========================================
 
# Option 1: Git revert and re-apply (preferred)
cd terraform/
git log --oneline -5
# abc123 Update API gateway config (BROKEN)
# def456 Add new Lambda function
# ghi789 Previous working state
 
git revert abc123  # Creates revert commit
terraform plan     # Review changes
terraform apply    # Apply rollback
 
# Option 2: Apply previous state file (risky, not recommended)
# This can cause state drift and unexpected behavior
# Only use if Git history is not available
 
# ========================================
# Feature Flag Rollback (Instant)
# ========================================
 
# LaunchDarkly CLI example
ldcli flags update \
    --project production \
    --key new-payment-flow \
    --set-toggle-off
 
# Generic API call
curl -X PATCH https://flags.internal/api/flags/new-payment-flow \
    -H "Authorization: Bearer $FLAG_API_KEY" \
    -d '{"enabled": false}'

Infrastructure as Code Enables Rollback

Automated Rollback Systems

The fastest rollback is one that happens automatically without human intervention. Automated rollback systems monitor key metrics and trigger rollback when thresholds are breached.

Automated Rollback Triggers

•Health check failures — Pods failing readiness probes trigger automatic deployment pause/rollback in Kubernetes.
•Progress deadline exceeded — Deployment taking too long indicates stuck rollout; triggers automatic rollback.
•Canary analysis failure — Metrics-based analysis (Argo Rollouts, Flagger) detects regression and rolls back.
•Error rate spike — Monitoring alert triggers runbook/automation that initiates rollback.
•SLO breach — Error budget burn rate exceeds threshold; automated response includes rollback.

automated-rollback-argo.yaml
Argo Rollouts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# Argo Rollouts with automatic rollback on failure
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
spec:
  replicas: 10
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
    spec:
      containers:
      - name: payment-service
        image: payment-service:v2.0.0
        
  strategy:
    canary:
      steps:
      - setWeight: 10
      - analysis:
          templates:
          - templateName: success-rate-check
          args:
          - name: service
            value: payment-service-canary
      - setWeight: 50
      - analysis:
          templates:
          - templateName: success-rate-check
      - setWeight: 100
      
      # Automatic rollback configuration
      autoPromotionEnabled: true  # Promote automatically if analysis passes
      
      # Rollback on analysis failure
      analysis:
        successfulRunHistoryLimit: 3
        unsuccessfulRunHistoryLimit: 3
        
      # Abort and rollback conditions
      abortScaleDownDelaySeconds: 30
 
---
# Analysis template that triggers rollback
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate-check
spec:
  args:
  - name: service
  metrics:
  - name: success-rate
    interval: 30s
    # If this condition fails 3 times, rollback is triggered
    successCondition: result[0] >= 0.99
    failureCondition: result[0] < 0.95  # Immediate failure threshold
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(http_requests_total{
            service="{{args.service}}",
            status!~"5.."
          }[2m]))
          /
          sum(rate(http_requests_total{
            service="{{args.service}}"
          }[2m]))

Kubernetes native automatic rollback:

Kubernetes Deployments have built-in automatic rollback when deployments fail to make progress.

spec:
  progressDeadlineSeconds: 600  # 10 minutes
  minReadySeconds: 30

If new pods don't become ready within progressDeadlineSeconds, deployment is marked failed
Kubernetes doesn't automatically roll back, but it stops the rollout
Controllers like Argo Rollouts, Flagger add automatic rollback capability

Automated Rollback Limits

Complex Rollback Scenarios

Some deployments involve coordinated changes across multiple systems—application code, database schema, message formats, external APIs. Rolling back such deployments requires careful coordination.

Complex Rollback Scenarios

•Code + schema change — New code requires new database column. Rolling back code requires column to still exist with compatible data.
•Producer + consumer change — New message format from producer, consumers must understand both formats during rollback window.
•API version change — New API version with breaking changes. Clients may need coordinated rollback.
•Multi-service deployment — Service A depends on Service B's new feature. Rolling back A may break if B is not also rolled back.
•Third-party integration change — New integration with external API. Rollback may not be possible if external state changed.

Multi-service rollback procedure:

When multiple services were deployed together and one fails, you may need coordinated rollback:

Scenario: Service A v2.0 → v2.1, Service B v3.0 → v3.1
Issue discovered in B v3.1 that requires A v2.0

Rollback sequence:
1. Identify dependency graph
   A v2.1 → B v3.1 (new versions depend on each other)
   A v2.0 → B v3.0 (old versions depend on each other)
   A v2.0 ⊗ B v3.1 (incompatible!)
   A v2.1 → B v3.0 (check if compatible)

2. If A v2.1 → B v3.0 is compatible:
   - Roll back B first (B v3.1 → B v3.0)
   - A v2.1 continues working with B v3.0
   - Optional: roll back A if needed

3. If A v2.1 ⊗ B v3.0 (incompatible):
   - Roll back both simultaneously
   - Use feature flag to disable the feature in A
   - Roll back A, then B
   - Coordinate timing carefully

coordinated-rollback.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#!/bin/bash
# Coordinated multi-service rollback procedure
 
set -euo pipefail
 
# Step 1: Disable the feature flag to stop new traffic to broken functionality
echo "Step 1: Disabling feature flag..."
curl -X PATCH https://flags.internal/api/flags/new-payment-flow \
    -H "Authorization: Bearer $FLAG_API_KEY" \
    -d '{"enabled": false}'
 
echo "Waiting for flag propagation..."
sleep 10
 
# Step 2: Roll back the dependent service first (consumer)
echo "Step 2: Rolling back payment-service..."
kubectl -n production rollout undo deployment/payment-service
 
kubectl -n production rollout status deployment/payment-service --timeout=300s
 
# Step 3: Roll back the upstream service (producer)
echo "Step 3: Rolling back order-service..."
kubectl -n production rollout undo deployment/order-service
 
kubectl -n production rollout status deployment/order-service --timeout=300s
 
# Step 4: Verify both services are healthy
echo "Step 4: Verifying service health..."
 
PAYMENT_HEALTH=$(curl -s -o /dev/null -w "%{http_code}" https://payment.internal/health)
ORDER_HEALTH=$(curl -s -o /dev/null -w "%{http_code}" https://order.internal/health)
 
if [[ "$PAYMENT_HEALTH" == "200" && "$ORDER_HEALTH" == "200" ]]; then
    echo "✅ Both services healthy after rollback"
else
    echo "❌ Health check failed: payment=$PAYMENT_HEALTH, order=$ORDER_HEALTH"
    exit 1
fi
 
# Step 5: Verify cross-service communication
echo "Step 5: Running integration test..."
npm run test:integration:payment-order
 
# Step 6: Document the rollback
echo "Step 6: Creating incident record..."
curl -X POST https://incidents.internal/api/incidents \
    -H "Content-Type: application/json" \
    -d '{
        "title": "Coordinated rollback: payment-service, order-service",
        "severity": "P2",
        "services": ["payment-service", "order-service"],
        "rolled_back_by": "'"$USER"'",
        "timestamp": "'"$(date -Iseconds)"'"
    }'
 
echo ""
echo "═══════════════════════════════════════════════════════════"
echo "✅ Coordinated rollback complete"
echo "═══════════════════════════════════════════════════════════"

Avoid Coordinated Deployments

Rollback Limitations and Edge Cases

Not everything can be rolled back. Understanding these limitations is crucial for both deployment planning and incident response.

When Rollback Is Impossible or Insufficient

•Data corruption — If the new version wrote invalid data, rolling back code doesn't fix the data. Manual data repair required.
•Dropped columns/tables — Data deleted by schema migration cannot be recovered without backup restoration.
•External state changes — If you called external APIs, sent emails, or triggered webhooks, those actions cannot be undone.
•Financial transactions — Payments processed, inventory decremented, orders shipped cannot be automatically reversed.
•User-visible changes — If users acted on incorrect information (wrong prices, false notifications), trust damage is done.
•Cache poisoning — Corrupt data cached in CDN, browser, or application caches may persist after rollback.

Rollback doesn't restore lost time:

Scenario	Rollback Fixes	Rollback Doesn't Fix
Error spike	Error rate returns to normal	Users who got errors already impacted
Wrong prices	Prices corrected	Orders placed at wrong prices exist
Email sent	Stops sending more	Emails already sent are out there
Data corruption	New code won't corrupt more	Existing corrupt data still exists
Performance regression	Performance restored	Users who experienced slowness gone

Plan for Unrecoverable Scenarios

Partial rollback strategies for edge cases:

Scenario: Data corruption discovered 2 hours after deployment

Options:
1. Full database restore (loses 2 hours of all data)
2. Selective data repair (script to fix affected records)
3. Compensating transactions (create correction records)
4. Hybrid: restore to backup, replay valid transactions

Decision factors:
- How many records affected?
- Is affected data identifiable?
- Can valid transactions be replayed?
- What's the cost of data loss vs repair effort?

Designing Rollback-Friendly Systems

The best rollback strategy is a system designed to make rollback safe and simple from the start. This requires upfront investment but pays dividends in operational safety.

Rollback-Friendly Design Principles

•Backward-compatible changes only — New code must work with old data. Old code must work with new data (within rollback window). Never deploy breaking changes atomically.
•Feature flags for new functionality — Deploy code disabled, enable via flag. Rollback is flag toggle, not redeployment.
•Expand-contract migrations — Schema changes in phases: add (expand), migrate, remove (contract). Each phase is independently rollback-safe.
•Idempotent operations — Operations that can be safely retried. If rollback happens mid-operation, re-running the old version doesn't cause duplicates.
•Version-aware data formats — Include version in message/document schemas. Consumers handle multiple versions gracefully.
•Async integration over sync — Queue-based integrations handle mixed versions better than synchronous API calls.
•Graceful degradation — Design features to degrade gracefully if dependencies fail. Rollback doesn't cascade across services.

rollback-friendly-patterns.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
// Rollback-Friendly Design Patterns
 
// ========================================
// Pattern 1: Version-Aware Message Handling
// ========================================
 
interface OrderMessage {
  version: 1 | 2 | 3;
  orderId: string;
  // Version-specific fields
  [key: string]: unknown;
}
 
class OrderProcessor {
  async processOrder(message: OrderMessage): Promise<void> {
    // Handle all known versions
    switch (message.version) {
      case 1:
        return this.processV1(message as OrderV1);
      case 2:
        return this.processV2(message as OrderV2);
      case 3:
        return this.processV3(message as OrderV3);
      default:
        // Forward compatibility: log and skip unknown versions
        console.warn(`Unknown message version: ${message.version}`);
        return;
    }
  }
  
  private async processV1(order: OrderV1): Promise<void> {
    // Legacy processing
  }
  
  private async processV2(order: OrderV2): Promise<void> {
    // V2 added shipping_address as separate field
    const shippingAddress = order.shipping_address ?? order.address;
    // Process with shipping address
  }
  
  private async processV3(order: OrderV3): Promise<void> {
    // V3 added gift options
    const giftOptions = order.gift_options ?? { isGift: false };
    // Process with gift options
  }
}
 
// ========================================
// Pattern 2: Idempotent Operations
// ========================================
 
class PaymentService {
  async processPayment(request: PaymentRequest): Promise<PaymentResult> {
    // Check if this exact payment was already processed
    const existingPayment = await this.db.payments.findUnique({
      where: { idempotencyKey: request.idempotencyKey }
    });
    
    if (existingPayment) {
      // Already processed - return the previous result
      // Safe even if this is a retry after rollback
      return {
        success: true,
        paymentId: existingPayment.id,
        cached: true
      };
    }
    
    // Process new payment
    const payment = await this.chargeCard(request);
    
    // Store with idempotency key
    await this.db.payments.create({
      data: {
        ...payment,
        idempotencyKey: request.idempotencyKey
      }
    });
    
    return { success: true, paymentId: payment.id, cached: false };
  }
}
 
// ========================================
// Pattern 3: Graceful Degradation
// ========================================
 
class RecommendationService {
  async getRecommendations(userId: string): Promise<Product[]> {
    try {
      // Try new ML-based recommendations (v2 feature)
      if (await this.flags.isEnabled('ml-recommendations', userId)) {
        return await this.mlRecommendations.getForUser(userId);
      }
    } catch (error) {
      // ML service down - log and fall through to fallback
      console.error('ML recommendations failed:', error);
    }
    
    try {
      // Try collaborative filtering (v1 feature)
      return await this.collaborativeFiltering.getForUser(userId);
    } catch (error) {
      console.error('Collaborative filtering failed:', error);
    }
    
    // Ultimate fallback: popular products (always works)
    return await this.getPopularProducts();
  }
}

Review for Rollback Safety

Add 'rollback safety' as an explicit review criterion in code reviews. Ask: 'If we need to roll this back, what breaks?' Any change that isn't rollback-safe needs mitigation strategy documented.

Summary: Rollback Strategies

Rollback capability is the safety net that makes aggressive deployment possible. Without reliable rollback, every deployment is a high-stakes gamble. Let's consolidate the key concepts:

Key Takeaways

•Rollback is multi-layered — Application, database, configuration, and data each have different rollback mechanisms and complexity levels.
•Application rollback is straightforward — Container orchestrators provide built-in rollback via revision history. Blue-green provides instant switching.
•Database rollback is hard — Some schema changes are irreversible. Design migrations to be rollback-safe with expand-contract patterns.
•Automated rollback catches clear failures — Health check failures, error spikes, and canary analysis can trigger automatic rollback.
•Complex scenarios require coordination — Multi-service deployments may need coordinated rollback. Avoid by designing for independent rollback.
•Not everything can be rolled back — Data corruption, external state changes, and user-visible actions may not be reversible. Plan accordingly.
•Design for rollback from the start — Backward compatibility, feature flags, idempotent operations, and version-aware formats make rollback safe.

Module complete:

You've now mastered the five core deployment strategies:

Rolling deployments — The workhorse of zero-downtime releases
Blue-green deployments — Dual environments with instant switching
Canary deployments — Progressive rollout with automated analysis
Feature flags — Runtime control decoupled from deployment
Rollback strategies — The safety net that makes it all possible

Together, these strategies form a complete toolkit for releasing software reliably at any scale.

Module Complete

5 / 5