System Design (HLD)CI/CD for Infrastructure

CI/CD for Infrastructure

LevelAdvanced

Duration90 mins

TopicCI/CD for Infrastructure

5 / 5

Deployment Strategies

Deploying Infrastructure Without Fear

The moment of truth in any infrastructure pipeline is deployment—when configuration changes become reality. Unlike application deployments where failures might affect a single service, infrastructure failures can cascade across entire systems. A misconfigured VPC can isolate all services. A broken IAM role can halt all deployments. A deleted database can lose customer data.

Deployment strategies for infrastructure minimize these risks through progressive rollouts, automated verification, and robust rollback capabilities. The goal is to deploy with confidence: knowing that issues will be caught early, blast radius will be contained, and recovery will be swift.

What You Will Learn

By the end of this page, you will understand progressive deployment strategies for infrastructure, how to structure environment promotion, techniques for verifying deployments, rollback patterns and their limitations, and strategies for handling stateful resources. You will be equipped to design deployment workflows that balance velocity with safety.

The Infrastructure Deployment Challenge

Infrastructure deployment differs fundamentally from application deployment. Understanding these differences is essential for designing appropriate deployment strategies.

Key Differences from Application Deployments:

Infrastructure vs Application Deployment
Aspect	Application Deployment	Infrastructure Deployment
Unit of Change	Versioned artifact (container, binary)	Configuration diff against current state
Rollback Speed	Seconds (switch to previous version)	Minutes to hours (resources must be recreated)
State Complexity	Often stateless (state in databases)	Inherently stateful (resources exist in cloud)
Blast Radius	Limited to that service's traffic	Can affect all dependent services
Canary Capability	Route percentage of traffic to new version	Limited—infrastructure changes are binary
Reversibility	Usually fully reversible	Some changes irreversible (data deletion)

The Core Infrastructure Deployment Problem:

Infrastructure changes often have all-or-nothing semantics. When you modify a security group rule, it applies to all instances immediately. When you change a VPC route, all traffic is affected at once. There's no straightforward way to 'send 5% of traffic' through a new infrastructure configuration.

This constraint shapes infrastructure deployment strategies:

Rely more on pre-deployment validation — Since we can't easily limit blast radius, we must be confident before deploying
Use environment promotion — Validate in lower environments before production
Design for fast rollback — When issues occur, minimize recovery time
Verify aggressively post-deployment — Catch issues immediately after apply

The Terraform Apply Reality

When terraform apply runs, it executes a series of API calls that modify cloud resources sequentially. If the apply fails partway through, you're left in a partially modified state. The plan was validated before apply, but the actual state may now differ from both the original and intended states. Design your strategies to handle partial failures.

Environment Promotion Strategy

The most fundamental infrastructure deployment strategy is environment promotion: deploying changes through progressively more critical environments, validating at each stage before proceeding.

The Standard Promotion Path:

Converting Mermaid diagram...

Environment Gates and Checks

•Dev → Staging Gate: All tests pass, no security violations, clean plan generated
•Staging Validation: Integration tests run, no performance regression, services healthy for soak period
•Staging → Production Gate: Manual approval from authorized personnel, change window compliance, incident-free period
•Production Deployment: Progressive rollout (if applicable), enhanced monitoring, rollback plan ready

deployment-pipeline.yaml
GitHub Actions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
name: Infrastructure Deployment Pipeline
 
on:
  push:
    branches: [main]
    paths:
      - 'infrastructure/**'
 
jobs:
  # Stage 1: Deploy to Development
  deploy-dev:
    name: Deploy to Development
    runs-on: ubuntu-latest
    environment: development
    outputs:
      applied: ${{ steps.apply.outputs.applied }}
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      
      - name: Deploy to Dev
        id: apply
        run: |
          cd infrastructure/environments/dev
          terraform init
          terraform apply -auto-approve
          echo "applied=true" >> $GITHUB_OUTPUT
          
      - name: Verify Dev Deployment
        run: |
          ./scripts/verify-environment.sh dev
          
  # Stage 2: Wait and Validate in Dev
  validate-dev:
    name: Validate Development
    needs: deploy-dev
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run Integration Tests
        run: |
          cd test/integration
          go test -v -tags=dev ./...
          
      - name: Check Metrics (5 minute soak)
        run: |
          # Wait 5 minutes, checking error rates
          for i in {1..10}; do
            ERROR_RATE=$(./scripts/check-error-rate.sh dev)
            if [ "$ERROR_RATE" -gt 1 ]; then
              echo "Error rate too high: $ERROR_RATE%"
              exit 1
            fi
            sleep 30
          done
          
  # Stage 3: Deploy to Staging
  deploy-staging:
    name: Deploy to Staging
    needs: validate-dev
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      
      - name: Deploy to Staging
        run: |
          cd infrastructure/environments/staging
          terraform init
          terraform apply -auto-approve
          
      - name: Verify Staging Deployment
        run: |
          ./scripts/verify-environment.sh staging
          
  # Stage 4: Extended Validation in Staging
  validate-staging:
    name: Validate Staging
    needs: deploy-staging
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run Full Test Suite
        run: |
          cd test/integration
          go test -v -tags=staging -timeout 30m ./...
          
      - name: Extended Soak (30 minutes)
        run: |
          # Monitor for 30 minutes
          START_TIME=$(date +%s)
          while [ $(($(date +%s) - START_TIME)) -lt 1800 ]; do
            ./scripts/health-check.sh staging || exit 1
            sleep 60
          done
          
  # Stage 5: Production Deployment (Manual Gate)
  deploy-production:
    name: Deploy to Production
    needs: validate-staging
    runs-on: ubuntu-latest
    environment: 
      name: production
      url: https://app.company.com
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      
      - name: Notify Deployment Starting
        run: |
          ./scripts/notify-slack.sh "🚀 Production deployment starting"
          
      - name: Deploy to Production
        run: |
          cd infrastructure/environments/production
          terraform init
          terraform apply -auto-approve
          
      - name: Comprehensive Verification
        run: |
          ./scripts/verify-production.sh
          
      - name: Notify Success
        if: success()
        run: |
          ./scripts/notify-slack.sh "✅ Production deployment complete"
          
      - name: Notify Failure
        if: failure()
        run: |
          ./scripts/notify-slack.sh "❌ Production deployment FAILED - investigate immediately"

Soak Time Matters

The 'soak period' after staging deployment catches issues that only manifest under sustained load or over time. A change might work initially but cause memory leaks, connection exhaustion, or gradual degradation. Budget appropriate soak time based on your system's characteristics.

Progressive Infrastructure Rollouts

While infrastructure changes are often binary, some scenarios allow for progressive rollout—deploying changes incrementally to limit blast radius.

Where Progressive Rollout Works:

Progressive Rollout Scenarios

•Multi-region deployments — Deploy to one region at a time, validate, then proceed
•Zonal rollouts — In multi-AZ setups, update one AZ at a time
•Instance group updates — Update auto-scaling groups in batches
•DNS-based traffic shifting — Gradually shift traffic to new infrastructure
•Blue-green infrastructure — Maintain parallel stacks, switch traffic when ready

Multi-Region Progressive Deployment:

For organizations with multi-region infrastructure, deploying region-by-region provides natural progressive rollout:

Converting Mermaid diagram...

multi-region-deploy.yaml
GitHub Actions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
name: Multi-Region Progressive Deployment
 
on:
  workflow_dispatch:
    inputs:
      skip_canary_wait:
        description: 'Skip canary wait period (emergency only)'
        type: boolean
        default: false
 
jobs:
  # Wave 1: Canary region
  deploy-canary:
    name: Deploy Canary (us-west-2)
    runs-on: ubuntu-latest
    environment: production-usw2
    steps:
      - uses: actions/checkout@v4
      
      - name: Deploy to us-west-2
        run: |
          cd infrastructure/environments/production
          terraform init -backend-config=backends/us-west-2.tfvars
          terraform apply -auto-approve -var="region=us-west-2"
          
      - name: Validate Canary
        run: |
          ./scripts/validate-region.sh us-west-2
          
  # Canary bake time
  canary-bake:
    name: Canary Validation Period
    needs: deploy-canary
    runs-on: ubuntu-latest
    if: ${{ !inputs.skip_canary_wait }}
    steps:
      - uses: actions/checkout@v4
      
      - name: Monitor Canary (1 hour)
        run: |
          END_TIME=$(($(date +%s) + 3600))
          while [ $(date +%s) -lt $END_TIME ]; do
            # Check error rates
            ERROR_RATE=$(./scripts/check-region-health.sh us-west-2)
            if [ "$ERROR_RATE" -gt 1 ]; then
              echo "Canary unhealthy, aborting rollout"
              exit 1
            fi
            echo "Canary healthy, waiting..."
            sleep 300
          done
          
  # Wave 2: Secondary regions
  deploy-wave2:
    name: Deploy Wave 2
    needs: [deploy-canary, canary-bake]
    if: always() && needs.deploy-canary.result == 'success' && (needs.canary-bake.result == 'success' || needs.canary-bake.result == 'skipped')
    runs-on: ubuntu-latest
    strategy:
      matrix:
        region: [us-east-1, eu-west-1]
      max-parallel: 2
    environment: production-${{ matrix.region }}
    steps:
      - uses: actions/checkout@v4
      
      - name: Deploy to ${{ matrix.region }}
        run: |
          cd infrastructure/environments/production
          terraform init -backend-config=backends/${{ matrix.region }}.tfvars
          terraform apply -auto-approve -var="region=${{ matrix.region }}"
          
  # Wave 2 validation
  validate-wave2:
    name: Validate Wave 2
    needs: deploy-wave2
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Validate All Wave 2 Regions
        run: |
          for region in us-east-1 eu-west-1; do
            ./scripts/validate-region.sh $region || exit 1
          done
          
  # Wave 3: All remaining
  deploy-wave3:
    name: Deploy Wave 3
    needs: validate-wave2
    runs-on: ubuntu-latest
    strategy:
      matrix:
        region: [ap-northeast-1, eu-central-1, ap-southeast-1]
      max-parallel: 3
    environment: production-${{ matrix.region }}
    steps:
      - uses: actions/checkout@v4
      
      - name: Deploy to ${{ matrix.region }}
        run: |
          cd infrastructure/environments/production
          terraform init -backend-config=backends/${{ matrix.region }}.tfvars
          terraform apply -auto-approve -var="region=${{ matrix.region }}"

Canary Region Selection

Choose your canary region wisely: it should have real traffic (not synthetic) but not be your largest region. Issues caught in a smaller region have smaller blast radius. Many organizations use a mid-tier region like us-west-2 as the canary.

Blue-Green Infrastructure

Blue-green deployment for infrastructure involves maintaining two parallel stacks and switching traffic between them. While expensive (double the resources), it provides the fastest rollback capability.

How Blue-Green Infrastructure Works:

Converting Mermaid diagram...

Blue-Green Deployment Steps

•Provision green stack — Create new infrastructure with updated configuration
•Warm up green — Ensure caches are populated, connections are ready
•Sync data — For stateful resources, replicate data to green stack
•Validate green — Run tests against green stack directly
•Switch traffic — Update DNS or load balancer to route traffic to green
•Monitor closely — Watch error rates, latency, and throughput
•Decommission blue — After confidence period, destroy old stack

blue-green.tf

Terraform

# Blue-Green Infrastructure Pattern
 
variable "active_color" {
  description = "Currently active stack: blue or green"
  type        = string
  default     = "blue"
  
  validation {
    condition     = contains(["blue", "green"], var.active_color)
    error_message = "Active color must be blue or green"
  }
}
 
variable "deploy_green" {
  description = "Whether to deploy the green stack"
  type        = bool
  default     = false
}
 
# Blue Stack (always exists unless actively decomissioned)
module "blue_stack" {
  source = "./modules/application-stack"
  
  name        = "app-blue"
  environment = var.environment
  
  # Blue stack configuration
  instance_type = var.blue_instance_type
  ami_id        = var.blue_ami_id
  
  tags = {
    Color = "blue"
  }
}
 
# Green Stack (deployed when needed for updates)
module "green_stack" {
  source = "./modules/application-stack"
  count  = var.deploy_green ? 1 : 0
  
  name        = "app-green"
  environment = var.environment
  
  # Green stack configuration (new version)
  instance_type = var.green_instance_type
  ami_id        = var.green_ami_id
  
  tags = {
    Color = "green"
  }
}
 
# Traffic routing based on active color
resource "aws_lb_listener_rule" "app" {
  listener_arn = aws_lb_listener.main.arn
  priority     = 100
 
  action {
    type             = "forward"
    target_group_arn = var.active_color == "blue" ? module.blue_stack.target_group_arn : module.green_stack[0].target_group_arn
  }
 
  condition {
    path_pattern {
      values = ["/*"]
    }
  }
}
 
# DNS pointing to the active stack
resource "aws_route53_record" "app" {
  zone_id = var.zone_id
  name    = "app.example.com"
  type    = "A"
 
  alias {
    name                   = aws_lb.main.dns_name
    zone_id                = aws_lb.main.zone_id
    evaluate_target_health = true
  }
}
 
# Outputs for verification
output "active_stack" {
  value = var.active_color
}
 
output "blue_stack_healthy" {
  value = module.blue_stack.health_check_passed
}
 
output "green_stack_healthy" {
  value = var.deploy_green ? module.green_stack[0].health_check_passed : null
}

Blue-Green Database Challenges

The hardest part of blue-green is stateful resources, especially databases. Options include: sharing a single database between stacks (limits what changes you can make), using read replicas that become primary on switch, or accepting brief downtime for database migration. Each has trade-offs.

Rollback Strategies

When infrastructure deployments fail or cause issues, fast rollback is essential. However, infrastructure rollback is more complex than application rollback—you can't simply 'deploy the previous version' because infrastructure changes may have already modified resources.

Rollback Strategy Options:

Infrastructure Rollback Approaches
Strategy	How It Works	Speed	Limitations
Revert Commit	Apply previous configuration from Git	Minutes	May not work if resources were destroyed
State Rollback	Restore previous Terraform state file	Fast (state restore)	Doesn't change actual resources, causes drift
Blue-Green Switch	Route traffic to previous stack	Seconds	Requires maintaining parallel stacks
Forward Fix	Deploy a corrected version quickly	Minutes	Requires rapid diagnosis and fix development
Manual Remediation	Ops team fixes resources directly	Variable	Causes drift, needs post-incident cleanup

The Revert Commit Approach:

The most straightforward rollback: revert the commit and apply again.

rollback.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#!/bin/bash
# Infrastructure Rollback Script
 
set -e
 
ENVIRONMENT=${1:-production}
COMMIT_TO_REVERT=${2:-HEAD}
 
echo "🔄 Rolling back infrastructure for $ENVIRONMENT"
echo "   Reverting commit: $COMMIT_TO_REVERT"
 
# Step 1: Create rollback branch
ROLLBACK_BRANCH="rollback/$(date +%Y%m%d-%H%M%S)"
git checkout -b $ROLLBACK_BRANCH
 
# Step 2: Revert the problematic commit
git revert --no-commit $COMMIT_TO_REVERT
 
# Step 3: Generate plan to verify rollback
echo "📋 Generating rollback plan..."
cd infrastructure/environments/$ENVIRONMENT
terraform init
terraform plan -out=rollback.tfplan
 
# Step 4: Display plan summary
echo ""
echo "=== ROLLBACK PLAN SUMMARY ==="
terraform show rollback.tfplan | grep -E "^(#|Plan:)"
 
# Step 5: Confirm with operator
read -p "Apply this rollback? (yes/no): " CONFIRM
if [ "$CONFIRM" != "yes" ]; then
    echo "Rollback cancelled"
    exit 1
fi
 
# Step 6: Apply rollback
echo "🚀 Applying rollback..."
terraform apply rollback.tfplan
 
# Step 7: Verify
echo "✅ Verifying rollback..."
../../scripts/verify-environment.sh $ENVIRONMENT
 
# Step 8: Commit and push
git add .
git commit -m "Rollback: Revert $COMMIT_TO_REVERT for $ENVIRONMENT"
git push origin $ROLLBACK_BRANCH
 
echo "✅ Rollback complete. Branch: $ROLLBACK_BRANCH"
echo "   Create PR to merge rollback into main"

When Revert Doesn't Work:

Revert-and-apply won't work in all scenarios:

Resource destroyed — If the change deleted a resource, reverting can't restore it (data is gone)
Immutable attribute changed — Some resource changes trigger replacement; reverting triggers another replacement
External dependencies changed — If other systems now depend on the new configuration, reverting may break them
State diverged — If the apply partially failed, state may not match either old or new config

In these cases, you may need forward-fix or manual intervention.

State Snapshots as Safety Net

Before any production apply, snapshot the Terraform state file. If something goes catastrophically wrong, you can restore the state to a known good point. This doesn't change actual resources but lets you understand what Terraform thinks exists, which aids debugging.

Post-Deployment Verification

Deployment isn't complete when terraform apply finishes—it's complete when you've verified the infrastructure is working correctly. Post-deployment verification catches issues that weren't apparent from the apply output.

Verification Layers:

Post-Deployment Verification Checks

•Resource Existence — Confirm all expected resources exist and are in correct state
•Connectivity — Verify network paths work (security groups, routes, DNS)
•Permissions — Confirm IAM roles can perform required actions
•Health Checks — Validate load balancer health checks pass
•Synthetic Transactions — Run test transactions through the system
•Metrics Baseline — Verify error rates, latency are within expected bounds

verify-deployment.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
#!/bin/bash
# Post-Deployment Verification Script
 
set -e
 
ENVIRONMENT=${1:-production}
MAX_RETRIES=10
RETRY_DELAY=30
 
echo "🔍 Verifying deployment for $ENVIRONMENT"
 
# Helper function for retries
retry() {
    local max=$1
    local delay=$2
    shift 2
    local count=0
    until "$@"; do
        count=$((count + 1))
        if [ $count -lt $max ]; then
            echo "   Attempt $count failed, retrying in $delay seconds..."
            sleep $delay
        else
            echo "   ❌ All $max attempts failed"
            return 1
        fi
    done
    return 0
}
 
# 1. Verify Terraform outputs exist
echo "📋 Checking Terraform outputs..."
cd infrastructure/environments/$ENVIRONMENT
terraform output -json > /tmp/tf_outputs.json
 
# Extract key outputs
VPC_ID=$(jq -r '.vpc_id.value' /tmp/tf_outputs.json)
ALB_DNS=$(jq -r '.alb_dns_name.value' /tmp/tf_outputs.json)
RDS_ENDPOINT=$(jq -r '.rds_endpoint.value' /tmp/tf_outputs.json)
 
echo "   VPC: $VPC_ID"
echo "   ALB: $ALB_DNS"
echo "   RDS: $RDS_ENDPOINT"
 
# 2. Verify VPC exists and is available
echo "🌐 Verifying VPC..."
VPC_STATE=$(aws ec2 describe-vpcs --vpc-ids $VPC_ID --query 'Vpcs[0].State' --output text)
[ "$VPC_STATE" == "available" ] || { echo "❌ VPC not available"; exit 1; }
echo "   ✅ VPC is available"
 
# 3. Verify ALB is healthy
echo "⚖️ Verifying Load Balancer..."
retry $MAX_RETRIES $RETRY_DELAY bash -c "
    ALB_STATE=$(aws elbv2 describe-load-balancers \
        --query "LoadBalancers[?DNSName=='$ALB_DNS'].State.Code" \
        --output text)
    [ "$ALB_STATE" == 'active' ]
"
echo "   ✅ ALB is active"
 
# 4. Verify target group health
echo "🎯 Verifying Target Groups..."
TG_ARNS=$(aws elbv2 describe-target-groups \
    --query "TargetGroups[?contains(LoadBalancerArns, '$ALB_ARN')].TargetGroupArn" \
    --output text)
 
for TG in $TG_ARNS; do
    HEALTHY=$(aws elbv2 describe-target-health --target-group-arn $TG \
        --query "TargetHealthDescriptions[?TargetHealth.State=='healthy'] | length(@)" \
        --output text)
    [ "$HEALTHY" -gt 0 ] || { echo "❌ No healthy targets in $TG"; exit 1; }
    echo "   ✅ $HEALTHY healthy targets in target group"
done
 
# 5. Verify RDS is available
echo "🗃️ Verifying Database..."
retry $MAX_RETRIES $RETRY_DELAY bash -c "
    RDS_STATUS=$(aws rds describe-db-instances \
        --query "DBInstances[?Endpoint.Address=='$RDS_ENDPOINT'].DBInstanceStatus" \
        --output text)
    [ "$RDS_STATUS" == 'available' ]
"
echo "   ✅ RDS is available"
 
# 6. Run synthetic health check
echo "🧪 Running synthetic health check..."
retry $MAX_RETRIES $RETRY_DELAY curl -sf "https://$ALB_DNS/health"
echo "   ✅ Health endpoint responding"
 
# 7. Check error rates (last 5 minutes)
echo "📊 Checking error metrics..."
ERROR_COUNT=$(aws cloudwatch get-metric-statistics \
    --namespace AWS/ApplicationELB \
    --metric-name HTTPCode_Target_5XX_Count \
    --start-time $(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
    --period 300 \
    --statistics Sum \
    --query 'Datapoints[0].Sum' \
    --output text)
 
if [ "$ERROR_COUNT" != "None" ] && [ "$ERROR_COUNT" -gt 10 ]; then
    echo "   ⚠️ Elevated error rate: $ERROR_COUNT 5xx errors in last 5 minutes"
    exit 1
fi
echo "   ✅ Error rates normal"
 
echo ""
echo "✅ All verification checks passed for $ENVIRONMENT"

Verification Takes Time

Some issues only manifest under load or over time. Post-deployment verification should include not just immediate checks but also a monitoring period. Many teams require a 'bake time' of 15-60 minutes in production before considering a deployment complete.

Handling Stateful Resources

Stateful resources—databases, storage, queues—require special care during deployments. Changes to these resources can be irreversible, and mistakes can result in data loss.

Stateful Resource Deployment Principles:

Stateful Resource Guidelines

•Prevent accidental deletion — Use lifecycle prevent_destroy, deletion protection, and multi-factor approval
•Back up before changes — Create snapshots before any modifications
•Prefer non-destructive changes — Some attribute changes trigger replacement; review plans carefully
•Separate stateful from stateless — Consider managing databases in separate Terraform runs with extra safeguards
•Use data migration tools — Don't use Terraform for data operations; use proper migration tools

stateful-protection.tf

Terraform

# Protecting Stateful Resources
 
# RDS with maximum protection
resource "aws_db_instance" "main" {
  identifier = "production-db"
  
  # ... configuration ...
  
  # Protection against accidental deletion
  deletion_protection = true
  skip_final_snapshot = false
  final_snapshot_identifier = "production-db-final-${formatdate("YYYY-MM-DD-hhmm", timestamp())}"
  
  # Automated backups
  backup_retention_period = 30
  backup_window           = "03:00-04:00"
  
  # Prevent Terraform from ever deleting
  lifecycle {
    prevent_destroy = true
    
    # Ignore changes that would cause replacement
    ignore_changes = [
      identifier,
      engine_version,  # Handle upgrades separately
    ]
  }
  
  tags = {
    DataClassification = "production"
    BackupRequired     = "true"
  }
}
 
# S3 bucket with versioning and MFA delete
resource "aws_s3_bucket" "data" {
  bucket = "company-production-data"
  
  lifecycle {
    prevent_destroy = true
  }
}
 
resource "aws_s3_bucket_versioning" "data" {
  bucket = aws_s3_bucket.data.id
  
  versioning_configuration {
    status = "Enabled"
    mfa_delete = "Enabled"  # Requires MFA to delete versions
  }
}
 
# Policy to prevent dangerous operations
resource "aws_iam_policy" "prevent_data_deletion" {
  name        = "prevent-production-data-deletion"
  description = "Denies deletion of production data resources"
 
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "DenyDataDeletion"
        Effect = "Deny"
        Action = [
          "rds:DeleteDBInstance",
          "rds:DeleteDBCluster",
          "s3:DeleteBucket",
          "dynamodb:DeleteTable",
        ]
        Resource = [
          aws_db_instance.main.arn,
          aws_s3_bucket.data.arn,
        ]
      }
    ]
  })
}

Database-Specific Considerations:

Change Type	Risk Level	Recommended Approach
Instance type change	Medium	Blue-green with replication
Engine version upgrade	High	Snapshot, test in staging, maintenance window
Parameter changes	Medium	Most apply without restart, verify compatibility
Storage increase	Low	Online expansion, no downtime
Encryption enable	Very High	Requires new instance and data migration
Multi-AZ enable	Low	Automatic failover capabilities added

The Replacement Trap

Some seemingly innocuous changes trigger resource replacement (destroy + create). For databases, this means data loss. Always review plans carefully for 'must be replaced' warnings. When in doubt, use terraform plan -detailed-exitcode and parse for replacement actions.

Change Windows and Maintenance

Many organizations restrict when production changes can occur through change windows—designated times when deployments are permitted. This practice reduces risk by ensuring changes happen when support staff are available and user impact is minimized.

Implementing Change Windows:

Change Window Tiers
Change Type	Window	Requirements	Examples
Standard	Business hours, any day	Approved PR, tests pass	Tags, scaling policies, non-critical updates
Significant	Low-traffic hours, weekdays	Standard + additional reviewer	Security groups, IAM roles, new resources
High-Risk	Scheduled maintenance window	Change board approval, rollback tested	Database changes, network modifications
Emergency	Any time	Incident response, documented	Security patches, outage fixes

change-window-check.yaml
GitHub Actions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
name: Production Deploy with Change Window
 
on:
  workflow_dispatch:
    inputs:
      bypass_change_window:
        description: 'Bypass change window (emergency only)'
        type: boolean
        default: false
      emergency_ticket:
        description: 'Emergency ticket number (required if bypassing)'
        type: string
        default: ''
 
jobs:
  check-change-window:
    name: Verify Change Window
    runs-on: ubuntu-latest
    outputs:
      allowed: ${{ steps.check.outputs.allowed }}
    steps:
      - name: Check Change Window
        id: check
        run: |
          # Get current time in UTC
          HOUR=$(date -u +%H)
          DAY=$(date -u +%u)  # 1=Monday, 7=Sunday
          
          # Define allowed windows (UTC)
          # Standard: Mon-Fri 14:00-18:00 UTC (9am-1pm EST)
          ALLOWED="false"
          
          if [ $DAY -ge 1 ] && [ $DAY -le 5 ]; then
            if [ $HOUR -ge 14 ] && [ $HOUR -lt 18 ]; then
              ALLOWED="true"
            fi
          fi
          
          # Check for bypass
          if [ "${{ inputs.bypass_change_window }}" == "true" ]; then
            if [ -z "${{ inputs.emergency_ticket }}" ]; then
              echo "❌ Emergency bypass requires ticket number"
              exit 1
            fi
            echo "⚠️ Change window bypassed. Ticket: ${{ inputs.emergency_ticket }}"
            ALLOWED="true"
          fi
          
          echo "allowed=$ALLOWED" >> $GITHUB_OUTPUT
          
          if [ "$ALLOWED" == "false" ]; then
            echo "❌ Outside change window. Current UTC: $(date -u)"
            echo "   Allowed: Mon-Fri 14:00-18:00 UTC"
            exit 1
          fi
          
  deploy-production:
    name: Deploy to Production
    needs: check-change-window
    if: needs.check-change-window.outputs.allowed == 'true'
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Record Change
        run: |
          echo "Change deployed at: $(date -u)"
          echo "Deployed by: ${{ github.actor }}"
          echo "Commit: ${{ github.sha }}"
          echo "Bypass: ${{ inputs.bypass_change_window }}"
          echo "Ticket: ${{ inputs.emergency_ticket }}"
          
      # ... deployment steps ...

Global Change Freezes

Many organizations implement change freezes during high-risk periods: quarter ends, major holidays, or peak business seasons. Implement freeze schedules in your pipelines that block non-emergency deployments during these periods.

Summary: Deployment Strategies

Infrastructure deployment requires deliberate strategies that account for the unique challenges of stateful, potentially irreversible changes. The key principles to remember:

Key Takeaways

•Infrastructure deployment differs from application deployment — Understand the stateful, less-reversible nature of infrastructure changes
•Environment promotion is fundamental — Always validate in lower environments before production
•Progressive rollouts limit blast radius — Deploy region-by-region, zone-by-zone when possible
•Blue-green provides fastest rollback — Maintain parallel stacks for instant traffic switching
•Rollback isn't always possible — Some changes (data deletion) are irreversible; plan accordingly
•Verification must be comprehensive — Check existence, connectivity, permissions, health, and metrics
•Protect stateful resources — Use deletion protection, prevent_destroy, snapshots, and careful review
•Respect change windows — Deploy when support is available and impact is minimized

Module Complete:

This completes the CI/CD for Infrastructure module. You've learned about infrastructure pipelines, GitOps principles, pull request workflows, automated testing, and deployment strategies. Together, these practices enable organizations to manage infrastructure with the same velocity and safety as application code.

Module Complete

Congratulations! You've mastered CI/CD for Infrastructure. You can now design pipelines that make infrastructure changes safe, fast, and auditable—from pull request to production deployment. Apply these practices to bring software engineering discipline to infrastructure management.

5 / 5

Loading learning content...

System Design (HLD)CI/CD for Infrastructure

CI/CD for Infrastructure

LevelAdvanced

Duration90 mins

TopicCI/CD for Infrastructure

5 / 5

Deployment Strategies

Deploying Infrastructure Without Fear

What You Will Learn

The Infrastructure Deployment Challenge

Infrastructure deployment differs fundamentally from application deployment. Understanding these differences is essential for designing appropriate deployment strategies.

Key Differences from Application Deployments:

Infrastructure vs Application Deployment
Aspect	Application Deployment	Infrastructure Deployment
Unit of Change	Versioned artifact (container, binary)	Configuration diff against current state
Rollback Speed	Seconds (switch to previous version)	Minutes to hours (resources must be recreated)
State Complexity	Often stateless (state in databases)	Inherently stateful (resources exist in cloud)
Blast Radius	Limited to that service's traffic	Can affect all dependent services
Canary Capability	Route percentage of traffic to new version	Limited—infrastructure changes are binary
Reversibility	Usually fully reversible	Some changes irreversible (data deletion)

The Core Infrastructure Deployment Problem:

This constraint shapes infrastructure deployment strategies:

Rely more on pre-deployment validation — Since we can't easily limit blast radius, we must be confident before deploying
Use environment promotion — Validate in lower environments before production
Design for fast rollback — When issues occur, minimize recovery time
Verify aggressively post-deployment — Catch issues immediately after apply

The Terraform Apply Reality

Environment Promotion Strategy

The most fundamental infrastructure deployment strategy is environment promotion: deploying changes through progressively more critical environments, validating at each stage before proceeding.

The Standard Promotion Path:

Converting Mermaid diagram...

Environment Gates and Checks

•Dev → Staging Gate: All tests pass, no security violations, clean plan generated
•Staging Validation: Integration tests run, no performance regression, services healthy for soak period
•Staging → Production Gate: Manual approval from authorized personnel, change window compliance, incident-free period
•Production Deployment: Progressive rollout (if applicable), enhanced monitoring, rollback plan ready

deployment-pipeline.yaml
GitHub Actions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
name: Infrastructure Deployment Pipeline
 
on:
  push:
    branches: [main]
    paths:
      - 'infrastructure/**'
 
jobs:
  # Stage 1: Deploy to Development
  deploy-dev:
    name: Deploy to Development
    runs-on: ubuntu-latest
    environment: development
    outputs:
      applied: ${{ steps.apply.outputs.applied }}
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      
      - name: Deploy to Dev
        id: apply
        run: |
          cd infrastructure/environments/dev
          terraform init
          terraform apply -auto-approve
          echo "applied=true" >> $GITHUB_OUTPUT
          
      - name: Verify Dev Deployment
        run: |
          ./scripts/verify-environment.sh dev
          
  # Stage 2: Wait and Validate in Dev
  validate-dev:
    name: Validate Development
    needs: deploy-dev
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run Integration Tests
        run: |
          cd test/integration
          go test -v -tags=dev ./...
          
      - name: Check Metrics (5 minute soak)
        run: |
          # Wait 5 minutes, checking error rates
          for i in {1..10}; do
            ERROR_RATE=$(./scripts/check-error-rate.sh dev)
            if [ "$ERROR_RATE" -gt 1 ]; then
              echo "Error rate too high: $ERROR_RATE%"
              exit 1
            fi
            sleep 30
          done
          
  # Stage 3: Deploy to Staging
  deploy-staging:
    name: Deploy to Staging
    needs: validate-dev
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      
      - name: Deploy to Staging
        run: |
          cd infrastructure/environments/staging
          terraform init
          terraform apply -auto-approve
          
      - name: Verify Staging Deployment
        run: |
          ./scripts/verify-environment.sh staging
          
  # Stage 4: Extended Validation in Staging
  validate-staging:
    name: Validate Staging
    needs: deploy-staging
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Run Full Test Suite
        run: |
          cd test/integration
          go test -v -tags=staging -timeout 30m ./...
          
      - name: Extended Soak (30 minutes)
        run: |
          # Monitor for 30 minutes
          START_TIME=$(date +%s)
          while [ $(($(date +%s) - START_TIME)) -lt 1800 ]; do
            ./scripts/health-check.sh staging || exit 1
            sleep 60
          done
          
  # Stage 5: Production Deployment (Manual Gate)
  deploy-production:
    name: Deploy to Production
    needs: validate-staging
    runs-on: ubuntu-latest
    environment: 
      name: production
      url: https://app.company.com
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      
      - name: Notify Deployment Starting
        run: |
          ./scripts/notify-slack.sh "🚀 Production deployment starting"
          
      - name: Deploy to Production
        run: |
          cd infrastructure/environments/production
          terraform init
          terraform apply -auto-approve
          
      - name: Comprehensive Verification
        run: |
          ./scripts/verify-production.sh
          
      - name: Notify Success
        if: success()
        run: |
          ./scripts/notify-slack.sh "✅ Production deployment complete"
          
      - name: Notify Failure
        if: failure()
        run: |
          ./scripts/notify-slack.sh "❌ Production deployment FAILED - investigate immediately"

Soak Time Matters

Progressive Infrastructure Rollouts

While infrastructure changes are often binary, some scenarios allow for progressive rollout—deploying changes incrementally to limit blast radius.

Where Progressive Rollout Works:

Progressive Rollout Scenarios

•Multi-region deployments — Deploy to one region at a time, validate, then proceed
•Zonal rollouts — In multi-AZ setups, update one AZ at a time
•Instance group updates — Update auto-scaling groups in batches
•DNS-based traffic shifting — Gradually shift traffic to new infrastructure
•Blue-green infrastructure — Maintain parallel stacks, switch traffic when ready

Multi-Region Progressive Deployment:

For organizations with multi-region infrastructure, deploying region-by-region provides natural progressive rollout:

Converting Mermaid diagram...

multi-region-deploy.yaml
GitHub Actions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
name: Multi-Region Progressive Deployment
 
on:
  workflow_dispatch:
    inputs:
      skip_canary_wait:
        description: 'Skip canary wait period (emergency only)'
        type: boolean
        default: false
 
jobs:
  # Wave 1: Canary region
  deploy-canary:
    name: Deploy Canary (us-west-2)
    runs-on: ubuntu-latest
    environment: production-usw2
    steps:
      - uses: actions/checkout@v4
      
      - name: Deploy to us-west-2
        run: |
          cd infrastructure/environments/production
          terraform init -backend-config=backends/us-west-2.tfvars
          terraform apply -auto-approve -var="region=us-west-2"
          
      - name: Validate Canary
        run: |
          ./scripts/validate-region.sh us-west-2
          
  # Canary bake time
  canary-bake:
    name: Canary Validation Period
    needs: deploy-canary
    runs-on: ubuntu-latest
    if: ${{ !inputs.skip_canary_wait }}
    steps:
      - uses: actions/checkout@v4
      
      - name: Monitor Canary (1 hour)
        run: |
          END_TIME=$(($(date +%s) + 3600))
          while [ $(date +%s) -lt $END_TIME ]; do
            # Check error rates
            ERROR_RATE=$(./scripts/check-region-health.sh us-west-2)
            if [ "$ERROR_RATE" -gt 1 ]; then
              echo "Canary unhealthy, aborting rollout"
              exit 1
            fi
            echo "Canary healthy, waiting..."
            sleep 300
          done
          
  # Wave 2: Secondary regions
  deploy-wave2:
    name: Deploy Wave 2
    needs: [deploy-canary, canary-bake]
    if: always() && needs.deploy-canary.result == 'success' && (needs.canary-bake.result == 'success' || needs.canary-bake.result == 'skipped')
    runs-on: ubuntu-latest
    strategy:
      matrix:
        region: [us-east-1, eu-west-1]
      max-parallel: 2
    environment: production-${{ matrix.region }}
    steps:
      - uses: actions/checkout@v4
      
      - name: Deploy to ${{ matrix.region }}
        run: |
          cd infrastructure/environments/production
          terraform init -backend-config=backends/${{ matrix.region }}.tfvars
          terraform apply -auto-approve -var="region=${{ matrix.region }}"
          
  # Wave 2 validation
  validate-wave2:
    name: Validate Wave 2
    needs: deploy-wave2
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Validate All Wave 2 Regions
        run: |
          for region in us-east-1 eu-west-1; do
            ./scripts/validate-region.sh $region || exit 1
          done
          
  # Wave 3: All remaining
  deploy-wave3:
    name: Deploy Wave 3
    needs: validate-wave2
    runs-on: ubuntu-latest
    strategy:
      matrix:
        region: [ap-northeast-1, eu-central-1, ap-southeast-1]
      max-parallel: 3
    environment: production-${{ matrix.region }}
    steps:
      - uses: actions/checkout@v4
      
      - name: Deploy to ${{ matrix.region }}
        run: |
          cd infrastructure/environments/production
          terraform init -backend-config=backends/${{ matrix.region }}.tfvars
          terraform apply -auto-approve -var="region=${{ matrix.region }}"

Canary Region Selection

Blue-Green Infrastructure

How Blue-Green Infrastructure Works:

Converting Mermaid diagram...

Blue-Green Deployment Steps

•Provision green stack — Create new infrastructure with updated configuration
•Warm up green — Ensure caches are populated, connections are ready
•Sync data — For stateful resources, replicate data to green stack
•Validate green — Run tests against green stack directly
•Switch traffic — Update DNS or load balancer to route traffic to green
•Monitor closely — Watch error rates, latency, and throughput
•Decommission blue — After confidence period, destroy old stack

blue-green.tf

Terraform

# Blue-Green Infrastructure Pattern
 
variable "active_color" {
  description = "Currently active stack: blue or green"
  type        = string
  default     = "blue"
  
  validation {
    condition     = contains(["blue", "green"], var.active_color)
    error_message = "Active color must be blue or green"
  }
}
 
variable "deploy_green" {
  description = "Whether to deploy the green stack"
  type        = bool
  default     = false
}
 
# Blue Stack (always exists unless actively decomissioned)
module "blue_stack" {
  source = "./modules/application-stack"
  
  name        = "app-blue"
  environment = var.environment
  
  # Blue stack configuration
  instance_type = var.blue_instance_type
  ami_id        = var.blue_ami_id
  
  tags = {
    Color = "blue"
  }
}
 
# Green Stack (deployed when needed for updates)
module "green_stack" {
  source = "./modules/application-stack"
  count  = var.deploy_green ? 1 : 0
  
  name        = "app-green"
  environment = var.environment
  
  # Green stack configuration (new version)
  instance_type = var.green_instance_type
  ami_id        = var.green_ami_id
  
  tags = {
    Color = "green"
  }
}
 
# Traffic routing based on active color
resource "aws_lb_listener_rule" "app" {
  listener_arn = aws_lb_listener.main.arn
  priority     = 100
 
  action {
    type             = "forward"
    target_group_arn = var.active_color == "blue" ? module.blue_stack.target_group_arn : module.green_stack[0].target_group_arn
  }
 
  condition {
    path_pattern {
      values = ["/*"]
    }
  }
}
 
# DNS pointing to the active stack
resource "aws_route53_record" "app" {
  zone_id = var.zone_id
  name    = "app.example.com"
  type    = "A"
 
  alias {
    name                   = aws_lb.main.dns_name
    zone_id                = aws_lb.main.zone_id
    evaluate_target_health = true
  }
}
 
# Outputs for verification
output "active_stack" {
  value = var.active_color
}
 
output "blue_stack_healthy" {
  value = module.blue_stack.health_check_passed
}
 
output "green_stack_healthy" {
  value = var.deploy_green ? module.green_stack[0].health_check_passed : null
}

Blue-Green Database Challenges

Rollback Strategies

Rollback Strategy Options:

Infrastructure Rollback Approaches
Strategy	How It Works	Speed	Limitations
Revert Commit	Apply previous configuration from Git	Minutes	May not work if resources were destroyed
State Rollback	Restore previous Terraform state file	Fast (state restore)	Doesn't change actual resources, causes drift
Blue-Green Switch	Route traffic to previous stack	Seconds	Requires maintaining parallel stacks
Forward Fix	Deploy a corrected version quickly	Minutes	Requires rapid diagnosis and fix development
Manual Remediation	Ops team fixes resources directly	Variable	Causes drift, needs post-incident cleanup

The Revert Commit Approach:

The most straightforward rollback: revert the commit and apply again.

rollback.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#!/bin/bash
# Infrastructure Rollback Script
 
set -e
 
ENVIRONMENT=${1:-production}
COMMIT_TO_REVERT=${2:-HEAD}
 
echo "🔄 Rolling back infrastructure for $ENVIRONMENT"
echo "   Reverting commit: $COMMIT_TO_REVERT"
 
# Step 1: Create rollback branch
ROLLBACK_BRANCH="rollback/$(date +%Y%m%d-%H%M%S)"
git checkout -b $ROLLBACK_BRANCH
 
# Step 2: Revert the problematic commit
git revert --no-commit $COMMIT_TO_REVERT
 
# Step 3: Generate plan to verify rollback
echo "📋 Generating rollback plan..."
cd infrastructure/environments/$ENVIRONMENT
terraform init
terraform plan -out=rollback.tfplan
 
# Step 4: Display plan summary
echo ""
echo "=== ROLLBACK PLAN SUMMARY ==="
terraform show rollback.tfplan | grep -E "^(#|Plan:)"
 
# Step 5: Confirm with operator
read -p "Apply this rollback? (yes/no): " CONFIRM
if [ "$CONFIRM" != "yes" ]; then
    echo "Rollback cancelled"
    exit 1
fi
 
# Step 6: Apply rollback
echo "🚀 Applying rollback..."
terraform apply rollback.tfplan
 
# Step 7: Verify
echo "✅ Verifying rollback..."
../../scripts/verify-environment.sh $ENVIRONMENT
 
# Step 8: Commit and push
git add .
git commit -m "Rollback: Revert $COMMIT_TO_REVERT for $ENVIRONMENT"
git push origin $ROLLBACK_BRANCH
 
echo "✅ Rollback complete. Branch: $ROLLBACK_BRANCH"
echo "   Create PR to merge rollback into main"

When Revert Doesn't Work:

Revert-and-apply won't work in all scenarios:

Resource destroyed — If the change deleted a resource, reverting can't restore it (data is gone)
Immutable attribute changed — Some resource changes trigger replacement; reverting triggers another replacement
External dependencies changed — If other systems now depend on the new configuration, reverting may break them
State diverged — If the apply partially failed, state may not match either old or new config

In these cases, you may need forward-fix or manual intervention.

State Snapshots as Safety Net

Post-Deployment Verification

Verification Layers:

Post-Deployment Verification Checks

•Resource Existence — Confirm all expected resources exist and are in correct state
•Connectivity — Verify network paths work (security groups, routes, DNS)
•Permissions — Confirm IAM roles can perform required actions
•Health Checks — Validate load balancer health checks pass
•Synthetic Transactions — Run test transactions through the system
•Metrics Baseline — Verify error rates, latency are within expected bounds

verify-deployment.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
#!/bin/bash
# Post-Deployment Verification Script
 
set -e
 
ENVIRONMENT=${1:-production}
MAX_RETRIES=10
RETRY_DELAY=30
 
echo "🔍 Verifying deployment for $ENVIRONMENT"
 
# Helper function for retries
retry() {
    local max=$1
    local delay=$2
    shift 2
    local count=0
    until "$@"; do
        count=$((count + 1))
        if [ $count -lt $max ]; then
            echo "   Attempt $count failed, retrying in $delay seconds..."
            sleep $delay
        else
            echo "   ❌ All $max attempts failed"
            return 1
        fi
    done
    return 0
}
 
# 1. Verify Terraform outputs exist
echo "📋 Checking Terraform outputs..."
cd infrastructure/environments/$ENVIRONMENT
terraform output -json > /tmp/tf_outputs.json
 
# Extract key outputs
VPC_ID=$(jq -r '.vpc_id.value' /tmp/tf_outputs.json)
ALB_DNS=$(jq -r '.alb_dns_name.value' /tmp/tf_outputs.json)
RDS_ENDPOINT=$(jq -r '.rds_endpoint.value' /tmp/tf_outputs.json)
 
echo "   VPC: $VPC_ID"
echo "   ALB: $ALB_DNS"
echo "   RDS: $RDS_ENDPOINT"
 
# 2. Verify VPC exists and is available
echo "🌐 Verifying VPC..."
VPC_STATE=$(aws ec2 describe-vpcs --vpc-ids $VPC_ID --query 'Vpcs[0].State' --output text)
[ "$VPC_STATE" == "available" ] || { echo "❌ VPC not available"; exit 1; }
echo "   ✅ VPC is available"
 
# 3. Verify ALB is healthy
echo "⚖️ Verifying Load Balancer..."
retry $MAX_RETRIES $RETRY_DELAY bash -c "
    ALB_STATE=$(aws elbv2 describe-load-balancers \
        --query "LoadBalancers[?DNSName=='$ALB_DNS'].State.Code" \
        --output text)
    [ "$ALB_STATE" == 'active' ]
"
echo "   ✅ ALB is active"
 
# 4. Verify target group health
echo "🎯 Verifying Target Groups..."
TG_ARNS=$(aws elbv2 describe-target-groups \
    --query "TargetGroups[?contains(LoadBalancerArns, '$ALB_ARN')].TargetGroupArn" \
    --output text)
 
for TG in $TG_ARNS; do
    HEALTHY=$(aws elbv2 describe-target-health --target-group-arn $TG \
        --query "TargetHealthDescriptions[?TargetHealth.State=='healthy'] | length(@)" \
        --output text)
    [ "$HEALTHY" -gt 0 ] || { echo "❌ No healthy targets in $TG"; exit 1; }
    echo "   ✅ $HEALTHY healthy targets in target group"
done
 
# 5. Verify RDS is available
echo "🗃️ Verifying Database..."
retry $MAX_RETRIES $RETRY_DELAY bash -c "
    RDS_STATUS=$(aws rds describe-db-instances \
        --query "DBInstances[?Endpoint.Address=='$RDS_ENDPOINT'].DBInstanceStatus" \
        --output text)
    [ "$RDS_STATUS" == 'available' ]
"
echo "   ✅ RDS is available"
 
# 6. Run synthetic health check
echo "🧪 Running synthetic health check..."
retry $MAX_RETRIES $RETRY_DELAY curl -sf "https://$ALB_DNS/health"
echo "   ✅ Health endpoint responding"
 
# 7. Check error rates (last 5 minutes)
echo "📊 Checking error metrics..."
ERROR_COUNT=$(aws cloudwatch get-metric-statistics \
    --namespace AWS/ApplicationELB \
    --metric-name HTTPCode_Target_5XX_Count \
    --start-time $(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
    --period 300 \
    --statistics Sum \
    --query 'Datapoints[0].Sum' \
    --output text)
 
if [ "$ERROR_COUNT" != "None" ] && [ "$ERROR_COUNT" -gt 10 ]; then
    echo "   ⚠️ Elevated error rate: $ERROR_COUNT 5xx errors in last 5 minutes"
    exit 1
fi
echo "   ✅ Error rates normal"
 
echo ""
echo "✅ All verification checks passed for $ENVIRONMENT"

Verification Takes Time

Handling Stateful Resources

Stateful resources—databases, storage, queues—require special care during deployments. Changes to these resources can be irreversible, and mistakes can result in data loss.

Stateful Resource Deployment Principles:

Stateful Resource Guidelines

•Prevent accidental deletion — Use lifecycle prevent_destroy, deletion protection, and multi-factor approval
•Back up before changes — Create snapshots before any modifications
•Prefer non-destructive changes — Some attribute changes trigger replacement; review plans carefully
•Separate stateful from stateless — Consider managing databases in separate Terraform runs with extra safeguards
•Use data migration tools — Don't use Terraform for data operations; use proper migration tools

stateful-protection.tf

Terraform

# Protecting Stateful Resources
 
# RDS with maximum protection
resource "aws_db_instance" "main" {
  identifier = "production-db"
  
  # ... configuration ...
  
  # Protection against accidental deletion
  deletion_protection = true
  skip_final_snapshot = false
  final_snapshot_identifier = "production-db-final-${formatdate("YYYY-MM-DD-hhmm", timestamp())}"
  
  # Automated backups
  backup_retention_period = 30
  backup_window           = "03:00-04:00"
  
  # Prevent Terraform from ever deleting
  lifecycle {
    prevent_destroy = true
    
    # Ignore changes that would cause replacement
    ignore_changes = [
      identifier,
      engine_version,  # Handle upgrades separately
    ]
  }
  
  tags = {
    DataClassification = "production"
    BackupRequired     = "true"
  }
}
 
# S3 bucket with versioning and MFA delete
resource "aws_s3_bucket" "data" {
  bucket = "company-production-data"
  
  lifecycle {
    prevent_destroy = true
  }
}
 
resource "aws_s3_bucket_versioning" "data" {
  bucket = aws_s3_bucket.data.id
  
  versioning_configuration {
    status = "Enabled"
    mfa_delete = "Enabled"  # Requires MFA to delete versions
  }
}
 
# Policy to prevent dangerous operations
resource "aws_iam_policy" "prevent_data_deletion" {
  name        = "prevent-production-data-deletion"
  description = "Denies deletion of production data resources"
 
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "DenyDataDeletion"
        Effect = "Deny"
        Action = [
          "rds:DeleteDBInstance",
          "rds:DeleteDBCluster",
          "s3:DeleteBucket",
          "dynamodb:DeleteTable",
        ]
        Resource = [
          aws_db_instance.main.arn,
          aws_s3_bucket.data.arn,
        ]
      }
    ]
  })
}

Database-Specific Considerations:

Change Type	Risk Level	Recommended Approach
Instance type change	Medium	Blue-green with replication
Engine version upgrade	High	Snapshot, test in staging, maintenance window
Parameter changes	Medium	Most apply without restart, verify compatibility
Storage increase	Low	Online expansion, no downtime
Encryption enable	Very High	Requires new instance and data migration
Multi-AZ enable	Low	Automatic failover capabilities added

The Replacement Trap

Change Windows and Maintenance

Implementing Change Windows:

Change Window Tiers
Change Type	Window	Requirements	Examples
Standard	Business hours, any day	Approved PR, tests pass	Tags, scaling policies, non-critical updates
Significant	Low-traffic hours, weekdays	Standard + additional reviewer	Security groups, IAM roles, new resources
High-Risk	Scheduled maintenance window	Change board approval, rollback tested	Database changes, network modifications
Emergency	Any time	Incident response, documented	Security patches, outage fixes

change-window-check.yaml
GitHub Actions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
name: Production Deploy with Change Window
 
on:
  workflow_dispatch:
    inputs:
      bypass_change_window:
        description: 'Bypass change window (emergency only)'
        type: boolean
        default: false
      emergency_ticket:
        description: 'Emergency ticket number (required if bypassing)'
        type: string
        default: ''
 
jobs:
  check-change-window:
    name: Verify Change Window
    runs-on: ubuntu-latest
    outputs:
      allowed: ${{ steps.check.outputs.allowed }}
    steps:
      - name: Check Change Window
        id: check
        run: |
          # Get current time in UTC
          HOUR=$(date -u +%H)
          DAY=$(date -u +%u)  # 1=Monday, 7=Sunday
          
          # Define allowed windows (UTC)
          # Standard: Mon-Fri 14:00-18:00 UTC (9am-1pm EST)
          ALLOWED="false"
          
          if [ $DAY -ge 1 ] && [ $DAY -le 5 ]; then
            if [ $HOUR -ge 14 ] && [ $HOUR -lt 18 ]; then
              ALLOWED="true"
            fi
          fi
          
          # Check for bypass
          if [ "${{ inputs.bypass_change_window }}" == "true" ]; then
            if [ -z "${{ inputs.emergency_ticket }}" ]; then
              echo "❌ Emergency bypass requires ticket number"
              exit 1
            fi
            echo "⚠️ Change window bypassed. Ticket: ${{ inputs.emergency_ticket }}"
            ALLOWED="true"
          fi
          
          echo "allowed=$ALLOWED" >> $GITHUB_OUTPUT
          
          if [ "$ALLOWED" == "false" ]; then
            echo "❌ Outside change window. Current UTC: $(date -u)"
            echo "   Allowed: Mon-Fri 14:00-18:00 UTC"
            exit 1
          fi
          
  deploy-production:
    name: Deploy to Production
    needs: check-change-window
    if: needs.check-change-window.outputs.allowed == 'true'
    runs-on: ubuntu-latest
    environment: production
    steps:
      - name: Record Change
        run: |
          echo "Change deployed at: $(date -u)"
          echo "Deployed by: ${{ github.actor }}"
          echo "Commit: ${{ github.sha }}"
          echo "Bypass: ${{ inputs.bypass_change_window }}"
          echo "Ticket: ${{ inputs.emergency_ticket }}"
          
      # ... deployment steps ...

Global Change Freezes

Summary: Deployment Strategies

Infrastructure deployment requires deliberate strategies that account for the unique challenges of stateful, potentially irreversible changes. The key principles to remember:

Key Takeaways

•Infrastructure deployment differs from application deployment — Understand the stateful, less-reversible nature of infrastructure changes
•Environment promotion is fundamental — Always validate in lower environments before production
•Progressive rollouts limit blast radius — Deploy region-by-region, zone-by-zone when possible
•Blue-green provides fastest rollback — Maintain parallel stacks for instant traffic switching
•Rollback isn't always possible — Some changes (data deletion) are irreversible; plan accordingly
•Verification must be comprehensive — Check existence, connectivity, permissions, health, and metrics
•Protect stateful resources — Use deletion protection, prevent_destroy, snapshots, and careful review
•Respect change windows — Deploy when support is available and impact is minimized

Module Complete:

Module Complete

5 / 5