Loading learning content...
The moment of truth in any infrastructure pipeline is deployment—when configuration changes become reality. Unlike application deployments where failures might affect a single service, infrastructure failures can cascade across entire systems. A misconfigured VPC can isolate all services. A broken IAM role can halt all deployments. A deleted database can lose customer data.
Deployment strategies for infrastructure minimize these risks through progressive rollouts, automated verification, and robust rollback capabilities. The goal is to deploy with confidence: knowing that issues will be caught early, blast radius will be contained, and recovery will be swift.
By the end of this page, you will understand progressive deployment strategies for infrastructure, how to structure environment promotion, techniques for verifying deployments, rollback patterns and their limitations, and strategies for handling stateful resources. You will be equipped to design deployment workflows that balance velocity with safety.
Infrastructure deployment differs fundamentally from application deployment. Understanding these differences is essential for designing appropriate deployment strategies.
Key Differences from Application Deployments:
| Aspect | Application Deployment | Infrastructure Deployment |
|---|---|---|
| Unit of Change | Versioned artifact (container, binary) | Configuration diff against current state |
| Rollback Speed | Seconds (switch to previous version) | Minutes to hours (resources must be recreated) |
| State Complexity | Often stateless (state in databases) | Inherently stateful (resources exist in cloud) |
| Blast Radius | Limited to that service's traffic | Can affect all dependent services |
| Canary Capability | Route percentage of traffic to new version | Limited—infrastructure changes are binary |
| Reversibility | Usually fully reversible | Some changes irreversible (data deletion) |
The Core Infrastructure Deployment Problem:
Infrastructure changes often have all-or-nothing semantics. When you modify a security group rule, it applies to all instances immediately. When you change a VPC route, all traffic is affected at once. There's no straightforward way to 'send 5% of traffic' through a new infrastructure configuration.
This constraint shapes infrastructure deployment strategies:
When terraform apply runs, it executes a series of API calls that modify cloud resources sequentially. If the apply fails partway through, you're left in a partially modified state. The plan was validated before apply, but the actual state may now differ from both the original and intended states. Design your strategies to handle partial failures.
The most fundamental infrastructure deployment strategy is environment promotion: deploying changes through progressively more critical environments, validating at each stage before proceeding.
The Standard Promotion Path:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134
name: Infrastructure Deployment Pipeline on: push: branches: [main] paths: - 'infrastructure/**' jobs: # Stage 1: Deploy to Development deploy-dev: name: Deploy to Development runs-on: ubuntu-latest environment: development outputs: applied: ${{ steps.apply.outputs.applied }} steps: - uses: actions/checkout@v4 - uses: hashicorp/setup-terraform@v3 - name: Deploy to Dev id: apply run: | cd infrastructure/environments/dev terraform init terraform apply -auto-approve echo "applied=true" >> $GITHUB_OUTPUT - name: Verify Dev Deployment run: | ./scripts/verify-environment.sh dev # Stage 2: Wait and Validate in Dev validate-dev: name: Validate Development needs: deploy-dev runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run Integration Tests run: | cd test/integration go test -v -tags=dev ./... - name: Check Metrics (5 minute soak) run: | # Wait 5 minutes, checking error rates for i in {1..10}; do ERROR_RATE=$(./scripts/check-error-rate.sh dev) if [ "$ERROR_RATE" -gt 1 ]; then echo "Error rate too high: $ERROR_RATE%" exit 1 fi sleep 30 done # Stage 3: Deploy to Staging deploy-staging: name: Deploy to Staging needs: validate-dev runs-on: ubuntu-latest environment: staging steps: - uses: actions/checkout@v4 - uses: hashicorp/setup-terraform@v3 - name: Deploy to Staging run: | cd infrastructure/environments/staging terraform init terraform apply -auto-approve - name: Verify Staging Deployment run: | ./scripts/verify-environment.sh staging # Stage 4: Extended Validation in Staging validate-staging: name: Validate Staging needs: deploy-staging runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run Full Test Suite run: | cd test/integration go test -v -tags=staging -timeout 30m ./... - name: Extended Soak (30 minutes) run: | # Monitor for 30 minutes START_TIME=$(date +%s) while [ $(($(date +%s) - START_TIME)) -lt 1800 ]; do ./scripts/health-check.sh staging || exit 1 sleep 60 done # Stage 5: Production Deployment (Manual Gate) deploy-production: name: Deploy to Production needs: validate-staging runs-on: ubuntu-latest environment: name: production url: https://app.company.com steps: - uses: actions/checkout@v4 - uses: hashicorp/setup-terraform@v3 - name: Notify Deployment Starting run: | ./scripts/notify-slack.sh "🚀 Production deployment starting" - name: Deploy to Production run: | cd infrastructure/environments/production terraform init terraform apply -auto-approve - name: Comprehensive Verification run: | ./scripts/verify-production.sh - name: Notify Success if: success() run: | ./scripts/notify-slack.sh "✅ Production deployment complete" - name: Notify Failure if: failure() run: | ./scripts/notify-slack.sh "❌ Production deployment FAILED - investigate immediately"The 'soak period' after staging deployment catches issues that only manifest under sustained load or over time. A change might work initially but cause memory leaks, connection exhaustion, or gradual degradation. Budget appropriate soak time based on your system's characteristics.
While infrastructure changes are often binary, some scenarios allow for progressive rollout—deploying changes incrementally to limit blast radius.
Where Progressive Rollout Works:
Multi-Region Progressive Deployment:
For organizations with multi-region infrastructure, deploying region-by-region provides natural progressive rollout:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103
name: Multi-Region Progressive Deployment on: workflow_dispatch: inputs: skip_canary_wait: description: 'Skip canary wait period (emergency only)' type: boolean default: false jobs: # Wave 1: Canary region deploy-canary: name: Deploy Canary (us-west-2) runs-on: ubuntu-latest environment: production-usw2 steps: - uses: actions/checkout@v4 - name: Deploy to us-west-2 run: | cd infrastructure/environments/production terraform init -backend-config=backends/us-west-2.tfvars terraform apply -auto-approve -var="region=us-west-2" - name: Validate Canary run: | ./scripts/validate-region.sh us-west-2 # Canary bake time canary-bake: name: Canary Validation Period needs: deploy-canary runs-on: ubuntu-latest if: ${{ !inputs.skip_canary_wait }} steps: - uses: actions/checkout@v4 - name: Monitor Canary (1 hour) run: | END_TIME=$(($(date +%s) + 3600)) while [ $(date +%s) -lt $END_TIME ]; do # Check error rates ERROR_RATE=$(./scripts/check-region-health.sh us-west-2) if [ "$ERROR_RATE" -gt 1 ]; then echo "Canary unhealthy, aborting rollout" exit 1 fi echo "Canary healthy, waiting..." sleep 300 done # Wave 2: Secondary regions deploy-wave2: name: Deploy Wave 2 needs: [deploy-canary, canary-bake] if: always() && needs.deploy-canary.result == 'success' && (needs.canary-bake.result == 'success' || needs.canary-bake.result == 'skipped') runs-on: ubuntu-latest strategy: matrix: region: [us-east-1, eu-west-1] max-parallel: 2 environment: production-${{ matrix.region }} steps: - uses: actions/checkout@v4 - name: Deploy to ${{ matrix.region }} run: | cd infrastructure/environments/production terraform init -backend-config=backends/${{ matrix.region }}.tfvars terraform apply -auto-approve -var="region=${{ matrix.region }}" # Wave 2 validation validate-wave2: name: Validate Wave 2 needs: deploy-wave2 runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Validate All Wave 2 Regions run: | for region in us-east-1 eu-west-1; do ./scripts/validate-region.sh $region || exit 1 done # Wave 3: All remaining deploy-wave3: name: Deploy Wave 3 needs: validate-wave2 runs-on: ubuntu-latest strategy: matrix: region: [ap-northeast-1, eu-central-1, ap-southeast-1] max-parallel: 3 environment: production-${{ matrix.region }} steps: - uses: actions/checkout@v4 - name: Deploy to ${{ matrix.region }} run: | cd infrastructure/environments/production terraform init -backend-config=backends/${{ matrix.region }}.tfvars terraform apply -auto-approve -var="region=${{ matrix.region }}"Choose your canary region wisely: it should have real traffic (not synthetic) but not be your largest region. Issues caught in a smaller region have smaller blast radius. Many organizations use a mid-tier region like us-west-2 as the canary.
Blue-green deployment for infrastructure involves maintaining two parallel stacks and switching traffic between them. While expensive (double the resources), it provides the fastest rollback capability.
How Blue-Green Infrastructure Works:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394
# Blue-Green Infrastructure Pattern variable "active_color" { description = "Currently active stack: blue or green" type = string default = "blue" validation { condition = contains(["blue", "green"], var.active_color) error_message = "Active color must be blue or green" }} variable "deploy_green" { description = "Whether to deploy the green stack" type = bool default = false} # Blue Stack (always exists unless actively decomissioned)module "blue_stack" { source = "./modules/application-stack" name = "app-blue" environment = var.environment # Blue stack configuration instance_type = var.blue_instance_type ami_id = var.blue_ami_id tags = { Color = "blue" }} # Green Stack (deployed when needed for updates)module "green_stack" { source = "./modules/application-stack" count = var.deploy_green ? 1 : 0 name = "app-green" environment = var.environment # Green stack configuration (new version) instance_type = var.green_instance_type ami_id = var.green_ami_id tags = { Color = "green" }} # Traffic routing based on active colorresource "aws_lb_listener_rule" "app" { listener_arn = aws_lb_listener.main.arn priority = 100 action { type = "forward" target_group_arn = var.active_color == "blue" ? module.blue_stack.target_group_arn : module.green_stack[0].target_group_arn } condition { path_pattern { values = ["/*"] } }} # DNS pointing to the active stackresource "aws_route53_record" "app" { zone_id = var.zone_id name = "app.example.com" type = "A" alias { name = aws_lb.main.dns_name zone_id = aws_lb.main.zone_id evaluate_target_health = true }} # Outputs for verificationoutput "active_stack" { value = var.active_color} output "blue_stack_healthy" { value = module.blue_stack.health_check_passed} output "green_stack_healthy" { value = var.deploy_green ? module.green_stack[0].health_check_passed : null}The hardest part of blue-green is stateful resources, especially databases. Options include: sharing a single database between stacks (limits what changes you can make), using read replicas that become primary on switch, or accepting brief downtime for database migration. Each has trade-offs.
When infrastructure deployments fail or cause issues, fast rollback is essential. However, infrastructure rollback is more complex than application rollback—you can't simply 'deploy the previous version' because infrastructure changes may have already modified resources.
Rollback Strategy Options:
| Strategy | How It Works | Speed | Limitations |
|---|---|---|---|
| Revert Commit | Apply previous configuration from Git | Minutes | May not work if resources were destroyed |
| State Rollback | Restore previous Terraform state file | Fast (state restore) | Doesn't change actual resources, causes drift |
| Blue-Green Switch | Route traffic to previous stack | Seconds | Requires maintaining parallel stacks |
| Forward Fix | Deploy a corrected version quickly | Minutes | Requires rapid diagnosis and fix development |
| Manual Remediation | Ops team fixes resources directly | Variable | Causes drift, needs post-incident cleanup |
The Revert Commit Approach:
The most straightforward rollback: revert the commit and apply again.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
#!/bin/bash# Infrastructure Rollback Script set -e ENVIRONMENT=${1:-production}COMMIT_TO_REVERT=${2:-HEAD} echo "🔄 Rolling back infrastructure for $ENVIRONMENT"echo " Reverting commit: $COMMIT_TO_REVERT" # Step 1: Create rollback branchROLLBACK_BRANCH="rollback/$(date +%Y%m%d-%H%M%S)"git checkout -b $ROLLBACK_BRANCH # Step 2: Revert the problematic commitgit revert --no-commit $COMMIT_TO_REVERT # Step 3: Generate plan to verify rollbackecho "📋 Generating rollback plan..."cd infrastructure/environments/$ENVIRONMENTterraform initterraform plan -out=rollback.tfplan # Step 4: Display plan summaryecho ""echo "=== ROLLBACK PLAN SUMMARY ==="terraform show rollback.tfplan | grep -E "^(#|Plan:)" # Step 5: Confirm with operatorread -p "Apply this rollback? (yes/no): " CONFIRMif [ "$CONFIRM" != "yes" ]; then echo "Rollback cancelled" exit 1fi # Step 6: Apply rollbackecho "🚀 Applying rollback..."terraform apply rollback.tfplan # Step 7: Verifyecho "✅ Verifying rollback..."../../scripts/verify-environment.sh $ENVIRONMENT # Step 8: Commit and pushgit add .git commit -m "Rollback: Revert $COMMIT_TO_REVERT for $ENVIRONMENT"git push origin $ROLLBACK_BRANCH echo "✅ Rollback complete. Branch: $ROLLBACK_BRANCH"echo " Create PR to merge rollback into main"When Revert Doesn't Work:
Revert-and-apply won't work in all scenarios:
In these cases, you may need forward-fix or manual intervention.
Before any production apply, snapshot the Terraform state file. If something goes catastrophically wrong, you can restore the state to a known good point. This doesn't change actual resources but lets you understand what Terraform thinks exists, which aids debugging.
Deployment isn't complete when terraform apply finishes—it's complete when you've verified the infrastructure is working correctly. Post-deployment verification catches issues that weren't apparent from the apply output.
Verification Layers:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
#!/bin/bash# Post-Deployment Verification Script set -e ENVIRONMENT=${1:-production}MAX_RETRIES=10RETRY_DELAY=30 echo "🔍 Verifying deployment for $ENVIRONMENT" # Helper function for retriesretry() { local max=$1 local delay=$2 shift 2 local count=0 until "$@"; do count=$((count + 1)) if [ $count -lt $max ]; then echo " Attempt $count failed, retrying in $delay seconds..." sleep $delay else echo " ❌ All $max attempts failed" return 1 fi done return 0} # 1. Verify Terraform outputs existecho "📋 Checking Terraform outputs..."cd infrastructure/environments/$ENVIRONMENTterraform output -json > /tmp/tf_outputs.json # Extract key outputsVPC_ID=$(jq -r '.vpc_id.value' /tmp/tf_outputs.json)ALB_DNS=$(jq -r '.alb_dns_name.value' /tmp/tf_outputs.json)RDS_ENDPOINT=$(jq -r '.rds_endpoint.value' /tmp/tf_outputs.json) echo " VPC: $VPC_ID"echo " ALB: $ALB_DNS"echo " RDS: $RDS_ENDPOINT" # 2. Verify VPC exists and is availableecho "🌐 Verifying VPC..."VPC_STATE=$(aws ec2 describe-vpcs --vpc-ids $VPC_ID --query 'Vpcs[0].State' --output text)[ "$VPC_STATE" == "available" ] || { echo "❌ VPC not available"; exit 1; }echo " ✅ VPC is available" # 3. Verify ALB is healthyecho "⚖️ Verifying Load Balancer..."retry $MAX_RETRIES $RETRY_DELAY bash -c " ALB_STATE=$(aws elbv2 describe-load-balancers \ --query "LoadBalancers[?DNSName=='$ALB_DNS'].State.Code" \ --output text) [ "$ALB_STATE" == 'active' ]"echo " ✅ ALB is active" # 4. Verify target group healthecho "🎯 Verifying Target Groups..."TG_ARNS=$(aws elbv2 describe-target-groups \ --query "TargetGroups[?contains(LoadBalancerArns, '$ALB_ARN')].TargetGroupArn" \ --output text) for TG in $TG_ARNS; do HEALTHY=$(aws elbv2 describe-target-health --target-group-arn $TG \ --query "TargetHealthDescriptions[?TargetHealth.State=='healthy'] | length(@)" \ --output text) [ "$HEALTHY" -gt 0 ] || { echo "❌ No healthy targets in $TG"; exit 1; } echo " ✅ $HEALTHY healthy targets in target group"done # 5. Verify RDS is availableecho "🗃️ Verifying Database..."retry $MAX_RETRIES $RETRY_DELAY bash -c " RDS_STATUS=$(aws rds describe-db-instances \ --query "DBInstances[?Endpoint.Address=='$RDS_ENDPOINT'].DBInstanceStatus" \ --output text) [ "$RDS_STATUS" == 'available' ]"echo " ✅ RDS is available" # 6. Run synthetic health checkecho "🧪 Running synthetic health check..."retry $MAX_RETRIES $RETRY_DELAY curl -sf "https://$ALB_DNS/health"echo " ✅ Health endpoint responding" # 7. Check error rates (last 5 minutes)echo "📊 Checking error metrics..."ERROR_COUNT=$(aws cloudwatch get-metric-statistics \ --namespace AWS/ApplicationELB \ --metric-name HTTPCode_Target_5XX_Count \ --start-time $(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 300 \ --statistics Sum \ --query 'Datapoints[0].Sum' \ --output text) if [ "$ERROR_COUNT" != "None" ] && [ "$ERROR_COUNT" -gt 10 ]; then echo " ⚠️ Elevated error rate: $ERROR_COUNT 5xx errors in last 5 minutes" exit 1fiecho " ✅ Error rates normal" echo ""echo "✅ All verification checks passed for $ENVIRONMENT"Some issues only manifest under load or over time. Post-deployment verification should include not just immediate checks but also a monitoring period. Many teams require a 'bake time' of 15-60 minutes in production before considering a deployment complete.
Stateful resources—databases, storage, queues—require special care during deployments. Changes to these resources can be irreversible, and mistakes can result in data loss.
Stateful Resource Deployment Principles:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
# Protecting Stateful Resources # RDS with maximum protectionresource "aws_db_instance" "main" { identifier = "production-db" # ... configuration ... # Protection against accidental deletion deletion_protection = true skip_final_snapshot = false final_snapshot_identifier = "production-db-final-${formatdate("YYYY-MM-DD-hhmm", timestamp())}" # Automated backups backup_retention_period = 30 backup_window = "03:00-04:00" # Prevent Terraform from ever deleting lifecycle { prevent_destroy = true # Ignore changes that would cause replacement ignore_changes = [ identifier, engine_version, # Handle upgrades separately ] } tags = { DataClassification = "production" BackupRequired = "true" }} # S3 bucket with versioning and MFA deleteresource "aws_s3_bucket" "data" { bucket = "company-production-data" lifecycle { prevent_destroy = true }} resource "aws_s3_bucket_versioning" "data" { bucket = aws_s3_bucket.data.id versioning_configuration { status = "Enabled" mfa_delete = "Enabled" # Requires MFA to delete versions }} # Policy to prevent dangerous operationsresource "aws_iam_policy" "prevent_data_deletion" { name = "prevent-production-data-deletion" description = "Denies deletion of production data resources" policy = jsonencode({ Version = "2012-10-17" Statement = [ { Sid = "DenyDataDeletion" Effect = "Deny" Action = [ "rds:DeleteDBInstance", "rds:DeleteDBCluster", "s3:DeleteBucket", "dynamodb:DeleteTable", ] Resource = [ aws_db_instance.main.arn, aws_s3_bucket.data.arn, ] } ] })}Database-Specific Considerations:
| Change Type | Risk Level | Recommended Approach |
|---|---|---|
| Instance type change | Medium | Blue-green with replication |
| Engine version upgrade | High | Snapshot, test in staging, maintenance window |
| Parameter changes | Medium | Most apply without restart, verify compatibility |
| Storage increase | Low | Online expansion, no downtime |
| Encryption enable | Very High | Requires new instance and data migration |
| Multi-AZ enable | Low | Automatic failover capabilities added |
Some seemingly innocuous changes trigger resource replacement (destroy + create). For databases, this means data loss. Always review plans carefully for 'must be replaced' warnings. When in doubt, use terraform plan -detailed-exitcode and parse for replacement actions.
Many organizations restrict when production changes can occur through change windows—designated times when deployments are permitted. This practice reduces risk by ensuring changes happen when support staff are available and user impact is minimized.
Implementing Change Windows:
| Change Type | Window | Requirements | Examples |
|---|---|---|---|
| Standard | Business hours, any day | Approved PR, tests pass | Tags, scaling policies, non-critical updates |
| Significant | Low-traffic hours, weekdays | Standard + additional reviewer | Security groups, IAM roles, new resources |
| High-Risk | Scheduled maintenance window | Change board approval, rollback tested | Database changes, network modifications |
| Emergency | Any time | Incident response, documented | Security patches, outage fixes |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
name: Production Deploy with Change Window on: workflow_dispatch: inputs: bypass_change_window: description: 'Bypass change window (emergency only)' type: boolean default: false emergency_ticket: description: 'Emergency ticket number (required if bypassing)' type: string default: '' jobs: check-change-window: name: Verify Change Window runs-on: ubuntu-latest outputs: allowed: ${{ steps.check.outputs.allowed }} steps: - name: Check Change Window id: check run: | # Get current time in UTC HOUR=$(date -u +%H) DAY=$(date -u +%u) # 1=Monday, 7=Sunday # Define allowed windows (UTC) # Standard: Mon-Fri 14:00-18:00 UTC (9am-1pm EST) ALLOWED="false" if [ $DAY -ge 1 ] && [ $DAY -le 5 ]; then if [ $HOUR -ge 14 ] && [ $HOUR -lt 18 ]; then ALLOWED="true" fi fi # Check for bypass if [ "${{ inputs.bypass_change_window }}" == "true" ]; then if [ -z "${{ inputs.emergency_ticket }}" ]; then echo "❌ Emergency bypass requires ticket number" exit 1 fi echo "⚠️ Change window bypassed. Ticket: ${{ inputs.emergency_ticket }}" ALLOWED="true" fi echo "allowed=$ALLOWED" >> $GITHUB_OUTPUT if [ "$ALLOWED" == "false" ]; then echo "❌ Outside change window. Current UTC: $(date -u)" echo " Allowed: Mon-Fri 14:00-18:00 UTC" exit 1 fi deploy-production: name: Deploy to Production needs: check-change-window if: needs.check-change-window.outputs.allowed == 'true' runs-on: ubuntu-latest environment: production steps: - name: Record Change run: | echo "Change deployed at: $(date -u)" echo "Deployed by: ${{ github.actor }}" echo "Commit: ${{ github.sha }}" echo "Bypass: ${{ inputs.bypass_change_window }}" echo "Ticket: ${{ inputs.emergency_ticket }}" # ... deployment steps ...Many organizations implement change freezes during high-risk periods: quarter ends, major holidays, or peak business seasons. Implement freeze schedules in your pipelines that block non-emergency deployments during these periods.
Infrastructure deployment requires deliberate strategies that account for the unique challenges of stateful, potentially irreversible changes. The key principles to remember:
Module Complete:
This completes the CI/CD for Infrastructure module. You've learned about infrastructure pipelines, GitOps principles, pull request workflows, automated testing, and deployment strategies. Together, these practices enable organizations to manage infrastructure with the same velocity and safety as application code.
Congratulations! You've mastered CI/CD for Infrastructure. You can now design pipelines that make infrastructure changes safe, fast, and auditable—from pull request to production deployment. Apply these practices to bring software engineering discipline to infrastructure management.