System Design (HLD)CI/CD for Infrastructure

CI/CD for Infrastructure

LevelAdvanced

Duration90 mins

TopicCI/CD for Infrastructure

1 / 5

Infrastructure Pipelines

The Evolution of Infrastructure Delivery

In the early days of computing, infrastructure changes were manual rituals performed by system administrators with root access and tribal knowledge. Today, elite engineering organizations treat infrastructure changes the same way they treat application code changes: through automated pipelines that validate, test, and deploy infrastructure with deterministic consistency.

This transformation—from ad-hoc scripts and runbooks to fully automated infrastructure pipelines—represents one of the most significant operational advances in modern software engineering. When done correctly, infrastructure pipelines enable teams to provision and modify infrastructure with the same speed, confidence, and auditability they expect from their application deployments.

What You Will Learn

By the end of this page, you will understand the anatomy of infrastructure pipelines, how they differ from application CI/CD, the key stages every pipeline must implement, and the patterns that enable organizations to scale infrastructure automation safely. You will be able to design pipelines that transform infrastructure changes from high-risk manual operations into routine, predictable events.

Why Infrastructure Pipelines Matter

Infrastructure pipelines aren't merely a convenience—they're a fundamental shift in how organizations manage risk, velocity, and operational excellence. Understanding why they matter requires examining the problems they solve and the capabilities they unlock.

The Traditional Infrastructure Problem:

Before infrastructure pipelines, most organizations operated through a combination of:

Manual provisioning — Engineers clicking through cloud consoles or running ad-hoc CLI commands
Documentation-based coordination — Runbooks and wiki pages that quickly become outdated
Change Advisory Boards — Slow governance processes that created bottlenecks
Configuration drift — Environments diverging from their intended state over time
Limited rollback capability — No systematic way to revert changes when things go wrong

This approach doesn't scale. As organizations grow, infrastructure complexity explodes. Each new service, each new environment, each new region multiplies the configuration surface area. Manual processes that worked for ten servers become impossible with ten thousand.

Core Benefits of Infrastructure Pipelines

•Consistency — Every change flows through the same automated process, eliminating human variance and ensuring environments are provisioned identically
•Velocity — Automation removes manual bottlenecks, enabling teams to make more changes with less effort and shorter lead times
•Safety — Automated validation catches errors before they reach production, while structured rollback capabilities minimize blast radius when issues occur
•Auditability — Every change is recorded, reviewed, and traceable, satisfying compliance requirements and enabling post-incident analysis
•Collaboration — Infrastructure changes become collaborative code reviews rather than individual heroic efforts
•Reproducibility — Any environment can be recreated exactly from its pipeline definition, enabling disaster recovery and environment proliferation

The Infrastructure Velocity Paradox

Organizations often fear that adding automation and validation to infrastructure changes will slow them down. The opposite is true: by removing manual coordination overhead and reducing failure rates, infrastructure pipelines dramatically increase the sustainable rate of change. Teams that deploy through pipelines typically make 10-100x more infrastructure changes than teams relying on manual processes—with far fewer incidents.

Anatomy of an Infrastructure Pipeline

A well-designed infrastructure pipeline consists of distinct stages, each serving a specific purpose in the journey from code commit to production deployment. Understanding these stages—and the principles behind them—is essential for building pipelines that are both robust and maintainable.

The Core Pipeline Stages:

Every infrastructure pipeline, regardless of tooling or cloud provider, follows a similar conceptual flow:

Converting Mermaid diagram...

Infrastructure Pipeline Stages
Stage	Purpose	Key Activities	Failure Mode
Validate	Catch syntax and configuration errors early	Linting, format checking, schema validation, security scanning	Fast fail on invalid configuration
Plan	Preview what changes will occur	Generate execution plan, diff against current state, resource change analysis	Surface destructive changes for review
Review	Human approval for significant changes	Peer review, plan review, approval gates	Block deployment until approved
Test	Validate infrastructure behavior	Policy tests, integration tests, compliance checks	Prevent non-compliant configurations
Deploy	Apply changes to target environment	Execute infrastructure changes, handle dependencies, manage state	Rollback on application failure
Verify	Confirm deployment success	Health checks, smoke tests, drift detection	Alert on verification failure

The Principle of Progressive Confidence:

Notice how these stages build confidence progressively. Early stages catch cheap-to-fix errors (syntax, formatting), while later stages verify more expensive properties (actual behavior, integration). This ordering minimizes wasted time and resources—you don't run integration tests on code that can't parse.

Each stage acts as a quality gate. Changes that fail any stage don't proceed to subsequent stages, ensuring that only well-validated changes reach production.

The Validate Stage Deep Dive

The validate stage is where infrastructure pipelines provide their fastest feedback. By catching issues in seconds rather than minutes or hours, this stage dramatically improves developer experience and reduces the cost of errors.

What Validation Catches:

Validation encompasses multiple types of checks, each targeting different error categories:

Validation Layer Components

•Syntax Validation — Ensures configuration files parse correctly (terraform validate, pulumi preview, HCL parsing)
•Format Checking — Enforces consistent code style (terraform fmt -check, standardized indentation and ordering)
•Static Analysis — Identifies potential issues without execution (tfsec, checkov, tflint, terrascan)
•Policy as Code — Validates configuration against organizational rules (OPA, Sentinel, HashiCorp Sentinel)
•Credential Scanning — Detects accidentally committed secrets (git-secrets, truffleHog, gitleaks)
•Dependency Validation — Checks module versions, provider compatibility, required inputs

validate-stage.yaml
GitHub Actions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
name: Infrastructure Validation
 
on:
  pull_request:
    paths:
      - 'infrastructure/**'
      - '.github/workflows/infrastructure-*.yml'
 
jobs:
  validate:
    name: Validate Infrastructure
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: infrastructure
 
    steps:
      - name: Checkout Repository
        uses: actions/checkout@v4
 
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.6.x"
 
      # Stage 1: Format Check
      - name: Terraform Format Check
        id: fmt
        run: terraform fmt -check -recursive
        continue-on-error: true
 
      # Stage 2: Initialize (required for validation)
      - name: Terraform Init
        id: init
        run: terraform init -backend=false
 
      # Stage 3: Syntax Validation
      - name: Terraform Validate
        id: validate
        run: terraform validate -no-color
 
      # Stage 4: Static Security Analysis
      - name: Run tfsec Security Scanner
        uses: aquasecurity/tfsec-action@v1.0.0
        with:
          working_directory: infrastructure
          soft_fail: false
 
      # Stage 5: Policy Compliance (Checkov)
      - name: Run Checkov Policy Checks
        uses: bridgecrewio/checkov-action@v12
        with:
          directory: infrastructure
          framework: terraform
          soft_fail: false
          output_format: cli,sarif
          output_file_path: console,results.sarif
 
      # Stage 6: Credential Scanning
      - name: Scan for Secrets
        uses: trufflesecurity/trufflehog@v3.63.0
        with:
          path: ./infrastructure
          base: ${{ github.event.pull_request.base.sha }}
          head: ${{ github.event.pull_request.head.sha }}
 
      # Report validation status
      - name: Post Validation Summary
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const output = `### Infrastructure Validation Results
            
            | Check | Status |
            |-------|--------|
            | Format | ${{ steps.fmt.outcome == 'success' && '✅ Pass' || '❌ Fail' }} |
            | Init | ${{ steps.init.outcome == 'success' && '✅ Pass' || '❌ Fail' }} |
            | Validate | ${{ steps.validate.outcome == 'success' && '✅ Pass' || '❌ Fail' }} |
            `;
            
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: output
            });

Validation Is Not Verification

Validation catches configuration errors, but it cannot guarantee that infrastructure will behave correctly. A syntactically valid Terraform configuration might still create misconfigured resources, fail due to API limits, or violate business requirements. Validation is necessary but not sufficient—it must be complemented by planning, testing, and post-deployment verification.

The Plan Stage Deep Dive

The plan stage is arguably the most critical stage in infrastructure pipelines. It produces a detailed preview of exactly what changes will occur, enabling human reviewers to understand the impact before any resources are modified.

Why Planning Is Essential:

Unlike application deployments where changes are often additive and easily reversible, infrastructure changes can be destructive and irreversible. Deleting a database, removing a VPC, or modifying IAM policies can have catastrophic consequences. The plan stage provides the last line of defense before these changes occur.

The Plan Contains:

Plan Output Components

•Resources to Create — New resources that don't exist in the current state
•Resources to Update — Existing resources whose configuration will change
•Resources to Destroy — Existing resources that will be deleted
•Replace Actions — Resources that must be destroyed and recreated (often due to immutable attributes)
•Attribute Changes — Specific property modifications on each resource
•Drift Detection — Differences between declared state and actual infrastructure

plan-stage.yaml
GitHub Actions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
name: Infrastructure Plan
 
on:
  pull_request:
    paths:
      - 'infrastructure/**'
 
env:
  TF_VAR_environment: "staging"
  ARM_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }}
  ARM_CLIENT_SECRET: ${{ secrets.AZURE_CLIENT_SECRET }}
  ARM_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
  ARM_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }}
 
jobs:
  plan:
    name: Generate Infrastructure Plan
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    
    steps:
      - name: Checkout
        uses: actions/checkout@v4
 
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.6.x"
 
      - name: Terraform Init
        id: init
        run: |
          cd infrastructure
          terraform init
 
      - name: Terraform Plan
        id: plan
        run: |
          cd infrastructure
          terraform plan -no-color -out=tfplan \
            -detailed-exitcode \
            2>&1 | tee plan_output.txt
        continue-on-error: true
 
      - name: Analyze Plan for Destructive Changes
        id: analyze
        run: |
          cd infrastructure
          
          # Count different change types
          CREATES=$(grep -c "will be created" plan_output.txt || true)
          UPDATES=$(grep -c "will be updated" plan_output.txt || true)
          DESTROYS=$(grep -c "will be destroyed" plan_output.txt || true)
          REPLACES=$(grep -c "must be replaced" plan_output.txt || true)
          
          echo "creates=$CREATES" >> $GITHUB_OUTPUT
          echo "updates=$UPDATES" >> $GITHUB_OUTPUT
          echo "destroys=$DESTROYS" >> $GITHUB_OUTPUT
          echo "replaces=$REPLACES" >> $GITHUB_OUTPUT
          
          # Flag high-risk changes
          if [ "$DESTROYS" -gt 0 ] || [ "$REPLACES" -gt 0 ]; then
            echo "high_risk=true" >> $GITHUB_OUTPUT
          else
            echo "high_risk=false" >> $GITHUB_OUTPUT
          fi
 
      - name: Post Plan to PR
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const plan = fs.readFileSync('infrastructure/plan_output.txt', 'utf8');
            
            // Truncate if too long for GitHub comment
            const maxLength = 60000;
            const truncatedPlan = plan.length > maxLength 
              ? plan.substring(0, maxLength) + '\n... [truncated]'
              : plan;
            
            const highRisk = '${{ steps.analyze.outputs.high_risk }}' === 'true';
            const riskBadge = highRisk ? '⚠️ **HIGH RISK**' : '✅ Low Risk';
            
            const output = `### Terraform Plan Summary
            
            ${riskBadge}
            
            | Change Type | Count |
            |-------------|-------|
            | Create | ${{ steps.analyze.outputs.creates }} |
            | Update | ${{ steps.analyze.outputs.updates }} |
            | Destroy | ${{ steps.analyze.outputs.destroys }} |
            | Replace | ${{ steps.analyze.outputs.replaces }} |
            
            <details>
            <summary>Show Full Plan</summary>
            
            \`\`\`hcl
            ${truncatedPlan}
            \`\`\`
            
            </details>
            
            ${highRisk ? '\n⚠️ **This plan includes destructive changes. Please review carefully before approving.**' : ''}
            `;
            
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: output
            });
 
      - name: Upload Plan Artifact
        uses: actions/upload-artifact@v4
        with:
          name: terraform-plan
          path: infrastructure/tfplan
          retention-days: 5
 
      - name: Require Approval for Destructive Changes
        if: steps.analyze.outputs.high_risk == 'true'
        run: |
          echo "::warning::This plan includes destructive changes."
          echo "Please ensure appropriate reviewers have approved before merge."

Plan Artifacts Are Critical:

Notice that the plan is saved as an artifact (tfplan). This is essential for ensuring that the plan reviewed during the PR is exactly the plan applied during deployment. Without this, there's a race condition: infrastructure could change between plan and apply, causing unexpected modifications.

The pattern is:

Generate plan during PR (stores as artifact)
Review plan in PR comments
Approve PR
Apply stage retrieves the exact same plan file and applies it

This guarantees that reviewers see exactly what will be deployed.

The Detailed Exit Code

The -detailed-exitcode flag makes Terraform return: 0 = no changes, 1 = error, 2 = changes pending. This allows pipelines to distinguish between 'no work needed' and 'work needed' rather than having to parse output text.

Deploying Infrastructure Safely

The deployment stage—where infrastructure changes are actually applied—requires meticulous attention to safety. Unlike application deployments that can often be rolled back in seconds, infrastructure changes may take minutes or hours to reverse, and some changes (data deletion, resource destruction) may be permanently irreversible.

Key Principles for Safe Infrastructure Deployment:

Safe Deployment Practices

•Use Saved Plans — Always apply from a saved plan file, never generate a new plan at apply time
•Lock State — Ensure only one pipeline run can modify state at a time (Terraform backends handle this)
•Progressive Rollout — Deploy to lower environments before production, with appropriate wait times
•Apply Timeouts — Set maximum execution times to catch stuck operations
•Notification Hooks — Alert relevant parties when deployments start, succeed, or fail
•Audit Logging — Capture all apply output for post-incident analysis

deploy-stage.yaml
GitHub Actions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
name: Infrastructure Deploy
 
on:
  push:
    branches:
      - main
    paths:
      - 'infrastructure/**'
 
concurrency:
  group: infrastructure-deploy
  cancel-in-progress: false  # Never cancel in-progress infrastructure changes
 
env:
  TF_VAR_environment: "production"
 
jobs:
  deploy:
    name: Deploy Infrastructure
    runs-on: ubuntu-latest
    environment: production  # Requires manual approval
    timeout-minutes: 60
    
    steps:
      - name: Checkout
        uses: actions/checkout@v4
 
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.6.x"
 
      - name: Download Plan Artifact
        uses: actions/download-artifact@v4
        with:
          name: terraform-plan
          path: infrastructure/
 
      - name: Terraform Init
        run: |
          cd infrastructure
          terraform init
 
      - name: Notify Deployment Start
        uses: slackapi/slack-github-action@v1.24.0
        with:
          channel-id: 'infra-deployments'
          slack-message: |
            🚀 Infrastructure deployment starting
            Triggered by: ${{ github.actor }}
            Commit: ${{ github.sha }}
            Environment: production
 
      - name: Terraform Apply
        id: apply
        run: |
          cd infrastructure
          terraform apply -auto-approve -no-color tfplan \
            2>&1 | tee apply_output.txt
        
      - name: Verify Deployment
        if: success()
        run: |
          cd infrastructure
          
          # Run post-deployment verification
          terraform output -json > outputs.json
          
          # Verify critical resources exist and are healthy
          # This would call your verification scripts
          ./scripts/verify-deployment.sh
 
      - name: Notify Success
        if: success()
        uses: slackapi/slack-github-action@v1.24.0
        with:
          channel-id: 'infra-deployments'
          slack-message: |
            ✅ Infrastructure deployment succeeded
            Duration: ${{ steps.apply.outputs.duration }}
            Commit: ${{ github.sha }}
 
      - name: Notify Failure
        if: failure()
        uses: slackapi/slack-github-action@v1.24.0
        with:
          channel-id: 'infra-alerts'
          slack-message: |
            ❌ Infrastructure deployment FAILED
            Triggered by: ${{ github.actor }}
            Commit: ${{ github.sha }}
            Please investigate immediately.
 
      - name: Archive Apply Output
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: terraform-apply-log
          path: infrastructure/apply_output.txt
          retention-days: 30

Concurrency Control Is Non-Negotiable

The concurrency block with cancel-in-progress: false is critical. Unlike application deployments, you should NEVER cancel an infrastructure deployment in progress. Doing so can leave infrastructure in an inconsistent state. Always let infrastructure operations complete, then address any issues.

Environment Promotion Patterns

Production infrastructure changes should never be deployed directly. Instead, changes should flow through a series of environments, building confidence at each stage. This environment promotion pattern is a fundamental practice for reducing production risk.

Common Environment Structures:

Converting Mermaid diagram...

Environment Promotion Strategy
Environment	Purpose	Deployment Trigger	Validation
Development	Rapid iteration, feature development	Push to feature branch	Basic validation, unit tests
Staging	Integration testing, pre-production validation	Merge to main branch	Full test suite, smoke tests
Production	Live customer-facing infrastructure	Manual approval after staging success	Verification, monitoring, alerting

Promotion Gates:

Moving between environments should require explicit gates:

Automated Gates — Tests must pass, no policy violations, clean security scans
Time Gates — Wait period after staging deployment before production eligibility
Manual Gates — Human approval for production changes, especially destructive ones
Quality Gates — Metrics thresholds (no increase in error rates, acceptable latency)

Handling Environment Differences:

Environments should be as similar as possible, but differences are inevitable (scale, costs, integrations). Handle these through:

environment-config.tf

Terraform

# variables.tf
variable "environment" {
  description = "Environment name (dev, staging, production)"
  type        = string
  validation {
    condition     = contains(["dev", "staging", "production"], var.environment)
    error_message = "Environment must be dev, staging, or production."
  }
}
 
# Environment-specific configuration
locals {
  environment_config = {
    dev = {
      instance_type     = "t3.small"
      min_instances     = 1
      max_instances     = 2
      enable_multi_az   = false
      backup_retention  = 1
      enable_monitoring = false
    }
    staging = {
      instance_type     = "t3.medium"
      min_instances     = 2
      max_instances     = 4
      enable_multi_az   = true
      backup_retention  = 7
      enable_monitoring = true
    }
    production = {
      instance_type     = "t3.large"
      min_instances     = 3
      max_instances     = 10
      enable_multi_az   = true
      backup_retention  = 30
      enable_monitoring = true
    }
  }
  
  config = local.environment_config[var.environment]
}
 
# Usage in resources
resource "aws_autoscaling_group" "main" {
  min_size         = local.config.min_instances
  max_size         = local.config.max_instances
  
  launch_template {
    id      = aws_launch_template.main.id
    version = "$Latest"
  }
  
  dynamic "availability_zones" {
    for_each = local.config.enable_multi_az ? local.all_azs : [local.all_azs[0]]
    content {
      # AZ configuration
    }
  }
}

Infrastructure vs Application Pipelines

While infrastructure pipelines share concepts with application CI/CD, they have fundamental differences that inform their design. Understanding these differences is crucial for building appropriate safeguards.

Application Pipelines

•Immutable artifacts — Build once, deploy the same artifact everywhere
•Fast rollback — Revert by deploying previous version
•Stateless (often) — Instances can be replaced freely
•Parallelizable — Multiple deployments can run simultaneously
•Quick feedback — Tests run in seconds to minutes
•Additive changes — Updates typically add functionality

Infrastructure Pipelines

•State-dependent — Changes depend on current infrastructure state
•Complex rollback — May require forward-fixing or manual intervention
•Stateful resources — Databases, networks can't be freely replaced
•Serial execution — Must lock state to prevent conflicts
•Slow feedback — Cloud API calls, provisioning takes minutes to hours
•Destructive changes — Updates may destroy and recreate resources

Design Implications

These differences mean infrastructure pipelines need: longer timeouts, stricter concurrency controls, more extensive plan review, careful state management, and different rollback strategies. Don't simply copy application pipeline patterns—they may be actively dangerous for infrastructure.

The State Problem:

Infrastructure pipelines are fundamentally stateful. The pipeline must know what infrastructure currently exists to determine what changes to make. This creates unique challenges:

State locking — Only one operation can modify state at a time
State consistency — State must accurately reflect actual infrastructure
State recovery — Corrupted or lost state can be catastrophic
State security — State often contains sensitive information

Robust infrastructure pipelines include explicit state management: backend configuration, locking mechanisms, state backup, and drift detection.

Handling Pipeline Failures

Pipeline failures are inevitable. Networks fail, cloud APIs have outages, rate limits are hit, and resources conflict. The difference between resilient and fragile infrastructure pipelines lies in how failures are handled.

Categories of Pipeline Failures:

Infrastructure Pipeline Failure Categories
Failure Type	Example	Recovery Strategy
Transient	Network timeout, rate limit, API error	Retry with exponential backoff
Validation	Invalid configuration, policy violation	Fix code and rerun pipeline
Resource Conflict	Resource already exists, naming collision	Import existing resource or rename
Partial Apply	Some resources created, others failed	Fix issue and reapply (idempotent)
State Corruption	State file corrupted or desynchronized	State surgery or import from actual infra
Dependency Loop	Circular resource dependencies	Restructure configuration or use targeted apply

Designing for Idempotency:

The most important principle for handling failures is idempotency: running the pipeline multiple times should produce the same result as running it once. This means:

If resources already exist in the desired state, no changes occur
Partial failures can be recovered by simply rerunning
Duplicate pipeline runs don't create duplicate resources

Terraform and similar tools are designed for idempotency, but you must also design your pipeline logic to support it—avoid side effects outside of the infrastructure tool itself.

Never Leave Partial State

The worst pipeline failures leave infrastructure in an inconsistent state: some resources created, others not, state file out of sync with reality. Always ensure your failure handling either completes the change or fully rolls it back. If you must leave partial state, document it extensively and alert the right people.

Summary: Infrastructure Pipeline Principles

Infrastructure pipelines transform infrastructure management from a high-risk manual activity into a systematic, repeatable, and auditable process. The key principles to remember:

Key Takeaways

•Pipelines enable velocity and safety together — Automation removes manual bottlenecks while structured validation reduces error rates
•Follow the validate → plan → review → test → deploy → verify pattern — Each stage builds confidence progressively
•Plans are sacred — Always save plans and apply exactly what was reviewed; never generate a new plan at apply time
•Concurrency must be controlled — Never cancel in-progress infrastructure operations; always lock state
•Environment promotion is mandatory — Never deploy directly to production; build confidence through lower environments
•Design for failure — Pipelines will fail; build in retry logic, preserve idempotency, and have clear recovery procedures
•Infrastructure pipelines differ from application pipelines — Respect the stateful, destructive nature of infrastructure changes

What's Next:

Now that you understand the fundamentals of infrastructure pipelines, the next page explores GitOps principles—a set of practices that take infrastructure automation to the next level by treating Git as the single source of truth for infrastructure state.

Page Complete

You now understand infrastructure pipeline architecture, the purpose of each stage, and the key differences from application CI/CD. You're equipped to design pipelines that make infrastructure changes safe, fast, and auditable.

1 / 5

Loading learning content...

System Design (HLD)CI/CD for Infrastructure

CI/CD for Infrastructure

LevelAdvanced

Duration90 mins

TopicCI/CD for Infrastructure

1 / 5

Infrastructure Pipelines

The Evolution of Infrastructure Delivery

What You Will Learn

Why Infrastructure Pipelines Matter

The Traditional Infrastructure Problem:

Before infrastructure pipelines, most organizations operated through a combination of:

Manual provisioning — Engineers clicking through cloud consoles or running ad-hoc CLI commands
Documentation-based coordination — Runbooks and wiki pages that quickly become outdated
Change Advisory Boards — Slow governance processes that created bottlenecks
Configuration drift — Environments diverging from their intended state over time
Limited rollback capability — No systematic way to revert changes when things go wrong

Core Benefits of Infrastructure Pipelines

•Consistency — Every change flows through the same automated process, eliminating human variance and ensuring environments are provisioned identically
•Velocity — Automation removes manual bottlenecks, enabling teams to make more changes with less effort and shorter lead times
•Safety — Automated validation catches errors before they reach production, while structured rollback capabilities minimize blast radius when issues occur
•Auditability — Every change is recorded, reviewed, and traceable, satisfying compliance requirements and enabling post-incident analysis
•Collaboration — Infrastructure changes become collaborative code reviews rather than individual heroic efforts
•Reproducibility — Any environment can be recreated exactly from its pipeline definition, enabling disaster recovery and environment proliferation

The Infrastructure Velocity Paradox

Anatomy of an Infrastructure Pipeline

The Core Pipeline Stages:

Every infrastructure pipeline, regardless of tooling or cloud provider, follows a similar conceptual flow:

Converting Mermaid diagram...

Infrastructure Pipeline Stages
Stage	Purpose	Key Activities	Failure Mode
Validate	Catch syntax and configuration errors early	Linting, format checking, schema validation, security scanning	Fast fail on invalid configuration
Plan	Preview what changes will occur	Generate execution plan, diff against current state, resource change analysis	Surface destructive changes for review
Review	Human approval for significant changes	Peer review, plan review, approval gates	Block deployment until approved
Test	Validate infrastructure behavior	Policy tests, integration tests, compliance checks	Prevent non-compliant configurations
Deploy	Apply changes to target environment	Execute infrastructure changes, handle dependencies, manage state	Rollback on application failure
Verify	Confirm deployment success	Health checks, smoke tests, drift detection	Alert on verification failure

The Principle of Progressive Confidence:

Each stage acts as a quality gate. Changes that fail any stage don't proceed to subsequent stages, ensuring that only well-validated changes reach production.

The Validate Stage Deep Dive

What Validation Catches:

Validation encompasses multiple types of checks, each targeting different error categories:

Validation Layer Components

•Syntax Validation — Ensures configuration files parse correctly (terraform validate, pulumi preview, HCL parsing)
•Format Checking — Enforces consistent code style (terraform fmt -check, standardized indentation and ordering)
•Static Analysis — Identifies potential issues without execution (tfsec, checkov, tflint, terrascan)
•Policy as Code — Validates configuration against organizational rules (OPA, Sentinel, HashiCorp Sentinel)
•Credential Scanning — Detects accidentally committed secrets (git-secrets, truffleHog, gitleaks)
•Dependency Validation — Checks module versions, provider compatibility, required inputs

validate-stage.yaml
GitHub Actions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
name: Infrastructure Validation
 
on:
  pull_request:
    paths:
      - 'infrastructure/**'
      - '.github/workflows/infrastructure-*.yml'
 
jobs:
  validate:
    name: Validate Infrastructure
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: infrastructure
 
    steps:
      - name: Checkout Repository
        uses: actions/checkout@v4
 
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.6.x"
 
      # Stage 1: Format Check
      - name: Terraform Format Check
        id: fmt
        run: terraform fmt -check -recursive
        continue-on-error: true
 
      # Stage 2: Initialize (required for validation)
      - name: Terraform Init
        id: init
        run: terraform init -backend=false
 
      # Stage 3: Syntax Validation
      - name: Terraform Validate
        id: validate
        run: terraform validate -no-color
 
      # Stage 4: Static Security Analysis
      - name: Run tfsec Security Scanner
        uses: aquasecurity/tfsec-action@v1.0.0
        with:
          working_directory: infrastructure
          soft_fail: false
 
      # Stage 5: Policy Compliance (Checkov)
      - name: Run Checkov Policy Checks
        uses: bridgecrewio/checkov-action@v12
        with:
          directory: infrastructure
          framework: terraform
          soft_fail: false
          output_format: cli,sarif
          output_file_path: console,results.sarif
 
      # Stage 6: Credential Scanning
      - name: Scan for Secrets
        uses: trufflesecurity/trufflehog@v3.63.0
        with:
          path: ./infrastructure
          base: ${{ github.event.pull_request.base.sha }}
          head: ${{ github.event.pull_request.head.sha }}
 
      # Report validation status
      - name: Post Validation Summary
        if: always()
        uses: actions/github-script@v7
        with:
          script: |
            const output = `### Infrastructure Validation Results
            
            | Check | Status |
            |-------|--------|
            | Format | ${{ steps.fmt.outcome == 'success' && '✅ Pass' || '❌ Fail' }} |
            | Init | ${{ steps.init.outcome == 'success' && '✅ Pass' || '❌ Fail' }} |
            | Validate | ${{ steps.validate.outcome == 'success' && '✅ Pass' || '❌ Fail' }} |
            `;
            
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: output
            });

Validation Is Not Verification

The Plan Stage Deep Dive

Why Planning Is Essential:

The Plan Contains:

Plan Output Components

•Resources to Create — New resources that don't exist in the current state
•Resources to Update — Existing resources whose configuration will change
•Resources to Destroy — Existing resources that will be deleted
•Replace Actions — Resources that must be destroyed and recreated (often due to immutable attributes)
•Attribute Changes — Specific property modifications on each resource
•Drift Detection — Differences between declared state and actual infrastructure

plan-stage.yaml
GitHub Actions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
name: Infrastructure Plan
 
on:
  pull_request:
    paths:
      - 'infrastructure/**'
 
env:
  TF_VAR_environment: "staging"
  ARM_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }}
  ARM_CLIENT_SECRET: ${{ secrets.AZURE_CLIENT_SECRET }}
  ARM_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
  ARM_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }}
 
jobs:
  plan:
    name: Generate Infrastructure Plan
    runs-on: ubuntu-latest
    permissions:
      contents: read
      pull-requests: write
    
    steps:
      - name: Checkout
        uses: actions/checkout@v4
 
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.6.x"
 
      - name: Terraform Init
        id: init
        run: |
          cd infrastructure
          terraform init
 
      - name: Terraform Plan
        id: plan
        run: |
          cd infrastructure
          terraform plan -no-color -out=tfplan \
            -detailed-exitcode \
            2>&1 | tee plan_output.txt
        continue-on-error: true
 
      - name: Analyze Plan for Destructive Changes
        id: analyze
        run: |
          cd infrastructure
          
          # Count different change types
          CREATES=$(grep -c "will be created" plan_output.txt || true)
          UPDATES=$(grep -c "will be updated" plan_output.txt || true)
          DESTROYS=$(grep -c "will be destroyed" plan_output.txt || true)
          REPLACES=$(grep -c "must be replaced" plan_output.txt || true)
          
          echo "creates=$CREATES" >> $GITHUB_OUTPUT
          echo "updates=$UPDATES" >> $GITHUB_OUTPUT
          echo "destroys=$DESTROYS" >> $GITHUB_OUTPUT
          echo "replaces=$REPLACES" >> $GITHUB_OUTPUT
          
          # Flag high-risk changes
          if [ "$DESTROYS" -gt 0 ] || [ "$REPLACES" -gt 0 ]; then
            echo "high_risk=true" >> $GITHUB_OUTPUT
          else
            echo "high_risk=false" >> $GITHUB_OUTPUT
          fi
 
      - name: Post Plan to PR
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const plan = fs.readFileSync('infrastructure/plan_output.txt', 'utf8');
            
            // Truncate if too long for GitHub comment
            const maxLength = 60000;
            const truncatedPlan = plan.length > maxLength 
              ? plan.substring(0, maxLength) + '\n... [truncated]'
              : plan;
            
            const highRisk = '${{ steps.analyze.outputs.high_risk }}' === 'true';
            const riskBadge = highRisk ? '⚠️ **HIGH RISK**' : '✅ Low Risk';
            
            const output = `### Terraform Plan Summary
            
            ${riskBadge}
            
            | Change Type | Count |
            |-------------|-------|
            | Create | ${{ steps.analyze.outputs.creates }} |
            | Update | ${{ steps.analyze.outputs.updates }} |
            | Destroy | ${{ steps.analyze.outputs.destroys }} |
            | Replace | ${{ steps.analyze.outputs.replaces }} |
            
            <details>
            <summary>Show Full Plan</summary>
            
            \`\`\`hcl
            ${truncatedPlan}
            \`\`\`
            
            </details>
            
            ${highRisk ? '\n⚠️ **This plan includes destructive changes. Please review carefully before approving.**' : ''}
            `;
            
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: output
            });
 
      - name: Upload Plan Artifact
        uses: actions/upload-artifact@v4
        with:
          name: terraform-plan
          path: infrastructure/tfplan
          retention-days: 5
 
      - name: Require Approval for Destructive Changes
        if: steps.analyze.outputs.high_risk == 'true'
        run: |
          echo "::warning::This plan includes destructive changes."
          echo "Please ensure appropriate reviewers have approved before merge."

Plan Artifacts Are Critical:

The pattern is:

Generate plan during PR (stores as artifact)
Review plan in PR comments
Approve PR
Apply stage retrieves the exact same plan file and applies it

This guarantees that reviewers see exactly what will be deployed.

The Detailed Exit Code

Deploying Infrastructure Safely

Key Principles for Safe Infrastructure Deployment:

Safe Deployment Practices

•Use Saved Plans — Always apply from a saved plan file, never generate a new plan at apply time
•Lock State — Ensure only one pipeline run can modify state at a time (Terraform backends handle this)
•Progressive Rollout — Deploy to lower environments before production, with appropriate wait times
•Apply Timeouts — Set maximum execution times to catch stuck operations
•Notification Hooks — Alert relevant parties when deployments start, succeed, or fail
•Audit Logging — Capture all apply output for post-incident analysis

deploy-stage.yaml
GitHub Actions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
name: Infrastructure Deploy
 
on:
  push:
    branches:
      - main
    paths:
      - 'infrastructure/**'
 
concurrency:
  group: infrastructure-deploy
  cancel-in-progress: false  # Never cancel in-progress infrastructure changes
 
env:
  TF_VAR_environment: "production"
 
jobs:
  deploy:
    name: Deploy Infrastructure
    runs-on: ubuntu-latest
    environment: production  # Requires manual approval
    timeout-minutes: 60
    
    steps:
      - name: Checkout
        uses: actions/checkout@v4
 
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.6.x"
 
      - name: Download Plan Artifact
        uses: actions/download-artifact@v4
        with:
          name: terraform-plan
          path: infrastructure/
 
      - name: Terraform Init
        run: |
          cd infrastructure
          terraform init
 
      - name: Notify Deployment Start
        uses: slackapi/slack-github-action@v1.24.0
        with:
          channel-id: 'infra-deployments'
          slack-message: |
            🚀 Infrastructure deployment starting
            Triggered by: ${{ github.actor }}
            Commit: ${{ github.sha }}
            Environment: production
 
      - name: Terraform Apply
        id: apply
        run: |
          cd infrastructure
          terraform apply -auto-approve -no-color tfplan \
            2>&1 | tee apply_output.txt
        
      - name: Verify Deployment
        if: success()
        run: |
          cd infrastructure
          
          # Run post-deployment verification
          terraform output -json > outputs.json
          
          # Verify critical resources exist and are healthy
          # This would call your verification scripts
          ./scripts/verify-deployment.sh
 
      - name: Notify Success
        if: success()
        uses: slackapi/slack-github-action@v1.24.0
        with:
          channel-id: 'infra-deployments'
          slack-message: |
            ✅ Infrastructure deployment succeeded
            Duration: ${{ steps.apply.outputs.duration }}
            Commit: ${{ github.sha }}
 
      - name: Notify Failure
        if: failure()
        uses: slackapi/slack-github-action@v1.24.0
        with:
          channel-id: 'infra-alerts'
          slack-message: |
            ❌ Infrastructure deployment FAILED
            Triggered by: ${{ github.actor }}
            Commit: ${{ github.sha }}
            Please investigate immediately.
 
      - name: Archive Apply Output
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: terraform-apply-log
          path: infrastructure/apply_output.txt
          retention-days: 30

Concurrency Control Is Non-Negotiable

Environment Promotion Patterns

Common Environment Structures:

Converting Mermaid diagram...

Environment Promotion Strategy
Environment	Purpose	Deployment Trigger	Validation
Development	Rapid iteration, feature development	Push to feature branch	Basic validation, unit tests
Staging	Integration testing, pre-production validation	Merge to main branch	Full test suite, smoke tests
Production	Live customer-facing infrastructure	Manual approval after staging success	Verification, monitoring, alerting

Promotion Gates:

Moving between environments should require explicit gates:

Automated Gates — Tests must pass, no policy violations, clean security scans
Time Gates — Wait period after staging deployment before production eligibility
Manual Gates — Human approval for production changes, especially destructive ones
Quality Gates — Metrics thresholds (no increase in error rates, acceptable latency)

Handling Environment Differences:

Environments should be as similar as possible, but differences are inevitable (scale, costs, integrations). Handle these through:

environment-config.tf

Terraform

# variables.tf
variable "environment" {
  description = "Environment name (dev, staging, production)"
  type        = string
  validation {
    condition     = contains(["dev", "staging", "production"], var.environment)
    error_message = "Environment must be dev, staging, or production."
  }
}
 
# Environment-specific configuration
locals {
  environment_config = {
    dev = {
      instance_type     = "t3.small"
      min_instances     = 1
      max_instances     = 2
      enable_multi_az   = false
      backup_retention  = 1
      enable_monitoring = false
    }
    staging = {
      instance_type     = "t3.medium"
      min_instances     = 2
      max_instances     = 4
      enable_multi_az   = true
      backup_retention  = 7
      enable_monitoring = true
    }
    production = {
      instance_type     = "t3.large"
      min_instances     = 3
      max_instances     = 10
      enable_multi_az   = true
      backup_retention  = 30
      enable_monitoring = true
    }
  }
  
  config = local.environment_config[var.environment]
}
 
# Usage in resources
resource "aws_autoscaling_group" "main" {
  min_size         = local.config.min_instances
  max_size         = local.config.max_instances
  
  launch_template {
    id      = aws_launch_template.main.id
    version = "$Latest"
  }
  
  dynamic "availability_zones" {
    for_each = local.config.enable_multi_az ? local.all_azs : [local.all_azs[0]]
    content {
      # AZ configuration
    }
  }
}

Infrastructure vs Application Pipelines

Application Pipelines

•Immutable artifacts — Build once, deploy the same artifact everywhere
•Fast rollback — Revert by deploying previous version
•Stateless (often) — Instances can be replaced freely
•Parallelizable — Multiple deployments can run simultaneously
•Quick feedback — Tests run in seconds to minutes
•Additive changes — Updates typically add functionality

Infrastructure Pipelines

•State-dependent — Changes depend on current infrastructure state
•Complex rollback — May require forward-fixing or manual intervention
•Stateful resources — Databases, networks can't be freely replaced
•Serial execution — Must lock state to prevent conflicts
•Slow feedback — Cloud API calls, provisioning takes minutes to hours
•Destructive changes — Updates may destroy and recreate resources

Design Implications

The State Problem:

Infrastructure pipelines are fundamentally stateful. The pipeline must know what infrastructure currently exists to determine what changes to make. This creates unique challenges:

State locking — Only one operation can modify state at a time
State consistency — State must accurately reflect actual infrastructure
State recovery — Corrupted or lost state can be catastrophic
State security — State often contains sensitive information

Robust infrastructure pipelines include explicit state management: backend configuration, locking mechanisms, state backup, and drift detection.

Handling Pipeline Failures

Categories of Pipeline Failures:

Infrastructure Pipeline Failure Categories
Failure Type	Example	Recovery Strategy
Transient	Network timeout, rate limit, API error	Retry with exponential backoff
Validation	Invalid configuration, policy violation	Fix code and rerun pipeline
Resource Conflict	Resource already exists, naming collision	Import existing resource or rename
Partial Apply	Some resources created, others failed	Fix issue and reapply (idempotent)
State Corruption	State file corrupted or desynchronized	State surgery or import from actual infra
Dependency Loop	Circular resource dependencies	Restructure configuration or use targeted apply

Designing for Idempotency:

The most important principle for handling failures is idempotency: running the pipeline multiple times should produce the same result as running it once. This means:

If resources already exist in the desired state, no changes occur
Partial failures can be recovered by simply rerunning
Duplicate pipeline runs don't create duplicate resources

Terraform and similar tools are designed for idempotency, but you must also design your pipeline logic to support it—avoid side effects outside of the infrastructure tool itself.

Never Leave Partial State

Summary: Infrastructure Pipeline Principles

Infrastructure pipelines transform infrastructure management from a high-risk manual activity into a systematic, repeatable, and auditable process. The key principles to remember:

Key Takeaways

•Pipelines enable velocity and safety together — Automation removes manual bottlenecks while structured validation reduces error rates
•Follow the validate → plan → review → test → deploy → verify pattern — Each stage builds confidence progressively
•Plans are sacred — Always save plans and apply exactly what was reviewed; never generate a new plan at apply time
•Concurrency must be controlled — Never cancel in-progress infrastructure operations; always lock state
•Environment promotion is mandatory — Never deploy directly to production; build confidence through lower environments
•Design for failure — Pipelines will fail; build in retry logic, preserve idempotency, and have clear recovery procedures
•Infrastructure pipelines differ from application pipelines — Respect the stateful, destructive nature of infrastructure changes

What's Next:

Page Complete

1 / 5