Loading learning content...
In the early days of computing, infrastructure changes were manual rituals performed by system administrators with root access and tribal knowledge. Today, elite engineering organizations treat infrastructure changes the same way they treat application code changes: through automated pipelines that validate, test, and deploy infrastructure with deterministic consistency.
This transformation—from ad-hoc scripts and runbooks to fully automated infrastructure pipelines—represents one of the most significant operational advances in modern software engineering. When done correctly, infrastructure pipelines enable teams to provision and modify infrastructure with the same speed, confidence, and auditability they expect from their application deployments.
By the end of this page, you will understand the anatomy of infrastructure pipelines, how they differ from application CI/CD, the key stages every pipeline must implement, and the patterns that enable organizations to scale infrastructure automation safely. You will be able to design pipelines that transform infrastructure changes from high-risk manual operations into routine, predictable events.
Infrastructure pipelines aren't merely a convenience—they're a fundamental shift in how organizations manage risk, velocity, and operational excellence. Understanding why they matter requires examining the problems they solve and the capabilities they unlock.
The Traditional Infrastructure Problem:
Before infrastructure pipelines, most organizations operated through a combination of:
This approach doesn't scale. As organizations grow, infrastructure complexity explodes. Each new service, each new environment, each new region multiplies the configuration surface area. Manual processes that worked for ten servers become impossible with ten thousand.
Organizations often fear that adding automation and validation to infrastructure changes will slow them down. The opposite is true: by removing manual coordination overhead and reducing failure rates, infrastructure pipelines dramatically increase the sustainable rate of change. Teams that deploy through pipelines typically make 10-100x more infrastructure changes than teams relying on manual processes—with far fewer incidents.
A well-designed infrastructure pipeline consists of distinct stages, each serving a specific purpose in the journey from code commit to production deployment. Understanding these stages—and the principles behind them—is essential for building pipelines that are both robust and maintainable.
The Core Pipeline Stages:
Every infrastructure pipeline, regardless of tooling or cloud provider, follows a similar conceptual flow:
| Stage | Purpose | Key Activities | Failure Mode |
|---|---|---|---|
| Validate | Catch syntax and configuration errors early | Linting, format checking, schema validation, security scanning | Fast fail on invalid configuration |
| Plan | Preview what changes will occur | Generate execution plan, diff against current state, resource change analysis | Surface destructive changes for review |
| Review | Human approval for significant changes | Peer review, plan review, approval gates | Block deployment until approved |
| Test | Validate infrastructure behavior | Policy tests, integration tests, compliance checks | Prevent non-compliant configurations |
| Deploy | Apply changes to target environment | Execute infrastructure changes, handle dependencies, manage state | Rollback on application failure |
| Verify | Confirm deployment success | Health checks, smoke tests, drift detection | Alert on verification failure |
The Principle of Progressive Confidence:
Notice how these stages build confidence progressively. Early stages catch cheap-to-fix errors (syntax, formatting), while later stages verify more expensive properties (actual behavior, integration). This ordering minimizes wasted time and resources—you don't run integration tests on code that can't parse.
Each stage acts as a quality gate. Changes that fail any stage don't proceed to subsequent stages, ensuring that only well-validated changes reach production.
The validate stage is where infrastructure pipelines provide their fastest feedback. By catching issues in seconds rather than minutes or hours, this stage dramatically improves developer experience and reduces the cost of errors.
What Validation Catches:
Validation encompasses multiple types of checks, each targeting different error categories:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
name: Infrastructure Validation on: pull_request: paths: - 'infrastructure/**' - '.github/workflows/infrastructure-*.yml' jobs: validate: name: Validate Infrastructure runs-on: ubuntu-latest defaults: run: working-directory: infrastructure steps: - name: Checkout Repository uses: actions/checkout@v4 - name: Setup Terraform uses: hashicorp/setup-terraform@v3 with: terraform_version: "1.6.x" # Stage 1: Format Check - name: Terraform Format Check id: fmt run: terraform fmt -check -recursive continue-on-error: true # Stage 2: Initialize (required for validation) - name: Terraform Init id: init run: terraform init -backend=false # Stage 3: Syntax Validation - name: Terraform Validate id: validate run: terraform validate -no-color # Stage 4: Static Security Analysis - name: Run tfsec Security Scanner uses: aquasecurity/tfsec-action@v1.0.0 with: working_directory: infrastructure soft_fail: false # Stage 5: Policy Compliance (Checkov) - name: Run Checkov Policy Checks uses: bridgecrewio/checkov-action@v12 with: directory: infrastructure framework: terraform soft_fail: false output_format: cli,sarif output_file_path: console,results.sarif # Stage 6: Credential Scanning - name: Scan for Secrets uses: trufflesecurity/trufflehog@v3.63.0 with: path: ./infrastructure base: ${{ github.event.pull_request.base.sha }} head: ${{ github.event.pull_request.head.sha }} # Report validation status - name: Post Validation Summary if: always() uses: actions/github-script@v7 with: script: | const output = `### Infrastructure Validation Results | Check | Status | |-------|--------| | Format | ${{ steps.fmt.outcome == 'success' && '✅ Pass' || '❌ Fail' }} | | Init | ${{ steps.init.outcome == 'success' && '✅ Pass' || '❌ Fail' }} | | Validate | ${{ steps.validate.outcome == 'success' && '✅ Pass' || '❌ Fail' }} | `; github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body: output });Validation catches configuration errors, but it cannot guarantee that infrastructure will behave correctly. A syntactically valid Terraform configuration might still create misconfigured resources, fail due to API limits, or violate business requirements. Validation is necessary but not sufficient—it must be complemented by planning, testing, and post-deployment verification.
The plan stage is arguably the most critical stage in infrastructure pipelines. It produces a detailed preview of exactly what changes will occur, enabling human reviewers to understand the impact before any resources are modified.
Why Planning Is Essential:
Unlike application deployments where changes are often additive and easily reversible, infrastructure changes can be destructive and irreversible. Deleting a database, removing a VPC, or modifying IAM policies can have catastrophic consequences. The plan stage provides the last line of defense before these changes occur.
The Plan Contains:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
name: Infrastructure Plan on: pull_request: paths: - 'infrastructure/**' env: TF_VAR_environment: "staging" ARM_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }} ARM_CLIENT_SECRET: ${{ secrets.AZURE_CLIENT_SECRET }} ARM_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }} ARM_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }} jobs: plan: name: Generate Infrastructure Plan runs-on: ubuntu-latest permissions: contents: read pull-requests: write steps: - name: Checkout uses: actions/checkout@v4 - name: Setup Terraform uses: hashicorp/setup-terraform@v3 with: terraform_version: "1.6.x" - name: Terraform Init id: init run: | cd infrastructure terraform init - name: Terraform Plan id: plan run: | cd infrastructure terraform plan -no-color -out=tfplan \ -detailed-exitcode \ 2>&1 | tee plan_output.txt continue-on-error: true - name: Analyze Plan for Destructive Changes id: analyze run: | cd infrastructure # Count different change types CREATES=$(grep -c "will be created" plan_output.txt || true) UPDATES=$(grep -c "will be updated" plan_output.txt || true) DESTROYS=$(grep -c "will be destroyed" plan_output.txt || true) REPLACES=$(grep -c "must be replaced" plan_output.txt || true) echo "creates=$CREATES" >> $GITHUB_OUTPUT echo "updates=$UPDATES" >> $GITHUB_OUTPUT echo "destroys=$DESTROYS" >> $GITHUB_OUTPUT echo "replaces=$REPLACES" >> $GITHUB_OUTPUT # Flag high-risk changes if [ "$DESTROYS" -gt 0 ] || [ "$REPLACES" -gt 0 ]; then echo "high_risk=true" >> $GITHUB_OUTPUT else echo "high_risk=false" >> $GITHUB_OUTPUT fi - name: Post Plan to PR uses: actions/github-script@v7 with: script: | const fs = require('fs'); const plan = fs.readFileSync('infrastructure/plan_output.txt', 'utf8'); // Truncate if too long for GitHub comment const maxLength = 60000; const truncatedPlan = plan.length > maxLength ? plan.substring(0, maxLength) + '\n... [truncated]' : plan; const highRisk = '${{ steps.analyze.outputs.high_risk }}' === 'true'; const riskBadge = highRisk ? '⚠️ **HIGH RISK**' : '✅ Low Risk'; const output = `### Terraform Plan Summary ${riskBadge} | Change Type | Count | |-------------|-------| | Create | ${{ steps.analyze.outputs.creates }} | | Update | ${{ steps.analyze.outputs.updates }} | | Destroy | ${{ steps.analyze.outputs.destroys }} | | Replace | ${{ steps.analyze.outputs.replaces }} | <details> <summary>Show Full Plan</summary> \`\`\`hcl ${truncatedPlan} \`\`\` </details> ${highRisk ? '\n⚠️ **This plan includes destructive changes. Please review carefully before approving.**' : ''} `; github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body: output }); - name: Upload Plan Artifact uses: actions/upload-artifact@v4 with: name: terraform-plan path: infrastructure/tfplan retention-days: 5 - name: Require Approval for Destructive Changes if: steps.analyze.outputs.high_risk == 'true' run: | echo "::warning::This plan includes destructive changes." echo "Please ensure appropriate reviewers have approved before merge."Plan Artifacts Are Critical:
Notice that the plan is saved as an artifact (tfplan). This is essential for ensuring that the plan reviewed during the PR is exactly the plan applied during deployment. Without this, there's a race condition: infrastructure could change between plan and apply, causing unexpected modifications.
The pattern is:
This guarantees that reviewers see exactly what will be deployed.
The -detailed-exitcode flag makes Terraform return: 0 = no changes, 1 = error, 2 = changes pending. This allows pipelines to distinguish between 'no work needed' and 'work needed' rather than having to parse output text.
The deployment stage—where infrastructure changes are actually applied—requires meticulous attention to safety. Unlike application deployments that can often be rolled back in seconds, infrastructure changes may take minutes or hours to reverse, and some changes (data deletion, resource destruction) may be permanently irreversible.
Key Principles for Safe Infrastructure Deployment:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
name: Infrastructure Deploy on: push: branches: - main paths: - 'infrastructure/**' concurrency: group: infrastructure-deploy cancel-in-progress: false # Never cancel in-progress infrastructure changes env: TF_VAR_environment: "production" jobs: deploy: name: Deploy Infrastructure runs-on: ubuntu-latest environment: production # Requires manual approval timeout-minutes: 60 steps: - name: Checkout uses: actions/checkout@v4 - name: Setup Terraform uses: hashicorp/setup-terraform@v3 with: terraform_version: "1.6.x" - name: Download Plan Artifact uses: actions/download-artifact@v4 with: name: terraform-plan path: infrastructure/ - name: Terraform Init run: | cd infrastructure terraform init - name: Notify Deployment Start uses: slackapi/slack-github-action@v1.24.0 with: channel-id: 'infra-deployments' slack-message: | 🚀 Infrastructure deployment starting Triggered by: ${{ github.actor }} Commit: ${{ github.sha }} Environment: production - name: Terraform Apply id: apply run: | cd infrastructure terraform apply -auto-approve -no-color tfplan \ 2>&1 | tee apply_output.txt - name: Verify Deployment if: success() run: | cd infrastructure # Run post-deployment verification terraform output -json > outputs.json # Verify critical resources exist and are healthy # This would call your verification scripts ./scripts/verify-deployment.sh - name: Notify Success if: success() uses: slackapi/slack-github-action@v1.24.0 with: channel-id: 'infra-deployments' slack-message: | ✅ Infrastructure deployment succeeded Duration: ${{ steps.apply.outputs.duration }} Commit: ${{ github.sha }} - name: Notify Failure if: failure() uses: slackapi/slack-github-action@v1.24.0 with: channel-id: 'infra-alerts' slack-message: | ❌ Infrastructure deployment FAILED Triggered by: ${{ github.actor }} Commit: ${{ github.sha }} Please investigate immediately. - name: Archive Apply Output if: always() uses: actions/upload-artifact@v4 with: name: terraform-apply-log path: infrastructure/apply_output.txt retention-days: 30The concurrency block with cancel-in-progress: false is critical. Unlike application deployments, you should NEVER cancel an infrastructure deployment in progress. Doing so can leave infrastructure in an inconsistent state. Always let infrastructure operations complete, then address any issues.
Production infrastructure changes should never be deployed directly. Instead, changes should flow through a series of environments, building confidence at each stage. This environment promotion pattern is a fundamental practice for reducing production risk.
Common Environment Structures:
| Environment | Purpose | Deployment Trigger | Validation |
|---|---|---|---|
| Development | Rapid iteration, feature development | Push to feature branch | Basic validation, unit tests |
| Staging | Integration testing, pre-production validation | Merge to main branch | Full test suite, smoke tests |
| Production | Live customer-facing infrastructure | Manual approval after staging success | Verification, monitoring, alerting |
Promotion Gates:
Moving between environments should require explicit gates:
Handling Environment Differences:
Environments should be as similar as possible, but differences are inevitable (scale, costs, integrations). Handle these through:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
# variables.tfvariable "environment" { description = "Environment name (dev, staging, production)" type = string validation { condition = contains(["dev", "staging", "production"], var.environment) error_message = "Environment must be dev, staging, or production." }} # Environment-specific configurationlocals { environment_config = { dev = { instance_type = "t3.small" min_instances = 1 max_instances = 2 enable_multi_az = false backup_retention = 1 enable_monitoring = false } staging = { instance_type = "t3.medium" min_instances = 2 max_instances = 4 enable_multi_az = true backup_retention = 7 enable_monitoring = true } production = { instance_type = "t3.large" min_instances = 3 max_instances = 10 enable_multi_az = true backup_retention = 30 enable_monitoring = true } } config = local.environment_config[var.environment]} # Usage in resourcesresource "aws_autoscaling_group" "main" { min_size = local.config.min_instances max_size = local.config.max_instances launch_template { id = aws_launch_template.main.id version = "$Latest" } dynamic "availability_zones" { for_each = local.config.enable_multi_az ? local.all_azs : [local.all_azs[0]] content { # AZ configuration } }}While infrastructure pipelines share concepts with application CI/CD, they have fundamental differences that inform their design. Understanding these differences is crucial for building appropriate safeguards.
These differences mean infrastructure pipelines need: longer timeouts, stricter concurrency controls, more extensive plan review, careful state management, and different rollback strategies. Don't simply copy application pipeline patterns—they may be actively dangerous for infrastructure.
The State Problem:
Infrastructure pipelines are fundamentally stateful. The pipeline must know what infrastructure currently exists to determine what changes to make. This creates unique challenges:
Robust infrastructure pipelines include explicit state management: backend configuration, locking mechanisms, state backup, and drift detection.
Pipeline failures are inevitable. Networks fail, cloud APIs have outages, rate limits are hit, and resources conflict. The difference between resilient and fragile infrastructure pipelines lies in how failures are handled.
Categories of Pipeline Failures:
| Failure Type | Example | Recovery Strategy |
|---|---|---|
| Transient | Network timeout, rate limit, API error | Retry with exponential backoff |
| Validation | Invalid configuration, policy violation | Fix code and rerun pipeline |
| Resource Conflict | Resource already exists, naming collision | Import existing resource or rename |
| Partial Apply | Some resources created, others failed | Fix issue and reapply (idempotent) |
| State Corruption | State file corrupted or desynchronized | State surgery or import from actual infra |
| Dependency Loop | Circular resource dependencies | Restructure configuration or use targeted apply |
Designing for Idempotency:
The most important principle for handling failures is idempotency: running the pipeline multiple times should produce the same result as running it once. This means:
Terraform and similar tools are designed for idempotency, but you must also design your pipeline logic to support it—avoid side effects outside of the infrastructure tool itself.
The worst pipeline failures leave infrastructure in an inconsistent state: some resources created, others not, state file out of sync with reality. Always ensure your failure handling either completes the change or fully rolls it back. If you must leave partial state, document it extensively and alert the right people.
Infrastructure pipelines transform infrastructure management from a high-risk manual activity into a systematic, repeatable, and auditable process. The key principles to remember:
What's Next:
Now that you understand the fundamentals of infrastructure pipelines, the next page explores GitOps principles—a set of practices that take infrastructure automation to the next level by treating Git as the single source of truth for infrastructure state.
You now understand infrastructure pipeline architecture, the purpose of each stage, and the key differences from application CI/CD. You're equipped to design pipelines that make infrastructure changes safe, fast, and auditable.