System Design (HLD)What Is Infrastructure as Code?

What Is Infrastructure as Code?

LevelIntermediate

Duration60 mins

TopicWhat Is Infrastructure as Code?

4 / 4

Reproducibility and Consistency

The Promise of Identical Environments

"It works on my machine."

Every software developer has uttered these words. Every operations engineer has heard them with a weary sigh. Code that runs perfectly in development fails mysteriously in production. A bug that doesn't exist in staging appears in prod. A hotfix that works on one server breaks another.

The root cause is almost always environment inconsistency.

Development uses different library versions than production. Staging has a slightly different network configuration. That one production server has a manually-installed package that no one remembers. These inconsistencies accumulate silently until they create catastrophic, hard-to-diagnose failures.

Infrastructure as Code promises to eliminate this entire category of problems. When infrastructure is defined as code, every environment is provisioned from the same source. Reproducibility becomes automatic. Consistency is enforced, not hoped for.

This is perhaps the most transformative benefit of IaC—the end of snowflake environments.

What You Will Learn

By the end of this page, you will understand configuration drift and its causes, how IaC achieves reproducibility, strategies for ensuring consistency across environments, drift detection and remediation, and the practices that mature organizations use to maintain reliable infrastructure at scale.

Understanding Configuration Drift

Configuration drift occurs when the actual state of infrastructure diverges from its intended or documented state. This happens gradually, silently, and inevitably in manually-managed systems.

How Drift Happens:

Common Causes of Configuration Drift

•Emergency Fixes — A production incident requires an immediate fix. An engineer SSHs in, makes a change, and forgets to document it. Repeat across dozens of incidents.
•Manual Adjustments — Someone 'just tweaks' a setting through the AWS Console. It works in prod, so they never update the scripts or documentation.
•Failed Automations — An update script fails partway through, leaving some servers updated and others not.
•Out-of-Band Changes — Cloud provider applies patches, security groups auto-update, or managed services change behavior.
•Stale Documentation — Even with documentation, infrastructure and docs diverge over time as one or the other is updated independently.
•Team Turnover — New engineers don't know the full history and make changes based on incomplete understanding.

The Compounding Effect:

Drift doesn't stay isolated. One drifted configuration leads to workarounds, which create more drift:

Server A drifts from spec → works fine
New code is written that happens to depend on A's drifted state
Server B, which matches spec, now can't run the new code
Engineer 'fixes' B by making it match A
Now both servers are drifted, and any new servers fail
The 'fix' becomes required for all servers
When someone finally provisions a fresh server correctly, everything breaks

This is how organizations end up with infrastructure that 'only works if you don't touch it.'

The Hidden Cost of Drift

A 2020 study found that configuration drift causes 35% of unplanned outages and increases mean time to recovery (MTTR) by 200-400%. The time spent diagnosing 'why does this work on server A but not server B?' dwarfs the time that would have been spent preventing drift in the first place.

How IaC Achieves Reproducibility

Infrastructure as Code achieves reproducibility through a combination of practices and properties that work together to ensure consistent, predictable infrastructure.

The Reproducibility Stack:

Foundations of IaC Reproducibility

•Declarative Definition — Code specifies the desired end state, not the steps to get there. Every apply produces the same result.
•Version Pinning — Provider versions, module versions, and dependencies are explicitly pinned, eliminating 'worked yesterday, broken today' issues.
•Immutable Infrastructure — Instead of modifying resources, we replace them entirely. New servers are provisioned from scratch, eliminating accumulated cruft.
•Idempotent Operations — Running the same code multiple times produces the same result. Re-applying is always safe.
•Automated Testing — Changes are validated before deployment, catching drift-causing errors early.
•Source Control as Truth — The repository is the only authoritative definition. Manual changes are detected and corrected.

version-pinning.tf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# version-pinning.tf - Locking versions for reproducibility
 
terraform {
  # Pin Terraform version - exact version for maximum reproducibility
  required_version = "= 1.5.7"
  
  # Pin provider versions with precision
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.20.0"  # Allows 5.20.x, locks major/minor
    }
    
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "= 2.23.0"  # Exact version lock
    }
  }
  
  # Use a remote backend for state consistency
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "production/infrastructure.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"  # State locking
    encrypt        = true
  }
}
 
# Pin module versions - never use unversioned modules in production
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.1.2"  # Exact version
  
  # ... configuration
}
 
# The .terraform.lock.hcl file (auto-generated) captures exact checksums
# This ensures bit-for-bit identical provider binaries across machines

Lock Files Are Critical

Terraform generates a .terraform.lock.hcl file that captures checksums of providers. Always commit this file to version control. It ensures that every engineer and every CI/CD run uses bit-for-bit identical provider binaries, eliminating subtle version inconsistencies.

Patterns for Environment Consistency

Maintaining consistency across development, staging, and production environments requires intentional architectural patterns. The goal is for these environments to be identical in structure while differing only in scale and data.

Environment Consistency Patterns:

Environment Consistency Strategies
Pattern	Description	Trade-offs
Shared Modules	All environments use identical modules with different variables	Maximum consistency; requires careful variable design
Template Environments	Generate environment configs from a single template	Reduces duplication; tooling complexity
Environment-as-Variable	Single codebase with environment passed as parameter	DRY code; conditional logic can get complex
Promotion Pipeline	Changes flow dev → staging → production automatically	Strong validation; requires CI/CD maturity
GitOps with Branches	Each environment is a directory/branch; reconciled continuously	Clear separation; potential for drift between branches

The Shared Module Pattern (Recommended):

The most effective pattern uses shared modules that define how resources are constructed, while environment-specific configurations control what is deployed:

shared-modules-pattern/
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# modules/web-cluster/main.tf - Reusable module
# Defines HOW a web cluster is built - same for all environments
 
variable "environment" {
  type = string
}
 
variable "instance_count" {
  type = number
}
 
variable "instance_type" {
  type = string
}
 
variable "vpc_id" {
  type = string
}
 
resource "aws_autoscaling_group" "web" {
  name                = "${var.environment}-web-asg"
  desired_capacity    = var.instance_count
  min_size            = var.instance_count
  max_size            = var.instance_count * 2
  
  launch_template {
    id = aws_launch_template.web.id
  }
  
  # ... common configuration that applies to ALL environments
}
 
# -------------------------------------------
 
# environments/production/main.tf - Production values
module "web_cluster" {
  source = "../../modules/web-cluster"
  
  environment    = "production"
  instance_count = 10
  instance_type  = "m5.xlarge"
  vpc_id         = module.vpc.vpc_id
}
 
# -------------------------------------------
 
# environments/staging/main.tf - Staging values  
module "web_cluster" {
  source = "../../modules/web-cluster"
  
  environment    = "staging"
  instance_count = 2
  instance_type  = "t3.medium"
  vpc_id         = module.vpc.vpc_id
}
 
# -------------------------------------------
 
# environments/development/main.tf - Development values
module "web_cluster" {
  source = "../../modules/web-cluster"
  
  environment    = "development"
  instance_count = 1
  instance_type  = "t3.small"
  vpc_id         = module.vpc.vpc_id
}

Environments Differ in Scale, Not Structure

Well-designed IaC has environments that are structurally identical but parametrically different. Production might have 10 large instances; development has 1 small instance. But both use the exact same module code, ensuring behavioral consistency.

Drift Detection and Remediation

Even with IaC, drift can occur if someone makes manual changes outside the normal workflow. Mature IaC practices include continuous drift detection to identify and remediate these deviations.

Drift Detection Approaches:

Drift Detection Methods

•Scheduled Plan Runs — Run terraform plan on a schedule (daily/hourly). Any drift appears as unexpected changes in the plan.
•Event-Driven Detection — CloudTrail or similar services trigger alerts when resources are modified outside IaC.
•Continuous Reconciliation — GitOps tools like ArgoCD continuously compare actual state to Git and report differences.
•Policy Enforcement — AWS Config, Azure Policy, or OPA continuously validate that resources match expected patterns.
•State Refresh — Tools like terraform refresh (now implicit in plan) detect differences between state and reality.

drift-detection-pipeline.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# .github/workflows/drift-detection.yml
name: Drift Detection
 
on:
  schedule:
    - cron: '0 */6 * * *'  # Every 6 hours
  workflow_dispatch: {}    # Manual trigger
 
jobs:
  detect-drift:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: [production, staging]
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.5.7
      
      - name: Terraform Init
        run: terraform init
        working-directory: environments/${{ matrix.environment }}
      
      - name: Terraform Plan (Drift Check)
        id: plan
        run: |
          terraform plan -detailed-exitcode -out=plan.out 2>&1 | tee plan.txt
        working-directory: environments/${{ matrix.environment }}
        continue-on-error: true
      
      - name: Check for Drift
        run: |
          if [ "${{ steps.plan.outputs.exitcode }}" == "2" ]; then
            echo "⚠️ DRIFT DETECTED in ${{ matrix.environment }}"
            echo "drift_detected=true" >> $GITHUB_OUTPUT
            
            # Send alert
            curl -X POST "$SLACK_WEBHOOK" -d '{
              "text": "🚨 Drift detected in ${{ matrix.environment }}!",
              "attachments": [{
                "text": "'"$(cat plan.txt | tail -50)"'"
              }]
            }'
          else
            echo "✅ No drift in ${{ matrix.environment }}"
          fi
        env:
          SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

Remediating Drift:

When drift is detected, you have options:

Re-apply to correct — Run terraform apply to restore intended state (most common)
Import the change — If the manual change was correct, update code to match and import
Investigate first — For critical resources, understand why drift occurred before correcting

The key is that drift should never persist unnoticed. Detection must be continuous and alerts must be actionable.

Not All Drift Is Bad

Some drift is expected: AWS may add default tags, managed services may auto-update settings, or load balancers may adjust health check parameters. Good IaC separates expected drift (ignore) from actual configuration changes (alert). Use lifecycle ignore_changes for expected variations.

Immutable Infrastructure: The Ultimate Consistency

Immutable infrastructure takes consistency to its logical extreme: instead of updating resources in place, you replace them entirely with new resources built from current code. This eliminates accumulated state and ensures every instance is identical.

Mutable vs. Immutable Infrastructure:

Mutable Infrastructure

•Servers updated in-place with patches
•Accumulates state over time
•Each server has unique history
•Updates can fail partway
•Hard to reproduce exact state
•Rollback is complex or impossible

Immutable Infrastructure

•New servers replace old ones entirely
•No accumulated state—fresh every time
•Every server is identical at creation
•Deployment is atomic—old or new, nothing between
•Perfectly reproducible by definition
•Rollback = deploy previous version image

Implementing Immutable Infrastructure:

Immutable patterns typically involve:

Build machine images — Use Packer, Docker, or cloud-native image builders to create complete, ready-to-run images
Blue-green or canary deployments — Deploy new images alongside old, shift traffic, terminate old
No SSH access — If you can't login to modify servers, you can't create drift
Stateless applications — Application state lives in external services (databases, S3), not on servers
IaC for the container layer — Kubernetes deployments are inherently immutable—pods are replaced, not updated

immutable-deployment.tf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# Immutable infrastructure pattern with Terraform
 
# The AMI is built by a CI pipeline - contains everything pre-installed
data "aws_ami" "web_server" {
  most_recent = true
  owners      = ["self"]
  
  filter {
    name   = "name"
    values = ["web-server-*"]
  }
  
  filter {
    name   = "tag:Environment"
    values = [var.environment]
  }
}
 
resource "aws_launch_template" "web" {
  name_prefix   = "${var.environment}-web-"
  image_id      = data.aws_ami.web_server.id  # New AMI = new version
  instance_type = var.instance_type
  
  # No SSH access - servers are immutable
  # key_name = "..."  # Intentionally omitted
  
  user_data = base64encode(<<-EOF
    #!/bin/bash
    # Minimal bootstrap - heavy config is in the AMI
    echo "Starting application version: ${var.app_version}"
    systemctl start application
  EOF
  )
  
  lifecycle {
    create_before_destroy = true
  }
}
 
resource "aws_autoscaling_group" "web" {
  name                = "${var.environment}-web-${var.app_version}"
  desired_capacity    = var.instance_count
  min_size            = var.instance_count
  max_size            = var.instance_count * 2
  
  launch_template {
    id      = aws_launch_template.web.id
    version = "$Latest"
  }
  
  # Rolling deployment - new instances created before old terminated
  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 80
    }
  }
  
  lifecycle {
    create_before_destroy = true
  }
}
 
# A new AMI version triggers:
# 1. New launch template version
# 2. Instance refresh in ASG
# 3. New instances launch with new AMI
# 4. Old instances terminated after healthy
# 5. Zero accumulated state

Containers Are Immutable by Default

If you're using Kubernetes, you're already practicing immutable infrastructure at the application layer. Pods are replaced, not updated. Images are versioned and immutable. Extending this principle to the underlying infrastructure (node images, cluster configuration) completes the picture.

Testing for Consistency

Reproducibility isn't just about writing code correctly—it requires testing to verify that infrastructure can actually be reproduced consistently. Infrastructure testing has matured significantly and should be part of every IaC workflow.

Levels of Infrastructure Testing:

Infrastructure Testing Levels
Level	What It Tests	Tools	Speed
Syntax/Formatting	Code is syntactically valid and formatted	terraform fmt, terraform validate	Seconds
Linting	Code follows best practices and conventions	tflint, checkov, tfsec	Seconds
Policy Compliance	Resources comply with organizational policies	Sentinel, OPA, Checkov	Seconds to minutes
Unit Testing	Module logic produces expected outputs	Terraform test, pytest-terraform	Minutes
Integration Testing	Resources actually deploy and work together	Terratest, Kitchen-Terraform	Minutes to hours
End-to-End Testing	Complete environment functions correctly	Custom smoke tests, synthetic monitoring	Hours

terratest-example.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
// Integration test using Terratest
// This actually deploys infrastructure and verifies it works
 
package test
 
import (
    "testing"
    "time"
    
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/gruntwork-io/terratest/modules/http-helper"
    "github.com/stretchr/testify/assert"
)
 
func TestWebClusterDeploysSuccessfully(t *testing.T) {
    t.Parallel()
    
    terraformOptions := &terraform.Options{
        TerraformDir: "../environments/test",
        Vars: map[string]interface{}{
            "environment":    "test",
            "instance_count": 1,
            "instance_type":  "t3.micro",
        },
    }
    
    // Clean up at the end
    defer terraform.Destroy(t, terraformOptions)
    
    // Deploy the infrastructure
    terraform.InitAndApply(t, terraformOptions)
    
    // Get the ALB URL from outputs
    albUrl := terraform.Output(t, terraformOptions, "alb_url")
    
    // Verify the service is reachable
    http_helper.HttpGetWithRetry(
        t,
        albUrl + "/health",
        nil,
        200,
        "OK",
        30,              // max retries
        5 * time.Second, // sleep between retries
    )
    
    // Verify outputs match expected values
    vpcId := terraform.Output(t, terraformOptions, "vpc_id")
    assert.Contains(t, vpcId, "vpc-", "VPC ID should be valid")
    
    instanceCount := terraform.Output(t, terraformOptions, "instance_count")
    assert.Equal(t, "1", instanceCount, "Should deploy 1 instance")
}
 
func TestEnvironmentsAreConsistent(t *testing.T) {
    // Compare that staging and production use same module versions
    // This is a meta-test for consistency
    
    stagingOptions := &terraform.Options{
        TerraformDir: "../environments/staging",
    }
    
    prodOptions := &terraform.Options{
        TerraformDir: "../environments/production",
    }
    
    terraform.Init(t, stagingOptions)
    terraform.Init(t, prodOptions)
    
    // Verify both use the same provider versions
    // (The lock files should match)
    // Custom assertion logic here
}

Test Environments Should Match Production

The ultimate test of reproducibility is: can you spin up a complete copy of production? If your test environment has different configurations, you're not truly testing your production infrastructure. The same IaC should create both.

Real-World Reproducibility Patterns

Mature organizations have developed specific patterns that ensure reproducibility even in complex, large-scale environments.

Pattern 1: Ephemeral Environments

Create and destroy full environments on demand, proving reproducibility every time:

Ephemeral Environment Benefits

•Pull Request Environments — Every PR gets its own complete environment for testing
•Developer Sandboxes — Engineers can spin up personal copies of production
•Cost Efficiency — Environments that exist only during business hours
•Continuous Proof — If you can create and destroy environments daily, you know your code is reproducible

Pattern 2: Infrastructure as Cattle

Treat entire environments as replaceable, not just individual servers:

Infrastructure Cattle Practices

•Regular Rebuilds — Periodically destroy and recreate staging/dev environments to prove reproducibility
•Disaster Recovery Testing — Actually restore from nothing as part of regular operations
•No Snowflakes — Any resource that can't be trivially recreated is a bug
•Burn the Ships — Some teams intentionally have no manual access to force IaC-only operations

Pattern 3: Infrastructure Contracts

Define explicit contracts between infrastructure and applications:

infrastructure-contract.tf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# outputs.tf - The infrastructure 'contract' with consuming applications
 
# Applications depend on these outputs, not internal implementation
# Changes to these outputs are breaking changes that require coordination
 
output "database_endpoint" {
  description = "PostgreSQL endpoint for application connections"
  value       = aws_db_instance.main.endpoint
}
 
output "database_port" {
  description = "PostgreSQL port"
  value       = aws_db_instance.main.port
}
 
output "redis_endpoint" {
  description = "Redis cluster endpoint"
  value       = aws_elasticache_cluster.redis.cache_nodes[0].address
}
 
output "application_security_group" {
  description = "Security group for application containers"
  value       = aws_security_group.application.id
}
 
output "private_subnet_ids" {
  description = "Subnet IDs for application deployment"
  value       = aws_subnet.private[*].id
}
 
# These outputs form a stable interface
# Internal implementation can change without breaking consumers
# Both infrastructure and application teams know the contract

Netflix's Chaos Engineering

Netflix's famous Chaos Monkey randomly terminates production instances. This works only because their infrastructure is fully reproducible—any instance can be replaced automatically. Chaos engineering proves reproducibility under fire.

Summary: Reproducibility and Consistency

We've explored how Infrastructure as Code delivers reproducibility and consistency—perhaps its most transformative benefits. Let's consolidate the key insights:

Key Takeaways

•Configuration drift is inevitable without IaC — Manual processes, emergencies, and human nature cause infrastructure to diverge from intent.
•IaC achieves reproducibility through fundamentals — Declarative definitions, version pinning, idempotency, and immutability work together.
•Shared modules ensure environment consistency — Same code produces same structure across dev, staging, and production.
•Drift detection must be continuous — Scheduled plans, event monitoring, and GitOps reconciliation catch deviations early.
•Immutable infrastructure eliminates accumulated state — Replace rather than modify for perfect consistency.
•Testing proves reproducibility — Integration tests that deploy real infrastructure verify your code actually works.
•Ephemeral environments are the ultimate test — If you can create and destroy environments at will, you've achieved true reproducibility.

Module Complete:

You've now completed the foundational module on Infrastructure as Code. You understand:

What IaC is and its transformative benefits
The declarative vs. imperative paradigms
How version control enables collaboration and audit
How reproducibility and consistency eliminate drift

What's Next:

The next module dives into Terraform, the most widely-adopted IaC tool. You'll learn Terraform fundamentals, provider and resource patterns, state management, modules, and the practical workflow that teams use to manage infrastructure at scale.

Module 1 Complete

Congratulations! You now have a comprehensive understanding of Infrastructure as Code principles. These fundamentals apply regardless of which specific tools you use. You're prepared to work with any IaC tool and understand why it works the way it does.

4 / 4

Loading learning content...

System Design (HLD)What Is Infrastructure as Code?

What Is Infrastructure as Code?

LevelIntermediate

Duration60 mins

TopicWhat Is Infrastructure as Code?

4 / 4

Reproducibility and Consistency

The Promise of Identical Environments

"It works on my machine."

The root cause is almost always environment inconsistency.

This is perhaps the most transformative benefit of IaC—the end of snowflake environments.

What You Will Learn

Understanding Configuration Drift

Configuration drift occurs when the actual state of infrastructure diverges from its intended or documented state. This happens gradually, silently, and inevitably in manually-managed systems.

How Drift Happens:

Common Causes of Configuration Drift

•Emergency Fixes — A production incident requires an immediate fix. An engineer SSHs in, makes a change, and forgets to document it. Repeat across dozens of incidents.
•Manual Adjustments — Someone 'just tweaks' a setting through the AWS Console. It works in prod, so they never update the scripts or documentation.
•Failed Automations — An update script fails partway through, leaving some servers updated and others not.
•Out-of-Band Changes — Cloud provider applies patches, security groups auto-update, or managed services change behavior.
•Stale Documentation — Even with documentation, infrastructure and docs diverge over time as one or the other is updated independently.
•Team Turnover — New engineers don't know the full history and make changes based on incomplete understanding.

The Compounding Effect:

Drift doesn't stay isolated. One drifted configuration leads to workarounds, which create more drift:

Server A drifts from spec → works fine
New code is written that happens to depend on A's drifted state
Server B, which matches spec, now can't run the new code
Engineer 'fixes' B by making it match A
Now both servers are drifted, and any new servers fail
The 'fix' becomes required for all servers
When someone finally provisions a fresh server correctly, everything breaks

This is how organizations end up with infrastructure that 'only works if you don't touch it.'

The Hidden Cost of Drift

How IaC Achieves Reproducibility

Infrastructure as Code achieves reproducibility through a combination of practices and properties that work together to ensure consistent, predictable infrastructure.

The Reproducibility Stack:

Foundations of IaC Reproducibility

•Declarative Definition — Code specifies the desired end state, not the steps to get there. Every apply produces the same result.
•Version Pinning — Provider versions, module versions, and dependencies are explicitly pinned, eliminating 'worked yesterday, broken today' issues.
•Immutable Infrastructure — Instead of modifying resources, we replace them entirely. New servers are provisioned from scratch, eliminating accumulated cruft.
•Idempotent Operations — Running the same code multiple times produces the same result. Re-applying is always safe.
•Automated Testing — Changes are validated before deployment, catching drift-causing errors early.
•Source Control as Truth — The repository is the only authoritative definition. Manual changes are detected and corrected.

version-pinning.tf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# version-pinning.tf - Locking versions for reproducibility
 
terraform {
  # Pin Terraform version - exact version for maximum reproducibility
  required_version = "= 1.5.7"
  
  # Pin provider versions with precision
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.20.0"  # Allows 5.20.x, locks major/minor
    }
    
    kubernetes = {
      source  = "hashicorp/kubernetes"
      version = "= 2.23.0"  # Exact version lock
    }
  }
  
  # Use a remote backend for state consistency
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "production/infrastructure.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-locks"  # State locking
    encrypt        = true
  }
}
 
# Pin module versions - never use unversioned modules in production
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.1.2"  # Exact version
  
  # ... configuration
}
 
# The .terraform.lock.hcl file (auto-generated) captures exact checksums
# This ensures bit-for-bit identical provider binaries across machines

Lock Files Are Critical

Patterns for Environment Consistency

Environment Consistency Patterns:

Environment Consistency Strategies
Pattern	Description	Trade-offs
Shared Modules	All environments use identical modules with different variables	Maximum consistency; requires careful variable design
Template Environments	Generate environment configs from a single template	Reduces duplication; tooling complexity
Environment-as-Variable	Single codebase with environment passed as parameter	DRY code; conditional logic can get complex
Promotion Pipeline	Changes flow dev → staging → production automatically	Strong validation; requires CI/CD maturity
GitOps with Branches	Each environment is a directory/branch; reconciled continuously	Clear separation; potential for drift between branches

The Shared Module Pattern (Recommended):

The most effective pattern uses shared modules that define how resources are constructed, while environment-specific configurations control what is deployed:

shared-modules-pattern/
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# modules/web-cluster/main.tf - Reusable module
# Defines HOW a web cluster is built - same for all environments
 
variable "environment" {
  type = string
}
 
variable "instance_count" {
  type = number
}
 
variable "instance_type" {
  type = string
}
 
variable "vpc_id" {
  type = string
}
 
resource "aws_autoscaling_group" "web" {
  name                = "${var.environment}-web-asg"
  desired_capacity    = var.instance_count
  min_size            = var.instance_count
  max_size            = var.instance_count * 2
  
  launch_template {
    id = aws_launch_template.web.id
  }
  
  # ... common configuration that applies to ALL environments
}
 
# -------------------------------------------
 
# environments/production/main.tf - Production values
module "web_cluster" {
  source = "../../modules/web-cluster"
  
  environment    = "production"
  instance_count = 10
  instance_type  = "m5.xlarge"
  vpc_id         = module.vpc.vpc_id
}
 
# -------------------------------------------
 
# environments/staging/main.tf - Staging values  
module "web_cluster" {
  source = "../../modules/web-cluster"
  
  environment    = "staging"
  instance_count = 2
  instance_type  = "t3.medium"
  vpc_id         = module.vpc.vpc_id
}
 
# -------------------------------------------
 
# environments/development/main.tf - Development values
module "web_cluster" {
  source = "../../modules/web-cluster"
  
  environment    = "development"
  instance_count = 1
  instance_type  = "t3.small"
  vpc_id         = module.vpc.vpc_id
}

Environments Differ in Scale, Not Structure

Drift Detection and Remediation

Even with IaC, drift can occur if someone makes manual changes outside the normal workflow. Mature IaC practices include continuous drift detection to identify and remediate these deviations.

Drift Detection Approaches:

Drift Detection Methods

•Scheduled Plan Runs — Run terraform plan on a schedule (daily/hourly). Any drift appears as unexpected changes in the plan.
•Event-Driven Detection — CloudTrail or similar services trigger alerts when resources are modified outside IaC.
•Continuous Reconciliation — GitOps tools like ArgoCD continuously compare actual state to Git and report differences.
•Policy Enforcement — AWS Config, Azure Policy, or OPA continuously validate that resources match expected patterns.
•State Refresh — Tools like terraform refresh (now implicit in plan) detect differences between state and reality.

drift-detection-pipeline.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# .github/workflows/drift-detection.yml
name: Drift Detection
 
on:
  schedule:
    - cron: '0 */6 * * *'  # Every 6 hours
  workflow_dispatch: {}    # Manual trigger
 
jobs:
  detect-drift:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: [production, staging]
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.5.7
      
      - name: Terraform Init
        run: terraform init
        working-directory: environments/${{ matrix.environment }}
      
      - name: Terraform Plan (Drift Check)
        id: plan
        run: |
          terraform plan -detailed-exitcode -out=plan.out 2>&1 | tee plan.txt
        working-directory: environments/${{ matrix.environment }}
        continue-on-error: true
      
      - name: Check for Drift
        run: |
          if [ "${{ steps.plan.outputs.exitcode }}" == "2" ]; then
            echo "⚠️ DRIFT DETECTED in ${{ matrix.environment }}"
            echo "drift_detected=true" >> $GITHUB_OUTPUT
            
            # Send alert
            curl -X POST "$SLACK_WEBHOOK" -d '{
              "text": "🚨 Drift detected in ${{ matrix.environment }}!",
              "attachments": [{
                "text": "'"$(cat plan.txt | tail -50)"'"
              }]
            }'
          else
            echo "✅ No drift in ${{ matrix.environment }}"
          fi
        env:
          SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}

Remediating Drift:

When drift is detected, you have options:

Re-apply to correct — Run terraform apply to restore intended state (most common)
Import the change — If the manual change was correct, update code to match and import
Investigate first — For critical resources, understand why drift occurred before correcting

The key is that drift should never persist unnoticed. Detection must be continuous and alerts must be actionable.

Not All Drift Is Bad

Immutable Infrastructure: The Ultimate Consistency

Mutable vs. Immutable Infrastructure:

Mutable Infrastructure

•Servers updated in-place with patches
•Accumulates state over time
•Each server has unique history
•Updates can fail partway
•Hard to reproduce exact state
•Rollback is complex or impossible

Immutable Infrastructure

•New servers replace old ones entirely
•No accumulated state—fresh every time
•Every server is identical at creation
•Deployment is atomic—old or new, nothing between
•Perfectly reproducible by definition
•Rollback = deploy previous version image

Implementing Immutable Infrastructure:

Immutable patterns typically involve:

Build machine images — Use Packer, Docker, or cloud-native image builders to create complete, ready-to-run images
Blue-green or canary deployments — Deploy new images alongside old, shift traffic, terminate old
No SSH access — If you can't login to modify servers, you can't create drift
Stateless applications — Application state lives in external services (databases, S3), not on servers
IaC for the container layer — Kubernetes deployments are inherently immutable—pods are replaced, not updated

immutable-deployment.tf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# Immutable infrastructure pattern with Terraform
 
# The AMI is built by a CI pipeline - contains everything pre-installed
data "aws_ami" "web_server" {
  most_recent = true
  owners      = ["self"]
  
  filter {
    name   = "name"
    values = ["web-server-*"]
  }
  
  filter {
    name   = "tag:Environment"
    values = [var.environment]
  }
}
 
resource "aws_launch_template" "web" {
  name_prefix   = "${var.environment}-web-"
  image_id      = data.aws_ami.web_server.id  # New AMI = new version
  instance_type = var.instance_type
  
  # No SSH access - servers are immutable
  # key_name = "..."  # Intentionally omitted
  
  user_data = base64encode(<<-EOF
    #!/bin/bash
    # Minimal bootstrap - heavy config is in the AMI
    echo "Starting application version: ${var.app_version}"
    systemctl start application
  EOF
  )
  
  lifecycle {
    create_before_destroy = true
  }
}
 
resource "aws_autoscaling_group" "web" {
  name                = "${var.environment}-web-${var.app_version}"
  desired_capacity    = var.instance_count
  min_size            = var.instance_count
  max_size            = var.instance_count * 2
  
  launch_template {
    id      = aws_launch_template.web.id
    version = "$Latest"
  }
  
  # Rolling deployment - new instances created before old terminated
  instance_refresh {
    strategy = "Rolling"
    preferences {
      min_healthy_percentage = 80
    }
  }
  
  lifecycle {
    create_before_destroy = true
  }
}
 
# A new AMI version triggers:
# 1. New launch template version
# 2. Instance refresh in ASG
# 3. New instances launch with new AMI
# 4. Old instances terminated after healthy
# 5. Zero accumulated state

Containers Are Immutable by Default

Testing for Consistency

Levels of Infrastructure Testing:

Infrastructure Testing Levels
Level	What It Tests	Tools	Speed
Syntax/Formatting	Code is syntactically valid and formatted	terraform fmt, terraform validate	Seconds
Linting	Code follows best practices and conventions	tflint, checkov, tfsec	Seconds
Policy Compliance	Resources comply with organizational policies	Sentinel, OPA, Checkov	Seconds to minutes
Unit Testing	Module logic produces expected outputs	Terraform test, pytest-terraform	Minutes
Integration Testing	Resources actually deploy and work together	Terratest, Kitchen-Terraform	Minutes to hours
End-to-End Testing	Complete environment functions correctly	Custom smoke tests, synthetic monitoring	Hours

terratest-example.go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
// Integration test using Terratest
// This actually deploys infrastructure and verifies it works
 
package test
 
import (
    "testing"
    "time"
    
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/gruntwork-io/terratest/modules/http-helper"
    "github.com/stretchr/testify/assert"
)
 
func TestWebClusterDeploysSuccessfully(t *testing.T) {
    t.Parallel()
    
    terraformOptions := &terraform.Options{
        TerraformDir: "../environments/test",
        Vars: map[string]interface{}{
            "environment":    "test",
            "instance_count": 1,
            "instance_type":  "t3.micro",
        },
    }
    
    // Clean up at the end
    defer terraform.Destroy(t, terraformOptions)
    
    // Deploy the infrastructure
    terraform.InitAndApply(t, terraformOptions)
    
    // Get the ALB URL from outputs
    albUrl := terraform.Output(t, terraformOptions, "alb_url")
    
    // Verify the service is reachable
    http_helper.HttpGetWithRetry(
        t,
        albUrl + "/health",
        nil,
        200,
        "OK",
        30,              // max retries
        5 * time.Second, // sleep between retries
    )
    
    // Verify outputs match expected values
    vpcId := terraform.Output(t, terraformOptions, "vpc_id")
    assert.Contains(t, vpcId, "vpc-", "VPC ID should be valid")
    
    instanceCount := terraform.Output(t, terraformOptions, "instance_count")
    assert.Equal(t, "1", instanceCount, "Should deploy 1 instance")
}
 
func TestEnvironmentsAreConsistent(t *testing.T) {
    // Compare that staging and production use same module versions
    // This is a meta-test for consistency
    
    stagingOptions := &terraform.Options{
        TerraformDir: "../environments/staging",
    }
    
    prodOptions := &terraform.Options{
        TerraformDir: "../environments/production",
    }
    
    terraform.Init(t, stagingOptions)
    terraform.Init(t, prodOptions)
    
    // Verify both use the same provider versions
    // (The lock files should match)
    // Custom assertion logic here
}

Test Environments Should Match Production

Real-World Reproducibility Patterns

Mature organizations have developed specific patterns that ensure reproducibility even in complex, large-scale environments.

Pattern 1: Ephemeral Environments

Create and destroy full environments on demand, proving reproducibility every time:

Ephemeral Environment Benefits

•Pull Request Environments — Every PR gets its own complete environment for testing
•Developer Sandboxes — Engineers can spin up personal copies of production
•Cost Efficiency — Environments that exist only during business hours
•Continuous Proof — If you can create and destroy environments daily, you know your code is reproducible

Pattern 2: Infrastructure as Cattle

Treat entire environments as replaceable, not just individual servers:

Infrastructure Cattle Practices

•Regular Rebuilds — Periodically destroy and recreate staging/dev environments to prove reproducibility
•Disaster Recovery Testing — Actually restore from nothing as part of regular operations
•No Snowflakes — Any resource that can't be trivially recreated is a bug
•Burn the Ships — Some teams intentionally have no manual access to force IaC-only operations

Pattern 3: Infrastructure Contracts

Define explicit contracts between infrastructure and applications:

infrastructure-contract.tf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# outputs.tf - The infrastructure 'contract' with consuming applications
 
# Applications depend on these outputs, not internal implementation
# Changes to these outputs are breaking changes that require coordination
 
output "database_endpoint" {
  description = "PostgreSQL endpoint for application connections"
  value       = aws_db_instance.main.endpoint
}
 
output "database_port" {
  description = "PostgreSQL port"
  value       = aws_db_instance.main.port
}
 
output "redis_endpoint" {
  description = "Redis cluster endpoint"
  value       = aws_elasticache_cluster.redis.cache_nodes[0].address
}
 
output "application_security_group" {
  description = "Security group for application containers"
  value       = aws_security_group.application.id
}
 
output "private_subnet_ids" {
  description = "Subnet IDs for application deployment"
  value       = aws_subnet.private[*].id
}
 
# These outputs form a stable interface
# Internal implementation can change without breaking consumers
# Both infrastructure and application teams know the contract

Netflix's Chaos Engineering

Summary: Reproducibility and Consistency

We've explored how Infrastructure as Code delivers reproducibility and consistency—perhaps its most transformative benefits. Let's consolidate the key insights:

Key Takeaways

•Configuration drift is inevitable without IaC — Manual processes, emergencies, and human nature cause infrastructure to diverge from intent.
•IaC achieves reproducibility through fundamentals — Declarative definitions, version pinning, idempotency, and immutability work together.
•Shared modules ensure environment consistency — Same code produces same structure across dev, staging, and production.
•Drift detection must be continuous — Scheduled plans, event monitoring, and GitOps reconciliation catch deviations early.
•Immutable infrastructure eliminates accumulated state — Replace rather than modify for perfect consistency.
•Testing proves reproducibility — Integration tests that deploy real infrastructure verify your code actually works.
•Ephemeral environments are the ultimate test — If you can create and destroy environments at will, you've achieved true reproducibility.

Module Complete:

You've now completed the foundational module on Infrastructure as Code. You understand:

What IaC is and its transformative benefits
The declarative vs. imperative paradigms
How version control enables collaboration and audit
How reproducibility and consistency eliminate drift

What's Next:

Module 1 Complete

4 / 4