System Design (HLD)Testing Microservices

Testing Microservices

LevelAdvanced

Duration90 mins

TopicTesting Microservices

5 / 5

Test Environments

The Environment Challenge in Microservices

Testing microservices requires running multiple services, databases, message brokers, and supporting infrastructure. Unlike monolithic applications where a single process can be tested in isolation, microservices demand sophisticated environment strategies. The difference between teams that ship confidently and teams that fear every deployment often comes down to their test environment practices.

Poor environment management manifests as: 'It works on my machine' failures, tests that pass locally but fail in CI, staging environments that drift from production, and hours spent debugging environment configuration instead of actual bugs. Mastering test environments is essential infrastructure for microservices success.

What You Will Learn

By the end of this page, you will understand how to design a comprehensive test environment strategy for microservices, including local development environments, CI/CD environments, ephemeral preview environments, and production testing approaches. You'll learn patterns for environment parity, data management, and infrastructure as code.

Environment Taxonomy for Microservices

Microservices testing requires multiple environment types, each serving different purposes and making different trade-offs between fidelity, cost, and speed.

The Environment Hierarchy:

Test Environment Types
Environment	Purpose	Fidelity	Cost	Speed	Isolation
Local	Developer iteration	Low-Medium	Free	Instant	Complete
CI	Automated verification	Medium	Low	Minutes	Complete
Preview/Ephemeral	PR validation	High	Medium	Minutes	Per PR
Staging	Release validation	High	Medium-High	Always on	Shared
Production	Live verification	Perfect	High	Always on	Careful design

Environment Fidelity Spectrum:

Low Fidelity: Single service running locally with mocked dependencies. Fast, cheap, but may miss integration issues.
Medium Fidelity: Multiple services running locally with real databases but simplified infrastructure. Good for most integration testing.
High Fidelity: Complete system running in cloud with near-production configuration. Required for E2E and release validation.
Production Fidelity: Actual production environment, possibly with feature flags or traffic mirroring. Only way to verify production-specific issues.

The key insight: you need all of these, not just one. Each environment type catches different categories of bugs at different stages of development.

Environment Parity Principle

The closer your test environment matches production, the more representative your tests are—but the more expensive and slower they become. Strive for maximum parity where it matters (infrastructure, configuration, data shapes) and acceptable differences where it doesn't (scale, geographic distribution, real user data).

Local Development Environment

The local development environment is where developers spend most of their time. A well-designed local setup enables fast iteration, realistic testing, and minimal context-switching to remote environments.

Goals for Local Environment:

Fast startup: Seconds to minutes, not tens of minutes
Realistic behavior: Same databases, message brokers as production
Easy reset: Return to clean state quickly
Selective startup: Run only the services needed for current work
Resource efficiency: Not consume entire machine's resources

local-environment-setup
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
# docker-compose.yml - Local development environment
version: '3.8'
 
# Profiles allow selective startup: docker compose --profile web up
services:
  # Shared infrastructure - always starts
  postgres:
    image: postgres:15-alpine
    ports:
      - "5432:5432"
    environment:
      POSTGRES_PASSWORD: postgres
      POSTGRES_USER: postgres
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./scripts/init-dbs.sql:/docker-entrypoint-initdb.d/init.sql
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 5
  
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 5s
      retries: 5
  
  kafka:
    image: confluentinc/cp-kafka:7.4.0
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,HOST:PLAINTEXT
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:9093,HOST://0.0.0.0:9092
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,HOST://localhost:9092
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      CLUSTER_ID: 'local-dev-cluster-001'
      KAFKA_AUTO_CREATE_TOPICS_ENABLE: 'true'
    healthcheck:
      test: ["CMD-SHELL", "kafka-broker-api-versions --bootstrap-server localhost:29092"]
      interval: 10s
      timeout: 10s
      retries: 10
  
  # Optional: Kafka UI for debugging
  kafka-ui:
    image: provectuslabs/kafka-ui:latest
    profiles: ["debug"]
    ports:
      - "8080:8080"
    environment:
      KAFKA_CLUSTERS_0_NAME: local
      KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka:29092
  
  # Services - selectively started based on needs
  order-service:
    build:
      context: ./services/order
      target: development
    profiles: ["order", "web", "all"]
    ports:
      - "3001:3000"
    environment:
      NODE_ENV: development
      DATABASE_URL: postgresql://postgres:postgres@postgres:5432/orders
      KAFKA_BROKERS: kafka:29092
      REDIS_URL: redis://redis:6379
      USER_SERVICE_URL: http://user-service:3000
      PRODUCT_SERVICE_URL: http://product-service:3000
    volumes:
      - ./services/order/src:/app/src
    depends_on:
      postgres:
        condition: service_healthy
      kafka:
        condition: service_healthy
    command: npm run dev
  
  user-service:
    build:
      context: ./services/user
      target: development
    profiles: ["user", "web", "all"]
    ports:
      - "3002:3000"
    environment:
      NODE_ENV: development
      DATABASE_URL: postgresql://postgres:postgres@postgres:5432/users
      REDIS_URL: redis://redis:6379
    volumes:
      - ./services/user/src:/app/src
    depends_on:
      postgres:
        condition: service_healthy
    command: npm run dev
  
  product-service:
    build:
      context: ./services/product
      target: development
    profiles: ["product", "web", "all"]
    ports:
      - "3003:3000"
    environment:
      NODE_ENV: development
      DATABASE_URL: postgresql://postgres:postgres@postgres:5432/products
      ELASTICSEARCH_URL: http://elasticsearch:9200
    volumes:
      - ./services/product/src:/app/src
    depends_on:
      postgres:
        condition: service_healthy
    command: npm run dev
  
  # Optional: Only for product search development
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.9.0
    profiles: ["product", "search", "all"]
    ports:
      - "9200:9200"
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    healthcheck:
      test: ["CMD-SHELL", "curl -sf http://localhost:9200/_cluster/health"]
      interval: 10s
      timeout: 10s
      retries: 10
 
volumes:
  postgres_data:
 
# Database initialization script
# scripts/init-dbs.sql
# CREATE DATABASE orders;
# CREATE DATABASE users;
# CREATE DATABASE products;

Hot Reloading is Essential

Notice the volume mounts for source code. This enables hot reloading—code changes reflect immediately without container rebuilds. Combined with watch mode (npm run dev), developers can iterate on code changes in seconds, not minutes. This speed is essential for maintaining flow state during development.

CI/CD Environment

CI/CD environments run automated tests on every commit and deploy validated changes. They must be reproducible, fast to provision, and completely isolated between runs.

CI Environment Requirements:

Fast provisioning: Spin up in seconds to minutes
Complete isolation: No state from previous runs
Reproducibility: Same inputs produce same environment
Resource efficiency: Don't hold resources when idle
Parallelization: Support concurrent test runs

ci-environment
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
# .github/workflows/ci.yml - CI pipeline with proper environment management
name: CI
 
on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]
 
env:
  # Use consistent versions across all jobs
  NODE_VERSION: '20'
  DOCKER_BUILDKIT: 1
  COMPOSE_DOCKER_CLI_BUILD: 1
 
jobs:
  # Build and push images first (once)
  build:
    runs-on: ubuntu-latest
    outputs:
      image_tag: ${{ steps.meta.outputs.tags }}
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      
      - name: Login to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      
      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ghcr.io/${{ github.repository }}
          tags: |
            type=sha,prefix=
            type=ref,event=pr
      
      # Build all service images in parallel
      - name: Build and push images
        run: |
          for service in order user product; do
            docker buildx build \
              --push \
              --cache-from type=gha,scope=$service \
              --cache-to type=gha,mode=max,scope=$service \
              -t ghcr.io/${{ github.repository }}/$service:${{ github.sha }} \
              -f services/$service/Dockerfile \
              services/$service &
          done
          wait
  
  # Unit and integration tests (parallel per service)
  test:
    needs: build
    runs-on: ubuntu-latest
    
    strategy:
      fail-fast: false
      matrix:
        service: [order, user, product]
    
    services:
      # GitHub Actions native service containers
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: test
          POSTGRES_USER: test
          POSTGRES_DB: test
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
      
      redis:
        image: redis:7
        ports:
          - 6379:6379
        options: >-
          --health-cmd "redis-cli ping"
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
    
    steps:
      - uses: actions/checkout@v4
      
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'
          cache-dependency-path: services/${{ matrix.service }}/package-lock.json
      
      - name: Install dependencies
        working-directory: services/${{ matrix.service }}
        run: npm ci
      
      - name: Run migrations
        working-directory: services/${{ matrix.service }}
        run: npm run db:migrate
        env:
          DATABASE_URL: postgresql://test:test@localhost:5432/test
      
      - name: Run unit tests
        working-directory: services/${{ matrix.service }}
        run: npm run test:unit -- --coverage
      
      - name: Run integration tests
        working-directory: services/${{ matrix.service }}
        run: npm run test:integration
        env:
          DATABASE_URL: postgresql://test:test@localhost:5432/test
          REDIS_URL: redis://localhost:6379
      
      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          files: services/${{ matrix.service }}/coverage/lcov.info
          flags: ${{ matrix.service }}
  
  # Contract tests after unit/integration
  contract-tests:
    needs: [build, test]
    runs-on: ubuntu-latest
    
    strategy:
      matrix:
        service: [order, user, product]
    
    steps:
      - uses: actions/checkout@v4
      
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
      
      - name: Install dependencies
        working-directory: services/${{ matrix.service }}
        run: npm ci
      
      - name: Run consumer contract tests
        working-directory: services/${{ matrix.service }}
        run: npm run test:pact:consumer
      
      - name: Publish pacts
        if: github.event_name == 'push'
        working-directory: services/${{ matrix.service }}
        run: npm run pact:publish
        env:
          PACT_BROKER_BASE_URL: ${{ secrets.PACT_BROKER_URL }}
          PACT_BROKER_TOKEN: ${{ secrets.PACT_BROKER_TOKEN }}
          GIT_COMMIT: ${{ github.sha }}
          GIT_BRANCH: ${{ github.ref_name }}
  
  # E2E tests with full environment
  e2e:
    needs: [build, test, contract-tests]
    runs-on: ubuntu-latest
    timeout-minutes: 30
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Create Docker network
        run: docker network create e2e-network
      
      - name: Start infrastructure
        run: |
          docker compose -f docker-compose.ci.yml up -d postgres redis kafka
          ./scripts/wait-healthy.sh
      
      - name: Start services
        run: |
          docker compose -f docker-compose.ci.yml up -d order-service user-service product-service gateway
          ./scripts/wait-for-services.sh
        env:
          IMAGE_TAG: ${{ github.sha }}
      
      - name: Run E2E tests
        run: npx playwright test
        env:
          E2E_BASE_URL: http://localhost
      
      - name: Collect logs on failure
        if: failure()
        run: docker compose -f docker-compose.ci.yml logs > docker-logs.txt
      
      - name: Upload artifacts
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: e2e-results
          path: |
            playwright-report/
            test-results/
            docker-logs.txt

CI Service Containers

GitHub Actions, GitLab CI, and other platforms offer native service containers—infrastructure that runs alongside your job. These are faster to start than Docker Compose and provide automatic health checking. Use service containers for simple databases and caches; use Docker Compose for complex multi-service setups.

Ephemeral Preview Environments

Ephemeral preview environments—also called review apps or PR environments—provide isolated, full-stack environments for each pull request. They enable realistic testing of changes before merge without polluting shared environments.

Benefits of Preview Environments:

Isolation: Each PR gets its own environment; no conflicts
Realism: Full system running, not just the changed service
Collaboration: Reviewers can interact with running code, not just read diffs
Confidence: Catch integration issues before they reach shared environments

ephemeral-environments
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
# Preview environment deployment with Kubernetes
# .github/workflows/preview.yml
name: Preview Environment
 
on:
  pull_request:
    types: [opened, synchronize, reopened, closed]
 
env:
  PREVIEW_NAMESPACE: preview-pr-${{ github.event.pull_request.number }}
 
jobs:
  deploy-preview:
    if: github.event.action != 'closed'
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          kubeconfig: ${{ secrets.KUBE_CONFIG }}
      
      - name: Create namespace
        run: |
          kubectl create namespace ${{ env.PREVIEW_NAMESPACE }} --dry-run=client -o yaml | kubectl apply -f -
          
          # Label for automatic cleanup
          kubectl label namespace ${{ env.PREVIEW_NAMESPACE }} \
            preview=true \
            pr-number=${{ github.event.pull_request.number }} \
            created-at=$(date +%s) \
            --overwrite
      
      - name: Build and push images
        run: |
          for service in order user product gateway; do
            docker build -t ${{ secrets.REGISTRY }}/$service:pr-${{ github.event.pull_request.number }} \
              services/$service
            docker push ${{ secrets.REGISTRY }}/$service:pr-${{ github.event.pull_request.number }}
          done
      
      - name: Deploy infrastructure
        run: |
          helm upgrade --install infra ./charts/preview-infra \
            --namespace ${{ env.PREVIEW_NAMESPACE }} \
            --set postgresql.postgresPassword=${{ secrets.PREVIEW_DB_PASSWORD }} \
            --wait
      
      - name: Run migrations
        run: |
          for service in order user product; do
            kubectl run migrate-$service \
              --namespace ${{ env.PREVIEW_NAMESPACE }} \
              --image ${{ secrets.REGISTRY }}/$service:pr-${{ github.event.pull_request.number }} \
              --restart=Never \
              --command -- npm run db:migrate
            kubectl wait --for=condition=complete job/migrate-$service \
              --namespace ${{ env.PREVIEW_NAMESPACE }} \
              --timeout=120s
          done
      
      - name: Deploy services
        run: |
          helm upgrade --install services ./charts/microservices \
            --namespace ${{ env.PREVIEW_NAMESPACE }} \
            --set image.tag=pr-${{ github.event.pull_request.number }} \
            --set ingress.host=pr-${{ github.event.pull_request.number }}.${{ secrets.PREVIEW_DOMAIN }} \
            --wait
      
      - name: Run smoke tests
        run: |
          export E2E_BASE_URL=https://pr-${{ github.event.pull_request.number }}.${{ secrets.PREVIEW_DOMAIN }}
          npx playwright test --grep @smoke
      
      - name: Comment on PR
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## 🚀 Preview Environment Ready
              
              **URL:** https://pr-${context.issue.number}.${{ secrets.PREVIEW_DOMAIN }}
              
              This environment will be automatically destroyed when the PR is closed.
              
              **Services deployed:**
              - Order Service
              - User Service  
              - Product Service
              - API Gateway
              
              **Smoke tests:** ✅ Passed`
            })
  
  # Cleanup when PR is closed
  cleanup-preview:
    if: github.event.action == 'closed'
    runs-on: ubuntu-latest
    
    steps:
      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          kubeconfig: ${{ secrets.KUBE_CONFIG }}
      
      - name: Delete namespace
        run: kubectl delete namespace ${{ env.PREVIEW_NAMESPACE }} --ignore-not-found
      
      - name: Comment on PR
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '🧹 Preview environment has been cleaned up.'
            })
 
---
# Automatic cleanup CronJob for orphaned preview environments
apiVersion: batch/v1
kind: CronJob
metadata:
  name: preview-cleanup
  namespace: preview-system
spec:
  schedule: "0 * * * *"  # Every hour
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: preview-cleanup
          containers:
            - name: cleanup
              image: bitnami/kubectl:latest
              command:
                - /bin/sh
                - -c
                - |
                  # Delete preview namespaces older than 48 hours
                  CUTOFF=$(($(date +%s) - 172800))
                  for ns in $(kubectl get ns -l preview=true -o jsonpath='{.items[*].metadata.name}'); do
                    CREATED=$(kubectl get ns $ns -o jsonpath='{.metadata.labels.created-at}')
                    if [ "$CREATED" -lt "$CUTOFF" ]; then
                      echo "Deleting stale preview namespace: $ns"
                      kubectl delete ns $ns
                    fi
                  done
          restartPolicy: Never

Preview Environment Costs

Preview environments can become expensive if not managed carefully. Each PR running its own databases and services adds up. Implement automatic cleanup (on PR close and after time limits), use smaller instance sizes than production, and consider shared databases with schema-per-PR isolation for database-heavy workloads.

Staging Environment

The staging environment is the final validation step before production. It should mirror production as closely as possible while remaining safe for testing.

Staging Environment Principles:

Production parity: Same infrastructure, configuration patterns, and service versions as production
Realistic data: Representative data volumes and patterns (but not real user data)
Same deployment process: Deploy to staging the same way you deploy to production
Monitoring parity: Same observability stack as production

Staging vs Production Parity
Aspect	Should Match Production?	Notes
Infrastructure (K8s, databases)	Yes	Same types, smaller scale acceptable
Configuration structure	Yes	Same config keys, different values
Service versions	Yes	Staging is next-production
Network topology	Yes	Same VPCs, load balancers, ingress
Monitoring/alerting	Yes	Catch observability issues early
Data volume	Reduced OK	Representative patterns, smaller scale
Real user data	No	Use synthetic or anonymized data
Third-party integrations	Sandbox mode	Use test/sandbox APIs

staging-infrastructure

# Terraform: Staging environment that mirrors production structure
# infrastructure/environments/staging/main.tf
 
module "vpc" {
  source = "../../modules/vpc"
  
  environment = "staging"
  cidr_block  = "10.1.0.0/16"  # Different from prod: 10.0.0.0/16
  
  # Same AZ structure as production
  azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
}
 
module "eks" {
  source = "../../modules/eks"
  
  cluster_name = "microservices-staging"
  vpc_id       = module.vpc.vpc_id
  subnet_ids   = module.vpc.private_subnet_ids
  
  # Smaller node groups than production
  node_groups = {
    general = {
      instance_types = ["t3.large"]  # Prod uses t3.xlarge
      min_size       = 2              # Prod uses 3
      max_size       = 6              # Prod uses 15
      desired_size   = 3              # Prod uses 6
    }
  }
}
 
module "databases" {
  source = "../../modules/rds"
  
  for_each = {
    orders   = { db_name = "orders" }
    users    = { db_name = "users" }
    products = { db_name = "products" }
  }
  
  identifier = "staging-${each.key}"
  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnet_ids
  
  # Same engine versions as production
  engine         = "postgres"
  engine_version = "15.4"
  
  # Smaller instance class
  instance_class = "db.t3.medium"  # Prod uses db.r5.large
  
  # Single AZ (not multi-AZ like prod)
  multi_az = false
  
  # Same encryption settings
  storage_encrypted = true
  kms_key_id        = module.kms.key_id
}
 
module "redis" {
  source = "../../modules/elasticache"
  
  cluster_id   = "staging-cache"
  vpc_id       = module.vpc.vpc_id
  subnet_ids   = module.vpc.private_subnet_ids
  
  # Same engine version
  engine_version = "7.0"
  
  # Smaller node type
  node_type       = "cache.t3.small"  # Prod uses cache.r5.large
  num_cache_nodes = 1                  # Prod uses 2
}
 
module "kafka" {
  source = "../../modules/msk"
  
  cluster_name = "staging-events"
  vpc_id       = module.vpc.vpc_id
  subnet_ids   = module.vpc.private_subnet_ids
  
  # Same Kafka version
  kafka_version = "3.5.1"
  
  # Smaller brokers
  broker_node_type = "kafka.t3.small"  # Prod uses kafka.m5.large
  number_of_nodes  = 2                  # Prod uses 3
}
 
# Outputs for service configuration
output "database_endpoints" {
  value = { for k, v in module.databases : k => v.endpoint }
}
 
output "redis_endpoint" {
  value = module.redis.endpoint
}
 
output "kafka_bootstrap_servers" {
  value = module.kafka.bootstrap_servers
}

Staging Data Strategy

Never copy production data to staging—it creates compliance and security risks. Instead, generate synthetic data that matches production patterns: same data shapes, realistic volumes, representative edge cases. Tools like Faker can generate realistic-looking data; custom scripts can match your domain's specific patterns.

Production Testing Strategies

No staging environment perfectly replicates production. Real users, real data volumes, real traffic patterns, and real third-party integrations only exist in production. Mature organizations embrace production testing—carefully controlled validation in the live environment.

Production Testing Techniques:

Production Testing Approaches

•Synthetic Monitoring: Automated tests that continuously probe production endpoints, simulating real user journeys
•Canary Deployments: Route a small percentage of traffic to new versions, monitor for errors before full rollout
•Feature Flags: Deploy code to production but activate only for specific users or conditions
•Shadow Traffic: Mirror production traffic to new code without affecting responses
•Chaos Engineering: Inject controlled failures to verify resilience and recovery

production-testing
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
// Synthetic monitoring - continuous production validation
// src/monitoring/synthetics/checkout-flow.ts
 
import { SyntheticMonitor } from "./framework";
 
export const checkoutFlowMonitor = new SyntheticMonitor({
  name: "checkout-flow",
  schedule: "*/5 * * * *",  // Every 5 minutes
  locations: ["us-east-1", "us-west-2", "eu-west-1"],
  
  async run({ page, metrics }) {
    const startTime = Date.now();
    
    try {
      // Use dedicated test account
      await page.goto(process.env.PRODUCTION_URL!);
      await this.login(page, process.env.SYNTHETIC_USER!, process.env.SYNTHETIC_PASSWORD!);
      
      // Browse to product
      await page.goto("/products/synthetic-test-product");
      metrics.recordTiming("product_page_load", Date.now() - startTime);
      
      // Add to cart
      await page.click('[data-action="add-to-cart"]');
      await page.waitForSelector('[data-testid="cart-updated"]');
      metrics.recordTiming("add_to_cart", Date.now() - startTime);
      
      // Proceed to checkout
      await page.click('[data-action="proceed-to-checkout"]');
      await page.waitForSelector('[data-testid="checkout-form"]');
      metrics.recordTiming("checkout_page_load", Date.now() - startTime);
      
      // Verify payment options load
      await page.waitForSelector('[data-testid="payment-methods"]');
      
      // Don't actually complete purchase - just verify flow works
      await page.click('[data-action="cancel-checkout"]');
      
      metrics.recordSuccess();
      metrics.recordTiming("total_flow", Date.now() - startTime);
      
    } catch (error) {
      metrics.recordFailure(error);
      
      // Capture diagnostics on failure
      await page.screenshot({ path: `/tmp/synthetic-failure-${Date.now()}.png` });
      
      // Alert on-call if critical path is broken
      if (this.isWithinBusinessHours()) {
        await this.sendAlert({
          severity: "high",
          title: "Checkout flow synthetic failing",
          details: error.message,
        });
      }
    }
  },
  
  private async login(page: Page, email: string, password: string) {
    await page.goto("/login");
    await page.fill('[data-testid="email"]', email);
    await page.fill('[data-testid="password"]', password);
    await page.click('[data-action="login"]');
    await page.waitForSelector('[data-testid="user-menu"]');
  },
  
  private isWithinBusinessHours(): boolean {
    const hour = new Date().getHours();
    return hour >= 9 && hour < 18;
  },
});
 
// Canary deployment with automated rollback
// kubernetes/rollout-strategy.yaml
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: order-service
spec:
  replicas: 10
  strategy:
    canary:
      # Canary steps
      steps:
        - setWeight: 5      # 5% traffic to canary
        - pause: { duration: 5m }
        - setWeight: 20     # 20% traffic
        - pause: { duration: 10m }
        - setWeight: 50     # 50% traffic
        - pause: { duration: 15m }
        - setWeight: 100    # Full rollout
      
      # Analysis for automatic rollback
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 1  # Start analysis after first step
        args:
          - name: service-name
            value: order-service
      
      # Automatic rollback triggers
      canaryMetadata:
        labels:
          role: canary
      stableMetadata:
        labels:
          role: stable
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: result[0] >= 0.99  # 99% success rate required
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}", 
                    status=~"2.."}[5m]))
            /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))
    
    - name: latency-p99
      interval: 1m
      successCondition: result[0] <= 500  # P99 under 500ms
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            histogram_quantile(0.99, 
              sum(rate(http_request_duration_seconds_bucket{
                service="{{args.service-name}}"}[5m])) by (le)
            ) * 1000

Production Testing Requires Discipline

Production testing must be done carefully. Use dedicated test accounts that don't affect real users. Ensure synthetic transactions are identifiable and excluded from business metrics. Design feature flags for quick rollback. Always have a clear blast radius—know exactly what could go wrong and how to fix it.

Environment Configuration Management

Configuration differences between environments are a major source of 'works in staging, fails in production' bugs. Proper configuration management ensures environments differ only in intended ways.

Configuration Principles:

Environment-agnostic code: Code should be identical across environments; only configuration varies
Explicit configuration: No hidden differences; all environment-specific values are documented
Layered configuration: Base config + environment overrides
Secrets management: Sensitive values never in code or version control

configuration-management
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
// Layered configuration pattern
// src/config/index.ts
 
import { z } from "zod";
import { baseConfig } from "./base";
import { developmentOverrides } from "./development";
import { stagingOverrides } from "./staging";
import { productionOverrides } from "./production";
 
// Schema validation ensures configuration is complete and correct
const ConfigSchema = z.object({
  // Environment identification
  environment: z.enum(["development", "staging", "production"]),
  
  // Service configuration
  service: z.object({
    name: z.string(),
    version: z.string(),
    port: z.number().int().positive(),
  }),
  
  // Database configuration
  database: z.object({
    host: z.string(),
    port: z.number().int().positive(),
    name: z.string(),
    user: z.string(),
    password: z.string(),  // From secrets
    poolSize: z.number().int().positive(),
    ssl: z.boolean(),
  }),
  
  // Cache configuration
  cache: z.object({
    host: z.string(),
    port: z.number().int().positive(),
    ttlSeconds: z.number().int().positive(),
  }),
  
  // Kafka configuration
  kafka: z.object({
    brokers: z.array(z.string()),
    clientId: z.string(),
    ssl: z.boolean(),
    sasl: z.object({
      mechanism: z.enum(["plain", "scram-sha-256", "scram-sha-512"]),
      username: z.string(),
      password: z.string(),
    }).optional(),
  }),
  
  // External services
  integrations: z.object({
    paymentGateway: z.object({
      baseUrl: z.string().url(),
      apiKey: z.string(),
      testMode: z.boolean(),
    }),
    emailService: z.object({
      baseUrl: z.string().url(),
      apiKey: z.string(),
      sandboxMode: z.boolean(),
    }),
  }),
  
  // Feature flags
  features: z.object({
    newCheckoutFlow: z.boolean(),
    enhancedSearch: z.boolean(),
    betaFeatures: z.boolean(),
  }),
  
  // Observability
  observability: z.object({
    logLevel: z.enum(["debug", "info", "warn", "error"]),
    tracingEnabled: z.boolean(),
    metricsEnabled: z.boolean(),
  }),
});
 
type Config = z.infer<typeof ConfigSchema>;
 
function loadConfig(): Config {
  const environment = process.env.NODE_ENV || "development";
  
  // Start with base config
  let config: unknown = baseConfig;
  
  // Apply environment-specific overrides
  switch (environment) {
    case "development":
      config = deepMerge(config, developmentOverrides);
      break;
    case "staging":
      config = deepMerge(config, stagingOverrides);
      break;
    case "production":
      config = deepMerge(config, productionOverrides);
      break;
    default:
      throw new Error(`Unknown environment: ${environment}`);
  }
  
  // Inject secrets from environment variables
  config = injectSecrets(config as object);
  
  // Validate and return
  const result = ConfigSchema.safeParse(config);
  if (!result.success) {
    console.error("Configuration validation failed:");
    console.error(result.error.format());
    throw new Error("Invalid configuration");
  }
  
  return result.data;
}
 
// Secret injection from environment variables
function injectSecrets(config: object): object {
  const secrets = {
    "database.password": process.env.DATABASE_PASSWORD,
    "kafka.sasl.password": process.env.KAFKA_PASSWORD,
    "integrations.paymentGateway.apiKey": process.env.PAYMENT_GATEWAY_API_KEY,
    "integrations.emailService.apiKey": process.env.EMAIL_SERVICE_API_KEY,
  };
  
  let result = { ...config };
  
  for (const [path, value] of Object.entries(secrets)) {
    if (value) {
      result = setPath(result, path, value);
    }
  }
  
  return result;
}
 
export const config = loadConfig();
 
// Validate on startup
console.log(`Configuration loaded for environment: ${config.environment}`);
console.log(`Features enabled: ${Object.entries(config.features)
  .filter(([_, v]) => v)
  .map(([k, _]) => k)
  .join(", ")}`);

Environment Variable Best Practice

Store secrets in environment variables, not configuration files. Use a secrets manager (HashiCorp Vault, AWS Secrets Manager) to inject secrets at runtime. Configuration files in version control should contain structure and non-sensitive values; secrets are always injected from secure sources.

Summary: Test Environments

Test environment management is foundational infrastructure for microservices testing. Well-designed environments enable fast feedback, reliable tests, and confident deployments. Poor environment practices lead to constant friction, flaky tests, and deployment fear.

Key Takeaways

•Multiple environment types are necessary: Local for development, CI for automation, preview for PRs, staging for release validation, production for final verification.
•Local environments should be fast and realistic: Docker Compose with service profiles enables quick startup of only what's needed.
•CI environments need isolation and speed: Service containers and cacheable images minimize provisioning time while maintaining reproducibility.
•Preview environments enable confident PR review: Ephemeral, isolated environments for each PR catch integration issues before merge.
•Staging must mirror production structure: Same infrastructure types, same deployment process, same monitoring—just smaller scale.
•Production testing validates what staging cannot: Synthetic monitoring, canary deployments, and feature flags enable safe production verification.
•Configuration management prevents environment drift: Layered configuration with explicit environment overrides ensures intentional differences only.

Module Complete:

Congratulations! You've completed the Testing Microservices module. You now understand the complete testing strategy for distributed systems—from unit tests that run in milliseconds to production monitoring that runs continuously. The testing pyramid is replicated across every service, and contract testing enables the independent deployment that makes microservices valuable.

The key takeaway: testing microservices is not about more tests or different tests—it's about the right tests at the right level. Unit tests verify logic, integration tests verify infrastructure, contract tests verify compatibility, E2E tests verify journeys, and the right environments make all of these reliable.

Module Complete

You've completed the Testing Microservices module! You now have a comprehensive understanding of unit testing, integration testing, contract testing, end-to-end testing, and test environment management for distributed systems. These skills are essential for building and maintaining reliable microservices at scale.

5 / 5

Loading learning content...

System Design (HLD)Testing Microservices

Testing Microservices

LevelAdvanced

Duration90 mins

TopicTesting Microservices

5 / 5

Test Environments

The Environment Challenge in Microservices

What You Will Learn

Environment Taxonomy for Microservices

Microservices testing requires multiple environment types, each serving different purposes and making different trade-offs between fidelity, cost, and speed.

The Environment Hierarchy:

Test Environment Types
Environment	Purpose	Fidelity	Cost	Speed	Isolation
Local	Developer iteration	Low-Medium	Free	Instant	Complete
CI	Automated verification	Medium	Low	Minutes	Complete
Preview/Ephemeral	PR validation	High	Medium	Minutes	Per PR
Staging	Release validation	High	Medium-High	Always on	Shared
Production	Live verification	Perfect	High	Always on	Careful design

Environment Fidelity Spectrum:

Low Fidelity: Single service running locally with mocked dependencies. Fast, cheap, but may miss integration issues.
Medium Fidelity: Multiple services running locally with real databases but simplified infrastructure. Good for most integration testing.
High Fidelity: Complete system running in cloud with near-production configuration. Required for E2E and release validation.
Production Fidelity: Actual production environment, possibly with feature flags or traffic mirroring. Only way to verify production-specific issues.

The key insight: you need all of these, not just one. Each environment type catches different categories of bugs at different stages of development.

Environment Parity Principle

Local Development Environment

Goals for Local Environment:

Fast startup: Seconds to minutes, not tens of minutes
Realistic behavior: Same databases, message brokers as production
Easy reset: Return to clean state quickly
Selective startup: Run only the services needed for current work
Resource efficiency: Not consume entire machine's resources

local-environment-setup
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
# docker-compose.yml - Local development environment
version: '3.8'
 
# Profiles allow selective startup: docker compose --profile web up
services:
  # Shared infrastructure - always starts
  postgres:
    image: postgres:15-alpine
    ports:
      - "5432:5432"
    environment:
      POSTGRES_PASSWORD: postgres
      POSTGRES_USER: postgres
    volumes:
      - postgres_data:/var/lib/postgresql/data
      - ./scripts/init-dbs.sql:/docker-entrypoint-initdb.d/init.sql
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 5
  
  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 5s
      retries: 5
  
  kafka:
    image: confluentinc/cp-kafka:7.4.0
    ports:
      - "9092:9092"
    environment:
      KAFKA_NODE_ID: 1
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,HOST:PLAINTEXT
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,CONTROLLER://0.0.0.0:9093,HOST://0.0.0.0:9092
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,HOST://localhost:9092
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      CLUSTER_ID: 'local-dev-cluster-001'
      KAFKA_AUTO_CREATE_TOPICS_ENABLE: 'true'
    healthcheck:
      test: ["CMD-SHELL", "kafka-broker-api-versions --bootstrap-server localhost:29092"]
      interval: 10s
      timeout: 10s
      retries: 10
  
  # Optional: Kafka UI for debugging
  kafka-ui:
    image: provectuslabs/kafka-ui:latest
    profiles: ["debug"]
    ports:
      - "8080:8080"
    environment:
      KAFKA_CLUSTERS_0_NAME: local
      KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka:29092
  
  # Services - selectively started based on needs
  order-service:
    build:
      context: ./services/order
      target: development
    profiles: ["order", "web", "all"]
    ports:
      - "3001:3000"
    environment:
      NODE_ENV: development
      DATABASE_URL: postgresql://postgres:postgres@postgres:5432/orders
      KAFKA_BROKERS: kafka:29092
      REDIS_URL: redis://redis:6379
      USER_SERVICE_URL: http://user-service:3000
      PRODUCT_SERVICE_URL: http://product-service:3000
    volumes:
      - ./services/order/src:/app/src
    depends_on:
      postgres:
        condition: service_healthy
      kafka:
        condition: service_healthy
    command: npm run dev
  
  user-service:
    build:
      context: ./services/user
      target: development
    profiles: ["user", "web", "all"]
    ports:
      - "3002:3000"
    environment:
      NODE_ENV: development
      DATABASE_URL: postgresql://postgres:postgres@postgres:5432/users
      REDIS_URL: redis://redis:6379
    volumes:
      - ./services/user/src:/app/src
    depends_on:
      postgres:
        condition: service_healthy
    command: npm run dev
  
  product-service:
    build:
      context: ./services/product
      target: development
    profiles: ["product", "web", "all"]
    ports:
      - "3003:3000"
    environment:
      NODE_ENV: development
      DATABASE_URL: postgresql://postgres:postgres@postgres:5432/products
      ELASTICSEARCH_URL: http://elasticsearch:9200
    volumes:
      - ./services/product/src:/app/src
    depends_on:
      postgres:
        condition: service_healthy
    command: npm run dev
  
  # Optional: Only for product search development
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.9.0
    profiles: ["product", "search", "all"]
    ports:
      - "9200:9200"
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    healthcheck:
      test: ["CMD-SHELL", "curl -sf http://localhost:9200/_cluster/health"]
      interval: 10s
      timeout: 10s
      retries: 10
 
volumes:
  postgres_data:
 
# Database initialization script
# scripts/init-dbs.sql
# CREATE DATABASE orders;
# CREATE DATABASE users;
# CREATE DATABASE products;

Hot Reloading is Essential

CI/CD Environment

CI/CD environments run automated tests on every commit and deploy validated changes. They must be reproducible, fast to provision, and completely isolated between runs.

CI Environment Requirements:

Fast provisioning: Spin up in seconds to minutes
Complete isolation: No state from previous runs
Reproducibility: Same inputs produce same environment
Resource efficiency: Don't hold resources when idle
Parallelization: Support concurrent test runs

ci-environment
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
# .github/workflows/ci.yml - CI pipeline with proper environment management
name: CI
 
on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]
 
env:
  # Use consistent versions across all jobs
  NODE_VERSION: '20'
  DOCKER_BUILDKIT: 1
  COMPOSE_DOCKER_CLI_BUILD: 1
 
jobs:
  # Build and push images first (once)
  build:
    runs-on: ubuntu-latest
    outputs:
      image_tag: ${{ steps.meta.outputs.tags }}
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v3
      
      - name: Login to Container Registry
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      
      - name: Extract metadata
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ghcr.io/${{ github.repository }}
          tags: |
            type=sha,prefix=
            type=ref,event=pr
      
      # Build all service images in parallel
      - name: Build and push images
        run: |
          for service in order user product; do
            docker buildx build \
              --push \
              --cache-from type=gha,scope=$service \
              --cache-to type=gha,mode=max,scope=$service \
              -t ghcr.io/${{ github.repository }}/$service:${{ github.sha }} \
              -f services/$service/Dockerfile \
              services/$service &
          done
          wait
  
  # Unit and integration tests (parallel per service)
  test:
    needs: build
    runs-on: ubuntu-latest
    
    strategy:
      fail-fast: false
      matrix:
        service: [order, user, product]
    
    services:
      # GitHub Actions native service containers
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: test
          POSTGRES_USER: test
          POSTGRES_DB: test
        ports:
          - 5432:5432
        options: >-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
      
      redis:
        image: redis:7
        ports:
          - 6379:6379
        options: >-
          --health-cmd "redis-cli ping"
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
    
    steps:
      - uses: actions/checkout@v4
      
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
          cache: 'npm'
          cache-dependency-path: services/${{ matrix.service }}/package-lock.json
      
      - name: Install dependencies
        working-directory: services/${{ matrix.service }}
        run: npm ci
      
      - name: Run migrations
        working-directory: services/${{ matrix.service }}
        run: npm run db:migrate
        env:
          DATABASE_URL: postgresql://test:test@localhost:5432/test
      
      - name: Run unit tests
        working-directory: services/${{ matrix.service }}
        run: npm run test:unit -- --coverage
      
      - name: Run integration tests
        working-directory: services/${{ matrix.service }}
        run: npm run test:integration
        env:
          DATABASE_URL: postgresql://test:test@localhost:5432/test
          REDIS_URL: redis://localhost:6379
      
      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          files: services/${{ matrix.service }}/coverage/lcov.info
          flags: ${{ matrix.service }}
  
  # Contract tests after unit/integration
  contract-tests:
    needs: [build, test]
    runs-on: ubuntu-latest
    
    strategy:
      matrix:
        service: [order, user, product]
    
    steps:
      - uses: actions/checkout@v4
      
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ env.NODE_VERSION }}
      
      - name: Install dependencies
        working-directory: services/${{ matrix.service }}
        run: npm ci
      
      - name: Run consumer contract tests
        working-directory: services/${{ matrix.service }}
        run: npm run test:pact:consumer
      
      - name: Publish pacts
        if: github.event_name == 'push'
        working-directory: services/${{ matrix.service }}
        run: npm run pact:publish
        env:
          PACT_BROKER_BASE_URL: ${{ secrets.PACT_BROKER_URL }}
          PACT_BROKER_TOKEN: ${{ secrets.PACT_BROKER_TOKEN }}
          GIT_COMMIT: ${{ github.sha }}
          GIT_BRANCH: ${{ github.ref_name }}
  
  # E2E tests with full environment
  e2e:
    needs: [build, test, contract-tests]
    runs-on: ubuntu-latest
    timeout-minutes: 30
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Create Docker network
        run: docker network create e2e-network
      
      - name: Start infrastructure
        run: |
          docker compose -f docker-compose.ci.yml up -d postgres redis kafka
          ./scripts/wait-healthy.sh
      
      - name: Start services
        run: |
          docker compose -f docker-compose.ci.yml up -d order-service user-service product-service gateway
          ./scripts/wait-for-services.sh
        env:
          IMAGE_TAG: ${{ github.sha }}
      
      - name: Run E2E tests
        run: npx playwright test
        env:
          E2E_BASE_URL: http://localhost
      
      - name: Collect logs on failure
        if: failure()
        run: docker compose -f docker-compose.ci.yml logs > docker-logs.txt
      
      - name: Upload artifacts
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: e2e-results
          path: |
            playwright-report/
            test-results/
            docker-logs.txt

CI Service Containers

Ephemeral Preview Environments

Benefits of Preview Environments:

Isolation: Each PR gets its own environment; no conflicts
Realism: Full system running, not just the changed service
Collaboration: Reviewers can interact with running code, not just read diffs
Confidence: Catch integration issues before they reach shared environments

ephemeral-environments
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
# Preview environment deployment with Kubernetes
# .github/workflows/preview.yml
name: Preview Environment
 
on:
  pull_request:
    types: [opened, synchronize, reopened, closed]
 
env:
  PREVIEW_NAMESPACE: preview-pr-${{ github.event.pull_request.number }}
 
jobs:
  deploy-preview:
    if: github.event.action != 'closed'
    runs-on: ubuntu-latest
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          kubeconfig: ${{ secrets.KUBE_CONFIG }}
      
      - name: Create namespace
        run: |
          kubectl create namespace ${{ env.PREVIEW_NAMESPACE }} --dry-run=client -o yaml | kubectl apply -f -
          
          # Label for automatic cleanup
          kubectl label namespace ${{ env.PREVIEW_NAMESPACE }} \
            preview=true \
            pr-number=${{ github.event.pull_request.number }} \
            created-at=$(date +%s) \
            --overwrite
      
      - name: Build and push images
        run: |
          for service in order user product gateway; do
            docker build -t ${{ secrets.REGISTRY }}/$service:pr-${{ github.event.pull_request.number }} \
              services/$service
            docker push ${{ secrets.REGISTRY }}/$service:pr-${{ github.event.pull_request.number }}
          done
      
      - name: Deploy infrastructure
        run: |
          helm upgrade --install infra ./charts/preview-infra \
            --namespace ${{ env.PREVIEW_NAMESPACE }} \
            --set postgresql.postgresPassword=${{ secrets.PREVIEW_DB_PASSWORD }} \
            --wait
      
      - name: Run migrations
        run: |
          for service in order user product; do
            kubectl run migrate-$service \
              --namespace ${{ env.PREVIEW_NAMESPACE }} \
              --image ${{ secrets.REGISTRY }}/$service:pr-${{ github.event.pull_request.number }} \
              --restart=Never \
              --command -- npm run db:migrate
            kubectl wait --for=condition=complete job/migrate-$service \
              --namespace ${{ env.PREVIEW_NAMESPACE }} \
              --timeout=120s
          done
      
      - name: Deploy services
        run: |
          helm upgrade --install services ./charts/microservices \
            --namespace ${{ env.PREVIEW_NAMESPACE }} \
            --set image.tag=pr-${{ github.event.pull_request.number }} \
            --set ingress.host=pr-${{ github.event.pull_request.number }}.${{ secrets.PREVIEW_DOMAIN }} \
            --wait
      
      - name: Run smoke tests
        run: |
          export E2E_BASE_URL=https://pr-${{ github.event.pull_request.number }}.${{ secrets.PREVIEW_DOMAIN }}
          npx playwright test --grep @smoke
      
      - name: Comment on PR
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## 🚀 Preview Environment Ready
              
              **URL:** https://pr-${context.issue.number}.${{ secrets.PREVIEW_DOMAIN }}
              
              This environment will be automatically destroyed when the PR is closed.
              
              **Services deployed:**
              - Order Service
              - User Service  
              - Product Service
              - API Gateway
              
              **Smoke tests:** ✅ Passed`
            })
  
  # Cleanup when PR is closed
  cleanup-preview:
    if: github.event.action == 'closed'
    runs-on: ubuntu-latest
    
    steps:
      - name: Configure kubectl
        uses: azure/k8s-set-context@v3
        with:
          kubeconfig: ${{ secrets.KUBE_CONFIG }}
      
      - name: Delete namespace
        run: kubectl delete namespace ${{ env.PREVIEW_NAMESPACE }} --ignore-not-found
      
      - name: Comment on PR
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '🧹 Preview environment has been cleaned up.'
            })
 
---
# Automatic cleanup CronJob for orphaned preview environments
apiVersion: batch/v1
kind: CronJob
metadata:
  name: preview-cleanup
  namespace: preview-system
spec:
  schedule: "0 * * * *"  # Every hour
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: preview-cleanup
          containers:
            - name: cleanup
              image: bitnami/kubectl:latest
              command:
                - /bin/sh
                - -c
                - |
                  # Delete preview namespaces older than 48 hours
                  CUTOFF=$(($(date +%s) - 172800))
                  for ns in $(kubectl get ns -l preview=true -o jsonpath='{.items[*].metadata.name}'); do
                    CREATED=$(kubectl get ns $ns -o jsonpath='{.metadata.labels.created-at}')
                    if [ "$CREATED" -lt "$CUTOFF" ]; then
                      echo "Deleting stale preview namespace: $ns"
                      kubectl delete ns $ns
                    fi
                  done
          restartPolicy: Never

Preview Environment Costs

Staging Environment

The staging environment is the final validation step before production. It should mirror production as closely as possible while remaining safe for testing.

Staging Environment Principles:

Production parity: Same infrastructure, configuration patterns, and service versions as production
Realistic data: Representative data volumes and patterns (but not real user data)
Same deployment process: Deploy to staging the same way you deploy to production
Monitoring parity: Same observability stack as production

Staging vs Production Parity
Aspect	Should Match Production?	Notes
Infrastructure (K8s, databases)	Yes	Same types, smaller scale acceptable
Configuration structure	Yes	Same config keys, different values
Service versions	Yes	Staging is next-production
Network topology	Yes	Same VPCs, load balancers, ingress
Monitoring/alerting	Yes	Catch observability issues early
Data volume	Reduced OK	Representative patterns, smaller scale
Real user data	No	Use synthetic or anonymized data
Third-party integrations	Sandbox mode	Use test/sandbox APIs

staging-infrastructure

# Terraform: Staging environment that mirrors production structure
# infrastructure/environments/staging/main.tf
 
module "vpc" {
  source = "../../modules/vpc"
  
  environment = "staging"
  cidr_block  = "10.1.0.0/16"  # Different from prod: 10.0.0.0/16
  
  # Same AZ structure as production
  azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
}
 
module "eks" {
  source = "../../modules/eks"
  
  cluster_name = "microservices-staging"
  vpc_id       = module.vpc.vpc_id
  subnet_ids   = module.vpc.private_subnet_ids
  
  # Smaller node groups than production
  node_groups = {
    general = {
      instance_types = ["t3.large"]  # Prod uses t3.xlarge
      min_size       = 2              # Prod uses 3
      max_size       = 6              # Prod uses 15
      desired_size   = 3              # Prod uses 6
    }
  }
}
 
module "databases" {
  source = "../../modules/rds"
  
  for_each = {
    orders   = { db_name = "orders" }
    users    = { db_name = "users" }
    products = { db_name = "products" }
  }
  
  identifier = "staging-${each.key}"
  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnet_ids
  
  # Same engine versions as production
  engine         = "postgres"
  engine_version = "15.4"
  
  # Smaller instance class
  instance_class = "db.t3.medium"  # Prod uses db.r5.large
  
  # Single AZ (not multi-AZ like prod)
  multi_az = false
  
  # Same encryption settings
  storage_encrypted = true
  kms_key_id        = module.kms.key_id
}
 
module "redis" {
  source = "../../modules/elasticache"
  
  cluster_id   = "staging-cache"
  vpc_id       = module.vpc.vpc_id
  subnet_ids   = module.vpc.private_subnet_ids
  
  # Same engine version
  engine_version = "7.0"
  
  # Smaller node type
  node_type       = "cache.t3.small"  # Prod uses cache.r5.large
  num_cache_nodes = 1                  # Prod uses 2
}
 
module "kafka" {
  source = "../../modules/msk"
  
  cluster_name = "staging-events"
  vpc_id       = module.vpc.vpc_id
  subnet_ids   = module.vpc.private_subnet_ids
  
  # Same Kafka version
  kafka_version = "3.5.1"
  
  # Smaller brokers
  broker_node_type = "kafka.t3.small"  # Prod uses kafka.m5.large
  number_of_nodes  = 2                  # Prod uses 3
}
 
# Outputs for service configuration
output "database_endpoints" {
  value = { for k, v in module.databases : k => v.endpoint }
}
 
output "redis_endpoint" {
  value = module.redis.endpoint
}
 
output "kafka_bootstrap_servers" {
  value = module.kafka.bootstrap_servers
}

Staging Data Strategy

Production Testing Strategies

Production Testing Techniques:

Production Testing Approaches

•Synthetic Monitoring: Automated tests that continuously probe production endpoints, simulating real user journeys
•Canary Deployments: Route a small percentage of traffic to new versions, monitor for errors before full rollout
•Feature Flags: Deploy code to production but activate only for specific users or conditions
•Shadow Traffic: Mirror production traffic to new code without affecting responses
•Chaos Engineering: Inject controlled failures to verify resilience and recovery

production-testing
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
// Synthetic monitoring - continuous production validation
// src/monitoring/synthetics/checkout-flow.ts
 
import { SyntheticMonitor } from "./framework";
 
export const checkoutFlowMonitor = new SyntheticMonitor({
  name: "checkout-flow",
  schedule: "*/5 * * * *",  // Every 5 minutes
  locations: ["us-east-1", "us-west-2", "eu-west-1"],
  
  async run({ page, metrics }) {
    const startTime = Date.now();
    
    try {
      // Use dedicated test account
      await page.goto(process.env.PRODUCTION_URL!);
      await this.login(page, process.env.SYNTHETIC_USER!, process.env.SYNTHETIC_PASSWORD!);
      
      // Browse to product
      await page.goto("/products/synthetic-test-product");
      metrics.recordTiming("product_page_load", Date.now() - startTime);
      
      // Add to cart
      await page.click('[data-action="add-to-cart"]');
      await page.waitForSelector('[data-testid="cart-updated"]');
      metrics.recordTiming("add_to_cart", Date.now() - startTime);
      
      // Proceed to checkout
      await page.click('[data-action="proceed-to-checkout"]');
      await page.waitForSelector('[data-testid="checkout-form"]');
      metrics.recordTiming("checkout_page_load", Date.now() - startTime);
      
      // Verify payment options load
      await page.waitForSelector('[data-testid="payment-methods"]');
      
      // Don't actually complete purchase - just verify flow works
      await page.click('[data-action="cancel-checkout"]');
      
      metrics.recordSuccess();
      metrics.recordTiming("total_flow", Date.now() - startTime);
      
    } catch (error) {
      metrics.recordFailure(error);
      
      // Capture diagnostics on failure
      await page.screenshot({ path: `/tmp/synthetic-failure-${Date.now()}.png` });
      
      // Alert on-call if critical path is broken
      if (this.isWithinBusinessHours()) {
        await this.sendAlert({
          severity: "high",
          title: "Checkout flow synthetic failing",
          details: error.message,
        });
      }
    }
  },
  
  private async login(page: Page, email: string, password: string) {
    await page.goto("/login");
    await page.fill('[data-testid="email"]', email);
    await page.fill('[data-testid="password"]', password);
    await page.click('[data-action="login"]');
    await page.waitForSelector('[data-testid="user-menu"]');
  },
  
  private isWithinBusinessHours(): boolean {
    const hour = new Date().getHours();
    return hour >= 9 && hour < 18;
  },
});
 
// Canary deployment with automated rollback
// kubernetes/rollout-strategy.yaml
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: order-service
spec:
  replicas: 10
  strategy:
    canary:
      # Canary steps
      steps:
        - setWeight: 5      # 5% traffic to canary
        - pause: { duration: 5m }
        - setWeight: 20     # 20% traffic
        - pause: { duration: 10m }
        - setWeight: 50     # 50% traffic
        - pause: { duration: 15m }
        - setWeight: 100    # Full rollout
      
      # Analysis for automatic rollback
      analysis:
        templates:
          - templateName: success-rate
        startingStep: 1  # Start analysis after first step
        args:
          - name: service-name
            value: order-service
      
      # Automatic rollback triggers
      canaryMetadata:
        labels:
          role: canary
      stableMetadata:
        labels:
          role: stable
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: result[0] >= 0.99  # 99% success rate required
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{service="{{args.service-name}}", 
                    status=~"2.."}[5m]))
            /
            sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))
    
    - name: latency-p99
      interval: 1m
      successCondition: result[0] <= 500  # P99 under 500ms
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            histogram_quantile(0.99, 
              sum(rate(http_request_duration_seconds_bucket{
                service="{{args.service-name}}"}[5m])) by (le)
            ) * 1000

Production Testing Requires Discipline

Environment Configuration Management

Configuration differences between environments are a major source of 'works in staging, fails in production' bugs. Proper configuration management ensures environments differ only in intended ways.

Configuration Principles:

Environment-agnostic code: Code should be identical across environments; only configuration varies
Explicit configuration: No hidden differences; all environment-specific values are documented
Layered configuration: Base config + environment overrides
Secrets management: Sensitive values never in code or version control

configuration-management
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
// Layered configuration pattern
// src/config/index.ts
 
import { z } from "zod";
import { baseConfig } from "./base";
import { developmentOverrides } from "./development";
import { stagingOverrides } from "./staging";
import { productionOverrides } from "./production";
 
// Schema validation ensures configuration is complete and correct
const ConfigSchema = z.object({
  // Environment identification
  environment: z.enum(["development", "staging", "production"]),
  
  // Service configuration
  service: z.object({
    name: z.string(),
    version: z.string(),
    port: z.number().int().positive(),
  }),
  
  // Database configuration
  database: z.object({
    host: z.string(),
    port: z.number().int().positive(),
    name: z.string(),
    user: z.string(),
    password: z.string(),  // From secrets
    poolSize: z.number().int().positive(),
    ssl: z.boolean(),
  }),
  
  // Cache configuration
  cache: z.object({
    host: z.string(),
    port: z.number().int().positive(),
    ttlSeconds: z.number().int().positive(),
  }),
  
  // Kafka configuration
  kafka: z.object({
    brokers: z.array(z.string()),
    clientId: z.string(),
    ssl: z.boolean(),
    sasl: z.object({
      mechanism: z.enum(["plain", "scram-sha-256", "scram-sha-512"]),
      username: z.string(),
      password: z.string(),
    }).optional(),
  }),
  
  // External services
  integrations: z.object({
    paymentGateway: z.object({
      baseUrl: z.string().url(),
      apiKey: z.string(),
      testMode: z.boolean(),
    }),
    emailService: z.object({
      baseUrl: z.string().url(),
      apiKey: z.string(),
      sandboxMode: z.boolean(),
    }),
  }),
  
  // Feature flags
  features: z.object({
    newCheckoutFlow: z.boolean(),
    enhancedSearch: z.boolean(),
    betaFeatures: z.boolean(),
  }),
  
  // Observability
  observability: z.object({
    logLevel: z.enum(["debug", "info", "warn", "error"]),
    tracingEnabled: z.boolean(),
    metricsEnabled: z.boolean(),
  }),
});
 
type Config = z.infer<typeof ConfigSchema>;
 
function loadConfig(): Config {
  const environment = process.env.NODE_ENV || "development";
  
  // Start with base config
  let config: unknown = baseConfig;
  
  // Apply environment-specific overrides
  switch (environment) {
    case "development":
      config = deepMerge(config, developmentOverrides);
      break;
    case "staging":
      config = deepMerge(config, stagingOverrides);
      break;
    case "production":
      config = deepMerge(config, productionOverrides);
      break;
    default:
      throw new Error(`Unknown environment: ${environment}`);
  }
  
  // Inject secrets from environment variables
  config = injectSecrets(config as object);
  
  // Validate and return
  const result = ConfigSchema.safeParse(config);
  if (!result.success) {
    console.error("Configuration validation failed:");
    console.error(result.error.format());
    throw new Error("Invalid configuration");
  }
  
  return result.data;
}
 
// Secret injection from environment variables
function injectSecrets(config: object): object {
  const secrets = {
    "database.password": process.env.DATABASE_PASSWORD,
    "kafka.sasl.password": process.env.KAFKA_PASSWORD,
    "integrations.paymentGateway.apiKey": process.env.PAYMENT_GATEWAY_API_KEY,
    "integrations.emailService.apiKey": process.env.EMAIL_SERVICE_API_KEY,
  };
  
  let result = { ...config };
  
  for (const [path, value] of Object.entries(secrets)) {
    if (value) {
      result = setPath(result, path, value);
    }
  }
  
  return result;
}
 
export const config = loadConfig();
 
// Validate on startup
console.log(`Configuration loaded for environment: ${config.environment}`);
console.log(`Features enabled: ${Object.entries(config.features)
  .filter(([_, v]) => v)
  .map(([k, _]) => k)
  .join(", ")}`);

Environment Variable Best Practice

Summary: Test Environments

Key Takeaways

•Multiple environment types are necessary: Local for development, CI for automation, preview for PRs, staging for release validation, production for final verification.
•Local environments should be fast and realistic: Docker Compose with service profiles enables quick startup of only what's needed.
•CI environments need isolation and speed: Service containers and cacheable images minimize provisioning time while maintaining reproducibility.
•Preview environments enable confident PR review: Ephemeral, isolated environments for each PR catch integration issues before merge.
•Staging must mirror production structure: Same infrastructure types, same deployment process, same monitoring—just smaller scale.
•Production testing validates what staging cannot: Synthetic monitoring, canary deployments, and feature flags enable safe production verification.
•Configuration management prevents environment drift: Layered configuration with explicit environment overrides ensures intentional differences only.

Module Complete:

Module Complete

5 / 5