Migration Planning - Learning Module

Loading content...

0/273

Tooling and Infrastructure

The Platform That Enables Transformation

A microservices architecture without proper infrastructure is a distributed mess. The operational complexity of running dozens or hundreds of services—each with its own deployment pipeline, logging, monitoring, and scaling requirements—quickly overwhelms teams lacking adequate tooling. The difference between successful microservices adoption and operational chaos is the platform.

Organizations that thrive with microservices invest heavily in platform capabilities before and during migration. They build self-service infrastructure that makes deploying a new service as easy as deploying to the monolith. They instrument observability that makes distributed debugging tractable. They create golden paths that encode best practices and reduce cognitive load on service teams.

What You'll Learn

This page covers the technical infrastructure required for microservices success: container orchestration platforms, CI/CD pipelines for independent deployment, observability stacks for distributed systems, service mesh for inter-service communication, and the platform engineering mindset that ties it all together.

Infrastructure Readiness Assessment

Before extracting services, assess whether your current infrastructure can support microservices operations. Many organizations discover mid-migration that their infrastructure assumptions don't hold for distributed systems.

Key Questions for Infrastructure Assessment:

Infrastructure Readiness Checklist
Capability	Monolith Requirement	Microservices Requirement	Gap Assessment
Deployment Frequency	Weekly/monthly releases	Multiple deployments per day per service	Can CI/CD support independent, frequent deployments?
Environment Provisioning	Shared long-lived environments	On-demand ephemeral environments per service	Can teams spin up isolated test environments quickly?
Log Aggregation	Single application log file	Correlated logs across dozens of services	Is there centralized logging with request tracing?
Metric Collection	Server and application metrics	Service-level metrics, business metrics, distributed traces	Can we track request flow across service boundaries?
Service Discovery	Not needed (in-process calls)	Dynamic discovery of service instances	Is there a mechanism for services to find each other?
Load Balancing	External load balancer for monolith	Internal load balancing between services	Can traffic be distributed across service instances?
Secrets Management	Often manual or in config files	Programmatic secret injection, rotation support	Is there a secure secrets management system?
Container Support	Often VMs or bare metal	Container orchestration for dynamic workloads	Is Kubernetes or equivalent operational?

Common Infrastructure Gaps

•No Distributed Tracing: Teams can't debug cross-service requests. Incidents become finger-pointing exercises.
•Manual Deployment Processes: Deploys require human intervention, limiting frequency and increasing error risk.
•Shared Databases Without Access Controls: Services can't have isolated data without infrastructure changes.
•No Service Mesh: No automatic mTLS, no circuit breaking, no traffic management between services.
•Insufficient Compute Capacity: Running 50 services requires more overhead than one monolith; capacity planning often underestimated.

Build Platform Before Extraction

Consider platform investment as Phase 0 of migration. Extracting services into an infrastructure that can't support them creates operational nightmares. Budget 3-6 months of platform work before significant service extraction begins—or plan for parallel development with early services being the guinea pigs.

Container Orchestration

Containers are the deployment unit of modern microservices. They provide consistency between development and production, efficient resource utilization, and the foundation for orchestration. Kubernetes has emerged as the dominant orchestration platform.

Why Kubernetes for Microservices:

Kubernetes Capabilities for Microservices

•Declarative Desired State: Define what you want (replicas, resources, health checks); Kubernetes makes it happen and maintains it.
•Automatic Scaling: Horizontal Pod Autoscaler scales based on CPU, memory, or custom metrics. Handle traffic spikes automatically.
•Self-Healing: Failed containers are restarted; unhealthy pods are replaced; nodes failures are handled transparently.
•Service Discovery: Built-in DNS and service abstractions let services find each other without hardcoded addresses.
•Rolling Updates: Deploy new versions gradually with zero downtime; automatic rollback on failure.
•Resource Isolation: Namespaces and resource quotas isolate teams and services from each other.
•Ecosystem: Vast ecosystem of tools for security, monitoring, networking, storage, and more.

kubernetes-service-deployment.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
# Example: Microservice Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
  namespace: commerce
  labels:
    app: order-service
    team: order-team
    tier: backend
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: order-service
        version: v2.3.1
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      serviceAccountName: order-service
      containers:
      - name: order-service
        image: registry.company.com/order-service:v2.3.1
        ports:
        - containerPort: 8080
          name: http
        - containerPort: 8081
          name: grpc
        
        # Resource management
        resources:
          requests:
            memory: "256Mi"
            cpu: "100m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        
        # Health probes
        livenessProbe:
          httpGet:
            path: /health/live
            port: http
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health/ready
            port: http
          initialDelaySeconds: 5
          periodSeconds: 5
        
        # Environment from ConfigMap and Secrets
        envFrom:
        - configMapRef:
            name: order-service-config
        - secretRef:
            name: order-service-secrets
        
        # Distributed tracing configuration
        env:
        - name: OTEL_SERVICE_NAME
          value: order-service
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: http://otel-collector.observability:4317
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
 
---
apiVersion: v1
kind: Service
metadata:
  name: order-service
  namespace: commerce
spec:
  selector:
    app: order-service
  ports:
  - name: http
    port: 80
    targetPort: 8080
  - name: grpc
    port: 9090
    targetPort: 8081
 
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: order-service
  namespace: commerce
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: order-service
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"

Managed Kubernetes

For most organizations, managed Kubernetes (GKE, EKS, AKS) is significantly easier than self-managed clusters. The control plane—the most operationally demanding component—is handled by the cloud provider. Focus your platform team on developer experience, not Kubernetes operations.

CI/CD for Independent Deployment

The primary benefit of microservices is independent deployment. Teams should be able to ship changes to their services without coordinating with other teams. This requires CI/CD infrastructure designed for service independence.

CI/CD Requirements for Microservices:

Microservices CI/CD Maturity Model
Level	Characteristics	Deployment Frequency	Migration Phase
Level 0: Manual	Manual builds, manual deployments, shared environments	Monthly or less	Not ready for migration
Level 1: Continuous Build	Automated builds per commit; manual deployment	Weekly	Can start simple extractions
Level 2: Continuous Delivery	Automated pipeline to staging; one-click production deploy	Daily	Ready for migration
Level 3: Continuous Deployment	Automated deploy to production on merge; feature flags	Many times per day	Optimal for microservices
Level 4: Progressive Delivery	Canary deployments, automated rollbacks, A/B testing	On-demand with safety	Advanced microservices maturity

.github/workflows/service-pipeline.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# Example: GitHub Actions Pipeline for Microservice
name: Order Service Pipeline
 
on:
  push:
    branches: [main]
    paths:
      - 'services/order-service/**'
      - '.github/workflows/order-service.yml'
  pull_request:
    branches: [main]
    paths:
      - 'services/order-service/**'
 
env:
  SERVICE_NAME: order-service
  REGISTRY: registry.company.com
  CLUSTER: production-cluster
 
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Set up Java
        uses: actions/setup-java@v4
        with:
          java-version: '21'
          distribution: 'temurin'
      
      - name: Run Unit Tests
        working-directory: services/order-service
        run: ./gradlew test
      
      - name: Run Integration Tests
        working-directory: services/order-service
        run: ./gradlew integrationTest
      
      - name: Contract Tests (Producer)
        working-directory: services/order-service
        run: ./gradlew pactVerify
      
      - name: Security Scan
        uses: snyk/actions/gradle@master
        with:
          command: test
          args: --project-name=order-service
 
  build:
    needs: test
    runs-on: ubuntu-latest
    outputs:
      image-tag: ${{ steps.meta.outputs.tags }}
    steps:
      - uses: actions/checkout@v4
      
      - name: Docker Meta
        id: meta
        uses: docker/metadata-action@v5
        with:
          images: ${{ env.REGISTRY }}/${{ env.SERVICE_NAME }}
          tags: |
            type=sha,prefix=
            type=ref,event=branch
      
      - name: Build and Push
        uses: docker/build-push-action@v5
        with:
          context: services/order-service
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          cache-from: type=registry,ref=${{ env.REGISTRY }}/${{ env.SERVICE_NAME }}:buildcache
          cache-to: type=registry,ref=${{ env.REGISTRY }}/${{ env.SERVICE_NAME }}:buildcache,mode=max
 
  deploy-staging:
    needs: build
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4
      
      - name: Deploy to Staging
        uses: azure/k8s-deploy@v4
        with:
          namespace: staging
          manifests: services/order-service/k8s/
          images: ${{ needs.build.outputs.image-tag }}
      
      - name: Smoke Tests
        run: |
          kubectl wait --for=condition=ready pod -l app=order-service -n staging
          ./scripts/smoke-test.sh https://order-service.staging.internal
 
  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      
      - name: Canary Deploy (10%)
        uses: azure/k8s-deploy@v4
        with:
          namespace: production
          strategy: canary
          percentage: 10
          manifests: services/order-service/k8s/
          images: ${{ needs.build.outputs.image-tag }}
      
      - name: Monitor Canary (5 minutes)
        run: |
          ./scripts/monitor-canary.sh order-service 300
      
      - name: Promote to 100%
        if: success()
        uses: azure/k8s-deploy@v4
        with:
          namespace: production
          strategy: canary
          action: promote
          manifests: services/order-service/k8s/

Key CI/CD Design Decisions

•Monorepo vs Polyrepo: Monorepo simplifies shared tooling but requires sophisticated CI to only build changed services. Polyrepo provides natural isolation but complicates cross-service changes.
•Image Tagging Strategy: Use immutable tags (commit SHA, semantic version), never 'latest'. Every deployment should be traceable to exactly what code is running.
•Environment Strategy: Use ephemeral preview environments for PRs, stable staging for integration testing, production with progressive rollout.
•Artifact Security: Sign images, scan for vulnerabilities, enforce image provenance. Don't deploy unvalidated artifacts.
•Deployment Ownership: Teams own their deployment pipelines. Platform provides templates and guardrails, not centralized deployment team.

Pipeline Templates

Provide standardized pipeline templates that encode organizational best practices. Teams can customize, but the default path should handle 80% of cases. This is the CI/CD manifestation of the 'paved road' philosophy.

Observability for Distributed Systems

Observability is not optional for microservices—it's essential. When a request traverses 10 services before failing, you need visibility into every hop. The three pillars of observability—logs, metrics, and traces—must work together.

The Three Pillars:

Observability Pillars for Microservices
Pillar	What It Shows	Key Technologies	Microservices Considerations
Logs	Discrete events with context	ELK Stack, Loki, CloudWatch Logs	Structured logging (JSON); correlation IDs across services; log levels
Metrics	Aggregated measurements over time	Prometheus, Datadog, CloudWatch Metrics	Service-level metrics; business metrics; SLO-based alerting
Traces	Request flow across service boundaries	Jaeger, Zipkin, AWS X-Ray, Honeycomb	Automatic instrumentation; sampling strategies; span context propagation

ObservabilitySetup.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
// Observability Setup for Microservice
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PrometheusExporter } from '@opentelemetry/exporter-prometheus';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import pino from 'pino';
 
// Structured Logger with Correlation
const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  mixin() {
    // Automatically include trace context
    const span = trace.getActiveSpan();
    if (span) {
      const context = span.spanContext();
      return {
        traceId: context.traceId,
        spanId: context.spanId,
        service: process.env.OTEL_SERVICE_NAME,
      };
    }
    return { service: process.env.OTEL_SERVICE_NAME };
  },
});
 
// OpenTelemetry SDK Configuration
const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: process.env.OTEL_SERVICE_NAME,
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.SERVICE_VERSION,
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.ENVIRONMENT,
    // Custom attributes for organizational grouping
    'team.name': process.env.TEAM_NAME,
    'service.tier': process.env.SERVICE_TIER,
  }),
  
  // Trace exporter (to Jaeger/Tempo/etc via OTLP)
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  
  // Metric exporters (both push to OTLP and pull for Prometheus)
  metricReader: new PrometheusExporter({ port: 9464 }),
  
  // Auto-instrument HTTP, gRPC, database clients, etc.
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingPaths: ['/health', '/ready', '/metrics'],
      },
    }),
  ],
});
 
// Custom business metrics
import { metrics } from '@opentelemetry/api';
 
const meter = metrics.getMeter('order-service');
 
// Counter: Orders created
const ordersCreated = meter.createCounter('orders.created', {
  description: 'Number of orders created',
  unit: '1',
});
 
// Histogram: Order processing time
const orderProcessingTime = meter.createHistogram('orders.processing.duration', {
  description: 'Time to process order',
  unit: 'ms',
});
 
// Gauge: Active orders in processing
const activeOrders = meter.createObservableGauge('orders.active', {
  description: 'Currently processing orders',
});
 
// Usage in application code
async function createOrder(orderData: OrderData): Promise<Order> {
  const startTime = Date.now();
  
  logger.info({ orderId: orderData.id, customerId: orderData.customerId }, 
    'Starting order creation');
  
  try {
    const order = await processOrder(orderData);
    
    ordersCreated.add(1, {
      'order.type': orderData.type,
      'payment.method': orderData.paymentMethod,
    });
    
    orderProcessingTime.record(Date.now() - startTime, {
      'order.type': orderData.type,
      'status': 'success',
    });
    
    logger.info({ orderId: order.id, duration: Date.now() - startTime }, 
      'Order created successfully');
    
    return order;
  } catch (error) {
    orderProcessingTime.record(Date.now() - startTime, {
      'order.type': orderData.type,
      'status': 'error',
    });
    
    logger.error({ orderId: orderData.id, error: error.message }, 
      'Order creation failed');
    
    throw error;
  }
}

SLO-Based Alerting:

Microservices generate massive amounts of telemetry. Alerting on every metric crosses into noise. Instead, define Service Level Objectives (SLOs) and alert when error budgets are at risk:

slo-alerting.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# Prometheus AlertManager SLO-Based Alerting
groups:
- name: order-service-slos
  rules:
  # SLO: 99.9% of requests succeed
  - alert: OrderServiceErrorBudgetBurn
    expr: |
      (
        sum(rate(http_requests_total{service="order-service", status=~"5.."}[5m]))
        /
        sum(rate(http_requests_total{service="order-service"}[5m]))
      ) > 0.001
    for: 5m
    labels:
      severity: warning
      service: order-service
      slo: availability
    annotations:
      summary: "Order Service error rate exceeding SLO"
      description: "Error rate is {{ $value | humanizePercentage }}, SLO allows 0.1%"
      runbook_url: https://runbooks.company.com/order-service/high-error-rate
  
  # SLO: 99% of requests complete in <500ms
  - alert: OrderServiceLatencyBudgetBurn
    expr: |
      histogram_quantile(0.99, 
        sum(rate(http_request_duration_seconds_bucket{service="order-service"}[5m])) by (le)
      ) > 0.5
    for: 5m
    labels:
      severity: warning
      service: order-service
      slo: latency
    annotations:
      summary: "Order Service p99 latency exceeding SLO"
      description: "p99 latency is {{ $value | humanizeDuration }}, SLO requires <500ms"
      
  # Multi-window error budget alert (more sophisticated)
  - alert: OrderServiceErrorBudgetCritical
    expr: |
      (
        # Fast burn: 2% of budget in 1 hour (would exhaust monthly budget in ~2 days)
        (1 - (sum(rate(http_requests_total{service="order-service", status=~"2.."}[1h])) 
          / sum(rate(http_requests_total{service="order-service"}[1h])))) 
        > (14.4 * 0.001)  # 14.4x burn rate
      ) and (
        # Confirm with 5m window to avoid noise
        (1 - (sum(rate(http_requests_total{service="order-service", status=~"2.."}[5m])) 
          / sum(rate(http_requests_total{service="order-service"}[5m])))) 
        > (14.4 * 0.001)
      )
    labels:
      severity: critical
      service: order-service

Start with OpenTelemetry

OpenTelemetry is the emerging standard for observability instrumentation. It provides vendor-neutral APIs for traces, metrics, and logs. Instrument with OpenTelemetry, then choose backends later. This prevents vendor lock-in and simplifies future tooling changes.

Service Mesh for Inter-Service Communication

A service mesh provides infrastructure-level capabilities for service-to-service communication without requiring application code changes. It handles cross-cutting concerns like security, observability, and traffic management in a consistent, centralized way.

What a Service Mesh Provides:

Service Mesh Capabilities

•mTLS Encryption: Automatic mutual TLS between all services. No application-level certificate management required.
•Traffic Management: Canary deployments, traffic splitting, A/B testing, failover routing—all without application changes.
•Observability: Automatic generation of golden signal metrics (latency, traffic, errors, saturation) for all service communication.
•Circuit Breaking: Automatic protection against cascade failures when downstream services become unhealthy.
•Retries and Timeouts: Configurable retry policies with exponential backoff, deadline propagation, and timeout enforcement.
•Rate Limiting: Enforce rate limits at the mesh level to protect services from overload.
•Access Control: Define which services can communicate with which other services through policy.

Service Mesh Options Comparison
Feature	Istio	Linkerd	Consul Connect	AWS App Mesh
Complexity	High (feature-rich)	Low (focused)	Medium	Medium
Performance Overhead	Higher (~2ms)	Lower (~1ms)	Medium	Medium
Feature Set	Comprehensive	Essential	Comprehensive	Essential
Multi-cluster	Excellent	Good	Excellent	AWS-only
Learning Curve	Steep	Gentle	Moderate	Moderate
Community/Support	Large (CNCF)	Growing (CNCF)	HashiCorp	AWS

istio-traffic-management.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
# Istio Traffic Management Examples
 
# Virtual Service: Define routing rules
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: order-service
  namespace: commerce
spec:
  hosts:
  - order-service
  http:
  # Canary: Route 10% to v2, 90% to v1
  - match:
    - headers:
        x-canary:
          exact: "true"
    route:
    - destination:
        host: order-service
        subset: v2
  - route:
    - destination:
        host: order-service
        subset: v1
      weight: 90
    - destination:
        host: order-service
        subset: v2
      weight: 10
    # Retry configuration
    retries:
      attempts: 3
      perTryTimeout: 2s
      retryOn: 5xx,reset,connect-failure
    # Timeout
    timeout: 10s
 
---
# Destination Rule: Define subsets and connection policies
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: order-service
  namespace: commerce
spec:
  host: order-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        h2UpgradePolicy: UPGRADE
        http1MaxPendingRequests: 100
        http2MaxRequests: 1000
    # Circuit breaker
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
  subsets:
  - name: v1
    labels:
      version: v1
  - name: v2
    labels:
      version: v2
 
---
# Authorization Policy: Define access control
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: order-service-policy
  namespace: commerce
spec:
  selector:
    matchLabels:
      app: order-service
  rules:
  # Allow API Gateway
  - from:
    - source:
        principals: ["cluster.local/ns/ingress/sa/api-gateway"]
    to:
    - operation:
        methods: ["GET", "POST", "PUT"]
        paths: ["/api/orders/*"]
  # Allow Payment Service for callbacks
  - from:
    - source:
        principals: ["cluster.local/ns/commerce/sa/payment-service"]
    to:
    - operation:
        methods: ["POST"]
        paths: ["/internal/payment-callback"]
  # Deny all other traffic
  action: ALLOW

Service Mesh Complexity

Service mesh adds operational complexity. For early-stage migrations with few services, a mesh may be overkill. Consider introducing it when you have 10+ services or specific requirements (mTLS everywhere, advanced traffic management). Some organizations skip the mesh entirely and implement specific capabilities (retries, circuit breakers) in application libraries.

Secrets and Configuration Management

Each microservice needs configuration (feature flags, endpoints, settings) and secrets (database passwords, API keys, certificates). In a monolith, these often live in config files or environment variables set during deployment. Microservices require more sophisticated management.

Configuration vs Secrets:

Configuration and Secrets: Key Differences
Aspect	Configuration	Secrets
Sensitivity	Not sensitive; can be logged	Sensitive; never log or expose
Change Frequency	May change frequently (feature flags)	Infrequent changes; rotation policies
Access Pattern	Read on startup and/or dynamically	Read on startup; cached carefully
Storage	ConfigMaps, config servers, feature flag services	HashiCorp Vault, AWS Secrets Manager, K8s Secrets (encrypted)
Versioning	Important for rollback	Important for audit and rotation
Sharing	May share across services	Minimize sharing; per-service credentials preferred

ConfigurationManagement.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
// Modern Configuration Management Pattern
 
interface ServiceConfig {
  // Environment-specific configuration
  environment: 'development' | 'staging' | 'production';
  
  // Feature flags (dynamically updatable)
  features: {
    newCheckoutFlow: boolean;
    enhancedFraudDetection: boolean;
    darkMode: boolean;
  };
  
  // Service dependencies (from service discovery or config)
  dependencies: {
    paymentServiceUrl: string;
    inventoryServiceUrl: string;
    notificationServiceUrl: string;
  };
  
  // Operational settings
  settings: {
    requestTimeoutMs: number;
    maxRetries: number;
    circuitBreakerThreshold: number;
  };
}
 
// Configuration Sources (in priority order)
class ConfigurationManager {
  private config: ServiceConfig;
  
  async initialize(): Promise<void> {
    // 1. Load defaults from code
    const defaults = this.loadDefaults();
    
    // 2. Load environment-specific from ConfigMap/config file
    const envConfig = await this.loadFromConfigMap();
    
    // 3. Load secrets from Vault
    const secrets = await this.loadFromVault();
    
    // 4. Load feature flags from LaunchDarkly/Unleash
    const features = await this.loadFeatureFlags();
    
    // 5. Override with environment variables (for local dev)
    const envOverrides = this.loadEnvOverrides();
    
    // Merge in priority order
    this.config = {
      ...defaults,
      ...envConfig,
      ...secrets,
      features: { ...defaults.features, ...features },
      ...envOverrides,
    };
  }
  
  private async loadFromVault(): Promise<Partial<ServiceConfig>> {
    const vault = require('node-vault')({
      apiVersion: 'v1',
      endpoint: process.env.VAULT_ADDR,
    });
    
    // Authenticate with Kubernetes service account
    await vault.kubernetesLogin({
      role: 'order-service',
      jwt: fs.readFileSync('/var/run/secrets/kubernetes.io/serviceaccount/token', 'utf8'),
    });
    
    // Read secrets
    const secrets = await vault.read('secret/data/order-service');
    
    return {
      database: {
        host: secrets.data.data.DB_HOST,
        username: secrets.data.data.DB_USER,
        password: secrets.data.data.DB_PASSWORD,
      },
      apiKeys: {
        paymentProvider: secrets.data.data.PAYMENT_API_KEY,
        shippingProvider: secrets.data.data.SHIPPING_API_KEY,
      },
    };
  }
  
  private async loadFeatureFlags(): Promise<ServiceConfig['features']> {
    const { UnleashClient } = require('unleash-proxy-client');
    
    const unleash = new UnleashClient({
      url: process.env.FEATURE_FLAG_URL,
      clientKey: process.env.FEATURE_FLAG_KEY,
      appName: 'order-service',
    });
    
    await unleash.start();
    
    return {
      newCheckoutFlow: unleash.isEnabled('new-checkout-flow'),
      enhancedFraudDetection: unleash.isEnabled('enhanced-fraud-detection'),
      darkMode: unleash.isEnabled('dark-mode'),
    };
  }
  
  // Dynamic feature flag check (doesn't require restart)
  isFeatureEnabled(featureKey: string): boolean {
    return this.featureFlagClient.isEnabled(featureKey);
  }
}
 
// Kubernetes ConfigMap + Secret Pattern
/*
apiVersion: v1
kind: ConfigMap
metadata:
  name: order-service-config
data:
  PAYMENT_SERVICE_URL: "http://payment-service.commerce.svc.cluster.local"
  INVENTORY_SERVICE_URL: "http://inventory-service.commerce.svc.cluster.local"
  REQUEST_TIMEOUT_MS: "5000"
  MAX_RETRIES: "3"
  LOG_LEVEL: "info"
 
---
apiVersion: v1
kind: Secret
metadata:
  name: order-service-secrets
type: Opaque
data:
  DB_PASSWORD: <base64-encoded, or use external-secrets-operator>
  API_KEY: <base64-encoded>
*/

External Secrets Operator

The External Secrets Operator for Kubernetes synchronizes secrets from external stores (Vault, AWS Secrets Manager, etc.) into Kubernetes Secrets. This combines the security of dedicated secret stores with the convenience of Kubernetes-native secret consumption.

Platform Engineering Mindset

Individual tools don't make a platform—a cohesive, self-service experience does. Platform engineering is the discipline of building internal developer platforms that make teams productive and consistent.

Platform Engineering Principles:

Core Platform Principles

•Treat Teams as Customers: Platform teams should have product mindset. Conduct user research, track adoption, iterate on feedback. Internal platforms succeed when developers choose to use them.
•Provide Self-Service: Developers shouldn't need to file tickets to get a database. Enable self-service provisioning with guardrails (quotas, approved configurations).
•Build Golden Paths: Create opinionated defaults that handle common cases excellently. Cover 80% of needs out of the box; allow escape hatches for the 20%.
•Reduce Cognitive Load: Stream-aligned teams should focus on business logic, not Kubernetes YAML. Abstract complexity into platform capabilities.
•Measure Developer Experience: Track metrics like deployment frequency, lead time, onboarding time. These indicate platform effectiveness.
•Document Generously: The best platform is useless if developers don't know how to use it. Invest in documentation, tutorials, and examples.

backstage-catalog-info.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
# Internal Developer Portal (Backstage) Service Registration
# This enables discoverability, documentation, and ownership visibility
 
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: order-service
  title: Order Service
  description: Handles order creation, processing, and lifecycle management
  annotations:
    github.com/project-slug: company/order-service
    backstage.io/techdocs-ref: dir:.
    prometheus.io/alert: https://prometheus.company.com/alerts?q=order-service
    grafana/dashboard-selector: "service=order-service"
    pagerduty.com/integration-key: "PXXX..."
    sonarqube.org/project-key: order-service
  tags:
    - java
    - grpc
    - critical-path
  links:
    - url: https://order-service.company.com/api-docs
      title: API Documentation
    - url: https://runbooks.company.com/order-service
      title: Runbooks
    - url: https://grafana.company.com/d/order-service
      title: Dashboards
 
spec:
  type: service
  lifecycle: production
  owner: order-team
  system: commerce-platform
  
  providesApis:
    - order-api
    - order-events
  
  consumesApis:
    - payment-api
    - inventory-api
    - notification-api
  
  dependsOn:
    - resource:order-database
    - resource:redis-cache
    - component:payment-service
    - component:inventory-service
 
---
apiVersion: backstage.io/v1alpha1
kind: API
metadata:
  name: order-api
  title: Order API
  description: REST and gRPC API for order management
spec:
  type: openapi
  lifecycle: production
  owner: order-team
  definition:
    $text: https://github.com/company/order-service/blob/main/api/openapi.yaml
 
---
# Service Template for creating new services
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: microservice-template
  title: New Microservice
  description: Create a new microservice with CI/CD, observability, and K8s manifests
spec:
  owner: platform-team
  type: service
  
  parameters:
    - title: Service Information
      properties:
        name:
          title: Service Name
          type: string
          pattern: '^[a-z][a-z0-9-]*$'
        description:
          title: Description
          type: string
        owner:
          title: Owner Team
          type: string
          ui:field: OwnerPicker
    
    - title: Technical Options
      properties:
        language:
          title: Language
          type: string
          enum: ['java', 'go', 'nodejs', 'python']
          default: 'java'
        database:
          title: Database
          type: string
          enum: ['postgresql', 'mysql', 'none']
        messageQueue:
          title: Message Queue
          type: boolean
          default: false
  
  steps:
    - id: fetch-template
      action: fetch:template
      input:
        url: ./skeleton
        values:
          name: ${{ parameters.name }}
          owner: ${{ parameters.owner }}
          language: ${{ parameters.language }}
    
    - id: create-repo
      action: github:repo:create
      input:
        repoUrl: github.com?owner=company&repo=${{ parameters.name }}
    
    - id: register
      action: catalog:register
      input:
        repoContentsUrl: ${{ steps.create-repo.output.repoContentsUrl }}

Internal Developer Portal

Tools like Backstage (open source from Spotify) provide service catalogs, documentation aggregation, and scaffolding for new services. A developer portal becomes the entry point for all platform capabilities—a single place to discover services, understand ownership, access docs, and create new services from approved templates.

Summary: Infrastructure for Success

Infrastructure is the foundation upon which microservices success is built. Without proper tooling, teams spend their time fighting operational fires rather than delivering business value. Let's consolidate the key requirements:

Key Takeaways

•Assess infrastructure readiness before extraction — Gaps in logging, deployment, or observability create operational nightmares in distributed systems.
•Container orchestration (Kubernetes) is the foundation — It provides the primitives for deployment, scaling, service discovery, and self-healing.
•CI/CD must enable independent deployment — Teams must deploy their services without coordinating with other teams. Progressive delivery adds safety.
•Observability is non-negotiable — Logs, metrics, and traces must work together across service boundaries. Without distributed tracing, debugging is guesswork.
•Service mesh addresses cross-cutting communication concerns — Consider when you have 10+ services or specific security/traffic requirements.
•Secrets and configuration require proper management — Use dedicated tools (Vault, Secrets Manager) and patterns (External Secrets Operator).
•Platform engineering treats developers as customers — Build self-service, golden paths, and reduce cognitive load for stream-aligned teams.

Page Complete

You now understand the infrastructure requirements for microservices migration. The next page covers timeline and milestone planning—how to create realistic schedules, define meaningful checkpoints, and structure the multi-phase journey from monolith to microservices.