Configuration Management - Learning Module

Loading content...

0/246

Configuration Management Best Practices

Synthesizing Configuration Management Excellence

Configuration management is a discipline that spans technology, process, and culture. Throughout this module, we've explored tools (Ansible, Chef, Puppet), paradigms (mutable vs immutable), challenges (configuration drift), and specialized concerns (secrets management). Now we synthesize these elements into a cohesive set of best practices that guide effective configuration management at scale.

These practices represent hard-won lessons from organizations managing infrastructure ranging from dozens to hundreds of thousands of servers. They are not dogma—context matters, and skilled engineers adapt practices to their specific situations. But they provide a foundation of principles that have proven effective across diverse environments.

What You Will Learn

This page covers best practices across all aspects of configuration management: code organization and structure, testing and validation, security and compliance, operational excellence, team collaboration, and continuous improvement. You'll develop a comprehensive framework for building and maintaining reliable, secure, and maintainable infrastructure.

Code Organization and Structure

Configuration code, like application code, benefits from thoughtful organization. Well-structured configuration is easier to understand, maintain, test, and evolve. Poorly organized configuration becomes a liability—a source of confusion, errors, and technical debt.

Fundamental Organizational Principles

Configuration Code Organization

•Single Responsibility — Each module, role, or cookbook should do one thing well. A 'nginx' role configures nginx—it doesn't also configure monitoring or log shipping.
•Separation of Concerns — Separate what (data/variables) from how (logic/code) from where (inventory/targeting). This enables reuse across environments.
•DRY but Not Prematurely — Avoid repetition, but don't abstract too early. Wait until patterns emerge before creating shared modules.
•Explicit > Implicit — Make dependencies, requirements, and assumptions explicit. Future readers (including future you) will thank you.
•Consistent Naming — Establish naming conventions and enforce them. Names should reveal intent: web_server not ws, production not prod.
•Version Everything — Lock versions of modules, dependencies, and tools. 'Latest' is not reproducible.

directory_structure.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
# Recommended Configuration Repository Structure
# Applies to Ansible; similar patterns for Chef/Puppet
 
infrastructure/
├── README.md                    # Documentation starting point
├── CONTRIBUTING.md              # How to contribute
├── .pre-commit-config.yaml      # Pre-commit hooks
├── .gitignore
├── .sops.yaml                   # SOPS encryption configuration
│
├── ansible.cfg                  # Ansible configuration
├── requirements.yml             # External role dependencies
├── requirements.txt             # Python dependencies
│
├── inventories/                 # Environment-specific inventories
│   ├── production/
│   │   ├── hosts.yml            # Host inventory
│   │   ├── group_vars/
│   │   │   ├── all/
│   │   │   │   ├── vars.yml     # Common variables
│   │   │   │   └── vault.yml    # Encrypted variables (ansible-vault)
│   │   │   ├── webservers.yml
│   │   │   └── databases.yml
│   │   └── host_vars/
│   │       └── db-primary-1.yml
│   ├── staging/
│   │   └── (same structure)
│   └── development/
│       └── (same structure)
│
├── playbooks/                   # High-level orchestration
│   ├── site.yml                 # Complete site deployment
│   ├── webservers.yml           # Web tier deployment
│   ├── databases.yml            # Database tier deployment
│   ├── deploy-app.yml           # Application deployment
│   └── security-update.yml      # Security patching
│
├── roles/                       # Reusable roles
│   ├── common/                  # Applied to all hosts
│   │   ├── defaults/main.yml
│   │   ├── tasks/main.yml
│   │   ├── handlers/main.yml
│   │   ├── templates/
│   │   ├── files/
│   │   └── meta/main.yml
│   ├── nginx/
│   ├── postgresql/
│   ├── monitoring-agent/
│   └── security-hardening/
│
├── library/                     # Custom modules
│   └── custom_module.py
│
├── filter_plugins/              # Custom filters
│   └── custom_filters.py
│
├── molecule/                    # Test configurations
│   └── default/
│       ├── molecule.yml
│       ├── converge.yml
│       └── verify.yml
│
├── docs/                        # Documentation
│   ├── architecture.md
│   ├── runbooks/
│   │   ├── deployment.md
│   │   └── incident-response.md
│   └── decision-records/        # ADRs
│       └── 001-ansible-over-puppet.md
│
└── scripts/                     # Supporting scripts
    ├── bootstrap.sh
    └── validate.sh

The Role and Profile Pattern

Borrowed from Puppet but applicable across tools, this pattern separates technology roles (what software does) from business profiles (what a server is):

Roles are technology-specific: nginx, postgresql, prometheus-agent. They know how to install and configure one piece of software.
Profiles are business-specific: profile::webserver, profile::database_primary. They compose roles and apply business-specific configuration.
Node classification assigns profiles to servers: "This server IS a webserver" → apply profile::webserver.

This separation enables technology roles to be reused across profiles while business logic lives in profiles.

profiles_example.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# Role/Profile Pattern in Ansible
# Profiles compose roles with business-specific configuration
 
# profiles/webserver.yml - What a webserver IS
---
- name: Web Server Profile
  hosts: "{{ target_hosts }}"
  become: yes
  
  vars:
    # Business-specific defaults
    nginx_ssl_enabled: true
    monitoring_enabled: true
    log_retention_days: 30
  
  roles:
    # Base roles (order matters)
    - role: common
    - role: security-hardening
    
    # Technology roles
    - role: nginx
      nginx_worker_processes: "{{ ansible_processor_vcpus }}"
      nginx_worker_connections: 4096
    
    - role: app-runtime
      runtime: nodejs
      version: "20"
    
    # Operational roles
    - role: monitoring-agent
      when: monitoring_enabled
    - role: log-shipper
      log_paths:
        - /var/log/nginx
        - /opt/app/logs
 
# profiles/database_primary.yml - What a database primary IS
---
- name: Database Primary Profile
  hosts: "{{ target_hosts }}"
  become: yes
  
  vars:
    postgresql_role: primary
    replication_enabled: true
    backup_enabled: true
  
  roles:
    - role: common
    - role: security-hardening
      security_level: strict  # Databases get stricter hardening
    
    - role: postgresql
      postgresql_version: "15"
      postgresql_max_connections: 500
      postgresql_shared_buffers: "{{ (ansible_memtotal_mb * 0.25) | int }}MB"
    
    - role: postgresql-replication
      when: replication_enabled
    
    - role: backup-agent
      when: backup_enabled
      backup_type: postgresql
      backup_schedule: "0 2 * * *"
    
    - role: monitoring-agent
      custom_metrics:
        - postgresql_exporter

Start Simple, Extract as Needed

Don't create elaborate role hierarchies upfront. Start with simple, inline configuration. When you see repetition across playbooks, extract a role. When roles become complex, split them. Premature abstraction creates complexity without benefit. Let structure emerge from actual needs.

Testing and Validation

Configuration code is code. It deserves the same testing rigor as application code. Yet configuration testing is often neglected—changes are 'tested in production,' with predictable results. A comprehensive testing strategy catches errors before they reach production.

The Configuration Testing Pyramid

Like application testing, configuration testing forms a pyramid:

Static Analysis (Linting) — Fast, cheap, catches syntax errors and style violations
Unit Tests — Test individual roles/modules in isolation
Integration Tests — Test roles working together on real infrastructure
Compliance Tests — Verify systems meet security and policy requirements
Production Validation — Canary deployments and progressive rollouts

.github/workflows/config-testing.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# GitHub Actions: Comprehensive Configuration Testing Pipeline
 
name: Configuration Testing
 
on:
  pull_request:
    paths:
      - 'roles/**'
      - 'playbooks/**'
      - 'inventories/**'
  push:
    branches: [main]
 
jobs:
  lint:
    name: Static Analysis
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Install dependencies
        run: |
          pip install ansible ansible-lint yamllint
      
      - name: YAML Lint
        run: yamllint .
      
      - name: Ansible Lint
        run: ansible-lint playbooks/ roles/
      
      - name: Syntax Check
        run: |
          for playbook in playbooks/*.yml; do
            ansible-playbook --syntax-check "$playbook"
          done
 
  unit-test:
    name: Unit Tests
    runs-on: ubuntu-latest
    needs: lint
    strategy:
      matrix:
        role: [common, nginx, postgresql, monitoring-agent]
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      
      - name: Install Molecule
        run: |
          pip install molecule molecule-docker ansible pytest testinfra
      
      - name: Run Molecule tests
        working-directory: roles/${{ matrix.role }}
        run: molecule test
        env:
          MOLECULE_DISTRO: ubuntu2204
 
  integration-test:
    name: Integration Tests
    runs-on: ubuntu-latest
    needs: unit-test
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup test infrastructure
        run: |
          # Spin up test VMs or containers
          docker-compose -f tests/integration/docker-compose.yml up -d
      
      - name: Run integration playbook
        run: |
          ansible-playbook -i tests/integration/inventory.yml \
            playbooks/site.yml \
            --check --diff
      
      - name: Run integration tests
        run: |
          pytest tests/integration/ -v
      
      - name: Cleanup
        if: always()
        run: docker-compose -f tests/integration/docker-compose.yml down
 
  compliance-test:
    name: Compliance Scanning
    runs-on: ubuntu-latest
    needs: integration-test
    steps:
      - uses: actions/checkout@v4
      
      - name: Install InSpec
        run: |
          curl https://omnitruck.chef.io/install.sh | sudo bash -s -- -P inspec
      
      - name: Run CIS benchmark
        run: |
          inspec exec compliance/cis-benchmark \
            -t docker://test-container \
            --reporter cli json:compliance-results.json
      
      - name: Upload compliance results
        uses: actions/upload-artifact@v3
        with:
          name: compliance-results
          path: compliance-results.json
 
  security-scan:
    name: Security Analysis
    runs-on: ubuntu-latest
    needs: lint
    steps:
      - uses: actions/checkout@v4
      
      - name: Secrets scanning
        uses: trufflesecurity/trufflehog@main
        with:
          path: ./
          base: ${{ github.event.repository.default_branch }}
          head: HEAD
      
      - name: Ansible security scan
        run: |
          pip install bandit
          bandit -r library/ filter_plugins/ -f json -o bandit-results.json || true
      
      - name: Check for hardcoded secrets
        run: |
          # Custom patterns for infrastructure secrets
          ! grep -rE "(password|secret|key)s*[:=]s*['"][^'"]+['"]" \
            --include="*.yml" --include="*.yaml" \
            inventories/ roles/ playbooks/

roles/nginx/molecule/default/molecule.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# Molecule Configuration for nginx role
# Defines test matrix and scenario
 
dependency:
  name: galaxy
  options:
    requirements-file: requirements.yml
 
driver:
  name: docker
 
platforms:
  # Test on multiple distributions
  - name: ubuntu-22
    image: geerlingguy/docker-ubuntu2204-ansible
    pre_build_image: true
    privileged: true
    command: /lib/systemd/systemd
    volumes:
      - /sys/fs/cgroup:/sys/fs/cgroup:rw
    cgroupns_mode: host
  
  - name: debian-12
    image: geerlingguy/docker-debian12-ansible
    pre_build_image: true
    privileged: true
    command: /lib/systemd/systemd
    volumes:
      - /sys/fs/cgroup:/sys/fs/cgroup:rw
    cgroupns_mode: host
  
  - name: rocky-9
    image: geerlingguy/docker-rockylinux9-ansible
    pre_build_image: true
    privileged: true
    command: /lib/systemd/systemd
    volumes:
      - /sys/fs/cgroup:/sys/fs/cgroup:rw
    cgroupns_mode: host
 
provisioner:
  name: ansible
  playbooks:
    converge: converge.yml
    verify: verify.yml
  inventory:
    group_vars:
      all:
        nginx_worker_processes: 2
        nginx_worker_connections: 1024
 
verifier:
  name: ansible
 
scenario:
  name: default
  test_sequence:
    - dependency
    - lint
    - cleanup
    - destroy
    - syntax
    - create
    - prepare
    - converge
    - idempotence  # Critical: verify idempotence
    - verify
    - cleanup
    - destroy

Testing Best Practices

•Test Idempotence — Run configuration twice; the second run should report no changes. Non-idempotent configuration is fragile.
•Test on Multiple Platforms — If you support Ubuntu and Rocky, test on both. Platform-specific bugs are common.
•Test the Negative Case — Verify that incorrect configurations are rejected, invalid inputs are handled, and error paths work.
•Test Real-World Scenarios — Don't just test happy paths. Test upgrades, rollbacks, partial failures, and edge cases.
•Automate Everything — Manual testing doesn't scale. Every test should run automatically in CI.
•Fast Feedback Loops — Lint in pre-commit, unit tests in CI, integration tests before merge. Catch issues early.

The Idempotence Imperative

Idempotence is non-negotiable. Configuration that isn't idempotent causes drift, breaks continuous enforcement, and makes debugging nearly impossible. If running your configuration twice produces different results, something is wrong. Molecule's idempotence test is your friend—don't skip it.

Security and Compliance

Configuration management systems have privileged access to infrastructure. They can create users, modify permissions, install software, and access secrets. This power demands rigorous security practices. A compromised CM system is a compromised infrastructure.

Security Principles for Configuration Management

CM Security Best Practices

•Least Privilege for CM Systems — CM tools need root/admin access to manage systems, but should only have access to systems they manage. Segment by environment.
•Secure the Control Plane — The Ansible control node, Chef Server, or Puppet Server is a high-value target. Apply the same security rigor as to production systems.
•Audit All Changes — Every configuration change should be logged, attributed to an identity, and reviewable. Use Git history plus CM tool audit logs.
•Code Review for Configuration — All configuration changes should go through peer review. No direct pushes to main. Treat it like production code.
•Separate Secrets from Configuration — Use dedicated secrets management. Don't store plaintext secrets in inventories or variables.
•Encrypt at Rest and in Transit — Use TLS for CM communication. Encrypt sensitive data in configuration files (ansible-vault, SOPS).
•Regular Security Audits — Periodically audit CM code for security issues: hardcoded credentials, overly permissive settings, unnecessary root access.

compliance/cis-benchmark/controls/os_hardening.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# InSpec Control: OS Security Hardening Compliance
# Verifies systems meet CIS benchmark requirements
 
title 'CIS Benchmark - Operating System Hardening'
 
control 'cis-1.1.1' do
  impact 1.0
  title 'Ensure mounting of cramfs filesystems is disabled'
  desc 'The cramfs filesystem type is a compressed read-only filesystem.'
  
  describe kernel_module('cramfs') do
    it { should_not be_loaded }
    it { should be_disabled }
    it { should be_blacklisted }
  end
end
 
control 'cis-1.4.1' do
  impact 1.0
  title 'Ensure bootloader password is set'
  desc 'Setting the boot loader password will require a password to reboot the server.'
  
  describe file('/boot/grub2/grub.cfg') do
    its('content') { should match(/^set superusers/) }
    its('content') { should match(/^password_pbkdf2/) }
  end
end
 
control 'cis-1.5.1' do
  impact 1.0
  title 'Ensure core dumps are restricted'
  desc 'A core dump is the memory of an executable program.'
  
  describe limits_conf do
    its('*') { should include ['hard', 'core', '0'] }
  end
  
  describe kernel_parameter('fs.suid_dumpable') do
    its('value') { should eq 0 }
  end
end
 
control 'cis-5.2.1' do
  impact 1.0
  title 'Ensure permissions on /etc/ssh/sshd_config are configured'
  
  describe file('/etc/ssh/sshd_config') do
    it { should exist }
    its('owner') { should eq 'root' }
    its('group') { should eq 'root' }
    its('mode') { should cmp '0600' }
  end
end
 
control 'cis-5.2.5' do
  impact 1.0
  title 'Ensure SSH LogLevel is appropriate'
  
  describe sshd_config do
    its('LogLevel') { should eq 'VERBOSE' }
  end
end
 
control 'cis-5.2.6' do
  impact 1.0
  title 'Ensure SSH X11 forwarding is disabled'
  
  describe sshd_config do
    its('X11Forwarding') { should eq 'no' }
  end
end
 
control 'cis-5.2.8' do
  impact 1.0
  title 'Ensure SSH root login is disabled'
  
  describe sshd_config do
    its('PermitRootLogin') { should eq 'no' }
  end
end
 
control 'cis-5.2.11' do
  impact 1.0
  title 'Ensure SSH PermitEmptyPasswords is disabled'
  
  describe sshd_config do
    its('PermitEmptyPasswords') { should eq 'no' }
  end
end
 
control 'cis-5.4.1.1' do
  impact 1.0
  title 'Ensure password expiration is 365 days or less'
  
  describe login_defs do
    its('PASS_MAX_DAYS') { should cmp <= 365 }
  end
end
 
control 'cis-5.4.1.4' do
  impact 1.0
  title 'Ensure inactive password lock is 30 days or less'
  
  describe command('useradd -D | grep INACTIVE') do
    its('stdout') { should match(/INACTIVE=(30|[1-2][0-9]|[1-9])$/) }
  end
end
 
control 'app-security-1' do
  impact 0.8
  title 'Application-specific: No development tools in production'
  
  only_if { os.linux? }
  only_if { input('environment') == 'production' }
  
  %w[gcc make gdb strace].each do |pkg|
    describe package(pkg) do
      it { should_not be_installed }
    end
  end
end

Compliance Frameworks and CM Integration
Framework	Key Requirements	CM Integration
CIS Benchmarks	OS hardening, service configuration	InSpec profiles, Ansible/Chef hardening roles
SOC 2	Access controls, change management, logging	Audit logging, PR-based changes, role segregation
PCI DSS	Encryption, access control, monitoring	Secrets management, compliance scanning, audit trails
HIPAA	Data protection, access logging, encryption	Encrypted data handling, access auditing
FedRAMP	Strict access control, continuous monitoring	Automated compliance testing, continuous enforcement

Compliance as Code

Treat compliance requirements as code. Express requirements in InSpec, Open Policy Agent, or similar tools. Run compliance checks automatically in CI/CD and periodically against production. Shift compliance left—catch violations before deployment, not during audits.

Operational Excellence

Configuration management in production requires operational discipline. It's not enough to write correct configurations—you must deploy them safely, monitor their effects, and respond to issues quickly.

Deployment Best Practices

Safe Configuration Deployment

•Progressive Rollouts — Deploy to a subset first (canary), validate, then expand. Never deploy to all production at once.
•Pre-flight Checks — Run in check/dry-run mode before applying. Review what will change before making changes.
•Staged Environments — Progress through dev → staging → production. Each stage should be as production-like as possible.
•Change Windows — Define when changes are allowed. Avoid deployments before weekends or holidays when possible.
•Rollback Plans — Know how to roll back before you roll forward. Test rollbacks regularly.
•Communication — Notify stakeholders before major changes. Post in #engineering-announcements.

deploy_playbook.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
# Production Deployment Playbook with Safety Controls
# Demonstrates progressive rollout and validation
 
---
- name: Pre-flight validation
  hosts: localhost
  tasks:
    - name: Check change window
      fail:
        msg: "Deployment blocked: Outside change window (Mon-Thu 10:00-16:00 UTC)"
      when:
        - not allow_outside_window | default(false)
        - ansible_date_time.weekday not in ['Monday', 'Tuesday', 'Wednesday', 'Thursday']
        - ansible_date_time.hour | int < 10 or ansible_date_time.hour | int > 16
    
    - name: Verify staging deployment was successful
      uri:
        url: "https://staging.example.com/health"
        status_code: 200
      retries: 3
      delay: 5
    
    - name: Check current production health
      uri:
        url: "https://{{ item }}.example.com/health"
        status_code: 200
      loop:
        - www
        - api
      register: pre_deploy_health
 
- name: Deploy to canary hosts
  hosts: production_canary
  become: yes
  serial: 1
  max_fail_percentage: 0
  
  pre_tasks:
    - name: Drain connections from load balancer
      uri:
        url: "https://lb-api.example.com/drain/{{ inventory_hostname }}"
        method: POST
        headers:
          Authorization: "Bearer {{ lb_api_token }}"
      delegate_to: localhost
    
    - name: Wait for connections to drain
      wait_for:
        timeout: 60
  
  roles:
    - role: application-deploy
      app_version: "{{ deploy_version }}"
  
  post_tasks:
    - name: Verify application health
      uri:
        url: "http://localhost:8080/health"
        status_code: 200
      retries: 5
      delay: 10
    
    - name: Re-enable in load balancer
      uri:
        url: "https://lb-api.example.com/enable/{{ inventory_hostname }}"
        method: POST
        headers:
          Authorization: "Bearer {{ lb_api_token }}"
      delegate_to: localhost
 
- name: Canary validation pause
  hosts: localhost
  tasks:
    - name: Wait for monitoring
      pause:
        minutes: 5
        prompt: "Canary deployed. Monitor dashboards. Press Enter to continue or Ctrl+C to abort."
      when: not auto_proceed | default(false)
    
    - name: Check error rate on canary
      uri:
        url: "https://prometheus.example.com/api/v1/query"
        body:
          query: 'rate(http_errors_total{instance=~"canary.*"}[5m])'
      register: error_rate
      failed_when: error_rate.json.data.result[0].value[1] | float > 0.01
 
- name: Deploy to production (remaining hosts)
  hosts: production:!production_canary
  become: yes
  serial: "25%"
  max_fail_percentage: 10
  
  pre_tasks:
    - name: Drain connections
      # ... same as canary
  
  roles:
    - role: application-deploy
      app_version: "{{ deploy_version }}"
  
  post_tasks:
    - name: Verify health
      # ... same as canary
 
- name: Post-deployment validation
  hosts: localhost
  tasks:
    - name: Run integration tests against production
      command: pytest tests/integration/production/ -v
      delegate_to: localhost
    
    - name: Notify success
      slack:
        token: "{{ slack_token }}"
        channel: "#deployments"
        msg: "✅ Deployed {{ deploy_version }} to production successfully"

Monitoring and Observability

Configuration management must be observable. You need to know:

What ran: Which configurations were applied, when, by whom
What changed: What resources were modified
What failed: Which runs failed, why, and on which hosts
System state: Current state vs. desired state (drift detection)

Observability Best Practices

•Centralized Logging — Aggregate all CM run logs to a central system (ELK, Splunk, Datadog). Enable searching and alerting.
•Metrics Collection — Track run duration, success rate, changes made, hosts affected. Visualize in dashboards.
•Alerting on Failures — Alert on CM run failures, high error rates, extended run times, or excessive changes.
•Drift Dashboards — Visualize drift detection results. Track drift rate over time as a key health metric.
•Change Correlation — Correlate CM changes with application metrics. Did the config change affect error rates, latency, or availability?
•Audit Trails — Maintain complete audit trails for compliance. Who changed what, when, with what justification.

The 'No Surprises' Principle

Production should never surprise you. If a CM run changes something unexpected, your testing or visibility is inadequate. Invest in preview/dry-run capabilities, diff outputs, and change impact analysis. Know what will happen before it happens.

Team Collaboration and Culture

Configuration management is a team sport. Effective CM requires collaboration between developers, operations, security, and platform teams. The practices that enable this collaboration are as important as the technical tools.

Collaboration Patterns

Team Collaboration Best Practices

•Everything in Git — All configuration lives in version control. No 'golden servers,' no undocumented changes, no tribal knowledge.
•Pull Request Workflow — All changes go through PRs. Code review for infrastructure, just like application code.
•Clear Ownership — Define who owns which configurations. Use CODEOWNERS files to require appropriate reviewers.
•Documentation as Practice — Document decisions, not just implementations. Use Architecture Decision Records (ADRs) for significant choices.
•Runbooks for Operations — Provide clear runbooks for common operations: deployments, rollbacks, incident response, maintenance.
•Onboarding Materials — New team members should be able to understand and contribute to configuration within their first week.
•Regular Knowledge Sharing — Hold regular sessions to share CM learnings, review incidents, and discuss improvements.

docs/decision-records/003-ansible-over-terraform-for-config.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# ADR 003: Using Ansible for Configuration Management
 
## Status
Accepted
 
## Context
We need to choose a configuration management tool for our growing 
infrastructure. Options considered:
- Ansible
- Terraform + cloud-init
- Chef
- Puppet
 
Our team has Python experience but limited Ruby experience.
We run on multiple cloud providers (AWS, GCP) with some on-premises.
We need both provisioning and ongoing configuration management.
 
## Decision
We will use Ansible for configuration management, with Terraform 
for infrastructure provisioning.
 
### Reasons:
1. **Agentless architecture** - Simpler operations, no agent 
   infrastructure to maintain
2. **Python-based** - Aligns with team skills
3. **YAML syntax** - Lower learning curve for DevOps-adjacent 
   developers
4. **Multi-cloud support** - Strong modules for AWS, GCP, and others
5. **Complement to Terraform** - Clear separation: Terraform provisions, 
   Ansible configures
 
### Trade-offs accepted:
- No continuous enforcement (mitigated with scheduled runs and 
  drift detection)
- Push-based can be slower at scale (acceptable for our ~500 hosts)
- Less mature testing story than Chef
 
## Consequences
- All team members will receive Ansible training
- We will use Ansible Galaxy for standard roles where possible
- Custom roles will follow the organizational standard structure
- Integration with Terraform will use dynamic inventory
- Scheduled Ansible runs will enforce configuration hourly
 
## Related
- ADR 001: Terraform for Infrastructure Provisioning  
- ADR 002: Git-based Workflow for Infrastructure Changes

Culture and Mindset

The most important best practice is cultural: treat infrastructure as a first-class engineering concern. Configuration management isn't 'ops work'—it's engineering work that requires the same rigor, quality, and professionalism as application development.

Cultural Anti-Patterns to Avoid
Anti-Pattern	Symptom	Better Approach
Cowboy Operations	Changes made directly on servers	All changes through code and PRs
Undocumented Knowledge	Only one person knows how X works	Document everything, cross-train
Fear of Change	Avoid updates because 'it works'	Frequent small changes, good testing
Blame Culture	Incidents blamed on individuals	Blameless postmortems, system improvement
Siloed Teams	Dev vs Ops vs Security conflicts	Shared ownership, embedded practices
Manual Heroics	Incidents require specific person	Runbooks, automation, shared on-call

The Bus Factor

If one person's absence would cripple your configuration management, you have a bus factor problem. Actively work to distribute knowledge: pair programming, documentation, cross-training, and regular rotation of responsibilities. No single point of failure—in people or systems.

Continuous Improvement

Configuration management is never 'done.' The infrastructure evolves, requirements change, tools improve, and the team learns. A structured approach to continuous improvement ensures your CM practices remain effective over time.

Improvement Practices

Continuous Improvement Framework

•Regular Retrospectives — After incidents, major deployments, or quarterly, review what worked and what didn't. Identify improvements.
•Technical Debt Tracking — Maintain a backlog of CM improvements. Allocate time for debt reduction alongside feature work.
•Metrics-Driven Improvement — Track key metrics (deployment frequency, lead time, failure rate, MTTR). Use data to guide priorities.
•Experiment and Iterate — Try new tools and practices in non-critical environments. Adopt what works, discard what doesn't.
•Stay Current — The CM landscape evolves rapidly. Attend conferences, read blogs, participate in communities. Don't get stuck on outdated practices.
•Blameless Postmortems — When things go wrong, focus on system improvement, not individual blame. Every incident is a learning opportunity.

metrics/cm-kpis.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
# Configuration Management Key Performance Indicators
# Track these metrics to measure and improve CM effectiveness
 
availability_metrics:
  - name: cm_run_success_rate
    description: Percentage of CM runs completing successfully
    target: ">= 99%"
    good: ">= 99%"
    warning: ">= 95%"
    critical: "< 95%"
  
  - name: time_to_apply_change
    description: Time from merge to production deployment
    target: "< 30 minutes"
    good: "< 30 minutes"
    warning: "< 2 hours"
    critical: "> 2 hours"
 
quality_metrics:
  - name: drift_rate
    description: Percentage of resources with detected drift
    target: "< 1%"
    good: "< 1%"
    warning: "< 5%"
    critical: "> 5%"
  
  - name: test_coverage
    description: Percentage of roles with Molecule tests
    target: ">= 90%"
    good: ">= 90%"
    warning: ">= 70%"
    critical: "< 70%"
  
  - name: compliance_pass_rate
    description: Percentage of hosts passing compliance scans
    target: ">= 98%"
    good: ">= 98%"
    warning: ">= 90%"
    critical: "< 90%"
 
velocity_metrics:
  - name: deployment_frequency
    description: Number of production deployments per week
    target: ">= 10"  # Multiple per day
    good: ">= 10"
    warning: ">= 3"
    critical: "< 3"
  
  - name: change_failure_rate
    description: Percentage of deployments causing issues
    target: "< 5%"
    good: "< 5%"
    warning: "< 15%"
    critical: "> 15%"
  
  - name: mean_time_to_recovery
    description: Average time to recover from CM-related incidents
    target: "< 1 hour"
    good: "< 1 hour"
    warning: "< 4 hours"
    critical: "> 4 hours"
 
operational_health:
  - name: secrets_rotation_compliance
    description: Percentage of secrets rotated within policy
    target: ">= 95%"
  
  - name: documentation_freshness
    description: Percentage of docs updated within 30 days of related changes
    target: ">= 80%"
  
  - name: onboarding_time
    description: Days for new team member to make first production change
    target: "< 5 days"

Progress Over Perfection

You don't need to implement everything immediately. Start with the most impactful practices for your context. Establish a baseline, then improve iteratively. A 10% improvement each quarter compounds to dramatic improvement over time. Consistency beats intensity.

Summary: Configuration Management Excellence

We've synthesized configuration management best practices across all dimensions. Let's consolidate the key insights:

Key Takeaways

•Structure configuration code thoughtfully — Apply software engineering principles: single responsibility, separation of concerns, explicit dependencies, consistent naming.
•Test rigorously — Static analysis, unit tests, integration tests, compliance tests. Verify idempotence. Automate everything.
•Prioritize security and compliance — Secure the control plane, audit all changes, encrypt secrets, express compliance as code.
•Deploy safely — Progressive rollouts, pre-flight checks, staged environments, clear rollback plans.
•Enable collaboration — Everything in Git, PR-based workflow, clear ownership, documentation, knowledge sharing.
•Improve continuously — Regular retrospectives, metrics-driven priorities, blameless postmortems, stay current.
•Culture matters most — The best tools fail without disciplined practices. Invest in culture alongside technology.

Module Complete:

With this page, you've completed a comprehensive journey through configuration management. You now understand the major tools (Ansible, Chef, Puppet), the paradigm choice (mutable vs immutable), the challenge of configuration drift, the criticality of secrets management, and the practices that enable excellence.

These skills form the foundation for managing infrastructure at any scale—from startups to global enterprises. Apply them thoughtfully, adapt them to your context, and continue learning as the field evolves.

Module Complete

Congratulations! You've completed the Configuration Management module. You possess comprehensive knowledge of configuration management tools, paradigms, challenges, and best practices. This foundation enables you to design, implement, and operate configuration management systems that provide reliable, secure, and maintainable infrastructure at scale.