Loading content...
Configuration management is a discipline that spans technology, process, and culture. Throughout this module, we've explored tools (Ansible, Chef, Puppet), paradigms (mutable vs immutable), challenges (configuration drift), and specialized concerns (secrets management). Now we synthesize these elements into a cohesive set of best practices that guide effective configuration management at scale.
These practices represent hard-won lessons from organizations managing infrastructure ranging from dozens to hundreds of thousands of servers. They are not dogma—context matters, and skilled engineers adapt practices to their specific situations. But they provide a foundation of principles that have proven effective across diverse environments.
This page covers best practices across all aspects of configuration management: code organization and structure, testing and validation, security and compliance, operational excellence, team collaboration, and continuous improvement. You'll develop a comprehensive framework for building and maintaining reliable, secure, and maintainable infrastructure.
Configuration code, like application code, benefits from thoughtful organization. Well-structured configuration is easier to understand, maintain, test, and evolve. Poorly organized configuration becomes a liability—a source of confusion, errors, and technical debt.
Fundamental Organizational Principles
web_server not ws, production not prod.12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
# Recommended Configuration Repository Structure# Applies to Ansible; similar patterns for Chef/Puppet infrastructure/├── README.md # Documentation starting point├── CONTRIBUTING.md # How to contribute├── .pre-commit-config.yaml # Pre-commit hooks├── .gitignore├── .sops.yaml # SOPS encryption configuration│├── ansible.cfg # Ansible configuration├── requirements.yml # External role dependencies├── requirements.txt # Python dependencies│├── inventories/ # Environment-specific inventories│ ├── production/│ │ ├── hosts.yml # Host inventory│ │ ├── group_vars/│ │ │ ├── all/│ │ │ │ ├── vars.yml # Common variables│ │ │ │ └── vault.yml # Encrypted variables (ansible-vault)│ │ │ ├── webservers.yml│ │ │ └── databases.yml│ │ └── host_vars/│ │ └── db-primary-1.yml│ ├── staging/│ │ └── (same structure)│ └── development/│ └── (same structure)│├── playbooks/ # High-level orchestration│ ├── site.yml # Complete site deployment│ ├── webservers.yml # Web tier deployment│ ├── databases.yml # Database tier deployment│ ├── deploy-app.yml # Application deployment│ └── security-update.yml # Security patching│├── roles/ # Reusable roles│ ├── common/ # Applied to all hosts│ │ ├── defaults/main.yml│ │ ├── tasks/main.yml│ │ ├── handlers/main.yml│ │ ├── templates/│ │ ├── files/│ │ └── meta/main.yml│ ├── nginx/│ ├── postgresql/│ ├── monitoring-agent/│ └── security-hardening/│├── library/ # Custom modules│ └── custom_module.py│├── filter_plugins/ # Custom filters│ └── custom_filters.py│├── molecule/ # Test configurations│ └── default/│ ├── molecule.yml│ ├── converge.yml│ └── verify.yml│├── docs/ # Documentation│ ├── architecture.md│ ├── runbooks/│ │ ├── deployment.md│ │ └── incident-response.md│ └── decision-records/ # ADRs│ └── 001-ansible-over-puppet.md│└── scripts/ # Supporting scripts ├── bootstrap.sh └── validate.shThe Role and Profile Pattern
Borrowed from Puppet but applicable across tools, this pattern separates technology roles (what software does) from business profiles (what a server is):
Roles are technology-specific: nginx, postgresql, prometheus-agent. They know how to install and configure one piece of software.
Profiles are business-specific: profile::webserver, profile::database_primary. They compose roles and apply business-specific configuration.
Node classification assigns profiles to servers: "This server IS a webserver" → apply profile::webserver.
This separation enables technology roles to be reused across profiles while business logic lives in profiles.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
# Role/Profile Pattern in Ansible# Profiles compose roles with business-specific configuration # profiles/webserver.yml - What a webserver IS---- name: Web Server Profile hosts: "{{ target_hosts }}" become: yes vars: # Business-specific defaults nginx_ssl_enabled: true monitoring_enabled: true log_retention_days: 30 roles: # Base roles (order matters) - role: common - role: security-hardening # Technology roles - role: nginx nginx_worker_processes: "{{ ansible_processor_vcpus }}" nginx_worker_connections: 4096 - role: app-runtime runtime: nodejs version: "20" # Operational roles - role: monitoring-agent when: monitoring_enabled - role: log-shipper log_paths: - /var/log/nginx - /opt/app/logs # profiles/database_primary.yml - What a database primary IS---- name: Database Primary Profile hosts: "{{ target_hosts }}" become: yes vars: postgresql_role: primary replication_enabled: true backup_enabled: true roles: - role: common - role: security-hardening security_level: strict # Databases get stricter hardening - role: postgresql postgresql_version: "15" postgresql_max_connections: 500 postgresql_shared_buffers: "{{ (ansible_memtotal_mb * 0.25) | int }}MB" - role: postgresql-replication when: replication_enabled - role: backup-agent when: backup_enabled backup_type: postgresql backup_schedule: "0 2 * * *" - role: monitoring-agent custom_metrics: - postgresql_exporterDon't create elaborate role hierarchies upfront. Start with simple, inline configuration. When you see repetition across playbooks, extract a role. When roles become complex, split them. Premature abstraction creates complexity without benefit. Let structure emerge from actual needs.
Configuration code is code. It deserves the same testing rigor as application code. Yet configuration testing is often neglected—changes are 'tested in production,' with predictable results. A comprehensive testing strategy catches errors before they reach production.
The Configuration Testing Pyramid
Like application testing, configuration testing forms a pyramid:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
# GitHub Actions: Comprehensive Configuration Testing Pipeline name: Configuration Testing on: pull_request: paths: - 'roles/**' - 'playbooks/**' - 'inventories/**' push: branches: [main] jobs: lint: name: Static Analysis runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Install dependencies run: | pip install ansible ansible-lint yamllint - name: YAML Lint run: yamllint . - name: Ansible Lint run: ansible-lint playbooks/ roles/ - name: Syntax Check run: | for playbook in playbooks/*.yml; do ansible-playbook --syntax-check "$playbook" done unit-test: name: Unit Tests runs-on: ubuntu-latest needs: lint strategy: matrix: role: [common, nginx, postgresql, monitoring-agent] steps: - uses: actions/checkout@v4 - name: Setup Python uses: actions/setup-python@v4 with: python-version: '3.11' - name: Install Molecule run: | pip install molecule molecule-docker ansible pytest testinfra - name: Run Molecule tests working-directory: roles/${{ matrix.role }} run: molecule test env: MOLECULE_DISTRO: ubuntu2204 integration-test: name: Integration Tests runs-on: ubuntu-latest needs: unit-test steps: - uses: actions/checkout@v4 - name: Setup test infrastructure run: | # Spin up test VMs or containers docker-compose -f tests/integration/docker-compose.yml up -d - name: Run integration playbook run: | ansible-playbook -i tests/integration/inventory.yml \ playbooks/site.yml \ --check --diff - name: Run integration tests run: | pytest tests/integration/ -v - name: Cleanup if: always() run: docker-compose -f tests/integration/docker-compose.yml down compliance-test: name: Compliance Scanning runs-on: ubuntu-latest needs: integration-test steps: - uses: actions/checkout@v4 - name: Install InSpec run: | curl https://omnitruck.chef.io/install.sh | sudo bash -s -- -P inspec - name: Run CIS benchmark run: | inspec exec compliance/cis-benchmark \ -t docker://test-container \ --reporter cli json:compliance-results.json - name: Upload compliance results uses: actions/upload-artifact@v3 with: name: compliance-results path: compliance-results.json security-scan: name: Security Analysis runs-on: ubuntu-latest needs: lint steps: - uses: actions/checkout@v4 - name: Secrets scanning uses: trufflesecurity/trufflehog@main with: path: ./ base: ${{ github.event.repository.default_branch }} head: HEAD - name: Ansible security scan run: | pip install bandit bandit -r library/ filter_plugins/ -f json -o bandit-results.json || true - name: Check for hardcoded secrets run: | # Custom patterns for infrastructure secrets ! grep -rE "(password|secret|key)s*[:=]s*['"][^'"]+['"]" \ --include="*.yml" --include="*.yaml" \ inventories/ roles/ playbooks/123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
# Molecule Configuration for nginx role# Defines test matrix and scenario dependency: name: galaxy options: requirements-file: requirements.yml driver: name: docker platforms: # Test on multiple distributions - name: ubuntu-22 image: geerlingguy/docker-ubuntu2204-ansible pre_build_image: true privileged: true command: /lib/systemd/systemd volumes: - /sys/fs/cgroup:/sys/fs/cgroup:rw cgroupns_mode: host - name: debian-12 image: geerlingguy/docker-debian12-ansible pre_build_image: true privileged: true command: /lib/systemd/systemd volumes: - /sys/fs/cgroup:/sys/fs/cgroup:rw cgroupns_mode: host - name: rocky-9 image: geerlingguy/docker-rockylinux9-ansible pre_build_image: true privileged: true command: /lib/systemd/systemd volumes: - /sys/fs/cgroup:/sys/fs/cgroup:rw cgroupns_mode: host provisioner: name: ansible playbooks: converge: converge.yml verify: verify.yml inventory: group_vars: all: nginx_worker_processes: 2 nginx_worker_connections: 1024 verifier: name: ansible scenario: name: default test_sequence: - dependency - lint - cleanup - destroy - syntax - create - prepare - converge - idempotence # Critical: verify idempotence - verify - cleanup - destroyIdempotence is non-negotiable. Configuration that isn't idempotent causes drift, breaks continuous enforcement, and makes debugging nearly impossible. If running your configuration twice produces different results, something is wrong. Molecule's idempotence test is your friend—don't skip it.
Configuration management systems have privileged access to infrastructure. They can create users, modify permissions, install software, and access secrets. This power demands rigorous security practices. A compromised CM system is a compromised infrastructure.
Security Principles for Configuration Management
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
# InSpec Control: OS Security Hardening Compliance# Verifies systems meet CIS benchmark requirements title 'CIS Benchmark - Operating System Hardening' control 'cis-1.1.1' do impact 1.0 title 'Ensure mounting of cramfs filesystems is disabled' desc 'The cramfs filesystem type is a compressed read-only filesystem.' describe kernel_module('cramfs') do it { should_not be_loaded } it { should be_disabled } it { should be_blacklisted } endend control 'cis-1.4.1' do impact 1.0 title 'Ensure bootloader password is set' desc 'Setting the boot loader password will require a password to reboot the server.' describe file('/boot/grub2/grub.cfg') do its('content') { should match(/^set superusers/) } its('content') { should match(/^password_pbkdf2/) } endend control 'cis-1.5.1' do impact 1.0 title 'Ensure core dumps are restricted' desc 'A core dump is the memory of an executable program.' describe limits_conf do its('*') { should include ['hard', 'core', '0'] } end describe kernel_parameter('fs.suid_dumpable') do its('value') { should eq 0 } endend control 'cis-5.2.1' do impact 1.0 title 'Ensure permissions on /etc/ssh/sshd_config are configured' describe file('/etc/ssh/sshd_config') do it { should exist } its('owner') { should eq 'root' } its('group') { should eq 'root' } its('mode') { should cmp '0600' } endend control 'cis-5.2.5' do impact 1.0 title 'Ensure SSH LogLevel is appropriate' describe sshd_config do its('LogLevel') { should eq 'VERBOSE' } endend control 'cis-5.2.6' do impact 1.0 title 'Ensure SSH X11 forwarding is disabled' describe sshd_config do its('X11Forwarding') { should eq 'no' } endend control 'cis-5.2.8' do impact 1.0 title 'Ensure SSH root login is disabled' describe sshd_config do its('PermitRootLogin') { should eq 'no' } endend control 'cis-5.2.11' do impact 1.0 title 'Ensure SSH PermitEmptyPasswords is disabled' describe sshd_config do its('PermitEmptyPasswords') { should eq 'no' } endend control 'cis-5.4.1.1' do impact 1.0 title 'Ensure password expiration is 365 days or less' describe login_defs do its('PASS_MAX_DAYS') { should cmp <= 365 } endend control 'cis-5.4.1.4' do impact 1.0 title 'Ensure inactive password lock is 30 days or less' describe command('useradd -D | grep INACTIVE') do its('stdout') { should match(/INACTIVE=(30|[1-2][0-9]|[1-9])$/) } endend control 'app-security-1' do impact 0.8 title 'Application-specific: No development tools in production' only_if { os.linux? } only_if { input('environment') == 'production' } %w[gcc make gdb strace].each do |pkg| describe package(pkg) do it { should_not be_installed } end endend| Framework | Key Requirements | CM Integration |
|---|---|---|
| CIS Benchmarks | OS hardening, service configuration | InSpec profiles, Ansible/Chef hardening roles |
| SOC 2 | Access controls, change management, logging | Audit logging, PR-based changes, role segregation |
| PCI DSS | Encryption, access control, monitoring | Secrets management, compliance scanning, audit trails |
| HIPAA | Data protection, access logging, encryption | Encrypted data handling, access auditing |
| FedRAMP | Strict access control, continuous monitoring | Automated compliance testing, continuous enforcement |
Treat compliance requirements as code. Express requirements in InSpec, Open Policy Agent, or similar tools. Run compliance checks automatically in CI/CD and periodically against production. Shift compliance left—catch violations before deployment, not during audits.
Configuration management in production requires operational discipline. It's not enough to write correct configurations—you must deploy them safely, monitor their effects, and respond to issues quickly.
Deployment Best Practices
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117
# Production Deployment Playbook with Safety Controls# Demonstrates progressive rollout and validation ---- name: Pre-flight validation hosts: localhost tasks: - name: Check change window fail: msg: "Deployment blocked: Outside change window (Mon-Thu 10:00-16:00 UTC)" when: - not allow_outside_window | default(false) - ansible_date_time.weekday not in ['Monday', 'Tuesday', 'Wednesday', 'Thursday'] - ansible_date_time.hour | int < 10 or ansible_date_time.hour | int > 16 - name: Verify staging deployment was successful uri: url: "https://staging.example.com/health" status_code: 200 retries: 3 delay: 5 - name: Check current production health uri: url: "https://{{ item }}.example.com/health" status_code: 200 loop: - www - api register: pre_deploy_health - name: Deploy to canary hosts hosts: production_canary become: yes serial: 1 max_fail_percentage: 0 pre_tasks: - name: Drain connections from load balancer uri: url: "https://lb-api.example.com/drain/{{ inventory_hostname }}" method: POST headers: Authorization: "Bearer {{ lb_api_token }}" delegate_to: localhost - name: Wait for connections to drain wait_for: timeout: 60 roles: - role: application-deploy app_version: "{{ deploy_version }}" post_tasks: - name: Verify application health uri: url: "http://localhost:8080/health" status_code: 200 retries: 5 delay: 10 - name: Re-enable in load balancer uri: url: "https://lb-api.example.com/enable/{{ inventory_hostname }}" method: POST headers: Authorization: "Bearer {{ lb_api_token }}" delegate_to: localhost - name: Canary validation pause hosts: localhost tasks: - name: Wait for monitoring pause: minutes: 5 prompt: "Canary deployed. Monitor dashboards. Press Enter to continue or Ctrl+C to abort." when: not auto_proceed | default(false) - name: Check error rate on canary uri: url: "https://prometheus.example.com/api/v1/query" body: query: 'rate(http_errors_total{instance=~"canary.*"}[5m])' register: error_rate failed_when: error_rate.json.data.result[0].value[1] | float > 0.01 - name: Deploy to production (remaining hosts) hosts: production:!production_canary become: yes serial: "25%" max_fail_percentage: 10 pre_tasks: - name: Drain connections # ... same as canary roles: - role: application-deploy app_version: "{{ deploy_version }}" post_tasks: - name: Verify health # ... same as canary - name: Post-deployment validation hosts: localhost tasks: - name: Run integration tests against production command: pytest tests/integration/production/ -v delegate_to: localhost - name: Notify success slack: token: "{{ slack_token }}" channel: "#deployments" msg: "✅ Deployed {{ deploy_version }} to production successfully"Monitoring and Observability
Configuration management must be observable. You need to know:
Production should never surprise you. If a CM run changes something unexpected, your testing or visibility is inadequate. Invest in preview/dry-run capabilities, diff outputs, and change impact analysis. Know what will happen before it happens.
Configuration management is a team sport. Effective CM requires collaboration between developers, operations, security, and platform teams. The practices that enable this collaboration are as important as the technical tools.
Collaboration Patterns
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
# ADR 003: Using Ansible for Configuration Management ## StatusAccepted ## ContextWe need to choose a configuration management tool for our growing infrastructure. Options considered:- Ansible- Terraform + cloud-init- Chef- Puppet Our team has Python experience but limited Ruby experience.We run on multiple cloud providers (AWS, GCP) with some on-premises.We need both provisioning and ongoing configuration management. ## DecisionWe will use Ansible for configuration management, with Terraform for infrastructure provisioning. ### Reasons:1. **Agentless architecture** - Simpler operations, no agent infrastructure to maintain2. **Python-based** - Aligns with team skills3. **YAML syntax** - Lower learning curve for DevOps-adjacent developers4. **Multi-cloud support** - Strong modules for AWS, GCP, and others5. **Complement to Terraform** - Clear separation: Terraform provisions, Ansible configures ### Trade-offs accepted:- No continuous enforcement (mitigated with scheduled runs and drift detection)- Push-based can be slower at scale (acceptable for our ~500 hosts)- Less mature testing story than Chef ## Consequences- All team members will receive Ansible training- We will use Ansible Galaxy for standard roles where possible- Custom roles will follow the organizational standard structure- Integration with Terraform will use dynamic inventory- Scheduled Ansible runs will enforce configuration hourly ## Related- ADR 001: Terraform for Infrastructure Provisioning - ADR 002: Git-based Workflow for Infrastructure ChangesCulture and Mindset
The most important best practice is cultural: treat infrastructure as a first-class engineering concern. Configuration management isn't 'ops work'—it's engineering work that requires the same rigor, quality, and professionalism as application development.
| Anti-Pattern | Symptom | Better Approach |
|---|---|---|
| Cowboy Operations | Changes made directly on servers | All changes through code and PRs |
| Undocumented Knowledge | Only one person knows how X works | Document everything, cross-train |
| Fear of Change | Avoid updates because 'it works' | Frequent small changes, good testing |
| Blame Culture | Incidents blamed on individuals | Blameless postmortems, system improvement |
| Siloed Teams | Dev vs Ops vs Security conflicts | Shared ownership, embedded practices |
| Manual Heroics | Incidents require specific person | Runbooks, automation, shared on-call |
If one person's absence would cripple your configuration management, you have a bus factor problem. Actively work to distribute knowledge: pair programming, documentation, cross-training, and regular rotation of responsibilities. No single point of failure—in people or systems.
Configuration management is never 'done.' The infrastructure evolves, requirements change, tools improve, and the team learns. A structured approach to continuous improvement ensures your CM practices remain effective over time.
Improvement Practices
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
# Configuration Management Key Performance Indicators# Track these metrics to measure and improve CM effectiveness availability_metrics: - name: cm_run_success_rate description: Percentage of CM runs completing successfully target: ">= 99%" good: ">= 99%" warning: ">= 95%" critical: "< 95%" - name: time_to_apply_change description: Time from merge to production deployment target: "< 30 minutes" good: "< 30 minutes" warning: "< 2 hours" critical: "> 2 hours" quality_metrics: - name: drift_rate description: Percentage of resources with detected drift target: "< 1%" good: "< 1%" warning: "< 5%" critical: "> 5%" - name: test_coverage description: Percentage of roles with Molecule tests target: ">= 90%" good: ">= 90%" warning: ">= 70%" critical: "< 70%" - name: compliance_pass_rate description: Percentage of hosts passing compliance scans target: ">= 98%" good: ">= 98%" warning: ">= 90%" critical: "< 90%" velocity_metrics: - name: deployment_frequency description: Number of production deployments per week target: ">= 10" # Multiple per day good: ">= 10" warning: ">= 3" critical: "< 3" - name: change_failure_rate description: Percentage of deployments causing issues target: "< 5%" good: "< 5%" warning: "< 15%" critical: "> 15%" - name: mean_time_to_recovery description: Average time to recover from CM-related incidents target: "< 1 hour" good: "< 1 hour" warning: "< 4 hours" critical: "> 4 hours" operational_health: - name: secrets_rotation_compliance description: Percentage of secrets rotated within policy target: ">= 95%" - name: documentation_freshness description: Percentage of docs updated within 30 days of related changes target: ">= 80%" - name: onboarding_time description: Days for new team member to make first production change target: "< 5 days"You don't need to implement everything immediately. Start with the most impactful practices for your context. Establish a baseline, then improve iteratively. A 10% improvement each quarter compounds to dramatic improvement over time. Consistency beats intensity.
We've synthesized configuration management best practices across all dimensions. Let's consolidate the key insights:
Module Complete:
With this page, you've completed a comprehensive journey through configuration management. You now understand the major tools (Ansible, Chef, Puppet), the paradigm choice (mutable vs immutable), the challenge of configuration drift, the criticality of secrets management, and the practices that enable excellence.
These skills form the foundation for managing infrastructure at any scale—from startups to global enterprises. Apply them thoughtfully, adapt them to your context, and continue learning as the field evolves.
Congratulations! You've completed the Configuration Management module. You possess comprehensive knowledge of configuration management tools, paradigms, challenges, and best practices. This foundation enables you to design, implement, and operate configuration management systems that provide reliable, secure, and maintainable infrastructure at scale.