Loading content...
At 3:47 AM on a Sunday, your phone screams with a critical alert. The primary database cluster is down. The on-call engineer who wrote the recovery procedures left the company six months ago. You're staring at a runbook that says 'Follow standard database failover procedures' and wondering what 'standard' means.
This scenario plays out constantly across the industry. Runbooks that seemed clear when written become incomprehensible under pressure. Procedures that worked last year fail because infrastructure has changed. Documentation that assumes expert knowledge becomes useless when the expert isn't available.
The quality of your runbooks directly determines whether recovery succeeds or fails. A well-crafted runbook enables a junior engineer, woken at 3 AM, unfamiliar with the specific system, under extreme stress, to successfully execute recovery procedures. A poorly crafted runbook becomes another obstacle in an already chaotic situation.
This page teaches you to build runbooks that actually work when needed most.
By the end of this page, you will understand how to create runbooks that pass the '3 AM test'—procedures that can be successfully executed by engineers under stress with limited context. You'll learn structure, content, validation techniques, and maintenance practices that keep runbooks effective over time.
An effective runbook is not just documentation—it's an operational tool. The difference is crucial:
Documentation explains systems and concepts. It helps someone understand.
A Runbook enables action under pressure. It tells someone exactly what to do, step by step, with verification at each stage.
Effective runbooks share several characteristics:
Before finalizing any runbook, apply the 3 AM test: Can an engineer who was just woken up, who is stressed, who has limited context on this specific system, and who may not have the original author available for questions—can they successfully execute this procedure? If the answer is 'probably not,' the runbook needs work.
A well-structured runbook follows a consistent format that operators can navigate quickly even under stress:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
# Runbook: [System Name] - [Procedure Name] ## Metadata| Field | Value ||-------|-------|| **Owner** | [Team/Person responsible for this runbook] || **Last Tested** | [Date of last validation test] || **Last Updated** | [Date of last content update] || **Expected Duration** | [Typical time to complete procedure] || **Required Access** | [List of access/permissions needed] || **Related Runbooks** | [Links to dependent or related procedures] | ## Overview ### Purpose[One paragraph explaining what this runbook accomplishes and when to use it] ### When to Use This Runbook- [Trigger condition 1]- [Trigger condition 2]- [Explicit scope: what this runbook does and does NOT cover] ### Pre-Requisites- [ ] Access to [specific system/tool] verified- [ ] [Required credential/key] available- [ ] [Dependent system] is operational- [ ] [Any other prerequisites] ### Impact and Risks| Risk | Likelihood | Mitigation ||------|------------|------------|| [Potential negative outcome] | [High/Medium/Low] | [How to prevent or recover] | --- ## Procedure ### Step 1: [Action Name] **Action:**```bash# Copy-paste ready commandexample-command --with-parameters``` **Expected Result:**```Expected output you should see``` **Verification:**- [ ] Output matches expected result- [ ] [Additional verification step] **If Step Fails:**- [ ] Check [common failure reason]- [ ] Try [alternative approach]- [ ] If still failing, escalate to [contact] or proceed to Rollback section --- ### Step 2: [Action Name] **Action:**[Detailed action instructions with copy-paste commands] **Expected Result:**[What you should see if successful] **Verification:**- [ ] [Verification checklist] **If Step Fails:**[Specific troubleshooting for this step] --- ### Step N: Verification Complete **Final Verification Checklist:**- [ ] [System component 1] is operational- [ ] [System component 2] returns expected response- [ ] Monitoring shows healthy metrics- [ ] Test transaction succeeds --- ## Rollback Procedure ### When to RollbackUse this rollback procedure if:- [Condition 1]- [Condition 2]- Any step fails and troubleshooting doesn't resolve within [X minutes] ### Rollback Steps **Step R1: [Undo Action]**```bashrollback-command``` [Continue with explicit rollback steps for each forward step] --- ## Contacts and Escalation | Role | Name | Contact | Availability ||------|------|---------|--------------|| Primary Escalation | [Name] | [Phone/Slack] | [Hours] || Database Expert | [Name] | [Phone/Slack] | [Hours] || Vendor Support | [Company] | [Number/Portal] | [SLA] | --- ## Revision History | Date | Author | Changes ||------|--------|---------|| YYYY-MM-DD | [Name] | [Summary of changes] |Key Structural Elements Explained:
Metadata Section: Critically important for assessing runbook validity. If 'Last Tested' is 18 months ago, proceed with extra caution. If 'Last Updated' predates a major infrastructure change, the runbook may be stale.
Pre-Requisites Checklist: Operators should verify these BEFORE starting the procedure. Nothing is more frustrating than getting halfway through recovery and discovering you don't have required access.
Step-by-Step Actions: Each step must be atomic and verifiable. If a step can't be verified, break it into smaller steps until each has a clear success/failure indicator.
Rollback Procedure: This is not optional. Every action that modifies state should have an undo. The rollback procedure should be as detailed as the forward procedure.
The quality of individual procedure steps determines whether a runbook succeeds. Here are principles for writing steps that work under pressure:
Be Explicit, Not Implicit:
Provide Copy-Paste Commands:
When a command is required, provide it exactly as it should be run. Include:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
### Step 3: Promote Read Replica to Primary **Action:**SSH to the database bastion host and run the promotion command: ```bash# SSH to bastion (requires VPN connection)ssh -i ~/.ssh/prod-bastion.pem ec2-user@bastion.prod.internal # Promote the DR replica to primary# This command takes 2-5 minutes to completeaws rds promote-read-replica \ --db-instance-identifier prod-db-replica-dr \ --region us-west-2 # Monitor promotion status (run repeatedly until status = "available")aws rds describe-db-instances \ --db-instance-identifier prod-db-replica-dr \ --region us-west-2 \ --query 'DBInstances[0].DBInstanceStatus'``` **Expected Result:**After 2-5 minutes, the status query should return:```"available"``` **Verification:**```bash# Connect to the promoted instance and verify write capabilitypsql -h prod-db-replica-dr.xxxx.us-west-2.rds.amazonaws.com -U admin -d production # Run a test write (this table exists for DR testing)INSERT INTO dr_test_writes (test_id, created_at) VALUES ('manual-test', NOW()); # Verify the write succeededSELECT * FROM dr_test_writes WHERE test_id = 'manual-test';``` **If This Step Fails:**1. If promotion command returns error: Check replica status is "available" before attempting promotion2. If still "modifying" after 10 minutes: Check RDS console for detailed status; may need AWS support3. If can't connect after promotion: Verify security group allows inbound from bastion4. If write test fails: Promote may not be complete; wait and retry in 2 minutesHandle Variations Explicitly:
Real-world procedures often have branches. When the action depends on a condition, use clear decision trees:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
### Step 5: Restore Application Configuration First, determine which configuration source is available: **Check 1: Is the Parameter Store accessible?**```bashaws ssm get-parameter --name /prod/config/version --region us-west-2``` **If Parameter Store is accessible:** → Proceed to Step 5a**If Parameter Store returns error:** → Proceed to Step 5b --- #### Step 5a: Load Configuration from Parameter Store ```bash# Pull configuration from Parameter Store./scripts/load-config-from-ssm.sh --environment production --region us-west-2``` **Expected Result:** Script outputs "Configuration loaded successfully" **Then:** → Skip to Step 6 --- #### Step 5b: Load Configuration from Backup The parameter store is unavailable. Use the configuration backup: ```bash# Locate most recent configuration backupaws s3 ls s3://prod-config-backups/daily/ --recursive | tail -1 # Download and apply the backup (substitute actual filename)aws s3 cp s3://prod-config-backups/daily/<YYYY-MM-DD>/config.tar.gz ./tar -xzf config.tar.gz./scripts/apply-config-backup.sh --source ./config/ # Note: This backup may be up to 24 hours old. # Document any recent configuration changes that may be missing.``` **⚠️ Warning:** Configuration backup may not include changes from the last 24 hours.After recovery, verify configuration matches expected values for recent deployments. **Then:** → Proceed to Step 6Every step in a runbook must have clear verification criteria. Verification answers: How do I know this step worked?
Types of Verification:
Output Verification: The command produces expected output. Document exactly what the output should look like:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
### Output Verification Example **Command:**```bashkubectl get pods -n production -l app=order-service``` **Expected Output (Success):**```NAME READY STATUS RESTARTS AGEorder-service-5d8f9b7c4d-abc12 3/3 Running 0 2morder-service-5d8f9b7c4d-def34 3/3 Running 0 2morder-service-5d8f9b7c4d-ghi56 3/3 Running 0 2m```All pods should show READY 3/3 and STATUS Running. **Failure Indicators:**- READY shows less than 3/3 → Container not starting; check logs- STATUS shows CrashLoopBackOff → Application error; check logs- STATUS shows Pending → Scheduling issue; check node capacity- Fewer than 3 pods → Check replica count in deployment --- ### Health Check Verification Example **Action:**```bashcurl -s https://api.example.com/health | jq .``` **Expected Output (Success):**```json{ "status": "healthy", "version": "2.4.1", "database": "connected", "cache": "connected", "timestamp": "2024-01-15T10:30:00Z"}``` **Success Criteria:**- "status" is "healthy"- "database" is "connected"- "cache" is "connected" **Failure Indicators:**- "status" is "degraded" → Some dependencies failing; check database/cache fields- "database" is "disconnected" → Database connectivity issue; return to database steps- Request timeout → Service not yet ready; wait 30 seconds and retryState Verification: Beyond command output, some steps require verifying system state has actually changed:
123456789101112131415161718192021222324252627282930313233
### State Verification Example: DNS Propagation **Action:** Update DNS A record from old IP to new IP **Verification:**DNS changes may take time to propagate. Verify propagation before proceeding: ```bash# Check DNS resolution from multiple locations# Should return NEW IP: 10.0.2.5 # Local resolutiondig +short api.example.com # Google DNSdig +short api.example.com @8.8.8.8 # Cloudflare DNSdig +short api.example.com @1.1.1.1``` **Success Criteria:**All three queries should return: `10.0.2.5` **⏱️ Timing:**- Most queries should resolve within 60 seconds- If still showing old IP after 5 minutes, check Route53 change status- Maximum propagation time: 15 minutes (based on TTL settings) **If Propagation Stalls:**1. Verify Route53 change completed: Check change status in console2. Consider CDN caching: May need to purge CDN DNS cache3. If urgent: Direct traffic via /etc/hosts or load balancer, bypassing DNSAfter writing each step, ask: 'If I execute this step, how do I know it worked?' If you can't answer with a specific, observable verification, the step is incomplete. Add explicit success criteria before moving on.
A runbook that hasn't been tested is just a theory. Validation ensures that documented procedures actually work:
Validation Methods:
1. Author Walkthrough: The runbook author executes the procedure in a test environment while timing each step. This catches obvious errors but misses assumptions the author makes implicitly.
2. Peer Execution: Someone who didn't write the runbook executes it without author assistance. This is the critical test—it exposes unstated assumptions, missing context, and confusing instructions.
3. Production Execution: During scheduled DR tests, runbooks are executed against production (or production-like) systems. This validates that procedures work with real data and scale.
| Level | Who Executes | Environment | Frequency | What It Catches |
|---|---|---|---|---|
| Author Walkthrough | Runbook author | Test/Staging | Every update | Syntax errors, missing commands, sequence issues |
| Peer Execution | Different engineer | Test/Staging | Quarterly | Unstated assumptions, confusing instructions, tribal knowledge |
| Simulated DR Test | On-call rotation | DR environment | Semi-annually | Integration issues, timing problems, access issues |
| Production DR Test | DR team | Production | Annually | Scale issues, real data edge cases, full recovery timing |
Peer Execution Test Protocol:
The most valuable validation is having someone unfamiliar with the procedure execute it:
Runbook authors suffer from the Curse of Knowledge—they can't unknow what they know. Steps that seem obvious to the author may be opaque to others. Peer execution is the only cure. Budget time for at least one peer validation of every critical runbook.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
// Runbook Validation Tracking interface RunbookValidation { runbookId: string; runbookName: string; lastModified: Date; validations: ValidationRecord[]; overallStatus: 'validated' | 'needs-review' | 'stale' | 'never-validated'; nextValidationDue: Date;} interface ValidationRecord { date: Date; type: 'author' | 'peer' | 'simulated-dr' | 'production-dr'; environment: string; executor: string; result: 'passed' | 'passed-with-issues' | 'failed'; totalDuration: number; // minutes findings: ValidationFinding[];} interface ValidationFinding { stepNumber: number; severity: 'critical' | 'major' | 'minor' | 'suggestion'; description: string; resolution: string; resolved: boolean;} // Example validation recordconst dbFailoverValidation: RunbookValidation = { runbookId: 'rb-db-failover-001', runbookName: 'Database Failover Procedure', lastModified: new Date('2024-01-10'), validations: [ { date: new Date('2024-01-15'), type: 'peer', environment: 'staging', executor: 'Alex Chen', result: 'passed-with-issues', totalDuration: 45, findings: [ { stepNumber: 3, severity: 'major', description: 'Step says "connect to bastion" but does not specify which bastion or provide SSH command', resolution: 'Added explicit bastion hostname and full SSH command with key path', resolved: true }, { stepNumber: 7, severity: 'minor', description: 'Expected output shows old database version; actual is newer', resolution: 'Updated expected output to reflect current version', resolved: true }, { stepNumber: 12, severity: 'critical', description: 'Verification command fails with "permission denied" - required IAM policy not documented', resolution: 'Added required IAM permissions to prerequisites section', resolved: true }, { stepNumber: 15, severity: 'suggestion', description: 'Would help to have a summary checklist at the end of all verification steps', resolution: 'Added final verification checklist section', resolved: true } ] } ], overallStatus: 'validated', nextValidationDue: new Date('2024-04-15') // 90 days after last validation}; function assessRunbookHealth(validation: RunbookValidation): { status: string; recommendation: string;} { const daysSinceModified = daysBetween(validation.lastModified, new Date()); const daysSinceValidation = validation.validations.length > 0 ? daysBetween(validation.validations[0].date, new Date()) : Infinity; if (validation.validations.length === 0) { return { status: 'CRITICAL', recommendation: 'Runbook has never been validated. Schedule immediate peer execution test.' }; } if (daysSinceModified > 0 && daysSinceModified < daysSinceValidation) { return { status: 'WARNING', recommendation: 'Runbook modified since last validation. Re-validation required.' }; } if (daysSinceValidation > 180) { return { status: 'STALE', recommendation: 'Last validation over 6 months ago. Schedule re-validation.' }; } if (daysSinceValidation > 90) { return { status: 'REVIEW', recommendation: 'Validation approaching staleness. Plan re-validation within 30 days.' }; } return { status: 'HEALTHY', recommendation: 'Runbook recently validated. No action required.' };} function daysBetween(date1: Date, date2: Date): number { return Math.floor((date2.getTime() - date1.getTime()) / (1000 * 60 * 60 * 24));}Runbooks decay rapidly. Infrastructure changes, tools are updated, personnel rotate. A runbook that worked six months ago may be dangerously outdated today. Systematic maintenance is essential:
Triggers for Runbook Updates:
Ownership Model:
Every runbook needs an owner—a person or team responsible for its accuracy:
Store runbooks in version control (Git). This provides change history, enables code review for updates, allows rollback if changes introduce errors, and integrates with CI/CD for automated validation. Treat runbooks as code, not documents.
Automated Maintenance Checks:
Some maintenance can be automated:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
// Automated Runbook Health Monitoring interface RunbookHealthCheck { checkType: string; checkFrequency: string; automatable: boolean; implementation: string;} const healthChecks: RunbookHealthCheck[] = [ { checkType: 'Link Validity', checkFrequency: 'Daily', automatable: true, implementation: 'Parse runbooks for URLs, HTTP HEAD each, alert on 4xx/5xx' }, { checkType: 'Contact Currency', checkFrequency: 'Weekly', automatable: true, implementation: 'Cross-reference contacts against HR/directory system, flag mismatches' }, { checkType: 'Referenced Resources', checkFrequency: 'Daily', automatable: true, implementation: 'Parse referenced AWS resources, verify they exist via API' }, { checkType: 'Age Check', checkFrequency: 'Weekly', automatable: true, implementation: 'Flag runbooks not updated in 90+ days for review' }, { checkType: 'Validation Currency', checkFrequency: 'Weekly', automatable: true, implementation: 'Flag runbooks not validated in 180+ days' }, { checkType: 'Infrastructure Drift', checkFrequency: 'On-change', automatable: true, implementation: 'Integrate with IaC pipeline; flag runbooks referencing changed resources' }, { checkType: 'Semantic Content', checkFrequency: 'Quarterly', automatable: false, implementation: 'Human review for accuracy, completeness, clarity' }]; // Example: Automated link checkingasync function checkRunbookLinks(runbookPath: string): Promise<LinkCheckResult[]> { const content = await fs.readFile(runbookPath, 'utf-8'); const urlRegex = /https?:\/\/[^\s\)\]]+/g; const urls = content.match(urlRegex) || []; const results: LinkCheckResult[] = []; for (const url of urls) { try { const response = await fetch(url, { method: 'HEAD', timeout: 5000 }); results.push({ url, status: response.status, healthy: response.status < 400 }); } catch (error) { results.push({ url, status: 0, healthy: false, error: String(error) }); } } return results;} interface LinkCheckResult { url: string; status: number; healthy: boolean; error?: string;}The best runbook is worthless if you can't access it during a disaster. Consider the failure modes:
Accessibility Failure Modes:
Mitigations:
If your DR procedures are stored only on the systems you're trying to recover, they're not DR procedures. The documentation for recovering AWS must be accessible without AWS. The documentation for recovering your Confluence wiki must be accessible without Confluence.
Recommended Architecture:
Automate synchronization between primary and secondary/tertiary locations. Test access from each location periodically.
What's Next:
With validated, maintained runbooks in place, the final step is reducing human involvement in recovery. The next page covers DR Automation—how to automate repetitive recovery tasks, reduce recovery time, and minimize the risk of human error during high-stress situations.
You now understand how to create runbooks that work under pressure—structured, explicit, verified, and maintained. These procedures are the bridge between DR strategy and DR execution. Next, we'll explore automating these procedures to further reduce recovery time and human error.