Disaster Recovery - Learning Module

Loading content...

0/273

Runbook Development: Creating Executable Recovery Procedures

The Documentation That Saves You at 3 AM

At 3:47 AM on a Sunday, your phone screams with a critical alert. The primary database cluster is down. The on-call engineer who wrote the recovery procedures left the company six months ago. You're staring at a runbook that says 'Follow standard database failover procedures' and wondering what 'standard' means.

This scenario plays out constantly across the industry. Runbooks that seemed clear when written become incomprehensible under pressure. Procedures that worked last year fail because infrastructure has changed. Documentation that assumes expert knowledge becomes useless when the expert isn't available.

The quality of your runbooks directly determines whether recovery succeeds or fails. A well-crafted runbook enables a junior engineer, woken at 3 AM, unfamiliar with the specific system, under extreme stress, to successfully execute recovery procedures. A poorly crafted runbook becomes another obstacle in an already chaotic situation.

This page teaches you to build runbooks that actually work when needed most.

What You Will Master

By the end of this page, you will understand how to create runbooks that pass the '3 AM test'—procedures that can be successfully executed by engineers under stress with limited context. You'll learn structure, content, validation techniques, and maintenance practices that keep runbooks effective over time.

What Makes a Runbook Effective

An effective runbook is not just documentation—it's an operational tool. The difference is crucial:

Documentation explains systems and concepts. It helps someone understand.

A Runbook enables action under pressure. It tells someone exactly what to do, step by step, with verification at each stage.

Effective runbooks share several characteristics:

Characteristics of Effective Runbooks

•Prescriptive, Not Descriptive — 'Run this command' not 'You might want to check the database.' Each step is an explicit action.
•Self-Contained — All required information is in the runbook or explicitly linked. No assumed knowledge about where to find things.
•Verification at Every Step — After each action, how do you confirm it worked? Expected output, success indicators, and what to do if verification fails.
•Decision Trees for Variants — If different situations require different actions, explicit branching logic guides the operator to the right path.
•Rollback Procedures — If a step fails or causes additional problems, how do you undo it? Every action should have an escape hatch.
•Testable — Each procedure has been executed in a test environment to confirm accuracy. Untested runbooks are theories.
•Maintainable — Structured so updates are easy, with clear ownership and review cycles. Stale runbooks are dangerous runbooks.

The 3 AM Test

Before finalizing any runbook, apply the 3 AM test: Can an engineer who was just woken up, who is stressed, who has limited context on this specific system, and who may not have the original author available for questions—can they successfully execute this procedure? If the answer is 'probably not,' the runbook needs work.

Runbook Structure: The Essential Components

A well-structured runbook follows a consistent format that operators can navigate quickly even under stress:

runbook-template.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# Runbook: [System Name] - [Procedure Name]
 
## Metadata
| Field | Value |
|-------|-------|
| **Owner** | [Team/Person responsible for this runbook] |
| **Last Tested** | [Date of last validation test] |
| **Last Updated** | [Date of last content update] |
| **Expected Duration** | [Typical time to complete procedure] |
| **Required Access** | [List of access/permissions needed] |
| **Related Runbooks** | [Links to dependent or related procedures] |
 
## Overview
 
### Purpose
[One paragraph explaining what this runbook accomplishes and when to use it]
 
### When to Use This Runbook
- [Trigger condition 1]
- [Trigger condition 2]
- [Explicit scope: what this runbook does and does NOT cover]
 
### Pre-Requisites
- [ ] Access to [specific system/tool] verified
- [ ] [Required credential/key] available
- [ ] [Dependent system] is operational
- [ ] [Any other prerequisites]
 
### Impact and Risks
| Risk | Likelihood | Mitigation |
|------|------------|------------|
| [Potential negative outcome] | [High/Medium/Low] | [How to prevent or recover] |
 
---
 
## Procedure
 
### Step 1: [Action Name]
 
**Action:**
```bash
# Copy-paste ready command
example-command --with-parameters
```
 
**Expected Result:**
```
Expected output you should see
```
 
**Verification:**
- [ ] Output matches expected result
- [ ] [Additional verification step]
 
**If Step Fails:**
- [ ] Check [common failure reason]
- [ ] Try [alternative approach]
- [ ] If still failing, escalate to [contact] or proceed to Rollback section
 
---
 
### Step 2: [Action Name]
 
**Action:**
[Detailed action instructions with copy-paste commands]
 
**Expected Result:**
[What you should see if successful]
 
**Verification:**
- [ ] [Verification checklist]
 
**If Step Fails:**
[Specific troubleshooting for this step]
 
---
 
### Step N: Verification Complete
 
**Final Verification Checklist:**
- [ ] [System component 1] is operational
- [ ] [System component 2] returns expected response
- [ ] Monitoring shows healthy metrics
- [ ] Test transaction succeeds
 
---
 
## Rollback Procedure
 
### When to Rollback
Use this rollback procedure if:
- [Condition 1]
- [Condition 2]
- Any step fails and troubleshooting doesn't resolve within [X minutes]
 
### Rollback Steps
 
**Step R1: [Undo Action]**
```bash
rollback-command
```
 
[Continue with explicit rollback steps for each forward step]
 
---
 
## Contacts and Escalation
 
| Role | Name | Contact | Availability |
|------|------|---------|--------------|
| Primary Escalation | [Name] | [Phone/Slack] | [Hours] |
| Database Expert | [Name] | [Phone/Slack] | [Hours] |
| Vendor Support | [Company] | [Number/Portal] | [SLA] |
 
---
 
## Revision History
 
| Date | Author | Changes |
|------|--------|---------|
| YYYY-MM-DD | [Name] | [Summary of changes] |

Key Structural Elements Explained:

Metadata Section: Critically important for assessing runbook validity. If 'Last Tested' is 18 months ago, proceed with extra caution. If 'Last Updated' predates a major infrastructure change, the runbook may be stale.

Pre-Requisites Checklist: Operators should verify these BEFORE starting the procedure. Nothing is more frustrating than getting halfway through recovery and discovering you don't have required access.

Step-by-Step Actions: Each step must be atomic and verifiable. If a step can't be verified, break it into smaller steps until each has a clear success/failure indicator.

Rollback Procedure: This is not optional. Every action that modifies state should have an undo. The rollback procedure should be as detailed as the forward procedure.

Writing Clear Procedures

The quality of individual procedure steps determines whether a runbook succeeds. Here are principles for writing steps that work under pressure:

Be Explicit, Not Implicit:

Ineffective Steps

•'Failover the database'
•'Update the DNS records'
•'Restart the affected services'
•'Check that everything is working'
•'Follow standard procedures'

Effective Steps

•'Run: aws rds failover-db-cluster --db-cluster-identifier prod-db-cluster'
•'In Route53, change A record for api.example.com from 10.0.1.5 to 10.0.2.5'
•'Run: kubectl rollout restart deployment/order-service -n production'
•'Verify: curl https://api.example.com/health returns {"status":"ok"}'
•'See linked runbook: Database Failover Procedure'

Provide Copy-Paste Commands:

When a command is required, provide it exactly as it should be run. Include:

The complete command with all required parameters
Placeholder markers for values that vary (clearly indicated)
Comments explaining non-obvious parameters
Expected execution time for long-running commands

example-command-step.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
### Step 3: Promote Read Replica to Primary
 
**Action:**
SSH to the database bastion host and run the promotion command:
 
```bash
# SSH to bastion (requires VPN connection)
ssh -i ~/.ssh/prod-bastion.pem ec2-user@bastion.prod.internal
 
# Promote the DR replica to primary
# This command takes 2-5 minutes to complete
aws rds promote-read-replica \
    --db-instance-identifier prod-db-replica-dr \
    --region us-west-2
 
# Monitor promotion status (run repeatedly until status = "available")
aws rds describe-db-instances \
    --db-instance-identifier prod-db-replica-dr \
    --region us-west-2 \
    --query 'DBInstances[0].DBInstanceStatus'
```
 
**Expected Result:**
After 2-5 minutes, the status query should return:
```
"available"
```
 
**Verification:**
```bash
# Connect to the promoted instance and verify write capability
psql -h prod-db-replica-dr.xxxx.us-west-2.rds.amazonaws.com -U admin -d production
 
# Run a test write (this table exists for DR testing)
INSERT INTO dr_test_writes (test_id, created_at) VALUES ('manual-test', NOW());
 
# Verify the write succeeded
SELECT * FROM dr_test_writes WHERE test_id = 'manual-test';
```
 
**If This Step Fails:**
1. If promotion command returns error: Check replica status is "available" before attempting promotion
2. If still "modifying" after 10 minutes: Check RDS console for detailed status; may need AWS support
3. If can't connect after promotion: Verify security group allows inbound from bastion
4. If write test fails: Promote may not be complete; wait and retry in 2 minutes

Handle Variations Explicitly:

Real-world procedures often have branches. When the action depends on a condition, use clear decision trees:

decision-tree-example.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
### Step 5: Restore Application Configuration
 
First, determine which configuration source is available:
 
**Check 1: Is the Parameter Store accessible?**
```bash
aws ssm get-parameter --name /prod/config/version --region us-west-2
```
 
**If Parameter Store is accessible:** → Proceed to Step 5a
**If Parameter Store returns error:** → Proceed to Step 5b
 
---
 
#### Step 5a: Load Configuration from Parameter Store
 
```bash
# Pull configuration from Parameter Store
./scripts/load-config-from-ssm.sh --environment production --region us-west-2
```
 
**Expected Result:** Script outputs "Configuration loaded successfully"
 
**Then:** → Skip to Step 6
 
---
 
#### Step 5b: Load Configuration from Backup
 
The parameter store is unavailable. Use the configuration backup:
 
```bash
# Locate most recent configuration backup
aws s3 ls s3://prod-config-backups/daily/ --recursive | tail -1
 
# Download and apply the backup (substitute actual filename)
aws s3 cp s3://prod-config-backups/daily/<YYYY-MM-DD>/config.tar.gz ./
tar -xzf config.tar.gz
./scripts/apply-config-backup.sh --source ./config/
 
# Note: This backup may be up to 24 hours old. 
# Document any recent configuration changes that may be missing.
```
 
**⚠️ Warning:** Configuration backup may not include changes from the last 24 hours.
After recovery, verify configuration matches expected values for recent deployments.
 
**Then:** → Proceed to Step 6

Verification and Validation

Every step in a runbook must have clear verification criteria. Verification answers: How do I know this step worked?

Types of Verification:

Output Verification: The command produces expected output. Document exactly what the output should look like:

verification-examples.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
### Output Verification Example
 
**Command:**
```bash
kubectl get pods -n production -l app=order-service
```
 
**Expected Output (Success):**
```
NAME                             READY   STATUS    RESTARTS   AGE
order-service-5d8f9b7c4d-abc12   3/3     Running   0          2m
order-service-5d8f9b7c4d-def34   3/3     Running   0          2m
order-service-5d8f9b7c4d-ghi56   3/3     Running   0          2m
```
All pods should show READY 3/3 and STATUS Running.
 
**Failure Indicators:**
- READY shows less than 3/3 → Container not starting; check logs
- STATUS shows CrashLoopBackOff → Application error; check logs
- STATUS shows Pending → Scheduling issue; check node capacity
- Fewer than 3 pods → Check replica count in deployment
 
---
 
### Health Check Verification Example
 
**Action:**
```bash
curl -s https://api.example.com/health | jq .
```
 
**Expected Output (Success):**
```json
{
  "status": "healthy",
  "version": "2.4.1",
  "database": "connected",
  "cache": "connected",
  "timestamp": "2024-01-15T10:30:00Z"
}
```
 
**Success Criteria:**
- "status" is "healthy"
- "database" is "connected"
- "cache" is "connected"
 
**Failure Indicators:**
- "status" is "degraded" → Some dependencies failing; check database/cache fields
- "database" is "disconnected" → Database connectivity issue; return to database steps
- Request timeout → Service not yet ready; wait 30 seconds and retry

State Verification: Beyond command output, some steps require verifying system state has actually changed:

state-verification-example.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
### State Verification Example: DNS Propagation
 
**Action:** Update DNS A record from old IP to new IP
 
**Verification:**
DNS changes may take time to propagate. Verify propagation before proceeding:
 
```bash
# Check DNS resolution from multiple locations
# Should return NEW IP: 10.0.2.5
 
# Local resolution
dig +short api.example.com
 
# Google DNS
dig +short api.example.com @8.8.8.8
 
# Cloudflare DNS
dig +short api.example.com @1.1.1.1
```
 
**Success Criteria:**
All three queries should return: `10.0.2.5`
 
**⏱️ Timing:**
- Most queries should resolve within 60 seconds
- If still showing old IP after 5 minutes, check Route53 change status
- Maximum propagation time: 15 minutes (based on TTL settings)
 
**If Propagation Stalls:**
1. Verify Route53 change completed: Check change status in console
2. Consider CDN caching: May need to purge CDN DNS cache
3. If urgent: Direct traffic via /etc/hosts or load balancer, bypassing DNS

The 'How Do I Know?' Rule

After writing each step, ask: 'If I execute this step, how do I know it worked?' If you can't answer with a specific, observable verification, the step is incomplete. Add explicit success criteria before moving on.

Runbook Testing and Validation

A runbook that hasn't been tested is just a theory. Validation ensures that documented procedures actually work:

Validation Methods:

1. Author Walkthrough: The runbook author executes the procedure in a test environment while timing each step. This catches obvious errors but misses assumptions the author makes implicitly.

2. Peer Execution: Someone who didn't write the runbook executes it without author assistance. This is the critical test—it exposes unstated assumptions, missing context, and confusing instructions.

3. Production Execution: During scheduled DR tests, runbooks are executed against production (or production-like) systems. This validates that procedures work with real data and scale.

Runbook Validation Levels
Level	Who Executes	Environment	Frequency	What It Catches
Author Walkthrough	Runbook author	Test/Staging	Every update	Syntax errors, missing commands, sequence issues
Peer Execution	Different engineer	Test/Staging	Quarterly	Unstated assumptions, confusing instructions, tribal knowledge
Simulated DR Test	On-call rotation	DR environment	Semi-annually	Integration issues, timing problems, access issues
Production DR Test	DR team	Production	Annually	Scale issues, real data edge cases, full recovery timing

Peer Execution Test Protocol:

The most valuable validation is having someone unfamiliar with the procedure execute it:

Select an Executor: Choose someone who knows the general technology but not the specific procedure
No Coaching: The author observes but doesn't help unless the executor is completely blocked
Document Everything: Record every question, confusion, missing information, and incorrect instruction
Time Each Step: Capture actual execution time vs. documented estimates
Validate Verification Steps: Were the verification criteria clear? Did they correctly indicate success/failure?
Update Runbook: Incorporate all findings before declaring runbook validated

The Curse of Knowledge

Runbook authors suffer from the Curse of Knowledge—they can't unknow what they know. Steps that seem obvious to the author may be opaque to others. Peer execution is the only cure. Budget time for at least one peer validation of every critical runbook.

runbook-validation-checklist.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
// Runbook Validation Tracking
 
interface RunbookValidation {
  runbookId: string;
  runbookName: string;
  lastModified: Date;
  
  validations: ValidationRecord[];
  
  overallStatus: 'validated' | 'needs-review' | 'stale' | 'never-validated';
  nextValidationDue: Date;
}
 
interface ValidationRecord {
  date: Date;
  type: 'author' | 'peer' | 'simulated-dr' | 'production-dr';
  environment: string;
  executor: string;
  
  result: 'passed' | 'passed-with-issues' | 'failed';
  totalDuration: number;  // minutes
  
  findings: ValidationFinding[];
}
 
interface ValidationFinding {
  stepNumber: number;
  severity: 'critical' | 'major' | 'minor' | 'suggestion';
  description: string;
  resolution: string;
  resolved: boolean;
}
 
// Example validation record
const dbFailoverValidation: RunbookValidation = {
  runbookId: 'rb-db-failover-001',
  runbookName: 'Database Failover Procedure',
  lastModified: new Date('2024-01-10'),
  
  validations: [
    {
      date: new Date('2024-01-15'),
      type: 'peer',
      environment: 'staging',
      executor: 'Alex Chen',
      result: 'passed-with-issues',
      totalDuration: 45,
      findings: [
        {
          stepNumber: 3,
          severity: 'major',
          description: 'Step says "connect to bastion" but does not specify which bastion or provide SSH command',
          resolution: 'Added explicit bastion hostname and full SSH command with key path',
          resolved: true
        },
        {
          stepNumber: 7,
          severity: 'minor',
          description: 'Expected output shows old database version; actual is newer',
          resolution: 'Updated expected output to reflect current version',
          resolved: true
        },
        {
          stepNumber: 12,
          severity: 'critical',
          description: 'Verification command fails with "permission denied" - required IAM policy not documented',
          resolution: 'Added required IAM permissions to prerequisites section',
          resolved: true
        },
        {
          stepNumber: 15,
          severity: 'suggestion',
          description: 'Would help to have a summary checklist at the end of all verification steps',
          resolution: 'Added final verification checklist section',
          resolved: true
        }
      ]
    }
  ],
  
  overallStatus: 'validated',
  nextValidationDue: new Date('2024-04-15')  // 90 days after last validation
};
 
function assessRunbookHealth(validation: RunbookValidation): {
  status: string;
  recommendation: string;
} {
  const daysSinceModified = daysBetween(validation.lastModified, new Date());
  const daysSinceValidation = validation.validations.length > 0 
    ? daysBetween(validation.validations[0].date, new Date())
    : Infinity;
  
  if (validation.validations.length === 0) {
    return {
      status: 'CRITICAL',
      recommendation: 'Runbook has never been validated. Schedule immediate peer execution test.'
    };
  }
  
  if (daysSinceModified > 0 && daysSinceModified < daysSinceValidation) {
    return {
      status: 'WARNING',
      recommendation: 'Runbook modified since last validation. Re-validation required.'
    };
  }
  
  if (daysSinceValidation > 180) {
    return {
      status: 'STALE',
      recommendation: 'Last validation over 6 months ago. Schedule re-validation.'
    };
  }
  
  if (daysSinceValidation > 90) {
    return {
      status: 'REVIEW',
      recommendation: 'Validation approaching staleness. Plan re-validation within 30 days.'
    };
  }
  
  return {
    status: 'HEALTHY',
    recommendation: 'Runbook recently validated. No action required.'
  };
}
 
function daysBetween(date1: Date, date2: Date): number {
  return Math.floor((date2.getTime() - date1.getTime()) / (1000 * 60 * 60 * 24));
}

Runbook Maintenance: Keeping Procedures Current

Runbooks decay rapidly. Infrastructure changes, tools are updated, personnel rotate. A runbook that worked six months ago may be dangerously outdated today. Systematic maintenance is essential:

Triggers for Runbook Updates:

Infrastructure Changes: New database version, different cloud region, updated network topology
Tool Changes: New CLI version, different monitoring platform, updated automation scripts
Personnel Changes: Contact information outdated, role changes, team restructuring
Post-Incident Findings: Real incident or DR test revealed procedure gaps
Scheduled Review: Quarterly review even without specific trigger

Ownership Model:

Every runbook needs an owner—a person or team responsible for its accuracy:

Owner receives notifications when referenced systems change
Owner is accountable for validation schedule
Owner approves any modifications
Backup owner designated for coverage

Runbook Maintenance Checklist

•Quarterly Review: Read through the runbook. Does everything still make sense? Are referenced resources still accurate?
•After Any System Change: Update affected runbooks within 5 business days of change completion
•After Any Incident: If runbook was used, document what worked and what didn't. Update immediately.
•Contact Validation: Monthly verification that escalation contacts are current
•Link Validation: Automated checks that referenced URLs and resources are accessible
•Access Validation: Quarterly test that required permissions still work

Version Control for Runbooks

Store runbooks in version control (Git). This provides change history, enables code review for updates, allows rollback if changes introduce errors, and integrates with CI/CD for automated validation. Treat runbooks as code, not documents.

Automated Maintenance Checks:

Some maintenance can be automated:

runbook-health-checks.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
// Automated Runbook Health Monitoring
 
interface RunbookHealthCheck {
  checkType: string;
  checkFrequency: string;
  automatable: boolean;
  implementation: string;
}
 
const healthChecks: RunbookHealthCheck[] = [
  {
    checkType: 'Link Validity',
    checkFrequency: 'Daily',
    automatable: true,
    implementation: 'Parse runbooks for URLs, HTTP HEAD each, alert on 4xx/5xx'
  },
  {
    checkType: 'Contact Currency',
    checkFrequency: 'Weekly',
    automatable: true,
    implementation: 'Cross-reference contacts against HR/directory system, flag mismatches'
  },
  {
    checkType: 'Referenced Resources',
    checkFrequency: 'Daily',
    automatable: true,
    implementation: 'Parse referenced AWS resources, verify they exist via API'
  },
  {
    checkType: 'Age Check',
    checkFrequency: 'Weekly',
    automatable: true,
    implementation: 'Flag runbooks not updated in 90+ days for review'
  },
  {
    checkType: 'Validation Currency',
    checkFrequency: 'Weekly',
    automatable: true,
    implementation: 'Flag runbooks not validated in 180+ days'
  },
  {
    checkType: 'Infrastructure Drift',
    checkFrequency: 'On-change',
    automatable: true,
    implementation: 'Integrate with IaC pipeline; flag runbooks referencing changed resources'
  },
  {
    checkType: 'Semantic Content',
    checkFrequency: 'Quarterly',
    automatable: false,
    implementation: 'Human review for accuracy, completeness, clarity'
  }
];
 
// Example: Automated link checking
async function checkRunbookLinks(runbookPath: string): Promise<LinkCheckResult[]> {
  const content = await fs.readFile(runbookPath, 'utf-8');
  const urlRegex = /https?:\/\/[^\s\)\]]+/g;
  const urls = content.match(urlRegex) || [];
  
  const results: LinkCheckResult[] = [];
  
  for (const url of urls) {
    try {
      const response = await fetch(url, { method: 'HEAD', timeout: 5000 });
      results.push({
        url,
        status: response.status,
        healthy: response.status < 400
      });
    } catch (error) {
      results.push({
        url,
        status: 0,
        healthy: false,
        error: String(error)
      });
    }
  }
  
  return results;
}
 
interface LinkCheckResult {
  url: string;
  status: number;
  healthy: boolean;
  error?: string;
}

Runbook Accessibility: Available When Needed

The best runbook is worthless if you can't access it during a disaster. Consider the failure modes:

Accessibility Failure Modes:

Wiki platform is hosted in the same region as the failed systems
VPN is required to access documentation, but VPN depends on failed systems
Two-factor authentication for document access uses service that's down
Single Sign-On provider is experiencing outage
The person who has the password is unreachable

Mitigations:

Runbook Accessibility Strategies

•Multi-Region Hosting: Store runbooks in a different region than production systems. If US-East-1 is down, you need docs accessible from US-West-2.
•Offline Copies: Critical runbooks should have offline-accessible versions—PDF exports, printed copies in office, local laptop copies for on-call engineers.
•Alternative Access Paths: If primary wiki is down, where's the backup? GitHub Pages, S3 static site, or even a shared Google Doc as fallback.
•Out-of-Band Communication: Phone numbers and alternate contact methods that don't depend on company infrastructure.
•Emergency Access Procedures: Break-glass procedures for bypassing normal authentication during outages.
•DR Documentation Drill: Periodically verify you can access runbooks without normal systems available.

Don't Store the Map in the Territory

If your DR procedures are stored only on the systems you're trying to recover, they're not DR procedures. The documentation for recovering AWS must be accessible without AWS. The documentation for recovering your Confluence wiki must be accessible without Confluence.

Recommended Architecture:

Primary: Internal wiki or documentation platform (normal operation)
Secondary: Static site in different cloud provider (GitHub Pages, Netlify)
Tertiary: Offline PDFs synced to on-call engineers' devices
Physical: Printed copies of most critical runbooks in secure location

Automate synchronization between primary and secondary/tertiary locations. Test access from each location periodically.

Summary: Runbook Excellence

Key Takeaways

•Runbooks enable action — They're prescriptive tools, not descriptive documentation. Every step is an explicit, verifiable action.
•Pass the 3 AM test — If a stressed, sleep-deprived engineer can't follow the runbook without help, it's not ready.
•Verification is mandatory — Every step needs clear success criteria. 'How do I know it worked?' must have an answer.
•Rollback procedures are required — Every action that changes state needs an undo. Forward progress without escape routes is dangerous.
•Peer validation is essential — Authors can't see their own blind spots. Unfamiliar engineers must test every critical runbook.
•Maintenance is continuous — Runbooks decay rapidly. Systematic updates, ownership, and automated health checks keep them current.
•Accessibility is critical — Runbooks must be accessible without the systems they're recovering. Multi-region hosting and offline copies are essential.

What's Next:

With validated, maintained runbooks in place, the final step is reducing human involvement in recovery. The next page covers DR Automation—how to automate repetitive recovery tasks, reduce recovery time, and minimize the risk of human error during high-stress situations.

Page Complete

You now understand how to create runbooks that work under pressure—structured, explicit, verified, and maintained. These procedures are the bridge between DR strategy and DR execution. Next, we'll explore automating these procedures to further reduce recovery time and human error.