System Design (HLD)Disaster Recovery

Disaster Recovery: Building Resilient Systems That Survive Catastrophe

LevelAdvanced

Duration180 mins

TopicDisaster Recovery

5 / 5

DR Automation: From Manual Procedures to Autonomous Recovery

When Seconds Matter, Humans Are the Bottleneck

On November 25, 2020, Amazon Web Services experienced an outage in US-East-1 that lasted nearly five hours. For many customers, recovery could have been faster—in theory. They had runbooks. They had DR sites ready. But when thousands of companies simultaneously tried to execute recovery procedures, each manual step became a bottleneck. Engineers were paged but took time to wake up and gain context. Documentation was consulted. Commands were typed, retyped after errors, verified, and re-verified. Human decision-making, normally an asset, became the limiting factor.

Meanwhile, companies with automated DR recovered faster. Their systems detected the failure, initiated failover sequences, validated recovery, and rerouted traffic—all while human operators were still assessing the situation.

Automation transforms DR from a human-speed operation to a machine-speed operation. It removes the cognitive load of executing procedures under stress. It eliminates the variance between a well-rested senior engineer and a junior teammate woken at 3 AM. It enables sub-minute recovery where manual procedures take an hour.

But automation also introduces new risks: automated systems can fail in ways that humans wouldn't, can execute incorrect actions at machine speed, and can mask underlying problems that humans would notice. This page teaches you to automate wisely—capturing the benefits of speed and consistency while managing the risks of autonomous decision-making.

What You Will Master

By the end of this page, you will understand the spectrum of DR automation from simple scripts to full orchestration. You'll learn what to automate first, how to design safe automation with appropriate guardrails, and how to balance automation benefits against the risks of autonomous action during critical situations.

The DR Automation Spectrum

DR automation exists on a spectrum from fully manual to fully autonomous. Understanding where you are and where you want to be is the foundation for an automation roadmap:

Level 0: Fully Manual Humans execute every step by reading documentation and running commands. Slowest and most error-prone, but provides maximum human oversight.

Level 1: Documented Scripts Individual steps are scripted, but a human initiates each script and decides when to proceed. Reduces typing errors and speeds execution, but humans remain in control.

Level 2: Orchestrated Workflows Multiple steps are combined into automated workflows. A human initiates the workflow, but it executes multiple steps in sequence with automated validation between stages.

Level 3: Triggered Automation Monitoring systems can trigger automated workflows based on defined conditions. Humans may approve the trigger (semi-automated) or automation runs immediately (fully automated).

Level 4: Autonomous Recovery The system detects failures, initiates recovery, validates success, and restores service without human involvement. Humans are notified but don't need to act.

Level 5: Self-Healing Architecture The system is designed such that component failures don't require recovery procedures at all. Redundancy, auto-scaling, and self-repair are built into the architecture.

DR Automation Levels: Characteristics and Trade-offs
Level	Human Involvement	Typical RTO	Error Risk	Oversight	Cost to Implement
0: Manual	100%	Hours	High (human error)	Maximum	Lowest
1: Scripts	80%	30-60 min	Medium	High	Low
2: Orchestrated	40%	15-30 min	Low	Medium-High	Medium
3: Triggered	10-20%	5-15 min	Low	Medium	High
4: Autonomous	<5%	1-5 min	Depends on quality	Low	Very High
5: Self-Healing	~0%	Seconds	Very Low	Minimal	Highest

Start at Level 1

Most organizations should not jump to Level 4 or 5 automation immediately. Start by scripting individual steps (Level 1), then combine into workflows (Level 2), then add triggers (Level 3). Each level validates that automation is correct before reducing human oversight. Automating incorrect procedures at machine speed is worse than executing them slowly.

What to Automate First: Prioritization Framework

Not all DR activities benefit equally from automation. Prioritize based on these criteria:

High Priority for Automation:

Repetitive, Well-Defined Steps: Tasks performed identically every time with clear success/failure criteria
Time-Critical Steps: Where every minute of delay increases business impact
High-Frequency Components: Steps that appear in multiple recovery procedures
Toil-Heavy Tasks: Steps that are tedious and error-prone for humans

Lower Priority for Automation:

Novel Decision-Making: Situations requiring judgment about unprecedented conditions
External Coordination: Steps requiring human-to-human communication (customer notification, vendor escalation)
Rarely Executed Procedures: Once-a-year steps may not justify automation investment
High-Variance Environments: Where conditions change frequently, making automation brittle

Best Candidates for DR Automation

•Health Checks and Failure Detection — Continuous monitoring and alerting is foundational to fast recovery
•Database Failover — Promoting replicas, updating connection strings, verifying replication
•DNS and Traffic Switching — Updating records, verifying propagation, managing TTLs
•Application Scaling — Spinning up additional instances in DR region
•Configuration Synchronization — Ensuring DR environment has current configuration
•Backup and Restore Verification — Daily automated restore tests
•Certificate and Credential Management — Ensuring DR systems have valid, current credentials
•Integration Verification — Automated smoke tests of third-party connections

Automation ROI Calculation:

Estimate the value of automating a step:

Automation Value = (Manual Time × Frequency × Error Rate Reduction) - Automation Cost

Where:
- Manual Time = Time to execute step manually (including human delays)
- Frequency = Expected executions per year (including tests and incidents)
- Error Rate Reduction = Avoided errors × cost per error
- Automation Cost = Development + Testing + Maintenance

Focus first on steps with high Manual Time, high Frequency, or high Error Rate impact.

Scripting DR Tasks: The Foundation of Automation

The first step in DR automation is creating reliable, well-tested scripts for individual recovery tasks. These scripts become building blocks for higher-level orchestration.

Script Design Principles:

DR Script Requirements

•Idempotent — Running the script multiple times produces the same result as running once. Safe to re-run if interrupted.
•Self-Validating — Script verifies its own success. Returns clear success/failure status codes.
•Verbose Logging — Every significant action is logged with timestamps. Enable post-incident analysis.
•Timeout Handling — Long-running operations have timeouts. Stuck scripts are worse than failed scripts.
•Parameterized — Environment-specific values passed as parameters, not hardcoded. Same script works in test and production.
•Rollback Capable — Where possible, script can undo its changes if subsequent steps fail.
•Documented — Clear comments, usage instructions, and prerequisites in the script itself.

dr-database-failover.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
#!/bin/bash
# =============================================================================
# DR Database Failover Script
# 
# Purpose: Promotes read replica to primary and updates application config
# 
# Prerequisites:
#   - AWS CLI configured with appropriate credentials
#   - jq installed
#   - Access to Parameter Store for config updates
#
# Usage:
#   ./dr-database-failover.sh --replica-id <replica-id> --region <region>
#
# Returns:
#   0 = Success
#   1 = Validation failure
#   2 = Failover execution failure
#   3 = Post-failover verification failure
#   4 = Timeout
# =============================================================================
 
set -euo pipefail
 
# Configuration
SCRIPT_NAME="dr-database-failover"
LOG_FILE="/var/log/dr/${SCRIPT_NAME}-$(date +%Y%m%d-%H%M%S).log"
FAILOVER_TIMEOUT_SECONDS=600
VERIFICATION_RETRIES=10
VERIFICATION_DELAY_SECONDS=30
 
# Logging function
log() {
    local level="$1"
    local message="$2"
    local timestamp=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
    echo "[${timestamp}] [${level}] ${message}" | tee -a "${LOG_FILE}"
}
 
# Parse arguments
REPLICA_ID=""
REGION=""
 
while [[ $# -gt 0 ]]; do
    case $1 in
        --replica-id)
            REPLICA_ID="$2"
            shift 2
            ;;
        --region)
            REGION="$2"
            shift 2
            ;;
        *)
            log "ERROR" "Unknown argument: $1"
            exit 1
            ;;
    esac
done
 
# Validate arguments
if [[ -z "${REPLICA_ID}" || -z "${REGION}" ]]; then
    log "ERROR" "Required arguments: --replica-id and --region"
    exit 1
fi
 
log "INFO" "========================================="
log "INFO" "Starting DR Database Failover"
log "INFO" "Replica: ${REPLICA_ID}"
log "INFO" "Region: ${REGION}"
log "INFO" "========================================="
 
# =============================================================================
# Step 1: Pre-flight validation
# =============================================================================
log "INFO" "Step 1: Pre-flight validation"
 
# Check replica exists and is in available state
REPLICA_STATUS=$(aws rds describe-db-instances \
    --db-instance-identifier "${REPLICA_ID}" \
    --region "${REGION}" \
    --query 'DBInstances[0].DBInstanceStatus' \
    --output text 2>/dev/null || echo "NOT_FOUND")
 
if [[ "${REPLICA_STATUS}" != "available" ]]; then
    log "ERROR" "Replica ${REPLICA_ID} is not available. Current status: ${REPLICA_STATUS}"
    exit 1
fi
 
# Check replication status
REPLICA_LAG=$(aws cloudwatch get-metric-statistics \
    --namespace AWS/RDS \
    --metric-name ReplicaLag \
    --dimensions Name=DBInstanceIdentifier,Value="${REPLICA_ID}" \
    --start-time $(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
    --period 60 \
    --statistics Average \
    --region "${REGION}" \
    --query 'Datapoints[0].Average' \
    --output text 2>/dev/null || echo "UNKNOWN")
 
log "INFO" "Current replica lag: ${REPLICA_LAG} seconds"
 
if [[ "${REPLICA_LAG}" != "UNKNOWN" ]] && (( $(echo "${REPLICA_LAG} > 300" | bc -l) )); then
    log "WARN" "Replica lag is high (${REPLICA_LAG}s). Proceeding may result in data loss."
fi
 
log "INFO" "Pre-flight validation passed"
 
# =============================================================================
# Step 2: Initiate failover
# =============================================================================
log "INFO" "Step 2: Initiating replica promotion"
 
PROMOTION_START=$(date +%s)
 
aws rds promote-read-replica \
    --db-instance-identifier "${REPLICA_ID}" \
    --region "${REGION}"
 
log "INFO" "Promotion initiated. Waiting for completion..."
 
# =============================================================================
# Step 3: Wait for promotion to complete
# =============================================================================
log "INFO" "Step 3: Monitoring promotion progress"
 
while true; do
    CURRENT_STATUS=$(aws rds describe-db-instances \
        --db-instance-identifier "${REPLICA_ID}" \
        --region "${REGION}" \
        --query 'DBInstances[0].DBInstanceStatus' \
        --output text)
    
    ELAPSED=$(( $(date +%s) - ${PROMOTION_START} ))
    
    log "INFO" "Status: ${CURRENT_STATUS} (elapsed: ${ELAPSED}s)"
    
    if [[ "${CURRENT_STATUS}" == "available" ]]; then
        log "INFO" "Promotion completed successfully"
        break
    fi
    
    if [[ ${ELAPSED} -gt ${FAILOVER_TIMEOUT_SECONDS} ]]; then
        log "ERROR" "Promotion timed out after ${FAILOVER_TIMEOUT_SECONDS} seconds"
        exit 4
    fi
    
    sleep 15
done
 
# =============================================================================
# Step 4: Verify write capability
# =============================================================================
log "INFO" "Step 4: Verifying write capability"
 
# Get endpoint
NEW_ENDPOINT=$(aws rds describe-db-instances \
    --db-instance-identifier "${REPLICA_ID}" \
    --region "${REGION}" \
    --query 'DBInstances[0].Endpoint.Address' \
    --output text)
 
log "INFO" "New primary endpoint: ${NEW_ENDPOINT}"
 
# Test write capability (requires appropriate db credentials configured)
for i in $(seq 1 ${VERIFICATION_RETRIES}); do
    log "INFO" "Verification attempt ${i}/${VERIFICATION_RETRIES}"
    
    if psql -h "${NEW_ENDPOINT}" -U admin -d production -c \
        "INSERT INTO dr_test (test_id, created_at) VALUES ('failover-$(date +%s)', NOW());"; then
        log "INFO" "Write verification successful"
        break
    fi
    
    if [[ ${i} -eq ${VERIFICATION_RETRIES} ]]; then
        log "ERROR" "Write verification failed after ${VERIFICATION_RETRIES} attempts"
        exit 3
    fi
    
    log "WARN" "Write test failed, retrying in ${VERIFICATION_DELAY_SECONDS}s..."
    sleep ${VERIFICATION_DELAY_SECONDS}
done
 
# =============================================================================
# Step 5: Update application configuration
# =============================================================================
log "INFO" "Step 5: Updating application configuration"
 
aws ssm put-parameter \
    --name "/production/database/primary-endpoint" \
    --value "${NEW_ENDPOINT}" \
    --type "String" \
    --overwrite \
    --region "${REGION}"
 
log "INFO" "Parameter Store updated with new endpoint"
 
# =============================================================================
# Complete
# =============================================================================
TOTAL_TIME=$(( $(date +%s) - ${PROMOTION_START} ))
 
log "INFO" "========================================="
log "INFO" "Database failover completed successfully"
log "INFO" "Total time: ${TOTAL_TIME} seconds"
log "INFO" "New primary: ${NEW_ENDPOINT}"
log "INFO" "========================================="
 
exit 0

Workflow Orchestration: Connecting the Pieces

Individual scripts are building blocks. Workflow orchestration connects them into complete recovery procedures, managing dependencies, parallelism, error handling, and human approval gates.

Orchestration Capabilities:

Sequential Steps: Execute steps in order, passing outputs to inputs
Parallel Execution: Run independent steps simultaneously to reduce total time
Conditional Logic: Branch based on step results or environment conditions
Approval Gates: Pause for human approval before proceeding to high-risk steps
Retry Logic: Automatically retry failed steps with backoff
Timeout Management: Fail safely if steps don't complete in expected time
Rollback Orchestration: Coordinate rollback across multiple components

dr-orchestration-workflow.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
# AWS Step Functions State Machine Definition
# Full DR Failover Workflow
 
name: FullDRFailoverWorkflow
type: StateMachine
definition:
  Comment: "Orchestrates complete DR failover from primary to DR region"
  StartAt: ValidatePreConditions
  
  States:
    # ==========================================================================
    # Phase 1: Validation
    # ==========================================================================
    ValidatePreConditions:
      Type: Parallel
      Next: HumanApprovalGate
      Branches:
        - StartAt: CheckDRSiteHealth
          States:
            CheckDRSiteHealth:
              Type: Task
              Resource: "arn:aws:lambda:us-west-2:123456789:function:check-dr-site-health"
              ResultPath: "$.drSiteHealth"
              End: true
              
        - StartAt: CheckReplicationStatus
          States:
            CheckReplicationStatus:
              Type: Task
              Resource: "arn:aws:lambda:us-west-2:123456789:function:check-replication-status"
              ResultPath: "$.replicationStatus"
              End: true
              
        - StartAt: ValidateCredentials
          States:
            ValidateCredentials:
              Type: Task
              Resource: "arn:aws:lambda:us-west-2:123456789:function:validate-dr-credentials"
              ResultPath: "$.credentialStatus"
              End: true
              
      Catch:
        - ErrorEquals: ["States.ALL"]
          Next: FailValidation
          
    HumanApprovalGate:
      Type: Task
      Resource: "arn:aws:states:::sqs:sendMessage.waitForTaskToken"
      Parameters:
        QueueUrl: "https://sqs.us-west-2.amazonaws.com/123456789/dr-approval-queue"
        MessageBody:
          TaskToken.$: "$$.Task.Token"
          Message: "DR Failover requires approval"
          PreConditionResults.$: "$"
          RequestedBy.$: "$.requestedBy"
      TimeoutSeconds: 3600  # 1 hour to approve
      Next: DatabaseFailover
      Catch:
        - ErrorEquals: ["States.Timeout"]
          Next: ApprovalTimeout
          
    # ==========================================================================
    # Phase 2: Database Failover
    # ==========================================================================
    DatabaseFailover:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:execute-database-failover"
      Parameters:
        ReplicaId.$: "$.config.drReplicaId"
        Region: "us-west-2"
      ResultPath: "$.databaseResult"
      TimeoutSeconds: 900  # 15 minutes
      Retry:
        - ErrorEquals: ["RetryableError"]
          IntervalSeconds: 30
          MaxAttempts: 3
          BackoffRate: 2
      Next: VerifyDatabaseConnectivity
      Catch:
        - ErrorEquals: ["States.ALL"]
          Next: DatabaseFailoverFailed
          
    VerifyDatabaseConnectivity:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:verify-database-connectivity"
      Parameters:
        Endpoint.$: "$.databaseResult.newEndpoint"
      ResultPath: "$.databaseVerification"
      Next: ParallelApplicationRecovery
      Catch:
        - ErrorEquals: ["States.ALL"]
          Next: DatabaseVerificationFailed
          
    # ==========================================================================
    # Phase 3: Application Recovery (Parallel)
    # ==========================================================================
    ParallelApplicationRecovery:
      Type: Parallel
      Next: DNSCutover
      Branches:
        - StartAt: ScaleApplicationTier
          States:
            ScaleApplicationTier:
              Type: Task
              Resource: "arn:aws:lambda:us-west-2:123456789:function:scale-application-tier"
              Parameters:
                DesiredCount: 10
                Region: "us-west-2"
              End: true
              
        - StartAt: WarmCaches
          States:
            WarmCaches:
              Type: Task
              Resource: "arn:aws:lambda:us-west-2:123456789:function:warm-application-caches"
              End: true
              
        - StartAt: UpdateConfiguration
          States:
            UpdateConfiguration:
              Type: Task
              Resource: "arn:aws:lambda:us-west-2:123456789:function:sync-application-config"
              End: true
              
      Catch:
        - ErrorEquals: ["States.ALL"]
          Next: ApplicationRecoveryFailed
          
    # ==========================================================================
    # Phase 4: Traffic Cutover
    # ==========================================================================
    DNSCutover:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:execute-dns-cutover"
      Parameters:
        TargetRegion: "us-west-2"
        TrafficPercentage: 5
      ResultPath: "$.dnsResult"
      Next: ValidateLimitedTraffic
      
    ValidateLimitedTraffic:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:validate-error-rates"
      Parameters:
        Threshold: 0.01  # 1% error rate
        DurationSeconds: 180  # 3 minutes observation
      ResultPath: "$.limitedTrafficValidation"
      Next: CheckLimitedTrafficResult
      
    CheckLimitedTrafficResult:
      Type: Choice
      Choices:
        - Variable: "$.limitedTrafficValidation.passed"
          BooleanEquals: true
          Next: FullTrafficCutover
        - Variable: "$.limitedTrafficValidation.passed"
          BooleanEquals: false
          Next: RollbackDNS
      Default: FullTrafficCutover
      
    FullTrafficCutover:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:execute-dns-cutover"
      Parameters:
        TargetRegion: "us-west-2"
        TrafficPercentage: 100
      Next: FinalValidation
      
    # ==========================================================================
    # Phase 5: Final Validation
    # ==========================================================================
    FinalValidation:
      Type: Parallel
      Next: Success
      Branches:
        - StartAt: RunSmokeTests
          States:
            RunSmokeTests:
              Type: Task
              Resource: "arn:aws:lambda:us-west-2:123456789:function:run-smoke-tests"
              End: true
              
        - StartAt: VerifyIntegrations
          States:
            VerifyIntegrations:
              Type: Task
              Resource: "arn:aws:lambda:us-west-2:123456789:function:verify-integrations"
              End: true
              
        - StartAt: ConfirmMonitoring
          States:
            ConfirmMonitoring:
              Type: Task
              Resource: "arn:aws:lambda:us-west-2:123456789:function:confirm-dr-monitoring"
              End: true
      Catch:
        - ErrorEquals: ["States.ALL"]
          Next: ValidationWarning
          
    # ==========================================================================
    # Success and Failure States
    # ==========================================================================
    Success:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:notify-dr-complete"
      Parameters:
        Status: "SUCCESS"
        Summary.$: "$"
      End: true
      
    ValidationWarning:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:notify-dr-complete"
      Parameters:
        Status: "SUCCESS_WITH_WARNINGS"
        Summary.$: "$"
      End: true
      
    # Failure states
    FailValidation:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:notify-dr-failure"
      Parameters:
        Stage: "VALIDATION"
        Error.$: "$"
      End: true
      
    ApprovalTimeout:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:notify-dr-failure"
      Parameters:
        Stage: "APPROVAL"
        Error: "Approval timeout - no response within 1 hour"
      End: true
      
    DatabaseFailoverFailed:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:notify-dr-failure"
      Parameters:
        Stage: "DATABASE_FAILOVER"
        Error.$: "$"
      End: true
      
    DatabaseVerificationFailed:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:initiate-dr-rollback"
      Parameters:
        Stage: "DATABASE_VERIFICATION"
        RollbackSteps: ["database"]
      End: true
      
    ApplicationRecoveryFailed:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:initiate-dr-rollback"
      Parameters:
        Stage: "APPLICATION_RECOVERY"
        RollbackSteps: ["application", "database"]
      End: true
      
    RollbackDNS:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:initiate-dr-rollback"
      Parameters:
        Stage: "TRAFFIC_CUTOVER"
        RollbackSteps: ["dns", "application", "database"]
      End: true

Tools for DR Orchestration

Common orchestration platforms include AWS Step Functions, Azure Logic Apps, Google Cloud Workflows, Temporal, and Kubernetes Jobs/Operators. Choose based on your infrastructure and team expertise. The orchestration logic matters more than the platform—well-designed workflows work across tools.

Automated Detection and Triggering

The most advanced level of DR automation connects monitoring to recovery—when a failure is detected, automated response is initiated without waiting for human intervention.

Detection Requirements:

Automated triggering depends on reliable detection. False positives trigger unnecessary (and potentially harmful) failovers. False negatives miss real failures. Detection systems must:

Have High Precision: When disaster is declared, it really is a disaster
Have High Recall: Real disasters are detected promptly
Be External to Failure Domain: Detection from inside the failure zone is useless
Have Redundancy: Multiple detection paths prevent monitoring failure from blocking recovery

dr-trigger-system.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
// Automated DR Trigger System
// Monitors health signals and initiates recovery when thresholds are breached
 
interface HealthSignal {
  source: string;
  status: 'healthy' | 'degraded' | 'unhealthy' | 'unknown';
  timestamp: Date;
  confidence: number;  // 0-1
  details: Record<string, any>;
}
 
interface TriggerRule {
  name: string;
  description: string;
  conditions: TriggerCondition[];
  conditionLogic: 'AND' | 'OR' | 'MAJORITY';
  cooldownMinutes: number;
  requiresApproval: boolean;
  automationLevel: 'notify' | 'prepare' | 'execute-with-approval' | 'auto-execute';
}
 
interface TriggerCondition {
  signalSource: string;
  operator: 'equals' | 'not-equals' | 'duration-exceeds' | 'count-exceeds';
  value: any;
  sustainedSeconds: number;  // Must be true for this duration
}
 
class DRTriggerSystem {
  private rules: TriggerRule[] = [];
  private lastTriggered: Map<string, Date> = new Map();
  private signalHistory: Map<string, HealthSignal[]> = new Map();
  
  constructor() {
    this.initializeRules();
  }
  
  private initializeRules(): void {
    // Rule 1: Complete region failure
    this.rules.push({
      name: 'RegionFailure',
      description: 'Primary region completely unavailable',
      conditions: [
        {
          signalSource: 'synthetic-transactions-primary',
          operator: 'equals',
          value: 'unhealthy',
          sustainedSeconds: 180  // 3 minutes sustained failure
        },
        {
          signalSource: 'external-uptime-monitor',
          operator: 'equals',
          value: 'unhealthy',
          sustainedSeconds: 180
        },
        {
          signalSource: 'cloudwatch-alarms-primary',
          operator: 'count-exceeds',
          value: 5,  // More than 5 critical alarms
          sustainedSeconds: 120
        }
      ],
      conditionLogic: 'MAJORITY',  // At least 2 of 3 must trigger
      cooldownMinutes: 60,
      requiresApproval: false,  // Auto-execute for clear regional failure
      automationLevel: 'auto-execute'
    });
    
    // Rule 2: Database primary failure
    this.rules.push({
      name: 'DatabaseFailure',
      description: 'Primary database unresponsive',
      conditions: [
        {
          signalSource: 'database-health-check',
          operator: 'equals',
          value: 'unhealthy',
          sustainedSeconds: 120
        },
        {
          signalSource: 'application-db-errors',
          operator: 'count-exceeds',
          value: 100,
          sustainedSeconds: 60
        }
      ],
      conditionLogic: 'AND',
      cooldownMinutes: 30,
      requiresApproval: true,  // Require approval for database failover
      automationLevel: 'execute-with-approval'
    });
    
    // Rule 3: Network connectivity issues
    this.rules.push({
      name: 'NetworkFailure',
      description: 'Primary region network unreachable',
      conditions: [
        {
          signalSource: 'network-health-primary',
          operator: 'equals',
          value: 'unhealthy',
          sustainedSeconds: 300  // 5 minutes - network issues may be transient
        },
        {
          signalSource: 'vpn-tunnel-primary',
          operator: 'equals',
          value: 'unhealthy',
          sustainedSeconds: 300
        }
      ],
      conditionLogic: 'AND',
      cooldownMinutes: 60,
      requiresApproval: true,
      automationLevel: 'execute-with-approval'
    });
  }
  
  processSignal(signal: HealthSignal): void {
    // Store in history
    const history = this.signalHistory.get(signal.source) || [];
    history.push(signal);
    // Keep last 10 minutes of history
    const cutoff = new Date(Date.now() - 600000);
    this.signalHistory.set(
      signal.source, 
      history.filter(s => s.timestamp > cutoff)
    );
    
    // Evaluate all rules
    for (const rule of this.rules) {
      this.evaluateRule(rule);
    }
  }
  
  private evaluateRule(rule: TriggerRule): void {
    // Check cooldown
    const lastTrigger = this.lastTriggered.get(rule.name);
    if (lastTrigger && 
        (Date.now() - lastTrigger.getTime()) < rule.cooldownMinutes * 60 * 1000) {
      return;  // In cooldown, don't evaluate
    }
    
    // Evaluate conditions
    const conditionResults = rule.conditions.map(c => this.evaluateCondition(c));
    
    let shouldTrigger = false;
    switch (rule.conditionLogic) {
      case 'AND':
        shouldTrigger = conditionResults.every(r => r);
        break;
      case 'OR':
        shouldTrigger = conditionResults.some(r => r);
        break;
      case 'MAJORITY':
        shouldTrigger = conditionResults.filter(r => r).length > conditionResults.length / 2;
        break;
    }
    
    if (shouldTrigger) {
      this.triggerDR(rule);
    }
  }
  
  private evaluateCondition(condition: TriggerCondition): boolean {
    const history = this.signalHistory.get(condition.signalSource) || [];
    const cutoff = new Date(Date.now() - condition.sustainedSeconds * 1000);
    const relevantSignals = history.filter(s => s.timestamp > cutoff);
    
    if (relevantSignals.length === 0) return false;
    
    switch (condition.operator) {
      case 'equals':
        return relevantSignals.every(s => s.status === condition.value);
      case 'not-equals':
        return relevantSignals.every(s => s.status !== condition.value);
      case 'count-exceeds':
        // This requires a different signal format (counting events)
        return relevantSignals.length > condition.value;
      default:
        return false;
    }
  }
  
  private async triggerDR(rule: TriggerRule): Promise<void> {
    this.lastTriggered.set(rule.name, new Date());
    
    console.log(`[DR TRIGGER] Rule '${rule.name}' activated`);
    console.log(`[DR TRIGGER] Description: ${rule.description}`);
    console.log(`[DR TRIGGER] Automation level: ${rule.automationLevel}`);
    
    switch (rule.automationLevel) {
      case 'notify':
        await this.sendNotification(rule);
        break;
        
      case 'prepare':
        await this.sendNotification(rule);
        await this.prepareDREnvironment();
        break;
        
      case 'execute-with-approval':
        await this.sendNotification(rule);
        await this.prepareDREnvironment();
        await this.requestApproval(rule);
        break;
        
      case 'auto-execute':
        await this.sendNotification(rule);
        await this.executeDRFailover(rule);
        break;
    }
  }
  
  private async sendNotification(rule: TriggerRule): Promise<void> {
    // Send alerts via PagerDuty, Slack, etc.
    console.log(`[NOTIFY] DR trigger notification for: ${rule.name}`);
  }
  
  private async prepareDREnvironment(): Promise<void> {
    // Pre-warm DR environment, verify readiness
    console.log(`[PREPARE] Warming DR environment...`);
  }
  
  private async requestApproval(rule: TriggerRule): Promise<void> {
    // Queue approval request, wait for human decision
    console.log(`[APPROVAL] Requesting human approval for: ${rule.name}`);
  }
  
  private async executeDRFailover(rule: TriggerRule): Promise<void> {
    // Invoke DR orchestration workflow
    console.log(`[EXECUTE] Initiating automated DR failover for: ${rule.name}`);
  }
}

The Danger of Auto-Triggering

Fully automatic DR triggering is powerful but dangerous. A false positive can cause unnecessary failover, data inconsistency, or customer-visible disruption. Start with 'notify' automation, progress to 'prepare', then 'execute-with-approval', and only enable 'auto-execute' for clear, unambiguous failure scenarios after extensive testing.

Safeguards and Guardrails

Automation executes at machine speed—including mistakes. Guardrails prevent automated systems from causing harm that humans would have avoided:

Essential Safeguards:

DR Automation Guardrails

•Confirmation Signals: Require multiple independent signals before triggering. A single failed health check shouldn't initiate failover.
•Cooldown Periods: After any trigger (successful or not), enforce a minimum interval before the same trigger can fire again.
•Rate Limits: Limit how many automated actions can execute in a given time window. Prevent runaway automation.
•Kill Switch: A simple, accessible mechanism to halt all automation. When things go wrong, you need to stop quickly.
•Scope Limits: Automation should only affect designated systems. Prevent scope creep where automation accidentally touches unintended resources.
•Rollback Automation: If automated recovery fails, automated rollback should also exist. Don't leave systems in partial recovery state.
•Audit Logging: Every automated action must be logged with full context. Enable post-incident analysis.
•Dry-Run Mode: Ability to run automation in simulation mode that logs actions without executing them.

dr-guardrails.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
// DR Automation Guardrails Implementation
 
class DRGuardrails {
  private killSwitchEnabled: boolean = false;
  private actionCounts: Map<string, number> = new Map();
  private lastActions: Map<string, Date> = new Map();
  
  private readonly RATE_LIMIT_WINDOW_MINUTES = 60;
  private readonly MAX_ACTIONS_PER_WINDOW = 5;
  private readonly COOLDOWN_MINUTES = 30;
  
  // Emergency kill switch - stops all automation
  enableKillSwitch(reason: string): void {
    this.killSwitchEnabled = true;
    console.log(`[KILL SWITCH] Enabled by operator. Reason: ${reason}`);
    this.notifyAllChannels(`DR Automation halted: ${reason}`);
  }
  
  disableKillSwitch(approverName: string): void {
    console.log(`[KILL SWITCH] Disabled by ${approverName}`);
    this.killSwitchEnabled = false;
  }
  
  // Check all guardrails before allowing action
  canExecute(actionType: string, scope: string[]): GuardrailResult {
    // Check 1: Kill switch
    if (this.killSwitchEnabled) {
      return {
        allowed: false,
        reason: 'Kill switch is enabled',
        code: 'KILL_SWITCH'
      };
    }
    
    // Check 2: Cooldown
    const lastAction = this.lastActions.get(actionType);
    if (lastAction) {
      const minutesSinceLastAction = (Date.now() - lastAction.getTime()) / 60000;
      if (minutesSinceLastAction < this.COOLDOWN_MINUTES) {
        return {
          allowed: false,
          reason: `Cooldown active. ${this.COOLDOWN_MINUTES - minutesSinceLastAction} minutes remaining.`,
          code: 'COOLDOWN'
        };
      }
    }
    
    // Check 3: Rate limit
    const currentWindow = this.getCurrentWindowStart();
    const actionsInWindow = this.getActionsInWindow(actionType, currentWindow);
    if (actionsInWindow >= this.MAX_ACTIONS_PER_WINDOW) {
      return {
        allowed: false,
        reason: `Rate limit exceeded. Max ${this.MAX_ACTIONS_PER_WINDOW} actions per ${this.RATE_LIMIT_WINDOW_MINUTES} minutes.`,
        code: 'RATE_LIMIT'
      };
    }
    
    // Check 4: Scope validation
    const scopeCheck = this.validateScope(actionType, scope);
    if (!scopeCheck.valid) {
      return {
        allowed: false,
        reason: `Scope validation failed: ${scopeCheck.reason}`,
        code: 'SCOPE_VIOLATION'
      };
    }
    
    // All checks passed
    return {
      allowed: true,
      reason: 'All guardrails passed',
      code: 'OK'
    };
  }
  
  // Record that an action was taken (for rate limiting)
  recordAction(actionType: string): void {
    this.lastActions.set(actionType, new Date());
    const key = `${actionType}:${this.getCurrentWindowStart().toISOString()}`;
    this.actionCounts.set(key, (this.actionCounts.get(key) || 0) + 1);
  }
  
  // Validate that the action scope is within allowed boundaries
  private validateScope(actionType: string, scope: string[]): { valid: boolean; reason?: string } {
    const allowedScopes: Record<string, string[]> = {
      'database-failover': ['prod-db-primary', 'prod-db-replica-dr'],
      'dns-cutover': ['api.example.com', 'www.example.com'],
      'application-scale': ['us-west-2', 'eu-west-1']
    };
    
    const allowed = allowedScopes[actionType] || [];
    const unauthorized = scope.filter(s => !allowed.includes(s));
    
    if (unauthorized.length > 0) {
      return {
        valid: false,
        reason: `Unauthorized scope: ${unauthorized.join(', ')}`
      };
    }
    
    return { valid: true };
  }
  
  private getCurrentWindowStart(): Date {
    const now = new Date();
    const windowMs = this.RATE_LIMIT_WINDOW_MINUTES * 60 * 1000;
    return new Date(Math.floor(now.getTime() / windowMs) * windowMs);
  }
  
  private getActionsInWindow(actionType: string, windowStart: Date): number {
    const key = `${actionType}:${windowStart.toISOString()}`;
    return this.actionCounts.get(key) || 0;
  }
  
  private notifyAllChannels(message: string): void {
    // Send to PagerDuty, Slack, email, etc.
    console.log(`[BROADCAST] ${message}`);
  }
}
 
interface GuardrailResult {
  allowed: boolean;
  reason: string;
  code: string;
}

Testing DR Automation

Automated DR must be tested as rigorously as manual DR—more rigorously, because it will execute without human review. Testing validates that automation does what it should, handles errors correctly, and that guardrails actually prevent harm.

Testing Levels:

Unit Testing: Each script and function is tested in isolation with mocked dependencies. Verify logic, error handling, and edge cases.

Integration Testing: Workflows are tested end-to-end against test infrastructure. Verify that steps connect correctly, outputs flow to inputs, and orchestration logic is correct.

Failure Injection Testing: Deliberately introduce failures at various points to verify that error handling, retries, and rollbacks work correctly.

Production Simulation: Run automation against production-like environment with production-scale data. Verify timing, resource consumption, and behavior under load.

DR Automation Test Matrix
Test Type	Scope	Environment	Frequency	Focus
Unit Tests	Individual scripts	CI/CD pipeline	Every commit	Logic, edge cases, error handling
Integration Tests	Workflow sequences	Staging	Daily	Step connections, data flow, orchestration
Failure Injection	Complete workflows	Staging	Weekly	Error paths, rollback, guardrails
Staging DR Test	Full automation	Staging (prod-like)	Monthly	Timing, integration, scale
Production DR Test	Full automation	Production	Semi-annually	Real-world validation, true RTO

Chaos Engineering for DR Automation:

Beyond testing happy paths, deliberately break things to validate resilience:

Partial Failures: What if one step succeeds but a later step fails? Does rollback work?
Timeout Scenarios: What if a step takes longer than expected? Does timeout handling work?
Concurrent Execution: What if automation triggers twice? Does idempotency hold?
Resource Exhaustion: What if the DR region is capacity-constrained? Does scaling fail gracefully?
Network Partition: What if automation can't communicate with some components? Does it handle partial connectivity?

Test the Guardrails Too

Don't just test that automation works—test that guardrails work. Deliberately trigger guardrail conditions and verify they halt automation as expected. A guardrail that fails to activate is worse than no guardrail, because you'll trust it until it's too late.

Summary: DR Automation Excellence

Key Takeaways

•Automation enables machine-speed recovery — What takes humans hours can be automated to minutes or seconds. This directly reduces RTO.
•Progress through levels deliberately — Start with scripting, advance to orchestration, then triggers. Each level validates the previous before reducing human oversight.
•Automate high-value targets first — Repetitive, time-critical, frequently-used, error-prone steps provide the best automation ROI.
•Design for idempotency and validation — Scripts must be safe to re-run and must verify their own success. Assume things will go wrong.
•Orchestration connects the pieces — Workflows manage dependencies, parallelism, error handling, and approval gates across multiple automated steps.
•Guardrails prevent automated harm — Kill switches, cooldowns, rate limits, and scope restrictions prevent automation from causing more damage than it's solving.
•Test automation as rigorously as code — Unit tests, integration tests, failure injection, and production validation ensure automation works when needed.

Module Complete: Disaster Recovery Mastery

You have now completed the comprehensive disaster recovery module. You understand:

DR Planning: Strategic foundations, risk assessment, and architecture patterns
RPO/RTO Targets: Setting, measuring, and managing recovery objectives
DR Testing: Validating capabilities from tabletops through production failover
Runbook Development: Creating executable procedures that work under pressure
DR Automation: Accelerating recovery through scripting, orchestration, and autonomous action

These capabilities together form a mature disaster recovery program that enables your organization to survive and rapidly recover from catastrophic failures.

Module Complete

Congratulations! You have mastered disaster recovery—from strategic planning through automation. You now have the knowledge to design, implement, test, and continuously improve DR capabilities that protect your organization against the inevitable failures that all systems eventually face.

5 / 5

Loading learning content...

System Design (HLD)Disaster Recovery

Disaster Recovery: Building Resilient Systems That Survive Catastrophe

LevelAdvanced

Duration180 mins

TopicDisaster Recovery

5 / 5

DR Automation: From Manual Procedures to Autonomous Recovery

When Seconds Matter, Humans Are the Bottleneck

What You Will Master

The DR Automation Spectrum

DR automation exists on a spectrum from fully manual to fully autonomous. Understanding where you are and where you want to be is the foundation for an automation roadmap:

Level 0: Fully Manual Humans execute every step by reading documentation and running commands. Slowest and most error-prone, but provides maximum human oversight.

Level 1: Documented Scripts Individual steps are scripted, but a human initiates each script and decides when to proceed. Reduces typing errors and speeds execution, but humans remain in control.

Level 4: Autonomous Recovery The system detects failures, initiates recovery, validates success, and restores service without human involvement. Humans are notified but don't need to act.

DR Automation Levels: Characteristics and Trade-offs
Level	Human Involvement	Typical RTO	Error Risk	Oversight	Cost to Implement
0: Manual	100%	Hours	High (human error)	Maximum	Lowest
1: Scripts	80%	30-60 min	Medium	High	Low
2: Orchestrated	40%	15-30 min	Low	Medium-High	Medium
3: Triggered	10-20%	5-15 min	Low	Medium	High
4: Autonomous	<5%	1-5 min	Depends on quality	Low	Very High
5: Self-Healing	~0%	Seconds	Very Low	Minimal	Highest

Start at Level 1

What to Automate First: Prioritization Framework

Not all DR activities benefit equally from automation. Prioritize based on these criteria:

High Priority for Automation:

Repetitive, Well-Defined Steps: Tasks performed identically every time with clear success/failure criteria
Time-Critical Steps: Where every minute of delay increases business impact
High-Frequency Components: Steps that appear in multiple recovery procedures
Toil-Heavy Tasks: Steps that are tedious and error-prone for humans

Lower Priority for Automation:

Novel Decision-Making: Situations requiring judgment about unprecedented conditions
External Coordination: Steps requiring human-to-human communication (customer notification, vendor escalation)
Rarely Executed Procedures: Once-a-year steps may not justify automation investment
High-Variance Environments: Where conditions change frequently, making automation brittle

Best Candidates for DR Automation

•Health Checks and Failure Detection — Continuous monitoring and alerting is foundational to fast recovery
•Database Failover — Promoting replicas, updating connection strings, verifying replication
•DNS and Traffic Switching — Updating records, verifying propagation, managing TTLs
•Application Scaling — Spinning up additional instances in DR region
•Configuration Synchronization — Ensuring DR environment has current configuration
•Backup and Restore Verification — Daily automated restore tests
•Certificate and Credential Management — Ensuring DR systems have valid, current credentials
•Integration Verification — Automated smoke tests of third-party connections

Automation ROI Calculation:

Estimate the value of automating a step:

Automation Value = (Manual Time × Frequency × Error Rate Reduction) - Automation Cost

Where:
- Manual Time = Time to execute step manually (including human delays)
- Frequency = Expected executions per year (including tests and incidents)
- Error Rate Reduction = Avoided errors × cost per error
- Automation Cost = Development + Testing + Maintenance

Focus first on steps with high Manual Time, high Frequency, or high Error Rate impact.

Scripting DR Tasks: The Foundation of Automation

The first step in DR automation is creating reliable, well-tested scripts for individual recovery tasks. These scripts become building blocks for higher-level orchestration.

Script Design Principles:

DR Script Requirements

•Idempotent — Running the script multiple times produces the same result as running once. Safe to re-run if interrupted.
•Self-Validating — Script verifies its own success. Returns clear success/failure status codes.
•Verbose Logging — Every significant action is logged with timestamps. Enable post-incident analysis.
•Timeout Handling — Long-running operations have timeouts. Stuck scripts are worse than failed scripts.
•Parameterized — Environment-specific values passed as parameters, not hardcoded. Same script works in test and production.
•Rollback Capable — Where possible, script can undo its changes if subsequent steps fail.
•Documented — Clear comments, usage instructions, and prerequisites in the script itself.

dr-database-failover.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
#!/bin/bash
# =============================================================================
# DR Database Failover Script
# 
# Purpose: Promotes read replica to primary and updates application config
# 
# Prerequisites:
#   - AWS CLI configured with appropriate credentials
#   - jq installed
#   - Access to Parameter Store for config updates
#
# Usage:
#   ./dr-database-failover.sh --replica-id <replica-id> --region <region>
#
# Returns:
#   0 = Success
#   1 = Validation failure
#   2 = Failover execution failure
#   3 = Post-failover verification failure
#   4 = Timeout
# =============================================================================
 
set -euo pipefail
 
# Configuration
SCRIPT_NAME="dr-database-failover"
LOG_FILE="/var/log/dr/${SCRIPT_NAME}-$(date +%Y%m%d-%H%M%S).log"
FAILOVER_TIMEOUT_SECONDS=600
VERIFICATION_RETRIES=10
VERIFICATION_DELAY_SECONDS=30
 
# Logging function
log() {
    local level="$1"
    local message="$2"
    local timestamp=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
    echo "[${timestamp}] [${level}] ${message}" | tee -a "${LOG_FILE}"
}
 
# Parse arguments
REPLICA_ID=""
REGION=""
 
while [[ $# -gt 0 ]]; do
    case $1 in
        --replica-id)
            REPLICA_ID="$2"
            shift 2
            ;;
        --region)
            REGION="$2"
            shift 2
            ;;
        *)
            log "ERROR" "Unknown argument: $1"
            exit 1
            ;;
    esac
done
 
# Validate arguments
if [[ -z "${REPLICA_ID}" || -z "${REGION}" ]]; then
    log "ERROR" "Required arguments: --replica-id and --region"
    exit 1
fi
 
log "INFO" "========================================="
log "INFO" "Starting DR Database Failover"
log "INFO" "Replica: ${REPLICA_ID}"
log "INFO" "Region: ${REGION}"
log "INFO" "========================================="
 
# =============================================================================
# Step 1: Pre-flight validation
# =============================================================================
log "INFO" "Step 1: Pre-flight validation"
 
# Check replica exists and is in available state
REPLICA_STATUS=$(aws rds describe-db-instances \
    --db-instance-identifier "${REPLICA_ID}" \
    --region "${REGION}" \
    --query 'DBInstances[0].DBInstanceStatus' \
    --output text 2>/dev/null || echo "NOT_FOUND")
 
if [[ "${REPLICA_STATUS}" != "available" ]]; then
    log "ERROR" "Replica ${REPLICA_ID} is not available. Current status: ${REPLICA_STATUS}"
    exit 1
fi
 
# Check replication status
REPLICA_LAG=$(aws cloudwatch get-metric-statistics \
    --namespace AWS/RDS \
    --metric-name ReplicaLag \
    --dimensions Name=DBInstanceIdentifier,Value="${REPLICA_ID}" \
    --start-time $(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
    --period 60 \
    --statistics Average \
    --region "${REGION}" \
    --query 'Datapoints[0].Average' \
    --output text 2>/dev/null || echo "UNKNOWN")
 
log "INFO" "Current replica lag: ${REPLICA_LAG} seconds"
 
if [[ "${REPLICA_LAG}" != "UNKNOWN" ]] && (( $(echo "${REPLICA_LAG} > 300" | bc -l) )); then
    log "WARN" "Replica lag is high (${REPLICA_LAG}s). Proceeding may result in data loss."
fi
 
log "INFO" "Pre-flight validation passed"
 
# =============================================================================
# Step 2: Initiate failover
# =============================================================================
log "INFO" "Step 2: Initiating replica promotion"
 
PROMOTION_START=$(date +%s)
 
aws rds promote-read-replica \
    --db-instance-identifier "${REPLICA_ID}" \
    --region "${REGION}"
 
log "INFO" "Promotion initiated. Waiting for completion..."
 
# =============================================================================
# Step 3: Wait for promotion to complete
# =============================================================================
log "INFO" "Step 3: Monitoring promotion progress"
 
while true; do
    CURRENT_STATUS=$(aws rds describe-db-instances \
        --db-instance-identifier "${REPLICA_ID}" \
        --region "${REGION}" \
        --query 'DBInstances[0].DBInstanceStatus' \
        --output text)
    
    ELAPSED=$(( $(date +%s) - ${PROMOTION_START} ))
    
    log "INFO" "Status: ${CURRENT_STATUS} (elapsed: ${ELAPSED}s)"
    
    if [[ "${CURRENT_STATUS}" == "available" ]]; then
        log "INFO" "Promotion completed successfully"
        break
    fi
    
    if [[ ${ELAPSED} -gt ${FAILOVER_TIMEOUT_SECONDS} ]]; then
        log "ERROR" "Promotion timed out after ${FAILOVER_TIMEOUT_SECONDS} seconds"
        exit 4
    fi
    
    sleep 15
done
 
# =============================================================================
# Step 4: Verify write capability
# =============================================================================
log "INFO" "Step 4: Verifying write capability"
 
# Get endpoint
NEW_ENDPOINT=$(aws rds describe-db-instances \
    --db-instance-identifier "${REPLICA_ID}" \
    --region "${REGION}" \
    --query 'DBInstances[0].Endpoint.Address' \
    --output text)
 
log "INFO" "New primary endpoint: ${NEW_ENDPOINT}"
 
# Test write capability (requires appropriate db credentials configured)
for i in $(seq 1 ${VERIFICATION_RETRIES}); do
    log "INFO" "Verification attempt ${i}/${VERIFICATION_RETRIES}"
    
    if psql -h "${NEW_ENDPOINT}" -U admin -d production -c \
        "INSERT INTO dr_test (test_id, created_at) VALUES ('failover-$(date +%s)', NOW());"; then
        log "INFO" "Write verification successful"
        break
    fi
    
    if [[ ${i} -eq ${VERIFICATION_RETRIES} ]]; then
        log "ERROR" "Write verification failed after ${VERIFICATION_RETRIES} attempts"
        exit 3
    fi
    
    log "WARN" "Write test failed, retrying in ${VERIFICATION_DELAY_SECONDS}s..."
    sleep ${VERIFICATION_DELAY_SECONDS}
done
 
# =============================================================================
# Step 5: Update application configuration
# =============================================================================
log "INFO" "Step 5: Updating application configuration"
 
aws ssm put-parameter \
    --name "/production/database/primary-endpoint" \
    --value "${NEW_ENDPOINT}" \
    --type "String" \
    --overwrite \
    --region "${REGION}"
 
log "INFO" "Parameter Store updated with new endpoint"
 
# =============================================================================
# Complete
# =============================================================================
TOTAL_TIME=$(( $(date +%s) - ${PROMOTION_START} ))
 
log "INFO" "========================================="
log "INFO" "Database failover completed successfully"
log "INFO" "Total time: ${TOTAL_TIME} seconds"
log "INFO" "New primary: ${NEW_ENDPOINT}"
log "INFO" "========================================="
 
exit 0

Workflow Orchestration: Connecting the Pieces

Individual scripts are building blocks. Workflow orchestration connects them into complete recovery procedures, managing dependencies, parallelism, error handling, and human approval gates.

Orchestration Capabilities:

Sequential Steps: Execute steps in order, passing outputs to inputs
Parallel Execution: Run independent steps simultaneously to reduce total time
Conditional Logic: Branch based on step results or environment conditions
Approval Gates: Pause for human approval before proceeding to high-risk steps
Retry Logic: Automatically retry failed steps with backoff
Timeout Management: Fail safely if steps don't complete in expected time
Rollback Orchestration: Coordinate rollback across multiple components

dr-orchestration-workflow.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
# AWS Step Functions State Machine Definition
# Full DR Failover Workflow
 
name: FullDRFailoverWorkflow
type: StateMachine
definition:
  Comment: "Orchestrates complete DR failover from primary to DR region"
  StartAt: ValidatePreConditions
  
  States:
    # ==========================================================================
    # Phase 1: Validation
    # ==========================================================================
    ValidatePreConditions:
      Type: Parallel
      Next: HumanApprovalGate
      Branches:
        - StartAt: CheckDRSiteHealth
          States:
            CheckDRSiteHealth:
              Type: Task
              Resource: "arn:aws:lambda:us-west-2:123456789:function:check-dr-site-health"
              ResultPath: "$.drSiteHealth"
              End: true
              
        - StartAt: CheckReplicationStatus
          States:
            CheckReplicationStatus:
              Type: Task
              Resource: "arn:aws:lambda:us-west-2:123456789:function:check-replication-status"
              ResultPath: "$.replicationStatus"
              End: true
              
        - StartAt: ValidateCredentials
          States:
            ValidateCredentials:
              Type: Task
              Resource: "arn:aws:lambda:us-west-2:123456789:function:validate-dr-credentials"
              ResultPath: "$.credentialStatus"
              End: true
              
      Catch:
        - ErrorEquals: ["States.ALL"]
          Next: FailValidation
          
    HumanApprovalGate:
      Type: Task
      Resource: "arn:aws:states:::sqs:sendMessage.waitForTaskToken"
      Parameters:
        QueueUrl: "https://sqs.us-west-2.amazonaws.com/123456789/dr-approval-queue"
        MessageBody:
          TaskToken.$: "$$.Task.Token"
          Message: "DR Failover requires approval"
          PreConditionResults.$: "$"
          RequestedBy.$: "$.requestedBy"
      TimeoutSeconds: 3600  # 1 hour to approve
      Next: DatabaseFailover
      Catch:
        - ErrorEquals: ["States.Timeout"]
          Next: ApprovalTimeout
          
    # ==========================================================================
    # Phase 2: Database Failover
    # ==========================================================================
    DatabaseFailover:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:execute-database-failover"
      Parameters:
        ReplicaId.$: "$.config.drReplicaId"
        Region: "us-west-2"
      ResultPath: "$.databaseResult"
      TimeoutSeconds: 900  # 15 minutes
      Retry:
        - ErrorEquals: ["RetryableError"]
          IntervalSeconds: 30
          MaxAttempts: 3
          BackoffRate: 2
      Next: VerifyDatabaseConnectivity
      Catch:
        - ErrorEquals: ["States.ALL"]
          Next: DatabaseFailoverFailed
          
    VerifyDatabaseConnectivity:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:verify-database-connectivity"
      Parameters:
        Endpoint.$: "$.databaseResult.newEndpoint"
      ResultPath: "$.databaseVerification"
      Next: ParallelApplicationRecovery
      Catch:
        - ErrorEquals: ["States.ALL"]
          Next: DatabaseVerificationFailed
          
    # ==========================================================================
    # Phase 3: Application Recovery (Parallel)
    # ==========================================================================
    ParallelApplicationRecovery:
      Type: Parallel
      Next: DNSCutover
      Branches:
        - StartAt: ScaleApplicationTier
          States:
            ScaleApplicationTier:
              Type: Task
              Resource: "arn:aws:lambda:us-west-2:123456789:function:scale-application-tier"
              Parameters:
                DesiredCount: 10
                Region: "us-west-2"
              End: true
              
        - StartAt: WarmCaches
          States:
            WarmCaches:
              Type: Task
              Resource: "arn:aws:lambda:us-west-2:123456789:function:warm-application-caches"
              End: true
              
        - StartAt: UpdateConfiguration
          States:
            UpdateConfiguration:
              Type: Task
              Resource: "arn:aws:lambda:us-west-2:123456789:function:sync-application-config"
              End: true
              
      Catch:
        - ErrorEquals: ["States.ALL"]
          Next: ApplicationRecoveryFailed
          
    # ==========================================================================
    # Phase 4: Traffic Cutover
    # ==========================================================================
    DNSCutover:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:execute-dns-cutover"
      Parameters:
        TargetRegion: "us-west-2"
        TrafficPercentage: 5
      ResultPath: "$.dnsResult"
      Next: ValidateLimitedTraffic
      
    ValidateLimitedTraffic:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:validate-error-rates"
      Parameters:
        Threshold: 0.01  # 1% error rate
        DurationSeconds: 180  # 3 minutes observation
      ResultPath: "$.limitedTrafficValidation"
      Next: CheckLimitedTrafficResult
      
    CheckLimitedTrafficResult:
      Type: Choice
      Choices:
        - Variable: "$.limitedTrafficValidation.passed"
          BooleanEquals: true
          Next: FullTrafficCutover
        - Variable: "$.limitedTrafficValidation.passed"
          BooleanEquals: false
          Next: RollbackDNS
      Default: FullTrafficCutover
      
    FullTrafficCutover:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:execute-dns-cutover"
      Parameters:
        TargetRegion: "us-west-2"
        TrafficPercentage: 100
      Next: FinalValidation
      
    # ==========================================================================
    # Phase 5: Final Validation
    # ==========================================================================
    FinalValidation:
      Type: Parallel
      Next: Success
      Branches:
        - StartAt: RunSmokeTests
          States:
            RunSmokeTests:
              Type: Task
              Resource: "arn:aws:lambda:us-west-2:123456789:function:run-smoke-tests"
              End: true
              
        - StartAt: VerifyIntegrations
          States:
            VerifyIntegrations:
              Type: Task
              Resource: "arn:aws:lambda:us-west-2:123456789:function:verify-integrations"
              End: true
              
        - StartAt: ConfirmMonitoring
          States:
            ConfirmMonitoring:
              Type: Task
              Resource: "arn:aws:lambda:us-west-2:123456789:function:confirm-dr-monitoring"
              End: true
      Catch:
        - ErrorEquals: ["States.ALL"]
          Next: ValidationWarning
          
    # ==========================================================================
    # Success and Failure States
    # ==========================================================================
    Success:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:notify-dr-complete"
      Parameters:
        Status: "SUCCESS"
        Summary.$: "$"
      End: true
      
    ValidationWarning:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:notify-dr-complete"
      Parameters:
        Status: "SUCCESS_WITH_WARNINGS"
        Summary.$: "$"
      End: true
      
    # Failure states
    FailValidation:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:notify-dr-failure"
      Parameters:
        Stage: "VALIDATION"
        Error.$: "$"
      End: true
      
    ApprovalTimeout:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:notify-dr-failure"
      Parameters:
        Stage: "APPROVAL"
        Error: "Approval timeout - no response within 1 hour"
      End: true
      
    DatabaseFailoverFailed:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:notify-dr-failure"
      Parameters:
        Stage: "DATABASE_FAILOVER"
        Error.$: "$"
      End: true
      
    DatabaseVerificationFailed:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:initiate-dr-rollback"
      Parameters:
        Stage: "DATABASE_VERIFICATION"
        RollbackSteps: ["database"]
      End: true
      
    ApplicationRecoveryFailed:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:initiate-dr-rollback"
      Parameters:
        Stage: "APPLICATION_RECOVERY"
        RollbackSteps: ["application", "database"]
      End: true
      
    RollbackDNS:
      Type: Task
      Resource: "arn:aws:lambda:us-west-2:123456789:function:initiate-dr-rollback"
      Parameters:
        Stage: "TRAFFIC_CUTOVER"
        RollbackSteps: ["dns", "application", "database"]
      End: true

Tools for DR Orchestration

Automated Detection and Triggering

The most advanced level of DR automation connects monitoring to recovery—when a failure is detected, automated response is initiated without waiting for human intervention.

Detection Requirements:

Automated triggering depends on reliable detection. False positives trigger unnecessary (and potentially harmful) failovers. False negatives miss real failures. Detection systems must:

Have High Precision: When disaster is declared, it really is a disaster
Have High Recall: Real disasters are detected promptly
Be External to Failure Domain: Detection from inside the failure zone is useless
Have Redundancy: Multiple detection paths prevent monitoring failure from blocking recovery

dr-trigger-system.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
// Automated DR Trigger System
// Monitors health signals and initiates recovery when thresholds are breached
 
interface HealthSignal {
  source: string;
  status: 'healthy' | 'degraded' | 'unhealthy' | 'unknown';
  timestamp: Date;
  confidence: number;  // 0-1
  details: Record<string, any>;
}
 
interface TriggerRule {
  name: string;
  description: string;
  conditions: TriggerCondition[];
  conditionLogic: 'AND' | 'OR' | 'MAJORITY';
  cooldownMinutes: number;
  requiresApproval: boolean;
  automationLevel: 'notify' | 'prepare' | 'execute-with-approval' | 'auto-execute';
}
 
interface TriggerCondition {
  signalSource: string;
  operator: 'equals' | 'not-equals' | 'duration-exceeds' | 'count-exceeds';
  value: any;
  sustainedSeconds: number;  // Must be true for this duration
}
 
class DRTriggerSystem {
  private rules: TriggerRule[] = [];
  private lastTriggered: Map<string, Date> = new Map();
  private signalHistory: Map<string, HealthSignal[]> = new Map();
  
  constructor() {
    this.initializeRules();
  }
  
  private initializeRules(): void {
    // Rule 1: Complete region failure
    this.rules.push({
      name: 'RegionFailure',
      description: 'Primary region completely unavailable',
      conditions: [
        {
          signalSource: 'synthetic-transactions-primary',
          operator: 'equals',
          value: 'unhealthy',
          sustainedSeconds: 180  // 3 minutes sustained failure
        },
        {
          signalSource: 'external-uptime-monitor',
          operator: 'equals',
          value: 'unhealthy',
          sustainedSeconds: 180
        },
        {
          signalSource: 'cloudwatch-alarms-primary',
          operator: 'count-exceeds',
          value: 5,  // More than 5 critical alarms
          sustainedSeconds: 120
        }
      ],
      conditionLogic: 'MAJORITY',  // At least 2 of 3 must trigger
      cooldownMinutes: 60,
      requiresApproval: false,  // Auto-execute for clear regional failure
      automationLevel: 'auto-execute'
    });
    
    // Rule 2: Database primary failure
    this.rules.push({
      name: 'DatabaseFailure',
      description: 'Primary database unresponsive',
      conditions: [
        {
          signalSource: 'database-health-check',
          operator: 'equals',
          value: 'unhealthy',
          sustainedSeconds: 120
        },
        {
          signalSource: 'application-db-errors',
          operator: 'count-exceeds',
          value: 100,
          sustainedSeconds: 60
        }
      ],
      conditionLogic: 'AND',
      cooldownMinutes: 30,
      requiresApproval: true,  // Require approval for database failover
      automationLevel: 'execute-with-approval'
    });
    
    // Rule 3: Network connectivity issues
    this.rules.push({
      name: 'NetworkFailure',
      description: 'Primary region network unreachable',
      conditions: [
        {
          signalSource: 'network-health-primary',
          operator: 'equals',
          value: 'unhealthy',
          sustainedSeconds: 300  // 5 minutes - network issues may be transient
        },
        {
          signalSource: 'vpn-tunnel-primary',
          operator: 'equals',
          value: 'unhealthy',
          sustainedSeconds: 300
        }
      ],
      conditionLogic: 'AND',
      cooldownMinutes: 60,
      requiresApproval: true,
      automationLevel: 'execute-with-approval'
    });
  }
  
  processSignal(signal: HealthSignal): void {
    // Store in history
    const history = this.signalHistory.get(signal.source) || [];
    history.push(signal);
    // Keep last 10 minutes of history
    const cutoff = new Date(Date.now() - 600000);
    this.signalHistory.set(
      signal.source, 
      history.filter(s => s.timestamp > cutoff)
    );
    
    // Evaluate all rules
    for (const rule of this.rules) {
      this.evaluateRule(rule);
    }
  }
  
  private evaluateRule(rule: TriggerRule): void {
    // Check cooldown
    const lastTrigger = this.lastTriggered.get(rule.name);
    if (lastTrigger && 
        (Date.now() - lastTrigger.getTime()) < rule.cooldownMinutes * 60 * 1000) {
      return;  // In cooldown, don't evaluate
    }
    
    // Evaluate conditions
    const conditionResults = rule.conditions.map(c => this.evaluateCondition(c));
    
    let shouldTrigger = false;
    switch (rule.conditionLogic) {
      case 'AND':
        shouldTrigger = conditionResults.every(r => r);
        break;
      case 'OR':
        shouldTrigger = conditionResults.some(r => r);
        break;
      case 'MAJORITY':
        shouldTrigger = conditionResults.filter(r => r).length > conditionResults.length / 2;
        break;
    }
    
    if (shouldTrigger) {
      this.triggerDR(rule);
    }
  }
  
  private evaluateCondition(condition: TriggerCondition): boolean {
    const history = this.signalHistory.get(condition.signalSource) || [];
    const cutoff = new Date(Date.now() - condition.sustainedSeconds * 1000);
    const relevantSignals = history.filter(s => s.timestamp > cutoff);
    
    if (relevantSignals.length === 0) return false;
    
    switch (condition.operator) {
      case 'equals':
        return relevantSignals.every(s => s.status === condition.value);
      case 'not-equals':
        return relevantSignals.every(s => s.status !== condition.value);
      case 'count-exceeds':
        // This requires a different signal format (counting events)
        return relevantSignals.length > condition.value;
      default:
        return false;
    }
  }
  
  private async triggerDR(rule: TriggerRule): Promise<void> {
    this.lastTriggered.set(rule.name, new Date());
    
    console.log(`[DR TRIGGER] Rule '${rule.name}' activated`);
    console.log(`[DR TRIGGER] Description: ${rule.description}`);
    console.log(`[DR TRIGGER] Automation level: ${rule.automationLevel}`);
    
    switch (rule.automationLevel) {
      case 'notify':
        await this.sendNotification(rule);
        break;
        
      case 'prepare':
        await this.sendNotification(rule);
        await this.prepareDREnvironment();
        break;
        
      case 'execute-with-approval':
        await this.sendNotification(rule);
        await this.prepareDREnvironment();
        await this.requestApproval(rule);
        break;
        
      case 'auto-execute':
        await this.sendNotification(rule);
        await this.executeDRFailover(rule);
        break;
    }
  }
  
  private async sendNotification(rule: TriggerRule): Promise<void> {
    // Send alerts via PagerDuty, Slack, etc.
    console.log(`[NOTIFY] DR trigger notification for: ${rule.name}`);
  }
  
  private async prepareDREnvironment(): Promise<void> {
    // Pre-warm DR environment, verify readiness
    console.log(`[PREPARE] Warming DR environment...`);
  }
  
  private async requestApproval(rule: TriggerRule): Promise<void> {
    // Queue approval request, wait for human decision
    console.log(`[APPROVAL] Requesting human approval for: ${rule.name}`);
  }
  
  private async executeDRFailover(rule: TriggerRule): Promise<void> {
    // Invoke DR orchestration workflow
    console.log(`[EXECUTE] Initiating automated DR failover for: ${rule.name}`);
  }
}

The Danger of Auto-Triggering

Safeguards and Guardrails

Automation executes at machine speed—including mistakes. Guardrails prevent automated systems from causing harm that humans would have avoided:

Essential Safeguards:

DR Automation Guardrails

•Confirmation Signals: Require multiple independent signals before triggering. A single failed health check shouldn't initiate failover.
•Cooldown Periods: After any trigger (successful or not), enforce a minimum interval before the same trigger can fire again.
•Rate Limits: Limit how many automated actions can execute in a given time window. Prevent runaway automation.
•Kill Switch: A simple, accessible mechanism to halt all automation. When things go wrong, you need to stop quickly.
•Scope Limits: Automation should only affect designated systems. Prevent scope creep where automation accidentally touches unintended resources.
•Rollback Automation: If automated recovery fails, automated rollback should also exist. Don't leave systems in partial recovery state.
•Audit Logging: Every automated action must be logged with full context. Enable post-incident analysis.
•Dry-Run Mode: Ability to run automation in simulation mode that logs actions without executing them.

dr-guardrails.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
// DR Automation Guardrails Implementation
 
class DRGuardrails {
  private killSwitchEnabled: boolean = false;
  private actionCounts: Map<string, number> = new Map();
  private lastActions: Map<string, Date> = new Map();
  
  private readonly RATE_LIMIT_WINDOW_MINUTES = 60;
  private readonly MAX_ACTIONS_PER_WINDOW = 5;
  private readonly COOLDOWN_MINUTES = 30;
  
  // Emergency kill switch - stops all automation
  enableKillSwitch(reason: string): void {
    this.killSwitchEnabled = true;
    console.log(`[KILL SWITCH] Enabled by operator. Reason: ${reason}`);
    this.notifyAllChannels(`DR Automation halted: ${reason}`);
  }
  
  disableKillSwitch(approverName: string): void {
    console.log(`[KILL SWITCH] Disabled by ${approverName}`);
    this.killSwitchEnabled = false;
  }
  
  // Check all guardrails before allowing action
  canExecute(actionType: string, scope: string[]): GuardrailResult {
    // Check 1: Kill switch
    if (this.killSwitchEnabled) {
      return {
        allowed: false,
        reason: 'Kill switch is enabled',
        code: 'KILL_SWITCH'
      };
    }
    
    // Check 2: Cooldown
    const lastAction = this.lastActions.get(actionType);
    if (lastAction) {
      const minutesSinceLastAction = (Date.now() - lastAction.getTime()) / 60000;
      if (minutesSinceLastAction < this.COOLDOWN_MINUTES) {
        return {
          allowed: false,
          reason: `Cooldown active. ${this.COOLDOWN_MINUTES - minutesSinceLastAction} minutes remaining.`,
          code: 'COOLDOWN'
        };
      }
    }
    
    // Check 3: Rate limit
    const currentWindow = this.getCurrentWindowStart();
    const actionsInWindow = this.getActionsInWindow(actionType, currentWindow);
    if (actionsInWindow >= this.MAX_ACTIONS_PER_WINDOW) {
      return {
        allowed: false,
        reason: `Rate limit exceeded. Max ${this.MAX_ACTIONS_PER_WINDOW} actions per ${this.RATE_LIMIT_WINDOW_MINUTES} minutes.`,
        code: 'RATE_LIMIT'
      };
    }
    
    // Check 4: Scope validation
    const scopeCheck = this.validateScope(actionType, scope);
    if (!scopeCheck.valid) {
      return {
        allowed: false,
        reason: `Scope validation failed: ${scopeCheck.reason}`,
        code: 'SCOPE_VIOLATION'
      };
    }
    
    // All checks passed
    return {
      allowed: true,
      reason: 'All guardrails passed',
      code: 'OK'
    };
  }
  
  // Record that an action was taken (for rate limiting)
  recordAction(actionType: string): void {
    this.lastActions.set(actionType, new Date());
    const key = `${actionType}:${this.getCurrentWindowStart().toISOString()}`;
    this.actionCounts.set(key, (this.actionCounts.get(key) || 0) + 1);
  }
  
  // Validate that the action scope is within allowed boundaries
  private validateScope(actionType: string, scope: string[]): { valid: boolean; reason?: string } {
    const allowedScopes: Record<string, string[]> = {
      'database-failover': ['prod-db-primary', 'prod-db-replica-dr'],
      'dns-cutover': ['api.example.com', 'www.example.com'],
      'application-scale': ['us-west-2', 'eu-west-1']
    };
    
    const allowed = allowedScopes[actionType] || [];
    const unauthorized = scope.filter(s => !allowed.includes(s));
    
    if (unauthorized.length > 0) {
      return {
        valid: false,
        reason: `Unauthorized scope: ${unauthorized.join(', ')}`
      };
    }
    
    return { valid: true };
  }
  
  private getCurrentWindowStart(): Date {
    const now = new Date();
    const windowMs = this.RATE_LIMIT_WINDOW_MINUTES * 60 * 1000;
    return new Date(Math.floor(now.getTime() / windowMs) * windowMs);
  }
  
  private getActionsInWindow(actionType: string, windowStart: Date): number {
    const key = `${actionType}:${windowStart.toISOString()}`;
    return this.actionCounts.get(key) || 0;
  }
  
  private notifyAllChannels(message: string): void {
    // Send to PagerDuty, Slack, email, etc.
    console.log(`[BROADCAST] ${message}`);
  }
}
 
interface GuardrailResult {
  allowed: boolean;
  reason: string;
  code: string;
}

Testing DR Automation

Testing Levels:

Unit Testing: Each script and function is tested in isolation with mocked dependencies. Verify logic, error handling, and edge cases.

Integration Testing: Workflows are tested end-to-end against test infrastructure. Verify that steps connect correctly, outputs flow to inputs, and orchestration logic is correct.

Failure Injection Testing: Deliberately introduce failures at various points to verify that error handling, retries, and rollbacks work correctly.

Production Simulation: Run automation against production-like environment with production-scale data. Verify timing, resource consumption, and behavior under load.

DR Automation Test Matrix
Test Type	Scope	Environment	Frequency	Focus
Unit Tests	Individual scripts	CI/CD pipeline	Every commit	Logic, edge cases, error handling
Integration Tests	Workflow sequences	Staging	Daily	Step connections, data flow, orchestration
Failure Injection	Complete workflows	Staging	Weekly	Error paths, rollback, guardrails
Staging DR Test	Full automation	Staging (prod-like)	Monthly	Timing, integration, scale
Production DR Test	Full automation	Production	Semi-annually	Real-world validation, true RTO

Chaos Engineering for DR Automation:

Beyond testing happy paths, deliberately break things to validate resilience:

Partial Failures: What if one step succeeds but a later step fails? Does rollback work?
Timeout Scenarios: What if a step takes longer than expected? Does timeout handling work?
Concurrent Execution: What if automation triggers twice? Does idempotency hold?
Resource Exhaustion: What if the DR region is capacity-constrained? Does scaling fail gracefully?
Network Partition: What if automation can't communicate with some components? Does it handle partial connectivity?

Test the Guardrails Too

Summary: DR Automation Excellence

Key Takeaways

•Automation enables machine-speed recovery — What takes humans hours can be automated to minutes or seconds. This directly reduces RTO.
•Progress through levels deliberately — Start with scripting, advance to orchestration, then triggers. Each level validates the previous before reducing human oversight.
•Automate high-value targets first — Repetitive, time-critical, frequently-used, error-prone steps provide the best automation ROI.
•Design for idempotency and validation — Scripts must be safe to re-run and must verify their own success. Assume things will go wrong.
•Orchestration connects the pieces — Workflows manage dependencies, parallelism, error handling, and approval gates across multiple automated steps.
•Guardrails prevent automated harm — Kill switches, cooldowns, rate limits, and scope restrictions prevent automation from causing more damage than it's solving.
•Test automation as rigorously as code — Unit tests, integration tests, failure injection, and production validation ensure automation works when needed.

Module Complete: Disaster Recovery Mastery

You have now completed the comprehensive disaster recovery module. You understand:

DR Planning: Strategic foundations, risk assessment, and architecture patterns
RPO/RTO Targets: Setting, measuring, and managing recovery objectives
DR Testing: Validating capabilities from tabletops through production failover
Runbook Development: Creating executable procedures that work under pressure
DR Automation: Accelerating recovery through scripting, orchestration, and autonomous action

These capabilities together form a mature disaster recovery program that enables your organization to survive and rapidly recover from catastrophic failures.

Module Complete

5 / 5