Loading learning content...
On November 25, 2020, Amazon Web Services experienced an outage in US-East-1 that lasted nearly five hours. For many customers, recovery could have been faster—in theory. They had runbooks. They had DR sites ready. But when thousands of companies simultaneously tried to execute recovery procedures, each manual step became a bottleneck. Engineers were paged but took time to wake up and gain context. Documentation was consulted. Commands were typed, retyped after errors, verified, and re-verified. Human decision-making, normally an asset, became the limiting factor.
Meanwhile, companies with automated DR recovered faster. Their systems detected the failure, initiated failover sequences, validated recovery, and rerouted traffic—all while human operators were still assessing the situation.
Automation transforms DR from a human-speed operation to a machine-speed operation. It removes the cognitive load of executing procedures under stress. It eliminates the variance between a well-rested senior engineer and a junior teammate woken at 3 AM. It enables sub-minute recovery where manual procedures take an hour.
But automation also introduces new risks: automated systems can fail in ways that humans wouldn't, can execute incorrect actions at machine speed, and can mask underlying problems that humans would notice. This page teaches you to automate wisely—capturing the benefits of speed and consistency while managing the risks of autonomous decision-making.
By the end of this page, you will understand the spectrum of DR automation from simple scripts to full orchestration. You'll learn what to automate first, how to design safe automation with appropriate guardrails, and how to balance automation benefits against the risks of autonomous action during critical situations.
DR automation exists on a spectrum from fully manual to fully autonomous. Understanding where you are and where you want to be is the foundation for an automation roadmap:
Level 0: Fully Manual Humans execute every step by reading documentation and running commands. Slowest and most error-prone, but provides maximum human oversight.
Level 1: Documented Scripts Individual steps are scripted, but a human initiates each script and decides when to proceed. Reduces typing errors and speeds execution, but humans remain in control.
Level 2: Orchestrated Workflows Multiple steps are combined into automated workflows. A human initiates the workflow, but it executes multiple steps in sequence with automated validation between stages.
Level 3: Triggered Automation Monitoring systems can trigger automated workflows based on defined conditions. Humans may approve the trigger (semi-automated) or automation runs immediately (fully automated).
Level 4: Autonomous Recovery The system detects failures, initiates recovery, validates success, and restores service without human involvement. Humans are notified but don't need to act.
Level 5: Self-Healing Architecture The system is designed such that component failures don't require recovery procedures at all. Redundancy, auto-scaling, and self-repair are built into the architecture.
| Level | Human Involvement | Typical RTO | Error Risk | Oversight | Cost to Implement |
|---|---|---|---|---|---|
| 0: Manual | 100% | Hours | High (human error) | Maximum | Lowest |
| 1: Scripts | 80% | 30-60 min | Medium | High | Low |
| 2: Orchestrated | 40% | 15-30 min | Low | Medium-High | Medium |
| 3: Triggered | 10-20% | 5-15 min | Low | Medium | High |
| 4: Autonomous | <5% | 1-5 min | Depends on quality | Low | Very High |
| 5: Self-Healing | ~0% | Seconds | Very Low | Minimal | Highest |
Most organizations should not jump to Level 4 or 5 automation immediately. Start by scripting individual steps (Level 1), then combine into workflows (Level 2), then add triggers (Level 3). Each level validates that automation is correct before reducing human oversight. Automating incorrect procedures at machine speed is worse than executing them slowly.
Not all DR activities benefit equally from automation. Prioritize based on these criteria:
High Priority for Automation:
Lower Priority for Automation:
Automation ROI Calculation:
Estimate the value of automating a step:
Automation Value = (Manual Time × Frequency × Error Rate Reduction) - Automation Cost
Where:
- Manual Time = Time to execute step manually (including human delays)
- Frequency = Expected executions per year (including tests and incidents)
- Error Rate Reduction = Avoided errors × cost per error
- Automation Cost = Development + Testing + Maintenance
Focus first on steps with high Manual Time, high Frequency, or high Error Rate impact.
The first step in DR automation is creating reliable, well-tested scripts for individual recovery tasks. These scripts become building blocks for higher-level orchestration.
Script Design Principles:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211
#!/bin/bash# =============================================================================# DR Database Failover Script# # Purpose: Promotes read replica to primary and updates application config# # Prerequisites:# - AWS CLI configured with appropriate credentials# - jq installed# - Access to Parameter Store for config updates## Usage:# ./dr-database-failover.sh --replica-id <replica-id> --region <region>## Returns:# 0 = Success# 1 = Validation failure# 2 = Failover execution failure# 3 = Post-failover verification failure# 4 = Timeout# ============================================================================= set -euo pipefail # ConfigurationSCRIPT_NAME="dr-database-failover"LOG_FILE="/var/log/dr/${SCRIPT_NAME}-$(date +%Y%m%d-%H%M%S).log"FAILOVER_TIMEOUT_SECONDS=600VERIFICATION_RETRIES=10VERIFICATION_DELAY_SECONDS=30 # Logging functionlog() { local level="$1" local message="$2" local timestamp=$(date -u +"%Y-%m-%dT%H:%M:%SZ") echo "[${timestamp}] [${level}] ${message}" | tee -a "${LOG_FILE}"} # Parse argumentsREPLICA_ID=""REGION="" while [[ $# -gt 0 ]]; do case $1 in --replica-id) REPLICA_ID="$2" shift 2 ;; --region) REGION="$2" shift 2 ;; *) log "ERROR" "Unknown argument: $1" exit 1 ;; esacdone # Validate argumentsif [[ -z "${REPLICA_ID}" || -z "${REGION}" ]]; then log "ERROR" "Required arguments: --replica-id and --region" exit 1fi log "INFO" "========================================="log "INFO" "Starting DR Database Failover"log "INFO" "Replica: ${REPLICA_ID}"log "INFO" "Region: ${REGION}"log "INFO" "=========================================" # =============================================================================# Step 1: Pre-flight validation# =============================================================================log "INFO" "Step 1: Pre-flight validation" # Check replica exists and is in available stateREPLICA_STATUS=$(aws rds describe-db-instances \ --db-instance-identifier "${REPLICA_ID}" \ --region "${REGION}" \ --query 'DBInstances[0].DBInstanceStatus' \ --output text 2>/dev/null || echo "NOT_FOUND") if [[ "${REPLICA_STATUS}" != "available" ]]; then log "ERROR" "Replica ${REPLICA_ID} is not available. Current status: ${REPLICA_STATUS}" exit 1fi # Check replication statusREPLICA_LAG=$(aws cloudwatch get-metric-statistics \ --namespace AWS/RDS \ --metric-name ReplicaLag \ --dimensions Name=DBInstanceIdentifier,Value="${REPLICA_ID}" \ --start-time $(date -u -d '5 minutes ago' +%Y-%m-%dT%H:%M:%SZ) \ --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \ --period 60 \ --statistics Average \ --region "${REGION}" \ --query 'Datapoints[0].Average' \ --output text 2>/dev/null || echo "UNKNOWN") log "INFO" "Current replica lag: ${REPLICA_LAG} seconds" if [[ "${REPLICA_LAG}" != "UNKNOWN" ]] && (( $(echo "${REPLICA_LAG} > 300" | bc -l) )); then log "WARN" "Replica lag is high (${REPLICA_LAG}s). Proceeding may result in data loss."fi log "INFO" "Pre-flight validation passed" # =============================================================================# Step 2: Initiate failover# =============================================================================log "INFO" "Step 2: Initiating replica promotion" PROMOTION_START=$(date +%s) aws rds promote-read-replica \ --db-instance-identifier "${REPLICA_ID}" \ --region "${REGION}" log "INFO" "Promotion initiated. Waiting for completion..." # =============================================================================# Step 3: Wait for promotion to complete# =============================================================================log "INFO" "Step 3: Monitoring promotion progress" while true; do CURRENT_STATUS=$(aws rds describe-db-instances \ --db-instance-identifier "${REPLICA_ID}" \ --region "${REGION}" \ --query 'DBInstances[0].DBInstanceStatus' \ --output text) ELAPSED=$(( $(date +%s) - ${PROMOTION_START} )) log "INFO" "Status: ${CURRENT_STATUS} (elapsed: ${ELAPSED}s)" if [[ "${CURRENT_STATUS}" == "available" ]]; then log "INFO" "Promotion completed successfully" break fi if [[ ${ELAPSED} -gt ${FAILOVER_TIMEOUT_SECONDS} ]]; then log "ERROR" "Promotion timed out after ${FAILOVER_TIMEOUT_SECONDS} seconds" exit 4 fi sleep 15done # =============================================================================# Step 4: Verify write capability# =============================================================================log "INFO" "Step 4: Verifying write capability" # Get endpointNEW_ENDPOINT=$(aws rds describe-db-instances \ --db-instance-identifier "${REPLICA_ID}" \ --region "${REGION}" \ --query 'DBInstances[0].Endpoint.Address' \ --output text) log "INFO" "New primary endpoint: ${NEW_ENDPOINT}" # Test write capability (requires appropriate db credentials configured)for i in $(seq 1 ${VERIFICATION_RETRIES}); do log "INFO" "Verification attempt ${i}/${VERIFICATION_RETRIES}" if psql -h "${NEW_ENDPOINT}" -U admin -d production -c \ "INSERT INTO dr_test (test_id, created_at) VALUES ('failover-$(date +%s)', NOW());"; then log "INFO" "Write verification successful" break fi if [[ ${i} -eq ${VERIFICATION_RETRIES} ]]; then log "ERROR" "Write verification failed after ${VERIFICATION_RETRIES} attempts" exit 3 fi log "WARN" "Write test failed, retrying in ${VERIFICATION_DELAY_SECONDS}s..." sleep ${VERIFICATION_DELAY_SECONDS}done # =============================================================================# Step 5: Update application configuration# =============================================================================log "INFO" "Step 5: Updating application configuration" aws ssm put-parameter \ --name "/production/database/primary-endpoint" \ --value "${NEW_ENDPOINT}" \ --type "String" \ --overwrite \ --region "${REGION}" log "INFO" "Parameter Store updated with new endpoint" # =============================================================================# Complete# =============================================================================TOTAL_TIME=$(( $(date +%s) - ${PROMOTION_START} )) log "INFO" "========================================="log "INFO" "Database failover completed successfully"log "INFO" "Total time: ${TOTAL_TIME} seconds"log "INFO" "New primary: ${NEW_ENDPOINT}"log "INFO" "=========================================" exit 0Individual scripts are building blocks. Workflow orchestration connects them into complete recovery procedures, managing dependencies, parallelism, error handling, and human approval gates.
Orchestration Capabilities:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266
# AWS Step Functions State Machine Definition# Full DR Failover Workflow name: FullDRFailoverWorkflowtype: StateMachinedefinition: Comment: "Orchestrates complete DR failover from primary to DR region" StartAt: ValidatePreConditions States: # ========================================================================== # Phase 1: Validation # ========================================================================== ValidatePreConditions: Type: Parallel Next: HumanApprovalGate Branches: - StartAt: CheckDRSiteHealth States: CheckDRSiteHealth: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:check-dr-site-health" ResultPath: "$.drSiteHealth" End: true - StartAt: CheckReplicationStatus States: CheckReplicationStatus: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:check-replication-status" ResultPath: "$.replicationStatus" End: true - StartAt: ValidateCredentials States: ValidateCredentials: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:validate-dr-credentials" ResultPath: "$.credentialStatus" End: true Catch: - ErrorEquals: ["States.ALL"] Next: FailValidation HumanApprovalGate: Type: Task Resource: "arn:aws:states:::sqs:sendMessage.waitForTaskToken" Parameters: QueueUrl: "https://sqs.us-west-2.amazonaws.com/123456789/dr-approval-queue" MessageBody: TaskToken.$: "$$.Task.Token" Message: "DR Failover requires approval" PreConditionResults.$: "$" RequestedBy.$: "$.requestedBy" TimeoutSeconds: 3600 # 1 hour to approve Next: DatabaseFailover Catch: - ErrorEquals: ["States.Timeout"] Next: ApprovalTimeout # ========================================================================== # Phase 2: Database Failover # ========================================================================== DatabaseFailover: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:execute-database-failover" Parameters: ReplicaId.$: "$.config.drReplicaId" Region: "us-west-2" ResultPath: "$.databaseResult" TimeoutSeconds: 900 # 15 minutes Retry: - ErrorEquals: ["RetryableError"] IntervalSeconds: 30 MaxAttempts: 3 BackoffRate: 2 Next: VerifyDatabaseConnectivity Catch: - ErrorEquals: ["States.ALL"] Next: DatabaseFailoverFailed VerifyDatabaseConnectivity: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:verify-database-connectivity" Parameters: Endpoint.$: "$.databaseResult.newEndpoint" ResultPath: "$.databaseVerification" Next: ParallelApplicationRecovery Catch: - ErrorEquals: ["States.ALL"] Next: DatabaseVerificationFailed # ========================================================================== # Phase 3: Application Recovery (Parallel) # ========================================================================== ParallelApplicationRecovery: Type: Parallel Next: DNSCutover Branches: - StartAt: ScaleApplicationTier States: ScaleApplicationTier: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:scale-application-tier" Parameters: DesiredCount: 10 Region: "us-west-2" End: true - StartAt: WarmCaches States: WarmCaches: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:warm-application-caches" End: true - StartAt: UpdateConfiguration States: UpdateConfiguration: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:sync-application-config" End: true Catch: - ErrorEquals: ["States.ALL"] Next: ApplicationRecoveryFailed # ========================================================================== # Phase 4: Traffic Cutover # ========================================================================== DNSCutover: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:execute-dns-cutover" Parameters: TargetRegion: "us-west-2" TrafficPercentage: 5 ResultPath: "$.dnsResult" Next: ValidateLimitedTraffic ValidateLimitedTraffic: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:validate-error-rates" Parameters: Threshold: 0.01 # 1% error rate DurationSeconds: 180 # 3 minutes observation ResultPath: "$.limitedTrafficValidation" Next: CheckLimitedTrafficResult CheckLimitedTrafficResult: Type: Choice Choices: - Variable: "$.limitedTrafficValidation.passed" BooleanEquals: true Next: FullTrafficCutover - Variable: "$.limitedTrafficValidation.passed" BooleanEquals: false Next: RollbackDNS Default: FullTrafficCutover FullTrafficCutover: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:execute-dns-cutover" Parameters: TargetRegion: "us-west-2" TrafficPercentage: 100 Next: FinalValidation # ========================================================================== # Phase 5: Final Validation # ========================================================================== FinalValidation: Type: Parallel Next: Success Branches: - StartAt: RunSmokeTests States: RunSmokeTests: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:run-smoke-tests" End: true - StartAt: VerifyIntegrations States: VerifyIntegrations: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:verify-integrations" End: true - StartAt: ConfirmMonitoring States: ConfirmMonitoring: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:confirm-dr-monitoring" End: true Catch: - ErrorEquals: ["States.ALL"] Next: ValidationWarning # ========================================================================== # Success and Failure States # ========================================================================== Success: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:notify-dr-complete" Parameters: Status: "SUCCESS" Summary.$: "$" End: true ValidationWarning: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:notify-dr-complete" Parameters: Status: "SUCCESS_WITH_WARNINGS" Summary.$: "$" End: true # Failure states FailValidation: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:notify-dr-failure" Parameters: Stage: "VALIDATION" Error.$: "$" End: true ApprovalTimeout: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:notify-dr-failure" Parameters: Stage: "APPROVAL" Error: "Approval timeout - no response within 1 hour" End: true DatabaseFailoverFailed: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:notify-dr-failure" Parameters: Stage: "DATABASE_FAILOVER" Error.$: "$" End: true DatabaseVerificationFailed: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:initiate-dr-rollback" Parameters: Stage: "DATABASE_VERIFICATION" RollbackSteps: ["database"] End: true ApplicationRecoveryFailed: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:initiate-dr-rollback" Parameters: Stage: "APPLICATION_RECOVERY" RollbackSteps: ["application", "database"] End: true RollbackDNS: Type: Task Resource: "arn:aws:lambda:us-west-2:123456789:function:initiate-dr-rollback" Parameters: Stage: "TRAFFIC_CUTOVER" RollbackSteps: ["dns", "application", "database"] End: trueCommon orchestration platforms include AWS Step Functions, Azure Logic Apps, Google Cloud Workflows, Temporal, and Kubernetes Jobs/Operators. Choose based on your infrastructure and team expertise. The orchestration logic matters more than the platform—well-designed workflows work across tools.
The most advanced level of DR automation connects monitoring to recovery—when a failure is detected, automated response is initiated without waiting for human intervention.
Detection Requirements:
Automated triggering depends on reliable detection. False positives trigger unnecessary (and potentially harmful) failovers. False negatives miss real failures. Detection systems must:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233
// Automated DR Trigger System// Monitors health signals and initiates recovery when thresholds are breached interface HealthSignal { source: string; status: 'healthy' | 'degraded' | 'unhealthy' | 'unknown'; timestamp: Date; confidence: number; // 0-1 details: Record<string, any>;} interface TriggerRule { name: string; description: string; conditions: TriggerCondition[]; conditionLogic: 'AND' | 'OR' | 'MAJORITY'; cooldownMinutes: number; requiresApproval: boolean; automationLevel: 'notify' | 'prepare' | 'execute-with-approval' | 'auto-execute';} interface TriggerCondition { signalSource: string; operator: 'equals' | 'not-equals' | 'duration-exceeds' | 'count-exceeds'; value: any; sustainedSeconds: number; // Must be true for this duration} class DRTriggerSystem { private rules: TriggerRule[] = []; private lastTriggered: Map<string, Date> = new Map(); private signalHistory: Map<string, HealthSignal[]> = new Map(); constructor() { this.initializeRules(); } private initializeRules(): void { // Rule 1: Complete region failure this.rules.push({ name: 'RegionFailure', description: 'Primary region completely unavailable', conditions: [ { signalSource: 'synthetic-transactions-primary', operator: 'equals', value: 'unhealthy', sustainedSeconds: 180 // 3 minutes sustained failure }, { signalSource: 'external-uptime-monitor', operator: 'equals', value: 'unhealthy', sustainedSeconds: 180 }, { signalSource: 'cloudwatch-alarms-primary', operator: 'count-exceeds', value: 5, // More than 5 critical alarms sustainedSeconds: 120 } ], conditionLogic: 'MAJORITY', // At least 2 of 3 must trigger cooldownMinutes: 60, requiresApproval: false, // Auto-execute for clear regional failure automationLevel: 'auto-execute' }); // Rule 2: Database primary failure this.rules.push({ name: 'DatabaseFailure', description: 'Primary database unresponsive', conditions: [ { signalSource: 'database-health-check', operator: 'equals', value: 'unhealthy', sustainedSeconds: 120 }, { signalSource: 'application-db-errors', operator: 'count-exceeds', value: 100, sustainedSeconds: 60 } ], conditionLogic: 'AND', cooldownMinutes: 30, requiresApproval: true, // Require approval for database failover automationLevel: 'execute-with-approval' }); // Rule 3: Network connectivity issues this.rules.push({ name: 'NetworkFailure', description: 'Primary region network unreachable', conditions: [ { signalSource: 'network-health-primary', operator: 'equals', value: 'unhealthy', sustainedSeconds: 300 // 5 minutes - network issues may be transient }, { signalSource: 'vpn-tunnel-primary', operator: 'equals', value: 'unhealthy', sustainedSeconds: 300 } ], conditionLogic: 'AND', cooldownMinutes: 60, requiresApproval: true, automationLevel: 'execute-with-approval' }); } processSignal(signal: HealthSignal): void { // Store in history const history = this.signalHistory.get(signal.source) || []; history.push(signal); // Keep last 10 minutes of history const cutoff = new Date(Date.now() - 600000); this.signalHistory.set( signal.source, history.filter(s => s.timestamp > cutoff) ); // Evaluate all rules for (const rule of this.rules) { this.evaluateRule(rule); } } private evaluateRule(rule: TriggerRule): void { // Check cooldown const lastTrigger = this.lastTriggered.get(rule.name); if (lastTrigger && (Date.now() - lastTrigger.getTime()) < rule.cooldownMinutes * 60 * 1000) { return; // In cooldown, don't evaluate } // Evaluate conditions const conditionResults = rule.conditions.map(c => this.evaluateCondition(c)); let shouldTrigger = false; switch (rule.conditionLogic) { case 'AND': shouldTrigger = conditionResults.every(r => r); break; case 'OR': shouldTrigger = conditionResults.some(r => r); break; case 'MAJORITY': shouldTrigger = conditionResults.filter(r => r).length > conditionResults.length / 2; break; } if (shouldTrigger) { this.triggerDR(rule); } } private evaluateCondition(condition: TriggerCondition): boolean { const history = this.signalHistory.get(condition.signalSource) || []; const cutoff = new Date(Date.now() - condition.sustainedSeconds * 1000); const relevantSignals = history.filter(s => s.timestamp > cutoff); if (relevantSignals.length === 0) return false; switch (condition.operator) { case 'equals': return relevantSignals.every(s => s.status === condition.value); case 'not-equals': return relevantSignals.every(s => s.status !== condition.value); case 'count-exceeds': // This requires a different signal format (counting events) return relevantSignals.length > condition.value; default: return false; } } private async triggerDR(rule: TriggerRule): Promise<void> { this.lastTriggered.set(rule.name, new Date()); console.log(`[DR TRIGGER] Rule '${rule.name}' activated`); console.log(`[DR TRIGGER] Description: ${rule.description}`); console.log(`[DR TRIGGER] Automation level: ${rule.automationLevel}`); switch (rule.automationLevel) { case 'notify': await this.sendNotification(rule); break; case 'prepare': await this.sendNotification(rule); await this.prepareDREnvironment(); break; case 'execute-with-approval': await this.sendNotification(rule); await this.prepareDREnvironment(); await this.requestApproval(rule); break; case 'auto-execute': await this.sendNotification(rule); await this.executeDRFailover(rule); break; } } private async sendNotification(rule: TriggerRule): Promise<void> { // Send alerts via PagerDuty, Slack, etc. console.log(`[NOTIFY] DR trigger notification for: ${rule.name}`); } private async prepareDREnvironment(): Promise<void> { // Pre-warm DR environment, verify readiness console.log(`[PREPARE] Warming DR environment...`); } private async requestApproval(rule: TriggerRule): Promise<void> { // Queue approval request, wait for human decision console.log(`[APPROVAL] Requesting human approval for: ${rule.name}`); } private async executeDRFailover(rule: TriggerRule): Promise<void> { // Invoke DR orchestration workflow console.log(`[EXECUTE] Initiating automated DR failover for: ${rule.name}`); }}Fully automatic DR triggering is powerful but dangerous. A false positive can cause unnecessary failover, data inconsistency, or customer-visible disruption. Start with 'notify' automation, progress to 'prepare', then 'execute-with-approval', and only enable 'auto-execute' for clear, unambiguous failure scenarios after extensive testing.
Automation executes at machine speed—including mistakes. Guardrails prevent automated systems from causing harm that humans would have avoided:
Essential Safeguards:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126
// DR Automation Guardrails Implementation class DRGuardrails { private killSwitchEnabled: boolean = false; private actionCounts: Map<string, number> = new Map(); private lastActions: Map<string, Date> = new Map(); private readonly RATE_LIMIT_WINDOW_MINUTES = 60; private readonly MAX_ACTIONS_PER_WINDOW = 5; private readonly COOLDOWN_MINUTES = 30; // Emergency kill switch - stops all automation enableKillSwitch(reason: string): void { this.killSwitchEnabled = true; console.log(`[KILL SWITCH] Enabled by operator. Reason: ${reason}`); this.notifyAllChannels(`DR Automation halted: ${reason}`); } disableKillSwitch(approverName: string): void { console.log(`[KILL SWITCH] Disabled by ${approverName}`); this.killSwitchEnabled = false; } // Check all guardrails before allowing action canExecute(actionType: string, scope: string[]): GuardrailResult { // Check 1: Kill switch if (this.killSwitchEnabled) { return { allowed: false, reason: 'Kill switch is enabled', code: 'KILL_SWITCH' }; } // Check 2: Cooldown const lastAction = this.lastActions.get(actionType); if (lastAction) { const minutesSinceLastAction = (Date.now() - lastAction.getTime()) / 60000; if (minutesSinceLastAction < this.COOLDOWN_MINUTES) { return { allowed: false, reason: `Cooldown active. ${this.COOLDOWN_MINUTES - minutesSinceLastAction} minutes remaining.`, code: 'COOLDOWN' }; } } // Check 3: Rate limit const currentWindow = this.getCurrentWindowStart(); const actionsInWindow = this.getActionsInWindow(actionType, currentWindow); if (actionsInWindow >= this.MAX_ACTIONS_PER_WINDOW) { return { allowed: false, reason: `Rate limit exceeded. Max ${this.MAX_ACTIONS_PER_WINDOW} actions per ${this.RATE_LIMIT_WINDOW_MINUTES} minutes.`, code: 'RATE_LIMIT' }; } // Check 4: Scope validation const scopeCheck = this.validateScope(actionType, scope); if (!scopeCheck.valid) { return { allowed: false, reason: `Scope validation failed: ${scopeCheck.reason}`, code: 'SCOPE_VIOLATION' }; } // All checks passed return { allowed: true, reason: 'All guardrails passed', code: 'OK' }; } // Record that an action was taken (for rate limiting) recordAction(actionType: string): void { this.lastActions.set(actionType, new Date()); const key = `${actionType}:${this.getCurrentWindowStart().toISOString()}`; this.actionCounts.set(key, (this.actionCounts.get(key) || 0) + 1); } // Validate that the action scope is within allowed boundaries private validateScope(actionType: string, scope: string[]): { valid: boolean; reason?: string } { const allowedScopes: Record<string, string[]> = { 'database-failover': ['prod-db-primary', 'prod-db-replica-dr'], 'dns-cutover': ['api.example.com', 'www.example.com'], 'application-scale': ['us-west-2', 'eu-west-1'] }; const allowed = allowedScopes[actionType] || []; const unauthorized = scope.filter(s => !allowed.includes(s)); if (unauthorized.length > 0) { return { valid: false, reason: `Unauthorized scope: ${unauthorized.join(', ')}` }; } return { valid: true }; } private getCurrentWindowStart(): Date { const now = new Date(); const windowMs = this.RATE_LIMIT_WINDOW_MINUTES * 60 * 1000; return new Date(Math.floor(now.getTime() / windowMs) * windowMs); } private getActionsInWindow(actionType: string, windowStart: Date): number { const key = `${actionType}:${windowStart.toISOString()}`; return this.actionCounts.get(key) || 0; } private notifyAllChannels(message: string): void { // Send to PagerDuty, Slack, email, etc. console.log(`[BROADCAST] ${message}`); }} interface GuardrailResult { allowed: boolean; reason: string; code: string;}Automated DR must be tested as rigorously as manual DR—more rigorously, because it will execute without human review. Testing validates that automation does what it should, handles errors correctly, and that guardrails actually prevent harm.
Testing Levels:
Unit Testing: Each script and function is tested in isolation with mocked dependencies. Verify logic, error handling, and edge cases.
Integration Testing: Workflows are tested end-to-end against test infrastructure. Verify that steps connect correctly, outputs flow to inputs, and orchestration logic is correct.
Failure Injection Testing: Deliberately introduce failures at various points to verify that error handling, retries, and rollbacks work correctly.
Production Simulation: Run automation against production-like environment with production-scale data. Verify timing, resource consumption, and behavior under load.
| Test Type | Scope | Environment | Frequency | Focus |
|---|---|---|---|---|
| Unit Tests | Individual scripts | CI/CD pipeline | Every commit | Logic, edge cases, error handling |
| Integration Tests | Workflow sequences | Staging | Daily | Step connections, data flow, orchestration |
| Failure Injection | Complete workflows | Staging | Weekly | Error paths, rollback, guardrails |
| Staging DR Test | Full automation | Staging (prod-like) | Monthly | Timing, integration, scale |
| Production DR Test | Full automation | Production | Semi-annually | Real-world validation, true RTO |
Chaos Engineering for DR Automation:
Beyond testing happy paths, deliberately break things to validate resilience:
Don't just test that automation works—test that guardrails work. Deliberately trigger guardrail conditions and verify they halt automation as expected. A guardrail that fails to activate is worse than no guardrail, because you'll trust it until it's too late.
Module Complete: Disaster Recovery Mastery
You have now completed the comprehensive disaster recovery module. You understand:
These capabilities together form a mature disaster recovery program that enables your organization to survive and rapidly recover from catastrophic failures.
Congratulations! You have mastered disaster recovery—from strategic planning through automation. You now have the knowledge to design, implement, test, and continuously improve DR capabilities that protect your organization against the inevitable failures that all systems eventually face.