System Design (HLD)Disaster Recovery

Disaster Recovery: Building Resilient Systems That Survive Catastrophe

LevelAdvanced

Duration180 mins

TopicDisaster Recovery

3 / 5

DR Testing: Validating Recovery Capability

The Test That Matters Most

On February 28, 2017, a simple typo brought down most of the internet's infrastructure for four hours. An Amazon S3 engineer, while debugging a billing system issue, accidentally entered a command that removed a larger set of servers than intended. The cascading failure took down S3 in US-East-1, which in turn took down thousands of websites and services—from Slack and Trello to Coursera and the US Securities and Exchange Commission.

But here's the part that doesn't make headlines: The companies that recovered fastest weren't the ones with the most sophisticated DR infrastructure. They were the ones that had tested their recovery procedures recently enough that teams knew exactly what to do, scripts still worked, and runbooks were current.

Untested disaster recovery is not disaster recovery—it's disaster hope. The only way to know your DR actually works is to test it, regularly, rigorously, and under conditions that approximate real failure scenarios.

The uncomfortable truth: Studies consistently show that 60-70% of DR plans fail on first invocation. Not because the technology fails, but because the human processes, the scripts, the documentation haven't been validated against reality. Testing is what transforms a theoretical plan into a proven capability.

What You Will Master

By the end of this page, you will understand the full spectrum of DR testing approaches, from low-risk tabletop exercises to full production failovers. You'll learn how to design test plans that validate your RPO/RTO targets, how to execute tests safely, and how to transform test results into continuous improvement of your DR capabilities.

The DR Testing Spectrum: From Discussion to Disaster

DR testing exists on a spectrum of realism vs. risk. Lower-realism tests are safer but may miss issues that only emerge under real conditions. Higher-realism tests expose real problems but carry real risks. A mature DR program uses all levels of the spectrum:

Level 0: Plan Review A structured walkthrough of DR documentation to identify gaps, outdated procedures, or missing information. No systems are touched; this is purely a document review exercise.

Frequency: Quarterly
Participants: DR owners, documentation reviewers
What it validates: Documentation completeness and currency
What it misses: Whether documented procedures actually work

Level 1: Tabletop Exercise Team members verbally walk through disaster scenarios, discussing what they would do at each stage. Systems remain untouched, but the exercise exposes gaps in understanding, role confusion, and procedural ambiguity.

Frequency: Quarterly
Participants: Recovery team, stakeholders, sometimes executives
What it validates: Team understanding, role clarity, decision-making process
What it misses: Technical execution, timing, actual system behavior

Level 2: Component Testing Individual DR components are tested in isolation: backup restore, replication failover, network rerouting—each tested separately without full integration.

Frequency: Monthly
Participants: Technical teams by component
What it validates: Individual component functionality
What it misses: Integration issues, full recovery timing

Level 3: Integrated DR Test (Non-Production) Full recovery is executed against a realistic test environment. All components work together, but production is not affected.

Frequency: Quarterly to semi-annually
Participants: Full recovery team
What it validates: End-to-end recovery process, integration
What it misses: Production-specific issues, real load, real data

Level 4: Production Failover Test Production traffic is actually switched to the DR environment. This is real disaster recovery, just planned and controlled rather than crisis-driven.

Frequency: Annually at minimum
Participants: Full organization, customer communication
What it validates: Everything—the full reality of DR capability
What it misses: Nothing (if designed well)

DR Test Level Comparison
Test Level	Realism	Risk	Cost	Typical Duration	Frequency
Plan Review	Very Low	None	Low	2-4 hours	Quarterly
Tabletop Exercise	Low	None	Low-Medium	2-4 hours	Quarterly
Component Testing	Medium	Low	Medium	1-4 hours each	Monthly
Integrated (Non-Prod)	High	Medium	High	4-8 hours	Quarterly
Production Failover	Maximum	High	Very High	2-8 hours	Annually

The Testing Pyramid

Like the software testing pyramid, DR testing should have many low-level tests and fewer high-level tests. Run component tests frequently, integrated tests quarterly, and production failovers annually. Each level builds confidence for the next.

Designing Effective DR Tests

A DR test is only as valuable as its design. Poorly designed tests provide false confidence—they 'pass' but don't validate actual recovery capability. Here's how to design tests that matter:

Define Clear Objectives: Every test should have explicit objectives that map to your DR requirements:

Validate RTO for Tier 1 systems is ≤ 30 minutes
Confirm RPO for database X is ≤ 5 minutes
Verify team can execute failover without on-call principal engineer
Test recovery when primary DNS provider is unavailable

Select Realistic Scenarios: Choose scenarios that represent credible disaster types:

Complete region unavailability
Ransomware encryption of primary storage
Cascading failure from dependency outage
Data corruption requiring point-in-time recovery
Key personnel unavailable during incident

Include Injects and Curveballs: Real disasters don't follow scripts. Add unexpected complications:

Mid-recovery, announce a secondary failure
Restrict access to a key tool or system
Make a key team member 'unavailable'
Inject a decision point not covered in runbooks

dr-test-plan-template.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
// Comprehensive DR Test Plan Template
 
interface DRTestPlan {
  testId: string;
  testDate: Date;
  testType: 'plan_review' | 'tabletop' | 'component' | 'integrated' | 'production_failover';
  
  objectives: TestObjective[];
  scenario: DisasterScenario;
  scope: TestScope;
  participants: Participant[];
  timeline: TimelinePhase[];
  successCriteria: SuccessCriterion[];
  riskMitigation: RiskMitigation[];
  rollbackPlan: string;
  communicationPlan: CommunicationPlan;
}
 
interface TestObjective {
  id: string;
  description: string;
  targetMetric: string;
  targetValue: string;
  measurementMethod: string;
}
 
interface DisasterScenario {
  name: string;
  description: string;
  affectedSystems: string[];
  simulatedCause: string;
  expectedDuration: number;  // minutes
  injects: Inject[];  // Unexpected complications
}
 
interface Inject {
  timing: number;  // minutes into test
  description: string;
  expectedImpact: string;
  responseValidation: string;
}
 
interface SuccessCriterion {
  criterion: string;
  target: string;
  measurement: string;
  mustPass: boolean;  // Hard failure vs. learning opportunity
}
 
// Example Test Plan: Annual Production Failover
const annualProductionFailover: DRTestPlan = {
  testId: 'DR-2024-Q1-PROD',
  testDate: new Date('2024-03-15'),
  testType: 'production_failover',
  
  objectives: [
    {
      id: 'OBJ-1',
      description: 'Validate Tier 1 systems achieve RTO target',
      targetMetric: 'Time to production traffic on DR site',
      targetValue: '≤ 30 minutes',
      measurementMethod: 'Timestamp from disaster declaration to first successful production request'
    },
    {
      id: 'OBJ-2',
      description: 'Validate database RPO is within target',
      targetMetric: 'Data loss measured by transaction gap',
      targetValue: '≤ 5 minutes',
      measurementMethod: 'Compare last committed transaction on primary vs first available on DR'
    },
    {
      id: 'OBJ-3',
      description: 'Verify secondary team can execute without principal',
      targetMetric: 'Recovery completion without designated primary on-call',
      targetValue: 'Pass/Fail',
      measurementMethod: 'Exclude primary on-call from communication during test'
    },
    {
      id: 'OBJ-4',
      description: 'Validate external partner API connectivity from DR',
      targetMetric: 'All partner integrations functional',
      targetValue: '100% of critical integrations operational',
      measurementMethod: 'Synthetic transactions to each integration endpoint'
    }
  ],
  
  scenario: {
    name: 'Complete Primary Region Failure',
    description: 'Simulated catastrophic failure of US-East-1 region requiring full failover to US-West-2',
    affectedSystems: ['All Tier 1', 'All Tier 2', 'Selected Tier 3'],
    simulatedCause: 'Simulated regional AWS outage',
    expectedDuration: 120,  // 2-hour planned window
    injects: [
      {
        timing: 30,
        description: 'Primary DNS fails to update within expected time',
        expectedImpact: 'Test ability to use backup DNS propagation method',
        responseValidation: 'Team switches to direct IP / CDN failover'
      },
      {
        timing: 60,
        description: 'Customer service reports cannot access customer lookup tool',
        expectedImpact: 'Test prioritization during partial recovery',
        responseValidation: 'Team makes appropriate triage decision'
      },
      {
        timing: 90,
        description: 'Third-party payment processor reports our new IP not whitelisted',
        expectedImpact: 'Test external dependency recovery',
        responseValidation: 'Team executes payment processor failover procedure'
      }
    ]
  },
  
  scope: {
    systemsIncluded: [
      'Web application cluster',
      'Primary database',
      'Order processing service',
      'Payment gateway integration',
      'Customer authentication',
      'CDN configuration'
    ],
    systemsExcluded: [
      'Internal tools (will use primary or manual workaround)',
      'Analytics pipeline (acceptable delay)',
      'Development environments'
    ],
    dataScope: 'Full production data (replicated)',
    trafficScope: 'Gradual cutover: 5% → 25% → 100% over 30 minutes'
  },
  
  participants: [
    { role: 'Test Director', name: 'Sarah Chen', contact: '+1-555-0100' },
    { role: 'Technical Lead', name: 'Marcus Johnson', contact: '+1-555-0101' },
    { role: 'Database Recovery', name: 'Priya Patel', contact: '+1-555-0102' },
    { role: 'Network/DNS', name: 'James Wilson', contact: '+1-555-0103' },
    { role: 'Application Team', name: 'Elena Rodriguez', contact: '+1-555-0104' },
    { role: 'Observer/Timekeeper', name: 'David Kim', contact: '+1-555-0105' },
    { role: 'Executive Sponsor', name: 'VP Engineering', contact: '+1-555-0106' }
  ],
  
  timeline: [
    { phase: 'Pre-Test Verification', start: 0, duration: 30, activities: ['Verify DR site readiness', 'Confirm all participants', 'Final go/no-go'] },
    { phase: 'Disaster Declaration', start: 30, duration: 5, activities: ['Declare simulated disaster', 'Start official clock'] },
    { phase: 'Recovery Execution', start: 35, duration: 60, activities: ['Execute runbook procedures', 'Document timing and issues'] },
    { phase: 'Verification', start: 95, duration: 20, activities: ['Synthetic transactions', 'Integration checks', 'Performance validation'] },
    { phase: 'Traffic Cutover', start: 115, duration: 30, activities: ['Gradual traffic migration', 'Monitor error rates'] },
    { phase: 'Steady-State Validation', start: 145, duration: 30, activities: ['Full production on DR', 'Monitor for issues'] },
    { phase: 'Failback', start: 175, duration: 60, activities: ['Return to primary region', 'Verify primary recovery'] }
  ],
  
  successCriteria: [
    { criterion: 'RTO achieved', target: '≤ 30 minutes', measurement: 'Timestamp delta', mustPass: true },
    { criterion: 'RPO achieved', target: '≤ 5 minutes', measurement: 'Transaction gap analysis', mustPass: true },
    { criterion: 'Error rate acceptable', target: '< 0.1% after cutover', measurement: 'Error monitoring', mustPass: true },
    { criterion: 'Latency acceptable', target: '< 200ms p99 from DR', measurement: 'APM tools', mustPass: false },
    { criterion: 'All integrations functional', target: '100%', measurement: 'Integration health checks', mustPass: true },
    { criterion: 'Failback successful', target: 'Complete within 60 min', measurement: 'Timestamp', mustPass: true }
  ],
  
  riskMitigation: [
    { risk: 'Test causes production data loss', mitigation: 'Additional backup taken pre-test; read-only mode verification', likelihood: 'Low' },
    { risk: 'Cannot failback to primary', mitigation: 'DR site capable of running indefinitely; extended customer communication ready', likelihood: 'Medium' },
    { risk: 'Customer-visible errors during cutover', mitigation: 'Maintenance window communication; gradual traffic shift', likelihood: 'Medium' },
    { risk: 'Test reveals critical DR flaw', mitigation: 'Abort criteria defined; rapid failback procedure ready', likelihood: 'Medium' }
  ],
  
  rollbackPlan: 'If test reveals critical issues, immediately redirect traffic back to primary using pre-configured DNS failback. All team members authorized to call abort.',
  
  communicationPlan: {
    internal: 'Slack #dr-test-2024 for real-time updates; executive summary email at completion',
    external: '24-hour advance customer notification; status page update during test window',
    escalation: 'Any production-impacting issue escalates to VP Engineering immediately'
  }
};
 
interface Participant {
  role: string;
  name: string;
  contact: string;
}
 
interface TimelinePhase {
  phase: string;
  start: number;
  duration: number;
  activities: string[];
}
 
interface RiskMitigation {
  risk: string;
  mitigation: string;
  likelihood: string;
}
 
interface CommunicationPlan {
  internal: string;
  external: string;
  escalation: string;
}
 
interface TestScope {
  systemsIncluded: string[];
  systemsExcluded: string[];
  dataScope: string;
  trafficScope: string;
}

Component Testing Deep Dive

Component tests validate individual DR building blocks. They're lower risk, faster to execute, and can be run frequently. Here are the essential component tests for most DR architectures:

Backup Restore Testing: The most fundamental DR component. Validates that backups are:

Actually being created on schedule
Restorable (not corrupted or incomplete)
Meeting RPO requirements (backup age)
Meeting RTO requirements (restore time)

Best practice: Automate daily restore tests to an isolated environment. If a restore fails, alert immediately—don't wait for disaster to discover backup problems.

Replication Health Validation: For systems using replication for DR:

Measure and track replication lag continuously
Periodically force a lag check by writing synthetic transactions
Validate that promoted replicas handle write traffic correctly
Test automatic failover triggers if implemented

DNS/Traffic Failover Testing: Traffic routing is often the final step in DR. Test:

DNS record updates propagate within expected timeframes
Health checks correctly detect failure and trigger routing changes
CDN and edge caching doesn't serve stale routes
Client applications handle DNS changes gracefully

Essential Component Tests

•Database Failover — Promote standby, verify read/write capability, measure time and data gap
•Storage Recovery — Restore from snapshot/backup, verify data integrity, measure restore time
•Network Failover — Switch to backup network path, verify connectivity and latency
•Application Restart — Cold start from DR infrastructure, verify health checks pass
•Certificate/Secrets Access — Confirm DR systems can access required credentials from backup vault
•External Integration — Verify third-party APIs accept connections from DR IP ranges
•Monitoring/Alerting — Confirm DR systems are monitored and alerts route correctly

automated-dr-component-tests.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
// Automated DR Component Testing Framework
 
import { MetricsClient } from './metrics';
import { AlertingClient } from './alerting';
import { DatabaseClient } from './database';
import { StorageClient } from './storage';
 
interface ComponentTestResult {
  testName: string;
  component: string;
  passed: boolean;
  duration: number;  // seconds
  details: string;
  metricsMet: {
    metric: string;
    target: string;
    actual: string;
    passed: boolean;
  }[];
  timestamp: Date;
}
 
class DRComponentTester {
  private metricsClient: MetricsClient;
  private alertingClient: AlertingClient;
  
  constructor() {
    this.metricsClient = new MetricsClient();
    this.alertingClient = new AlertingClient();
  }
  
  /**
   * Database Backup Restore Test
   * Validates backup exists and can be restored within RTO
   */
  async testDatabaseBackupRestore(
    config: {
      backupIdentifier: string;
      targetInstance: string;
      maxRestoreTimeMinutes: number;
      validationQueries: string[];
    }
  ): Promise<ComponentTestResult> {
    const startTime = Date.now();
    const db = new DatabaseClient();
    
    try {
      // Step 1: Verify backup exists and is recent
      const backup = await db.getBackupMetadata(config.backupIdentifier);
      const backupAgeMinutes = (Date.now() - backup.createdAt.getTime()) / 60000;
      
      // Step 2: Initiate restore to test instance
      console.log(`Initiating restore from ${config.backupIdentifier}`);
      const restoreJob = await db.restoreBackup({
        backupId: config.backupIdentifier,
        targetInstance: config.targetInstance,
        deleteExisting: true
      });
      
      // Step 3: Wait for restore completion
      await db.waitForRestoreComplete(restoreJob.id, config.maxRestoreTimeMinutes * 60 * 1000);
      const restoreDuration = (Date.now() - startTime) / 1000;
      
      // Step 4: Validate restored data
      const validationResults = await Promise.all(
        config.validationQueries.map(async (query) => {
          try {
            await db.executeQuery(config.targetInstance, query);
            return { query, passed: true };
          } catch (error) {
            return { query, passed: false, error: String(error) };
          }
        })
      );
      
      const allValidationsPassed = validationResults.every(r => r.passed);
      const restoreTimeTarget = config.maxRestoreTimeMinutes * 60;
      
      // Step 5: Clean up test instance
      await db.deleteInstance(config.targetInstance);
      
      return {
        testName: 'Database Backup Restore',
        component: 'Primary Database',
        passed: allValidationsPassed && restoreDuration <= restoreTimeTarget,
        duration: restoreDuration,
        details: allValidationsPassed 
          ? `Restore completed in ${restoreDuration.toFixed(0)}s`
          : `Validation failures: ${validationResults.filter(r => !r.passed).length}`,
        metricsMet: [
          {
            metric: 'Restore Time',
            target: `≤ ${config.maxRestoreTimeMinutes} minutes`,
            actual: `${(restoreDuration / 60).toFixed(1)} minutes`,
            passed: restoreDuration <= restoreTimeTarget
          },
          {
            metric: 'Backup Age',
            target: 'Within RPO',
            actual: `${backupAgeMinutes.toFixed(0)} minutes`,
            passed: backupAgeMinutes <= 60  // Assumes 1-hour RPO
          },
          {
            metric: 'Data Validation',
            target: 'All queries pass',
            actual: `${validationResults.filter(r => r.passed).length}/${validationResults.length} passed`,
            passed: allValidationsPassed
          }
        ],
        timestamp: new Date()
      };
    } catch (error) {
      return {
        testName: 'Database Backup Restore',
        component: 'Primary Database',
        passed: false,
        duration: (Date.now() - startTime) / 1000,
        details: `Test failed: ${error}`,
        metricsMet: [],
        timestamp: new Date()
      };
    }
  }
  
  /**
   * Replication Lag Test
   * Writes synthetic transaction and measures time to appear on replica
   */
  async testReplicationLag(
    config: {
      primaryConnection: string;
      replicaConnection: string;
      maxLagSeconds: number;
      testTable: string;
    }
  ): Promise<ComponentTestResult> {
    const db = new DatabaseClient();
    const testId = `dr_test_${Date.now()}`;
    const startTime = Date.now();
    
    try {
      // Write synthetic record to primary
      const writeTime = Date.now();
      await db.executeQuery(config.primaryConnection, 
        `INSERT INTO ${config.testTable} (id, created_at) VALUES ('${testId}', NOW())`
      );
      
      // Poll replica until record appears or timeout
      let replicaReadTime: number | null = null;
      const maxWaitMs = config.maxLagSeconds * 1000 * 2;  // 2x target for timeout
      
      while (Date.now() - writeTime < maxWaitMs) {
        const result = await db.executeQuery(config.replicaConnection,
          `SELECT id FROM ${config.testTable} WHERE id = '${testId}'`
        );
        if (result.rows.length > 0) {
          replicaReadTime = Date.now();
          break;
        }
        await this.sleep(100);  // Poll every 100ms
      }
      
      // Clean up test record
      await db.executeQuery(config.primaryConnection,
        `DELETE FROM ${config.testTable} WHERE id = '${testId}'`
      );
      
      if (!replicaReadTime) {
        return {
          testName: 'Replication Lag Test',
          component: 'Database Replication',
          passed: false,
          duration: (Date.now() - startTime) / 1000,
          details: `Record did not appear on replica within ${config.maxLagSeconds * 2}s`,
          metricsMet: [{
            metric: 'Replication Lag',
            target: `≤ ${config.maxLagSeconds}s`,
            actual: 'Timeout',
            passed: false
          }],
          timestamp: new Date()
        };
      }
      
      const measuredLagMs = replicaReadTime - writeTime;
      const lagSeconds = measuredLagMs / 1000;
      
      return {
        testName: 'Replication Lag Test',
        component: 'Database Replication',
        passed: lagSeconds <= config.maxLagSeconds,
        duration: (Date.now() - startTime) / 1000,
        details: `Measured replication lag: ${lagSeconds.toFixed(2)}s`,
        metricsMet: [{
          metric: 'Replication Lag',
          target: `≤ ${config.maxLagSeconds}s`,
          actual: `${lagSeconds.toFixed(2)}s`,
          passed: lagSeconds <= config.maxLagSeconds
        }],
        timestamp: new Date()
      };
    } catch (error) {
      return {
        testName: 'Replication Lag Test',
        component: 'Database Replication',
        passed: false,
        duration: (Date.now() - startTime) / 1000,
        details: `Test error: ${error}`,
        metricsMet: [],
        timestamp: new Date()
      };
    }
  }
  
  /**
   * Run all scheduled component tests and report results
   */
  async runDailyComponentTests(): Promise<ComponentTestResult[]> {
    const results: ComponentTestResult[] = [];
    
    // Database backup restore (runs on isolated test instance)
    results.push(await this.testDatabaseBackupRestore({
      backupIdentifier: 'latest-automated',
      targetInstance: 'dr-test-restore-target',
      maxRestoreTimeMinutes: 30,
      validationQueries: [
        'SELECT COUNT(*) FROM users',
        'SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL \'24 hours\'',
        'SELECT 1 FROM critical_config WHERE key = \'version\''
      ]
    }));
    
    // Replication lag (continuous check)
    results.push(await this.testReplicationLag({
      primaryConnection: 'postgresql://primary:5432/app',
      replicaConnection: 'postgresql://dr-replica:5432/app',
      maxLagSeconds: 30,
      testTable: 'dr_replication_test'
    }));
    
    // Report results
    const failedTests = results.filter(r => !r.passed);
    if (failedTests.length > 0) {
      await this.alertingClient.sendAlert({
        severity: 'warning',
        title: `DR Component Test Failures: ${failedTests.length}`,
        details: failedTests.map(t => `${t.testName}: ${t.details}`).join('
'),
        action: 'Review DR component test results and remediate failures'
      });
    }
    
    // Store results for trending
    await this.metricsClient.recordDRTestResults(results);
    
    return results;
  }
  
  private sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

Tabletop Exercises: Testing Without Risk

Tabletop exercises are discussion-based sessions where teams walk through disaster scenarios verbally. They're low-cost, zero-risk, and highly effective at exposing process gaps, role confusion, and decision-making weaknesses.

Conducting an Effective Tabletop:

1. Define a Realistic Scenario: Start with a specific, plausible disaster scenario. Vague scenarios produce vague discussions. Good examples:

'At 2 AM on Black Friday, our primary database cluster becomes unresponsive. The on-call engineer is traveling and has limited connectivity.'
'A supply chain breach has compromised our CI/CD pipeline. We detect malicious code in our latest production deployment.'
'A hurricane warning has been issued for the region housing our primary data center. We have 48 hours before potential impact.'

2. Assign a Facilitator: The facilitator guides discussion, introduces complications (injects), and ensures all participants engage. They should NOT be part of the response team—their job is to drive the exercise, not solve the problem.

3. Introduce Progressive Information: Real disasters don't reveal all information at once. Start with limited information and add details as the exercise progresses:

T+0: 'Monitoring shows database cluster unhealthy'
T+10: 'We've confirmed primary is down, replica status unknown'
T+20: 'Replica is also unreachable; we need to consider backup restore'
T+30: 'Latest backup is 4 hours old—more recent backup appears corrupted'

4. Document Decisions and Gaps: Capture every decision, every question that couldn't be answered, every point of confusion. These are the valuable outputs of the tabletop.

Tabletop Discussion Prompts

•Who gets notified first? How? What if that channel is unavailable?
•Who has authority to declare a disaster? What if they're unreachable?
•Where is the runbook? Can you access it without the primary systems?
•What's the first technical action? Who performs it?
•How do we communicate with customers? Who drafts the message? Who sends it?
•If this action fails, what's the fallback?
•How do we know recovery is complete? What are the verification steps?
•If we need to escalate to a vendor, do we have 24/7 support contacts?

Include Management

Tabletops are one of the few DR activities where including executives provides high value. They expose leadership to recovery complexity, build confidence in the team, and often surface organizational blockers (budget, policy, authority) that technical staff cannot resolve alone.

Common Tabletop Findings:

Tabletops consistently reveal the same categories of gaps:

Contact Information Gaps: Phone numbers outdated, no backup contacts, critical vendors not listed
Authority Ambiguity: Unclear who can declare disaster, authorize spending, or communicate externally
Tool Access: Runbooks stored in systems that would be unavailable in scenarios discussed
Cross-Team Dependencies: Actions requiring coordination between teams with no established protocol
Timing Assumptions: Expected durations based on optimistic scenarios, not tested reality
External Dependencies: Third-party integrations with unclear DR status or failover procedures

Production Failover Testing

Production failover tests are the ultimate validation of DR capability. They're also the most complex and highest-risk form of DR testing. Here's how to execute them safely:

Pre-Test Preparation (Weeks Before):

Schedule during low-traffic window (consider time zones for global users)
Notify customers of planned maintenance window
Verify DR site readiness (capacity, replication current, configurations synced)
Conduct dry-run in non-production to validate runbooks
Brief all participants on roles, communication channels, abort criteria
Prepare customer communication templates for all scenarios
Verify rollback procedures are current and tested
Take additional backups immediately before test

Test Execution:

Go/No-Go Decision: Final check that all prerequisites are met
Disaster Declaration: Start the clock, initiate formal DR procedures
Recovery Execution: Follow runbook procedures, document any deviations
Verification: Run synthetic transactions, health checks, integration tests
Traffic Cutover: Gradually shift traffic to DR site (5% → 25% → 100%)
Steady State: Run on DR infrastructure for designated period
Failback: Return traffic to primary, verify primary recovery
Test Completion: Formal declaration of test end

Abort Criteria Must Be Defined

Before any production failover test, define explicit abort criteria. Examples: Error rate exceeds 1% for 5 minutes, recovery actions take 50% longer than expected, critical integration fails verification. Anyone on the team should be empowered to call abort if criteria are met.

Production Failover Test Checklist
Phase	Action	Responsible	Duration	Abort Trigger
Pre-Test	Verify DR site replication current	DBA	T-24h	Lag > RPO target
Pre-Test	Customer notification sent	Comms	T-24h
Pre-Test	Fresh backup taken	DBA	T-2h	Backup fails
Pre-Test	Team briefing complete	Test Lead	T-1h
Go/No-Go	Final readiness check	Test Lead	T-0	Any critical blocker
Execution	Database failover	DBA	Target: 5min	Failover fails
Execution	App tier startup	App Team	Target: 10min	Health checks fail 3x
Execution	Integration verification	App Team	Target: 10min	Critical integration fails
Cutover	5% traffic shift	Infra	5min observe	Error rate > 1%
Cutover	25% traffic shift	Infra	10min observe	Error rate > 0.5%
Cutover	100% traffic shift	Infra	30min observe	Error rate > 0.1%
Failback	Return to primary	Infra	Target: 30min	Primary unhealthy
Completion	Site restored, test ended	Test Lead

Post-Test Analysis:

The value of a production failover test extends far beyond the test itself. Rigorous post-test analysis transforms observations into improvements:

Timing Analysis: Compare actual times to targets. Where were bottlenecks?
Runbook Accuracy: Where did procedures need deviation? Update runbooks.
Tooling Gaps: What manual steps could be automated?
Communication Effectiveness: Did the right information reach the right people?
Customer Impact: Any customer-reported issues? Response time?
Recommendations: Concrete action items with owners and deadlines

Test Frequency and Coverage Strategy

How often should you test? The answer balances risk, cost, and operational burden:

Testing Cadence Guidelines:

Recommended DR Test Frequencies by Type
Test Type	Minimum Frequency	Ideal Frequency	Trigger for Extra Tests
Backup Verification	Weekly	Daily (automated)	After any backup process change
Replication Lag Monitoring	Continuous	Continuous	Alerts on threshold approach
Component Tests	Monthly	Weekly (automated)	After infrastructure changes
Tabletop Exercises	Quarterly	Monthly	After major incidents, new scenarios
Integrated Non-Prod Test	Quarterly	Monthly	After architecture changes
Production Failover	Annually	Semi-annually	After any major DR investment

Coverage Strategy:

Not all systems need the same test intensity. Match testing investment to system criticality:

Tier 1 (Mission-Critical): Full production failover annually, integrated tests quarterly, component tests monthly, continuous replication monitoring.

Tier 2 (Business-Critical): Integrated tests semi-annually, component tests monthly, backup verification weekly.

Tier 3 (Operational): Component tests quarterly, backup verification monthly.

Tier 4 (Administrative): Annual backup verification, IaC rebuild test annually.

Rotation Strategy: With many systems, testing everything at once is impractical. Implement a rotation:

Each quarter, select different systems for integrated testing
Over a year, ensure all Tier 1 and Tier 2 systems are tested
Document last-tested dates on a DR dashboard

Testing After Changes

Beyond scheduled testing, any significant change should trigger DR validation: New database cluster? Test failover. Updated runbook? Validate with tabletop. New region added? Full production failover test. Change-triggered testing catches drift between documentation and reality.

Learning From Test Failures

A DR test that exposes problems is a successful test. The worst outcome isn't a test failure—it's a test that passes despite hidden weaknesses, only to fail during actual disaster.

Treating Test Failures as Opportunities:

Every test failure should generate:

Immediate remediation: Fix the specific issue discovered
Root cause analysis: Why did this issue exist? Why wasn't it caught earlier?
Systemic improvements: What process changes prevent similar issues?
Validation testing: Retest to confirm the fix works

Healthy Failure Response

•Failures are celebrated as learning
•Root cause analysis performed
•Systemic fixes implemented
•Retest validates the fix
•Similar issues proactively sought
•Test coverage expanded

Unhealthy Failure Response

•Failures hidden or minimized
•Quick fix without analysis
•Blame assigned
•No retest performed
•Testing reduced to 'avoid failures'
•False confidence reported upward

Common DR Test Failure Patterns:

Understanding common failure patterns helps you anticipate and prevent them:

Credential/Access Failures: DR systems can't access secrets, APIs, or databases because credentials expired, weren't replicated, or aren't authorized from DR IP ranges.

Configuration Drift: DR environment configuration has drifted from production. Database connection strings point to wrong hosts, feature flags don't match, SSL certs are expired.

Capacity Shortfall: DR environment was provisioned for old production scale. Current load exceeds DR capacity.

Procedure Rot: Runbook procedures reference obsolete tools, old URLs, changed role names, or steps that no longer apply.

Network/DNS Issues: Firewall rules don't allow required traffic, DNS changes take longer than expected, health checks misconfigured.

Data Inconsistency: Recovery produces inconsistent state—transactions missing, foreign key violations, corrupted records.

Human Factors: Key personnel unavailable, confusion about roles, communication breakdowns, decision paralysis.

Summary: DR Testing Excellence

Key Takeaways

•Untested DR is not DR — Plans only become capabilities through testing. Until tested, they're hopes.
•Use the full spectrum — Combine tabletops, component tests, integrated tests, and production failovers. Each catches different issues.
•Automate component tests — Daily backup restores and replication lag checks catch problems before they matter.
•Tabletops are high value, low risk — They expose process gaps, role confusion, and decision-making weaknesses with zero production risk.
•Production failovers are the ultimate validation — At least annually, prove you can actually serve production from DR infrastructure.
•Define clear abort criteria — Know when to stop a test before it causes real damage.
•Celebrate test failures — They're learning opportunities. The alternative is discovering problems during real disasters.

What's Next:

Testing validates capability, but capability depends on executable procedures. The next page covers Runbook Development—how to create documentation that enables successful recovery even when written by people no longer available during the disaster.

Page Complete

You now understand the full spectrum of DR testing, from low-risk tabletops to production failovers. You can design effective tests, execute them safely, and transform findings into continuous improvement. Next, we'll explore runbook development for executable disaster recovery procedures.

3 / 5

Loading learning content...

System Design (HLD)Disaster Recovery

Disaster Recovery: Building Resilient Systems That Survive Catastrophe

LevelAdvanced

Duration180 mins

TopicDisaster Recovery

3 / 5

DR Testing: Validating Recovery Capability

The Test That Matters Most

What You Will Master

The DR Testing Spectrum: From Discussion to Disaster

Level 0: Plan Review A structured walkthrough of DR documentation to identify gaps, outdated procedures, or missing information. No systems are touched; this is purely a document review exercise.

Frequency: Quarterly
Participants: DR owners, documentation reviewers
What it validates: Documentation completeness and currency
What it misses: Whether documented procedures actually work

Frequency: Quarterly
Participants: Recovery team, stakeholders, sometimes executives
What it validates: Team understanding, role clarity, decision-making process
What it misses: Technical execution, timing, actual system behavior

Level 2: Component Testing Individual DR components are tested in isolation: backup restore, replication failover, network rerouting—each tested separately without full integration.

Frequency: Monthly
Participants: Technical teams by component
What it validates: Individual component functionality
What it misses: Integration issues, full recovery timing

Level 3: Integrated DR Test (Non-Production) Full recovery is executed against a realistic test environment. All components work together, but production is not affected.

Frequency: Quarterly to semi-annually
Participants: Full recovery team
What it validates: End-to-end recovery process, integration
What it misses: Production-specific issues, real load, real data

Level 4: Production Failover Test Production traffic is actually switched to the DR environment. This is real disaster recovery, just planned and controlled rather than crisis-driven.

Frequency: Annually at minimum
Participants: Full organization, customer communication
What it validates: Everything—the full reality of DR capability
What it misses: Nothing (if designed well)

DR Test Level Comparison
Test Level	Realism	Risk	Cost	Typical Duration	Frequency
Plan Review	Very Low	None	Low	2-4 hours	Quarterly
Tabletop Exercise	Low	None	Low-Medium	2-4 hours	Quarterly
Component Testing	Medium	Low	Medium	1-4 hours each	Monthly
Integrated (Non-Prod)	High	Medium	High	4-8 hours	Quarterly
Production Failover	Maximum	High	Very High	2-8 hours	Annually

The Testing Pyramid

Designing Effective DR Tests

A DR test is only as valuable as its design. Poorly designed tests provide false confidence—they 'pass' but don't validate actual recovery capability. Here's how to design tests that matter:

Define Clear Objectives: Every test should have explicit objectives that map to your DR requirements:

Validate RTO for Tier 1 systems is ≤ 30 minutes
Confirm RPO for database X is ≤ 5 minutes
Verify team can execute failover without on-call principal engineer
Test recovery when primary DNS provider is unavailable

Select Realistic Scenarios: Choose scenarios that represent credible disaster types:

Complete region unavailability
Ransomware encryption of primary storage
Cascading failure from dependency outage
Data corruption requiring point-in-time recovery
Key personnel unavailable during incident

Include Injects and Curveballs: Real disasters don't follow scripts. Add unexpected complications:

Mid-recovery, announce a secondary failure
Restrict access to a key tool or system
Make a key team member 'unavailable'
Inject a decision point not covered in runbooks

dr-test-plan-template.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
// Comprehensive DR Test Plan Template
 
interface DRTestPlan {
  testId: string;
  testDate: Date;
  testType: 'plan_review' | 'tabletop' | 'component' | 'integrated' | 'production_failover';
  
  objectives: TestObjective[];
  scenario: DisasterScenario;
  scope: TestScope;
  participants: Participant[];
  timeline: TimelinePhase[];
  successCriteria: SuccessCriterion[];
  riskMitigation: RiskMitigation[];
  rollbackPlan: string;
  communicationPlan: CommunicationPlan;
}
 
interface TestObjective {
  id: string;
  description: string;
  targetMetric: string;
  targetValue: string;
  measurementMethod: string;
}
 
interface DisasterScenario {
  name: string;
  description: string;
  affectedSystems: string[];
  simulatedCause: string;
  expectedDuration: number;  // minutes
  injects: Inject[];  // Unexpected complications
}
 
interface Inject {
  timing: number;  // minutes into test
  description: string;
  expectedImpact: string;
  responseValidation: string;
}
 
interface SuccessCriterion {
  criterion: string;
  target: string;
  measurement: string;
  mustPass: boolean;  // Hard failure vs. learning opportunity
}
 
// Example Test Plan: Annual Production Failover
const annualProductionFailover: DRTestPlan = {
  testId: 'DR-2024-Q1-PROD',
  testDate: new Date('2024-03-15'),
  testType: 'production_failover',
  
  objectives: [
    {
      id: 'OBJ-1',
      description: 'Validate Tier 1 systems achieve RTO target',
      targetMetric: 'Time to production traffic on DR site',
      targetValue: '≤ 30 minutes',
      measurementMethod: 'Timestamp from disaster declaration to first successful production request'
    },
    {
      id: 'OBJ-2',
      description: 'Validate database RPO is within target',
      targetMetric: 'Data loss measured by transaction gap',
      targetValue: '≤ 5 minutes',
      measurementMethod: 'Compare last committed transaction on primary vs first available on DR'
    },
    {
      id: 'OBJ-3',
      description: 'Verify secondary team can execute without principal',
      targetMetric: 'Recovery completion without designated primary on-call',
      targetValue: 'Pass/Fail',
      measurementMethod: 'Exclude primary on-call from communication during test'
    },
    {
      id: 'OBJ-4',
      description: 'Validate external partner API connectivity from DR',
      targetMetric: 'All partner integrations functional',
      targetValue: '100% of critical integrations operational',
      measurementMethod: 'Synthetic transactions to each integration endpoint'
    }
  ],
  
  scenario: {
    name: 'Complete Primary Region Failure',
    description: 'Simulated catastrophic failure of US-East-1 region requiring full failover to US-West-2',
    affectedSystems: ['All Tier 1', 'All Tier 2', 'Selected Tier 3'],
    simulatedCause: 'Simulated regional AWS outage',
    expectedDuration: 120,  // 2-hour planned window
    injects: [
      {
        timing: 30,
        description: 'Primary DNS fails to update within expected time',
        expectedImpact: 'Test ability to use backup DNS propagation method',
        responseValidation: 'Team switches to direct IP / CDN failover'
      },
      {
        timing: 60,
        description: 'Customer service reports cannot access customer lookup tool',
        expectedImpact: 'Test prioritization during partial recovery',
        responseValidation: 'Team makes appropriate triage decision'
      },
      {
        timing: 90,
        description: 'Third-party payment processor reports our new IP not whitelisted',
        expectedImpact: 'Test external dependency recovery',
        responseValidation: 'Team executes payment processor failover procedure'
      }
    ]
  },
  
  scope: {
    systemsIncluded: [
      'Web application cluster',
      'Primary database',
      'Order processing service',
      'Payment gateway integration',
      'Customer authentication',
      'CDN configuration'
    ],
    systemsExcluded: [
      'Internal tools (will use primary or manual workaround)',
      'Analytics pipeline (acceptable delay)',
      'Development environments'
    ],
    dataScope: 'Full production data (replicated)',
    trafficScope: 'Gradual cutover: 5% → 25% → 100% over 30 minutes'
  },
  
  participants: [
    { role: 'Test Director', name: 'Sarah Chen', contact: '+1-555-0100' },
    { role: 'Technical Lead', name: 'Marcus Johnson', contact: '+1-555-0101' },
    { role: 'Database Recovery', name: 'Priya Patel', contact: '+1-555-0102' },
    { role: 'Network/DNS', name: 'James Wilson', contact: '+1-555-0103' },
    { role: 'Application Team', name: 'Elena Rodriguez', contact: '+1-555-0104' },
    { role: 'Observer/Timekeeper', name: 'David Kim', contact: '+1-555-0105' },
    { role: 'Executive Sponsor', name: 'VP Engineering', contact: '+1-555-0106' }
  ],
  
  timeline: [
    { phase: 'Pre-Test Verification', start: 0, duration: 30, activities: ['Verify DR site readiness', 'Confirm all participants', 'Final go/no-go'] },
    { phase: 'Disaster Declaration', start: 30, duration: 5, activities: ['Declare simulated disaster', 'Start official clock'] },
    { phase: 'Recovery Execution', start: 35, duration: 60, activities: ['Execute runbook procedures', 'Document timing and issues'] },
    { phase: 'Verification', start: 95, duration: 20, activities: ['Synthetic transactions', 'Integration checks', 'Performance validation'] },
    { phase: 'Traffic Cutover', start: 115, duration: 30, activities: ['Gradual traffic migration', 'Monitor error rates'] },
    { phase: 'Steady-State Validation', start: 145, duration: 30, activities: ['Full production on DR', 'Monitor for issues'] },
    { phase: 'Failback', start: 175, duration: 60, activities: ['Return to primary region', 'Verify primary recovery'] }
  ],
  
  successCriteria: [
    { criterion: 'RTO achieved', target: '≤ 30 minutes', measurement: 'Timestamp delta', mustPass: true },
    { criterion: 'RPO achieved', target: '≤ 5 minutes', measurement: 'Transaction gap analysis', mustPass: true },
    { criterion: 'Error rate acceptable', target: '< 0.1% after cutover', measurement: 'Error monitoring', mustPass: true },
    { criterion: 'Latency acceptable', target: '< 200ms p99 from DR', measurement: 'APM tools', mustPass: false },
    { criterion: 'All integrations functional', target: '100%', measurement: 'Integration health checks', mustPass: true },
    { criterion: 'Failback successful', target: 'Complete within 60 min', measurement: 'Timestamp', mustPass: true }
  ],
  
  riskMitigation: [
    { risk: 'Test causes production data loss', mitigation: 'Additional backup taken pre-test; read-only mode verification', likelihood: 'Low' },
    { risk: 'Cannot failback to primary', mitigation: 'DR site capable of running indefinitely; extended customer communication ready', likelihood: 'Medium' },
    { risk: 'Customer-visible errors during cutover', mitigation: 'Maintenance window communication; gradual traffic shift', likelihood: 'Medium' },
    { risk: 'Test reveals critical DR flaw', mitigation: 'Abort criteria defined; rapid failback procedure ready', likelihood: 'Medium' }
  ],
  
  rollbackPlan: 'If test reveals critical issues, immediately redirect traffic back to primary using pre-configured DNS failback. All team members authorized to call abort.',
  
  communicationPlan: {
    internal: 'Slack #dr-test-2024 for real-time updates; executive summary email at completion',
    external: '24-hour advance customer notification; status page update during test window',
    escalation: 'Any production-impacting issue escalates to VP Engineering immediately'
  }
};
 
interface Participant {
  role: string;
  name: string;
  contact: string;
}
 
interface TimelinePhase {
  phase: string;
  start: number;
  duration: number;
  activities: string[];
}
 
interface RiskMitigation {
  risk: string;
  mitigation: string;
  likelihood: string;
}
 
interface CommunicationPlan {
  internal: string;
  external: string;
  escalation: string;
}
 
interface TestScope {
  systemsIncluded: string[];
  systemsExcluded: string[];
  dataScope: string;
  trafficScope: string;
}

Component Testing Deep Dive

Component tests validate individual DR building blocks. They're lower risk, faster to execute, and can be run frequently. Here are the essential component tests for most DR architectures:

Backup Restore Testing: The most fundamental DR component. Validates that backups are:

Actually being created on schedule
Restorable (not corrupted or incomplete)
Meeting RPO requirements (backup age)
Meeting RTO requirements (restore time)

Best practice: Automate daily restore tests to an isolated environment. If a restore fails, alert immediately—don't wait for disaster to discover backup problems.

Replication Health Validation: For systems using replication for DR:

Measure and track replication lag continuously
Periodically force a lag check by writing synthetic transactions
Validate that promoted replicas handle write traffic correctly
Test automatic failover triggers if implemented

DNS/Traffic Failover Testing: Traffic routing is often the final step in DR. Test:

DNS record updates propagate within expected timeframes
Health checks correctly detect failure and trigger routing changes
CDN and edge caching doesn't serve stale routes
Client applications handle DNS changes gracefully

Essential Component Tests

•Database Failover — Promote standby, verify read/write capability, measure time and data gap
•Storage Recovery — Restore from snapshot/backup, verify data integrity, measure restore time
•Network Failover — Switch to backup network path, verify connectivity and latency
•Application Restart — Cold start from DR infrastructure, verify health checks pass
•Certificate/Secrets Access — Confirm DR systems can access required credentials from backup vault
•External Integration — Verify third-party APIs accept connections from DR IP ranges
•Monitoring/Alerting — Confirm DR systems are monitored and alerts route correctly

automated-dr-component-tests.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
// Automated DR Component Testing Framework
 
import { MetricsClient } from './metrics';
import { AlertingClient } from './alerting';
import { DatabaseClient } from './database';
import { StorageClient } from './storage';
 
interface ComponentTestResult {
  testName: string;
  component: string;
  passed: boolean;
  duration: number;  // seconds
  details: string;
  metricsMet: {
    metric: string;
    target: string;
    actual: string;
    passed: boolean;
  }[];
  timestamp: Date;
}
 
class DRComponentTester {
  private metricsClient: MetricsClient;
  private alertingClient: AlertingClient;
  
  constructor() {
    this.metricsClient = new MetricsClient();
    this.alertingClient = new AlertingClient();
  }
  
  /**
   * Database Backup Restore Test
   * Validates backup exists and can be restored within RTO
   */
  async testDatabaseBackupRestore(
    config: {
      backupIdentifier: string;
      targetInstance: string;
      maxRestoreTimeMinutes: number;
      validationQueries: string[];
    }
  ): Promise<ComponentTestResult> {
    const startTime = Date.now();
    const db = new DatabaseClient();
    
    try {
      // Step 1: Verify backup exists and is recent
      const backup = await db.getBackupMetadata(config.backupIdentifier);
      const backupAgeMinutes = (Date.now() - backup.createdAt.getTime()) / 60000;
      
      // Step 2: Initiate restore to test instance
      console.log(`Initiating restore from ${config.backupIdentifier}`);
      const restoreJob = await db.restoreBackup({
        backupId: config.backupIdentifier,
        targetInstance: config.targetInstance,
        deleteExisting: true
      });
      
      // Step 3: Wait for restore completion
      await db.waitForRestoreComplete(restoreJob.id, config.maxRestoreTimeMinutes * 60 * 1000);
      const restoreDuration = (Date.now() - startTime) / 1000;
      
      // Step 4: Validate restored data
      const validationResults = await Promise.all(
        config.validationQueries.map(async (query) => {
          try {
            await db.executeQuery(config.targetInstance, query);
            return { query, passed: true };
          } catch (error) {
            return { query, passed: false, error: String(error) };
          }
        })
      );
      
      const allValidationsPassed = validationResults.every(r => r.passed);
      const restoreTimeTarget = config.maxRestoreTimeMinutes * 60;
      
      // Step 5: Clean up test instance
      await db.deleteInstance(config.targetInstance);
      
      return {
        testName: 'Database Backup Restore',
        component: 'Primary Database',
        passed: allValidationsPassed && restoreDuration <= restoreTimeTarget,
        duration: restoreDuration,
        details: allValidationsPassed 
          ? `Restore completed in ${restoreDuration.toFixed(0)}s`
          : `Validation failures: ${validationResults.filter(r => !r.passed).length}`,
        metricsMet: [
          {
            metric: 'Restore Time',
            target: `≤ ${config.maxRestoreTimeMinutes} minutes`,
            actual: `${(restoreDuration / 60).toFixed(1)} minutes`,
            passed: restoreDuration <= restoreTimeTarget
          },
          {
            metric: 'Backup Age',
            target: 'Within RPO',
            actual: `${backupAgeMinutes.toFixed(0)} minutes`,
            passed: backupAgeMinutes <= 60  // Assumes 1-hour RPO
          },
          {
            metric: 'Data Validation',
            target: 'All queries pass',
            actual: `${validationResults.filter(r => r.passed).length}/${validationResults.length} passed`,
            passed: allValidationsPassed
          }
        ],
        timestamp: new Date()
      };
    } catch (error) {
      return {
        testName: 'Database Backup Restore',
        component: 'Primary Database',
        passed: false,
        duration: (Date.now() - startTime) / 1000,
        details: `Test failed: ${error}`,
        metricsMet: [],
        timestamp: new Date()
      };
    }
  }
  
  /**
   * Replication Lag Test
   * Writes synthetic transaction and measures time to appear on replica
   */
  async testReplicationLag(
    config: {
      primaryConnection: string;
      replicaConnection: string;
      maxLagSeconds: number;
      testTable: string;
    }
  ): Promise<ComponentTestResult> {
    const db = new DatabaseClient();
    const testId = `dr_test_${Date.now()}`;
    const startTime = Date.now();
    
    try {
      // Write synthetic record to primary
      const writeTime = Date.now();
      await db.executeQuery(config.primaryConnection, 
        `INSERT INTO ${config.testTable} (id, created_at) VALUES ('${testId}', NOW())`
      );
      
      // Poll replica until record appears or timeout
      let replicaReadTime: number | null = null;
      const maxWaitMs = config.maxLagSeconds * 1000 * 2;  // 2x target for timeout
      
      while (Date.now() - writeTime < maxWaitMs) {
        const result = await db.executeQuery(config.replicaConnection,
          `SELECT id FROM ${config.testTable} WHERE id = '${testId}'`
        );
        if (result.rows.length > 0) {
          replicaReadTime = Date.now();
          break;
        }
        await this.sleep(100);  // Poll every 100ms
      }
      
      // Clean up test record
      await db.executeQuery(config.primaryConnection,
        `DELETE FROM ${config.testTable} WHERE id = '${testId}'`
      );
      
      if (!replicaReadTime) {
        return {
          testName: 'Replication Lag Test',
          component: 'Database Replication',
          passed: false,
          duration: (Date.now() - startTime) / 1000,
          details: `Record did not appear on replica within ${config.maxLagSeconds * 2}s`,
          metricsMet: [{
            metric: 'Replication Lag',
            target: `≤ ${config.maxLagSeconds}s`,
            actual: 'Timeout',
            passed: false
          }],
          timestamp: new Date()
        };
      }
      
      const measuredLagMs = replicaReadTime - writeTime;
      const lagSeconds = measuredLagMs / 1000;
      
      return {
        testName: 'Replication Lag Test',
        component: 'Database Replication',
        passed: lagSeconds <= config.maxLagSeconds,
        duration: (Date.now() - startTime) / 1000,
        details: `Measured replication lag: ${lagSeconds.toFixed(2)}s`,
        metricsMet: [{
          metric: 'Replication Lag',
          target: `≤ ${config.maxLagSeconds}s`,
          actual: `${lagSeconds.toFixed(2)}s`,
          passed: lagSeconds <= config.maxLagSeconds
        }],
        timestamp: new Date()
      };
    } catch (error) {
      return {
        testName: 'Replication Lag Test',
        component: 'Database Replication',
        passed: false,
        duration: (Date.now() - startTime) / 1000,
        details: `Test error: ${error}`,
        metricsMet: [],
        timestamp: new Date()
      };
    }
  }
  
  /**
   * Run all scheduled component tests and report results
   */
  async runDailyComponentTests(): Promise<ComponentTestResult[]> {
    const results: ComponentTestResult[] = [];
    
    // Database backup restore (runs on isolated test instance)
    results.push(await this.testDatabaseBackupRestore({
      backupIdentifier: 'latest-automated',
      targetInstance: 'dr-test-restore-target',
      maxRestoreTimeMinutes: 30,
      validationQueries: [
        'SELECT COUNT(*) FROM users',
        'SELECT COUNT(*) FROM orders WHERE created_at > NOW() - INTERVAL \'24 hours\'',
        'SELECT 1 FROM critical_config WHERE key = \'version\''
      ]
    }));
    
    // Replication lag (continuous check)
    results.push(await this.testReplicationLag({
      primaryConnection: 'postgresql://primary:5432/app',
      replicaConnection: 'postgresql://dr-replica:5432/app',
      maxLagSeconds: 30,
      testTable: 'dr_replication_test'
    }));
    
    // Report results
    const failedTests = results.filter(r => !r.passed);
    if (failedTests.length > 0) {
      await this.alertingClient.sendAlert({
        severity: 'warning',
        title: `DR Component Test Failures: ${failedTests.length}`,
        details: failedTests.map(t => `${t.testName}: ${t.details}`).join('
'),
        action: 'Review DR component test results and remediate failures'
      });
    }
    
    // Store results for trending
    await this.metricsClient.recordDRTestResults(results);
    
    return results;
  }
  
  private sleep(ms: number): Promise<void> {
    return new Promise(resolve => setTimeout(resolve, ms));
  }
}

Tabletop Exercises: Testing Without Risk

Conducting an Effective Tabletop:

1. Define a Realistic Scenario: Start with a specific, plausible disaster scenario. Vague scenarios produce vague discussions. Good examples:

'At 2 AM on Black Friday, our primary database cluster becomes unresponsive. The on-call engineer is traveling and has limited connectivity.'
'A supply chain breach has compromised our CI/CD pipeline. We detect malicious code in our latest production deployment.'
'A hurricane warning has been issued for the region housing our primary data center. We have 48 hours before potential impact.'

3. Introduce Progressive Information: Real disasters don't reveal all information at once. Start with limited information and add details as the exercise progresses:

T+0: 'Monitoring shows database cluster unhealthy'
T+10: 'We've confirmed primary is down, replica status unknown'
T+20: 'Replica is also unreachable; we need to consider backup restore'
T+30: 'Latest backup is 4 hours old—more recent backup appears corrupted'

4. Document Decisions and Gaps: Capture every decision, every question that couldn't be answered, every point of confusion. These are the valuable outputs of the tabletop.

Tabletop Discussion Prompts

•Who gets notified first? How? What if that channel is unavailable?
•Who has authority to declare a disaster? What if they're unreachable?
•Where is the runbook? Can you access it without the primary systems?
•What's the first technical action? Who performs it?
•How do we communicate with customers? Who drafts the message? Who sends it?
•If this action fails, what's the fallback?
•How do we know recovery is complete? What are the verification steps?
•If we need to escalate to a vendor, do we have 24/7 support contacts?

Include Management

Common Tabletop Findings:

Tabletops consistently reveal the same categories of gaps:

Contact Information Gaps: Phone numbers outdated, no backup contacts, critical vendors not listed
Authority Ambiguity: Unclear who can declare disaster, authorize spending, or communicate externally
Tool Access: Runbooks stored in systems that would be unavailable in scenarios discussed
Cross-Team Dependencies: Actions requiring coordination between teams with no established protocol
Timing Assumptions: Expected durations based on optimistic scenarios, not tested reality
External Dependencies: Third-party integrations with unclear DR status or failover procedures

Production Failover Testing

Production failover tests are the ultimate validation of DR capability. They're also the most complex and highest-risk form of DR testing. Here's how to execute them safely:

Pre-Test Preparation (Weeks Before):

Schedule during low-traffic window (consider time zones for global users)
Notify customers of planned maintenance window
Verify DR site readiness (capacity, replication current, configurations synced)
Conduct dry-run in non-production to validate runbooks
Brief all participants on roles, communication channels, abort criteria
Prepare customer communication templates for all scenarios
Verify rollback procedures are current and tested
Take additional backups immediately before test

Test Execution:

Go/No-Go Decision: Final check that all prerequisites are met
Disaster Declaration: Start the clock, initiate formal DR procedures
Recovery Execution: Follow runbook procedures, document any deviations
Verification: Run synthetic transactions, health checks, integration tests
Traffic Cutover: Gradually shift traffic to DR site (5% → 25% → 100%)
Steady State: Run on DR infrastructure for designated period
Failback: Return traffic to primary, verify primary recovery
Test Completion: Formal declaration of test end

Abort Criteria Must Be Defined

Production Failover Test Checklist
Phase	Action	Responsible	Duration	Abort Trigger
Pre-Test	Verify DR site replication current	DBA	T-24h	Lag > RPO target
Pre-Test	Customer notification sent	Comms	T-24h
Pre-Test	Fresh backup taken	DBA	T-2h	Backup fails
Pre-Test	Team briefing complete	Test Lead	T-1h
Go/No-Go	Final readiness check	Test Lead	T-0	Any critical blocker
Execution	Database failover	DBA	Target: 5min	Failover fails
Execution	App tier startup	App Team	Target: 10min	Health checks fail 3x
Execution	Integration verification	App Team	Target: 10min	Critical integration fails
Cutover	5% traffic shift	Infra	5min observe	Error rate > 1%
Cutover	25% traffic shift	Infra	10min observe	Error rate > 0.5%
Cutover	100% traffic shift	Infra	30min observe	Error rate > 0.1%
Failback	Return to primary	Infra	Target: 30min	Primary unhealthy
Completion	Site restored, test ended	Test Lead

Post-Test Analysis:

The value of a production failover test extends far beyond the test itself. Rigorous post-test analysis transforms observations into improvements:

Timing Analysis: Compare actual times to targets. Where were bottlenecks?
Runbook Accuracy: Where did procedures need deviation? Update runbooks.
Tooling Gaps: What manual steps could be automated?
Communication Effectiveness: Did the right information reach the right people?
Customer Impact: Any customer-reported issues? Response time?
Recommendations: Concrete action items with owners and deadlines

Test Frequency and Coverage Strategy

How often should you test? The answer balances risk, cost, and operational burden:

Testing Cadence Guidelines:

Recommended DR Test Frequencies by Type
Test Type	Minimum Frequency	Ideal Frequency	Trigger for Extra Tests
Backup Verification	Weekly	Daily (automated)	After any backup process change
Replication Lag Monitoring	Continuous	Continuous	Alerts on threshold approach
Component Tests	Monthly	Weekly (automated)	After infrastructure changes
Tabletop Exercises	Quarterly	Monthly	After major incidents, new scenarios
Integrated Non-Prod Test	Quarterly	Monthly	After architecture changes
Production Failover	Annually	Semi-annually	After any major DR investment

Coverage Strategy:

Not all systems need the same test intensity. Match testing investment to system criticality:

Tier 1 (Mission-Critical): Full production failover annually, integrated tests quarterly, component tests monthly, continuous replication monitoring.

Tier 2 (Business-Critical): Integrated tests semi-annually, component tests monthly, backup verification weekly.

Tier 3 (Operational): Component tests quarterly, backup verification monthly.

Tier 4 (Administrative): Annual backup verification, IaC rebuild test annually.

Rotation Strategy: With many systems, testing everything at once is impractical. Implement a rotation:

Each quarter, select different systems for integrated testing
Over a year, ensure all Tier 1 and Tier 2 systems are tested
Document last-tested dates on a DR dashboard

Testing After Changes

Learning From Test Failures

A DR test that exposes problems is a successful test. The worst outcome isn't a test failure—it's a test that passes despite hidden weaknesses, only to fail during actual disaster.

Treating Test Failures as Opportunities:

Every test failure should generate:

Immediate remediation: Fix the specific issue discovered
Root cause analysis: Why did this issue exist? Why wasn't it caught earlier?
Systemic improvements: What process changes prevent similar issues?
Validation testing: Retest to confirm the fix works

Healthy Failure Response

•Failures are celebrated as learning
•Root cause analysis performed
•Systemic fixes implemented
•Retest validates the fix
•Similar issues proactively sought
•Test coverage expanded

Unhealthy Failure Response

•Failures hidden or minimized
•Quick fix without analysis
•Blame assigned
•No retest performed
•Testing reduced to 'avoid failures'
•False confidence reported upward