Database Management SystemsBackup Verification

Backup Verification

LevelIntermediate

Duration60 mins

TopicBackup Verification

4 / 5

Documentation

The Institutional Memory of Disaster Recovery

When disaster strikes and critical databases need recovery, you won't have time to figure out procedures from scratch. Documentation transforms tacit knowledge into actionable instructions that anyone with appropriate skills can execute under pressure.

The documentation imperative:

Consider this scenario: Your senior DBA who designed the backup system is unavailable during a major outage. Can the on-call engineer restore the database? Do they know where backups are stored, which one to use, the correct restoration procedure, and how to validate success? Without documentation, the answer is often "no"—and the outage extends until someone who "knows" is reached.

What You Will Learn

By the end of this page, you will understand how to create, maintain, and organize backup and recovery documentation that enables reliable disaster recovery even when key personnel are unavailable.

Documentation Categories

Comprehensive backup documentation spans multiple categories, each serving different purposes and audiences. Well-organized documentation enables both quick reference during incidents and deep understanding during planning.

Backup Documentation Categories
Category	Purpose	Audience	Update Frequency
Architecture Overview	High-level backup system design	Management, Architects	Major changes only
Operational Runbooks	Step-by-step procedures for common tasks	DBAs, Operations	As procedures change
Emergency Procedures	Disaster recovery playbooks	On-call engineers	After each DR test
Configuration Reference	Detailed system settings	DBAs	Real-time with changes
Backup Inventory	What's backed up, where, retention	All technical staff	Continuous
Test Results History	Record of verification tests	Auditors, Management	After each test
Incident Records	Past failures and resolutions	DBAs, Operations	After each incident

Living Documentation

Documentation must evolve with your systems. Treat documentation as code—version control it, review changes, and automate updates where possible. Stale documentation is dangerous; it creates false confidence while providing incorrect guidance.

Recovery Runbook Structure

Runbooks are step-by-step guides for executing specific procedures. Effective runbooks are detailed enough that a competent engineer unfamiliar with the specific system can execute them successfully under pressure.

Essential runbook components:

Runbook Structure

•Title and Purpose — Clear description of what the runbook accomplishes
•Prerequisites — Required access, tools, and knowledge before starting
•Estimated Duration — How long the procedure typically takes
•Risk Assessment — What can go wrong and how to mitigate
•Step-by-Step Instructions — Numbered, unambiguous actions
•Verification Steps — How to confirm each step succeeded
•Rollback Procedures — How to reverse if things go wrong
•Escalation Path — Who to contact if the procedure fails
•Success Criteria — How to confirm the goal was achieved
•Revision History — When updated and by whom

database_restore_runbook.md
Markdown
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
# Database Restore Runbook: Production PostgreSQL
 
## Purpose
Restore the production PostgreSQL database from backup following 
complete data loss or corruption.
 
## Prerequisites
- [ ] SSH access to database server (prod-db-01)
- [ ] Read access to S3 backup bucket (backups-prod)
- [ ] PostgreSQL superuser credentials (in Vault: db/prod/admin)
- [ ] At least 500GB free space on /data
- [ ] Notification sent to incident channel
 
## Estimated Duration: 2-4 hours
 
## Risk Assessment
| Risk | Likelihood | Mitigation |
|------|------------|------------|
| Wrong backup selected | Medium | Verify backup timestamp matches target |
| Storage exhausted mid-restore | Low | Check space before starting |
| Application connections fail | Medium | Coordinate with app team on timing |
 
## Procedure
 
### 1. Assess the Situation (5 min)
1.1 Confirm database is truly unrecoverable
1.2 Identify target recovery point (latest backup vs PITR target)
1.3 Notify stakeholders: "Database restore initiated, ETA X hours"
 
### 2. Prepare Environment (15 min)
2.1 Stop application connections
    ```bash
    sudo systemctl stop app-server
    ```
2.2 Verify storage space
    ```bash
    df -h /data  # Must show >500GB free
    ```
2.3 Backup any remaining data (if applicable)
 
### 3. Locate Correct Backup (10 min)
3.1 List available backups
    ```bash
    aws s3 ls s3://backups-prod/postgres/daily/
    ```
3.2 Select backup based on target recovery time
3.3 Verify backup integrity before download
    ```bash
    aws s3 cp s3://backups-prod/postgres/daily/YYYYMMDD.sha256 - 
    ```
 
### 4. Execute Restore (2-3 hours)
4.1 Download backup
    ```bash
    aws s3 cp s3://backups-prod/postgres/daily/YYYYMMDD.backup /restore/
    ```
4.2 Stop existing PostgreSQL
    ```bash
    sudo systemctl stop postgresql
    ```
4.3 Clear data directory
    ```bash
    sudo rm -rf /data/postgresql/14/main/*
    ```
4.4 Execute restore
    ```bash
    pg_restore -d postgres -C -j 4 /restore/YYYYMMDD.backup
    ```
 
### 5. Validate Restoration (30 min)
5.1 Start PostgreSQL
5.2 Run validation queries
    ```bash
    psql -f /scripts/validate_restore.sql
    ```
5.3 Verify all checks pass
 
### 6. Resume Operations
6.1 Start application servers
6.2 Verify application connectivity
6.3 Monitor for errors
 
## Rollback
If restore fails mid-process, the database is already in failed 
state. Retry with different backup or escalate.
 
## Escalation
- DBA Lead: +1-555-0100 (Alice)  
- Platform Engineering: #platform-oncall
- VP Engineering: (after 4 hours)
 
## Success Criteria
- [ ] Database accepting connections
- [ ] All validation queries pass
- [ ] Application functional
- [ ] Data loss within acceptable RPO
 
## Revision History
| Date | Author | Change |
|------|--------|--------|
| 2024-01-15 | alice | Updated for PostgreSQL 14 |
| 2023-09-01 | bob | Added PITR section |

Backup Inventory Documentation

A backup inventory catalogs every database, its backup configuration, storage location, and retention policy. This inventory answers the fundamental question: "What do we have backed up and where is it?"

Essential inventory fields:

Backup Inventory Schema
Field	Description	Example
Database Name	Canonical database identifier	prod-orders-db
Database Type	Database engine and version	PostgreSQL 14.5
Data Classification	Sensitivity level	Confidential - PII
Backup Type	Full, incremental, or differential	Daily full + hourly WAL
Backup Schedule	When backups run	02:00 UTC daily
Primary Storage	Initial backup location	s3://backups-prod/postgres/
Secondary Storage	Replicated location	s3://backups-dr/postgres/
Retention Period	How long backups kept	30 days full, 7 days WAL
PITR Window	Point-in-time recovery extent	7 days
RTO	Target recovery time	2 hours
RPO	Target recovery point	1 hour
Owner	Responsible team/person	Platform Team - Alice
Last Verified	Most recent restore test	2024-01-10

Shadow Databases

The most dangerous databases are those not in your inventory—shadow databases created for testing, development copies that became production, or replicas that contain critical data. Regular audits should discover all databases and ensure they're in the inventory.

Configuration Documentation

Configuration documentation captures the technical details of backup systems—settings, credentials, network paths, and dependencies. This enables troubleshooting, auditing, and recreation of backup infrastructure.

Configuration documentation strategies:

What to Document

•Backup software and versions
•Schedule configurations
•Storage paths and credentials
•Network topology and firewall rules
•Encryption keys and rotation
•Monitoring and alerting setup
•Dependencies (agents, libraries)

Documentation Methods

•Infrastructure as Code (IaC)
•Configuration management tools
•Automated config export
•Version-controlled YAML/JSON
•Diagrams generated from code
•Change tracking via git
•Automated drift detection

Documentation as Code

Use Infrastructure as Code for backup configuration. Terraform, Ansible, or similar tools serve as both implementation and documentation. The code is always accurate because it IS the configuration, eliminating the drift between docs and reality.

Test Result Documentation

Every backup test should produce a documented record—evidence of what was tested, results obtained, and any issues discovered. This documentation supports compliance audits, trend analysis, and continuous improvement.

Test documentation template:

•Test ID and Date — Unique identifier and execution timestamp
•Test Type — Restore test, integrity check, or DR exercise
•Scope — Which databases, which backup files, which recovery points
•Executor — Who performed or initiated the test
•Procedure Used — Reference to runbook version followed
•Results Summary — Pass/Fail with key metrics (time, data validated)
•Deviations — Any steps that didn't follow procedure
•Issues Found — Problems discovered during testing
•Remediation Actions — Follow-up tasks from findings
•Sign-off — Approval that test was acceptable

Compliance requirements:

Many compliance frameworks (SOC 2, ISO 27001, HIPAA, PCI-DSS) require documented evidence of backup testing. Maintain test records in a format suitable for auditor review—timestamped, immutable, and easily retrievable.

Documentation Accessibility

Documentation that can't be accessed during a disaster is useless. Ensure recovery documentation is available even when primary systems are down.

Accessibility strategies:

Ensuring Documentation Availability

•Multiple storage locations — Cloud, on-premises, printed copies for critical procedures
•Independent of primary systems — Runbooks shouldn't require the system being recovered to access them
•Offline access — Mobile apps, downloaded PDFs for scenarios without network
•Known location — All team members know where to find documentation
•Tested access — Periodically verify documentation is retrievable from DR locations
•Search capability — Quick lookup during incident stress

The Wiki Dependency

If your only copy of database recovery procedures is in a wiki hosted on that database, you have a circular dependency that will fail when most needed. Critical procedures must exist outside the systems they recover.

Documentation Maintenance

Documentation decays without active maintenance. Systems change, procedures evolve, and contacts move on. Scheduled review and update processes keep documentation accurate and trustworthy.

Documentation Review Schedule
Document Type	Review Frequency	Review Trigger	Owner
Emergency Runbooks	Quarterly	After each DR test or incident	DBA Lead
Backup Inventory	Monthly	Any database change	Operations
Configuration Docs	Continuous	Any config change	Automated
Architecture Overview	Annually	Major system changes	Architect
Contact Lists	Monthly	Any personnel change	Team Lead

Documentation review checklist:

Are all referenced systems still accurate?
Are all credentials/paths still valid?
Have any procedures changed?
Are all contacts still current?
Has the documentation been tested recently?
Are there new systems not yet documented?

Summary: Documentation

Key Takeaways

•Documentation enables recovery by anyone — Procedures shouldn't depend on specific individuals
•Multiple document types serve different needs — Runbooks, inventories, configurations, test records
•Runbooks must be detailed and actionable — Step-by-step instructions executable under pressure
•Documentation must be accessible during disasters — Independent of the systems being recovered
•Active maintenance prevents decay — Scheduled reviews keep documentation current and accurate

Page Complete

You now understand the essential documentation practices for backup and recovery. Next, we'll examine regular practice—the discipline of exercising recovery procedures before they're needed in production.

4 / 5

Loading learning content...

Database Management SystemsBackup Verification

Backup Verification

LevelIntermediate

Duration60 mins

TopicBackup Verification

4 / 5

Documentation

The Institutional Memory of Disaster Recovery

The documentation imperative:

What You Will Learn

By the end of this page, you will understand how to create, maintain, and organize backup and recovery documentation that enables reliable disaster recovery even when key personnel are unavailable.

Documentation Categories

Backup Documentation Categories
Category	Purpose	Audience	Update Frequency
Architecture Overview	High-level backup system design	Management, Architects	Major changes only
Operational Runbooks	Step-by-step procedures for common tasks	DBAs, Operations	As procedures change
Emergency Procedures	Disaster recovery playbooks	On-call engineers	After each DR test
Configuration Reference	Detailed system settings	DBAs	Real-time with changes
Backup Inventory	What's backed up, where, retention	All technical staff	Continuous
Test Results History	Record of verification tests	Auditors, Management	After each test
Incident Records	Past failures and resolutions	DBAs, Operations	After each incident

Living Documentation

Recovery Runbook Structure

Essential runbook components:

Runbook Structure

•Title and Purpose — Clear description of what the runbook accomplishes
•Prerequisites — Required access, tools, and knowledge before starting
•Estimated Duration — How long the procedure typically takes
•Risk Assessment — What can go wrong and how to mitigate
•Step-by-Step Instructions — Numbered, unambiguous actions
•Verification Steps — How to confirm each step succeeded
•Rollback Procedures — How to reverse if things go wrong
•Escalation Path — Who to contact if the procedure fails
•Success Criteria — How to confirm the goal was achieved
•Revision History — When updated and by whom

database_restore_runbook.md
Markdown
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
# Database Restore Runbook: Production PostgreSQL
 
## Purpose
Restore the production PostgreSQL database from backup following 
complete data loss or corruption.
 
## Prerequisites
- [ ] SSH access to database server (prod-db-01)
- [ ] Read access to S3 backup bucket (backups-prod)
- [ ] PostgreSQL superuser credentials (in Vault: db/prod/admin)
- [ ] At least 500GB free space on /data
- [ ] Notification sent to incident channel
 
## Estimated Duration: 2-4 hours
 
## Risk Assessment
| Risk | Likelihood | Mitigation |
|------|------------|------------|
| Wrong backup selected | Medium | Verify backup timestamp matches target |
| Storage exhausted mid-restore | Low | Check space before starting |
| Application connections fail | Medium | Coordinate with app team on timing |
 
## Procedure
 
### 1. Assess the Situation (5 min)
1.1 Confirm database is truly unrecoverable
1.2 Identify target recovery point (latest backup vs PITR target)
1.3 Notify stakeholders: "Database restore initiated, ETA X hours"
 
### 2. Prepare Environment (15 min)
2.1 Stop application connections
    ```bash
    sudo systemctl stop app-server
    ```
2.2 Verify storage space
    ```bash
    df -h /data  # Must show >500GB free
    ```
2.3 Backup any remaining data (if applicable)
 
### 3. Locate Correct Backup (10 min)
3.1 List available backups
    ```bash
    aws s3 ls s3://backups-prod/postgres/daily/
    ```
3.2 Select backup based on target recovery time
3.3 Verify backup integrity before download
    ```bash
    aws s3 cp s3://backups-prod/postgres/daily/YYYYMMDD.sha256 - 
    ```
 
### 4. Execute Restore (2-3 hours)
4.1 Download backup
    ```bash
    aws s3 cp s3://backups-prod/postgres/daily/YYYYMMDD.backup /restore/
    ```
4.2 Stop existing PostgreSQL
    ```bash
    sudo systemctl stop postgresql
    ```
4.3 Clear data directory
    ```bash
    sudo rm -rf /data/postgresql/14/main/*
    ```
4.4 Execute restore
    ```bash
    pg_restore -d postgres -C -j 4 /restore/YYYYMMDD.backup
    ```
 
### 5. Validate Restoration (30 min)
5.1 Start PostgreSQL
5.2 Run validation queries
    ```bash
    psql -f /scripts/validate_restore.sql
    ```
5.3 Verify all checks pass
 
### 6. Resume Operations
6.1 Start application servers
6.2 Verify application connectivity
6.3 Monitor for errors
 
## Rollback
If restore fails mid-process, the database is already in failed 
state. Retry with different backup or escalate.
 
## Escalation
- DBA Lead: +1-555-0100 (Alice)  
- Platform Engineering: #platform-oncall
- VP Engineering: (after 4 hours)
 
## Success Criteria
- [ ] Database accepting connections
- [ ] All validation queries pass
- [ ] Application functional
- [ ] Data loss within acceptable RPO
 
## Revision History
| Date | Author | Change |
|------|--------|--------|
| 2024-01-15 | alice | Updated for PostgreSQL 14 |
| 2023-09-01 | bob | Added PITR section |

Backup Inventory Documentation

Essential inventory fields:

Backup Inventory Schema
Field	Description	Example
Database Name	Canonical database identifier	prod-orders-db
Database Type	Database engine and version	PostgreSQL 14.5
Data Classification	Sensitivity level	Confidential - PII
Backup Type	Full, incremental, or differential	Daily full + hourly WAL
Backup Schedule	When backups run	02:00 UTC daily
Primary Storage	Initial backup location	s3://backups-prod/postgres/
Secondary Storage	Replicated location	s3://backups-dr/postgres/
Retention Period	How long backups kept	30 days full, 7 days WAL
PITR Window	Point-in-time recovery extent	7 days
RTO	Target recovery time	2 hours
RPO	Target recovery point	1 hour
Owner	Responsible team/person	Platform Team - Alice
Last Verified	Most recent restore test	2024-01-10

Shadow Databases

Configuration Documentation

Configuration documentation strategies:

What to Document

•Backup software and versions
•Schedule configurations
•Storage paths and credentials
•Network topology and firewall rules
•Encryption keys and rotation
•Monitoring and alerting setup
•Dependencies (agents, libraries)

Documentation Methods

•Infrastructure as Code (IaC)
•Configuration management tools
•Automated config export
•Version-controlled YAML/JSON
•Diagrams generated from code
•Change tracking via git
•Automated drift detection

Documentation as Code

Test Result Documentation

Test documentation template:

•Test ID and Date — Unique identifier and execution timestamp
•Test Type — Restore test, integrity check, or DR exercise
•Scope — Which databases, which backup files, which recovery points
•Executor — Who performed or initiated the test
•Procedure Used — Reference to runbook version followed
•Results Summary — Pass/Fail with key metrics (time, data validated)
•Deviations — Any steps that didn't follow procedure
•Issues Found — Problems discovered during testing
•Remediation Actions — Follow-up tasks from findings
•Sign-off — Approval that test was acceptable

Compliance requirements:

Documentation Accessibility

Documentation that can't be accessed during a disaster is useless. Ensure recovery documentation is available even when primary systems are down.

Accessibility strategies:

Ensuring Documentation Availability

•Multiple storage locations — Cloud, on-premises, printed copies for critical procedures
•Independent of primary systems — Runbooks shouldn't require the system being recovered to access them
•Offline access — Mobile apps, downloaded PDFs for scenarios without network
•Known location — All team members know where to find documentation
•Tested access — Periodically verify documentation is retrievable from DR locations
•Search capability — Quick lookup during incident stress

The Wiki Dependency

Documentation Maintenance

Documentation decays without active maintenance. Systems change, procedures evolve, and contacts move on. Scheduled review and update processes keep documentation accurate and trustworthy.

Documentation Review Schedule
Document Type	Review Frequency	Review Trigger	Owner
Emergency Runbooks	Quarterly	After each DR test or incident	DBA Lead
Backup Inventory	Monthly	Any database change	Operations
Configuration Docs	Continuous	Any config change	Automated
Architecture Overview	Annually	Major system changes	Architect
Contact Lists	Monthly	Any personnel change	Team Lead

Documentation review checklist:

Are all referenced systems still accurate?
Are all credentials/paths still valid?
Have any procedures changed?
Are all contacts still current?
Has the documentation been tested recently?
Are there new systems not yet documented?

Summary: Documentation

Key Takeaways

•Documentation enables recovery by anyone — Procedures shouldn't depend on specific individuals
•Multiple document types serve different needs — Runbooks, inventories, configurations, test records
•Runbooks must be detailed and actionable — Step-by-step instructions executable under pressure
•Documentation must be accessible during disasters — Independent of the systems being recovered
•Active maintenance prevents decay — Scheduled reviews keep documentation current and accurate

Page Complete

4 / 5