Loading learning content...
Backup files can become corrupted silently—during creation, transfer, or storage. Integrity checks are systematic validations that detect corruption before you attempt recovery, ensuring that when disaster strikes, you're not compounding the crisis with unusable backups.
The integrity challenge: Corruption can occur at multiple levels:
Each corruption type requires different detection methods. Comprehensive integrity checking addresses all levels.
By the end of this page, you will understand how to implement multi-layer integrity checking that catches corruption at every level—from physical bit-level validation to logical data consistency verification.
Checksums are the foundation of integrity verification—mathematical fingerprints that detect any modification to backup files. By comparing calculated checksums against stored values, you can identify corruption with high confidence.
Checksum algorithm selection:
| Algorithm | Output Size | Speed | Collision Resistance | Use Case |
|---|---|---|---|---|
| CRC32 | 32 bits | Very Fast | Low | Quick sanity checks, not security |
| MD5 | 128 bits | Fast | Broken | Legacy systems only, not recommended |
| SHA-1 | 160 bits | Fast | Weakened | Transitional, avoid for new systems |
| SHA-256 | 256 bits | Moderate | Strong | Recommended for integrity verification |
| SHA-512 | 512 bits | Moderate | Very Strong | High-security environments |
| BLAKE3 | 256+ bits | Very Fast | Strong | Modern high-performance option |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
#!/usr/bin/env python3"""Backup Integrity Verification using ChecksumsImplements multi-algorithm checksum validation with streaming support""" import hashlibfrom pathlib import Pathfrom dataclasses import dataclassfrom typing import Dict, Optionalimport json @dataclassclass ChecksumResult: algorithm: str expected: str calculated: str match: bool file_size: int def calculate_checksum( file_path: Path, algorithm: str = 'sha256', chunk_size: int = 8192 * 1024 # 8MB chunks for large files) -> str: """ Calculate checksum using streaming to handle large backup files. Memory-efficient: never loads entire file into RAM. """ hash_func = hashlib.new(algorithm) with open(file_path, 'rb') as f: while chunk := f.read(chunk_size): hash_func.update(chunk) return hash_func.hexdigest() def verify_backup_checksum( backup_path: Path, manifest_path: Path) -> ChecksumResult: """ Verify backup file against stored checksum in manifest. """ # Load manifest with expected checksums with open(manifest_path) as f: manifest = json.load(f) expected = manifest['checksums']['sha256'] calculated = calculate_checksum(backup_path, 'sha256') return ChecksumResult( algorithm='sha256', expected=expected, calculated=calculated, match=(expected == calculated), file_size=backup_path.stat().st_size ) def create_backup_manifest(backup_path: Path) -> Dict: """ Create integrity manifest for a backup file. Includes multiple checksums for defense in depth. """ return { 'backup_file': backup_path.name, 'file_size': backup_path.stat().st_size, 'checksums': { 'sha256': calculate_checksum(backup_path, 'sha256'), 'sha512': calculate_checksum(backup_path, 'sha512'), }, 'created_at': datetime.now().isoformat() }Beyond bit-level integrity, backups must have valid internal structure. Structural validation verifies that backup files conform to expected format specifications—headers present, segments properly delimited, metadata consistent.
Database-specific structural validation:
pg_restore --list validates backup structure without restoring; pg_verifybackup (v13+) validates base backups against manifestmysqlbinlog --verify-binlog-checksum for binary logs; myisamchk for MyISAM table filesRESTORE VERIFYONLY validates backup sets; DBCC CHECKDB for logical consistencyRMAN VALIDATE checks backup file structure; DBVERIFY validates datafile blocksmongorestore --dryRun validates without restoring; oplog validation for replica consistency12345678910111213141516171819202122232425262728293031
#!/bin/bash# Structural validation for PostgreSQL backups BACKUP_FILE="$1"LOG_FILE="/var/log/backup-validation.log" log() { echo "[$(date -Iseconds)] $*" | tee -a "$LOG_FILE"; } log "Validating structure: $BACKUP_FILE" # Method 1: List backup contents (validates format)if pg_restore --list "$BACKUP_FILE" > /dev/null 2>&1; then OBJECT_COUNT=$(pg_restore --list "$BACKUP_FILE" | wc -l) log "PASS: Valid format, contains $OBJECT_COUNT objects"else log "FAIL: Backup format invalid or corrupted" exit 1fi # Method 2: Verify against manifest (PostgreSQL 13+)MANIFEST="${BACKUP_FILE %.backup}.manifest"if [[ -f "$MANIFEST" ]]; then if pg_verifybackup -m "$MANIFEST" "$BACKUP_FILE"; then log "PASS: Manifest verification successful" else log "FAIL: Manifest verification failed" exit 1 fifi log "Structural validation complete: PASS"Enterprise databases store page/block-level checksums that enable fine-grained corruption detection. Block verification validates each data page independently, identifying corruption even when file-level checksums pass.
How block-level checksums work:
Database pages (typically 8KB or 16KB) include embedded checksums calculated when the page is written. During verification, each page's content is re-checksummed and compared against the embedded value. A mismatch indicates corruption.
Advantages over file-level checksums:
Block-level checksums must be enabled during database creation in most systems (e.g., PostgreSQL data_checksums). They cannot be added later without rebuilding. Always enable this feature for production databases—the minimal performance overhead is far outweighed by corruption detection capability.
Even structurally valid backups can contain logically inconsistent data—referential integrity violations, constraint failures, or application-level anomalies. Consistency verification validates data correctness beyond physical integrity.
Consistency check categories:
| Check Type | What It Validates | Failure Indicates | Verification Method |
|---|---|---|---|
| Primary Key Uniqueness | No duplicate primary keys | Backup during concurrent modification | SELECT with GROUP BY HAVING COUNT > 1 |
| Foreign Key Integrity | All references valid | Incomplete backup or truncation | LEFT JOIN WHERE parent IS NULL |
| Check Constraints | Domain constraints satisfied | Data corruption or invalid backup | COUNT(*) WHERE NOT constraint |
| Index Consistency | Index entries match table data | Index corruption in source | REINDEX and compare counts |
| Sequence Alignment | Sequences ahead of max values | Sequence not captured in backup | Compare sequence value vs MAX(column) |
Application-level consistency:
Beyond database constraints, applications often have implicit consistency rules not enforced by the database:
Include application-specific consistency queries in your verification suite to catch domain-level corruption.
Integrity checking should be automated and continuous—running after every backup, during storage, and periodically throughout retention. An automated pipeline ensures no backup goes unverified.
Combine integrity checks with immutable storage (WORM, object lock). Immutability prevents malicious modification, while checksums detect corruption. Together, they provide defense-in-depth against both intentional attacks and accidental damage.
When integrity checks fail, immediate response is critical. A failed check means your backup safety net has a hole that must be addressed before the next disaster.
Integrity failure response protocol:
An integrity failure you ignore today becomes a data loss event tomorrow. Treat every failed check as a production incident requiring tracking, resolution, and post-mortem analysis.
You now understand integrity checking principles—the systematic detection of corruption before it impacts recovery. Next, we'll examine documentation practices that ensure backup and recovery procedures are well-defined and accessible.