Loading content...
A backup that runs but fails silently is worse than no backup at all—it creates false confidence. Organizations discover their backup failures only during a crisis, when the data they expected to recover simply isn't there.
Monitoring transforms backup from a scheduled job into a verified protection system.
Comprehensive monitoring answers critical questions continuously:
By the end of this page, you will understand what metrics to collect, how to set meaningful alerts, how to build operational dashboards, and how to implement proactive monitoring that catches issues before they become crises.
Effective monitoring starts with collecting the right metrics. These fall into several categories:
| Metric | What It Measures | Why It Matters | Alert Threshold Example |
|---|---|---|---|
| Success Rate | % of backups completing without error | Primary health indicator | < 99% triggers warning; < 95% critical |
| Duration | Time from start to completion | Performance baseline, window compliance | 150% of average triggers alert |
| Size | Backup data volume | Growth trends, anomaly detection | 200% or < 50% of expected |
| Throughput | MB/s during backup | Infrastructure performance | < 50% of baseline |
| Last Successful | Time since last good backup | RPO compliance monitoring | RPO target triggers critical |
| Verification Status | Last successful restore test | Recovery assurance | 30 days since test |
| Storage Utilization | Backup storage consumption | Capacity planning | 80% capacity warning |
| Retention Compliance | Backups meeting retention policy | Regulatory compliance | Any non-compliance is critical |
1234567891011121314151617181920212223242526272829303132333435363738394041
-- Core backup metrics queries for monitoring dashboard -- Current backup health summarySELECT COUNT(*) AS total_jobs, SUM(CASE WHEN status = 'success' THEN 1 ELSE 0 END) AS successful, ROUND(100.0 * SUM(CASE WHEN status = 'success' THEN 1 ELSE 0 END) / COUNT(*), 2) AS success_rate, AVG(duration_minutes) AS avg_duration, MAX(duration_minutes) AS max_duration, SUM(size_bytes) / (1024^3) AS total_size_gbFROM backup_historyWHERE start_time > NOW() - INTERVAL '24 hours'; -- Databases exceeding RPO (no backup within target window)SELECT d.database_name, d.rpo_hours, MAX(b.end_time) AS last_backup, EXTRACT(EPOCH FROM (NOW() - MAX(b.end_time))) / 3600 AS hours_since_backup, CASE WHEN EXTRACT(EPOCH FROM (NOW() - MAX(b.end_time))) / 3600 > d.rpo_hours THEN 'VIOLATION' ELSE 'OK' END AS rpo_statusFROM databases dLEFT JOIN backup_history b ON d.database_name = b.database_name AND b.status = 'success'GROUP BY d.database_name, d.rpo_hoursHAVING MAX(b.end_time) IS NULL OR EXTRACT(EPOCH FROM (NOW() - MAX(b.end_time))) / 3600 > d.rpo_hours; -- Duration trend analysis (detect slowdowns)SELECT database_name, DATE(start_time) AS backup_date, AVG(duration_minutes) AS avg_duration, LAG(AVG(duration_minutes), 7) OVER (PARTITION BY database_name ORDER BY DATE(start_time)) AS duration_7d_ago, ROUND(100.0 * (AVG(duration_minutes) - LAG(AVG(duration_minutes), 7) OVER (PARTITION BY database_name ORDER BY DATE(start_time))) / NULLIF(LAG(AVG(duration_minutes), 7) OVER (PARTITION BY database_name ORDER BY DATE(start_time)), 0), 1) AS pct_changeFROM backup_historyWHERE start_time > NOW() - INTERVAL '30 days'GROUP BY database_name, DATE(start_time)ORDER BY database_name, backup_date DESC;Alerts must be actionable, not overwhelming. A team drowning in alerts ignores them all. Design alerting with clear escalation tiers:
12345678910111213141516171819202122232425262728293031323334353637383940
# Prometheus alerting rules for backup monitoringgroups: - name: backup_alerts rules: # Critical: RPO violation - alert: BackupRPOViolation expr: (time() - backup_last_success_timestamp) > (backup_rpo_seconds * 1.0) for: 5m labels: severity: critical annotations: summary: "RPO violation for {{ $labels.database }}" description: "Last successful backup was {{ $value | humanizeDuration }} ago. RPO target: {{ $labels.rpo }}" # Critical: Backup failure - alert: BackupFailed expr: backup_last_status{tier="1"} == 0 for: 0m labels: severity: critical annotations: summary: "Tier 1 backup failed: {{ $labels.database }}" # Warning: Duration anomaly - alert: BackupDurationAnomaly expr: backup_duration_seconds > (backup_duration_avg_seconds * 1.5) for: 10m labels: severity: warning annotations: summary: "Backup taking 50%+ longer than average" # Warning: Storage capacity - alert: BackupStorageNearFull expr: backup_storage_used_bytes / backup_storage_total_bytes > 0.85 for: 1h labels: severity: warning annotations: summary: "Backup storage at {{ $value | humanizePercentage }}"Every alert should have a documented response. If an alert fires and the response is 'ignore it,' the alert shouldn't exist. Regularly review and tune alert thresholds based on actual response patterns. Track alert-to-action ratio.
Dashboards provide at-a-glance visibility into backup health. Design dashboards for different audiences:
Executive Dashboard:
Operations Dashboard:
Compliance Dashboard:
| Component | Visualization | Update Frequency | Primary Audience |
|---|---|---|---|
| Success Rate | Gauge (green/yellow/red) | Real-time | All |
| RPO Compliance | Status grid by database | 5 minutes | Operations |
| Active Jobs | Live table with progress | Real-time | Operations |
| Failure Log | Scrolling event list | Real-time | Operations |
| Duration Trends | Time-series graph | Hourly | Capacity Planning |
| Storage Growth | Area chart with projection | Daily | Capacity Planning |
| Verification Calendar | Heatmap by date | Daily | Compliance |
Use color consistently: Green = healthy, Yellow = attention needed, Red = action required. If your dashboard is always yellow or red, either your systems need fixing or your thresholds need adjustment. A perpetually alarmed dashboard becomes background noise.
A backup that cannot be restored is not a backup. Verification monitoring ensures backups are actually recoverable.
Verification levels:
12345678910111213141516171819202122232425262728
-- Track verification status and scheduleCREATE TABLE backup_verifications ( verification_id UUID PRIMARY KEY, backup_id UUID REFERENCES backup_catalog(backup_id), verification_level INT, -- 1=completion, 2=checksum, 3=header, 4=partial, 5=full, 6=app verified_at TIMESTAMP DEFAULT NOW(), verified_by VARCHAR(100), result VARCHAR(20), -- passed, failed, partial duration_seconds INT, notes TEXT); -- Databases overdue for restore testingSELECT d.database_name, d.tier, d.verification_frequency_days, MAX(v.verified_at) AS last_verified, CURRENT_DATE - MAX(v.verified_at)::date AS days_since_verification, CASE WHEN CURRENT_DATE - MAX(v.verified_at)::date > d.verification_frequency_days THEN 'OVERDUE' ELSE 'OK' END AS statusFROM databases dLEFT JOIN backup_catalog b ON d.database_name = b.database_nameLEFT JOIN backup_verifications v ON b.backup_id = v.backup_id AND v.verification_level >= 4GROUP BY d.database_name, d.tier, d.verification_frequency_daysHAVING MAX(v.verified_at) IS NULL OR CURRENT_DATE - MAX(v.verified_at)::date > d.verification_frequency_daysORDER BY tier, days_since_verification DESC;Monitoring must be automated and continuous. Manual checking doesn't scale and misses off-hours failures.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
#!/usr/bin/env python3"""Automated backup monitoring and alerting""" import psycopg2from datetime import datetime, timedeltaimport requestsimport logging logging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__) class BackupMonitor: def __init__(self, db_config: dict, alert_config: dict): self.db_config = db_config self.alert_config = alert_config def check_rpo_compliance(self) -> list: """Check all databases against their RPO targets""" violations = [] with psycopg2.connect(**self.db_config) as conn: with conn.cursor() as cur: cur.execute(""" SELECT database_name, rpo_hours, EXTRACT(EPOCH FROM (NOW() - last_backup)) / 3600 AS hours_since FROM database_rpo_status WHERE hours_since > rpo_hours """) for row in cur.fetchall(): violations.append({ 'database': row[0], 'rpo_hours': row[1], 'hours_since': row[2], 'severity': 'critical' }) return violations def check_recent_failures(self) -> list: """Check for backup failures in last window""" failures = [] with psycopg2.connect(**self.db_config) as conn: with conn.cursor() as cur: cur.execute(""" SELECT database_name, backup_type, error_message, end_time FROM backup_history WHERE status = 'failed' AND end_time > NOW() - INTERVAL '1 hour' """) for row in cur.fetchall(): failures.append({ 'database': row[0], 'type': row[1], 'error': row[2], 'time': row[3] }) return failures def send_alert(self, alert: dict): """Send alert via configured channels""" if alert['severity'] == 'critical': # PagerDuty for critical requests.post(self.alert_config['pagerduty_url'], json={'event': alert}) # All alerts to Slack requests.post(self.alert_config['slack_webhook'], json={'text': f"[{alert['severity'].upper()}] {alert['message']}"}) logger.info(f"Alert sent: {alert}") def run_checks(self): """Run all monitoring checks""" # RPO violations for v in self.check_rpo_compliance(): self.send_alert({ 'severity': 'critical', 'message': f"RPO violation: {v['database']} - {v['hours_since']:.1f}h since backup (target: {v['rpo_hours']}h)" }) # Recent failures for f in self.check_recent_failures(): self.send_alert({ 'severity': 'critical' if 'tier1' in f['database'] else 'warning', 'message': f"Backup failed: {f['database']} - {f['error']}" }) if __name__ == "__main__": monitor = BackupMonitor( db_config={'host': 'monitoring-db', 'database': 'backup_catalog'}, alert_config={'slack_webhook': 'https://...', 'pagerduty_url': 'https://...'} ) monitor.run_checks()Regular reporting demonstrates backup program effectiveness and supports compliance audits.
Essential reports:
| Report | Frequency | Audience | Key Content |
|---|---|---|---|
| Daily Status | Daily | Operations | Success/failure summary, exceptions, actions required |
| Weekly Summary | Weekly | IT Leadership | Success rates, duration trends, capacity projections |
| Monthly Review | Monthly | Management | SLA compliance, cost analysis, improvement initiatives |
| Compliance Report | Quarterly | Audit/Compliance | Retention adherence, encryption status, verification history |
| DR Test Report | Quarterly | Executive/Board | Recovery test results, RTO/RPO achievement |
Maintain documentation that answers auditor questions: What is backed up? How often? Where is it stored? How is it protected? How do you verify it works? Keep this current and accessible—scrambling during an audit reveals weakness.
Move beyond reactive alerting to predict and prevent failures before they impact data protection.
Technical metrics matter to operations, but business cares about: Can we recover if something goes wrong? How much data might we lose? Frame monitoring outputs in business terms—'All critical systems can recover within 1 hour with less than 15 minutes data loss' is more meaningful than 'backup success rate 99.2%'.
You have completed the Backup Best Practices module. You now understand scheduling, retention, offsite storage, encryption, and monitoring—the five pillars of enterprise backup strategy. Apply these practices to build backup systems that truly protect organizational data.