Loading content...
Every alert is a demand for human attention. When an alert fires, it rips someone from deep work, wakes them at 3 AM, or adds to the cognitive load of an already stressed on-call engineer. This is a precious resource that must be spent wisely.
Poorly configured alerting systems create two dangerous failure modes:
Alert Fatigue — Too many alerts, mostly noise. Engineers stop responding to pages, real incidents get missed, and trust in monitoring erodes.
Alert Blindness — Critical conditions pass unnoticed because no alert was configured, thresholds were wrong, or symptoms weren't correlated to root causes.
The goal of alert configuration is surgical precision: wake someone up when and only when a human needs to take action. Everything else belongs in dashboards, logs, or automated remediation—not in pagers.
By the end of this page, you will understand alerting philosophy and the distinction between alerts and monitoring, design effective alert thresholds using symptom-based methods, implement multi-tier severity classifications, configure alert routing and escalation, and apply best practices that minimize noise while catching real incidents.
Before configuring any alerts, we must establish principles that guide our decisions. Modern Site Reliability Engineering (SRE) practices emphasize a fundamental distinction:
Alerts vs. Monitoring:
The Three Alert Questions:
For every potential alert, ask:
Does this require immediate human action? If it can wait until morning, it's not an alert—it's a ticket or dashboard item.
Is the condition actionable? If the responder can't do anything useful, the alert wastes everyone's time.
Does this indicate user impact? Internal metrics without user symptoms may indicate problems worth investigating, but not paging.
Symptom-Based vs. Cause-Based Alerting:
| Approach | Example | Pros | Cons |
|---|---|---|---|
| Cause-Based | CPU > 90% | Easy to understand, fast detection | High false positive rate; many causes don't affect users |
| Symptom-Based | p99 latency > 2s | Directly measures user impact | May miss silent problems; requires good instrumentation |
| Combined | p99 latency > 2s AND CPU > 90% | Reduces noise while maintaining sensitivity | More complex to configure and maintain |
High CPU isn't a problem if users aren't affected. Disk at 85% isn't urgent if growth rate is slow. Symptom-based alerts (response time, error rate, availability) align alerting with what actually matters: user experience. Use cause-based alerts to diagnose symptoms, not to page.
Not all problems demand the same response. A well-designed severity system ensures critical issues get immediate attention while less urgent conditions are handled appropriately.
Four-Tier Severity Model:
| Severity | Response Time | Criteria | Notification |
|---|---|---|---|
| P1 — Critical | Immediate (< 5 min) | Production down, data loss risk, security breach | Phone call, SMS, multiple escalation paths |
| P2 — High | < 30 min | Significant degradation, imminent critical if unchecked | Page/SMS during on-call hours |
| P3 — Medium | < 4 hours | Minor degradation, no immediate user impact | Email, Slack, next business day if after hours |
| P4 — Low | < 24 hours | Anomalies, trend warnings, maintenance needed | Dashboard, weekly review, ticket creation |
Database-Specific Severity Examples:
If everything is P1, nothing is P1. Resist pressure to escalate alert severity 'just in case.' Each severity level must have genuinely different response requirements. Regular review of alert firing frequency helps calibrate severity appropriately.
Alert thresholds are the numeric boundaries that trigger notifications. Setting them requires balancing sensitivity (catching problems) against specificity (avoiding false positives).
Static vs. Dynamic Thresholds:
| Type | How It Works | When to Use | Limitations |
|---|---|---|---|
| Static Threshold | Fixed value: CPU > 90% | Absolute limits (disk space, connections) | Ignores normal variation; many false positives |
| Baseline Threshold | Deviation from historical average | Metrics with predictable patterns | Requires training period; can normalize bad states |
| Rate of Change | Change velocity: growth > 10%/hour | Detecting sudden spikes or degradation | Noisy for volatile metrics |
| Anomaly Detection | ML-based deviation scoring | Complex patterns; many metrics | Black box; requires tuning; delayed detection |
Threshold Design Patterns:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
# Pattern 1: Multi-Window Rate Alerting# Alert only if condition persists across multiple time windows# Reduces single-spike false positives groups:- name: database_alerts rules: # Short window for critical issues - alert: DatabaseHighLatencyCritical expr: | histogram_quantile(0.99, rate(query_duration_seconds_bucket[1m])) > 5 for: 2m # Must persist for 2 minutes labels: severity: critical # Longer window for warning - alert: DatabaseHighLatencyWarning expr: | histogram_quantile(0.99, rate(query_duration_seconds_bucket[5m])) > 2 for: 10m # Must persist for 10 minutes labels: severity: warning # Pattern 2: Burn Rate Alerting (SLO-based)# Alert when consuming error budget faster than sustainable - alert: DatabaseErrorBudgetBurn expr: | ( # Fast burn: 14.4x budget consumption in 1 hour (1 - (sum(rate(query_success_total[1h])) / sum(rate(query_total[1h])))) > (14.4 * 0.001) and (1 - (sum(rate(query_success_total[5m])) / sum(rate(query_total[5m])))) > (14.4 * 0.001) ) for: 2m labels: severity: critical # Pattern 3: Percentage Capacity with Growth Rate - alert: DiskSpaceWarning expr: | (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100 < 20) or (predict_linear(node_filesystem_avail_bytes[6h], 24*60*60) < 0) for: 15m labels: severity: warning annotations: summary: "Disk space low or depleting within 24 hours"Key Threshold Design Principles:
Service Level Objectives (SLOs) provide principled threshold guidance. If your SLO is 99.9% query success rate (43 minutes of downtime allowed per month), you can calculate exactly when error rates threaten the SLO and alert accordingly. This replaces guesswork with math.
Getting alerts to the right person at the right time is as important as detecting issues. Modern alerting systems provide sophisticated routing capabilities.
Alertmanager Configuration Example:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
global: resolve_timeout: 5m smtp_smarthost: 'smtp.company.com:587' pagerduty_url: 'https://events.pagerduty.com/v2/enqueue' # Route tree: defines how alerts are routedroute: # Default receiver receiver: 'slack-notifications' # Group alerts by these labels group_by: ['alertname', 'severity', 'database'] # Wait before sending first notification (batching) group_wait: 30s # Wait before sending subsequent notification group_interval: 5m # Wait before re-notifying same alert repeat_interval: 4h # Child routes (evaluated in order, first match wins) routes: # Critical database alerts → PagerDuty immediately - match: severity: critical team: database receiver: 'pagerduty-database' group_wait: 10s repeat_interval: 30m continue: false # High severity → Slack + ticket - match: severity: high receiver: 'slack-high-priority' continue: true # Database team alerts → Database Slack channel - match_re: team: database|dba receiver: 'slack-database' # Receiver definitionsreceivers: - name: 'slack-notifications' slack_configs: - channel: '#alerts-general' send_resolved: true - name: 'pagerduty-database' pagerduty_configs: - service_key: '<DATABASE_PAGERDUTY_KEY>' severity: critical description: '{{ .CommonAnnotations.summary }}' details: firing: '{{ .Alerts.Firing | len }}' database: '{{ .CommonLabels.database }}' - name: 'slack-database' slack_configs: - channel: '#dba-alerts' title: '{{ .CommonLabels.alertname }}' text: '{{ .CommonAnnotations.description }}' # Inhibition rules: suppress alerts when related alerts are firinginhibit_rules: # Don't alert on replica issues if primary is down - source_match: alertname: 'PrimaryDatabaseDown' target_match: alertname: 'ReplicationLag' equal: ['database_cluster'] # Don't page warning if critical already firing - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'database']Escalation Strategies:
When the primary database is down, you don't need separate alerts for replication lag, high connection count, and query failures. Inhibition rules automatically suppress dependent alerts when root-cause alerts are firing, reducing noise during incidents.
An alert that simply says 'Database problem' at 3 AM is useless. Effective alerts include enough context for immediate action.
Essential Alert Components:
| Component | Purpose | Example |
|---|---|---|
| Alert Name | Unique, descriptive identifier | PostgresDiskSpaceCritical |
| Severity | Urgency classification | P1 / Critical |
| Summary | One-line description of the problem | Disk usage at 95% on prod-db-1 |
| Description | Detailed explanation with values | Primary database disk at 95.3% usage. 12.4GB remaining. Growth rate: 500MB/day. Estimated exhaustion: 24 hours. |
| Affected Service | What user-facing systems are impacted | Order processing, checkout |
| Runbook Link | Direct link to remediation steps | https://wiki/runbooks/disk-space |
| Dashboard Link | Link to relevant visualization | https://grafana/d/postgres-overview |
| Labels | Machine-parseable metadata | team=database, environment=prod, database=orders |
Alert Annotation Templates:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
groups:- name: database_alerts rules: - alert: PostgreSQLReplicationLag expr: pg_replication_lag_seconds > 60 for: 5m labels: severity: high team: database annotations: summary: "Replication lag high on {{ $labels.instance }}" description: | PostgreSQL replication lag is {{ $value | printf "%.0f" }} seconds on {{ $labels.instance }}. **Impact:** Read replicas serving stale data. In case of failover, up to {{ $value | printf "%.0f" }}s of transactions could be lost. **Threshold:** > 60 seconds sustained for 5 minutes **Current Value:** {{ $value | printf "%.1f" }} seconds **Possible Causes:** - High write load on primary - Network latency between primary and replica - Replica CPU/IO saturation - Long-running transactions blocking replay runbook_url: "https://wiki.company.com/runbooks/postgres-replication-lag" dashboard_url: "https://grafana.company.com/d/postgres-replication?var-instance={{ $labels.instance }}" - alert: DatabaseConnectionPoolExhausted expr: pg_stat_activity_count / pg_settings_max_connections > 0.9 for: 5m labels: severity: critical team: database annotations: summary: "Connection pool > 90% on {{ $labels.instance }}" description: | Database {{ $labels.instance }} has {{ $value | printf "%.0f" }}% of max_connections in use. **Immediate Risk:** New connections will be refused. Application errors imminent. **Current Connections:** {{ with query "pg_stat_activity_count{instance='" }}{{ . | first | value }}{{ end }} **Max Connections:** {{ with query "pg_settings_max_connections{instance='" }}{{ . | first | value }}{{ end }} **Immediate Actions:** 1. Check for idle-in-transaction sessions: https://grafana/d/sessions?var-instance={{ $labels.instance }} 2. Check application connection pool configuration 3. Consider increasing max_connections (requires restart) runbook_url: "https://wiki.company.com/runbooks/connection-exhaustion"At 3 AM, an on-call engineer shouldn't need to remember remediation steps. Every alert should link to a runbook with specific troubleshooting and remediation procedures. No runbook? The alert isn't ready for production.
Alerts that never fire in production might not work when you need them. Alert testing is as important as code testing.
Alert Testing Strategies:
Promtool for Alert Rule Testing:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
# tests/alert_rules_test.yml# Run with: promtool test rules tests/alert_rules_test.yml rule_files: - ../prometheus/rules/database_alerts.yml tests: # Test 1: Verify disk space alert fires at correct threshold - interval: 1m input_series: - series: 'node_filesystem_avail_bytes{instance="db-1", mountpoint="/"}' values: '100000000000 50000000000 10000000000 5000000000 4000000000' - series: 'node_filesystem_size_bytes{instance="db-1", mountpoint="/"}' values: '100000000000 100000000000 100000000000 100000000000 100000000000' alert_rule_test: - eval_time: 5m alertname: DiskSpaceCritical exp_alerts: - exp_labels: severity: critical instance: db-1 exp_annotations: summary: "Disk space critical on db-1" # Test 2: Verify replication lag alert with proper threshold - interval: 1m input_series: - series: 'pg_replication_lag_seconds{instance="replica-1"}' values: '10 20 30 40 50 60 70 80' alert_rule_test: - eval_time: 8m # After 'for' duration alertname: PostgreSQLReplicationLag exp_alerts: - exp_labels: severity: high instance: replica-1 # Test 3: Verify alert does NOT fire for normal conditions - interval: 1m input_series: - series: 'pg_stat_activity_count{instance="db-1"}' values: '50 55 60 55 50' - series: 'pg_settings_max_connections{instance="db-1"}' values: '200 200 200 200 200' alert_rule_test: - eval_time: 5m alertname: DatabaseConnectionPoolExhausted exp_alerts: [] # Expect no alertsAlert Review Process:
Regular review ensures alerts remain effective:
| Review Type | Frequency | What to Check |
|---|---|---|
| Weekly Triage | Weekly | Which alerts fired? How many were actionable? Any false positives? |
| Monthly Analysis | Monthly | Alert frequency trends, on-call burden, threshold adjustments needed |
| Quarterly Deep Dive | Quarterly | Are we catching real incidents? Missing any? Coverage gaps? |
| Post-Incident Review | After each incident | Did alerts fire appropriately? Fast enough? Good context? |
Track 'alert toil'—time spent responding to alerts that don't result in meaningful action. If more than 50% of on-call time is spent on noise, prioritize alert quality improvement. Google's SRE book targets <50% toil as a hard limit.
Let's consolidate the essential database alerts every monitoring system should include.
Availability Alerts:
12345678910111213141516171819202122232425262728293031
groups:- name: database_availability rules: # Database unreachable - alert: DatabaseDown expr: pg_up == 0 for: 1m labels: severity: critical annotations: summary: "PostgreSQL instance {{ $labels.instance }} is down" # Primary database failed - alert: PrimaryDatabaseDown expr: | pg_up{role="primary"} == 0 for: 30s labels: severity: critical annotations: summary: "PRIMARY database {{ $labels.instance }} is DOWN - immediate failover required" # All replicas down - alert: AllReplicasDown expr: | count(pg_up{role="replica"} == 1) by (cluster) == 0 for: 1m labels: severity: critical annotations: summary: "All replicas in cluster {{ $labels.cluster }} are down"Performance Alerts:
123456789101112131415161718192021222324252627282930313233
groups:- name: database_performance rules: # High query latency (symptom-based) - alert: QueryLatencyHigh expr: | histogram_quantile(0.99, sum(rate(pg_query_duration_seconds_bucket[5m])) by (le, instance) ) > 2 for: 5m labels: severity: warning # Buffer cache hit ratio low - alert: BufferCacheHitRatioLow expr: | (pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read)) * 100 < 90 for: 15m labels: severity: warning annotations: summary: "Buffer cache hit ratio {{ $value | printf "%.1f" }}% on {{ $labels.instance }}" # High lock wait time - alert: LockWaitTimeHigh expr: | rate(pg_stat_database_deadlocks[5m]) > 0.1 or pg_locks_count{mode="ExclusiveLock"} > 10 for: 5m labels: severity: warningCapacity Alerts:
123456789101112131415161718192021222324252627282930313233343536
groups:- name: database_capacity rules: # Connection pool exhaustion - alert: ConnectionPoolCritical expr: | pg_stat_activity_count / pg_settings_max_connections > 0.9 for: 5m labels: severity: critical # Disk space critical - alert: DiskSpaceCritical expr: | (node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql"} / node_filesystem_size_bytes) * 100 < 10 for: 5m labels: severity: critical # Disk will fill in 24 hours - alert: DiskWillFillIn24Hours expr: | predict_linear(node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql"}[6h], 24*60*60) < 0 for: 30m labels: severity: warning # Replication lag critical - alert: ReplicationLagCritical expr: pg_replication_lag_seconds > 300 for: 5m labels: severity: critical annotations: summary: "Replication lag > 5 minutes. Failover would lose significant data."Start with conservative thresholds that might generate some false positives. It's easier to loosen thresholds than to explain why a real incident wasn't alerted. Track every alert firing and refine thresholds based on actual data.
Alerting is a system of trust. When on-call engineers trust that alerts represent real problems requiring immediate action, they respond promptly. When that trust erodes through noise and false positives, response times increase and real incidents get missed.
What's Next:
With alerts configured to notify the right people about real problems, the next page addresses Dashboards—visualizing database health for proactive monitoring, incident response, and capacity planning.
You now understand how to design alerting systems that balance sensitivity with specificity—catching real problems while avoiding the noise that erodes trust. Next, we'll build the dashboards that provide visual context for these alerts.