Monitoring And Alerting - Learning Module

Loading content...

0/241

Alert Configuration

The Art of Meaningful Alerts

Every alert is a demand for human attention. When an alert fires, it rips someone from deep work, wakes them at 3 AM, or adds to the cognitive load of an already stressed on-call engineer. This is a precious resource that must be spent wisely.

Poorly configured alerting systems create two dangerous failure modes:

Alert Fatigue — Too many alerts, mostly noise. Engineers stop responding to pages, real incidents get missed, and trust in monitoring erodes.
Alert Blindness — Critical conditions pass unnoticed because no alert was configured, thresholds were wrong, or symptoms weren't correlated to root causes.

The goal of alert configuration is surgical precision: wake someone up when and only when a human needs to take action. Everything else belongs in dashboards, logs, or automated remediation—not in pagers.

What You Will Learn

By the end of this page, you will understand alerting philosophy and the distinction between alerts and monitoring, design effective alert thresholds using symptom-based methods, implement multi-tier severity classifications, configure alert routing and escalation, and apply best practices that minimize noise while catching real incidents.

Alerting Philosophy

Before configuring any alerts, we must establish principles that guide our decisions. Modern Site Reliability Engineering (SRE) practices emphasize a fundamental distinction:

Alerts vs. Monitoring:

Alerts (Pages)

•Require immediate human action
•Indicate user-impacting conditions
•Should wake you at 3 AM
•Cannot be automated away
•Must be actionable right now

Monitoring (Dashboards)

•For awareness and investigation
•Historical trends and patterns
•Reviewed during business hours
•Context for troubleshooting
•Capacity planning inputs

The Three Alert Questions:

For every potential alert, ask:

Does this require immediate human action? If it can wait until morning, it's not an alert—it's a ticket or dashboard item.
Is the condition actionable? If the responder can't do anything useful, the alert wastes everyone's time.
Does this indicate user impact? Internal metrics without user symptoms may indicate problems worth investigating, but not paging.

Symptom-Based vs. Cause-Based Alerting:

Alerting Approaches Compared
Approach	Example	Pros	Cons
Cause-Based	CPU > 90%	Easy to understand, fast detection	High false positive rate; many causes don't affect users
Symptom-Based	p99 latency > 2s	Directly measures user impact	May miss silent problems; requires good instrumentation
Combined	p99 latency > 2s AND CPU > 90%	Reduces noise while maintaining sensitivity	More complex to configure and maintain

Prefer Symptom-Based Alerts

High CPU isn't a problem if users aren't affected. Disk at 85% isn't urgent if growth rate is slow. Symptom-based alerts (response time, error rate, availability) align alerting with what actually matters: user experience. Use cause-based alerts to diagnose symptoms, not to page.

Alert Severity Classification

Not all problems demand the same response. A well-designed severity system ensures critical issues get immediate attention while less urgent conditions are handled appropriately.

Four-Tier Severity Model:

Alert Severity Levels
Severity	Response Time	Criteria	Notification
P1 — Critical	Immediate (< 5 min)	Production down, data loss risk, security breach	Phone call, SMS, multiple escalation paths
P2 — High	< 30 min	Significant degradation, imminent critical if unchecked	Page/SMS during on-call hours
P3 — Medium	< 4 hours	Minor degradation, no immediate user impact	Email, Slack, next business day if after hours
P4 — Low	< 24 hours	Anomalies, trend warnings, maintenance needed	Dashboard, weekly review, ticket creation

Database-Specific Severity Examples:

P1 — Critical (Wake Someone Up)

•Database unreachable (connection refused)
•Primary database failed, replica promotion failed
•Disk space < 5% remaining
•Replication broken (not lagging, broken)
•Data corruption detected
•All connections exhausted

P2 — High (Respond Within 30 Minutes)

•Query latency p99 > 5 seconds sustained
•Replication lag > 60 seconds and increasing
•Connection pool > 90% utilized
•Disk space < 15% remaining
•Error rate > 1% of queries
•Deadlock frequency spike

P3 — Medium (Next Business Day)

•Buffer pool hit ratio < 95%
•Slow query rate increased significantly
•Disk space < 30% remaining
•Unusual query pattern detected
•Backup older than expected window

P4 — Low (Weekly Review)

•Table bloat > 50%
•Index unused in 30 days
•Statistics older than 1 week
•Disk growth higher than projected
•Configuration drift from baseline

Avoid Severity Inflation

If everything is P1, nothing is P1. Resist pressure to escalate alert severity 'just in case.' Each severity level must have genuinely different response requirements. Regular review of alert firing frequency helps calibrate severity appropriately.

Designing Effective Thresholds

Alert thresholds are the numeric boundaries that trigger notifications. Setting them requires balancing sensitivity (catching problems) against specificity (avoiding false positives).

Static vs. Dynamic Thresholds:

Threshold Approaches
Type	How It Works	When to Use	Limitations
Static Threshold	Fixed value: CPU > 90%	Absolute limits (disk space, connections)	Ignores normal variation; many false positives
Baseline Threshold	Deviation from historical average	Metrics with predictable patterns	Requires training period; can normalize bad states
Rate of Change	Change velocity: growth > 10%/hour	Detecting sudden spikes or degradation	Noisy for volatile metrics
Anomaly Detection	ML-based deviation scoring	Complex patterns; many metrics	Black box; requires tuning; delayed detection

Threshold Design Patterns:

threshold_patterns.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# Pattern 1: Multi-Window Rate Alerting
# Alert only if condition persists across multiple time windows
# Reduces single-spike false positives
 
groups:
- name: database_alerts
  rules:
  # Short window for critical issues
  - alert: DatabaseHighLatencyCritical
    expr: |
      histogram_quantile(0.99, rate(query_duration_seconds_bucket[1m])) > 5
    for: 2m  # Must persist for 2 minutes
    labels:
      severity: critical
      
  # Longer window for warning
  - alert: DatabaseHighLatencyWarning
    expr: |
      histogram_quantile(0.99, rate(query_duration_seconds_bucket[5m])) > 2
    for: 10m  # Must persist for 10 minutes
    labels:
      severity: warning
 
# Pattern 2: Burn Rate Alerting (SLO-based)
# Alert when consuming error budget faster than sustainable
 
  - alert: DatabaseErrorBudgetBurn
    expr: |
      (
        # Fast burn: 14.4x budget consumption in 1 hour
        (1 - (sum(rate(query_success_total[1h])) / sum(rate(query_total[1h])))) > (14.4 * 0.001)
        and
        (1 - (sum(rate(query_success_total[5m])) / sum(rate(query_total[5m])))) > (14.4 * 0.001)
      )
    for: 2m
    labels:
      severity: critical
 
# Pattern 3: Percentage Capacity with Growth Rate
  - alert: DiskSpaceWarning
    expr: |
      (node_filesystem_avail_bytes / node_filesystem_size_bytes * 100 < 20)
      or
      (predict_linear(node_filesystem_avail_bytes[6h], 24*60*60) < 0)
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Disk space low or depleting within 24 hours"

Key Threshold Design Principles:

Threshold Best Practices

•Use 'for' clauses — Require conditions to persist before alerting. A 1-second CPU spike shouldn't page.
•Set warning thresholds below critical — Warnings give time to investigate before pages fire.
•Base thresholds on SLOs — If your SLO is 99.9% availability, alert before you burn through the 0.1% error budget.
•Consider time of day — Traffic patterns vary; thresholds that work at noon may false-positive at 3 AM.
•Review and adjust regularly — Initial thresholds are guesses. Refine based on actual alert patterns.

The SLO-Based Approach

Service Level Objectives (SLOs) provide principled threshold guidance. If your SLO is 99.9% query success rate (43 minutes of downtime allowed per month), you can calculate exactly when error rates threaten the SLO and alert accordingly. This replaces guesswork with math.

Alert Routing and Escalation

Getting alerts to the right person at the right time is as important as detecting issues. Modern alerting systems provide sophisticated routing capabilities.

Alertmanager Configuration Example:

alertmanager.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.company.com:587'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
 
# Route tree: defines how alerts are routed
route:
  # Default receiver
  receiver: 'slack-notifications'
  
  # Group alerts by these labels
  group_by: ['alertname', 'severity', 'database']
  
  # Wait before sending first notification (batching)
  group_wait: 30s
  
  # Wait before sending subsequent notification
  group_interval: 5m
  
  # Wait before re-notifying same alert
  repeat_interval: 4h
  
  # Child routes (evaluated in order, first match wins)
  routes:
    # Critical database alerts → PagerDuty immediately
    - match:
        severity: critical
        team: database
      receiver: 'pagerduty-database'
      group_wait: 10s
      repeat_interval: 30m
      continue: false
      
    # High severity → Slack + ticket
    - match:
        severity: high
      receiver: 'slack-high-priority'
      continue: true
      
    # Database team alerts → Database Slack channel
    - match_re:
        team: database|dba
      receiver: 'slack-database'
 
# Receiver definitions
receivers:
  - name: 'slack-notifications'
    slack_configs:
      - channel: '#alerts-general'
        send_resolved: true
        
  - name: 'pagerduty-database'
    pagerduty_configs:
      - service_key: '<DATABASE_PAGERDUTY_KEY>'
        severity: critical
        description: '{{ .CommonAnnotations.summary }}'
        details:
          firing: '{{ .Alerts.Firing | len }}'
          database: '{{ .CommonLabels.database }}'
          
  - name: 'slack-database'
    slack_configs:
      - channel: '#dba-alerts'
        title: '{{ .CommonLabels.alertname }}'
        text: '{{ .CommonAnnotations.description }}'
 
# Inhibition rules: suppress alerts when related alerts are firing
inhibit_rules:
  # Don't alert on replica issues if primary is down
  - source_match:
      alertname: 'PrimaryDatabaseDown'
    target_match:
      alertname: 'ReplicationLag'
    equal: ['database_cluster']
    
  # Don't page warning if critical already firing
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'database']

Escalation Strategies:

Escalation Patterns

•Time-Based Escalation — If not acknowledged in 15 minutes, escalate to secondary on-call. If not acknowledged in 30 minutes, escalate to team lead.
•Severity-Based Escalation — P1 pages multiple people immediately. P2 pages primary on-call only.
•Business Hours Routing — Critical issues page 24/7. Warning-level alerts route to email/Slack outside business hours.
•Expertise-Based Routing — Database alerts to DBA team. Application performance alerts to backend team.

Use Inhibition Rules

When the primary database is down, you don't need separate alerts for replication lag, high connection count, and query failures. Inhibition rules automatically suppress dependent alerts when root-cause alerts are firing, reducing noise during incidents.

Alert Content and Context

An alert that simply says 'Database problem' at 3 AM is useless. Effective alerts include enough context for immediate action.

Essential Alert Components:

Alert Content Requirements
Component	Purpose	Example
Alert Name	Unique, descriptive identifier	PostgresDiskSpaceCritical
Severity	Urgency classification	P1 / Critical
Summary	One-line description of the problem	Disk usage at 95% on prod-db-1
Description	Detailed explanation with values	Primary database disk at 95.3% usage. 12.4GB remaining. Growth rate: 500MB/day. Estimated exhaustion: 24 hours.
Affected Service	What user-facing systems are impacted	Order processing, checkout
Runbook Link	Direct link to remediation steps	https://wiki/runbooks/disk-space
Dashboard Link	Link to relevant visualization	https://grafana/d/postgres-overview
Labels	Machine-parseable metadata	team=database, environment=prod, database=orders

Alert Annotation Templates:

alert_annotations.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
groups:
- name: database_alerts
  rules:
  - alert: PostgreSQLReplicationLag
    expr: pg_replication_lag_seconds > 60
    for: 5m
    labels:
      severity: high
      team: database
    annotations:
      summary: "Replication lag high on {{ $labels.instance }}"
      description: |
        PostgreSQL replication lag is {{ $value | printf "%.0f" }} seconds on {{ $labels.instance }}.
        
        **Impact:** Read replicas serving stale data. In case of failover, up to {{ $value | printf "%.0f" }}s of transactions could be lost.
        
        **Threshold:** > 60 seconds sustained for 5 minutes
        
        **Current Value:** {{ $value | printf "%.1f" }} seconds
        
        **Possible Causes:**
        - High write load on primary
        - Network latency between primary and replica  
        - Replica CPU/IO saturation
        - Long-running transactions blocking replay
      runbook_url: "https://wiki.company.com/runbooks/postgres-replication-lag"
      dashboard_url: "https://grafana.company.com/d/postgres-replication?var-instance={{ $labels.instance }}"
      
  - alert: DatabaseConnectionPoolExhausted
    expr: pg_stat_activity_count / pg_settings_max_connections > 0.9
    for: 5m
    labels:
      severity: critical
      team: database
    annotations:
      summary: "Connection pool > 90% on {{ $labels.instance }}"
      description: |
        Database {{ $labels.instance }} has {{ $value | printf "%.0f" }}% of max_connections in use.
        
        **Immediate Risk:** New connections will be refused. Application errors imminent.
        
        **Current Connections:** {{ with query "pg_stat_activity_count{instance='" }}{{ . | first | value }}{{ end }}
        **Max Connections:** {{ with query "pg_settings_max_connections{instance='" }}{{ . | first | value }}{{ end }}
        
        **Immediate Actions:**
        1. Check for idle-in-transaction sessions: https://grafana/d/sessions?var-instance={{ $labels.instance }}
        2. Check application connection pool configuration
        3. Consider increasing max_connections (requires restart)
      runbook_url: "https://wiki.company.com/runbooks/connection-exhaustion"

Link Runbooks to Every Alert

At 3 AM, an on-call engineer shouldn't need to remember remediation steps. Every alert should link to a runbook with specific troubleshooting and remediation procedures. No runbook? The alert isn't ready for production.

Testing and Validating Alerts

Alerts that never fire in production might not work when you need them. Alert testing is as important as code testing.

Alert Testing Strategies:

Comprehensive Alert Testing

•Unit Testing Alert Rules — Validate PromQL expressions return expected results for sample data
•Integration Testing — Send test metrics through the full pipeline; verify alerts fire and route correctly
•Chaos Testing — Deliberately induce failures (in non-production) to verify detection
•Fire Drills — Periodically trigger alerts manually to verify routing and response procedures
•End-to-End Testing — Verify alerts reach recipients through all channels (email, PagerDuty, Slack)

Promtool for Alert Rule Testing:

alert_tests.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# tests/alert_rules_test.yml
# Run with: promtool test rules tests/alert_rules_test.yml
 
rule_files:
  - ../prometheus/rules/database_alerts.yml
 
tests:
  # Test 1: Verify disk space alert fires at correct threshold
  - interval: 1m
    input_series:
      - series: 'node_filesystem_avail_bytes{instance="db-1", mountpoint="/"}'
        values: '100000000000 50000000000 10000000000 5000000000 4000000000'
      - series: 'node_filesystem_size_bytes{instance="db-1", mountpoint="/"}'
        values: '100000000000 100000000000 100000000000 100000000000 100000000000'
    alert_rule_test:
      - eval_time: 5m
        alertname: DiskSpaceCritical
        exp_alerts:
          - exp_labels:
              severity: critical
              instance: db-1
            exp_annotations:
              summary: "Disk space critical on db-1"
 
  # Test 2: Verify replication lag alert with proper threshold
  - interval: 1m
    input_series:
      - series: 'pg_replication_lag_seconds{instance="replica-1"}'
        values: '10 20 30 40 50 60 70 80'
    alert_rule_test:
      - eval_time: 8m  # After 'for' duration
        alertname: PostgreSQLReplicationLag
        exp_alerts:
          - exp_labels:
              severity: high
              instance: replica-1
 
  # Test 3: Verify alert does NOT fire for normal conditions
  - interval: 1m
    input_series:
      - series: 'pg_stat_activity_count{instance="db-1"}'
        values: '50 55 60 55 50'
      - series: 'pg_settings_max_connections{instance="db-1"}'
        values: '200 200 200 200 200'
    alert_rule_test:
      - eval_time: 5m
        alertname: DatabaseConnectionPoolExhausted
        exp_alerts: []  # Expect no alerts

Alert Review Process:

Regular review ensures alerts remain effective:

Alert Review Checklist
Review Type	Frequency	What to Check
Weekly Triage	Weekly	Which alerts fired? How many were actionable? Any false positives?
Monthly Analysis	Monthly	Alert frequency trends, on-call burden, threshold adjustments needed
Quarterly Deep Dive	Quarterly	Are we catching real incidents? Missing any? Coverage gaps?
Post-Incident Review	After each incident	Did alerts fire appropriately? Fast enough? Good context?

Alert Toil Budget

Track 'alert toil'—time spent responding to alerts that don't result in meaningful action. If more than 50% of on-call time is spent on noise, prioritize alert quality improvement. Google's SRE book targets <50% toil as a hard limit.

Common Database Alert Patterns

Let's consolidate the essential database alerts every monitoring system should include.

Availability Alerts:

availability_alerts.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
groups:
- name: database_availability
  rules:
  # Database unreachable
  - alert: DatabaseDown
    expr: pg_up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "PostgreSQL instance {{ $labels.instance }} is down"
      
  # Primary database failed
  - alert: PrimaryDatabaseDown
    expr: |
      pg_up{role="primary"} == 0
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "PRIMARY database {{ $labels.instance }} is DOWN - immediate failover required"
      
  # All replicas down
  - alert: AllReplicasDown
    expr: |
      count(pg_up{role="replica"} == 1) by (cluster) == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "All replicas in cluster {{ $labels.cluster }} are down"

Performance Alerts:

performance_alerts.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
groups:
- name: database_performance
  rules:
  # High query latency (symptom-based)
  - alert: QueryLatencyHigh
    expr: |
      histogram_quantile(0.99, 
        sum(rate(pg_query_duration_seconds_bucket[5m])) by (le, instance)
      ) > 2
    for: 5m
    labels:
      severity: warning
      
  # Buffer cache hit ratio low
  - alert: BufferCacheHitRatioLow
    expr: |
      (pg_stat_database_blks_hit / 
       (pg_stat_database_blks_hit + pg_stat_database_blks_read)) * 100 < 90
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Buffer cache hit ratio {{ $value | printf "%.1f" }}% on {{ $labels.instance }}"
      
  # High lock wait time
  - alert: LockWaitTimeHigh
    expr: |
      rate(pg_stat_database_deadlocks[5m]) > 0.1
      or 
      pg_locks_count{mode="ExclusiveLock"} > 10
    for: 5m
    labels:
      severity: warning

Capacity Alerts:

capacity_alerts.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
groups:
- name: database_capacity
  rules:
  # Connection pool exhaustion
  - alert: ConnectionPoolCritical
    expr: |
      pg_stat_activity_count / pg_settings_max_connections > 0.9
    for: 5m
    labels:
      severity: critical
      
  # Disk space critical
  - alert: DiskSpaceCritical
    expr: |
      (node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql"} / 
       node_filesystem_size_bytes) * 100 < 10
    for: 5m
    labels:
      severity: critical
      
  # Disk will fill in 24 hours
  - alert: DiskWillFillIn24Hours
    expr: |
      predict_linear(node_filesystem_avail_bytes{mountpoint="/var/lib/postgresql"}[6h], 24*60*60) < 0
    for: 30m
    labels:
      severity: warning
      
  # Replication lag critical
  - alert: ReplicationLagCritical
    expr: pg_replication_lag_seconds > 300
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Replication lag > 5 minutes. Failover would lose significant data."

Start Conservative, Then Tune

Start with conservative thresholds that might generate some false positives. It's easier to loosen thresholds than to explain why a real incident wasn't alerted. Track every alert firing and refine thresholds based on actual data.

Summary: Alerts That Earn Trust

Alerting is a system of trust. When on-call engineers trust that alerts represent real problems requiring immediate action, they respond promptly. When that trust erodes through noise and false positives, response times increase and real incidents get missed.

Key Takeaways

•Every alert demands a response — If it doesn't require immediate human action, it's not an alert.
•Prefer symptom-based alerting — Alert on user impact, not internal causes that may not matter.
•Classify severity rigorously — P1 means drop everything. Reserve it for genuinely critical situations.
•Design thresholds with 'for' clauses — Require conditions to persist before alerting.
•Route alerts intelligently — Right person, right channel, right escalation path.
•Provide actionable context — Runbooks, dashboard links, specific values, immediate actions.
•Test alerts continuously — Untested alerts fail when you need them most.
•Review and refine relentlessly — Alert quality requires ongoing maintenance.

What's Next:

With alerts configured to notify the right people about real problems, the next page addresses Dashboards—visualizing database health for proactive monitoring, incident response, and capacity planning.

Page Complete

You now understand how to design alerting systems that balance sensitivity with specificity—catching real problems while avoiding the noise that erodes trust. Next, we'll build the dashboards that provide visual context for these alerts.