System Design (HLD)Alerting Design

Alerting Design: Building Effective Alert Systems

LevelIntermediate

Duration90 mins

TopicAlerting Design

1 / 5

What to Alert On

The Signal in the Noise

At 3:17 AM, an engineer's phone buzzes. Another alert. They glance at it through half-open eyes: 'High CPU usage on web-server-42'. Is this critical? Is the system actually degrading? Or is it just a routine spike that will resolve itself? After years of false alarms, they've learned to ignore most alerts. They roll over and go back to sleep.\n\nTwo hours later, the real incident strikes. A cascading failure brings down the entire service. The alert that could have caught it early was buried among dozens of meaningless notifications that the on-call engineer had learned to tune out.\n\nThis scenario plays out in organizations worldwide, revealing a fundamental truth about alerting: the hardest problem isn't detecting issues—it's deciding what deserves human attention.

What You Will Learn

By the end of this page, you will understand the philosophy and practical principles that guide effective alerting decisions. You'll learn to distinguish between symptoms and causes, user-facing impact and internal metrics, and ultimately develop a framework for deciding what should wake someone up at 3 AM versus what should be reviewed during business hours.

The Philosophy of Alerting

Before we discuss what to alert on, we must understand why we alert at all. Alerts exist for one fundamental purpose: to prompt human action when automated systems cannot resolve an issue themselves.\n\nThis seemingly simple statement carries profound implications that many organizations overlook.

The Core Principle

Every alert should be actionable. If an alert fires and the correct response is 'wait and see' or 'nothing I can do', that alert should not exist. Alerts that don't require action erode trust in the alerting system itself, leading engineers to ignore critical notifications when they do arrive.

The Three Questions Every Alert Must Answer\n\nBefore creating any alert, ask these fundamental questions:\n\n1. What specific action should the responder take? If you cannot clearly articulate the response, the alert is premature.\n\n2. Why does this require human intervention? If automation could handle it, implement automation instead of alerting.\n\n3. Will delaying response by minutes (or hours) cause material harm? If not, this belongs in a daily review dashboard, not an alert.\n\nThe Cost of Every Alert\n\nAlerts are not free. Each one carries costs that must be weighed against benefits:

The True Cost of Alerts

•Attention Tax — Every alert demands cognitive resources. Engineers switch context, assess severity, and decide on response. Even a 30-second interruption disrupts deep work for 15+ minutes.
•Fatigue Accumulation — Each non-actionable alert trains engineers to ignore notifications. The hundredth false positive creates the muscle memory that misses the true positive.
•Trust Erosion — When alerts consistently fail to indicate real problems, the entire alerting system loses credibility. Engineers start silencing, snoozing, or routing alerts to folders they never check.
•Quality of Life — On-call engineers who receive constant alerts experience burnout, sleep deprivation, and eventually leave. The cost of replacing experienced engineers dwarfs any infrastructure expenses.
•Opportunity Cost — Time spent on false alarms is time not spent improving systems, preventing future incidents, or building features.

The Alert Paradox

The teams with the most alerts are often the ones with the poorest incident response. They've created so much noise that the real signal drowns. Conversely, highly reliable organizations are notable for how few alerts they have—each one carefully designed to indicate genuine problems requiring human intervention.

Symptoms vs. Causes: The Critical Distinction

One of the most fundamental principles in effective alerting is the distinction between symptoms and causes. Understanding this distinction determines whether your alerts detect real problems or merely chase shadows.

Symptoms (Alert On These)

•Error rate visible to users exceeds threshold
•User-facing request latency above acceptable levels
•Successful transaction rate dropping
•Customer-visible features returning errors
•SLO burn rate accelerating
•HTTP 500 responses increasing
•Queue consumer lag growing continuously

Causes (Investigate, Don't Alert)

•High CPU utilization on one server
•Memory usage at 85%
•Disk approaching capacity
•Individual process restarting
•Connection pool utilization high
•Single database replica lagging
•One container in CrashLoopBackOff

Why This Distinction Matters\n\nCause-based alerts suffer from a fundamental problem: they may indicate issues that never manifest as user impact. A server at 90% CPU might be perfectly healthy under load, or it might be about to fall over. The CPU metric alone cannot tell you which.\n\nSymptom-based alerts, by contrast, directly measure what users experience. If error rates are normal and latency is acceptable, the system is healthy—regardless of what internal metrics suggest.\n\nThe Monitoring Hierarchy\n\nThink of monitoring as a diagnostic hierarchy:\n\n\n┌─────────────────────────────────────────────────────────────────┐\n│ USER EXPERIENCE (Alert Here) │\n│ Error rates, latency, availability, throughput │\n├─────────────────────────────────────────────────────────────────┤\n│ APPLICATION METRICS │\n│ Request queues, connection pools, thread states │\n├─────────────────────────────────────────────────────────────────┤\n│ INFRASTRUCTURE METRICS │\n│ CPU, memory, disk, network, container health │\n├─────────────────────────────────────────────────────────────────┤\n│ HARDWARE METRICS │\n│ Temperature, fan speed, disk SMART data │\n└─────────────────────────────────────────────────────────────────┘\n\n\nAlerts should primarily trigger at the top of this hierarchy. Lower-level metrics are invaluable for investigation after an alert fires, but they rarely should trigger alerts themselves.

The Exception: Imminent Failures

There are legitimate cases for cause-based alerts: when the cause will inevitably lead to user impact if not addressed. Disk at 99% will cause writes to fail. Memory exhaustion will trigger OOM kills. These predictable failures warrant proactive alerts—but they should be the exception, not the rule.

User Impact as the North Star

The most effective alerting strategy centers on a single question: Is the user experiencing degraded service?\n\nThis user-centric approach filters out countless internal metrics that may fluctuate without affecting anyone. It focuses engineering attention on what actually matters: delivering value to customers.

Defining 'User'\n\nThe term 'user' is deliberately broad. It encompasses:\n\n- External customers using your product\n- Internal teams consuming your platform or APIs\n- Automated systems that depend on your services\n- Business processes that require your infrastructure\n\nEach of these user categories has different tolerance thresholds and expectations, which should inform your alerting decisions.

User Impact Categories and Alert Priorities
Impact Category	Description	Alert Priority	Response Time
Complete Outage	Core functionality entirely unavailable	Critical (Page immediately)	Minutes
Severe Degradation	Major features broken, significant latency	High (Page during day, escalate at night)	< 30 minutes
Partial Degradation	Some features affected, workarounds exist	Medium (Ticket + on-call awareness)	< 4 hours
Minor Issues	Edge cases affected, most users unimpacted	Low (Ticket for next business day)	< 24 hours
Cosmetic/Informational	No functional impact, aesthetic issues	Informational (Log only)	As capacity allows

Measuring User Impact\n\nTo alert on user impact, you must measure user impact. This requires instrumentation at every layer where users interact with your system:\n\nRequest-Level Metrics\n- Success rate (percentage of requests returning expected results)\n- Error rate by type (client errors vs. server errors)\n- Latency distributions (p50, p90, p99)\n- Throughput (requests per second)\n\nSession-Level Metrics\n- Login success rate\n- Session duration anomalies\n- User flow abandonment rates\n- Feature adoption metrics\n\nBusiness-Level Metrics\n- Transaction completion rate\n- Revenue per minute/hour\n- Order placement success\n- Critical business process completion

The Traffic Light Test

Imagine your executive dashboard showing red/yellow/green for system health. An alert should indicate something is red or turning red. If your system would show green despite the metric you're alerting on, you shouldn't be alerting on it.

The SLO-Based Alerting Paradigm

The most sophisticated approach to 'what to alert on' comes from Service Level Objective (SLO) based alerting. Rather than alerting on arbitrary thresholds, you alert when the system is consuming its error budget at an unsustainable rate.\n\nThis approach, pioneered by Google's Site Reliability Engineering practices, fundamentally changes how teams think about alerts.

Understanding Error Budgets\n\nAn error budget is the inverse of your SLO. If you promise 99.9% availability, you have a 0.1% error budget. For a 30-day month, this translates to approximately 43 minutes of acceptable downtime.\n\nWhy This Matters for Alerting\n\nTraditional alerting asks: 'Is something wrong right now?'\n\nSLO-based alerting asks: 'Are we on track to meet our reliability commitments?'\n\nThis shift is profound. A brief spike in error rate that quickly resolves may trigger traditional alerts but wouldn't concern SLO-based alerting—because it barely dented the error budget. Conversely, a slowly increasing error rate that wouldn't trigger percentage-based alerts might be rapidly burning through error budget, warranting immediate attention.

SLO Burn Rate Calculation

Concept

Error Budget Calculation:
───────────────────────────────────────────────────────────
 
SLO = 99.9% availability over 30 days
Error Budget = (100% - 99.9%) × 30 days × 24 hours × 60 minutes
             = 0.1% × 43,200 minutes
             = 43.2 minutes of total allowed downtime
 
Burn Rate Analysis:
───────────────────────────────────────────────────────────
 
Current Error Rate: 0.5% (5x the 0.1% budget rate)
Burn Rate: 5x (consuming 5 minutes of budget per minute)
 
At this rate:
- 1-hour error budget consumed in: 12 minutes
- Complete monthly budget exhausted in: ~8.6 minutes
 
This SHOULD trigger an alert!
 
vs.
 
Current Error Rate: 0.15% (1.5x the budget rate)  
Burn Rate: 1.5x (consuming 1.5 minutes of budget per minute)
 
At this rate:
- Daily budget (1.44 minutes) consumed in: ~58 minutes
- Complete monthly budget exhausted in: ~29 minutes
 
This might be acceptable if it's a brief spike.

Multiple Burn Rate Windows\n\nEffective SLO-based alerting uses multiple time windows to detect both acute issues and slow burns:\n\n| Window | Burn Rate | Meaning | Response |\n|--------|-----------|---------|----------|\n| 5 min | 14.4x | Severe incident in progress | Page immediately |\n| 1 hour | 6x | Significant issue developing | Page with urgency |\n| 6 hours | 1.5x | Slow burn consuming budget | Ticket for prompt investigation |\n| 3 days | 1x | Gradual degradation | Weekly review item |\n\nThe combination of fast and slow windows catches both sudden failures and gradual degradation that might otherwise slip detection.

The SLO Advantage

SLO-based alerting naturally calibrates sensitivity to business impact. During low-traffic periods, a higher error percentage is acceptable because fewer users are affected. During peak traffic, even small percentage increases matter more. The error budget approach handles this automatically.

Categories of Alerts

Not all alerts are created equal. Understanding the different categories helps you design appropriate responses for each type.

Alert Categories by Urgency

•Page-Worthy Alerts — These indicate problems requiring immediate human intervention. The system is actively degraded or will be imminently. Examples: complete outage, severe error rate spike, SLO burn rate exceeding critical threshold. Response: Wake someone up, interrupt their weekend, stop what they're doing.
•Ticket-Worthy Alerts — These indicate issues that need attention but not immediate response. The system is functioning but operating sub-optimally. Examples: elevated but stable error rate, approaching capacity limits, certificate expiring in 7 days. Response: Create a ticket, address within defined SLA.
•Informational Alerts — These capture noteworthy events that don't require action but aid diagnosis and trend analysis. Examples: deployment completed, config change applied, traffic pattern shift. Response: Log for later analysis, no immediate action.

What Deserves a Page?\n\nThe highest bar in alerting is the page—an alert that interrupts someone regardless of time of day. Pages should be reserved for situations where:\n\n1. Users are currently impacted or will be within minutes\n2. Human intervention is required to resolve the issue\n3. Delay will cause escalating harm (more users affected, data loss, revenue impact)\n4. The required action is known and can be performed by on-call\n\nIf any of these conditions isn't met, the alert shouldn't page.

Page vs. Ticket Decision Matrix
Scenario	Page?	Ticket?	Rationale
API returning 500s to 20% of requests	Yes	—	Active user impact, requires immediate investigation
Disk at 85% capacity, growing 2%/day	No	Yes	Not urgent, can be addressed in business hours
Single pod CrashLoopBackOff, others healthy	No	Yes	Redundancy handling, no user impact
SSL certificate expires in 3 days	No	Yes (High)	Predictable, addressable in work hours
Database primary failover occurred	Depends	Yes	Page if failover was unexpected; ticket if planned
Memory trending up, no OOM yet	No	Yes	Investigation needed, but not urgent
All replicas lost for critical service	Yes	—	Complete outage, immediate action required

The 'Just in Case' Trap

Resist the temptation to create page-level alerts 'just in case'. Every unnecessary page contributes to alert fatigue. If you're unsure whether something needs a page, start with a ticket. You can always escalate based on real-world experience.

Anti-Patterns in Alert Design

Learning what not to alert on is as important as learning what to alert on. These anti-patterns are commonly found in organizations struggling with alert fatigue.

Common Alert Anti-Patterns

•Alerting on Individual Instances — 'Server-03 CPU high' ignores that you have 50 servers with load balancing. One server's metrics rarely matter when redundancy exists. Alert on aggregate behavior: 'More than 20% of servers showing high CPU'.
•Static Thresholds for Variable Workloads — 'Alert if requests > 10,000/min' fails when traffic naturally varies. Black Friday traffic legitimately 5x normal isn't an alert. Use anomaly detection or percentage-based deviations.
•Alerting on Recoverable Events — 'Container restarted' isn't an alert if Kubernetes is doing its job. The orchestrator handles restarts automatically. Alert on 'Container in CrashLoopBackOff for > 15 minutes' instead.
•Copy-Paste Alert Configurations — Using the same alerts for every service ignores that services have different reliability requirements. A 1% error rate may be critical for payments and acceptable for telemetry.
•Alerting Without Clear Ownership — 'Someone should look at this' alerts route to overloaded shared queues. Every alert needs a clear owner and escalation path.
•Metric Fishing — Alert on everything that might be useful, figure out what matters later. This approach guarantees alert fatigue and trains engineers to ignore all alerts.
•Fear-Driven Alerting — After every incident, adding a new alert to 'never let this happen again'. This accumulates alerts without pruning ones that no longer matter.

The Review Ritual

Schedule quarterly alert reviews. For each alert, ask: 'When did this last fire? What action was taken? Did the action match the expected runbook?' Alerts that haven't fired in 6 months may need higher thresholds. Alerts that fired without useful action should be deleted or redesigned.

Building Your Alert Strategy

With principles established, let's construct a practical framework for deciding what to alert on in your systems.

Step 1: Define Your SLOs\n\nStart with what matters to users and the business. Typical SLO categories include:\n\n- Availability: What percentage of requests should succeed?\n- Latency: What response time should users experience?\n- Throughput: What transaction volume must the system support?\n- Correctness: What data accuracy is required?\n\nStep 2: Instrument User-Facing Metrics\n\nEnsure you're measuring what users experience, not just what servers report. Key metrics:\n\n- Requests handled vs. requests failed (as users see it)\n- Time from request receipt to response delivery\n- End-to-end transaction success rates\n- User-reported errors and complaints\n\nStep 3: Calculate Error Budgets\n\nFor each SLO, determine the acceptable error budget and how it's consumed over time. This becomes your alerting baseline.\n\nStep 4: Define Alert Conditions Based on Burn Rate\n\nCreate alerts that trigger when error budget consumption threatens your SLO. Use multiple time windows for different severities.\n\nStep 5: Establish Clear Response Expectations\n\nFor each alert, document:\n- What does this alert mean?\n- What should the responder check first?\n- What are the likely causes?\n- What remediation steps are available?\n- When should the responder escalate?

Example Alert Strategy
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Example: E-commerce Checkout Service Alert Strategy
 
slo:
  name: "checkout-availability"
  target: 99.9%  # 43 minutes/month error budget
  window: 30d
 
alerts:
  # Critical: Immediate page required
  - name: checkout_critical_burn
    description: "Checkout SLO burning at critical rate"
    condition: |
      (1 - checkout_success_rate[5m]) >= 0.01  # 10x burn rate
    severity: critical
    action: page
    runbook: "runbook.example.com/checkout-critical"
    
  # High: Page during business hours, ticket at night
  - name: checkout_high_burn
    description: "Checkout SLO burning at elevated rate"  
    condition: |
      (1 - checkout_success_rate[1h]) >= 0.006  # 6x burn rate
    severity: high
    action: page_business_hours
    runbook: "runbook.example.com/checkout-elevated"
    
  # Medium: Create ticket for investigation
  - name: checkout_slow_burn
    description: "Checkout error budget slowly depleting"
    condition: |
      (1 - checkout_success_rate[6h]) >= 0.002  # 2x burn rate
    severity: medium
    action: ticket
    runbook: "runbook.example.com/checkout-investigation"
 
  # Proactive: Warning before problems manifest
  - name: checkout_dependency_unhealthy
    description: "Critical checkout dependency showing degradation"
    condition: |
      payment_gateway_health < 0.95 
      OR inventory_service_health < 0.95
    severity: medium
    action: ticket
    runbook: "runbook.example.com/checkout-dependencies"

Summary: Deciding What to Alert On

We've covered the foundational principles for designing effective alerts. Let's consolidate the key insights:

Key Takeaways

•Every alert must be actionable — If there's no clear action to take, the alert shouldn't exist.
•Alert on symptoms, investigate causes — User-facing impact matters more than internal metrics.
•User impact is the north star — All alerting decisions should ultimately trace back to user experience.
•SLO-based alerting surpasses threshold-based — Burn rate analysis naturally adjusts for traffic patterns and business impact.
•Differentiate page-worthy from ticket-worthy — Not every issue deserves to wake someone up at 3 AM.
•Avoid common anti-patterns — Static thresholds, individual instance alerts, and fear-driven alerting create unsustainable noise.
•Build strategy around SLOs — Define what matters, measure it accurately, and alert when objectives are threatened.

What's Next:\n\nNow that we understand what to alert on, we need to understand how to determine when an alert should fire. The next page explores alert thresholds—the art and science of setting boundaries that catch real problems while avoiding false positives.

Page Complete

You now understand the philosophy and principles that guide effective alerting decisions. The core insight is simple but powerful: alerts exist to prompt human action when the system cannot heal itself. Everything else—symptoms vs. causes, SLO burn rates, urgency categories—flows from this fundamental truth.

1 / 5

Loading learning content...

System Design (HLD)Alerting Design

Alerting Design: Building Effective Alert Systems

LevelIntermediate

Duration90 mins

TopicAlerting Design

1 / 5

What to Alert On

The Signal in the Noise

What You Will Learn

The Philosophy of Alerting

The Core Principle

The True Cost of Alerts

•Attention Tax — Every alert demands cognitive resources. Engineers switch context, assess severity, and decide on response. Even a 30-second interruption disrupts deep work for 15+ minutes.
•Fatigue Accumulation — Each non-actionable alert trains engineers to ignore notifications. The hundredth false positive creates the muscle memory that misses the true positive.
•Trust Erosion — When alerts consistently fail to indicate real problems, the entire alerting system loses credibility. Engineers start silencing, snoozing, or routing alerts to folders they never check.
•Quality of Life — On-call engineers who receive constant alerts experience burnout, sleep deprivation, and eventually leave. The cost of replacing experienced engineers dwarfs any infrastructure expenses.
•Opportunity Cost — Time spent on false alarms is time not spent improving systems, preventing future incidents, or building features.

The Alert Paradox

Symptoms vs. Causes: The Critical Distinction

Symptoms (Alert On These)

•Error rate visible to users exceeds threshold
•User-facing request latency above acceptable levels
•Successful transaction rate dropping
•Customer-visible features returning errors
•SLO burn rate accelerating
•HTTP 500 responses increasing
•Queue consumer lag growing continuously

Causes (Investigate, Don't Alert)

•High CPU utilization on one server
•Memory usage at 85%
•Disk approaching capacity
•Individual process restarting
•Connection pool utilization high
•Single database replica lagging
•One container in CrashLoopBackOff

The Exception: Imminent Failures

User Impact as the North Star

User Impact Categories and Alert Priorities
Impact Category	Description	Alert Priority	Response Time
Complete Outage	Core functionality entirely unavailable	Critical (Page immediately)	Minutes
Severe Degradation	Major features broken, significant latency	High (Page during day, escalate at night)	< 30 minutes
Partial Degradation	Some features affected, workarounds exist	Medium (Ticket + on-call awareness)	< 4 hours
Minor Issues	Edge cases affected, most users unimpacted	Low (Ticket for next business day)	< 24 hours
Cosmetic/Informational	No functional impact, aesthetic issues	Informational (Log only)	As capacity allows

The Traffic Light Test

The SLO-Based Alerting Paradigm

SLO Burn Rate Calculation

Concept

Error Budget Calculation:
───────────────────────────────────────────────────────────
 
SLO = 99.9% availability over 30 days
Error Budget = (100% - 99.9%) × 30 days × 24 hours × 60 minutes
             = 0.1% × 43,200 minutes
             = 43.2 minutes of total allowed downtime
 
Burn Rate Analysis:
───────────────────────────────────────────────────────────
 
Current Error Rate: 0.5% (5x the 0.1% budget rate)
Burn Rate: 5x (consuming 5 minutes of budget per minute)
 
At this rate:
- 1-hour error budget consumed in: 12 minutes
- Complete monthly budget exhausted in: ~8.6 minutes
 
This SHOULD trigger an alert!
 
vs.
 
Current Error Rate: 0.15% (1.5x the budget rate)  
Burn Rate: 1.5x (consuming 1.5 minutes of budget per minute)
 
At this rate:
- Daily budget (1.44 minutes) consumed in: ~58 minutes
- Complete monthly budget exhausted in: ~29 minutes
 
This might be acceptable if it's a brief spike.

The SLO Advantage

Categories of Alerts

Not all alerts are created equal. Understanding the different categories helps you design appropriate responses for each type.

Alert Categories by Urgency

•Page-Worthy Alerts — These indicate problems requiring immediate human intervention. The system is actively degraded or will be imminently. Examples: complete outage, severe error rate spike, SLO burn rate exceeding critical threshold. Response: Wake someone up, interrupt their weekend, stop what they're doing.
•Ticket-Worthy Alerts — These indicate issues that need attention but not immediate response. The system is functioning but operating sub-optimally. Examples: elevated but stable error rate, approaching capacity limits, certificate expiring in 7 days. Response: Create a ticket, address within defined SLA.
•Informational Alerts — These capture noteworthy events that don't require action but aid diagnosis and trend analysis. Examples: deployment completed, config change applied, traffic pattern shift. Response: Log for later analysis, no immediate action.

Page vs. Ticket Decision Matrix
Scenario	Page?	Ticket?	Rationale
API returning 500s to 20% of requests	Yes	—	Active user impact, requires immediate investigation
Disk at 85% capacity, growing 2%/day	No	Yes	Not urgent, can be addressed in business hours
Single pod CrashLoopBackOff, others healthy	No	Yes	Redundancy handling, no user impact
SSL certificate expires in 3 days	No	Yes (High)	Predictable, addressable in work hours
Database primary failover occurred	Depends	Yes	Page if failover was unexpected; ticket if planned
Memory trending up, no OOM yet	No	Yes	Investigation needed, but not urgent
All replicas lost for critical service	Yes	—	Complete outage, immediate action required

The 'Just in Case' Trap

Anti-Patterns in Alert Design

Learning what not to alert on is as important as learning what to alert on. These anti-patterns are commonly found in organizations struggling with alert fatigue.

Common Alert Anti-Patterns

•Alerting on Individual Instances — 'Server-03 CPU high' ignores that you have 50 servers with load balancing. One server's metrics rarely matter when redundancy exists. Alert on aggregate behavior: 'More than 20% of servers showing high CPU'.
•Static Thresholds for Variable Workloads — 'Alert if requests > 10,000/min' fails when traffic naturally varies. Black Friday traffic legitimately 5x normal isn't an alert. Use anomaly detection or percentage-based deviations.
•Alerting on Recoverable Events — 'Container restarted' isn't an alert if Kubernetes is doing its job. The orchestrator handles restarts automatically. Alert on 'Container in CrashLoopBackOff for > 15 minutes' instead.
•Copy-Paste Alert Configurations — Using the same alerts for every service ignores that services have different reliability requirements. A 1% error rate may be critical for payments and acceptable for telemetry.
•Alerting Without Clear Ownership — 'Someone should look at this' alerts route to overloaded shared queues. Every alert needs a clear owner and escalation path.
•Metric Fishing — Alert on everything that might be useful, figure out what matters later. This approach guarantees alert fatigue and trains engineers to ignore all alerts.
•Fear-Driven Alerting — After every incident, adding a new alert to 'never let this happen again'. This accumulates alerts without pruning ones that no longer matter.

The Review Ritual

Building Your Alert Strategy

With principles established, let's construct a practical framework for deciding what to alert on in your systems.

Example Alert Strategy
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Example: E-commerce Checkout Service Alert Strategy
 
slo:
  name: "checkout-availability"
  target: 99.9%  # 43 minutes/month error budget
  window: 30d
 
alerts:
  # Critical: Immediate page required
  - name: checkout_critical_burn
    description: "Checkout SLO burning at critical rate"
    condition: |
      (1 - checkout_success_rate[5m]) >= 0.01  # 10x burn rate
    severity: critical
    action: page
    runbook: "runbook.example.com/checkout-critical"
    
  # High: Page during business hours, ticket at night
  - name: checkout_high_burn
    description: "Checkout SLO burning at elevated rate"  
    condition: |
      (1 - checkout_success_rate[1h]) >= 0.006  # 6x burn rate
    severity: high
    action: page_business_hours
    runbook: "runbook.example.com/checkout-elevated"
    
  # Medium: Create ticket for investigation
  - name: checkout_slow_burn
    description: "Checkout error budget slowly depleting"
    condition: |
      (1 - checkout_success_rate[6h]) >= 0.002  # 2x burn rate
    severity: medium
    action: ticket
    runbook: "runbook.example.com/checkout-investigation"
 
  # Proactive: Warning before problems manifest
  - name: checkout_dependency_unhealthy
    description: "Critical checkout dependency showing degradation"
    condition: |
      payment_gateway_health < 0.95 
      OR inventory_service_health < 0.95
    severity: medium
    action: ticket
    runbook: "runbook.example.com/checkout-dependencies"

Summary: Deciding What to Alert On

We've covered the foundational principles for designing effective alerts. Let's consolidate the key insights:

Key Takeaways

•Every alert must be actionable — If there's no clear action to take, the alert shouldn't exist.
•Alert on symptoms, investigate causes — User-facing impact matters more than internal metrics.
•User impact is the north star — All alerting decisions should ultimately trace back to user experience.
•SLO-based alerting surpasses threshold-based — Burn rate analysis naturally adjusts for traffic patterns and business impact.
•Differentiate page-worthy from ticket-worthy — Not every issue deserves to wake someone up at 3 AM.
•Avoid common anti-patterns — Static thresholds, individual instance alerts, and fear-driven alerting create unsustainable noise.
•Build strategy around SLOs — Define what matters, measure it accurately, and alert when objectives are threatened.

Page Complete

1 / 5