Loading learning content...
At 3:17 AM, an engineer's phone buzzes. Another alert. They glance at it through half-open eyes: 'High CPU usage on web-server-42'. Is this critical? Is the system actually degrading? Or is it just a routine spike that will resolve itself? After years of false alarms, they've learned to ignore most alerts. They roll over and go back to sleep.\n\nTwo hours later, the real incident strikes. A cascading failure brings down the entire service. The alert that could have caught it early was buried among dozens of meaningless notifications that the on-call engineer had learned to tune out.\n\nThis scenario plays out in organizations worldwide, revealing a fundamental truth about alerting: the hardest problem isn't detecting issues—it's deciding what deserves human attention.
By the end of this page, you will understand the philosophy and practical principles that guide effective alerting decisions. You'll learn to distinguish between symptoms and causes, user-facing impact and internal metrics, and ultimately develop a framework for deciding what should wake someone up at 3 AM versus what should be reviewed during business hours.
Before we discuss what to alert on, we must understand why we alert at all. Alerts exist for one fundamental purpose: to prompt human action when automated systems cannot resolve an issue themselves.\n\nThis seemingly simple statement carries profound implications that many organizations overlook.
Every alert should be actionable. If an alert fires and the correct response is 'wait and see' or 'nothing I can do', that alert should not exist. Alerts that don't require action erode trust in the alerting system itself, leading engineers to ignore critical notifications when they do arrive.
The Three Questions Every Alert Must Answer\n\nBefore creating any alert, ask these fundamental questions:\n\n1. What specific action should the responder take? If you cannot clearly articulate the response, the alert is premature.\n\n2. Why does this require human intervention? If automation could handle it, implement automation instead of alerting.\n\n3. Will delaying response by minutes (or hours) cause material harm? If not, this belongs in a daily review dashboard, not an alert.\n\nThe Cost of Every Alert\n\nAlerts are not free. Each one carries costs that must be weighed against benefits:
The teams with the most alerts are often the ones with the poorest incident response. They've created so much noise that the real signal drowns. Conversely, highly reliable organizations are notable for how few alerts they have—each one carefully designed to indicate genuine problems requiring human intervention.
One of the most fundamental principles in effective alerting is the distinction between symptoms and causes. Understanding this distinction determines whether your alerts detect real problems or merely chase shadows.
Why This Distinction Matters\n\nCause-based alerts suffer from a fundamental problem: they may indicate issues that never manifest as user impact. A server at 90% CPU might be perfectly healthy under load, or it might be about to fall over. The CPU metric alone cannot tell you which.\n\nSymptom-based alerts, by contrast, directly measure what users experience. If error rates are normal and latency is acceptable, the system is healthy—regardless of what internal metrics suggest.\n\nThe Monitoring Hierarchy\n\nThink of monitoring as a diagnostic hierarchy:\n\n\n┌─────────────────────────────────────────────────────────────────┐\n│ USER EXPERIENCE (Alert Here) │\n│ Error rates, latency, availability, throughput │\n├─────────────────────────────────────────────────────────────────┤\n│ APPLICATION METRICS │\n│ Request queues, connection pools, thread states │\n├─────────────────────────────────────────────────────────────────┤\n│ INFRASTRUCTURE METRICS │\n│ CPU, memory, disk, network, container health │\n├─────────────────────────────────────────────────────────────────┤\n│ HARDWARE METRICS │\n│ Temperature, fan speed, disk SMART data │\n└─────────────────────────────────────────────────────────────────┘\n\n\nAlerts should primarily trigger at the top of this hierarchy. Lower-level metrics are invaluable for investigation after an alert fires, but they rarely should trigger alerts themselves.
There are legitimate cases for cause-based alerts: when the cause will inevitably lead to user impact if not addressed. Disk at 99% will cause writes to fail. Memory exhaustion will trigger OOM kills. These predictable failures warrant proactive alerts—but they should be the exception, not the rule.
The most effective alerting strategy centers on a single question: Is the user experiencing degraded service?\n\nThis user-centric approach filters out countless internal metrics that may fluctuate without affecting anyone. It focuses engineering attention on what actually matters: delivering value to customers.
Defining 'User'\n\nThe term 'user' is deliberately broad. It encompasses:\n\n- External customers using your product\n- Internal teams consuming your platform or APIs\n- Automated systems that depend on your services\n- Business processes that require your infrastructure\n\nEach of these user categories has different tolerance thresholds and expectations, which should inform your alerting decisions.
| Impact Category | Description | Alert Priority | Response Time |
|---|---|---|---|
| Complete Outage | Core functionality entirely unavailable | Critical (Page immediately) | Minutes |
| Severe Degradation | Major features broken, significant latency | High (Page during day, escalate at night) | < 30 minutes |
| Partial Degradation | Some features affected, workarounds exist | Medium (Ticket + on-call awareness) | < 4 hours |
| Minor Issues | Edge cases affected, most users unimpacted | Low (Ticket for next business day) | < 24 hours |
| Cosmetic/Informational | No functional impact, aesthetic issues | Informational (Log only) | As capacity allows |
Measuring User Impact\n\nTo alert on user impact, you must measure user impact. This requires instrumentation at every layer where users interact with your system:\n\nRequest-Level Metrics\n- Success rate (percentage of requests returning expected results)\n- Error rate by type (client errors vs. server errors)\n- Latency distributions (p50, p90, p99)\n- Throughput (requests per second)\n\nSession-Level Metrics\n- Login success rate\n- Session duration anomalies\n- User flow abandonment rates\n- Feature adoption metrics\n\nBusiness-Level Metrics\n- Transaction completion rate\n- Revenue per minute/hour\n- Order placement success\n- Critical business process completion
Imagine your executive dashboard showing red/yellow/green for system health. An alert should indicate something is red or turning red. If your system would show green despite the metric you're alerting on, you shouldn't be alerting on it.
The most sophisticated approach to 'what to alert on' comes from Service Level Objective (SLO) based alerting. Rather than alerting on arbitrary thresholds, you alert when the system is consuming its error budget at an unsustainable rate.\n\nThis approach, pioneered by Google's Site Reliability Engineering practices, fundamentally changes how teams think about alerts.
Understanding Error Budgets\n\nAn error budget is the inverse of your SLO. If you promise 99.9% availability, you have a 0.1% error budget. For a 30-day month, this translates to approximately 43 minutes of acceptable downtime.\n\nWhy This Matters for Alerting\n\nTraditional alerting asks: 'Is something wrong right now?'\n\nSLO-based alerting asks: 'Are we on track to meet our reliability commitments?'\n\nThis shift is profound. A brief spike in error rate that quickly resolves may trigger traditional alerts but wouldn't concern SLO-based alerting—because it barely dented the error budget. Conversely, a slowly increasing error rate that wouldn't trigger percentage-based alerts might be rapidly burning through error budget, warranting immediate attention.
123456789101112131415161718192021222324252627282930
Error Budget Calculation:─────────────────────────────────────────────────────────── SLO = 99.9% availability over 30 daysError Budget = (100% - 99.9%) × 30 days × 24 hours × 60 minutes = 0.1% × 43,200 minutes = 43.2 minutes of total allowed downtime Burn Rate Analysis:─────────────────────────────────────────────────────────── Current Error Rate: 0.5% (5x the 0.1% budget rate)Burn Rate: 5x (consuming 5 minutes of budget per minute) At this rate:- 1-hour error budget consumed in: 12 minutes- Complete monthly budget exhausted in: ~8.6 minutes This SHOULD trigger an alert! vs. Current Error Rate: 0.15% (1.5x the budget rate) Burn Rate: 1.5x (consuming 1.5 minutes of budget per minute) At this rate:- Daily budget (1.44 minutes) consumed in: ~58 minutes- Complete monthly budget exhausted in: ~29 minutes This might be acceptable if it's a brief spike.Multiple Burn Rate Windows\n\nEffective SLO-based alerting uses multiple time windows to detect both acute issues and slow burns:\n\n| Window | Burn Rate | Meaning | Response |\n|--------|-----------|---------|----------|\n| 5 min | 14.4x | Severe incident in progress | Page immediately |\n| 1 hour | 6x | Significant issue developing | Page with urgency |\n| 6 hours | 1.5x | Slow burn consuming budget | Ticket for prompt investigation |\n| 3 days | 1x | Gradual degradation | Weekly review item |\n\nThe combination of fast and slow windows catches both sudden failures and gradual degradation that might otherwise slip detection.
SLO-based alerting naturally calibrates sensitivity to business impact. During low-traffic periods, a higher error percentage is acceptable because fewer users are affected. During peak traffic, even small percentage increases matter more. The error budget approach handles this automatically.
Not all alerts are created equal. Understanding the different categories helps you design appropriate responses for each type.
What Deserves a Page?\n\nThe highest bar in alerting is the page—an alert that interrupts someone regardless of time of day. Pages should be reserved for situations where:\n\n1. Users are currently impacted or will be within minutes\n2. Human intervention is required to resolve the issue\n3. Delay will cause escalating harm (more users affected, data loss, revenue impact)\n4. The required action is known and can be performed by on-call\n\nIf any of these conditions isn't met, the alert shouldn't page.
| Scenario | Page? | Ticket? | Rationale |
|---|---|---|---|
| API returning 500s to 20% of requests | Yes | — | Active user impact, requires immediate investigation |
| Disk at 85% capacity, growing 2%/day | No | Yes | Not urgent, can be addressed in business hours |
| Single pod CrashLoopBackOff, others healthy | No | Yes | Redundancy handling, no user impact |
| SSL certificate expires in 3 days | No | Yes (High) | Predictable, addressable in work hours |
| Database primary failover occurred | Depends | Yes | Page if failover was unexpected; ticket if planned |
| Memory trending up, no OOM yet | No | Yes | Investigation needed, but not urgent |
| All replicas lost for critical service | Yes | — | Complete outage, immediate action required |
Resist the temptation to create page-level alerts 'just in case'. Every unnecessary page contributes to alert fatigue. If you're unsure whether something needs a page, start with a ticket. You can always escalate based on real-world experience.
Learning what not to alert on is as important as learning what to alert on. These anti-patterns are commonly found in organizations struggling with alert fatigue.
Schedule quarterly alert reviews. For each alert, ask: 'When did this last fire? What action was taken? Did the action match the expected runbook?' Alerts that haven't fired in 6 months may need higher thresholds. Alerts that fired without useful action should be deleted or redesigned.
With principles established, let's construct a practical framework for deciding what to alert on in your systems.
Step 1: Define Your SLOs\n\nStart with what matters to users and the business. Typical SLO categories include:\n\n- Availability: What percentage of requests should succeed?\n- Latency: What response time should users experience?\n- Throughput: What transaction volume must the system support?\n- Correctness: What data accuracy is required?\n\nStep 2: Instrument User-Facing Metrics\n\nEnsure you're measuring what users experience, not just what servers report. Key metrics:\n\n- Requests handled vs. requests failed (as users see it)\n- Time from request receipt to response delivery\n- End-to-end transaction success rates\n- User-reported errors and complaints\n\nStep 3: Calculate Error Budgets\n\nFor each SLO, determine the acceptable error budget and how it's consumed over time. This becomes your alerting baseline.\n\nStep 4: Define Alert Conditions Based on Burn Rate\n\nCreate alerts that trigger when error budget consumption threatens your SLO. Use multiple time windows for different severities.\n\nStep 5: Establish Clear Response Expectations\n\nFor each alert, document:\n- What does this alert mean?\n- What should the responder check first?\n- What are the likely causes?\n- What remediation steps are available?\n- When should the responder escalate?
1234567891011121314151617181920212223242526272829303132333435363738394041424344
# Example: E-commerce Checkout Service Alert Strategy slo: name: "checkout-availability" target: 99.9% # 43 minutes/month error budget window: 30d alerts: # Critical: Immediate page required - name: checkout_critical_burn description: "Checkout SLO burning at critical rate" condition: | (1 - checkout_success_rate[5m]) >= 0.01 # 10x burn rate severity: critical action: page runbook: "runbook.example.com/checkout-critical" # High: Page during business hours, ticket at night - name: checkout_high_burn description: "Checkout SLO burning at elevated rate" condition: | (1 - checkout_success_rate[1h]) >= 0.006 # 6x burn rate severity: high action: page_business_hours runbook: "runbook.example.com/checkout-elevated" # Medium: Create ticket for investigation - name: checkout_slow_burn description: "Checkout error budget slowly depleting" condition: | (1 - checkout_success_rate[6h]) >= 0.002 # 2x burn rate severity: medium action: ticket runbook: "runbook.example.com/checkout-investigation" # Proactive: Warning before problems manifest - name: checkout_dependency_unhealthy description: "Critical checkout dependency showing degradation" condition: | payment_gateway_health < 0.95 OR inventory_service_health < 0.95 severity: medium action: ticket runbook: "runbook.example.com/checkout-dependencies"We've covered the foundational principles for designing effective alerts. Let's consolidate the key insights:
What's Next:\n\nNow that we understand what to alert on, we need to understand how to determine when an alert should fire. The next page explores alert thresholds—the art and science of setting boundaries that catch real problems while avoiding false positives.
You now understand the philosophy and principles that guide effective alerting decisions. The core insight is simple but powerful: alerts exist to prompt human action when the system cannot heal itself. Everything else—symptoms vs. causes, SLO burn rates, urgency categories—flows from this fundamental truth.