Loading learning content...
Setting alert thresholds is a perpetual balancing act. Set them too sensitive, and you drown in false alarms—every minor fluctuation becomes an emergency. Set them too lenient, and you miss genuine incidents until they escalate into catastrophes.\n\nThis is the Goldilocks Problem of alerting: finding thresholds that are just right—sensitive enough to catch real problems but stable enough to avoid crying wolf. Unlike the fairy tale, however, there's no single correct answer. The 'just right' threshold depends on your system's behavior, your SLOs, your team's capacity, and the cost of missing an incident versus the cost of responding to a false alarm.
By the end of this page, you will understand the mathematical foundations of threshold selection, the trade-offs between different threshold types, how to use statistical methods to reduce false positives, and practical techniques for tuning thresholds based on real-world feedback.
Before diving into threshold selection, we need to understand the different types of thresholds available. Each type has distinct characteristics that make it more or less suitable for different scenarios.
| Type | Pros | Cons | Best For |
|---|---|---|---|
| Static | Simple, predictable, easy to explain | Ignores normal variation, may need frequent adjustment | Infrastructure limits (disk space, connection limits) |
| Dynamic | Adapts to patterns, reduces seasonal false positives | Complex, can mask slow degradation, training period needed | Traffic-dependent metrics (latency, throughput) |
| Percentage | Scales with load naturally | Sensitive to baseline calculation, noisy at low volumes | Performance metrics relative to established baselines |
| Rate-Based | Catches developing problems early | Can false-positive on normal fluctuations | Metrics where rate of change is more important than absolute value |
| Compound | Very low false positive rate | Can miss single-dimension problems | Critical alerts where false positives are very costly |
Static thresholds are the most common and easiest to implement. Despite their simplicity, they're often misused. Understanding when static thresholds work and when they fail is crucial.
When Static Thresholds Work Well:\n\n1. Hard Limits Exist — Disk at 95% capacity will cause problems regardless of traffic patterns. Memory exhaustion triggers OOM kills at specific thresholds. These physical boundaries are inherently static.\n\n2. SLO Contracts Are Absolute — If your SLA promises 99.9% availability, an error rate of 0.1% is significant regardless of context. Business commitments create static boundaries.\n\n3. Metrics Are Stable — Connection pool usage should remain relatively constant if your application is well-tuned. Static thresholds work when the metric itself doesn't vary dramatically.\n\n4. Simplicity Is Paramount — For teams new to alerting, starting with static thresholds provides transparency and learnability. You can evolve to dynamic approaches later.
Static thresholds notoriously fail for traffic-dependent metrics. If you set 'Alert if request volume < 1000/min' based on weekday traffic, it will fire every weekend. If you set it based on weekend traffic, you'll miss weekday outages. This is where dynamic thresholds become essential.
Setting Effective Static Thresholds:\n\nWhen static thresholds are appropriate, follow these guidelines:\n\nStart From Real Data\nPlot your metric over 2-4 weeks. Identify:\n- Normal operating range\n- Maximum typical values\n- Minimum typical values\n- Anomalous spikes that were or weren't problems\n\nApply the Margin Principle\nSet thresholds beyond normal variation but before failure:\n\n\n┌──────────────────────────────────────────────────────────────────────┐\n│ FAILURE ZONE (Too Late) │\n├──────────────────────────────────────────────────────────────────────┤\n│ ████████████████████ CRITICAL THRESHOLD ██████████████████████ │\n├──────────────────────────────────────────────────────────────────────┤\n│ WARNING ZONE (Just Right) │\n├──────────────────────────────────────────────────────────────────────┤\n│ ████████████████████ WARNING THRESHOLD ██████████████████████ │\n├──────────────────────────────────────────────────────────────────────┤\n│ NORMAL OPERATING RANGE │\n│ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ │\n│ Typical metric fluctuation │\n└──────────────────────────────────────────────────────────────────────┘\n\n\nThe gap between your maximum typical value and the threshold should be large enough to avoid false positives from normal spikes but small enough to catch real problems before they escalate.
1234567891011121314151617181920212223242526272829303132333435
# Good use cases for static thresholds # Infrastructure limits - these are genuinely staticalerts: - name: disk_space_warning condition: disk_used_percent > 80 severity: warning rationale: "80% gives ~2 days to respond at typical growth rate" - name: disk_space_critical condition: disk_used_percent > 95 severity: critical rationale: "95% leaves minimal buffer; some FS degrade at this level" - name: memory_pressure condition: memory_available_bytes < 500MB severity: critical rationale: "Below 500MB, OOM killer becomes likely" # Connection limits - tied to configuration - name: database_connections_exhausted condition: db_connection_pool_used > 90% severity: warning rationale: "Pool is 100 connections; 90% means new requests may wait" # Certificate expiration - truly static deadline - name: ssl_cert_expiring condition: ssl_cert_days_remaining < 14 severity: warning rationale: "14 days gives time for renewal even with delays" - name: ssl_cert_critical condition: ssl_cert_days_remaining < 3 severity: critical rationale: "3 days requires immediate action to prevent outage"For metrics that naturally fluctuate with time, load, or business cycles, dynamic thresholds provide a more intelligent approach. These thresholds adapt based on what's 'normal' for a given context.
The Statistical Foundation\n\nDynamic thresholds typically use statistical methods to define 'normal'. The most common approach involves standard deviations from the mean:\n\n- Mean (μ): The average value over a historical period\n- Standard Deviation (σ): The measure of how spread out values are\n- Z-Score: How many standard deviations a value is from the mean\n\n\nZ = (Current Value - μ) / σ\n\nInterpretation:\n- Z = 0: Value equals the mean\n- Z = 1: Value is 1 standard deviation above mean\n- Z = -2: Value is 2 standard deviations below mean\n\n\nThe Empirical Rule (68-95-99.7)\n\nFor normally distributed data:\n- 68% of values fall within ±1σ\n- 95% of values fall within ±2σ\n- 99.7% of values fall within ±3σ\n\nThis means alerting at ±3σ would, for truly normal data, produce false positives only 0.3% of the time.
Latency, error rates, and queue depths are rarely normally distributed. They tend to be right-skewed with heavy tails. Consider using median and percentiles instead of mean and standard deviation for robust anomaly detection.
Seasonality-Aware Thresholds\n\nMany systems exhibit predictable patterns:\n- Hourly: Morning ramp-up, lunch lull, afternoon peak\n- Daily: Weekday vs. weekend traffic\n- Weekly: Monday traffic spike after weekend quiet\n- Monthly/Yearly: Month-end processing, holiday seasons\n\nEffective dynamic thresholds compare current values against the historically typical value for this time period, not just an overall average.
123456789101112131415161718192021222324252627282930313233343536373839404142
# Simple approach: Compare against same hour last week# Alert if current value differs significantly from last week # Calculate baseline from same hour, 7 days ago- record: http_requests_baseline expr: http_requests_total offset 168h # 168 hours = 7 days # Alert if current deviates more than 50% from baseline# Requires at least 100 requests to avoid noise on low traffic- alert: TrafficAnomaly expr: | abs( (http_requests_total - http_requests_baseline) / http_requests_baseline ) > 0.5 AND http_requests_total > 100 for: 10m labels: severity: warning annotations: summary: "Traffic differs significantly from baseline" # More sophisticated: Use multiple week average- record: http_requests_avg_baseline expr: | ( http_requests_total offset 168h + http_requests_total offset 336h + http_requests_total offset 504h ) / 3 # Alert on deviation from multi-week baseline- alert: TrafficAnomalyRobust expr: | abs( (http_requests_total - http_requests_avg_baseline) / http_requests_avg_baseline ) > 0.4 AND http_requests_total > 100 for: 15m labels: severity: warningMachine Learning Approaches\n\nFor sophisticated environments, ML-based anomaly detection can identify patterns humans would miss:\n\n- Isolation Forests: Identify outliers by how easily they can be isolated\n- LSTM Networks: Learn temporal patterns and predict expected ranges\n- Prophet-style decomposition: Separate trend, seasonality, and residual\n\nThese approaches excel at catching unusual patterns but require careful tuning to avoid black-box alerting where no one understands why an alert fired.
One of the most powerful techniques for reducing false positives while maintaining sensitivity is multi-window alerting—using multiple time windows to confirm that an anomaly is real and persistent.
The Problem with Single Windows\n\nA single 5-minute window might show:\n- A brief spike that self-resolves\n- A calculation artifact from metric aggregation\n- A transient network hiccup affecting data collection\n- A genuine developing incident\n\nYou can't distinguish between these with one window. Multi-window alerting solves this.
| Short Window | Long Window | Alert If | Interpretation |
|---|---|---|---|
| 5 min high | 1 hour high | Both triggered | Sustained incident, page immediately |
| 5 min high | 1 hour normal | Short only | Possible spike, continue monitoring |
| 5 min normal | 1 hour high | Long only | Slow burn, create ticket |
| 5 min normal | 1 hour normal | Neither | System healthy |
Implementing Multi-Window Logic\n\nThe key insight is that different window sizes serve different purposes:\n\n- Short windows (1-5 min): Catch acute incidents quickly\n- Medium windows (30 min - 1 hour): Confirm persistence, filter transient spikes\n- Long windows (6 hours - 1 day): Detect slow burns and trending degradation\n\nThe Google SRE Approach\n\nGoogle's SRE team recommends a specific multi-window configuration for SLO-based alerting:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
# Google-style multi-window, multi-burn-rate alerting# Based on the SREbook recommendations # For a 99.9% SLO (30-day, 43 minute error budget): alerts: # Page immediately: Fast burn detected - name: slo_high_fast_burn # Both short AND long windows must exceed threshold condition: | # 2% budget burned in 1 hour = high burn rate # Check both 5-minute and 1-hour windows agree (1 - success_rate[5m]) >= 0.01 AND (1 - success_rate[1h]) >= 0.005 severity: critical action: page rationale: | 5m window catches acute issues quickly. 1h window confirms it's not a transient spike. Combined: confident this is a real, fast-burning incident. # Page during day, ticket at night: Medium burn - name: slo_medium_burn condition: | # 5% budget burned in 6 hours (1 - success_rate[30m]) >= 0.003 AND (1 - success_rate[6h]) >= 0.0015 severity: high action: page_business_hours rationale: | 30m window detects developing issues. 6h window confirms sustained degradation. Combined: needs attention but not immediately. # Ticket: Slow burn - name: slo_slow_burn condition: | # Burning faster than sustainable over days (1 - success_rate[6h]) >= 0.001 AND (1 - success_rate[3d]) >= 0.0004 severity: medium action: ticket rationale: | Multi-day windows catch gradual degradation that slips under short-term radar.The long window should be at least 12x the short window. This ratio ensures the long window isn't dominated by the same spike affecting the short window. A 5-minute spike may affect a 1-hour average, but it won't significantly move a 6-hour average.
For latency and other right-skewed metrics, percentile-based thresholds are essential. Average-based thresholds hide problems affecting a minority of users.
Why Averages Lie\n\nConsider this latency distribution:\n- 95% of requests: 50ms\n- 5% of requests: 2000ms\n- Average latency: 147.5ms\n\nThe average looks acceptable, but 5% of your users are experiencing 40x worse performance! These frustrated users generate support tickets, churn, and negative reviews.\n\nPercentile Coverage\n\nDifferent percentiles tell different stories:\n\n| Percentile | Interpretation | Typical Use |\n|------------|----------------|-------------|\n| p50 (median) | Half of requests are better than this | General performance indicator |\n| p90 | 90% of requests are better than this | Most users' experience |\n| p95 | 95% of requests are better than this | Catches edge cases affecting many |\n| p99 | 99% of requests are better than this | Worst-case for regular traffic |\n| p99.9 | 999 of 1000 requests are better | Long tail, often indicates systemic issues |
12345678910111213141516171819202122232425262728293031323334353637383940414243
# Alert on latency percentiles, not averages # p50 alert - if median is high, the whole service is slow- alert: LatencyHighP50 expr: histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m])) > 0.2 for: 5m labels: severity: critical annotations: summary: "Median latency exceeds 200ms" description: "Half of all requests taking >200ms indicates systemic issue" # p90 alert - catches degradation affecting many users- alert: LatencyHighP90 expr: histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[5m])) > 0.5 for: 5m labels: severity: warning annotations: summary: "P90 latency exceeds 500ms" description: "10% of users experiencing significant delays" # p99 alert - catches long tail problems- alert: LatencyHighP99 expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2.0 for: 10m labels: severity: warning annotations: summary: "P99 latency exceeds 2 seconds" description: "Long tail latency affecting 1% of requests" # Multi-percentile alert - degradation across the board- alert: LatencyDegradationWidespread expr: | histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[10m])) > 0.1 AND histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m])) > 0.3 AND histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[10m])) > 1.0 for: 5m labels: severity: critical annotations: summary: "Widespread latency degradation across all percentiles"Alert on p50/p90 for page-worthy incidents (widespread impact). Alert on p99/p99.9 for ticket-worthy investigation (tail latency issues). The higher the percentile, the more sensitive to outliers—which is useful for investigation but noisy for paging.
Threshold selection isn't a one-time decision—it's an ongoing process of refinement based on real-world feedback. The goal is to maximize the ratio of true positives to false positives while minimizing missed incidents.
The Tuning Cycle\n\n\n┌──────────────────┐\n│ Set Initial │\n│ Thresholds │◄─────────────────────────────────────────┐\n└────────┬─────────┘ │\n │ │\n ▼ │\n┌──────────────────┐ │\n│ Monitor Alert │ │\n│ Behavior │ │\n└────────┬─────────┘ │\n │ │\n ▼ │\n┌──────────────────┐ ┌──────────────────────────────────┐ │\n│ Classify Each │────▶│ True Positive: Correct! │ │\n│ Alert │ │ False Positive: Threshold too │ │\n└────────┬─────────┘ │ sensitive, raise it │ │\n │ │ False Negative: Threshold too │ │\n │ │ lenient, lower it │ │\n │ │ True Negative: Correct! │ │\n ▼ └──────────────────────────────────┘ │\n┌──────────────────┐ │\n│ Gather Feedback │ │\n│ from Responders │ │\n└────────┬─────────┘ │\n │ │\n ▼ │\n┌──────────────────┐ │\n│ Analyze Patterns │ │\n│ and Adjust │──────────────────────────────────────────┘\n└──────────────────┘\n
Beware of raising thresholds just because a condition persists. If your error rate has 'always' been 3%, you might raise the alert threshold to 5%. But that 3% may represent a real, fixable problem. Sometimes the right response is fixing the root cause rather than adjusting the alert.
Let's examine real-world threshold configurations for common alerting scenarios, with rationale for each choice.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
# =====================================================# AVAILABILITY ALERTS# ===================================================== availability_alerts: # HTTP success rate - name: http_success_rate_critical condition: success_rate[5m] < 0.99 for: 2m severity: critical threshold_rationale: | 99% success rate means 1% errors. For a 99.9% SLO, 1% sustained error rate consumes 10x the budget - clearly critical. - name: http_success_rate_warning condition: success_rate[5m] < 0.995 for: 5m severity: warning threshold_rationale: | 0.5% error rate is elevated but may recover. Longer 'for' duration filters transient spikes. # =====================================================# LATENCY ALERTS # ===================================================== latency_alerts: # API latency - SLO is p99 < 500ms - name: api_latency_p99_critical condition: latency_p99[5m] > 1.0 # 1 second for: 5m severity: critical threshold_rationale: | 2x the SLO target indicates severe degradation. Users are actively frustrated at this latency. - name: api_latency_p99_warning condition: latency_p99[5m] > 0.5 # 500ms for: 10m severity: warning threshold_rationale: | At SLO boundary; longer duration means we're sure it's not a transient spike. - name: api_latency_p50_critical condition: latency_p50[5m] > 0.5 # 500ms for: 2m severity: critical threshold_rationale: | If MEDIAN latency is at p99 target, the entire service is slow. Immediate action needed. # =====================================================# RESOURCE UTILIZATION ALERTS# ===================================================== resource_alerts: # CPU - multi-threshold approach - name: cpu_high_sustained condition: cpu_utilization[15m] > 0.85 for: 10m severity: warning threshold_rationale: | 85% for 10+ minutes indicates genuinely high load, not a processing spike. 25 minutes total before alert fires. - name: cpu_critical condition: cpu_utilization[5m] > 0.95 for: 5m severity: critical threshold_rationale: | 95% leaves no headroom. System likely degraded or about to become unresponsive. # Memory - OOM prevention - name: memory_pressure_warning condition: memory_available_bytes < 1GB for: 5m severity: warning threshold_rationale: | 1GB buffer is typically enough for most applications but getting close to danger. - name: memory_pressure_critical condition: memory_available_bytes < 256MB for: 1m severity: critical threshold_rationale: | OOM killer likely to trigger soon. Very short 'for' duration because impact is imminent. # Disk - growth-aware thresholds - name: disk_space_warning condition: disk_used_percent > 80 for: 30m severity: warning threshold_rationale: | 80% allows days of growth typically. 30m duration filters temporary spikes from large file processing. - name: disk_space_critical condition: disk_used_percent > 95 for: 5m severity: critical threshold_rationale: | 5% remaining. Some systems reserve 5% for root. Immediate action needed. # Smarter: Alert on growth rate, not just level - name: disk_growth_unsustainable condition: predict_linear(disk_used_bytes[1h], 24*3600) > disk_total_bytes for: 30m severity: warning threshold_rationale: | If current growth rate continues, disk fills in 24h. Creates ticket before emergency hits.Setting effective thresholds is both art and science. Let's consolidate the key principles:
What's Next:\n\nEven with perfect thresholds, alert volume can become unsustainable. The next page explores reducing alert fatigue—strategies for managing alert volume, correlating related alerts, and ensuring that when an alert fires, it receives the attention it deserves.
You now understand the different types of thresholds, when to apply each, and how to tune them for optimal signal-to-noise ratio. The key insight: thresholds are hypotheses about system behavior that must be continuously validated against reality.