System Design (HLD)Alerting Design

Alerting Design: Building Effective Alert Systems

LevelIntermediate

Duration90 mins

TopicAlerting Design

2 / 5

Alert Thresholds

The Goldilocks Problem

Setting alert thresholds is a perpetual balancing act. Set them too sensitive, and you drown in false alarms—every minor fluctuation becomes an emergency. Set them too lenient, and you miss genuine incidents until they escalate into catastrophes.\n\nThis is the Goldilocks Problem of alerting: finding thresholds that are just right—sensitive enough to catch real problems but stable enough to avoid crying wolf. Unlike the fairy tale, however, there's no single correct answer. The 'just right' threshold depends on your system's behavior, your SLOs, your team's capacity, and the cost of missing an incident versus the cost of responding to a false alarm.

What You Will Learn

By the end of this page, you will understand the mathematical foundations of threshold selection, the trade-offs between different threshold types, how to use statistical methods to reduce false positives, and practical techniques for tuning thresholds based on real-world feedback.

Understanding Threshold Types

Before diving into threshold selection, we need to understand the different types of thresholds available. Each type has distinct characteristics that make it more or less suitable for different scenarios.

Common Threshold Types

•Static Thresholds — Fixed values that don't change over time. Example: 'Alert if error rate > 5%'. Simple to implement and understand, but don't account for changing workload patterns.
•Dynamic/Adaptive Thresholds — Values that adjust based on historical patterns. Example: 'Alert if error rate exceeds baseline by 3 standard deviations'. More sophisticated but require historical data and can miss slow drifts.
•Percentage-Based Thresholds — Relative values based on a metric's normal range. Example: 'Alert if latency increases by 50% from baseline'. Good for metrics that scale with load.
•Rate-Based Thresholds — Focused on how fast a metric is changing. Example: 'Alert if error rate is increasing faster than 1% per minute'. Excellent for catching developing problems.
•Compound Thresholds — Combination of multiple conditions. Example: 'Alert if error rate > 2% AND request volume > 1000/min'. Reduces false positives by requiring multiple signals.

Threshold Type Comparison
Type	Pros	Cons	Best For
Static	Simple, predictable, easy to explain	Ignores normal variation, may need frequent adjustment	Infrastructure limits (disk space, connection limits)
Dynamic	Adapts to patterns, reduces seasonal false positives	Complex, can mask slow degradation, training period needed	Traffic-dependent metrics (latency, throughput)
Percentage	Scales with load naturally	Sensitive to baseline calculation, noisy at low volumes	Performance metrics relative to established baselines
Rate-Based	Catches developing problems early	Can false-positive on normal fluctuations	Metrics where rate of change is more important than absolute value
Compound	Very low false positive rate	Can miss single-dimension problems	Critical alerts where false positives are very costly

Static Thresholds: When and How

Static thresholds are the most common and easiest to implement. Despite their simplicity, they're often misused. Understanding when static thresholds work and when they fail is crucial.

When Static Thresholds Work Well:\n\n1. Hard Limits Exist — Disk at 95% capacity will cause problems regardless of traffic patterns. Memory exhaustion triggers OOM kills at specific thresholds. These physical boundaries are inherently static.\n\n2. SLO Contracts Are Absolute — If your SLA promises 99.9% availability, an error rate of 0.1% is significant regardless of context. Business commitments create static boundaries.\n\n3. Metrics Are Stable — Connection pool usage should remain relatively constant if your application is well-tuned. Static thresholds work when the metric itself doesn't vary dramatically.\n\n4. Simplicity Is Paramount — For teams new to alerting, starting with static thresholds provides transparency and learnability. You can evolve to dynamic approaches later.

The Weeknight vs. Weekend Problem

Static thresholds notoriously fail for traffic-dependent metrics. If you set 'Alert if request volume < 1000/min' based on weekday traffic, it will fire every weekend. If you set it based on weekend traffic, you'll miss weekday outages. This is where dynamic thresholds become essential.

Setting Effective Static Thresholds:\n\nWhen static thresholds are appropriate, follow these guidelines:\n\nStart From Real Data\nPlot your metric over 2-4 weeks. Identify:\n- Normal operating range\n- Maximum typical values\n- Minimum typical values\n- Anomalous spikes that were or weren't problems\n\nApply the Margin Principle\nSet thresholds beyond normal variation but before failure:\n\n\n┌──────────────────────────────────────────────────────────────────────┐\n│ FAILURE ZONE (Too Late) │\n├──────────────────────────────────────────────────────────────────────┤\n│ ████████████████████ CRITICAL THRESHOLD ██████████████████████ │\n├──────────────────────────────────────────────────────────────────────┤\n│ WARNING ZONE (Just Right) │\n├──────────────────────────────────────────────────────────────────────┤\n│ ████████████████████ WARNING THRESHOLD ██████████████████████ │\n├──────────────────────────────────────────────────────────────────────┤\n│ NORMAL OPERATING RANGE │\n│ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ │\n│ Typical metric fluctuation │\n└──────────────────────────────────────────────────────────────────────┘\n\n\nThe gap between your maximum typical value and the threshold should be large enough to avoid false positives from normal spikes but small enough to catch real problems before they escalate.

Static Threshold Examples
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Good use cases for static thresholds
 
# Infrastructure limits - these are genuinely static
alerts:
  - name: disk_space_warning
    condition: disk_used_percent > 80
    severity: warning
    rationale: "80% gives ~2 days to respond at typical growth rate"
    
  - name: disk_space_critical
    condition: disk_used_percent > 95
    severity: critical
    rationale: "95% leaves minimal buffer; some FS degrade at this level"
 
  - name: memory_pressure
    condition: memory_available_bytes < 500MB
    severity: critical
    rationale: "Below 500MB, OOM killer becomes likely"
 
# Connection limits - tied to configuration
  - name: database_connections_exhausted
    condition: db_connection_pool_used > 90%
    severity: warning
    rationale: "Pool is 100 connections; 90% means new requests may wait"
 
# Certificate expiration - truly static deadline
  - name: ssl_cert_expiring
    condition: ssl_cert_days_remaining < 14
    severity: warning
    rationale: "14 days gives time for renewal even with delays"
    
  - name: ssl_cert_critical
    condition: ssl_cert_days_remaining < 3
    severity: critical
    rationale: "3 days requires immediate action to prevent outage"

Dynamic Thresholds and Anomaly Detection

For metrics that naturally fluctuate with time, load, or business cycles, dynamic thresholds provide a more intelligent approach. These thresholds adapt based on what's 'normal' for a given context.

The Statistical Foundation\n\nDynamic thresholds typically use statistical methods to define 'normal'. The most common approach involves standard deviations from the mean:\n\n- Mean (μ): The average value over a historical period\n- Standard Deviation (σ): The measure of how spread out values are\n- Z-Score: How many standard deviations a value is from the mean\n\n\nZ = (Current Value - μ) / σ\n\nInterpretation:\n- Z = 0: Value equals the mean\n- Z = 1: Value is 1 standard deviation above mean\n- Z = -2: Value is 2 standard deviations below mean\n\n\nThe Empirical Rule (68-95-99.7)\n\nFor normally distributed data:\n- 68% of values fall within ±1σ\n- 95% of values fall within ±2σ\n- 99.7% of values fall within ±3σ\n\nThis means alerting at ±3σ would, for truly normal data, produce false positives only 0.3% of the time.

Real Metrics Aren't Normal

Latency, error rates, and queue depths are rarely normally distributed. They tend to be right-skewed with heavy tails. Consider using median and percentiles instead of mean and standard deviation for robust anomaly detection.

Seasonality-Aware Thresholds\n\nMany systems exhibit predictable patterns:\n- Hourly: Morning ramp-up, lunch lull, afternoon peak\n- Daily: Weekday vs. weekend traffic\n- Weekly: Monday traffic spike after weekend quiet\n- Monthly/Yearly: Month-end processing, holiday seasons\n\nEffective dynamic thresholds compare current values against the historically typical value for this time period, not just an overall average.

Seasonality-Aware Threshold Logic

Prometheus

# Simple approach: Compare against same hour last week
# Alert if current value differs significantly from last week
 
# Calculate baseline from same hour, 7 days ago
- record: http_requests_baseline
  expr: http_requests_total offset 168h  # 168 hours = 7 days
 
# Alert if current deviates more than 50% from baseline
# Requires at least 100 requests to avoid noise on low traffic
- alert: TrafficAnomaly
  expr: |
    abs(
      (http_requests_total - http_requests_baseline) 
      / http_requests_baseline
    ) > 0.5
    AND http_requests_total > 100
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Traffic differs significantly from baseline"
    
# More sophisticated: Use multiple week average
- record: http_requests_avg_baseline
  expr: |
    (
      http_requests_total offset 168h +
      http_requests_total offset 336h +
      http_requests_total offset 504h
    ) / 3
 
# Alert on deviation from multi-week baseline
- alert: TrafficAnomalyRobust
  expr: |
    abs(
      (http_requests_total - http_requests_avg_baseline) 
      / http_requests_avg_baseline
    ) > 0.4
    AND http_requests_total > 100
  for: 15m
  labels:
    severity: warning

Machine Learning Approaches\n\nFor sophisticated environments, ML-based anomaly detection can identify patterns humans would miss:\n\n- Isolation Forests: Identify outliers by how easily they can be isolated\n- LSTM Networks: Learn temporal patterns and predict expected ranges\n- Prophet-style decomposition: Separate trend, seasonality, and residual\n\nThese approaches excel at catching unusual patterns but require careful tuning to avoid black-box alerting where no one understands why an alert fired.

Multi-Window Alerting

One of the most powerful techniques for reducing false positives while maintaining sensitivity is multi-window alerting—using multiple time windows to confirm that an anomaly is real and persistent.

The Problem with Single Windows\n\nA single 5-minute window might show:\n- A brief spike that self-resolves\n- A calculation artifact from metric aggregation\n- A transient network hiccup affecting data collection\n- A genuine developing incident\n\nYou can't distinguish between these with one window. Multi-window alerting solves this.

Multi-Window Alert Configuration
Short Window	Long Window	Alert If	Interpretation
5 min high	1 hour high	Both triggered	Sustained incident, page immediately
5 min high	1 hour normal	Short only	Possible spike, continue monitoring
5 min normal	1 hour high	Long only	Slow burn, create ticket
5 min normal	1 hour normal	Neither	System healthy

Implementing Multi-Window Logic\n\nThe key insight is that different window sizes serve different purposes:\n\n- Short windows (1-5 min): Catch acute incidents quickly\n- Medium windows (30 min - 1 hour): Confirm persistence, filter transient spikes\n- Long windows (6 hours - 1 day): Detect slow burns and trending degradation\n\nThe Google SRE Approach\n\nGoogle's SRE team recommends a specific multi-window configuration for SLO-based alerting:

Multi-Window SLO Alert
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Google-style multi-window, multi-burn-rate alerting
# Based on the SREbook recommendations
 
# For a 99.9% SLO (30-day, 43 minute error budget):
 
alerts:
  # Page immediately: Fast burn detected
  - name: slo_high_fast_burn
    # Both short AND long windows must exceed threshold
    condition: |
      # 2% budget burned in 1 hour = high burn rate
      # Check both 5-minute and 1-hour windows agree
      (1 - success_rate[5m]) >= 0.01 AND 
      (1 - success_rate[1h]) >= 0.005
    severity: critical
    action: page
    rationale: |
      5m window catches acute issues quickly.
      1h window confirms it's not a transient spike.
      Combined: confident this is a real, fast-burning incident.
 
  # Page during day, ticket at night: Medium burn
  - name: slo_medium_burn
    condition: |
      # 5% budget burned in 6 hours
      (1 - success_rate[30m]) >= 0.003 AND
      (1 - success_rate[6h]) >= 0.0015
    severity: high
    action: page_business_hours
    rationale: |
      30m window detects developing issues.
      6h window confirms sustained degradation.
      Combined: needs attention but not immediately.
 
  # Ticket: Slow burn
  - name: slo_slow_burn  
    condition: |
      # Burning faster than sustainable over days
      (1 - success_rate[6h]) >= 0.001 AND
      (1 - success_rate[3d]) >= 0.0004
    severity: medium
    action: ticket
    rationale: |
      Multi-day windows catch gradual degradation
      that slips under short-term radar.

Window Size Heuristic

The long window should be at least 12x the short window. This ratio ensures the long window isn't dominated by the same spike affecting the short window. A 5-minute spike may affect a 1-hour average, but it won't significantly move a 6-hour average.

Percentile-Based Thresholds

For latency and other right-skewed metrics, percentile-based thresholds are essential. Average-based thresholds hide problems affecting a minority of users.

Why Averages Lie\n\nConsider this latency distribution:\n- 95% of requests: 50ms\n- 5% of requests: 2000ms\n- Average latency: 147.5ms\n\nThe average looks acceptable, but 5% of your users are experiencing 40x worse performance! These frustrated users generate support tickets, churn, and negative reviews.\n\nPercentile Coverage\n\nDifferent percentiles tell different stories:\n\n| Percentile | Interpretation | Typical Use |\n|------------|----------------|-------------|\n| p50 (median) | Half of requests are better than this | General performance indicator |\n| p90 | 90% of requests are better than this | Most users' experience |\n| p95 | 95% of requests are better than this | Catches edge cases affecting many |\n| p99 | 99% of requests are better than this | Worst-case for regular traffic |\n| p99.9 | 999 of 1000 requests are better | Long tail, often indicates systemic issues |

Percentile-Based Alert Examples

Prometheus

# Alert on latency percentiles, not averages
 
# p50 alert - if median is high, the whole service is slow
- alert: LatencyHighP50
  expr: histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m])) > 0.2
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Median latency exceeds 200ms"
    description: "Half of all requests taking >200ms indicates systemic issue"
 
# p90 alert - catches degradation affecting many users
- alert: LatencyHighP90
  expr: histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[5m])) > 0.5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "P90 latency exceeds 500ms"
    description: "10% of users experiencing significant delays"
 
# p99 alert - catches long tail problems
- alert: LatencyHighP99
  expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2.0
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "P99 latency exceeds 2 seconds"
    description: "Long tail latency affecting 1% of requests"
 
# Multi-percentile alert - degradation across the board
- alert: LatencyDegradationWidespread
  expr: |
    histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[10m])) > 0.1
    AND histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m])) > 0.3
    AND histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[10m])) > 1.0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Widespread latency degradation across all percentiles"

Choosing the Right Percentile

Alert on p50/p90 for page-worthy incidents (widespread impact). Alert on p99/p99.9 for ticket-worthy investigation (tail latency issues). The higher the percentile, the more sensitive to outliers—which is useful for investigation but noisy for paging.

The Threshold Tuning Process

Threshold selection isn't a one-time decision—it's an ongoing process of refinement based on real-world feedback. The goal is to maximize the ratio of true positives to false positives while minimizing missed incidents.

The Tuning Cycle\n\n\n┌──────────────────┐\n│ Set Initial │\n│ Thresholds │◄─────────────────────────────────────────┐\n└────────┬─────────┘ │\n │ │\n ▼ │\n┌──────────────────┐ │\n│ Monitor Alert │ │\n│ Behavior │ │\n└────────┬─────────┘ │\n │ │\n ▼ │\n┌──────────────────┐ ┌──────────────────────────────────┐ │\n│ Classify Each │────▶│ True Positive: Correct! │ │\n│ Alert │ │ False Positive: Threshold too │ │\n└────────┬─────────┘ │ sensitive, raise it │ │\n │ │ False Negative: Threshold too │ │\n │ │ lenient, lower it │ │\n │ │ True Negative: Correct! │ │\n ▼ └──────────────────────────────────┘ │\n┌──────────────────┐ │\n│ Gather Feedback │ │\n│ from Responders │ │\n└────────┬─────────┘ │\n │ │\n ▼ │\n┌──────────────────┐ │\n│ Analyze Patterns │ │\n│ and Adjust │──────────────────────────────────────────┘\n└──────────────────┘\n

Threshold Tuning Best Practices

•Start Conservative — Begin with thresholds that are too sensitive. False positives are irritating but teach you about system behavior. False negatives (missed incidents) damage trust and systems.
•Track Alert Utility — For every alert, record whether it was actionable. Calculate your precision: TruePositives / (TruePositives + FalsePositives). Target >80% precision.
•Use Warning-then-Critical Pattern — Set a warning threshold below your critical threshold. Warnings can be more sensitive; if too noisy, raise them without affecting critical alerts.
•Consider Time-of-Day — Different thresholds may be appropriate for peak vs. off-peak hours. A 5% error rate during low traffic might indicate a problem; during traffic spikes, it might be expected.
•Document Every Change — Record why thresholds were adjusted, what evidence led to the change, and what the expected outcome is. This creates institutional knowledge.
•Automate Threshold Suggestions — Use historical alert data to suggest threshold adjustments. If an alert fires 50 times but action was taken only twice, the threshold is likely too sensitive.

The Normalization Trap

Beware of raising thresholds just because a condition persists. If your error rate has 'always' been 3%, you might raise the alert threshold to 5%. But that 3% may represent a real, fixable problem. Sometimes the right response is fixing the root cause rather than adjusting the alert.

Practical Threshold Examples

Let's examine real-world threshold configurations for common alerting scenarios, with rationale for each choice.

Comprehensive Alert Threshold Examples
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# =====================================================
# AVAILABILITY ALERTS
# =====================================================
 
availability_alerts:
  # HTTP success rate
  - name: http_success_rate_critical
    condition: success_rate[5m] < 0.99
    for: 2m
    severity: critical
    threshold_rationale: |
      99% success rate means 1% errors.
      For a 99.9% SLO, 1% sustained error rate
      consumes 10x the budget - clearly critical.
    
  - name: http_success_rate_warning
    condition: success_rate[5m] < 0.995
    for: 5m  
    severity: warning
    threshold_rationale: |
      0.5% error rate is elevated but may recover.
      Longer 'for' duration filters transient spikes.
 
# =====================================================
# LATENCY ALERTS  
# =====================================================
 
latency_alerts:
  # API latency - SLO is p99 < 500ms
  - name: api_latency_p99_critical
    condition: latency_p99[5m] > 1.0  # 1 second
    for: 5m
    severity: critical
    threshold_rationale: |
      2x the SLO target indicates severe degradation.
      Users are actively frustrated at this latency.
      
  - name: api_latency_p99_warning
    condition: latency_p99[5m] > 0.5  # 500ms
    for: 10m
    severity: warning
    threshold_rationale: |
      At SLO boundary; longer duration means
      we're sure it's not a transient spike.
 
  - name: api_latency_p50_critical
    condition: latency_p50[5m] > 0.5  # 500ms
    for: 2m
    severity: critical
    threshold_rationale: |
      If MEDIAN latency is at p99 target, 
      the entire service is slow. Immediate action needed.
 
# =====================================================
# RESOURCE UTILIZATION ALERTS
# =====================================================
 
resource_alerts:
  # CPU - multi-threshold approach
  - name: cpu_high_sustained
    condition: cpu_utilization[15m] > 0.85
    for: 10m
    severity: warning
    threshold_rationale: |
      85% for 10+ minutes indicates genuinely
      high load, not a processing spike.
      25 minutes total before alert fires.
      
  - name: cpu_critical
    condition: cpu_utilization[5m] > 0.95
    for: 5m
    severity: critical
    threshold_rationale: |
      95% leaves no headroom. System likely 
      degraded or about to become unresponsive.
 
  # Memory - OOM prevention
  - name: memory_pressure_warning
    condition: memory_available_bytes < 1GB
    for: 5m
    severity: warning
    threshold_rationale: |
      1GB buffer is typically enough for most
      applications but getting close to danger.
      
  - name: memory_pressure_critical
    condition: memory_available_bytes < 256MB
    for: 1m
    severity: critical
    threshold_rationale: |
      OOM killer likely to trigger soon.
      Very short 'for' duration because
      impact is imminent.
 
  # Disk - growth-aware thresholds
  - name: disk_space_warning
    condition: disk_used_percent > 80
    for: 30m
    severity: warning
    threshold_rationale: |
      80% allows days of growth typically.
      30m duration filters temporary spikes
      from large file processing.
      
  - name: disk_space_critical  
    condition: disk_used_percent > 95
    for: 5m
    severity: critical
    threshold_rationale: |
      5% remaining. Some systems reserve 5%
      for root. Immediate action needed.
 
  # Smarter: Alert on growth rate, not just level
  - name: disk_growth_unsustainable
    condition: predict_linear(disk_used_bytes[1h], 24*3600) > disk_total_bytes
    for: 30m
    severity: warning
    threshold_rationale: |
      If current growth rate continues,
      disk fills in 24h. Creates ticket
      before emergency hits.

Summary: Mastering Alert Thresholds

Setting effective thresholds is both art and science. Let's consolidate the key principles:

Key Takeaways

•Match threshold type to metric behavior — Static for limits, dynamic for variable loads, percentile for latency.
•Use multi-window alerting — Short windows for detection, long windows for confirmation. Require both for high-severity alerts.
•Alert on percentiles, not averages — Averages hide problems affecting minorities of users. P90/P99 reveal tail latency.
•Set thresholds from data, not intuition — Analyze historical metric distributions before choosing numbers.
•Implement warning-then-critical patterns — Warnings provide early signal; critical reserves for urgent action.
•Tune continuously based on feedback — Track precision, gather responder feedback, adjust thresholds in response.
•Document threshold rationale — Future you (or your successors) will need to understand why thresholds were set.

What's Next:\n\nEven with perfect thresholds, alert volume can become unsustainable. The next page explores reducing alert fatigue—strategies for managing alert volume, correlating related alerts, and ensuring that when an alert fires, it receives the attention it deserves.

Page Complete

You now understand the different types of thresholds, when to apply each, and how to tune them for optimal signal-to-noise ratio. The key insight: thresholds are hypotheses about system behavior that must be continuously validated against reality.

2 / 5

Loading learning content...

System Design (HLD)Alerting Design

Alerting Design: Building Effective Alert Systems

LevelIntermediate

Duration90 mins

TopicAlerting Design

2 / 5

Alert Thresholds

The Goldilocks Problem

What You Will Learn

Understanding Threshold Types

Common Threshold Types

•Static Thresholds — Fixed values that don't change over time. Example: 'Alert if error rate > 5%'. Simple to implement and understand, but don't account for changing workload patterns.
•Dynamic/Adaptive Thresholds — Values that adjust based on historical patterns. Example: 'Alert if error rate exceeds baseline by 3 standard deviations'. More sophisticated but require historical data and can miss slow drifts.
•Percentage-Based Thresholds — Relative values based on a metric's normal range. Example: 'Alert if latency increases by 50% from baseline'. Good for metrics that scale with load.
•Rate-Based Thresholds — Focused on how fast a metric is changing. Example: 'Alert if error rate is increasing faster than 1% per minute'. Excellent for catching developing problems.
•Compound Thresholds — Combination of multiple conditions. Example: 'Alert if error rate > 2% AND request volume > 1000/min'. Reduces false positives by requiring multiple signals.

Threshold Type Comparison
Type	Pros	Cons	Best For
Static	Simple, predictable, easy to explain	Ignores normal variation, may need frequent adjustment	Infrastructure limits (disk space, connection limits)
Dynamic	Adapts to patterns, reduces seasonal false positives	Complex, can mask slow degradation, training period needed	Traffic-dependent metrics (latency, throughput)
Percentage	Scales with load naturally	Sensitive to baseline calculation, noisy at low volumes	Performance metrics relative to established baselines
Rate-Based	Catches developing problems early	Can false-positive on normal fluctuations	Metrics where rate of change is more important than absolute value
Compound	Very low false positive rate	Can miss single-dimension problems	Critical alerts where false positives are very costly

Static Thresholds: When and How

Static thresholds are the most common and easiest to implement. Despite their simplicity, they're often misused. Understanding when static thresholds work and when they fail is crucial.

The Weeknight vs. Weekend Problem

Static Threshold Examples
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Good use cases for static thresholds
 
# Infrastructure limits - these are genuinely static
alerts:
  - name: disk_space_warning
    condition: disk_used_percent > 80
    severity: warning
    rationale: "80% gives ~2 days to respond at typical growth rate"
    
  - name: disk_space_critical
    condition: disk_used_percent > 95
    severity: critical
    rationale: "95% leaves minimal buffer; some FS degrade at this level"
 
  - name: memory_pressure
    condition: memory_available_bytes < 500MB
    severity: critical
    rationale: "Below 500MB, OOM killer becomes likely"
 
# Connection limits - tied to configuration
  - name: database_connections_exhausted
    condition: db_connection_pool_used > 90%
    severity: warning
    rationale: "Pool is 100 connections; 90% means new requests may wait"
 
# Certificate expiration - truly static deadline
  - name: ssl_cert_expiring
    condition: ssl_cert_days_remaining < 14
    severity: warning
    rationale: "14 days gives time for renewal even with delays"
    
  - name: ssl_cert_critical
    condition: ssl_cert_days_remaining < 3
    severity: critical
    rationale: "3 days requires immediate action to prevent outage"

Dynamic Thresholds and Anomaly Detection

For metrics that naturally fluctuate with time, load, or business cycles, dynamic thresholds provide a more intelligent approach. These thresholds adapt based on what's 'normal' for a given context.

Real Metrics Aren't Normal

Seasonality-Aware Threshold Logic

Prometheus

# Simple approach: Compare against same hour last week
# Alert if current value differs significantly from last week
 
# Calculate baseline from same hour, 7 days ago
- record: http_requests_baseline
  expr: http_requests_total offset 168h  # 168 hours = 7 days
 
# Alert if current deviates more than 50% from baseline
# Requires at least 100 requests to avoid noise on low traffic
- alert: TrafficAnomaly
  expr: |
    abs(
      (http_requests_total - http_requests_baseline) 
      / http_requests_baseline
    ) > 0.5
    AND http_requests_total > 100
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Traffic differs significantly from baseline"
    
# More sophisticated: Use multiple week average
- record: http_requests_avg_baseline
  expr: |
    (
      http_requests_total offset 168h +
      http_requests_total offset 336h +
      http_requests_total offset 504h
    ) / 3
 
# Alert on deviation from multi-week baseline
- alert: TrafficAnomalyRobust
  expr: |
    abs(
      (http_requests_total - http_requests_avg_baseline) 
      / http_requests_avg_baseline
    ) > 0.4
    AND http_requests_total > 100
  for: 15m
  labels:
    severity: warning

Multi-Window Alerting

Multi-Window Alert Configuration
Short Window	Long Window	Alert If	Interpretation
5 min high	1 hour high	Both triggered	Sustained incident, page immediately
5 min high	1 hour normal	Short only	Possible spike, continue monitoring
5 min normal	1 hour high	Long only	Slow burn, create ticket
5 min normal	1 hour normal	Neither	System healthy

Multi-Window SLO Alert
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Google-style multi-window, multi-burn-rate alerting
# Based on the SREbook recommendations
 
# For a 99.9% SLO (30-day, 43 minute error budget):
 
alerts:
  # Page immediately: Fast burn detected
  - name: slo_high_fast_burn
    # Both short AND long windows must exceed threshold
    condition: |
      # 2% budget burned in 1 hour = high burn rate
      # Check both 5-minute and 1-hour windows agree
      (1 - success_rate[5m]) >= 0.01 AND 
      (1 - success_rate[1h]) >= 0.005
    severity: critical
    action: page
    rationale: |
      5m window catches acute issues quickly.
      1h window confirms it's not a transient spike.
      Combined: confident this is a real, fast-burning incident.
 
  # Page during day, ticket at night: Medium burn
  - name: slo_medium_burn
    condition: |
      # 5% budget burned in 6 hours
      (1 - success_rate[30m]) >= 0.003 AND
      (1 - success_rate[6h]) >= 0.0015
    severity: high
    action: page_business_hours
    rationale: |
      30m window detects developing issues.
      6h window confirms sustained degradation.
      Combined: needs attention but not immediately.
 
  # Ticket: Slow burn
  - name: slo_slow_burn  
    condition: |
      # Burning faster than sustainable over days
      (1 - success_rate[6h]) >= 0.001 AND
      (1 - success_rate[3d]) >= 0.0004
    severity: medium
    action: ticket
    rationale: |
      Multi-day windows catch gradual degradation
      that slips under short-term radar.

Window Size Heuristic

Percentile-Based Thresholds

For latency and other right-skewed metrics, percentile-based thresholds are essential. Average-based thresholds hide problems affecting a minority of users.

Percentile-Based Alert Examples

Prometheus

# Alert on latency percentiles, not averages
 
# p50 alert - if median is high, the whole service is slow
- alert: LatencyHighP50
  expr: histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m])) > 0.2
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Median latency exceeds 200ms"
    description: "Half of all requests taking >200ms indicates systemic issue"
 
# p90 alert - catches degradation affecting many users
- alert: LatencyHighP90
  expr: histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[5m])) > 0.5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "P90 latency exceeds 500ms"
    description: "10% of users experiencing significant delays"
 
# p99 alert - catches long tail problems
- alert: LatencyHighP99
  expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2.0
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "P99 latency exceeds 2 seconds"
    description: "Long tail latency affecting 1% of requests"
 
# Multi-percentile alert - degradation across the board
- alert: LatencyDegradationWidespread
  expr: |
    histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[10m])) > 0.1
    AND histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[10m])) > 0.3
    AND histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[10m])) > 1.0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Widespread latency degradation across all percentiles"

Choosing the Right Percentile

The Threshold Tuning Process

Threshold Tuning Best Practices

•Start Conservative — Begin with thresholds that are too sensitive. False positives are irritating but teach you about system behavior. False negatives (missed incidents) damage trust and systems.
•Track Alert Utility — For every alert, record whether it was actionable. Calculate your precision: TruePositives / (TruePositives + FalsePositives). Target >80% precision.
•Use Warning-then-Critical Pattern — Set a warning threshold below your critical threshold. Warnings can be more sensitive; if too noisy, raise them without affecting critical alerts.
•Consider Time-of-Day — Different thresholds may be appropriate for peak vs. off-peak hours. A 5% error rate during low traffic might indicate a problem; during traffic spikes, it might be expected.
•Document Every Change — Record why thresholds were adjusted, what evidence led to the change, and what the expected outcome is. This creates institutional knowledge.
•Automate Threshold Suggestions — Use historical alert data to suggest threshold adjustments. If an alert fires 50 times but action was taken only twice, the threshold is likely too sensitive.

The Normalization Trap

Practical Threshold Examples

Let's examine real-world threshold configurations for common alerting scenarios, with rationale for each choice.

Comprehensive Alert Threshold Examples
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
# =====================================================
# AVAILABILITY ALERTS
# =====================================================
 
availability_alerts:
  # HTTP success rate
  - name: http_success_rate_critical
    condition: success_rate[5m] < 0.99
    for: 2m
    severity: critical
    threshold_rationale: |
      99% success rate means 1% errors.
      For a 99.9% SLO, 1% sustained error rate
      consumes 10x the budget - clearly critical.
    
  - name: http_success_rate_warning
    condition: success_rate[5m] < 0.995
    for: 5m  
    severity: warning
    threshold_rationale: |
      0.5% error rate is elevated but may recover.
      Longer 'for' duration filters transient spikes.
 
# =====================================================
# LATENCY ALERTS  
# =====================================================
 
latency_alerts:
  # API latency - SLO is p99 < 500ms
  - name: api_latency_p99_critical
    condition: latency_p99[5m] > 1.0  # 1 second
    for: 5m
    severity: critical
    threshold_rationale: |
      2x the SLO target indicates severe degradation.
      Users are actively frustrated at this latency.
      
  - name: api_latency_p99_warning
    condition: latency_p99[5m] > 0.5  # 500ms
    for: 10m
    severity: warning
    threshold_rationale: |
      At SLO boundary; longer duration means
      we're sure it's not a transient spike.
 
  - name: api_latency_p50_critical
    condition: latency_p50[5m] > 0.5  # 500ms
    for: 2m
    severity: critical
    threshold_rationale: |
      If MEDIAN latency is at p99 target, 
      the entire service is slow. Immediate action needed.
 
# =====================================================
# RESOURCE UTILIZATION ALERTS
# =====================================================
 
resource_alerts:
  # CPU - multi-threshold approach
  - name: cpu_high_sustained
    condition: cpu_utilization[15m] > 0.85
    for: 10m
    severity: warning
    threshold_rationale: |
      85% for 10+ minutes indicates genuinely
      high load, not a processing spike.
      25 minutes total before alert fires.
      
  - name: cpu_critical
    condition: cpu_utilization[5m] > 0.95
    for: 5m
    severity: critical
    threshold_rationale: |
      95% leaves no headroom. System likely 
      degraded or about to become unresponsive.
 
  # Memory - OOM prevention
  - name: memory_pressure_warning
    condition: memory_available_bytes < 1GB
    for: 5m
    severity: warning
    threshold_rationale: |
      1GB buffer is typically enough for most
      applications but getting close to danger.
      
  - name: memory_pressure_critical
    condition: memory_available_bytes < 256MB
    for: 1m
    severity: critical
    threshold_rationale: |
      OOM killer likely to trigger soon.
      Very short 'for' duration because
      impact is imminent.
 
  # Disk - growth-aware thresholds
  - name: disk_space_warning
    condition: disk_used_percent > 80
    for: 30m
    severity: warning
    threshold_rationale: |
      80% allows days of growth typically.
      30m duration filters temporary spikes
      from large file processing.
      
  - name: disk_space_critical  
    condition: disk_used_percent > 95
    for: 5m
    severity: critical
    threshold_rationale: |
      5% remaining. Some systems reserve 5%
      for root. Immediate action needed.
 
  # Smarter: Alert on growth rate, not just level
  - name: disk_growth_unsustainable
    condition: predict_linear(disk_used_bytes[1h], 24*3600) > disk_total_bytes
    for: 30m
    severity: warning
    threshold_rationale: |
      If current growth rate continues,
      disk fills in 24h. Creates ticket
      before emergency hits.

Summary: Mastering Alert Thresholds

Setting effective thresholds is both art and science. Let's consolidate the key principles:

Key Takeaways

•Match threshold type to metric behavior — Static for limits, dynamic for variable loads, percentile for latency.
•Use multi-window alerting — Short windows for detection, long windows for confirmation. Require both for high-severity alerts.
•Alert on percentiles, not averages — Averages hide problems affecting minorities of users. P90/P99 reveal tail latency.
•Set thresholds from data, not intuition — Analyze historical metric distributions before choosing numbers.
•Implement warning-then-critical patterns — Warnings provide early signal; critical reserves for urgent action.
•Tune continuously based on feedback — Track precision, gather responder feedback, adjust thresholds in response.
•Document threshold rationale — Future you (or your successors) will need to understand why thresholds were set.

Page Complete

2 / 5