System Design (HLD)SLOs, SLIs & Incident Management

Incident Management

LevelAdvanced

Duration90 mins

TopicSLOs, SLIs & Incident Management

1 / 5

Incident Detection

The First Minutes Matter Most

On February 28, 2017, Amazon S3 suffered one of the most impactful outages in cloud computing history. A single typo in a command removed more servers than intended, triggering a cascade that took down a significant portion of the internet for nearly four hours. The impact was staggering: major websites went dark, IoT devices stopped working, and businesses lost millions in revenue.

But here's what's often overlooked in the post-mortem discussions: the time between when the problem started and when Amazon knew they had a problem. In incident management, this gap—called the Time to Detection (TTD)—is one of the most critical metrics an organization can optimize. Every minute of delayed detection translates directly into extended customer impact, revenue loss, and reputational damage.

Incident detection is the foundation of incident management. You cannot respond to what you don't know exists. Yet many organizations invest heavily in response processes while neglecting the detection mechanisms that trigger those processes. This is akin to training firefighters without installing smoke detectors.

What You Will Learn

By the end of this page, you will understand the complete landscape of incident detection: from proactive monitoring and alerting strategies to user-reported issues, from anomaly detection algorithms to synthetic monitoring. You'll learn how to design detection systems that minimize both Time to Detection and false positives, enabling your organization to identify and begin resolving incidents within minutes, not hours.

The Detection Imperative

To understand why incident detection deserves significant engineering investment, we need to examine the anatomy of an incident from a timeline perspective. Every incident follows a lifecycle:

Incident Onset: The moment when the system begins exhibiting anomalous behavior
Detection: The moment when someone or something recognizes the anomaly
Response Initiation: When a responder begins actively investigating
Mitigation: When immediate actions stop the bleeding
Resolution: When the system returns to healthy operation

The interval between onset and detection—the Time to Detection (TTD)—is pure waste. During this window, customers experience degradation while the organization remains oblivious. Unlike response and mitigation time, which can be reduced through training and automation, detection latency often goes unmeasured and unoptimized.

Impact of Detection Delay
Detection Delay	Potential Impact	Customer Experience	Business Cost
< 1 minute	Minimal - caught before widespread impact	Momentary blip, if noticed	Negligible
1-5 minutes	Moderate - some users affected	Noticeable errors or latency	$10K-$100K for large services
5-30 minutes	Significant - broad user impact	Sustained degradation, complaints begin	$100K-$1M, customer churn begins
30-60 minutes	Severe - major outage	Service unusable, social media mentions	$1M-$10M, reputational damage
60 minutes	Critical - catastrophic failure	Complete service unavailability	$10M+, potential regulatory scrutiny

The Detection Investment Case

Consider a service processing $1 million in transactions per hour. A one-minute reduction in detection latency—achieved through better monitoring—saves approximately $16,667 per incident. For a service experiencing one significant incident monthly, that's $200,000 annually. The monitoring infrastructure enabling that improvement costs a fraction of this amount.

But the financial case understates the true value. Detection latency compounds: a problem detected early is often simpler to diagnose and faster to mitigate. A memory leak caught at 70% utilization is trivially resolved with a pod restart; caught at 99% utilization during a traffic spike, it may require complex emergency procedures while the service is actively failing.

Detection effectiveness becomes organizational capability. Organizations that detect quickly develop confidence to deploy more frequently, experiment more boldly, and operate more leanly—because they know they'll catch problems early.

The Iceberg Principle of Incidents

For every major incident an organization experiences, there are typically 10 moderate incidents and 100 minor incidents that were detected and resolved before escalating. Excellent detection doesn't just reduce MTTR—it prevents incidents from ever becoming visible to customers or executives.

Detection Taxonomy: Proactive vs Reactive

Incident detection mechanisms fall into two fundamental categories, each with distinct characteristics, strengths, and applications:

Proactive Detection: The system or its observers identify the problem before users notice or report it. Proactive detection is the gold standard—it means your observability infrastructure is working.

Reactive Detection: Users, customers, or external observers report the problem to you. Reactive detection means your proactive mechanisms failed, but it's still valuable as a backstop and often catches issues that automated systems miss.

World-class incident detection employs both approaches in depth, creating multiple layers of detection that complement each other.

Proactive Detection Methods

•Threshold Alerting — Alerts triggered when metrics cross predefined boundaries (e.g., error rate > 1%)
•Anomaly Detection — ML-driven identification of abnormal patterns without explicit thresholds
•Synthetic Monitoring — Artificial transactions that test system behavior continuously
•Canary Deployments — Gradual rollouts that detect issues before full deployment
•Health Checks — Active probing of service endpoints for availability
•Log Pattern Detection — Automated scanning for error patterns in log streams
•Distributed Tracing — Detection of latency anomalies in request flows
•Chaos Engineering — Controlled failure injection that reveals weaknesses before incidents

Reactive Detection Methods

•Customer Support Tickets — Users report issues through support channels
•Social Media Monitoring — Detection of public complaints on Twitter/X, Reddit, etc.
•Partner/API Consumer Reports — B2B customers noticing integration failures
•Internal User Reports — Employees noticing issues in internal tools
•Status Page Comments — Users checking your status and reporting issues
•Third-Party Monitoring Services — External services like Downdetector aggregating user reports
•Press/Media Inquiries — Journalists asking about reported outages (worst case)
•Executive Escalation — VIP customers contacting leadership directly

The Proactive-Reactive Ratio

A key metric for detection maturity is the ratio of proactively-detected incidents to reactively-detected incidents. World-class organizations aim for 90%+ proactive detection—meaning that 9 out of 10 incidents are caught by internal monitoring before any customer needs to report them.

This ratio reveals the effectiveness of your observability investment. A low proactive ratio indicates gaps in monitoring coverage, alert configuration, or detection infrastructure. It signals that you're relying on customers as unpaid quality assurance—a pattern that erodes trust and competitiveness.

The Customer Detection Anti-Pattern

When customers report issues faster than your monitoring detects them, you have a serious observability gap. This often happens because monitoring focuses on infrastructure metrics (CPU, memory) rather than user-facing outcomes (checkout success rate, page load time). Always monitor what matters to users, not just what's easy to measure.

Threshold-Based Alerting: The Foundation

Threshold-based alerting is the most common detection mechanism. It operates on a simple principle: when a metric crosses a predefined boundary, trigger an alert. Despite—or perhaps because of—its simplicity, threshold alerting requires careful design to be effective.

Anatomy of a Threshold Alert

A well-constructed threshold alert has several components:

Metric Selection: What are we measuring? (error_rate, latency_p99, queue_depth)
Threshold Value: What value indicates a problem? (>1%, >500ms, >10000)
Duration/Frequency: How long must the condition persist? (for 5 minutes, 3 times in 10 minutes)
Evaluation Window: Over what time range is the metric calculated? (1-minute average, 5-minute maximum)
Alert Routing: Who or what receives the alert? (on-call engineer, Slack channel, PagerDuty)
Severity Classification: How urgent is this alert? (critical, warning, info)

alert-rule-example.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# Prometheus Alerting Rule Example
# Demonstrating best practices in threshold-based alerting
 
groups:
  - name: payment-service-alerts
    rules:
      # Critical: Payment processing failure rate too high
      # This directly impacts revenue - immediate response required
      - alert: PaymentFailureRateCritical
        expr: |
          (
            sum(rate(payment_transactions_total{status="failed"}[5m]))
            /
            sum(rate(payment_transactions_total[5m]))
          ) > 0.05
        for: 2m
        labels:
          severity: critical
          service: payment
          team: payments
        annotations:
          summary: "Payment failure rate exceeds 5%"
          description: |
            Payment failure rate is {{ $value | humanizePercentage }} over the last 5 minutes.
            This is above the critical threshold of 5%.
            Immediate investigation required - revenue impact in progress.
          runbook_url: https://runbooks.example.com/payments/high-failure-rate
          dashboard: https://grafana.example.com/d/payments-overview
 
      # Warning: Payment latency degradation
      # Users experiencing slow checkouts - investigate soon
      - alert: PaymentLatencyHigh
        expr: |
          histogram_quantile(0.99, 
            rate(payment_processing_duration_seconds_bucket[5m])
          ) > 2.0
        for: 5m
        labels:
          severity: warning
          service: payment
          team: payments
        annotations:
          summary: "P99 payment latency exceeds 2 seconds"
          description: |
            P99 payment processing latency is {{ $value | humanizeDuration }}.
            Users are experiencing slow checkout. 
            SLO target is 1.5 seconds at P99.
          runbook_url: https://runbooks.example.com/payments/high-latency
          
      # Predictive: Payment service approaching capacity
      # Warning before we hit actual problems
      - alert: PaymentServiceNearCapacity
        expr: |
          (
            rate(payment_transactions_total[5m]) 
            / 
            payment_max_throughput_per_second
          ) > 0.8
        for: 10m
        labels:
          severity: warning
          service: payment
          team: payments
        annotations:
          summary: "Payment service at 80%+ capacity"
          description: |
            Payment throughput is at {{ $value | humanizePercentage }} of max capacity.
            Consider scaling up before traffic increases further.

Threshold Selection: The Art and Science

Choosing the right threshold is where detection design becomes nuanced. Set thresholds too tight, and you'll drown in false positives (alert fatigue). Set them too loose, and you'll miss real incidents. Several strategies inform threshold selection:

1. SLO-Based Thresholds: Derive thresholds from your Service Level Objectives. If your SLO requires 99.9% availability, alert when you're burning through your error budget faster than sustainable—typically when error rate exceeds 10× the budgeted rate.

2. Historical Baseline: Analyze historical data to understand normal behavior. Set thresholds at statistical boundaries—for example, 3 standard deviations above the mean, or at the 99th percentile of historical values.

3. Capacity-Based Thresholds: For resource metrics, alert at percentages of capacity (e.g., 80% CPU, 90% disk) that provide enough room for remediation before exhaustion.

4. Business-Derived Thresholds: Some thresholds come directly from business requirements. If the product team says checkout must complete in under 3 seconds, that's your latency threshold—regardless of historical behavior.

The Multi-Window, Multi-Threshold Strategy

Sophisticated alerting uses multiple thresholds and time windows for the same metric. For example: Warning at 1% error rate (5-minute average), Critical at 5% error rate (1-minute average). This provides early warning for developing problems while reserving immediate escalation for acute failures.

Anomaly Detection: Beyond Fixed Thresholds

Fixed thresholds work well for stable metrics but struggle with systems that exhibit complex, time-varying behavior. Consider an e-commerce platform where normal traffic varies by 10× between 3 AM and 3 PM, with additional variation by day of week and season. A fixed threshold that catches problems during peak traffic would trigger constantly during off-peak hours, while one tuned for off-peak would miss problems during peaks.

Anomaly detection addresses this by learning normal patterns and alerting on deviations from those patterns rather than fixed values. This requires more sophisticated algorithms but provides superior detection for complex systems.

Types of Anomaly Detection Approaches

Statistical Methods: Simple but effective approaches based on statistical modeling
- Moving average with standard deviation bands
- Seasonal decomposition (STL, TBATS)
- Quantile-based detection
Machine Learning Methods: More sophisticated pattern recognition
- Isolation Forest (unsupervised outlier detection)
- One-Class SVM (novelty detection)
- Autoencoders (neural network reconstruction error)
- LSTM networks (sequence prediction error)
Time-Series Specialized Methods: Purpose-built for metric data
- Prophet (Facebook's forecasting tool)
- Holt-Winters exponential smoothing
- ARIMA variants

anomaly-detection-example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
"""
Anomaly Detection for Incident Detection
 
This example demonstrates a practical approach to anomaly detection
for system metrics, combining statistical methods with operational
considerations like seasonality and alert dampening.
"""
 
import numpy as np
from scipy import stats
from dataclasses import dataclass
from typing import List, Optional, Tuple
from datetime import datetime, timedelta
from collections import deque
 
@dataclass
class AnomalyResult:
    """Result of anomaly detection analysis"""
    is_anomaly: bool
    current_value: float
    expected_value: float
    deviation_score: float  # Standard deviations from expected
    confidence: float  # 0-1, higher means more confident
    context: str
 
class SeasonalAnomalyDetector:
    """
    Anomaly detector that accounts for daily and weekly seasonality.
    
    This approach is practical for most production systems:
    - Learns patterns from historical data
    - Handles multiple seasonality (hour-of-day, day-of-week)
    - Provides confidence scores for operations
    - Supports gradual threshold adjustment
    """
    
    def __init__(
        self,
        sensitivity: float = 3.0,  # std deviations for anomaly
        min_data_points: int = 168,  # 1 week of hourly data
        learning_rate: float = 0.1,
    ):
        self.sensitivity = sensitivity
        self.min_data_points = min_data_points
        self.learning_rate = learning_rate
        
        # Historical profiles: hourly buckets for each day of week
        # Shape: (7 days, 24 hours, list of values)
        self.hourly_profiles = [[[] for _ in range(24)] for _ in range(7)]
        self.data_points_seen = 0
        
        # Recent values for short-term trend detection
        self.recent_values = deque(maxlen=60)  # Last hour at minute granularity
    
    def add_observation(self, value: float, timestamp: datetime) -> None:
        """Record an observation for learning"""
        day_of_week = timestamp.weekday()
        hour = timestamp.hour
        
        # Add to seasonal profile
        self.hourly_profiles[day_of_week][hour].append(value)
        
        # Limit history per bucket to prevent memory bloat
        max_per_bucket = 100
        if len(self.hourly_profiles[day_of_week][hour]) > max_per_bucket:
            # Keep recent values with exponential decay
            self.hourly_profiles[day_of_week][hour] = (
                self.hourly_profiles[day_of_week][hour][-max_per_bucket:]
            )
        
        self.recent_values.append((timestamp, value))
        self.data_points_seen += 1
    
    def get_expected_range(
        self, 
        timestamp: datetime
    ) -> Tuple[float, float, float]:
        """
        Calculate expected value and acceptable range for a given time.
        
        Returns: (expected_value, lower_bound, upper_bound)
        """
        day_of_week = timestamp.weekday()
        hour = timestamp.hour
        
        bucket = self.hourly_profiles[day_of_week][hour]
        
        if len(bucket) < 10:
            # Not enough data for this specific bucket
            # Fall back to all data for this hour across all days
            all_hour_data = []
            for day in range(7):
                all_hour_data.extend(self.hourly_profiles[day][hour])
            bucket = all_hour_data if all_hour_data else [0]
        
        mean = np.mean(bucket)
        std = np.std(bucket) if len(bucket) > 1 else mean * 0.1
        
        # Ensure minimum standard deviation (10% of mean or 1)
        std = max(std, mean * 0.1, 1.0)
        
        lower = mean - (self.sensitivity * std)
        upper = mean + (self.sensitivity * std)
        
        return float(mean), float(lower), float(upper)
    
    def detect(
        self, 
        value: float, 
        timestamp: datetime
    ) -> AnomalyResult:
        """
        Analyze if a value is anomalous given the timestamp.
        """
        # Check if we have enough data to make reliable predictions
        if self.data_points_seen < self.min_data_points:
            return AnomalyResult(
                is_anomaly=False,
                current_value=value,
                expected_value=value,
                deviation_score=0.0,
                confidence=0.0,
                context=f"Insufficient data ({self.data_points_seen}/{self.min_data_points})"
            )
        
        expected, lower, upper = self.get_expected_range(timestamp)
        
        # Calculate deviation in terms of standard deviations
        bucket = self._get_bucket(timestamp)
        std = np.std(bucket) if len(bucket) > 1 else expected * 0.1
        std = max(std, expected * 0.1, 1.0)
        
        deviation_score = abs(value - expected) / std
        
        # Is it anomalous?
        is_anomaly = value < lower or value > upper
        
        # Calculate confidence based on data quality
        data_quality = min(len(bucket) / 50, 1.0)  # More data = higher confidence
        confidence = data_quality * min(deviation_score / self.sensitivity, 1.0)
        
        # Build context string
        direction = "above" if value > expected else "below"
        context = (
            f"Value {value:.2f} is {deviation_score:.1f}σ {direction} expected "
            f"({expected:.2f}). Normal range: [{lower:.2f}, {upper:.2f}]"
        )
        
        return AnomalyResult(
            is_anomaly=is_anomaly,
            current_value=value,
            expected_value=expected,
            deviation_score=deviation_score,
            confidence=confidence,
            context=context
        )
    
    def _get_bucket(self, timestamp: datetime) -> List[float]:
        """Get the historical bucket for a timestamp"""
        day_of_week = timestamp.weekday()
        hour = timestamp.hour
        return self.hourly_profiles[day_of_week][hour]
 
 
class MultiMetricAnomalyDetector:
    """
    Coordinates anomaly detection across multiple related metrics.
    
    This helps reduce false positives by correlating anomalies:
    - If only one metric is anomalous, confidence is lower
    - If multiple correlated metrics are anomalous, confidence is higher
    """
    
    def __init__(self, metric_names: List[str], sensitivity: float = 3.0):
        self.detectors = {
            name: SeasonalAnomalyDetector(sensitivity=sensitivity)
            for name in metric_names
        }
        self.metric_names = metric_names
    
    def add_observations(
        self, 
        values: dict[str, float], 
        timestamp: datetime
    ) -> None:
        """Add observations for all metrics"""
        for name, value in values.items():
            if name in self.detectors:
                self.detectors[name].add_observation(value, timestamp)
    
    def detect_correlated(
        self, 
        values: dict[str, float], 
        timestamp: datetime
    ) -> dict:
        """
        Detect anomalies with correlation analysis.
        
        Returns enhanced results considering metric correlations.
        """
        results = {}
        anomaly_count = 0
        
        # Run detection for each metric
        for name, value in values.items():
            if name in self.detectors:
                result = self.detectors[name].detect(value, timestamp)
                results[name] = result
                if result.is_anomaly:
                    anomaly_count += 1
        
        # Adjust confidence based on correlation
        correlation_factor = anomaly_count / len(self.metric_names)
        
        return {
            "individual_results": results,
            "anomalous_metrics": anomaly_count,
            "total_metrics": len(self.metric_names),
            "correlation_factor": correlation_factor,
            "high_confidence_incident": correlation_factor > 0.5,
        }
 
 
# Usage Example
if __name__ == "__main__":
    # Initialize detector for request latency
    detector = SeasonalAnomalyDetector(
        sensitivity=3.0,  # Alert on 3+ standard deviations
        min_data_points=168,  # Require 1 week of hourly data
    )
    
    # Simulate loading historical data
    base_time = datetime.now() - timedelta(days=14)
    for day in range(14):
        for hour in range(24):
            # Simulate traffic pattern: higher during business hours
            base_latency = 100 + (50 if 9 <= hour <= 17 else 0)
            # Add day-of-week pattern: higher on weekdays
            if (base_time + timedelta(days=day)).weekday() < 5:
                base_latency += 20
            # Add noise
            value = base_latency + np.random.normal(0, 10)
            
            timestamp = base_time + timedelta(days=day, hours=hour)
            detector.add_observation(value, timestamp)
    
    # Test detection
    test_time = datetime.now().replace(hour=14, minute=0)  # 2 PM
    
    # Normal value
    result = detector.detect(155.0, test_time)
    print(f"Normal value: {result}")
    
    # Anomalous value (latency spike)
    result = detector.detect(350.0, test_time)
    print(f"Anomalous value: {result}")

Operational Considerations for Anomaly Detection

False Positive Management: Anomaly detection inherently trades off sensitivity against false positives. Start conservative (higher sensitivity threshold) and tighten as you gain confidence in the model's accuracy.
Training Data Quality: Models are only as good as their training data. Ensure your historical data doesn't include past incidents—or mark those periods as anomalous in training so the model doesn't learn them as "normal."
Concept Drift: Systems evolve. New features, architecture changes, and traffic growth all shift what's "normal." Implement continuous learning or periodic retraining to keep models current.
Interpretability: When anomaly detection triggers, responders need to understand why. Provide context: expected range, deviation magnitude, recent trend, and any correlated metrics showing similar behavior.

The Cold Start Problem

Anomaly detection requires historical data to establish baselines. New services, metrics, or environments lack this history. Plan for a "cold start" period where detection relies on static thresholds or conservative defaults while the model learns normal behavior. Typically, 2-4 weeks of data is needed for reliable seasonality detection.

Synthetic Monitoring: Testing from the Outside In

Synthetic monitoring—also called proactive monitoring or active monitoring—generates artificial traffic to test system behavior from the user's perspective. Rather than waiting for real users to encounter problems, synthetic monitors continuously exercise critical paths, detecting failures even when no real users are present.

Why Synthetic Monitoring Matters

Real user monitoring (RUM) tells you how actual users are experiencing your system. But RUM has limitations:

Coverage Gaps: Features or paths with low traffic may go unmonitored
Detection Latency: You need users to encounter problems before detecting them
Off-Hours Blindness: Systems may fail at 3 AM with few users to notice
Deployment Verification: New deployments need validation before users arrive

Synthetic monitoring fills these gaps by providing consistent, predictable test coverage regardless of real traffic patterns.

Synthetic Monitoring Strategies
Strategy	What It Tests	Frequency	Best For
Availability Checks	Endpoint responds with 2xx	Every 1-5 minutes	Basic uptime monitoring
API Contract Tests	Response schema and values	Every 5-15 minutes	API correctness
End-to-End Transactions	Complete user flows	Every 5-15 minutes	Critical business processes
Multi-Region Probes	Availability from multiple locations	Every 1-5 minutes	Geographic redundancy
SSL/TLS Validation	Certificate validity and expiration	Every hour to daily	Security and trust
Dependency Checks	Third-party service availability	Every 1-5 minutes	External dependency health
Performance Baselines	Response time benchmarks	Every 5-15 minutes	Performance regression detection

synthetic-monitor.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
/**
 * Synthetic Monitoring Implementation
 * 
 * This implementation demonstrates a production-grade approach
 * to synthetic monitoring with proper alerting, metrics, and
 * failure handling.
 */
 
interface SyntheticCheckResult {
  checkName: string;
  success: boolean;
  durationMs: number;
  statusCode?: number;
  error?: string;
  timestamp: Date;
  location: string;
}
 
interface CheckDefinition {
  name: string;
  url: string;
  method: 'GET' | 'POST' | 'PUT' | 'DELETE';
  headers?: Record<string, string>;
  body?: unknown;
  expectedStatus: number;
  timeoutMs: number;
  assertions?: ((response: Response, body: unknown) => void)[];
}
 
class SyntheticMonitor {
  private checks: CheckDefinition[] = [];
  private results: SyntheticCheckResult[] = [];
  private alertCallback?: (result: SyntheticCheckResult) => void;
  private metricsCallback?: (result: SyntheticCheckResult) => void;
  private location: string;
 
  constructor(location: string = 'us-east-1') {
    this.location = location;
  }
 
  /**
   * Register a check to be executed
   */
  addCheck(check: CheckDefinition): void {
    this.checks.push(check);
  }
 
  /**
   * Register alert callback for failed checks
   */
  onAlert(callback: (result: SyntheticCheckResult) => void): void {
    this.alertCallback = callback;
  }
 
  /**
   * Register metrics callback for all check results
   */
  onMetrics(callback: (result: SyntheticCheckResult) => void): void {
    this.metricsCallback = callback;
  }
 
  /**
   * Execute a single check
   */
  async runCheck(check: CheckDefinition): Promise<SyntheticCheckResult> {
    const startTime = Date.now();
    
    try {
      const controller = new AbortController();
      const timeout = setTimeout(() => controller.abort(), check.timeoutMs);
 
      const response = await fetch(check.url, {
        method: check.method,
        headers: {
          'User-Agent': 'SyntheticMonitor/1.0',
          ...check.headers,
        },
        body: check.body ? JSON.stringify(check.body) : undefined,
        signal: controller.signal,
      });
 
      clearTimeout(timeout);
 
      const durationMs = Date.now() - startTime;
      let body: unknown;
      
      try {
        body = await response.json();
      } catch {
        body = await response.text();
      }
 
      // Check basic status code
      if (response.status !== check.expectedStatus) {
        return {
          checkName: check.name,
          success: false,
          durationMs,
          statusCode: response.status,
          error: `Expected status ${check.expectedStatus}, got ${response.status}`,
          timestamp: new Date(),
          location: this.location,
        };
      }
 
      // Run custom assertions
      if (check.assertions) {
        for (const assertion of check.assertions) {
          try {
            assertion(response, body);
          } catch (assertionError) {
            return {
              checkName: check.name,
              success: false,
              durationMs,
              statusCode: response.status,
              error: `Assertion failed: ${(assertionError as Error).message}`,
              timestamp: new Date(),
              location: this.location,
            };
          }
        }
      }
 
      return {
        checkName: check.name,
        success: true,
        durationMs,
        statusCode: response.status,
        timestamp: new Date(),
        location: this.location,
      };
 
    } catch (error) {
      const durationMs = Date.now() - startTime;
      const errorMessage = error instanceof Error 
        ? error.message 
        : 'Unknown error';
 
      return {
        checkName: check.name,
        success: false,
        durationMs,
        error: errorMessage,
        timestamp: new Date(),
        location: this.location,
      };
    }
  }
 
  /**
   * Run all registered checks
   */
  async runAllChecks(): Promise<SyntheticCheckResult[]> {
    const results = await Promise.all(
      this.checks.map(check => this.runCheck(check))
    );
 
    // Emit metrics for all results
    results.forEach(result => {
      this.metricsCallback?.(result);
      
      // Alert on failures
      if (!result.success) {
        this.alertCallback?.(result);
      }
    });
 
    this.results = results;
    return results;
  }
 
  /**
   * Start continuous monitoring at specified interval
   */
  startContinuousMonitoring(intervalMs: number = 60000): void {
    console.log(`Starting continuous monitoring every ${intervalMs}ms`);
    
    // Run immediately
    this.runAllChecks();
 
    // Schedule recurring runs
    setInterval(() => this.runAllChecks(), intervalMs);
  }
}
 
// Example: Critical Path Synthetic Checks
const monitor = new SyntheticMonitor('us-east-1');
 
// Check 1: API Health Endpoint
monitor.addCheck({
  name: 'api-health',
  url: 'https://api.example.com/health',
  method: 'GET',
  expectedStatus: 200,
  timeoutMs: 5000,
  assertions: [
    (response, body) => {
      const health = body as { status: string };
      if (health.status !== 'healthy') {
        throw new Error(`Health status is ${health.status}`);
      }
    }
  ]
});
 
// Check 2: Authentication Flow
monitor.addCheck({
  name: 'auth-flow',
  url: 'https://api.example.com/auth/token',
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: { 
    grant_type: 'client_credentials',
    client_id: process.env.SYNTHETIC_CLIENT_ID,
    client_secret: process.env.SYNTHETIC_CLIENT_SECRET,
  },
  expectedStatus: 200,
  timeoutMs: 10000,
  assertions: [
    (response, body) => {
      const token = body as { access_token?: string; expires_in?: number };
      if (!token.access_token) {
        throw new Error('No access token returned');
      }
      if (!token.expires_in || token.expires_in < 300) {
        throw new Error('Token expiration too short');
      }
    }
  ]
});
 
// Check 3: Payment Gateway Connectivity
monitor.addCheck({
  name: 'payment-gateway-ping',
  url: 'https://api.example.com/payments/health',
  method: 'GET',
  expectedStatus: 200,
  timeoutMs: 5000,
  assertions: [
    (response, body) => {
      const health = body as { gateway_status: string };
      if (health.gateway_status !== 'connected') {
        throw new Error(`Payment gateway: ${health.gateway_status}`);
      }
    }
  ]
});
 
// Set up alerting
monitor.onAlert((result) => {
  console.error(`🚨 SYNTHETIC CHECK FAILED: ${result.checkName}`);
  console.error(`   Error: ${result.error}`);
  console.error(`   Duration: ${result.durationMs}ms`);
  console.error(`   Location: ${result.location}`);
  
  // In production: Send to PagerDuty, Slack, etc.
  // await pagerduty.trigger({
  //   summary: `Synthetic check failed: ${result.checkName}`,
  //   severity: 'critical',
  //   ...
  // });
});
 
// Set up metrics emission
monitor.onMetrics((result) => {
  // In production: Send to Prometheus, DataDog, etc.
  console.log(`📊 Metric: synthetic_check_duration{check="${result.checkName}",` +
                                `location="${result.location}",success="${result.success}"}` +
              ` ${result.durationMs}`);
});
 
// Start monitoring
monitor.startContinuousMonitoring(60000); // Every minute

Best Practices for Synthetic Monitoring

Use dedicated test accounts: Never use production user credentials; create service accounts specifically for synthetic tests.
Run from multiple locations: Deploy synthetic monitors across regions to detect localized issues.
Test the full stack: Include database queries, external API calls, and authentication flows—not just HTTP response codes.
Make tests idempotent: Synthetic tests shouldn't create persistent state that accumulates or affects real users.
Monitor the monitors: Alert on synthetic monitor failures (connection issues, script errors) separately from system failures.

User-Reported Issues: The Essential Backstop

Despite our best efforts at proactive detection, users will sometimes discover issues before our monitoring catches them. Rather than viewing user reports as monitoring failures, mature organizations treat them as essential signals that complement automated detection.

Why Users Detect What Monitoring Misses

Users encounter scenarios that are difficult to anticipate or synthesize:

Edge Cases: Unusual inputs, rare feature combinations, or unexpected usage patterns
Subjective Quality: Performance that's technically within bounds but "feels slow" to users
Partial Failures: Issues affecting specific user segments (geography, device type, account tier)
UI/UX Issues: Problems with rendering, layout, or interaction that automated tests miss
Data Quality Issues: Incorrect or missing data that's valid from a technical standpoint
Integration Context: Problems that only manifest in specific client applications or browser versions

Building an Effective User Reports Pipeline

•Low-Friction Reporting: Make it trivially easy to report issues. In-app feedback widgets, 'Report a Problem' buttons, and keyboard shortcuts (Shift+? for help) reduce the effort required to report.
•Automatic Context Capture: When users report issues, automatically capture: device info, browser version, user ID (for privacy-compliant lookup), timestamp, current page/screen, and recent actions. This context is invaluable for investigation.
•Structured Categorization: Help users categorize their reports ("Something isn't working", "It's slow", "Data looks wrong") to enable routing and trend analysis.
•Correlation with Metrics: Link user reports to contemporaneous system metrics. If ten users report slowness at 2:15 PM, what did your latency metrics show at 2:15 PM?
•Trend Detection: Individual user reports may not warrant incident response, but a sudden spike in reports is a strong signal. Implement automated detection for report volume anomalies.
•Feedback Loops: Close the loop with users who report. Tell them when issues are resolved. This encourages future reporting and builds trust.

Triaging User Reports for Incident Detection

Not every user report indicates an incident. The challenge is distinguishing signal from noise:

Immediate Escalation Indicators:

Multiple reports of the same issue within a short window (>5 reports in 10 minutes)
Reports from VIP/enterprise customers
Reports involving financial transactions, data loss, or security
Reports coinciding with recent deployments

Investigation-Worthy Signals:

Recurring reports of the same issue over days/weeks
Reports from technically sophisticated users with detailed reproduction steps
Reports that align with known but unmonitored risks

Background Processing:

Single reports without corroboration
Reports of known issues with workarounds
Feature requests disguised as bug reports

The Three-Report Rule

A practical heuristic: one user report is anecdotal, two are coincidental, three are a pattern. When three independent users report the same issue within a short window, treat it as a potential incident and investigate actively.

Building the Detection Pipeline

Individual detection mechanisms—alerts, anomaly detection, synthetic monitoring, user reports—provide value in isolation. But their true power emerges when integrated into a cohesive detection pipeline that correlates signals, suppresses noise, and routes actionable alerts to responders.

Detection Pipeline Architecture

A modern detection pipeline consists of several stages:

Signal Collection: Gathering data from all detection sources
Signal Enrichment: Adding context, metadata, and cross-references
Correlation: Grouping related signals and identifying patterns
Deduplication: Consolidating multiple alerts for the same issue
Prioritization: Scoring incidents by severity and business impact
Routing: Sending alerts to the right teams via the right channels
Acknowledgment Tracking: Confirming responders have engaged

Converting Mermaid diagram...

Signal Correlation Strategies

Correlation transforms a flood of independent alerts into coherent incident narratives:

Temporal Correlation: Alerts occurring within a time window (e.g., 5 minutes) are likely related. If CPU spikes, memory alerts trigger, and latency increases all at 2:15 PM, they're probably one incident, not three.

Topological Correlation: Alerts from components in the same call path belong together. A database alert during an API latency alert suggests causation, especially if traces connect them.

Deployment Correlation: Any alerts occurring shortly after a deployment should be grouped and flagged as potentially deployment-related.

Historical Pattern Correlation: If CPU spikes at the same time every day (batch job), correlation engines can suppress or relabel instead of alerting.

Cross-Signal Correlation: A synthetic monitor failure + user reports + latency alert = high confidence incident. A single latency alert = investigate before escalating.

Alert Correlation Tools

Tools like PagerDuty Event Intelligence, Splunk ITSI, BigPanda, and Moogsoft specialize in alert correlation and noise reduction. These platforms use machine learning to identify patterns, suppress duplicates, and surface actionable incidents from raw alert streams.

Summary: Mastering Incident Detection

Incident detection is the critical first step in incident response. The effectiveness of your entire incident management program depends on how quickly and accurately you identify problems. Let's consolidate the key principles:

Key Takeaways

•Time to Detection (TTD) is critical — Every minute of delayed detection extends customer impact. Invest in reducing detection latency as seriously as you invest in reducing resolution time.
•Layer proactive and reactive detection — Automated monitoring is your first line; user reports are your essential backstop. Aim for 90%+ proactive detection rate.
•Threshold alerting requires careful tuning — Base thresholds on SLOs, historical data, and business requirements. Use multi-window strategies for different severity levels.
•Anomaly detection handles complexity — When fixed thresholds can't capture normal variation, machine learning approaches provide dynamic baselines that adapt to patterns.
•Synthetic monitoring tests from the user's perspective — Active probing catches issues before users do, especially during low-traffic periods and after deployments.
•User reports carry unique signal — Users encounter edge cases and scenarios that automated testing misses. Build low-friction reporting and treat user signals as data, not complaints.
•Detection pipelines correlate and prioritize — Raw alerts become actionable incidents through enrichment, correlation, deduplication, and intelligent routing.
•Measure and improve detection continuously — Track TTD, false positive rates, and proactive detection ratio. Use these metrics to guide observability investments.

What's Next:

With detection in place, we need processes to respond effectively. The next page explores the Incident Response Process—the structured workflows that translate detection into resolution, covering roles, communication protocols, and the mechanics of incident command.

Page Complete

You now understand the complete landscape of incident detection: from threshold-based alerting and anomaly detection to synthetic monitoring and user report pipelines. Detection is the foundation—without knowing there's a problem, all downstream processes are moot. Next, we'll explore how to respond once an incident is detected.

1 / 5

Loading learning content...

System Design (HLD)SLOs, SLIs & Incident Management

Incident Management

LevelAdvanced

Duration90 mins

TopicSLOs, SLIs & Incident Management

1 / 5

Incident Detection

The First Minutes Matter Most

What You Will Learn

The Detection Imperative

To understand why incident detection deserves significant engineering investment, we need to examine the anatomy of an incident from a timeline perspective. Every incident follows a lifecycle:

Incident Onset: The moment when the system begins exhibiting anomalous behavior
Detection: The moment when someone or something recognizes the anomaly
Response Initiation: When a responder begins actively investigating
Mitigation: When immediate actions stop the bleeding
Resolution: When the system returns to healthy operation

Impact of Detection Delay
Detection Delay	Potential Impact	Customer Experience	Business Cost
< 1 minute	Minimal - caught before widespread impact	Momentary blip, if noticed	Negligible
1-5 minutes	Moderate - some users affected	Noticeable errors or latency	$10K-$100K for large services
5-30 minutes	Significant - broad user impact	Sustained degradation, complaints begin	$100K-$1M, customer churn begins
30-60 minutes	Severe - major outage	Service unusable, social media mentions	$1M-$10M, reputational damage
60 minutes	Critical - catastrophic failure	Complete service unavailability	$10M+, potential regulatory scrutiny

The Detection Investment Case

The Iceberg Principle of Incidents

Detection Taxonomy: Proactive vs Reactive

Incident detection mechanisms fall into two fundamental categories, each with distinct characteristics, strengths, and applications:

World-class incident detection employs both approaches in depth, creating multiple layers of detection that complement each other.

Proactive Detection Methods

•Threshold Alerting — Alerts triggered when metrics cross predefined boundaries (e.g., error rate > 1%)
•Anomaly Detection — ML-driven identification of abnormal patterns without explicit thresholds
•Synthetic Monitoring — Artificial transactions that test system behavior continuously
•Canary Deployments — Gradual rollouts that detect issues before full deployment
•Health Checks — Active probing of service endpoints for availability
•Log Pattern Detection — Automated scanning for error patterns in log streams
•Distributed Tracing — Detection of latency anomalies in request flows
•Chaos Engineering — Controlled failure injection that reveals weaknesses before incidents

Reactive Detection Methods

•Customer Support Tickets — Users report issues through support channels
•Social Media Monitoring — Detection of public complaints on Twitter/X, Reddit, etc.
•Partner/API Consumer Reports — B2B customers noticing integration failures
•Internal User Reports — Employees noticing issues in internal tools
•Status Page Comments — Users checking your status and reporting issues
•Third-Party Monitoring Services — External services like Downdetector aggregating user reports
•Press/Media Inquiries — Journalists asking about reported outages (worst case)
•Executive Escalation — VIP customers contacting leadership directly

The Proactive-Reactive Ratio

The Customer Detection Anti-Pattern

Threshold-Based Alerting: The Foundation

Anatomy of a Threshold Alert

A well-constructed threshold alert has several components:

Metric Selection: What are we measuring? (error_rate, latency_p99, queue_depth)
Threshold Value: What value indicates a problem? (>1%, >500ms, >10000)
Duration/Frequency: How long must the condition persist? (for 5 minutes, 3 times in 10 minutes)
Evaluation Window: Over what time range is the metric calculated? (1-minute average, 5-minute maximum)
Alert Routing: Who or what receives the alert? (on-call engineer, Slack channel, PagerDuty)
Severity Classification: How urgent is this alert? (critical, warning, info)

alert-rule-example.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# Prometheus Alerting Rule Example
# Demonstrating best practices in threshold-based alerting
 
groups:
  - name: payment-service-alerts
    rules:
      # Critical: Payment processing failure rate too high
      # This directly impacts revenue - immediate response required
      - alert: PaymentFailureRateCritical
        expr: |
          (
            sum(rate(payment_transactions_total{status="failed"}[5m]))
            /
            sum(rate(payment_transactions_total[5m]))
          ) > 0.05
        for: 2m
        labels:
          severity: critical
          service: payment
          team: payments
        annotations:
          summary: "Payment failure rate exceeds 5%"
          description: |
            Payment failure rate is {{ $value | humanizePercentage }} over the last 5 minutes.
            This is above the critical threshold of 5%.
            Immediate investigation required - revenue impact in progress.
          runbook_url: https://runbooks.example.com/payments/high-failure-rate
          dashboard: https://grafana.example.com/d/payments-overview
 
      # Warning: Payment latency degradation
      # Users experiencing slow checkouts - investigate soon
      - alert: PaymentLatencyHigh
        expr: |
          histogram_quantile(0.99, 
            rate(payment_processing_duration_seconds_bucket[5m])
          ) > 2.0
        for: 5m
        labels:
          severity: warning
          service: payment
          team: payments
        annotations:
          summary: "P99 payment latency exceeds 2 seconds"
          description: |
            P99 payment processing latency is {{ $value | humanizeDuration }}.
            Users are experiencing slow checkout. 
            SLO target is 1.5 seconds at P99.
          runbook_url: https://runbooks.example.com/payments/high-latency
          
      # Predictive: Payment service approaching capacity
      # Warning before we hit actual problems
      - alert: PaymentServiceNearCapacity
        expr: |
          (
            rate(payment_transactions_total[5m]) 
            / 
            payment_max_throughput_per_second
          ) > 0.8
        for: 10m
        labels:
          severity: warning
          service: payment
          team: payments
        annotations:
          summary: "Payment service at 80%+ capacity"
          description: |
            Payment throughput is at {{ $value | humanizePercentage }} of max capacity.
            Consider scaling up before traffic increases further.

Threshold Selection: The Art and Science

3. Capacity-Based Thresholds: For resource metrics, alert at percentages of capacity (e.g., 80% CPU, 90% disk) that provide enough room for remediation before exhaustion.

The Multi-Window, Multi-Threshold Strategy

Anomaly Detection: Beyond Fixed Thresholds

Types of Anomaly Detection Approaches

Statistical Methods: Simple but effective approaches based on statistical modeling
- Moving average with standard deviation bands
- Seasonal decomposition (STL, TBATS)
- Quantile-based detection
Machine Learning Methods: More sophisticated pattern recognition
- Isolation Forest (unsupervised outlier detection)
- One-Class SVM (novelty detection)
- Autoencoders (neural network reconstruction error)
- LSTM networks (sequence prediction error)
Time-Series Specialized Methods: Purpose-built for metric data
- Prophet (Facebook's forecasting tool)
- Holt-Winters exponential smoothing
- ARIMA variants

anomaly-detection-example.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
"""
Anomaly Detection for Incident Detection
 
This example demonstrates a practical approach to anomaly detection
for system metrics, combining statistical methods with operational
considerations like seasonality and alert dampening.
"""
 
import numpy as np
from scipy import stats
from dataclasses import dataclass
from typing import List, Optional, Tuple
from datetime import datetime, timedelta
from collections import deque
 
@dataclass
class AnomalyResult:
    """Result of anomaly detection analysis"""
    is_anomaly: bool
    current_value: float
    expected_value: float
    deviation_score: float  # Standard deviations from expected
    confidence: float  # 0-1, higher means more confident
    context: str
 
class SeasonalAnomalyDetector:
    """
    Anomaly detector that accounts for daily and weekly seasonality.
    
    This approach is practical for most production systems:
    - Learns patterns from historical data
    - Handles multiple seasonality (hour-of-day, day-of-week)
    - Provides confidence scores for operations
    - Supports gradual threshold adjustment
    """
    
    def __init__(
        self,
        sensitivity: float = 3.0,  # std deviations for anomaly
        min_data_points: int = 168,  # 1 week of hourly data
        learning_rate: float = 0.1,
    ):
        self.sensitivity = sensitivity
        self.min_data_points = min_data_points
        self.learning_rate = learning_rate
        
        # Historical profiles: hourly buckets for each day of week
        # Shape: (7 days, 24 hours, list of values)
        self.hourly_profiles = [[[] for _ in range(24)] for _ in range(7)]
        self.data_points_seen = 0
        
        # Recent values for short-term trend detection
        self.recent_values = deque(maxlen=60)  # Last hour at minute granularity
    
    def add_observation(self, value: float, timestamp: datetime) -> None:
        """Record an observation for learning"""
        day_of_week = timestamp.weekday()
        hour = timestamp.hour
        
        # Add to seasonal profile
        self.hourly_profiles[day_of_week][hour].append(value)
        
        # Limit history per bucket to prevent memory bloat
        max_per_bucket = 100
        if len(self.hourly_profiles[day_of_week][hour]) > max_per_bucket:
            # Keep recent values with exponential decay
            self.hourly_profiles[day_of_week][hour] = (
                self.hourly_profiles[day_of_week][hour][-max_per_bucket:]
            )
        
        self.recent_values.append((timestamp, value))
        self.data_points_seen += 1
    
    def get_expected_range(
        self, 
        timestamp: datetime
    ) -> Tuple[float, float, float]:
        """
        Calculate expected value and acceptable range for a given time.
        
        Returns: (expected_value, lower_bound, upper_bound)
        """
        day_of_week = timestamp.weekday()
        hour = timestamp.hour
        
        bucket = self.hourly_profiles[day_of_week][hour]
        
        if len(bucket) < 10:
            # Not enough data for this specific bucket
            # Fall back to all data for this hour across all days
            all_hour_data = []
            for day in range(7):
                all_hour_data.extend(self.hourly_profiles[day][hour])
            bucket = all_hour_data if all_hour_data else [0]
        
        mean = np.mean(bucket)
        std = np.std(bucket) if len(bucket) > 1 else mean * 0.1
        
        # Ensure minimum standard deviation (10% of mean or 1)
        std = max(std, mean * 0.1, 1.0)
        
        lower = mean - (self.sensitivity * std)
        upper = mean + (self.sensitivity * std)
        
        return float(mean), float(lower), float(upper)
    
    def detect(
        self, 
        value: float, 
        timestamp: datetime
    ) -> AnomalyResult:
        """
        Analyze if a value is anomalous given the timestamp.
        """
        # Check if we have enough data to make reliable predictions
        if self.data_points_seen < self.min_data_points:
            return AnomalyResult(
                is_anomaly=False,
                current_value=value,
                expected_value=value,
                deviation_score=0.0,
                confidence=0.0,
                context=f"Insufficient data ({self.data_points_seen}/{self.min_data_points})"
            )
        
        expected, lower, upper = self.get_expected_range(timestamp)
        
        # Calculate deviation in terms of standard deviations
        bucket = self._get_bucket(timestamp)
        std = np.std(bucket) if len(bucket) > 1 else expected * 0.1
        std = max(std, expected * 0.1, 1.0)
        
        deviation_score = abs(value - expected) / std
        
        # Is it anomalous?
        is_anomaly = value < lower or value > upper
        
        # Calculate confidence based on data quality
        data_quality = min(len(bucket) / 50, 1.0)  # More data = higher confidence
        confidence = data_quality * min(deviation_score / self.sensitivity, 1.0)
        
        # Build context string
        direction = "above" if value > expected else "below"
        context = (
            f"Value {value:.2f} is {deviation_score:.1f}σ {direction} expected "
            f"({expected:.2f}). Normal range: [{lower:.2f}, {upper:.2f}]"
        )
        
        return AnomalyResult(
            is_anomaly=is_anomaly,
            current_value=value,
            expected_value=expected,
            deviation_score=deviation_score,
            confidence=confidence,
            context=context
        )
    
    def _get_bucket(self, timestamp: datetime) -> List[float]:
        """Get the historical bucket for a timestamp"""
        day_of_week = timestamp.weekday()
        hour = timestamp.hour
        return self.hourly_profiles[day_of_week][hour]
 
 
class MultiMetricAnomalyDetector:
    """
    Coordinates anomaly detection across multiple related metrics.
    
    This helps reduce false positives by correlating anomalies:
    - If only one metric is anomalous, confidence is lower
    - If multiple correlated metrics are anomalous, confidence is higher
    """
    
    def __init__(self, metric_names: List[str], sensitivity: float = 3.0):
        self.detectors = {
            name: SeasonalAnomalyDetector(sensitivity=sensitivity)
            for name in metric_names
        }
        self.metric_names = metric_names
    
    def add_observations(
        self, 
        values: dict[str, float], 
        timestamp: datetime
    ) -> None:
        """Add observations for all metrics"""
        for name, value in values.items():
            if name in self.detectors:
                self.detectors[name].add_observation(value, timestamp)
    
    def detect_correlated(
        self, 
        values: dict[str, float], 
        timestamp: datetime
    ) -> dict:
        """
        Detect anomalies with correlation analysis.
        
        Returns enhanced results considering metric correlations.
        """
        results = {}
        anomaly_count = 0
        
        # Run detection for each metric
        for name, value in values.items():
            if name in self.detectors:
                result = self.detectors[name].detect(value, timestamp)
                results[name] = result
                if result.is_anomaly:
                    anomaly_count += 1
        
        # Adjust confidence based on correlation
        correlation_factor = anomaly_count / len(self.metric_names)
        
        return {
            "individual_results": results,
            "anomalous_metrics": anomaly_count,
            "total_metrics": len(self.metric_names),
            "correlation_factor": correlation_factor,
            "high_confidence_incident": correlation_factor > 0.5,
        }
 
 
# Usage Example
if __name__ == "__main__":
    # Initialize detector for request latency
    detector = SeasonalAnomalyDetector(
        sensitivity=3.0,  # Alert on 3+ standard deviations
        min_data_points=168,  # Require 1 week of hourly data
    )
    
    # Simulate loading historical data
    base_time = datetime.now() - timedelta(days=14)
    for day in range(14):
        for hour in range(24):
            # Simulate traffic pattern: higher during business hours
            base_latency = 100 + (50 if 9 <= hour <= 17 else 0)
            # Add day-of-week pattern: higher on weekdays
            if (base_time + timedelta(days=day)).weekday() < 5:
                base_latency += 20
            # Add noise
            value = base_latency + np.random.normal(0, 10)
            
            timestamp = base_time + timedelta(days=day, hours=hour)
            detector.add_observation(value, timestamp)
    
    # Test detection
    test_time = datetime.now().replace(hour=14, minute=0)  # 2 PM
    
    # Normal value
    result = detector.detect(155.0, test_time)
    print(f"Normal value: {result}")
    
    # Anomalous value (latency spike)
    result = detector.detect(350.0, test_time)
    print(f"Anomalous value: {result}")

Operational Considerations for Anomaly Detection

False Positive Management: Anomaly detection inherently trades off sensitivity against false positives. Start conservative (higher sensitivity threshold) and tighten as you gain confidence in the model's accuracy.
Training Data Quality: Models are only as good as their training data. Ensure your historical data doesn't include past incidents—or mark those periods as anomalous in training so the model doesn't learn them as "normal."
Concept Drift: Systems evolve. New features, architecture changes, and traffic growth all shift what's "normal." Implement continuous learning or periodic retraining to keep models current.
Interpretability: When anomaly detection triggers, responders need to understand why. Provide context: expected range, deviation magnitude, recent trend, and any correlated metrics showing similar behavior.

The Cold Start Problem

Synthetic Monitoring: Testing from the Outside In

Why Synthetic Monitoring Matters

Real user monitoring (RUM) tells you how actual users are experiencing your system. But RUM has limitations:

Coverage Gaps: Features or paths with low traffic may go unmonitored
Detection Latency: You need users to encounter problems before detecting them
Off-Hours Blindness: Systems may fail at 3 AM with few users to notice
Deployment Verification: New deployments need validation before users arrive

Synthetic monitoring fills these gaps by providing consistent, predictable test coverage regardless of real traffic patterns.

Synthetic Monitoring Strategies
Strategy	What It Tests	Frequency	Best For
Availability Checks	Endpoint responds with 2xx	Every 1-5 minutes	Basic uptime monitoring
API Contract Tests	Response schema and values	Every 5-15 minutes	API correctness
End-to-End Transactions	Complete user flows	Every 5-15 minutes	Critical business processes
Multi-Region Probes	Availability from multiple locations	Every 1-5 minutes	Geographic redundancy
SSL/TLS Validation	Certificate validity and expiration	Every hour to daily	Security and trust
Dependency Checks	Third-party service availability	Every 1-5 minutes	External dependency health
Performance Baselines	Response time benchmarks	Every 5-15 minutes	Performance regression detection

synthetic-monitor.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
/**
 * Synthetic Monitoring Implementation
 * 
 * This implementation demonstrates a production-grade approach
 * to synthetic monitoring with proper alerting, metrics, and
 * failure handling.
 */
 
interface SyntheticCheckResult {
  checkName: string;
  success: boolean;
  durationMs: number;
  statusCode?: number;
  error?: string;
  timestamp: Date;
  location: string;
}
 
interface CheckDefinition {
  name: string;
  url: string;
  method: 'GET' | 'POST' | 'PUT' | 'DELETE';
  headers?: Record<string, string>;
  body?: unknown;
  expectedStatus: number;
  timeoutMs: number;
  assertions?: ((response: Response, body: unknown) => void)[];
}
 
class SyntheticMonitor {
  private checks: CheckDefinition[] = [];
  private results: SyntheticCheckResult[] = [];
  private alertCallback?: (result: SyntheticCheckResult) => void;
  private metricsCallback?: (result: SyntheticCheckResult) => void;
  private location: string;
 
  constructor(location: string = 'us-east-1') {
    this.location = location;
  }
 
  /**
   * Register a check to be executed
   */
  addCheck(check: CheckDefinition): void {
    this.checks.push(check);
  }
 
  /**
   * Register alert callback for failed checks
   */
  onAlert(callback: (result: SyntheticCheckResult) => void): void {
    this.alertCallback = callback;
  }
 
  /**
   * Register metrics callback for all check results
   */
  onMetrics(callback: (result: SyntheticCheckResult) => void): void {
    this.metricsCallback = callback;
  }
 
  /**
   * Execute a single check
   */
  async runCheck(check: CheckDefinition): Promise<SyntheticCheckResult> {
    const startTime = Date.now();
    
    try {
      const controller = new AbortController();
      const timeout = setTimeout(() => controller.abort(), check.timeoutMs);
 
      const response = await fetch(check.url, {
        method: check.method,
        headers: {
          'User-Agent': 'SyntheticMonitor/1.0',
          ...check.headers,
        },
        body: check.body ? JSON.stringify(check.body) : undefined,
        signal: controller.signal,
      });
 
      clearTimeout(timeout);
 
      const durationMs = Date.now() - startTime;
      let body: unknown;
      
      try {
        body = await response.json();
      } catch {
        body = await response.text();
      }
 
      // Check basic status code
      if (response.status !== check.expectedStatus) {
        return {
          checkName: check.name,
          success: false,
          durationMs,
          statusCode: response.status,
          error: `Expected status ${check.expectedStatus}, got ${response.status}`,
          timestamp: new Date(),
          location: this.location,
        };
      }
 
      // Run custom assertions
      if (check.assertions) {
        for (const assertion of check.assertions) {
          try {
            assertion(response, body);
          } catch (assertionError) {
            return {
              checkName: check.name,
              success: false,
              durationMs,
              statusCode: response.status,
              error: `Assertion failed: ${(assertionError as Error).message}`,
              timestamp: new Date(),
              location: this.location,
            };
          }
        }
      }
 
      return {
        checkName: check.name,
        success: true,
        durationMs,
        statusCode: response.status,
        timestamp: new Date(),
        location: this.location,
      };
 
    } catch (error) {
      const durationMs = Date.now() - startTime;
      const errorMessage = error instanceof Error 
        ? error.message 
        : 'Unknown error';
 
      return {
        checkName: check.name,
        success: false,
        durationMs,
        error: errorMessage,
        timestamp: new Date(),
        location: this.location,
      };
    }
  }
 
  /**
   * Run all registered checks
   */
  async runAllChecks(): Promise<SyntheticCheckResult[]> {
    const results = await Promise.all(
      this.checks.map(check => this.runCheck(check))
    );
 
    // Emit metrics for all results
    results.forEach(result => {
      this.metricsCallback?.(result);
      
      // Alert on failures
      if (!result.success) {
        this.alertCallback?.(result);
      }
    });
 
    this.results = results;
    return results;
  }
 
  /**
   * Start continuous monitoring at specified interval
   */
  startContinuousMonitoring(intervalMs: number = 60000): void {
    console.log(`Starting continuous monitoring every ${intervalMs}ms`);
    
    // Run immediately
    this.runAllChecks();
 
    // Schedule recurring runs
    setInterval(() => this.runAllChecks(), intervalMs);
  }
}
 
// Example: Critical Path Synthetic Checks
const monitor = new SyntheticMonitor('us-east-1');
 
// Check 1: API Health Endpoint
monitor.addCheck({
  name: 'api-health',
  url: 'https://api.example.com/health',
  method: 'GET',
  expectedStatus: 200,
  timeoutMs: 5000,
  assertions: [
    (response, body) => {
      const health = body as { status: string };
      if (health.status !== 'healthy') {
        throw new Error(`Health status is ${health.status}`);
      }
    }
  ]
});
 
// Check 2: Authentication Flow
monitor.addCheck({
  name: 'auth-flow',
  url: 'https://api.example.com/auth/token',
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: { 
    grant_type: 'client_credentials',
    client_id: process.env.SYNTHETIC_CLIENT_ID,
    client_secret: process.env.SYNTHETIC_CLIENT_SECRET,
  },
  expectedStatus: 200,
  timeoutMs: 10000,
  assertions: [
    (response, body) => {
      const token = body as { access_token?: string; expires_in?: number };
      if (!token.access_token) {
        throw new Error('No access token returned');
      }
      if (!token.expires_in || token.expires_in < 300) {
        throw new Error('Token expiration too short');
      }
    }
  ]
});
 
// Check 3: Payment Gateway Connectivity
monitor.addCheck({
  name: 'payment-gateway-ping',
  url: 'https://api.example.com/payments/health',
  method: 'GET',
  expectedStatus: 200,
  timeoutMs: 5000,
  assertions: [
    (response, body) => {
      const health = body as { gateway_status: string };
      if (health.gateway_status !== 'connected') {
        throw new Error(`Payment gateway: ${health.gateway_status}`);
      }
    }
  ]
});
 
// Set up alerting
monitor.onAlert((result) => {
  console.error(`🚨 SYNTHETIC CHECK FAILED: ${result.checkName}`);
  console.error(`   Error: ${result.error}`);
  console.error(`   Duration: ${result.durationMs}ms`);
  console.error(`   Location: ${result.location}`);
  
  // In production: Send to PagerDuty, Slack, etc.
  // await pagerduty.trigger({
  //   summary: `Synthetic check failed: ${result.checkName}`,
  //   severity: 'critical',
  //   ...
  // });
});
 
// Set up metrics emission
monitor.onMetrics((result) => {
  // In production: Send to Prometheus, DataDog, etc.
  console.log(`📊 Metric: synthetic_check_duration{check="${result.checkName}",` +
                                `location="${result.location}",success="${result.success}"}` +
              ` ${result.durationMs}`);
});
 
// Start monitoring
monitor.startContinuousMonitoring(60000); // Every minute

Best Practices for Synthetic Monitoring

Use dedicated test accounts: Never use production user credentials; create service accounts specifically for synthetic tests.
Run from multiple locations: Deploy synthetic monitors across regions to detect localized issues.
Test the full stack: Include database queries, external API calls, and authentication flows—not just HTTP response codes.
Make tests idempotent: Synthetic tests shouldn't create persistent state that accumulates or affects real users.
Monitor the monitors: Alert on synthetic monitor failures (connection issues, script errors) separately from system failures.

User-Reported Issues: The Essential Backstop

Why Users Detect What Monitoring Misses

Users encounter scenarios that are difficult to anticipate or synthesize:

Edge Cases: Unusual inputs, rare feature combinations, or unexpected usage patterns
Subjective Quality: Performance that's technically within bounds but "feels slow" to users
Partial Failures: Issues affecting specific user segments (geography, device type, account tier)
UI/UX Issues: Problems with rendering, layout, or interaction that automated tests miss
Data Quality Issues: Incorrect or missing data that's valid from a technical standpoint
Integration Context: Problems that only manifest in specific client applications or browser versions

Building an Effective User Reports Pipeline

•Low-Friction Reporting: Make it trivially easy to report issues. In-app feedback widgets, 'Report a Problem' buttons, and keyboard shortcuts (Shift+? for help) reduce the effort required to report.
•Automatic Context Capture: When users report issues, automatically capture: device info, browser version, user ID (for privacy-compliant lookup), timestamp, current page/screen, and recent actions. This context is invaluable for investigation.
•Structured Categorization: Help users categorize their reports ("Something isn't working", "It's slow", "Data looks wrong") to enable routing and trend analysis.
•Correlation with Metrics: Link user reports to contemporaneous system metrics. If ten users report slowness at 2:15 PM, what did your latency metrics show at 2:15 PM?
•Trend Detection: Individual user reports may not warrant incident response, but a sudden spike in reports is a strong signal. Implement automated detection for report volume anomalies.
•Feedback Loops: Close the loop with users who report. Tell them when issues are resolved. This encourages future reporting and builds trust.

Triaging User Reports for Incident Detection

Not every user report indicates an incident. The challenge is distinguishing signal from noise:

Immediate Escalation Indicators:

Multiple reports of the same issue within a short window (>5 reports in 10 minutes)
Reports from VIP/enterprise customers
Reports involving financial transactions, data loss, or security
Reports coinciding with recent deployments

Investigation-Worthy Signals:

Recurring reports of the same issue over days/weeks
Reports from technically sophisticated users with detailed reproduction steps
Reports that align with known but unmonitored risks

Background Processing:

Single reports without corroboration
Reports of known issues with workarounds
Feature requests disguised as bug reports

The Three-Report Rule

Building the Detection Pipeline

Detection Pipeline Architecture

A modern detection pipeline consists of several stages:

Signal Collection: Gathering data from all detection sources
Signal Enrichment: Adding context, metadata, and cross-references
Correlation: Grouping related signals and identifying patterns
Deduplication: Consolidating multiple alerts for the same issue
Prioritization: Scoring incidents by severity and business impact
Routing: Sending alerts to the right teams via the right channels
Acknowledgment Tracking: Confirming responders have engaged

Converting Mermaid diagram...

Signal Correlation Strategies

Correlation transforms a flood of independent alerts into coherent incident narratives:

Topological Correlation: Alerts from components in the same call path belong together. A database alert during an API latency alert suggests causation, especially if traces connect them.

Deployment Correlation: Any alerts occurring shortly after a deployment should be grouped and flagged as potentially deployment-related.

Historical Pattern Correlation: If CPU spikes at the same time every day (batch job), correlation engines can suppress or relabel instead of alerting.

Cross-Signal Correlation: A synthetic monitor failure + user reports + latency alert = high confidence incident. A single latency alert = investigate before escalating.

Alert Correlation Tools

Summary: Mastering Incident Detection

Key Takeaways

•Time to Detection (TTD) is critical — Every minute of delayed detection extends customer impact. Invest in reducing detection latency as seriously as you invest in reducing resolution time.
•Layer proactive and reactive detection — Automated monitoring is your first line; user reports are your essential backstop. Aim for 90%+ proactive detection rate.
•Threshold alerting requires careful tuning — Base thresholds on SLOs, historical data, and business requirements. Use multi-window strategies for different severity levels.
•Anomaly detection handles complexity — When fixed thresholds can't capture normal variation, machine learning approaches provide dynamic baselines that adapt to patterns.
•Synthetic monitoring tests from the user's perspective — Active probing catches issues before users do, especially during low-traffic periods and after deployments.
•User reports carry unique signal — Users encounter edge cases and scenarios that automated testing misses. Build low-friction reporting and treat user signals as data, not complaints.
•Detection pipelines correlate and prioritize — Raw alerts become actionable incidents through enrichment, correlation, deduplication, and intelligent routing.
•Measure and improve detection continuously — Track TTD, false positive rates, and proactive detection ratio. Use these metrics to guide observability investments.

What's Next:

Page Complete

1 / 5