Loading learning content...
On February 28, 2017, Amazon S3 suffered one of the most impactful outages in cloud computing history. A single typo in a command removed more servers than intended, triggering a cascade that took down a significant portion of the internet for nearly four hours. The impact was staggering: major websites went dark, IoT devices stopped working, and businesses lost millions in revenue.
But here's what's often overlooked in the post-mortem discussions: the time between when the problem started and when Amazon knew they had a problem. In incident management, this gap—called the Time to Detection (TTD)—is one of the most critical metrics an organization can optimize. Every minute of delayed detection translates directly into extended customer impact, revenue loss, and reputational damage.
Incident detection is the foundation of incident management. You cannot respond to what you don't know exists. Yet many organizations invest heavily in response processes while neglecting the detection mechanisms that trigger those processes. This is akin to training firefighters without installing smoke detectors.
By the end of this page, you will understand the complete landscape of incident detection: from proactive monitoring and alerting strategies to user-reported issues, from anomaly detection algorithms to synthetic monitoring. You'll learn how to design detection systems that minimize both Time to Detection and false positives, enabling your organization to identify and begin resolving incidents within minutes, not hours.
To understand why incident detection deserves significant engineering investment, we need to examine the anatomy of an incident from a timeline perspective. Every incident follows a lifecycle:
The interval between onset and detection—the Time to Detection (TTD)—is pure waste. During this window, customers experience degradation while the organization remains oblivious. Unlike response and mitigation time, which can be reduced through training and automation, detection latency often goes unmeasured and unoptimized.
| Detection Delay | Potential Impact | Customer Experience | Business Cost |
|---|---|---|---|
| < 1 minute | Minimal - caught before widespread impact | Momentary blip, if noticed | Negligible |
| 1-5 minutes | Moderate - some users affected | Noticeable errors or latency | $10K-$100K for large services |
| 5-30 minutes | Significant - broad user impact | Sustained degradation, complaints begin | $100K-$1M, customer churn begins |
| 30-60 minutes | Severe - major outage | Service unusable, social media mentions | $1M-$10M, reputational damage |
60 minutes | Critical - catastrophic failure | Complete service unavailability | $10M+, potential regulatory scrutiny |
The Detection Investment Case
Consider a service processing $1 million in transactions per hour. A one-minute reduction in detection latency—achieved through better monitoring—saves approximately $16,667 per incident. For a service experiencing one significant incident monthly, that's $200,000 annually. The monitoring infrastructure enabling that improvement costs a fraction of this amount.
But the financial case understates the true value. Detection latency compounds: a problem detected early is often simpler to diagnose and faster to mitigate. A memory leak caught at 70% utilization is trivially resolved with a pod restart; caught at 99% utilization during a traffic spike, it may require complex emergency procedures while the service is actively failing.
Detection effectiveness becomes organizational capability. Organizations that detect quickly develop confidence to deploy more frequently, experiment more boldly, and operate more leanly—because they know they'll catch problems early.
For every major incident an organization experiences, there are typically 10 moderate incidents and 100 minor incidents that were detected and resolved before escalating. Excellent detection doesn't just reduce MTTR—it prevents incidents from ever becoming visible to customers or executives.
Incident detection mechanisms fall into two fundamental categories, each with distinct characteristics, strengths, and applications:
Proactive Detection: The system or its observers identify the problem before users notice or report it. Proactive detection is the gold standard—it means your observability infrastructure is working.
Reactive Detection: Users, customers, or external observers report the problem to you. Reactive detection means your proactive mechanisms failed, but it's still valuable as a backstop and often catches issues that automated systems miss.
World-class incident detection employs both approaches in depth, creating multiple layers of detection that complement each other.
The Proactive-Reactive Ratio
A key metric for detection maturity is the ratio of proactively-detected incidents to reactively-detected incidents. World-class organizations aim for 90%+ proactive detection—meaning that 9 out of 10 incidents are caught by internal monitoring before any customer needs to report them.
This ratio reveals the effectiveness of your observability investment. A low proactive ratio indicates gaps in monitoring coverage, alert configuration, or detection infrastructure. It signals that you're relying on customers as unpaid quality assurance—a pattern that erodes trust and competitiveness.
When customers report issues faster than your monitoring detects them, you have a serious observability gap. This often happens because monitoring focuses on infrastructure metrics (CPU, memory) rather than user-facing outcomes (checkout success rate, page load time). Always monitor what matters to users, not just what's easy to measure.
Threshold-based alerting is the most common detection mechanism. It operates on a simple principle: when a metric crosses a predefined boundary, trigger an alert. Despite—or perhaps because of—its simplicity, threshold alerting requires careful design to be effective.
Anatomy of a Threshold Alert
A well-constructed threshold alert has several components:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
# Prometheus Alerting Rule Example# Demonstrating best practices in threshold-based alerting groups: - name: payment-service-alerts rules: # Critical: Payment processing failure rate too high # This directly impacts revenue - immediate response required - alert: PaymentFailureRateCritical expr: | ( sum(rate(payment_transactions_total{status="failed"}[5m])) / sum(rate(payment_transactions_total[5m])) ) > 0.05 for: 2m labels: severity: critical service: payment team: payments annotations: summary: "Payment failure rate exceeds 5%" description: | Payment failure rate is {{ $value | humanizePercentage }} over the last 5 minutes. This is above the critical threshold of 5%. Immediate investigation required - revenue impact in progress. runbook_url: https://runbooks.example.com/payments/high-failure-rate dashboard: https://grafana.example.com/d/payments-overview # Warning: Payment latency degradation # Users experiencing slow checkouts - investigate soon - alert: PaymentLatencyHigh expr: | histogram_quantile(0.99, rate(payment_processing_duration_seconds_bucket[5m]) ) > 2.0 for: 5m labels: severity: warning service: payment team: payments annotations: summary: "P99 payment latency exceeds 2 seconds" description: | P99 payment processing latency is {{ $value | humanizeDuration }}. Users are experiencing slow checkout. SLO target is 1.5 seconds at P99. runbook_url: https://runbooks.example.com/payments/high-latency # Predictive: Payment service approaching capacity # Warning before we hit actual problems - alert: PaymentServiceNearCapacity expr: | ( rate(payment_transactions_total[5m]) / payment_max_throughput_per_second ) > 0.8 for: 10m labels: severity: warning service: payment team: payments annotations: summary: "Payment service at 80%+ capacity" description: | Payment throughput is at {{ $value | humanizePercentage }} of max capacity. Consider scaling up before traffic increases further.Threshold Selection: The Art and Science
Choosing the right threshold is where detection design becomes nuanced. Set thresholds too tight, and you'll drown in false positives (alert fatigue). Set them too loose, and you'll miss real incidents. Several strategies inform threshold selection:
1. SLO-Based Thresholds: Derive thresholds from your Service Level Objectives. If your SLO requires 99.9% availability, alert when you're burning through your error budget faster than sustainable—typically when error rate exceeds 10× the budgeted rate.
2. Historical Baseline: Analyze historical data to understand normal behavior. Set thresholds at statistical boundaries—for example, 3 standard deviations above the mean, or at the 99th percentile of historical values.
3. Capacity-Based Thresholds: For resource metrics, alert at percentages of capacity (e.g., 80% CPU, 90% disk) that provide enough room for remediation before exhaustion.
4. Business-Derived Thresholds: Some thresholds come directly from business requirements. If the product team says checkout must complete in under 3 seconds, that's your latency threshold—regardless of historical behavior.
Sophisticated alerting uses multiple thresholds and time windows for the same metric. For example: Warning at 1% error rate (5-minute average), Critical at 5% error rate (1-minute average). This provides early warning for developing problems while reserving immediate escalation for acute failures.
Fixed thresholds work well for stable metrics but struggle with systems that exhibit complex, time-varying behavior. Consider an e-commerce platform where normal traffic varies by 10× between 3 AM and 3 PM, with additional variation by day of week and season. A fixed threshold that catches problems during peak traffic would trigger constantly during off-peak hours, while one tuned for off-peak would miss problems during peaks.
Anomaly detection addresses this by learning normal patterns and alerting on deviations from those patterns rather than fixed values. This requires more sophisticated algorithms but provides superior detection for complex systems.
Types of Anomaly Detection Approaches
Statistical Methods: Simple but effective approaches based on statistical modeling
Machine Learning Methods: More sophisticated pattern recognition
Time-Series Specialized Methods: Purpose-built for metric data
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256
"""Anomaly Detection for Incident Detection This example demonstrates a practical approach to anomaly detectionfor system metrics, combining statistical methods with operationalconsiderations like seasonality and alert dampening.""" import numpy as npfrom scipy import statsfrom dataclasses import dataclassfrom typing import List, Optional, Tuplefrom datetime import datetime, timedeltafrom collections import deque @dataclassclass AnomalyResult: """Result of anomaly detection analysis""" is_anomaly: bool current_value: float expected_value: float deviation_score: float # Standard deviations from expected confidence: float # 0-1, higher means more confident context: str class SeasonalAnomalyDetector: """ Anomaly detector that accounts for daily and weekly seasonality. This approach is practical for most production systems: - Learns patterns from historical data - Handles multiple seasonality (hour-of-day, day-of-week) - Provides confidence scores for operations - Supports gradual threshold adjustment """ def __init__( self, sensitivity: float = 3.0, # std deviations for anomaly min_data_points: int = 168, # 1 week of hourly data learning_rate: float = 0.1, ): self.sensitivity = sensitivity self.min_data_points = min_data_points self.learning_rate = learning_rate # Historical profiles: hourly buckets for each day of week # Shape: (7 days, 24 hours, list of values) self.hourly_profiles = [[[] for _ in range(24)] for _ in range(7)] self.data_points_seen = 0 # Recent values for short-term trend detection self.recent_values = deque(maxlen=60) # Last hour at minute granularity def add_observation(self, value: float, timestamp: datetime) -> None: """Record an observation for learning""" day_of_week = timestamp.weekday() hour = timestamp.hour # Add to seasonal profile self.hourly_profiles[day_of_week][hour].append(value) # Limit history per bucket to prevent memory bloat max_per_bucket = 100 if len(self.hourly_profiles[day_of_week][hour]) > max_per_bucket: # Keep recent values with exponential decay self.hourly_profiles[day_of_week][hour] = ( self.hourly_profiles[day_of_week][hour][-max_per_bucket:] ) self.recent_values.append((timestamp, value)) self.data_points_seen += 1 def get_expected_range( self, timestamp: datetime ) -> Tuple[float, float, float]: """ Calculate expected value and acceptable range for a given time. Returns: (expected_value, lower_bound, upper_bound) """ day_of_week = timestamp.weekday() hour = timestamp.hour bucket = self.hourly_profiles[day_of_week][hour] if len(bucket) < 10: # Not enough data for this specific bucket # Fall back to all data for this hour across all days all_hour_data = [] for day in range(7): all_hour_data.extend(self.hourly_profiles[day][hour]) bucket = all_hour_data if all_hour_data else [0] mean = np.mean(bucket) std = np.std(bucket) if len(bucket) > 1 else mean * 0.1 # Ensure minimum standard deviation (10% of mean or 1) std = max(std, mean * 0.1, 1.0) lower = mean - (self.sensitivity * std) upper = mean + (self.sensitivity * std) return float(mean), float(lower), float(upper) def detect( self, value: float, timestamp: datetime ) -> AnomalyResult: """ Analyze if a value is anomalous given the timestamp. """ # Check if we have enough data to make reliable predictions if self.data_points_seen < self.min_data_points: return AnomalyResult( is_anomaly=False, current_value=value, expected_value=value, deviation_score=0.0, confidence=0.0, context=f"Insufficient data ({self.data_points_seen}/{self.min_data_points})" ) expected, lower, upper = self.get_expected_range(timestamp) # Calculate deviation in terms of standard deviations bucket = self._get_bucket(timestamp) std = np.std(bucket) if len(bucket) > 1 else expected * 0.1 std = max(std, expected * 0.1, 1.0) deviation_score = abs(value - expected) / std # Is it anomalous? is_anomaly = value < lower or value > upper # Calculate confidence based on data quality data_quality = min(len(bucket) / 50, 1.0) # More data = higher confidence confidence = data_quality * min(deviation_score / self.sensitivity, 1.0) # Build context string direction = "above" if value > expected else "below" context = ( f"Value {value:.2f} is {deviation_score:.1f}σ {direction} expected " f"({expected:.2f}). Normal range: [{lower:.2f}, {upper:.2f}]" ) return AnomalyResult( is_anomaly=is_anomaly, current_value=value, expected_value=expected, deviation_score=deviation_score, confidence=confidence, context=context ) def _get_bucket(self, timestamp: datetime) -> List[float]: """Get the historical bucket for a timestamp""" day_of_week = timestamp.weekday() hour = timestamp.hour return self.hourly_profiles[day_of_week][hour] class MultiMetricAnomalyDetector: """ Coordinates anomaly detection across multiple related metrics. This helps reduce false positives by correlating anomalies: - If only one metric is anomalous, confidence is lower - If multiple correlated metrics are anomalous, confidence is higher """ def __init__(self, metric_names: List[str], sensitivity: float = 3.0): self.detectors = { name: SeasonalAnomalyDetector(sensitivity=sensitivity) for name in metric_names } self.metric_names = metric_names def add_observations( self, values: dict[str, float], timestamp: datetime ) -> None: """Add observations for all metrics""" for name, value in values.items(): if name in self.detectors: self.detectors[name].add_observation(value, timestamp) def detect_correlated( self, values: dict[str, float], timestamp: datetime ) -> dict: """ Detect anomalies with correlation analysis. Returns enhanced results considering metric correlations. """ results = {} anomaly_count = 0 # Run detection for each metric for name, value in values.items(): if name in self.detectors: result = self.detectors[name].detect(value, timestamp) results[name] = result if result.is_anomaly: anomaly_count += 1 # Adjust confidence based on correlation correlation_factor = anomaly_count / len(self.metric_names) return { "individual_results": results, "anomalous_metrics": anomaly_count, "total_metrics": len(self.metric_names), "correlation_factor": correlation_factor, "high_confidence_incident": correlation_factor > 0.5, } # Usage Exampleif __name__ == "__main__": # Initialize detector for request latency detector = SeasonalAnomalyDetector( sensitivity=3.0, # Alert on 3+ standard deviations min_data_points=168, # Require 1 week of hourly data ) # Simulate loading historical data base_time = datetime.now() - timedelta(days=14) for day in range(14): for hour in range(24): # Simulate traffic pattern: higher during business hours base_latency = 100 + (50 if 9 <= hour <= 17 else 0) # Add day-of-week pattern: higher on weekdays if (base_time + timedelta(days=day)).weekday() < 5: base_latency += 20 # Add noise value = base_latency + np.random.normal(0, 10) timestamp = base_time + timedelta(days=day, hours=hour) detector.add_observation(value, timestamp) # Test detection test_time = datetime.now().replace(hour=14, minute=0) # 2 PM # Normal value result = detector.detect(155.0, test_time) print(f"Normal value: {result}") # Anomalous value (latency spike) result = detector.detect(350.0, test_time) print(f"Anomalous value: {result}")Operational Considerations for Anomaly Detection
False Positive Management: Anomaly detection inherently trades off sensitivity against false positives. Start conservative (higher sensitivity threshold) and tighten as you gain confidence in the model's accuracy.
Training Data Quality: Models are only as good as their training data. Ensure your historical data doesn't include past incidents—or mark those periods as anomalous in training so the model doesn't learn them as "normal."
Concept Drift: Systems evolve. New features, architecture changes, and traffic growth all shift what's "normal." Implement continuous learning or periodic retraining to keep models current.
Interpretability: When anomaly detection triggers, responders need to understand why. Provide context: expected range, deviation magnitude, recent trend, and any correlated metrics showing similar behavior.
Anomaly detection requires historical data to establish baselines. New services, metrics, or environments lack this history. Plan for a "cold start" period where detection relies on static thresholds or conservative defaults while the model learns normal behavior. Typically, 2-4 weeks of data is needed for reliable seasonality detection.
Synthetic monitoring—also called proactive monitoring or active monitoring—generates artificial traffic to test system behavior from the user's perspective. Rather than waiting for real users to encounter problems, synthetic monitors continuously exercise critical paths, detecting failures even when no real users are present.
Why Synthetic Monitoring Matters
Real user monitoring (RUM) tells you how actual users are experiencing your system. But RUM has limitations:
Synthetic monitoring fills these gaps by providing consistent, predictable test coverage regardless of real traffic patterns.
| Strategy | What It Tests | Frequency | Best For |
|---|---|---|---|
| Availability Checks | Endpoint responds with 2xx | Every 1-5 minutes | Basic uptime monitoring |
| API Contract Tests | Response schema and values | Every 5-15 minutes | API correctness |
| End-to-End Transactions | Complete user flows | Every 5-15 minutes | Critical business processes |
| Multi-Region Probes | Availability from multiple locations | Every 1-5 minutes | Geographic redundancy |
| SSL/TLS Validation | Certificate validity and expiration | Every hour to daily | Security and trust |
| Dependency Checks | Third-party service availability | Every 1-5 minutes | External dependency health |
| Performance Baselines | Response time benchmarks | Every 5-15 minutes | Performance regression detection |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274
/** * Synthetic Monitoring Implementation * * This implementation demonstrates a production-grade approach * to synthetic monitoring with proper alerting, metrics, and * failure handling. */ interface SyntheticCheckResult { checkName: string; success: boolean; durationMs: number; statusCode?: number; error?: string; timestamp: Date; location: string;} interface CheckDefinition { name: string; url: string; method: 'GET' | 'POST' | 'PUT' | 'DELETE'; headers?: Record<string, string>; body?: unknown; expectedStatus: number; timeoutMs: number; assertions?: ((response: Response, body: unknown) => void)[];} class SyntheticMonitor { private checks: CheckDefinition[] = []; private results: SyntheticCheckResult[] = []; private alertCallback?: (result: SyntheticCheckResult) => void; private metricsCallback?: (result: SyntheticCheckResult) => void; private location: string; constructor(location: string = 'us-east-1') { this.location = location; } /** * Register a check to be executed */ addCheck(check: CheckDefinition): void { this.checks.push(check); } /** * Register alert callback for failed checks */ onAlert(callback: (result: SyntheticCheckResult) => void): void { this.alertCallback = callback; } /** * Register metrics callback for all check results */ onMetrics(callback: (result: SyntheticCheckResult) => void): void { this.metricsCallback = callback; } /** * Execute a single check */ async runCheck(check: CheckDefinition): Promise<SyntheticCheckResult> { const startTime = Date.now(); try { const controller = new AbortController(); const timeout = setTimeout(() => controller.abort(), check.timeoutMs); const response = await fetch(check.url, { method: check.method, headers: { 'User-Agent': 'SyntheticMonitor/1.0', ...check.headers, }, body: check.body ? JSON.stringify(check.body) : undefined, signal: controller.signal, }); clearTimeout(timeout); const durationMs = Date.now() - startTime; let body: unknown; try { body = await response.json(); } catch { body = await response.text(); } // Check basic status code if (response.status !== check.expectedStatus) { return { checkName: check.name, success: false, durationMs, statusCode: response.status, error: `Expected status ${check.expectedStatus}, got ${response.status}`, timestamp: new Date(), location: this.location, }; } // Run custom assertions if (check.assertions) { for (const assertion of check.assertions) { try { assertion(response, body); } catch (assertionError) { return { checkName: check.name, success: false, durationMs, statusCode: response.status, error: `Assertion failed: ${(assertionError as Error).message}`, timestamp: new Date(), location: this.location, }; } } } return { checkName: check.name, success: true, durationMs, statusCode: response.status, timestamp: new Date(), location: this.location, }; } catch (error) { const durationMs = Date.now() - startTime; const errorMessage = error instanceof Error ? error.message : 'Unknown error'; return { checkName: check.name, success: false, durationMs, error: errorMessage, timestamp: new Date(), location: this.location, }; } } /** * Run all registered checks */ async runAllChecks(): Promise<SyntheticCheckResult[]> { const results = await Promise.all( this.checks.map(check => this.runCheck(check)) ); // Emit metrics for all results results.forEach(result => { this.metricsCallback?.(result); // Alert on failures if (!result.success) { this.alertCallback?.(result); } }); this.results = results; return results; } /** * Start continuous monitoring at specified interval */ startContinuousMonitoring(intervalMs: number = 60000): void { console.log(`Starting continuous monitoring every ${intervalMs}ms`); // Run immediately this.runAllChecks(); // Schedule recurring runs setInterval(() => this.runAllChecks(), intervalMs); }} // Example: Critical Path Synthetic Checksconst monitor = new SyntheticMonitor('us-east-1'); // Check 1: API Health Endpointmonitor.addCheck({ name: 'api-health', url: 'https://api.example.com/health', method: 'GET', expectedStatus: 200, timeoutMs: 5000, assertions: [ (response, body) => { const health = body as { status: string }; if (health.status !== 'healthy') { throw new Error(`Health status is ${health.status}`); } } ]}); // Check 2: Authentication Flowmonitor.addCheck({ name: 'auth-flow', url: 'https://api.example.com/auth/token', method: 'POST', headers: { 'Content-Type': 'application/json' }, body: { grant_type: 'client_credentials', client_id: process.env.SYNTHETIC_CLIENT_ID, client_secret: process.env.SYNTHETIC_CLIENT_SECRET, }, expectedStatus: 200, timeoutMs: 10000, assertions: [ (response, body) => { const token = body as { access_token?: string; expires_in?: number }; if (!token.access_token) { throw new Error('No access token returned'); } if (!token.expires_in || token.expires_in < 300) { throw new Error('Token expiration too short'); } } ]}); // Check 3: Payment Gateway Connectivitymonitor.addCheck({ name: 'payment-gateway-ping', url: 'https://api.example.com/payments/health', method: 'GET', expectedStatus: 200, timeoutMs: 5000, assertions: [ (response, body) => { const health = body as { gateway_status: string }; if (health.gateway_status !== 'connected') { throw new Error(`Payment gateway: ${health.gateway_status}`); } } ]}); // Set up alertingmonitor.onAlert((result) => { console.error(`🚨 SYNTHETIC CHECK FAILED: ${result.checkName}`); console.error(` Error: ${result.error}`); console.error(` Duration: ${result.durationMs}ms`); console.error(` Location: ${result.location}`); // In production: Send to PagerDuty, Slack, etc. // await pagerduty.trigger({ // summary: `Synthetic check failed: ${result.checkName}`, // severity: 'critical', // ... // });}); // Set up metrics emissionmonitor.onMetrics((result) => { // In production: Send to Prometheus, DataDog, etc. console.log(`📊 Metric: synthetic_check_duration{check="${result.checkName}",` + `location="${result.location}",success="${result.success}"}` + ` ${result.durationMs}`);}); // Start monitoringmonitor.startContinuousMonitoring(60000); // Every minuteDespite our best efforts at proactive detection, users will sometimes discover issues before our monitoring catches them. Rather than viewing user reports as monitoring failures, mature organizations treat them as essential signals that complement automated detection.
Why Users Detect What Monitoring Misses
Users encounter scenarios that are difficult to anticipate or synthesize:
Triaging User Reports for Incident Detection
Not every user report indicates an incident. The challenge is distinguishing signal from noise:
Immediate Escalation Indicators:
Investigation-Worthy Signals:
Background Processing:
A practical heuristic: one user report is anecdotal, two are coincidental, three are a pattern. When three independent users report the same issue within a short window, treat it as a potential incident and investigate actively.
Individual detection mechanisms—alerts, anomaly detection, synthetic monitoring, user reports—provide value in isolation. But their true power emerges when integrated into a cohesive detection pipeline that correlates signals, suppresses noise, and routes actionable alerts to responders.
Detection Pipeline Architecture
A modern detection pipeline consists of several stages:
Signal Correlation Strategies
Correlation transforms a flood of independent alerts into coherent incident narratives:
Temporal Correlation: Alerts occurring within a time window (e.g., 5 minutes) are likely related. If CPU spikes, memory alerts trigger, and latency increases all at 2:15 PM, they're probably one incident, not three.
Topological Correlation: Alerts from components in the same call path belong together. A database alert during an API latency alert suggests causation, especially if traces connect them.
Deployment Correlation: Any alerts occurring shortly after a deployment should be grouped and flagged as potentially deployment-related.
Historical Pattern Correlation: If CPU spikes at the same time every day (batch job), correlation engines can suppress or relabel instead of alerting.
Cross-Signal Correlation: A synthetic monitor failure + user reports + latency alert = high confidence incident. A single latency alert = investigate before escalating.
Tools like PagerDuty Event Intelligence, Splunk ITSI, BigPanda, and Moogsoft specialize in alert correlation and noise reduction. These platforms use machine learning to identify patterns, suppress duplicates, and surface actionable incidents from raw alert streams.
Incident detection is the critical first step in incident response. The effectiveness of your entire incident management program depends on how quickly and accurately you identify problems. Let's consolidate the key principles:
What's Next:
With detection in place, we need processes to respond effectively. The next page explores the Incident Response Process—the structured workflows that translate detection into resolution, covering roles, communication protocols, and the mechanics of incident command.
You now understand the complete landscape of incident detection: from threshold-based alerting and anomaly detection to synthetic monitoring and user report pipelines. Detection is the foundation—without knowing there's a problem, all downstream processes are moot. Next, we'll explore how to respond once an incident is detected.