Loading content...
It's 3:17 AM. Your phone buzzes with a PagerDuty alert: "CRITICAL: Payment Service Error Rate > 5%." You check the dashboard—error rate is climbing, now at 8%. Customers can't complete purchases. Your company loses roughly $50,000 per minute when checkout is broken.
What happens next determines whether this incident resolves in 15 minutes or 3 hours. Whether it's a controlled response or a chaotic scramble. Whether your team learns from the experience or repeats the same mistakes.
The difference isn't luck or individual heroics—it's process. Organizations that handle incidents well have internalized a structured response framework that channels the urgency of a crisis into coordinated, effective action. This page explores that framework in depth.
By the end of this page, you will understand the complete incident response lifecycle: the phases from detection through resolution and beyond, the roles that participate in response, the coordination mechanisms that enable effective parallel work, and the documentation practices that capture learning for the future. You'll be able to design and implement incident response processes that scale from five-person startups to thousand-person enterprises.
Every incident, regardless of severity or duration, follows a predictable lifecycle. Understanding this lifecycle is fundamental to effective response—it provides a mental model for where you are, what should happen next, and what success looks like at each stage.
The Five Phases of Incident Response
Phase Transitions
Transitions between phases aren't always linear. Investigation may reveal the need for additional escalation. A mitigation attempt may fail, cycling back to investigation. Resolution may be partial, requiring iteration. Experienced responders recognize these phase transitions and communicate them clearly:
Naming the phase explicitly helps the team maintain shared understanding of where they are in the process.
Borrowed from emergency medicine: the first 60 minutes of incident response are disproportionately impactful. Actions taken (or not taken) in this window often determine whether resolution takes 2 hours or 12 hours. Prioritize getting the right people engaged and stabilizing the situation over perfect diagnosis during this critical window.
Effective incident response requires clear role definition. Without explicit roles, you get either paralysis (everyone waiting for someone else) or chaos (everyone doing everything, stepping on each other). Mature incident response frameworks define several distinct roles:
Core Incident Response Roles
| Role | Primary Focus | Key Responsibilities | Required Skills |
|---|---|---|---|
| Incident Commander (IC) | Overall coordination | Owns the incident, coordinates work, makes decisions, drives toward resolution | Leadership, communication, technical breadth, calm under pressure |
| Technical Lead | Technical investigation | Leads diagnosis, proposes mitigations, validates fixes, coordinates SMEs | Deep system knowledge, debugging skills, architectural understanding |
| Communications Lead | Stakeholder updates | Updates status page, notifies stakeholders, manages external messaging | Clear writing, stakeholder awareness, timing judgment |
| Scribe | Documentation | Records timeline, actions, decisions; maintains incident channel log | Attention to detail, fast typing, synthesis skills |
| Subject Matter Expert | Domain expertise | Provides deep knowledge on specific systems, answers technical questions | Specialized system expertise |
| Customer Liaison | Customer impact | Monitors support channels, escalates customer reports, coordinates response | Customer empathy, support knowledge, communication |
The Incident Commander Role
The Incident Commander (IC) is the most critical role in incident response. This person is responsible for the incident as a whole—not for solving the technical problem personally, but for ensuring the incident progresses toward resolution.
IC Responsibilities:
IC Anti-Patterns:
For smaller incidents, one person may fill multiple roles. For major incidents, each role might have a dedicated person or even a sub-team. The key is explicit assignment: everyone should know who's doing what. Never assume—always state clearly: 'I'll take IC' or 'Can someone volunteer as scribe?'
Not every alert is an incident. The triage phase determines whether a detected issue warrants incident response activation. This decision has significant implications: declaring an incident too readily leads to chaos and fatigue; hesitating to declare leads to delayed response and extended impact.
Incident Declaration Criteria
An incident should generally be declared when one or more of the following are true:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182
/** * Incident Triage Decision Framework * * This framework helps responders systematically evaluate * whether an issue warrants incident declaration and what * severity level to assign. */ interface TriageInput { // Impact assessment userImpact: 'none' | 'degraded' | 'partial-outage' | 'full-outage'; affectedUserPercentage: number; revenueImpact: 'none' | 'minor' | 'significant' | 'major'; // Scope assessment affectedServices: string[]; isEscalating: boolean; estimatedResolutionTime: 'minutes' | 'hours' | 'uncertain'; // Risk assessment dataAtRisk: boolean; securityImplications: boolean; regulatoryImplications: boolean; // Context recentDeployments: boolean; knownIssue: boolean; existingIncident: boolean;} interface TriageDecision { declareIncident: boolean; recommendedSeverity: 'SEV-1' | 'SEV-2' | 'SEV-3' | 'SEV-4' | null; rationale: string; immediateActions: string[]; escalationTargets: string[];} function triageIssue(input: TriageInput): TriageDecision { // Critical path: Full outage with user/revenue impact if (input.userImpact === 'full-outage' && input.revenueImpact !== 'none') { return { declareIncident: true, recommendedSeverity: 'SEV-1', rationale: 'Complete service outage affecting revenue', immediateActions: [ 'Page all on-call responders for affected services', 'Open incident bridge/channel immediately', 'Begin customer communication preparation', 'Alert leadership within 5 minutes', ], escalationTargets: [ 'Primary and secondary on-call', 'Engineering leadership', 'Customer success leadership', ], }; } // Security or data breach if (input.securityImplications || input.dataAtRisk) { return { declareIncident: true, recommendedSeverity: 'SEV-1', rationale: 'Security incident or data at risk requires immediate response', immediateActions: [ 'Engage security team immediately', 'Consider isolation/containment measures', 'Begin evidence preservation', 'Alert legal/compliance if data breach suspected', ], escalationTargets: [ 'Security on-call', 'CISO or security leadership', 'Legal/compliance if needed', ], }; } // Partial outage or significant degradation if ( input.userImpact === 'partial-outage' || (input.userImpact === 'degraded' && input.affectedUserPercentage > 25) ) { const severity = input.isEscalating || input.revenueImpact === 'significant' ? 'SEV-1' : 'SEV-2'; return { declareIncident: true, recommendedSeverity: severity, rationale: 'Significant user impact exceeds acceptable thresholds', immediateActions: [ 'Page primary on-call for affected services', 'Open incident channel', 'Begin initial investigation', 'Prepare customer status update', ], escalationTargets: [ 'Primary on-call for affected services', 'Engineering manager if escalating', ], }; } // Multi-service impact if (input.affectedServices.length >= 3) { return { declareIncident: true, recommendedSeverity: 'SEV-2', rationale: 'Widespread impact across multiple services', immediateActions: [ 'Page on-call for all affected services', 'Open incident channel for coordination', 'Look for common cause (shared dependency)', ], escalationTargets: input.affectedServices.map(s => `${s} on-call`), }; } // Known issue with workaround if (input.knownIssue && input.userImpact === 'degraded') { return { declareIncident: false, recommendedSeverity: null, rationale: 'Known issue with understood impact and workaround', immediateActions: [ 'Monitor for escalation', 'Ensure workaround documentation is current', 'Track toward permanent resolution', ], escalationTargets: [], }; } // Minor degradation if (input.userImpact === 'degraded' && input.affectedUserPercentage <= 5) { return { declareIncident: input.estimatedResolutionTime !== 'minutes', recommendedSeverity: 'SEV-3', rationale: 'Minor impact, investigate without full incident response unless prolonged', immediateActions: [ 'Primary team investigate', 'Monitor for escalation', 'Create ticket for tracking', ], escalationTargets: ['Primary team on-call'], }; } // Default: Don't declare, but investigate return { declareIncident: false, recommendedSeverity: null, rationale: 'Issue does not meet incident criteria; handle through normal operations', immediateActions: [ 'Create ticket for tracking', 'Assign to appropriate team', 'Monitor for changes in scope', ], escalationTargets: [], };} // Usage exampleconst triageResult = triageIssue({ userImpact: 'partial-outage', affectedUserPercentage: 30, revenueImpact: 'significant', affectedServices: ['checkout', 'payment-gateway'], isEscalating: true, estimatedResolutionTime: 'uncertain', dataAtRisk: false, securityImplications: false, regulatoryImplications: false, recentDeployments: true, knownIssue: false, existingIncident: false,}); console.log('Triage Decision:', triageResult);// Output: SEV-1 incident, page all responders, recent deployment likely causeWhen in doubt, declare the incident. An over-declared incident can be quickly downgraded or closed. An under-declared incident results in delayed response, extended impact, and frustrated teams who weren't informed early enough. It's much easier to scale down than to catch up.
Once an incident is declared, the team must understand what's happening before they can fix it. Investigation is the phase where hypotheses are formed and tested, data is gathered, and root cause is identified. Effective investigation is methodical rather than chaotic—even under pressure.
The Investigation Framework
Systematic investigation follows a structured approach:
The 'Recent Changes' Checklist
The majority of incidents are caused by changes. When triaging, quickly review:
A deployment timeline correlated with the incident onset time often reveals the trigger immediately.
For complex incidents, apply the 'Five Whys' iteratively:
• Why are users seeing errors? Because the API is timing out. • Why is the API timing out? Because database queries are slow. • Why are queries slow? Because the table lacks an index on the new column. • Why is the column unindexed? Because migration didn't include index creation. • Why wasn't the index included? Because our migration process doesn't review for performance.
This reveals root causes beyond the immediate symptom.
Mitigation is the phase where action replaces investigation. The goal shifts from understanding the problem to stopping the bleeding. In many incidents, mitigation can and should begin before diagnosis is complete—if there's a safe action that might help, take it.
The Mitigation Mindset
Key principles for effective mitigation:
Restore First, Root Cause Later: Don't delay customer recovery to achieve perfect understanding. Rollback now; post-mortem later.
Reversible Actions First: Prefer actions you can undo. A rollback is easily reversed; a database schema change is not.
Small Blast Radius Before Large: If trying a fix in production, limit exposure: one region, one pod, one percent of traffic.
Communicate Before Acting: Announce mitigation actions in the incident channel: "Rolling back the 2:15 PM deploy now."
Verify After Acting: Confirm the mitigation worked: "Error rate dropping. Now at 2%, down from 12%."
| Strategy | When to Use | Risks/Considerations | Reversibility |
|---|---|---|---|
| Rollback Deployment | Recent deploy correlated with issue onset | Ensure rollback process is tested; may need to rollback database migrations first | Usually reversible (re-deploy) |
| Disable Feature Flag | New feature causing problems | Fast and surgical; ensure flag controls the suspected code path | Easily reversible |
| Scale Up Resources | Capacity exhaustion (CPU, memory, connections) | May mask the symptom; doesn't address root cause | Easily reversible |
| Failover to Backup | Primary system unhealthy | Verify backup is current and functional before switching | Usually reversible (fail back) |
| Block Bad Traffic | DDoS or abuse pattern identified | May block legitimate users if criteria too broad; may block legitimate traffic | Easily reversible |
| Restart Services | Suspected memory leak, deadlock, or corrupted state | Rolling restart preferred; verify graceful draining is working | N/A (one-time action) |
| Shed Load | System overloaded beyond capacity | Proactively reject some requests to protect overall stability | Reversible (remove limiting) |
The Mitigation Decision Tree
When facing multiple mitigation options, prioritize by:
Resolution vs. Mitigation
Mitigation restores service; resolution fixes the root cause. They're related but distinct:
Incident response should achieve mitigation quickly. Resolution may come later, even after the incident is officially closed, through follow-up work items.
Teams often hesitate to rollback because they fear losing work or causing other issues. This hesitation extends outages. Establish a cultural norm: rollback is always acceptable when customer impact is ongoing. The engineering work isn't lost—it's in version control waiting to be fixed and redeployed.
Multiple people working on an incident isn't automatically better than one person—it's only better if they're coordinated. Effective coordination transforms individual efforts into collective capability. Poor coordination creates confusion, duplication, and conflict.
Communication Channels
Most organizations use multiple channels during incidents:
[STATUS] Current state of the incident[ACTION] Something someone is doing or has done[FINDING] An observation or discovery[THEORY] A hypothesis being considered[DECISION] A choice being made by IC or team#inc-2024-0142-payment-failures 📌 Pinned- Dashboard: https://grafana.example.com/d/payments-overview- Runbook: https://runbooks.example.com/payments/high-failure-rate- Status Page: https://status.example.com --- [10:17] 🤖 Incident Bot: Incident created by alert "PaymentFailureRateCritical" Severity: SEV-1 | Commander: unassigned [10:18] @sarah.chen: Taking IC. This is SEV-1, full incident response. [10:18] @sarah.chen: [STATUS] Payment failure rate at 12%, up from baseline 0.5%[STATUS] Started approximately 10:15 AM PST[ACTION] Paging payments team and platform on-call [10:19] @mike.johnson: Joining as Tech Lead. Looking at recent deploys. [10:20] @sarah.chen: [STATUS] Bridge call started: meet.google.com/abc-defg-hijNeed SMEs on: payments gateway, database team [10:22] @mike.johnson:[FINDING] Deploy at 10:12 AM to payment-service v2.47.3[ACTION] Pulling up the change list now [10:23] @lisa.wong: Database on-call here. Checking for query performance issues. [10:25] @mike.johnson:[FINDING] v2.47.3 changed the retry logic for gateway timeouts[THEORY] Might be creating amplified load on downstream gateway [10:26] @sara.chen:[DECISION] Let's prepare rollback to v2.47.2 while continuing investigation@mike.johnson can you stage the rollback?@lisa.wong any database anomalies? [10:27] @lisa.wong:[FINDING] No database issues - queries normal latency [10:28] @mike.johnson:[ACTION] Rollback staged, awaiting confirmation to execute [10:29] @sarah.chen:[DECISION] Execute rollback. Gateway amplification theory is plausible.[ACTION] @mike.johnson proceed with rollback to v2.47.2 [10:30] @mike.johnson:[ACTION] Rollback initiated - payment-service v2.47.2 deploying [10:33] @mike.johnson:[STATUS] Rollback complete. New pods healthy. [10:34] @sarah.chen:[STATUS] Error rate dropping. 8%... 5%... 3%... [10:37] @sarah.chen:[STATUS] Error rate at 0.6%, returning to baseline[STATUS] Monitoring for stability. Will close incident if holds for 15 min. [10:52] @sarah.chen:[STATUS] Stable at baseline for 15 min. Closing incident.[ACTION] Will schedule post-mortem for tomorrow 11 AM[ACTION] @mike.johnson please create follow-up ticket for retry logic fixA dedicated scribe dramatically improves incident quality. They capture details that would otherwise be lost, create the foundation for post-mortems, and allow technical responders to focus entirely on diagnosis and mitigation. For SEV-1 incidents, always assign a scribe.
Knowing when and how to close an incident is as important as knowing how to respond to one. Premature closure leads to recurrence; delayed closure wastes resources and creates responder fatigue.
Incident Closure Criteria
An incident should be considered for closure when:
Closure ≠ Complete
Closing an incident means the immediate crisis has passed—not that all work is done. Closure should include:
Handoff Between Responders
For extended incidents spanning shift changes, formal handoff is essential:
Handoff Protocol:
The worst handoffs happen when:
For incidents where you're confident but not certain of resolution: close the incident but keep the channel open and monitoring elevated for 24-48 hours. Announce: 'Soft-closing this incident. Keeping channel active for observation. Will hard-close tomorrow if stable.' This prevents premature confident closure while not keeping responders actively engaged.
Effective incident response transforms chaos into coordinated action. It's not about individual heroics—it's about reliable processes that work even when responders are tired, stressed, or unfamiliar with the specific system failing.
What's Next:
Incident response relies on responders being available when incidents occur. The next page explores On-Call Practices—the systems and norms that ensure qualified responders are available, well-rested, and prepared to handle incidents whenever they arise.
You now understand the complete incident response process: from lifecycle phases and roles to investigation techniques, mitigation strategies, coordination mechanics, and closure procedures. Process transforms chaos into resolution. Next, we'll explore how on-call practices ensure responders are ready when needed.