Loading learning content...
Imagine a hospital emergency room where every patient is immediately rushed to the operating theater—the person with a paper cut, the patient having a heart attack, and everyone in between. The result would be chaos: critical patients dying while resources are wasted on minor issues. Emergency medicine solved this with triage—a systematic approach to categorizing patients by urgency.
Incident management faces the same challenge. Without severity classification, organizations oscillate between two failure modes: treating everything as critical (leading to alert fatigue and responder burnout) or treating everything as routine (leading to catastrophic delays on real crises). Neither extreme serves customers or teams.
Severity levels provide the triage framework for incidents. They determine how quickly responders mobilize, who gets paged, how much customer communication occurs, and what level of organizational attention the incident receives. Get severity wrong, and you either under-respond to crises or over-respond to non-issues. Get it right, and your organization deploys exactly the right level of effort to each situation.
By the end of this page, you will understand how to design a severity classification system that matches response effort to incident impact. You'll learn to define severity levels with clear criteria, map response expectations to each level, handle severity changes during incidents, and build organizational consensus around classification decisions.
Severity levels exist to calibrate response. Every incident response action has costs: responders' time, interrupted sleep, stakeholder attention, and organizational overhead. Severity classification ensures these costs are proportional to incident impact.
The Purpose of Severity Levels
| Level | Name | Typical Criteria | Response Expectation |
|---|---|---|---|
| SEV-1 / P1 | Critical | Complete outage, data loss, security breach, major revenue impact | All-hands response, executive involvement, 24/7 until resolved |
| SEV-2 / P2 | High | Major functionality broken, significant user impact, no workaround | Immediate response, dedicated resources, escalation to leadership |
| SEV-3 / P3 | Medium | Partial degradation, workaround exists, limited user impact | Business hours response, normal priority, standard workflow |
| SEV-4 / P4 | Low | Minor issue, cosmetic problems, edge cases, minimal impact | Queue-based handling, address as capacity allows |
The Consequences of Misclassification
Under-Severity (calling SEV-2 when it's SEV-1):
Over-Severity (calling SEV-1 when it's SEV-3):
Good severity classification is 'just right'—enough urgency to drive appropriate response, not so much that it loses meaning. If 50% of your incidents are SEV-1, either your systems are catastrophically unreliable or your classification is broken. Typical healthy distribution: 5-10% SEV-1, 15-20% SEV-2, 40-50% SEV-3, remainder SEV-4.
Clear, unambiguous severity criteria prevent debates during incidents and ensure consistent classification across responders. Good criteria are:
Dimensions for Severity Classification
Most severity frameworks consider multiple dimensions of impact:
| Dimension | Questions to Ask | Example Indicators |
|---|---|---|
| User Impact | How many users affected? What fraction of total? | 50% = higher severity; <1% = lower severity |
| Functionality Impact | Core functionality vs. secondary features? | Login broken = higher; minor UI bug = lower |
| Revenue Impact | Does this directly prevent transactions? | Checkout broken = higher; internal tool = lower |
| Data Impact | Is data lost, corrupted, or exposed? | Data loss = SEV-1; data stale = lower |
| Security Impact | Is there active attack or data breach? | Active breach = SEV-1 always |
| Workaround Availability | Can users achieve their goal another way? | No workaround = higher severity |
| Duration/Trend | How long has it persisted? Getting worse? | Escalating = higher; stable minor = lower |
| Blast Radius | Single service or spreading to others? | Cascading failure = higher severity |
Example Severity Definitions
Here's a concrete severity framework for a typical B2B SaaS product:
SEV-1 (Critical) — Any of:
SEV-2 (High) — Any of:
SEV-3 (Medium) — Any of:
SEV-4 (Low) — Any of:
Notice the 'any of' structure. If any single criterion for a level is met, classify at that level. A minor feature broken for 50% of users is still SEV-2 due to user impact, even though it's not a 'major' feature. When in doubt, classify higher—you can always downgrade.
Each severity level should have explicit response expectations. These expectations create accountability and help responders understand what's required. Key dimensions include:
| Aspect | SEV-1 | SEV-2 | SEV-3 | SEV-4 |
|---|---|---|---|---|
| Acknowledge Time | < 5 minutes | < 15 minutes | < 1 hour | < 4 hours |
| Response Time | < 15 minutes | < 30 minutes | < 4 hours | Next business day |
| Update Cadence | Every 15 minutes | Every 30 minutes | Every 2 hours | Daily if ongoing |
| Resolution Target | < 2 hours (MTTR) | < 4 hours (MTTR) | < 24 hours | < 1 week |
| Leadership Notification | Immediate | Within 30 minutes | Daily summary | Not required |
| Customer Communication | Status page + proactive email | Status page update | On request | None |
| Post-Mortem Required | Yes, within 48 hours | Yes, within 1 week | Optional (recommended) | No |
| Executive Briefing | Yes, during incident | Yes, post-resolution | No | No |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
# Incident Severity Configuration# Defines response expectations for each severity level severity_levels: SEV-1: name: "Critical" description: "Complete outage or critical business impact" color: "#FF0000" # Red response: acknowledge_sla_minutes: 5 response_sla_minutes: 15 resolution_target_hours: 2 escalation: immediate: - primary_oncall - secondary_oncall - engineering_manager after_15_minutes: - director - vp_engineering after_30_minutes: - cto communication: internal_update_interval_minutes: 15 status_page_required: true customer_notification_required: true executive_briefing_required: true post_incident: postmortem_required: true postmortem_deadline_hours: 48 publish_to_team: true publish_externally: true SEV-2: name: "High" description: "Major functionality impaired with significant impact" color: "#FF8C00" # Orange response: acknowledge_sla_minutes: 15 response_sla_minutes: 30 resolution_target_hours: 4 escalation: immediate: - primary_oncall after_15_minutes: - secondary_oncall after_30_minutes: - engineering_manager communication: internal_update_interval_minutes: 30 status_page_required: true customer_notification_required: false # On request executive_briefing_required: false # Post-resolution summary post_incident: postmortem_required: true postmortem_deadline_hours: 168 # 1 week publish_to_team: true publish_externally: false SEV-3: name: "Medium" description: "Partial degradation with limited impact" color: "#FFD700" # Yellow response: acknowledge_sla_minutes: 60 response_sla_minutes: 240 # 4 hours resolution_target_hours: 24 escalation: immediate: - primary_oncall # No further escalation unless manually triggered communication: internal_update_interval_minutes: 120 # 2 hours status_page_required: false # Optional customer_notification_required: false executive_briefing_required: false post_incident: postmortem_required: false # Recommended postmortem_deadline_hours: null publish_to_team: false publish_externally: false SEV-4: name: "Low" description: "Minor issue with minimal impact" color: "#32CD32" # Green response: acknowledge_sla_minutes: 240 # 4 hours response_sla_minutes: 1440 # Next business day resolution_target_hours: 168 # 1 week escalation: immediate: - ticket_queue # Handled through normal ticketing workflow communication: internal_update_interval_minutes: null # Daily max status_page_required: false customer_notification_required: false executive_briefing_required: false post_incident: postmortem_required: false postmortem_deadline_hours: null publish_to_team: false publish_externally: falseResponse time (how quickly you start working) and resolution time (how quickly you fix it) are different commitments. A SEV-1 response time of 15 minutes is realistic; a SEV-1 resolution time of 2 hours is a target, not a guarantee. Complex issues may take longer—but if you're repeatedly missing resolution targets, either your targets are unrealistic or you have systemic reliability issues.
Incidents evolve. What starts as a minor issue may escalate into a critical outage; what looks catastrophic may turn out to be limited in scope. Severity should reflect current reality, not initial assessment.
When to Escalate Severity
Increase severity when:
When to De-escalate Severity
Decrease severity when:
The Severity Change Protocol
Announce the Change: Post in incident channel: "Escalating this to SEV-1. Error rate now at 15% and rising."
Communicate Rationale: Briefly explain why severity changed: "Impact larger than expected—affecting checkout, not just cart."
Trigger Appropriate Response: Escalation should automatically trigger additional notifications and resources.
Update External Communications: Status page should reflect new severity if it affects customer impact statement.
Adjust Expectations: New severity means new response expectations—remind responders of updated cadence.
Cultural Considerations
Make severity changes routine, not controversial:
Escalation and de-escalation aren't symmetric. Escalation should be quick and easy—any responder can propose it. De-escalation should be more deliberate—confirm the situation is truly improved before reducing response level. It's much harder to re-mobilize after de-escalation than to simply maintain elevated response.
What happens when you have two SEV-1 incidents simultaneously? Or when a SEV-1 strikes while you're still resolving a SEV-2? Concurrent incidents require explicit prioritization beyond simple severity levels.
Priority vs. Severity
Severity describes impact; priority describes response order:
Two SEV-1 incidents have equal severity but may have different priority based on factors like:
Prioritization Factors
When facing concurrent incidents:
Resource Splitting Strategies
For resource-constrained teams, sequential is often better than fragmented parallel attention.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
/** * Incident Priority Scoring * * When multiple incidents have the same severity, * this scoring helps determine priority order. */ interface Incident { id: string; severity: 'SEV-1' | 'SEV-2' | 'SEV-3' | 'SEV-4'; affectedUsers: number; revenueImpactPerHour: number; dataAtRisk: boolean; regulatoryImplications: boolean; durationMinutes: number; estimatedResolutionMinutes: number; isEscalating: boolean;} interface PriorityScore { incidentId: string; score: number; primaryFactor: string;} function calculatePriorityScore(incident: Incident): PriorityScore { let score = 0; let primaryFactor = ''; // Base score from severity const severityScores = { 'SEV-1': 10000, 'SEV-2': 1000, 'SEV-3': 100, 'SEV-4': 10, }; score += severityScores[incident.severity]; // Data and regulatory trump all else if (incident.regulatoryImplications) { score += 50000; primaryFactor = 'Regulatory implications'; } if (incident.dataAtRisk) { score += 25000; if (!primaryFactor) primaryFactor = 'Data at risk'; } // Revenue impact (normalized to score range) const revenueScore = Math.min(incident.revenueImpactPerHour / 1000, 5000); score += revenueScore; // User impact (log scale to not over-weight large numbers) const userScore = Math.log10(Math.max(incident.affectedUsers, 1)) * 500; score += userScore; // Escalating incidents get priority boost if (incident.isEscalating) { score += 2000; if (!primaryFactor) primaryFactor = 'Escalating situation'; } // Duration penalty - longer incidents may need fresh attention // But also consider if resolution is close const resolutionProgress = incident.durationMinutes / (incident.durationMinutes + incident.estimatedResolutionMinutes); if (resolutionProgress > 0.8) { // Close to resolution - slight priority boost for quick win score += 500; if (!primaryFactor) primaryFactor = 'Near resolution'; } else if (incident.durationMinutes > 60) { // Prolonged incident without progress score += 1000; if (!primaryFactor) primaryFactor = 'Extended duration'; } if (!primaryFactor) { if (revenueScore > userScore) { primaryFactor = `Revenue impact ($${incident.revenueImpactPerHour}/hr)`; } else { primaryFactor = `User impact (${incident.affectedUsers} users)`; } } return { incidentId: incident.id, score, primaryFactor, };} function prioritizeIncidents(incidents: Incident[]): PriorityScore[] { const scored = incidents.map(i => calculatePriorityScore(i)); return scored.sort((a, b) => b.score - a.score);} // Example usageconst incidents: Incident[] = [ { id: 'INC-001', severity: 'SEV-1', affectedUsers: 50000, revenueImpactPerHour: 75000, dataAtRisk: false, regulatoryImplications: false, durationMinutes: 25, estimatedResolutionMinutes: 30, isEscalating: false, }, { id: 'INC-002', severity: 'SEV-1', affectedUsers: 5000, revenueImpactPerHour: 10000, dataAtRisk: true, // Data at risk trumps revenue regulatoryImplications: false, durationMinutes: 10, estimatedResolutionMinutes: 60, isEscalating: true, },]; const prioritized = prioritizeIncidents(incidents);console.log('Priority order:', prioritized);// INC-002 ranked higher due to data at risk, despite lower revenue/user impactSometimes multiple incidents are related (cascading failure). Before prioritizing them separately, ask: 'Could these be the same root cause manifesting differently?' Fixing one might fix both. Conversely, treating symptoms separately while ignoring root cause extends all incidents.
Severity frameworks only work if the organization agrees on them. Misaligned expectations—where engineers classify SEV-2 and executives expect SEV-1 response—create friction and erode trust. Building alignment requires explicit discussion and ongoing calibration.
Stakeholder Alignment Process
Define Together: Engineering, product, support, and executives should collaborate on severity definitions. Each group has different perspectives on impact.
Use Real Examples: Abstract criteria are ambiguous. Calibrate with: "Last month's payment outage—was that SEV-1 or SEV-2?" Concrete examples expose disagreements.
Document Rationale: Don't just document criteria; explain why each dimension matters and how it maps to business impact.
Train Consistently: Everyone who might declare or respond to incidents should receive severity training. Include realistic scenarios and practice classification.
Review Regularly: In post-mortems, evaluate whether severity was accurate. If consistently over/under-classified, refine criteria.
The Severity Calibration Meeting
Quarterly, bring stakeholders together to review severity accuracy:
Agenda:
The goal isn't to blame responders but to calibrate the system. Classification disagreements reveal ambiguous criteria that need clarification.
Create a one-page severity decision guide that responders can reference during triage. Include the criteria matrix, escalation paths, and common examples. Make it accessible from your incident management tool. During 3 AM pages, responders shouldn't have to hunt for classification guidance.
Standard severity frameworks handle most incidents well, but some situations require special consideration. These edge cases test your framework's completeness.
Security Incidents
Security events often follow different severity logic:
Data Incidents
Data problems—loss, corruption, exposure—often warrant higher severity than equivalent availability issues:
| Scenario | Standard Classification | Special Consideration |
|---|---|---|
| Internal tool outage | SEV-3 (limited user impact) | May be SEV-1 if blocks operations (deploy pipeline during incident) |
| Single VIP customer affected | SEV-3 (one customer) | May be SEV-2 for enterprise customer with SLA commitments |
| Slow degradation over days | SEV-4 (minor impact) | May need escalation if trend predicts eventual failure |
| Partial fix stabilizes situation | Maintain original severity | Consider de-escalation to allow focused follow-up |
| Scheduled maintenance overruns | Not an incident initially | Becomes SEV-2/3 when exceeds planned window significantly |
| Third-party dependency outage | Based on customer impact | May limit mitigation options; track separately for dependency analysis |
| Monitoring/observability failure | SEV-3 (no direct user impact) | May be SEV-2 if it impairs incident detection capability |
The 'Unknown' Severity
Sometimes you don't have enough information to classify accurately. Rather than guess incorrectly:
Start High, Adjust Down: If uncertain, declare at higher severity and downgrade once you have data. This ensures adequate response.
Time-Box Uncertainty: "I'm declaring SEV-2 pending investigation. If impact is confirmed smaller in 15 minutes, we'll downgrade."
Log Uncertainty: Note in incident record that initial classification was uncertain. This helps post-mortem learning.
External Incidents
When the problem isn't yours (AWS outage, Stripe down):
Sometimes a minor-seeming issue reveals a critical systemic problem—the 'paper tower' that looks stable until you remove one sheet. If investigation reveals that a SEV-4 issue indicates much larger risk ('this could bring down the whole system under load'), escalate based on potential impact, not just current impact.
Severity classification is the control system that ensures incident response effort matches incident impact. Well-designed severity frameworks enable organizations to respond decisively to crises while avoiding exhausting resources on non-issues.
Module Complete:
This concludes the Incident Management module. You've learned how to detect incidents, respond with structured processes, maintain sustainable on-call practices, communicate effectively across audiences, and classify incidents appropriately. Together, these capabilities form a comprehensive incident management system that protects customers, responders, and business outcomes.
The next module explores Post-Mortems—how organizations learn from incidents to prevent recurrence and build more resilient systems.
You now understand the complete framework of incident management: from detection and response processes to on-call practices, communication strategies, and severity classification. These interconnected practices enable organizations to handle production incidents quickly, effectively, and sustainably. Incidents are inevitable—your response defines whether they're minor blips or major crises.