System Design (HLD)SLOs, SLIs & Incident Management

Incident Management

LevelAdvanced

Duration90 mins

TopicSLOs, SLIs & Incident Management

5 / 5

Incident Severity Levels

When Everything Is Critical, Nothing Is

Imagine a hospital emergency room where every patient is immediately rushed to the operating theater—the person with a paper cut, the patient having a heart attack, and everyone in between. The result would be chaos: critical patients dying while resources are wasted on minor issues. Emergency medicine solved this with triage—a systematic approach to categorizing patients by urgency.

Incident management faces the same challenge. Without severity classification, organizations oscillate between two failure modes: treating everything as critical (leading to alert fatigue and responder burnout) or treating everything as routine (leading to catastrophic delays on real crises). Neither extreme serves customers or teams.

Severity levels provide the triage framework for incidents. They determine how quickly responders mobilize, who gets paged, how much customer communication occurs, and what level of organizational attention the incident receives. Get severity wrong, and you either under-respond to crises or over-respond to non-issues. Get it right, and your organization deploys exactly the right level of effort to each situation.

What You Will Learn

By the end of this page, you will understand how to design a severity classification system that matches response effort to incident impact. You'll learn to define severity levels with clear criteria, map response expectations to each level, handle severity changes during incidents, and build organizational consensus around classification decisions.

Why Severity Classification Matters

Severity levels exist to calibrate response. Every incident response action has costs: responders' time, interrupted sleep, stakeholder attention, and organizational overhead. Severity classification ensures these costs are proportional to incident impact.

The Purpose of Severity Levels

Response Calibration: Match organizational effort to incident significance
Clear Escalation Triggers: Know when to wake people up at 3 AM vs. address during business hours
Communication Expectations: Determine who needs to know and how urgently
Resource Allocation: Decide when to pull in SMEs vs. let primary on-call handle it
Priority Setting: When multiple incidents occur, severity guides which gets attention first
Historical Analysis: Enable meaningful incident trending and improvement prioritization
SLA Mapping: Align incident response with customer contract commitments

Common Severity Level Structure
Level	Name	Typical Criteria	Response Expectation
SEV-1 / P1	Critical	Complete outage, data loss, security breach, major revenue impact	All-hands response, executive involvement, 24/7 until resolved
SEV-2 / P2	High	Major functionality broken, significant user impact, no workaround	Immediate response, dedicated resources, escalation to leadership
SEV-3 / P3	Medium	Partial degradation, workaround exists, limited user impact	Business hours response, normal priority, standard workflow
SEV-4 / P4	Low	Minor issue, cosmetic problems, edge cases, minimal impact	Queue-based handling, address as capacity allows

The Consequences of Misclassification

Under-Severity (calling SEV-2 when it's SEV-1):

Delayed response extends customer impact
Critical stakeholders not informed in time
Miss opportunity for faster mitigation with more resources
Post-incident questions: "Why wasn't this escalated sooner?"

Over-Severity (calling SEV-1 when it's SEV-3):

Responder burnout from unnecessary escalations
Desensitization to severe alerts (the boy who cried wolf)
Lost productivity from over-response
Stakeholder fatigue from false alarms
Erosion of trust in severity classifications

The Goldilocks Principle

Good severity classification is 'just right'—enough urgency to drive appropriate response, not so much that it loses meaning. If 50% of your incidents are SEV-1, either your systems are catastrophically unreliable or your classification is broken. Typical healthy distribution: 5-10% SEV-1, 15-20% SEV-2, 40-50% SEV-3, remainder SEV-4.

Defining Severity Criteria

Clear, unambiguous severity criteria prevent debates during incidents and ensure consistent classification across responders. Good criteria are:

Objective: Based on measurable factors, not subjective judgment
Observable: Can be determined quickly during triage
Comprehensive: Cover the dimensions that matter to your business
Mutually Exclusive: It's clear which level applies
Stable: Definitions don't change frequently

Dimensions for Severity Classification

Most severity frameworks consider multiple dimensions of impact:

Severity Classification Dimensions
Dimension	Questions to Ask	Example Indicators
User Impact	How many users affected? What fraction of total?	50% = higher severity; <1% = lower severity
Functionality Impact	Core functionality vs. secondary features?	Login broken = higher; minor UI bug = lower
Revenue Impact	Does this directly prevent transactions?	Checkout broken = higher; internal tool = lower
Data Impact	Is data lost, corrupted, or exposed?	Data loss = SEV-1; data stale = lower
Security Impact	Is there active attack or data breach?	Active breach = SEV-1 always
Workaround Availability	Can users achieve their goal another way?	No workaround = higher severity
Duration/Trend	How long has it persisted? Getting worse?	Escalating = higher; stable minor = lower
Blast Radius	Single service or spreading to others?	Cascading failure = higher severity

Example Severity Definitions

Here's a concrete severity framework for a typical B2B SaaS product:

SEV-1 (Critical) — Any of:

Complete unavailability of the application for >10% of users
Core transaction processing (payments, orders) completely broken
Active security incident or confirmed data breach
Regulatory compliance system unavailable during filing period
Significant customer data loss or corruption
Public-facing API completely unavailable

SEV-2 (High) — Any of:

Major feature broken with no workaround for >5% of users
Significant performance degradation (>5× normal latency)
Core functionality impaired but partially working
Security vulnerability requiring urgent patching
Single major customer completely unable to use service
Data processing delays exceeding SLA

SEV-3 (Medium) — Any of:

Minor feature broken affecting some users
Workaround exists but causes significant inconvenience
Performance degradation (2-5× normal latency)
Non-critical integration failure
Single non-critical component unavailable
Intermittent issues affecting subset of requests

SEV-4 (Low) — Any of:

Cosmetic issues (UI misalignment, typos)
Feature works but edge case fails
Slowness within acceptable bounds
Internal tooling issues with workarounds
Documentation gaps identified
Non-blocking friction in user experience

The 'Any Of' Pattern

Notice the 'any of' structure. If any single criterion for a level is met, classify at that level. A minor feature broken for 50% of users is still SEV-2 due to user impact, even though it's not a 'major' feature. When in doubt, classify higher—you can always downgrade.

Response Expectations by Severity

Each severity level should have explicit response expectations. These expectations create accountability and help responders understand what's required. Key dimensions include:

Response Time: How quickly must responders engage?
Communication Cadence: How often must updates be provided?
Escalation Path: Who is notified and when?
Resolution Target: What's the expected resolution time?
Post-Incident Requirements: What follow-up is mandatory?

Response Expectations by Severity Level
Aspect	SEV-1	SEV-2	SEV-3	SEV-4
Acknowledge Time	< 5 minutes	< 15 minutes	< 1 hour	< 4 hours
Response Time	< 15 minutes	< 30 minutes	< 4 hours	Next business day
Update Cadence	Every 15 minutes	Every 30 minutes	Every 2 hours	Daily if ongoing
Resolution Target	< 2 hours (MTTR)	< 4 hours (MTTR)	< 24 hours	< 1 week
Leadership Notification	Immediate	Within 30 minutes	Daily summary	Not required
Customer Communication	Status page + proactive email	Status page update	On request	None
Post-Mortem Required	Yes, within 48 hours	Yes, within 1 week	Optional (recommended)	No
Executive Briefing	Yes, during incident	Yes, post-resolution	No	No

severity-response-config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# Incident Severity Configuration
# Defines response expectations for each severity level
 
severity_levels:
  SEV-1:
    name: "Critical"
    description: "Complete outage or critical business impact"
    color: "#FF0000"  # Red
    
    response:
      acknowledge_sla_minutes: 5
      response_sla_minutes: 15
      resolution_target_hours: 2
      
    escalation:
      immediate:
        - primary_oncall
        - secondary_oncall
        - engineering_manager
      after_15_minutes:
        - director
        - vp_engineering
      after_30_minutes:
        - cto
        
    communication:
      internal_update_interval_minutes: 15
      status_page_required: true
      customer_notification_required: true
      executive_briefing_required: true
      
    post_incident:
      postmortem_required: true
      postmortem_deadline_hours: 48
      publish_to_team: true
      publish_externally: true
 
  SEV-2:
    name: "High"
    description: "Major functionality impaired with significant impact"
    color: "#FF8C00"  # Orange
    
    response:
      acknowledge_sla_minutes: 15
      response_sla_minutes: 30
      resolution_target_hours: 4
      
    escalation:
      immediate:
        - primary_oncall
      after_15_minutes:
        - secondary_oncall
      after_30_minutes:
        - engineering_manager
        
    communication:
      internal_update_interval_minutes: 30
      status_page_required: true
      customer_notification_required: false  # On request
      executive_briefing_required: false  # Post-resolution summary
      
    post_incident:
      postmortem_required: true
      postmortem_deadline_hours: 168  # 1 week
      publish_to_team: true
      publish_externally: false
 
  SEV-3:
    name: "Medium"
    description: "Partial degradation with limited impact"
    color: "#FFD700"  # Yellow
    
    response:
      acknowledge_sla_minutes: 60
      response_sla_minutes: 240  # 4 hours
      resolution_target_hours: 24
      
    escalation:
      immediate:
        - primary_oncall
      # No further escalation unless manually triggered
        
    communication:
      internal_update_interval_minutes: 120  # 2 hours
      status_page_required: false  # Optional
      customer_notification_required: false
      executive_briefing_required: false
      
    post_incident:
      postmortem_required: false  # Recommended
      postmortem_deadline_hours: null
      publish_to_team: false
      publish_externally: false
 
  SEV-4:
    name: "Low"
    description: "Minor issue with minimal impact"
    color: "#32CD32"  # Green
    
    response:
      acknowledge_sla_minutes: 240  # 4 hours
      response_sla_minutes: 1440  # Next business day
      resolution_target_hours: 168  # 1 week
      
    escalation:
      immediate:
        - ticket_queue
      # Handled through normal ticketing workflow
        
    communication:
      internal_update_interval_minutes: null  # Daily max
      status_page_required: false
      customer_notification_required: false
      executive_briefing_required: false
      
    post_incident:
      postmortem_required: false
      postmortem_deadline_hours: null
      publish_to_team: false
      publish_externally: false

Response Time vs. Resolution Time

Response time (how quickly you start working) and resolution time (how quickly you fix it) are different commitments. A SEV-1 response time of 15 minutes is realistic; a SEV-1 resolution time of 2 hours is a target, not a guarantee. Complex issues may take longer—but if you're repeatedly missing resolution targets, either your targets are unrealistic or you have systemic reliability issues.

Severity Changes During Incidents

Incidents evolve. What starts as a minor issue may escalate into a critical outage; what looks catastrophic may turn out to be limited in scope. Severity should reflect current reality, not initial assessment.

When to Escalate Severity

Increase severity when:

Impact is larger than initially assessed (more users affected)
Problem is spreading to additional systems
Resolution is taking longer than expected
Root cause is deeper than initially thought
Customer/business impact becomes clear
Mitigation attempts are failing

When to De-escalate Severity

Decrease severity when:

Impact is confirmed to be smaller than feared
Workaround successfully reduces customer impact
Mitigation stabilizes the situation (even if root cause isn't fixed)
Issue is contained to specific subset of users/features
Initial alarm was disproportionate to actual problem

Escalation Indicators

•"We thought it was one service, but it's affecting three now"
•"The workaround we suggested isn't working for customers"
•"We've been investigating for an hour and still don't know root cause"
•"Social media mentions are spiking—customers are noticing"
•"The CEO just asked what's going on with the platform"
•"Revenue impact is higher than we initially calculated"

De-escalation Indicators

•"Only affects users with a specific configuration—about 0.5%"
•"We've confirmed there's no data loss, just display issues"
•"The rollback worked and error rate is back to baseline"
•"It looked like an outage but was actually a monitoring bug"
•"Customers can complete transactions via mobile—only web is affected"
•"Impact limited to one region and traffic is being rerouted"

The Severity Change Protocol

Announce the Change: Post in incident channel: "Escalating this to SEV-1. Error rate now at 15% and rising."
Communicate Rationale: Briefly explain why severity changed: "Impact larger than expected—affecting checkout, not just cart."
Trigger Appropriate Response: Escalation should automatically trigger additional notifications and resources.
Update External Communications: Status page should reflect new severity if it affects customer impact statement.
Adjust Expectations: New severity means new response expectations—remind responders of updated cadence.

Cultural Considerations

Make severity changes routine, not controversial:

Encourage early escalation ("better safe than sorry")
De-escalation is not failure—it's accurate assessment
The person closest to the data should propose severity changes
IC makes final severity decisions during incident
Post-mortem reviews whether initial severity was accurate (for learning, not blame)

The Asymmetry of Escalation

Escalation and de-escalation aren't symmetric. Escalation should be quick and easy—any responder can propose it. De-escalation should be more deliberate—confirm the situation is truly improved before reducing response level. It's much harder to re-mobilize after de-escalation than to simply maintain elevated response.

Managing Multiple Concurrent Incidents

What happens when you have two SEV-1 incidents simultaneously? Or when a SEV-1 strikes while you're still resolving a SEV-2? Concurrent incidents require explicit prioritization beyond simple severity levels.

Priority vs. Severity

Severity describes impact; priority describes response order:

Severity: How bad is this problem? (Fixed classification)
Priority: What should we work on first? (Dynamic ordering)

Two SEV-1 incidents have equal severity but may have different priority based on factors like:

Cash flow impact (payments outage vs. reporting system)
Customer contract obligations (SLA penalties at stake)
Duration (longer-running incident may need to yield to fresh crisis)
Fix complexity (quick win vs. deep investigation)

Prioritization Factors

When facing concurrent incidents:

Revenue Impact: Which incident loses more money per minute?
Customer Scope: Which affects more users or higher-value customers?
Data Risk: Is data at risk in either incident? Data issues trump availability.
Regulatory: Are there compliance implications? Regulators > customers.
Resolution Proximity: Is one incident close to resolution? Quick win may be worth prioritizing.
Cascade Risk: Could one incident cause the other to get worse?

Resource Splitting Strategies

Sequential: Focus entire team on highest priority incident until resolved or stabilized
Parallel: Split team with dedicated IC and responders for each incident
Hybrid: Primary team on highest priority; secondary on-call manages lower-priority

For resource-constrained teams, sequential is often better than fragmented parallel attention.

incident-priority-tiebreaker.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
/**
 * Incident Priority Scoring
 * 
 * When multiple incidents have the same severity,
 * this scoring helps determine priority order.
 */
 
interface Incident {
  id: string;
  severity: 'SEV-1' | 'SEV-2' | 'SEV-3' | 'SEV-4';
  affectedUsers: number;
  revenueImpactPerHour: number;
  dataAtRisk: boolean;
  regulatoryImplications: boolean;
  durationMinutes: number;
  estimatedResolutionMinutes: number;
  isEscalating: boolean;
}
 
interface PriorityScore {
  incidentId: string;
  score: number;
  primaryFactor: string;
}
 
function calculatePriorityScore(incident: Incident): PriorityScore {
  let score = 0;
  let primaryFactor = '';
  
  // Base score from severity
  const severityScores = {
    'SEV-1': 10000,
    'SEV-2': 1000,
    'SEV-3': 100,
    'SEV-4': 10,
  };
  score += severityScores[incident.severity];
  
  // Data and regulatory trump all else
  if (incident.regulatoryImplications) {
    score += 50000;
    primaryFactor = 'Regulatory implications';
  }
  
  if (incident.dataAtRisk) {
    score += 25000;
    if (!primaryFactor) primaryFactor = 'Data at risk';
  }
  
  // Revenue impact (normalized to score range)
  const revenueScore = Math.min(incident.revenueImpactPerHour / 1000, 5000);
  score += revenueScore;
  
  // User impact (log scale to not over-weight large numbers)
  const userScore = Math.log10(Math.max(incident.affectedUsers, 1)) * 500;
  score += userScore;
  
  // Escalating incidents get priority boost
  if (incident.isEscalating) {
    score += 2000;
    if (!primaryFactor) primaryFactor = 'Escalating situation';
  }
  
  // Duration penalty - longer incidents may need fresh attention
  // But also consider if resolution is close
  const resolutionProgress = incident.durationMinutes / 
    (incident.durationMinutes + incident.estimatedResolutionMinutes);
  
  if (resolutionProgress > 0.8) {
    // Close to resolution - slight priority boost for quick win
    score += 500;
    if (!primaryFactor) primaryFactor = 'Near resolution';
  } else if (incident.durationMinutes > 60) {
    // Prolonged incident without progress
    score += 1000;
    if (!primaryFactor) primaryFactor = 'Extended duration';
  }
  
  if (!primaryFactor) {
    if (revenueScore > userScore) {
      primaryFactor = `Revenue impact ($${incident.revenueImpactPerHour}/hr)`;
    } else {
      primaryFactor = `User impact (${incident.affectedUsers} users)`;
    }
  }
  
  return {
    incidentId: incident.id,
    score,
    primaryFactor,
  };
}
 
function prioritizeIncidents(incidents: Incident[]): PriorityScore[] {
  const scored = incidents.map(i => calculatePriorityScore(i));
  return scored.sort((a, b) => b.score - a.score);
}
 
// Example usage
const incidents: Incident[] = [
  {
    id: 'INC-001',
    severity: 'SEV-1',
    affectedUsers: 50000,
    revenueImpactPerHour: 75000,
    dataAtRisk: false,
    regulatoryImplications: false,
    durationMinutes: 25,
    estimatedResolutionMinutes: 30,
    isEscalating: false,
  },
  {
    id: 'INC-002',
    severity: 'SEV-1',
    affectedUsers: 5000,
    revenueImpactPerHour: 10000,
    dataAtRisk: true,  // Data at risk trumps revenue
    regulatoryImplications: false,
    durationMinutes: 10,
    estimatedResolutionMinutes: 60,
    isEscalating: true,
  },
];
 
const prioritized = prioritizeIncidents(incidents);
console.log('Priority order:', prioritized);
// INC-002 ranked higher due to data at risk, despite lower revenue/user impact

The Incident Storm Scenario

Sometimes multiple incidents are related (cascading failure). Before prioritizing them separately, ask: 'Could these be the same root cause manifesting differently?' Fixing one might fix both. Conversely, treating symptoms separately while ignoring root cause extends all incidents.

Building Organizational Alignment

Severity frameworks only work if the organization agrees on them. Misaligned expectations—where engineers classify SEV-2 and executives expect SEV-1 response—create friction and erode trust. Building alignment requires explicit discussion and ongoing calibration.

Stakeholder Alignment Process

Define Together: Engineering, product, support, and executives should collaborate on severity definitions. Each group has different perspectives on impact.
Use Real Examples: Abstract criteria are ambiguous. Calibrate with: "Last month's payment outage—was that SEV-1 or SEV-2?" Concrete examples expose disagreements.
Document Rationale: Don't just document criteria; explain why each dimension matters and how it maps to business impact.
Train Consistently: Everyone who might declare or respond to incidents should receive severity training. Include realistic scenarios and practice classification.
Review Regularly: In post-mortems, evaluate whether severity was accurate. If consistently over/under-classified, refine criteria.

Aligning Stakeholder Expectations

•Executive Alignment: Executives often care about different dimensions (brand risk, regulatory exposure) than engineers (technical impact). Ensure severity criteria include business impact dimensions that matter to leadership.
•Support Team Alignment: Support sees customer impact first-hand. Their input on 'what constitutes major impact to customers' is invaluable. Include them in criteria development.
•Cross-Team Consistency: Platform team and feature team may have different severity intuitions. Create shared criteria to ensure consistent classification across the organization.
•Customer-Contract Alignment: If customer SLAs reference severity levels, ensure internal definitions match contractual obligations. Misalignment creates legal risk.
•Service-Specific Variations: A 30-minute outage might be SEV-1 for payment processing and SEV-3 for internal analytics. Allow service-specific criteria within overall framework.

The Severity Calibration Meeting

Quarterly, bring stakeholders together to review severity accuracy:

Agenda:

Review incidents from past quarter by declared severity
For each SEV-1/SEV-2, assess: "Was this classification accurate? Would we classify it the same today?"
Identify patterns: "Too many SEV-1s?" "Under-classified issues?"
Propose criteria adjustments if needed
Discuss any edge cases where classification was debated

The goal isn't to blame responders but to calibrate the system. Classification disagreements reveal ambiguous criteria that need clarification.

The Severity Runbook

Create a one-page severity decision guide that responders can reference during triage. Include the criteria matrix, escalation paths, and common examples. Make it accessible from your incident management tool. During 3 AM pages, responders shouldn't have to hunt for classification guidance.

Special Cases in Severity Classification

Standard severity frameworks handle most incidents well, but some situations require special consideration. These edge cases test your framework's completeness.

Security Incidents

Security events often follow different severity logic:

Active compromise or data breach = SEV-1 always, regardless of affected user count
Vulnerability discovered (not exploited) = Severity based on exploitability and exposure
Security incidents may require different response procedures (incident response team, forensics, legal involvement)
Consider a separate security incident classification parallel to operational incidents

Data Incidents

Data problems—loss, corruption, exposure—often warrant higher severity than equivalent availability issues:

Data loss is often irreversible; availability issues resolve when fixed
Regulatory requirements (GDPR, HIPAA) mandate specific response timelines for data incidents
Customer trust impact from data issues exceeds trust impact from downtime

Special Severity Cases
Scenario	Standard Classification	Special Consideration
Internal tool outage	SEV-3 (limited user impact)	May be SEV-1 if blocks operations (deploy pipeline during incident)
Single VIP customer affected	SEV-3 (one customer)	May be SEV-2 for enterprise customer with SLA commitments
Slow degradation over days	SEV-4 (minor impact)	May need escalation if trend predicts eventual failure
Partial fix stabilizes situation	Maintain original severity	Consider de-escalation to allow focused follow-up
Scheduled maintenance overruns	Not an incident initially	Becomes SEV-2/3 when exceeds planned window significantly
Third-party dependency outage	Based on customer impact	May limit mitigation options; track separately for dependency analysis
Monitoring/observability failure	SEV-3 (no direct user impact)	May be SEV-2 if it impairs incident detection capability

The 'Unknown' Severity

Sometimes you don't have enough information to classify accurately. Rather than guess incorrectly:

Start High, Adjust Down: If uncertain, declare at higher severity and downgrade once you have data. This ensures adequate response.
Time-Box Uncertainty: "I'm declaring SEV-2 pending investigation. If impact is confirmed smaller in 15 minutes, we'll downgrade."
Log Uncertainty: Note in incident record that initial classification was uncertain. This helps post-mortem learning.

External Incidents

When the problem isn't yours (AWS outage, Stripe down):

Still classify by customer impact, not by where the fault lies
Track externally-caused incidents separately for dependency analysis
Response options are limited, but communication remains critical
Consider if you should have redundancy for that dependency (post-mortem topic)

The 'Paper Tower' Problem

Sometimes a minor-seeming issue reveals a critical systemic problem—the 'paper tower' that looks stable until you remove one sheet. If investigation reveals that a SEV-4 issue indicates much larger risk ('this could bring down the whole system under load'), escalate based on potential impact, not just current impact.

Summary: Calibrated Incident Response

Severity classification is the control system that ensures incident response effort matches incident impact. Well-designed severity frameworks enable organizations to respond decisively to crises while avoiding exhausting resources on non-issues.

Key Takeaways

•Severity calibrates response — Matching organizational effort to incident impact prevents both under-response and over-response, protecting customers and responders alike.
•Clear criteria prevent debates — Objective, measurable criteria enable quick classification during stressful incidents. Ambiguity leads to inconsistency.
•Response expectations vary by level — Each severity should have explicit SLAs for acknowledgment, resolution, communication, and follow-up.
•Severity can change — Escalate when impact grows; de-escalate when contained. Current reality, not initial assessment, should drive response level.
•Concurrent incidents need priority — When severity is equal, prioritize by revenue, data risk, and regulatory implications. Sometimes related incidents share root cause.
•Alignment requires collaboration — Engineering, product, support, and executives should agree on severity definitions. Real examples calibrate better than abstract criteria.
•Special cases test your framework — Security incidents, data issues, and external dependencies require specific consideration beyond standard criteria.
•When uncertain, classify high — Start with higher severity and downgrade with data. Under-classification has worse consequences than over-classification.

Module Complete:

This concludes the Incident Management module. You've learned how to detect incidents, respond with structured processes, maintain sustainable on-call practices, communicate effectively across audiences, and classify incidents appropriately. Together, these capabilities form a comprehensive incident management system that protects customers, responders, and business outcomes.

The next module explores Post-Mortems—how organizations learn from incidents to prevent recurrence and build more resilient systems.

Module Complete

You now understand the complete framework of incident management: from detection and response processes to on-call practices, communication strategies, and severity classification. These interconnected practices enable organizations to handle production incidents quickly, effectively, and sustainably. Incidents are inevitable—your response defines whether they're minor blips or major crises.

5 / 5

Loading learning content...

System Design (HLD)SLOs, SLIs & Incident Management

Incident Management

LevelAdvanced

Duration90 mins

TopicSLOs, SLIs & Incident Management

5 / 5

Incident Severity Levels

When Everything Is Critical, Nothing Is

What You Will Learn

Why Severity Classification Matters

The Purpose of Severity Levels

Response Calibration: Match organizational effort to incident significance
Clear Escalation Triggers: Know when to wake people up at 3 AM vs. address during business hours
Communication Expectations: Determine who needs to know and how urgently
Resource Allocation: Decide when to pull in SMEs vs. let primary on-call handle it
Priority Setting: When multiple incidents occur, severity guides which gets attention first
Historical Analysis: Enable meaningful incident trending and improvement prioritization
SLA Mapping: Align incident response with customer contract commitments

Common Severity Level Structure
Level	Name	Typical Criteria	Response Expectation
SEV-1 / P1	Critical	Complete outage, data loss, security breach, major revenue impact	All-hands response, executive involvement, 24/7 until resolved
SEV-2 / P2	High	Major functionality broken, significant user impact, no workaround	Immediate response, dedicated resources, escalation to leadership
SEV-3 / P3	Medium	Partial degradation, workaround exists, limited user impact	Business hours response, normal priority, standard workflow
SEV-4 / P4	Low	Minor issue, cosmetic problems, edge cases, minimal impact	Queue-based handling, address as capacity allows

The Consequences of Misclassification

Under-Severity (calling SEV-2 when it's SEV-1):

Delayed response extends customer impact
Critical stakeholders not informed in time
Miss opportunity for faster mitigation with more resources
Post-incident questions: "Why wasn't this escalated sooner?"

Over-Severity (calling SEV-1 when it's SEV-3):

Responder burnout from unnecessary escalations
Desensitization to severe alerts (the boy who cried wolf)
Lost productivity from over-response
Stakeholder fatigue from false alarms
Erosion of trust in severity classifications

The Goldilocks Principle

Defining Severity Criteria

Clear, unambiguous severity criteria prevent debates during incidents and ensure consistent classification across responders. Good criteria are:

Objective: Based on measurable factors, not subjective judgment
Observable: Can be determined quickly during triage
Comprehensive: Cover the dimensions that matter to your business
Mutually Exclusive: It's clear which level applies
Stable: Definitions don't change frequently

Dimensions for Severity Classification

Most severity frameworks consider multiple dimensions of impact:

Severity Classification Dimensions
Dimension	Questions to Ask	Example Indicators
User Impact	How many users affected? What fraction of total?	50% = higher severity; <1% = lower severity
Functionality Impact	Core functionality vs. secondary features?	Login broken = higher; minor UI bug = lower
Revenue Impact	Does this directly prevent transactions?	Checkout broken = higher; internal tool = lower
Data Impact	Is data lost, corrupted, or exposed?	Data loss = SEV-1; data stale = lower
Security Impact	Is there active attack or data breach?	Active breach = SEV-1 always
Workaround Availability	Can users achieve their goal another way?	No workaround = higher severity
Duration/Trend	How long has it persisted? Getting worse?	Escalating = higher; stable minor = lower
Blast Radius	Single service or spreading to others?	Cascading failure = higher severity

Example Severity Definitions

Here's a concrete severity framework for a typical B2B SaaS product:

SEV-1 (Critical) — Any of:

Complete unavailability of the application for >10% of users
Core transaction processing (payments, orders) completely broken
Active security incident or confirmed data breach
Regulatory compliance system unavailable during filing period
Significant customer data loss or corruption
Public-facing API completely unavailable

SEV-2 (High) — Any of:

Major feature broken with no workaround for >5% of users
Significant performance degradation (>5× normal latency)
Core functionality impaired but partially working
Security vulnerability requiring urgent patching
Single major customer completely unable to use service
Data processing delays exceeding SLA

SEV-3 (Medium) — Any of:

Minor feature broken affecting some users
Workaround exists but causes significant inconvenience
Performance degradation (2-5× normal latency)
Non-critical integration failure
Single non-critical component unavailable
Intermittent issues affecting subset of requests

SEV-4 (Low) — Any of:

Cosmetic issues (UI misalignment, typos)
Feature works but edge case fails
Slowness within acceptable bounds
Internal tooling issues with workarounds
Documentation gaps identified
Non-blocking friction in user experience

The 'Any Of' Pattern

Response Expectations by Severity

Each severity level should have explicit response expectations. These expectations create accountability and help responders understand what's required. Key dimensions include:

Response Time: How quickly must responders engage?
Communication Cadence: How often must updates be provided?
Escalation Path: Who is notified and when?
Resolution Target: What's the expected resolution time?
Post-Incident Requirements: What follow-up is mandatory?

Response Expectations by Severity Level
Aspect	SEV-1	SEV-2	SEV-3	SEV-4
Acknowledge Time	< 5 minutes	< 15 minutes	< 1 hour	< 4 hours
Response Time	< 15 minutes	< 30 minutes	< 4 hours	Next business day
Update Cadence	Every 15 minutes	Every 30 minutes	Every 2 hours	Daily if ongoing
Resolution Target	< 2 hours (MTTR)	< 4 hours (MTTR)	< 24 hours	< 1 week
Leadership Notification	Immediate	Within 30 minutes	Daily summary	Not required
Customer Communication	Status page + proactive email	Status page update	On request	None
Post-Mortem Required	Yes, within 48 hours	Yes, within 1 week	Optional (recommended)	No
Executive Briefing	Yes, during incident	Yes, post-resolution	No	No

severity-response-config.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
# Incident Severity Configuration
# Defines response expectations for each severity level
 
severity_levels:
  SEV-1:
    name: "Critical"
    description: "Complete outage or critical business impact"
    color: "#FF0000"  # Red
    
    response:
      acknowledge_sla_minutes: 5
      response_sla_minutes: 15
      resolution_target_hours: 2
      
    escalation:
      immediate:
        - primary_oncall
        - secondary_oncall
        - engineering_manager
      after_15_minutes:
        - director
        - vp_engineering
      after_30_minutes:
        - cto
        
    communication:
      internal_update_interval_minutes: 15
      status_page_required: true
      customer_notification_required: true
      executive_briefing_required: true
      
    post_incident:
      postmortem_required: true
      postmortem_deadline_hours: 48
      publish_to_team: true
      publish_externally: true
 
  SEV-2:
    name: "High"
    description: "Major functionality impaired with significant impact"
    color: "#FF8C00"  # Orange
    
    response:
      acknowledge_sla_minutes: 15
      response_sla_minutes: 30
      resolution_target_hours: 4
      
    escalation:
      immediate:
        - primary_oncall
      after_15_minutes:
        - secondary_oncall
      after_30_minutes:
        - engineering_manager
        
    communication:
      internal_update_interval_minutes: 30
      status_page_required: true
      customer_notification_required: false  # On request
      executive_briefing_required: false  # Post-resolution summary
      
    post_incident:
      postmortem_required: true
      postmortem_deadline_hours: 168  # 1 week
      publish_to_team: true
      publish_externally: false
 
  SEV-3:
    name: "Medium"
    description: "Partial degradation with limited impact"
    color: "#FFD700"  # Yellow
    
    response:
      acknowledge_sla_minutes: 60
      response_sla_minutes: 240  # 4 hours
      resolution_target_hours: 24
      
    escalation:
      immediate:
        - primary_oncall
      # No further escalation unless manually triggered
        
    communication:
      internal_update_interval_minutes: 120  # 2 hours
      status_page_required: false  # Optional
      customer_notification_required: false
      executive_briefing_required: false
      
    post_incident:
      postmortem_required: false  # Recommended
      postmortem_deadline_hours: null
      publish_to_team: false
      publish_externally: false
 
  SEV-4:
    name: "Low"
    description: "Minor issue with minimal impact"
    color: "#32CD32"  # Green
    
    response:
      acknowledge_sla_minutes: 240  # 4 hours
      response_sla_minutes: 1440  # Next business day
      resolution_target_hours: 168  # 1 week
      
    escalation:
      immediate:
        - ticket_queue
      # Handled through normal ticketing workflow
        
    communication:
      internal_update_interval_minutes: null  # Daily max
      status_page_required: false
      customer_notification_required: false
      executive_briefing_required: false
      
    post_incident:
      postmortem_required: false
      postmortem_deadline_hours: null
      publish_to_team: false
      publish_externally: false

Response Time vs. Resolution Time

Severity Changes During Incidents

When to Escalate Severity

Increase severity when:

Impact is larger than initially assessed (more users affected)
Problem is spreading to additional systems
Resolution is taking longer than expected
Root cause is deeper than initially thought
Customer/business impact becomes clear
Mitigation attempts are failing

When to De-escalate Severity

Decrease severity when:

Impact is confirmed to be smaller than feared
Workaround successfully reduces customer impact
Mitigation stabilizes the situation (even if root cause isn't fixed)
Issue is contained to specific subset of users/features
Initial alarm was disproportionate to actual problem

Escalation Indicators

•"We thought it was one service, but it's affecting three now"
•"The workaround we suggested isn't working for customers"
•"We've been investigating for an hour and still don't know root cause"
•"Social media mentions are spiking—customers are noticing"
•"The CEO just asked what's going on with the platform"
•"Revenue impact is higher than we initially calculated"

De-escalation Indicators

•"Only affects users with a specific configuration—about 0.5%"
•"We've confirmed there's no data loss, just display issues"
•"The rollback worked and error rate is back to baseline"
•"It looked like an outage but was actually a monitoring bug"
•"Customers can complete transactions via mobile—only web is affected"
•"Impact limited to one region and traffic is being rerouted"

The Severity Change Protocol

Announce the Change: Post in incident channel: "Escalating this to SEV-1. Error rate now at 15% and rising."
Communicate Rationale: Briefly explain why severity changed: "Impact larger than expected—affecting checkout, not just cart."
Trigger Appropriate Response: Escalation should automatically trigger additional notifications and resources.
Update External Communications: Status page should reflect new severity if it affects customer impact statement.
Adjust Expectations: New severity means new response expectations—remind responders of updated cadence.

Cultural Considerations

Make severity changes routine, not controversial:

Encourage early escalation ("better safe than sorry")
De-escalation is not failure—it's accurate assessment
The person closest to the data should propose severity changes
IC makes final severity decisions during incident
Post-mortem reviews whether initial severity was accurate (for learning, not blame)

The Asymmetry of Escalation

Managing Multiple Concurrent Incidents

Priority vs. Severity

Severity describes impact; priority describes response order:

Severity: How bad is this problem? (Fixed classification)
Priority: What should we work on first? (Dynamic ordering)

Two SEV-1 incidents have equal severity but may have different priority based on factors like:

Cash flow impact (payments outage vs. reporting system)
Customer contract obligations (SLA penalties at stake)
Duration (longer-running incident may need to yield to fresh crisis)
Fix complexity (quick win vs. deep investigation)

Prioritization Factors

When facing concurrent incidents:

Revenue Impact: Which incident loses more money per minute?
Customer Scope: Which affects more users or higher-value customers?
Data Risk: Is data at risk in either incident? Data issues trump availability.
Regulatory: Are there compliance implications? Regulators > customers.
Resolution Proximity: Is one incident close to resolution? Quick win may be worth prioritizing.
Cascade Risk: Could one incident cause the other to get worse?

Resource Splitting Strategies

Sequential: Focus entire team on highest priority incident until resolved or stabilized
Parallel: Split team with dedicated IC and responders for each incident
Hybrid: Primary team on highest priority; secondary on-call manages lower-priority

For resource-constrained teams, sequential is often better than fragmented parallel attention.

incident-priority-tiebreaker.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
/**
 * Incident Priority Scoring
 * 
 * When multiple incidents have the same severity,
 * this scoring helps determine priority order.
 */
 
interface Incident {
  id: string;
  severity: 'SEV-1' | 'SEV-2' | 'SEV-3' | 'SEV-4';
  affectedUsers: number;
  revenueImpactPerHour: number;
  dataAtRisk: boolean;
  regulatoryImplications: boolean;
  durationMinutes: number;
  estimatedResolutionMinutes: number;
  isEscalating: boolean;
}
 
interface PriorityScore {
  incidentId: string;
  score: number;
  primaryFactor: string;
}
 
function calculatePriorityScore(incident: Incident): PriorityScore {
  let score = 0;
  let primaryFactor = '';
  
  // Base score from severity
  const severityScores = {
    'SEV-1': 10000,
    'SEV-2': 1000,
    'SEV-3': 100,
    'SEV-4': 10,
  };
  score += severityScores[incident.severity];
  
  // Data and regulatory trump all else
  if (incident.regulatoryImplications) {
    score += 50000;
    primaryFactor = 'Regulatory implications';
  }
  
  if (incident.dataAtRisk) {
    score += 25000;
    if (!primaryFactor) primaryFactor = 'Data at risk';
  }
  
  // Revenue impact (normalized to score range)
  const revenueScore = Math.min(incident.revenueImpactPerHour / 1000, 5000);
  score += revenueScore;
  
  // User impact (log scale to not over-weight large numbers)
  const userScore = Math.log10(Math.max(incident.affectedUsers, 1)) * 500;
  score += userScore;
  
  // Escalating incidents get priority boost
  if (incident.isEscalating) {
    score += 2000;
    if (!primaryFactor) primaryFactor = 'Escalating situation';
  }
  
  // Duration penalty - longer incidents may need fresh attention
  // But also consider if resolution is close
  const resolutionProgress = incident.durationMinutes / 
    (incident.durationMinutes + incident.estimatedResolutionMinutes);
  
  if (resolutionProgress > 0.8) {
    // Close to resolution - slight priority boost for quick win
    score += 500;
    if (!primaryFactor) primaryFactor = 'Near resolution';
  } else if (incident.durationMinutes > 60) {
    // Prolonged incident without progress
    score += 1000;
    if (!primaryFactor) primaryFactor = 'Extended duration';
  }
  
  if (!primaryFactor) {
    if (revenueScore > userScore) {
      primaryFactor = `Revenue impact ($${incident.revenueImpactPerHour}/hr)`;
    } else {
      primaryFactor = `User impact (${incident.affectedUsers} users)`;
    }
  }
  
  return {
    incidentId: incident.id,
    score,
    primaryFactor,
  };
}
 
function prioritizeIncidents(incidents: Incident[]): PriorityScore[] {
  const scored = incidents.map(i => calculatePriorityScore(i));
  return scored.sort((a, b) => b.score - a.score);
}
 
// Example usage
const incidents: Incident[] = [
  {
    id: 'INC-001',
    severity: 'SEV-1',
    affectedUsers: 50000,
    revenueImpactPerHour: 75000,
    dataAtRisk: false,
    regulatoryImplications: false,
    durationMinutes: 25,
    estimatedResolutionMinutes: 30,
    isEscalating: false,
  },
  {
    id: 'INC-002',
    severity: 'SEV-1',
    affectedUsers: 5000,
    revenueImpactPerHour: 10000,
    dataAtRisk: true,  // Data at risk trumps revenue
    regulatoryImplications: false,
    durationMinutes: 10,
    estimatedResolutionMinutes: 60,
    isEscalating: true,
  },
];
 
const prioritized = prioritizeIncidents(incidents);
console.log('Priority order:', prioritized);
// INC-002 ranked higher due to data at risk, despite lower revenue/user impact

The Incident Storm Scenario

Building Organizational Alignment

Stakeholder Alignment Process

Define Together: Engineering, product, support, and executives should collaborate on severity definitions. Each group has different perspectives on impact.
Use Real Examples: Abstract criteria are ambiguous. Calibrate with: "Last month's payment outage—was that SEV-1 or SEV-2?" Concrete examples expose disagreements.
Document Rationale: Don't just document criteria; explain why each dimension matters and how it maps to business impact.
Train Consistently: Everyone who might declare or respond to incidents should receive severity training. Include realistic scenarios and practice classification.
Review Regularly: In post-mortems, evaluate whether severity was accurate. If consistently over/under-classified, refine criteria.

Aligning Stakeholder Expectations

•Executive Alignment: Executives often care about different dimensions (brand risk, regulatory exposure) than engineers (technical impact). Ensure severity criteria include business impact dimensions that matter to leadership.
•Support Team Alignment: Support sees customer impact first-hand. Their input on 'what constitutes major impact to customers' is invaluable. Include them in criteria development.
•Cross-Team Consistency: Platform team and feature team may have different severity intuitions. Create shared criteria to ensure consistent classification across the organization.
•Customer-Contract Alignment: If customer SLAs reference severity levels, ensure internal definitions match contractual obligations. Misalignment creates legal risk.
•Service-Specific Variations: A 30-minute outage might be SEV-1 for payment processing and SEV-3 for internal analytics. Allow service-specific criteria within overall framework.

The Severity Calibration Meeting

Quarterly, bring stakeholders together to review severity accuracy:

Agenda:

Review incidents from past quarter by declared severity
For each SEV-1/SEV-2, assess: "Was this classification accurate? Would we classify it the same today?"
Identify patterns: "Too many SEV-1s?" "Under-classified issues?"
Propose criteria adjustments if needed
Discuss any edge cases where classification was debated

The goal isn't to blame responders but to calibrate the system. Classification disagreements reveal ambiguous criteria that need clarification.

The Severity Runbook

Special Cases in Severity Classification

Standard severity frameworks handle most incidents well, but some situations require special consideration. These edge cases test your framework's completeness.

Security Incidents

Security events often follow different severity logic:

Active compromise or data breach = SEV-1 always, regardless of affected user count
Vulnerability discovered (not exploited) = Severity based on exploitability and exposure
Security incidents may require different response procedures (incident response team, forensics, legal involvement)
Consider a separate security incident classification parallel to operational incidents

Data Incidents

Data problems—loss, corruption, exposure—often warrant higher severity than equivalent availability issues:

Data loss is often irreversible; availability issues resolve when fixed
Regulatory requirements (GDPR, HIPAA) mandate specific response timelines for data incidents
Customer trust impact from data issues exceeds trust impact from downtime

Special Severity Cases
Scenario	Standard Classification	Special Consideration
Internal tool outage	SEV-3 (limited user impact)	May be SEV-1 if blocks operations (deploy pipeline during incident)
Single VIP customer affected	SEV-3 (one customer)	May be SEV-2 for enterprise customer with SLA commitments
Slow degradation over days	SEV-4 (minor impact)	May need escalation if trend predicts eventual failure
Partial fix stabilizes situation	Maintain original severity	Consider de-escalation to allow focused follow-up
Scheduled maintenance overruns	Not an incident initially	Becomes SEV-2/3 when exceeds planned window significantly
Third-party dependency outage	Based on customer impact	May limit mitigation options; track separately for dependency analysis
Monitoring/observability failure	SEV-3 (no direct user impact)	May be SEV-2 if it impairs incident detection capability

The 'Unknown' Severity

Sometimes you don't have enough information to classify accurately. Rather than guess incorrectly:

Start High, Adjust Down: If uncertain, declare at higher severity and downgrade once you have data. This ensures adequate response.
Time-Box Uncertainty: "I'm declaring SEV-2 pending investigation. If impact is confirmed smaller in 15 minutes, we'll downgrade."
Log Uncertainty: Note in incident record that initial classification was uncertain. This helps post-mortem learning.

External Incidents

When the problem isn't yours (AWS outage, Stripe down):

Still classify by customer impact, not by where the fault lies
Track externally-caused incidents separately for dependency analysis
Response options are limited, but communication remains critical
Consider if you should have redundancy for that dependency (post-mortem topic)

The 'Paper Tower' Problem

Summary: Calibrated Incident Response

Key Takeaways

•Severity calibrates response — Matching organizational effort to incident impact prevents both under-response and over-response, protecting customers and responders alike.
•Clear criteria prevent debates — Objective, measurable criteria enable quick classification during stressful incidents. Ambiguity leads to inconsistency.
•Response expectations vary by level — Each severity should have explicit SLAs for acknowledgment, resolution, communication, and follow-up.
•Severity can change — Escalate when impact grows; de-escalate when contained. Current reality, not initial assessment, should drive response level.
•Concurrent incidents need priority — When severity is equal, prioritize by revenue, data risk, and regulatory implications. Sometimes related incidents share root cause.
•Alignment requires collaboration — Engineering, product, support, and executives should agree on severity definitions. Real examples calibrate better than abstract criteria.
•Special cases test your framework — Security incidents, data issues, and external dependencies require specific consideration beyond standard criteria.
•When uncertain, classify high — Start with higher severity and downgrade with data. Under-classification has worse consequences than over-classification.

Module Complete:

The next module explores Post-Mortems—how organizations learn from incidents to prevent recurrence and build more resilient systems.

Module Complete

5 / 5