System Design (HLD)Alerting Design

Alerting Design: Building Effective Alert Systems

LevelIntermediate

Duration90 mins

TopicAlerting Design

4 / 5

Escalation Policies

The Alert That Nobody Answered

At 2:34 AM, the critical alert fired. The primary on-call engineer was unreachable—phone on silent, perhaps asleep too deeply, perhaps the app had a glitch. The alert aged.\n\nFive minutes passed. Ten minutes. Fifteen minutes. The system continued to degrade. Users started abandoning transactions. Revenue was hemorrhaging.\n\nThere was no escalation policy.\n\nNo secondary on-call was notified. No supervisor was paged. No backup system existed. The company had invested millions in monitoring infrastructure that detected the problem within seconds—then did nothing useful with that detection for 47 minutes until someone happened to check their phone.\n\nThis is the gap that escalation policies fill: the space between detecting an incident and ensuring human response.

What You Will Learn

By the end of this page, you will understand how to design escalation policies that guarantee timely response, balance urgency with responder well-being, and adapt to different incident types and severities. You'll learn the mechanics of escalation chains, timeout configurations, and multi-tier response structures.

Escalation Fundamentals

An escalation policy defines what happens when the initial response fails. It's the backup plan for the backup plan—a systematic approach to ensuring incidents receive attention regardless of individual failures.

Why Escalation Matters\n\nIncident response has multiple failure points:\n\n1. Human Availability: On-call may be asleep, in a meeting, or experiencing phone issues\n2. Human Capability: The issue may exceed the on-call's expertise\n3. Human Capacity: The on-call may be overwhelmed with concurrent incidents\n4. Human Misjudgment: The severity may be underestimated initially\n\nEscalation policies address each of these by defining when and how additional resources are engaged.\n\nEscalation vs. Routing\n\nIt's important to distinguish escalation from initial routing:\n\n| Routing | Escalation |\n|---------|------------|\n| Determines who receives the alert first | Determines what happens if first responder doesn't resolve |\n| Based on alert type, service, time of day | Based on response time, incident duration, severity |\n| Usually static or schedule-based | Dynamic based on incident progression |\n| One-time decision at alert creation | Ongoing process throughout incident |

Elements of an Escalation Policy

•Escalation Tiers — The levels of responders, from primary on-call to management to executive, each with different authority and scope.
•Timeout Durations — How long to wait before escalating to the next tier. Balances urgency against unnecessary escalation.
•Acknowledgment Requirements — What constitutes 'response'—acknowledging the page, joining a call, or taking concrete action?
•Notification Channels — How each tier is notified: page, phone call, email, Slack, or multiple channels.
•Override Conditions — When to skip tiers (critical severity) or delay escalation (known maintenance).
•De-escalation Criteria — When escalations can be canceled if the incident is resolved or downgraded.

Escalation Is Not Punishment

A healthy escalation culture treats escalation as a natural part of incident response, not a failure or punishment. The goal is effective resolution, not blame assignment. Responders should feel comfortable escalating when they need help rather than struggling alone.

Designing Escalation Tiers

Escalation tiers define who gets involved as an incident persists or escalates. The structure should match your organization's size, on-call capacity, and incident severity model.

The Classic Three-Tier Model\n\nMost organizations benefit from a variation of this structure:\n\nTier 1: Primary On-Call\n- First responder for all alerts\n- Expected to acknowledge within minutes\n- Has authority to resolve or escalate\n- Usually one person per service or domain\n\nTier 2: Secondary On-Call / Specialist\n- Engaged when primary is unavailable or needs help\n- May have deeper expertise in specific areas\n- Often a more senior engineer\n- Can provide coverage across related services\n\nTier 3: Incident Commander / Management\n- Engaged for extended or high-severity incidents\n- Coordinates cross-team response\n- Has authority to make organizational decisions\n- Responsible for communication to stakeholders

Escalation Tier Timing by Severity
Severity	Tier 1 Timeout	Tier 2 Timeout	Tier 3 Timeout	Rationale
SEV1/Critical	3 minutes	5 minutes	10 minutes	Maximum urgency; every minute matters
SEV2/High	10 minutes	20 minutes	45 minutes	Urgent but brief delays acceptable
SEV3/Medium	30 minutes	2 hours	Next business day	Can wait for reasonable response
SEV4/Low	4 hours	8 hours	N/A	Low urgency, no management escalation

Specialized Escalation Paths\n\nDifferent incident types may require different escalation structures:\n\nSecurity Incidents\n- Tier 1: Security Operations Center (SOC)\n- Tier 2: Security Engineer On-Call\n- Tier 3: CISO / Security Leadership\n- Additional: Legal, PR for breach scenarios\n\nData Incidents\n- Tier 1: Data Engineering On-Call\n- Tier 2: Data Platform Lead\n- Tier 3: Data Protection Officer\n- Additional: Compliance, Legal for GDPR/Privacy issues\n\nCustomer-Facing Outages\n- Tier 1: Service Owner On-Call\n- Tier 2: Engineering Manager\n- Tier 3: VP Engineering / CTO\n- Additional: Customer Success, Communications

Multi-Path Escalation Configuration
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# PagerDuty-style escalation policy configuration
 
escalation_policies:
  
  # Standard service escalation
  - name: "default-service-escalation"
    description: "Standard on-call escalation for production services"
    repeat_limit: 2  # Repeat entire chain twice if still unacked
    escalation_rules:
      - escalation_delay_in_minutes: 0
        targets:
          - type: "schedule"
            id: "primary-oncall-schedule"
        notify_methods: ["push", "sms", "phone"]
        
      - escalation_delay_in_minutes: 5
        targets:
          - type: "schedule"
            id: "secondary-oncall-schedule"
        notify_methods: ["push", "sms", "phone"]
        
      - escalation_delay_in_minutes: 15
        targets:
          - type: "user"
            id: "engineering-manager"
          - type: "user" 
            id: "team-lead"
        notify_methods: ["phone", "sms"]
        
      - escalation_delay_in_minutes: 30
        targets:
          - type: "user"
            id: "vp-engineering"
        notify_methods: ["phone"]
 
  # Critical path - accelerated escalation
  - name: "critical-infrastructure-escalation"
    description: "Accelerated escalation for SEV1 incidents"
    services: ["database-primary", "payment-gateway", "auth-service"]
    escalation_rules:
      - escalation_delay_in_minutes: 0
        targets:
          - type: "schedule"
            id: "platform-oncall-schedule"
        notify_methods: ["push", "sms", "phone"]
        
      - escalation_delay_in_minutes: 3  # Faster escalation
        targets:
          - type: "schedule"
            id: "platform-secondary-schedule"
          - type: "schedule"
            id: "sre-oncall-schedule"  # Parallel notification
        notify_methods: ["phone", "sms"]
        
      - escalation_delay_in_minutes: 8
        targets:
          - type: "user"
            id: "platform-director"
          - type: "user"
            id: "incident-commander-oncall"
        notify_methods: ["phone"]
 
  # Security incident path
  - name: "security-incident-escalation"
    description: "Security-specific escalation path"
    trigger_on_labels:
      category: "security"
    escalation_rules:
      - escalation_delay_in_minutes: 0
        targets:
          - type: "schedule"
            id: "soc-oncall-schedule"
        notify_methods: ["push", "sms"]
        
      - escalation_delay_in_minutes: 5
        targets:
          - type: "schedule"
            id: "security-engineer-oncall"
        notify_methods: ["phone", "sms"]
        
      - escalation_delay_in_minutes: 15
        targets:
          - type: "user"
            id: "ciso"
        notify_methods: ["phone"]
        additional_context: "Include legal@company.com in comms"

Notification Strategies

How you notify responders can be as important as who you notify. Different channels have different reliability, intrusiveness, and appropriateness for various situations.

Notification Channel Characteristics
Channel	Intrusiveness	Reliability	Best For	Limitations
Phone Call	Very High	High	Critical/SEV1, unreachable responders	Disruptive, may not work internationally
SMS	High	High	Urgent alerts, backup to app push	Length limits, carrier delays possible
Push Notification	Medium	Medium	Standard on-call alerts	Requires app, phone must be online
Slack/Teams	Low	Medium	Team awareness, SEV3+	Easy to miss, requires checking
Email	Very Low	High	Non-urgent, documentation	Not for real-time response

Multi-Channel Notification\n\nFor critical alerts, use multiple channels simultaneously:\n\n\nSEV1 Notification Sequence:\n\n T+0s T+30s T+60s T+90s\n │ │ │ │\n ▼ ▼ ▼ ▼\n Push ───► SMS ───► Call ───► Second Call\n └────────┴────────┴────────┘\n Continue until acknowledged\n\n\nThis layered approach ensures the alert reaches the responder even if one channel fails.\n\nChannel Selection by Context\n\nSmart alerting systems can adapt channel selection based on context:\n\n- Time of Day: Phone calls for night alerts; push for working hours\n- Responder Preference: Some prefer SMS over push; respect preferences\n- Alert History: If push rarely gets acknowledgment, escalate to call faster\n- Responder Status: If marked 'in meeting', try text before calling

Notification Best Practices

•Include Actionable Context — The notification should include enough information for the responder to understand severity and likely response without opening another system.
•Use Consistent Formatting — Responders should instantly recognize alert structure. [SEV1] Service: Issue - Impact format creates scannability.
•Link to Details — Include a direct link to the alert dashboard, incident page, or runbook. Reduce friction to the next step.
•Distinguish Severities — Use different sounds or patterns for different severities. The brain can be trained to react differently to different signals.
•Test Notification Paths — Regularly verify that notifications are actually delivered. Phone number changes, app permissions, and Do Not Disturb settings can break the chain.

The Silent Phone Trap

Many missed pages happen because phones are on silent, Do Not Disturb is enabled, or the paging app lost notification permission. Require on-call engineers to configure their phones for alert receipt, and test bi-weekly that notifications actually reach them.

Acknowledgment and Response SLAs

Escalation timing depends on what constitutes an adequate 'response'. Defining clear response stages with associated SLAs prevents ambiguity.

Response Stages\n\nAcknowledgment: The responder indicates they've received and are aware of the alert. This stops immediate escalation but doesn't indicate resolution.\n\nEngagement: The responder is actively investigating the issue. They've looked at dashboards, logs, or the affected system.\n\nTriage: The responder has assessed severity and determined next steps—whether to resolve, escalate, or seek assistance.\n\nResolution: The incident is resolved and the system has returned to normal operation.\n\nEach stage can have its own SLA and escalation triggers.

Response Stage SLAs by Severity
Severity	Acknowledge SLA	Engage SLA	Triage SLA	Escalation Trigger
SEV1	3 min	10 min	30 min	Any SLA breach triggers next tier
SEV2	15 min	30 min	2 hours	Acknowledge or Engage miss triggers escalation
SEV3	60 min	4 hours	24 hours	Only Acknowledge miss triggers escalation
SEV4	4 hours	Next business day	3 business days	No automatic escalation

Acknowledgment Schemes\n\nDifferent acknowledgment models suit different situations:\n\nSimple Acknowledgment\n- Any team member can acknowledge\n- Escalation stops when acknowledged\n- Works for small teams with shared ownership\n\nAssigned Acknowledgment\n- Only the assigned on-call can acknowledge\n- Prevents 'ack and ignore' from non-responders\n- Better accountability\n\nProgressive Acknowledgment\n- Acknowledge pauses escalation for N minutes\n- Re-alerts if no resolution in that window\n- Prevents acking then forgetting\n\nMulti-Responder Acknowledgment\n- Requires multiple people to acknowledge (for critical incidents)\n- Ensures incident commander + technical responder both engaged\n- Used for SEV1 only

Progressive Acknowledgment Logic

Pseudo-code

// Progressive Acknowledgment Flow
 
WHEN alert fires:
    START escalation_timer (5 minutes)
    NOTIFY tier_1
    
WHEN acknowledgment received:
    IF from tier_1 responder:
        PAUSE escalation_timer
        START work_timer (30 minutes for SEV1, 2 hours for SEV2)
    ELSE:
        LOG "Non-assigned ack received"
        CONTINUE escalation_timer
        
WHEN work_timer expires:
    IF incident NOT resolved:
        NOTIFY tier_1: "Reminder: Incident still open"
        RESTART work_timer (15 minutes)
        INCREMENT reminder_count
        
    IF reminder_count >= 3:
        ESCALATE to tier_2
        INCLUDE note: "Tier 1 acknowledged but no resolution after 
                       {total_time}"
 
WHEN incident resolved:
    STOP all timers
    RECORD resolution_time
    NOTIFY stakeholders
    CLOSE escalation chain
 
// This prevents the "ack and forget" pattern while respecting
// that complex incidents take time to resolve

Cross-Team and External Escalation

Many incidents require response from multiple teams or external parties. Designing escalation policies for these complex scenarios requires special consideration.

Cross-Team Scenario: The Dependency Alert\n\nYour payment service is failing. Investigation reveals the root cause is in the upstream authentication service, owned by a different team.\n\nPoor Escalation Pattern:\n1. Payment on-call investigates for 30 minutes\n2. Realizes it's an auth issue\n3. Manually notifies auth team via Slack\n4. Auth on-call doesn't see message for 15 minutes\n5. Time wasted: 45+ minutes\n\nBetter Escalation Pattern:\n1. Payment on-call investigates for 10 minutes\n2. Recognizes cross-team dependency\n3. Triggers built-in escalation: 'Escalate to dependency owner: Auth Service'\n4. Auth on-call immediately paged with context: 'Escalated from Payment Service incident'\n5. Both teams collaborate; incident commander coordinates

Cross-Team Escalation Principles

•Map Dependencies Explicitly — Maintain a service dependency graph. When Service A escalates, the system knows which team owns Service B.
•Provide Context on Escalation — The receiving team should see: What's the symptom? What investigation was done? Why do we believe this is their domain?
•Establish Handoff Protocols — Define whether the original responder stays engaged or hands off completely. Usually staying engaged is better.
•Define Authority Clearly — When multiple teams are involved, who decides on remediation approaches? The incident commander role becomes critical.
•Enable Self-Escalation — Responders should be empowered to escalate to other teams without bureaucratic approval. Speed matters.

External Escalation\n\nSome incidents require escalating to vendors, partners, or providers:\n\nCloud Provider Outages\n- Escalation path: Internal On-Call → Cloud Support Case → Account Manager → Executive Escalation\n- Have enterprise support contracts that enable fast escalation\n- Know how to file 'Urgent' or 'Business Critical' cases\n\nThird-Party Service Degradation\n- Escalation path: Check status page → API support channel → Partner engineering contact\n- Maintain contact lists for partners with different severity levels\n- Some agreements include specific escalation SLAs\n\nVendor Issues Affecting Production\n- Document escalation procedures for each critical vendor\n- Include backup vendor activation as part of escalation\n- Have pre-negotiated support levels that access engineering, not just support

External Escalation Contacts Configuration
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# External escalation paths for vendor dependencies
 
external_escalation:
  
  aws:
    name: "Amazon Web Services"
    services: ["EC2", "RDS", "S3", "Lambda", "EKS"]
    tier_1:
      channel: "AWS Support Center"
      url: "https://console.aws.amazon.com/support"
      case_type: "Technical"
      severity: "Urgent (Production system down)"
      sla: "15 min response for Enterprise support"
    tier_2:
      channel: "Technical Account Manager"
      contact: "tam@aws.example.com"  # Placeholder
      phone: "+1-xxx-xxx-xxxx"
      when: "Tier 1 unresponsive after 30 min OR SEV1 incidents"
    tier_3:
      channel: "Executive Escalation"
      contact: "executive-escalations-aws@company.internal"
      when: "Major outage with business impact > $X/hour"
      requires: "VP or C-level approval"
 
  stripe:
    name: "Stripe Payment Processing"
    services: ["payment-intent", "charges", "refunds"]
    tier_1:
      channel: "Stripe Dashboard Support"
      url: "https://dashboard.stripe.com/support"
      escalation_email: "urgent@stripe.com"  # If available
      sla: "1 hour for standard, 15 min for critical"
    tier_2:
      channel: "Assigned Partner Engineer"
      contact: "partner-engineer@stripe.example.com"
      when: "Tier 1 unresponsive OR suspected platform issue"
    fallback:
      action: "Enable PayPal fallback"
      runbook: "runbook.internal/payment-fallback-paypal"
      auto_trigger_threshold: "Payment success rate < 50% for 5 min"
 
  datadog:
    name: "Datadog Monitoring"
    services: ["metrics", "logs", "apm", "synthetics"]
    tier_1:
      channel: "Datadog Chat Support"
      url: "In-app chat"
      sla: "Varies by plan"
    notes: |
      If Datadog is down, we lose alerting visibility.
      Fallback: Direct CloudWatch alarms, PagerDuty direct integrations
      Check status.datadoghq.com for known issues first

The Warm Introduction

For critical vendors, establish relationships before you need them. Meet your AWS TAM during non-emergencies. Have your Stripe partner engineer on Slack. When an incident happens, a warm relationship means faster response than a cold support ticket.

Escalation Anti-Patterns

Common escalation mistakes undermine incident response. Recognizing these anti-patterns helps you design better policies.

Escalation Anti-Patterns

•The Hero Pattern — One engineer handles everything; no escalation ever happens. When they're unavailable, incidents go unaddressed. Solution: Mandate escalation timeouts regardless of individual capabilities.
•The Pass-the-Buck Chain — Each tier immediately escalates to the next without investigation. Incidents reach leadership without any technical triage. Solution: Require minimum engagement time before escalation.
•The Black Hole — Escalations go to a distribution list where nobody feels personal responsibility. Everyone assumes someone else will handle it. Solution: Assign specific individuals, not groups, at each tier.
•The Shame Spiral — Escalation is treated as failure. Engineers avoid escalating to protect reputation, leading to prolonged incidents. Solution: Celebrate appropriate escalation; make it culturally safe.
•The Infinite Loop — Poorly configured policies escalate in circles. Team A escalates to Team B; Team B escalates back to Team A. Solution: Clear ownership maps and break-glass procedures.
•The Always-Escalate Pattern — Everything goes to senior engineers or management regardless of severity. Upper tiers become fatigued. Solution: Clear severity-based routing; SEV3+ shouldn't reach VPs.
•The Missing Runbook — Escalation happens, but the next tier has no context on what to do. They spend time rediscovering what the previous tier already learned. Solution: Require context transfer as part of escalation mechanics.

The Goldilocks Escalation

Escalation should be neither too easy nor too hard: Easy enough that responders don't suffer alone with overwhelming incidents. Hard enough that every level actually attempts resolution before escalating. The balance point varies by incident type and team maturity.

Designing Your Escalation Policy

Let's synthesize the principles into a practical framework for designing escalation policies for your organization.

Step 1: Map Your Incident Types\n\nIdentify the categories of incidents you handle:\n- Production outages (service level)\n- Performance degradation\n- Security incidents\n- Data issues\n- Customer-impacting bugs\n- Infrastructure failures\n\nEach may need a specialized escalation path.\n\nStep 2: Define Severity for Each Type\n\nCreate a severity matrix that's specific and unambiguous:\n- What makes a database issue SEV1 vs. SEV2?\n- How do we assess impact for different services?\n- Who has authority to assign/change severity?\n\nStep 3: Identify Responder Pools\n\nFor each incident type:\n- Who is Tier 1? (Usually domain-specific on-call)\n- Who is Tier 2? (Senior engineer, specialist, or cross-trained peer)\n- Who is Tier 3? (Management, incident commander, or exec)\n- What external escalation paths exist?\n\nStep 4: Set Timeout Values\n\nBalance urgency against practical response times:\n- What can someone reasonably acknowledge in the middle of the night?\n- How long should we wait before assuming no response?\n- Different times for different severities\n\nStep 5: Configure Notification Channels\n\nFor each tier and severity:\n- What channels are used?\n- In what order or combination?\n- How do we verify delivery?\n\nStep 6: Document and Train\n\nWrite it down and make sure everyone knows:\n- Escalation policy documentation\n- Training for new on-call engineers\n- Regular drills and reviews

Complete Escalation Policy Example
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# Complete escalation policy for a mid-sized SaaS company
 
organization:
  name: "Acme Corp"
  timezone: "America/Los_Angeles"
  on_call_tool: "PagerDuty"
 
severity_definitions:
  SEV1:
    description: "Complete outage or >50% of users affected"
    examples:
      - "Core API returning 500s"
      - "Database primary offline"
      - "Security breach detected"
    initial_response: "Page immediately, all hands"
    
  SEV2:
    description: "Significant degradation, 10-50% users affected"
    examples:
      - "Elevated latency across services"
      - "One region unavailable"
      - "Key feature broken"
    initial_response: "Page on-call, prepare for escalation"
    
  SEV3:
    description: "Minor issue, <10% users affected, workarounds exist"
    examples:
      - "Non-critical feature broken"
      - "Single customer issue"
      - "Performance degradation in non-peak hours"
    initial_response: "Ticket, address within 4 hours"
 
escalation_policies:
  
  production_service:
    applies_to: 
      services: ["api", "web", "mobile-backend", "worker"]
      
    sev1_escalation:
      tier_1:
        team: "service-owning-team-oncall"
        timeout: 3m
        notify: ["push", "sms", "phone"]
        
      tier_2:
        team: "platform-oncall"
        timeout: 5m
        notify: ["phone", "sms"]
        
      tier_3:
        role: "incident-commander-rotation"
        timeout: 10m
        notify: ["phone"]
        action: "Open incident bridge, notify leadership"
        
      tier_4:
        role: "engineering-leadership"
        timeout: 20m
        notify: ["phone"]
        includes: ["VP Engineering", "CTO"]
        
    sev2_escalation:
      tier_1:
        team: "service-owning-team-oncall"
        timeout: 10m
        notify: ["push", "sms"]
        
      tier_2:
        team: "service-owning-team-secondary"
        timeout: 20m
        notify: ["phone", "sms"]
        
      tier_3:
        role: "engineering-manager"
        timeout: 45m
        notify: ["phone"]
        
    sev3_escalation:
      tier_1:
        team: "service-owning-team-oncall"
        timeout: 60m
        notify: ["push", "slack"]
        
      tier_2:
        team: "service-owning-team-secondary"
        timeout: 4h
        notify: ["sms"]
 
acknowledgment_rules:
  sev1:
    required_from: "assigned on-call only"
    ack_timeout: 3m
    work_check_timeout: 15m  # Re-ping if no progress
    
  sev2:
    required_from: "any team member"
    ack_timeout: 10m
    work_check_timeout: 30m
    
  sev3:
    required_from: "any team member"
    ack_timeout: 60m
    work_check_timeout: null  # No automatic re-ping
 
cross_team_escalation:
  enabled: true
  method: "One-click escalate to dependency owner"
  context_required:
    - "Original alert details"
    - "Investigation summary"
    - "Suspected root cause"
  notification: "Cross-team escalation from {originating_team} re: {incident_title}"
 
external_escalation:
  aws:
    tier_1: "Support case, Urgent severity"
    tier_2: "Technical Account Manager"
    contact_info: "See external contacts runbook"
    
  stripe:
    tier_1: "Dashboard support chat"
    fallback: "Enable PayPal processor"
    runbook: "link/to/payment-fallback"
 
review_schedule:
  frequency: "Monthly"
  attendees: 
    - "On-call leads from each team"
    - "Incident Commander rotation"
    - "Engineering leadership representative"
  agenda:
    - "Review escalation events from past month"
    - "Assess timeout appropriateness"
    - "Update contact information"
    - "Drill one escalation scenario"

Summary: Designing Escalation Policies

Escalation policies transform alerting from 'fire and hope' to guaranteed response. They're the final safety net that ensures incidents receive the attention they require.

Key Takeaways

•Escalation ensures response — Even if primary on-call is unavailable, the incident will receive attention through automatic escalation.
•Design tiers thoughtfully — Three tiers (primary, secondary, management) work for most organizations. Customize for your structure.
•Timeout aggressively for critical issues — SEV1 should escalate in 3-5 minutes. Every minute of delay compounds impact.
•Use multiple notification channels — Phone calls for critical alerts; text and push for lower severity. Redundancy ensures delivery.
•Define acknowledgment clearly — What counts as response? Progressive acknowledgment prevents 'ack and forget'.
•Enable cross-team escalation — Incidents crossing boundaries need frictionless handoff. Map dependencies proactively.
•Avoid anti-patterns — The Hero Pattern, Black Hole, and Pass-the-Buck all undermine effective response.
•Review and drill regularly — Escalation paths go stale. Test them monthly; update contact information continuously.

What's Next:\n\nEscalation gets responders engaged. But what do they do once engaged? The next page explores runbook integration—how to connect alerts with actionable documentation that guides responders through diagnosis and remediation.

Page Complete

You now understand how to design escalation policies that guarantee response, balance urgency with sustainability, and adapt to incident complexity. The key insight: detecting an incident is worthless without ensuring human engagement. Escalation policies bridge that gap.

4 / 5

Loading learning content...

System Design (HLD)Alerting Design

Alerting Design: Building Effective Alert Systems

LevelIntermediate

Duration90 mins

TopicAlerting Design

4 / 5

Escalation Policies

The Alert That Nobody Answered

What You Will Learn

Escalation Fundamentals

Elements of an Escalation Policy

•Escalation Tiers — The levels of responders, from primary on-call to management to executive, each with different authority and scope.
•Timeout Durations — How long to wait before escalating to the next tier. Balances urgency against unnecessary escalation.
•Acknowledgment Requirements — What constitutes 'response'—acknowledging the page, joining a call, or taking concrete action?
•Notification Channels — How each tier is notified: page, phone call, email, Slack, or multiple channels.
•Override Conditions — When to skip tiers (critical severity) or delay escalation (known maintenance).
•De-escalation Criteria — When escalations can be canceled if the incident is resolved or downgraded.

Escalation Is Not Punishment

Designing Escalation Tiers

Escalation tiers define who gets involved as an incident persists or escalates. The structure should match your organization's size, on-call capacity, and incident severity model.

Escalation Tier Timing by Severity
Severity	Tier 1 Timeout	Tier 2 Timeout	Tier 3 Timeout	Rationale
SEV1/Critical	3 minutes	5 minutes	10 minutes	Maximum urgency; every minute matters
SEV2/High	10 minutes	20 minutes	45 minutes	Urgent but brief delays acceptable
SEV3/Medium	30 minutes	2 hours	Next business day	Can wait for reasonable response
SEV4/Low	4 hours	8 hours	N/A	Low urgency, no management escalation

Multi-Path Escalation Configuration
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# PagerDuty-style escalation policy configuration
 
escalation_policies:
  
  # Standard service escalation
  - name: "default-service-escalation"
    description: "Standard on-call escalation for production services"
    repeat_limit: 2  # Repeat entire chain twice if still unacked
    escalation_rules:
      - escalation_delay_in_minutes: 0
        targets:
          - type: "schedule"
            id: "primary-oncall-schedule"
        notify_methods: ["push", "sms", "phone"]
        
      - escalation_delay_in_minutes: 5
        targets:
          - type: "schedule"
            id: "secondary-oncall-schedule"
        notify_methods: ["push", "sms", "phone"]
        
      - escalation_delay_in_minutes: 15
        targets:
          - type: "user"
            id: "engineering-manager"
          - type: "user" 
            id: "team-lead"
        notify_methods: ["phone", "sms"]
        
      - escalation_delay_in_minutes: 30
        targets:
          - type: "user"
            id: "vp-engineering"
        notify_methods: ["phone"]
 
  # Critical path - accelerated escalation
  - name: "critical-infrastructure-escalation"
    description: "Accelerated escalation for SEV1 incidents"
    services: ["database-primary", "payment-gateway", "auth-service"]
    escalation_rules:
      - escalation_delay_in_minutes: 0
        targets:
          - type: "schedule"
            id: "platform-oncall-schedule"
        notify_methods: ["push", "sms", "phone"]
        
      - escalation_delay_in_minutes: 3  # Faster escalation
        targets:
          - type: "schedule"
            id: "platform-secondary-schedule"
          - type: "schedule"
            id: "sre-oncall-schedule"  # Parallel notification
        notify_methods: ["phone", "sms"]
        
      - escalation_delay_in_minutes: 8
        targets:
          - type: "user"
            id: "platform-director"
          - type: "user"
            id: "incident-commander-oncall"
        notify_methods: ["phone"]
 
  # Security incident path
  - name: "security-incident-escalation"
    description: "Security-specific escalation path"
    trigger_on_labels:
      category: "security"
    escalation_rules:
      - escalation_delay_in_minutes: 0
        targets:
          - type: "schedule"
            id: "soc-oncall-schedule"
        notify_methods: ["push", "sms"]
        
      - escalation_delay_in_minutes: 5
        targets:
          - type: "schedule"
            id: "security-engineer-oncall"
        notify_methods: ["phone", "sms"]
        
      - escalation_delay_in_minutes: 15
        targets:
          - type: "user"
            id: "ciso"
        notify_methods: ["phone"]
        additional_context: "Include legal@company.com in comms"

Notification Strategies

How you notify responders can be as important as who you notify. Different channels have different reliability, intrusiveness, and appropriateness for various situations.

Notification Channel Characteristics
Channel	Intrusiveness	Reliability	Best For	Limitations
Phone Call	Very High	High	Critical/SEV1, unreachable responders	Disruptive, may not work internationally
SMS	High	High	Urgent alerts, backup to app push	Length limits, carrier delays possible
Push Notification	Medium	Medium	Standard on-call alerts	Requires app, phone must be online
Slack/Teams	Low	Medium	Team awareness, SEV3+	Easy to miss, requires checking
Email	Very Low	High	Non-urgent, documentation	Not for real-time response

Notification Best Practices

•Include Actionable Context — The notification should include enough information for the responder to understand severity and likely response without opening another system.
•Use Consistent Formatting — Responders should instantly recognize alert structure. [SEV1] Service: Issue - Impact format creates scannability.
•Link to Details — Include a direct link to the alert dashboard, incident page, or runbook. Reduce friction to the next step.
•Distinguish Severities — Use different sounds or patterns for different severities. The brain can be trained to react differently to different signals.
•Test Notification Paths — Regularly verify that notifications are actually delivered. Phone number changes, app permissions, and Do Not Disturb settings can break the chain.

The Silent Phone Trap

Acknowledgment and Response SLAs

Escalation timing depends on what constitutes an adequate 'response'. Defining clear response stages with associated SLAs prevents ambiguity.

Response Stage SLAs by Severity
Severity	Acknowledge SLA	Engage SLA	Triage SLA	Escalation Trigger
SEV1	3 min	10 min	30 min	Any SLA breach triggers next tier
SEV2	15 min	30 min	2 hours	Acknowledge or Engage miss triggers escalation
SEV3	60 min	4 hours	24 hours	Only Acknowledge miss triggers escalation
SEV4	4 hours	Next business day	3 business days	No automatic escalation

Progressive Acknowledgment Logic

Pseudo-code

// Progressive Acknowledgment Flow
 
WHEN alert fires:
    START escalation_timer (5 minutes)
    NOTIFY tier_1
    
WHEN acknowledgment received:
    IF from tier_1 responder:
        PAUSE escalation_timer
        START work_timer (30 minutes for SEV1, 2 hours for SEV2)
    ELSE:
        LOG "Non-assigned ack received"
        CONTINUE escalation_timer
        
WHEN work_timer expires:
    IF incident NOT resolved:
        NOTIFY tier_1: "Reminder: Incident still open"
        RESTART work_timer (15 minutes)
        INCREMENT reminder_count
        
    IF reminder_count >= 3:
        ESCALATE to tier_2
        INCLUDE note: "Tier 1 acknowledged but no resolution after 
                       {total_time}"
 
WHEN incident resolved:
    STOP all timers
    RECORD resolution_time
    NOTIFY stakeholders
    CLOSE escalation chain
 
// This prevents the "ack and forget" pattern while respecting
// that complex incidents take time to resolve

Cross-Team and External Escalation

Many incidents require response from multiple teams or external parties. Designing escalation policies for these complex scenarios requires special consideration.

Cross-Team Escalation Principles

•Map Dependencies Explicitly — Maintain a service dependency graph. When Service A escalates, the system knows which team owns Service B.
•Provide Context on Escalation — The receiving team should see: What's the symptom? What investigation was done? Why do we believe this is their domain?
•Establish Handoff Protocols — Define whether the original responder stays engaged or hands off completely. Usually staying engaged is better.
•Define Authority Clearly — When multiple teams are involved, who decides on remediation approaches? The incident commander role becomes critical.
•Enable Self-Escalation — Responders should be empowered to escalate to other teams without bureaucratic approval. Speed matters.

External Escalation Contacts Configuration
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# External escalation paths for vendor dependencies
 
external_escalation:
  
  aws:
    name: "Amazon Web Services"
    services: ["EC2", "RDS", "S3", "Lambda", "EKS"]
    tier_1:
      channel: "AWS Support Center"
      url: "https://console.aws.amazon.com/support"
      case_type: "Technical"
      severity: "Urgent (Production system down)"
      sla: "15 min response for Enterprise support"
    tier_2:
      channel: "Technical Account Manager"
      contact: "tam@aws.example.com"  # Placeholder
      phone: "+1-xxx-xxx-xxxx"
      when: "Tier 1 unresponsive after 30 min OR SEV1 incidents"
    tier_3:
      channel: "Executive Escalation"
      contact: "executive-escalations-aws@company.internal"
      when: "Major outage with business impact > $X/hour"
      requires: "VP or C-level approval"
 
  stripe:
    name: "Stripe Payment Processing"
    services: ["payment-intent", "charges", "refunds"]
    tier_1:
      channel: "Stripe Dashboard Support"
      url: "https://dashboard.stripe.com/support"
      escalation_email: "urgent@stripe.com"  # If available
      sla: "1 hour for standard, 15 min for critical"
    tier_2:
      channel: "Assigned Partner Engineer"
      contact: "partner-engineer@stripe.example.com"
      when: "Tier 1 unresponsive OR suspected platform issue"
    fallback:
      action: "Enable PayPal fallback"
      runbook: "runbook.internal/payment-fallback-paypal"
      auto_trigger_threshold: "Payment success rate < 50% for 5 min"
 
  datadog:
    name: "Datadog Monitoring"
    services: ["metrics", "logs", "apm", "synthetics"]
    tier_1:
      channel: "Datadog Chat Support"
      url: "In-app chat"
      sla: "Varies by plan"
    notes: |
      If Datadog is down, we lose alerting visibility.
      Fallback: Direct CloudWatch alarms, PagerDuty direct integrations
      Check status.datadoghq.com for known issues first

The Warm Introduction

Escalation Anti-Patterns

Common escalation mistakes undermine incident response. Recognizing these anti-patterns helps you design better policies.

Escalation Anti-Patterns

•The Hero Pattern — One engineer handles everything; no escalation ever happens. When they're unavailable, incidents go unaddressed. Solution: Mandate escalation timeouts regardless of individual capabilities.
•The Pass-the-Buck Chain — Each tier immediately escalates to the next without investigation. Incidents reach leadership without any technical triage. Solution: Require minimum engagement time before escalation.
•The Black Hole — Escalations go to a distribution list where nobody feels personal responsibility. Everyone assumes someone else will handle it. Solution: Assign specific individuals, not groups, at each tier.
•The Shame Spiral — Escalation is treated as failure. Engineers avoid escalating to protect reputation, leading to prolonged incidents. Solution: Celebrate appropriate escalation; make it culturally safe.
•The Infinite Loop — Poorly configured policies escalate in circles. Team A escalates to Team B; Team B escalates back to Team A. Solution: Clear ownership maps and break-glass procedures.
•The Always-Escalate Pattern — Everything goes to senior engineers or management regardless of severity. Upper tiers become fatigued. Solution: Clear severity-based routing; SEV3+ shouldn't reach VPs.
•The Missing Runbook — Escalation happens, but the next tier has no context on what to do. They spend time rediscovering what the previous tier already learned. Solution: Require context transfer as part of escalation mechanics.

The Goldilocks Escalation

Designing Your Escalation Policy

Let's synthesize the principles into a practical framework for designing escalation policies for your organization.

Complete Escalation Policy Example
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# Complete escalation policy for a mid-sized SaaS company
 
organization:
  name: "Acme Corp"
  timezone: "America/Los_Angeles"
  on_call_tool: "PagerDuty"
 
severity_definitions:
  SEV1:
    description: "Complete outage or >50% of users affected"
    examples:
      - "Core API returning 500s"
      - "Database primary offline"
      - "Security breach detected"
    initial_response: "Page immediately, all hands"
    
  SEV2:
    description: "Significant degradation, 10-50% users affected"
    examples:
      - "Elevated latency across services"
      - "One region unavailable"
      - "Key feature broken"
    initial_response: "Page on-call, prepare for escalation"
    
  SEV3:
    description: "Minor issue, <10% users affected, workarounds exist"
    examples:
      - "Non-critical feature broken"
      - "Single customer issue"
      - "Performance degradation in non-peak hours"
    initial_response: "Ticket, address within 4 hours"
 
escalation_policies:
  
  production_service:
    applies_to: 
      services: ["api", "web", "mobile-backend", "worker"]
      
    sev1_escalation:
      tier_1:
        team: "service-owning-team-oncall"
        timeout: 3m
        notify: ["push", "sms", "phone"]
        
      tier_2:
        team: "platform-oncall"
        timeout: 5m
        notify: ["phone", "sms"]
        
      tier_3:
        role: "incident-commander-rotation"
        timeout: 10m
        notify: ["phone"]
        action: "Open incident bridge, notify leadership"
        
      tier_4:
        role: "engineering-leadership"
        timeout: 20m
        notify: ["phone"]
        includes: ["VP Engineering", "CTO"]
        
    sev2_escalation:
      tier_1:
        team: "service-owning-team-oncall"
        timeout: 10m
        notify: ["push", "sms"]
        
      tier_2:
        team: "service-owning-team-secondary"
        timeout: 20m
        notify: ["phone", "sms"]
        
      tier_3:
        role: "engineering-manager"
        timeout: 45m
        notify: ["phone"]
        
    sev3_escalation:
      tier_1:
        team: "service-owning-team-oncall"
        timeout: 60m
        notify: ["push", "slack"]
        
      tier_2:
        team: "service-owning-team-secondary"
        timeout: 4h
        notify: ["sms"]
 
acknowledgment_rules:
  sev1:
    required_from: "assigned on-call only"
    ack_timeout: 3m
    work_check_timeout: 15m  # Re-ping if no progress
    
  sev2:
    required_from: "any team member"
    ack_timeout: 10m
    work_check_timeout: 30m
    
  sev3:
    required_from: "any team member"
    ack_timeout: 60m
    work_check_timeout: null  # No automatic re-ping
 
cross_team_escalation:
  enabled: true
  method: "One-click escalate to dependency owner"
  context_required:
    - "Original alert details"
    - "Investigation summary"
    - "Suspected root cause"
  notification: "Cross-team escalation from {originating_team} re: {incident_title}"
 
external_escalation:
  aws:
    tier_1: "Support case, Urgent severity"
    tier_2: "Technical Account Manager"
    contact_info: "See external contacts runbook"
    
  stripe:
    tier_1: "Dashboard support chat"
    fallback: "Enable PayPal processor"
    runbook: "link/to/payment-fallback"
 
review_schedule:
  frequency: "Monthly"
  attendees: 
    - "On-call leads from each team"
    - "Incident Commander rotation"
    - "Engineering leadership representative"
  agenda:
    - "Review escalation events from past month"
    - "Assess timeout appropriateness"
    - "Update contact information"
    - "Drill one escalation scenario"

Summary: Designing Escalation Policies

Escalation policies transform alerting from 'fire and hope' to guaranteed response. They're the final safety net that ensures incidents receive the attention they require.

Key Takeaways

•Escalation ensures response — Even if primary on-call is unavailable, the incident will receive attention through automatic escalation.
•Design tiers thoughtfully — Three tiers (primary, secondary, management) work for most organizations. Customize for your structure.
•Timeout aggressively for critical issues — SEV1 should escalate in 3-5 minutes. Every minute of delay compounds impact.
•Use multiple notification channels — Phone calls for critical alerts; text and push for lower severity. Redundancy ensures delivery.
•Define acknowledgment clearly — What counts as response? Progressive acknowledgment prevents 'ack and forget'.
•Enable cross-team escalation — Incidents crossing boundaries need frictionless handoff. Map dependencies proactively.
•Avoid anti-patterns — The Hero Pattern, Black Hole, and Pass-the-Buck all undermine effective response.
•Review and drill regularly — Escalation paths go stale. Test them monthly; update contact information continuously.

Page Complete

4 / 5