Loading learning content...
At 2:34 AM, the critical alert fired. The primary on-call engineer was unreachable—phone on silent, perhaps asleep too deeply, perhaps the app had a glitch. The alert aged.\n\nFive minutes passed. Ten minutes. Fifteen minutes. The system continued to degrade. Users started abandoning transactions. Revenue was hemorrhaging.\n\nThere was no escalation policy.\n\nNo secondary on-call was notified. No supervisor was paged. No backup system existed. The company had invested millions in monitoring infrastructure that detected the problem within seconds—then did nothing useful with that detection for 47 minutes until someone happened to check their phone.\n\nThis is the gap that escalation policies fill: the space between detecting an incident and ensuring human response.
By the end of this page, you will understand how to design escalation policies that guarantee timely response, balance urgency with responder well-being, and adapt to different incident types and severities. You'll learn the mechanics of escalation chains, timeout configurations, and multi-tier response structures.
An escalation policy defines what happens when the initial response fails. It's the backup plan for the backup plan—a systematic approach to ensuring incidents receive attention regardless of individual failures.
Why Escalation Matters\n\nIncident response has multiple failure points:\n\n1. Human Availability: On-call may be asleep, in a meeting, or experiencing phone issues\n2. Human Capability: The issue may exceed the on-call's expertise\n3. Human Capacity: The on-call may be overwhelmed with concurrent incidents\n4. Human Misjudgment: The severity may be underestimated initially\n\nEscalation policies address each of these by defining when and how additional resources are engaged.\n\nEscalation vs. Routing\n\nIt's important to distinguish escalation from initial routing:\n\n| Routing | Escalation |\n|---------|------------|\n| Determines who receives the alert first | Determines what happens if first responder doesn't resolve |\n| Based on alert type, service, time of day | Based on response time, incident duration, severity |\n| Usually static or schedule-based | Dynamic based on incident progression |\n| One-time decision at alert creation | Ongoing process throughout incident |
A healthy escalation culture treats escalation as a natural part of incident response, not a failure or punishment. The goal is effective resolution, not blame assignment. Responders should feel comfortable escalating when they need help rather than struggling alone.
Escalation tiers define who gets involved as an incident persists or escalates. The structure should match your organization's size, on-call capacity, and incident severity model.
The Classic Three-Tier Model\n\nMost organizations benefit from a variation of this structure:\n\nTier 1: Primary On-Call\n- First responder for all alerts\n- Expected to acknowledge within minutes\n- Has authority to resolve or escalate\n- Usually one person per service or domain\n\nTier 2: Secondary On-Call / Specialist\n- Engaged when primary is unavailable or needs help\n- May have deeper expertise in specific areas\n- Often a more senior engineer\n- Can provide coverage across related services\n\nTier 3: Incident Commander / Management\n- Engaged for extended or high-severity incidents\n- Coordinates cross-team response\n- Has authority to make organizational decisions\n- Responsible for communication to stakeholders
| Severity | Tier 1 Timeout | Tier 2 Timeout | Tier 3 Timeout | Rationale |
|---|---|---|---|---|
| SEV1/Critical | 3 minutes | 5 minutes | 10 minutes | Maximum urgency; every minute matters |
| SEV2/High | 10 minutes | 20 minutes | 45 minutes | Urgent but brief delays acceptable |
| SEV3/Medium | 30 minutes | 2 hours | Next business day | Can wait for reasonable response |
| SEV4/Low | 4 hours | 8 hours | N/A | Low urgency, no management escalation |
Specialized Escalation Paths\n\nDifferent incident types may require different escalation structures:\n\nSecurity Incidents\n- Tier 1: Security Operations Center (SOC)\n- Tier 2: Security Engineer On-Call\n- Tier 3: CISO / Security Leadership\n- Additional: Legal, PR for breach scenarios\n\nData Incidents\n- Tier 1: Data Engineering On-Call\n- Tier 2: Data Platform Lead\n- Tier 3: Data Protection Officer\n- Additional: Compliance, Legal for GDPR/Privacy issues\n\nCustomer-Facing Outages\n- Tier 1: Service Owner On-Call\n- Tier 2: Engineering Manager\n- Tier 3: VP Engineering / CTO\n- Additional: Customer Success, Communications
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
# PagerDuty-style escalation policy configuration escalation_policies: # Standard service escalation - name: "default-service-escalation" description: "Standard on-call escalation for production services" repeat_limit: 2 # Repeat entire chain twice if still unacked escalation_rules: - escalation_delay_in_minutes: 0 targets: - type: "schedule" id: "primary-oncall-schedule" notify_methods: ["push", "sms", "phone"] - escalation_delay_in_minutes: 5 targets: - type: "schedule" id: "secondary-oncall-schedule" notify_methods: ["push", "sms", "phone"] - escalation_delay_in_minutes: 15 targets: - type: "user" id: "engineering-manager" - type: "user" id: "team-lead" notify_methods: ["phone", "sms"] - escalation_delay_in_minutes: 30 targets: - type: "user" id: "vp-engineering" notify_methods: ["phone"] # Critical path - accelerated escalation - name: "critical-infrastructure-escalation" description: "Accelerated escalation for SEV1 incidents" services: ["database-primary", "payment-gateway", "auth-service"] escalation_rules: - escalation_delay_in_minutes: 0 targets: - type: "schedule" id: "platform-oncall-schedule" notify_methods: ["push", "sms", "phone"] - escalation_delay_in_minutes: 3 # Faster escalation targets: - type: "schedule" id: "platform-secondary-schedule" - type: "schedule" id: "sre-oncall-schedule" # Parallel notification notify_methods: ["phone", "sms"] - escalation_delay_in_minutes: 8 targets: - type: "user" id: "platform-director" - type: "user" id: "incident-commander-oncall" notify_methods: ["phone"] # Security incident path - name: "security-incident-escalation" description: "Security-specific escalation path" trigger_on_labels: category: "security" escalation_rules: - escalation_delay_in_minutes: 0 targets: - type: "schedule" id: "soc-oncall-schedule" notify_methods: ["push", "sms"] - escalation_delay_in_minutes: 5 targets: - type: "schedule" id: "security-engineer-oncall" notify_methods: ["phone", "sms"] - escalation_delay_in_minutes: 15 targets: - type: "user" id: "ciso" notify_methods: ["phone"] additional_context: "Include legal@company.com in comms"How you notify responders can be as important as who you notify. Different channels have different reliability, intrusiveness, and appropriateness for various situations.
| Channel | Intrusiveness | Reliability | Best For | Limitations |
|---|---|---|---|---|
| Phone Call | Very High | High | Critical/SEV1, unreachable responders | Disruptive, may not work internationally |
| SMS | High | High | Urgent alerts, backup to app push | Length limits, carrier delays possible |
| Push Notification | Medium | Medium | Standard on-call alerts | Requires app, phone must be online |
| Slack/Teams | Low | Medium | Team awareness, SEV3+ | Easy to miss, requires checking |
| Very Low | High | Non-urgent, documentation | Not for real-time response |
Multi-Channel Notification\n\nFor critical alerts, use multiple channels simultaneously:\n\n\nSEV1 Notification Sequence:\n\n T+0s T+30s T+60s T+90s\n │ │ │ │\n ▼ ▼ ▼ ▼\n Push ───► SMS ───► Call ───► Second Call\n └────────┴────────┴────────┘\n Continue until acknowledged\n\n\nThis layered approach ensures the alert reaches the responder even if one channel fails.\n\nChannel Selection by Context\n\nSmart alerting systems can adapt channel selection based on context:\n\n- Time of Day: Phone calls for night alerts; push for working hours\n- Responder Preference: Some prefer SMS over push; respect preferences\n- Alert History: If push rarely gets acknowledgment, escalate to call faster\n- Responder Status: If marked 'in meeting', try text before calling
Many missed pages happen because phones are on silent, Do Not Disturb is enabled, or the paging app lost notification permission. Require on-call engineers to configure their phones for alert receipt, and test bi-weekly that notifications actually reach them.
Escalation timing depends on what constitutes an adequate 'response'. Defining clear response stages with associated SLAs prevents ambiguity.
Response Stages\n\nAcknowledgment: The responder indicates they've received and are aware of the alert. This stops immediate escalation but doesn't indicate resolution.\n\nEngagement: The responder is actively investigating the issue. They've looked at dashboards, logs, or the affected system.\n\nTriage: The responder has assessed severity and determined next steps—whether to resolve, escalate, or seek assistance.\n\nResolution: The incident is resolved and the system has returned to normal operation.\n\nEach stage can have its own SLA and escalation triggers.
| Severity | Acknowledge SLA | Engage SLA | Triage SLA | Escalation Trigger |
|---|---|---|---|---|
| SEV1 | 3 min | 10 min | 30 min | Any SLA breach triggers next tier |
| SEV2 | 15 min | 30 min | 2 hours | Acknowledge or Engage miss triggers escalation |
| SEV3 | 60 min | 4 hours | 24 hours | Only Acknowledge miss triggers escalation |
| SEV4 | 4 hours | Next business day | 3 business days | No automatic escalation |
Acknowledgment Schemes\n\nDifferent acknowledgment models suit different situations:\n\nSimple Acknowledgment\n- Any team member can acknowledge\n- Escalation stops when acknowledged\n- Works for small teams with shared ownership\n\nAssigned Acknowledgment\n- Only the assigned on-call can acknowledge\n- Prevents 'ack and ignore' from non-responders\n- Better accountability\n\nProgressive Acknowledgment\n- Acknowledge pauses escalation for N minutes\n- Re-alerts if no resolution in that window\n- Prevents acking then forgetting\n\nMulti-Responder Acknowledgment\n- Requires multiple people to acknowledge (for critical incidents)\n- Ensures incident commander + technical responder both engaged\n- Used for SEV1 only
123456789101112131415161718192021222324252627282930313233
// Progressive Acknowledgment Flow WHEN alert fires: START escalation_timer (5 minutes) NOTIFY tier_1 WHEN acknowledgment received: IF from tier_1 responder: PAUSE escalation_timer START work_timer (30 minutes for SEV1, 2 hours for SEV2) ELSE: LOG "Non-assigned ack received" CONTINUE escalation_timer WHEN work_timer expires: IF incident NOT resolved: NOTIFY tier_1: "Reminder: Incident still open" RESTART work_timer (15 minutes) INCREMENT reminder_count IF reminder_count >= 3: ESCALATE to tier_2 INCLUDE note: "Tier 1 acknowledged but no resolution after {total_time}" WHEN incident resolved: STOP all timers RECORD resolution_time NOTIFY stakeholders CLOSE escalation chain // This prevents the "ack and forget" pattern while respecting// that complex incidents take time to resolveMany incidents require response from multiple teams or external parties. Designing escalation policies for these complex scenarios requires special consideration.
Cross-Team Scenario: The Dependency Alert\n\nYour payment service is failing. Investigation reveals the root cause is in the upstream authentication service, owned by a different team.\n\nPoor Escalation Pattern:\n1. Payment on-call investigates for 30 minutes\n2. Realizes it's an auth issue\n3. Manually notifies auth team via Slack\n4. Auth on-call doesn't see message for 15 minutes\n5. Time wasted: 45+ minutes\n\nBetter Escalation Pattern:\n1. Payment on-call investigates for 10 minutes\n2. Recognizes cross-team dependency\n3. Triggers built-in escalation: 'Escalate to dependency owner: Auth Service'\n4. Auth on-call immediately paged with context: 'Escalated from Payment Service incident'\n5. Both teams collaborate; incident commander coordinates
External Escalation\n\nSome incidents require escalating to vendors, partners, or providers:\n\nCloud Provider Outages\n- Escalation path: Internal On-Call → Cloud Support Case → Account Manager → Executive Escalation\n- Have enterprise support contracts that enable fast escalation\n- Know how to file 'Urgent' or 'Business Critical' cases\n\nThird-Party Service Degradation\n- Escalation path: Check status page → API support channel → Partner engineering contact\n- Maintain contact lists for partners with different severity levels\n- Some agreements include specific escalation SLAs\n\nVendor Issues Affecting Production\n- Document escalation procedures for each critical vendor\n- Include backup vendor activation as part of escalation\n- Have pre-negotiated support levels that access engineering, not just support
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
# External escalation paths for vendor dependencies external_escalation: aws: name: "Amazon Web Services" services: ["EC2", "RDS", "S3", "Lambda", "EKS"] tier_1: channel: "AWS Support Center" url: "https://console.aws.amazon.com/support" case_type: "Technical" severity: "Urgent (Production system down)" sla: "15 min response for Enterprise support" tier_2: channel: "Technical Account Manager" contact: "tam@aws.example.com" # Placeholder phone: "+1-xxx-xxx-xxxx" when: "Tier 1 unresponsive after 30 min OR SEV1 incidents" tier_3: channel: "Executive Escalation" contact: "executive-escalations-aws@company.internal" when: "Major outage with business impact > $X/hour" requires: "VP or C-level approval" stripe: name: "Stripe Payment Processing" services: ["payment-intent", "charges", "refunds"] tier_1: channel: "Stripe Dashboard Support" url: "https://dashboard.stripe.com/support" escalation_email: "urgent@stripe.com" # If available sla: "1 hour for standard, 15 min for critical" tier_2: channel: "Assigned Partner Engineer" contact: "partner-engineer@stripe.example.com" when: "Tier 1 unresponsive OR suspected platform issue" fallback: action: "Enable PayPal fallback" runbook: "runbook.internal/payment-fallback-paypal" auto_trigger_threshold: "Payment success rate < 50% for 5 min" datadog: name: "Datadog Monitoring" services: ["metrics", "logs", "apm", "synthetics"] tier_1: channel: "Datadog Chat Support" url: "In-app chat" sla: "Varies by plan" notes: | If Datadog is down, we lose alerting visibility. Fallback: Direct CloudWatch alarms, PagerDuty direct integrations Check status.datadoghq.com for known issues firstFor critical vendors, establish relationships before you need them. Meet your AWS TAM during non-emergencies. Have your Stripe partner engineer on Slack. When an incident happens, a warm relationship means faster response than a cold support ticket.
Common escalation mistakes undermine incident response. Recognizing these anti-patterns helps you design better policies.
Escalation should be neither too easy nor too hard: Easy enough that responders don't suffer alone with overwhelming incidents. Hard enough that every level actually attempts resolution before escalating. The balance point varies by incident type and team maturity.
Let's synthesize the principles into a practical framework for designing escalation policies for your organization.
Step 1: Map Your Incident Types\n\nIdentify the categories of incidents you handle:\n- Production outages (service level)\n- Performance degradation\n- Security incidents\n- Data issues\n- Customer-impacting bugs\n- Infrastructure failures\n\nEach may need a specialized escalation path.\n\nStep 2: Define Severity for Each Type\n\nCreate a severity matrix that's specific and unambiguous:\n- What makes a database issue SEV1 vs. SEV2?\n- How do we assess impact for different services?\n- Who has authority to assign/change severity?\n\nStep 3: Identify Responder Pools\n\nFor each incident type:\n- Who is Tier 1? (Usually domain-specific on-call)\n- Who is Tier 2? (Senior engineer, specialist, or cross-trained peer)\n- Who is Tier 3? (Management, incident commander, or exec)\n- What external escalation paths exist?\n\nStep 4: Set Timeout Values\n\nBalance urgency against practical response times:\n- What can someone reasonably acknowledge in the middle of the night?\n- How long should we wait before assuming no response?\n- Different times for different severities\n\nStep 5: Configure Notification Channels\n\nFor each tier and severity:\n- What channels are used?\n- In what order or combination?\n- How do we verify delivery?\n\nStep 6: Document and Train\n\nWrite it down and make sure everyone knows:\n- Escalation policy documentation\n- Training for new on-call engineers\n- Regular drills and reviews
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
# Complete escalation policy for a mid-sized SaaS company organization: name: "Acme Corp" timezone: "America/Los_Angeles" on_call_tool: "PagerDuty" severity_definitions: SEV1: description: "Complete outage or >50% of users affected" examples: - "Core API returning 500s" - "Database primary offline" - "Security breach detected" initial_response: "Page immediately, all hands" SEV2: description: "Significant degradation, 10-50% users affected" examples: - "Elevated latency across services" - "One region unavailable" - "Key feature broken" initial_response: "Page on-call, prepare for escalation" SEV3: description: "Minor issue, <10% users affected, workarounds exist" examples: - "Non-critical feature broken" - "Single customer issue" - "Performance degradation in non-peak hours" initial_response: "Ticket, address within 4 hours" escalation_policies: production_service: applies_to: services: ["api", "web", "mobile-backend", "worker"] sev1_escalation: tier_1: team: "service-owning-team-oncall" timeout: 3m notify: ["push", "sms", "phone"] tier_2: team: "platform-oncall" timeout: 5m notify: ["phone", "sms"] tier_3: role: "incident-commander-rotation" timeout: 10m notify: ["phone"] action: "Open incident bridge, notify leadership" tier_4: role: "engineering-leadership" timeout: 20m notify: ["phone"] includes: ["VP Engineering", "CTO"] sev2_escalation: tier_1: team: "service-owning-team-oncall" timeout: 10m notify: ["push", "sms"] tier_2: team: "service-owning-team-secondary" timeout: 20m notify: ["phone", "sms"] tier_3: role: "engineering-manager" timeout: 45m notify: ["phone"] sev3_escalation: tier_1: team: "service-owning-team-oncall" timeout: 60m notify: ["push", "slack"] tier_2: team: "service-owning-team-secondary" timeout: 4h notify: ["sms"] acknowledgment_rules: sev1: required_from: "assigned on-call only" ack_timeout: 3m work_check_timeout: 15m # Re-ping if no progress sev2: required_from: "any team member" ack_timeout: 10m work_check_timeout: 30m sev3: required_from: "any team member" ack_timeout: 60m work_check_timeout: null # No automatic re-ping cross_team_escalation: enabled: true method: "One-click escalate to dependency owner" context_required: - "Original alert details" - "Investigation summary" - "Suspected root cause" notification: "Cross-team escalation from {originating_team} re: {incident_title}" external_escalation: aws: tier_1: "Support case, Urgent severity" tier_2: "Technical Account Manager" contact_info: "See external contacts runbook" stripe: tier_1: "Dashboard support chat" fallback: "Enable PayPal processor" runbook: "link/to/payment-fallback" review_schedule: frequency: "Monthly" attendees: - "On-call leads from each team" - "Incident Commander rotation" - "Engineering leadership representative" agenda: - "Review escalation events from past month" - "Assess timeout appropriateness" - "Update contact information" - "Drill one escalation scenario"Escalation policies transform alerting from 'fire and hope' to guaranteed response. They're the final safety net that ensures incidents receive the attention they require.
What's Next:\n\nEscalation gets responders engaged. But what do they do once engaged? The next page explores runbook integration—how to connect alerts with actionable documentation that guides responders through diagnosis and remediation.
You now understand how to design escalation policies that guarantee response, balance urgency with sustainability, and adapt to incident complexity. The key insight: detecting an incident is worthless without ensuring human engagement. Escalation policies bridge that gap.