Loading content...
At 2:34 AM, the critical alert fired. The primary on-call engineer was unreachable—phone on silent, perhaps asleep too deeply, perhaps the app had a glitch. The alert aged.
Five minutes passed. Ten minutes. Fifteen minutes. The system continued to degrade. Users started abandoning transactions. Revenue was hemorrhaging.
There was no escalation policy.
No secondary on-call was notified. No supervisor was paged. No backup system existed. The company had invested millions in monitoring infrastructure that detected the problem within seconds—then did nothing useful with that detection for 47 minutes until someone happened to check their phone.
This is the gap that escalation policies fill: the space between detecting an incident and ensuring human response.
By the end of this page, you will understand how to design escalation policies that guarantee timely response, balance urgency with responder well-being, and adapt to different incident types and severities. You'll learn the mechanics of escalation chains, timeout configurations, and multi-tier response structures.
An escalation policy defines what happens when the initial response fails. It's the backup plan for the backup plan—a systematic approach to ensuring incidents receive attention regardless of individual failures.
Why Escalation Matters
Incident response has multiple failure points:
Escalation policies address each of these by defining when and how additional resources are engaged.
Escalation vs. Routing
It's important to distinguish escalation from initial routing:
| Routing | Escalation |
|---|---|
| Determines who receives the alert first | Determines what happens if first responder doesn't resolve |
| Based on alert type, service, time of day | Based on response time, incident duration, severity |
| Usually static or schedule-based | Dynamic based on incident progression |
| One-time decision at alert creation | Ongoing process throughout incident |
A healthy escalation culture treats escalation as a natural part of incident response, not a failure or punishment. The goal is effective resolution, not blame assignment. Responders should feel comfortable escalating when they need help rather than struggling alone.
Escalation tiers define who gets involved as an incident persists or escalates. The structure should match your organization's size, on-call capacity, and incident severity model.
The Classic Three-Tier Model
Most organizations benefit from a variation of this structure:
Tier 1: Primary On-Call
Tier 2: Secondary On-Call / Specialist
Tier 3: Incident Commander / Management
| Severity | Tier 1 Timeout | Tier 2 Timeout | Tier 3 Timeout | Rationale |
|---|---|---|---|---|
| SEV1/Critical | 3 minutes | 5 minutes | 10 minutes | Maximum urgency; every minute matters |
| SEV2/High | 10 minutes | 20 minutes | 45 minutes | Urgent but brief delays acceptable |
| SEV3/Medium | 30 minutes | 2 hours | Next business day | Can wait for reasonable response |
| SEV4/Low | 4 hours | 8 hours | N/A | Low urgency, no management escalation |
Specialized Escalation Paths
Different incident types may require different escalation structures:
Security Incidents
Data Incidents
Customer-Facing Outages
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
# PagerDuty-style escalation policy configuration escalation_policies: # Standard service escalation - name: "default-service-escalation" description: "Standard on-call escalation for production services" repeat_limit: 2 # Repeat entire chain twice if still unacked escalation_rules: - escalation_delay_in_minutes: 0 targets: - type: "schedule" id: "primary-oncall-schedule" notify_methods: ["push", "sms", "phone"] - escalation_delay_in_minutes: 5 targets: - type: "schedule" id: "secondary-oncall-schedule" notify_methods: ["push", "sms", "phone"] - escalation_delay_in_minutes: 15 targets: - type: "user" id: "engineering-manager" - type: "user" id: "team-lead" notify_methods: ["phone", "sms"] - escalation_delay_in_minutes: 30 targets: - type: "user" id: "vp-engineering" notify_methods: ["phone"] # Critical path - accelerated escalation - name: "critical-infrastructure-escalation" description: "Accelerated escalation for SEV1 incidents" services: ["database-primary", "payment-gateway", "auth-service"] escalation_rules: - escalation_delay_in_minutes: 0 targets: - type: "schedule" id: "platform-oncall-schedule" notify_methods: ["push", "sms", "phone"] - escalation_delay_in_minutes: 3 # Faster escalation targets: - type: "schedule" id: "platform-secondary-schedule" - type: "schedule" id: "sre-oncall-schedule" # Parallel notification notify_methods: ["phone", "sms"] - escalation_delay_in_minutes: 8 targets: - type: "user" id: "platform-director" - type: "user" id: "incident-commander-oncall" notify_methods: ["phone"] # Security incident path - name: "security-incident-escalation" description: "Security-specific escalation path" trigger_on_labels: category: "security" escalation_rules: - escalation_delay_in_minutes: 0 targets: - type: "schedule" id: "soc-oncall-schedule" notify_methods: ["push", "sms"] - escalation_delay_in_minutes: 5 targets: - type: "schedule" id: "security-engineer-oncall" notify_methods: ["phone", "sms"] - escalation_delay_in_minutes: 15 targets: - type: "user" id: "ciso" notify_methods: ["phone"] additional_context: "Include legal@company.com in comms"How you notify responders can be as important as who you notify. Different channels have different reliability, intrusiveness, and appropriateness for various situations.
| Channel | Intrusiveness | Reliability | Best For | Limitations |
|---|---|---|---|---|
| Phone Call | Very High | High | Critical/SEV1, unreachable responders | Disruptive, may not work internationally |
| SMS | High | High | Urgent alerts, backup to app push | Length limits, carrier delays possible |
| Push Notification | Medium | Medium | Standard on-call alerts | Requires app, phone must be online |
| Slack/Teams | Low | Medium | Team awareness, SEV3+ | Easy to miss, requires checking |
| Very Low | High | Non-urgent, documentation | Not for real-time response |
Multi-Channel Notification
For critical alerts, use multiple channels simultaneously:
SEV1 Notification Sequence:
T+0s T+30s T+60s T+90s
│ │ │ │
▼ ▼ ▼ ▼
Push ───► SMS ───► Call ───► Second Call
└────────┴────────┴────────┘
Continue until acknowledged
This layered approach ensures the alert reaches the responder even if one channel fails.
Channel Selection by Context
Smart alerting systems can adapt channel selection based on context:
Many missed pages happen because phones are on silent, Do Not Disturb is enabled, or the paging app lost notification permission. Require on-call engineers to configure their phones for alert receipt, and test bi-weekly that notifications actually reach them.
Escalation timing depends on what constitutes an adequate 'response'. Defining clear response stages with associated SLAs prevents ambiguity.
Response Stages
Acknowledgment: The responder indicates they've received and are aware of the alert. This stops immediate escalation but doesn't indicate resolution.
Engagement: The responder is actively investigating the issue. They've looked at dashboards, logs, or the affected system.
Triage: The responder has assessed severity and determined next steps—whether to resolve, escalate, or seek assistance.
Resolution: The incident is resolved and the system has returned to normal operation.
Each stage can have its own SLA and escalation triggers.
| Severity | Acknowledge SLA | Engage SLA | Triage SLA | Escalation Trigger |
|---|---|---|---|---|
| SEV1 | 3 min | 10 min | 30 min | Any SLA breach triggers next tier |
| SEV2 | 15 min | 30 min | 2 hours | Acknowledge or Engage miss triggers escalation |
| SEV3 | 60 min | 4 hours | 24 hours | Only Acknowledge miss triggers escalation |
| SEV4 | 4 hours | Next business day | 3 business days | No automatic escalation |
Acknowledgment Schemes
Different acknowledgment models suit different situations:
Simple Acknowledgment
Assigned Acknowledgment
Progressive Acknowledgment
Multi-Responder Acknowledgment
123456789101112131415161718192021222324252627282930313233
// Progressive Acknowledgment Flow WHEN alert fires: START escalation_timer (5 minutes) NOTIFY tier_1 WHEN acknowledgment received: IF from tier_1 responder: PAUSE escalation_timer START work_timer (30 minutes for SEV1, 2 hours for SEV2) ELSE: LOG "Non-assigned ack received" CONTINUE escalation_timer WHEN work_timer expires: IF incident NOT resolved: NOTIFY tier_1: "Reminder: Incident still open" RESTART work_timer (15 minutes) INCREMENT reminder_count IF reminder_count >= 3: ESCALATE to tier_2 INCLUDE note: "Tier 1 acknowledged but no resolution after {total_time}" WHEN incident resolved: STOP all timers RECORD resolution_time NOTIFY stakeholders CLOSE escalation chain // This prevents the "ack and forget" pattern while respecting// that complex incidents take time to resolveMany incidents require response from multiple teams or external parties. Designing escalation policies for these complex scenarios requires special consideration.
Cross-Team Scenario: The Dependency Alert
Your payment service is failing. Investigation reveals the root cause is in the upstream authentication service, owned by a different team.
Poor Escalation Pattern:
Better Escalation Pattern:
External Escalation
Some incidents require escalating to vendors, partners, or providers:
Cloud Provider Outages
Third-Party Service Degradation
Vendor Issues Affecting Production
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
# External escalation paths for vendor dependencies external_escalation: aws: name: "Amazon Web Services" services: ["EC2", "RDS", "S3", "Lambda", "EKS"] tier_1: channel: "AWS Support Center" url: "https://console.aws.amazon.com/support" case_type: "Technical" severity: "Urgent (Production system down)" sla: "15 min response for Enterprise support" tier_2: channel: "Technical Account Manager" contact: "tam@aws.example.com" # Placeholder phone: "+1-xxx-xxx-xxxx" when: "Tier 1 unresponsive after 30 min OR SEV1 incidents" tier_3: channel: "Executive Escalation" contact: "executive-escalations-aws@company.internal" when: "Major outage with business impact > $X/hour" requires: "VP or C-level approval" stripe: name: "Stripe Payment Processing" services: ["payment-intent", "charges", "refunds"] tier_1: channel: "Stripe Dashboard Support" url: "https://dashboard.stripe.com/support" escalation_email: "urgent@stripe.com" # If available sla: "1 hour for standard, 15 min for critical" tier_2: channel: "Assigned Partner Engineer" contact: "partner-engineer@stripe.example.com" when: "Tier 1 unresponsive OR suspected platform issue" fallback: action: "Enable PayPal fallback" runbook: "runbook.internal/payment-fallback-paypal" auto_trigger_threshold: "Payment success rate < 50% for 5 min" datadog: name: "Datadog Monitoring" services: ["metrics", "logs", "apm", "synthetics"] tier_1: channel: "Datadog Chat Support" url: "In-app chat" sla: "Varies by plan" notes: | If Datadog is down, we lose alerting visibility. Fallback: Direct CloudWatch alarms, PagerDuty direct integrations Check status.datadoghq.com for known issues firstFor critical vendors, establish relationships before you need them. Meet your AWS TAM during non-emergencies. Have your Stripe partner engineer on Slack. When an incident happens, a warm relationship means faster response than a cold support ticket.
Common escalation mistakes undermine incident response. Recognizing these anti-patterns helps you design better policies.
Escalation should be neither too easy nor too hard: Easy enough that responders don't suffer alone with overwhelming incidents. Hard enough that every level actually attempts resolution before escalating. The balance point varies by incident type and team maturity.
Let's synthesize the principles into a practical framework for designing escalation policies for your organization.
Step 1: Map Your Incident Types
Identify the categories of incidents you handle:
Each may need a specialized escalation path.
Step 2: Define Severity for Each Type
Create a severity matrix that's specific and unambiguous:
Step 3: Identify Responder Pools
For each incident type:
Step 4: Set Timeout Values
Balance urgency against practical response times:
Step 5: Configure Notification Channels
For each tier and severity:
Step 6: Document and Train
Write it down and make sure everyone knows:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
# Complete escalation policy for a mid-sized SaaS company organization: name: "Acme Corp" timezone: "America/Los_Angeles" on_call_tool: "PagerDuty" severity_definitions: SEV1: description: "Complete outage or >50% of users affected" examples: - "Core API returning 500s" - "Database primary offline" - "Security breach detected" initial_response: "Page immediately, all hands" SEV2: description: "Significant degradation, 10-50% users affected" examples: - "Elevated latency across services" - "One region unavailable" - "Key feature broken" initial_response: "Page on-call, prepare for escalation" SEV3: description: "Minor issue, <10% users affected, workarounds exist" examples: - "Non-critical feature broken" - "Single customer issue" - "Performance degradation in non-peak hours" initial_response: "Ticket, address within 4 hours" escalation_policies: production_service: applies_to: services: ["api", "web", "mobile-backend", "worker"] sev1_escalation: tier_1: team: "service-owning-team-oncall" timeout: 3m notify: ["push", "sms", "phone"] tier_2: team: "platform-oncall" timeout: 5m notify: ["phone", "sms"] tier_3: role: "incident-commander-rotation" timeout: 10m notify: ["phone"] action: "Open incident bridge, notify leadership" tier_4: role: "engineering-leadership" timeout: 20m notify: ["phone"] includes: ["VP Engineering", "CTO"] sev2_escalation: tier_1: team: "service-owning-team-oncall" timeout: 10m notify: ["push", "sms"] tier_2: team: "service-owning-team-secondary" timeout: 20m notify: ["phone", "sms"] tier_3: role: "engineering-manager" timeout: 45m notify: ["phone"] sev3_escalation: tier_1: team: "service-owning-team-oncall" timeout: 60m notify: ["push", "slack"] tier_2: team: "service-owning-team-secondary" timeout: 4h notify: ["sms"] acknowledgment_rules: sev1: required_from: "assigned on-call only" ack_timeout: 3m work_check_timeout: 15m # Re-ping if no progress sev2: required_from: "any team member" ack_timeout: 10m work_check_timeout: 30m sev3: required_from: "any team member" ack_timeout: 60m work_check_timeout: null # No automatic re-ping cross_team_escalation: enabled: true method: "One-click escalate to dependency owner" context_required: - "Original alert details" - "Investigation summary" - "Suspected root cause" notification: "Cross-team escalation from {originating_team} re: {incident_title}" external_escalation: aws: tier_1: "Support case, Urgent severity" tier_2: "Technical Account Manager" contact_info: "See external contacts runbook" stripe: tier_1: "Dashboard support chat" fallback: "Enable PayPal processor" runbook: "link/to/payment-fallback" review_schedule: frequency: "Monthly" attendees: - "On-call leads from each team" - "Incident Commander rotation" - "Engineering leadership representative" agenda: - "Review escalation events from past month" - "Assess timeout appropriateness" - "Update contact information" - "Drill one escalation scenario"Escalation policies transform alerting from 'fire and hope' to guaranteed response. They're the final safety net that ensures incidents receive the attention they require.
What's Next:
Escalation gets responders engaged. But what do they do once engaged? The next page explores runbook integration—how to connect alerts with actionable documentation that guides responders through diagnosis and remediation.
You now understand how to design escalation policies that guarantee response, balance urgency with sustainability, and adapt to incident complexity. The key insight: detecting an incident is worthless without ensuring human engagement. Escalation policies bridge that gap.