Incident Management - Learning Module

Loading content...

0/273

On-Call Practices

The Human Infrastructure of Reliability

At 2:47 AM, somewhere in the world, someone's phone is buzzing. A database cluster has crossed a latency threshold. A payment service is returning errors. A Kubernetes pod is stuck in a crash loop. In that moment, the difference between a minor blip and a major outage often comes down to a single question: Is the right person awake, alert, and equipped to respond?

On-call is the human infrastructure of reliability. All the monitoring, alerting, and automation in the world is worthless if there's no qualified human ready to receive the signals and take action. Yet on-call is also one of the most challenging aspects of engineering culture—it impacts work-life balance, creates stress, and can lead to burnout if poorly managed.

Organizations that excel at incident management recognize on-call as a first-class operational concern, not an afterthought. They design rotations carefully, invest in tooling, compensate appropriately, and continuously improve the experience. This page explores how.

What You Will Learn

By the end of this page, you will understand how to design sustainable on-call rotations, select and configure alerting tools, compensate on-call fairly, protect responder well-being, and build on-call practices that teams actually want to participate in. You'll learn the difference between on-call systems that burn people out and those that make teams more effective.

The Purpose of On-Call

On-call exists to ensure that production systems have responsible human oversight at all times. While automation handles routine operations, complex failures require human judgment—the ability to diagnose novel situations, weigh trade-offs, and make decisions under uncertainty.

What On-Call Provides

Continuous Coverage: Someone is always responsible, eliminating gaps when issues arise outside business hours
Clear Accountability: A specific person is responsible at any given moment, preventing bystander effect
Rapid Response: Defined escalation paths ensure quick engagement when incidents occur
Expert Availability: On-call rotations typically include team members with relevant system expertise
Organizational Learning: On-call experience builds deep operational knowledge within teams

On-Call Responsibility Levels
Level	Responsibility	Typical Response Time	Escalates To
Primary On-Call	First responder for all alerts; triages and handles or escalates	5-15 minutes	Secondary On-Call
Secondary On-Call	Backup when primary is unavailable; steps in after timeout	15-30 minutes	Engineering Manager
Specialist On-Call	Subject matter expert for specific systems; engaged by primary as needed	Variable	Depends on system
Manager On-Call	Escalation point for severity decisions, resource allocation, external communication	15-30 minutes	Director/VP
Executive On-Call	Major incident awareness; customer and stakeholder communication authority	30-60 minutes	CEO/CTO

The On-Call Contract

On-call is an explicit agreement between the organization and the engineer. When you're on-call:

You carry a functional alerting device at all times
You remain in a location where you can respond (connectivity, equipment access)
You can achieve response time SLAs (typically under 15 minutes to acknowledge, under 30 minutes to be actively working)
You limit activities that would impair response (excessive alcohol, unreachable locations, sleep deprivation)

In return, the organization provides:

Clear escalation paths when you need help
Tools and access to diagnose and resolve issues
Compensation for availability and disruption
Support to ensure you're not on-call more than is sustainable
Authority to take necessary action without seeking approval

On-Call Is Not Heroism

A healthy on-call culture doesn't celebrate engineers who sacrifice sleep and personal time. It celebrates systems that rarely page, alerts that are actionable, and rotations that share the burden fairly. The goal is boring on-call shifts—nothing happens, everyone sleeps well.

Designing On-Call Rotations

Rotation design determines how on-call burden is distributed across the team. Poor rotation design leads to burnout, resentment, and coverage gaps. Good design balances coverage needs with human sustainability.

Key Rotation Parameters

Rotation Length: How long is each on-call shift?
Team Size: How many people share the rotation?
Primary/Secondary Structure: How many tiers of on-call?
Handoff Timing: When do shifts change?
Override Policies: How are planned absences handled?
Follow-the-Sun: Does rotation follow time zones for global teams?

Rotation Best Practices

•1-Week Rotations: Long enough to adapt, short enough to not exhaust. Shorter rotations (3-4 days) can work for high-page-volume teams.
•6+ Person Rotation Pool: Minimum to avoid burnout. 8-10 is ideal for 1-week rotations (on-call ~5 weeks/year).
•Primary + Secondary: Secondary provides backup and reduces primary stress. Secondary should be able to handle any escalation.
•Business-Hours Handoffs: Hand off during working hours (e.g., 10 AM Tuesday) so incumbents can brief and are available for questions.
•Multi-Timezone Coverage: For global teams, split coverage by region to avoid nighttime pages entirely.
•Protected Recovery Time: Don't schedule demanding work for engineers coming off on-call, especially after incident-heavy weeks.

Rotation Anti-Patterns

•Perpetual On-Call: Same person(s) always on-call due to small team size. Leads to burnout and single points of failure.
•Month-Long Rotations: Too exhausting. On-call for an entire month creates dread and life disruption.
•Weekend-Only Penalty: Junior or new engineers only get weekend shifts. Unfair and demoralizing.
•No Secondary: Single point of failure. What happens if primary is incapacitated or unreachable?
•Midnight Handoffs: Handoffs at midnight create ambiguity and tired responders.
•Invisible Rotation: Rotation only known to the on-call tool. Team should have visibility into upcoming schedule.

Sample Rotation Structures

Standard Weekly Rotation (8-person team)

Primary rotates weekly, Tuesday 10 AM to Tuesday 10 AM
Secondary is 'previous week's primary' (natural context transfer)
Each person is primary ~6 weeks/year, secondary ~6 weeks/year

Follow-the-Sun (3-region global team)

APAC: Covers 8 PM - 4 AM UTC
EMEA: Covers 4 AM - 12 PM UTC
Americas: Covers 12 PM - 8 PM UTC
Each region has its own primary/secondary rotation
No nighttime pages for anyone

Weekend Hero (for teams with limited coverage)

Weekday rotation: separate weekday on-call schedule
Weekend rotation: dedicated weekend rotation with higher compensation
Allows smaller teams to provide 24/7 coverage without everyone doing nights

The Vacation/Holiday Problem

Holidays and vacations create coverage gaps. Solve them explicitly:

• Maintain an override pool of volunteers willing to swap • Offer premium compensation for holiday coverage • Allow (encourage!) swaps—don't force people to be on-call during important personal events • Plan holiday coverage 4+ weeks in advance • Consider reduced SLOs during low-traffic holiday periods

Alerting and Escalation Configuration

On-call relies on robust alerting infrastructure to deliver the right alerts to the right people at the right time. Misconfigured alerting leads to missed incidents, excessive noise, or responders who can't sleep because their phones buzz constantly.

Alerting Stack Components

Monitoring System: Generates alerts based on thresholds or anomalies (Prometheus, DataDog, Cloudwatch)
Alert Routing: Directs alerts to appropriate channels (PagerDuty, Opsgenie, VictorOps)
Notification Delivery: Reaches responders through multiple channels (push notification, SMS, phone call)
Escalation Engine: Escalates unanswered alerts through the response chain
De-duplication: Consolidates related alerts to prevent notification storms

escalation-policy.yaml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
# PagerDuty-style Escalation Policy Configuration
# This policy demonstrates multi-tier escalation with 
# appropriate timeouts and notification preferences
 
escalation_policy:
  name: "Payment Service Production"
  description: "Escalation for payment service production alerts"
  
  # Repeat the entire escalation chain if still unresolved
  repeat_enabled: true
  repeat_count: 3  # After 3 complete cycles, page on-call manager
  
  escalation_rules:
    # Level 0: Primary On-Call
    - timeout_minutes: 5
      targets:
        - type: schedule
          id: payment-service-primary-schedule
      notification_rules:
        - channels: [push, sms]
          delay_minutes: 0
        - channels: [phone_call]
          delay_minutes: 2
 
    # Level 1: Secondary On-Call (if primary doesn't respond)
    - timeout_minutes: 10
      targets:
        - type: schedule
          id: payment-service-secondary-schedule
      notification_rules:
        - channels: [push, sms, phone_call]
          delay_minutes: 0
 
    # Level 2: Engineering Manager
    - timeout_minutes: 10
      targets:
        - type: user
          id: payments-eng-manager
        - type: user
          id: payments-eng-manager-backup
      notification_rules:
        - channels: [phone_call]
          delay_minutes: 0
 
    # Level 3: On-Call Director (major escalation)
    - timeout_minutes: 15
      targets:
        - type: schedule
          id: engineering-director-on-call
      notification_rules:
        - channels: [phone_call]
          delay_minutes: 0
 
---
# Schedule Configuration
schedules:
  - id: payment-service-primary-schedule
    name: "Payments Primary On-Call"
    timezone: "America/Los_Angeles"
    
    layers:
      # Standard weekly rotation
      - name: "Weekly Rotation"
        rotation_type: weekly
        handoff_time: "10:00:00"
        handoff_day: tuesday
        users:
          - alice@example.com
          - bob@example.com
          - carol@example.com
          - david@example.com
          - eve@example.com
          - frank@example.com
          - grace@example.com
          - henry@example.com
    
    # Overrides for vacations, holidays, swaps
    overrides:
      - start: "2024-12-24T00:00:00"
        end: "2024-12-26T00:00:00"
        user: volunteer@example.com
        reason: "Holiday coverage - Christmas"
 
---
# Service Configuration
services:
  - id: payment-service-prod
    name: "Payment Service (Production)"
    escalation_policy: payment-service-production
    
    alert_creation: create_alerts_and_incidents
    
    # Intelligent grouping to reduce noise
    alert_grouping:
      type: intelligent
      timeout_minutes: 5
      
    # Auto-resolve when condition clears
    auto_resolve:
      enabled: true
      timeout_minutes: 240
      
    # Integration with monitoring
    integrations:
      - type: prometheus_alertmanager
        routing_key: payment-prod-key
      - type: datadog
        routing_key: datadog-payment-key

Escalation Timeout Considerations

Escalation timeouts balance response urgency against responder availability:

Too short (< 3 minutes): Responders can't get to their phone in time; frequent false escalations
Too long (> 15 minutes): Critical incidents sit unacknowledged; customer impact extends
Sweet spot (5-10 minutes): Enough time for bathroom breaks and grabbing phones, not so long that incidents linger

Consider time of day—nighttime escalations may need slightly longer timeouts (responders are waking up).

Acknowledgment vs. Resolution

Distinguish between:

Acknowledge: "I see this and am looking at it" - stops escalation
Resolution: "The incident is fixed" - closes the alert

Acknowledgment should happen quickly (within minutes). Resolution takes as long as it takes. Track both metrics separately.

The Dead-of-Night Test

Review every alert that could page at 3 AM and ask: 'Does this require immediate human action that cannot wait until morning?' If the answer is no, the alert should either be lower severity (no page) or should be suppressed during off-hours. Waking engineers for non-urgent issues destroys on-call sustainability.

On-Call Compensation

On-call is work. Being available to respond, even when no incidents occur, constrains what you can do with your time. Actual incident response disrupts sleep, personal time, and mental health. Fair compensation acknowledges these costs and ensures on-call burden is shared equitably.

Compensation Models

Organizations compensate on-call through various mechanisms:

On-Call Compensation Approaches
Model	How It Works	Pros	Cons
Flat Stipend	$X per on-call shift (e.g., $500/week)	Simple; predictable for budgeting	Doesn't account for actual incident volume
Per-Page Payment	$Y for each page received (e.g., $50/page)	Reflects actual disruption; incentivizes reducing pages	Complex tracking; variable team costs
Hybrid Model	Base stipend + per-page bonus	Balances availability and disruption compensation	More complex to administer
Time-in-Lieu	Comp time earned for on-call and incidents	No direct cost; time recovery	Doesn't feel like real compensation; may be unused
Salary Loading	On-call expectations baked into base salary	Simplest; no per-shift overhead	Unfair if burden isn't shared equally
Weekend Premium	Higher rate for weekend/holiday coverage	Reflects higher personal cost of weekend disruption	May create competition for weekday-only slots

Recommended Approach: Hybrid Compensation

Most mature organizations use a hybrid model:

Base Stipend: Compensates for availability and life constraints during on-call week
Night/Weekend Premium: Higher rate for non-business-hours coverage
Incident Compensation: Additional payment or time-off for actual incident work
Post-Incident Recovery: Mandatory comp time after particularly difficult incidents

Example Calculation:

Weekday on-call shift (5 days): $400 base
Weekend coverage (2 days): $200 additional
Night-time incident (per hour of active work): $75
Major incident (SEV-1/SEV-2 response): 0.5 day comp time

Typical week with 1 night incident: $600 + 0.5 day off

Non-Monetary Recognition

Beyond compensation:

Public acknowledgment of on-call service
Consideration in performance reviews
Career growth credit (on-call experience valued for senior roles)
Choice of on-call weeks (veterans get schedule preferences)
Team events and appreciation (on-call dinners, etc.)

The Fairness Imperative

Unfair compensation breeds resentment and attrition. Junior engineers shouldn't be subsidizing reliability by taking more on-call for less pay. Senior engineers with families shouldn't opt out entirely. Design compensation and rotation to ensure equitable burden distribution across the team.

Responder Well-Being

On-call can be sustainable or it can burn people out. The difference lies in how organizations balance reliability needs against human well-being. Burned-out responders make mistakes, leave the company, and create institutional knowledge loss. Sustainable on-call builds expertise, team cohesion, and operational excellence.

The Well-Being Equation

Responder well-being depends on several factors:

Alert Volume: How often are you paged?
Alert Actionability: Can you fix the issue, or are you just acknowledging noise?
Sleep Disruption: How often are you woken at night?
Recovery Time: Do you get time to recover after difficult shifts?
Support Availability: Can you get help when overwhelmed?
Tooling Quality: Do you have what you need to diagnose and resolve issues?

Sustainable On-Call Practices

•Page Volume Targets: Target < 2 pages per on-call week on average. More than 5 pages/week indicates a reliability problem, not an on-call problem. Invest in fixing the root causes.
•Sleep Protection: Avoid pages between 10 PM and 6 AM for non-critical issues. Use off-hours suppression for warnings and low-severity alerts.
•Mandatory Post-Incident Recovery: After a night incident, the next morning is protected. No meetings, no expectations until afternoon. After a major incident, consider a full day of recovery.
•Secondary Support: Always have secondary on-call available. Primary should never feel alone. Encourage escalation to secondary when overwhelmed.
•Mental Health Resources: Recognize that repeated incident response stress affects mental health. Provide access to counseling, stress management resources, and check-ins.
•Rotation Flexibility: Allow people to swap shifts for important life events. Build a culture where asking for help is encouraged, not stigmatized.
•On-Call Sabbaticals: For teams with high incident load, periodic breaks from on-call rotation (a month off after 6 months on) can prevent cumulative burnout.

The Burnout Warning Signs

Recognize early indicators of on-call burnout:

Cynicism: "It's always broken. Why do I even try?"
Dread: Visible anxiety when on-call week approaches
Avoidance: Finding excuses to skip on-call or swap every shift
Fatigue: Chronic tiredness even outside on-call weeks
Mistakes: Increased error rate during incident response
Attrition: Team members citing on-call as reason for leaving

When you see these signs, address the root causes—usually excessive page volume from unreliable systems—rather than just counseling the individual.

The 'Quiet Week' Principle

On-call should be boring most weeks. If responders regularly have exciting on-call shifts (multiple incidents, complex pages, night interruptions), you have a reliability problem disguised as an on-call problem. Invest in stability to protect your people.

On-Call Onboarding and Training

New team members shouldn't be thrown into on-call without preparation. Effective on-call onboarding builds confidence, reduces mistakes, and creates a sustainable path from observer to primary responder.

The On-Call Onboarding Journey

A structured progression from novice to competent responder:

On-Call Onboarding Progression
Stage	Duration	Activities	Responsibilities
Shadow Week 1	1 week	Observe primary's response; join incident calls; ask questions	None - learning only
Shadow Week 2	1 week	Suggest diagnosis steps; write incident notes; practice with runbooks	Scribe in incidents; no pages received
Reverse Shadow	1 week	Primary in name; experienced responder shadows and advises	Handle pages with immediate support available
Supported Primary	2-4 weeks	Full primary with secondary who is experienced and actively monitoring	Full primary duties; can escalate freely
Full Primary	Ongoing	Standard primary rotation	Full ownership; normal escalation paths

On-Call Prerequisites

Before entering on-call rotation, engineers should have:

System Knowledge:

Completed architecture overview of systems they'll support
Reviewed and practiced with runbooks for common issues
Gained hands-on experience with observability tools (dashboards, log queries, trace analysis)
Understood deployment and rollback procedures

Process Knowledge:

Completed incident response training
Understood escalation paths and when to use them
Familiar with communication channels and expectations
Practiced with alerting tools (acknowledge, escalate, resolve)

Access Verification:

All production access provisioned and tested
VPN and SSH access verified from home network
Alerting tools configured on personal device
Phone number verified in on-call system

On-Call Readiness Checklist

•✅ Reviewed service architecture documentation for all owned services
•✅ Completed runbook walkthrough with experienced team member
•✅ Practiced querying metrics and logs for common failure modes
•✅ Deployed and rolled back a change in non-prod environment
•✅ Participated in at least 2 incident responses as shadow
•✅ Configured alerting app on phone; verified test page received
•✅ VPN access tested from home network
•✅ SSH/kubectl access verified for all production systems
•✅ Emergency contacts for escalation saved
•✅ Completed incident response training module
•✅ Signed off by team lead or on-call coordinator

Never Solo for the First Time

No one should be completely alone on their first primary on-call week. Even if confident, the psychological support of knowing an experienced person is actively available reduces anxiety and improves outcomes. The reverse-shadow period is essential, not optional.

On-Call Tooling Ecosystem

Effective on-call requires more than willingness—it requires tools that enable rapid response. The right tooling stack reduces friction, improves response times, and prevents human error during stressful incidents.

The On-Call Tooling Stack

On-Call Tool Categories
Category	Purpose	Example Tools	Key Features
Alert Management	Route alerts, manage escalations, track incidents	PagerDuty, Opsgenie, VictorOps, Rootly	Multi-channel notification, escalation policies, on-call scheduling
Monitoring/Observability	Generate alerts, provide diagnostic data	Prometheus, Datadog, New Relic, Grafana	Dashboards, alerting rules, historical data
Communication	Coordinate during incidents	Slack, Microsoft Teams, Zoom/Meet	Incident channels, bridge calls, async updates
Documentation	Runbooks, incident records, post-mortems	Confluence, Notion, Backstage	Searchable, version-controlled, linked to services
Status Pages	External and internal communication	Statuspage, Instatus, Cachet	Subscriber notifications, incident timelines
Access Management	Production access provisioning	Teleport, HashiCorp Boundary, AWS SSO	Just-in-time access, audit logging

Essential On-Call Tool Configurations

1. Mobile App Configuration:

Install alerting app on personal phone (not just work laptop)
Enable high-priority/critical notifications to bypass Do Not Disturb
Test that pages wake you up at night
Enable auto-escalation if phone dies

2. Laptop Readiness:

VPN client installed and credentials fresh
SSH keys accessible
Common dashboards bookmarked
Quick-access to deployment tools

3. Communication Setup:

Slack/Teams on mobile with incident channel notifications
Bridge call shortcut (one-tap to join incident call)
Key contact information readily accessible

4. Documentation Access:

Runbooks accessible offline (or via reliable connection)
Architecture diagrams saved locally
Escalation contact list on phone

oncall-automation.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
#!/bin/bash
# On-Call Toolkit: Quick Access Scripts for Responders
# Store these as aliases or scripts for rapid access during incidents
 
# =============================================================================
# ENVIRONMENT QUICK ACCESS
# =============================================================================
 
# Connect to production VPN (adjust for your VPN client)
alias vpn-prod='sudo openvpn --config /etc/openvpn/prod.ovpn'
 
# Quick SSH to bastion host
alias bastion='ssh -A bastion.prod.example.com'
 
# Kubernetes context switching
alias k-prod='kubectl config use-context prod-cluster'
alias k-staging='kubectl config use-context staging-cluster'
 
# =============================================================================
# QUICK DIAGNOSTICS
# =============================================================================
 
# Get pod status for a service
function pods() {
    local service="$1"
    kubectl get pods -l app="$service" -o wide
}
 
# Tail logs for a service (last 100 lines + follow)
function logs() {
    local service="$1"
    kubectl logs -l app="$service" --tail=100 -f --max-log-requests=10
}
 
# Get recent events for a namespace
function events() {
    local ns="${1:-default}"
    kubectl get events -n "$ns" --sort-by='.lastTimestamp' | tail -20
}
 
# Check deployment status
function deploy-status() {
    local service="$1"
    kubectl rollout status deployment/"$service" --timeout=5s 2>&1 || true
    kubectl describe deployment "$service" | grep -A5 "Replicas:"
}
 
# =============================================================================
# QUICK ACTIONS
# =============================================================================
 
# Rollback to previous deployment
function rollback() {
    local service="$1"
    echo "⚠️  Rolling back $service to previous version..."
    kubectl rollout undo deployment/"$service"
    kubectl rollout status deployment/"$service"
    echo "✅ Rollback complete. Verify in dashboards."
}
 
# Restart all pods for a service (rolling restart)
function restart-pods() {
    local service="$1"
    echo "🔄 Initiating rolling restart for $service..."
    kubectl rollout restart deployment/"$service"
    kubectl rollout status deployment/"$service"
}
 
# Scale a service temporarily
function scale() {
    local service="$1"
    local replicas="$2"
    echo "📈 Scaling $service to $replicas replicas..."
    kubectl scale deployment/"$service" --replicas="$replicas"
}
 
# =============================================================================
# INCIDENT HELPERS
# =============================================================================
 
# Create incident channel in Slack
function incident-channel() {
    local name="$1"
    local date=$(date +%Y%m%d)
    echo "📢 Creating incident channel: #inc-$date-$name"
    # Replace with your Slack API call or use Incident Bot
    curl -X POST "https://slack.com/api/conversations.create"         -H "Authorization: Bearer $SLACK_TOKEN"         -d "name=inc-$date-$name"
}
 
# Open key dashboards
function dashboards() {
    echo "Opening incident dashboards..."
    open "https://grafana.example.com/d/overview"
    open "https://grafana.example.com/d/errors"
    open "https://example.datadoghq.com/dashboard/abc"
}
 
# Quick runbook access
function runbook() {
    local service="$1"
    open "https://wiki.example.com/runbooks/$service"
}
 
# =============================================================================
# USAGE INSTRUCTIONS
# =============================================================================
: '
Source this file in your shell profile:
    echo "source ~/oncall-toolkit.sh" >> ~/.zshrc
 
Common commands during incidents:
    vpn-prod          # Connect to production VPN
    k-prod            # Switch to production Kubernetes context
    pods checkout     # See checkout service pods
    logs payment      # Tail payment service logs
    rollback checkout # Rollback checkout to previous version
    dashboards        # Open all incident dashboards
    runbook payment   # Open payment service runbook
'

Automate the First 5 Minutes

Create scripts that automate your initial diagnosis steps: connecting to VPN, switching to the right cluster, opening relevant dashboards, and pulling recent logs. When paged at 3 AM, you want one-click access, not a multi-step process requiring full cognitive function.

Summary: Building Sustainable On-Call

On-call is the human infrastructure that makes reliability possible. When done well, it's a sustainable practice that builds expertise, protects customers, and enables teams to sleep soundly most nights. When done poorly, it burns people out and drives away talent.

Key Takeaways

•On-call is an explicit contract — Clear expectations, fair compensation, and organizational support in exchange for availability and response.
•Rotation design matters — Weekly rotations, adequate team size, primary/secondary structure, and thoughtful handoffs create sustainable coverage.
•Alerting and escalation must be robust — Multi-channel notification, appropriate timeouts, and tested escalation paths ensure pages reach responders.
•Compensation acknowledges real burden — Stipends, per-page payments, and time-off recognize that on-call constrains life even when phones stay quiet.
•Well-being is non-negotiable — Target low page volumes, protect sleep, provide recovery time, and recognize burnout warning signs.
•Onboarding is a gradual process — Shadow weeks, reverse-shadowing, and readiness checklists prepare responders before solo responsibility.
•Tooling reduces friction — Mobile apps, quick-access scripts, and pre-configured environments enable rapid response under stress.
•Boring on-call is the goal — If on-call is exciting, you have a reliability problem. Invest in stability to protect your people.

What's Next:

On-call responders need to communicate effectively with stakeholders during incidents. The next page explores Communication During Incidents—how to keep internal teams, executives, and customers informed while responders focus on resolution.

Page Complete

You now understand how to build sustainable on-call practices: from rotation design and alerting configuration to compensation models, well-being protection, and onboarding programs. On-call is the human foundation of reliability—take care of your responders and they'll take care of your systems.