Loading content...
At 2:47 AM, somewhere in the world, someone's phone is buzzing. A database cluster has crossed a latency threshold. A payment service is returning errors. A Kubernetes pod is stuck in a crash loop. In that moment, the difference between a minor blip and a major outage often comes down to a single question: Is the right person awake, alert, and equipped to respond?
On-call is the human infrastructure of reliability. All the monitoring, alerting, and automation in the world is worthless if there's no qualified human ready to receive the signals and take action. Yet on-call is also one of the most challenging aspects of engineering culture—it impacts work-life balance, creates stress, and can lead to burnout if poorly managed.
Organizations that excel at incident management recognize on-call as a first-class operational concern, not an afterthought. They design rotations carefully, invest in tooling, compensate appropriately, and continuously improve the experience. This page explores how.
By the end of this page, you will understand how to design sustainable on-call rotations, select and configure alerting tools, compensate on-call fairly, protect responder well-being, and build on-call practices that teams actually want to participate in. You'll learn the difference between on-call systems that burn people out and those that make teams more effective.
On-call exists to ensure that production systems have responsible human oversight at all times. While automation handles routine operations, complex failures require human judgment—the ability to diagnose novel situations, weigh trade-offs, and make decisions under uncertainty.
What On-Call Provides
| Level | Responsibility | Typical Response Time | Escalates To |
|---|---|---|---|
| Primary On-Call | First responder for all alerts; triages and handles or escalates | 5-15 minutes | Secondary On-Call |
| Secondary On-Call | Backup when primary is unavailable; steps in after timeout | 15-30 minutes | Engineering Manager |
| Specialist On-Call | Subject matter expert for specific systems; engaged by primary as needed | Variable | Depends on system |
| Manager On-Call | Escalation point for severity decisions, resource allocation, external communication | 15-30 minutes | Director/VP |
| Executive On-Call | Major incident awareness; customer and stakeholder communication authority | 30-60 minutes | CEO/CTO |
The On-Call Contract
On-call is an explicit agreement between the organization and the engineer. When you're on-call:
In return, the organization provides:
A healthy on-call culture doesn't celebrate engineers who sacrifice sleep and personal time. It celebrates systems that rarely page, alerts that are actionable, and rotations that share the burden fairly. The goal is boring on-call shifts—nothing happens, everyone sleeps well.
Rotation design determines how on-call burden is distributed across the team. Poor rotation design leads to burnout, resentment, and coverage gaps. Good design balances coverage needs with human sustainability.
Key Rotation Parameters
Sample Rotation Structures
Standard Weekly Rotation (8-person team)
Follow-the-Sun (3-region global team)
Weekend Hero (for teams with limited coverage)
Holidays and vacations create coverage gaps. Solve them explicitly:
• Maintain an override pool of volunteers willing to swap • Offer premium compensation for holiday coverage • Allow (encourage!) swaps—don't force people to be on-call during important personal events • Plan holiday coverage 4+ weeks in advance • Consider reduced SLOs during low-traffic holiday periods
On-call relies on robust alerting infrastructure to deliver the right alerts to the right people at the right time. Misconfigured alerting leads to missed incidents, excessive noise, or responders who can't sleep because their phones buzz constantly.
Alerting Stack Components
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108
# PagerDuty-style Escalation Policy Configuration# This policy demonstrates multi-tier escalation with # appropriate timeouts and notification preferences escalation_policy: name: "Payment Service Production" description: "Escalation for payment service production alerts" # Repeat the entire escalation chain if still unresolved repeat_enabled: true repeat_count: 3 # After 3 complete cycles, page on-call manager escalation_rules: # Level 0: Primary On-Call - timeout_minutes: 5 targets: - type: schedule id: payment-service-primary-schedule notification_rules: - channels: [push, sms] delay_minutes: 0 - channels: [phone_call] delay_minutes: 2 # Level 1: Secondary On-Call (if primary doesn't respond) - timeout_minutes: 10 targets: - type: schedule id: payment-service-secondary-schedule notification_rules: - channels: [push, sms, phone_call] delay_minutes: 0 # Level 2: Engineering Manager - timeout_minutes: 10 targets: - type: user id: payments-eng-manager - type: user id: payments-eng-manager-backup notification_rules: - channels: [phone_call] delay_minutes: 0 # Level 3: On-Call Director (major escalation) - timeout_minutes: 15 targets: - type: schedule id: engineering-director-on-call notification_rules: - channels: [phone_call] delay_minutes: 0 ---# Schedule Configurationschedules: - id: payment-service-primary-schedule name: "Payments Primary On-Call" timezone: "America/Los_Angeles" layers: # Standard weekly rotation - name: "Weekly Rotation" rotation_type: weekly handoff_time: "10:00:00" handoff_day: tuesday users: - alice@example.com - bob@example.com - carol@example.com - david@example.com - eve@example.com - frank@example.com - grace@example.com - henry@example.com # Overrides for vacations, holidays, swaps overrides: - start: "2024-12-24T00:00:00" end: "2024-12-26T00:00:00" user: volunteer@example.com reason: "Holiday coverage - Christmas" ---# Service Configurationservices: - id: payment-service-prod name: "Payment Service (Production)" escalation_policy: payment-service-production alert_creation: create_alerts_and_incidents # Intelligent grouping to reduce noise alert_grouping: type: intelligent timeout_minutes: 5 # Auto-resolve when condition clears auto_resolve: enabled: true timeout_minutes: 240 # Integration with monitoring integrations: - type: prometheus_alertmanager routing_key: payment-prod-key - type: datadog routing_key: datadog-payment-keyEscalation Timeout Considerations
Escalation timeouts balance response urgency against responder availability:
Consider time of day—nighttime escalations may need slightly longer timeouts (responders are waking up).
Acknowledgment vs. Resolution
Distinguish between:
Acknowledgment should happen quickly (within minutes). Resolution takes as long as it takes. Track both metrics separately.
Review every alert that could page at 3 AM and ask: 'Does this require immediate human action that cannot wait until morning?' If the answer is no, the alert should either be lower severity (no page) or should be suppressed during off-hours. Waking engineers for non-urgent issues destroys on-call sustainability.
On-call is work. Being available to respond, even when no incidents occur, constrains what you can do with your time. Actual incident response disrupts sleep, personal time, and mental health. Fair compensation acknowledges these costs and ensures on-call burden is shared equitably.
Compensation Models
Organizations compensate on-call through various mechanisms:
| Model | How It Works | Pros | Cons |
|---|---|---|---|
| Flat Stipend | $X per on-call shift (e.g., $500/week) | Simple; predictable for budgeting | Doesn't account for actual incident volume |
| Per-Page Payment | $Y for each page received (e.g., $50/page) | Reflects actual disruption; incentivizes reducing pages | Complex tracking; variable team costs |
| Hybrid Model | Base stipend + per-page bonus | Balances availability and disruption compensation | More complex to administer |
| Time-in-Lieu | Comp time earned for on-call and incidents | No direct cost; time recovery | Doesn't feel like real compensation; may be unused |
| Salary Loading | On-call expectations baked into base salary | Simplest; no per-shift overhead | Unfair if burden isn't shared equally |
| Weekend Premium | Higher rate for weekend/holiday coverage | Reflects higher personal cost of weekend disruption | May create competition for weekday-only slots |
Recommended Approach: Hybrid Compensation
Most mature organizations use a hybrid model:
Example Calculation:
Weekday on-call shift (5 days): $400 base
Weekend coverage (2 days): $200 additional
Night-time incident (per hour of active work): $75
Major incident (SEV-1/SEV-2 response): 0.5 day comp time
Typical week with 1 night incident: $600 + 0.5 day off
Non-Monetary Recognition
Beyond compensation:
Unfair compensation breeds resentment and attrition. Junior engineers shouldn't be subsidizing reliability by taking more on-call for less pay. Senior engineers with families shouldn't opt out entirely. Design compensation and rotation to ensure equitable burden distribution across the team.
On-call can be sustainable or it can burn people out. The difference lies in how organizations balance reliability needs against human well-being. Burned-out responders make mistakes, leave the company, and create institutional knowledge loss. Sustainable on-call builds expertise, team cohesion, and operational excellence.
The Well-Being Equation
Responder well-being depends on several factors:
The Burnout Warning Signs
Recognize early indicators of on-call burnout:
When you see these signs, address the root causes—usually excessive page volume from unreliable systems—rather than just counseling the individual.
On-call should be boring most weeks. If responders regularly have exciting on-call shifts (multiple incidents, complex pages, night interruptions), you have a reliability problem disguised as an on-call problem. Invest in stability to protect your people.
New team members shouldn't be thrown into on-call without preparation. Effective on-call onboarding builds confidence, reduces mistakes, and creates a sustainable path from observer to primary responder.
The On-Call Onboarding Journey
A structured progression from novice to competent responder:
| Stage | Duration | Activities | Responsibilities |
|---|---|---|---|
| Shadow Week 1 | 1 week | Observe primary's response; join incident calls; ask questions | None - learning only |
| Shadow Week 2 | 1 week | Suggest diagnosis steps; write incident notes; practice with runbooks | Scribe in incidents; no pages received |
| Reverse Shadow | 1 week | Primary in name; experienced responder shadows and advises | Handle pages with immediate support available |
| Supported Primary | 2-4 weeks | Full primary with secondary who is experienced and actively monitoring | Full primary duties; can escalate freely |
| Full Primary | Ongoing | Standard primary rotation | Full ownership; normal escalation paths |
On-Call Prerequisites
Before entering on-call rotation, engineers should have:
System Knowledge:
Process Knowledge:
Access Verification:
No one should be completely alone on their first primary on-call week. Even if confident, the psychological support of knowing an experienced person is actively available reduces anxiety and improves outcomes. The reverse-shadow period is essential, not optional.
Effective on-call requires more than willingness—it requires tools that enable rapid response. The right tooling stack reduces friction, improves response times, and prevents human error during stressful incidents.
The On-Call Tooling Stack
| Category | Purpose | Example Tools | Key Features |
|---|---|---|---|
| Alert Management | Route alerts, manage escalations, track incidents | PagerDuty, Opsgenie, VictorOps, Rootly | Multi-channel notification, escalation policies, on-call scheduling |
| Monitoring/Observability | Generate alerts, provide diagnostic data | Prometheus, Datadog, New Relic, Grafana | Dashboards, alerting rules, historical data |
| Communication | Coordinate during incidents | Slack, Microsoft Teams, Zoom/Meet | Incident channels, bridge calls, async updates |
| Documentation | Runbooks, incident records, post-mortems | Confluence, Notion, Backstage | Searchable, version-controlled, linked to services |
| Status Pages | External and internal communication | Statuspage, Instatus, Cachet | Subscriber notifications, incident timelines |
| Access Management | Production access provisioning | Teleport, HashiCorp Boundary, AWS SSO | Just-in-time access, audit logging |
Essential On-Call Tool Configurations
1. Mobile App Configuration:
2. Laptop Readiness:
3. Communication Setup:
4. Documentation Access:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119
#!/bin/bash# On-Call Toolkit: Quick Access Scripts for Responders# Store these as aliases or scripts for rapid access during incidents # =============================================================================# ENVIRONMENT QUICK ACCESS# ============================================================================= # Connect to production VPN (adjust for your VPN client)alias vpn-prod='sudo openvpn --config /etc/openvpn/prod.ovpn' # Quick SSH to bastion hostalias bastion='ssh -A bastion.prod.example.com' # Kubernetes context switchingalias k-prod='kubectl config use-context prod-cluster'alias k-staging='kubectl config use-context staging-cluster' # =============================================================================# QUICK DIAGNOSTICS# ============================================================================= # Get pod status for a servicefunction pods() { local service="$1" kubectl get pods -l app="$service" -o wide} # Tail logs for a service (last 100 lines + follow)function logs() { local service="$1" kubectl logs -l app="$service" --tail=100 -f --max-log-requests=10} # Get recent events for a namespacefunction events() { local ns="${1:-default}" kubectl get events -n "$ns" --sort-by='.lastTimestamp' | tail -20} # Check deployment statusfunction deploy-status() { local service="$1" kubectl rollout status deployment/"$service" --timeout=5s 2>&1 || true kubectl describe deployment "$service" | grep -A5 "Replicas:"} # =============================================================================# QUICK ACTIONS# ============================================================================= # Rollback to previous deploymentfunction rollback() { local service="$1" echo "⚠️ Rolling back $service to previous version..." kubectl rollout undo deployment/"$service" kubectl rollout status deployment/"$service" echo "✅ Rollback complete. Verify in dashboards."} # Restart all pods for a service (rolling restart)function restart-pods() { local service="$1" echo "🔄 Initiating rolling restart for $service..." kubectl rollout restart deployment/"$service" kubectl rollout status deployment/"$service"} # Scale a service temporarilyfunction scale() { local service="$1" local replicas="$2" echo "📈 Scaling $service to $replicas replicas..." kubectl scale deployment/"$service" --replicas="$replicas"} # =============================================================================# INCIDENT HELPERS# ============================================================================= # Create incident channel in Slackfunction incident-channel() { local name="$1" local date=$(date +%Y%m%d) echo "📢 Creating incident channel: #inc-$date-$name" # Replace with your Slack API call or use Incident Bot curl -X POST "https://slack.com/api/conversations.create" -H "Authorization: Bearer $SLACK_TOKEN" -d "name=inc-$date-$name"} # Open key dashboardsfunction dashboards() { echo "Opening incident dashboards..." open "https://grafana.example.com/d/overview" open "https://grafana.example.com/d/errors" open "https://example.datadoghq.com/dashboard/abc"} # Quick runbook accessfunction runbook() { local service="$1" open "https://wiki.example.com/runbooks/$service"} # =============================================================================# USAGE INSTRUCTIONS# =============================================================================: 'Source this file in your shell profile: echo "source ~/oncall-toolkit.sh" >> ~/.zshrc Common commands during incidents: vpn-prod # Connect to production VPN k-prod # Switch to production Kubernetes context pods checkout # See checkout service pods logs payment # Tail payment service logs rollback checkout # Rollback checkout to previous version dashboards # Open all incident dashboards runbook payment # Open payment service runbook'Create scripts that automate your initial diagnosis steps: connecting to VPN, switching to the right cluster, opening relevant dashboards, and pulling recent logs. When paged at 3 AM, you want one-click access, not a multi-step process requiring full cognitive function.
On-call is the human infrastructure that makes reliability possible. When done well, it's a sustainable practice that builds expertise, protects customers, and enables teams to sleep soundly most nights. When done poorly, it burns people out and drives away talent.
What's Next:
On-call responders need to communicate effectively with stakeholders during incidents. The next page explores Communication During Incidents—how to keep internal teams, executives, and customers informed while responders focus on resolution.
You now understand how to build sustainable on-call practices: from rotation design and alerting configuration to compensation models, well-being protection, and onboarding programs. On-call is the human foundation of reliability—take care of your responders and they'll take care of your systems.