Loading learning content...
The alert reads: 'Payment Processing Queue Depth Critical'.
The on-call engineer stares at their phone. They're six months into the job. The senior engineer who built this system left last quarter. It's 3 AM, and they have no idea what this alert means, what's normal, or what to do.
They log into the monitoring dashboard. The queue depth is indeed high—but is 50,000 high? Is it supposed to recover automatically? Is there a button to push? A service to restart? A team to call?
Forty-five minutes later, after reading outdated wiki pages, grepping through Slack history, and finally waking a colleague, they learn the fix was a single command that takes ten seconds to run.
This is the knowledge gap that runbooks are designed to close.
By the end of this page, you will understand how to create runbooks that empower any on-call engineer to respond effectively, how to integrate runbooks seamlessly with alerting systems, and how to maintain runbooks as living documents that evolve with your systems.
A runbook (also called a playbook, standard operating procedure, or incident response guide) is a document that provides step-by-step instructions for responding to a specific operational scenario. Runbooks bridge the gap between detecting a problem (via alerts) and resolving it (via human action).
Why Runbooks Matter
Knowledge Democratization: The expert's knowledge becomes accessible to all on-call responders, including those new to the team or system.
Reduced Mean Time to Resolution (MTTR): Instead of investigating from scratch, responders follow established procedures, drastically reducing time to fix.
Consistent Response: Every incident of a given type receives the same quality response, regardless of which engineer handles it.
Reduced Stress: An on-call facing an unfamiliar incident at 3 AM has a lifeline—documented guidance that tells them exactly what to do.
Institutional Memory: When engineers leave, their operational knowledge remains encoded in runbooks rather than walking out the door with them.
| Aspect | Without Runbook | With Runbook |
|---|---|---|
| Time to understand alert | 5-15 minutes (reading code, dashboards) | 1-2 minutes (read summary) |
| Diagnosis approach | Ad hoc, based on responder experience | Systematic, follows decision tree |
| Common missteps | Frequent (inexperienced responders) | Rare (documented pitfalls avoided) |
| Resolution consistency | Varies wildly by responder | Consistent, proven remediation |
| Post-incident learning | Informal, may not propagate | Updates runbook, learning persists |
| On-call confidence | Low for new team members | High regardless of experience |
If a remediation can be fully automated, it should be. Runbooks are for scenarios requiring human judgment, situations too rare to justify automation, or as a fallback when automation fails. The ideal trajectory: identify common runbook steps that should become automation.
Great runbooks share common structural elements that make them accessible, actionable, and maintainable. Here's the anatomy of an effective runbook:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178
# Runbook: Payment Processing Queue Depth Critical ## OverviewThis alert fires when the payment processing queue exceeds 50,000 pending items, indicating processing is falling behind. **User Impact**: Payment confirmations delayed. Users see "processing" indefinitely. Severe: may cause checkout abandonment, refund requests. **Typical Resolution**: Scale workers or resolve downstream dependency.Time to fix: Usually 5-15 minutes. --- ## Severity and Escalation | Initial Severity | Escalation Trigger | Escalate To ||-----------------|-------------------|-------------|| SEV2 | Queue > 100k OR duration > 30min | Payment Team Lead || SEV1 | User complaints OR revenue impact | Engineering Manager + Product | --- ## Initial Diagnosis ### Step 1: Check current queue depth and trendDashboard: [Payment Queue Dashboard](link) - Current depth: Should be < 10,000 normally- Trend: Is it growing, stable, or draining?- Rate: Items entering vs. items processed per second ### Step 2: Check worker healthDashboard: [Payment Workers Dashboard](link) - Worker count: Should be >= 10 in production- Worker status: All should be "Healthy"- Error rate: Should be < 1% ### Step 3: Check downstream dependencies- Stripe API: [Stripe Status](https://status.stripe.com)- Database: [Database Dashboard](link)- Internal services: [Service Health](link) --- ## Common Causes and Solutions ### Cause 1: Insufficient Workers (Most Common, ~60%)**Symptoms**: - Worker count < expected- Workers healthy but can't keep up- No errors, just slow processing **Solution**: Scale workers```bash# Check current countkubectl get deployment payment-workers -n production # Scale to 20 workers (2x normal)kubectl scale deployment payment-workers -n production --replicas=20 # Verify scalingwatch kubectl get pods -n production -l app=payment-workers``` Expected: Pods should reach Running within 2 minutes.Queue should start draining within 5 minutes. ### Cause 2: Stripe Rate Limiting (~20%)**Symptoms**:- Errors mentioning "rate limit" or "429"- Stripe dashboard shows elevated errors- Workers healthy but failing requests **Solution**: Reduce request rate, wait for limit reset```bash# Reduce worker concurrency (temp config)kubectl set env deployment/payment-workers CONCURRENCY=5 -n production # Wait 1-2 minutes for Stripe rate limit reset# Then verify error rate decreasing``` Consider: Contact Stripe about limit increase for sustained traffic. ### Cause 3: Database Connection Issues (~15%)**Symptoms**:- Errors mentioning database/connection/timeout- Database dashboard shows elevated connections or CPU- Query latency elevated **Solution**: See database runbookLink: [Database Connection Issues Runbook](link) ### Cause 4: Code Bug (Rare, <5%)**Symptoms**:- Specific error types in logs- Recent deployment correlates with issue start- Only certain payment types affected **Solution**: Rollback recent deployment```bash# Check recent deploymentskubectl rollout history deployment/payment-workers -n production # Rollback to previous versionkubectl rollout undo deployment/payment-workers -n production # Verify rollback completekubectl rollout status deployment/payment-workers -n production``` --- ## Verification After applying remediation: 1. **Queue Depth**: Should be decreasing at > 1000/minute Dashboard: [Queue Depth Charts](link) 2. **Processing Rate**: Should return to normal (> 500/min) Dashboard: [Processing Rate](link) 3. **Error Rate**: Should be < 1% Dashboard: [Error Rate](link) 4. **User Impact**: Check recent payment attempts succeeding Query: `SELECT count(*) FROM payments WHERE status='success' AND created_at > NOW() - INTERVAL '5 minutes'` If metrics don't improve within 10 minutes, escalate. --- ## Escalation **When to escalate**:- Queue exceeds 100,000 items- Issue persists > 30 minutes after attempted remediation- Cause is unclear after following diagnosis steps- Customer complaints begin appearing **Escalate to**: @payments-oncall-secondary, @payments-manager**Include**: Current queue depth, steps tried, suspected cause --- ## Additional Context ### Architecture``` [API] --> [Queue] --> [Workers] --> [Stripe] | | v v [Database] [Notification]``` ### Historical Notes- 2024-03-15: Similar incident caused by Stripe maintenance. Resolved.- 2024-01-22: Queue migration caused 30min backup. Will not recur. ### Related Runbooks- [Stripe Integration Issues](link)- [Payment Database Runbook](link)- [Worker Scaling Automation](link) --- ## Metadata | Field | Value ||-------|-------|| Owner | Payment Team (@payments-team) || Last Updated | 2024-11-15 || Review Schedule | Quarterly || Alert Links | [Queue Depth Alert](link) |The best runbook is useless if responders can't find it during an incident. Tight integration between alerts and runbooks is essential for low-friction incident response.
Direct Linking
Every alert should include a direct link to its corresponding runbook. This should be
123456789101112131415161718192021222324252627282930
# Prometheus AlertManager Rule with Runbook URL groups: - name: payment-alerts rules: - alert: PaymentQueueDepthCritical expr: payment_queue_depth > 50000 for: 5m labels: severity: critical team: payments service: payment-processor annotations: summary: "Payment queue depth critical: {{ $value }} items pending" description: | The payment processing queue has exceeded 50,000 pending items. Current depth: {{ $value }} Threshold: 50,000 This indicates payment processing is falling behind. User payments may be delayed. # CRITICAL: Include runbook URL in every alert runbook_url: "https://runbooks.example.com/payments/queue-depth-critical" # Include dashboard link for quick investigation dashboard_url: "https://grafana.example.com/d/payments-queue" # Include escalation information escalation: "After 30min, escalate to payments-manager"Integration Patterns
Different organizations implement alert-runbook integration differently:
Pattern 1: URL in Alert Annotations
Pattern 2: Convention-Based Mapping
runbooks.example.com/{alert-name}Pattern 3: Registry/Database Mapping
Pattern 4: Embedded Runbooks
A responder should reach the relevant runbook within two clicks of seeing an alert: (1) Open alert details, (2) Click runbook link. More than two clicks introduces friction; responders may skip the runbook and improvise.
A runbook is only as good as its clarity. Poorly written runbooks can be worse than nothing—they waste time and provide false confidence. Here's how to write runbooks that actually help.
kubectl get pods -n payments -l app=worker — expect 10 Running podskubectl scale deployment payment-worker --replicas=10 -n paymentsThe Stress Test
Before publishing a runbook, apply this test:
Imagine a junior engineer who:
Can they follow this runbook successfully?
If the answer is no, add more detail, more links, more explanation.
While runbooks are human-readable documents, tooling can significantly enhance their creation, maintenance, and execution.
Semi-Automated Runbook Execution
Some organizations implement platforms that turn runbooks into interactive experiences:
Interactive Runbooks
Parameterized Commands
kubectl get pods -n {namespace}kubectl get pods -n production (auto-filled from alert)Safely Executable Steps
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
# Interactive runbook definition (pseudo-format) runbook: name: "Payment Queue Depth Critical" version: "2.3" last_updated: "2024-11-15" # Parameters from alert context parameters: - name: namespace source: "alert.labels.namespace" default: "production" - name: current_depth source: "alert.annotations.current_value" steps: - id: check_queue_depth title: "Verify Current Queue Depth" type: command command: | kubectl exec -n {{ namespace }} deploy/queue-monitor -- redis-cli llen payment_queue expected_output_pattern: "\d+" interpretation: | Normal: < 10,000 Elevated: 10,000 - 50,000 Critical: > 50,000 Current value from alert: {{ current_depth }} - id: check_workers title: "Check Worker Status" type: command command: | kubectl get pods -n {{ namespace }} -l app=payment-worker -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount expected_output_pattern: ".*Running.*" success_criteria: "All pods should show 'Running'" - id: scale_decision title: "Decide on Scaling" type: decision question: "Are fewer than 10 worker pods in Running state?" options: - label: "Yes, need more workers" goto: scale_workers - label: "No, workers are healthy" goto: check_dependencies - id: scale_workers title: "Scale Worker Deployment" type: command command: | kubectl scale deployment payment-worker -n {{ namespace }} --replicas=20 requires_confirmation: true confirmation_message: "This will double worker count. Proceed?" rollback_command: | kubectl scale deployment payment-worker -n {{ namespace }} --replicas=10 - id: check_dependencies title: "Check Dependencies" type: checklist items: - "Stripe API status (link)" - "Database connection pool (link)" - "Redis cluster health (link)" manual_check: true - id: verification title: "Verify Resolution" type: command command: | watch -n 5 'kubectl exec -n {{ namespace }} deploy/queue-monitor -- redis-cli llen payment_queue' success_criteria: "Queue depth should decrease by > 1000 every minute" timeout_minutes: 10 on_timeout: "Escalate to @payments-lead" escalation: after_minutes: 30 to: "payments-oncall-secondary" include: - steps_completed - outputs - current_metricsRunbook Platforms
Several platforms specialize in runbook management:
| Platform | Strength | Integration |
|---|---|---|
| Runbook.md / Wiki | Simple, version-controlled | Manual links in alerts |
| PagerDuty Runbook Automation | Runs commands via secure agents | Deep PagerDuty integration |
| Rundeck | Self-hosted, powerful automation | Integrates with most alerting |
| Shoreline | AI-assisted, learns from runs | Cloud-native focus |
| Rootly/Incident.io | Incident management with runbooks | Slack-centric workflow |
Interactive runbooks reveal automation opportunities. If responders execute the same steps every time, those steps become automation candidates. Track execution patterns; runbooks are a stepping stone to self-healing systems.
Runbooks decay rapidly. Command syntax changes, dashboards are reorganized, services are renamed, and procedures that worked last month no longer apply. Runbook maintenance is as important as runbook creation.
Ownership Model
Every runbook needs a clear owner:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Service │───▶│ Alert │───▶│ Runbook │
│ Payment Proc │ │ Queue Depth │ │ Queue Depth │
│ │ │ │ │ │
│ Owner: @alice │ │ │ │ Owner: @alice │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Rule: Service owner owns all runbooks for that service.
They are responsible for accuracy and currentness.
Review Process
Quarterly Review Meeting
Post-Incident Review
Continuous Improvement
A stale runbook is worse than no runbook. Responders may follow outdated commands that cause harm, or waste time on procedures that no longer apply. Build staleness detection: alerts for runbooks not updated in 90+ days, automatic 'last verified' reminders.
Not every alert needs a dedicated runbook, but every page-worthy alert should have one. Understanding coverage gaps helps prioritize runbook development.
Coverage Analysis
List all page-worthy alerts
Map alerts to runbooks
Calculate coverage metrics
Prioritize gaps
| Alert Frequency | Severity | Runbook Priority | Recommended Action |
|---|---|---|---|
| Weekly+ | Critical | HIGHEST | Immediate creation, detail required |
| Weekly+ | High | HIGH | Create within 1 sprint |
| Monthly | Critical | HIGH | Create within 1 sprint |
| Monthly | High | MEDIUM | Plan for next quarter |
| Quarterly | Critical | MEDIUM | Create when capacity allows |
| Quarterly | High | LOW | Minimal runbook acceptable |
| Yearly | Any | LOW | Link to general troubleshooting |
Runbook Templates
Speed up runbook creation with templates:
Service-Specific Template: Pre-filled with service architecture, common dashboards, standard rollback procedures
Alert-Type Templates: Templates for common alert types (high latency, error rate, resource exhaustion) with appropriate diagnostic steps
Skeleton Template: Minimal structure with all sections as placeholders, ensuring consistency
Templates reduce the barrier to runbook creation from 'write a complete document' to 'fill in the blanks'.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
#!/bin/bash# Generate runbook coverage report echo "Runbook Coverage Analysis"echo "========================="echo "" # Get all alerts from Prometheus AlertManageralerts=$(curl -s "http://alertmanager:9093/api/v2/alerts" | jq -r '.[].labels.alertname' | sort -u) total_alerts=$(echo "$alerts" | wc -l)covered=0stale=0missing=0 echo "Alert,Runbook Status,Last Updated" > coverage_report.csv for alert in $alerts; do # Check if runbook URL exists in alert definition runbook_url=$(curl -s "http://prometheus:9090/api/v1/rules" | jq -r ".data.groups[].rules[] | select(.name == "$alert") | .annotations.runbook_url // """) if [ -z "$runbook_url" ]; then echo "$alert,MISSING,N/A" >> coverage_report.csv ((missing++)) else # Check if runbook exists and get last updated response=$(curl -s -o /dev/null -w "%{http_code}" "$runbook_url") if [ "$response" == "200" ]; then # Check last modified (simplified - actual implementation varies) last_modified=$(curl -sI "$runbook_url" | grep -i "last-modified" | cut -d' ' -f2-) # Check if older than 90 days if [ $(datediff "$last_modified" "now") -gt 90 ]; then echo "$alert,STALE,$last_modified" >> coverage_report.csv ((stale++)) else echo "$alert,CURRENT,$last_modified" >> coverage_report.csv ((covered++)) fi else echo "$alert,BROKEN_LINK,$runbook_url" >> coverage_report.csv ((missing++)) fi fidone echo ""echo "Summary:"echo "--------"echo "Total Alerts: $total_alerts"echo "Current Runbooks: $covered ($(echo "scale=1; $covered*100/$total_alerts" | bc)%)"echo "Stale Runbooks: $stale ($(echo "scale=1; $stale*100/$total_alerts" | bc)%)"echo "Missing Runbooks: $missing ($(echo "scale=1; $missing*100/$total_alerts" | bc)%)"echo ""echo "Detailed report saved to coverage_report.csv"Runbooks transform alerting from mere detection into effective response. They encode expert knowledge, reduce MTTR, and ensure consistent incident handling regardless of which engineer responds.
Module Complete:
You've now covered all essential aspects of alerting design: understanding what to alert on, how to set thresholds, managing alert fatigue, designing escalation policies, and integrating runbooks. Together, these practices transform raw monitoring data into effective, actionable incident response.
The journey doesn't end here—alerting is a continuously improving system. Review your alerts regularly, gather feedback from responders, and iterate relentlessly. The goal is a system where every alert is meaningful, every responder is empowered, and every incident is resolved efficiently.
Congratulations! You've completed the Alerting Design module. You now understand how to build an alerting system that detects problems, routes them appropriately, provides actionable guidance, and evolves with your systems. This forms the bridge between monitoring infrastructure and human response—the critical link in operational excellence.