Loading learning content...
The alert reads: 'Payment Processing Queue Depth Critical'.\n\nThe on-call engineer stares at their phone. They're six months into the job. The senior engineer who built this system left last quarter. It's 3 AM, and they have no idea what this alert means, what's normal, or what to do.\n\nThey log into the monitoring dashboard. The queue depth is indeed high—but is 50,000 high? Is it supposed to recover automatically? Is there a button to push? A service to restart? A team to call?\n\nForty-five minutes later, after reading outdated wiki pages, grepping through Slack history, and finally waking a colleague, they learn the fix was a single command that takes ten seconds to run.\n\nThis is the knowledge gap that runbooks are designed to close.
By the end of this page, you will understand how to create runbooks that empower any on-call engineer to respond effectively, how to integrate runbooks seamlessly with alerting systems, and how to maintain runbooks as living documents that evolve with your systems.
A runbook (also called a playbook, standard operating procedure, or incident response guide) is a document that provides step-by-step instructions for responding to a specific operational scenario. Runbooks bridge the gap between detecting a problem (via alerts) and resolving it (via human action).
Why Runbooks Matter\n\n1. Knowledge Democratization: The expert's knowledge becomes accessible to all on-call responders, including those new to the team or system.\n\n2. Reduced Mean Time to Resolution (MTTR): Instead of investigating from scratch, responders follow established procedures, drastically reducing time to fix.\n\n3. Consistent Response: Every incident of a given type receives the same quality response, regardless of which engineer handles it.\n\n4. Reduced Stress: An on-call facing an unfamiliar incident at 3 AM has a lifeline—documented guidance that tells them exactly what to do.\n\n5. Institutional Memory: When engineers leave, their operational knowledge remains encoded in runbooks rather than walking out the door with them.
| Aspect | Without Runbook | With Runbook |
|---|---|---|
| Time to understand alert | 5-15 minutes (reading code, dashboards) | 1-2 minutes (read summary) |
| Diagnosis approach | Ad hoc, based on responder experience | Systematic, follows decision tree |
| Common missteps | Frequent (inexperienced responders) | Rare (documented pitfalls avoided) |
| Resolution consistency | Varies wildly by responder | Consistent, proven remediation |
| Post-incident learning | Informal, may not propagate | Updates runbook, learning persists |
| On-call confidence | Low for new team members | High regardless of experience |
If a remediation can be fully automated, it should be. Runbooks are for scenarios requiring human judgment, situations too rare to justify automation, or as a fallback when automation fails. The ideal trajectory: identify common runbook steps that should become automation.
Great runbooks share common structural elements that make them accessible, actionable, and maintainable. Here's the anatomy of an effective runbook:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178
# Runbook: Payment Processing Queue Depth Critical ## OverviewThis alert fires when the payment processing queue exceeds 50,000 pending items, indicating processing is falling behind. **User Impact**: Payment confirmations delayed. Users see "processing" indefinitely. Severe: may cause checkout abandonment, refund requests. **Typical Resolution**: Scale workers or resolve downstream dependency.Time to fix: Usually 5-15 minutes. --- ## Severity and Escalation | Initial Severity | Escalation Trigger | Escalate To ||-----------------|-------------------|-------------|| SEV2 | Queue > 100k OR duration > 30min | Payment Team Lead || SEV1 | User complaints OR revenue impact | Engineering Manager + Product | --- ## Initial Diagnosis ### Step 1: Check current queue depth and trendDashboard: [Payment Queue Dashboard](link) - Current depth: Should be < 10,000 normally- Trend: Is it growing, stable, or draining?- Rate: Items entering vs. items processed per second ### Step 2: Check worker healthDashboard: [Payment Workers Dashboard](link) - Worker count: Should be >= 10 in production- Worker status: All should be "Healthy"- Error rate: Should be < 1% ### Step 3: Check downstream dependencies- Stripe API: [Stripe Status](https://status.stripe.com)- Database: [Database Dashboard](link)- Internal services: [Service Health](link) --- ## Common Causes and Solutions ### Cause 1: Insufficient Workers (Most Common, ~60%)**Symptoms**: - Worker count < expected- Workers healthy but can't keep up- No errors, just slow processing **Solution**: Scale workers```bash# Check current countkubectl get deployment payment-workers -n production # Scale to 20 workers (2x normal)kubectl scale deployment payment-workers -n production --replicas=20 # Verify scalingwatch kubectl get pods -n production -l app=payment-workers``` Expected: Pods should reach Running within 2 minutes.Queue should start draining within 5 minutes. ### Cause 2: Stripe Rate Limiting (~20%)**Symptoms**:- Errors mentioning "rate limit" or "429"- Stripe dashboard shows elevated errors- Workers healthy but failing requests **Solution**: Reduce request rate, wait for limit reset```bash# Reduce worker concurrency (temp config)kubectl set env deployment/payment-workers CONCURRENCY=5 -n production # Wait 1-2 minutes for Stripe rate limit reset# Then verify error rate decreasing``` Consider: Contact Stripe about limit increase for sustained traffic. ### Cause 3: Database Connection Issues (~15%)**Symptoms**:- Errors mentioning database/connection/timeout- Database dashboard shows elevated connections or CPU- Query latency elevated **Solution**: See database runbookLink: [Database Connection Issues Runbook](link) ### Cause 4: Code Bug (Rare, <5%)**Symptoms**:- Specific error types in logs- Recent deployment correlates with issue start- Only certain payment types affected **Solution**: Rollback recent deployment```bash# Check recent deploymentskubectl rollout history deployment/payment-workers -n production # Rollback to previous versionkubectl rollout undo deployment/payment-workers -n production # Verify rollback completekubectl rollout status deployment/payment-workers -n production``` --- ## Verification After applying remediation: 1. **Queue Depth**: Should be decreasing at > 1000/minute Dashboard: [Queue Depth Charts](link) 2. **Processing Rate**: Should return to normal (> 500/min) Dashboard: [Processing Rate](link) 3. **Error Rate**: Should be < 1% Dashboard: [Error Rate](link) 4. **User Impact**: Check recent payment attempts succeeding Query: `SELECT count(*) FROM payments WHERE status='success' AND created_at > NOW() - INTERVAL '5 minutes'` If metrics don't improve within 10 minutes, escalate. --- ## Escalation **When to escalate**:- Queue exceeds 100,000 items- Issue persists > 30 minutes after attempted remediation- Cause is unclear after following diagnosis steps- Customer complaints begin appearing **Escalate to**: @payments-oncall-secondary, @payments-manager**Include**: Current queue depth, steps tried, suspected cause --- ## Additional Context ### Architecture``` [API] --> [Queue] --> [Workers] --> [Stripe] | | v v [Database] [Notification]``` ### Historical Notes- 2024-03-15: Similar incident caused by Stripe maintenance. Resolved.- 2024-01-22: Queue migration caused 30min backup. Will not recur. ### Related Runbooks- [Stripe Integration Issues](link)- [Payment Database Runbook](link)- [Worker Scaling Automation](link) --- ## Metadata | Field | Value ||-------|-------|| Owner | Payment Team (@payments-team) || Last Updated | 2024-11-15 || Review Schedule | Quarterly || Alert Links | [Queue Depth Alert](link) |The best runbook is useless if responders can't find it during an incident. Tight integration between alerts and runbooks is essential for low-friction incident response.
Direct Linking\n\nEvery alert should include a direct link to its corresponding runbook. This should be\n\n1. Included in the alert notification — The page/text/Slack message includes a clickable runbook link\n2. Visible in the alert dashboard — When viewing alert details, the runbook is one click away\n3. Consistent and predictable — Always in the same location so responders develop muscle memory
123456789101112131415161718192021222324252627282930
# Prometheus AlertManager Rule with Runbook URL groups: - name: payment-alerts rules: - alert: PaymentQueueDepthCritical expr: payment_queue_depth > 50000 for: 5m labels: severity: critical team: payments service: payment-processor annotations: summary: "Payment queue depth critical: {{ $value }} items pending" description: | The payment processing queue has exceeded 50,000 pending items. Current depth: {{ $value }} Threshold: 50,000 This indicates payment processing is falling behind. User payments may be delayed. # CRITICAL: Include runbook URL in every alert runbook_url: "https://runbooks.example.com/payments/queue-depth-critical" # Include dashboard link for quick investigation dashboard_url: "https://grafana.example.com/d/payments-queue" # Include escalation information escalation: "After 30min, escalate to payments-manager"Integration Patterns\n\nDifferent organizations implement alert-runbook integration differently:\n\nPattern 1: URL in Alert Annotations\n- Runbook URL is a field in the alert definition\n- Displayed in notification and dashboard\n- Simple and widely supported\n- Requires manual sync when URLs change\n\nPattern 2: Convention-Based Mapping\n- Runbook URL derived from alert name: runbooks.example.com/{alert-name}\n- No explicit configuration needed per alert\n- Requires consistent naming and URL structure\n- Easy to maintain at scale\n\nPattern 3: Registry/Database Mapping\n- Central database maps alert IDs to runbook IDs\n- Supports many-to-one mappings (multiple alerts → one runbook)\n- Enables runbook versioning and search\n- More complex to implement\n\nPattern 4: Embedded Runbooks\n- Runbook content included in alert definition\n- Displayed directly in alert UI\n- Guaranteed to be current\n- Limited to short procedures; full runbooks need links
A responder should reach the relevant runbook within two clicks of seeing an alert: (1) Open alert details, (2) Click runbook link. More than two clicks introduces friction; responders may skip the runbook and improvise.
A runbook is only as good as its clarity. Poorly written runbooks can be worse than nothing—they waste time and provide false confidence. Here's how to write runbooks that actually help.
kubectl get pods -n payments -l app=worker — expect 10 Running podskubectl scale deployment payment-worker --replicas=10 -n paymentsThe Stress Test\n\nBefore publishing a runbook, apply this test:\n\nImagine a junior engineer who:\n- Has never worked on this service\n- Is bleary-eyed at 3 AM\n- Is stressed because users are affected\n- Has no access to experts (they're on vacation)\n\nCan they follow this runbook successfully?\n\nIf the answer is no, add more detail, more links, more explanation.
While runbooks are human-readable documents, tooling can significantly enhance their creation, maintenance, and execution.
Semi-Automated Runbook Execution\n\nSome organizations implement platforms that turn runbooks into interactive experiences:\n\nInteractive Runbooks\n- Runbook displayed as a checklist\n- Responder marks steps complete\n- System tracks progress and time\n- Commands can be executed with one click\n- Outputs displayed inline\n\nParameterized Commands\n- Replace placeholders with context from the alert\n- Instead of: kubectl get pods -n {namespace}\n- Display: kubectl get pods -n production (auto-filled from alert)\n\nSafely Executable Steps\n- Certain steps marked as 'auto-executable'\n- Responder clicks 'Run' instead of copying to terminal\n- Output captured and associated with incident\n- Audit trail of actions taken
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
# Interactive runbook definition (pseudo-format) runbook: name: "Payment Queue Depth Critical" version: "2.3" last_updated: "2024-11-15" # Parameters from alert context parameters: - name: namespace source: "alert.labels.namespace" default: "production" - name: current_depth source: "alert.annotations.current_value" steps: - id: check_queue_depth title: "Verify Current Queue Depth" type: command command: | kubectl exec -n {{ namespace }} deploy/queue-monitor -- redis-cli llen payment_queue expected_output_pattern: "\d+" interpretation: | Normal: < 10,000 Elevated: 10,000 - 50,000 Critical: > 50,000 Current value from alert: {{ current_depth }} - id: check_workers title: "Check Worker Status" type: command command: | kubectl get pods -n {{ namespace }} -l app=payment-worker -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount expected_output_pattern: ".*Running.*" success_criteria: "All pods should show 'Running'" - id: scale_decision title: "Decide on Scaling" type: decision question: "Are fewer than 10 worker pods in Running state?" options: - label: "Yes, need more workers" goto: scale_workers - label: "No, workers are healthy" goto: check_dependencies - id: scale_workers title: "Scale Worker Deployment" type: command command: | kubectl scale deployment payment-worker -n {{ namespace }} --replicas=20 requires_confirmation: true confirmation_message: "This will double worker count. Proceed?" rollback_command: | kubectl scale deployment payment-worker -n {{ namespace }} --replicas=10 - id: check_dependencies title: "Check Dependencies" type: checklist items: - "Stripe API status (link)" - "Database connection pool (link)" - "Redis cluster health (link)" manual_check: true - id: verification title: "Verify Resolution" type: command command: | watch -n 5 'kubectl exec -n {{ namespace }} deploy/queue-monitor -- redis-cli llen payment_queue' success_criteria: "Queue depth should decrease by > 1000 every minute" timeout_minutes: 10 on_timeout: "Escalate to @payments-lead" escalation: after_minutes: 30 to: "payments-oncall-secondary" include: - steps_completed - outputs - current_metricsRunbook Platforms\n\nSeveral platforms specialize in runbook management:\n\n| Platform | Strength | Integration |\n|----------|----------|-------------|\n| Runbook.md / Wiki | Simple, version-controlled | Manual links in alerts |\n| PagerDuty Runbook Automation | Runs commands via secure agents | Deep PagerDuty integration |\n| Rundeck | Self-hosted, powerful automation | Integrates with most alerting |\n| Shoreline | AI-assisted, learns from runs | Cloud-native focus |\n| Rootly/Incident.io | Incident management with runbooks | Slack-centric workflow |
Interactive runbooks reveal automation opportunities. If responders execute the same steps every time, those steps become automation candidates. Track execution patterns; runbooks are a stepping stone to self-healing systems.
Runbooks decay rapidly. Command syntax changes, dashboards are reorganized, services are renamed, and procedures that worked last month no longer apply. Runbook maintenance is as important as runbook creation.
Ownership Model\n\nEvery runbook needs a clear owner:\n\n\n┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐\n│ Service │───▶│ Alert │───▶│ Runbook │\n│ Payment Proc │ │ Queue Depth │ │ Queue Depth │\n│ │ │ │ │ │\n│ Owner: @alice │ │ │ │ Owner: @alice │\n└─────────────────┘ └─────────────────┘ └─────────────────┘\n\nRule: Service owner owns all runbooks for that service.\nThey are responsible for accuracy and currentness.\n\n\nReview Process\n\n1. Quarterly Review Meeting\n - Each service owner presents their runbooks\n - Walk through one runbook in detail\n - Identify stale sections, missing scenarios\n\n2. Post-Incident Review\n - Incident postmortem includes 'Was the runbook helpful?'\n - Action items include runbook updates\n - Verify updates completed\n\n3. Continuous Improvement\n - Track 'runbook helpfulness' rating from responders\n - Prioritize improvements by usage frequency and helpfulness gap
A stale runbook is worse than no runbook. Responders may follow outdated commands that cause harm, or waste time on procedures that no longer apply. Build staleness detection: alerts for runbooks not updated in 90+ days, automatic 'last verified' reminders.
Not every alert needs a dedicated runbook, but every page-worthy alert should have one. Understanding coverage gaps helps prioritize runbook development.
Coverage Analysis\n\n1. List all page-worthy alerts\n - Export from alerting system\n - Include alerts that fired in past 6 months\n\n2. Map alerts to runbooks\n - Does this alert have a linked runbook?\n - Is the runbook current (updated < 90 days)?\n - Is the runbook detailed enough?\n\n3. Calculate coverage metrics\n - Percentage of alerts with runbooks\n - Percentage with current runbooks\n - Percentage with 'high quality' runbooks (by review)\n\n4. Prioritize gaps\n - Frequent alerts without runbooks: highest priority\n - Critical-severity alerts without runbooks: highest priority\n - Rarely-firing alerts: lower priority but still needed
| Alert Frequency | Severity | Runbook Priority | Recommended Action |
|---|---|---|---|
| Weekly+ | Critical | HIGHEST | Immediate creation, detail required |
| Weekly+ | High | HIGH | Create within 1 sprint |
| Monthly | Critical | HIGH | Create within 1 sprint |
| Monthly | High | MEDIUM | Plan for next quarter |
| Quarterly | Critical | MEDIUM | Create when capacity allows |
| Quarterly | High | LOW | Minimal runbook acceptable |
| Yearly | Any | LOW | Link to general troubleshooting |
Runbook Templates\n\nSpeed up runbook creation with templates:\n\n1. Service-Specific Template: Pre-filled with service architecture, common dashboards, standard rollback procedures\n\n2. Alert-Type Templates: Templates for common alert types (high latency, error rate, resource exhaustion) with appropriate diagnostic steps\n\n3. Skeleton Template: Minimal structure with all sections as placeholders, ensuring consistency\n\nTemplates reduce the barrier to runbook creation from 'write a complete document' to 'fill in the blanks'.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
#!/bin/bash# Generate runbook coverage report echo "Runbook Coverage Analysis"echo "========================="echo "" # Get all alerts from Prometheus AlertManageralerts=$(curl -s "http://alertmanager:9093/api/v2/alerts" | jq -r '.[].labels.alertname' | sort -u) total_alerts=$(echo "$alerts" | wc -l)covered=0stale=0missing=0 echo "Alert,Runbook Status,Last Updated" > coverage_report.csv for alert in $alerts; do # Check if runbook URL exists in alert definition runbook_url=$(curl -s "http://prometheus:9090/api/v1/rules" | jq -r ".data.groups[].rules[] | select(.name == "$alert") | .annotations.runbook_url // """) if [ -z "$runbook_url" ]; then echo "$alert,MISSING,N/A" >> coverage_report.csv ((missing++)) else # Check if runbook exists and get last updated response=$(curl -s -o /dev/null -w "%{http_code}" "$runbook_url") if [ "$response" == "200" ]; then # Check last modified (simplified - actual implementation varies) last_modified=$(curl -sI "$runbook_url" | grep -i "last-modified" | cut -d' ' -f2-) # Check if older than 90 days if [ $(datediff "$last_modified" "now") -gt 90 ]; then echo "$alert,STALE,$last_modified" >> coverage_report.csv ((stale++)) else echo "$alert,CURRENT,$last_modified" >> coverage_report.csv ((covered++)) fi else echo "$alert,BROKEN_LINK,$runbook_url" >> coverage_report.csv ((missing++)) fi fidone echo ""echo "Summary:"echo "--------"echo "Total Alerts: $total_alerts"echo "Current Runbooks: $covered ($(echo "scale=1; $covered*100/$total_alerts" | bc)%)"echo "Stale Runbooks: $stale ($(echo "scale=1; $stale*100/$total_alerts" | bc)%)"echo "Missing Runbooks: $missing ($(echo "scale=1; $missing*100/$total_alerts" | bc)%)"echo ""echo "Detailed report saved to coverage_report.csv"Runbooks transform alerting from mere detection into effective response. They encode expert knowledge, reduce MTTR, and ensure consistent incident handling regardless of which engineer responds.
Module Complete:\n\nYou've now covered all essential aspects of alerting design: understanding what to alert on, how to set thresholds, managing alert fatigue, designing escalation policies, and integrating runbooks. Together, these practices transform raw monitoring data into effective, actionable incident response.\n\nThe journey doesn't end here—alerting is a continuously improving system. Review your alerts regularly, gather feedback from responders, and iterate relentlessly. The goal is a system where every alert is meaningful, every responder is empowered, and every incident is resolved efficiently.
Congratulations! You've completed the Alerting Design module. You now understand how to build an alerting system that detects problems, routes them appropriately, provides actionable guidance, and evolves with your systems. This forms the bridge between monitoring infrastructure and human response—the critical link in operational excellence.