Loading content...
After every significant incident, someone asks: 'What was the root cause?' The question seems simple, almost administrative—we need something to write in the incident report, something to tell stakeholders, something to check off before moving on.
But this question, and especially our approach to answering it, determines whether we genuinely prevent future incidents or merely document past failures. Root cause analysis (RCA) is the systematic process of identifying the fundamental factors that contributed to an incident. Done well, RCA reveals actionable insights that improve system reliability. Done poorly, it produces superficial explanations that change nothing.
The challenge is that complex systems fail in complex ways. There is rarely a single 'root cause' in any meaningful sense. Multiple factors combine—technical, organizational, environmental—to produce an outcome that no single factor would have caused alone. Effective RCA must navigate this complexity rather than suppress it.
By the end of this page, you will master multiple root cause analysis techniques, understand when to apply each, recognize common pitfalls that produce superficial analysis, and learn to distinguish between proximate causes and the deeper systemic factors that genuinely explain why incidents occur.
Before diving into techniques, we must address a fundamental misconception: the notion that every incident has a single 'root cause' that, if eliminated, would have prevented the failure.
This belief—sometimes called the 'root cause fallacy'—is deeply embedded in engineering culture, inherited from simpler domains where linear cause-and-effect relationships hold. If a machine breaks, there's often a specific component that failed. Find the component, replace it, problem solved.
Complex sociotechnical systems don't work this way. Consider a typical modern production incident:
Which of these is 'the' root cause? All of them contributed, and none alone is sufficient. If any one of them had been different, the incident might not have occurred—or might have been detected and resolved before causing user impact.
The term 'root cause' is so entrenched that abandoning it is impractical, but we must reframe it. Root causes are the set of systemic factors that, if addressed, significantly reduce the probability or impact of similar incidents. There are always multiple, and the goal of RCA is to identify the ones most amenable to improvement.
Safety researcher Sidney Dekker warns: 'Root cause analysis typically stops at the human operator. It's politically convenient: they can be identified, blamed, retrained, or replaced. But the conditions that made the operator's action possible—the workload, the interface design, the organizational pressure—remain unexamined.' Effective RCA must push past the human action to the conditions that shaped it.
The Five Whys is the most widely-known RCA technique, originating from the Toyota Production System. The method is deceptively simple: when a problem occurs, ask 'Why?' successively (traditionally five times, though the number is not fixed) to peel back layers of causation until reaching an addressable root level.
Example application:
123456789101112131415161718
Incident: The website returned 500 errors for 15 minutes. Why #1: Why did the website return 500 errors?→ The API service crashed. Why #2: Why did the API service crash?→ It ran out of memory. Why #3: Why did it run out of memory?→ A new code path created a memory leak under high load. Why #4: Why did this code path reach production?→ The memory leak was not caught in testing. Why #5: Why was the memory leak not caught in testing?→ The test environment does not simulate production-level load. Actionable root cause: Lack of production-realistic load testing.Strengths of Five Whys:
Five Whys has serious limitations that make it problematic for complex incidents: (1) Linear assumption — it implies a single chain of causation when real incidents have branching causes; (2) Stopping arbitrarily — why stop at five? The 'root' depends on when you choose to stop; (3) Hindsight bias — each 'why' is answered with knowledge of the outcome, which may not reflect what was knowable at the time; (4) Facilitator dependence — different facilitators produce different chains.
When to use Five Whys:
Five Whys works best for relatively simple incidents with a dominant causal chain. For complex, multi-factor incidents, use it as a starting point but complement with more rigorous methods like Fault Trees or Ishikawa diagrams.
Best practices:
Ishikawa diagrams, also called fishbone or cause-and-effect diagrams, address the Five Whys' limitation of linear causation. Created by Kaoru Ishikawa in the 1960s, this technique visually maps multiple categories of contributing factors to an incident.
The diagram structure resembles a fish skeleton: the 'head' is the incident/problem, and 'bones' branch off the spine representing categories of causes. Within each category, specific factors are listed.
1234567891011121314151617181920212223242526
┌───────────────────────┐ │ Incident: Database │ │ Corruption │ └───────────│───────────┘ │ ┌────────────────────────────────────────┼────────────────────────────────────────┐ │ │ │┌───────┴───────┐ ┌───────┴───────┐ ┌───────┴───────┐│ METHODS │ │ MACHINES │ │ PEOPLE │├───────────────┤ ├───────────────┤ ├───────────────┤│ • No staged │ │ • Same script │ │ • Engineer ││ rollout │ │ for staging │ │ fatigued ││ • Missing │ │ and prod │ │ (2:47 AM) ││ backup │ │ • Env vars │ │ • New to ││ validation │ │ ambiguous │ │ migration │└───────────────┘ └───────────────┘ │ process │ └───────────────┘ │ │ │┌───────┴───────┐ ┌───────┴───────┐ ┌───────┴───────┐│ MONITORING │ │ ENVIRONMENT │ │ MATERIAL │├───────────────┤ ├───────────────┤ ├───────────────┤│ • No prod │ │ • Deadline │ │ • Runbook ││ write alert │ │ pressure │ │ outdated ││ • Delayed │ │ • Off-hours │ │ • No prompt ││ detection │ │ maintenance │ │ for prod │└───────────────┘ └───────────────┘ └───────────────┘Common category frameworks for software systems:
The 6 Ms (adapted from manufacturing):
Alternative: 4 Ps (production-focused):
Ishikawa diagrams are excellent for group brainstorming. Draw the skeleton on a whiteboard, assign categories to different team members, and spend 10-15 minutes populating causes. The visual structure prevents linear thinking and surfaces factors that might be overlooked with sequential analysis.
Fault Tree Analysis (FTA) is a deductive, top-down technique that maps the logical relationships between a top-level failure (the incident) and its underlying causes. Unlike Ishikawa diagrams, which are primarily taxonomic, fault trees model how causes combine using Boolean logic.
FTA originated in aerospace and nuclear industries where rigorous failure analysis is mandatory. It's particularly powerful for complex incidents involving multiple interacting failures.
123456789101112131415161718192021222324
┌─────────────────────────┐ │ SERVICE UNAVAILABLE │ (Top Event) │ for 15 minutes │ └───────────┬─────────────┘ │ ┌────┴────┐ │ AND │ (Both conditions required) └────┬────┘ ┌──────────────┼──────────────┐ ▼ ▼ ▼ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ Bad config │ │ Config not │ │ Issue not │ │ deployed │ │ validated │ │ detected early │ └───────┬────────┘ └───────┬────────┘ └───────┬────────┘ │ │ │ ┌────┴────┐ ┌────┴────┐ ┌────┴────┐ │ OR │ │ AND │ │ AND │ └────┬────┘ └────┬────┘ └────┬────┘ ┌───────┼───────┐ ┌──────┼──────┐ ┌──────┼──────┐ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ Human Script Config │ No No No Alert No error bug drift │ unit e2e staging fire auto- │ test test test late roll ▼ backKey concepts in Fault Tree Analysis:
Analyzing the tree:
The Boolean structure reveals critical insights:
Fault Tree Analysis is most valuable for high-severity incidents where understanding the precise failure logic matters for remediation. It's overkill for simple incidents but essential when you need to prove that proposed mitigations address all failure paths. The tree structure also helps communicate complex causation to stakeholders.
Clear terminology is essential for effective RCA. Different levels of causation require different responses:
Proximate Cause: The immediate trigger of the incident—the final action or event in the causal chain. Example: 'The engineer ran the migration script against production.'
Proximate causes are often human actions. While they're the 'cause' in a technical sense, stopping here is almost always insufficient for prevention.
Contributing Factors: Conditions that increased the probability or severity of the incident but weren't direct causes by themselves. Example: 'The migration was scheduled for 2:47 AM when the on-call engineer was fatigued.'
Contributing factors often involve organizational, environmental, or contextual elements. They represent the 'latent conditions' in James Reason's Swiss Cheese model.
Root Causes: The fundamental systemic flaws that, if addressed, would prevent the entire class of incident. Example: 'The deployment system does not distinguish between production and staging environments.'
Root causes typically involve system design, processes, tooling, or organizational structures. They're 'root' not because they're first chronologically, but because they're the deepest level at which intervention would be effective.
| Cause Type | Characteristics | Typical Remediation |
|---|---|---|
| Proximate | Immediate trigger, often human action, visible in timeline | Not directly targeted—addressing proximate cause alone rarely prevents recurrence |
| Contributing | Amplified impact or probability, contextual factors, latent conditions | May be addressed if practical—reducing frequency of conditions that amplify failures |
| Root | Systemic, structural, would prevent class of incidents, actionable | Primary target—architectural changes, process improvements, tooling investments |
Calling 'human error' a root cause is almost always wrong. Human error is a proximate cause at best. The root cause is whatever made the human error possible and consequential: unclear interfaces, inadequate training, time pressure, missing safeguards. Always ask: 'What conditions made this error likely to occur and likely to cause harm?'
Even with good intentions and structured techniques, RCA frequently produces inadequate results. Awareness of common pitfalls allows teams to avoid them:
Apply this test to every identified root cause: 'If we replaced the individuals involved with different people of similar skill and experience, and made no other changes, would the incident still be possible?' If yes, the root cause isn't the individuals—it's the system conditions. This test reliably redirects analysis toward systemic factors.
Quality indicators for root cause analysis:
✓ Root causes are systemic, not individual ✓ Multiple contributing factors are identified ✓ Each root cause is actionable (leads to a specific improvement) ✓ The analysis explains how, not just what ✓ Timeline events are supported by data, not memory ✓ Analysis accounts for what was knowable at decision points ✓ Uncomfortable truths about systems/processes are not avoided
Rigorous RCA requires evidence, not narrative reconstruction from memory. Human memory is notoriously unreliable, especially for stressful incidents. Within hours, recollections become contaminated by hindsight, conversation, and the natural tendency to construct coherent stories.
Evidence sources for RCA:
| Source | What It Provides | Preservation Requirements |
|---|---|---|
| Application logs | Detailed request-level behavior, error messages, stack traces | Ensure log retention covers incident period; centralized logging essential |
| Metrics/time-series data | Quantified system behavior: latency, throughput, error rates, resource utilization | High-resolution metrics (at least 1-minute granularity) retained |
| Distributed traces | Request flow across services, identifying bottlenecks and failures | Trace sampling may miss incident-relevant traces; consider higher sampling during incidents |
| Deployment logs | What changed and when: code, configuration, infrastructure | Immutable deployment records; version control history |
| Chat/communication logs | Human coordination, decision rationale, confusion points | Export Slack/Teams transcripts; often overlooked evidence source |
| Alert history | What alerted, when, and whether it was actionable | Alerting platform should retain history with timestamps |
| Runbook/documentation state | What instructions were available at incident time | Version-controlled documentation; ability to view historical versions |
Preserving evidence during incidents:
The time to gather evidence is immediately during and after the incident—not days later during the post-mortem meeting. Establish a practice of:
Conduct post-mortems within 5 business days of incident resolution. Beyond this window, memory degrades rapidly, evidence may be rotated out, and participants' attention moves to other priorities. If a post-mortem cannot be scheduled within 5 days, at minimum gather all evidence and draft the timeline immediately while details are fresh.
Root cause analysis is only valuable if it produces actionable improvements. The bridge between analysis and improvement is the action item—a concrete, assignable, time-bounded task that addresses an identified root cause or contributing factor.
Well-formed action items have:
Resist the temptation to generate many action items to demonstrate thoroughness. A focused set of high-impact actions is more valuable than a long list that will be ignored. Prioritize ruthlessly: which 3-5 items would most reduce the probability of recurrence?
Root cause analysis is the intellectual core of the post-mortem process—the discipline that transforms incident narratives into organizational learning. Let's consolidate the key insights:
You now possess a comprehensive toolkit for root cause analysis. In the next page, we will explore action items and follow-up—the critical bridge between analysis and improvement that determines whether post-mortems produce real change or merely documents for the archives.