Post-Mortems - Learning Module

Loading content...

0/273

Root Cause Analysis

Beyond the Obvious: Finding What Really Broke

After every significant incident, someone asks: 'What was the root cause?' The question seems simple, almost administrative—we need something to write in the incident report, something to tell stakeholders, something to check off before moving on.

But this question, and especially our approach to answering it, determines whether we genuinely prevent future incidents or merely document past failures. Root cause analysis (RCA) is the systematic process of identifying the fundamental factors that contributed to an incident. Done well, RCA reveals actionable insights that improve system reliability. Done poorly, it produces superficial explanations that change nothing.

The challenge is that complex systems fail in complex ways. There is rarely a single 'root cause' in any meaningful sense. Multiple factors combine—technical, organizational, environmental—to produce an outcome that no single factor would have caused alone. Effective RCA must navigate this complexity rather than suppress it.

What You Will Learn

By the end of this page, you will master multiple root cause analysis techniques, understand when to apply each, recognize common pitfalls that produce superficial analysis, and learn to distinguish between proximate causes and the deeper systemic factors that genuinely explain why incidents occur.

The Root Cause Myth

Before diving into techniques, we must address a fundamental misconception: the notion that every incident has a single 'root cause' that, if eliminated, would have prevented the failure.

This belief—sometimes called the 'root cause fallacy'—is deeply embedded in engineering culture, inherited from simpler domains where linear cause-and-effect relationships hold. If a machine breaks, there's often a specific component that failed. Find the component, replace it, problem solved.

Complex sociotechnical systems don't work this way. Consider a typical modern production incident:

A Typical Incident's Contributing Factors

•Triggering action — A configuration change was deployed
•Missing validation — The deployment pipeline didn't catch the invalid configuration
•Monitoring gap — Alerts didn't fire because the failure mode wasn't monitored
•Recovery delay — The runbook was outdated, leading to confusion
•Knowledge gap — The on-call engineer hadn't worked on this system before
•Time pressure — The change was rushed to meet a quarterly deadline
•Process breakdown — The change bypassed the usual review process due to urgency

Which of these is 'the' root cause? All of them contributed, and none alone is sufficient. If any one of them had been different, the incident might not have occurred—or might have been detected and resolved before causing user impact.

The term 'root cause' is so entrenched that abandoning it is impractical, but we must reframe it. Root causes are the set of systemic factors that, if addressed, significantly reduce the probability or impact of similar incidents. There are always multiple, and the goal of RCA is to identify the ones most amenable to improvement.

Sidney Dekker's Caution

Safety researcher Sidney Dekker warns: 'Root cause analysis typically stops at the human operator. It's politically convenient: they can be identified, blamed, retrained, or replaced. But the conditions that made the operator's action possible—the workload, the interface design, the organizational pressure—remain unexamined.' Effective RCA must push past the human action to the conditions that shaped it.

The Five Whys Technique

The Five Whys is the most widely-known RCA technique, originating from the Toyota Production System. The method is deceptively simple: when a problem occurs, ask 'Why?' successively (traditionally five times, though the number is not fixed) to peel back layers of causation until reaching an addressable root level.

Example application:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Incident: The website returned 500 errors for 15 minutes.
 
Why #1: Why did the website return 500 errors?
→ The API service crashed.
 
Why #2: Why did the API service crash?
→ It ran out of memory.
 
Why #3: Why did it run out of memory?
→ A new code path created a memory leak under high load.
 
Why #4: Why did this code path reach production?
→ The memory leak was not caught in testing.
 
Why #5: Why was the memory leak not caught in testing?
→ The test environment does not simulate production-level load.
 
Actionable root cause: Lack of production-realistic load testing.

Strengths of Five Whys:

Simplicity — Requires no specialized training or tools
Accessibility — Anyone can participate, promoting democratic analysis
Speed — Can be completed quickly in a meeting
Depth — Naturally pushes past surface-level explanations

Critical Weaknesses

Five Whys has serious limitations that make it problematic for complex incidents: (1) Linear assumption — it implies a single chain of causation when real incidents have branching causes; (2) Stopping arbitrarily — why stop at five? The 'root' depends on when you choose to stop; (3) Hindsight bias — each 'why' is answered with knowledge of the outcome, which may not reflect what was knowable at the time; (4) Facilitator dependence — different facilitators produce different chains.

When to use Five Whys:

Five Whys works best for relatively simple incidents with a dominant causal chain. For complex, multi-factor incidents, use it as a starting point but complement with more rigorous methods like Fault Trees or Ishikawa diagrams.

Best practices:

Apply Five Whys to multiple starting points, not just one
Don't stop at five if the answer isn't actionable
Avoid stopping at 'human error'; always ask why the human action led to failure
Capture branching causes, not just a single chain
Validate each 'why' answer with evidence from logs, traces, or other data

Ishikawa (Fishbone) Diagrams

Ishikawa diagrams, also called fishbone or cause-and-effect diagrams, address the Five Whys' limitation of linear causation. Created by Kaoru Ishikawa in the 1960s, this technique visually maps multiple categories of contributing factors to an incident.

The diagram structure resembles a fish skeleton: the 'head' is the incident/problem, and 'bones' branch off the spine representing categories of causes. Within each category, specific factors are listed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
                                     ┌───────────────────────┐
                                     │  Incident: Database   │
                                     │  Corruption           │
                                     └───────────│───────────┘
                                                 │
        ┌────────────────────────────────────────┼────────────────────────────────────────┐
        │                                        │                                        │
┌───────┴───────┐                       ┌───────┴───────┐                        ┌───────┴───────┐
│   METHODS     │                       │   MACHINES    │                        │   PEOPLE      │
├───────────────┤                       ├───────────────┤                        ├───────────────┤
│ • No staged   │                       │ • Same script │                        │ • Engineer    │
│   rollout     │                       │   for staging │                        │   fatigued    │
│ • Missing     │                       │   and prod    │                        │   (2:47 AM)   │
│   backup      │                       │ • Env vars    │                        │ • New to      │
│   validation  │                       │   ambiguous   │                        │   migration   │
└───────────────┘                       └───────────────┘                        │   process     │
                                                                                 └───────────────┘
        │                                        │                                        │
┌───────┴───────┐                       ┌───────┴───────┐                        ┌───────┴───────┐
│   MONITORING  │                       │  ENVIRONMENT  │                        │   MATERIAL    │
├───────────────┤                       ├───────────────┤                        ├───────────────┤
│ • No prod     │                       │ • Deadline    │                        │ • Runbook     │
│   write alert │                       │   pressure    │                        │   outdated    │
│ • Delayed     │                       │ • Off-hours   │                        │ • No prompt   │
│   detection   │                       │   maintenance │                        │   for prod    │
└───────────────┘                       └───────────────┘                        └───────────────┘

Common category frameworks for software systems:

The 6 Ms (adapted from manufacturing):

Methods — Processes, procedures, runbooks
Machines — Hardware, software, infrastructure
Materials — Data, configurations, dependencies
Measurements — Monitoring, alerting, observability
Mother Nature — Environmental factors, time pressure, external events
People — Training, experience, cognitive load

Alternative: 4 Ps (production-focused):

Policies — Organizational rules and constraints
Procedures — Specific operational steps
People — Human factors
Plant — Infrastructure and tooling

Brainstorming Exercise

Ishikawa diagrams are excellent for group brainstorming. Draw the skeleton on a whiteboard, assign categories to different team members, and spend 10-15 minutes populating causes. The visual structure prevents linear thinking and surfaces factors that might be overlooked with sequential analysis.

Fault Tree Analysis

Fault Tree Analysis (FTA) is a deductive, top-down technique that maps the logical relationships between a top-level failure (the incident) and its underlying causes. Unlike Ishikawa diagrams, which are primarily taxonomic, fault trees model how causes combine using Boolean logic.

FTA originated in aerospace and nuclear industries where rigorous failure analysis is mandatory. It's particularly powerful for complex incidents involving multiple interacting failures.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
                    ┌─────────────────────────┐
                    │ SERVICE UNAVAILABLE     │  (Top Event)
                    │ for 15 minutes          │
                    └───────────┬─────────────┘
                                │
                           ┌────┴────┐
                           │  AND    │  (Both conditions required)
                           └────┬────┘
                 ┌──────────────┼──────────────┐
                 ▼              ▼              ▼
        ┌────────────────┐ ┌────────────────┐ ┌────────────────┐
        │ Bad config     │ │ Config not    │ │ Issue not      │
        │ deployed       │ │ validated     │ │ detected early │
        └───────┬────────┘ └───────┬────────┘ └───────┬────────┘
                │                  │                  │
           ┌────┴────┐        ┌────┴────┐        ┌────┴────┐
           │   OR    │        │  AND    │        │  AND    │
           └────┬────┘        └────┬────┘        └────┬────┘
        ┌───────┼───────┐   ┌──────┼──────┐   ┌──────┼──────┐
        ▼       ▼       ▼   ▼      ▼      ▼   ▼      ▼      ▼
     Human   Script  Config │   No     No   No    Alert   No
     error   bug     drift  │  unit   e2e  staging fire  auto-
                            │  test  test  test   late   roll
                            ▼                           back

Key concepts in Fault Tree Analysis:

Top Event — The incident being analyzed (e.g., 'Service unavailable for 15 minutes')
Intermediate Events — Contributing failures that combine to cause the top event
Basic Events — Root-level failures that cannot be decomposed further (e.g., 'Unit tests missing')
AND Gate — All child events must occur for the parent event to occur
OR Gate — Any child event is sufficient to cause the parent event

Analyzing the tree:

The Boolean structure reveals critical insights:

Minimal Cut Sets — The smallest combination of basic events that would cause the top event. These identify the most critical vulnerabilities.
Single Points of Failure — Events connected by OR gates where one failure alone causes the incident.
Redundancy Gaps — Places where AND gates could be introduced to require multiple failures.

When to Use FTA

Fault Tree Analysis is most valuable for high-severity incidents where understanding the precise failure logic matters for remediation. It's overkill for simple incidents but essential when you need to prove that proposed mitigations address all failure paths. The tree structure also helps communicate complex causation to stakeholders.

Proximate vs. Contributing vs. Root Causes

Clear terminology is essential for effective RCA. Different levels of causation require different responses:

Proximate Cause: The immediate trigger of the incident—the final action or event in the causal chain. Example: 'The engineer ran the migration script against production.'

Proximate causes are often human actions. While they're the 'cause' in a technical sense, stopping here is almost always insufficient for prevention.

Contributing Factors: Conditions that increased the probability or severity of the incident but weren't direct causes by themselves. Example: 'The migration was scheduled for 2:47 AM when the on-call engineer was fatigued.'

Contributing factors often involve organizational, environmental, or contextual elements. They represent the 'latent conditions' in James Reason's Swiss Cheese model.

Root Causes: The fundamental systemic flaws that, if addressed, would prevent the entire class of incident. Example: 'The deployment system does not distinguish between production and staging environments.'

Root causes typically involve system design, processes, tooling, or organizational structures. They're 'root' not because they're first chronologically, but because they're the deepest level at which intervention would be effective.

Cause Classification Guide
Cause Type	Characteristics	Typical Remediation
Proximate	Immediate trigger, often human action, visible in timeline	Not directly targeted—addressing proximate cause alone rarely prevents recurrence
Contributing	Amplified impact or probability, contextual factors, latent conditions	May be addressed if practical—reducing frequency of conditions that amplify failures
Root	Systemic, structural, would prevent class of incidents, actionable	Primary target—architectural changes, process improvements, tooling investments

The Human Error Trap

Calling 'human error' a root cause is almost always wrong. Human error is a proximate cause at best. The root cause is whatever made the human error possible and consequential: unclear interfaces, inadequate training, time pressure, missing safeguards. Always ask: 'What conditions made this error likely to occur and likely to cause harm?'

Avoiding Common RCA Pitfalls

Even with good intentions and structured techniques, RCA frequently produces inadequate results. Awareness of common pitfalls allows teams to avoid them:

Common RCA Pitfalls

•Stopping Too Early — Declaring root cause at the first plausible explanation. Ask: 'If we fix this, could a similar incident still happen?' If yes, dig deeper.
•Stopping at Human Action — 'The operator made a mistake' ends analysis prematurely. The operator's action occurred within a system context that enabled it.
•Single Cause Bias — Pressure to identify 'the' root cause suppresses multiple contributing factors. Complex incidents have multiple causes; listing only one is analytically dishonest.
•Hindsight Bias — Evaluating decisions with outcome knowledge. The post-incident 'obvious' mistake was not obvious pre-incident with available information.
•Confirmation Bias — Seeking evidence for a predetermined explanation rather than following evidence wherever it leads.
•Outcome Bias — Judging decisions by their outcomes rather than their reasonableness given available information. The same decision in identical circumstances might produce different outcomes.
•Naming the Cause as the Negative of the Solution — 'Root cause: lack of monitoring' is just restating the remedy, not explaining why monitoring was absent.
•Organizational Blind Spots — Avoiding identification of causes that implicate powerful stakeholders or uncomfortable organizational dynamics.

The Substitution Test

Apply this test to every identified root cause: 'If we replaced the individuals involved with different people of similar skill and experience, and made no other changes, would the incident still be possible?' If yes, the root cause isn't the individuals—it's the system conditions. This test reliably redirects analysis toward systemic factors.

Quality indicators for root cause analysis:

✓ Root causes are systemic, not individual ✓ Multiple contributing factors are identified ✓ Each root cause is actionable (leads to a specific improvement) ✓ The analysis explains how, not just what ✓ Timeline events are supported by data, not memory ✓ Analysis accounts for what was knowable at decision points ✓ Uncomfortable truths about systems/processes are not avoided

Evidence-Based Analysis

Rigorous RCA requires evidence, not narrative reconstruction from memory. Human memory is notoriously unreliable, especially for stressful incidents. Within hours, recollections become contaminated by hindsight, conversation, and the natural tendency to construct coherent stories.

Evidence sources for RCA:

Evidence Sources for Root Cause Analysis
Source	What It Provides	Preservation Requirements
Application logs	Detailed request-level behavior, error messages, stack traces	Ensure log retention covers incident period; centralized logging essential
Metrics/time-series data	Quantified system behavior: latency, throughput, error rates, resource utilization	High-resolution metrics (at least 1-minute granularity) retained
Distributed traces	Request flow across services, identifying bottlenecks and failures	Trace sampling may miss incident-relevant traces; consider higher sampling during incidents
Deployment logs	What changed and when: code, configuration, infrastructure	Immutable deployment records; version control history
Chat/communication logs	Human coordination, decision rationale, confusion points	Export Slack/Teams transcripts; often overlooked evidence source
Alert history	What alerted, when, and whether it was actionable	Alerting platform should retain history with timestamps
Runbook/documentation state	What instructions were available at incident time	Version-controlled documentation; ability to view historical versions

Preserving evidence during incidents:

The time to gather evidence is immediately during and after the incident—not days later during the post-mortem meeting. Establish a practice of:

Taking screenshots of dashboards and alerts at key moments
Noting the exact times of significant actions
Preserving the state of anomalous systems (logs, metrics) before cleanup
Creating a dedicated channel/thread for incident discussion that can be referenced later
Running automated evidence collection scripts that snapshot logs, metrics, and alerts

The 5-Day Rule

Conduct post-mortems within 5 business days of incident resolution. Beyond this window, memory degrades rapidly, evidence may be rotated out, and participants' attention moves to other priorities. If a post-mortem cannot be scheduled within 5 days, at minimum gather all evidence and draft the timeline immediately while details are fresh.

From Analysis to Action

Root cause analysis is only valuable if it produces actionable improvements. The bridge between analysis and improvement is the action item—a concrete, assignable, time-bounded task that addresses an identified root cause or contributing factor.

Well-formed action items have:

•Specificity — Clear, unambiguous description of what will be done
•Owner — A single person responsible for completion (not a team or 'someone')
•Deadline — A concrete date by which the action will be complete
•Priority — Classification that enables triage when resources conflict
•Linkage to root cause — Explicit connection to which identified cause this addresses
•Verification criteria — How will we know the action item actually works?

Weak Action Items

•'Improve monitoring'
•'Be more careful with deployments'
•'Add tests'
•'Update documentation'
•'Consider implementing validation'

Strong Action Items

•'Add alert for >5% error rate on /api/orders endpoint (Owner: Alice, Due: Jan 15)'
•'Implement required confirmation prompt for production deployments (Owner: Bob, Due: Jan 20)'
•'Create integration test covering config validation failure path (Owner: Carol, Due: Jan 12)'
•'Update runbook section 4.2 with corrected rollback procedure (Owner: Dan, Due: Jan 10)'
•'Deploy config validation in deployment pipeline with blocking behavior (Owner: Eve, Due: Jan 25)'

Avoid Action Item Inflation

Resist the temptation to generate many action items to demonstrate thoroughness. A focused set of high-impact actions is more valuable than a long list that will be ignored. Prioritize ruthlessly: which 3-5 items would most reduce the probability of recurrence?

Summary: Mastering Root Cause Analysis

Root cause analysis is the intellectual core of the post-mortem process—the discipline that transforms incident narratives into organizational learning. Let's consolidate the key insights:

Key Takeaways

•Reject single-cause thinking. Complex incidents have multiple contributing factors; forcing a single root cause produces incomplete understanding.
•Choose techniques appropriate to complexity. Five Whys for simple chains, Ishikawa for brainstorming, Fault Trees for logical relationships.
•Distinguish cause types. Proximate → Contributing → Root. Target root causes for maximum prevention leverage.
•Push past human action. 'Human error' is never a root cause—it's a symptom of systemic conditions that enabled and failed to contain the error.
•Base analysis on evidence. Logs, metrics, traces, and chat transcripts—not memory, which is unreliable under incident stress.
•Avoid cognitive biases. Hindsight bias, confirmation bias, and outcome bias corrupt analysis; structure and facilitation help counteract them.
•Translate analysis to action. Root causes must map to specific, owned, time-bounded action items to generate improvement.

Page Complete

You now possess a comprehensive toolkit for root cause analysis. In the next page, we will explore action items and follow-up—the critical bridge between analysis and improvement that determines whether post-mortems produce real change or merely documents for the archives.