Post Mortems - Learning Module

Loading content...

0/273

Learning from Failures

The Organization That Cannot Forget

In 2003, NASA's Space Shuttle Columbia disintegrated during re-entry, killing all seven crew members. The immediate technical cause was foam debris striking the shuttle's thermal protection tiles during launch. But the deeper finding of the Columbia Accident Investigation Board was devastating: NASA had failed to learn from previous incidents.

Foam strikes had occurred on nearly every previous shuttle mission. They had been discussed, investigated, and normalized. The organization had the information to prevent Columbia's loss—it simply failed to transform that information into effective action. The lessons were documented but not learned.

The post-mortem is not learning. The post-mortem creates the potential for learning. Actual learning happens when insights change behavior, when knowledge spreads beyond the incident team, when patterns are recognized across incidents, and when the organization fundamentally improves its capacity to prevent and respond to failures.

This page is about closing that gap—transforming incidents from isolated events into drivers of organizational improvement.

What You Will Learn

By the end of this page, you will understand how to maximize learning from individual incidents, disseminate insights beyond the immediate team, identify patterns across multiple incidents, create knowledge repositories, and build an organizational culture that treats failures as opportunities for growth.

Individual Learning vs. Organizational Learning

When an incident occurs and a post-mortem is conducted, who learns? At minimum, the individuals directly involved learn something—they gain firsthand experience of a failure mode and its resolution. But individual learning is fragile and limited:

Individuals leave organizations, taking knowledge with them
Individuals forget details over time
Individuals may not encounter similar situations again
Individual learning doesn't scale—the same lesson must be learned repeatedly by different people

Organizational learning occurs when the organization itself changes: its systems, processes, documentation, training, tooling, and culture. These changes persist beyond any individual and apply to everyone who encounters similar situations.

Individual vs. Organizational Learning
Dimension	Individual Learning	Organizational Learning
Persistence	Exists in one person's memory	Embedded in systems and processes
Scalability	Benefits one person	Benefits everyone who uses the improved system
Reliability	Subject to forgetting, cognitive load	Encoded in automation, documentation, training
Transfer	Requires explicit teaching	Implicit in using the system
Example	'I now know to check the environment variable'	'The system now validates the environment variable'

The Learning Hierarchy:

Learning operates at multiple levels, each with increasing impact and persistence:

Individual memory — Lowest level; fragile, doesn't scale
Team knowledge — Shared understanding within a team; lost when team changes
Documentation — Written records that persist; requires discovery and reading
Process change — New procedures that apply to everyone; requires compliance
Automation/tooling — Systemic changes that apply automatically; highest persistence

Goal: Push learning up the hierarchy. Whenever possible, transform individual lessons into automation or tooling. When automation isn't feasible, encode in process. When process is insufficient, at least document.

The 'No Human Intervention' Test

For each lesson learned from an incident, ask: 'Can we prevent this failure mode without relying on a human to remember and act correctly?' If yes, implement that solution. Humans are the least reliable component of any system—not because they're incompetent, but because they're human. Offload to machines wherever possible.

Knowledge Dissemination Mechanisms

A post-mortem document sitting in a folder helps no one if teams in similar situations don't know it exists. Dissemination is the process of spreading incident learnings beyond the immediate team to everyone who might benefit.

Dissemination channels:

Knowledge Dissemination Channels

•Post-mortem email/summary — Send a concise summary to a broad engineering mailing list. Include: what happened, impact, key lessons, and link to full document. People who wouldn't read the full document will read the summary.
•Post-mortem readouts — Present significant post-mortems at team or engineering-wide meetings. The oral format enables Q&A and discussion that written documents cannot.
•Weekly/monthly incident digest — Aggregate recent post-mortems into a regular publication. Highlight patterns, notable lessons, and completed action items.
•Searchable post-mortem database — All post-mortems in a searchable repository. When starting work on a system, engineers can search for past incidents to understand potential failure modes.
•On-call handoff material — Include relevant post-mortems in on-call handoff documentation. When inheriting on-call responsibility, review recent incidents for the services covered.
•New hire onboarding — Include post-mortem review in onboarding curriculum. New engineers learn about real failures rather than theoretical risks.
•Cross-team learning sessions — Periodic sessions where teams present post-mortems to other teams. Especially valuable when incidents reveal issues in shared infrastructure or dependencies.

Targeting dissemination:

Not every post-mortem needs maximum distribution. Consider audience relevance:

Team-level incidents (limited impact, localized cause) — Share within team; summarize to broader org
Cross-team incidents (involving multiple services) — Share with all involved teams; summarize broadly
Organization-wide incidents (major outages, novel failure modes) — Full readout to all engineering; executive summary to company
Industry-relevant incidents (novel patterns, useful to others) — Consider publishing externally (blog posts, conferences)

The Pull vs. Push Balance

Push mechanisms (emails, meetings) ensure everyone receives information but can create overload. Pull mechanisms (searchable databases, documentation) allow on-demand access but require initiative. A healthy dissemination strategy uses both: push for immediate awareness, pull for long-term reference.

Pattern Recognition Across Incidents

Individual incidents reveal local failure modes. Patterns across incidents reveal systemic issues. An organization that analyzes incidents only in isolation misses critical insights that emerge from aggregation.

Types of patterns to watch for:

Common Incident Pattern Types
Pattern Type	Description	Example
Component patterns	Certain services/systems repeatedly involved	'The payments service has had 4 incidents this quarter'
Causal patterns	Similar root causes across different services	'5 incidents involved missing input validation'
Trigger patterns	Similar triggering events	'3 incidents followed holiday traffic spikes'
Detection patterns	Consistent detection gaps	'7 incidents were discovered by customers, not monitoring'
Response patterns	Consistent response challenges	'4 incidents had delayed resolution due to unclear ownership'
Temporal patterns	Correlation with time	'Incidents are 3x more likely in the week after major releases'
Organizational patterns	Correlation with team dynamics	'Team X has elevated incident rate since losing senior engineer'

Implementing pattern recognition:

1. Incident tagging/categorization

Develop a taxonomy of incident attributes:

Component(s) involved
Root cause category (e.g., configuration, capacity, bug, dependency)
Trigger type (e.g., deployment, traffic spike, external event)
Detection method (e.g., monitoring, customer report, internal discovery)
Impact category (e.g., availability, latency, data integrity)

Apply tags consistently to every post-mortem.

2. Periodic pattern analysis

Quarterly review of all incidents with aggregate analysis:

Volume and trends
Distribution by category
Emerging themes
Repeat offenders (services, teams, causes)

3. Threshold-triggered investigation

Automatic alerts when thresholds are crossed:

'Service X has had 3+ incidents this quarter'
'Root cause Y has appeared in 5+ post-mortems'
'Customer detection rate is above 30%'

Threshold crossing triggers a focused investigation into the systemic issue.

Beware Survivorship Bias

Pattern analysis can only find patterns in incidents that are tracked. If certain types of incidents (minor, quickly resolved, involving senior engineers) are systematically not documented, the patterns will be skewed. Maintain consistent post-mortem criteria to ensure the dataset is representative.

The Learning Review

While individual post-mortems focus on specific incidents, the Learning Review is a periodic examination of incident patterns and organizational improvement trends. It answers: Are we actually getting better?

Learning Review structure:

•Incident metrics overview — Incident count, MTTR, customer impact hours, error budget consumption. Trends vs. previous period.
•Category analysis — Distribution of incidents by component, root cause type, detection method. Changes from previous period.
•Action item review — Completion rate, notable completed items, aging items, blocked items.
•Pattern deep dives — If patterns emerged from aggregation, assign working groups to investigate systemic issues.
•Action items from review — The Learning Review itself produces action items addressing systemic patterns.
•Process improvement — How can we improve the post-mortem process itself? What's working, what isn't?

Frequency and attendees:

Engineering-wide Learning Review: Quarterly, attended by all team leads + reliability/SRE + engineering leadership. 60-90 minutes.
Team-level Learning Review: Monthly, attended by team members. 30 minutes.

Sample agenda for quarterly Learning Review:

Time	Topic	Owner
10 min	Incident metrics overview	SRE Lead
15 min	Notable incidents walkthrough	Rotating teams
15 min	Pattern analysis presentation	Data/reliability team
20 min	Deep dive discussion	Facilitated
20 min	Action items and next steps	VP Engineering
10 min	Process feedback	All

The 'Learning Debt' Concept

Just as organizations accumulate technical debt, they can accumulate 'learning debt'—patterns that have been observed but not addressed, action items that remain incomplete, and lessons that haven't been institutionalized. The Learning Review is an opportunity to audit and pay down learning debt.

Near-Miss Learning

Not all failures cause incidents. Sometimes an error is made, but recovery happens before customers are affected. A misconfiguration is deployed but caught by monitoring and rolled back in seconds. A database failover is triggered but completes seamlessly.

These near-misses are goldmines of learning—they reveal vulnerabilities without incurring damage. But they're easy to overlook. Since no harm occurred, there's little urgency to investigate.

High-reliability organizations actively cultivate near-miss reporting.

Why Near-Misses Matter

•Near-misses are far more common than incidents
•They reveal vulnerabilities before they cause harm
•Lower stakes enable more candid discussion
•They provide early warning of degrading systems
•Addressing near-misses prevents incidents

Why They're Ignored

•No customer impact = no urgency
•Reporting requires effort with no obvious reward
•Fear that reporting implies incompetence
•'Nothing bad happened, so why investigate?'
•No process for near-miss collection

Building a near-miss program:

Create a low-ceremony reporting mechanism — A Slack channel, a simple form, or a dedicated email alias. The goal is to minimize friction so people actually report.
Explicitly encourage and thank reporters — Public recognition for near-miss reports reinforces the behavior. 'Thanks for catching this before it became an incident.'
Regular near-miss review — Weekly or biweekly review of near-miss reports to identify patterns and prioritize investigations.
Elevated near-misses — When a near-miss reveals a serious vulnerability, escalate to a full post-mortem. Just because nothing bad happened doesn't mean it couldn't have.
Psychological safety — Near-miss reporting only works if people don't fear punishment for admitting they almost caused problems. Blameless culture is a prerequisite.

The Heinrich Ratio: Industrial safety research suggests that for every major incident, there are approximately 30 minor incidents and 300 near-misses. If you're only learning from incidents, you're missing 99% of your learning opportunities.

Near-Miss Example

An engineer accidentally runs 'DROP DATABASE' in what they thought was a test environment. They realize the error and Ctrl+C before the command completes. The database is unharmed. In many organizations, this incident is never reported—the engineer is embarrassed and nothing bad happened. In a near-miss-positive culture, this is immediately reported, and the investigation reveals that production and test database hostnames differ by one character. The organization implements color-coded prompts to distinguish environments, preventing future risk.

External Learning: Other Organizations' Failures

You don't have to experience every failure firsthand to learn from it. Organizations across the industry regularly publish incident reports, post-mortems, and reliability case studies. Learning from others' failures is vastly cheaper than learning from your own.

Sources of external incident learning:

•Company engineering blogs — Many technology companies (Google, Netflix, Cloudflare, GitHub, Stripe, etc.) publish detailed post-mortems of significant incidents. These are gold mines of learning.
•Hacker News / r/programming discussions — Major outages often generate community discussion analyzing causes and responses.
•SREcon / QCon presentations — Conference talks on incident management often include detailed case studies.
•Books and reports — Publications like 'Site Reliability Engineering' (Google), 'Seeking SRE' (David N. Blank-Edelman), and the 'Accelerate' report contain aggregated incident insights.
•CISA / NIST advisories — For security-related incidents, government agencies publish analyses of significant breaches.
•Status page histories — Public status pages of cloud providers reveal patterns in outages and recovery.

Applying external learnings:

Regular reading/review — Assign engineers to periodically review external incident reports and summarize relevant lessons.
Defensive analysis — When reading about an incident at another company, ask: 'Could this happen to us?' If yes, what would prevent or detect it?
Architecture review input — When designing new systems, review relevant external incidents to inform design decisions.
Tabletop exercises — Use external incident scenarios for tabletop exercises: 'If this happened to us, how would we respond?'
Knowledge sharing — Circulate notable external post-mortems to relevant teams. 'Cloudflare had an outage due to X—here's how our system handles this.'

The 'What If' Practice

When reviewing external incidents, make it concrete: 'Our system uses the same library they mentioned. Let's check if we're vulnerable.' 'They had a cascading failure due to circuit breaker misconfiguration—let's audit our circuit breaker settings.' Passive reading produces little learning; active application produces insight.

Building a Knowledge Repository

Incident learnings must be stored somewhere accessible. A knowledge repository is the organizational memory for incident experience—searchable, browsable, and integrated into engineering workflows.

Repository requirements:

•Searchability — Full-text search across all post-mortem content. When an engineer encounters a strange behavior, they should be able to search for prior occurrences.
•Discoverability — Browsable by team, service, time period, severity, and root cause category. New team members should be able to explore without knowing what to search for.
•Consistency — Standard template and structure across all documents. Consistent structure enables pattern analysis and reading efficiency.
•Linking — Connections between related incidents, action items, runbooks, and service documentation. Context is as valuable as content.
•Freshness indicators — Clear timestamps and aging indicators. A five-year-old post-mortem may be outdated; readers need to know.
•Access control — Balance openness (maximize learning) with sensitivity (some incidents involve confidential information). Default to open.

Tool options:

Dedicated incident management platforms (Blameless, FireHydrant, incident.io) — Built-in post-mortem management with search, tagging, and tracking
Documentation platforms (Notion, Confluence, GitHub Wiki) — Flexible, integrated with other engineering docs, but requires discipline for consistency
Custom internal tools — Some organizations build bespoke incident management systems tailored to their processes

Integration points:

Link from on-call handoffs to relevant post-mortems
Include 'Related Incidents' section in service runbooks
Surface relevant post-mortems during incident response (based on affected service or symptoms)
Connect action items to originating post-mortems bidirectionally

Curation Over Accumulation

A repository with thousands of unorganized post-mortems becomes unsearchable noise. Invest in curation: tagging, categorization, summary extraction, and periodic archiving of obsolete content. Quality of organization matters more than quantity of content.

Learning Loops: Continuous Improvement

Learning from failures is not a one-time activity but a continuous loop. Each incident feeds into the next cycle of improvement, which produces a more resilient system, which encounters new failure modes, which drive further improvement.

The Learning Loop:

                    ┌─────────────────────────────────────────┐
                    │                                         │
                    ▼                                         │
        ┌───────────────────┐                                 │
        │     INCIDENT      │                                 │
        │  Something fails  │                                 │
        └─────────┬─────────┘                                 │
                  │                                           │
                  ▼                                           │
        ┌───────────────────┐                                 │
        │  POST-MORTEM      │                                 │
        │  Analyze causes   │                                 │
        └─────────┬─────────┘                                 │
                  │                                           │
                  ▼                                           │
        ┌───────────────────┐                                 │
        │  DISSEMINATION    │                                 │
        │  Spread knowledge │                                 │
        └─────────┬─────────┘                                 │
                  │                                           │
                  ▼                                           │
        ┌───────────────────┐                                 │
        │  IMPLEMENTATION   │                                 │
        │  Execute actions  │                                 │
        └─────────┬─────────┘                                 │
                  │                                           │
                  ▼                                           │
        ┌───────────────────┐                                 │
        │  VERIFICATION     │                                 │
        │  Confirm it works │                                 │
        └─────────┬─────────┘                                 │
                  │                                           │
                  ▼                                           │
        ┌───────────────────┐                                 │
        │  IMPROVED SYSTEM  │─────────────────────────────────┘
        │  More resilient   │  (New failure modes emerge, 
        └───────────────────┘   triggering new cycle)

Accelerating the loop:

Shorten cycle time — Fast post-mortems, rapid action item completion, quick verification. The longer the cycle, the more risk accumulates.
Increase learning per cycle — Deeper analysis, better root cause identification, more effective action items. Each cycle should produce maximum improvement.
Widen the loop — Include near-misses, external learnings, and proactive analysis (chaos engineering) to feed the loop with more inputs.
Measure loop health — Track incident rates over time. A healthy loop produces declining incident frequency or severity.

Signs of a broken loop:

Same types of incidents recurring repeatedly
Action items not being completed
Post-mortems becoming ritualistic without substance
Teams disengaged from the process
No measurable improvement in reliability metrics

The Plateau Problem

Organizations sometimes reach a plateau where obvious improvements are exhausted but serious resilience gaps remain. Breaking through requires investing in harder, larger projects: architectural refactoring, platform improvements, cultural change. When the easy action items are done, don't mistake stagnation for success.

Summary: Building an Organization That Learns

Learning from failures transforms incidents from pure losses into investments in future reliability. But learning doesn't happen automatically—it requires deliberate systems and practices.

Key Takeaways

•Organizational learning > individual learning. Encode lessons in systems, processes, and automation—not just in people's memories.
•Disseminate actively. Use multiple channels (summaries, readouts, digests, databases) to spread knowledge beyond incident participants.
•Recognize patterns. Individual incidents reveal local failures; patterns reveal systemic issues. Aggregate and analyze.
•Conduct Learning Reviews. Periodic examination of incident trends, action item health, and systemic patterns.
•Embrace near-misses. Near-misses reveal vulnerabilities without damage. Create systems to capture and learn from them.
•Learn from others. External incident reports are free learning. Read, analyze, and apply defensively.
•Build a knowledge repository. Searchable, browsable, integrated post-mortem storage enables future learning.
•Treat learning as a loop. Continuous improvement requires continuous cycling through analysis, action, and verification.

Page Complete

You now understand how to maximize organizational learning from failures. In the final page of this module, we will explore post-mortem culture—how to build and maintain the cultural foundations that make all of these practices possible.