System Design (HLD)Post-Mortems

Post-Mortems: Learning from Failure

LevelIntermediate

Duration60 mins

TopicPost-Mortems

3 / 5

Action Items and Follow-up

The Graveyard of Good Intentions

Every organization has one: a graveyard of post-mortem action items that were agreed upon with great seriousness, assigned with conviction, and then quietly forgotten. Two months later, when a similar incident occurs, someone searches the ticketing system and discovers a half-closed action item from the previous incident—partially implemented, never verified, and ultimately ineffective.

This is the post-mortem's failure mode. Not poor analysis, not blame culture, but the gap between identifying what should change and actually changing it. The most insightful root cause analysis in the world produces nothing if its recommendations die in a backlog.

Action items are the bridge between understanding and improvement. This page is about building that bridge to last—formulating action items that are actually actionable, prioritizing ruthlessly, tracking effectively, and closing the loop to verify that improvements work.

What You Will Learn

By the end of this page, you will understand how to craft action items that lead to real improvement, establish prioritization frameworks that ensure high-impact items get implemented, build tracking systems that maintain visibility, and create verification practices that confirm improvements actually work.

The Anatomy of an Effective Action Item

Not all action items are created equal. The difference between an action item that drives improvement and one that languishes in the backlog often comes down to how it's formulated.

The SMART framework, borrowed from project management, provides a useful structure:

Specific — Exactly what will be done, with no ambiguity
Measurable — Clear criteria for completion
Assignable — A single owner who is accountable
Realistic — Achievable within the timeframe given available resources
Time-bound — A concrete deadline, not 'when we get to it'

But SMART alone isn't sufficient. Effective post-mortem action items have additional characteristics:

Characteristics of Effective Action Items

•Linked to root cause — The action item explicitly addresses an identified root cause or contributing factor. If you can't trace the action item back to the analysis, question whether it belongs.
•Verifiable — Not just 'done,' but 'proven effective.' How will you know the improvement actually works?
•Independent — Ideally, the action item can be completed without complex dependencies on other work. Dependencies should be explicit and tracked.
•Appropriately scoped — Large enough to make a difference, small enough to complete in a reasonable timeframe. If it's a multi-month project, break it down.
•Defense-in-depth aware — Does this action item create a single new defense, or reinforce multiple layers? Prefer action items that add defense at multiple points.

Action Item Quality Assessment
Weak Action Item	Problem	Improved Version
'Improve monitoring'	Vague—what monitoring? What improvement?	'Add latency P99 alert for /checkout endpoint with >500ms threshold (Owner: Alice, Due: Feb 5)'
'Fix the bug'	Assumes one bug; no verification	'Fix race condition in payment processor (#4521), add regression test, verify in staging (Owner: Bob, Due: Feb 7)'
'Add documentation'	What documentation? For whom?	'Add troubleshooting section to on-call runbook covering database failover (Owner: Carol, Due: Feb 3)'
'Team should be more careful'	Not an action—blame in disguise	'Implement deployment confirmation prompt requiring production environment name (Owner: Dan, Due: Feb 10)'
'Consider adding validation'	'Consider' is not an action	'Implement input validation for config parameters with blocking behavior in CI pipeline (Owner: Eve, Due: Feb 12)'

Avoid 'Training' as Default Remedy

Training is often proposed when no better solution comes to mind. While training has its place, it's typically ineffective as a sole remedy. Humans forget, make errors under stress, and rotate off teams. Prefer systemic changes (automation, validation, guardrails) over 'train the humans to not make mistakes.' If training is included, pair it with systemic controls.

Prioritization Frameworks

Post-mortems often produce more action items than can be immediately addressed. Without prioritization, teams either attempt everything (and complete nothing well) or cherry-pick easy items while high-impact work is deferred indefinitely.

Effective prioritization balances multiple dimensions:

Prioritization Dimensions

•Impact — How much does this reduce the probability or severity of recurrence? Does it address a primary root cause or a peripheral contributing factor?
•Effort — How much work is required to implement? Is it an afternoon fix or a quarter-long project?
•Urgency — How likely is recurrence before this is completed? Are there similar vulnerabilities being actively exploited?
•Breadth — Does this action item address only this specific incident, or does it improve resilience against a class of failures?
•Dependencies — Can this be done independently, or does it require coordination with other teams or completion of other work first?

The Impact/Effort Matrix:

A simple and widely-used prioritization tool categorizes action items into four quadrants:

	Low Effort	High Effort
High Impact	Quick Wins (Do First)	Major Projects (Schedule)
Low Impact	Fill-Ins (If Time Permits)	Reconsider (Often Not Worth It)

Quick Wins are the obvious priorities—high-value improvements that can be implemented rapidly. Do these immediately.

Major Projects require investment but deliver significant improvement. These should be formally scheduled with appropriate resources.

Fill-Ins are low-cost but limited impact. Include when convenient but don't prioritize over higher-impact work.

Reconsider items require substantial effort for limited benefit. Unless circumstances change, these often aren't worth pursuing.

The 'Blast Radius' Heuristic

When evaluating impact, consider the 'blast radius' of the vulnerability. An action item that closes a vulnerability in a single, rarely-used code path has limited impact. An action item that adds validation to a shared library used by 50 services has enormous impact. Prefer fixes at bottlenecks and shared infrastructure.

Prioritization in practice:

During the post-mortem meeting, generate all candidate action items without filtering
After generating, assess each item on Impact, Effort, and Urgency (use a 1-3 scale)
Calculate a simple priority score: (Impact × Urgency) / Effort
Rank action items by score; take the top 3-5 as immediate commitments
Log remaining items in a backlog for future consideration
Revisit backlog during subsequent incidents or quarterly reliability reviews

A realistic commitment: Most teams can sustain 3-5 post-mortem action items per incident across the team's backlog. Overcommitting leads to action item sprawl and declining completion rates.

Ownership and Accountability

Every action item must have exactly one owner. This principle is simple but frequently violated. 'The team' is not an owner. 'SRE' is not an owner. 'Someone from platform engineering' is not an owner. These pseudo-assignments guarantee that no one is accountable and the action item drifts.

The owner is not necessarily the person who implements the change—they may coordinate others or delegate. But they are the single point of accountability for completion.

Ownership Anti-Patterns
Anti-Pattern	Why It Fails	Correct Approach
'The team will...'	Diffusion of responsibility; no individual accountable	Assign to a specific team member who coordinates
Assigning to someone not present	Owner may not accept, understand, or have capacity	Confirm with owner or their manager before assigning
Assigning to managers	Managers often delegate and lose track	Assign to the implementing engineer; manager sponsors
Owner without authority	Owner can't access required systems or make decisions	Ensure owner has or can obtain necessary access
Multiple owners	Each assumes the other is driving	Single owner coordinates; others are collaborators

Who should own action items?

Ownership often determines completion probability. Consider:

Domain expertise — Does the owner understand the system being changed?
Capacity — Does the owner have bandwidth within the deadline?
Motivation — Is the owner invested in the improvement? (Engineers involved in the incident often become the most motivated owners)
Authority — Can the owner make the necessary changes without excessive approvals?

The incident participant principle: When possible, assign action items to engineers who participated in the incident response. They have firsthand understanding of the failure mode and are often highly motivated to prevent recurrence. This also transforms the incident from a negative experience into a growth opportunity.

Ownership vs. Blame

Ownership of action items is fundamentally different from blame. Blame says 'you caused this, so you're responsible for the problem.' Ownership says 'you're empowered and supported to drive this improvement.' Effective blameless cultures assign ownership generously while rejecting blame entirely.

Tracking Systems and Visibility

Action items need a tracking system that provides visibility to stakeholders, enables status updates, and prevents items from being forgotten. This is not merely administrative overhead—it's essential infrastructure for organizational learning.

Tracking system requirements:

•Central repository — All post-mortem action items in one place, searchable by incident, team, theme, and status
•Status tracking — Clear states: Not Started, In Progress, Blocked, Completed, Verified, Closed
•Deadline visibility — Due dates prominent; overdue items highlighted
•Linking — Action items linked to originating post-mortem document
•Reporting — Aggregate views: completion rate by team, aging action items, recurrence themes
•Integration — Flows naturally into engineering workflows (e.g., linked to Jira, Linear, or similar)

Implementation options:

Integrated tooling — Platforms like Blameless, incident.io, Rootly, and FireHydrant provide built-in action item tracking alongside post-mortem documentation. These offer the smoothest experience but require organizational commitment.

Issue tracker integration — Create action items as tickets in your existing issue tracker (Jira, Linear, GitHub Issues) with a dedicated 'post-mortem' Epic or label. This leverages existing workflows but may lose visibility in the broader backlog.

Dedicated spreadsheet/Notion database — Lower ceremony for smaller organizations. Risk: becomes stale without discipline.

Best practice: dual tracking — Create action items in both the post-mortem document AND your issue tracker. The post-mortem provides context; the issue tracker provides workflow integration.

The Aging Action Item Problem

Action items older than 30 days without progress are a red flag. They suggest either the item was never realistically scoped, priorities have shifted, or the owner lacks capacity. Establish a practice: any action item untouched for 30 days triggers auto-escalation to the team lead and a required status update. Either make progress, formally defer with documented reasoning, or close as 'will not do.'

Visibility rituals:

Tracking systems only work if people look at them. Build visibility into regular team rituals:

Weekly team standup — 2-minute review of open post-mortem action items
Sprint planning — Explicitly allocate capacity for action item work
Monthly reliability review — Leadership reviews action item completion rates and aging items
Quarterly post-mortem retrospective — Review themes from closed action items; identify systemic patterns

The Completion Definition: When Is an Action Item Done?

One of the most common failures in action item management is premature closure. An engineer marks an action item 'complete' when the code is merged, but the improvement was never verified to actually work in production. Three months later, a similar incident reveals that the 'fix' had a bug, was misconfigured, or didn't address the actual root cause.

A complete action item has passed through multiple stages:

1
2
3
4
5
6
7
8
9
10
11
Lifecycle of a Post-Mortem Action Item:
 
┌──────────────┐      ┌────────────┐      ┌─────────────┐      ┌───────────┐      ┌──────────┐
│  NOT STARTED │ ──▶  │ IN PROGRESS│ ──▶  │  IMPLEMENTED│ ──▶  │  VERIFIED │ ──▶  │  CLOSED  │
└──────────────┘      └────────────┘      └─────────────┘      └───────────┘      └──────────┘
       │                    │                    │                   │                  │
       │                    │                    │                   │                  │
   Assigned          Active          Code/config         Confirmed        Documented
   with deadline     development     merged and           to work in       and linked
                     or work         deployed to          production       to post-mortem
                                     production                            for reference

The critical distinction: Implemented vs. Verified

An action item is implemented when the change is deployed. It is verified when evidence confirms the change addresses the root cause. These are not the same.

Verification methods:

Synthetic testing — If the action item adds an alert, trigger the alert condition in a test environment and confirm it fires correctly.
Chaos engineering — If the action item adds a safeguard, use controlled fault injection to confirm the safeguard works.
Metric observation — If the action item improves latency, confirm latency actually improves in production metrics.
Incident simulation — Walk through the incident scenario with the improvement in place; confirm the incident would now be prevented or detected faster.
Code review / audit — For simpler changes, a thorough review by a knowledgeable engineer may suffice.

The 'What Would Break' Test

Before closing an action item, ask: 'If this specific improvement were removed tomorrow, would we know?' If the answer is no, consider whether you've actually verified the improvement. Ideally, there should be a test, alert, or monitoring that would detect regression of the improvement.

Handling Blocked and Deferred Items

Not every action item can be completed as originally planned. Dependencies emerge, priorities shift, and resource constraints bite. The difference between effective and ineffective organizations is not that the former complete all action items—it's that they handle blocked and deferred items explicitly.

Blocking scenarios:

Common Blocking Scenarios and Responses
Blocking Reason	Appropriate Response
Dependency on another team	Escalate to management to unblock; document dependency; explore temporary mitigations
Requires infrastructure not yet available	Defer with clear trigger condition (e.g., 'after infrastructure X is deployed'); track as dependency
Scope larger than estimated	Re-scope: break into smaller items, commit to phase 1, defer later phases
Owner left/unavailable	Immediately reassign; don't allow orphaned items
Conflicting priorities	Explicit decision by leadership: either reprioritize or defer with documented risk acceptance

Deferral is a decision, not a default:

Deferring an action item should require explicit justification and risk acknowledgment. If the item addressed a genuine root cause, deferral means the organization is accepting ongoing risk of recurrence.

Deferral documentation should include:

Reason for deferral
Who made the deferral decision
Risk acknowledged (probability of recurrence, potential impact)
Trigger for re-evaluation (date, event, or condition)

The 'Accept Risk' option:

Sometimes the honest conclusion is that an action item is not worth doing. The fix may be disproportionately expensive relative to the risk, or the system may be scheduled for decommissioning. In these cases, explicitly close the item as 'Risk Accepted' with documented justification. This is honest and traceable—far better than leaving items in zombie state.

Deferred Items Are Technical Debt

Every deferred action item represents unaddressed risk—a known vulnerability that could enable future incidents. Track deferred items as technical debt and include them in debt reduction planning. Periodically review: has the risk profile changed? Is the item now feasible? If the same item is repeatedly deferred, this signals that the underlying risk is being systematically under-prioritized.

Preventing Action Item Sprawl

A perverse failure mode afflicts organizations that take post-mortems seriously: action item sprawl. Each incident generates 5-10 action items. With multiple incidents per month, the backlog grows faster than items are closed. Soon, teams are drowning in hundreds of open items, and the tracking system becomes a graveyard of good intentions.

Symptoms of action item sprawl:

Warning Signs

•Open action items outnumber closed items
•Average item age exceeds 60 days
•Completion rate is below 70%
•Teams stop looking at the tracking system
•New action items duplicate items already in the backlog
•'Similar incident recurrence despite existing action item' becomes common

Strategies to prevent sprawl:

1. Ruthless prioritization at creation

Don't create action items for everything that could be improved—only for items that will actually be implemented. It's better to consciously defer or decline at creation than to create false commitments.

2. Quota per incident

Limit action items to 3-5 per post-mortem. This forces prioritization during the meeting rather than after. If the team identifies more candidates, they go into a 'future considerations' section—not the action item list.

3. Team capacity planning

Post-mortem work competes with feature development. Explicitly reserve capacity (e.g., 10-15% of engineering time) for reliability work including action items. Without reserved capacity, action items perpetually lose to feature priorities.

4. Regular backlog hygiene

Monthly review of all open action items. Close items that are no longer relevant. Re-prioritize based on current understanding. Consolidate duplicates. This prevents the backlog from becoming stale.

5. Theme aggregation

If multiple incidents produce action items addressing similar themes (e.g., 'add monitoring for service X'), consolidate into a single larger project rather than tracking as individual items. Address the theme, not just the symptoms.

The 2:1 Rule

Aim to close at least two action items for every one created. This ensures the backlog shrinks over time. Track this ratio monthly. If it falls below 1:1, stop adding new items until the backlog is under control.

Measuring Follow-up Effectiveness

What gets measured gets managed. Organizations serious about follow-up effectiveness track metrics that provide visibility into the health of their action item process.

Action Item Health Metrics
Metric	Definition	Target	Red Flag
Completion Rate	% of action items closed within deadline	80%	<60%
On-Time Completion	% of action items closed on or before original deadline	70%	<50%
Average Time to Close	Mean days from creation to closure	<30 days	60 days
Backlog Size	Total open action items at any time	<20 per team	50 per team
Backlog Age (P90)	90th percentile age of open items	<45 days	90 days
Items Created / Items Closed	Ratio of new items to closed items per month	<1:1	2:1
Recurrence Despite Action Item	Incidents where a relevant action item exists but wasn't completed	0	Any occurrence

The ultimate measure: incident recurrence.

All the process metrics matter only insofar as they impact the outcome that counts: preventing similar incidents. Track recurrence patterns:

When an incident occurs, search for historical post-mortems with similar themes
Did previous action items address the issue? Were they completed?
If completed, why did recurrence still happen? (Action item was insufficient, or this is a genuinely new variant)

A pattern of recurrence despite closed action items indicates that either root cause analysis is shallow or action items are insufficiently scoped. This should trigger a process review.

Reporting to Leadership

Leadership should receive monthly or quarterly summaries: completion rates, aging item counts, and notable recurrence patterns. This creates organizational accountability for follow-through and enables resource conversations when teams are under-capacity for reliability work.

Summary: Ensuring Follow-Through

The gap between analysis and improvement is where many post-mortem programs fail. Effective action items and disciplined follow-up are the bridge that makes post-mortems valuable.

Key Takeaways

•Action items must be SMART and more. Specific, measurable, assignable, realistic, time-bound—plus linked to root causes and verifiable.
•Prioritize ruthlessly. Use Impact/Effort matrices to focus on quick wins and major projects; ignore high-effort/low-impact items.
•Single owner, no exceptions. 'The team' is not an owner. Every item has one accountable individual.
•Track with visibility rituals. Tracking systems only work if they're regularly reviewed in standups, planning, and leadership reviews.
•Distinguish implemented from verified. Don't close an item until you've confirmed the improvement actually works.
•Handle blocked items explicitly. Defer with documented risk acceptance or escalate to unblock—don't let items rot.
•Prevent sprawl proactively. Quotas, capacity planning, and the 2:1 rule keep backlogs manageable.
•Measure follow-up health. Track completion rates, aging, and recurrence to maintain organizational accountability.

Page Complete

You now understand how to bridge the gap between post-mortem analysis and real-world improvement. In the next page, we will explore learning from failures—how to extract maximum organizational learning from incidents and disseminate knowledge beyond the immediate team.

3 / 5

Loading learning content...

System Design (HLD)Post-Mortems

Post-Mortems: Learning from Failure

LevelIntermediate

Duration60 mins

TopicPost-Mortems

3 / 5

Action Items and Follow-up

The Graveyard of Good Intentions

What You Will Learn

The Anatomy of an Effective Action Item

Not all action items are created equal. The difference between an action item that drives improvement and one that languishes in the backlog often comes down to how it's formulated.

The SMART framework, borrowed from project management, provides a useful structure:

Specific — Exactly what will be done, with no ambiguity
Measurable — Clear criteria for completion
Assignable — A single owner who is accountable
Realistic — Achievable within the timeframe given available resources
Time-bound — A concrete deadline, not 'when we get to it'

But SMART alone isn't sufficient. Effective post-mortem action items have additional characteristics:

Characteristics of Effective Action Items

•Linked to root cause — The action item explicitly addresses an identified root cause or contributing factor. If you can't trace the action item back to the analysis, question whether it belongs.
•Verifiable — Not just 'done,' but 'proven effective.' How will you know the improvement actually works?
•Independent — Ideally, the action item can be completed without complex dependencies on other work. Dependencies should be explicit and tracked.
•Appropriately scoped — Large enough to make a difference, small enough to complete in a reasonable timeframe. If it's a multi-month project, break it down.
•Defense-in-depth aware — Does this action item create a single new defense, or reinforce multiple layers? Prefer action items that add defense at multiple points.

Action Item Quality Assessment
Weak Action Item	Problem	Improved Version
'Improve monitoring'	Vague—what monitoring? What improvement?	'Add latency P99 alert for /checkout endpoint with >500ms threshold (Owner: Alice, Due: Feb 5)'
'Fix the bug'	Assumes one bug; no verification	'Fix race condition in payment processor (#4521), add regression test, verify in staging (Owner: Bob, Due: Feb 7)'
'Add documentation'	What documentation? For whom?	'Add troubleshooting section to on-call runbook covering database failover (Owner: Carol, Due: Feb 3)'
'Team should be more careful'	Not an action—blame in disguise	'Implement deployment confirmation prompt requiring production environment name (Owner: Dan, Due: Feb 10)'
'Consider adding validation'	'Consider' is not an action	'Implement input validation for config parameters with blocking behavior in CI pipeline (Owner: Eve, Due: Feb 12)'

Avoid 'Training' as Default Remedy

Prioritization Frameworks

Effective prioritization balances multiple dimensions:

Prioritization Dimensions

•Impact — How much does this reduce the probability or severity of recurrence? Does it address a primary root cause or a peripheral contributing factor?
•Effort — How much work is required to implement? Is it an afternoon fix or a quarter-long project?
•Urgency — How likely is recurrence before this is completed? Are there similar vulnerabilities being actively exploited?
•Breadth — Does this action item address only this specific incident, or does it improve resilience against a class of failures?
•Dependencies — Can this be done independently, or does it require coordination with other teams or completion of other work first?

The Impact/Effort Matrix:

A simple and widely-used prioritization tool categorizes action items into four quadrants:

	Low Effort	High Effort
High Impact	Quick Wins (Do First)	Major Projects (Schedule)
Low Impact	Fill-Ins (If Time Permits)	Reconsider (Often Not Worth It)

Quick Wins are the obvious priorities—high-value improvements that can be implemented rapidly. Do these immediately.

Major Projects require investment but deliver significant improvement. These should be formally scheduled with appropriate resources.

Fill-Ins are low-cost but limited impact. Include when convenient but don't prioritize over higher-impact work.

Reconsider items require substantial effort for limited benefit. Unless circumstances change, these often aren't worth pursuing.

The 'Blast Radius' Heuristic

Prioritization in practice:

During the post-mortem meeting, generate all candidate action items without filtering
After generating, assess each item on Impact, Effort, and Urgency (use a 1-3 scale)
Calculate a simple priority score: (Impact × Urgency) / Effort
Rank action items by score; take the top 3-5 as immediate commitments
Log remaining items in a backlog for future consideration
Revisit backlog during subsequent incidents or quarterly reliability reviews

A realistic commitment: Most teams can sustain 3-5 post-mortem action items per incident across the team's backlog. Overcommitting leads to action item sprawl and declining completion rates.

Ownership and Accountability

The owner is not necessarily the person who implements the change—they may coordinate others or delegate. But they are the single point of accountability for completion.

Ownership Anti-Patterns
Anti-Pattern	Why It Fails	Correct Approach
'The team will...'	Diffusion of responsibility; no individual accountable	Assign to a specific team member who coordinates
Assigning to someone not present	Owner may not accept, understand, or have capacity	Confirm with owner or their manager before assigning
Assigning to managers	Managers often delegate and lose track	Assign to the implementing engineer; manager sponsors
Owner without authority	Owner can't access required systems or make decisions	Ensure owner has or can obtain necessary access
Multiple owners	Each assumes the other is driving	Single owner coordinates; others are collaborators

Who should own action items?

Ownership often determines completion probability. Consider:

Domain expertise — Does the owner understand the system being changed?
Capacity — Does the owner have bandwidth within the deadline?
Motivation — Is the owner invested in the improvement? (Engineers involved in the incident often become the most motivated owners)
Authority — Can the owner make the necessary changes without excessive approvals?

Ownership vs. Blame

Tracking Systems and Visibility

Tracking system requirements:

•Central repository — All post-mortem action items in one place, searchable by incident, team, theme, and status
•Status tracking — Clear states: Not Started, In Progress, Blocked, Completed, Verified, Closed
•Deadline visibility — Due dates prominent; overdue items highlighted
•Linking — Action items linked to originating post-mortem document
•Reporting — Aggregate views: completion rate by team, aging action items, recurrence themes
•Integration — Flows naturally into engineering workflows (e.g., linked to Jira, Linear, or similar)

Implementation options:

Dedicated spreadsheet/Notion database — Lower ceremony for smaller organizations. Risk: becomes stale without discipline.

Best practice: dual tracking — Create action items in both the post-mortem document AND your issue tracker. The post-mortem provides context; the issue tracker provides workflow integration.

The Aging Action Item Problem

Visibility rituals:

Tracking systems only work if people look at them. Build visibility into regular team rituals:

Weekly team standup — 2-minute review of open post-mortem action items
Sprint planning — Explicitly allocate capacity for action item work
Monthly reliability review — Leadership reviews action item completion rates and aging items
Quarterly post-mortem retrospective — Review themes from closed action items; identify systemic patterns

The Completion Definition: When Is an Action Item Done?

A complete action item has passed through multiple stages:

1
2
3
4
5
6
7
8
9
10
11
Lifecycle of a Post-Mortem Action Item:
 
┌──────────────┐      ┌────────────┐      ┌─────────────┐      ┌───────────┐      ┌──────────┐
│  NOT STARTED │ ──▶  │ IN PROGRESS│ ──▶  │  IMPLEMENTED│ ──▶  │  VERIFIED │ ──▶  │  CLOSED  │
└──────────────┘      └────────────┘      └─────────────┘      └───────────┘      └──────────┘
       │                    │                    │                   │                  │
       │                    │                    │                   │                  │
   Assigned          Active          Code/config         Confirmed        Documented
   with deadline     development     merged and           to work in       and linked
                     or work         deployed to          production       to post-mortem
                                     production                            for reference

The critical distinction: Implemented vs. Verified

An action item is implemented when the change is deployed. It is verified when evidence confirms the change addresses the root cause. These are not the same.

Verification methods:

Synthetic testing — If the action item adds an alert, trigger the alert condition in a test environment and confirm it fires correctly.
Chaos engineering — If the action item adds a safeguard, use controlled fault injection to confirm the safeguard works.
Metric observation — If the action item improves latency, confirm latency actually improves in production metrics.
Incident simulation — Walk through the incident scenario with the improvement in place; confirm the incident would now be prevented or detected faster.
Code review / audit — For simpler changes, a thorough review by a knowledgeable engineer may suffice.

The 'What Would Break' Test

Handling Blocked and Deferred Items

Blocking scenarios:

Common Blocking Scenarios and Responses
Blocking Reason	Appropriate Response
Dependency on another team	Escalate to management to unblock; document dependency; explore temporary mitigations
Requires infrastructure not yet available	Defer with clear trigger condition (e.g., 'after infrastructure X is deployed'); track as dependency
Scope larger than estimated	Re-scope: break into smaller items, commit to phase 1, defer later phases
Owner left/unavailable	Immediately reassign; don't allow orphaned items
Conflicting priorities	Explicit decision by leadership: either reprioritize or defer with documented risk acceptance

Deferral is a decision, not a default:

Deferral documentation should include:

Reason for deferral
Who made the deferral decision
Risk acknowledged (probability of recurrence, potential impact)
Trigger for re-evaluation (date, event, or condition)

The 'Accept Risk' option:

Deferred Items Are Technical Debt

Preventing Action Item Sprawl

Symptoms of action item sprawl:

Warning Signs

•Open action items outnumber closed items
•Average item age exceeds 60 days
•Completion rate is below 70%
•Teams stop looking at the tracking system
•New action items duplicate items already in the backlog
•'Similar incident recurrence despite existing action item' becomes common

Strategies to prevent sprawl:

1. Ruthless prioritization at creation

2. Quota per incident

3. Team capacity planning

4. Regular backlog hygiene

Monthly review of all open action items. Close items that are no longer relevant. Re-prioritize based on current understanding. Consolidate duplicates. This prevents the backlog from becoming stale.

5. Theme aggregation

The 2:1 Rule

Measuring Follow-up Effectiveness

What gets measured gets managed. Organizations serious about follow-up effectiveness track metrics that provide visibility into the health of their action item process.

Action Item Health Metrics
Metric	Definition	Target	Red Flag
Completion Rate	% of action items closed within deadline	80%	<60%
On-Time Completion	% of action items closed on or before original deadline	70%	<50%
Average Time to Close	Mean days from creation to closure	<30 days	60 days
Backlog Size	Total open action items at any time	<20 per team	50 per team
Backlog Age (P90)	90th percentile age of open items	<45 days	90 days
Items Created / Items Closed	Ratio of new items to closed items per month	<1:1	2:1
Recurrence Despite Action Item	Incidents where a relevant action item exists but wasn't completed	0	Any occurrence

The ultimate measure: incident recurrence.

All the process metrics matter only insofar as they impact the outcome that counts: preventing similar incidents. Track recurrence patterns:

When an incident occurs, search for historical post-mortems with similar themes
Did previous action items address the issue? Were they completed?
If completed, why did recurrence still happen? (Action item was insufficient, or this is a genuinely new variant)

A pattern of recurrence despite closed action items indicates that either root cause analysis is shallow or action items are insufficiently scoped. This should trigger a process review.

Reporting to Leadership

Summary: Ensuring Follow-Through

The gap between analysis and improvement is where many post-mortem programs fail. Effective action items and disciplined follow-up are the bridge that makes post-mortems valuable.

Key Takeaways

•Action items must be SMART and more. Specific, measurable, assignable, realistic, time-bound—plus linked to root causes and verifiable.
•Prioritize ruthlessly. Use Impact/Effort matrices to focus on quick wins and major projects; ignore high-effort/low-impact items.
•Single owner, no exceptions. 'The team' is not an owner. Every item has one accountable individual.
•Track with visibility rituals. Tracking systems only work if they're regularly reviewed in standups, planning, and leadership reviews.
•Distinguish implemented from verified. Don't close an item until you've confirmed the improvement actually works.
•Handle blocked items explicitly. Defer with documented risk acceptance or escalate to unblock—don't let items rot.
•Prevent sprawl proactively. Quotas, capacity planning, and the 2:1 rule keep backlogs manageable.
•Measure follow-up health. Track completion rates, aging, and recurrence to maintain organizational accountability.

Page Complete

3 / 5