Loading learning content...
Every organization has one: a graveyard of post-mortem action items that were agreed upon with great seriousness, assigned with conviction, and then quietly forgotten. Two months later, when a similar incident occurs, someone searches the ticketing system and discovers a half-closed action item from the previous incident—partially implemented, never verified, and ultimately ineffective.
This is the post-mortem's failure mode. Not poor analysis, not blame culture, but the gap between identifying what should change and actually changing it. The most insightful root cause analysis in the world produces nothing if its recommendations die in a backlog.
Action items are the bridge between understanding and improvement. This page is about building that bridge to last—formulating action items that are actually actionable, prioritizing ruthlessly, tracking effectively, and closing the loop to verify that improvements work.
By the end of this page, you will understand how to craft action items that lead to real improvement, establish prioritization frameworks that ensure high-impact items get implemented, build tracking systems that maintain visibility, and create verification practices that confirm improvements actually work.
Not all action items are created equal. The difference between an action item that drives improvement and one that languishes in the backlog often comes down to how it's formulated.
The SMART framework, borrowed from project management, provides a useful structure:
But SMART alone isn't sufficient. Effective post-mortem action items have additional characteristics:
| Weak Action Item | Problem | Improved Version |
|---|---|---|
| 'Improve monitoring' | Vague—what monitoring? What improvement? | 'Add latency P99 alert for /checkout endpoint with >500ms threshold (Owner: Alice, Due: Feb 5)' |
| 'Fix the bug' | Assumes one bug; no verification | 'Fix race condition in payment processor (#4521), add regression test, verify in staging (Owner: Bob, Due: Feb 7)' |
| 'Add documentation' | What documentation? For whom? | 'Add troubleshooting section to on-call runbook covering database failover (Owner: Carol, Due: Feb 3)' |
| 'Team should be more careful' | Not an action—blame in disguise | 'Implement deployment confirmation prompt requiring production environment name (Owner: Dan, Due: Feb 10)' |
| 'Consider adding validation' | 'Consider' is not an action | 'Implement input validation for config parameters with blocking behavior in CI pipeline (Owner: Eve, Due: Feb 12)' |
Training is often proposed when no better solution comes to mind. While training has its place, it's typically ineffective as a sole remedy. Humans forget, make errors under stress, and rotate off teams. Prefer systemic changes (automation, validation, guardrails) over 'train the humans to not make mistakes.' If training is included, pair it with systemic controls.
Post-mortems often produce more action items than can be immediately addressed. Without prioritization, teams either attempt everything (and complete nothing well) or cherry-pick easy items while high-impact work is deferred indefinitely.
Effective prioritization balances multiple dimensions:
The Impact/Effort Matrix:
A simple and widely-used prioritization tool categorizes action items into four quadrants:
| Low Effort | High Effort | |
|---|---|---|
| High Impact | Quick Wins (Do First) | Major Projects (Schedule) |
| Low Impact | Fill-Ins (If Time Permits) | Reconsider (Often Not Worth It) |
Quick Wins are the obvious priorities—high-value improvements that can be implemented rapidly. Do these immediately.
Major Projects require investment but deliver significant improvement. These should be formally scheduled with appropriate resources.
Fill-Ins are low-cost but limited impact. Include when convenient but don't prioritize over higher-impact work.
Reconsider items require substantial effort for limited benefit. Unless circumstances change, these often aren't worth pursuing.
When evaluating impact, consider the 'blast radius' of the vulnerability. An action item that closes a vulnerability in a single, rarely-used code path has limited impact. An action item that adds validation to a shared library used by 50 services has enormous impact. Prefer fixes at bottlenecks and shared infrastructure.
Prioritization in practice:
(Impact × Urgency) / EffortA realistic commitment: Most teams can sustain 3-5 post-mortem action items per incident across the team's backlog. Overcommitting leads to action item sprawl and declining completion rates.
Every action item must have exactly one owner. This principle is simple but frequently violated. 'The team' is not an owner. 'SRE' is not an owner. 'Someone from platform engineering' is not an owner. These pseudo-assignments guarantee that no one is accountable and the action item drifts.
The owner is not necessarily the person who implements the change—they may coordinate others or delegate. But they are the single point of accountability for completion.
| Anti-Pattern | Why It Fails | Correct Approach |
|---|---|---|
| 'The team will...' | Diffusion of responsibility; no individual accountable | Assign to a specific team member who coordinates |
| Assigning to someone not present | Owner may not accept, understand, or have capacity | Confirm with owner or their manager before assigning |
| Assigning to managers | Managers often delegate and lose track | Assign to the implementing engineer; manager sponsors |
| Owner without authority | Owner can't access required systems or make decisions | Ensure owner has or can obtain necessary access |
| Multiple owners | Each assumes the other is driving | Single owner coordinates; others are collaborators |
Who should own action items?
Ownership often determines completion probability. Consider:
The incident participant principle: When possible, assign action items to engineers who participated in the incident response. They have firsthand understanding of the failure mode and are often highly motivated to prevent recurrence. This also transforms the incident from a negative experience into a growth opportunity.
Ownership of action items is fundamentally different from blame. Blame says 'you caused this, so you're responsible for the problem.' Ownership says 'you're empowered and supported to drive this improvement.' Effective blameless cultures assign ownership generously while rejecting blame entirely.
Action items need a tracking system that provides visibility to stakeholders, enables status updates, and prevents items from being forgotten. This is not merely administrative overhead—it's essential infrastructure for organizational learning.
Tracking system requirements:
Implementation options:
Integrated tooling — Platforms like Blameless, incident.io, Rootly, and FireHydrant provide built-in action item tracking alongside post-mortem documentation. These offer the smoothest experience but require organizational commitment.
Issue tracker integration — Create action items as tickets in your existing issue tracker (Jira, Linear, GitHub Issues) with a dedicated 'post-mortem' Epic or label. This leverages existing workflows but may lose visibility in the broader backlog.
Dedicated spreadsheet/Notion database — Lower ceremony for smaller organizations. Risk: becomes stale without discipline.
Best practice: dual tracking — Create action items in both the post-mortem document AND your issue tracker. The post-mortem provides context; the issue tracker provides workflow integration.
Action items older than 30 days without progress are a red flag. They suggest either the item was never realistically scoped, priorities have shifted, or the owner lacks capacity. Establish a practice: any action item untouched for 30 days triggers auto-escalation to the team lead and a required status update. Either make progress, formally defer with documented reasoning, or close as 'will not do.'
Visibility rituals:
Tracking systems only work if people look at them. Build visibility into regular team rituals:
One of the most common failures in action item management is premature closure. An engineer marks an action item 'complete' when the code is merged, but the improvement was never verified to actually work in production. Three months later, a similar incident reveals that the 'fix' had a bug, was misconfigured, or didn't address the actual root cause.
A complete action item has passed through multiple stages:
1234567891011
Lifecycle of a Post-Mortem Action Item: ┌──────────────┐ ┌────────────┐ ┌─────────────┐ ┌───────────┐ ┌──────────┐│ NOT STARTED │ ──▶ │ IN PROGRESS│ ──▶ │ IMPLEMENTED│ ──▶ │ VERIFIED │ ──▶ │ CLOSED │└──────────────┘ └────────────┘ └─────────────┘ └───────────┘ └──────────┘ │ │ │ │ │ │ │ │ │ │ Assigned Active Code/config Confirmed Documented with deadline development merged and to work in and linked or work deployed to production to post-mortem production for referenceThe critical distinction: Implemented vs. Verified
An action item is implemented when the change is deployed. It is verified when evidence confirms the change addresses the root cause. These are not the same.
Verification methods:
Before closing an action item, ask: 'If this specific improvement were removed tomorrow, would we know?' If the answer is no, consider whether you've actually verified the improvement. Ideally, there should be a test, alert, or monitoring that would detect regression of the improvement.
Not every action item can be completed as originally planned. Dependencies emerge, priorities shift, and resource constraints bite. The difference between effective and ineffective organizations is not that the former complete all action items—it's that they handle blocked and deferred items explicitly.
Blocking scenarios:
| Blocking Reason | Appropriate Response |
|---|---|
| Dependency on another team | Escalate to management to unblock; document dependency; explore temporary mitigations |
| Requires infrastructure not yet available | Defer with clear trigger condition (e.g., 'after infrastructure X is deployed'); track as dependency |
| Scope larger than estimated | Re-scope: break into smaller items, commit to phase 1, defer later phases |
| Owner left/unavailable | Immediately reassign; don't allow orphaned items |
| Conflicting priorities | Explicit decision by leadership: either reprioritize or defer with documented risk acceptance |
Deferral is a decision, not a default:
Deferring an action item should require explicit justification and risk acknowledgment. If the item addressed a genuine root cause, deferral means the organization is accepting ongoing risk of recurrence.
Deferral documentation should include:
The 'Accept Risk' option:
Sometimes the honest conclusion is that an action item is not worth doing. The fix may be disproportionately expensive relative to the risk, or the system may be scheduled for decommissioning. In these cases, explicitly close the item as 'Risk Accepted' with documented justification. This is honest and traceable—far better than leaving items in zombie state.
Every deferred action item represents unaddressed risk—a known vulnerability that could enable future incidents. Track deferred items as technical debt and include them in debt reduction planning. Periodically review: has the risk profile changed? Is the item now feasible? If the same item is repeatedly deferred, this signals that the underlying risk is being systematically under-prioritized.
A perverse failure mode afflicts organizations that take post-mortems seriously: action item sprawl. Each incident generates 5-10 action items. With multiple incidents per month, the backlog grows faster than items are closed. Soon, teams are drowning in hundreds of open items, and the tracking system becomes a graveyard of good intentions.
Symptoms of action item sprawl:
Strategies to prevent sprawl:
1. Ruthless prioritization at creation
Don't create action items for everything that could be improved—only for items that will actually be implemented. It's better to consciously defer or decline at creation than to create false commitments.
2. Quota per incident
Limit action items to 3-5 per post-mortem. This forces prioritization during the meeting rather than after. If the team identifies more candidates, they go into a 'future considerations' section—not the action item list.
3. Team capacity planning
Post-mortem work competes with feature development. Explicitly reserve capacity (e.g., 10-15% of engineering time) for reliability work including action items. Without reserved capacity, action items perpetually lose to feature priorities.
4. Regular backlog hygiene
Monthly review of all open action items. Close items that are no longer relevant. Re-prioritize based on current understanding. Consolidate duplicates. This prevents the backlog from becoming stale.
5. Theme aggregation
If multiple incidents produce action items addressing similar themes (e.g., 'add monitoring for service X'), consolidate into a single larger project rather than tracking as individual items. Address the theme, not just the symptoms.
Aim to close at least two action items for every one created. This ensures the backlog shrinks over time. Track this ratio monthly. If it falls below 1:1, stop adding new items until the backlog is under control.
What gets measured gets managed. Organizations serious about follow-up effectiveness track metrics that provide visibility into the health of their action item process.
| Metric | Definition | Target | Red Flag |
|---|---|---|---|
| Completion Rate | % of action items closed within deadline | 80% | <60% |
| On-Time Completion | % of action items closed on or before original deadline | 70% | <50% |
| Average Time to Close | Mean days from creation to closure | <30 days | 60 days |
| Backlog Size | Total open action items at any time | <20 per team | 50 per team |
| Backlog Age (P90) | 90th percentile age of open items | <45 days | 90 days |
| Items Created / Items Closed | Ratio of new items to closed items per month | <1:1 | 2:1 |
| Recurrence Despite Action Item | Incidents where a relevant action item exists but wasn't completed | 0 | Any occurrence |
The ultimate measure: incident recurrence.
All the process metrics matter only insofar as they impact the outcome that counts: preventing similar incidents. Track recurrence patterns:
A pattern of recurrence despite closed action items indicates that either root cause analysis is shallow or action items are insufficiently scoped. This should trigger a process review.
Leadership should receive monthly or quarterly summaries: completion rates, aging item counts, and notable recurrence patterns. This creates organizational accountability for follow-through and enables resource conversations when teams are under-capacity for reliability work.
The gap between analysis and improvement is where many post-mortem programs fail. Effective action items and disciplined follow-up are the bridge that makes post-mortems valuable.
You now understand how to bridge the gap between post-mortem analysis and real-world improvement. In the next page, we will explore learning from failures—how to extract maximum organizational learning from incidents and disseminate knowledge beyond the immediate team.