Security Monitoring - Learning Module

Loading content...

0/273

Security Incident Response

When Detection Becomes Action

Detection is only valuable when it leads to effective response. When your monitoring systems alert on a potential compromise, what happens next? The difference between a minor incident and a catastrophic breach often comes down to the speed and effectiveness of your incident response.

Consider two organizations facing identical attacks:

Organization A: No incident response plan. Alert fires at 2 AM. On-call engineer isn't sure if it's real or how to escalate. Incident sits in queue until morning standup. By then, the attacker has moved laterally across the network, exfiltrated customer data, and deployed ransomware. Recovery takes weeks, costs millions, and makes headlines.

Organization B: Established incident response process. Alert fires at 2 AM. On-call engineer follows runbook: isolates affected system within 15 minutes, escalates to security team, initiates forensic preservation. Attack is contained to single service. Full recovery within hours. Customers never know.

The difference isn't luck or resources—it's preparation. This page covers building and executing incident response capabilities that turn security alerts into effective threat neutralization.

What You Will Learn

By the end of this page, you will understand the incident response lifecycle, how to build response playbooks for common scenarios, strategies for containment and eradication in distributed systems, forensic investigation principles, and communication frameworks for managing stakeholders during incidents. You'll gain the knowledge to respond to security incidents effectively and minimize their impact.

The Incident Response Lifecycle

Security incident response follows a well-established lifecycle, typically modeled on NIST's Computer Security Incident Handling Guide (SP 800-61). The lifecycle has distinct phases, each with specific objectives and activities.

The PICERL Framework

A common mnemonic for incident response phases:

Preparation — Before incidents occur
Identification — Detecting and confirming incidents
Containment — Limiting the damage
Eradication — Removing the threat
Recovery — Returning to normal operations
Lessons Learned — Improving for next time

Converting Mermaid diagram...

Incident Response Phase Objectives
Phase	Primary Objective	Key Activities	Success Criteria
Preparation	Enable effective response before incidents	Team training, playbooks, tools, communication plans	Response capability tested and documented
Identification	Confirm incident and assess scope	Alert triage, initial analysis, scope determination	Incident confirmed, initial scope understood
Containment	Stop the bleeding, prevent spread	Isolate systems, block IOCs, preserve evidence	Threat contained, no further spread
Eradication	Remove threat from environment	Malware removal, account disabling, vulnerability patching	All traces of threat eliminated
Recovery	Restore normal operations	System restoration, monitoring increase, user communication	Business operations restored safely
Lessons Learned	Improve for future incidents	Post-mortem, detection gaps, process improvements	Improvements documented and implemented

Parallelization in Practice

While the lifecycle is presented linearly, real incidents require parallel work:

Containment + Investigation: Contain immediate threats while investigating scope
Eradication + Forensics: Remove threats while preserving evidence for analysis
Recovery + Monitoring: Restore services while heightened monitoring catches persistence

Effective incident response teams run multiple workstreams simultaneously, coordinated through regular sync meetings (every 1-4 hours during active incidents).

Evidence Preservation

A critical consideration throughout incident response is evidence preservation. The instinct to 'fix things quickly' can destroy forensic evidence needed to understand the attack, attribute responsibility, or support legal action. Before taking containment actions that modify systems, ensure evidence is captured: memory dumps, disk images, log preservation.

Preparation: Before Incidents Occur

The preparation phase determines how effectively you respond to incidents. Organizations that invest in preparation handle incidents faster, with less damage, and at lower cost.

Building the Incident Response Team

Incident response requires coordinated effort across multiple roles:

Incident Response Team Roles
Role	Responsibilities	Skills Required
Incident Commander (IC)	Overall incident leadership, decision making, coordination	Leadership, communication, technical breadth
Security Analyst	Alert triage, threat analysis, IOC identification	Malware analysis, log analysis, forensics
Infrastructure Engineer	System isolation, containment actions, recovery	Systems administration, networking, automation
Communications Lead	Stakeholder updates, customer communication, PR coordination	Writing, stakeholder management, crisis communication
Legal/Compliance	Regulatory requirements, evidence preservation, disclosure obligations	Regulatory knowledge, legal frameworks
Executive Sponsor	Resource allocation, external communication, strategic decisions	Business context, authority, judgment

Essential Preparation Activities

Contact Lists and Escalation Procedures
- Maintain updated contact information for all IR team members
- Define escalation criteria and paths
- Include vendors, legal counsel, law enforcement contacts
- Test that contacts are reachable (people change roles, numbers change)
Tool Readiness
- Forensic tools installed and tested
- Log access configured and tested
- Network isolation capabilities verified
- Secure communication channels established (attackers may be monitoring email)
Documentation
- Network diagrams and asset inventories
- Critical system dependencies mapped
- Business impact assessments
- Data classification and location
Playbooks
- Step-by-step procedures for common incident types
- Decision trees for severity classification
- Containment action checklists
- Communication templates

incident_severity_matrix.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# Incident Severity Classification Matrix
# Used to determine response urgency and resources
 
severity_levels:
  critical:
    name: "SEV-1 Critical"
    description: "Active breach with confirmed data exfiltration or system compromise"
    response_time_minutes: 15
    stakeholder_notification: immediate
    executive_notification: required
    external_parties: legal, PR, potentially law enforcement
    page_on_call: true
    video_bridge: open_immediately
    
    indicators:
      - Confirmed ransomware deployment
      - Active unauthorized data exfiltration
      - Production systems compromised with customer data access
      - Critical infrastructure (auth, core services) compromised
      
  high:
    name: "SEV-2 High"  
    description: "Confirmed security incident with potential for significant impact"
    response_time_minutes: 60
    stakeholder_notification: within_1_hour
    executive_notification: situational
    external_parties: legal_standby
    page_on_call: true
    video_bridge: open_immediately
    
    indicators:
      - Successful phishing with credential compromise
      - Malware detected on internal systems
      - Unauthorized access to sensitive systems
      - Privilege escalation detected
      
  medium:
    name: "SEV-3 Medium"
    description: "Potential security incident requiring investigation"
    response_time_minutes: 240
    stakeholder_notification: within_4_hours
    executive_notification: not_required
    external_parties: none
    page_on_call: business_hours
    video_bridge: as_needed
    
    indicators:
      - Failed attack attempts (blocked exploit, failed logins)
      - Policy violation detected
      - Suspicious but unconfirmed activity
      - Vulnerability exploitation attempt
      
  low:
    name: "SEV-4 Low"
    description: "Security event requiring logging and review"
    response_time_minutes: 1440  # 24 hours
    stakeholder_notification: next_business_day
    executive_notification: not_required
    external_parties: none
    page_on_call: false
    video_bridge: not_required
    
    indicators:
      - Routine security scan detection
      - Low-risk policy violations
      - Informational security events
 
escalation_criteria:
  - name: "Data breach confirmed"
    current_severity_max: high
    new_severity: critical
    
  - name: "Lateral movement detected"
    current_severity_max: medium
    new_severity: high
    
  - name: "Multiple systems affected"
    current_severity_max: medium
    new_severity: high
    
  - name: "Attack ongoing after containment attempt"
    current_severity_max: high
    new_severity: critical

Identification and Triage

The identification phase bridges detection and response. Goals are to confirm whether an incident has occurred, classify its severity, and understand initial scope.

Alert Triage Process

Not every security alert is an incident. Triage determines which alerts warrant incident response:

Initial Assessment (5-15 minutes)
- Is this a true positive or false alarm?
- What triggered the alert? Review raw data.
- What's the potential impact if real?
Contextual Investigation (15-60 minutes)
- What else happened around this event? (chronological events)
- Who/what is affected? (scope)
- Is this isolated or part of larger activity?
Severity Classification
- Apply severity matrix criteria
- Determine response urgency
- Identify required responders

Initial Investigation Questions

During identification, answer these questions as quickly as possible:

What happened?

What type of activity triggered the alert?
What systems are involved?
What time did it occur?

Who is involved?

Which accounts are implicated?
Are these internal users, service accounts, or external actors?
Is this authorized activity?

Where is the impact?

Which systems, networks, or data stores are affected?
Is this contained to one system or spreading?
What's the blast radius potential?

Is it ongoing?

Is the attack still active?
Are there signs of persistence?
Is the attacker still present?

incident_triage_checklist.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
interface TriageResult {
  isIncident: boolean;
  confidence: 'low' | 'medium' | 'high';
  severity?: 'critical' | 'high' | 'medium' | 'low';
  summary: string;
  affectedSystems: string[];
  affectedUsers: string[];
  indicators: IndicatorOfCompromise[];
  recommendedActions: string[];
  escalationRequired: boolean;
}
 
interface TriageChecklist {
  alertId: string;
  triageStartTime: Date;
  
  // Step 1: Alert Validation
  alertValidation: {
    dataSourceVerified: boolean;
    rawDataReviewed: boolean;
    falsePositiveIndicators: string[];
    truePositiveIndicators: string[];
    validationConclusion: 'true_positive' | 'false_positive' | 'uncertain';
  };
  
  // Step 2: Scope Determination
  scopeAssessment: {
    primaryAffectedSystem: string;
    additionalAffectedSystems: string[];
    affectedUserAccounts: string[];
    dataTypesAtRisk: string[];
    businessImpact: string;
    spreadPotential: 'contained' | 'spreading' | 'unknown';
  };
  
  // Step 3: Threat Assessment
  threatAssessment: {
    attackType: string;
    attackVector: string;
    attackerSkillLevel: 'opportunistic' | 'targeted' | 'advanced';
    isAttackOngoing: boolean;
    killChainPhase: string;
  };
  
  // Step 4: IOC Extraction
  indicators: {
    maliciousIPs: string[];
    maliciousDomains: string[];
    maliciousHashes: string[];
    suspiciousAccounts: string[];
    abnormalProcesses: string[];
  };
}
 
async function performTriage(alertId: string): Promise<TriageResult> {
  const checklist: TriageChecklist = initializeChecklist(alertId);
  const alert = await getAlertDetails(alertId);
  
  // Step 1: Validate the alert
  const rawEvents = await getRawEvents(alert.timeRange, alert.sourceSystem);
  checklist.alertValidation = {
    dataSourceVerified: await verifyDataSource(alert.sourceSystem),
    rawDataReviewed: true,
    falsePositiveIndicators: identifyFPIndicators(rawEvents),
    truePositiveIndicators: identifyTPIndicators(rawEvents),
    validationConclusion: 'uncertain' // Will be updated
  };
  
  // Determine if this is a true incident
  if (checklist.alertValidation.truePositiveIndicators.length > 0) {
    checklist.alertValidation.validationConclusion = 'true_positive';
  } else if (checklist.alertValidation.falsePositiveIndicators.length > 
             checklist.alertValidation.truePositiveIndicators.length) {
    checklist.alertValidation.validationConclusion = 'false_positive';
    return {
      isIncident: false,
      confidence: 'high',
      summary: 'Alert validated as false positive',
      affectedSystems: [],
      affectedUsers: [],
      indicators: [],
      recommendedActions: ['Close alert as false positive', 'Tune detection rule'],
      escalationRequired: false
    };
  }
  
  // Step 2: Determine scope
  checklist.scopeAssessment = await assessScope(alert, rawEvents);
  
  // Step 3: Assess threat
  checklist.threatAssessment = await assessThreat(rawEvents, checklist.scopeAssessment);
  
  // Step 4: Extract IOCs
  checklist.indicators = await extractIndicators(rawEvents);
  
  // Classify severity
  const severity = classifySeverity(checklist);
  
  return {
    isIncident: true,
    confidence: checklist.alertValidation.truePositiveIndicators.length > 2 ? 'high' : 'medium',
    severity,
    summary: generateIncidentSummary(checklist),
    affectedSystems: [
      checklist.scopeAssessment.primaryAffectedSystem,
      ...checklist.scopeAssessment.additionalAffectedSystems
    ],
    affectedUsers: checklist.scopeAssessment.affectedUserAccounts,
    indicators: formatIndicators(checklist.indicators),
    recommendedActions: determineRecommendedActions(checklist, severity),
    escalationRequired: severity === 'critical' || severity === 'high'
  };
}

Time-Box Identification

Don't let perfect be the enemy of good. If you're unsure whether an incident is real after 30-60 minutes of analysis, treat it as real and begin containment. The cost of containing a false positive is much lower than the cost of allowing a real attack to spread. You can always stand down if investigation proves it benign.

Containment Strategies

Containment stops the bleeding. The goal is to limit the damage and prevent the attack from spreading while preserving evidence and avoiding unnecessary business disruption.

Containment Principles

Speed over perfection: A 80% effective containment now beats 100% effective containment in an hour
Preserve evidence: Capture forensic data before taking actions that modify systems
Avoid tipping off attackers: If possible, contain without alerting attacker to detection
Consider business impact: Balance security needs with operational continuity
Document everything: Every action taken, by whom, at what time

Containment Techniques by System Type

Network-Level Containment:

Block malicious IPs at firewall/WAF
Isolate network segments containing affected systems
Implement emergency firewall rules
Enable enhanced logging for affected segments

Host-Level Containment:

Isolate host from network (while preserving local access for forensics)
Disable compromised user accounts
Kill malicious processes
Block known-malicious executables

Cloud/Container Containment:

Revoke IAM credentials/roles
Isolate VPC/security group changes
Stop or quarantine compromised containers
Rotate service account credentials

Identity Containment:

Force password reset for compromised accounts
Revoke active sessions
Disable MFA and force re-enrollment
Temporarily increase authentication requirements

containment_actions.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
from datetime import datetime
from enum import Enum
from typing import List, Optional
import logging
 
logger = logging.getLogger(__name__)
 
class ContainmentType(Enum):
    NETWORK_ISOLATION = "network_isolation"
    ACCOUNT_DISABLE = "account_disable"
    SESSION_REVOKE = "session_revoke"
    PROCESS_KILL = "process_kill"
    CREDENTIAL_ROTATE = "credential_rotate"
    FIREWALL_BLOCK = "firewall_block"
    
class ContainmentAction:
    def __init__(self, action_type: ContainmentType, target: str, 
                 incident_id: str, operator: str):
        self.action_type = action_type
        self.target = target
        self.incident_id = incident_id
        self.operator = operator
        self.timestamp = datetime.utcnow()
        self.status = "pending"
        self.rollback_info = None
        
    def log_action(self):
        """Record action for audit trail."""
        logger.info(f"CONTAINMENT ACTION: {self.action_type.value} | "
                   f"Target: {self.target} | Incident: {self.incident_id} | "
                   f"Operator: {self.operator} | Time: {self.timestamp.isoformat()}")
 
class ContainmentOrchestrator:
    """
    Orchestrates containment actions across multiple systems.
    Ensures rollback capability and audit trail.
    """
    
    def __init__(self, incident_id: str, operator: str):
        self.incident_id = incident_id
        self.operator = operator
        self.actions_taken: List[ContainmentAction] = []
        
    async def isolate_host(self, host_id: str, 
                          preserve_forensic_access: bool = True) -> ContainmentAction:
        """
        Isolate a host from the network while optionally
        preserving access for forensic investigation.
        """
        action = ContainmentAction(
            ContainmentType.NETWORK_ISOLATION,
            host_id,
            self.incident_id,
            self.operator
        )
        action.log_action()
        
        # Capture current network config for rollback
        current_config = await self.network_api.get_host_network_rules(host_id)
        action.rollback_info = current_config
        
        # Apply isolation rules
        isolation_rules = {
            "deny_all_inbound": True,
            "deny_all_outbound": True,
            "allow_forensic_subnet": preserve_forensic_access,
            "allow_dns": False,  # Prevent data exfiltration
            "log_blocked_traffic": True
        }
        
        try:
            await self.network_api.apply_isolation(host_id, isolation_rules)
            action.status = "completed"
            logger.info(f"Host {host_id} successfully isolated")
        except Exception as e:
            action.status = "failed"
            logger.error(f"Failed to isolate host {host_id}: {e}")
            raise
            
        self.actions_taken.append(action)
        return action
    
    async def disable_user_account(self, user_id: str,
                                   revoke_sessions: bool = True) -> ContainmentAction:
        """
        Disable a user account and optionally revoke all active sessions.
        """
        action = ContainmentAction(
            ContainmentType.ACCOUNT_DISABLE,
            user_id,
            self.incident_id,
            self.operator
        )
        action.log_action()
        
        # Store current account state for rollback
        current_state = await self.identity_api.get_account_state(user_id)
        action.rollback_info = current_state
        
        try:
            # Disable account
            await self.identity_api.disable_account(user_id)
            
            # Revoke all active sessions
            if revoke_sessions:
                sessions = await self.identity_api.get_active_sessions(user_id)
                for session in sessions:
                    await self.identity_api.revoke_session(session.id)
                logger.info(f"Revoked {len(sessions)} sessions for user {user_id}")
            
            action.status = "completed"
            logger.info(f"User account {user_id} disabled")
        except Exception as e:
            action.status = "failed"
            logger.error(f"Failed to disable user {user_id}: {e}")
            raise
            
        self.actions_taken.append(action)
        return action
    
    async def block_iocs(self, iocs: List[dict]) -> List[ContainmentAction]:
        """
        Block indicators of compromise across security controls.
        """
        actions = []
        
        for ioc in iocs:
            if ioc['type'] == 'ip':
                action = await self.block_ip(ioc['value'])
            elif ioc['type'] == 'domain':
                action = await self.block_domain(ioc['value'])
            elif ioc['type'] == 'hash':
                action = await self.block_hash(ioc['value'])
            
            actions.append(action)
            
        return actions
    
    async def rollback_containment(self, action: ContainmentAction):
        """
        Rollback a containment action during recovery phase.
        """
        if not action.rollback_info:
            raise ValueError(f"No rollback information for action {action.action_type}")
            
        logger.info(f"Rolling back containment action: {action.action_type.value} "
                   f"on {action.target}")
        
        if action.action_type == ContainmentType.NETWORK_ISOLATION:
            await self.network_api.restore_network_rules(
                action.target, action.rollback_info)
        elif action.action_type == ContainmentType.ACCOUNT_DISABLE:
            await self.identity_api.restore_account_state(
                action.target, action.rollback_info)
        
        action.status = "rolled_back"
        logger.info(f"Rollback completed for {action.target}")

The Attacker May Be Watching

Sophisticated attackers monitor for detection. If they see containment actions, they may immediately execute destructive actions (deploy ransomware, exfiltrate more data) or go deeper into hiding. Consider whether to perform 'quiet' containment (blocking C2 channels without obvious isolation) versus 'loud' containment (full isolation). The choice depends on attacker sophistication and your immediate goals.

Eradication and Recovery

With the threat contained, the focus shifts to removing all traces of the attack and restoring normal operations.

Eradication Objectives

Remove all malware and attack tools from affected systems
Close the entry point that allowed initial access
Eliminate persistence mechanisms (backdoor accounts, scheduled tasks, modified configurations)
Patch vulnerabilities that were exploited
Reset compromised credentials across all affected accounts and systems

Eradication Actions by Attack Type
Attack Type	Eradication Actions	Verification Steps
Malware Infection	Remove malware, patch exploit, scan for variants, check for persistence	Full AV scan, behavioral monitoring, memory analysis
Compromised Credentials	Reset passwords, rotate secrets, revoke tokens, check for backdoor accounts	Review all admin accounts, audit recent account creation
Web Application Attack	Patch vulnerability, review logs for data access, check for web shells	Web shell scan, file integrity check, penetration test
Insider Threat	Disable accounts, revoke access, preserve evidence, HR coordination	Access audit, data access review, legal preparation
Ransomware	Wipe and rebuild systems, restore from clean backups, patch entry point	Monitor for re-infection, verify backup integrity before restore

Recovery Strategies

Option 1: Clean and Restore

Remove identified malware and backdoors
Restore configurations from known-good backups
Appropriate when infection is limited and well-understood
Faster but risk of missing hidden persistence

Option 2: Wipe and Rebuild

Completely rebuild affected systems from scratch
Restore only data (not configurations) from backups
Appropriate for severe compromise or ransomware
More thorough but significantly more time-consuming

Recovery Verification:

Before declaring recovery complete, verify:

All known IOCs are blocked
Affected systems pass security scans
Monitoring shows no signs of continued compromise
Access logs show only authorized activity
Vulnerability that enabled attack is patched

Heightened Monitoring Period:

After recovery, maintain elevated monitoring for 30-90 days:

Increase logging verbosity on recovered systems
Lower alert thresholds for related indicators
Schedule frequent log reviews
Prepare for potential re-compromise attempts

Never Restore to the Vulnerable State

A common mistake is restoring systems quickly without addressing the root cause. If you restore a web server from backup without patching the vulnerability that was exploited, you'll be compromised again—often within hours. Eradication must include closing the door, not just ejecting the intruder.

Incident Communication

Effective communication during incidents is as important as technical response. Poor communication leads to confusion, duplicated effort, stakeholder distrust, and regulatory problems.

Internal Communication

Incident Communication Bridge:

Establish a dedicated communication channel (video conference, chat room)
Keep the bridge open throughout the incident
Assign someone to take notes/maintain timeline
Regular status updates (every 1-2 hours during active response)

Status Update Template:

Incident: [ID] - [One-line description]
Severity: [SEV-1/2/3/4]
Current Status: [Investigating/Containing/Eradicating/Recovering/Resolved]

What we know:
- [Confirmed facts only]

What we're doing:
- [Current active workstreams]

What we need:
- [Required resources, decisions, or support]

Next update: [Time]

Executive Communication

Executives need different information than responders:

Business impact: What's affected? What's the cost?
Customer impact: Are customers affected? How many?
Data exposure: Was data accessed/exfiltrated? What type?
Timeline: When will it be resolved?
Next steps: What are we doing?
External communication: Do we need to notify anyone?

Customer/Public Communication

If the incident affects customers or requires public disclosure:

Work with legal to understand notification obligations
Prepare holding statements before detailed messaging
Be honest about what happened (don't minimize or obfuscate)
Focus on actions: what you're doing to protect customers
Provide concrete next steps for affected individuals
Establish channels for questions (dedicated email, phone line)

What NOT to communicate:

Speculation about attack attribution
Technical details that could help attackers
Information that's not verified
Promises you can't keep
Misleading minimizations

Regulatory Notification Requirements

•GDPR — 72-hour notification to supervisory authority for personal data breaches affecting EU residents
•HIPAA — 60-day notification for breaches affecting 500+ individuals; immediate for 500+ in one state
•PCI DSS — Immediate notification to card brands and acquiring banks for cardholder data breaches
•SEC — 4-day disclosure for material cybersecurity incidents for public companies
•State laws — Various US states have breach notification laws with different triggers and timelines

Legal Privilege Considerations

In some jurisdictions, incident response conducted under legal guidance may be protected by attorney-client privilege. Early engagement of legal counsel can protect investigation findings from discovery in potential litigation. Discuss with your legal team before incidents occur to establish appropriate structures.

Post-Incident Activities

The incident isn't truly over when systems are restored. Post-incident activities capture lessons learned and improve future response.

Post-Mortem / Lessons Learned

Conduct a formal post-mortem within 1-2 weeks of incident resolution. The goal is improvement, not blame.

Post-Mortem Agenda:

Timeline reconstruction: What happened, when?
Detection analysis: How was the incident detected? Could it have been found earlier?
Response evaluation: What worked well? What could improve?
Root cause analysis: Why did this happen? What enabled the attack?
Action items: Concrete improvements with owners and deadlines

post_mortem_template.md
Markdown
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
# Post-Mortem: [Incident Title]
**Incident ID:** INC-2024-0042
**Date of Incident:** 2024-01-15
**Post-Mortem Date:** 2024-01-22
**Author:** [Name]
**Attendees:** [List all participants]
 
## Executive Summary
[2-3 sentence summary of what happened and impact]
 
## Timeline
| Time (UTC) | Event |
|------------|-------|
| 14:32 | Initial alert triggered for anomalous authentication |
| 14:45 | On-call engineer acknowledges, begins triage |
| 15:15 | Incident declared, severity assessed as SEV-2 |
| 15:30 | Compromised account identified and disabled |
| 16:00 | Lateral movement detected, escalated to SEV-1 |
| 16:30 | Network isolation of affected systems |
| 18:00 | Scope determined: 3 systems, no data exfiltration confirmed |
| 22:00 | Eradication complete, recovery initiated |
| +1 day | Systems restored, incident closed |
 
## Impact Analysis
- **Systems Affected:** 3 internal application servers
- **Data Affected:** None confirmed exfiltrated
- **User Impact:** Internal users unable to access affected apps for 8 hours
- **Financial Impact:** Estimated $XX,XXX in response costs + productivity loss
- **Regulatory Impact:** None (no customer data affected)
 
## Root Cause Analysis
### What happened
Attacker gained initial access via phishing email that bypassed email filters.
Victim's credentials harvested through fake login page.
Attacker used valid credentials to access VPN, then moved laterally using SMB.
 
### Why it happened
1. Email filtering did not detect the phishing URL (URL was newly registered)
2. User was not using hardware MFA (software token was compromised)
3. Internal network segmentation allowed excessive lateral access
 
## Detection Analysis
- **How was it detected?** Authentication anomaly detection flagged unusual VPN 
  access from new geographic location
- **Time to detect:** 2.5 hours from initial access to alert
- **Detection gaps:** Lateral movement via SMB was not initially detected
 
## Response Analysis
### What went well
- On-call response within 15 minutes of alert
- Clear escalation when severity increased
- Effective containment prevented further spread
 
### What could improve
- Initial triage took too long (30 min) due to unclear runbook
- Communication gap during shift handover
- Forensic imaging delayed by tool availability
 
## Action Items
| Action | Owner | Deadline | Status |
|--------|-------|----------|--------|
| Implement hardware MFA for all VPN access | Identity Team | 2024-02-15 | In Progress |
| Update email filtering rules for newly registered domains | Security | 2024-01-29 | Complete |
| Create SMB lateral movement detection rule | Detection Eng | 2024-02-05 | In Progress |
| Improve incident triage runbook for auth alerts | SIRT Lead | 2024-02-01 | Not Started |
| Pre-position forensic imaging tools on jump hosts | Security | 2024-02-10 | Not Started |
 
## Appendix
- [Link to incident ticket]
- [Link to detailed timeline]
- [Link to forensic report]

Metrics to Track

Measure incident response effectiveness over time:

MTTD (Mean Time to Detect): Time from attack start to detection
MTTC (Mean Time to Contain): Time from detection to containment
MTTR (Mean Time to Recover): Time from containment to full recovery
Incidents by severity: Trend of incident counts by severity level
Detection source: Where do incidents come from? (automated detection vs. user report vs. external notification)
Action item completion rate: Are post-mortem improvements actually implemented?

Track these metrics over quarters and years to demonstrate improvement in security maturity.

Summary: Effective Incident Response

Security incident response transforms detection into protection. How you respond to incidents determines whether security events become minor disruptions or catastrophic breaches.

Key Takeaways

•Follow the PICERL lifecycle — Preparation, Identification, Containment, Eradication, Recovery, and Lessons Learned provide structure for response.
•Preparation is investment — Time spent on playbooks, training, and tools pays dividends when incidents occur at 2 AM.
•Triage quickly but thoroughly — Balance speed with accuracy; time-box identification and err toward containment if uncertain.
•Preserve evidence during containment — Forensic data enables understanding attacks and may be required for legal proceedings.
•Eradicate completely — Remove malware, close entry points, eliminate persistence, and patch vulnerabilities before recovery.
•Communicate proactively — Keep stakeholders informed with accurate, appropriate information at each level.
•Learn from every incident — Blameless post-mortems identify improvements; action items must be tracked to completion.

What's Next:

Incident response is a reactive capability—responding to threats after they're detected. But security programs must also demonstrate compliance with regulations and policies. The final page covers Compliance Auditing—how to systematically verify and demonstrate that security controls are operating effectively.

Page Complete

You now understand the security incident response lifecycle and practical implementation. This knowledge enables you to build response capabilities that minimize damage when breaches occur, ensuring that security incidents remain manageable events rather than existential crises.