System Design (LLD)Concurrency Fundamentals

Deadlocks and Prevention

LevelAdvanced

Duration90 mins

TopicConcurrency Fundamentals

4 / 4

Detection and Recovery

When Prevention Isn't Enough

Prevention is the ideal approach to deadlock, but it's not always practical. Complex systems with dynamic resource patterns, legacy code integration, or third-party components may not support strict prevention disciplines. In these cases, we need detection and recovery:

Detection: Identify when a deadlock has occurred
Recovery: Break the deadlock and restore normal operation

This is the approach taken by most database systems: allow deadlocks to occur, detect them quickly, and recover by aborting one of the participating transactions.

What You Will Learn

By the end of this page, you will understand how to implement wait-for graph analysis for deadlock detection, choose victim transactions for recovery, and design systems that recover gracefully from deadlocks without data loss or corruption.

Prevention vs Detection: When to Use Each
Approach	Best For	Trade-offs
Prevention	Systems you fully control; critical paths; simple lock patterns	Requires discipline; may reduce concurrency; upfront design effort
Detection + Recovery	Database systems; complex/legacy systems; third-party integration	Runtime overhead; occasional transaction aborts; recovery complexity
Hybrid	Most production systems	Prevention as primary; detection as safety net; best of both worlds

Wait-For Graphs

The Wait-For Graph (WFG) is the fundamental data structure for deadlock detection. Unlike the Resource Allocation Graph (which models both processes and resources), the Wait-For Graph directly models which processes are waiting for which other processes.

Construction:

Nodes: One per process/transaction
Edges: P₁ → P₂ means P₁ is waiting for a resource held by P₂

Detection: A deadlock exists if and only if there is a cycle in the Wait-For Graph.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Resource Allocation Graph → Wait-For Graph
 
RESOURCE ALLOCATION GRAPH:            WAIT-FOR GRAPH:
 
    P1 ──holds──► R1                       P1
         │         │                        │
   waits │         │ held-by                │ waits for
   for   ▼         ▼                        ▼
         R2 ◄──holds── P2                  P2
                   │                        │
            held-by│                        │ waits for
                   ▼                        ▼
                   P3                       P3
                   │                        │
            waits  │                        │ waits for
            for    ▼                        ▼
                   R1                       P1 ←── CYCLE!
 
Conversion:
- P1 waits for R2, R2 held by P2  →  P1 waits for P2
- P2 waits for (nothing)          →  (no outgoing edge)
- P3 waits for R1, R1 held by P1  →  P3 waits for P1
 
Cycle detection finds: P1 → P2 → P3 → P1
DEADLOCK!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
/**
 * Wait-For Graph for deadlock detection.
 * Supports efficient cycle detection.
 */
class WaitForGraph {
    // Adjacency list: waiter -> set of processes being waited for
    private edges: Map<string, Set<string>> = new Map();
    // Track all known processes
    private processes: Set<string> = new Set();
    
    /**
     * Add a wait-for relationship.
     * @param waiter - Process that is waiting
     * @param holder - Process that holds the resource
     */
    addWait(waiter: string, holder: string): void {
        this.processes.add(waiter);
        this.processes.add(holder);
        
        if (!this.edges.has(waiter)) {
            this.edges.set(waiter, new Set());
        }
        this.edges.get(waiter)!.add(holder);
    }
    
    /**
     * Remove a wait-for relationship (when lock is acquired or waiter gives up).
     */
    removeWait(waiter: string, holder: string): void {
        this.edges.get(waiter)?.delete(holder);
    }
    
    /**
     * Remove all edges involving a process (when it completes/aborts).
     */
    removeProcess(process: string): void {
        // Remove outgoing edges
        this.edges.delete(process);
        
        // Remove incoming edges
        for (const dependents of this.edges.values()) {
            dependents.delete(process);
        }
        
        this.processes.delete(process);
    }
    
    /**
     * Detect any cycle in the graph.
     * Returns the cycle if found, or null if no deadlock.
     */
    detectCycle(): string[] | null {
        const visited = new Set<string>();
        const recursionStack = new Set<string>();
        const path: string[] = [];
        
        for (const process of this.processes) {
            if (!visited.has(process)) {
                const cycle = this.dfs(process, visited, recursionStack, path);
                if (cycle) {
                    return cycle;
                }
            }
        }
        
        return null;
    }
    
    private dfs(
        node: string,
        visited: Set<string>,
        recursionStack: Set<string>,
        path: string[]
    ): string[] | null {
        visited.add(node);
        recursionStack.add(node);
        path.push(node);
        
        const neighbors = this.edges.get(node) || new Set();
        
        for (const neighbor of neighbors) {
            if (!visited.has(neighbor)) {
                const cycle = this.dfs(neighbor, visited, recursionStack, path);
                if (cycle) {
                    return cycle;
                }
            } else if (recursionStack.has(neighbor)) {
                // Found cycle - extract it from path
                const cycleStart = path.indexOf(neighbor);
                return path.slice(cycleStart);
            }
        }
        
        path.pop();
        recursionStack.delete(node);
        return null;
    }
    
    /**
     * Get all cycles (for comprehensive deadlock analysis).
     */
    detectAllCycles(): string[][] {
        // Johnson's algorithm for finding all elementary cycles
        // Simplified implementation for illustration
        const cycles: string[][] = [];
        
        for (const start of this.processes) {
            const visited = new Set<string>();
            const stack: string[] = [];
            this.findCyclesFrom(start, start, visited, stack, cycles);
        }
        
        return this.deduplicateCycles(cycles);
    }
}

Time complexity analysis:

Graph construction: O(E) where E is the number of wait relationships
Cycle detection (DFS): O(V + E) where V is the number of processes
Finding all cycles: O((V + E) × C) where C is the number of cycles (can be exponential in worst case)

For most practical systems, the graph is sparse and detection is fast.

Incremental Detection

Instead of periodically rebuilding and analyzing the entire graph, production systems often use incremental detection: check for cycles only when adding a new edge. If adding waiter → holder would create a cycle, the deadlock is detected immediately. This is O(path length) per lock request rather than O(V+E) periodically.

Detection Algorithms

Beyond simple cycle detection, various algorithms optimize for different scenarios:

Single-Instance Resource Detection

•Algorithm: Standard DFS cycle detection on Wait-For Graph
•Complexity: O(V + E)
•Guarantee: Cycle ↔ Deadlock (necessary and sufficient)
•Use case: Mutex locks, single database connections, file handles

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
/**
 * Banker's Algorithm for deadlock detection with multi-instance resources.
 * 
 * Named after a bank that must ensure it can satisfy all customer withdrawals.
 */
class BankersAlgorithm {
    private n: number;  // Number of processes
    private m: number;  // Number of resource types
    
    // Current state
    private available: number[];      // Available instances of each resource
    private allocation: number[][];   // allocation[i][j] = instances of resource j held by process i
    private request: number[][];      // request[i][j] = additional instances of j that process i wants
    
    constructor(
        available: number[],
        allocation: number[][],
        request: number[][]
    ) {
        this.n = allocation.length;
        this.m = available.length;
        this.available = [...available];
        this.allocation = allocation.map(row => [...row]);
        this.request = request.map(row => [...row]);
    }
    
    /**
     * Detect if the system is in a deadlocked state.
     * Returns list of deadlocked processes, or empty array if no deadlock.
     */
    detectDeadlock(): number[] {
        // Work = available resources
        const work = [...this.available];
        
        // Finish[i] = true if process i can finish
        const finish = new Array(this.n).fill(false);
        
        // Iteratively find processes that can complete
        let changed = true;
        while (changed) {
            changed = false;
            
            for (let i = 0; i < this.n; i++) {
                if (finish[i]) continue;
                
                // Check if process i's request can be satisfied
                if (this.canSatisfy(this.request[i], work)) {
                    // Assume process completes and releases resources
                    for (let j = 0; j < this.m; j++) {
                        work[j] += this.allocation[i][j];
                    }
                    finish[i] = true;
                    changed = true;
                }
            }
        }
        
        // Processes that can't finish are deadlocked
        const deadlocked: number[] = [];
        for (let i = 0; i < this.n; i++) {
            if (!finish[i]) {
                deadlocked.push(i);
            }
        }
        
        return deadlocked;
    }
    
    private canSatisfy(request: number[], available: number[]): boolean {
        for (let j = 0; j < this.m; j++) {
            if (request[j] > available[j]) {
                return false;
            }
        }
        return true;
    }
}
 
// Example usage
const available = [3, 3, 2];  // 3 of resource A, 3 of B, 2 of C
 
const allocation = [
    [0, 1, 0],  // Process 0 holds 0 A, 1 B, 0 C
    [2, 0, 0],  // Process 1 holds 2 A, 0 B, 0 C
    [3, 0, 2],  // Process 2 holds 3 A, 0 B, 2 C
    [2, 1, 1],  // Process 3 holds 2 A, 1 B, 1 C
    [0, 0, 2],  // Process 4 holds 0 A, 0 B, 2 C
];
 
const request = [
    [0, 0, 0],  // Process 0 needs nothing more
    [2, 0, 2],  // Process 1 needs 2 A, 0 B, 2 C
    [0, 0, 0],  // Process 2 needs nothing more
    [1, 0, 0],  // Process 3 needs 1 A
    [0, 0, 2],  // Process 4 needs 0 A, 0 B, 2 C
];
 
const banker = new BankersAlgorithm(available, allocation, request);
const deadlocked = banker.detectDeadlock();
console.log('Deadlocked processes:', deadlocked);

When to run detection:

Detection has a cost (CPU time, potential lock on lock manager data structures). Systems must balance responsiveness against overhead:

Detection Frequency Strategies
Strategy	When to Run	Pros	Cons
Continuous	On every lock request that blocks	Immediate detection; minimal deadlock duration	Highest overhead; may slow down normal operations
Periodic	Every N seconds (typically 1-30s)	Low overhead; predictable CPU usage	Deadlocks persist until next check; wasted resources
Threshold-based	When wait queue exceeds threshold	Adaptive to actual contention	May miss some deadlocks; tuning required
Timeout-triggered	When any wait exceeds timeout	Only runs when needed; efficient	Timeout must be tuned; delayed detection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
class DeadlockDetector {
    private waitForGraph = new WaitForGraph();
    private lastFullCheck = Date.now();
    private waitingProcesses = new Map<string, number>();  // process -> wait start time
    
    private readonly periodicIntervalMs = 10000;  // Full check every 10s
    private readonly waitTimeoutMs = 5000;        // Flag after 5s wait
    
    /**
     * Called when a process starts waiting for a lock.
     */
    onWaitStart(waiter: string, holder: string): DeadlockCheckResult {
        const startTime = Date.now();
        this.waitingProcesses.set(waiter, startTime);
        this.waitForGraph.addWait(waiter, holder);
        
        // STRATEGY 1: Incremental check - would adding this edge create a cycle?
        // This is O(path length), very fast
        const cycle = this.waitForGraph.detectCycleFrom(waiter);
        if (cycle) {
            return { deadlock: true, cycle, detectedBy: 'incremental' };
        }
        
        return { deadlock: false };
    }
    
    /**
     * Called periodically (e.g., by background thread).
     */
    periodicCheck(): DeadlockCheckResult[] {
        const now = Date.now();
        const results: DeadlockCheckResult[] = [];
        
        // STRATEGY 2: Check long-waiting processes first
        for (const [process, startTime] of this.waitingProcesses) {
            if (now - startTime > this.waitTimeoutMs) {
                // This process has waited too long - check for deadlock
                const cycle = this.waitForGraph.detectCycleFrom(process);
                if (cycle) {
                    results.push({
                        deadlock: true,
                        cycle,
                        detectedBy: 'timeout',
                        waitDuration: now - startTime
                    });
                }
            }
        }
        
        // STRATEGY 3: Periodic full graph scan (catches edge cases)
        if (now - this.lastFullCheck > this.periodicIntervalMs) {
            const allCycles = this.waitForGraph.detectAllCycles();
            for (const cycle of allCycles) {
                results.push({
                    deadlock: true,
                    cycle,
                    detectedBy: 'periodic'
                });
            }
            this.lastFullCheck = now;
        }
        
        return results;
    }
    
    /**
     * Called when a process acquires or releases a lock.
     */
    onWaitEnd(waiter: string): void {
        this.waitingProcesses.delete(waiter);
        this.waitForGraph.removeProcessAsWaiter(waiter);
    }
}

Distributed Deadlock Detection

In distributed systems, deadlocks can span multiple nodes, making detection significantly more complex. No single node has a complete view of the wait-for relationships. Three main approaches exist:

Centralized Detection

•How it works: All nodes send wait-for information to a central coordinator
•**Coordinator builds global wait-for graph and detects cycles
•Pros: Simple algorithm; complete information
•Cons: Single point of failure; scalability bottleneck; communication overhead

Distributed Detection (Edge Chasing)

•How it works: Probe messages propagate along wait-for edges
•If a probe returns to its origin, a cycle is detected
•Pros: No single point of failure; scales better
•Cons: Complex; probe overhead; may detect phantom deadlocks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
/**
 * Chandy-Misra-Haas algorithm for distributed deadlock detection.
 * Each node maintains local wait-for information and propagates probes.
 */
interface Probe {
    initiator: string;     // Process that started the probe
    sender: string;        // Process that sent this probe
    receiver: string;      // Process that should receive this probe
    path: string[];        // Path the probe has taken
}
 
class DistributedDeadlockDetector {
    private nodeId: string;
    private localWaitFor: Map<string, string[]> = new Map();  // local waits
    private network: NetworkInterface;
    
    /**
     * Called when a local process starts waiting for a resource
     * held by a process on another node.
     */
    onCrossNodeWait(localProcess: string, remoteProcess: string, remoteNode: string): void {
        // Initiate a probe
        const probe: Probe = {
            initiator: localProcess,
            sender: localProcess,
            receiver: remoteProcess,
            path: [localProcess]
        };
        
        this.network.sendTo(remoteNode, {
            type: 'PROBE',
            probe
        });
    }
    
    /**
     * Handle received probe message.
     */
    onProbeReceived(probe: Probe): void {
        const localProcess = probe.receiver;
        
        // Check if probe has returned to initiator → DEADLOCK!
        if (localProcess === probe.initiator) {
            this.handleDeadlockDetected(probe.path);
            return;
        }
        
        // Check if local process is blocked
        const waitingFor = this.localWaitFor.get(localProcess);
        if (!waitingFor || waitingFor.length === 0) {
            // Process is not waiting - probe dies
            return;
        }
        
        // Propagate probe to all processes this one is waiting for
        for (const target of waitingFor) {
            const newProbe: Probe = {
                initiator: probe.initiator,
                sender: localProcess,
                receiver: target,
                path: [...probe.path, localProcess]
            };
            
            const targetNode = this.getNodeFor(target);
            if (targetNode === this.nodeId) {
                // Local - recurse
                this.onProbeReceived(newProbe);
            } else {
                // Remote - send over network
                this.network.sendTo(targetNode, {
                    type: 'PROBE',
                    probe: newProbe
                });
            }
        }
    }
    
    private handleDeadlockDetected(cycle: string[]): void {
        console.log('Distributed deadlock detected:', cycle.join(' → '));
        // Initiate recovery (e.g., abort youngest transaction in cycle)
        this.initiateRecovery(cycle);
    }
}

Phantom Deadlocks

In distributed detection, timing issues can cause phantom deadlocks: a deadlock is detected when none actually exists. This happens because the global state being analyzed is inconsistent—some edges existed in the past but have since been removed. Systems must be robust to occasional false positives.

Timeout-based approach in distributed systems:

Many distributed databases (including Spanner, CockroachDB, and TiDB) rely on timeouts rather than explicit detection:

Transaction holds locks with a deadline
If deadline passes before commit, transaction is aborted
Other transactions waiting on those locks can proceed
Aborted transaction retries

This is simpler than distributed detection and handles network partitions naturally (a node that can't communicate will eventually timeout).

Victim Selection

Once a deadlock is detected, we must choose which process(es) to abort to break the cycle. This is victim selection. The goal is to minimize the cost of recovery while ensuring the deadlock is resolved.

Criteria for victim selection:

Victim Selection Criteria
Criterion	Description	Rationale
Minimum rollback	Choose process that has done least work	Less wasted computation; faster recovery
Minimum resources held	Choose process holding fewest resources	Frees resources for more processes
Youngest first	Choose most recently started process	Less work invested; natural priority by age
Priority-based	Choose lowest priority process	Protects important transactions
Restart cost	Choose process cheapest to restart	Holistic optimization of system throughput
Cycle breaking	Choose process that breaks most cycles	One abort may resolve multiple deadlocks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
interface TransactionInfo {
    id: string;
    startTime: number;
    priority: 'LOW' | 'NORMAL' | 'HIGH' | 'CRITICAL';
    resourcesHeld: number;
    operationsCompleted: number;
    estimatedRemainingWork: number;
    rollbackCost: number;  // Computed from operations to undo
}
 
class VictimSelector {
    /**
     * Select the optimal victim from a deadlock cycle.
     * Returns the transaction to abort.
     */
    selectVictim(
        cycle: string[],
        transactions: Map<string, TransactionInfo>
    ): string {
        let bestVictim: string | null = null;
        let lowestCost = Infinity;
        
        for (const txId of cycle) {
            const tx = transactions.get(txId);
            if (!tx) continue;
            
            const cost = this.calculateAbortCost(tx);
            
            // Never abort critical transactions if possible
            if (tx.priority === 'CRITICAL' && cycle.length > 1) {
                continue;
            }
            
            if (cost < lowestCost) {
                lowestCost = cost;
                bestVictim = txId;
            }
        }
        
        // If no victim selected (all critical), pick youngest
        if (!bestVictim) {
            bestVictim = this.selectYoungest(cycle, transactions);
        }
        
        return bestVictim!;
    }
    
    private calculateAbortCost(tx: TransactionInfo): number {
        const priorityWeight = {
            'LOW': 1,
            'NORMAL': 2,
            'HIGH': 4,
            'CRITICAL': 100
        };
        
        // Multi-factor cost function
        const cost = 
            tx.rollbackCost * 1.0 +                  // Direct rollback cost
            tx.operationsCompleted * 0.5 +           // Wasted work
            tx.resourcesHeld * 0.3 +                 // Resources to free
            priorityWeight[tx.priority] * 10 +       // Priority protection
            (Date.now() - tx.startTime) / 1000 * 0.1; // Age (older = higher cost)
        
        return cost;
    }
    
    private selectYoungest(
        cycle: string[],
        transactions: Map<string, TransactionInfo>
    ): string {
        let youngest: string = cycle[0];
        let latestStart = 0;
        
        for (const txId of cycle) {
            const tx = transactions.get(txId);
            if (tx && tx.startTime > latestStart) {
                latestStart = tx.startTime;
                youngest = txId;
            }
        }
        
        return youngest;
    }
}

Starvation Risk

If the same transaction is always selected as victim (e.g., always youngest), it may never complete—starvation. Combat this by tracking abort counts and increasing priority for repeatedly-aborted transactions, or by eventually making them immune to victim selection.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class StarvationAwareVictimSelector extends VictimSelector {
    private abortCounts: Map<string, number> = new Map();
    private readonly starvationThreshold = 3;
    
    selectVictim(
        cycle: string[],
        transactions: Map<string, TransactionInfo>
    ): string {
        // Filter out transactions that have been aborted too many times
        const eligibleCandidates = cycle.filter(txId => {
            const abortCount = this.abortCounts.get(txId) || 0;
            return abortCount < this.starvationThreshold;
        });
        
        // If all have been aborted many times, reset counts and choose normally
        if (eligibleCandidates.length === 0) {
            console.warn('All transactions in cycle show signs of starvation');
            // Reset and fall back to normal selection
            for (const txId of cycle) {
                this.abortCounts.set(txId, 0);
            }
            return super.selectVictim(cycle, transactions);
        }
        
        // Select from eligible candidates only
        return super.selectVictim(eligibleCandidates, transactions);
    }
    
    recordAbort(txId: string): void {
        const current = this.abortCounts.get(txId) || 0;
        this.abortCounts.set(txId, current + 1);
    }
    
    recordCommit(txId: string): void {
        // Reset on successful commit
        this.abortCounts.delete(txId);
    }
}

Recovery Strategies

Once a victim is selected, recovery must break the deadlock without leaving the system in an inconsistent state. The primary strategies are:

Transaction Rollback

•Total rollback: Abort the entire transaction; undo all its operations
•Checkpoint rollback: Roll back to a previous savepoint within the transaction
•Used when transaction maintains ACID properties
•Requires undo logs or compensation operations
•Transaction can retry after rollback

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
interface UndoLogEntry {
    operation: 'INSERT' | 'UPDATE' | 'DELETE';
    table: string;
    key: string;
    previousValue?: any;
    newValue?: any;
    timestamp: number;
}
 
class TransactionRecoveryManager {
    private undoLogs: Map<string, UndoLogEntry[]> = new Map();
    
    /**
     * Record an operation for potential rollback.
     */
    logOperation(txId: string, entry: UndoLogEntry): void {
        if (!this.undoLogs.has(txId)) {
            this.undoLogs.set(txId, []);
        }
        this.undoLogs.get(txId)!.push(entry);
    }
    
    /**
     * Rollback a transaction completely.
     */
    async rollbackTransaction(txId: string): Promise<void> {
        const entries = this.undoLogs.get(txId) || [];
        
        // Process in reverse order (LIFO)
        for (let i = entries.length - 1; i >= 0; i--) {
            const entry = entries[i];
            await this.undoOperation(entry);
        }
        
        // Clear undo log
        this.undoLogs.delete(txId);
        
        // Release all locks held by this transaction
        await this.lockManager.releaseAllFor(txId);
        
        console.log(`Transaction ${txId} rolled back successfully`);
    }
    
    private async undoOperation(entry: UndoLogEntry): Promise<void> {
        switch (entry.operation) {
            case 'INSERT':
                // Undo insert by deleting
                await this.database.delete(entry.table, entry.key);
                break;
                
            case 'UPDATE':
                // Undo update by restoring previous value
                await this.database.update(entry.table, entry.key, entry.previousValue);
                break;
                
            case 'DELETE':
                // Undo delete by reinserting
                await this.database.insert(entry.table, entry.key, entry.previousValue);
                break;
        }
    }
    
    /**
     * Rollback to a specific savepoint.
     */
    async rollbackToSavepoint(txId: string, savepoint: string): Promise<void> {
        const entries = this.undoLogs.get(txId) || [];
        const savepointIndex = entries.findIndex(
            e => e.operation === 'SAVEPOINT' && e.key === savepoint
        );
        
        if (savepointIndex === -1) {
            throw new Error(`Savepoint ${savepoint} not found`);
        }
        
        // Undo operations after savepoint
        for (let i = entries.length - 1; i > savepointIndex; i--) {
            await this.undoOperation(entries[i]);
        }
        
        // Truncate log to savepoint
        this.undoLogs.set(txId, entries.slice(0, savepointIndex + 1));
        
        // Release locks acquired after savepoint
        await this.lockManager.releaseLocksAfter(txId, entries[savepointIndex].timestamp);
    }
}

Process Termination

•Kill and restart: Terminate the process entirely; restart from scratch
•Used for non-transactional operations or when state is hard to recover
•Simplest but most disruptive
•May lose in-progress work not backed by transactions

Resource Preemption

•Force release locks: Take resources from victim without full rollback
•Victim must be designed to handle resource loss
•Requires careful handling to avoid inconsistency
•Useful when operations are idempotent or checkpoint-based

Design for Recovery

Systems designed with deadlock recovery in mind include: clear transaction boundaries, proper undo logging, idempotent operations where possible, and retry logic at the application level. The application must expect occasional aborts and handle them gracefully.

Database Deadlock Handling in Practice

Modern databases have sophisticated, battle-tested deadlock handling. Let's examine how major databases implement detection and recovery:

Database Deadlock Handling Comparison
Database	Detection Method	Victim Selection	Recovery
PostgreSQL	Wait-for graph; checked when wait exceeds deadlock_timeout (1s default)	Detects cycle; aborts transaction that closed the cycle	Raises ERROR 40P01; transaction must retry
MySQL/InnoDB	Wait-for graph; immediate check on every lock wait	Chooses smallest transaction (fewest rows modified)	Raises ERROR 1213; automatic rollback; app must retry
SQL Server	Background thread checks every 5 seconds	Chooses transaction with lowest DEADLOCK_PRIORITY	Raises ERROR 1205; transaction rolled back
Oracle	Immediate detection when wait begins	Transaction that detected the deadlock (requester)	Statement rolled back (not full transaction); app handles

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
class RobustDatabaseClient {
    private readonly maxRetries = 3;
    private readonly baseBackoffMs = 100;
    
    /**
     * Execute a database operation with automatic deadlock retry.
     */
    async executeWithDeadlockRetry<T>(
        operation: () => Promise<T>
    ): Promise<T> {
        let lastError: Error | null = null;
        
        for (let attempt = 0; attempt < this.maxRetries; attempt++) {
            try {
                return await operation();
                
            } catch (error) {
                if (this.isDeadlockError(error)) {
                    lastError = error;
                    
                    // Log for monitoring
                    console.warn(
                        `Deadlock detected (attempt ${attempt + 1}/${this.maxRetries})`,
                        { error: error.message }
                    );
                    
                    // Exponential backoff with jitter
                    const backoffMs = this.baseBackoffMs * Math.pow(2, attempt);
                    const jitter = Math.random() * backoffMs * 0.5;
                    await this.sleep(backoffMs + jitter);
                    
                    // Retry
                    continue;
                }
                
                // Non-deadlock error - rethrow immediately
                throw error;
            }
        }
        
        // Exhausted retries
        throw new DeadlockRetriesExhaustedError(
            `Operation failed after ${this.maxRetries} deadlock retries`,
            lastError
        );
    }
    
    private isDeadlockError(error: any): boolean {
        // Check for database-specific error codes
        const deadlockCodes = [
            '40P01',    // PostgreSQL
            '1213',     // MySQL
            '1205',     // SQL Server
            'ORA-00060' // Oracle
        ];
        
        const errorCode = error.code || error.errorCode || error.errno;
        return deadlockCodes.includes(String(errorCode));
    }
    
    private sleep(ms: number): Promise<void> {
        return new Promise(resolve => setTimeout(resolve, ms));
    }
}
 
// Usage example
const db = new RobustDatabaseClient();
 
async function transferFunds(fromId: string, toId: string, amount: number) {
    await db.executeWithDeadlockRetry(async () => {
        await db.beginTransaction();
        try {
            await db.query('UPDATE accounts SET balance = balance - $1 WHERE id = $2', [amount, fromId]);
            await db.query('UPDATE accounts SET balance = balance + $1 WHERE id = $2', [amount, toId]);
            await db.commit();
        } catch (error) {
            await db.rollback();
            throw error;
        }
    });
}

Idempotency Matters

When retrying after deadlock, the operation runs again from the beginning. If the operation has side effects (sending emails, calling external APIs), you must ensure idempotency or use saga patterns to handle partial failures.

Monitoring and Alerting

Deadlock detection and recovery handle incidents, but monitoring helps prevent them from becoming chronic problems. A good monitoring strategy includes:

Key Metrics to Track

•Deadlock rate: Deadlocks per minute/hour. Sudden increases indicate problems.
•Deadlock hot spots: Which locks/tables are involved most frequently
•Victim transaction types: Are important transactions being aborted?
•Retry success rate: What percentage of retries succeed vs fail?
•Average lock wait time: Increasing waits may precede deadlocks
•Lock timeout rate: Timeouts may indicate near-deadlock conditions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
class DeadlockMonitor {
    private metrics: MetricsClient;
    private alerter: AlertingService;
    
    // Thresholds for alerting
    private readonly deadlocksPerMinuteWarning = 5;
    private readonly deadlocksPerMinuteCritical = 20;
    private readonly sameResourceThreshold = 3;  // Same resource in N deadlocks = hot spot
    
    onDeadlockDetected(deadlock: DeadlockEvent): void {
        // Record metrics
        this.metrics.increment('deadlock.detected', {
            resources: deadlock.resources.join(','),
            victimType: deadlock.victim.type,
            cycleLength: deadlock.cycle.length.toString()
        });
        
        // Track hot spots
        for (const resource of deadlock.resources) {
            this.metrics.increment('deadlock.by_resource', {
                resource: resource
            });
        }
        
        // Check for alert conditions
        this.checkAlertConditions(deadlock);
    }
    
    private async checkAlertConditions(deadlock: DeadlockEvent): Promise<void> {
        // Rate-based alert
        const recentDeadlocks = await this.metrics.query(
            'sum(deadlock.detected)[1m]'
        );
        
        if (recentDeadlocks >= this.deadlocksPerMinuteCritical) {
            await this.alerter.sendAlert({
                severity: 'CRITICAL',
                title: 'High deadlock rate detected',
                description: `${recentDeadlocks} deadlocks in the past minute`,
                runbook: 'https://wiki/deadlock-runbook',
                metadata: deadlock
            });
        } else if (recentDeadlocks >= this.deadlocksPerMinuteWarning) {
            await this.alerter.sendAlert({
                severity: 'WARNING',
                title: 'Elevated deadlock rate',
                description: `${recentDeadlocks} deadlocks in the past minute`,
                runbook: 'https://wiki/deadlock-runbook'
            });
        }
        
        // Hot spot detection
        for (const resource of deadlock.resources) {
            const resourceCount = await this.metrics.query(
                `sum(deadlock.by_resource{resource="${resource}"})[5m]`
            );
            
            if (resourceCount >= this.sameResourceThreshold) {
                await this.alerter.sendAlert({
                    severity: 'WARNING',
                    title: 'Deadlock hot spot detected',
                    description: `Resource ${resource} involved in ${resourceCount} deadlocks`,
                    suggestion: 'Review lock ordering or access patterns for this resource'
                });
            }
        }
    }
    
    /**
     * Generate a deadlock analysis report.
     */
    async generateReport(timeRange: string): Promise<DeadlockReport> {
        return {
            totalDeadlocks: await this.metrics.query(`sum(deadlock.detected)[${timeRange}]`),
            averagePerHour: await this.metrics.query(`avg_over_time(rate(deadlock.detected[1h])[${timeRange}])`),
            topResources: await this.getTopResources(timeRange),
            topTransactionTypes: await this.getTopTransactionTypes(timeRange),
            hourlyDistribution: await this.getHourlyDistribution(timeRange),
            recommendations: await this.generateRecommendations(timeRange)
        };
    }
}

Post-Incident Analysis

When deadlocks spike, capture: the full cycle (which transactions/resources), stack traces of waiting threads, recent code changes, and load patterns. This data is invaluable for root cause analysis and permanent fixes.

Summary: Detection and Recovery

We've covered the complete lifecycle of deadlock detection and recovery. Let's consolidate the key concepts:

Key Takeaways

•Wait-For Graphs are the foundation: Nodes are processes, edges are wait relationships, cycles indicate deadlock.
•Detection algorithms range from simple DFS (single-instance) to Banker's algorithm (multi-instance) to distributed edge chasing.
•Detection frequency must balance responsiveness against overhead—consider incremental, periodic, and timeout-triggered approaches.
•Distributed detection is hard due to lack of global state—many systems rely on timeouts instead.
•Victim selection minimizes recovery cost using criteria like age, resources held, priority, and restart cost.
•Starvation prevention requires tracking abort counts and eventually protecting frequently-victimized transactions.
•Recovery involves rollback (using undo logs), process termination, or resource preemption—design for recovery from the start.
•Application code must handle deadlock errors with retry logic and ensure idempotency.
•Monitoring tracks rates, hot spots, and patterns to enable proactive fixes before deadlocks become chronic.

Module complete:

Congratulations! You now have comprehensive knowledge of deadlocks—from fundamental definitions through prevention strategies to detection and recovery. This knowledge enables you to:

Design concurrent systems that minimize deadlock risk
Implement effective prevention through lock ordering and other strategies
Build detection systems for when prevention isn't sufficient
Recover gracefully when deadlocks occur
Monitor and analyze deadlock patterns to improve system reliability

Module Complete

You've mastered one of the most challenging topics in concurrent programming. Deadlocks have been the source of countless production incidents, but with the knowledge from this module, you're equipped to prevent, detect, and recover from them. Apply these principles to build robust, reliable concurrent systems.

4 / 4

Loading learning content...

System Design (LLD)Concurrency Fundamentals

Deadlocks and Prevention

LevelAdvanced

Duration90 mins

TopicConcurrency Fundamentals

4 / 4

Detection and Recovery

When Prevention Isn't Enough

Detection: Identify when a deadlock has occurred
Recovery: Break the deadlock and restore normal operation

This is the approach taken by most database systems: allow deadlocks to occur, detect them quickly, and recover by aborting one of the participating transactions.

What You Will Learn

Prevention vs Detection: When to Use Each
Approach	Best For	Trade-offs
Prevention	Systems you fully control; critical paths; simple lock patterns	Requires discipline; may reduce concurrency; upfront design effort
Detection + Recovery	Database systems; complex/legacy systems; third-party integration	Runtime overhead; occasional transaction aborts; recovery complexity
Hybrid	Most production systems	Prevention as primary; detection as safety net; best of both worlds

Wait-For Graphs

Construction:

Nodes: One per process/transaction
Edges: P₁ → P₂ means P₁ is waiting for a resource held by P₂

Detection: A deadlock exists if and only if there is a cycle in the Wait-For Graph.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Resource Allocation Graph → Wait-For Graph
 
RESOURCE ALLOCATION GRAPH:            WAIT-FOR GRAPH:
 
    P1 ──holds──► R1                       P1
         │         │                        │
   waits │         │ held-by                │ waits for
   for   ▼         ▼                        ▼
         R2 ◄──holds── P2                  P2
                   │                        │
            held-by│                        │ waits for
                   ▼                        ▼
                   P3                       P3
                   │                        │
            waits  │                        │ waits for
            for    ▼                        ▼
                   R1                       P1 ←── CYCLE!
 
Conversion:
- P1 waits for R2, R2 held by P2  →  P1 waits for P2
- P2 waits for (nothing)          →  (no outgoing edge)
- P3 waits for R1, R1 held by P1  →  P3 waits for P1
 
Cycle detection finds: P1 → P2 → P3 → P1
DEADLOCK!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
/**
 * Wait-For Graph for deadlock detection.
 * Supports efficient cycle detection.
 */
class WaitForGraph {
    // Adjacency list: waiter -> set of processes being waited for
    private edges: Map<string, Set<string>> = new Map();
    // Track all known processes
    private processes: Set<string> = new Set();
    
    /**
     * Add a wait-for relationship.
     * @param waiter - Process that is waiting
     * @param holder - Process that holds the resource
     */
    addWait(waiter: string, holder: string): void {
        this.processes.add(waiter);
        this.processes.add(holder);
        
        if (!this.edges.has(waiter)) {
            this.edges.set(waiter, new Set());
        }
        this.edges.get(waiter)!.add(holder);
    }
    
    /**
     * Remove a wait-for relationship (when lock is acquired or waiter gives up).
     */
    removeWait(waiter: string, holder: string): void {
        this.edges.get(waiter)?.delete(holder);
    }
    
    /**
     * Remove all edges involving a process (when it completes/aborts).
     */
    removeProcess(process: string): void {
        // Remove outgoing edges
        this.edges.delete(process);
        
        // Remove incoming edges
        for (const dependents of this.edges.values()) {
            dependents.delete(process);
        }
        
        this.processes.delete(process);
    }
    
    /**
     * Detect any cycle in the graph.
     * Returns the cycle if found, or null if no deadlock.
     */
    detectCycle(): string[] | null {
        const visited = new Set<string>();
        const recursionStack = new Set<string>();
        const path: string[] = [];
        
        for (const process of this.processes) {
            if (!visited.has(process)) {
                const cycle = this.dfs(process, visited, recursionStack, path);
                if (cycle) {
                    return cycle;
                }
            }
        }
        
        return null;
    }
    
    private dfs(
        node: string,
        visited: Set<string>,
        recursionStack: Set<string>,
        path: string[]
    ): string[] | null {
        visited.add(node);
        recursionStack.add(node);
        path.push(node);
        
        const neighbors = this.edges.get(node) || new Set();
        
        for (const neighbor of neighbors) {
            if (!visited.has(neighbor)) {
                const cycle = this.dfs(neighbor, visited, recursionStack, path);
                if (cycle) {
                    return cycle;
                }
            } else if (recursionStack.has(neighbor)) {
                // Found cycle - extract it from path
                const cycleStart = path.indexOf(neighbor);
                return path.slice(cycleStart);
            }
        }
        
        path.pop();
        recursionStack.delete(node);
        return null;
    }
    
    /**
     * Get all cycles (for comprehensive deadlock analysis).
     */
    detectAllCycles(): string[][] {
        // Johnson's algorithm for finding all elementary cycles
        // Simplified implementation for illustration
        const cycles: string[][] = [];
        
        for (const start of this.processes) {
            const visited = new Set<string>();
            const stack: string[] = [];
            this.findCyclesFrom(start, start, visited, stack, cycles);
        }
        
        return this.deduplicateCycles(cycles);
    }
}

Time complexity analysis:

Graph construction: O(E) where E is the number of wait relationships
Cycle detection (DFS): O(V + E) where V is the number of processes
Finding all cycles: O((V + E) × C) where C is the number of cycles (can be exponential in worst case)

For most practical systems, the graph is sparse and detection is fast.

Incremental Detection

Detection Algorithms

Beyond simple cycle detection, various algorithms optimize for different scenarios:

Single-Instance Resource Detection

•Algorithm: Standard DFS cycle detection on Wait-For Graph
•Complexity: O(V + E)
•Guarantee: Cycle ↔ Deadlock (necessary and sufficient)
•Use case: Mutex locks, single database connections, file handles

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
/**
 * Banker's Algorithm for deadlock detection with multi-instance resources.
 * 
 * Named after a bank that must ensure it can satisfy all customer withdrawals.
 */
class BankersAlgorithm {
    private n: number;  // Number of processes
    private m: number;  // Number of resource types
    
    // Current state
    private available: number[];      // Available instances of each resource
    private allocation: number[][];   // allocation[i][j] = instances of resource j held by process i
    private request: number[][];      // request[i][j] = additional instances of j that process i wants
    
    constructor(
        available: number[],
        allocation: number[][],
        request: number[][]
    ) {
        this.n = allocation.length;
        this.m = available.length;
        this.available = [...available];
        this.allocation = allocation.map(row => [...row]);
        this.request = request.map(row => [...row]);
    }
    
    /**
     * Detect if the system is in a deadlocked state.
     * Returns list of deadlocked processes, or empty array if no deadlock.
     */
    detectDeadlock(): number[] {
        // Work = available resources
        const work = [...this.available];
        
        // Finish[i] = true if process i can finish
        const finish = new Array(this.n).fill(false);
        
        // Iteratively find processes that can complete
        let changed = true;
        while (changed) {
            changed = false;
            
            for (let i = 0; i < this.n; i++) {
                if (finish[i]) continue;
                
                // Check if process i's request can be satisfied
                if (this.canSatisfy(this.request[i], work)) {
                    // Assume process completes and releases resources
                    for (let j = 0; j < this.m; j++) {
                        work[j] += this.allocation[i][j];
                    }
                    finish[i] = true;
                    changed = true;
                }
            }
        }
        
        // Processes that can't finish are deadlocked
        const deadlocked: number[] = [];
        for (let i = 0; i < this.n; i++) {
            if (!finish[i]) {
                deadlocked.push(i);
            }
        }
        
        return deadlocked;
    }
    
    private canSatisfy(request: number[], available: number[]): boolean {
        for (let j = 0; j < this.m; j++) {
            if (request[j] > available[j]) {
                return false;
            }
        }
        return true;
    }
}
 
// Example usage
const available = [3, 3, 2];  // 3 of resource A, 3 of B, 2 of C
 
const allocation = [
    [0, 1, 0],  // Process 0 holds 0 A, 1 B, 0 C
    [2, 0, 0],  // Process 1 holds 2 A, 0 B, 0 C
    [3, 0, 2],  // Process 2 holds 3 A, 0 B, 2 C
    [2, 1, 1],  // Process 3 holds 2 A, 1 B, 1 C
    [0, 0, 2],  // Process 4 holds 0 A, 0 B, 2 C
];
 
const request = [
    [0, 0, 0],  // Process 0 needs nothing more
    [2, 0, 2],  // Process 1 needs 2 A, 0 B, 2 C
    [0, 0, 0],  // Process 2 needs nothing more
    [1, 0, 0],  // Process 3 needs 1 A
    [0, 0, 2],  // Process 4 needs 0 A, 0 B, 2 C
];
 
const banker = new BankersAlgorithm(available, allocation, request);
const deadlocked = banker.detectDeadlock();
console.log('Deadlocked processes:', deadlocked);

When to run detection:

Detection has a cost (CPU time, potential lock on lock manager data structures). Systems must balance responsiveness against overhead:

Detection Frequency Strategies
Strategy	When to Run	Pros	Cons
Continuous	On every lock request that blocks	Immediate detection; minimal deadlock duration	Highest overhead; may slow down normal operations
Periodic	Every N seconds (typically 1-30s)	Low overhead; predictable CPU usage	Deadlocks persist until next check; wasted resources
Threshold-based	When wait queue exceeds threshold	Adaptive to actual contention	May miss some deadlocks; tuning required
Timeout-triggered	When any wait exceeds timeout	Only runs when needed; efficient	Timeout must be tuned; delayed detection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
class DeadlockDetector {
    private waitForGraph = new WaitForGraph();
    private lastFullCheck = Date.now();
    private waitingProcesses = new Map<string, number>();  // process -> wait start time
    
    private readonly periodicIntervalMs = 10000;  // Full check every 10s
    private readonly waitTimeoutMs = 5000;        // Flag after 5s wait
    
    /**
     * Called when a process starts waiting for a lock.
     */
    onWaitStart(waiter: string, holder: string): DeadlockCheckResult {
        const startTime = Date.now();
        this.waitingProcesses.set(waiter, startTime);
        this.waitForGraph.addWait(waiter, holder);
        
        // STRATEGY 1: Incremental check - would adding this edge create a cycle?
        // This is O(path length), very fast
        const cycle = this.waitForGraph.detectCycleFrom(waiter);
        if (cycle) {
            return { deadlock: true, cycle, detectedBy: 'incremental' };
        }
        
        return { deadlock: false };
    }
    
    /**
     * Called periodically (e.g., by background thread).
     */
    periodicCheck(): DeadlockCheckResult[] {
        const now = Date.now();
        const results: DeadlockCheckResult[] = [];
        
        // STRATEGY 2: Check long-waiting processes first
        for (const [process, startTime] of this.waitingProcesses) {
            if (now - startTime > this.waitTimeoutMs) {
                // This process has waited too long - check for deadlock
                const cycle = this.waitForGraph.detectCycleFrom(process);
                if (cycle) {
                    results.push({
                        deadlock: true,
                        cycle,
                        detectedBy: 'timeout',
                        waitDuration: now - startTime
                    });
                }
            }
        }
        
        // STRATEGY 3: Periodic full graph scan (catches edge cases)
        if (now - this.lastFullCheck > this.periodicIntervalMs) {
            const allCycles = this.waitForGraph.detectAllCycles();
            for (const cycle of allCycles) {
                results.push({
                    deadlock: true,
                    cycle,
                    detectedBy: 'periodic'
                });
            }
            this.lastFullCheck = now;
        }
        
        return results;
    }
    
    /**
     * Called when a process acquires or releases a lock.
     */
    onWaitEnd(waiter: string): void {
        this.waitingProcesses.delete(waiter);
        this.waitForGraph.removeProcessAsWaiter(waiter);
    }
}

Distributed Deadlock Detection

In distributed systems, deadlocks can span multiple nodes, making detection significantly more complex. No single node has a complete view of the wait-for relationships. Three main approaches exist:

Centralized Detection

•How it works: All nodes send wait-for information to a central coordinator
•**Coordinator builds global wait-for graph and detects cycles
•Pros: Simple algorithm; complete information
•Cons: Single point of failure; scalability bottleneck; communication overhead

Distributed Detection (Edge Chasing)

•How it works: Probe messages propagate along wait-for edges
•If a probe returns to its origin, a cycle is detected
•Pros: No single point of failure; scales better
•Cons: Complex; probe overhead; may detect phantom deadlocks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
/**
 * Chandy-Misra-Haas algorithm for distributed deadlock detection.
 * Each node maintains local wait-for information and propagates probes.
 */
interface Probe {
    initiator: string;     // Process that started the probe
    sender: string;        // Process that sent this probe
    receiver: string;      // Process that should receive this probe
    path: string[];        // Path the probe has taken
}
 
class DistributedDeadlockDetector {
    private nodeId: string;
    private localWaitFor: Map<string, string[]> = new Map();  // local waits
    private network: NetworkInterface;
    
    /**
     * Called when a local process starts waiting for a resource
     * held by a process on another node.
     */
    onCrossNodeWait(localProcess: string, remoteProcess: string, remoteNode: string): void {
        // Initiate a probe
        const probe: Probe = {
            initiator: localProcess,
            sender: localProcess,
            receiver: remoteProcess,
            path: [localProcess]
        };
        
        this.network.sendTo(remoteNode, {
            type: 'PROBE',
            probe
        });
    }
    
    /**
     * Handle received probe message.
     */
    onProbeReceived(probe: Probe): void {
        const localProcess = probe.receiver;
        
        // Check if probe has returned to initiator → DEADLOCK!
        if (localProcess === probe.initiator) {
            this.handleDeadlockDetected(probe.path);
            return;
        }
        
        // Check if local process is blocked
        const waitingFor = this.localWaitFor.get(localProcess);
        if (!waitingFor || waitingFor.length === 0) {
            // Process is not waiting - probe dies
            return;
        }
        
        // Propagate probe to all processes this one is waiting for
        for (const target of waitingFor) {
            const newProbe: Probe = {
                initiator: probe.initiator,
                sender: localProcess,
                receiver: target,
                path: [...probe.path, localProcess]
            };
            
            const targetNode = this.getNodeFor(target);
            if (targetNode === this.nodeId) {
                // Local - recurse
                this.onProbeReceived(newProbe);
            } else {
                // Remote - send over network
                this.network.sendTo(targetNode, {
                    type: 'PROBE',
                    probe: newProbe
                });
            }
        }
    }
    
    private handleDeadlockDetected(cycle: string[]): void {
        console.log('Distributed deadlock detected:', cycle.join(' → '));
        // Initiate recovery (e.g., abort youngest transaction in cycle)
        this.initiateRecovery(cycle);
    }
}

Phantom Deadlocks

Timeout-based approach in distributed systems:

Many distributed databases (including Spanner, CockroachDB, and TiDB) rely on timeouts rather than explicit detection:

Transaction holds locks with a deadline
If deadline passes before commit, transaction is aborted
Other transactions waiting on those locks can proceed
Aborted transaction retries

This is simpler than distributed detection and handles network partitions naturally (a node that can't communicate will eventually timeout).

Victim Selection

Criteria for victim selection:

Victim Selection Criteria
Criterion	Description	Rationale
Minimum rollback	Choose process that has done least work	Less wasted computation; faster recovery
Minimum resources held	Choose process holding fewest resources	Frees resources for more processes
Youngest first	Choose most recently started process	Less work invested; natural priority by age
Priority-based	Choose lowest priority process	Protects important transactions
Restart cost	Choose process cheapest to restart	Holistic optimization of system throughput
Cycle breaking	Choose process that breaks most cycles	One abort may resolve multiple deadlocks

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
interface TransactionInfo {
    id: string;
    startTime: number;
    priority: 'LOW' | 'NORMAL' | 'HIGH' | 'CRITICAL';
    resourcesHeld: number;
    operationsCompleted: number;
    estimatedRemainingWork: number;
    rollbackCost: number;  // Computed from operations to undo
}
 
class VictimSelector {
    /**
     * Select the optimal victim from a deadlock cycle.
     * Returns the transaction to abort.
     */
    selectVictim(
        cycle: string[],
        transactions: Map<string, TransactionInfo>
    ): string {
        let bestVictim: string | null = null;
        let lowestCost = Infinity;
        
        for (const txId of cycle) {
            const tx = transactions.get(txId);
            if (!tx) continue;
            
            const cost = this.calculateAbortCost(tx);
            
            // Never abort critical transactions if possible
            if (tx.priority === 'CRITICAL' && cycle.length > 1) {
                continue;
            }
            
            if (cost < lowestCost) {
                lowestCost = cost;
                bestVictim = txId;
            }
        }
        
        // If no victim selected (all critical), pick youngest
        if (!bestVictim) {
            bestVictim = this.selectYoungest(cycle, transactions);
        }
        
        return bestVictim!;
    }
    
    private calculateAbortCost(tx: TransactionInfo): number {
        const priorityWeight = {
            'LOW': 1,
            'NORMAL': 2,
            'HIGH': 4,
            'CRITICAL': 100
        };
        
        // Multi-factor cost function
        const cost = 
            tx.rollbackCost * 1.0 +                  // Direct rollback cost
            tx.operationsCompleted * 0.5 +           // Wasted work
            tx.resourcesHeld * 0.3 +                 // Resources to free
            priorityWeight[tx.priority] * 10 +       // Priority protection
            (Date.now() - tx.startTime) / 1000 * 0.1; // Age (older = higher cost)
        
        return cost;
    }
    
    private selectYoungest(
        cycle: string[],
        transactions: Map<string, TransactionInfo>
    ): string {
        let youngest: string = cycle[0];
        let latestStart = 0;
        
        for (const txId of cycle) {
            const tx = transactions.get(txId);
            if (tx && tx.startTime > latestStart) {
                latestStart = tx.startTime;
                youngest = txId;
            }
        }
        
        return youngest;
    }
}

Starvation Risk

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class StarvationAwareVictimSelector extends VictimSelector {
    private abortCounts: Map<string, number> = new Map();
    private readonly starvationThreshold = 3;
    
    selectVictim(
        cycle: string[],
        transactions: Map<string, TransactionInfo>
    ): string {
        // Filter out transactions that have been aborted too many times
        const eligibleCandidates = cycle.filter(txId => {
            const abortCount = this.abortCounts.get(txId) || 0;
            return abortCount < this.starvationThreshold;
        });
        
        // If all have been aborted many times, reset counts and choose normally
        if (eligibleCandidates.length === 0) {
            console.warn('All transactions in cycle show signs of starvation');
            // Reset and fall back to normal selection
            for (const txId of cycle) {
                this.abortCounts.set(txId, 0);
            }
            return super.selectVictim(cycle, transactions);
        }
        
        // Select from eligible candidates only
        return super.selectVictim(eligibleCandidates, transactions);
    }
    
    recordAbort(txId: string): void {
        const current = this.abortCounts.get(txId) || 0;
        this.abortCounts.set(txId, current + 1);
    }
    
    recordCommit(txId: string): void {
        // Reset on successful commit
        this.abortCounts.delete(txId);
    }
}

Recovery Strategies

Once a victim is selected, recovery must break the deadlock without leaving the system in an inconsistent state. The primary strategies are:

Transaction Rollback

•Total rollback: Abort the entire transaction; undo all its operations
•Checkpoint rollback: Roll back to a previous savepoint within the transaction
•Used when transaction maintains ACID properties
•Requires undo logs or compensation operations
•Transaction can retry after rollback

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
interface UndoLogEntry {
    operation: 'INSERT' | 'UPDATE' | 'DELETE';
    table: string;
    key: string;
    previousValue?: any;
    newValue?: any;
    timestamp: number;
}
 
class TransactionRecoveryManager {
    private undoLogs: Map<string, UndoLogEntry[]> = new Map();
    
    /**
     * Record an operation for potential rollback.
     */
    logOperation(txId: string, entry: UndoLogEntry): void {
        if (!this.undoLogs.has(txId)) {
            this.undoLogs.set(txId, []);
        }
        this.undoLogs.get(txId)!.push(entry);
    }
    
    /**
     * Rollback a transaction completely.
     */
    async rollbackTransaction(txId: string): Promise<void> {
        const entries = this.undoLogs.get(txId) || [];
        
        // Process in reverse order (LIFO)
        for (let i = entries.length - 1; i >= 0; i--) {
            const entry = entries[i];
            await this.undoOperation(entry);
        }
        
        // Clear undo log
        this.undoLogs.delete(txId);
        
        // Release all locks held by this transaction
        await this.lockManager.releaseAllFor(txId);
        
        console.log(`Transaction ${txId} rolled back successfully`);
    }
    
    private async undoOperation(entry: UndoLogEntry): Promise<void> {
        switch (entry.operation) {
            case 'INSERT':
                // Undo insert by deleting
                await this.database.delete(entry.table, entry.key);
                break;
                
            case 'UPDATE':
                // Undo update by restoring previous value
                await this.database.update(entry.table, entry.key, entry.previousValue);
                break;
                
            case 'DELETE':
                // Undo delete by reinserting
                await this.database.insert(entry.table, entry.key, entry.previousValue);
                break;
        }
    }
    
    /**
     * Rollback to a specific savepoint.
     */
    async rollbackToSavepoint(txId: string, savepoint: string): Promise<void> {
        const entries = this.undoLogs.get(txId) || [];
        const savepointIndex = entries.findIndex(
            e => e.operation === 'SAVEPOINT' && e.key === savepoint
        );
        
        if (savepointIndex === -1) {
            throw new Error(`Savepoint ${savepoint} not found`);
        }
        
        // Undo operations after savepoint
        for (let i = entries.length - 1; i > savepointIndex; i--) {
            await this.undoOperation(entries[i]);
        }
        
        // Truncate log to savepoint
        this.undoLogs.set(txId, entries.slice(0, savepointIndex + 1));
        
        // Release locks acquired after savepoint
        await this.lockManager.releaseLocksAfter(txId, entries[savepointIndex].timestamp);
    }
}

Process Termination

•Kill and restart: Terminate the process entirely; restart from scratch
•Used for non-transactional operations or when state is hard to recover
•Simplest but most disruptive
•May lose in-progress work not backed by transactions

Resource Preemption

•Force release locks: Take resources from victim without full rollback
•Victim must be designed to handle resource loss
•Requires careful handling to avoid inconsistency
•Useful when operations are idempotent or checkpoint-based

Design for Recovery

Database Deadlock Handling in Practice

Modern databases have sophisticated, battle-tested deadlock handling. Let's examine how major databases implement detection and recovery:

Database Deadlock Handling Comparison
Database	Detection Method	Victim Selection	Recovery
PostgreSQL	Wait-for graph; checked when wait exceeds deadlock_timeout (1s default)	Detects cycle; aborts transaction that closed the cycle	Raises ERROR 40P01; transaction must retry
MySQL/InnoDB	Wait-for graph; immediate check on every lock wait	Chooses smallest transaction (fewest rows modified)	Raises ERROR 1213; automatic rollback; app must retry
SQL Server	Background thread checks every 5 seconds	Chooses transaction with lowest DEADLOCK_PRIORITY	Raises ERROR 1205; transaction rolled back
Oracle	Immediate detection when wait begins	Transaction that detected the deadlock (requester)	Statement rolled back (not full transaction); app handles

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
class RobustDatabaseClient {
    private readonly maxRetries = 3;
    private readonly baseBackoffMs = 100;
    
    /**
     * Execute a database operation with automatic deadlock retry.
     */
    async executeWithDeadlockRetry<T>(
        operation: () => Promise<T>
    ): Promise<T> {
        let lastError: Error | null = null;
        
        for (let attempt = 0; attempt < this.maxRetries; attempt++) {
            try {
                return await operation();
                
            } catch (error) {
                if (this.isDeadlockError(error)) {
                    lastError = error;
                    
                    // Log for monitoring
                    console.warn(
                        `Deadlock detected (attempt ${attempt + 1}/${this.maxRetries})`,
                        { error: error.message }
                    );
                    
                    // Exponential backoff with jitter
                    const backoffMs = this.baseBackoffMs * Math.pow(2, attempt);
                    const jitter = Math.random() * backoffMs * 0.5;
                    await this.sleep(backoffMs + jitter);
                    
                    // Retry
                    continue;
                }
                
                // Non-deadlock error - rethrow immediately
                throw error;
            }
        }
        
        // Exhausted retries
        throw new DeadlockRetriesExhaustedError(
            `Operation failed after ${this.maxRetries} deadlock retries`,
            lastError
        );
    }
    
    private isDeadlockError(error: any): boolean {
        // Check for database-specific error codes
        const deadlockCodes = [
            '40P01',    // PostgreSQL
            '1213',     // MySQL
            '1205',     // SQL Server
            'ORA-00060' // Oracle
        ];
        
        const errorCode = error.code || error.errorCode || error.errno;
        return deadlockCodes.includes(String(errorCode));
    }
    
    private sleep(ms: number): Promise<void> {
        return new Promise(resolve => setTimeout(resolve, ms));
    }
}
 
// Usage example
const db = new RobustDatabaseClient();
 
async function transferFunds(fromId: string, toId: string, amount: number) {
    await db.executeWithDeadlockRetry(async () => {
        await db.beginTransaction();
        try {
            await db.query('UPDATE accounts SET balance = balance - $1 WHERE id = $2', [amount, fromId]);
            await db.query('UPDATE accounts SET balance = balance + $1 WHERE id = $2', [amount, toId]);
            await db.commit();
        } catch (error) {
            await db.rollback();
            throw error;
        }
    });
}

Idempotency Matters

Monitoring and Alerting

Deadlock detection and recovery handle incidents, but monitoring helps prevent them from becoming chronic problems. A good monitoring strategy includes:

Key Metrics to Track

•Deadlock rate: Deadlocks per minute/hour. Sudden increases indicate problems.
•Deadlock hot spots: Which locks/tables are involved most frequently
•Victim transaction types: Are important transactions being aborted?
•Retry success rate: What percentage of retries succeed vs fail?
•Average lock wait time: Increasing waits may precede deadlocks
•Lock timeout rate: Timeouts may indicate near-deadlock conditions

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
class DeadlockMonitor {
    private metrics: MetricsClient;
    private alerter: AlertingService;
    
    // Thresholds for alerting
    private readonly deadlocksPerMinuteWarning = 5;
    private readonly deadlocksPerMinuteCritical = 20;
    private readonly sameResourceThreshold = 3;  // Same resource in N deadlocks = hot spot
    
    onDeadlockDetected(deadlock: DeadlockEvent): void {
        // Record metrics
        this.metrics.increment('deadlock.detected', {
            resources: deadlock.resources.join(','),
            victimType: deadlock.victim.type,
            cycleLength: deadlock.cycle.length.toString()
        });
        
        // Track hot spots
        for (const resource of deadlock.resources) {
            this.metrics.increment('deadlock.by_resource', {
                resource: resource
            });
        }
        
        // Check for alert conditions
        this.checkAlertConditions(deadlock);
    }
    
    private async checkAlertConditions(deadlock: DeadlockEvent): Promise<void> {
        // Rate-based alert
        const recentDeadlocks = await this.metrics.query(
            'sum(deadlock.detected)[1m]'
        );
        
        if (recentDeadlocks >= this.deadlocksPerMinuteCritical) {
            await this.alerter.sendAlert({
                severity: 'CRITICAL',
                title: 'High deadlock rate detected',
                description: `${recentDeadlocks} deadlocks in the past minute`,
                runbook: 'https://wiki/deadlock-runbook',
                metadata: deadlock
            });
        } else if (recentDeadlocks >= this.deadlocksPerMinuteWarning) {
            await this.alerter.sendAlert({
                severity: 'WARNING',
                title: 'Elevated deadlock rate',
                description: `${recentDeadlocks} deadlocks in the past minute`,
                runbook: 'https://wiki/deadlock-runbook'
            });
        }
        
        // Hot spot detection
        for (const resource of deadlock.resources) {
            const resourceCount = await this.metrics.query(
                `sum(deadlock.by_resource{resource="${resource}"})[5m]`
            );
            
            if (resourceCount >= this.sameResourceThreshold) {
                await this.alerter.sendAlert({
                    severity: 'WARNING',
                    title: 'Deadlock hot spot detected',
                    description: `Resource ${resource} involved in ${resourceCount} deadlocks`,
                    suggestion: 'Review lock ordering or access patterns for this resource'
                });
            }
        }
    }
    
    /**
     * Generate a deadlock analysis report.
     */
    async generateReport(timeRange: string): Promise<DeadlockReport> {
        return {
            totalDeadlocks: await this.metrics.query(`sum(deadlock.detected)[${timeRange}]`),
            averagePerHour: await this.metrics.query(`avg_over_time(rate(deadlock.detected[1h])[${timeRange}])`),
            topResources: await this.getTopResources(timeRange),
            topTransactionTypes: await this.getTopTransactionTypes(timeRange),
            hourlyDistribution: await this.getHourlyDistribution(timeRange),
            recommendations: await this.generateRecommendations(timeRange)
        };
    }
}

Post-Incident Analysis

Summary: Detection and Recovery

We've covered the complete lifecycle of deadlock detection and recovery. Let's consolidate the key concepts:

Key Takeaways

•Wait-For Graphs are the foundation: Nodes are processes, edges are wait relationships, cycles indicate deadlock.
•Detection algorithms range from simple DFS (single-instance) to Banker's algorithm (multi-instance) to distributed edge chasing.
•Detection frequency must balance responsiveness against overhead—consider incremental, periodic, and timeout-triggered approaches.
•Distributed detection is hard due to lack of global state—many systems rely on timeouts instead.
•Victim selection minimizes recovery cost using criteria like age, resources held, priority, and restart cost.
•Starvation prevention requires tracking abort counts and eventually protecting frequently-victimized transactions.
•Recovery involves rollback (using undo logs), process termination, or resource preemption—design for recovery from the start.
•Application code must handle deadlock errors with retry logic and ensure idempotency.
•Monitoring tracks rates, hot spots, and patterns to enable proactive fixes before deadlocks become chronic.

Module complete:

Congratulations! You now have comprehensive knowledge of deadlocks—from fundamental definitions through prevention strategies to detection and recovery. This knowledge enables you to:

Design concurrent systems that minimize deadlock risk
Implement effective prevention through lock ordering and other strategies
Build detection systems for when prevention isn't sufficient
Recover gracefully when deadlocks occur
Monitor and analyze deadlock patterns to improve system reliability

Module Complete

4 / 4