Detection And Recovery - Learning Module

Loading content...

0/240

Single Instance Detection

The Elegant Special Case

When each resource type in the system has exactly one instance—one printer, one database connection, one exclusive file lock—deadlock detection becomes remarkably elegant. In this scenario, the wait-for graph provides a complete and direct representation of the system's deadlock potential.

Why is single-instance detection simpler? Because resource allocation is binary: a resource is either held by exactly one process or available. There's no ambiguity about "which instance" a process might receive. When process P waits for resource R, and R is held by process Q, we know exactly that P is waiting for Q. This directness translates into a graph where:

Nodes represent processes
Edges represent "waits for" relationships
A cycle in this graph IS a deadlock—not just a potential deadlock, but a definite one

This equivalence between cycles and deadlocks makes single-instance detection both conceptually clear and computationally efficient.

What You Will Learn

By the end of this page, you will understand why cycles in the wait-for graph precisely indicate deadlock for single-instance resources, master multiple algorithms for cycle detection (DFS, topological sort), implement production-quality detection code, and analyze the time and space complexity of these approaches.

The Wait-For Graph Model

The wait-for graph (WFG) is a directed graph that captures the essential waiting relationships between processes.

Formal Definition:

Given:

Set of processes: P = {P₀, P₁, ..., Pₙ₋₁}
Set of resources: R = {R₀, R₁, ..., Rₘ₋₁}, each with exactly one instance

The wait-for graph G = (V, E) is defined as:

V = P (vertices are processes)
E = {(Pᵢ, Pⱼ) | Pᵢ is waiting for a resource held by Pⱼ}

An edge Pᵢ → Pⱼ exists if and only if:

Process Pᵢ has requested some resource Rₖ
Resource Rₖ is currently held by process Pⱼ
Pᵢ ≠ Pⱼ (a process cannot wait for itself in meaningful scenarios)

wait_for_graph.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
from dataclasses import dataclass, field
from typing import Dict, Set, List, Optional, Tuple
from enum import Enum
 
class ResourceState(Enum):
    """State of a single-instance resource."""
    FREE = 0
    HELD = 1
 
@dataclass
class Resource:
    """A single-instance resource."""
    id: int
    name: str
    state: ResourceState = ResourceState.FREE
    holder: Optional[int] = None  # Process ID holding this resource
 
@dataclass 
class Process:
    """A process that may hold or request resources."""
    pid: int
    name: str
    holding: Set[int] = field(default_factory=set)    # Resource IDs held
    waiting_for: Optional[int] = None                  # Resource ID waiting for (if any)
 
@dataclass
class WaitForGraph:
    """
    Wait-for graph for single-instance resource deadlock detection.
    
    The graph is derived from the current state of processes and resources.
    Cycles in this graph indicate definite deadlock.
    """
    adjacency: Dict[int, Set[int]] = field(default_factory=dict)  # pid -> set of pids
    processes: Dict[int, Process] = field(default_factory=dict)
    resources: Dict[int, Resource] = field(default_factory=dict)
    
    def add_process(self, process: Process):
        """Register a process in the system."""
        self.processes[process.pid] = process
        self.adjacency[process.pid] = set()
    
    def add_resource(self, resource: Resource):
        """Register a single-instance resource."""
        self.resources[resource.id] = resource
    
    def allocate(self, pid: int, resource_id: int) -> bool:
        """
        Attempt to allocate a resource to a process.
        Returns True if successful, False if resource is held by another.
        """
        resource = self.resources[resource_id]
        process = self.processes[pid]
        
        if resource.state == ResourceState.FREE:
            # Grant the resource
            resource.state = ResourceState.HELD
            resource.holder = pid
            process.holding.add(resource_id)
            process.waiting_for = None  # No longer waiting
            self._rebuild_edges_for(pid)
            return True
        else:
            # Resource is held - process must wait
            process.waiting_for = resource_id
            self._rebuild_edges_for(pid)
            return False
    
    def release(self, pid: int, resource_id: int):
        """Release a resource held by a process."""
        resource = self.resources[resource_id]
        process = self.processes[pid]
        
        if resource.holder == pid:
            resource.state = ResourceState.FREE
            resource.holder = None
            process.holding.discard(resource_id)
            # Rebuild edges for all processes that might have been waiting for this
            for p in self.processes.values():
                if p.waiting_for == resource_id:
                    self._rebuild_edges_for(p.pid)
    
    def _rebuild_edges_for(self, pid: int):
        """Rebuild wait-for edges for a specific process."""
        process = self.processes[pid]
        self.adjacency[pid].clear()
        
        if process.waiting_for is not None:
            resource = self.resources[process.waiting_for]
            if resource.holder is not None and resource.holder != pid:
                # This process waits for the holder
                self.adjacency[pid].add(resource.holder)
    
    def get_edges(self) -> List[Tuple[int, int]]:
        """Get all edges as (waiting_pid, holding_pid) tuples."""
        edges = []
        for pid, waiting_for in self.adjacency.items():
            for holder_pid in waiting_for:
                edges.append((pid, holder_pid))
        return edges
    
    def print_graph(self):
        """Print the current wait-for graph."""
        print("Wait-For Graph:")
        print("  Processes:", [p.name for p in self.processes.values()])
        print("  Edges:")
        for pid, waiting_for in self.adjacency.items():
            if waiting_for:
                p = self.processes[pid]
                for holder_pid in waiting_for:
                    h = self.processes[holder_pid]
                    r = self.resources.get(p.waiting_for)
                    print(f"    {p.name} --({r.name if r else '?'})--> {h.name}")
 
 
# Demonstration
def demonstrate_wait_for_graph():
    wfg = WaitForGraph()
    
    # Create processes
    for i, name in enumerate(["Web Server", "Database", "Cache", "Logger"]):
        wfg.add_process(Process(pid=i, name=name))
    
    # Create single-instance resources
    for i, name in enumerate(["Mutex A", "Mutex B", "Mutex C"]):
        wfg.add_resource(Resource(id=i, name=name))
    
    # Simulate a deadlock scenario:
    # P0 acquires R0, then requests R1
    # P1 acquires R1, then requests R2
    # P2 acquires R2, then requests R0
    
    print("=== Initial allocations ===")
    wfg.allocate(0, 0)  # P0 gets R0 (Mutex A)
    wfg.allocate(1, 1)  # P1 gets R1 (Mutex B)
    wfg.allocate(2, 2)  # P2 gets R2 (Mutex C)
    wfg.print_graph()
    
    print("
=== Creating deadlock ===")
    wfg.allocate(0, 1)  # P0 wants R1 (held by P1) -> P0 waits for P1
    wfg.allocate(1, 2)  # P1 wants R2 (held by P2) -> P1 waits for P2
    wfg.allocate(2, 0)  # P2 wants R0 (held by P0) -> P2 waits for P0
    wfg.print_graph()
    
    print("
Cycle: P0 -> P1 -> P2 -> P0  [DEADLOCK!]")
 
if __name__ == "__main__":
    demonstrate_wait_for_graph()

Why Cycles Mean Deadlock (for Single Instances):

Suppose we have a cycle: P₀ → P₁ → P₂ → ... → Pₖ → P₀

This means:

P₀ is waiting for a resource held by P₁
P₁ is waiting for a resource held by P₂
...
Pₖ is waiting for a resource held by P₀

No process in this cycle can make progress:

P₀ can't proceed until P₁ releases, but P₁ is waiting
P₁ can't proceed until P₂ releases, but P₂ is waiting
... and so on, all the way around

Since each resource has only one instance, there's no chance of another instance becoming available. The cycle IS the deadlock.

Single Instance is Key

This perfect correspondence between cycles and deadlocks holds ONLY for single-instance resources. With multiple instances, a cycle is necessary but not sufficient—the allocation of specific instances matters. We'll explore this distinction when we cover multiple-instance detection.

DFS-Based Cycle Detection

The most intuitive approach to finding cycles is depth-first search (DFS). The algorithm colors nodes to track their exploration state and identifies cycles via "back edges"—edges pointing to an ancestor in the current DFS tree.

The Three-Color Algorithm:

WHITE: Not yet visited
GRAY: Currently being explored (on the DFS stack)
BLACK: Fully explored (all descendants visited, no cycles through this node)

A back edge exists when we encounter a GRAY node during DFS—meaning we've found a path back to an ancestor, completing a cycle.

dfs_cycle_detection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
from enum import IntEnum
from typing import Dict, Set, List, Tuple, Optional
from dataclasses import dataclass
 
class Color(IntEnum):
    """DFS node coloring for cycle detection."""
    WHITE = 0  # Not yet visited
    GRAY = 1   # Currently in DFS stack (being explored)
    BLACK = 2  # Fully explored
 
@dataclass
class CycleDetectionResult:
    """Result of cycle detection algorithm."""
    has_cycle: bool
    cycle: List[int]            # Nodes in the cycle (if found)
    deadlocked: Set[int]        # All nodes involved in some cycle
    
def detect_cycles_dfs(adjacency: Dict[int, Set[int]]) -> CycleDetectionResult:
    """
    Detect cycles in a directed graph using DFS with three-color marking.
    
    Time Complexity: O(V + E) where V = vertices, E = edges
    Space Complexity: O(V) for color array and recursion stack
    
    Args:
        adjacency: Mapping from node ID to set of neighbor node IDs
        
    Returns:
        CycleDetectionResult with cycle information
    """
    if not adjacency:
        return CycleDetectionResult(
            has_cycle=False, 
            cycle=[], 
            deadlocked=set()
        )
    
    color: Dict[int, Color] = {node: Color.WHITE for node in adjacency}
    parent: Dict[int, Optional[int]] = {node: None for node in adjacency}
    cycle_found: List[int] = []
    all_in_cycles: Set[int] = set()
    
    def dfs(node: int) -> bool:
        """
        DFS from a node, returning True if a cycle is found.
        """
        color[node] = Color.GRAY
        
        for neighbor in adjacency.get(node, set()):
            if color[neighbor] == Color.GRAY:
                # Back edge found! This is a cycle.
                # Reconstruct the cycle: neighbor -> ... -> node -> neighbor
                nonlocal cycle_found
                cycle_found = reconstruct_cycle(node, neighbor, parent)
                all_in_cycles.update(cycle_found)
                return True
                
            elif color[neighbor] == Color.WHITE:
                parent[neighbor] = node
                if dfs(neighbor):
                    return True
        
        color[node] = Color.BLACK
        return False
    
    def reconstruct_cycle(current: int, cycle_start: int, 
                          parent_map: Dict[int, Optional[int]]) -> List[int]:
        """
        Reconstruct the cycle path from cycle_start back to cycle_start.
        
        When we find a back edge current -> cycle_start, the cycle is:
        cycle_start -> ... -> current -> cycle_start
        """
        cycle = [current]
        node = current
        
        while parent_map[node] != cycle_start and parent_map[node] is not None:
            node = parent_map[node]
            cycle.append(node)
        
        cycle.append(cycle_start)
        cycle.reverse()  # Start from cycle_start
        return cycle
    
    # Run DFS from each unvisited node
    for start_node in adjacency:
        if color[start_node] == Color.WHITE:
            if dfs(start_node):
                # Found a cycle starting from this component
                pass  # Continue to find all cycles
    
    return CycleDetectionResult(
        has_cycle=len(cycle_found) > 0,
        cycle=cycle_found,
        deadlocked=all_in_cycles
    )
 
 
def find_all_cycles(adjacency: Dict[int, Set[int]]) -> List[List[int]]:
    """
    Find ALL cycles in the graph, not just one.
    
    Uses Johnson's algorithm concept: for each node, find all cycles
    that include that node as the smallest-indexed node.
    
    This is more expensive but provides complete information.
    """
    all_cycles: List[List[int]] = []
    nodes = sorted(adjacency.keys())
    
    for start in nodes:
        # Find cycles starting from 'start' that only go through
        # nodes >= start (to avoid counting the same cycle multiple times)
        blocked: Set[int] = set()
        path: List[int] = []
        
        def circuit(v: int) -> bool:
            found_cycle = False
            path.append(v)
            blocked.add(v)
            
            for w in adjacency.get(v, set()):
                if w == start:
                    # Found a cycle
                    all_cycles.append(path.copy())
                    found_cycle = True
                elif w not in blocked and w > start:
                    if circuit(w):
                        found_cycle = True
            
            if found_cycle:
                blocked.discard(v)
            
            path.pop()
            return found_cycle
        
        circuit(start)
    
    return all_cycles
 
 
# Demonstration
def demo_cycle_detection():
    print("=== Single Cycle ===")
    # P0 -> P1 -> P2 -> P0
    graph1 = {
        0: {1},
        1: {2},
        2: {0},
    }
    result1 = detect_cycles_dfs(graph1)
    print(f"Has cycle: {result1.has_cycle}")
    print(f"Cycle: {result1.cycle}")
    print(f"Deadlocked processes: {result1.deadlocked}")
    
    print("
=== No Cycle ===")
    # P0 -> P1 -> P2 (no back edge)
    graph2 = {
        0: {1},
        1: {2},
        2: set(),
    }
    result2 = detect_cycles_dfs(graph2)
    print(f"Has cycle: {result2.has_cycle}")
    
    print("
=== Multiple Cycles ===")
    # P0 -> P1 -> P0 (cycle 1)
    # P2 -> P3 -> P2 (cycle 2, disconnected)
    graph3 = {
        0: {1},
        1: {0},
        2: {3},
        3: {2},
    }
    all_cycles = find_all_cycles(graph3)
    print(f"All cycles found: {all_cycles}")
    
    print("
=== Complex Graph ===")
    #     0 -> 1 -> 2
    #     ^    |    |
    #     |    v    v
    #     4 <- 3 <- -
    graph4 = {
        0: {1},
        1: {2, 3},
        2: {3},
        3: {4},
        4: {0},
    }
    result4 = detect_cycles_dfs(graph4)
    print(f"Has cycle: {result4.has_cycle}")
    print(f"One cycle: {result4.cycle}")
 
if __name__ == "__main__":
    demo_cycle_detection()

Algorithm Correctness Proof:

We prove that a GRAY node encounter indicates a cycle:

Gray nodes are ancestors: During DFS from node u, colored GRAY, all nodes on the path from the DFS starting point to u are also GRAY (currently on the stack).
Back edge creates cycle: If we're at node v, exploring edge v → w, and w is GRAY, then w is an ancestor of v. The path w → ... → v plus edge v → w forms a cycle.
Completeness: Any cycle in the graph will be detected because DFS explores all edges. When we traverse an edge that's part of a cycle, we'll find a back edge.
Black nodes are safe: If w is BLACK, we've fully explored w and found no cycle through it. Edge v → w doesn't create a new cycle because any cycle containing this edge would have been detected during w's exploration.

Iterative Implementation

For production systems with potentially deep recursion, convert the recursive DFS to an iterative version using an explicit stack. This prevents stack overflow for graphs with thousands of nodes or long chains. The logic remains identical—we just manage the stack manually.

Topological Sort Detection

An alternative to DFS is the topological sort approach using Kahn's algorithm. The insight: a directed acyclic graph (DAG) can be topologically sorted, but a graph with cycles cannot. If topological sort fails to include all nodes, cycles exist.

Kahn's Algorithm Insight:

Topological sort works by repeatedly removing nodes with no incoming edges (in-degree = 0). In a DAG, we can always find such a node until the graph is empty. In a graph with cycles, we'll reach a point where all remaining nodes have incoming edges—they're in cycles.

kahn_deadlock_detection.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
from collections import deque
from typing import Dict, Set, List, Tuple
from dataclasses import dataclass
 
@dataclass
class KahnDetectionResult:
    """Result of Kahn's algorithm-based deadlock detection."""
    has_deadlock: bool
    deadlocked: Set[int]        # Processes in deadlock
    safe_order: List[int]       # Safe processes in completion order
    
def detect_deadlock_kahn(adjacency: Dict[int, Set[int]]) -> KahnDetectionResult:
    """
    Detect deadlock using Kahn's topological sort algorithm.
    
    Key insight: In the wait-for graph, a process with no outgoing edges
    (not waiting for anyone) can potentially complete. When it completes,
    processes waiting for it (incoming edges) may now be unblocked.
    
    We reverse the logic: find processes that NO ONE is waiting for
    (in-degree 0 in the reversed graph, i.e., out-degree 0 in original).
    Wait, that's not quite right for our use case...
    
    Actually, for wait-for graphs:
    - A process with in-degree 0 (no one waiting for it) can complete
      only if it's not itself waiting (out-degree 0 also).
    - A process not waiting for anyone (out-degree 0) can definitely complete.
    
    Let's reformulate: we reduce based on OUT-degree = 0 (not waiting).
    
    Time Complexity: O(V + E)
    Space Complexity: O(V)
    """
    if not adjacency:
        return KahnDetectionResult(
            has_deadlock=False, 
            deadlocked=set(), 
            safe_order=[]
        )
    
    # Calculate out-degree for each node
    # Out-degree = number of processes this process is waiting for
    out_degree = {node: len(neighbors) for node, neighbors in adjacency.items()}
    
    # Calculate in-degree (who's waiting for this process)
    in_degree = {node: 0 for node in adjacency}
    for node, neighbors in adjacency.items():
        for neighbor in neighbors:
            in_degree[neighbor] = in_degree.get(neighbor, 0) + 1
    
    # Processes with out-degree 0 are not waiting - they can complete
    # (assuming they have the resources they need, which for this graph means
    # they're not blocked)
    queue = deque()
    for node in adjacency:
        if out_degree[node] == 0:
            queue.append(node)
    
    safe_order = []
    removed = set()
    
    while queue:
        # This process can complete
        process = queue.popleft()
        safe_order.append(process)
        removed.add(process)
        
        # When this process completes, anyone waiting for it might now proceed
        # We need the reverse adjacency for this
        for waiting_process, waits_for in adjacency.items():
            if process in waits_for and waiting_process not in removed:
                # waiting_process was waiting for process
                out_degree[waiting_process] -= 1
                if out_degree[waiting_process] == 0:
                    queue.append(waiting_process)
    
    # Any node not removed is in a cycle (deadlocked)
    deadlocked = set(adjacency.keys()) - removed
    
    return KahnDetectionResult(
        has_deadlock=len(deadlocked) > 0,
        deadlocked=deadlocked,
        safe_order=safe_order
    )
 
 
def detect_deadlock_reduction(adjacency: Dict[int, Set[int]]) -> KahnDetectionResult:
    """
    Alternative formulation: Graph reduction algorithm.
    
    1. Find all nodes with out-degree 0 (not waiting)
    2. Remove them and all edges to them
    3. Repeat until no more removals possible
    4. Remaining nodes are deadlocked
    
    This is conceptually identical to Kahn's but framed as reduction.
    """
    # Make a working copy
    remaining = {node: neighbors.copy() for node, neighbors in adjacency.items()}
    safe_order = []
    
    changed = True
    while changed:
        changed = False
        
        # Find nodes with out-degree 0
        removable = [node for node, neighbors in remaining.items() 
                     if len(neighbors) == 0]
        
        for node in removable:
            safe_order.append(node)
            del remaining[node]
            changed = True
            
            # Remove this node from all neighbor lists
            for neighbors in remaining.values():
                neighbors.discard(node)
    
    return KahnDetectionResult(
        has_deadlock=len(remaining) > 0,
        deadlocked=set(remaining.keys()),
        safe_order=safe_order
    )
 
 
# Demonstration
def demo_kahn_detection():
    print("=== Deadlock Scenario ===")
    # P0 -> P1 -> P2 -> P0  (cycle)
    g1 = {
        0: {1},
        1: {2},
        2: {0},
    }
    result1 = detect_deadlock_kahn(g1)
    print(f"Has deadlock: {result1.has_deadlock}")
    print(f"Deadlocked: {result1.deadlocked}")
    print(f"Safe processes: {result1.safe_order}")
    
    print("
=== Some Safe, Some Deadlocked ===")
    # P0 -> P1 -> P2 -> P1  (P1, P2 in cycle)
    # P3 -> P1 (P3 depends on deadlocked P1)
    # P4 not waiting (P4 is safe)
    g2 = {
        0: {1},
        1: {2},
        2: {1},
        3: {1},
        4: set(),
    }
    result2 = detect_deadlock_kahn(g2)
    print(f"Has deadlock: {result2.has_deadlock}")
    print(f"Deadlocked: {result2.deadlocked}")
    print(f"Safe processes: {result2.safe_order}")
    
    print("
=== Reduction Algorithm ===")
    result3 = detect_deadlock_reduction(g2)
    print(f"Has deadlock: {result3.has_deadlock}")
    print(f"Deadlocked: {result3.deadlocked}")
    print(f"Safe processes: {result3.safe_order}")
 
if __name__ == "__main__":
    demo_kahn_detection()

Comparison: DFS vs Kahn's Algorithm:

DFS vs Topological Sort for Deadlock Detection
Aspect	DFS (Three-Color)	Kahn's Algorithm
Time Complexity	O(V + E)	O(V + E)
Space Complexity	O(V) + recursion stack	O(V) + queue
Output	One cycle path	All deadlocked nodes
Early termination	Stops at first cycle	Processes entire graph
Recursive?	Yes (or explicit stack)	No (iterative)
Additional info	Cycle path for debugging	Safe execution order
Best for	Quick detection	Full analysis

Choosing an Algorithm

DFS is better when you just need to know IF there's a deadlock and want to stop early. Kahn's algorithm is better when you need to identify ALL deadlocked processes and want the safe completion order for non-deadlocked processes. Most production systems use DFS for its simplicity and early termination.

Production Implementation

A production deadlock detector must handle real-world concerns: thread safety, efficient updates, minimal locking, and detailed reporting. Here's a production-quality implementation:

production_detector.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
#include <vector>
#include <unordered_map>
#include <unordered_set>
#include <mutex>
#include <shared_mutex>
#include <optional>
#include <chrono>
#include <atomic>
 
/**
 * Production-grade single-instance deadlock detector.
 * 
 * Features:
 * - Lock-free graph updates via copy-on-write
 * - Concurrent reads during detection
 * - Detailed cycle reporting
 * - Performance metrics
 */
class DeadlockDetector {
public:
    using ProcessId = uint64_t;
    using ResourceId = uint64_t;
    
    struct DetectionResult {
        bool has_deadlock;
        std::vector<ProcessId> cycle;           // Cycle path
        std::vector<ProcessId> all_deadlocked;  // All affected processes
        std::chrono::nanoseconds detection_time;
    };
    
    struct ResourceHold {
        ResourceId resource;
        ProcessId holder;
    };
    
private:
    // Graph state (copy-on-write for lock-free reads)
    struct GraphState {
        std::unordered_map<ProcessId, std::unordered_set<ProcessId>> wait_for;
        std::unordered_map<ResourceId, ProcessId> resource_holders;
        std::unordered_map<ProcessId, std::optional<ResourceId>> process_waiting;
        uint64_t version = 0;
    };
    
    std::shared_ptr<GraphState> current_state_;
    mutable std::shared_mutex state_mutex_;
    
    // Metrics
    std::atomic<uint64_t> detection_count_{0};
    std::atomic<uint64_t> deadlock_count_{0};
    std::atomic<uint64_t> total_detection_ns_{0};
    
public:
    DeadlockDetector() : current_state_(std::make_shared<GraphState>()) {}
    
    /**
     * Record that a process now holds a resource.
     * Thread-safe.
     */
    void record_acquire(ProcessId process, ResourceId resource) {
        std::unique_lock lock(state_mutex_);
        
        // Copy-on-write
        auto new_state = std::make_shared<GraphState>(*current_state_);
        new_state->resource_holders[resource] = process;
        new_state->process_waiting[process] = std::nullopt;
        new_state->version++;
        
        rebuild_edges_for(process, *new_state);
        current_state_ = new_state;
    }
    
    /**
     * Record that a process released a resource.
     * Thread-safe.
     */
    void record_release(ProcessId process, ResourceId resource) {
        std::unique_lock lock(state_mutex_);
        
        auto new_state = std::make_shared<GraphState>(*current_state_);
        new_state->resource_holders.erase(resource);
        new_state->version++;
        
        // Rebuild edges for all processes waiting for this resource
        for (auto& [pid, waiting_for] : new_state->process_waiting) {
            if (waiting_for && *waiting_for == resource) {
                rebuild_edges_for(pid, *new_state);
            }
        }
        current_state_ = new_state;
    }
    
    /**
     * Record that a process is blocked waiting for a resource.
     * Thread-safe.
     */
    void record_block(ProcessId process, ResourceId resource) {
        std::unique_lock lock(state_mutex_);
        
        auto new_state = std::make_shared<GraphState>(*current_state_);
        new_state->process_waiting[process] = resource;
        new_state->version++;
        
        rebuild_edges_for(process, *new_state);
        current_state_ = new_state;
    }
    
    /**
     * Detect deadlock in the current graph state.
     * Thread-safe, non-blocking for other operations.
     */
    DetectionResult detect() const {
        auto start = std::chrono::high_resolution_clock::now();
        
        // Get snapshot (lock-free read after this)
        std::shared_ptr<GraphState> snapshot;
        {
            std::shared_lock lock(state_mutex_);
            snapshot = current_state_;
        }
        
        DetectionResult result;
        result.has_deadlock = false;
        
        // DFS-based cycle detection
        enum Color { WHITE, GRAY, BLACK };
        std::unordered_map<ProcessId, Color> color;
        std::unordered_map<ProcessId, ProcessId> parent;
        
        for (const auto& [pid, _] : snapshot->wait_for) {
            color[pid] = WHITE;
        }
        
        std::function<bool(ProcessId)> dfs = [&](ProcessId node) -> bool {
            color[node] = GRAY;
            
            auto it = snapshot->wait_for.find(node);
            if (it != snapshot->wait_for.end()) {
                for (ProcessId neighbor : it->second) {
                    if (color[neighbor] == GRAY) {
                        // Back edge - cycle found!
                        result.has_deadlock = true;
                        
                        // Reconstruct cycle
                        result.cycle.push_back(node);
                        ProcessId curr = node;
                        while (curr != neighbor) {
                            curr = parent[curr];
                            result.cycle.push_back(curr);
                        }
                        result.cycle.push_back(neighbor);
                        return true;
                    }
                    if (color[neighbor] == WHITE) {
                        parent[neighbor] = node;
                        if (dfs(neighbor)) return true;
                    }
                }
            }
            
            color[node] = BLACK;
            return false;
        };
        
        for (const auto& [pid, _] : snapshot->wait_for) {
            if (color[pid] == WHITE) {
                if (dfs(pid)) break;
            }
        }
        
        // Collect all deadlocked (still GRAY after detection)
        if (result.has_deadlock) {
            for (const auto& [pid, c] : color) {
                if (c == GRAY) {
                    result.all_deadlocked.push_back(pid);
                }
            }
        }
        
        auto end = std::chrono::high_resolution_clock::now();
        result.detection_time = 
            std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
        
        // Update metrics
        detection_count_++;
        if (result.has_deadlock) deadlock_count_++;
        total_detection_ns_ += result.detection_time.count();
        
        return result;
    }
    
    /**
     * Get detection statistics.
     */
    struct Stats {
        uint64_t total_detections;
        uint64_t deadlocks_found;
        double avg_detection_us;
    };
    
    Stats get_stats() const {
        Stats s;
        s.total_detections = detection_count_.load();
        s.deadlocks_found = deadlock_count_.load();
        s.avg_detection_us = s.total_detections > 0 
            ? (total_detection_ns_.load() / 1000.0) / s.total_detections 
            : 0;
        return s;
    }
    
private:
    void rebuild_edges_for(ProcessId pid, GraphState& state) {
        state.wait_for[pid].clear();
        
        auto waiting_it = state.process_waiting.find(pid);
        if (waiting_it != state.process_waiting.end() && waiting_it->second) {
            ResourceId resource = *waiting_it->second;
            auto holder_it = state.resource_holders.find(resource);
            if (holder_it != state.resource_holders.end() && 
                holder_it->second != pid) {
                state.wait_for[pid].insert(holder_it->second);
            }
        }
    }
};

Key Production Features:

Copy-on-Write State: Graph updates create a new state object. Readers always see a consistent snapshot without blocking writers.
Shared Mutex: Multiple detection threads can run concurrently (shared lock). Updates take exclusive lock but are fast (just pointer swap).
Metrics Collection: Track detection count, deadlock count, and average detection time for monitoring and alerting.
Detailed Results: Return not just "has deadlock" but the cycle path and all affected processes for debugging and recovery.
Nanosecond Timing: Precise timing allows tuning detection frequency based on actual overhead.

Watch for ABA Problems

In highly concurrent systems, be careful of ABA problems where state changes back to an earlier configuration between your check and action. The version number in our GraphState helps detect such issues. Always validate that the state you're acting on is still current.

Real-World Examples and Case Studies

Let's examine how single-instance deadlock detection applies to common scenarios.

Scenario: Database transactions acquiring exclusive locks on individual rows.

Each row lock is a single-instance resource (only one transaction can hold it). This is a perfect fit for wait-for graph detection.

database_deadlock.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Transaction T1
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE id = 1;  -- Acquires lock on row 1
-- ... some work ...
UPDATE accounts SET balance = balance + 100 WHERE id = 2;  -- Wants lock on row 2
-- BLOCKED if T2 holds lock on row 2
 
-- Transaction T2 (concurrent)
BEGIN;
UPDATE accounts SET balance = balance - 50 WHERE id = 2;   -- Acquires lock on row 2
-- ... some work ...
UPDATE accounts SET balance = balance + 50 WHERE id = 1;   -- Wants lock on row 1
-- BLOCKED if T1 holds lock on row 1
 
-- Wait-For Graph:
-- T1 -> T2 (T1 waiting for row 2, held by T2)
-- T2 -> T1 (T2 waiting for row 1, held by T1)
-- CYCLE: T1 -> T2 -> T1  [DEADLOCK]
 
-- Database detects this and aborts one transaction:
-- ERROR: deadlock detected
-- DETAIL: Process 12345 waits for ShareLock on transaction 67890
--         Process 67890 waits for ShareLock on transaction 12345
-- HINT: See server log for query details.

Detection in Practice

Most modern databases (PostgreSQL, MySQL, SQL Server) implement wait-for graph deadlock detection for row locks. When deadlock is detected, one transaction is chosen as the 'victim' and aborted. The application receives an error and can retry. This is the detection + recovery pattern in action.

Complexity Analysis

Understanding the computational complexity helps tune detection frequency and resource budgets.

Complexity of Single-Instance Detection Operations
Operation	Time Complexity	Space Complexity	Notes
Build wait-for graph	O(P × R)	O(P + E)	P=processes, R=resources, E=edges
DFS cycle detection	O(P + E)	O(P)	Linear in graph size
Kahn's algorithm	O(P + E)	O(P)	Same as DFS
Find ALL cycles	O((P+E)(C+1))	O(P + E)	C = number of cycles (Johnson's)
Edge insertion (incremental)	O(P)	O(1)	Check for cycle through new edge
Edge deletion (incremental)	O(E)	O(P)	May need re-validation

Practical Performance:

For typical systems:

P = 100-10,000 processes: Detection takes microseconds to low milliseconds
E ≈ P (each process waits for at most one resource): Linear performance
Detection frequency: Once per second is easily affordable at any scale

Scaling Considerations:

For very large systems (100,000+ processes):

Partitioning: Divide resources into independent groups; detect within each group
Incremental detection: Maintain the graph; only check for cycles when edges change
Sampling: Periodically check subsets; full detection less frequently
Hierarchical: Local detection first; global detection only when local finds nothing but processes remain blocked

In Practice, Detection is FAST

Don't overthink detection overhead. Modern CPUs can traverse graphs with millions of edges per second. For a system with 1000 processes and sparse wait-for relationships, detection takes ~10-100 microseconds. The cost of NOT detecting (processes stuck forever) far exceeds detection overhead.

Edge Cases and Pitfalls

Production implementations must handle edge cases that simple algorithms overlook.

Common Pitfalls

•Self-loops: A process waiting for a resource it already holds. Should be impossible by design, but validate. This is an immediate deadlock (cycle of length 1).
•Ghost processes: Process terminates while detection runs. Ensure you're working on a consistent snapshot, not live data that changes mid-analysis.
•Phantom edges: Race condition creates edge that shouldn't exist. Use proper synchronization when building the graph.
•Timeout vs deadlock: A process timing out on a lock isn't deadlock—it's normal operation. Don't conflate temporary blocking with permanent deadlock.
•Cascading waits: P1 waits for P2 which waits for P3 which is blocked on I/O. This is NOT deadlock—P3 will eventually unblock. Only circular wait is deadlock.
•Resource pool confusion: If R1 is actually a pool of resources (not single-instance), wait-for graph gives false positives. Ensure resources are truly single-instance.

edge_case_handling.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
def detect_deadlock_robust(adjacency: dict, blocked_since: dict, 
                                timeout_threshold: float) -> dict:
    """
    Robust detection handling common edge cases.
    
    Args:
        adjacency: Wait-for graph (pid -> set of pids waited for)
        blocked_since: When each waiting process started waiting (pid -> timestamp)
        timeout_threshold: Seconds before considering a wait pathological
        
    Returns:
        Dict with: has_deadlock, cycle, pending_timeouts, recommendations
    """
    import time
    
    result = {
        'has_deadlock': False,
        'cycle': [],
        'pending_timeouts': [],  # Processes about to timeout
        'recommendations': []
    }
    
    # Edge case 1: Filter out processes not actually waiting
    active_waiters = {pid: waits for pid, waits in adjacency.items() 
                      if waits}  # Non-empty wait set
    
    # Edge case 2: Self-loops (shouldn't happen, but catch it)
    for pid, waits in active_waiters.items():
        if pid in waits:
            result['has_deadlock'] = True
            result['cycle'] = [pid, pid]
            result['recommendations'].append(
                f"CRITICAL: Process {pid} waiting for itself. Bug in application."
            )
            return result
    
    # Edge case 3: Check for processes close to timeout
    now = time.time()
    for pid in active_waiters:
        blocked_duration = now - blocked_since.get(pid, now)
        if blocked_duration > timeout_threshold * 0.8:  # 80% of timeout
            result['pending_timeouts'].append(pid)
    
    # Main detection (DFS)
    # ... standard DFS cycle detection ...
    
    # Edge case 4: Distinguish timeout candidates from true deadlock
    if result['cycle']:
        # Check if any cycle participant will soon timeout
        cycle_set = set(result['cycle'])
        timeout_set = set(result['pending_timeouts'])
        if cycle_set & timeout_set:
            result['recommendations'].append(
                f"Deadlock may self-resolve: processes {cycle_set & timeout_set} "
                f"near timeout. Consider waiting briefly before recovery."
            )
    
    return result

Test Your Edge Cases

Every edge case listed above has caused production outages. Unit test each scenario. Inject faults (crashed processes, slow responses) into integration tests. A detection algorithm that works on simple cases but fails on edge cases is worse than no algorithm—it gives false confidence.

Summary: Single Instance Detection Mastered

We've comprehensively covered deadlock detection for single-instance resources—the foundational case that underpins more complex scenarios.

Key Takeaways

•Wait-for graph perfectly models single-instance deadlock: Nodes are processes, edges are "waits for" relationships. A cycle IS a deadlock, not just a potential one.
•DFS with three-color marking finds cycles in O(V+E): GRAY nodes form the active DFS path; encountering a GRAY node means we found a back edge and thus a cycle.
•Kahn's algorithm provides alternative and additional info: Iteratively remove processes with no outgoing edges; remaining processes are deadlocked. Also yields safe completion order.
•Production implementations need concurrency handling: Copy-on-write graph state, shared mutexes, consistent snapshots, and detailed metrics are essential.
•Real systems use this: Databases, file locking, mutex-based applications all employ single-instance wait-for graph detection.
•Edge cases matter: Self-loops, ghost processes, timeout confusion, and pool vs single-instance distinctions must be handled.

What's Next:

The next page addresses the more complex case: multiple-instance detection. When resources have multiple instances (e.g., 5 printers, 10 database connections), a cycle in the wait-for graph is necessary but not sufficient for deadlock. We'll learn the Banker's-style reduction algorithm that handles this case.

Page Complete

You now have complete mastery of single-instance deadlock detection. You can implement production-quality detectors, understand their complexity, handle edge cases, and apply them to real systems like databases and multi-threaded applications. This is the foundational skill for all deadlock detection work.