Operating SystemsSleep and Wakeup

Sleep and Wakeup Mechanisms

LevelIntermediate

Duration60 mins

TopicSleep and Wakeup

1 / 5

Sleep Mechanism

The Essence of Blocking Synchronization

In the realm of operating systems, waiting is an inescapable reality. Processes frequently need to pause their execution until certain conditions are met—a lock becomes available, data arrives from disk, a message is received from another process, or a timer expires. The question is not whether processes will wait, but how they should wait.

We've previously explored spinlocks and busy-waiting, where a process continuously polls a condition in a tight loop. While simple and sometimes appropriate, busy-waiting wastes CPU cycles—cycles that could power other productive work. For long waits, this approach is catastrophically inefficient.

The sleep mechanism offers an elegant alternative: a process that cannot proceed voluntarily yields the CPU and enters a dormant state. The kernel removes it from the run queue and records that it's waiting for a specific event. The process consumes no CPU resources while sleeping. When the awaited condition materializes, a corresponding wakeup operation restores the process to runnability.

This page provides an exhaustive exploration of the sleep mechanism—the fundamental primitive that transforms wasteful spinning into efficient blocking synchronization.

What You Will Learn

By the end of this page, you will understand: the fundamental concept and motivation behind sleep; the internal kernel data structures and operations involved; the distinction between interruptible and uninterruptible sleep states; how sleep integrates with process scheduling; and the semantic contract that sleep establishes between a process and the operating system.

The Problem with Active Waiting

Before diving into sleep, let's crystallize why we need it by examining the costs of its absence.

Active waiting (busy-waiting or spinning) occurs when a process repeatedly checks a condition in a loop:

while (!condition) {
    // Do nothing, just keep checking
}

This approach has several critical problems:

Fundamental Problems with Busy-Waiting

•CPU Waste: The spinning process consumes 100% of its allocated CPU time doing nothing productive. On a single-core system, this can starve other processes entirely. On multi-core systems, it still wastes power and reduces available compute capacity.
•Priority Inversion Risk: If a low-priority process holds a resource and a high-priority process spins waiting for it, the high-priority process may prevent the low-priority process from running and releasing the resource—a livelock.
•Energy Inefficiency: Modern processors enter low-power states when idle. A spinning process keeps the core fully active, draining battery on mobile devices and increasing data center cooling costs.
•Poor Scalability: With N threads spinning for the same resource, you burn N cores of compute. As contention grows, the system degrades non-linearly.
•Unpredictable Latency: If the condition depends on another process, and that process isn't scheduled promptly, the waiting thread spins uselessly for its entire quantum before yielding.

The Uniprocessor Pathology

On a uniprocessor system with preemptive scheduling, spinning is particularly pathological. The spinner and the process that would set the condition share the same CPU. Every cycle the spinner uses is a cycle unavailable to the process that could end the wait. The spinner literally prolongs its own wait.

When spinning is acceptable:

Despite its flaws, spinning has legitimate uses:

Very short critical sections: If the expected wait time is less than the context switch cost (typically a few microseconds), spinning is more efficient.
Multiprocessor spinlocks in kernel mode: The kernel uses spinlocks for mutual exclusion when context switching is impossible or more expensive than spinning.
Lock-free algorithms: Some lock-free data structures use spinning for coordination without traditional locks.

But for general synchronization—waiting for I/O, waiting for locks held for arbitrary durations, waiting for conditions set by other processes—spinning is unacceptable. We need blocking primitives that don't consume CPU while waiting.

The Sleep Concept: Voluntary Abdication

The sleep mechanism provides a fundamentally different waiting strategy. Instead of actively polling, a process declares:

"I cannot make progress until event X occurs. Remove me from the CPU and wake me when X happens."

This is a voluntary abdication of the processor. The process explicitly requests to be suspended. The kernel:

Records the wait condition: Associates the process with a wait queue or event identifier.
Changes process state: Transitions the process from RUNNING to SLEEPING/WAITING state.
Removes from run queue: The scheduler no longer considers this process for execution.
Dispatches another process: Context switches to a runnable process.

The sleeping process now consumes zero CPU cycles. It exists only as a data structure in memory, waiting for its wakeup call.

Spinning vs. Sleeping: A Detailed Comparison
Aspect	Busy-Waiting (Spinning)	Blocking (Sleeping)
CPU usage while waiting	100% of allocated time	0%
Response latency	Immediate (already running)	~1-10μs (context switch)
Appropriate wait duration	< 1-10 microseconds	10 microseconds
Energy efficiency	Poor (core stays active)	Excellent (core can idle)
Scalability under contention	Degrades linearly with waiters	Efficient regardless of waiters
Implementation complexity	Simple (just a loop)	Complex (kernel involvement)
Progress guarantee	May livelock on uniprocessor	Guaranteed (waiter yields CPU)

The semantic contract of sleep:

When a process calls a sleep primitive, it establishes a contract with the kernel:

The process agrees to give up the CPU immediately.
The kernel agrees to not schedule the process until the specified condition is signaled.
Some entity (another process, an interrupt handler, the kernel itself) agrees to call wakeup when the condition is met.

If any party violates this contract, problems arise. If the wakeup never comes, the process sleeps forever. If the wakeup comes before sleep, the process may miss it (the lost wakeup problem we'll cover later). The safety of sleep depends on careful coordination between all three participants.

The Philosophy of Sleep

Sleep embodies a fundamental OS design principle: processes should not consume resources they don't need. A waiting process needs only memory for its state; the CPU should serve active work. Sleep transforms a waiting process from a CPU consumer into a memory resident, maximizing overall system throughput.

Process States and Sleep

To understand sleep precisely, we must examine how it fits into the process state model. Operating systems track each process's execution state using a finite state machine. While implementations vary, the core states relevant to sleep are:

Running (TASK_RUNNING in Linux): The process is either currently executing on a CPU or ready to execute (in the run queue). The scheduler considers it for CPU allocation.

Sleeping/Waiting: The process has voluntarily suspended. It's not in the run queue. The scheduler ignores it until something triggers a wakeup. This state typically has two variants:

Interruptible Sleep (TASK_INTERRUPTIBLE): The process can be awakened by the awaited event or by a signal. Most sleeps are interruptible—if a user presses Ctrl+C, the sleeping process should wake to handle SIGINT.
Uninterruptible Sleep (TASK_UNINTERRUPTIBLE): The process can only be awakened by the specific event it's waiting for. Signals are ignored. This is used for waits that cannot be safely interrupted, like certain I/O operations mid-transaction.

Stopped/Suspended: The process has been explicitly stopped (e.g., SIGSTOP). Different from sleep—this is externally imposed, not voluntarily entered for condition-waiting.

Zombie: The process has terminated but its parent hasn't collected its exit status. Irrelevant to sleep mechanics.

Converting Mermaid diagram...

The transition mechanics:

When a process enters sleep:

Set state flag: The kernel sets the process's state to INTERRUPTIBLE or UNINTERRUPTIBLE.
Dequeue from run queue: Remove the process from the scheduler's data structures.
Add to wait queue: Link the process to a wait queue associated with the event.
Call scheduler: Invoke schedule() to context-switch to another process.

The sleeping process's kernel stack is preserved. Its user-space state (registers, memory mappings) remains intact. It simply doesn't execute.

When wakeup occurs:

Find the process on the relevant wait queue.
Set state to RUNNING: Change the flag back.
Add to run queue: Make the process schedulable again.
Potentially trigger reschedule: If the awakened process has higher priority, the kernel may preempt the current process.

Note: Wakeup doesn't immediately run the process—it makes the process eligible to run. The scheduler decides when it actually gets the CPU.

The D State in Linux

In Linux, uninterruptible sleep appears as 'D' in the ps output. Processes in D state cannot be killed even with SIGKILL—they're waiting for something the kernel considers uninterruptible (typically disk or NFS I/O). Too many processes in D state often indicates I/O system problems, hung file systems, or kernel bugs.

Wait Queues: The Sleep Infrastructure

The kernel needs a way to track which processes are waiting for which events. This is accomplished through wait queues (also called sleep queues or wait channels).

A wait queue is a data structure that:

Associates a set of processes with a specific event or condition
Allows efficient wakeup of one or more processes when the event occurs
Provides ordering guarantees (typically FIFO for fairness)

In Linux, the primary wait queue structure is wait_queue_head_t:

linux/wait.h (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Wait queue head - represents an event that processes can wait on
struct wait_queue_head {
    spinlock_t lock;              // Protects the list
    struct list_head head;        // List of waiting entries
};
typedef struct wait_queue_head wait_queue_head_t;
 
// Wait queue entry - represents one sleeping process  
struct wait_queue_entry {
    unsigned int flags;           // Exclusive vs. shared wakeup
    void *private;                // Typically points to task_struct
    wait_queue_func_t func;       // Custom wakeup function
    struct list_head entry;       // Links in the queue
};
typedef struct wait_queue_entry wait_queue_entry_t;
 
// Initialization macro
#define DECLARE_WAIT_QUEUE_HEAD(name) \
    wait_queue_head_t name = { \
        .lock = __SPIN_LOCK_UNLOCKED(name.lock), \
        .head = { &(name).head, &(name).head } \
    }

Key components:

wait_queue_head_t: The anchor point for a wait queue. Each condition you can wait on has one. Examples:

A mutex might have a wait queue for blocked acquirers
A pipe has wait queues for readers and writers
A device has wait queues for processes awaiting I/O completion

wait_queue_entry_t: Represents one waiting process. Contains:

A pointer to the task_struct (process descriptor)
Flags indicating exclusive vs. shared wakeup
A custom function called during wakeup (usually default_wake_function)

The spinlock: The wait queue itself needs protection. Multiple CPUs may add/remove entries concurrently. The spinlock serializes these operations.

Exclusive vs. non-exclusive waiters:

Non-exclusive (default): All waiters are awakened when the event fires. Used when multiple processes can proceed (e.g., data available for multiple readers).
Exclusive: Only one waiter is awakened. Prevents thundering herd when only one can make progress (e.g., one process can acquire a lock).

Wait Queue Operations
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Put current process to sleep on a wait queue (interruptible)
void add_wait_queue(wait_queue_head_t *q, wait_queue_entry_t *wait);
void remove_wait_queue(wait_queue_head_t *q, wait_queue_entry_t *wait);
 
// Common pattern: prepare to wait, check condition, schedule
void prepare_to_wait(wait_queue_head_t *q, 
                     wait_queue_entry_t *wait, 
                     long state);
 
// After waking up: clean up
void finish_wait(wait_queue_head_t *q, wait_queue_entry_t *wait);
 
// Wake up processes
int wake_up(wait_queue_head_t *q);              // Wake all non-exclusive + 1 exclusive
int wake_up_all(wait_queue_head_t *q);          // Wake everyone  
int wake_up_interruptible(wait_queue_head_t *q); // Only interruptible sleepers

The Wait Queue Lock Duration

The spinlock protecting a wait queue must be held only briefly. Adding a waiter, removing a waiter, and traversing for wakeup are all O(n) in the worst case but typically fast. Never perform blocking operations while holding the wait queue lock—that would deadlock.

The Sleep/Schedule Sequence

Let's trace exactly what happens when a process sleeps in Linux. Understanding this sequence is crucial for implementing correct synchronization.

The canonical sleep pattern:

Canonical Sleep Pattern
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// The correct pattern for sleeping until a condition is true
DEFINE_WAIT(wait);                              // Declare wait queue entry
 
while (!condition) {                            // Loop until condition met
    prepare_to_wait(&wq, &wait, TASK_INTERRUPTIBLE);
    
    if (!condition)                             // Re-check after prepare
        schedule();                             // Actually sleep
    
    if (signal_pending(current)) {             // Handle signals
        // Return -EINTR or handle signal
        break;
    }
}
finish_wait(&wq, &wait);                       // Clean up

Dissecting each step:

Step 1: DEFINE_WAIT(wait) Declares and initializes a wait queue entry on the stack. Sets up the default wake function and marks the waiter as non-exclusive.

Step 2: while (!condition) The condition must be checked in a loop, not just once. Spurious wakeups can occur, and the condition should be verified after every wakeup.

Step 3: prepare_to_wait(&wq, &wait, TASK_INTERRUPTIBLE)

Adds the current process to the wait queue
Sets the current process's state to TASK_INTERRUPTIBLE
The process is not yet asleep—it's still running this code

Critical insight: After prepare_to_wait, the process is in a vulnerable state. Its state flag says "sleeping" but it hasn't called schedule() yet. If wakeup happens now, it will set the state back to RUNNING, and the subsequent schedule() will simply not sleep.

Step 4: if (!condition) Re-check the condition. This is essential to avoid the lost wakeup problem. If the condition became true between the outer while check and prepare_to_wait, we skip schedule().

Step 5: schedule() The actual context switch:

Saves current process's register state
Picks next process from run queue (current is not on run queue because it's TASK_INTERRUPTIBLE)
Loads next process's state
Returns... eventually, when this process is woken up again

Step 6: signal_pending(current) After wakeup from TASK_INTERRUPTIBLE sleep, check if woken by a signal. If so, may need to return with an error code to allow signal handling.

Step 7: finish_wait(&wq, &wait)

Removes the wait entry from the queue (if still present)
Resets process state to TASK_RUNNING
Cleans up

Converting Mermaid diagram...

Sleep API Variants

Modern kernels provide several convenience wrappers around the core sleep mechanism. Understanding these variants helps you choose the right tool for each situation.

Linux wait_event family:

Linux Wait Event Macros
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// Wait until condition is true (uninterruptible)
// DANGER: Cannot be killed - use sparingly
wait_event(wq, condition);
 
// Wait until condition is true (interruptible by signals)
// Returns 0 if condition true, -ERESTARTSYS if interrupted
wait_event_interruptible(wq, condition);
 
// Wait with timeout (jiffies) - interruptible  
// Returns remaining time, or 0 on timeout, or -ERESTARTSYS
wait_event_interruptible_timeout(wq, condition, timeout);
 
// Wait until condition, but killable (SIGKILL wakes it)
// Compromise between interruptible and uninterruptible
wait_event_killable(wq, condition);
 
// Wake only one exclusive waiter (used for mutexes/locks)
wait_event_exclusive(wq, condition);
 
// Wait with a lock held, releasing it during sleep
wait_event_interruptible_lock_irq(wq, condition, lock);

Choosing the right variant:

Selecting the Appropriate Sleep Primitive
Situation	Recommended Variant	Rationale
Normal user-space request	`wait_event_interruptible`	Allow Ctrl+C to work; return -EINTR/EAGAIN
Short kernel-internal wait	`wait_event_killable`	Block signals except kill for cleaner code
Critical I/O completion	`wait_event`	Interruption could corrupt data; use carefully
Network/disk with timeout	`wait_event_interruptible_timeout`	Don't hang forever on unresponsive hardware
Mutex acquisition	`wait_event_exclusive`	Only one can acquire; avoid thundering herd
Wait while holding spinlock	`wait_event_*_lock_irq`	Atomically release lock and sleep

Uninterruptible Sleep is Dangerous

Code using wait_event() (uninterruptible) cannot be stopped by signals, including SIGKILL. If the wakeup never comes (buggy driver, hung hardware), the process is stuck forever. The system must be rebooted. Use uninterruptible sleep only when: (1) the wait is guaranteed to be short, (2) interruption would cause data corruption, or (3) there's no safe way to cancel the operation.

User-space sleep functions:

At the application level, similar concepts exist:

// Simple time-based sleep (no event, just delay)
unsigned int sleep(unsigned int seconds);
int usleep(useconds_t usec);
int nanosleep(const struct timespec *req, struct timespec *rem);

// Condition variable wait (blocks until signaled)
int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex);
int pthread_cond_timedwait(pthread_cond_t *cond, pthread_mutex_t *mutex,
                            const struct timespec *abstime);

These ultimately invoke kernel sleep mechanisms. pthread_cond_wait, for instance, uses futex() system call which implements efficient user-space/kernel-space sleep coordination.

Implementation Deep Dive

Let's examine a simplified implementation of the core sleep mechanism to solidify understanding. This pseudo-code captures the essential operations:

Simplified Sleep Implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
// The wait queue entry, embedded in or pointed to by each waiter
struct wait_queue_entry {
    struct task_struct *task;     // The sleeping process
    struct list_head link;        // Linked list linkage
    int flags;                    // EXCLUSIVE, etc.
};
 
// The wait queue head - one per logical event
struct wait_queue_head {
    spinlock_t lock;
    struct list_head waiters;     // List of wait_queue_entry
};
 
// Add current process to wait queue and set state
void prepare_to_wait(struct wait_queue_head *wq,
                     struct wait_queue_entry *entry,
                     int state) {
    entry->task = current;        // Point to current process
    entry->flags = 0;
    
    spin_lock(&wq->lock);
    
    // Add to list if not already present
    if (list_empty(&entry->link))
        list_add_tail(&entry->link, &wq->waiters);
    
    // Mark ourselves as sleeping (but we're still on CPU!)
    set_current_state(state);     // TASK_INTERRUPTIBLE, etc.
    
    spin_unlock(&wq->lock);
}
 
// The actual sleep - invoke the scheduler
void schedule(void) {
    struct task_struct *prev = current;
    struct task_struct *next;
    
    // If current is still TASK_RUNNING, don't remove from run queue
    // But if it's TASK_INTERRUPTIBLE/UNINTERRUPTIBLE, it's not runnable
    if (prev->state != TASK_RUNNING) {
        // Already dequeued by prepare_to_wait (conceptually)
        // In reality, scheduler checks state and skips if not RUNNING
    }
    
    // Pick next task from run queue
    next = pick_next_task();
    
    if (prev != next) {
        // Context switch: save prev's registers, load next's registers
        context_switch(prev, next);
        // When we return here, we've been re-scheduled
        // Could be much later - seconds, hours, days!
    }
}
 
// Clean up after wakeup
void finish_wait(struct wait_queue_head *wq,
                 struct wait_queue_entry *entry) {
    set_current_state(TASK_RUNNING);  // Defensive
    
    spin_lock(&wq->lock);
    if (!list_empty(&entry->link))
        list_del_init(&entry->link);
    spin_unlock(&wq->lock);
}

Key implementation insights:

1. State is set before schedule(): The process state is changed to SLEEPING before calling schedule(). This is crucial. If a wakeup arrives between prepare_to_wait and schedule, it sets state back to RUNNING. Then schedule() sees RUNNING and doesn't actually sleep.

2. The double-check pattern: The condition is checked twice:

Once in the while loop to decide whether to enter the prepare_to_wait path
Once after prepare_to_wait but before schedule()

This handles the race where condition becomes true between checks.

3. Wait queue entry on stack: Typically, the wait_queue_entry is declared on the calling function's stack. This is safe because the entry is removed before finish_wait returns, and finish_wait is called before the function returns.

4. Spinlock scope: The wait queue spinlock is held only for list operations, never across schedule(). Sleeping while holding a spinlock would deadlock the system.

5. context_switch is the dividing line: Everything before context_switch runs in the context of the sleeping process. After context_switch, the CPU runs a different process. The sleeping process's execution is frozen mid-context_switch until wakeup.

Memory Barriers in Sleep

Real implementations include memory barriers to ensure the state change is visible to other CPUs before the condition check and schedule. set_current_state() typically includes a memory barrier. Without this, a CPU might reorder the state write after the schedule() call, breaking the lost-wakeup prevention.

Summary and Key Insights

The sleep mechanism is foundational to efficient synchronization. Let's consolidate the key concepts:

Core Concepts

•Sleep vs. Spinning: Sleep releases the CPU; spinning wastes it. Use sleep for waits longer than context switch overhead (~1-10μs).
•Voluntary abdication: A sleeping process explicitly yields, allowing other work to proceed without wasting cycles on polling.
•Wait queues: The kernel infrastructure tracking which processes wait for which events, enabling efficient targeted wakeup.
•Process states: Sleep transitions a process from RUNNING to INTERRUPTIBLE or UNINTERRUPTIBLE, removing it from scheduler consideration.
•The prepare-check-schedule pattern: Set state, add to queue, re-check condition, then schedule. This sequence prevents lost wakeups.
•Interruptible vs. uninterruptible: Most sleeps should be interruptible (signals work); uninterruptible only for critical I/O that cannot be safely canceled.

Page Complete

You now understand the sleep mechanism—what it is, why it exists, and how it works internally. This knowledge is essential for understanding synchronization primitives, debugging hung processes, and implementing efficient concurrent systems. Next, we'll examine the complementary wakeup mechanism—how sleeping processes are returned to activity.

1 / 5

Loading learning content...

Operating SystemsSleep and Wakeup

Sleep and Wakeup Mechanisms

LevelIntermediate

Duration60 mins

TopicSleep and Wakeup

1 / 5

Sleep Mechanism

The Essence of Blocking Synchronization

This page provides an exhaustive exploration of the sleep mechanism—the fundamental primitive that transforms wasteful spinning into efficient blocking synchronization.

What You Will Learn

The Problem with Active Waiting

Before diving into sleep, let's crystallize why we need it by examining the costs of its absence.

Active waiting (busy-waiting or spinning) occurs when a process repeatedly checks a condition in a loop:

while (!condition) {
    // Do nothing, just keep checking
}

This approach has several critical problems:

Fundamental Problems with Busy-Waiting

•CPU Waste: The spinning process consumes 100% of its allocated CPU time doing nothing productive. On a single-core system, this can starve other processes entirely. On multi-core systems, it still wastes power and reduces available compute capacity.
•Priority Inversion Risk: If a low-priority process holds a resource and a high-priority process spins waiting for it, the high-priority process may prevent the low-priority process from running and releasing the resource—a livelock.
•Energy Inefficiency: Modern processors enter low-power states when idle. A spinning process keeps the core fully active, draining battery on mobile devices and increasing data center cooling costs.
•Poor Scalability: With N threads spinning for the same resource, you burn N cores of compute. As contention grows, the system degrades non-linearly.
•Unpredictable Latency: If the condition depends on another process, and that process isn't scheduled promptly, the waiting thread spins uselessly for its entire quantum before yielding.

The Uniprocessor Pathology

When spinning is acceptable:

Despite its flaws, spinning has legitimate uses:

Very short critical sections: If the expected wait time is less than the context switch cost (typically a few microseconds), spinning is more efficient.
Multiprocessor spinlocks in kernel mode: The kernel uses spinlocks for mutual exclusion when context switching is impossible or more expensive than spinning.
Lock-free algorithms: Some lock-free data structures use spinning for coordination without traditional locks.

The Sleep Concept: Voluntary Abdication

The sleep mechanism provides a fundamentally different waiting strategy. Instead of actively polling, a process declares:

"I cannot make progress until event X occurs. Remove me from the CPU and wake me when X happens."

This is a voluntary abdication of the processor. The process explicitly requests to be suspended. The kernel:

Records the wait condition: Associates the process with a wait queue or event identifier.
Changes process state: Transitions the process from RUNNING to SLEEPING/WAITING state.
Removes from run queue: The scheduler no longer considers this process for execution.
Dispatches another process: Context switches to a runnable process.

The sleeping process now consumes zero CPU cycles. It exists only as a data structure in memory, waiting for its wakeup call.

Spinning vs. Sleeping: A Detailed Comparison
Aspect	Busy-Waiting (Spinning)	Blocking (Sleeping)
CPU usage while waiting	100% of allocated time	0%
Response latency	Immediate (already running)	~1-10μs (context switch)
Appropriate wait duration	< 1-10 microseconds	10 microseconds
Energy efficiency	Poor (core stays active)	Excellent (core can idle)
Scalability under contention	Degrades linearly with waiters	Efficient regardless of waiters
Implementation complexity	Simple (just a loop)	Complex (kernel involvement)
Progress guarantee	May livelock on uniprocessor	Guaranteed (waiter yields CPU)

The semantic contract of sleep:

When a process calls a sleep primitive, it establishes a contract with the kernel:

The process agrees to give up the CPU immediately.
The kernel agrees to not schedule the process until the specified condition is signaled.
Some entity (another process, an interrupt handler, the kernel itself) agrees to call wakeup when the condition is met.

The Philosophy of Sleep

Process States and Sleep

Running (TASK_RUNNING in Linux): The process is either currently executing on a CPU or ready to execute (in the run queue). The scheduler considers it for CPU allocation.

Sleeping/Waiting: The process has voluntarily suspended. It's not in the run queue. The scheduler ignores it until something triggers a wakeup. This state typically has two variants:

Interruptible Sleep (TASK_INTERRUPTIBLE): The process can be awakened by the awaited event or by a signal. Most sleeps are interruptible—if a user presses Ctrl+C, the sleeping process should wake to handle SIGINT.
Uninterruptible Sleep (TASK_UNINTERRUPTIBLE): The process can only be awakened by the specific event it's waiting for. Signals are ignored. This is used for waits that cannot be safely interrupted, like certain I/O operations mid-transaction.

Stopped/Suspended: The process has been explicitly stopped (e.g., SIGSTOP). Different from sleep—this is externally imposed, not voluntarily entered for condition-waiting.

Zombie: The process has terminated but its parent hasn't collected its exit status. Irrelevant to sleep mechanics.

Converting Mermaid diagram...

The transition mechanics:

When a process enters sleep:

Set state flag: The kernel sets the process's state to INTERRUPTIBLE or UNINTERRUPTIBLE.
Dequeue from run queue: Remove the process from the scheduler's data structures.
Add to wait queue: Link the process to a wait queue associated with the event.
Call scheduler: Invoke schedule() to context-switch to another process.

The sleeping process's kernel stack is preserved. Its user-space state (registers, memory mappings) remains intact. It simply doesn't execute.

When wakeup occurs:

Find the process on the relevant wait queue.
Set state to RUNNING: Change the flag back.
Add to run queue: Make the process schedulable again.
Potentially trigger reschedule: If the awakened process has higher priority, the kernel may preempt the current process.

Note: Wakeup doesn't immediately run the process—it makes the process eligible to run. The scheduler decides when it actually gets the CPU.

The D State in Linux

Wait Queues: The Sleep Infrastructure

The kernel needs a way to track which processes are waiting for which events. This is accomplished through wait queues (also called sleep queues or wait channels).

A wait queue is a data structure that:

Associates a set of processes with a specific event or condition
Allows efficient wakeup of one or more processes when the event occurs
Provides ordering guarantees (typically FIFO for fairness)

In Linux, the primary wait queue structure is wait_queue_head_t:

linux/wait.h (simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Wait queue head - represents an event that processes can wait on
struct wait_queue_head {
    spinlock_t lock;              // Protects the list
    struct list_head head;        // List of waiting entries
};
typedef struct wait_queue_head wait_queue_head_t;
 
// Wait queue entry - represents one sleeping process  
struct wait_queue_entry {
    unsigned int flags;           // Exclusive vs. shared wakeup
    void *private;                // Typically points to task_struct
    wait_queue_func_t func;       // Custom wakeup function
    struct list_head entry;       // Links in the queue
};
typedef struct wait_queue_entry wait_queue_entry_t;
 
// Initialization macro
#define DECLARE_WAIT_QUEUE_HEAD(name) \
    wait_queue_head_t name = { \
        .lock = __SPIN_LOCK_UNLOCKED(name.lock), \
        .head = { &(name).head, &(name).head } \
    }

Key components:

wait_queue_head_t: The anchor point for a wait queue. Each condition you can wait on has one. Examples:

A mutex might have a wait queue for blocked acquirers
A pipe has wait queues for readers and writers
A device has wait queues for processes awaiting I/O completion

wait_queue_entry_t: Represents one waiting process. Contains:

A pointer to the task_struct (process descriptor)
Flags indicating exclusive vs. shared wakeup
A custom function called during wakeup (usually default_wake_function)

The spinlock: The wait queue itself needs protection. Multiple CPUs may add/remove entries concurrently. The spinlock serializes these operations.

Exclusive vs. non-exclusive waiters:

Non-exclusive (default): All waiters are awakened when the event fires. Used when multiple processes can proceed (e.g., data available for multiple readers).
Exclusive: Only one waiter is awakened. Prevents thundering herd when only one can make progress (e.g., one process can acquire a lock).

Wait Queue Operations
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Put current process to sleep on a wait queue (interruptible)
void add_wait_queue(wait_queue_head_t *q, wait_queue_entry_t *wait);
void remove_wait_queue(wait_queue_head_t *q, wait_queue_entry_t *wait);
 
// Common pattern: prepare to wait, check condition, schedule
void prepare_to_wait(wait_queue_head_t *q, 
                     wait_queue_entry_t *wait, 
                     long state);
 
// After waking up: clean up
void finish_wait(wait_queue_head_t *q, wait_queue_entry_t *wait);
 
// Wake up processes
int wake_up(wait_queue_head_t *q);              // Wake all non-exclusive + 1 exclusive
int wake_up_all(wait_queue_head_t *q);          // Wake everyone  
int wake_up_interruptible(wait_queue_head_t *q); // Only interruptible sleepers

The Wait Queue Lock Duration

The Sleep/Schedule Sequence

Let's trace exactly what happens when a process sleeps in Linux. Understanding this sequence is crucial for implementing correct synchronization.

The canonical sleep pattern:

Canonical Sleep Pattern
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// The correct pattern for sleeping until a condition is true
DEFINE_WAIT(wait);                              // Declare wait queue entry
 
while (!condition) {                            // Loop until condition met
    prepare_to_wait(&wq, &wait, TASK_INTERRUPTIBLE);
    
    if (!condition)                             // Re-check after prepare
        schedule();                             // Actually sleep
    
    if (signal_pending(current)) {             // Handle signals
        // Return -EINTR or handle signal
        break;
    }
}
finish_wait(&wq, &wait);                       // Clean up

Dissecting each step:

Step 1: DEFINE_WAIT(wait) Declares and initializes a wait queue entry on the stack. Sets up the default wake function and marks the waiter as non-exclusive.

Step 2: while (!condition) The condition must be checked in a loop, not just once. Spurious wakeups can occur, and the condition should be verified after every wakeup.

Step 3: prepare_to_wait(&wq, &wait, TASK_INTERRUPTIBLE)

Adds the current process to the wait queue
Sets the current process's state to TASK_INTERRUPTIBLE
The process is not yet asleep—it's still running this code

Step 5: schedule() The actual context switch:

Saves current process's register state
Picks next process from run queue (current is not on run queue because it's TASK_INTERRUPTIBLE)
Loads next process's state
Returns... eventually, when this process is woken up again

Step 6: signal_pending(current) After wakeup from TASK_INTERRUPTIBLE sleep, check if woken by a signal. If so, may need to return with an error code to allow signal handling.

Step 7: finish_wait(&wq, &wait)

Removes the wait entry from the queue (if still present)
Resets process state to TASK_RUNNING
Cleans up

Converting Mermaid diagram...

Sleep API Variants

Modern kernels provide several convenience wrappers around the core sleep mechanism. Understanding these variants helps you choose the right tool for each situation.

Linux wait_event family:

Linux Wait Event Macros
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// Wait until condition is true (uninterruptible)
// DANGER: Cannot be killed - use sparingly
wait_event(wq, condition);
 
// Wait until condition is true (interruptible by signals)
// Returns 0 if condition true, -ERESTARTSYS if interrupted
wait_event_interruptible(wq, condition);
 
// Wait with timeout (jiffies) - interruptible  
// Returns remaining time, or 0 on timeout, or -ERESTARTSYS
wait_event_interruptible_timeout(wq, condition, timeout);
 
// Wait until condition, but killable (SIGKILL wakes it)
// Compromise between interruptible and uninterruptible
wait_event_killable(wq, condition);
 
// Wake only one exclusive waiter (used for mutexes/locks)
wait_event_exclusive(wq, condition);
 
// Wait with a lock held, releasing it during sleep
wait_event_interruptible_lock_irq(wq, condition, lock);

Choosing the right variant:

Selecting the Appropriate Sleep Primitive
Situation	Recommended Variant	Rationale
Normal user-space request	`wait_event_interruptible`	Allow Ctrl+C to work; return -EINTR/EAGAIN
Short kernel-internal wait	`wait_event_killable`	Block signals except kill for cleaner code
Critical I/O completion	`wait_event`	Interruption could corrupt data; use carefully
Network/disk with timeout	`wait_event_interruptible_timeout`	Don't hang forever on unresponsive hardware
Mutex acquisition	`wait_event_exclusive`	Only one can acquire; avoid thundering herd
Wait while holding spinlock	`wait_event_*_lock_irq`	Atomically release lock and sleep

Uninterruptible Sleep is Dangerous

User-space sleep functions:

At the application level, similar concepts exist:

// Simple time-based sleep (no event, just delay)
unsigned int sleep(unsigned int seconds);
int usleep(useconds_t usec);
int nanosleep(const struct timespec *req, struct timespec *rem);

// Condition variable wait (blocks until signaled)
int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t *mutex);
int pthread_cond_timedwait(pthread_cond_t *cond, pthread_mutex_t *mutex,
                            const struct timespec *abstime);

These ultimately invoke kernel sleep mechanisms. pthread_cond_wait, for instance, uses futex() system call which implements efficient user-space/kernel-space sleep coordination.

Implementation Deep Dive

Let's examine a simplified implementation of the core sleep mechanism to solidify understanding. This pseudo-code captures the essential operations:

Simplified Sleep Implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
// The wait queue entry, embedded in or pointed to by each waiter
struct wait_queue_entry {
    struct task_struct *task;     // The sleeping process
    struct list_head link;        // Linked list linkage
    int flags;                    // EXCLUSIVE, etc.
};
 
// The wait queue head - one per logical event
struct wait_queue_head {
    spinlock_t lock;
    struct list_head waiters;     // List of wait_queue_entry
};
 
// Add current process to wait queue and set state
void prepare_to_wait(struct wait_queue_head *wq,
                     struct wait_queue_entry *entry,
                     int state) {
    entry->task = current;        // Point to current process
    entry->flags = 0;
    
    spin_lock(&wq->lock);
    
    // Add to list if not already present
    if (list_empty(&entry->link))
        list_add_tail(&entry->link, &wq->waiters);
    
    // Mark ourselves as sleeping (but we're still on CPU!)
    set_current_state(state);     // TASK_INTERRUPTIBLE, etc.
    
    spin_unlock(&wq->lock);
}
 
// The actual sleep - invoke the scheduler
void schedule(void) {
    struct task_struct *prev = current;
    struct task_struct *next;
    
    // If current is still TASK_RUNNING, don't remove from run queue
    // But if it's TASK_INTERRUPTIBLE/UNINTERRUPTIBLE, it's not runnable
    if (prev->state != TASK_RUNNING) {
        // Already dequeued by prepare_to_wait (conceptually)
        // In reality, scheduler checks state and skips if not RUNNING
    }
    
    // Pick next task from run queue
    next = pick_next_task();
    
    if (prev != next) {
        // Context switch: save prev's registers, load next's registers
        context_switch(prev, next);
        // When we return here, we've been re-scheduled
        // Could be much later - seconds, hours, days!
    }
}
 
// Clean up after wakeup
void finish_wait(struct wait_queue_head *wq,
                 struct wait_queue_entry *entry) {
    set_current_state(TASK_RUNNING);  // Defensive
    
    spin_lock(&wq->lock);
    if (!list_empty(&entry->link))
        list_del_init(&entry->link);
    spin_unlock(&wq->lock);
}

Key implementation insights:

2. The double-check pattern: The condition is checked twice:

Once in the while loop to decide whether to enter the prepare_to_wait path
Once after prepare_to_wait but before schedule()

This handles the race where condition becomes true between checks.

4. Spinlock scope: The wait queue spinlock is held only for list operations, never across schedule(). Sleeping while holding a spinlock would deadlock the system.

Memory Barriers in Sleep

Summary and Key Insights

The sleep mechanism is foundational to efficient synchronization. Let's consolidate the key concepts:

Core Concepts

•Sleep vs. Spinning: Sleep releases the CPU; spinning wastes it. Use sleep for waits longer than context switch overhead (~1-10μs).
•Voluntary abdication: A sleeping process explicitly yields, allowing other work to proceed without wasting cycles on polling.
•Wait queues: The kernel infrastructure tracking which processes wait for which events, enabling efficient targeted wakeup.
•Process states: Sleep transitions a process from RUNNING to INTERRUPTIBLE or UNINTERRUPTIBLE, removing it from scheduler consideration.
•The prepare-check-schedule pattern: Set state, add to queue, re-check condition, then schedule. This sequence prevents lost wakeups.
•Interruptible vs. uninterruptible: Most sleeps should be interruptible (signals work); uninterruptible only for critical I/O that cannot be safely canceled.

Page Complete

1 / 5