Sleep And Wakeup - Learning Module

Loading content...

0/240

Implementation Challenges

The Deceptive Simplicity of Sleep/Wakeup

At first glance, sleep and wakeup appear simple: remove a process from the run queue, add it back later. But decades of operating system development have revealed that this apparent simplicity conceals a minefield of subtle challenges.

Implementing sleep/wakeup correctly requires navigating:

Atomicity concerns: Multiple state changes must happen together
Race conditions: Dozens of interleaving scenarios to reason about
Memory ordering: What happens on weakly-ordered CPUs?
Interrupt handling: Sleep/wakeup must work across interrupt contexts
Priority inversions: Sleeping can disrupt priority relationships
Resource cleanup: What if a sleeping process is killed?
Performance: Every microsecond of overhead matters at scale

This page examines the engineering challenges that make sleep/wakeup implementation non-trivial. Understanding these challenges illuminates why kernel code is as complex as it is—and why user-space synchronization libraries exist to abstract this complexity.

What You Will Learn

By the end of this page, you will understand: the atomicity requirements for sleep/wakeup; challenges with memory ordering and barriers; interrupt context constraints; the complexity of wait queue management; priority inheritance and inversion; cancellation and cleanup; and performance optimization techniques.

Atomicity Requirements

The core challenge of sleep/wakeup is ensuring multiple operations happen atomically. Let's enumerate what must be coordinated:

For sleep, these must be atomic:

Setting process state to SLEEPING
Adding process to wait queue
Releasing any held locks (for condition variables)
Calling the scheduler to context switch

If any of these can be observed independently by another CPU, races occur.

For wakeup, these must be atomic:

Finding the target process on the wait queue
Changing process state to RUNNING
Removing from wait queue (or marking for removal)
Adding to scheduler run queue

The fundamental tension:

We want to hold locks for atomicity, but we can't hold locks across schedule()—that would mean sleeping while holding a lock, preventing the lock from ever being released by another process (deadlock).

The Atomicity Dilemma
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// WRONG: Trying to hold lock across schedule()
void broken_sleep() {
    spin_lock(&queue_lock);           // Acquire lock
    add_to_wait_queue(current);       // Add ourselves
    set_current_state(SLEEPING);      // Mark sleeping
    // We want atomicity... but:
    schedule();                        // BUG! schedule() switches to another process
                                       // That process might need queue_lock -> DEADLOCK
    spin_unlock(&queue_lock);          // Never reached if deadlock
}
 
// CORRECT: Careful ordering and lock scoping
void correct_sleep() {
    spin_lock(&queue_lock);           // Lock for queue manipulation
    add_to_wait_queue(current);       
    set_current_state(SLEEPING);      // State visible before releasing lock
    spin_unlock(&queue_lock);          // Release BEFORE schedule
    
    // Memory barrier in set_current_state ensures wakeup on another CPU
    // can see SLEEPING state even though we released the lock
    
    schedule();                        // Now safe to context switch
}
 
// The trick: we set state BEFORE releasing lock
// So any wakeup that happens:
// - Before we release lock: blocked on lock, will wake us after we schedule
// - After we release lock: sees SLEEPING state, wakes us
// No lost wakeups!

The prepare_to_wait pattern achieves atomicity through careful ordering:

Add to wait queue (under spinlock)
Set state to SLEEPING (with memory barrier, under spinlock)
Release spinlock
Schedule

The key insight: state is set while holding the lock, so any concurrent wakeup either:

Is blocked waiting for the lock (will see SLEEPING state after acquiring)
Has already acquired and released the lock (must have called wakeup before our state change, which we'll detect via the condition check)

Lock-Free Alternatives

Some implementations use lock-free techniques (compare-and-swap on state) instead of spinlocks for the wait queue. This can reduce contention under high load but requires even more careful reasoning about memory ordering and is typically reserved for performance-critical paths like futex.

Memory Ordering Challenges

On modern CPUs with relaxed memory ordering, writes can appear reordered to other processors. This creates subtle correctness issues for sleep/wakeup.

The problem:

Consider a waiter and waker on different CPUs:

// CPU 1 (Waiter)                    // CPU 2 (Waker)
state = SLEEPING;     // Store S1    data = new_value;     // Store D
check_condition();                    wakeup();             // Load S1

On a weakly-ordered architecture (ARM, PowerPC), CPU 2 might see S1 before CPU 1's earlier stores become visible. Or CPU 1 might not see Store D when checking the condition.

The result:

Waiter sets SLEEPING, then checks condition, sees old value (false)
Waiter goes to sleep
But waker already set new value! Waiter should have seen it
Wakeup sees RUNNING (S1 not yet visible) and does nothing
Lost wakeup due to memory ordering!

Memory Barriers in Sleep/Wakeup
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Required memory barriers for correctness
 
// WAITER: Must ensure state is visible before checking condition
void correct_waiter() {
    state = SLEEPING;
    smp_mb();  // Full memory barrier: all previous stores visible,
               // all subsequent loads see fresh values
    
    if (!condition) {
        schedule();  // Safe: waker will see SLEEPING if we see old condition
    }
}
 
// WAKER: Must ensure condition update visible before checking state  
void correct_waker() {
    condition = true;
    smp_mb();  // Full memory barrier
    
    if (state == SLEEPING) {
        wake_up();  // Safe: sleeper will see new condition if we see SLEEPING
    }
}
 
// Linux's set_current_state() includes the barrier:
#define set_current_state(state_value) \
    do { \
        current->state = (state_value); \
        smp_mb();  // Barrier included! \
    } while (0)
 
// This is why you must use set_current_state(), not direct assignment:
// current->state = TASK_INTERRUPTIBLE;  // WRONG: no barrier
// set_current_state(TASK_INTERRUPTIBLE); // CORRECT: barrier included

Why barriers are placed where they are:

The waiter's barrier after setting SLEEPING ensures:

The SLEEPING state is visible to all CPUs before condition check
The condition check sees the latest value (not stale cached data)

The waker's barrier after setting condition ensures:

The new condition value is visible before state check
If state check sees RUNNING, the waiter will see the new condition

Together, these barriers create a total ordering that prevents the lost wakeup race.

Platform-specific considerations:

Architecture	Memory Model	Barrier Needs
x86/x64	Total Store Order	Compiler barriers often sufficient; HW handles most cases
ARM	Weakly ordered	Full barriers (dmb) required
PowerPC	Weakly ordered	sync/lwsync instructions needed
RISC-V	RVWMO (weak)	fence instructions required

The Linux kernel abstracts these with smp_mb(), which compiles to the appropriate instruction per architecture.

Don't Roll Your Own

Memory ordering bugs are among the hardest to debug—they may only manifest on specific hardware under specific timing. Always use the kernel's provided sleep/wakeup infrastructure (wait_event, etc.) which has barriers in the right places. Implementing your own is a recipe for subtle, architecture-dependent bugs.

Interrupt Context Constraints

Sleep and wakeup must interact correctly with interrupt handling, but they have asymmetric constraints.

Sleep cannot happen in interrupt context:

Interrupt handlers:

Have no backing process—they hijack whatever process is currently running
Cannot block—there's nothing else to run on this CPU while waiting
Must complete quickly—holding up other interrupts

If an interrupt handler tries to sleep:

The handler is not associated with a process that can be suspended
The interrupted process shouldn't be penalized/blocked
The interrupt controller may be in a state requiring completion

Attempting to sleep in interrupt context triggers a kernel panic (BUG) in Linux.

Wakeup can happen in interrupt context:

Wakeup is specifically designed to be safe from interrupts:

It uses spin_lock_irqsave() to handle nested interrupts
It only manipulates data structures, never blocks
It queues the awakened task for the scheduler; doesn't run it directly

This asymmetry is essential: I/O completion interrupts must be able to wake waiting processes.

Context-Aware Implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// Checking execution context in Linux
 
bool can_sleep() {
    // Cannot sleep if:
    // - In interrupt context (hardirq or softirq)
    // - Holding a spinlock
    // - With preemption disabled
    return !in_atomic() && !irqs_disabled() && !in_interrupt();
}
 
// Kernel's might_sleep() annotation - debug check
void might_sleep() {
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
    if (in_atomic() || irqs_disabled()) {
        // Print warning and backtrace
        printk(KERN_ERR "BUG: sleeping function called from invalid context
");
        dump_stack();
    }
#endif
}
 
// Use in functions that may sleep
void my_blocking_function() {
    might_sleep();  // Warns if called from wrong context
    
    wait_event(wq, condition);  // Might actually sleep
}
 
// Wakeup is safe anywhere - but uses appropriate locking
void wake_up_safe(wait_queue_head_t *wq) {
    unsigned long flags;
    
    // irqsave works from any context
    spin_lock_irqsave(&wq->lock, flags);
    
    __wake_up_common(wq, TASK_NORMAL, 1, 0, NULL);
    
    spin_unlock_irqrestore(&wq->lock, flags);
}

Deferred work patterns:

When interrupt context needs to do work that requires sleeping:

Work queues: Schedule work to run in process context

IRQ_HANDLER(my_irq) {
    prepare_data_fast();       // Minimal work in IRQ
    schedule_work(&my_work);   // Defer heavy work to kworker
}

WORK_HANDLER(my_work) {
    might_sleep();             // This can sleep
    process_data_slowly();
}

Threaded IRQ handlers: Request the interrupt be handled by kernel thread

request_threaded_irq(irq, quick_handler, thread_handler, ...);
// quick_handler: runs in IRQ context (fast, no sleep)
// thread_handler: runs in kthread context (can sleep)

Bottom halves: Softirqs and tasklets run later, with interrupts enabled

irq_handler() {
    acknowledge_hardware();
    tasklet_schedule(&mytask);  // Will run soon, outside IRQ
}

The BKL Era

In ancient Linux versions, the Big Kernel Lock (BKL) could be held while sleeping—it would be automatically released and reacquired. This made kernel development easier but caused massive contention. Modern Linux has no such allowance; all locks that can be held across sleep are explicitly blocking (mutexes, semaphores, etc.).

Wait Queue Management Complexity

Wait queues seem simple—just a linked list of waiters. But efficient and correct management requires addressing numerous concerns.

Queue organization:

FIFO fairness: Waiters should be awakened in the order they arrived (prevents starvation)
Exclusive ordering: Exclusive waiters typically at the end, so non-exclusive get woken first
Priority consideration: Should high-priority waiters jump the queue?

Concurrent modification:

Multiple CPUs may simultaneously:

Add new waiters
Remove waiters (cancellation, signal)
Traverse for wakeup
Perform wakeup operations

This requires careful locking or lock-free techniques.

Wait Queue Edge Cases
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// Edge case 1: Removal during wakeup traversal
// Wakeup is iterating the list when a waiter is removed
// Solution: use list_for_each_entry_safe
 
void wake_up_common(wq) {
    wait_entry *pos, *tmp;
    // _safe version allows removal of current entry during iteration
    list_for_each_entry_safe(pos, tmp, &wq->head, entry) {
        pos->func(pos);  // May remove itself from list
    }
}
 
// Edge case 2: Multiple removals racing
// Two CPUs try to remove the same waiter (signal + wakeup)
void remove_waiter(wq, entry) {
    spin_lock(&wq->lock);
    // Check if already removed (list_empty after removal)
    if (!list_empty(&entry->entry)) {
        list_del_init(&entry->entry);  // init so it's safely empty
    }
    spin_unlock(&wq->lock);
}
 
// Edge case 3: Add during wakeup
// New waiter adds itself while wakeup is traversing
// If added after current position, won't be woken this round (intended)
// If added before... depends on list implementation
 
// Edge case 4: Very long wake time
// Holding spinlock while waking 1000s of waiters = long latency
// Solution: Linux uses bookmarks to release lock periodically
void wake_up_huge_queue(wq) {
    while (more_waiters) {
        spin_lock(&wq->lock);
        wake_some_waiters();
        insert_bookmark();  // Mark progress
        spin_unlock(&wq->lock);
        // Let other CPUs proceed
        cond_resched();     // Maybe yield to others
    }
}

Wait Queue Implementation Trade-offs
Aspect	Simple Approach	Advanced Approach	Trade-off
Data structure	Single linked list	Separate lists for exclusive/non-exclusive	Memory vs. wakeup efficiency
Locking	Global lock per queue	Lock-free with CAS	Simplicity vs. scalability
Wakeup traversal	Wake all, let them re-check	Selective wake based on key	Generality vs. precision
Priority handling	FIFO only	Priority-ordered list	Fairness vs. responsiveness
Large queue handling	Single atomic wakeup	Batched with lock release	Latency vs. complexity

The poll/epoll Wait Queue Key

Linux's poll and epoll use a 'key' parameter in wait queues to identify which events (POLLIN, POLLOUT, etc.) the waiter cares about. The wakeup only triggers for entries whose key matches the event. This avoids waking all poll waiters for every event—a significant optimization for high-connection servers.

Priority Inversion and Inheritance

When processes of different priorities interact through sleep/wakeup, priority inversion can occur—a high-priority process waiting for a low-priority process.

The classic scenario:

Low-priority process L holds a lock
High-priority process H tries to acquire the lock, must sleep
Medium-priority process M becomes runnable
M preempts L (M is higher priority than L)
H waits for L, but L can't run because of M
Result: M (medium) runs before H (high) — priority inversion

In the worst case, M runs indefinitely, and H (the highest priority) waits indefinitely. The Mars Pathfinder mission famously encountered this bug.

Converting Mermaid diagram...

Solutions:

Priority Inheritance: When H waits for L, temporarily boost L's priority to H's level. Now L can preempt M and release the lock faster.

// When high-priority waiter blocks on lock held by low-priority holder:
void boost_holder_priority(lock_holder, waiter_priority) {
    if (lock_holder->priority < waiter_priority) {
        // Temporarily raise holder's scheduling priority
        lock_holder->boosted_priority = waiter_priority;
        // Reschedule to apply new priority
        resched_task(lock_holder);
    }
}

// When holder releases lock, restore original priority
void restore_priority(task) {
    task->boosted_priority = task->original_priority;
}

Priority Ceiling: A lock is assigned a priority ceiling—the highest priority of any thread that might acquire it. Any thread holding the lock runs at ceiling priority. This prevents blocking by medium-priority tasks entirely.

RT-Mutex in Linux: Linux's real-time mutexes (rtmutex) implement priority inheritance. When a high-priority task blocks on an rtmutex, the holder's priority is boosted. This is transitive—if L is waiting on another lock, that holder is also boosted.

User-Space Implications

User-space mutexes (pthread_mutex) typically don't implement priority inheritance by default. Using them with real-time threads requires PTHREAD_PRIO_INHERIT attribute. Failing to set this can cause priority inversion bugs that are intermittent and hard to diagnose.

Cancellation and Cleanup

A sleeping process may need to be removed from its wait before the awaited event occurs:

Signal delivery: SIGINT, SIGTERM cause early wakeup
Process termination: Process is killed while sleeping
Timeout expiry: wait_event_timeout
Thread cancellation: pthread_cancel

The cleanup challenge:

When a sleeper is forcibly awakened:

It must be removed from the wait queue
Its state must be corrected to RUNNING
It must be returned to the run queue
The cancellation must be visible to pending wakeups
Resources associated with waiting must be freed

Signal Handling During Sleep
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// Interruptible sleep with signal handling
 
long wait_event_interruptible(wq, condition) {
    DEFINE_WAIT(wait);
    long ret = 0;
    
    while (!condition) {
        prepare_to_wait(&wq, &wait, TASK_INTERRUPTIBLE);
        
        if (!condition) {
            schedule();
            
            // After waking, check why we woke
            if (signal_pending(current)) {
                // A signal arrived! Don't keep waiting.
                ret = -ERESTARTSYS;  // Or -EINTR
                break;
            }
        }
        // Otherwise, check condition again (might be spurious)
    }
    
    finish_wait(&wq, &wait);  // Always clean up!
    return ret;
}
 
// Process termination during sleep
void do_exit(int exit_code) {
    // ... lots of cleanup ...
    
    // Remove from any wait queues
    // (handled implicitly by scheduler during exit path)
    
    // Can't be left on wait queue - would have dangling task pointer
    remove_from_all_wait_queues(current);
    
    // ... rest of exit ...
}
 
// Timeout-based cleanup
long schedule_timeout(long timeout) {
    struct timer_list timer;
    
    // Set up timer to wake us
    setup_timer(&timer, process_timeout, (unsigned long)current);
    timer.expires = jiffies + timeout;
    add_timer(&timer);
    
    // Sleep
    schedule();
    
    // Woke up: either timer fired or something else woke us
    // Delete timer (may already have fired)
    del_timer_sync(&timer);
    
    // Return remaining time (0 if timeout)
    return timer.expires - jiffies;
}

The finish_wait guarantee:

finish_wait() is designed to handle all cleanup:

If we're still on the wait queue (normal wake), remove ourselves
If we're not (already removed by wakeup), safely do nothing
Set state to RUNNING (idempotent if already RUNNING)

This design makes the wait loop robust against any wakeup timing.

Cleanup points in Linux:

Situation	Cleanup Mechanism
Normal wakeup	finish_wait() in the waiting code
Signal during wait	finish_wait() called after breaking loop
Process exit	do_exit() cleanup routines
Timeout	Timer callback wakes, finish_wait() cleans
Thread cancel	Cancellation handler must call finish_wait() equivalent

TASK_KILLABLE

TASK_KILLABLE is a compromise state: interruptible only by fatal signals (SIGKILL). This simplifies cleanup—you don't need to handle every possible signal gracefully, just know that the process might be killed. Used for kernel waits that shouldn't be interrupted by SIGINT but shouldn't be completely uninterruptible either.

Performance Optimization

Sleep/wakeup is on the critical path of many operations. Microseconds matter.

Optimization techniques used in production kernels:

Performance Optimizations

•Fast-path checking: Check condition without touching wait queue if likely true. Avoid queue manipulation overhead entirely.
•Stack-allocated wait entries: DEFINE_WAIT allocates on stack, avoiding heap allocation/free on every wait.
•Lazy removal: Sometimes wait entries are marked for removal but not immediately unlinked. Cleanup happens during traversal.
•Per-CPU wait queues: Reduce cross-CPU contention by having separate queues per CPU where semantically valid.
•Lock-free wakeup paths: Some wakeups can use atomic operations instead of locks for the fast path.
•Batch wakeups: Collect multiple wakeups and perform them together, reducing lock acquire/release cycles.
•Wakeup affinity: Wake processes on the CPU where they last ran (cache hot) rather than the current CPU.

Fast-Path Optimization
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Optimized wait_event with fast-path
 
#define wait_event(wq, condition) \
do { \
    /* Fast path: condition already true, no wait needed */ \
    if (condition) \
        break; \
    /* Slow path: need to actually wait */ \
    __wait_event(wq, condition); \
} while (0)
 
// The fast path:
// - Checks condition (hopefully branch-predicted as likely)
// - If true, no function calls, no wait queue ops, no locks
// - Compiles to just a few instructions
 
// In hot paths (locks with low contention), the fast path
// dominates. Optimizing it is crucial.
 
// Another optimization: the __builtin_expect hint
#define likely(x)   __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
 
#define wait_event_optimized(wq, condition) \
do { \
    if (likely(condition))  /* Condition usually true */ \
        break; \
    __wait_event(wq, condition); \
} while (0)
 
// This helps the CPU's branch predictor and instruction cache

Typical Latencies
Operation	Approximate Latency	Notes
Fast-path (condition true)	~5-20 ns	Just condition check + branch
Sleep + immediate wakeup (same CPU)	~1-5 μs	State changes + scheduler
Sleep + wakeup (different CPU)	~5-50 μs	Includes IPI overhead
Sleep + wakeup (CPU was idle)	~50-200 μs	CPU must exit power-saving state
Context switch	~1-10 μs	Register save/restore + TLB

Hybrid Spinning

Some implementations spin briefly before sleeping, betting that the wakeup will come soon. If it does, they avoid the context switch cost. This is the adaptive spinning used by futex and many mutex implementations. The spin duration is tuned based on observed wait times.

Summary and Key Insights

Implementing sleep/wakeup correctly and efficiently requires addressing numerous challenges:

Key Implementation Challenges

•Atomicity: Check-and-sleep must be atomic, but can't hold locks across context switches. Solution: careful ordering with state set before lock release.
•Memory ordering: Barriers required to ensure visibility across CPUs. Use set_current_state() which includes barriers.
•Interrupt context: Sleep impossible from IRQ; wakeup must be safe from any context. Use irqsave locking.
•Wait queue management: Concurrent modifications, traversal during modification, scaling to thousands of waiters.
•Priority inversion: High-priority waiters blocked by low-priority holders. Solution: priority inheritance.
•Cancellation: Signal, timeout, exit must cleanly remove sleepers. finish_wait() handles all cases.
•Performance: Fast-path optimization, stack allocation, batch operations, cache-aware wakeup.

Page Complete

You now understand the implementation challenges that make sleep/wakeup complex in real operating systems. This knowledge helps you appreciate why kernel APIs are designed as they are, and why violating their contracts leads to subtle bugs. Next, we'll explore how sleep/wakeup integrates with the scheduler.