Loading content...
At first glance, sleep and wakeup appear simple: remove a process from the run queue, add it back later. But decades of operating system development have revealed that this apparent simplicity conceals a minefield of subtle challenges.
Implementing sleep/wakeup correctly requires navigating:
This page examines the engineering challenges that make sleep/wakeup implementation non-trivial. Understanding these challenges illuminates why kernel code is as complex as it is—and why user-space synchronization libraries exist to abstract this complexity.
By the end of this page, you will understand: the atomicity requirements for sleep/wakeup; challenges with memory ordering and barriers; interrupt context constraints; the complexity of wait queue management; priority inheritance and inversion; cancellation and cleanup; and performance optimization techniques.
The core challenge of sleep/wakeup is ensuring multiple operations happen atomically. Let's enumerate what must be coordinated:
For sleep, these must be atomic:
If any of these can be observed independently by another CPU, races occur.
For wakeup, these must be atomic:
The fundamental tension:
We want to hold locks for atomicity, but we can't hold locks across schedule()—that would mean sleeping while holding a lock, preventing the lock from ever being released by another process (deadlock).
1234567891011121314151617181920212223242526272829
// WRONG: Trying to hold lock across schedule()void broken_sleep() { spin_lock(&queue_lock); // Acquire lock add_to_wait_queue(current); // Add ourselves set_current_state(SLEEPING); // Mark sleeping // We want atomicity... but: schedule(); // BUG! schedule() switches to another process // That process might need queue_lock -> DEADLOCK spin_unlock(&queue_lock); // Never reached if deadlock} // CORRECT: Careful ordering and lock scopingvoid correct_sleep() { spin_lock(&queue_lock); // Lock for queue manipulation add_to_wait_queue(current); set_current_state(SLEEPING); // State visible before releasing lock spin_unlock(&queue_lock); // Release BEFORE schedule // Memory barrier in set_current_state ensures wakeup on another CPU // can see SLEEPING state even though we released the lock schedule(); // Now safe to context switch} // The trick: we set state BEFORE releasing lock// So any wakeup that happens:// - Before we release lock: blocked on lock, will wake us after we schedule// - After we release lock: sees SLEEPING state, wakes us// No lost wakeups!The prepare_to_wait pattern achieves atomicity through careful ordering:
The key insight: state is set while holding the lock, so any concurrent wakeup either:
Some implementations use lock-free techniques (compare-and-swap on state) instead of spinlocks for the wait queue. This can reduce contention under high load but requires even more careful reasoning about memory ordering and is typically reserved for performance-critical paths like futex.
On modern CPUs with relaxed memory ordering, writes can appear reordered to other processors. This creates subtle correctness issues for sleep/wakeup.
The problem:
Consider a waiter and waker on different CPUs:
// CPU 1 (Waiter) // CPU 2 (Waker)
state = SLEEPING; // Store S1 data = new_value; // Store D
check_condition(); wakeup(); // Load S1
On a weakly-ordered architecture (ARM, PowerPC), CPU 2 might see S1 before CPU 1's earlier stores become visible. Or CPU 1 might not see Store D when checking the condition.
The result:
123456789101112131415161718192021222324252627282930313233
// Required memory barriers for correctness // WAITER: Must ensure state is visible before checking conditionvoid correct_waiter() { state = SLEEPING; smp_mb(); // Full memory barrier: all previous stores visible, // all subsequent loads see fresh values if (!condition) { schedule(); // Safe: waker will see SLEEPING if we see old condition }} // WAKER: Must ensure condition update visible before checking state void correct_waker() { condition = true; smp_mb(); // Full memory barrier if (state == SLEEPING) { wake_up(); // Safe: sleeper will see new condition if we see SLEEPING }} // Linux's set_current_state() includes the barrier:#define set_current_state(state_value) \ do { \ current->state = (state_value); \ smp_mb(); // Barrier included! \ } while (0) // This is why you must use set_current_state(), not direct assignment:// current->state = TASK_INTERRUPTIBLE; // WRONG: no barrier// set_current_state(TASK_INTERRUPTIBLE); // CORRECT: barrier includedWhy barriers are placed where they are:
The waiter's barrier after setting SLEEPING ensures:
The waker's barrier after setting condition ensures:
Together, these barriers create a total ordering that prevents the lost wakeup race.
Platform-specific considerations:
| Architecture | Memory Model | Barrier Needs |
|---|---|---|
| x86/x64 | Total Store Order | Compiler barriers often sufficient; HW handles most cases |
| ARM | Weakly ordered | Full barriers (dmb) required |
| PowerPC | Weakly ordered | sync/lwsync instructions needed |
| RISC-V | RVWMO (weak) | fence instructions required |
The Linux kernel abstracts these with smp_mb(), which compiles to the appropriate instruction per architecture.
Memory ordering bugs are among the hardest to debug—they may only manifest on specific hardware under specific timing. Always use the kernel's provided sleep/wakeup infrastructure (wait_event, etc.) which has barriers in the right places. Implementing your own is a recipe for subtle, architecture-dependent bugs.
Sleep and wakeup must interact correctly with interrupt handling, but they have asymmetric constraints.
Sleep cannot happen in interrupt context:
Interrupt handlers:
If an interrupt handler tries to sleep:
Attempting to sleep in interrupt context triggers a kernel panic (BUG) in Linux.
Wakeup can happen in interrupt context:
Wakeup is specifically designed to be safe from interrupts:
spin_lock_irqsave() to handle nested interruptsThis asymmetry is essential: I/O completion interrupts must be able to wake waiting processes.
12345678910111213141516171819202122232425262728293031323334353637383940
// Checking execution context in Linux bool can_sleep() { // Cannot sleep if: // - In interrupt context (hardirq or softirq) // - Holding a spinlock // - With preemption disabled return !in_atomic() && !irqs_disabled() && !in_interrupt();} // Kernel's might_sleep() annotation - debug checkvoid might_sleep() {#ifdef CONFIG_DEBUG_ATOMIC_SLEEP if (in_atomic() || irqs_disabled()) { // Print warning and backtrace printk(KERN_ERR "BUG: sleeping function called from invalid context"); dump_stack(); }#endif} // Use in functions that may sleepvoid my_blocking_function() { might_sleep(); // Warns if called from wrong context wait_event(wq, condition); // Might actually sleep} // Wakeup is safe anywhere - but uses appropriate lockingvoid wake_up_safe(wait_queue_head_t *wq) { unsigned long flags; // irqsave works from any context spin_lock_irqsave(&wq->lock, flags); __wake_up_common(wq, TASK_NORMAL, 1, 0, NULL); spin_unlock_irqrestore(&wq->lock, flags);}Deferred work patterns:
When interrupt context needs to do work that requires sleeping:
IRQ_HANDLER(my_irq) {
prepare_data_fast(); // Minimal work in IRQ
schedule_work(&my_work); // Defer heavy work to kworker
}
WORK_HANDLER(my_work) {
might_sleep(); // This can sleep
process_data_slowly();
}
request_threaded_irq(irq, quick_handler, thread_handler, ...);
// quick_handler: runs in IRQ context (fast, no sleep)
// thread_handler: runs in kthread context (can sleep)
irq_handler() {
acknowledge_hardware();
tasklet_schedule(&mytask); // Will run soon, outside IRQ
}
In ancient Linux versions, the Big Kernel Lock (BKL) could be held while sleeping—it would be automatically released and reacquired. This made kernel development easier but caused massive contention. Modern Linux has no such allowance; all locks that can be held across sleep are explicitly blocking (mutexes, semaphores, etc.).
Wait queues seem simple—just a linked list of waiters. But efficient and correct management requires addressing numerous concerns.
Queue organization:
Concurrent modification:
Multiple CPUs may simultaneously:
This requires careful locking or lock-free techniques.
1234567891011121314151617181920212223242526272829303132333435363738394041
// Edge case 1: Removal during wakeup traversal// Wakeup is iterating the list when a waiter is removed// Solution: use list_for_each_entry_safe void wake_up_common(wq) { wait_entry *pos, *tmp; // _safe version allows removal of current entry during iteration list_for_each_entry_safe(pos, tmp, &wq->head, entry) { pos->func(pos); // May remove itself from list }} // Edge case 2: Multiple removals racing// Two CPUs try to remove the same waiter (signal + wakeup)void remove_waiter(wq, entry) { spin_lock(&wq->lock); // Check if already removed (list_empty after removal) if (!list_empty(&entry->entry)) { list_del_init(&entry->entry); // init so it's safely empty } spin_unlock(&wq->lock);} // Edge case 3: Add during wakeup// New waiter adds itself while wakeup is traversing// If added after current position, won't be woken this round (intended)// If added before... depends on list implementation // Edge case 4: Very long wake time// Holding spinlock while waking 1000s of waiters = long latency// Solution: Linux uses bookmarks to release lock periodicallyvoid wake_up_huge_queue(wq) { while (more_waiters) { spin_lock(&wq->lock); wake_some_waiters(); insert_bookmark(); // Mark progress spin_unlock(&wq->lock); // Let other CPUs proceed cond_resched(); // Maybe yield to others }}| Aspect | Simple Approach | Advanced Approach | Trade-off |
|---|---|---|---|
| Data structure | Single linked list | Separate lists for exclusive/non-exclusive | Memory vs. wakeup efficiency |
| Locking | Global lock per queue | Lock-free with CAS | Simplicity vs. scalability |
| Wakeup traversal | Wake all, let them re-check | Selective wake based on key | Generality vs. precision |
| Priority handling | FIFO only | Priority-ordered list | Fairness vs. responsiveness |
| Large queue handling | Single atomic wakeup | Batched with lock release | Latency vs. complexity |
Linux's poll and epoll use a 'key' parameter in wait queues to identify which events (POLLIN, POLLOUT, etc.) the waiter cares about. The wakeup only triggers for entries whose key matches the event. This avoids waking all poll waiters for every event—a significant optimization for high-connection servers.
When processes of different priorities interact through sleep/wakeup, priority inversion can occur—a high-priority process waiting for a low-priority process.
The classic scenario:
In the worst case, M runs indefinitely, and H (the highest priority) waits indefinitely. The Mars Pathfinder mission famously encountered this bug.
Solutions:
Priority Inheritance: When H waits for L, temporarily boost L's priority to H's level. Now L can preempt M and release the lock faster.
// When high-priority waiter blocks on lock held by low-priority holder:
void boost_holder_priority(lock_holder, waiter_priority) {
if (lock_holder->priority < waiter_priority) {
// Temporarily raise holder's scheduling priority
lock_holder->boosted_priority = waiter_priority;
// Reschedule to apply new priority
resched_task(lock_holder);
}
}
// When holder releases lock, restore original priority
void restore_priority(task) {
task->boosted_priority = task->original_priority;
}
Priority Ceiling: A lock is assigned a priority ceiling—the highest priority of any thread that might acquire it. Any thread holding the lock runs at ceiling priority. This prevents blocking by medium-priority tasks entirely.
RT-Mutex in Linux: Linux's real-time mutexes (rtmutex) implement priority inheritance. When a high-priority task blocks on an rtmutex, the holder's priority is boosted. This is transitive—if L is waiting on another lock, that holder is also boosted.
User-space mutexes (pthread_mutex) typically don't implement priority inheritance by default. Using them with real-time threads requires PTHREAD_PRIO_INHERIT attribute. Failing to set this can cause priority inversion bugs that are intermittent and hard to diagnose.
A sleeping process may need to be removed from its wait before the awaited event occurs:
The cleanup challenge:
When a sleeper is forcibly awakened:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
// Interruptible sleep with signal handling long wait_event_interruptible(wq, condition) { DEFINE_WAIT(wait); long ret = 0; while (!condition) { prepare_to_wait(&wq, &wait, TASK_INTERRUPTIBLE); if (!condition) { schedule(); // After waking, check why we woke if (signal_pending(current)) { // A signal arrived! Don't keep waiting. ret = -ERESTARTSYS; // Or -EINTR break; } } // Otherwise, check condition again (might be spurious) } finish_wait(&wq, &wait); // Always clean up! return ret;} // Process termination during sleepvoid do_exit(int exit_code) { // ... lots of cleanup ... // Remove from any wait queues // (handled implicitly by scheduler during exit path) // Can't be left on wait queue - would have dangling task pointer remove_from_all_wait_queues(current); // ... rest of exit ...} // Timeout-based cleanuplong schedule_timeout(long timeout) { struct timer_list timer; // Set up timer to wake us setup_timer(&timer, process_timeout, (unsigned long)current); timer.expires = jiffies + timeout; add_timer(&timer); // Sleep schedule(); // Woke up: either timer fired or something else woke us // Delete timer (may already have fired) del_timer_sync(&timer); // Return remaining time (0 if timeout) return timer.expires - jiffies;}The finish_wait guarantee:
finish_wait() is designed to handle all cleanup:
This design makes the wait loop robust against any wakeup timing.
Cleanup points in Linux:
| Situation | Cleanup Mechanism |
|---|---|
| Normal wakeup | finish_wait() in the waiting code |
| Signal during wait | finish_wait() called after breaking loop |
| Process exit | do_exit() cleanup routines |
| Timeout | Timer callback wakes, finish_wait() cleans |
| Thread cancel | Cancellation handler must call finish_wait() equivalent |
TASK_KILLABLE is a compromise state: interruptible only by fatal signals (SIGKILL). This simplifies cleanup—you don't need to handle every possible signal gracefully, just know that the process might be killed. Used for kernel waits that shouldn't be interrupted by SIGINT but shouldn't be completely uninterruptible either.
Sleep/wakeup is on the critical path of many operations. Microseconds matter.
Optimization techniques used in production kernels:
12345678910111213141516171819202122232425262728293031
// Optimized wait_event with fast-path #define wait_event(wq, condition) \do { \ /* Fast path: condition already true, no wait needed */ \ if (condition) \ break; \ /* Slow path: need to actually wait */ \ __wait_event(wq, condition); \} while (0) // The fast path:// - Checks condition (hopefully branch-predicted as likely)// - If true, no function calls, no wait queue ops, no locks// - Compiles to just a few instructions // In hot paths (locks with low contention), the fast path// dominates. Optimizing it is crucial. // Another optimization: the __builtin_expect hint#define likely(x) __builtin_expect(!!(x), 1)#define unlikely(x) __builtin_expect(!!(x), 0) #define wait_event_optimized(wq, condition) \do { \ if (likely(condition)) /* Condition usually true */ \ break; \ __wait_event(wq, condition); \} while (0) // This helps the CPU's branch predictor and instruction cache| Operation | Approximate Latency | Notes |
|---|---|---|
| Fast-path (condition true) | ~5-20 ns | Just condition check + branch |
| Sleep + immediate wakeup (same CPU) | ~1-5 μs | State changes + scheduler |
| Sleep + wakeup (different CPU) | ~5-50 μs | Includes IPI overhead |
| Sleep + wakeup (CPU was idle) | ~50-200 μs | CPU must exit power-saving state |
| Context switch | ~1-10 μs | Register save/restore + TLB |
Some implementations spin briefly before sleeping, betting that the wakeup will come soon. If it does, they avoid the context switch cost. This is the adaptive spinning used by futex and many mutex implementations. The spin duration is tuned based on observed wait times.
Implementing sleep/wakeup correctly and efficiently requires addressing numerous challenges:
You now understand the implementation challenges that make sleep/wakeup complex in real operating systems. This knowledge helps you appreciate why kernel APIs are designed as they are, and why violating their contracts leads to subtle bugs. Next, we'll explore how sleep/wakeup integrates with the scheduler.