Operating SystemsSleep and Wakeup

Sleep and Wakeup Mechanisms

LevelIntermediate

Duration60 mins

TopicSleep and Wakeup

2 / 5

Wakeup Mechanism

Restoring Processes to Life

In the previous page, we explored how processes enter sleep—voluntarily yielding the CPU when they cannot make progress. A sleeping process is inert: it consumes no CPU cycles and waits passively for its awakening. But sleep without wakeup would mean processes sleep forever—a useless suspended animation.

The wakeup mechanism is the essential complement to sleep. It is the signal that tells a sleeping process: "The event you awaited has occurred. Time to resume." Wakeup transitions a process from dormancy back to candidacy for CPU time, returning it to the scheduler's consideration.

But wakeup is more than a simple state flip. It must be:

Correct: Wakeup the right process(es), not random others
Efficient: Don't waste cycles waking everyone when one suffices
Atomic: Coordinate with sleep to avoid races
Fair: Don't starve long-sleeping processes
Scheduler-aware: Interact properly with priority and preemption

This page provides a comprehensive examination of wakeup—the mechanism that restores sleeping processes to active life.

What You Will Learn

By the end of this page, you will understand: the fundamental concept and API of wakeup operations; the difference between waking one vs. all waiters; exclusive vs. non-exclusive wakeup; the internal implementation of wakeup; how wakeup integrates with the scheduler; and the critical role of wakeup in avoiding deadlock and ensuring progress.

The Wakeup Concept

Wakeup is the operation that transitions a sleeping process back to the runnable state. It is the inverse of sleep and forms the second half of the sleep/wakeup primitive pair.

Conceptually, wakeup does the following:

Locates the sleeping process(es): Either by direct task reference or by finding waiters on a specific wait queue.
Changes process state: Sets the state from SLEEPING (TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE) back to RUNNING (TASK_RUNNING).
Adds to run queue: Places the process back on the scheduler's data structures so it's considered for CPU allocation.
Optionally triggers reschedule: If the awakened process has higher priority than the current one, the kernel may preempt immediately.

The semantic contract:

Wakeup establishes a guarantee: After wakeup is called, the target process will eventually be scheduled (assuming no further sleeps). Wakeup doesn't mean immediate execution—just eligibility for scheduling. The actual resumption depends on:

The process's priority relative to others
The scheduler's policy (CFS, real-time, etc.)
Available CPU capacity
Whether the current process yields or completes its timeslice

What Wakeup Does

•Sets process state to TASK_RUNNING
•Adds process to CPU run queue
•Removes from wait queue (or marks for removal)
•May trigger preemption if higher priority
•Returns success/failure indicator

What Wakeup Does NOT Do

•Does not guarantee immediate execution
•Does not transfer data to the process
•Does not change the awaited condition
•Does not run the process's code
•Does not block the caller

Wakeup is Non-Blocking

Unlike sleep, wakeup never blocks the caller. It's a quick state manipulation—typically a few dozen instructions. The waker continues immediately after calling wakeup, regardless of when or whether the awakened process actually runs. This asymmetry is fundamental: sleep blocks, wakeup notifies.

Wakeup Targets: Who Gets Awakened?

There are fundamentally two ways to specify who should be awakened:

1. Direct wakeup (by task reference): The waker has a direct pointer to the sleeping task's control block (task_struct in Linux) and awakens that specific process:

int wake_up_process(struct task_struct *p);

This is used when:

There's a known 1:1 relationship (e.g., parent waiting for specific child)
A timer expires for a specific process
A signal targets a specific process

2. Wait queue wakeup (by event): The waker doesn't know or care which specific processes are waiting. It signals an event, and all relevant waiters are awakened:

int wake_up(wait_queue_head_t *q);

This is used when:

Multiple processes may wait for the same condition (reader/writer, producer/consumer)
The waker's concern is the event, not the waiters
Modularity—the sleeping and waking code are decoupled

Wait queue wakeup strategies:

When waking by event (wait queue), the kernel offers several strategies:

Wait Queue Wakeup Strategies
Function	Behavior	Use Case
`wake_up()`	Wake all non-exclusive + one exclusive	Standard wakeup: handles both cases
`wake_up_all()`	Wake every waiter, exclusive or not	Condition affecting all (e.g., shutdown)
`wake_up_nr(wq, n)`	Wake up to n exclusive waiters	Resource with multiple available slots
`wake_up_interruptible()`	Only wake TASK_INTERRUPTIBLE waiters	Don't disturb uninterruptible I/O
`wake_up_sync()`	Wake but don't reschedule immediately	Batching multiple wakeups for efficiency
`wake_up_locked()`	Wake while already holding queue lock	Avoid double-lock when already protected

The _nr Variants

The wake_up_nr(wq, n) variant wakes exactly n exclusive waiters. This is useful for resources that become available in batches—for example, if 5 buffer slots become free, wake 5 waiters. This is more efficient than wake_up_all when you know how many can proceed.

Exclusive vs. Non-Exclusive Wakeup

A critical distinction in wait queue design is whether waiters are exclusive or non-exclusive. This affects how many processes are awakened.

Non-exclusive waiters (default):

All non-exclusive waiters are awakened by any wakeup call
Used when the event applies equally to all waiters
Example: Multiple readers waiting for data—when data arrives, all can read it

Exclusive waiters:

Only one exclusive waiter is awakened per wakeup call
Used when only one waiter can make progress (mutual exclusion)
Example: Multiple processes waiting to acquire a lock—only one can acquire it

The thundering herd problem:

Thundering Herd

If 100 processes wait for a lock and you wake all of them when the lock is released, 99 will immediately go back to sleep (only one can acquire the lock). This wastes CPU cycles on 99 pointless context switches. Exclusive wakeup avoids this—only one is awakened, and that one acquires the lock.

How wake_up handles mixed queues:

A wait queue can contain both exclusive and non-exclusive waiters. The standard wake_up() processes the queue as follows:

Traverse the wait queue from head to tail
Wake every non-exclusive waiter encountered
Wake the first exclusive waiter encountered, then stop

This means:

All non-exclusive waiters at the front are awakened
Exactly one exclusive waiter is awakened
Non-exclusive waiters behind the exclusive waiter are also awakened

Practical implications:

Exclusive vs Non-Exclusive Example
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// Wait queue with mixed waiters
// Position:  1     2       3       4       5
// Type:     NE    NE      EX      EX      NE
// Process:  P1    P2      P3      P4      P5
 
// Linux wake_up() behavior:
// 1. NE waiters at front: wake P1, P2 (non-exclusive)
// 2. First EX waiter: wake P3, then STOP
// 3. P4 (EX) and P5 (NE) remain sleeping
 
// wake_up_all() behavior:
// Wake P1, P2, P3, P4, P5 (everyone)
 
// wake_up_nr(wq, 2) behavior (for exclusive):
// 1. Wake P1, P2 (non-exclusive, always woken)
// 2. Wake P3 (first exclusive)
// 3. Wake P4 (second exclusive)  
// 4. P5 remains (exhausted the n=2 exclusive count)

Setting exclusive mode:

When adding to a wait queue, processes indicate whether they're exclusive:

// Non-exclusive wait (default)
add_wait_queue(wq, &wait);

// Exclusive wait  
add_wait_queue_exclusive(wq, &wait);

// Or set the flag directly
wait.flags |= WQ_FLAG_EXCLUSIVE;
prepare_to_wait_exclusive(wq, &wait, TASK_INTERRUPTIBLE);

Guidelines for choosing:

Scenario	Wakeup Mode	Reason
Lock/mutex acquisition	Exclusive	Only one can hold the lock
Condition variable signal	Exclusive	Usually paired with one waker one waiter
Shared resource becomes available	Non-exclusive	Multiple can use simultaneously
Broadcast notification	Non-exclusive	All interested parties should know
Semaphore with count N	Wake N exclusive	Exactly N can proceed

Wakeup Implementation Deep Dive

Let's examine the internal implementation of wakeup to understand precisely what happens when a process is awakened.

The wakeup call chain (Linux):

Wakeup Implementation (Simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
// High-level wakeup: wake all non-exclusive + one exclusive
void wake_up(wait_queue_head_t *wq) {
    unsigned long flags;
    
    spin_lock_irqsave(&wq->lock, flags);
    __wake_up_common(wq, TASK_NORMAL, 1, 0, NULL);
    spin_unlock_irqrestore(&wq->lock, flags);
}
 
// Core wakeup logic
static void __wake_up_common(wait_queue_head_t *wq,
                             unsigned int mode,      // Which states to wake
                             int nr_exclusive,       // How many exclusive
                             int wake_flags,         // Sync, etc.
                             void *key) {
    wait_queue_entry_t *entry, *next;
    
    list_for_each_entry_safe(entry, next, &wq->head, entry) {
        unsigned flags = entry->flags;
        
        // Call the waiter's wake function
        // Default: default_wake_function
        int ret = entry->func(entry, mode, wake_flags, key);
        
        if (ret < 0)
            break;  // Stop iteration
            
        // If exclusive, decrement counter
        if (ret && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
            break;  // Woken enough exclusives
    }
}
 
// The actual process wakeup
int default_wake_function(wait_queue_entry_t *entry,
                          unsigned mode,
                          int sync,
                          void *key) {
    struct task_struct *p = entry->private;
    return try_to_wake_up(p, mode, sync);
}
 
// Core of process wakeup
int try_to_wake_up(struct task_struct *p, 
                   unsigned int state, 
                   int wake_flags) {
    unsigned long flags;
    int cpu;
    
    raw_spin_lock_irqsave(&p->pi_lock, flags);
    
    // Check if process is in acceptable sleep state
    if (!(p->state & state))
        goto out_nostat;  // Not sleeping in matching state
    
    // Atomically change state to TASK_RUNNING
    p->state = TASK_RUNNING;
    
    // Find CPU to add to run queue
    // Could be current CPU, previous CPU (for cache affinity),
    // or idle CPU (for load balancing)
    cpu = select_task_rq(p, p->wake_cpu, wake_flags);
    
    // Add to that CPU's run queue
    // Uses scheduler class (CFS, RT, etc.)
    activate_task(rq, p, ENQUEUE_WAKEUP);
    
    // Check if we should preempt current task on that CPU
    check_preempt_curr(rq, p, wake_flags);
    
out:
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);
    return success;
}

Key implementation details:

1. Spin lock protection: The wait queue spinlock is held during the wakeup traversal. This prevents races with concurrent add/remove operations. The irqsave variant is used because wakeup can be called from interrupt context (e.g., timer handler, I/O completion).

2. list_for_each_entry_safe: This safe iteration macro handles the case where the current entry is removed during iteration—which can happen if the wakeup function removes itself.

3. Custom wake functions: Each wait entry has a func pointer (usually default_wake_function). This allows custom wakeup logic—for example, poll() and epoll() use custom functions to aggregate multiple events.

4. The mode parameter: This is a bitmask of states to wake: TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, or both (TASK_NORMAL). wake_up_interruptible() only wakes TASK_INTERRUPTIBLE processes.

5. CPU selection for wakeup: select_task_rq chooses which CPU's run queue to place the awakened process on. Factors include:

Previous CPU (cache warmth)
Affinity mask (which CPUs the process can run on)
Load balancing (spread load across CPUs)
Waker's CPU (reduce migration for communication patterns)

6. Preemption check: After enqueuing, the kernel checks if the newly runnable process has higher priority than whatever is currently running on the target CPU. If so, it sets a "need resched" flag, causing preemption at the next opportunity.

The WQ_FLAG_BOOKMARK Optimization

For very long wait queues, Linux uses a bookmark entry to track progress. This allows the lock to be dropped periodically during wakeup traversal, preventing excessive lock hold times. The bookmark is a dummy entry that marks where to resume iteration.

Wakeup and The Scheduler

Wakeup doesn't exist in isolation—it's intimately connected with the scheduler. Understanding this relationship is crucial for performance and correctness.

The scheduler's role in wakeup:

When try_to_wake_up succeeds, it must integrate the newly runnable process with the scheduler. This involves:

Scheduler Operations During Wakeup

•Enqueue to run queue: Add the task to the appropriate scheduler data structure (CFS's red-black tree, RT's priority queue, etc.)
•Update statistics: Record sleep time, recalculate priority/vruntime based on sleep duration
•Balance decisions: Consider whether this CPU is overloaded and the task should go elsewhere
•Preemption check: Compare priority with current running task; set need_resched if appropriate
•IPI for remote wakeup: If waking on a different CPU, send inter-processor interrupt to notify that CPU

CFS sleep bonus:

The Completely Fair Scheduler gives sleeping processes a boost when they wake up. The intuition: a process that slept is likely interactive (waiting for user input) and should get responsive scheduling.

// On wakeup, CFS places the task with a slight vruntime advantage
// This ensures recently-sleeping tasks get scheduled quickly
vruntime = min(vruntime, min_vruntime - sleep_bonus)

This "sleeper fairness" prevents interactive processes from being starved by CPU-bound processes that never sleep.

Wakeup latency:

The time from wakeup call to actual process execution is called wakeup latency. It depends on:

Factor	Impact
Same CPU vs. remote	Same CPU: ~1-5μs; Remote: ~10-50μs (IPI overhead)
Current process priority	If lower, preemption adds ~1-10μs
Interrupt context	May need to defer to scheduler tick
RT vs. normal process	RT gets immediate preemption
CPU power state	If idle/deep sleep: ~10-100μs to wake CPU

The wake_up_sync Optimization

When you're about to yield the CPU anyway (e.g., you wake someone then immediately sleep yourself), use wake_up_sync(). This skips the preemption check and avoids unnecessary reschedule IPIs. The awakened process will be scheduled when you naturally yield, saving overhead.

Sync Wakeup Pattern
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Suboptimal: might trigger premature preemption
void producer_thread() {
    // Produce data
    buffer[idx] = data;
    
    // Wake consumer (might preempt us immediately)
    wake_up(&consumer_wq);  
    
    // We were about to sleep anyway!
    wait_event_interruptible(&producer_wq, !buffer_full);
}
 
// Optimal: use sync wakeup
void producer_thread_optimized() {
    buffer[idx] = data;
    
    // Wake consumer but don't trigger immediate reschedule
    wake_up_sync(&consumer_wq);
    
    // Now we sleep - scheduler will pick consumer if appropriate
    wait_event_interruptible(&producer_wq, !buffer_full);
}

Wakeup From Interrupt Context

A significant portion of wakeups occur from interrupt context—when hardware signals that an awaited event has occurred. Interrupt handlers must follow special rules.

Why interrupts trigger wakeups:

I/O completion: Disk read finishes, network packet arrives, USB device responds
Timer expiry: sleep(), nanosleep(), timeout-based waits
Hardware events: User presses key, mouse moves, sensor triggers
IPI (Inter-Processor Interrupt): Another CPU signals this one

Constraints in interrupt context:

Interrupt handlers run with interrupts disabled or at elevated priority. They must:

Not block or sleep (nothing to switch to)
Execute quickly (holding up other interrupts)
Not acquire blocking locks (only spinlocks)
Be reentrant or protected against concurrency

Safe wakeup from interrupts:

Wakeup is specifically designed to be safe from interrupt context:

// All wakeup variants are interrupt-safe
wake_up();                    // Safe
wake_up_interruptible();      // Safe
wake_up_all();                // Safe
wake_up_process(p);           // Safe

Internally, these use spin_lock_irqsave to handle the nested interrupt case—an interrupt handler that's itself interrupted.

The deferred work pattern:

Sometimes the interrupt handler needs to do more than just wakeup—it needs to copy data, update statistics, or perform complex logic. But it must stay fast. The solution is deferred work:

Interrupt handler (top half): Minimal work, wake the process, perhaps schedule bottom half
Softirq/tasklet (bottom half): More complex processing, still in kernel
Awakened process: Full processing in process context

// Disk I/O completion interrupt handler (simplified)
void disk_irq_handler(int irq, void *dev_id) {
    struct request *req = get_completed_request(dev_id);
    
    // Mark request complete (very fast)
    req->status = SUCCESS;
    req->bytes_transferred = req->length;
    
    // Wake the waiting process  
    wake_up(&req->wait_queue);
    
    // That's it! Handler exits quickly
    // The awakened process does any heavy processing
}

Threaded Interrupt Handlers

Modern kernels support threaded interrupt handlers (request_threaded_irq). The interrupt triggers a kernel thread rather than running in raw interrupt context. This thread can sleep, take blocking locks, and do extensive processing—while still waking other processes. This simplifies driver development at a small latency cost.

Converting Mermaid diagram...

Spurious Wakeups

A spurious wakeup occurs when a process wakes up even though the condition it was waiting for is not true. This is not a bug—it's an expected possibility that code must handle.

Sources of spurious wakeups:

Signal delivery: TASK_INTERRUPTIBLE processes wake on any signal, not just the awaited event
Implementation artifacts: Some implementations may wake spuriously for efficiency reasons
Broadcast wakeup: All waiters wake, but only one can proceed (rest must re-sleep)
Condition change between wakeup and check: Another process may have consumed the resource first
Internal kernel reasons: Memory allocation, migration, etc. may cause transient wakeups

The Deadly Pattern

if (condition) { ... } else { sleep; } — This pattern FAILS on spurious wakeups. After waking, the condition is not rechecked, leading to proceeding with false assumptions. ALWAYS use while (!condition) { sleep; } to re-check after every wakeup.

Handling Spurious Wakeups
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// WRONG: Susceptible to spurious wakeups
void consumer_wrong() {
    // Check once, then sleep if not ready
    if (buffer_empty()) {
        wait_event_interruptible(&wq, !buffer_empty());
    }
    // BUG: If spuriously awakened but still empty, we proceed incorrectly
    item = buffer[--count];  // Potential underflow!
}
 
// CORRECT: Always re-check condition
void consumer_correct() {
    // Loop until condition is actually true
    while (buffer_empty()) {
        int ret = wait_event_interruptible(&wq, !buffer_empty());
        if (ret == -ERESTARTSYS) {
            // Woken by signal - handle appropriately
            return handle_signal();
        }
        // If we get here without error, check condition again
        // The while loop naturally re-checks
    }
    // Guaranteed: buffer is not empty
    item = buffer[--count];  // Safe!
}
 
// The wait_event macro does this automatically
// This is CORRECT (wait_event loops internally)
void consumer_wait_event() {
    wait_event_interruptible(&wq, !buffer_empty());
    // wait_event only returns when condition is true
    // (or if interrupted)
    item = buffer[--count];
}

Why spurious wakeups are tolerated:

You might wonder: why not prevent all spurious wakeups? The answer is efficiency and simplicity:

Eliminating all spurious wakeups would require expensive synchronization at every wakeup point
The condition-in-a-loop pattern is simple and robust—it handles all cases uniformly
Spurious wakeups are rare in practice—the overhead of rechecking is minimal
Some "spurious" wakeups are features: Signal delivery legitimately wakes INTERRUPTIBLE sleepers

The design philosophy is: make wakeup cheap and simple, require waiters to be robust.

POSIX Condition Variables

The POSIX spec for pthread_cond_wait() explicitly states that spurious wakeups may occur. This is why all correct condition variable usage wraps the wait in a while loop. This isn't a limitation—it's a deliberate API contract.

Summary and Key Insights

The wakeup mechanism completes the sleep/wakeup pair, enabling efficient blocking synchronization. Let's consolidate the essential concepts:

Core Concepts

•Wakeup = state transition + scheduler notification: Changes process state to RUNNING and adds to run queue.
•Non-blocking operation: Wakeup never blocks the caller; it's a quick notification.
•Exclusive vs. non-exclusive: Exclusive wakeup avoids thundering herd; non-exclusive notifies all interested parties.
•Interrupt-safe: Wakeup can be called from interrupt context, enabling I/O completion and timer expiry.
•Scheduler integration: Wakeup interacts with CFS/RT schedulers, respecting priorities and triggering preemption.
•Spurious wakeups exist: Always check conditions in a loop; never assume wakeup means condition is true.

Page Complete

You now understand the wakeup mechanism—how sleeping processes are restored to activity, the different wakeup strategies, and the critical integration with the scheduler. Next, we'll examine what happens when wakeup and sleep are improperly coordinated: the infamous lost wakeup problem.

2 / 5

Loading learning content...

Operating SystemsSleep and Wakeup

Sleep and Wakeup Mechanisms

LevelIntermediate

Duration60 mins

TopicSleep and Wakeup

2 / 5

Wakeup Mechanism

Restoring Processes to Life

But wakeup is more than a simple state flip. It must be:

Correct: Wakeup the right process(es), not random others
Efficient: Don't waste cycles waking everyone when one suffices
Atomic: Coordinate with sleep to avoid races
Fair: Don't starve long-sleeping processes
Scheduler-aware: Interact properly with priority and preemption

This page provides a comprehensive examination of wakeup—the mechanism that restores sleeping processes to active life.

What You Will Learn

The Wakeup Concept

Wakeup is the operation that transitions a sleeping process back to the runnable state. It is the inverse of sleep and forms the second half of the sleep/wakeup primitive pair.

Conceptually, wakeup does the following:

Locates the sleeping process(es): Either by direct task reference or by finding waiters on a specific wait queue.
Changes process state: Sets the state from SLEEPING (TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE) back to RUNNING (TASK_RUNNING).
Adds to run queue: Places the process back on the scheduler's data structures so it's considered for CPU allocation.
Optionally triggers reschedule: If the awakened process has higher priority than the current one, the kernel may preempt immediately.

The semantic contract:

The process's priority relative to others
The scheduler's policy (CFS, real-time, etc.)
Available CPU capacity
Whether the current process yields or completes its timeslice

What Wakeup Does

•Sets process state to TASK_RUNNING
•Adds process to CPU run queue
•Removes from wait queue (or marks for removal)
•May trigger preemption if higher priority
•Returns success/failure indicator

What Wakeup Does NOT Do

•Does not guarantee immediate execution
•Does not transfer data to the process
•Does not change the awaited condition
•Does not run the process's code
•Does not block the caller

Wakeup is Non-Blocking

Wakeup Targets: Who Gets Awakened?

There are fundamentally two ways to specify who should be awakened:

1. Direct wakeup (by task reference): The waker has a direct pointer to the sleeping task's control block (task_struct in Linux) and awakens that specific process:

int wake_up_process(struct task_struct *p);

This is used when:

There's a known 1:1 relationship (e.g., parent waiting for specific child)
A timer expires for a specific process
A signal targets a specific process

2. Wait queue wakeup (by event): The waker doesn't know or care which specific processes are waiting. It signals an event, and all relevant waiters are awakened:

int wake_up(wait_queue_head_t *q);

This is used when:

Multiple processes may wait for the same condition (reader/writer, producer/consumer)
The waker's concern is the event, not the waiters
Modularity—the sleeping and waking code are decoupled

Wait queue wakeup strategies:

When waking by event (wait queue), the kernel offers several strategies:

Wait Queue Wakeup Strategies
Function	Behavior	Use Case
`wake_up()`	Wake all non-exclusive + one exclusive	Standard wakeup: handles both cases
`wake_up_all()`	Wake every waiter, exclusive or not	Condition affecting all (e.g., shutdown)
`wake_up_nr(wq, n)`	Wake up to n exclusive waiters	Resource with multiple available slots
`wake_up_interruptible()`	Only wake TASK_INTERRUPTIBLE waiters	Don't disturb uninterruptible I/O
`wake_up_sync()`	Wake but don't reschedule immediately	Batching multiple wakeups for efficiency
`wake_up_locked()`	Wake while already holding queue lock	Avoid double-lock when already protected

The _nr Variants

Exclusive vs. Non-Exclusive Wakeup

A critical distinction in wait queue design is whether waiters are exclusive or non-exclusive. This affects how many processes are awakened.

Non-exclusive waiters (default):

All non-exclusive waiters are awakened by any wakeup call
Used when the event applies equally to all waiters
Example: Multiple readers waiting for data—when data arrives, all can read it

Exclusive waiters:

Only one exclusive waiter is awakened per wakeup call
Used when only one waiter can make progress (mutual exclusion)
Example: Multiple processes waiting to acquire a lock—only one can acquire it

The thundering herd problem:

Thundering Herd

How wake_up handles mixed queues:

A wait queue can contain both exclusive and non-exclusive waiters. The standard wake_up() processes the queue as follows:

Traverse the wait queue from head to tail
Wake every non-exclusive waiter encountered
Wake the first exclusive waiter encountered, then stop

This means:

All non-exclusive waiters at the front are awakened
Exactly one exclusive waiter is awakened
Non-exclusive waiters behind the exclusive waiter are also awakened

Practical implications:

Exclusive vs Non-Exclusive Example
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// Wait queue with mixed waiters
// Position:  1     2       3       4       5
// Type:     NE    NE      EX      EX      NE
// Process:  P1    P2      P3      P4      P5
 
// Linux wake_up() behavior:
// 1. NE waiters at front: wake P1, P2 (non-exclusive)
// 2. First EX waiter: wake P3, then STOP
// 3. P4 (EX) and P5 (NE) remain sleeping
 
// wake_up_all() behavior:
// Wake P1, P2, P3, P4, P5 (everyone)
 
// wake_up_nr(wq, 2) behavior (for exclusive):
// 1. Wake P1, P2 (non-exclusive, always woken)
// 2. Wake P3 (first exclusive)
// 3. Wake P4 (second exclusive)  
// 4. P5 remains (exhausted the n=2 exclusive count)

Setting exclusive mode:

When adding to a wait queue, processes indicate whether they're exclusive:

// Non-exclusive wait (default)
add_wait_queue(wq, &wait);

// Exclusive wait  
add_wait_queue_exclusive(wq, &wait);

// Or set the flag directly
wait.flags |= WQ_FLAG_EXCLUSIVE;
prepare_to_wait_exclusive(wq, &wait, TASK_INTERRUPTIBLE);

Guidelines for choosing:

Scenario	Wakeup Mode	Reason
Lock/mutex acquisition	Exclusive	Only one can hold the lock
Condition variable signal	Exclusive	Usually paired with one waker one waiter
Shared resource becomes available	Non-exclusive	Multiple can use simultaneously
Broadcast notification	Non-exclusive	All interested parties should know
Semaphore with count N	Wake N exclusive	Exactly N can proceed

Wakeup Implementation Deep Dive

Let's examine the internal implementation of wakeup to understand precisely what happens when a process is awakened.

The wakeup call chain (Linux):

Wakeup Implementation (Simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
// High-level wakeup: wake all non-exclusive + one exclusive
void wake_up(wait_queue_head_t *wq) {
    unsigned long flags;
    
    spin_lock_irqsave(&wq->lock, flags);
    __wake_up_common(wq, TASK_NORMAL, 1, 0, NULL);
    spin_unlock_irqrestore(&wq->lock, flags);
}
 
// Core wakeup logic
static void __wake_up_common(wait_queue_head_t *wq,
                             unsigned int mode,      // Which states to wake
                             int nr_exclusive,       // How many exclusive
                             int wake_flags,         // Sync, etc.
                             void *key) {
    wait_queue_entry_t *entry, *next;
    
    list_for_each_entry_safe(entry, next, &wq->head, entry) {
        unsigned flags = entry->flags;
        
        // Call the waiter's wake function
        // Default: default_wake_function
        int ret = entry->func(entry, mode, wake_flags, key);
        
        if (ret < 0)
            break;  // Stop iteration
            
        // If exclusive, decrement counter
        if (ret && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
            break;  // Woken enough exclusives
    }
}
 
// The actual process wakeup
int default_wake_function(wait_queue_entry_t *entry,
                          unsigned mode,
                          int sync,
                          void *key) {
    struct task_struct *p = entry->private;
    return try_to_wake_up(p, mode, sync);
}
 
// Core of process wakeup
int try_to_wake_up(struct task_struct *p, 
                   unsigned int state, 
                   int wake_flags) {
    unsigned long flags;
    int cpu;
    
    raw_spin_lock_irqsave(&p->pi_lock, flags);
    
    // Check if process is in acceptable sleep state
    if (!(p->state & state))
        goto out_nostat;  // Not sleeping in matching state
    
    // Atomically change state to TASK_RUNNING
    p->state = TASK_RUNNING;
    
    // Find CPU to add to run queue
    // Could be current CPU, previous CPU (for cache affinity),
    // or idle CPU (for load balancing)
    cpu = select_task_rq(p, p->wake_cpu, wake_flags);
    
    // Add to that CPU's run queue
    // Uses scheduler class (CFS, RT, etc.)
    activate_task(rq, p, ENQUEUE_WAKEUP);
    
    // Check if we should preempt current task on that CPU
    check_preempt_curr(rq, p, wake_flags);
    
out:
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);
    return success;
}

Key implementation details:

2. list_for_each_entry_safe: This safe iteration macro handles the case where the current entry is removed during iteration—which can happen if the wakeup function removes itself.

4. The mode parameter: This is a bitmask of states to wake: TASK_INTERRUPTIBLE, TASK_UNINTERRUPTIBLE, or both (TASK_NORMAL). wake_up_interruptible() only wakes TASK_INTERRUPTIBLE processes.

5. CPU selection for wakeup: select_task_rq chooses which CPU's run queue to place the awakened process on. Factors include:

Previous CPU (cache warmth)
Affinity mask (which CPUs the process can run on)
Load balancing (spread load across CPUs)
Waker's CPU (reduce migration for communication patterns)

The WQ_FLAG_BOOKMARK Optimization

Wakeup and The Scheduler

Wakeup doesn't exist in isolation—it's intimately connected with the scheduler. Understanding this relationship is crucial for performance and correctness.

The scheduler's role in wakeup:

When try_to_wake_up succeeds, it must integrate the newly runnable process with the scheduler. This involves:

Scheduler Operations During Wakeup

•Enqueue to run queue: Add the task to the appropriate scheduler data structure (CFS's red-black tree, RT's priority queue, etc.)
•Update statistics: Record sleep time, recalculate priority/vruntime based on sleep duration
•Balance decisions: Consider whether this CPU is overloaded and the task should go elsewhere
•Preemption check: Compare priority with current running task; set need_resched if appropriate
•IPI for remote wakeup: If waking on a different CPU, send inter-processor interrupt to notify that CPU

CFS sleep bonus:

// On wakeup, CFS places the task with a slight vruntime advantage
// This ensures recently-sleeping tasks get scheduled quickly
vruntime = min(vruntime, min_vruntime - sleep_bonus)

This "sleeper fairness" prevents interactive processes from being starved by CPU-bound processes that never sleep.

Wakeup latency:

The time from wakeup call to actual process execution is called wakeup latency. It depends on:

Factor	Impact
Same CPU vs. remote	Same CPU: ~1-5μs; Remote: ~10-50μs (IPI overhead)
Current process priority	If lower, preemption adds ~1-10μs
Interrupt context	May need to defer to scheduler tick
RT vs. normal process	RT gets immediate preemption
CPU power state	If idle/deep sleep: ~10-100μs to wake CPU

The wake_up_sync Optimization

Sync Wakeup Pattern
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Suboptimal: might trigger premature preemption
void producer_thread() {
    // Produce data
    buffer[idx] = data;
    
    // Wake consumer (might preempt us immediately)
    wake_up(&consumer_wq);  
    
    // We were about to sleep anyway!
    wait_event_interruptible(&producer_wq, !buffer_full);
}
 
// Optimal: use sync wakeup
void producer_thread_optimized() {
    buffer[idx] = data;
    
    // Wake consumer but don't trigger immediate reschedule
    wake_up_sync(&consumer_wq);
    
    // Now we sleep - scheduler will pick consumer if appropriate
    wait_event_interruptible(&producer_wq, !buffer_full);
}

Wakeup From Interrupt Context

A significant portion of wakeups occur from interrupt context—when hardware signals that an awaited event has occurred. Interrupt handlers must follow special rules.

Why interrupts trigger wakeups:

I/O completion: Disk read finishes, network packet arrives, USB device responds
Timer expiry: sleep(), nanosleep(), timeout-based waits
Hardware events: User presses key, mouse moves, sensor triggers
IPI (Inter-Processor Interrupt): Another CPU signals this one

Constraints in interrupt context:

Interrupt handlers run with interrupts disabled or at elevated priority. They must:

Not block or sleep (nothing to switch to)
Execute quickly (holding up other interrupts)
Not acquire blocking locks (only spinlocks)
Be reentrant or protected against concurrency

Safe wakeup from interrupts:

Wakeup is specifically designed to be safe from interrupt context:

// All wakeup variants are interrupt-safe
wake_up();                    // Safe
wake_up_interruptible();      // Safe
wake_up_all();                // Safe
wake_up_process(p);           // Safe

Internally, these use spin_lock_irqsave to handle the nested interrupt case—an interrupt handler that's itself interrupted.

The deferred work pattern:

Sometimes the interrupt handler needs to do more than just wakeup—it needs to copy data, update statistics, or perform complex logic. But it must stay fast. The solution is deferred work:

Interrupt handler (top half): Minimal work, wake the process, perhaps schedule bottom half
Softirq/tasklet (bottom half): More complex processing, still in kernel
Awakened process: Full processing in process context

// Disk I/O completion interrupt handler (simplified)
void disk_irq_handler(int irq, void *dev_id) {
    struct request *req = get_completed_request(dev_id);
    
    // Mark request complete (very fast)
    req->status = SUCCESS;
    req->bytes_transferred = req->length;
    
    // Wake the waiting process  
    wake_up(&req->wait_queue);
    
    // That's it! Handler exits quickly
    // The awakened process does any heavy processing
}

Threaded Interrupt Handlers

Converting Mermaid diagram...

Spurious Wakeups

A spurious wakeup occurs when a process wakes up even though the condition it was waiting for is not true. This is not a bug—it's an expected possibility that code must handle.

Sources of spurious wakeups:

Signal delivery: TASK_INTERRUPTIBLE processes wake on any signal, not just the awaited event
Implementation artifacts: Some implementations may wake spuriously for efficiency reasons
Broadcast wakeup: All waiters wake, but only one can proceed (rest must re-sleep)
Condition change between wakeup and check: Another process may have consumed the resource first
Internal kernel reasons: Memory allocation, migration, etc. may cause transient wakeups

The Deadly Pattern

Handling Spurious Wakeups
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// WRONG: Susceptible to spurious wakeups
void consumer_wrong() {
    // Check once, then sleep if not ready
    if (buffer_empty()) {
        wait_event_interruptible(&wq, !buffer_empty());
    }
    // BUG: If spuriously awakened but still empty, we proceed incorrectly
    item = buffer[--count];  // Potential underflow!
}
 
// CORRECT: Always re-check condition
void consumer_correct() {
    // Loop until condition is actually true
    while (buffer_empty()) {
        int ret = wait_event_interruptible(&wq, !buffer_empty());
        if (ret == -ERESTARTSYS) {
            // Woken by signal - handle appropriately
            return handle_signal();
        }
        // If we get here without error, check condition again
        // The while loop naturally re-checks
    }
    // Guaranteed: buffer is not empty
    item = buffer[--count];  // Safe!
}
 
// The wait_event macro does this automatically
// This is CORRECT (wait_event loops internally)
void consumer_wait_event() {
    wait_event_interruptible(&wq, !buffer_empty());
    // wait_event only returns when condition is true
    // (or if interrupted)
    item = buffer[--count];
}

Why spurious wakeups are tolerated:

You might wonder: why not prevent all spurious wakeups? The answer is efficiency and simplicity:

Eliminating all spurious wakeups would require expensive synchronization at every wakeup point
The condition-in-a-loop pattern is simple and robust—it handles all cases uniformly
Spurious wakeups are rare in practice—the overhead of rechecking is minimal
Some "spurious" wakeups are features: Signal delivery legitimately wakes INTERRUPTIBLE sleepers

The design philosophy is: make wakeup cheap and simple, require waiters to be robust.

POSIX Condition Variables

Summary and Key Insights

The wakeup mechanism completes the sleep/wakeup pair, enabling efficient blocking synchronization. Let's consolidate the essential concepts:

Core Concepts

•Wakeup = state transition + scheduler notification: Changes process state to RUNNING and adds to run queue.
•Non-blocking operation: Wakeup never blocks the caller; it's a quick notification.
•Exclusive vs. non-exclusive: Exclusive wakeup avoids thundering herd; non-exclusive notifies all interested parties.
•Interrupt-safe: Wakeup can be called from interrupt context, enabling I/O completion and timer expiry.
•Scheduler integration: Wakeup interacts with CFS/RT schedulers, respecting priorities and triggering preemption.
•Spurious wakeups exist: Always check conditions in a loop; never assume wakeup means condition is true.

Page Complete

2 / 5