Operating SystemsState Transitions

Process State Transitions

LevelIntermediate

Duration75 mins

TopicState Transitions

5 / 5

Wake: Waiting → Ready

Returning to the Runnable Pool

A sleeping process exists in suspension, consuming no CPU cycles, waiting for an event that may be microseconds or hours away. When that event finally occurs—the disk completes a read, a network packet arrives, a lock becomes available, or a timer expires—the process must be awakened and returned to the pool of runnable processes.

The Wake transition (also called the Unblock or Wakeup transition) moves a process from the Waiting state to the Ready state. It is the counterpart to the Block transition, completing the cycle that allows processes to efficiently wait for external events. Without wakeup, blocked processes would sleep forever.

This transition is almost always initiated by something other than the sleeping process itself—an interrupt handler, another process releasing a lock, or a timer subsystem. Understanding wakeup mechanisms is essential for understanding how I/O completion propagates through the system.

What You Will Learn

By the end of this page, you will understand the complete Wake transition—from the event that triggers wakeup through the kernel's wake mechanisms, wait queue processing, potential preemption, and reinsertion into the Ready queue. You'll also learn about wakeup optimization techniques and common problems like the thundering herd.

What Triggers Wakeup

A process wakes when the event it was waiting for occurs. The nature of this event depends on why the process blocked in the first place.

Wakeup Event Sources

•Hardware Interrupt — Device completes I/O (disk read done, network packet received). Interrupt handler signals completion.
•Software Interrupt (Softirq) — Deferred work completes. Network packet processing, SCSI completion, etc.
•Another Process — Lock released, message sent, child terminated. The other process explicitly wakes waiters.
•Timer Expiration — Kernel timer fires (sleep timeout, poll timeout). Timer subsystem wakes the sleeper.
•Signal Delivery — Signal sent to process in interruptible sleep. Process wakes to handle the signal.
•Manual Wakeup — Administrator or debugger forces wakeup (rare, for debugging).

Blocked State vs Wake Trigger
Blocked Waiting For	Woken By	Context of Wakeup
Disk I/O	Disk interrupt handler	Interrupt context (fast, limited)
Network data	Network softirq	Softirq context (deferred interrupt)
Mutex lock	Lock owner calling unlock()	Process context (full capabilities)
Semaphore	Another process calling sem_post()	Process context
Child exit	Child's do_exit() path	Process context
sleep()/usleep()	Timer interrupt + hrtimer subsystem	Timer softirq context
Condition variable	pthread_cond_signal/broadcast	Process context
select/poll/epoll	Any monitored FD becomes ready	Various (interrupt, process)

Wakeup context matters:

The context in which wakeup occurs affects what the wakeup code can do:

Interrupt Context:

Very restricted—cannot sleep, allocate memory normally, or hold certain locks
Must be fast to avoid delaying other interrupts
Can only add process to run queue; actual scheduling happens later

Process Context:

Full kernel capabilities—can sleep, allocate, take any lock
Can immediately call schedule() if the woken process is higher priority
More flexibility but potentially slower wakeup path

IRQ vs Softirq vs Process

Hard interrupt handlers run with interrupts disabled on that CPU—they must be extremely fast. Softirqs run with interrupts enabled but in interrupt context—no sleeping. Process context is normal kernel execution with full capabilities. Many wakeups are triggered in hard interrupt, deferred to softirq for the actual wake_up() call, then the woken process runs later in process context.

The Wakeup Mechanism

Waking a process involves several precise steps: locating the sleeping process in its wait queue, changing its state, removing it from the wait queue, and adding it to the appropriate ready queue. Let's examine this mechanism in detail.

Converting Mermaid diagram...

wake_mechanism.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Linux wakeup implementation
// Simplified from kernel/sched/core.c and kernel/sched/wait.c
 
// Main wakeup entry point
void wake_up(wait_queue_head_t *wq) {
    __wake_up(wq, TASK_NORMAL, 1);
}
 
void wake_up_all(wait_queue_head_t *wq) {
    __wake_up(wq, TASK_NORMAL, 0);  // nr=0 means all
}
 
void __wake_up(wait_queue_head_t *wq, 
               unsigned int mode,
               int nr_exclusive) {
    unsigned long flags;
    
    // Lock the wait queue
    spin_lock_irqsave(&wq->lock, flags);
    
    // Walk the list of waiters
    __wake_up_common(wq, mode, nr_exclusive, 0);
    
    spin_unlock_irqrestore(&wq->lock, flags);
}
 
void __wake_up_common(wait_queue_head_t *wq,
                      unsigned int mode,
                      int nr_exclusive) {
    wait_queue_entry_t *curr, *next;
    
    list_for_each_entry_safe(curr, next, &wq->head, entry) {
        // Call the waiter's callback function
        unsigned flags = curr->flags;
        int ret = curr->func(curr, mode, 0, NULL);
        
        if (ret < 0)
            break;  // Stop waking
            
        // Handle exclusive waiters
        if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
            break;  // Only wake nr_exclusive exclusive waiters
    }
}
 
// Default wake function - actually wakes the process
int default_wake_function(wait_queue_entry_t *wait,
                          unsigned mode, int flags, void *key) {
    return try_to_wake_up(wait->private, mode);
}
 
// Core wakeup logic
int try_to_wake_up(struct task_struct *p, unsigned int state) {
    unsigned long flags;
    int cpu, success = 0;
    
    raw_spin_lock_irqsave(&p->pi_lock, flags);
    
    // Check if process is in the right state to wake
    if (!(p->state & state))
        goto out;  // Not in wakeable state
    
    success = 1;
    
    // Change process state to RUNNING (Ready)
    p->state = TASK_RUNNING;
    
    // Select a CPU for the woken process
    cpu = select_task_rq(p);
    
    // Add to that CPU's run queue
    activate_task(cpu_rq(cpu), p);
    
    // Check if should preempt current task on target CPU
    if (p->prio < cpu_curr(cpu)->prio)
        resched_curr(cpu_rq(cpu));
    
out:
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);
    return success;
}

try_to_wake_up() is Central

Almost all wakeup paths eventually call try_to_wake_up(). This function handles the state transition, run queue insertion, and potential preemption. Whether waking from I/O completion, lock release, or signal delivery, this is the bottleneck. Performance optimizations here benefit the entire system.

Wakeup from I/O Completion

I/O completion is one of the most common wakeup triggers. When a disk read finishes, a network packet arrives, or a USB device responds, the device generates an interrupt that eventually leads to waking the blocked process.

I/O Completion Wakeup Path

•Device completes operation — Hardware signals completion to the device controller.
•Interrupt generated — Device raises interrupt line, CPU receives interrupt signal.
•CPU handles interrupt — Current process is paused, interrupt handler runs.
•Device driver processes completion — Driver marks I/O as complete, data is now available.
•Wake up waiters — Driver calls wake_up() on the device's wait queue.
•Process state changed — Sleeping process marked TASK_RUNNING.
•Added to run queue — Process inserted into Ready queue.
•Interrupt handler returns — If woken process is higher priority, preemption flag set.
•Scheduling point — At return from interrupt (or syscall), scheduler runs if needed.
•Woken process dispatched — Eventually, the process runs and continues from where it blocked.

disk_io_completion.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
// Example: Disk I/O completion wakeup path
// Simplified from actual block layer code
 
// 1. Interrupt handler (runs in hard IRQ context)
irqreturn_t disk_interrupt(int irq, void *dev_id) {
    struct disk_device *disk = dev_id;
    struct request *req;
    
    // Acknowledge interrupt to hardware
    disk_ack_interrupt(disk);
    
    // Get the completed request
    req = disk_get_completed_request(disk);
    
    // Mark request as complete
    req->status = IO_COMPLETE;
    req->error = check_for_errors(disk);
    
    // Schedule bottom-half processing
    raise_softirq(BLOCK_SOFTIRQ);
    
    return IRQ_HANDLED;
}
 
// 2. Softirq handler (runs in softirq context)
void block_softirq_handler(void) {
    struct request *req;
    
    while ((req = get_completed_request())) {
        // Call the request's completion callback
        req->end_io(req);
    }
}
 
// 3. Request completion callback
void bio_end_io(struct bio *bio) {
    struct kiocb *iocb = bio->bi_private;
    
    // Signal I/O completion
    if (iocb->ki_complete) {
        // Async I/O - invoke callback
        iocb->ki_complete(iocb, bio->bi_status);
    } else {
        // Sync I/O - wake the waiter
        struct task_struct *waiter = iocb->private;
        
        // THE CRITICAL WAKEUP CALL
        wake_up_process(waiter);
        // Process moves: Waiting → Ready
    }
}
 
// 4. Back in the original read() syscall
ssize_t sync_read(struct file *file, char *buf, size_t len) {
    struct kiocb iocb;
    
    init_sync_kiocb(&iocb, file);
    iocb.private = current;  // Store current process for wakeup
    
    // Submit I/O
    submit_bio(bio);
    
    // Block until I/O completes
    for (;;) {
        set_current_state(TASK_UNINTERRUPTIBLE);
        
        if (iocb.ki_complete)  // Check if I/O done
            break;
            
        schedule();  // Sleep - woken by bio_end_io()
    }
    
    set_current_state(TASK_RUNNING);
    
    // Copy data to user buffer
    return copy_to_user(buf, iocb.data, len);
}

Deferred Wakeup

Notice how the hard interrupt handler doesn't call wake_up() directly—it schedules a softirq. This is because wake_up() may need to acquire scheduler locks, which isn't safe in hard interrupt context. The softirq runs shortly after, with interrupts enabled, and performs the actual wakeup.

Wakeup from Lock Release

When a process releases a lock that others are waiting for, one or more waiters must be woken. The mechanism differs slightly from I/O wakeup because it occurs in process context, not interrupt context.

mutex_wakeup.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
// Mutex wakeup in Linux kernel
// Simplified from kernel/locking/mutex.c
 
struct mutex {
    atomic_long_t owner;     // Current owner (or 0 if free)
    spinlock_t wait_lock;    // Protects wait_list
    struct list_head wait_list;  // Waiters
};
 
void mutex_unlock(struct mutex *lock) {
    // Fast path: no waiters
    if (atomic_long_cmpxchg(&lock->owner, 
                            (long)current, 0) == (long)current)
        return;  // Unlocked, no one waiting
    
    // Slow path: need to wake someone
    __mutex_unlock_slowpath(lock);
}
 
void __mutex_unlock_slowpath(struct mutex *lock) {
    struct task_struct *next = NULL;
    struct mutex_waiter *waiter;
    unsigned long flags;
    
    spin_lock_irqsave(&lock->wait_lock, flags);
    
    if (!list_empty(&lock->wait_list)) {
        // Get first waiter (FIFO order typically)
        waiter = list_first_entry(&lock->wait_list,
                                  struct mutex_waiter, list);
        
        // Remove from wait list
        list_del(&waiter->list);
        next = waiter->task;
        
        // Transfer ownership atomically
        atomic_long_set(&lock->owner, (long)next);
    } else {
        // No waiters, just clear owner
        atomic_long_set(&lock->owner, 0);
    }
    
    spin_unlock_irqrestore(&lock->wait_lock, flags);
    
    if (next) {
        // Wake up the next owner
        wake_up_process(next);
        // Woken process now owns the mutex and is Ready
    }
}
 
// The waiting side
void mutex_lock(struct mutex *lock) {
    // Fast path: lock is free
    if (atomic_long_cmpxchg(&lock->owner, 0, (long)current) == 0)
        return;  // Got the lock
    
    // Slow path: need to wait
    __mutex_lock_slowpath(lock);
}
 
void __mutex_lock_slowpath(struct mutex *lock) {
    struct mutex_waiter waiter;
    unsigned long flags;
    
    waiter.task = current;
    
    spin_lock_irqsave(&lock->wait_lock, flags);
    list_add_tail(&waiter.list, &lock->wait_list);
    spin_unlock_irqrestore(&lock->wait_lock, flags);
    
    for (;;) {
        set_current_state(TASK_UNINTERRUPTIBLE);
        
        // Check if we got ownership (unlock woke us with ownership)
        if (atomic_long_read(&lock->owner) == (long)current)
            break;
        
        schedule();  // Sleep until woken by unlock
    }
    
    set_current_state(TASK_RUNNING);
    // We now own the mutex
}

Lock handoff optimization:

Modern mutex implementations often use lock handoff rather than simple wakeup:

Old owner transfers ownership to first waiter atomically
Woken waiter already owns the lock when it wakes
No race with other CPUs trying to acquire
Reduces lock starvation

Without handoff, a woken waiter might lose the lock race to a CPU that just tried to acquire, leading to potential starvation of the waiter.

Wakeup Strategies for Different Primitives
Primitive	Wakeup Strategy	Reason
Mutex	Wake one (first waiter)	Only one can hold the lock
Read-Write Lock (write unlock)	Wake one writer OR all readers	Readers can share
Semaphore	Wake one per count increase	Controlled by count
Condition Variable (signal)	Wake one	Typically for one-at-a-time
Condition Variable (broadcast)	Wake all	All should reevaluate condition
File I/O (device ready)	Wake all waiters	All can proceed with I/O

Spurious Wakeups

A process may be woken even when the condition it's waiting for isn't met—this is called a spurious wakeup. It can happen due to implementation details, signal delivery, or even cosmic rays. This is why wait loops always recheck the condition after waking: 'while (!condition) { sleep(); }'

Wakeup from Timer Expiration

Timers are fundamental to many blocking operations. When a process calls sleep(), nanosleep(), or poll() with a timeout, the kernel sets a timer. When the timer expires, the process is woken.

timer_wakeup.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
// Timer-based wakeup implementation
// Simplified from kernel/time/hrtimer.c and kernel/time/timer.c
 
// The sleep() implementation using timers
unsigned int __sched sleep(unsigned int seconds) {
    struct hrtimer_sleeper sleeper;
    ktime_t expire = ktime_add_ns(ktime_get(), 
                                  seconds * NSEC_PER_SEC);
    
    // Initialize the high-resolution timer
    hrtimer_init_sleeper(&sleeper, current);
    sleeper.timer.function = hrtimer_wakeup;
    
    // Set expiration time
    hrtimer_set_expires(&sleeper.timer, expire);
    
    // Start the timer
    hrtimer_start(&sleeper.timer, expire, HRTIMER_MODE_ABS);
    
    // Sleep until timer fires (or signal arrives)
    set_current_state(TASK_INTERRUPTIBLE);
    
    if (sleeper.task)  // Still valid (not yet expired)
        schedule();    // Block here
    
    // Woken up - clean up timer
    hrtimer_cancel(&sleeper.timer);
    
    // Return remaining time if interrupted early
    return calculate_remaining(expire);
}
 
// Timer callback - the wakeup function
enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer) {
    struct hrtimer_sleeper *sleeper = 
        container_of(timer, struct hrtimer_sleeper, timer);
    struct task_struct *task = sleeper->task;
    
    // Mark sleeper as expired
    sleeper->task = NULL;
    
    // Wake the sleeping process
    if (task)
        wake_up_process(task);
    
    return HRTIMER_NORESTART;  // One-shot timer
}
 
// How timers are processed (timer softirq)
void run_timer_softirq(void) {
    struct timer_base *base = this_cpu_ptr(&timer_bases);
    
    while (time_after_eq(jiffies, base->clk)) {
        // Find and remove expired timers
        struct timer_list *timer;
        
        while ((timer = find_expired_timer(base))) {
            // Call the timer's callback function
            // This may call wake_up_process()
            timer->function(timer);
        }
        
        base->clk++;
    }
}

Two types of kernel timers:

Traditional Timers (timer_list):

Based on jiffies (tick-based)
Resolution limited by HZ (typically 1ms)
Efficient for large numbers of timers (timer wheel)
Used for coarse-grained timeouts

High-Resolution Timers (hrtimer):

Nanosecond resolution
Based on hardware timers like HPET, TSC
Used for precise timing (nanosleep, pthread timing)
More overhead per timer but more accurate

Timer Precision by Use Case
Function	Timer Type	Typical Precision
sleep()	Traditional or HR	1 ms - 10 ms
usleep()	High-resolution	100 μs - 1 ms
nanosleep()	High-resolution	50 μs - 500 μs
poll/select timeout	Traditional	1 ms - 10 ms
pthread_cond_timedwait	High-resolution	50 μs - 500 μs
Scheduler time slice	Local APIC timer	Variable (CFS)

Timer Coalescing

To save power on idle systems, Linux coalesces nearby timer expirations. A timer set for 1.001 seconds might fire at 1.010 seconds along with several other timers. This reduces wakeups on an otherwise-idle CPU. The timer_slack_ns field controls how much slop is allowed.

The Thundering Herd Problem

When an event occurs that multiple processes are waiting for, should the kernel wake all of them or just one? Waking all can cause the thundering herd problem—many processes wake up, only to find the resource gone, and go back to sleep.

Converting Mermaid diagram...

Exclusive waiters:

To solve the thundering herd, Linux supports exclusive waiters. When processes add themselves to a wait queue with the WQ_FLAG_EXCLUSIVE flag, wake_up() will wake only one of them.

How it works:

Non-exclusive waiters are at the front of the queue
Exclusive waiters are at the back
wake_up() wakes all non-exclusive + one exclusive
This ensures events that can only satisfy one waiter don't wake many

exclusive_wakeup.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Exclusive wakeup to avoid thundering herd
// From net/ipv4/inet_connection_sock.c - accept() implementation
 
// When a server calls accept(), it uses exclusive wait
int inet_csk_wait_for_connect(struct sock *sk, long timeo) {
    DEFINE_WAIT(wait);
    
    for (;;) {
        // Add as EXCLUSIVE waiter
        // Only one process woken per incoming connection
        prepare_to_wait_exclusive(&sk->sk_socket->wq.wait,
                                  &wait, 
                                  TASK_INTERRUPTIBLE);
        
        // Check if connection already queued
        if (reqsk_queue_empty(&icsk->icsk_accept_queue)) {
            release_sock(sk);
            if (timeo)
                timeo = schedule_timeout(timeo);
            lock_sock(sk);
        }
        
        // Check if we got one
        if (!reqsk_queue_empty(&icsk->icsk_accept_queue))
            break;
            
        // Check for errors/signals
        if (signal_pending(current))
            break;
    }
    
    finish_wait(&sk->sk_socket->wq.wait, &wait);
    return 0;
}
 
// Wait queue structure showing exclusive flag
void add_wait_queue_exclusive(wait_queue_head_t *wq,
                              wait_queue_entry_t *wait) {
    wait->flags |= WQ_FLAG_EXCLUSIVE;
    
    spin_lock(&wq->lock);
    // Exclusive waiters go at the TAIL
    // So non-exclusive at head get woken first
    list_add_tail(&wait->entry, &wq->head);
    spin_unlock(&wq->lock);
}
 
// Wake respects exclusive flag
void __wake_up_common(wait_queue_head_t *wq, ...) {
    int nr_exclusive = ...; // Usually 1
    
    list_for_each_entry_safe(curr, next, &wq->head, entry) {
        curr->func(curr, ...);  // Wake this one
        
        // If exclusive waiter, decrement counter
        if (curr->flags & WQ_FLAG_EXCLUSIVE) {
            if (--nr_exclusive == 0)
                break;  // Stop after waking nr_exclusive
        }
        // Non-exclusive: always continue
    }
}

When to Use Exclusive vs Non-Exclusive Wakeup
Scenario	Wakeup Type	Reason
accept() on listen socket	Exclusive	One connection per waker
File data becomes readable	Non-exclusive (all)	Multiple readers may proceed
Mutex unlock	Exclusive (one)	Only one new owner
Broadcast condition	Non-exclusive (all)	All should reevaluate
Semaphore post (+1)	Exclusive (one)	One more can acquire
epoll event	Typically exclusive	One handler per event

EPOLLEXCLUSIVE (Linux 4.5+)

Before Linux 4.5, epoll suffered from thundering herd when multiple processes epoll_wait() on the same FD. EPOLLEXCLUSIVE flag was added to wake only one waiter per event, enabling efficient load balancing across worker processes without excessive wakeups.

Preemption After Wakeup

When a process is woken, it's added to the Ready queue—but it doesn't necessarily run immediately. Whether the woken process runs soon depends on its priority relative to the current running process.

Priority-based wakeup preemption:

During the wakeup path, the kernel checks:

What priority is the woken process?
What priority is currently running on the target CPU?
If woken > current, set TIF_NEED_RESCHED on that CPU

This doesn't immediately preempt—it marks preemption needed. The actual switch happens at the next preemption point (return from interrupt, syscall, etc.).

wakeup_preemption.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// Preemption check during wakeup
// From kernel/sched/core.c
 
void ttwu_do_activate(struct rq *rq, struct task_struct *p) {
    // Add to run queue
    activate_task(rq, p);
    
    // Clear WF_MIGRATED flag
    p->sched_contributes_to_load = 1;
    
    // Check for preemption
    check_preempt_curr(rq, p);
}
 
// CFS preemption check
void check_preempt_curr_fair(struct rq *rq, struct task_struct *p) {
    struct task_struct *curr = rq->curr;
    struct sched_entity *se = &p->se;
    struct sched_entity *curr_se = &curr->se;
    
    // Don't preempt idle task
    if (unlikely(curr->policy == SCHED_IDLE))
        goto preempt;
    
    // Check if woken task should preempt based on vruntime
    // CFS: lower vruntime = deserves CPU more
    s64 delta = curr_se->vruntime - se->vruntime;
    
    if (delta > sysctl_sched_wakeup_granularity)
        goto preempt;
    
    return;  // Current task continues
    
preempt:
    // Mark reschedule needed on this CPU
    resched_curr(rq);
}
 
// Real-time preemption check (simpler: strict priority)
void check_preempt_curr_rt(struct rq *rq, struct task_struct *p) {
    // If woken task has higher RT priority, always preempt
    if (p->prio < rq->curr->prio)
        resched_curr(rq);
}
 
// Cross-CPU wakeup and IPI
void ttwu_queue(struct task_struct *p, int cpu) {
    struct rq *rq = cpu_rq(cpu);
    
    if (cpu == smp_processor_id()) {
        // Local CPU - add directly
        ttwu_do_activate(rq, p);
    } else {
        // Remote CPU - send Inter-Processor Interrupt
        if (p->prio < rq->curr->prio) {
            // Need to preempt remote CPU
            smp_send_reschedule(cpu);
            // IPI forces remote CPU to reschedule
        }
    }
}

Cross-CPU wakeup and IPIs:

If a process should run on a different CPU than the one performing the wakeup:

Woken process added to remote CPU's run queue
If preemption needed, send Inter-Processor Interrupt (IPI)
Remote CPU receives interrupt, sets need_resched, and will reschedule at next opportunity

IPIs have overhead (~1-5 μs) but are necessary for low-latency wakeup on SMP systems.

Wakeup to Execution Latency
Scenario	Approximate Latency	Factors
Local CPU, no preemption needed	0 μs (next context switch)	Must wait for current to yield
Local CPU, preemption needed	2-10 μs	Preempt at next point
Remote CPU, idle	5-15 μs	IPI + context switch
Remote CPU, busy, higher priority	10-50 μs	IPI + preemption
Remote CPU, PREEMPT_RT	50-200 μs	Bounded latency

Wakeup Granularity

sched_wakeup_granularity controls how much 'better' a woken task must be to preempt. A value too low causes excessive context switches on every wakeup; too high causes poor latency for woken tasks. Default is around 1ms on desktop, tunable via sysctl.

Debugging Wakeup Issues

When processes don't wake as expected, debugging requires understanding the entire wakeup path. Common issues include missed wakeups, excessive latency, and stuck processes.

Common Wakeup Issues

•Lost Wakeup — Wakeup occurs before process fully asleep. Race condition. Solution: proper wait loop pattern with condition recheck.
•Stuck in D State — Process waiting for I/O that never completes. Usually hardware or driver issue. Check dmesg for errors.
•High Latency — Process wakes but doesn't run promptly. Check CPU affinity, scheduler settings, competing high-priority tasks.
•Thundering Herd — Many wakeups for one event. Use exclusive waiters or EPOLLEXCLUSIVE.
•Missed Signal — Signaled while handling previous signal. Use sigpending checks and proper signal masks.

debug_wakeup.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Diagnose wakeup issues
 
# Check what a sleeping process is waiting for
cat /proc/<pid>/wchan
# Output: wait_woken or specific waiting function name
 
# Get full kernel stack of sleeping process
cat /proc/<pid>/stack
# Shows: complete kernel call stack - reveals blocking point
 
# Trace wakeups in real-time
sudo trace-cmd record -e sched:sched_wakeup_new \
                      -e sched:sched_wakeup \
                      -e sched:sched_switch
sudo trace-cmd report
 
# Using perf to measure wakeup latency
sudo perf sched record sleep 10
sudo perf sched latency
# Shows average and max wakeup latencies per task
 
# BPF-based wakeup analysis
sudo bpftrace -e '
kprobe:try_to_wake_up {
    @wakeup_count[comm] = count();
}
tracepoint:sched:sched_wakeup {
    @wakeup_latency = hist(args->prio);
}'
 
# Check for processes stuck in D state
ps aux | awk '$8 == "D"'
# If many D-state processes, check dmesg for I/O errors
 
# Monitor scheduler statistics
cat /proc/schedstat
# Per-CPU scheduling stats including wakeup counts
 
# Per-process scheduling info
cat /proc/<pid>/sched
# Shows: nr_wakeups, nr_wakeups_sync, nr_wakeups_migrate, etc.

The proper wait loop pattern:

To avoid lost wakeups and spurious wakeups, always use this pattern:

proper_wait.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// CORRECT: Proper wait loop avoiding lost wakeups
// Uses the wait_event pattern
 
void wait_for_condition(bool *condition, wait_queue_head_t *wq) {
    DEFINE_WAIT(wait);
    
    for (;;) {
        // 1. Register ourselves on the wait queue
        prepare_to_wait(wq, &wait, TASK_INTERRUPTIBLE);
        
        // 2. Check condition AFTER registering
        //    If event happened while registering, we won't miss it
        if (*condition)
            break;
        
        // 3. Check for signals (if interruptible)
        if (signal_pending(current)) {
            // Handle signal - may abort wait
            break;
        }
        
        // 4. Actually sleep - only reached if condition false
        schedule();
        
        // 5. Woken up - loop back to recheck condition
        //    May have been spurious wakeup
    }
    
    // 6. Clean up
    finish_wait(wq, &wait);
}
 
// INCORRECT: Race condition - DON'T DO THIS
void broken_wait(bool *condition, wait_queue_head_t *wq) {
    if (!*condition) {
        // BAD: Event could happen RIGHT HERE
        // Between check and sleep, wakeup would be lost
        
        set_current_state(TASK_INTERRUPTIBLE);
        schedule();  // May sleep forever!
    }
}

Always Recheck After Wakeup

Never assume the condition is true just because you woke up. Spurious wakeups, signal delivery, and incorrect exclusive wakeup handling can all cause a process to wake when the condition isn't met. Always loop back and recheck.

Summary: The Complete Wake Transition

We've thoroughly explored the Wake transition—the mechanism that returns sleeping processes to the runnable pool when their waited-for events occur.

Key Takeaways

•Waiting → Ready completes the wait — The Wake transition returns a process to schedulability when its blocking condition is satisfied.
•Multiple trigger sources — Hardware interrupts, other processes releasing locks, timer expiration, and signal delivery all can trigger wakeup.
•Context matters — Wakeup from interrupt context is restricted; full wakeup deferred to softirq or process context.
•Wait queues link sleepers to events — Each blocking resource has a wait queue; wake_up() traverses it to wake appropriate processes.
•try_to_wake_up() is central — This function handles state change, queue insertion, and preemption check for all wakeups.
•Thundering herd is a real problem — Waking all waiters when one can proceed wastes resources. Exclusive waiters solve this.
•Wakeup may trigger preemption — If the woken process is higher priority, the current process may be preempted at the next opportunity.
•Proper wait loops prevent lost wakeups — Always register on wait queue before checking condition, and always recheck after waking.

Module complete:

With the Wake transition, we've now covered all five fundamental state transitions:

Admit (New → Ready) — Process created and ready to compete for CPU
Dispatch (Ready → Running) — Process selected and given CPU control
Timeout (Running → Ready) — Process preempted after time slice expires
Block (Running → Waiting) — Process sleeps waiting for external event
Wake (Waiting → Ready) — Process returns to runnable after event occurs

These five transitions, operating continuously across all processes, create the dynamic multitasking behavior we depend on in modern operating systems.

Module Complete

You now have a comprehensive understanding of all process state transitions. From a program's first admission into the system through repeated cycles of dispatch, timeout, blocking, and waking, you understand the kernel mechanisms that orchestrate the dance of processes competing for the CPU. This knowledge is foundational for understanding scheduling, synchronization, and system performance.

5 / 5

Loading learning content...

Operating SystemsState Transitions

Process State Transitions

LevelIntermediate

Duration75 mins

TopicState Transitions

5 / 5

Wake: Waiting → Ready

Returning to the Runnable Pool

What You Will Learn

What Triggers Wakeup

A process wakes when the event it was waiting for occurs. The nature of this event depends on why the process blocked in the first place.

Wakeup Event Sources

•Hardware Interrupt — Device completes I/O (disk read done, network packet received). Interrupt handler signals completion.
•Software Interrupt (Softirq) — Deferred work completes. Network packet processing, SCSI completion, etc.
•Another Process — Lock released, message sent, child terminated. The other process explicitly wakes waiters.
•Timer Expiration — Kernel timer fires (sleep timeout, poll timeout). Timer subsystem wakes the sleeper.
•Signal Delivery — Signal sent to process in interruptible sleep. Process wakes to handle the signal.
•Manual Wakeup — Administrator or debugger forces wakeup (rare, for debugging).

Blocked State vs Wake Trigger
Blocked Waiting For	Woken By	Context of Wakeup
Disk I/O	Disk interrupt handler	Interrupt context (fast, limited)
Network data	Network softirq	Softirq context (deferred interrupt)
Mutex lock	Lock owner calling unlock()	Process context (full capabilities)
Semaphore	Another process calling sem_post()	Process context
Child exit	Child's do_exit() path	Process context
sleep()/usleep()	Timer interrupt + hrtimer subsystem	Timer softirq context
Condition variable	pthread_cond_signal/broadcast	Process context
select/poll/epoll	Any monitored FD becomes ready	Various (interrupt, process)

Wakeup context matters:

The context in which wakeup occurs affects what the wakeup code can do:

Interrupt Context:

Very restricted—cannot sleep, allocate memory normally, or hold certain locks
Must be fast to avoid delaying other interrupts
Can only add process to run queue; actual scheduling happens later

Process Context:

Full kernel capabilities—can sleep, allocate, take any lock
Can immediately call schedule() if the woken process is higher priority
More flexibility but potentially slower wakeup path

IRQ vs Softirq vs Process

The Wakeup Mechanism

Converting Mermaid diagram...

wake_mechanism.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Linux wakeup implementation
// Simplified from kernel/sched/core.c and kernel/sched/wait.c
 
// Main wakeup entry point
void wake_up(wait_queue_head_t *wq) {
    __wake_up(wq, TASK_NORMAL, 1);
}
 
void wake_up_all(wait_queue_head_t *wq) {
    __wake_up(wq, TASK_NORMAL, 0);  // nr=0 means all
}
 
void __wake_up(wait_queue_head_t *wq, 
               unsigned int mode,
               int nr_exclusive) {
    unsigned long flags;
    
    // Lock the wait queue
    spin_lock_irqsave(&wq->lock, flags);
    
    // Walk the list of waiters
    __wake_up_common(wq, mode, nr_exclusive, 0);
    
    spin_unlock_irqrestore(&wq->lock, flags);
}
 
void __wake_up_common(wait_queue_head_t *wq,
                      unsigned int mode,
                      int nr_exclusive) {
    wait_queue_entry_t *curr, *next;
    
    list_for_each_entry_safe(curr, next, &wq->head, entry) {
        // Call the waiter's callback function
        unsigned flags = curr->flags;
        int ret = curr->func(curr, mode, 0, NULL);
        
        if (ret < 0)
            break;  // Stop waking
            
        // Handle exclusive waiters
        if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
            break;  // Only wake nr_exclusive exclusive waiters
    }
}
 
// Default wake function - actually wakes the process
int default_wake_function(wait_queue_entry_t *wait,
                          unsigned mode, int flags, void *key) {
    return try_to_wake_up(wait->private, mode);
}
 
// Core wakeup logic
int try_to_wake_up(struct task_struct *p, unsigned int state) {
    unsigned long flags;
    int cpu, success = 0;
    
    raw_spin_lock_irqsave(&p->pi_lock, flags);
    
    // Check if process is in the right state to wake
    if (!(p->state & state))
        goto out;  // Not in wakeable state
    
    success = 1;
    
    // Change process state to RUNNING (Ready)
    p->state = TASK_RUNNING;
    
    // Select a CPU for the woken process
    cpu = select_task_rq(p);
    
    // Add to that CPU's run queue
    activate_task(cpu_rq(cpu), p);
    
    // Check if should preempt current task on target CPU
    if (p->prio < cpu_curr(cpu)->prio)
        resched_curr(cpu_rq(cpu));
    
out:
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);
    return success;
}

try_to_wake_up() is Central

Wakeup from I/O Completion

I/O Completion Wakeup Path

•Device completes operation — Hardware signals completion to the device controller.
•Interrupt generated — Device raises interrupt line, CPU receives interrupt signal.
•CPU handles interrupt — Current process is paused, interrupt handler runs.
•Device driver processes completion — Driver marks I/O as complete, data is now available.
•Wake up waiters — Driver calls wake_up() on the device's wait queue.
•Process state changed — Sleeping process marked TASK_RUNNING.
•Added to run queue — Process inserted into Ready queue.
•Interrupt handler returns — If woken process is higher priority, preemption flag set.
•Scheduling point — At return from interrupt (or syscall), scheduler runs if needed.
•Woken process dispatched — Eventually, the process runs and continues from where it blocked.

disk_io_completion.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
// Example: Disk I/O completion wakeup path
// Simplified from actual block layer code
 
// 1. Interrupt handler (runs in hard IRQ context)
irqreturn_t disk_interrupt(int irq, void *dev_id) {
    struct disk_device *disk = dev_id;
    struct request *req;
    
    // Acknowledge interrupt to hardware
    disk_ack_interrupt(disk);
    
    // Get the completed request
    req = disk_get_completed_request(disk);
    
    // Mark request as complete
    req->status = IO_COMPLETE;
    req->error = check_for_errors(disk);
    
    // Schedule bottom-half processing
    raise_softirq(BLOCK_SOFTIRQ);
    
    return IRQ_HANDLED;
}
 
// 2. Softirq handler (runs in softirq context)
void block_softirq_handler(void) {
    struct request *req;
    
    while ((req = get_completed_request())) {
        // Call the request's completion callback
        req->end_io(req);
    }
}
 
// 3. Request completion callback
void bio_end_io(struct bio *bio) {
    struct kiocb *iocb = bio->bi_private;
    
    // Signal I/O completion
    if (iocb->ki_complete) {
        // Async I/O - invoke callback
        iocb->ki_complete(iocb, bio->bi_status);
    } else {
        // Sync I/O - wake the waiter
        struct task_struct *waiter = iocb->private;
        
        // THE CRITICAL WAKEUP CALL
        wake_up_process(waiter);
        // Process moves: Waiting → Ready
    }
}
 
// 4. Back in the original read() syscall
ssize_t sync_read(struct file *file, char *buf, size_t len) {
    struct kiocb iocb;
    
    init_sync_kiocb(&iocb, file);
    iocb.private = current;  // Store current process for wakeup
    
    // Submit I/O
    submit_bio(bio);
    
    // Block until I/O completes
    for (;;) {
        set_current_state(TASK_UNINTERRUPTIBLE);
        
        if (iocb.ki_complete)  // Check if I/O done
            break;
            
        schedule();  // Sleep - woken by bio_end_io()
    }
    
    set_current_state(TASK_RUNNING);
    
    // Copy data to user buffer
    return copy_to_user(buf, iocb.data, len);
}

Deferred Wakeup

Wakeup from Lock Release

mutex_wakeup.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
// Mutex wakeup in Linux kernel
// Simplified from kernel/locking/mutex.c
 
struct mutex {
    atomic_long_t owner;     // Current owner (or 0 if free)
    spinlock_t wait_lock;    // Protects wait_list
    struct list_head wait_list;  // Waiters
};
 
void mutex_unlock(struct mutex *lock) {
    // Fast path: no waiters
    if (atomic_long_cmpxchg(&lock->owner, 
                            (long)current, 0) == (long)current)
        return;  // Unlocked, no one waiting
    
    // Slow path: need to wake someone
    __mutex_unlock_slowpath(lock);
}
 
void __mutex_unlock_slowpath(struct mutex *lock) {
    struct task_struct *next = NULL;
    struct mutex_waiter *waiter;
    unsigned long flags;
    
    spin_lock_irqsave(&lock->wait_lock, flags);
    
    if (!list_empty(&lock->wait_list)) {
        // Get first waiter (FIFO order typically)
        waiter = list_first_entry(&lock->wait_list,
                                  struct mutex_waiter, list);
        
        // Remove from wait list
        list_del(&waiter->list);
        next = waiter->task;
        
        // Transfer ownership atomically
        atomic_long_set(&lock->owner, (long)next);
    } else {
        // No waiters, just clear owner
        atomic_long_set(&lock->owner, 0);
    }
    
    spin_unlock_irqrestore(&lock->wait_lock, flags);
    
    if (next) {
        // Wake up the next owner
        wake_up_process(next);
        // Woken process now owns the mutex and is Ready
    }
}
 
// The waiting side
void mutex_lock(struct mutex *lock) {
    // Fast path: lock is free
    if (atomic_long_cmpxchg(&lock->owner, 0, (long)current) == 0)
        return;  // Got the lock
    
    // Slow path: need to wait
    __mutex_lock_slowpath(lock);
}
 
void __mutex_lock_slowpath(struct mutex *lock) {
    struct mutex_waiter waiter;
    unsigned long flags;
    
    waiter.task = current;
    
    spin_lock_irqsave(&lock->wait_lock, flags);
    list_add_tail(&waiter.list, &lock->wait_list);
    spin_unlock_irqrestore(&lock->wait_lock, flags);
    
    for (;;) {
        set_current_state(TASK_UNINTERRUPTIBLE);
        
        // Check if we got ownership (unlock woke us with ownership)
        if (atomic_long_read(&lock->owner) == (long)current)
            break;
        
        schedule();  // Sleep until woken by unlock
    }
    
    set_current_state(TASK_RUNNING);
    // We now own the mutex
}

Lock handoff optimization:

Modern mutex implementations often use lock handoff rather than simple wakeup:

Old owner transfers ownership to first waiter atomically
Woken waiter already owns the lock when it wakes
No race with other CPUs trying to acquire
Reduces lock starvation

Without handoff, a woken waiter might lose the lock race to a CPU that just tried to acquire, leading to potential starvation of the waiter.

Wakeup Strategies for Different Primitives
Primitive	Wakeup Strategy	Reason
Mutex	Wake one (first waiter)	Only one can hold the lock
Read-Write Lock (write unlock)	Wake one writer OR all readers	Readers can share
Semaphore	Wake one per count increase	Controlled by count
Condition Variable (signal)	Wake one	Typically for one-at-a-time
Condition Variable (broadcast)	Wake all	All should reevaluate condition
File I/O (device ready)	Wake all waiters	All can proceed with I/O

Spurious Wakeups

Wakeup from Timer Expiration

Timers are fundamental to many blocking operations. When a process calls sleep(), nanosleep(), or poll() with a timeout, the kernel sets a timer. When the timer expires, the process is woken.

timer_wakeup.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
// Timer-based wakeup implementation
// Simplified from kernel/time/hrtimer.c and kernel/time/timer.c
 
// The sleep() implementation using timers
unsigned int __sched sleep(unsigned int seconds) {
    struct hrtimer_sleeper sleeper;
    ktime_t expire = ktime_add_ns(ktime_get(), 
                                  seconds * NSEC_PER_SEC);
    
    // Initialize the high-resolution timer
    hrtimer_init_sleeper(&sleeper, current);
    sleeper.timer.function = hrtimer_wakeup;
    
    // Set expiration time
    hrtimer_set_expires(&sleeper.timer, expire);
    
    // Start the timer
    hrtimer_start(&sleeper.timer, expire, HRTIMER_MODE_ABS);
    
    // Sleep until timer fires (or signal arrives)
    set_current_state(TASK_INTERRUPTIBLE);
    
    if (sleeper.task)  // Still valid (not yet expired)
        schedule();    // Block here
    
    // Woken up - clean up timer
    hrtimer_cancel(&sleeper.timer);
    
    // Return remaining time if interrupted early
    return calculate_remaining(expire);
}
 
// Timer callback - the wakeup function
enum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer) {
    struct hrtimer_sleeper *sleeper = 
        container_of(timer, struct hrtimer_sleeper, timer);
    struct task_struct *task = sleeper->task;
    
    // Mark sleeper as expired
    sleeper->task = NULL;
    
    // Wake the sleeping process
    if (task)
        wake_up_process(task);
    
    return HRTIMER_NORESTART;  // One-shot timer
}
 
// How timers are processed (timer softirq)
void run_timer_softirq(void) {
    struct timer_base *base = this_cpu_ptr(&timer_bases);
    
    while (time_after_eq(jiffies, base->clk)) {
        // Find and remove expired timers
        struct timer_list *timer;
        
        while ((timer = find_expired_timer(base))) {
            // Call the timer's callback function
            // This may call wake_up_process()
            timer->function(timer);
        }
        
        base->clk++;
    }
}

Two types of kernel timers:

Traditional Timers (timer_list):

Based on jiffies (tick-based)
Resolution limited by HZ (typically 1ms)
Efficient for large numbers of timers (timer wheel)
Used for coarse-grained timeouts

High-Resolution Timers (hrtimer):

Nanosecond resolution
Based on hardware timers like HPET, TSC
Used for precise timing (nanosleep, pthread timing)
More overhead per timer but more accurate

Timer Precision by Use Case
Function	Timer Type	Typical Precision
sleep()	Traditional or HR	1 ms - 10 ms
usleep()	High-resolution	100 μs - 1 ms
nanosleep()	High-resolution	50 μs - 500 μs
poll/select timeout	Traditional	1 ms - 10 ms
pthread_cond_timedwait	High-resolution	50 μs - 500 μs
Scheduler time slice	Local APIC timer	Variable (CFS)

Timer Coalescing

The Thundering Herd Problem

Converting Mermaid diagram...

Exclusive waiters:

To solve the thundering herd, Linux supports exclusive waiters. When processes add themselves to a wait queue with the WQ_FLAG_EXCLUSIVE flag, wake_up() will wake only one of them.

How it works:

Non-exclusive waiters are at the front of the queue
Exclusive waiters are at the back
wake_up() wakes all non-exclusive + one exclusive
This ensures events that can only satisfy one waiter don't wake many

exclusive_wakeup.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Exclusive wakeup to avoid thundering herd
// From net/ipv4/inet_connection_sock.c - accept() implementation
 
// When a server calls accept(), it uses exclusive wait
int inet_csk_wait_for_connect(struct sock *sk, long timeo) {
    DEFINE_WAIT(wait);
    
    for (;;) {
        // Add as EXCLUSIVE waiter
        // Only one process woken per incoming connection
        prepare_to_wait_exclusive(&sk->sk_socket->wq.wait,
                                  &wait, 
                                  TASK_INTERRUPTIBLE);
        
        // Check if connection already queued
        if (reqsk_queue_empty(&icsk->icsk_accept_queue)) {
            release_sock(sk);
            if (timeo)
                timeo = schedule_timeout(timeo);
            lock_sock(sk);
        }
        
        // Check if we got one
        if (!reqsk_queue_empty(&icsk->icsk_accept_queue))
            break;
            
        // Check for errors/signals
        if (signal_pending(current))
            break;
    }
    
    finish_wait(&sk->sk_socket->wq.wait, &wait);
    return 0;
}
 
// Wait queue structure showing exclusive flag
void add_wait_queue_exclusive(wait_queue_head_t *wq,
                              wait_queue_entry_t *wait) {
    wait->flags |= WQ_FLAG_EXCLUSIVE;
    
    spin_lock(&wq->lock);
    // Exclusive waiters go at the TAIL
    // So non-exclusive at head get woken first
    list_add_tail(&wait->entry, &wq->head);
    spin_unlock(&wq->lock);
}
 
// Wake respects exclusive flag
void __wake_up_common(wait_queue_head_t *wq, ...) {
    int nr_exclusive = ...; // Usually 1
    
    list_for_each_entry_safe(curr, next, &wq->head, entry) {
        curr->func(curr, ...);  // Wake this one
        
        // If exclusive waiter, decrement counter
        if (curr->flags & WQ_FLAG_EXCLUSIVE) {
            if (--nr_exclusive == 0)
                break;  // Stop after waking nr_exclusive
        }
        // Non-exclusive: always continue
    }
}

When to Use Exclusive vs Non-Exclusive Wakeup
Scenario	Wakeup Type	Reason
accept() on listen socket	Exclusive	One connection per waker
File data becomes readable	Non-exclusive (all)	Multiple readers may proceed
Mutex unlock	Exclusive (one)	Only one new owner
Broadcast condition	Non-exclusive (all)	All should reevaluate
Semaphore post (+1)	Exclusive (one)	One more can acquire
epoll event	Typically exclusive	One handler per event

EPOLLEXCLUSIVE (Linux 4.5+)

Preemption After Wakeup

Priority-based wakeup preemption:

During the wakeup path, the kernel checks:

What priority is the woken process?
What priority is currently running on the target CPU?
If woken > current, set TIF_NEED_RESCHED on that CPU

This doesn't immediately preempt—it marks preemption needed. The actual switch happens at the next preemption point (return from interrupt, syscall, etc.).

wakeup_preemption.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// Preemption check during wakeup
// From kernel/sched/core.c
 
void ttwu_do_activate(struct rq *rq, struct task_struct *p) {
    // Add to run queue
    activate_task(rq, p);
    
    // Clear WF_MIGRATED flag
    p->sched_contributes_to_load = 1;
    
    // Check for preemption
    check_preempt_curr(rq, p);
}
 
// CFS preemption check
void check_preempt_curr_fair(struct rq *rq, struct task_struct *p) {
    struct task_struct *curr = rq->curr;
    struct sched_entity *se = &p->se;
    struct sched_entity *curr_se = &curr->se;
    
    // Don't preempt idle task
    if (unlikely(curr->policy == SCHED_IDLE))
        goto preempt;
    
    // Check if woken task should preempt based on vruntime
    // CFS: lower vruntime = deserves CPU more
    s64 delta = curr_se->vruntime - se->vruntime;
    
    if (delta > sysctl_sched_wakeup_granularity)
        goto preempt;
    
    return;  // Current task continues
    
preempt:
    // Mark reschedule needed on this CPU
    resched_curr(rq);
}
 
// Real-time preemption check (simpler: strict priority)
void check_preempt_curr_rt(struct rq *rq, struct task_struct *p) {
    // If woken task has higher RT priority, always preempt
    if (p->prio < rq->curr->prio)
        resched_curr(rq);
}
 
// Cross-CPU wakeup and IPI
void ttwu_queue(struct task_struct *p, int cpu) {
    struct rq *rq = cpu_rq(cpu);
    
    if (cpu == smp_processor_id()) {
        // Local CPU - add directly
        ttwu_do_activate(rq, p);
    } else {
        // Remote CPU - send Inter-Processor Interrupt
        if (p->prio < rq->curr->prio) {
            // Need to preempt remote CPU
            smp_send_reschedule(cpu);
            // IPI forces remote CPU to reschedule
        }
    }
}

Cross-CPU wakeup and IPIs:

If a process should run on a different CPU than the one performing the wakeup:

Woken process added to remote CPU's run queue
If preemption needed, send Inter-Processor Interrupt (IPI)
Remote CPU receives interrupt, sets need_resched, and will reschedule at next opportunity

IPIs have overhead (~1-5 μs) but are necessary for low-latency wakeup on SMP systems.

Wakeup to Execution Latency
Scenario	Approximate Latency	Factors
Local CPU, no preemption needed	0 μs (next context switch)	Must wait for current to yield
Local CPU, preemption needed	2-10 μs	Preempt at next point
Remote CPU, idle	5-15 μs	IPI + context switch
Remote CPU, busy, higher priority	10-50 μs	IPI + preemption
Remote CPU, PREEMPT_RT	50-200 μs	Bounded latency

Wakeup Granularity

Debugging Wakeup Issues

When processes don't wake as expected, debugging requires understanding the entire wakeup path. Common issues include missed wakeups, excessive latency, and stuck processes.

Common Wakeup Issues

•Lost Wakeup — Wakeup occurs before process fully asleep. Race condition. Solution: proper wait loop pattern with condition recheck.
•Stuck in D State — Process waiting for I/O that never completes. Usually hardware or driver issue. Check dmesg for errors.
•High Latency — Process wakes but doesn't run promptly. Check CPU affinity, scheduler settings, competing high-priority tasks.
•Thundering Herd — Many wakeups for one event. Use exclusive waiters or EPOLLEXCLUSIVE.
•Missed Signal — Signaled while handling previous signal. Use sigpending checks and proper signal masks.

debug_wakeup.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# Diagnose wakeup issues
 
# Check what a sleeping process is waiting for
cat /proc/<pid>/wchan
# Output: wait_woken or specific waiting function name
 
# Get full kernel stack of sleeping process
cat /proc/<pid>/stack
# Shows: complete kernel call stack - reveals blocking point
 
# Trace wakeups in real-time
sudo trace-cmd record -e sched:sched_wakeup_new \
                      -e sched:sched_wakeup \
                      -e sched:sched_switch
sudo trace-cmd report
 
# Using perf to measure wakeup latency
sudo perf sched record sleep 10
sudo perf sched latency
# Shows average and max wakeup latencies per task
 
# BPF-based wakeup analysis
sudo bpftrace -e '
kprobe:try_to_wake_up {
    @wakeup_count[comm] = count();
}
tracepoint:sched:sched_wakeup {
    @wakeup_latency = hist(args->prio);
}'
 
# Check for processes stuck in D state
ps aux | awk '$8 == "D"'
# If many D-state processes, check dmesg for I/O errors
 
# Monitor scheduler statistics
cat /proc/schedstat
# Per-CPU scheduling stats including wakeup counts
 
# Per-process scheduling info
cat /proc/<pid>/sched
# Shows: nr_wakeups, nr_wakeups_sync, nr_wakeups_migrate, etc.

The proper wait loop pattern:

To avoid lost wakeups and spurious wakeups, always use this pattern:

proper_wait.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// CORRECT: Proper wait loop avoiding lost wakeups
// Uses the wait_event pattern
 
void wait_for_condition(bool *condition, wait_queue_head_t *wq) {
    DEFINE_WAIT(wait);
    
    for (;;) {
        // 1. Register ourselves on the wait queue
        prepare_to_wait(wq, &wait, TASK_INTERRUPTIBLE);
        
        // 2. Check condition AFTER registering
        //    If event happened while registering, we won't miss it
        if (*condition)
            break;
        
        // 3. Check for signals (if interruptible)
        if (signal_pending(current)) {
            // Handle signal - may abort wait
            break;
        }
        
        // 4. Actually sleep - only reached if condition false
        schedule();
        
        // 5. Woken up - loop back to recheck condition
        //    May have been spurious wakeup
    }
    
    // 6. Clean up
    finish_wait(wq, &wait);
}
 
// INCORRECT: Race condition - DON'T DO THIS
void broken_wait(bool *condition, wait_queue_head_t *wq) {
    if (!*condition) {
        // BAD: Event could happen RIGHT HERE
        // Between check and sleep, wakeup would be lost
        
        set_current_state(TASK_INTERRUPTIBLE);
        schedule();  // May sleep forever!
    }
}

Always Recheck After Wakeup

Summary: The Complete Wake Transition

We've thoroughly explored the Wake transition—the mechanism that returns sleeping processes to the runnable pool when their waited-for events occur.

Key Takeaways

•Waiting → Ready completes the wait — The Wake transition returns a process to schedulability when its blocking condition is satisfied.
•Multiple trigger sources — Hardware interrupts, other processes releasing locks, timer expiration, and signal delivery all can trigger wakeup.
•Context matters — Wakeup from interrupt context is restricted; full wakeup deferred to softirq or process context.
•Wait queues link sleepers to events — Each blocking resource has a wait queue; wake_up() traverses it to wake appropriate processes.
•try_to_wake_up() is central — This function handles state change, queue insertion, and preemption check for all wakeups.
•Thundering herd is a real problem — Waking all waiters when one can proceed wastes resources. Exclusive waiters solve this.
•Wakeup may trigger preemption — If the woken process is higher priority, the current process may be preempted at the next opportunity.
•Proper wait loops prevent lost wakeups — Always register on wait queue before checking condition, and always recheck after waking.

Module complete:

With the Wake transition, we've now covered all five fundamental state transitions:

Admit (New → Ready) — Process created and ready to compete for CPU
Dispatch (Ready → Running) — Process selected and given CPU control
Timeout (Running → Ready) — Process preempted after time slice expires
Block (Running → Waiting) — Process sleeps waiting for external event
Wake (Waiting → Ready) — Process returns to runnable after event occurs

These five transitions, operating continuously across all processes, create the dynamic multitasking behavior we depend on in modern operating systems.

Module Complete

5 / 5