Loading learning content...
A sleeping process exists in suspension, consuming no CPU cycles, waiting for an event that may be microseconds or hours away. When that event finally occurs—the disk completes a read, a network packet arrives, a lock becomes available, or a timer expires—the process must be awakened and returned to the pool of runnable processes.
The Wake transition (also called the Unblock or Wakeup transition) moves a process from the Waiting state to the Ready state. It is the counterpart to the Block transition, completing the cycle that allows processes to efficiently wait for external events. Without wakeup, blocked processes would sleep forever.
This transition is almost always initiated by something other than the sleeping process itself—an interrupt handler, another process releasing a lock, or a timer subsystem. Understanding wakeup mechanisms is essential for understanding how I/O completion propagates through the system.
By the end of this page, you will understand the complete Wake transition—from the event that triggers wakeup through the kernel's wake mechanisms, wait queue processing, potential preemption, and reinsertion into the Ready queue. You'll also learn about wakeup optimization techniques and common problems like the thundering herd.
A process wakes when the event it was waiting for occurs. The nature of this event depends on why the process blocked in the first place.
| Blocked Waiting For | Woken By | Context of Wakeup |
|---|---|---|
| Disk I/O | Disk interrupt handler | Interrupt context (fast, limited) |
| Network data | Network softirq | Softirq context (deferred interrupt) |
| Mutex lock | Lock owner calling unlock() | Process context (full capabilities) |
| Semaphore | Another process calling sem_post() | Process context |
| Child exit | Child's do_exit() path | Process context |
| sleep()/usleep() | Timer interrupt + hrtimer subsystem | Timer softirq context |
| Condition variable | pthread_cond_signal/broadcast | Process context |
| select/poll/epoll | Any monitored FD becomes ready | Various (interrupt, process) |
Wakeup context matters:
The context in which wakeup occurs affects what the wakeup code can do:
Interrupt Context:
Process Context:
Hard interrupt handlers run with interrupts disabled on that CPU—they must be extremely fast. Softirqs run with interrupts enabled but in interrupt context—no sleeping. Process context is normal kernel execution with full capabilities. Many wakeups are triggered in hard interrupt, deferred to softirq for the actual wake_up() call, then the woken process runs later in process context.
Waking a process involves several precise steps: locating the sleeping process in its wait queue, changing its state, removing it from the wait queue, and adding it to the appropriate ready queue. Let's examine this mechanism in detail.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
// Linux wakeup implementation// Simplified from kernel/sched/core.c and kernel/sched/wait.c // Main wakeup entry pointvoid wake_up(wait_queue_head_t *wq) { __wake_up(wq, TASK_NORMAL, 1);} void wake_up_all(wait_queue_head_t *wq) { __wake_up(wq, TASK_NORMAL, 0); // nr=0 means all} void __wake_up(wait_queue_head_t *wq, unsigned int mode, int nr_exclusive) { unsigned long flags; // Lock the wait queue spin_lock_irqsave(&wq->lock, flags); // Walk the list of waiters __wake_up_common(wq, mode, nr_exclusive, 0); spin_unlock_irqrestore(&wq->lock, flags);} void __wake_up_common(wait_queue_head_t *wq, unsigned int mode, int nr_exclusive) { wait_queue_entry_t *curr, *next; list_for_each_entry_safe(curr, next, &wq->head, entry) { // Call the waiter's callback function unsigned flags = curr->flags; int ret = curr->func(curr, mode, 0, NULL); if (ret < 0) break; // Stop waking // Handle exclusive waiters if ((flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) break; // Only wake nr_exclusive exclusive waiters }} // Default wake function - actually wakes the processint default_wake_function(wait_queue_entry_t *wait, unsigned mode, int flags, void *key) { return try_to_wake_up(wait->private, mode);} // Core wakeup logicint try_to_wake_up(struct task_struct *p, unsigned int state) { unsigned long flags; int cpu, success = 0; raw_spin_lock_irqsave(&p->pi_lock, flags); // Check if process is in the right state to wake if (!(p->state & state)) goto out; // Not in wakeable state success = 1; // Change process state to RUNNING (Ready) p->state = TASK_RUNNING; // Select a CPU for the woken process cpu = select_task_rq(p); // Add to that CPU's run queue activate_task(cpu_rq(cpu), p); // Check if should preempt current task on target CPU if (p->prio < cpu_curr(cpu)->prio) resched_curr(cpu_rq(cpu)); out: raw_spin_unlock_irqrestore(&p->pi_lock, flags); return success;}Almost all wakeup paths eventually call try_to_wake_up(). This function handles the state transition, run queue insertion, and potential preemption. Whether waking from I/O completion, lock release, or signal delivery, this is the bottleneck. Performance optimizations here benefit the entire system.
I/O completion is one of the most common wakeup triggers. When a disk read finishes, a network packet arrives, or a USB device responds, the device generates an interrupt that eventually leads to waking the blocked process.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
// Example: Disk I/O completion wakeup path// Simplified from actual block layer code // 1. Interrupt handler (runs in hard IRQ context)irqreturn_t disk_interrupt(int irq, void *dev_id) { struct disk_device *disk = dev_id; struct request *req; // Acknowledge interrupt to hardware disk_ack_interrupt(disk); // Get the completed request req = disk_get_completed_request(disk); // Mark request as complete req->status = IO_COMPLETE; req->error = check_for_errors(disk); // Schedule bottom-half processing raise_softirq(BLOCK_SOFTIRQ); return IRQ_HANDLED;} // 2. Softirq handler (runs in softirq context)void block_softirq_handler(void) { struct request *req; while ((req = get_completed_request())) { // Call the request's completion callback req->end_io(req); }} // 3. Request completion callbackvoid bio_end_io(struct bio *bio) { struct kiocb *iocb = bio->bi_private; // Signal I/O completion if (iocb->ki_complete) { // Async I/O - invoke callback iocb->ki_complete(iocb, bio->bi_status); } else { // Sync I/O - wake the waiter struct task_struct *waiter = iocb->private; // THE CRITICAL WAKEUP CALL wake_up_process(waiter); // Process moves: Waiting → Ready }} // 4. Back in the original read() syscallssize_t sync_read(struct file *file, char *buf, size_t len) { struct kiocb iocb; init_sync_kiocb(&iocb, file); iocb.private = current; // Store current process for wakeup // Submit I/O submit_bio(bio); // Block until I/O completes for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); if (iocb.ki_complete) // Check if I/O done break; schedule(); // Sleep - woken by bio_end_io() } set_current_state(TASK_RUNNING); // Copy data to user buffer return copy_to_user(buf, iocb.data, len);}Notice how the hard interrupt handler doesn't call wake_up() directly—it schedules a softirq. This is because wake_up() may need to acquire scheduler locks, which isn't safe in hard interrupt context. The softirq runs shortly after, with interrupts enabled, and performs the actual wakeup.
When a process releases a lock that others are waiting for, one or more waiters must be woken. The mechanism differs slightly from I/O wakeup because it occurs in process context, not interrupt context.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
// Mutex wakeup in Linux kernel// Simplified from kernel/locking/mutex.c struct mutex { atomic_long_t owner; // Current owner (or 0 if free) spinlock_t wait_lock; // Protects wait_list struct list_head wait_list; // Waiters}; void mutex_unlock(struct mutex *lock) { // Fast path: no waiters if (atomic_long_cmpxchg(&lock->owner, (long)current, 0) == (long)current) return; // Unlocked, no one waiting // Slow path: need to wake someone __mutex_unlock_slowpath(lock);} void __mutex_unlock_slowpath(struct mutex *lock) { struct task_struct *next = NULL; struct mutex_waiter *waiter; unsigned long flags; spin_lock_irqsave(&lock->wait_lock, flags); if (!list_empty(&lock->wait_list)) { // Get first waiter (FIFO order typically) waiter = list_first_entry(&lock->wait_list, struct mutex_waiter, list); // Remove from wait list list_del(&waiter->list); next = waiter->task; // Transfer ownership atomically atomic_long_set(&lock->owner, (long)next); } else { // No waiters, just clear owner atomic_long_set(&lock->owner, 0); } spin_unlock_irqrestore(&lock->wait_lock, flags); if (next) { // Wake up the next owner wake_up_process(next); // Woken process now owns the mutex and is Ready }} // The waiting sidevoid mutex_lock(struct mutex *lock) { // Fast path: lock is free if (atomic_long_cmpxchg(&lock->owner, 0, (long)current) == 0) return; // Got the lock // Slow path: need to wait __mutex_lock_slowpath(lock);} void __mutex_lock_slowpath(struct mutex *lock) { struct mutex_waiter waiter; unsigned long flags; waiter.task = current; spin_lock_irqsave(&lock->wait_lock, flags); list_add_tail(&waiter.list, &lock->wait_list); spin_unlock_irqrestore(&lock->wait_lock, flags); for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); // Check if we got ownership (unlock woke us with ownership) if (atomic_long_read(&lock->owner) == (long)current) break; schedule(); // Sleep until woken by unlock } set_current_state(TASK_RUNNING); // We now own the mutex}Lock handoff optimization:
Modern mutex implementations often use lock handoff rather than simple wakeup:
Without handoff, a woken waiter might lose the lock race to a CPU that just tried to acquire, leading to potential starvation of the waiter.
| Primitive | Wakeup Strategy | Reason |
|---|---|---|
| Mutex | Wake one (first waiter) | Only one can hold the lock |
| Read-Write Lock (write unlock) | Wake one writer OR all readers | Readers can share |
| Semaphore | Wake one per count increase | Controlled by count |
| Condition Variable (signal) | Wake one | Typically for one-at-a-time |
| Condition Variable (broadcast) | Wake all | All should reevaluate condition |
| File I/O (device ready) | Wake all waiters | All can proceed with I/O |
A process may be woken even when the condition it's waiting for isn't met—this is called a spurious wakeup. It can happen due to implementation details, signal delivery, or even cosmic rays. This is why wait loops always recheck the condition after waking: 'while (!condition) { sleep(); }'
Timers are fundamental to many blocking operations. When a process calls sleep(), nanosleep(), or poll() with a timeout, the kernel sets a timer. When the timer expires, the process is woken.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
// Timer-based wakeup implementation// Simplified from kernel/time/hrtimer.c and kernel/time/timer.c // The sleep() implementation using timersunsigned int __sched sleep(unsigned int seconds) { struct hrtimer_sleeper sleeper; ktime_t expire = ktime_add_ns(ktime_get(), seconds * NSEC_PER_SEC); // Initialize the high-resolution timer hrtimer_init_sleeper(&sleeper, current); sleeper.timer.function = hrtimer_wakeup; // Set expiration time hrtimer_set_expires(&sleeper.timer, expire); // Start the timer hrtimer_start(&sleeper.timer, expire, HRTIMER_MODE_ABS); // Sleep until timer fires (or signal arrives) set_current_state(TASK_INTERRUPTIBLE); if (sleeper.task) // Still valid (not yet expired) schedule(); // Block here // Woken up - clean up timer hrtimer_cancel(&sleeper.timer); // Return remaining time if interrupted early return calculate_remaining(expire);} // Timer callback - the wakeup functionenum hrtimer_restart hrtimer_wakeup(struct hrtimer *timer) { struct hrtimer_sleeper *sleeper = container_of(timer, struct hrtimer_sleeper, timer); struct task_struct *task = sleeper->task; // Mark sleeper as expired sleeper->task = NULL; // Wake the sleeping process if (task) wake_up_process(task); return HRTIMER_NORESTART; // One-shot timer} // How timers are processed (timer softirq)void run_timer_softirq(void) { struct timer_base *base = this_cpu_ptr(&timer_bases); while (time_after_eq(jiffies, base->clk)) { // Find and remove expired timers struct timer_list *timer; while ((timer = find_expired_timer(base))) { // Call the timer's callback function // This may call wake_up_process() timer->function(timer); } base->clk++; }}Two types of kernel timers:
Traditional Timers (timer_list):
High-Resolution Timers (hrtimer):
| Function | Timer Type | Typical Precision |
|---|---|---|
| sleep() | Traditional or HR | 1 ms - 10 ms |
| usleep() | High-resolution | 100 μs - 1 ms |
| nanosleep() | High-resolution | 50 μs - 500 μs |
| poll/select timeout | Traditional | 1 ms - 10 ms |
| pthread_cond_timedwait | High-resolution | 50 μs - 500 μs |
| Scheduler time slice | Local APIC timer | Variable (CFS) |
To save power on idle systems, Linux coalesces nearby timer expirations. A timer set for 1.001 seconds might fire at 1.010 seconds along with several other timers. This reduces wakeups on an otherwise-idle CPU. The timer_slack_ns field controls how much slop is allowed.
When an event occurs that multiple processes are waiting for, should the kernel wake all of them or just one? Waking all can cause the thundering herd problem—many processes wake up, only to find the resource gone, and go back to sleep.
Exclusive waiters:
To solve the thundering herd, Linux supports exclusive waiters. When processes add themselves to a wait queue with the WQ_FLAG_EXCLUSIVE flag, wake_up() will wake only one of them.
How it works:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
// Exclusive wakeup to avoid thundering herd// From net/ipv4/inet_connection_sock.c - accept() implementation // When a server calls accept(), it uses exclusive waitint inet_csk_wait_for_connect(struct sock *sk, long timeo) { DEFINE_WAIT(wait); for (;;) { // Add as EXCLUSIVE waiter // Only one process woken per incoming connection prepare_to_wait_exclusive(&sk->sk_socket->wq.wait, &wait, TASK_INTERRUPTIBLE); // Check if connection already queued if (reqsk_queue_empty(&icsk->icsk_accept_queue)) { release_sock(sk); if (timeo) timeo = schedule_timeout(timeo); lock_sock(sk); } // Check if we got one if (!reqsk_queue_empty(&icsk->icsk_accept_queue)) break; // Check for errors/signals if (signal_pending(current)) break; } finish_wait(&sk->sk_socket->wq.wait, &wait); return 0;} // Wait queue structure showing exclusive flagvoid add_wait_queue_exclusive(wait_queue_head_t *wq, wait_queue_entry_t *wait) { wait->flags |= WQ_FLAG_EXCLUSIVE; spin_lock(&wq->lock); // Exclusive waiters go at the TAIL // So non-exclusive at head get woken first list_add_tail(&wait->entry, &wq->head); spin_unlock(&wq->lock);} // Wake respects exclusive flagvoid __wake_up_common(wait_queue_head_t *wq, ...) { int nr_exclusive = ...; // Usually 1 list_for_each_entry_safe(curr, next, &wq->head, entry) { curr->func(curr, ...); // Wake this one // If exclusive waiter, decrement counter if (curr->flags & WQ_FLAG_EXCLUSIVE) { if (--nr_exclusive == 0) break; // Stop after waking nr_exclusive } // Non-exclusive: always continue }}| Scenario | Wakeup Type | Reason |
|---|---|---|
| accept() on listen socket | Exclusive | One connection per waker |
| File data becomes readable | Non-exclusive (all) | Multiple readers may proceed |
| Mutex unlock | Exclusive (one) | Only one new owner |
| Broadcast condition | Non-exclusive (all) | All should reevaluate |
| Semaphore post (+1) | Exclusive (one) | One more can acquire |
| epoll event | Typically exclusive | One handler per event |
Before Linux 4.5, epoll suffered from thundering herd when multiple processes epoll_wait() on the same FD. EPOLLEXCLUSIVE flag was added to wake only one waiter per event, enabling efficient load balancing across worker processes without excessive wakeups.
When a process is woken, it's added to the Ready queue—but it doesn't necessarily run immediately. Whether the woken process runs soon depends on its priority relative to the current running process.
Priority-based wakeup preemption:
During the wakeup path, the kernel checks:
This doesn't immediately preempt—it marks preemption needed. The actual switch happens at the next preemption point (return from interrupt, syscall, etc.).
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
// Preemption check during wakeup// From kernel/sched/core.c void ttwu_do_activate(struct rq *rq, struct task_struct *p) { // Add to run queue activate_task(rq, p); // Clear WF_MIGRATED flag p->sched_contributes_to_load = 1; // Check for preemption check_preempt_curr(rq, p);} // CFS preemption checkvoid check_preempt_curr_fair(struct rq *rq, struct task_struct *p) { struct task_struct *curr = rq->curr; struct sched_entity *se = &p->se; struct sched_entity *curr_se = &curr->se; // Don't preempt idle task if (unlikely(curr->policy == SCHED_IDLE)) goto preempt; // Check if woken task should preempt based on vruntime // CFS: lower vruntime = deserves CPU more s64 delta = curr_se->vruntime - se->vruntime; if (delta > sysctl_sched_wakeup_granularity) goto preempt; return; // Current task continues preempt: // Mark reschedule needed on this CPU resched_curr(rq);} // Real-time preemption check (simpler: strict priority)void check_preempt_curr_rt(struct rq *rq, struct task_struct *p) { // If woken task has higher RT priority, always preempt if (p->prio < rq->curr->prio) resched_curr(rq);} // Cross-CPU wakeup and IPIvoid ttwu_queue(struct task_struct *p, int cpu) { struct rq *rq = cpu_rq(cpu); if (cpu == smp_processor_id()) { // Local CPU - add directly ttwu_do_activate(rq, p); } else { // Remote CPU - send Inter-Processor Interrupt if (p->prio < rq->curr->prio) { // Need to preempt remote CPU smp_send_reschedule(cpu); // IPI forces remote CPU to reschedule } }}Cross-CPU wakeup and IPIs:
If a process should run on a different CPU than the one performing the wakeup:
IPIs have overhead (~1-5 μs) but are necessary for low-latency wakeup on SMP systems.
| Scenario | Approximate Latency | Factors |
|---|---|---|
| Local CPU, no preemption needed | 0 μs (next context switch) | Must wait for current to yield |
| Local CPU, preemption needed | 2-10 μs | Preempt at next point |
| Remote CPU, idle | 5-15 μs | IPI + context switch |
| Remote CPU, busy, higher priority | 10-50 μs | IPI + preemption |
| Remote CPU, PREEMPT_RT | 50-200 μs | Bounded latency |
sched_wakeup_granularity controls how much 'better' a woken task must be to preempt. A value too low causes excessive context switches on every wakeup; too high causes poor latency for woken tasks. Default is around 1ms on desktop, tunable via sysctl.
When processes don't wake as expected, debugging requires understanding the entire wakeup path. Common issues include missed wakeups, excessive latency, and stuck processes.
1234567891011121314151617181920212223242526272829303132333435363738394041
# Diagnose wakeup issues # Check what a sleeping process is waiting forcat /proc/<pid>/wchan# Output: wait_woken or specific waiting function name # Get full kernel stack of sleeping processcat /proc/<pid>/stack# Shows: complete kernel call stack - reveals blocking point # Trace wakeups in real-timesudo trace-cmd record -e sched:sched_wakeup_new \ -e sched:sched_wakeup \ -e sched:sched_switchsudo trace-cmd report # Using perf to measure wakeup latencysudo perf sched record sleep 10sudo perf sched latency# Shows average and max wakeup latencies per task # BPF-based wakeup analysissudo bpftrace -e 'kprobe:try_to_wake_up { @wakeup_count[comm] = count();}tracepoint:sched:sched_wakeup { @wakeup_latency = hist(args->prio);}' # Check for processes stuck in D stateps aux | awk '$8 == "D"'# If many D-state processes, check dmesg for I/O errors # Monitor scheduler statisticscat /proc/schedstat# Per-CPU scheduling stats including wakeup counts # Per-process scheduling infocat /proc/<pid>/sched# Shows: nr_wakeups, nr_wakeups_sync, nr_wakeups_migrate, etc.The proper wait loop pattern:
To avoid lost wakeups and spurious wakeups, always use this pattern:
123456789101112131415161718192021222324252627282930313233343536373839404142
// CORRECT: Proper wait loop avoiding lost wakeups// Uses the wait_event pattern void wait_for_condition(bool *condition, wait_queue_head_t *wq) { DEFINE_WAIT(wait); for (;;) { // 1. Register ourselves on the wait queue prepare_to_wait(wq, &wait, TASK_INTERRUPTIBLE); // 2. Check condition AFTER registering // If event happened while registering, we won't miss it if (*condition) break; // 3. Check for signals (if interruptible) if (signal_pending(current)) { // Handle signal - may abort wait break; } // 4. Actually sleep - only reached if condition false schedule(); // 5. Woken up - loop back to recheck condition // May have been spurious wakeup } // 6. Clean up finish_wait(wq, &wait);} // INCORRECT: Race condition - DON'T DO THISvoid broken_wait(bool *condition, wait_queue_head_t *wq) { if (!*condition) { // BAD: Event could happen RIGHT HERE // Between check and sleep, wakeup would be lost set_current_state(TASK_INTERRUPTIBLE); schedule(); // May sleep forever! }}Never assume the condition is true just because you woke up. Spurious wakeups, signal delivery, and incorrect exclusive wakeup handling can all cause a process to wake when the condition isn't met. Always loop back and recheck.
We've thoroughly explored the Wake transition—the mechanism that returns sleeping processes to the runnable pool when their waited-for events occur.
Module complete:
With the Wake transition, we've now covered all five fundamental state transitions:
These five transitions, operating continuously across all processes, create the dynamic multitasking behavior we depend on in modern operating systems.
You now have a comprehensive understanding of all process state transitions. From a program's first admission into the system through repeated cycles of dispatch, timeout, blocking, and waking, you understand the kernel mechanisms that orchestrate the dance of processes competing for the CPU. This knowledge is foundational for understanding scheduling, synchronization, and system performance.