Loading learning content...
Sleep and wakeup don't operate in isolation—they are fundamentally scheduler operations. The scheduler is the kernel subsystem that decides which process runs on which CPU at any given moment. Sleep tells the scheduler "remove me from consideration," and wakeup says "consider me again."
Understanding how sleep/wakeup integrate with scheduling reveals:
This page provides a comprehensive examination of the deep relationship between blocking synchronization and process scheduling. The goal is to understand sleep/wakeup not as isolated mechanisms, but as integral parts of the scheduler's decision-making.
By the end of this page, you will understand: how schedule() implements the actual sleep; the relationship between process states and scheduler queues; how wakeup chooses which CPU to target; scheduler-class-specific behaviors for sleeping; sleep duration effects on priority; and load balancing implications.
When a process calls sleep, the actual suspension happens in schedule(). This function is the core of the Linux scheduler—it decides what runs next and performs the context switch.
What schedule() does:
The sleep path through schedule():
When a process is sleeping (state ≠ TASK_RUNNING), schedule() doesn't put it back on the run queue. The process simply vanishes from the scheduler's consideration until wakeup changes its state back.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
// Simplified schedule() showing the sleep pathvoid __sched schedule(void) { struct task_struct *prev = current; struct task_struct *next; struct rq *rq = this_rq(); // This CPU's run queue // Lock the run queue raw_spin_lock(&rq->lock); // CRITICAL: If prev is not TASK_RUNNING, deactivate it if (prev->state != TASK_RUNNING) { // This is the sleep path! // prev won't be re-added to run queue deactivate_task(rq, prev); // Remove from run queue // Special case: check for pending signals for INTERRUPTIBLE if (prev->state == TASK_INTERRUPTIBLE && signal_pending(prev)) { // Wake it up to handle the signal prev->state = TASK_RUNNING; activate_task(rq, prev); } } // Pick the next task to run next = pick_next_task(rq); // Consults CFS, RT, etc. if (prev != next) { // Actually switch processes rq->curr = next; // THE CONTEXT SWITCH context_switch(rq, prev, next); // When we return here, we've been scheduled again // prev was saved, ran other stuff, prev restored // Could be microseconds or hours later! } raw_spin_unlock(&rq->lock);} // The context switch - where the magic happensvoid context_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next) { // Switch memory context (page tables) switch_mm(prev->mm, next->mm, next); // Switch register context (stack, instruction pointer, etc.) switch_to(prev, next, prev); // After switch_to returns, we're running as 'next' // 'prev' is now the task we switched FROM // (which was 'next' before, confusingly)}Key insights:
1. State check before schedule call matters: The sleeping process sets its state to SLEEPING before calling schedule(). When schedule() runs, it checks this state and decides not to re-enqueue the process. If state were RUNNING, the process would be immediately re-added to the run queue (a yield, not a sleep).
2. deactivate_task vs staying on run queue:
deactivate_task() removes the process from the run queue data structures (CFS's red-black tree, RT's priority queue, etc.). This is the moment the process truly becomes "sleeping"—it won't be picked by pick_next_task().
3. The context switch is the dividing line: Everything before context_switch runs in the sleeping process's context. After context_switch, a different process runs. The sleeping process's execution is frozen at that point until wakeup + scheduling returns it to a CPU.
4. switch_to is architecture-specific: The actual register save/restore is done by architecture-specific assembly. It saves prev's stack pointer, instruction pointer, and callee-saved registers, then loads next's.
A sleeping process experiences a VOLUNTARY context switch—it called schedule() itself. Preemption causes INVOLUNTARY switches—the kernel forces schedule() during an interrupt return. Both use the same schedule() function, but the accounting differs. Sleep shows as voluntary switches in /proc/PID/status.
The scheduler maintains run queues—data structures holding processes eligible to run. Sleep and wakeup move processes between membership states:
On run queue (state == TASK_RUNNING):
Off run queue (state ≠ TASK_RUNNING):
Processes off the run queue consume zero CPU time. They're just data structures in memory, tracked by wait queues.
Per-CPU run queues:
Modern SMP Linux has a run queue per CPU. Each queue has its own lock and scheduling data:
struct rq {
raw_spinlock_t lock; // Protects this queue
unsigned int nr_running; // Count of runnable tasks
struct cfs_rq cfs; // CFS scheduling data
struct rt_rq rt; // RT scheduling data
struct task_struct *curr; // Currently running task
struct task_struct *idle; // This CPU's idle task
// ...
};
Sleep removes from ONE CPU's run queue: When sleeping, the process is deactivated from its current CPU's queue. It's no longer associated with any specific CPU.
Wakeup adds to a (possibly different) CPU's run queue: Wakeup must choose which CPU to target. This might be:
The sum of nr_running across all CPUs approximates system load. Sleeping processes DON'T count—they're waiting for I/O, not CPU. Load average includes recently-run and currently-running: it's the demand for CPU. This is why I/O-bound workloads show low load despite many processes.
When a process wakes up, where should it run? This decision significantly impacts performance. The kernel uses several heuristics:
Factors in CPU selection:
| Factor | Consideration | Trade-off |
|---|---|---|
| Cache warmth | Running on last CPU preserves L1/L2 cache data | Reduces cache misses vs. potential delay |
| Idle CPUs | An idle CPU can run the task immediately | Reduces latency vs. migration cost |
| Waker locality | Waker's CPU may have relevant data cached | Good for producer-consumer patterns |
| Load balancing | Spread work across CPUs evenly | Throughput vs. latency |
| NUMA locality | Same NUMA node has faster memory access | Critical for memory-intensive tasks |
| Affinity mask | Task may be restricted to certain CPUs | Hard constraint; must be respected |
1234567891011121314151617181920212223242526272829303132333435363738394041
// Simplified logic for selecting wakeup CPUint select_task_rq(struct task_struct *p, int prev_cpu, // Where task last ran int wake_flags) { // Check affinity - must respect this cpumask_t *allowed = &p->cpus_allowed; if (!cpumask_test_cpu(prev_cpu, allowed)) prev_cpu = cpumask_any(allowed); // Fast path: if prev CPU is idle, use it (cache warm + immediate run) if (cpu_idle(prev_cpu) && cpumask_test_cpu(prev_cpu, allowed)) return prev_cpu; // Check the waking CPU (for producer-consumer locality) int waker_cpu = smp_processor_id(); if (cpumask_test_cpu(waker_cpu, allowed) && shares_llc(waker_cpu, prev_cpu)) { // Waker and prev are on same LLC domain // Prefer prev for cache warmth return prev_cpu; } // Find an idle CPU in the same LLC (last-level cache) domain int idle = find_idlest_cpu(p, prev_cpu, wake_flags); if (idle >= 0) return idle; // Fall back to previous CPU (preserve cache) return prev_cpu;} // Special case: WF_SYNC flag// When waker uses wake_up_sync(), it's hinting "I'm about to sleep"// In this case, target the waking CPU directly:// waker sleeps, awakened task runs on same CPU without migrationif (wake_flags & WF_SYNC) { // Target the waking CPU if allowed if (cpumask_test_cpu(waker_cpu, allowed)) return waker_cpu;}The wake-affine heuristic:
Linux's scheduler has a "wake-affine" feature: When task A wakes task B, and they communicate frequently, it often makes sense to run B near A. The scheduler tracks waker-wakee relationships and may pull the wakee to the waker's CPU or LLC domain.
NUMA-aware wakeup:
On NUMA systems, memory access time depends on which node the CPU is on. The scheduler considers:
Migrating a task to a different NUMA node can cause significant slowdown if its memory is remote. The scheduler balances CPU utilization against memory access costs.
IPI for remote wakeup:
When the wakeup targets a different CPU, an Inter-Processor Interrupt (IPI) is sent:
// Target CPU is notified via IPI
void smp_send_reschedule(int cpu) {
// Send IPI to 'cpu'
// When it receives, it will call schedule()
arch_send_ipi(cpu, RESCHEDULE_VECTOR);
}
The target CPU handles the IPI and invokes the scheduler to consider the newly runnable task.
IPIs are relatively expensive (~1-10μs). Waking a process on a different CPU costs more than waking on the current CPU. This is why cache warmth and same-CPU wake are preferred when the target CPU isn't idle. Excessive IPIs can become a bottleneck in highly concurrent systems.
Different scheduler classes (policies) handle sleeping processes differently:
CFS (Completely Fair Scheduler) - SCHED_NORMAL:
CFS tracks each task's "virtual runtime" (vruntime)—how much CPU time it has received relative to others. When a task sleeps:
CFS gives sleepers a bonus: their vruntime is set to max(their_vruntime, min_vruntime - threshold). This prevents a long-sleeping task from having such low vruntime that it monopolizes the CPU. It also ensures recently-woken interactive tasks get scheduled promptly.
1234567891011121314151617181920212223242526
// When a CFS task wakes upvoid enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { struct cfs_rq *cfs_rq = &rq->cfs; struct sched_entity *se = &p->se; if (flags & ENQUEUE_WAKEUP) { // Sleeper bonus: don't let sleepers have unfairly low vruntime u64 min_vruntime = cfs_rq->min_vruntime; if (sched_feat(GENTLE_FAIR_SLEEPERS)) { // Cap vruntime so sleeper doesn't dominate se->vruntime = max(se->vruntime, min_vruntime - thresh); } } // Add to the red-black tree __enqueue_entity(cfs_rq, se); // Update min_vruntime update_min_vruntime(cfs_rq);} // The sleeper's vruntime may be stale from when they slept// We bring it forward to be fair to tasks that ran while we slept// But we also give a small bonus for being interactive (sleeping often)Real-Time Scheduler - SCHED_FIFO/SCHED_RR:
RT tasks have static priorities (1-99). When an RT task wakes:
RT sleepers get immediate scheduling attention on wakeup—there's no vruntime calculation. This is essential for meeting deadlines.
SCHED_DEADLINE:
Deadline scheduling tracks each task's deadline and runtime budget. When sleeping:
Deadline tasks that oversleep may miss deadlines, visible to the application as deadline misses.
| Class | On Sleep | On Wakeup | Sleep Impact |
|---|---|---|---|
| SCHED_NORMAL (CFS) | Record vruntime, remove from RB-tree | Adjust vruntime (sleeper bonus), insert in RB-tree | Sleepers treated fairly; small interactive bonus |
| SCHED_FIFO | Remove from priority queue | Insert in priority queue, preempt if higher | Immediate scheduling at fixed priority |
| SCHED_RR | Remove from priority queue | Insert at end of priority level | Round-robin among same-priority peers |
| SCHED_DEADLINE | Account remaining runtime | Check deadline/period, admit if eligible | May miss deadline if oversleep |
There's ongoing debate about how much bonus sleepers should receive. Too much bonus and batch jobs are starved. Too little and interactive applications feel sluggish. CFS has tuned this over many kernel versions. The current behavior represents a balance based on empirical testing across diverse workloads.
When wakeup makes a new process runnable, the question arises: should it run immediately by preempting the current process?
The preemption decision:
After wakeup adds a task to the run queue, it calls check_preempt_curr():
void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) {
// Ask the scheduler class if p should preempt curr
rq->curr->sched_class->check_preempt_curr(rq, p, flags);
}
CFS preemption logic: CFS compares the woken task's vruntime to the current task's. If the woken task has lower vruntime (is further behind), a resched flag is set:
if (se->vruntime < curr->vruntime - wakeup_granularity) {
resched_curr(rq); // Set TIF_NEED_RESCHED flag
}
The wakeup_granularity is a tunable (~1ms by default) that prevents too-frequent preemption for minor vruntime differences.
RT preemption logic: Simple priority comparison:
if (p->prio < rq->curr->prio) {
resched_curr(rq); // Higher priority (lower number) preempts
}
RT tasks of higher priority always preempt lower priority or normal tasks.
When preemption actually happens:
Setting TIF_NEED_RESCHED doesn't immediately invoke schedule(). The actual context switch happens at:
Latency implications:
Without CONFIG_PREEMPT, a newly-awakened high-priority task might wait until the current task finished its kernel work. With CONFIG_PREEMPT or CONFIG_PREEMPT_RT, the kernel itself can be interrupted for the wakeup.
12345678910111213141516171819202122232425262728
// After wakeup on target CPU, or on IPI reception void scheduler_ipi(void) { // Called on receipt of reschedule IPI // The TIF_NEED_RESCHED flag is set // On return from this interrupt, schedule() will be called irq_enter(); // Mark that we need to reschedule set_tsk_need_resched(current); irq_exit();} // Returning from interrupt or syscallvoid prepare_exit_to_usermode(struct pt_regs *regs) { if (need_resched()) { // Don't return to user yet - call scheduler first schedule(); } // Now return to user space} // With CONFIG_PREEMPT, even kernel code checks:void maybe_preempt_kernel(void) { if (preemptible() && need_resched()) { // We're not holding any locks, and reschedule is needed preempt_schedule(); }}Aggressive preemption (CONFIG_PREEMPT_RT) minimizes wakeup latency—critical for real-time systems. But context switching has overhead. Server workloads often prefer CONFIG_PREEMPT_VOLUNTARY or even CONFIG_PREEMPT_NONE for throughput, accepting slightly higher latency.
The scheduler periodically rebalances load across CPUs. How do sleeping tasks factor in?
Sleeping tasks don't consume CPU: A task on a wait queue doesn't count toward CPU load. If 1000 processes are sleeping and 2 are running, the load is 2, not 1002. Load balancing focuses on runnable tasks.
But sleeping tasks affect memory locality: A sleeping task's memory pages may be on a specific NUMA node. If it wakes on a different node, it experiences remote memory access penalties. The scheduler tracks NUMA placement and prefers waking on the same node.
Sleeping pattern analysis: The scheduler tracks how often tasks sleep and for how long. Tasks that rarely sleep get migrated more readily (they're CPU-bound). Tasks that frequently sleep will be rescheduled (they're interactive/I/O-bound).
1234567891011121314151617181920212223242526272829303132333435
// PELT (Per-Entity Load Tracking) accounts for running vs. sleeping struct sched_avg { u64 last_update_time; // When we last calculated u64 load_sum; // Accumulated load contribution u64 runnable_sum; // Time spent runnable (not sleeping) u64 util_sum; // Time spent executing u32 load_avg; // Smoothed load average u32 runnable_avg; // Smoothed runnable average u32 util_avg; // Smoothed utilization}; // When task sleeps, its load_sum/util_sum stop accumulating// When task runs, they accumulate// The _avg values are decayed exponentially // A task that sleeps 80% of the time has lower util_avg than one// that runs 80% of the time // Load balancing uses these metrics:void load_balance(int this_cpu) { // Find busiest run queue struct rq *busiest = find_busiest_queue(); // Compare load_avg of tasks, not count // Pull tasks until balanced while (imbalanced(this_rq, busiest)) { struct task_struct *p = pick_next_to_migrate(busiest); if (p) { deactivate_task(busiest, p); set_cpu(p, this_cpu); activate_task(this_rq, p); } }}Idle balancing:
When a CPU becomes idle (all its tasks are sleeping), it can "steal" work from busier CPUs:
This is called "pull migration" or "work stealing."
Wake-to-run latency considerations:
For latency-sensitive workloads (e.g., trading systems), you might:
These reduce migration-induced cache misses and IPI latency at the cost of CPU utilization efficiency.
A task that frequently sleeps and wakes on different CPUs 'ping-pongs' between CPUs, never warming any cache. The scheduler detects this and may 'park' the task on one CPU, trading balance for locality. This is the wake-affine heuristic in action.
When things go wrong with sleeping processes, how do you diagnose?
Common symptoms and causes:
| Symptom | Possible Causes | Diagnostic Approach |
|---|---|---|
| Process stuck in D state | Waiting for I/O that never completes; kernel bug | Check /proc/PID/stack; look for hung I/O |
| Process sleeps forever (S state) | Lost wakeup; condition never true | Verify condition; check wakeup called |
| High latency after wakeup | Waking on busy/wrong CPU; low priority | Use perf sched; check affinity |
| Excessive context switches | Lock contention causing sleep/wake cycles | perf stat; check lock hold times |
| Process not running despite being runnable | Priority inversion; affinity mismatch | Check scheduling policy and affinity |
12345678910111213141516171819202122232425262728
# Check what a process is waiting for$ cat /proc/PID/stack[<0>] do_wait+0x...[<0>] kernel_wait4+0x...[<0>] __do_sys_wait4+0x...# Shows the kernel callstack - where in the kernel it's blocked # Check process scheduling statistics $ cat /proc/PID/schednr_involuntary_switches: 1234 # Preemptionsnr_voluntary_switches: 5678 # Sleep callsse.sum_exec_runtime: 123456789 # Total CPU time (ns)se.vruntime: 98765432 # CFS virtual runtime # Trace scheduler events$ sudo perf sched record -p PID sleep 10$ sudo perf sched latency# Shows sleep/wakeup latencies and causes # Watch for processes stuck in D state$ while true; do ps aux | awk '$8 ~ /D/'; sleep 1; done # Enable scheduler debugging (kernel)$ echo 1 > /sys/kernel/debug/sched/debug # Trace wakeups$ sudo trace-cmd record -e sched:sched_wakeup -p PID$ sudo trace-cmd reportperf sched for deep analysis:
perf sched provides detailed visibility into scheduling decisions:
perf sched record: Record scheduling eventsperf sched latency: Show per-task scheduling latenciesperf sched timehist: Timeline of schedule eventsperf sched map: Visual CPU-to-task mapping over timeThese tools reveal:
Tracing and debugging tools affect timing. A race condition that causes lost wakeups may disappear when tracing is enabled because the tracing code changes timing. Use minimal, always-on instrumentation (systemwide trace buffers) to catch timing bugs in production.
Sleep and wakeup are fundamentally scheduler operations. Understanding their integration is key to building and debugging efficient concurrent systems:
You have now completed the Sleep and Wakeup module. You understand the fundamental blocking synchronization mechanisms, the dangers of lost wakeups, the implementation challenges, and the deep integration with the operating system scheduler. This knowledge is essential for building correct and efficient synchronization primitives and for debugging performance issues in concurrent systems.