Operating SystemsSleep and Wakeup

Sleep and Wakeup Mechanisms

LevelIntermediate

Duration60 mins

TopicSleep and Wakeup

5 / 5

Integration with Scheduler

The Scheduler: Orchestrator of Sleep and Wakeup

Sleep and wakeup don't operate in isolation—they are fundamentally scheduler operations. The scheduler is the kernel subsystem that decides which process runs on which CPU at any given moment. Sleep tells the scheduler "remove me from consideration," and wakeup says "consider me again."

Understanding how sleep/wakeup integrate with scheduling reveals:

Why context switches are necessary for sleep to work
How wakeup decides which CPU to enqueue a process on
Why sleeping processes affect system-wide scheduling decisions
How scheduler policies (CFS, RT, deadline) handle sleepers differently
The performance implications of sleep/wakeup patterns

This page provides a comprehensive examination of the deep relationship between blocking synchronization and process scheduling. The goal is to understand sleep/wakeup not as isolated mechanisms, but as integral parts of the scheduler's decision-making.

What You Will Learn

By the end of this page, you will understand: how schedule() implements the actual sleep; the relationship between process states and scheduler queues; how wakeup chooses which CPU to target; scheduler-class-specific behaviors for sleeping; sleep duration effects on priority; and load balancing implications.

The schedule() Function: Heart of Context Switching

When a process calls sleep, the actual suspension happens in schedule(). This function is the core of the Linux scheduler—it decides what runs next and performs the context switch.

What schedule() does:

Check current process state: If not TASK_RUNNING, don't re-enqueue it
Pick next process: Consult the scheduler class (CFS, RT, etc.)
Context switch if needed: Save current state, load next state
Return: When this process runs again, return from schedule()

The sleep path through schedule():

When a process is sleeping (state ≠ TASK_RUNNING), schedule() doesn't put it back on the run queue. The process simply vanishes from the scheduler's consideration until wakeup changes its state back.

schedule() Simplified Implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// Simplified schedule() showing the sleep path
void __sched schedule(void) {
    struct task_struct *prev = current;
    struct task_struct *next;
    struct rq *rq = this_rq();  // This CPU's run queue
    
    // Lock the run queue
    raw_spin_lock(&rq->lock);
    
    // CRITICAL: If prev is not TASK_RUNNING, deactivate it
    if (prev->state != TASK_RUNNING) {
        // This is the sleep path!
        // prev won't be re-added to run queue
        deactivate_task(rq, prev);  // Remove from run queue
        
        // Special case: check for pending signals for INTERRUPTIBLE
        if (prev->state == TASK_INTERRUPTIBLE && 
            signal_pending(prev)) {
            // Wake it up to handle the signal
            prev->state = TASK_RUNNING;
            activate_task(rq, prev);
        }
    }
    
    // Pick the next task to run
    next = pick_next_task(rq);  // Consults CFS, RT, etc.
    
    if (prev != next) {
        // Actually switch processes
        rq->curr = next;
        
        // THE CONTEXT SWITCH
        context_switch(rq, prev, next);
        
        // When we return here, we've been scheduled again
        // prev was saved, ran other stuff, prev restored
        // Could be microseconds or hours later!
    }
    
    raw_spin_unlock(&rq->lock);
}
 
// The context switch - where the magic happens
void context_switch(struct rq *rq, 
                    struct task_struct *prev,
                    struct task_struct *next) {
    // Switch memory context (page tables)
    switch_mm(prev->mm, next->mm, next);
    
    // Switch register context (stack, instruction pointer, etc.)
    switch_to(prev, next, prev);
    
    // After switch_to returns, we're running as 'next'
    // 'prev' is now the task we switched FROM
    // (which was 'next' before, confusingly)
}

Key insights:

1. State check before schedule call matters: The sleeping process sets its state to SLEEPING before calling schedule(). When schedule() runs, it checks this state and decides not to re-enqueue the process. If state were RUNNING, the process would be immediately re-added to the run queue (a yield, not a sleep).

2. deactivate_task vs staying on run queue: deactivate_task() removes the process from the run queue data structures (CFS's red-black tree, RT's priority queue, etc.). This is the moment the process truly becomes "sleeping"—it won't be picked by pick_next_task().

3. The context switch is the dividing line: Everything before context_switch runs in the sleeping process's context. After context_switch, a different process runs. The sleeping process's execution is frozen at that point until wakeup + scheduling returns it to a CPU.

4. switch_to is architecture-specific: The actual register save/restore is done by architecture-specific assembly. It saves prev's stack pointer, instruction pointer, and callee-saved registers, then loads next's.

Voluntary vs. Involuntary Context Switches

A sleeping process experiences a VOLUNTARY context switch—it called schedule() itself. Preemption causes INVOLUNTARY switches—the kernel forces schedule() during an interrupt return. Both use the same schedule() function, but the accounting differs. Sleep shows as voluntary switches in /proc/PID/status.

Process States and Run Queues

The scheduler maintains run queues—data structures holding processes eligible to run. Sleep and wakeup move processes between membership states:

On run queue (state == TASK_RUNNING):

Process is eligible to be picked by scheduler
May be currently running (if rq->curr == task)
Or waiting for CPU (if on queue but not running)

Off run queue (state ≠ TASK_RUNNING):

TASK_INTERRUPTIBLE: Sleeping, wake on event or signal
TASK_UNINTERRUPTIBLE: Sleeping, wake only on event
TASK_DEAD: Terminated, never scheduled again
TASK_STOPPED: Stopped by signal (SIGSTOP), wake on SIGCONT

Processes off the run queue consume zero CPU time. They're just data structures in memory, tracked by wait queues.

Converting Mermaid diagram...

Per-CPU run queues:

Modern SMP Linux has a run queue per CPU. Each queue has its own lock and scheduling data:

struct rq {
    raw_spinlock_t lock;         // Protects this queue
    unsigned int nr_running;     // Count of runnable tasks
    struct cfs_rq cfs;           // CFS scheduling data
    struct rt_rq rt;             // RT scheduling data
    struct task_struct *curr;    // Currently running task
    struct task_struct *idle;    // This CPU's idle task
    // ...
};

Sleep removes from ONE CPU's run queue: When sleeping, the process is deactivated from its current CPU's queue. It's no longer associated with any specific CPU.

Wakeup adds to a (possibly different) CPU's run queue: Wakeup must choose which CPU to target. This might be:

The last CPU (cache warmth)
The waking CPU (reduce migration)
An idle CPU (immediate scheduling)
A lightly-loaded CPU (balance)

nr_running as System Load

The sum of nr_running across all CPUs approximates system load. Sleeping processes DON'T count—they're waiting for I/O, not CPU. Load average includes recently-run and currently-running: it's the demand for CPU. This is why I/O-bound workloads show low load despite many processes.

Wakeup CPU Selection

When a process wakes up, where should it run? This decision significantly impacts performance. The kernel uses several heuristics:

Factors in CPU selection:

CPU Selection Heuristics
Factor	Consideration	Trade-off
Cache warmth	Running on last CPU preserves L1/L2 cache data	Reduces cache misses vs. potential delay
Idle CPUs	An idle CPU can run the task immediately	Reduces latency vs. migration cost
Waker locality	Waker's CPU may have relevant data cached	Good for producer-consumer patterns
Load balancing	Spread work across CPUs evenly	Throughput vs. latency
NUMA locality	Same NUMA node has faster memory access	Critical for memory-intensive tasks
Affinity mask	Task may be restricted to certain CPUs	Hard constraint; must be respected

Wake Target CPU Selection
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// Simplified logic for selecting wakeup CPU
int select_task_rq(struct task_struct *p, 
                   int prev_cpu,      // Where task last ran
                   int wake_flags) {
    
    // Check affinity - must respect this
    cpumask_t *allowed = &p->cpus_allowed;
    if (!cpumask_test_cpu(prev_cpu, allowed))
        prev_cpu = cpumask_any(allowed);
    
    // Fast path: if prev CPU is idle, use it (cache warm + immediate run)
    if (cpu_idle(prev_cpu) && cpumask_test_cpu(prev_cpu, allowed))
        return prev_cpu;
    
    // Check the waking CPU (for producer-consumer locality)
    int waker_cpu = smp_processor_id();
    if (cpumask_test_cpu(waker_cpu, allowed) && 
        shares_llc(waker_cpu, prev_cpu)) {
        // Waker and prev are on same LLC domain
        // Prefer prev for cache warmth
        return prev_cpu;
    }
    
    // Find an idle CPU in the same LLC (last-level cache) domain
    int idle = find_idlest_cpu(p, prev_cpu, wake_flags);
    if (idle >= 0)
        return idle;
    
    // Fall back to previous CPU (preserve cache)
    return prev_cpu;
}
 
// Special case: WF_SYNC flag
// When waker uses wake_up_sync(), it's hinting "I'm about to sleep"
// In this case, target the waking CPU directly:
// waker sleeps, awakened task runs on same CPU without migration
if (wake_flags & WF_SYNC) {
    // Target the waking CPU if allowed
    if (cpumask_test_cpu(waker_cpu, allowed))
        return waker_cpu;
}

The wake-affine heuristic:

Linux's scheduler has a "wake-affine" feature: When task A wakes task B, and they communicate frequently, it often makes sense to run B near A. The scheduler tracks waker-wakee relationships and may pull the wakee to the waker's CPU or LLC domain.

NUMA-aware wakeup:

On NUMA systems, memory access time depends on which node the CPU is on. The scheduler considers:

Task's memory placement (where its pages reside)
NUMA node of previous CPU
NUMA node of waker

Migrating a task to a different NUMA node can cause significant slowdown if its memory is remote. The scheduler balances CPU utilization against memory access costs.

IPI for remote wakeup:

When the wakeup targets a different CPU, an Inter-Processor Interrupt (IPI) is sent:

// Target CPU is notified via IPI
void smp_send_reschedule(int cpu) {
    // Send IPI to 'cpu'
    // When it receives, it will call schedule()
    arch_send_ipi(cpu, RESCHEDULE_VECTOR);
}

The target CPU handles the IPI and invokes the scheduler to consider the newly runnable task.

IPI Overhead

IPIs are relatively expensive (~1-10μs). Waking a process on a different CPU costs more than waking on the current CPU. This is why cache warmth and same-CPU wake are preferred when the target CPU isn't idle. Excessive IPIs can become a bottleneck in highly concurrent systems.

Scheduler Classes and Sleep Behavior

Different scheduler classes (policies) handle sleeping processes differently:

CFS (Completely Fair Scheduler) - SCHED_NORMAL:

CFS tracks each task's "virtual runtime" (vruntime)—how much CPU time it has received relative to others. When a task sleeps:

Its vruntime is recorded at deactivation
While sleeping, others accumulate vruntime
On wakeup, the task's vruntime may be adjusted

CFS gives sleepers a bonus: their vruntime is set to max(their_vruntime, min_vruntime - threshold). This prevents a long-sleeping task from having such low vruntime that it monopolizes the CPU. It also ensures recently-woken interactive tasks get scheduled promptly.

CFS Sleep/Wake Handling
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// When a CFS task wakes up
void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) {
    struct cfs_rq *cfs_rq = &rq->cfs;
    struct sched_entity *se = &p->se;
    
    if (flags & ENQUEUE_WAKEUP) {
        // Sleeper bonus: don't let sleepers have unfairly low vruntime
        u64 min_vruntime = cfs_rq->min_vruntime;
        
        if (sched_feat(GENTLE_FAIR_SLEEPERS)) {
            // Cap vruntime so sleeper doesn't dominate
            se->vruntime = max(se->vruntime, 
                              min_vruntime - thresh);
        }
    }
    
    // Add to the red-black tree
    __enqueue_entity(cfs_rq, se);
    
    // Update min_vruntime
    update_min_vruntime(cfs_rq);
}
 
// The sleeper's vruntime may be stale from when they slept
// We bring it forward to be fair to tasks that ran while we slept
// But we also give a small bonus for being interactive (sleeping often)

Real-Time Scheduler - SCHED_FIFO/SCHED_RR:

RT tasks have static priorities (1-99). When an RT task wakes:

It's immediately enqueued at its priority level
If higher priority than current task on target CPU, preemption is triggered
FIFO tasks within same priority run until they block/yield
RR tasks within same priority round-robin with a time quantum

RT sleepers get immediate scheduling attention on wakeup—there's no vruntime calculation. This is essential for meeting deadlines.

SCHED_DEADLINE:

Deadline scheduling tracks each task's deadline and runtime budget. When sleeping:

The current job's remaining runtime is preserved
If the task slept through its deadline, the next job period begins
On wakeup, the task is eligible if within its period with remaining budget

Deadline tasks that oversleep may miss deadlines, visible to the application as deadline misses.

Scheduler Class Sleep Behavior
Class	On Sleep	On Wakeup	Sleep Impact
SCHED_NORMAL (CFS)	Record vruntime, remove from RB-tree	Adjust vruntime (sleeper bonus), insert in RB-tree	Sleepers treated fairly; small interactive bonus
SCHED_FIFO	Remove from priority queue	Insert in priority queue, preempt if higher	Immediate scheduling at fixed priority
SCHED_RR	Remove from priority queue	Insert at end of priority level	Round-robin among same-priority peers
SCHED_DEADLINE	Account remaining runtime	Check deadline/period, admit if eligible	May miss deadline if oversleep

The Sleeper Fairness Debate

There's ongoing debate about how much bonus sleepers should receive. Too much bonus and batch jobs are starved. Too little and interactive applications feel sluggish. CFS has tuned this over many kernel versions. The current behavior represents a balance based on empirical testing across diverse workloads.

Preemption and Wakeup

When wakeup makes a new process runnable, the question arises: should it run immediately by preempting the current process?

The preemption decision:

After wakeup adds a task to the run queue, it calls check_preempt_curr():

void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) {
    // Ask the scheduler class if p should preempt curr
    rq->curr->sched_class->check_preempt_curr(rq, p, flags);
}

CFS preemption logic: CFS compares the woken task's vruntime to the current task's. If the woken task has lower vruntime (is further behind), a resched flag is set:

if (se->vruntime < curr->vruntime - wakeup_granularity) {
    resched_curr(rq);  // Set TIF_NEED_RESCHED flag
}

The wakeup_granularity is a tunable (~1ms by default) that prevents too-frequent preemption for minor vruntime differences.

RT preemption logic: Simple priority comparison:

if (p->prio < rq->curr->prio) {
    resched_curr(rq);  // Higher priority (lower number) preempts
}

RT tasks of higher priority always preempt lower priority or normal tasks.

When preemption actually happens:

Setting TIF_NEED_RESCHED doesn't immediately invoke schedule(). The actual context switch happens at:

Return from system call: Kernel checks TIF_NEED_RESCHED before returning to user space
Return from interrupt: Same check on interrupt exit path
Explicit check points: Some long kernel operations check preemption_needed()
With CONFIG_PREEMPT: Kernel code can be preempted at most points

Latency implications:

Without CONFIG_PREEMPT, a newly-awakened high-priority task might wait until the current task finished its kernel work. With CONFIG_PREEMPT or CONFIG_PREEMPT_RT, the kernel itself can be interrupted for the wakeup.

Preemption Check Path
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// After wakeup on target CPU, or on IPI reception
 
void scheduler_ipi(void) {
    // Called on receipt of reschedule IPI
    // The TIF_NEED_RESCHED flag is set
    // On return from this interrupt, schedule() will be called
    irq_enter();
    // Mark that we need to reschedule
    set_tsk_need_resched(current);
    irq_exit();
}
 
// Returning from interrupt or syscall
void prepare_exit_to_usermode(struct pt_regs *regs) {
    if (need_resched()) {
        // Don't return to user yet - call scheduler first
        schedule();
    }
    // Now return to user space
}
 
// With CONFIG_PREEMPT, even kernel code checks:
void maybe_preempt_kernel(void) {
    if (preemptible() && need_resched()) {
        // We're not holding any locks, and reschedule is needed
        preempt_schedule();
    }
}

Wakeup Latency vs. Throughput

Aggressive preemption (CONFIG_PREEMPT_RT) minimizes wakeup latency—critical for real-time systems. But context switching has overhead. Server workloads often prefer CONFIG_PREEMPT_VOLUNTARY or even CONFIG_PREEMPT_NONE for throughput, accepting slightly higher latency.

Load Balancing and Sleeping Tasks

The scheduler periodically rebalances load across CPUs. How do sleeping tasks factor in?

Sleeping tasks don't consume CPU: A task on a wait queue doesn't count toward CPU load. If 1000 processes are sleeping and 2 are running, the load is 2, not 1002. Load balancing focuses on runnable tasks.

But sleeping tasks affect memory locality: A sleeping task's memory pages may be on a specific NUMA node. If it wakes on a different node, it experiences remote memory access penalties. The scheduler tracks NUMA placement and prefers waking on the same node.

Sleeping pattern analysis: The scheduler tracks how often tasks sleep and for how long. Tasks that rarely sleep get migrated more readily (they're CPU-bound). Tasks that frequently sleep will be rescheduled (they're interactive/I/O-bound).

Task Load Tracking and Sleep
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// PELT (Per-Entity Load Tracking) accounts for running vs. sleeping
 
struct sched_avg {
    u64 last_update_time;     // When we last calculated
    u64 load_sum;             // Accumulated load contribution  
    u64 runnable_sum;         // Time spent runnable (not sleeping)
    u64 util_sum;             // Time spent executing
    u32 load_avg;             // Smoothed load average
    u32 runnable_avg;         // Smoothed runnable average
    u32 util_avg;             // Smoothed utilization
};
 
// When task sleeps, its load_sum/util_sum stop accumulating
// When task runs, they accumulate
// The _avg values are decayed exponentially
 
// A task that sleeps 80% of the time has lower util_avg than one
// that runs 80% of the time
 
// Load balancing uses these metrics:
void load_balance(int this_cpu) {
    // Find busiest run queue
    struct rq *busiest = find_busiest_queue();
    
    // Compare load_avg of tasks, not count
    // Pull tasks until balanced
    while (imbalanced(this_rq, busiest)) {
        struct task_struct *p = pick_next_to_migrate(busiest);
        if (p) {
            deactivate_task(busiest, p);
            set_cpu(p, this_cpu);
            activate_task(this_rq, p);
        }
    }
}

Idle balancing:

When a CPU becomes idle (all its tasks are sleeping), it can "steal" work from busier CPUs:

CPU finishes its runnable tasks
Before going idle, it checks if other CPUs have excess tasks
If so, it migrates a task to itself
This keeps CPUs busy and reduces overall sleep times

This is called "pull migration" or "work stealing."

Wake-to-run latency considerations:

For latency-sensitive workloads (e.g., trading systems), you might:

Use cpusets to reserve CPUs for critical tasks
Disable load balancing on latency-critical CPUs
Ensure wakeup targets the same CPU (affinity)

These reduce migration-induced cache misses and IPI latency at the cost of CPU utilization efficiency.

The Parking Problem

A task that frequently sleeps and wakes on different CPUs 'ping-pongs' between CPUs, never warming any cache. The scheduler detects this and may 'park' the task on one CPU, trading balance for locality. This is the wake-affine heuristic in action.

Debugging Sleep/Scheduler Issues

When things go wrong with sleeping processes, how do you diagnose?

Common symptoms and causes:

Sleep/Scheduler Debugging Guide
Symptom	Possible Causes	Diagnostic Approach
Process stuck in D state	Waiting for I/O that never completes; kernel bug	Check /proc/PID/stack; look for hung I/O
Process sleeps forever (S state)	Lost wakeup; condition never true	Verify condition; check wakeup called
High latency after wakeup	Waking on busy/wrong CPU; low priority	Use perf sched; check affinity
Excessive context switches	Lock contention causing sleep/wake cycles	perf stat; check lock hold times
Process not running despite being runnable	Priority inversion; affinity mismatch	Check scheduling policy and affinity

Debugging Tools and Commands
bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Check what a process is waiting for
$ cat /proc/PID/stack
[<0>] do_wait+0x...
[<0>] kernel_wait4+0x...
[<0>] __do_sys_wait4+0x...
# Shows the kernel callstack - where in the kernel it's blocked
 
# Check process scheduling statistics  
$ cat /proc/PID/sched
nr_involuntary_switches: 1234        # Preemptions
nr_voluntary_switches: 5678          # Sleep calls
se.sum_exec_runtime: 123456789       # Total CPU time (ns)
se.vruntime: 98765432                # CFS virtual runtime
 
# Trace scheduler events
$ sudo perf sched record -p PID sleep 10
$ sudo perf sched latency
# Shows sleep/wakeup latencies and causes
 
# Watch for processes stuck in D state
$ while true; do ps aux | awk '$8 ~ /D/'; sleep 1; done
 
# Enable scheduler debugging (kernel)
$ echo 1 > /sys/kernel/debug/sched/debug
 
# Trace wakeups
$ sudo trace-cmd record -e sched:sched_wakeup -p PID
$ sudo trace-cmd report

perf sched for deep analysis:

perf sched provides detailed visibility into scheduling decisions:

perf sched record: Record scheduling events
perf sched latency: Show per-task scheduling latencies
perf sched timehist: Timeline of schedule events
perf sched map: Visual CPU-to-task mapping over time

These tools reveal:

When a task was woken vs. when it actually ran (wakeup latency)
Why it was preempted (what higher priority task)
Migration patterns across CPUs
Time spent sleeping vs. running

The Observer Effect

Tracing and debugging tools affect timing. A race condition that causes lost wakeups may disappear when tracing is enabled because the tracing code changes timing. Use minimal, always-on instrumentation (systemwide trace buffers) to catch timing bugs in production.

Summary and Key Insights

Sleep and wakeup are fundamentally scheduler operations. Understanding their integration is key to building and debugging efficient concurrent systems:

Key Takeaways

•schedule() is the sleep mechanism: Setting state + calling schedule() = not being re-enqueued = sleeping.
•Process state determines run queue membership: TASK_RUNNING = on queue; TASK_INTERRUPTIBLE/UNINTERRUPTIBLE = off queue.
•Wakeup CPU selection balances multiple factors: Cache warmth, idle CPUs, NUMA locality, affinity, load balancing.
•Scheduler classes handle sleepers differently: CFS gives sleeper bonuses; RT provides immediate priority-based scheduling.
•Preemption on wakeup is conditional: Depends on priority comparison and preemption configuration.
•Load balancing ignores sleeping tasks: Only runnable tasks count; sleeping tasks affect NUMA and cache considerations.
•Debugging tools reveal the picture: /proc/PID/stack, perf sched, and tracing illuminate sleep/wake behavior.

Module Complete

You have now completed the Sleep and Wakeup module. You understand the fundamental blocking synchronization mechanisms, the dangers of lost wakeups, the implementation challenges, and the deep integration with the operating system scheduler. This knowledge is essential for building correct and efficient synchronization primitives and for debugging performance issues in concurrent systems.

5 / 5

Loading learning content...

Operating SystemsSleep and Wakeup

Sleep and Wakeup Mechanisms

LevelIntermediate

Duration60 mins

TopicSleep and Wakeup

5 / 5

Integration with Scheduler

The Scheduler: Orchestrator of Sleep and Wakeup

Understanding how sleep/wakeup integrate with scheduling reveals:

Why context switches are necessary for sleep to work
How wakeup decides which CPU to enqueue a process on
Why sleeping processes affect system-wide scheduling decisions
How scheduler policies (CFS, RT, deadline) handle sleepers differently
The performance implications of sleep/wakeup patterns

What You Will Learn

The schedule() Function: Heart of Context Switching

When a process calls sleep, the actual suspension happens in schedule(). This function is the core of the Linux scheduler—it decides what runs next and performs the context switch.

What schedule() does:

Check current process state: If not TASK_RUNNING, don't re-enqueue it
Pick next process: Consult the scheduler class (CFS, RT, etc.)
Context switch if needed: Save current state, load next state
Return: When this process runs again, return from schedule()

The sleep path through schedule():

schedule() Simplified Implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// Simplified schedule() showing the sleep path
void __sched schedule(void) {
    struct task_struct *prev = current;
    struct task_struct *next;
    struct rq *rq = this_rq();  // This CPU's run queue
    
    // Lock the run queue
    raw_spin_lock(&rq->lock);
    
    // CRITICAL: If prev is not TASK_RUNNING, deactivate it
    if (prev->state != TASK_RUNNING) {
        // This is the sleep path!
        // prev won't be re-added to run queue
        deactivate_task(rq, prev);  // Remove from run queue
        
        // Special case: check for pending signals for INTERRUPTIBLE
        if (prev->state == TASK_INTERRUPTIBLE && 
            signal_pending(prev)) {
            // Wake it up to handle the signal
            prev->state = TASK_RUNNING;
            activate_task(rq, prev);
        }
    }
    
    // Pick the next task to run
    next = pick_next_task(rq);  // Consults CFS, RT, etc.
    
    if (prev != next) {
        // Actually switch processes
        rq->curr = next;
        
        // THE CONTEXT SWITCH
        context_switch(rq, prev, next);
        
        // When we return here, we've been scheduled again
        // prev was saved, ran other stuff, prev restored
        // Could be microseconds or hours later!
    }
    
    raw_spin_unlock(&rq->lock);
}
 
// The context switch - where the magic happens
void context_switch(struct rq *rq, 
                    struct task_struct *prev,
                    struct task_struct *next) {
    // Switch memory context (page tables)
    switch_mm(prev->mm, next->mm, next);
    
    // Switch register context (stack, instruction pointer, etc.)
    switch_to(prev, next, prev);
    
    // After switch_to returns, we're running as 'next'
    // 'prev' is now the task we switched FROM
    // (which was 'next' before, confusingly)
}

Key insights:

Voluntary vs. Involuntary Context Switches

Process States and Run Queues

The scheduler maintains run queues—data structures holding processes eligible to run. Sleep and wakeup move processes between membership states:

On run queue (state == TASK_RUNNING):

Process is eligible to be picked by scheduler
May be currently running (if rq->curr == task)
Or waiting for CPU (if on queue but not running)

Off run queue (state ≠ TASK_RUNNING):

TASK_INTERRUPTIBLE: Sleeping, wake on event or signal
TASK_UNINTERRUPTIBLE: Sleeping, wake only on event
TASK_DEAD: Terminated, never scheduled again
TASK_STOPPED: Stopped by signal (SIGSTOP), wake on SIGCONT

Processes off the run queue consume zero CPU time. They're just data structures in memory, tracked by wait queues.

Converting Mermaid diagram...

Per-CPU run queues:

Modern SMP Linux has a run queue per CPU. Each queue has its own lock and scheduling data:

struct rq {
    raw_spinlock_t lock;         // Protects this queue
    unsigned int nr_running;     // Count of runnable tasks
    struct cfs_rq cfs;           // CFS scheduling data
    struct rt_rq rt;             // RT scheduling data
    struct task_struct *curr;    // Currently running task
    struct task_struct *idle;    // This CPU's idle task
    // ...
};

Sleep removes from ONE CPU's run queue: When sleeping, the process is deactivated from its current CPU's queue. It's no longer associated with any specific CPU.

Wakeup adds to a (possibly different) CPU's run queue: Wakeup must choose which CPU to target. This might be:

The last CPU (cache warmth)
The waking CPU (reduce migration)
An idle CPU (immediate scheduling)
A lightly-loaded CPU (balance)

nr_running as System Load

Wakeup CPU Selection

When a process wakes up, where should it run? This decision significantly impacts performance. The kernel uses several heuristics:

Factors in CPU selection:

CPU Selection Heuristics
Factor	Consideration	Trade-off
Cache warmth	Running on last CPU preserves L1/L2 cache data	Reduces cache misses vs. potential delay
Idle CPUs	An idle CPU can run the task immediately	Reduces latency vs. migration cost
Waker locality	Waker's CPU may have relevant data cached	Good for producer-consumer patterns
Load balancing	Spread work across CPUs evenly	Throughput vs. latency
NUMA locality	Same NUMA node has faster memory access	Critical for memory-intensive tasks
Affinity mask	Task may be restricted to certain CPUs	Hard constraint; must be respected

Wake Target CPU Selection
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// Simplified logic for selecting wakeup CPU
int select_task_rq(struct task_struct *p, 
                   int prev_cpu,      // Where task last ran
                   int wake_flags) {
    
    // Check affinity - must respect this
    cpumask_t *allowed = &p->cpus_allowed;
    if (!cpumask_test_cpu(prev_cpu, allowed))
        prev_cpu = cpumask_any(allowed);
    
    // Fast path: if prev CPU is idle, use it (cache warm + immediate run)
    if (cpu_idle(prev_cpu) && cpumask_test_cpu(prev_cpu, allowed))
        return prev_cpu;
    
    // Check the waking CPU (for producer-consumer locality)
    int waker_cpu = smp_processor_id();
    if (cpumask_test_cpu(waker_cpu, allowed) && 
        shares_llc(waker_cpu, prev_cpu)) {
        // Waker and prev are on same LLC domain
        // Prefer prev for cache warmth
        return prev_cpu;
    }
    
    // Find an idle CPU in the same LLC (last-level cache) domain
    int idle = find_idlest_cpu(p, prev_cpu, wake_flags);
    if (idle >= 0)
        return idle;
    
    // Fall back to previous CPU (preserve cache)
    return prev_cpu;
}
 
// Special case: WF_SYNC flag
// When waker uses wake_up_sync(), it's hinting "I'm about to sleep"
// In this case, target the waking CPU directly:
// waker sleeps, awakened task runs on same CPU without migration
if (wake_flags & WF_SYNC) {
    // Target the waking CPU if allowed
    if (cpumask_test_cpu(waker_cpu, allowed))
        return waker_cpu;
}

The wake-affine heuristic:

NUMA-aware wakeup:

On NUMA systems, memory access time depends on which node the CPU is on. The scheduler considers:

Task's memory placement (where its pages reside)
NUMA node of previous CPU
NUMA node of waker

Migrating a task to a different NUMA node can cause significant slowdown if its memory is remote. The scheduler balances CPU utilization against memory access costs.

IPI for remote wakeup:

When the wakeup targets a different CPU, an Inter-Processor Interrupt (IPI) is sent:

// Target CPU is notified via IPI
void smp_send_reschedule(int cpu) {
    // Send IPI to 'cpu'
    // When it receives, it will call schedule()
    arch_send_ipi(cpu, RESCHEDULE_VECTOR);
}

The target CPU handles the IPI and invokes the scheduler to consider the newly runnable task.

IPI Overhead

Scheduler Classes and Sleep Behavior

Different scheduler classes (policies) handle sleeping processes differently:

CFS (Completely Fair Scheduler) - SCHED_NORMAL:

CFS tracks each task's "virtual runtime" (vruntime)—how much CPU time it has received relative to others. When a task sleeps:

Its vruntime is recorded at deactivation
While sleeping, others accumulate vruntime
On wakeup, the task's vruntime may be adjusted

CFS Sleep/Wake Handling
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// When a CFS task wakes up
void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) {
    struct cfs_rq *cfs_rq = &rq->cfs;
    struct sched_entity *se = &p->se;
    
    if (flags & ENQUEUE_WAKEUP) {
        // Sleeper bonus: don't let sleepers have unfairly low vruntime
        u64 min_vruntime = cfs_rq->min_vruntime;
        
        if (sched_feat(GENTLE_FAIR_SLEEPERS)) {
            // Cap vruntime so sleeper doesn't dominate
            se->vruntime = max(se->vruntime, 
                              min_vruntime - thresh);
        }
    }
    
    // Add to the red-black tree
    __enqueue_entity(cfs_rq, se);
    
    // Update min_vruntime
    update_min_vruntime(cfs_rq);
}
 
// The sleeper's vruntime may be stale from when they slept
// We bring it forward to be fair to tasks that ran while we slept
// But we also give a small bonus for being interactive (sleeping often)

Real-Time Scheduler - SCHED_FIFO/SCHED_RR:

RT tasks have static priorities (1-99). When an RT task wakes:

It's immediately enqueued at its priority level
If higher priority than current task on target CPU, preemption is triggered
FIFO tasks within same priority run until they block/yield
RR tasks within same priority round-robin with a time quantum

RT sleepers get immediate scheduling attention on wakeup—there's no vruntime calculation. This is essential for meeting deadlines.

SCHED_DEADLINE:

Deadline scheduling tracks each task's deadline and runtime budget. When sleeping:

The current job's remaining runtime is preserved
If the task slept through its deadline, the next job period begins
On wakeup, the task is eligible if within its period with remaining budget

Deadline tasks that oversleep may miss deadlines, visible to the application as deadline misses.

Scheduler Class Sleep Behavior
Class	On Sleep	On Wakeup	Sleep Impact
SCHED_NORMAL (CFS)	Record vruntime, remove from RB-tree	Adjust vruntime (sleeper bonus), insert in RB-tree	Sleepers treated fairly; small interactive bonus
SCHED_FIFO	Remove from priority queue	Insert in priority queue, preempt if higher	Immediate scheduling at fixed priority
SCHED_RR	Remove from priority queue	Insert at end of priority level	Round-robin among same-priority peers
SCHED_DEADLINE	Account remaining runtime	Check deadline/period, admit if eligible	May miss deadline if oversleep

The Sleeper Fairness Debate

Preemption and Wakeup

When wakeup makes a new process runnable, the question arises: should it run immediately by preempting the current process?

The preemption decision:

After wakeup adds a task to the run queue, it calls check_preempt_curr():

void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags) {
    // Ask the scheduler class if p should preempt curr
    rq->curr->sched_class->check_preempt_curr(rq, p, flags);
}

CFS preemption logic: CFS compares the woken task's vruntime to the current task's. If the woken task has lower vruntime (is further behind), a resched flag is set:

if (se->vruntime < curr->vruntime - wakeup_granularity) {
    resched_curr(rq);  // Set TIF_NEED_RESCHED flag
}

The wakeup_granularity is a tunable (~1ms by default) that prevents too-frequent preemption for minor vruntime differences.

RT preemption logic: Simple priority comparison:

if (p->prio < rq->curr->prio) {
    resched_curr(rq);  // Higher priority (lower number) preempts
}

RT tasks of higher priority always preempt lower priority or normal tasks.

When preemption actually happens:

Setting TIF_NEED_RESCHED doesn't immediately invoke schedule(). The actual context switch happens at:

Return from system call: Kernel checks TIF_NEED_RESCHED before returning to user space
Return from interrupt: Same check on interrupt exit path
Explicit check points: Some long kernel operations check preemption_needed()
With CONFIG_PREEMPT: Kernel code can be preempted at most points

Latency implications:

Preemption Check Path
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// After wakeup on target CPU, or on IPI reception
 
void scheduler_ipi(void) {
    // Called on receipt of reschedule IPI
    // The TIF_NEED_RESCHED flag is set
    // On return from this interrupt, schedule() will be called
    irq_enter();
    // Mark that we need to reschedule
    set_tsk_need_resched(current);
    irq_exit();
}
 
// Returning from interrupt or syscall
void prepare_exit_to_usermode(struct pt_regs *regs) {
    if (need_resched()) {
        // Don't return to user yet - call scheduler first
        schedule();
    }
    // Now return to user space
}
 
// With CONFIG_PREEMPT, even kernel code checks:
void maybe_preempt_kernel(void) {
    if (preemptible() && need_resched()) {
        // We're not holding any locks, and reschedule is needed
        preempt_schedule();
    }
}

Wakeup Latency vs. Throughput

Load Balancing and Sleeping Tasks

The scheduler periodically rebalances load across CPUs. How do sleeping tasks factor in?

Task Load Tracking and Sleep
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// PELT (Per-Entity Load Tracking) accounts for running vs. sleeping
 
struct sched_avg {
    u64 last_update_time;     // When we last calculated
    u64 load_sum;             // Accumulated load contribution  
    u64 runnable_sum;         // Time spent runnable (not sleeping)
    u64 util_sum;             // Time spent executing
    u32 load_avg;             // Smoothed load average
    u32 runnable_avg;         // Smoothed runnable average
    u32 util_avg;             // Smoothed utilization
};
 
// When task sleeps, its load_sum/util_sum stop accumulating
// When task runs, they accumulate
// The _avg values are decayed exponentially
 
// A task that sleeps 80% of the time has lower util_avg than one
// that runs 80% of the time
 
// Load balancing uses these metrics:
void load_balance(int this_cpu) {
    // Find busiest run queue
    struct rq *busiest = find_busiest_queue();
    
    // Compare load_avg of tasks, not count
    // Pull tasks until balanced
    while (imbalanced(this_rq, busiest)) {
        struct task_struct *p = pick_next_to_migrate(busiest);
        if (p) {
            deactivate_task(busiest, p);
            set_cpu(p, this_cpu);
            activate_task(this_rq, p);
        }
    }
}

Idle balancing:

When a CPU becomes idle (all its tasks are sleeping), it can "steal" work from busier CPUs:

CPU finishes its runnable tasks
Before going idle, it checks if other CPUs have excess tasks
If so, it migrates a task to itself
This keeps CPUs busy and reduces overall sleep times

This is called "pull migration" or "work stealing."

Wake-to-run latency considerations:

For latency-sensitive workloads (e.g., trading systems), you might:

Use cpusets to reserve CPUs for critical tasks
Disable load balancing on latency-critical CPUs
Ensure wakeup targets the same CPU (affinity)

These reduce migration-induced cache misses and IPI latency at the cost of CPU utilization efficiency.

The Parking Problem

Debugging Sleep/Scheduler Issues

When things go wrong with sleeping processes, how do you diagnose?

Common symptoms and causes:

Sleep/Scheduler Debugging Guide
Symptom	Possible Causes	Diagnostic Approach
Process stuck in D state	Waiting for I/O that never completes; kernel bug	Check /proc/PID/stack; look for hung I/O
Process sleeps forever (S state)	Lost wakeup; condition never true	Verify condition; check wakeup called
High latency after wakeup	Waking on busy/wrong CPU; low priority	Use perf sched; check affinity
Excessive context switches	Lock contention causing sleep/wake cycles	perf stat; check lock hold times
Process not running despite being runnable	Priority inversion; affinity mismatch	Check scheduling policy and affinity

Debugging Tools and Commands
bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Check what a process is waiting for
$ cat /proc/PID/stack
[<0>] do_wait+0x...
[<0>] kernel_wait4+0x...
[<0>] __do_sys_wait4+0x...
# Shows the kernel callstack - where in the kernel it's blocked
 
# Check process scheduling statistics  
$ cat /proc/PID/sched
nr_involuntary_switches: 1234        # Preemptions
nr_voluntary_switches: 5678          # Sleep calls
se.sum_exec_runtime: 123456789       # Total CPU time (ns)
se.vruntime: 98765432                # CFS virtual runtime
 
# Trace scheduler events
$ sudo perf sched record -p PID sleep 10
$ sudo perf sched latency
# Shows sleep/wakeup latencies and causes
 
# Watch for processes stuck in D state
$ while true; do ps aux | awk '$8 ~ /D/'; sleep 1; done
 
# Enable scheduler debugging (kernel)
$ echo 1 > /sys/kernel/debug/sched/debug
 
# Trace wakeups
$ sudo trace-cmd record -e sched:sched_wakeup -p PID
$ sudo trace-cmd report

perf sched for deep analysis:

perf sched provides detailed visibility into scheduling decisions:

perf sched record: Record scheduling events
perf sched latency: Show per-task scheduling latencies
perf sched timehist: Timeline of schedule events
perf sched map: Visual CPU-to-task mapping over time

These tools reveal:

When a task was woken vs. when it actually ran (wakeup latency)
Why it was preempted (what higher priority task)
Migration patterns across CPUs
Time spent sleeping vs. running

The Observer Effect

Summary and Key Insights

Sleep and wakeup are fundamentally scheduler operations. Understanding their integration is key to building and debugging efficient concurrent systems:

Key Takeaways

•schedule() is the sleep mechanism: Setting state + calling schedule() = not being re-enqueued = sleeping.
•Process state determines run queue membership: TASK_RUNNING = on queue; TASK_INTERRUPTIBLE/UNINTERRUPTIBLE = off queue.
•Wakeup CPU selection balances multiple factors: Cache warmth, idle CPUs, NUMA locality, affinity, load balancing.
•Scheduler classes handle sleepers differently: CFS gives sleeper bonuses; RT provides immediate priority-based scheduling.
•Preemption on wakeup is conditional: Depends on priority comparison and preemption configuration.
•Load balancing ignores sleeping tasks: Only runnable tasks count; sleeping tasks affect NUMA and cache considerations.
•Debugging tools reveal the picture: /proc/PID/stack, perf sched, and tracing illuminate sleep/wake behavior.

Module Complete

5 / 5