Operating SystemsLinux Scheduling

Linux Scheduling

LevelAdvanced

Duration90 mins

TopicLinux Scheduling

1 / 5

Completely Fair Scheduler (CFS)

The Heart of Linux Process Scheduling

Every time you run a command in your terminal, open an application, or load a web page on a Linux server, an elegant piece of software engineering determines when and for how long your process gets to use the CPU. This software is the Completely Fair Scheduler (CFS)—the default process scheduler in the Linux kernel since October 2007.

CFS represents a fundamental departure from traditional scheduling algorithms. Rather than relying on fixed time slices and priority queues like its predecessors, CFS approaches scheduling as a problem of proportional fairness. Its goal is deceptively simple: give every runnable process a fair share of CPU time, proportional to its weight. But achieving this goal efficiently across thousands of processes, on systems ranging from embedded devices to supercomputers, requires sophisticated algorithms and data structures.

Designed by Ingo Molnár, CFS replaced the O(1) scheduler that had been in use since Linux 2.6.0. While the O(1) scheduler was efficient, it struggled with interactivity and fairness on desktop workloads. CFS solved these problems through a revolutionary approach based on virtual runtime and red-black trees.

Learning Objectives

By the end of this page, you will understand: (1) The historical context and motivation for CFS, (2) The mathematical foundation of proportional-share scheduling, (3) How CFS models an 'ideal multitasking CPU,' (4) The core algorithm and its complexity guarantees, and (5) Why CFS became the default scheduler for billions of devices worldwide.

Historical Context: From O(1) to CFS

To appreciate CFS, we must understand the scheduler it replaced and the problems that motivated its creation.

The O(1) Scheduler (2003-2007)

The O(1) scheduler, introduced by Ingo Molnár in Linux 2.6.0, was a significant achievement. Its name came from its O(1) time complexity for all scheduling operations—regardless of how many processes were runnable, selecting the next process took constant time. It achieved this through:

Priority arrays: Two arrays (active and expired), each containing 140 priority levels
Bitmap-based selection: A bitmap tracked which priority levels had runnable tasks, enabling O(1) lookup of the highest-priority task
Time slice assignment: Each priority level had a predetermined time slice

While elegant for server workloads, the O(1) scheduler had fundamental problems with desktop interactivity.

O(1) Scheduler Limitations

•Heuristic Complexity: The scheduler used complex heuristics to identify 'interactive' processes and boost their priority. These heuristics were fragile and often wrong.
•Unfair CPU Distribution: Processes could game the heuristics to receive more CPU time than their fair share.
•Poor Interactivity Under Load: When the system was heavily loaded, interactive applications (editors, browsers) could experience noticeable delays.
•Magic Numbers: The code contained numerous empirically-tuned constants that were difficult to justify theoretically.
•Priority Inversion Scenarios: Complex interactions between priority levels could cause unexpected behavior.

The Con Kolivas Experiment

Australian anesthesiologist and kernel developer Con Kolivas proposed alternative schedulers (RSDL, SD) that challenged the O(1) approach. His work demonstrated that a fundamentally different approach—based on deadline scheduling and strict fairness—could dramatically improve desktop responsiveness.

Kolivas's ideas influenced Ingo Molnár to create CFS, which took the concept of fairness and implemented it with unprecedented elegance. Molnár described the design philosophy:

"CFS basically models an 'ideal, precise multitasking CPU' on real hardware. The 'virtual runtime' of a task specifies when its next timeslice would start execution on the ideal multi-tasking CPU. In practice, the virtual runtime of a task is the actual runtime normalized to the total number of running tasks."

This single insight—modeling an ideal CPU where every process runs simultaneously with infinitesimal slices—became the foundation of CFS.

The Ideal Multitasking CPU Concept

Imagine a CPU that could run N processes truly simultaneously, each receiving exactly 1/N of the CPU's power. While physically impossible, CFS approximates this ideal by tracking how much CPU time each process 'deserves' versus how much it has actually received, then always running the process that's most behind.

Core Philosophy: Proportional Fairness

CFS is built on a mathematically rigorous foundation: proportional-share scheduling (also known as weighted fair queueing in networking). The core idea is that each process should receive CPU time proportional to its weight rather than according to fixed priority levels.

The Fairness Equation

For a system with N runnable processes, each with weight wᵢ, the CPU share for process i should be:

CPU Share(i) = wᵢ / Σwⱼ

Where the sum is over all runnable processes. This means:

If all processes have equal weight, each gets 1/N of the CPU
If process A has twice the weight of process B, A gets twice as much CPU time
Adding or removing processes automatically adjusts everyone's share

From Shares to Scheduling

The challenge is converting this abstract fairness requirement into concrete scheduling decisions. CFS solves this with a brilliant insight: instead of tracking actual CPU time, track a normalized time called virtual runtime (vruntime).

Virtual runtime increases slower for high-weight processes and faster for low-weight processes. When virtual runtimes are equal, actual CPU time is proportional to weight. CFS simply selects the process with the smallest vruntime—the process that's most 'behind' relative to what it deserves.

vruntime_concept.c
C (Conceptual)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
/*
 * Conceptual illustration of virtual runtime calculation
 * 
 * In CFS, each process has a vruntime that represents its
 * "virtual" elapsed time. The key insight:
 * 
 * vruntime increases proportionally to actual runtime
 * but inversely to the process's weight.
 */
 
/* NICE_0_LOAD represents the weight of a nice-0 process (1024) */
#define NICE_0_LOAD    1024
 
/*
 * delta_exec:  actual CPU time used in this scheduling period
 * weight:      process weight (derived from nice value)
 * 
 * vruntime delta = (delta_exec * NICE_0_LOAD) / weight
 */
static inline u64 calc_delta_vruntime(u64 delta_exec, 
                                       unsigned long weight)
{
    /* For a nice-0 process (weight = 1024):
     *   delta_vruntime = delta_exec (no change)
     * 
     * For a nice +19 process (weight = 15, lowest priority):
     *   delta_vruntime = delta_exec * 68.27 (much faster increase)
     * 
     * For a nice -20 process (weight = 88761, highest priority):
     *   delta_vruntime = delta_exec * 0.0115 (much slower increase)
     */
    return (delta_exec * NICE_0_LOAD) / weight;
}
 
/*
 * This elegant formulation ensures:
 * 1. All processes with equal weight have equal vruntime growth
 * 2. Higher-weight processes can run longer before "catching up"
 * 3. Always scheduling the lowest-vruntime process is always fair
 */

Why Virtual Runtime Works

Consider two processes A (weight 1024) and B (weight 2048). After time t:

A's vruntime = t × (1024/1024) = t
B's vruntime = t × (1024/2048) = 0.5t

If both start at vruntime 0, CFS will schedule them so that at any point:

B has used twice as much actual CPU time as A
Both have approximately equal vruntimes
The system is therefore 'fair' according to their weights

This mathematical elegance eliminates the heuristics, magic numbers, and special cases that plagued the O(1) scheduler. Fairness emerges naturally from a single, principled algorithm.

The Genius of Normalization

By normalizing actual runtime to virtual runtime, CFS converts a complex multi-objective optimization problem (balance fairness, priorities, throughput) into a simple single-objective problem: always run the process with the smallest vruntime. This is a textbook example of elegant algorithm design.

The CFS Algorithm: How It Works

CFS operates through a remarkably simple loop, made efficient by sophisticated data structures. Let's trace through the algorithm step by step.

Core Data Structures

Each CPU maintains a runqueue (struct rq) containing a CFS runqueue (struct cfs_rq). The CFS runqueue's most important component is a red-black tree ordered by vruntime. Every runnable process is a node in this tree, keyed by its vruntime.

The Scheduling Loop

CFS Scheduling Steps

•Select Next Task: CFS picks the leftmost node in the red-black tree—the task with the smallest vruntime. This operation is O(1) because Linux caches the leftmost node pointer.
•Context Switch: The kernel performs a context switch to the selected task, saving the old task's state and loading the new task's state.
•Execute: The task runs until it blocks (I/O, sleep), yields, or exhausts its time slice. The time slice is dynamically calculated based on the number of runnable tasks.
•Update vruntime: When the task stops running, CFS calculates its vruntime delta: delta_vruntime = delta_exec * (NICE_0_LOAD / weight) and adds this to the task's current vruntime.
•Reinsert: If the task is still runnable, it's reinserted into the red-black tree at its new vruntime position. This takes O(log n) time.
•Repeat: The scheduler picks the new leftmost node and continues.

cfs_algorithm_pseudo.c
Pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
/*
 * CFS Main Scheduling Loop (Simplified)
 *
 * This pseudocode illustrates the core CFS algorithm.
 * The actual kernel code is more complex due to
 * handling of SMP, CPU affinity, load balancing, etc.
 */
 
struct task_struct* pick_next_task_fair(struct rq *rq) {
    struct cfs_rq *cfs_rq = &rq->cfs;
    struct sched_entity *se;
    
    /* If no runnable tasks, return idle task */
    if (!cfs_rq->nr_running)
        return idle_task(rq);
    
    /* Pick the task with smallest vruntime (leftmost in tree) */
    se = __pick_first_entity(cfs_rq);
    
    /* Return the task_struct containing this sched_entity */
    return task_of(se);
}
 
void update_curr(struct cfs_rq *cfs_rq) {
    struct sched_entity *curr = cfs_rq->curr;
    u64 now = rq_clock_task(rq_of(cfs_rq));
    u64 delta_exec;
    
    if (!curr)
        return;
    
    /* Calculate how long the task has been running */
    delta_exec = now - curr->exec_start;
    curr->exec_start = now;
    
    /* Update task's statistics */
    curr->sum_exec_runtime += delta_exec;
    
    /* Update virtual runtime based on weight */
    curr->vruntime += calc_delta_fair(delta_exec, curr);
    
    /* Update the minimum vruntime for the runqueue */
    update_min_vruntime(cfs_rq);
}
 
void enqueue_entity(struct cfs_rq *cfs_rq, 
                    struct sched_entity *se) {
    /* If task was sleeping, give it some vruntime credit */
    if (se->on_rq == 0) {
        /* Place near the minimum vruntime to prevent starvation */
        se->vruntime = max(se->vruntime, 
                          cfs_rq->min_vruntime - sched_latency/2);
    }
    
    /* Update runqueue load statistics */
    account_entity_enqueue(cfs_rq, se);
    
    /* Insert into the red-black tree: O(log n) */
    __enqueue_entity(cfs_rq, se);
    
    cfs_rq->nr_running++;
}

Time Slice Calculation

Unlike the O(1) scheduler, CFS doesn't assign fixed time slices. Instead, it calculates a target latency (typically 6ms for <= 8 tasks) and divides it among runnable tasks proportional to their weights:

time_slice(i) = target_latency × (wᵢ / Σwⱼ)

This means:

With few tasks: each gets a substantial time slice, reducing context switch overhead
With many tasks: time slices shrink, but a minimum granularity (typically 0.75ms) prevents excessive switching
Higher-weight tasks get proportionally longer slices

Complexity Analysis

Pick next task: O(1) — leftmost node is cached
Enqueue task: O(log n) — red-black tree insertion
Dequeue task: O(log n) — red-black tree deletion
Update vruntime: O(1) — simple arithmetic

Compared to the O(1) scheduler's constant time for everything, CFS trades O(log n) insertions for vastly improved fairness and simplicity.

The O(log n) Trade-off

Many questioned whether O(log n) operations were acceptable. In practice, log(1000) ≈ 10, and modern CPUs execute red-black tree operations in microseconds. The improved fairness and code simplicity far outweigh this theoretical overhead. Most systems have far fewer runnable tasks than sleeping ones.

Handling Edge Cases and Special Scenarios

A production scheduler must handle numerous edge cases that the basic algorithm doesn't address. CFS includes carefully designed mechanisms for each.

New Process Insertion

When a new process becomes runnable, where should its vruntime start? If it starts at 0, it would monopolize the CPU until it 'catches up' with existing processes. CFS solves this by initializing new tasks' vruntimes to min_vruntime, which tracks the smallest vruntime in the current runqueue. This ensures new tasks get fair treatment without starving existing ones.

Waking from Sleep

Processes that sleep (waiting for I/O, timers, etc.) don't accumulate vruntime during sleep. When they wake, their vruntime might be far behind running processes. If unchanged, they would monopolize the CPU upon waking.

CFS handles this with sleeper fairness: when a task wakes, its vruntime is set to max(old_vruntime, min_vruntime - sched_latency/2). This gives sleepers a slight bonus (they can preempt the current task) without allowing them to accumulate unlimited credit.

sleeper_fairness.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
/*
 * Sleeper Fairness: Handling tasks waking from sleep
 * 
 * The challenge: A task sleeping for 10 seconds would have
 * vruntime 10 seconds behind running tasks. Without adjustment,
 * it could starve other processes for a long time.
 *
 * The solution: Cap the vruntime credit for sleepers.
 */
static void place_entity(struct cfs_rq *cfs_rq,
                         struct sched_entity *se,
                         int initial)
{
    u64 vruntime = cfs_rq->min_vruntime;
 
    /* For newly forked tasks, start at current min_vruntime */
    if (initial && sched_feat(START_DEBIT)) {
        vruntime += sched_vslice(cfs_rq, se);
    }
 
    /* For waking tasks, give a small credit (half a latency period) */
    if (!initial) {
        /* This credit allows sleepers to preempt the current task */
        if (sched_feat(GENTLE_FAIR_SLEEPERS))
            vruntime -= sysctl_sched_latency >> 1;
    }
 
    /* Never let vruntime go backward too far */
    se->vruntime = max_vruntime(se->vruntime, vruntime);
}
 
/*
 * The GENTLE_FAIR_SLEEPERS feature provides a bounded credit:
 * - Sleepers get priority but can't starve other processes
 * - Interactive processes (short bursts, frequent sleeps) 
 *   get responsive treatment
 * - The half-latency credit is enough to preempt once but
 *   not enough to run indefinitely
 */

Critical Edge Cases Handled by CFS

•Process Migration: When a process moves between CPUs, its vruntime must be adjusted relative to the new runqueue's min_vruntime to prevent unfair advantage or disadvantage.
•Fork Handling: Child processes inherit adjusted vruntime from parents. Without adjustment, fork bombs could starve the system by creating many low-vruntime processes.
•Runqueue Starvation: If min_vruntime advances while there are sleeping tasks, waking tasks could have ancient vruntimes. The min_vruntime comparison prevents this.
•Timer Tick Granularity: CFS doesn't require periodic timer ticks. It can operate in 'tickless' mode for power savings, waking only when scheduling decisions are needed.
•Group Scheduling: CFS supports hierarchical scheduling through cgroups, treating groups as scheduling entities with their own vruntimes.

Subtle Fairness Violations

Even CFS isn't perfectly fair in all scenarios. CPU-bound tasks running on SMT (Hyper-Threading) cores may receive less actual CPU time than expected due to shared execution resources. CFS addresses this through load accounting adjustments, but perfect fairness across heterogeneous hardware remains challenging.

Configuration and Tuning Parameters

CFS provides several tunable parameters through /proc/sys/kernel/ that allow administrators to adjust behavior for different workloads. Understanding these parameters is essential for performance optimization.

Key Tunable Parameters

CFS Tunable Parameters
Parameter	Default	Description	Trade-off
`sched_latency_ns`	6,000,000 (6ms)	Target latency period for the runqueue	Lower = better interactivity, more context switches
`sched_min_granularity_ns`	750,000 (0.75ms)	Minimum time slice any task receives	Lower = fairer with many tasks, more overhead
`sched_wakeup_granularity_ns`	1,000,000 (1ms)	Minimum vruntime advantage needed to preempt	Lower = faster response, more context switches
`sched_child_runs_first`	0 (disabled)	Whether forked children run before parent	1 = better for fork-exec pattern
`sched_migration_cost_ns`	500,000 (0.5ms)	Assumed cost of migrating task between CPUs	Lower = more migrations, better balance
`sched_nr_migrate`	32	Maximum tasks to migrate per load balance	Higher = faster balancing, more migration overhead

cfs_tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/bin/bash
# CFS Tuning Examples
 
# View current CFS parameters
cat /proc/sys/kernel/sched_latency_ns
cat /proc/sys/kernel/sched_min_granularity_ns
cat /proc/sys/kernel/sched_wakeup_granularity_ns
 
# Desktop Optimization: Lower latency for better interactivity
# Smaller time slices = faster response but more context switches
echo 4000000 > /proc/sys/kernel/sched_latency_ns          # 4ms
echo 500000 > /proc/sys/kernel/sched_min_granularity_ns   # 0.5ms
echo 500000 > /proc/sys/kernel/sched_wakeup_granularity_ns # 0.5ms
 
# Server Optimization: Higher throughput, fewer context switches
# Larger time slices = less switching overhead, more throughput
echo 10000000 > /proc/sys/kernel/sched_latency_ns         # 10ms
echo 1000000 > /proc/sys/kernel/sched_min_granularity_ns  # 1ms
echo 2000000 > /proc/sys/kernel/sched_wakeup_granularity_ns # 2ms
 
# HPC/Batch Workloads: Maximum throughput
# Very large slices minimize scheduler overhead
echo 24000000 > /proc/sys/kernel/sched_latency_ns         # 24ms
echo 3000000 > /proc/sys/kernel/sched_min_granularity_ns  # 3ms
 
# The kernel automatically scales sched_latency based on
# the number of CPUs at boot time:
# scaled_latency = base_latency * (1 + log2(nr_cpus))
# This ensures reasonable latency on large NUMA systems.

Modern Kernel Improvements

Recent Linux kernels (5.0+) include the 'EEVDF' (Earliest Eligible Virtual Deadline First) enhancements to CFS that improve latency guarantees. Linux 6.6 made EEVDF the default, providing better deadline awareness while maintaining CFS's fairness properties. The fundamental vruntime concept remains, but with added deadline tracking.

CFS in Practice: Real-World Impact

CFS's impact extends far beyond kernel internals. It fundamentally shapes the behavior of every Linux system, from Android phones to supercomputers.

Desktop and Workstation Use

On desktop systems, CFS dramatically improved perceived responsiveness compared to the O(1) scheduler. When you're compiling code while playing music and browsing the web:

The browser gets CPU time proportional to its weight when you interact with it
The music player gets regular, small slices to prevent audio dropouts
The compiler uses remaining CPU but can't starve interactive applications

No special 'interactive' detection is needed—fairness naturally prioritizes applications that need short bursts over those that want continuous CPU.

Server and Cloud

In data centers, CFS enables efficient resource sharing:

Containers and VMs: CFS's group scheduling (via cgroups) ensures fair CPU distribution across containers. Each container's tasks share a group vruntime.
Multi-tenant Systems: Fair scheduling prevents any single tenant's workload from monopolizing shared resources.
Latency-sensitive Services: The predictable fairness of CFS allows SLA guarantees that heuristic-based schedulers couldn't provide.

CFS Strengths

•Mathematically principled fairness
•Excellent interactivity without heuristics
•Scalable to thousands of processes
•Supports hierarchical (cgroup) scheduling
•Predictable, analyzable behavior
•Simple algorithm, complex optimizations

CFS Limitations

•O(log n) insertion (vs O(1) in O(1) sched)
•No hard real-time guarantees
•Complex vruntime adjustment edge cases
•Needs careful tuning for extreme workloads
•Memory overhead for red-black tree
•Still evolving (EEVDF, core scheduling)

The Success of CFS

CFS has been the default Linux scheduler for over 15 years, running on billions of devices—from $35 Raspberry Pis to million-dollar supercomputers. Its success stems from a key insight: good design principles (fairness, simplicity, mathematical rigor) beat clever heuristics in the long run. This is a lesson that applies far beyond operating systems.

Summary: The Completely Fair Scheduler

CFS represents a triumph of elegant algorithm design in systems software. By replacing heuristics with mathematics and complex code with principled abstraction, it solved problems that had vexed scheduler designers for decades.

Key Takeaways

•CFS models an ideal CPU — It approximates a CPU that runs all tasks simultaneously with infinitesimal slices, achieving this through virtual runtime tracking.
•Virtual runtime enables fairness — By normalizing actual runtime by task weight, CFS ensures proportional CPU distribution without explicit priority levels.
•Red-black trees provide efficiency — The O(log n) insertion/deletion and O(1) minimum selection perfectly match CFS's access patterns.
•No heuristics, no magic numbers — CFS's behavior emerges from a simple, principled algorithm rather than empirically-tuned parameters.
•Configurable for diverse workloads — Tunable parameters allow optimization for desktops, servers, or HPC without changing the core algorithm.
•Foundation for modern features — CFS's design enabled cgroups, containers, and hierarchical scheduling that power cloud computing.

What's Next

In the next page, we'll dive deeper into virtual runtime—the core abstraction that makes CFS work. We'll explore the mathematics of weight-to-vruntime conversion, how nice values map to weights, and the subtleties that ensure fairness even in complex scenarios.

Page Complete

You now understand the architecture and philosophy of the Linux Completely Fair Scheduler—one of the most successful scheduling algorithms in computing history. Next, we'll examine virtual runtime in detail, exploring how CFS translates the abstract concept of 'fairness' into concrete scheduling decisions.

1 / 5

Loading learning content...

Operating SystemsLinux Scheduling

Linux Scheduling

LevelAdvanced

Duration90 mins

TopicLinux Scheduling

1 / 5

Completely Fair Scheduler (CFS)

The Heart of Linux Process Scheduling

Learning Objectives

Historical Context: From O(1) to CFS

To appreciate CFS, we must understand the scheduler it replaced and the problems that motivated its creation.

The O(1) Scheduler (2003-2007)

Priority arrays: Two arrays (active and expired), each containing 140 priority levels
Bitmap-based selection: A bitmap tracked which priority levels had runnable tasks, enabling O(1) lookup of the highest-priority task
Time slice assignment: Each priority level had a predetermined time slice

While elegant for server workloads, the O(1) scheduler had fundamental problems with desktop interactivity.

O(1) Scheduler Limitations

•Heuristic Complexity: The scheduler used complex heuristics to identify 'interactive' processes and boost their priority. These heuristics were fragile and often wrong.
•Unfair CPU Distribution: Processes could game the heuristics to receive more CPU time than their fair share.
•Poor Interactivity Under Load: When the system was heavily loaded, interactive applications (editors, browsers) could experience noticeable delays.
•Magic Numbers: The code contained numerous empirically-tuned constants that were difficult to justify theoretically.
•Priority Inversion Scenarios: Complex interactions between priority levels could cause unexpected behavior.

The Con Kolivas Experiment

Kolivas's ideas influenced Ingo Molnár to create CFS, which took the concept of fairness and implemented it with unprecedented elegance. Molnár described the design philosophy:

"CFS basically models an 'ideal, precise multitasking CPU' on real hardware. The 'virtual runtime' of a task specifies when its next timeslice would start execution on the ideal multi-tasking CPU. In practice, the virtual runtime of a task is the actual runtime normalized to the total number of running tasks."

This single insight—modeling an ideal CPU where every process runs simultaneously with infinitesimal slices—became the foundation of CFS.

The Ideal Multitasking CPU Concept

Core Philosophy: Proportional Fairness

The Fairness Equation

For a system with N runnable processes, each with weight wᵢ, the CPU share for process i should be:

CPU Share(i) = wᵢ / Σwⱼ

Where the sum is over all runnable processes. This means:

If all processes have equal weight, each gets 1/N of the CPU
If process A has twice the weight of process B, A gets twice as much CPU time
Adding or removing processes automatically adjusts everyone's share

From Shares to Scheduling

vruntime_concept.c
C (Conceptual)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
/*
 * Conceptual illustration of virtual runtime calculation
 * 
 * In CFS, each process has a vruntime that represents its
 * "virtual" elapsed time. The key insight:
 * 
 * vruntime increases proportionally to actual runtime
 * but inversely to the process's weight.
 */
 
/* NICE_0_LOAD represents the weight of a nice-0 process (1024) */
#define NICE_0_LOAD    1024
 
/*
 * delta_exec:  actual CPU time used in this scheduling period
 * weight:      process weight (derived from nice value)
 * 
 * vruntime delta = (delta_exec * NICE_0_LOAD) / weight
 */
static inline u64 calc_delta_vruntime(u64 delta_exec, 
                                       unsigned long weight)
{
    /* For a nice-0 process (weight = 1024):
     *   delta_vruntime = delta_exec (no change)
     * 
     * For a nice +19 process (weight = 15, lowest priority):
     *   delta_vruntime = delta_exec * 68.27 (much faster increase)
     * 
     * For a nice -20 process (weight = 88761, highest priority):
     *   delta_vruntime = delta_exec * 0.0115 (much slower increase)
     */
    return (delta_exec * NICE_0_LOAD) / weight;
}
 
/*
 * This elegant formulation ensures:
 * 1. All processes with equal weight have equal vruntime growth
 * 2. Higher-weight processes can run longer before "catching up"
 * 3. Always scheduling the lowest-vruntime process is always fair
 */

Why Virtual Runtime Works

Consider two processes A (weight 1024) and B (weight 2048). After time t:

A's vruntime = t × (1024/1024) = t
B's vruntime = t × (1024/2048) = 0.5t

If both start at vruntime 0, CFS will schedule them so that at any point:

B has used twice as much actual CPU time as A
Both have approximately equal vruntimes
The system is therefore 'fair' according to their weights

This mathematical elegance eliminates the heuristics, magic numbers, and special cases that plagued the O(1) scheduler. Fairness emerges naturally from a single, principled algorithm.

The Genius of Normalization

The CFS Algorithm: How It Works

CFS operates through a remarkably simple loop, made efficient by sophisticated data structures. Let's trace through the algorithm step by step.

Core Data Structures

The Scheduling Loop

CFS Scheduling Steps

•Select Next Task: CFS picks the leftmost node in the red-black tree—the task with the smallest vruntime. This operation is O(1) because Linux caches the leftmost node pointer.
•Context Switch: The kernel performs a context switch to the selected task, saving the old task's state and loading the new task's state.
•Execute: The task runs until it blocks (I/O, sleep), yields, or exhausts its time slice. The time slice is dynamically calculated based on the number of runnable tasks.
•Update vruntime: When the task stops running, CFS calculates its vruntime delta: delta_vruntime = delta_exec * (NICE_0_LOAD / weight) and adds this to the task's current vruntime.
•Reinsert: If the task is still runnable, it's reinserted into the red-black tree at its new vruntime position. This takes O(log n) time.
•Repeat: The scheduler picks the new leftmost node and continues.

cfs_algorithm_pseudo.c
Pseudocode
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
/*
 * CFS Main Scheduling Loop (Simplified)
 *
 * This pseudocode illustrates the core CFS algorithm.
 * The actual kernel code is more complex due to
 * handling of SMP, CPU affinity, load balancing, etc.
 */
 
struct task_struct* pick_next_task_fair(struct rq *rq) {
    struct cfs_rq *cfs_rq = &rq->cfs;
    struct sched_entity *se;
    
    /* If no runnable tasks, return idle task */
    if (!cfs_rq->nr_running)
        return idle_task(rq);
    
    /* Pick the task with smallest vruntime (leftmost in tree) */
    se = __pick_first_entity(cfs_rq);
    
    /* Return the task_struct containing this sched_entity */
    return task_of(se);
}
 
void update_curr(struct cfs_rq *cfs_rq) {
    struct sched_entity *curr = cfs_rq->curr;
    u64 now = rq_clock_task(rq_of(cfs_rq));
    u64 delta_exec;
    
    if (!curr)
        return;
    
    /* Calculate how long the task has been running */
    delta_exec = now - curr->exec_start;
    curr->exec_start = now;
    
    /* Update task's statistics */
    curr->sum_exec_runtime += delta_exec;
    
    /* Update virtual runtime based on weight */
    curr->vruntime += calc_delta_fair(delta_exec, curr);
    
    /* Update the minimum vruntime for the runqueue */
    update_min_vruntime(cfs_rq);
}
 
void enqueue_entity(struct cfs_rq *cfs_rq, 
                    struct sched_entity *se) {
    /* If task was sleeping, give it some vruntime credit */
    if (se->on_rq == 0) {
        /* Place near the minimum vruntime to prevent starvation */
        se->vruntime = max(se->vruntime, 
                          cfs_rq->min_vruntime - sched_latency/2);
    }
    
    /* Update runqueue load statistics */
    account_entity_enqueue(cfs_rq, se);
    
    /* Insert into the red-black tree: O(log n) */
    __enqueue_entity(cfs_rq, se);
    
    cfs_rq->nr_running++;
}

Time Slice Calculation

time_slice(i) = target_latency × (wᵢ / Σwⱼ)

This means:

With few tasks: each gets a substantial time slice, reducing context switch overhead
With many tasks: time slices shrink, but a minimum granularity (typically 0.75ms) prevents excessive switching
Higher-weight tasks get proportionally longer slices

Complexity Analysis

Pick next task: O(1) — leftmost node is cached
Enqueue task: O(log n) — red-black tree insertion
Dequeue task: O(log n) — red-black tree deletion
Update vruntime: O(1) — simple arithmetic

Compared to the O(1) scheduler's constant time for everything, CFS trades O(log n) insertions for vastly improved fairness and simplicity.

The O(log n) Trade-off

Handling Edge Cases and Special Scenarios

A production scheduler must handle numerous edge cases that the basic algorithm doesn't address. CFS includes carefully designed mechanisms for each.

New Process Insertion

Waking from Sleep

sleeper_fairness.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
/*
 * Sleeper Fairness: Handling tasks waking from sleep
 * 
 * The challenge: A task sleeping for 10 seconds would have
 * vruntime 10 seconds behind running tasks. Without adjustment,
 * it could starve other processes for a long time.
 *
 * The solution: Cap the vruntime credit for sleepers.
 */
static void place_entity(struct cfs_rq *cfs_rq,
                         struct sched_entity *se,
                         int initial)
{
    u64 vruntime = cfs_rq->min_vruntime;
 
    /* For newly forked tasks, start at current min_vruntime */
    if (initial && sched_feat(START_DEBIT)) {
        vruntime += sched_vslice(cfs_rq, se);
    }
 
    /* For waking tasks, give a small credit (half a latency period) */
    if (!initial) {
        /* This credit allows sleepers to preempt the current task */
        if (sched_feat(GENTLE_FAIR_SLEEPERS))
            vruntime -= sysctl_sched_latency >> 1;
    }
 
    /* Never let vruntime go backward too far */
    se->vruntime = max_vruntime(se->vruntime, vruntime);
}
 
/*
 * The GENTLE_FAIR_SLEEPERS feature provides a bounded credit:
 * - Sleepers get priority but can't starve other processes
 * - Interactive processes (short bursts, frequent sleeps) 
 *   get responsive treatment
 * - The half-latency credit is enough to preempt once but
 *   not enough to run indefinitely
 */

Critical Edge Cases Handled by CFS

•Process Migration: When a process moves between CPUs, its vruntime must be adjusted relative to the new runqueue's min_vruntime to prevent unfair advantage or disadvantage.
•Fork Handling: Child processes inherit adjusted vruntime from parents. Without adjustment, fork bombs could starve the system by creating many low-vruntime processes.
•Runqueue Starvation: If min_vruntime advances while there are sleeping tasks, waking tasks could have ancient vruntimes. The min_vruntime comparison prevents this.
•Timer Tick Granularity: CFS doesn't require periodic timer ticks. It can operate in 'tickless' mode for power savings, waking only when scheduling decisions are needed.
•Group Scheduling: CFS supports hierarchical scheduling through cgroups, treating groups as scheduling entities with their own vruntimes.

Subtle Fairness Violations

Configuration and Tuning Parameters

Key Tunable Parameters

CFS Tunable Parameters
Parameter	Default	Description	Trade-off
`sched_latency_ns`	6,000,000 (6ms)	Target latency period for the runqueue	Lower = better interactivity, more context switches
`sched_min_granularity_ns`	750,000 (0.75ms)	Minimum time slice any task receives	Lower = fairer with many tasks, more overhead
`sched_wakeup_granularity_ns`	1,000,000 (1ms)	Minimum vruntime advantage needed to preempt	Lower = faster response, more context switches
`sched_child_runs_first`	0 (disabled)	Whether forked children run before parent	1 = better for fork-exec pattern
`sched_migration_cost_ns`	500,000 (0.5ms)	Assumed cost of migrating task between CPUs	Lower = more migrations, better balance
`sched_nr_migrate`	32	Maximum tasks to migrate per load balance	Higher = faster balancing, more migration overhead

cfs_tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/bin/bash
# CFS Tuning Examples
 
# View current CFS parameters
cat /proc/sys/kernel/sched_latency_ns
cat /proc/sys/kernel/sched_min_granularity_ns
cat /proc/sys/kernel/sched_wakeup_granularity_ns
 
# Desktop Optimization: Lower latency for better interactivity
# Smaller time slices = faster response but more context switches
echo 4000000 > /proc/sys/kernel/sched_latency_ns          # 4ms
echo 500000 > /proc/sys/kernel/sched_min_granularity_ns   # 0.5ms
echo 500000 > /proc/sys/kernel/sched_wakeup_granularity_ns # 0.5ms
 
# Server Optimization: Higher throughput, fewer context switches
# Larger time slices = less switching overhead, more throughput
echo 10000000 > /proc/sys/kernel/sched_latency_ns         # 10ms
echo 1000000 > /proc/sys/kernel/sched_min_granularity_ns  # 1ms
echo 2000000 > /proc/sys/kernel/sched_wakeup_granularity_ns # 2ms
 
# HPC/Batch Workloads: Maximum throughput
# Very large slices minimize scheduler overhead
echo 24000000 > /proc/sys/kernel/sched_latency_ns         # 24ms
echo 3000000 > /proc/sys/kernel/sched_min_granularity_ns  # 3ms
 
# The kernel automatically scales sched_latency based on
# the number of CPUs at boot time:
# scaled_latency = base_latency * (1 + log2(nr_cpus))
# This ensures reasonable latency on large NUMA systems.

Modern Kernel Improvements

CFS in Practice: Real-World Impact

CFS's impact extends far beyond kernel internals. It fundamentally shapes the behavior of every Linux system, from Android phones to supercomputers.

Desktop and Workstation Use

On desktop systems, CFS dramatically improved perceived responsiveness compared to the O(1) scheduler. When you're compiling code while playing music and browsing the web:

The browser gets CPU time proportional to its weight when you interact with it
The music player gets regular, small slices to prevent audio dropouts
The compiler uses remaining CPU but can't starve interactive applications

No special 'interactive' detection is needed—fairness naturally prioritizes applications that need short bursts over those that want continuous CPU.

Server and Cloud

In data centers, CFS enables efficient resource sharing:

Containers and VMs: CFS's group scheduling (via cgroups) ensures fair CPU distribution across containers. Each container's tasks share a group vruntime.
Multi-tenant Systems: Fair scheduling prevents any single tenant's workload from monopolizing shared resources.
Latency-sensitive Services: The predictable fairness of CFS allows SLA guarantees that heuristic-based schedulers couldn't provide.

CFS Strengths

•Mathematically principled fairness
•Excellent interactivity without heuristics
•Scalable to thousands of processes
•Supports hierarchical (cgroup) scheduling
•Predictable, analyzable behavior
•Simple algorithm, complex optimizations

CFS Limitations

•O(log n) insertion (vs O(1) in O(1) sched)
•No hard real-time guarantees
•Complex vruntime adjustment edge cases
•Needs careful tuning for extreme workloads
•Memory overhead for red-black tree
•Still evolving (EEVDF, core scheduling)

The Success of CFS

Summary: The Completely Fair Scheduler

Key Takeaways

•CFS models an ideal CPU — It approximates a CPU that runs all tasks simultaneously with infinitesimal slices, achieving this through virtual runtime tracking.
•Virtual runtime enables fairness — By normalizing actual runtime by task weight, CFS ensures proportional CPU distribution without explicit priority levels.
•Red-black trees provide efficiency — The O(log n) insertion/deletion and O(1) minimum selection perfectly match CFS's access patterns.
•No heuristics, no magic numbers — CFS's behavior emerges from a simple, principled algorithm rather than empirically-tuned parameters.
•Configurable for diverse workloads — Tunable parameters allow optimization for desktops, servers, or HPC without changing the core algorithm.
•Foundation for modern features — CFS's design enabled cgroups, containers, and hierarchical scheduling that power cloud computing.

What's Next

Page Complete

1 / 5