Loading learning content...
Every time you run a command in your terminal, open an application, or load a web page on a Linux server, an elegant piece of software engineering determines when and for how long your process gets to use the CPU. This software is the Completely Fair Scheduler (CFS)—the default process scheduler in the Linux kernel since October 2007.
CFS represents a fundamental departure from traditional scheduling algorithms. Rather than relying on fixed time slices and priority queues like its predecessors, CFS approaches scheduling as a problem of proportional fairness. Its goal is deceptively simple: give every runnable process a fair share of CPU time, proportional to its weight. But achieving this goal efficiently across thousands of processes, on systems ranging from embedded devices to supercomputers, requires sophisticated algorithms and data structures.
Designed by Ingo Molnár, CFS replaced the O(1) scheduler that had been in use since Linux 2.6.0. While the O(1) scheduler was efficient, it struggled with interactivity and fairness on desktop workloads. CFS solved these problems through a revolutionary approach based on virtual runtime and red-black trees.
By the end of this page, you will understand: (1) The historical context and motivation for CFS, (2) The mathematical foundation of proportional-share scheduling, (3) How CFS models an 'ideal multitasking CPU,' (4) The core algorithm and its complexity guarantees, and (5) Why CFS became the default scheduler for billions of devices worldwide.
To appreciate CFS, we must understand the scheduler it replaced and the problems that motivated its creation.
The O(1) Scheduler (2003-2007)
The O(1) scheduler, introduced by Ingo Molnár in Linux 2.6.0, was a significant achievement. Its name came from its O(1) time complexity for all scheduling operations—regardless of how many processes were runnable, selecting the next process took constant time. It achieved this through:
While elegant for server workloads, the O(1) scheduler had fundamental problems with desktop interactivity.
The Con Kolivas Experiment
Australian anesthesiologist and kernel developer Con Kolivas proposed alternative schedulers (RSDL, SD) that challenged the O(1) approach. His work demonstrated that a fundamentally different approach—based on deadline scheduling and strict fairness—could dramatically improve desktop responsiveness.
Kolivas's ideas influenced Ingo Molnár to create CFS, which took the concept of fairness and implemented it with unprecedented elegance. Molnár described the design philosophy:
"CFS basically models an 'ideal, precise multitasking CPU' on real hardware. The 'virtual runtime' of a task specifies when its next timeslice would start execution on the ideal multi-tasking CPU. In practice, the virtual runtime of a task is the actual runtime normalized to the total number of running tasks."
This single insight—modeling an ideal CPU where every process runs simultaneously with infinitesimal slices—became the foundation of CFS.
Imagine a CPU that could run N processes truly simultaneously, each receiving exactly 1/N of the CPU's power. While physically impossible, CFS approximates this ideal by tracking how much CPU time each process 'deserves' versus how much it has actually received, then always running the process that's most behind.
CFS is built on a mathematically rigorous foundation: proportional-share scheduling (also known as weighted fair queueing in networking). The core idea is that each process should receive CPU time proportional to its weight rather than according to fixed priority levels.
The Fairness Equation
For a system with N runnable processes, each with weight wᵢ, the CPU share for process i should be:
CPU Share(i) = wᵢ / Σwⱼ
Where the sum is over all runnable processes. This means:
From Shares to Scheduling
The challenge is converting this abstract fairness requirement into concrete scheduling decisions. CFS solves this with a brilliant insight: instead of tracking actual CPU time, track a normalized time called virtual runtime (vruntime).
Virtual runtime increases slower for high-weight processes and faster for low-weight processes. When virtual runtimes are equal, actual CPU time is proportional to weight. CFS simply selects the process with the smallest vruntime—the process that's most 'behind' relative to what it deserves.
12345678910111213141516171819202122232425262728293031323334353637383940
/* * Conceptual illustration of virtual runtime calculation * * In CFS, each process has a vruntime that represents its * "virtual" elapsed time. The key insight: * * vruntime increases proportionally to actual runtime * but inversely to the process's weight. */ /* NICE_0_LOAD represents the weight of a nice-0 process (1024) */#define NICE_0_LOAD 1024 /* * delta_exec: actual CPU time used in this scheduling period * weight: process weight (derived from nice value) * * vruntime delta = (delta_exec * NICE_0_LOAD) / weight */static inline u64 calc_delta_vruntime(u64 delta_exec, unsigned long weight){ /* For a nice-0 process (weight = 1024): * delta_vruntime = delta_exec (no change) * * For a nice +19 process (weight = 15, lowest priority): * delta_vruntime = delta_exec * 68.27 (much faster increase) * * For a nice -20 process (weight = 88761, highest priority): * delta_vruntime = delta_exec * 0.0115 (much slower increase) */ return (delta_exec * NICE_0_LOAD) / weight;} /* * This elegant formulation ensures: * 1. All processes with equal weight have equal vruntime growth * 2. Higher-weight processes can run longer before "catching up" * 3. Always scheduling the lowest-vruntime process is always fair */Why Virtual Runtime Works
Consider two processes A (weight 1024) and B (weight 2048). After time t:
If both start at vruntime 0, CFS will schedule them so that at any point:
This mathematical elegance eliminates the heuristics, magic numbers, and special cases that plagued the O(1) scheduler. Fairness emerges naturally from a single, principled algorithm.
By normalizing actual runtime to virtual runtime, CFS converts a complex multi-objective optimization problem (balance fairness, priorities, throughput) into a simple single-objective problem: always run the process with the smallest vruntime. This is a textbook example of elegant algorithm design.
CFS operates through a remarkably simple loop, made efficient by sophisticated data structures. Let's trace through the algorithm step by step.
Core Data Structures
Each CPU maintains a runqueue (struct rq) containing a CFS runqueue (struct cfs_rq). The CFS runqueue's most important component is a red-black tree ordered by vruntime. Every runnable process is a node in this tree, keyed by its vruntime.
The Scheduling Loop
delta_vruntime = delta_exec * (NICE_0_LOAD / weight) and adds this to the task's current vruntime.1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
/* * CFS Main Scheduling Loop (Simplified) * * This pseudocode illustrates the core CFS algorithm. * The actual kernel code is more complex due to * handling of SMP, CPU affinity, load balancing, etc. */ struct task_struct* pick_next_task_fair(struct rq *rq) { struct cfs_rq *cfs_rq = &rq->cfs; struct sched_entity *se; /* If no runnable tasks, return idle task */ if (!cfs_rq->nr_running) return idle_task(rq); /* Pick the task with smallest vruntime (leftmost in tree) */ se = __pick_first_entity(cfs_rq); /* Return the task_struct containing this sched_entity */ return task_of(se);} void update_curr(struct cfs_rq *cfs_rq) { struct sched_entity *curr = cfs_rq->curr; u64 now = rq_clock_task(rq_of(cfs_rq)); u64 delta_exec; if (!curr) return; /* Calculate how long the task has been running */ delta_exec = now - curr->exec_start; curr->exec_start = now; /* Update task's statistics */ curr->sum_exec_runtime += delta_exec; /* Update virtual runtime based on weight */ curr->vruntime += calc_delta_fair(delta_exec, curr); /* Update the minimum vruntime for the runqueue */ update_min_vruntime(cfs_rq);} void enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) { /* If task was sleeping, give it some vruntime credit */ if (se->on_rq == 0) { /* Place near the minimum vruntime to prevent starvation */ se->vruntime = max(se->vruntime, cfs_rq->min_vruntime - sched_latency/2); } /* Update runqueue load statistics */ account_entity_enqueue(cfs_rq, se); /* Insert into the red-black tree: O(log n) */ __enqueue_entity(cfs_rq, se); cfs_rq->nr_running++;}Time Slice Calculation
Unlike the O(1) scheduler, CFS doesn't assign fixed time slices. Instead, it calculates a target latency (typically 6ms for <= 8 tasks) and divides it among runnable tasks proportional to their weights:
time_slice(i) = target_latency × (wᵢ / Σwⱼ)
This means:
Complexity Analysis
Compared to the O(1) scheduler's constant time for everything, CFS trades O(log n) insertions for vastly improved fairness and simplicity.
Many questioned whether O(log n) operations were acceptable. In practice, log(1000) ≈ 10, and modern CPUs execute red-black tree operations in microseconds. The improved fairness and code simplicity far outweigh this theoretical overhead. Most systems have far fewer runnable tasks than sleeping ones.
A production scheduler must handle numerous edge cases that the basic algorithm doesn't address. CFS includes carefully designed mechanisms for each.
New Process Insertion
When a new process becomes runnable, where should its vruntime start? If it starts at 0, it would monopolize the CPU until it 'catches up' with existing processes. CFS solves this by initializing new tasks' vruntimes to min_vruntime, which tracks the smallest vruntime in the current runqueue. This ensures new tasks get fair treatment without starving existing ones.
Waking from Sleep
Processes that sleep (waiting for I/O, timers, etc.) don't accumulate vruntime during sleep. When they wake, their vruntime might be far behind running processes. If unchanged, they would monopolize the CPU upon waking.
CFS handles this with sleeper fairness: when a task wakes, its vruntime is set to max(old_vruntime, min_vruntime - sched_latency/2). This gives sleepers a slight bonus (they can preempt the current task) without allowing them to accumulate unlimited credit.
123456789101112131415161718192021222324252627282930313233343536373839
/* * Sleeper Fairness: Handling tasks waking from sleep * * The challenge: A task sleeping for 10 seconds would have * vruntime 10 seconds behind running tasks. Without adjustment, * it could starve other processes for a long time. * * The solution: Cap the vruntime credit for sleepers. */static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial){ u64 vruntime = cfs_rq->min_vruntime; /* For newly forked tasks, start at current min_vruntime */ if (initial && sched_feat(START_DEBIT)) { vruntime += sched_vslice(cfs_rq, se); } /* For waking tasks, give a small credit (half a latency period) */ if (!initial) { /* This credit allows sleepers to preempt the current task */ if (sched_feat(GENTLE_FAIR_SLEEPERS)) vruntime -= sysctl_sched_latency >> 1; } /* Never let vruntime go backward too far */ se->vruntime = max_vruntime(se->vruntime, vruntime);} /* * The GENTLE_FAIR_SLEEPERS feature provides a bounded credit: * - Sleepers get priority but can't starve other processes * - Interactive processes (short bursts, frequent sleeps) * get responsive treatment * - The half-latency credit is enough to preempt once but * not enough to run indefinitely */Even CFS isn't perfectly fair in all scenarios. CPU-bound tasks running on SMT (Hyper-Threading) cores may receive less actual CPU time than expected due to shared execution resources. CFS addresses this through load accounting adjustments, but perfect fairness across heterogeneous hardware remains challenging.
CFS provides several tunable parameters through /proc/sys/kernel/ that allow administrators to adjust behavior for different workloads. Understanding these parameters is essential for performance optimization.
Key Tunable Parameters
| Parameter | Default | Description | Trade-off |
|---|---|---|---|
sched_latency_ns | 6,000,000 (6ms) | Target latency period for the runqueue | Lower = better interactivity, more context switches |
sched_min_granularity_ns | 750,000 (0.75ms) | Minimum time slice any task receives | Lower = fairer with many tasks, more overhead |
sched_wakeup_granularity_ns | 1,000,000 (1ms) | Minimum vruntime advantage needed to preempt | Lower = faster response, more context switches |
sched_child_runs_first | 0 (disabled) | Whether forked children run before parent | 1 = better for fork-exec pattern |
sched_migration_cost_ns | 500,000 (0.5ms) | Assumed cost of migrating task between CPUs | Lower = more migrations, better balance |
sched_nr_migrate | 32 | Maximum tasks to migrate per load balance | Higher = faster balancing, more migration overhead |
1234567891011121314151617181920212223242526272829
#!/bin/bash# CFS Tuning Examples # View current CFS parameterscat /proc/sys/kernel/sched_latency_nscat /proc/sys/kernel/sched_min_granularity_nscat /proc/sys/kernel/sched_wakeup_granularity_ns # Desktop Optimization: Lower latency for better interactivity# Smaller time slices = faster response but more context switchesecho 4000000 > /proc/sys/kernel/sched_latency_ns # 4msecho 500000 > /proc/sys/kernel/sched_min_granularity_ns # 0.5msecho 500000 > /proc/sys/kernel/sched_wakeup_granularity_ns # 0.5ms # Server Optimization: Higher throughput, fewer context switches# Larger time slices = less switching overhead, more throughputecho 10000000 > /proc/sys/kernel/sched_latency_ns # 10msecho 1000000 > /proc/sys/kernel/sched_min_granularity_ns # 1msecho 2000000 > /proc/sys/kernel/sched_wakeup_granularity_ns # 2ms # HPC/Batch Workloads: Maximum throughput# Very large slices minimize scheduler overheadecho 24000000 > /proc/sys/kernel/sched_latency_ns # 24msecho 3000000 > /proc/sys/kernel/sched_min_granularity_ns # 3ms # The kernel automatically scales sched_latency based on# the number of CPUs at boot time:# scaled_latency = base_latency * (1 + log2(nr_cpus))# This ensures reasonable latency on large NUMA systems.Recent Linux kernels (5.0+) include the 'EEVDF' (Earliest Eligible Virtual Deadline First) enhancements to CFS that improve latency guarantees. Linux 6.6 made EEVDF the default, providing better deadline awareness while maintaining CFS's fairness properties. The fundamental vruntime concept remains, but with added deadline tracking.
CFS's impact extends far beyond kernel internals. It fundamentally shapes the behavior of every Linux system, from Android phones to supercomputers.
Desktop and Workstation Use
On desktop systems, CFS dramatically improved perceived responsiveness compared to the O(1) scheduler. When you're compiling code while playing music and browsing the web:
No special 'interactive' detection is needed—fairness naturally prioritizes applications that need short bursts over those that want continuous CPU.
Server and Cloud
In data centers, CFS enables efficient resource sharing:
CFS has been the default Linux scheduler for over 15 years, running on billions of devices—from $35 Raspberry Pis to million-dollar supercomputers. Its success stems from a key insight: good design principles (fairness, simplicity, mathematical rigor) beat clever heuristics in the long run. This is a lesson that applies far beyond operating systems.
CFS represents a triumph of elegant algorithm design in systems software. By replacing heuristics with mathematics and complex code with principled abstraction, it solved problems that had vexed scheduler designers for decades.
What's Next
In the next page, we'll dive deeper into virtual runtime—the core abstraction that makes CFS work. We'll explore the mathematics of weight-to-vruntime conversion, how nice values map to weights, and the subtleties that ensure fairness even in complex scenarios.
You now understand the architecture and philosophy of the Linux Completely Fair Scheduler—one of the most successful scheduling algorithms in computing history. Next, we'll examine virtual runtime in detail, exploring how CFS translates the abstract concept of 'fairness' into concrete scheduling decisions.