Loading learning content...
The Completely Fair Scheduler (CFS), introduced in Linux 2.6.23 (2007) by Ingo Molnár, revolutionized Linux scheduling. Unlike its predecessor (the O(1) scheduler with complex heuristics), CFS achieves fairness through elegant mathematics: it models an ideal processor that could run all tasks simultaneously, each receiving an infinitesimal slice of CPU proportional to its weight.
In reality, only one task runs at a time, but CFS tracks the idealized runtime each task should have received and always runs the task furthest behind. The result is remarkable fairness across diverse workloads without complex tuning parameters.
By the end of this page, you will understand virtual runtime, the red-black tree implementation, how nice values map to weights, the concept of targeted latency, and how CFS handles edge cases like new tasks and sleepers.
CFS's central concept is virtual runtime (vruntime)—a measure of how much "fair" time a task has consumed. Unlike wall-clock time, vruntime is weighted by task priority.
123456789101112131415161718192021222324252627282930313233343536373839404142
/* * Virtual runtime formula: * * vruntime += delta_exec * (NICE_0_LOAD / weight) * * Where: * - delta_exec: actual wall-clock time executed * - NICE_0_LOAD: weight of a nice-0 task (1024) * - weight: this task's weight (based on nice value) * * Higher weight = slower vruntime growth = more CPU time * Lower weight = faster vruntime growth = less CPU time */ static void update_curr(struct cfs_rq *cfs_rq){ struct sched_entity *curr = cfs_rq->curr; u64 now = rq_clock_task(rq_of(cfs_rq)); u64 delta_exec; /* How long did this task run? */ delta_exec = now - curr->exec_start; curr->exec_start = now; /* Update total execution time */ curr->sum_exec_runtime += delta_exec; /* Calculate weighted vruntime delta */ curr->vruntime += calc_delta_fair(delta_exec, curr); /* Update min_vruntime for the runqueue */ update_min_vruntime(cfs_rq);} /* calc_delta_fair adjusts for weight */static u64 calc_delta_fair(u64 delta, struct sched_entity *se){ if (unlikely(se->load.weight != NICE_0_LOAD)) /* Weighted calculation */ delta = __calc_delta(delta, NICE_0_LOAD, &se->load); return delta;}| Nice Value | Weight | vruntime Growth Rate | Relative CPU Share |
|---|---|---|---|
| -20 (highest) | 88761 | 0.012x | ~88x nice 0 |
| -10 | 9548 | 0.11x | ~9.3x nice 0 |
| 0 (default) | 1024 | 1.0x | 1.0x (baseline) |
| +10 | 110 | 9.3x | ~0.1x nice 0 |
| +19 (lowest) | 15 | 68x | ~0.015x nice 0 |
CFS always runs the task with the lowest vruntime. High-weight tasks accumulate vruntime slowly, so they stay at the 'front of the line' longer and get more CPU time. Low-weight tasks accumulate vruntime quickly, fall behind in the ordering, and get scheduled less often. Perfect mathematical fairness!
CFS organizes tasks in a red-black tree (self-balancing BST) ordered by vruntime. This provides O(log n) insertion and deletion, with O(1) access to the leftmost (minimum vruntime) node.
123456789101112131415161718192021222324252627282930313233343536373839
/* The CFS runqueue */struct cfs_rq { /* Load tracking */ struct load_weight load; /* Total weight of runnable tasks */ unsigned int nr_running; /* Number of runnable tasks */ /* The red-black tree of runnable entities */ struct rb_root_cached tasks_timeline; /* Cached leftmost node */ /* Currently running entity */ struct sched_entity *curr; struct sched_entity *next; /* Hint for next task */ struct sched_entity *skip; /* Hint to skip this task */ /* Minimum vruntime - our "timeline" position */ u64 min_vruntime; #ifdef CONFIG_FAIR_GROUP_SCHED /* Group scheduling support */ struct sched_entity *tg_load_avg_contrib; struct task_group *tg; /* Owner task group */#endif}; /* Each task has a scheduling entity */struct sched_entity { struct load_weight load; /* This entity's weight */ struct rb_node run_node; /* Position in RB-tree */ unsigned int on_rq; /* Currently on runqueue? */ u64 vruntime; /* Virtual runtime */ u64 exec_start; /* Start of current run */ u64 sum_exec_runtime; /* Total time executed */ /* Group scheduling hierarchy */ struct sched_entity *parent; struct cfs_rq *cfs_rq; /* Runqueue I belong to */ struct cfs_rq *my_q; /* Runqueue I own (groups only) */};123456789101112131415161718192021222324252627282930313233343536
/* Insert entity into the RB tree */static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se){ struct rb_node **link = &cfs_rq->tasks_timeline.rb_root.rb_node; struct rb_node *parent = NULL; struct sched_entity *entry; bool leftmost = true; /* Walk tree to find insertion point */ while (*link) { parent = *link; entry = rb_entry(parent, struct sched_entity, run_node); if (entity_before(se, entry)) { link = &parent->rb_left; } else { link = &parent->rb_right; leftmost = false; /* Not leftmost anymore */ } } /* Insert and rebalance */ rb_link_node(&se->run_node, parent, link); rb_insert_color_cached(&se->run_node, &cfs_rq->tasks_timeline, leftmost);} /* Pick next task: O(1) because leftmost is cached! */static struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq){ struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline); if (!left) return NULL; return rb_entry(left, struct sched_entity, run_node);}The rb_root_cached structure caches the leftmost node pointer, making pick_next_task O(1) instead of O(log n). This is critical because pick_next_task is called on every context switch—potentially thousands of times per second.
CFS maps nice values (-20 to +19) to weights using a carefully designed table. Each nice level represents approximately a 10% difference in CPU allocation.
123456789101112131415161718192021222324252627
/* * Nice level to weight mapping * Each step is ~1.25x the next (10% difference in CPU share) * Nice 0 = weight 1024 (the baseline) */const int sched_prio_to_weight[40] = { /* -20 */ 88761, 71755, 56483, 46273, 36291, /* -15 */ 29154, 23254, 18705, 14949, 11916, /* -10 */ 9548, 7620, 6100, 4904, 3906, /* -5 */ 3121, 2501, 1991, 1586, 1277, /* 0 */ 1024, 820, 655, 526, 423, /* 5 */ 335, 272, 215, 172, 137, /* 10 */ 110, 87, 70, 56, 45, /* 15 */ 36, 29, 23, 18, 15,}; /* * Example: Two tasks competing for CPU * Task A: nice 0 -> weight 1024 * Task B: nice 5 -> weight 335 * * Total weight: 1024 + 335 = 1359 * Task A gets: 1024/1359 = 75.4% of CPU * Task B gets: 335/1359 = 24.6% of CPU * * Ratio: 75.4/24.6 ≈ 3.06 (matches 1024/335 ≈ 3.06) */The weight table is designed so that a single nice level difference gives roughly 10% more or less CPU. Two tasks at nice 0 vs nice 1 split CPU approximately 52.5% / 47.5%. This makes nice values intuitive for users: 'bump priority by 1' has a consistent, predictable effect.
While CFS doesn't use fixed time slices, it does maintain a targeted latency—the time within which every runnable task should get at least one chance to run.
12345678910111213141516171819202122232425262728293031
/* Tunable parameters (in /proc/sys/kernel) */ /* Target latency: Max time before every task runs once */unsigned int sched_latency = 6000000; /* 6ms default */ /* Minimum granularity: Minimum timeslice per task */unsigned int sched_min_granularity = 750000; /* 0.75ms */ /* When nr_running exceeds this, use min_granularity */unsigned int sched_nr_latency = 8; /* sched_latency / sched_min_granularity */ /* * Ideal runtime for each task is: * * ideal_runtime = sched_latency * (task_weight / total_weight) * * But never less than sched_min_granularity. * * If nr_running > 8, sched_latency expands: * effective_latency = nr_running * sched_min_granularity */ static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se){ u64 slice = __sched_period(cfs_rq->nr_running); /* Adjust for task's weight relative to total */ slice = __calc_delta(slice, se->load.weight, &cfs_rq->load); return max(slice, (u64)sched_min_granularity);}| Tasks Running | Effective Latency | Avg Slice per Task |
|---|---|---|
| 1 | 6ms | 6ms (runs alone) |
| 4 | 6ms | 1.5ms each |
| 8 | 6ms | 0.75ms each (min granularity) |
| 16 | 12ms | 0.75ms each |
| 100 | 75ms | 0.75ms each |
CFS must carefully handle tasks that join or rejoin the runqueue. Without proper initialization, new or waking tasks could game the scheduler.
12345678910111213141516171819202122232425262728293031323334353637
/* New task vruntime initialization */static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial){ u64 vruntime = cfs_rq->min_vruntime; if (initial) { /* New task: Start slightly behind current min_vruntime */ vruntime += sched_vslice(cfs_rq, se); /* ~half a latency period */ } /* Waking sleeper: Limit credit for sleeping */ if (!initial) { /* Don't give too much credit for long sleeps */ u64 thresh = sched_latency; if (sched_feat(GENTLE_FAIR_SLEEPERS)) thresh >>= 1; /* Half a latency period max credit */ vruntime -= thresh; } /* Never go backward past min_vruntime */ se->vruntime = max(vruntime, cfs_rq->min_vruntime);} /* * Why this matters: * * 1. NEW TASKS: If they started at vruntime=0, they'd run * for a very long time before catching up. Instead, we * place them near min_vruntime so they integrate fairly. * * 2. SLEEPERS: A task sleeping for an hour has very old vruntime. * If we let it use that, it would dominate CPU on wake. * We give limited "credit" for sleeping—enough to be * responsive, not enough to starve other tasks. */The sleeper credit mechanism is why interactive programs feel responsive. When you click a button in an idle GUI app, the waking task gets a small vruntime bonus, letting it run quickly. But the bonus is limited—you can't game the system by sleeping artificially.
CFS integrates with cgroups to provide hierarchical CPU control. Task groups compete for CPU at each hierarchy level, with their CPU share distributed among their members.
123456789101112131415161718192021222324252627282930313233343536373839
/* * Group scheduling creates a hierarchy of sched_entities. * * Consider this cgroup hierarchy: * * root (cpu.weight = implicit 100) * ├── groupA (cpu.weight = 200) * │ ├── task1 * │ └── task2 * └── groupB (cpu.weight = 100) * └── task3 * * The CFS structure mirrors this: * * Root CFS runqueue: * └── sched_entity for groupA (weight 200) * └── sched_entity for groupB (weight 100) * * GroupA's CFS runqueue: * └── sched_entity for task1 * └── sched_entity for task2 * * GroupB's CFS runqueue: * └── sched_entity for task3 * * Result: * - GroupA gets 66% of CPU (200/300) * - GroupB gets 33% of CPU (100/300) * - task1 and task2 split GroupA's 66% evenly * - task3 gets all of GroupB's 33% */ /* How group scheduling works */struct task_group { struct sched_entity **se; /* Per-CPU scheduling entities */ struct cfs_rq **cfs_rq; /* Per-CPU runqueues owned by this group */ unsigned long shares; /* cpu.weight value */ /* ... */};Docker's --cpu-shares option translates directly to cgroup cpu.weight, which sets the scheduling entity's weight for group scheduling. A container with 2048 shares gets twice the CPU of one with 1024 when both are competing—exactly how nice values work, but at the container level!
In 2023, Linux 6.6 introduced EEVDF (Earliest Eligible Virtual Deadline First) as a refinement to CFS. EEVDF maintains CFS's vruntime-based fairness but adds deadline tracking for improved latency control.
123456789101112131415161718192021222324252627282930313233
/* * EEVDF adds a virtual deadline to each entity: * * deadline = eligible_time + request_size / weight * * Where eligible_time is when the entity is allowed to run * based on its vruntime vs system virtual time. * * Selection: Pick the eligible entity with earliest deadline. * * Benefits: * 1. Better latency bounds than pure CFS * 2. More predictable scheduling for latency-sensitive tasks * 3. Still maintains proportional fairness over time */ struct sched_entity { /* Existing CFS fields */ u64 vruntime; /* EEVDF additions */ u64 deadline; /* Virtual deadline */ u64 min_deadline; /* Min deadline in subtree (for tree augmentation) */ s64 vlag; /* Lag: how far behind fair share */ u64 slice; /* Request size */}; /* * EEVDF Task Selection: * 1. Find all "eligible" tasks (vruntime <= system virtual time) * 2. Among eligible, pick earliest deadline * 3. Tree is augmented with min_deadline for efficient search */EEVDF is a refinement, not a replacement. It uses the same red-black tree, weight system, and cgroups integration. From userspace, the interface remains unchanged—existing nice values and cpu.weight settings continue to work identically.
You now understand CFS—the scheduler that runs billions of Linux tasks worldwide. Next, we'll explore process states and the lifecycle of tasks from creation to termination.