Loading learning content...
In the world of CFS, all scheduling decisions reduce to a single question: which task has the smallest virtual runtime? This elegant simplicity is made possible by virtual runtime (vruntime)—the abstraction that transforms the complex problem of fair CPU allocation into a simple comparison operation.
Think of virtual runtime as a normalized clock. While actual CPU time ticks at the same rate for everyone, virtual time ticks at different rates for different tasks based on their weight. High-weight (high-priority) tasks have slow virtual clocks; low-weight (low-priority) tasks have fast virtual clocks. When everyone's virtual clock shows the same time, the allocation is fair.
This page explores virtual runtime in exhaustive detail: the mathematics behind it, how it interacts with task weights, the edge cases that complicate the simple model, and the implementation techniques that make it efficient in practice.
By the end of this page, you will understand: (1) The mathematical definition of virtual runtime, (2) How vruntime relates to actual execution time through weights, (3) The concept of min_vruntime and why it's necessary, (4) How vruntime updates occur during execution, and (5) The numerical precision considerations in vruntime accounting.
Virtual runtime is defined by a simple but powerful equation. Understanding this equation is the key to understanding CFS.
The Virtual Runtime Equation
For a task with weight w that runs for actual time Δt, its virtual runtime increases by:
Δvruntime = Δt × (reference_weight / w)
Where reference_weight is a constant representing the weight of a 'normal' process (nice value 0, weight = 1024 in Linux). This formulation ensures:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
"""Virtual Runtime Mathematics This module demonstrates the core mathematical conceptsbehind CFS's virtual runtime calculation.""" # Reference weight: nice 0 process weightNICE_0_WEIGHT = 1024 def calculate_delta_vruntime(delta_exec_ns: int, task_weight: int) -> int: """ Calculate the virtual runtime delta for a given execution time. Args: delta_exec_ns: Actual execution time in nanoseconds task_weight: The task's scheduling weight Returns: Virtual runtime delta in nanoseconds Formula: delta_vruntime = delta_exec * (NICE_0_WEIGHT / weight) """ # Integer math to avoid floating point, with proper rounding delta_vruntime = (delta_exec_ns * NICE_0_WEIGHT) // task_weight return delta_vruntime def demonstrate_fairness(): """ Demonstrate how virtual runtime ensures proportional fairness. """ # Two tasks: A has weight 1024 (nice 0), B has weight 2048 (nice -5) task_a_weight = 1024 task_b_weight = 2048 # Let's run each for 100ms of actual CPU time actual_runtime_ns = 100_000_000 # 100ms vruntime_a = calculate_delta_vruntime(actual_runtime_ns, task_a_weight) vruntime_b = calculate_delta_vruntime(actual_runtime_ns, task_b_weight) print(f"Task A (weight {task_a_weight}): " f"actual = {actual_runtime_ns/1e6:.2f}ms, " f"vruntime = {vruntime_a/1e6:.2f}ms") print(f"Task B (weight {task_b_weight}): " f"actual = {actual_runtime_ns/1e6:.2f}ms, " f"vruntime = {vruntime_b/1e6:.2f}ms") # For equal vruntime, how much actual time does each need? target_vruntime_ns = 100_000_000 # 100ms of virtual time actual_a = target_vruntime_ns * task_a_weight // NICE_0_WEIGHT actual_b = target_vruntime_ns * task_b_weight // NICE_0_WEIGHT print(f"\nFor equal vruntime ({target_vruntime_ns/1e6:.2f}ms):") print(f" Task A gets {actual_a/1e6:.2f}ms actual CPU") print(f" Task B gets {actual_b/1e6:.2f}ms actual CPU") print(f" Ratio B/A = {actual_b/actual_a:.2f} (expected: 2.0)") def trace_scheduling_scenario(): """ Simulate a CFS scheduling scenario to show vruntime in action. """ print("\n--- CFS Scheduling Trace ---") tasks = { 'A': {'weight': 1024, 'vruntime': 0, 'actual': 0}, # nice 0 'B': {'weight': 820, 'vruntime': 0, 'actual': 0}, # nice 2 'C': {'weight': 1277, 'vruntime': 0, 'actual': 0}, # nice -2 } time_slice_ns = 4_000_000 # 4ms per scheduling round total_time_ns = 100_000_000 # 100ms total simulation current_time = 0 while current_time < total_time_ns: # Pick task with minimum vruntime next_task = min(tasks.keys(), key=lambda t: tasks[t]['vruntime']) # Run it for a time slice task = tasks[next_task] delta_vruntime = calculate_delta_vruntime(time_slice_ns, task['weight']) task['vruntime'] += delta_vruntime task['actual'] += time_slice_ns current_time += time_slice_ns print(f"\nAfter {total_time_ns/1e6:.0f}ms of total CPU time:") total_weight = sum(t['weight'] for t in tasks.values()) for name, task in sorted(tasks.items()): expected_share = task['weight'] / total_weight * 100 actual_share = task['actual'] / total_time_ns * 100 print(f" {name}: vruntime={task['vruntime']/1e6:.2f}ms, " f"actual={task['actual']/1e6:.2f}ms, " f"share={actual_share:.1f}% (expected {expected_share:.1f}%)") if __name__ == "__main__": demonstrate_fairness() trace_scheduling_scenario()The Fairness Invariant
CFS maintains an invariant: at any point in time, tasks that have received their fair share of CPU have approximately equal vruntimes. The task with the smallest vruntime is the one that's most behind—it has received less than its fair share and should run next.
This invariant naturally handles:
The beauty of this approach is that fairness is emergent, not enforced through explicit priority queues or time slice allocations.
Imagine each task has a stopwatch, but the stopwatches tick at different speeds. A high-priority task's stopwatch ticks slowly; a low-priority task's ticks fast. When we line up all stopwatches showing the same time, the high-priority task has actually been running longer. The stopwatch reading is vruntime; the scheduler always picks the stopwatch showing the lowest time.
The connection between user-visible priorities (nice values) and scheduling behavior goes through two conversions: nice → weight → vruntime scaling.
Nice Values
Linux uses nice values from -20 (highest priority) to +19 (lowest priority). The name comes from the idea that a 'nice' process yields CPU to others. Nice 0 is the default.
Weight Calculation
Nice values are converted to weights using a carefully designed lookup table. The design principle: each nice level should provide approximately 10% more or less CPU than the adjacent level.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
/* * Linux kernel weight table for nice values * From: kernel/sched/core.c * * Nice value -20 to +19 mapped to weight values. * The ratio between adjacent values is ~1.25 (10% CPU difference principle) */const int sched_prio_to_weight[40] = { /* -20 */ 88761, 71755, 56483, 46273, 36291, /* -15 */ 29154, 23254, 18705, 14949, 11916, /* -10 */ 9548, 7620, 6100, 4904, 3906, /* -5 */ 3121, 2501, 1991, 1586, 1277, /* 0 */ 1024, 820, 655, 526, 423, /* 5 */ 335, 272, 215, 172, 137, /* 10 */ 110, 87, 70, 56, 45, /* 15 */ 36, 29, 23, 18, 15,}; /* * Inverse weight table for faster division (multiply instead) * inv_weight = 2^32 / weight */const u32 sched_prio_to_wmult[40] = { /* -20 */ 48388, 59856, 76040, 92818, 118348, /* -15 */ 147320, 184698, 229616, 287308, 360437, /* -10 */ 449829, 563644, 704093, 875809, 1099582, /* -5 */ 1376151, 1717300, 2157191, 2708050, 3363326, /* 0 */ 4194304, 5237765, 6557202, 8165337, 10153587, /* 5 */ 12820798, 15790321, 19976592, 24970740, 31350126, /* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717, /* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153,}; /* * The beauty of this table: * * 1. weight[nice 0] = 1024 (convenient power of 2) * 2. weight[nice -20] / weight[nice +19] ≈ 5917 (huge range) * 3. Ratio of adjacent weights ≈ 1.25, giving ~10% CPU difference * * This means nice -20 can get 5917x more CPU than nice +19, * but each step only changes allocation by ~10%. */ static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se){ /* Fast path: nice 0 doesn't need scaling */ if (unlikely(se->load.weight != NICE_0_LOAD)) delta = __calc_delta(delta, NICE_0_LOAD, &se->load); return delta;} /* * Optimized division using multiplication by inverse weight * delta_vruntime = delta_exec * NICE_0_LOAD / weight * = delta_exec * NICE_0_LOAD * inv_weight >> 32 */static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw){ u64 fact = scale_load_down(weight); int shift = WMULT_SHIFT; /* 32 */ __update_inv_weight(lw); /* Ensure inv_weight is current */ /* Multiply by inverse weight and shift */ fact = mul_u64_u32_shr(delta_exec, lw->inv_weight, shift); return fact;}| Nice | Weight | vruntime Rate | Relative CPU Share |
|---|---|---|---|
| -20 | 88,761 | 0.012× | ~87× of nice 0 (vs nice +19) |
| -10 | 9,548 | 0.107× | ~9× of nice 0 |
| -5 | 3,121 | 0.328× | ~3× of nice 0 |
| 0 | 1,024 | 1.000× | 1× (reference) |
| +5 | 335 | 3.057× | ~0.33× of nice 0 |
| +10 | 110 | 9.309× | ~0.11× of nice 0 |
| +19 | 15 | 68.27× | ~0.015× of nice 0 |
The ~1.25 ratio between adjacent nice levels ensures that each nice level provides roughly 10% CPU difference. If the ratio were larger (say, 2×), administrators would have less fine-grained control. If smaller, the full nice range wouldn't provide meaningful differentiation. The 1.25 balance was empirically determined to be optimal for real-world workloads.
Each CFS runqueue maintains a crucial variable called min_vruntime. This value tracks the smallest vruntime among all runnable tasks, serving as an anchor point for various scheduling decisions.
Why min_vruntime Matters
Virtual runtimes are ever-increasing. If we simply used raw vruntimes, they would eventually overflow (even 64-bit integers have limits). More practically, several operations need a reference point:
min_vruntime provides this reference point. It monotonically increases (never decreases), and all task vruntimes are implicitly relative to it.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
/* * min_vruntime update logic * * min_vruntime is updated to track the smallest vruntime * among all runnable tasks, but with important constraints. */ static void update_min_vruntime(struct cfs_rq *cfs_rq){ struct sched_entity *curr = cfs_rq->curr; struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline); u64 vruntime = cfs_rq->min_vruntime; /* * Consider both the currently running task and the * leftmost task in the red-black tree */ if (curr) { if (curr->on_rq) vruntime = curr->vruntime; else curr = NULL; } if (leftmost) { struct sched_entity *se = __node_2_se(leftmost); if (!curr) vruntime = se->vruntime; else vruntime = min_vruntime(vruntime, se->vruntime); } /* * CRITICAL: min_vruntime must never go backwards! * * This ensures: * 1. New tasks placed at min_vruntime don't get unfair advantage * 2. Migrated tasks don't get "time travel" to the past * 3. Progress is monotonic */ cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime); #ifndef CONFIG_64BIT smp_wmb(); cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;#endif} /* * Comparison functions that handle wraparound correctly. * Even with 64-bit vruntimes, we use signed comparison * to handle potential (rare) overflow scenarios. */static inline bool entity_before(struct sched_entity *a, struct sched_entity *b){ return (s64)(a->vruntime - b->vruntime) < 0;} static inline u64 max_vruntime(u64 max_vruntime, u64 vruntime){ s64 delta = (s64)(vruntime - max_vruntime); if (delta > 0) max_vruntime = vruntime; return max_vruntime;} static inline u64 min_vruntime(u64 min_vruntime, u64 vruntime){ s64 delta = (s64)(vruntime - min_vruntime); if (delta < 0) min_vruntime = vruntime; return min_vruntime;}The Monotonic Property
The most important invariant of min_vruntime is that it never decreases. This is enforced by the max_vruntime() call in the update function. Why is this critical?
Prevents Time Travel: If min_vruntime could decrease, tasks placed at the new min_vruntime could be scheduled before tasks that were already 'ahead'
Bounds Sleeper Credit: Sleeping tasks have their vruntime updated upon waking relative to min_vruntime. If it could decrease, sleepers could accumulate unbounded credit
Migration Consistency: When tasks migrate, their vruntimes are adjusted by (source_min_vruntime - dest_min_vruntime). Monotonicity ensures this adjustment is well-defined
Edge Case: Empty Runqueue
When a runqueue becomes empty (all tasks blocked), min_vruntime is left at its last value. When new tasks arrive, they start near this value, maintaining continuity.
With 64-bit vruntimes measured in nanoseconds, how long until overflow? 2^63 nanoseconds ≈ 292 years. A system would need to run continuously for centuries before this becomes a concern. Nevertheless, CFS uses signed comparisons to handle wraparound gracefully, adopting the principle that if (a - b) < 0, then a is 'before' b.
Understanding when and how vruntimes are updated reveals the practical mechanics of CFS. Updates happen at several key points in a task's lifecycle.
During Execution: The Core Update Path
The primary vruntime update happens through update_curr(), called whenever the scheduler needs current accounting. This occurs on:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
/* * update_curr - Account for the 'current' task's runtime * * This is the heart of CFS accounting. Called on every * scheduling decision point. */static void update_curr(struct cfs_rq *cfs_rq){ struct sched_entity *curr = cfs_rq->curr; u64 now = rq_clock_task(rq_of(cfs_rq)); u64 delta_exec; if (unlikely(!curr)) return; /* How long has current task been running since last update? */ delta_exec = now - curr->exec_start; if (unlikely((s64)delta_exec <= 0)) return; /* Reset the execution start time for next accounting period */ curr->exec_start = now; /* Accumulate statistics */ curr->sum_exec_runtime += delta_exec; schedstat_add(cfs_rq->exec_clock, delta_exec); /* * THE KEY OPERATION: Update virtual runtime * * delta_vruntime = delta_exec * (NICE_0_LOAD / curr->load.weight) */ curr->vruntime += calc_delta_fair(delta_exec, curr); /* Keep min_vruntime up to date */ update_min_vruntime(cfs_rq); /* Handle cgroup bandwidth accounting if enabled */ if (entity_is_task(curr)) { struct task_struct *curtask = task_of(curr); trace_sched_stat_runtime(curtask, delta_exec); cgroup_account_cputime(curtask, delta_exec); account_group_exec_runtime(curtask, delta_exec); }} /* * When is update_curr called? * * 1. scheduler_tick() - periodic timer interrupt * └─> entity_tick() -> update_curr() * * 2. put_prev_task() - when descheduling current task * └─> put_prev_entity() -> update_curr() * * 3. pick_next_task() - when selecting new task * └─> pick_next_entity() -> set_next_entity() * -> update_curr() [for prev task] * * 4. task_state_changes - fork, exec, nice change * └─> Various paths call update_curr() */On Wakeup: Sleeper Placement
When a sleeping task wakes up, its vruntime must be adjusted. The task's old vruntime might be far behind current tasks, which would allow it to monopolize the CPU unfairly. CFS handles this with place_entity():
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
/* * place_entity - Position a waking or forking task in the vruntime timeline * * This function balances two competing goals: * 1. Sleepers should get some credit for sleeping (responsiveness) * 2. Sleepers shouldn't starve runners (fairness) */static void place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial){ u64 vruntime = cfs_rq->min_vruntime; /* * For newly forked tasks: start ahead of min_vruntime * by one expected time slice. This prevents fork bombs * from starving existing tasks. */ if (initial && sched_feat(START_DEBIT)) vruntime += sched_vslice(cfs_rq, se); /* * For waking tasks: give a bounded credit. * * GENTLE_FAIR_SLEEPERS: Place at min_vruntime - (latency/2) * This gives sleepers priority (they're "behind") * but limits how much priority (half a latency period) */ if (!initial && sched_feat(GENTLE_FAIR_SLEEPERS)) { u64 thresh = sysctl_sched_latency >> 1; /* * Clamp the credit: don't let vruntime go too far * back from min_vruntime */ if (sched_feat(BASE_SLICE)) thresh = calc_delta_fair(thresh, se); vruntime -= thresh; } /* * CRITICAL: Never let vruntime go backward from its natural position * * If task's vruntime is already ahead of where we'd place it, * leave it alone. This handles tasks that blocked briefly * and woke up before "earning" any significant credit. */ se->vruntime = max_vruntime(se->vruntime, vruntime);} /* * The result: * - Long sleepers get capped credit (half a latency period) * - Brief sleepers aren't penalized for their vruntime position * - New forks start slightly behind to prevent fork bombs */The sleeper credit is controversial. Too much credit and interactive tasks can starve CPU-bound tasks. Too little and interactive responsiveness suffers. The 'half a latency period' value was chosen empirically to balance desktop interactivity (where it helps) against server throughput (where it can hurt). Administrators can tune sched_latency to adjust this balance.
Kernel code must be both precise and efficient. CFS's vruntime calculations involve division by weights, which is expensive in CPU cycles. The kernel uses several techniques to optimize this.
Avoiding Division: Multiply and Shift
Division by weight can be converted to multiplication by inverse weight:
value / weight = value * inv_weight >> 32
The kernel precomputes inv_weight = 2^32 / weight for each nice level. This converts expensive divisions into fast multiplications and bit shifts.
Handling 32-bit Systems
On 32-bit systems, 64-bit arithmetic is slower (requires multiple instructions). The kernel includes optimized paths for 32-bit platforms, using careful overflow detection and multi-word arithmetic.
Nanosecond Resolution
CFS uses nanoseconds for all time values. This provides sufficient resolution for modern CPUs executing billions of instructions per second, while keeping values in human-comprehensible units for debugging.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
/* * Optimized delta calculation using inverse weights * * Goal: Calculate delta_exec * NICE_0_LOAD / weight efficiently * * Method: Use precomputed inv_weight = 2^32 / weight * Result = delta_exec * NICE_0_LOAD * inv_weight >> 32 */ #define WMULT_SHIFT 32#define WMULT_CONST (~0UL) /* * Ensure inv_weight is up to date * This is called lazily when weight changes */static void __update_inv_weight(struct load_weight *lw){ unsigned long w; if (likely(lw->inv_weight)) return; w = scale_load_down(lw->weight); if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST)) lw->inv_weight = 1; else if (unlikely(!w)) lw->inv_weight = WMULT_CONST; else lw->inv_weight = WMULT_CONST / w;} /* * mul_u64_u32_shr - Multiply 64-bit by 32-bit and shift right * * This is architecture-optimized on most platforms */static inline u64 mul_u64_u32_shr(u64 a, u32 mul, unsigned int shift){ /* * On 64-bit: single multiply instruction + shift * On 32-bit: software 64*32 multiply */#if BITS_PER_LONG == 64 return (a * mul) >> shift;#else u32 ah = a >> 32; u32 al = a; u64 ret; ret = ((u64)ah * mul) << (32 - shift); if (shift > 32) ret >>= shift - 32; else ret += (al * mul) >> shift; return ret;#endif} /* * Performance characteristics: * * calc_delta_fair() on modern x86-64: * - No division instructions (avoided via inv_weight) * - 2-3 multiplications * - 1-2 shifts * - Total: ~5-10 cycles * * This is called every scheduler tick and task switch, * so the optimization matters significantly. */Timing Sources
The accuracy of vruntime depends on accurate time measurement. Linux uses different clock sources based on context:
CFS uses rq_clock_task() which provides a stable, monotonic nanosecond timestamp suitable for accounting.
Using nanoseconds ensures that even microsecond-scale execution times are accurately tracked. On a 4GHz CPU, 1 nanosecond is about 4 cycles—enough resolution for fine-grained accounting. The 64-bit vruntime can represent 292 years of nanoseconds, far exceeding any practical system lifetime.
Understanding vruntime dynamics is easier with visualization. Let's trace through a concrete scenario showing how vruntimes evolve over time.
Scenario: Three Tasks with Different Weights
Expected Behavior: Over any time period, B should get 2× the CPU of A, and A should get 2× the CPU of C. Their vruntimes should stay approximately equal.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
"""Visualization of CFS Virtual Runtime Dynamics This script simulates CFS scheduling and shows howvruntimes evolve to achieve proportional fairness.""" import matplotlib.pyplot as pltfrom dataclasses import dataclass NICE_0_LOAD = 1024TIME_SLICE_MS = 4 @dataclassclass Task: name: str weight: int vruntime: float = 0.0 actual_runtime: float = 0.0 def vruntime_rate(self) -> float: """How fast vruntime increases per ms of actual runtime""" return NICE_0_LOAD / self.weight def simulate_cfs(tasks: list, total_time_ms: int): """ Simulate CFS scheduling for given tasks and duration. Returns history for visualization. """ history = {t.name: {'vruntime': [], 'actual': [], 'time': []} for t in tasks} current_time = 0 while current_time < total_time_ms: # CFS picks task with minimum vruntime task = min(tasks, key=lambda t: t.vruntime) # Run for one time slice slice_ms = TIME_SLICE_MS task.actual_runtime += slice_ms task.vruntime += slice_ms * task.vruntime_rate() current_time += slice_ms # Record state for t in tasks: history[t.name]['time'].append(current_time) history[t.name]['vruntime'].append(t.vruntime) history[t.name]['actual'].append(t.actual_runtime) return history def visualize(): tasks = [ Task('A (weight=1024)', 1024), Task('B (weight=2048)', 2048), Task('C (weight=512)', 512), ] history = simulate_cfs(tasks, 1000) # 1 second fig, axes = plt.subplots(2, 1, figsize=(12, 8)) # Plot vruntimes - should converge/stay close ax1 = axes[0] for name, data in history.items(): ax1.plot(data['time'], data['vruntime'], label=name) ax1.set_xlabel('Wall Clock Time (ms)') ax1.set_ylabel('Virtual Runtime') ax1.set_title('Virtual Runtime Over Time (All Tasks Stay Close)') ax1.legend() ax1.grid(True, alpha=0.3) # Plot actual runtimes - should differ by weight ratios ax2 = axes[1] for name, data in history.items(): ax2.plot(data['time'], data['actual'], label=name) ax2.set_xlabel('Wall Clock Time (ms)') ax2.set_ylabel('Actual Runtime (ms)') ax2.set_title('Actual Runtime Over Time (B gets 2x of A, A gets 2x of C)') ax2.legend() ax2.grid(True, alpha=0.3) plt.tight_layout() plt.savefig('cfs_vruntime_visualization.png', dpi=150) # Print final statistics print("\nFinal Statistics after 1000ms:") print("-" * 50) total_weight = sum(t.weight for t in tasks) for t in tasks: expected = t.weight / total_weight * 1000 print(f"{t.name}:") print(f" Actual runtime: {t.actual_runtime:.1f}ms " f"(expected {expected:.1f}ms)") print(f" Final vruntime: {t.vruntime:.1f}") print(f" vruntime rate: {t.vruntime_rate():.3f}") # Verify vruntimes are close vruntimes = [t.vruntime for t in tasks] vrt_range = max(vruntimes) - min(vruntimes) print(f"\nVruntime spread: {vrt_range:.1f} " f"(should be small, ~{TIME_SLICE_MS} * max_rate)") if __name__ == "__main__": visualize()Key Observations from the Visualization
Vruntimes Track Together: Despite different actual runtimes, all three tasks maintain approximately equal vruntimes. This is the fairness guarantee in action.
Higher Weight = More CPU: Task B (weight 2048) accumulates actual runtime twice as fast as Task A, and four times as fast as Task C.
Vruntime Slopes Differ: Task C's vruntime increases fastest (steep slope), Task B's increases slowest (gentle slope). This keeps them racing at different speeds but crossing the same 'distance.'
Scheduler Oscillation: The scheduler bounces between tasks as vruntimes leapfrog each other. This creates the fair interleaving pattern.
The visualization reveals CFS's core insight: by normalizing time differently for each task, the scheduler can use a single metric (minimum vruntime) to make fair decisions. There's no need for separate priority queues, complex heuristics, or interactivity detection—fairness emerges naturally from the mathematics.
Virtual runtime is the conceptual heart of CFS—the abstraction that makes proportional fairness achievable and efficient. Understanding vruntime is essential for anyone seeking to comprehend Linux scheduling at a deep level.
What's Next
The next page explores the red-black tree—the data structure that makes CFS's vruntime-based scheduling efficient. We'll see how this self-balancing binary search tree enables O(1) minimum selection and O(log n) insertions, and why it's perfectly suited to CFS's access patterns.
You now understand virtual runtime—the normalized time metric that enables CFS to achieve proportional fairness efficiently. This understanding is foundational for comprehending not just Linux scheduling, but the broader class of weighted fair queueing algorithms used throughout systems software.