Operating SystemsLinux Scheduling

Linux Scheduling

LevelAdvanced

Duration90 mins

TopicLinux Scheduling

2 / 5

Virtual Runtime

The Universal Currency of CPU Time

In the world of CFS, all scheduling decisions reduce to a single question: which task has the smallest virtual runtime? This elegant simplicity is made possible by virtual runtime (vruntime)—the abstraction that transforms the complex problem of fair CPU allocation into a simple comparison operation.

Think of virtual runtime as a normalized clock. While actual CPU time ticks at the same rate for everyone, virtual time ticks at different rates for different tasks based on their weight. High-weight (high-priority) tasks have slow virtual clocks; low-weight (low-priority) tasks have fast virtual clocks. When everyone's virtual clock shows the same time, the allocation is fair.

This page explores virtual runtime in exhaustive detail: the mathematics behind it, how it interacts with task weights, the edge cases that complicate the simple model, and the implementation techniques that make it efficient in practice.

Learning Objectives

By the end of this page, you will understand: (1) The mathematical definition of virtual runtime, (2) How vruntime relates to actual execution time through weights, (3) The concept of min_vruntime and why it's necessary, (4) How vruntime updates occur during execution, and (5) The numerical precision considerations in vruntime accounting.

Mathematical Foundation of Virtual Runtime

Virtual runtime is defined by a simple but powerful equation. Understanding this equation is the key to understanding CFS.

The Virtual Runtime Equation

For a task with weight w that runs for actual time Δt, its virtual runtime increases by:

Δvruntime = Δt × (reference_weight / w)

Where reference_weight is a constant representing the weight of a 'normal' process (nice value 0, weight = 1024 in Linux). This formulation ensures:

Unit Consistency: A task with reference weight has Δvruntime = Δt (virtual time equals real time)
Proportional Fairness: Tasks with double the weight see half the vruntime increase for the same actual execution
Scale Independence: The absolute value of weights doesn't matter, only their ratios

vruntime_math.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
"""
Virtual Runtime Mathematics
 
This module demonstrates the core mathematical concepts
behind CFS's virtual runtime calculation.
"""
 
# Reference weight: nice 0 process weight
NICE_0_WEIGHT = 1024
 
def calculate_delta_vruntime(delta_exec_ns: int, task_weight: int) -> int:
    """
    Calculate the virtual runtime delta for a given execution time.
    
    Args:
        delta_exec_ns: Actual execution time in nanoseconds
        task_weight: The task's scheduling weight
    
    Returns:
        Virtual runtime delta in nanoseconds
    
    Formula: delta_vruntime = delta_exec * (NICE_0_WEIGHT / weight)
    """
    # Integer math to avoid floating point, with proper rounding
    delta_vruntime = (delta_exec_ns * NICE_0_WEIGHT) // task_weight
    return delta_vruntime
 
 
def demonstrate_fairness():
    """
    Demonstrate how virtual runtime ensures proportional fairness.
    """
    # Two tasks: A has weight 1024 (nice 0), B has weight 2048 (nice -5)
    task_a_weight = 1024
    task_b_weight = 2048
    
    # Let's run each for 100ms of actual CPU time
    actual_runtime_ns = 100_000_000  # 100ms
    
    vruntime_a = calculate_delta_vruntime(actual_runtime_ns, task_a_weight)
    vruntime_b = calculate_delta_vruntime(actual_runtime_ns, task_b_weight)
    
    print(f"Task A (weight {task_a_weight}): "
          f"actual = {actual_runtime_ns/1e6:.2f}ms, "
          f"vruntime = {vruntime_a/1e6:.2f}ms")
    print(f"Task B (weight {task_b_weight}): "
          f"actual = {actual_runtime_ns/1e6:.2f}ms, "
          f"vruntime = {vruntime_b/1e6:.2f}ms")
    
    # For equal vruntime, how much actual time does each need?
    target_vruntime_ns = 100_000_000  # 100ms of virtual time
    
    actual_a = target_vruntime_ns * task_a_weight // NICE_0_WEIGHT
    actual_b = target_vruntime_ns * task_b_weight // NICE_0_WEIGHT
    
    print(f"\nFor equal vruntime ({target_vruntime_ns/1e6:.2f}ms):")
    print(f"  Task A gets {actual_a/1e6:.2f}ms actual CPU")
    print(f"  Task B gets {actual_b/1e6:.2f}ms actual CPU")
    print(f"  Ratio B/A = {actual_b/actual_a:.2f} (expected: 2.0)")
 
 
def trace_scheduling_scenario():
    """
    Simulate a CFS scheduling scenario to show vruntime in action.
    """
    print("\n--- CFS Scheduling Trace ---")
    
    tasks = {
        'A': {'weight': 1024, 'vruntime': 0, 'actual': 0},  # nice 0
        'B': {'weight': 820,  'vruntime': 0, 'actual': 0},  # nice 2
        'C': {'weight': 1277, 'vruntime': 0, 'actual': 0},  # nice -2
    }
    
    time_slice_ns = 4_000_000  # 4ms per scheduling round
    total_time_ns = 100_000_000  # 100ms total simulation
    
    current_time = 0
    while current_time < total_time_ns:
        # Pick task with minimum vruntime
        next_task = min(tasks.keys(), key=lambda t: tasks[t]['vruntime'])
        
        # Run it for a time slice
        task = tasks[next_task]
        delta_vruntime = calculate_delta_vruntime(time_slice_ns, task['weight'])
        
        task['vruntime'] += delta_vruntime
        task['actual'] += time_slice_ns
        current_time += time_slice_ns
    
    print(f"\nAfter {total_time_ns/1e6:.0f}ms of total CPU time:")
    total_weight = sum(t['weight'] for t in tasks.values())
    for name, task in sorted(tasks.items()):
        expected_share = task['weight'] / total_weight * 100
        actual_share = task['actual'] / total_time_ns * 100
        print(f"  {name}: vruntime={task['vruntime']/1e6:.2f}ms, "
              f"actual={task['actual']/1e6:.2f}ms, "
              f"share={actual_share:.1f}% (expected {expected_share:.1f}%)")
 
 
if __name__ == "__main__":
    demonstrate_fairness()
    trace_scheduling_scenario()

The Fairness Invariant

CFS maintains an invariant: at any point in time, tasks that have received their fair share of CPU have approximately equal vruntimes. The task with the smallest vruntime is the one that's most behind—it has received less than its fair share and should run next.

This invariant naturally handles:

Priority Differences: Higher-weight tasks accumulate vruntime slower, so they can run longer before another task becomes 'more behind'
Dynamic Task Counts: When tasks join or leave, vruntimes automatically reflect the new fair shares
Sleeping Tasks: Tasks that voluntarily sleep don't accumulate vruntime, giving them implicit priority when they wake

The beauty of this approach is that fairness is emergent, not enforced through explicit priority queues or time slice allocations.

The Normalized Clock Analogy

Imagine each task has a stopwatch, but the stopwatches tick at different speeds. A high-priority task's stopwatch ticks slowly; a low-priority task's ticks fast. When we line up all stopwatches showing the same time, the high-priority task has actually been running longer. The stopwatch reading is vruntime; the scheduler always picks the stopwatch showing the lowest time.

From Nice Values to Virtual Runtime

The connection between user-visible priorities (nice values) and scheduling behavior goes through two conversions: nice → weight → vruntime scaling.

Nice Values

Linux uses nice values from -20 (highest priority) to +19 (lowest priority). The name comes from the idea that a 'nice' process yields CPU to others. Nice 0 is the default.

Weight Calculation

Nice values are converted to weights using a carefully designed lookup table. The design principle: each nice level should provide approximately 10% more or less CPU than the adjacent level.

nice_to_weight.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
/*
 * Linux kernel weight table for nice values
 * From: kernel/sched/core.c
 *
 * Nice value -20 to +19 mapped to weight values.
 * The ratio between adjacent values is ~1.25 (10% CPU difference principle)
 */
const int sched_prio_to_weight[40] = {
    /* -20 */    88761,    71755,    56483,    46273,    36291,
    /* -15 */    29154,    23254,    18705,    14949,    11916,
    /* -10 */     9548,     7620,     6100,     4904,     3906,
    /*  -5 */     3121,     2501,     1991,     1586,     1277,
    /*   0 */     1024,      820,      655,      526,      423,
    /*   5 */      335,      272,      215,      172,      137,
    /*  10 */      110,       87,       70,       56,       45,
    /*  15 */       36,       29,       23,       18,       15,
};
 
/*
 * Inverse weight table for faster division (multiply instead)
 * inv_weight = 2^32 / weight
 */
const u32 sched_prio_to_wmult[40] = {
    /* -20 */        48388,        59856,        76040,        92818,       118348,
    /* -15 */       147320,       184698,       229616,       287308,       360437,
    /* -10 */       449829,       563644,       704093,       875809,      1099582,
    /*  -5 */      1376151,      1717300,      2157191,      2708050,      3363326,
    /*   0 */      4194304,      5237765,      6557202,      8165337,     10153587,
    /*   5 */     12820798,     15790321,     19976592,     24970740,     31350126,
    /*  10 */     39045157,     49367440,     61356676,     76695844,     95443717,
    /*  15 */    119304647,    148102320,    186737708,    238609294,    286331153,
};
 
/*
 * The beauty of this table:
 * 
 * 1. weight[nice 0] = 1024 (convenient power of 2)
 * 2. weight[nice -20] / weight[nice +19] ≈ 5917 (huge range)
 * 3. Ratio of adjacent weights ≈ 1.25, giving ~10% CPU difference
 *
 * This means nice -20 can get 5917x more CPU than nice +19,
 * but each step only changes allocation by ~10%.
 */
 
static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
    /* Fast path: nice 0 doesn't need scaling */
    if (unlikely(se->load.weight != NICE_0_LOAD))
        delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
    
    return delta;
}
 
/*
 * Optimized division using multiplication by inverse weight
 * delta_vruntime = delta_exec * NICE_0_LOAD / weight
 *                = delta_exec * NICE_0_LOAD * inv_weight >> 32
 */
static u64 __calc_delta(u64 delta_exec, unsigned long weight,
                        struct load_weight *lw)
{
    u64 fact = scale_load_down(weight);
    int shift = WMULT_SHIFT;  /* 32 */
    
    __update_inv_weight(lw);  /* Ensure inv_weight is current */
    
    /* Multiply by inverse weight and shift */
    fact = mul_u64_u32_shr(delta_exec, lw->inv_weight, shift);
    
    return fact;
}

Nice Values, Weights, and CPU Share Impact
Nice	Weight	vruntime Rate	Relative CPU Share
-20	88,761	0.012×	~87× of nice 0 (vs nice +19)
-10	9,548	0.107×	~9× of nice 0
-5	3,121	0.328×	~3× of nice 0
0	1,024	1.000×	1× (reference)
+5	335	3.057×	~0.33× of nice 0
+10	110	9.309×	~0.11× of nice 0
+19	15	68.27×	~0.015× of nice 0

Why This Weight Scale?

The ~1.25 ratio between adjacent nice levels ensures that each nice level provides roughly 10% CPU difference. If the ratio were larger (say, 2×), administrators would have less fine-grained control. If smaller, the full nice range wouldn't provide meaningful differentiation. The 1.25 balance was empirically determined to be optimal for real-world workloads.

The min_vruntime: Anchor of the Runqueue

Each CFS runqueue maintains a crucial variable called min_vruntime. This value tracks the smallest vruntime among all runnable tasks, serving as an anchor point for various scheduling decisions.

Why min_vruntime Matters

Virtual runtimes are ever-increasing. If we simply used raw vruntimes, they would eventually overflow (even 64-bit integers have limits). More practically, several operations need a reference point:

New Task Placement: Where should a newly forked or awakened task be placed in the timeline?
Task Migration: When a task moves between CPUs, how do we adjust its vruntime for the new runqueue?
Sleeper Fairness: How much credit should sleeping tasks receive?

min_vruntime provides this reference point. It monotonically increases (never decreases), and all task vruntimes are implicitly relative to it.

min_vruntime.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
/*
 * min_vruntime update logic
 *
 * min_vruntime is updated to track the smallest vruntime
 * among all runnable tasks, but with important constraints.
 */
 
static void update_min_vruntime(struct cfs_rq *cfs_rq)
{
    struct sched_entity *curr = cfs_rq->curr;
    struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);
 
    u64 vruntime = cfs_rq->min_vruntime;
 
    /*
     * Consider both the currently running task and the
     * leftmost task in the red-black tree
     */
    if (curr) {
        if (curr->on_rq)
            vruntime = curr->vruntime;
        else
            curr = NULL;
    }
 
    if (leftmost) {
        struct sched_entity *se = __node_2_se(leftmost);
 
        if (!curr)
            vruntime = se->vruntime;
        else
            vruntime = min_vruntime(vruntime, se->vruntime);
    }
 
    /*
     * CRITICAL: min_vruntime must never go backwards!
     * 
     * This ensures:
     * 1. New tasks placed at min_vruntime don't get unfair advantage
     * 2. Migrated tasks don't get "time travel" to the past
     * 3. Progress is monotonic
     */
    cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
 
#ifndef CONFIG_64BIT
    smp_wmb();
    cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
#endif
}
 
/*
 * Comparison functions that handle wraparound correctly.
 * Even with 64-bit vruntimes, we use signed comparison
 * to handle potential (rare) overflow scenarios.
 */
static inline bool entity_before(struct sched_entity *a,
                                 struct sched_entity *b)
{
    return (s64)(a->vruntime - b->vruntime) < 0;
}
 
static inline u64 max_vruntime(u64 max_vruntime, u64 vruntime)
{
    s64 delta = (s64)(vruntime - max_vruntime);
    if (delta > 0)
        max_vruntime = vruntime;
    return max_vruntime;
}
 
static inline u64 min_vruntime(u64 min_vruntime, u64 vruntime)
{
    s64 delta = (s64)(vruntime - min_vruntime);
    if (delta < 0)
        min_vruntime = vruntime;
    return min_vruntime;
}

The Monotonic Property

The most important invariant of min_vruntime is that it never decreases. This is enforced by the max_vruntime() call in the update function. Why is this critical?

Prevents Time Travel: If min_vruntime could decrease, tasks placed at the new min_vruntime could be scheduled before tasks that were already 'ahead'
Bounds Sleeper Credit: Sleeping tasks have their vruntime updated upon waking relative to min_vruntime. If it could decrease, sleepers could accumulate unbounded credit
Migration Consistency: When tasks migrate, their vruntimes are adjusted by (source_min_vruntime - dest_min_vruntime). Monotonicity ensures this adjustment is well-defined

Edge Case: Empty Runqueue

When a runqueue becomes empty (all tasks blocked), min_vruntime is left at its last value. When new tasks arrive, they start near this value, maintaining continuity.

The Overflow Question

With 64-bit vruntimes measured in nanoseconds, how long until overflow? 2^63 nanoseconds ≈ 292 years. A system would need to run continuously for centuries before this becomes a concern. Nevertheless, CFS uses signed comparisons to handle wraparound gracefully, adopting the principle that if (a - b) < 0, then a is 'before' b.

Virtual Runtime Updates in Action

Understanding when and how vruntimes are updated reveals the practical mechanics of CFS. Updates happen at several key points in a task's lifecycle.

During Execution: The Core Update Path

The primary vruntime update happens through update_curr(), called whenever the scheduler needs current accounting. This occurs on:

Timer interrupts (scheduler tick)
Voluntary preemption (task sleeps or yields)
Forced preemption (higher-priority task becomes runnable)
Before selecting the next task to run

update_curr.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
/*
 * update_curr - Account for the 'current' task's runtime
 * 
 * This is the heart of CFS accounting. Called on every
 * scheduling decision point.
 */
static void update_curr(struct cfs_rq *cfs_rq)
{
    struct sched_entity *curr = cfs_rq->curr;
    u64 now = rq_clock_task(rq_of(cfs_rq));
    u64 delta_exec;
 
    if (unlikely(!curr))
        return;
 
    /* How long has current task been running since last update? */
    delta_exec = now - curr->exec_start;
    if (unlikely((s64)delta_exec <= 0))
        return;
 
    /* Reset the execution start time for next accounting period */
    curr->exec_start = now;
 
    /* Accumulate statistics */
    curr->sum_exec_runtime += delta_exec;
    schedstat_add(cfs_rq->exec_clock, delta_exec);
 
    /*
     * THE KEY OPERATION: Update virtual runtime
     * 
     * delta_vruntime = delta_exec * (NICE_0_LOAD / curr->load.weight)
     */
    curr->vruntime += calc_delta_fair(delta_exec, curr);
 
    /* Keep min_vruntime up to date */
    update_min_vruntime(cfs_rq);
 
    /* Handle cgroup bandwidth accounting if enabled */
    if (entity_is_task(curr)) {
        struct task_struct *curtask = task_of(curr);
        trace_sched_stat_runtime(curtask, delta_exec);
        cgroup_account_cputime(curtask, delta_exec);
        account_group_exec_runtime(curtask, delta_exec);
    }
}
 
/*
 * When is update_curr called?
 *
 * 1. scheduler_tick() - periodic timer interrupt
 *    └─> entity_tick() -> update_curr()
 *
 * 2. put_prev_task() - when descheduling current task
 *    └─> put_prev_entity() -> update_curr()
 *
 * 3. pick_next_task() - when selecting new task
 *    └─> pick_next_entity() -> set_next_entity() 
 *        -> update_curr() [for prev task]
 *
 * 4. task_state_changes - fork, exec, nice change
 *    └─> Various paths call update_curr()
 */

On Wakeup: Sleeper Placement

When a sleeping task wakes up, its vruntime must be adjusted. The task's old vruntime might be far behind current tasks, which would allow it to monopolize the CPU unfairly. CFS handles this with place_entity():

place_entity.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
/*
 * place_entity - Position a waking or forking task in the vruntime timeline
 *
 * This function balances two competing goals:
 * 1. Sleepers should get some credit for sleeping (responsiveness)
 * 2. Sleepers shouldn't starve runners (fairness)
 */
static void place_entity(struct cfs_rq *cfs_rq, 
                         struct sched_entity *se, 
                         int initial)
{
    u64 vruntime = cfs_rq->min_vruntime;
 
    /*
     * For newly forked tasks: start ahead of min_vruntime
     * by one expected time slice. This prevents fork bombs
     * from starving existing tasks.
     */
    if (initial && sched_feat(START_DEBIT))
        vruntime += sched_vslice(cfs_rq, se);
 
    /*
     * For waking tasks: give a bounded credit.
     *
     * GENTLE_FAIR_SLEEPERS: Place at min_vruntime - (latency/2)
     * This gives sleepers priority (they're "behind")
     * but limits how much priority (half a latency period)
     */
    if (!initial && sched_feat(GENTLE_FAIR_SLEEPERS)) {
        u64 thresh = sysctl_sched_latency >> 1;
        
        /*
         * Clamp the credit: don't let vruntime go too far
         * back from min_vruntime
         */
        if (sched_feat(BASE_SLICE))
            thresh = calc_delta_fair(thresh, se);
        
        vruntime -= thresh;
    }
 
    /*
     * CRITICAL: Never let vruntime go backward from its natural position
     * 
     * If task's vruntime is already ahead of where we'd place it,
     * leave it alone. This handles tasks that blocked briefly
     * and woke up before "earning" any significant credit.
     */
    se->vruntime = max_vruntime(se->vruntime, vruntime);
}
 
/*
 * The result:
 * - Long sleepers get capped credit (half a latency period)
 * - Brief sleepers aren't penalized for their vruntime position
 * - New forks start slightly behind to prevent fork bombs
 */

vruntime Update Scenarios

•Running Task: vruntime increases proportionally to actual runtime, inversely to weight. Higher weight = slower vruntime growth.
•Waking Task (short sleep): vruntime unchanged. Task resumes at its natural timeline position.
•Waking Task (long sleep): vruntime set to min_vruntime minus credit. Gets priority but can't starve others.
•New Fork: vruntime set to min_vruntime plus debit. Starts slightly behind to prevent fork storms.
•Nice Change: vruntime preserved, but future accumulation rate changes with new weight.
•Migration: vruntime adjusted by (old_min_vruntime - new_min_vruntime) to align with new runqueue.

The Sleeper Bonus Trade-off

The sleeper credit is controversial. Too much credit and interactive tasks can starve CPU-bound tasks. Too little and interactive responsiveness suffers. The 'half a latency period' value was chosen empirically to balance desktop interactivity (where it helps) against server throughput (where it can hurt). Administrators can tune sched_latency to adjust this balance.

Numerical Precision and Computational Efficiency

Kernel code must be both precise and efficient. CFS's vruntime calculations involve division by weights, which is expensive in CPU cycles. The kernel uses several techniques to optimize this.

Avoiding Division: Multiply and Shift

Division by weight can be converted to multiplication by inverse weight:

value / weight = value * inv_weight >> 32

The kernel precomputes inv_weight = 2^32 / weight for each nice level. This converts expensive divisions into fast multiplications and bit shifts.

Handling 32-bit Systems

On 32-bit systems, 64-bit arithmetic is slower (requires multiple instructions). The kernel includes optimized paths for 32-bit platforms, using careful overflow detection and multi-word arithmetic.

Nanosecond Resolution

CFS uses nanoseconds for all time values. This provides sufficient resolution for modern CPUs executing billions of instructions per second, while keeping values in human-comprehensible units for debugging.

calc_delta_optimized.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
/*
 * Optimized delta calculation using inverse weights
 *
 * Goal: Calculate delta_exec * NICE_0_LOAD / weight efficiently
 *
 * Method: Use precomputed inv_weight = 2^32 / weight
 *         Result = delta_exec * NICE_0_LOAD * inv_weight >> 32
 */
 
#define WMULT_SHIFT     32
#define WMULT_CONST     (~0UL)
 
/*
 * Ensure inv_weight is up to date
 * This is called lazily when weight changes
 */
static void __update_inv_weight(struct load_weight *lw)
{
    unsigned long w;
 
    if (likely(lw->inv_weight))
        return;
 
    w = scale_load_down(lw->weight);
 
    if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
        lw->inv_weight = 1;
    else if (unlikely(!w))
        lw->inv_weight = WMULT_CONST;
    else
        lw->inv_weight = WMULT_CONST / w;
}
 
/*
 * mul_u64_u32_shr - Multiply 64-bit by 32-bit and shift right
 *
 * This is architecture-optimized on most platforms
 */
static inline u64 mul_u64_u32_shr(u64 a, u32 mul, unsigned int shift)
{
    /* 
     * On 64-bit: single multiply instruction + shift
     * On 32-bit: software 64*32 multiply
     */
#if BITS_PER_LONG == 64
    return (a * mul) >> shift;
#else
    u32 ah = a >> 32;
    u32 al = a;
    u64 ret;
 
    ret = ((u64)ah * mul) << (32 - shift);
    if (shift > 32)
        ret >>= shift - 32;
    else
        ret += (al * mul) >> shift;
    
    return ret;
#endif
}
 
/*
 * Performance characteristics:
 *
 * calc_delta_fair() on modern x86-64:
 * - No division instructions (avoided via inv_weight)
 * - 2-3 multiplications
 * - 1-2 shifts
 * - Total: ~5-10 cycles
 *
 * This is called every scheduler tick and task switch,
 * so the optimization matters significantly.
 */

Timing Sources

The accuracy of vruntime depends on accurate time measurement. Linux uses different clock sources based on context:

TSC (Time Stamp Counter): CPU cycle counter, extremely fast but requires calibration
HPET (High Precision Event Timer): Hardware timer with nanosecond resolution
CLOCK_MONOTONIC_RAW: Kernel's view of time, not subject to NTP adjustments

CFS uses rq_clock_task() which provides a stable, monotonic nanosecond timestamp suitable for accounting.

Why Nanoseconds?

Using nanoseconds ensures that even microsecond-scale execution times are accurately tracked. On a 4GHz CPU, 1 nanosecond is about 4 cycles—enough resolution for fine-grained accounting. The 64-bit vruntime can represent 292 years of nanoseconds, far exceeding any practical system lifetime.

Visualizing Virtual Runtime Dynamics

Understanding vruntime dynamics is easier with visualization. Let's trace through a concrete scenario showing how vruntimes evolve over time.

Scenario: Three Tasks with Different Weights

Task A: weight 1024 (nice 0)
Task B: weight 2048 (nice -5, higher priority)
Task C: weight 512 (nice +5, lower priority)

Expected Behavior: Over any time period, B should get 2× the CPU of A, and A should get 2× the CPU of C. Their vruntimes should stay approximately equal.

vruntime_trace.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
"""
Visualization of CFS Virtual Runtime Dynamics
 
This script simulates CFS scheduling and shows how
vruntimes evolve to achieve proportional fairness.
"""
 
import matplotlib.pyplot as plt
from dataclasses import dataclass
 
NICE_0_LOAD = 1024
TIME_SLICE_MS = 4
 
@dataclass
class Task:
    name: str
    weight: int
    vruntime: float = 0.0
    actual_runtime: float = 0.0
    
    def vruntime_rate(self) -> float:
        """How fast vruntime increases per ms of actual runtime"""
        return NICE_0_LOAD / self.weight
 
def simulate_cfs(tasks: list, total_time_ms: int):
    """
    Simulate CFS scheduling for given tasks and duration.
    Returns history for visualization.
    """
    history = {t.name: {'vruntime': [], 'actual': [], 'time': []} 
               for t in tasks}
    
    current_time = 0
    
    while current_time < total_time_ms:
        # CFS picks task with minimum vruntime
        task = min(tasks, key=lambda t: t.vruntime)
        
        # Run for one time slice
        slice_ms = TIME_SLICE_MS
        task.actual_runtime += slice_ms
        task.vruntime += slice_ms * task.vruntime_rate()
        
        current_time += slice_ms
        
        # Record state
        for t in tasks:
            history[t.name]['time'].append(current_time)
            history[t.name]['vruntime'].append(t.vruntime)
            history[t.name]['actual'].append(t.actual_runtime)
    
    return history
 
def visualize():
    tasks = [
        Task('A (weight=1024)', 1024),
        Task('B (weight=2048)', 2048),
        Task('C (weight=512)', 512),
    ]
    
    history = simulate_cfs(tasks, 1000)  # 1 second
    
    fig, axes = plt.subplots(2, 1, figsize=(12, 8))
    
    # Plot vruntimes - should converge/stay close
    ax1 = axes[0]
    for name, data in history.items():
        ax1.plot(data['time'], data['vruntime'], label=name)
    ax1.set_xlabel('Wall Clock Time (ms)')
    ax1.set_ylabel('Virtual Runtime')
    ax1.set_title('Virtual Runtime Over Time (All Tasks Stay Close)')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Plot actual runtimes - should differ by weight ratios
    ax2 = axes[1]
    for name, data in history.items():
        ax2.plot(data['time'], data['actual'], label=name)
    ax2.set_xlabel('Wall Clock Time (ms)')
    ax2.set_ylabel('Actual Runtime (ms)')
    ax2.set_title('Actual Runtime Over Time (B gets 2x of A, A gets 2x of C)')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('cfs_vruntime_visualization.png', dpi=150)
    
    # Print final statistics
    print("\nFinal Statistics after 1000ms:")
    print("-" * 50)
    total_weight = sum(t.weight for t in tasks)
    for t in tasks:
        expected = t.weight / total_weight * 1000
        print(f"{t.name}:")
        print(f"  Actual runtime: {t.actual_runtime:.1f}ms "
              f"(expected {expected:.1f}ms)")
        print(f"  Final vruntime: {t.vruntime:.1f}")
        print(f"  vruntime rate: {t.vruntime_rate():.3f}")
    
    # Verify vruntimes are close
    vruntimes = [t.vruntime for t in tasks]
    vrt_range = max(vruntimes) - min(vruntimes)
    print(f"\nVruntime spread: {vrt_range:.1f} "
          f"(should be small, ~{TIME_SLICE_MS} * max_rate)")
 
if __name__ == "__main__":
    visualize()

Key Observations from the Visualization

Vruntimes Track Together: Despite different actual runtimes, all three tasks maintain approximately equal vruntimes. This is the fairness guarantee in action.
Higher Weight = More CPU: Task B (weight 2048) accumulates actual runtime twice as fast as Task A, and four times as fast as Task C.
Vruntime Slopes Differ: Task C's vruntime increases fastest (steep slope), Task B's increases slowest (gentle slope). This keeps them racing at different speeds but crossing the same 'distance.'
Scheduler Oscillation: The scheduler bounces between tasks as vruntimes leapfrog each other. This creates the fair interleaving pattern.

The Elegance of Proportional Fairness

The visualization reveals CFS's core insight: by normalizing time differently for each task, the scheduler can use a single metric (minimum vruntime) to make fair decisions. There's no need for separate priority queues, complex heuristics, or interactivity detection—fairness emerges naturally from the mathematics.

Summary: Virtual Runtime Mastery

Virtual runtime is the conceptual heart of CFS—the abstraction that makes proportional fairness achievable and efficient. Understanding vruntime is essential for anyone seeking to comprehend Linux scheduling at a deep level.

Key Takeaways

•Vruntime normalizes actual runtime — It increases proportionally to execution time but inversely to weight, creating a universal comparison metric.
•The minimum vruntime task is 'most behind' — CFS always schedules the task with the smallest vruntime, which is simply the task that has received less than its fair share.
•min_vruntime anchors the timeline — It provides a reference point for placing new tasks and adjusting migrated tasks, and it never decreases.
•Updates happen at scheduling boundaries — Vruntime is recalculated whenever the scheduler runs: on timer ticks, task switches, and state changes.
•Efficiency through inverse weights — Division is replaced with multiplication and shifting, making vruntime calculations fast enough for kernel hot paths.
•Sleeper fairness is bounded — Waking tasks get limited credit, balancing responsiveness against preventing starvation of running tasks.

What's Next

The next page explores the red-black tree—the data structure that makes CFS's vruntime-based scheduling efficient. We'll see how this self-balancing binary search tree enables O(1) minimum selection and O(log n) insertions, and why it's perfectly suited to CFS's access patterns.

Page Complete

You now understand virtual runtime—the normalized time metric that enables CFS to achieve proportional fairness efficiently. This understanding is foundational for comprehending not just Linux scheduling, but the broader class of weighted fair queueing algorithms used throughout systems software.

2 / 5

Loading learning content...

Operating SystemsLinux Scheduling

Linux Scheduling

LevelAdvanced

Duration90 mins

TopicLinux Scheduling

2 / 5

Virtual Runtime

The Universal Currency of CPU Time

Learning Objectives

Mathematical Foundation of Virtual Runtime

Virtual runtime is defined by a simple but powerful equation. Understanding this equation is the key to understanding CFS.

The Virtual Runtime Equation

For a task with weight w that runs for actual time Δt, its virtual runtime increases by:

Δvruntime = Δt × (reference_weight / w)

Where reference_weight is a constant representing the weight of a 'normal' process (nice value 0, weight = 1024 in Linux). This formulation ensures:

Unit Consistency: A task with reference weight has Δvruntime = Δt (virtual time equals real time)
Proportional Fairness: Tasks with double the weight see half the vruntime increase for the same actual execution
Scale Independence: The absolute value of weights doesn't matter, only their ratios

vruntime_math.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
"""
Virtual Runtime Mathematics
 
This module demonstrates the core mathematical concepts
behind CFS's virtual runtime calculation.
"""
 
# Reference weight: nice 0 process weight
NICE_0_WEIGHT = 1024
 
def calculate_delta_vruntime(delta_exec_ns: int, task_weight: int) -> int:
    """
    Calculate the virtual runtime delta for a given execution time.
    
    Args:
        delta_exec_ns: Actual execution time in nanoseconds
        task_weight: The task's scheduling weight
    
    Returns:
        Virtual runtime delta in nanoseconds
    
    Formula: delta_vruntime = delta_exec * (NICE_0_WEIGHT / weight)
    """
    # Integer math to avoid floating point, with proper rounding
    delta_vruntime = (delta_exec_ns * NICE_0_WEIGHT) // task_weight
    return delta_vruntime
 
 
def demonstrate_fairness():
    """
    Demonstrate how virtual runtime ensures proportional fairness.
    """
    # Two tasks: A has weight 1024 (nice 0), B has weight 2048 (nice -5)
    task_a_weight = 1024
    task_b_weight = 2048
    
    # Let's run each for 100ms of actual CPU time
    actual_runtime_ns = 100_000_000  # 100ms
    
    vruntime_a = calculate_delta_vruntime(actual_runtime_ns, task_a_weight)
    vruntime_b = calculate_delta_vruntime(actual_runtime_ns, task_b_weight)
    
    print(f"Task A (weight {task_a_weight}): "
          f"actual = {actual_runtime_ns/1e6:.2f}ms, "
          f"vruntime = {vruntime_a/1e6:.2f}ms")
    print(f"Task B (weight {task_b_weight}): "
          f"actual = {actual_runtime_ns/1e6:.2f}ms, "
          f"vruntime = {vruntime_b/1e6:.2f}ms")
    
    # For equal vruntime, how much actual time does each need?
    target_vruntime_ns = 100_000_000  # 100ms of virtual time
    
    actual_a = target_vruntime_ns * task_a_weight // NICE_0_WEIGHT
    actual_b = target_vruntime_ns * task_b_weight // NICE_0_WEIGHT
    
    print(f"\nFor equal vruntime ({target_vruntime_ns/1e6:.2f}ms):")
    print(f"  Task A gets {actual_a/1e6:.2f}ms actual CPU")
    print(f"  Task B gets {actual_b/1e6:.2f}ms actual CPU")
    print(f"  Ratio B/A = {actual_b/actual_a:.2f} (expected: 2.0)")
 
 
def trace_scheduling_scenario():
    """
    Simulate a CFS scheduling scenario to show vruntime in action.
    """
    print("\n--- CFS Scheduling Trace ---")
    
    tasks = {
        'A': {'weight': 1024, 'vruntime': 0, 'actual': 0},  # nice 0
        'B': {'weight': 820,  'vruntime': 0, 'actual': 0},  # nice 2
        'C': {'weight': 1277, 'vruntime': 0, 'actual': 0},  # nice -2
    }
    
    time_slice_ns = 4_000_000  # 4ms per scheduling round
    total_time_ns = 100_000_000  # 100ms total simulation
    
    current_time = 0
    while current_time < total_time_ns:
        # Pick task with minimum vruntime
        next_task = min(tasks.keys(), key=lambda t: tasks[t]['vruntime'])
        
        # Run it for a time slice
        task = tasks[next_task]
        delta_vruntime = calculate_delta_vruntime(time_slice_ns, task['weight'])
        
        task['vruntime'] += delta_vruntime
        task['actual'] += time_slice_ns
        current_time += time_slice_ns
    
    print(f"\nAfter {total_time_ns/1e6:.0f}ms of total CPU time:")
    total_weight = sum(t['weight'] for t in tasks.values())
    for name, task in sorted(tasks.items()):
        expected_share = task['weight'] / total_weight * 100
        actual_share = task['actual'] / total_time_ns * 100
        print(f"  {name}: vruntime={task['vruntime']/1e6:.2f}ms, "
              f"actual={task['actual']/1e6:.2f}ms, "
              f"share={actual_share:.1f}% (expected {expected_share:.1f}%)")
 
 
if __name__ == "__main__":
    demonstrate_fairness()
    trace_scheduling_scenario()

The Fairness Invariant

This invariant naturally handles:

Priority Differences: Higher-weight tasks accumulate vruntime slower, so they can run longer before another task becomes 'more behind'
Dynamic Task Counts: When tasks join or leave, vruntimes automatically reflect the new fair shares
Sleeping Tasks: Tasks that voluntarily sleep don't accumulate vruntime, giving them implicit priority when they wake

The beauty of this approach is that fairness is emergent, not enforced through explicit priority queues or time slice allocations.

The Normalized Clock Analogy

From Nice Values to Virtual Runtime

The connection between user-visible priorities (nice values) and scheduling behavior goes through two conversions: nice → weight → vruntime scaling.

Nice Values

Linux uses nice values from -20 (highest priority) to +19 (lowest priority). The name comes from the idea that a 'nice' process yields CPU to others. Nice 0 is the default.

Weight Calculation

Nice values are converted to weights using a carefully designed lookup table. The design principle: each nice level should provide approximately 10% more or less CPU than the adjacent level.

nice_to_weight.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
/*
 * Linux kernel weight table for nice values
 * From: kernel/sched/core.c
 *
 * Nice value -20 to +19 mapped to weight values.
 * The ratio between adjacent values is ~1.25 (10% CPU difference principle)
 */
const int sched_prio_to_weight[40] = {
    /* -20 */    88761,    71755,    56483,    46273,    36291,
    /* -15 */    29154,    23254,    18705,    14949,    11916,
    /* -10 */     9548,     7620,     6100,     4904,     3906,
    /*  -5 */     3121,     2501,     1991,     1586,     1277,
    /*   0 */     1024,      820,      655,      526,      423,
    /*   5 */      335,      272,      215,      172,      137,
    /*  10 */      110,       87,       70,       56,       45,
    /*  15 */       36,       29,       23,       18,       15,
};
 
/*
 * Inverse weight table for faster division (multiply instead)
 * inv_weight = 2^32 / weight
 */
const u32 sched_prio_to_wmult[40] = {
    /* -20 */        48388,        59856,        76040,        92818,       118348,
    /* -15 */       147320,       184698,       229616,       287308,       360437,
    /* -10 */       449829,       563644,       704093,       875809,      1099582,
    /*  -5 */      1376151,      1717300,      2157191,      2708050,      3363326,
    /*   0 */      4194304,      5237765,      6557202,      8165337,     10153587,
    /*   5 */     12820798,     15790321,     19976592,     24970740,     31350126,
    /*  10 */     39045157,     49367440,     61356676,     76695844,     95443717,
    /*  15 */    119304647,    148102320,    186737708,    238609294,    286331153,
};
 
/*
 * The beauty of this table:
 * 
 * 1. weight[nice 0] = 1024 (convenient power of 2)
 * 2. weight[nice -20] / weight[nice +19] ≈ 5917 (huge range)
 * 3. Ratio of adjacent weights ≈ 1.25, giving ~10% CPU difference
 *
 * This means nice -20 can get 5917x more CPU than nice +19,
 * but each step only changes allocation by ~10%.
 */
 
static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
    /* Fast path: nice 0 doesn't need scaling */
    if (unlikely(se->load.weight != NICE_0_LOAD))
        delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
    
    return delta;
}
 
/*
 * Optimized division using multiplication by inverse weight
 * delta_vruntime = delta_exec * NICE_0_LOAD / weight
 *                = delta_exec * NICE_0_LOAD * inv_weight >> 32
 */
static u64 __calc_delta(u64 delta_exec, unsigned long weight,
                        struct load_weight *lw)
{
    u64 fact = scale_load_down(weight);
    int shift = WMULT_SHIFT;  /* 32 */
    
    __update_inv_weight(lw);  /* Ensure inv_weight is current */
    
    /* Multiply by inverse weight and shift */
    fact = mul_u64_u32_shr(delta_exec, lw->inv_weight, shift);
    
    return fact;
}

Nice Values, Weights, and CPU Share Impact
Nice	Weight	vruntime Rate	Relative CPU Share
-20	88,761	0.012×	~87× of nice 0 (vs nice +19)
-10	9,548	0.107×	~9× of nice 0
-5	3,121	0.328×	~3× of nice 0
0	1,024	1.000×	1× (reference)
+5	335	3.057×	~0.33× of nice 0
+10	110	9.309×	~0.11× of nice 0
+19	15	68.27×	~0.015× of nice 0

Why This Weight Scale?

The min_vruntime: Anchor of the Runqueue

Each CFS runqueue maintains a crucial variable called min_vruntime. This value tracks the smallest vruntime among all runnable tasks, serving as an anchor point for various scheduling decisions.

Why min_vruntime Matters

Virtual runtimes are ever-increasing. If we simply used raw vruntimes, they would eventually overflow (even 64-bit integers have limits). More practically, several operations need a reference point:

New Task Placement: Where should a newly forked or awakened task be placed in the timeline?
Task Migration: When a task moves between CPUs, how do we adjust its vruntime for the new runqueue?
Sleeper Fairness: How much credit should sleeping tasks receive?

min_vruntime provides this reference point. It monotonically increases (never decreases), and all task vruntimes are implicitly relative to it.

min_vruntime.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
/*
 * min_vruntime update logic
 *
 * min_vruntime is updated to track the smallest vruntime
 * among all runnable tasks, but with important constraints.
 */
 
static void update_min_vruntime(struct cfs_rq *cfs_rq)
{
    struct sched_entity *curr = cfs_rq->curr;
    struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);
 
    u64 vruntime = cfs_rq->min_vruntime;
 
    /*
     * Consider both the currently running task and the
     * leftmost task in the red-black tree
     */
    if (curr) {
        if (curr->on_rq)
            vruntime = curr->vruntime;
        else
            curr = NULL;
    }
 
    if (leftmost) {
        struct sched_entity *se = __node_2_se(leftmost);
 
        if (!curr)
            vruntime = se->vruntime;
        else
            vruntime = min_vruntime(vruntime, se->vruntime);
    }
 
    /*
     * CRITICAL: min_vruntime must never go backwards!
     * 
     * This ensures:
     * 1. New tasks placed at min_vruntime don't get unfair advantage
     * 2. Migrated tasks don't get "time travel" to the past
     * 3. Progress is monotonic
     */
    cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
 
#ifndef CONFIG_64BIT
    smp_wmb();
    cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
#endif
}
 
/*
 * Comparison functions that handle wraparound correctly.
 * Even with 64-bit vruntimes, we use signed comparison
 * to handle potential (rare) overflow scenarios.
 */
static inline bool entity_before(struct sched_entity *a,
                                 struct sched_entity *b)
{
    return (s64)(a->vruntime - b->vruntime) < 0;
}
 
static inline u64 max_vruntime(u64 max_vruntime, u64 vruntime)
{
    s64 delta = (s64)(vruntime - max_vruntime);
    if (delta > 0)
        max_vruntime = vruntime;
    return max_vruntime;
}
 
static inline u64 min_vruntime(u64 min_vruntime, u64 vruntime)
{
    s64 delta = (s64)(vruntime - min_vruntime);
    if (delta < 0)
        min_vruntime = vruntime;
    return min_vruntime;
}

The Monotonic Property

The most important invariant of min_vruntime is that it never decreases. This is enforced by the max_vruntime() call in the update function. Why is this critical?

Prevents Time Travel: If min_vruntime could decrease, tasks placed at the new min_vruntime could be scheduled before tasks that were already 'ahead'
Bounds Sleeper Credit: Sleeping tasks have their vruntime updated upon waking relative to min_vruntime. If it could decrease, sleepers could accumulate unbounded credit
Migration Consistency: When tasks migrate, their vruntimes are adjusted by (source_min_vruntime - dest_min_vruntime). Monotonicity ensures this adjustment is well-defined

Edge Case: Empty Runqueue

When a runqueue becomes empty (all tasks blocked), min_vruntime is left at its last value. When new tasks arrive, they start near this value, maintaining continuity.

The Overflow Question

Virtual Runtime Updates in Action

Understanding when and how vruntimes are updated reveals the practical mechanics of CFS. Updates happen at several key points in a task's lifecycle.

During Execution: The Core Update Path

The primary vruntime update happens through update_curr(), called whenever the scheduler needs current accounting. This occurs on:

Timer interrupts (scheduler tick)
Voluntary preemption (task sleeps or yields)
Forced preemption (higher-priority task becomes runnable)
Before selecting the next task to run

update_curr.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
/*
 * update_curr - Account for the 'current' task's runtime
 * 
 * This is the heart of CFS accounting. Called on every
 * scheduling decision point.
 */
static void update_curr(struct cfs_rq *cfs_rq)
{
    struct sched_entity *curr = cfs_rq->curr;
    u64 now = rq_clock_task(rq_of(cfs_rq));
    u64 delta_exec;
 
    if (unlikely(!curr))
        return;
 
    /* How long has current task been running since last update? */
    delta_exec = now - curr->exec_start;
    if (unlikely((s64)delta_exec <= 0))
        return;
 
    /* Reset the execution start time for next accounting period */
    curr->exec_start = now;
 
    /* Accumulate statistics */
    curr->sum_exec_runtime += delta_exec;
    schedstat_add(cfs_rq->exec_clock, delta_exec);
 
    /*
     * THE KEY OPERATION: Update virtual runtime
     * 
     * delta_vruntime = delta_exec * (NICE_0_LOAD / curr->load.weight)
     */
    curr->vruntime += calc_delta_fair(delta_exec, curr);
 
    /* Keep min_vruntime up to date */
    update_min_vruntime(cfs_rq);
 
    /* Handle cgroup bandwidth accounting if enabled */
    if (entity_is_task(curr)) {
        struct task_struct *curtask = task_of(curr);
        trace_sched_stat_runtime(curtask, delta_exec);
        cgroup_account_cputime(curtask, delta_exec);
        account_group_exec_runtime(curtask, delta_exec);
    }
}
 
/*
 * When is update_curr called?
 *
 * 1. scheduler_tick() - periodic timer interrupt
 *    └─> entity_tick() -> update_curr()
 *
 * 2. put_prev_task() - when descheduling current task
 *    └─> put_prev_entity() -> update_curr()
 *
 * 3. pick_next_task() - when selecting new task
 *    └─> pick_next_entity() -> set_next_entity() 
 *        -> update_curr() [for prev task]
 *
 * 4. task_state_changes - fork, exec, nice change
 *    └─> Various paths call update_curr()
 */

On Wakeup: Sleeper Placement

place_entity.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
/*
 * place_entity - Position a waking or forking task in the vruntime timeline
 *
 * This function balances two competing goals:
 * 1. Sleepers should get some credit for sleeping (responsiveness)
 * 2. Sleepers shouldn't starve runners (fairness)
 */
static void place_entity(struct cfs_rq *cfs_rq, 
                         struct sched_entity *se, 
                         int initial)
{
    u64 vruntime = cfs_rq->min_vruntime;
 
    /*
     * For newly forked tasks: start ahead of min_vruntime
     * by one expected time slice. This prevents fork bombs
     * from starving existing tasks.
     */
    if (initial && sched_feat(START_DEBIT))
        vruntime += sched_vslice(cfs_rq, se);
 
    /*
     * For waking tasks: give a bounded credit.
     *
     * GENTLE_FAIR_SLEEPERS: Place at min_vruntime - (latency/2)
     * This gives sleepers priority (they're "behind")
     * but limits how much priority (half a latency period)
     */
    if (!initial && sched_feat(GENTLE_FAIR_SLEEPERS)) {
        u64 thresh = sysctl_sched_latency >> 1;
        
        /*
         * Clamp the credit: don't let vruntime go too far
         * back from min_vruntime
         */
        if (sched_feat(BASE_SLICE))
            thresh = calc_delta_fair(thresh, se);
        
        vruntime -= thresh;
    }
 
    /*
     * CRITICAL: Never let vruntime go backward from its natural position
     * 
     * If task's vruntime is already ahead of where we'd place it,
     * leave it alone. This handles tasks that blocked briefly
     * and woke up before "earning" any significant credit.
     */
    se->vruntime = max_vruntime(se->vruntime, vruntime);
}
 
/*
 * The result:
 * - Long sleepers get capped credit (half a latency period)
 * - Brief sleepers aren't penalized for their vruntime position
 * - New forks start slightly behind to prevent fork bombs
 */

vruntime Update Scenarios

•Running Task: vruntime increases proportionally to actual runtime, inversely to weight. Higher weight = slower vruntime growth.
•Waking Task (short sleep): vruntime unchanged. Task resumes at its natural timeline position.
•Waking Task (long sleep): vruntime set to min_vruntime minus credit. Gets priority but can't starve others.
•New Fork: vruntime set to min_vruntime plus debit. Starts slightly behind to prevent fork storms.
•Nice Change: vruntime preserved, but future accumulation rate changes with new weight.
•Migration: vruntime adjusted by (old_min_vruntime - new_min_vruntime) to align with new runqueue.

The Sleeper Bonus Trade-off

Numerical Precision and Computational Efficiency

Kernel code must be both precise and efficient. CFS's vruntime calculations involve division by weights, which is expensive in CPU cycles. The kernel uses several techniques to optimize this.

Avoiding Division: Multiply and Shift

Division by weight can be converted to multiplication by inverse weight:

value / weight = value * inv_weight >> 32

The kernel precomputes inv_weight = 2^32 / weight for each nice level. This converts expensive divisions into fast multiplications and bit shifts.

Handling 32-bit Systems

On 32-bit systems, 64-bit arithmetic is slower (requires multiple instructions). The kernel includes optimized paths for 32-bit platforms, using careful overflow detection and multi-word arithmetic.

Nanosecond Resolution

calc_delta_optimized.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
/*
 * Optimized delta calculation using inverse weights
 *
 * Goal: Calculate delta_exec * NICE_0_LOAD / weight efficiently
 *
 * Method: Use precomputed inv_weight = 2^32 / weight
 *         Result = delta_exec * NICE_0_LOAD * inv_weight >> 32
 */
 
#define WMULT_SHIFT     32
#define WMULT_CONST     (~0UL)
 
/*
 * Ensure inv_weight is up to date
 * This is called lazily when weight changes
 */
static void __update_inv_weight(struct load_weight *lw)
{
    unsigned long w;
 
    if (likely(lw->inv_weight))
        return;
 
    w = scale_load_down(lw->weight);
 
    if (BITS_PER_LONG > 32 && unlikely(w >= WMULT_CONST))
        lw->inv_weight = 1;
    else if (unlikely(!w))
        lw->inv_weight = WMULT_CONST;
    else
        lw->inv_weight = WMULT_CONST / w;
}
 
/*
 * mul_u64_u32_shr - Multiply 64-bit by 32-bit and shift right
 *
 * This is architecture-optimized on most platforms
 */
static inline u64 mul_u64_u32_shr(u64 a, u32 mul, unsigned int shift)
{
    /* 
     * On 64-bit: single multiply instruction + shift
     * On 32-bit: software 64*32 multiply
     */
#if BITS_PER_LONG == 64
    return (a * mul) >> shift;
#else
    u32 ah = a >> 32;
    u32 al = a;
    u64 ret;
 
    ret = ((u64)ah * mul) << (32 - shift);
    if (shift > 32)
        ret >>= shift - 32;
    else
        ret += (al * mul) >> shift;
    
    return ret;
#endif
}
 
/*
 * Performance characteristics:
 *
 * calc_delta_fair() on modern x86-64:
 * - No division instructions (avoided via inv_weight)
 * - 2-3 multiplications
 * - 1-2 shifts
 * - Total: ~5-10 cycles
 *
 * This is called every scheduler tick and task switch,
 * so the optimization matters significantly.
 */

Timing Sources

The accuracy of vruntime depends on accurate time measurement. Linux uses different clock sources based on context:

TSC (Time Stamp Counter): CPU cycle counter, extremely fast but requires calibration
HPET (High Precision Event Timer): Hardware timer with nanosecond resolution
CLOCK_MONOTONIC_RAW: Kernel's view of time, not subject to NTP adjustments

CFS uses rq_clock_task() which provides a stable, monotonic nanosecond timestamp suitable for accounting.

Why Nanoseconds?

Visualizing Virtual Runtime Dynamics

Understanding vruntime dynamics is easier with visualization. Let's trace through a concrete scenario showing how vruntimes evolve over time.

Scenario: Three Tasks with Different Weights

Task A: weight 1024 (nice 0)
Task B: weight 2048 (nice -5, higher priority)
Task C: weight 512 (nice +5, lower priority)

Expected Behavior: Over any time period, B should get 2× the CPU of A, and A should get 2× the CPU of C. Their vruntimes should stay approximately equal.

vruntime_trace.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
"""
Visualization of CFS Virtual Runtime Dynamics
 
This script simulates CFS scheduling and shows how
vruntimes evolve to achieve proportional fairness.
"""
 
import matplotlib.pyplot as plt
from dataclasses import dataclass
 
NICE_0_LOAD = 1024
TIME_SLICE_MS = 4
 
@dataclass
class Task:
    name: str
    weight: int
    vruntime: float = 0.0
    actual_runtime: float = 0.0
    
    def vruntime_rate(self) -> float:
        """How fast vruntime increases per ms of actual runtime"""
        return NICE_0_LOAD / self.weight
 
def simulate_cfs(tasks: list, total_time_ms: int):
    """
    Simulate CFS scheduling for given tasks and duration.
    Returns history for visualization.
    """
    history = {t.name: {'vruntime': [], 'actual': [], 'time': []} 
               for t in tasks}
    
    current_time = 0
    
    while current_time < total_time_ms:
        # CFS picks task with minimum vruntime
        task = min(tasks, key=lambda t: t.vruntime)
        
        # Run for one time slice
        slice_ms = TIME_SLICE_MS
        task.actual_runtime += slice_ms
        task.vruntime += slice_ms * task.vruntime_rate()
        
        current_time += slice_ms
        
        # Record state
        for t in tasks:
            history[t.name]['time'].append(current_time)
            history[t.name]['vruntime'].append(t.vruntime)
            history[t.name]['actual'].append(t.actual_runtime)
    
    return history
 
def visualize():
    tasks = [
        Task('A (weight=1024)', 1024),
        Task('B (weight=2048)', 2048),
        Task('C (weight=512)', 512),
    ]
    
    history = simulate_cfs(tasks, 1000)  # 1 second
    
    fig, axes = plt.subplots(2, 1, figsize=(12, 8))
    
    # Plot vruntimes - should converge/stay close
    ax1 = axes[0]
    for name, data in history.items():
        ax1.plot(data['time'], data['vruntime'], label=name)
    ax1.set_xlabel('Wall Clock Time (ms)')
    ax1.set_ylabel('Virtual Runtime')
    ax1.set_title('Virtual Runtime Over Time (All Tasks Stay Close)')
    ax1.legend()
    ax1.grid(True, alpha=0.3)
    
    # Plot actual runtimes - should differ by weight ratios
    ax2 = axes[1]
    for name, data in history.items():
        ax2.plot(data['time'], data['actual'], label=name)
    ax2.set_xlabel('Wall Clock Time (ms)')
    ax2.set_ylabel('Actual Runtime (ms)')
    ax2.set_title('Actual Runtime Over Time (B gets 2x of A, A gets 2x of C)')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.savefig('cfs_vruntime_visualization.png', dpi=150)
    
    # Print final statistics
    print("\nFinal Statistics after 1000ms:")
    print("-" * 50)
    total_weight = sum(t.weight for t in tasks)
    for t in tasks:
        expected = t.weight / total_weight * 1000
        print(f"{t.name}:")
        print(f"  Actual runtime: {t.actual_runtime:.1f}ms "
              f"(expected {expected:.1f}ms)")
        print(f"  Final vruntime: {t.vruntime:.1f}")
        print(f"  vruntime rate: {t.vruntime_rate():.3f}")
    
    # Verify vruntimes are close
    vruntimes = [t.vruntime for t in tasks]
    vrt_range = max(vruntimes) - min(vruntimes)
    print(f"\nVruntime spread: {vrt_range:.1f} "
          f"(should be small, ~{TIME_SLICE_MS} * max_rate)")
 
if __name__ == "__main__":
    visualize()

Key Observations from the Visualization

Vruntimes Track Together: Despite different actual runtimes, all three tasks maintain approximately equal vruntimes. This is the fairness guarantee in action.
Higher Weight = More CPU: Task B (weight 2048) accumulates actual runtime twice as fast as Task A, and four times as fast as Task C.
Vruntime Slopes Differ: Task C's vruntime increases fastest (steep slope), Task B's increases slowest (gentle slope). This keeps them racing at different speeds but crossing the same 'distance.'
Scheduler Oscillation: The scheduler bounces between tasks as vruntimes leapfrog each other. This creates the fair interleaving pattern.

The Elegance of Proportional Fairness

Summary: Virtual Runtime Mastery

Key Takeaways

•Vruntime normalizes actual runtime — It increases proportionally to execution time but inversely to weight, creating a universal comparison metric.
•The minimum vruntime task is 'most behind' — CFS always schedules the task with the smallest vruntime, which is simply the task that has received less than its fair share.
•min_vruntime anchors the timeline — It provides a reference point for placing new tasks and adjusting migrated tasks, and it never decreases.
•Updates happen at scheduling boundaries — Vruntime is recalculated whenever the scheduler runs: on timer ticks, task switches, and state changes.
•Efficiency through inverse weights — Division is replaced with multiplication and shifting, making vruntime calculations fast enough for kernel hot paths.
•Sleeper fairness is bounded — Waking tasks get limited credit, balancing responsiveness against preventing starvation of running tasks.

What's Next

Page Complete

2 / 5