Operating SystemsLoad Balancing

Load Balancing in Multiprocessor Systems

LevelAdvanced

Duration75 mins

TopicLoad Balancing

5 / 5

Metrics: Measuring Load and Balance Quality

The Measurement Imperative

Throughout this module, we've discussed 'load', 'imbalance', and 'overload' without precisely defining how these are measured. This final page addresses the foundational question: How do we quantify load and evaluate balance quality?

The choice of metrics profoundly affects scheduler behavior. A metric that emphasizes queue length produces different decisions than one emphasizing CPU utilization or weighted task priority. Understanding the available metrics—their strengths, limitations, and implementation—enables you to reason about scheduler behavior and make informed tuning decisions.

This page provides a comprehensive exploration of load balancing metrics: from simple counts to sophisticated weighted averages, from instantaneous measurements to temporally-decayed tracking, and from single-CPU metrics to system-wide balance indicators.

What You Will Learn

By completing this page, you will understand: (1) The fundamental metrics for quantifying CPU load, (2) Per-Entity Load Tracking (PELT) and other decay mechanisms, (3) Imbalance calculation and balance quality indicators, (4) NUMA-aware metrics and capacity-normalized load, and (5) Observability and monitoring of load balance effectiveness.

Fundamental Load Metrics

Before we can balance load, we must define what 'load' means. Several fundamental metrics capture different aspects of system demand.

Run Queue Length

The simplest metric: count the number of runnable tasks on each CPU's queue.

Load(CPU) = count of tasks in RUNNABLE state on CPU's queue

Advantages: Simple, fast to compute, intuitive. Disadvantages: Ignores task priority, overhead, and CPU utilization variations.

queue_length_metrics.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
/* Run queue length as load metric */
 
/* Simple count of runnable tasks */
unsigned int queue_length_load(struct run_queue *rq) {
    return rq->nr_running;
}
 
/* Check for imbalance using queue length */
bool queue_length_imbalanced(void) {
    int max_length = 0, min_length = INT_MAX;
    int cpu;
    
    for_each_online_cpu(cpu) {
        unsigned int length = queue_length_load(cpu_rq(cpu));
        max_length = max(max_length, (int)length);
        min_length = min(min_length, (int)length);
    }
    
    /* Imbalanced if max exceeds min by threshold */
    return (max_length - min_length) > IMBALANCE_THRESHOLD;
}
 
/* Limitation: A CPU with 10 nice +19 tasks appears same as
 * a CPU with 10 nice -20 tasks, despite vastly different
 * actual processing demand */

Weighted Load

Account for task priority by weighting each task's contribution:

Load(CPU) = Σ weight(task) for each runnable task

Higher-priority tasks contribute more to load, reflecting their greater demand on CPU time.

weighted_load.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
/* Priority-weighted load calculation */
 
/* Linux's nice-to-weight conversion table (simplified) */
/* nice 0 = weight 1024, each nice increment ~= 1.25x change */
static const int nice_to_weight[40] = {
/*  -20 */ 88761, 71755, 56483, 46273, 36291,
/*  -15 */ 29154, 23254, 18705, 14949, 11916,
/*  -10 */  9548,  7620,  6100,  4904,  3906,
/*   -5 */  3121,  2501,  1991,  1586,  1277,
/*    0 */  1024,   820,   655,   526,   423,
/*    5 */   335,   272,   215,   172,   137,
/*   10 */   110,    87,    70,    56,    45,
/*   15 */    36,    29,    23,    18,    15,
};
 
/* Get weight for a task based on nice value */
unsigned long task_weight(struct task_struct *p) {
    int nice = task_nice(p);  /* -20 to +19 */
    int idx = nice + 20;      /* Convert to 0-39 index */
    return nice_to_weight[idx];
}
 
/* Calculate weighted load for a run queue */
unsigned long weighted_load(struct run_queue *rq) {
    unsigned long total = 0;
    struct task_struct *p;
    
    list_for_each_entry(p, &rq->tasks, run_list) {
        total += task_weight(p);
    }
    
    return total;
}
 
/* This is what CFS uses as basis for load balancing */
/* A nice -20 task contributes ~6000x more than a nice +19 task */

Comparison of Basic Load Metrics
Metric	Accounts For	Complexity	Use Case
Queue Length	Task count only	O(1)	Simple systems, quick checks
Weighted Load	Priority differences	O(n)	CFS-style fair scheduling
CPU Utilization	Actual time consumed	N/A (sampled)	Performance monitoring
Runnable Time	Queue wait time	O(n)	Latency-focused scheduling

CPU Utilization

Measure actual CPU consumption rather than queue state:

Utilization(CPU) = time_busy / (time_busy + time_idle) over period

Advantages: Captures actual demand, not just potential demand. Disadvantages: Lagging indicator (measures past, not current), doesn't predict future load.

Queue Length vs. Utilization

Queue length is a leading indicator—it shows demand waiting to be served. Utilization is a lagging indicator—it shows demand that was served. For proactive balancing, leading indicators are more useful, but utilization helps validate that balancing is working.

Temporal Load Tracking: PELT and Decay

Instantaneous load measurements are noisy—a task may run for 1ms then sleep for 99ms. Using instantaneous load would see the CPU as 'fully loaded' during that 1ms. Temporal averaging smooths these fluctuations.

Per-Entity Load Tracking (PELT)

Linux's CFS uses PELT—a sophisticated exponential decay tracking mechanism that computes temporally-weighted averages for each scheduling entity (task or group).

pelt_implementation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
/* PELT: Per-Entity Load Tracking */
 
/* PELT decays load with a half-life of approximately 32ms */
/* This means: after 32ms, contribution decays to 50% */
/*             after 64ms, contribution decays to 25% */
/*             after 96ms, contribution decays to 12.5% */
 
#define LOAD_AVG_PERIOD     32  /* ms, decay half-life */
#define LOAD_AVG_MAX     47742  /* Maximum load_avg value */
 
struct sched_avg {
    /* Running average of runnable time */
    unsigned long load_avg;
    
    /* Running average of running time */
    unsigned long runnable_avg;
    
    /* Running average of utilization */
    unsigned long util_avg;
    
    /* Period tracking for decay */
    u64 last_update_time;
    u32 period_contrib;
};
 
/* Update PELT averages - called on each scheduler tick */
void update_load_avg(struct sched_entity *se, 
                     struct run_queue *rq,
                     int running) {
    struct sched_avg *sa = &se->avg;
    u64 now = rq_clock(rq);
    u64 delta_time = now - sa->last_update_time;
    
    if (delta_time == 0) return;
    
    /* Calculate decayed values */
    /* Core formula: new_avg = old_avg * decay + contribution * (1 - decay) */
    
    /* Step 1: Decay existing average */
    u32 periods = delta_time / LOAD_AVG_PERIOD_NS;
    if (periods > 0) {
        /* Apply geometric decay for full periods */
        sa->load_avg = decay_load(sa->load_avg, periods);
        sa->runnable_avg = decay_load(sa->runnable_avg, periods);
        sa->util_avg = decay_load(sa->util_avg, periods);
    }
    
    /* Step 2: Add contribution from current period */
    u32 contrib = calculate_contribution(delta_time, running);
    
    if (se->on_rq) {
        sa->load_avg += contrib * se->load.weight / LOAD_AVG_MAX;
        sa->runnable_avg += contrib;
    }
    
    if (running) {
        sa->util_avg += contrib;
    }
    
    sa->last_update_time = now;
}
 
/* Geometric decay: load * (1/2)^(periods/32) */
static inline unsigned long decay_load(unsigned long load, int periods) {
    /* Use pre-computed decay values for efficiency */
    static const u32 decay_table[32] = {
        /* decay_table[i] = (1/2)^(i/32) * 2^32 */
        4294967296UL, 4264570326UL, 4234504929UL, /* ... */
    };
    
    if (periods >= 2016) return 0;  /* Fully decayed */
    
    while (periods >= 32) {
        load = (load * decay_table[31]) >> 32;
        periods -= 32;
    }
    
    if (periods > 0) {
        load = (load * decay_table[periods]) >> 32;
    }
    
    return load;
}

Why Exponential Decay?

Exponential decay provides several desirable properties:

Recent data matters more — The last 32ms contributes 50% of the average
Smooth transitions — No discontinuities when tasks start/stop
Bounded memory — Only current average needed, not full history
Predictive power — Recent behavior predicts near-future behavior

The Three PELT Metrics

PELT tracks three distinct averages for each entity:

PELT Metrics Explained
Metric	What It Tracks	Used For
load_avg	Priority-weighted runnable time	Load balancing decisions
runnable_avg	Total runnable time (unweighted)	Capacity planning
util_avg	Actual running time	DVFS (CPU frequency scaling)

pelt_metrics_usage.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/* How PELT metrics guide scheduler decisions */
 
/* Load balancing uses load_avg */
unsigned long task_load_for_balancing(struct task_struct *p) {
    return p->se.avg.load_avg;
}
 
/* CPU frequency scaling uses util_avg */
unsigned long cpu_util_for_dvfs(int cpu) {
    return cpu_rq(cpu)->cfs.avg.util_avg;
}
 
/* Example: Balance decision based on PELT */
bool should_migrate_task(struct task_struct *p, 
                          struct run_queue *src_rq,
                          struct run_queue *dst_rq) {
    unsigned long task_load = task_load_for_balancing(p);
    unsigned long src_load = src_rq->cfs.avg.load_avg;
    unsigned long dst_load = dst_rq->cfs.avg.load_avg;
    
    /* Migration improves balance if:
     * (src_load - task_load) closer to dst_load + task_load
     * than src_load is to dst_load */
    
    unsigned long current_imbalance = abs_diff(src_load, dst_load);
    unsigned long new_imbalance = abs_diff(src_load - task_load, 
                                            dst_load + task_load);
    
    /* Need minimum improvement to justify migration cost */
    return (current_imbalance - new_imbalance) > MIN_BALANCE_IMPROVEMENT;
}

PELT Half-Life Tuning

The 32ms half-life is a compromise. Shorter half-life = more responsive but noisier. Longer half-life = more stable but slower to adapt. Some researchers have proposed adaptive half-life that shortens during high activity and lengthens during stability. Linux allows boot-time adjustment via kernel parameters.

Imbalance Quantification

With per-CPU load defined, we can quantify system-wide imbalance. Several formulations capture different aspects of load distribution quality.

Simple Imbalance: Max - Min

The most intuitive definition:

Imbalance = max(Load(CPU)) - min(Load(CPU))

Simple, but ignores intermediate CPUs—doesn't distinguish between 'one CPU overloaded' vs. 'several CPUs overloaded'.

imbalance_calculation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
/* Various imbalance metrics */
 
/* Simple max-min imbalance */
unsigned long max_min_imbalance(void) {
    unsigned long max_load = 0, min_load = ULONG_MAX;
    int cpu;
    
    for_each_online_cpu(cpu) {
        unsigned long load = cpu_load(cpu);
        max_load = max(max_load, load);
        min_load = min(min_load, load);
    }
    
    return max_load - min_load;
}
 
/* Variance-based imbalance - accounts for all CPUs */
unsigned long variance_imbalance(void) {
    unsigned long total = 0, sq_total = 0;
    int count = 0;
    int cpu;
    
    for_each_online_cpu(cpu) {
        unsigned long load = cpu_load(cpu);
        total += load;
        sq_total += load * load;
        count++;
    }
    
    unsigned long mean = total / count;
    unsigned long variance = (sq_total / count) - (mean * mean);
    
    return int_sqrt(variance);  /* Standard deviation */
}
 
/* What CFS actually uses: group-based imbalance */
unsigned long cfs_imbalance(struct sched_domain *sd) {
    struct sched_group *busiest = find_busiest_group(sd);
    struct sched_group *local = sd->groups;  /* Local group */
    
    if (!busiest || busiest == local) {
        return 0;  /* No imbalance */
    }
    
    unsigned long busiest_avg = busiest->load_avg / busiest->nr_cpus;
    unsigned long local_avg = local->load_avg / local->nr_cpus;
    
    if (busiest_avg <= local_avg) {
        return 0;  /* Local group is busier or equal */
    }
    
    /* Imbalance is the excess that should move to local */
    return (busiest_avg - local_avg) * local->nr_cpus;
}

Group-Based Imbalance

CFS organizes CPUs into scheduling groups (reflecting NUMA nodes, sockets, etc.) and computes imbalance between groups rather than individual CPUs. This reduces noise from individual CPU fluctuations and matches the migration cost structure—balancing within a group is cheaper than across groups.

Imbalance Threshold

Not every imbalance warrants action. The scheduler defines minimum thresholds below which imbalance is tolerated:

imbalance_threshold.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/* Imbalance thresholds and action triggers */
 
/* CFS uses imbalance_pct per scheduling domain */
struct sched_domain {
    /* ... */
    unsigned int imbalance_pct;  /* 0-1000, representing 0.0%-100.0% */
    /* imbalance_pct = 117 means 17% imbalance triggers action */
};
 
/* Check if imbalance exceeds threshold */
bool imbalance_exceeds_threshold(struct sched_domain *sd,
                                  unsigned long imbalance,
                                  unsigned long avg_load) {
    /* Threshold is percentage-based */
    unsigned long threshold = avg_load * sd->imbalance_pct / 100;
    
    /* Also enforce minimum absolute threshold */
    threshold = max(threshold, sd->min_imbalance);
    
    return imbalance > threshold;
}
 
/* Typical imbalance_pct values by domain level */
/* SMT: 110 (10% imbalance triggers) - cheap migration */
/* MC:  125 (25% imbalance triggers) - moderate cost */
/* NUMA: 133 (33% imbalance triggers) - expensive migration */
 
/* The threshold prevents thrashing on minor fluctuations */

Statistical Significance

Imbalance thresholds implicitly provide statistical significance for balance decisions. Small imbalances could be noise (random fluctuation in task behavior). Only when imbalance exceeds the threshold—typically 2-3 standard deviations equivalent—do we act. This is similar to hypothesis testing in statistics.

Capacity-Adjusted Load and Heterogeneous Systems

So far, we've assumed all CPUs have equal capacity. Modern systems increasingly feature heterogeneous CPUs (different speeds, capabilities) requiring capacity-adjusted metrics.

CPU Capacity

Capacity represents a CPU's processing power relative to a baseline:

Capacity(CPU) = (CPU's max throughput) / (reference CPU's max throughput)

A 'big' core in ARM big.LITTLE might have capacity 1024, while a 'LITTLE' core has capacity 512.

capacity_calculation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/* CPU capacity and capacity-adjusted load */
 
/* Per-CPU capacity (normalized, 1024 = standard core) */
DEFINE_PER_CPU(unsigned long, cpu_capacity);
 
/* Sources of capacity variation:
 * 1. Heterogeneous cores (big.LITTLE)
 * 2. Current CPU frequency (DVFS)
 * 3. Thermal throttling
 * 4. Architecture differences
 */
 
/* Initialize capacity at boot based on hardware */
void init_cpu_capacity(int cpu) {
    struct cpuinfo *info = &cpu_data(cpu);
    unsigned long capacity = SCHED_CAPACITY_SCALE;  /* 1024 = baseline */
    
    /* Adjust for max frequency relative to fastest CPU */
    capacity = capacity * info->max_freq / reference_max_freq;
    
    /* Adjust for IPC (instructions per cycle) differences */
    if (info->core_type == CORE_LITTLE) {
        capacity = capacity * LITTLE_IPC_RATIO / 100;
    }
    
    per_cpu(cpu_capacity, cpu) = capacity;
}
 
/* Runtime capacity update for frequency changes */
void update_cpu_capacity_for_freq(int cpu, unsigned long new_freq) {
    unsigned long base_capacity = per_cpu(cpu_base_capacity, cpu);
    unsigned long max_freq = per_cpu(cpu_max_freq, cpu);
    
    per_cpu(cpu_capacity, cpu) = base_capacity * new_freq / max_freq;
    
    /* Trigger balance reconsideration - capacities changed */
    set_balance_needed(cpu);
}
 
/* Capacity-normalized load: what fraction of capacity is used? */
unsigned long capacity_normalized_load(int cpu) {
    unsigned long load = cpu_load(cpu);
    unsigned long capacity = per_cpu(cpu_capacity, cpu);
    
    /* normalized_load = load * SCHED_CAPACITY_SCALE / capacity */
    return (load << SCHED_CAPACITY_SHIFT) / capacity;
}

Capacity-Aware Balancing

With capacity known, balancing compares utilized fraction rather than absolute load:

capacity_aware_balance.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
/* Capacity-aware load balancing */
 
/* Compare CPUs by utilized fraction, not absolute load */
bool cpu_overloaded_for_capacity(int cpu) {
    unsigned long load = cpu_load(cpu);
    unsigned long capacity = cpu_capacity_of(cpu);
    
    /* Overloaded if load exceeds 80% of capacity */
    return (load * 100 / capacity) > CAPACITY_OVERLOAD_PCT;
}
 
/* Find best destination accounting for capacity */
int find_best_destination_capacity_aware(struct task_struct *p) {
    int best_cpu = -1;
    unsigned long best_spare_capacity = 0;
    int cpu;
    
    for_each_cpu(cpu, &p->cpus_allowed) {
        unsigned long load = cpu_load(cpu);
        unsigned long capacity = cpu_capacity_of(cpu);
        unsigned long task_load = task_load_for_balancing(p);
        
        /* Would adding this task overload the CPU? */
        if (load + task_load > capacity) {
            continue;  /* Would exceed capacity */
        }
        
        /* Calculate spare capacity after adding task */
        unsigned long spare = capacity - (load + task_load);
        
        if (spare > best_spare_capacity) {
            best_spare_capacity = spare;
            best_cpu = cpu;
        }
    }
    
    return best_cpu;
}
 
/* Energy-aware scheduling: prefer efficient CPUs */
int find_energy_efficient_cpu(struct task_struct *p) {
    int best_cpu = -1;
    unsigned long best_energy = ULONG_MAX;
    int cpu;
    
    for_each_cpu(cpu, &p->cpus_allowed) {
        unsigned long energy = estimate_cpu_energy(cpu, p);
        
        if (energy < best_energy && cpu_has_capacity(cpu, p)) {
            best_energy = energy;
            best_cpu = cpu;
        }
    }
    
    return best_cpu;
}

ARM Energy Aware Scheduling

ARM's Energy Aware Scheduler (EAS) integrates with capacity tracking to make energy-optimal placement decisions. Small tasks go to 'LITTLE' cores (lower power). Large tasks go to 'big' cores (faster). This capacity-aware approach extends to mobile and embedded systems where power matters as much as performance.

NUMA-Aware Metrics

Non-Uniform Memory Access (NUMA) architectures add another dimension: memory locality. Effective NUMA-aware scheduling requires metrics that capture both CPU load and memory placement.

Memory Placement Score

Track what fraction of a task's memory accesses are local vs. remote:

numa_metrics.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
/* NUMA-aware load and placement metrics */
 
struct task_numa_stats {
    /* Page access tracking */
    unsigned long local_faults;    /* Pages accessed on local node */
    unsigned long remote_faults;   /* Pages accessed on remote node */
    
    /* Per-node access counts */
    unsigned long faults[MAX_NUMA_NODES];
    
    /* Preferred node based on memory access pattern */
    int preferred_node;
    
    /* NUMA scanning period */
    unsigned long scan_period;
};
 
/* Calculate locality score for a task on a given node */
unsigned long task_locality_score(struct task_struct *p, int node) {
    struct task_numa_stats *numa = &p->numa_stats;
    unsigned long total_faults = numa->local_faults + numa->remote_faults;
    
    if (total_faults == 0) {
        return 0;  /* No data yet */
    }
    
    unsigned long node_faults = numa->faults[node];
    
    /* Score = fraction of accesses that would be local if on this node */
    return (node_faults * NUMA_LOCALITY_MAX) / total_faults;
}
 
/* Find best node for a task based on memory access pattern */
int find_best_numa_node(struct task_struct *p) {
    int best_node = numa_node_id();  /* Default to current */
    unsigned long best_score = 0;
    int node;
    
    for_each_online_node(node) {
        unsigned long score = task_locality_score(p, node);
        
        if (score > best_score) {
            best_score = score;
            best_node = node;
        }
    }
    
    /* Only prefer different node if significantly better */
    if (best_score > task_locality_score(p, numa_node_id()) * 130 / 100) {
        return best_node;  /* 30% improvement threshold */
    }
    
    return numa_node_id();
}

Combined CPU + NUMA Metrics

The scheduler must balance CPU load against memory locality. Sometimes the optimal choice is a busier CPU on the right NUMA node:

combined_numa_load.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
/* Combined CPU load and NUMA locality scoring */
 
struct migration_score {
    unsigned long cpu_score;      /* Lower is better (less loaded) */
    unsigned long numa_score;     /* Higher is better (more local) */
    unsigned long combined_score; /* Overall score */
};
 
/* Calculate combined migration score */
void calculate_migration_score(struct task_struct *p,
                                int dst_cpu,
                                struct migration_score *score) {
    int dst_node = cpu_to_node(dst_cpu);
    
    /* CPU score: inverse of load (lower load = higher score) */
    unsigned long load = cpu_load(dst_cpu);
    unsigned long capacity = cpu_capacity_of(dst_cpu);
    score->cpu_score = (capacity - load) * 100 / capacity;  /* % spare */
    
    /* NUMA score: locality percentage */
    score->numa_score = task_locality_score(p, dst_node);
    
    /* Combined score: weighted average */
    /* Weight depends on task's memory intensity */
    unsigned long mem_weight = task_memory_intensity(p);  /* 0-100 */
    unsigned long cpu_weight = 100 - mem_weight;
    
    score->combined_score = 
        (score->cpu_score * cpu_weight + 
         score->numa_score * mem_weight) / 100;
}
 
/* Find best CPU considering both load and NUMA */
int find_best_cpu_numa_aware(struct task_struct *p) {
    int best_cpu = task_cpu(p);  /* Default: stay put */
    struct migration_score best_score = { 0 };
    int cpu;
    
    calculate_migration_score(p, best_cpu, &best_score);
    
    for_each_cpu(cpu, &p->cpus_allowed) {
        struct migration_score score;
        calculate_migration_score(p, cpu, &score);
        
        if (score.combined_score > best_score.combined_score) {
            /* Need significant improvement to justify migration */
            if (score.combined_score > 
                best_score.combined_score * 115 / 100) {
                best_score = score;
                best_cpu = cpu;
            }
        }
    }
    
    return best_cpu;
}

NUMA Metrics Are Expensive

Tracking per-node page faults requires hardware support (NUMA balancing page table modifications) and regular scanning. This overhead is worthwhile for memory-bound workloads but wasteful for CPU-bound ones. Linux's automatic NUMA balancing can be disabled for workloads where it hurts more than helps.

Observability and Monitoring

Metrics aren't just for the scheduler—they're essential for operators and developers to understand system behavior. Linux exposes rich scheduling statistics through several interfaces.

/proc/schedstat

Per-CPU and per-domain scheduling statistics:

schedstat_reading.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#!/bin/bash
# Reading and interpreting /proc/schedstat
 
# Format documentation:
# cpu<N> <yld_count> <sched_count> <sched_goidle> \
#        <ttwu_count> <ttwu_local> <sum_exec_runtime> \
#        <sum_sleep_runtime>
# domain<N> <balance_count> <balance_failed> \
#           <push_count> <push_failed> <pull_count> \
#           <pull_failed> <alb_count> <alb_failed>
 
# View raw schedstat
cat /proc/schedstat
 
# Parse key metrics for each CPU
awk '/^cpu/ {
    cpu = $1
    sched_count = $2
    goidle = $3
    ttwu = $4
    local_ttwu = $5
    
    local_pct = (local_ttwu * 100 / ttwu)
    
    printf "%s: scheduled %d times, went idle %d times, "\
           "wakeups: %d (%.1f%% local)\n", 
           cpu, sched_count, goidle, ttwu, local_pct
}' /proc/schedstat
 
# Domain-level balance statistics
awk '/^domain/ {
    domain = $1
    balance_count = $2
    balance_fail = $3
    push = $4
    push_fail = $5
    pull = $6
    pull_fail = $7
    
    if (balance_count > 0) {
        fail_pct = (balance_fail * 100 / balance_count)
        printf "%s: %d balance attempts (%.1f%% failed), "\
               "push: %d/%d, pull: %d/%d\n",
               domain, balance_count, fail_pct, 
               push - push_fail, push,
               pull - pull_fail, pull
    }
}' /proc/schedstat

perf sched

The perf tool provides detailed scheduler tracing and analysis:

perf_sched_analysis.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/bash
# Using perf sched for scheduler analysis
 
# Record scheduler events for 10 seconds
perf sched record -- sleep 10
 
# Analyze scheduler latencies
perf sched latency
# Output: Task latencies (how long tasks wait in queue)
#         Per-task breakdown
#         Max, avg, and distribution
 
# Show per-CPU balance activity  
perf sched map
# Output: Visual timeline of task-to-CPU mapping
#         Shows migrations as task movements
 
# Migration analysis
perf sched migrate
# Output: Which tasks migrated, from where to where
#         Migration frequency per task
 
# Detailed scheduler trace
perf sched script
# Output: Raw scheduler events with timestamps
#         sched:sched_switch, sched:sched_wakeup, etc.
 
# Example: Find excessive migrations
perf sched record -g -- ./my_application
perf sched migrate --sort=migrations
# High migration count may indicate poor affinity settings

Key Health Indicators

Monitor these metrics to assess scheduling health:

Scheduler Health Metrics
Metric	Healthy Range	Problem Indication
Balance success rate	80%	Low = excessive failed attempts
Local wakeup %	70%	Low = poor affinity, cache loss
Migrations/sec	< 100/sec typical	High = thrashing
Run queue latency	< 10ms p99	High = overload or imbalance
Idle time imbalance	< 10% difference	High = load imbalance

health_monitor.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#!/usr/bin/env python3
"""Scheduler health monitoring script"""
 
import subprocess
import re
from collections import defaultdict
 
def parse_schedstat():
    """Parse /proc/schedstat for key metrics"""
    metrics = defaultdict(dict)
    
    with open('/proc/schedstat', 'r') as f:
        for line in f:
            parts = line.split()
            
            if parts[0].startswith('cpu'):
                cpu = parts[0]
                metrics[cpu] = {
                    'schedules': int(parts[1]),
                    'idle_entries': int(parts[2]),
                    'wakeups': int(parts[3]),
                    'local_wakeups': int(parts[4]),
                }
            elif parts[0].startswith('domain'):
                domain = parts[0]
                metrics[domain] = {
                    'balance_attempts': int(parts[1]),
                    'balance_failed': int(parts[2]),
                    'migrations': int(parts[3]) - int(parts[4]),
                }
    
    return metrics
 
def check_health(metrics):
    """Evaluate scheduler health from metrics"""
    issues = []
    
    # Check local wakeup percentage
    total_wakeups = sum(m.get('wakeups', 0) 
                        for m in metrics.values() 
                        if 'wakeups' in m)
    local_wakeups = sum(m.get('local_wakeups', 0) 
                        for m in metrics.values() 
                        if 'local_wakeups' in m)
    
    if total_wakeups > 0:
        local_pct = local_wakeups * 100 / total_wakeups
        if local_pct < 70:
            issues.append(f"Low local wakeup rate: {local_pct:.1f}%")
    
    # Check balance success rate
    for domain, stats in metrics.items():
        if 'balance_attempts' in stats and stats['balance_attempts'] > 100:
            success_rate = 100 - (stats['balance_failed'] * 100 / 
                                   stats['balance_attempts'])
            if success_rate < 80:
                issues.append(
                    f"{domain}: Low balance success rate: {success_rate:.1f}%"
                )
    
    return issues or ["All metrics healthy"]
 
if __name__ == "__main__":
    metrics = parse_schedstat()
    for issue in check_health(metrics):
        print(issue)

Continuous Monitoring

Integrate scheduler metrics into your monitoring stack (Prometheus, Grafana, etc.). Alert on anomalies like sudden migration spikes or degraded local wakeup rates. These often indicate workload changes or configuration issues that need attention.

Summary: Metrics Mastery

Metrics form the foundation of intelligent load balancing—you cannot optimize what you cannot measure. Let's consolidate the key insights from this comprehensive exploration:

Key Takeaways

•Multiple load definitions exist — Queue length, weighted load, utilization each capture different aspects. Modern schedulers use weighted, temporally-decayed metrics.
•PELT provides smooth, predictive load tracking — Exponential decay with 32ms half-life balances responsiveness against stability.
•Imbalance requires thresholds — Not every imbalance warrants action. Percentage-based thresholds prevent thrashing on noise.
•Capacity normalization handles heterogeneity — Different CPU speeds require capacity-adjusted load comparison.
•NUMA metrics combine CPU and memory locality — Memory-bound workloads need placement decisions considering both dimensions.
•Rich observability enables debugging — /proc/schedstat, perf sched, and custom monitoring reveal scheduler behavior.

Module Complete: Load Balancing

With this page, we've completed our comprehensive exploration of load balancing in multiprocessor systems. You now understand:

Push migration — Proactive redistribution from overloaded CPUs
Pull migration — Demand-driven work acquisition by idle CPUs
Work stealing — Algorithmic load balancing with optimal theoretical bounds
Balancing frequency — The when of load balancing and adaptive tuning
Metrics — The quantitative foundation for intelligent scheduling decisions

This knowledge equips you to reason about scheduler behavior, diagnose performance issues, and make informed tuning decisions across diverse multiprocessor systems.

Module Complete

Congratulations! You've mastered load balancing in multiprocessor operating systems. From mechanism (push/pull/work stealing) through timing (frequency) to measurement (metrics), you possess the conceptual foundation to understand, analyze, and optimize scheduling on modern hardware.

5 / 5

Loading learning content...

Operating SystemsLoad Balancing

Load Balancing in Multiprocessor Systems

LevelAdvanced

Duration75 mins

TopicLoad Balancing

5 / 5

Metrics: Measuring Load and Balance Quality

The Measurement Imperative

What You Will Learn

Fundamental Load Metrics

Before we can balance load, we must define what 'load' means. Several fundamental metrics capture different aspects of system demand.

Run Queue Length

The simplest metric: count the number of runnable tasks on each CPU's queue.

Load(CPU) = count of tasks in RUNNABLE state on CPU's queue

Advantages: Simple, fast to compute, intuitive. Disadvantages: Ignores task priority, overhead, and CPU utilization variations.

queue_length_metrics.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
/* Run queue length as load metric */
 
/* Simple count of runnable tasks */
unsigned int queue_length_load(struct run_queue *rq) {
    return rq->nr_running;
}
 
/* Check for imbalance using queue length */
bool queue_length_imbalanced(void) {
    int max_length = 0, min_length = INT_MAX;
    int cpu;
    
    for_each_online_cpu(cpu) {
        unsigned int length = queue_length_load(cpu_rq(cpu));
        max_length = max(max_length, (int)length);
        min_length = min(min_length, (int)length);
    }
    
    /* Imbalanced if max exceeds min by threshold */
    return (max_length - min_length) > IMBALANCE_THRESHOLD;
}
 
/* Limitation: A CPU with 10 nice +19 tasks appears same as
 * a CPU with 10 nice -20 tasks, despite vastly different
 * actual processing demand */

Weighted Load

Account for task priority by weighting each task's contribution:

Load(CPU) = Σ weight(task) for each runnable task

Higher-priority tasks contribute more to load, reflecting their greater demand on CPU time.

weighted_load.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
/* Priority-weighted load calculation */
 
/* Linux's nice-to-weight conversion table (simplified) */
/* nice 0 = weight 1024, each nice increment ~= 1.25x change */
static const int nice_to_weight[40] = {
/*  -20 */ 88761, 71755, 56483, 46273, 36291,
/*  -15 */ 29154, 23254, 18705, 14949, 11916,
/*  -10 */  9548,  7620,  6100,  4904,  3906,
/*   -5 */  3121,  2501,  1991,  1586,  1277,
/*    0 */  1024,   820,   655,   526,   423,
/*    5 */   335,   272,   215,   172,   137,
/*   10 */   110,    87,    70,    56,    45,
/*   15 */    36,    29,    23,    18,    15,
};
 
/* Get weight for a task based on nice value */
unsigned long task_weight(struct task_struct *p) {
    int nice = task_nice(p);  /* -20 to +19 */
    int idx = nice + 20;      /* Convert to 0-39 index */
    return nice_to_weight[idx];
}
 
/* Calculate weighted load for a run queue */
unsigned long weighted_load(struct run_queue *rq) {
    unsigned long total = 0;
    struct task_struct *p;
    
    list_for_each_entry(p, &rq->tasks, run_list) {
        total += task_weight(p);
    }
    
    return total;
}
 
/* This is what CFS uses as basis for load balancing */
/* A nice -20 task contributes ~6000x more than a nice +19 task */

Comparison of Basic Load Metrics
Metric	Accounts For	Complexity	Use Case
Queue Length	Task count only	O(1)	Simple systems, quick checks
Weighted Load	Priority differences	O(n)	CFS-style fair scheduling
CPU Utilization	Actual time consumed	N/A (sampled)	Performance monitoring
Runnable Time	Queue wait time	O(n)	Latency-focused scheduling

CPU Utilization

Measure actual CPU consumption rather than queue state:

Utilization(CPU) = time_busy / (time_busy + time_idle) over period

Advantages: Captures actual demand, not just potential demand. Disadvantages: Lagging indicator (measures past, not current), doesn't predict future load.

Queue Length vs. Utilization

Temporal Load Tracking: PELT and Decay

Per-Entity Load Tracking (PELT)

Linux's CFS uses PELT—a sophisticated exponential decay tracking mechanism that computes temporally-weighted averages for each scheduling entity (task or group).

pelt_implementation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
/* PELT: Per-Entity Load Tracking */
 
/* PELT decays load with a half-life of approximately 32ms */
/* This means: after 32ms, contribution decays to 50% */
/*             after 64ms, contribution decays to 25% */
/*             after 96ms, contribution decays to 12.5% */
 
#define LOAD_AVG_PERIOD     32  /* ms, decay half-life */
#define LOAD_AVG_MAX     47742  /* Maximum load_avg value */
 
struct sched_avg {
    /* Running average of runnable time */
    unsigned long load_avg;
    
    /* Running average of running time */
    unsigned long runnable_avg;
    
    /* Running average of utilization */
    unsigned long util_avg;
    
    /* Period tracking for decay */
    u64 last_update_time;
    u32 period_contrib;
};
 
/* Update PELT averages - called on each scheduler tick */
void update_load_avg(struct sched_entity *se, 
                     struct run_queue *rq,
                     int running) {
    struct sched_avg *sa = &se->avg;
    u64 now = rq_clock(rq);
    u64 delta_time = now - sa->last_update_time;
    
    if (delta_time == 0) return;
    
    /* Calculate decayed values */
    /* Core formula: new_avg = old_avg * decay + contribution * (1 - decay) */
    
    /* Step 1: Decay existing average */
    u32 periods = delta_time / LOAD_AVG_PERIOD_NS;
    if (periods > 0) {
        /* Apply geometric decay for full periods */
        sa->load_avg = decay_load(sa->load_avg, periods);
        sa->runnable_avg = decay_load(sa->runnable_avg, periods);
        sa->util_avg = decay_load(sa->util_avg, periods);
    }
    
    /* Step 2: Add contribution from current period */
    u32 contrib = calculate_contribution(delta_time, running);
    
    if (se->on_rq) {
        sa->load_avg += contrib * se->load.weight / LOAD_AVG_MAX;
        sa->runnable_avg += contrib;
    }
    
    if (running) {
        sa->util_avg += contrib;
    }
    
    sa->last_update_time = now;
}
 
/* Geometric decay: load * (1/2)^(periods/32) */
static inline unsigned long decay_load(unsigned long load, int periods) {
    /* Use pre-computed decay values for efficiency */
    static const u32 decay_table[32] = {
        /* decay_table[i] = (1/2)^(i/32) * 2^32 */
        4294967296UL, 4264570326UL, 4234504929UL, /* ... */
    };
    
    if (periods >= 2016) return 0;  /* Fully decayed */
    
    while (periods >= 32) {
        load = (load * decay_table[31]) >> 32;
        periods -= 32;
    }
    
    if (periods > 0) {
        load = (load * decay_table[periods]) >> 32;
    }
    
    return load;
}

Why Exponential Decay?

Exponential decay provides several desirable properties:

Recent data matters more — The last 32ms contributes 50% of the average
Smooth transitions — No discontinuities when tasks start/stop
Bounded memory — Only current average needed, not full history
Predictive power — Recent behavior predicts near-future behavior

The Three PELT Metrics

PELT tracks three distinct averages for each entity:

PELT Metrics Explained
Metric	What It Tracks	Used For
load_avg	Priority-weighted runnable time	Load balancing decisions
runnable_avg	Total runnable time (unweighted)	Capacity planning
util_avg	Actual running time	DVFS (CPU frequency scaling)

pelt_metrics_usage.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/* How PELT metrics guide scheduler decisions */
 
/* Load balancing uses load_avg */
unsigned long task_load_for_balancing(struct task_struct *p) {
    return p->se.avg.load_avg;
}
 
/* CPU frequency scaling uses util_avg */
unsigned long cpu_util_for_dvfs(int cpu) {
    return cpu_rq(cpu)->cfs.avg.util_avg;
}
 
/* Example: Balance decision based on PELT */
bool should_migrate_task(struct task_struct *p, 
                          struct run_queue *src_rq,
                          struct run_queue *dst_rq) {
    unsigned long task_load = task_load_for_balancing(p);
    unsigned long src_load = src_rq->cfs.avg.load_avg;
    unsigned long dst_load = dst_rq->cfs.avg.load_avg;
    
    /* Migration improves balance if:
     * (src_load - task_load) closer to dst_load + task_load
     * than src_load is to dst_load */
    
    unsigned long current_imbalance = abs_diff(src_load, dst_load);
    unsigned long new_imbalance = abs_diff(src_load - task_load, 
                                            dst_load + task_load);
    
    /* Need minimum improvement to justify migration cost */
    return (current_imbalance - new_imbalance) > MIN_BALANCE_IMPROVEMENT;
}

PELT Half-Life Tuning

Imbalance Quantification

With per-CPU load defined, we can quantify system-wide imbalance. Several formulations capture different aspects of load distribution quality.

Simple Imbalance: Max - Min

The most intuitive definition:

Imbalance = max(Load(CPU)) - min(Load(CPU))

Simple, but ignores intermediate CPUs—doesn't distinguish between 'one CPU overloaded' vs. 'several CPUs overloaded'.

imbalance_calculation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
/* Various imbalance metrics */
 
/* Simple max-min imbalance */
unsigned long max_min_imbalance(void) {
    unsigned long max_load = 0, min_load = ULONG_MAX;
    int cpu;
    
    for_each_online_cpu(cpu) {
        unsigned long load = cpu_load(cpu);
        max_load = max(max_load, load);
        min_load = min(min_load, load);
    }
    
    return max_load - min_load;
}
 
/* Variance-based imbalance - accounts for all CPUs */
unsigned long variance_imbalance(void) {
    unsigned long total = 0, sq_total = 0;
    int count = 0;
    int cpu;
    
    for_each_online_cpu(cpu) {
        unsigned long load = cpu_load(cpu);
        total += load;
        sq_total += load * load;
        count++;
    }
    
    unsigned long mean = total / count;
    unsigned long variance = (sq_total / count) - (mean * mean);
    
    return int_sqrt(variance);  /* Standard deviation */
}
 
/* What CFS actually uses: group-based imbalance */
unsigned long cfs_imbalance(struct sched_domain *sd) {
    struct sched_group *busiest = find_busiest_group(sd);
    struct sched_group *local = sd->groups;  /* Local group */
    
    if (!busiest || busiest == local) {
        return 0;  /* No imbalance */
    }
    
    unsigned long busiest_avg = busiest->load_avg / busiest->nr_cpus;
    unsigned long local_avg = local->load_avg / local->nr_cpus;
    
    if (busiest_avg <= local_avg) {
        return 0;  /* Local group is busier or equal */
    }
    
    /* Imbalance is the excess that should move to local */
    return (busiest_avg - local_avg) * local->nr_cpus;
}

Group-Based Imbalance

Imbalance Threshold

Not every imbalance warrants action. The scheduler defines minimum thresholds below which imbalance is tolerated:

imbalance_threshold.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/* Imbalance thresholds and action triggers */
 
/* CFS uses imbalance_pct per scheduling domain */
struct sched_domain {
    /* ... */
    unsigned int imbalance_pct;  /* 0-1000, representing 0.0%-100.0% */
    /* imbalance_pct = 117 means 17% imbalance triggers action */
};
 
/* Check if imbalance exceeds threshold */
bool imbalance_exceeds_threshold(struct sched_domain *sd,
                                  unsigned long imbalance,
                                  unsigned long avg_load) {
    /* Threshold is percentage-based */
    unsigned long threshold = avg_load * sd->imbalance_pct / 100;
    
    /* Also enforce minimum absolute threshold */
    threshold = max(threshold, sd->min_imbalance);
    
    return imbalance > threshold;
}
 
/* Typical imbalance_pct values by domain level */
/* SMT: 110 (10% imbalance triggers) - cheap migration */
/* MC:  125 (25% imbalance triggers) - moderate cost */
/* NUMA: 133 (33% imbalance triggers) - expensive migration */
 
/* The threshold prevents thrashing on minor fluctuations */

Statistical Significance

Capacity-Adjusted Load and Heterogeneous Systems

So far, we've assumed all CPUs have equal capacity. Modern systems increasingly feature heterogeneous CPUs (different speeds, capabilities) requiring capacity-adjusted metrics.

CPU Capacity

Capacity represents a CPU's processing power relative to a baseline:

Capacity(CPU) = (CPU's max throughput) / (reference CPU's max throughput)

A 'big' core in ARM big.LITTLE might have capacity 1024, while a 'LITTLE' core has capacity 512.

capacity_calculation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/* CPU capacity and capacity-adjusted load */
 
/* Per-CPU capacity (normalized, 1024 = standard core) */
DEFINE_PER_CPU(unsigned long, cpu_capacity);
 
/* Sources of capacity variation:
 * 1. Heterogeneous cores (big.LITTLE)
 * 2. Current CPU frequency (DVFS)
 * 3. Thermal throttling
 * 4. Architecture differences
 */
 
/* Initialize capacity at boot based on hardware */
void init_cpu_capacity(int cpu) {
    struct cpuinfo *info = &cpu_data(cpu);
    unsigned long capacity = SCHED_CAPACITY_SCALE;  /* 1024 = baseline */
    
    /* Adjust for max frequency relative to fastest CPU */
    capacity = capacity * info->max_freq / reference_max_freq;
    
    /* Adjust for IPC (instructions per cycle) differences */
    if (info->core_type == CORE_LITTLE) {
        capacity = capacity * LITTLE_IPC_RATIO / 100;
    }
    
    per_cpu(cpu_capacity, cpu) = capacity;
}
 
/* Runtime capacity update for frequency changes */
void update_cpu_capacity_for_freq(int cpu, unsigned long new_freq) {
    unsigned long base_capacity = per_cpu(cpu_base_capacity, cpu);
    unsigned long max_freq = per_cpu(cpu_max_freq, cpu);
    
    per_cpu(cpu_capacity, cpu) = base_capacity * new_freq / max_freq;
    
    /* Trigger balance reconsideration - capacities changed */
    set_balance_needed(cpu);
}
 
/* Capacity-normalized load: what fraction of capacity is used? */
unsigned long capacity_normalized_load(int cpu) {
    unsigned long load = cpu_load(cpu);
    unsigned long capacity = per_cpu(cpu_capacity, cpu);
    
    /* normalized_load = load * SCHED_CAPACITY_SCALE / capacity */
    return (load << SCHED_CAPACITY_SHIFT) / capacity;
}

Capacity-Aware Balancing

With capacity known, balancing compares utilized fraction rather than absolute load:

capacity_aware_balance.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
/* Capacity-aware load balancing */
 
/* Compare CPUs by utilized fraction, not absolute load */
bool cpu_overloaded_for_capacity(int cpu) {
    unsigned long load = cpu_load(cpu);
    unsigned long capacity = cpu_capacity_of(cpu);
    
    /* Overloaded if load exceeds 80% of capacity */
    return (load * 100 / capacity) > CAPACITY_OVERLOAD_PCT;
}
 
/* Find best destination accounting for capacity */
int find_best_destination_capacity_aware(struct task_struct *p) {
    int best_cpu = -1;
    unsigned long best_spare_capacity = 0;
    int cpu;
    
    for_each_cpu(cpu, &p->cpus_allowed) {
        unsigned long load = cpu_load(cpu);
        unsigned long capacity = cpu_capacity_of(cpu);
        unsigned long task_load = task_load_for_balancing(p);
        
        /* Would adding this task overload the CPU? */
        if (load + task_load > capacity) {
            continue;  /* Would exceed capacity */
        }
        
        /* Calculate spare capacity after adding task */
        unsigned long spare = capacity - (load + task_load);
        
        if (spare > best_spare_capacity) {
            best_spare_capacity = spare;
            best_cpu = cpu;
        }
    }
    
    return best_cpu;
}
 
/* Energy-aware scheduling: prefer efficient CPUs */
int find_energy_efficient_cpu(struct task_struct *p) {
    int best_cpu = -1;
    unsigned long best_energy = ULONG_MAX;
    int cpu;
    
    for_each_cpu(cpu, &p->cpus_allowed) {
        unsigned long energy = estimate_cpu_energy(cpu, p);
        
        if (energy < best_energy && cpu_has_capacity(cpu, p)) {
            best_energy = energy;
            best_cpu = cpu;
        }
    }
    
    return best_cpu;
}

ARM Energy Aware Scheduling

NUMA-Aware Metrics

Non-Uniform Memory Access (NUMA) architectures add another dimension: memory locality. Effective NUMA-aware scheduling requires metrics that capture both CPU load and memory placement.

Memory Placement Score

Track what fraction of a task's memory accesses are local vs. remote:

numa_metrics.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
/* NUMA-aware load and placement metrics */
 
struct task_numa_stats {
    /* Page access tracking */
    unsigned long local_faults;    /* Pages accessed on local node */
    unsigned long remote_faults;   /* Pages accessed on remote node */
    
    /* Per-node access counts */
    unsigned long faults[MAX_NUMA_NODES];
    
    /* Preferred node based on memory access pattern */
    int preferred_node;
    
    /* NUMA scanning period */
    unsigned long scan_period;
};
 
/* Calculate locality score for a task on a given node */
unsigned long task_locality_score(struct task_struct *p, int node) {
    struct task_numa_stats *numa = &p->numa_stats;
    unsigned long total_faults = numa->local_faults + numa->remote_faults;
    
    if (total_faults == 0) {
        return 0;  /* No data yet */
    }
    
    unsigned long node_faults = numa->faults[node];
    
    /* Score = fraction of accesses that would be local if on this node */
    return (node_faults * NUMA_LOCALITY_MAX) / total_faults;
}
 
/* Find best node for a task based on memory access pattern */
int find_best_numa_node(struct task_struct *p) {
    int best_node = numa_node_id();  /* Default to current */
    unsigned long best_score = 0;
    int node;
    
    for_each_online_node(node) {
        unsigned long score = task_locality_score(p, node);
        
        if (score > best_score) {
            best_score = score;
            best_node = node;
        }
    }
    
    /* Only prefer different node if significantly better */
    if (best_score > task_locality_score(p, numa_node_id()) * 130 / 100) {
        return best_node;  /* 30% improvement threshold */
    }
    
    return numa_node_id();
}

Combined CPU + NUMA Metrics

The scheduler must balance CPU load against memory locality. Sometimes the optimal choice is a busier CPU on the right NUMA node:

combined_numa_load.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
/* Combined CPU load and NUMA locality scoring */
 
struct migration_score {
    unsigned long cpu_score;      /* Lower is better (less loaded) */
    unsigned long numa_score;     /* Higher is better (more local) */
    unsigned long combined_score; /* Overall score */
};
 
/* Calculate combined migration score */
void calculate_migration_score(struct task_struct *p,
                                int dst_cpu,
                                struct migration_score *score) {
    int dst_node = cpu_to_node(dst_cpu);
    
    /* CPU score: inverse of load (lower load = higher score) */
    unsigned long load = cpu_load(dst_cpu);
    unsigned long capacity = cpu_capacity_of(dst_cpu);
    score->cpu_score = (capacity - load) * 100 / capacity;  /* % spare */
    
    /* NUMA score: locality percentage */
    score->numa_score = task_locality_score(p, dst_node);
    
    /* Combined score: weighted average */
    /* Weight depends on task's memory intensity */
    unsigned long mem_weight = task_memory_intensity(p);  /* 0-100 */
    unsigned long cpu_weight = 100 - mem_weight;
    
    score->combined_score = 
        (score->cpu_score * cpu_weight + 
         score->numa_score * mem_weight) / 100;
}
 
/* Find best CPU considering both load and NUMA */
int find_best_cpu_numa_aware(struct task_struct *p) {
    int best_cpu = task_cpu(p);  /* Default: stay put */
    struct migration_score best_score = { 0 };
    int cpu;
    
    calculate_migration_score(p, best_cpu, &best_score);
    
    for_each_cpu(cpu, &p->cpus_allowed) {
        struct migration_score score;
        calculate_migration_score(p, cpu, &score);
        
        if (score.combined_score > best_score.combined_score) {
            /* Need significant improvement to justify migration */
            if (score.combined_score > 
                best_score.combined_score * 115 / 100) {
                best_score = score;
                best_cpu = cpu;
            }
        }
    }
    
    return best_cpu;
}

NUMA Metrics Are Expensive

Observability and Monitoring

Metrics aren't just for the scheduler—they're essential for operators and developers to understand system behavior. Linux exposes rich scheduling statistics through several interfaces.

/proc/schedstat

Per-CPU and per-domain scheduling statistics:

schedstat_reading.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#!/bin/bash
# Reading and interpreting /proc/schedstat
 
# Format documentation:
# cpu<N> <yld_count> <sched_count> <sched_goidle> \
#        <ttwu_count> <ttwu_local> <sum_exec_runtime> \
#        <sum_sleep_runtime>
# domain<N> <balance_count> <balance_failed> \
#           <push_count> <push_failed> <pull_count> \
#           <pull_failed> <alb_count> <alb_failed>
 
# View raw schedstat
cat /proc/schedstat
 
# Parse key metrics for each CPU
awk '/^cpu/ {
    cpu = $1
    sched_count = $2
    goidle = $3
    ttwu = $4
    local_ttwu = $5
    
    local_pct = (local_ttwu * 100 / ttwu)
    
    printf "%s: scheduled %d times, went idle %d times, "\
           "wakeups: %d (%.1f%% local)\n", 
           cpu, sched_count, goidle, ttwu, local_pct
}' /proc/schedstat
 
# Domain-level balance statistics
awk '/^domain/ {
    domain = $1
    balance_count = $2
    balance_fail = $3
    push = $4
    push_fail = $5
    pull = $6
    pull_fail = $7
    
    if (balance_count > 0) {
        fail_pct = (balance_fail * 100 / balance_count)
        printf "%s: %d balance attempts (%.1f%% failed), "\
               "push: %d/%d, pull: %d/%d\n",
               domain, balance_count, fail_pct, 
               push - push_fail, push,
               pull - pull_fail, pull
    }
}' /proc/schedstat

perf sched

The perf tool provides detailed scheduler tracing and analysis:

perf_sched_analysis.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/bash
# Using perf sched for scheduler analysis
 
# Record scheduler events for 10 seconds
perf sched record -- sleep 10
 
# Analyze scheduler latencies
perf sched latency
# Output: Task latencies (how long tasks wait in queue)
#         Per-task breakdown
#         Max, avg, and distribution
 
# Show per-CPU balance activity  
perf sched map
# Output: Visual timeline of task-to-CPU mapping
#         Shows migrations as task movements
 
# Migration analysis
perf sched migrate
# Output: Which tasks migrated, from where to where
#         Migration frequency per task
 
# Detailed scheduler trace
perf sched script
# Output: Raw scheduler events with timestamps
#         sched:sched_switch, sched:sched_wakeup, etc.
 
# Example: Find excessive migrations
perf sched record -g -- ./my_application
perf sched migrate --sort=migrations
# High migration count may indicate poor affinity settings

Key Health Indicators

Monitor these metrics to assess scheduling health:

Scheduler Health Metrics
Metric	Healthy Range	Problem Indication
Balance success rate	80%	Low = excessive failed attempts
Local wakeup %	70%	Low = poor affinity, cache loss
Migrations/sec	< 100/sec typical	High = thrashing
Run queue latency	< 10ms p99	High = overload or imbalance
Idle time imbalance	< 10% difference	High = load imbalance

health_monitor.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#!/usr/bin/env python3
"""Scheduler health monitoring script"""
 
import subprocess
import re
from collections import defaultdict
 
def parse_schedstat():
    """Parse /proc/schedstat for key metrics"""
    metrics = defaultdict(dict)
    
    with open('/proc/schedstat', 'r') as f:
        for line in f:
            parts = line.split()
            
            if parts[0].startswith('cpu'):
                cpu = parts[0]
                metrics[cpu] = {
                    'schedules': int(parts[1]),
                    'idle_entries': int(parts[2]),
                    'wakeups': int(parts[3]),
                    'local_wakeups': int(parts[4]),
                }
            elif parts[0].startswith('domain'):
                domain = parts[0]
                metrics[domain] = {
                    'balance_attempts': int(parts[1]),
                    'balance_failed': int(parts[2]),
                    'migrations': int(parts[3]) - int(parts[4]),
                }
    
    return metrics
 
def check_health(metrics):
    """Evaluate scheduler health from metrics"""
    issues = []
    
    # Check local wakeup percentage
    total_wakeups = sum(m.get('wakeups', 0) 
                        for m in metrics.values() 
                        if 'wakeups' in m)
    local_wakeups = sum(m.get('local_wakeups', 0) 
                        for m in metrics.values() 
                        if 'local_wakeups' in m)
    
    if total_wakeups > 0:
        local_pct = local_wakeups * 100 / total_wakeups
        if local_pct < 70:
            issues.append(f"Low local wakeup rate: {local_pct:.1f}%")
    
    # Check balance success rate
    for domain, stats in metrics.items():
        if 'balance_attempts' in stats and stats['balance_attempts'] > 100:
            success_rate = 100 - (stats['balance_failed'] * 100 / 
                                   stats['balance_attempts'])
            if success_rate < 80:
                issues.append(
                    f"{domain}: Low balance success rate: {success_rate:.1f}%"
                )
    
    return issues or ["All metrics healthy"]
 
if __name__ == "__main__":
    metrics = parse_schedstat()
    for issue in check_health(metrics):
        print(issue)

Continuous Monitoring

Summary: Metrics Mastery

Metrics form the foundation of intelligent load balancing—you cannot optimize what you cannot measure. Let's consolidate the key insights from this comprehensive exploration:

Key Takeaways

•Multiple load definitions exist — Queue length, weighted load, utilization each capture different aspects. Modern schedulers use weighted, temporally-decayed metrics.
•PELT provides smooth, predictive load tracking — Exponential decay with 32ms half-life balances responsiveness against stability.
•Imbalance requires thresholds — Not every imbalance warrants action. Percentage-based thresholds prevent thrashing on noise.
•Capacity normalization handles heterogeneity — Different CPU speeds require capacity-adjusted load comparison.
•NUMA metrics combine CPU and memory locality — Memory-bound workloads need placement decisions considering both dimensions.
•Rich observability enables debugging — /proc/schedstat, perf sched, and custom monitoring reveal scheduler behavior.

Module Complete: Load Balancing

With this page, we've completed our comprehensive exploration of load balancing in multiprocessor systems. You now understand:

Push migration — Proactive redistribution from overloaded CPUs
Pull migration — Demand-driven work acquisition by idle CPUs
Work stealing — Algorithmic load balancing with optimal theoretical bounds
Balancing frequency — The when of load balancing and adaptive tuning
Metrics — The quantitative foundation for intelligent scheduling decisions

This knowledge equips you to reason about scheduler behavior, diagnose performance issues, and make informed tuning decisions across diverse multiprocessor systems.

Module Complete

5 / 5