Multi Processor Scheduling - Learning Module

Loading content...

0/227

Load Balancing

The Distributed Challenge of Work Distribution

In an ideal world, work would naturally distribute evenly across all processors. Reality is messier: some processes are CPU-intensive while others sleep waiting for I/O; processes wake and sleep at different times; cores run at different speeds; and affinity constraints limit migration options.

Load balancing is the scheduler's mechanism for detecting and correcting workload imbalance—moving work from overloaded processors to underutilized ones. This sounds simple, but effective load balancing is one of the most challenging aspects of multi-processor scheduling. Balance too aggressively, and you destroy cache affinity; balance too conservatively, and CPUs sit idle while work waits.

What You Will Master

By the end of this page, you will understand load balancing mechanisms in depth—how schedulers detect imbalance, push and pull migration strategies, scheduling domains that reflect hardware topology, work stealing algorithms used in parallel programming frameworks, and the tradeoffs that govern balancing frequency and aggressiveness. You'll be able to reason about load balancer behavior and tune systems for optimal balance.

The Load Imbalance Problem

With per-CPU run queues—the architecture all modern SMP schedulers use—load imbalance is inevitable. Processes are created, wake, sleep, and terminate asynchronously on different cores, leading to transient and sometimes persistent imbalance.

Sources of Imbalance:

1. Process Creation and Termination: When a new process is created, it's typically placed on the parent's CPU or a nearby CPU. Batch jobs may spawn many children on one CPU before the scheduler can intervene.

2. Asymmetric Wake Patterns: I/O completion, timer expirations, and signals wake processes. The CPU that handles the interrupt often becomes the wake target, concentrating ready processes.

3. Affinity Constraints: Hard affinity limits which CPUs can run certain processes. If many affinitized processes are constrained to the same CPUs, those CPUs become overloaded while others with different affinity targets remain underutilized.

4. Workload Phase Changes: Applications have phases: initialization, processing, cleanup. A process that was I/O-bound might suddenly become CPU-intensive, shifting load on its CPU.

5. Priority Preemption: High-priority processes preempt lower-priority work. Multiple high-priority processes on the same CPU create load concentration.

Impact of Load Imbalance
Imbalance Scenario	Performance Impact	User-Visible Effect
One CPU overloaded, others idle	50%+ throughput loss	Slow response, poor parallelization
Slight imbalance (10-20%)	Minor throughput loss	Usually unnoticeable
Persistent single-CPU bottleneck	System-wide stalls possible	Latency spikes, timeouts
Oscillating imbalance	Cache thrashing from migration	Variable, unpredictable performance
NUMA-aware imbalance	Remote memory access penalty	Increased memory latency

Measuring Load:

Before load can be balanced, it must be measured. Different schedulers use different metrics:

Run Queue Length: Number of runnable processes waiting on each CPU
Weighted Load: Sum of process priorities or weights (accounts for different process "importance")
CPU Utilization: Fraction of time the CPU is not idle
Running Average Load: Smoothed load over time (avoids reacting to transient fluctuations)

Linux's CFS scheduler calculates load as the sum of per-task weights on each CPU's run queue. A higher-priority (lower-nice) process contributes more load than a lower-priority one.

Load vs. Utilization

Load and utilization are related but distinct. Utilization of 100% means the CPU is never idle but says nothing about queueing. Load of 2.0 on a single CPU means two processes are contending for one CPU—one runs while one waits. Understanding this distinction is essential for interpreting system metrics like Linux 'load average'.

Push Migration: Redistributing Excess Load

Push migration is a load balancing strategy where overloaded CPUs actively push tasks to less-loaded CPUs. It's a proactive approach: the "heavy" CPU takes responsibility for shedding load.

How Push Migration Works:

Trigger: Periodic timer fires on each CPU (e.g., every scheduling tick or every few milliseconds)
Load Calculation: CPU computes its current load and queries neighbors' loads
Imbalance Detection: If local load significantly exceeds neighbors, migration triggered
Task Selection: Choose which task(s) to push based on migration cost, priority, and cache affinity
Target Selection: Choose destination CPU with lowest load or most available capacity
Migration Execution: Move task's run queue entry; task will run on target when scheduled

push_migration_concept.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
/* Conceptual push migration algorithm */
 
void periodic_load_balance(int this_cpu) {
    struct run_queue *this_rq = cpu_rq(this_cpu);
    unsigned long this_load = get_weighted_load(this_rq);
    
    /* Find the busiest and least loaded CPUs in our domain */
    int target_cpu = find_least_loaded_cpu(this_cpu);
    if (target_cpu < 0) return;  /* No lighter CPU found */
    
    struct run_queue *target_rq = cpu_rq(target_cpu);
    unsigned long target_load = get_weighted_load(target_rq);
    
    /* Calculate imbalance */
    unsigned long imbalance = this_load - target_load;
    
    /* Check if imbalance exceeds threshold */
    if (imbalance < MIGRATION_THRESHOLD) {
        return;  /* Not worth migrating */
    }
    
    /* Calculate how much load to move */
    unsigned long load_to_move = imbalance / 2;  /* Split difference */
    
    /* Select tasks to migrate */
    struct list_head *tasks_to_push = select_tasks_for_migration(
        this_rq, 
        load_to_move,
        target_cpu
    );
    
    /* Execute migrations */
    struct task *task;
    list_for_each_entry(task, tasks_to_push, migration_list) {
        /* Lock both run queues */
        double_lock_balance(this_rq, target_rq);
        
        /* Dequeue from source */
        dequeue_task(this_rq, task);
        
        /* Update task's CPU assignment */
        task->cpu = target_cpu;
        
        /* Enqueue on target */
        enqueue_task(target_rq, task);
        
        double_unlock_balance(this_rq, target_rq);
        
        /* If target was idle, wake it */
        if (target_rq->nr_running == 1) {
            send_reschedule_ipi(target_cpu);
        }
    }
}

Task Selection Criteria:

Not all tasks are equally good candidates for migration:

Prefer recently woken tasks: They haven't accumulated cache state yet; migration cost is minimal
Avoid cache-hot tasks: Tasks that have been running for a while have warm caches; migrating them is costly
Respect affinity: Can only migrate to CPUs in the task's affinity mask
Consider priority: May prefer migrating lower-priority tasks to reduce impact on critical work
Check migration history: Avoid ping-ponging tasks back and forth

Advantages of Push Migration:

Loaded CPUs take action; no idle CPU polling needed
Can distribute work even when all CPUs are busy
Proactive approach prevents queue buildup

Disadvantages of Push Migration:

Busy CPUs spend time on load balancing instead of real work
May trigger too frequently, causing excessive migration
Requires all CPUs to track their load and neighbors' loads

Periodic Balancing in Linux

Linux runs periodic load balancing on each CPU through the scheduler_tick() function. The frequency varies by scheduling domain—more frequent balancing for SMT siblings and same-socket cores, less frequent for cross-socket migration. This tiered approach reflects the differing migration costs at different topology levels.

Pull Migration: Idle CPUs Seek Work

Pull migration is a load balancing strategy where idle or lightly-loaded CPUs actively pull tasks from heavily-loaded CPUs. It's a reactive approach: when a CPU runs out of work, it seeks more.

How Pull Migration Works:

Trigger: CPU finishes its last runnable task and is about to go idle
Candidate Search: CPU scans for other CPUs with runnable tasks
Load Comparison: Identifies the busiest CPU with migratable tasks
Task Pulling: Selects task(s) from the busy CPU's run queue
Migration Execution: Moves task to the pulling CPU's run queue
Immediate Execution: The pulling CPU runs the pulled task instead of going idle

pull_migration_concept.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
/* Conceptual pull migration (idle load balancing) */
 
struct task* idle_balance(int this_cpu) {
    struct run_queue *this_rq = cpu_rq(this_cpu);
    
    /* Find the busiest CPU in our scheduling domain */
    int busiest_cpu = find_busiest_cpu(this_cpu);
    if (busiest_cpu < 0) {
        return NULL;  /* No work to pull */
    }
    
    struct run_queue *busiest_rq = cpu_rq(busiest_cpu);
    
    /* Check if busiest has enough load to share */
    if (busiest_rq->nr_running < 2) {
        return NULL;  /* Only one task, can't steal without idling it */
    }
    
    /* Lock the busiest run queue */
    spin_lock(&busiest_rq->lock);
    
    /* Find a suitable task to pull */
    struct task *pulled_task = NULL;
    struct task *p;
    
    list_for_each_entry_reverse(p, &busiest_rq->task_list, run_list) {
        /* Check affinity: can this task run on our CPU? */
        if (!cpumask_test_cpu(this_cpu, &p->cpus_allowed)) {
            continue;
        }
        
        /* Check cache hotness: prefer cold tasks */
        if (task_hot(p)) {
            continue;  /* Skip cache-hot tasks if others available */
        }
        
        /* Found a candidate */
        pulled_task = p;
        break;
    }
    
    /* If no cold task, consider hot tasks (better than idling) */
    if (!pulled_task) {
        pulled_task = select_any_migratable_task(busiest_rq, this_cpu);
    }
    
    if (pulled_task) {
        /* Dequeue from busiest */
        dequeue_task(busiest_rq, pulled_task);
        
        /* Update CPU assignment */
        pulled_task->cpu = this_cpu;
    }
    
    spin_unlock(&busiest_rq->lock);
    
    if (pulled_task) {
        /* Enqueue on our (the idle) run queue */
        enqueue_task(this_rq, pulled_task);
    }
    
    return pulled_task;  /* Will be scheduled immediately */
}
 
/* Called when scheduler has no runnable task */
void schedule_idle(void) {
    int cpu = current_cpu();
    
    /* Try to pull work before going idle */
    struct task *new_task = idle_balance(cpu);
    
    if (new_task) {
        /* Run the pulled task */
        switch_to(new_task);
    } else {
        /* No work found; enter idle state */
        cpu_idle_loop();
    }
}

Advantages of Pull Migration:

CPUs actively seek work only when idle; minimal overhead when busy
Naturally responsive: immediate reaction to becoming idle
Idle CPU has cycles to spend on load balancing
No energy wasted on unnecessary checks when all CPUs are busy

Disadvantages of Pull Migration:

Reactive, not proactive: doesn't prevent queue buildup
May scan multiple CPUs before finding migratable work
Lock contention: idle CPU must acquire busy CPU's run queue lock
Latency: task sits on busy queue until idle CPU happens to look

Pull Migration in Practice:

Most modern schedulers use pull migration as the primary mechanism for immediate work distribution, triggered when a CPU is about to go idle. It's complemented by periodic push migration to handle cases where all CPUs are busy but some more so than others.

Idle Balance Optimization: Newidle Balance

Linux implements 'newidle' balancing—a lightweight quick check when a CPU is about to go idle. It only scans nearby CPUs (SMT siblings, same socket) before giving up. Full idle balance runs less frequently. This tiered approach reduces the overhead of idle CPUs constantly searching for work while still enabling rapid rebalancing within cache-coherent domains.

Scheduling Domains: Topology-Aware Balancing

Modern systems have hierarchical CPU topologies, and migration costs differ dramatically within the hierarchy. Scheduling domains are the scheduler's representation of this topology, enabling intelligent, topology-aware load balancing.

The Domain Hierarchy:

Scheduling domains form a nested hierarchy corresponding to the hardware:

Level 1: SMT (Sibling Threads) CPUs sharing a physical core (Hyper-Threading siblings). Lowest migration cost—shared L1/L2 caches mean minimal cache state loss.

Level 2: MC (Multi-Core / Same Socket) CPUs on the same physical chip. Moderate migration cost—shared L3 cache but separate L1/L2.

Level 3: NUMA (Same NUMA Node) CPUs with shared local memory. Higher migration cost if caches differ, but memory remains local.

Level 4: System (Cross-NUMA) All CPUs in the system. Highest migration cost—different L3 caches, potentially remote NUMA memory.

scheduling_domain_hierarchy.txt
Scheduling Domain Hierarchy Example:
(Dual-Socket, 8-core-per-socket, 2-way SMT = 32 logical CPUs)
 
                    ┌─────────────────────────────────────────┐
                    │          SYSTEM DOMAIN (SD_NUMA)        │
                    │                                         │
                    │  Balance Interval: 64ms                 │
                    │  Migrate across NUMA nodes              │
                    │                                         │
                    └────────────────┬────────────────────────┘
                                     │
              ┌──────────────────────┴──────────────────────┐
              │                                              │
    ┌─────────┴──────────┐                      ┌────────────┴──────────┐
    │   NUMA NODE 0      │                      │    NUMA NODE 1        │
    │   (MC DOMAIN)      │                      │    (MC DOMAIN)        │
    │                    │                      │                       │
    │   Balance: 8ms     │                      │    Balance: 8ms       │
    │   CPUs 0-15        │                      │    CPUs 16-31         │
    └────────┬───────────┘                      └───────────┬───────────┘
             │                                              │
    ┌────────┼────────────────────┐              ┌──────────┼──────────────────┐
    │        │                    │              │          │                  │
  ┌─┴──┐  ┌──┴─┐    ...    ┌─────┐│        ┌────┴┐   ┌─────┴┐   ...    ┌─────┐│
  │SMT │  │SMT │           │SMT  ││        │SMT  │   │SMT   │          │SMT  ││
  │Dom │  │Dom │           │Dom  ││        │Dom  │   │Dom   │          │Dom  ││
  │0,8 │  │1,9 │           │7,15 ││        │16,24│   │17,25 │          │23,31││
  └────┘  └────┘           └─────┘│        └─────┘   └──────┘          └─────┘│
  Bal:1ms Bal:1ms          Bal:1ms│        Bal:1ms   Bal:1ms           Bal:1ms│
                                  │                                          │
 
Balance Intervals Reflect Migration Cost:
- SMT Domain: 1ms   (nearly free migration)
- MC Domain:  8ms   (L3 shared, L1/L2 lost)
- NUMA Domain: 64ms (cache completely cold, memory may be remote)

Domain-Aware Load Balancing:

The scheduler balances within each domain at different frequencies:

Innermost domains balance most frequently: SMT siblings balance often because migration is nearly free
Outer domains balance less often: Cross-NUMA migration is expensive, so it happens only with persistent imbalance
Cascading balance: Imbalance may persist at inner levels if work can't be moved further (affinity, capacity)

Domain Parameters:

Each scheduling domain has configurable parameters:

min_interval / max_interval: Balance frequency bounds
imbalance_pct: How much imbalance tolerated before migrating
cache_nice_tries: Number of attempts to find cache-cold tasks before accepting hot tasks
busy_factor: How much to slow down balancing when all CPUs are busy
flags: Feature flags (e.g., SD_WAKE_AFFINE, SD_BALANCE_WAKE)

Inspecting Scheduling Domains on Linux

You can view the kernel's scheduling domain configuration through /proc/sys/kernel/sched_domain/. Each CPU has entries for each domain level, exposing parameters like min_interval, max_interval, busy_idx, and flags. These are tunable at runtime, though defaults are usually appropriate for most workloads.

Work Stealing: Per-Task Load Balancing

Work stealing is a load balancing algorithm particularly suited to parallel programming frameworks where a single application spawns many short-lived tasks. It's a specialized form of pull migration optimized for task-based parallelism.

The Work Stealing Model:

Each worker thread has a local deque (double-ended queue) of tasks
Workers push new tasks onto their own deque (local work generation)
Workers pop tasks from their own deque to execute (LIFO order for locality)
When a worker's deque is empty, it steals from another worker's deque (FIFO order)
Stealing from the opposite end minimizes contention with the victim

This design is elegant: workers primarily access their own deques without synchronization. Contention only occurs during steals, which are relatively rare in balanced workloads.

work_stealing.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
/* Work Stealing Deque Operations */
 
struct work_item {
    void (*function)(void *);
    void *arg;
};
 
struct ws_deque {
    struct work_item *items;
    atomic_size_t top;     /* Steal from here (FIFO end) */
    atomic_size_t bottom;  /* Push/pop here (LIFO end) */
    size_t capacity;
};
 
/* Worker pushes new work onto the bottom (local, fast path) */
void push(struct ws_deque *deque, struct work_item *item) {
    size_t b = atomic_load(&deque->bottom);
    deque->items[b % deque->capacity] = *item;
    atomic_thread_fence(memory_order_release);  /* Ensure item visible before bottom update */
    atomic_store(&deque->bottom, b + 1);
}
 
/* Worker pops from its own deque (local, fast path) */
struct work_item* pop(struct ws_deque *deque) {
    size_t b = atomic_load(&deque->bottom) - 1;
    atomic_store(&deque->bottom, b);
    atomic_thread_fence(memory_order_seq_cst);  /* Full fence for correctness */
    
    size_t t = atomic_load(&deque->top);
    
    if (t <= b) {
        /* Non-empty: return item */
        struct work_item *item = &deque->items[b % deque->capacity];
        if (t == b) {
            /* Last item: race with potential steal */
            if (!atomic_compare_exchange_strong(&deque->top, &t, t + 1)) {
                /* Lost race to stealer */
                atomic_store(&deque->bottom, b + 1);
                return NULL;
            }
            atomic_store(&deque->bottom, b + 1);
        }
        return item;
    } else {
        /* Empty deque */
        atomic_store(&deque->bottom, b + 1);
        return NULL;
    }
}
 
/* Thief steals from the top of victim's deque (remote, contended) */
struct work_item* steal(struct ws_deque *deque) {
    size_t t = atomic_load(&deque->top);
    atomic_thread_fence(memory_order_seq_cst);
    size_t b = atomic_load(&deque->bottom);
    
    if (t < b) {
        /* Non-empty: try to steal */
        struct work_item *item = &deque->items[t % deque->capacity];
        if (atomic_compare_exchange_strong(&deque->top, &t, t + 1)) {
            return item;  /* Success */
        }
    }
    return NULL;  /* Empty or lost race */
}
 
/* Main worker loop */
void worker_loop(int worker_id, struct ws_deque *deques, int num_workers) {
    struct ws_deque *my_deque = &deques[worker_id];
    
    while (!shutdown_requested) {
        /* Try to pop from own deque */
        struct work_item *item = pop(my_deque);
        
        if (item == NULL) {
            /* Own deque empty: try stealing */
            for (int attempts = 0; attempts < num_workers * 2; attempts++) {
                int victim = random() % num_workers;
                if (victim == worker_id) continue;
                
                item = steal(&deques[victim]);
                if (item != NULL) break;
            }
        }
        
        if (item != NULL) {
            /* Execute the work */
            item->function(item->arg);
        } else {
            /* No work found: back off */
            sched_yield();
        }
    }
}

Why Work Stealing Works:

LIFO local access: Workers access recently pushed tasks first, maximizing temporal cache locality
FIFO stealing: Thieves take oldest tasks, which often represent larger work units (recursive decomposition property)
Minimal synchronization: Lock-free deques with atomic operations; only CAS conflicts on steal
Self-balancing: Idle workers automatically seek work until balance is restored

Work Stealing in Practice:

Work stealing is implemented in many parallel frameworks:

Java Fork/Join Framework: Used by parallel streams and CompletableFuture
Intel TBB (Thread Building Blocks): C++ parallel algorithms
Cilk: Academic and production parallel language extensions
Go runtime: Goroutine scheduler uses work stealing between P's (processors)
Tokio (Rust): async runtime work stealing for tasks

Work Stealing vs. OS Load Balancing

Work stealing operates at the application/runtime level on user-space tasks, while OS load balancing operates at the kernel level on kernel-scheduled threads. They're complementary: the OS balances threads across CPUs, while work stealing balances tasks (fine-grained work units) across threads. A well-designed system uses both appropriately.

Balancing Metrics and Frequency

Effective load balancing requires careful calibration of how load is measured and how frequently balancing runs.

Load Metrics in Linux:

Linux uses load weight rather than simple task count:

Each task has a weight derived from its nice value (priority)
Weight ranges from ~88 (nice 19, low priority) to ~88,761 (nice -20, high priority)
Default weight (nice 0) is 1024
CPU load = sum of weights of runnable tasks on that CPU
This ensures a single high-priority task counts as more "load" than many low-priority ones

Linux Scheduler Weight by Nice Value
Nice Value	Weight	Relative to Default	Interpretation
-20 (highest priority)	88761	86.7x	Counts as 87 normal tasks
-10	9548	9.3x	Counts as 9 normal tasks
0 (default)	1024	1.0x	Baseline weight
+10	110	0.11x	1/9 of a normal task
+19 (lowest priority)	15	0.015x	Nearly invisible load

Balancing Frequency:

Balancing frequency is a critical parameter:

Too Frequent:

Excessive migration overhead (cache cold starts)
CPU time spent on balancing instead of real work
Potential for oscillation (tasks bouncing between CPUs)

Too Infrequent:

Persistent imbalance persists longer
Idle time accumulates while work waits elsewhere
Latency for newly runnable tasks

Linux Balancing Intervals (defaults vary by kernel version):

Newly idle balance: Immediately when CPU becomes idle
Periodic balance: Every 4-8ms for same-socket CPUs
Cross-NUMA balance: Every 16-64ms for NUMA node crossing
Busy factor: When all CPUs are busy, intervals increase (less urgent to rebalance)

balance_tuning_considerations.txt
Load Balance Tuning Considerations:
 
┌─────────────────────────────────────────────────────────────────────┐
│                     BALANCE FREQUENCY SPECTRUM                       │
└─────────────────────────────────────────────────────────────────────┘
 
  Low Frequency                                         High Frequency
  (Conservative)                                         (Aggressive)
  
  ◄─────────────────────────────────────────────────────────────────────►
  
  ├── Less migration         <---> More migration                     │
  ├── Better cache efficiency <---> Worse cache efficiency            │
  ├── Potential idle time    <---> Potential thrashing                │
  ├── Lower overhead         <---> Higher overhead                    │
  ├── Slower response       <---> Faster response to imbalance        │
  
  Workload-Specific Tuning:
  
  BATCH/HPC WORKLOADS:
  ├── Prefer lower frequency
  ├── Large working sets; migration expensive
  ├── Predictable task placement preferred
  └── Tune: Increase balance_interval, increase imbalance_pct threshold
  
  LATENCY-SENSITIVE WORKLOADS:
  ├── Prefer higher frequency (within reason)
  ├── Small tasks; migration cheap
  ├── Quick response to arriving work critical
  └── Tune: Decrease balance_interval, enable SD_WAKE_AFFINE
  
  MIXED WORKLOADS:
  ├── Rely on scheduler domain hierarchy
  ├── Frequent balancing within socket (cheap)
  ├── Infrequent balancing across sockets (expensive)
  └── Default settings usually appropriate

Beware of Balancer Overhead

Load balancing itself consumes CPU cycles. On systems with many CPUs, the load balancer can become a significant overhead if intervals are too short. Linux's CFS tracks 'rq->idle_balance' time to ensure balancing doesn't dominate scheduling. If profiling shows significant time in load_balance(), consider increasing balance intervals.

Practical Load Balancing Tuning

System administrators and developers can tune load balancing behavior through various mechanisms. Here's a practical guide to common scenarios.

Scenario 1: Reduce Migration for Latency Sensitivity

For real-time or latency-critical workloads:

# Increase minimum balance interval (sysfs)
echo 16 > /proc/sys/kernel/sched_migration_cost_ns  # Default: ~500000 (0.5ms)

# Use CPU isolation to remove CPUs from balancing entirely
# Add to kernel boot parameters:
isolcpus=4-7

# Pin critical processes to isolated CPUs
taskset -c 4 ./critical_process

Scenario 2: Improve Balance for Throughput

For batch processing with many short tasks:

# Decrease min granularity (more preemption, quicker rebalancing)
echo 1000000 > /proc/sys/kernel/sched_min_granularity_ns  # 1ms instead of 3ms

# Reduce cache hot threshold (willing to migrate sooner)
echo 0 > /proc/sys/kernel/sched_migration_cost_ns

Scenario 3: NUMA-Aware Balancing

For memory-intensive workloads on NUMA systems:

# View NUMA topology
numactl --hardware

# Prefer local memory allocation
echo 1 > /proc/sys/kernel/numa_balancing  # Enable automatic NUMA balancing

# For specific application: bind to NUMA node
numactl --cpunodebind=0 --membind=0 ./application

Observing Load Balancing:

Monitor balancer effectiveness:

# View per-CPU load
watch -n 0.5 'cat /proc/loadavg; mpstat -P ALL 1 1'

# View scheduler statistics
cat /proc/schedstat

# Trace load balancing events (requires DEBUG_SCHED)
perf sched record sleep 10
perf sched latency

Load Balancing Troubleshooting Guide
Symptom	Likely Cause	Tuning Action
CPUs idle while others overloaded	Balance interval too long or affinity blocking	Check affinity masks; reduce balance interval
Excessive context switches	Migration too aggressive	Increase sched_migration_cost_ns
Latency spikes on critical CPUs	Unwanted migration of critical tasks	Use CPU isolation (isolcpus) for critical tasks
Poor cache hit rates	Excessive migration	Increase migration cost; check for false affinity
Imbalance persists on idle CPUs	Pull migration failing (affinity, newidle disabled)	Verify newidle balancing enabled; check for bugs

Let the Scheduler Do Its Job

Modern schedulers are sophisticated. Before tuning, confirm that the problem is actually load balancing and not application architecture, affinity misuse, or lock contention. Profile with perf and schedstat before making changes. Many 'load balancing problems' turn out to be something else entirely.

Summary: Mastering Load Balancing

We have explored load balancing from fundamental concepts through practical tuning. This knowledge enables you to understand, diagnose, and optimize work distribution in multi-processor systems.

Consolidating Our Understanding:

Key Takeaways

•Imbalance is inevitable with per-CPU queues — The architecture that enables SMP scalability also creates the need for active load distribution. Balancing is not optional.
•Push migration redistributes from busy CPUs — Overloaded CPUs actively move work to lighter CPUs. Good for preventing extreme imbalance but adds overhead to busy CPUs.
•Pull migration responds to idle CPUs — Idle CPUs seek work before giving up the processor. Reactive and efficient, but doesn't prevent queue buildup.
•Scheduling domains reflect topology — Migration costs vary by topology level. Smart schedulers balance frequently within cheap domains (SMT, same socket) and sparingly across expensive boundaries (NUMA).
•Work stealing excels for task parallelism — The work stealing deque provides efficient, low-contention load balancing for fine-grained parallel tasks. Key to high-performance parallel frameworks.
•Load metrics matter — Simple task count ignores priority. Weighted load ensures high-priority tasks get appropriate resources. Understand what 'load' means in your system.
•Frequency tuning involves tradeoffs — More frequent balancing reduces idle time but increases migration overhead. Find the right balance for your workload through measurement.

What's Next:

With load balancing understood, we'll explore NUMA Considerations—the final topic in our multi-processor scheduling module. NUMA (Non-Uniform Memory Access) adds another dimension to scheduling: not all memory is equidistant from all CPUs. Understanding NUMA is essential for scaling to large multi-socket systems.

Load Balancing Mastery Achieved

You now understand load balancing mechanisms at the depth required for kernel-level reasoning and production system optimization. You can analyze balancer behavior, diagnose imbalance problems, and tune systems for optimal work distribution. This knowledge is essential for anyone working with high-performance multi-processor systems.