Loading content...
In an ideal world, work would naturally distribute evenly across all processors. Reality is messier: some processes are CPU-intensive while others sleep waiting for I/O; processes wake and sleep at different times; cores run at different speeds; and affinity constraints limit migration options.
Load balancing is the scheduler's mechanism for detecting and correcting workload imbalance—moving work from overloaded processors to underutilized ones. This sounds simple, but effective load balancing is one of the most challenging aspects of multi-processor scheduling. Balance too aggressively, and you destroy cache affinity; balance too conservatively, and CPUs sit idle while work waits.
By the end of this page, you will understand load balancing mechanisms in depth—how schedulers detect imbalance, push and pull migration strategies, scheduling domains that reflect hardware topology, work stealing algorithms used in parallel programming frameworks, and the tradeoffs that govern balancing frequency and aggressiveness. You'll be able to reason about load balancer behavior and tune systems for optimal balance.
With per-CPU run queues—the architecture all modern SMP schedulers use—load imbalance is inevitable. Processes are created, wake, sleep, and terminate asynchronously on different cores, leading to transient and sometimes persistent imbalance.
Sources of Imbalance:
1. Process Creation and Termination: When a new process is created, it's typically placed on the parent's CPU or a nearby CPU. Batch jobs may spawn many children on one CPU before the scheduler can intervene.
2. Asymmetric Wake Patterns: I/O completion, timer expirations, and signals wake processes. The CPU that handles the interrupt often becomes the wake target, concentrating ready processes.
3. Affinity Constraints: Hard affinity limits which CPUs can run certain processes. If many affinitized processes are constrained to the same CPUs, those CPUs become overloaded while others with different affinity targets remain underutilized.
4. Workload Phase Changes: Applications have phases: initialization, processing, cleanup. A process that was I/O-bound might suddenly become CPU-intensive, shifting load on its CPU.
5. Priority Preemption: High-priority processes preempt lower-priority work. Multiple high-priority processes on the same CPU create load concentration.
| Imbalance Scenario | Performance Impact | User-Visible Effect |
|---|---|---|
| One CPU overloaded, others idle | 50%+ throughput loss | Slow response, poor parallelization |
| Slight imbalance (10-20%) | Minor throughput loss | Usually unnoticeable |
| Persistent single-CPU bottleneck | System-wide stalls possible | Latency spikes, timeouts |
| Oscillating imbalance | Cache thrashing from migration | Variable, unpredictable performance |
| NUMA-aware imbalance | Remote memory access penalty | Increased memory latency |
Measuring Load:
Before load can be balanced, it must be measured. Different schedulers use different metrics:
Linux's CFS scheduler calculates load as the sum of per-task weights on each CPU's run queue. A higher-priority (lower-nice) process contributes more load than a lower-priority one.
Load and utilization are related but distinct. Utilization of 100% means the CPU is never idle but says nothing about queueing. Load of 2.0 on a single CPU means two processes are contending for one CPU—one runs while one waits. Understanding this distinction is essential for interpreting system metrics like Linux 'load average'.
Push migration is a load balancing strategy where overloaded CPUs actively push tasks to less-loaded CPUs. It's a proactive approach: the "heavy" CPU takes responsibility for shedding load.
How Push Migration Works:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
/* Conceptual push migration algorithm */ void periodic_load_balance(int this_cpu) { struct run_queue *this_rq = cpu_rq(this_cpu); unsigned long this_load = get_weighted_load(this_rq); /* Find the busiest and least loaded CPUs in our domain */ int target_cpu = find_least_loaded_cpu(this_cpu); if (target_cpu < 0) return; /* No lighter CPU found */ struct run_queue *target_rq = cpu_rq(target_cpu); unsigned long target_load = get_weighted_load(target_rq); /* Calculate imbalance */ unsigned long imbalance = this_load - target_load; /* Check if imbalance exceeds threshold */ if (imbalance < MIGRATION_THRESHOLD) { return; /* Not worth migrating */ } /* Calculate how much load to move */ unsigned long load_to_move = imbalance / 2; /* Split difference */ /* Select tasks to migrate */ struct list_head *tasks_to_push = select_tasks_for_migration( this_rq, load_to_move, target_cpu ); /* Execute migrations */ struct task *task; list_for_each_entry(task, tasks_to_push, migration_list) { /* Lock both run queues */ double_lock_balance(this_rq, target_rq); /* Dequeue from source */ dequeue_task(this_rq, task); /* Update task's CPU assignment */ task->cpu = target_cpu; /* Enqueue on target */ enqueue_task(target_rq, task); double_unlock_balance(this_rq, target_rq); /* If target was idle, wake it */ if (target_rq->nr_running == 1) { send_reschedule_ipi(target_cpu); } }}Task Selection Criteria:
Not all tasks are equally good candidates for migration:
Advantages of Push Migration:
Disadvantages of Push Migration:
Linux runs periodic load balancing on each CPU through the scheduler_tick() function. The frequency varies by scheduling domain—more frequent balancing for SMT siblings and same-socket cores, less frequent for cross-socket migration. This tiered approach reflects the differing migration costs at different topology levels.
Pull migration is a load balancing strategy where idle or lightly-loaded CPUs actively pull tasks from heavily-loaded CPUs. It's a reactive approach: when a CPU runs out of work, it seeks more.
How Pull Migration Works:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
/* Conceptual pull migration (idle load balancing) */ struct task* idle_balance(int this_cpu) { struct run_queue *this_rq = cpu_rq(this_cpu); /* Find the busiest CPU in our scheduling domain */ int busiest_cpu = find_busiest_cpu(this_cpu); if (busiest_cpu < 0) { return NULL; /* No work to pull */ } struct run_queue *busiest_rq = cpu_rq(busiest_cpu); /* Check if busiest has enough load to share */ if (busiest_rq->nr_running < 2) { return NULL; /* Only one task, can't steal without idling it */ } /* Lock the busiest run queue */ spin_lock(&busiest_rq->lock); /* Find a suitable task to pull */ struct task *pulled_task = NULL; struct task *p; list_for_each_entry_reverse(p, &busiest_rq->task_list, run_list) { /* Check affinity: can this task run on our CPU? */ if (!cpumask_test_cpu(this_cpu, &p->cpus_allowed)) { continue; } /* Check cache hotness: prefer cold tasks */ if (task_hot(p)) { continue; /* Skip cache-hot tasks if others available */ } /* Found a candidate */ pulled_task = p; break; } /* If no cold task, consider hot tasks (better than idling) */ if (!pulled_task) { pulled_task = select_any_migratable_task(busiest_rq, this_cpu); } if (pulled_task) { /* Dequeue from busiest */ dequeue_task(busiest_rq, pulled_task); /* Update CPU assignment */ pulled_task->cpu = this_cpu; } spin_unlock(&busiest_rq->lock); if (pulled_task) { /* Enqueue on our (the idle) run queue */ enqueue_task(this_rq, pulled_task); } return pulled_task; /* Will be scheduled immediately */} /* Called when scheduler has no runnable task */void schedule_idle(void) { int cpu = current_cpu(); /* Try to pull work before going idle */ struct task *new_task = idle_balance(cpu); if (new_task) { /* Run the pulled task */ switch_to(new_task); } else { /* No work found; enter idle state */ cpu_idle_loop(); }}Advantages of Pull Migration:
Disadvantages of Pull Migration:
Pull Migration in Practice:
Most modern schedulers use pull migration as the primary mechanism for immediate work distribution, triggered when a CPU is about to go idle. It's complemented by periodic push migration to handle cases where all CPUs are busy but some more so than others.
Linux implements 'newidle' balancing—a lightweight quick check when a CPU is about to go idle. It only scans nearby CPUs (SMT siblings, same socket) before giving up. Full idle balance runs less frequently. This tiered approach reduces the overhead of idle CPUs constantly searching for work while still enabling rapid rebalancing within cache-coherent domains.
Modern systems have hierarchical CPU topologies, and migration costs differ dramatically within the hierarchy. Scheduling domains are the scheduler's representation of this topology, enabling intelligent, topology-aware load balancing.
The Domain Hierarchy:
Scheduling domains form a nested hierarchy corresponding to the hardware:
Level 1: SMT (Sibling Threads) CPUs sharing a physical core (Hyper-Threading siblings). Lowest migration cost—shared L1/L2 caches mean minimal cache state loss.
Level 2: MC (Multi-Core / Same Socket) CPUs on the same physical chip. Moderate migration cost—shared L3 cache but separate L1/L2.
Level 3: NUMA (Same NUMA Node) CPUs with shared local memory. Higher migration cost if caches differ, but memory remains local.
Level 4: System (Cross-NUMA) All CPUs in the system. Highest migration cost—different L3 caches, potentially remote NUMA memory.
Scheduling Domain Hierarchy Example:(Dual-Socket, 8-core-per-socket, 2-way SMT = 32 logical CPUs) ┌─────────────────────────────────────────┐ │ SYSTEM DOMAIN (SD_NUMA) │ │ │ │ Balance Interval: 64ms │ │ Migrate across NUMA nodes │ │ │ └────────────────┬────────────────────────┘ │ ┌──────────────────────┴──────────────────────┐ │ │ ┌─────────┴──────────┐ ┌────────────┴──────────┐ │ NUMA NODE 0 │ │ NUMA NODE 1 │ │ (MC DOMAIN) │ │ (MC DOMAIN) │ │ │ │ │ │ Balance: 8ms │ │ Balance: 8ms │ │ CPUs 0-15 │ │ CPUs 16-31 │ └────────┬───────────┘ └───────────┬───────────┘ │ │ ┌────────┼────────────────────┐ ┌──────────┼──────────────────┐ │ │ │ │ │ │ ┌─┴──┐ ┌──┴─┐ ... ┌─────┐│ ┌────┴┐ ┌─────┴┐ ... ┌─────┐│ │SMT │ │SMT │ │SMT ││ │SMT │ │SMT │ │SMT ││ │Dom │ │Dom │ │Dom ││ │Dom │ │Dom │ │Dom ││ │0,8 │ │1,9 │ │7,15 ││ │16,24│ │17,25 │ │23,31││ └────┘ └────┘ └─────┘│ └─────┘ └──────┘ └─────┘│ Bal:1ms Bal:1ms Bal:1ms│ Bal:1ms Bal:1ms Bal:1ms│ │ │ Balance Intervals Reflect Migration Cost:- SMT Domain: 1ms (nearly free migration)- MC Domain: 8ms (L3 shared, L1/L2 lost)- NUMA Domain: 64ms (cache completely cold, memory may be remote)Domain-Aware Load Balancing:
The scheduler balances within each domain at different frequencies:
Domain Parameters:
Each scheduling domain has configurable parameters:
You can view the kernel's scheduling domain configuration through /proc/sys/kernel/sched_domain/. Each CPU has entries for each domain level, exposing parameters like min_interval, max_interval, busy_idx, and flags. These are tunable at runtime, though defaults are usually appropriate for most workloads.
Work stealing is a load balancing algorithm particularly suited to parallel programming frameworks where a single application spawns many short-lived tasks. It's a specialized form of pull migration optimized for task-based parallelism.
The Work Stealing Model:
This design is elegant: workers primarily access their own deques without synchronization. Contention only occurs during steals, which are relatively rare in balanced workloads.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394
/* Work Stealing Deque Operations */ struct work_item { void (*function)(void *); void *arg;}; struct ws_deque { struct work_item *items; atomic_size_t top; /* Steal from here (FIFO end) */ atomic_size_t bottom; /* Push/pop here (LIFO end) */ size_t capacity;}; /* Worker pushes new work onto the bottom (local, fast path) */void push(struct ws_deque *deque, struct work_item *item) { size_t b = atomic_load(&deque->bottom); deque->items[b % deque->capacity] = *item; atomic_thread_fence(memory_order_release); /* Ensure item visible before bottom update */ atomic_store(&deque->bottom, b + 1);} /* Worker pops from its own deque (local, fast path) */struct work_item* pop(struct ws_deque *deque) { size_t b = atomic_load(&deque->bottom) - 1; atomic_store(&deque->bottom, b); atomic_thread_fence(memory_order_seq_cst); /* Full fence for correctness */ size_t t = atomic_load(&deque->top); if (t <= b) { /* Non-empty: return item */ struct work_item *item = &deque->items[b % deque->capacity]; if (t == b) { /* Last item: race with potential steal */ if (!atomic_compare_exchange_strong(&deque->top, &t, t + 1)) { /* Lost race to stealer */ atomic_store(&deque->bottom, b + 1); return NULL; } atomic_store(&deque->bottom, b + 1); } return item; } else { /* Empty deque */ atomic_store(&deque->bottom, b + 1); return NULL; }} /* Thief steals from the top of victim's deque (remote, contended) */struct work_item* steal(struct ws_deque *deque) { size_t t = atomic_load(&deque->top); atomic_thread_fence(memory_order_seq_cst); size_t b = atomic_load(&deque->bottom); if (t < b) { /* Non-empty: try to steal */ struct work_item *item = &deque->items[t % deque->capacity]; if (atomic_compare_exchange_strong(&deque->top, &t, t + 1)) { return item; /* Success */ } } return NULL; /* Empty or lost race */} /* Main worker loop */void worker_loop(int worker_id, struct ws_deque *deques, int num_workers) { struct ws_deque *my_deque = &deques[worker_id]; while (!shutdown_requested) { /* Try to pop from own deque */ struct work_item *item = pop(my_deque); if (item == NULL) { /* Own deque empty: try stealing */ for (int attempts = 0; attempts < num_workers * 2; attempts++) { int victim = random() % num_workers; if (victim == worker_id) continue; item = steal(&deques[victim]); if (item != NULL) break; } } if (item != NULL) { /* Execute the work */ item->function(item->arg); } else { /* No work found: back off */ sched_yield(); } }}Why Work Stealing Works:
Work Stealing in Practice:
Work stealing is implemented in many parallel frameworks:
Work stealing operates at the application/runtime level on user-space tasks, while OS load balancing operates at the kernel level on kernel-scheduled threads. They're complementary: the OS balances threads across CPUs, while work stealing balances tasks (fine-grained work units) across threads. A well-designed system uses both appropriately.
Effective load balancing requires careful calibration of how load is measured and how frequently balancing runs.
Load Metrics in Linux:
Linux uses load weight rather than simple task count:
| Nice Value | Weight | Relative to Default | Interpretation |
|---|---|---|---|
| -20 (highest priority) | 88761 | 86.7x | Counts as 87 normal tasks |
| -10 | 9548 | 9.3x | Counts as 9 normal tasks |
| 0 (default) | 1024 | 1.0x | Baseline weight |
| +10 | 110 | 0.11x | 1/9 of a normal task |
| +19 (lowest priority) | 15 | 0.015x | Nearly invisible load |
Balancing Frequency:
Balancing frequency is a critical parameter:
Too Frequent:
Too Infrequent:
Linux Balancing Intervals (defaults vary by kernel version):
Load Balance Tuning Considerations: ┌─────────────────────────────────────────────────────────────────────┐│ BALANCE FREQUENCY SPECTRUM │└─────────────────────────────────────────────────────────────────────┘ Low Frequency High Frequency (Conservative) (Aggressive) ◄─────────────────────────────────────────────────────────────────────► ├── Less migration <---> More migration │ ├── Better cache efficiency <---> Worse cache efficiency │ ├── Potential idle time <---> Potential thrashing │ ├── Lower overhead <---> Higher overhead │ ├── Slower response <---> Faster response to imbalance │ Workload-Specific Tuning: BATCH/HPC WORKLOADS: ├── Prefer lower frequency ├── Large working sets; migration expensive ├── Predictable task placement preferred └── Tune: Increase balance_interval, increase imbalance_pct threshold LATENCY-SENSITIVE WORKLOADS: ├── Prefer higher frequency (within reason) ├── Small tasks; migration cheap ├── Quick response to arriving work critical └── Tune: Decrease balance_interval, enable SD_WAKE_AFFINE MIXED WORKLOADS: ├── Rely on scheduler domain hierarchy ├── Frequent balancing within socket (cheap) ├── Infrequent balancing across sockets (expensive) └── Default settings usually appropriateLoad balancing itself consumes CPU cycles. On systems with many CPUs, the load balancer can become a significant overhead if intervals are too short. Linux's CFS tracks 'rq->idle_balance' time to ensure balancing doesn't dominate scheduling. If profiling shows significant time in load_balance(), consider increasing balance intervals.
System administrators and developers can tune load balancing behavior through various mechanisms. Here's a practical guide to common scenarios.
Scenario 1: Reduce Migration for Latency Sensitivity
For real-time or latency-critical workloads:
# Increase minimum balance interval (sysfs)
echo 16 > /proc/sys/kernel/sched_migration_cost_ns # Default: ~500000 (0.5ms)
# Use CPU isolation to remove CPUs from balancing entirely
# Add to kernel boot parameters:
isolcpus=4-7
# Pin critical processes to isolated CPUs
taskset -c 4 ./critical_process
Scenario 2: Improve Balance for Throughput
For batch processing with many short tasks:
# Decrease min granularity (more preemption, quicker rebalancing)
echo 1000000 > /proc/sys/kernel/sched_min_granularity_ns # 1ms instead of 3ms
# Reduce cache hot threshold (willing to migrate sooner)
echo 0 > /proc/sys/kernel/sched_migration_cost_ns
Scenario 3: NUMA-Aware Balancing
For memory-intensive workloads on NUMA systems:
# View NUMA topology
numactl --hardware
# Prefer local memory allocation
echo 1 > /proc/sys/kernel/numa_balancing # Enable automatic NUMA balancing
# For specific application: bind to NUMA node
numactl --cpunodebind=0 --membind=0 ./application
Observing Load Balancing:
Monitor balancer effectiveness:
# View per-CPU load
watch -n 0.5 'cat /proc/loadavg; mpstat -P ALL 1 1'
# View scheduler statistics
cat /proc/schedstat
# Trace load balancing events (requires DEBUG_SCHED)
perf sched record sleep 10
perf sched latency
| Symptom | Likely Cause | Tuning Action |
|---|---|---|
| CPUs idle while others overloaded | Balance interval too long or affinity blocking | Check affinity masks; reduce balance interval |
| Excessive context switches | Migration too aggressive | Increase sched_migration_cost_ns |
| Latency spikes on critical CPUs | Unwanted migration of critical tasks | Use CPU isolation (isolcpus) for critical tasks |
| Poor cache hit rates | Excessive migration | Increase migration cost; check for false affinity |
| Imbalance persists on idle CPUs | Pull migration failing (affinity, newidle disabled) | Verify newidle balancing enabled; check for bugs |
Modern schedulers are sophisticated. Before tuning, confirm that the problem is actually load balancing and not application architecture, affinity misuse, or lock contention. Profile with perf and schedstat before making changes. Many 'load balancing problems' turn out to be something else entirely.
We have explored load balancing from fundamental concepts through practical tuning. This knowledge enables you to understand, diagnose, and optimize work distribution in multi-processor systems.
Consolidating Our Understanding:
What's Next:
With load balancing understood, we'll explore NUMA Considerations—the final topic in our multi-processor scheduling module. NUMA (Non-Uniform Memory Access) adds another dimension to scheduling: not all memory is equidistant from all CPUs. Understanding NUMA is essential for scaling to large multi-socket systems.
You now understand load balancing mechanisms at the depth required for kernel-level reasoning and production system optimization. You can analyze balancer behavior, diagnose imbalance problems, and tune systems for optimal work distribution. This knowledge is essential for anyone working with high-performance multi-processor systems.