Loading content...
We've explored what load balancing does (push, pull, and work stealing) and how it works (algorithms and data structures). Now we address an equally critical question: when should load balancing occur?
Balancing frequency represents a fundamental tradeoff in system design. Balance too frequently, and you waste CPU cycles on overhead—computing loads, acquiring locks, migrating tasks—cycles that could be spent on actual work. Balance too infrequently, and imbalances persist, leaving CPUs idle while others are overloaded.
This page provides a comprehensive exploration of balancing frequency: the theoretical considerations, the practical tradeoffs, the interval selection strategies, and the adaptive mechanisms that allow systems to self-tune for optimal performance across varying workload conditions.
By completing this page, you will understand: (1) The overhead-responsiveness tradeoff in balancing frequency, (2) Timer-based vs. event-driven balance triggers, (3) Hierarchical frequency strategies for multi-level topologies, (4) Adaptive frequency mechanisms that respond to workload dynamics, and (5) Production scheduler tuning parameters for balance frequency.
Every load balancing invocation consumes resources. Understanding these costs is essential for making informed frequency decisions.
The Cost Components
Load balancing incurs several distinct costs:
| Component | Cost Range | Frequency Impact |
|---|---|---|
| Load calculation | 100-500 ns | Scales with CPU count |
| Lock operations | 50-200 ns uncontended | Increases with contention |
| Cache effects | Variable, 1-10 μs effective | Depends on balance code size |
| IPI delivery | 1-5 μs | Per remote CPU involved |
| Task migration | 5-50 μs | Only when migration occurs |
The Benefit Model
Balancing provides benefit when it corrects imbalances that would otherwise waste CPU cycles:
Benefit = (Idle_Time_Prevented) × (CPU_Throughput)
If a CPU sits idle for 10ms when it could have been running a task, the lost benefit is 10ms of compute capacity. Frequent balancing catches imbalances early; infrequent balancing lets them accumulate.
The Optimal Balance Point
The optimal frequency maximizes net benefit:
Net_Benefit = Imbalance_Reduction_Benefit - Balance_Overhead_Cost
This optimal point depends on workload characteristics:
A common guideline: scheduler overhead (including load balancing) should consume less than 1% of total CPU time. If balance operations are running every 1ms and each takes 10μs, that's 1% just for balancing. Monitor /proc/schedstat or equivalent to verify overhead stays within budget.
Load balancing can be triggered by two fundamental mechanisms: timer-based (periodic) and event-driven. Most production schedulers use a combination of both.
Timer-Based (Periodic) Balancing
A timer interrupt fires at regular intervals, invoking the load balance routine regardless of workload state.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
/* Timer-based periodic load balancing */ #define BALANCE_TICK_MS 4 /* Base balance interval */#define JITTER_FACTOR 8 /* Spread timer across CPUs */ /* Initialize per-CPU balance timer */void init_balance_timer(int cpu) { struct run_queue *rq = cpu_rq(cpu); /* Stagger timers to prevent synchronization surge */ unsigned long initial_delay = BALANCE_TICK_MS + (cpu % JITTER_FACTOR); timer_setup(&rq->balance_timer, balance_timer_handler); mod_timer(&rq->balance_timer, jiffies + msecs_to_jiffies(initial_delay));} /* Timer handler - invoked periodically */void balance_timer_handler(struct timer_list *timer) { struct run_queue *rq = container_of(timer, struct run_queue, balance_timer); int cpu = rq_cpu(rq); /* Perform balance check */ if (should_balance(rq)) { trigger_balance(rq); } /* Re-arm timer for next interval */ unsigned long next_interval = calculate_next_interval(rq); mod_timer(&rq->balance_timer, jiffies + next_interval);} /* Calculate next balance interval (may be adaptive) */unsigned long calculate_next_interval(struct run_queue *rq) { unsigned long base = msecs_to_jiffies(BALANCE_TICK_MS); /* If last balance found work, keep checking frequently */ if (rq->last_balance_migrated > 0) { return base; } /* If system seems balanced, increase interval */ if (rq->balance_attempts_without_migration > 10) { return base * 2; /* Back off when nothing to do */ } return base;}Event-Driven Balancing
Event-driven balancing triggers on specific scheduling events rather than time passage:
123456789101112131415161718192021222324252627282930313233343536373839404142
/* Event-driven load balancing triggers */ /* Event 1: CPU going idle - try to pull work */void schedule_idle_balance(struct run_queue *rq) { /* Immediate pull attempt when becoming idle */ pull_tasks_from_busy_cpus(rq);} /* Event 2: Task wakeup - consider placement */int select_task_rq_fair(struct task_struct *p, int prev_cpu) { /* Where should this waking task run? */ int new_cpu = find_idlest_cpu(p, prev_cpu); if (new_cpu != prev_cpu) { /* Implicit "push" via wake affinity */ migrate_task_rq(p, new_cpu); } return new_cpu;} /* Event 3: Task creation - inherent load injection */void wake_up_new_task(struct task_struct *p) { int cpu = select_task_rq(p, task_cpu(p)); /* New task may trigger immediate balance if it overloads */ if (rq_of(cpu)->nr_running > OVERLOAD_THRESHOLD) { trigger_push_balance(rq_of(cpu)); }} /* Event 4: Affinity change - may require migration */void set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask) { int old_cpu = task_cpu(p); if (!cpumask_test_cpu(old_cpu, new_mask)) { /* Current CPU no longer allowed - must migrate */ int new_cpu = cpumask_first(new_mask); migrate_task_immediately(p, new_cpu); }}Production schedulers combine both triggers. Timer-based balancing catches gradual drift and provides a safety net. Event-driven balancing provides immediate response to significant changes. The combination covers all scenarios.
Modern systems have hierarchical CPU topologies: SMT siblings, cores on a socket, sockets in a NUMA node, multiple NUMA nodes. Migration cost varies dramatically across these levels, and so should balancing frequency.
The Hierarchical Principle
Balance frequently at low-cost levels; balance cautiously at high-cost levels.
This principle reflects the asymmetric cost structure:
| Domain Level | SD_FLAG | Typical Interval | Rationale |
|---|---|---|---|
| SMT | SD_SHARE_CPUCAPACITY | 1-2 ms | Nearly free migration |
| MC (Multi-Core) | SD_SHARE_PKG_RESOURCES | 4-8 ms | Shared L3, low cost |
| DIE | SD_ASYM_PACKING | 8-16 ms | Same die, moderate cost |
| NUMA | SD_NUMA | 16-64 ms | High cost, cautious approach |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
/* Hierarchical load balancing implementation */ /* Per-domain balance parameters */struct sched_domain { /* Domain identification */ int level; /* 0=SMT, 1=MC, 2=DIE, 3=NUMA */ unsigned long flags; /* SD_* flags */ /* Balance interval parameters */ unsigned int balance_interval; /* Ticks between balance attempts */ unsigned int max_interval; /* Maximum interval (backed off) */ unsigned int min_interval; /* Minimum interval (urgent) */ /* Balance timing tracking */ unsigned long last_balance; /* Jiffies of last balance */ unsigned long next_balance; /* Jiffies of next balance */ /* Balance cost tracking for adaptation */ unsigned long avg_balance_cost; unsigned long avg_balance_benefit;}; /* Domain-aware balance trigger check */void run_rebalance_domains(struct run_queue *rq) { struct sched_domain *sd; /* Walk domain hierarchy from innermost to outermost */ for_each_domain(rq_cpu(rq), sd) { /* Check if this domain is due for balance */ if (time_before(jiffies, sd->next_balance)) { continue; /* Not yet time for this domain */ } /* Perform balance at this domain level */ int migrated = load_balance(sd, rq); /* Update next balance time based on results */ update_domain_balance_interval(sd, migrated); }} /* Adaptive interval update based on balance results */void update_domain_balance_interval(struct sched_domain *sd, int migrated) { if (migrated > 0) { /* Successful balance - maintain current or decrease interval */ sd->balance_interval = max(sd->balance_interval * 3 / 4, sd->min_interval); } else { /* No migration needed - increase interval */ sd->balance_interval = min(sd->balance_interval * 5 / 4, sd->max_interval); } sd->next_balance = jiffies + sd->balance_interval; sd->last_balance = jiffies;} /* Initialize domain intervals based on level */void init_domain_intervals(struct sched_domain *sd) { switch (sd->level) { case 0: /* SMT */ sd->min_interval = msecs_to_jiffies(1); sd->balance_interval = msecs_to_jiffies(2); sd->max_interval = msecs_to_jiffies(8); break; case 1: /* MC */ sd->min_interval = msecs_to_jiffies(4); sd->balance_interval = msecs_to_jiffies(8); sd->max_interval = msecs_to_jiffies(32); break; case 2: /* DIE */ sd->min_interval = msecs_to_jiffies(8); sd->balance_interval = msecs_to_jiffies(16); sd->max_interval = msecs_to_jiffies(64); break; case 3: /* NUMA */ sd->min_interval = msecs_to_jiffies(16); sd->balance_interval = msecs_to_jiffies(64); sd->max_interval = msecs_to_jiffies(256); break; }}When balance at a lower domain succeeds, higher domains can sometimes be skipped. If SMT balancing fixes the imbalance, there's no need to incur NUMA balancing cost. This 'domain stopping' optimization reduces overhead while maintaining effectiveness.
Static balance intervals are suboptimal for dynamic workloads. Adaptive mechanisms adjust frequency based on observed behavior, achieving better overhead-benefit balance.
Workload-Responsive Adaptation
The key insight: balance frequency should track workload dynamism. When tasks are being created/destroyed rapidly, balance often. When workload is stable, balance rarely.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
/* Adaptive balance frequency based on workload dynamics */ struct balance_statistics { /* Activity metrics */ unsigned long task_arrivals; /* Tasks placed on queue */ unsigned long task_departures; /* Tasks leaving queue */ unsigned long load_changes; /* Significant load deltas */ /* Balance result metrics */ unsigned long balance_attempts; /* Total balance invocations */ unsigned long migrations; /* Successful migrations */ unsigned long failed_finds; /* No imbalance found */ /* Timing metrics */ unsigned long balance_cost_sum; /* Cumulative time in balance */ unsigned long balance_time_sum; /* Time saved by migrations */}; /* Per-CPU balance stats */DEFINE_PER_CPU(struct balance_statistics, balance_stats); /* Calculate optimal interval based on recent history */unsigned long calculate_adaptive_interval(struct run_queue *rq) { struct balance_statistics *stats = &per_cpu(balance_stats, rq_cpu(rq)); unsigned long base_interval = msecs_to_jiffies(4); unsigned long interval; /* Factor 1: Activity level */ unsigned long activity = stats->task_arrivals + stats->task_departures; if (activity > 100) { /* High activity - balance frequentd */ interval = base_interval / 2; } else if (activity < 10) { /* Low activity - balance rarely */ interval = base_interval * 4; } else { interval = base_interval; } /* Factor 2: Migration success rate */ if (stats->balance_attempts > 0) { unsigned long success_rate = stats->migrations * 100 / stats->balance_attempts; if (success_rate > 50) { /* Frequently finding work to migrate - stay aggressive */ interval = interval * 3 / 4; } else if (success_rate < 10) { /* Rarely finding imbalance - back off */ interval = interval * 3 / 2; } } /* Factor 3: Balance cost vs benefit */ if (stats->balance_cost_sum > stats->balance_time_sum) { /* Spending more on balance than we're gaining - back off */ interval = interval * 2; } /* Clamp to reasonable range */ interval = clamp(interval, msecs_to_jiffies(1), msecs_to_jiffies(100)); /* Periodic reset of stats for freshness */ if (stats->balance_attempts > 1000) { memset(stats, 0, sizeof(*stats)); } return interval;}Nohz Full and Balance Deferral
Modern kernels support 'nohz full' mode where tick-idle CPUs stop timer interrupts entirely. This creates special considerations for load balancing:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
/* NOHZ-aware load balancing */ /* In NOHZ mode, idle CPUs don't receive timer interrupts. * We need other mechanisms to ensure balance still occurs. */ /* Designate one CPU as the "nohz balancer" */static int nohz_balance_cpu = -1; /* Called when a CPU goes into nohz-idle */void nohz_balance_enter_idle(int cpu) { /* If no balancer designated, or we're the first idle, take the role */ if (nohz_balance_cpu < 0) { nohz_balance_cpu = cpu; } /* Otherwise, request wake when balance is needed */ cpumask_set_cpu(cpu, nohz_idle_cpus);} /* Called when a CPU exits nohz-idle */void nohz_balance_exit_idle(int cpu) { cpumask_clear_cpu(cpu, nohz_idle_cpus); /* If we were the designated balancer, transfer role */ if (nohz_balance_cpu == cpu) { nohz_balance_cpu = cpumask_first(nohz_idle_cpus); }} /* The designated balancer runs balance on behalf of idle CPUs */void nohz_idle_balance(struct run_queue *this_rq) { int this_cpu = rq_cpu(this_rq); if (this_cpu != nohz_balance_cpu) { return; /* Not our job */ } /* Balance for all idle CPUs */ int cpu; for_each_cpu(cpu, nohz_idle_cpus) { if (cpu == this_cpu) continue; struct run_queue *rq = cpu_rq(cpu); /* Check if remote CPU needs balancing */ if (need_nohz_balance(rq)) { /* Option 1: Balance locally on their behalf */ /* Option 2: Send IPI to wake them for self-balance */ if (cpumask_weight(nohz_idle_cpus) > nr_cpus / 2) { /* Mostly idle - do it ourselves */ trigger_nohz_balance(rq); } else { /* Mostly busy - wake the idle CPU */ kick_idle_cpu(cpu); } } }}NOHZ balancing is a significant source of scheduler complexity. When most CPUs are idle and tickless, the designated balancer becomes a hot spot. Careful throttling and distribution of balancer responsibility is essential to avoid creating new bottlenecks.
Even with adaptive frequency, balance operations need throttling to prevent runaway overhead during system stress or pathological workloads.
Throttling Mechanisms
Several throttling strategies work together:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091
/* Balance throttling and rate limiting */ /* Throttle states */#define THROTTLE_NONE 0#define THROTTLE_LIGHT 1#define THROTTLE_HEAVY 2 /* Per-domain throttle tracking */struct domain_throttle { int level; /* Current throttle level */ unsigned long throttle_until; /* Jiffies until throttle expires */ unsigned long recent_cost; /* Recent balance cost accumulation */ unsigned long cost_threshold; /* Cost threshold for escalation */}; /* Check if balance should be throttled */int check_balance_throttle(struct sched_domain *sd) { struct domain_throttle *throttle = &sd->throttle; if (time_before(jiffies, throttle->throttle_until)) { return throttle->level; /* Still throttled */ } return THROTTLE_NONE; /* Not currently throttled */} /* Apply throttling based on observed cost */void apply_balance_throttle(struct sched_domain *sd, unsigned long cost) { struct domain_throttle *throttle = &sd->throttle; throttle->recent_cost += cost; /* Check if cost exceeds threshold */ if (throttle->recent_cost > throttle->cost_threshold) { /* Escalate throttle level */ if (throttle->level < THROTTLE_HEAVY) { throttle->level++; } /* Calculate throttle duration based on level */ unsigned long duration; switch (throttle->level) { case THROTTLE_LIGHT: duration = sd->balance_interval; /* Skip one interval */ break; case THROTTLE_HEAVY: duration = sd->max_interval; /* Skip to max interval */ break; default: duration = 0; } throttle->throttle_until = jiffies + duration; throttle->recent_cost = 0; /* Reset cost tracker */ } /* Natural decay of cost accumulation */ if (throttle->recent_cost > 0 && time_after(jiffies, sd->last_balance + HZ)) { throttle->recent_cost = throttle->recent_cost * 3 / 4; /* Also decay throttle level over time */ if (throttle->level > THROTTLE_NONE) { throttle->level--; } }} /* Rate limit balance invocations per domain */#define min_balance_interval(sd) \ ((sd)->flags & SD_NUMA ? msecs_to_jiffies(16) : msecs_to_jiffies(2)) bool should_balance_domain(struct sched_domain *sd) { /* Minimum interval enforcement */ if (time_before(jiffies, sd->last_balance + min_balance_interval(sd))) { return false; } /* Throttle check */ if (check_balance_throttle(sd) == THROTTLE_HEAVY) { return false; } /* Scheduled interval check */ if (time_before(jiffies, sd->next_balance) && check_balance_throttle(sd) != THROTTLE_NONE) { return false; /* Throttled and not due */ } return true;}Imbalance Threshold Scaling
Another throttling approach: as balance frequency increases, demand larger imbalances before acting. This prevents micro-migrations from dominating:
12345678910111213141516171819202122232425262728293031323334353637383940414243
/* Dynamic imbalance thresholds */ /* Minimum imbalance worth migrating - scales with frequency */unsigned long calculate_imbalance_threshold(struct sched_domain *sd) { /* Base threshold: one task's load worth */ unsigned long base = SCHED_LOAD_SCALE; /* Scale up threshold if we've been balancing frequently */ unsigned long recent_balances = sd->recent_balance_count; unsigned long time_window = jiffies - sd->balance_window_start; if (time_window > 0) { unsigned long balance_rate = recent_balances * HZ / time_window; if (balance_rate > 10) { /* >10 balance/sec */ /* Balancing too often - raise threshold */ base = base * balance_rate / 10; } } /* Cap threshold to prevent over-accumulation */ base = min(base, SCHED_LOAD_SCALE * 4); /* NUMA domains always have higher threshold */ if (sd->flags & SD_NUMA) { base = base * 2; } return base;} /* Only migrate if imbalance exceeds threshold */bool imbalance_worth_migrating(struct sched_domain *sd, unsigned long imbalance) { unsigned long threshold = calculate_imbalance_threshold(sd); if (imbalance < threshold) { sd->balance_rejects++; return false; } return true;}Effective throttling uses multiple mechanisms: rate limiting (minimum interval), cost tracking (throttle escalation), and threshold scaling (demand larger imbalances). Each catches different pathological patterns. Together they ensure scheduler overhead stays bounded regardless of workload behavior.
Production schedulers expose tunable parameters for balance frequency. Understanding these allows workload-specific optimization.
Linux Scheduler Tunables
Linux's CFS exposes several balance-related parameters via /proc/sys/kernel/:
| Parameter | Default | Effect | Workload Guidance |
|---|---|---|---|
| sched_migration_cost_ns | 500000 (500μs) | Minimum time between migrations for a task | Increase for cache-sensitive workloads |
| sched_nr_migrate | 32 | Max tasks to migrate per balance | Decrease for latency-sensitive |
| sched_min_granularity_ns | 3000000 | Indirect: affects balance opportunity | Affects how often balance can trigger |
| numa_balancing_scan_delay_ms | 1000 | NUMA-specific: initial scan delay | Increase for memory-bound workloads |
12345678910111213141516171819202122232425262728293031
#!/bin/bash# Linux scheduler balance tuning examples # View current balance settingscat /proc/sys/kernel/sched_migration_cost_nscat /proc/sys/kernel/sched_nr_migrate # For latency-sensitive workloads: reduce migration aggressivenessecho 1000000 > /proc/sys/kernel/sched_migration_cost_ns # 1msecho 8 > /proc/sys/kernel/sched_nr_migrate # Fewer per batch # For throughput workloads: increase migration willingnessecho 100000 > /proc/sys/kernel/sched_migration_cost_ns # 100μsecho 64 > /proc/sys/kernel/sched_nr_migrate # More per batch # For NUMA systems: tune NUMA-specific balancingecho 0 > /proc/sys/kernel/numa_balancing # Disable auto NUMA# Or fine-tune:echo 2000 > /proc/sys/kernel/numa_balancing_scan_delay_msecho 10 > /proc/sys/kernel/numa_balancing_scan_period_min_msecho 60000 > /proc/sys/kernel/numa_balancing_scan_period_max_ms # Monitor balance statisticscat /proc/schedstat | grep -E "domain|balance" # Per-CPU balance statisticsfor cpu in /sys/kernel/debug/sched/domains/cpu*/domain*; do echo "=== $cpu ===" cat "$cpu/balance_interval" cat "$cpu/max_interval"doneDomain-Level Tuning
For fine-grained control, scheduling domain parameters can be adjusted via sysfs or boot-time configuration:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
/* Programmatic domain tuning (kernel module or patch) */ /* Adjust domain parameters during initialization */void tune_sched_domain(struct sched_domain *sd, int workload_type) { switch (workload_type) { case WORKLOAD_LATENCY_SENSITIVE: /* Minimize migration to preserve cache locality */ sd->balance_interval *= 2; sd->max_interval *= 2; sd->min_interval = sd->min_interval * 3 / 2; sd->imbalance_pct = 133; /* Require 33% imbalance */ break; case WORKLOAD_THROUGHPUT: /* Aggressive balance for maximum utilization */ sd->balance_interval /= 2; sd->min_interval = msecs_to_jiffies(1); sd->imbalance_pct = 110; /* 10% imbalance triggers */ break; case WORKLOAD_MIXED: /* Default balanced settings */ break; case WORKLOAD_NUMA_BOUND: /* Very conservative cross-NUMA balance */ if (sd->flags & SD_NUMA) { sd->balance_interval *= 4; sd->max_interval *= 4; sd->imbalance_pct = 150; /* 50% imbalance required */ } break; }} /* User-space can request workload hints via cgroups or syscalls */int set_workload_hint(pid_t pid, int hint) { struct task_struct *p = find_task_by_vpid(pid); if (!p) return -ESRCH; p->workload_hint = hint; /* Hint affects balance decisions for this task */ if (hint == HINT_LATENCY_SENSITIVE) { p->migration_disabled++; /* Discourage migration */ } return 0;}Always measure current behavior before changing scheduler parameters. Use perf sched, schedstat, and application-level metrics to establish baseline. Tune one parameter at a time, measuring impact. Scheduler tuning is highly workload-dependent—what helps one application may hurt another.
Balancing frequency sits at the heart of scheduler design—too frequent wastes cycles, too infrequent wastes parallelism. Let's consolidate the key insights:
Connecting to the Broader Picture
With push, pull, work stealing, and frequency understood, we have a complete picture of mechanisms for load balancing. The final page of this module examines metrics—how we quantify load, measure balance quality, and evaluate scheduler effectiveness.
You now understand balancing frequency—the timing dimension of load balancing. You can reason about overhead-benefit tradeoffs, recognize when adaptive frequency is beneficial, and tune scheduler parameters for specific workload requirements. Next: load balancing metrics.