Loading learning content...
In the previous page, we explored push migration—the proactive approach where overloaded CPUs redistribute their excess work. But what happens when this process fails to achieve perfect balance? What if some CPUs become idle while others remain busy, yet the push thresholds haven't been triggered?
Enter pull migration: the complementary mechanism where idle or underloaded processors actively seek work from their busier peers.
Pull migration embodies a demand-driven philosophy. Rather than waiting for overloaded CPUs to notice and react, idle CPUs take initiative—they 'pull' work toward themselves, ensuring that available computational capacity is never wasted while runnable tasks exist elsewhere in the system.
This page provides a comprehensive exploration of pull migration: its design rationale, implementation strategies, integration with push migration, and its critical role in modern operating system schedulers.
By completing this page, you will understand: (1) Why pull migration is necessary alongside push migration, (2) The idle CPU detection mechanisms that trigger pulling, (3) Algorithms for finding work to pull and selecting appropriate tasks, (4) The interplay between push and pull in maintaining system balance, and (5) How production schedulers implement efficient pull migration.
While push migration addresses overload from the sender's perspective, pull migration addresses underutilization from the receiver's perspective. Understanding why both mechanisms are necessary requires appreciating their complementary roles.
Why Push Alone is Insufficient
Push migration has inherent limitations that create gaps in load balancing coverage:
Threshold Sensitivity — Push thresholds prevent every minor imbalance from triggering migration. This conservatism, while necessary to prevent thrashing, means some imbalances persist.
Detection Delay — Overloaded CPUs check for imbalance periodically (not continuously), creating windows where imbalance exists but isn't addressed.
Race Conditions — Between detecting overload and completing migration, the situation may change, leaving the push incomplete.
Asymmetric Load Patterns — Some workloads create imbalance without any single CPU becoming 'overloaded' by threshold definitions.
Pull migration fills these gaps by having idle CPUs actively scan for work rather than passively waiting.
Formal Definition
Pull migration is a load balancing mechanism in which a processor P, upon becoming idle or detecting its load L(P) is significantly below the system average, identifies processors with excess load and transfers one or more tasks from those processors to itself.
The Idle Balance Problem
The core problem pull migration solves is idle balance: ensuring that when a CPU runs out of work, it doesn't passively enter idle state if runnable tasks exist elsewhere.
Consider this scenario:
Pull migration's event-driven nature catches exactly this situation—CPU 1, about to idle, checks if it can pull work from CPU 0.
Push and pull are not redundant—they operate on different triggers with different sensitivities. Push prevents severe overload buildup. Pull prevents idle CPU waste. Together they form a complete solution: push handles supply (excess work), pull handles demand (available capacity).
Pull migration is triggered when a CPU transitions to idle state or detects it is underloaded. This detection mechanism is the first critical component of the pull system.
Idle Entry Path
When the scheduler runs out of runnable tasks, the CPU is about to enter idle state. This transition point is the primary trigger for pull migration—a last chance to find work before wasting cycles:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
/* Scheduler's idle entry path with pull migration attempt */void schedule_idle(void) { struct run_queue *rq = this_rq(); /* Standard check: do we have any runnable tasks? */ if (rq->nr_running > 0) { /* This shouldn't happen - we have local work */ schedule_next_task(); return; } /* About to go idle - opportunity for pull migration */ /* This is the critical point where pull adds value */ /* Attempt to pull work from busy CPUs */ if (try_to_pull_tasks(rq)) { /* Successfully pulled work - schedule it instead of idling */ schedule_next_task(); return; } /* Check idle sibling override (when we need to help siblings) */ if (need_idle_sibling_balance()) { attempt_sibling_pull(); } /* No work found anywhere - now we truly idle */ enter_idle_state();} /* Check if we should attempt pull based on system state */static bool should_attempt_pull(struct run_queue *rq) { /* We must have no local work */ if (rq->nr_running > 0) { return false; } /* System must have work somewhere */ if (atomic_read(&system_runnable_count) <= 1) { return false; /* No one else has work either */ } /* Rate limit pull attempts to avoid burning CPU */ if (time_before(jiffies, rq->last_pull_time + PULL_MIN_INTERVAL)) { return false; } return true;}Alternative Triggers for Pull
Beyond the primary idle-entry trigger, several other events can initiate pull migration:
1. Periodic Idle Rebalance
Even while idle, CPUs periodically wake to recheck for work. This catches tasks that became runnable after the initial pull attempt failed:
1234567891011121314151617181920212223
/* Periodic idle rebalance timer handler */void idle_balance_timer(struct run_queue *rq) { /* We're idle - timer woke us to recheck */ if (rq->nr_running > 0) { /* Work arrived while we were idle - stop timer, schedule */ cancel_idle_timer(rq); schedule(); return; } /* Still idle - try to pull */ if (try_to_pull_tasks(rq)) { cancel_idle_timer(rq); schedule(); return; } /* Still nothing - go back to idle, timer will wake us again */ /* Exponential backoff on timer interval prevents busy-waiting */ rq->idle_check_interval = min(rq->idle_check_interval * 2, MAX_IDLE_CHECK); arm_idle_timer(rq, rq->idle_check_interval);}2. Scheduler Tick Underload Detection
Even non-idle CPUs with low load can trigger pull behavior during scheduler tick:
1234567891011121314151617181920212223
/* Scheduler tick - check if we're underloaded and should pull */void scheduler_tick(void) { struct run_queue *rq = this_rq(); /* Update load averages */ update_rq_load_avg(rq); /* Check if we're underloaded relative to system average */ if (rq->avg_load < system_avg_load() * UNDERLOAD_THRESHOLD / 100) { /* We have capacity - consider pulling */ rq->pull_pending = true; /* Don't pull too aggressively during tick - defer to idle path */ if (rq->nr_running < 2 && pull_opportunity_exists()) { try_to_pull_tasks(rq); } } /* Regular tick processing... */ account_process_tick(current);} #define UNDERLOAD_THRESHOLD 75 /* Pull if below 75% of average load */3. New Idle (newidle) Balance
Linux CFS implements a 'newidle' balance specifically for the moment a CPU finishes its last task. This is the most performance-critical pull trigger—it happens at exactly the right moment to prevent idle cycles:
12345678910111213141516171819202122232425262728293031323334353637383940
/* Called just before a CPU would go idle */int newidle_balance(struct run_queue *this_rq) { struct sched_domain *sd; int pulled = 0; /* Quick sanity checks */ if (this_rq->nr_running > 0) { return 0; /* Not actually idle */ } /* Don't waste time if system is nearly idle */ if (atomic_read(&system_runnable_count) <= 1) { return 0; } /* Walk up the scheduling domain hierarchy */ for_each_domain(cpu_of(this_rq), sd) { if (time_before(jiffies, sd->newidle_check_time)) { continue; /* This domain was checked recently */ } /* Try to balance within this domain */ pulled = load_balance(NEWIDLE, sd, this_rq); if (pulled) { break; /* Success - stop searching */ } /* Update timestamp to prevent immediate re-checking */ sd->newidle_check_time = jiffies + sd->newidle_check_interval; /* Higher domains are more expensive - be more conservative */ if (sd->flags & SD_NUMA && !pulled) { /* Cross-NUMA pull at newidle is expensive - bail out */ break; } } return pulled;}Newidle balance runs on the critical path between task completion and idle entry. Spending too long searching for work can be worse than just going idle—especially if work arrives naturally via wakeup. Production schedulers strictly limit newidle search time and domain depth.
Once a CPU decides to pull, it must identify which CPUs have work to spare. This is the inverse of push migration's destination selection—we're now looking for the source of work to pull.
Efficient Candidate Identification
Naively checking every CPU in the system is expensive, especially on large machines. Efficient implementations use hierarchical information and cached state:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
/* Find a CPU with excess work that we can pull from */static int find_pull_source(struct run_queue *dst_rq, struct pull_context *ctx) { int dst_cpu = rq_cpu(dst_rq); struct sched_domain *sd; int best_source = -1; unsigned long best_excess = 0; /* Strategy: Start local, expand if necessary */ /* This minimizes migration cost when work is found nearby */ for_each_domain(dst_cpu, sd) { struct sched_group *sg; /* Check if this domain has imbalance */ if (!domain_has_excess_load(sd, dst_cpu)) { continue; } /* Find the busiest group within this domain */ sg = find_busiest_group(sd); if (!sg) { continue; } /* Find the busiest CPU within that group */ best_source = find_busiest_cpu_in_group(sg, dst_cpu, ctx); if (best_source >= 0) { /* Found a source at this domain level */ break; } /* Check if we should expand search to higher domains */ if (sd->flags & SD_NUMA) { /* NUMA crossing is expensive - only if really necessary */ if (!cross_numa_pull_worthwhile(sd, ctx)) { break; /* Not worth crossing NUMA boundary */ } } } ctx->source_cpu = best_source; return best_source;} /* Find the busiest CPU within a scheduling group */static int find_busiest_cpu_in_group(struct sched_group *sg, int dst_cpu, struct pull_context *ctx) { int busiest_cpu = -1; unsigned long busiest_load = 0; int cpu; for_each_cpu(cpu, sched_group_cpus(sg)) { struct run_queue *rq = cpu_rq(cpu); unsigned long load; if (cpu == dst_cpu) { continue; /* Can't pull from ourselves */ } /* Skip single-task CPUs - pulling leaves them idle */ if (rq->nr_running <= 1) { continue; } load = rq->avg_load; /* Factor in cache/NUMA relationships */ load = adjust_load_for_distance(load, cpu, dst_cpu); if (load > busiest_load) { busiest_load = load; busiest_cpu = cpu; } } if (busiest_cpu >= 0) { ctx->source_load = busiest_load; } return busiest_cpu;}Busy CPU Tracking Optimizations
To avoid expensive scans, schedulers maintain cached information about busy CPUs:
123456789101112131415161718192021222324252627282930313233343536373839
/* Bitmap of CPUs that currently have multiple runnable tasks */static cpumask_t overloaded_cpus; /* Called when a CPU's run queue state changes */void update_overload_status(struct run_queue *rq) { int cpu = rq_cpu(rq); if (rq->nr_running > 1) { cpumask_set_cpu(cpu, &overloaded_cpus); } else { cpumask_clear_cpu(cpu, &overloaded_cpus); } /* Also update per-NUMA-node tracking */ update_node_overload_status(cpu_to_node(cpu));} /* Quick check: is there anyone to pull from? */static inline bool any_overloaded_cpus(void) { return !cpumask_empty(&overloaded_cpus);} /* Find overloaded CPU nearest to destination */static int find_nearest_overloaded_cpu(int dst_cpu) { int cpu; int best_cpu = -1; int best_distance = INT_MAX; for_each_cpu(cpu, &overloaded_cpus) { int distance = cpu_distance(dst_cpu, cpu); if (distance < best_distance) { best_distance = distance; best_cpu = cpu; } } return best_cpu;}The overloaded_cpus bitmap enables lock-free checking of pull potential. An idle CPU can quickly determine if any pull candidates exist without acquiring any run queue locks. This is essential for keeping newidle balance fast.
Cache-Distance Aware Selection
When multiple source candidates exist, the pull algorithm prefers sources that minimize cache/memory impact:
| Relationship | Priority | Rationale |
|---|---|---|
| SMT Sibling | Highest | Same core, shared L1/L2 cache |
| Same Socket Core | High | Shared L3 cache |
| Same NUMA Node | Medium | Local memory, shared memory controller |
| Remote NUMA Node | Low | Remote memory access penalty |
Having identified a source CPU with excess work, the puller must select which specific task(s) to migrate. This selection is critical—pulling the wrong task can harm overall performance despite achieving better load balance.
Selection Criteria
The ideal task for pulling has the following characteristics:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
/* Select the best task to pull from source run queue */static struct task_struct *select_task_to_pull(struct run_queue *src_rq, struct run_queue *dst_rq, struct pull_context *ctx) { struct task_struct *best_task = NULL; int best_score = INT_MIN; struct task_struct *task; int dst_cpu = rq_cpu(dst_rq); /* Must hold src_rq lock during selection */ lockdep_assert_held(&src_rq->lock); list_for_each_entry(task, &src_rq->cfs_tasks, se.group_node) { int score = 0; /* Hard constraints: can this task even migrate? */ if (!can_migrate_task(task, dst_cpu)) { continue; } /* Currently running tasks cannot be pulled */ if (task == src_rq->curr) { continue; } /* Build a preference score for this task */ /* 1. Cache temperature - prefer cold tasks */ if (task_has_cold_cache(task, src_rq)) { score += 200; /* Strong preference for cold tasks */ } else if (task_just_ran(task)) { score -= 300; /* Strong avoidance of hot tasks */ } /* 2. Task load relative to imbalance */ unsigned long task_weight = task_load_weight(task); if (task_weight <= ctx->imbalance) { /* Task fits within our capacity gap - good */ score += 100; } else if (task_weight <= ctx->imbalance * 2) { /* Slightly oversized but acceptable */ score += 50; } else { /* Too heavy - might over-correct */ score -= 50; } /* 3. CPU affinity strength */ if (cpumask_weight(&task->cpus_allowed) > 4) { score += 75; /* Weak affinity - happy to run anywhere */ } else if (cpumask_weight(&task->cpus_allowed) == nr_cpu_ids) { score += 100; /* No affinity restrictions */ } /* 4. Time waiting in queue */ unsigned long wait_time = jiffies - task->last_enqueue_time; if (wait_time > HZ / 10) { score += 50; /* Been waiting a while - help it */ } /* 5. Priority considerations */ if (task_nice(task) > 0) { score += 25; /* Low priority - less latency-sensitive */ } else if (task_nice(task) < 0) { score -= 25; /* High priority - prefer local execution */ } /* Track best candidate */ if (score > best_score) { best_score = score; best_task = task; } } /* If best candidate has negative score, pulling may hurt */ if (best_task && best_score < PULL_VIABILITY_THRESHOLD) { /* Signal that no good candidates exist */ ctx->no_viable_tasks = true; return NULL; } return best_task;} #define PULL_VIABILITY_THRESHOLD -100Cache Temperature Estimation
Determining whether a task's cache footprint is 'hot' or 'cold' is crucial for pull decisions. Tasks with warm caches should generally stay on their current CPU:
123456789101112131415161718192021222324252627282930313233343536373839404142
/* Estimate whether a task's cache state is 'hot' on current CPU */static bool task_has_cold_cache(struct task_struct *task, struct run_queue *rq) { /* Time since task last executed */ unsigned long idle_time = jiffies - task->last_run_time; /* Cache cold threshold varies by CPU type and cache size */ unsigned long cache_decay_time = rq->cache_half_life; /* After 2-3 half-lives, cache is considered cold */ if (idle_time > cache_decay_time * 3) { return true; /* Definitely cold */ } /* Check if task was scheduled-out for I/O (will need data reload) */ if (task->sched_class == SCHED_BLOCKED_IO) { return true; /* Will hit memory anyway */ } /* Check task's working set size estimate */ if (task->estimated_working_set < L3_CACHE_SIZE / 4) { /* Small working set - quick to reload on new CPU */ return true; } return false; /* Assume hot */} /* Estimate how long cache stays warm for this CPU */static void calibrate_cache_decay(struct run_queue *rq) { /* Typically 1-10ms depending on cache size and access patterns */ /* Larger caches decay slower */ unsigned long l3_size_mb = this_cpu_l3_size() / (1024 * 1024); /* Rough heuristic: 1ms per 2MB of L3 */ rq->cache_half_life = msecs_to_jiffies(l3_size_mb / 2); /* Clamp to reasonable bounds */ rq->cache_half_life = clamp(rq->cache_half_life, msecs_to_jiffies(1), msecs_to_jiffies(10));}Cache temperature estimation is inherently imperfect—the scheduler cannot directly observe cache contents. These heuristics provide reasonable approximations that work well in aggregate, even if individual decisions are sometimes suboptimal.
With source and task identified, the pull migration execution transfers the task to the pulling CPU. This phase requires careful synchronization and follows a specific protocol:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
/* Main pull migration implementation */static int try_to_pull_tasks(struct run_queue *this_rq) { int pulled = 0; struct pull_context ctx = { 0 }; int src_cpu; /* Quick check: anyone to pull from? */ if (!any_overloaded_cpus()) { return 0; } /* Find a source CPU */ src_cpu = find_pull_source(this_rq, &ctx); if (src_cpu < 0) { return 0; /* No suitable source found */ } /* Acquire locks in consistent order to prevent deadlock */ struct run_queue *src_rq = cpu_rq(src_cpu); pull_lock_pair(this_rq, src_rq); /* Re-verify conditions after getting locks */ if (!pull_still_viable(this_rq, src_rq, &ctx)) { pull_unlock_pair(this_rq, src_rq); return 0; } /* Pull tasks until we've corrected the imbalance */ while (ctx.imbalance > 0 && pulled < MAX_PULL_PER_ATTEMPT) { struct task_struct *task; task = select_task_to_pull(src_rq, this_rq, &ctx); if (!task) { break; /* No more suitable tasks */ } /* Execute the migration */ if (move_task(task, src_rq, this_rq)) { ctx.imbalance -= task_load_weight(task); pulled++; } else { /* Task became unmovable during selection - skip */ continue; } } pull_unlock_pair(this_rq, src_rq); /* Update pull statistics */ update_pull_stats(rq_cpu(this_rq), src_cpu, pulled); return pulled;} /* Lock both run queues in consistent order */static void pull_lock_pair(struct run_queue *rq1, struct run_queue *rq2) { /* Always lock lower CPU number first to prevent deadlock */ if (rq_cpu(rq1) < rq_cpu(rq2)) { spin_lock(&rq1->lock); spin_lock(&rq2->lock); } else { spin_lock(&rq2->lock); spin_lock(&rq1->lock); }} /* Move task from source to destination run queue */static bool move_task(struct task_struct *task, struct run_queue *src_rq, struct run_queue *dst_rq) { /* Final migration eligibility check */ if (!can_migrate_task(task, rq_cpu(dst_rq))) { return false; } /* Dequeue from source */ dequeue_task(src_rq, task); /* Update task state */ task->cpu = rq_cpu(dst_rq); task->last_migration_time = jiffies; task->migration_count++; /* Enqueue on destination */ enqueue_task(dst_rq, task); /* Update accounting */ src_rq->nr_running--; src_rq->load_weight -= task_load_weight(task); dst_rq->nr_running++; dst_rq->load_weight += task_load_weight(task); /* Update overloaded tracking */ update_overload_status(src_rq); update_overload_status(dst_rq); return true;}Verification After Locking
A critical aspect of robust pull implementation is re-verifying conditions after acquiring locks. The system state may have changed in the window between finding a source and locking it:
1234567891011121314151617181920212223242526272829303132333435363738
/* Re-verify pull conditions after acquiring locks */static bool pull_still_viable(struct run_queue *dst_rq, struct run_queue *src_rq, struct pull_context *ctx) { /* Check 1: Destination still needs work */ if (dst_rq->nr_running > 0) { /* Work arrived while we were locking - no longer idle */ return false; } /* Check 2: Source still has excess work */ if (src_rq->nr_running <= 1) { /* Source drained - can't pull without leaving it empty */ return false; } /* Check 3: Imbalance still exists */ unsigned long imbalance = calculate_current_imbalance(dst_rq, src_rq); if (imbalance < MIN_PULL_IMBALANCE) { return false; /* Imbalance resolved */ } ctx->imbalance = imbalance; /* Check 4: Migration didn't become too expensive */ if (migration_cost_increased(ctx)) { return false; /* Conditions changed unfavorably */ } return true;} /* Reasons pull might become non-viable between check and lock: * - Task woke up on destination CPU (no longer idle) * - Source CPU's tasks completed or migrated elsewhere * - Another CPU already pulled from this source * - CPU hotplug event changed topology */The pull_lock_pair() function's consistent ordering (lower CPU first) is essential. Without this discipline, two CPUs trying to pull from each other could deadlock—each holding their own lock while waiting for the other. This pattern appears throughout kernel synchronization.
With both push and pull mechanisms operating in the system, coordination is necessary to prevent conflicts and ensure they work synergistically rather than redundantly.
Potential Issues Without Coordination
Coordination Strategies
Production schedulers employ several strategies to coordinate push and pull:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
/* Coordination structures and logic */ /* Per-task migration tracking to prevent thrashing */struct task_struct { /* ... other fields ... */ unsigned long last_migration_time; /* When task last migrated */ int migration_count; /* Recent migration count */ int on_rq; /* Is task on a run queue? */}; /* Prevent recently-migrated tasks from migrating again */static bool task_recently_migrated(struct task_struct *task) { /* Require minimum residence time before next migration */ if (time_before(jiffies, task->last_migration_time + MIGRATION_COOLDOWN)) { return true; /* Too soon to migrate again */ } /* Detect thrashing: multiple recent migrations */ if (task->migration_count > THRASH_THRESHOLD) { /* This task is bouncing - give it more time to settle */ if (time_before(jiffies, task->last_migration_time + MIGRATION_COOLDOWN * 4)) { return true; } /* Decay the counter over time */ task->migration_count = task->migration_count * 3 / 4; } return false;} /* Run queue level coordination */struct run_queue { /* ... other fields ... */ unsigned long last_push_time; /* Timestamp of last push attempt */ unsigned long last_pull_time; /* Timestamp of last pull */ int in_balance_operation; /* Lock for balance mutual exclusion */}; /* Mutual exclusion for balance operations on a queue */static bool try_enter_balance(struct run_queue *rq) { if (xchg(&rq->in_balance_operation, 1) == 0) { return true; /* We got exclusive access */ } return false; /* Another balance operation in progress */} static void exit_balance(struct run_queue *rq) { rq->in_balance_operation = 0;} /* Integrated balance check respecting both push and pull */static void run_balance_callback(struct run_queue *rq) { /* Only one balance operation at a time per queue */ if (!try_enter_balance(rq)) { return; } /* Decide between push and pull based on current state */ int state = assess_balance_state(rq); switch (state) { case BALANCE_OVERLOADED: try_to_push_tasks(rq); break; case BALANCE_UNDERLOADED: try_to_pull_tasks(rq); break; case BALANCE_NEUTRAL: /* No action needed */ break; } exit_balance(rq);} #define MIGRATION_COOLDOWN (HZ / 50) /* 20ms minimum residence */#define THRASH_THRESHOLD 4 /* >4 recent migrations = thrashing */Asymmetric Trigger Conditions
Another coordination approach uses asymmetric thresholds to separate push and pull domains:
| Mechanism | Trigger Condition | Prevents |
|---|---|---|
| Push | Load > 125% of avg | Building idle overlap with pull |
| Pull | Load = 0 (idle) | Pulling from non-overloaded CPUs |
| Dead Zone | 75% < Load < 125% | Neither push nor pull - stable |
The 'dead zone' where neither push nor pull triggers is essential. It prevents the scenario where CPU A pushes to CPU B, then CPU A pulls right back—a pointless thrashing cycle. By requiring load to be clearly above average for push and clearly zero for pull, the mechanisms target distinct situations.
Push runs periodically (timer-driven), while pull runs event-driven (on idle entry). This natural frequency separation reduces conflicts. Push catches gradual imbalance buildup; pull catches immediate idle opportunities. They rarely compete for the same situation.
Linux's CFS scheduler implements sophisticated pull migration under several balance modes, with idle_balance() and newidle_balance() being the primary entry points. Let's examine the production implementation.
CFS Balance Types
CFS categorizes balance operations by context:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
/* CFS idle balance implementation (simplified from kernel/sched/fair.c) */ /* Called when CPU is about to go idle */static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) { int this_cpu = this_rq->cpu; struct sched_domain *sd; int pulled_task = 0; u64 t0, t1, curr_cost = 0; /* Quick exit if no one else is running */ if (atomic_read(&rq_nr_running) <= 1) { return 0; } t0 = sched_clock_cpu(this_cpu); /* Start timing */ /* Walk scheduling domains from most local to NUMA level */ for_each_domain(this_cpu, sd) { u64 domain_cost; /* Check if enough time has passed for this domain */ if (!time_after_eq(jiffies, sd->next_newidle_check)) { continue; } /* Try to balance within this domain */ pulled_task = load_balance(CPU_NEWLY_IDLE, sd, this_cpu, rf); t1 = sched_clock_cpu(this_cpu); curr_cost = t1 - t0; if (pulled_task || curr_cost > sd->max_newidle_cost) { /* Success or timeout - stop searching */ if (!pulled_task && curr_cost > sd->max_newidle_cost) { /* Took too long - widen interval for this domain */ sd->newidle_check_interval += sd->newidle_check_interval / 4; } break; } /* NUMA domains: extra caution for cross-node pull */ if (sd->flags & SD_NUMA) { /* Only cross NUMA boundary if absolutely necessary */ if (!high_numa_benefit_expected(sd)) { break; } } } return pulled_task;} /* CFS's main load balance - handles both push and pull perspectives */static int load_balance(int idle, struct sched_domain *sd, int this_cpu, struct rq_flags *rf) { int ld_moved, cur_ld_moved; struct sched_group *busiest; struct rq *busiest_rq; struct lb_env env = { .sd = sd, .dst_cpu = this_cpu, .dst_rq = cpu_rq(this_cpu), .idle = idle, .loop_max = sysctl_sched_nr_migrate, }; /* Find the busiest group/CPU in this domain */ busiest = find_busiest_group(&env); if (!busiest) { return 0; /* No imbalance */ } busiest_rq = find_busiest_queue(&env, busiest); if (!busiest_rq) { return 0; /* busiest group exists but no CPU to pull from */ } /* Calculate how much load to migrate */ env.src_cpu = cpu_of(busiest_rq); env.src_rq = busiest_rq; /* Detach tasks from source and attach to destination */ cur_ld_moved = detach_tasks(&env); if (cur_ld_moved) { attach_tasks(&env); } return cur_ld_moved;}Newidle Cost Tracking
CFS dynamically limits newidle searching based on observed cost. If newidle balance frequently takes too long without finding work, the scheduler backs off for that domain:
123456789101112131415161718192021222324
/* Track and adapt newidle balance costing */static void update_newidle_cost(struct sched_domain *sd, u64 cost) { /* Exponential moving average of newidle balance cost */ sd->avg_newidle_cost = (sd->avg_newidle_cost * 7 + cost) / 8; /* If average cost exceeds threshold, reduce frequency */ if (sd->avg_newidle_cost > sd->max_newidle_cost) { /* Back off: increase interval between checks */ sd->newidle_check_interval = min( sd->newidle_check_interval + msecs_to_jiffies(1), sd->max_newidle_check_interval ); } else { /* Benefit exists: try to check more often */ sd->newidle_check_interval = max( sd->newidle_check_interval - msecs_to_jiffies(1), sd->min_newidle_check_interval ); }} /* Typical values (configurable via sysctl) *//* max_newidle_cost: 0.5ms for SMT, 1ms for MC, 2ms for NUMA *//* Cost = time spent searching (even if no task found) */CFS's cost tracking creates a self-tuning system. When newidle balance consistently finds work quickly, it maintains aggressive checking. When the system is lightly loaded and searching is usually futile, it backs off automatically. This adaptation happens per-domain, recognizing that local domains are cheaper to search than NUMA domains.
Pull migration completes our understanding of the fundamental load balancing mechanisms in multiprocessor scheduling. Let's consolidate the key insights:
The Complete BAlancing Picture
With push and pull understood, we have the two fundamental reactive mechanisms for load balancing. In the next page, we'll explore work stealing—a more sophisticated technique that combines aspects of both, used extensively in user-space thread pools and increasingly in kernel schedulers.
You now understand pull migration—from the idle CPU's perspective through source selection to task extraction. Combined with push migration knowledge, you can reason about how modern schedulers maintain balance across diverse workload patterns and hardware configurations. Next: work stealing algorithms.