Operating SystemsLoad Balancing

Load Balancing in Multiprocessor Systems

LevelAdvanced

Duration75 mins

TopicLoad Balancing

2 / 5

Pull Migration: Demand-Driven Load Distribution

The Idle CPU Problem

In the previous page, we explored push migration—the proactive approach where overloaded CPUs redistribute their excess work. But what happens when this process fails to achieve perfect balance? What if some CPUs become idle while others remain busy, yet the push thresholds haven't been triggered?

Enter pull migration: the complementary mechanism where idle or underloaded processors actively seek work from their busier peers.

Pull migration embodies a demand-driven philosophy. Rather than waiting for overloaded CPUs to notice and react, idle CPUs take initiative—they 'pull' work toward themselves, ensuring that available computational capacity is never wasted while runnable tasks exist elsewhere in the system.

This page provides a comprehensive exploration of pull migration: its design rationale, implementation strategies, integration with push migration, and its critical role in modern operating system schedulers.

What You Will Learn

By completing this page, you will understand: (1) Why pull migration is necessary alongside push migration, (2) The idle CPU detection mechanisms that trigger pulling, (3) Algorithms for finding work to pull and selecting appropriate tasks, (4) The interplay between push and pull in maintaining system balance, and (5) How production schedulers implement efficient pull migration.

Conceptual Foundations of Pull Migration

While push migration addresses overload from the sender's perspective, pull migration addresses underutilization from the receiver's perspective. Understanding why both mechanisms are necessary requires appreciating their complementary roles.

Why Push Alone is Insufficient

Push migration has inherent limitations that create gaps in load balancing coverage:

Threshold Sensitivity — Push thresholds prevent every minor imbalance from triggering migration. This conservatism, while necessary to prevent thrashing, means some imbalances persist.
Detection Delay — Overloaded CPUs check for imbalance periodically (not continuously), creating windows where imbalance exists but isn't addressed.
Race Conditions — Between detecting overload and completing migration, the situation may change, leaving the push incomplete.
Asymmetric Load Patterns — Some workloads create imbalance without any single CPU becoming 'overloaded' by threshold definitions.

Pull migration fills these gaps by having idle CPUs actively scan for work rather than passively waiting.

Push Migration Characteristics

•Sender-initiated — Overloaded CPU takes action
•Proactive — Acts before situation worsens
•Threshold-gated — Only triggers above overload threshold
•Periodic — Runs on timer intervals
•Redistributive — Spreads excess work broadly

Pull Migration Characteristics

•Receiver-initiated — Idle CPU takes action
•Reactive — Responds to becoming idle
•Opportunity-driven — Triggers when CPU has capacity
•Event-driven — Runs when CPU becomes idle
•Targeted — Pulls from specific busy CPUs

Formal Definition

Pull migration is a load balancing mechanism in which a processor P, upon becoming idle or detecting its load L(P) is significantly below the system average, identifies processors with excess load and transfers one or more tasks from those processors to itself.

The Idle Balance Problem

The core problem pull migration solves is idle balance: ensuring that when a CPU runs out of work, it doesn't passively enter idle state if runnable tasks exist elsewhere.

Consider this scenario:

CPU 0 has 5 runnable tasks
CPU 1 just completed its last task and is about to go idle
Push migration on CPU 0 hasn't triggered (5 tasks is below the push threshold)
Without pull, CPU 1 wastes cycles while CPU 0 has a backlog

Pull migration's event-driven nature catches exactly this situation—CPU 1, about to idle, checks if it can pull work from CPU 0.

Complementary, Not Redundant

Push and pull are not redundant—they operate on different triggers with different sensitivities. Push prevents severe overload buildup. Pull prevents idle CPU waste. Together they form a complete solution: push handles supply (excess work), pull handles demand (available capacity).

Idle CPU Detection and Triggers

Pull migration is triggered when a CPU transitions to idle state or detects it is underloaded. This detection mechanism is the first critical component of the pull system.

Idle Entry Path

When the scheduler runs out of runnable tasks, the CPU is about to enter idle state. This transition point is the primary trigger for pull migration—a last chance to find work before wasting cycles:

idle_entry_path.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
/* Scheduler's idle entry path with pull migration attempt */
void schedule_idle(void) {
    struct run_queue *rq = this_rq();
    
    /* Standard check: do we have any runnable tasks? */
    if (rq->nr_running > 0) {
        /* This shouldn't happen - we have local work */
        schedule_next_task();
        return;
    }
    
    /* About to go idle - opportunity for pull migration */
    /* This is the critical point where pull adds value */
    
    /* Attempt to pull work from busy CPUs */
    if (try_to_pull_tasks(rq)) {
        /* Successfully pulled work - schedule it instead of idling */
        schedule_next_task();
        return;
    }
    
    /* Check idle sibling override (when we need to help siblings) */
    if (need_idle_sibling_balance()) {
        attempt_sibling_pull();
    }
    
    /* No work found anywhere - now we truly idle */
    enter_idle_state();
}
 
/* Check if we should attempt pull based on system state */
static bool should_attempt_pull(struct run_queue *rq) {
    /* We must have no local work */
    if (rq->nr_running > 0) {
        return false;
    }
    
    /* System must have work somewhere */
    if (atomic_read(&system_runnable_count) <= 1) {
        return false;  /* No one else has work either */
    }
    
    /* Rate limit pull attempts to avoid burning CPU */
    if (time_before(jiffies, rq->last_pull_time + PULL_MIN_INTERVAL)) {
        return false;
    }
    
    return true;
}

Alternative Triggers for Pull

Beyond the primary idle-entry trigger, several other events can initiate pull migration:

1. Periodic Idle Rebalance

Even while idle, CPUs periodically wake to recheck for work. This catches tasks that became runnable after the initial pull attempt failed:

periodic_idle_check.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/* Periodic idle rebalance timer handler */
void idle_balance_timer(struct run_queue *rq) {
    /* We're idle - timer woke us to recheck */
    
    if (rq->nr_running > 0) {
        /* Work arrived while we were idle - stop timer, schedule */
        cancel_idle_timer(rq);
        schedule();
        return;
    }
    
    /* Still idle - try to pull */
    if (try_to_pull_tasks(rq)) {
        cancel_idle_timer(rq);
        schedule();
        return;
    }
    
    /* Still nothing - go back to idle, timer will wake us again */
    /* Exponential backoff on timer interval prevents busy-waiting */
    rq->idle_check_interval = min(rq->idle_check_interval * 2, MAX_IDLE_CHECK);
    arm_idle_timer(rq, rq->idle_check_interval);
}

2. Scheduler Tick Underload Detection

Even non-idle CPUs with low load can trigger pull behavior during scheduler tick:

tick_underload.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/* Scheduler tick - check if we're underloaded and should pull */
void scheduler_tick(void) {
    struct run_queue *rq = this_rq();
    
    /* Update load averages */
    update_rq_load_avg(rq);
    
    /* Check if we're underloaded relative to system average */
    if (rq->avg_load < system_avg_load() * UNDERLOAD_THRESHOLD / 100) {
        /* We have capacity - consider pulling */
        rq->pull_pending = true;
        
        /* Don't pull too aggressively during tick - defer to idle path */
        if (rq->nr_running < 2 && pull_opportunity_exists()) {
            try_to_pull_tasks(rq);
        }
    }
    
    /* Regular tick processing... */
    account_process_tick(current);
}
 
#define UNDERLOAD_THRESHOLD 75  /* Pull if below 75% of average load */

3. New Idle (newidle) Balance

Linux CFS implements a 'newidle' balance specifically for the moment a CPU finishes its last task. This is the most performance-critical pull trigger—it happens at exactly the right moment to prevent idle cycles:

newidle_balance.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
/* Called just before a CPU would go idle */
int newidle_balance(struct run_queue *this_rq) {
    struct sched_domain *sd;
    int pulled = 0;
    
    /* Quick sanity checks */
    if (this_rq->nr_running > 0) {
        return 0;  /* Not actually idle */
    }
    
    /* Don't waste time if system is nearly idle */
    if (atomic_read(&system_runnable_count) <= 1) {
        return 0;
    }
    
    /* Walk up the scheduling domain hierarchy */
    for_each_domain(cpu_of(this_rq), sd) {
        if (time_before(jiffies, sd->newidle_check_time)) {
            continue;  /* This domain was checked recently */
        }
        
        /* Try to balance within this domain */
        pulled = load_balance(NEWIDLE, sd, this_rq);
        
        if (pulled) {
            break;  /* Success - stop searching */
        }
        
        /* Update timestamp to prevent immediate re-checking */
        sd->newidle_check_time = jiffies + sd->newidle_check_interval;
        
        /* Higher domains are more expensive - be more conservative */
        if (sd->flags & SD_NUMA && !pulled) {
            /* Cross-NUMA pull at newidle is expensive - bail out */
            break;
        }
    }
    
    return pulled;
}

Newidle Latency Sensitivity

Newidle balance runs on the critical path between task completion and idle entry. Spending too long searching for work can be worse than just going idle—especially if work arrives naturally via wakeup. Production schedulers strictly limit newidle search time and domain depth.

Finding Work: Source CPU Selection

Once a CPU decides to pull, it must identify which CPUs have work to spare. This is the inverse of push migration's destination selection—we're now looking for the source of work to pull.

Efficient Candidate Identification

Naively checking every CPU in the system is expensive, especially on large machines. Efficient implementations use hierarchical information and cached state:

source_selection.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
/* Find a CPU with excess work that we can pull from */
static int find_pull_source(struct run_queue *dst_rq,
                             struct pull_context *ctx) {
    int dst_cpu = rq_cpu(dst_rq);
    struct sched_domain *sd;
    int best_source = -1;
    unsigned long best_excess = 0;
    
    /* Strategy: Start local, expand if necessary */
    /* This minimizes migration cost when work is found nearby */
    
    for_each_domain(dst_cpu, sd) {
        struct sched_group *sg;
        
        /* Check if this domain has imbalance */
        if (!domain_has_excess_load(sd, dst_cpu)) {
            continue;
        }
        
        /* Find the busiest group within this domain */
        sg = find_busiest_group(sd);
        if (!sg) {
            continue;
        }
        
        /* Find the busiest CPU within that group */
        best_source = find_busiest_cpu_in_group(sg, dst_cpu, ctx);
        
        if (best_source >= 0) {
            /* Found a source at this domain level */
            break;
        }
        
        /* Check if we should expand search to higher domains */
        if (sd->flags & SD_NUMA) {
            /* NUMA crossing is expensive - only if really necessary */
            if (!cross_numa_pull_worthwhile(sd, ctx)) {
                break;  /* Not worth crossing NUMA boundary */
            }
        }
    }
    
    ctx->source_cpu = best_source;
    return best_source;
}
 
/* Find the busiest CPU within a scheduling group */
static int find_busiest_cpu_in_group(struct sched_group *sg,
                                      int dst_cpu,
                                      struct pull_context *ctx) {
    int busiest_cpu = -1;
    unsigned long busiest_load = 0;
    int cpu;
    
    for_each_cpu(cpu, sched_group_cpus(sg)) {
        struct run_queue *rq = cpu_rq(cpu);
        unsigned long load;
        
        if (cpu == dst_cpu) {
            continue;  /* Can't pull from ourselves */
        }
        
        /* Skip single-task CPUs - pulling leaves them idle */
        if (rq->nr_running <= 1) {
            continue;
        }
        
        load = rq->avg_load;
        
        /* Factor in cache/NUMA relationships */
        load = adjust_load_for_distance(load, cpu, dst_cpu);
        
        if (load > busiest_load) {
            busiest_load = load;
            busiest_cpu = cpu;
        }
    }
    
    if (busiest_cpu >= 0) {
        ctx->source_load = busiest_load;
    }
    
    return busiest_cpu;
}

Busy CPU Tracking Optimizations

To avoid expensive scans, schedulers maintain cached information about busy CPUs:

busy_tracking.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
/* Bitmap of CPUs that currently have multiple runnable tasks */
static cpumask_t overloaded_cpus;
 
/* Called when a CPU's run queue state changes */
void update_overload_status(struct run_queue *rq) {
    int cpu = rq_cpu(rq);
    
    if (rq->nr_running > 1) {
        cpumask_set_cpu(cpu, &overloaded_cpus);
    } else {
        cpumask_clear_cpu(cpu, &overloaded_cpus);
    }
    
    /* Also update per-NUMA-node tracking */
    update_node_overload_status(cpu_to_node(cpu));
}
 
/* Quick check: is there anyone to pull from? */
static inline bool any_overloaded_cpus(void) {
    return !cpumask_empty(&overloaded_cpus);
}
 
/* Find overloaded CPU nearest to destination */
static int find_nearest_overloaded_cpu(int dst_cpu) {
    int cpu;
    int best_cpu = -1;
    int best_distance = INT_MAX;
    
    for_each_cpu(cpu, &overloaded_cpus) {
        int distance = cpu_distance(dst_cpu, cpu);
        
        if (distance < best_distance) {
            best_distance = distance;
            best_cpu = cpu;
        }
    }
    
    return best_cpu;
}

Lock-Free Checking

The overloaded_cpus bitmap enables lock-free checking of pull potential. An idle CPU can quickly determine if any pull candidates exist without acquiring any run queue locks. This is essential for keeping newidle balance fast.

Cache-Distance Aware Selection

When multiple source candidates exist, the pull algorithm prefers sources that minimize cache/memory impact:

Pull Source Priority by Distance
Relationship	Priority	Rationale
SMT Sibling	Highest	Same core, shared L1/L2 cache
Same Socket Core	High	Shared L3 cache
Same NUMA Node	Medium	Local memory, shared memory controller
Remote NUMA Node	Low	Remote memory access penalty

Task Selection for Pulling

Having identified a source CPU with excess work, the puller must select which specific task(s) to migrate. This selection is critical—pulling the wrong task can harm overall performance despite achieving better load balance.

Selection Criteria

The ideal task for pulling has the following characteristics:

Migratable — Not pinned to the source CPU or temporarily migration-disabled
Right Size — Load contribution roughly matches the imbalance gap
Cache Cold — Recently queued or long-waiting, so cache state loss is minimal
Low Affinity — No strong preference for current CPU
Not Latency-Critical — Migration delay won't violate response requirements

task_selection_pull.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
/* Select the best task to pull from source run queue */
static struct task_struct *select_task_to_pull(struct run_queue *src_rq,
                                                struct run_queue *dst_rq,
                                                struct pull_context *ctx) {
    struct task_struct *best_task = NULL;
    int best_score = INT_MIN;
    struct task_struct *task;
    int dst_cpu = rq_cpu(dst_rq);
    
    /* Must hold src_rq lock during selection */
    lockdep_assert_held(&src_rq->lock);
    
    list_for_each_entry(task, &src_rq->cfs_tasks, se.group_node) {
        int score = 0;
        
        /* Hard constraints: can this task even migrate? */
        if (!can_migrate_task(task, dst_cpu)) {
            continue;
        }
        
        /* Currently running tasks cannot be pulled */
        if (task == src_rq->curr) {
            continue;
        }
        
        /* Build a preference score for this task */
        
        /* 1. Cache temperature - prefer cold tasks */
        if (task_has_cold_cache(task, src_rq)) {
            score += 200;  /* Strong preference for cold tasks */
        } else if (task_just_ran(task)) {
            score -= 300;  /* Strong avoidance of hot tasks */
        }
        
        /* 2. Task load relative to imbalance */
        unsigned long task_weight = task_load_weight(task);
        if (task_weight <= ctx->imbalance) {
            /* Task fits within our capacity gap - good */
            score += 100;
        } else if (task_weight <= ctx->imbalance * 2) {
            /* Slightly oversized but acceptable */
            score += 50;
        } else {
            /* Too heavy - might over-correct */
            score -= 50;
        }
        
        /* 3. CPU affinity strength */
        if (cpumask_weight(&task->cpus_allowed) > 4) {
            score += 75;  /* Weak affinity - happy to run anywhere */
        } else if (cpumask_weight(&task->cpus_allowed) == nr_cpu_ids) {
            score += 100;  /* No affinity restrictions */
        }
        
        /* 4. Time waiting in queue */
        unsigned long wait_time = jiffies - task->last_enqueue_time;
        if (wait_time > HZ / 10) {
            score += 50;  /* Been waiting a while - help it */
        }
        
        /* 5. Priority considerations */
        if (task_nice(task) > 0) {
            score += 25;  /* Low priority - less latency-sensitive */
        } else if (task_nice(task) < 0) {
            score -= 25;  /* High priority - prefer local execution */
        }
        
        /* Track best candidate */
        if (score > best_score) {
            best_score = score;
            best_task = task;
        }
    }
    
    /* If best candidate has negative score, pulling may hurt */
    if (best_task && best_score < PULL_VIABILITY_THRESHOLD) {
        /* Signal that no good candidates exist */
        ctx->no_viable_tasks = true;
        return NULL;
    }
    
    return best_task;
}
 
#define PULL_VIABILITY_THRESHOLD -100

Cache Temperature Estimation

Determining whether a task's cache footprint is 'hot' or 'cold' is crucial for pull decisions. Tasks with warm caches should generally stay on their current CPU:

cache_temperature.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
/* Estimate whether a task's cache state is 'hot' on current CPU */
static bool task_has_cold_cache(struct task_struct *task, 
                                 struct run_queue *rq) {
    /* Time since task last executed */
    unsigned long idle_time = jiffies - task->last_run_time;
    
    /* Cache cold threshold varies by CPU type and cache size */
    unsigned long cache_decay_time = rq->cache_half_life;
    
    /* After 2-3 half-lives, cache is considered cold */
    if (idle_time > cache_decay_time * 3) {
        return true;  /* Definitely cold */
    }
    
    /* Check if task was scheduled-out for I/O (will need data reload) */
    if (task->sched_class == SCHED_BLOCKED_IO) {
        return true;  /* Will hit memory anyway */
    }
    
    /* Check task's working set size estimate */
    if (task->estimated_working_set < L3_CACHE_SIZE / 4) {
        /* Small working set - quick to reload on new CPU */
        return true;
    }
    
    return false;  /* Assume hot */
}
 
/* Estimate how long cache stays warm for this CPU */
static void calibrate_cache_decay(struct run_queue *rq) {
    /* Typically 1-10ms depending on cache size and access patterns */
    /* Larger caches decay slower */
    unsigned long l3_size_mb = this_cpu_l3_size() / (1024 * 1024);
    
    /* Rough heuristic: 1ms per 2MB of L3 */
    rq->cache_half_life = msecs_to_jiffies(l3_size_mb / 2);
    
    /* Clamp to reasonable bounds */
    rq->cache_half_life = clamp(rq->cache_half_life, 
                                 msecs_to_jiffies(1), 
                                 msecs_to_jiffies(10));
}

Cache Temperature is Heuristic

Cache temperature estimation is inherently imperfect—the scheduler cannot directly observe cache contents. These heuristics provide reasonable approximations that work well in aggregate, even if individual decisions are sometimes suboptimal.

Executing the Pull Migration

With source and task identified, the pull migration execution transfers the task to the pulling CPU. This phase requires careful synchronization and follows a specific protocol:

pull_execution.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
/* Main pull migration implementation */
static int try_to_pull_tasks(struct run_queue *this_rq) {
    int pulled = 0;
    struct pull_context ctx = { 0 };
    int src_cpu;
    
    /* Quick check: anyone to pull from? */
    if (!any_overloaded_cpus()) {
        return 0;
    }
    
    /* Find a source CPU */
    src_cpu = find_pull_source(this_rq, &ctx);
    if (src_cpu < 0) {
        return 0;  /* No suitable source found */
    }
    
    /* Acquire locks in consistent order to prevent deadlock */
    struct run_queue *src_rq = cpu_rq(src_cpu);
    
    pull_lock_pair(this_rq, src_rq);
    
    /* Re-verify conditions after getting locks */
    if (!pull_still_viable(this_rq, src_rq, &ctx)) {
        pull_unlock_pair(this_rq, src_rq);
        return 0;
    }
    
    /* Pull tasks until we've corrected the imbalance */
    while (ctx.imbalance > 0 && pulled < MAX_PULL_PER_ATTEMPT) {
        struct task_struct *task;
        
        task = select_task_to_pull(src_rq, this_rq, &ctx);
        if (!task) {
            break;  /* No more suitable tasks */
        }
        
        /* Execute the migration */
        if (move_task(task, src_rq, this_rq)) {
            ctx.imbalance -= task_load_weight(task);
            pulled++;
        } else {
            /* Task became unmovable during selection - skip */
            continue;
        }
    }
    
    pull_unlock_pair(this_rq, src_rq);
    
    /* Update pull statistics */
    update_pull_stats(rq_cpu(this_rq), src_cpu, pulled);
    
    return pulled;
}
 
/* Lock both run queues in consistent order */
static void pull_lock_pair(struct run_queue *rq1, struct run_queue *rq2) {
    /* Always lock lower CPU number first to prevent deadlock */
    if (rq_cpu(rq1) < rq_cpu(rq2)) {
        spin_lock(&rq1->lock);
        spin_lock(&rq2->lock);
    } else {
        spin_lock(&rq2->lock);
        spin_lock(&rq1->lock);
    }
}
 
/* Move task from source to destination run queue */
static bool move_task(struct task_struct *task,
                      struct run_queue *src_rq,
                      struct run_queue *dst_rq) {
    /* Final migration eligibility check */
    if (!can_migrate_task(task, rq_cpu(dst_rq))) {
        return false;
    }
    
    /* Dequeue from source */
    dequeue_task(src_rq, task);
    
    /* Update task state */
    task->cpu = rq_cpu(dst_rq);
    task->last_migration_time = jiffies;
    task->migration_count++;
    
    /* Enqueue on destination */
    enqueue_task(dst_rq, task);
    
    /* Update accounting */
    src_rq->nr_running--;
    src_rq->load_weight -= task_load_weight(task);
    dst_rq->nr_running++;
    dst_rq->load_weight += task_load_weight(task);
    
    /* Update overloaded tracking */
    update_overload_status(src_rq);
    update_overload_status(dst_rq);
    
    return true;
}

Verification After Locking

A critical aspect of robust pull implementation is re-verifying conditions after acquiring locks. The system state may have changed in the window between finding a source and locking it:

pull_verification.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
/* Re-verify pull conditions after acquiring locks */
static bool pull_still_viable(struct run_queue *dst_rq,
                               struct run_queue *src_rq,
                               struct pull_context *ctx) {
    /* Check 1: Destination still needs work */
    if (dst_rq->nr_running > 0) {
        /* Work arrived while we were locking - no longer idle */
        return false;
    }
    
    /* Check 2: Source still has excess work */
    if (src_rq->nr_running <= 1) {
        /* Source drained - can't pull without leaving it empty */
        return false;
    }
    
    /* Check 3: Imbalance still exists */
    unsigned long imbalance = calculate_current_imbalance(dst_rq, src_rq);
    if (imbalance < MIN_PULL_IMBALANCE) {
        return false;  /* Imbalance resolved */
    }
    
    ctx->imbalance = imbalance;
    
    /* Check 4: Migration didn't become too expensive */
    if (migration_cost_increased(ctx)) {
        return false;  /* Conditions changed unfavorably */
    }
    
    return true;
}
 
/* Reasons pull might become non-viable between check and lock:
 * - Task woke up on destination CPU (no longer idle)
 * - Source CPU's tasks completed or migrated elsewhere
 * - Another CPU already pulled from this source
 * - CPU hotplug event changed topology
 */

Lock Ordering is Critical

The pull_lock_pair() function's consistent ordering (lower CPU first) is essential. Without this discipline, two CPUs trying to pull from each other could deadlock—each holding their own lock while waiting for the other. This pattern appears throughout kernel synchronization.

Coordinating Push and Pull

With both push and pull mechanisms operating in the system, coordination is necessary to prevent conflicts and ensure they work synergistically rather than redundantly.

Potential Issues Without Coordination

Migration Oscillation — CPU A pushes task to CPU B; CPU C pulls same task from B → task bounces repeatedly
Simultaneous Selection — Both push and pull select the same task simultaneously
Overreaction — Push and pull both respond to the same imbalance, over-correcting
Lock Contention — Push and pull compete for the same run queue locks

Coordination Strategies

Production schedulers employ several strategies to coordinate push and pull:

push_pull_coordination.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
/* Coordination structures and logic */
 
/* Per-task migration tracking to prevent thrashing */
struct task_struct {
    /* ... other fields ... */
    unsigned long last_migration_time;  /* When task last migrated */
    int migration_count;                /* Recent migration count */
    int on_rq;                          /* Is task on a run queue? */
};
 
/* Prevent recently-migrated tasks from migrating again */
static bool task_recently_migrated(struct task_struct *task) {
    /* Require minimum residence time before next migration */
    if (time_before(jiffies, task->last_migration_time + MIGRATION_COOLDOWN)) {
        return true;  /* Too soon to migrate again */
    }
    
    /* Detect thrashing: multiple recent migrations */
    if (task->migration_count > THRASH_THRESHOLD) {
        /* This task is bouncing - give it more time to settle */
        if (time_before(jiffies, 
                        task->last_migration_time + MIGRATION_COOLDOWN * 4)) {
            return true;
        }
        /* Decay the counter over time */
        task->migration_count = task->migration_count * 3 / 4;
    }
    
    return false;
}
 
/* Run queue level coordination */
struct run_queue {
    /* ... other fields ... */
    unsigned long last_push_time;       /* Timestamp of last push attempt */
    unsigned long last_pull_time;       /* Timestamp of last pull */
    int in_balance_operation;           /* Lock for balance mutual exclusion */
};
 
/* Mutual exclusion for balance operations on a queue */
static bool try_enter_balance(struct run_queue *rq) {
    if (xchg(&rq->in_balance_operation, 1) == 0) {
        return true;   /* We got exclusive access */
    }
    return false;      /* Another balance operation in progress */
}
 
static void exit_balance(struct run_queue *rq) {
    rq->in_balance_operation = 0;
}
 
/* Integrated balance check respecting both push and pull */
static void run_balance_callback(struct run_queue *rq) {
    /* Only one balance operation at a time per queue */
    if (!try_enter_balance(rq)) {
        return;
    }
    
    /* Decide between push and pull based on current state */
    int state = assess_balance_state(rq);
    
    switch (state) {
    case BALANCE_OVERLOADED:
        try_to_push_tasks(rq);
        break;
    case BALANCE_UNDERLOADED:
        try_to_pull_tasks(rq);
        break;
    case BALANCE_NEUTRAL:
        /* No action needed */
        break;
    }
    
    exit_balance(rq);
}
 
#define MIGRATION_COOLDOWN   (HZ / 50)   /* 20ms minimum residence */
#define THRASH_THRESHOLD     4            /* >4 recent migrations = thrashing */

Asymmetric Trigger Conditions

Another coordination approach uses asymmetric thresholds to separate push and pull domains:

Asymmetric Balance Thresholds
Mechanism	Trigger Condition	Prevents
Push	Load > 125% of avg	Building idle overlap with pull
Pull	Load = 0 (idle)	Pulling from non-overloaded CPUs
Dead Zone	75% < Load < 125%	Neither push nor pull - stable

The 'dead zone' where neither push nor pull triggers is essential. It prevents the scenario where CPU A pushes to CPU B, then CPU A pulls right back—a pointless thrashing cycle. By requiring load to be clearly above average for push and clearly zero for pull, the mechanisms target distinct situations.

Complementary Frequency

Push runs periodically (timer-driven), while pull runs event-driven (on idle entry). This natural frequency separation reduces conflicts. Push catches gradual imbalance buildup; pull catches immediate idle opportunities. They rarely compete for the same situation.

Pull Migration in Linux CFS

Linux's CFS scheduler implements sophisticated pull migration under several balance modes, with idle_balance() and newidle_balance() being the primary entry points. Let's examine the production implementation.

CFS Balance Types

CFS categorizes balance operations by context:

IDLE — CPU has been idle; periodic check for work
NEWLY_IDLE — CPU just became idle; immediate opportunity
ACTIVE — CPU has work but might be underloaded
WAKE_AFFINE — Special case for task wakeup placement

cfs_idle_balance.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
/* CFS idle balance implementation (simplified from kernel/sched/fair.c) */
 
/* Called when CPU is about to go idle */
static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) {
    int this_cpu = this_rq->cpu;
    struct sched_domain *sd;
    int pulled_task = 0;
    u64 t0, t1, curr_cost = 0;
    
    /* Quick exit if no one else is running */
    if (atomic_read(&rq_nr_running) <= 1) {
        return 0;
    }
    
    t0 = sched_clock_cpu(this_cpu);  /* Start timing */
    
    /* Walk scheduling domains from most local to NUMA level */
    for_each_domain(this_cpu, sd) {
        u64 domain_cost;
        
        /* Check if enough time has passed for this domain */
        if (!time_after_eq(jiffies, sd->next_newidle_check)) {
            continue;
        }
        
        /* Try to balance within this domain */
        pulled_task = load_balance(CPU_NEWLY_IDLE, sd, this_cpu, rf);
        
        t1 = sched_clock_cpu(this_cpu);
        curr_cost = t1 - t0;
        
        if (pulled_task || curr_cost > sd->max_newidle_cost) {
            /* Success or timeout - stop searching */
            if (!pulled_task && curr_cost > sd->max_newidle_cost) {
                /* Took too long - widen interval for this domain */
                sd->newidle_check_interval += sd->newidle_check_interval / 4;
            }
            break;
        }
        
        /* NUMA domains: extra caution for cross-node pull */
        if (sd->flags & SD_NUMA) {
            /* Only cross NUMA boundary if absolutely necessary */
            if (!high_numa_benefit_expected(sd)) {
                break;
            }
        }
    }
    
    return pulled_task;
}
 
/* CFS's main load balance - handles both push and pull perspectives */
static int load_balance(int idle, struct sched_domain *sd,
                        int this_cpu, struct rq_flags *rf) {
    int ld_moved, cur_ld_moved;
    struct sched_group *busiest;
    struct rq *busiest_rq;
    struct lb_env env = {
        .sd          = sd,
        .dst_cpu     = this_cpu,
        .dst_rq      = cpu_rq(this_cpu),
        .idle        = idle,
        .loop_max    = sysctl_sched_nr_migrate,
    };
    
    /* Find the busiest group/CPU in this domain */
    busiest = find_busiest_group(&env);
    if (!busiest) {
        return 0;  /* No imbalance */
    }
    
    busiest_rq = find_busiest_queue(&env, busiest);
    if (!busiest_rq) {
        return 0;  /* busiest group exists but no CPU to pull from */
    }
    
    /* Calculate how much load to migrate */
    env.src_cpu = cpu_of(busiest_rq);
    env.src_rq = busiest_rq;
    
    /* Detach tasks from source and attach to destination */
    cur_ld_moved = detach_tasks(&env);
    if (cur_ld_moved) {
        attach_tasks(&env);
    }
    
    return cur_ld_moved;
}

Newidle Cost Tracking

CFS dynamically limits newidle searching based on observed cost. If newidle balance frequently takes too long without finding work, the scheduler backs off for that domain:

newidle_cost_tracking.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
/* Track and adapt newidle balance costing */
static void update_newidle_cost(struct sched_domain *sd, u64 cost) {
    /* Exponential moving average of newidle balance cost */
    sd->avg_newidle_cost = (sd->avg_newidle_cost * 7 + cost) / 8;
    
    /* If average cost exceeds threshold, reduce frequency */
    if (sd->avg_newidle_cost > sd->max_newidle_cost) {
        /* Back off: increase interval between checks */
        sd->newidle_check_interval = min(
            sd->newidle_check_interval + msecs_to_jiffies(1),
            sd->max_newidle_check_interval
        );
    } else {
        /* Benefit exists: try to check more often */
        sd->newidle_check_interval = max(
            sd->newidle_check_interval - msecs_to_jiffies(1),
            sd->min_newidle_check_interval
        );
    }
}
 
/* Typical values (configurable via sysctl) */
/* max_newidle_cost: 0.5ms for SMT, 1ms for MC, 2ms for NUMA */
/* Cost = time spent searching (even if no task found) */

Self-Tuning Balance

CFS's cost tracking creates a self-tuning system. When newidle balance consistently finds work quickly, it maintains aggressive checking. When the system is lightly loaded and searching is usually futile, it backs off automatically. This adaptation happens per-domain, recognizing that local domains are cheaper to search than NUMA domains.

Summary: Pull Migration Mastery

Pull migration completes our understanding of the fundamental load balancing mechanisms in multiprocessor scheduling. Let's consolidate the key insights:

Key Takeaways

•Pull migration is demand-driven load balancing — Idle CPUs actively seek work rather than waiting for overloaded CPUs to notice them.
•Idle entry is the critical trigger point — The moment before going idle is the highest-value time for pull, catching imbalances that push didn't address.
•Source selection uses hierarchical search — Start local (cheap migrations), expand outward only if necessary, with NUMA boundaries as cost escalation points.
•Task selection favors cache-cold candidates — Pulling recently-executing tasks destroys their cache investment; prefer waiting tasks.
•Push and pull are complementary, not redundant — Asymmetric thresholds and different trigger conditions ensure they address distinct scenarios.
•Cost tracking enables self-tuning — CFS dynamically adjusts newidle aggressiveness based on observed benefit, avoiding wasted searching.

The Complete BAlancing Picture

With push and pull understood, we have the two fundamental reactive mechanisms for load balancing. In the next page, we'll explore work stealing—a more sophisticated technique that combines aspects of both, used extensively in user-space thread pools and increasingly in kernel schedulers.

Page Complete

You now understand pull migration—from the idle CPU's perspective through source selection to task extraction. Combined with push migration knowledge, you can reason about how modern schedulers maintain balance across diverse workload patterns and hardware configurations. Next: work stealing algorithms.

2 / 5

Loading learning content...

Operating SystemsLoad Balancing

Load Balancing in Multiprocessor Systems

LevelAdvanced

Duration75 mins

TopicLoad Balancing

2 / 5

Pull Migration: Demand-Driven Load Distribution

The Idle CPU Problem

Enter pull migration: the complementary mechanism where idle or underloaded processors actively seek work from their busier peers.

What You Will Learn

Conceptual Foundations of Pull Migration

Why Push Alone is Insufficient

Push migration has inherent limitations that create gaps in load balancing coverage:

Threshold Sensitivity — Push thresholds prevent every minor imbalance from triggering migration. This conservatism, while necessary to prevent thrashing, means some imbalances persist.
Detection Delay — Overloaded CPUs check for imbalance periodically (not continuously), creating windows where imbalance exists but isn't addressed.
Race Conditions — Between detecting overload and completing migration, the situation may change, leaving the push incomplete.
Asymmetric Load Patterns — Some workloads create imbalance without any single CPU becoming 'overloaded' by threshold definitions.

Pull migration fills these gaps by having idle CPUs actively scan for work rather than passively waiting.

Push Migration Characteristics

•Sender-initiated — Overloaded CPU takes action
•Proactive — Acts before situation worsens
•Threshold-gated — Only triggers above overload threshold
•Periodic — Runs on timer intervals
•Redistributive — Spreads excess work broadly

Pull Migration Characteristics

•Receiver-initiated — Idle CPU takes action
•Reactive — Responds to becoming idle
•Opportunity-driven — Triggers when CPU has capacity
•Event-driven — Runs when CPU becomes idle
•Targeted — Pulls from specific busy CPUs

Formal Definition

The Idle Balance Problem

The core problem pull migration solves is idle balance: ensuring that when a CPU runs out of work, it doesn't passively enter idle state if runnable tasks exist elsewhere.

Consider this scenario:

CPU 0 has 5 runnable tasks
CPU 1 just completed its last task and is about to go idle
Push migration on CPU 0 hasn't triggered (5 tasks is below the push threshold)
Without pull, CPU 1 wastes cycles while CPU 0 has a backlog

Pull migration's event-driven nature catches exactly this situation—CPU 1, about to idle, checks if it can pull work from CPU 0.

Complementary, Not Redundant

Idle CPU Detection and Triggers

Pull migration is triggered when a CPU transitions to idle state or detects it is underloaded. This detection mechanism is the first critical component of the pull system.

Idle Entry Path

When the scheduler runs out of runnable tasks, the CPU is about to enter idle state. This transition point is the primary trigger for pull migration—a last chance to find work before wasting cycles:

idle_entry_path.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
/* Scheduler's idle entry path with pull migration attempt */
void schedule_idle(void) {
    struct run_queue *rq = this_rq();
    
    /* Standard check: do we have any runnable tasks? */
    if (rq->nr_running > 0) {
        /* This shouldn't happen - we have local work */
        schedule_next_task();
        return;
    }
    
    /* About to go idle - opportunity for pull migration */
    /* This is the critical point where pull adds value */
    
    /* Attempt to pull work from busy CPUs */
    if (try_to_pull_tasks(rq)) {
        /* Successfully pulled work - schedule it instead of idling */
        schedule_next_task();
        return;
    }
    
    /* Check idle sibling override (when we need to help siblings) */
    if (need_idle_sibling_balance()) {
        attempt_sibling_pull();
    }
    
    /* No work found anywhere - now we truly idle */
    enter_idle_state();
}
 
/* Check if we should attempt pull based on system state */
static bool should_attempt_pull(struct run_queue *rq) {
    /* We must have no local work */
    if (rq->nr_running > 0) {
        return false;
    }
    
    /* System must have work somewhere */
    if (atomic_read(&system_runnable_count) <= 1) {
        return false;  /* No one else has work either */
    }
    
    /* Rate limit pull attempts to avoid burning CPU */
    if (time_before(jiffies, rq->last_pull_time + PULL_MIN_INTERVAL)) {
        return false;
    }
    
    return true;
}

Alternative Triggers for Pull

Beyond the primary idle-entry trigger, several other events can initiate pull migration:

1. Periodic Idle Rebalance

Even while idle, CPUs periodically wake to recheck for work. This catches tasks that became runnable after the initial pull attempt failed:

periodic_idle_check.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/* Periodic idle rebalance timer handler */
void idle_balance_timer(struct run_queue *rq) {
    /* We're idle - timer woke us to recheck */
    
    if (rq->nr_running > 0) {
        /* Work arrived while we were idle - stop timer, schedule */
        cancel_idle_timer(rq);
        schedule();
        return;
    }
    
    /* Still idle - try to pull */
    if (try_to_pull_tasks(rq)) {
        cancel_idle_timer(rq);
        schedule();
        return;
    }
    
    /* Still nothing - go back to idle, timer will wake us again */
    /* Exponential backoff on timer interval prevents busy-waiting */
    rq->idle_check_interval = min(rq->idle_check_interval * 2, MAX_IDLE_CHECK);
    arm_idle_timer(rq, rq->idle_check_interval);
}

2. Scheduler Tick Underload Detection

Even non-idle CPUs with low load can trigger pull behavior during scheduler tick:

tick_underload.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
/* Scheduler tick - check if we're underloaded and should pull */
void scheduler_tick(void) {
    struct run_queue *rq = this_rq();
    
    /* Update load averages */
    update_rq_load_avg(rq);
    
    /* Check if we're underloaded relative to system average */
    if (rq->avg_load < system_avg_load() * UNDERLOAD_THRESHOLD / 100) {
        /* We have capacity - consider pulling */
        rq->pull_pending = true;
        
        /* Don't pull too aggressively during tick - defer to idle path */
        if (rq->nr_running < 2 && pull_opportunity_exists()) {
            try_to_pull_tasks(rq);
        }
    }
    
    /* Regular tick processing... */
    account_process_tick(current);
}
 
#define UNDERLOAD_THRESHOLD 75  /* Pull if below 75% of average load */

3. New Idle (newidle) Balance

newidle_balance.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
/* Called just before a CPU would go idle */
int newidle_balance(struct run_queue *this_rq) {
    struct sched_domain *sd;
    int pulled = 0;
    
    /* Quick sanity checks */
    if (this_rq->nr_running > 0) {
        return 0;  /* Not actually idle */
    }
    
    /* Don't waste time if system is nearly idle */
    if (atomic_read(&system_runnable_count) <= 1) {
        return 0;
    }
    
    /* Walk up the scheduling domain hierarchy */
    for_each_domain(cpu_of(this_rq), sd) {
        if (time_before(jiffies, sd->newidle_check_time)) {
            continue;  /* This domain was checked recently */
        }
        
        /* Try to balance within this domain */
        pulled = load_balance(NEWIDLE, sd, this_rq);
        
        if (pulled) {
            break;  /* Success - stop searching */
        }
        
        /* Update timestamp to prevent immediate re-checking */
        sd->newidle_check_time = jiffies + sd->newidle_check_interval;
        
        /* Higher domains are more expensive - be more conservative */
        if (sd->flags & SD_NUMA && !pulled) {
            /* Cross-NUMA pull at newidle is expensive - bail out */
            break;
        }
    }
    
    return pulled;
}

Newidle Latency Sensitivity

Finding Work: Source CPU Selection

Once a CPU decides to pull, it must identify which CPUs have work to spare. This is the inverse of push migration's destination selection—we're now looking for the source of work to pull.

Efficient Candidate Identification

Naively checking every CPU in the system is expensive, especially on large machines. Efficient implementations use hierarchical information and cached state:

source_selection.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
/* Find a CPU with excess work that we can pull from */
static int find_pull_source(struct run_queue *dst_rq,
                             struct pull_context *ctx) {
    int dst_cpu = rq_cpu(dst_rq);
    struct sched_domain *sd;
    int best_source = -1;
    unsigned long best_excess = 0;
    
    /* Strategy: Start local, expand if necessary */
    /* This minimizes migration cost when work is found nearby */
    
    for_each_domain(dst_cpu, sd) {
        struct sched_group *sg;
        
        /* Check if this domain has imbalance */
        if (!domain_has_excess_load(sd, dst_cpu)) {
            continue;
        }
        
        /* Find the busiest group within this domain */
        sg = find_busiest_group(sd);
        if (!sg) {
            continue;
        }
        
        /* Find the busiest CPU within that group */
        best_source = find_busiest_cpu_in_group(sg, dst_cpu, ctx);
        
        if (best_source >= 0) {
            /* Found a source at this domain level */
            break;
        }
        
        /* Check if we should expand search to higher domains */
        if (sd->flags & SD_NUMA) {
            /* NUMA crossing is expensive - only if really necessary */
            if (!cross_numa_pull_worthwhile(sd, ctx)) {
                break;  /* Not worth crossing NUMA boundary */
            }
        }
    }
    
    ctx->source_cpu = best_source;
    return best_source;
}
 
/* Find the busiest CPU within a scheduling group */
static int find_busiest_cpu_in_group(struct sched_group *sg,
                                      int dst_cpu,
                                      struct pull_context *ctx) {
    int busiest_cpu = -1;
    unsigned long busiest_load = 0;
    int cpu;
    
    for_each_cpu(cpu, sched_group_cpus(sg)) {
        struct run_queue *rq = cpu_rq(cpu);
        unsigned long load;
        
        if (cpu == dst_cpu) {
            continue;  /* Can't pull from ourselves */
        }
        
        /* Skip single-task CPUs - pulling leaves them idle */
        if (rq->nr_running <= 1) {
            continue;
        }
        
        load = rq->avg_load;
        
        /* Factor in cache/NUMA relationships */
        load = adjust_load_for_distance(load, cpu, dst_cpu);
        
        if (load > busiest_load) {
            busiest_load = load;
            busiest_cpu = cpu;
        }
    }
    
    if (busiest_cpu >= 0) {
        ctx->source_load = busiest_load;
    }
    
    return busiest_cpu;
}

Busy CPU Tracking Optimizations

To avoid expensive scans, schedulers maintain cached information about busy CPUs:

busy_tracking.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
/* Bitmap of CPUs that currently have multiple runnable tasks */
static cpumask_t overloaded_cpus;
 
/* Called when a CPU's run queue state changes */
void update_overload_status(struct run_queue *rq) {
    int cpu = rq_cpu(rq);
    
    if (rq->nr_running > 1) {
        cpumask_set_cpu(cpu, &overloaded_cpus);
    } else {
        cpumask_clear_cpu(cpu, &overloaded_cpus);
    }
    
    /* Also update per-NUMA-node tracking */
    update_node_overload_status(cpu_to_node(cpu));
}
 
/* Quick check: is there anyone to pull from? */
static inline bool any_overloaded_cpus(void) {
    return !cpumask_empty(&overloaded_cpus);
}
 
/* Find overloaded CPU nearest to destination */
static int find_nearest_overloaded_cpu(int dst_cpu) {
    int cpu;
    int best_cpu = -1;
    int best_distance = INT_MAX;
    
    for_each_cpu(cpu, &overloaded_cpus) {
        int distance = cpu_distance(dst_cpu, cpu);
        
        if (distance < best_distance) {
            best_distance = distance;
            best_cpu = cpu;
        }
    }
    
    return best_cpu;
}

Lock-Free Checking

Cache-Distance Aware Selection

When multiple source candidates exist, the pull algorithm prefers sources that minimize cache/memory impact:

Pull Source Priority by Distance
Relationship	Priority	Rationale
SMT Sibling	Highest	Same core, shared L1/L2 cache
Same Socket Core	High	Shared L3 cache
Same NUMA Node	Medium	Local memory, shared memory controller
Remote NUMA Node	Low	Remote memory access penalty

Task Selection for Pulling

Selection Criteria

The ideal task for pulling has the following characteristics:

Migratable — Not pinned to the source CPU or temporarily migration-disabled
Right Size — Load contribution roughly matches the imbalance gap
Cache Cold — Recently queued or long-waiting, so cache state loss is minimal
Low Affinity — No strong preference for current CPU
Not Latency-Critical — Migration delay won't violate response requirements

task_selection_pull.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
/* Select the best task to pull from source run queue */
static struct task_struct *select_task_to_pull(struct run_queue *src_rq,
                                                struct run_queue *dst_rq,
                                                struct pull_context *ctx) {
    struct task_struct *best_task = NULL;
    int best_score = INT_MIN;
    struct task_struct *task;
    int dst_cpu = rq_cpu(dst_rq);
    
    /* Must hold src_rq lock during selection */
    lockdep_assert_held(&src_rq->lock);
    
    list_for_each_entry(task, &src_rq->cfs_tasks, se.group_node) {
        int score = 0;
        
        /* Hard constraints: can this task even migrate? */
        if (!can_migrate_task(task, dst_cpu)) {
            continue;
        }
        
        /* Currently running tasks cannot be pulled */
        if (task == src_rq->curr) {
            continue;
        }
        
        /* Build a preference score for this task */
        
        /* 1. Cache temperature - prefer cold tasks */
        if (task_has_cold_cache(task, src_rq)) {
            score += 200;  /* Strong preference for cold tasks */
        } else if (task_just_ran(task)) {
            score -= 300;  /* Strong avoidance of hot tasks */
        }
        
        /* 2. Task load relative to imbalance */
        unsigned long task_weight = task_load_weight(task);
        if (task_weight <= ctx->imbalance) {
            /* Task fits within our capacity gap - good */
            score += 100;
        } else if (task_weight <= ctx->imbalance * 2) {
            /* Slightly oversized but acceptable */
            score += 50;
        } else {
            /* Too heavy - might over-correct */
            score -= 50;
        }
        
        /* 3. CPU affinity strength */
        if (cpumask_weight(&task->cpus_allowed) > 4) {
            score += 75;  /* Weak affinity - happy to run anywhere */
        } else if (cpumask_weight(&task->cpus_allowed) == nr_cpu_ids) {
            score += 100;  /* No affinity restrictions */
        }
        
        /* 4. Time waiting in queue */
        unsigned long wait_time = jiffies - task->last_enqueue_time;
        if (wait_time > HZ / 10) {
            score += 50;  /* Been waiting a while - help it */
        }
        
        /* 5. Priority considerations */
        if (task_nice(task) > 0) {
            score += 25;  /* Low priority - less latency-sensitive */
        } else if (task_nice(task) < 0) {
            score -= 25;  /* High priority - prefer local execution */
        }
        
        /* Track best candidate */
        if (score > best_score) {
            best_score = score;
            best_task = task;
        }
    }
    
    /* If best candidate has negative score, pulling may hurt */
    if (best_task && best_score < PULL_VIABILITY_THRESHOLD) {
        /* Signal that no good candidates exist */
        ctx->no_viable_tasks = true;
        return NULL;
    }
    
    return best_task;
}
 
#define PULL_VIABILITY_THRESHOLD -100

Cache Temperature Estimation

Determining whether a task's cache footprint is 'hot' or 'cold' is crucial for pull decisions. Tasks with warm caches should generally stay on their current CPU:

cache_temperature.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
/* Estimate whether a task's cache state is 'hot' on current CPU */
static bool task_has_cold_cache(struct task_struct *task, 
                                 struct run_queue *rq) {
    /* Time since task last executed */
    unsigned long idle_time = jiffies - task->last_run_time;
    
    /* Cache cold threshold varies by CPU type and cache size */
    unsigned long cache_decay_time = rq->cache_half_life;
    
    /* After 2-3 half-lives, cache is considered cold */
    if (idle_time > cache_decay_time * 3) {
        return true;  /* Definitely cold */
    }
    
    /* Check if task was scheduled-out for I/O (will need data reload) */
    if (task->sched_class == SCHED_BLOCKED_IO) {
        return true;  /* Will hit memory anyway */
    }
    
    /* Check task's working set size estimate */
    if (task->estimated_working_set < L3_CACHE_SIZE / 4) {
        /* Small working set - quick to reload on new CPU */
        return true;
    }
    
    return false;  /* Assume hot */
}
 
/* Estimate how long cache stays warm for this CPU */
static void calibrate_cache_decay(struct run_queue *rq) {
    /* Typically 1-10ms depending on cache size and access patterns */
    /* Larger caches decay slower */
    unsigned long l3_size_mb = this_cpu_l3_size() / (1024 * 1024);
    
    /* Rough heuristic: 1ms per 2MB of L3 */
    rq->cache_half_life = msecs_to_jiffies(l3_size_mb / 2);
    
    /* Clamp to reasonable bounds */
    rq->cache_half_life = clamp(rq->cache_half_life, 
                                 msecs_to_jiffies(1), 
                                 msecs_to_jiffies(10));
}

Cache Temperature is Heuristic

Executing the Pull Migration

With source and task identified, the pull migration execution transfers the task to the pulling CPU. This phase requires careful synchronization and follows a specific protocol:

pull_execution.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
/* Main pull migration implementation */
static int try_to_pull_tasks(struct run_queue *this_rq) {
    int pulled = 0;
    struct pull_context ctx = { 0 };
    int src_cpu;
    
    /* Quick check: anyone to pull from? */
    if (!any_overloaded_cpus()) {
        return 0;
    }
    
    /* Find a source CPU */
    src_cpu = find_pull_source(this_rq, &ctx);
    if (src_cpu < 0) {
        return 0;  /* No suitable source found */
    }
    
    /* Acquire locks in consistent order to prevent deadlock */
    struct run_queue *src_rq = cpu_rq(src_cpu);
    
    pull_lock_pair(this_rq, src_rq);
    
    /* Re-verify conditions after getting locks */
    if (!pull_still_viable(this_rq, src_rq, &ctx)) {
        pull_unlock_pair(this_rq, src_rq);
        return 0;
    }
    
    /* Pull tasks until we've corrected the imbalance */
    while (ctx.imbalance > 0 && pulled < MAX_PULL_PER_ATTEMPT) {
        struct task_struct *task;
        
        task = select_task_to_pull(src_rq, this_rq, &ctx);
        if (!task) {
            break;  /* No more suitable tasks */
        }
        
        /* Execute the migration */
        if (move_task(task, src_rq, this_rq)) {
            ctx.imbalance -= task_load_weight(task);
            pulled++;
        } else {
            /* Task became unmovable during selection - skip */
            continue;
        }
    }
    
    pull_unlock_pair(this_rq, src_rq);
    
    /* Update pull statistics */
    update_pull_stats(rq_cpu(this_rq), src_cpu, pulled);
    
    return pulled;
}
 
/* Lock both run queues in consistent order */
static void pull_lock_pair(struct run_queue *rq1, struct run_queue *rq2) {
    /* Always lock lower CPU number first to prevent deadlock */
    if (rq_cpu(rq1) < rq_cpu(rq2)) {
        spin_lock(&rq1->lock);
        spin_lock(&rq2->lock);
    } else {
        spin_lock(&rq2->lock);
        spin_lock(&rq1->lock);
    }
}
 
/* Move task from source to destination run queue */
static bool move_task(struct task_struct *task,
                      struct run_queue *src_rq,
                      struct run_queue *dst_rq) {
    /* Final migration eligibility check */
    if (!can_migrate_task(task, rq_cpu(dst_rq))) {
        return false;
    }
    
    /* Dequeue from source */
    dequeue_task(src_rq, task);
    
    /* Update task state */
    task->cpu = rq_cpu(dst_rq);
    task->last_migration_time = jiffies;
    task->migration_count++;
    
    /* Enqueue on destination */
    enqueue_task(dst_rq, task);
    
    /* Update accounting */
    src_rq->nr_running--;
    src_rq->load_weight -= task_load_weight(task);
    dst_rq->nr_running++;
    dst_rq->load_weight += task_load_weight(task);
    
    /* Update overloaded tracking */
    update_overload_status(src_rq);
    update_overload_status(dst_rq);
    
    return true;
}

Verification After Locking

A critical aspect of robust pull implementation is re-verifying conditions after acquiring locks. The system state may have changed in the window between finding a source and locking it:

pull_verification.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
/* Re-verify pull conditions after acquiring locks */
static bool pull_still_viable(struct run_queue *dst_rq,
                               struct run_queue *src_rq,
                               struct pull_context *ctx) {
    /* Check 1: Destination still needs work */
    if (dst_rq->nr_running > 0) {
        /* Work arrived while we were locking - no longer idle */
        return false;
    }
    
    /* Check 2: Source still has excess work */
    if (src_rq->nr_running <= 1) {
        /* Source drained - can't pull without leaving it empty */
        return false;
    }
    
    /* Check 3: Imbalance still exists */
    unsigned long imbalance = calculate_current_imbalance(dst_rq, src_rq);
    if (imbalance < MIN_PULL_IMBALANCE) {
        return false;  /* Imbalance resolved */
    }
    
    ctx->imbalance = imbalance;
    
    /* Check 4: Migration didn't become too expensive */
    if (migration_cost_increased(ctx)) {
        return false;  /* Conditions changed unfavorably */
    }
    
    return true;
}
 
/* Reasons pull might become non-viable between check and lock:
 * - Task woke up on destination CPU (no longer idle)
 * - Source CPU's tasks completed or migrated elsewhere
 * - Another CPU already pulled from this source
 * - CPU hotplug event changed topology
 */

Lock Ordering is Critical

Coordinating Push and Pull

With both push and pull mechanisms operating in the system, coordination is necessary to prevent conflicts and ensure they work synergistically rather than redundantly.

Potential Issues Without Coordination

Migration Oscillation — CPU A pushes task to CPU B; CPU C pulls same task from B → task bounces repeatedly
Simultaneous Selection — Both push and pull select the same task simultaneously
Overreaction — Push and pull both respond to the same imbalance, over-correcting
Lock Contention — Push and pull compete for the same run queue locks

Coordination Strategies

Production schedulers employ several strategies to coordinate push and pull:

push_pull_coordination.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
/* Coordination structures and logic */
 
/* Per-task migration tracking to prevent thrashing */
struct task_struct {
    /* ... other fields ... */
    unsigned long last_migration_time;  /* When task last migrated */
    int migration_count;                /* Recent migration count */
    int on_rq;                          /* Is task on a run queue? */
};
 
/* Prevent recently-migrated tasks from migrating again */
static bool task_recently_migrated(struct task_struct *task) {
    /* Require minimum residence time before next migration */
    if (time_before(jiffies, task->last_migration_time + MIGRATION_COOLDOWN)) {
        return true;  /* Too soon to migrate again */
    }
    
    /* Detect thrashing: multiple recent migrations */
    if (task->migration_count > THRASH_THRESHOLD) {
        /* This task is bouncing - give it more time to settle */
        if (time_before(jiffies, 
                        task->last_migration_time + MIGRATION_COOLDOWN * 4)) {
            return true;
        }
        /* Decay the counter over time */
        task->migration_count = task->migration_count * 3 / 4;
    }
    
    return false;
}
 
/* Run queue level coordination */
struct run_queue {
    /* ... other fields ... */
    unsigned long last_push_time;       /* Timestamp of last push attempt */
    unsigned long last_pull_time;       /* Timestamp of last pull */
    int in_balance_operation;           /* Lock for balance mutual exclusion */
};
 
/* Mutual exclusion for balance operations on a queue */
static bool try_enter_balance(struct run_queue *rq) {
    if (xchg(&rq->in_balance_operation, 1) == 0) {
        return true;   /* We got exclusive access */
    }
    return false;      /* Another balance operation in progress */
}
 
static void exit_balance(struct run_queue *rq) {
    rq->in_balance_operation = 0;
}
 
/* Integrated balance check respecting both push and pull */
static void run_balance_callback(struct run_queue *rq) {
    /* Only one balance operation at a time per queue */
    if (!try_enter_balance(rq)) {
        return;
    }
    
    /* Decide between push and pull based on current state */
    int state = assess_balance_state(rq);
    
    switch (state) {
    case BALANCE_OVERLOADED:
        try_to_push_tasks(rq);
        break;
    case BALANCE_UNDERLOADED:
        try_to_pull_tasks(rq);
        break;
    case BALANCE_NEUTRAL:
        /* No action needed */
        break;
    }
    
    exit_balance(rq);
}
 
#define MIGRATION_COOLDOWN   (HZ / 50)   /* 20ms minimum residence */
#define THRASH_THRESHOLD     4            /* >4 recent migrations = thrashing */

Asymmetric Trigger Conditions

Another coordination approach uses asymmetric thresholds to separate push and pull domains:

Asymmetric Balance Thresholds
Mechanism	Trigger Condition	Prevents
Push	Load > 125% of avg	Building idle overlap with pull
Pull	Load = 0 (idle)	Pulling from non-overloaded CPUs
Dead Zone	75% < Load < 125%	Neither push nor pull - stable

Complementary Frequency

Pull Migration in Linux CFS

CFS Balance Types

CFS categorizes balance operations by context:

IDLE — CPU has been idle; periodic check for work
NEWLY_IDLE — CPU just became idle; immediate opportunity
ACTIVE — CPU has work but might be underloaded
WAKE_AFFINE — Special case for task wakeup placement

cfs_idle_balance.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
/* CFS idle balance implementation (simplified from kernel/sched/fair.c) */
 
/* Called when CPU is about to go idle */
static int newidle_balance(struct rq *this_rq, struct rq_flags *rf) {
    int this_cpu = this_rq->cpu;
    struct sched_domain *sd;
    int pulled_task = 0;
    u64 t0, t1, curr_cost = 0;
    
    /* Quick exit if no one else is running */
    if (atomic_read(&rq_nr_running) <= 1) {
        return 0;
    }
    
    t0 = sched_clock_cpu(this_cpu);  /* Start timing */
    
    /* Walk scheduling domains from most local to NUMA level */
    for_each_domain(this_cpu, sd) {
        u64 domain_cost;
        
        /* Check if enough time has passed for this domain */
        if (!time_after_eq(jiffies, sd->next_newidle_check)) {
            continue;
        }
        
        /* Try to balance within this domain */
        pulled_task = load_balance(CPU_NEWLY_IDLE, sd, this_cpu, rf);
        
        t1 = sched_clock_cpu(this_cpu);
        curr_cost = t1 - t0;
        
        if (pulled_task || curr_cost > sd->max_newidle_cost) {
            /* Success or timeout - stop searching */
            if (!pulled_task && curr_cost > sd->max_newidle_cost) {
                /* Took too long - widen interval for this domain */
                sd->newidle_check_interval += sd->newidle_check_interval / 4;
            }
            break;
        }
        
        /* NUMA domains: extra caution for cross-node pull */
        if (sd->flags & SD_NUMA) {
            /* Only cross NUMA boundary if absolutely necessary */
            if (!high_numa_benefit_expected(sd)) {
                break;
            }
        }
    }
    
    return pulled_task;
}
 
/* CFS's main load balance - handles both push and pull perspectives */
static int load_balance(int idle, struct sched_domain *sd,
                        int this_cpu, struct rq_flags *rf) {
    int ld_moved, cur_ld_moved;
    struct sched_group *busiest;
    struct rq *busiest_rq;
    struct lb_env env = {
        .sd          = sd,
        .dst_cpu     = this_cpu,
        .dst_rq      = cpu_rq(this_cpu),
        .idle        = idle,
        .loop_max    = sysctl_sched_nr_migrate,
    };
    
    /* Find the busiest group/CPU in this domain */
    busiest = find_busiest_group(&env);
    if (!busiest) {
        return 0;  /* No imbalance */
    }
    
    busiest_rq = find_busiest_queue(&env, busiest);
    if (!busiest_rq) {
        return 0;  /* busiest group exists but no CPU to pull from */
    }
    
    /* Calculate how much load to migrate */
    env.src_cpu = cpu_of(busiest_rq);
    env.src_rq = busiest_rq;
    
    /* Detach tasks from source and attach to destination */
    cur_ld_moved = detach_tasks(&env);
    if (cur_ld_moved) {
        attach_tasks(&env);
    }
    
    return cur_ld_moved;
}

Newidle Cost Tracking

CFS dynamically limits newidle searching based on observed cost. If newidle balance frequently takes too long without finding work, the scheduler backs off for that domain:

newidle_cost_tracking.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
/* Track and adapt newidle balance costing */
static void update_newidle_cost(struct sched_domain *sd, u64 cost) {
    /* Exponential moving average of newidle balance cost */
    sd->avg_newidle_cost = (sd->avg_newidle_cost * 7 + cost) / 8;
    
    /* If average cost exceeds threshold, reduce frequency */
    if (sd->avg_newidle_cost > sd->max_newidle_cost) {
        /* Back off: increase interval between checks */
        sd->newidle_check_interval = min(
            sd->newidle_check_interval + msecs_to_jiffies(1),
            sd->max_newidle_check_interval
        );
    } else {
        /* Benefit exists: try to check more often */
        sd->newidle_check_interval = max(
            sd->newidle_check_interval - msecs_to_jiffies(1),
            sd->min_newidle_check_interval
        );
    }
}
 
/* Typical values (configurable via sysctl) */
/* max_newidle_cost: 0.5ms for SMT, 1ms for MC, 2ms for NUMA */
/* Cost = time spent searching (even if no task found) */

Self-Tuning Balance

Summary: Pull Migration Mastery

Pull migration completes our understanding of the fundamental load balancing mechanisms in multiprocessor scheduling. Let's consolidate the key insights:

Key Takeaways

•Pull migration is demand-driven load balancing — Idle CPUs actively seek work rather than waiting for overloaded CPUs to notice them.
•Idle entry is the critical trigger point — The moment before going idle is the highest-value time for pull, catching imbalances that push didn't address.
•Source selection uses hierarchical search — Start local (cheap migrations), expand outward only if necessary, with NUMA boundaries as cost escalation points.
•Task selection favors cache-cold candidates — Pulling recently-executing tasks destroys their cache investment; prefer waiting tasks.
•Push and pull are complementary, not redundant — Asymmetric thresholds and different trigger conditions ensure they address distinct scenarios.
•Cost tracking enables self-tuning — CFS dynamically adjusts newidle aggressiveness based on observed benefit, avoiding wasted searching.

The Complete BAlancing Picture

Page Complete

2 / 5