Operating SystemsLoad Balancing

Load Balancing in Multiprocessor Systems

LevelAdvanced

Duration75 mins

TopicLoad Balancing

1 / 5

Push Migration: Proactive Load Distribution

The Challenge of Load Imbalance

In multiprocessor systems, the fundamental promise of parallelism—multiple CPUs working simultaneously to achieve greater throughput—can be undermined by a deceptively simple problem: load imbalance. When some processors are overwhelmed with work while others sit idle, the system fails to realize its full potential, wasting both computational resources and power.

Push migration represents one of the most intuitive and widely-deployed solutions to this problem. It embodies a proactive philosophy: when a processor detects that it is overloaded relative to its peers, it actively 'pushes' excess tasks to less-loaded processors, redistributing work before the imbalance becomes severe.

This page provides a comprehensive exploration of push migration—its mechanisms, implementation strategies, performance characteristics, and role in modern operating system schedulers. By the end, you will understand not just how push migration works, but when it is the optimal choice and why it behaves the way it does under various workload conditions.

What You Will Learn

By completing this page, you will understand: (1) The fundamental concept and motivation behind push migration, (2) The architectural components required for implementation, (3) Algorithms for detecting overload and selecting migration candidates, (4) Integration with SMP schedulers in production operating systems, and (5) The performance tradeoffs inherent in proactive load redistribution.

Conceptual Foundations of Push Migration

To understand push migration deeply, we must first appreciate the operating environment it addresses and the specific problem it solves.

The Multiprocessor Scheduling Context

In symmetric multiprocessing (SMP) systems, each CPU maintains its own run queue—a data structure containing processes ready to execute. The scheduler on each CPU independently selects the next process to run from its local queue. This per-CPU queue architecture is essential for:

Scalability — Avoiding a single global lock that would become a bottleneck as processor count increases
Cache affinity — Keeping processes on the same CPU to benefit from warm caches
Reduced latency — Enabling scheduling decisions without cross-CPU communication

However, this distributed design introduces a fundamental challenge: run queues can become unbalanced. One CPU might have 20 runnable processes while another has none. Without intervention, the overloaded CPU cannot make progress on all its tasks, while the idle CPU wastes cycles.

Why Imbalance Occurs

Load imbalance arises naturally from multiple sources: processes are often created on specific CPUs (e.g., where the parent runs), processes exit unpredictably, I/O completions wake processes on arbitrary CPUs, and user-space thread pools may distribute work unevenly. Even with perfect initial placement, the system drifts toward imbalance over time.

The Push Migration Philosophy

Push migration addresses imbalance through a proactive, sender-initiated approach. The core concept is elegantly simple:

Detection — Each CPU periodically monitors its own load relative to the system average or specific thresholds
Decision — When a CPU determines it is significantly overloaded, it identifies tasks suitable for migration
Transfer — The overloaded CPU actively moves selected tasks to less-loaded processors
Continuation — The migrated tasks resume execution on their new CPUs

This 'push' terminology reflects the direction of agency: the overloaded processor initiates the migration, pushing work away from itself. This contrasts with 'pull' migration where idle processors request work from busy ones.

Formal Definition

We can define push migration formally as follows:

Push migration is a load balancing mechanism in which a processor P, upon detecting that its local load L(P) exceeds a threshold T, selects one or more tasks from its run queue and transfers them to processors whose load is below T, thereby reducing L(P) and increasing utilization of underloaded processors.

The elegance of this definition belies the complexity of implementation: What constitutes 'load'? How is the threshold determined? Which tasks should be migrated? How do we avoid thrashing? These questions drive the detailed design we explore next.

Push Migration: Key Design Decisions
Design Aspect	Question	Typical Approaches
Load Metric	How do we quantify CPU load?	Run queue length, weighted by priority/niceness
Threshold	When is a CPU 'overloaded'?	Absolute count, relative to average, percentage above mean
Target Selection	Which CPU receives migrated tasks?	Least loaded, round-robin among idle, NUMA-aware selection
Task Selection	Which tasks should migrate?	Lowest priority, most recently queued, cache-cold processes
Timing	When do we check for imbalance?	Periodic timer interrupt, after scheduling events

Architectural Components for Push Migration

Implementing push migration requires several coordinated components within the operating system's scheduler. Understanding these components is essential for grasping how the abstract concept becomes working code.

Per-CPU Run Queue Infrastructure

The foundation of push migration is the per-CPU run queue. Each processor maintains its own queue of runnable tasks, typically organized as a multi-level structure with different priority levels or scheduling classes.

In Linux, for example, the struct rq (run queue) structure exists for each CPU and contains:

Active tasks — Processes currently ready to run
Load tracking — Statistics about recent CPU utilization
Locking — Per-queue locks for thread-safe access
Migration state — Flags and counters for load balancing

run_queue_structure.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
/* Simplified representation of per-CPU run queue components */
struct run_queue {
    /* Core scheduling structures */
    struct list_head tasks;           /* List of runnable tasks */
    unsigned int nr_running;          /* Number of runnable tasks */
    unsigned int nr_waiting;          /* Tasks waiting for I/O */
    
    /* Load tracking for balancing decisions */
    unsigned long load_weight;        /* Weighted load (priority-adjusted) */
    unsigned long avg_load;           /* Running average over time */
    unsigned long cpu_capacity;       /* This CPU's processing capacity */
    
    /* Push migration specific fields */
    int overloaded;                   /* Flag: is this CPU overloaded? */
    int push_count;                   /* Tasks pushed in current interval */
    unsigned long last_push_time;     /* Timestamp of last push attempt */
    
    /* Locking and synchronization */
    spinlock_t lock;                  /* Protects queue modifications */
    int migration_disabled;           /* Temporarily prevent migrations */
    
    /* Migration target tracking */
    cpumask_t idle_siblings;          /* Known idle CPUs in local domain */
    int busy_idx;                     /* Index for scanning busy CPUs */
};

Load Calculation Subsystem

Accurate load calculation is the cornerstone of effective push migration. The system must answer a deceptively complex question: How busy is this CPU compared to others?

Simple metrics like 'number of runnable tasks' are inadequate because:

A CPU with 5 low-priority background tasks is less 'loaded' than one with 2 high-priority interactive tasks
Tasks with different CPU affinities contribute differently to effective load
Historical load (running average) matters more than instantaneous snapshots for stability

Weighted Load Calculation

Modern schedulers use weighted load calculations that account for task priority. Each task contributes a 'load weight' proportional to its scheduling priority:

load_calculation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
/* Priority-to-weight mapping (simplified from Linux CFS) */
static const int priority_to_weight[40] = {
/*  -20 */ 88761, 71755, 56483, 46273, 36291,
/*  -15 */ 29154, 23254, 18705, 14949, 11916,
/*  -10 */  9548,  7620,  6100,  4904,  3906,
/*   -5 */  3121,  2501,  1991,  1586,  1277,
/*    0 */  1024,   820,   655,   526,   423,
/*    5 */   335,   272,   215,   172,   137,
/*   10 */   110,    87,    70,    56,    45,
/*   15 */    36,    29,    23,    18,    15,
};
 
/* Calculate weighted load for a run queue */
unsigned long calculate_weighted_load(struct run_queue *rq) {
    unsigned long total_weight = 0;
    struct task_struct *task;
    
    list_for_each_entry(task, &rq->tasks, run_list) {
        int priority = task->static_prio - 100; /* Normalize to 0-39 */
        if (priority < 0) priority = 0;
        if (priority > 39) priority = 39;
        
        total_weight += priority_to_weight[priority];
    }
    
    return total_weight;
}
 
/* Calculate exponential moving average for stability */
unsigned long update_load_average(struct run_queue *rq, 
                                   unsigned long new_load) {
    /* Classic EMA: new_avg = alpha * new + (1-alpha) * old */
    /* Using fixed-point: (new * 4 + old * 12) / 16 */
    rq->avg_load = (new_load * 4 + rq->avg_load * 12) >> 4;
    return rq->avg_load;
}

Exponential Smoothing Matters

Using a running average rather than instantaneous load prevents 'migration thrashing'—where tasks bounce between CPUs responding to momentary fluctuations. The smoothing factor (alpha) balances responsiveness against stability. Too responsive: thrashing. Too stable: slow to correct imbalances.

Overload Detection Mechanism

With load calculated, the next component determines when a CPU is sufficiently overloaded to trigger migration. Several strategies exist:

Absolute Threshold

if (rq->nr_running > PUSH_THRESHOLD) trigger_push();

Simple but inflexible—ignores system-wide load context.

Relative Threshold (Average-Based)

if (rq->avg_load > system_avg_load * 1.25) trigger_push();

Adapts to overall system load but requires cross-CPU coordination.

Imbalance-Based

imbalance = rq->avg_load - busiest_group_avg_load / nr_cpus;
if (imbalance > min_migration_threshold) migrate(imbalance);

Most sophisticated—considers where work would go, not just local overload.

Linux's CFS scheduler uses a hybrid approach, computing 'imbalance' as the weighted difference between scheduling domains and triggering migration when this exceeds a configurable threshold.

The Push Migration Algorithm in Detail

With the foundational components in place, we can now examine the push migration algorithm itself. This section presents a production-quality algorithm with all the nuances required for robust operation.

High-Level Algorithm Flow

The push migration process follows a four-phase structure:

Trigger Phase — Timer interrupt or scheduling event initiates load check
Analysis Phase — Compute local load, compare against thresholds
Selection Phase — Choose tasks and destination CPUs for migration
Execution Phase — Transfer selected tasks, update accounting

Let's examine each phase in detail.

push_migration_algorithm.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
/* Main push migration entry point - called from timer interrupt */
void try_to_push_tasks(struct run_queue *this_rq) {
    int this_cpu = smp_processor_id();
    struct migration_context ctx;
    int nr_pushed = 0;
    
    /* Phase 1: Quick check - any point in pushing? */
    if (!should_attempt_push(this_rq)) {
        return;  /* Not overloaded or no migration needed */
    }
    
    /* Acquire local runqueue lock */
    spin_lock(&this_rq->lock);
    
    /* Phase 2: Detailed analysis */
    if (!calculate_push_imbalance(this_rq, &ctx)) {
        spin_unlock(&this_rq->lock);
        return;  /* No actionable imbalance */
    }
    
    /* Phase 3: Find destination CPUs and candidate tasks */
    while (ctx.imbalance > 0 && nr_pushed < MAX_PUSH_PER_CYCLE) {
        struct task_struct *task;
        int dst_cpu;
        
        /* Find a suitable destination CPU */
        dst_cpu = find_push_destination(this_rq, &ctx);
        if (dst_cpu < 0) {
            break;  /* No suitable destinations available */
        }
        
        /* Select a task to migrate */
        task = select_task_for_push(this_rq, dst_cpu, &ctx);
        if (!task) {
            break;  /* No suitable tasks to migrate */
        }
        
        /* Phase 4: Execute the migration */
        if (migrate_task_to_cpu(task, dst_cpu)) {
            ctx.imbalance -= task_load_weight(task);
            nr_pushed++;
            this_rq->push_count++;
        }
    }
    
    spin_unlock(&this_rq->lock);
    
    /* Update statistics */
    update_push_statistics(this_cpu, nr_pushed);
}

Phase 1: Trigger and Quick Check

Push migration is typically triggered by a periodic timer interrupt (every few milliseconds) or after significant scheduling events. Before proceeding with expensive calculations, a quick check determines if migration is even plausible:

push_trigger_check.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/* Quick pre-check to avoid unnecessary work */
static bool should_attempt_push(struct run_queue *rq) {
    /* Need at least 2 runnable tasks to push one away */
    if (rq->nr_running < 2) {
        return false;
    }
    
    /* Check if we've pushed recently (prevent thrashing) */
    if (time_before(jiffies, rq->last_push_time + PUSH_MIN_INTERVAL)) {
        return false;
    }
    
    /* Check if system has any idle CPUs worth pushing to */
    if (cpumask_empty(&rq->idle_siblings)) {
        /* No known idle CPUs - do detailed scan only periodically */
        if (!time_to_rescan_idle_cpus(rq)) {
            return false;
        }
    }
    
    /* Basic weighted load check */
    if (rq->load_weight <= avg_load_per_cpu() * PUSH_THRESHOLD_PCT / 100) {
        return false;  /* Not overloaded relative to average */
    }
    
    return true;  /* Worth doing detailed analysis */
}
 
/* Constants governing push behavior */
#define PUSH_MIN_INTERVAL       (HZ / 10)    /* Max 10 push attempts/second */
#define PUSH_THRESHOLD_PCT      125          /* 25% above average triggers push */
#define MAX_PUSH_PER_CYCLE      2            /* Limit work per cycle */

Phase 2: Imbalance Calculation

If the quick check passes, we compute the precise imbalance—the amount of load that should be migrated to achieve balance:

imbalance_calculation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
/* Detailed imbalance calculation */
static bool calculate_push_imbalance(struct run_queue *rq, 
                                      struct migration_context *ctx) {
    unsigned long this_load = rq->avg_load;
    unsigned long target_load;
    unsigned long system_load = 0;
    int nr_online_cpus = 0;
    int cpu;
    
    /* Calculate system-wide load */
    for_each_online_cpu(cpu) {
        system_load += per_cpu(runqueues, cpu).avg_load;
        nr_online_cpus++;
    }
    
    /* Target: each CPU should have average load */
    target_load = system_load / nr_online_cpus;
    
    /* Imbalance is how much we exceed target */
    if (this_load <= target_load) {
        return false;  /* We're at or below average - no push needed */
    }
    
    ctx->imbalance = this_load - target_load;
    
    /* Apply minimum threshold to prevent trivial migrations */
    if (ctx->imbalance < MIN_PUSH_IMBALANCE) {
        return false;
    }
    
    /* Adjust for migration cost - don't push if benefit is marginal */
    if (ctx->imbalance < estimated_migration_cost()) {
        return false;
    }
    
    ctx->this_load = this_load;
    ctx->target_load = target_load;
    ctx->system_load = system_load;
    
    return true;
}

Migration Cost Consideration

Every migration has a cost: cache invalidation, TLB flushes, memory bandwidth for moving task state. Push migration must only occur when the expected benefit (reduced imbalance) exceeds this cost. Without this check, aggressive pushing can decrease overall throughput despite achieving better balance on paper.

Phase 3: Destination and Task Selection

With imbalance quantified, we must select which task to push and where to push it. These decisions profoundly affect migration effectiveness:

selection_algorithm.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
/* Find the best destination CPU for pushing a task */
static int find_push_destination(struct run_queue *src_rq,
                                  struct migration_context *ctx) {
    int best_cpu = -1;
    unsigned long best_capacity = 0;
    int cpu;
    
    /* Priority 1: Check for any idle CPUs in same NUMA node */
    for_each_cpu(cpu, &src_rq->idle_siblings) {
        if (cpu_is_same_numa_node(src_rq, cpu)) {
            struct run_queue *dst_rq = &per_cpu(runqueues, cpu);
            
            if (dst_rq->nr_running == 0) {
                /* Perfect match: idle CPU on same NUMA node */
                return cpu;
            }
        }
    }
    
    /* Priority 2: Any idle CPU (cross-NUMA if necessary) */
    for_each_cpu(cpu, &src_rq->idle_siblings) {
        struct run_queue *dst_rq = &per_cpu(runqueues, cpu);
        
        if (dst_rq->nr_running == 0) {
            return cpu;
        }
    }
    
    /* Priority 3: Find least-loaded CPU that's below average */
    for_each_online_cpu(cpu) {
        struct run_queue *dst_rq = &per_cpu(runqueues, cpu);
        unsigned long available_capacity;
        
        if (cpu == smp_processor_id()) {
            continue;  /* Don't push to self */
        }
        
        /* Skip CPUs at or above average load */
        if (dst_rq->avg_load >= ctx->target_load) {
            continue;
        }
        
        /* Calculate how much load this CPU can accept */
        available_capacity = ctx->target_load - dst_rq->avg_load;
        
        if (available_capacity > best_capacity) {
            best_capacity = available_capacity;
            best_cpu = cpu;
        }
    }
    
    return best_cpu;
}
 
/* Select a task suitable for migration to given destination */
static struct task_struct *select_task_for_push(struct run_queue *rq,
                                                 int dst_cpu,
                                                 struct migration_context *ctx) {
    struct task_struct *best_task = NULL;
    struct task_struct *task;
    int best_score = INT_MIN;
    
    list_for_each_entry(task, &rq->tasks, run_list) {
        int score = 0;
        
        /* Skip tasks that cannot migrate */
        if (!task_can_migrate(task, dst_cpu)) {
            continue;
        }
        
        /* Prefer cache-cold tasks (haven't run recently) */
        if (task_cache_cold(task)) {
            score += 100;
        }
        
        /* Prefer tasks with weak CPU affinity */
        if (task_has_weak_affinity(task)) {
            score += 50;
        }
        
        /* Prefer lower priority tasks (less latency-sensitive) */
        score += (MAX_PRIO - task->prio);
        
        /* Prefer tasks whose load contribution matches our needs */
        if (task_load_weight(task) <= ctx->imbalance * 2) {
            score += 25;  /* Right size for our imbalance */
        }
        
        if (score > best_score) {
            best_score = score;
            best_task = task;
        }
    }
    
    return best_task;
}
 
/* Check if a task can legally migrate to a given CPU */
static bool task_can_migrate(struct task_struct *task, int dst_cpu) {
    /* Check CPU affinity mask */
    if (!cpumask_test_cpu(dst_cpu, &task->cpus_allowed)) {
        return false;
    }
    
    /* Check if task requested migration disabled */
    if (task->migration_disabled) {
        return false;
    }
    
    /* Currently running tasks cannot migrate */
    if (task_running(task)) {
        return false;
    }
    
    /* Kernel threads with CPU bindings */
    if (task_is_bound_kthread(task)) {
        return false;
    }
    
    return true;
}

Phase 4: Migration Execution

Once destination and task are selected, the actual migration transfers the task's scheduling state:

migration_execution.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
/* Execute the migration of a task to a new CPU */
static bool migrate_task_to_cpu(struct task_struct *task, int dst_cpu) {
    struct run_queue *src_rq = task_rq(task);
    struct run_queue *dst_rq = &per_cpu(runqueues, dst_cpu);
    
    /* Double-lock ordering to prevent deadlock */
    /* Always lock lower-numbered CPU first */
    if (dst_cpu < smp_processor_id()) {
        spin_lock(&dst_rq->lock);
    }
    
    /* Dequeue from source */
    dequeue_task(src_rq, task);
    
    /* Update task's CPU assignment */
    task->cpu = dst_cpu;
    
    /* Enqueue on destination */
    enqueue_task(dst_rq, task);
    
    /* Update load tracking */
    src_rq->load_weight -= task_load_weight(task);
    dst_rq->load_weight += task_load_weight(task);
    src_rq->nr_running--;
    dst_rq->nr_running++;
    
    /* Release destination lock if we took it */
    if (dst_cpu < smp_processor_id()) {
        spin_unlock(&dst_rq->lock);
    }
    
    /* Record migration for statistics */
    record_migration(smp_processor_id(), dst_cpu, task);
    
    /* Send IPI to wake destination CPU if it was idle */
    if (need_resched_cpu(dst_cpu)) {
        send_reschedule_ipi(dst_cpu);
    }
    
    return true;
}

NUMA-Aware Push Migration

Modern multiprocessor systems are typically organized as Non-Uniform Memory Access (NUMA) architectures, where memory access latency varies based on the physical relationship between CPUs and memory controllers. This architectural reality profoundly impacts push migration strategies.

The NUMA Challenge

In NUMA systems, migrating a task to a distant CPU can significantly degrade that task's performance:

Remote memory access latency — Tasks accessing memory allocated on their original node face 2-3x slower access from a remote node
Interconnect bandwidth — Cross-node traffic consumes limited inter-node bandwidth
Cache coherency overhead — Maintaining cache coherency across NUMA boundaries is more expensive

Naive push migration that ignores NUMA topology can actually decrease system throughput despite achieving better load balance.

NUMA-Aware vs. NUMA-Blind Migration
Approach	Load Balance	Memory Locality	Overall Throughput
NUMA-Blind	Optimal	Severely degraded	May decrease 20-40%
Local-Only	Suboptimal	Preserved	Limited improvement
NUMA-Aware Hybrid	Near-optimal	Mostly preserved	Best overall results

Hierarchical Domain Organization

To handle NUMA effectively, schedulers organize CPUs into hierarchical scheduling domains. Each domain represents a set of CPUs that share some architectural characteristic:

NUMA Node 0          NUMA Node 1
+-----------+        +-----------+
| Core 0-3  |<------>| Core 4-7  |
| L3 Cache  |  QPI   | L3 Cache  |
| DDR4 Bank |        | DDR4 Bank |
+-----------+        +-----------+
     |                     |
     v                     v
SMT Domain           SMT Domain
(Core siblings)      (Core siblings)

NUMA-Aware Push Strategy

With domain hierarchy in place, push migration follows a domain-aware strategy:

numa_aware_push.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/* NUMA-aware destination selection for push migration */
static int find_numa_aware_destination(struct run_queue *src_rq,
                                        struct migration_context *ctx) {
    int src_cpu = rq_cpu(src_rq);
    struct sched_domain *sd;
    int best_cpu = -1;
    
    /* Walk up the domain hierarchy, starting local */
    for_each_domain(src_cpu, sd) {
        struct sched_group *sg = sd->groups;
        int found_in_domain = false;
        
        /* Scan scheduling groups within this domain */
        do {
            int cpu;
            
            for_each_cpu(cpu, sched_group_cpus(sg)) {
                struct run_queue *dst_rq = &per_cpu(runqueues, cpu);
                
                if (cpu == src_cpu) continue;
                
                /* Check if this CPU can accept work */
                if (dst_rq->avg_load < ctx->target_load) {
                    /* At lower domains (closer), prefer any available */
                    /* At higher domains (NUMA), require significant benefit */
                    unsigned long domain_cost = domain_migration_cost(sd);
                    unsigned long benefit = ctx->imbalance;
                    
                    if (benefit > domain_cost) {
                        best_cpu = cpu;
                        found_in_domain = true;
                        
                        /* Prefer idle CPUs */
                        if (dst_rq->nr_running == 0) {
                            return cpu;  /* Perfect match at this level */
                        }
                    }
                }
            }
            
            sg = sg->next;
        } while (sg != sd->groups);
        
        /* If we found a candidate at this level, use it */
        /* Prevents unnecessary promotion to higher (costlier) domains */
        if (found_in_domain && best_cpu >= 0) {
            return best_cpu;
        }
    }
    
    return best_cpu;  /* May be -1 if no suitable destination */
}
 
/* Estimate cost of migrating across a scheduling domain */
static unsigned long domain_migration_cost(struct sched_domain *sd) {
    /* Base cost increases with domain level */
    unsigned long base_cost = sd->level * 1000;
    
    /* NUMA domains have additional memory latency cost */
    if (sd->flags & SD_NUMA) {
        base_cost += numa_remote_access_penalty();
    }
    
    /* Factor in domain's historical migration success rate */
    base_cost = base_cost * sd->migration_success_rate / 100;
    
    return base_cost;
}

Domain Hierarchy as Cost Function

The scheduling domain hierarchy encodes migration costs implicitly. SMT siblings share L1 cache—nearly free migration. Same-socket cores share L3—cheap migration. Same-node CPUs share memory—moderate cost. Cross-node requires interconnect—expensive. Push migration uses this hierarchy to make cost-aware decisions.

Push Migration in Linux CFS

Linux's Completely Fair Scheduler (CFS) provides a production-grade implementation of push migration that has been refined over two decades of deployment on systems ranging from smartphones to supercomputers. Examining its design reveals practical solutions to the theoretical challenges we've discussed.

CFS Load Balancing Overview

CFS integrates push migration into a broader load balancing framework triggered by timer interrupts. The run_rebalance_domains() function, invoked periodically, walks the scheduling domain hierarchy and calls load_balance() for domains requiring intervention.

Key CFS Concepts for Push

Load Weight — CFS uses PELT (Per-Entity Load Tracking) to compute temporally-decayed load for each task, accounting for both recent CPU usage and historical patterns.
Scheduling Domains — Hierarchical CPU groupings (SMT → Core → Socket → NUMA) with per-domain balance intervals.
Busiest Queue Detection — Before pushing, CFS identifies the busiest CPU/group, ensuring push actions actually improve balance.
Migration Throttling — Rate limits on migrations prevent oscillation.

cfs_load_balance.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
/* Simplified representation of CFS load balancing logic */
/* See kernel/sched/fair.c for full implementation */
 
/* Main load balance function called for each scheduling domain */
static int load_balance(struct lb_env *env) {
    struct run_queue *busiest;
    unsigned long imbalance;
    int ld_moved = 0;
    
    /* Find the busiest group in this scheduling domain */
    struct sched_group *busiest_group = find_busiest_group(env);
    if (!busiest_group) {
        return 0;  /* No imbalance at group level */
    }
    
    /* Find the busiest run queue within that group */
    busiest = find_busiest_queue(env, busiest_group);
    if (!busiest) {
        return 0;  /* No specific CPU to balance from */
    }
    
    /* Calculate how much load to move */
    imbalance = calculate_imbalance(env, busiest_group);
    
    if (imbalance == 0) {
        return 0;  /* Imbalance below threshold */
    }
    
    /* Attempt to migrate tasks from busiest to local CPU */
    env->src_rq = busiest;
    env->dst_rq = this_rq();
    env->imbalance = imbalance;
    
    /* The actual migration loop */
    while (env->imbalance > 0) {
        struct task_struct *p;
        
        /* Select a task that can migrate */
        p = detach_one_task(env);
        if (!p) {
            break;  /* No more migratable tasks */
        }
        
        /* Attach the task to local run queue */
        attach_one_task(env->dst_rq, p);
        
        ld_moved++;
        env->imbalance -= task_load(p);
    }
    
    return ld_moved;
}
 
/* PELT-based load calculation (simplified) */
static unsigned long task_load(struct task_struct *p) {
    /* Load is decayed average of CPU demand */
    return p->se.load.weight * p->se.avg.load_avg / LOAD_AVG_MAX;
}

Balance Intervals by Domain Level

CFS adjusts push migration frequency based on domain level, balancing responsiveness against overhead:

CFS Balance Intervals (Typical Values)
Domain Level	Balance Interval	Rationale
SMT (Hyperthreads)	4ms	Same core—nearly free migration
MC (Multi-core)	8ms	Same socket—low cost
DIE (Same die)	16ms	Shared L3—moderate cost
NUMA (Cross-node)	64ms	High latency—cautious balancing

Push vs. Pull in CFS

CFS actually implements both push and pull, but in a unified framework. The 'push' perspective occurs when code running on CPU A considers sending tasks away. The 'pull' perspective occurs when idle CPU B looks for work to take. CFS's load_balance() can be viewed as 'push from busiest to local' or 'pull from busiest to local' depending on which CPU initiates.

Performance Analysis of Push Migration

Understanding when push migration helps—and when it hurts—requires analyzing its performance characteristics under different workload patterns.

Beneficial Scenarios

Push migration provides significant benefits in scenarios with:

Bursty Task Creation — When processes spawn many children on a single CPU
Uneven Wakeups — When I/O completions wake tasks concentrated on few CPUs
Long-Running Compute Tasks — Tasks that exceed their time slice multiple times
Mixed Workloads — Combinations of batch and interactive tasks

Detrimental Scenarios

Push migration can hurt performance when:

High-Locality Workloads — Tasks with strong cache affinity that shouldn't move
Short-Lived Tasks — Tasks that complete before migration overhead pays off
Memory-Intensive NUMA Workloads — Cross-node migration destroys locality
Real-Time Workloads — Migration latency may miss deadlines

Push Migration Overhead Components
Overhead Source	Typical Cost	Mitigation Strategy
Run queue lock contention	1-10 μs	Per-CPU queues, lock-free checking
Task state copying	0.5-2 μs	Minimal state transfer
Cache invalidation (L1/L2)	100-500 cycles	Prefer cache-cold tasks
TLB flush on new CPU	50-200 cycles	Batch migrations
Memory controller switch (NUMA)	100-300 ns additional latency	NUMA-aware selection
IPI for destination wakeup	1-5 μs	Coalesce with other IPIs

Quantifying the Benefit

The net benefit of push migration can be modeled as:

Benefit = (Imbalance_Reduction × Task_Throughput_Gain) - Migration_Cost

Where:

Imbalance_Reduction = Time saved by distributing work more evenly
Task_Throughput_Gain = Performance improvement from utilizing idle CPUs
Migration_Cost = Direct overhead + locality loss + cache rebuild time

For a task that would wait 50ms in queue on an overloaded CPU:

If migration takes 10μs total overhead
And cache rebuild adds 100μs slowdown
Net benefit = 50ms - 0.11ms ≈ 50ms saved (clear win)

For a task that would wait 200μs:

Same 110μs migration cost
Net benefit = 200μs - 110μs = 90μs saved (marginal)

For a task with 1ms remaining lifetime:

Migration cost = 110μs overhead + locality loss
Task may complete before benefiting (potential loss)

The 'Don't Migrate Short Tasks' Heuristic

Production schedulers often track task 'cache footprint' or 'run time since last migration'. Tasks with warm caches or short expected remaining runtime are skipped for migration. This simple heuristic prevents the most common anti-pattern: migrating a task that was about to complete anyway.

Summary: Push Migration Mastery

Push migration represents a fundamental technique in the operating system's arsenal for extracting maximum performance from multiprocessor hardware. Let's consolidate the key insights from this comprehensive exploration:

Key Takeaways

•Push migration is proactive load balancing — Overloaded CPUs actively redistribute their excess work to less-loaded processors, preventing severe imbalance.
•Weighted load metrics enable fair comparison — Simple task counts are insufficient; priority-weighted calculations account for the true computational demand of each task.
•NUMA topology must guide destination selection — Cross-node migrations can negate load balancing benefits by destroying memory locality.
•Task selection is as important as destination selection — Cache-cold, low-priority tasks with weak affinity are ideal migration candidates.
•Migration cost must be weighed against benefit — Not every imbalance warrants migration; threshold tuning prevents thrashing.
•Production schedulers use hierarchical domains — CFS and similar schedulers balance more aggressively within low-cost domains and cautiously across NUMA boundaries.

Connecting to the Broader Picture

Push migration is one half of the load balancing equation. In the next page, we'll explore pull migration—the complementary approach where idle processors actively seek work from busy ones. Together, push and pull form a complete solution for maintaining balance across diverse workload patterns.

Page Complete

You now possess deep understanding of push migration—from conceptual foundations through production implementation. You can reason about when push migration helps, when it hurts, and how operating systems tune its behavior across different hardware topologies. Next, we examine the complementary pull migration mechanism.

1 / 5

Loading learning content...

Operating SystemsLoad Balancing

Load Balancing in Multiprocessor Systems

LevelAdvanced

Duration75 mins

TopicLoad Balancing

1 / 5

Push Migration: Proactive Load Distribution

The Challenge of Load Imbalance

What You Will Learn

Conceptual Foundations of Push Migration

To understand push migration deeply, we must first appreciate the operating environment it addresses and the specific problem it solves.

The Multiprocessor Scheduling Context

Scalability — Avoiding a single global lock that would become a bottleneck as processor count increases
Cache affinity — Keeping processes on the same CPU to benefit from warm caches
Reduced latency — Enabling scheduling decisions without cross-CPU communication

Why Imbalance Occurs

The Push Migration Philosophy

Push migration addresses imbalance through a proactive, sender-initiated approach. The core concept is elegantly simple:

Detection — Each CPU periodically monitors its own load relative to the system average or specific thresholds
Decision — When a CPU determines it is significantly overloaded, it identifies tasks suitable for migration
Transfer — The overloaded CPU actively moves selected tasks to less-loaded processors
Continuation — The migrated tasks resume execution on their new CPUs

Formal Definition

We can define push migration formally as follows:

Push Migration: Key Design Decisions
Design Aspect	Question	Typical Approaches
Load Metric	How do we quantify CPU load?	Run queue length, weighted by priority/niceness
Threshold	When is a CPU 'overloaded'?	Absolute count, relative to average, percentage above mean
Target Selection	Which CPU receives migrated tasks?	Least loaded, round-robin among idle, NUMA-aware selection
Task Selection	Which tasks should migrate?	Lowest priority, most recently queued, cache-cold processes
Timing	When do we check for imbalance?	Periodic timer interrupt, after scheduling events

Architectural Components for Push Migration

Per-CPU Run Queue Infrastructure

In Linux, for example, the struct rq (run queue) structure exists for each CPU and contains:

Active tasks — Processes currently ready to run
Load tracking — Statistics about recent CPU utilization
Locking — Per-queue locks for thread-safe access
Migration state — Flags and counters for load balancing

run_queue_structure.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
/* Simplified representation of per-CPU run queue components */
struct run_queue {
    /* Core scheduling structures */
    struct list_head tasks;           /* List of runnable tasks */
    unsigned int nr_running;          /* Number of runnable tasks */
    unsigned int nr_waiting;          /* Tasks waiting for I/O */
    
    /* Load tracking for balancing decisions */
    unsigned long load_weight;        /* Weighted load (priority-adjusted) */
    unsigned long avg_load;           /* Running average over time */
    unsigned long cpu_capacity;       /* This CPU's processing capacity */
    
    /* Push migration specific fields */
    int overloaded;                   /* Flag: is this CPU overloaded? */
    int push_count;                   /* Tasks pushed in current interval */
    unsigned long last_push_time;     /* Timestamp of last push attempt */
    
    /* Locking and synchronization */
    spinlock_t lock;                  /* Protects queue modifications */
    int migration_disabled;           /* Temporarily prevent migrations */
    
    /* Migration target tracking */
    cpumask_t idle_siblings;          /* Known idle CPUs in local domain */
    int busy_idx;                     /* Index for scanning busy CPUs */
};

Load Calculation Subsystem

Accurate load calculation is the cornerstone of effective push migration. The system must answer a deceptively complex question: How busy is this CPU compared to others?

Simple metrics like 'number of runnable tasks' are inadequate because:

A CPU with 5 low-priority background tasks is less 'loaded' than one with 2 high-priority interactive tasks
Tasks with different CPU affinities contribute differently to effective load
Historical load (running average) matters more than instantaneous snapshots for stability

Weighted Load Calculation

Modern schedulers use weighted load calculations that account for task priority. Each task contributes a 'load weight' proportional to its scheduling priority:

load_calculation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
/* Priority-to-weight mapping (simplified from Linux CFS) */
static const int priority_to_weight[40] = {
/*  -20 */ 88761, 71755, 56483, 46273, 36291,
/*  -15 */ 29154, 23254, 18705, 14949, 11916,
/*  -10 */  9548,  7620,  6100,  4904,  3906,
/*   -5 */  3121,  2501,  1991,  1586,  1277,
/*    0 */  1024,   820,   655,   526,   423,
/*    5 */   335,   272,   215,   172,   137,
/*   10 */   110,    87,    70,    56,    45,
/*   15 */    36,    29,    23,    18,    15,
};
 
/* Calculate weighted load for a run queue */
unsigned long calculate_weighted_load(struct run_queue *rq) {
    unsigned long total_weight = 0;
    struct task_struct *task;
    
    list_for_each_entry(task, &rq->tasks, run_list) {
        int priority = task->static_prio - 100; /* Normalize to 0-39 */
        if (priority < 0) priority = 0;
        if (priority > 39) priority = 39;
        
        total_weight += priority_to_weight[priority];
    }
    
    return total_weight;
}
 
/* Calculate exponential moving average for stability */
unsigned long update_load_average(struct run_queue *rq, 
                                   unsigned long new_load) {
    /* Classic EMA: new_avg = alpha * new + (1-alpha) * old */
    /* Using fixed-point: (new * 4 + old * 12) / 16 */
    rq->avg_load = (new_load * 4 + rq->avg_load * 12) >> 4;
    return rq->avg_load;
}

Exponential Smoothing Matters

Overload Detection Mechanism

With load calculated, the next component determines when a CPU is sufficiently overloaded to trigger migration. Several strategies exist:

Absolute Threshold

if (rq->nr_running > PUSH_THRESHOLD) trigger_push();

Simple but inflexible—ignores system-wide load context.

Relative Threshold (Average-Based)

if (rq->avg_load > system_avg_load * 1.25) trigger_push();

Adapts to overall system load but requires cross-CPU coordination.

Imbalance-Based

imbalance = rq->avg_load - busiest_group_avg_load / nr_cpus;
if (imbalance > min_migration_threshold) migrate(imbalance);

Most sophisticated—considers where work would go, not just local overload.

Linux's CFS scheduler uses a hybrid approach, computing 'imbalance' as the weighted difference between scheduling domains and triggering migration when this exceeds a configurable threshold.

The Push Migration Algorithm in Detail

High-Level Algorithm Flow

The push migration process follows a four-phase structure:

Trigger Phase — Timer interrupt or scheduling event initiates load check
Analysis Phase — Compute local load, compare against thresholds
Selection Phase — Choose tasks and destination CPUs for migration
Execution Phase — Transfer selected tasks, update accounting

Let's examine each phase in detail.

push_migration_algorithm.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
/* Main push migration entry point - called from timer interrupt */
void try_to_push_tasks(struct run_queue *this_rq) {
    int this_cpu = smp_processor_id();
    struct migration_context ctx;
    int nr_pushed = 0;
    
    /* Phase 1: Quick check - any point in pushing? */
    if (!should_attempt_push(this_rq)) {
        return;  /* Not overloaded or no migration needed */
    }
    
    /* Acquire local runqueue lock */
    spin_lock(&this_rq->lock);
    
    /* Phase 2: Detailed analysis */
    if (!calculate_push_imbalance(this_rq, &ctx)) {
        spin_unlock(&this_rq->lock);
        return;  /* No actionable imbalance */
    }
    
    /* Phase 3: Find destination CPUs and candidate tasks */
    while (ctx.imbalance > 0 && nr_pushed < MAX_PUSH_PER_CYCLE) {
        struct task_struct *task;
        int dst_cpu;
        
        /* Find a suitable destination CPU */
        dst_cpu = find_push_destination(this_rq, &ctx);
        if (dst_cpu < 0) {
            break;  /* No suitable destinations available */
        }
        
        /* Select a task to migrate */
        task = select_task_for_push(this_rq, dst_cpu, &ctx);
        if (!task) {
            break;  /* No suitable tasks to migrate */
        }
        
        /* Phase 4: Execute the migration */
        if (migrate_task_to_cpu(task, dst_cpu)) {
            ctx.imbalance -= task_load_weight(task);
            nr_pushed++;
            this_rq->push_count++;
        }
    }
    
    spin_unlock(&this_rq->lock);
    
    /* Update statistics */
    update_push_statistics(this_cpu, nr_pushed);
}

Phase 1: Trigger and Quick Check

push_trigger_check.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
/* Quick pre-check to avoid unnecessary work */
static bool should_attempt_push(struct run_queue *rq) {
    /* Need at least 2 runnable tasks to push one away */
    if (rq->nr_running < 2) {
        return false;
    }
    
    /* Check if we've pushed recently (prevent thrashing) */
    if (time_before(jiffies, rq->last_push_time + PUSH_MIN_INTERVAL)) {
        return false;
    }
    
    /* Check if system has any idle CPUs worth pushing to */
    if (cpumask_empty(&rq->idle_siblings)) {
        /* No known idle CPUs - do detailed scan only periodically */
        if (!time_to_rescan_idle_cpus(rq)) {
            return false;
        }
    }
    
    /* Basic weighted load check */
    if (rq->load_weight <= avg_load_per_cpu() * PUSH_THRESHOLD_PCT / 100) {
        return false;  /* Not overloaded relative to average */
    }
    
    return true;  /* Worth doing detailed analysis */
}
 
/* Constants governing push behavior */
#define PUSH_MIN_INTERVAL       (HZ / 10)    /* Max 10 push attempts/second */
#define PUSH_THRESHOLD_PCT      125          /* 25% above average triggers push */
#define MAX_PUSH_PER_CYCLE      2            /* Limit work per cycle */

Phase 2: Imbalance Calculation

If the quick check passes, we compute the precise imbalance—the amount of load that should be migrated to achieve balance:

imbalance_calculation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
/* Detailed imbalance calculation */
static bool calculate_push_imbalance(struct run_queue *rq, 
                                      struct migration_context *ctx) {
    unsigned long this_load = rq->avg_load;
    unsigned long target_load;
    unsigned long system_load = 0;
    int nr_online_cpus = 0;
    int cpu;
    
    /* Calculate system-wide load */
    for_each_online_cpu(cpu) {
        system_load += per_cpu(runqueues, cpu).avg_load;
        nr_online_cpus++;
    }
    
    /* Target: each CPU should have average load */
    target_load = system_load / nr_online_cpus;
    
    /* Imbalance is how much we exceed target */
    if (this_load <= target_load) {
        return false;  /* We're at or below average - no push needed */
    }
    
    ctx->imbalance = this_load - target_load;
    
    /* Apply minimum threshold to prevent trivial migrations */
    if (ctx->imbalance < MIN_PUSH_IMBALANCE) {
        return false;
    }
    
    /* Adjust for migration cost - don't push if benefit is marginal */
    if (ctx->imbalance < estimated_migration_cost()) {
        return false;
    }
    
    ctx->this_load = this_load;
    ctx->target_load = target_load;
    ctx->system_load = system_load;
    
    return true;
}

Migration Cost Consideration

Phase 3: Destination and Task Selection

With imbalance quantified, we must select which task to push and where to push it. These decisions profoundly affect migration effectiveness:

selection_algorithm.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
/* Find the best destination CPU for pushing a task */
static int find_push_destination(struct run_queue *src_rq,
                                  struct migration_context *ctx) {
    int best_cpu = -1;
    unsigned long best_capacity = 0;
    int cpu;
    
    /* Priority 1: Check for any idle CPUs in same NUMA node */
    for_each_cpu(cpu, &src_rq->idle_siblings) {
        if (cpu_is_same_numa_node(src_rq, cpu)) {
            struct run_queue *dst_rq = &per_cpu(runqueues, cpu);
            
            if (dst_rq->nr_running == 0) {
                /* Perfect match: idle CPU on same NUMA node */
                return cpu;
            }
        }
    }
    
    /* Priority 2: Any idle CPU (cross-NUMA if necessary) */
    for_each_cpu(cpu, &src_rq->idle_siblings) {
        struct run_queue *dst_rq = &per_cpu(runqueues, cpu);
        
        if (dst_rq->nr_running == 0) {
            return cpu;
        }
    }
    
    /* Priority 3: Find least-loaded CPU that's below average */
    for_each_online_cpu(cpu) {
        struct run_queue *dst_rq = &per_cpu(runqueues, cpu);
        unsigned long available_capacity;
        
        if (cpu == smp_processor_id()) {
            continue;  /* Don't push to self */
        }
        
        /* Skip CPUs at or above average load */
        if (dst_rq->avg_load >= ctx->target_load) {
            continue;
        }
        
        /* Calculate how much load this CPU can accept */
        available_capacity = ctx->target_load - dst_rq->avg_load;
        
        if (available_capacity > best_capacity) {
            best_capacity = available_capacity;
            best_cpu = cpu;
        }
    }
    
    return best_cpu;
}
 
/* Select a task suitable for migration to given destination */
static struct task_struct *select_task_for_push(struct run_queue *rq,
                                                 int dst_cpu,
                                                 struct migration_context *ctx) {
    struct task_struct *best_task = NULL;
    struct task_struct *task;
    int best_score = INT_MIN;
    
    list_for_each_entry(task, &rq->tasks, run_list) {
        int score = 0;
        
        /* Skip tasks that cannot migrate */
        if (!task_can_migrate(task, dst_cpu)) {
            continue;
        }
        
        /* Prefer cache-cold tasks (haven't run recently) */
        if (task_cache_cold(task)) {
            score += 100;
        }
        
        /* Prefer tasks with weak CPU affinity */
        if (task_has_weak_affinity(task)) {
            score += 50;
        }
        
        /* Prefer lower priority tasks (less latency-sensitive) */
        score += (MAX_PRIO - task->prio);
        
        /* Prefer tasks whose load contribution matches our needs */
        if (task_load_weight(task) <= ctx->imbalance * 2) {
            score += 25;  /* Right size for our imbalance */
        }
        
        if (score > best_score) {
            best_score = score;
            best_task = task;
        }
    }
    
    return best_task;
}
 
/* Check if a task can legally migrate to a given CPU */
static bool task_can_migrate(struct task_struct *task, int dst_cpu) {
    /* Check CPU affinity mask */
    if (!cpumask_test_cpu(dst_cpu, &task->cpus_allowed)) {
        return false;
    }
    
    /* Check if task requested migration disabled */
    if (task->migration_disabled) {
        return false;
    }
    
    /* Currently running tasks cannot migrate */
    if (task_running(task)) {
        return false;
    }
    
    /* Kernel threads with CPU bindings */
    if (task_is_bound_kthread(task)) {
        return false;
    }
    
    return true;
}

Phase 4: Migration Execution

Once destination and task are selected, the actual migration transfers the task's scheduling state:

migration_execution.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
/* Execute the migration of a task to a new CPU */
static bool migrate_task_to_cpu(struct task_struct *task, int dst_cpu) {
    struct run_queue *src_rq = task_rq(task);
    struct run_queue *dst_rq = &per_cpu(runqueues, dst_cpu);
    
    /* Double-lock ordering to prevent deadlock */
    /* Always lock lower-numbered CPU first */
    if (dst_cpu < smp_processor_id()) {
        spin_lock(&dst_rq->lock);
    }
    
    /* Dequeue from source */
    dequeue_task(src_rq, task);
    
    /* Update task's CPU assignment */
    task->cpu = dst_cpu;
    
    /* Enqueue on destination */
    enqueue_task(dst_rq, task);
    
    /* Update load tracking */
    src_rq->load_weight -= task_load_weight(task);
    dst_rq->load_weight += task_load_weight(task);
    src_rq->nr_running--;
    dst_rq->nr_running++;
    
    /* Release destination lock if we took it */
    if (dst_cpu < smp_processor_id()) {
        spin_unlock(&dst_rq->lock);
    }
    
    /* Record migration for statistics */
    record_migration(smp_processor_id(), dst_cpu, task);
    
    /* Send IPI to wake destination CPU if it was idle */
    if (need_resched_cpu(dst_cpu)) {
        send_reschedule_ipi(dst_cpu);
    }
    
    return true;
}

NUMA-Aware Push Migration

The NUMA Challenge

In NUMA systems, migrating a task to a distant CPU can significantly degrade that task's performance:

Remote memory access latency — Tasks accessing memory allocated on their original node face 2-3x slower access from a remote node
Interconnect bandwidth — Cross-node traffic consumes limited inter-node bandwidth
Cache coherency overhead — Maintaining cache coherency across NUMA boundaries is more expensive

Naive push migration that ignores NUMA topology can actually decrease system throughput despite achieving better load balance.

NUMA-Aware vs. NUMA-Blind Migration
Approach	Load Balance	Memory Locality	Overall Throughput
NUMA-Blind	Optimal	Severely degraded	May decrease 20-40%
Local-Only	Suboptimal	Preserved	Limited improvement
NUMA-Aware Hybrid	Near-optimal	Mostly preserved	Best overall results

Hierarchical Domain Organization

To handle NUMA effectively, schedulers organize CPUs into hierarchical scheduling domains. Each domain represents a set of CPUs that share some architectural characteristic:

NUMA Node 0          NUMA Node 1
+-----------+        +-----------+
| Core 0-3  |<------>| Core 4-7  |
| L3 Cache  |  QPI   | L3 Cache  |
| DDR4 Bank |        | DDR4 Bank |
+-----------+        +-----------+
     |                     |
     v                     v
SMT Domain           SMT Domain
(Core siblings)      (Core siblings)

NUMA-Aware Push Strategy

With domain hierarchy in place, push migration follows a domain-aware strategy:

numa_aware_push.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/* NUMA-aware destination selection for push migration */
static int find_numa_aware_destination(struct run_queue *src_rq,
                                        struct migration_context *ctx) {
    int src_cpu = rq_cpu(src_rq);
    struct sched_domain *sd;
    int best_cpu = -1;
    
    /* Walk up the domain hierarchy, starting local */
    for_each_domain(src_cpu, sd) {
        struct sched_group *sg = sd->groups;
        int found_in_domain = false;
        
        /* Scan scheduling groups within this domain */
        do {
            int cpu;
            
            for_each_cpu(cpu, sched_group_cpus(sg)) {
                struct run_queue *dst_rq = &per_cpu(runqueues, cpu);
                
                if (cpu == src_cpu) continue;
                
                /* Check if this CPU can accept work */
                if (dst_rq->avg_load < ctx->target_load) {
                    /* At lower domains (closer), prefer any available */
                    /* At higher domains (NUMA), require significant benefit */
                    unsigned long domain_cost = domain_migration_cost(sd);
                    unsigned long benefit = ctx->imbalance;
                    
                    if (benefit > domain_cost) {
                        best_cpu = cpu;
                        found_in_domain = true;
                        
                        /* Prefer idle CPUs */
                        if (dst_rq->nr_running == 0) {
                            return cpu;  /* Perfect match at this level */
                        }
                    }
                }
            }
            
            sg = sg->next;
        } while (sg != sd->groups);
        
        /* If we found a candidate at this level, use it */
        /* Prevents unnecessary promotion to higher (costlier) domains */
        if (found_in_domain && best_cpu >= 0) {
            return best_cpu;
        }
    }
    
    return best_cpu;  /* May be -1 if no suitable destination */
}
 
/* Estimate cost of migrating across a scheduling domain */
static unsigned long domain_migration_cost(struct sched_domain *sd) {
    /* Base cost increases with domain level */
    unsigned long base_cost = sd->level * 1000;
    
    /* NUMA domains have additional memory latency cost */
    if (sd->flags & SD_NUMA) {
        base_cost += numa_remote_access_penalty();
    }
    
    /* Factor in domain's historical migration success rate */
    base_cost = base_cost * sd->migration_success_rate / 100;
    
    return base_cost;
}

Domain Hierarchy as Cost Function

Push Migration in Linux CFS

CFS Load Balancing Overview

Key CFS Concepts for Push

Load Weight — CFS uses PELT (Per-Entity Load Tracking) to compute temporally-decayed load for each task, accounting for both recent CPU usage and historical patterns.
Scheduling Domains — Hierarchical CPU groupings (SMT → Core → Socket → NUMA) with per-domain balance intervals.
Busiest Queue Detection — Before pushing, CFS identifies the busiest CPU/group, ensuring push actions actually improve balance.
Migration Throttling — Rate limits on migrations prevent oscillation.

cfs_load_balance.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
/* Simplified representation of CFS load balancing logic */
/* See kernel/sched/fair.c for full implementation */
 
/* Main load balance function called for each scheduling domain */
static int load_balance(struct lb_env *env) {
    struct run_queue *busiest;
    unsigned long imbalance;
    int ld_moved = 0;
    
    /* Find the busiest group in this scheduling domain */
    struct sched_group *busiest_group = find_busiest_group(env);
    if (!busiest_group) {
        return 0;  /* No imbalance at group level */
    }
    
    /* Find the busiest run queue within that group */
    busiest = find_busiest_queue(env, busiest_group);
    if (!busiest) {
        return 0;  /* No specific CPU to balance from */
    }
    
    /* Calculate how much load to move */
    imbalance = calculate_imbalance(env, busiest_group);
    
    if (imbalance == 0) {
        return 0;  /* Imbalance below threshold */
    }
    
    /* Attempt to migrate tasks from busiest to local CPU */
    env->src_rq = busiest;
    env->dst_rq = this_rq();
    env->imbalance = imbalance;
    
    /* The actual migration loop */
    while (env->imbalance > 0) {
        struct task_struct *p;
        
        /* Select a task that can migrate */
        p = detach_one_task(env);
        if (!p) {
            break;  /* No more migratable tasks */
        }
        
        /* Attach the task to local run queue */
        attach_one_task(env->dst_rq, p);
        
        ld_moved++;
        env->imbalance -= task_load(p);
    }
    
    return ld_moved;
}
 
/* PELT-based load calculation (simplified) */
static unsigned long task_load(struct task_struct *p) {
    /* Load is decayed average of CPU demand */
    return p->se.load.weight * p->se.avg.load_avg / LOAD_AVG_MAX;
}

Balance Intervals by Domain Level

CFS adjusts push migration frequency based on domain level, balancing responsiveness against overhead:

CFS Balance Intervals (Typical Values)
Domain Level	Balance Interval	Rationale
SMT (Hyperthreads)	4ms	Same core—nearly free migration
MC (Multi-core)	8ms	Same socket—low cost
DIE (Same die)	16ms	Shared L3—moderate cost
NUMA (Cross-node)	64ms	High latency—cautious balancing

Push vs. Pull in CFS

Performance Analysis of Push Migration

Understanding when push migration helps—and when it hurts—requires analyzing its performance characteristics under different workload patterns.

Beneficial Scenarios

Push migration provides significant benefits in scenarios with:

Bursty Task Creation — When processes spawn many children on a single CPU
Uneven Wakeups — When I/O completions wake tasks concentrated on few CPUs
Long-Running Compute Tasks — Tasks that exceed their time slice multiple times
Mixed Workloads — Combinations of batch and interactive tasks

Detrimental Scenarios

Push migration can hurt performance when:

High-Locality Workloads — Tasks with strong cache affinity that shouldn't move
Short-Lived Tasks — Tasks that complete before migration overhead pays off
Memory-Intensive NUMA Workloads — Cross-node migration destroys locality
Real-Time Workloads — Migration latency may miss deadlines

Push Migration Overhead Components
Overhead Source	Typical Cost	Mitigation Strategy
Run queue lock contention	1-10 μs	Per-CPU queues, lock-free checking
Task state copying	0.5-2 μs	Minimal state transfer
Cache invalidation (L1/L2)	100-500 cycles	Prefer cache-cold tasks
TLB flush on new CPU	50-200 cycles	Batch migrations
Memory controller switch (NUMA)	100-300 ns additional latency	NUMA-aware selection
IPI for destination wakeup	1-5 μs	Coalesce with other IPIs

Quantifying the Benefit

The net benefit of push migration can be modeled as:

Benefit = (Imbalance_Reduction × Task_Throughput_Gain) - Migration_Cost

Where:

Imbalance_Reduction = Time saved by distributing work more evenly
Task_Throughput_Gain = Performance improvement from utilizing idle CPUs
Migration_Cost = Direct overhead + locality loss + cache rebuild time

For a task that would wait 50ms in queue on an overloaded CPU:

If migration takes 10μs total overhead
And cache rebuild adds 100μs slowdown
Net benefit = 50ms - 0.11ms ≈ 50ms saved (clear win)

For a task that would wait 200μs:

Same 110μs migration cost
Net benefit = 200μs - 110μs = 90μs saved (marginal)

For a task with 1ms remaining lifetime:

Migration cost = 110μs overhead + locality loss
Task may complete before benefiting (potential loss)

The 'Don't Migrate Short Tasks' Heuristic

Summary: Push Migration Mastery

Key Takeaways

•Push migration is proactive load balancing — Overloaded CPUs actively redistribute their excess work to less-loaded processors, preventing severe imbalance.
•Weighted load metrics enable fair comparison — Simple task counts are insufficient; priority-weighted calculations account for the true computational demand of each task.
•NUMA topology must guide destination selection — Cross-node migrations can negate load balancing benefits by destroying memory locality.
•Task selection is as important as destination selection — Cache-cold, low-priority tasks with weak affinity are ideal migration candidates.
•Migration cost must be weighed against benefit — Not every imbalance warrants migration; threshold tuning prevents thrashing.
•Production schedulers use hierarchical domains — CFS and similar schedulers balance more aggressively within low-cost domains and cautiously across NUMA boundaries.

Connecting to the Broader Picture

Page Complete

1 / 5