Load Balancing - Learning Module

Loading content...

0/227

Balancing Frequency: The When of Load Distribution

The Timing Question

We've explored what load balancing does (push, pull, and work stealing) and how it works (algorithms and data structures). Now we address an equally critical question: when should load balancing occur?

Balancing frequency represents a fundamental tradeoff in system design. Balance too frequently, and you waste CPU cycles on overhead—computing loads, acquiring locks, migrating tasks—cycles that could be spent on actual work. Balance too infrequently, and imbalances persist, leaving CPUs idle while others are overloaded.

This page provides a comprehensive exploration of balancing frequency: the theoretical considerations, the practical tradeoffs, the interval selection strategies, and the adaptive mechanisms that allow systems to self-tune for optimal performance across varying workload conditions.

What You Will Learn

By completing this page, you will understand: (1) The overhead-responsiveness tradeoff in balancing frequency, (2) Timer-based vs. event-driven balance triggers, (3) Hierarchical frequency strategies for multi-level topologies, (4) Adaptive frequency mechanisms that respond to workload dynamics, and (5) Production scheduler tuning parameters for balance frequency.

The Fundamental Tradeoff

Every load balancing invocation consumes resources. Understanding these costs is essential for making informed frequency decisions.

The Cost Components

Load balancing incurs several distinct costs:

Computational Cost — CPU cycles spent calculating loads, comparing against thresholds, selecting tasks
Synchronization Cost — Lock acquisition, memory barriers, atomic operations
Cache Disruption — Balance code displaces application cache content
IPI Cost — Inter-processor interrupts to wake remote CPUs
Migration Cost — When balance action occurs, task migration itself has cost

Load Balance Overhead Components
Component	Cost Range	Frequency Impact
Load calculation	100-500 ns	Scales with CPU count
Lock operations	50-200 ns uncontended	Increases with contention
Cache effects	Variable, 1-10 μs effective	Depends on balance code size
IPI delivery	1-5 μs	Per remote CPU involved
Task migration	5-50 μs	Only when migration occurs

The Benefit Model

Balancing provides benefit when it corrects imbalances that would otherwise waste CPU cycles:

Benefit = (Idle_Time_Prevented) × (CPU_Throughput)

If a CPU sits idle for 10ms when it could have been running a task, the lost benefit is 10ms of compute capacity. Frequent balancing catches imbalances early; infrequent balancing lets them accumulate.

The Optimal Balance Point

The optimal frequency maximizes net benefit:

Net_Benefit = Imbalance_Reduction_Benefit - Balance_Overhead_Cost

This optimal point depends on workload characteristics:

Highly dynamic workloads (tasks arriving/departing rapidly) benefit from frequent balancing
Stable workloads (long-running tasks with little change) tolerate infrequent balancing
Performance-critical workloads are more sensitive to overhead
Throughput-oriented workloads are more sensitive to imbalance

The 1% Rule

A common guideline: scheduler overhead (including load balancing) should consume less than 1% of total CPU time. If balance operations are running every 1ms and each takes 10μs, that's 1% just for balancing. Monitor /proc/schedstat or equivalent to verify overhead stays within budget.

Timer-Based vs. Event-Driven Triggers

Load balancing can be triggered by two fundamental mechanisms: timer-based (periodic) and event-driven. Most production schedulers use a combination of both.

Timer-Based (Periodic) Balancing

A timer interrupt fires at regular intervals, invoking the load balance routine regardless of workload state.

timer_based_balance.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/* Timer-based periodic load balancing */
 
#define BALANCE_TICK_MS     4       /* Base balance interval */
#define JITTER_FACTOR       8       /* Spread timer across CPUs */
 
/* Initialize per-CPU balance timer */
void init_balance_timer(int cpu) {
    struct run_queue *rq = cpu_rq(cpu);
    
    /* Stagger timers to prevent synchronization surge */
    unsigned long initial_delay = BALANCE_TICK_MS + (cpu % JITTER_FACTOR);
    
    timer_setup(&rq->balance_timer, balance_timer_handler);
    mod_timer(&rq->balance_timer, jiffies + msecs_to_jiffies(initial_delay));
}
 
/* Timer handler - invoked periodically */
void balance_timer_handler(struct timer_list *timer) {
    struct run_queue *rq = container_of(timer, struct run_queue, balance_timer);
    int cpu = rq_cpu(rq);
    
    /* Perform balance check */
    if (should_balance(rq)) {
        trigger_balance(rq);
    }
    
    /* Re-arm timer for next interval */
    unsigned long next_interval = calculate_next_interval(rq);
    mod_timer(&rq->balance_timer, jiffies + next_interval);
}
 
/* Calculate next balance interval (may be adaptive) */
unsigned long calculate_next_interval(struct run_queue *rq) {
    unsigned long base = msecs_to_jiffies(BALANCE_TICK_MS);
    
    /* If last balance found work, keep checking frequently */
    if (rq->last_balance_migrated > 0) {
        return base;
    }
    
    /* If system seems balanced, increase interval */
    if (rq->balance_attempts_without_migration > 10) {
        return base * 2;  /* Back off when nothing to do */
    }
    
    return base;
}

Timer-Based Advantages

•Predictable overhead — Known frequency = predictable cost
•Catches gradual imbalance — Even slow drift is detected
•Simple implementation — Just set a timer
•Guaranteed coverage — Every CPU checks periodically

Timer-Based Disadvantages

•Delayed response — Must wait for next tick
•Unnecessary checks — Fires even when balanced
•Fixed overhead — Can't scale down for idle system
•Timer storm risk — Synchronized timers cause load spikes

Event-Driven Balancing

Event-driven balancing triggers on specific scheduling events rather than time passage:

event_driven_balance.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
/* Event-driven load balancing triggers */
 
/* Event 1: CPU going idle - try to pull work */
void schedule_idle_balance(struct run_queue *rq) {
    /* Immediate pull attempt when becoming idle */
    pull_tasks_from_busy_cpus(rq);
}
 
/* Event 2: Task wakeup - consider placement */
int select_task_rq_fair(struct task_struct *p, int prev_cpu) {
    /* Where should this waking task run? */
    int new_cpu = find_idlest_cpu(p, prev_cpu);
    
    if (new_cpu != prev_cpu) {
        /* Implicit "push" via wake affinity */
        migrate_task_rq(p, new_cpu);
    }
    
    return new_cpu;
}
 
/* Event 3: Task creation - inherent load injection */
void wake_up_new_task(struct task_struct *p) {
    int cpu = select_task_rq(p, task_cpu(p));
    
    /* New task may trigger immediate balance if it overloads */
    if (rq_of(cpu)->nr_running > OVERLOAD_THRESHOLD) {
        trigger_push_balance(rq_of(cpu));
    }
}
 
/* Event 4: Affinity change - may require migration */
void set_cpus_allowed_ptr(struct task_struct *p, 
                          const struct cpumask *new_mask) {
    int old_cpu = task_cpu(p);
    
    if (!cpumask_test_cpu(old_cpu, new_mask)) {
        /* Current CPU no longer allowed - must migrate */
        int new_cpu = cpumask_first(new_mask);
        migrate_task_immediately(p, new_cpu);
    }
}

Event-Driven Advantages

•Immediate response — Acts when events occur
•Zero cost when idle — No events = no overhead
•Targeted action — Responds to specific imbalances
•Natural scaling — Overhead proportional to activity

Event-Driven Disadvantages

•May miss slow drift — No event = no detection
•Event storms — Burst activity = burst overhead
•Complex implementation — Many hook points needed
•Incomplete coverage — Some imbalances have no event

Hybrid Approaches Dominate

Production schedulers combine both triggers. Timer-based balancing catches gradual drift and provides a safety net. Event-driven balancing provides immediate response to significant changes. The combination covers all scenarios.

Hierarchical Frequency Strategies

Modern systems have hierarchical CPU topologies: SMT siblings, cores on a socket, sockets in a NUMA node, multiple NUMA nodes. Migration cost varies dramatically across these levels, and so should balancing frequency.

The Hierarchical Principle

Balance frequently at low-cost levels; balance cautiously at high-cost levels.

This principle reflects the asymmetric cost structure:

SMT sibling migration: negligible cost (shared core)
Same-socket migration: low cost (shared L3)
Same-NUMA migration: moderate cost (shared memory)
Cross-NUMA migration: high cost (remote memory access)

Linux CFS Hierarchical Balance Intervals
Domain Level	SD_FLAG	Typical Interval	Rationale
SMT	SD_SHARE_CPUCAPACITY	1-2 ms	Nearly free migration
MC (Multi-Core)	SD_SHARE_PKG_RESOURCES	4-8 ms	Shared L3, low cost
DIE	SD_ASYM_PACKING	8-16 ms	Same die, moderate cost
NUMA	SD_NUMA	16-64 ms	High cost, cautious approach

hierarchical_balance.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
/* Hierarchical load balancing implementation */
 
/* Per-domain balance parameters */
struct sched_domain {
    /* Domain identification */
    int level;                     /* 0=SMT, 1=MC, 2=DIE, 3=NUMA */
    unsigned long flags;           /* SD_* flags */
    
    /* Balance interval parameters */
    unsigned int balance_interval; /* Ticks between balance attempts */
    unsigned int max_interval;     /* Maximum interval (backed off) */
    unsigned int min_interval;     /* Minimum interval (urgent) */
    
    /* Balance timing tracking */
    unsigned long last_balance;    /* Jiffies of last balance */
    unsigned long next_balance;    /* Jiffies of next balance */
    
    /* Balance cost tracking for adaptation */
    unsigned long avg_balance_cost;
    unsigned long avg_balance_benefit;
};
 
/* Domain-aware balance trigger check */
void run_rebalance_domains(struct run_queue *rq) {
    struct sched_domain *sd;
    
    /* Walk domain hierarchy from innermost to outermost */
    for_each_domain(rq_cpu(rq), sd) {
        /* Check if this domain is due for balance */
        if (time_before(jiffies, sd->next_balance)) {
            continue;  /* Not yet time for this domain */
        }
        
        /* Perform balance at this domain level */
        int migrated = load_balance(sd, rq);
        
        /* Update next balance time based on results */
        update_domain_balance_interval(sd, migrated);
    }
}
 
/* Adaptive interval update based on balance results */
void update_domain_balance_interval(struct sched_domain *sd, int migrated) {
    if (migrated > 0) {
        /* Successful balance - maintain current or decrease interval */
        sd->balance_interval = max(sd->balance_interval * 3 / 4, 
                                    sd->min_interval);
    } else {
        /* No migration needed - increase interval */
        sd->balance_interval = min(sd->balance_interval * 5 / 4,
                                    sd->max_interval);
    }
    
    sd->next_balance = jiffies + sd->balance_interval;
    sd->last_balance = jiffies;
}
 
/* Initialize domain intervals based on level */
void init_domain_intervals(struct sched_domain *sd) {
    switch (sd->level) {
    case 0:  /* SMT */
        sd->min_interval = msecs_to_jiffies(1);
        sd->balance_interval = msecs_to_jiffies(2);
        sd->max_interval = msecs_to_jiffies(8);
        break;
    case 1:  /* MC */
        sd->min_interval = msecs_to_jiffies(4);
        sd->balance_interval = msecs_to_jiffies(8);
        sd->max_interval = msecs_to_jiffies(32);
        break;
    case 2:  /* DIE */
        sd->min_interval = msecs_to_jiffies(8);
        sd->balance_interval = msecs_to_jiffies(16);
        sd->max_interval = msecs_to_jiffies(64);
        break;
    case 3:  /* NUMA */
        sd->min_interval = msecs_to_jiffies(16);
        sd->balance_interval = msecs_to_jiffies(64);
        sd->max_interval = msecs_to_jiffies(256);
        break;
    }
}

Domain Stopping

When balance at a lower domain succeeds, higher domains can sometimes be skipped. If SMT balancing fixes the imbalance, there's no need to incur NUMA balancing cost. This 'domain stopping' optimization reduces overhead while maintaining effectiveness.

Adaptive Frequency Mechanisms

Static balance intervals are suboptimal for dynamic workloads. Adaptive mechanisms adjust frequency based on observed behavior, achieving better overhead-benefit balance.

Workload-Responsive Adaptation

The key insight: balance frequency should track workload dynamism. When tasks are being created/destroyed rapidly, balance often. When workload is stable, balance rarely.

adaptive_frequency.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
/* Adaptive balance frequency based on workload dynamics */
 
struct balance_statistics {
    /* Activity metrics */
    unsigned long task_arrivals;     /* Tasks placed on queue */
    unsigned long task_departures;   /* Tasks leaving queue */
    unsigned long load_changes;      /* Significant load deltas */
    
    /* Balance result metrics */
    unsigned long balance_attempts;  /* Total balance invocations */
    unsigned long migrations;        /* Successful migrations */
    unsigned long failed_finds;      /* No imbalance found */
    
    /* Timing metrics */
    unsigned long balance_cost_sum;  /* Cumulative time in balance */
    unsigned long balance_time_sum;  /* Time saved by migrations */
};
 
/* Per-CPU balance stats */
DEFINE_PER_CPU(struct balance_statistics, balance_stats);
 
/* Calculate optimal interval based on recent history */
unsigned long calculate_adaptive_interval(struct run_queue *rq) {
    struct balance_statistics *stats = &per_cpu(balance_stats, rq_cpu(rq));
    unsigned long base_interval = msecs_to_jiffies(4);
    unsigned long interval;
    
    /* Factor 1: Activity level */
    unsigned long activity = stats->task_arrivals + stats->task_departures;
    if (activity > 100) {
        /* High activity - balance frequentd */
        interval = base_interval / 2;
    } else if (activity < 10) {
        /* Low activity - balance rarely */
        interval = base_interval * 4;
    } else {
        interval = base_interval;
    }
    
    /* Factor 2: Migration success rate */
    if (stats->balance_attempts > 0) {
        unsigned long success_rate = 
            stats->migrations * 100 / stats->balance_attempts;
        
        if (success_rate > 50) {
            /* Frequently finding work to migrate - stay aggressive */
            interval = interval * 3 / 4;
        } else if (success_rate < 10) {
            /* Rarely finding imbalance - back off */
            interval = interval * 3 / 2;
        }
    }
    
    /* Factor 3: Balance cost vs benefit */
    if (stats->balance_cost_sum > stats->balance_time_sum) {
        /* Spending more on balance than we're gaining - back off */
        interval = interval * 2;
    }
    
    /* Clamp to reasonable range */
    interval = clamp(interval, 
                     msecs_to_jiffies(1), 
                     msecs_to_jiffies(100));
    
    /* Periodic reset of stats for freshness */
    if (stats->balance_attempts > 1000) {
        memset(stats, 0, sizeof(*stats));
    }
    
    return interval;
}

Nohz Full and Balance Deferral

Modern kernels support 'nohz full' mode where tick-idle CPUs stop timer interrupts entirely. This creates special considerations for load balancing:

nohz_balance.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
/* NOHZ-aware load balancing */
 
/* In NOHZ mode, idle CPUs don't receive timer interrupts.
 * We need other mechanisms to ensure balance still occurs. */
 
/* Designate one CPU as the "nohz balancer" */
static int nohz_balance_cpu = -1;
 
/* Called when a CPU goes into nohz-idle */
void nohz_balance_enter_idle(int cpu) {
    /* If no balancer designated, or we're the first idle, take the role */
    if (nohz_balance_cpu < 0) {
        nohz_balance_cpu = cpu;
    }
    
    /* Otherwise, request wake when balance is needed */
    cpumask_set_cpu(cpu, nohz_idle_cpus);
}
 
/* Called when a CPU exits nohz-idle */
void nohz_balance_exit_idle(int cpu) {
    cpumask_clear_cpu(cpu, nohz_idle_cpus);
    
    /* If we were the designated balancer, transfer role */
    if (nohz_balance_cpu == cpu) {
        nohz_balance_cpu = cpumask_first(nohz_idle_cpus);
    }
}
 
/* The designated balancer runs balance on behalf of idle CPUs */
void nohz_idle_balance(struct run_queue *this_rq) {
    int this_cpu = rq_cpu(this_rq);
    
    if (this_cpu != nohz_balance_cpu) {
        return;  /* Not our job */
    }
    
    /* Balance for all idle CPUs */
    int cpu;
    for_each_cpu(cpu, nohz_idle_cpus) {
        if (cpu == this_cpu) continue;
        
        struct run_queue *rq = cpu_rq(cpu);
        
        /* Check if remote CPU needs balancing */
        if (need_nohz_balance(rq)) {
            /* Option 1: Balance locally on their behalf */
            /* Option 2: Send IPI to wake them for self-balance */
            
            if (cpumask_weight(nohz_idle_cpus) > nr_cpus / 2) {
                /* Mostly idle - do it ourselves */
                trigger_nohz_balance(rq);
            } else {
                /* Mostly busy - wake the idle CPU */
                kick_idle_cpu(cpu);
            }
        }
    }
}

NOHZ Balance Complexity

NOHZ balancing is a significant source of scheduler complexity. When most CPUs are idle and tickless, the designated balancer becomes a hot spot. Careful throttling and distribution of balancer responsibility is essential to avoid creating new bottlenecks.

Balance Throttling and Rate Limiting

Even with adaptive frequency, balance operations need throttling to prevent runaway overhead during system stress or pathological workloads.

Throttling Mechanisms

Several throttling strategies work together:

balance_throttling.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
/* Balance throttling and rate limiting */
 
/* Throttle states */
#define THROTTLE_NONE       0
#define THROTTLE_LIGHT      1
#define THROTTLE_HEAVY      2
 
/* Per-domain throttle tracking */
struct domain_throttle {
    int level;                      /* Current throttle level */
    unsigned long throttle_until;   /* Jiffies until throttle expires */
    unsigned long recent_cost;      /* Recent balance cost accumulation */
    unsigned long cost_threshold;   /* Cost threshold for escalation */
};
 
/* Check if balance should be throttled */
int check_balance_throttle(struct sched_domain *sd) {
    struct domain_throttle *throttle = &sd->throttle;
    
    if (time_before(jiffies, throttle->throttle_until)) {
        return throttle->level;  /* Still throttled */
    }
    
    return THROTTLE_NONE;  /* Not currently throttled */
}
 
/* Apply throttling based on observed cost */
void apply_balance_throttle(struct sched_domain *sd, unsigned long cost) {
    struct domain_throttle *throttle = &sd->throttle;
    
    throttle->recent_cost += cost;
    
    /* Check if cost exceeds threshold */
    if (throttle->recent_cost > throttle->cost_threshold) {
        /* Escalate throttle level */
        if (throttle->level < THROTTLE_HEAVY) {
            throttle->level++;
        }
        
        /* Calculate throttle duration based on level */
        unsigned long duration;
        switch (throttle->level) {
        case THROTTLE_LIGHT:
            duration = sd->balance_interval;  /* Skip one interval */
            break;
        case THROTTLE_HEAVY:
            duration = sd->max_interval;      /* Skip to max interval */
            break;
        default:
            duration = 0;
        }
        
        throttle->throttle_until = jiffies + duration;
        throttle->recent_cost = 0;  /* Reset cost tracker */
    }
    
    /* Natural decay of cost accumulation */
    if (throttle->recent_cost > 0 && 
        time_after(jiffies, sd->last_balance + HZ)) {
        throttle->recent_cost = throttle->recent_cost * 3 / 4;
        
        /* Also decay throttle level over time */
        if (throttle->level > THROTTLE_NONE) {
            throttle->level--;
        }
    }
}
 
/* Rate limit balance invocations per domain */
#define min_balance_interval(sd) \
    ((sd)->flags & SD_NUMA ? msecs_to_jiffies(16) : msecs_to_jiffies(2))
 
bool should_balance_domain(struct sched_domain *sd) {
    /* Minimum interval enforcement */
    if (time_before(jiffies, sd->last_balance + min_balance_interval(sd))) {
        return false;
    }
    
    /* Throttle check */
    if (check_balance_throttle(sd) == THROTTLE_HEAVY) {
        return false;
    }
    
    /* Scheduled interval check */
    if (time_before(jiffies, sd->next_balance) && 
        check_balance_throttle(sd) != THROTTLE_NONE) {
        return false;  /* Throttled and not due */
    }
    
    return true;
}

Imbalance Threshold Scaling

Another throttling approach: as balance frequency increases, demand larger imbalances before acting. This prevents micro-migrations from dominating:

imbalance_threshold.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
/* Dynamic imbalance thresholds */
 
/* Minimum imbalance worth migrating - scales with frequency */
unsigned long calculate_imbalance_threshold(struct sched_domain *sd) {
    /* Base threshold: one task's load worth */
    unsigned long base = SCHED_LOAD_SCALE;
    
    /* Scale up threshold if we've been balancing frequently */
    unsigned long recent_balances = sd->recent_balance_count;
    unsigned long time_window = jiffies - sd->balance_window_start;
    
    if (time_window > 0) {
        unsigned long balance_rate = recent_balances * HZ / time_window;
        
        if (balance_rate > 10) {  /* >10 balance/sec */
            /* Balancing too often - raise threshold */
            base = base * balance_rate / 10;
        }
    }
    
    /* Cap threshold to prevent over-accumulation */
    base = min(base, SCHED_LOAD_SCALE * 4);
    
    /* NUMA domains always have higher threshold */
    if (sd->flags & SD_NUMA) {
        base = base * 2;
    }
    
    return base;
}
 
/* Only migrate if imbalance exceeds threshold */
bool imbalance_worth_migrating(struct sched_domain *sd, 
                                unsigned long imbalance) {
    unsigned long threshold = calculate_imbalance_threshold(sd);
    
    if (imbalance < threshold) {
        sd->balance_rejects++;
        return false;
    }
    
    return true;
}

Combined Defense

Effective throttling uses multiple mechanisms: rate limiting (minimum interval), cost tracking (throttle escalation), and threshold scaling (demand larger imbalances). Each catches different pathological patterns. Together they ensure scheduler overhead stays bounded regardless of workload behavior.

Production Tuning Parameters

Production schedulers expose tunable parameters for balance frequency. Understanding these allows workload-specific optimization.

Linux Scheduler Tunables

Linux's CFS exposes several balance-related parameters via /proc/sys/kernel/:

Key Linux Balance Frequency Tunables
Parameter	Default	Effect	Workload Guidance
sched_migration_cost_ns	500000 (500μs)	Minimum time between migrations for a task	Increase for cache-sensitive workloads
sched_nr_migrate	32	Max tasks to migrate per balance	Decrease for latency-sensitive
sched_min_granularity_ns	3000000	Indirect: affects balance opportunity	Affects how often balance can trigger
numa_balancing_scan_delay_ms	1000	NUMA-specific: initial scan delay	Increase for memory-bound workloads

tuning_examples.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/bash
# Linux scheduler balance tuning examples
 
# View current balance settings
cat /proc/sys/kernel/sched_migration_cost_ns
cat /proc/sys/kernel/sched_nr_migrate
 
# For latency-sensitive workloads: reduce migration aggressiveness
echo 1000000 > /proc/sys/kernel/sched_migration_cost_ns  # 1ms
echo 8 > /proc/sys/kernel/sched_nr_migrate               # Fewer per batch
 
# For throughput workloads: increase migration willingness
echo 100000 > /proc/sys/kernel/sched_migration_cost_ns   # 100μs
echo 64 > /proc/sys/kernel/sched_nr_migrate              # More per batch
 
# For NUMA systems: tune NUMA-specific balancing
echo 0 > /proc/sys/kernel/numa_balancing  # Disable auto NUMA
# Or fine-tune:
echo 2000 > /proc/sys/kernel/numa_balancing_scan_delay_ms
echo 10 > /proc/sys/kernel/numa_balancing_scan_period_min_ms
echo 60000 > /proc/sys/kernel/numa_balancing_scan_period_max_ms
 
# Monitor balance statistics
cat /proc/schedstat | grep -E "domain|balance"
 
# Per-CPU balance statistics
for cpu in /sys/kernel/debug/sched/domains/cpu*/domain*; do
    echo "=== $cpu ==="
    cat "$cpu/balance_interval"
    cat "$cpu/max_interval"
done

Domain-Level Tuning

For fine-grained control, scheduling domain parameters can be adjusted via sysfs or boot-time configuration:

domain_tuning.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
/* Programmatic domain tuning (kernel module or patch) */
 
/* Adjust domain parameters during initialization */
void tune_sched_domain(struct sched_domain *sd, int workload_type) {
    switch (workload_type) {
    case WORKLOAD_LATENCY_SENSITIVE:
        /* Minimize migration to preserve cache locality */
        sd->balance_interval *= 2;
        sd->max_interval *= 2;
        sd->min_interval = sd->min_interval * 3 / 2;
        sd->imbalance_pct = 133;  /* Require 33% imbalance */
        break;
        
    case WORKLOAD_THROUGHPUT:
        /* Aggressive balance for maximum utilization */
        sd->balance_interval /= 2;
        sd->min_interval = msecs_to_jiffies(1);
        sd->imbalance_pct = 110;  /* 10% imbalance triggers */
        break;
        
    case WORKLOAD_MIXED:
        /* Default balanced settings */
        break;
        
    case WORKLOAD_NUMA_BOUND:
        /* Very conservative cross-NUMA balance */
        if (sd->flags & SD_NUMA) {
            sd->balance_interval *= 4;
            sd->max_interval *= 4;
            sd->imbalance_pct = 150;  /* 50% imbalance required */
        }
        break;
    }
}
 
/* User-space can request workload hints via cgroups or syscalls */
int set_workload_hint(pid_t pid, int hint) {
    struct task_struct *p = find_task_by_vpid(pid);
    if (!p) return -ESRCH;
    
    p->workload_hint = hint;
    
    /* Hint affects balance decisions for this task */
    if (hint == HINT_LATENCY_SENSITIVE) {
        p->migration_disabled++;  /* Discourage migration */
    }
    
    return 0;
}

Measure Before Tuning

Always measure current behavior before changing scheduler parameters. Use perf sched, schedstat, and application-level metrics to establish baseline. Tune one parameter at a time, measuring impact. Scheduler tuning is highly workload-dependent—what helps one application may hurt another.

Summary: Balancing Frequency Mastery

Balancing frequency sits at the heart of scheduler design—too frequent wastes cycles, too infrequent wastes parallelism. Let's consolidate the key insights:

Key Takeaways

•Balance frequency is a tradeoff — Overhead cost vs. imbalance duration. Neither extreme is optimal.
•Timer-based and event-driven are complementary — Timers catch drift; events catch immediate changes. Use both.
•Hierarchical frequency follows topology — Balance frequently at low-cost levels (SMT), cautiously at high-cost (NUMA).
•Adaptive mechanisms track workload dynamics — Static intervals are suboptimal. Adjust frequency based on activity and success rate.
•Throttling prevents pathological overhead — Rate limits, cost tracking, and threshold scaling protect against runaway balance.
•Production tuning requires measurement — Scheduler parameters are workload-dependent. Measure before tuning.

Connecting to the Broader Picture

With push, pull, work stealing, and frequency understood, we have a complete picture of mechanisms for load balancing. The final page of this module examines metrics—how we quantify load, measure balance quality, and evaluate scheduler effectiveness.

Page Complete

You now understand balancing frequency—the timing dimension of load balancing. You can reason about overhead-benefit tradeoffs, recognize when adaptive frequency is beneficial, and tune scheduler parameters for specific workload requirements. Next: load balancing metrics.