Context Switching - Learning Module

Loading content...

0/227

Context Switch Overhead: The Hidden Performance Cost

The Price of Multitasking

Context switching is not free. Every time the kernel saves one process's state and loads another's, it consumes CPU cycles that could have been used for actual work. This overhead is the cost we pay for the illusion of simultaneous execution.

But how expensive is a context switch? A naive answer might focus only on the time to push and pop registers—perhaps a few hundred cycles. The true answer is far more complex and far more costly. The indirect effects of context switching—cache pollution, TLB misses, pipeline disruption—often dominate the direct costs and can persist for thousands of instructions after the switch completes.

Understanding context switch overhead is essential for systems programmers, performance engineers, and anyone designing software that must run efficiently on shared hardware. This page provides a comprehensive analysis of every cost component and how they interact.

What You Will Learn

By the end of this page, you will understand the complete taxonomy of context switch costs—direct and indirect, immediate and deferred. You'll learn how to measure context switch overhead, interpret the results, and understand why even 'fast' context switches can devastate application performance.

Direct Costs: The Visible Overhead

Direct costs are the CPU cycles consumed by the context switch code itself—the time spent in kernel code saving and restoring state. These are measurable and relatively predictable.

Components of Direct Cost:

Register Save/Restore: Pushing and popping ~20 general-purpose registers to/from the kernel stack
FPU/SIMD State Transfer: XSAVE/XRSTOR for floating-point and vector registers (512-2048 bytes)
Page Table Switch: Loading CR3 with a new page table base address
Kernel Data Structure Updates: Updating current_task, TSS, per-CPU variables
Scheduler Execution: Time spent in schedule() selecting the next process
System Call/Interrupt Entry/Exit: Mode transitions between user and kernel space

Direct Context Switch Cost Breakdown (Approximate)
Operation	Cycles (Approx.)	Bytes Moved	Notes
Mode transition (syscall entry)	50-100	64	User→Kernel via SYSCALL instruction
Push all GPRs to stack	20-40	128	16 registers × 8 bytes each
Scheduler logic (pick next)	100-500		Varies greatly by scheduler
Stack pointer swap	5-10	16	Save/load RSP to/from thread_struct
Pop all GPRs from stack	20-40	128	16 registers × 8 bytes each
CR3 load (page table switch)	100-500	8	Major cost, flushes TLB
FPU state (XSAVE/XRSTOR)	200-800	512-2048	Depends on state size
TLS segment base update	20-50	16	WRMSRL for FS/GS base
Mode transition (iretq/sysret)	30-50	40	Kernel→User return
TOTAL DIRECT COST	500-2000+	~1KB	On modern x86-64 CPUs

Direct Cost in Time Units

On a 3 GHz CPU, 1000 cycles ≈ 0.33 microseconds (μs). So the direct cost of a context switch is roughly 0.2-1 μs. This seems tiny—but modern CPUs can execute 3000+ instructions per microsecond. A 1 μs context switch 'wastes' 3000 potential instructions of work.

direct_cost_measurement.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
/**
 * Measuring direct context switch cost using RDTSC
 * 
 * This measures the minimum time for a round-trip context switch
 * between two cooperating processes using a pipe for synchronization.
 */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
#include <stdint.h>
 
/* Read CPU timestamp counter */
static inline uint64_t rdtsc(void) {
    uint32_t lo, hi;
    __asm__ volatile ("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}
 
#define ITERATIONS 100000
 
int main() {
    int pipe_parent_to_child[2];
    int pipe_child_to_parent[2];
    char buf = 'x';
    
    pipe(pipe_parent_to_child);
    pipe(pipe_child_to_parent);
    
    if (fork() == 0) {
        /* Child: bounce messages back */
        for (int i = 0; i < ITERATIONS; i++) {
            read(pipe_parent_to_child[0], &buf, 1);
            write(pipe_child_to_parent[1], &buf, 1);
        }
        exit(0);
    }
    
    /* Parent: measure round-trip time */
    uint64_t total_cycles = 0;
    
    /* Warmup */
    for (int i = 0; i < 1000; i++) {
        write(pipe_parent_to_child[1], &buf, 1);
        read(pipe_child_to_parent[0], &buf, 1);
    }
    
    /* Measurement */
    for (int i = 0; i < ITERATIONS; i++) {
        uint64_t start = rdtsc();
        write(pipe_parent_to_child[1], &buf, 1);
        read(pipe_child_to_parent[0], &buf, 1);
        uint64_t end = rdtsc();
        total_cycles += (end - start);
    }
    
    wait(NULL);
    
    /* Each round-trip = 2 context switches + pipe overhead */
    double cycles_per_roundtrip = (double)total_cycles / ITERATIONS;
    
    /*
     * Estimated context switch cost:
     * Subtract pipe overhead (~500-1000 cycles for small writes)
     * Divide by 2 for single switch
     */
    double estimated_switch = (cycles_per_roundtrip - 1000) / 2;
    
    printf("Cycles per round-trip: %.0f\n", cycles_per_roundtrip);
    printf("Estimated cycles per context switch: %.0f\n", estimated_switch);
    printf("At 3 GHz, approx %.3f microseconds per switch\n", 
           estimated_switch / 3000.0);
    
    return 0;
}
 
/*
 * Typical output on modern Linux (3 GHz CPU):
 *   Cycles per round-trip: 3500
 *   Estimated cycles per context switch: 1250
 *   At 3 GHz, approx 0.417 microseconds per switch
 * 
 * But this is just the DIRECT cost...
 */

Cache Effects: The Invisible Slowdown

The most significant indirect cost of context switching is cache pollution. Modern CPUs rely heavily on caches to achieve high performance—L1, L2, and L3 caches store recently accessed data and instructions. When a context switch occurs, the new process's working set gradually evicts the old process's cache entries. When the old process resumes, it suffers a cascade of cache misses.

The Cache Miss Penalty:

L1 cache hit: ~4 cycles
L2 cache hit: ~12 cycles
L3 cache hit: ~40 cycles
DRAM access (cache miss): ~100-300 cycles

A process returning from a context switch may experience thousands of cache misses as it reloads its working set. Each miss costs 100+ cycles, dwarfing the direct switch cost.

cache_impact_demo.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
/**
 * Demonstrating cache pollution from context switches
 * 
 * This program measures array access time before and after
 * a context switch to a memory-intensive process.
 */
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <sys/wait.h>
 
#define ARRAY_SIZE (1 << 20)  /* 1M elements = 8MB (larger than L2) */
#define ITERATIONS 100
 
volatile uint64_t *array;
 
static inline uint64_t rdtsc(void) {
    uint32_t lo, hi;
    __asm__ volatile ("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}
 
/* Access array sequentially - should be cache-friendly */
uint64_t measure_array_access(void) {
    uint64_t sum = 0;
    uint64_t start = rdtsc();
    
    for (size_t i = 0; i < ARRAY_SIZE; i++) {
        sum += array[i];
    }
    
    uint64_t end = rdtsc();
    return end - start;
}
 
/* Pollute the cache with different data */
void pollute_cache(size_t size) {
    volatile uint64_t *polluter = malloc(size * sizeof(uint64_t));
    for (size_t i = 0; i < size; i++) {
        polluter[i] = i;  /* Read-modify-write forces cache fill */
    }
    free((void*)polluter);
}
 
int main() {
    array = malloc(ARRAY_SIZE * sizeof(uint64_t));
    for (size_t i = 0; i < ARRAY_SIZE; i++) {
        array[i] = i;
    }
    
    /* Warm up cache with our array */
    measure_array_access();
    measure_array_access();
    
    /* Measure with warm cache */
    uint64_t warm_cache_cycles = 0;
    for (int i = 0; i < ITERATIONS; i++) {
        warm_cache_cycles += measure_array_access();
        measure_array_access();  /* Keep cache warm */
    }
    warm_cache_cycles /= ITERATIONS;
    
    /* Now measure after cache pollution (simulating context switch) */
    uint64_t cold_cache_cycles = 0;
    for (int i = 0; i < ITERATIONS; i++) {
        pollute_cache(ARRAY_SIZE);  /* Evict our data from cache */
        cold_cache_cycles += measure_array_access();  /* First access after eviction */
    }
    cold_cache_cycles /= ITERATIONS;
    
    printf("Warm cache access: %lu cycles\n", warm_cache_cycles);
    printf("Cold cache access: %lu cycles\n", cold_cache_cycles);
    printf("Cache pollution penalty: %lu cycles (%.1fx slower)\n",
           cold_cache_cycles - warm_cache_cycles,
           (double)cold_cache_cycles / warm_cache_cycles);
    printf("\nAt 3 GHz: %.2f ms warm, %.2f ms cold\n",
           warm_cache_cycles / 3e6, cold_cache_cycles / 3e6);
    
    free((void*)array);
    return 0;
}
 
/*
 * Typical output:
 *   Warm cache access: 1,200,000 cycles
 *   Cold cache access: 18,000,000 cycles  
 *   Cache pollution penalty: 16,800,000 cycles (15.0x slower)
 *
 *   At 3 GHz: 0.40 ms warm, 6.00 ms cold
 *
 * The "direct" context switch cost is ~1 microsecond.
 * The INDIRECT cache penalty can be milliseconds!
 */

Cache-Related Context Switch Costs

•L1 Data Cache (~32KB) — Completely polluted within microseconds of new process running; near-total reload required
•L1 Instruction Cache (~32KB) — New process's code evicts old process's code; hot loops become cold
•L2 Cache (~256KB-1MB) — Partially polluted; larger working sets survive better but still degraded
•L3 Cache (~8-36MB) — Shared among cores; may partially retain data, but with increased access latency
•Branch Predictor State — Process-specific branch history is disrupted; mispredictions increase
•Prefetcher State — Hardware prefetcher must re-learn access patterns for new process

The Real Cost Can Be Milliseconds

While direct context switch cost is ~1 μs, the cache reload penalty for a process with a large working set can approach milliseconds. This is why high-frequency trading systems, real-time applications, and performance-critical servers go to extreme lengths to minimize context switches.

TLB Miss Penalties: Address Translation Overhead

The Translation Lookaside Buffer (TLB) is a specialized cache that stores virtual-to-physical address translations. On x86-64, every memory access requires address translation:

TLB hit: Translation found in TLB, ~1 additional cycle
TLB miss: Must walk page tables in memory, 10-100+ cycles per miss

When switching between processes (not threads), the kernel typically loads a new CR3, which flushes the entire TLB. The new process starts with an empty TLB and suffers a miss on every unique page accessed.

The Page Table Walk Cost:

On x86-64 with 4-level paging, a TLB miss requires up to 4 sequential memory accesses:

PML4 table lookup
Page Directory Pointer lookup
Page Directory lookup
Page Table lookup

Each step may itself cause a cache miss, compounding the penalty.

TLB Sizes and Miss Costs (Typical x86-64)
TLB Level	Entries	Miss Penalty (cycles)	Impact
L1 Data TLB (4KB pages)	64-128	~10 (L2 TLB hit)	Fastest, first checked
L1 Instruction TLB	64-128	~10 (L2 TLB hit)	For code pages
L1 Data TLB (2MB pages)	32-64	~10 (L2 TLB hit)	Huge pages benefit
L2 Unified TLB	1024-2048	~50-100 (page walk)	Shared, larger
Full page walk (L2 miss)	N/A	100-500+	4-5 memory accesses

pcid_optimization.c
C (Linux Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
/**
 * PCID (Process Context IDentifier) - Mitigating TLB Flush Cost
 * 
 * PCIDs allow keeping TLB entries from multiple address spaces,
 * tagged by a 12-bit identifier. This avoids full TLB flushes
 * on context switch.
 */
 
/* PCID is stored in lower 12 bits of CR3 */
#define CR3_PCID_MASK   0x0FFF
 
/* 
 * Bit 63 of CR3 controls TLB flush behavior:
 * - 0: Flush all entries with the new PCID
 * - 1: Don't flush (keep existing entries for this PCID)
 */
#define CR3_NOFLUSH     (1UL << 63)
 
/**
 * switch_mm_pcid() - Switch address space with PCID optimization
 * 
 * Instead of flushing all TLB entries, we tag entries with PCID.
 * When switching back to a process, its TLB entries may still be valid.
 */
void switch_mm_pcid(struct mm_struct *prev, struct mm_struct *next)
{
    unsigned long new_cr3;
    
    if (prev == next) {
        return;  /* Same address space - no CR3 change needed */
    }
    
    /*
     * Build CR3 value:
     * - Physical address of PML4 table
     * - PCID from mm_struct context
     * - NOFLUSH bit if this PCID was recently used
     */
    new_cr3 = __pa(next->pgd);
    new_cr3 |= next->context.pcid;
    
    /*
     * Check if we've used this PCID recently on this CPU.
     * If so, TLB entries might still be valid - don't flush.
     */
    if (this_cpu_read(tlb_gen[next->context.pcid]) == next->context.tlb_gen) {
        new_cr3 |= CR3_NOFLUSH;  /* Keep TLB entries */
    } else {
        /* PCID was used by different mm - must flush */
        this_cpu_write(tlb_gen[next->context.pcid], next->context.tlb_gen);
    }
    
    native_write_cr3(new_cr3);
}
 
/**
 * Performance impact of PCID:
 * 
 * Without PCID (before ~2013, or disabled):
 *   - Every process switch flushes entire TLB
 *   - New process gets 100% TLB misses until warmed up
 *   
 * With PCID:
 *   - TLB entries tagged with PCID survive switch
 *   - Switching back to a process may find hot TLB entries
 *   - 12-bit PCID = up to 4096 concurrent address spaces cached
 *   - Dramatic improvement for frequent process switches
 *
 * Measured improvement: 20-40% reduction in context switch overhead
 * on workloads with frequent process switches.
 */

Thread Switches Avoid TLB Flush

Threads within the same process share the same address space (mm_struct). When switching between threads of the same process, no CR3 change is needed, and the TLB remains valid. This is a major performance advantage of multi-threading over multi-processing.

Pipeline and Speculation Effects

Modern CPUs are deeply pipelined and heavily speculative. A context switch disrupts these mechanisms, causing stalls and requiring restart from a cold state.

Pipeline Effects:

Pipeline Flush: Mode transitions (IRET, SYSRET) flush the pipeline to prevent speculated user-mode instructions from completing in kernel mode. Penalty: 20-50 cycles.
Branch Predictor Reset: Branch history is process-specific. After a switch, the predictor has no history for the new process. Until it learns, every branch risks misprediction (15-20 cycle penalty each).
Return Stack Buffer (RSB): Predicts return addresses based on past calls. After a switch, RSB is stale, causing return mispredictions.
Indirect Branch Predictor: Predicts targets of indirect jumps (virtual function calls). Process-specific patterns are lost after a switch.
Load/Store Queue: Based on recent memory access patterns. New process has entirely different patterns.

Speculation and Prediction State Disruption
Predictor	Entries (typical)	Recovery Time	Impact After Switch
Branch History Table (BHT)	2K-8K entries	100s of branches	Many mispredictions until retrained
Branch Target Buffer (BTB)	2K-8K entries	100s of branches	Indirect branch targets unknown
Return Stack Buffer (RSB)	16-32 entries	First returns	Return targets mispredicted
Loop Predictor	~64 loops	Immediate	Loop bounds unknown
Prefetcher State	N/A	1000s of accesses	Must re-learn patterns

branch_misprediction_cost.txt

Analysis

BRANCH MISPREDICTION COST ANALYSIS
====================================
 
Scenario: Process with tight loop containing conditional branch
 
Before Context Switch (warm predictor):
  - Branch misprediction rate: 2%
  - Loop iterations: 10,000
  - Mispredictions: 200
  - Penalty: 200 × 15 cycles = 3,000 cycles
 
After Context Switch (cold predictor):  
  - Branch misprediction rate: 25-50% initially
  - First 1,000 iterations: avg 35% mispredict
  - Mispredictions in warmup: 350
  - Penalty: 350 × 15 cycles = 5,250 extra cycles
  
Total warmup penalty: ~5,000-10,000 cycles
(in addition to direct context switch cost)
 
--------------------------------------------
 
REAL-WORLD EXAMPLE: Database Query Engine
 
A database executing B-tree traversals after context switch:
 
- B-tree has many conditional branches (key comparisons)
- Cold predictor: 30% misprediction rate
- 50 comparisons/query, 100 queries to warm up predictor
- Warmup penalty: 50 × 100 × 0.3 × 15 = 22,500 cycles
 
This is ~7 microseconds of CPU time "wasted" on mispredictions
during the warmup period after a context switch.

Security: Spectre and Branch Predictor Isolation

Post-Spectre, kernels flush or invalidate branch prediction state on context switch to prevent cross-process speculation attacks. This INCREASES context switch overhead but is necessary for security. The IBPB (Indirect Branch Predictor Barrier) instruction adds 100+ cycles but prevents one process from influencing another's branch predictions.

Process Switches vs. Thread Switches

Not all context switches are created equal. Switching between threads of the same process is significantly cheaper than switching between different processes.

Why Thread Switches Are Cheaper:

No CR3 switch: Threads share the same address space, so page tables don't change
No TLB flush: TLB remains valid, no page walk penalty
Shared code cache: If threads execute similar code, instruction cache stays warm
Shared data cache: Shared data structures remain cached
No mm_struct switch logic: Kernel skips memory management context switch

Process Switch vs. Thread Switch Cost Comparison
Cost Component	Process Switch	Thread Switch	Savings
Register save/restore	~200 cycles	~200 cycles	None
CR3 load	~200 cycles	0 cycles	200 cycles
TLB flush penalty	5K-50K cycles	0 cycles	5K-50K cycles
Cache pollution	Variable (high)	Variable (lower)	Depends on sharing
Branch predictor	Polluted	Partially shared	Some benefit
FPU state	~300 cycles	~300 cycles	Usually same
Typical Total	5K-100K cycles	500-2K cycles	10-50x cheaper

thread_vs_process_benchmark.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
/**
 * Comparing process switch vs thread switch overhead
 * 
 * Creates either thread pairs or process pairs, measures
 * round-trip time through pipe communication.
 */
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/wait.h>
 
#define ITERATIONS 100000
 
/* Measure round-trip through pipes */
uint64_t measure_switches(int read_fd, int write_fd) {
    char buf = 'x';
    uint64_t start = rdtsc();
    
    for (int i = 0; i < ITERATIONS; i++) {
        write(write_fd, &buf, 1);
        read(read_fd, &buf, 1);
    }
    
    return rdtsc() - start;
}
 
/* Thread function for thread-switch test */
void *thread_bouncer(void *arg) {
    int *fds = (int *)arg;
    char buf;
    for (int i = 0; i < ITERATIONS; i++) {
        read(fds[0], &buf, 1);
        write(fds[1], &buf, 1);
    }
    return NULL;
}
 
int main() {
    /*
     * Test 1: Process context switches
     */
    int p2c[2], c2p[2];
    pipe(p2c); pipe(c2p);
    
    if (fork() == 0) {
        char buf;
        for (int i = 0; i < ITERATIONS; i++) {
            read(p2c[0], &buf, 1);
            write(c2p[1], &buf, 1);
        }
        _exit(0);
    }
    
    uint64_t process_cycles = measure_switches(c2p[0], p2c[1]);
    wait(NULL);
    
    /*
     * Test 2: Thread context switches  
     */
    int t2t[2], t2m[2];
    pipe(t2t); pipe(t2m);
    
    pthread_t thread;
    int fds[2] = {t2t[0], t2m[1]};
    pthread_create(&thread, NULL, thread_bouncer, fds);
    
    uint64_t thread_cycles = measure_switches(t2m[0], t2t[1]);
    pthread_join(thread, NULL);
    
    /*
     * Results
     */
    double process_per_switch = (double)process_cycles / (ITERATIONS * 2);
    double thread_per_switch = (double)thread_cycles / (ITERATIONS * 2);
    
    printf("Process switch: %.0f cycles/switch\n", process_per_switch);
    printf("Thread switch:  %.0f cycles/switch\n", thread_per_switch);
    printf("Ratio: %.1fx\n", process_per_switch / thread_per_switch);
    
    return 0;
}
 
/*
 * Typical results:
 *   Process switch: 3500 cycles/switch
 *   Thread switch:  1200 cycles/switch  
 *   Ratio: 2.9x
 *
 * Note: This understates the difference because:
 * 1. Both tests have pipe overhead
 * 2. Doesn't capture full cache/TLB warmup costs
 * 
 * In heavy workloads with large working sets, the
 * ratio can be 10-50x or more.
 */

Measuring Real-World Context Switch Overhead

Measuring context switch overhead accurately is challenging because:

Variable costs: Each switch differs based on cache state, process characteristics
Multiple components: Direct + indirect costs, immediate + deferred
System-wide effects: Switches on one core can affect other cores via LLC, memory bandwidth
Tool interference: Profiling tools themselves cause context switches

Methods for Measurement:

Context Switch Measurement Approaches

•lmbench lat_ctx — Classic benchmark measuring context switch latency between cooperating processes
•perf stat -e context-switches — Count context switches during program execution
•perf record -e sched:sched_switch — Record each switch event with timestamps
•eBPF programs — Attach to scheduler hooks for fine-grained measurement
•/proc/[pid]/sched — Kernel-exposed statistics including switch counts and wait times
•Custom RDTSC instrumentation — Direct cycle counting around switch points

perf_context_switch.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#!/bin/bash
# Measuring context switch overhead with various tools
 
# 1. Basic count of context switches during workload
echo "=== Context Switch Count ==="
perf stat -e context-switches,cpu-migrations ./my_workload
 
# 2. Detailed timing of context switches
echo -e "\n=== Context Switch Timing ==="
perf record -e sched:sched_switch -c 1 ./my_workload
perf script | head -20
 
# 3. Per-process scheduling statistics
echo -e "\n=== Process Scheduling Stats ==="
# Run workload in background
./my_workload &
PID=$!
sleep 1
cat /proc/$PID/sched
kill $PID 2>/dev/null
 
# 4. System-wide context switch rate
echo -e "\n=== System-Wide Context Switch Rate ==="
vmstat 1 5
 
# 5. Using lmbench for dedicated measurement
echo -e "\n=== lmbench Context Switch Latency ==="
# lat_ctx -s 0 2     # 2 processes, 0KB working set
# lat_ctx -s 64 2    # 2 processes, 64KB working set  
# lat_ctx -s 1024 2  # 2 processes, 1MB working set
 
# Example output annotation:
# lat_ctx output format: "size=XXK ovr=Y.YY"
# - size: working set size per process
# - ovr: microseconds per context switch
#
# Typical results:
#   size=0K    ovr=1.5   (minimal working set)
#   size=64K   ovr=3.2   (L1 cache exceeded)
#   size=256K  ovr=12.4  (L2 cache exceeded)
#   size=1024K ovr=48.6  (L3 cache exceeded)
#
# Notice how overhead grows dramatically with working set size!

Typical lmbench Context Switch Results by Working Set Size
Working Set	Cache Level	Latency (μs)	Cycles (3GHz)
0 KB	L1 (32KB)	1.2-2.0	3,600-6,000
16 KB	L1 (32KB)	1.5-2.5	4,500-7,500
64 KB	L2 (256KB)	3-5	9,000-15,000
256 KB	L2/L3 boundary	8-15	24,000-45,000
1 MB	L3 (8MB)	30-50	90,000-150,000
8 MB	L3/RAM boundary	80-150	240,000-450,000
32 MB	Main memory	200-500	600,000-1,500,000

Impact on System Performance

Context switch overhead compounds at the system level. A server handling thousands of requests per second may perform thousands of context switches per second. Each switch adds latency and reduces throughput.

Worked Example: Web Server Performance:

Consider a web server handling 10,000 requests/second:

throughput_analysis.txt

Analysis

WEB SERVER CONTEXT SWITCH OVERHEAD ANALYSIS
============================================
 
Scenario: Apache-style process-per-request model
 
Request Rate: 10,000 requests/second
Processes: 100 worker processes
Context Switches: ~2 per request (in + out) = 20,000 switches/sec
 
Direct Cost:
  - 2,000 cycles/switch × 20,000 switches/sec = 40 million cycles/sec
  - On 3 GHz CPU: 40M / 3G = 1.3% CPU overhead
 
Indirect Cost (estimated):
  - Cache/TLB warmup: ~20,000 cycles/switch average
  - 20,000 × 20,000 = 400 million cycles/sec  
  - On 3 GHz CPU: 400M / 3G = 13.3% CPU overhead
 
TOTAL OVERHEAD: ~15% of CPU capacity
 
On an 8-core server:
  - 15% overhead = 1.2 cores spent on switching
  - Effective capacity: 6.8 cores for actual work
  
============================================
 
Alternative: Event-driven model (nginx-style)
 
Worker Threads: 8 (one per core)
Context Switches: Minimal (mostly same-thread handling)
Estimated switches: 1,000/sec (10x fewer)
 
Direct Cost: 0.07% CPU
Indirect Cost: 1.3% CPU  
TOTAL OVERHEAD: ~1.5% of CPU capacity
 
IMPROVEMENT: 10x reduction in overhead
             15% → 1.5% of CPU freed
 
This is why event-driven servers (nginx, Node.js) can handle
more connections than process-per-request servers (Apache).

Architecture Implications

Understanding context switch overhead drives major architectural decisions: event-driven vs. threaded servers, process pools vs. on-demand spawning, user-space vs. kernel threads, coroutine-based concurrency. High-performance systems minimize context switches through careful design.

Summary: The True Cost of Context Switching

Context switch overhead is far more than the visible register save/restore. The indirect costs—cache pollution, TLB flushes, branch predictor disruption—often dominate and can vary by orders of magnitude based on working set size.

Key Takeaways

•Direct costs are ~1-2 μs — Register saves, stack switch, kernel bookkeeping; relatively fixed
•Indirect costs dominate — Cache and TLB pollution can add 10-500 μs depending on working set
•Working set size matters enormously — Larger working sets mean more cache misses on resume
•Thread switches are much cheaper than process switches — No CR3 change, TLB stays valid
•PCID mitigates TLB costs — Keeps TLB entries across switches, reducing page walk penalty
•Context switch overhead compounds at scale — Thousands of switches/second can consume significant CPU
•Architecture must consider switch costs — Event-driven designs minimize switches for high throughput

Page Complete

You now understand the complete cost structure of context switching—from cycle-level register operations to system-level throughput impact. Next, we'll explore strategies for MINIMIZING context switches to optimize system performance.

Context Switch Overhead: The Hidden Performance Cost

The Price of Multitasking

What You Will Learn

Direct Costs: The Visible Overhead

Direct costs are the CPU cycles consumed by the context switch code itself—the time spent in kernel code saving and restoring state. These are measurable and relatively predictable.

Components of Direct Cost:

Register Save/Restore: Pushing and popping ~20 general-purpose registers to/from the kernel stack
FPU/SIMD State Transfer: XSAVE/XRSTOR for floating-point and vector registers (512-2048 bytes)
Page Table Switch: Loading CR3 with a new page table base address
Kernel Data Structure Updates: Updating current_task, TSS, per-CPU variables
Scheduler Execution: Time spent in schedule() selecting the next process
System Call/Interrupt Entry/Exit: Mode transitions between user and kernel space

Direct Context Switch Cost Breakdown (Approximate)
Operation	Cycles (Approx.)	Bytes Moved	Notes
Mode transition (syscall entry)	50-100	64	User→Kernel via SYSCALL instruction
Push all GPRs to stack	20-40	128	16 registers × 8 bytes each
Scheduler logic (pick next)	100-500		Varies greatly by scheduler
Stack pointer swap	5-10	16	Save/load RSP to/from thread_struct
Pop all GPRs from stack	20-40	128	16 registers × 8 bytes each
CR3 load (page table switch)	100-500	8	Major cost, flushes TLB
FPU state (XSAVE/XRSTOR)	200-800	512-2048	Depends on state size
TLS segment base update	20-50	16	WRMSRL for FS/GS base
Mode transition (iretq/sysret)	30-50	40	Kernel→User return
TOTAL DIRECT COST	500-2000+	~1KB	On modern x86-64 CPUs

Direct Cost in Time Units

direct_cost_measurement.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
/**
 * Measuring direct context switch cost using RDTSC
 * 
 * This measures the minimum time for a round-trip context switch
 * between two cooperating processes using a pipe for synchronization.
 */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
#include <stdint.h>
 
/* Read CPU timestamp counter */
static inline uint64_t rdtsc(void) {
    uint32_t lo, hi;
    __asm__ volatile ("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}
 
#define ITERATIONS 100000
 
int main() {
    int pipe_parent_to_child[2];
    int pipe_child_to_parent[2];
    char buf = 'x';
    
    pipe(pipe_parent_to_child);
    pipe(pipe_child_to_parent);
    
    if (fork() == 0) {
        /* Child: bounce messages back */
        for (int i = 0; i < ITERATIONS; i++) {
            read(pipe_parent_to_child[0], &buf, 1);
            write(pipe_child_to_parent[1], &buf, 1);
        }
        exit(0);
    }
    
    /* Parent: measure round-trip time */
    uint64_t total_cycles = 0;
    
    /* Warmup */
    for (int i = 0; i < 1000; i++) {
        write(pipe_parent_to_child[1], &buf, 1);
        read(pipe_child_to_parent[0], &buf, 1);
    }
    
    /* Measurement */
    for (int i = 0; i < ITERATIONS; i++) {
        uint64_t start = rdtsc();
        write(pipe_parent_to_child[1], &buf, 1);
        read(pipe_child_to_parent[0], &buf, 1);
        uint64_t end = rdtsc();
        total_cycles += (end - start);
    }
    
    wait(NULL);
    
    /* Each round-trip = 2 context switches + pipe overhead */
    double cycles_per_roundtrip = (double)total_cycles / ITERATIONS;
    
    /*
     * Estimated context switch cost:
     * Subtract pipe overhead (~500-1000 cycles for small writes)
     * Divide by 2 for single switch
     */
    double estimated_switch = (cycles_per_roundtrip - 1000) / 2;
    
    printf("Cycles per round-trip: %.0f\n", cycles_per_roundtrip);
    printf("Estimated cycles per context switch: %.0f\n", estimated_switch);
    printf("At 3 GHz, approx %.3f microseconds per switch\n", 
           estimated_switch / 3000.0);
    
    return 0;
}
 
/*
 * Typical output on modern Linux (3 GHz CPU):
 *   Cycles per round-trip: 3500
 *   Estimated cycles per context switch: 1250
 *   At 3 GHz, approx 0.417 microseconds per switch
 * 
 * But this is just the DIRECT cost...
 */

Cache Effects: The Invisible Slowdown

The Cache Miss Penalty:

L1 cache hit: ~4 cycles
L2 cache hit: ~12 cycles
L3 cache hit: ~40 cycles
DRAM access (cache miss): ~100-300 cycles

A process returning from a context switch may experience thousands of cache misses as it reloads its working set. Each miss costs 100+ cycles, dwarfing the direct switch cost.

cache_impact_demo.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
/**
 * Demonstrating cache pollution from context switches
 * 
 * This program measures array access time before and after
 * a context switch to a memory-intensive process.
 */
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <unistd.h>
#include <sys/wait.h>
 
#define ARRAY_SIZE (1 << 20)  /* 1M elements = 8MB (larger than L2) */
#define ITERATIONS 100
 
volatile uint64_t *array;
 
static inline uint64_t rdtsc(void) {
    uint32_t lo, hi;
    __asm__ volatile ("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}
 
/* Access array sequentially - should be cache-friendly */
uint64_t measure_array_access(void) {
    uint64_t sum = 0;
    uint64_t start = rdtsc();
    
    for (size_t i = 0; i < ARRAY_SIZE; i++) {
        sum += array[i];
    }
    
    uint64_t end = rdtsc();
    return end - start;
}
 
/* Pollute the cache with different data */
void pollute_cache(size_t size) {
    volatile uint64_t *polluter = malloc(size * sizeof(uint64_t));
    for (size_t i = 0; i < size; i++) {
        polluter[i] = i;  /* Read-modify-write forces cache fill */
    }
    free((void*)polluter);
}
 
int main() {
    array = malloc(ARRAY_SIZE * sizeof(uint64_t));
    for (size_t i = 0; i < ARRAY_SIZE; i++) {
        array[i] = i;
    }
    
    /* Warm up cache with our array */
    measure_array_access();
    measure_array_access();
    
    /* Measure with warm cache */
    uint64_t warm_cache_cycles = 0;
    for (int i = 0; i < ITERATIONS; i++) {
        warm_cache_cycles += measure_array_access();
        measure_array_access();  /* Keep cache warm */
    }
    warm_cache_cycles /= ITERATIONS;
    
    /* Now measure after cache pollution (simulating context switch) */
    uint64_t cold_cache_cycles = 0;
    for (int i = 0; i < ITERATIONS; i++) {
        pollute_cache(ARRAY_SIZE);  /* Evict our data from cache */
        cold_cache_cycles += measure_array_access();  /* First access after eviction */
    }
    cold_cache_cycles /= ITERATIONS;
    
    printf("Warm cache access: %lu cycles\n", warm_cache_cycles);
    printf("Cold cache access: %lu cycles\n", cold_cache_cycles);
    printf("Cache pollution penalty: %lu cycles (%.1fx slower)\n",
           cold_cache_cycles - warm_cache_cycles,
           (double)cold_cache_cycles / warm_cache_cycles);
    printf("\nAt 3 GHz: %.2f ms warm, %.2f ms cold\n",
           warm_cache_cycles / 3e6, cold_cache_cycles / 3e6);
    
    free((void*)array);
    return 0;
}
 
/*
 * Typical output:
 *   Warm cache access: 1,200,000 cycles
 *   Cold cache access: 18,000,000 cycles  
 *   Cache pollution penalty: 16,800,000 cycles (15.0x slower)
 *
 *   At 3 GHz: 0.40 ms warm, 6.00 ms cold
 *
 * The "direct" context switch cost is ~1 microsecond.
 * The INDIRECT cache penalty can be milliseconds!
 */

Cache-Related Context Switch Costs

•L1 Data Cache (~32KB) — Completely polluted within microseconds of new process running; near-total reload required
•L1 Instruction Cache (~32KB) — New process's code evicts old process's code; hot loops become cold
•L2 Cache (~256KB-1MB) — Partially polluted; larger working sets survive better but still degraded
•L3 Cache (~8-36MB) — Shared among cores; may partially retain data, but with increased access latency
•Branch Predictor State — Process-specific branch history is disrupted; mispredictions increase
•Prefetcher State — Hardware prefetcher must re-learn access patterns for new process

The Real Cost Can Be Milliseconds

TLB Miss Penalties: Address Translation Overhead

The Translation Lookaside Buffer (TLB) is a specialized cache that stores virtual-to-physical address translations. On x86-64, every memory access requires address translation:

TLB hit: Translation found in TLB, ~1 additional cycle
TLB miss: Must walk page tables in memory, 10-100+ cycles per miss

The Page Table Walk Cost:

On x86-64 with 4-level paging, a TLB miss requires up to 4 sequential memory accesses:

PML4 table lookup
Page Directory Pointer lookup
Page Directory lookup
Page Table lookup

Each step may itself cause a cache miss, compounding the penalty.

TLB Sizes and Miss Costs (Typical x86-64)
TLB Level	Entries	Miss Penalty (cycles)	Impact
L1 Data TLB (4KB pages)	64-128	~10 (L2 TLB hit)	Fastest, first checked
L1 Instruction TLB	64-128	~10 (L2 TLB hit)	For code pages
L1 Data TLB (2MB pages)	32-64	~10 (L2 TLB hit)	Huge pages benefit
L2 Unified TLB	1024-2048	~50-100 (page walk)	Shared, larger
Full page walk (L2 miss)	N/A	100-500+	4-5 memory accesses

pcid_optimization.c
C (Linux Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
/**
 * PCID (Process Context IDentifier) - Mitigating TLB Flush Cost
 * 
 * PCIDs allow keeping TLB entries from multiple address spaces,
 * tagged by a 12-bit identifier. This avoids full TLB flushes
 * on context switch.
 */
 
/* PCID is stored in lower 12 bits of CR3 */
#define CR3_PCID_MASK   0x0FFF
 
/* 
 * Bit 63 of CR3 controls TLB flush behavior:
 * - 0: Flush all entries with the new PCID
 * - 1: Don't flush (keep existing entries for this PCID)
 */
#define CR3_NOFLUSH     (1UL << 63)
 
/**
 * switch_mm_pcid() - Switch address space with PCID optimization
 * 
 * Instead of flushing all TLB entries, we tag entries with PCID.
 * When switching back to a process, its TLB entries may still be valid.
 */
void switch_mm_pcid(struct mm_struct *prev, struct mm_struct *next)
{
    unsigned long new_cr3;
    
    if (prev == next) {
        return;  /* Same address space - no CR3 change needed */
    }
    
    /*
     * Build CR3 value:
     * - Physical address of PML4 table
     * - PCID from mm_struct context
     * - NOFLUSH bit if this PCID was recently used
     */
    new_cr3 = __pa(next->pgd);
    new_cr3 |= next->context.pcid;
    
    /*
     * Check if we've used this PCID recently on this CPU.
     * If so, TLB entries might still be valid - don't flush.
     */
    if (this_cpu_read(tlb_gen[next->context.pcid]) == next->context.tlb_gen) {
        new_cr3 |= CR3_NOFLUSH;  /* Keep TLB entries */
    } else {
        /* PCID was used by different mm - must flush */
        this_cpu_write(tlb_gen[next->context.pcid], next->context.tlb_gen);
    }
    
    native_write_cr3(new_cr3);
}
 
/**
 * Performance impact of PCID:
 * 
 * Without PCID (before ~2013, or disabled):
 *   - Every process switch flushes entire TLB
 *   - New process gets 100% TLB misses until warmed up
 *   
 * With PCID:
 *   - TLB entries tagged with PCID survive switch
 *   - Switching back to a process may find hot TLB entries
 *   - 12-bit PCID = up to 4096 concurrent address spaces cached
 *   - Dramatic improvement for frequent process switches
 *
 * Measured improvement: 20-40% reduction in context switch overhead
 * on workloads with frequent process switches.
 */

Thread Switches Avoid TLB Flush

Pipeline and Speculation Effects

Modern CPUs are deeply pipelined and heavily speculative. A context switch disrupts these mechanisms, causing stalls and requiring restart from a cold state.

Pipeline Effects:

Pipeline Flush: Mode transitions (IRET, SYSRET) flush the pipeline to prevent speculated user-mode instructions from completing in kernel mode. Penalty: 20-50 cycles.
Branch Predictor Reset: Branch history is process-specific. After a switch, the predictor has no history for the new process. Until it learns, every branch risks misprediction (15-20 cycle penalty each).
Return Stack Buffer (RSB): Predicts return addresses based on past calls. After a switch, RSB is stale, causing return mispredictions.
Indirect Branch Predictor: Predicts targets of indirect jumps (virtual function calls). Process-specific patterns are lost after a switch.
Load/Store Queue: Based on recent memory access patterns. New process has entirely different patterns.

Speculation and Prediction State Disruption
Predictor	Entries (typical)	Recovery Time	Impact After Switch
Branch History Table (BHT)	2K-8K entries	100s of branches	Many mispredictions until retrained
Branch Target Buffer (BTB)	2K-8K entries	100s of branches	Indirect branch targets unknown
Return Stack Buffer (RSB)	16-32 entries	First returns	Return targets mispredicted
Loop Predictor	~64 loops	Immediate	Loop bounds unknown
Prefetcher State	N/A	1000s of accesses	Must re-learn patterns

branch_misprediction_cost.txt

Analysis

BRANCH MISPREDICTION COST ANALYSIS
====================================
 
Scenario: Process with tight loop containing conditional branch
 
Before Context Switch (warm predictor):
  - Branch misprediction rate: 2%
  - Loop iterations: 10,000
  - Mispredictions: 200
  - Penalty: 200 × 15 cycles = 3,000 cycles
 
After Context Switch (cold predictor):  
  - Branch misprediction rate: 25-50% initially
  - First 1,000 iterations: avg 35% mispredict
  - Mispredictions in warmup: 350
  - Penalty: 350 × 15 cycles = 5,250 extra cycles
  
Total warmup penalty: ~5,000-10,000 cycles
(in addition to direct context switch cost)
 
--------------------------------------------
 
REAL-WORLD EXAMPLE: Database Query Engine
 
A database executing B-tree traversals after context switch:
 
- B-tree has many conditional branches (key comparisons)
- Cold predictor: 30% misprediction rate
- 50 comparisons/query, 100 queries to warm up predictor
- Warmup penalty: 50 × 100 × 0.3 × 15 = 22,500 cycles
 
This is ~7 microseconds of CPU time "wasted" on mispredictions
during the warmup period after a context switch.

Security: Spectre and Branch Predictor Isolation

Process Switches vs. Thread Switches

Not all context switches are created equal. Switching between threads of the same process is significantly cheaper than switching between different processes.

Why Thread Switches Are Cheaper:

No CR3 switch: Threads share the same address space, so page tables don't change
No TLB flush: TLB remains valid, no page walk penalty
Shared code cache: If threads execute similar code, instruction cache stays warm
Shared data cache: Shared data structures remain cached
No mm_struct switch logic: Kernel skips memory management context switch

Process Switch vs. Thread Switch Cost Comparison
Cost Component	Process Switch	Thread Switch	Savings
Register save/restore	~200 cycles	~200 cycles	None
CR3 load	~200 cycles	0 cycles	200 cycles
TLB flush penalty	5K-50K cycles	0 cycles	5K-50K cycles
Cache pollution	Variable (high)	Variable (lower)	Depends on sharing
Branch predictor	Polluted	Partially shared	Some benefit
FPU state	~300 cycles	~300 cycles	Usually same
Typical Total	5K-100K cycles	500-2K cycles	10-50x cheaper

thread_vs_process_benchmark.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
/**
 * Comparing process switch vs thread switch overhead
 * 
 * Creates either thread pairs or process pairs, measures
 * round-trip time through pipe communication.
 */
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/wait.h>
 
#define ITERATIONS 100000
 
/* Measure round-trip through pipes */
uint64_t measure_switches(int read_fd, int write_fd) {
    char buf = 'x';
    uint64_t start = rdtsc();
    
    for (int i = 0; i < ITERATIONS; i++) {
        write(write_fd, &buf, 1);
        read(read_fd, &buf, 1);
    }
    
    return rdtsc() - start;
}
 
/* Thread function for thread-switch test */
void *thread_bouncer(void *arg) {
    int *fds = (int *)arg;
    char buf;
    for (int i = 0; i < ITERATIONS; i++) {
        read(fds[0], &buf, 1);
        write(fds[1], &buf, 1);
    }
    return NULL;
}
 
int main() {
    /*
     * Test 1: Process context switches
     */
    int p2c[2], c2p[2];
    pipe(p2c); pipe(c2p);
    
    if (fork() == 0) {
        char buf;
        for (int i = 0; i < ITERATIONS; i++) {
            read(p2c[0], &buf, 1);
            write(c2p[1], &buf, 1);
        }
        _exit(0);
    }
    
    uint64_t process_cycles = measure_switches(c2p[0], p2c[1]);
    wait(NULL);
    
    /*
     * Test 2: Thread context switches  
     */
    int t2t[2], t2m[2];
    pipe(t2t); pipe(t2m);
    
    pthread_t thread;
    int fds[2] = {t2t[0], t2m[1]};
    pthread_create(&thread, NULL, thread_bouncer, fds);
    
    uint64_t thread_cycles = measure_switches(t2m[0], t2t[1]);
    pthread_join(thread, NULL);
    
    /*
     * Results
     */
    double process_per_switch = (double)process_cycles / (ITERATIONS * 2);
    double thread_per_switch = (double)thread_cycles / (ITERATIONS * 2);
    
    printf("Process switch: %.0f cycles/switch\n", process_per_switch);
    printf("Thread switch:  %.0f cycles/switch\n", thread_per_switch);
    printf("Ratio: %.1fx\n", process_per_switch / thread_per_switch);
    
    return 0;
}
 
/*
 * Typical results:
 *   Process switch: 3500 cycles/switch
 *   Thread switch:  1200 cycles/switch  
 *   Ratio: 2.9x
 *
 * Note: This understates the difference because:
 * 1. Both tests have pipe overhead
 * 2. Doesn't capture full cache/TLB warmup costs
 * 
 * In heavy workloads with large working sets, the
 * ratio can be 10-50x or more.
 */

Measuring Real-World Context Switch Overhead

Measuring context switch overhead accurately is challenging because:

Variable costs: Each switch differs based on cache state, process characteristics
Multiple components: Direct + indirect costs, immediate + deferred
System-wide effects: Switches on one core can affect other cores via LLC, memory bandwidth
Tool interference: Profiling tools themselves cause context switches

Methods for Measurement:

Context Switch Measurement Approaches

•lmbench lat_ctx — Classic benchmark measuring context switch latency between cooperating processes
•perf stat -e context-switches — Count context switches during program execution
•perf record -e sched:sched_switch — Record each switch event with timestamps
•eBPF programs — Attach to scheduler hooks for fine-grained measurement
•/proc/[pid]/sched — Kernel-exposed statistics including switch counts and wait times
•Custom RDTSC instrumentation — Direct cycle counting around switch points

perf_context_switch.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#!/bin/bash
# Measuring context switch overhead with various tools
 
# 1. Basic count of context switches during workload
echo "=== Context Switch Count ==="
perf stat -e context-switches,cpu-migrations ./my_workload
 
# 2. Detailed timing of context switches
echo -e "\n=== Context Switch Timing ==="
perf record -e sched:sched_switch -c 1 ./my_workload
perf script | head -20
 
# 3. Per-process scheduling statistics
echo -e "\n=== Process Scheduling Stats ==="
# Run workload in background
./my_workload &
PID=$!
sleep 1
cat /proc/$PID/sched
kill $PID 2>/dev/null
 
# 4. System-wide context switch rate
echo -e "\n=== System-Wide Context Switch Rate ==="
vmstat 1 5
 
# 5. Using lmbench for dedicated measurement
echo -e "\n=== lmbench Context Switch Latency ==="
# lat_ctx -s 0 2     # 2 processes, 0KB working set
# lat_ctx -s 64 2    # 2 processes, 64KB working set  
# lat_ctx -s 1024 2  # 2 processes, 1MB working set
 
# Example output annotation:
# lat_ctx output format: "size=XXK ovr=Y.YY"
# - size: working set size per process
# - ovr: microseconds per context switch
#
# Typical results:
#   size=0K    ovr=1.5   (minimal working set)
#   size=64K   ovr=3.2   (L1 cache exceeded)
#   size=256K  ovr=12.4  (L2 cache exceeded)
#   size=1024K ovr=48.6  (L3 cache exceeded)
#
# Notice how overhead grows dramatically with working set size!

Typical lmbench Context Switch Results by Working Set Size
Working Set	Cache Level	Latency (μs)	Cycles (3GHz)
0 KB	L1 (32KB)	1.2-2.0	3,600-6,000
16 KB	L1 (32KB)	1.5-2.5	4,500-7,500
64 KB	L2 (256KB)	3-5	9,000-15,000
256 KB	L2/L3 boundary	8-15	24,000-45,000
1 MB	L3 (8MB)	30-50	90,000-150,000
8 MB	L3/RAM boundary	80-150	240,000-450,000
32 MB	Main memory	200-500	600,000-1,500,000

Impact on System Performance

Worked Example: Web Server Performance:

Consider a web server handling 10,000 requests/second:

throughput_analysis.txt

Analysis

WEB SERVER CONTEXT SWITCH OVERHEAD ANALYSIS
============================================
 
Scenario: Apache-style process-per-request model
 
Request Rate: 10,000 requests/second
Processes: 100 worker processes
Context Switches: ~2 per request (in + out) = 20,000 switches/sec
 
Direct Cost:
  - 2,000 cycles/switch × 20,000 switches/sec = 40 million cycles/sec
  - On 3 GHz CPU: 40M / 3G = 1.3% CPU overhead
 
Indirect Cost (estimated):
  - Cache/TLB warmup: ~20,000 cycles/switch average
  - 20,000 × 20,000 = 400 million cycles/sec  
  - On 3 GHz CPU: 400M / 3G = 13.3% CPU overhead
 
TOTAL OVERHEAD: ~15% of CPU capacity
 
On an 8-core server:
  - 15% overhead = 1.2 cores spent on switching
  - Effective capacity: 6.8 cores for actual work
  
============================================
 
Alternative: Event-driven model (nginx-style)
 
Worker Threads: 8 (one per core)
Context Switches: Minimal (mostly same-thread handling)
Estimated switches: 1,000/sec (10x fewer)
 
Direct Cost: 0.07% CPU
Indirect Cost: 1.3% CPU  
TOTAL OVERHEAD: ~1.5% of CPU capacity
 
IMPROVEMENT: 10x reduction in overhead
             15% → 1.5% of CPU freed
 
This is why event-driven servers (nginx, Node.js) can handle
more connections than process-per-request servers (Apache).

Architecture Implications

Summary: The True Cost of Context Switching

Key Takeaways

•Direct costs are ~1-2 μs — Register saves, stack switch, kernel bookkeeping; relatively fixed
•Indirect costs dominate — Cache and TLB pollution can add 10-500 μs depending on working set
•Working set size matters enormously — Larger working sets mean more cache misses on resume
•Thread switches are much cheaper than process switches — No CR3 change, TLB stays valid
•PCID mitigates TLB costs — Keeps TLB entries across switches, reducing page walk penalty
•Context switch overhead compounds at scale — Thousands of switches/second can consume significant CPU
•Architecture must consider switch costs — Event-driven designs minimize switches for high throughput

Page Complete