Operating SystemsContext Switching

Context Switching: The Heart of Multitasking

LevelIntermediate

Duration90 mins

TopicContext Switching

5 / 5

Minimizing Context Switches: Strategies for Performance

Fighting the Hidden Performance Tax

We've established that context switches are expensive—far more expensive than the naive calculation of register save/restore cycles suggests. Cache pollution, TLB flushes, and branch predictor disruption can add hundreds of microseconds of overhead per switch.

The obvious question follows: how do we minimize these switches?

This is not merely an academic exercise. Production systems at companies like Google, Amazon, and Netflix process millions of requests per second. Shaving microseconds from each request directly translates to thousands of dollars in infrastructure savings and millions of dollars in improved user experience. Context switch minimization is a first-class performance engineering concern.

This page presents a comprehensive toolkit of strategies—from architectural patterns to programming techniques to kernel tuning—that reduce context switch frequency and mitigate their impact when switches are unavoidable.

What You Will Learn

By the end of this page, you will understand multiple strategies for minimizing context switches: event-driven architectures, CPU affinity, cooperative scheduling, proper I/O management, kernel tuning parameters, and hardware features. You'll know when to apply each technique and understand the tradeoffs involved.

Architectural Strategies: Design Away the Problem

The most effective way to minimize context switches is to design systems that inherently require fewer switches. This is an architectural decision that affects the entire application structure.

Event-Driven Architecture:

Traditional thread-per-connection models create a new thread for each client, causing a context switch whenever the active client changes. Event-driven architecture uses a single thread (or a few threads) to handle all connections using non-blocking I/O and an event loop.

How it works:

Single thread manages many connections via epoll/kqueue/IOCP
When data arrives on any connection, the OS signals the event loop
Event loop processes the event without context switch
No thread-per-connection = no O(connections) threads = minimal switching

event_loop_server.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
/**
 * Event-Driven Server Example (Linux epoll)
 * 
 * Single thread handles 10,000+ connections with minimal context switches.
 * Compare to thread-per-connection which would require 10,000 threads.
 */
#include <sys/epoll.h>
#include <sys/socket.h>
#include <fcntl.h>
#include <unistd.h>
 
#define MAX_EVENTS 10000
 
void set_nonblocking(int fd) {
    int flags = fcntl(fd, F_GETFL, 0);
    fcntl(fd, F_SETFL, flags | O_NONBLOCK);
}
 
void event_loop(int listen_fd) {
    int epoll_fd = epoll_create1(0);
    
    struct epoll_event ev;
    ev.events = EPOLLIN;
    ev.data.fd = listen_fd;
    epoll_ctl(epoll_fd, EPOLL_CTL_ADD, listen_fd, &ev);
    
    struct epoll_event events[MAX_EVENTS];
    
    /*
     * This single thread handles ALL connections.
     * No context switches between client handling!
     */
    while (1) {
        /* Wait for events on any of thousands of file descriptors */
        int n = epoll_wait(epoll_fd, events, MAX_EVENTS, -1);
        
        /*
         * Context switch only happens here if no events are ready.
         * When events ARE ready, we process them all in one go.
         */
        
        for (int i = 0; i < n; i++) {
            if (events[i].data.fd == listen_fd) {
                /* Accept new connection */
                int client_fd = accept(listen_fd, NULL, NULL);
                set_nonblocking(client_fd);
                
                ev.events = EPOLLIN | EPOLLET;  /* Edge-triggered */
                ev.data.fd = client_fd;
                epoll_ctl(epoll_fd, EPOLL_CTL_ADD, client_fd, &ev);
            } else {
                /* Handle client data - process without switching */
                handle_client(events[i].data.fd);
            }
        }
    }
}
 
/*
 * Context Switch Comparison:
 * 
 * Thread-per-connection (10,000 clients):
 *   - 10,000 threads competing for CPU
 *   - Scheduler constantly switching between threads
 *   - Each read() may block and cause switch
 *   - Estimated: 50,000+ switches/second
 * 
 * Event-driven (10,000 clients):
 *   - 1-4 threads (typically = CPU cores)
 *   - Threads only switch when event loop is idle
 *   - Non-blocking I/O never causes switch
 *   - Estimated: 100-1,000 switches/second
 * 
 * Result: 50-500x reduction in context switches
 */

Architecture Comparison: Context Switch Implications
Architecture	Threads/Connections	Switches/sec (10K conn)	Best For
Process-per-connection	10,000 processes	100,000+	Isolation, simple code (CGI era)
Thread-per-connection	10,000 threads	50,000+	Moderate load, blocking I/O
Thread pool	100-1000 threads	10,000+	Balanced approach
Event-driven (single)	1 thread	100-500	Maximum throughput, I/O-bound
Event-driven (multi)	4-8 threads	500-2,000	Multi-core utilization
Hybrid (event + pool)	Varies	5,000-10,000	Mixed CPU/IO workloads

Real-World Examples

nginx (event-driven) handles 10,000+ connections per worker process. Apache (process/thread-per-connection) struggles with 1,000. Node.js, Redis, and memcached all use event-driven architecture. This isn't coincidence—it's deliberate optimization for minimal context switching.

CPU Affinity: Keeping Processes on Their Cores

CPU affinity (or processor affinity) binds a process or thread to specific CPU cores. This reduces context switch overhead by:

Preserving cache locality: Process's data stays in the core's L1/L2 cache
Avoiding migration overhead: No need to flush/reload cache on core switch
Reducing scheduler complexity: Fewer cores to consider when scheduling
Preventing cache ping-pong: In multi-producer scenarios, keeps data on specific cores

When to Use CPU Affinity:

Latency-sensitive applications (trading systems, games)
CPU-intensive batch processing
Real-time systems requiring predictable timing
NUMA systems where memory locality matters

cpu_affinity.c
C (Linux)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
/**
 * CPU Affinity Examples
 * 
 * Pinning threads to specific cores to minimize cache effects
 * from context switches and core migrations.
 */
#define _GNU_SOURCE
#include <sched.h>
#include <pthread.h>
#include <stdio.h>
 
/**
 * Pin current thread to a specific CPU core
 */
void pin_to_core(int core_id) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core_id, &cpuset);
    
    /* Set affinity for current thread */
    int result = pthread_setaffinity_np(
        pthread_self(), 
        sizeof(cpuset), 
        &cpuset
    );
    
    if (result != 0) {
        perror("Failed to set CPU affinity");
    } else {
        printf("Thread pinned to core %d
", core_id);
    }
}
 
/**
 * Pin to multiple cores (allow migration within a set)
 */
void pin_to_core_range(int start_core, int end_core) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    
    for (int i = start_core; i <= end_core; i++) {
        CPU_SET(i, &cpuset);
    }
    
    pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
}
 
/**
 * High-performance worker threads with dedicated cores
 */
#define NUM_WORKERS 4
 
void *worker_thread(void *arg) {
    int worker_id = *(int *)arg;
    
    /* Pin each worker to a specific core */
    pin_to_core(worker_id);
    
    /*
     * Benefits:
     * 1. This worker's data stays in core's L1/L2 cache
     * 2. No cache pollution from other workers
     * 3. No migration delays
     * 4. Predictable performance
     */
    
    while (1) {
        /* Process work items */
        process_work();
    }
    
    return NULL;
}
 
int main() {
    pthread_t threads[NUM_WORKERS];
    int worker_ids[NUM_WORKERS];
    
    /*
     * Create workers, each pinned to a core.
     * Core 0 is often reserved for interrupts, so start at 1.
     */
    for (int i = 0; i < NUM_WORKERS; i++) {
        worker_ids[i] = i + 1;  /* Cores 1, 2, 3, 4 */
        pthread_create(&threads[i], NULL, worker_thread, &worker_ids[i]);
    }
    
    /*
     * Alternative: Use isolcpus kernel parameter to reserve cores
     * for your application, preventing other processes from using them.
     * 
     * Boot param: isolcpus=1-4
     * Now cores 1-4 only run your pinned threads, not system tasks.
     */
    
    for (int i = 0; i < NUM_WORKERS; i++) {
        pthread_join(threads[i], NULL);
    }
    
    return 0;
}

taskset_commands.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/bin/bash
# CPU Affinity Management with taskset and cgroups
 
# Pin a running process to core 0
taskset -p 0x1 <PID>
 
# Start a new process on cores 0-3
taskset -c 0-3 ./my_application
 
# View current affinity
taskset -p <PID>
 
# Using numactl for NUMA-aware placement
numactl --cpubind=0 --membind=0 ./my_application
 
# Isolate CPUs at boot (add to kernel cmdline)
# isolcpus=4-7       # Reserve cores 4-7 for dedicated use
# nohz_full=4-7      # Disable timer interrupts on cores 4-7
# rcu_nocbs=4-7      # Move RCU callbacks off cores 4-7
 
# Using cgroups v2 for CPU isolation
mkdir /sys/fs/cgroup/myapp
echo "4-7" > /sys/fs/cgroup/myapp/cpuset.cpus
echo "0" > /sys/fs/cgroup/myapp/cpuset.mems
echo $$ > /sys/fs/cgroup/myapp/cgroup.procs
 
# Verify isolation - should show minimal interrupts on isolated cores
watch -n 1 'cat /proc/interrupts | head -20'

Affinity Tradeoffs

CPU affinity can backfire if overused. Pinning threads to specific cores prevents the scheduler from load balancing. If one core becomes overloaded while others idle, overall throughput drops. Use affinity for latency-critical paths, but allow flexibility for batch workloads.

Efficient Synchronization: Avoiding Unnecessary Blocking

Many context switches are caused by threads blocking on synchronization primitives. By choosing the right synchronization strategy, you can dramatically reduce involuntary yields.

The Synchronization Hierarchy (least to most costly):

Lock-free/Wait-free algorithms: No blocking, no switches (atomic operations only)
Spin locks: Busy-wait, avoid switch for short waits
Adaptive mutexes: Spin briefly, then sleep
Mutexes with quick release: Sleep but optimize for short critical sections
Condition variables: Sleep until specific condition signaled
Heavy synchronization (semaphores, file locks): Always involve kernel

synchronization_strategies.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
/**
 * Synchronization Strategies and Their Context Switch Impact
 */
#include <stdatomic.h>
#include <pthread.h>
#include <immintrin.h>  /* For _mm_pause() */
 
/**
 * Strategy 1: Lock-Free (No context switches)
 * 
 * Use atomic operations to update shared state without locks.
 * No thread ever blocks, no context switches for synchronization.
 */
atomic_int lock_free_counter = 0;
 
void increment_lock_free(void) {
    atomic_fetch_add(&lock_free_counter, 1);
    /* No blocking possible, no context switch */
}
 
/**
 * Strategy 2: Spinlock with backoff (Avoids short-term switches)
 * 
 * For very short critical sections, spinning is cheaper than sleeping.
 * The pause instruction reduces power and avoids memory bus contention.
 */
typedef struct {
    atomic_int locked;
} spinlock_t;
 
void spin_lock(spinlock_t *lock) {
    int backoff = 1;
    while (1) {
        /* Try to acquire */
        if (atomic_exchange(&lock->locked, 1) == 0) {
            return;  /* Got the lock */
        }
        
        /* Spin with exponential backoff */
        for (int i = 0; i < backoff; i++) {
            _mm_pause();  /* CPU hint: spinning, reduce power */
        }
        backoff = (backoff < 1024) ? backoff * 2 : 1024;
    }
}
 
void spin_unlock(spinlock_t *lock) {
    atomic_store(&lock->locked, 0);
}
 
/**
 * Strategy 3: Adaptive mutex (Best of both worlds)
 * 
 * Spin for a short time hoping lock becomes available.
 * If not, fall back to sleeping (context switch).
 */
typedef struct {
    atomic_int locked;
    pthread_mutex_t fallback;
} adaptive_mutex_t;
 
#define SPIN_COUNT 100
 
void adaptive_lock(adaptive_mutex_t *m) {
    /* First, try spinning */
    for (int i = 0; i < SPIN_COUNT; i++) {
        if (atomic_exchange(&m->locked, 1) == 0) {
            return;  /* Got lock without sleeping */
        }
        _mm_pause();
    }
    
    /* Spinning failed, fall back to sleeping mutex */
    pthread_mutex_lock(&m->fallback);
    while (atomic_exchange(&m->locked, 1) != 0) {
        pthread_mutex_unlock(&m->fallback);
        pthread_mutex_lock(&m->fallback);
    }
}
 
/**
 * Strategy 4: Futex for minimal kernel interaction
 * 
 * futex: "Fast userspace mutex"
 * - Fast path is pure userspace (no syscall, no switch)
 * - Slow path uses kernel to sleep
 */
#include <linux/futex.h>
#include <sys/syscall.h>
 
typedef struct {
    atomic_int value;  /* 0 = unlocked, 1 = locked */
} futex_lock_t;
 
void futex_lock(futex_lock_t *lock) {
    /* Fast path: uncontended case - pure userspace */
    if (atomic_exchange(&lock->value, 1) == 0) {
        return;  /* Got lock, no syscall needed */
    }
    
    /* Slow path: contended - must ask kernel to sleep */
    while (1) {
        /* Wait in kernel until value != 1 */
        syscall(SYS_futex, &lock->value, FUTEX_WAIT, 1, NULL, NULL, 0);
        
        /* Woke up, try again */
        if (atomic_exchange(&lock->value, 1) == 0) {
            return;
        }
    }
}
 
void futex_unlock(futex_lock_t *lock) {
    atomic_store(&lock->value, 0);
    /* Wake one waiter (if any) */
    syscall(SYS_futex, &lock->value, FUTEX_WAKE, 1, NULL, NULL, 0);
}

When to Use Each Strategy

•Lock-free: Counters, read-mostly data, queues with single producer/consumer
•Spinlock: Very short critical sections (<1 μs), when you know contention is brief
•Adaptive mutex: General purpose, when critical section length is unknown
•Futex: Standard mutex implementation on Linux, good general choice
•Condition variable: When threads must wait for specific conditions
•Read-write locks: When reads vastly outnumber writes

Non-Blocking I/O: Avoiding I/O-Induced Switches

Blocking I/O is a primary cause of voluntary context switches. Every read() or write() that cannot complete immediately triggers a context switch. Non-blocking I/O and asynchronous I/O patterns avoid this penalty.

I/O Patterns and Their Context Switch Behavior:

Blocking I/O: Thread sleeps until I/O completes → guaranteed context switch
Non-blocking I/O: Returns immediately with EAGAIN if not ready → no switch, but requires polling
I/O multiplexing (poll/epoll): Single syscall blocks for multiple FDs → one switch for many operations
Asynchronous I/O (io_uring): Submit I/O, continue working, check results later → minimal switches

io_patterns.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
/**
 * Comparison of I/O patterns and context switch implications
 */
 
/*
 * Pattern 1: Blocking I/O (causes context switch)
 */
void blocking_io_example(int fd) {
    char buffer[4096];
    
    /* This WILL context switch if data isn't ready */
    ssize_t n = read(fd, buffer, sizeof(buffer));
    
    /* Thread was asleep, now continuing */
    process_data(buffer, n);
}
 
/*
 * Pattern 2: Non-blocking with poll (one switch for many FDs)
 */
void multiplexed_io_example(int *fds, int num_fds) {
    struct pollfd pfd[num_fds];
    char buffer[4096];
    
    for (int i = 0; i < num_fds; i++) {
        pfd[i].fd = fds[i];
        pfd[i].events = POLLIN;
    }
    
    /* 
     * Single context switch (if nothing ready yet)
     * covers ALL file descriptors.
     */
    int ready = poll(pfd, num_fds, -1);
    
    /* No more switches needed - just process ready FDs */
    for (int i = 0; i < num_fds; i++) {
        if (pfd[i].revents & POLLIN) {
            /* Non-blocking read - guaranteed to succeed */
            read(fds[i], buffer, sizeof(buffer));
            process_data(buffer, n);
        }
    }
}
 
/*
 * Pattern 3: io_uring (Linux 5.1+) - truly asynchronous
 */
#include <liburing.h>
 
void io_uring_example(void) {
    struct io_uring ring;
    io_uring_queue_init(32, &ring, 0);
    
    char buffer[4096];
    int fd = open("data.txt", O_RDONLY);
    
    /* Submit read request - doesn't block! */
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_read(sqe, fd, buffer, sizeof(buffer), 0);
    io_uring_submit(&ring);
    
    /* Continue doing other work while I/O happens in background */
    do_other_work();
    
    /* Check completion - can be blocking or non-blocking */
    struct io_uring_cqe *cqe;
    io_uring_wait_cqe(&ring, &cqe);  /* Or peek_cqe for non-blocking */
    
    /* I/O complete, process result */
    int result = cqe->res;
    io_uring_cqe_seen(&ring, cqe);
    
    process_data(buffer, result);
}
 
/*
 * Comparison at 10,000 I/O operations:
 * 
 * Blocking I/O:         10,000 context switches
 * poll/epoll (batched): ~100 context switches (depends on arrival pattern)
 * io_uring (batched):   ~10 context switches
 */

Batching I/O Submissions

io_uring allows batching: submit multiple I/O operations with a single syscall, then wait for completions in a batch. This can reduce syscall overhead to near zero and minimize context switches even for I/O-heavy workloads.

User-Level Threading and Coroutines

User-level threads (also called green threads, fibers, or coroutines) are managed entirely in user space, without kernel involvement. Switching between user-level threads requires no system call and no kernel context switch—just saving/restoring a few registers.

User-Level vs. Kernel-Level Threads:

Aspect	Kernel Threads	User-Level Threads
Switch cost	1-10 μs	10-100 ns
Involves kernel	Yes	No
True parallelism	Yes	No (unless M:N mapping)
Blocking syscall	Only blocks thread	Blocks entire process*

*Unless using async I/O or M:N hybrid model

coroutine_example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
/**
 * Simple coroutine implementation using ucontext
 * 
 * This demonstrates user-level context switching without kernel.
 * Switch cost is ~10-50 nanoseconds vs ~1000 ns for kernel threads.
 */
#include <ucontext.h>
#include <stdio.h>
#include <stdlib.h>
 
#define STACK_SIZE 8192
 
typedef struct {
    ucontext_t context;
    char stack[STACK_SIZE];
    int id;
    int finished;
} coroutine_t;
 
coroutine_t *current_coroutine;
coroutine_t main_coroutine;
coroutine_t *coroutines[10];
int num_coroutines = 0;
 
/**
 * Yield to scheduler - switch WITHOUT kernel involvement
 */
void yield(void) {
    coroutine_t *prev = current_coroutine;
    
    /* Find next runnable coroutine (simple round-robin) */
    static int next = 0;
    next = (next + 1) % num_coroutines;
    coroutine_t *next_coro = coroutines[next];
    
    current_coroutine = next_coro;
    
    /*
     * swapcontext: The entire "context switch"
     * 
     * This is a pure userspace operation:
     * 1. Save current registers to prev->context
     * 2. Load registers from next_coro->context
     * 3. Continue execution in next coroutine
     * 
     * Cost: ~50 cycles (vs ~3000 for kernel thread switch)
     * No syscall, no kernel involvement!
     */
    swapcontext(&prev->context, &next_coro->context);
}
 
/**
 * Coroutine function - can yield and resume
 */
void coroutine_func(void *arg) {
    int id = *(int *)arg;
    
    for (int i = 0; i < 5; i++) {
        printf("Coroutine %d: iteration %d
", id, i);
        yield();  /* Give other coroutines a chance - NO KERNEL! */
    }
    
    current_coroutine->finished = 1;
}
 
/**
 * Create a new coroutine
 */
coroutine_t *create_coroutine(void (*func)(void *), void *arg) {
    coroutine_t *coro = malloc(sizeof(coroutine_t));
    
    getcontext(&coro->context);
    
    coro->context.uc_stack.ss_sp = coro->stack;
    coro->context.uc_stack.ss_size = STACK_SIZE;
    coro->context.uc_link = &main_coroutine.context;
    coro->finished = 0;
    
    makecontext(&coro->context, (void (*)(void))func, 1, arg);
    
    coroutines[num_coroutines++] = coro;
    return coro;
}
 
int main() {
    /* Create several coroutines */
    int ids[] = {1, 2, 3};
    for (int i = 0; i < 3; i++) {
        create_coroutine(coroutine_func, &ids[i]);
    }
    
    /* Run scheduler */
    current_coroutine = coroutines[0];
    swapcontext(&main_coroutine.context, &current_coroutine->context);
    
    printf("All coroutines finished
");
    return 0;
}
 
/*
 * Modern languages with native coroutines:
 * - Go: goroutines (M:N model with work stealing)
 * - Python: async/await with asyncio
 * - Rust: async/await with tokio/async-std
 * - C++20: coroutines (stackless)
 * - Java 21: Virtual Threads (Project Loom)
 */

The M:N Threading Model

Modern runtimes like Go use M:N threading: M user-level threads (goroutines) multiplexed onto N kernel threads. This combines the low-overhead switching of user threads with the parallelism of kernel threads. When a goroutine blocks on I/O, the runtime moves other goroutines to idle OS threads, preventing the blocking problem of pure user threading.

Kernel Tuning for Reduced Context Switching

The Linux kernel provides numerous tuning parameters that affect context switch behavior. Proper tuning can significantly reduce switch frequency for specific workloads.

Key Tunable Parameters:

kernel_tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
#!/bin/bash
# Kernel Tuning for Reduced Context Switches
 
#################################################
# SCHEDULER TUNING
#################################################
 
# Increase minimum time slice (default: 0.75ms)
# Larger slices = fewer preemptions
echo 4000000 > /proc/sys/kernel/sched_min_granularity_ns    # 4ms
 
# Increase scheduler latency target
# Higher = allows longer time slices under load
echo 24000000 > /proc/sys/kernel/sched_latency_ns           # 24ms
 
# Reduce preemption after wakeup
# Lower = longer before newly-woken process preempts current
echo 10000000 > /proc/sys/kernel/sched_wakeup_granularity_ns # 10ms
 
# Disable automatic NUMA balancing (if manual placement is better)
echo 0 > /proc/sys/kernel/numa_balancing
 
#################################################
# TIMER AND TICK SETTINGS
#################################################
 
# Boot parameter for tickless operation on certain CPUs:
# nohz_full=4-7    # No timer interrupts on cores 4-7
 
# Boot parameter for RCU callbacks off certain CPUs:
# rcu_nocbs=4-7    # Move RCU work off cores 4-7
 
#################################################
# I/O SCHEDULER TUNING
#################################################
 
# Use noop or mq-deadline scheduler for NVMe (fewer I/O switches)
echo "none" > /sys/block/nvme0n1/queue/scheduler
 
# Increase I/O request size (batch more, fewer interrupts)
echo 256 > /sys/block/sda/queue/nr_requests
 
#################################################
# NETWORK TUNING
#################################################
 
# Busy poll - spin instead of sleeping for network I/O
echo 50 > /proc/sys/net/core/busy_read   # microseconds to busy poll
echo 50 > /proc/sys/net/core/busy_poll
 
# Increase socket buffers to reduce syscall frequency
echo 16777216 > /proc/sys/net/core/rmem_max
echo 16777216 > /proc/sys/net/core/wmem_max
 
#################################################
# MEMORY TUNING  
#################################################
 
# Reduce transparent hugepage management overhead
echo "madvise" > /sys/kernel/mm/transparent_hugepage/enabled
 
# Pin pages in memory (requires mlockall in application)
echo 0 > /proc/sys/vm/swappiness
 
#################################################
# CPU ISOLATION (boot parameters)
#################################################
 
# Kernel boot parameters for isolation:
# isolcpus=4-7           # Isolate cores 4-7 from scheduler
# nohz_full=4-7          # Disable timer ticks on cores 4-7
# rcu_nocbs=4-7          # Move RCU callbacks off cores 4-7
# irqaffinity=0-3        # Route IRQs to cores 0-3 only
 
# After boot, move IRQs off isolated cores
for irq in /proc/irq/[0-9]*/smp_affinity; do
    echo f > $irq  # Cores 0-3 only (bitmask 0xf)
done

Key Scheduler Parameters
Parameter	Default	Latency-Optimized	Throughput-Optimized
sched_min_granularity_ns	750,000	3,000,000	10,000,000
sched_latency_ns	6,000,000	12,000,000	40,000,000
sched_wakeup_granularity_ns	1,000,000	5,000,000	15,000,000
sched_migration_cost_ns	500,000	500,000	5,000,000

Workload-Specific Tuning

There is no universal 'best' setting. Interactive desktops need low latency (more switches, shorter slices). Batch processing needs throughput (fewer switches, longer slices). Always benchmark with your actual workload before deploying tuning changes to production.

Hardware Features for Faster Context Switches

Modern CPUs include features specifically designed to make context switching faster. Understanding and leveraging these features is key to minimizing overhead.

Key Hardware Features:

CPU Features for Context Switch Optimization

•PCID (Process Context ID): Tags TLB entries with process ID, avoiding full TLB flush on CR3 change. Saves 20-40% of switch overhead on TLB-heavy workloads.
•XSAVE/XSAVEOPT: Efficiently saves only modified FPU/SIMD state, skipping unchanged components. Can reduce FPU save time by 50%.
•FSGSBASE: Allows direct read/write of FS/GS segment bases without MSR access. Saves ~100 cycles per TLS update on switch.
•Last Branch Record (LBR): Hardware can restore branch predictor state. Experimental, not widely used for switches yet.
•Hardware Thread ID (CPUID): Fast identification of which logical CPU we're on, used for per-CPU data access.
•INVPCID: Selective TLB entry invalidation by PCID, finer control than full flush.

hardware_features.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
/**
 * Using hardware features to optimize context switching
 */
 
/* Check for PCID support */
#include <cpuid.h>
 
int has_pcid(void) {
    unsigned int eax, ebx, ecx, edx;
    __cpuid(1, eax, ebx, ecx, edx);
    return (ecx >> 17) & 1;  /* PCID is bit 17 of ECX */
}
 
/* Check for XSAVEOPT support */
int has_xsaveopt(void) {
    unsigned int eax, ebx, ecx, edx;
    __cpuid_count(0xd, 1, eax, ebx, ecx, edx);
    return eax & 1;  /* XSAVEOPT is bit 0 of EAX */
}
 
/* Check for FSGSBASE support */
int has_fsgsbase(void) {
    unsigned int eax, ebx, ecx, edx;
    __cpuid_count(7, 0, eax, ebx, ecx, edx);
    return (ebx >> 0) & 1;  /* FSGSBASE is bit 0 of EBX */
}
 
/**
 * FSGSBASE: Fast TLS base manipulation
 * 
 * Old way (MSR access, ~100 cycles):
 *   wrmsr(MSR_FS_BASE, value);
 * 
 * New way with FSGSBASE (~5 cycles):
 *   wrfsbase(value);
 */
static inline void write_fs_base(unsigned long base) {
    asm volatile("wrfsbase %0" : : "r" (base));
}
 
static inline unsigned long read_fs_base(void) {
    unsigned long base;
    asm volatile("rdfsbase %0" : "=r" (base));
    return base;
}
 
/**
 * XSAVEOPT: Optimized FPU state save
 * 
 * Regular XSAVE saves all requested components.
 * XSAVEOPT only saves components that have been modified
 * since last XRSTOR, using an "initialized" bitmap.
 * 
 * For processes that don't use FPU/SIMD between switches,
 * XSAVEOPT is essentially a no-op, saving hundreds of cycles.
 */
static inline void fpu_save_optimized(void *buffer, u64 mask) {
    asm volatile(
        "xsaveopt %0"
        : "=m" (*(char *)buffer)
        : "a" ((u32)mask), "d" ((u32)(mask >> 32))
        : "memory"
    );
}

Summary: A Toolkit for Minimizing Context Switches

Minimizing context switches is a multi-faceted challenge requiring a combination of architectural, programming, and systems-level approaches. The right combination depends on your specific workload characteristics.

Context Switch Minimization Techniques Summary
Technique	Impact	Complexity	When to Use
Event-driven architecture	Very High	High	High-connection servers, I/O-bound
CPU affinity/pinning	Medium-High	Low	Latency-critical, batch processing
Lock-free data structures	Medium	High	High-contention shared data
Adaptive mutex/spinlocks	Medium	Medium	Short critical sections
Non-blocking/async I/O	High	Medium	I/O-heavy applications
User-level threading	Very High	Medium	Fine-grained concurrency
Kernel tuning	Medium	Low	All workloads (baseline)
Thread pooling	Medium	Low	Request-based workloads

Key Takeaways

•Architecture matters most — Event-driven designs can reduce switches by 10-100x compared to thread-per-connection
•CPU affinity preserves caches — Pinning threads keeps data warm in L1/L2 caches
•Smart synchronization avoids blocking — Lock-free > spinlock > adaptive mutex > blocking mutex
•Async I/O prevents I/O-induced switches — epoll, io_uring batch many operations into few switches
•User-level threading is dramatically faster — 100x cheaper than kernel thread switches
•Kernel tuning provides baseline improvements — Larger time slices, isolated CPUs, reduced preemption
•Hardware features help — PCID, XSAVEOPT, FSGSBASE reduce switch overhead significantly

Module Complete

You have now completed the Context Switching module. You understand what triggers context switches, how state is saved and restored, the true overhead involved, and a comprehensive toolkit for minimizing that overhead. This knowledge is fundamental to building high-performance systems and understanding why modern server architectures operate the way they do.

5 / 5

Loading learning content...

Operating SystemsContext Switching

Context Switching: The Heart of Multitasking

LevelIntermediate

Duration90 mins

TopicContext Switching

5 / 5

Minimizing Context Switches: Strategies for Performance

Fighting the Hidden Performance Tax

The obvious question follows: how do we minimize these switches?

What You Will Learn

Architectural Strategies: Design Away the Problem

The most effective way to minimize context switches is to design systems that inherently require fewer switches. This is an architectural decision that affects the entire application structure.

Event-Driven Architecture:

How it works:

Single thread manages many connections via epoll/kqueue/IOCP
When data arrives on any connection, the OS signals the event loop
Event loop processes the event without context switch
No thread-per-connection = no O(connections) threads = minimal switching

event_loop_server.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
/**
 * Event-Driven Server Example (Linux epoll)
 * 
 * Single thread handles 10,000+ connections with minimal context switches.
 * Compare to thread-per-connection which would require 10,000 threads.
 */
#include <sys/epoll.h>
#include <sys/socket.h>
#include <fcntl.h>
#include <unistd.h>
 
#define MAX_EVENTS 10000
 
void set_nonblocking(int fd) {
    int flags = fcntl(fd, F_GETFL, 0);
    fcntl(fd, F_SETFL, flags | O_NONBLOCK);
}
 
void event_loop(int listen_fd) {
    int epoll_fd = epoll_create1(0);
    
    struct epoll_event ev;
    ev.events = EPOLLIN;
    ev.data.fd = listen_fd;
    epoll_ctl(epoll_fd, EPOLL_CTL_ADD, listen_fd, &ev);
    
    struct epoll_event events[MAX_EVENTS];
    
    /*
     * This single thread handles ALL connections.
     * No context switches between client handling!
     */
    while (1) {
        /* Wait for events on any of thousands of file descriptors */
        int n = epoll_wait(epoll_fd, events, MAX_EVENTS, -1);
        
        /*
         * Context switch only happens here if no events are ready.
         * When events ARE ready, we process them all in one go.
         */
        
        for (int i = 0; i < n; i++) {
            if (events[i].data.fd == listen_fd) {
                /* Accept new connection */
                int client_fd = accept(listen_fd, NULL, NULL);
                set_nonblocking(client_fd);
                
                ev.events = EPOLLIN | EPOLLET;  /* Edge-triggered */
                ev.data.fd = client_fd;
                epoll_ctl(epoll_fd, EPOLL_CTL_ADD, client_fd, &ev);
            } else {
                /* Handle client data - process without switching */
                handle_client(events[i].data.fd);
            }
        }
    }
}
 
/*
 * Context Switch Comparison:
 * 
 * Thread-per-connection (10,000 clients):
 *   - 10,000 threads competing for CPU
 *   - Scheduler constantly switching between threads
 *   - Each read() may block and cause switch
 *   - Estimated: 50,000+ switches/second
 * 
 * Event-driven (10,000 clients):
 *   - 1-4 threads (typically = CPU cores)
 *   - Threads only switch when event loop is idle
 *   - Non-blocking I/O never causes switch
 *   - Estimated: 100-1,000 switches/second
 * 
 * Result: 50-500x reduction in context switches
 */

Architecture Comparison: Context Switch Implications
Architecture	Threads/Connections	Switches/sec (10K conn)	Best For
Process-per-connection	10,000 processes	100,000+	Isolation, simple code (CGI era)
Thread-per-connection	10,000 threads	50,000+	Moderate load, blocking I/O
Thread pool	100-1000 threads	10,000+	Balanced approach
Event-driven (single)	1 thread	100-500	Maximum throughput, I/O-bound
Event-driven (multi)	4-8 threads	500-2,000	Multi-core utilization
Hybrid (event + pool)	Varies	5,000-10,000	Mixed CPU/IO workloads

Real-World Examples

CPU Affinity: Keeping Processes on Their Cores

CPU affinity (or processor affinity) binds a process or thread to specific CPU cores. This reduces context switch overhead by:

Preserving cache locality: Process's data stays in the core's L1/L2 cache
Avoiding migration overhead: No need to flush/reload cache on core switch
Reducing scheduler complexity: Fewer cores to consider when scheduling
Preventing cache ping-pong: In multi-producer scenarios, keeps data on specific cores

When to Use CPU Affinity:

Latency-sensitive applications (trading systems, games)
CPU-intensive batch processing
Real-time systems requiring predictable timing
NUMA systems where memory locality matters

cpu_affinity.c
C (Linux)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
/**
 * CPU Affinity Examples
 * 
 * Pinning threads to specific cores to minimize cache effects
 * from context switches and core migrations.
 */
#define _GNU_SOURCE
#include <sched.h>
#include <pthread.h>
#include <stdio.h>
 
/**
 * Pin current thread to a specific CPU core
 */
void pin_to_core(int core_id) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(core_id, &cpuset);
    
    /* Set affinity for current thread */
    int result = pthread_setaffinity_np(
        pthread_self(), 
        sizeof(cpuset), 
        &cpuset
    );
    
    if (result != 0) {
        perror("Failed to set CPU affinity");
    } else {
        printf("Thread pinned to core %d
", core_id);
    }
}
 
/**
 * Pin to multiple cores (allow migration within a set)
 */
void pin_to_core_range(int start_core, int end_core) {
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    
    for (int i = start_core; i <= end_core; i++) {
        CPU_SET(i, &cpuset);
    }
    
    pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
}
 
/**
 * High-performance worker threads with dedicated cores
 */
#define NUM_WORKERS 4
 
void *worker_thread(void *arg) {
    int worker_id = *(int *)arg;
    
    /* Pin each worker to a specific core */
    pin_to_core(worker_id);
    
    /*
     * Benefits:
     * 1. This worker's data stays in core's L1/L2 cache
     * 2. No cache pollution from other workers
     * 3. No migration delays
     * 4. Predictable performance
     */
    
    while (1) {
        /* Process work items */
        process_work();
    }
    
    return NULL;
}
 
int main() {
    pthread_t threads[NUM_WORKERS];
    int worker_ids[NUM_WORKERS];
    
    /*
     * Create workers, each pinned to a core.
     * Core 0 is often reserved for interrupts, so start at 1.
     */
    for (int i = 0; i < NUM_WORKERS; i++) {
        worker_ids[i] = i + 1;  /* Cores 1, 2, 3, 4 */
        pthread_create(&threads[i], NULL, worker_thread, &worker_ids[i]);
    }
    
    /*
     * Alternative: Use isolcpus kernel parameter to reserve cores
     * for your application, preventing other processes from using them.
     * 
     * Boot param: isolcpus=1-4
     * Now cores 1-4 only run your pinned threads, not system tasks.
     */
    
    for (int i = 0; i < NUM_WORKERS; i++) {
        pthread_join(threads[i], NULL);
    }
    
    return 0;
}

taskset_commands.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
#!/bin/bash
# CPU Affinity Management with taskset and cgroups
 
# Pin a running process to core 0
taskset -p 0x1 <PID>
 
# Start a new process on cores 0-3
taskset -c 0-3 ./my_application
 
# View current affinity
taskset -p <PID>
 
# Using numactl for NUMA-aware placement
numactl --cpubind=0 --membind=0 ./my_application
 
# Isolate CPUs at boot (add to kernel cmdline)
# isolcpus=4-7       # Reserve cores 4-7 for dedicated use
# nohz_full=4-7      # Disable timer interrupts on cores 4-7
# rcu_nocbs=4-7      # Move RCU callbacks off cores 4-7
 
# Using cgroups v2 for CPU isolation
mkdir /sys/fs/cgroup/myapp
echo "4-7" > /sys/fs/cgroup/myapp/cpuset.cpus
echo "0" > /sys/fs/cgroup/myapp/cpuset.mems
echo $$ > /sys/fs/cgroup/myapp/cgroup.procs
 
# Verify isolation - should show minimal interrupts on isolated cores
watch -n 1 'cat /proc/interrupts | head -20'

Affinity Tradeoffs

Efficient Synchronization: Avoiding Unnecessary Blocking

Many context switches are caused by threads blocking on synchronization primitives. By choosing the right synchronization strategy, you can dramatically reduce involuntary yields.

The Synchronization Hierarchy (least to most costly):

Lock-free/Wait-free algorithms: No blocking, no switches (atomic operations only)
Spin locks: Busy-wait, avoid switch for short waits
Adaptive mutexes: Spin briefly, then sleep
Mutexes with quick release: Sleep but optimize for short critical sections
Condition variables: Sleep until specific condition signaled
Heavy synchronization (semaphores, file locks): Always involve kernel

synchronization_strategies.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
/**
 * Synchronization Strategies and Their Context Switch Impact
 */
#include <stdatomic.h>
#include <pthread.h>
#include <immintrin.h>  /* For _mm_pause() */
 
/**
 * Strategy 1: Lock-Free (No context switches)
 * 
 * Use atomic operations to update shared state without locks.
 * No thread ever blocks, no context switches for synchronization.
 */
atomic_int lock_free_counter = 0;
 
void increment_lock_free(void) {
    atomic_fetch_add(&lock_free_counter, 1);
    /* No blocking possible, no context switch */
}
 
/**
 * Strategy 2: Spinlock with backoff (Avoids short-term switches)
 * 
 * For very short critical sections, spinning is cheaper than sleeping.
 * The pause instruction reduces power and avoids memory bus contention.
 */
typedef struct {
    atomic_int locked;
} spinlock_t;
 
void spin_lock(spinlock_t *lock) {
    int backoff = 1;
    while (1) {
        /* Try to acquire */
        if (atomic_exchange(&lock->locked, 1) == 0) {
            return;  /* Got the lock */
        }
        
        /* Spin with exponential backoff */
        for (int i = 0; i < backoff; i++) {
            _mm_pause();  /* CPU hint: spinning, reduce power */
        }
        backoff = (backoff < 1024) ? backoff * 2 : 1024;
    }
}
 
void spin_unlock(spinlock_t *lock) {
    atomic_store(&lock->locked, 0);
}
 
/**
 * Strategy 3: Adaptive mutex (Best of both worlds)
 * 
 * Spin for a short time hoping lock becomes available.
 * If not, fall back to sleeping (context switch).
 */
typedef struct {
    atomic_int locked;
    pthread_mutex_t fallback;
} adaptive_mutex_t;
 
#define SPIN_COUNT 100
 
void adaptive_lock(adaptive_mutex_t *m) {
    /* First, try spinning */
    for (int i = 0; i < SPIN_COUNT; i++) {
        if (atomic_exchange(&m->locked, 1) == 0) {
            return;  /* Got lock without sleeping */
        }
        _mm_pause();
    }
    
    /* Spinning failed, fall back to sleeping mutex */
    pthread_mutex_lock(&m->fallback);
    while (atomic_exchange(&m->locked, 1) != 0) {
        pthread_mutex_unlock(&m->fallback);
        pthread_mutex_lock(&m->fallback);
    }
}
 
/**
 * Strategy 4: Futex for minimal kernel interaction
 * 
 * futex: "Fast userspace mutex"
 * - Fast path is pure userspace (no syscall, no switch)
 * - Slow path uses kernel to sleep
 */
#include <linux/futex.h>
#include <sys/syscall.h>
 
typedef struct {
    atomic_int value;  /* 0 = unlocked, 1 = locked */
} futex_lock_t;
 
void futex_lock(futex_lock_t *lock) {
    /* Fast path: uncontended case - pure userspace */
    if (atomic_exchange(&lock->value, 1) == 0) {
        return;  /* Got lock, no syscall needed */
    }
    
    /* Slow path: contended - must ask kernel to sleep */
    while (1) {
        /* Wait in kernel until value != 1 */
        syscall(SYS_futex, &lock->value, FUTEX_WAIT, 1, NULL, NULL, 0);
        
        /* Woke up, try again */
        if (atomic_exchange(&lock->value, 1) == 0) {
            return;
        }
    }
}
 
void futex_unlock(futex_lock_t *lock) {
    atomic_store(&lock->value, 0);
    /* Wake one waiter (if any) */
    syscall(SYS_futex, &lock->value, FUTEX_WAKE, 1, NULL, NULL, 0);
}

When to Use Each Strategy

•Lock-free: Counters, read-mostly data, queues with single producer/consumer
•Spinlock: Very short critical sections (<1 μs), when you know contention is brief
•Adaptive mutex: General purpose, when critical section length is unknown
•Futex: Standard mutex implementation on Linux, good general choice
•Condition variable: When threads must wait for specific conditions
•Read-write locks: When reads vastly outnumber writes

Non-Blocking I/O: Avoiding I/O-Induced Switches

I/O Patterns and Their Context Switch Behavior:

Blocking I/O: Thread sleeps until I/O completes → guaranteed context switch
Non-blocking I/O: Returns immediately with EAGAIN if not ready → no switch, but requires polling
I/O multiplexing (poll/epoll): Single syscall blocks for multiple FDs → one switch for many operations
Asynchronous I/O (io_uring): Submit I/O, continue working, check results later → minimal switches

io_patterns.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
/**
 * Comparison of I/O patterns and context switch implications
 */
 
/*
 * Pattern 1: Blocking I/O (causes context switch)
 */
void blocking_io_example(int fd) {
    char buffer[4096];
    
    /* This WILL context switch if data isn't ready */
    ssize_t n = read(fd, buffer, sizeof(buffer));
    
    /* Thread was asleep, now continuing */
    process_data(buffer, n);
}
 
/*
 * Pattern 2: Non-blocking with poll (one switch for many FDs)
 */
void multiplexed_io_example(int *fds, int num_fds) {
    struct pollfd pfd[num_fds];
    char buffer[4096];
    
    for (int i = 0; i < num_fds; i++) {
        pfd[i].fd = fds[i];
        pfd[i].events = POLLIN;
    }
    
    /* 
     * Single context switch (if nothing ready yet)
     * covers ALL file descriptors.
     */
    int ready = poll(pfd, num_fds, -1);
    
    /* No more switches needed - just process ready FDs */
    for (int i = 0; i < num_fds; i++) {
        if (pfd[i].revents & POLLIN) {
            /* Non-blocking read - guaranteed to succeed */
            read(fds[i], buffer, sizeof(buffer));
            process_data(buffer, n);
        }
    }
}
 
/*
 * Pattern 3: io_uring (Linux 5.1+) - truly asynchronous
 */
#include <liburing.h>
 
void io_uring_example(void) {
    struct io_uring ring;
    io_uring_queue_init(32, &ring, 0);
    
    char buffer[4096];
    int fd = open("data.txt", O_RDONLY);
    
    /* Submit read request - doesn't block! */
    struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
    io_uring_prep_read(sqe, fd, buffer, sizeof(buffer), 0);
    io_uring_submit(&ring);
    
    /* Continue doing other work while I/O happens in background */
    do_other_work();
    
    /* Check completion - can be blocking or non-blocking */
    struct io_uring_cqe *cqe;
    io_uring_wait_cqe(&ring, &cqe);  /* Or peek_cqe for non-blocking */
    
    /* I/O complete, process result */
    int result = cqe->res;
    io_uring_cqe_seen(&ring, cqe);
    
    process_data(buffer, result);
}
 
/*
 * Comparison at 10,000 I/O operations:
 * 
 * Blocking I/O:         10,000 context switches
 * poll/epoll (batched): ~100 context switches (depends on arrival pattern)
 * io_uring (batched):   ~10 context switches
 */

Batching I/O Submissions

User-Level Threading and Coroutines

User-Level vs. Kernel-Level Threads:

Aspect	Kernel Threads	User-Level Threads
Switch cost	1-10 μs	10-100 ns
Involves kernel	Yes	No
True parallelism	Yes	No (unless M:N mapping)
Blocking syscall	Only blocks thread	Blocks entire process*

*Unless using async I/O or M:N hybrid model

coroutine_example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
/**
 * Simple coroutine implementation using ucontext
 * 
 * This demonstrates user-level context switching without kernel.
 * Switch cost is ~10-50 nanoseconds vs ~1000 ns for kernel threads.
 */
#include <ucontext.h>
#include <stdio.h>
#include <stdlib.h>
 
#define STACK_SIZE 8192
 
typedef struct {
    ucontext_t context;
    char stack[STACK_SIZE];
    int id;
    int finished;
} coroutine_t;
 
coroutine_t *current_coroutine;
coroutine_t main_coroutine;
coroutine_t *coroutines[10];
int num_coroutines = 0;
 
/**
 * Yield to scheduler - switch WITHOUT kernel involvement
 */
void yield(void) {
    coroutine_t *prev = current_coroutine;
    
    /* Find next runnable coroutine (simple round-robin) */
    static int next = 0;
    next = (next + 1) % num_coroutines;
    coroutine_t *next_coro = coroutines[next];
    
    current_coroutine = next_coro;
    
    /*
     * swapcontext: The entire "context switch"
     * 
     * This is a pure userspace operation:
     * 1. Save current registers to prev->context
     * 2. Load registers from next_coro->context
     * 3. Continue execution in next coroutine
     * 
     * Cost: ~50 cycles (vs ~3000 for kernel thread switch)
     * No syscall, no kernel involvement!
     */
    swapcontext(&prev->context, &next_coro->context);
}
 
/**
 * Coroutine function - can yield and resume
 */
void coroutine_func(void *arg) {
    int id = *(int *)arg;
    
    for (int i = 0; i < 5; i++) {
        printf("Coroutine %d: iteration %d
", id, i);
        yield();  /* Give other coroutines a chance - NO KERNEL! */
    }
    
    current_coroutine->finished = 1;
}
 
/**
 * Create a new coroutine
 */
coroutine_t *create_coroutine(void (*func)(void *), void *arg) {
    coroutine_t *coro = malloc(sizeof(coroutine_t));
    
    getcontext(&coro->context);
    
    coro->context.uc_stack.ss_sp = coro->stack;
    coro->context.uc_stack.ss_size = STACK_SIZE;
    coro->context.uc_link = &main_coroutine.context;
    coro->finished = 0;
    
    makecontext(&coro->context, (void (*)(void))func, 1, arg);
    
    coroutines[num_coroutines++] = coro;
    return coro;
}
 
int main() {
    /* Create several coroutines */
    int ids[] = {1, 2, 3};
    for (int i = 0; i < 3; i++) {
        create_coroutine(coroutine_func, &ids[i]);
    }
    
    /* Run scheduler */
    current_coroutine = coroutines[0];
    swapcontext(&main_coroutine.context, &current_coroutine->context);
    
    printf("All coroutines finished
");
    return 0;
}
 
/*
 * Modern languages with native coroutines:
 * - Go: goroutines (M:N model with work stealing)
 * - Python: async/await with asyncio
 * - Rust: async/await with tokio/async-std
 * - C++20: coroutines (stackless)
 * - Java 21: Virtual Threads (Project Loom)
 */

The M:N Threading Model

Kernel Tuning for Reduced Context Switching

The Linux kernel provides numerous tuning parameters that affect context switch behavior. Proper tuning can significantly reduce switch frequency for specific workloads.

Key Tunable Parameters:

kernel_tuning.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
#!/bin/bash
# Kernel Tuning for Reduced Context Switches
 
#################################################
# SCHEDULER TUNING
#################################################
 
# Increase minimum time slice (default: 0.75ms)
# Larger slices = fewer preemptions
echo 4000000 > /proc/sys/kernel/sched_min_granularity_ns    # 4ms
 
# Increase scheduler latency target
# Higher = allows longer time slices under load
echo 24000000 > /proc/sys/kernel/sched_latency_ns           # 24ms
 
# Reduce preemption after wakeup
# Lower = longer before newly-woken process preempts current
echo 10000000 > /proc/sys/kernel/sched_wakeup_granularity_ns # 10ms
 
# Disable automatic NUMA balancing (if manual placement is better)
echo 0 > /proc/sys/kernel/numa_balancing
 
#################################################
# TIMER AND TICK SETTINGS
#################################################
 
# Boot parameter for tickless operation on certain CPUs:
# nohz_full=4-7    # No timer interrupts on cores 4-7
 
# Boot parameter for RCU callbacks off certain CPUs:
# rcu_nocbs=4-7    # Move RCU work off cores 4-7
 
#################################################
# I/O SCHEDULER TUNING
#################################################
 
# Use noop or mq-deadline scheduler for NVMe (fewer I/O switches)
echo "none" > /sys/block/nvme0n1/queue/scheduler
 
# Increase I/O request size (batch more, fewer interrupts)
echo 256 > /sys/block/sda/queue/nr_requests
 
#################################################
# NETWORK TUNING
#################################################
 
# Busy poll - spin instead of sleeping for network I/O
echo 50 > /proc/sys/net/core/busy_read   # microseconds to busy poll
echo 50 > /proc/sys/net/core/busy_poll
 
# Increase socket buffers to reduce syscall frequency
echo 16777216 > /proc/sys/net/core/rmem_max
echo 16777216 > /proc/sys/net/core/wmem_max
 
#################################################
# MEMORY TUNING  
#################################################
 
# Reduce transparent hugepage management overhead
echo "madvise" > /sys/kernel/mm/transparent_hugepage/enabled
 
# Pin pages in memory (requires mlockall in application)
echo 0 > /proc/sys/vm/swappiness
 
#################################################
# CPU ISOLATION (boot parameters)
#################################################
 
# Kernel boot parameters for isolation:
# isolcpus=4-7           # Isolate cores 4-7 from scheduler
# nohz_full=4-7          # Disable timer ticks on cores 4-7
# rcu_nocbs=4-7          # Move RCU callbacks off cores 4-7
# irqaffinity=0-3        # Route IRQs to cores 0-3 only
 
# After boot, move IRQs off isolated cores
for irq in /proc/irq/[0-9]*/smp_affinity; do
    echo f > $irq  # Cores 0-3 only (bitmask 0xf)
done

Key Scheduler Parameters
Parameter	Default	Latency-Optimized	Throughput-Optimized
sched_min_granularity_ns	750,000	3,000,000	10,000,000
sched_latency_ns	6,000,000	12,000,000	40,000,000
sched_wakeup_granularity_ns	1,000,000	5,000,000	15,000,000
sched_migration_cost_ns	500,000	500,000	5,000,000

Workload-Specific Tuning

Hardware Features for Faster Context Switches

Modern CPUs include features specifically designed to make context switching faster. Understanding and leveraging these features is key to minimizing overhead.

Key Hardware Features:

CPU Features for Context Switch Optimization

•PCID (Process Context ID): Tags TLB entries with process ID, avoiding full TLB flush on CR3 change. Saves 20-40% of switch overhead on TLB-heavy workloads.
•XSAVE/XSAVEOPT: Efficiently saves only modified FPU/SIMD state, skipping unchanged components. Can reduce FPU save time by 50%.
•FSGSBASE: Allows direct read/write of FS/GS segment bases without MSR access. Saves ~100 cycles per TLS update on switch.
•Last Branch Record (LBR): Hardware can restore branch predictor state. Experimental, not widely used for switches yet.
•Hardware Thread ID (CPUID): Fast identification of which logical CPU we're on, used for per-CPU data access.
•INVPCID: Selective TLB entry invalidation by PCID, finer control than full flush.

hardware_features.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
/**
 * Using hardware features to optimize context switching
 */
 
/* Check for PCID support */
#include <cpuid.h>
 
int has_pcid(void) {
    unsigned int eax, ebx, ecx, edx;
    __cpuid(1, eax, ebx, ecx, edx);
    return (ecx >> 17) & 1;  /* PCID is bit 17 of ECX */
}
 
/* Check for XSAVEOPT support */
int has_xsaveopt(void) {
    unsigned int eax, ebx, ecx, edx;
    __cpuid_count(0xd, 1, eax, ebx, ecx, edx);
    return eax & 1;  /* XSAVEOPT is bit 0 of EAX */
}
 
/* Check for FSGSBASE support */
int has_fsgsbase(void) {
    unsigned int eax, ebx, ecx, edx;
    __cpuid_count(7, 0, eax, ebx, ecx, edx);
    return (ebx >> 0) & 1;  /* FSGSBASE is bit 0 of EBX */
}
 
/**
 * FSGSBASE: Fast TLS base manipulation
 * 
 * Old way (MSR access, ~100 cycles):
 *   wrmsr(MSR_FS_BASE, value);
 * 
 * New way with FSGSBASE (~5 cycles):
 *   wrfsbase(value);
 */
static inline void write_fs_base(unsigned long base) {
    asm volatile("wrfsbase %0" : : "r" (base));
}
 
static inline unsigned long read_fs_base(void) {
    unsigned long base;
    asm volatile("rdfsbase %0" : "=r" (base));
    return base;
}
 
/**
 * XSAVEOPT: Optimized FPU state save
 * 
 * Regular XSAVE saves all requested components.
 * XSAVEOPT only saves components that have been modified
 * since last XRSTOR, using an "initialized" bitmap.
 * 
 * For processes that don't use FPU/SIMD between switches,
 * XSAVEOPT is essentially a no-op, saving hundreds of cycles.
 */
static inline void fpu_save_optimized(void *buffer, u64 mask) {
    asm volatile(
        "xsaveopt %0"
        : "=m" (*(char *)buffer)
        : "a" ((u32)mask), "d" ((u32)(mask >> 32))
        : "memory"
    );
}

Summary: A Toolkit for Minimizing Context Switches

Context Switch Minimization Techniques Summary
Technique	Impact	Complexity	When to Use
Event-driven architecture	Very High	High	High-connection servers, I/O-bound
CPU affinity/pinning	Medium-High	Low	Latency-critical, batch processing
Lock-free data structures	Medium	High	High-contention shared data
Adaptive mutex/spinlocks	Medium	Medium	Short critical sections
Non-blocking/async I/O	High	Medium	I/O-heavy applications
User-level threading	Very High	Medium	Fine-grained concurrency
Kernel tuning	Medium	Low	All workloads (baseline)
Thread pooling	Medium	Low	Request-based workloads

Key Takeaways

•Architecture matters most — Event-driven designs can reduce switches by 10-100x compared to thread-per-connection
•CPU affinity preserves caches — Pinning threads keeps data warm in L1/L2 caches
•Smart synchronization avoids blocking — Lock-free > spinlock > adaptive mutex > blocking mutex
•Async I/O prevents I/O-induced switches — epoll, io_uring batch many operations into few switches
•User-level threading is dramatically faster — 100x cheaper than kernel thread switches
•Kernel tuning provides baseline improvements — Larger time slices, isolated CPUs, reduced preemption
•Hardware features help — PCID, XSAVEOPT, FSGSBASE reduce switch overhead significantly

Module Complete

5 / 5