Operating SystemsThread Concepts

Kernel-Level Threads

LevelIntermediate

Duration90 mins

TopicThread Concepts

3 / 5

True Parallelism

Running at the Same Time—Really

Consider a modern laptop with an 8-core CPU. When you run a well-written multithreaded program on this machine, something remarkable happens: multiple threads execute their instructions simultaneously—not interleaved, not time-sliced, but genuinely at the same instant. One thread computes on Core 0 while another computes on Core 1, and six more threads can be actively computing on the remaining cores, all in parallel.

This is true parallelism—the ability of a program to perform multiple computations literally at the same time. It's the holy grail of concurrent programming, and it's made possible by one critical factor: kernel-level threads.

The kernel is the gatekeeper of CPU resources. It decides which code runs on which processor core. User-level threads, invisible to the kernel, cannot achieve true parallelism because the kernel doesn't know they exist—it sees only the process, which it schedules onto a single core at a time. Kernel-level threads, by contrast, are first-class citizens in the kernel's scheduling universe, each eligible to be scheduled onto any available CPU core.

This page explores exactly how kernel threads enable true parallelism, why this matters profoundly for performance, and how to harness this capability effectively.

What You Will Learn

By the end of this page, you will understand: (1) The precise distinction between concurrency and parallelism, (2) How kernel-level threads achieve true simultaneous execution, (3) The role of the multiprocessor scheduler in enabling parallelism, (4) Theoretical and practical limits of parallel speedup, (5) How to verify and observe parallel execution, and (6) Design principles for applications that leverage true parallelism.

Concurrency vs. Parallelism: The Crucial Distinction

Before diving into how kernel threads enable parallelism, we must clearly distinguish between two concepts that are often confused: concurrency and parallelism.

Concurrency: Multiple tasks make progress during overlapping time periods. They may or may not execute at exactly the same instant. Concurrency is about structure—designing programs where independent tasks can be interleaved.

Parallelism: Multiple tasks execute at literally the same instant, on different processors or cores. Parallelism is about execution—achieving simultaneous computation.

The key insight: You can have concurrency without parallelism (single core, time-sliced), parallelism without concurrency (SIMD, vector operations), or both together. Kernel-level threads enable both.

Converting Mermaid diagram...

Concurrency vs. Parallelism Comparison
Aspect	Concurrency Only	True Parallelism
Execution model	Tasks interleaved on one CPU	Tasks execute simultaneously on multiple CPUs
Hardware requirement	Single core sufficient	Multiple cores required
Speedup potential (2 tasks)	0x (just interleaved)	Up to 2x (twice the compute)
User-level threads	✓ Can achieve	✗ Cannot achieve
Kernel-level threads	✓ Can achieve	✓ Can achieve
I/O-bound workloads	Significant benefit	Less additional benefit
CPU-bound workloads	No speed benefit	Near-linear speedup possible

Why this distinction matters:

For I/O-bound workloads (waiting for network, disk, user input), concurrency alone provides significant benefits. While one task waits for I/O, another can use the CPU. A single core can keep busy by switching between waiting tasks.

For CPU-bound workloads (mathematical computation, data processing, rendering), concurrency alone provides no speedup on a single core. The tasks must wait for each other—the total computation time equals the sum of all task times. Only true parallelism reduces total execution time for CPU-bound work.

This is why kernel-level threads—and their ability to enable true parallelism—are essential for leveraging modern multi-core hardware.

Rob Pike's Famous Quote

"Concurrency is about dealing with lots of things at once. Parallelism is about doing lots of things at once."

Concurrency is a programming model—how you structure code to handle multiple tasks. Parallelism is an execution property—whether multiple computations happen simultaneously. Kernel threads let you achieve both.

How Kernel Threads Enable Parallelism

The mechanism by which kernel threads achieve true parallelism involves the intricate coordination between the CPU scheduler, per-CPU data structures, and the hardware itself.

The fundamental mechanism:

Modern operating systems implement Symmetric Multiprocessing (SMP), where any thread can run on any CPU, and the kernel treats all CPUs equally. Here's how this enables parallelism:

Per-CPU Run Queues: The scheduler maintains a queue of ready-to-run threads for each CPU core. Each core draws from its own queue for maximum locality.
Independent Scheduling: Each CPU runs its own scheduling loop, independently deciding which thread to execute next. No global lock is needed for the common case.
Load Balancing: The kernel periodically rebalances threads across CPUs to prevent one core from being overloaded while others are idle.
Simultaneous Execution: With N CPUs, up to N kernel threads can be in the Running state at the exact same instant, each executing on a different core.

multiprocessor_scheduler_concept.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
// Conceptual SMP scheduler structure enabling true parallelism
 
// Each CPU has its own run queue - no global lock for local scheduling
typedef struct RunQueue {
    spinlock_t lock;              // Protects only this CPU's queue
    int nr_running;               // Number of runnable threads on this CPU
    struct rb_root tasks_timeline; // Red-black tree of tasks (CFS style)
    
    struct ThreadControlBlock *current; // Currently executing thread
    struct ThreadControlBlock *idle;    // Idle thread for this CPU
    
    // For load balancing
    unsigned long load;            // Weighted load on this CPU
} RunQueue __percpu;               // One per CPU
 
// Global array of per-CPU run queues
RunQueue runqueues[MAX_CPUS];
 
// === SCHEDULER ENTRY (runs on each CPU independently) ===
void schedule(void) {
    RunQueue *rq = this_cpu_runqueue();  // Get current CPU's runqueue
    
    spin_lock_irq(&rq->lock);  // Lock only local queue
    
    // Save context of current thread (if not exiting)
    if (rq->current->state != THREAD_TERMINATED) {
        save_context(rq->current);
    }
    
    // Pick next thread to run (from LOCAL queue only)
    ThreadControlBlock *next = pick_next_task(rq);
    
    if (next != rq->current) {
        // Switch context to new thread
        context_switch(rq->current, next);
        rq->current = next;
    }
    
    spin_unlock_irq(&rq->lock);
}
 
// === PARALLEL EXECUTION VISUALIZED ===
/*
 * Time T=0:
 *   CPU 0: Running Thread A from Process 1
 *   CPU 1: Running Thread B from Process 1
 *   CPU 2: Running Thread X from Process 2
 *   CPU 3: Running Thread Y from Process 2
 *
 * All four threads are executing *simultaneously*.
 * Threads A and B share an address space (same process).
 * Threads X and Y share a different address space.
 *
 * This is TRUE PARALLELISM - four instruction streams
 * making progress at the exact same instant.
 */
 
// === HOW THREADS GET DISTRIBUTED ===
 
// When a new thread is created:
pid_t do_clone(clone_args *args) {
    ThreadControlBlock *new_thread = allocate_tcb();
    initialize_thread(new_thread, args);
    
    // Critical: Choose which CPU's runqueue to add to
    int target_cpu = select_task_rq(new_thread);
    
    // Add to that CPU's runqueue (may be different from current CPU)
    RunQueue *rq = &runqueues[target_cpu];
    spin_lock(&rq->lock);
    enqueue_task(rq, new_thread);
    spin_unlock(&rq->lock);
    
    // If target CPU is idle, send Inter-Processor Interrupt
    // to wake it up and reschedule immediately
    if (rq->current == rq->idle) {
        smp_send_reschedule(target_cpu);
    }
    
    return new_thread->tid;
}
 
// CPU selection considers:
// - Load balancing (put on underloaded CPU)
// - Cache locality (prefer CPU with related data)
// - NUMA topology (prefer local memory node)
// - Affinity mask (respect process preferences)

Why user-level threads cannot achieve this:

User-level threads are invisible to the kernel. From the kernel's perspective, a process with 100 user-level threads looks identical to a process with 1 thread—it's just one schedulable entity. The kernel schedules this single entity onto one CPU at a time.

Internally, the process's user-space runtime multiplexes its 100 user threads onto this single execution context. While clever, this means:

Only one user-level thread can actually execute at any instant
When the process is scheduled out, ALL user-level threads stop
The runtime cannot access additional CPUs without kernel cooperation

Kernel threads solve this by making each thread a first-class kernel entity, eligible for independent scheduling on any CPU.

The Hybrid Approach

Modern high-level concurrency runtimes (Go, Tokio, work-stealing pools) use a hybrid approach: they create N kernel threads (usually = CPU count) and multiplex many lightweight tasks onto these threads. This gives the efficiency of user-level scheduling while still achieving true parallelism through the underlying kernel threads. The best of both worlds.

Measuring and Observing Parallelism

How do we know parallelism is actually happening? Let's examine concrete ways to verify that multiple threads are executing simultaneously:

1. Timing-Based Verification

The most straightforward test: if two CPU-bound tasks, each taking time T, complete in total time close to T (not 2T), they must have run in parallel.

verify_parallelism.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
// Demonstrate true parallelism through timing measurement
#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <time.h>
#include <sched.h>
 
#define ITERATIONS 500000000  // Heavyweight computation
 
// CPU-intensive work that takes ~1 second
void *cpu_intensive_work(void *arg) {
    int thread_id = *(int *)arg;
    volatile double result = 0;  // Prevent optimization
    
    // Get which CPU we're running on
    int cpu = sched_getcpu();
    printf("Thread %d starting on CPU %d\n", thread_id, cpu);
    
    for (long i = 0; i < ITERATIONS; i++) {
        result += i * 0.000001;
    }
    
    cpu = sched_getcpu();  // May have migrated
    printf("Thread %d finished on CPU %d, result=%f\n", thread_id, cpu, result);
    
    return NULL;
}
 
int main() {
    struct timespec start, end;
    
    // === Test 1: Sequential execution ===
    printf("\n=== Sequential Execution ===\n");
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    int id1 = 1, id2 = 2;
    cpu_intensive_work(&id1);  // First task
    cpu_intensive_work(&id2);  // Second task, after first completes
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    double sequential_time = (end.tv_sec - start.tv_sec) + 
                              (end.tv_nsec - start.tv_nsec) / 1e9;
    printf("Sequential time: %.2f seconds\n", sequential_time);
    
    // === Test 2: Parallel execution with kernel threads ===
    printf("\n=== Parallel Execution ===\n");
    pthread_t threads[2];
    int ids[2] = {1, 2};
    
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    // Create two kernel threads
    pthread_create(&threads[0], NULL, cpu_intensive_work, &ids[0]);
    pthread_create(&threads[1], NULL, cpu_intensive_work, &ids[1]);
    
    // Wait for both to complete
    pthread_join(threads[0], NULL);
    pthread_join(threads[1], NULL);
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    double parallel_time = (end.tv_sec - start.tv_sec) + 
                           (end.tv_nsec - start.tv_nsec) / 1e9;
    printf("Parallel time: %.2f seconds\n", parallel_time);
    
    // Report speedup
    printf("\n=== Results ===\n");
    printf("Speedup: %.2fx\n", sequential_time / parallel_time);
    printf("Efficiency: %.1f%%\n", (sequential_time / parallel_time / 2) * 100);
    
    // Expected output on a multi-core system:
    // Sequential time: ~2.0 seconds (task A + task B)
    // Parallel time:   ~1.0 seconds (both tasks simultaneously)
    // Speedup:         ~2.0x       (true parallel execution)
    
    // If you run this on a single-core system or with pinned affinity:
    // Parallel time:   ~2.0 seconds (time-sliced, no true parallelism)
    // Speedup:         ~1.0x       (just overhead)
    
    return 0;
}

2. CPU Utilization Monitoring

Watching system CPU usage reveals parallel execution:

observe_parallelism.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#!/bin/bash
# Tools to observe parallel thread execution
 
# 1. Watch per-CPU usage with mpstat
# Run in one terminal while executing a multithreaded program
mpstat -P ALL 1
 
# Output shows individual CPU utilization:
# 14:30:01  CPU  %usr  %sys  %iowait  %idle
# 14:30:02  all  50.0  0.5   0.0      49.5   # Two threads on 4-core system
# 14:30:02    0  99.0  1.0   0.0       0.0   # Core 0: Thread A running
# 14:30:02    1  99.0  1.0   0.0       0.0   # Core 1: Thread B running
# 14:30:02    2   1.0  0.0   0.0      99.0   # Core 2: Idle
# 14:30:02    3   1.0  0.0   0.0      99.0   # Core 3: Idle
 
# 2. Use htop for visual CPU core display
htop
# Each CPU bar shows separate activity
# You can see threads executing on different cores
 
# 3. Trace thread scheduling with perf
perf sched record ./my_parallel_program
perf sched map
 
# Shows which threads ran on which CPUs over time:
#  TIME     CPU 0   CPU 1   CPU 2   CPU 3
#  0.000    Thread1 Thread2 idle    idle
#  0.001    Thread1 Thread2 idle    idle
#  0.002    Thread1 Thread2 idle    idle
# ...
 
# 4. Watch thread-to-CPU mapping live
watch -n 0.5 'ps -eLo pid,tid,psr,comm | grep my_program'
 
# Output shows which processor (PSR) each thread runs on:
# PID    TID    PSR  COMMAND
# 12345  12345  0    my_program    # Main thread on CPU 0
# 12345  12346  1    my_program    # Worker thread on CPU 1
# 12345  12347  2    my_program    # Worker thread on CPU 2

The Smoking Gun

The definitive proof of true parallelism: if a CPU-bound 2-thread program achieves >1.5x speedup over single-threaded execution, parallelism must be occurring. Time-sliced concurrency on one core cannot exceed 1.0x speedup for CPU-bound work (and typically achieves <1.0x due to context switch overhead).

Limits of Parallelism: Amdahl's Law

True parallelism is powerful, but it has fundamental limits. Amdahl's Law describes the maximum theoretical speedup achievable by parallelizing a program, based on the fraction that must remain sequential.

The formula:

Speedup(N) = 1 / (S + (1-S)/N)

Where:
  N = number of processors
  S = fraction of program that is inherently sequential
  (1-S) = fraction that can be parallelized

The implications are stark:

Maximum Speedup by Sequential Fraction (Amdahl's Law)
Sequential %	Parallel %	Max Speedup (∞ CPUs)	Speedup with 8 CPUs
0%	100%	∞ (perfect)	8.0x
1%	99%	100x	7.5x
5%	95%	20x	5.9x
10%	90%	10x	4.7x
25%	75%	4x	2.9x
50%	50%	2x	1.8x
75%	25%	1.33x	1.2x

The profound insight:

Even with unlimited processors, if 10% of your program must run sequentially, you can never exceed 10x speedup. That sequential 10% becomes the bottleneck that dominates total runtime as the parallel portion speeds up.

What creates sequential portions?

Sources of Sequential Bottlenecks

•Initialization and cleanup: Program startup, data loading, result aggregation
•Synchronization: Lock acquisition, barrier waits, atomic operations
•Dependencies: When task B needs the result of task A
•Shared state updates: Single shared counter, global state modifications
•I/O operations: Many file/network operations are serialized
•Algorithmic structure: Some algorithms have inherent sequential phases

amdahls_law_demo.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
// Demonstrating Amdahl's Law in practice
 
#include <pthread.h>
#include <stdio.h>
#include <time.h>
 
#define ARRAY_SIZE 100000000
#define NUM_THREADS 8
 
double data[ARRAY_SIZE];
double partial_sums[NUM_THREADS];
 
// Work function for parallel portion
void *parallel_work(void *arg) {
    int thread_id = *(int *)arg;
    int chunk_size = ARRAY_SIZE / NUM_THREADS;
    int start = thread_id * chunk_size;
    int end = start + chunk_size;
    
    double sum = 0;
    for (int i = start; i < end; i++) {
        sum += data[i] * data[i];  // Parallelizable work
    }
    partial_sums[thread_id] = sum;
    return NULL;
}
 
int main() {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    // ============================================
    // SEQUENTIAL PORTION: Initialization (S%)
    // ============================================
    for (int i = 0; i < ARRAY_SIZE; i++) {
        data[i] = i * 0.0001;  // Must be sequential (or is it?)
    }
    
    struct timespec after_init;
    clock_gettime(CLOCK_MONOTONIC, &after_init);
    
    // ============================================
    // PARALLEL PORTION: Computation ((1-S)%)
    // ============================================
    pthread_t threads[NUM_THREADS];
    int thread_ids[NUM_THREADS];
    
    for (int i = 0; i < NUM_THREADS; i++) {
        thread_ids[i] = i;
        pthread_create(&threads[i], NULL, parallel_work, &thread_ids[i]);
    }
    
    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);
    }
    
    struct timespec after_parallel;
    clock_gettime(CLOCK_MONOTONIC, &after_parallel);
    
    // ============================================
    // SEQUENTIAL PORTION: Aggregation (S%)
    // ============================================
    double total = 0;
    for (int i = 0; i < NUM_THREADS; i++) {
        total += partial_sums[i];
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    // Calculate timings
    double init_time = (after_init.tv_sec - start.tv_sec) + 
                       (after_init.tv_nsec - start.tv_nsec) / 1e9;
    double parallel_time = (after_parallel.tv_sec - after_init.tv_sec) + 
                           (after_parallel.tv_nsec - after_init.tv_nsec) / 1e9;
    double total_time = (end.tv_sec - start.tv_sec) + 
                        (end.tv_nsec - start.tv_nsec) / 1e9;
    
    printf("Initialization (sequential): %.3f seconds (%.1f%%)\n", 
           init_time, (init_time / total_time) * 100);
    printf("Computation (parallel):      %.3f seconds (%.1f%%)\n", 
           parallel_time, (parallel_time / total_time) * 100);
    printf("Total time:                  %.3f seconds\n", total_time);
    printf("Result: %f\n", total);
    
    // Amdahl's Law prediction:
    // If init is 30% of sequential runtime,
    // max speedup with 8 threads = 1 / (0.3 + 0.7/8) = 2.58x
    // Not 8x, even with perfect parallel efficiency!
    
    return 0;
}
 
/*
 * Key insight: To maximize parallel speedup:
 * 1. Minimize sequential fractions
 * 2. Parallelize initialization if possible
 * 3. Use concurrent data structures to reduce aggregation overhead
 * 4. Design algorithms with minimal synchronization
 */

Gustafson's Law: The Optimistic View

Gustafson's Law offers a more optimistic perspective: as we add processors, we often scale up the problem size too. If the parallel portion scales with problem size while the sequential portion stays constant, speedup can be much better than Amdahl's Law suggests. This is why large-scale computing (supercomputers, cloud) can achieve high efficiency—the problems are chosen to be massively parallel.

Practical Parallelism Challenges

Beyond Amdahl's Law, practical parallel programming faces additional challenges that can limit the effectiveness of kernel threads:

1. Cache Coherency Overhead

When multiple CPUs access shared data, the hardware maintains coherency using protocols like MESI. This creates invisible overhead:

false_sharing_demo.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Demonstrating false sharing - a hidden parallelism killer
 
#include <pthread.h>
#include <stdio.h>
#include <time.h>
 
#define ITERATIONS 100000000
#define NUM_THREADS 4
 
// BAD: Counters are adjacent in memory (same cache line)
struct BadCounters {
    long count[NUM_THREADS];  // All in same or adjacent cache lines
};
 
// GOOD: Counters are padded to separate cache lines
struct GoodCounters {
    struct {
        long count;
        char padding[56];  // Pad to 64 bytes (typical cache line)
    } slots[NUM_THREADS];
};
 
BadCounters bad_counters = {0};
GoodCounters good_counters = {0};
 
void *increment_bad(void *arg) {
    int id = *(int *)arg;
    for (long i = 0; i < ITERATIONS; i++) {
        bad_counters.count[id]++;  // False sharing!
    }
    return NULL;
}
 
void *increment_good(void *arg) {
    int id = *(int *)arg;
    for (long i = 0; i < ITERATIONS; i++) {
        good_counters.slots[id].count++;  // No false sharing
    }
    return NULL;
}
 
int main() {
    pthread_t threads[NUM_THREADS];
    int ids[NUM_THREADS];
    struct timespec start, end;
    
    // Test with false sharing
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < NUM_THREADS; i++) {
        ids[i] = i;
        pthread_create(&threads[i], NULL, increment_bad, &ids[i]);
    }
    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);
    }
    clock_gettime(CLOCK_MONOTONIC, &end);
    double bad_time = (end.tv_sec - start.tv_sec) + 
                      (end.tv_nsec - start.tv_nsec) / 1e9;
    
    // Test without false sharing
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_create(&threads[i], NULL, increment_good, &ids[i]);
    }
    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);
    }
    clock_gettime(CLOCK_MONOTONIC, &end);
    double good_time = (end.tv_sec - start.tv_sec) + 
                       (end.tv_nsec - start.tv_nsec) / 1e9;
    
    printf("With false sharing:    %.2f seconds\n", bad_time);
    printf("Without false sharing: %.2f seconds\n", good_time);
    printf("Speedup from fix:      %.2fx\n", bad_time / good_time);
    
    // Typical results: false sharing version is 3-10x SLOWER!
    // Each write invalidates the cache line on all other CPUs,
    // causing expensive cache coherency traffic.
    
    return 0;
}

2. Memory Bandwidth Limitations

All CPUs share the same memory subsystem. When parallel threads all access memory intensively, they compete for bandwidth:

Memory Bandwidth Considerations

•Modern CPUs: 8-16 cores, each capable of issuing memory requests
•Memory bandwidth: Typically 50-100 GB/s for main memory
•Per-core bandwidth: Divided among cores, can become bottleneck
•NUMA effects: On multi-socket systems, remote memory access is 2-3x slower
•Mitigation: Maximize cache hits, minimize memory traffic, use NUMA-aware allocation

3. Synchronization Overhead

Every synchronization point—lock, barrier, atomic operation—creates a sequential bottleneck and adds overhead:

Synchronization Overhead (Approximate)
Operation	Uncontended Cost	Contended Cost
Atomic increment	5-20 cycles	100-500 cycles (cache coherency)
Mutex lock	20-50 cycles	Microseconds (context switch)
Spinlock	10-30 cycles	Busy-wait until release
Barrier sync	100+ cycles	All threads must wait for slowest
Memory fence	10-50 cycles	Forces memory ordering

The Synchronization Trap

Adding more threads to a program with heavy synchronization often decreases performance. Each additional thread increases contention, and the sequential synchronization bottleneck dominates. This is why designing for minimal synchronization is essential—parallel performance is won or lost in the architecture, not the implementation.

Designing for True Parallelism

Achieving effective parallelism requires deliberate design decisions. Here are key principles for designing systems that effectively utilize true parallelism via kernel threads:

Principle 1: Partition Data, Not Operations

Divide data among threads, letting each thread perform all operations on its portion. This minimizes sharing and synchronization.

Operation-Parallel (Poor)

•Thread 1: reads all data
•Barrier synchronization
•Thread 2: transforms all data
•Barrier synchronization
•Thread 3: writes all data
•Lots of synchronization, poor locality

Data-Parallel (Good)

•Thread 1: reads+transforms+writes chunk A
•Thread 2: reads+transforms+writes chunk B
•Thread 3: reads+transforms+writes chunk C
•No synchronization during work
•Excellent cache locality per thread

Principle 2: Use Lock-Free Structures for Hot Paths

For data structures accessed by all threads, lock-free algorithms avoid serialization:

lockfree_counter.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Lock-free counter using atomic operations
#include <stdatomic.h>
 
// Lock-free: concurrent increments don't block
typedef struct {
    _Atomic long value;
} AtomicCounter;
 
void atomic_increment(AtomicCounter *counter) {
    atomic_fetch_add(&counter->value, 1);  // Hardware atomic
    // No lock, no waiting, true parallel increments
    // (though still cache coherency traffic)
}
 
// Even better: Per-thread counters with lazy aggregation
typedef struct {
    long counts[MAX_THREADS];  // Separate cache lines (with padding)
} DistributedCounter;
 
void distributed_increment(DistributedCounter *counter, int thread_id) {
    counter->counts[thread_id]++;  // No atomics, no sharing!
}
 
long distributed_read(DistributedCounter *counter) {
    long total = 0;
    for (int i = 0; i < MAX_THREADS; i++) {
        total += counter->counts[i];  // Aggregate when needed
    }
    return total;
}
 
// The distributed version scales perfectly - each thread
// works on its own cache line with zero contention.

Additional Design Principles

•Thread Count ≈ CPU Count: For CPU-bound work, more threads than cores adds overhead without benefit. Use physical core count, not hyperthreads.
•Minimize Shared Mutable State: The less threads share, the less synchronization needed. Prefer message-passing or immutable shared data.
•Batch Work: Small tasks have high overhead-to-work ratio. Combine into larger chunks for better efficiency.
•Cache-Oblivious Algorithms: Design algorithms that work well regardless of cache size, maximizing data reuse.
•NUMA Awareness: On multi-socket systems, allocate memory local to the thread that will access it most.

The Embarrassingly Parallel Ideal

The best parallel workloads are "embarrassingly parallel"—tasks that require no communication or synchronization between threads. Examples: independent web requests, parallel file processing, map operations on independent data. Strive to structure your work to approach this ideal. Each ounce of communication you remove yields a pound of parallel speedup.

Kernel Support for Efficient Parallelism

Modern kernels include sophisticated features specifically designed to maximize parallel efficiency. Understanding these helps in writing applications that fully leverage hardware parallelism.

1. CPU Affinity

CPU affinity allows binding threads to specific CPUs, providing control over where threads execute:

cpu_affinity_example.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#define _GNU_SOURCE
#include <pthread.h>
#include <sched.h>
#include <stdio.h>
 
void *worker(void *arg) {
    int cpu = *(int *)arg;
    
    // Create CPU set with single CPU
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(cpu, &cpuset);
    
    // Bind this thread to the specified CPU
    pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
    
    printf("Thread bound to CPU %d, now running on CPU %d\n", 
           cpu, sched_getcpu());
    
    // Benefits:
    // 1. Cache locality - thread's data stays in CPU's cache
    // 2. Predictability - no migrations between CPUs
    // 3. NUMA optimization - can bind to CPU near thread's memory
    
    // Now do work that benefits from staying on one CPU...
    
    return NULL;
}
 
int main() {
    pthread_t threads[4];
    int cpus[4] = {0, 1, 2, 3};
    
    // Create threads, each bound to a specific CPU
    for (int i = 0; i < 4; i++) {
        pthread_create(&threads[i], NULL, worker, &cpus[i]);
    }
    
    for (int i = 0; i < 4; i++) {
        pthread_join(threads[i], NULL);
    }
    
    return 0;
}

2. NUMA Optimization

Non-Uniform Memory Architecture (NUMA) systems have memory local to each CPU socket. The kernel provides NUMA-aware allocation:

NUMA Optimization Strategies

•First-touch policy: Memory is allocated on the NUMA node where it's first accessed. Initialize data in the thread that will use it.
•Explicit binding: Use numa_alloc_onnode() to allocate memory on specific nodes.
•Memory interleaving: For shared data, interleave across nodes for balanced bandwidth.
•NUMA-aware threading: Pin threads to CPUs near their data. Local memory access is 2-3x faster than remote.

3. Scheduler Features for Parallelism

Linux Scheduler Features for Parallel Efficiency
Feature	Purpose	Benefit
Per-CPU run queues	Each CPU has its own queue	No global lock for scheduling
Work stealing	Idle CPUs take work from busy ones	Automatic load balancing
Cache-hot balancing	Prefer migrating threads without hot data	Preserves cache locality
Wake-to-idle	Wake threads preferring idle CPUs	Faster wakeup, better distribution
cgroup bandwidth control	Limit CPU time per group	Fair sharing across applications

scheduler_hints.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Using scheduler hints for better parallelism
 
#include <sys/syscall.h>
#include <linux/sched.h>
 
// Request wake-to-idle behavior for a thread
// The kernel will try to wake this thread on an idle CPU
// rather than the CPU that woke it (which might be busy)
 
struct sched_attr {
    uint32_t size;
    uint32_t sched_policy;
    uint64_t sched_flags;
    // ... other fields
};
 
// Real-time scheduling for latency-sensitive threads
void set_realtime_priority(pthread_t thread, int priority) {
    struct sched_param param;
    param.sched_priority = priority;  // 1-99, higher = more priority
    
    // SCHED_FIFO: Real-time, no time slicing
    // SCHED_RR: Real-time with round-robin among same priority
    pthread_setschedparam(thread, SCHED_FIFO, &param);
    
    // Warning: Must run as root; can starve other processes!
}
 
// Hint to scheduler about thread behavior
// Linux 5.3+ supports CLONE_INTO_CGROUP for better placement
// Linux 5.6+ supports SCHED_FLAG_UTIL_CLAMP for frequency hints

Let the Kernel Help

Modern kernel schedulers are highly optimized for parallel workloads. Features like automatic load balancing, cache-aware migration, and wake-to-idle work transparently without application involvement. For most applications, simply creating an appropriate number of long-lived threads and minimizing synchronization is sufficient—the kernel handles distribution efficiently.

Summary: True Parallelism

We've explored the defining capability of kernel-level threads: enabling true parallelism—genuine simultaneous execution on multiple CPUs. Let's consolidate the key insights:

Key Takeaways

•Parallelism differs from concurrency — Concurrency is about structure (interleaved tasks); parallelism is about execution (simultaneous computation). Kernel threads enable both.
•The kernel is the gatekeeper — Only kernel-visible threads can be scheduled onto multiple CPUs. User-level threads achieve concurrency but not parallelism without kernel thread backing.
•SMP scheduling enables parallelism — Per-CPU run queues, independent scheduling loops, and load balancing let N kernel threads run on N CPUs simultaneously.
•Parallelism can be measured and verified — Timing (2 CPU-bound tasks in T not 2T) and CPU monitoring (multiple cores at 100%) confirm true parallel execution.
•Amdahl's Law sets limits — The sequential fraction of a program bounds maximum speedup. 10% sequential means ≤10x speedup even with infinite CPUs.
•Practical challenges exist — False sharing, memory bandwidth limits, and synchronization overhead can negate parallelism benefits if not addressed.
•Design for parallelism — Partition data over operations, use lock-free structures, minimize sharing, and let the kernel scheduler do its job.

What's next:

True parallelism comes with a cost. Every kernel thread consumes kernel resources—memory for data structures, kernel stack space, CPU cycles for scheduling. The next page examines overhead considerations: what resources kernel threads consume, how overhead scales with thread count, and when this overhead becomes a limiting factor.

Page Complete

You now understand true parallelism—the fundamental capability that makes kernel-level threads essential for utilizing modern multi-core hardware. This knowledge is crucial for designing systems that effectively leverage parallel hardware and for understanding why certain architectural decisions (reduced synchronization, data partitioning, appropriate thread counts) have such profound performance impacts.

3 / 5

Loading learning content...

Operating SystemsThread Concepts

Kernel-Level Threads

LevelIntermediate

Duration90 mins

TopicThread Concepts

3 / 5

True Parallelism

Running at the Same Time—Really

This page explores exactly how kernel threads enable true parallelism, why this matters profoundly for performance, and how to harness this capability effectively.

What You Will Learn

Concurrency vs. Parallelism: The Crucial Distinction

Before diving into how kernel threads enable parallelism, we must clearly distinguish between two concepts that are often confused: concurrency and parallelism.

Parallelism: Multiple tasks execute at literally the same instant, on different processors or cores. Parallelism is about execution—achieving simultaneous computation.

Converting Mermaid diagram...

Concurrency vs. Parallelism Comparison
Aspect	Concurrency Only	True Parallelism
Execution model	Tasks interleaved on one CPU	Tasks execute simultaneously on multiple CPUs
Hardware requirement	Single core sufficient	Multiple cores required
Speedup potential (2 tasks)	0x (just interleaved)	Up to 2x (twice the compute)
User-level threads	✓ Can achieve	✗ Cannot achieve
Kernel-level threads	✓ Can achieve	✓ Can achieve
I/O-bound workloads	Significant benefit	Less additional benefit
CPU-bound workloads	No speed benefit	Near-linear speedup possible

Why this distinction matters:

This is why kernel-level threads—and their ability to enable true parallelism—are essential for leveraging modern multi-core hardware.

Rob Pike's Famous Quote

"Concurrency is about dealing with lots of things at once. Parallelism is about doing lots of things at once."

How Kernel Threads Enable Parallelism

The mechanism by which kernel threads achieve true parallelism involves the intricate coordination between the CPU scheduler, per-CPU data structures, and the hardware itself.

The fundamental mechanism:

Modern operating systems implement Symmetric Multiprocessing (SMP), where any thread can run on any CPU, and the kernel treats all CPUs equally. Here's how this enables parallelism:

Per-CPU Run Queues: The scheduler maintains a queue of ready-to-run threads for each CPU core. Each core draws from its own queue for maximum locality.
Independent Scheduling: Each CPU runs its own scheduling loop, independently deciding which thread to execute next. No global lock is needed for the common case.
Load Balancing: The kernel periodically rebalances threads across CPUs to prevent one core from being overloaded while others are idle.
Simultaneous Execution: With N CPUs, up to N kernel threads can be in the Running state at the exact same instant, each executing on a different core.

multiprocessor_scheduler_concept.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
// Conceptual SMP scheduler structure enabling true parallelism
 
// Each CPU has its own run queue - no global lock for local scheduling
typedef struct RunQueue {
    spinlock_t lock;              // Protects only this CPU's queue
    int nr_running;               // Number of runnable threads on this CPU
    struct rb_root tasks_timeline; // Red-black tree of tasks (CFS style)
    
    struct ThreadControlBlock *current; // Currently executing thread
    struct ThreadControlBlock *idle;    // Idle thread for this CPU
    
    // For load balancing
    unsigned long load;            // Weighted load on this CPU
} RunQueue __percpu;               // One per CPU
 
// Global array of per-CPU run queues
RunQueue runqueues[MAX_CPUS];
 
// === SCHEDULER ENTRY (runs on each CPU independently) ===
void schedule(void) {
    RunQueue *rq = this_cpu_runqueue();  // Get current CPU's runqueue
    
    spin_lock_irq(&rq->lock);  // Lock only local queue
    
    // Save context of current thread (if not exiting)
    if (rq->current->state != THREAD_TERMINATED) {
        save_context(rq->current);
    }
    
    // Pick next thread to run (from LOCAL queue only)
    ThreadControlBlock *next = pick_next_task(rq);
    
    if (next != rq->current) {
        // Switch context to new thread
        context_switch(rq->current, next);
        rq->current = next;
    }
    
    spin_unlock_irq(&rq->lock);
}
 
// === PARALLEL EXECUTION VISUALIZED ===
/*
 * Time T=0:
 *   CPU 0: Running Thread A from Process 1
 *   CPU 1: Running Thread B from Process 1
 *   CPU 2: Running Thread X from Process 2
 *   CPU 3: Running Thread Y from Process 2
 *
 * All four threads are executing *simultaneously*.
 * Threads A and B share an address space (same process).
 * Threads X and Y share a different address space.
 *
 * This is TRUE PARALLELISM - four instruction streams
 * making progress at the exact same instant.
 */
 
// === HOW THREADS GET DISTRIBUTED ===
 
// When a new thread is created:
pid_t do_clone(clone_args *args) {
    ThreadControlBlock *new_thread = allocate_tcb();
    initialize_thread(new_thread, args);
    
    // Critical: Choose which CPU's runqueue to add to
    int target_cpu = select_task_rq(new_thread);
    
    // Add to that CPU's runqueue (may be different from current CPU)
    RunQueue *rq = &runqueues[target_cpu];
    spin_lock(&rq->lock);
    enqueue_task(rq, new_thread);
    spin_unlock(&rq->lock);
    
    // If target CPU is idle, send Inter-Processor Interrupt
    // to wake it up and reschedule immediately
    if (rq->current == rq->idle) {
        smp_send_reschedule(target_cpu);
    }
    
    return new_thread->tid;
}
 
// CPU selection considers:
// - Load balancing (put on underloaded CPU)
// - Cache locality (prefer CPU with related data)
// - NUMA topology (prefer local memory node)
// - Affinity mask (respect process preferences)

Why user-level threads cannot achieve this:

Internally, the process's user-space runtime multiplexes its 100 user threads onto this single execution context. While clever, this means:

Only one user-level thread can actually execute at any instant
When the process is scheduled out, ALL user-level threads stop
The runtime cannot access additional CPUs without kernel cooperation

Kernel threads solve this by making each thread a first-class kernel entity, eligible for independent scheduling on any CPU.

The Hybrid Approach

Measuring and Observing Parallelism

How do we know parallelism is actually happening? Let's examine concrete ways to verify that multiple threads are executing simultaneously:

1. Timing-Based Verification

The most straightforward test: if two CPU-bound tasks, each taking time T, complete in total time close to T (not 2T), they must have run in parallel.

verify_parallelism.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
// Demonstrate true parallelism through timing measurement
#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <time.h>
#include <sched.h>
 
#define ITERATIONS 500000000  // Heavyweight computation
 
// CPU-intensive work that takes ~1 second
void *cpu_intensive_work(void *arg) {
    int thread_id = *(int *)arg;
    volatile double result = 0;  // Prevent optimization
    
    // Get which CPU we're running on
    int cpu = sched_getcpu();
    printf("Thread %d starting on CPU %d\n", thread_id, cpu);
    
    for (long i = 0; i < ITERATIONS; i++) {
        result += i * 0.000001;
    }
    
    cpu = sched_getcpu();  // May have migrated
    printf("Thread %d finished on CPU %d, result=%f\n", thread_id, cpu, result);
    
    return NULL;
}
 
int main() {
    struct timespec start, end;
    
    // === Test 1: Sequential execution ===
    printf("\n=== Sequential Execution ===\n");
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    int id1 = 1, id2 = 2;
    cpu_intensive_work(&id1);  // First task
    cpu_intensive_work(&id2);  // Second task, after first completes
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    double sequential_time = (end.tv_sec - start.tv_sec) + 
                              (end.tv_nsec - start.tv_nsec) / 1e9;
    printf("Sequential time: %.2f seconds\n", sequential_time);
    
    // === Test 2: Parallel execution with kernel threads ===
    printf("\n=== Parallel Execution ===\n");
    pthread_t threads[2];
    int ids[2] = {1, 2};
    
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    // Create two kernel threads
    pthread_create(&threads[0], NULL, cpu_intensive_work, &ids[0]);
    pthread_create(&threads[1], NULL, cpu_intensive_work, &ids[1]);
    
    // Wait for both to complete
    pthread_join(threads[0], NULL);
    pthread_join(threads[1], NULL);
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    double parallel_time = (end.tv_sec - start.tv_sec) + 
                           (end.tv_nsec - start.tv_nsec) / 1e9;
    printf("Parallel time: %.2f seconds\n", parallel_time);
    
    // Report speedup
    printf("\n=== Results ===\n");
    printf("Speedup: %.2fx\n", sequential_time / parallel_time);
    printf("Efficiency: %.1f%%\n", (sequential_time / parallel_time / 2) * 100);
    
    // Expected output on a multi-core system:
    // Sequential time: ~2.0 seconds (task A + task B)
    // Parallel time:   ~1.0 seconds (both tasks simultaneously)
    // Speedup:         ~2.0x       (true parallel execution)
    
    // If you run this on a single-core system or with pinned affinity:
    // Parallel time:   ~2.0 seconds (time-sliced, no true parallelism)
    // Speedup:         ~1.0x       (just overhead)
    
    return 0;
}

2. CPU Utilization Monitoring

Watching system CPU usage reveals parallel execution:

observe_parallelism.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#!/bin/bash
# Tools to observe parallel thread execution
 
# 1. Watch per-CPU usage with mpstat
# Run in one terminal while executing a multithreaded program
mpstat -P ALL 1
 
# Output shows individual CPU utilization:
# 14:30:01  CPU  %usr  %sys  %iowait  %idle
# 14:30:02  all  50.0  0.5   0.0      49.5   # Two threads on 4-core system
# 14:30:02    0  99.0  1.0   0.0       0.0   # Core 0: Thread A running
# 14:30:02    1  99.0  1.0   0.0       0.0   # Core 1: Thread B running
# 14:30:02    2   1.0  0.0   0.0      99.0   # Core 2: Idle
# 14:30:02    3   1.0  0.0   0.0      99.0   # Core 3: Idle
 
# 2. Use htop for visual CPU core display
htop
# Each CPU bar shows separate activity
# You can see threads executing on different cores
 
# 3. Trace thread scheduling with perf
perf sched record ./my_parallel_program
perf sched map
 
# Shows which threads ran on which CPUs over time:
#  TIME     CPU 0   CPU 1   CPU 2   CPU 3
#  0.000    Thread1 Thread2 idle    idle
#  0.001    Thread1 Thread2 idle    idle
#  0.002    Thread1 Thread2 idle    idle
# ...
 
# 4. Watch thread-to-CPU mapping live
watch -n 0.5 'ps -eLo pid,tid,psr,comm | grep my_program'
 
# Output shows which processor (PSR) each thread runs on:
# PID    TID    PSR  COMMAND
# 12345  12345  0    my_program    # Main thread on CPU 0
# 12345  12346  1    my_program    # Worker thread on CPU 1
# 12345  12347  2    my_program    # Worker thread on CPU 2

The Smoking Gun

Limits of Parallelism: Amdahl's Law

The formula:

Speedup(N) = 1 / (S + (1-S)/N)

Where:
  N = number of processors
  S = fraction of program that is inherently sequential
  (1-S) = fraction that can be parallelized

The implications are stark:

Maximum Speedup by Sequential Fraction (Amdahl's Law)
Sequential %	Parallel %	Max Speedup (∞ CPUs)	Speedup with 8 CPUs
0%	100%	∞ (perfect)	8.0x
1%	99%	100x	7.5x
5%	95%	20x	5.9x
10%	90%	10x	4.7x
25%	75%	4x	2.9x
50%	50%	2x	1.8x
75%	25%	1.33x	1.2x

The profound insight:

What creates sequential portions?

Sources of Sequential Bottlenecks

•Initialization and cleanup: Program startup, data loading, result aggregation
•Synchronization: Lock acquisition, barrier waits, atomic operations
•Dependencies: When task B needs the result of task A
•Shared state updates: Single shared counter, global state modifications
•I/O operations: Many file/network operations are serialized
•Algorithmic structure: Some algorithms have inherent sequential phases

amdahls_law_demo.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
// Demonstrating Amdahl's Law in practice
 
#include <pthread.h>
#include <stdio.h>
#include <time.h>
 
#define ARRAY_SIZE 100000000
#define NUM_THREADS 8
 
double data[ARRAY_SIZE];
double partial_sums[NUM_THREADS];
 
// Work function for parallel portion
void *parallel_work(void *arg) {
    int thread_id = *(int *)arg;
    int chunk_size = ARRAY_SIZE / NUM_THREADS;
    int start = thread_id * chunk_size;
    int end = start + chunk_size;
    
    double sum = 0;
    for (int i = start; i < end; i++) {
        sum += data[i] * data[i];  // Parallelizable work
    }
    partial_sums[thread_id] = sum;
    return NULL;
}
 
int main() {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    // ============================================
    // SEQUENTIAL PORTION: Initialization (S%)
    // ============================================
    for (int i = 0; i < ARRAY_SIZE; i++) {
        data[i] = i * 0.0001;  // Must be sequential (or is it?)
    }
    
    struct timespec after_init;
    clock_gettime(CLOCK_MONOTONIC, &after_init);
    
    // ============================================
    // PARALLEL PORTION: Computation ((1-S)%)
    // ============================================
    pthread_t threads[NUM_THREADS];
    int thread_ids[NUM_THREADS];
    
    for (int i = 0; i < NUM_THREADS; i++) {
        thread_ids[i] = i;
        pthread_create(&threads[i], NULL, parallel_work, &thread_ids[i]);
    }
    
    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);
    }
    
    struct timespec after_parallel;
    clock_gettime(CLOCK_MONOTONIC, &after_parallel);
    
    // ============================================
    // SEQUENTIAL PORTION: Aggregation (S%)
    // ============================================
    double total = 0;
    for (int i = 0; i < NUM_THREADS; i++) {
        total += partial_sums[i];
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    // Calculate timings
    double init_time = (after_init.tv_sec - start.tv_sec) + 
                       (after_init.tv_nsec - start.tv_nsec) / 1e9;
    double parallel_time = (after_parallel.tv_sec - after_init.tv_sec) + 
                           (after_parallel.tv_nsec - after_init.tv_nsec) / 1e9;
    double total_time = (end.tv_sec - start.tv_sec) + 
                        (end.tv_nsec - start.tv_nsec) / 1e9;
    
    printf("Initialization (sequential): %.3f seconds (%.1f%%)\n", 
           init_time, (init_time / total_time) * 100);
    printf("Computation (parallel):      %.3f seconds (%.1f%%)\n", 
           parallel_time, (parallel_time / total_time) * 100);
    printf("Total time:                  %.3f seconds\n", total_time);
    printf("Result: %f\n", total);
    
    // Amdahl's Law prediction:
    // If init is 30% of sequential runtime,
    // max speedup with 8 threads = 1 / (0.3 + 0.7/8) = 2.58x
    // Not 8x, even with perfect parallel efficiency!
    
    return 0;
}
 
/*
 * Key insight: To maximize parallel speedup:
 * 1. Minimize sequential fractions
 * 2. Parallelize initialization if possible
 * 3. Use concurrent data structures to reduce aggregation overhead
 * 4. Design algorithms with minimal synchronization
 */

Gustafson's Law: The Optimistic View

Practical Parallelism Challenges

Beyond Amdahl's Law, practical parallel programming faces additional challenges that can limit the effectiveness of kernel threads:

1. Cache Coherency Overhead

When multiple CPUs access shared data, the hardware maintains coherency using protocols like MESI. This creates invisible overhead:

false_sharing_demo.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Demonstrating false sharing - a hidden parallelism killer
 
#include <pthread.h>
#include <stdio.h>
#include <time.h>
 
#define ITERATIONS 100000000
#define NUM_THREADS 4
 
// BAD: Counters are adjacent in memory (same cache line)
struct BadCounters {
    long count[NUM_THREADS];  // All in same or adjacent cache lines
};
 
// GOOD: Counters are padded to separate cache lines
struct GoodCounters {
    struct {
        long count;
        char padding[56];  // Pad to 64 bytes (typical cache line)
    } slots[NUM_THREADS];
};
 
BadCounters bad_counters = {0};
GoodCounters good_counters = {0};
 
void *increment_bad(void *arg) {
    int id = *(int *)arg;
    for (long i = 0; i < ITERATIONS; i++) {
        bad_counters.count[id]++;  // False sharing!
    }
    return NULL;
}
 
void *increment_good(void *arg) {
    int id = *(int *)arg;
    for (long i = 0; i < ITERATIONS; i++) {
        good_counters.slots[id].count++;  // No false sharing
    }
    return NULL;
}
 
int main() {
    pthread_t threads[NUM_THREADS];
    int ids[NUM_THREADS];
    struct timespec start, end;
    
    // Test with false sharing
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < NUM_THREADS; i++) {
        ids[i] = i;
        pthread_create(&threads[i], NULL, increment_bad, &ids[i]);
    }
    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);
    }
    clock_gettime(CLOCK_MONOTONIC, &end);
    double bad_time = (end.tv_sec - start.tv_sec) + 
                      (end.tv_nsec - start.tv_nsec) / 1e9;
    
    // Test without false sharing
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_create(&threads[i], NULL, increment_good, &ids[i]);
    }
    for (int i = 0; i < NUM_THREADS; i++) {
        pthread_join(threads[i], NULL);
    }
    clock_gettime(CLOCK_MONOTONIC, &end);
    double good_time = (end.tv_sec - start.tv_sec) + 
                       (end.tv_nsec - start.tv_nsec) / 1e9;
    
    printf("With false sharing:    %.2f seconds\n", bad_time);
    printf("Without false sharing: %.2f seconds\n", good_time);
    printf("Speedup from fix:      %.2fx\n", bad_time / good_time);
    
    // Typical results: false sharing version is 3-10x SLOWER!
    // Each write invalidates the cache line on all other CPUs,
    // causing expensive cache coherency traffic.
    
    return 0;
}

2. Memory Bandwidth Limitations

All CPUs share the same memory subsystem. When parallel threads all access memory intensively, they compete for bandwidth:

Memory Bandwidth Considerations

•Modern CPUs: 8-16 cores, each capable of issuing memory requests
•Memory bandwidth: Typically 50-100 GB/s for main memory
•Per-core bandwidth: Divided among cores, can become bottleneck
•NUMA effects: On multi-socket systems, remote memory access is 2-3x slower
•Mitigation: Maximize cache hits, minimize memory traffic, use NUMA-aware allocation

3. Synchronization Overhead

Every synchronization point—lock, barrier, atomic operation—creates a sequential bottleneck and adds overhead:

Synchronization Overhead (Approximate)
Operation	Uncontended Cost	Contended Cost
Atomic increment	5-20 cycles	100-500 cycles (cache coherency)
Mutex lock	20-50 cycles	Microseconds (context switch)
Spinlock	10-30 cycles	Busy-wait until release
Barrier sync	100+ cycles	All threads must wait for slowest
Memory fence	10-50 cycles	Forces memory ordering

The Synchronization Trap

Designing for True Parallelism

Achieving effective parallelism requires deliberate design decisions. Here are key principles for designing systems that effectively utilize true parallelism via kernel threads:

Principle 1: Partition Data, Not Operations

Divide data among threads, letting each thread perform all operations on its portion. This minimizes sharing and synchronization.

Operation-Parallel (Poor)

•Thread 1: reads all data
•Barrier synchronization
•Thread 2: transforms all data
•Barrier synchronization
•Thread 3: writes all data
•Lots of synchronization, poor locality

Data-Parallel (Good)

•Thread 1: reads+transforms+writes chunk A
•Thread 2: reads+transforms+writes chunk B
•Thread 3: reads+transforms+writes chunk C
•No synchronization during work
•Excellent cache locality per thread

Principle 2: Use Lock-Free Structures for Hot Paths

For data structures accessed by all threads, lock-free algorithms avoid serialization:

lockfree_counter.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// Lock-free counter using atomic operations
#include <stdatomic.h>
 
// Lock-free: concurrent increments don't block
typedef struct {
    _Atomic long value;
} AtomicCounter;
 
void atomic_increment(AtomicCounter *counter) {
    atomic_fetch_add(&counter->value, 1);  // Hardware atomic
    // No lock, no waiting, true parallel increments
    // (though still cache coherency traffic)
}
 
// Even better: Per-thread counters with lazy aggregation
typedef struct {
    long counts[MAX_THREADS];  // Separate cache lines (with padding)
} DistributedCounter;
 
void distributed_increment(DistributedCounter *counter, int thread_id) {
    counter->counts[thread_id]++;  // No atomics, no sharing!
}
 
long distributed_read(DistributedCounter *counter) {
    long total = 0;
    for (int i = 0; i < MAX_THREADS; i++) {
        total += counter->counts[i];  // Aggregate when needed
    }
    return total;
}
 
// The distributed version scales perfectly - each thread
// works on its own cache line with zero contention.

Additional Design Principles

•Thread Count ≈ CPU Count: For CPU-bound work, more threads than cores adds overhead without benefit. Use physical core count, not hyperthreads.
•Minimize Shared Mutable State: The less threads share, the less synchronization needed. Prefer message-passing or immutable shared data.
•Batch Work: Small tasks have high overhead-to-work ratio. Combine into larger chunks for better efficiency.
•Cache-Oblivious Algorithms: Design algorithms that work well regardless of cache size, maximizing data reuse.
•NUMA Awareness: On multi-socket systems, allocate memory local to the thread that will access it most.

The Embarrassingly Parallel Ideal

Kernel Support for Efficient Parallelism

Modern kernels include sophisticated features specifically designed to maximize parallel efficiency. Understanding these helps in writing applications that fully leverage hardware parallelism.

1. CPU Affinity

CPU affinity allows binding threads to specific CPUs, providing control over where threads execute:

cpu_affinity_example.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#define _GNU_SOURCE
#include <pthread.h>
#include <sched.h>
#include <stdio.h>
 
void *worker(void *arg) {
    int cpu = *(int *)arg;
    
    // Create CPU set with single CPU
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(cpu, &cpuset);
    
    // Bind this thread to the specified CPU
    pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
    
    printf("Thread bound to CPU %d, now running on CPU %d\n", 
           cpu, sched_getcpu());
    
    // Benefits:
    // 1. Cache locality - thread's data stays in CPU's cache
    // 2. Predictability - no migrations between CPUs
    // 3. NUMA optimization - can bind to CPU near thread's memory
    
    // Now do work that benefits from staying on one CPU...
    
    return NULL;
}
 
int main() {
    pthread_t threads[4];
    int cpus[4] = {0, 1, 2, 3};
    
    // Create threads, each bound to a specific CPU
    for (int i = 0; i < 4; i++) {
        pthread_create(&threads[i], NULL, worker, &cpus[i]);
    }
    
    for (int i = 0; i < 4; i++) {
        pthread_join(threads[i], NULL);
    }
    
    return 0;
}

2. NUMA Optimization

Non-Uniform Memory Architecture (NUMA) systems have memory local to each CPU socket. The kernel provides NUMA-aware allocation:

NUMA Optimization Strategies

•First-touch policy: Memory is allocated on the NUMA node where it's first accessed. Initialize data in the thread that will use it.
•Explicit binding: Use numa_alloc_onnode() to allocate memory on specific nodes.
•Memory interleaving: For shared data, interleave across nodes for balanced bandwidth.
•NUMA-aware threading: Pin threads to CPUs near their data. Local memory access is 2-3x faster than remote.

3. Scheduler Features for Parallelism

Linux Scheduler Features for Parallel Efficiency
Feature	Purpose	Benefit
Per-CPU run queues	Each CPU has its own queue	No global lock for scheduling
Work stealing	Idle CPUs take work from busy ones	Automatic load balancing
Cache-hot balancing	Prefer migrating threads without hot data	Preserves cache locality
Wake-to-idle	Wake threads preferring idle CPUs	Faster wakeup, better distribution
cgroup bandwidth control	Limit CPU time per group	Fair sharing across applications

scheduler_hints.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Using scheduler hints for better parallelism
 
#include <sys/syscall.h>
#include <linux/sched.h>
 
// Request wake-to-idle behavior for a thread
// The kernel will try to wake this thread on an idle CPU
// rather than the CPU that woke it (which might be busy)
 
struct sched_attr {
    uint32_t size;
    uint32_t sched_policy;
    uint64_t sched_flags;
    // ... other fields
};
 
// Real-time scheduling for latency-sensitive threads
void set_realtime_priority(pthread_t thread, int priority) {
    struct sched_param param;
    param.sched_priority = priority;  // 1-99, higher = more priority
    
    // SCHED_FIFO: Real-time, no time slicing
    // SCHED_RR: Real-time with round-robin among same priority
    pthread_setschedparam(thread, SCHED_FIFO, &param);
    
    // Warning: Must run as root; can starve other processes!
}
 
// Hint to scheduler about thread behavior
// Linux 5.3+ supports CLONE_INTO_CGROUP for better placement
// Linux 5.6+ supports SCHED_FLAG_UTIL_CLAMP for frequency hints

Let the Kernel Help

Summary: True Parallelism

We've explored the defining capability of kernel-level threads: enabling true parallelism—genuine simultaneous execution on multiple CPUs. Let's consolidate the key insights:

Key Takeaways

•Parallelism differs from concurrency — Concurrency is about structure (interleaved tasks); parallelism is about execution (simultaneous computation). Kernel threads enable both.
•The kernel is the gatekeeper — Only kernel-visible threads can be scheduled onto multiple CPUs. User-level threads achieve concurrency but not parallelism without kernel thread backing.
•SMP scheduling enables parallelism — Per-CPU run queues, independent scheduling loops, and load balancing let N kernel threads run on N CPUs simultaneously.
•Parallelism can be measured and verified — Timing (2 CPU-bound tasks in T not 2T) and CPU monitoring (multiple cores at 100%) confirm true parallel execution.
•Amdahl's Law sets limits — The sequential fraction of a program bounds maximum speedup. 10% sequential means ≤10x speedup even with infinite CPUs.
•Practical challenges exist — False sharing, memory bandwidth limits, and synchronization overhead can negate parallelism benefits if not addressed.
•Design for parallelism — Partition data over operations, use lock-free structures, minimize sharing, and let the kernel scheduler do its job.

What's next:

Page Complete

3 / 5