Loading learning content...
Consider a modern laptop with an 8-core CPU. When you run a well-written multithreaded program on this machine, something remarkable happens: multiple threads execute their instructions simultaneously—not interleaved, not time-sliced, but genuinely at the same instant. One thread computes on Core 0 while another computes on Core 1, and six more threads can be actively computing on the remaining cores, all in parallel.
This is true parallelism—the ability of a program to perform multiple computations literally at the same time. It's the holy grail of concurrent programming, and it's made possible by one critical factor: kernel-level threads.
The kernel is the gatekeeper of CPU resources. It decides which code runs on which processor core. User-level threads, invisible to the kernel, cannot achieve true parallelism because the kernel doesn't know they exist—it sees only the process, which it schedules onto a single core at a time. Kernel-level threads, by contrast, are first-class citizens in the kernel's scheduling universe, each eligible to be scheduled onto any available CPU core.
This page explores exactly how kernel threads enable true parallelism, why this matters profoundly for performance, and how to harness this capability effectively.
By the end of this page, you will understand: (1) The precise distinction between concurrency and parallelism, (2) How kernel-level threads achieve true simultaneous execution, (3) The role of the multiprocessor scheduler in enabling parallelism, (4) Theoretical and practical limits of parallel speedup, (5) How to verify and observe parallel execution, and (6) Design principles for applications that leverage true parallelism.
Before diving into how kernel threads enable parallelism, we must clearly distinguish between two concepts that are often confused: concurrency and parallelism.
Concurrency: Multiple tasks make progress during overlapping time periods. They may or may not execute at exactly the same instant. Concurrency is about structure—designing programs where independent tasks can be interleaved.
Parallelism: Multiple tasks execute at literally the same instant, on different processors or cores. Parallelism is about execution—achieving simultaneous computation.
The key insight: You can have concurrency without parallelism (single core, time-sliced), parallelism without concurrency (SIMD, vector operations), or both together. Kernel-level threads enable both.
| Aspect | Concurrency Only | True Parallelism |
|---|---|---|
| Execution model | Tasks interleaved on one CPU | Tasks execute simultaneously on multiple CPUs |
| Hardware requirement | Single core sufficient | Multiple cores required |
| Speedup potential (2 tasks) | 0x (just interleaved) | Up to 2x (twice the compute) |
| User-level threads | ✓ Can achieve | ✗ Cannot achieve |
| Kernel-level threads | ✓ Can achieve | ✓ Can achieve |
| I/O-bound workloads | Significant benefit | Less additional benefit |
| CPU-bound workloads | No speed benefit | Near-linear speedup possible |
Why this distinction matters:
For I/O-bound workloads (waiting for network, disk, user input), concurrency alone provides significant benefits. While one task waits for I/O, another can use the CPU. A single core can keep busy by switching between waiting tasks.
For CPU-bound workloads (mathematical computation, data processing, rendering), concurrency alone provides no speedup on a single core. The tasks must wait for each other—the total computation time equals the sum of all task times. Only true parallelism reduces total execution time for CPU-bound work.
This is why kernel-level threads—and their ability to enable true parallelism—are essential for leveraging modern multi-core hardware.
"Concurrency is about dealing with lots of things at once. Parallelism is about doing lots of things at once."
Concurrency is a programming model—how you structure code to handle multiple tasks. Parallelism is an execution property—whether multiple computations happen simultaneously. Kernel threads let you achieve both.
The mechanism by which kernel threads achieve true parallelism involves the intricate coordination between the CPU scheduler, per-CPU data structures, and the hardware itself.
The fundamental mechanism:
Modern operating systems implement Symmetric Multiprocessing (SMP), where any thread can run on any CPU, and the kernel treats all CPUs equally. Here's how this enables parallelism:
Per-CPU Run Queues: The scheduler maintains a queue of ready-to-run threads for each CPU core. Each core draws from its own queue for maximum locality.
Independent Scheduling: Each CPU runs its own scheduling loop, independently deciding which thread to execute next. No global lock is needed for the common case.
Load Balancing: The kernel periodically rebalances threads across CPUs to prevent one core from being overloaded while others are idle.
Simultaneous Execution: With N CPUs, up to N kernel threads can be in the Running state at the exact same instant, each executing on a different core.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
// Conceptual SMP scheduler structure enabling true parallelism // Each CPU has its own run queue - no global lock for local schedulingtypedef struct RunQueue { spinlock_t lock; // Protects only this CPU's queue int nr_running; // Number of runnable threads on this CPU struct rb_root tasks_timeline; // Red-black tree of tasks (CFS style) struct ThreadControlBlock *current; // Currently executing thread struct ThreadControlBlock *idle; // Idle thread for this CPU // For load balancing unsigned long load; // Weighted load on this CPU} RunQueue __percpu; // One per CPU // Global array of per-CPU run queuesRunQueue runqueues[MAX_CPUS]; // === SCHEDULER ENTRY (runs on each CPU independently) ===void schedule(void) { RunQueue *rq = this_cpu_runqueue(); // Get current CPU's runqueue spin_lock_irq(&rq->lock); // Lock only local queue // Save context of current thread (if not exiting) if (rq->current->state != THREAD_TERMINATED) { save_context(rq->current); } // Pick next thread to run (from LOCAL queue only) ThreadControlBlock *next = pick_next_task(rq); if (next != rq->current) { // Switch context to new thread context_switch(rq->current, next); rq->current = next; } spin_unlock_irq(&rq->lock);} // === PARALLEL EXECUTION VISUALIZED ===/* * Time T=0: * CPU 0: Running Thread A from Process 1 * CPU 1: Running Thread B from Process 1 * CPU 2: Running Thread X from Process 2 * CPU 3: Running Thread Y from Process 2 * * All four threads are executing *simultaneously*. * Threads A and B share an address space (same process). * Threads X and Y share a different address space. * * This is TRUE PARALLELISM - four instruction streams * making progress at the exact same instant. */ // === HOW THREADS GET DISTRIBUTED === // When a new thread is created:pid_t do_clone(clone_args *args) { ThreadControlBlock *new_thread = allocate_tcb(); initialize_thread(new_thread, args); // Critical: Choose which CPU's runqueue to add to int target_cpu = select_task_rq(new_thread); // Add to that CPU's runqueue (may be different from current CPU) RunQueue *rq = &runqueues[target_cpu]; spin_lock(&rq->lock); enqueue_task(rq, new_thread); spin_unlock(&rq->lock); // If target CPU is idle, send Inter-Processor Interrupt // to wake it up and reschedule immediately if (rq->current == rq->idle) { smp_send_reschedule(target_cpu); } return new_thread->tid;} // CPU selection considers:// - Load balancing (put on underloaded CPU)// - Cache locality (prefer CPU with related data)// - NUMA topology (prefer local memory node)// - Affinity mask (respect process preferences)Why user-level threads cannot achieve this:
User-level threads are invisible to the kernel. From the kernel's perspective, a process with 100 user-level threads looks identical to a process with 1 thread—it's just one schedulable entity. The kernel schedules this single entity onto one CPU at a time.
Internally, the process's user-space runtime multiplexes its 100 user threads onto this single execution context. While clever, this means:
Kernel threads solve this by making each thread a first-class kernel entity, eligible for independent scheduling on any CPU.
Modern high-level concurrency runtimes (Go, Tokio, work-stealing pools) use a hybrid approach: they create N kernel threads (usually = CPU count) and multiplex many lightweight tasks onto these threads. This gives the efficiency of user-level scheduling while still achieving true parallelism through the underlying kernel threads. The best of both worlds.
How do we know parallelism is actually happening? Let's examine concrete ways to verify that multiple threads are executing simultaneously:
1. Timing-Based Verification
The most straightforward test: if two CPU-bound tasks, each taking time T, complete in total time close to T (not 2T), they must have run in parallel.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
// Demonstrate true parallelism through timing measurement#define _GNU_SOURCE#include <pthread.h>#include <stdio.h>#include <time.h>#include <sched.h> #define ITERATIONS 500000000 // Heavyweight computation // CPU-intensive work that takes ~1 secondvoid *cpu_intensive_work(void *arg) { int thread_id = *(int *)arg; volatile double result = 0; // Prevent optimization // Get which CPU we're running on int cpu = sched_getcpu(); printf("Thread %d starting on CPU %d\n", thread_id, cpu); for (long i = 0; i < ITERATIONS; i++) { result += i * 0.000001; } cpu = sched_getcpu(); // May have migrated printf("Thread %d finished on CPU %d, result=%f\n", thread_id, cpu, result); return NULL;} int main() { struct timespec start, end; // === Test 1: Sequential execution === printf("\n=== Sequential Execution ===\n"); clock_gettime(CLOCK_MONOTONIC, &start); int id1 = 1, id2 = 2; cpu_intensive_work(&id1); // First task cpu_intensive_work(&id2); // Second task, after first completes clock_gettime(CLOCK_MONOTONIC, &end); double sequential_time = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9; printf("Sequential time: %.2f seconds\n", sequential_time); // === Test 2: Parallel execution with kernel threads === printf("\n=== Parallel Execution ===\n"); pthread_t threads[2]; int ids[2] = {1, 2}; clock_gettime(CLOCK_MONOTONIC, &start); // Create two kernel threads pthread_create(&threads[0], NULL, cpu_intensive_work, &ids[0]); pthread_create(&threads[1], NULL, cpu_intensive_work, &ids[1]); // Wait for both to complete pthread_join(threads[0], NULL); pthread_join(threads[1], NULL); clock_gettime(CLOCK_MONOTONIC, &end); double parallel_time = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9; printf("Parallel time: %.2f seconds\n", parallel_time); // Report speedup printf("\n=== Results ===\n"); printf("Speedup: %.2fx\n", sequential_time / parallel_time); printf("Efficiency: %.1f%%\n", (sequential_time / parallel_time / 2) * 100); // Expected output on a multi-core system: // Sequential time: ~2.0 seconds (task A + task B) // Parallel time: ~1.0 seconds (both tasks simultaneously) // Speedup: ~2.0x (true parallel execution) // If you run this on a single-core system or with pinned affinity: // Parallel time: ~2.0 seconds (time-sliced, no true parallelism) // Speedup: ~1.0x (just overhead) return 0;}2. CPU Utilization Monitoring
Watching system CPU usage reveals parallel execution:
123456789101112131415161718192021222324252627282930313233343536373839
#!/bin/bash# Tools to observe parallel thread execution # 1. Watch per-CPU usage with mpstat# Run in one terminal while executing a multithreaded programmpstat -P ALL 1 # Output shows individual CPU utilization:# 14:30:01 CPU %usr %sys %iowait %idle# 14:30:02 all 50.0 0.5 0.0 49.5 # Two threads on 4-core system# 14:30:02 0 99.0 1.0 0.0 0.0 # Core 0: Thread A running# 14:30:02 1 99.0 1.0 0.0 0.0 # Core 1: Thread B running# 14:30:02 2 1.0 0.0 0.0 99.0 # Core 2: Idle# 14:30:02 3 1.0 0.0 0.0 99.0 # Core 3: Idle # 2. Use htop for visual CPU core displayhtop# Each CPU bar shows separate activity# You can see threads executing on different cores # 3. Trace thread scheduling with perfperf sched record ./my_parallel_programperf sched map # Shows which threads ran on which CPUs over time:# TIME CPU 0 CPU 1 CPU 2 CPU 3# 0.000 Thread1 Thread2 idle idle# 0.001 Thread1 Thread2 idle idle# 0.002 Thread1 Thread2 idle idle# ... # 4. Watch thread-to-CPU mapping livewatch -n 0.5 'ps -eLo pid,tid,psr,comm | grep my_program' # Output shows which processor (PSR) each thread runs on:# PID TID PSR COMMAND# 12345 12345 0 my_program # Main thread on CPU 0# 12345 12346 1 my_program # Worker thread on CPU 1# 12345 12347 2 my_program # Worker thread on CPU 2The definitive proof of true parallelism: if a CPU-bound 2-thread program achieves >1.5x speedup over single-threaded execution, parallelism must be occurring. Time-sliced concurrency on one core cannot exceed 1.0x speedup for CPU-bound work (and typically achieves <1.0x due to context switch overhead).
True parallelism is powerful, but it has fundamental limits. Amdahl's Law describes the maximum theoretical speedup achievable by parallelizing a program, based on the fraction that must remain sequential.
The formula:
Speedup(N) = 1 / (S + (1-S)/N)
Where:
N = number of processors
S = fraction of program that is inherently sequential
(1-S) = fraction that can be parallelized
The implications are stark:
| Sequential % | Parallel % | Max Speedup (∞ CPUs) | Speedup with 8 CPUs |
|---|---|---|---|
| 0% | 100% | ∞ (perfect) | 8.0x |
| 1% | 99% | 100x | 7.5x |
| 5% | 95% | 20x | 5.9x |
| 10% | 90% | 10x | 4.7x |
| 25% | 75% | 4x | 2.9x |
| 50% | 50% | 2x | 1.8x |
| 75% | 25% | 1.33x | 1.2x |
The profound insight:
Even with unlimited processors, if 10% of your program must run sequentially, you can never exceed 10x speedup. That sequential 10% becomes the bottleneck that dominates total runtime as the parallel portion speeds up.
What creates sequential portions?
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
// Demonstrating Amdahl's Law in practice #include <pthread.h>#include <stdio.h>#include <time.h> #define ARRAY_SIZE 100000000#define NUM_THREADS 8 double data[ARRAY_SIZE];double partial_sums[NUM_THREADS]; // Work function for parallel portionvoid *parallel_work(void *arg) { int thread_id = *(int *)arg; int chunk_size = ARRAY_SIZE / NUM_THREADS; int start = thread_id * chunk_size; int end = start + chunk_size; double sum = 0; for (int i = start; i < end; i++) { sum += data[i] * data[i]; // Parallelizable work } partial_sums[thread_id] = sum; return NULL;} int main() { struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); // ============================================ // SEQUENTIAL PORTION: Initialization (S%) // ============================================ for (int i = 0; i < ARRAY_SIZE; i++) { data[i] = i * 0.0001; // Must be sequential (or is it?) } struct timespec after_init; clock_gettime(CLOCK_MONOTONIC, &after_init); // ============================================ // PARALLEL PORTION: Computation ((1-S)%) // ============================================ pthread_t threads[NUM_THREADS]; int thread_ids[NUM_THREADS]; for (int i = 0; i < NUM_THREADS; i++) { thread_ids[i] = i; pthread_create(&threads[i], NULL, parallel_work, &thread_ids[i]); } for (int i = 0; i < NUM_THREADS; i++) { pthread_join(threads[i], NULL); } struct timespec after_parallel; clock_gettime(CLOCK_MONOTONIC, &after_parallel); // ============================================ // SEQUENTIAL PORTION: Aggregation (S%) // ============================================ double total = 0; for (int i = 0; i < NUM_THREADS; i++) { total += partial_sums[i]; } clock_gettime(CLOCK_MONOTONIC, &end); // Calculate timings double init_time = (after_init.tv_sec - start.tv_sec) + (after_init.tv_nsec - start.tv_nsec) / 1e9; double parallel_time = (after_parallel.tv_sec - after_init.tv_sec) + (after_parallel.tv_nsec - after_init.tv_nsec) / 1e9; double total_time = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9; printf("Initialization (sequential): %.3f seconds (%.1f%%)\n", init_time, (init_time / total_time) * 100); printf("Computation (parallel): %.3f seconds (%.1f%%)\n", parallel_time, (parallel_time / total_time) * 100); printf("Total time: %.3f seconds\n", total_time); printf("Result: %f\n", total); // Amdahl's Law prediction: // If init is 30% of sequential runtime, // max speedup with 8 threads = 1 / (0.3 + 0.7/8) = 2.58x // Not 8x, even with perfect parallel efficiency! return 0;} /* * Key insight: To maximize parallel speedup: * 1. Minimize sequential fractions * 2. Parallelize initialization if possible * 3. Use concurrent data structures to reduce aggregation overhead * 4. Design algorithms with minimal synchronization */Gustafson's Law offers a more optimistic perspective: as we add processors, we often scale up the problem size too. If the parallel portion scales with problem size while the sequential portion stays constant, speedup can be much better than Amdahl's Law suggests. This is why large-scale computing (supercomputers, cloud) can achieve high efficiency—the problems are chosen to be massively parallel.
Beyond Amdahl's Law, practical parallel programming faces additional challenges that can limit the effectiveness of kernel threads:
1. Cache Coherency Overhead
When multiple CPUs access shared data, the hardware maintains coherency using protocols like MESI. This creates invisible overhead:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
// Demonstrating false sharing - a hidden parallelism killer #include <pthread.h>#include <stdio.h>#include <time.h> #define ITERATIONS 100000000#define NUM_THREADS 4 // BAD: Counters are adjacent in memory (same cache line)struct BadCounters { long count[NUM_THREADS]; // All in same or adjacent cache lines}; // GOOD: Counters are padded to separate cache linesstruct GoodCounters { struct { long count; char padding[56]; // Pad to 64 bytes (typical cache line) } slots[NUM_THREADS];}; BadCounters bad_counters = {0};GoodCounters good_counters = {0}; void *increment_bad(void *arg) { int id = *(int *)arg; for (long i = 0; i < ITERATIONS; i++) { bad_counters.count[id]++; // False sharing! } return NULL;} void *increment_good(void *arg) { int id = *(int *)arg; for (long i = 0; i < ITERATIONS; i++) { good_counters.slots[id].count++; // No false sharing } return NULL;} int main() { pthread_t threads[NUM_THREADS]; int ids[NUM_THREADS]; struct timespec start, end; // Test with false sharing clock_gettime(CLOCK_MONOTONIC, &start); for (int i = 0; i < NUM_THREADS; i++) { ids[i] = i; pthread_create(&threads[i], NULL, increment_bad, &ids[i]); } for (int i = 0; i < NUM_THREADS; i++) { pthread_join(threads[i], NULL); } clock_gettime(CLOCK_MONOTONIC, &end); double bad_time = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9; // Test without false sharing clock_gettime(CLOCK_MONOTONIC, &start); for (int i = 0; i < NUM_THREADS; i++) { pthread_create(&threads[i], NULL, increment_good, &ids[i]); } for (int i = 0; i < NUM_THREADS; i++) { pthread_join(threads[i], NULL); } clock_gettime(CLOCK_MONOTONIC, &end); double good_time = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9; printf("With false sharing: %.2f seconds\n", bad_time); printf("Without false sharing: %.2f seconds\n", good_time); printf("Speedup from fix: %.2fx\n", bad_time / good_time); // Typical results: false sharing version is 3-10x SLOWER! // Each write invalidates the cache line on all other CPUs, // causing expensive cache coherency traffic. return 0;}2. Memory Bandwidth Limitations
All CPUs share the same memory subsystem. When parallel threads all access memory intensively, they compete for bandwidth:
3. Synchronization Overhead
Every synchronization point—lock, barrier, atomic operation—creates a sequential bottleneck and adds overhead:
| Operation | Uncontended Cost | Contended Cost |
|---|---|---|
| Atomic increment | 5-20 cycles | 100-500 cycles (cache coherency) |
| Mutex lock | 20-50 cycles | Microseconds (context switch) |
| Spinlock | 10-30 cycles | Busy-wait until release |
| Barrier sync | 100+ cycles | All threads must wait for slowest |
| Memory fence | 10-50 cycles | Forces memory ordering |
Adding more threads to a program with heavy synchronization often decreases performance. Each additional thread increases contention, and the sequential synchronization bottleneck dominates. This is why designing for minimal synchronization is essential—parallel performance is won or lost in the architecture, not the implementation.
Achieving effective parallelism requires deliberate design decisions. Here are key principles for designing systems that effectively utilize true parallelism via kernel threads:
Principle 1: Partition Data, Not Operations
Divide data among threads, letting each thread perform all operations on its portion. This minimizes sharing and synchronization.
Principle 2: Use Lock-Free Structures for Hot Paths
For data structures accessed by all threads, lock-free algorithms avoid serialization:
123456789101112131415161718192021222324252627282930313233
// Lock-free counter using atomic operations#include <stdatomic.h> // Lock-free: concurrent increments don't blocktypedef struct { _Atomic long value;} AtomicCounter; void atomic_increment(AtomicCounter *counter) { atomic_fetch_add(&counter->value, 1); // Hardware atomic // No lock, no waiting, true parallel increments // (though still cache coherency traffic)} // Even better: Per-thread counters with lazy aggregationtypedef struct { long counts[MAX_THREADS]; // Separate cache lines (with padding)} DistributedCounter; void distributed_increment(DistributedCounter *counter, int thread_id) { counter->counts[thread_id]++; // No atomics, no sharing!} long distributed_read(DistributedCounter *counter) { long total = 0; for (int i = 0; i < MAX_THREADS; i++) { total += counter->counts[i]; // Aggregate when needed } return total;} // The distributed version scales perfectly - each thread// works on its own cache line with zero contention.The best parallel workloads are "embarrassingly parallel"—tasks that require no communication or synchronization between threads. Examples: independent web requests, parallel file processing, map operations on independent data. Strive to structure your work to approach this ideal. Each ounce of communication you remove yields a pound of parallel speedup.
Modern kernels include sophisticated features specifically designed to maximize parallel efficiency. Understanding these helps in writing applications that fully leverage hardware parallelism.
1. CPU Affinity
CPU affinity allows binding threads to specific CPUs, providing control over where threads execute:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
#define _GNU_SOURCE#include <pthread.h>#include <sched.h>#include <stdio.h> void *worker(void *arg) { int cpu = *(int *)arg; // Create CPU set with single CPU cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(cpu, &cpuset); // Bind this thread to the specified CPU pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset); printf("Thread bound to CPU %d, now running on CPU %d\n", cpu, sched_getcpu()); // Benefits: // 1. Cache locality - thread's data stays in CPU's cache // 2. Predictability - no migrations between CPUs // 3. NUMA optimization - can bind to CPU near thread's memory // Now do work that benefits from staying on one CPU... return NULL;} int main() { pthread_t threads[4]; int cpus[4] = {0, 1, 2, 3}; // Create threads, each bound to a specific CPU for (int i = 0; i < 4; i++) { pthread_create(&threads[i], NULL, worker, &cpus[i]); } for (int i = 0; i < 4; i++) { pthread_join(threads[i], NULL); } return 0;}2. NUMA Optimization
Non-Uniform Memory Architecture (NUMA) systems have memory local to each CPU socket. The kernel provides NUMA-aware allocation:
numa_alloc_onnode() to allocate memory on specific nodes.3. Scheduler Features for Parallelism
| Feature | Purpose | Benefit |
|---|---|---|
| Per-CPU run queues | Each CPU has its own queue | No global lock for scheduling |
| Work stealing | Idle CPUs take work from busy ones | Automatic load balancing |
| Cache-hot balancing | Prefer migrating threads without hot data | Preserves cache locality |
| Wake-to-idle | Wake threads preferring idle CPUs | Faster wakeup, better distribution |
| cgroup bandwidth control | Limit CPU time per group | Fair sharing across applications |
12345678910111213141516171819202122232425262728293031
// Using scheduler hints for better parallelism #include <sys/syscall.h>#include <linux/sched.h> // Request wake-to-idle behavior for a thread// The kernel will try to wake this thread on an idle CPU// rather than the CPU that woke it (which might be busy) struct sched_attr { uint32_t size; uint32_t sched_policy; uint64_t sched_flags; // ... other fields}; // Real-time scheduling for latency-sensitive threadsvoid set_realtime_priority(pthread_t thread, int priority) { struct sched_param param; param.sched_priority = priority; // 1-99, higher = more priority // SCHED_FIFO: Real-time, no time slicing // SCHED_RR: Real-time with round-robin among same priority pthread_setschedparam(thread, SCHED_FIFO, ¶m); // Warning: Must run as root; can starve other processes!} // Hint to scheduler about thread behavior// Linux 5.3+ supports CLONE_INTO_CGROUP for better placement// Linux 5.6+ supports SCHED_FLAG_UTIL_CLAMP for frequency hintsModern kernel schedulers are highly optimized for parallel workloads. Features like automatic load balancing, cache-aware migration, and wake-to-idle work transparently without application involvement. For most applications, simply creating an appropriate number of long-lived threads and minimizing synchronization is sufficient—the kernel handles distribution efficiently.
We've explored the defining capability of kernel-level threads: enabling true parallelism—genuine simultaneous execution on multiple CPUs. Let's consolidate the key insights:
What's next:
True parallelism comes with a cost. Every kernel thread consumes kernel resources—memory for data structures, kernel stack space, CPU cycles for scheduling. The next page examines overhead considerations: what resources kernel threads consume, how overhead scales with thread count, and when this overhead becomes a limiting factor.
You now understand true parallelism—the fundamental capability that makes kernel-level threads essential for utilizing modern multi-core hardware. This knowledge is crucial for designing systems that effectively leverage parallel hardware and for understanding why certain architectural decisions (reduced synchronization, data partitioning, appropriate thread counts) have such profound performance impacts.