Operating SystemsThreading Models

Threading Models: Mapping User and Kernel Threads

LevelIntermediate

Duration60 mins

TopicThreading Models

3 / 5

Many-to-Many Model

The Best of Both Worlds

The Many-to-Many model (also called M:N threading or hybrid threading) represents the most sophisticated approach to thread mapping. It multiplexes M user-level threads onto N kernel-level threads, where M ≥ N. This allows applications to create lightweight user threads in large numbers while still achieving true parallelism across multiple CPU cores.

This model attempts to combine the advantages of both Many-to-One (fast, lightweight threads) and One-to-One (true parallelism, independent blocking) while avoiding their respective limitations. The trade-off is implementation complexity—the runtime must manage both user-level scheduling and coordination with kernel threads.

What You Will Learn

By the end of this page, you will understand the Many-to-Many model's architecture and how user threads are scheduled onto kernel threads, appreciate its advantages for high-concurrency applications, recognize the complexity challenges that limit its adoption, and identify modern systems (like Go's goroutines) that successfully implement this model.

Architectural Overview

The Many-to-Many model introduces a pool of kernel threads that serve as execution vehicles for a potentially much larger population of user-level threads. The thread library (or language runtime) schedules user threads onto available kernel threads, dynamically allocating kernel resources as needed.

Converting Mermaid diagram...

Key Architectural Properties:

Two-Level Scheduling — The system has two schedulers: a user-level scheduler in the thread library that schedules user threads onto kernel threads, and the kernel scheduler that schedules kernel threads onto CPU cores.
Decoupled Thread Counts — The number of user threads (M) is independent of the number of kernel threads (N). An application might have 1 million user threads running on 8 kernel threads (one per CPU core).
Dynamic Kernel Thread Pool — The number of kernel threads can adjust dynamically based on workload. When threads block on I/O, new kernel threads may be created to maintain parallelism.
Lightweight User Threads — User threads have minimal overhead (small stacks, no kernel resources), enabling creation of millions of concurrent tasks.
True Parallelism — With N kernel threads, up to N user threads can execute truly in parallel on N CPU cores.

Many-to-Many Model: Key Characteristics
Characteristic	Many-to-Many Behavior	Implication
User Thread Count	Virtually unlimited (millions possible)	Lightweight concurrency at massive scale
Kernel Thread Count	Typically matches core count, can grow	Full CPU utilization with minimal overhead
Thread Scheduling	Two-level: user + kernel scheduler	Complex but flexible
Maximum CPU Utilization	All cores (100% of all CPUs)	True parallelism maintained
Blocking Behavior	Complex - requires scheduler cooperation	Can be handled if runtime manages it
Implementation Complexity	High	Requires sophisticated runtime

The Fundamental Trade-off

Many-to-Many trades implementation complexity for optimal resource usage. It combines lightweight thread creation (like Many-to-One) with true parallelism (like One-to-One). The runtime must carefully manage the mapping between user and kernel threads to maintain this balance.

User-Level Scheduling Mechanics

The heart of the Many-to-Many model is the user-level scheduler—a sophisticated component within the thread library or language runtime that decides which user thread runs on which kernel thread. This scheduler must make intelligent decisions while minimizing overhead.

User-Level Scheduler Responsibilities

•Ready Queue Management — Maintain a queue (or per-core queues) of user threads ready to run. Must handle thread priorities, fairness, and starvation prevention.
•Work Stealing — When a kernel thread's local queue is empty, it may 'steal' work from busy queues. This balances load across kernel threads without centralized coordination.
•Cooperative Yield Points — Insert yield opportunities at function calls, loop iterations, or explicit checkpoints so the scheduler can context-switch without preemption.
•Preemptive Scheduling (Optional) — Use timer signals (e.g., SIGALRM) to force context switches, preventing any single user thread from monopolizing a kernel thread.
•Stack Management — Manage user thread stacks, typically much smaller than kernel stacks (e.g., 2KB vs 8MB). May use growable stacks that expand as needed.
•Blocking Coordination — Detect when a user thread would block on I/O and coordinate with the kernel to prevent the blocking problem (using techniques like scheduler activations).

The Scheduling Algorithm:

A typical Many-to-Many user-level scheduler uses a multi-queue approach with work stealing:

m_n_scheduler.pseudo

Pseudocode

// Conceptual Many-to-Many Scheduler (Based on Go Runtime Design)
 
struct UserThread {
    uintptr_t stack_pointer;      // Current stack position
    void *stack_base;             // Stack memory (small: 2-8KB initially)
    size_t stack_size;            // Current stack size (growable)
    thread_state_t state;         // RUNNABLE, RUNNING, BLOCKED, DEAD
    UserThread *next;             // Queue linkage
    void (*entry_func)(void*);    // Thread entry point
    void *arg;                    // Argument to entry
};
 
struct Processor {
    KernelThread *kernel_thread;  // Associated kernel thread (1:1)
    UserThread *running;          // Currently executing user thread
    LocalRunQueue local_queue;    // Per-processor run queue
    int id;                       // Processor identifier
};
 
// Global state
GlobalRunQueue global_queue;      // Overflow queue for user threads
Processor processors[NUM_CORES];  // One Processor per CPU core
Lock global_lock;                 // Lock for global queue
 
/*
 * Main scheduler loop - runs on each kernel thread
 * 
 * Each kernel thread (Processor) loops forever, finding and
 * executing user threads. When a user thread blocks or yields,
 * control returns here to find the next thread.
 */
void scheduler_loop(Processor *p) {
    while (true) {
        UserThread *ut = find_runnable_thread(p);
        
        if (ut != NULL) {
            p->running = ut;
            ut->state = RUNNING;
            
            /*
             * Context switch to user thread
             * This swaps stacks - when the user thread yields or blocks,
             * we return here to find the next thread.
             */
            switch_to_user_thread(p, ut);
            
            // Returned from user thread (it yielded, blocked, or finished)
            handle_thread_return(p, ut);
        } else {
            // No work available - try stealing from other processors
            ut = steal_work(p);
            if (ut == NULL) {
                // Really no work - park this kernel thread
                park_processor(p);
            }
        }
    }
}
 
/*
 * Find a runnable user thread for this processor
 * Priority: local queue > global queue > steal from others
 */
UserThread *find_runnable_thread(Processor *p) {
    // 1. Check local run queue first (fast, no locking)
    if (!is_empty(&p->local_queue)) {
        return dequeue(&p->local_queue);
    }
    
    // 2. Check global run queue (requires lock, less frequent)
    if (!is_empty(&global_queue)) {
        acquire(&global_lock);
        // Grab batch of threads to amortize locking cost
        int count = min(BATCH_SIZE, queue_length(&global_queue));
        UserThread *batch = dequeue_batch(&global_queue, count);
        release(&global_lock);
        
        // Put extras in local queue, return first one
        enqueue_batch(&p->local_queue, batch->next);
        return batch;
    }
    
    // 3. Try work stealing from other processors
    return steal_work(p);
}
 
/*
 * Work stealing: Take threads from busy processors
 * 
 * This is key to load balancing. When a processor is idle,
 * it scans other processors' local queues and steals half
 * their work. This prevents hot spots without global coordination.
 */
UserThread *steal_work(Processor *p) {
    // Randomize starting point to reduce contention
    int start = random() % NUM_PROCESSORS;
    
    for (int i = 0; i < NUM_PROCESSORS; i++) {
        int target = (start + i) % NUM_PROCESSORS;
        if (target == p->id) continue;  // Don't steal from self
        
        Processor *victim = &processors[target];
        int victim_len = queue_length(&victim->local_queue);
        
        if (victim_len > 1) {
            // Steal half the victim's queue
            int steal_count = victim_len / 2;
            UserThread *stolen = steal_from_queue(
                &victim->local_queue, steal_count);
            
            if (stolen != NULL) {
                // Put extras in our local queue, return first
                enqueue_batch(&p->local_queue, stolen->next);
                return stolen;
            }
        }
    }
    
    return NULL;  // No work to steal
}
 
/*
 * Handle a user thread that yielded, blocked, or completed
 */
void handle_thread_return(Processor *p, UserThread *ut) {
    p->running = NULL;
    
    switch (ut->state) {
        case RUNNABLE:
            // Thread yielded - put back in local queue
            enqueue(&p->local_queue, ut);
            break;
            
        case BLOCKED:
            // Thread blocked on I/O or sync primitive
            // Don't put in run queue - wake handler will re-queue
            break;
            
        case DEAD:
            // Thread finished - free resources
            free_user_thread(ut);
            break;
    }
}

This is How Go Works

The pseudocode above closely mirrors Go's runtime scheduler (the 'GMP' model: Goroutines, M threads, Processors). Go's goroutines are user-level threads scheduled onto a pool of OS threads, with work stealing for load balancing. This design enables Go to handle millions of goroutines on standard hardware.

Handling the Blocking Problem

The blocking problem—where a blocking system call freezes all user threads sharing a kernel thread—requires careful handling in the Many-to-Many model. Unlike Many-to-One (which has no solution) and One-to-One (where it doesn't exist), Many-to-Many can mitigate blocking through several techniques.

The Challenge

When a user thread executing on kernel thread K makes a blocking system call (e.g., read()), kernel thread K blocks. If other user threads were queued to run on K, they cannot execute until K unblocks. The Many-to-Many model needs strategies to maintain parallelism despite blocking calls.

Blocking Mitigation Strategies

•Scheduler Activations — The kernel notifies the user-level scheduler when a kernel thread blocks (via an 'upcall'). The scheduler can then migrate runnable user threads to other kernel threads or create a new kernel thread. Solaris and some research systems used this approach.
•Dynamic Kernel Thread Pool — When user threads block, the runtime detects this and spawns additional kernel threads to maintain parallelism. When blocked threads wake, excess kernel threads are retired. Go uses this approach.
•Non-Blocking I/O Conversion — The runtime intercepts blocking calls and converts them to non-blocking operations plus polling (epoll/kqueue/IOCP). The user thread is suspended by the user-level scheduler, not blocked in the kernel. Go's netpoller does exactly this.
•I/O Multiplexing Integration — Integrate with the OS's I/O multiplexing mechanisms. A dedicated poller thread monitors all pending I/O and wakes user threads when their I/O completes.
•Thread Handoff — Before making a potentially blocking call, hand off the kernel thread's queue of runnable user threads to another processor, then block. Upon waking, reclaim a kernel thread and continue.

Converting Mermaid diagram...

Go's Approach in Detail:

Go's runtime provides one of the best modern examples of handling blocking in a Many-to-Many model:

Network I/O: All network operations use non-blocking sockets internally. When a goroutine would block on network I/O, the runtime registers interest with epoll/kqueue/IOCP and parks the goroutine. The network poller (running on its own goroutine) wakes goroutines when their I/O is ready.
System Calls: When a goroutine enters a potentially blocking system call, Go's runtime:
- Releases the processor (P) so other goroutines can use it
- Creates a new kernel thread if needed to maintain GOMAXPROCS parallelism
- When the syscall returns, the goroutine rejoins the scheduling queue
File I/O: File I/O is genuinely blocking on most systems (no async file I/O on Linux). Go detects this and spawns additional kernel threads to maintain parallelism while file I/O threads block.

This approach means goroutines 'block' from the programmer's perspective (simple, intuitive API) while the runtime prevents actual kernel thread blocking in most cases.

go_netpoller_concept.go
Go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
package main
 
import (
    "fmt"
    "net"
    "runtime"
)
 
/*
 * Go's Many-to-Many Model in Action
 * 
 * This example demonstrates how Go handles thousands of 
 * "blocking" network operations without actually blocking 
 * kernel threads.
 */
 
func handleConnection(conn net.Conn) {
    defer conn.Close()
    
    buffer := make([]byte, 1024)
    
    for {
        // This "blocks" from the goroutine's perspective
        // But internally:
        // 1. Go sets socket to non-blocking
        // 2. Attempts read - if EAGAIN, goroutine is parked
        // 3. epoll/kqueue monitors socket
        // 4. When data arrives, goroutine is unparked
        // 5. The kernel thread was NOT blocked during wait
        n, err := conn.Read(buffer)
        if err != nil {
            return
        }
        
        // Echo back
        conn.Write(buffer[:n])
    }
}
 
func main() {
    // Set to use 4 kernel threads maximum
    runtime.GOMAXPROCS(4)
    
    ln, _ := net.Listen("tcp", ":8080")
    
    /*
     * This server handles 10,000 concurrent connections using
     * only 4 kernel threads. Each connection has its own goroutine,
     * but goroutines are multiplexed onto the 4 kernel threads.
     * 
     * Many-to-Many in action:
     * - M = 10,000 goroutines (user threads)
     * - N = 4 kernel threads (plus netpoller)
     * - All 10,000 can be "blocked" on Read() simultaneously
     * - But 0 kernel threads are actually blocked on I/O
     * - The netpoller watches all 10,000 sockets efficiently
     */
    
    for {
        conn, _ := ln.Accept()
        go handleConnection(conn)  // Creates lightweight goroutine
    }
}
 
/*
 * Why this works:
 * 
 * Traditional One-to-One model:
 *   10,000 connections = 10,000 kernel threads
 *   Each thread: ~8MB stack = 80GB memory (!)
 *   Plus kernel overhead per thread
 * 
 * Go's Many-to-Many model:
 *   10,000 connections = 10,000 goroutines
 *   Each goroutine: ~2KB initial stack = 20MB memory
 *   Only 4-5 kernel threads total
 *   Non-blocking I/O means no thread is wasted waiting
 * 
 * This is the power of Many-to-Many with proper blocking handling.
 */

Advantages of Many-to-Many

The Many-to-Many model, when properly implemented, combines the strengths of both simpler models while avoiding their most severe limitations:

Key Advantages

•Massive Concurrency — Applications can create millions of user threads (goroutines, fibers, green threads) without memory exhaustion. Each user thread only needs ~2-8KB compared to ~8MB for kernel threads. This enables the 'one thread per task' programming model at massive scale.
•True Parallelism — Unlike Many-to-One, the model supports N kernel threads running simultaneously on N cores. CPU-bound work achieves real parallel speedup.
•Lightweight Thread Operations — Thread creation, destruction, and context switching (when both threads are on the same kernel thread) are very fast—no system calls required. Creating a goroutine in Go takes ~1μs compared to ~10μs for a kernel thread.
•Efficient Resource Usage — Kernel resources (stacks, TCBs) are only allocated for N kernel threads, not M user threads. This makes efficient use of kernel memory even with high concurrency.
•Flexible Blocking Handling — The runtime can intelligently handle blocking calls, maintaining parallelism through dynamic kernel thread creation or non-blocking I/O conversion.
•Application-Optimized Scheduling — The user-level scheduler can implement policies specifically optimized for the application's workload patterns, beyond what the kernel's general-purpose scheduler provides.

Many-to-Many vs Other Models
Characteristic	Many-to-One	One-to-One	Many-to-Many
Max Threads	Millions	Thousands	Millions
True Parallelism	✗ No	✓ Yes	✓ Yes
Thread Creation Cost	~1μs	~10μs	~1μs
Memory per Thread	~4KB	~8MB + kernel	~2-8KB
Blocking Behavior	All freeze	Independent	Mitigated
Implementation Complexity	Medium	Low	High
Kernel Awareness	None	Full	Partial

Ideal Use Cases:

The Many-to-Many model excels in scenarios requiring:

High-Concurrency Network Servers — Web servers, chat servers, and API gateways handling thousands to millions of simultaneous connections. Each connection can have its own lightweight thread.
Massively Concurrent Task Processing — Systems where each task is a natural unit of work but tasks number in the millions. Examples: web crawlers, data processing pipelines, simulation systems.
Actor Systems and Message-Passing Architectures — Where each actor is a lightweight 'process' that should be independently schedulable but numerous.
Async/Await without Callback Hell — Languages using Many-to-Many can provide async/await syntax where each 'await' can yield the user thread without blocking the underlying kernel thread.
Microservices with High Fan-Out — Services that make many outbound requests per incoming request benefit from lightweight threads that can wait on those requests simultaneously.

The Sweet Spot

Many-to-Many hits the sweet spot for applications needing thousands to millions of concurrent tasks with good CPU utilization. It's why Go, Erlang, and similar runtimes have become popular for cloud infrastructure, network services, and concurrent data processing.

Complexity and Challenges

The Many-to-Many model's advantages come at the cost of significant implementation complexity. This complexity is why most languages and systems have historically avoided Many-to-Many in favor of simpler One-to-One threading.

The Complexity Cost

Building a correct, performant Many-to-Many runtime is extremely difficult. It requires deep integration with the OS, careful handling of edge cases (signals, fork, exec), and sophisticated scheduling algorithms. This is why successful implementations (Go, Erlang/BEAM, Java Loom) are major engineering efforts that took years to mature.

Implementation Challenges

•Two-Level Scheduling Coordination — The user-level and kernel-level schedulers must work together without creating priority inversions, starvation, or inefficient thread migrations. Getting this coordination right is subtle and error-prone.
•Blocking Call Detection — The runtime must intercept potential blocking calls and handle them appropriately. This requires wrappers or compiler integration for all I/O operations, system calls, and third-party libraries.
•Stack Management Complexity — User threads need small stacks to be memory-efficient, but stacks must grow if needed. Detecting stack overflow and growing stacks without corruption is tricky. Go uses separate 'stack segments' (now contiguous stacks with copying).
•Signal Handling — POSIX signals are delivered to specific kernel threads, but the programmer expects them to be delivered to specific user threads. Routing signals correctly is complex.
•Debugging and Profiling — Standard system tools (gdb, perf, debuggers) understand kernel threads, not user threads. Many-to-Many systems need custom tooling for debugging and profiling user threads.
•Library Compatibility — Third-party libraries that use blocking I/O or expect One-to-One threading may behave incorrectly. The runtime must either wrap all external calls or accept reduced parallelism.
•Preemption Complexity — Preempting a user thread (rather than waiting for it to yield) requires careful handling to avoid corrupting data structures mid-update. Go added preemption to goroutines but it required significant work.

Why Solaris Moved Away:

Solaris was a notable Many-to-Many implementation that eventually abandoned the model:

Initial Design (Solaris 2.x): Used a two-level thread library (libthread) multiplexing user threads onto LWPs (Lightweight Processes, which mapped to kernel threads).
Problems Encountered:
- Blocking coordination was complex and sometimes failed
- Scheduler activations (kernel-to-userspace upcalls) were expensive
- Two-level scheduling caused unexpected latency spikes
- Debugging was difficult—which LWP was a user thread on?
- Third-party libraries often didn't work correctly
Resolution (Solaris 9+): Moved to One-to-One threading. Every user thread became a LWP. Simpler, more predictable, and hardware got fast enough that One-to-One overhead was acceptable for most workloads.

This history lesson shows that Many-to-Many can fail if not carefully designed and maintained.

Historical Many-to-Many Implementations
System	Status	Outcome
Solaris LWP	Abandoned	Switched to 1:1 (Solaris 9+)
HP-UX threads	Abandoned	Switched to 1:1
IRIX threads	Abandoned (platform dead)	Never completed transition
Windows Fibers	Limited	Optional, rarely used, manual scheduling
GNU Pth	Niche	Still M:1 not M:N, limited use
Go goroutines	Success	Mature, widely used, handles edge cases
Erlang/BEAM	Success	Mature, specialized for actors
Java Project Loom	Emerging	Virtual threads, promising

Modern Success Stories

Go and Erlang prove that Many-to-Many can work brilliantly when the runtime is designed from the ground up around it. Both languages were designed with Many-to-Many in mind, not retrofitted. They handle blocking carefully, provide excellent tooling, and have millions of users proving their implementations work.

Modern Many-to-Many Implementations

Despite the complexity, several modern runtimes have successfully implemented Many-to-Many threading. These implementations demonstrate that the model is viable when carefully designed.

Successful Modern Implementations

•Go Goroutines (GMP Scheduler) — Go's runtime implements a sophisticated M:N scheduler with goroutines (G), OS threads (M), and processors (P). Work stealing, growable stacks, integrated netpoller, and preemptive scheduling. Handles millions of goroutines on dozens of cores. The reference implementation for modern M:N threading.
•Erlang/Elixir BEAM VM — The BEAM virtual machine schedules lightweight processes (actors) onto scheduler threads. Each scheduler thread runs on a core. Processes have tiny heaps, communicate via message passing, and can number in the millions. Optimized for concurrency and fault tolerance.
•Java Project Loom (Virtual Threads) — JDK 21+ includes virtual threads—user-level threads scheduled onto a pool of carrier threads (kernel threads). Existing blocking code works without modification; the runtime handles unmounting virtual threads during blocking. Backward-compatible M:N for Java.
•Kotlin Coroutines — Kotlin's coroutines are scheduled onto a dispatcher backed by a thread pool. Not as transparent as Go's model—coroutines require explicit 'suspend' points—but provides M:N-style efficiency for async code.
•Rust Tokio/async-std — Rust's async runtimes (Tokio, async-std) multiplex async tasks onto a thread pool. Tasks are state machines scheduled onto a worker pool. Similar to M:N threading for async code.

Modern M:N Implementation Comparison
Implementation	User Thread Name	Min Stack Size	Preemptive	Blocking Handling
Go	Goroutine	2KB (growable)	Yes (since 1.14)	Dynamic M spawn, netpoller
Erlang BEAM	Process	~1KB	Yes (reduction counting)	Never blocks (pure message passing)
Java Loom	Virtual Thread	Platform-controlled	N/A (cooperative)	Unmount on block, carrier continues
Kotlin	Coroutine	Very small	No (cooperative)	Dispatcher flexibility
Rust Tokio	Task	~200 bytes state	No (cooperative)	Async all the way down

Go's GMP Model in Detail:

Go's scheduler is often considered the reference for modern Many-to-Many implementations. It uses three entities:

G (Goroutine): The user-level thread. Contains stack, state, and function to execute.
M (Machine): An OS thread (kernel thread). Actually runs code on a CPU.
P (Processor): A scheduling context. Represents the right to execute Go code. There are GOMAXPROCS of these.

The key insight: A goroutine (G) needs both an M and a P to run. The P owns the run queue. When a goroutine blocks in a syscall, the M is 'stuck' in the syscall, but the P can be handed off to a different M to continue running other goroutines.

This P abstraction allows Go to maintain exactly GOMAXPROCS goroutines executing simultaneously, even as Ms block and unblock on syscalls.

Lessons from the Successful Implementations

Successful M:N runtimes share common patterns: designed from the start for M:N (not retrofitted), integrated I/O handling (netpoller or non-blocking I/O), work stealing for load balance, careful handling of blocking calls, excellent debugging/profiling tools, and years of iteration and hardening. They're major engineering investments that pay off for high-concurrency workloads.

Summary and Key Takeaways

The Many-to-Many model represents the most sophisticated approach to thread mapping, combining lightweight concurrency with true parallelism. Let's consolidate the key insights:

Key Takeaways

•Many-to-Many multiplexes M user threads onto N kernel threads — This achieves both lightweight thread creation (like Many-to-One) and true parallelism (like One-to-One).
•Two-level scheduling is the core mechanism — A user-level scheduler runs on each kernel thread, selecting which user threads to execute. Work stealing balances load across kernel threads.
•The blocking problem requires sophisticated handling — Successful implementations use non-blocking I/O conversion, dynamic kernel thread creation, or scheduler activations to maintain parallelism when user threads would block.
•Implementation complexity is the major trade-off — Building a correct, performant M:N runtime is a major engineering effort. Historical attempts (Solaris, HP-UX) often failed and reverted to One-to-One.
•Modern runtimes prove it can work — Go, Erlang/BEAM, and Java Loom demonstrate that M:N threading is viable when designed carefully from the ground up.
•M:N is ideal for specific workloads — High-concurrency network services, massive task systems, and actor-based architectures benefit most from M:N's combination of lightweight threads and parallelism.

What's Next:

We've now covered Many-to-One, One-to-One, and Many-to-Many—the three fundamental threading models. The next page examines the Two-Level model, a variation that combines M:N threading with the ability to directly bind critical user threads to dedicated kernel threads for predictable performance.

Page Complete

You now understand the Many-to-Many threading model's architecture, scheduling mechanisms, advantages for high-concurrency workloads, complexity challenges, and modern implementations like Go's goroutines. This prepares you to understand the Two-Level model and to make informed choices about threading approaches for different application requirements.

3 / 5

Loading learning content...

Operating SystemsThreading Models

Threading Models: Mapping User and Kernel Threads

LevelIntermediate

Duration60 mins

TopicThreading Models

3 / 5

Many-to-Many Model

The Best of Both Worlds

What You Will Learn

Architectural Overview

Converting Mermaid diagram...

Key Architectural Properties:

Two-Level Scheduling — The system has two schedulers: a user-level scheduler in the thread library that schedules user threads onto kernel threads, and the kernel scheduler that schedules kernel threads onto CPU cores.
Decoupled Thread Counts — The number of user threads (M) is independent of the number of kernel threads (N). An application might have 1 million user threads running on 8 kernel threads (one per CPU core).
Dynamic Kernel Thread Pool — The number of kernel threads can adjust dynamically based on workload. When threads block on I/O, new kernel threads may be created to maintain parallelism.
Lightweight User Threads — User threads have minimal overhead (small stacks, no kernel resources), enabling creation of millions of concurrent tasks.
True Parallelism — With N kernel threads, up to N user threads can execute truly in parallel on N CPU cores.

Many-to-Many Model: Key Characteristics
Characteristic	Many-to-Many Behavior	Implication
User Thread Count	Virtually unlimited (millions possible)	Lightweight concurrency at massive scale
Kernel Thread Count	Typically matches core count, can grow	Full CPU utilization with minimal overhead
Thread Scheduling	Two-level: user + kernel scheduler	Complex but flexible
Maximum CPU Utilization	All cores (100% of all CPUs)	True parallelism maintained
Blocking Behavior	Complex - requires scheduler cooperation	Can be handled if runtime manages it
Implementation Complexity	High	Requires sophisticated runtime

The Fundamental Trade-off

User-Level Scheduling Mechanics

User-Level Scheduler Responsibilities

•Ready Queue Management — Maintain a queue (or per-core queues) of user threads ready to run. Must handle thread priorities, fairness, and starvation prevention.
•Work Stealing — When a kernel thread's local queue is empty, it may 'steal' work from busy queues. This balances load across kernel threads without centralized coordination.
•Cooperative Yield Points — Insert yield opportunities at function calls, loop iterations, or explicit checkpoints so the scheduler can context-switch without preemption.
•Preemptive Scheduling (Optional) — Use timer signals (e.g., SIGALRM) to force context switches, preventing any single user thread from monopolizing a kernel thread.
•Stack Management — Manage user thread stacks, typically much smaller than kernel stacks (e.g., 2KB vs 8MB). May use growable stacks that expand as needed.
•Blocking Coordination — Detect when a user thread would block on I/O and coordinate with the kernel to prevent the blocking problem (using techniques like scheduler activations).

The Scheduling Algorithm:

A typical Many-to-Many user-level scheduler uses a multi-queue approach with work stealing:

m_n_scheduler.pseudo

Pseudocode

// Conceptual Many-to-Many Scheduler (Based on Go Runtime Design)
 
struct UserThread {
    uintptr_t stack_pointer;      // Current stack position
    void *stack_base;             // Stack memory (small: 2-8KB initially)
    size_t stack_size;            // Current stack size (growable)
    thread_state_t state;         // RUNNABLE, RUNNING, BLOCKED, DEAD
    UserThread *next;             // Queue linkage
    void (*entry_func)(void*);    // Thread entry point
    void *arg;                    // Argument to entry
};
 
struct Processor {
    KernelThread *kernel_thread;  // Associated kernel thread (1:1)
    UserThread *running;          // Currently executing user thread
    LocalRunQueue local_queue;    // Per-processor run queue
    int id;                       // Processor identifier
};
 
// Global state
GlobalRunQueue global_queue;      // Overflow queue for user threads
Processor processors[NUM_CORES];  // One Processor per CPU core
Lock global_lock;                 // Lock for global queue
 
/*
 * Main scheduler loop - runs on each kernel thread
 * 
 * Each kernel thread (Processor) loops forever, finding and
 * executing user threads. When a user thread blocks or yields,
 * control returns here to find the next thread.
 */
void scheduler_loop(Processor *p) {
    while (true) {
        UserThread *ut = find_runnable_thread(p);
        
        if (ut != NULL) {
            p->running = ut;
            ut->state = RUNNING;
            
            /*
             * Context switch to user thread
             * This swaps stacks - when the user thread yields or blocks,
             * we return here to find the next thread.
             */
            switch_to_user_thread(p, ut);
            
            // Returned from user thread (it yielded, blocked, or finished)
            handle_thread_return(p, ut);
        } else {
            // No work available - try stealing from other processors
            ut = steal_work(p);
            if (ut == NULL) {
                // Really no work - park this kernel thread
                park_processor(p);
            }
        }
    }
}
 
/*
 * Find a runnable user thread for this processor
 * Priority: local queue > global queue > steal from others
 */
UserThread *find_runnable_thread(Processor *p) {
    // 1. Check local run queue first (fast, no locking)
    if (!is_empty(&p->local_queue)) {
        return dequeue(&p->local_queue);
    }
    
    // 2. Check global run queue (requires lock, less frequent)
    if (!is_empty(&global_queue)) {
        acquire(&global_lock);
        // Grab batch of threads to amortize locking cost
        int count = min(BATCH_SIZE, queue_length(&global_queue));
        UserThread *batch = dequeue_batch(&global_queue, count);
        release(&global_lock);
        
        // Put extras in local queue, return first one
        enqueue_batch(&p->local_queue, batch->next);
        return batch;
    }
    
    // 3. Try work stealing from other processors
    return steal_work(p);
}
 
/*
 * Work stealing: Take threads from busy processors
 * 
 * This is key to load balancing. When a processor is idle,
 * it scans other processors' local queues and steals half
 * their work. This prevents hot spots without global coordination.
 */
UserThread *steal_work(Processor *p) {
    // Randomize starting point to reduce contention
    int start = random() % NUM_PROCESSORS;
    
    for (int i = 0; i < NUM_PROCESSORS; i++) {
        int target = (start + i) % NUM_PROCESSORS;
        if (target == p->id) continue;  // Don't steal from self
        
        Processor *victim = &processors[target];
        int victim_len = queue_length(&victim->local_queue);
        
        if (victim_len > 1) {
            // Steal half the victim's queue
            int steal_count = victim_len / 2;
            UserThread *stolen = steal_from_queue(
                &victim->local_queue, steal_count);
            
            if (stolen != NULL) {
                // Put extras in our local queue, return first
                enqueue_batch(&p->local_queue, stolen->next);
                return stolen;
            }
        }
    }
    
    return NULL;  // No work to steal
}
 
/*
 * Handle a user thread that yielded, blocked, or completed
 */
void handle_thread_return(Processor *p, UserThread *ut) {
    p->running = NULL;
    
    switch (ut->state) {
        case RUNNABLE:
            // Thread yielded - put back in local queue
            enqueue(&p->local_queue, ut);
            break;
            
        case BLOCKED:
            // Thread blocked on I/O or sync primitive
            // Don't put in run queue - wake handler will re-queue
            break;
            
        case DEAD:
            // Thread finished - free resources
            free_user_thread(ut);
            break;
    }
}

This is How Go Works

Handling the Blocking Problem

The Challenge

Blocking Mitigation Strategies

•Scheduler Activations — The kernel notifies the user-level scheduler when a kernel thread blocks (via an 'upcall'). The scheduler can then migrate runnable user threads to other kernel threads or create a new kernel thread. Solaris and some research systems used this approach.
•Dynamic Kernel Thread Pool — When user threads block, the runtime detects this and spawns additional kernel threads to maintain parallelism. When blocked threads wake, excess kernel threads are retired. Go uses this approach.
•Non-Blocking I/O Conversion — The runtime intercepts blocking calls and converts them to non-blocking operations plus polling (epoll/kqueue/IOCP). The user thread is suspended by the user-level scheduler, not blocked in the kernel. Go's netpoller does exactly this.
•I/O Multiplexing Integration — Integrate with the OS's I/O multiplexing mechanisms. A dedicated poller thread monitors all pending I/O and wakes user threads when their I/O completes.
•Thread Handoff — Before making a potentially blocking call, hand off the kernel thread's queue of runnable user threads to another processor, then block. Upon waking, reclaim a kernel thread and continue.

Converting Mermaid diagram...

Go's Approach in Detail:

Go's runtime provides one of the best modern examples of handling blocking in a Many-to-Many model:

Network I/O: All network operations use non-blocking sockets internally. When a goroutine would block on network I/O, the runtime registers interest with epoll/kqueue/IOCP and parks the goroutine. The network poller (running on its own goroutine) wakes goroutines when their I/O is ready.
System Calls: When a goroutine enters a potentially blocking system call, Go's runtime:
- Releases the processor (P) so other goroutines can use it
- Creates a new kernel thread if needed to maintain GOMAXPROCS parallelism
- When the syscall returns, the goroutine rejoins the scheduling queue
File I/O: File I/O is genuinely blocking on most systems (no async file I/O on Linux). Go detects this and spawns additional kernel threads to maintain parallelism while file I/O threads block.

This approach means goroutines 'block' from the programmer's perspective (simple, intuitive API) while the runtime prevents actual kernel thread blocking in most cases.

go_netpoller_concept.go
Go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
package main
 
import (
    "fmt"
    "net"
    "runtime"
)
 
/*
 * Go's Many-to-Many Model in Action
 * 
 * This example demonstrates how Go handles thousands of 
 * "blocking" network operations without actually blocking 
 * kernel threads.
 */
 
func handleConnection(conn net.Conn) {
    defer conn.Close()
    
    buffer := make([]byte, 1024)
    
    for {
        // This "blocks" from the goroutine's perspective
        // But internally:
        // 1. Go sets socket to non-blocking
        // 2. Attempts read - if EAGAIN, goroutine is parked
        // 3. epoll/kqueue monitors socket
        // 4. When data arrives, goroutine is unparked
        // 5. The kernel thread was NOT blocked during wait
        n, err := conn.Read(buffer)
        if err != nil {
            return
        }
        
        // Echo back
        conn.Write(buffer[:n])
    }
}
 
func main() {
    // Set to use 4 kernel threads maximum
    runtime.GOMAXPROCS(4)
    
    ln, _ := net.Listen("tcp", ":8080")
    
    /*
     * This server handles 10,000 concurrent connections using
     * only 4 kernel threads. Each connection has its own goroutine,
     * but goroutines are multiplexed onto the 4 kernel threads.
     * 
     * Many-to-Many in action:
     * - M = 10,000 goroutines (user threads)
     * - N = 4 kernel threads (plus netpoller)
     * - All 10,000 can be "blocked" on Read() simultaneously
     * - But 0 kernel threads are actually blocked on I/O
     * - The netpoller watches all 10,000 sockets efficiently
     */
    
    for {
        conn, _ := ln.Accept()
        go handleConnection(conn)  // Creates lightweight goroutine
    }
}
 
/*
 * Why this works:
 * 
 * Traditional One-to-One model:
 *   10,000 connections = 10,000 kernel threads
 *   Each thread: ~8MB stack = 80GB memory (!)
 *   Plus kernel overhead per thread
 * 
 * Go's Many-to-Many model:
 *   10,000 connections = 10,000 goroutines
 *   Each goroutine: ~2KB initial stack = 20MB memory
 *   Only 4-5 kernel threads total
 *   Non-blocking I/O means no thread is wasted waiting
 * 
 * This is the power of Many-to-Many with proper blocking handling.
 */

Advantages of Many-to-Many

The Many-to-Many model, when properly implemented, combines the strengths of both simpler models while avoiding their most severe limitations:

Key Advantages

•Massive Concurrency — Applications can create millions of user threads (goroutines, fibers, green threads) without memory exhaustion. Each user thread only needs ~2-8KB compared to ~8MB for kernel threads. This enables the 'one thread per task' programming model at massive scale.
•True Parallelism — Unlike Many-to-One, the model supports N kernel threads running simultaneously on N cores. CPU-bound work achieves real parallel speedup.
•Lightweight Thread Operations — Thread creation, destruction, and context switching (when both threads are on the same kernel thread) are very fast—no system calls required. Creating a goroutine in Go takes ~1μs compared to ~10μs for a kernel thread.
•Efficient Resource Usage — Kernel resources (stacks, TCBs) are only allocated for N kernel threads, not M user threads. This makes efficient use of kernel memory even with high concurrency.
•Flexible Blocking Handling — The runtime can intelligently handle blocking calls, maintaining parallelism through dynamic kernel thread creation or non-blocking I/O conversion.
•Application-Optimized Scheduling — The user-level scheduler can implement policies specifically optimized for the application's workload patterns, beyond what the kernel's general-purpose scheduler provides.

Many-to-Many vs Other Models
Characteristic	Many-to-One	One-to-One	Many-to-Many
Max Threads	Millions	Thousands	Millions
True Parallelism	✗ No	✓ Yes	✓ Yes
Thread Creation Cost	~1μs	~10μs	~1μs
Memory per Thread	~4KB	~8MB + kernel	~2-8KB
Blocking Behavior	All freeze	Independent	Mitigated
Implementation Complexity	Medium	Low	High
Kernel Awareness	None	Full	Partial

Ideal Use Cases:

The Many-to-Many model excels in scenarios requiring:

High-Concurrency Network Servers — Web servers, chat servers, and API gateways handling thousands to millions of simultaneous connections. Each connection can have its own lightweight thread.
Massively Concurrent Task Processing — Systems where each task is a natural unit of work but tasks number in the millions. Examples: web crawlers, data processing pipelines, simulation systems.
Actor Systems and Message-Passing Architectures — Where each actor is a lightweight 'process' that should be independently schedulable but numerous.
Async/Await without Callback Hell — Languages using Many-to-Many can provide async/await syntax where each 'await' can yield the user thread without blocking the underlying kernel thread.
Microservices with High Fan-Out — Services that make many outbound requests per incoming request benefit from lightweight threads that can wait on those requests simultaneously.

The Sweet Spot

Complexity and Challenges

The Complexity Cost

Implementation Challenges

•Two-Level Scheduling Coordination — The user-level and kernel-level schedulers must work together without creating priority inversions, starvation, or inefficient thread migrations. Getting this coordination right is subtle and error-prone.
•Blocking Call Detection — The runtime must intercept potential blocking calls and handle them appropriately. This requires wrappers or compiler integration for all I/O operations, system calls, and third-party libraries.
•Stack Management Complexity — User threads need small stacks to be memory-efficient, but stacks must grow if needed. Detecting stack overflow and growing stacks without corruption is tricky. Go uses separate 'stack segments' (now contiguous stacks with copying).
•Signal Handling — POSIX signals are delivered to specific kernel threads, but the programmer expects them to be delivered to specific user threads. Routing signals correctly is complex.
•Debugging and Profiling — Standard system tools (gdb, perf, debuggers) understand kernel threads, not user threads. Many-to-Many systems need custom tooling for debugging and profiling user threads.
•Library Compatibility — Third-party libraries that use blocking I/O or expect One-to-One threading may behave incorrectly. The runtime must either wrap all external calls or accept reduced parallelism.
•Preemption Complexity — Preempting a user thread (rather than waiting for it to yield) requires careful handling to avoid corrupting data structures mid-update. Go added preemption to goroutines but it required significant work.

Why Solaris Moved Away:

Solaris was a notable Many-to-Many implementation that eventually abandoned the model:

Initial Design (Solaris 2.x): Used a two-level thread library (libthread) multiplexing user threads onto LWPs (Lightweight Processes, which mapped to kernel threads).
Problems Encountered:
- Blocking coordination was complex and sometimes failed
- Scheduler activations (kernel-to-userspace upcalls) were expensive
- Two-level scheduling caused unexpected latency spikes
- Debugging was difficult—which LWP was a user thread on?
- Third-party libraries often didn't work correctly
Resolution (Solaris 9+): Moved to One-to-One threading. Every user thread became a LWP. Simpler, more predictable, and hardware got fast enough that One-to-One overhead was acceptable for most workloads.

This history lesson shows that Many-to-Many can fail if not carefully designed and maintained.

Historical Many-to-Many Implementations
System	Status	Outcome
Solaris LWP	Abandoned	Switched to 1:1 (Solaris 9+)
HP-UX threads	Abandoned	Switched to 1:1
IRIX threads	Abandoned (platform dead)	Never completed transition
Windows Fibers	Limited	Optional, rarely used, manual scheduling
GNU Pth	Niche	Still M:1 not M:N, limited use
Go goroutines	Success	Mature, widely used, handles edge cases
Erlang/BEAM	Success	Mature, specialized for actors
Java Project Loom	Emerging	Virtual threads, promising

Modern Success Stories

Modern Many-to-Many Implementations

Despite the complexity, several modern runtimes have successfully implemented Many-to-Many threading. These implementations demonstrate that the model is viable when carefully designed.

Successful Modern Implementations

•Go Goroutines (GMP Scheduler) — Go's runtime implements a sophisticated M:N scheduler with goroutines (G), OS threads (M), and processors (P). Work stealing, growable stacks, integrated netpoller, and preemptive scheduling. Handles millions of goroutines on dozens of cores. The reference implementation for modern M:N threading.
•Erlang/Elixir BEAM VM — The BEAM virtual machine schedules lightweight processes (actors) onto scheduler threads. Each scheduler thread runs on a core. Processes have tiny heaps, communicate via message passing, and can number in the millions. Optimized for concurrency and fault tolerance.
•Java Project Loom (Virtual Threads) — JDK 21+ includes virtual threads—user-level threads scheduled onto a pool of carrier threads (kernel threads). Existing blocking code works without modification; the runtime handles unmounting virtual threads during blocking. Backward-compatible M:N for Java.
•Kotlin Coroutines — Kotlin's coroutines are scheduled onto a dispatcher backed by a thread pool. Not as transparent as Go's model—coroutines require explicit 'suspend' points—but provides M:N-style efficiency for async code.
•Rust Tokio/async-std — Rust's async runtimes (Tokio, async-std) multiplex async tasks onto a thread pool. Tasks are state machines scheduled onto a worker pool. Similar to M:N threading for async code.

Modern M:N Implementation Comparison
Implementation	User Thread Name	Min Stack Size	Preemptive	Blocking Handling
Go	Goroutine	2KB (growable)	Yes (since 1.14)	Dynamic M spawn, netpoller
Erlang BEAM	Process	~1KB	Yes (reduction counting)	Never blocks (pure message passing)
Java Loom	Virtual Thread	Platform-controlled	N/A (cooperative)	Unmount on block, carrier continues
Kotlin	Coroutine	Very small	No (cooperative)	Dispatcher flexibility
Rust Tokio	Task	~200 bytes state	No (cooperative)	Async all the way down

Go's GMP Model in Detail:

Go's scheduler is often considered the reference for modern Many-to-Many implementations. It uses three entities:

G (Goroutine): The user-level thread. Contains stack, state, and function to execute.
M (Machine): An OS thread (kernel thread). Actually runs code on a CPU.
P (Processor): A scheduling context. Represents the right to execute Go code. There are GOMAXPROCS of these.

This P abstraction allows Go to maintain exactly GOMAXPROCS goroutines executing simultaneously, even as Ms block and unblock on syscalls.

Lessons from the Successful Implementations

Summary and Key Takeaways

The Many-to-Many model represents the most sophisticated approach to thread mapping, combining lightweight concurrency with true parallelism. Let's consolidate the key insights:

Key Takeaways

•Many-to-Many multiplexes M user threads onto N kernel threads — This achieves both lightweight thread creation (like Many-to-One) and true parallelism (like One-to-One).
•Two-level scheduling is the core mechanism — A user-level scheduler runs on each kernel thread, selecting which user threads to execute. Work stealing balances load across kernel threads.
•The blocking problem requires sophisticated handling — Successful implementations use non-blocking I/O conversion, dynamic kernel thread creation, or scheduler activations to maintain parallelism when user threads would block.
•Implementation complexity is the major trade-off — Building a correct, performant M:N runtime is a major engineering effort. Historical attempts (Solaris, HP-UX) often failed and reverted to One-to-One.
•Modern runtimes prove it can work — Go, Erlang/BEAM, and Java Loom demonstrate that M:N threading is viable when designed carefully from the ground up.
•M:N is ideal for specific workloads — High-concurrency network services, massive task systems, and actor-based architectures benefit most from M:N's combination of lightweight threads and parallelism.

What's Next:

Page Complete

3 / 5