Loading learning content...
The Many-to-Many model (also called M:N threading or hybrid threading) represents the most sophisticated approach to thread mapping. It multiplexes M user-level threads onto N kernel-level threads, where M ≥ N. This allows applications to create lightweight user threads in large numbers while still achieving true parallelism across multiple CPU cores.
This model attempts to combine the advantages of both Many-to-One (fast, lightweight threads) and One-to-One (true parallelism, independent blocking) while avoiding their respective limitations. The trade-off is implementation complexity—the runtime must manage both user-level scheduling and coordination with kernel threads.
By the end of this page, you will understand the Many-to-Many model's architecture and how user threads are scheduled onto kernel threads, appreciate its advantages for high-concurrency applications, recognize the complexity challenges that limit its adoption, and identify modern systems (like Go's goroutines) that successfully implement this model.
The Many-to-Many model introduces a pool of kernel threads that serve as execution vehicles for a potentially much larger population of user-level threads. The thread library (or language runtime) schedules user threads onto available kernel threads, dynamically allocating kernel resources as needed.
Key Architectural Properties:
Two-Level Scheduling — The system has two schedulers: a user-level scheduler in the thread library that schedules user threads onto kernel threads, and the kernel scheduler that schedules kernel threads onto CPU cores.
Decoupled Thread Counts — The number of user threads (M) is independent of the number of kernel threads (N). An application might have 1 million user threads running on 8 kernel threads (one per CPU core).
Dynamic Kernel Thread Pool — The number of kernel threads can adjust dynamically based on workload. When threads block on I/O, new kernel threads may be created to maintain parallelism.
Lightweight User Threads — User threads have minimal overhead (small stacks, no kernel resources), enabling creation of millions of concurrent tasks.
True Parallelism — With N kernel threads, up to N user threads can execute truly in parallel on N CPU cores.
| Characteristic | Many-to-Many Behavior | Implication |
|---|---|---|
| User Thread Count | Virtually unlimited (millions possible) | Lightweight concurrency at massive scale |
| Kernel Thread Count | Typically matches core count, can grow | Full CPU utilization with minimal overhead |
| Thread Scheduling | Two-level: user + kernel scheduler | Complex but flexible |
| Maximum CPU Utilization | All cores (100% of all CPUs) | True parallelism maintained |
| Blocking Behavior | Complex - requires scheduler cooperation | Can be handled if runtime manages it |
| Implementation Complexity | High | Requires sophisticated runtime |
Many-to-Many trades implementation complexity for optimal resource usage. It combines lightweight thread creation (like Many-to-One) with true parallelism (like One-to-One). The runtime must carefully manage the mapping between user and kernel threads to maintain this balance.
The heart of the Many-to-Many model is the user-level scheduler—a sophisticated component within the thread library or language runtime that decides which user thread runs on which kernel thread. This scheduler must make intelligent decisions while minimizing overhead.
The Scheduling Algorithm:
A typical Many-to-Many user-level scheduler uses a multi-queue approach with work stealing:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144
// Conceptual Many-to-Many Scheduler (Based on Go Runtime Design) struct UserThread { uintptr_t stack_pointer; // Current stack position void *stack_base; // Stack memory (small: 2-8KB initially) size_t stack_size; // Current stack size (growable) thread_state_t state; // RUNNABLE, RUNNING, BLOCKED, DEAD UserThread *next; // Queue linkage void (*entry_func)(void*); // Thread entry point void *arg; // Argument to entry}; struct Processor { KernelThread *kernel_thread; // Associated kernel thread (1:1) UserThread *running; // Currently executing user thread LocalRunQueue local_queue; // Per-processor run queue int id; // Processor identifier}; // Global stateGlobalRunQueue global_queue; // Overflow queue for user threadsProcessor processors[NUM_CORES]; // One Processor per CPU coreLock global_lock; // Lock for global queue /* * Main scheduler loop - runs on each kernel thread * * Each kernel thread (Processor) loops forever, finding and * executing user threads. When a user thread blocks or yields, * control returns here to find the next thread. */void scheduler_loop(Processor *p) { while (true) { UserThread *ut = find_runnable_thread(p); if (ut != NULL) { p->running = ut; ut->state = RUNNING; /* * Context switch to user thread * This swaps stacks - when the user thread yields or blocks, * we return here to find the next thread. */ switch_to_user_thread(p, ut); // Returned from user thread (it yielded, blocked, or finished) handle_thread_return(p, ut); } else { // No work available - try stealing from other processors ut = steal_work(p); if (ut == NULL) { // Really no work - park this kernel thread park_processor(p); } } }} /* * Find a runnable user thread for this processor * Priority: local queue > global queue > steal from others */UserThread *find_runnable_thread(Processor *p) { // 1. Check local run queue first (fast, no locking) if (!is_empty(&p->local_queue)) { return dequeue(&p->local_queue); } // 2. Check global run queue (requires lock, less frequent) if (!is_empty(&global_queue)) { acquire(&global_lock); // Grab batch of threads to amortize locking cost int count = min(BATCH_SIZE, queue_length(&global_queue)); UserThread *batch = dequeue_batch(&global_queue, count); release(&global_lock); // Put extras in local queue, return first one enqueue_batch(&p->local_queue, batch->next); return batch; } // 3. Try work stealing from other processors return steal_work(p);} /* * Work stealing: Take threads from busy processors * * This is key to load balancing. When a processor is idle, * it scans other processors' local queues and steals half * their work. This prevents hot spots without global coordination. */UserThread *steal_work(Processor *p) { // Randomize starting point to reduce contention int start = random() % NUM_PROCESSORS; for (int i = 0; i < NUM_PROCESSORS; i++) { int target = (start + i) % NUM_PROCESSORS; if (target == p->id) continue; // Don't steal from self Processor *victim = &processors[target]; int victim_len = queue_length(&victim->local_queue); if (victim_len > 1) { // Steal half the victim's queue int steal_count = victim_len / 2; UserThread *stolen = steal_from_queue( &victim->local_queue, steal_count); if (stolen != NULL) { // Put extras in our local queue, return first enqueue_batch(&p->local_queue, stolen->next); return stolen; } } } return NULL; // No work to steal} /* * Handle a user thread that yielded, blocked, or completed */void handle_thread_return(Processor *p, UserThread *ut) { p->running = NULL; switch (ut->state) { case RUNNABLE: // Thread yielded - put back in local queue enqueue(&p->local_queue, ut); break; case BLOCKED: // Thread blocked on I/O or sync primitive // Don't put in run queue - wake handler will re-queue break; case DEAD: // Thread finished - free resources free_user_thread(ut); break; }}The pseudocode above closely mirrors Go's runtime scheduler (the 'GMP' model: Goroutines, M threads, Processors). Go's goroutines are user-level threads scheduled onto a pool of OS threads, with work stealing for load balancing. This design enables Go to handle millions of goroutines on standard hardware.
The blocking problem—where a blocking system call freezes all user threads sharing a kernel thread—requires careful handling in the Many-to-Many model. Unlike Many-to-One (which has no solution) and One-to-One (where it doesn't exist), Many-to-Many can mitigate blocking through several techniques.
When a user thread executing on kernel thread K makes a blocking system call (e.g., read()), kernel thread K blocks. If other user threads were queued to run on K, they cannot execute until K unblocks. The Many-to-Many model needs strategies to maintain parallelism despite blocking calls.
Go's Approach in Detail:
Go's runtime provides one of the best modern examples of handling blocking in a Many-to-Many model:
Network I/O: All network operations use non-blocking sockets internally. When a goroutine would block on network I/O, the runtime registers interest with epoll/kqueue/IOCP and parks the goroutine. The network poller (running on its own goroutine) wakes goroutines when their I/O is ready.
System Calls: When a goroutine enters a potentially blocking system call, Go's runtime:
File I/O: File I/O is genuinely blocking on most systems (no async file I/O on Linux). Go detects this and spawns additional kernel threads to maintain parallelism while file I/O threads block.
This approach means goroutines 'block' from the programmer's perspective (simple, intuitive API) while the runtime prevents actual kernel thread blocking in most cases.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980
package main import ( "fmt" "net" "runtime") /* * Go's Many-to-Many Model in Action * * This example demonstrates how Go handles thousands of * "blocking" network operations without actually blocking * kernel threads. */ func handleConnection(conn net.Conn) { defer conn.Close() buffer := make([]byte, 1024) for { // This "blocks" from the goroutine's perspective // But internally: // 1. Go sets socket to non-blocking // 2. Attempts read - if EAGAIN, goroutine is parked // 3. epoll/kqueue monitors socket // 4. When data arrives, goroutine is unparked // 5. The kernel thread was NOT blocked during wait n, err := conn.Read(buffer) if err != nil { return } // Echo back conn.Write(buffer[:n]) }} func main() { // Set to use 4 kernel threads maximum runtime.GOMAXPROCS(4) ln, _ := net.Listen("tcp", ":8080") /* * This server handles 10,000 concurrent connections using * only 4 kernel threads. Each connection has its own goroutine, * but goroutines are multiplexed onto the 4 kernel threads. * * Many-to-Many in action: * - M = 10,000 goroutines (user threads) * - N = 4 kernel threads (plus netpoller) * - All 10,000 can be "blocked" on Read() simultaneously * - But 0 kernel threads are actually blocked on I/O * - The netpoller watches all 10,000 sockets efficiently */ for { conn, _ := ln.Accept() go handleConnection(conn) // Creates lightweight goroutine }} /* * Why this works: * * Traditional One-to-One model: * 10,000 connections = 10,000 kernel threads * Each thread: ~8MB stack = 80GB memory (!) * Plus kernel overhead per thread * * Go's Many-to-Many model: * 10,000 connections = 10,000 goroutines * Each goroutine: ~2KB initial stack = 20MB memory * Only 4-5 kernel threads total * Non-blocking I/O means no thread is wasted waiting * * This is the power of Many-to-Many with proper blocking handling. */The Many-to-Many model, when properly implemented, combines the strengths of both simpler models while avoiding their most severe limitations:
| Characteristic | Many-to-One | One-to-One | Many-to-Many |
|---|---|---|---|
| Max Threads | Millions | Thousands | Millions |
| True Parallelism | ✗ No | ✓ Yes | ✓ Yes |
| Thread Creation Cost | ~1μs | ~10μs | ~1μs |
| Memory per Thread | ~4KB | ~8MB + kernel | ~2-8KB |
| Blocking Behavior | All freeze | Independent | Mitigated |
| Implementation Complexity | Medium | Low | High |
| Kernel Awareness | None | Full | Partial |
Ideal Use Cases:
The Many-to-Many model excels in scenarios requiring:
High-Concurrency Network Servers — Web servers, chat servers, and API gateways handling thousands to millions of simultaneous connections. Each connection can have its own lightweight thread.
Massively Concurrent Task Processing — Systems where each task is a natural unit of work but tasks number in the millions. Examples: web crawlers, data processing pipelines, simulation systems.
Actor Systems and Message-Passing Architectures — Where each actor is a lightweight 'process' that should be independently schedulable but numerous.
Async/Await without Callback Hell — Languages using Many-to-Many can provide async/await syntax where each 'await' can yield the user thread without blocking the underlying kernel thread.
Microservices with High Fan-Out — Services that make many outbound requests per incoming request benefit from lightweight threads that can wait on those requests simultaneously.
Many-to-Many hits the sweet spot for applications needing thousands to millions of concurrent tasks with good CPU utilization. It's why Go, Erlang, and similar runtimes have become popular for cloud infrastructure, network services, and concurrent data processing.
The Many-to-Many model's advantages come at the cost of significant implementation complexity. This complexity is why most languages and systems have historically avoided Many-to-Many in favor of simpler One-to-One threading.
Building a correct, performant Many-to-Many runtime is extremely difficult. It requires deep integration with the OS, careful handling of edge cases (signals, fork, exec), and sophisticated scheduling algorithms. This is why successful implementations (Go, Erlang/BEAM, Java Loom) are major engineering efforts that took years to mature.
Why Solaris Moved Away:
Solaris was a notable Many-to-Many implementation that eventually abandoned the model:
Initial Design (Solaris 2.x): Used a two-level thread library (libthread) multiplexing user threads onto LWPs (Lightweight Processes, which mapped to kernel threads).
Problems Encountered:
Resolution (Solaris 9+): Moved to One-to-One threading. Every user thread became a LWP. Simpler, more predictable, and hardware got fast enough that One-to-One overhead was acceptable for most workloads.
This history lesson shows that Many-to-Many can fail if not carefully designed and maintained.
| System | Status | Outcome |
|---|---|---|
| Solaris LWP | Abandoned | Switched to 1:1 (Solaris 9+) |
| HP-UX threads | Abandoned | Switched to 1:1 |
| IRIX threads | Abandoned (platform dead) | Never completed transition |
| Windows Fibers | Limited | Optional, rarely used, manual scheduling |
| GNU Pth | Niche | Still M:1 not M:N, limited use |
| Go goroutines | Success | Mature, widely used, handles edge cases |
| Erlang/BEAM | Success | Mature, specialized for actors |
| Java Project Loom | Emerging | Virtual threads, promising |
Go and Erlang prove that Many-to-Many can work brilliantly when the runtime is designed from the ground up around it. Both languages were designed with Many-to-Many in mind, not retrofitted. They handle blocking carefully, provide excellent tooling, and have millions of users proving their implementations work.
Despite the complexity, several modern runtimes have successfully implemented Many-to-Many threading. These implementations demonstrate that the model is viable when carefully designed.
| Implementation | User Thread Name | Min Stack Size | Preemptive | Blocking Handling |
|---|---|---|---|---|
| Go | Goroutine | 2KB (growable) | Yes (since 1.14) | Dynamic M spawn, netpoller |
| Erlang BEAM | Process | ~1KB | Yes (reduction counting) | Never blocks (pure message passing) |
| Java Loom | Virtual Thread | Platform-controlled | N/A (cooperative) | Unmount on block, carrier continues |
| Kotlin | Coroutine | Very small | No (cooperative) | Dispatcher flexibility |
| Rust Tokio | Task | ~200 bytes state | No (cooperative) | Async all the way down |
Go's GMP Model in Detail:
Go's scheduler is often considered the reference for modern Many-to-Many implementations. It uses three entities:
The key insight: A goroutine (G) needs both an M and a P to run. The P owns the run queue. When a goroutine blocks in a syscall, the M is 'stuck' in the syscall, but the P can be handed off to a different M to continue running other goroutines.
This P abstraction allows Go to maintain exactly GOMAXPROCS goroutines executing simultaneously, even as Ms block and unblock on syscalls.
Successful M:N runtimes share common patterns: designed from the start for M:N (not retrofitted), integrated I/O handling (netpoller or non-blocking I/O), work stealing for load balance, careful handling of blocking calls, excellent debugging/profiling tools, and years of iteration and hardening. They're major engineering investments that pay off for high-concurrency workloads.
The Many-to-Many model represents the most sophisticated approach to thread mapping, combining lightweight concurrency with true parallelism. Let's consolidate the key insights:
What's Next:
We've now covered Many-to-One, One-to-One, and Many-to-Many—the three fundamental threading models. The next page examines the Two-Level model, a variation that combines M:N threading with the ability to directly bind critical user threads to dedicated kernel threads for predictable performance.
You now understand the Many-to-Many threading model's architecture, scheduling mechanisms, advantages for high-concurrency workloads, complexity challenges, and modern implementations like Go's goroutines. This prepares you to understand the Two-Level model and to make informed choices about threading approaches for different application requirements.