Loading content...
Imagine you've carefully designed a highly concurrent server using 1,000 user-level threads. Each thread handles one client request. The system hums along beautifully—until Thread #537 calls read() on a file that isn't in the buffer cache.
In that instant, the entire process disappears from the CPU. All 1,000 threads freeze. 999 perfectly healthy, ready-to-run threads stop dead because one thread needs to wait for a disk. The kernel, unaware that any threads exist, simply blocks the process.
This is the blocking problem—the most severe practical limitation of pure user-level threading, and the reason why sophisticated runtime systems go to extraordinary lengths to avoid blocking system calls.
The blocking problem isn't merely theoretical—it has caused real production outages. Understanding it deeply is essential for anyone considering user-level threads, building async runtimes, or designing I/O-heavy concurrent systems.
The blocking problem arises from the intersection of two design decisions:
When a user-level thread makes a blocking system call, the kernel blocks the entire process—not just that thread—because the kernel doesn't know other threads exist.
Let's trace through exactly what happens at the kernel level:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
/* * Simplified view of what happens in the kernel when a blocking * system call is made by a process using user-level threads */ /* User-level thread calls read() */void user_thread_537(void) { char buffer[4096]; /* This looks innocent... */ ssize_t n = read(fd, buffer, sizeof(buffer)); /* ...but can freeze 999 other threads */} /* What happens in kernel (simplified) */ssize_t sys_read(int fd, void *buf, size_t count) { struct file *f = get_file(fd); /* Check if data is available */ if (!data_ready(f)) { /* Issue I/O request to device */ submit_io_request(f, buf, count); /* * CRITICAL: This blocks the PROCESS. * * current->state = TASK_INTERRUPTIBLE; * schedule(); // Remove from run queue, pick another process * * ALL user-level threads in this process stop here. * The kernel doesn't know they exist. * The scheduler picks a completely different process. */ wait_for_io_completion(f); /* * When I/O completes, process is woken up. * Whichever user-level thread was "current" continues. * Other user-level threads never knew they were frozen. */ } return copy_data_to_user(f, buf, count);}The blocking problem is particularly insidious because it's invisible to user-level threads. Thread 537 thinks it just did a slow read(). Threads 1-536 and 538-1000 never know they were suspended—they just experience mysterious latency. There's no exception, no error, no notification. Time simply stops for everyone.
The blocking problem has severe practical consequences for systems using pure user-level threads:
Even if 99% of your system calls complete instantly, the 1% that block cause latency spikes for all threads:
| Blocking Operation | Typical Duration | Impact on All Threads |
|---|---|---|
| SSD read (cache miss) | 50-200 μs | All threads frozen for 50-200 μs |
| HDD read (random) | 5-15 ms | All threads frozen for 5-15 ms |
| Network read (LAN) | 1-10 ms | All threads frozen waiting for data |
| Network read (WAN) | 50-500 ms | Catastrophic freeze for all threads |
| Database query | 1 ms - 30s+ | Entire server frozen |
| File open (NFS) | 10 ms - 30s | NFS timeout can freeze everything |
| DNS lookup | 1 ms - 30s | gethostbyname() can be deadly |
Consider a web server handling 1000 concurrent requests with user-level threads:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
"""Throughput impact analysis for blocking problem""" # Scenario: 1000-thread web serverthreads = 1000requests_per_second_ideal = 10000 # 10 req/thread/sec # Without blocking problem (kernel threads or non-blocking I/O)# Each thread works independentlyeffective_throughput_ideal = requests_per_second_idealprint(f"Ideal throughput: {effective_throughput_ideal} req/sec") # With blocking problem: # Assume 5% of requests trigger a 10ms blocking callblocking_probability = 0.05block_duration_ms = 10request_interval_ms = 100 # 10 req/sec means one every 100ms # In any 100ms window, probability that at least one thread blocks:# P(at least one blocks) ≈ 1 - (1 - 0.05)^active_threads_in_window# With many threads active, this approaches certainty # Expected blocking time per 100ms window:# Multiple threads will hit blocking calls; they overlap# Worst case: constant blocking as different threads take turns blocking # Simplified model: if blocking events are uniformly distributed,# and each blocks for 10ms, with 5% of 1000 threads blocking:blocking_threads_per_100ms = threads * blocking_probability * (100 / 10)# = 1000 * 0.05 * 10 = 500 blocking events per 100ms window # With pure user-level threads, these DON'T overlap - they serialize!# Each 10ms block stops ALL threads.# If we have 50 blocking events in 100ms, we're blocked 500ms worth# in a 100ms window = totally stalled actual_blocked_time_pct = min(1.0, blocking_threads_per_100ms * block_duration_ms / 1000)# = min(1.0, 500 * 10 / 1000) = min(1.0, 5.0) = 100% blocked! print(f"With blocking problem:")print(f" Blocking events per 100ms: ~50")print(f" Each block: 10ms, blocks ALL threads")print(f" Result: System nearly 100% stalled!")print(f" Effective throughput: ~0 req/sec") print(f"Conclusion: Even 5% blocking rate can completely")print(f"stall a user-level threaded system."The blocking problem doesn't degrade gracefully. A small percentage of blocking calls can completely stall the system. If any thread blocks frequently enough that a block is always in progress, no other thread can run—even if 99% of threads never block. One bad thread poisons the entire process.
Many operations that developers assume are fast can unexpectedly block. Understanding these scenarios is crucial for designing systems that use user-level threads:
open() — May require directory traversal, inode lookup, disk I/O for metadata. Network filesystems (NFS, SMB) can take seconds.read() / write() — Fast if data is cached, but cache misses trigger disk I/O. Large files or memory pressure cause cache misses.stat() / fstat() — Requires inode read from disk if not cached. Common in file serving, logging, web frameworks.fsync() / fdatasync() — Explicitly waits for data to reach disk. Can take hundreds of milliseconds on HDDs.close() — Usually fast, but may trigger writeback of dirty buffers or release of locks, causing waits.connect() — TCP handshake requires network round-trip. Hundreds of milliseconds for remote hosts; timeout (often 30s+) if unreachable.accept() — Blocks if no incoming connection. Indefinite block without pending clients.recv() / read() on socket — Blocks waiting for data. Remote peer delay = local thread freeze.send() / write() on socket — Blocks if socket send buffer is full. TCP backpressure becomes threading disaster.gethostbyname() / getaddrinfo() — Synchronous DNS lookup. DNS timeout (often 5-30s) freezes everything.| Operation | When It Blocks | Typical Duration |
|---|---|---|
malloc() / free() | Arena lock contention; mmap/munmap | Usually µs, occasionally ms |
printf() / fwrite() to stdout | Terminal blocked; pipe buffer full | Indefinite if consumer slow |
syslog() | Syslog daemon connection | ms to seconds |
exec*() | Loads executable, libraries | Tens to hundreds of ms |
fork() | Page table copy, COW setup | ms (proportional to memory) |
Locking (flock(), lockf()) | Lock held by another process | Indefinite |
IPC (msgrcv(), semop()) | Resource not available | Until sender/signaler acts |
| Page fault | Swap in, demand paging | µs (SSD) to ms (HDD) per fault |
When using user-level threads, every system call is suspect. Audit all I/O paths. Ask: 'Can this block? Under what conditions? How long?' The answer is often 'yes, unexpectedly, for a long time.' Defensive programming means assuming any syscall might block.
The most fundamental mitigation for the blocking problem is to avoid blocking system calls entirely. This is achieved through non-blocking I/O: configuring file descriptors to return immediately with an error if the operation would block.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
#include <fcntl.h>#include <errno.h>#include <sys/socket.h> /* * Set a file descriptor to non-blocking mode * Returns 0 on success, -1 on failure */int set_nonblocking(int fd) { int flags = fcntl(fd, F_GETFL, 0); if (flags == -1) { return -1; } return fcntl(fd, F_SETFL, flags | O_NONBLOCK);} /* * Non-blocking read wrapper for user-level threads * * Returns: * > 0 : Number of bytes read * 0 : EOF (connection closed) * -1 : Would block (EAGAIN/EWOULDBLOCK) - try again later * -2 : Actual error */ssize_t nonblocking_read(int fd, void *buf, size_t count) { ssize_t n = read(fd, buf, count); if (n >= 0) { return n; /* Success or EOF */ } if (errno == EAGAIN || errno == EWOULDBLOCK) { return -1; /* Would block - data not ready */ } return -2; /* Real error */} /* * User-level thread-aware read * * Instead of blocking, we yield to other threads and retry. * The thread library's scheduler handles the polling. */ssize_t thread_aware_read(int fd, void *buf, size_t count) { while (1) { ssize_t n = nonblocking_read(fd, buf, count); if (n >= 0) { return n; /* Got data or EOF */ } if (n == -1) { /* Would block - yield to other threads */ register_fd_for_read_ready(fd, current_thread); thread_block(); /* Sleep until fd is readable */ /* When we wake up, loop and try read again */ continue; } /* n == -2: Real error */ return -1; }} /* * Non-blocking connect with thread integration */int thread_aware_connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen) { /* Set non-blocking first */ set_nonblocking(sockfd); int result = connect(sockfd, addr, addrlen); if (result == 0) { return 0; /* Immediate success (rare for remote) */ } if (errno == EINPROGRESS) { /* Connection in progress - wait for writability */ register_fd_for_write_ready(sockfd, current_thread); thread_block(); /* Check if connect succeeded */ int error; socklen_t len = sizeof(error); getsockopt(sockfd, SOL_SOCKET, SO_ERROR, &error, &len); if (error == 0) { return 0; /* Success */ } errno = error; return -1; } return -1; /* Immediate failure */}Non-blocking I/O requires an event notification mechanism to know when operations can proceed. This is implemented through I/O multiplexing system calls:
| Mechanism | Platforms | Scalability | Notes |
|---|---|---|---|
select() | All POSIX | O(n) per call | Limited to ~1024 fds; oldest |
poll() | All POSIX | O(n) per call | No fd limit; still linear scan |
epoll | Linux | O(1) per event | Edge/level triggered; scales to millions |
kqueue | BSD/macOS | O(1) per event | Similar to epoll; very flexible |
IOCP | Windows | O(1) per event | Completion-based; async model |
In a well-designed user-level thread library, the runtime handles all I/O multiplexing transparently. Threads call 'read()' (actually a wrapper); if the operation would block, the wrapper registers interest with epoll/kqueue, blocks the user-level thread (not the process!), and continues scheduling other threads. When the fd becomes ready, the blocked thread is woken.
Jacketing (also called wrapper interposition) is a technique where blocking system calls are intercepted and wrapped with thread-aware code. The wrapper converts blocking behavior into non-blocking behavior with thread-level sleep, transparent to the calling code.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
/* * Jacketing: Transparent blocking call interception * * These wrapper functions replace standard library functions, * making blocking operations thread-aware without changing * application code. */ #include <dlfcn.h> /* For dlsym to get original function */ /* Original function pointers, loaded once at startup */static ssize_t (*real_read)(int, void*, size_t) = NULL;static ssize_t (*real_write)(int, const void*, size_t) = NULL;static int (*real_accept)(int, struct sockaddr*, socklen_t*) = NULL; __attribute__((constructor))void init_jackets(void) { real_read = dlsym(RTLD_NEXT, "read"); real_write = dlsym(RTLD_NEXT, "write"); real_accept = dlsym(RTLD_NEXT, "accept");} /* * Jacketed read() - intercepts all read() calls * * 1. Check if fd is non-blocking or would succeed * 2. If would block, register for events and sleep thread * 3. When ready, retry */ssize_t read(int fd, void *buf, size_t count) { /* First, ensure fd is non-blocking */ ensure_nonblocking(fd); while (1) { ssize_t n = real_read(fd, buf, count); if (n >= 0) { return n; /* Success */ } if (errno == EAGAIN || errno == EWOULDBLOCK) { /* Would block - yield to other threads */ io_wait_read(fd); /* Register and block user-thread */ continue; /* Retry after wake */ } /* Real error */ return -1; }} /* * Jacketed accept() - intercepts all accept() calls */int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen) { ensure_nonblocking(sockfd); while (1) { int client = real_accept(sockfd, addr, addrlen); if (client >= 0) { /* Also make client socket non-blocking */ ensure_nonblocking(client); return client; } if (errno == EAGAIN || errno == EWOULDBLOCK) { io_wait_read(sockfd); /* Wait for connection */ continue; } return -1; }} /* * io_wait_read() - Block user-level thread until fd is readable */void io_wait_read(int fd) { /* Add fd to epoll interest list for current thread */ struct epoll_event ev = { .events = EPOLLIN | EPOLLET, .data.ptr = current_thread }; epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev); /* Mark thread as blocked on I/O */ current_thread->wait_reason = WAIT_IO_READ; current_thread->wait_fd = fd; /* Block this thread (not the process!) */ thread_block(); /* Woken up - remove from epoll */ epoll_ctl(epfd, EPOLL_CTL_DEL, fd, NULL);}Jacketing is used extensively in production systems. Go's runtime jackets all I/O operations. Gevent (Python) uses greenlet with libevent/libev jackets. Even some databases use internal jacketing for their thread pools. When done well, it's invisible to developers writing application code.
Scheduler activations is an advanced technique that adds limited kernel support to address the blocking problem while preserving user-level threading benefits. First proposed by Anderson et al. in 1991, it creates a communication channel between the kernel and user-level scheduler.
Instead of leaving the kernel completely unaware, scheduler activations provide upcalls—notifications from kernel to user space when blocking events occur:
Virtual Processors: The kernel grants the process a number of virtual processors (analogous to CPU time slots). Each virtual processor can run one user-level thread.
Blocking Notification: When a user-level thread makes a blocking call, the kernel:
Unblocking Notification: When the blocked operation completes:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
/* * Scheduler Activations - User-level scheduler integration * * Simplified conceptual implementation showing the upcall handling */ /* Upcall handler - kernel calls this when thread blocks/unblocks */void upcall_handler(int event_type, tcb_t *thread, void *aux) { switch (event_type) { case UPCALL_THREAD_BLOCKED: /* * A user-level thread made a blocking syscall. * Kernel has allocated us a new virtual processor. * We're running on that new processor now. * * The blocked thread's state is saved by the kernel. * We need to schedule work for this processor. */ /* Mark thread as blocked */ thread->state = THREAD_BLOCKED_SYSCALL; thread->blocked_on = aux; /* What it's waiting for */ /* Schedule another thread on this virtual processor */ tcb_t *next = dequeue_ready(); if (next) { context_switch_to(next); } else { /* No ready threads - idle this processor */ idle_virtual_processor(); } break; case UPCALL_THREAD_UNBLOCKED: /* * A previously blocked thread can now continue. * Add it back to the ready queue. */ thread->state = THREAD_READY; enqueue_ready(thread); /* Potentially preempt current thread if higher priority */ maybe_preempt_current(thread); break; case UPCALL_PROCESSOR_PREEMPTED: /* * Kernel preempted one of our virtual processors. * Save state and adjust scheduling. */ thread->state = THREAD_READY; enqueue_ready(thread); num_virtual_processors--; break; case UPCALL_PROCESSOR_GRANTED: /* * Kernel gave us back a virtual processor. * Schedule a thread on it. */ num_virtual_processors++; tcb_t *to_run = dequeue_ready(); if (to_run) { context_switch_to(to_run); } break; }}Scheduler activations require kernel modifications and were never widely adopted in mainstream operating systems. Solaris had a related mechanism (lightweight processes), and some research systems implemented full activations. Today, the M:N threading model (multiple user threads on kernel threads) with runtime scheduling achieves similar benefits without kernel changes. Go's runtime is a practical example of this approach.
The most practical solution to the blocking problem in modern systems is the M:N threading model—multiplexing M user-level threads onto N kernel threads (where typically N ≈ number of CPU cores).
Go provides an excellent case study in solving the blocking problem:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
/* * Go's runtime handles blocking transparently * * When a goroutine blocks on I/O, the runtime: * 1. Parks the goroutine (user-level block) * 2. Uses netpoller for async I/O internally * 3. Keeps the kernel thread (M) running other goroutines * 4. When I/O ready, netpoller unparks the goroutine */ package main import ( "fmt" "net/http" "time") func handler(w http.ResponseWriter, r *http.Request) { // This sleeps *only* this goroutine, not the thread time.Sleep(100 * time.Millisecond) // This network I/O uses netpoller - non-blocking internally resp, _ := http.Get("https://example.com") defer resp.Body.Close() // Other goroutines continue running on other Ms // or even on this same M after we park fmt.Fprintf(w, "Done!")} func main() { // GOMAXPROCS sets number of kernel threads (Ps) // Default is number of CPUs // Each P can run many goroutines (Gs) // 10000 concurrent handlers only need ~8 kernel threads // Blocking in one handler doesn't affect others http.HandleFunc("/", handler) http.ListenAndServe(":8080", nil)} /* * Key insight: Go wraps ALL blocking operations: * - Network I/O → netpoller (epoll/kqueue based) * - File I/O → often background goroutine on separate M * - Sleep → timer heap in runtime * - Channel ops → goroutine parking * * The programmer writes blocking-style code; * the runtime makes it non-blocking underneath. */M:N threading combines user-level thread efficiency (fast switches, many threads) with kernel thread robustness (multiprocessor support, blocking tolerance). Go, Rust (tokio), Java (Project Loom), and many other modern platforms use variants of this approach. It's the practical resolution of the blocking problem.
We have thoroughly examined the blocking problem—the most significant practical limitation of pure user-level threading—and the sophisticated strategies developed to overcome it.
You now understand both the problem and its solutions in depth. In the next page, we'll explore green threads—a specific form of user-level threading used in many modern languages and runtimes—and see how they apply these concepts in practice.