User Level Threads - Learning Module

Loading content...

0/240

The Blocking Problem

When One Thread Stops Everything

Imagine you've carefully designed a highly concurrent server using 1,000 user-level threads. Each thread handles one client request. The system hums along beautifully—until Thread #537 calls read() on a file that isn't in the buffer cache.

In that instant, the entire process disappears from the CPU. All 1,000 threads freeze. 999 perfectly healthy, ready-to-run threads stop dead because one thread needs to wait for a disk. The kernel, unaware that any threads exist, simply blocks the process.

This is the blocking problem—the most severe practical limitation of pure user-level threading, and the reason why sophisticated runtime systems go to extraordinary lengths to avoid blocking system calls.

Critical Understanding Required

The blocking problem isn't merely theoretical—it has caused real production outages. Understanding it deeply is essential for anyone considering user-level threads, building async runtimes, or designing I/O-heavy concurrent systems.

Anatomy of the Blocking Problem

The blocking problem arises from the intersection of two design decisions:

User-level threads are invisible to the kernel — The kernel manages the process as a single schedulable unit
Blocking system calls suspend the calling process — When a syscall can't complete immediately, the kernel removes the process from the run queue

When a user-level thread makes a blocking system call, the kernel blocks the entire process—not just that thread—because the kernel doesn't know other threads exist.

Converting Mermaid diagram...

The Technical Mechanism

Let's trace through exactly what happens at the kernel level:

blocking_mechanism.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
/*
 * Simplified view of what happens in the kernel when a blocking
 * system call is made by a process using user-level threads
 */
 
/* User-level thread calls read() */
void user_thread_537(void) {
    char buffer[4096];
    
    /* This looks innocent... */
    ssize_t n = read(fd, buffer, sizeof(buffer));
    /* ...but can freeze 999 other threads */
}
 
/* What happens in kernel (simplified) */
ssize_t sys_read(int fd, void *buf, size_t count) {
    struct file *f = get_file(fd);
    
    /* Check if data is available */
    if (!data_ready(f)) {
        /* Issue I/O request to device */
        submit_io_request(f, buf, count);
        
        /*
         * CRITICAL: This blocks the PROCESS.
         * 
         * current->state = TASK_INTERRUPTIBLE;
         * schedule();  // Remove from run queue, pick another process
         * 
         * ALL user-level threads in this process stop here.
         * The kernel doesn't know they exist.
         * The scheduler picks a completely different process.
         */
        wait_for_io_completion(f);
        
        /*
         * When I/O completes, process is woken up.
         * Whichever user-level thread was "current" continues.
         * Other user-level threads never knew they were frozen.
         */
    }
    
    return copy_data_to_user(f, buf, count);
}

The Insidious Nature

The blocking problem is particularly insidious because it's invisible to user-level threads. Thread 537 thinks it just did a slow read(). Threads 1-536 and 538-1000 never know they were suspended—they just experience mysterious latency. There's no exception, no error, no notification. Time simply stops for everyone.

Impact and Consequences

The blocking problem has severe practical consequences for systems using pure user-level threads:

Latency Spikes

Even if 99% of your system calls complete instantly, the 1% that block cause latency spikes for all threads:

Blocking Latency Impact by Scenario
Blocking Operation	Typical Duration	Impact on All Threads
SSD read (cache miss)	50-200 μs	All threads frozen for 50-200 μs
HDD read (random)	5-15 ms	All threads frozen for 5-15 ms
Network read (LAN)	1-10 ms	All threads frozen waiting for data
Network read (WAN)	50-500 ms	Catastrophic freeze for all threads
Database query	1 ms - 30s+	Entire server frozen
File open (NFS)	10 ms - 30s	NFS timeout can freeze everything
DNS lookup	1 ms - 30s	gethostbyname() can be deadly

Throughput Collapse

Consider a web server handling 1000 concurrent requests with user-level threads:

throughput_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
"""
Throughput impact analysis for blocking problem
"""
 
# Scenario: 1000-thread web server
threads = 1000
requests_per_second_ideal = 10000  # 10 req/thread/sec
 
# Without blocking problem (kernel threads or non-blocking I/O)
# Each thread works independently
effective_throughput_ideal = requests_per_second_ideal
print(f"Ideal throughput: {effective_throughput_ideal} req/sec")
 
# With blocking problem: 
# Assume 5% of requests trigger a 10ms blocking call
blocking_probability = 0.05
block_duration_ms = 10
request_interval_ms = 100  # 10 req/sec means one every 100ms
 
# In any 100ms window, probability that at least one thread blocks:
# P(at least one blocks) ≈ 1 - (1 - 0.05)^active_threads_in_window
# With many threads active, this approaches certainty
 
# Expected blocking time per 100ms window:
# Multiple threads will hit blocking calls; they overlap
# Worst case: constant blocking as different threads take turns blocking
 
# Simplified model: if blocking events are uniformly distributed,
# and each blocks for 10ms, with 5% of 1000 threads blocking:
blocking_threads_per_100ms = threads * blocking_probability * (100 / 10)
# = 1000 * 0.05 * 10 = 500 blocking events per 100ms window
 
# With pure user-level threads, these DON'T overlap - they serialize!
# Each 10ms block stops ALL threads.
# If we have 50 blocking events in 100ms, we're blocked 500ms worth
# in a 100ms window = totally stalled
 
actual_blocked_time_pct = min(1.0, blocking_threads_per_100ms * block_duration_ms / 1000)
# = min(1.0, 500 * 10 / 1000) = min(1.0, 5.0) = 100% blocked!
 
print(f"
With blocking problem:")
print(f"  Blocking events per 100ms: ~50")
print(f"  Each block: 10ms, blocks ALL threads")
print(f"  Result: System nearly 100% stalled!")
print(f"  Effective throughput: ~0 req/sec")
 
print(f"
Conclusion: Even 5% blocking rate can completely")
print(f"stall a user-level threaded system."

The Amplification Effect

The blocking problem doesn't degrade gracefully. A small percentage of blocking calls can completely stall the system. If any thread blocks frequently enough that a block is always in progress, no other thread can run—even if 99% of threads never block. One bad thread poisons the entire process.

Common Blocking Scenarios

Many operations that developers assume are fast can unexpectedly block. Understanding these scenarios is crucial for designing systems that use user-level threads:

File System Operations

File System Blocking Risks

•open() — May require directory traversal, inode lookup, disk I/O for metadata. Network filesystems (NFS, SMB) can take seconds.
•read() / write() — Fast if data is cached, but cache misses trigger disk I/O. Large files or memory pressure cause cache misses.
•stat() / fstat() — Requires inode read from disk if not cached. Common in file serving, logging, web frameworks.
•fsync() / fdatasync() — Explicitly waits for data to reach disk. Can take hundreds of milliseconds on HDDs.
•close() — Usually fast, but may trigger writeback of dirty buffers or release of locks, causing waits.

Network Operations

Network Blocking Risks

•connect() — TCP handshake requires network round-trip. Hundreds of milliseconds for remote hosts; timeout (often 30s+) if unreachable.
•accept() — Blocks if no incoming connection. Indefinite block without pending clients.
•recv() / read() on socket — Blocks waiting for data. Remote peer delay = local thread freeze.
•send() / write() on socket — Blocks if socket send buffer is full. TCP backpressure becomes threading disaster.
•gethostbyname() / getaddrinfo() — Synchronous DNS lookup. DNS timeout (often 5-30s) freezes everything.
•SSL/TLS handshake — Multiple round-trips; computationally expensive; can take hundreds of ms.

Other Dangerous Operations

Additional Blocking Risks
Operation	When It Blocks	Typical Duration
`malloc()` / `free()`	Arena lock contention; mmap/munmap	Usually µs, occasionally ms
`printf()` / `fwrite()` to stdout	Terminal blocked; pipe buffer full	Indefinite if consumer slow
`syslog()`	Syslog daemon connection	ms to seconds
`exec*()`	Loads executable, libraries	Tens to hundreds of ms
`fork()`	Page table copy, COW setup	ms (proportional to memory)
Locking (`flock()`, `lockf()`)	Lock held by another process	Indefinite
IPC (`msgrcv()`, `semop()`)	Resource not available	Until sender/signaler acts
Page fault	Swap in, demand paging	µs (SSD) to ms (HDD) per fault

The Audit Mindset

When using user-level threads, every system call is suspect. Audit all I/O paths. Ask: 'Can this block? Under what conditions? How long?' The answer is often 'yes, unexpectedly, for a long time.' Defensive programming means assuming any syscall might block.

Mitigation: Non-Blocking I/O

The most fundamental mitigation for the blocking problem is to avoid blocking system calls entirely. This is achieved through non-blocking I/O: configuring file descriptors to return immediately with an error if the operation would block.

Setting Up Non-Blocking I/O

non_blocking_io.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
#include <fcntl.h>
#include <errno.h>
#include <sys/socket.h>
 
/*
 * Set a file descriptor to non-blocking mode
 * Returns 0 on success, -1 on failure
 */
int set_nonblocking(int fd) {
    int flags = fcntl(fd, F_GETFL, 0);
    if (flags == -1) {
        return -1;
    }
    return fcntl(fd, F_SETFL, flags | O_NONBLOCK);
}
 
/*
 * Non-blocking read wrapper for user-level threads
 * 
 * Returns:
 *   > 0  : Number of bytes read
 *   0    : EOF (connection closed)
 *   -1   : Would block (EAGAIN/EWOULDBLOCK) - try again later
 *   -2   : Actual error
 */
ssize_t nonblocking_read(int fd, void *buf, size_t count) {
    ssize_t n = read(fd, buf, count);
    
    if (n >= 0) {
        return n;  /* Success or EOF */
    }
    
    if (errno == EAGAIN || errno == EWOULDBLOCK) {
        return -1;  /* Would block - data not ready */
    }
    
    return -2;  /* Real error */
}
 
/*
 * User-level thread-aware read
 * 
 * Instead of blocking, we yield to other threads and retry.
 * The thread library's scheduler handles the polling.
 */
ssize_t thread_aware_read(int fd, void *buf, size_t count) {
    while (1) {
        ssize_t n = nonblocking_read(fd, buf, count);
        
        if (n >= 0) {
            return n;  /* Got data or EOF */
        }
        
        if (n == -1) {
            /* Would block - yield to other threads */
            register_fd_for_read_ready(fd, current_thread);
            thread_block();  /* Sleep until fd is readable */
            /* When we wake up, loop and try read again */
            continue;
        }
        
        /* n == -2: Real error */
        return -1;
    }
}
 
/*
 * Non-blocking connect with thread integration
 */
int thread_aware_connect(int sockfd, const struct sockaddr *addr, 
                         socklen_t addrlen) {
    /* Set non-blocking first */
    set_nonblocking(sockfd);
    
    int result = connect(sockfd, addr, addrlen);
    
    if (result == 0) {
        return 0;  /* Immediate success (rare for remote) */
    }
    
    if (errno == EINPROGRESS) {
        /* Connection in progress - wait for writability */
        register_fd_for_write_ready(sockfd, current_thread);
        thread_block();
        
        /* Check if connect succeeded */
        int error;
        socklen_t len = sizeof(error);
        getsockopt(sockfd, SOL_SOCKET, SO_ERROR, &error, &len);
        
        if (error == 0) {
            return 0;  /* Success */
        }
        errno = error;
        return -1;
    }
    
    return -1;  /* Immediate failure */
}

The Event Loop Pattern

Non-blocking I/O requires an event notification mechanism to know when operations can proceed. This is implemented through I/O multiplexing system calls:

I/O Multiplexing Mechanisms
Mechanism	Platforms	Scalability	Notes
`select()`	All POSIX	O(n) per call	Limited to ~1024 fds; oldest
`poll()`	All POSIX	O(n) per call	No fd limit; still linear scan
`epoll`	Linux	O(1) per event	Edge/level triggered; scales to millions
`kqueue`	BSD/macOS	O(1) per event	Similar to epoll; very flexible
`IOCP`	Windows	O(1) per event	Completion-based; async model

The Runtime's Responsibility

In a well-designed user-level thread library, the runtime handles all I/O multiplexing transparently. Threads call 'read()' (actually a wrapper); if the operation would block, the wrapper registers interest with epoll/kqueue, blocks the user-level thread (not the process!), and continues scheduling other threads. When the fd becomes ready, the blocked thread is woken.

Mitigation: Jacketing

Jacketing (also called wrapper interposition) is a technique where blocking system calls are intercepted and wrapped with thread-aware code. The wrapper converts blocking behavior into non-blocking behavior with thread-level sleep, transparent to the calling code.

How Jacketing Works

Converting Mermaid diagram...

jacketing.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
/*
 * Jacketing: Transparent blocking call interception
 * 
 * These wrapper functions replace standard library functions,
 * making blocking operations thread-aware without changing
 * application code.
 */
 
#include <dlfcn.h>  /* For dlsym to get original function */
 
/* Original function pointers, loaded once at startup */
static ssize_t (*real_read)(int, void*, size_t) = NULL;
static ssize_t (*real_write)(int, const void*, size_t) = NULL;
static int (*real_accept)(int, struct sockaddr*, socklen_t*) = NULL;
 
__attribute__((constructor))
void init_jackets(void) {
    real_read = dlsym(RTLD_NEXT, "read");
    real_write = dlsym(RTLD_NEXT, "write");
    real_accept = dlsym(RTLD_NEXT, "accept");
}
 
/*
 * Jacketed read() - intercepts all read() calls
 * 
 * 1. Check if fd is non-blocking or would succeed
 * 2. If would block, register for events and sleep thread
 * 3. When ready, retry
 */
ssize_t read(int fd, void *buf, size_t count) {
    /* First, ensure fd is non-blocking */
    ensure_nonblocking(fd);
    
    while (1) {
        ssize_t n = real_read(fd, buf, count);
        
        if (n >= 0) {
            return n;  /* Success */
        }
        
        if (errno == EAGAIN || errno == EWOULDBLOCK) {
            /* Would block - yield to other threads */
            io_wait_read(fd);  /* Register and block user-thread */
            continue;          /* Retry after wake */
        }
        
        /* Real error */
        return -1;
    }
}
 
/*
 * Jacketed accept() - intercepts all accept() calls
 */
int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen) {
    ensure_nonblocking(sockfd);
    
    while (1) {
        int client = real_accept(sockfd, addr, addrlen);
        
        if (client >= 0) {
            /* Also make client socket non-blocking */
            ensure_nonblocking(client);
            return client;
        }
        
        if (errno == EAGAIN || errno == EWOULDBLOCK) {
            io_wait_read(sockfd);  /* Wait for connection */
            continue;
        }
        
        return -1;
    }
}
 
/*
 * io_wait_read() - Block user-level thread until fd is readable
 */
void io_wait_read(int fd) {
    /* Add fd to epoll interest list for current thread */
    struct epoll_event ev = {
        .events = EPOLLIN | EPOLLET,
        .data.ptr = current_thread
    };
    epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev);
    
    /* Mark thread as blocked on I/O */
    current_thread->wait_reason = WAIT_IO_READ;
    current_thread->wait_fd = fd;
    
    /* Block this thread (not the process!) */
    thread_block();
    
    /* Woken up - remove from epoll */
    epoll_ctl(epfd, EPOLL_CTL_DEL, fd, NULL);
}

Advantages and Limitations of Jacketing

Advantages

•Transparent to application code
•Existing programs work without changes
•Centralized I/O handling
•Can add logging, profiling
•Gradual adoption possible

Limitations

•Can't intercept direct syscalls
•Third-party libraries may bypass
•Some syscalls can't be non-blocked
•Adds layer of complexity
•Platform-specific implementation

Production Use

Jacketing is used extensively in production systems. Go's runtime jackets all I/O operations. Gevent (Python) uses greenlet with libevent/libev jackets. Even some databases use internal jacketing for their thread pools. When done well, it's invisible to developers writing application code.

Mitigation: Scheduler Activations

Scheduler activations is an advanced technique that adds limited kernel support to address the blocking problem while preserving user-level threading benefits. First proposed by Anderson et al. in 1991, it creates a communication channel between the kernel and user-level scheduler.

The Core Idea

Instead of leaving the kernel completely unaware, scheduler activations provide upcalls—notifications from kernel to user space when blocking events occur:

Converting Mermaid diagram...

How Scheduler Activations Work

Virtual Processors: The kernel grants the process a number of virtual processors (analogous to CPU time slots). Each virtual processor can run one user-level thread.
Blocking Notification: When a user-level thread makes a blocking call, the kernel:
- Saves the blocking thread's state
- Allocates a new virtual processor
- Performs an upcall to the user-level scheduler
- The scheduler runs on the new virtual processor and can schedule another thread
Unblocking Notification: When the blocked operation completes:
- Kernel performs another upcall
- User-level scheduler adds the thread back to its ready queue
- Scheduler decides when to resume the thread

scheduler_activations.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
/*
 * Scheduler Activations - User-level scheduler integration
 * 
 * Simplified conceptual implementation showing the upcall handling
 */
 
/* Upcall handler - kernel calls this when thread blocks/unblocks */
void upcall_handler(int event_type, tcb_t *thread, void *aux) {
    switch (event_type) {
        
        case UPCALL_THREAD_BLOCKED:
            /*
             * A user-level thread made a blocking syscall.
             * Kernel has allocated us a new virtual processor.
             * We're running on that new processor now.
             * 
             * The blocked thread's state is saved by the kernel.
             * We need to schedule work for this processor.
             */
            
            /* Mark thread as blocked */
            thread->state = THREAD_BLOCKED_SYSCALL;
            thread->blocked_on = aux;  /* What it's waiting for */
            
            /* Schedule another thread on this virtual processor */
            tcb_t *next = dequeue_ready();
            if (next) {
                context_switch_to(next);
            } else {
                /* No ready threads - idle this processor */
                idle_virtual_processor();
            }
            break;
            
        case UPCALL_THREAD_UNBLOCKED:
            /*
             * A previously blocked thread can now continue.
             * Add it back to the ready queue.
             */
            thread->state = THREAD_READY;
            enqueue_ready(thread);
            
            /* Potentially preempt current thread if higher priority */
            maybe_preempt_current(thread);
            break;
            
        case UPCALL_PROCESSOR_PREEMPTED:
            /*
             * Kernel preempted one of our virtual processors.
             * Save state and adjust scheduling.
             */
            thread->state = THREAD_READY;
            enqueue_ready(thread);
            num_virtual_processors--;
            break;
            
        case UPCALL_PROCESSOR_GRANTED:
            /*
             * Kernel gave us back a virtual processor.
             * Schedule a thread on it.
             */
            num_virtual_processors++;
            tcb_t *to_run = dequeue_ready();
            if (to_run) {
                context_switch_to(to_run);
            }
            break;
    }
}

Adoption Status

Scheduler activations require kernel modifications and were never widely adopted in mainstream operating systems. Solaris had a related mechanism (lightweight processes), and some research systems implemented full activations. Today, the M:N threading model (multiple user threads on kernel threads) with runtime scheduling achieves similar benefits without kernel changes. Go's runtime is a practical example of this approach.

Modern Solutions: The M:N Model

The most practical solution to the blocking problem in modern systems is the M:N threading model—multiplexing M user-level threads onto N kernel threads (where typically N ≈ number of CPU cores).

How M:N Addresses Blocking

M:N Model Benefits

•Multiple kernel threads: When one kernel thread blocks, others continue. The process has N independent execution paths through the kernel.
•User-level scheduling within each kernel thread: Each kernel thread runs many user-level threads with fast context switches.
•Work stealing: If one kernel thread's user-level threads all block, it can steal work from other kernel threads' queues.
•Multiprocessor utilization: N kernel threads can run on N different CPUs simultaneously.
•Bounded blocking impact: At worst, one blocking call freezes 1/N of user-level threads, not all of them.

Converting Mermaid diagram...

Go's Solution: Goroutines and the Runtime Scheduler

Go provides an excellent case study in solving the blocking problem:

go_scheduling.go
Go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/*
 * Go's runtime handles blocking transparently
 * 
 * When a goroutine blocks on I/O, the runtime:
 * 1. Parks the goroutine (user-level block)
 * 2. Uses netpoller for async I/O internally  
 * 3. Keeps the kernel thread (M) running other goroutines
 * 4. When I/O ready, netpoller unparks the goroutine
 */
 
package main
 
import (
    "fmt"
    "net/http"
    "time"
)
 
func handler(w http.ResponseWriter, r *http.Request) {
    // This sleeps *only* this goroutine, not the thread
    time.Sleep(100 * time.Millisecond)
    
    // This network I/O uses netpoller - non-blocking internally
    resp, _ := http.Get("https://example.com")
    defer resp.Body.Close()
    
    // Other goroutines continue running on other Ms
    // or even on this same M after we park
    
    fmt.Fprintf(w, "Done!")
}
 
func main() {
    // GOMAXPROCS sets number of kernel threads (Ps)
    // Default is number of CPUs
    // Each P can run many goroutines (Gs)
    
    // 10000 concurrent handlers only need ~8 kernel threads
    // Blocking in one handler doesn't affect others
    http.HandleFunc("/", handler)
    http.ListenAndServe(":8080", nil)
}
 
/*
 * Key insight: Go wraps ALL blocking operations:
 * - Network I/O → netpoller (epoll/kqueue based)
 * - File I/O → often background goroutine on separate M
 * - Sleep → timer heap in runtime
 * - Channel ops → goroutine parking
 * 
 * The programmer writes blocking-style code;
 * the runtime makes it non-blocking underneath.
 */

Best of Both Worlds

M:N threading combines user-level thread efficiency (fast switches, many threads) with kernel thread robustness (multiprocessor support, blocking tolerance). Go, Rust (tokio), Java (Project Loom), and many other modern platforms use variants of this approach. It's the practical resolution of the blocking problem.

Summary: Conquering the Blocking Problem

We have thoroughly examined the blocking problem—the most significant practical limitation of pure user-level threading—and the sophisticated strategies developed to overcome it.

Key Takeaways

•Blocking system calls freeze all threads — When any user-level thread blocks in a syscall, the kernel blocks the entire process. All threads stop, not just the blocking one.
•Small blocking rates cause catastrophic stalls — Even a few percent of blocking calls can completely stall throughput as blocks serialize rather than overlap.
•Many operations can block unexpectedly — File I/O, network operations, DNS, malloc, logging—any syscall is potentially blocking under the wrong conditions.
•Non-blocking I/O is the foundation — Setting O_NONBLOCK and using epoll/kqueue allows threads to poll for readiness without blocking the process.
•Jacketing provides transparency — Wrapper functions intercept blocking calls and convert them to non-blocking with thread-level sleeping, invisible to application code.
•M:N threading is the modern solution — Multiplexing user-level threads onto multiple kernel threads limits blocking impact and enables true parallelism.

The Blocking Problem Mastered

You now understand both the problem and its solutions in depth. In the next page, we'll explore green threads—a specific form of user-level threading used in many modern languages and runtimes—and see how they apply these concepts in practice.

The Blocking Problem

When One Thread Stops Everything

Critical Understanding Required

Anatomy of the Blocking Problem

The blocking problem arises from the intersection of two design decisions:

User-level threads are invisible to the kernel — The kernel manages the process as a single schedulable unit
Blocking system calls suspend the calling process — When a syscall can't complete immediately, the kernel removes the process from the run queue

When a user-level thread makes a blocking system call, the kernel blocks the entire process—not just that thread—because the kernel doesn't know other threads exist.

Converting Mermaid diagram...

The Technical Mechanism

Let's trace through exactly what happens at the kernel level:

blocking_mechanism.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
/*
 * Simplified view of what happens in the kernel when a blocking
 * system call is made by a process using user-level threads
 */
 
/* User-level thread calls read() */
void user_thread_537(void) {
    char buffer[4096];
    
    /* This looks innocent... */
    ssize_t n = read(fd, buffer, sizeof(buffer));
    /* ...but can freeze 999 other threads */
}
 
/* What happens in kernel (simplified) */
ssize_t sys_read(int fd, void *buf, size_t count) {
    struct file *f = get_file(fd);
    
    /* Check if data is available */
    if (!data_ready(f)) {
        /* Issue I/O request to device */
        submit_io_request(f, buf, count);
        
        /*
         * CRITICAL: This blocks the PROCESS.
         * 
         * current->state = TASK_INTERRUPTIBLE;
         * schedule();  // Remove from run queue, pick another process
         * 
         * ALL user-level threads in this process stop here.
         * The kernel doesn't know they exist.
         * The scheduler picks a completely different process.
         */
        wait_for_io_completion(f);
        
        /*
         * When I/O completes, process is woken up.
         * Whichever user-level thread was "current" continues.
         * Other user-level threads never knew they were frozen.
         */
    }
    
    return copy_data_to_user(f, buf, count);
}

The Insidious Nature

Impact and Consequences

The blocking problem has severe practical consequences for systems using pure user-level threads:

Latency Spikes

Even if 99% of your system calls complete instantly, the 1% that block cause latency spikes for all threads:

Blocking Latency Impact by Scenario
Blocking Operation	Typical Duration	Impact on All Threads
SSD read (cache miss)	50-200 μs	All threads frozen for 50-200 μs
HDD read (random)	5-15 ms	All threads frozen for 5-15 ms
Network read (LAN)	1-10 ms	All threads frozen waiting for data
Network read (WAN)	50-500 ms	Catastrophic freeze for all threads
Database query	1 ms - 30s+	Entire server frozen
File open (NFS)	10 ms - 30s	NFS timeout can freeze everything
DNS lookup	1 ms - 30s	gethostbyname() can be deadly

Throughput Collapse

Consider a web server handling 1000 concurrent requests with user-level threads:

throughput_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
"""
Throughput impact analysis for blocking problem
"""
 
# Scenario: 1000-thread web server
threads = 1000
requests_per_second_ideal = 10000  # 10 req/thread/sec
 
# Without blocking problem (kernel threads or non-blocking I/O)
# Each thread works independently
effective_throughput_ideal = requests_per_second_ideal
print(f"Ideal throughput: {effective_throughput_ideal} req/sec")
 
# With blocking problem: 
# Assume 5% of requests trigger a 10ms blocking call
blocking_probability = 0.05
block_duration_ms = 10
request_interval_ms = 100  # 10 req/sec means one every 100ms
 
# In any 100ms window, probability that at least one thread blocks:
# P(at least one blocks) ≈ 1 - (1 - 0.05)^active_threads_in_window
# With many threads active, this approaches certainty
 
# Expected blocking time per 100ms window:
# Multiple threads will hit blocking calls; they overlap
# Worst case: constant blocking as different threads take turns blocking
 
# Simplified model: if blocking events are uniformly distributed,
# and each blocks for 10ms, with 5% of 1000 threads blocking:
blocking_threads_per_100ms = threads * blocking_probability * (100 / 10)
# = 1000 * 0.05 * 10 = 500 blocking events per 100ms window
 
# With pure user-level threads, these DON'T overlap - they serialize!
# Each 10ms block stops ALL threads.
# If we have 50 blocking events in 100ms, we're blocked 500ms worth
# in a 100ms window = totally stalled
 
actual_blocked_time_pct = min(1.0, blocking_threads_per_100ms * block_duration_ms / 1000)
# = min(1.0, 500 * 10 / 1000) = min(1.0, 5.0) = 100% blocked!
 
print(f"
With blocking problem:")
print(f"  Blocking events per 100ms: ~50")
print(f"  Each block: 10ms, blocks ALL threads")
print(f"  Result: System nearly 100% stalled!")
print(f"  Effective throughput: ~0 req/sec")
 
print(f"
Conclusion: Even 5% blocking rate can completely")
print(f"stall a user-level threaded system."

The Amplification Effect

Common Blocking Scenarios

Many operations that developers assume are fast can unexpectedly block. Understanding these scenarios is crucial for designing systems that use user-level threads:

File System Operations

File System Blocking Risks

•open() — May require directory traversal, inode lookup, disk I/O for metadata. Network filesystems (NFS, SMB) can take seconds.
•read() / write() — Fast if data is cached, but cache misses trigger disk I/O. Large files or memory pressure cause cache misses.
•stat() / fstat() — Requires inode read from disk if not cached. Common in file serving, logging, web frameworks.
•fsync() / fdatasync() — Explicitly waits for data to reach disk. Can take hundreds of milliseconds on HDDs.
•close() — Usually fast, but may trigger writeback of dirty buffers or release of locks, causing waits.

Network Operations

Network Blocking Risks

•connect() — TCP handshake requires network round-trip. Hundreds of milliseconds for remote hosts; timeout (often 30s+) if unreachable.
•accept() — Blocks if no incoming connection. Indefinite block without pending clients.
•recv() / read() on socket — Blocks waiting for data. Remote peer delay = local thread freeze.
•send() / write() on socket — Blocks if socket send buffer is full. TCP backpressure becomes threading disaster.
•gethostbyname() / getaddrinfo() — Synchronous DNS lookup. DNS timeout (often 5-30s) freezes everything.
•SSL/TLS handshake — Multiple round-trips; computationally expensive; can take hundreds of ms.

Other Dangerous Operations

Additional Blocking Risks
Operation	When It Blocks	Typical Duration
`malloc()` / `free()`	Arena lock contention; mmap/munmap	Usually µs, occasionally ms
`printf()` / `fwrite()` to stdout	Terminal blocked; pipe buffer full	Indefinite if consumer slow
`syslog()`	Syslog daemon connection	ms to seconds
`exec*()`	Loads executable, libraries	Tens to hundreds of ms
`fork()`	Page table copy, COW setup	ms (proportional to memory)
Locking (`flock()`, `lockf()`)	Lock held by another process	Indefinite
IPC (`msgrcv()`, `semop()`)	Resource not available	Until sender/signaler acts
Page fault	Swap in, demand paging	µs (SSD) to ms (HDD) per fault

The Audit Mindset

Mitigation: Non-Blocking I/O

Setting Up Non-Blocking I/O

non_blocking_io.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
#include <fcntl.h>
#include <errno.h>
#include <sys/socket.h>
 
/*
 * Set a file descriptor to non-blocking mode
 * Returns 0 on success, -1 on failure
 */
int set_nonblocking(int fd) {
    int flags = fcntl(fd, F_GETFL, 0);
    if (flags == -1) {
        return -1;
    }
    return fcntl(fd, F_SETFL, flags | O_NONBLOCK);
}
 
/*
 * Non-blocking read wrapper for user-level threads
 * 
 * Returns:
 *   > 0  : Number of bytes read
 *   0    : EOF (connection closed)
 *   -1   : Would block (EAGAIN/EWOULDBLOCK) - try again later
 *   -2   : Actual error
 */
ssize_t nonblocking_read(int fd, void *buf, size_t count) {
    ssize_t n = read(fd, buf, count);
    
    if (n >= 0) {
        return n;  /* Success or EOF */
    }
    
    if (errno == EAGAIN || errno == EWOULDBLOCK) {
        return -1;  /* Would block - data not ready */
    }
    
    return -2;  /* Real error */
}
 
/*
 * User-level thread-aware read
 * 
 * Instead of blocking, we yield to other threads and retry.
 * The thread library's scheduler handles the polling.
 */
ssize_t thread_aware_read(int fd, void *buf, size_t count) {
    while (1) {
        ssize_t n = nonblocking_read(fd, buf, count);
        
        if (n >= 0) {
            return n;  /* Got data or EOF */
        }
        
        if (n == -1) {
            /* Would block - yield to other threads */
            register_fd_for_read_ready(fd, current_thread);
            thread_block();  /* Sleep until fd is readable */
            /* When we wake up, loop and try read again */
            continue;
        }
        
        /* n == -2: Real error */
        return -1;
    }
}
 
/*
 * Non-blocking connect with thread integration
 */
int thread_aware_connect(int sockfd, const struct sockaddr *addr, 
                         socklen_t addrlen) {
    /* Set non-blocking first */
    set_nonblocking(sockfd);
    
    int result = connect(sockfd, addr, addrlen);
    
    if (result == 0) {
        return 0;  /* Immediate success (rare for remote) */
    }
    
    if (errno == EINPROGRESS) {
        /* Connection in progress - wait for writability */
        register_fd_for_write_ready(sockfd, current_thread);
        thread_block();
        
        /* Check if connect succeeded */
        int error;
        socklen_t len = sizeof(error);
        getsockopt(sockfd, SOL_SOCKET, SO_ERROR, &error, &len);
        
        if (error == 0) {
            return 0;  /* Success */
        }
        errno = error;
        return -1;
    }
    
    return -1;  /* Immediate failure */
}

The Event Loop Pattern

Non-blocking I/O requires an event notification mechanism to know when operations can proceed. This is implemented through I/O multiplexing system calls:

I/O Multiplexing Mechanisms
Mechanism	Platforms	Scalability	Notes
`select()`	All POSIX	O(n) per call	Limited to ~1024 fds; oldest
`poll()`	All POSIX	O(n) per call	No fd limit; still linear scan
`epoll`	Linux	O(1) per event	Edge/level triggered; scales to millions
`kqueue`	BSD/macOS	O(1) per event	Similar to epoll; very flexible
`IOCP`	Windows	O(1) per event	Completion-based; async model

The Runtime's Responsibility

Mitigation: Jacketing

How Jacketing Works

Converting Mermaid diagram...

jacketing.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
/*
 * Jacketing: Transparent blocking call interception
 * 
 * These wrapper functions replace standard library functions,
 * making blocking operations thread-aware without changing
 * application code.
 */
 
#include <dlfcn.h>  /* For dlsym to get original function */
 
/* Original function pointers, loaded once at startup */
static ssize_t (*real_read)(int, void*, size_t) = NULL;
static ssize_t (*real_write)(int, const void*, size_t) = NULL;
static int (*real_accept)(int, struct sockaddr*, socklen_t*) = NULL;
 
__attribute__((constructor))
void init_jackets(void) {
    real_read = dlsym(RTLD_NEXT, "read");
    real_write = dlsym(RTLD_NEXT, "write");
    real_accept = dlsym(RTLD_NEXT, "accept");
}
 
/*
 * Jacketed read() - intercepts all read() calls
 * 
 * 1. Check if fd is non-blocking or would succeed
 * 2. If would block, register for events and sleep thread
 * 3. When ready, retry
 */
ssize_t read(int fd, void *buf, size_t count) {
    /* First, ensure fd is non-blocking */
    ensure_nonblocking(fd);
    
    while (1) {
        ssize_t n = real_read(fd, buf, count);
        
        if (n >= 0) {
            return n;  /* Success */
        }
        
        if (errno == EAGAIN || errno == EWOULDBLOCK) {
            /* Would block - yield to other threads */
            io_wait_read(fd);  /* Register and block user-thread */
            continue;          /* Retry after wake */
        }
        
        /* Real error */
        return -1;
    }
}
 
/*
 * Jacketed accept() - intercepts all accept() calls
 */
int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen) {
    ensure_nonblocking(sockfd);
    
    while (1) {
        int client = real_accept(sockfd, addr, addrlen);
        
        if (client >= 0) {
            /* Also make client socket non-blocking */
            ensure_nonblocking(client);
            return client;
        }
        
        if (errno == EAGAIN || errno == EWOULDBLOCK) {
            io_wait_read(sockfd);  /* Wait for connection */
            continue;
        }
        
        return -1;
    }
}
 
/*
 * io_wait_read() - Block user-level thread until fd is readable
 */
void io_wait_read(int fd) {
    /* Add fd to epoll interest list for current thread */
    struct epoll_event ev = {
        .events = EPOLLIN | EPOLLET,
        .data.ptr = current_thread
    };
    epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &ev);
    
    /* Mark thread as blocked on I/O */
    current_thread->wait_reason = WAIT_IO_READ;
    current_thread->wait_fd = fd;
    
    /* Block this thread (not the process!) */
    thread_block();
    
    /* Woken up - remove from epoll */
    epoll_ctl(epfd, EPOLL_CTL_DEL, fd, NULL);
}

Advantages and Limitations of Jacketing

Advantages

•Transparent to application code
•Existing programs work without changes
•Centralized I/O handling
•Can add logging, profiling
•Gradual adoption possible

Limitations

•Can't intercept direct syscalls
•Third-party libraries may bypass
•Some syscalls can't be non-blocked
•Adds layer of complexity
•Platform-specific implementation

Production Use

Mitigation: Scheduler Activations

The Core Idea

Instead of leaving the kernel completely unaware, scheduler activations provide upcalls—notifications from kernel to user space when blocking events occur:

Converting Mermaid diagram...

How Scheduler Activations Work

Virtual Processors: The kernel grants the process a number of virtual processors (analogous to CPU time slots). Each virtual processor can run one user-level thread.
Blocking Notification: When a user-level thread makes a blocking call, the kernel:
- Saves the blocking thread's state
- Allocates a new virtual processor
- Performs an upcall to the user-level scheduler
- The scheduler runs on the new virtual processor and can schedule another thread
Unblocking Notification: When the blocked operation completes:
- Kernel performs another upcall
- User-level scheduler adds the thread back to its ready queue
- Scheduler decides when to resume the thread

scheduler_activations.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
/*
 * Scheduler Activations - User-level scheduler integration
 * 
 * Simplified conceptual implementation showing the upcall handling
 */
 
/* Upcall handler - kernel calls this when thread blocks/unblocks */
void upcall_handler(int event_type, tcb_t *thread, void *aux) {
    switch (event_type) {
        
        case UPCALL_THREAD_BLOCKED:
            /*
             * A user-level thread made a blocking syscall.
             * Kernel has allocated us a new virtual processor.
             * We're running on that new processor now.
             * 
             * The blocked thread's state is saved by the kernel.
             * We need to schedule work for this processor.
             */
            
            /* Mark thread as blocked */
            thread->state = THREAD_BLOCKED_SYSCALL;
            thread->blocked_on = aux;  /* What it's waiting for */
            
            /* Schedule another thread on this virtual processor */
            tcb_t *next = dequeue_ready();
            if (next) {
                context_switch_to(next);
            } else {
                /* No ready threads - idle this processor */
                idle_virtual_processor();
            }
            break;
            
        case UPCALL_THREAD_UNBLOCKED:
            /*
             * A previously blocked thread can now continue.
             * Add it back to the ready queue.
             */
            thread->state = THREAD_READY;
            enqueue_ready(thread);
            
            /* Potentially preempt current thread if higher priority */
            maybe_preempt_current(thread);
            break;
            
        case UPCALL_PROCESSOR_PREEMPTED:
            /*
             * Kernel preempted one of our virtual processors.
             * Save state and adjust scheduling.
             */
            thread->state = THREAD_READY;
            enqueue_ready(thread);
            num_virtual_processors--;
            break;
            
        case UPCALL_PROCESSOR_GRANTED:
            /*
             * Kernel gave us back a virtual processor.
             * Schedule a thread on it.
             */
            num_virtual_processors++;
            tcb_t *to_run = dequeue_ready();
            if (to_run) {
                context_switch_to(to_run);
            }
            break;
    }
}

Adoption Status

Modern Solutions: The M:N Model

How M:N Addresses Blocking

M:N Model Benefits

•Multiple kernel threads: When one kernel thread blocks, others continue. The process has N independent execution paths through the kernel.
•User-level scheduling within each kernel thread: Each kernel thread runs many user-level threads with fast context switches.
•Work stealing: If one kernel thread's user-level threads all block, it can steal work from other kernel threads' queues.
•Multiprocessor utilization: N kernel threads can run on N different CPUs simultaneously.
•Bounded blocking impact: At worst, one blocking call freezes 1/N of user-level threads, not all of them.

Converting Mermaid diagram...

Go's Solution: Goroutines and the Runtime Scheduler

Go provides an excellent case study in solving the blocking problem:

go_scheduling.go
Go
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
/*
 * Go's runtime handles blocking transparently
 * 
 * When a goroutine blocks on I/O, the runtime:
 * 1. Parks the goroutine (user-level block)
 * 2. Uses netpoller for async I/O internally  
 * 3. Keeps the kernel thread (M) running other goroutines
 * 4. When I/O ready, netpoller unparks the goroutine
 */
 
package main
 
import (
    "fmt"
    "net/http"
    "time"
)
 
func handler(w http.ResponseWriter, r *http.Request) {
    // This sleeps *only* this goroutine, not the thread
    time.Sleep(100 * time.Millisecond)
    
    // This network I/O uses netpoller - non-blocking internally
    resp, _ := http.Get("https://example.com")
    defer resp.Body.Close()
    
    // Other goroutines continue running on other Ms
    // or even on this same M after we park
    
    fmt.Fprintf(w, "Done!")
}
 
func main() {
    // GOMAXPROCS sets number of kernel threads (Ps)
    // Default is number of CPUs
    // Each P can run many goroutines (Gs)
    
    // 10000 concurrent handlers only need ~8 kernel threads
    // Blocking in one handler doesn't affect others
    http.HandleFunc("/", handler)
    http.ListenAndServe(":8080", nil)
}
 
/*
 * Key insight: Go wraps ALL blocking operations:
 * - Network I/O → netpoller (epoll/kqueue based)
 * - File I/O → often background goroutine on separate M
 * - Sleep → timer heap in runtime
 * - Channel ops → goroutine parking
 * 
 * The programmer writes blocking-style code;
 * the runtime makes it non-blocking underneath.
 */

Best of Both Worlds

Summary: Conquering the Blocking Problem

We have thoroughly examined the blocking problem—the most significant practical limitation of pure user-level threading—and the sophisticated strategies developed to overcome it.

Key Takeaways

•Blocking system calls freeze all threads — When any user-level thread blocks in a syscall, the kernel blocks the entire process. All threads stop, not just the blocking one.
•Small blocking rates cause catastrophic stalls — Even a few percent of blocking calls can completely stall throughput as blocks serialize rather than overlap.
•Many operations can block unexpectedly — File I/O, network operations, DNS, malloc, logging—any syscall is potentially blocking under the wrong conditions.
•Non-blocking I/O is the foundation — Setting O_NONBLOCK and using epoll/kqueue allows threads to poll for readiness without blocking the process.
•Jacketing provides transparency — Wrapper functions intercept blocking calls and convert them to non-blocking with thread-level sleeping, invisible to application code.
•M:N threading is the modern solution — Multiplexing user-level threads onto multiple kernel threads limits blocking impact and enables true parallelism.

The Blocking Problem Mastered