Loading learning content...
Not all work can be done instantly. A process may need to read data from a slow disk, wait for a network packet, receive user input, or acquire a lock held by another process. In these situations, the process cannot make forward progress—continuing to run would just waste CPU cycles spinning in place.
The Block transition (also called the Wait or Sleep transition) moves a process from the Running state to the Waiting (or Blocked) state. The process voluntarily surrenders the CPU, acknowledging that it cannot proceed until some external event occurs. This is fundamentally different from the Timeout transition—blocking is the process's choice, not the kernel's enforcement.
This transition is crucial for system efficiency. Without blocking, I/O-bound processes would spin-wait, wasting enormous CPU resources. Instead, they sleep, allowing other processes to run productively.
By the end of this page, you will understand why and how processes block—from the various conditions that trigger blocking, through the kernel's wait queue mechanism, to the data structures that track sleeping processes. You'll see how blocking enables efficient I/O multiplexing and understand interruptible vs uninterruptible sleep states.
A process blocks when it requests something that isn't immediately available. Rather than busy-waiting (consuming CPU while repeatedly checking), the process asks the kernel to put it to sleep until the condition is met.
The alternative to blocking—busy-waiting:
Imagine waiting for a disk read without blocking:
// Terrible approach: busy-waiting
while (!disk_data_ready()) {
// Spin, wasting millions of CPU cycles
}
process_data();
On a 3 GHz CPU, this loop executes billions of check iterations while waiting for a disk (which takes milliseconds). It wastes power, heats the CPU, and prevents other processes from running.
The proper approach—blocking:
// Correct approach: blocking
submit_disk_request();
sleep_until_data_ready(); // Process blocks here
process_data(); // Resumes when data arrives
The process surrenders the CPU for the duration of the wait. Other processes run. When the disk completes, the blocked process is awakened.
| Approach | CPU Usage | Power | Other Processes | Response Time |
|---|---|---|---|---|
| Busy-waiting (spin) | 100% of one core | High | Starved | Immediate when ready |
| Blocking (sleep) | 0% while waiting | Low (can sleep) | Can run | Small wakeup latency |
| Hybrid (spin then block) | Brief spike, then 0% | Medium | Mostly can run | Low latency for fast ops |
Spinning isn't always wrong. Spinlocks are appropriate when the expected wait is very short (shorter than the context switch overhead). If a lock is typically held for < 1μs, spinning wastes less CPU than the 5μs+ context switch. Modern kernels often use adaptive spinning: spin briefly, then block if the lock isn't released.
A process in the Waiting (Blocked) state is paused, not consuming CPU, waiting for a specific event. It differs from the Ready state in a crucial way: a Ready process can run if given the CPU, but a Waiting process cannot—it needs something external first.
| Aspect | Ready State | Waiting State |
|---|---|---|
| CPU assignment | Eligible - waiting for dispatch | Ineligible - not schedulable |
| Missing resource | Only CPU time | Something else (I/O, lock, etc.) |
| Exit condition | Scheduler selects it | Event occurs (I/O done, lock free) |
| Queue location | Ready queue/run queue | Wait queue (device-specific) |
| Kernel view | TASK_RUNNING (runnable) | TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE |
Linux sleep states:
Linux distinguishes several sleeping states:
TASK_INTERRUPTIBLE:
TASK_UNINTERRUPTIBLE:
TASK_KILLABLE (since Linux 2.6.25):
TASK_IDLE:
123456789101112131415161718192021222324252627282930
# View process states with psps aux | head -1# USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND # STAT column meanings:# S = Interruptible sleep (waiting for event)# D = Uninterruptible sleep (usually I/O)# R = Running or runnable (on run queue)# T = Stopped (by signal or debugger)# Z = Zombie (terminated but not reaped) # Find all uninterruptible (D state) processesps aux | awk '$8 ~ /D/'# These processes are stuck waiting for I/O# Common for slow NFS mounts or failing disks # Count processes in each stateps -eo stat | sort | uniq -c | sort -rn# 150 S <- Sleeping (interruptible)# 23 R <- Running/runnable # 5 D <- Uninterruptible sleep# 2 Z <- Zombies # Detailed scheduling info for sleeping processcat /proc/<pid>/status | grep State# State: S (sleeping) # What is a process waiting for?cat /proc/<pid>/stack# Shows kernel stack trace - reveals blocking callProcesses in uninterruptible sleep (D state) cannot be killed—not even with kill -9. They're waiting for I/O that must complete. If a disk or NFS server becomes unresponsive, D-state processes accumulate. The only resolution is fixing the underlying I/O issue or rebooting. TASK_KILLABLE was introduced specifically to allow killing processes in some formerly-uninterruptible situations.
The Block transition is always initiated by the process itself (though indirectly through system calls). The process requests an operation that cannot complete immediately, and the kernel responds by putting it to sleep.
System calls that commonly block:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
// Common blocking system calls and their wait conditions // FILE I/O BLOCKINGssize_t n = read(fd, buf, size);// Blocks until: data available, EOF, or error// Wait queue: associated with the file descriptor / device ssize_t n = write(fd, buf, size);// Blocks until: buffer space available or error// Can block on slow devices, full pipes, network congestion // NETWORK BLOCKING int client = accept(sockfd, NULL, NULL);// Blocks until: incoming connection arrives// Server spends most time blocked here ssize_t n = recv(sockfd, buf, size, 0);// Blocks until: data arrives on socket// May block indefinitely waiting for peer int result = connect(sockfd, addr, len);// Blocks until: connection established or fails (timeout)// TCP handshake takes multiple round-trips // SYNCHRONIZATION BLOCKINGpthread_mutex_lock(&mutex);// Blocks until: mutex becomes available// May block indefinitely if holder doesn't release sem_wait(&semaphore);// Blocks until: semaphore count > 0// Classic producer-consumer synchronization // PROCESS BLOCKINGpid_t pid = wait(&status);// Blocks until: any child terminates// Parent often waits for children to complete int result = pause();// Blocks until: signal is delivered// Unconditional block waiting for signal // TIMER BLOCKINGunsigned remain = sleep(5);// Blocks for: specified seconds// Returns early if signal interrupts int result = nanosleep(&req, &rem);// Blocks for: specified nanoseconds// Higher precision than sleep() // MULTIPLEXED BLOCKINGint ready = select(nfds, &reads, &writes, &excepts, &timeout);// Blocks until: any of multiple FDs is ready, or timeout// Foundation of event-driven I/O int ready = poll(fds, nfds, timeout);// Similar to select, different interface int nready = epoll_wait(epfd, events, max, timeout);// Linux-specific, scalable event interface// Efficiently handles thousands of connectionsMost blocking calls have non-blocking variants. Setting O_NONBLOCK on a file descriptor makes read/write return immediately with EAGAIN if they would block. Locks have trylock variants. These are essential for event-driven programming but require the application to handle the 'not ready' case explicitly.
The kernel must track which processes are waiting for which events. This is accomplished through wait queues—linked lists of processes waiting for a particular condition. Each blocking resource (device, file, lock) has an associated wait queue.
Wait queue structure:
A Linux wait queue consists of:
When a process blocks, it creates a wait_queue_entry, adds it to the appropriate queue, sets its state to INTERRUPTIBLE/UNINTERRUPTIBLE, and calls schedule(). When the event occurs, the kernel walks the queue and wakes up waiters.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
// Linux wait queue implementation// Simplified from include/linux/wait.h and kernel/sched/wait.c // Wait queue head - attached to the resource being waited onstruct wait_queue_head { spinlock_t lock; // Protects the list struct list_head head; // List of waiting processes}; // Wait queue entry - one per waiting processstruct wait_queue_entry { unsigned int flags; // WQ_FLAG_* flags struct task_struct *private; // The process waiting wait_queue_func_t func; // Wakeup function struct list_head entry; // Links into wait_queue_head}; // Typical blocking pattern: wait_event macro#define wait_event(wq_head, condition) \do { \ if (condition) \ break; \ DEFINE_WAIT(wait); \ for (;;) { \ prepare_to_wait(&wq_head, &wait, TASK_INTERRUPTIBLE); \ if (condition) \ break; \ schedule(); /* Block here until woken */ \ } \ finish_wait(&wq_head, &wait); \} while (0) // Add process to wait queue and set statevoid prepare_to_wait(wait_queue_head_t *wq, wait_queue_entry_t *wait, int state) { spin_lock(&wq->lock); // Only add to queue if not already there if (list_empty(&wait->entry)) list_add(&wait->entry, &wq->head); // Set process state to sleeping set_current_state(state); spin_unlock(&wq->lock);} // Wake up processes on a wait queuevoid wake_up(wait_queue_head_t *wq) { wait_queue_entry_t *curr; spin_lock(&wq->lock); list_for_each_entry(curr, &wq->head, entry) { // Call the wakeup function for each waiter curr->func(curr, TASK_NORMAL, 0, NULL); } spin_unlock(&wq->lock);} // Default wakeup functionint default_wake_function(wait_queue_entry_t *wait) { // Move process from waiting to ready return try_to_wake_up(wait->private, TASK_NORMAL);}Some wait queues use 'exclusive wait' to wake only one waiter instead of all. This prevents the 'thundering herd' problem where many processes wake up, only one gets the resource, and the rest go back to sleep. Linux's accept() on sockets uses exclusive wait to avoid waking all listening processes for each connection.
Let's trace through a complete Block transition, using a read() from a slow device as an example:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
// Pseudocode for read() system call implementation// Shows the blocking path in detail ssize_t sys_read(int fd, char __user *buf, size_t count) { struct file *file = get_file(fd); struct device *dev = file->device; // 1. Check if data is already buffered if (device_has_data(dev)) { // Fast path: data available, no blocking ssize_t n = copy_to_user(buf, dev->buffer, count); return n; } // 2. No data available - must wait // Submit I/O request to device submit_io_request(dev, count); // 3. Block until data arrives DEFINE_WAIT(wait); for (;;) { // Add ourselves to device's wait queue prepare_to_wait(&dev->wait_queue, &wait, TASK_INTERRUPTIBLE); // Double-check if data arrived while setting up if (device_has_data(dev)) break; // Check for pending signals (interruptible sleep) if (signal_pending(current)) { finish_wait(&dev->wait_queue, &wait); return -EINTR; // Interrupted by signal } // Actually block - give up CPU // Process state changes: Running → Waiting schedule(); // We return here when woken up // Loop back to check if condition is met } // 4. Remove from wait queue finish_wait(&dev->wait_queue, &wait); // 5. Data is now available ssize_t n = copy_to_user(buf, dev->buffer, count); return n;} // In the device driver's interrupt handler:irqreturn_t device_interrupt_handler(int irq, void *dev_id) { struct device *dev = dev_id; // Device completed I/O, data is ready dev->data_ready = true; // Wake up anyone waiting on this device wake_up(&dev->wait_queue); // All waiters moved: Waiting → Ready return IRQ_HANDLED;}Notice the loop structure with repeated checking? This prevents 'lost wakeups'. If we set our state to INTERRUPTIBLE and then check the condition, there's a race: the event could occur between check and schedule(). We'd sleep forever. The loop pattern, with recheck after wakeup, ensures the condition is truly met before proceeding.
Many applications need to wait on multiple events simultaneously: data on any of several sockets, input from keyboard OR timeout, etc. The kernel provides mechanisms for this: select(), poll(), and epoll().
Multiplexed I/O blocking:
With select/poll/epoll, a process blocks until any of multiple conditions is met:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
#include <sys/epoll.h>#include <fcntl.h>#include <stdio.h>#include <unistd.h> // Event-driven server using epollint main() { int epfd = epoll_create1(0); // Add server socket struct epoll_event ev; ev.events = EPOLLIN; ev.data.fd = server_socket; epoll_ctl(epfd, EPOLL_CTL_ADD, server_socket, &ev); struct epoll_event events[MAX_EVENTS]; while (1) { // Block until any registered FD has events // This single process can handle thousands of connections int nready = epoll_wait(epfd, events, MAX_EVENTS, -1); // Process blocks here until: // - A client sends data // - A new connection arrives // - A write becomes possible for (int i = 0; i < nready; i++) { int fd = events[i].data.fd; if (fd == server_socket) { // New connection - accept it int client = accept(server_socket, NULL, NULL); // Set non-blocking and add to epoll fcntl(client, F_SETFL, O_NONBLOCK); ev.events = EPOLLIN | EPOLLET; ev.data.fd = client; epoll_ctl(epfd, EPOLL_CTL_ADD, client, &ev); } else { // Client ready for I/O - handle it // Non-blocking read won't block since epoll // said it's ready handle_client(fd); } } }}| Mechanism | Max FDs | Scalability | Platform |
|---|---|---|---|
| select() | Usually 1024 | O(n) scan each call | POSIX (portable) |
| poll() | Unlimited | O(n) scan each call | POSIX (portable) |
| epoll() | Unlimited | O(1) for ready FDs | Linux only |
| kqueue() | Unlimited | O(1) for ready FDs | BSD, macOS |
| IOCP | Unlimited | O(1) | Windows |
High-performance servers (nginx, Node.js, Redis) use event-driven I/O with epoll/kqueue. Instead of one thread per connection (each potentially blocked), they use a single thread handling thousands of connections via non-blocking I/O and event multiplexing. The thread only blocks in epoll_wait(), waking when any connection needs attention.
Processes block not only for I/O but also for synchronization primitives. When a mutex, semaphore, or condition variable isn't available, the process must wait for another process to release it.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
#include <pthread.h>#include <semaphore.h> // Example: Blocking on a mutexpthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; void critical_section() { // If lock is held by another thread, we BLOCK here // Thread moves to Waiting state until lock is released pthread_mutex_lock(&lock); // ... critical section work ... pthread_mutex_unlock(&lock); // Other waiting threads may now wake up} // Example: Blocking on a semaphore (producer-consumer)sem_t items; void consumer() { for (;;) { // If semaphore count is 0, BLOCK here // Wait until producer adds an item sem_wait(&items); consume_item(); }} void producer() { for (;;) { produce_item(); // Increment semaphore, potentially waking consumer sem_post(&items); }} // Example: Blocking on condition variablepthread_mutex_t cv_lock = PTHREAD_MUTEX_INITIALIZER;pthread_cond_t cv = PTHREAD_COND_INITIALIZER;int ready = 0; void waiter() { pthread_mutex_lock(&cv_lock); while (!ready) { // Release lock and block atomically // Thread waits until signaled AND lock reacquired pthread_cond_wait(&cv, &cv_lock); // When we return, we hold lock again } process_data(); pthread_mutex_unlock(&cv_lock);} void signaler() { pthread_mutex_lock(&cv_lock); ready = 1; pthread_cond_signal(&cv); // Wake one waiter // pthread_cond_broadcast(&cv); // Wake ALL waiters pthread_mutex_unlock(&cv_lock);}Futex: The modern Linux locking primitive:
User-space synchronization (pthread_mutex) is built on futex (Fast Userspace Mutex):
This design makes uncontended locking extremely fast (single atomic instruction), only involving the kernel when blocking is actually necessary.
When a high-priority process blocks on a lock held by a low-priority process, and a medium-priority process runs instead of the lock holder, the high-priority process is effectively blocked by the medium-priority one. This is priority inversion. Solutions include priority inheritance (boosting holder's priority) and priority ceiling (locks have fixed high priority).
Each I/O device has associated wait queues for processes blocked on that device. These are called device queues or I/O queues. Understanding their organization helps explain I/O performance and scheduling.
I/O wait analysis:
The time a process spends in device queues (waiting for I/O) is called I/O wait time. Tools like top and iostat report this:
123456789101112131415161718192021222324252627282930313233
# System-wide I/O wait percentagetop -bn1 | grep Cpu# Shows: %us user, %sy system, %wa I/O wait # If %wa (I/O wait) is high, processes are blocking on I/O # Per-device I/O statisticsiostat -x 1# avgqu-sz: Average queue length (processes waiting)# await: Average wait time (ms)# %util: Device utilization # Find processes in I/O wait (D state)ps aux | awk '$8 ~ /D/ {print}'# These processes are blocked on I/O # What is a specific process waiting for?cat /proc/<pid>/wchan# Shows the kernel function where process is sleeping # More detail with stack tracecat /proc/<pid>/stack# Shows full kernel stack trace # Monitor block layer I/Osudo bpftrace -e 'tracepoint:block:block_rq_insert { @[comm] = count(); }'# Shows which processes are issuing I/O # I/O wait breakdown by processpidstat -d 1# Shows per-process disk I/O and wait states| I/O Type | Best Case | Typical | Worst Case |
|---|---|---|---|
| L1 cache hit | 1 ns | 1 ns | 1 ns |
| L3 cache hit | 10 ns | 20 ns | 50 ns |
| Main memory | 100 ns | 100 ns | 200 ns |
| NVMe SSD read | 20 μs | 100 μs | 500 μs |
| SATA SSD read | 100 μs | 500 μs | 2 ms |
| HDD read | 5 ms | 10 ms | 50 ms |
| Network (LAN) | 100 μs | 500 μs | 10 ms |
| Network (WAN) | 20 ms | 100 ms | 500 ms |
Just as the CPU scheduler decides which Ready process runs next, I/O schedulers decide which blocked process's I/O request is serviced next. For disks, elevator algorithms (like CFQ, deadline, or mq-deadline) reorder requests to minimize seek time, affecting how long each process waits in the device queue.
We've thoroughly explored the Block transition—the mechanism by which processes voluntarily give up the CPU while waiting for external events.
Looking ahead:
We've now seen processes block and enter the Waiting state. But what happens when the event they're waiting for occurs? The Wake transition (Waiting → Ready) returns processes to the schedulable pool. This is our final transition, completing the process state machine.
You now understand the Block transition (Running → Waiting) comprehensively—from the motivations for blocking through wait queue implementation, I/O multiplexing, synchronization primitives, and device queues. Next, we'll examine the Wake transition that moves processes from Waiting back to Ready.