Operating SystemsState Transitions

Process State Transitions

LevelIntermediate

Duration75 mins

TopicState Transitions

4 / 5

Block: Running → Waiting

When Progress Requires Patience

Not all work can be done instantly. A process may need to read data from a slow disk, wait for a network packet, receive user input, or acquire a lock held by another process. In these situations, the process cannot make forward progress—continuing to run would just waste CPU cycles spinning in place.

The Block transition (also called the Wait or Sleep transition) moves a process from the Running state to the Waiting (or Blocked) state. The process voluntarily surrenders the CPU, acknowledging that it cannot proceed until some external event occurs. This is fundamentally different from the Timeout transition—blocking is the process's choice, not the kernel's enforcement.

This transition is crucial for system efficiency. Without blocking, I/O-bound processes would spin-wait, wasting enormous CPU resources. Instead, they sleep, allowing other processes to run productively.

What You Will Learn

By the end of this page, you will understand why and how processes block—from the various conditions that trigger blocking, through the kernel's wait queue mechanism, to the data structures that track sleeping processes. You'll see how blocking enables efficient I/O multiplexing and understand interruptible vs uninterruptible sleep states.

Why Processes Block

A process blocks when it requests something that isn't immediately available. Rather than busy-waiting (consuming CPU while repeatedly checking), the process asks the kernel to put it to sleep until the condition is met.

The alternative to blocking—busy-waiting:

Imagine waiting for a disk read without blocking:

// Terrible approach: busy-waiting
while (!disk_data_ready()) {
    // Spin, wasting millions of CPU cycles
}
process_data();

On a 3 GHz CPU, this loop executes billions of check iterations while waiting for a disk (which takes milliseconds). It wastes power, heats the CPU, and prevents other processes from running.

The proper approach—blocking:

// Correct approach: blocking
submit_disk_request();
sleep_until_data_ready();  // Process blocks here
process_data();             // Resumes when data arrives

The process surrenders the CPU for the duration of the wait. Other processes run. When the disk completes, the blocked process is awakened.

Common Blocking Conditions

•Disk I/O — Waiting for read/write operations on storage devices. SSDs take ~50μs, HDDs take ~10ms. Far too long to spin-wait.
•Network I/O — Waiting for packets to arrive or send buffers to drain. Network latencies range from microseconds (local) to hundreds of milliseconds (across continents).
•User Input — Waiting for keyboard, mouse, or other input devices. Users are very slow (seconds between keystrokes).
•IPC (Inter-Process Communication) — Waiting for messages from other processes via pipes, sockets, or message queues.
•Lock Acquisition — Waiting for a mutex, semaphore, or other synchronization primitive held by another process/thread.
•Timer/Sleep — Explicitly waiting for a period to elapse (e.g., sleep(1), usleep(100000)).
•Child Process — Parent waiting for child to terminate (wait() or waitpid()).
•Page Fault — Waiting for a page to be loaded from swap or memory-mapped file.

Blocking vs Busy-Waiting: Resource Comparison
Approach	CPU Usage	Power	Other Processes	Response Time
Busy-waiting (spin)	100% of one core	High	Starved	Immediate when ready
Blocking (sleep)	0% while waiting	Low (can sleep)	Can run	Small wakeup latency
Hybrid (spin then block)	Brief spike, then 0%	Medium	Mostly can run	Low latency for fast ops

When Spinning is Appropriate

Spinning isn't always wrong. Spinlocks are appropriate when the expected wait is very short (shorter than the context switch overhead). If a lock is typically held for < 1μs, spinning wastes less CPU than the 5μs+ context switch. Modern kernels often use adaptive spinning: spin briefly, then block if the lock isn't released.

The Waiting State Defined

A process in the Waiting (Blocked) state is paused, not consuming CPU, waiting for a specific event. It differs from the Ready state in a crucial way: a Ready process can run if given the CPU, but a Waiting process cannot—it needs something external first.

State Comparison: Ready vs Waiting
Aspect	Ready State	Waiting State
CPU assignment	Eligible - waiting for dispatch	Ineligible - not schedulable
Missing resource	Only CPU time	Something else (I/O, lock, etc.)
Exit condition	Scheduler selects it	Event occurs (I/O done, lock free)
Queue location	Ready queue/run queue	Wait queue (device-specific)
Kernel view	TASK_RUNNING (runnable)	TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE

Linux sleep states:

Linux distinguishes several sleeping states:

TASK_INTERRUPTIBLE:

Process can be woken by signals (SIGINT, SIGTERM, etc.)
Signal handler runs; system call returns with EINTR
Most common sleep state for user-initiated waits
Allows Ctrl+C to interrupt a stuck program

TASK_UNINTERRUPTIBLE:

Process cannot be woken by signals
Only wakes when the waited event occurs
Used when waking early would corrupt data (e.g., mid-disk-write)
Shows as 'D' state in ps/top (can't be killed!)

TASK_KILLABLE (since Linux 2.6.25):

Like uninterruptible, but can be killed by SIGKILL
Balances data integrity with ability to terminate runaway processes

TASK_IDLE:

Special state for idle kernel threads
Not charged to load average

process_state_in_ps.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# View process states with ps
ps aux | head -1
# USER  PID %CPU %MEM    VSZ   RSS TTY   STAT  START   TIME COMMAND
 
# STAT column meanings:
# S = Interruptible sleep (waiting for event)
# D = Uninterruptible sleep (usually I/O)
# R = Running or runnable (on run queue)
# T = Stopped (by signal or debugger)
# Z = Zombie (terminated but not reaped)
 
# Find all uninterruptible (D state) processes
ps aux | awk '$8 ~ /D/'
# These processes are stuck waiting for I/O
# Common for slow NFS mounts or failing disks
 
# Count processes in each state
ps -eo stat | sort | uniq -c | sort -rn
#  150 S   <- Sleeping (interruptible)
#   23 R   <- Running/runnable  
#    5 D   <- Uninterruptible sleep
#    2 Z   <- Zombies
 
# Detailed scheduling info for sleeping process
cat /proc/<pid>/status | grep State
# State:  S (sleeping)
 
# What is a process waiting for?
cat /proc/<pid>/stack
# Shows kernel stack trace - reveals blocking call

The Dreaded D State

Processes in uninterruptible sleep (D state) cannot be killed—not even with kill -9. They're waiting for I/O that must complete. If a disk or NFS server becomes unresponsive, D-state processes accumulate. The only resolution is fixing the underlying I/O issue or rebooting. TASK_KILLABLE was introduced specifically to allow killing processes in some formerly-uninterruptible situations.

Triggering the Block Transition

The Block transition is always initiated by the process itself (though indirectly through system calls). The process requests an operation that cannot complete immediately, and the kernel responds by putting it to sleep.

Converting Mermaid diagram...

System calls that commonly block:

blocking_syscalls.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Common blocking system calls and their wait conditions
 
// FILE I/O BLOCKING
ssize_t n = read(fd, buf, size);
// Blocks until: data available, EOF, or error
// Wait queue: associated with the file descriptor / device
 
ssize_t n = write(fd, buf, size);
// Blocks until: buffer space available or error
// Can block on slow devices, full pipes, network congestion
 
// NETWORK BLOCKING  
int client = accept(sockfd, NULL, NULL);
// Blocks until: incoming connection arrives
// Server spends most time blocked here
 
ssize_t n = recv(sockfd, buf, size, 0);
// Blocks until: data arrives on socket
// May block indefinitely waiting for peer
 
int result = connect(sockfd, addr, len);
// Blocks until: connection established or fails (timeout)
// TCP handshake takes multiple round-trips
 
// SYNCHRONIZATION BLOCKING
pthread_mutex_lock(&mutex);
// Blocks until: mutex becomes available
// May block indefinitely if holder doesn't release
 
sem_wait(&semaphore);
// Blocks until: semaphore count > 0
// Classic producer-consumer synchronization
 
// PROCESS BLOCKING
pid_t pid = wait(&status);
// Blocks until: any child terminates
// Parent often waits for children to complete
 
int result = pause();
// Blocks until: signal is delivered
// Unconditional block waiting for signal
 
// TIMER BLOCKING
unsigned remain = sleep(5);
// Blocks for: specified seconds
// Returns early if signal interrupts
 
int result = nanosleep(&req, &rem);
// Blocks for: specified nanoseconds
// Higher precision than sleep()
 
// MULTIPLEXED BLOCKING
int ready = select(nfds, &reads, &writes, &excepts, &timeout);
// Blocks until: any of multiple FDs is ready, or timeout
// Foundation of event-driven I/O
 
int ready = poll(fds, nfds, timeout);
// Similar to select, different interface
 
int nready = epoll_wait(epfd, events, max, timeout);
// Linux-specific, scalable event interface
// Efficiently handles thousands of connections

Non-Blocking Alternatives

Most blocking calls have non-blocking variants. Setting O_NONBLOCK on a file descriptor makes read/write return immediately with EAGAIN if they would block. Locks have trylock variants. These are essential for event-driven programming but require the application to handle the 'not ready' case explicitly.

Wait Queue Implementation

The kernel must track which processes are waiting for which events. This is accomplished through wait queues—linked lists of processes waiting for a particular condition. Each blocking resource (device, file, lock) has an associated wait queue.

Wait queue structure:

A Linux wait queue consists of:

wait_queue_head: The queue itself, attached to the resource
wait_queue_entry: One entry per waiting process, linked into the queue

When a process blocks, it creates a wait_queue_entry, adds it to the appropriate queue, sets its state to INTERRUPTIBLE/UNINTERRUPTIBLE, and calls schedule(). When the event occurs, the kernel walks the queue and wakes up waiters.

wait_queue.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
// Linux wait queue implementation
// Simplified from include/linux/wait.h and kernel/sched/wait.c
 
// Wait queue head - attached to the resource being waited on
struct wait_queue_head {
    spinlock_t lock;               // Protects the list
    struct list_head head;         // List of waiting processes
};
 
// Wait queue entry - one per waiting process
struct wait_queue_entry {
    unsigned int flags;            // WQ_FLAG_* flags
    struct task_struct *private;   // The process waiting
    wait_queue_func_t func;        // Wakeup function
    struct list_head entry;        // Links into wait_queue_head
};
 
// Typical blocking pattern: wait_event macro
#define wait_event(wq_head, condition)                          \
do {                                                            \
    if (condition)                                              \
        break;                                                  \
    DEFINE_WAIT(wait);                                          \
    for (;;) {                                                  \
        prepare_to_wait(&wq_head, &wait, TASK_INTERRUPTIBLE);   \
        if (condition)                                          \
            break;                                              \
        schedule();  /* Block here until woken */               \
    }                                                           \
    finish_wait(&wq_head, &wait);                               \
} while (0)
 
// Add process to wait queue and set state
void prepare_to_wait(wait_queue_head_t *wq, 
                     wait_queue_entry_t *wait, 
                     int state) {
    spin_lock(&wq->lock);
    
    // Only add to queue if not already there
    if (list_empty(&wait->entry))
        list_add(&wait->entry, &wq->head);
    
    // Set process state to sleeping
    set_current_state(state);
    
    spin_unlock(&wq->lock);
}
 
// Wake up processes on a wait queue
void wake_up(wait_queue_head_t *wq) {
    wait_queue_entry_t *curr;
    
    spin_lock(&wq->lock);
    
    list_for_each_entry(curr, &wq->head, entry) {
        // Call the wakeup function for each waiter
        curr->func(curr, TASK_NORMAL, 0, NULL);
    }
    
    spin_unlock(&wq->lock);
}
 
// Default wakeup function
int default_wake_function(wait_queue_entry_t *wait) {
    // Move process from waiting to ready
    return try_to_wake_up(wait->private, TASK_NORMAL);
}

Converting Mermaid diagram...

Exclusive Wait

Some wait queues use 'exclusive wait' to wake only one waiter instead of all. This prevents the 'thundering herd' problem where many processes wake up, only one gets the resource, and the rest go back to sleep. Linux's accept() on sockets uses exclusive wait to avoid waking all listening processes for each connection.

The Block Transition Sequence

Let's trace through a complete Block transition, using a read() from a slow device as an example:

Step-by-Step Block Transition

•User calls read() — Process in Running state asks for data from a file descriptor.
•Syscall enters kernel — Mode switch from user to kernel. Process is now in kernel mode but still 'Running'.
•Check if data available — Kernel checks device buffers. If data is present, copy immediately and return. No blocking needed.
•Data not available — Kernel must wait for the device. Busy-waiting would waste CPU.
•Prepare to wait — Create wait_queue_entry, add to device's wait queue.
•Set state to INTERRUPTIBLE — Mark process as sleeping. It's no longer schedulable.
•Call schedule() — Voluntarily give up CPU. Process is now in Waiting state.
•Context switch occurs — Scheduler picks another Ready process. Original process is suspended.
•Original process sleeps — Remains in Waiting state until the event occurs. Could be microseconds to hours.
•Event occurs — Device finishes I/O, interrupt fires.
•Wakeup path begins — Interrupt handler or worker thread calls wake_up() on the wait queue.

real_read_example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
// Pseudocode for read() system call implementation
// Shows the blocking path in detail
 
ssize_t sys_read(int fd, char __user *buf, size_t count) {
    struct file *file = get_file(fd);
    struct device *dev = file->device;
    
    // 1. Check if data is already buffered
    if (device_has_data(dev)) {
        // Fast path: data available, no blocking
        ssize_t n = copy_to_user(buf, dev->buffer, count);
        return n;
    }
    
    // 2. No data available - must wait
    // Submit I/O request to device
    submit_io_request(dev, count);
    
    // 3. Block until data arrives
    DEFINE_WAIT(wait);
    
    for (;;) {
        // Add ourselves to device's wait queue
        prepare_to_wait(&dev->wait_queue, &wait, 
                       TASK_INTERRUPTIBLE);
        
        // Double-check if data arrived while setting up
        if (device_has_data(dev))
            break;
        
        // Check for pending signals (interruptible sleep)
        if (signal_pending(current)) {
            finish_wait(&dev->wait_queue, &wait);
            return -EINTR;  // Interrupted by signal
        }
        
        // Actually block - give up CPU
        // Process state changes: Running → Waiting
        schedule();
        
        // We return here when woken up
        // Loop back to check if condition is met
    }
    
    // 4. Remove from wait queue
    finish_wait(&dev->wait_queue, &wait);
    
    // 5. Data is now available
    ssize_t n = copy_to_user(buf, dev->buffer, count);
    
    return n;
}
 
// In the device driver's interrupt handler:
irqreturn_t device_interrupt_handler(int irq, void *dev_id) {
    struct device *dev = dev_id;
    
    // Device completed I/O, data is ready
    dev->data_ready = true;
    
    // Wake up anyone waiting on this device
    wake_up(&dev->wait_queue);
    // All waiters moved: Waiting → Ready
    
    return IRQ_HANDLED;
}

The Lost Wakeup Problem

Notice the loop structure with repeated checking? This prevents 'lost wakeups'. If we set our state to INTERRUPTIBLE and then check the condition, there's a race: the event could occur between check and schedule(). We'd sleep forever. The loop pattern, with recheck after wakeup, ensures the condition is truly met before proceeding.

Waiting on Multiple Events

Many applications need to wait on multiple events simultaneously: data on any of several sockets, input from keyboard OR timeout, etc. The kernel provides mechanisms for this: select(), poll(), and epoll().

Multiplexed I/O blocking:

With select/poll/epoll, a process blocks until any of multiple conditions is met:

Process calls poll(fds, nfds, timeout)
Kernel registers interest in all file descriptors
Process blocks on a special poll wait queue
When any FD becomes ready, or timeout expires, process wakes
Kernel returns which FDs are ready
Process handles ready FDs without blocking

multiplexed_io.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#include <sys/epoll.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
 
// Event-driven server using epoll
int main() {
    int epfd = epoll_create1(0);
    
    // Add server socket
    struct epoll_event ev;
    ev.events = EPOLLIN;
    ev.data.fd = server_socket;
    epoll_ctl(epfd, EPOLL_CTL_ADD, server_socket, &ev);
    
    struct epoll_event events[MAX_EVENTS];
    
    while (1) {
        // Block until any registered FD has events
        // This single process can handle thousands of connections
        int nready = epoll_wait(epfd, events, MAX_EVENTS, -1);
        // Process blocks here until:
        // - A client sends data
        // - A new connection arrives
        // - A write becomes possible
        
        for (int i = 0; i < nready; i++) {
            int fd = events[i].data.fd;
            
            if (fd == server_socket) {
                // New connection - accept it
                int client = accept(server_socket, NULL, NULL);
                // Set non-blocking and add to epoll
                fcntl(client, F_SETFL, O_NONBLOCK);
                ev.events = EPOLLIN | EPOLLET;
                ev.data.fd = client;
                epoll_ctl(epfd, EPOLL_CTL_ADD, client, &ev);
            } else {
                // Client ready for I/O - handle it
                // Non-blocking read won't block since epoll
                // said it's ready
                handle_client(fd);
            }
        }
    }
}

I/O Multiplexing Mechanisms
Mechanism	Max FDs	Scalability	Platform
select()	Usually 1024	O(n) scan each call	POSIX (portable)
poll()	Unlimited	O(n) scan each call	POSIX (portable)
epoll()	Unlimited	O(1) for ready FDs	Linux only
kqueue()	Unlimited	O(1) for ready FDs	BSD, macOS
IOCP	Unlimited	O(1)	Windows

Event-Driven Architecture

High-performance servers (nginx, Node.js, Redis) use event-driven I/O with epoll/kqueue. Instead of one thread per connection (each potentially blocked), they use a single thread handling thousands of connections via non-blocking I/O and event multiplexing. The thread only blocks in epoll_wait(), waking when any connection needs attention.

Blocking for Synchronization

Processes block not only for I/O but also for synchronization primitives. When a mutex, semaphore, or condition variable isn't available, the process must wait for another process to release it.

Synchronization Blocking Primitives

•Mutex (pthread_mutex_t) — Mutual exclusion lock. Only one holder. Others block in acquisition order (usually FIFO).
•Read-Write Lock — Multiple readers OR one writer. Readers block if writer waiting; writers block if readers present.
•Semaphore (sem_t) — Counting lock. Block when count reaches zero. Classic producer-consumer use case.
•Condition Variable — Block until condition signaled by another thread. Always used with a mutex.
•Futex (Fast Userspace Mutex) — Linux-specific. Fast path in userspace; slow path blocks in kernel.

synchronization_blocking.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#include <pthread.h>
#include <semaphore.h>
 
// Example: Blocking on a mutex
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
 
void critical_section() {
    // If lock is held by another thread, we BLOCK here
    // Thread moves to Waiting state until lock is released
    pthread_mutex_lock(&lock);
    
    // ... critical section work ...
    
    pthread_mutex_unlock(&lock);
    // Other waiting threads may now wake up
}
 
// Example: Blocking on a semaphore (producer-consumer)
sem_t items;
 
void consumer() {
    for (;;) {
        // If semaphore count is 0, BLOCK here
        // Wait until producer adds an item
        sem_wait(&items);
        
        consume_item();
    }
}
 
void producer() {
    for (;;) {
        produce_item();
        
        // Increment semaphore, potentially waking consumer
        sem_post(&items);
    }
}
 
// Example: Blocking on condition variable
pthread_mutex_t cv_lock = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cv = PTHREAD_COND_INITIALIZER;
int ready = 0;
 
void waiter() {
    pthread_mutex_lock(&cv_lock);
    
    while (!ready) {
        // Release lock and block atomically
        // Thread waits until signaled AND lock reacquired
        pthread_cond_wait(&cv, &cv_lock);
        // When we return, we hold lock again
    }
    
    process_data();
    pthread_mutex_unlock(&cv_lock);
}
 
void signaler() {
    pthread_mutex_lock(&cv_lock);
    ready = 1;
    pthread_cond_signal(&cv);  // Wake one waiter
    // pthread_cond_broadcast(&cv);  // Wake ALL waiters
    pthread_mutex_unlock(&cv_lock);
}

Futex: The modern Linux locking primitive:

User-space synchronization (pthread_mutex) is built on futex (Fast Userspace Mutex):

Fast path (uncontended): Lock/unlock entirely in userspace using atomic instructions. No syscall needed.
Slow path (contended): When lock is held, waiter calls futex(FUTEX_WAIT) to block in kernel.
Wakeup: Owner calls futex(FUTEX_WAKE) to wake waiters.

This design makes uncontended locking extremely fast (single atomic instruction), only involving the kernel when blocking is actually necessary.

Priority Inversion

When a high-priority process blocks on a lock held by a low-priority process, and a medium-priority process runs instead of the lock holder, the high-priority process is effectively blocked by the medium-priority one. This is priority inversion. Solutions include priority inheritance (boosting holder's priority) and priority ceiling (locks have fixed high priority).

Device Queues and I/O Wait

Each I/O device has associated wait queues for processes blocked on that device. These are called device queues or I/O queues. Understanding their organization helps explain I/O performance and scheduling.

Converting Mermaid diagram...

I/O wait analysis:

The time a process spends in device queues (waiting for I/O) is called I/O wait time. Tools like top and iostat report this:

io_wait_analysis.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# System-wide I/O wait percentage
top -bn1 | grep Cpu
# Shows: %us user, %sy system, %wa I/O wait
 
# If %wa (I/O wait) is high, processes are blocking on I/O
 
# Per-device I/O statistics
iostat -x 1
# avgqu-sz: Average queue length (processes waiting)
# await: Average wait time (ms)
# %util: Device utilization
 
# Find processes in I/O wait (D state)
ps aux | awk '$8 ~ /D/ {print}'
# These processes are blocked on I/O
 
# What is a specific process waiting for?
cat /proc/<pid>/wchan
# Shows the kernel function where process is sleeping
 
# More detail with stack trace
cat /proc/<pid>/stack
# Shows full kernel stack trace
 
# Monitor block layer I/O
sudo bpftrace -e 'tracepoint:block:block_rq_insert { 
    @[comm] = count(); 
}'
# Shows which processes are issuing I/O
 
# I/O wait breakdown by process
pidstat -d 1
# Shows per-process disk I/O and wait states

Typical I/O Latencies (Approximate)
I/O Type	Best Case	Typical	Worst Case
L1 cache hit	1 ns	1 ns	1 ns
L3 cache hit	10 ns	20 ns	50 ns
Main memory	100 ns	100 ns	200 ns
NVMe SSD read	20 μs	100 μs	500 μs
SATA SSD read	100 μs	500 μs	2 ms
HDD read	5 ms	10 ms	50 ms
Network (LAN)	100 μs	500 μs	10 ms
Network (WAN)	20 ms	100 ms	500 ms

I/O Scheduling

Just as the CPU scheduler decides which Ready process runs next, I/O schedulers decide which blocked process's I/O request is serviced next. For disks, elevator algorithms (like CFQ, deadline, or mq-deadline) reorder requests to minimize seek time, affecting how long each process waits in the device queue.

Summary: The Complete Block Transition

We've thoroughly explored the Block transition—the mechanism by which processes voluntarily give up the CPU while waiting for external events.

Key Takeaways

•Running → Waiting when progress is impossible — Processes block when they need something not yet available: I/O data, locks, timers, or IPC messages.
•Blocking is efficient — Unlike busy-waiting, blocking surrenders the CPU for other work. Essential for responsive multitasking.
•Wait queues track sleeping processes — Each blocking resource has a wait queue. Processes add themselves and sleep until woken.
•Interruptible vs Uninterruptible sleep — Interruptible sleep responds to signals; uninterruptible protects data integrity but can't be killed.
•Multiplexed I/O enables efficient servers — epoll/kqueue/IOCP allow one thread to block waiting for thousands of events simultaneously.
•Synchronization primitives cause blocking — Mutexes, semaphores, and condition variables block when the resource isn't available.
•Device queues organize I/O waiting — Each device has its queue of blocked processes awaiting I/O completion.
•I/O wait time is measurable — Tools reveal how much time processes spend blocked, helping diagnose performance issues.

Looking ahead:

We've now seen processes block and enter the Waiting state. But what happens when the event they're waiting for occurs? The Wake transition (Waiting → Ready) returns processes to the schedulable pool. This is our final transition, completing the process state machine.

Page Complete

You now understand the Block transition (Running → Waiting) comprehensively—from the motivations for blocking through wait queue implementation, I/O multiplexing, synchronization primitives, and device queues. Next, we'll examine the Wake transition that moves processes from Waiting back to Ready.

4 / 5

Loading learning content...

Operating SystemsState Transitions

Process State Transitions

LevelIntermediate

Duration75 mins

TopicState Transitions

4 / 5

Block: Running → Waiting

When Progress Requires Patience

What You Will Learn

Why Processes Block

The alternative to blocking—busy-waiting:

Imagine waiting for a disk read without blocking:

// Terrible approach: busy-waiting
while (!disk_data_ready()) {
    // Spin, wasting millions of CPU cycles
}
process_data();

On a 3 GHz CPU, this loop executes billions of check iterations while waiting for a disk (which takes milliseconds). It wastes power, heats the CPU, and prevents other processes from running.

The proper approach—blocking:

// Correct approach: blocking
submit_disk_request();
sleep_until_data_ready();  // Process blocks here
process_data();             // Resumes when data arrives

The process surrenders the CPU for the duration of the wait. Other processes run. When the disk completes, the blocked process is awakened.

Common Blocking Conditions

•Disk I/O — Waiting for read/write operations on storage devices. SSDs take ~50μs, HDDs take ~10ms. Far too long to spin-wait.
•Network I/O — Waiting for packets to arrive or send buffers to drain. Network latencies range from microseconds (local) to hundreds of milliseconds (across continents).
•User Input — Waiting for keyboard, mouse, or other input devices. Users are very slow (seconds between keystrokes).
•IPC (Inter-Process Communication) — Waiting for messages from other processes via pipes, sockets, or message queues.
•Lock Acquisition — Waiting for a mutex, semaphore, or other synchronization primitive held by another process/thread.
•Timer/Sleep — Explicitly waiting for a period to elapse (e.g., sleep(1), usleep(100000)).
•Child Process — Parent waiting for child to terminate (wait() or waitpid()).
•Page Fault — Waiting for a page to be loaded from swap or memory-mapped file.

Blocking vs Busy-Waiting: Resource Comparison
Approach	CPU Usage	Power	Other Processes	Response Time
Busy-waiting (spin)	100% of one core	High	Starved	Immediate when ready
Blocking (sleep)	0% while waiting	Low (can sleep)	Can run	Small wakeup latency
Hybrid (spin then block)	Brief spike, then 0%	Medium	Mostly can run	Low latency for fast ops

When Spinning is Appropriate

The Waiting State Defined

State Comparison: Ready vs Waiting
Aspect	Ready State	Waiting State
CPU assignment	Eligible - waiting for dispatch	Ineligible - not schedulable
Missing resource	Only CPU time	Something else (I/O, lock, etc.)
Exit condition	Scheduler selects it	Event occurs (I/O done, lock free)
Queue location	Ready queue/run queue	Wait queue (device-specific)
Kernel view	TASK_RUNNING (runnable)	TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE

Linux sleep states:

Linux distinguishes several sleeping states:

TASK_INTERRUPTIBLE:

Process can be woken by signals (SIGINT, SIGTERM, etc.)
Signal handler runs; system call returns with EINTR
Most common sleep state for user-initiated waits
Allows Ctrl+C to interrupt a stuck program

TASK_UNINTERRUPTIBLE:

Process cannot be woken by signals
Only wakes when the waited event occurs
Used when waking early would corrupt data (e.g., mid-disk-write)
Shows as 'D' state in ps/top (can't be killed!)

TASK_KILLABLE (since Linux 2.6.25):

Like uninterruptible, but can be killed by SIGKILL
Balances data integrity with ability to terminate runaway processes

TASK_IDLE:

Special state for idle kernel threads
Not charged to load average

process_state_in_ps.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# View process states with ps
ps aux | head -1
# USER  PID %CPU %MEM    VSZ   RSS TTY   STAT  START   TIME COMMAND
 
# STAT column meanings:
# S = Interruptible sleep (waiting for event)
# D = Uninterruptible sleep (usually I/O)
# R = Running or runnable (on run queue)
# T = Stopped (by signal or debugger)
# Z = Zombie (terminated but not reaped)
 
# Find all uninterruptible (D state) processes
ps aux | awk '$8 ~ /D/'
# These processes are stuck waiting for I/O
# Common for slow NFS mounts or failing disks
 
# Count processes in each state
ps -eo stat | sort | uniq -c | sort -rn
#  150 S   <- Sleeping (interruptible)
#   23 R   <- Running/runnable  
#    5 D   <- Uninterruptible sleep
#    2 Z   <- Zombies
 
# Detailed scheduling info for sleeping process
cat /proc/<pid>/status | grep State
# State:  S (sleeping)
 
# What is a process waiting for?
cat /proc/<pid>/stack
# Shows kernel stack trace - reveals blocking call

The Dreaded D State

Triggering the Block Transition

Converting Mermaid diagram...

System calls that commonly block:

blocking_syscalls.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Common blocking system calls and their wait conditions
 
// FILE I/O BLOCKING
ssize_t n = read(fd, buf, size);
// Blocks until: data available, EOF, or error
// Wait queue: associated with the file descriptor / device
 
ssize_t n = write(fd, buf, size);
// Blocks until: buffer space available or error
// Can block on slow devices, full pipes, network congestion
 
// NETWORK BLOCKING  
int client = accept(sockfd, NULL, NULL);
// Blocks until: incoming connection arrives
// Server spends most time blocked here
 
ssize_t n = recv(sockfd, buf, size, 0);
// Blocks until: data arrives on socket
// May block indefinitely waiting for peer
 
int result = connect(sockfd, addr, len);
// Blocks until: connection established or fails (timeout)
// TCP handshake takes multiple round-trips
 
// SYNCHRONIZATION BLOCKING
pthread_mutex_lock(&mutex);
// Blocks until: mutex becomes available
// May block indefinitely if holder doesn't release
 
sem_wait(&semaphore);
// Blocks until: semaphore count > 0
// Classic producer-consumer synchronization
 
// PROCESS BLOCKING
pid_t pid = wait(&status);
// Blocks until: any child terminates
// Parent often waits for children to complete
 
int result = pause();
// Blocks until: signal is delivered
// Unconditional block waiting for signal
 
// TIMER BLOCKING
unsigned remain = sleep(5);
// Blocks for: specified seconds
// Returns early if signal interrupts
 
int result = nanosleep(&req, &rem);
// Blocks for: specified nanoseconds
// Higher precision than sleep()
 
// MULTIPLEXED BLOCKING
int ready = select(nfds, &reads, &writes, &excepts, &timeout);
// Blocks until: any of multiple FDs is ready, or timeout
// Foundation of event-driven I/O
 
int ready = poll(fds, nfds, timeout);
// Similar to select, different interface
 
int nready = epoll_wait(epfd, events, max, timeout);
// Linux-specific, scalable event interface
// Efficiently handles thousands of connections

Non-Blocking Alternatives

Wait Queue Implementation

Wait queue structure:

A Linux wait queue consists of:

wait_queue_head: The queue itself, attached to the resource
wait_queue_entry: One entry per waiting process, linked into the queue

wait_queue.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
// Linux wait queue implementation
// Simplified from include/linux/wait.h and kernel/sched/wait.c
 
// Wait queue head - attached to the resource being waited on
struct wait_queue_head {
    spinlock_t lock;               // Protects the list
    struct list_head head;         // List of waiting processes
};
 
// Wait queue entry - one per waiting process
struct wait_queue_entry {
    unsigned int flags;            // WQ_FLAG_* flags
    struct task_struct *private;   // The process waiting
    wait_queue_func_t func;        // Wakeup function
    struct list_head entry;        // Links into wait_queue_head
};
 
// Typical blocking pattern: wait_event macro
#define wait_event(wq_head, condition)                          \
do {                                                            \
    if (condition)                                              \
        break;                                                  \
    DEFINE_WAIT(wait);                                          \
    for (;;) {                                                  \
        prepare_to_wait(&wq_head, &wait, TASK_INTERRUPTIBLE);   \
        if (condition)                                          \
            break;                                              \
        schedule();  /* Block here until woken */               \
    }                                                           \
    finish_wait(&wq_head, &wait);                               \
} while (0)
 
// Add process to wait queue and set state
void prepare_to_wait(wait_queue_head_t *wq, 
                     wait_queue_entry_t *wait, 
                     int state) {
    spin_lock(&wq->lock);
    
    // Only add to queue if not already there
    if (list_empty(&wait->entry))
        list_add(&wait->entry, &wq->head);
    
    // Set process state to sleeping
    set_current_state(state);
    
    spin_unlock(&wq->lock);
}
 
// Wake up processes on a wait queue
void wake_up(wait_queue_head_t *wq) {
    wait_queue_entry_t *curr;
    
    spin_lock(&wq->lock);
    
    list_for_each_entry(curr, &wq->head, entry) {
        // Call the wakeup function for each waiter
        curr->func(curr, TASK_NORMAL, 0, NULL);
    }
    
    spin_unlock(&wq->lock);
}
 
// Default wakeup function
int default_wake_function(wait_queue_entry_t *wait) {
    // Move process from waiting to ready
    return try_to_wake_up(wait->private, TASK_NORMAL);
}

Converting Mermaid diagram...

Exclusive Wait

The Block Transition Sequence

Let's trace through a complete Block transition, using a read() from a slow device as an example:

Step-by-Step Block Transition

•User calls read() — Process in Running state asks for data from a file descriptor.
•Syscall enters kernel — Mode switch from user to kernel. Process is now in kernel mode but still 'Running'.
•Check if data available — Kernel checks device buffers. If data is present, copy immediately and return. No blocking needed.
•Data not available — Kernel must wait for the device. Busy-waiting would waste CPU.
•Prepare to wait — Create wait_queue_entry, add to device's wait queue.
•Set state to INTERRUPTIBLE — Mark process as sleeping. It's no longer schedulable.
•Call schedule() — Voluntarily give up CPU. Process is now in Waiting state.
•Context switch occurs — Scheduler picks another Ready process. Original process is suspended.
•Original process sleeps — Remains in Waiting state until the event occurs. Could be microseconds to hours.
•Event occurs — Device finishes I/O, interrupt fires.
•Wakeup path begins — Interrupt handler or worker thread calls wake_up() on the wait queue.

real_read_example.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
// Pseudocode for read() system call implementation
// Shows the blocking path in detail
 
ssize_t sys_read(int fd, char __user *buf, size_t count) {
    struct file *file = get_file(fd);
    struct device *dev = file->device;
    
    // 1. Check if data is already buffered
    if (device_has_data(dev)) {
        // Fast path: data available, no blocking
        ssize_t n = copy_to_user(buf, dev->buffer, count);
        return n;
    }
    
    // 2. No data available - must wait
    // Submit I/O request to device
    submit_io_request(dev, count);
    
    // 3. Block until data arrives
    DEFINE_WAIT(wait);
    
    for (;;) {
        // Add ourselves to device's wait queue
        prepare_to_wait(&dev->wait_queue, &wait, 
                       TASK_INTERRUPTIBLE);
        
        // Double-check if data arrived while setting up
        if (device_has_data(dev))
            break;
        
        // Check for pending signals (interruptible sleep)
        if (signal_pending(current)) {
            finish_wait(&dev->wait_queue, &wait);
            return -EINTR;  // Interrupted by signal
        }
        
        // Actually block - give up CPU
        // Process state changes: Running → Waiting
        schedule();
        
        // We return here when woken up
        // Loop back to check if condition is met
    }
    
    // 4. Remove from wait queue
    finish_wait(&dev->wait_queue, &wait);
    
    // 5. Data is now available
    ssize_t n = copy_to_user(buf, dev->buffer, count);
    
    return n;
}
 
// In the device driver's interrupt handler:
irqreturn_t device_interrupt_handler(int irq, void *dev_id) {
    struct device *dev = dev_id;
    
    // Device completed I/O, data is ready
    dev->data_ready = true;
    
    // Wake up anyone waiting on this device
    wake_up(&dev->wait_queue);
    // All waiters moved: Waiting → Ready
    
    return IRQ_HANDLED;
}

The Lost Wakeup Problem

Waiting on Multiple Events

Multiplexed I/O blocking:

With select/poll/epoll, a process blocks until any of multiple conditions is met:

Process calls poll(fds, nfds, timeout)
Kernel registers interest in all file descriptors
Process blocks on a special poll wait queue
When any FD becomes ready, or timeout expires, process wakes
Kernel returns which FDs are ready
Process handles ready FDs without blocking

multiplexed_io.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
#include <sys/epoll.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
 
// Event-driven server using epoll
int main() {
    int epfd = epoll_create1(0);
    
    // Add server socket
    struct epoll_event ev;
    ev.events = EPOLLIN;
    ev.data.fd = server_socket;
    epoll_ctl(epfd, EPOLL_CTL_ADD, server_socket, &ev);
    
    struct epoll_event events[MAX_EVENTS];
    
    while (1) {
        // Block until any registered FD has events
        // This single process can handle thousands of connections
        int nready = epoll_wait(epfd, events, MAX_EVENTS, -1);
        // Process blocks here until:
        // - A client sends data
        // - A new connection arrives
        // - A write becomes possible
        
        for (int i = 0; i < nready; i++) {
            int fd = events[i].data.fd;
            
            if (fd == server_socket) {
                // New connection - accept it
                int client = accept(server_socket, NULL, NULL);
                // Set non-blocking and add to epoll
                fcntl(client, F_SETFL, O_NONBLOCK);
                ev.events = EPOLLIN | EPOLLET;
                ev.data.fd = client;
                epoll_ctl(epfd, EPOLL_CTL_ADD, client, &ev);
            } else {
                // Client ready for I/O - handle it
                // Non-blocking read won't block since epoll
                // said it's ready
                handle_client(fd);
            }
        }
    }
}

I/O Multiplexing Mechanisms
Mechanism	Max FDs	Scalability	Platform
select()	Usually 1024	O(n) scan each call	POSIX (portable)
poll()	Unlimited	O(n) scan each call	POSIX (portable)
epoll()	Unlimited	O(1) for ready FDs	Linux only
kqueue()	Unlimited	O(1) for ready FDs	BSD, macOS
IOCP	Unlimited	O(1)	Windows

Event-Driven Architecture

Blocking for Synchronization

Processes block not only for I/O but also for synchronization primitives. When a mutex, semaphore, or condition variable isn't available, the process must wait for another process to release it.

Synchronization Blocking Primitives

•Mutex (pthread_mutex_t) — Mutual exclusion lock. Only one holder. Others block in acquisition order (usually FIFO).
•Read-Write Lock — Multiple readers OR one writer. Readers block if writer waiting; writers block if readers present.
•Semaphore (sem_t) — Counting lock. Block when count reaches zero. Classic producer-consumer use case.
•Condition Variable — Block until condition signaled by another thread. Always used with a mutex.
•Futex (Fast Userspace Mutex) — Linux-specific. Fast path in userspace; slow path blocks in kernel.

synchronization_blocking.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#include <pthread.h>
#include <semaphore.h>
 
// Example: Blocking on a mutex
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
 
void critical_section() {
    // If lock is held by another thread, we BLOCK here
    // Thread moves to Waiting state until lock is released
    pthread_mutex_lock(&lock);
    
    // ... critical section work ...
    
    pthread_mutex_unlock(&lock);
    // Other waiting threads may now wake up
}
 
// Example: Blocking on a semaphore (producer-consumer)
sem_t items;
 
void consumer() {
    for (;;) {
        // If semaphore count is 0, BLOCK here
        // Wait until producer adds an item
        sem_wait(&items);
        
        consume_item();
    }
}
 
void producer() {
    for (;;) {
        produce_item();
        
        // Increment semaphore, potentially waking consumer
        sem_post(&items);
    }
}
 
// Example: Blocking on condition variable
pthread_mutex_t cv_lock = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cv = PTHREAD_COND_INITIALIZER;
int ready = 0;
 
void waiter() {
    pthread_mutex_lock(&cv_lock);
    
    while (!ready) {
        // Release lock and block atomically
        // Thread waits until signaled AND lock reacquired
        pthread_cond_wait(&cv, &cv_lock);
        // When we return, we hold lock again
    }
    
    process_data();
    pthread_mutex_unlock(&cv_lock);
}
 
void signaler() {
    pthread_mutex_lock(&cv_lock);
    ready = 1;
    pthread_cond_signal(&cv);  // Wake one waiter
    // pthread_cond_broadcast(&cv);  // Wake ALL waiters
    pthread_mutex_unlock(&cv_lock);
}

Futex: The modern Linux locking primitive:

User-space synchronization (pthread_mutex) is built on futex (Fast Userspace Mutex):

Fast path (uncontended): Lock/unlock entirely in userspace using atomic instructions. No syscall needed.
Slow path (contended): When lock is held, waiter calls futex(FUTEX_WAIT) to block in kernel.
Wakeup: Owner calls futex(FUTEX_WAKE) to wake waiters.

This design makes uncontended locking extremely fast (single atomic instruction), only involving the kernel when blocking is actually necessary.

Priority Inversion

Device Queues and I/O Wait

Converting Mermaid diagram...

I/O wait analysis:

The time a process spends in device queues (waiting for I/O) is called I/O wait time. Tools like top and iostat report this:

io_wait_analysis.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# System-wide I/O wait percentage
top -bn1 | grep Cpu
# Shows: %us user, %sy system, %wa I/O wait
 
# If %wa (I/O wait) is high, processes are blocking on I/O
 
# Per-device I/O statistics
iostat -x 1
# avgqu-sz: Average queue length (processes waiting)
# await: Average wait time (ms)
# %util: Device utilization
 
# Find processes in I/O wait (D state)
ps aux | awk '$8 ~ /D/ {print}'
# These processes are blocked on I/O
 
# What is a specific process waiting for?
cat /proc/<pid>/wchan
# Shows the kernel function where process is sleeping
 
# More detail with stack trace
cat /proc/<pid>/stack
# Shows full kernel stack trace
 
# Monitor block layer I/O
sudo bpftrace -e 'tracepoint:block:block_rq_insert { 
    @[comm] = count(); 
}'
# Shows which processes are issuing I/O
 
# I/O wait breakdown by process
pidstat -d 1
# Shows per-process disk I/O and wait states

Typical I/O Latencies (Approximate)
I/O Type	Best Case	Typical	Worst Case
L1 cache hit	1 ns	1 ns	1 ns
L3 cache hit	10 ns	20 ns	50 ns
Main memory	100 ns	100 ns	200 ns
NVMe SSD read	20 μs	100 μs	500 μs
SATA SSD read	100 μs	500 μs	2 ms
HDD read	5 ms	10 ms	50 ms
Network (LAN)	100 μs	500 μs	10 ms
Network (WAN)	20 ms	100 ms	500 ms

I/O Scheduling

Summary: The Complete Block Transition

We've thoroughly explored the Block transition—the mechanism by which processes voluntarily give up the CPU while waiting for external events.

Key Takeaways

•Running → Waiting when progress is impossible — Processes block when they need something not yet available: I/O data, locks, timers, or IPC messages.
•Blocking is efficient — Unlike busy-waiting, blocking surrenders the CPU for other work. Essential for responsive multitasking.
•Wait queues track sleeping processes — Each blocking resource has a wait queue. Processes add themselves and sleep until woken.
•Interruptible vs Uninterruptible sleep — Interruptible sleep responds to signals; uninterruptible protects data integrity but can't be killed.
•Multiplexed I/O enables efficient servers — epoll/kqueue/IOCP allow one thread to block waiting for thousands of events simultaneously.
•Synchronization primitives cause blocking — Mutexes, semaphores, and condition variables block when the resource isn't available.
•Device queues organize I/O waiting — Each device has its queue of blocked processes awaiting I/O completion.
•I/O wait time is measurable — Tools reveal how much time processes spend blocked, helping diagnose performance issues.

Looking ahead:

Page Complete

4 / 5