Operating SystemsProcess Queues

Process Queues: Managing System Resources

LevelIntermediate

Duration60 mins

TopicProcess Queues

3 / 5

Device Queues: Managing I/O-Bound Processes

Where Processes Wait for the Physical World

While the CPU is blindingly fast, the physical world moves at a glacial pace by comparison. When a process needs to read from a disk, write to a network interface, or wait for keyboard input, it cannot simply spin waiting—that would waste precious CPU cycles. Instead, the operating system places the process in a device queue where it sleeps until the I/O operation completes.

Device queues are the interface between the world of processes and the world of hardware. They embody the fundamental asymmetry of modern computing: CPUs execute billions of instructions per second, but disks, networks, and human input operate thousands to millions of times slower. Understanding device queues is essential for understanding how operating systems achieve the illusion of responsiveness despite these vast speed disparities.

What You Will Learn

By the end of this page, you will understand why device queues exist, how they are organized per device and device class, the relationship between device queues and interrupt handling, I/O scheduling algorithms that order requests, and how processes transition between device queues and the ready queue.

The I/O Wait Problem: Why Device Queues Exist

To appreciate device queues, we must first understand the fundamental problem they solve: the I/O wait problem.

The Speed Disparity:

Consider the relative speeds of different operations:

Latency Comparison: CPU vs I/O Operations
Operation	Typical Latency	CPU Cycles (3GHz)	Human Time Analogy
CPU Register Access	~0.3 ns	~1 cycle	1 second
L1 Cache Hit	~1 ns	~3 cycles	3 seconds
L2 Cache Hit	~3 ns	~10 cycles	10 seconds
L3 Cache Hit	~10 ns	~30 cycles	30 seconds
Main Memory Access	~100 ns	~300 cycles	5 minutes
NVMe SSD Read	~20 μs	~60,000 cycles	17 hours
SATA SSD Read	~100 μs	~300,000 cycles	3.5 days
HDD Read (random)	~10 ms	~30,000,000 cycles	1 year
Network Round Trip (LAN)	~0.5 ms	~1,500,000 cycles	17 days
Network Round Trip (Internet)	~50 ms	~150,000,000 cycles	5 years
Keyboard Input	~100 ms	~300,000,000 cycles	10 years

The Implication:

If a process were to wait synchronously for a disk read while holding the CPU, it would waste millions of CPU cycles that could execute useful instructions. This is the fundamental insight behind multiprogramming: while one process waits for I/O, another can use the CPU.

The Solution: Blocking and Queuing

When a process issues an I/O request:

The OS records the request and identifies which device will service it
The process is removed from the CPU (context switch)
The process is placed in the appropriate device queue
The process state changes from RUNNING to WAITING/BLOCKED
Another ready process is scheduled on the CPU
When the I/O completes (signaled by interrupt), the process moves to the ready queue

Converting Mermaid diagram...

The Restaurant Analogy

Think of device queues like waiting lists at restaurants. When you order food (I/O request), you don't stand at the kitchen door—you wait at your table (device queue). The waiter (interrupt handler) notifies you when your food is ready, and then you can eat (process ready to run). Meanwhile, the kitchen (device) serves multiple orders (requests) efficiently.

Device Queue Organization: One Queue Per Device

The operating system maintains separate queues for each device (or device class in some designs). This organization reflects the physical reality that each device can only service one request at a time (or a fixed number for some devices).

Common Device Queues in a System

•Block Device Queues — One per disk, SSD, or storage volume. Processes waiting for read/write completion.
•Character Device Queues — One per terminal, serial port, or character device. Processes waiting for input/output.
•Network Interface Queues — Per network interface (eth0, wlan0). Processes waiting for packet transmission or reception.
•Graphics Device Queue — For GPU operations requiring synchronous completion.
•Printer Queues — Spool queues for print jobs awaiting physical printing.
•USB Device Queues — Per USB endpoint for various peripherals.

Device Queue Structure (Conceptual)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
/* Conceptual device queue structure */
struct device_queue {
    /* Device identification */
    struct device *dev;          /* Associated device */
    char name[32];               /* Device name (e.g., "sda") */
    
    /* Queue of pending I/O requests */
    struct list_head request_queue;   /* I/O requests pending */
    struct request *current_request;  /* Currently serviced request */
    
    /* Queue of waiting processes (may be in request structures) */
    int waiting_count;                /* Number of blocked processes */
    
    /* Queue statistics */
    unsigned long total_requests;     /* Total requests serviced */
    unsigned long total_wait_time;    /* Cumulative wait time */
    
    /* Scheduling policy */
    struct elevator_type *elevator;   /* I/O scheduler */
    
    /* Synchronization */
    spinlock_t queue_lock;
    wait_queue_head_t wait;           /* Wait queue for blocking */
};
 
/* Individual I/O request structure */
struct request {
    struct list_head list;       /* Queue linkage */
    
    /* Request details */
    sector_t sector;             /* Starting sector (for block devices) */
    unsigned int nr_sectors;     /* Number of sectors */
    enum req_op operation;       /* READ, WRITE, FLUSH, etc. */
    
    /* Buffers */
    void *buffer;                /* Data buffer */
    struct bio *bio;             /* Bio chain (Linux) */
    
    /* Associated process */
    struct task_struct *task;    /* Process waiting for this request */
    
    /* Timing */
    unsigned long start_time;    /* When request was submitted */
};

Linux Block Layer Implementation:

In Linux, block device queues are managed by the block layer, which sits between the filesystem and device drivers:

Linux Block Device Queue (Simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/* Linux request queue structure (simplified) */
struct request_queue {
    /* Request lists */
    struct list_head queue_head;     /* List of pending requests */
    
    /* Queue limits */
    unsigned int nr_requests;        /* Max requests in queue */
    unsigned int nr_congestion_on;   /* Congestion threshold */
    
    /* I/O scheduler (elevator) */
    struct elevator_queue *elevator;
    
    /* Make request function - entry point from filesystem */
    make_request_fn *make_request_fn;
    
    /* Per-queue statistics */
    struct queue_limits limits;
    
    /* Backing device info */
    struct backing_dev_info *backing_dev_info;
    
    /* Synchronization */
    spinlock_t queue_lock;
};
 
/* Submit I/O request and potentially block */
void submit_bio_wait(struct bio *bio) {
    struct completion done;
    
    init_completion(&done);
    bio->bi_private = &done;
    bio->bi_end_io = bio_complete;
    
    /* Submit to device queue */
    submit_bio(bio);
    
    /* Block current process until I/O completes */
    wait_for_completion(&done);
}
 
/* Called by device driver when I/O completes */
void bio_complete(struct bio *bio) {
    struct completion *done = bio->bi_private;
    
    /* Wake up the waiting process */
    complete(done);
}

Multi-Queue Block Layer (blk-mq)

Modern Linux uses the blk-mq (Multi-Queue Block Layer) which maintains per-CPU submission queues and per-hardware-queue dispatch queues. This design was introduced in Linux 3.13 to better match the parallel capabilities of NVMe SSDs, which can handle hundreds of thousands of IOPS across multiple hardware queues.

I/O Scheduling: Ordering Requests in Device Queues

Device queues aren't simply FIFO. For many devices, especially rotational hard drives, the order in which requests are serviced dramatically affects performance. I/O schedulers (also called elevators) reorder requests to optimize throughput and latency.

Why I/O Scheduling Matters

•Mechanical Seek Time — Hard drives have physical read heads that must move. Sequential access is 100x faster than random access.
•Request Merging — Adjacent requests can be merged into one, reducing overhead.
•Fairness — Without scheduling, a flood of requests from one process can starve others.
•Priority — Some requests (like page-ins) may be more urgent than background writes.
•Deadline Guarantees — Real-time or interactive workloads need bounded latency.

Common I/O Scheduling Algorithms
Scheduler	Strategy	Best For	Drawbacks
FCFS (None)	First-come, first-served	Simple devices, SSDs	Poor HDD performance, no fairness
SSTF	Shortest Seek Time First	Throughput optimization	Starvation of distant requests
SCAN (Elevator)	Move head in one direction, then reverse	Balanced performance	Middle requests favored
C-SCAN	Circular SCAN - only one direction	More uniform wait time	Empty return sweep wastes time
LOOK/C-LOOK	Like SCAN but only to farthest request	Efficient SCAN variant	Same as SCAN, less wasteful
CFQ	Complete Fairness Queueing (Linux)	Interactive workloads	Complex, deprecated in Linux 5.0
Deadline	Separate read/write queues with deadlines	Avoiding starvation	May sacrifice throughput
BFQ	Budget Fair Queueing	Desktop/interactive	Higher CPU overhead
mq-deadline	Multi-queue deadline scheduler	NVMe, modern SSDs	Default for many Linux systems
none/noop	No reordering, simple FIFO	NVMe/SSD (no seek)	Poor for HDDs

Deadline Scheduler Deep Dive:

The deadline scheduler is particularly interesting because it explicitly prevents starvation while maintaining reasonable throughput:

Deadline I/O Scheduler (Conceptual)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
/* Deadline scheduler maintains four queues */
struct deadline_data {
    /* Sorted queues (by sector number) for seeking efficiency */
    struct rb_root sort_list[2];    /* [READ] and [WRITE] sorted lists */
    
    /* FIFO queues (by time) for deadline enforcement */
    struct list_head fifo_list[2];  /* [READ] and [WRITE] deadline lists */
    
    /* Batching state */
    int fifo_batch;                 /* How many to service before checking other queue */
    int writes_starved;             /* Count of times reads preempted writes */
    
    /* Tunable deadlines */
    int read_expire;                /* Read deadline: 500ms default */
    int write_expire;               /* Write deadline: 5000ms default */
    
    /* Which direction we're currently servicing */
    int data_dir;                   /* READ or WRITE */
    sector_t last_sector;           /* Last sector serviced (for batching) */
};
 
/* Deadline scheduler dispatch logic */
struct request *deadline_dispatch(struct deadline_data *dd) {
    struct request *rq;
    int data_dir = dd->data_dir;
    
    /* Check if there's an expired request (deadline missed) */
    rq = deadline_check_fifo(dd, READ);  /* Reads have priority */
    if (rq)
        return rq;  /* Must service this immediately */
    
    rq = deadline_check_fifo(dd, WRITE);
    if (rq)
        return rq;
    
    /* No deadline pressure - batch requests in current direction */
    if (deadline_has_work(dd, data_dir)) {
        /* Get next request in sector order for efficient seeking */
        rq = deadline_next_request(dd, data_dir);
        if (rq)
            return rq;
    }
    
    /* Switch direction if other direction has work */
    if (deadline_has_work(dd, !data_dir)) {
        dd->data_dir = !data_dir;
        rq = deadline_next_request(dd, !data_dir);
        if (rq)
            return rq;
    }
    
    return NULL;  /* No work */
}
 
/* Check if any request in this queue has exceeded its deadline */
struct request *deadline_check_fifo(struct deadline_data *dd, int dir) {
    struct request *rq = list_first_entry_or_null(
        &dd->fifo_list[dir], struct request, queuelist);
    
    if (rq) {
        unsigned long deadline = dir == READ ? 
            dd->read_expire : dd->write_expire;
        
        if (time_after(jiffies, rq->start_time + deadline)) {
            /* This request has waited too long */
            return rq;
        }
    }
    
    return NULL;  /* No expired requests */
}

Choosing an I/O Scheduler

For NVMe SSDs, use 'none' or 'mq-deadline'—no seek optimization needed, and minimal overhead is best. For SATA SSDs, 'mq-deadline' works well. For HDDs, 'bfq' provides good interactive response, while 'mq-deadline' maximizes throughput. You can check/change the scheduler: cat /sys/block/sda/queue/scheduler and echo 'mq-deadline' > /sys/block/sda/queue/scheduler.

Wait Queues: The Mechanism Behind Blocking

When a process blocks on a device queue, it uses the kernel's wait queue infrastructure. Wait queues are the general mechanism for putting processes to sleep until some condition is met.

Linux Wait Queue Implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
/* Wait queue head - one per condition/event */
struct wait_queue_head {
    spinlock_t lock;
    struct list_head head;  /* List of waiting tasks */
};
typedef struct wait_queue_head wait_queue_head_t;
 
/* Wait queue entry - one per waiting task */
struct wait_queue_entry {
    unsigned int flags;
    void *private;           /* Usually the task_struct pointer */
    wait_queue_func_t func;  /* Function to call on wake */
    struct list_head entry;  /* List linkage */
};
 
/* Initialize a wait queue head */
#define DECLARE_WAIT_QUEUE_HEAD(name) \
    struct wait_queue_head name = { \
        .lock = __SPIN_LOCK_UNLOCKED(name.lock), \
        .head = LIST_HEAD_INIT(name.head) \
    }
 
/* Initialize a wait queue entry */
#define DECLARE_WAITQUEUE(name, tsk) \
    struct wait_queue_entry name = { \
        .private = tsk, \
        .func = default_wake_function, \
        .entry = LIST_HEAD_INIT(name.entry) \
    }
 
/* Adding/removing from wait queue */
void add_wait_queue(struct wait_queue_head *wq_head,
                    struct wait_queue_entry *wq_entry) {
    unsigned long flags;
    spin_lock_irqsave(&wq_head->lock, flags);
    list_add(&wq_entry->entry, &wq_head->head);
    spin_unlock_irqrestore(&wq_head->lock, flags);
}
 
void remove_wait_queue(struct wait_queue_head *wq_head,
                       struct wait_queue_entry *wq_entry) {
    unsigned long flags;
    spin_lock_irqsave(&wq_head->lock, flags);
    list_del(&wq_entry->entry);
    spin_unlock_irqrestore(&wq_head->lock, flags);
}

The Sleep/Wake Pattern:

Blocking on an I/O operation follows a specific pattern to avoid race conditions:

Correct Sleep/Wake Pattern
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
/* Correct blocking pattern - avoids lost wakeup race */
int wait_for_io_completion(struct io_request *req) {
    DECLARE_WAITQUEUE(wait, current);
    int result;
    
    /* Add ourselves to wait queue BEFORE checking condition */
    add_wait_queue(&req->wait_queue, &wait);
    
    while (1) {
        /* Set state to INTERRUPTIBLE or UNINTERRUPTIBLE */
        set_current_state(TASK_INTERRUPTIBLE);
        
        /* Check if condition is already satisfied */
        if (io_request_complete(req))
            break;
        
        /* Check for signals (if INTERRUPTIBLE) */
        if (signal_pending(current)) {
            result = -ERESTARTSYS;
            goto out;
        }
        
        /* Actually sleep - scheduler will run another task */
        schedule();
        
        /* We've been woken up - loop back to check condition */
    }
    
    result = req->result;
    
out:
    /* Reset state and remove from wait queue */
    set_current_state(TASK_RUNNING);
    remove_wait_queue(&req->wait_queue, &wait);
    
    return result;
}
 
/* Wake up waiting process when I/O completes (interrupt context) */
void complete_io_request(struct io_request *req) {
    /* Mark request complete */
    req->complete = true;
    req->result = 0;  /* Success */
    
    /* Wake up waiters */
    wake_up(&req->wait_queue);
}
 
/* Convenience macros hide this complexity */
/* wait_event(wq, condition) - uninterruptible wait */
/* wait_event_interruptible(wq, condition) - can be signaled */
/* wait_event_timeout(wq, condition, timeout) - with timeout */
 
int read_from_device(struct device *dev, void *buf, size_t count) {
    struct io_request req;
    
    init_io_request(&req, READ, buf, count);
    submit_io_request(dev, &req);
    
    /* This macro handles all the sleep/wake logic */
    return wait_event_interruptible(req.wait_queue, 
                                     io_request_complete(&req));
}

The Lost Wakeup Problem

A critical race condition can occur if you check the condition before adding to the wait queue. If the I/O completes between checking and sleeping, the wakeup is 'lost' and the process sleeps forever. Always add to the wait queue first, then check the condition in a loop.

Interrupt-Driven Wakeup: From Device to Ready Queue

When an I/O operation completes, the device signals the CPU via an interrupt. This interrupt triggers a chain of events that moves the waiting process from the device queue to the ready queue.

I/O Completion Interrupt Flow
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
/* Device driver interrupt handler */
irqreturn_t disk_interrupt_handler(int irq, void *dev_id) {
    struct block_device *bdev = dev_id;
    struct request *req;
    
    /* Acknowledge interrupt to hardware */
    ack_device_interrupt(bdev);
    
    /* Get the completed request */
    req = bdev->current_request;
    if (!req)
        return IRQ_NONE;  /* Spurious interrupt */
    
    /* Transfer data if this was a read */
    if (req->operation == READ) {
        copy_from_device_buffer(req->buffer, bdev, req->nr_sectors);
    }
    
    /* Mark request complete */
    req->error = 0;  /* Success */
    
    /* Complete the request - this wakes the waiting process */
    blk_mq_complete_request(req);
    
    /* Start next request if any */
    bdev->current_request = get_next_request(bdev);
    if (bdev->current_request)
        start_device_operation(bdev);
    
    return IRQ_HANDLED;
}
 
/* Request completion path */
void blk_mq_complete_request(struct request *rq) {
    /* May need to defer to softirq if in hard irq context */
    if (in_hardirq()) {
        /* Schedule completion to run in softirq context */
        raise_softirq(BLOCK_SOFTIRQ);
        return;
    }
    
    /* Call the request's completion function */
    rq->end_io(rq, rq->error);
}
 
/* Completion callback set by the submitting code */
void bio_end_io(struct bio *bio) {
    struct task_struct *waiter = bio->bi_private;
    
    /* Record any errors */
    bio->bi_status = BLK_STS_OK;
    
    /* Wake up the process that submitted this I/O */
    wake_up_process(waiter);
}
 
/* wake_up_process implementation */
int wake_up_process(struct task_struct *p) {
    return try_to_wake_up(p, TASK_NORMAL, 0);
}
 
int try_to_wake_up(struct task_struct *p, 
                   unsigned int state, 
                   int wake_flags) {
    unsigned long flags;
    struct rq *rq;
    
    /* Get the run queue this task is associated with */
    raw_spin_lock_irqsave(&p->pi_lock, flags);
    
    /* Check if task is in a wakeable state */
    if (!(p->state & state))
        goto out_unlock;
    
    /* Mark task as runnable */
    p->state = TASK_RUNNING;
    
    /* Add to ready queue */
    rq = task_rq(p);
    activate_task(rq, p, ENQUEUE_WAKEUP);
    
    /* Check if we should preempt current task */
    check_preempt_curr(rq, p, wake_flags);
    
out_unlock:
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);
    return 0;
}

Converting Mermaid diagram...

Interrupt Context Limitations

Code running in interrupt context cannot sleep, allocate memory with GFP_KERNEL, acquire sleeping locks, or access user space. This is why interrupt handlers often defer work to softirqs or tasklets, which run with interrupts enabled and have fewer restrictions.

A Process Waiting on Multiple Devices

Real applications often need to wait for events from multiple sources simultaneously—network connections, files, timers, and user input. This requires mechanisms for waiting on multiple device queues at once.

Mechanisms for Waiting on Multiple Events
Mechanism	POSIX/Linux	Windows	Characteristics
Multiplexing	select(), poll()	WaitForMultipleObjects()	Block until any of N events ready
Advanced Multiplexing	epoll (Linux)	I/O Completion Ports	Scalable to 100K+ connections
Async I/O (POSIX)	aio_* functions	ReadFileEx()	Non-blocking, callback/signal notification
Async I/O (Modern)	io_uring (Linux 5.1+)	IOCP	Ring buffer based, batched syscalls

Multiplexed I/O with epoll
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
/* epoll: Efficiently wait on thousands of file descriptors */
#include <sys/epoll.h>
 
int main() {
    /* Create epoll instance */
    int epfd = epoll_create1(0);
    
    /* Add file descriptors to monitor */
    struct epoll_event ev;
    ev.events = EPOLLIN;  /* Wait for read readiness */
    ev.data.fd = socket_fd;
    epoll_ctl(epfd, EPOLL_CTL_ADD, socket_fd, &ev);
    
    ev.data.fd = file_fd;
    epoll_ctl(epfd, EPOLL_CTL_ADD, file_fd, &ev);
    
    /* Wait for events */
    struct epoll_event events[MAX_EVENTS];
    while (1) {
        /* Block until at least one fd is ready */
        int nfds = epoll_wait(epfd, events, MAX_EVENTS, -1);
        
        /* Process ready file descriptors */
        for (int i = 0; i < nfds; i++) {
            if (events[i].data.fd == socket_fd) {
                handle_socket_data();
            } else if (events[i].data.fd == file_fd) {
                handle_file_data();
            }
        }
    }
}
 
/* Under the hood, epoll uses a red-black tree + ready list */
struct eventpoll {
    /* Lock protecting this structure */
    spinlock_t lock;
    
    /* RB-tree of monitored file descriptors */
    struct rb_root rbr;
    
    /* List of file descriptors ready for userspace */
    struct list_head rdllist;
    
    /* Wait queue for epoll_wait callers */
    wait_queue_head_t wq;
};
 
/* When any monitored fd becomes ready:
 * 1. fd's poll callback triggers
 * 2. Epitem is added to rdllist
 * 3. Process waiting in epoll_wait is woken
 * 4. Ready fds returned to userspace
 */

io_uring: The Modern Approach:

The latest Linux approach to asynchronous I/O uses shared ring buffers between kernel and userspace, eliminating most syscall overhead:

io_uring Example (Simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
/* io_uring: Ring-buffer based async I/O */
#include <liburing.h>
 
int main() {
    struct io_uring ring;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    
    /* Initialize ring */
    io_uring_queue_init(32, &ring, 0);
    
    /* Get a submission queue entry */
    sqe = io_uring_get_sqe(&ring);
    
    /* Prepare a read operation (doesn't submit yet) */
    io_uring_prep_read(sqe, fd, buffer, sizeof(buffer), 0);
    sqe->user_data = 123;  /* Tag for completion matching */
    
    /* Submit all prepared ops to kernel */
    io_uring_submit(&ring);
    
    /* Wait for completion */
    io_uring_wait_cqe(&ring, &cqe);
    
    /* Process result */
    if (cqe->res >= 0) {
        printf("Read %d bytes\n", cqe->res);
    }
    
    /* Mark completion as consumed */
    io_uring_cqe_seen(&ring, cqe);
    
    io_uring_queue_exit(&ring);
}
 
/*
 * io_uring advantage: batched submissions
 * 
 * Traditional approach: 1000 reads = 1000 syscalls
 * io_uring approach: 1000 reads = ~1-2 syscalls
 * 
 * Also supports:
 * - Polling mode (no interrupts needed)
 * - Linked operations
 * - Fixed buffers (no copy overhead)
 */

Scaling Network Servers

The evolution from select() to poll() to epoll to io_uring represents decades of solving the C10K, C100K, and C10M problems—handling 10 thousand, 100 thousand, and 10 million concurrent connections. Modern servers like nginx and HAProxy use epoll or io_uring to efficiently multiplex huge numbers of connections with minimal CPU overhead.

Device Queue Management and Statistics

System administrators and developers need visibility into device queue behavior to diagnose performance issues and optimize I/O patterns.

Linux I/O Monitoring
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# iostat - I/O statistics per device
$ iostat -xz 1
Device   r/s   w/s  rkB/s  wkB/s avgrq-sz avgqu-sz await svctm  %util
sda    100.0  50.0  4000.0 2000.0    80.0     2.5  16.0   3.0  45.0
nvme0n1 5000.0 2000.0 200000.0 80000.0 80.0   0.5   0.1   0.05  35.0
 
# Key metrics:
# avgqu-sz - Average queue depth (requests waiting)
# await    - Average wait time (queue + service)
# svctm    - Average service time (device only)
# %util    - Percentage of time device was busy
 
# iotop - I/O by process (like top for I/O)
$ sudo iotop
Total DISK READ: 50.00 M/s | Total DISK WRITE: 25.00 M/s
  PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
  1234 be/4  mysql    45.00 M/s   20.00 M/s  0.00 %  50.00 %  mysqld
  5678 be/4  postgres  5.00 M/s    5.00 M/s  0.00 %  10.00 %  postgres
 
# /sys/block/*/queue/ - Per-device queue parameters
$ cat /sys/block/sda/queue/scheduler
[mq-deadline] none
 
$ cat /sys/block/sda/queue/nr_requests
256
 
$ cat /sys/block/sda/queue/queue_depth
32
 
# blktrace - Detailed block layer tracing
$ sudo blktrace -d /dev/sda -o - | blkparse -i -
  8,0    1     1     0.000000000  1234  Q   R 12345678 + 8 [process]
  8,0    1     2     0.000001000  1234  G   R 12345678 + 8 [process]
  8,0    1     3     0.000100000     0  D   R 12345678 + 8 [process]
  8,0    1     4     0.001000000     0  C   R 12345678 + 8 [0]
 
# Q=Queued, G=GetRequest, D=Dispatch, C=Complete

Queue Depth Warning Signs

High average queue depth (avgqu-sz > 4 for HDD, > 32 for SSD) combined with high utilization (>80%) indicates the device is becoming a bottleneck. Either upgrade the device, reduce I/O load, or optimize access patterns (e.g., batching, caching).

Summary: Device Queues as the I/O Interface

Device queues bridge the gap between the fast world of CPU execution and the slow world of physical I/O. Let's consolidate our understanding:

Key Takeaways

•Speed Disparity Solution — Device queues exist because I/O operations are millions of times slower than CPU operations. Blocking and queuing allows CPU time to be used productively while I/O proceeds.
•Per-Device Organization — Each device (or device class) maintains its own queue. Processes move from running to waiting when they issue I/O, joining the appropriate device queue.
•I/O Scheduling — For devices with seek latency (HDDs), I/O schedulers reorder requests to optimize throughput while preventing starvation. Different schedulers (deadline, BFQ, mq-deadline) suit different workloads.
•Wait Queue Mechanism — The kernel's wait queue infrastructure provides the sleep/wake primitives that underlie all blocking I/O. Proper usage avoids race conditions like lost wakeups.
•Interrupt-Driven Completion — When I/O completes, hardware interrupts trigger the kernel to wake waiting processes, moving them from device queues to the ready queue.
•Multiplexing for Scalability — Mechanisms like epoll and io_uring allow efficient waiting on thousands of sources simultaneously, essential for scalable servers.
•Observability — Tools like iostat, blktrace, and performance counters provide visibility into device queue behavior for debugging and optimization.

What's Next:

Having explored the three main queue types (job queue, ready queue, device queues), we'll now examine how these queues are implemented at a data structure level. The next page covers Queue Implementation—the linked lists, arrays, and specialized structures that bring these abstractions to life.

Page Complete

You now understand device queues: their purpose in bridging CPU and I/O speeds, their per-device organization, the I/O scheduling algorithms that optimize access patterns, the wait queue mechanism for blocking, and how interrupts drive process wakeup. This knowledge is essential for understanding system performance and I/O-bound behavior.

3 / 5

Loading learning content...

Operating SystemsProcess Queues

Process Queues: Managing System Resources

LevelIntermediate

Duration60 mins

TopicProcess Queues

3 / 5

Device Queues: Managing I/O-Bound Processes

Where Processes Wait for the Physical World

What You Will Learn

The I/O Wait Problem: Why Device Queues Exist

To appreciate device queues, we must first understand the fundamental problem they solve: the I/O wait problem.

The Speed Disparity:

Consider the relative speeds of different operations:

Latency Comparison: CPU vs I/O Operations
Operation	Typical Latency	CPU Cycles (3GHz)	Human Time Analogy
CPU Register Access	~0.3 ns	~1 cycle	1 second
L1 Cache Hit	~1 ns	~3 cycles	3 seconds
L2 Cache Hit	~3 ns	~10 cycles	10 seconds
L3 Cache Hit	~10 ns	~30 cycles	30 seconds
Main Memory Access	~100 ns	~300 cycles	5 minutes
NVMe SSD Read	~20 μs	~60,000 cycles	17 hours
SATA SSD Read	~100 μs	~300,000 cycles	3.5 days
HDD Read (random)	~10 ms	~30,000,000 cycles	1 year
Network Round Trip (LAN)	~0.5 ms	~1,500,000 cycles	17 days
Network Round Trip (Internet)	~50 ms	~150,000,000 cycles	5 years
Keyboard Input	~100 ms	~300,000,000 cycles	10 years

The Implication:

The Solution: Blocking and Queuing

When a process issues an I/O request:

The OS records the request and identifies which device will service it
The process is removed from the CPU (context switch)
The process is placed in the appropriate device queue
The process state changes from RUNNING to WAITING/BLOCKED
Another ready process is scheduled on the CPU
When the I/O completes (signaled by interrupt), the process moves to the ready queue

Converting Mermaid diagram...

The Restaurant Analogy

Device Queue Organization: One Queue Per Device

Common Device Queues in a System

•Block Device Queues — One per disk, SSD, or storage volume. Processes waiting for read/write completion.
•Character Device Queues — One per terminal, serial port, or character device. Processes waiting for input/output.
•Network Interface Queues — Per network interface (eth0, wlan0). Processes waiting for packet transmission or reception.
•Graphics Device Queue — For GPU operations requiring synchronous completion.
•Printer Queues — Spool queues for print jobs awaiting physical printing.
•USB Device Queues — Per USB endpoint for various peripherals.

Device Queue Structure (Conceptual)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
/* Conceptual device queue structure */
struct device_queue {
    /* Device identification */
    struct device *dev;          /* Associated device */
    char name[32];               /* Device name (e.g., "sda") */
    
    /* Queue of pending I/O requests */
    struct list_head request_queue;   /* I/O requests pending */
    struct request *current_request;  /* Currently serviced request */
    
    /* Queue of waiting processes (may be in request structures) */
    int waiting_count;                /* Number of blocked processes */
    
    /* Queue statistics */
    unsigned long total_requests;     /* Total requests serviced */
    unsigned long total_wait_time;    /* Cumulative wait time */
    
    /* Scheduling policy */
    struct elevator_type *elevator;   /* I/O scheduler */
    
    /* Synchronization */
    spinlock_t queue_lock;
    wait_queue_head_t wait;           /* Wait queue for blocking */
};
 
/* Individual I/O request structure */
struct request {
    struct list_head list;       /* Queue linkage */
    
    /* Request details */
    sector_t sector;             /* Starting sector (for block devices) */
    unsigned int nr_sectors;     /* Number of sectors */
    enum req_op operation;       /* READ, WRITE, FLUSH, etc. */
    
    /* Buffers */
    void *buffer;                /* Data buffer */
    struct bio *bio;             /* Bio chain (Linux) */
    
    /* Associated process */
    struct task_struct *task;    /* Process waiting for this request */
    
    /* Timing */
    unsigned long start_time;    /* When request was submitted */
};

Linux Block Layer Implementation:

In Linux, block device queues are managed by the block layer, which sits between the filesystem and device drivers:

Linux Block Device Queue (Simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/* Linux request queue structure (simplified) */
struct request_queue {
    /* Request lists */
    struct list_head queue_head;     /* List of pending requests */
    
    /* Queue limits */
    unsigned int nr_requests;        /* Max requests in queue */
    unsigned int nr_congestion_on;   /* Congestion threshold */
    
    /* I/O scheduler (elevator) */
    struct elevator_queue *elevator;
    
    /* Make request function - entry point from filesystem */
    make_request_fn *make_request_fn;
    
    /* Per-queue statistics */
    struct queue_limits limits;
    
    /* Backing device info */
    struct backing_dev_info *backing_dev_info;
    
    /* Synchronization */
    spinlock_t queue_lock;
};
 
/* Submit I/O request and potentially block */
void submit_bio_wait(struct bio *bio) {
    struct completion done;
    
    init_completion(&done);
    bio->bi_private = &done;
    bio->bi_end_io = bio_complete;
    
    /* Submit to device queue */
    submit_bio(bio);
    
    /* Block current process until I/O completes */
    wait_for_completion(&done);
}
 
/* Called by device driver when I/O completes */
void bio_complete(struct bio *bio) {
    struct completion *done = bio->bi_private;
    
    /* Wake up the waiting process */
    complete(done);
}

Multi-Queue Block Layer (blk-mq)

I/O Scheduling: Ordering Requests in Device Queues

Why I/O Scheduling Matters

•Mechanical Seek Time — Hard drives have physical read heads that must move. Sequential access is 100x faster than random access.
•Request Merging — Adjacent requests can be merged into one, reducing overhead.
•Fairness — Without scheduling, a flood of requests from one process can starve others.
•Priority — Some requests (like page-ins) may be more urgent than background writes.
•Deadline Guarantees — Real-time or interactive workloads need bounded latency.

Common I/O Scheduling Algorithms
Scheduler	Strategy	Best For	Drawbacks
FCFS (None)	First-come, first-served	Simple devices, SSDs	Poor HDD performance, no fairness
SSTF	Shortest Seek Time First	Throughput optimization	Starvation of distant requests
SCAN (Elevator)	Move head in one direction, then reverse	Balanced performance	Middle requests favored
C-SCAN	Circular SCAN - only one direction	More uniform wait time	Empty return sweep wastes time
LOOK/C-LOOK	Like SCAN but only to farthest request	Efficient SCAN variant	Same as SCAN, less wasteful
CFQ	Complete Fairness Queueing (Linux)	Interactive workloads	Complex, deprecated in Linux 5.0
Deadline	Separate read/write queues with deadlines	Avoiding starvation	May sacrifice throughput
BFQ	Budget Fair Queueing	Desktop/interactive	Higher CPU overhead
mq-deadline	Multi-queue deadline scheduler	NVMe, modern SSDs	Default for many Linux systems
none/noop	No reordering, simple FIFO	NVMe/SSD (no seek)	Poor for HDDs

Deadline Scheduler Deep Dive:

The deadline scheduler is particularly interesting because it explicitly prevents starvation while maintaining reasonable throughput:

Deadline I/O Scheduler (Conceptual)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
/* Deadline scheduler maintains four queues */
struct deadline_data {
    /* Sorted queues (by sector number) for seeking efficiency */
    struct rb_root sort_list[2];    /* [READ] and [WRITE] sorted lists */
    
    /* FIFO queues (by time) for deadline enforcement */
    struct list_head fifo_list[2];  /* [READ] and [WRITE] deadline lists */
    
    /* Batching state */
    int fifo_batch;                 /* How many to service before checking other queue */
    int writes_starved;             /* Count of times reads preempted writes */
    
    /* Tunable deadlines */
    int read_expire;                /* Read deadline: 500ms default */
    int write_expire;               /* Write deadline: 5000ms default */
    
    /* Which direction we're currently servicing */
    int data_dir;                   /* READ or WRITE */
    sector_t last_sector;           /* Last sector serviced (for batching) */
};
 
/* Deadline scheduler dispatch logic */
struct request *deadline_dispatch(struct deadline_data *dd) {
    struct request *rq;
    int data_dir = dd->data_dir;
    
    /* Check if there's an expired request (deadline missed) */
    rq = deadline_check_fifo(dd, READ);  /* Reads have priority */
    if (rq)
        return rq;  /* Must service this immediately */
    
    rq = deadline_check_fifo(dd, WRITE);
    if (rq)
        return rq;
    
    /* No deadline pressure - batch requests in current direction */
    if (deadline_has_work(dd, data_dir)) {
        /* Get next request in sector order for efficient seeking */
        rq = deadline_next_request(dd, data_dir);
        if (rq)
            return rq;
    }
    
    /* Switch direction if other direction has work */
    if (deadline_has_work(dd, !data_dir)) {
        dd->data_dir = !data_dir;
        rq = deadline_next_request(dd, !data_dir);
        if (rq)
            return rq;
    }
    
    return NULL;  /* No work */
}
 
/* Check if any request in this queue has exceeded its deadline */
struct request *deadline_check_fifo(struct deadline_data *dd, int dir) {
    struct request *rq = list_first_entry_or_null(
        &dd->fifo_list[dir], struct request, queuelist);
    
    if (rq) {
        unsigned long deadline = dir == READ ? 
            dd->read_expire : dd->write_expire;
        
        if (time_after(jiffies, rq->start_time + deadline)) {
            /* This request has waited too long */
            return rq;
        }
    }
    
    return NULL;  /* No expired requests */
}

Choosing an I/O Scheduler

Wait Queues: The Mechanism Behind Blocking

When a process blocks on a device queue, it uses the kernel's wait queue infrastructure. Wait queues are the general mechanism for putting processes to sleep until some condition is met.

Linux Wait Queue Implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
/* Wait queue head - one per condition/event */
struct wait_queue_head {
    spinlock_t lock;
    struct list_head head;  /* List of waiting tasks */
};
typedef struct wait_queue_head wait_queue_head_t;
 
/* Wait queue entry - one per waiting task */
struct wait_queue_entry {
    unsigned int flags;
    void *private;           /* Usually the task_struct pointer */
    wait_queue_func_t func;  /* Function to call on wake */
    struct list_head entry;  /* List linkage */
};
 
/* Initialize a wait queue head */
#define DECLARE_WAIT_QUEUE_HEAD(name) \
    struct wait_queue_head name = { \
        .lock = __SPIN_LOCK_UNLOCKED(name.lock), \
        .head = LIST_HEAD_INIT(name.head) \
    }
 
/* Initialize a wait queue entry */
#define DECLARE_WAITQUEUE(name, tsk) \
    struct wait_queue_entry name = { \
        .private = tsk, \
        .func = default_wake_function, \
        .entry = LIST_HEAD_INIT(name.entry) \
    }
 
/* Adding/removing from wait queue */
void add_wait_queue(struct wait_queue_head *wq_head,
                    struct wait_queue_entry *wq_entry) {
    unsigned long flags;
    spin_lock_irqsave(&wq_head->lock, flags);
    list_add(&wq_entry->entry, &wq_head->head);
    spin_unlock_irqrestore(&wq_head->lock, flags);
}
 
void remove_wait_queue(struct wait_queue_head *wq_head,
                       struct wait_queue_entry *wq_entry) {
    unsigned long flags;
    spin_lock_irqsave(&wq_head->lock, flags);
    list_del(&wq_entry->entry);
    spin_unlock_irqrestore(&wq_head->lock, flags);
}

The Sleep/Wake Pattern:

Blocking on an I/O operation follows a specific pattern to avoid race conditions:

Correct Sleep/Wake Pattern
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
/* Correct blocking pattern - avoids lost wakeup race */
int wait_for_io_completion(struct io_request *req) {
    DECLARE_WAITQUEUE(wait, current);
    int result;
    
    /* Add ourselves to wait queue BEFORE checking condition */
    add_wait_queue(&req->wait_queue, &wait);
    
    while (1) {
        /* Set state to INTERRUPTIBLE or UNINTERRUPTIBLE */
        set_current_state(TASK_INTERRUPTIBLE);
        
        /* Check if condition is already satisfied */
        if (io_request_complete(req))
            break;
        
        /* Check for signals (if INTERRUPTIBLE) */
        if (signal_pending(current)) {
            result = -ERESTARTSYS;
            goto out;
        }
        
        /* Actually sleep - scheduler will run another task */
        schedule();
        
        /* We've been woken up - loop back to check condition */
    }
    
    result = req->result;
    
out:
    /* Reset state and remove from wait queue */
    set_current_state(TASK_RUNNING);
    remove_wait_queue(&req->wait_queue, &wait);
    
    return result;
}
 
/* Wake up waiting process when I/O completes (interrupt context) */
void complete_io_request(struct io_request *req) {
    /* Mark request complete */
    req->complete = true;
    req->result = 0;  /* Success */
    
    /* Wake up waiters */
    wake_up(&req->wait_queue);
}
 
/* Convenience macros hide this complexity */
/* wait_event(wq, condition) - uninterruptible wait */
/* wait_event_interruptible(wq, condition) - can be signaled */
/* wait_event_timeout(wq, condition, timeout) - with timeout */
 
int read_from_device(struct device *dev, void *buf, size_t count) {
    struct io_request req;
    
    init_io_request(&req, READ, buf, count);
    submit_io_request(dev, &req);
    
    /* This macro handles all the sleep/wake logic */
    return wait_event_interruptible(req.wait_queue, 
                                     io_request_complete(&req));
}

The Lost Wakeup Problem

Interrupt-Driven Wakeup: From Device to Ready Queue

When an I/O operation completes, the device signals the CPU via an interrupt. This interrupt triggers a chain of events that moves the waiting process from the device queue to the ready queue.

I/O Completion Interrupt Flow
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
/* Device driver interrupt handler */
irqreturn_t disk_interrupt_handler(int irq, void *dev_id) {
    struct block_device *bdev = dev_id;
    struct request *req;
    
    /* Acknowledge interrupt to hardware */
    ack_device_interrupt(bdev);
    
    /* Get the completed request */
    req = bdev->current_request;
    if (!req)
        return IRQ_NONE;  /* Spurious interrupt */
    
    /* Transfer data if this was a read */
    if (req->operation == READ) {
        copy_from_device_buffer(req->buffer, bdev, req->nr_sectors);
    }
    
    /* Mark request complete */
    req->error = 0;  /* Success */
    
    /* Complete the request - this wakes the waiting process */
    blk_mq_complete_request(req);
    
    /* Start next request if any */
    bdev->current_request = get_next_request(bdev);
    if (bdev->current_request)
        start_device_operation(bdev);
    
    return IRQ_HANDLED;
}
 
/* Request completion path */
void blk_mq_complete_request(struct request *rq) {
    /* May need to defer to softirq if in hard irq context */
    if (in_hardirq()) {
        /* Schedule completion to run in softirq context */
        raise_softirq(BLOCK_SOFTIRQ);
        return;
    }
    
    /* Call the request's completion function */
    rq->end_io(rq, rq->error);
}
 
/* Completion callback set by the submitting code */
void bio_end_io(struct bio *bio) {
    struct task_struct *waiter = bio->bi_private;
    
    /* Record any errors */
    bio->bi_status = BLK_STS_OK;
    
    /* Wake up the process that submitted this I/O */
    wake_up_process(waiter);
}
 
/* wake_up_process implementation */
int wake_up_process(struct task_struct *p) {
    return try_to_wake_up(p, TASK_NORMAL, 0);
}
 
int try_to_wake_up(struct task_struct *p, 
                   unsigned int state, 
                   int wake_flags) {
    unsigned long flags;
    struct rq *rq;
    
    /* Get the run queue this task is associated with */
    raw_spin_lock_irqsave(&p->pi_lock, flags);
    
    /* Check if task is in a wakeable state */
    if (!(p->state & state))
        goto out_unlock;
    
    /* Mark task as runnable */
    p->state = TASK_RUNNING;
    
    /* Add to ready queue */
    rq = task_rq(p);
    activate_task(rq, p, ENQUEUE_WAKEUP);
    
    /* Check if we should preempt current task */
    check_preempt_curr(rq, p, wake_flags);
    
out_unlock:
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);
    return 0;
}

Converting Mermaid diagram...

Interrupt Context Limitations

A Process Waiting on Multiple Devices

Mechanisms for Waiting on Multiple Events
Mechanism	POSIX/Linux	Windows	Characteristics
Multiplexing	select(), poll()	WaitForMultipleObjects()	Block until any of N events ready
Advanced Multiplexing	epoll (Linux)	I/O Completion Ports	Scalable to 100K+ connections
Async I/O (POSIX)	aio_* functions	ReadFileEx()	Non-blocking, callback/signal notification
Async I/O (Modern)	io_uring (Linux 5.1+)	IOCP	Ring buffer based, batched syscalls

Multiplexed I/O with epoll
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
/* epoll: Efficiently wait on thousands of file descriptors */
#include <sys/epoll.h>
 
int main() {
    /* Create epoll instance */
    int epfd = epoll_create1(0);
    
    /* Add file descriptors to monitor */
    struct epoll_event ev;
    ev.events = EPOLLIN;  /* Wait for read readiness */
    ev.data.fd = socket_fd;
    epoll_ctl(epfd, EPOLL_CTL_ADD, socket_fd, &ev);
    
    ev.data.fd = file_fd;
    epoll_ctl(epfd, EPOLL_CTL_ADD, file_fd, &ev);
    
    /* Wait for events */
    struct epoll_event events[MAX_EVENTS];
    while (1) {
        /* Block until at least one fd is ready */
        int nfds = epoll_wait(epfd, events, MAX_EVENTS, -1);
        
        /* Process ready file descriptors */
        for (int i = 0; i < nfds; i++) {
            if (events[i].data.fd == socket_fd) {
                handle_socket_data();
            } else if (events[i].data.fd == file_fd) {
                handle_file_data();
            }
        }
    }
}
 
/* Under the hood, epoll uses a red-black tree + ready list */
struct eventpoll {
    /* Lock protecting this structure */
    spinlock_t lock;
    
    /* RB-tree of monitored file descriptors */
    struct rb_root rbr;
    
    /* List of file descriptors ready for userspace */
    struct list_head rdllist;
    
    /* Wait queue for epoll_wait callers */
    wait_queue_head_t wq;
};
 
/* When any monitored fd becomes ready:
 * 1. fd's poll callback triggers
 * 2. Epitem is added to rdllist
 * 3. Process waiting in epoll_wait is woken
 * 4. Ready fds returned to userspace
 */

io_uring: The Modern Approach:

The latest Linux approach to asynchronous I/O uses shared ring buffers between kernel and userspace, eliminating most syscall overhead:

io_uring Example (Simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
/* io_uring: Ring-buffer based async I/O */
#include <liburing.h>
 
int main() {
    struct io_uring ring;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    
    /* Initialize ring */
    io_uring_queue_init(32, &ring, 0);
    
    /* Get a submission queue entry */
    sqe = io_uring_get_sqe(&ring);
    
    /* Prepare a read operation (doesn't submit yet) */
    io_uring_prep_read(sqe, fd, buffer, sizeof(buffer), 0);
    sqe->user_data = 123;  /* Tag for completion matching */
    
    /* Submit all prepared ops to kernel */
    io_uring_submit(&ring);
    
    /* Wait for completion */
    io_uring_wait_cqe(&ring, &cqe);
    
    /* Process result */
    if (cqe->res >= 0) {
        printf("Read %d bytes\n", cqe->res);
    }
    
    /* Mark completion as consumed */
    io_uring_cqe_seen(&ring, cqe);
    
    io_uring_queue_exit(&ring);
}
 
/*
 * io_uring advantage: batched submissions
 * 
 * Traditional approach: 1000 reads = 1000 syscalls
 * io_uring approach: 1000 reads = ~1-2 syscalls
 * 
 * Also supports:
 * - Polling mode (no interrupts needed)
 * - Linked operations
 * - Fixed buffers (no copy overhead)
 */

Scaling Network Servers

Device Queue Management and Statistics

System administrators and developers need visibility into device queue behavior to diagnose performance issues and optimize I/O patterns.

Linux I/O Monitoring
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# iostat - I/O statistics per device
$ iostat -xz 1
Device   r/s   w/s  rkB/s  wkB/s avgrq-sz avgqu-sz await svctm  %util
sda    100.0  50.0  4000.0 2000.0    80.0     2.5  16.0   3.0  45.0
nvme0n1 5000.0 2000.0 200000.0 80000.0 80.0   0.5   0.1   0.05  35.0
 
# Key metrics:
# avgqu-sz - Average queue depth (requests waiting)
# await    - Average wait time (queue + service)
# svctm    - Average service time (device only)
# %util    - Percentage of time device was busy
 
# iotop - I/O by process (like top for I/O)
$ sudo iotop
Total DISK READ: 50.00 M/s | Total DISK WRITE: 25.00 M/s
  PID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
  1234 be/4  mysql    45.00 M/s   20.00 M/s  0.00 %  50.00 %  mysqld
  5678 be/4  postgres  5.00 M/s    5.00 M/s  0.00 %  10.00 %  postgres
 
# /sys/block/*/queue/ - Per-device queue parameters
$ cat /sys/block/sda/queue/scheduler
[mq-deadline] none
 
$ cat /sys/block/sda/queue/nr_requests
256
 
$ cat /sys/block/sda/queue/queue_depth
32
 
# blktrace - Detailed block layer tracing
$ sudo blktrace -d /dev/sda -o - | blkparse -i -
  8,0    1     1     0.000000000  1234  Q   R 12345678 + 8 [process]
  8,0    1     2     0.000001000  1234  G   R 12345678 + 8 [process]
  8,0    1     3     0.000100000     0  D   R 12345678 + 8 [process]
  8,0    1     4     0.001000000     0  C   R 12345678 + 8 [0]
 
# Q=Queued, G=GetRequest, D=Dispatch, C=Complete

Queue Depth Warning Signs

Summary: Device Queues as the I/O Interface

Device queues bridge the gap between the fast world of CPU execution and the slow world of physical I/O. Let's consolidate our understanding:

Key Takeaways

•Speed Disparity Solution — Device queues exist because I/O operations are millions of times slower than CPU operations. Blocking and queuing allows CPU time to be used productively while I/O proceeds.
•Per-Device Organization — Each device (or device class) maintains its own queue. Processes move from running to waiting when they issue I/O, joining the appropriate device queue.
•I/O Scheduling — For devices with seek latency (HDDs), I/O schedulers reorder requests to optimize throughput while preventing starvation. Different schedulers (deadline, BFQ, mq-deadline) suit different workloads.
•Wait Queue Mechanism — The kernel's wait queue infrastructure provides the sleep/wake primitives that underlie all blocking I/O. Proper usage avoids race conditions like lost wakeups.
•Interrupt-Driven Completion — When I/O completes, hardware interrupts trigger the kernel to wake waiting processes, moving them from device queues to the ready queue.
•Multiplexing for Scalability — Mechanisms like epoll and io_uring allow efficient waiting on thousands of sources simultaneously, essential for scalable servers.
•Observability — Tools like iostat, blktrace, and performance counters provide visibility into device queue behavior for debugging and optimization.

What's Next:

Page Complete

3 / 5