Loading learning content...
While the CPU is blindingly fast, the physical world moves at a glacial pace by comparison. When a process needs to read from a disk, write to a network interface, or wait for keyboard input, it cannot simply spin waiting—that would waste precious CPU cycles. Instead, the operating system places the process in a device queue where it sleeps until the I/O operation completes.
Device queues are the interface between the world of processes and the world of hardware. They embody the fundamental asymmetry of modern computing: CPUs execute billions of instructions per second, but disks, networks, and human input operate thousands to millions of times slower. Understanding device queues is essential for understanding how operating systems achieve the illusion of responsiveness despite these vast speed disparities.
By the end of this page, you will understand why device queues exist, how they are organized per device and device class, the relationship between device queues and interrupt handling, I/O scheduling algorithms that order requests, and how processes transition between device queues and the ready queue.
To appreciate device queues, we must first understand the fundamental problem they solve: the I/O wait problem.
The Speed Disparity:
Consider the relative speeds of different operations:
| Operation | Typical Latency | CPU Cycles (3GHz) | Human Time Analogy |
|---|---|---|---|
| CPU Register Access | ~0.3 ns | ~1 cycle | 1 second |
| L1 Cache Hit | ~1 ns | ~3 cycles | 3 seconds |
| L2 Cache Hit | ~3 ns | ~10 cycles | 10 seconds |
| L3 Cache Hit | ~10 ns | ~30 cycles | 30 seconds |
| Main Memory Access | ~100 ns | ~300 cycles | 5 minutes |
| NVMe SSD Read | ~20 μs | ~60,000 cycles | 17 hours |
| SATA SSD Read | ~100 μs | ~300,000 cycles | 3.5 days |
| HDD Read (random) | ~10 ms | ~30,000,000 cycles | 1 year |
| Network Round Trip (LAN) | ~0.5 ms | ~1,500,000 cycles | 17 days |
| Network Round Trip (Internet) | ~50 ms | ~150,000,000 cycles | 5 years |
| Keyboard Input | ~100 ms | ~300,000,000 cycles | 10 years |
The Implication:
If a process were to wait synchronously for a disk read while holding the CPU, it would waste millions of CPU cycles that could execute useful instructions. This is the fundamental insight behind multiprogramming: while one process waits for I/O, another can use the CPU.
The Solution: Blocking and Queuing
When a process issues an I/O request:
Think of device queues like waiting lists at restaurants. When you order food (I/O request), you don't stand at the kitchen door—you wait at your table (device queue). The waiter (interrupt handler) notifies you when your food is ready, and then you can eat (process ready to run). Meanwhile, the kitchen (device) serves multiple orders (requests) efficiently.
The operating system maintains separate queues for each device (or device class in some designs). This organization reflects the physical reality that each device can only service one request at a time (or a fixed number for some devices).
1234567891011121314151617181920212223242526272829303132333435363738394041424344
/* Conceptual device queue structure */struct device_queue { /* Device identification */ struct device *dev; /* Associated device */ char name[32]; /* Device name (e.g., "sda") */ /* Queue of pending I/O requests */ struct list_head request_queue; /* I/O requests pending */ struct request *current_request; /* Currently serviced request */ /* Queue of waiting processes (may be in request structures) */ int waiting_count; /* Number of blocked processes */ /* Queue statistics */ unsigned long total_requests; /* Total requests serviced */ unsigned long total_wait_time; /* Cumulative wait time */ /* Scheduling policy */ struct elevator_type *elevator; /* I/O scheduler */ /* Synchronization */ spinlock_t queue_lock; wait_queue_head_t wait; /* Wait queue for blocking */}; /* Individual I/O request structure */struct request { struct list_head list; /* Queue linkage */ /* Request details */ sector_t sector; /* Starting sector (for block devices) */ unsigned int nr_sectors; /* Number of sectors */ enum req_op operation; /* READ, WRITE, FLUSH, etc. */ /* Buffers */ void *buffer; /* Data buffer */ struct bio *bio; /* Bio chain (Linux) */ /* Associated process */ struct task_struct *task; /* Process waiting for this request */ /* Timing */ unsigned long start_time; /* When request was submitted */};Linux Block Layer Implementation:
In Linux, block device queues are managed by the block layer, which sits between the filesystem and device drivers:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
/* Linux request queue structure (simplified) */struct request_queue { /* Request lists */ struct list_head queue_head; /* List of pending requests */ /* Queue limits */ unsigned int nr_requests; /* Max requests in queue */ unsigned int nr_congestion_on; /* Congestion threshold */ /* I/O scheduler (elevator) */ struct elevator_queue *elevator; /* Make request function - entry point from filesystem */ make_request_fn *make_request_fn; /* Per-queue statistics */ struct queue_limits limits; /* Backing device info */ struct backing_dev_info *backing_dev_info; /* Synchronization */ spinlock_t queue_lock;}; /* Submit I/O request and potentially block */void submit_bio_wait(struct bio *bio) { struct completion done; init_completion(&done); bio->bi_private = &done; bio->bi_end_io = bio_complete; /* Submit to device queue */ submit_bio(bio); /* Block current process until I/O completes */ wait_for_completion(&done);} /* Called by device driver when I/O completes */void bio_complete(struct bio *bio) { struct completion *done = bio->bi_private; /* Wake up the waiting process */ complete(done);}Modern Linux uses the blk-mq (Multi-Queue Block Layer) which maintains per-CPU submission queues and per-hardware-queue dispatch queues. This design was introduced in Linux 3.13 to better match the parallel capabilities of NVMe SSDs, which can handle hundreds of thousands of IOPS across multiple hardware queues.
Device queues aren't simply FIFO. For many devices, especially rotational hard drives, the order in which requests are serviced dramatically affects performance. I/O schedulers (also called elevators) reorder requests to optimize throughput and latency.
| Scheduler | Strategy | Best For | Drawbacks |
|---|---|---|---|
| FCFS (None) | First-come, first-served | Simple devices, SSDs | Poor HDD performance, no fairness |
| SSTF | Shortest Seek Time First | Throughput optimization | Starvation of distant requests |
| SCAN (Elevator) | Move head in one direction, then reverse | Balanced performance | Middle requests favored |
| C-SCAN | Circular SCAN - only one direction | More uniform wait time | Empty return sweep wastes time |
| LOOK/C-LOOK | Like SCAN but only to farthest request | Efficient SCAN variant | Same as SCAN, less wasteful |
| CFQ | Complete Fairness Queueing (Linux) | Interactive workloads | Complex, deprecated in Linux 5.0 |
| Deadline | Separate read/write queues with deadlines | Avoiding starvation | May sacrifice throughput |
| BFQ | Budget Fair Queueing | Desktop/interactive | Higher CPU overhead |
| mq-deadline | Multi-queue deadline scheduler | NVMe, modern SSDs | Default for many Linux systems |
| none/noop | No reordering, simple FIFO | NVMe/SSD (no seek) | Poor for HDDs |
Deadline Scheduler Deep Dive:
The deadline scheduler is particularly interesting because it explicitly prevents starvation while maintaining reasonable throughput:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
/* Deadline scheduler maintains four queues */struct deadline_data { /* Sorted queues (by sector number) for seeking efficiency */ struct rb_root sort_list[2]; /* [READ] and [WRITE] sorted lists */ /* FIFO queues (by time) for deadline enforcement */ struct list_head fifo_list[2]; /* [READ] and [WRITE] deadline lists */ /* Batching state */ int fifo_batch; /* How many to service before checking other queue */ int writes_starved; /* Count of times reads preempted writes */ /* Tunable deadlines */ int read_expire; /* Read deadline: 500ms default */ int write_expire; /* Write deadline: 5000ms default */ /* Which direction we're currently servicing */ int data_dir; /* READ or WRITE */ sector_t last_sector; /* Last sector serviced (for batching) */}; /* Deadline scheduler dispatch logic */struct request *deadline_dispatch(struct deadline_data *dd) { struct request *rq; int data_dir = dd->data_dir; /* Check if there's an expired request (deadline missed) */ rq = deadline_check_fifo(dd, READ); /* Reads have priority */ if (rq) return rq; /* Must service this immediately */ rq = deadline_check_fifo(dd, WRITE); if (rq) return rq; /* No deadline pressure - batch requests in current direction */ if (deadline_has_work(dd, data_dir)) { /* Get next request in sector order for efficient seeking */ rq = deadline_next_request(dd, data_dir); if (rq) return rq; } /* Switch direction if other direction has work */ if (deadline_has_work(dd, !data_dir)) { dd->data_dir = !data_dir; rq = deadline_next_request(dd, !data_dir); if (rq) return rq; } return NULL; /* No work */} /* Check if any request in this queue has exceeded its deadline */struct request *deadline_check_fifo(struct deadline_data *dd, int dir) { struct request *rq = list_first_entry_or_null( &dd->fifo_list[dir], struct request, queuelist); if (rq) { unsigned long deadline = dir == READ ? dd->read_expire : dd->write_expire; if (time_after(jiffies, rq->start_time + deadline)) { /* This request has waited too long */ return rq; } } return NULL; /* No expired requests */}For NVMe SSDs, use 'none' or 'mq-deadline'—no seek optimization needed, and minimal overhead is best. For SATA SSDs, 'mq-deadline' works well. For HDDs, 'bfq' provides good interactive response, while 'mq-deadline' maximizes throughput. You can check/change the scheduler: cat /sys/block/sda/queue/scheduler and echo 'mq-deadline' > /sys/block/sda/queue/scheduler.
When a process blocks on a device queue, it uses the kernel's wait queue infrastructure. Wait queues are the general mechanism for putting processes to sleep until some condition is met.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
/* Wait queue head - one per condition/event */struct wait_queue_head { spinlock_t lock; struct list_head head; /* List of waiting tasks */};typedef struct wait_queue_head wait_queue_head_t; /* Wait queue entry - one per waiting task */struct wait_queue_entry { unsigned int flags; void *private; /* Usually the task_struct pointer */ wait_queue_func_t func; /* Function to call on wake */ struct list_head entry; /* List linkage */}; /* Initialize a wait queue head */#define DECLARE_WAIT_QUEUE_HEAD(name) \ struct wait_queue_head name = { \ .lock = __SPIN_LOCK_UNLOCKED(name.lock), \ .head = LIST_HEAD_INIT(name.head) \ } /* Initialize a wait queue entry */#define DECLARE_WAITQUEUE(name, tsk) \ struct wait_queue_entry name = { \ .private = tsk, \ .func = default_wake_function, \ .entry = LIST_HEAD_INIT(name.entry) \ } /* Adding/removing from wait queue */void add_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) { unsigned long flags; spin_lock_irqsave(&wq_head->lock, flags); list_add(&wq_entry->entry, &wq_head->head); spin_unlock_irqrestore(&wq_head->lock, flags);} void remove_wait_queue(struct wait_queue_head *wq_head, struct wait_queue_entry *wq_entry) { unsigned long flags; spin_lock_irqsave(&wq_head->lock, flags); list_del(&wq_entry->entry); spin_unlock_irqrestore(&wq_head->lock, flags);}The Sleep/Wake Pattern:
Blocking on an I/O operation follows a specific pattern to avoid race conditions:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
/* Correct blocking pattern - avoids lost wakeup race */int wait_for_io_completion(struct io_request *req) { DECLARE_WAITQUEUE(wait, current); int result; /* Add ourselves to wait queue BEFORE checking condition */ add_wait_queue(&req->wait_queue, &wait); while (1) { /* Set state to INTERRUPTIBLE or UNINTERRUPTIBLE */ set_current_state(TASK_INTERRUPTIBLE); /* Check if condition is already satisfied */ if (io_request_complete(req)) break; /* Check for signals (if INTERRUPTIBLE) */ if (signal_pending(current)) { result = -ERESTARTSYS; goto out; } /* Actually sleep - scheduler will run another task */ schedule(); /* We've been woken up - loop back to check condition */ } result = req->result; out: /* Reset state and remove from wait queue */ set_current_state(TASK_RUNNING); remove_wait_queue(&req->wait_queue, &wait); return result;} /* Wake up waiting process when I/O completes (interrupt context) */void complete_io_request(struct io_request *req) { /* Mark request complete */ req->complete = true; req->result = 0; /* Success */ /* Wake up waiters */ wake_up(&req->wait_queue);} /* Convenience macros hide this complexity *//* wait_event(wq, condition) - uninterruptible wait *//* wait_event_interruptible(wq, condition) - can be signaled *//* wait_event_timeout(wq, condition, timeout) - with timeout */ int read_from_device(struct device *dev, void *buf, size_t count) { struct io_request req; init_io_request(&req, READ, buf, count); submit_io_request(dev, &req); /* This macro handles all the sleep/wake logic */ return wait_event_interruptible(req.wait_queue, io_request_complete(&req));}A critical race condition can occur if you check the condition before adding to the wait queue. If the I/O completes between checking and sleeping, the wakeup is 'lost' and the process sleeps forever. Always add to the wait queue first, then check the condition in a loop.
When an I/O operation completes, the device signals the CPU via an interrupt. This interrupt triggers a chain of events that moves the waiting process from the device queue to the ready queue.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
/* Device driver interrupt handler */irqreturn_t disk_interrupt_handler(int irq, void *dev_id) { struct block_device *bdev = dev_id; struct request *req; /* Acknowledge interrupt to hardware */ ack_device_interrupt(bdev); /* Get the completed request */ req = bdev->current_request; if (!req) return IRQ_NONE; /* Spurious interrupt */ /* Transfer data if this was a read */ if (req->operation == READ) { copy_from_device_buffer(req->buffer, bdev, req->nr_sectors); } /* Mark request complete */ req->error = 0; /* Success */ /* Complete the request - this wakes the waiting process */ blk_mq_complete_request(req); /* Start next request if any */ bdev->current_request = get_next_request(bdev); if (bdev->current_request) start_device_operation(bdev); return IRQ_HANDLED;} /* Request completion path */void blk_mq_complete_request(struct request *rq) { /* May need to defer to softirq if in hard irq context */ if (in_hardirq()) { /* Schedule completion to run in softirq context */ raise_softirq(BLOCK_SOFTIRQ); return; } /* Call the request's completion function */ rq->end_io(rq, rq->error);} /* Completion callback set by the submitting code */void bio_end_io(struct bio *bio) { struct task_struct *waiter = bio->bi_private; /* Record any errors */ bio->bi_status = BLK_STS_OK; /* Wake up the process that submitted this I/O */ wake_up_process(waiter);} /* wake_up_process implementation */int wake_up_process(struct task_struct *p) { return try_to_wake_up(p, TASK_NORMAL, 0);} int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) { unsigned long flags; struct rq *rq; /* Get the run queue this task is associated with */ raw_spin_lock_irqsave(&p->pi_lock, flags); /* Check if task is in a wakeable state */ if (!(p->state & state)) goto out_unlock; /* Mark task as runnable */ p->state = TASK_RUNNING; /* Add to ready queue */ rq = task_rq(p); activate_task(rq, p, ENQUEUE_WAKEUP); /* Check if we should preempt current task */ check_preempt_curr(rq, p, wake_flags); out_unlock: raw_spin_unlock_irqrestore(&p->pi_lock, flags); return 0;}Code running in interrupt context cannot sleep, allocate memory with GFP_KERNEL, acquire sleeping locks, or access user space. This is why interrupt handlers often defer work to softirqs or tasklets, which run with interrupts enabled and have fewer restrictions.
Real applications often need to wait for events from multiple sources simultaneously—network connections, files, timers, and user input. This requires mechanisms for waiting on multiple device queues at once.
| Mechanism | POSIX/Linux | Windows | Characteristics |
|---|---|---|---|
| Multiplexing | select(), poll() | WaitForMultipleObjects() | Block until any of N events ready |
| Advanced Multiplexing | epoll (Linux) | I/O Completion Ports | Scalable to 100K+ connections |
| Async I/O (POSIX) | aio_* functions | ReadFileEx() | Non-blocking, callback/signal notification |
| Async I/O (Modern) | io_uring (Linux 5.1+) | IOCP | Ring buffer based, batched syscalls |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
/* epoll: Efficiently wait on thousands of file descriptors */#include <sys/epoll.h> int main() { /* Create epoll instance */ int epfd = epoll_create1(0); /* Add file descriptors to monitor */ struct epoll_event ev; ev.events = EPOLLIN; /* Wait for read readiness */ ev.data.fd = socket_fd; epoll_ctl(epfd, EPOLL_CTL_ADD, socket_fd, &ev); ev.data.fd = file_fd; epoll_ctl(epfd, EPOLL_CTL_ADD, file_fd, &ev); /* Wait for events */ struct epoll_event events[MAX_EVENTS]; while (1) { /* Block until at least one fd is ready */ int nfds = epoll_wait(epfd, events, MAX_EVENTS, -1); /* Process ready file descriptors */ for (int i = 0; i < nfds; i++) { if (events[i].data.fd == socket_fd) { handle_socket_data(); } else if (events[i].data.fd == file_fd) { handle_file_data(); } } }} /* Under the hood, epoll uses a red-black tree + ready list */struct eventpoll { /* Lock protecting this structure */ spinlock_t lock; /* RB-tree of monitored file descriptors */ struct rb_root rbr; /* List of file descriptors ready for userspace */ struct list_head rdllist; /* Wait queue for epoll_wait callers */ wait_queue_head_t wq;}; /* When any monitored fd becomes ready: * 1. fd's poll callback triggers * 2. Epitem is added to rdllist * 3. Process waiting in epoll_wait is woken * 4. Ready fds returned to userspace */io_uring: The Modern Approach:
The latest Linux approach to asynchronous I/O uses shared ring buffers between kernel and userspace, eliminating most syscall overhead:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
/* io_uring: Ring-buffer based async I/O */#include <liburing.h> int main() { struct io_uring ring; struct io_uring_sqe *sqe; struct io_uring_cqe *cqe; /* Initialize ring */ io_uring_queue_init(32, &ring, 0); /* Get a submission queue entry */ sqe = io_uring_get_sqe(&ring); /* Prepare a read operation (doesn't submit yet) */ io_uring_prep_read(sqe, fd, buffer, sizeof(buffer), 0); sqe->user_data = 123; /* Tag for completion matching */ /* Submit all prepared ops to kernel */ io_uring_submit(&ring); /* Wait for completion */ io_uring_wait_cqe(&ring, &cqe); /* Process result */ if (cqe->res >= 0) { printf("Read %d bytes\n", cqe->res); } /* Mark completion as consumed */ io_uring_cqe_seen(&ring, cqe); io_uring_queue_exit(&ring);} /* * io_uring advantage: batched submissions * * Traditional approach: 1000 reads = 1000 syscalls * io_uring approach: 1000 reads = ~1-2 syscalls * * Also supports: * - Polling mode (no interrupts needed) * - Linked operations * - Fixed buffers (no copy overhead) */The evolution from select() to poll() to epoll to io_uring represents decades of solving the C10K, C100K, and C10M problems—handling 10 thousand, 100 thousand, and 10 million concurrent connections. Modern servers like nginx and HAProxy use epoll or io_uring to efficiently multiplex huge numbers of connections with minimal CPU overhead.
System administrators and developers need visibility into device queue behavior to diagnose performance issues and optimize I/O patterns.
12345678910111213141516171819202122232425262728293031323334353637
# iostat - I/O statistics per device$ iostat -xz 1Device r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %utilsda 100.0 50.0 4000.0 2000.0 80.0 2.5 16.0 3.0 45.0nvme0n1 5000.0 2000.0 200000.0 80000.0 80.0 0.5 0.1 0.05 35.0 # Key metrics:# avgqu-sz - Average queue depth (requests waiting)# await - Average wait time (queue + service)# svctm - Average service time (device only)# %util - Percentage of time device was busy # iotop - I/O by process (like top for I/O)$ sudo iotopTotal DISK READ: 50.00 M/s | Total DISK WRITE: 25.00 M/s PID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 1234 be/4 mysql 45.00 M/s 20.00 M/s 0.00 % 50.00 % mysqld 5678 be/4 postgres 5.00 M/s 5.00 M/s 0.00 % 10.00 % postgres # /sys/block/*/queue/ - Per-device queue parameters$ cat /sys/block/sda/queue/scheduler[mq-deadline] none $ cat /sys/block/sda/queue/nr_requests256 $ cat /sys/block/sda/queue/queue_depth32 # blktrace - Detailed block layer tracing$ sudo blktrace -d /dev/sda -o - | blkparse -i - 8,0 1 1 0.000000000 1234 Q R 12345678 + 8 [process] 8,0 1 2 0.000001000 1234 G R 12345678 + 8 [process] 8,0 1 3 0.000100000 0 D R 12345678 + 8 [process] 8,0 1 4 0.001000000 0 C R 12345678 + 8 [0] # Q=Queued, G=GetRequest, D=Dispatch, C=CompleteHigh average queue depth (avgqu-sz > 4 for HDD, > 32 for SSD) combined with high utilization (>80%) indicates the device is becoming a bottleneck. Either upgrade the device, reduce I/O load, or optimize access patterns (e.g., batching, caching).
Device queues bridge the gap between the fast world of CPU execution and the slow world of physical I/O. Let's consolidate our understanding:
What's Next:
Having explored the three main queue types (job queue, ready queue, device queues), we'll now examine how these queues are implemented at a data structure level. The next page covers Queue Implementation—the linked lists, arrays, and specialized structures that bring these abstractions to life.
You now understand device queues: their purpose in bridging CPU and I/O speeds, their per-device organization, the I/O scheduling algorithms that optimize access patterns, the wait queue mechanism for blocking, and how interrupts drive process wakeup. This knowledge is essential for understanding system performance and I/O-bound behavior.