Loading learning content...
We've explored what process queues are and their roles in operating systems—but how are they actually built? The choice of data structure for a process queue has profound implications for system performance. These queues are accessed on every context switch, every I/O completion, and every process creation. Even small inefficiencies multiply across millions of operations per second.
This page dives deep into the implementation details: the linked lists, arrays, trees, and hybrid structures that kernel developers use to implement efficient process queues. You'll understand not just which structures are used, but why they're chosen and how their characteristics match the requirements of different queue types.
By the end of this page, you will understand the intrusive linked list design pattern used in kernel code, array-based queue implementations for fixed-size scenarios, tree-based structures for priority ordering, the trade-offs between different implementations, and memory considerations for queue nodes.
Before examining specific implementations, let's understand the requirements that drive data structure selection for process queues.
| Queue Type | Primary Operations | Priority Needed | Typical Size |
|---|---|---|---|
| Job Queue | Iterate, lookup by PID, insert, remove | No | 100 - 100,000 processes |
| Ready Queue | Remove-min, insert, priority update | Yes (usually) | 1 - 100 per CPU |
| Device Queue | Insert-back, remove-front, reorder | Sometimes (I/O priority) | 1 - 1000 per device |
| Wait Queue | Insert, remove-all, remove-one | No | 1 - 100 per condition |
Different queue types have different access patterns, leading to different optimal data structures. The job queue is traversed but rarely searched by position. The ready queue needs fast min-extraction. Device queues may need reordering. A single data structure rarely satisfies all needs efficiently.
The most common data structure for process queues in kernels is the intrusive linked list. Unlike standard library containers where the container owns the nodes, intrusive lists embed the linkage pointers directly within the data structures being linked.
Linux Kernel list_head:
Linux uses the list_head structure for doubly-linked lists throughout the kernel:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
/* The fundamental list structure - just two pointers */struct list_head { struct list_head *next; struct list_head *prev;}; /* Initialize a list head (empty list points to itself) */#define LIST_HEAD_INIT(name) { &(name), &(name) }#define LIST_HEAD(name) struct list_head name = LIST_HEAD_INIT(name) static inline void INIT_LIST_HEAD(struct list_head *list) { list->next = list; list->prev = list;} /* Check if list is empty */static inline int list_empty(const struct list_head *head) { return head->next == head;} /* Internal insert helper */static inline void __list_add(struct list_head *new, struct list_head *prev, struct list_head *next) { next->prev = new; new->next = next; new->prev = prev; prev->next = new;} /* Insert at head (stack behavior) */static inline void list_add(struct list_head *new, struct list_head *head) { __list_add(new, head, head->next);} /* Insert at tail (queue behavior) */static inline void list_add_tail(struct list_head *new, struct list_head *head) { __list_add(new, head->prev, head);} /* Remove entry from list */static inline void list_del(struct list_head *entry) { entry->prev->next = entry->next; entry->next->prev = entry->prev; /* Poison pointers to catch use-after-remove bugs */ entry->next = LIST_POISON1; entry->prev = LIST_POISON2;} /* The magic: get containing structure from list_head pointer */#define container_of(ptr, type, member) ({ \ const typeof(((type *)0)->member) *__mptr = (ptr); \ (type *)((char *)__mptr - offsetof(type, member)); }) #define list_entry(ptr, type, member) \ container_of(ptr, type, member) /* Iterate over list entries */#define list_for_each_entry(pos, head, member) \ for (pos = list_entry((head)->next, typeof(*pos), member); \ &pos->member != (head); \ pos = list_entry(pos->member.next, typeof(*pos), member))Using list_head in task_struct:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
/* Process structure with multiple list memberships */struct task_struct { /* ... other fields ... */ /* ALL-TASKS LIST (job queue) */ struct list_head tasks; /* Link in global task list */ /* SIBLING LIST (children of same parent) */ struct list_head sibling; /* Link in parent's children list */ struct list_head children; /* Head of this process's children */ /* RUN QUEUE (ready queue) - CFS uses run_node in sched_entity */ struct sched_entity se; /* Contains rb_node for CFS tree */ /* ... more list_heads for other memberships ... */}; /* A single process can be in multiple lists simultaneously */void demonstrate_multi_membership(void) { struct task_struct *task; struct list_head all_tasks = LIST_HEAD_INIT(all_tasks); struct list_head ready_queue = LIST_HEAD_INIT(ready_queue); /* Add to job queue (global task list) */ list_add(&task->tasks, &all_tasks); /* Add to parent's children list */ list_add(&task->sibling, ¤t->children); /* Task is now in THREE lists using THREE different list_heads: * 1. all_tasks (via task->tasks) * 2. parent->children (via task->sibling) * 3. Its own children list head (task->children) */} /* Iterate all processes in the system */void print_all_pids(void) { struct task_struct *task; /* for_each_process uses the 'tasks' list_head */ for_each_process(task) { printk("PID: %d\n", task->pid); }}The container_of macro is the key to intrusive lists. Given a pointer to a list_head member within a structure, it calculates the address of the containing structure using pointer arithmetic. offsetof(type, member) gives the byte offset of member within type. Subtracting this from the member's address gives the structure's base address.
For queues with bounded size or where cache efficiency is paramount, array-based implementations offer advantages over linked lists.
Circular Buffer (Ring Buffer):
For FIFO queues with bounded size, circular buffers are optimal:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
/* Fixed-size circular buffer for process pointers */struct circular_queue { struct task_struct *buffer[MAX_QUEUE_SIZE]; int head; /* Index of next element to remove */ int tail; /* Index of next free slot */ int count; /* Number of elements */ spinlock_t lock;}; /* Initialize queue */void circular_queue_init(struct circular_queue *q) { q->head = 0; q->tail = 0; q->count = 0; spin_lock_init(&q->lock);} /* Enqueue at tail - O(1) */int circular_queue_enqueue(struct circular_queue *q, struct task_struct *task) { unsigned long flags; spin_lock_irqsave(&q->lock, flags); if (q->count >= MAX_QUEUE_SIZE) { spin_unlock_irqrestore(&q->lock, flags); return -ENOSPC; /* Queue full */ } q->buffer[q->tail] = task; q->tail = (q->tail + 1) % MAX_QUEUE_SIZE; /* Wrap around */ q->count++; spin_unlock_irqrestore(&q->lock, flags); return 0;} /* Dequeue from head - O(1) */struct task_struct *circular_queue_dequeue(struct circular_queue *q) { struct task_struct *task; unsigned long flags; spin_lock_irqsave(&q->lock, flags); if (q->count == 0) { spin_unlock_irqrestore(&q->lock, flags); return NULL; /* Queue empty */ } task = q->buffer[q->head]; q->buffer[q->head] = NULL; /* Clear (optional, helps debugging) */ q->head = (q->head + 1) % MAX_QUEUE_SIZE; q->count--; spin_unlock_irqrestore(&q->lock, flags); return task;} /* Lock-free variant using atomic operations */struct lockfree_circular_queue { struct task_struct *buffer[MAX_QUEUE_SIZE]; atomic_t head; atomic_t tail;}; /* Single-producer, single-consumer lock-free enqueue */int spsc_enqueue(struct lockfree_circular_queue *q, struct task_struct *task) { int tail = atomic_read(&q->tail); int next_tail = (tail + 1) % MAX_QUEUE_SIZE; /* Check if full (tail would catch head) */ if (next_tail == atomic_read(&q->head)) return -ENOSPC; q->buffer[tail] = task; smp_wmb(); /* Ensure the write is visible before updating tail */ atomic_set(&q->tail, next_tail); return 0;}Array of Lists (Multi-Level Priority Queue):
The classic O(1) scheduler design combines arrays with linked lists:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
/* Array-of-lists structure for O(1) priority scheduling */#define NUM_PRIORITY_LEVELS 140 struct prio_array { /* Bitmap: bit n set if queue[n] is non-empty */ unsigned long bitmap[BITS_TO_LONGS(NUM_PRIORITY_LEVELS)]; /* Array of list heads, one per priority level */ struct list_head queue[NUM_PRIORITY_LEVELS]; /* Total number of tasks */ int nr_active;}; /* Initialize the priority array */void prio_array_init(struct prio_array *pa) { int i; bitmap_zero(pa->bitmap, NUM_PRIORITY_LEVELS); pa->nr_active = 0; for (i = 0; i < NUM_PRIORITY_LEVELS; i++) { INIT_LIST_HEAD(&pa->queue[i]); }} /* Enqueue task at its priority level - O(1) */void prio_array_enqueue(struct prio_array *pa, struct task_struct *task) { int prio = task->priority; /* 0-139 */ list_add_tail(&task->run_list, &pa->queue[prio]); __set_bit(prio, pa->bitmap); pa->nr_active++;} /* Find and remove highest priority task - O(1) */struct task_struct *prio_array_dequeue(struct prio_array *pa) { int prio; struct task_struct *task; if (pa->nr_active == 0) return NULL; /* Find first (highest priority = lowest number) non-empty queue */ prio = sched_find_first_bit(pa->bitmap); /* Remove first task from that queue */ task = list_first_entry(&pa->queue[prio], struct task_struct, run_list); list_del_init(&task->run_list); /* Clear bitmap bit if queue is now empty */ if (list_empty(&pa->queue[prio])) { __clear_bit(prio, pa->bitmap); } pa->nr_active--; return task;} /* sched_find_first_bit is highly optimized */static inline int sched_find_first_bit(const unsigned long *bitmap) { /* On x86, this compiles to just a few instructions: * - Load bitmap word * - BSF (Bit Scan Forward) instruction * Total: ~3-5 cycles */ return find_first_bit(bitmap, NUM_PRIORITY_LEVELS);}The bitmap is the key to O(1) behavior. Finding the first set bit in a 64-bit word takes 1-3 CPU cycles using hardware instructions like BSF (Bit Scan Forward) or TZCNT (Trailing Zero Count). Even with 140 priorities requiring 3 64-bit words, finding the highest priority is extremely fast.
When priorities are continuous (like virtual runtime in CFS) rather than discrete levels, tree structures provide efficient priority ordering.
Red-Black Trees in CFS:
Linux CFS uses red-black trees because they provide:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
/* Red-black tree node - embedded in managed structures */struct rb_node { unsigned long __rb_parent_color; /* Parent pointer + color bit packed */ struct rb_node *rb_right; struct rb_node *rb_left;}; /* Tree root with cached leftmost node */struct rb_root_cached { struct rb_root rb_root; struct rb_node *rb_leftmost; /* Cache of min element */}; /* Colors are encoded in the low bit of parent pointer */#define RB_RED 0#define RB_BLACK 1 #define rb_parent(r) ((struct rb_node *)((r)->__rb_parent_color & ~3))#define rb_color(r) ((r)->__rb_parent_color & 1)#define rb_is_red(r) (!rb_color(r))#define rb_is_black(r) rb_color(r) /* Get first (leftmost/minimum) element - O(1) with cache */static inline struct rb_node *rb_first_cached( const struct rb_root_cached *root){ return root->rb_leftmost;} /* Insert node and rebalance */void rb_insert_color_cached(struct rb_node *node, struct rb_root_cached *root, bool leftmost) { /* Update leftmost cache if this is the new minimum */ if (leftmost) root->rb_leftmost = node; /* Standard RB-tree rebalancing */ rb_insert_color(node, &root->rb_root);} /* CFS scheduling entity with rb_node embedded */struct sched_entity { /* ... load weight, vruntime, etc. ... */ struct rb_node run_node; /* Embedded RB-tree node */ u64 vruntime; /* Key for tree ordering */}; /* Insert sched_entity into CFS tree */static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) { struct rb_node **link = &cfs_rq->tasks_timeline.rb_root.rb_node; struct rb_node *parent = NULL; struct sched_entity *entry; bool leftmost = true; /* Binary search for insertion point */ while (*link) { parent = *link; entry = rb_entry(parent, struct sched_entity, run_node); if (entity_before(se, entry)) { /* Compare vruntimes */ link = &parent->rb_left; } else { link = &parent->rb_right; leftmost = false; /* Not the new minimum */ } } /* Link node into tree */ rb_link_node(&se->run_node, parent, link); /* Color the node and rebalance; update leftmost cache */ rb_insert_color_cached(&se->run_node, &cfs_rq->tasks_timeline, leftmost);} /* Remove sched_entity from CFS tree */static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se) { rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);}Binary Heaps for Priority Queues:
Heaps are simpler than RB-trees and can be more cache-friendly with array storage:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117
/* Array-based binary min-heap for deadline scheduling */struct heap_queue { struct task_struct *tasks[HEAP_SIZE]; int size; spinlock_t lock;}; /* Heap helper functions */#define PARENT(i) (((i) - 1) / 2)#define LEFT(i) ((2 * (i)) + 1)#define RIGHT(i) ((2 * (i)) + 2) /* Compare by deadline (lower deadline = higher priority) */static inline bool higher_priority(struct task_struct *a, struct task_struct *b) { return a->deadline < b->deadline;} /* Swap two heap entries */static inline void heap_swap(struct heap_queue *heap, int i, int j) { struct task_struct *tmp = heap->tasks[i]; heap->tasks[i] = heap->tasks[j]; heap->tasks[j] = tmp; /* Update heap index stored in task (for O(1) removal) */ heap->tasks[i]->heap_idx = i; heap->tasks[j]->heap_idx = j;} /* Bubble up after insertion */static void heap_bubble_up(struct heap_queue *heap, int idx) { while (idx > 0) { int parent = PARENT(idx); if (higher_priority(heap->tasks[idx], heap->tasks[parent])) { heap_swap(heap, idx, parent); idx = parent; } else { break; /* Heap property satisfied */ } }} /* Bubble down after extraction or priority change */static void heap_bubble_down(struct heap_queue *heap, int idx) { while (LEFT(idx) < heap->size) { int smallest = idx; int left = LEFT(idx); int right = RIGHT(idx); if (higher_priority(heap->tasks[left], heap->tasks[smallest])) smallest = left; if (right < heap->size && higher_priority(heap->tasks[right], heap->tasks[smallest])) smallest = right; if (smallest != idx) { heap_swap(heap, idx, smallest); idx = smallest; } else { break; } }} /* Insert task - O(log n) */int heap_insert(struct heap_queue *heap, struct task_struct *task) { if (heap->size >= HEAP_SIZE) return -ENOMEM; int idx = heap->size++; heap->tasks[idx] = task; task->heap_idx = idx; heap_bubble_up(heap, idx); return 0;} /* Extract minimum - O(log n) */struct task_struct *heap_extract_min(struct heap_queue *heap) { if (heap->size == 0) return NULL; struct task_struct *min = heap->tasks[0]; /* Move last element to root */ heap->tasks[0] = heap->tasks[--heap->size]; heap->tasks[0]->heap_idx = 0; /* Restore heap property */ heap_bubble_down(heap, 0); min->heap_idx = -1; /* No longer in heap */ return min;} /* Remove arbitrary element using stored index - O(log n) */void heap_remove(struct heap_queue *heap, struct task_struct *task) { int idx = task->heap_idx; if (idx < 0 || idx >= heap->size) return; /* Move last element to the removed position */ heap->tasks[idx] = heap->tasks[--heap->size]; heap->tasks[idx]->heap_idx = idx; /* Fix heap property (may need to go up or down) */ if (idx > 0 && higher_priority(heap->tasks[idx], heap->tasks[PARENT(idx)])) { heap_bubble_up(heap, idx); } else { heap_bubble_down(heap, idx); } task->heap_idx = -1;}Heaps are simpler and have better cache locality (array-based). RB-trees support in-order traversal and have worst-case O(log n) guarantees. Heaps require storing the heap index in each element for O(log n) arbitrary removal. RB-trees support efficient range queries. CFS chose RB-trees partly because visualizing/debugging a sorted tree is much easier than a heap.
While linked lists and trees organize processes for iteration and priority ordering, hash tables provide O(1) lookup by process ID—essential for operations like signal delivery and /proc access.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
/* Linux uses IDR (ID Radix tree) for PID lookup, but let's show * the conceptual hash table approach for clarity */ #define PID_HASH_BITS 12 /* 4096 buckets */#define PID_HASH_SIZE (1 << PID_HASH_BITS) struct pid_hash_table { struct hlist_head buckets[PID_HASH_SIZE]; spinlock_t lock;}; /* Hash bucket list (singly-linked with back-pointer for O(1) removal) */struct hlist_head { struct hlist_node *first;}; struct hlist_node { struct hlist_node *next; struct hlist_node **pprev; /* Pointer to previous node's next pointer */}; /* PID structure with hash linkage */struct pid { pid_t nr; /* The PID number */ struct hlist_node hash_link; /* Link in hash table */ struct task_struct *task; /* Associated task */ atomic_t count; /* Reference count */}; /* Hash function for PIDs */static inline unsigned int pid_hash(pid_t pid) { /* Golden ratio hash */ return (unsigned int)pid * 0x9e370001UL >> (32 - PID_HASH_BITS);} /* Insert PID into hash table - O(1) average */void pid_hash_insert(struct pid_hash_table *table, struct pid *pid) { unsigned int hash = pid_hash(pid->nr); spin_lock(&table->lock); hlist_add_head(&pid->hash_link, &table->buckets[hash]); spin_unlock(&table->lock);} /* Lookup task by PID - O(1) average */struct task_struct *find_task_by_pid(struct pid_hash_table *table, pid_t nr) { unsigned int hash = pid_hash(nr); struct pid *pid; struct task_struct *task = NULL; rcu_read_lock(); /* Lock-free read via RCU */ hlist_for_each_entry_rcu(pid, &table->buckets[hash], hash_link) { if (pid->nr == nr) { task = pid->task; break; } } rcu_read_unlock(); return task;} /* Remove PID from hash table - O(1) */void pid_hash_remove(struct pid *pid) { /* hlist allows O(1) removal using the pprev pointer */ hlist_del_rcu(&pid->hash_link);} /* Modern Linux uses IDR (radix tree) instead: * - Better worst-case behavior * - Efficient ID allocation * - Supports hierarchical namespaces */struct idr { struct radix_tree_root idr_rt; unsigned int idr_base; unsigned int idr_next;}; /* IDR provides both allocation and lookup */int idr_alloc(struct idr *idr, void *ptr, int start, int end, gfp_t gfp);void *idr_find(const struct idr *idr, unsigned long id);void *idr_remove(struct idr *idr, unsigned long id);Why Multiple Data Structures?
The job queue uses BOTH a linked list (for iteration) AND a hash table (for lookup). This dual-indexing pattern is common in kernel data structures:
Linux uses RCU (Read-Copy-Update) for the PID hash table. Readers can traverse the hash table without locks—they only need rcu_read_lock() which doesn't actually acquire a lock. Writers use careful ordering and deferred reclamation. This makes operations like signal delivery extremely fast.
Queue implementation must consider memory layout for cache efficiency. Modern CPUs have multiple cache levels, and cache misses can cost hundreds of cycles.
| Access Pattern | Cache Behavior | Performance Impact |
|---|---|---|
| Sequential array iteration | Excellent - prefetching works | ~1 cycle per element after warmup |
| Linked list with locality | Good - elements near each other | ~3-10 cycles per element |
| Linked list with fragmentation | Poor - random memory access | ~50-200 cycles per element |
| Hash table lookup (hit) | Moderate - 2-3 pointer dereferences | ~10-50 cycles |
| Tree traversal (balanced) | Poor - random access down tree | ~50-100 cycles per level |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
/* Technique 1: Slab Allocation * Allocate task_structs from a dedicated cache, keeping them * in contiguous memory and improving cache behavior. */struct kmem_cache *task_struct_cachep; void __init task_cache_init(void) { task_struct_cachep = kmem_cache_create( "task_struct", sizeof(struct task_struct), __alignof__(struct task_struct), SLAB_PANIC | SLAB_ACCOUNT, NULL);} struct task_struct *alloc_task_struct_node(int node) { return kmem_cache_alloc_node(task_struct_cachep, GFP_KERNEL, node);} /* Technique 2: Hot/Cold Field Separation * Group frequently accessed fields together at the start * of the structure to fit in a single cache line. */struct task_struct { /* HOT DATA - accessed during scheduling */ volatile long state; /* Current state */ void *stack; /* Kernel stack */ unsigned int flags; /* Process flags */ int on_cpu; /* Currently running on which CPU */ int prio; /* Dynamic priority */ /* ... */ /* COLD DATA - accessed less frequently */ struct mm_struct *mm; /* Address space */ struct files_struct *files; /* Open files */ /* ... */ /* VERY COLD DATA - rarely accessed */ char comm[TASK_COMM_LEN]; /* Process name */ struct audit_context *audit_context; /* ... */}; /* Technique 3: Percpu Caching * For per-CPU run queues, ensure each CPU's queue * is on a different cache line to avoid false sharing. */DEFINE_PER_CPU_ALIGNED(struct rq, runqueues); /* The _ALIGNED suffix ensures each CPU's rq starts on * a cache line boundary (typically 64 bytes). * This prevents CPU 0's writes from invalidating CPU 1's cache. */ /* Technique 4: Prefetching * Hint to the CPU that certain data will be needed soon. */void traverse_ready_queue(struct list_head *head) { struct task_struct *task, *next; list_for_each_entry_safe(task, next, head, run_list) { /* Prefetch next element while processing current */ prefetch(&next->state); prefetch(&next->prio); /* Process current task */ update_task_accounting(task); }}On NUMA (Non-Uniform Memory Access) systems, memory access time depends on which CPU is accessing which memory node. Queue structures should be allocated on the node where they'll be primarily accessed. Linux's NUMA-aware scheduling tries to keep processes near their memory, and queue structures are allocated per-node when beneficial.
Process queues are accessed from multiple CPUs and contexts (process context, interrupt context, softirq context). Proper synchronization is essential for correctness and performance.
| Mechanism | Use Case | Overhead | Scalability |
|---|---|---|---|
| Spinlock | Short critical sections, interrupt-safe | Low (if uncontended) | Poor (one CPU at a time) |
| Mutex | Longer critical sections, can sleep | Medium | Poor (one CPU at a time) |
| RW Lock | Read-heavy workloads | Medium | Good for reads |
| RCU | Read-mostly, rare writes | Very low for reads | Excellent for reads |
| Per-CPU Data | Avoid sharing entirely | Very low | Excellent |
| Lock-free | Specific patterns (SPSC) | Low | Excellent |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
/* Ready queue (per-CPU) uses spinlock with interrupt disable */void enqueue_task(struct rq *rq, struct task_struct *p) { raw_spin_lock(&rq->lock); /* Spinlock, not sleeping lock */ /* ... enqueue operation ... */ list_add_tail(&p->run_list, &rq->cfs.tasks); raw_spin_unlock(&rq->lock);} /* When called from interrupt context, must save/restore flags */void wake_up_task(struct task_struct *p) { unsigned long flags; struct rq *rq; raw_spin_lock_irqsave(&p->pi_lock, flags); /* Disable interrupts */ rq = __task_rq_lock(p, &flags); /* ... wakeup operation ... */ __task_rq_unlock(rq, &flags); raw_spin_unlock_irqrestore(&p->pi_lock, flags); /* Restore interrupts */} /* PID hash table uses RCU for lock-free reading */struct task_struct *find_task_by_vpid(pid_t vnr) { struct task_struct *task; rcu_read_lock(); /* Marks read-side critical section */ /* No actual lock acquired - just increments preempt count */ task = find_task_by_pid_ns(vnr, task_active_pid_ns(current)); rcu_read_unlock(); /* End of critical section */ return task;} /* Writers to PID hash must synchronize with RCU */void detach_pid(struct task_struct *task, enum pid_type type) { struct pid *pid = task->pids[type].pid; struct hlist_head *head = &pid_hash[pid_hashfn(pid->nr)]; spin_lock(&pidmap_lock); /* Writers still need a lock */ hlist_del_rcu(&pid->pid_chain); /* RCU-safe deletion */ spin_unlock(&pidmap_lock); /* Actual memory free is deferred until all RCU readers complete */ call_rcu(&pid->rcu, delayed_put_pid);} /* Lock ordering is critical to prevent deadlock *//* * Lock hierarchy (must acquire in this order): * 1. task->pi_lock (protects priority inheritance) * 2. rq->lock (protects run queue) * 3. p->lock (protects individual task) * * Never hold a higher-numbered lock while trying to acquire * a lower-numbered one! */The most elegant solution to queue synchronization is avoiding shared data entirely. Per-CPU run queues mean each CPU only accesses its own queue, requiring no locking for local operations. Load balancing still requires cross-CPU access, but it's infrequent compared to scheduling decisions. This is why the move from global run queue to per-CPU queues was transformative for SMP scalability.
Queue implementation is a masterclass in practical data structure design—balancing theoretical complexity with real-world concerns like cache behavior, memory layout, and synchronization.
What's Next:
With a solid understanding of queue implementation, we're ready to examine Queue Management—the policies and algorithms that determine how queues are maintained, how load is balanced, and how queue behavior is tuned for different workloads.
You now understand the data structures that power process queues: intrusive linked lists for flexibility, arrays for cache efficiency, trees for priority ordering, and hash tables for fast lookup. You've seen how these structures are synchronized and optimized for the demanding environment of kernel code.