Operating SystemsProcess Queues

Process Queues: Managing System Resources

LevelIntermediate

Duration60 mins

TopicProcess Queues

4 / 5

Queue Implementation: Data Structures for Process Management

The Engineering of Process Queues

We've explored what process queues are and their roles in operating systems—but how are they actually built? The choice of data structure for a process queue has profound implications for system performance. These queues are accessed on every context switch, every I/O completion, and every process creation. Even small inefficiencies multiply across millions of operations per second.

This page dives deep into the implementation details: the linked lists, arrays, trees, and hybrid structures that kernel developers use to implement efficient process queues. You'll understand not just which structures are used, but why they're chosen and how their characteristics match the requirements of different queue types.

What You Will Learn

By the end of this page, you will understand the intrusive linked list design pattern used in kernel code, array-based queue implementations for fixed-size scenarios, tree-based structures for priority ordering, the trade-offs between different implementations, and memory considerations for queue nodes.

Requirements for Process Queue Data Structures

Before examining specific implementations, let's understand the requirements that drive data structure selection for process queues.

Critical Requirements

•Fast Insertion — Processes frequently enter queues (fork, wakeup, preemption). Insertion must be O(1) or O(log n).
•Fast Extraction — Scheduling decisions occur thousands of times per second. Finding the next process must be extremely fast.
•Dynamic Membership — Processes can be removed from anywhere in the queue (signals, termination), not just the front.
•Multiple Queue Membership — A process may be in multiple queues simultaneously (job queue + ready queue, or multiple wait queues).
•Memory Efficiency — With potentially thousands of processes, memory overhead per queue node matters.
•Lock-Friendly Design — Queues are accessed from multiple CPUs. Data structures must support efficient locking or lock-free operations.
•Cache Efficiency — Queue traversal should have good cache locality. Pointer-chasing through memory kills performance.

Queue Requirements by Type
Queue Type	Primary Operations	Priority Needed	Typical Size
Job Queue	Iterate, lookup by PID, insert, remove	No	100 - 100,000 processes
Ready Queue	Remove-min, insert, priority update	Yes (usually)	1 - 100 per CPU
Device Queue	Insert-back, remove-front, reorder	Sometimes (I/O priority)	1 - 1000 per device
Wait Queue	Insert, remove-all, remove-one	No	1 - 100 per condition

One Size Doesn't Fit All

Different queue types have different access patterns, leading to different optimal data structures. The job queue is traversed but rarely searched by position. The ready queue needs fast min-extraction. Device queues may need reordering. A single data structure rarely satisfies all needs efficiently.

Intrusive Linked Lists: The Kernel's Workhorse

The most common data structure for process queues in kernels is the intrusive linked list. Unlike standard library containers where the container owns the nodes, intrusive lists embed the linkage pointers directly within the data structures being linked.

Standard (Non-Intrusive) List

•Container allocates nodes separately
•Node contains pointer to data
•Extra memory allocation per element
•Data can only be in one list
•Removing requires finding the node first

Intrusive List

•Linkage embedded in data structure
•No separate node allocation
•Zero allocation overhead for linking
•Data can be in multiple lists
•O(1) removal given pointer to element

Linux Kernel list_head:

Linux uses the list_head structure for doubly-linked lists throughout the kernel:

Linux list_head Implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
/* The fundamental list structure - just two pointers */
struct list_head {
    struct list_head *next;
    struct list_head *prev;
};
 
/* Initialize a list head (empty list points to itself) */
#define LIST_HEAD_INIT(name) { &(name), &(name) }
#define LIST_HEAD(name) struct list_head name = LIST_HEAD_INIT(name)
 
static inline void INIT_LIST_HEAD(struct list_head *list) {
    list->next = list;
    list->prev = list;
}
 
/* Check if list is empty */
static inline int list_empty(const struct list_head *head) {
    return head->next == head;
}
 
/* Internal insert helper */
static inline void __list_add(struct list_head *new,
                              struct list_head *prev,
                              struct list_head *next) {
    next->prev = new;
    new->next = next;
    new->prev = prev;
    prev->next = new;
}
 
/* Insert at head (stack behavior) */
static inline void list_add(struct list_head *new, struct list_head *head) {
    __list_add(new, head, head->next);
}
 
/* Insert at tail (queue behavior) */
static inline void list_add_tail(struct list_head *new, 
                                  struct list_head *head) {
    __list_add(new, head->prev, head);
}
 
/* Remove entry from list */
static inline void list_del(struct list_head *entry) {
    entry->prev->next = entry->next;
    entry->next->prev = entry->prev;
    /* Poison pointers to catch use-after-remove bugs */
    entry->next = LIST_POISON1;
    entry->prev = LIST_POISON2;
}
 
/* The magic: get containing structure from list_head pointer */
#define container_of(ptr, type, member) ({                      \
    const typeof(((type *)0)->member) *__mptr = (ptr);          \
    (type *)((char *)__mptr - offsetof(type, member)); })
 
#define list_entry(ptr, type, member) \
    container_of(ptr, type, member)
 
/* Iterate over list entries */
#define list_for_each_entry(pos, head, member)                  \
    for (pos = list_entry((head)->next, typeof(*pos), member);  \
         &pos->member != (head);                                \
         pos = list_entry(pos->member.next, typeof(*pos), member))

Using list_head in task_struct:

Intrusive Lists in Process Structures
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
/* Process structure with multiple list memberships */
struct task_struct {
    /* ... other fields ... */
    
    /* ALL-TASKS LIST (job queue) */
    struct list_head tasks;      /* Link in global task list */
    
    /* SIBLING LIST (children of same parent) */
    struct list_head sibling;    /* Link in parent's children list */
    struct list_head children;   /* Head of this process's children */
    
    /* RUN QUEUE (ready queue) - CFS uses run_node in sched_entity */
    struct sched_entity se;      /* Contains rb_node for CFS tree */
    
    /* ... more list_heads for other memberships ... */
};
 
/* A single process can be in multiple lists simultaneously */
void demonstrate_multi_membership(void) {
    struct task_struct *task;
    struct list_head all_tasks = LIST_HEAD_INIT(all_tasks);
    struct list_head ready_queue = LIST_HEAD_INIT(ready_queue);
    
    /* Add to job queue (global task list) */
    list_add(&task->tasks, &all_tasks);
    
    /* Add to parent's children list */
    list_add(&task->sibling, &current->children);
    
    /* Task is now in THREE lists using THREE different list_heads:
     * 1. all_tasks (via task->tasks)
     * 2. parent->children (via task->sibling)
     * 3. Its own children list head (task->children)
     */
}
 
/* Iterate all processes in the system */
void print_all_pids(void) {
    struct task_struct *task;
    
    /* for_each_process uses the 'tasks' list_head */
    for_each_process(task) {
        printk("PID: %d\n", task->pid);
    }
}

The container_of Magic

The container_of macro is the key to intrusive lists. Given a pointer to a list_head member within a structure, it calculates the address of the containing structure using pointer arithmetic. offsetof(type, member) gives the byte offset of member within type. Subtracting this from the member's address gives the structure's base address.

Array-Based Queue Implementations

For queues with bounded size or where cache efficiency is paramount, array-based implementations offer advantages over linked lists.

Array Advantages

•Cache Efficiency — Contiguous memory means prefetching works. Sequential access is ~10x faster than pointer chasing.
•Index-Based Access — O(1) random access by index, useful for priority levels.
•No Allocation Overhead — Pre-allocated arrays have zero per-operation allocation cost.
•Predictable Memory Layout — Beneficial for real-time systems requiring deterministic timing.

Circular Buffer (Ring Buffer):

For FIFO queues with bounded size, circular buffers are optimal:

Circular Buffer Implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
/* Fixed-size circular buffer for process pointers */
struct circular_queue {
    struct task_struct *buffer[MAX_QUEUE_SIZE];
    int head;           /* Index of next element to remove */
    int tail;           /* Index of next free slot */
    int count;          /* Number of elements */
    spinlock_t lock;
};
 
/* Initialize queue */
void circular_queue_init(struct circular_queue *q) {
    q->head = 0;
    q->tail = 0;
    q->count = 0;
    spin_lock_init(&q->lock);
}
 
/* Enqueue at tail - O(1) */
int circular_queue_enqueue(struct circular_queue *q, 
                           struct task_struct *task) {
    unsigned long flags;
    
    spin_lock_irqsave(&q->lock, flags);
    
    if (q->count >= MAX_QUEUE_SIZE) {
        spin_unlock_irqrestore(&q->lock, flags);
        return -ENOSPC;  /* Queue full */
    }
    
    q->buffer[q->tail] = task;
    q->tail = (q->tail + 1) % MAX_QUEUE_SIZE;  /* Wrap around */
    q->count++;
    
    spin_unlock_irqrestore(&q->lock, flags);
    return 0;
}
 
/* Dequeue from head - O(1) */
struct task_struct *circular_queue_dequeue(struct circular_queue *q) {
    struct task_struct *task;
    unsigned long flags;
    
    spin_lock_irqsave(&q->lock, flags);
    
    if (q->count == 0) {
        spin_unlock_irqrestore(&q->lock, flags);
        return NULL;  /* Queue empty */
    }
    
    task = q->buffer[q->head];
    q->buffer[q->head] = NULL;  /* Clear (optional, helps debugging) */
    q->head = (q->head + 1) % MAX_QUEUE_SIZE;
    q->count--;
    
    spin_unlock_irqrestore(&q->lock, flags);
    return task;
}
 
/* Lock-free variant using atomic operations */
struct lockfree_circular_queue {
    struct task_struct *buffer[MAX_QUEUE_SIZE];
    atomic_t head;
    atomic_t tail;
};
 
/* Single-producer, single-consumer lock-free enqueue */
int spsc_enqueue(struct lockfree_circular_queue *q, 
                 struct task_struct *task) {
    int tail = atomic_read(&q->tail);
    int next_tail = (tail + 1) % MAX_QUEUE_SIZE;
    
    /* Check if full (tail would catch head) */
    if (next_tail == atomic_read(&q->head))
        return -ENOSPC;
    
    q->buffer[tail] = task;
    smp_wmb();  /* Ensure the write is visible before updating tail */
    atomic_set(&q->tail, next_tail);
    
    return 0;
}

Array of Lists (Multi-Level Priority Queue):

The classic O(1) scheduler design combines arrays with linked lists:

Array of Lists (Priority Queue)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
/* Array-of-lists structure for O(1) priority scheduling */
#define NUM_PRIORITY_LEVELS 140
 
struct prio_array {
    /* Bitmap: bit n set if queue[n] is non-empty */
    unsigned long bitmap[BITS_TO_LONGS(NUM_PRIORITY_LEVELS)];
    
    /* Array of list heads, one per priority level */
    struct list_head queue[NUM_PRIORITY_LEVELS];
    
    /* Total number of tasks */
    int nr_active;
};
 
/* Initialize the priority array */
void prio_array_init(struct prio_array *pa) {
    int i;
    
    bitmap_zero(pa->bitmap, NUM_PRIORITY_LEVELS);
    pa->nr_active = 0;
    
    for (i = 0; i < NUM_PRIORITY_LEVELS; i++) {
        INIT_LIST_HEAD(&pa->queue[i]);
    }
}
 
/* Enqueue task at its priority level - O(1) */
void prio_array_enqueue(struct prio_array *pa, 
                        struct task_struct *task) {
    int prio = task->priority;  /* 0-139 */
    
    list_add_tail(&task->run_list, &pa->queue[prio]);
    __set_bit(prio, pa->bitmap);
    pa->nr_active++;
}
 
/* Find and remove highest priority task - O(1) */
struct task_struct *prio_array_dequeue(struct prio_array *pa) {
    int prio;
    struct task_struct *task;
    
    if (pa->nr_active == 0)
        return NULL;
    
    /* Find first (highest priority = lowest number) non-empty queue */
    prio = sched_find_first_bit(pa->bitmap);
    
    /* Remove first task from that queue */
    task = list_first_entry(&pa->queue[prio], 
                            struct task_struct, 
                            run_list);
    list_del_init(&task->run_list);
    
    /* Clear bitmap bit if queue is now empty */
    if (list_empty(&pa->queue[prio])) {
        __clear_bit(prio, pa->bitmap);
    }
    
    pa->nr_active--;
    return task;
}
 
/* sched_find_first_bit is highly optimized */
static inline int sched_find_first_bit(const unsigned long *bitmap) {
    /* On x86, this compiles to just a few instructions:
     * - Load bitmap word
     * - BSF (Bit Scan Forward) instruction
     * Total: ~3-5 cycles
     */
    return find_first_bit(bitmap, NUM_PRIORITY_LEVELS);
}

Bitmap Acceleration

The bitmap is the key to O(1) behavior. Finding the first set bit in a 64-bit word takes 1-3 CPU cycles using hardware instructions like BSF (Bit Scan Forward) or TZCNT (Trailing Zero Count). Even with 140 priorities requiring 3 64-bit words, finding the highest priority is extremely fast.

Tree-Based Priority Structures

When priorities are continuous (like virtual runtime in CFS) rather than discrete levels, tree structures provide efficient priority ordering.

Red-Black Trees in CFS:

Linux CFS uses red-black trees because they provide:

O(log n) insertion, deletion, and lookup
O(1) access to minimum element (with caching)
Self-balancing guarantees (no degenerate worst cases)
In-order traversal for debugging/display

Linux Red-Black Tree Implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
/* Red-black tree node - embedded in managed structures */
struct rb_node {
    unsigned long __rb_parent_color;  /* Parent pointer + color bit packed */
    struct rb_node *rb_right;
    struct rb_node *rb_left;
};
 
/* Tree root with cached leftmost node */
struct rb_root_cached {
    struct rb_root rb_root;
    struct rb_node *rb_leftmost;  /* Cache of min element */
};
 
/* Colors are encoded in the low bit of parent pointer */
#define RB_RED   0
#define RB_BLACK 1
 
#define rb_parent(r)   ((struct rb_node *)((r)->__rb_parent_color & ~3))
#define rb_color(r)    ((r)->__rb_parent_color & 1)
#define rb_is_red(r)   (!rb_color(r))
#define rb_is_black(r) rb_color(r)
 
/* Get first (leftmost/minimum) element - O(1) with cache */
static inline struct rb_node *rb_first_cached(
    const struct rb_root_cached *root)
{
    return root->rb_leftmost;
}
 
/* Insert node and rebalance */
void rb_insert_color_cached(struct rb_node *node,
                            struct rb_root_cached *root,
                            bool leftmost) {
    /* Update leftmost cache if this is the new minimum */
    if (leftmost)
        root->rb_leftmost = node;
    
    /* Standard RB-tree rebalancing */
    rb_insert_color(node, &root->rb_root);
}
 
/* CFS scheduling entity with rb_node embedded */
struct sched_entity {
    /* ... load weight, vruntime, etc. ... */
    
    struct rb_node run_node;  /* Embedded RB-tree node */
    u64 vruntime;             /* Key for tree ordering */
};
 
/* Insert sched_entity into CFS tree */
static void __enqueue_entity(struct cfs_rq *cfs_rq, 
                             struct sched_entity *se) {
    struct rb_node **link = &cfs_rq->tasks_timeline.rb_root.rb_node;
    struct rb_node *parent = NULL;
    struct sched_entity *entry;
    bool leftmost = true;
    
    /* Binary search for insertion point */
    while (*link) {
        parent = *link;
        entry = rb_entry(parent, struct sched_entity, run_node);
        
        if (entity_before(se, entry)) {  /* Compare vruntimes */
            link = &parent->rb_left;
        } else {
            link = &parent->rb_right;
            leftmost = false;  /* Not the new minimum */
        }
    }
    
    /* Link node into tree */
    rb_link_node(&se->run_node, parent, link);
    
    /* Color the node and rebalance; update leftmost cache */
    rb_insert_color_cached(&se->run_node, 
                           &cfs_rq->tasks_timeline, 
                           leftmost);
}
 
/* Remove sched_entity from CFS tree */
static void __dequeue_entity(struct cfs_rq *cfs_rq, 
                             struct sched_entity *se) {
    rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
}

Binary Heaps for Priority Queues:

Heaps are simpler than RB-trees and can be more cache-friendly with array storage:

Binary Heap Implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
/* Array-based binary min-heap for deadline scheduling */
struct heap_queue {
    struct task_struct *tasks[HEAP_SIZE];
    int size;
    spinlock_t lock;
};
 
/* Heap helper functions */
#define PARENT(i)    (((i) - 1) / 2)
#define LEFT(i)      ((2 * (i)) + 1)
#define RIGHT(i)     ((2 * (i)) + 2)
 
/* Compare by deadline (lower deadline = higher priority) */
static inline bool higher_priority(struct task_struct *a,
                                   struct task_struct *b) {
    return a->deadline < b->deadline;
}
 
/* Swap two heap entries */
static inline void heap_swap(struct heap_queue *heap, int i, int j) {
    struct task_struct *tmp = heap->tasks[i];
    heap->tasks[i] = heap->tasks[j];
    heap->tasks[j] = tmp;
    
    /* Update heap index stored in task (for O(1) removal) */
    heap->tasks[i]->heap_idx = i;
    heap->tasks[j]->heap_idx = j;
}
 
/* Bubble up after insertion */
static void heap_bubble_up(struct heap_queue *heap, int idx) {
    while (idx > 0) {
        int parent = PARENT(idx);
        
        if (higher_priority(heap->tasks[idx], heap->tasks[parent])) {
            heap_swap(heap, idx, parent);
            idx = parent;
        } else {
            break;  /* Heap property satisfied */
        }
    }
}
 
/* Bubble down after extraction or priority change */
static void heap_bubble_down(struct heap_queue *heap, int idx) {
    while (LEFT(idx) < heap->size) {
        int smallest = idx;
        int left = LEFT(idx);
        int right = RIGHT(idx);
        
        if (higher_priority(heap->tasks[left], heap->tasks[smallest]))
            smallest = left;
        
        if (right < heap->size && 
            higher_priority(heap->tasks[right], heap->tasks[smallest]))
            smallest = right;
        
        if (smallest != idx) {
            heap_swap(heap, idx, smallest);
            idx = smallest;
        } else {
            break;
        }
    }
}
 
/* Insert task - O(log n) */
int heap_insert(struct heap_queue *heap, struct task_struct *task) {
    if (heap->size >= HEAP_SIZE)
        return -ENOMEM;
    
    int idx = heap->size++;
    heap->tasks[idx] = task;
    task->heap_idx = idx;
    
    heap_bubble_up(heap, idx);
    return 0;
}
 
/* Extract minimum - O(log n) */
struct task_struct *heap_extract_min(struct heap_queue *heap) {
    if (heap->size == 0)
        return NULL;
    
    struct task_struct *min = heap->tasks[0];
    
    /* Move last element to root */
    heap->tasks[0] = heap->tasks[--heap->size];
    heap->tasks[0]->heap_idx = 0;
    
    /* Restore heap property */
    heap_bubble_down(heap, 0);
    
    min->heap_idx = -1;  /* No longer in heap */
    return min;
}
 
/* Remove arbitrary element using stored index - O(log n) */
void heap_remove(struct heap_queue *heap, struct task_struct *task) {
    int idx = task->heap_idx;
    if (idx < 0 || idx >= heap->size)
        return;
    
    /* Move last element to the removed position */
    heap->tasks[idx] = heap->tasks[--heap->size];
    heap->tasks[idx]->heap_idx = idx;
    
    /* Fix heap property (may need to go up or down) */
    if (idx > 0 && 
        higher_priority(heap->tasks[idx], heap->tasks[PARENT(idx)])) {
        heap_bubble_up(heap, idx);
    } else {
        heap_bubble_down(heap, idx);
    }
    
    task->heap_idx = -1;
}

Heap vs RB-Tree Trade-offs

Heaps are simpler and have better cache locality (array-based). RB-trees support in-order traversal and have worst-case O(log n) guarantees. Heaps require storing the heap index in each element for O(log n) arbitrary removal. RB-trees support efficient range queries. CFS chose RB-trees partly because visualizing/debugging a sorted tree is much easier than a heap.

Hash Tables for Fast Process Lookup

While linked lists and trees organize processes for iteration and priority ordering, hash tables provide O(1) lookup by process ID—essential for operations like signal delivery and /proc access.

PID Hash Table Implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
/* Linux uses IDR (ID Radix tree) for PID lookup, but let's show
 * the conceptual hash table approach for clarity */
 
#define PID_HASH_BITS 12  /* 4096 buckets */
#define PID_HASH_SIZE (1 << PID_HASH_BITS)
 
struct pid_hash_table {
    struct hlist_head buckets[PID_HASH_SIZE];
    spinlock_t lock;
};
 
/* Hash bucket list (singly-linked with back-pointer for O(1) removal) */
struct hlist_head {
    struct hlist_node *first;
};
 
struct hlist_node {
    struct hlist_node *next;
    struct hlist_node **pprev;  /* Pointer to previous node's next pointer */
};
 
/* PID structure with hash linkage */
struct pid {
    pid_t nr;                   /* The PID number */
    struct hlist_node hash_link; /* Link in hash table */
    struct task_struct *task;   /* Associated task */
    atomic_t count;             /* Reference count */
};
 
/* Hash function for PIDs */
static inline unsigned int pid_hash(pid_t pid) {
    /* Golden ratio hash */
    return (unsigned int)pid * 0x9e370001UL >> (32 - PID_HASH_BITS);
}
 
/* Insert PID into hash table - O(1) average */
void pid_hash_insert(struct pid_hash_table *table, struct pid *pid) {
    unsigned int hash = pid_hash(pid->nr);
    
    spin_lock(&table->lock);
    hlist_add_head(&pid->hash_link, &table->buckets[hash]);
    spin_unlock(&table->lock);
}
 
/* Lookup task by PID - O(1) average */
struct task_struct *find_task_by_pid(struct pid_hash_table *table,
                                      pid_t nr) {
    unsigned int hash = pid_hash(nr);
    struct pid *pid;
    struct task_struct *task = NULL;
    
    rcu_read_lock();  /* Lock-free read via RCU */
    
    hlist_for_each_entry_rcu(pid, &table->buckets[hash], hash_link) {
        if (pid->nr == nr) {
            task = pid->task;
            break;
        }
    }
    
    rcu_read_unlock();
    return task;
}
 
/* Remove PID from hash table - O(1) */
void pid_hash_remove(struct pid *pid) {
    /* hlist allows O(1) removal using the pprev pointer */
    hlist_del_rcu(&pid->hash_link);
}
 
/* Modern Linux uses IDR (radix tree) instead:
 * - Better worst-case behavior
 * - Efficient ID allocation
 * - Supports hierarchical namespaces
 */
struct idr {
    struct radix_tree_root idr_rt;
    unsigned int idr_base;
    unsigned int idr_next;
};
 
/* IDR provides both allocation and lookup */
int idr_alloc(struct idr *idr, void *ptr, int start, int end, gfp_t gfp);
void *idr_find(const struct idr *idr, unsigned long id);
void *idr_remove(struct idr *idr, unsigned long id);

Why Multiple Data Structures?

The job queue uses BOTH a linked list (for iteration) AND a hash table (for lookup). This dual-indexing pattern is common in kernel data structures:

Converting Mermaid diagram...

RCU for Lock-Free Reading

Linux uses RCU (Read-Copy-Update) for the PID hash table. Readers can traverse the hash table without locks—they only need rcu_read_lock() which doesn't actually acquire a lock. Writers use careful ordering and deferred reclamation. This makes operations like signal delivery extremely fast.

Memory Layout and Cache Considerations

Queue implementation must consider memory layout for cache efficiency. Modern CPUs have multiple cache levels, and cache misses can cost hundreds of cycles.

Cache Effects on Queue Operations
Access Pattern	Cache Behavior	Performance Impact
Sequential array iteration	Excellent - prefetching works	~1 cycle per element after warmup
Linked list with locality	Good - elements near each other	~3-10 cycles per element
Linked list with fragmentation	Poor - random memory access	~50-200 cycles per element
Hash table lookup (hit)	Moderate - 2-3 pointer dereferences	~10-50 cycles
Tree traversal (balanced)	Poor - random access down tree	~50-100 cycles per level

Cache-Friendly Techniques
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
/* Technique 1: Slab Allocation
 * Allocate task_structs from a dedicated cache, keeping them
 * in contiguous memory and improving cache behavior.
 */
struct kmem_cache *task_struct_cachep;
 
void __init task_cache_init(void) {
    task_struct_cachep = kmem_cache_create(
        "task_struct",
        sizeof(struct task_struct),
        __alignof__(struct task_struct),
        SLAB_PANIC | SLAB_ACCOUNT,
        NULL);
}
 
struct task_struct *alloc_task_struct_node(int node) {
    return kmem_cache_alloc_node(task_struct_cachep, GFP_KERNEL, node);
}
 
/* Technique 2: Hot/Cold Field Separation
 * Group frequently accessed fields together at the start
 * of the structure to fit in a single cache line.
 */
struct task_struct {
    /* HOT DATA - accessed during scheduling */
    volatile long state;        /* Current state */
    void *stack;                /* Kernel stack */
    unsigned int flags;         /* Process flags */
    int on_cpu;                 /* Currently running on which CPU */
    int prio;                   /* Dynamic priority */
    /* ... */
    
    /* COLD DATA - accessed less frequently */
    struct mm_struct *mm;       /* Address space */
    struct files_struct *files; /* Open files */
    /* ... */
    
    /* VERY COLD DATA - rarely accessed */
    char comm[TASK_COMM_LEN];   /* Process name */
    struct audit_context *audit_context;
    /* ... */
};
 
/* Technique 3: Percpu Caching
 * For per-CPU run queues, ensure each CPU's queue
 * is on a different cache line to avoid false sharing.
 */
DEFINE_PER_CPU_ALIGNED(struct rq, runqueues);
 
/* The _ALIGNED suffix ensures each CPU's rq starts on 
 * a cache line boundary (typically 64 bytes).
 * This prevents CPU 0's writes from invalidating CPU 1's cache.
 */
 
/* Technique 4: Prefetching
 * Hint to the CPU that certain data will be needed soon.
 */
void traverse_ready_queue(struct list_head *head) {
    struct task_struct *task, *next;
    
    list_for_each_entry_safe(task, next, head, run_list) {
        /* Prefetch next element while processing current */
        prefetch(&next->state);
        prefetch(&next->prio);
        
        /* Process current task */
        update_task_accounting(task);
    }
}

NUMA Awareness

On NUMA (Non-Uniform Memory Access) systems, memory access time depends on which CPU is accessing which memory node. Queue structures should be allocated on the node where they'll be primarily accessed. Linux's NUMA-aware scheduling tries to keep processes near their memory, and queue structures are allocated per-node when beneficial.

Synchronization Strategies for Queue Access

Process queues are accessed from multiple CPUs and contexts (process context, interrupt context, softirq context). Proper synchronization is essential for correctness and performance.

Synchronization Mechanisms in Queue Implementation
Mechanism	Use Case	Overhead	Scalability
Spinlock	Short critical sections, interrupt-safe	Low (if uncontended)	Poor (one CPU at a time)
Mutex	Longer critical sections, can sleep	Medium	Poor (one CPU at a time)
RW Lock	Read-heavy workloads	Medium	Good for reads
RCU	Read-mostly, rare writes	Very low for reads	Excellent for reads
Per-CPU Data	Avoid sharing entirely	Very low	Excellent
Lock-free	Specific patterns (SPSC)	Low	Excellent

Synchronization in Practice
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
/* Ready queue (per-CPU) uses spinlock with interrupt disable */
void enqueue_task(struct rq *rq, struct task_struct *p) {
    raw_spin_lock(&rq->lock);  /* Spinlock, not sleeping lock */
    
    /* ... enqueue operation ... */
    list_add_tail(&p->run_list, &rq->cfs.tasks);
    
    raw_spin_unlock(&rq->lock);
}
 
/* When called from interrupt context, must save/restore flags */
void wake_up_task(struct task_struct *p) {
    unsigned long flags;
    struct rq *rq;
    
    raw_spin_lock_irqsave(&p->pi_lock, flags);  /* Disable interrupts */
    rq = __task_rq_lock(p, &flags);
    
    /* ... wakeup operation ... */
    
    __task_rq_unlock(rq, &flags);
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);  /* Restore interrupts */
}
 
/* PID hash table uses RCU for lock-free reading */
struct task_struct *find_task_by_vpid(pid_t vnr) {
    struct task_struct *task;
    
    rcu_read_lock();  /* Marks read-side critical section */
    /* No actual lock acquired - just increments preempt count */
    
    task = find_task_by_pid_ns(vnr, task_active_pid_ns(current));
    
    rcu_read_unlock();  /* End of critical section */
    
    return task;
}
 
/* Writers to PID hash must synchronize with RCU */
void detach_pid(struct task_struct *task, enum pid_type type) {
    struct pid *pid = task->pids[type].pid;
    struct hlist_head *head = &pid_hash[pid_hashfn(pid->nr)];
    
    spin_lock(&pidmap_lock);  /* Writers still need a lock */
    
    hlist_del_rcu(&pid->pid_chain);  /* RCU-safe deletion */
    
    spin_unlock(&pidmap_lock);
    
    /* Actual memory free is deferred until all RCU readers complete */
    call_rcu(&pid->rcu, delayed_put_pid);
}
 
/* Lock ordering is critical to prevent deadlock */
/*
 * Lock hierarchy (must acquire in this order):
 * 1. task->pi_lock (protects priority inheritance)
 * 2. rq->lock (protects run queue)
 * 3. p->lock (protects individual task)
 *
 * Never hold a higher-numbered lock while trying to acquire
 * a lower-numbered one!
 */

The Per-CPU Solution

The most elegant solution to queue synchronization is avoiding shared data entirely. Per-CPU run queues mean each CPU only accesses its own queue, requiring no locking for local operations. Load balancing still requires cross-CPU access, but it's infrequent compared to scheduling decisions. This is why the move from global run queue to per-CPU queues was transformative for SMP scalability.

Summary: Engineering Process Queues

Queue implementation is a masterclass in practical data structure design—balancing theoretical complexity with real-world concerns like cache behavior, memory layout, and synchronization.

Key Takeaways

•Intrusive Linked Lists — Kernel queues embed linkage in data structures, avoiding separate allocation, enabling multi-list membership, and providing O(1) removal.
•Array-Based Queues — Circular buffers offer cache efficiency and bounded memory. Combined with bitmaps, they enable O(1) priority queue operations.
•Tree Structures — Red-black trees (CFS) and heaps provide efficient priority ordering for continuous-value priorities like virtual runtime.
•Hash Tables for Lookup — PID lookup uses hash tables (or radix trees) for O(1) access, complementing iteration-focused linked lists.
•Cache Consciousness — Slab allocation, field ordering, per-CPU alignment, and prefetching all contribute to cache-efficient queue operations.
•Synchronization Trade-offs — Per-CPU queues eliminate contention. RCU provides lock-free reading. Spinlocks protect short critical sections.
•Dual Indexing — Processes are often in multiple data structures simultaneously (linked list for iteration, hash table for lookup, tree for priority).

What's Next:

With a solid understanding of queue implementation, we're ready to examine Queue Management—the policies and algorithms that determine how queues are maintained, how load is balanced, and how queue behavior is tuned for different workloads.

Page Complete

You now understand the data structures that power process queues: intrusive linked lists for flexibility, arrays for cache efficiency, trees for priority ordering, and hash tables for fast lookup. You've seen how these structures are synchronized and optimized for the demanding environment of kernel code.

4 / 5

Loading learning content...

Operating SystemsProcess Queues

Process Queues: Managing System Resources

LevelIntermediate

Duration60 mins

TopicProcess Queues

4 / 5

Queue Implementation: Data Structures for Process Management

The Engineering of Process Queues

What You Will Learn

Requirements for Process Queue Data Structures

Before examining specific implementations, let's understand the requirements that drive data structure selection for process queues.

Critical Requirements

•Fast Insertion — Processes frequently enter queues (fork, wakeup, preemption). Insertion must be O(1) or O(log n).
•Fast Extraction — Scheduling decisions occur thousands of times per second. Finding the next process must be extremely fast.
•Dynamic Membership — Processes can be removed from anywhere in the queue (signals, termination), not just the front.
•Multiple Queue Membership — A process may be in multiple queues simultaneously (job queue + ready queue, or multiple wait queues).
•Memory Efficiency — With potentially thousands of processes, memory overhead per queue node matters.
•Lock-Friendly Design — Queues are accessed from multiple CPUs. Data structures must support efficient locking or lock-free operations.
•Cache Efficiency — Queue traversal should have good cache locality. Pointer-chasing through memory kills performance.

Queue Requirements by Type
Queue Type	Primary Operations	Priority Needed	Typical Size
Job Queue	Iterate, lookup by PID, insert, remove	No	100 - 100,000 processes
Ready Queue	Remove-min, insert, priority update	Yes (usually)	1 - 100 per CPU
Device Queue	Insert-back, remove-front, reorder	Sometimes (I/O priority)	1 - 1000 per device
Wait Queue	Insert, remove-all, remove-one	No	1 - 100 per condition

One Size Doesn't Fit All

Intrusive Linked Lists: The Kernel's Workhorse

Standard (Non-Intrusive) List

•Container allocates nodes separately
•Node contains pointer to data
•Extra memory allocation per element
•Data can only be in one list
•Removing requires finding the node first

Intrusive List

•Linkage embedded in data structure
•No separate node allocation
•Zero allocation overhead for linking
•Data can be in multiple lists
•O(1) removal given pointer to element

Linux Kernel list_head:

Linux uses the list_head structure for doubly-linked lists throughout the kernel:

Linux list_head Implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
/* The fundamental list structure - just two pointers */
struct list_head {
    struct list_head *next;
    struct list_head *prev;
};
 
/* Initialize a list head (empty list points to itself) */
#define LIST_HEAD_INIT(name) { &(name), &(name) }
#define LIST_HEAD(name) struct list_head name = LIST_HEAD_INIT(name)
 
static inline void INIT_LIST_HEAD(struct list_head *list) {
    list->next = list;
    list->prev = list;
}
 
/* Check if list is empty */
static inline int list_empty(const struct list_head *head) {
    return head->next == head;
}
 
/* Internal insert helper */
static inline void __list_add(struct list_head *new,
                              struct list_head *prev,
                              struct list_head *next) {
    next->prev = new;
    new->next = next;
    new->prev = prev;
    prev->next = new;
}
 
/* Insert at head (stack behavior) */
static inline void list_add(struct list_head *new, struct list_head *head) {
    __list_add(new, head, head->next);
}
 
/* Insert at tail (queue behavior) */
static inline void list_add_tail(struct list_head *new, 
                                  struct list_head *head) {
    __list_add(new, head->prev, head);
}
 
/* Remove entry from list */
static inline void list_del(struct list_head *entry) {
    entry->prev->next = entry->next;
    entry->next->prev = entry->prev;
    /* Poison pointers to catch use-after-remove bugs */
    entry->next = LIST_POISON1;
    entry->prev = LIST_POISON2;
}
 
/* The magic: get containing structure from list_head pointer */
#define container_of(ptr, type, member) ({                      \
    const typeof(((type *)0)->member) *__mptr = (ptr);          \
    (type *)((char *)__mptr - offsetof(type, member)); })
 
#define list_entry(ptr, type, member) \
    container_of(ptr, type, member)
 
/* Iterate over list entries */
#define list_for_each_entry(pos, head, member)                  \
    for (pos = list_entry((head)->next, typeof(*pos), member);  \
         &pos->member != (head);                                \
         pos = list_entry(pos->member.next, typeof(*pos), member))

Using list_head in task_struct:

Intrusive Lists in Process Structures
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
/* Process structure with multiple list memberships */
struct task_struct {
    /* ... other fields ... */
    
    /* ALL-TASKS LIST (job queue) */
    struct list_head tasks;      /* Link in global task list */
    
    /* SIBLING LIST (children of same parent) */
    struct list_head sibling;    /* Link in parent's children list */
    struct list_head children;   /* Head of this process's children */
    
    /* RUN QUEUE (ready queue) - CFS uses run_node in sched_entity */
    struct sched_entity se;      /* Contains rb_node for CFS tree */
    
    /* ... more list_heads for other memberships ... */
};
 
/* A single process can be in multiple lists simultaneously */
void demonstrate_multi_membership(void) {
    struct task_struct *task;
    struct list_head all_tasks = LIST_HEAD_INIT(all_tasks);
    struct list_head ready_queue = LIST_HEAD_INIT(ready_queue);
    
    /* Add to job queue (global task list) */
    list_add(&task->tasks, &all_tasks);
    
    /* Add to parent's children list */
    list_add(&task->sibling, &current->children);
    
    /* Task is now in THREE lists using THREE different list_heads:
     * 1. all_tasks (via task->tasks)
     * 2. parent->children (via task->sibling)
     * 3. Its own children list head (task->children)
     */
}
 
/* Iterate all processes in the system */
void print_all_pids(void) {
    struct task_struct *task;
    
    /* for_each_process uses the 'tasks' list_head */
    for_each_process(task) {
        printk("PID: %d\n", task->pid);
    }
}

The container_of Magic

Array-Based Queue Implementations

For queues with bounded size or where cache efficiency is paramount, array-based implementations offer advantages over linked lists.

Array Advantages

•Cache Efficiency — Contiguous memory means prefetching works. Sequential access is ~10x faster than pointer chasing.
•Index-Based Access — O(1) random access by index, useful for priority levels.
•No Allocation Overhead — Pre-allocated arrays have zero per-operation allocation cost.
•Predictable Memory Layout — Beneficial for real-time systems requiring deterministic timing.

Circular Buffer (Ring Buffer):

For FIFO queues with bounded size, circular buffers are optimal:

Circular Buffer Implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
/* Fixed-size circular buffer for process pointers */
struct circular_queue {
    struct task_struct *buffer[MAX_QUEUE_SIZE];
    int head;           /* Index of next element to remove */
    int tail;           /* Index of next free slot */
    int count;          /* Number of elements */
    spinlock_t lock;
};
 
/* Initialize queue */
void circular_queue_init(struct circular_queue *q) {
    q->head = 0;
    q->tail = 0;
    q->count = 0;
    spin_lock_init(&q->lock);
}
 
/* Enqueue at tail - O(1) */
int circular_queue_enqueue(struct circular_queue *q, 
                           struct task_struct *task) {
    unsigned long flags;
    
    spin_lock_irqsave(&q->lock, flags);
    
    if (q->count >= MAX_QUEUE_SIZE) {
        spin_unlock_irqrestore(&q->lock, flags);
        return -ENOSPC;  /* Queue full */
    }
    
    q->buffer[q->tail] = task;
    q->tail = (q->tail + 1) % MAX_QUEUE_SIZE;  /* Wrap around */
    q->count++;
    
    spin_unlock_irqrestore(&q->lock, flags);
    return 0;
}
 
/* Dequeue from head - O(1) */
struct task_struct *circular_queue_dequeue(struct circular_queue *q) {
    struct task_struct *task;
    unsigned long flags;
    
    spin_lock_irqsave(&q->lock, flags);
    
    if (q->count == 0) {
        spin_unlock_irqrestore(&q->lock, flags);
        return NULL;  /* Queue empty */
    }
    
    task = q->buffer[q->head];
    q->buffer[q->head] = NULL;  /* Clear (optional, helps debugging) */
    q->head = (q->head + 1) % MAX_QUEUE_SIZE;
    q->count--;
    
    spin_unlock_irqrestore(&q->lock, flags);
    return task;
}
 
/* Lock-free variant using atomic operations */
struct lockfree_circular_queue {
    struct task_struct *buffer[MAX_QUEUE_SIZE];
    atomic_t head;
    atomic_t tail;
};
 
/* Single-producer, single-consumer lock-free enqueue */
int spsc_enqueue(struct lockfree_circular_queue *q, 
                 struct task_struct *task) {
    int tail = atomic_read(&q->tail);
    int next_tail = (tail + 1) % MAX_QUEUE_SIZE;
    
    /* Check if full (tail would catch head) */
    if (next_tail == atomic_read(&q->head))
        return -ENOSPC;
    
    q->buffer[tail] = task;
    smp_wmb();  /* Ensure the write is visible before updating tail */
    atomic_set(&q->tail, next_tail);
    
    return 0;
}

Array of Lists (Multi-Level Priority Queue):

The classic O(1) scheduler design combines arrays with linked lists:

Array of Lists (Priority Queue)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
/* Array-of-lists structure for O(1) priority scheduling */
#define NUM_PRIORITY_LEVELS 140
 
struct prio_array {
    /* Bitmap: bit n set if queue[n] is non-empty */
    unsigned long bitmap[BITS_TO_LONGS(NUM_PRIORITY_LEVELS)];
    
    /* Array of list heads, one per priority level */
    struct list_head queue[NUM_PRIORITY_LEVELS];
    
    /* Total number of tasks */
    int nr_active;
};
 
/* Initialize the priority array */
void prio_array_init(struct prio_array *pa) {
    int i;
    
    bitmap_zero(pa->bitmap, NUM_PRIORITY_LEVELS);
    pa->nr_active = 0;
    
    for (i = 0; i < NUM_PRIORITY_LEVELS; i++) {
        INIT_LIST_HEAD(&pa->queue[i]);
    }
}
 
/* Enqueue task at its priority level - O(1) */
void prio_array_enqueue(struct prio_array *pa, 
                        struct task_struct *task) {
    int prio = task->priority;  /* 0-139 */
    
    list_add_tail(&task->run_list, &pa->queue[prio]);
    __set_bit(prio, pa->bitmap);
    pa->nr_active++;
}
 
/* Find and remove highest priority task - O(1) */
struct task_struct *prio_array_dequeue(struct prio_array *pa) {
    int prio;
    struct task_struct *task;
    
    if (pa->nr_active == 0)
        return NULL;
    
    /* Find first (highest priority = lowest number) non-empty queue */
    prio = sched_find_first_bit(pa->bitmap);
    
    /* Remove first task from that queue */
    task = list_first_entry(&pa->queue[prio], 
                            struct task_struct, 
                            run_list);
    list_del_init(&task->run_list);
    
    /* Clear bitmap bit if queue is now empty */
    if (list_empty(&pa->queue[prio])) {
        __clear_bit(prio, pa->bitmap);
    }
    
    pa->nr_active--;
    return task;
}
 
/* sched_find_first_bit is highly optimized */
static inline int sched_find_first_bit(const unsigned long *bitmap) {
    /* On x86, this compiles to just a few instructions:
     * - Load bitmap word
     * - BSF (Bit Scan Forward) instruction
     * Total: ~3-5 cycles
     */
    return find_first_bit(bitmap, NUM_PRIORITY_LEVELS);
}

Bitmap Acceleration

Tree-Based Priority Structures

When priorities are continuous (like virtual runtime in CFS) rather than discrete levels, tree structures provide efficient priority ordering.

Red-Black Trees in CFS:

Linux CFS uses red-black trees because they provide:

O(log n) insertion, deletion, and lookup
O(1) access to minimum element (with caching)
Self-balancing guarantees (no degenerate worst cases)
In-order traversal for debugging/display

Linux Red-Black Tree Implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
/* Red-black tree node - embedded in managed structures */
struct rb_node {
    unsigned long __rb_parent_color;  /* Parent pointer + color bit packed */
    struct rb_node *rb_right;
    struct rb_node *rb_left;
};
 
/* Tree root with cached leftmost node */
struct rb_root_cached {
    struct rb_root rb_root;
    struct rb_node *rb_leftmost;  /* Cache of min element */
};
 
/* Colors are encoded in the low bit of parent pointer */
#define RB_RED   0
#define RB_BLACK 1
 
#define rb_parent(r)   ((struct rb_node *)((r)->__rb_parent_color & ~3))
#define rb_color(r)    ((r)->__rb_parent_color & 1)
#define rb_is_red(r)   (!rb_color(r))
#define rb_is_black(r) rb_color(r)
 
/* Get first (leftmost/minimum) element - O(1) with cache */
static inline struct rb_node *rb_first_cached(
    const struct rb_root_cached *root)
{
    return root->rb_leftmost;
}
 
/* Insert node and rebalance */
void rb_insert_color_cached(struct rb_node *node,
                            struct rb_root_cached *root,
                            bool leftmost) {
    /* Update leftmost cache if this is the new minimum */
    if (leftmost)
        root->rb_leftmost = node;
    
    /* Standard RB-tree rebalancing */
    rb_insert_color(node, &root->rb_root);
}
 
/* CFS scheduling entity with rb_node embedded */
struct sched_entity {
    /* ... load weight, vruntime, etc. ... */
    
    struct rb_node run_node;  /* Embedded RB-tree node */
    u64 vruntime;             /* Key for tree ordering */
};
 
/* Insert sched_entity into CFS tree */
static void __enqueue_entity(struct cfs_rq *cfs_rq, 
                             struct sched_entity *se) {
    struct rb_node **link = &cfs_rq->tasks_timeline.rb_root.rb_node;
    struct rb_node *parent = NULL;
    struct sched_entity *entry;
    bool leftmost = true;
    
    /* Binary search for insertion point */
    while (*link) {
        parent = *link;
        entry = rb_entry(parent, struct sched_entity, run_node);
        
        if (entity_before(se, entry)) {  /* Compare vruntimes */
            link = &parent->rb_left;
        } else {
            link = &parent->rb_right;
            leftmost = false;  /* Not the new minimum */
        }
    }
    
    /* Link node into tree */
    rb_link_node(&se->run_node, parent, link);
    
    /* Color the node and rebalance; update leftmost cache */
    rb_insert_color_cached(&se->run_node, 
                           &cfs_rq->tasks_timeline, 
                           leftmost);
}
 
/* Remove sched_entity from CFS tree */
static void __dequeue_entity(struct cfs_rq *cfs_rq, 
                             struct sched_entity *se) {
    rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
}

Binary Heaps for Priority Queues:

Heaps are simpler than RB-trees and can be more cache-friendly with array storage:

Binary Heap Implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
/* Array-based binary min-heap for deadline scheduling */
struct heap_queue {
    struct task_struct *tasks[HEAP_SIZE];
    int size;
    spinlock_t lock;
};
 
/* Heap helper functions */
#define PARENT(i)    (((i) - 1) / 2)
#define LEFT(i)      ((2 * (i)) + 1)
#define RIGHT(i)     ((2 * (i)) + 2)
 
/* Compare by deadline (lower deadline = higher priority) */
static inline bool higher_priority(struct task_struct *a,
                                   struct task_struct *b) {
    return a->deadline < b->deadline;
}
 
/* Swap two heap entries */
static inline void heap_swap(struct heap_queue *heap, int i, int j) {
    struct task_struct *tmp = heap->tasks[i];
    heap->tasks[i] = heap->tasks[j];
    heap->tasks[j] = tmp;
    
    /* Update heap index stored in task (for O(1) removal) */
    heap->tasks[i]->heap_idx = i;
    heap->tasks[j]->heap_idx = j;
}
 
/* Bubble up after insertion */
static void heap_bubble_up(struct heap_queue *heap, int idx) {
    while (idx > 0) {
        int parent = PARENT(idx);
        
        if (higher_priority(heap->tasks[idx], heap->tasks[parent])) {
            heap_swap(heap, idx, parent);
            idx = parent;
        } else {
            break;  /* Heap property satisfied */
        }
    }
}
 
/* Bubble down after extraction or priority change */
static void heap_bubble_down(struct heap_queue *heap, int idx) {
    while (LEFT(idx) < heap->size) {
        int smallest = idx;
        int left = LEFT(idx);
        int right = RIGHT(idx);
        
        if (higher_priority(heap->tasks[left], heap->tasks[smallest]))
            smallest = left;
        
        if (right < heap->size && 
            higher_priority(heap->tasks[right], heap->tasks[smallest]))
            smallest = right;
        
        if (smallest != idx) {
            heap_swap(heap, idx, smallest);
            idx = smallest;
        } else {
            break;
        }
    }
}
 
/* Insert task - O(log n) */
int heap_insert(struct heap_queue *heap, struct task_struct *task) {
    if (heap->size >= HEAP_SIZE)
        return -ENOMEM;
    
    int idx = heap->size++;
    heap->tasks[idx] = task;
    task->heap_idx = idx;
    
    heap_bubble_up(heap, idx);
    return 0;
}
 
/* Extract minimum - O(log n) */
struct task_struct *heap_extract_min(struct heap_queue *heap) {
    if (heap->size == 0)
        return NULL;
    
    struct task_struct *min = heap->tasks[0];
    
    /* Move last element to root */
    heap->tasks[0] = heap->tasks[--heap->size];
    heap->tasks[0]->heap_idx = 0;
    
    /* Restore heap property */
    heap_bubble_down(heap, 0);
    
    min->heap_idx = -1;  /* No longer in heap */
    return min;
}
 
/* Remove arbitrary element using stored index - O(log n) */
void heap_remove(struct heap_queue *heap, struct task_struct *task) {
    int idx = task->heap_idx;
    if (idx < 0 || idx >= heap->size)
        return;
    
    /* Move last element to the removed position */
    heap->tasks[idx] = heap->tasks[--heap->size];
    heap->tasks[idx]->heap_idx = idx;
    
    /* Fix heap property (may need to go up or down) */
    if (idx > 0 && 
        higher_priority(heap->tasks[idx], heap->tasks[PARENT(idx)])) {
        heap_bubble_up(heap, idx);
    } else {
        heap_bubble_down(heap, idx);
    }
    
    task->heap_idx = -1;
}

Heap vs RB-Tree Trade-offs

Hash Tables for Fast Process Lookup

PID Hash Table Implementation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
/* Linux uses IDR (ID Radix tree) for PID lookup, but let's show
 * the conceptual hash table approach for clarity */
 
#define PID_HASH_BITS 12  /* 4096 buckets */
#define PID_HASH_SIZE (1 << PID_HASH_BITS)
 
struct pid_hash_table {
    struct hlist_head buckets[PID_HASH_SIZE];
    spinlock_t lock;
};
 
/* Hash bucket list (singly-linked with back-pointer for O(1) removal) */
struct hlist_head {
    struct hlist_node *first;
};
 
struct hlist_node {
    struct hlist_node *next;
    struct hlist_node **pprev;  /* Pointer to previous node's next pointer */
};
 
/* PID structure with hash linkage */
struct pid {
    pid_t nr;                   /* The PID number */
    struct hlist_node hash_link; /* Link in hash table */
    struct task_struct *task;   /* Associated task */
    atomic_t count;             /* Reference count */
};
 
/* Hash function for PIDs */
static inline unsigned int pid_hash(pid_t pid) {
    /* Golden ratio hash */
    return (unsigned int)pid * 0x9e370001UL >> (32 - PID_HASH_BITS);
}
 
/* Insert PID into hash table - O(1) average */
void pid_hash_insert(struct pid_hash_table *table, struct pid *pid) {
    unsigned int hash = pid_hash(pid->nr);
    
    spin_lock(&table->lock);
    hlist_add_head(&pid->hash_link, &table->buckets[hash]);
    spin_unlock(&table->lock);
}
 
/* Lookup task by PID - O(1) average */
struct task_struct *find_task_by_pid(struct pid_hash_table *table,
                                      pid_t nr) {
    unsigned int hash = pid_hash(nr);
    struct pid *pid;
    struct task_struct *task = NULL;
    
    rcu_read_lock();  /* Lock-free read via RCU */
    
    hlist_for_each_entry_rcu(pid, &table->buckets[hash], hash_link) {
        if (pid->nr == nr) {
            task = pid->task;
            break;
        }
    }
    
    rcu_read_unlock();
    return task;
}
 
/* Remove PID from hash table - O(1) */
void pid_hash_remove(struct pid *pid) {
    /* hlist allows O(1) removal using the pprev pointer */
    hlist_del_rcu(&pid->hash_link);
}
 
/* Modern Linux uses IDR (radix tree) instead:
 * - Better worst-case behavior
 * - Efficient ID allocation
 * - Supports hierarchical namespaces
 */
struct idr {
    struct radix_tree_root idr_rt;
    unsigned int idr_base;
    unsigned int idr_next;
};
 
/* IDR provides both allocation and lookup */
int idr_alloc(struct idr *idr, void *ptr, int start, int end, gfp_t gfp);
void *idr_find(const struct idr *idr, unsigned long id);
void *idr_remove(struct idr *idr, unsigned long id);

Why Multiple Data Structures?

The job queue uses BOTH a linked list (for iteration) AND a hash table (for lookup). This dual-indexing pattern is common in kernel data structures:

Converting Mermaid diagram...

RCU for Lock-Free Reading

Memory Layout and Cache Considerations

Queue implementation must consider memory layout for cache efficiency. Modern CPUs have multiple cache levels, and cache misses can cost hundreds of cycles.

Cache Effects on Queue Operations
Access Pattern	Cache Behavior	Performance Impact
Sequential array iteration	Excellent - prefetching works	~1 cycle per element after warmup
Linked list with locality	Good - elements near each other	~3-10 cycles per element
Linked list with fragmentation	Poor - random memory access	~50-200 cycles per element
Hash table lookup (hit)	Moderate - 2-3 pointer dereferences	~10-50 cycles
Tree traversal (balanced)	Poor - random access down tree	~50-100 cycles per level

Cache-Friendly Techniques
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
/* Technique 1: Slab Allocation
 * Allocate task_structs from a dedicated cache, keeping them
 * in contiguous memory and improving cache behavior.
 */
struct kmem_cache *task_struct_cachep;
 
void __init task_cache_init(void) {
    task_struct_cachep = kmem_cache_create(
        "task_struct",
        sizeof(struct task_struct),
        __alignof__(struct task_struct),
        SLAB_PANIC | SLAB_ACCOUNT,
        NULL);
}
 
struct task_struct *alloc_task_struct_node(int node) {
    return kmem_cache_alloc_node(task_struct_cachep, GFP_KERNEL, node);
}
 
/* Technique 2: Hot/Cold Field Separation
 * Group frequently accessed fields together at the start
 * of the structure to fit in a single cache line.
 */
struct task_struct {
    /* HOT DATA - accessed during scheduling */
    volatile long state;        /* Current state */
    void *stack;                /* Kernel stack */
    unsigned int flags;         /* Process flags */
    int on_cpu;                 /* Currently running on which CPU */
    int prio;                   /* Dynamic priority */
    /* ... */
    
    /* COLD DATA - accessed less frequently */
    struct mm_struct *mm;       /* Address space */
    struct files_struct *files; /* Open files */
    /* ... */
    
    /* VERY COLD DATA - rarely accessed */
    char comm[TASK_COMM_LEN];   /* Process name */
    struct audit_context *audit_context;
    /* ... */
};
 
/* Technique 3: Percpu Caching
 * For per-CPU run queues, ensure each CPU's queue
 * is on a different cache line to avoid false sharing.
 */
DEFINE_PER_CPU_ALIGNED(struct rq, runqueues);
 
/* The _ALIGNED suffix ensures each CPU's rq starts on 
 * a cache line boundary (typically 64 bytes).
 * This prevents CPU 0's writes from invalidating CPU 1's cache.
 */
 
/* Technique 4: Prefetching
 * Hint to the CPU that certain data will be needed soon.
 */
void traverse_ready_queue(struct list_head *head) {
    struct task_struct *task, *next;
    
    list_for_each_entry_safe(task, next, head, run_list) {
        /* Prefetch next element while processing current */
        prefetch(&next->state);
        prefetch(&next->prio);
        
        /* Process current task */
        update_task_accounting(task);
    }
}

NUMA Awareness

Synchronization Strategies for Queue Access

Process queues are accessed from multiple CPUs and contexts (process context, interrupt context, softirq context). Proper synchronization is essential for correctness and performance.

Synchronization Mechanisms in Queue Implementation
Mechanism	Use Case	Overhead	Scalability
Spinlock	Short critical sections, interrupt-safe	Low (if uncontended)	Poor (one CPU at a time)
Mutex	Longer critical sections, can sleep	Medium	Poor (one CPU at a time)
RW Lock	Read-heavy workloads	Medium	Good for reads
RCU	Read-mostly, rare writes	Very low for reads	Excellent for reads
Per-CPU Data	Avoid sharing entirely	Very low	Excellent
Lock-free	Specific patterns (SPSC)	Low	Excellent

Synchronization in Practice
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
/* Ready queue (per-CPU) uses spinlock with interrupt disable */
void enqueue_task(struct rq *rq, struct task_struct *p) {
    raw_spin_lock(&rq->lock);  /* Spinlock, not sleeping lock */
    
    /* ... enqueue operation ... */
    list_add_tail(&p->run_list, &rq->cfs.tasks);
    
    raw_spin_unlock(&rq->lock);
}
 
/* When called from interrupt context, must save/restore flags */
void wake_up_task(struct task_struct *p) {
    unsigned long flags;
    struct rq *rq;
    
    raw_spin_lock_irqsave(&p->pi_lock, flags);  /* Disable interrupts */
    rq = __task_rq_lock(p, &flags);
    
    /* ... wakeup operation ... */
    
    __task_rq_unlock(rq, &flags);
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);  /* Restore interrupts */
}
 
/* PID hash table uses RCU for lock-free reading */
struct task_struct *find_task_by_vpid(pid_t vnr) {
    struct task_struct *task;
    
    rcu_read_lock();  /* Marks read-side critical section */
    /* No actual lock acquired - just increments preempt count */
    
    task = find_task_by_pid_ns(vnr, task_active_pid_ns(current));
    
    rcu_read_unlock();  /* End of critical section */
    
    return task;
}
 
/* Writers to PID hash must synchronize with RCU */
void detach_pid(struct task_struct *task, enum pid_type type) {
    struct pid *pid = task->pids[type].pid;
    struct hlist_head *head = &pid_hash[pid_hashfn(pid->nr)];
    
    spin_lock(&pidmap_lock);  /* Writers still need a lock */
    
    hlist_del_rcu(&pid->pid_chain);  /* RCU-safe deletion */
    
    spin_unlock(&pidmap_lock);
    
    /* Actual memory free is deferred until all RCU readers complete */
    call_rcu(&pid->rcu, delayed_put_pid);
}
 
/* Lock ordering is critical to prevent deadlock */
/*
 * Lock hierarchy (must acquire in this order):
 * 1. task->pi_lock (protects priority inheritance)
 * 2. rq->lock (protects run queue)
 * 3. p->lock (protects individual task)
 *
 * Never hold a higher-numbered lock while trying to acquire
 * a lower-numbered one!
 */

The Per-CPU Solution

Summary: Engineering Process Queues

Queue implementation is a masterclass in practical data structure design—balancing theoretical complexity with real-world concerns like cache behavior, memory layout, and synchronization.

Key Takeaways

•Intrusive Linked Lists — Kernel queues embed linkage in data structures, avoiding separate allocation, enabling multi-list membership, and providing O(1) removal.
•Array-Based Queues — Circular buffers offer cache efficiency and bounded memory. Combined with bitmaps, they enable O(1) priority queue operations.
•Tree Structures — Red-black trees (CFS) and heaps provide efficient priority ordering for continuous-value priorities like virtual runtime.
•Hash Tables for Lookup — PID lookup uses hash tables (or radix trees) for O(1) access, complementing iteration-focused linked lists.
•Cache Consciousness — Slab allocation, field ordering, per-CPU alignment, and prefetching all contribute to cache-efficient queue operations.
•Synchronization Trade-offs — Per-CPU queues eliminate contention. RCU provides lock-free reading. Spinlocks protect short critical sections.
•Dual Indexing — Processes are often in multiple data structures simultaneously (linked list for iteration, hash table for lookup, tree for priority).

What's Next:

Page Complete

4 / 5