Loading learning content...
CFS's elegant vruntime-based scheduling requires an equally elegant data structure. The scheduler needs to efficiently:
The red-black tree is CFS's answer. This self-balancing binary search tree maintains O(log n) height, ensuring all operations complete quickly even with thousands of runnable tasks.
Red-black trees might seem like an academic data structure relegated to textbooks, but they're the backbone of one of the world's most important schedulers. Every time you run a process on Linux—whether on your laptop, a cloud server, or an Android phone—a red-black tree is making the scheduling decision.
By the end of this page, you will understand: (1) Red-black tree properties and invariants, (2) Why CFS chose RB-trees over other data structures, (3) How the Linux kernel implements and optimizes RB-trees, (4) The specific CFS integration including cached leftmost access, and (5) Performance characteristics and trade-offs.
A red-black tree is a binary search tree with additional properties that ensure the tree remains approximately balanced. Each node has a color (red or black), and the tree maintains specific invariants that prevent pathological imbalance.
Red-Black Tree Properties
Every valid red-black tree satisfies these five properties:
These properties guarantee that the longest path is at most twice the shortest path, ensuring O(log n) height.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
/* * Linux kernel red-black tree node structure * From: include/linux/rbtree.h * * The kernel uses a compact representation that embeds * the color in the low bit of the parent pointer. */struct rb_node { unsigned long __rb_parent_color; /* Parent pointer + color bit */ struct rb_node *rb_right; struct rb_node *rb_left;} __attribute__((aligned(sizeof(long)))); /* * The alignment ensures the low bits of __rb_parent_color * are available for flags (the color). * * Bit 0: Color (0 = red, 1 = black) * Bits 1-63: Parent pointer (with bit 0 masked) */ /* Access macros */#define rb_parent(r) ((struct rb_node *)((r)->__rb_parent_color & ~3))#define rb_color(r) ((r)->__rb_parent_color & 1)#define rb_is_red(r) (!rb_color(r))#define rb_is_black(r) rb_color(r) /* * Root structure with optional leftmost caching * (critical for CFS performance) */struct rb_root { struct rb_node *rb_node; /* Root of tree */}; struct rb_root_cached { struct rb_root rb_root; struct rb_node *rb_leftmost; /* Cached minimum */}; /* * The rb_root_cached structure is essential for CFS: * - rb_leftmost points to the task with minimum vruntime * - Selecting the next task is O(1) instead of O(log n) * - The cache is updated automatically during insertions */Why These Properties Work
The 'no two adjacent reds' rule (Property 4) limits how 'stretched' a path can become with red nodes. The 'black-height consistency' rule (Property 5) ensures all paths have the same 'core structure.'
Together, they guarantee:
bh)2×bh)2 × log₂(n+1)This bounds the tree height logarithmically, making all operations O(log n) in the worst case—a crucial guarantee for a scheduler that runs millions of times per second.
Think of colors as 'slack' in the tree structure. Red nodes are 'bonus' nodes that don't count toward black-height. When the tree becomes too deep after an insertion, recoloring and rotations restore balance by either converting red nodes to black (gaining height capacity) or rotating to redistribute nodes.
CFS could have used many data structures for its vruntime-ordered task queue. Understanding why red-black trees were chosen illuminates both CFS's requirements and RB-tree strengths.
CFS's Access Pattern
CFS performs three fundamental operations:
The key observation: find minimum must be extremely fast, but insert/delete can tolerate O(log n).
| Data Structure | Find Min | Insert | Delete | CFS Suitability |
|---|---|---|---|---|
| Unsorted Array | O(n) | O(1) | O(n) | ❌ Min search too slow |
| Sorted Array | O(1) | O(n) | O(n) | ❌ Insert/delete too slow |
| Binary Heap | O(1) | O(log n) | O(log n)* | ⚠️ Delete requires search |
| Skip List | O(log n) | O(log n) | O(log n) | ⚠️ Probabilistic guarantees |
| AVL Tree | O(log n)** | O(log n) | O(log n) | ⚠️ More rotations than RB |
| Red-Black Tree | O(1)* | O(log n) | O(log n) | ✅ Perfect fit |
Notes on the comparison:
Why Not a Heap?
Binary heaps seem ideal for minimum extraction, but CFS has a critical requirement: random access deletion. When a running task blocks, it must be removed from the runqueue. In a heap, this requires finding the element first (O(n) without external tracking), then performing the O(log n) delete.
Red-black trees provide direct pointer-based deletion when you already have a reference to the node—which CFS does, via the sched_entity structure.
Why Not AVL Trees?
AVL trees maintain stricter balance (height difference ≤ 1 between subtrees), providing faster lookups. However, this strictness requires more rotations during insertion and deletion. For CFS, where insertions/deletions are frequent, RB-trees' looser balance (up to 2:1 path ratio) reduces constant-factor overhead.
The Leftmost Cache Insight
The clincher for CFS is rb_root_cached. By maintaining a pointer to the leftmost (minimum vruntime) node, CFS achieves O(1) task selection. This cache is updated during insertions—when a new node becomes the leftmost, the pointer is updated. This combines the best of heaps (O(1) min) with BST flexibility (O(log n) delete).
The Linux kernel emphasizes practical performance over theoretical optimality. RB-trees aren't the theoretically fastest for any single operation, but they provide excellent balanced performance across all operations CFS needs, with well-understood worst-case bounds and battle-tested implementation.
RB-tree operations maintain the five invariants through a combination of standard BST operations and repair procedures. Let's examine each operation in detail.
Insertion
Insertion follows the BST rule: traverse left if new key < current key, right otherwise. The new node is always inserted as a red leaf. This preserves Property 5 (black-height) but may violate Property 4 (no adjacent reds) if the parent is red.
After insertion, the tree may need rebalancing. This involves recoloring and rotations that propagate up the tree until all invariants are satisfied.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
/* * CFS Red-Black Tree Insertion (Simplified) * * The actual kernel code handles more edge cases and uses * optimized macros, but this captures the essence. */ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se){ struct rb_node **link = &cfs_rq->tasks_timeline.rb_root.rb_node; struct rb_node *parent = NULL; struct sched_entity *entry; bool leftmost = true; /* * Phase 1: Find insertion point (standard BST traversal) */ while (*link) { parent = *link; entry = rb_entry(parent, struct sched_entity, run_node); /* * Key comparison: vruntime * entity_before() handles wraparound correctly */ if (entity_before(se, entry)) { link = &parent->rb_left; } else { link = &parent->rb_right; leftmost = false; /* We went right, not leftmost */ } } /* * Phase 2: Link and color the new node * rb_link_node() sets parent, colors node RED */ rb_link_node(&se->run_node, parent, link); /* * Phase 3: Rebalance and update leftmost cache * rb_insert_color_cached() performs rotations as needed * and updates rb_leftmost if leftmost == true */ rb_insert_color_cached(&se->run_node, &cfs_rq->tasks_timeline, leftmost); cfs_rq->nr_running++;} /* * Key observation: We track 'leftmost' during traversal. * If we never go right (link = &parent->rb_right), this * insertion creates the new minimum and updates the cache. * This makes pick_next_task_fair() O(1). */Insertion Repair Cases
After inserting a red node, if its parent is also red, we have a violation. The repair procedure considers the 'uncle' (parent's sibling):
Case 1: Uncle is red
Case 2: Uncle is black, node is inner child (zigzag)
Case 3: Uncle is black, node is outer child (straight)
These three cases cover all possible violations. Each requires at most 2 rotations in total, and the tree height increases only when Case 1 reaches the root.
123456789101112131415161718192021222324252627282930313233343536373839404142434445
/* * CFS Red-Black Tree Deletion (Simplified) * * Deletion is more complex than insertion because * removing a node may violate black-height invariant. */ static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se){ /* * rb_erase_cached handles: * 1. Standard BST deletion (find replacement if needed) * 2. Rebalancing if we removed a black node * 3. Updating leftmost cache if we removed the minimum */ rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline); cfs_rq->nr_running--;} /* * Deletion repair is needed when we remove a BLACK node. * (Removing a red node doesn't affect black-height.) * * The repair considers the sibling of the node that * "inherits" the black deficit. Multiple cases exist, * but the key insight: * * - We need to "push" a black node down from the sibling's * subtree to restore balance * - This may require rotations to position nodes correctly * - At most 3 rotations are needed */ /* * CFS-specific optimization: * * When a task blocks, we already have a pointer to its * sched_entity. The rb_erase_cached() takes the rb_node * directly—no O(n) search required unlike in a heap. * * This is why RB-trees excel for CFS: random access * deletion is O(log n), not O(n). */Deletion repair has more cases than insertion because black-height violations are harder to fix locally. The kernel implementation uses an iterative approach with careful case analysis. While complex, it guarantees O(log n) worst-case—essential for scheduler predictability.
Rotations are the fundamental restructuring operation in RB-trees. They change the tree shape while preserving the BST ordering property. Understanding rotations is key to understanding how balance is maintained.
Left Rotation
A left rotation at node X promotes X's right child Y to take X's position, with X becoming Y's left child:
X Y
/ \ / \
a Y --> X c
/ \ / \
b c a b
Right Rotation
A right rotation is the mirror: Y's left child X is promoted, with Y becoming X's right child:
Y X
/ \ / \
X c --> a Y
/ \ / \
a b b c
Key Properties Preserved:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
/* * Linux kernel RB-tree rotation implementation * From: lib/rbtree.c * * These are the building blocks for rebalancing. */ static inline void __rb_rotate_set_parents(struct rb_node *old, struct rb_node *new, struct rb_root *root, int color){ struct rb_node *parent = rb_parent(old); /* Update new node's parent/color */ new->__rb_parent_color = old->__rb_parent_color; /* Set old node's parent to new, with specified color */ rb_set_parent_color(old, new, color); /* Update parent's child pointer */ __rb_change_child(old, new, parent, root);} /* * Left rotation at 'node' * * Before: After: * P P * | | * node right * / \ / \ * a right --> node c * / \ / \ * b c a b */static __always_inline void__rb_rotate_left(struct rb_node *node, struct rb_root *root){ struct rb_node *right = node->rb_right; struct rb_node *parent = rb_parent(node); /* node's right becomes right's left subtree */ node->rb_right = right->rb_left; if (right->rb_left) rb_set_parent(right->rb_left, node); /* right takes node's position */ right->rb_left = node; rb_set_parent(node, right); /* Update parent's pointer */ if (parent) { if (node == parent->rb_left) parent->rb_left = right; else parent->rb_right = right; } else { root->rb_node = right; } rb_set_parent(right, parent);} /* * Performance note: * * Rotations are O(1) operations—just pointer updates. * The logarithmic bound comes from tree height limiting * how many rotations might be needed in the worst case. * * In practice: * - Insertions require at most 2 rotations * - Deletions require at most 3 rotations */Visualizing Rotation Purpose
Think of rotations as 'reshuffling' the tree to move nodes between subtrees without breaking ordering:
Rotations never happen in isolation in RB-trees. They're always paired with recoloring to fix specific invariant violations. The art is knowing which rotation+recolor combination fixes each violation pattern.
RB-tree repair is designed so that once a rotation is performed (Cases 2/3 of insertion, final cases of deletion), the tree is completely fixed. Recoloring can propagate to the root, but it's O(1) per level and only increases the tree's ability to absorb future changes. This bounds rotation count and provides excellent amortized performance.
The Linux kernel's RB-tree implementation includes several optimizations specifically for CFS's usage pattern.
The Leftmost Cache
CFS uses rb_root_cached to maintain a pointer to the leftmost node. This is updated during insertion:
leftmost flag tracked during traversal enables O(1) determinationThis transforms pick_next_task_fair() from O(log n) (traverse to find minimum) to O(1) (dereference a pointer).
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
/* * CFS Red-Black Tree Integration * * The key structures that tie RB-trees to CFS scheduling */ struct sched_entity { /* Link to the RB-tree node */ struct rb_node run_node; /* Virtual runtime - the tree's sort key */ u64 vruntime; /* Other fields... */ u64 sum_exec_runtime; u64 prev_sum_exec_runtime; struct load_weight load; /* ... */}; struct cfs_rq { /* * The RB-tree with leftmost cache * Key = vruntime, stored in sched_entity */ struct rb_root_cached tasks_timeline; /* * Number of runnable entities */ unsigned int nr_running; /* * Minimum vruntime across all entities * (not the same as leftmost->vruntime during updates) */ u64 min_vruntime; /* * Currently running entity (might not be in tree) */ struct sched_entity *curr; /* ... more fields for load tracking, bandwidth, etc. */}; /* * O(1) minimum extraction - the heart of CFS task selection */static struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq){ struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline); if (!left) return NULL; return rb_entry(left, struct sched_entity, run_node);} /* * rb_first_cached is literally just: * * static inline struct rb_node * * rb_first_cached(const struct rb_root_cached *tree) * { * return tree->rb_leftmost; * } * * One pointer dereference. That's O(1) task selection. */rb_root_cached maintains pointer to minimum, enabling O(1) next-task selection instead of O(log n) leftmost traversal.run_node is embedded in sched_entity, avoiding separate allocation and improving cache locality.Memory Layout Considerations
The kernel places sched_entity fields strategically:
struct sched_entity {
struct rb_node run_node; /* Offset 0: hot, accessed during insert/delete */
u64 vruntime; /* Offset 24: hot, compared during sort */
u64 sum_exec_runtime; /* Offset 32: warm, updated on tick */
/* ... less frequently accessed fields at higher offsets */
};
By placing RB-tree node and vruntime at the start, the most frequently accessed fields are likely in the same cache line, reducing memory latency impact.
CFS's choice of RB-trees exemplifies how understanding your access pattern leads to optimal data structure selection. CFS needs: (1) ultra-fast minimum finding → leftmost cache, (2) efficient insert → O(log n) RB insert, (3) direct delete via pointer → O(log n) RB erase. RB-trees provide all three with proven worst-case bounds.
Understanding RB-tree performance helps predict scheduler behavior under various loads.
Theoretical Bounds
| Operation | Time | Rotations | Notes |
|---|---|---|---|
| Find minimum | O(1) | 0 | Cached leftmost pointer |
| Insert | O(log n) | ≤2 | Traverse + rebalance |
| Delete | O(log n) | ≤3 | Find replacement + rebalance |
| Find by key | O(log n) | 0 | Standard BST search |
| Iterate all | O(n) | 0 | Inorder traversal |
Practical Performance
In real-world scheduler operation:
Task Selection (most frequent): O(1) leftmost access dominates. Even on heavily loaded servers with thousands of runnable tasks, selection is a single pointer dereference.
Enqueue/Dequeue (moderately frequent): O(log n) with small constants. For n=1000 tasks, this is about 10 comparisons plus 0-3 rotations. At CPU speeds of billions of operations per second, this completes in nanoseconds.
Tree Height: With n tasks, height ≤ 2⌈log₂(n+1)⌉. For 1000 tasks: height ≤ 20. For 1 million tasks: height ≤ 40.
Cache Considerations
Modern CPU performance is often dominated by memory access patterns rather than raw operations. RB-trees benefit from:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
/* * Stylized RB-tree performance characteristics * Based on kernel profiling data */ /* * Typical operation timings on modern x86-64: * * rb_first_cached() (O(1) min): ~3-5 cycles * - Single pointer dereference * - Almost always L1 cache hit (hot path) * * rb_insert_color() (O(log n) insert): ~200-500 cycles * - Traversal: ~15 cycles per level * - Rotations: ~20-30 cycles each * - Dominated by pointer chasing, cache misses * * rb_erase() (O(log n) delete): ~250-600 cycles * - Finding replacement: ~15 cycles per level * - Rebalancing: similar to insert * * context_switch() (for comparison): ~3000-5000 cycles * - Vastly more expensive than any RB operation * - Makes scheduler overhead negligible */ /* * Scaling behavior (cycles, approximate): * * n tasks | insert | delete | pick_first * ---------- | ------ | ------ | ---------- * 10 | 100 | 120 | 5 * 100 | 200 | 250 | 5 * 1,000 | 350 | 400 | 5 * 10,000 | 500 | 550 | 5 * 100,000 | 650 | 700 | 5 * * Note: pick_first is constant regardless of n! * This is why CFS scales to massive workloads. */ /* * Comparison with O(1) scheduler overhead: * - O(1) scheduler: ~50 cycles for all operations * - CFS: ~5 cycles for selection, ~400 for modify * * Since selection happens far more often than modification, * and context switches dwarf all scheduler overhead, * the O(log n) trade-off is negligible in practice. */Scheduler overhead must be evaluated against the actual work being scheduled. A context switch costs thousands of cycles; the scheduler overhead is single-digit percentage of this. The O(log n) vs O(1) difference becomes meaningful only at extreme scale—and even then, the fairness benefits of CFS far outweigh the microseconds of overhead.
Red-black trees are the enabling data structure for CFS's vruntime-based scheduling. Their balanced structure, efficient operations, and kernel-optimized implementation make them ideal for the scheduler's demands.
What's Next
The next page explores real-time policies in Linux—SCHED_FIFO and SCHED_RR scheduling classes that take priority over CFS. We'll see how Linux provides hard priority guarantees when needed while maintaining CFS's fairness for normal workloads.
You now understand how red-black trees enable CFS's efficient vruntime-based scheduling. This knowledge connects algorithm theory to systems practice—showing how the right data structure choice can make a complex scheduler both fast and fair.