Loading learning content...
Linux supports a remarkable diversity of workloads: interactive desktops requiring millisecond responsiveness, batch jobs running for hours, real-time audio processing with microsecond precision, and deadline-sensitive industrial control systems. How can a single scheduler satisfy such varied requirements?
The answer is Linux's scheduling class architecture—an elegant object-oriented design that separates scheduling mechanism from policy. The core scheduler provides infrastructure (runqueues, context switching, load balancing), while scheduling classes implement specific algorithms. This modular design allows real-time, fair-share, and deadline scheduling to coexist harmoniously.
By the end of this page, you will understand the scheduling class hierarchy, how the scheduler selects which class to consult, and the characteristics of each major scheduling class: stop, deadline, real-time, fair (CFS), and idle.
Every scheduling policy implements the sched_class interface—a structure of function pointers that the core scheduler calls at key decision points. This is classic object-oriented design in C:
123456789101112131415161718192021222324252627282930313233343536373839
struct sched_class { /* Enqueue task onto runqueue */ void (*enqueue_task)(struct rq *rq, struct task_struct *p, int flags); /* Dequeue task from runqueue */ void (*dequeue_task)(struct rq *rq, struct task_struct *p, int flags); /* Handle explicit yield (sched_yield syscall) */ void (*yield_task)(struct rq *rq); /* Check if incoming task should preempt current */ void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags); /* THE critical method: select next task to run */ struct task_struct *(*pick_next_task)(struct rq *rq); /* Called when switching away from a task */ void (*put_prev_task)(struct rq *rq, struct task_struct *p); /* Set up a task as the next to run */ void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first); /* Per-tick processing (timeslice accounting, etc.) */ void (*task_tick)(struct rq *rq, struct task_struct *p, int queued); /* Fork notification */ void (*task_fork)(struct task_struct *p); /* Priority changed */ void (*prio_changed)(struct rq *rq, struct task_struct *p, int oldprio); /* For SMP: migration and load balancing */ int (*balance)(struct rq *rq, struct task_struct *prev, ...); int (*select_task_rq)(struct task_struct *p, int cpu, int flags); void (*migrate_task_rq)(struct task_struct *p, int new_cpu); /* Update current task's runtime statistics */ void (*update_curr)(struct rq *rq);};The sched_class pattern is how Linux achieves polymorphism without C++. Each scheduling class defines its own implementations of these functions. When the scheduler needs to, say, pick the next task, it calls rq->curr->sched_class->pick_next_task(rq). Different task types get different behavior through the same interface.
Linux scheduling classes form a strict priority hierarchy. When selecting the next task to run, the scheduler consults classes in priority order until one returns a runnable task:
| Priority | Class | Policy Constants | Use Case |
|---|---|---|---|
| 1 (Highest) | stop_sched_class | Internal only | Migration, hot-unplug, stopper threads |
| 2 | dl_sched_class | SCHED_DEADLINE | Hard deadline tasks (industrial, automotive) |
| 3 | rt_sched_class | SCHED_FIFO, SCHED_RR | Real-time audio, robotics, trading |
| 4 | fair_sched_class | SCHED_OTHER, SCHED_BATCH | Normal processes (vast majority) |
| 5 (Lowest) | idle_sched_class | SCHED_IDLE | Only runs when nothing else can |
123456789101112131415161718192021222324252627282930
/* Classes are linked in priority order *//* In kernel/sched/sched.h */ #define for_each_class(class) \ for (class = &stop_sched_class; class; class = class->next) /* The chain: stop -> dl -> rt -> fair -> idle -> NULL */ /* pick_next_task() walks this chain */static struct task_struct *__pick_next_task(struct rq *rq, ...){ const struct sched_class *class; struct task_struct *p; /* Try each class in priority order */ for_each_class(class) { p = class->pick_next_task(rq); if (p) return p; /* Found a runnable task in this class */ } /* Should never reach here - idle class always returns idle task */ BUG();} /* * Key insight: A SCHED_DEADLINE task ALWAYS preempts SCHED_OTHER. * A SCHED_FIFO/RR task ALWAYS preempts SCHED_OTHER. * Only tasks in the same class compete fairly with each other. */The stop class is the highest priority and exists for special kernel operations that must not be interrupted. It's not available to user processes.
When the kernel needs to perform an operation that requires ALL CPUs to stop (like modifying global data structures or live-patching the kernel), stop_machine() runs stopper tasks on all CPUs simultaneously, ensuring a quiescent state. This is the nuclear option—used sparingly for truly global critical sections.
The deadline class implements Earliest Deadline First (EDF) scheduling with Constant Bandwidth Server (CBS) to provide hard real-time guarantees with bandwidth isolation.
12345678910111213141516171819202122232425
/* Deadline tasks specify three parameters */struct sched_attr { __u64 sched_runtime; /* Execution time per period (ns) */ __u64 sched_deadline; /* Relative deadline from period start (ns) */ __u64 sched_period; /* Period length (ns) */}; /* Example: Video decoder that must decode a frame every 16.6ms */struct sched_attr attr = { .sched_policy = SCHED_DEADLINE, .sched_runtime = 5 * 1000000, /* 5ms of CPU time */ .sched_deadline = 15 * 1000000, /* Must complete within 15ms */ .sched_period = 16666666, /* Every 16.67ms (60 FPS) */};sched_setattr(0, &attr, 0); /* * Scheduler guarantees: * - Task receives 5ms of CPU time every 16.67ms period * - Task runs to completion before 15ms deadline * - If task exceeds runtime, it's throttled until next period * * EDF: Always runs the task with earliest absolute deadline. * CBS: Guarantees bandwidth and prevents overrun from starving others. */SCHED_DEADLINE includes admission control—the kernel rejects new deadline tasks if adding them would exceed the CPU's bandwidth capacity. This prevents over-subscription that would make deadlines impossible to meet. Total deadline bandwidth is typically limited to ~95% to leave headroom.
The real-time class provides fixed-priority preemptive scheduling through two policies: SCHED_FIFO (run until blocked or preempted) and SCHED_RR (round-robin with time slices).
| Aspect | SCHED_FIFO | SCHED_RR |
|---|---|---|
| Time slice | Infinite (no preemption by equals) | Limited (defaults to 100ms) |
| Same-priority behavior | First-in, first-out | Round-robin rotation |
| When yields CPU | Block, exit, or higher priority arrives | Time slice expires, block, exit, or higher priority |
| Use case | Strictly ordered execution | Equal sharing among same-priority tasks |
12345678910111213141516171819202122232425262728293031
/* RT priorities range from 1-99 (higher = more important) */struct sched_param param = { .sched_priority = 50 };sched_setscheduler(0, SCHED_FIFO, ¶m); /* RT class uses a simple array of linked lists */struct rt_rq { struct rt_prio_array active; /* 100 priority queues */ unsigned int rt_nr_running; /* Number of RT tasks */ /* ... */}; struct rt_prio_array { DECLARE_BITMAP(bitmap, MAX_RT_PRIO); /* Which priorities have tasks? */ struct list_head queue[MAX_RT_PRIO]; /* The queues themselves */}; /* pick_next_task_rt: O(1) selection */static struct task_struct *pick_next_task_rt(struct rq *rq){ struct rt_rq *rt_rq = &rq->rt; struct list_head *queue; int idx; /* Find highest priority with tasks (bitmap scan) */ idx = sched_find_first_bit(rt_rq->active.bitmap); if (idx >= MAX_RT_PRIO) return NULL; /* No RT tasks runnable */ queue = &rt_rq->active.queue[idx]; return list_first_entry(queue, struct task_struct, rt.run_list);}A runaway SCHED_FIFO task at priority 99 can completely lock out all normal processes, including your login shell. The system appears hung. Use RT throttling (sched_rt_runtime_us/sched_rt_period_us) to guarantee some CPU for non-RT tasks, or use SCHED_DEADLINE which has built-in bandwidth limiting.
The Completely Fair Scheduler (CFS) handles the vast majority of processes on a typical Linux system. It implements SCHED_OTHER (the default), SCHED_BATCH, and integrates with cgroups for hierarchical resource control.
12345678910111213141516171819202122232425262728293031323334
/* CFS tracks "virtual runtime" - weighted CPU time */ /* * Key insight: CFS doesn't use time slices! * Instead, it tracks vruntime (virtual runtime) for each task. * * vruntime = actual_runtime * (NICE_0_WEIGHT / task_weight) * * Low-priority tasks accumulate vruntime faster. * CFS ALWAYS picks the task with the LOWEST vruntime. * * Result: Over time, each task gets CPU proportional to its weight. * A nice -20 task gets 88x more CPU than a nice +19 task! */ /* The CFS runqueue: a red-black tree ordered by vruntime */struct cfs_rq { struct rb_root_cached tasks_timeline; /* RB-tree of runnable tasks */ struct sched_entity *curr; /* Currently running entity */ u64 min_vruntime; /* Smallest vruntime (for normalization) */ unsigned int nr_running; /* Number of runnable tasks */ /* ... */}; /* pick_next_task_fair: O(1) - leftmost node is cached */static struct task_struct *pick_next_task_fair(struct rq *rq){ struct cfs_rq *cfs_rq = &rq->cfs; struct sched_entity *se; /* Leftmost node in RB-tree has lowest vruntime */ se = __pick_first_entity(cfs_rq); /* O(1) - cached! */ return task_of(se);}The idle class is the fallback when no other work exists. Each CPU has an idle thread that runs when the runqueue is empty.
12345678910111213141516171819202122232425
/* The idle task - runs when nothing else can */void cpu_idle_loop(void){ while (1) { /* Check for pending work */ while (!need_resched()) { /* Enter power-saving state */ cpuidle_idle_call(); /* C-states, WFI, etc. */ } /* Work is available - let scheduler pick it */ schedule_idle(); }} /* * pick_next_task_idle() is trivial: * - Return the CPU's idle task * - This is only called when ALL other classes returned NULL * * The idle task: * - Has PID 0 (swapper) on each CPU * - Never blocks, never exits * - Enters low-power CPU states to save energy */The idle task is crucial for power efficiency. When running, it puts the CPU into progressively deeper sleep states (C-states). A busy system rarely runs the idle task; an idle laptop spends most time here, with the CPU consuming minimal power.
You now understand Linux's scheduling class architecture. Next, we'll deep-dive into CFS implementation—the scheduler that runs the vast majority of tasks on Linux systems.