Loading learning content...
In a multitasking operating system, fairness is paramount. No single process should be allowed to monopolize the CPU while others starve. But processes don't voluntarily relinquish control—they're programmed to complete their tasks, not to be considerate neighbors. This is where preemptive scheduling and the Timeout transition become essential.
The Timeout transition occurs when the operating system forcibly removes a running process from the CPU because it has consumed its allocated time quantum. The process hasn't finished its work, but its time is up. It moves from the Running state back to the Ready state, waiting in line for another turn.
This mechanism is fundamental to the responsive, fair multitasking we expect from modern operating systems. Without it, a single infinite loop could freeze your entire computer.
By the end of this page, you will understand how operating systems enforce time limits on CPU usage—from hardware timer interrupts through quantum calculation, preemption implementation, and the delicate balance between responsiveness and throughput. You'll see why time slice selection is one of the most impactful tuning parameters in system design.
Before preemptive scheduling, early operating systems relied on cooperative scheduling. Processes would voluntarily yield the CPU when they had done enough work or needed to wait for something. This approach failed spectacularly when:
Preemptive scheduling solves these problems by having the operating system take control periodically, regardless of what the running process wants.
The time quantum concept:
A time quantum (also called time slice) is the maximum time a process can run before being preempted. When a process is dispatched, the scheduler starts a timer. When the timer expires, an interrupt fires, and the kernel regains control.
Typical time quantum values:
The first preemptive multitasking OS for personal computers was AmigaOS (1985). Before that, MS-DOS had no multitasking, and the original Mac OS used cooperative scheduling until Mac OS X in 2001. Windows NT brought preemptive multitasking to Windows in 1993, while the consumer Windows 9x line remained cooperative through Windows ME in 2000.
Preemption relies on hardware support: a timer that periodically interrupts the CPU, forcing it to execute kernel code. This interrupt is the fundamental mechanism that allows the OS to regain control from user-mode processes.
The timer interrupt flow:
When the timer fires, a complex sequence ensures the running process is safely preempted:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
// Simplified timer interrupt handler (Linux-inspired)// Real implementation in kernel/sched/core.c void timer_interrupt_handler(struct pt_regs *regs) { // 1. Acknowledge the interrupt to hardware ack_timer_interrupt(); // 2. Update system time update_wall_clock(); // 3. Account CPU time to current process account_process_tick(current); // 4. Update scheduler timekeeping // This checks if current process should be preempted scheduler_tick(); // 5. Check and run expired timers (kernel timers) run_local_timers(); // 6. Check for rescheduling need // If set, schedule() will be called on return to user if (need_resched()) set_preempt_pending();} void scheduler_tick(void) { struct task_struct *curr = current; // Update virtual runtime for CFS curr->sched_class->task_tick(rq, curr); // CFS checks if this process has run longer than its fair share // If so, it calls resched_curr() to trigger reschedule} // Called when returning to user spacevoid exit_to_user_mode(void) { while (true) { local_irq_enable(); if (need_resched()) { // Preemption point: call scheduler schedule(); // May switch to different process } // Handle pending signals if (signal_pending(current)) { do_signal(current); } // If nothing else to do, return to user mode if (!need_resched() && !signal_pending(current)) break; }}Modern kernels can operate 'tickless' (NO_HZ) when idle. Instead of periodic ticks, the kernel programs the timer for the next needed event. This saves power—an idle CPU can sleep deeper without being woken by unnecessary ticks. The tick is only needed when processes are actively running.
The time quantum is not a static value—it can vary based on process priority, system policy, and even runtime behavior. The choice of quantum size profoundly impacts system behavior:
| Short Quantum | Long Quantum |
|---|---|
| Better response time | Worse response time |
| More context switches (overhead) | Fewer context switches |
| Lower throughput (more overhead) | Higher throughput |
| Fairer short-term CPU distribution | Less fair in short term |
| Higher interactive responsiveness | Better for batch workloads |
| CPU cache thrashing possible | Better cache utilization |
Linux CFS approach:
Linux's Completely Fair Scheduler doesn't use fixed time quanta. Instead, it:
slice = target_latency / nr_runningExample calculation:
If one process is nice +10 (weight 0.1x):
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
// Simplified CFS time slice calculation// Real implementation in kernel/sched/fair.c // CFS parameters (sysctl tunable)unsigned int sysctl_sched_latency = 6000000; // 6ms in nsunsigned int sysctl_sched_min_granularity = 750000; // 0.75ms // Nice value to weight conversion// Nice 0 = weight 1024, Nice -20 = weight ~88761, Nice +19 = weight ~15const int prio_to_weight[40] = { /* -20 */ 88761, 71755, 56483, ..., /* 0 */ 1024, 820, 655, 526, ..., /* +19 */ 15}; // Calculate time slice for a processu64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) { u64 slice; unsigned int nr_running = cfs_rq->nr_running; // Start with target latency slice = sysctl_sched_latency; // Scale down if many processes running if (nr_running > 1) slice = slice / nr_running; // Enforce minimum granularity if (slice < sysctl_sched_min_granularity) slice = sysctl_sched_min_granularity; // Scale by this process's weight vs average slice = calc_delta_fair(slice, se->load.weight); return max(slice, (u64)sysctl_sched_min_granularity);} // Check if process should be preemptedvoid check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr) { u64 ideal_runtime = sched_slice(cfs_rq, curr); u64 actual_runtime = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; // If running longer than its fair share, preempt if (actual_runtime > ideal_runtime) { resched_curr(rq); // Mark for preemption }}On Linux, you can adjust CFS parameters via /proc/sys/kernel/sched_* files. Desktop-oriented distributions often use shorter latencies (4ms) for responsiveness. Server distributions may prefer longer latencies (10ms+) for throughput. The kernel also auto-adjusts: with many runnable processes, the minimum granularity limits how short slices become.
A timer interrupt can occur at any moment, but the kernel can't always immediately perform a context switch. Preemption must occur at safe points where kernel data structures are in consistent states. Understanding preemption points is crucial for understanding system responsiveness.
The preempt_count mechanism:
Linux tracks when preemption is allowed using a per-task preempt_count variable:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
// Preemption control in Linux kernel // preempt_count structure:// Bits 0-7: Preemption disable count (incremented by preempt_disable())// Bits 8-15: Softirq count (in softirq handler)// Bits 16-23: Hardirq count (in interrupt handler) // Bit 24: NMI bit // Preemption is only allowed when preempt_count == 0 void preempt_disable(void) { current->preempt_count++; barrier(); // Compiler barrier - ensure count updated before critical section} void preempt_enable(void) { barrier(); if (--current->preempt_count == 0 && need_resched()) { // Count hit zero and preemption needed - reschedule now preempt_schedule(); }} // Used in spinlock acquisitionvoid spin_lock(spinlock_t *lock) { preempt_disable(); // Can't preempt while holding spinlock _raw_spin_lock(lock);} void spin_unlock(spinlock_t *lock) { _raw_spin_unlock(lock); preempt_enable(); // May trigger preemption here} // Voluntary preemption point for long loopsvoid cond_resched(void) { if (need_resched()) { // Explicitly yield - allow other processes to run schedule(); }} // Check if preemption is currently allowedbool preemptible(void) { return current->preempt_count == 0 && !irqs_disabled();}| Model | CONFIG Option | Preemption Allowed | Use Case |
|---|---|---|---|
| No Forced Preemption | PREEMPT_NONE | Only on syscall return | Throughput-focused servers |
| Voluntary | PREEMPT_VOLUNTARY | Syscall return + cond_resched() | General servers |
| Preemptible | PREEMPT | Almost anytime (outside spinlocks) | Desktop, embedded |
| Full Real-Time | PREEMPT_RT | Nearly everywhere (spinlocks sleep) | Hard real-time |
If a low-priority process holds a spinlock and gets preempted by a medium-priority process, a high-priority process waiting for that lock is blocked—by a lower priority process! This is priority inversion. PREEMPT_RT addresses this by making spinlocks sleepable with priority inheritance, but standard kernels must be careful about preemption in critical sections.
Let's trace through a complete Timeout transition, from the moment a process's time slice expires to when it's back in the Ready queue:
State preservation during timeout:
When a process is preempted (not voluntarily yielding), it may be in the middle of any operation. The kernel must save:
The process is completely unaware it was suspended—when rescheduled, it continues exactly where it left off.
From the process's perspective, preemption is invisible. It doesn't receive a signal, can't detect that time passed, and has no opportunity to react. This is by design—applications shouldn't need to know or care about scheduling. The only observable effect is that clock time progresses faster than CPU time.
Besides time-based preemption (timeout), preemption can also occur when a higher-priority process becomes ready. This is essential for real-time responsiveness—a high-priority task shouldn't wait for a lower-priority task's time slice to expire.
Scenarios triggering priority preemption:
I/O Completion: A high-priority process was waiting for disk/network and just received its data. It's now Ready and may preempt the current Running process.
Timer Expiration: A process was sleeping with a timeout (e.g., sleep(1)) and the timer fired. If it's higher priority than current, it preempts.
Signal Delivery: A signal wakes up a blocked high-priority process.
Lock Release: A high-priority process was blocked on a mutex. When released, it can preempt.
New Process Creation: A fork() creates a child that may have higher priority (though usually inherits parent's priority).
1234567891011121314151617181920212223242526272829303132333435363738394041424344
// Priority preemption in Linux kernel // Called when a process becomes runnable (wakes up)void try_to_wake_up(struct task_struct *p, unsigned int state) { // Mark process as runnable p->state = TASK_RUNNING; // Add to run queue activate_task(rq, p); // Check if should preempt current process check_preempt_curr(rq, p);} // Check if new task should preempt currentvoid check_preempt_curr(struct rq *rq, struct task_struct *p) { // Get current running task struct task_struct *curr = rq->curr; // Let the scheduling class decide // For RT: higher priority always preempts // For CFS: lower vruntime may preempt curr->sched_class->check_preempt_curr(rq, p);} // Real-time class: strict priority preemptionvoid check_preempt_curr_rt(struct rq *rq, struct task_struct *p) { // If new task has higher RT priority, preempt immediately if (p->prio < rq->curr->prio) { resched_curr(rq); // Set need_resched }} // CFS class: fairness-based preemptionvoid check_preempt_curr_fair(struct rq *rq, struct task_struct *p) { // Check if new task has significantly lower vruntime // If so, it deserves to run (current has had too much CPU) if (entity_before(&p->se, &rq->curr->se)) { resched_curr(rq); // Set need_resched }} // The need_resched flag will cause schedule() to be called// at the next preemption point (return to userspace, etc.)| System Type | Acceptable Latency | Preemption Strategy |
|---|---|---|
| Hard Real-Time | < 100 μs | Immediate, fully preemptible kernel |
| Soft Real-Time | < 1 ms | Nearly full preemption |
| Desktop/Interactive | < 10 ms | Standard preemptive kernel |
| Server/Throughput | < 100 ms | Voluntary preemption or none |
| Batch Processing | Seconds acceptable | Minimal preemption |
Even with PREEMPT_RT, certain kernel paths can delay preemption: long-held spinlocks, interrupt handlers that run too long, kernel code that disables interrupts. Achieving hard real-time guarantees requires careful auditing of all kernel code paths, which is why only a subset of Linux (with RT patches) is suitable for hard real-time.
The kernel tracks how much CPU time each process consumes. This accounting serves multiple purposes: enforcing time limits, scheduling decisions, resource usage monitoring, and billing (in cloud environments).
1234567891011121314151617181920212223242526272829303132333435363738394041
// CPU time accounting during timer tick// Simplified from Linux kernel/sched/cputime.c void account_process_tick(struct task_struct *p, int user_tick) { u64 cputime = tick_to_cputime(1); if (user_tick) { // Timer fired while in user mode account_user_time(p, cputime); p->utime += cputime; // User time } else if (in_kernel()) { // Timer fired while in kernel mode account_system_time(p, cputime); p->stime += cputime; // System time } else if (in_interrupt()) { // Timer fired while handling interrupt account_irq_time(); } // Also update scheduling statistics update_curr_stats(p, cputime);} // High-resolution accounting for CFSvoid update_curr(struct cfs_rq *cfs_rq) { struct sched_entity *curr = cfs_rq->curr; u64 now = rq_clock_task(rq); // High-res clock u64 delta_exec = now - curr->exec_start; if (unlikely(delta_exec <= 0)) return; curr->exec_start = now; curr->sum_exec_runtime += delta_exec; // Update virtual runtime (weighted by priority) curr->vruntime += calc_delta_fair(delta_exec, curr); // Update minimum vruntime of runqueue update_min_vruntime(cfs_rq);}Viewing time accounting:
You can observe process time accounting with various tools:
123456789101112131415161718192021222324
# View time for a specific processcat /proc/<pid>/stat# ... utime stime ... # Nice format with psps -o pid,user,%cpu,time,etime,comm -p <pid> # Real-time monitoring with toptop -p <pid># Shows %CPU (current), TIME (cumulative) # Detailed per-process statscat /proc/<pid>/schedstat# run_time wait_time num_slices # System-wide CPU accountingcat /proc/stat# cpu user nice system idle iowait irq softirq steal guest # Time command for completed processtime ./my_program# real 0m3.456s # Wall clock time# user 0m2.100s # User CPU time# sys 0m0.340s # System CPU timeTraditional tick-based accounting has 1/HZ resolution (typically 1ms). Modern kernels use high-resolution clocks for CFS virtual runtime calculations, providing nanosecond precision. However, the user-visible utime/stime may still use coarser granularity. Be aware that very short-lived processes may have misleading time statistics.
Understanding timeout behavior helps diagnose real-world performance issues. Let's examine practical scenarios where time slice management directly impacts system behavior.
Desktop Responsiveness Issues:
Symptom: Video playback stutters when compiling code in background.
Cause: Video player is a normal-priority process. Compiler spawns many threads, all consuming their time slices. CFS tries to be fair, giving each process ~equal time, but with 8 compiler threads vs 1 video player:
Solution:
nice +19 make -j8 — Lower compiler prioritychrt -i 0 make -j8 — Use idle scheduling class1234567891011121314151617181920212223242526
# Diagnose scheduling issues # View current scheduler settingscat /proc/sys/kernel/sched_latency_nscat /proc/sys/kernel/sched_min_granularity_ns # Monitor context switchesvmstat 1 | awk '{print "Context switches:", $12}' # Per-process scheduling statscat /proc/<pid>/sched# Shows: nr_switches, wait_sum, wait_max, etc. # Watch for scheduling delaysperf sched record -a sleep 10perf sched latency# Shows scheduling latencies by task # Real-time process schedulingchrt -f -p 99 <pid> # Set SCHED_FIFO priority 99chrt -p <pid> # View current policy # Trace scheduling eventsecho 1 > /sys/kernel/debug/tracing/events/sched/sched_switch/enablecat /sys/kernel/debug/tracing/trace_pipe# Shows every context switch with timingThere's no universally optimal time slice. Desktop systems favor short slices for responsiveness. Servers favor longer slices for throughput. Real-time systems need deterministic latency. Container orchestrators like Kubernetes specify CPU requests/limits that the kernel enforces via cgroup bandwidth control. Understand your workload to tune appropriately.
We've thoroughly explored the Timeout transition—the mechanism that prevents CPU monopolization and ensures fair time sharing among processes.
Looking ahead:
So far, we've seen processes transition from New to Ready (Admit), Ready to Running (Dispatch), and Running back to Ready (Timeout). But what happens when a running process can't proceed because it needs something—data from disk, input from a user, or a lock held by another process? This brings us to the Block transition (Running → Waiting), the subject of our next page.
You now understand the Timeout transition (Running → Ready) comprehensively—from timer interrupt mechanics through time quantum calculation, preemption points, priority preemption, and time accounting. Next, we'll examine the Block transition that moves processes from Running to Waiting.