Operating SystemsState Transitions

Process State Transitions

LevelIntermediate

Duration75 mins

TopicState Transitions

3 / 5

Timeout: Running → Ready

Enforcing Fair Time Sharing

In a multitasking operating system, fairness is paramount. No single process should be allowed to monopolize the CPU while others starve. But processes don't voluntarily relinquish control—they're programmed to complete their tasks, not to be considerate neighbors. This is where preemptive scheduling and the Timeout transition become essential.

The Timeout transition occurs when the operating system forcibly removes a running process from the CPU because it has consumed its allocated time quantum. The process hasn't finished its work, but its time is up. It moves from the Running state back to the Ready state, waiting in line for another turn.

This mechanism is fundamental to the responsive, fair multitasking we expect from modern operating systems. Without it, a single infinite loop could freeze your entire computer.

What You Will Learn

By the end of this page, you will understand how operating systems enforce time limits on CPU usage—from hardware timer interrupts through quantum calculation, preemption implementation, and the delicate balance between responsiveness and throughput. You'll see why time slice selection is one of the most impactful tuning parameters in system design.

The Need for Preemption

Before preemptive scheduling, early operating systems relied on cooperative scheduling. Processes would voluntarily yield the CPU when they had done enough work or needed to wait for something. This approach failed spectacularly when:

A process entered an infinite loop—the entire system would hang
A buggy program forgot to yield—other processes starved
A malicious program refused to yield—denial of service
Real-time deadlines were missed because other processes didn't cooperate

Preemptive scheduling solves these problems by having the operating system take control periodically, regardless of what the running process wants.

Cooperative Scheduling

•Process decides when to yield
•Bug in one process can hang system
•No guaranteed response time
•Malware can monopolize CPU
•Poor for interactive workloads
•Simple to implement
•No timer interrupt overhead

Preemptive Scheduling

•Kernel enforces time limits
•System stays responsive
•Bounded response time guarantees
•Protection against runaway processes
•Good for interactive workloads
•Requires timer hardware
•Context switch overhead

The time quantum concept:

A time quantum (also called time slice) is the maximum time a process can run before being preempted. When a process is dispatched, the scheduler starts a timer. When the timer expires, an interrupt fires, and the kernel regains control.

Typical time quantum values:

Linux: 1-4ms (CFS uses variable, targeted latency is 6ms with minimum 0.75ms per process)
Windows: 15.625ms or 31.25ms (varies by edition and settings)
Real-time systems: Often 1ms or less
Traditional UNIX: 100ms (not responsive by modern standards)

Historical Perspective

The first preemptive multitasking OS for personal computers was AmigaOS (1985). Before that, MS-DOS had no multitasking, and the original Mac OS used cooperative scheduling until Mac OS X in 2001. Windows NT brought preemptive multitasking to Windows in 1993, while the consumer Windows 9x line remained cooperative through Windows ME in 2000.

Timer Interrupt Mechanism

Preemption relies on hardware support: a timer that periodically interrupts the CPU, forcing it to execute kernel code. This interrupt is the fundamental mechanism that allows the OS to regain control from user-mode processes.

Timer Hardware Components

•Programmable Interval Timer (PIT) — Legacy Intel 8254 chip, 1.193182 MHz clock. Used in early PCs, still available for compatibility. Limited precision.
•Local APIC Timer — Per-CPU timer in the Advanced Programmable Interrupt Controller. Can be programmed for one-shot or periodic interrupts. Higher precision than PIT.
•HPET (High Precision Event Timer) — System-wide high-resolution timer, 10+ MHz. Used when available for better precision.
•TSC (Time Stamp Counter) — Per-CPU cycle counter. Not for interrupts, but used for high-resolution time measurement and deadline tracking.
•ARM Generic Timer — ARM's per-CPU timer architecture, supports virtual timers for hypervisors.

The timer interrupt flow:

When the timer fires, a complex sequence ensures the running process is safely preempted:

Converting Mermaid diagram...

timer_interrupt_handler.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// Simplified timer interrupt handler (Linux-inspired)
// Real implementation in kernel/sched/core.c
 
void timer_interrupt_handler(struct pt_regs *regs) {
    // 1. Acknowledge the interrupt to hardware
    ack_timer_interrupt();
    
    // 2. Update system time
    update_wall_clock();
    
    // 3. Account CPU time to current process
    account_process_tick(current);
    
    // 4. Update scheduler timekeeping
    // This checks if current process should be preempted
    scheduler_tick();
    
    // 5. Check and run expired timers (kernel timers)
    run_local_timers();
    
    // 6. Check for rescheduling need
    // If set, schedule() will be called on return to user
    if (need_resched())
        set_preempt_pending();
}
 
void scheduler_tick(void) {
    struct task_struct *curr = current;
    
    // Update virtual runtime for CFS
    curr->sched_class->task_tick(rq, curr);
    
    // CFS checks if this process has run longer than its fair share
    // If so, it calls resched_curr() to trigger reschedule
}
 
// Called when returning to user space
void exit_to_user_mode(void) {
    while (true) {
        local_irq_enable();
        
        if (need_resched()) {
            // Preemption point: call scheduler
            schedule();  // May switch to different process
        }
        
        // Handle pending signals
        if (signal_pending(current)) {
            do_signal(current);
        }
        
        // If nothing else to do, return to user mode
        if (!need_resched() && !signal_pending(current))
            break;
    }
}

Tickless Kernels

Modern kernels can operate 'tickless' (NO_HZ) when idle. Instead of periodic ticks, the kernel programs the timer for the next needed event. This saves power—an idle CPU can sleep deeper without being woken by unnecessary ticks. The tick is only needed when processes are actively running.

Time Quantum Calculation and Management

The time quantum is not a static value—it can vary based on process priority, system policy, and even runtime behavior. The choice of quantum size profoundly impacts system behavior:

Time Quantum Trade-offs
Short Quantum	Long Quantum
Better response time	Worse response time
More context switches (overhead)	Fewer context switches
Lower throughput (more overhead)	Higher throughput
Fairer short-term CPU distribution	Less fair in short term
Higher interactive responsiveness	Better for batch workloads
CPU cache thrashing possible	Better cache utilization

Linux CFS approach:

Linux's Completely Fair Scheduler doesn't use fixed time quanta. Instead, it:

Defines a target latency (default 6ms): the time for every runnable process to get at least one slice
Divides target latency among runnable processes: slice = target_latency / nr_running
Enforces a minimum granularity (default 0.75ms): no slice can be shorter than this
Uses nice values as weights: higher-nice (lower priority) processes get proportionally less time

Example calculation:

Target latency: 6ms
4 runnable processes (all nice 0)
Each gets: 6ms / 4 = 1.5ms time slice

If one process is nice +10 (weight 0.1x):

Total weight: 3 × 1.0 + 1 × 0.1 = 3.1
Nice 0 processes: 6ms × (1.0/3.1) ≈ 1.94ms each
Nice +10 process: 6ms × (0.1/3.1) ≈ 0.19ms → rounded up to minimum 0.75ms

cfs_time_slice.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Simplified CFS time slice calculation
// Real implementation in kernel/sched/fair.c
 
// CFS parameters (sysctl tunable)
unsigned int sysctl_sched_latency = 6000000;  // 6ms in ns
unsigned int sysctl_sched_min_granularity = 750000;  // 0.75ms
 
// Nice value to weight conversion
// Nice 0 = weight 1024, Nice -20 = weight ~88761, Nice +19 = weight ~15
const int prio_to_weight[40] = {
    /* -20 */ 88761, 71755, 56483, ...,
    /*   0 */ 1024, 820, 655, 526, ...,
    /* +19 */ 15
};
 
// Calculate time slice for a process
u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) {
    u64 slice;
    unsigned int nr_running = cfs_rq->nr_running;
    
    // Start with target latency
    slice = sysctl_sched_latency;
    
    // Scale down if many processes running
    if (nr_running > 1)
        slice = slice / nr_running;
    
    // Enforce minimum granularity
    if (slice < sysctl_sched_min_granularity)
        slice = sysctl_sched_min_granularity;
    
    // Scale by this process's weight vs average
    slice = calc_delta_fair(slice, se->load.weight);
    
    return max(slice, (u64)sysctl_sched_min_granularity);
}
 
// Check if process should be preempted
void check_preempt_tick(struct cfs_rq *cfs_rq, 
                        struct sched_entity *curr) {
    u64 ideal_runtime = sched_slice(cfs_rq, curr);
    u64 actual_runtime = curr->sum_exec_runtime - 
                         curr->prev_sum_exec_runtime;
    
    // If running longer than its fair share, preempt
    if (actual_runtime > ideal_runtime) {
        resched_curr(rq);  // Mark for preemption
    }
}

Tuning Time Slices

On Linux, you can adjust CFS parameters via /proc/sys/kernel/sched_* files. Desktop-oriented distributions often use shorter latencies (4ms) for responsiveness. Server distributions may prefer longer latencies (10ms+) for throughput. The kernel also auto-adjusts: with many runnable processes, the minimum granularity limits how short slices become.

Preemption Points in the Kernel

A timer interrupt can occur at any moment, but the kernel can't always immediately perform a context switch. Preemption must occur at safe points where kernel data structures are in consistent states. Understanding preemption points is crucial for understanding system responsiveness.

When Preemption Can Occur

•User Mode Execution — Any timer interrupt while in user mode can trigger preemption. The interrupt handler runs, sees the time slice expired, and sets the need_resched flag. On return to user mode, schedule() is called.
•Return from Syscall/Interrupt — The kernel checks need_resched before returning to user mode. This is the most common preemption point.
•Inside Kernel (PREEMPT_RT) — With CONFIG_PREEMPT, preemption can occur even during kernel code execution, except when preemption is explicitly disabled.
•Voluntary Preemption (cond_resched) — Long-running kernel paths call cond_resched() to voluntarily yield if preemption is pending.
•Sleep/Block Points — When a process calls schedule() voluntarily to wait for something.

The preempt_count mechanism:

Linux tracks when preemption is allowed using a per-task preempt_count variable:

preempt_control.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Preemption control in Linux kernel
 
// preempt_count structure:
// Bits 0-7:   Preemption disable count (incremented by preempt_disable())
// Bits 8-15:  Softirq count (in softirq handler)
// Bits 16-23: Hardirq count (in interrupt handler)  
// Bit 24:     NMI bit
 
// Preemption is only allowed when preempt_count == 0
 
void preempt_disable(void) {
    current->preempt_count++;
    barrier();  // Compiler barrier - ensure count updated before critical section
}
 
void preempt_enable(void) {
    barrier();
    if (--current->preempt_count == 0 && need_resched()) {
        // Count hit zero and preemption needed - reschedule now
        preempt_schedule();
    }
}
 
// Used in spinlock acquisition
void spin_lock(spinlock_t *lock) {
    preempt_disable();  // Can't preempt while holding spinlock
    _raw_spin_lock(lock);
}
 
void spin_unlock(spinlock_t *lock) {
    _raw_spin_unlock(lock);
    preempt_enable();  // May trigger preemption here
}
 
// Voluntary preemption point for long loops
void cond_resched(void) {
    if (need_resched()) {
        // Explicitly yield - allow other processes to run
        schedule();
    }
}
 
// Check if preemption is currently allowed
bool preemptible(void) {
    return current->preempt_count == 0 && !irqs_disabled();
}

Linux Preemption Models
Model	CONFIG Option	Preemption Allowed	Use Case
No Forced Preemption	PREEMPT_NONE	Only on syscall return	Throughput-focused servers
Voluntary	PREEMPT_VOLUNTARY	Syscall return + cond_resched()	General servers
Preemptible	PREEMPT	Almost anytime (outside spinlocks)	Desktop, embedded
Full Real-Time	PREEMPT_RT	Nearly everywhere (spinlocks sleep)	Hard real-time

Priority Inversion

If a low-priority process holds a spinlock and gets preempted by a medium-priority process, a high-priority process waiting for that lock is blocked—by a lower priority process! This is priority inversion. PREEMPT_RT addresses this by making spinlocks sleepable with priority inheritance, but standard kernels must be careful about preemption in critical sections.

The Timeout Transition in Detail

Let's trace through a complete Timeout transition, from the moment a process's time slice expires to when it's back in the Ready queue:

Step-by-Step Timeout Flow

•Timer Interrupt Fires — Hardware timer reaches zero and sends interrupt to CPU. CPU saves minimal state (return address, flags) and jumps to interrupt vector.
•Interrupt Handler Runs — kernel's timer interrupt handler executes. It acknowledges the interrupt to hardware and updates system timekeeping.
•Scheduler Tick Called — The scheduler_tick() function checks if the current process has exceeded its time slice by comparing actual runtime to fair share.
•Preemption Marked — If time exceeded, the kernel sets the TIF_NEED_RESCHED flag in the current process's thread_info structure. This doesn't preempt immediately—just marks it needed.
•Interrupt Returns — The interrupt handler finishes. Before returning to user mode, the kernel checks TIF_NEED_RESCHED.
•Schedule Called — If need_resched is set, schedule() is invoked. It selects the next process with the lowest virtual runtime (CFS) or highest priority (RT).
•Context Switch — The current process's state is saved to its PCB. It's moved from Running to Ready. The new process's state is restored.
•Back to Ready Queue — The preempted process is inserted into the Ready queue with updated runtime accounting. It waits for its next dispatch.

Converting Mermaid diagram...

State preservation during timeout:

When a process is preempted (not voluntarily yielding), it may be in the middle of any operation. The kernel must save:

All general-purpose registers (the process was actively using them)
The exact instruction pointer (to resume at the right instruction)
The stack pointer (to continue with the same call stack)
Floating-point/SIMD state (if dirty)
Any in-flight memory operations must complete or be restartable

The process is completely unaware it was suspended—when rescheduled, it continues exactly where it left off.

Transparent Preemption

From the process's perspective, preemption is invisible. It doesn't receive a signal, can't detect that time passed, and has no opportunity to react. This is by design—applications shouldn't need to know or care about scheduling. The only observable effect is that clock time progresses faster than CPU time.

High-Priority Preemption

Besides time-based preemption (timeout), preemption can also occur when a higher-priority process becomes ready. This is essential for real-time responsiveness—a high-priority task shouldn't wait for a lower-priority task's time slice to expire.

Scenarios triggering priority preemption:

I/O Completion: A high-priority process was waiting for disk/network and just received its data. It's now Ready and may preempt the current Running process.
Timer Expiration: A process was sleeping with a timeout (e.g., sleep(1)) and the timer fired. If it's higher priority than current, it preempts.
Signal Delivery: A signal wakes up a blocked high-priority process.
Lock Release: A high-priority process was blocked on a mutex. When released, it can preempt.
New Process Creation: A fork() creates a child that may have higher priority (though usually inherits parent's priority).

priority_preemption.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// Priority preemption in Linux kernel
 
// Called when a process becomes runnable (wakes up)
void try_to_wake_up(struct task_struct *p, unsigned int state) {
    // Mark process as runnable
    p->state = TASK_RUNNING;
    
    // Add to run queue
    activate_task(rq, p);
    
    // Check if should preempt current process
    check_preempt_curr(rq, p);
}
 
// Check if new task should preempt current
void check_preempt_curr(struct rq *rq, struct task_struct *p) {
    // Get current running task
    struct task_struct *curr = rq->curr;
    
    // Let the scheduling class decide
    // For RT: higher priority always preempts
    // For CFS: lower vruntime may preempt
    curr->sched_class->check_preempt_curr(rq, p);
}
 
// Real-time class: strict priority preemption
void check_preempt_curr_rt(struct rq *rq, struct task_struct *p) {
    // If new task has higher RT priority, preempt immediately
    if (p->prio < rq->curr->prio) {
        resched_curr(rq);  // Set need_resched
    }
}
 
// CFS class: fairness-based preemption
void check_preempt_curr_fair(struct rq *rq, struct task_struct *p) {
    // Check if new task has significantly lower vruntime
    // If so, it deserves to run (current has had too much CPU)
    if (entity_before(&p->se, &rq->curr->se)) {
        resched_curr(rq);  // Set need_resched
    }
}
 
// The need_resched flag will cause schedule() to be called
// at the next preemption point (return to userspace, etc.)

Preemption Latency Requirements
System Type	Acceptable Latency	Preemption Strategy
Hard Real-Time	< 100 μs	Immediate, fully preemptible kernel
Soft Real-Time	< 1 ms	Nearly full preemption
Desktop/Interactive	< 10 ms	Standard preemptive kernel
Server/Throughput	< 100 ms	Voluntary preemption or none
Batch Processing	Seconds acceptable	Minimal preemption

Latency Killers

Even with PREEMPT_RT, certain kernel paths can delay preemption: long-held spinlocks, interrupt handlers that run too long, kernel code that disables interrupts. Achieving hard real-time guarantees requires careful auditing of all kernel code paths, which is why only a subset of Linux (with RT patches) is suitable for hard real-time.

CPU Time Accounting

The kernel tracks how much CPU time each process consumes. This accounting serves multiple purposes: enforcing time limits, scheduling decisions, resource usage monitoring, and billing (in cloud environments).

Types of CPU Time

•User Time (utime) — Time spent executing user-mode code. The application's actual computation.
•System Time (stime) — Time spent in kernel mode on behalf of the process. System calls, exception handling, etc.
•Guest Time (gtime) — Time spent running a virtual CPU (for hypervisors). Subset of user time.
•Steal Time — Time a VM wanted to run but the hypervisor gave CPU to another VM. Indicates cloud throttling.
•Idle Time — CPU was idle (no process needed it). Not charged to any process.
•IRQ Time — Time handling hardware interrupts. May be charged to interrupted process or separately.

time_accounting.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// CPU time accounting during timer tick
// Simplified from Linux kernel/sched/cputime.c
 
void account_process_tick(struct task_struct *p, int user_tick) {
    u64 cputime = tick_to_cputime(1);
    
    if (user_tick) {
        // Timer fired while in user mode
        account_user_time(p, cputime);
        p->utime += cputime;        // User time
    } else if (in_kernel()) {
        // Timer fired while in kernel mode
        account_system_time(p, cputime);
        p->stime += cputime;        // System time
    } else if (in_interrupt()) {
        // Timer fired while handling interrupt
        account_irq_time();
    }
    
    // Also update scheduling statistics
    update_curr_stats(p, cputime);
}
 
// High-resolution accounting for CFS
void update_curr(struct cfs_rq *cfs_rq) {
    struct sched_entity *curr = cfs_rq->curr;
    u64 now = rq_clock_task(rq);  // High-res clock
    u64 delta_exec = now - curr->exec_start;
    
    if (unlikely(delta_exec <= 0))
        return;
    
    curr->exec_start = now;
    curr->sum_exec_runtime += delta_exec;
    
    // Update virtual runtime (weighted by priority)
    curr->vruntime += calc_delta_fair(delta_exec, curr);
    
    // Update minimum vruntime of runqueue
    update_min_vruntime(cfs_rq);
}

Viewing time accounting:

You can observe process time accounting with various tools:

view_time_accounting.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# View time for a specific process
cat /proc/<pid>/stat
# ... utime stime ...
 
# Nice format with ps
ps -o pid,user,%cpu,time,etime,comm -p <pid>
 
# Real-time monitoring with top
top -p <pid>
# Shows %CPU (current), TIME (cumulative)
 
# Detailed per-process stats
cat /proc/<pid>/schedstat
# run_time wait_time num_slices
 
# System-wide CPU accounting
cat /proc/stat
# cpu user nice system idle iowait irq softirq steal guest
 
# Time command for completed process
time ./my_program
# real    0m3.456s   # Wall clock time
# user    0m2.100s   # User CPU time
# sys     0m0.340s   # System CPU time

Accounting Granularity

Traditional tick-based accounting has 1/HZ resolution (typically 1ms). Modern kernels use high-resolution clocks for CFS virtual runtime calculations, providing nanosecond precision. However, the user-visible utime/stime may still use coarser granularity. Be aware that very short-lived processes may have misleading time statistics.

Real-World Impact and Examples

Understanding timeout behavior helps diagnose real-world performance issues. Let's examine practical scenarios where time slice management directly impacts system behavior.

Desktop Responsiveness Issues:

Symptom: Video playback stutters when compiling code in background.

Cause: Video player is a normal-priority process. Compiler spawns many threads, all consuming their time slices. CFS tries to be fair, giving each process ~equal time, but with 8 compiler threads vs 1 video player:

Compiler gets 8/9 of CPU time
Video player gets 1/9 of CPU time
Even at 144fps, frames may be missed

Solution:

nice +19 make -j8 — Lower compiler priority
Or: chrt -i 0 make -j8 — Use idle scheduling class
Video player keeps responsive priority

diagnose_scheduling.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Diagnose scheduling issues
 
# View current scheduler settings
cat /proc/sys/kernel/sched_latency_ns
cat /proc/sys/kernel/sched_min_granularity_ns
 
# Monitor context switches
vmstat 1 | awk '{print "Context switches:", $12}'
 
# Per-process scheduling stats
cat /proc/<pid>/sched
# Shows: nr_switches, wait_sum, wait_max, etc.
 
# Watch for scheduling delays
perf sched record -a sleep 10
perf sched latency
# Shows scheduling latencies by task
 
# Real-time process scheduling
chrt -f -p 99 <pid>  # Set SCHED_FIFO priority 99
chrt -p <pid>        # View current policy
 
# Trace scheduling events
echo 1 > /sys/kernel/debug/tracing/events/sched/sched_switch/enable
cat /sys/kernel/debug/tracing/trace_pipe
# Shows every context switch with timing

Scheduling Tradeoffs

There's no universally optimal time slice. Desktop systems favor short slices for responsiveness. Servers favor longer slices for throughput. Real-time systems need deterministic latency. Container orchestrators like Kubernetes specify CPU requests/limits that the kernel enforces via cgroup bandwidth control. Understand your workload to tune appropriately.

Summary: The Complete Timeout Transition

We've thoroughly explored the Timeout transition—the mechanism that prevents CPU monopolization and ensures fair time sharing among processes.

Key Takeaways

•Running → Ready on time slice expiration — The Timeout transition enforces fairness by limiting how long any process can run continuously.
•Hardware timer is essential — Preemptive scheduling requires timer hardware to periodically interrupt the CPU and give the kernel control.
•Time quantum is a tradeoff — Short quanta improve responsiveness but increase overhead. Long quanta improve throughput but hurt interactivity.
•CFS uses dynamic time slices — Linux divides target latency among runnable processes, weighted by nice value/priority.
•Preemption points control latency — The kernel can only preempt at safe points. PREEMPT_RT extends these points for lower latency.
•Priority preemption adds responsiveness — High-priority processes can preempt immediately, not just at timeout.
•Time accounting tracks CPU usage — The kernel records user time, system time, and other metrics for scheduling and monitoring.
•Tuning depends on workload — Desktop, server, and real-time systems need different scheduler configurations.

Looking ahead:

So far, we've seen processes transition from New to Ready (Admit), Ready to Running (Dispatch), and Running back to Ready (Timeout). But what happens when a running process can't proceed because it needs something—data from disk, input from a user, or a lock held by another process? This brings us to the Block transition (Running → Waiting), the subject of our next page.

Page Complete

You now understand the Timeout transition (Running → Ready) comprehensively—from timer interrupt mechanics through time quantum calculation, preemption points, priority preemption, and time accounting. Next, we'll examine the Block transition that moves processes from Running to Waiting.

3 / 5

Loading learning content...

Operating SystemsState Transitions

Process State Transitions

LevelIntermediate

Duration75 mins

TopicState Transitions

3 / 5

Timeout: Running → Ready

Enforcing Fair Time Sharing

This mechanism is fundamental to the responsive, fair multitasking we expect from modern operating systems. Without it, a single infinite loop could freeze your entire computer.

What You Will Learn

The Need for Preemption

A process entered an infinite loop—the entire system would hang
A buggy program forgot to yield—other processes starved
A malicious program refused to yield—denial of service
Real-time deadlines were missed because other processes didn't cooperate

Preemptive scheduling solves these problems by having the operating system take control periodically, regardless of what the running process wants.

Cooperative Scheduling

•Process decides when to yield
•Bug in one process can hang system
•No guaranteed response time
•Malware can monopolize CPU
•Poor for interactive workloads
•Simple to implement
•No timer interrupt overhead

Preemptive Scheduling

•Kernel enforces time limits
•System stays responsive
•Bounded response time guarantees
•Protection against runaway processes
•Good for interactive workloads
•Requires timer hardware
•Context switch overhead

The time quantum concept:

Typical time quantum values:

Linux: 1-4ms (CFS uses variable, targeted latency is 6ms with minimum 0.75ms per process)
Windows: 15.625ms or 31.25ms (varies by edition and settings)
Real-time systems: Often 1ms or less
Traditional UNIX: 100ms (not responsive by modern standards)

Historical Perspective

Timer Interrupt Mechanism

Timer Hardware Components

•Programmable Interval Timer (PIT) — Legacy Intel 8254 chip, 1.193182 MHz clock. Used in early PCs, still available for compatibility. Limited precision.
•Local APIC Timer — Per-CPU timer in the Advanced Programmable Interrupt Controller. Can be programmed for one-shot or periodic interrupts. Higher precision than PIT.
•HPET (High Precision Event Timer) — System-wide high-resolution timer, 10+ MHz. Used when available for better precision.
•TSC (Time Stamp Counter) — Per-CPU cycle counter. Not for interrupts, but used for high-resolution time measurement and deadline tracking.
•ARM Generic Timer — ARM's per-CPU timer architecture, supports virtual timers for hypervisors.

The timer interrupt flow:

When the timer fires, a complex sequence ensures the running process is safely preempted:

Converting Mermaid diagram...

timer_interrupt_handler.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// Simplified timer interrupt handler (Linux-inspired)
// Real implementation in kernel/sched/core.c
 
void timer_interrupt_handler(struct pt_regs *regs) {
    // 1. Acknowledge the interrupt to hardware
    ack_timer_interrupt();
    
    // 2. Update system time
    update_wall_clock();
    
    // 3. Account CPU time to current process
    account_process_tick(current);
    
    // 4. Update scheduler timekeeping
    // This checks if current process should be preempted
    scheduler_tick();
    
    // 5. Check and run expired timers (kernel timers)
    run_local_timers();
    
    // 6. Check for rescheduling need
    // If set, schedule() will be called on return to user
    if (need_resched())
        set_preempt_pending();
}
 
void scheduler_tick(void) {
    struct task_struct *curr = current;
    
    // Update virtual runtime for CFS
    curr->sched_class->task_tick(rq, curr);
    
    // CFS checks if this process has run longer than its fair share
    // If so, it calls resched_curr() to trigger reschedule
}
 
// Called when returning to user space
void exit_to_user_mode(void) {
    while (true) {
        local_irq_enable();
        
        if (need_resched()) {
            // Preemption point: call scheduler
            schedule();  // May switch to different process
        }
        
        // Handle pending signals
        if (signal_pending(current)) {
            do_signal(current);
        }
        
        // If nothing else to do, return to user mode
        if (!need_resched() && !signal_pending(current))
            break;
    }
}

Tickless Kernels

Time Quantum Calculation and Management

The time quantum is not a static value—it can vary based on process priority, system policy, and even runtime behavior. The choice of quantum size profoundly impacts system behavior:

Time Quantum Trade-offs
Short Quantum	Long Quantum
Better response time	Worse response time
More context switches (overhead)	Fewer context switches
Lower throughput (more overhead)	Higher throughput
Fairer short-term CPU distribution	Less fair in short term
Higher interactive responsiveness	Better for batch workloads
CPU cache thrashing possible	Better cache utilization

Linux CFS approach:

Linux's Completely Fair Scheduler doesn't use fixed time quanta. Instead, it:

Defines a target latency (default 6ms): the time for every runnable process to get at least one slice
Divides target latency among runnable processes: slice = target_latency / nr_running
Enforces a minimum granularity (default 0.75ms): no slice can be shorter than this
Uses nice values as weights: higher-nice (lower priority) processes get proportionally less time

Example calculation:

Target latency: 6ms
4 runnable processes (all nice 0)
Each gets: 6ms / 4 = 1.5ms time slice

If one process is nice +10 (weight 0.1x):

Total weight: 3 × 1.0 + 1 × 0.1 = 3.1
Nice 0 processes: 6ms × (1.0/3.1) ≈ 1.94ms each
Nice +10 process: 6ms × (0.1/3.1) ≈ 0.19ms → rounded up to minimum 0.75ms

cfs_time_slice.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Simplified CFS time slice calculation
// Real implementation in kernel/sched/fair.c
 
// CFS parameters (sysctl tunable)
unsigned int sysctl_sched_latency = 6000000;  // 6ms in ns
unsigned int sysctl_sched_min_granularity = 750000;  // 0.75ms
 
// Nice value to weight conversion
// Nice 0 = weight 1024, Nice -20 = weight ~88761, Nice +19 = weight ~15
const int prio_to_weight[40] = {
    /* -20 */ 88761, 71755, 56483, ...,
    /*   0 */ 1024, 820, 655, 526, ...,
    /* +19 */ 15
};
 
// Calculate time slice for a process
u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) {
    u64 slice;
    unsigned int nr_running = cfs_rq->nr_running;
    
    // Start with target latency
    slice = sysctl_sched_latency;
    
    // Scale down if many processes running
    if (nr_running > 1)
        slice = slice / nr_running;
    
    // Enforce minimum granularity
    if (slice < sysctl_sched_min_granularity)
        slice = sysctl_sched_min_granularity;
    
    // Scale by this process's weight vs average
    slice = calc_delta_fair(slice, se->load.weight);
    
    return max(slice, (u64)sysctl_sched_min_granularity);
}
 
// Check if process should be preempted
void check_preempt_tick(struct cfs_rq *cfs_rq, 
                        struct sched_entity *curr) {
    u64 ideal_runtime = sched_slice(cfs_rq, curr);
    u64 actual_runtime = curr->sum_exec_runtime - 
                         curr->prev_sum_exec_runtime;
    
    // If running longer than its fair share, preempt
    if (actual_runtime > ideal_runtime) {
        resched_curr(rq);  // Mark for preemption
    }
}

Tuning Time Slices

Preemption Points in the Kernel

When Preemption Can Occur

•User Mode Execution — Any timer interrupt while in user mode can trigger preemption. The interrupt handler runs, sees the time slice expired, and sets the need_resched flag. On return to user mode, schedule() is called.
•Return from Syscall/Interrupt — The kernel checks need_resched before returning to user mode. This is the most common preemption point.
•Inside Kernel (PREEMPT_RT) — With CONFIG_PREEMPT, preemption can occur even during kernel code execution, except when preemption is explicitly disabled.
•Voluntary Preemption (cond_resched) — Long-running kernel paths call cond_resched() to voluntarily yield if preemption is pending.
•Sleep/Block Points — When a process calls schedule() voluntarily to wait for something.

The preempt_count mechanism:

Linux tracks when preemption is allowed using a per-task preempt_count variable:

preempt_control.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Preemption control in Linux kernel
 
// preempt_count structure:
// Bits 0-7:   Preemption disable count (incremented by preempt_disable())
// Bits 8-15:  Softirq count (in softirq handler)
// Bits 16-23: Hardirq count (in interrupt handler)  
// Bit 24:     NMI bit
 
// Preemption is only allowed when preempt_count == 0
 
void preempt_disable(void) {
    current->preempt_count++;
    barrier();  // Compiler barrier - ensure count updated before critical section
}
 
void preempt_enable(void) {
    barrier();
    if (--current->preempt_count == 0 && need_resched()) {
        // Count hit zero and preemption needed - reschedule now
        preempt_schedule();
    }
}
 
// Used in spinlock acquisition
void spin_lock(spinlock_t *lock) {
    preempt_disable();  // Can't preempt while holding spinlock
    _raw_spin_lock(lock);
}
 
void spin_unlock(spinlock_t *lock) {
    _raw_spin_unlock(lock);
    preempt_enable();  // May trigger preemption here
}
 
// Voluntary preemption point for long loops
void cond_resched(void) {
    if (need_resched()) {
        // Explicitly yield - allow other processes to run
        schedule();
    }
}
 
// Check if preemption is currently allowed
bool preemptible(void) {
    return current->preempt_count == 0 && !irqs_disabled();
}

Linux Preemption Models
Model	CONFIG Option	Preemption Allowed	Use Case
No Forced Preemption	PREEMPT_NONE	Only on syscall return	Throughput-focused servers
Voluntary	PREEMPT_VOLUNTARY	Syscall return + cond_resched()	General servers
Preemptible	PREEMPT	Almost anytime (outside spinlocks)	Desktop, embedded
Full Real-Time	PREEMPT_RT	Nearly everywhere (spinlocks sleep)	Hard real-time

Priority Inversion

The Timeout Transition in Detail

Let's trace through a complete Timeout transition, from the moment a process's time slice expires to when it's back in the Ready queue:

Step-by-Step Timeout Flow

•Timer Interrupt Fires — Hardware timer reaches zero and sends interrupt to CPU. CPU saves minimal state (return address, flags) and jumps to interrupt vector.
•Interrupt Handler Runs — kernel's timer interrupt handler executes. It acknowledges the interrupt to hardware and updates system timekeeping.
•Scheduler Tick Called — The scheduler_tick() function checks if the current process has exceeded its time slice by comparing actual runtime to fair share.
•Preemption Marked — If time exceeded, the kernel sets the TIF_NEED_RESCHED flag in the current process's thread_info structure. This doesn't preempt immediately—just marks it needed.
•Interrupt Returns — The interrupt handler finishes. Before returning to user mode, the kernel checks TIF_NEED_RESCHED.
•Schedule Called — If need_resched is set, schedule() is invoked. It selects the next process with the lowest virtual runtime (CFS) or highest priority (RT).
•Context Switch — The current process's state is saved to its PCB. It's moved from Running to Ready. The new process's state is restored.
•Back to Ready Queue — The preempted process is inserted into the Ready queue with updated runtime accounting. It waits for its next dispatch.

Converting Mermaid diagram...

State preservation during timeout:

When a process is preempted (not voluntarily yielding), it may be in the middle of any operation. The kernel must save:

All general-purpose registers (the process was actively using them)
The exact instruction pointer (to resume at the right instruction)
The stack pointer (to continue with the same call stack)
Floating-point/SIMD state (if dirty)
Any in-flight memory operations must complete or be restartable

The process is completely unaware it was suspended—when rescheduled, it continues exactly where it left off.

Transparent Preemption

High-Priority Preemption

Scenarios triggering priority preemption:

I/O Completion: A high-priority process was waiting for disk/network and just received its data. It's now Ready and may preempt the current Running process.
Timer Expiration: A process was sleeping with a timeout (e.g., sleep(1)) and the timer fired. If it's higher priority than current, it preempts.
Signal Delivery: A signal wakes up a blocked high-priority process.
Lock Release: A high-priority process was blocked on a mutex. When released, it can preempt.
New Process Creation: A fork() creates a child that may have higher priority (though usually inherits parent's priority).

priority_preemption.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// Priority preemption in Linux kernel
 
// Called when a process becomes runnable (wakes up)
void try_to_wake_up(struct task_struct *p, unsigned int state) {
    // Mark process as runnable
    p->state = TASK_RUNNING;
    
    // Add to run queue
    activate_task(rq, p);
    
    // Check if should preempt current process
    check_preempt_curr(rq, p);
}
 
// Check if new task should preempt current
void check_preempt_curr(struct rq *rq, struct task_struct *p) {
    // Get current running task
    struct task_struct *curr = rq->curr;
    
    // Let the scheduling class decide
    // For RT: higher priority always preempts
    // For CFS: lower vruntime may preempt
    curr->sched_class->check_preempt_curr(rq, p);
}
 
// Real-time class: strict priority preemption
void check_preempt_curr_rt(struct rq *rq, struct task_struct *p) {
    // If new task has higher RT priority, preempt immediately
    if (p->prio < rq->curr->prio) {
        resched_curr(rq);  // Set need_resched
    }
}
 
// CFS class: fairness-based preemption
void check_preempt_curr_fair(struct rq *rq, struct task_struct *p) {
    // Check if new task has significantly lower vruntime
    // If so, it deserves to run (current has had too much CPU)
    if (entity_before(&p->se, &rq->curr->se)) {
        resched_curr(rq);  // Set need_resched
    }
}
 
// The need_resched flag will cause schedule() to be called
// at the next preemption point (return to userspace, etc.)

Preemption Latency Requirements
System Type	Acceptable Latency	Preemption Strategy
Hard Real-Time	< 100 μs	Immediate, fully preemptible kernel
Soft Real-Time	< 1 ms	Nearly full preemption
Desktop/Interactive	< 10 ms	Standard preemptive kernel
Server/Throughput	< 100 ms	Voluntary preemption or none
Batch Processing	Seconds acceptable	Minimal preemption

Latency Killers

CPU Time Accounting

Types of CPU Time

•User Time (utime) — Time spent executing user-mode code. The application's actual computation.
•System Time (stime) — Time spent in kernel mode on behalf of the process. System calls, exception handling, etc.
•Guest Time (gtime) — Time spent running a virtual CPU (for hypervisors). Subset of user time.
•Steal Time — Time a VM wanted to run but the hypervisor gave CPU to another VM. Indicates cloud throttling.
•Idle Time — CPU was idle (no process needed it). Not charged to any process.
•IRQ Time — Time handling hardware interrupts. May be charged to interrupted process or separately.

time_accounting.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// CPU time accounting during timer tick
// Simplified from Linux kernel/sched/cputime.c
 
void account_process_tick(struct task_struct *p, int user_tick) {
    u64 cputime = tick_to_cputime(1);
    
    if (user_tick) {
        // Timer fired while in user mode
        account_user_time(p, cputime);
        p->utime += cputime;        // User time
    } else if (in_kernel()) {
        // Timer fired while in kernel mode
        account_system_time(p, cputime);
        p->stime += cputime;        // System time
    } else if (in_interrupt()) {
        // Timer fired while handling interrupt
        account_irq_time();
    }
    
    // Also update scheduling statistics
    update_curr_stats(p, cputime);
}
 
// High-resolution accounting for CFS
void update_curr(struct cfs_rq *cfs_rq) {
    struct sched_entity *curr = cfs_rq->curr;
    u64 now = rq_clock_task(rq);  // High-res clock
    u64 delta_exec = now - curr->exec_start;
    
    if (unlikely(delta_exec <= 0))
        return;
    
    curr->exec_start = now;
    curr->sum_exec_runtime += delta_exec;
    
    // Update virtual runtime (weighted by priority)
    curr->vruntime += calc_delta_fair(delta_exec, curr);
    
    // Update minimum vruntime of runqueue
    update_min_vruntime(cfs_rq);
}

Viewing time accounting:

You can observe process time accounting with various tools:

view_time_accounting.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# View time for a specific process
cat /proc/<pid>/stat
# ... utime stime ...
 
# Nice format with ps
ps -o pid,user,%cpu,time,etime,comm -p <pid>
 
# Real-time monitoring with top
top -p <pid>
# Shows %CPU (current), TIME (cumulative)
 
# Detailed per-process stats
cat /proc/<pid>/schedstat
# run_time wait_time num_slices
 
# System-wide CPU accounting
cat /proc/stat
# cpu user nice system idle iowait irq softirq steal guest
 
# Time command for completed process
time ./my_program
# real    0m3.456s   # Wall clock time
# user    0m2.100s   # User CPU time
# sys     0m0.340s   # System CPU time

Accounting Granularity

Real-World Impact and Examples

Understanding timeout behavior helps diagnose real-world performance issues. Let's examine practical scenarios where time slice management directly impacts system behavior.

Desktop Responsiveness Issues:

Symptom: Video playback stutters when compiling code in background.

Compiler gets 8/9 of CPU time
Video player gets 1/9 of CPU time
Even at 144fps, frames may be missed

Solution:

nice +19 make -j8 — Lower compiler priority
Or: chrt -i 0 make -j8 — Use idle scheduling class
Video player keeps responsive priority

diagnose_scheduling.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Diagnose scheduling issues
 
# View current scheduler settings
cat /proc/sys/kernel/sched_latency_ns
cat /proc/sys/kernel/sched_min_granularity_ns
 
# Monitor context switches
vmstat 1 | awk '{print "Context switches:", $12}'
 
# Per-process scheduling stats
cat /proc/<pid>/sched
# Shows: nr_switches, wait_sum, wait_max, etc.
 
# Watch for scheduling delays
perf sched record -a sleep 10
perf sched latency
# Shows scheduling latencies by task
 
# Real-time process scheduling
chrt -f -p 99 <pid>  # Set SCHED_FIFO priority 99
chrt -p <pid>        # View current policy
 
# Trace scheduling events
echo 1 > /sys/kernel/debug/tracing/events/sched/sched_switch/enable
cat /sys/kernel/debug/tracing/trace_pipe
# Shows every context switch with timing

Scheduling Tradeoffs

Summary: The Complete Timeout Transition

We've thoroughly explored the Timeout transition—the mechanism that prevents CPU monopolization and ensures fair time sharing among processes.

Key Takeaways

•Running → Ready on time slice expiration — The Timeout transition enforces fairness by limiting how long any process can run continuously.
•Hardware timer is essential — Preemptive scheduling requires timer hardware to periodically interrupt the CPU and give the kernel control.
•Time quantum is a tradeoff — Short quanta improve responsiveness but increase overhead. Long quanta improve throughput but hurt interactivity.
•CFS uses dynamic time slices — Linux divides target latency among runnable processes, weighted by nice value/priority.
•Preemption points control latency — The kernel can only preempt at safe points. PREEMPT_RT extends these points for lower latency.
•Priority preemption adds responsiveness — High-priority processes can preempt immediately, not just at timeout.
•Time accounting tracks CPU usage — The kernel records user time, system time, and other metrics for scheduling and monitoring.
•Tuning depends on workload — Desktop, server, and real-time systems need different scheduler configurations.

Looking ahead:

Page Complete

3 / 5