Loading learning content...
Linux was never designed to be a real-time operating system. Born as a general-purpose Unix clone, its kernel architecture prioritized throughput, fairness, and scalability over determinism and bounded latency. Yet today, Linux powers industrial robots, medical devices, autonomous vehicles, and telecommunications infrastructure—systems where missing a deadline can mean catastrophic failure.
This transformation from a general-purpose system to a real-time platform represents one of the most ambitious kernel engineering efforts in computing history: the PREEMPT_RT patch.
For over two decades, this patchset has systematically reengineered the Linux kernel's fundamental assumptions about scheduling, locking, and interrupt handling. What began as an experimental project has evolved into production infrastructure that runs mission-critical systems worldwide, and as of 2024, significant portions have been merged into the mainline Linux kernel.
By the end of this page, you will understand: (1) Why standard Linux fails real-time requirements; (2) The architectural philosophy behind PREEMPT_RT; (3) Key kernel modifications that enable determinism; (4) The technical mechanisms that reduce worst-case latency; (5) How PREEMPT_RT compares to dedicated RTOSes; and (6) Practical considerations for deployment and configuration.
To appreciate what PREEMPT_RT accomplishes, we must first understand why standard Linux fundamentally violates real-time requirements. The problem isn't performance—Linux can be extremely fast. The problem is predictability.
The Unpredictability Sources:
Standard Linux exhibits unpredictable latencies from multiple kernel subsystems, each capable of delaying high-priority task execution for arbitrary periods.
| Latency Source | Mechanism | Worst-Case Duration | Real-Time Impact |
|---|---|---|---|
| Interrupt Handlers | Hardirq handlers run with interrupts disabled, blocking all other processing | Hundreds of microseconds to milliseconds | Unbounded delay for high-priority tasks |
| Spinlocks | Lock holders cannot be preempted; critical sections run to completion | Milliseconds under contention | Priority inversion, deadline misses |
| RCU Grace Periods | Memory reclamation requires synchronization across all CPUs | Tens to hundreds of milliseconds | Unpredictable memory pressure effects |
| Softirqs/Tasklets | Deferred interrupt work runs at high priority, non-preemptible | Milliseconds during I/O bursts | Network/storage activity blocks RT tasks |
| Kernel Preemption Points | Preemption only at explicit points in non-PREEMPT kernels | Entire system call duration | Syscall latency becomes RT latency |
| Memory Allocation | Page reclaim, compaction, and slab allocation can block | Seconds under memory pressure | Catastrophic deadline misses |
The Fundamental Tension:
Linux kernel developers have historically optimized for throughput and average-case performance. From this perspective, running an interrupt handler to completion before returning to user space is efficient—it minimizes context switch overhead and cache pollution.
But for real-time systems, worst-case latency is the only metric that matters. A system that completes 99.99% of operations in 10 microseconds but occasionally takes 100 milliseconds is worthless for controlling a robotic arm that requires 1-millisecond response guarantees.
Real-time failures occur in the tail of the latency distribution—the rare worst-case scenarios that happen once per hour, per day, or per week. Standard Linux testing and optimization focuses on average cases, completely ignoring the tail events that cause real-time systems to fail catastrophically.
Quantifying the Problem:
In standard Linux (without PREEMPT_RT), measured worst-case latencies under load can reach:
For a real-time system requiring 100μs guarantees, these figures represent failure rates that are completely unacceptable. Even microsecond-level jitter can accumulate to cause deadline misses in tightly coupled control systems.
The PREEMPT_RT patch emerged from a revolutionary insight: instead of building a real-time system alongside Linux (the dual-kernel approach), transform Linux itself into a real-time kernel. This philosophy has profound implications for system design, maintenance, and the broader Linux ecosystem.
Historical Development:
The PREEMPT_RT project began around 2004-2005, led by kernel developers including Ingo Molnár, Thomas Gleixner, and Steven Rostedt. The project built upon earlier preemption work and introduced increasingly aggressive kernel modifications.
| Period | Development | Significance |
|---|---|---|
| 2004-2005 | Initial PREEMPT_RT patchset | Proved concept of fully preemptible Linux kernel |
| 2006-2010 | Threaded interrupt handlers | Fundamental architecture for interrupt preemption |
| 2010-2015 | Raw spinlock separation | Clean API distinguishing RT-safe and non-RT-safe locks |
| 2015-2020 | Mainline integration begins | Generic threaded IRQ support, priority inheritance mutexes merged |
| 2020-2024 | Accelerated upstreaming | Printk, locking, and timer subsystem changes merged; RT becomes kernel config option |
As of Linux 6.x, the majority of PREEMPT_RT changes have been merged into mainline. The remaining pieces—primarily related to printk and certain locking primitives—are actively being upstreamed. This represents a decades-long engineering effort finally reaching completion.
Why Not a Dedicated RTOS?
The PREEMPT_RT approach offers compelling advantages over using a separate real-time operating system:
The Linux kernel supports multiple preemption models, each representing a different tradeoff between throughput and latency. Understanding these models is essential for configuring real-time systems.
| Config Option | Preemption Model | Kernel Behavior | Use Case |
|---|---|---|---|
| PREEMPT_NONE | No Preemption | Kernel code runs to completion; preemption only on return to user space | Servers, throughput-focused workloads |
| PREEMPT_VOLUNTARY | Voluntary Preemption | Explicit preemption points scattered through kernel; checks at might_sleep() calls | Desktop systems, general-purpose computing |
| PREEMPT | Full Preemption (Standard) | Kernel code preemptible except when holding spinlocks or in interrupt context | Low-latency desktops, soft real-time |
| PREEMPT_RT | Full Real-Time Preemption | Nearly all kernel code preemptible; spinlocks converted to mutexes; threaded interrupts | Hard real-time systems, industrial control |
The Preemption Hierarchy:
Each preemption model builds upon the previous, adding more preemption points and reducing worst-case latency at the cost of increased overhead and complexity.
1234567891011121314151617181920212223
PREEMPT_NONE:┌─────────────────────────────────────────────────────────────────┐│ User Space │ Syscall/Interrupt → Kernel → Return │ User Space │└─────────────────────────────────────────────────────────────────┘ ↑ Preemption only here PREEMPT_VOLUNTARY:┌─────────────────────────────────────────────────────────────────┐│ User │ Kernel code ──●──●──●──●── Kernel code │ User │└─────────────────────────────────────────────────────────────────┘ ↑ Preemption at explicit check points (●) PREEMPT:┌─────────────────────────────────────────────────────────────────┐│ User │ ══════│ spinlock │══════│ spinlock │══════ │ User │└─────────────────────────────────────────────────────────────────┘ ↑ Preemptible ↑ Not ↑ Preemptible PREEMPT_RT:┌─────────────────────────────────────────────────────────────────┐│ User │ ═══════════════════════════════════════════════ │ User │└─────────────────────────────────────────────────────────────────┘ ↑ Almost everything preemptible (sleeping locks)Latency Implications:
The difference between preemption models becomes dramatic under load:
| Preemption Model | Idle System | Moderate Load | Heavy I/O Load | Memory Pressure |
|---|---|---|---|---|
| PREEMPT_NONE | 10-50 μs | 100 μs - 5 ms | 10-100 ms | 100 ms - 1 s |
| PREEMPT_VOLUNTARY | 10-30 μs | 50 μs - 1 ms | 5-50 ms | 50-500 ms |
| PREEMPT | 10-20 μs | 30-200 μs | 1-10 ms | 10-100 ms |
| PREEMPT_RT | 5-15 μs | 15-50 μs | 20-100 μs | 50-200 μs |
Notice that PREEMPT_RT doesn't just improve average latency—it fundamentally changes worst-case behavior. Under heavy I/O, standard PREEMPT shows 1-10ms worst-case while PREEMPT_RT maintains 20-100μs. This bounded behavior is what makes real-time systems reliable.
One of PREEMPT_RT's most significant architectural changes is the conversion of interrupt handlers from hardirq context to kernel threads. This transformation is fundamental to achieving bounded latency.
Implementation Architecture:
When a device driver requests a threaded interrupt, the kernel creates the following structure:
123456789101112131415161718192021222324252627
Hardware Interrupt Occurs │ ▼┌────────────────────────────────────────────────────┐│ Hardirq Stub (Primary Handler) ││ - Acknowledge interrupt to hardware ││ - Check if this interrupt needs handling ││ - Return IRQ_WAKE_THREAD to schedule thread ││ Duration: < 1 microsecond typically │└────────────────────────────────────────────────────┘ │ ▼ (Thread wakeup)┌────────────────────────────────────────────────────┐│ Scheduler Runs ││ - Threaded handler competes with other threads ││ - Priority inheritance if waiting on RT mutex ││ - System admin can set irq thread priorities │└────────────────────────────────────────────────────┘ │ ▼┌────────────────────────────────────────────────────┐│ Threaded Handler (Secondary Handler) ││ - Full interrupt processing ││ - Can sleep, acquire mutexes ││ - CAN BE PREEMPTED by higher priority threads ││ Duration: Whatever the handler needs │└────────────────────────────────────────────────────┘12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788
/** * Example: Requesting a threaded interrupt handler * * The kernel creates an "irq/N-driver_name" thread for this handler. * This thread can be observed with 'ps' and its priority adjusted. */ #include <linux/interrupt.h> /* Primary handler: runs in hardirq context, must be fast */static irqreturn_t my_device_hardirq(int irq, void *dev_id){ struct my_device *dev = dev_id; /* Quick check: is this interrupt for us? */ if (!device_interrupt_pending(dev)) return IRQ_NONE; /* Not our interrupt */ /* Acknowledge interrupt to hardware (stop it from re-firing) */ device_ack_interrupt(dev); /* Store any volatile state that must be captured immediately */ dev->captured_timestamp = read_hardware_timestamp(dev); /* Request threaded handler execution */ return IRQ_WAKE_THREAD;} /* Threaded handler: runs in process context, can do heavy work */static irqreturn_t my_device_thread(int irq, void *dev_id){ struct my_device *dev = dev_id; /* * This code runs in a kernel thread. * It CAN: * - Sleep * - Acquire mutexes (with priority inheritance on PREEMPT_RT) * - Be preempted by higher-priority threads * - Take as long as necessary * * It CANNOT: * - Assume it runs immediately after the interrupt * - Assume no other code ran between hardirq and here */ mutex_lock(&dev->data_lock); /* Safe! Will sleep if contended */ process_received_data(dev); wake_up_waiting_userspace(dev); prepare_next_dma_transfer(dev); mutex_unlock(&dev->data_lock); return IRQ_HANDLED;} /* Registration */int my_device_probe(struct pci_dev *pdev){ struct my_device *dev = /* ... allocate ... */; int ret; ret = request_threaded_irq( pdev->irq, my_device_hardirq, /* Primary: hardirq context */ my_device_thread, /* Secondary: thread context */ IRQF_SHARED, "my_device", dev ); if (ret) { dev_err(&pdev->dev, "Failed to request threaded IRQ\n"); return ret; } /* * After this, you can see the thread: * $ ps aux | grep irq * root ... [irq/24-my_device] * * And adjust its priority: * # chrt -f -p 90 <pid> */ return 0;}On PREEMPT_RT kernels, most interrupt handlers are automatically force-threaded even if the driver didn't request it. Only handlers marked with IRQF_NO_THREAD (for critical low-level functions like timer interrupts) retain hardirq execution. This ensures system-wide determinism regardless of driver quality.
The most counterintuitive aspect of PREEMPT_RT is its treatment of spinlocks. In standard Linux, spinlocks are busy-wait locks that disable preemption—holding a spinlock means you cannot be preempted. PREEMPT_RT transforms most spinlocks into sleeping mutexes with priority inheritance.
The Spinlock Problem:
Consider a typical spinlock usage pattern in a device driver:
123456789101112131415161718192021
/* * Standard spinlock usage - problematic for real-time */void driver_operation(struct device *dev){ unsigned long flags; spin_lock_irqsave(&dev->lock, flags); /* * PROBLEM: While holding this lock: * 1. Preemption is disabled * 2. All interrupts are disabled (on this CPU) * 3. Any higher-priority task wanting to run must wait * 4. Duration is unbounded (depends on work done here) * * If this critical section takes 1ms, we add 1ms to the * worst-case latency of EVERY real-time task in the system! */ perform_lengthy_device_operation(dev); spin_unlock_irqrestore(&dev->lock, flags);}PREEMPT_RT Solution: Sleeping Locks
PREEMPT_RT redefines spinlock_t to be a sleeping lock (an rtmutex internally). When code calls spin_lock(), it may actually sleep if the lock is contended—and critically, the lock holder can be preempted.
1234567891011121314151617
Standard Linux Spinlock:┌─────────────────────────────────────────────────────────────────┐│ Thread A: [───── spin_lock ─────────────── spin_unlock ────] ││ ↑ preemption disabled ││ Thread B: [...BLOCKED...BLOCKED...BLOCKED...] ││ ↑ Cannot run even if higher priority │└─────────────────────────────────────────────────────────────────┘ PREEMPT_RT Sleeping Spinlock:┌─────────────────────────────────────────────────────────────────┐│ Thread A: [── lock ──┐ ┌── unlock ──] ││ │ preempted! │ ││ Thread B: └─[HIGH PRIORITY RUNS]─┘ ││ ││ Note: Thread A continues after B completes. Priority ││ inheritance ensures A completes quickly to release lock. │└─────────────────────────────────────────────────────────────────┘Priority Inheritance:
When a high-priority thread blocks waiting for a lock held by a low-priority thread, PREEMPT_RT implements priority inheritance: the lock holder temporarily inherits the waiter's priority. This prevents classic priority inversion scenarios.
12345678910111213141516171819
Priority Inversion WITHOUT Inheritance:┌─────────────────────────────────────────────────────────────────┐│ High (Pri=90): [BLOCKED on lock ─────────────────────────]││ Medium (Pri=50): ════════════════════════════════════════ ││ Low (Pri=10): [holds lock...preempted by Medium...] ││ ││ Problem: High waits for Low, but Medium runs instead of Low! ││ High's latency = Medium's entire execution time │└─────────────────────────────────────────────────────────────────┘ Priority Inversion WITH Inheritance (PREEMPT_RT):┌─────────────────────────────────────────────────────────────────┐│ High (Pri=90): [wait]──────────[RUNS] ││ Medium (Pri=50): [blocked behind Low-at-90] ││ Low (Pri=10→90): [runs at 90, releases lock] ││ ↑ Inherits High's priority ││ ││ Result: High's latency = only Low's critical section time │└─────────────────────────────────────────────────────────────────┘Raw Spinlocks:
Some kernel code genuinely requires non-sleeping spinlocks—typically low-level code that manages the sleeping infrastructure itself, or code that runs before the scheduler is available. PREEMPT_RT provides raw_spinlock_t for these cases:
123456789101112131415161718192021222324252627282930313233343536
#include <linux/spinlock.h> /* Regular spinlock: becomes sleeping mutex on PREEMPT_RT */static DEFINE_SPINLOCK(normal_lock); /* Raw spinlock: always a true spinlock, even on PREEMPT_RT */static DEFINE_RAW_SPINLOCK(raw_lock); void regular_path(void){ spin_lock(&normal_lock); /* On PREEMPT_RT: May sleep! Lock holder can be preempted. */ /* Use for most normal device driver critical sections. */ do_normal_work(); spin_unlock(&normal_lock);} void scheduler_critical_path(void){ unsigned long flags; raw_spin_lock_irqsave(&raw_lock, flags); /* * TRUE busy-wait spinlock. Preemption disabled. * * Use ONLY for: * - Scheduler internals * - Interrupt controller manipulation * - Timer hardware programming * - Debugging/tracing infrastructure * * Keep critical sections EXTREMELY short (< 1μs ideal). */ manipulate_scheduler_structures(); raw_spin_unlock_irqrestore(&raw_lock, flags);}Every raw_spinlock in the kernel is a potential latency source. PREEMPT_RT developers carefully audit raw spinlock usage, striving to minimize both the number of raw spinlocks and the duration of their critical sections. Adding raw spinlocks to new code requires strong justification.
Beyond threaded interrupts and sleeping spinlocks, PREEMPT_RT includes numerous other modifications that collectively achieve deterministic behavior.
Softirq Threading:
In standard Linux, softirqs run immediately after hardware interrupts with interrupts enabled but preemption disabled. This can cause significant latency if multiple softirqs are pending. PREEMPT_RT moves softirq execution to dedicated kernel threads:
123456789101112131415161718
Standard Linux Softirq Processing:┌─────────────────────────────────────────────────────────────────┐│ [Hardirq] → [Softirq: NET_RX + TIMER + SCHED + ...] → [User] ││ └──────── Cannot be preempted ────────┘ ││ Potentially milliseconds of non-preemptible work │└─────────────────────────────────────────────────────────────────┘ PREEMPT_RT Threaded Softirqs:┌─────────────────────────────────────────────────────────────────┐│ [Hardirq] → [Wake ksoftirqd] → [Return immediately] ││ ↓ ││ [ksoftirqd/N thread runs when scheduled] ││ [Can be preempted by higher priority threads] ││ ││ $ ps aux | grep ksoftirqd ││ root ... [ksoftirqd/0] # CPU 0 softirq thread ││ root ... [ksoftirqd/1] # CPU 1 softirq thread │└─────────────────────────────────────────────────────────────────┘Preemptible RCU:
Read-Copy-Update (RCU) is a fundamental Linux synchronization mechanism. Standard RCU requires readers to run to completion without preemption. PREEMPT_RT implements a preemptible RCU variant where readers can be preempted mid-critical-section:
1234567891011121314151617181920212223242526272829303132333435
/* * RCU Reader: Standard vs PREEMPT_RT */ /* Standard Linux: Reader cannot be preempted */void standard_rcu_reader(void){ rcu_read_lock(); /* Disables preemption */ ptr = rcu_dereference(global_ptr); /* CANNOT be preempted here */ process(ptr); rcu_read_unlock(); /* Re-enables preemption */} /* PREEMPT_RT: Reader CAN be preempted */void preempt_rt_rcu_reader(void){ rcu_read_lock(); /* Does NOT disable preemption */ ptr = rcu_dereference(global_ptr); /* * CAN be preempted here! * * The RCU machinery tracks that we're in a critical section * and ensures grace periods account for preempted readers. * * This prevents a long-running RCU reader from blocking * high-priority real-time tasks. */ process_potentially_long_operation(ptr); rcu_read_unlock();}All these modifications add overhead. Threaded interrupts require thread context switches. Sleeping spinlocks have mutex acquisition costs. Preemptible RCU has more complex tracking. PREEMPT_RT trades some average-case performance for dramatically improved worst-case performance—exactly the trade-off real-time systems require.
Deploying PREEMPT_RT requires careful kernel configuration and system tuning. This section covers the practical aspects of building and configuring an RT kernel.
Obtaining PREEMPT_RT:
As of recent Linux versions, PREEMPT_RT support is largely mainlined. For older kernels or the complete patchset:
1234567891011121314
# For mainline kernels (6.x+): PREEMPT_RT is a config option# No patches needed for many configurations # For kernels requiring patches:# Visit: https://wiki.linuxfoundation.org/realtime/start# Download matching patch version for your kernel # Example: Applying patches to kernel 5.15wget https://cdn.kernel.org/pub/linux/kernel/projects/rt/5.15/patch-5.15.XX-rtXX.patch.gzgunzip patch-5.15.XX-rtXX.patch.gz # Apply to kernel sourcecd /path/to/linux-5.15patch -p1 < ../patch-5.15.XX-rtXX.patchEssential Kernel Configuration:
123456789101112131415161718192021222324252627282930
# Essential PREEMPT_RT Configuration Options # Core preemption model - select PREEMPT_RTCONFIG_PREEMPT_RT=y # Full real-time preemption # Timer configurationCONFIG_HIGH_RES_TIMERS=y # High-resolution timer support (essential)CONFIG_NO_HZ_FULL=y # Tickless operation for RT tasks (optional) # Scheduler featuresCONFIG_RT_GROUP_SCHED=y # RT task group scheduling (optional) # Disable problematic features for RTCONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y # Avoid frequency scalingCONFIG_CPU_IDLE=n # Or configure very short idle states only # Debugging (disable in production for lower latency)CONFIG_DEBUG_PREEMPT=n # Disable for productionCONFIG_PROVE_LOCKING=n # Disable for productionCONFIG_LOCKDEP=n # Disable for production # Tracing (keep for latency analysis, disable for absolute minimum)CONFIG_FTRACE=y # Function tracingCONFIG_IRQSOFF_TRACER=y # Track IRQs-off latencyCONFIG_PREEMPTIRQ_EVENTS=y # Preemption/IRQ tracingCONFIG_SCHED_TRACER=y # Scheduler tracing # Memory configurationCONFIG_TRANSPARENT_HUGEPAGE=n # Avoid THP overhead (recommended)CONFIG_COMPACTION=n # Consider disabling compactionRuntime Configuration:
After booting the RT kernel, additional runtime configuration optimizes real-time behavior:
12345678910111213141516171819202122232425262728293031323334353637
#!/bin/bash# Runtime configuration for PREEMPT_RT system # 1. Verify RT kernel is runninguname -a | grep -i rt || echo "WARNING: Not running RT kernel!" # 2. Set CPU frequency to maximum (avoid frequency scaling latency)for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance > $cpudone # 3. Disable real-time throttling (allow RT tasks to use 100% CPU)# WARNING: A runaway RT task can hang the system!echo -1 > /proc/sys/kernel/sched_rt_runtime_us # 4. Isolate CPUs for RT tasks (kernel boot parameter is better)# In /etc/default/grub, add: isolcpus=2,3 nohz_full=2,3 # 5. Configure IRQ affinities - move non-RT IRQs off RT CPUs# Move all IRQs to CPU 0,1 (leaving 2,3 for RT tasks)for irq in /proc/irq/*/smp_affinity; do echo 3 > $irq 2>/dev/null # CPUs 0 and 1done # 6. Set RT thread priorities# Example: Set network IRQ thread to priority 90pgrep -f "irq/.*eth" | xargs -I{} chrt -f -p 90 {} # 7. Lock memory for RT application# Application should use mlockall(MCL_CURRENT | MCL_FUTURE) # 8. Verify configurationecho "=== RT Configuration Summary ==="echo "Kernel: $(uname -r)"echo "RT Runtime: $(cat /proc/sys/kernel/sched_rt_runtime_us)"cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governorecho "Isolated CPUs: $(cat /sys/devices/system/cpu/isolated)"Setting sched_rt_runtime_us to -1 disables RT throttling, allowing RT tasks to consume 100% CPU indefinitely. A buggy RT task can completely lock up the system. Only disable throttling on fully tested production systems with hardware watchdogs.
The PREEMPT_RT patch represents a fundamental reimagining of the Linux kernel for real-time applications. Let's consolidate the key concepts:
What's Next:
With the foundational understanding of PREEMPT_RT architecture, we'll next explore the specific real-time schedulers available in Linux—SCHED_FIFO, SCHED_RR, and SCHED_DEADLINE—and how to effectively use them for different real-time requirements.
You now understand the fundamental architecture and mechanisms of the PREEMPT_RT patch—the key technology that transforms Linux into a real-time operating system. This knowledge is essential for developing and deploying real-time applications on Linux platforms.