Loading learning content...
In any real computer system, multiple I/O devices compete for CPU attention simultaneously. A keystroke arrives while a network packet completes while a disk read finishes—all at the same instant. The system must decide: Which interrupt gets serviced first?
This isn't an academic question. Wrong prioritization can cause audio to stutter (missed samples), network connections to drop (buffer overflow), or real-time control systems to fail catastrophically (missed deadlines). Interrupt priority is the mechanism that ensures critical events receive timely attention while less urgent work waits its turn.
By the end of this page, you will understand why interrupt priority is necessary, how priority is implemented at both hardware (interrupt controller) and software (handler) levels, master the concept of priority-based preemption, and learn how priority inversion and other pathologies are handled.
To understand why priority matters, consider what happens without it. In a simple first-come-first-served model:
This is unacceptable for any system with time-sensitive requirements. The solution is priority: assigning relative importance to different interrupt sources, ensuring that more critical events can preempt less critical ones.
| Interrupt Source | Urgency | Consequence of Delay |
|---|---|---|
| Machine Check (hardware failure) | Critical | Data corruption, system damage |
| Timer interrupt | Very High | Scheduler jitter, time inaccuracy |
| Real-time audio I/O | High | Audible clicks/pops in playback |
| Network packet arrival | Medium-High | Buffer overflow, dropped packets |
| Disk I/O completion | Medium | Slower application response |
| USB device event | Low-Medium | User perceives minor lag |
| Keyboard/mouse input | Low | User perceives minor lag |
The Priority Principle:
A higher-priority interrupt should be able to preempt a lower-priority handler—interrupting the interrupter. This creates a priority order where critical events are always serviced promptly, regardless of what other interrupts are being handled.
Priority can be implemented at multiple levels:
Priority systems inherently risk starvation: if high-priority interrupts arrive continuously, low-priority ones never get serviced. Good system design uses priority for urgency but includes mechanisms (timeouts, round-robin within a level) to prevent complete starvation of low-priority sources.
The Intel 8259 Programmable Interrupt Controller (PIC) was the de facto standard for x86 interrupt handling for nearly 30 years. Though obsolete in modern systems, understanding the PIC illuminates fundamental priority concepts still used today.
PIC Architecture:
The original PC used two 8259 chips cascaded:
Total: 15 usable IRQ lines (IRQ2 used for cascade)
123456789101112131415161718192021222324252627
; Traditional PC/AT IRQ Assignments with 8259 PICs Master PIC (IRQ 0-7): IRQ0: System Timer (highest priority) IRQ1: Keyboard IRQ2: Cascade from Slave PIC IRQ3: COM2 (serial port) IRQ4: COM1 (serial port) IRQ5: LPT2 or Sound Card IRQ6: Floppy Disk Controller IRQ7: LPT1 (parallel port) / Spurious (lowest priority) Slave PIC (IRQ 8-15): IRQ8: Real Time Clock (RTC) IRQ9: ACPI / Available IRQ10: Available IRQ11: Available IRQ12: PS/2 Mouse IRQ13: FPU / Coprocessor IRQ14: Primary IDE IRQ15: Secondary IDE Priority order (highest to lowest):IRQ0 > IRQ1 > IRQ8-15 > IRQ3 > IRQ4 > IRQ5 > IRQ6 > IRQ7 Note: Slave PIc IRQs (8-15) all appear at master IRQ2's priority level,then are further prioritized within the slave.PIC Priority Scheme:
The 8259 uses a fixed priority scheme by default: IRQ0 has highest priority, IRQ7 the lowest. The currently-serviced interrupt level blocks all equal or lower priority interrupts.
In-Service Register (ISR): Tracks which interrupts are currently being handled Interrupt Mask Register (IMR): Allows software to disable specific IRQs
When an interrupt arrives:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
// Programming the 8259 PIC #define PIC1_CMD 0x20 // Master PIC command port#define PIC1_DATA 0x21 // Master PIC data port#define PIC2_CMD 0xA0 // Slave PIC command port#define PIC2_DATA 0xA1 // Slave PIC data port // Initialization Command Words (ICW)#define ICW1_INIT 0x10 // Initialization#define ICW1_ICW4 0x01 // ICW4 needed#define ICW4_8086 0x01 // 8086/88 mode // Operation Command Words (OCW)#define OCW2_EOI 0x20 // Non-specific End-of-Interrupt#define OCW2_SEOI 0x60 // Specific EOI (add IRQ number) // Remap PICs to avoid conflict with CPU exceptions (vectors 0-31)void pic_remap(uint8_t master_base, uint8_t slave_base) { uint8_t master_mask, slave_mask; // Save current masks master_mask = inb(PIC1_DATA); slave_mask = inb(PIC2_DATA); // ICW1: Start initialization sequence outb(PIC1_CMD, ICW1_INIT | ICW1_ICW4); outb(PIC2_CMD, ICW1_INIT | ICW1_ICW4); // ICW2: Set vector offset outb(PIC1_DATA, master_base); // Master: vectors master_base to master_base+7 outb(PIC2_DATA, slave_base); // Slave: vectors slave_base to slave_base+7 // ICW3: Tell PICs about cascade outb(PIC1_DATA, 0x04); // Master: slave on IRQ2 (bit 2) outb(PIC2_DATA, 0x02); // Slave: cascade identity 2 // ICW4: Set mode outb(PIC1_DATA, ICW4_8086); outb(PIC2_DATA, ICW4_8086); // Restore saved masks outb(PIC1_DATA, master_mask); outb(PIC2_DATA, slave_mask);} // Send End-of-Interrupt signalvoid pic_send_eoi(uint8_t irq) { if (irq >= 8) { // IRQ from slave: must EOI both slave and master outb(PIC2_CMD, OCW2_EOI); } outb(PIC1_CMD, OCW2_EOI);} // Mask (disable) a specific IRQvoid pic_mask_irq(uint8_t irq) { uint16_t port = (irq < 8) ? PIC1_DATA : PIC2_DATA; uint8_t mask = inb(port); mask |= (1 << (irq & 7)); outb(port, mask);} // Unmask (enable) a specific IRQvoid pic_unmask_irq(uint8_t irq) { uint16_t port = (irq < 8) ? PIC1_DATA : PIC2_DATA; uint8_t mask = inb(port); mask &= ~(1 << (irq & 7)); outb(port, mask);}The 8259 PIC has severe limitations for modern systems: only 15 IRQ lines (forcing devices to share), broadcast EOI (can't easily target a specific IRQ), no MSI support, and poor SMP (multi-processor) support. Modern systems use the APIC, but PIC understanding remains valuable for legacy code, embedded systems, and interview questions.
The Advanced Programmable Interrupt Controller (APIC) architecture replaced the 8259 PIC, addressing its limitations and adding essential features for multi-processor systems.
APIC Components:
This two-tier architecture enables per-CPU interrupt handling and sophisticated routing.
APIC Priority Mechanism:
The APIC implements a 256-level priority scheme. Each interrupt vector has an inherent priority based on its vector number:
Priority Class = Vector / 16 (0-15)
Priority Subclass = Vector % 16 (0-15)
Higher vector numbers = higher priority. Vectors 0-31 (CPU exceptions) have lowest priority class, device interrupts at vectors 32-239 have medium priority, and vectors 240-255 have highest priority.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
// APIC Priority Mechanism // Priority classes (based on vector number / 16)// Class 0-1: Reserved for CPU exceptions (vectors 0-31)// Class 2-14: Device interrupt range (vectors 32-239)// Class 15: Highest priority (vectors 240-255) // Task Priority Register (TPR) - Local APIC register// Bits 7:4 = Task Priority Class// Bits 3:0 = Task Priority Subclass// The APIC will only deliver interrupts with priority > TPR #define APIC_TPR 0x80 // Task Priority Register offset // Read current TPRstatic inline uint8_t apic_read_tpr(void) { return *(volatile uint32_t*)(APIC_BASE + APIC_TPR) & 0xFF;} // Set TPR to block interrupts below a thresholdstatic inline void apic_set_tpr(uint8_t priority) { *(volatile uint32_t*)(APIC_BASE + APIC_TPR) = priority;} // Example: Block all device interrupts during critical sectionvoid critical_section_enter(void) { // Set TPR to priority class 15 (block all below 240) // This blocks device IRQs but allows IPI at highest vectors apic_set_tpr(0xF0); // Priority = 15 (vectors 240-255 can still arrive)} void critical_section_exit(void) { apic_set_tpr(0); // Allow all priorities} // End of Interrupt (EOI) for APIC#define APIC_EOI 0xB0 static inline void apic_eoi(void) { *(volatile uint32_t*)(APIC_BASE + APIC_EOI) = 0;} // I/O APIC Redirection Table Entrystruct ioapic_redir_entry { union { struct { uint8_t vector; // Interrupt vector (32-255) uint8_t delvmode:3; // Delivery mode (0=fixed, 1=lowest priority...) uint8_t destmode:1; // Destination mode (0=physical, 1=logical) uint8_t delvstatus:1; // Delivery status (read-only) uint8_t polarity:1; // 0=active high, 1=active low uint8_t irr:1; // Remote IRR (read-only) uint8_t trigger:1; // 0=edge, 1=level uint8_t mask:1; // 0=enabled, 1=masked uint32_t reserved:15; uint8_t destination; // APIC ID of target CPU(s) } fields; uint64_t raw; };}; // Example: Route IRQ to lowest-priority CPUvoid ioapic_route_irq(uint8_t irq, uint8_t vector) { struct ioapic_redir_entry entry = {0}; entry.fields.vector = vector; entry.fields.delvmode = 1; // Lowest priority delivery entry.fields.destmode = 1; // Logical destination entry.fields.trigger = 1; // Level-triggered (typical for PCI) entry.fields.destination = 0xFF; // All CPUs in logical dest ioapic_write_redir(irq, entry);}| Mode | Value | Description |
|---|---|---|
| Fixed | 0 | Deliver to CPU(s) in destination field |
| Lowest Priority | 1 | Deliver to CPU with lowest TPR |
| SMI | 2 | System Management Interrupt |
| NMI | 4 | Non-Maskable Interrupt |
| INIT | 5 | Initialization (reset CPU) |
| ExtINT | 7 | External interrupt (8259 compatibility) |
The 'Lowest Priority' delivery mode is elegant: the I/O APIC queries all target CPUs' TPR values and delivers to the one with lowest current priority. This provides automatic load balancing—idle CPUs (low TPR) receive more interrupts than busy ones (high TPR).
When a higher-priority interrupt arrives while a lower-priority handler is running, preemption occurs: the lower-priority handler is interrupted, the higher-priority handler runs, then control returns to the lower-priority handler. This creates nested interrupts.
Preemption Mechanics:
Controlling Preemption:
Preemption can be controlled at multiple levels:
CPU Interrupt Flag (IF): When cleared, no maskable interrupts are delivered. This prevents all preemption but is heavy-handed.
APIC TPR: Setting the Task Priority Register blocks interrupts below a threshold. Current priority becomes the TPR value.
Handler Design: Handlers can explicitly enable interrupts (set IF) to allow preemption, or keep them disabled for atomicity.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
// Nested interrupt handling example // Handler that allows higher-priority preemptionirqreturn_t preemptible_handler(int irq, void *dev_id) { // Phase 1: Critical section (no preemption) // Interrupts are automatically disabled on handler entry acknowledge_device_interrupt(); capture_device_data(); // Phase 2: Allow preemption for lengthy processing local_irq_enable(); // RE-ENABLE interrupts // WARNING: We can now be preempted by higher-priority interrupts! // All data structures must be in consistent state before this point. process_captured_data(); // Potentially lengthy // Phase 3: Prepare to return local_irq_disable(); // Disable before EOI return IRQ_HANDLED;} // Handler that prevents all preemptionirqreturn_t non_preemptible_handler(int irq, void *dev_id) { // Interrupts remain disabled throughout // This handler has exclusive CPU access // MUST BE VERY SHORT to avoid latency issues! uint32_t data = read_device_register(); update_statistics(data); acknowledge_interrupt(); // Approximately 1-5 microseconds total return IRQ_HANDLED;} // Implementation with priority managementirqreturn_t priority_aware_handler(int irq, void *dev_id) { // Save current priority and raise to our level unsigned long flags; uint8_t old_tpr = apic_read_tpr(); // Set TPR to our priority - blocks equal and lower, allows higher uint8_t our_priority = irq_to_priority(irq); apic_set_tpr(our_priority); local_save_flags(flags); local_irq_enable(); // Allow higher-priority interrupts // Do work - can be preempted by higher priority handle_device(); local_irq_restore(flags); apic_set_tpr(old_tpr); // Restore previous priority level return IRQ_HANDLED;}Each nested interrupt requires stack space for saved context. Deeply nested interrupts can overflow the interrupt stack—typically only 4-8 KB. Systems must ensure worst-case nesting depth fits within stack limits. The use of IST (Interrupt Stack Table) per priority level can help by giving each level its own stack.
Priority inversion occurs when a high-priority task is effectively blocked by a low-priority task. This can happen in interrupt systems when handlers share resources with locks.
Classic Priority Inversion Scenario:
The Famous Mars Pathfinder Case:
In 1997, the Mars Pathfinder spacecraft experienced repeated system resets due to priority inversion. A low-priority task held a shared data bus mutex, a high-priority task needed the mutex, but a medium-priority task kept preempting the low-priority task, preventing it from releasing the mutex. The watchdog timer would fire after the high-priority task missed its deadline, causing a system reset.
Solutions to Priority Inversion:
| Solution | Mechanism | Trade-offs |
|---|---|---|
| Priority Inheritance | Lock holder inherits highest waiter priority | Complexity, requires mutex tracking |
| Priority Ceiling | Lock raises priority to predefined ceiling | Simpler, but ceiling must be known in advance |
| Disable Interrupts | Hold lock only with interrupts disabled | Prevents all preemption—works but heavy-handed |
| Lock-Free Algorithms | Avoid locks entirely using atomic operations | Complex to implement correctly |
| Interrupt Threading | Convert to threaded handlers, use RT mutexes | Linux approach: RT-PREEMPT kernel |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475
// Solutions to priority inversion in interrupt handlers // Solution 1: Disable interrupts while holding lock// Simple but prevents all interrupt handling during lock holdvoid handle_with_irq_disabled(void) { unsigned long flags; spin_lock_irqsave(&data_lock, flags); // Disable interrupts + lock // Critical section - no interrupt can run, no inversion possible modify_shared_data(); spin_unlock_irqrestore(&data_lock, flags); // Restore + unlock} // Solution 2: Use priority inheritance mutex (RT kernel)// Lock holder automatically inherits waiting task's priority#include <linux/mutex.h> struct mutex rt_mutex; // RT-PREEMPT kernel uses PI-aware mutexes void threaded_irq_handler(int irq, void *dev_id) { // This runs as a kernel thread, not hard IRQ context // Can use sleeping locks with priority inheritance mutex_lock(&rt_mutex); // If someone waits, they boost our priority process_data(); mutex_unlock(&rt_mutex); // Priority restored when we release} // Solution 3: Interrupt-aware spinlocks// Different spinlock types for different contexts // For data shared between process context and IRQ:spinlock_t irq_data_lock; // Process context uses spin_lock_irqsave (disables IRQs)void process_context_access(void) { unsigned long flags; spin_lock_irqsave(&irq_data_lock, flags); // ... spin_unlock_irqrestore(&irq_data_lock, flags);} // IRQ context can use plain spin_lock (already in IRQ, can't be preempted by same IRQ)irqreturn_t irq_handler(int irq, void *dev_id) { spin_lock(&irq_data_lock); // ... spin_unlock(&irq_data_lock); return IRQ_HANDLED;} // Solution 4: Lock-free data structures (where applicable)#include <linux/llist.h> // Lock-free linked list examplestruct llist_head event_queue; // IRQ can add without lockingirqreturn_t lockfree_irq_handler(int irq, void *dev_id) { struct event *evt = alloc_event_atomic(); evt->data = read_device(); llist_add(&evt->node, &event_queue); // Lock-free! return IRQ_HANDLED;} // Consumer removes without blocking IRQsvoid process_events(void) { struct llist_node *list = llist_del_all(&event_queue); // Process all events in 'list'}The PREEMPT_RT Linux kernel converts most interrupt handlers to threaded handlers. These run as kernel threads with real-time scheduling, and mutexes support priority inheritance. This largely eliminates traditional interrupt-context priority inversion, translating the problem into the well-understood domain of real-time process scheduling.
In multi-processor systems, interrupt distribution involves choosing which CPU handles each interrupt. This isn't purely a priority concern, but affects interrupt latency and system throughput.
IRQ Affinity:
Each IRQ can be assigned an affinity mask—a bitmask indicating which CPUs are allowed to handle it. The I/O APIC uses this when routing interrupts.
Benefits of IRQ Affinity Control:
123456789101112131415161718192021222324252627282930
# Managing IRQ affinity in Linux # View current IRQ affinity$ cat /proc/irq/25/smp_affinityf # Bitmask: 0xF = all 4 CPUs (binary: 1111) # Set IRQ 25 to only run on CPU 2$ echo 4 > /proc/irq/25/smp_affinity # 0x4 = bit 2 = CPU 2 # View in a more readable format$ cat /proc/irq/25/smp_affinity_list2 # CPU 2 # Set to CPUs 0 and 2$ echo 0,2 > /proc/irq/25/smp_affinity_list # View all IRQ distributions$ cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 25: 0 1892 0 0 PCI-MSI ahci[0000:00:1f.2] 26: 0 0 12487 0 PCI-MSI nvme0q0 27: 0 0 0 89432 PCI-MSI nvme0q1 # Automatic IRQ balancing daemon$ systemctl status irqbalance● irqbalance.service - irqbalance daemon Active: active (running) # irqbalance policy hints (/proc/irq/N/affinity_hint)# Drivers can suggest ideal CPUs; irqbalance may followPer-Queue Interrupts (Modern NVMe/Network):
Modern devices like NVMe SSDs and high-speed network cards expose multiple queues, each with its own interrupt. This allows per-CPU interrupt handling:
This is why NVMe can handle millions of IOPS with minimal CPU overhead: each core handles its own queue independently.
| Strategy | Use Case | Implementation |
|---|---|---|
| Round-Robin | General purpose, fair distribution | irqbalance with default settings |
| CPU Isolation | Real-time, latency-sensitive | isolcpus kernel parameter + manual affinity |
| NUMA-Aware | Large memory systems | Pin IRQs to CPUs on same NUMA node as device |
| Per-Queue | High-throughput devices | RSS/MSI-X with per-CPU queues |
| Power-Aware | Mobile, battery systems | Consolidate to minimal CPUs during low load |
Linux includes the irqbalance daemon that automatically distributes IRQs across CPUs based on heuristics. It considers NUMA topology, power state, and interrupt load. For most systems, irqbalance provides reasonable defaults. High-performance or real-time systems often disable it in favor of manual, application-specific affinity settings.
Beyond hardware priority, Linux implements software-level priority for interrupt-related work. This is most visible in the softirq subsystem, where different types of deferred work have different priorities.
Softirq Priority Order:
12345678910111213141516171819202122232425262728293031323334353637383940414243
// Linux softirq priority order (from kernel source)// Lower number = higher priority enum { HI_SOFTIRQ = 0, // High-priority tasklets (run first) TIMER_SOFTIRQ, // Timer expiration callbacks NET_TX_SOFTIRQ, // Network transmit processing NET_RX_SOFTIRQ, // Network receive processing BLOCK_SOFTIRQ, // Block device completion IRQ_POLL_SOFTIRQ, // IRQ polling mode TASKLET_SOFTIRQ, // Regular tasklets SCHED_SOFTIRQ, // Scheduler load balancing HRTIMER_SOFTIRQ, // High-resolution timers RCU_SOFTIRQ, // RCU callbacks (always last) NR_SOFTIRQS}; // Softirq execution order in __do_softirq():void __do_softirq(void) { uint32_t pending = local_softirq_pending(); // Process in priority order (bit 0 first) while (pending) { if (pending & 1) { // Execute this softirq's handler softirq_vec[softirq_num].action(h); } pending >>= 1; softirq_num++; }} // Priority implications:// - HI_SOFTIRQ (0) runs before NET_RX_SOFTIRQ (3)// - TIMER processing happens before network processing// - RCU_SOFTIRQ (9) always runs last // For tasklets, TASKLET_HI_SOFTIRQ gives higher priority// Use for time-sensitive bottom-half workDECLARE_TASKLET(my_tasklet, my_function);tasklet_schedule(&my_tasklet); // Uses TASKLET_SOFTIRQ (priority 6) tasklet_hi_schedule(&my_tasklet); // Uses HI_SOFTIRQ (priority 0)Time Limits on Softirq Processing:
To prevent softirq handlers from monopolizing the CPU, Linux limits how long softirqs can run before yielding:
1234567891011121314151617181920
# Softirq processing limits # Maximum time in microseconds$ cat /proc/sys/net/core/netdev_budget_usecs2000 # 2ms default # Maximum packets per NAPI poll cycle$ cat /proc/sys/net/core/netdev_budget 300 # After these limits, softirq processing yields to:# 1. Check for pending higher-priority work# 2. Allow scheduler to run waiting processes# 3. Resume softirq processing (via ksoftirqd thread if heavy load) # The ksoftirqd/N kernel threads:$ ps aux | grep ksoftirqdroot 9 0.0 0.0 0 0 ? S Jan01 0:05 [ksoftirqd/0]root 15 0.0 0.0 0 0 ? S Jan01 0:03 [ksoftirqd/1]# When softirq load is high, these threads handle overflow at lower priorityUnder extreme interrupt load, all softirq processing may be offloaded to the ksoftirqd kernel threads, which run at normal process priority. If the system is also under heavy CPU load from user processes, ksoftirqd may be starved, causing packet drops and poor I/O performance. Watch for high ksoftirqd CPU usage—it indicates the system is at or beyond its interrupt handling capacity.
We've explored how systems prioritize interrupts. Let's consolidate the key takeaways:
What's Next:
With priority determining which interrupt gets serviced, the final piece is interrupt acknowledgment—how handlers signal completion to the interrupt controller, clearing the way for new interrupts to be delivered.
You now understand how systems prioritize competing interrupt requests—from hardware priority in interrupt controllers to software priority in deferred work mechanisms. This knowledge is essential for performance tuning, real-time system design, and debugging latency issues.