Operating SystemsInterrupts

Interrupts: Hardware-Level I/O Signaling

LevelIntermediate

Duration90 mins

TopicInterrupts

4 / 5

Interrupt Priority

When Everything Needs Attention at Once

In any real computer system, multiple I/O devices compete for CPU attention simultaneously. A keystroke arrives while a network packet completes while a disk read finishes—all at the same instant. The system must decide: Which interrupt gets serviced first?

This isn't an academic question. Wrong prioritization can cause audio to stutter (missed samples), network connections to drop (buffer overflow), or real-time control systems to fail catastrophically (missed deadlines). Interrupt priority is the mechanism that ensures critical events receive timely attention while less urgent work waits its turn.

What You Will Learn

By the end of this page, you will understand why interrupt priority is necessary, how priority is implemented at both hardware (interrupt controller) and software (handler) levels, master the concept of priority-based preemption, and learn how priority inversion and other pathologies are handled.

The Need for Interrupt Priority

To understand why priority matters, consider what happens without it. In a simple first-come-first-served model:

Low-priority bulk data transfer interrupt arrives
CPU begins executing the transfer handler
High-priority real-time sensor interrupt arrives
Sensor interrupt must wait until transfer handler completes
Real-time deadline is missed → system failure

This is unacceptable for any system with time-sensitive requirements. The solution is priority: assigning relative importance to different interrupt sources, ensuring that more critical events can preempt less critical ones.

Relative Urgency of Different Interrupt Sources
Interrupt Source	Urgency	Consequence of Delay
Machine Check (hardware failure)	Critical	Data corruption, system damage
Timer interrupt	Very High	Scheduler jitter, time inaccuracy
Real-time audio I/O	High	Audible clicks/pops in playback
Network packet arrival	Medium-High	Buffer overflow, dropped packets
Disk I/O completion	Medium	Slower application response
USB device event	Low-Medium	User perceives minor lag
Keyboard/mouse input	Low	User perceives minor lag

The Priority Principle:

A higher-priority interrupt should be able to preempt a lower-priority handler—interrupting the interrupter. This creates a priority order where critical events are always serviced promptly, regardless of what other interrupts are being handled.

Priority can be implemented at multiple levels:

Hardware level: The interrupt controller decides which pending interrupt to present to the CPU
CPU level: The processor may mask certain priority levels
Software level: The OS can defer handling or adjust effective priorities

Priority vs. Fairness

Priority systems inherently risk starvation: if high-priority interrupts arrive continuously, low-priority ones never get serviced. Good system design uses priority for urgency but includes mechanisms (timeouts, round-robin within a level) to prevent complete starvation of low-priority sources.

Hardware Priority: The Legacy 8259 PIC

The Intel 8259 Programmable Interrupt Controller (PIC) was the de facto standard for x86 interrupt handling for nearly 30 years. Though obsolete in modern systems, understanding the PIC illuminates fundamental priority concepts still used today.

PIC Architecture:

The original PC used two 8259 chips cascaded:

Master PIC: Handles IRQ0-IRQ7, connects to CPU's INTR line
Slave PIC: Handles IRQ8-IRQ15, connected to master's IRQ2

Total: 15 usable IRQ lines (IRQ2 used for cascade)

pic_irq_assignment.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
; Traditional PC/AT IRQ Assignments with 8259 PICs
 
Master PIC (IRQ 0-7):
  IRQ0: System Timer (highest priority)
  IRQ1: Keyboard
  IRQ2: Cascade from Slave PIC
  IRQ3: COM2 (serial port)
  IRQ4: COM1 (serial port)
  IRQ5: LPT2 or Sound Card
  IRQ6: Floppy Disk Controller
  IRQ7: LPT1 (parallel port) / Spurious (lowest priority)
 
Slave PIC (IRQ 8-15):
  IRQ8:  Real Time Clock (RTC)
  IRQ9:  ACPI / Available
  IRQ10: Available
  IRQ11: Available
  IRQ12: PS/2 Mouse
  IRQ13: FPU / Coprocessor
  IRQ14: Primary IDE
  IRQ15: Secondary IDE
 
Priority order (highest to lowest):
IRQ0 > IRQ1 > IRQ8-15 > IRQ3 > IRQ4 > IRQ5 > IRQ6 > IRQ7
 
Note: Slave PIc IRQs (8-15) all appear at master IRQ2's priority level,
then are further prioritized within the slave.

PIC Priority Scheme:

The 8259 uses a fixed priority scheme by default: IRQ0 has highest priority, IRQ7 the lowest. The currently-serviced interrupt level blocks all equal or lower priority interrupts.

In-Service Register (ISR): Tracks which interrupts are currently being handled Interrupt Mask Register (IMR): Allows software to disable specific IRQs

When an interrupt arrives:

PIC checks if it's higher priority than any in-service interrupt
If yes, PIC asserts INTR to CPU
CPU acknowledges via INTA (interrupt acknowledge) cycle
PIC provides vector number on data bus
Corresponding bit set in ISR (blocks equal/lower priority)
Handler runs, then sends EOI (End Of Interrupt) to PIC
ISR bit cleared, lower-priority interrupts can proceed

pic_programming.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
// Programming the 8259 PIC
 
#define PIC1_CMD    0x20    // Master PIC command port
#define PIC1_DATA   0x21    // Master PIC data port
#define PIC2_CMD    0xA0    // Slave PIC command port
#define PIC2_DATA   0xA1    // Slave PIC data port
 
// Initialization Command Words (ICW)
#define ICW1_INIT   0x10    // Initialization
#define ICW1_ICW4   0x01    // ICW4 needed
#define ICW4_8086   0x01    // 8086/88 mode
 
// Operation Command Words (OCW)
#define OCW2_EOI    0x20    // Non-specific End-of-Interrupt
#define OCW2_SEOI   0x60    // Specific EOI (add IRQ number)
 
// Remap PICs to avoid conflict with CPU exceptions (vectors 0-31)
void pic_remap(uint8_t master_base, uint8_t slave_base) {
    uint8_t master_mask, slave_mask;
    
    // Save current masks
    master_mask = inb(PIC1_DATA);
    slave_mask = inb(PIC2_DATA);
    
    // ICW1: Start initialization sequence
    outb(PIC1_CMD, ICW1_INIT | ICW1_ICW4);
    outb(PIC2_CMD, ICW1_INIT | ICW1_ICW4);
    
    // ICW2: Set vector offset
    outb(PIC1_DATA, master_base);  // Master: vectors master_base to master_base+7
    outb(PIC2_DATA, slave_base);   // Slave: vectors slave_base to slave_base+7
    
    // ICW3: Tell PICs about cascade
    outb(PIC1_DATA, 0x04);  // Master: slave on IRQ2 (bit 2)
    outb(PIC2_DATA, 0x02);  // Slave: cascade identity 2
    
    // ICW4: Set mode
    outb(PIC1_DATA, ICW4_8086);
    outb(PIC2_DATA, ICW4_8086);
    
    // Restore saved masks
    outb(PIC1_DATA, master_mask);
    outb(PIC2_DATA, slave_mask);
}
 
// Send End-of-Interrupt signal
void pic_send_eoi(uint8_t irq) {
    if (irq >= 8) {
        // IRQ from slave: must EOI both slave and master
        outb(PIC2_CMD, OCW2_EOI);
    }
    outb(PIC1_CMD, OCW2_EOI);
}
 
// Mask (disable) a specific IRQ
void pic_mask_irq(uint8_t irq) {
    uint16_t port = (irq < 8) ? PIC1_DATA : PIC2_DATA;
    uint8_t mask = inb(port);
    mask |= (1 << (irq & 7));
    outb(port, mask);
}
 
// Unmask (enable) a specific IRQ
void pic_unmask_irq(uint8_t irq) {
    uint16_t port = (irq < 8) ? PIC1_DATA : PIC2_DATA;
    uint8_t mask = inb(port);
    mask &= ~(1 << (irq & 7));
    outb(port, mask);
}

PIC Limitations

The 8259 PIC has severe limitations for modern systems: only 15 IRQ lines (forcing devices to share), broadcast EOI (can't easily target a specific IRQ), no MSI support, and poor SMP (multi-processor) support. Modern systems use the APIC, but PIC understanding remains valuable for legacy code, embedded systems, and interview questions.

Modern Priority: The APIC Architecture

The Advanced Programmable Interrupt Controller (APIC) architecture replaced the 8259 PIC, addressing its limitations and adding essential features for multi-processor systems.

APIC Components:

Local APIC: One per CPU core; handles that core's interrupts
I/O APIC: One or more per system; handles device interrupts, routes to Local APICs

This two-tier architecture enables per-CPU interrupt handling and sophisticated routing.

Converting Mermaid diagram...

APIC Priority Mechanism:

The APIC implements a 256-level priority scheme. Each interrupt vector has an inherent priority based on its vector number:

Priority Class = Vector / 16  (0-15)
Priority Subclass = Vector % 16 (0-15)

Higher vector numbers = higher priority. Vectors 0-31 (CPU exceptions) have lowest priority class, device interrupts at vectors 32-239 have medium priority, and vectors 240-255 have highest priority.

apic_priority.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
// APIC Priority Mechanism
 
// Priority classes (based on vector number / 16)
// Class 0-1: Reserved for CPU exceptions (vectors 0-31)
// Class 2-14: Device interrupt range (vectors 32-239)
// Class 15: Highest priority (vectors 240-255)
 
// Task Priority Register (TPR) - Local APIC register
// Bits 7:4 = Task Priority Class
// Bits 3:0 = Task Priority Subclass
// The APIC will only deliver interrupts with priority > TPR
 
#define APIC_TPR    0x80    // Task Priority Register offset
 
// Read current TPR
static inline uint8_t apic_read_tpr(void) {
    return *(volatile uint32_t*)(APIC_BASE + APIC_TPR) & 0xFF;
}
 
// Set TPR to block interrupts below a threshold
static inline void apic_set_tpr(uint8_t priority) {
    *(volatile uint32_t*)(APIC_BASE + APIC_TPR) = priority;
}
 
// Example: Block all device interrupts during critical section
void critical_section_enter(void) {
    // Set TPR to priority class 15 (block all below 240)
    // This blocks device IRQs but allows IPI at highest vectors
    apic_set_tpr(0xF0);  // Priority = 15 (vectors 240-255 can still arrive)
}
 
void critical_section_exit(void) {
    apic_set_tpr(0);  // Allow all priorities
}
 
// End of Interrupt (EOI) for APIC
#define APIC_EOI    0xB0
 
static inline void apic_eoi(void) {
    *(volatile uint32_t*)(APIC_BASE + APIC_EOI) = 0;
}
 
// I/O APIC Redirection Table Entry
struct ioapic_redir_entry {
    union {
        struct {
            uint8_t  vector;        // Interrupt vector (32-255)
            uint8_t  delvmode:3;    // Delivery mode (0=fixed, 1=lowest priority...)
            uint8_t  destmode:1;    // Destination mode (0=physical, 1=logical)
            uint8_t  delvstatus:1;  // Delivery status (read-only)
            uint8_t  polarity:1;    // 0=active high, 1=active low
            uint8_t  irr:1;         // Remote IRR (read-only)
            uint8_t  trigger:1;     // 0=edge, 1=level
            uint8_t  mask:1;        // 0=enabled, 1=masked
            uint32_t reserved:15;
            uint8_t  destination;   // APIC ID of target CPU(s)
        } fields;
        uint64_t raw;
    };
};
 
// Example: Route IRQ to lowest-priority CPU
void ioapic_route_irq(uint8_t irq, uint8_t vector) {
    struct ioapic_redir_entry entry = {0};
    
    entry.fields.vector = vector;
    entry.fields.delvmode = 1;   // Lowest priority delivery
    entry.fields.destmode = 1;   // Logical destination
    entry.fields.trigger = 1;    // Level-triggered (typical for PCI)
    entry.fields.destination = 0xFF;  // All CPUs in logical dest
    
    ioapic_write_redir(irq, entry);
}

I/O APIC Delivery Modes
Mode	Value	Description
Fixed	0	Deliver to CPU(s) in destination field
Lowest Priority	1	Deliver to CPU with lowest TPR
SMI	2	System Management Interrupt
NMI	4	Non-Maskable Interrupt
INIT	5	Initialization (reset CPU)
ExtINT	7	External interrupt (8259 compatibility)

Lowest Priority Delivery

The 'Lowest Priority' delivery mode is elegant: the I/O APIC queries all target CPUs' TPR values and delivers to the one with lowest current priority. This provides automatic load balancing—idle CPUs (low TPR) receive more interrupts than busy ones (high TPR).

Interrupt Preemption and Nesting

When a higher-priority interrupt arrives while a lower-priority handler is running, preemption occurs: the lower-priority handler is interrupted, the higher-priority handler runs, then control returns to the lower-priority handler. This creates nested interrupts.

Preemption Mechanics:

Converting Mermaid diagram...

Controlling Preemption:

Preemption can be controlled at multiple levels:

CPU Interrupt Flag (IF): When cleared, no maskable interrupts are delivered. This prevents all preemption but is heavy-handed.
APIC TPR: Setting the Task Priority Register blocks interrupts below a threshold. Current priority becomes the TPR value.
Handler Design: Handlers can explicitly enable interrupts (set IF) to allow preemption, or keep them disabled for atomicity.

nested_interrupt_handling.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// Nested interrupt handling example
 
// Handler that allows higher-priority preemption
irqreturn_t preemptible_handler(int irq, void *dev_id) {
    // Phase 1: Critical section (no preemption)
    // Interrupts are automatically disabled on handler entry
    
    acknowledge_device_interrupt();
    capture_device_data();
    
    // Phase 2: Allow preemption for lengthy processing
    local_irq_enable();  // RE-ENABLE interrupts
    
    // WARNING: We can now be preempted by higher-priority interrupts!
    // All data structures must be in consistent state before this point.
    
    process_captured_data();  // Potentially lengthy
    
    // Phase 3: Prepare to return
    local_irq_disable();  // Disable before EOI
    
    return IRQ_HANDLED;
}
 
// Handler that prevents all preemption
irqreturn_t non_preemptible_handler(int irq, void *dev_id) {
    // Interrupts remain disabled throughout
    // This handler has exclusive CPU access
    // MUST BE VERY SHORT to avoid latency issues!
    
    uint32_t data = read_device_register();
    update_statistics(data);
    acknowledge_interrupt();
    
    // Approximately 1-5 microseconds total
    return IRQ_HANDLED;
}
 
// Implementation with priority management
irqreturn_t priority_aware_handler(int irq, void *dev_id) {
    // Save current priority and raise to our level
    unsigned long flags;
    uint8_t old_tpr = apic_read_tpr();
    
    // Set TPR to our priority - blocks equal and lower, allows higher
    uint8_t our_priority = irq_to_priority(irq);
    apic_set_tpr(our_priority);
    
    local_save_flags(flags);
    local_irq_enable();  // Allow higher-priority interrupts
    
    // Do work - can be preempted by higher priority
    handle_device();
    
    local_irq_restore(flags);
    apic_set_tpr(old_tpr);  // Restore previous priority level
    
    return IRQ_HANDLED;
}

Nesting Depth Limits

Each nested interrupt requires stack space for saved context. Deeply nested interrupts can overflow the interrupt stack—typically only 4-8 KB. Systems must ensure worst-case nesting depth fits within stack limits. The use of IST (Interrupt Stack Table) per priority level can help by giving each level its own stack.

The Priority Inversion Problem

Priority inversion occurs when a high-priority task is effectively blocked by a low-priority task. This can happen in interrupt systems when handlers share resources with locks.

Classic Priority Inversion Scenario:

Priority Inversion Sequence

•Low-priority handler (L) acquires spinlock on shared data
•Medium-priority interrupt arrives while L holds lock
•Medium-priority handler (M) preempts L
•High-priority interrupt arrives
•High-priority handler (H) preempts M
•H tries to acquire the same spinlock - but L holds it!
•H cannot proceed - it's effectively blocked by L
•H spins waiting for lock, but L can't release because M is running!
•Result: H (high-priority) waits for M (medium-priority) indirectly

The Famous Mars Pathfinder Case:

In 1997, the Mars Pathfinder spacecraft experienced repeated system resets due to priority inversion. A low-priority task held a shared data bus mutex, a high-priority task needed the mutex, but a medium-priority task kept preempting the low-priority task, preventing it from releasing the mutex. The watchdog timer would fire after the high-priority task missed its deadline, causing a system reset.

Solutions to Priority Inversion:

Priority Inversion Solutions
Solution	Mechanism	Trade-offs
Priority Inheritance	Lock holder inherits highest waiter priority	Complexity, requires mutex tracking
Priority Ceiling	Lock raises priority to predefined ceiling	Simpler, but ceiling must be known in advance
Disable Interrupts	Hold lock only with interrupts disabled	Prevents all preemption—works but heavy-handed
Lock-Free Algorithms	Avoid locks entirely using atomic operations	Complex to implement correctly
Interrupt Threading	Convert to threaded handlers, use RT mutexes	Linux approach: RT-PREEMPT kernel

priority_inversion_solutions.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
// Solutions to priority inversion in interrupt handlers
 
// Solution 1: Disable interrupts while holding lock
// Simple but prevents all interrupt handling during lock hold
void handle_with_irq_disabled(void) {
    unsigned long flags;
    
    spin_lock_irqsave(&data_lock, flags);  // Disable interrupts + lock
    
    // Critical section - no interrupt can run, no inversion possible
    modify_shared_data();
    
    spin_unlock_irqrestore(&data_lock, flags);  // Restore + unlock
}
 
// Solution 2: Use priority inheritance mutex (RT kernel)
// Lock holder automatically inherits waiting task's priority
#include <linux/mutex.h>
 
struct mutex rt_mutex;  // RT-PREEMPT kernel uses PI-aware mutexes
 
void threaded_irq_handler(int irq, void *dev_id) {
    // This runs as a kernel thread, not hard IRQ context
    // Can use sleeping locks with priority inheritance
    
    mutex_lock(&rt_mutex);  // If someone waits, they boost our priority
    
    process_data();
    
    mutex_unlock(&rt_mutex);  // Priority restored when we release
}
 
// Solution 3: Interrupt-aware spinlocks
// Different spinlock types for different contexts
 
// For data shared between process context and IRQ:
spinlock_t irq_data_lock;
 
// Process context uses spin_lock_irqsave (disables IRQs)
void process_context_access(void) {
    unsigned long flags;
    spin_lock_irqsave(&irq_data_lock, flags);
    // ...
    spin_unlock_irqrestore(&irq_data_lock, flags);
}
 
// IRQ context can use plain spin_lock (already in IRQ, can't be preempted by same IRQ)
irqreturn_t irq_handler(int irq, void *dev_id) {
    spin_lock(&irq_data_lock);
    // ...  
    spin_unlock(&irq_data_lock);
    return IRQ_HANDLED;
}
 
// Solution 4: Lock-free data structures (where applicable)
#include <linux/llist.h>
 
// Lock-free linked list example
struct llist_head event_queue;
 
// IRQ can add without locking
irqreturn_t lockfree_irq_handler(int irq, void *dev_id) {
    struct event *evt = alloc_event_atomic();
    evt->data = read_device();
    
    llist_add(&evt->node, &event_queue);  // Lock-free!
    
    return IRQ_HANDLED;
}
 
// Consumer removes without blocking IRQs
void process_events(void) {
    struct llist_node *list = llist_del_all(&event_queue);
    // Process all events in 'list'
}

Linux's Approach

The PREEMPT_RT Linux kernel converts most interrupt handlers to threaded handlers. These run as kernel threads with real-time scheduling, and mutexes support priority inheritance. This largely eliminates traditional interrupt-context priority inversion, translating the problem into the well-understood domain of real-time process scheduling.

IRQ Affinity and Distribution

In multi-processor systems, interrupt distribution involves choosing which CPU handles each interrupt. This isn't purely a priority concern, but affects interrupt latency and system throughput.

IRQ Affinity:

Each IRQ can be assigned an affinity mask—a bitmask indicating which CPUs are allowed to handle it. The I/O APIC uses this when routing interrupts.

Benefits of IRQ Affinity Control:

IRQ Affinity Benefits

•Cache Locality: Same CPU handles same device's interrupts, keeping handler code and data in local cache
•NUMA Awareness: Route interrupts to CPUs on the same NUMA node as the device for faster memory access
•Load Balancing: Spread interrupts across CPUs to prevent any single core from being overwhelmed
•CPU Isolation: Keep real-time CPUs free from device interrupts
•Power Efficiency: Consolidate interrupts to fewer CPUs, allowing others to sleep

irq_affinity.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Managing IRQ affinity in Linux
 
# View current IRQ affinity
$ cat /proc/irq/25/smp_affinity
f    # Bitmask: 0xF = all 4 CPUs (binary: 1111)
 
# Set IRQ 25 to only run on CPU 2
$ echo 4 > /proc/irq/25/smp_affinity  # 0x4 = bit 2 = CPU 2
 
# View in a more readable format
$ cat /proc/irq/25/smp_affinity_list
2    # CPU 2
 
# Set to CPUs 0 and 2
$ echo 0,2 > /proc/irq/25/smp_affinity_list
 
# View all IRQ distributions
$ cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
 25:          0       1892          0          0   PCI-MSI  ahci[0000:00:1f.2]
 26:          0          0      12487          0   PCI-MSI  nvme0q0
 27:          0          0          0      89432   PCI-MSI  nvme0q1
 
# Automatic IRQ balancing daemon
$ systemctl status irqbalance
● irqbalance.service - irqbalance daemon
   Active: active (running)
 
# irqbalance policy hints (/proc/irq/N/affinity_hint)
# Drivers can suggest ideal CPUs; irqbalance may follow

Per-Queue Interrupts (Modern NVMe/Network):

Modern devices like NVMe SSDs and high-speed network cards expose multiple queues, each with its own interrupt. This allows per-CPU interrupt handling:

Each CPU gets its own queue (or set of queues)
Interrupts for each queue are pinned to the associated CPU
No cross-CPU coordination needed—massive scalability improvement

This is why NVMe can handle millions of IOPS with minimal CPU overhead: each core handles its own queue independently.

IRQ Distribution Strategies
Strategy	Use Case	Implementation
Round-Robin	General purpose, fair distribution	irqbalance with default settings
CPU Isolation	Real-time, latency-sensitive	isolcpus kernel parameter + manual affinity
NUMA-Aware	Large memory systems	Pin IRQs to CPUs on same NUMA node as device
Per-Queue	High-throughput devices	RSS/MSI-X with per-CPU queues
Power-Aware	Mobile, battery systems	Consolidate to minimal CPUs during low load

irqbalance Daemon

Linux includes the irqbalance daemon that automatically distributes IRQs across CPUs based on heuristics. It considers NUMA topology, power state, and interrupt load. For most systems, irqbalance provides reasonable defaults. High-performance or real-time systems often disable it in favor of manual, application-specific affinity settings.

Software Priority in Linux

Beyond hardware priority, Linux implements software-level priority for interrupt-related work. This is most visible in the softirq subsystem, where different types of deferred work have different priorities.

Softirq Priority Order:

linux_softirq_priority.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Linux softirq priority order (from kernel source)
// Lower number = higher priority
 
enum {
    HI_SOFTIRQ = 0,          // High-priority tasklets (run first)
    TIMER_SOFTIRQ,           // Timer expiration callbacks
    NET_TX_SOFTIRQ,          // Network transmit processing
    NET_RX_SOFTIRQ,          // Network receive processing  
    BLOCK_SOFTIRQ,           // Block device completion
    IRQ_POLL_SOFTIRQ,        // IRQ polling mode
    TASKLET_SOFTIRQ,         // Regular tasklets
    SCHED_SOFTIRQ,           // Scheduler load balancing
    HRTIMER_SOFTIRQ,         // High-resolution timers
    RCU_SOFTIRQ,             // RCU callbacks (always last)
    NR_SOFTIRQS
};
 
// Softirq execution order in __do_softirq():
void __do_softirq(void) {
    uint32_t pending = local_softirq_pending();
    
    // Process in priority order (bit 0 first)
    while (pending) {
        if (pending & 1) {
            // Execute this softirq's handler
            softirq_vec[softirq_num].action(h);
        }
        pending >>= 1;
        softirq_num++;
    }
}
 
// Priority implications:
// - HI_SOFTIRQ (0) runs before NET_RX_SOFTIRQ (3)
// - TIMER processing happens before network processing
// - RCU_SOFTIRQ (9) always runs last
 
// For tasklets, TASKLET_HI_SOFTIRQ gives higher priority
// Use for time-sensitive bottom-half work
DECLARE_TASKLET(my_tasklet, my_function);
tasklet_schedule(&my_tasklet);      // Uses TASKLET_SOFTIRQ (priority 6)
 
tasklet_hi_schedule(&my_tasklet);   // Uses HI_SOFTIRQ (priority 0)

Time Limits on Softirq Processing:

To prevent softirq handlers from monopolizing the CPU, Linux limits how long softirqs can run before yielding:

softirq_limits.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Softirq processing limits
 
# Maximum time in microseconds
$ cat /proc/sys/net/core/netdev_budget_usecs
2000    # 2ms default
 
# Maximum packets per NAPI poll cycle
$ cat /proc/sys/net/core/netdev_budget  
300
 
# After these limits, softirq processing yields to:
# 1. Check for pending higher-priority work
# 2. Allow scheduler to run waiting processes
# 3. Resume softirq processing (via ksoftirqd thread if heavy load)
 
# The ksoftirqd/N kernel threads:
$ ps aux | grep ksoftirqd
root         9  0.0  0.0      0     0 ?  S    Jan01   0:05 [ksoftirqd/0]
root        15  0.0  0.0      0     0 ?  S    Jan01   0:03 [ksoftirqd/1]
# When softirq load is high, these threads handle overflow at lower priority

ksoftirqd Starvation

Under extreme interrupt load, all softirq processing may be offloaded to the ksoftirqd kernel threads, which run at normal process priority. If the system is also under heavy CPU load from user processes, ksoftirqd may be starved, causing packet drops and poor I/O performance. Watch for high ksoftirqd CPU usage—it indicates the system is at or beyond its interrupt handling capacity.

Summary: Interrupt Priority

We've explored how systems prioritize interrupts. Let's consolidate the key takeaways:

Key Takeaways

•Priority enables timely handling of critical events — Without priority, time-sensitive interrupts could miss deadlines waiting for bulk processing.
•Legacy PIC used fixed priority based on IRQ number — Simple but inflexible, with only 15 available interrupt lines.
•Modern APIC provides 256-level priority — Vector number determines priority class, TPR controls handling threshold.
•Preemption allows higher-priority handlers to interrupt lower ones — Creates nested interrupts, requiring careful stack management.
•Priority inversion occurs when locks cause high-priority to wait for low — Solutions include priority inheritance, disabling interrupts, and lock-free algorithms.
•IRQ affinity controls which CPUs handle which interrupts — Enables cache optimization, NUMA awareness, load balancing, and CPU isolation.
•Linux softirqs have software-defined priorities — Time limits prevent starvation; ksoftirqd handles overflow at process priority.

What's Next:

With priority determining which interrupt gets serviced, the final piece is interrupt acknowledgment—how handlers signal completion to the interrupt controller, clearing the way for new interrupts to be delivered.

Page Complete

You now understand how systems prioritize competing interrupt requests—from hardware priority in interrupt controllers to software priority in deferred work mechanisms. This knowledge is essential for performance tuning, real-time system design, and debugging latency issues.

4 / 5

Loading learning content...

Operating SystemsInterrupts

Interrupts: Hardware-Level I/O Signaling

LevelIntermediate

Duration90 mins

TopicInterrupts

4 / 5

Interrupt Priority

When Everything Needs Attention at Once

What You Will Learn

The Need for Interrupt Priority

To understand why priority matters, consider what happens without it. In a simple first-come-first-served model:

Low-priority bulk data transfer interrupt arrives
CPU begins executing the transfer handler
High-priority real-time sensor interrupt arrives
Sensor interrupt must wait until transfer handler completes
Real-time deadline is missed → system failure

Relative Urgency of Different Interrupt Sources
Interrupt Source	Urgency	Consequence of Delay
Machine Check (hardware failure)	Critical	Data corruption, system damage
Timer interrupt	Very High	Scheduler jitter, time inaccuracy
Real-time audio I/O	High	Audible clicks/pops in playback
Network packet arrival	Medium-High	Buffer overflow, dropped packets
Disk I/O completion	Medium	Slower application response
USB device event	Low-Medium	User perceives minor lag
Keyboard/mouse input	Low	User perceives minor lag

The Priority Principle:

Priority can be implemented at multiple levels:

Hardware level: The interrupt controller decides which pending interrupt to present to the CPU
CPU level: The processor may mask certain priority levels
Software level: The OS can defer handling or adjust effective priorities

Priority vs. Fairness

Hardware Priority: The Legacy 8259 PIC

PIC Architecture:

The original PC used two 8259 chips cascaded:

Master PIC: Handles IRQ0-IRQ7, connects to CPU's INTR line
Slave PIC: Handles IRQ8-IRQ15, connected to master's IRQ2

Total: 15 usable IRQ lines (IRQ2 used for cascade)

pic_irq_assignment.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
; Traditional PC/AT IRQ Assignments with 8259 PICs
 
Master PIC (IRQ 0-7):
  IRQ0: System Timer (highest priority)
  IRQ1: Keyboard
  IRQ2: Cascade from Slave PIC
  IRQ3: COM2 (serial port)
  IRQ4: COM1 (serial port)
  IRQ5: LPT2 or Sound Card
  IRQ6: Floppy Disk Controller
  IRQ7: LPT1 (parallel port) / Spurious (lowest priority)
 
Slave PIC (IRQ 8-15):
  IRQ8:  Real Time Clock (RTC)
  IRQ9:  ACPI / Available
  IRQ10: Available
  IRQ11: Available
  IRQ12: PS/2 Mouse
  IRQ13: FPU / Coprocessor
  IRQ14: Primary IDE
  IRQ15: Secondary IDE
 
Priority order (highest to lowest):
IRQ0 > IRQ1 > IRQ8-15 > IRQ3 > IRQ4 > IRQ5 > IRQ6 > IRQ7
 
Note: Slave PIc IRQs (8-15) all appear at master IRQ2's priority level,
then are further prioritized within the slave.

PIC Priority Scheme:

The 8259 uses a fixed priority scheme by default: IRQ0 has highest priority, IRQ7 the lowest. The currently-serviced interrupt level blocks all equal or lower priority interrupts.

In-Service Register (ISR): Tracks which interrupts are currently being handled Interrupt Mask Register (IMR): Allows software to disable specific IRQs

When an interrupt arrives:

PIC checks if it's higher priority than any in-service interrupt
If yes, PIC asserts INTR to CPU
CPU acknowledges via INTA (interrupt acknowledge) cycle
PIC provides vector number on data bus
Corresponding bit set in ISR (blocks equal/lower priority)
Handler runs, then sends EOI (End Of Interrupt) to PIC
ISR bit cleared, lower-priority interrupts can proceed

pic_programming.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
// Programming the 8259 PIC
 
#define PIC1_CMD    0x20    // Master PIC command port
#define PIC1_DATA   0x21    // Master PIC data port
#define PIC2_CMD    0xA0    // Slave PIC command port
#define PIC2_DATA   0xA1    // Slave PIC data port
 
// Initialization Command Words (ICW)
#define ICW1_INIT   0x10    // Initialization
#define ICW1_ICW4   0x01    // ICW4 needed
#define ICW4_8086   0x01    // 8086/88 mode
 
// Operation Command Words (OCW)
#define OCW2_EOI    0x20    // Non-specific End-of-Interrupt
#define OCW2_SEOI   0x60    // Specific EOI (add IRQ number)
 
// Remap PICs to avoid conflict with CPU exceptions (vectors 0-31)
void pic_remap(uint8_t master_base, uint8_t slave_base) {
    uint8_t master_mask, slave_mask;
    
    // Save current masks
    master_mask = inb(PIC1_DATA);
    slave_mask = inb(PIC2_DATA);
    
    // ICW1: Start initialization sequence
    outb(PIC1_CMD, ICW1_INIT | ICW1_ICW4);
    outb(PIC2_CMD, ICW1_INIT | ICW1_ICW4);
    
    // ICW2: Set vector offset
    outb(PIC1_DATA, master_base);  // Master: vectors master_base to master_base+7
    outb(PIC2_DATA, slave_base);   // Slave: vectors slave_base to slave_base+7
    
    // ICW3: Tell PICs about cascade
    outb(PIC1_DATA, 0x04);  // Master: slave on IRQ2 (bit 2)
    outb(PIC2_DATA, 0x02);  // Slave: cascade identity 2
    
    // ICW4: Set mode
    outb(PIC1_DATA, ICW4_8086);
    outb(PIC2_DATA, ICW4_8086);
    
    // Restore saved masks
    outb(PIC1_DATA, master_mask);
    outb(PIC2_DATA, slave_mask);
}
 
// Send End-of-Interrupt signal
void pic_send_eoi(uint8_t irq) {
    if (irq >= 8) {
        // IRQ from slave: must EOI both slave and master
        outb(PIC2_CMD, OCW2_EOI);
    }
    outb(PIC1_CMD, OCW2_EOI);
}
 
// Mask (disable) a specific IRQ
void pic_mask_irq(uint8_t irq) {
    uint16_t port = (irq < 8) ? PIC1_DATA : PIC2_DATA;
    uint8_t mask = inb(port);
    mask |= (1 << (irq & 7));
    outb(port, mask);
}
 
// Unmask (enable) a specific IRQ
void pic_unmask_irq(uint8_t irq) {
    uint16_t port = (irq < 8) ? PIC1_DATA : PIC2_DATA;
    uint8_t mask = inb(port);
    mask &= ~(1 << (irq & 7));
    outb(port, mask);
}

PIC Limitations

Modern Priority: The APIC Architecture

The Advanced Programmable Interrupt Controller (APIC) architecture replaced the 8259 PIC, addressing its limitations and adding essential features for multi-processor systems.

APIC Components:

Local APIC: One per CPU core; handles that core's interrupts
I/O APIC: One or more per system; handles device interrupts, routes to Local APICs

This two-tier architecture enables per-CPU interrupt handling and sophisticated routing.

Converting Mermaid diagram...

APIC Priority Mechanism:

The APIC implements a 256-level priority scheme. Each interrupt vector has an inherent priority based on its vector number:

Priority Class = Vector / 16  (0-15)
Priority Subclass = Vector % 16 (0-15)

Higher vector numbers = higher priority. Vectors 0-31 (CPU exceptions) have lowest priority class, device interrupts at vectors 32-239 have medium priority, and vectors 240-255 have highest priority.

apic_priority.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
// APIC Priority Mechanism
 
// Priority classes (based on vector number / 16)
// Class 0-1: Reserved for CPU exceptions (vectors 0-31)
// Class 2-14: Device interrupt range (vectors 32-239)
// Class 15: Highest priority (vectors 240-255)
 
// Task Priority Register (TPR) - Local APIC register
// Bits 7:4 = Task Priority Class
// Bits 3:0 = Task Priority Subclass
// The APIC will only deliver interrupts with priority > TPR
 
#define APIC_TPR    0x80    // Task Priority Register offset
 
// Read current TPR
static inline uint8_t apic_read_tpr(void) {
    return *(volatile uint32_t*)(APIC_BASE + APIC_TPR) & 0xFF;
}
 
// Set TPR to block interrupts below a threshold
static inline void apic_set_tpr(uint8_t priority) {
    *(volatile uint32_t*)(APIC_BASE + APIC_TPR) = priority;
}
 
// Example: Block all device interrupts during critical section
void critical_section_enter(void) {
    // Set TPR to priority class 15 (block all below 240)
    // This blocks device IRQs but allows IPI at highest vectors
    apic_set_tpr(0xF0);  // Priority = 15 (vectors 240-255 can still arrive)
}
 
void critical_section_exit(void) {
    apic_set_tpr(0);  // Allow all priorities
}
 
// End of Interrupt (EOI) for APIC
#define APIC_EOI    0xB0
 
static inline void apic_eoi(void) {
    *(volatile uint32_t*)(APIC_BASE + APIC_EOI) = 0;
}
 
// I/O APIC Redirection Table Entry
struct ioapic_redir_entry {
    union {
        struct {
            uint8_t  vector;        // Interrupt vector (32-255)
            uint8_t  delvmode:3;    // Delivery mode (0=fixed, 1=lowest priority...)
            uint8_t  destmode:1;    // Destination mode (0=physical, 1=logical)
            uint8_t  delvstatus:1;  // Delivery status (read-only)
            uint8_t  polarity:1;    // 0=active high, 1=active low
            uint8_t  irr:1;         // Remote IRR (read-only)
            uint8_t  trigger:1;     // 0=edge, 1=level
            uint8_t  mask:1;        // 0=enabled, 1=masked
            uint32_t reserved:15;
            uint8_t  destination;   // APIC ID of target CPU(s)
        } fields;
        uint64_t raw;
    };
};
 
// Example: Route IRQ to lowest-priority CPU
void ioapic_route_irq(uint8_t irq, uint8_t vector) {
    struct ioapic_redir_entry entry = {0};
    
    entry.fields.vector = vector;
    entry.fields.delvmode = 1;   // Lowest priority delivery
    entry.fields.destmode = 1;   // Logical destination
    entry.fields.trigger = 1;    // Level-triggered (typical for PCI)
    entry.fields.destination = 0xFF;  // All CPUs in logical dest
    
    ioapic_write_redir(irq, entry);
}

I/O APIC Delivery Modes
Mode	Value	Description
Fixed	0	Deliver to CPU(s) in destination field
Lowest Priority	1	Deliver to CPU with lowest TPR
SMI	2	System Management Interrupt
NMI	4	Non-Maskable Interrupt
INIT	5	Initialization (reset CPU)
ExtINT	7	External interrupt (8259 compatibility)

Lowest Priority Delivery

Interrupt Preemption and Nesting

Preemption Mechanics:

Converting Mermaid diagram...

Controlling Preemption:

Preemption can be controlled at multiple levels:

CPU Interrupt Flag (IF): When cleared, no maskable interrupts are delivered. This prevents all preemption but is heavy-handed.
APIC TPR: Setting the Task Priority Register blocks interrupts below a threshold. Current priority becomes the TPR value.
Handler Design: Handlers can explicitly enable interrupts (set IF) to allow preemption, or keep them disabled for atomicity.

nested_interrupt_handling.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// Nested interrupt handling example
 
// Handler that allows higher-priority preemption
irqreturn_t preemptible_handler(int irq, void *dev_id) {
    // Phase 1: Critical section (no preemption)
    // Interrupts are automatically disabled on handler entry
    
    acknowledge_device_interrupt();
    capture_device_data();
    
    // Phase 2: Allow preemption for lengthy processing
    local_irq_enable();  // RE-ENABLE interrupts
    
    // WARNING: We can now be preempted by higher-priority interrupts!
    // All data structures must be in consistent state before this point.
    
    process_captured_data();  // Potentially lengthy
    
    // Phase 3: Prepare to return
    local_irq_disable();  // Disable before EOI
    
    return IRQ_HANDLED;
}
 
// Handler that prevents all preemption
irqreturn_t non_preemptible_handler(int irq, void *dev_id) {
    // Interrupts remain disabled throughout
    // This handler has exclusive CPU access
    // MUST BE VERY SHORT to avoid latency issues!
    
    uint32_t data = read_device_register();
    update_statistics(data);
    acknowledge_interrupt();
    
    // Approximately 1-5 microseconds total
    return IRQ_HANDLED;
}
 
// Implementation with priority management
irqreturn_t priority_aware_handler(int irq, void *dev_id) {
    // Save current priority and raise to our level
    unsigned long flags;
    uint8_t old_tpr = apic_read_tpr();
    
    // Set TPR to our priority - blocks equal and lower, allows higher
    uint8_t our_priority = irq_to_priority(irq);
    apic_set_tpr(our_priority);
    
    local_save_flags(flags);
    local_irq_enable();  // Allow higher-priority interrupts
    
    // Do work - can be preempted by higher priority
    handle_device();
    
    local_irq_restore(flags);
    apic_set_tpr(old_tpr);  // Restore previous priority level
    
    return IRQ_HANDLED;
}

Nesting Depth Limits

The Priority Inversion Problem

Priority inversion occurs when a high-priority task is effectively blocked by a low-priority task. This can happen in interrupt systems when handlers share resources with locks.

Classic Priority Inversion Scenario:

Priority Inversion Sequence

•Low-priority handler (L) acquires spinlock on shared data
•Medium-priority interrupt arrives while L holds lock
•Medium-priority handler (M) preempts L
•High-priority interrupt arrives
•High-priority handler (H) preempts M
•H tries to acquire the same spinlock - but L holds it!
•H cannot proceed - it's effectively blocked by L
•H spins waiting for lock, but L can't release because M is running!
•Result: H (high-priority) waits for M (medium-priority) indirectly

The Famous Mars Pathfinder Case:

Solutions to Priority Inversion:

Priority Inversion Solutions
Solution	Mechanism	Trade-offs
Priority Inheritance	Lock holder inherits highest waiter priority	Complexity, requires mutex tracking
Priority Ceiling	Lock raises priority to predefined ceiling	Simpler, but ceiling must be known in advance
Disable Interrupts	Hold lock only with interrupts disabled	Prevents all preemption—works but heavy-handed
Lock-Free Algorithms	Avoid locks entirely using atomic operations	Complex to implement correctly
Interrupt Threading	Convert to threaded handlers, use RT mutexes	Linux approach: RT-PREEMPT kernel

priority_inversion_solutions.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
// Solutions to priority inversion in interrupt handlers
 
// Solution 1: Disable interrupts while holding lock
// Simple but prevents all interrupt handling during lock hold
void handle_with_irq_disabled(void) {
    unsigned long flags;
    
    spin_lock_irqsave(&data_lock, flags);  // Disable interrupts + lock
    
    // Critical section - no interrupt can run, no inversion possible
    modify_shared_data();
    
    spin_unlock_irqrestore(&data_lock, flags);  // Restore + unlock
}
 
// Solution 2: Use priority inheritance mutex (RT kernel)
// Lock holder automatically inherits waiting task's priority
#include <linux/mutex.h>
 
struct mutex rt_mutex;  // RT-PREEMPT kernel uses PI-aware mutexes
 
void threaded_irq_handler(int irq, void *dev_id) {
    // This runs as a kernel thread, not hard IRQ context
    // Can use sleeping locks with priority inheritance
    
    mutex_lock(&rt_mutex);  // If someone waits, they boost our priority
    
    process_data();
    
    mutex_unlock(&rt_mutex);  // Priority restored when we release
}
 
// Solution 3: Interrupt-aware spinlocks
// Different spinlock types for different contexts
 
// For data shared between process context and IRQ:
spinlock_t irq_data_lock;
 
// Process context uses spin_lock_irqsave (disables IRQs)
void process_context_access(void) {
    unsigned long flags;
    spin_lock_irqsave(&irq_data_lock, flags);
    // ...
    spin_unlock_irqrestore(&irq_data_lock, flags);
}
 
// IRQ context can use plain spin_lock (already in IRQ, can't be preempted by same IRQ)
irqreturn_t irq_handler(int irq, void *dev_id) {
    spin_lock(&irq_data_lock);
    // ...  
    spin_unlock(&irq_data_lock);
    return IRQ_HANDLED;
}
 
// Solution 4: Lock-free data structures (where applicable)
#include <linux/llist.h>
 
// Lock-free linked list example
struct llist_head event_queue;
 
// IRQ can add without locking
irqreturn_t lockfree_irq_handler(int irq, void *dev_id) {
    struct event *evt = alloc_event_atomic();
    evt->data = read_device();
    
    llist_add(&evt->node, &event_queue);  // Lock-free!
    
    return IRQ_HANDLED;
}
 
// Consumer removes without blocking IRQs
void process_events(void) {
    struct llist_node *list = llist_del_all(&event_queue);
    // Process all events in 'list'
}

Linux's Approach

IRQ Affinity and Distribution

In multi-processor systems, interrupt distribution involves choosing which CPU handles each interrupt. This isn't purely a priority concern, but affects interrupt latency and system throughput.

IRQ Affinity:

Each IRQ can be assigned an affinity mask—a bitmask indicating which CPUs are allowed to handle it. The I/O APIC uses this when routing interrupts.

Benefits of IRQ Affinity Control:

IRQ Affinity Benefits

•Cache Locality: Same CPU handles same device's interrupts, keeping handler code and data in local cache
•NUMA Awareness: Route interrupts to CPUs on the same NUMA node as the device for faster memory access
•Load Balancing: Spread interrupts across CPUs to prevent any single core from being overwhelmed
•CPU Isolation: Keep real-time CPUs free from device interrupts
•Power Efficiency: Consolidate interrupts to fewer CPUs, allowing others to sleep

irq_affinity.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Managing IRQ affinity in Linux
 
# View current IRQ affinity
$ cat /proc/irq/25/smp_affinity
f    # Bitmask: 0xF = all 4 CPUs (binary: 1111)
 
# Set IRQ 25 to only run on CPU 2
$ echo 4 > /proc/irq/25/smp_affinity  # 0x4 = bit 2 = CPU 2
 
# View in a more readable format
$ cat /proc/irq/25/smp_affinity_list
2    # CPU 2
 
# Set to CPUs 0 and 2
$ echo 0,2 > /proc/irq/25/smp_affinity_list
 
# View all IRQ distributions
$ cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
 25:          0       1892          0          0   PCI-MSI  ahci[0000:00:1f.2]
 26:          0          0      12487          0   PCI-MSI  nvme0q0
 27:          0          0          0      89432   PCI-MSI  nvme0q1
 
# Automatic IRQ balancing daemon
$ systemctl status irqbalance
● irqbalance.service - irqbalance daemon
   Active: active (running)
 
# irqbalance policy hints (/proc/irq/N/affinity_hint)
# Drivers can suggest ideal CPUs; irqbalance may follow

Per-Queue Interrupts (Modern NVMe/Network):

Modern devices like NVMe SSDs and high-speed network cards expose multiple queues, each with its own interrupt. This allows per-CPU interrupt handling:

Each CPU gets its own queue (or set of queues)
Interrupts for each queue are pinned to the associated CPU
No cross-CPU coordination needed—massive scalability improvement

This is why NVMe can handle millions of IOPS with minimal CPU overhead: each core handles its own queue independently.

IRQ Distribution Strategies
Strategy	Use Case	Implementation
Round-Robin	General purpose, fair distribution	irqbalance with default settings
CPU Isolation	Real-time, latency-sensitive	isolcpus kernel parameter + manual affinity
NUMA-Aware	Large memory systems	Pin IRQs to CPUs on same NUMA node as device
Per-Queue	High-throughput devices	RSS/MSI-X with per-CPU queues
Power-Aware	Mobile, battery systems	Consolidate to minimal CPUs during low load

irqbalance Daemon

Software Priority in Linux

Softirq Priority Order:

linux_softirq_priority.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Linux softirq priority order (from kernel source)
// Lower number = higher priority
 
enum {
    HI_SOFTIRQ = 0,          // High-priority tasklets (run first)
    TIMER_SOFTIRQ,           // Timer expiration callbacks
    NET_TX_SOFTIRQ,          // Network transmit processing
    NET_RX_SOFTIRQ,          // Network receive processing  
    BLOCK_SOFTIRQ,           // Block device completion
    IRQ_POLL_SOFTIRQ,        // IRQ polling mode
    TASKLET_SOFTIRQ,         // Regular tasklets
    SCHED_SOFTIRQ,           // Scheduler load balancing
    HRTIMER_SOFTIRQ,         // High-resolution timers
    RCU_SOFTIRQ,             // RCU callbacks (always last)
    NR_SOFTIRQS
};
 
// Softirq execution order in __do_softirq():
void __do_softirq(void) {
    uint32_t pending = local_softirq_pending();
    
    // Process in priority order (bit 0 first)
    while (pending) {
        if (pending & 1) {
            // Execute this softirq's handler
            softirq_vec[softirq_num].action(h);
        }
        pending >>= 1;
        softirq_num++;
    }
}
 
// Priority implications:
// - HI_SOFTIRQ (0) runs before NET_RX_SOFTIRQ (3)
// - TIMER processing happens before network processing
// - RCU_SOFTIRQ (9) always runs last
 
// For tasklets, TASKLET_HI_SOFTIRQ gives higher priority
// Use for time-sensitive bottom-half work
DECLARE_TASKLET(my_tasklet, my_function);
tasklet_schedule(&my_tasklet);      // Uses TASKLET_SOFTIRQ (priority 6)
 
tasklet_hi_schedule(&my_tasklet);   // Uses HI_SOFTIRQ (priority 0)

Time Limits on Softirq Processing:

To prevent softirq handlers from monopolizing the CPU, Linux limits how long softirqs can run before yielding:

softirq_limits.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Softirq processing limits
 
# Maximum time in microseconds
$ cat /proc/sys/net/core/netdev_budget_usecs
2000    # 2ms default
 
# Maximum packets per NAPI poll cycle
$ cat /proc/sys/net/core/netdev_budget  
300
 
# After these limits, softirq processing yields to:
# 1. Check for pending higher-priority work
# 2. Allow scheduler to run waiting processes
# 3. Resume softirq processing (via ksoftirqd thread if heavy load)
 
# The ksoftirqd/N kernel threads:
$ ps aux | grep ksoftirqd
root         9  0.0  0.0      0     0 ?  S    Jan01   0:05 [ksoftirqd/0]
root        15  0.0  0.0      0     0 ?  S    Jan01   0:03 [ksoftirqd/1]
# When softirq load is high, these threads handle overflow at lower priority

ksoftirqd Starvation

Summary: Interrupt Priority

We've explored how systems prioritize interrupts. Let's consolidate the key takeaways:

Key Takeaways

•Priority enables timely handling of critical events — Without priority, time-sensitive interrupts could miss deadlines waiting for bulk processing.
•Legacy PIC used fixed priority based on IRQ number — Simple but inflexible, with only 15 available interrupt lines.
•Modern APIC provides 256-level priority — Vector number determines priority class, TPR controls handling threshold.
•Preemption allows higher-priority handlers to interrupt lower ones — Creates nested interrupts, requiring careful stack management.
•Priority inversion occurs when locks cause high-priority to wait for low — Solutions include priority inheritance, disabling interrupts, and lock-free algorithms.
•IRQ affinity controls which CPUs handle which interrupts — Enables cache optimization, NUMA awareness, load balancing, and CPU isolation.
•Linux softirqs have software-defined priorities — Time limits prevent starvation; ksoftirqd handles overflow at process priority.

What's Next:

Page Complete

4 / 5