Interrupts And Exceptions - Learning Module

Loading content...

0/227

Interrupt Handling

The Invisible Context Switch

When an interrupt occurs—whether a timer tick, a keystroke, or a page fault—the CPU must perform an incredibly delicate operation. It must:

Stop the currently executing instruction stream
Save enough state to resume it later
Switch to a different privilege level (often Ring 0)
Transfer control to the appropriate handler
Execute the handler code
Restore the original state
Resume the interrupted code as if nothing happened

This entire process must be atomic (uninterruptible at critical points), fast (happening thousands of times per second), and transparent (the interrupted code must not notice, unless it's supposed to).

Understanding interrupt handling is essential for kernel development, debugging, and understanding how operating systems manage the boundary between hardware and software.

What You Will Learn

By the end of this page, you will understand the complete interrupt handling lifecycle: CPU state saving, stack frame construction, privilege level transitions, handler dispatch, and the IRET return sequence. You'll learn the critical differences between interrupt and exception handling, nested interrupts, and the security implications of improper handling.

The Interrupt Lifecycle: Overview

The interrupt lifecycle can be divided into three phases: entry (CPU hardware), execution (software handler), and return (CPU hardware + IRET). Understanding each phase is crucial for writing correct interrupt handlers.

Phase 1: Interrupt Entry (CPU Hardware)

Interrupt/exception is recognized
Complete or abort current instruction
Determine interrupt vector number
Look up handler in IDT (Interrupt Descriptor Table)
Check privilege levels and switch stack if needed
Push interrupt frame onto stack
Clear IF (disable interrupts) for interrupt gates
Load CS:RIP from IDT entry
Begin executing handler

Phase 2: Handler Execution (Software)

Save additional registers (if needed)
Identify interrupt source
Perform necessary processing
Acknowledge interrupt (send EOI to PIC/APIC)
Restore saved registers

Phase 3: Interrupt Return (IRET Instruction)

Pop RIP, CS, RFLAGS from stack
If privilege change: pop RSP, SS
Restore privilege level
Resume original instruction stream

Converting Mermaid diagram...

CPU State Saving: The Interrupt Stack Frame

When an interrupt occurs, the CPU must save enough state to resume execution later. The hardware automatically pushes a minimum set of registers onto the stack—this is the interrupt stack frame. Additional state must be saved by software if needed.

Automatic Hardware Save (x86-64 Long Mode):

The CPU pushes the following values in this order (remember: stack grows downward, so the first push is at the highest address):

x86-64 Interrupt Stack Frame (Long Mode)
Offset from RSP	Value Pushed	Description
+40	SS	Stack Segment (only if privilege change)
+32	RSP	Stack Pointer (only if privilege change)
+24	RFLAGS	Processor flags (IF, TF, etc.)
+16	CS	Code Segment (includes CPL)
+8	RIP	Instruction Pointer (return address)
+0	Error Code	Only for exceptions that push it

interrupt_frame.h
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// C structure representing the interrupt stack frame
// Matches the layout pushed by CPU hardware
 
// Frame WITHOUT error code (interrupts, some exceptions)
struct interrupt_frame {
    uint64_t rip;       // Return instruction pointer
    uint64_t cs;        // Code segment (with CPL in low 2 bits)
    uint64_t rflags;    // Processor flags
    uint64_t rsp;       // Stack pointer (from before interrupt)
    uint64_t ss;        // Stack segment
} __attribute__((packed));
 
// Frame WITH error code (page fault, GPF, etc.)
struct interrupt_frame_error {
    uint64_t error_code; // Exception-specific error code
    uint64_t rip;        // Return instruction pointer
    uint64_t cs;         // Code segment
    uint64_t rflags;     // Processor flags
    uint64_t rsp;        // Stack pointer
    uint64_t ss;         // Stack segment
} __attribute__((packed));
 
// Handler function signatures (GCC/Clang x86-64)
__attribute__((interrupt))
void timer_handler(struct interrupt_frame *frame);
 
__attribute__((interrupt))
void page_fault_handler(struct interrupt_frame_error *frame);

What the CPU Does NOT Save

The CPU only saves the minimum needed for return. General-purpose registers (RAX, RBX, RCX, etc.), SIMD registers, and most segment registers are NOT saved by hardware. If your handler uses any of these, it MUST save and restore them manually. Failure to do so corrupts the interrupted program's state—a catastrophic and hard-to-debug bug.

full_context_save.s

x86-64 Assembly

; Complete interrupt entry sequence with full register save
; Used when handler needs to access/modify full CPU state
 
; Macro to save all general-purpose registers
%macro SAVE_ALL 0
    push rax
    push rbx
    push rcx
    push rdx
    push rsi
    push rdi
    push rbp
    push r8
    push r9
    push r10
    push r11
    push r12
    push r13
    push r14
    push r15
%endmacro
 
%macro RESTORE_ALL 0
    pop r15
    pop r14
    pop r13
    pop r12
    pop r11
    pop r10
    pop r9
    pop r8
    pop rbp
    pop rdi
    pop rsi
    pop rdx
    pop rcx
    pop rbx
    pop rax
%endmacro
 
; Example interrupt handler entry point
timer_interrupt_entry:
    ; CPU has already pushed SS, RSP, RFLAGS, CS, RIP
    
    SAVE_ALL            ; Save all GP registers (120 bytes)
    
    ; At this point, stack has complete context
    ; RSP points to saved R15
    
    mov rdi, rsp        ; Pass pointer to saved context as argument
    call timer_handler_c ; Call C handler
    
    RESTORE_ALL         ; Restore all GP registers
    
    iretq               ; Return from interrupt

Privilege Level Transitions

One of the most critical aspects of interrupt handling is managing privilege level transitions. When an interrupt occurs, the CPU may need to switch from a less privileged level (Ring 3/user mode) to a more privileged level (Ring 0/kernel mode). This transition involves additional security checks and stack switching.

The Current Privilege Level (CPL):

The CPL is stored in the low 2 bits of the CS register:

CPL 0: Kernel mode (most privileged)
CPL 1-2: Rarely used device drivers
CPL 3: User mode (least privileged)

Stack Switching on Privilege Change:

When transitioning from Ring 3 to Ring 0, the CPU cannot use the user-mode stack for security reasons—a malicious user program could manipulate the stack to corrupt kernel data or hijack execution. Instead, the CPU switches to a kernel stack.

Why Kernel Stack Switching is Essential

•Memory Protection: User-mode stack is in user memory—kernel cannot trust its contents or even its validity
•Stack Integrity: User could point RSP at kernel memory or invalid addresses to attack the kernel
•Isolation: Each process needs its own kernel stack for handling system calls and interrupts
•Security: Prevents user code from observing or modifying kernel stack contents

The Task State Segment (TSS):

The TSS is a hardware data structure that stores the stack pointers for each privilege level. When an interrupt causes a privilege transition, the CPU reads the new RSP from the TSS:

RSP0: Ring 0 stack pointer (used for Ring 3 → Ring 0)
RSP1: Ring 1 stack pointer (rarely used)
RSP2: Ring 2 stack pointer (rarely used)
IST1-IST7: Interrupt Stack Table entries (for critical handlers)

tss.h
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Task State Segment for x86-64 Long Mode
// Much simpler than protected mode TSS—mainly for stack pointers
 
struct tss64 {
    uint32_t reserved0;     // Reserved, must be 0
    
    // Stack pointers loaded on privilege level change
    uint64_t rsp0;          // Ring 0 stack (used for Ring 3 → 0)
    uint64_t rsp1;          // Ring 1 stack (usually unused)
    uint64_t rsp2;          // Ring 2 stack (usually unused)
    
    uint64_t reserved1;     // Reserved
    
    // Interrupt Stack Table (IST)
    // Used for critical handlers that need known-good stack
    uint64_t ist1;          // IST entry 1 (e.g., double fault)
    uint64_t ist2;          // IST entry 2 (e.g., NMI)
    uint64_t ist3;          // IST entry 3 (e.g., debug)
    uint64_t ist4;          // IST entry 4
    uint64_t ist5;          // IST entry 5
    uint64_t ist6;          // IST entry 6
    uint64_t ist7;          // IST entry 7
    
    uint64_t reserved2;     // Reserved
    uint16_t reserved3;     // Reserved
    uint16_t iopb_offset;   // I/O permission bitmap offset
} __attribute__((packed));
 
// Per-CPU TSS setup (each CPU needs its own)
void setup_tss(struct tss64 *tss, void *kernel_stack_top) {
    memset(tss, 0, sizeof(struct tss64));
    
    // Set kernel stack for syscalls/interrupts from user mode
    tss->rsp0 = (uint64_t)kernel_stack_top;
    
    // Set IST entries for critical handlers
    tss->ist1 = (uint64_t)alloc_ist_stack();  // Double fault
    tss->ist2 = (uint64_t)alloc_ist_stack();  // NMI
    tss->ist3 = (uint64_t)alloc_ist_stack();  // Debug
    
    // No I/O permission bitmap (disable with offset beyond limit)
    tss->iopb_offset = sizeof(struct tss64);
}

Interrupt Stack Table (IST)

The IST provides up to 7 dedicated stacks for specific interrupt handlers. This is critical for handlers that cannot trust the current stack—such as the double fault handler (stack may be corrupted) or NMI handler (may interrupt kernel code with inconsistent stack state). Each IDT entry can specify an IST entry (1-7) or 0 for normal stack switching.

Interrupt Gates vs Trap Gates

The IDT entries that define interrupt handlers come in two primary flavors: Interrupt Gates and Trap Gates. Their critical difference lies in how they handle the Interrupt Flag (IF).

The Interrupt Flag (IF):

The IF bit in RFLAGS controls whether the CPU responds to maskable hardware interrupts (INTR):

IF = 1: Interrupts enabled (CPU responds to INTR)
IF = 0: Interrupts disabled (CPU ignores INTR, but NMI still works)

Interrupt Gates vs Trap Gates
Characteristic	Interrupt Gate	Trap Gate
IF Behavior	Clears IF (disables interrupts)	Leaves IF unchanged
Typical Use	Hardware interrupts, timer	System calls, breakpoints
Nested Interrupts	Prevented by default	Allowed by default
Handler Complexity	Simpler—no nesting concerns	Must handle potential nesting
Type Field Value	0xE (64-bit interrupt gate)	0xF (64-bit trap gate)

Why Interrupt Gates Disable Interrupts:

Consider what happens if an interrupt handler is interrupted by another interrupt:

Timer interrupt occurs, handler starts executing
Before timer handler completes, another timer interrupt occurs
New handler starts, potentially corrupting shared state
Original handler resumes in inconsistent state → crash

Interrupt gates prevent this by atomically clearing IF when entering the handler. The handler executes to completion, issues EOI, and then re-enables interrupts (via IRET restoring RFLAGS with IF=1).

When Trap Gates are Appropriate:

System calls (INT 0x80, SYSCALL) often use trap gates because:

System call may take significant time (disk I/O, network)
Hardware interrupts shouldn't be blocked during syscall
Kernel explicitly manages critical sections with cli/sti
Nested interrupts are expected and handled correctly

idt_entry.h
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// IDT entry structure for x86-64
 
struct idt_entry {
    uint16_t offset_low;    // Handler offset bits 0-15
    uint16_t selector;      // Code segment selector
    uint8_t  ist;           // IST index (bits 0-2), zero bits (3-7)
    uint8_t  type_attr;     // Type and attributes
    uint16_t offset_mid;    // Handler offset bits 16-31
    uint32_t offset_high;   // Handler offset bits 32-63
    uint32_t reserved;      // Reserved, must be 0
} __attribute__((packed));
 
// Type attribute values (64-bit long mode)
#define IDT_TYPE_INTERRUPT_GATE  0x8E  // P=1, DPL=0, Type=0xE
#define IDT_TYPE_TRAP_GATE       0x8F  // P=1, DPL=0, Type=0xF
#define IDT_TYPE_USER_INTERRUPT  0xEE  // P=1, DPL=3, Type=0xE (for INT from user)
#define IDT_TYPE_USER_TRAP       0xEF  // P=1, DPL=3, Type=0xF
 
void set_idt_entry(struct idt_entry *entry, 
                   void (*handler)(void), 
                   uint16_t selector,
                   uint8_t type_attr,
                   uint8_t ist) {
    uint64_t offset = (uint64_t)handler;
    
    entry->offset_low  = offset & 0xFFFF;
    entry->offset_mid  = (offset >> 16) & 0xFFFF;
    entry->offset_high = (offset >> 32) & 0xFFFFFFFF;
    
    entry->selector  = selector;
    entry->ist       = ist & 0x7;  // Only bits 0-2
    entry->type_attr = type_attr;
    entry->reserved  = 0;
}
 
// Example: Set up timer interrupt (interrupt gate, disables IF)
set_idt_entry(&idt[32], timer_handler, KERNEL_CS, 
              IDT_TYPE_INTERRUPT_GATE, 0);
 
// Example: Set up syscall (trap gate, preserves IF)
set_idt_entry(&idt[0x80], syscall_handler, KERNEL_CS,
              IDT_TYPE_USER_TRAP, 0);  // DPL=3 for user access

The Handler Dispatch Mechanism

With 256 possible interrupt vectors, the kernel needs an efficient mechanism to dispatch interrupts to the appropriate handlers. Several architectural approaches exist.

Direct Vector Handlers:

The simplest approach assigns each vector its own entry in the IDT, pointing directly to the handler code. Simple but inflexible—each handler has independent code.

A more elegant approach uses small stubs that push the vector number, then jump to a common handler:

interrupt_stubs.s

x86-64 Assembly

; Interrupt stub macros - generate small entry points
; that push vector number and jump to common handler
 
; For exceptions WITHOUT error code
%macro ISR_NOERR 1
isr_stub_%1:
    push 0              ; Dummy error code for uniform stack frame
    push %1             ; Push interrupt vector number
    jmp common_interrupt_handler
%endmacro
 
; For exceptions WITH error code (already pushed by CPU)
%macro ISR_ERR 1
isr_stub_%1:
    push %1             ; Push interrupt vector number
    jmp common_interrupt_handler
%endmacro
 
; Generate stubs for all exception vectors
ISR_NOERR 0   ; Divide Error
ISR_NOERR 1   ; Debug
ISR_NOERR 2   ; NMI
ISR_NOERR 3   ; Breakpoint
ISR_NOERR 4   ; Overflow
ISR_NOERR 5   ; Bound Range
ISR_NOERR 6   ; Invalid Opcode
ISR_NOERR 7   ; Device Not Available
ISR_ERR   8   ; Double Fault (error code = 0)
ISR_NOERR 9   ; Coprocessor Segment (reserved)
ISR_ERR   10  ; Invalid TSS
ISR_ERR   11  ; Segment Not Present
ISR_ERR   12  ; Stack Fault
ISR_ERR   13  ; General Protection Fault
ISR_ERR   14  ; Page Fault
ISR_NOERR 15  ; Reserved
; ... continue for all vectors
 
; Common handler - receives all interrupts
common_interrupt_handler:
    ; Save all general-purpose registers
    push rax
    push rbx
    push rcx
    push rdx
    push rsi
    push rdi
    push rbp
    push r8
    push r9
    push r10
    push r11
    push r12
    push r13
    push r14
    push r15
    
    ; Pass pointer to saved state
    mov rdi, rsp
    
    ; Call C interrupt dispatcher
    call interrupt_dispatch
    
    ; Restore all registers
    pop r15
    pop r14
    pop r13
    pop r12
    pop r11
    pop r10
    pop r9
    pop r8
    pop rbp
    pop rdi
    pop rsi
    pop rdx
    pop rcx
    pop rbx
    pop rax
    
    ; Remove vector number and error code
    add rsp, 16
    
    ; Return from interrupt
    iretq

The IRET Instruction: Return from Interrupt

The IRET (Interrupt Return) instruction is the counterpart to the interrupt entry sequence. It reverses everything the CPU did when entering the handler, restoring the interrupted context and resuming execution.

IRET vs Normal RET:

A normal RET instruction only pops RIP—it cannot change privilege levels or restore flags. IRET is special:

Pops RIP, CS, and RFLAGS (always)
Pops RSP and SS (if privilege level changes)
Restores the Interrupt Flag (IF) from the saved RFLAGS
Performs privilege level checks for security

IRET Stack Pop Sequence (x86-64)
Step	Action	Security Implications
1	Pop RIP (return address)	Checked against segment limits
2	Pop CS (code segment)	CPL derived from CS, checked for valid transition
3	Pop RFLAGS	IF restored, IOPL may change based on CPL
4	Pop RSP (if CPL changes)	Only for Ring 0 → Ring 3
5	Pop SS (if CPL changes)	Validates SS selector

IRET Security Implications

IRET performs extensive validation. The popped CS must be valid for the target CPL. If returning to Ring 3, RSP and SS must also be popped (cannot leave kernel stack accessible to user code). RFLAGS changes are restricted—user mode cannot set privileged flags. Bugs in constructing the interrupt frame can cause security vulnerabilities.

iret_examples.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Using IRET to switch to user mode
// Commonly used to start the first user process
 
void switch_to_user_mode(void *user_entry, void *user_stack) {
    // Construct a fake interrupt frame on the kernel stack
    // that IRET will pop to enter user mode
    
    // The frame must be ordered as IRET expects:
    // SS, RSP, RFLAGS, CS, RIP (top of stack = RIP)
    
    asm volatile (
        // Disable interrupts during setup
        "cli                    \n"
        
        // Push SS (user data segment, RPL=3)
        "push $0x23             \n"  // USER_DS | RPL_3
        
        // Push user RSP
        "push %0                \n"
        
        // Push RFLAGS with IF set (enable interrupts in user mode)
        "pushfq                 \n"
        "pop rax                \n"
        "or rax, 0x200          \n"  // Set IF
        "push rax               \n"
        
        // Push CS (user code segment, RPL=3)
        "push $0x1B             \n"  // USER_CS | RPL_3
        
        // Push user RIP (entry point)
        "push %1                \n"
        
        // IRET pops RIP, CS, RFLAGS, RSP, SS
        // Transitions to Ring 3, jumps to user_entry
        "iretq                  \n"
        :
        : "r"(user_stack), "r"(user_entry)
        : "rax"
    );
    
    // Unreachable - we're now in user mode
    __builtin_unreachable();
}
 
// IRET can also be used to perform context switches
void context_switch(struct task_state *new_task) {
    // Load new task's saved registers
    // Set up stack to point at task's saved interrupt frame
    // IRET returns to the new task's execution point
    
    asm volatile (
        "mov rsp, %0            \n"  // Switch to new task's stack
        "pop r15                \n"  // Restore GP registers
        "pop r14                \n"
        // ... restore all registers
        "pop rax                \n"
        "add rsp, 16            \n"  // Skip vector, error code
        "iretq                  \n"  // Return to new task
        :
        : "r"(new_task->kernel_stack_pointer)
    );
}

Nested Interrupts and Reentrancy

Nested interrupts occur when an interrupt handler is itself interrupted by another interrupt. This can happen when using trap gates (which preserve IF) or when handlers explicitly re-enable interrupts.

Why Allow Nested Interrupts?

Some interrupt handlers take significant time:

Network interrupt processes many packets
Disk interrupt handles complex I/O completion
Timer interrupt runs scheduler, updates time

Blocking all interrupts during these handlers harms system responsiveness. High-priority interrupts (like a hardware failure NMI) should preempt lower-priority handlers.

Nested Interrupt Challenges

•Stack Overflow: Each nested interrupt consumes stack space. Deep nesting on limited kernel stacks can overflow
•Reentrancy Bugs: Handler code must be reentrant if it can be interrupted. Modifying global state without locks causes corruption
•Priority Inversion: Lower-priority interrupt holds lock needed by higher-priority interrupt
•Deadlock: Handler spins waiting for condition that another handler blocked by masks would set
•EOI Ordering: Sending EOI at wrong time can allow same interrupt to fire recursively

Linux's Split Handler Model:

Linux addresses these challenges by splitting interrupt handling into two parts:

Top Half (Hardirq):

Runs immediately with interrupts disabled
Does minimum necessary work
Schedules bottom half for deferred processing
Quick—typically < 100μs

Bottom Half (Softirq/Tasklet/Workqueue):

Runs later with interrupts enabled
Does bulk of processing
Can sleep (for workqueues), be interrupted
Takes as long as needed

split_handler.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// Example: Network interrupt using split handler model
 
// Top half - runs in interrupt context, interrupts disabled
irqreturn_t network_interrupt_handler(int irq, void *dev_id) {
    struct net_device *dev = dev_id;
    uint32_t status = read_device_status(dev);
    
    if (!(status & INTERRUPT_PENDING))
        return IRQ_NONE;  // Not our interrupt (shared IRQ)
    
    // Acknowledge interrupt to device immediately
    write_device_register(dev, STATUS_REG, status);
    
    // Quick check: do we have received packets?
    if (status & RX_COMPLETE) {
        // Disable further RX interrupts (we'll poll in softirq)
        disable_rx_interrupt(dev);
        
        // Schedule NAPI softirq for packet processing
        napi_schedule(&dev->napi);
    }
    
    // Handle TX completion inline (fast)
    if (status & TX_COMPLETE) {
        reclaim_tx_buffers_fast(dev);
    }
    
    return IRQ_HANDLED;
}
 
// Bottom half - runs in softirq context, interrupts enabled
int network_poll(struct napi_struct *napi, int budget) {
    struct net_device *dev = container_of(napi, struct net_device, napi);
    int packets_processed = 0;
    
    // Process up to 'budget' packets
    while (packets_processed < budget) {
        struct packet *pkt = dequeue_rx_packet(dev);
        if (!pkt)
            break;
        
        // Process packet (can take time)
        process_packet(dev, pkt);
        packets_processed++;
    }
    
    // If we processed all available packets, re-enable interrupts
    if (packets_processed < budget) {
        napi_complete(napi);
        enable_rx_interrupt(dev);
    }
    
    return packets_processed;
}

NAPI - New API for Network Polling

High-speed networks can generate millions of packets per second. Using traditional interrupt-per-packet handling would cause 'livelock'—the CPU spends all time handling interrupts, no time processing packets. NAPI uses interrupt coalescing: the first packet triggers an interrupt, which disables further interrupts and schedules polling. The bottom half polls until no packets remain, then re-enables interrupts.

Summary: Interrupt Handling

We've explored the complete interrupt handling lifecycle—from the moment an interrupt is recognized through handler execution and return. This mechanism forms the foundation of all OS-hardware and OS-application interactions.

Key Takeaways

•The interrupt lifecycle has three phases — Entry (CPU saves state), Execution (handler runs), Return (IRET restores state).
•The CPU saves minimal state — Only RIP, CS, RFLAGS (and RSP, SS on privilege change). Handlers must save additional registers manually.
•Privilege transitions require stack switching — The TSS provides kernel stack pointers; IST entries handle critical handlers.
•Interrupt gates disable IF; trap gates preserve it — This controls whether nested interrupts are possible during handling.
•Handler dispatch uses vector numbers — Stubs push vector, common handler calls dispatch table entries.
•IRET is the privileged return — Restores full context including RFLAGS and potentially privilege level.
•Nested interrupts require careful design — Split handler models (top/bottom half) balance responsiveness with safety.

What's Next:

Now that we understand how interrupts are handled, we'll examine how the CPU finds the correct handler. The next page covers the Interrupt Vector Table (IVT) and Interrupt Descriptor Table (IDT)—the data structures that map interrupt vectors to handler addresses.

Page Complete

You now understand interrupt handling: the CPU's state-saving mechanism, privilege transitions, stack switching, gate types, and the critical IRET instruction. This knowledge is essential for kernel development and understanding how operating systems respond to events. Next, we'll explore the data structures that organize interrupt handlers.

Interrupt Handling

The Invisible Context Switch

When an interrupt occurs—whether a timer tick, a keystroke, or a page fault—the CPU must perform an incredibly delicate operation. It must:

Stop the currently executing instruction stream
Save enough state to resume it later
Switch to a different privilege level (often Ring 0)
Transfer control to the appropriate handler
Execute the handler code
Restore the original state
Resume the interrupted code as if nothing happened

Understanding interrupt handling is essential for kernel development, debugging, and understanding how operating systems manage the boundary between hardware and software.

What You Will Learn

The Interrupt Lifecycle: Overview

Phase 1: Interrupt Entry (CPU Hardware)

Interrupt/exception is recognized
Complete or abort current instruction
Determine interrupt vector number
Look up handler in IDT (Interrupt Descriptor Table)
Check privilege levels and switch stack if needed
Push interrupt frame onto stack
Clear IF (disable interrupts) for interrupt gates
Load CS:RIP from IDT entry
Begin executing handler

Phase 2: Handler Execution (Software)

Save additional registers (if needed)
Identify interrupt source
Perform necessary processing
Acknowledge interrupt (send EOI to PIC/APIC)
Restore saved registers

Phase 3: Interrupt Return (IRET Instruction)

Pop RIP, CS, RFLAGS from stack
If privilege change: pop RSP, SS
Restore privilege level
Resume original instruction stream

Converting Mermaid diagram...

CPU State Saving: The Interrupt Stack Frame

Automatic Hardware Save (x86-64 Long Mode):

The CPU pushes the following values in this order (remember: stack grows downward, so the first push is at the highest address):

x86-64 Interrupt Stack Frame (Long Mode)
Offset from RSP	Value Pushed	Description
+40	SS	Stack Segment (only if privilege change)
+32	RSP	Stack Pointer (only if privilege change)
+24	RFLAGS	Processor flags (IF, TF, etc.)
+16	CS	Code Segment (includes CPL)
+8	RIP	Instruction Pointer (return address)
+0	Error Code	Only for exceptions that push it

interrupt_frame.h
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// C structure representing the interrupt stack frame
// Matches the layout pushed by CPU hardware
 
// Frame WITHOUT error code (interrupts, some exceptions)
struct interrupt_frame {
    uint64_t rip;       // Return instruction pointer
    uint64_t cs;        // Code segment (with CPL in low 2 bits)
    uint64_t rflags;    // Processor flags
    uint64_t rsp;       // Stack pointer (from before interrupt)
    uint64_t ss;        // Stack segment
} __attribute__((packed));
 
// Frame WITH error code (page fault, GPF, etc.)
struct interrupt_frame_error {
    uint64_t error_code; // Exception-specific error code
    uint64_t rip;        // Return instruction pointer
    uint64_t cs;         // Code segment
    uint64_t rflags;     // Processor flags
    uint64_t rsp;        // Stack pointer
    uint64_t ss;         // Stack segment
} __attribute__((packed));
 
// Handler function signatures (GCC/Clang x86-64)
__attribute__((interrupt))
void timer_handler(struct interrupt_frame *frame);
 
__attribute__((interrupt))
void page_fault_handler(struct interrupt_frame_error *frame);

What the CPU Does NOT Save

full_context_save.s

x86-64 Assembly

; Complete interrupt entry sequence with full register save
; Used when handler needs to access/modify full CPU state
 
; Macro to save all general-purpose registers
%macro SAVE_ALL 0
    push rax
    push rbx
    push rcx
    push rdx
    push rsi
    push rdi
    push rbp
    push r8
    push r9
    push r10
    push r11
    push r12
    push r13
    push r14
    push r15
%endmacro
 
%macro RESTORE_ALL 0
    pop r15
    pop r14
    pop r13
    pop r12
    pop r11
    pop r10
    pop r9
    pop r8
    pop rbp
    pop rdi
    pop rsi
    pop rdx
    pop rcx
    pop rbx
    pop rax
%endmacro
 
; Example interrupt handler entry point
timer_interrupt_entry:
    ; CPU has already pushed SS, RSP, RFLAGS, CS, RIP
    
    SAVE_ALL            ; Save all GP registers (120 bytes)
    
    ; At this point, stack has complete context
    ; RSP points to saved R15
    
    mov rdi, rsp        ; Pass pointer to saved context as argument
    call timer_handler_c ; Call C handler
    
    RESTORE_ALL         ; Restore all GP registers
    
    iretq               ; Return from interrupt

Privilege Level Transitions

The Current Privilege Level (CPL):

The CPL is stored in the low 2 bits of the CS register:

CPL 0: Kernel mode (most privileged)
CPL 1-2: Rarely used device drivers
CPL 3: User mode (least privileged)

Stack Switching on Privilege Change:

Why Kernel Stack Switching is Essential

•Memory Protection: User-mode stack is in user memory—kernel cannot trust its contents or even its validity
•Stack Integrity: User could point RSP at kernel memory or invalid addresses to attack the kernel
•Isolation: Each process needs its own kernel stack for handling system calls and interrupts
•Security: Prevents user code from observing or modifying kernel stack contents

The Task State Segment (TSS):

The TSS is a hardware data structure that stores the stack pointers for each privilege level. When an interrupt causes a privilege transition, the CPU reads the new RSP from the TSS:

RSP0: Ring 0 stack pointer (used for Ring 3 → Ring 0)
RSP1: Ring 1 stack pointer (rarely used)
RSP2: Ring 2 stack pointer (rarely used)
IST1-IST7: Interrupt Stack Table entries (for critical handlers)

tss.h
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Task State Segment for x86-64 Long Mode
// Much simpler than protected mode TSS—mainly for stack pointers
 
struct tss64 {
    uint32_t reserved0;     // Reserved, must be 0
    
    // Stack pointers loaded on privilege level change
    uint64_t rsp0;          // Ring 0 stack (used for Ring 3 → 0)
    uint64_t rsp1;          // Ring 1 stack (usually unused)
    uint64_t rsp2;          // Ring 2 stack (usually unused)
    
    uint64_t reserved1;     // Reserved
    
    // Interrupt Stack Table (IST)
    // Used for critical handlers that need known-good stack
    uint64_t ist1;          // IST entry 1 (e.g., double fault)
    uint64_t ist2;          // IST entry 2 (e.g., NMI)
    uint64_t ist3;          // IST entry 3 (e.g., debug)
    uint64_t ist4;          // IST entry 4
    uint64_t ist5;          // IST entry 5
    uint64_t ist6;          // IST entry 6
    uint64_t ist7;          // IST entry 7
    
    uint64_t reserved2;     // Reserved
    uint16_t reserved3;     // Reserved
    uint16_t iopb_offset;   // I/O permission bitmap offset
} __attribute__((packed));
 
// Per-CPU TSS setup (each CPU needs its own)
void setup_tss(struct tss64 *tss, void *kernel_stack_top) {
    memset(tss, 0, sizeof(struct tss64));
    
    // Set kernel stack for syscalls/interrupts from user mode
    tss->rsp0 = (uint64_t)kernel_stack_top;
    
    // Set IST entries for critical handlers
    tss->ist1 = (uint64_t)alloc_ist_stack();  // Double fault
    tss->ist2 = (uint64_t)alloc_ist_stack();  // NMI
    tss->ist3 = (uint64_t)alloc_ist_stack();  // Debug
    
    // No I/O permission bitmap (disable with offset beyond limit)
    tss->iopb_offset = sizeof(struct tss64);
}

Interrupt Stack Table (IST)

Interrupt Gates vs Trap Gates

The IDT entries that define interrupt handlers come in two primary flavors: Interrupt Gates and Trap Gates. Their critical difference lies in how they handle the Interrupt Flag (IF).

The Interrupt Flag (IF):

The IF bit in RFLAGS controls whether the CPU responds to maskable hardware interrupts (INTR):

IF = 1: Interrupts enabled (CPU responds to INTR)
IF = 0: Interrupts disabled (CPU ignores INTR, but NMI still works)

Interrupt Gates vs Trap Gates
Characteristic	Interrupt Gate	Trap Gate
IF Behavior	Clears IF (disables interrupts)	Leaves IF unchanged
Typical Use	Hardware interrupts, timer	System calls, breakpoints
Nested Interrupts	Prevented by default	Allowed by default
Handler Complexity	Simpler—no nesting concerns	Must handle potential nesting
Type Field Value	0xE (64-bit interrupt gate)	0xF (64-bit trap gate)

Why Interrupt Gates Disable Interrupts:

Consider what happens if an interrupt handler is interrupted by another interrupt:

Timer interrupt occurs, handler starts executing
Before timer handler completes, another timer interrupt occurs
New handler starts, potentially corrupting shared state
Original handler resumes in inconsistent state → crash

Interrupt gates prevent this by atomically clearing IF when entering the handler. The handler executes to completion, issues EOI, and then re-enables interrupts (via IRET restoring RFLAGS with IF=1).

When Trap Gates are Appropriate:

System calls (INT 0x80, SYSCALL) often use trap gates because:

System call may take significant time (disk I/O, network)
Hardware interrupts shouldn't be blocked during syscall
Kernel explicitly manages critical sections with cli/sti
Nested interrupts are expected and handled correctly

idt_entry.h
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// IDT entry structure for x86-64
 
struct idt_entry {
    uint16_t offset_low;    // Handler offset bits 0-15
    uint16_t selector;      // Code segment selector
    uint8_t  ist;           // IST index (bits 0-2), zero bits (3-7)
    uint8_t  type_attr;     // Type and attributes
    uint16_t offset_mid;    // Handler offset bits 16-31
    uint32_t offset_high;   // Handler offset bits 32-63
    uint32_t reserved;      // Reserved, must be 0
} __attribute__((packed));
 
// Type attribute values (64-bit long mode)
#define IDT_TYPE_INTERRUPT_GATE  0x8E  // P=1, DPL=0, Type=0xE
#define IDT_TYPE_TRAP_GATE       0x8F  // P=1, DPL=0, Type=0xF
#define IDT_TYPE_USER_INTERRUPT  0xEE  // P=1, DPL=3, Type=0xE (for INT from user)
#define IDT_TYPE_USER_TRAP       0xEF  // P=1, DPL=3, Type=0xF
 
void set_idt_entry(struct idt_entry *entry, 
                   void (*handler)(void), 
                   uint16_t selector,
                   uint8_t type_attr,
                   uint8_t ist) {
    uint64_t offset = (uint64_t)handler;
    
    entry->offset_low  = offset & 0xFFFF;
    entry->offset_mid  = (offset >> 16) & 0xFFFF;
    entry->offset_high = (offset >> 32) & 0xFFFFFFFF;
    
    entry->selector  = selector;
    entry->ist       = ist & 0x7;  // Only bits 0-2
    entry->type_attr = type_attr;
    entry->reserved  = 0;
}
 
// Example: Set up timer interrupt (interrupt gate, disables IF)
set_idt_entry(&idt[32], timer_handler, KERNEL_CS, 
              IDT_TYPE_INTERRUPT_GATE, 0);
 
// Example: Set up syscall (trap gate, preserves IF)
set_idt_entry(&idt[0x80], syscall_handler, KERNEL_CS,
              IDT_TYPE_USER_TRAP, 0);  // DPL=3 for user access

The Handler Dispatch Mechanism

With 256 possible interrupt vectors, the kernel needs an efficient mechanism to dispatch interrupts to the appropriate handlers. Several architectural approaches exist.

Direct Vector Handlers:

The simplest approach assigns each vector its own entry in the IDT, pointing directly to the handler code. Simple but inflexible—each handler has independent code.

A more elegant approach uses small stubs that push the vector number, then jump to a common handler:

interrupt_stubs.s

x86-64 Assembly

; Interrupt stub macros - generate small entry points
; that push vector number and jump to common handler
 
; For exceptions WITHOUT error code
%macro ISR_NOERR 1
isr_stub_%1:
    push 0              ; Dummy error code for uniform stack frame
    push %1             ; Push interrupt vector number
    jmp common_interrupt_handler
%endmacro
 
; For exceptions WITH error code (already pushed by CPU)
%macro ISR_ERR 1
isr_stub_%1:
    push %1             ; Push interrupt vector number
    jmp common_interrupt_handler
%endmacro
 
; Generate stubs for all exception vectors
ISR_NOERR 0   ; Divide Error
ISR_NOERR 1   ; Debug
ISR_NOERR 2   ; NMI
ISR_NOERR 3   ; Breakpoint
ISR_NOERR 4   ; Overflow
ISR_NOERR 5   ; Bound Range
ISR_NOERR 6   ; Invalid Opcode
ISR_NOERR 7   ; Device Not Available
ISR_ERR   8   ; Double Fault (error code = 0)
ISR_NOERR 9   ; Coprocessor Segment (reserved)
ISR_ERR   10  ; Invalid TSS
ISR_ERR   11  ; Segment Not Present
ISR_ERR   12  ; Stack Fault
ISR_ERR   13  ; General Protection Fault
ISR_ERR   14  ; Page Fault
ISR_NOERR 15  ; Reserved
; ... continue for all vectors
 
; Common handler - receives all interrupts
common_interrupt_handler:
    ; Save all general-purpose registers
    push rax
    push rbx
    push rcx
    push rdx
    push rsi
    push rdi
    push rbp
    push r8
    push r9
    push r10
    push r11
    push r12
    push r13
    push r14
    push r15
    
    ; Pass pointer to saved state
    mov rdi, rsp
    
    ; Call C interrupt dispatcher
    call interrupt_dispatch
    
    ; Restore all registers
    pop r15
    pop r14
    pop r13
    pop r12
    pop r11
    pop r10
    pop r9
    pop r8
    pop rbp
    pop rdi
    pop rsi
    pop rdx
    pop rcx
    pop rbx
    pop rax
    
    ; Remove vector number and error code
    add rsp, 16
    
    ; Return from interrupt
    iretq

The IRET Instruction: Return from Interrupt

IRET vs Normal RET:

A normal RET instruction only pops RIP—it cannot change privilege levels or restore flags. IRET is special:

Pops RIP, CS, and RFLAGS (always)
Pops RSP and SS (if privilege level changes)
Restores the Interrupt Flag (IF) from the saved RFLAGS
Performs privilege level checks for security

IRET Stack Pop Sequence (x86-64)
Step	Action	Security Implications
1	Pop RIP (return address)	Checked against segment limits
2	Pop CS (code segment)	CPL derived from CS, checked for valid transition
3	Pop RFLAGS	IF restored, IOPL may change based on CPL
4	Pop RSP (if CPL changes)	Only for Ring 0 → Ring 3
5	Pop SS (if CPL changes)	Validates SS selector

IRET Security Implications

iret_examples.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
// Using IRET to switch to user mode
// Commonly used to start the first user process
 
void switch_to_user_mode(void *user_entry, void *user_stack) {
    // Construct a fake interrupt frame on the kernel stack
    // that IRET will pop to enter user mode
    
    // The frame must be ordered as IRET expects:
    // SS, RSP, RFLAGS, CS, RIP (top of stack = RIP)
    
    asm volatile (
        // Disable interrupts during setup
        "cli                    \n"
        
        // Push SS (user data segment, RPL=3)
        "push $0x23             \n"  // USER_DS | RPL_3
        
        // Push user RSP
        "push %0                \n"
        
        // Push RFLAGS with IF set (enable interrupts in user mode)
        "pushfq                 \n"
        "pop rax                \n"
        "or rax, 0x200          \n"  // Set IF
        "push rax               \n"
        
        // Push CS (user code segment, RPL=3)
        "push $0x1B             \n"  // USER_CS | RPL_3
        
        // Push user RIP (entry point)
        "push %1                \n"
        
        // IRET pops RIP, CS, RFLAGS, RSP, SS
        // Transitions to Ring 3, jumps to user_entry
        "iretq                  \n"
        :
        : "r"(user_stack), "r"(user_entry)
        : "rax"
    );
    
    // Unreachable - we're now in user mode
    __builtin_unreachable();
}
 
// IRET can also be used to perform context switches
void context_switch(struct task_state *new_task) {
    // Load new task's saved registers
    // Set up stack to point at task's saved interrupt frame
    // IRET returns to the new task's execution point
    
    asm volatile (
        "mov rsp, %0            \n"  // Switch to new task's stack
        "pop r15                \n"  // Restore GP registers
        "pop r14                \n"
        // ... restore all registers
        "pop rax                \n"
        "add rsp, 16            \n"  // Skip vector, error code
        "iretq                  \n"  // Return to new task
        :
        : "r"(new_task->kernel_stack_pointer)
    );
}

Nested Interrupts and Reentrancy

Why Allow Nested Interrupts?

Some interrupt handlers take significant time:

Network interrupt processes many packets
Disk interrupt handles complex I/O completion
Timer interrupt runs scheduler, updates time

Blocking all interrupts during these handlers harms system responsiveness. High-priority interrupts (like a hardware failure NMI) should preempt lower-priority handlers.

Nested Interrupt Challenges

•Stack Overflow: Each nested interrupt consumes stack space. Deep nesting on limited kernel stacks can overflow
•Reentrancy Bugs: Handler code must be reentrant if it can be interrupted. Modifying global state without locks causes corruption
•Priority Inversion: Lower-priority interrupt holds lock needed by higher-priority interrupt
•Deadlock: Handler spins waiting for condition that another handler blocked by masks would set
•EOI Ordering: Sending EOI at wrong time can allow same interrupt to fire recursively

Linux's Split Handler Model:

Linux addresses these challenges by splitting interrupt handling into two parts:

Top Half (Hardirq):

Runs immediately with interrupts disabled
Does minimum necessary work
Schedules bottom half for deferred processing
Quick—typically < 100μs

Bottom Half (Softirq/Tasklet/Workqueue):

Runs later with interrupts enabled
Does bulk of processing
Can sleep (for workqueues), be interrupted
Takes as long as needed

split_handler.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// Example: Network interrupt using split handler model
 
// Top half - runs in interrupt context, interrupts disabled
irqreturn_t network_interrupt_handler(int irq, void *dev_id) {
    struct net_device *dev = dev_id;
    uint32_t status = read_device_status(dev);
    
    if (!(status & INTERRUPT_PENDING))
        return IRQ_NONE;  // Not our interrupt (shared IRQ)
    
    // Acknowledge interrupt to device immediately
    write_device_register(dev, STATUS_REG, status);
    
    // Quick check: do we have received packets?
    if (status & RX_COMPLETE) {
        // Disable further RX interrupts (we'll poll in softirq)
        disable_rx_interrupt(dev);
        
        // Schedule NAPI softirq for packet processing
        napi_schedule(&dev->napi);
    }
    
    // Handle TX completion inline (fast)
    if (status & TX_COMPLETE) {
        reclaim_tx_buffers_fast(dev);
    }
    
    return IRQ_HANDLED;
}
 
// Bottom half - runs in softirq context, interrupts enabled
int network_poll(struct napi_struct *napi, int budget) {
    struct net_device *dev = container_of(napi, struct net_device, napi);
    int packets_processed = 0;
    
    // Process up to 'budget' packets
    while (packets_processed < budget) {
        struct packet *pkt = dequeue_rx_packet(dev);
        if (!pkt)
            break;
        
        // Process packet (can take time)
        process_packet(dev, pkt);
        packets_processed++;
    }
    
    // If we processed all available packets, re-enable interrupts
    if (packets_processed < budget) {
        napi_complete(napi);
        enable_rx_interrupt(dev);
    }
    
    return packets_processed;
}

NAPI - New API for Network Polling

Summary: Interrupt Handling

Key Takeaways

•The interrupt lifecycle has three phases — Entry (CPU saves state), Execution (handler runs), Return (IRET restores state).
•The CPU saves minimal state — Only RIP, CS, RFLAGS (and RSP, SS on privilege change). Handlers must save additional registers manually.
•Privilege transitions require stack switching — The TSS provides kernel stack pointers; IST entries handle critical handlers.
•Interrupt gates disable IF; trap gates preserve it — This controls whether nested interrupts are possible during handling.
•Handler dispatch uses vector numbers — Stubs push vector, common handler calls dispatch table entries.
•IRET is the privileged return — Restores full context including RFLAGS and potentially privilege level.
•Nested interrupts require careful design — Split handler models (top/bottom half) balance responsiveness with safety.

What's Next:

Page Complete