Loading content...
When an interrupt occurs—whether a timer tick, a keystroke, or a page fault—the CPU must perform an incredibly delicate operation. It must:
This entire process must be atomic (uninterruptible at critical points), fast (happening thousands of times per second), and transparent (the interrupted code must not notice, unless it's supposed to).
Understanding interrupt handling is essential for kernel development, debugging, and understanding how operating systems manage the boundary between hardware and software.
By the end of this page, you will understand the complete interrupt handling lifecycle: CPU state saving, stack frame construction, privilege level transitions, handler dispatch, and the IRET return sequence. You'll learn the critical differences between interrupt and exception handling, nested interrupts, and the security implications of improper handling.
The interrupt lifecycle can be divided into three phases: entry (CPU hardware), execution (software handler), and return (CPU hardware + IRET). Understanding each phase is crucial for writing correct interrupt handlers.
Phase 1: Interrupt Entry (CPU Hardware)
Phase 2: Handler Execution (Software)
Phase 3: Interrupt Return (IRET Instruction)
When an interrupt occurs, the CPU must save enough state to resume execution later. The hardware automatically pushes a minimum set of registers onto the stack—this is the interrupt stack frame. Additional state must be saved by software if needed.
Automatic Hardware Save (x86-64 Long Mode):
The CPU pushes the following values in this order (remember: stack grows downward, so the first push is at the highest address):
| Offset from RSP | Value Pushed | Description |
|---|---|---|
| +40 | SS | Stack Segment (only if privilege change) |
| +32 | RSP | Stack Pointer (only if privilege change) |
| +24 | RFLAGS | Processor flags (IF, TF, etc.) |
| +16 | CS | Code Segment (includes CPL) |
| +8 | RIP | Instruction Pointer (return address) |
| +0 | Error Code | Only for exceptions that push it |
12345678910111213141516171819202122232425262728
// C structure representing the interrupt stack frame// Matches the layout pushed by CPU hardware // Frame WITHOUT error code (interrupts, some exceptions)struct interrupt_frame { uint64_t rip; // Return instruction pointer uint64_t cs; // Code segment (with CPL in low 2 bits) uint64_t rflags; // Processor flags uint64_t rsp; // Stack pointer (from before interrupt) uint64_t ss; // Stack segment} __attribute__((packed)); // Frame WITH error code (page fault, GPF, etc.)struct interrupt_frame_error { uint64_t error_code; // Exception-specific error code uint64_t rip; // Return instruction pointer uint64_t cs; // Code segment uint64_t rflags; // Processor flags uint64_t rsp; // Stack pointer uint64_t ss; // Stack segment} __attribute__((packed)); // Handler function signatures (GCC/Clang x86-64)__attribute__((interrupt))void timer_handler(struct interrupt_frame *frame); __attribute__((interrupt))void page_fault_handler(struct interrupt_frame_error *frame);The CPU only saves the minimum needed for return. General-purpose registers (RAX, RBX, RCX, etc.), SIMD registers, and most segment registers are NOT saved by hardware. If your handler uses any of these, it MUST save and restore them manually. Failure to do so corrupts the interrupted program's state—a catastrophic and hard-to-debug bug.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
; Complete interrupt entry sequence with full register save; Used when handler needs to access/modify full CPU state ; Macro to save all general-purpose registers%macro SAVE_ALL 0 push rax push rbx push rcx push rdx push rsi push rdi push rbp push r8 push r9 push r10 push r11 push r12 push r13 push r14 push r15%endmacro %macro RESTORE_ALL 0 pop r15 pop r14 pop r13 pop r12 pop r11 pop r10 pop r9 pop r8 pop rbp pop rdi pop rsi pop rdx pop rcx pop rbx pop rax%endmacro ; Example interrupt handler entry pointtimer_interrupt_entry: ; CPU has already pushed SS, RSP, RFLAGS, CS, RIP SAVE_ALL ; Save all GP registers (120 bytes) ; At this point, stack has complete context ; RSP points to saved R15 mov rdi, rsp ; Pass pointer to saved context as argument call timer_handler_c ; Call C handler RESTORE_ALL ; Restore all GP registers iretq ; Return from interruptOne of the most critical aspects of interrupt handling is managing privilege level transitions. When an interrupt occurs, the CPU may need to switch from a less privileged level (Ring 3/user mode) to a more privileged level (Ring 0/kernel mode). This transition involves additional security checks and stack switching.
The Current Privilege Level (CPL):
The CPL is stored in the low 2 bits of the CS register:
Stack Switching on Privilege Change:
When transitioning from Ring 3 to Ring 0, the CPU cannot use the user-mode stack for security reasons—a malicious user program could manipulate the stack to corrupt kernel data or hijack execution. Instead, the CPU switches to a kernel stack.
The Task State Segment (TSS):
The TSS is a hardware data structure that stores the stack pointers for each privilege level. When an interrupt causes a privilege transition, the CPU reads the new RSP from the TSS:
12345678910111213141516171819202122232425262728293031323334353637383940414243
// Task State Segment for x86-64 Long Mode// Much simpler than protected mode TSS—mainly for stack pointers struct tss64 { uint32_t reserved0; // Reserved, must be 0 // Stack pointers loaded on privilege level change uint64_t rsp0; // Ring 0 stack (used for Ring 3 → 0) uint64_t rsp1; // Ring 1 stack (usually unused) uint64_t rsp2; // Ring 2 stack (usually unused) uint64_t reserved1; // Reserved // Interrupt Stack Table (IST) // Used for critical handlers that need known-good stack uint64_t ist1; // IST entry 1 (e.g., double fault) uint64_t ist2; // IST entry 2 (e.g., NMI) uint64_t ist3; // IST entry 3 (e.g., debug) uint64_t ist4; // IST entry 4 uint64_t ist5; // IST entry 5 uint64_t ist6; // IST entry 6 uint64_t ist7; // IST entry 7 uint64_t reserved2; // Reserved uint16_t reserved3; // Reserved uint16_t iopb_offset; // I/O permission bitmap offset} __attribute__((packed)); // Per-CPU TSS setup (each CPU needs its own)void setup_tss(struct tss64 *tss, void *kernel_stack_top) { memset(tss, 0, sizeof(struct tss64)); // Set kernel stack for syscalls/interrupts from user mode tss->rsp0 = (uint64_t)kernel_stack_top; // Set IST entries for critical handlers tss->ist1 = (uint64_t)alloc_ist_stack(); // Double fault tss->ist2 = (uint64_t)alloc_ist_stack(); // NMI tss->ist3 = (uint64_t)alloc_ist_stack(); // Debug // No I/O permission bitmap (disable with offset beyond limit) tss->iopb_offset = sizeof(struct tss64);}The IST provides up to 7 dedicated stacks for specific interrupt handlers. This is critical for handlers that cannot trust the current stack—such as the double fault handler (stack may be corrupted) or NMI handler (may interrupt kernel code with inconsistent stack state). Each IDT entry can specify an IST entry (1-7) or 0 for normal stack switching.
The IDT entries that define interrupt handlers come in two primary flavors: Interrupt Gates and Trap Gates. Their critical difference lies in how they handle the Interrupt Flag (IF).
The Interrupt Flag (IF):
The IF bit in RFLAGS controls whether the CPU responds to maskable hardware interrupts (INTR):
| Characteristic | Interrupt Gate | Trap Gate |
|---|---|---|
| IF Behavior | Clears IF (disables interrupts) | Leaves IF unchanged |
| Typical Use | Hardware interrupts, timer | System calls, breakpoints |
| Nested Interrupts | Prevented by default | Allowed by default |
| Handler Complexity | Simpler—no nesting concerns | Must handle potential nesting |
| Type Field Value | 0xE (64-bit interrupt gate) | 0xF (64-bit trap gate) |
Why Interrupt Gates Disable Interrupts:
Consider what happens if an interrupt handler is interrupted by another interrupt:
Interrupt gates prevent this by atomically clearing IF when entering the handler. The handler executes to completion, issues EOI, and then re-enables interrupts (via IRET restoring RFLAGS with IF=1).
When Trap Gates are Appropriate:
System calls (INT 0x80, SYSCALL) often use trap gates because:
123456789101112131415161718192021222324252627282930313233343536373839404142
// IDT entry structure for x86-64 struct idt_entry { uint16_t offset_low; // Handler offset bits 0-15 uint16_t selector; // Code segment selector uint8_t ist; // IST index (bits 0-2), zero bits (3-7) uint8_t type_attr; // Type and attributes uint16_t offset_mid; // Handler offset bits 16-31 uint32_t offset_high; // Handler offset bits 32-63 uint32_t reserved; // Reserved, must be 0} __attribute__((packed)); // Type attribute values (64-bit long mode)#define IDT_TYPE_INTERRUPT_GATE 0x8E // P=1, DPL=0, Type=0xE#define IDT_TYPE_TRAP_GATE 0x8F // P=1, DPL=0, Type=0xF#define IDT_TYPE_USER_INTERRUPT 0xEE // P=1, DPL=3, Type=0xE (for INT from user)#define IDT_TYPE_USER_TRAP 0xEF // P=1, DPL=3, Type=0xF void set_idt_entry(struct idt_entry *entry, void (*handler)(void), uint16_t selector, uint8_t type_attr, uint8_t ist) { uint64_t offset = (uint64_t)handler; entry->offset_low = offset & 0xFFFF; entry->offset_mid = (offset >> 16) & 0xFFFF; entry->offset_high = (offset >> 32) & 0xFFFFFFFF; entry->selector = selector; entry->ist = ist & 0x7; // Only bits 0-2 entry->type_attr = type_attr; entry->reserved = 0;} // Example: Set up timer interrupt (interrupt gate, disables IF)set_idt_entry(&idt[32], timer_handler, KERNEL_CS, IDT_TYPE_INTERRUPT_GATE, 0); // Example: Set up syscall (trap gate, preserves IF)set_idt_entry(&idt[0x80], syscall_handler, KERNEL_CS, IDT_TYPE_USER_TRAP, 0); // DPL=3 for user accessWith 256 possible interrupt vectors, the kernel needs an efficient mechanism to dispatch interrupts to the appropriate handlers. Several architectural approaches exist.
Direct Vector Handlers:
The simplest approach assigns each vector its own entry in the IDT, pointing directly to the handler code. Simple but inflexible—each handler has independent code.
A more elegant approach uses small stubs that push the vector number, then jump to a common handler:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
; Interrupt stub macros - generate small entry points; that push vector number and jump to common handler ; For exceptions WITHOUT error code%macro ISR_NOERR 1isr_stub_%1: push 0 ; Dummy error code for uniform stack frame push %1 ; Push interrupt vector number jmp common_interrupt_handler%endmacro ; For exceptions WITH error code (already pushed by CPU)%macro ISR_ERR 1isr_stub_%1: push %1 ; Push interrupt vector number jmp common_interrupt_handler%endmacro ; Generate stubs for all exception vectorsISR_NOERR 0 ; Divide ErrorISR_NOERR 1 ; DebugISR_NOERR 2 ; NMIISR_NOERR 3 ; BreakpointISR_NOERR 4 ; OverflowISR_NOERR 5 ; Bound RangeISR_NOERR 6 ; Invalid OpcodeISR_NOERR 7 ; Device Not AvailableISR_ERR 8 ; Double Fault (error code = 0)ISR_NOERR 9 ; Coprocessor Segment (reserved)ISR_ERR 10 ; Invalid TSSISR_ERR 11 ; Segment Not PresentISR_ERR 12 ; Stack FaultISR_ERR 13 ; General Protection FaultISR_ERR 14 ; Page FaultISR_NOERR 15 ; Reserved; ... continue for all vectors ; Common handler - receives all interruptscommon_interrupt_handler: ; Save all general-purpose registers push rax push rbx push rcx push rdx push rsi push rdi push rbp push r8 push r9 push r10 push r11 push r12 push r13 push r14 push r15 ; Pass pointer to saved state mov rdi, rsp ; Call C interrupt dispatcher call interrupt_dispatch ; Restore all registers pop r15 pop r14 pop r13 pop r12 pop r11 pop r10 pop r9 pop r8 pop rbp pop rdi pop rsi pop rdx pop rcx pop rbx pop rax ; Remove vector number and error code add rsp, 16 ; Return from interrupt iretqThe IRET (Interrupt Return) instruction is the counterpart to the interrupt entry sequence. It reverses everything the CPU did when entering the handler, restoring the interrupted context and resuming execution.
IRET vs Normal RET:
A normal RET instruction only pops RIP—it cannot change privilege levels or restore flags. IRET is special:
| Step | Action | Security Implications |
|---|---|---|
| 1 | Pop RIP (return address) | Checked against segment limits |
| 2 | Pop CS (code segment) | CPL derived from CS, checked for valid transition |
| 3 | Pop RFLAGS | IF restored, IOPL may change based on CPL |
| 4 | Pop RSP (if CPL changes) | Only for Ring 0 → Ring 3 |
| 5 | Pop SS (if CPL changes) | Validates SS selector |
IRET performs extensive validation. The popped CS must be valid for the target CPL. If returning to Ring 3, RSP and SS must also be popped (cannot leave kernel stack accessible to user code). RFLAGS changes are restricted—user mode cannot set privileged flags. Bugs in constructing the interrupt frame can cause security vulnerabilities.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162
// Using IRET to switch to user mode// Commonly used to start the first user process void switch_to_user_mode(void *user_entry, void *user_stack) { // Construct a fake interrupt frame on the kernel stack // that IRET will pop to enter user mode // The frame must be ordered as IRET expects: // SS, RSP, RFLAGS, CS, RIP (top of stack = RIP) asm volatile ( // Disable interrupts during setup "cli \n" // Push SS (user data segment, RPL=3) "push $0x23 \n" // USER_DS | RPL_3 // Push user RSP "push %0 \n" // Push RFLAGS with IF set (enable interrupts in user mode) "pushfq \n" "pop rax \n" "or rax, 0x200 \n" // Set IF "push rax \n" // Push CS (user code segment, RPL=3) "push $0x1B \n" // USER_CS | RPL_3 // Push user RIP (entry point) "push %1 \n" // IRET pops RIP, CS, RFLAGS, RSP, SS // Transitions to Ring 3, jumps to user_entry "iretq \n" : : "r"(user_stack), "r"(user_entry) : "rax" ); // Unreachable - we're now in user mode __builtin_unreachable();} // IRET can also be used to perform context switchesvoid context_switch(struct task_state *new_task) { // Load new task's saved registers // Set up stack to point at task's saved interrupt frame // IRET returns to the new task's execution point asm volatile ( "mov rsp, %0 \n" // Switch to new task's stack "pop r15 \n" // Restore GP registers "pop r14 \n" // ... restore all registers "pop rax \n" "add rsp, 16 \n" // Skip vector, error code "iretq \n" // Return to new task : : "r"(new_task->kernel_stack_pointer) );}Nested interrupts occur when an interrupt handler is itself interrupted by another interrupt. This can happen when using trap gates (which preserve IF) or when handlers explicitly re-enable interrupts.
Why Allow Nested Interrupts?
Some interrupt handlers take significant time:
Blocking all interrupts during these handlers harms system responsiveness. High-priority interrupts (like a hardware failure NMI) should preempt lower-priority handlers.
Linux's Split Handler Model:
Linux addresses these challenges by splitting interrupt handling into two parts:
Top Half (Hardirq):
Bottom Half (Softirq/Tasklet/Workqueue):
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
// Example: Network interrupt using split handler model // Top half - runs in interrupt context, interrupts disabledirqreturn_t network_interrupt_handler(int irq, void *dev_id) { struct net_device *dev = dev_id; uint32_t status = read_device_status(dev); if (!(status & INTERRUPT_PENDING)) return IRQ_NONE; // Not our interrupt (shared IRQ) // Acknowledge interrupt to device immediately write_device_register(dev, STATUS_REG, status); // Quick check: do we have received packets? if (status & RX_COMPLETE) { // Disable further RX interrupts (we'll poll in softirq) disable_rx_interrupt(dev); // Schedule NAPI softirq for packet processing napi_schedule(&dev->napi); } // Handle TX completion inline (fast) if (status & TX_COMPLETE) { reclaim_tx_buffers_fast(dev); } return IRQ_HANDLED;} // Bottom half - runs in softirq context, interrupts enabledint network_poll(struct napi_struct *napi, int budget) { struct net_device *dev = container_of(napi, struct net_device, napi); int packets_processed = 0; // Process up to 'budget' packets while (packets_processed < budget) { struct packet *pkt = dequeue_rx_packet(dev); if (!pkt) break; // Process packet (can take time) process_packet(dev, pkt); packets_processed++; } // If we processed all available packets, re-enable interrupts if (packets_processed < budget) { napi_complete(napi); enable_rx_interrupt(dev); } return packets_processed;}High-speed networks can generate millions of packets per second. Using traditional interrupt-per-packet handling would cause 'livelock'—the CPU spends all time handling interrupts, no time processing packets. NAPI uses interrupt coalescing: the first packet triggers an interrupt, which disables further interrupts and schedules polling. The bottom half polls until no packets remain, then re-enables interrupts.
We've explored the complete interrupt handling lifecycle—from the moment an interrupt is recognized through handler execution and return. This mechanism forms the foundation of all OS-hardware and OS-application interactions.
What's Next:
Now that we understand how interrupts are handled, we'll examine how the CPU finds the correct handler. The next page covers the Interrupt Vector Table (IVT) and Interrupt Descriptor Table (IDT)—the data structures that map interrupt vectors to handler addresses.
You now understand interrupt handling: the CPU's state-saving mechanism, privilege transitions, stack switching, gate types, and the critical IRET instruction. This knowledge is essential for kernel development and understanding how operating systems respond to events. Next, we'll explore the data structures that organize interrupt handlers.