Operating SystemsSystem Calls & API

System Call Mechanism

LevelIntermediate

Duration75 mins

TopicSystem Calls & API

2 / 5

Trap Instruction

The Deliberate Exception

In the previous page, we explored how the CPU enforces the user-kernel boundary and allows transitions only through controlled entry points. Now we examine the deliberate mechanism by which user code requests these transitions: the trap instruction.

Unlike exceptions caused by errors (divide by zero, page faults) or asynchronous hardware interrupts (keyboard, disk), a trap is a synchronous, intentional event. User code explicitly invokes it to request a service from the operating system. This deliberate nature distinguishes the system call trap from other privilege escalation mechanisms and has profound implications for how applications and kernels interact.

What You Will Learn

By the end of this page, you will understand the semantics of trap instructions, how they differ from interrupts and exceptions, the various trap instruction implementations across major CPU architectures, and the precise sequence of CPU state changes that occurs during trap execution.

Understanding Traps: Interrupts, Exceptions, and Traps

Before diving into trap instructions specifically, we must understand the taxonomy of events that cause privilege transitions. The terms 'interrupt,' 'exception,' and 'trap' are often used interchangeably, but they have distinct technical meanings:

Interrupts (Hardware Interrupts)

Source: External hardware devices (keyboard, disk controller, network card, timer)
Timing: Asynchronous—occur independently of currently executing instructions
Resume: Execution continues at the instruction following the interrupted point
Example: Disk controller signals data is ready; timer fires for scheduling

Exceptions (Faults, Traps, Aborts)

Source: CPU detects an abnormal condition caused by instruction execution
Timing: Synchronous—directly caused by executing an instruction
Resume: Depends on exception type:
- Faults: Can be corrected; instruction re-executed (e.g., page fault)
- Traps: Intentional; execution continues at next instruction (e.g., INT 3)
- Aborts: Unrecoverable; process terminated (e.g., machine check)

Software Traps (System Calls)

Source: Explicit trap instruction executed by user code
Timing: Synchronous—deliberately triggered at a specific instruction
Resume: Execution continues at the instruction following the trap
Example: Application calls read(), which invokes SYSCALL instruction

Classification of Privilege Transition Events
Type	Source	Timing	Cause	Resume Point
Hardware Interrupt	External device	Asynchronous	Device signals CPU	Next instruction
Fault	CPU detection	Synchronous	Error condition (recoverable)	Faulting instruction (retry)
Trap	Explicit instruction	Synchronous	Deliberate software request	Next instruction
Abort	CPU detection	Synchronous	Severe error (unrecoverable)	Process terminated

Terminology Variations

Different CPU vendors use different terminology. Intel uses 'interrupt' for all these events collectively. ARM uses 'exception' for both synchronous and asynchronous events. The conceptual distinctions remain the same—what matters is understanding when and why the CPU transitions to a different privilege level.

The Semantics of Trap Instructions

A trap instruction is designed with specific semantics that distinguish it from other instructions:

1. Atomic Privilege Transition

The trap must atomically switch the CPU from user mode to kernel mode. There cannot be a 'half-transitioned' state where the CPU is partly in user mode and partly in kernel mode. This atomicity prevents race conditions and security vulnerabilities.

2. Controlled Destination

The trap instruction does not specify where to jump—the destination is determined by tables configured by the operating system. User code cannot choose to jump to arbitrary kernel addresses; it can only trigger a transition to predefined handler locations.

3. State Preservation

The CPU must preserve sufficient state for the kernel to:

Identify what the user was doing when the trap occurred
Access the user's register values (including arguments to the system call)
Eventually return to user mode and continue where it left off

4. Interrupt Disabling (Optional)

Different trap mechanisms have different interrupt behavior. Some disable interrupts during the transition (like interrupt gates), while others leave interrupts enabled (like trap gates). The choice affects kernel complexity and latency.

5. Stack Safety

The transition mechanism must either switch to a known-safe kernel stack or provide a way for the kernel to immediately do so. Running kernel code on an untrusted user stack is a severe security vulnerability.

Essential Trap Instruction Properties

•Deterministic — The same trap instruction with the same parameters always triggers the same transition (barring configuration changes).
•Non-returnable by user — Once the trap executes, user code loses control until the kernel explicitly returns.
•Low overhead — Modern CPUs optimize trap instructions to minimize the cycle count for privilege transitions.
•Secure — User code cannot exploit the trap to gain unauthorized access or corrupt kernel state.
•Bidirectional (with return instruction) — A corresponding mechanism (SYSRET, IRET) allows the kernel to return to user mode.

x86 Trap Instructions: From INT to SYSCALL

The x86 architecture has evolved multiple trap mechanisms, each addressing limitations of its predecessors:

INT n (Software Interrupt)

The original mechanism, inherited from the 8086. The INT instruction with an immediate operand triggers a software interrupt:

MOV EAX, 1        ; system call number (exit)
MOV EBX, 0        ; exit status
INT 0x80          ; trigger system call

The CPU uses the operand (0x80 in this case) as an index into the Interrupt Descriptor Table (IDT). Historically, Linux used INT 0x80 for 32-bit system calls, and Windows used INT 0x2E.

x86 Trap Instruction Comparison
Instruction	Year	Bits	Latency	State Saved	Used By
`INT 0x80`	1979	16/32	~250-500 cycles	All to stack via IDT	Legacy Linux 32-bit
`SYSENTER`	1997	32	~100-200 cycles	Minimal (CS, EIP only)	Modern Linux 32-bit, Windows
`SYSCALL`	2003	64	~50-100 cycles	RCX=RIP, R11=RFLAGS	Modern Linux 64-bit, Windows 64-bit

SYSENTER/SYSEXIT (Pentium II and later)

Intel introduced SYSENTER as a faster alternative to INT. Unlike INT, which uses the IDT and involves stack operations during the transition, SYSENTER uses Model-Specific Registers (MSRs) to define the entry point:

IA32_SYSENTER_CS: Kernel code segment selector
IA32_SYSENTER_EIP: Kernel entry point address
IA32_SYSENTER_ESP: Kernel stack pointer

SYSENTER does NOT save user state to the stack—it only loads the new CS, EIP, and ESP. The kernel entry code must manually save user registers.

SYSCALL/SYSRET (AMD64/x86-64)

For 64-bit mode, AMD introduced SYSCALL, which became the standard for x86-64 Linux and Windows:

IA32_STAR: CS/SS selectors for kernel and user mode
IA32_LSTAR: Kernel entry point (RIP)
IA32_FMASK: RFLAGS mask (bits to clear)

SYSCALL saves user RIP to RCX and user RFLAGS to R11. It does NOT switch stacks—the kernel entry code must load the kernel stack manually.

SYSCALL Instruction Behavior (x86-64)

Pseudocode

SYSCALL:
    // Save user state to registers (NOT stack)
    RCX ← RIP         // User return address saved to RCX
    R11 ← RFLAGS      // User flags saved to R11
    
    // Load kernel segments from MSR
    CS.selector ← IA32_STAR[47:32]    // Kernel CS from IA32_STAR
    SS.selector ← IA32_STAR[47:32] + 8 // Kernel SS
    
    // Clear flags specified by mask
    RFLAGS ← RFLAGS AND NOT(IA32_FMASK)
    
    // Jump to kernel entry point
    RIP ← IA32_LSTAR   // Kernel entry address from MSR
    
    // CPL is now 0 (kernel mode)
    // RSP is UNCHANGED - still points to user stack!
    // Kernel must immediately load safe stack

Stack Pointer Not Switched

Unlike INT, the SYSCALL instruction does NOT automatically switch the stack pointer. RSP still points to user memory when kernel execution begins. This means the very first thing the kernel entry code must do is load a safe kernel stack pointer. Any instructions before this are extremely security-sensitive.

ARM Trap Instructions: SVC and Exception Levels

ARM processors use a different privilege model based on Exception Levels (EL):

EL0: User applications (unprivileged)
EL1: Operating system kernel (privileged)
EL2: Hypervisor (for virtualization)
EL3: Secure monitor (for TrustZone security)

SVC (Supervisor Call)

The ARM equivalent of x86's SYSCALL is the SVC instruction (formerly called SWI - Software Interrupt in ARM32):

// ARM64 (AArch64) system call example
MOV     X8, #93       // System call number (exit)
MOV     X0, #0        // Exit status
SVC     #0            // Trap to kernel

When SVC executes:

Exception Syndrome Register (ESR_EL1) is updated with exception cause
Program Counter is saved to Exception Link Register (ELR_EL1)
Processor State (PSTATE) is saved to Saved Program Status Register (SPSR_EL1)
Exception level changes from EL0 to EL1
Execution continues at the exception vector (defined in VBAR_EL1)

ARM64 Exception Handling Registers
Register	Purpose	Saved Automatically?
ELR_EL1	Return address (user PC)	Yes
SPSR_EL1	User processor state (PSTATE)	Yes
ESR_EL1	Exception syndrome (cause + details)	Yes
FAR_EL1	Faulting address (for memory exceptions)	Yes
SP_EL1	Kernel stack pointer	Used automatically
X0-X30	General purpose registers	No (must save manually)

Exception Vector Table

Unlike x86's IDT which has hundreds of entries, ARM uses a compact exception vector table with a fixed layout. For AArch64, each exception level has a vector base address register (VBAR_ELx) that points to a table with 16 entries:

Offset	Exception Type
0x000	Synchronous, current EL, SP0
0x080	IRQ, current EL, SP0
0x100	FIQ, current EL, SP0
0x180	SError, current EL, SP0
0x200	Synchronous, current EL, SPx
...	...
0x400	Synchronous, lower EL, AArch64
0x480	IRQ, lower EL, AArch64
...	...

When EL0 code executes SVC, the CPU jumps to VBAR_EL1 + 0x400 (synchronous exception from lower EL using AArch64).

ARM Stack Pointer Banking

ARM64 provides banked stack pointers for each exception level. When transitioning from EL0 to EL1, the processor automatically switches from SP_EL0 to SP_EL1. This is MORE secure than x86's SYSCALL, which leaves RSP pointing to user memory.

RISC-V Trap Instructions: ECALL and Privilege Modes

RISC-V, being a modern clean-slate ISA, provides a straightforward trap mechanism designed for clarity and correctness.

Privilege Modes

U-mode (User): Unprivileged applications
S-mode (Supervisor): Operating system kernel
M-mode (Machine): Firmware, bootloader, hypervisor

ECALL (Environment Call)

The RISC-V trap instruction is ECALL. It requests a service from the next higher privilege level:

# RISC-V Linux system call example
li      a7, 93         # System call number (exit)
li      a0, 0          # Exit status
ecall                  # Trap to kernel (U→S) or SBI (S→M)

ECALL behavior:

mepc (or sepc for S-mode) is set to the ECALL instruction address
mcause (or scause) is set to identify the exception (11 for U→M, 9 for U→S)
Privilege mode changes to M-mode or S-mode
PC is set to the value in mtvec (or stvec) trap vector register

RISC-V ECALL Exception Handling

Assembly (RISC-V)

# RISC-V S-mode trap handler entry (simplified)
.align 4
trap_entry:
    # At this point:
    # - sstatus.SPP indicates previous privilege (0=User, 1=Supervisor)  
    # - sepc contains the address of ECALL instruction
    # - scause contains ECALL_FROM_U (value 8)
    # - sscratch contains kernel stack pointer (swapped before entry)
    
    # Save user registers to kernel stack
    csrrw   sp, sscratch, sp   # Swap user SP with kernel SP
    
    # Allocate trapframe on kernel stack
    addi    sp, sp, -288       # Space for 36 registers (8 bytes each)
    
    # Save user registers
    sd      ra, 0(sp)
    sd      gp, 8(sp)
    sd      tp, 16(sp)
    sd      t0, 24(sp)
    # ... save remaining registers ...
    
    # Save sepc (user return address)
    csrr    t0, sepc
    sd      t0, 256(sp)
    
    # Call C handler
    mv      a0, sp             # Pass trapframe pointer
    call    handle_exception
    
    # Restore and return
    ld      t0, 256(sp)        # Load sepc
    addi    t0, t0, 4          # Increment past ECALL instruction
    csrw    sepc, t0
    
    # Restore registers...
    ld      ra, 0(sp)
    # ...
    
    csrrw   sp, sscratch, sp   # Restore user SP
    sret                       # Return to user mode

RISC-V's Clean Design

RISC-V was designed with security and simplicity in mind. Unlike x86, there's only one system call instruction (ECALL), and the trap handling mechanism is consistent across all exception types. The sscratch register provides a clean way to store the kernel stack pointer, available immediately upon trap entry.

CPU State Changes During Trap Execution

When a trap instruction executes, the CPU performs a carefully orchestrated sequence of state changes. Understanding these changes is essential for kernel developers and security researchers. Let's examine each phase:

Phase 1: Instruction Recognition

•The CPU's instruction decoder identifies the trap instruction (SYSCALL, SVC, ECALL, etc.).
•The instruction is recognized as a privilege-escalating operation requiring special handling.
•The CPU begins the trap sequence, which is atomic with respect to interrupts and other exceptions.
•Any pending writes are completed to ensure memory consistency.

Phase 2: State Preservation

•Program Counter (PC/RIP/EIP): The address of the next instruction is saved. On x86-64 SYSCALL, this goes to RCX; on ARM64 SVC, to ELR_EL1.
•Processor Flags (RFLAGS/PSTATE): Condition codes and control bits are preserved. x86-64 SYSCALL saves to R11; ARM64 saves to SPSR_EL1.
•Stack Pointer: Behavior varies by architecture. x86-64 SYSCALL leaves RSP unchanged (dangerous!). ARM64 SVC switches to SP_EL1 automatically.
•General Purpose Registers: NOT automatically saved on most modern architectures. Kernel entry code must preserve them.

Phase 3: Privilege Escalation

•Privilege Level Change: CPL (x86) or Exception Level (ARM) is elevated. User mode (Ring 3/EL0) becomes kernel mode (Ring 0/EL1).
•Segment Selector Update (x86): CS and SS are loaded with kernel segment selectors, enabling access to kernel memory.
•Memory Protection Change: Page table interpretation may change (SMEP, SMAP effects), and kernel-only mappings become accessible.

Phase 4: Control Transfer

•PC/RIP Load: The program counter is loaded with the kernel entry point address from the appropriate source (IA32_LSTAR, VBAR_ELx, stvec).
•Flag Modification: Certain flags are modified. SYSCALL clears bits specified by IA32_FMASK (often IF to disable interrupts temporarily).
•Execution Begins: The CPU fetches and executes the instruction at the new PC—the first instruction of the kernel's trap handler.

The Atomicity Guarantee

This entire sequence is atomic—it cannot be interrupted halfway. If an interrupt arrives during the trap sequence, it is held pending until the trap completes. This atomicity is crucial:

It prevents race conditions where an interrupt handler might find the CPU in an inconsistent state
It ensures the kernel entry code always starts from a well-defined state
It simplifies reasoning about security properties

The atomicity is implemented in microcode or hardwired logic, not in software. It's a fundamental property of the trap instruction itself.

Trap Return Instructions: SYSRET, ERET, and SRET

Every trap instruction has a corresponding return instruction that restores user mode. These return instructions are privileged—only the kernel can execute them—and they reverse the trap's state changes:

Trap Return Instructions by Architecture
Architecture	Instruction	PC Source	Flags Source	Stack Behavior
x86-64	`SYSRET`	RCX	R11	RSP must be set by kernel
x86 (legacy)	`IRET`	Stack pop	Stack pop	Full stack restore
ARM64	`ERET`	ELR_ELx	SPSR_ELx	Automatic SP switching
RISC-V	`SRET`/`MRET`	sepc/mepc	sstatus/mstatus	sscratch swap pattern

SYSRET Behavior (x86-64)

Pseudocode

SYSRET:
    // Verify we're in Ring 0
    IF CPL != 0:
        RAISE #GP(0)
    
    // Load user segments from MSR
    CS.selector ← IA32_STAR[63:48] + 16  // User CS
    SS.selector ← IA32_STAR[63:48] + 8   // User SS
    
    // Restore user state from registers
    RIP ← RCX         // Return address (saved by SYSCALL)
    RFLAGS ← R11      // Flags (saved by SYSCALL)
    
    // Change privilege level
    CPL ← 3           // Back to user mode
    
    // Execution continues at user RIP
    // RSP must have been restored by kernel before SYSRET

SYSRET Security Vulnerability

SYSRET on Intel CPUs has a subtle security issue: if the return RIP is non-canonical (invalid), the processor raises #GP but does so AFTER loading the user segment selectors but BEFORE actually returning to user mode. This means the #GP handler runs with user GS. Linux works around this by validating RCX and using IRET for potentially problematic returns.

Kernel Responsibilities Before Return

Before executing the return instruction, the kernel must:

Restore user registers: General purpose registers that were saved on entry must be restored from the kernel stack.
Set return values: System call results (typically in RAX on x86, X0 on ARM) must be set before return.
Restore user stack pointer: On x86-64, RSP must be explicitly loaded with the user's stack pointer.
Handle signals: If signals are pending, the kernel may need to divert return to a signal handler instead.
Update flags: On SYSRET, R11 is restored to RFLAGS, but the kernel may need to modify it to reflect error conditions.
Re-enable interrupts: If interrupts were disabled during system call handling, they must be re-enabled (or the return instruction does this automatically).

Performance: INT vs. SYSCALL

The evolution from INT-based to SYSCALL-based system calls was driven by performance requirements. Understanding the differences explains why modern systems use dedicated instructions:

INT 0x80 vs. SYSCALL Overhead Breakdown
Operation	INT 0x80	SYSCALL	Notes
IDT lookup	~20 cycles	N/A	SYSCALL uses MSR directly
Gate validation	~10 cycles	N/A	No gate for SYSCALL
Stack switching	~30 cycles (auto)	~5 cycles (manual)	Kernel code vs. microcode
State save to stack	~50 cycles	~5 cycles (to regs)	RCX, R11 vs. full push
Segment reload	~20 cycles	~5 cycles	Cached vs. memory
Pipeline flush	~50 cycles	~30 cycles	Both require flush
Total (approx)	~180-300 cycles	~50-80 cycles	3-4x improvement

Why INT Is Slower

Memory accesses: INT must read the IDT entry from memory (even if cached), validate its type and privilege, and read the segment descriptor. SYSCALL uses pre-loaded MSR values.
Automatic stack push: INT automatically pushes SS, RSP, RFLAGS, CS, and RIP to the new stack. SYSCALL saves only RIP and RFLAGS to registers—no memory writes.
Segment descriptor lookups: INT reloads segment selectors, requiring descriptor table lookups. SYSCALL still loads selectors but uses a simpler path.
Flexibility overhead: INT is a general-purpose mechanism for hundreds of interrupt types. SYSCALL is optimized specifically for the user→kernel→user transition.

Modern Impact (with KPTI)

With Kernel Page Table Isolation (KPTI) mitigating Meltdown, system call overhead has increased again. Each transition now requires:

Page table switch (CR3 reload) on entry
TLB flush (or PCID switch) on entry
Reverse on exit

This adds ~50-100 cycles per system call, partially negating the gains from SYSCALL over INT. However, SYSCALL is still faster because the additional overhead applies equally to both.

Measuring System Call Latency

Use perf stat -e 'cycles' -- ./getpid_loop where getpid_loop calls getpid() in a tight loop. On modern hardware: ~100-200 cycles without KPTI, ~200-400 cycles with KPTI. Tools like LMBench's lat_syscall provide standardized measurements.

Security Implications of Trap Instructions

Trap instructions are the boundary between trusted and untrusted code. Any security flaw in the trap mechanism or its handlers can lead to complete system compromise:

Historical Vulnerabilities

•SYSRET Non-Canonical RIP (CVE-2012-0217): Intel's SYSRET raises #GP with user segment selectors if RCX contains a non-canonical address, allowing kernel code to run with user GS base. Privilege escalation on FreeBSD, Xen, Windows.
•Spectre/Meltdown (2018): Speculative execution during trap handling can leak kernel memory. SYSCALL entry becomes a speculation gadget, enabling user-space to infer kernel secrets.
•SWAPGS Side Channel (CVE-2019-1125): The SWAPGS instruction at trap entry can be speculatively executed incorrectly, leaking kernel data. Required microcode and software mitigations.
•Stack Pivot Attacks: If kernel entry code doesn't immediately switch to a safe stack, attackers can arrange for the user stack to be paged out, causing a page fault handler to run on a corrupt stack.

Modern Mitigations

Operating systems deploy multiple mitigations to protect the trap boundary:

Mitigation	Protection	Overhead
KPTI	Unmaps kernel from user page tables	~50-100 cycles/syscall
SMEP	Prevents kernel executing user code	Minimal (bit in CR4)
SMAP	Prevents kernel reading user memory (without intent)	Minimal (~2 cycles)
KASLR	Randomizes kernel addresses	Minimal (boot-time)
Retpoline	Mitigates Spectre-BTB	~5% overall
IBPB/IBRS	Hardware Spectre mitigations	Variable (~5-10%)

Defense in Depth

No single mitigation is sufficient. Modern kernels combine:

Hardware protections (SMEP/SMAP, CET)
Isolation (KPTI, KASLR)
Speculation barriers (LFENCE, retpolines)
Stack protections (stack canaries, guard pages)
Control flow integrity (CFI, shadow call stacks)

The Trap Entry Code Is Critical Path

The code between trap instruction execution and full kernel context establishment is the most security-critical code in the entire kernel. It runs with elevated privilege but incomplete state. Every instruction must be audited for correctness, side-channel safety, and robustness against hostile inputs.

Summary: The Deliberate Doorway

We've thoroughly examined the trap instruction—the deliberate mechanism by which user code requests kernel services. Let's consolidate our understanding:

Key Takeaways

•Traps are intentional — Unlike faults (errors) or interrupts (asynchronous), traps are deliberately executed by user code to request services.
•Multiple trap mechanisms exist — x86 provides INT, SYSENTER, and SYSCALL with different performance characteristics. ARM uses SVC, RISC-V uses ECALL.
•State preservation varies — Different trap instructions save different amounts of state automatically. SYSCALL saves minimal state (RCX, R11), while INT saves more to stack.
•Stack handling is critical — Some traps (INT) switch stacks automatically; others (SYSCALL) do not. The kernel must handle this safely.
•Performance motivated evolution — SYSCALL is ~3-4x faster than INT due to reduced memory access and simpler microcode path.
•Security is paramount — The trap entry point is the most security-sensitive code in the kernel. Vulnerabilities here lead to complete system compromise.

What's Next:

Now that we understand how the CPU transitions from user to kernel mode, we'll examine how the kernel identifies what service is being requested: the system call number. We'll explore system call tables, numbering conventions, and how kernels maintain compatibility across versions.

Page Complete

You now understand the trap instruction—the deliberate, synchronous mechanism that allows user applications to request operating system services. You've seen how different architectures implement this fundamental operation and the critical security considerations involved. Next, we'll explore how the kernel knows which service is being requested through system call numbers.

2 / 5

Loading learning content...

Operating SystemsSystem Calls & API

System Call Mechanism

LevelIntermediate

Duration75 mins

TopicSystem Calls & API

2 / 5

Trap Instruction

The Deliberate Exception

What You Will Learn

Understanding Traps: Interrupts, Exceptions, and Traps

Interrupts (Hardware Interrupts)

Source: External hardware devices (keyboard, disk controller, network card, timer)
Timing: Asynchronous—occur independently of currently executing instructions
Resume: Execution continues at the instruction following the interrupted point
Example: Disk controller signals data is ready; timer fires for scheduling

Exceptions (Faults, Traps, Aborts)

Source: CPU detects an abnormal condition caused by instruction execution
Timing: Synchronous—directly caused by executing an instruction
Resume: Depends on exception type:
- Faults: Can be corrected; instruction re-executed (e.g., page fault)
- Traps: Intentional; execution continues at next instruction (e.g., INT 3)
- Aborts: Unrecoverable; process terminated (e.g., machine check)

Software Traps (System Calls)

Source: Explicit trap instruction executed by user code
Timing: Synchronous—deliberately triggered at a specific instruction
Resume: Execution continues at the instruction following the trap
Example: Application calls read(), which invokes SYSCALL instruction

Classification of Privilege Transition Events
Type	Source	Timing	Cause	Resume Point
Hardware Interrupt	External device	Asynchronous	Device signals CPU	Next instruction
Fault	CPU detection	Synchronous	Error condition (recoverable)	Faulting instruction (retry)
Trap	Explicit instruction	Synchronous	Deliberate software request	Next instruction
Abort	CPU detection	Synchronous	Severe error (unrecoverable)	Process terminated

Terminology Variations

The Semantics of Trap Instructions

A trap instruction is designed with specific semantics that distinguish it from other instructions:

1. Atomic Privilege Transition

2. Controlled Destination

3. State Preservation

The CPU must preserve sufficient state for the kernel to:

Identify what the user was doing when the trap occurred
Access the user's register values (including arguments to the system call)
Eventually return to user mode and continue where it left off

4. Interrupt Disabling (Optional)

5. Stack Safety

Essential Trap Instruction Properties

•Deterministic — The same trap instruction with the same parameters always triggers the same transition (barring configuration changes).
•Non-returnable by user — Once the trap executes, user code loses control until the kernel explicitly returns.
•Low overhead — Modern CPUs optimize trap instructions to minimize the cycle count for privilege transitions.
•Secure — User code cannot exploit the trap to gain unauthorized access or corrupt kernel state.
•Bidirectional (with return instruction) — A corresponding mechanism (SYSRET, IRET) allows the kernel to return to user mode.

x86 Trap Instructions: From INT to SYSCALL

The x86 architecture has evolved multiple trap mechanisms, each addressing limitations of its predecessors:

INT n (Software Interrupt)

The original mechanism, inherited from the 8086. The INT instruction with an immediate operand triggers a software interrupt:

MOV EAX, 1        ; system call number (exit)
MOV EBX, 0        ; exit status
INT 0x80          ; trigger system call

The CPU uses the operand (0x80 in this case) as an index into the Interrupt Descriptor Table (IDT). Historically, Linux used INT 0x80 for 32-bit system calls, and Windows used INT 0x2E.

x86 Trap Instruction Comparison
Instruction	Year	Bits	Latency	State Saved	Used By
`INT 0x80`	1979	16/32	~250-500 cycles	All to stack via IDT	Legacy Linux 32-bit
`SYSENTER`	1997	32	~100-200 cycles	Minimal (CS, EIP only)	Modern Linux 32-bit, Windows
`SYSCALL`	2003	64	~50-100 cycles	RCX=RIP, R11=RFLAGS	Modern Linux 64-bit, Windows 64-bit

SYSENTER/SYSEXIT (Pentium II and later)

IA32_SYSENTER_CS: Kernel code segment selector
IA32_SYSENTER_EIP: Kernel entry point address
IA32_SYSENTER_ESP: Kernel stack pointer

SYSENTER does NOT save user state to the stack—it only loads the new CS, EIP, and ESP. The kernel entry code must manually save user registers.

SYSCALL/SYSRET (AMD64/x86-64)

For 64-bit mode, AMD introduced SYSCALL, which became the standard for x86-64 Linux and Windows:

IA32_STAR: CS/SS selectors for kernel and user mode
IA32_LSTAR: Kernel entry point (RIP)
IA32_FMASK: RFLAGS mask (bits to clear)

SYSCALL saves user RIP to RCX and user RFLAGS to R11. It does NOT switch stacks—the kernel entry code must load the kernel stack manually.

SYSCALL Instruction Behavior (x86-64)

Pseudocode

SYSCALL:
    // Save user state to registers (NOT stack)
    RCX ← RIP         // User return address saved to RCX
    R11 ← RFLAGS      // User flags saved to R11
    
    // Load kernel segments from MSR
    CS.selector ← IA32_STAR[47:32]    // Kernel CS from IA32_STAR
    SS.selector ← IA32_STAR[47:32] + 8 // Kernel SS
    
    // Clear flags specified by mask
    RFLAGS ← RFLAGS AND NOT(IA32_FMASK)
    
    // Jump to kernel entry point
    RIP ← IA32_LSTAR   // Kernel entry address from MSR
    
    // CPL is now 0 (kernel mode)
    // RSP is UNCHANGED - still points to user stack!
    // Kernel must immediately load safe stack

Stack Pointer Not Switched

ARM Trap Instructions: SVC and Exception Levels

ARM processors use a different privilege model based on Exception Levels (EL):

EL0: User applications (unprivileged)
EL1: Operating system kernel (privileged)
EL2: Hypervisor (for virtualization)
EL3: Secure monitor (for TrustZone security)

SVC (Supervisor Call)

The ARM equivalent of x86's SYSCALL is the SVC instruction (formerly called SWI - Software Interrupt in ARM32):

// ARM64 (AArch64) system call example
MOV     X8, #93       // System call number (exit)
MOV     X0, #0        // Exit status
SVC     #0            // Trap to kernel

When SVC executes:

Exception Syndrome Register (ESR_EL1) is updated with exception cause
Program Counter is saved to Exception Link Register (ELR_EL1)
Processor State (PSTATE) is saved to Saved Program Status Register (SPSR_EL1)
Exception level changes from EL0 to EL1
Execution continues at the exception vector (defined in VBAR_EL1)

ARM64 Exception Handling Registers
Register	Purpose	Saved Automatically?
ELR_EL1	Return address (user PC)	Yes
SPSR_EL1	User processor state (PSTATE)	Yes
ESR_EL1	Exception syndrome (cause + details)	Yes
FAR_EL1	Faulting address (for memory exceptions)	Yes
SP_EL1	Kernel stack pointer	Used automatically
X0-X30	General purpose registers	No (must save manually)

Exception Vector Table

Offset	Exception Type
0x000	Synchronous, current EL, SP0
0x080	IRQ, current EL, SP0
0x100	FIQ, current EL, SP0
0x180	SError, current EL, SP0
0x200	Synchronous, current EL, SPx
...	...
0x400	Synchronous, lower EL, AArch64
0x480	IRQ, lower EL, AArch64
...	...

When EL0 code executes SVC, the CPU jumps to VBAR_EL1 + 0x400 (synchronous exception from lower EL using AArch64).

ARM Stack Pointer Banking

RISC-V Trap Instructions: ECALL and Privilege Modes

RISC-V, being a modern clean-slate ISA, provides a straightforward trap mechanism designed for clarity and correctness.

Privilege Modes

U-mode (User): Unprivileged applications
S-mode (Supervisor): Operating system kernel
M-mode (Machine): Firmware, bootloader, hypervisor

ECALL (Environment Call)

The RISC-V trap instruction is ECALL. It requests a service from the next higher privilege level:

# RISC-V Linux system call example
li      a7, 93         # System call number (exit)
li      a0, 0          # Exit status
ecall                  # Trap to kernel (U→S) or SBI (S→M)

ECALL behavior:

mepc (or sepc for S-mode) is set to the ECALL instruction address
mcause (or scause) is set to identify the exception (11 for U→M, 9 for U→S)
Privilege mode changes to M-mode or S-mode
PC is set to the value in mtvec (or stvec) trap vector register

RISC-V ECALL Exception Handling

Assembly (RISC-V)

# RISC-V S-mode trap handler entry (simplified)
.align 4
trap_entry:
    # At this point:
    # - sstatus.SPP indicates previous privilege (0=User, 1=Supervisor)  
    # - sepc contains the address of ECALL instruction
    # - scause contains ECALL_FROM_U (value 8)
    # - sscratch contains kernel stack pointer (swapped before entry)
    
    # Save user registers to kernel stack
    csrrw   sp, sscratch, sp   # Swap user SP with kernel SP
    
    # Allocate trapframe on kernel stack
    addi    sp, sp, -288       # Space for 36 registers (8 bytes each)
    
    # Save user registers
    sd      ra, 0(sp)
    sd      gp, 8(sp)
    sd      tp, 16(sp)
    sd      t0, 24(sp)
    # ... save remaining registers ...
    
    # Save sepc (user return address)
    csrr    t0, sepc
    sd      t0, 256(sp)
    
    # Call C handler
    mv      a0, sp             # Pass trapframe pointer
    call    handle_exception
    
    # Restore and return
    ld      t0, 256(sp)        # Load sepc
    addi    t0, t0, 4          # Increment past ECALL instruction
    csrw    sepc, t0
    
    # Restore registers...
    ld      ra, 0(sp)
    # ...
    
    csrrw   sp, sscratch, sp   # Restore user SP
    sret                       # Return to user mode

RISC-V's Clean Design

CPU State Changes During Trap Execution

Phase 1: Instruction Recognition

•The CPU's instruction decoder identifies the trap instruction (SYSCALL, SVC, ECALL, etc.).
•The instruction is recognized as a privilege-escalating operation requiring special handling.
•The CPU begins the trap sequence, which is atomic with respect to interrupts and other exceptions.
•Any pending writes are completed to ensure memory consistency.

Phase 2: State Preservation

•Program Counter (PC/RIP/EIP): The address of the next instruction is saved. On x86-64 SYSCALL, this goes to RCX; on ARM64 SVC, to ELR_EL1.
•Processor Flags (RFLAGS/PSTATE): Condition codes and control bits are preserved. x86-64 SYSCALL saves to R11; ARM64 saves to SPSR_EL1.
•Stack Pointer: Behavior varies by architecture. x86-64 SYSCALL leaves RSP unchanged (dangerous!). ARM64 SVC switches to SP_EL1 automatically.
•General Purpose Registers: NOT automatically saved on most modern architectures. Kernel entry code must preserve them.

Phase 3: Privilege Escalation

•Privilege Level Change: CPL (x86) or Exception Level (ARM) is elevated. User mode (Ring 3/EL0) becomes kernel mode (Ring 0/EL1).
•Segment Selector Update (x86): CS and SS are loaded with kernel segment selectors, enabling access to kernel memory.
•Memory Protection Change: Page table interpretation may change (SMEP, SMAP effects), and kernel-only mappings become accessible.

Phase 4: Control Transfer

•PC/RIP Load: The program counter is loaded with the kernel entry point address from the appropriate source (IA32_LSTAR, VBAR_ELx, stvec).
•Flag Modification: Certain flags are modified. SYSCALL clears bits specified by IA32_FMASK (often IF to disable interrupts temporarily).
•Execution Begins: The CPU fetches and executes the instruction at the new PC—the first instruction of the kernel's trap handler.

The Atomicity Guarantee

This entire sequence is atomic—it cannot be interrupted halfway. If an interrupt arrives during the trap sequence, it is held pending until the trap completes. This atomicity is crucial:

It prevents race conditions where an interrupt handler might find the CPU in an inconsistent state
It ensures the kernel entry code always starts from a well-defined state
It simplifies reasoning about security properties

The atomicity is implemented in microcode or hardwired logic, not in software. It's a fundamental property of the trap instruction itself.

Trap Return Instructions: SYSRET, ERET, and SRET

Trap Return Instructions by Architecture
Architecture	Instruction	PC Source	Flags Source	Stack Behavior
x86-64	`SYSRET`	RCX	R11	RSP must be set by kernel
x86 (legacy)	`IRET`	Stack pop	Stack pop	Full stack restore
ARM64	`ERET`	ELR_ELx	SPSR_ELx	Automatic SP switching
RISC-V	`SRET`/`MRET`	sepc/mepc	sstatus/mstatus	sscratch swap pattern

SYSRET Behavior (x86-64)

Pseudocode

SYSRET:
    // Verify we're in Ring 0
    IF CPL != 0:
        RAISE #GP(0)
    
    // Load user segments from MSR
    CS.selector ← IA32_STAR[63:48] + 16  // User CS
    SS.selector ← IA32_STAR[63:48] + 8   // User SS
    
    // Restore user state from registers
    RIP ← RCX         // Return address (saved by SYSCALL)
    RFLAGS ← R11      // Flags (saved by SYSCALL)
    
    // Change privilege level
    CPL ← 3           // Back to user mode
    
    // Execution continues at user RIP
    // RSP must have been restored by kernel before SYSRET

SYSRET Security Vulnerability

Kernel Responsibilities Before Return

Before executing the return instruction, the kernel must:

Restore user registers: General purpose registers that were saved on entry must be restored from the kernel stack.
Set return values: System call results (typically in RAX on x86, X0 on ARM) must be set before return.
Restore user stack pointer: On x86-64, RSP must be explicitly loaded with the user's stack pointer.
Handle signals: If signals are pending, the kernel may need to divert return to a signal handler instead.
Update flags: On SYSRET, R11 is restored to RFLAGS, but the kernel may need to modify it to reflect error conditions.
Re-enable interrupts: If interrupts were disabled during system call handling, they must be re-enabled (or the return instruction does this automatically).

Performance: INT vs. SYSCALL

The evolution from INT-based to SYSCALL-based system calls was driven by performance requirements. Understanding the differences explains why modern systems use dedicated instructions:

INT 0x80 vs. SYSCALL Overhead Breakdown
Operation	INT 0x80	SYSCALL	Notes
IDT lookup	~20 cycles	N/A	SYSCALL uses MSR directly
Gate validation	~10 cycles	N/A	No gate for SYSCALL
Stack switching	~30 cycles (auto)	~5 cycles (manual)	Kernel code vs. microcode
State save to stack	~50 cycles	~5 cycles (to regs)	RCX, R11 vs. full push
Segment reload	~20 cycles	~5 cycles	Cached vs. memory
Pipeline flush	~50 cycles	~30 cycles	Both require flush
Total (approx)	~180-300 cycles	~50-80 cycles	3-4x improvement

Why INT Is Slower

Memory accesses: INT must read the IDT entry from memory (even if cached), validate its type and privilege, and read the segment descriptor. SYSCALL uses pre-loaded MSR values.
Automatic stack push: INT automatically pushes SS, RSP, RFLAGS, CS, and RIP to the new stack. SYSCALL saves only RIP and RFLAGS to registers—no memory writes.
Segment descriptor lookups: INT reloads segment selectors, requiring descriptor table lookups. SYSCALL still loads selectors but uses a simpler path.
Flexibility overhead: INT is a general-purpose mechanism for hundreds of interrupt types. SYSCALL is optimized specifically for the user→kernel→user transition.

Modern Impact (with KPTI)

With Kernel Page Table Isolation (KPTI) mitigating Meltdown, system call overhead has increased again. Each transition now requires:

Page table switch (CR3 reload) on entry
TLB flush (or PCID switch) on entry
Reverse on exit

This adds ~50-100 cycles per system call, partially negating the gains from SYSCALL over INT. However, SYSCALL is still faster because the additional overhead applies equally to both.

Measuring System Call Latency

Security Implications of Trap Instructions

Trap instructions are the boundary between trusted and untrusted code. Any security flaw in the trap mechanism or its handlers can lead to complete system compromise:

Historical Vulnerabilities

•SYSRET Non-Canonical RIP (CVE-2012-0217): Intel's SYSRET raises #GP with user segment selectors if RCX contains a non-canonical address, allowing kernel code to run with user GS base. Privilege escalation on FreeBSD, Xen, Windows.
•Spectre/Meltdown (2018): Speculative execution during trap handling can leak kernel memory. SYSCALL entry becomes a speculation gadget, enabling user-space to infer kernel secrets.
•SWAPGS Side Channel (CVE-2019-1125): The SWAPGS instruction at trap entry can be speculatively executed incorrectly, leaking kernel data. Required microcode and software mitigations.
•Stack Pivot Attacks: If kernel entry code doesn't immediately switch to a safe stack, attackers can arrange for the user stack to be paged out, causing a page fault handler to run on a corrupt stack.

Modern Mitigations

Operating systems deploy multiple mitigations to protect the trap boundary:

Mitigation	Protection	Overhead
KPTI	Unmaps kernel from user page tables	~50-100 cycles/syscall
SMEP	Prevents kernel executing user code	Minimal (bit in CR4)
SMAP	Prevents kernel reading user memory (without intent)	Minimal (~2 cycles)
KASLR	Randomizes kernel addresses	Minimal (boot-time)
Retpoline	Mitigates Spectre-BTB	~5% overall
IBPB/IBRS	Hardware Spectre mitigations	Variable (~5-10%)

Defense in Depth

No single mitigation is sufficient. Modern kernels combine:

Hardware protections (SMEP/SMAP, CET)
Isolation (KPTI, KASLR)
Speculation barriers (LFENCE, retpolines)
Stack protections (stack canaries, guard pages)
Control flow integrity (CFI, shadow call stacks)

The Trap Entry Code Is Critical Path

Summary: The Deliberate Doorway

We've thoroughly examined the trap instruction—the deliberate mechanism by which user code requests kernel services. Let's consolidate our understanding:

Key Takeaways

•Traps are intentional — Unlike faults (errors) or interrupts (asynchronous), traps are deliberately executed by user code to request services.
•Multiple trap mechanisms exist — x86 provides INT, SYSENTER, and SYSCALL with different performance characteristics. ARM uses SVC, RISC-V uses ECALL.
•State preservation varies — Different trap instructions save different amounts of state automatically. SYSCALL saves minimal state (RCX, R11), while INT saves more to stack.
•Stack handling is critical — Some traps (INT) switch stacks automatically; others (SYSCALL) do not. The kernel must handle this safely.
•Performance motivated evolution — SYSCALL is ~3-4x faster than INT due to reduced memory access and simpler microcode path.
•Security is paramount — The trap entry point is the most security-sensitive code in the kernel. Vulnerabilities here lead to complete system compromise.

What's Next:

Page Complete

2 / 5