Operating SystemsSystem Call Implementation

System Call Implementation

LevelIntermediate

Duration90 mins

TopicSystem Call Implementation

2 / 5

Context Switch

Crossing the Great Divide

When the syscall instruction executes, something extraordinary happens: the CPU changes identity. In a matter of nanoseconds, the processor transitions from operating on behalf of an unprivileged user application to executing the most protected code in the system—the operating system kernel.

This transition is called a context switch, though more precisely for system calls, we call it a mode switch or kernel entry. It involves:

Privilege escalation — The CPU's privilege level changes from Ring 3 (user) to Ring 0 (kernel)
State preservation — The user's complete execution state must be saved for later restoration
Stack switching — Execution moves from the user stack to a kernel stack
Control flow transfer — The instruction pointer jumps to the kernel's syscall handler

This page dissects the context switch mechanism at the hardware and software levels, revealing exactly what happens in those critical nanoseconds.

What You Will Learn

By the end of this page, you will understand the complete flow of a syscall context switch—from the moment the syscall instruction executes through CPU state capture, kernel stack setup, and entry into the system call handler. You'll know exactly what the hardware does automatically versus what the kernel must do in software.

The Privilege Model

Before diving into context switching mechanics, we must understand the privilege model that makes context switching necessary.

x86 Protection Rings:

Intel's x86 architecture defines four privilege levels, called "rings":

Ring 0 — Most privileged (kernel mode, supervisor mode)
Ring 1 — Intended for device drivers (rarely used)
Ring 2 — Intended for device drivers (rarely used)
Ring 3 — Least privileged (user mode)

In practice, modern operating systems use only Ring 0 (kernel) and Ring 3 (user). The unused rings were intended for systems like OS/2 but never gained traction.

What does privilege level determine?

The current privilege level (CPL), stored in the CS register's lowest two bits, controls:

Privilege Level Controls

•Instruction execution — Privileged instructions (IN, OUT, CLI, STI, HLT, LGDT, etc.) only execute in Ring 0. Attempting them in Ring 3 triggers a general protection fault.
•Memory access — Page table entries contain privilege bits. Pages marked supervisor-only (Ring 0) cannot be accessed from Ring 3, even if the virtual address is valid.
•I/O access — Port I/O instructions are controlled by the I/O permission bitmap. Ring 3 code typically has no port access.
•Control register access — CR0, CR2, CR3, CR4 and other control registers are Ring 0 only.
•MSR access — Model-specific registers (timing, power, features) require Ring 0.

Why Two Rings Suffice

The two-ring model persists because modern OS designs don't need intermediate privilege levels. Device drivers run in Ring 0 with the kernel (monolithic design) or in Ring 3 as user-space servers (microkernel design). The intermediate rings would require complex inter-ring call gates that add overhead without clear benefit.

Ring 0 vs Ring 3 Capabilities Comparison
Capability	Ring 0 (Kernel)	Ring 3 (User)
Execute privileged instructions	✓ Yes	✗ No (GP fault)
Access I/O ports directly	✓ Yes	✗ No (unless IOPL allows)
Modify page tables	✓ Yes	✗ No
Disable interrupts	✓ Yes (CLI/STI)	✗ No
Access all physical memory	✓ Yes	✗ No (only mapped pages)
Load special registers (GDT, IDT, etc.)	✓ Yes	✗ No
Execute user memory	✓ Yes (if SMEP disabled)	✓ Yes
Read user memory	✓ Yes (if SMAP disabled)	✓ Yes

The syscall Instruction

The syscall instruction is the fast system call entry mechanism on x86-64. It was introduced because the older int 0x80 mechanism (software interrupt) was too slow for the syscall-heavy workloads of modern systems.

What syscall does (hardware-automated):

When the CPU executes syscall, the following happens atomically, without software intervention:

Save user RIP → RCX (the return address)
Save user RFLAGS → R11
Load kernel CS and SS from the STAR MSR
Load kernel RIP from the LSTAR MSR (syscall handler address)
Clear certain RFLAGS bits based on SFMASK MSR (typically clears IF to disable interrupts)
CPL changes to 0 (Ring 0)

Critical: What syscall does NOT do:

Does NOT switch stacks (RSP is unchanged!)
Does NOT save any other registers
Does NOT save the user stack pointer anywhere automatic
Does NOT set up a kernel stack

The Stack Problem

After syscall executes, the CPU is in kernel mode but RSP still points to the user stack! This is a security-critical moment—the kernel must immediately switch to a kernel stack before doing anything that uses the stack. Using the user stack would be a privilege escalation vulnerability.

syscall_hardware_behavior.txt

Assembly

; What the CPU does when 'syscall' executes (x86-64)
; This is hardware behavior, not code you write
 
; Step 1: Save return address (next instruction's address)
RCX ← RIP                       ; RCX = address to return to
 
; Step 2: Save flags
R11 ← RFLAGS                    ; R11 = saved flags
 
; Step 3: Load kernel segment selectors
; STAR MSR format: [0:15]=SYSCALL CS, [16:31]=SS, [32:47]=SYSRET CS, [48:63]=SS
CS ← STAR[47:32]                ; Usually 0x10 (kernel code segment)
SS ← STAR[47:32] + 8            ; Usually 0x18 (kernel data segment)
 
; Step 4: Transfer to kernel handler
RIP ← LSTAR                     ; Jump to syscall entry point
 
; Step 5: Clear flags per mask
RFLAGS ← RFLAGS AND NOT(SFMASK) ; Usually clears IF (disables interrupts)
 
; Step 6: Privilege level change (implicit with CS load)
CPL ← 0                         ; Now in Ring 0
 
; CRITICAL: RSP is UNCHANGED - still points to user stack!
; The kernel entry code must fix this immediately

The MSR Configuration:

During boot, the kernel configures several Model-Specific Registers (MSRs) that control syscall behavior:

MSRs Controlling syscall Behavior
MSR Name	Address	Purpose	Typical Value
STAR	0xC0000081	Segment selectors for syscall/sysret	0x0023001000000000
LSTAR	0xC0000082	Kernel RIP for syscall (Long mode)	Address of entry_SYSCALL_64
CSTAR	0xC0000083	Kernel RIP for syscall (compat mode)	Address of entry_SYSCALL_compat
SFMASK	0xC0000084	RFLAGS bits to clear on syscall	0x47700 (clears IF, DF, TF, AC, NT)

msr_setup.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/* Linux kernel: arch/x86/kernel/cpu/common.c */
/* Setting up MSRs for syscall during boot */
 
void syscall_init(void)
{
    /* STAR MSR: Set up segment selectors
     * Bits 32-47: Kernel CS (0x10) for syscall
     * Bits 48-63: User CS (0x23) for sysret
     * (SS is derived as CS+8 for both)
     */
    wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
    
    /* LSTAR: Kernel entry point for 64-bit syscalls */
    wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
    
    /* CSTAR: Entry point for 32-bit syscalls in long mode */
    wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
    
    /* SFMASK: Flags to clear on syscall entry
     * X86_EFLAGS_TF: Trap flag (single-step debugging)
     * X86_EFLAGS_DF: Direction flag (string operations)
     * X86_EFLAGS_IF: Interrupt flag (disable interrupts)
     * X86_EFLAGS_AC: Alignment check
     * X86_EFLAGS_NT: Nested task
     */
    wrmsrl(MSR_SYSCALL_MASK,
           X86_EFLAGS_TF | X86_EFLAGS_DF | X86_EFLAGS_IF |
           X86_EFLAGS_AC | X86_EFLAGS_NT);
}

The Kernel Entry Trampoline

The syscall instruction jumps to the address in the LSTAR MSR—the kernel's syscall entry point. This code has the most stringent requirements in the entire kernel:

No stack usage initially — RSP is untrusted (still user stack)
No function calls — Function calls need the stack
No memory access through user-controlled pointers
Must save user state before modifying any registers
Must switch to kernel stack as first action

On Linux, this entry point is called entry_SYSCALL_64, implemented in assembly:

entry_SYSCALL_64.S

Assembly

/* Linux kernel: arch/x86/entry/entry_64.S (simplified) */
/* This is the actual syscall entry point */
 
SYM_CODE_START(entry_SYSCALL_64)
    /* At this point:
     * - We're in Ring 0 (kernel mode)
     * - RCX = user RIP (saved by hardware)
     * - R11 = user RFLAGS (saved by hardware)
     * - RAX = syscall number
     * - RDI, RSI, RDX, R10, R8, R9 = syscall arguments
     * - RSP = user stack pointer (UNTRUSTED!)
     */
    
    /* CRITICAL: First instruction must not use stack
     * Use the per-CPU scratch area to temporarily store user RSP
     */
    swapgs                          /* Load kernel GS base (per-CPU data) */
    movq    %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)  /* Save user RSP */
    
    /* Load the kernel stack pointer from TSS
     * This is the top of the current task's kernel stack
     */
    movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp
    
    /* Now we have a valid kernel stack. We can proceed normally.
     * Push all user registers to create pt_regs structure
     */
    
    /* Push fake SS, RSP (from TSS_sp2), RFLAGS, CS, RIP */
    pushq   $__USER_DS                              /* User SS */
    pushq   PER_CPU_VAR(cpu_tss_rw + TSS_sp2)      /* User RSP */
    pushq   %r11                                    /* User RFLAGS */
    pushq   $__USER_CS                              /* User CS */
    pushq   %rcx                                    /* User RIP */
    
    /* Push error code placeholder and interrupt number */
    pushq   $-ENOSYS                               /* Will be replaced by return value */
    pushq   %rax                                    /* Syscall number (orig_rax) */
    
    /* Push all general-purpose registers */
    pushq   %rdi                                    /* Argument 1 */
    pushq   %rsi                                    /* Argument 2 */
    pushq   %rdx                                    /* Argument 3 */
    pushq   %rcx                                    /* (Clobbered, but save anyway) */
    pushq   %rax                                    /* Syscall number again */
    pushq   %r8                                     /* Argument 5 */
    pushq   %r9                                     /* Argument 6 */
    pushq   %r10                                    /* Argument 4 */
    pushq   %r11                                    /* (Clobbered, but save anyway) */
    pushq   %rbx                                    /* Callee-saved */
    pushq   %rbp                                    /* Callee-saved */
    pushq   %r12                                    /* Callee-saved */
    pushq   %r13                                    /* Callee-saved */
    pushq   %r14                                    /* Callee-saved */
    pushq   %r15                                    /* Callee-saved */
    
    /* Now the stack contains a complete pt_regs structure
     * RSP points to it - this becomes the argument to C handlers
     */
    movq    %rsp, %rdi                             /* pt_regs* as first argument */
    call    do_syscall_64                           /* Call the C handler */
    
    /* ... syscall return path continues ... */
SYM_CODE_END(entry_SYSCALL_64)

The swapgs Instruction

swapgs exchanges the current GS base (user's GS) with the kernel's GS base stored in an MSR. This gives the kernel access to per-CPU data (including the kernel stack pointer) without using any general-purpose registers. It's the key to the stack switch.

The stack switch explained:

The sequence swapgs → movq %rsp, PER_CPU_VAR(...) → movq PER_CPU_VAR(...), %rsp is the critical stack switch:

swapgs — Now GS references per-CPU kernel data
Save user RSP to a per-CPU scratch slot (using GS-relative addressing)
Load kernel stack pointer from per-CPU data into RSP

After these three instructions, we have a valid kernel stack and the user's RSP is safely stored. Function calls are now possible.

The pt_regs Structure

After the entry trampoline pushes all registers, the kernel stack contains a complete snapshot of the user's CPU state. This is the pt_regs structure—the fundamental representation of saved process state in Linux.

Why pt_regs matters:

System call arguments — Arguments are extracted from the saved register values
Return to user space — The saved RIP, RSP, and RFLAGS restore user execution
Signal delivery — Signal handlers receive pt_regs to modify user state
Debugging — ptrace reads/writes pt_regs to inspect/modify debuggee state
Stack traces — Unwinding uses pt_regs to find return addresses

pt_regs.h
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
/* Linux kernel: arch/x86/include/asm/ptrace.h */
 
struct pt_regs {
    /* Pushed by entry code in reverse order (growing down) */
    
    /* C ABI callee-saved registers */
    unsigned long r15;
    unsigned long r14;
    unsigned long r13;
    unsigned long r12;
    unsigned long bp;      /* Frame pointer */
    unsigned long bx;
    
    /* These are clobbered by syscall, but saved anyway */
    unsigned long r11;
    unsigned long r10;
    
    /* Syscall arguments (some overlap with above) */
    unsigned long r9;      /* Argument 6 */
    unsigned long r8;      /* Argument 5 */
    unsigned long ax;      /* Syscall number / return value */
    unsigned long cx;      /* Clobbered by syscall (user RIP) */
    unsigned long dx;      /* Argument 3 */
    unsigned long si;      /* Argument 2 */
    unsigned long di;      /* Argument 1 */
    
    /* Syscall metadata */
    unsigned long orig_ax; /* Original syscall number (for restart) */
    
    /* Instruction pointer - pushed as part of iret frame */
    unsigned long ip;      /* User RIP */
    unsigned long cs;      /* User CS */
    unsigned long flags;   /* User RFLAGS */
    unsigned long sp;      /* User RSP */
    unsigned long ss;      /* User SS */
};
 
/* Accessing syscall arguments from pt_regs */
static inline unsigned long syscall_arg1(struct pt_regs *regs)
{
    return regs->di;
}
 
static inline unsigned long syscall_arg2(struct pt_regs *regs)
{
    return regs->si;
}
 
/* ... and so on for all 6 arguments */
 
/* Get/set return value */
static inline void syscall_set_return_value(
    struct pt_regs *regs, int error, long val)
{
    if (error) {
        regs->ax = -error;  /* Negative errno */
    } else {
        regs->ax = val;     /* Success value */
    }
}

Layout on the kernel stack:

After the entry trampoline completes, the kernel stack looks like:

stack_layout.txt

Text

Higher addresses (stack grows down)
┌─────────────────────────────────────┐
│  (Top of kernel stack)              │
├─────────────────────────────────────┤
│  SS        (user stack segment)     │ ← pt_regs + 0xa0
├─────────────────────────────────────┤
│  RSP       (user stack pointer)     │ ← pt_regs + 0x98
├─────────────────────────────────────┤
│  RFLAGS    (user flags)             │ ← pt_regs + 0x90
├─────────────────────────────────────┤
│  CS        (user code segment)      │ ← pt_regs + 0x88
├─────────────────────────────────────┤
│  RIP       (user instruction ptr)   │ ← pt_regs + 0x80
├─────────────────────────────────────┤
│  orig_ax   (syscall number)         │ ← pt_regs + 0x78
├─────────────────────────────────────┤
│  rax       (return value slot)      │ ← pt_regs + 0x70
├─────────────────────────────────────┤
│  ... remaining registers ...        │
├─────────────────────────────────────┤
│  r15       (last saved register)    │ ← pt_regs + 0x00 = RSP now points here
├─────────────────────────────────────┤
│  (Local variables, call frames)     │ ← Stack grows into here
└─────────────────────────────────────┘
Lower addresses

orig_ax vs ax

The pt_regs structure has both orig_ax and ax. orig_ax preserves the original syscall number (for restart after signals), while ax holds the current return value. During execution, the kernel may write the return value to ax. On signal restart, orig_ax tells which syscall to re-execute.

Kernel Stack Architecture

Every executing thread in Linux has not one but several stacks associated with it. Understanding this multi-stack architecture is essential for grasping syscall context switches.

Stack types in Linux x86-64:

User stack — In user address space, process-controlled, used for function calls in user mode
Kernel stack — Per-task, 16KB (4 pages), used when task executes in kernel
IRQ stack — Per-CPU, separate stack for hardware interrupt handling
Exception stacks (IST) — Specialized stacks for critical exceptions (NMI, double fault, etc.)

Linux x86-64 Stack Summary
Stack Type	Size	Allocation	Purpose
User stack	8MB (typical)	Per-process, grows down from high addresses	User-mode function calls, local variables
Kernel stack	16KB (THREAD_SIZE)	Per-task (process/thread)	Kernel mode execution for this task
IRQ stack	16KB	Per-CPU	Hardware interrupt handling
IST1 (Double Fault)	4KB	Per-CPU	Double fault handling
IST2 (NMI)	4KB	Per-CPU	Non-maskable interrupt handling
IST3 (Debug)	4KB	Per-CPU	Debug exception handling
IST4 (MCE)	4KB	Per-CPU	Machine check exception

Why separate kernel stacks per task?

Each task (process or thread) has its own kernel stack because:

Context isolation — If all tasks shared a kernel stack, one task blocking in the kernel would prevent others from making syscalls
Blocking syscalls — A task can block in the kernel (e.g., waiting for I/O). Its kernel stack state must be preserved
Nested entries — A task in a syscall can receive a signal that triggers another kernel entry
Task switch — When switching tasks, we simply change which kernel stack to use

kernel_stack.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
/* Linux kernel: include/linux/sched.h and related */
 
#define THREAD_SIZE  (16 * 1024)  /* 16KB kernel stack */
 
/* Every task_struct has an associated kernel stack */
struct task_struct {
    /* ... many fields ... */
    
    void *stack;           /* Points to the stack allocation */
    
    /* ... many more fields ... */
};
 
/* Get current task's kernel stack top */
static inline void *task_stack_page(struct task_struct *task)
{
    return task->stack;
}
 
/* Initialize kernel stack for new task */
static struct task_struct *dup_task_struct(struct task_struct *orig)
{
    struct task_struct *tsk;
    unsigned long *stack;
    
    /* Allocate task_struct */
    tsk = alloc_task_struct();
    
    /* Allocate kernel stack (16KB aligned) */
    stack = alloc_thread_stack_node(tsk, node);
    tsk->stack = stack;
    
    /* Set up initial pt_regs at top of stack */
    /* ... */
    
    return tsk;
}
 
/* The stack layout at task creation */
/*
 * High address (stack top)
 * ┌────────────────────┐
 * │    pt_regs         │  ← Initial saved state
 * ├────────────────────┤
 * │                    │
 * │    (unused)        │  ← Stack grows downward into this
 * │                    │
 * ├────────────────────┤
 * │   thread_info      │  ← Task metadata at stack base (historical)
 * └────────────────────┘
 * Low address (stack base)
 */

Stack Overflow Protection

Kernel stacks are small (16KB) and have guard pages. A kernel stack overflow is catastrophic—it corrupts the thread_info structure or adjacent memory. Deep recursion in kernel code is dangerous. The kernel uses static analysis and runtime checks (VMAP_STACK) to detect overflows.

Returning to User Space

After the syscall handler completes, the kernel must return to user space. This involves reversing everything the entry path did:

Restore user registers from pt_regs
Switch back to user stack
Lower privilege level to Ring 3
Resume user execution at the saved RIP

The sysret instruction is the fast-path return mechanism, paired with syscall:

syscall_return.S

Assembly

/* Linux kernel: arch/x86/entry/entry_64.S (simplified) */
/* Returning from syscall to user space */
 
SYM_CODE_START(entry_SYSCALL_64_return_path)
    /* At this point:
     * RSP points to pt_regs on kernel stack
     * RAX in pt_regs contains return value
     */
    
    /* Restore callee-saved registers */
    popq    %r15
    popq    %r14
    popq    %r13
    popq    %r12
    popq    %rbp
    popq    %rbx
    
    /* Skip r11, r10, r9, r8, rax (handled specially) */
    addq    $(6*8), %rsp
    
    /* Move return value to RAX */
    movq    (%rsp), %rax            /* pt_regs->ax */
    
    /* Skip remaining pt_regs to get to iret frame portion */
    movq    (1*8)(%rsp), %rcx       /* pt_regs->ip → for sysret */
    movq    (4*8)(%rsp), %rsp       /* pt_regs->sp → user RSP */
    
    /* Prepare for sysret */
    movq    %r11, %rsp              /* Load user RFLAGS into R11 */
                                     /* (Actually loaded earlier) */
    
    /* Switch back to user GS */
    swapgs
    
    /* sysret does the reverse of syscall:
     * - Load RIP from RCX
     * - Load RFLAGS from R11
     * - Set CPL to 3 (Ring 3)
     * - Load CS and SS from STAR MSR (user segments)
     */
    sysretq
    
    /* User execution resumes at RCX (original RIP + instruction length) */
SYM_CODE_END(entry_SYSCALL_64_return_path)

What sysret does (hardware-automated):

Load RIP from RCX (return address)
Load RFLAGS from R11
Load CS from STAR MSR bits [63:48] (user code segment)
Load SS from STAR MSR bits [63:48] + 8 (user data segment)
CPL changes to 3 (Ring 3)

When sysret can't be used:

The sysret fast path has restrictions. The kernel falls back to iret (slower but more flexible) when:

The return RIP is non-canonical (e.g., certain ptrace modifications)
Signal delivery modified the return state
TF (trap flag) is set (debugging)
NMI or other exceptions during return setup

The sysret Vulnerability

A famous vulnerability exists with sysret: if the return RIP in RCX is non-canonical (not a valid user address), the CPU raises a general protection fault. But by then, the kernel has already loaded user's RSP and switched to Ring 3. The GP fault handler runs in Ring 0 but with a user-controlled RSP! This led to CVE-2014-4699 on older kernels. Modern kernels validate RCX before sysret.

sysret vs iret Comparison
Aspect	sysret	iret
Speed	~20 cycles	~40 cycles
What it restores	RIP (from RCX), RFLAGS (from R11), CS, SS	RIP, CS, RFLAGS, RSP, SS from stack
Flexibility	Fixed register sources	All from stack (more flexible)
When used	Fast path return	Signal delivery, ptrace, fallback
Security concern	Non-canonical RCX vulnerability	None (fully controlled by kernel)

Performance Considerations

System call context switches are the performance-critical path in any operating system. Applications make millions of syscalls per second. Every optimization matters.

Historical evolution:

Evolution of x86 System Call Mechanisms
Mechanism	Era	Approximate Latency	Notes
int 0x80	1980s-2000s	~350 cycles	Software interrupt, very general but slow
sysenter/sysexit	Pentium II (1997)	~150 cycles	Intel's fast syscall, complex setup
syscall/sysret	AMD K6 (1998), x86-64	~50-100 cycles	AMD's fast syscall, became x86-64 standard
vDSO (no syscall)	Linux 2.6 (2003)	~5-10 cycles	No kernel entry at all for certain calls

Modern performance on x86-64:

A typical system call on a modern Intel/AMD processor:

syscall instruction: ~10-20 cycles
Kernel entry (swapgs, stack switch, register save): ~30-50 cycles
Syscall dispatch: ~20-50 cycles
Actual syscall work: Varies (nanoseconds to milliseconds)
Kernel exit (register restore, sysret): ~30-50 cycles
sysret instruction: ~10-20 cycles

Total overhead for null syscall: ~150-250 cycles ≈ 40-75ns on a 3GHz CPU

What hurts performance:

Performance Killers in Context Switching

•Spectre/Meltdown mitigations — KPTI (Kernel Page Table Isolation) adds ~100-500 cycles by requiring TLB flushes on every kernel entry/exit. This nearly doubled syscall costs.
•Cache misses — If the kernel's syscall code isn't in L1 cache (cold path), latency explodes. The kernel keeps hot paths small and contiguous.
•TLB misses — Each privilege level change formerly flushed the TLB. PCID (Process Context ID) mitigates this, keeping user and kernel TLB entries separate.
•Branch mispredictions — The syscall dispatch table is an indirect branch. Indirect branch prediction mitigations (Spectre v2) add overhead.

The Spectre Tax

Post-Spectre/Meltdown (2018), syscall latency increased by 30-100% depending on workload. The kernel does extra work to prevent speculative execution attacks: flushing branch prediction history, using retpolines for indirect calls, and separating user/kernel page tables. Security won, performance lost.

measure_syscall_overhead.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
/* Measure syscall overhead using getpid() */
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
 
/* Read timestamp counter */
static inline uint64_t rdtsc(void)
{
    uint32_t lo, hi;
    __asm__ volatile ("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}
 
int main(void)
{
    uint64_t start, end, total = 0;
    const int iterations = 1000000;
    
    /* Warm up */
    for (int i = 0; i < 10000; i++) {
        getpid();
    }
    
    /* Measure */
    for (int i = 0; i < iterations; i++) {
        start = rdtsc();
        getpid();  /* Simple syscall with minimal work */
        end = rdtsc();
        total += (end - start);
    }
    
    printf("Average syscall overhead: %.1f cycles\n",
           (double)total / iterations);
    
    /* Typical results:
     * Without mitigations: ~150 cycles
     * With KPTI: ~300-500 cycles
     * With full Spectre mitigations: ~400-700 cycles
     */
    return 0;
}

Summary: The Context Switch Mechanism

We've traced the complete journey of a system call context switch—one of the most intricate and performance-critical paths in any operating system. Let's consolidate the key concepts:

Key Takeaways

•The privilege model separates user and kernel — x86 uses Ring 0 (kernel) and Ring 3 (user). Privilege level controls instruction execution, memory access, and I/O permissions.
•syscall is a hardware-optimized kernel entry — The instruction atomically saves RIP→RCX, RFLAGS→R11, and jumps to the LSTAR address. It does NOT switch stacks automatically.
•The kernel entry trampoline handles the critical first instructions — swapgs enables per-CPU data access, then the kernel loads its own stack. No stack usage is possible before the switch.
•pt_regs captures complete user state — All user registers are pushed to create a structure that enables return, signal delivery, debugging, and syscall argument access.
•Each task has its own kernel stack — Separate 16KB stacks enable blocking syscalls and task isolation. Stack overflow is guarded but catastrophic if it occurs.
•sysret provides fast return — It's the inverse of syscall, restoring RIP, RFLAGS, and user segment selectors. iret is the slower fallback for complex cases.
•Performance is constantly optimized — From int 0x80 to syscall/sysret to vDSO, the industry has reduced syscall overhead 10x. But Spectre mitigations clawed back significant gains.

What's next:

The context switch lands us in the kernel with a pt_regs structure containing the syscall number and arguments. But how does the kernel actually handle the request? The next page explores the Kernel Handler—the dispatch mechanism that routes syscall numbers to specific handler functions, and how those handlers execute the requested operation.

Page Complete

You now understand the low-level mechanics of system call context switching—from the syscall instruction through kernel entry, state preservation, and return via sysret. This foundation prepares you to follow a syscall through the kernel's dispatch and handler mechanisms.

2 / 5

Loading learning content...

Operating SystemsSystem Call Implementation

System Call Implementation

LevelIntermediate

Duration90 mins

TopicSystem Call Implementation

2 / 5

Context Switch

Crossing the Great Divide

This transition is called a context switch, though more precisely for system calls, we call it a mode switch or kernel entry. It involves:

Privilege escalation — The CPU's privilege level changes from Ring 3 (user) to Ring 0 (kernel)
State preservation — The user's complete execution state must be saved for later restoration
Stack switching — Execution moves from the user stack to a kernel stack
Control flow transfer — The instruction pointer jumps to the kernel's syscall handler

This page dissects the context switch mechanism at the hardware and software levels, revealing exactly what happens in those critical nanoseconds.

What You Will Learn

The Privilege Model

Before diving into context switching mechanics, we must understand the privilege model that makes context switching necessary.

x86 Protection Rings:

Intel's x86 architecture defines four privilege levels, called "rings":

Ring 0 — Most privileged (kernel mode, supervisor mode)
Ring 1 — Intended for device drivers (rarely used)
Ring 2 — Intended for device drivers (rarely used)
Ring 3 — Least privileged (user mode)

In practice, modern operating systems use only Ring 0 (kernel) and Ring 3 (user). The unused rings were intended for systems like OS/2 but never gained traction.

What does privilege level determine?

The current privilege level (CPL), stored in the CS register's lowest two bits, controls:

Privilege Level Controls

•Instruction execution — Privileged instructions (IN, OUT, CLI, STI, HLT, LGDT, etc.) only execute in Ring 0. Attempting them in Ring 3 triggers a general protection fault.
•Memory access — Page table entries contain privilege bits. Pages marked supervisor-only (Ring 0) cannot be accessed from Ring 3, even if the virtual address is valid.
•I/O access — Port I/O instructions are controlled by the I/O permission bitmap. Ring 3 code typically has no port access.
•Control register access — CR0, CR2, CR3, CR4 and other control registers are Ring 0 only.
•MSR access — Model-specific registers (timing, power, features) require Ring 0.

Why Two Rings Suffice

Ring 0 vs Ring 3 Capabilities Comparison
Capability	Ring 0 (Kernel)	Ring 3 (User)
Execute privileged instructions	✓ Yes	✗ No (GP fault)
Access I/O ports directly	✓ Yes	✗ No (unless IOPL allows)
Modify page tables	✓ Yes	✗ No
Disable interrupts	✓ Yes (CLI/STI)	✗ No
Access all physical memory	✓ Yes	✗ No (only mapped pages)
Load special registers (GDT, IDT, etc.)	✓ Yes	✗ No
Execute user memory	✓ Yes (if SMEP disabled)	✓ Yes
Read user memory	✓ Yes (if SMAP disabled)	✓ Yes

The syscall Instruction

What syscall does (hardware-automated):

When the CPU executes syscall, the following happens atomically, without software intervention:

Save user RIP → RCX (the return address)
Save user RFLAGS → R11
Load kernel CS and SS from the STAR MSR
Load kernel RIP from the LSTAR MSR (syscall handler address)
Clear certain RFLAGS bits based on SFMASK MSR (typically clears IF to disable interrupts)
CPL changes to 0 (Ring 0)

Critical: What syscall does NOT do:

Does NOT switch stacks (RSP is unchanged!)
Does NOT save any other registers
Does NOT save the user stack pointer anywhere automatic
Does NOT set up a kernel stack

The Stack Problem

syscall_hardware_behavior.txt

Assembly

; What the CPU does when 'syscall' executes (x86-64)
; This is hardware behavior, not code you write
 
; Step 1: Save return address (next instruction's address)
RCX ← RIP                       ; RCX = address to return to
 
; Step 2: Save flags
R11 ← RFLAGS                    ; R11 = saved flags
 
; Step 3: Load kernel segment selectors
; STAR MSR format: [0:15]=SYSCALL CS, [16:31]=SS, [32:47]=SYSRET CS, [48:63]=SS
CS ← STAR[47:32]                ; Usually 0x10 (kernel code segment)
SS ← STAR[47:32] + 8            ; Usually 0x18 (kernel data segment)
 
; Step 4: Transfer to kernel handler
RIP ← LSTAR                     ; Jump to syscall entry point
 
; Step 5: Clear flags per mask
RFLAGS ← RFLAGS AND NOT(SFMASK) ; Usually clears IF (disables interrupts)
 
; Step 6: Privilege level change (implicit with CS load)
CPL ← 0                         ; Now in Ring 0
 
; CRITICAL: RSP is UNCHANGED - still points to user stack!
; The kernel entry code must fix this immediately

The MSR Configuration:

During boot, the kernel configures several Model-Specific Registers (MSRs) that control syscall behavior:

MSRs Controlling syscall Behavior
MSR Name	Address	Purpose	Typical Value
STAR	0xC0000081	Segment selectors for syscall/sysret	0x0023001000000000
LSTAR	0xC0000082	Kernel RIP for syscall (Long mode)	Address of entry_SYSCALL_64
CSTAR	0xC0000083	Kernel RIP for syscall (compat mode)	Address of entry_SYSCALL_compat
SFMASK	0xC0000084	RFLAGS bits to clear on syscall	0x47700 (clears IF, DF, TF, AC, NT)

msr_setup.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
/* Linux kernel: arch/x86/kernel/cpu/common.c */
/* Setting up MSRs for syscall during boot */
 
void syscall_init(void)
{
    /* STAR MSR: Set up segment selectors
     * Bits 32-47: Kernel CS (0x10) for syscall
     * Bits 48-63: User CS (0x23) for sysret
     * (SS is derived as CS+8 for both)
     */
    wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
    
    /* LSTAR: Kernel entry point for 64-bit syscalls */
    wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
    
    /* CSTAR: Entry point for 32-bit syscalls in long mode */
    wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
    
    /* SFMASK: Flags to clear on syscall entry
     * X86_EFLAGS_TF: Trap flag (single-step debugging)
     * X86_EFLAGS_DF: Direction flag (string operations)
     * X86_EFLAGS_IF: Interrupt flag (disable interrupts)
     * X86_EFLAGS_AC: Alignment check
     * X86_EFLAGS_NT: Nested task
     */
    wrmsrl(MSR_SYSCALL_MASK,
           X86_EFLAGS_TF | X86_EFLAGS_DF | X86_EFLAGS_IF |
           X86_EFLAGS_AC | X86_EFLAGS_NT);
}

The Kernel Entry Trampoline

The syscall instruction jumps to the address in the LSTAR MSR—the kernel's syscall entry point. This code has the most stringent requirements in the entire kernel:

No stack usage initially — RSP is untrusted (still user stack)
No function calls — Function calls need the stack
No memory access through user-controlled pointers
Must save user state before modifying any registers
Must switch to kernel stack as first action

On Linux, this entry point is called entry_SYSCALL_64, implemented in assembly:

entry_SYSCALL_64.S

Assembly

/* Linux kernel: arch/x86/entry/entry_64.S (simplified) */
/* This is the actual syscall entry point */
 
SYM_CODE_START(entry_SYSCALL_64)
    /* At this point:
     * - We're in Ring 0 (kernel mode)
     * - RCX = user RIP (saved by hardware)
     * - R11 = user RFLAGS (saved by hardware)
     * - RAX = syscall number
     * - RDI, RSI, RDX, R10, R8, R9 = syscall arguments
     * - RSP = user stack pointer (UNTRUSTED!)
     */
    
    /* CRITICAL: First instruction must not use stack
     * Use the per-CPU scratch area to temporarily store user RSP
     */
    swapgs                          /* Load kernel GS base (per-CPU data) */
    movq    %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2)  /* Save user RSP */
    
    /* Load the kernel stack pointer from TSS
     * This is the top of the current task's kernel stack
     */
    movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp
    
    /* Now we have a valid kernel stack. We can proceed normally.
     * Push all user registers to create pt_regs structure
     */
    
    /* Push fake SS, RSP (from TSS_sp2), RFLAGS, CS, RIP */
    pushq   $__USER_DS                              /* User SS */
    pushq   PER_CPU_VAR(cpu_tss_rw + TSS_sp2)      /* User RSP */
    pushq   %r11                                    /* User RFLAGS */
    pushq   $__USER_CS                              /* User CS */
    pushq   %rcx                                    /* User RIP */
    
    /* Push error code placeholder and interrupt number */
    pushq   $-ENOSYS                               /* Will be replaced by return value */
    pushq   %rax                                    /* Syscall number (orig_rax) */
    
    /* Push all general-purpose registers */
    pushq   %rdi                                    /* Argument 1 */
    pushq   %rsi                                    /* Argument 2 */
    pushq   %rdx                                    /* Argument 3 */
    pushq   %rcx                                    /* (Clobbered, but save anyway) */
    pushq   %rax                                    /* Syscall number again */
    pushq   %r8                                     /* Argument 5 */
    pushq   %r9                                     /* Argument 6 */
    pushq   %r10                                    /* Argument 4 */
    pushq   %r11                                    /* (Clobbered, but save anyway) */
    pushq   %rbx                                    /* Callee-saved */
    pushq   %rbp                                    /* Callee-saved */
    pushq   %r12                                    /* Callee-saved */
    pushq   %r13                                    /* Callee-saved */
    pushq   %r14                                    /* Callee-saved */
    pushq   %r15                                    /* Callee-saved */
    
    /* Now the stack contains a complete pt_regs structure
     * RSP points to it - this becomes the argument to C handlers
     */
    movq    %rsp, %rdi                             /* pt_regs* as first argument */
    call    do_syscall_64                           /* Call the C handler */
    
    /* ... syscall return path continues ... */
SYM_CODE_END(entry_SYSCALL_64)

The swapgs Instruction

The stack switch explained:

The sequence swapgs → movq %rsp, PER_CPU_VAR(...) → movq PER_CPU_VAR(...), %rsp is the critical stack switch:

swapgs — Now GS references per-CPU kernel data
Save user RSP to a per-CPU scratch slot (using GS-relative addressing)
Load kernel stack pointer from per-CPU data into RSP

After these three instructions, we have a valid kernel stack and the user's RSP is safely stored. Function calls are now possible.

The pt_regs Structure

Why pt_regs matters:

System call arguments — Arguments are extracted from the saved register values
Return to user space — The saved RIP, RSP, and RFLAGS restore user execution
Signal delivery — Signal handlers receive pt_regs to modify user state
Debugging — ptrace reads/writes pt_regs to inspect/modify debuggee state
Stack traces — Unwinding uses pt_regs to find return addresses

pt_regs.h
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
/* Linux kernel: arch/x86/include/asm/ptrace.h */
 
struct pt_regs {
    /* Pushed by entry code in reverse order (growing down) */
    
    /* C ABI callee-saved registers */
    unsigned long r15;
    unsigned long r14;
    unsigned long r13;
    unsigned long r12;
    unsigned long bp;      /* Frame pointer */
    unsigned long bx;
    
    /* These are clobbered by syscall, but saved anyway */
    unsigned long r11;
    unsigned long r10;
    
    /* Syscall arguments (some overlap with above) */
    unsigned long r9;      /* Argument 6 */
    unsigned long r8;      /* Argument 5 */
    unsigned long ax;      /* Syscall number / return value */
    unsigned long cx;      /* Clobbered by syscall (user RIP) */
    unsigned long dx;      /* Argument 3 */
    unsigned long si;      /* Argument 2 */
    unsigned long di;      /* Argument 1 */
    
    /* Syscall metadata */
    unsigned long orig_ax; /* Original syscall number (for restart) */
    
    /* Instruction pointer - pushed as part of iret frame */
    unsigned long ip;      /* User RIP */
    unsigned long cs;      /* User CS */
    unsigned long flags;   /* User RFLAGS */
    unsigned long sp;      /* User RSP */
    unsigned long ss;      /* User SS */
};
 
/* Accessing syscall arguments from pt_regs */
static inline unsigned long syscall_arg1(struct pt_regs *regs)
{
    return regs->di;
}
 
static inline unsigned long syscall_arg2(struct pt_regs *regs)
{
    return regs->si;
}
 
/* ... and so on for all 6 arguments */
 
/* Get/set return value */
static inline void syscall_set_return_value(
    struct pt_regs *regs, int error, long val)
{
    if (error) {
        regs->ax = -error;  /* Negative errno */
    } else {
        regs->ax = val;     /* Success value */
    }
}

Layout on the kernel stack:

After the entry trampoline completes, the kernel stack looks like:

stack_layout.txt

Text

Higher addresses (stack grows down)
┌─────────────────────────────────────┐
│  (Top of kernel stack)              │
├─────────────────────────────────────┤
│  SS        (user stack segment)     │ ← pt_regs + 0xa0
├─────────────────────────────────────┤
│  RSP       (user stack pointer)     │ ← pt_regs + 0x98
├─────────────────────────────────────┤
│  RFLAGS    (user flags)             │ ← pt_regs + 0x90
├─────────────────────────────────────┤
│  CS        (user code segment)      │ ← pt_regs + 0x88
├─────────────────────────────────────┤
│  RIP       (user instruction ptr)   │ ← pt_regs + 0x80
├─────────────────────────────────────┤
│  orig_ax   (syscall number)         │ ← pt_regs + 0x78
├─────────────────────────────────────┤
│  rax       (return value slot)      │ ← pt_regs + 0x70
├─────────────────────────────────────┤
│  ... remaining registers ...        │
├─────────────────────────────────────┤
│  r15       (last saved register)    │ ← pt_regs + 0x00 = RSP now points here
├─────────────────────────────────────┤
│  (Local variables, call frames)     │ ← Stack grows into here
└─────────────────────────────────────┘
Lower addresses

orig_ax vs ax

Kernel Stack Architecture

Every executing thread in Linux has not one but several stacks associated with it. Understanding this multi-stack architecture is essential for grasping syscall context switches.

Stack types in Linux x86-64:

User stack — In user address space, process-controlled, used for function calls in user mode
Kernel stack — Per-task, 16KB (4 pages), used when task executes in kernel
IRQ stack — Per-CPU, separate stack for hardware interrupt handling
Exception stacks (IST) — Specialized stacks for critical exceptions (NMI, double fault, etc.)

Linux x86-64 Stack Summary
Stack Type	Size	Allocation	Purpose
User stack	8MB (typical)	Per-process, grows down from high addresses	User-mode function calls, local variables
Kernel stack	16KB (THREAD_SIZE)	Per-task (process/thread)	Kernel mode execution for this task
IRQ stack	16KB	Per-CPU	Hardware interrupt handling
IST1 (Double Fault)	4KB	Per-CPU	Double fault handling
IST2 (NMI)	4KB	Per-CPU	Non-maskable interrupt handling
IST3 (Debug)	4KB	Per-CPU	Debug exception handling
IST4 (MCE)	4KB	Per-CPU	Machine check exception

Why separate kernel stacks per task?

Each task (process or thread) has its own kernel stack because:

Context isolation — If all tasks shared a kernel stack, one task blocking in the kernel would prevent others from making syscalls
Blocking syscalls — A task can block in the kernel (e.g., waiting for I/O). Its kernel stack state must be preserved
Nested entries — A task in a syscall can receive a signal that triggers another kernel entry
Task switch — When switching tasks, we simply change which kernel stack to use

kernel_stack.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
/* Linux kernel: include/linux/sched.h and related */
 
#define THREAD_SIZE  (16 * 1024)  /* 16KB kernel stack */
 
/* Every task_struct has an associated kernel stack */
struct task_struct {
    /* ... many fields ... */
    
    void *stack;           /* Points to the stack allocation */
    
    /* ... many more fields ... */
};
 
/* Get current task's kernel stack top */
static inline void *task_stack_page(struct task_struct *task)
{
    return task->stack;
}
 
/* Initialize kernel stack for new task */
static struct task_struct *dup_task_struct(struct task_struct *orig)
{
    struct task_struct *tsk;
    unsigned long *stack;
    
    /* Allocate task_struct */
    tsk = alloc_task_struct();
    
    /* Allocate kernel stack (16KB aligned) */
    stack = alloc_thread_stack_node(tsk, node);
    tsk->stack = stack;
    
    /* Set up initial pt_regs at top of stack */
    /* ... */
    
    return tsk;
}
 
/* The stack layout at task creation */
/*
 * High address (stack top)
 * ┌────────────────────┐
 * │    pt_regs         │  ← Initial saved state
 * ├────────────────────┤
 * │                    │
 * │    (unused)        │  ← Stack grows downward into this
 * │                    │
 * ├────────────────────┤
 * │   thread_info      │  ← Task metadata at stack base (historical)
 * └────────────────────┘
 * Low address (stack base)
 */

Stack Overflow Protection

Returning to User Space

After the syscall handler completes, the kernel must return to user space. This involves reversing everything the entry path did:

Restore user registers from pt_regs
Switch back to user stack
Lower privilege level to Ring 3
Resume user execution at the saved RIP

The sysret instruction is the fast-path return mechanism, paired with syscall:

syscall_return.S

Assembly

/* Linux kernel: arch/x86/entry/entry_64.S (simplified) */
/* Returning from syscall to user space */
 
SYM_CODE_START(entry_SYSCALL_64_return_path)
    /* At this point:
     * RSP points to pt_regs on kernel stack
     * RAX in pt_regs contains return value
     */
    
    /* Restore callee-saved registers */
    popq    %r15
    popq    %r14
    popq    %r13
    popq    %r12
    popq    %rbp
    popq    %rbx
    
    /* Skip r11, r10, r9, r8, rax (handled specially) */
    addq    $(6*8), %rsp
    
    /* Move return value to RAX */
    movq    (%rsp), %rax            /* pt_regs->ax */
    
    /* Skip remaining pt_regs to get to iret frame portion */
    movq    (1*8)(%rsp), %rcx       /* pt_regs->ip → for sysret */
    movq    (4*8)(%rsp), %rsp       /* pt_regs->sp → user RSP */
    
    /* Prepare for sysret */
    movq    %r11, %rsp              /* Load user RFLAGS into R11 */
                                     /* (Actually loaded earlier) */
    
    /* Switch back to user GS */
    swapgs
    
    /* sysret does the reverse of syscall:
     * - Load RIP from RCX
     * - Load RFLAGS from R11
     * - Set CPL to 3 (Ring 3)
     * - Load CS and SS from STAR MSR (user segments)
     */
    sysretq
    
    /* User execution resumes at RCX (original RIP + instruction length) */
SYM_CODE_END(entry_SYSCALL_64_return_path)

What sysret does (hardware-automated):

Load RIP from RCX (return address)
Load RFLAGS from R11
Load CS from STAR MSR bits [63:48] (user code segment)
Load SS from STAR MSR bits [63:48] + 8 (user data segment)
CPL changes to 3 (Ring 3)

When sysret can't be used:

The sysret fast path has restrictions. The kernel falls back to iret (slower but more flexible) when:

The return RIP is non-canonical (e.g., certain ptrace modifications)
Signal delivery modified the return state
TF (trap flag) is set (debugging)
NMI or other exceptions during return setup

The sysret Vulnerability

sysret vs iret Comparison
Aspect	sysret	iret
Speed	~20 cycles	~40 cycles
What it restores	RIP (from RCX), RFLAGS (from R11), CS, SS	RIP, CS, RFLAGS, RSP, SS from stack
Flexibility	Fixed register sources	All from stack (more flexible)
When used	Fast path return	Signal delivery, ptrace, fallback
Security concern	Non-canonical RCX vulnerability	None (fully controlled by kernel)

Performance Considerations

System call context switches are the performance-critical path in any operating system. Applications make millions of syscalls per second. Every optimization matters.

Historical evolution:

Evolution of x86 System Call Mechanisms
Mechanism	Era	Approximate Latency	Notes
int 0x80	1980s-2000s	~350 cycles	Software interrupt, very general but slow
sysenter/sysexit	Pentium II (1997)	~150 cycles	Intel's fast syscall, complex setup
syscall/sysret	AMD K6 (1998), x86-64	~50-100 cycles	AMD's fast syscall, became x86-64 standard
vDSO (no syscall)	Linux 2.6 (2003)	~5-10 cycles	No kernel entry at all for certain calls

Modern performance on x86-64:

A typical system call on a modern Intel/AMD processor:

syscall instruction: ~10-20 cycles
Kernel entry (swapgs, stack switch, register save): ~30-50 cycles
Syscall dispatch: ~20-50 cycles
Actual syscall work: Varies (nanoseconds to milliseconds)
Kernel exit (register restore, sysret): ~30-50 cycles
sysret instruction: ~10-20 cycles

Total overhead for null syscall: ~150-250 cycles ≈ 40-75ns on a 3GHz CPU

What hurts performance:

Performance Killers in Context Switching

•Spectre/Meltdown mitigations — KPTI (Kernel Page Table Isolation) adds ~100-500 cycles by requiring TLB flushes on every kernel entry/exit. This nearly doubled syscall costs.
•Cache misses — If the kernel's syscall code isn't in L1 cache (cold path), latency explodes. The kernel keeps hot paths small and contiguous.
•TLB misses — Each privilege level change formerly flushed the TLB. PCID (Process Context ID) mitigates this, keeping user and kernel TLB entries separate.
•Branch mispredictions — The syscall dispatch table is an indirect branch. Indirect branch prediction mitigations (Spectre v2) add overhead.

The Spectre Tax

measure_syscall_overhead.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
/* Measure syscall overhead using getpid() */
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
 
/* Read timestamp counter */
static inline uint64_t rdtsc(void)
{
    uint32_t lo, hi;
    __asm__ volatile ("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}
 
int main(void)
{
    uint64_t start, end, total = 0;
    const int iterations = 1000000;
    
    /* Warm up */
    for (int i = 0; i < 10000; i++) {
        getpid();
    }
    
    /* Measure */
    for (int i = 0; i < iterations; i++) {
        start = rdtsc();
        getpid();  /* Simple syscall with minimal work */
        end = rdtsc();
        total += (end - start);
    }
    
    printf("Average syscall overhead: %.1f cycles\n",
           (double)total / iterations);
    
    /* Typical results:
     * Without mitigations: ~150 cycles
     * With KPTI: ~300-500 cycles
     * With full Spectre mitigations: ~400-700 cycles
     */
    return 0;
}

Summary: The Context Switch Mechanism

We've traced the complete journey of a system call context switch—one of the most intricate and performance-critical paths in any operating system. Let's consolidate the key concepts:

Key Takeaways

•The privilege model separates user and kernel — x86 uses Ring 0 (kernel) and Ring 3 (user). Privilege level controls instruction execution, memory access, and I/O permissions.
•syscall is a hardware-optimized kernel entry — The instruction atomically saves RIP→RCX, RFLAGS→R11, and jumps to the LSTAR address. It does NOT switch stacks automatically.
•The kernel entry trampoline handles the critical first instructions — swapgs enables per-CPU data access, then the kernel loads its own stack. No stack usage is possible before the switch.
•pt_regs captures complete user state — All user registers are pushed to create a structure that enables return, signal delivery, debugging, and syscall argument access.
•Each task has its own kernel stack — Separate 16KB stacks enable blocking syscalls and task isolation. Stack overflow is guarded but catastrophic if it occurs.
•sysret provides fast return — It's the inverse of syscall, restoring RIP, RFLAGS, and user segment selectors. iret is the slower fallback for complex cases.
•Performance is constantly optimized — From int 0x80 to syscall/sysret to vDSO, the industry has reduced syscall overhead 10x. But Spectre mitigations clawed back significant gains.

What's next:

Page Complete

2 / 5