Loading learning content...
When the syscall instruction executes, something extraordinary happens: the CPU changes identity. In a matter of nanoseconds, the processor transitions from operating on behalf of an unprivileged user application to executing the most protected code in the system—the operating system kernel.
This transition is called a context switch, though more precisely for system calls, we call it a mode switch or kernel entry. It involves:
This page dissects the context switch mechanism at the hardware and software levels, revealing exactly what happens in those critical nanoseconds.
By the end of this page, you will understand the complete flow of a syscall context switch—from the moment the syscall instruction executes through CPU state capture, kernel stack setup, and entry into the system call handler. You'll know exactly what the hardware does automatically versus what the kernel must do in software.
Before diving into context switching mechanics, we must understand the privilege model that makes context switching necessary.
x86 Protection Rings:
Intel's x86 architecture defines four privilege levels, called "rings":
In practice, modern operating systems use only Ring 0 (kernel) and Ring 3 (user). The unused rings were intended for systems like OS/2 but never gained traction.
What does privilege level determine?
The current privilege level (CPL), stored in the CS register's lowest two bits, controls:
The two-ring model persists because modern OS designs don't need intermediate privilege levels. Device drivers run in Ring 0 with the kernel (monolithic design) or in Ring 3 as user-space servers (microkernel design). The intermediate rings would require complex inter-ring call gates that add overhead without clear benefit.
| Capability | Ring 0 (Kernel) | Ring 3 (User) |
|---|---|---|
| Execute privileged instructions | ✓ Yes | ✗ No (GP fault) |
| Access I/O ports directly | ✓ Yes | ✗ No (unless IOPL allows) |
| Modify page tables | ✓ Yes | ✗ No |
| Disable interrupts | ✓ Yes (CLI/STI) | ✗ No |
| Access all physical memory | ✓ Yes | ✗ No (only mapped pages) |
| Load special registers (GDT, IDT, etc.) | ✓ Yes | ✗ No |
| Execute user memory | ✓ Yes (if SMEP disabled) | ✓ Yes |
| Read user memory | ✓ Yes (if SMAP disabled) | ✓ Yes |
The syscall instruction is the fast system call entry mechanism on x86-64. It was introduced because the older int 0x80 mechanism (software interrupt) was too slow for the syscall-heavy workloads of modern systems.
What syscall does (hardware-automated):
When the CPU executes syscall, the following happens atomically, without software intervention:
Critical: What syscall does NOT do:
After syscall executes, the CPU is in kernel mode but RSP still points to the user stack! This is a security-critical moment—the kernel must immediately switch to a kernel stack before doing anything that uses the stack. Using the user stack would be a privilege escalation vulnerability.
12345678910111213141516171819202122232425
; What the CPU does when 'syscall' executes (x86-64); This is hardware behavior, not code you write ; Step 1: Save return address (next instruction's address)RCX ← RIP ; RCX = address to return to ; Step 2: Save flagsR11 ← RFLAGS ; R11 = saved flags ; Step 3: Load kernel segment selectors; STAR MSR format: [0:15]=SYSCALL CS, [16:31]=SS, [32:47]=SYSRET CS, [48:63]=SSCS ← STAR[47:32] ; Usually 0x10 (kernel code segment)SS ← STAR[47:32] + 8 ; Usually 0x18 (kernel data segment) ; Step 4: Transfer to kernel handlerRIP ← LSTAR ; Jump to syscall entry point ; Step 5: Clear flags per maskRFLAGS ← RFLAGS AND NOT(SFMASK) ; Usually clears IF (disables interrupts) ; Step 6: Privilege level change (implicit with CS load)CPL ← 0 ; Now in Ring 0 ; CRITICAL: RSP is UNCHANGED - still points to user stack!; The kernel entry code must fix this immediatelyThe MSR Configuration:
During boot, the kernel configures several Model-Specific Registers (MSRs) that control syscall behavior:
| MSR Name | Address | Purpose | Typical Value |
|---|---|---|---|
| STAR | 0xC0000081 | Segment selectors for syscall/sysret | 0x0023001000000000 |
| LSTAR | 0xC0000082 | Kernel RIP for syscall (Long mode) | Address of entry_SYSCALL_64 |
| CSTAR | 0xC0000083 | Kernel RIP for syscall (compat mode) | Address of entry_SYSCALL_compat |
| SFMASK | 0xC0000084 | RFLAGS bits to clear on syscall | 0x47700 (clears IF, DF, TF, AC, NT) |
1234567891011121314151617181920212223242526272829
/* Linux kernel: arch/x86/kernel/cpu/common.c *//* Setting up MSRs for syscall during boot */ void syscall_init(void){ /* STAR MSR: Set up segment selectors * Bits 32-47: Kernel CS (0x10) for syscall * Bits 48-63: User CS (0x23) for sysret * (SS is derived as CS+8 for both) */ wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS); /* LSTAR: Kernel entry point for 64-bit syscalls */ wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64); /* CSTAR: Entry point for 32-bit syscalls in long mode */ wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat); /* SFMASK: Flags to clear on syscall entry * X86_EFLAGS_TF: Trap flag (single-step debugging) * X86_EFLAGS_DF: Direction flag (string operations) * X86_EFLAGS_IF: Interrupt flag (disable interrupts) * X86_EFLAGS_AC: Alignment check * X86_EFLAGS_NT: Nested task */ wrmsrl(MSR_SYSCALL_MASK, X86_EFLAGS_TF | X86_EFLAGS_DF | X86_EFLAGS_IF | X86_EFLAGS_AC | X86_EFLAGS_NT);}The syscall instruction jumps to the address in the LSTAR MSR—the kernel's syscall entry point. This code has the most stringent requirements in the entire kernel:
On Linux, this entry point is called entry_SYSCALL_64, implemented in assembly:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
/* Linux kernel: arch/x86/entry/entry_64.S (simplified) *//* This is the actual syscall entry point */ SYM_CODE_START(entry_SYSCALL_64) /* At this point: * - We're in Ring 0 (kernel mode) * - RCX = user RIP (saved by hardware) * - R11 = user RFLAGS (saved by hardware) * - RAX = syscall number * - RDI, RSI, RDX, R10, R8, R9 = syscall arguments * - RSP = user stack pointer (UNTRUSTED!) */ /* CRITICAL: First instruction must not use stack * Use the per-CPU scratch area to temporarily store user RSP */ swapgs /* Load kernel GS base (per-CPU data) */ movq %rsp, PER_CPU_VAR(cpu_tss_rw + TSS_sp2) /* Save user RSP */ /* Load the kernel stack pointer from TSS * This is the top of the current task's kernel stack */ movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp /* Now we have a valid kernel stack. We can proceed normally. * Push all user registers to create pt_regs structure */ /* Push fake SS, RSP (from TSS_sp2), RFLAGS, CS, RIP */ pushq $__USER_DS /* User SS */ pushq PER_CPU_VAR(cpu_tss_rw + TSS_sp2) /* User RSP */ pushq %r11 /* User RFLAGS */ pushq $__USER_CS /* User CS */ pushq %rcx /* User RIP */ /* Push error code placeholder and interrupt number */ pushq $-ENOSYS /* Will be replaced by return value */ pushq %rax /* Syscall number (orig_rax) */ /* Push all general-purpose registers */ pushq %rdi /* Argument 1 */ pushq %rsi /* Argument 2 */ pushq %rdx /* Argument 3 */ pushq %rcx /* (Clobbered, but save anyway) */ pushq %rax /* Syscall number again */ pushq %r8 /* Argument 5 */ pushq %r9 /* Argument 6 */ pushq %r10 /* Argument 4 */ pushq %r11 /* (Clobbered, but save anyway) */ pushq %rbx /* Callee-saved */ pushq %rbp /* Callee-saved */ pushq %r12 /* Callee-saved */ pushq %r13 /* Callee-saved */ pushq %r14 /* Callee-saved */ pushq %r15 /* Callee-saved */ /* Now the stack contains a complete pt_regs structure * RSP points to it - this becomes the argument to C handlers */ movq %rsp, %rdi /* pt_regs* as first argument */ call do_syscall_64 /* Call the C handler */ /* ... syscall return path continues ... */SYM_CODE_END(entry_SYSCALL_64)swapgs exchanges the current GS base (user's GS) with the kernel's GS base stored in an MSR. This gives the kernel access to per-CPU data (including the kernel stack pointer) without using any general-purpose registers. It's the key to the stack switch.
The stack switch explained:
The sequence swapgs → movq %rsp, PER_CPU_VAR(...) → movq PER_CPU_VAR(...), %rsp is the critical stack switch:
swapgs — Now GS references per-CPU kernel dataAfter these three instructions, we have a valid kernel stack and the user's RSP is safely stored. Function calls are now possible.
After the entry trampoline pushes all registers, the kernel stack contains a complete snapshot of the user's CPU state. This is the pt_regs structure—the fundamental representation of saved process state in Linux.
Why pt_regs matters:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
/* Linux kernel: arch/x86/include/asm/ptrace.h */ struct pt_regs { /* Pushed by entry code in reverse order (growing down) */ /* C ABI callee-saved registers */ unsigned long r15; unsigned long r14; unsigned long r13; unsigned long r12; unsigned long bp; /* Frame pointer */ unsigned long bx; /* These are clobbered by syscall, but saved anyway */ unsigned long r11; unsigned long r10; /* Syscall arguments (some overlap with above) */ unsigned long r9; /* Argument 6 */ unsigned long r8; /* Argument 5 */ unsigned long ax; /* Syscall number / return value */ unsigned long cx; /* Clobbered by syscall (user RIP) */ unsigned long dx; /* Argument 3 */ unsigned long si; /* Argument 2 */ unsigned long di; /* Argument 1 */ /* Syscall metadata */ unsigned long orig_ax; /* Original syscall number (for restart) */ /* Instruction pointer - pushed as part of iret frame */ unsigned long ip; /* User RIP */ unsigned long cs; /* User CS */ unsigned long flags; /* User RFLAGS */ unsigned long sp; /* User RSP */ unsigned long ss; /* User SS */}; /* Accessing syscall arguments from pt_regs */static inline unsigned long syscall_arg1(struct pt_regs *regs){ return regs->di;} static inline unsigned long syscall_arg2(struct pt_regs *regs){ return regs->si;} /* ... and so on for all 6 arguments */ /* Get/set return value */static inline void syscall_set_return_value( struct pt_regs *regs, int error, long val){ if (error) { regs->ax = -error; /* Negative errno */ } else { regs->ax = val; /* Success value */ }}Layout on the kernel stack:
After the entry trampoline completes, the kernel stack looks like:
12345678910111213141516171819202122232425
Higher addresses (stack grows down)┌─────────────────────────────────────┐│ (Top of kernel stack) │├─────────────────────────────────────┤│ SS (user stack segment) │ ← pt_regs + 0xa0├─────────────────────────────────────┤│ RSP (user stack pointer) │ ← pt_regs + 0x98├─────────────────────────────────────┤│ RFLAGS (user flags) │ ← pt_regs + 0x90├─────────────────────────────────────┤│ CS (user code segment) │ ← pt_regs + 0x88├─────────────────────────────────────┤│ RIP (user instruction ptr) │ ← pt_regs + 0x80├─────────────────────────────────────┤│ orig_ax (syscall number) │ ← pt_regs + 0x78├─────────────────────────────────────┤│ rax (return value slot) │ ← pt_regs + 0x70├─────────────────────────────────────┤│ ... remaining registers ... │├─────────────────────────────────────┤│ r15 (last saved register) │ ← pt_regs + 0x00 = RSP now points here├─────────────────────────────────────┤│ (Local variables, call frames) │ ← Stack grows into here└─────────────────────────────────────┘Lower addressesThe pt_regs structure has both orig_ax and ax. orig_ax preserves the original syscall number (for restart after signals), while ax holds the current return value. During execution, the kernel may write the return value to ax. On signal restart, orig_ax tells which syscall to re-execute.
Every executing thread in Linux has not one but several stacks associated with it. Understanding this multi-stack architecture is essential for grasping syscall context switches.
Stack types in Linux x86-64:
| Stack Type | Size | Allocation | Purpose |
|---|---|---|---|
| User stack | 8MB (typical) | Per-process, grows down from high addresses | User-mode function calls, local variables |
| Kernel stack | 16KB (THREAD_SIZE) | Per-task (process/thread) | Kernel mode execution for this task |
| IRQ stack | 16KB | Per-CPU | Hardware interrupt handling |
| IST1 (Double Fault) | 4KB | Per-CPU | Double fault handling |
| IST2 (NMI) | 4KB | Per-CPU | Non-maskable interrupt handling |
| IST3 (Debug) | 4KB | Per-CPU | Debug exception handling |
| IST4 (MCE) | 4KB | Per-CPU | Machine check exception |
Why separate kernel stacks per task?
Each task (process or thread) has its own kernel stack because:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
/* Linux kernel: include/linux/sched.h and related */ #define THREAD_SIZE (16 * 1024) /* 16KB kernel stack */ /* Every task_struct has an associated kernel stack */struct task_struct { /* ... many fields ... */ void *stack; /* Points to the stack allocation */ /* ... many more fields ... */}; /* Get current task's kernel stack top */static inline void *task_stack_page(struct task_struct *task){ return task->stack;} /* Initialize kernel stack for new task */static struct task_struct *dup_task_struct(struct task_struct *orig){ struct task_struct *tsk; unsigned long *stack; /* Allocate task_struct */ tsk = alloc_task_struct(); /* Allocate kernel stack (16KB aligned) */ stack = alloc_thread_stack_node(tsk, node); tsk->stack = stack; /* Set up initial pt_regs at top of stack */ /* ... */ return tsk;} /* The stack layout at task creation *//* * High address (stack top) * ┌────────────────────┐ * │ pt_regs │ ← Initial saved state * ├────────────────────┤ * │ │ * │ (unused) │ ← Stack grows downward into this * │ │ * ├────────────────────┤ * │ thread_info │ ← Task metadata at stack base (historical) * └────────────────────┘ * Low address (stack base) */Kernel stacks are small (16KB) and have guard pages. A kernel stack overflow is catastrophic—it corrupts the thread_info structure or adjacent memory. Deep recursion in kernel code is dangerous. The kernel uses static analysis and runtime checks (VMAP_STACK) to detect overflows.
After the syscall handler completes, the kernel must return to user space. This involves reversing everything the entry path did:
The sysret instruction is the fast-path return mechanism, paired with syscall:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
/* Linux kernel: arch/x86/entry/entry_64.S (simplified) *//* Returning from syscall to user space */ SYM_CODE_START(entry_SYSCALL_64_return_path) /* At this point: * RSP points to pt_regs on kernel stack * RAX in pt_regs contains return value */ /* Restore callee-saved registers */ popq %r15 popq %r14 popq %r13 popq %r12 popq %rbp popq %rbx /* Skip r11, r10, r9, r8, rax (handled specially) */ addq $(6*8), %rsp /* Move return value to RAX */ movq (%rsp), %rax /* pt_regs->ax */ /* Skip remaining pt_regs to get to iret frame portion */ movq (1*8)(%rsp), %rcx /* pt_regs->ip → for sysret */ movq (4*8)(%rsp), %rsp /* pt_regs->sp → user RSP */ /* Prepare for sysret */ movq %r11, %rsp /* Load user RFLAGS into R11 */ /* (Actually loaded earlier) */ /* Switch back to user GS */ swapgs /* sysret does the reverse of syscall: * - Load RIP from RCX * - Load RFLAGS from R11 * - Set CPL to 3 (Ring 3) * - Load CS and SS from STAR MSR (user segments) */ sysretq /* User execution resumes at RCX (original RIP + instruction length) */SYM_CODE_END(entry_SYSCALL_64_return_path)What sysret does (hardware-automated):
When sysret can't be used:
The sysret fast path has restrictions. The kernel falls back to iret (slower but more flexible) when:
A famous vulnerability exists with sysret: if the return RIP in RCX is non-canonical (not a valid user address), the CPU raises a general protection fault. But by then, the kernel has already loaded user's RSP and switched to Ring 3. The GP fault handler runs in Ring 0 but with a user-controlled RSP! This led to CVE-2014-4699 on older kernels. Modern kernels validate RCX before sysret.
| Aspect | sysret | iret |
|---|---|---|
| Speed | ~20 cycles | ~40 cycles |
| What it restores | RIP (from RCX), RFLAGS (from R11), CS, SS | RIP, CS, RFLAGS, RSP, SS from stack |
| Flexibility | Fixed register sources | All from stack (more flexible) |
| When used | Fast path return | Signal delivery, ptrace, fallback |
| Security concern | Non-canonical RCX vulnerability | None (fully controlled by kernel) |
System call context switches are the performance-critical path in any operating system. Applications make millions of syscalls per second. Every optimization matters.
Historical evolution:
| Mechanism | Era | Approximate Latency | Notes |
|---|---|---|---|
| int 0x80 | 1980s-2000s | ~350 cycles | Software interrupt, very general but slow |
| sysenter/sysexit | Pentium II (1997) | ~150 cycles | Intel's fast syscall, complex setup |
| syscall/sysret | AMD K6 (1998), x86-64 | ~50-100 cycles | AMD's fast syscall, became x86-64 standard |
| vDSO (no syscall) | Linux 2.6 (2003) | ~5-10 cycles | No kernel entry at all for certain calls |
Modern performance on x86-64:
A typical system call on a modern Intel/AMD processor:
Total overhead for null syscall: ~150-250 cycles ≈ 40-75ns on a 3GHz CPU
What hurts performance:
Post-Spectre/Meltdown (2018), syscall latency increased by 30-100% depending on workload. The kernel does extra work to prevent speculative execution attacks: flushing branch prediction history, using retpolines for indirect calls, and separating user/kernel page tables. Security won, performance lost.
1234567891011121314151617181920212223242526272829303132333435363738394041
/* Measure syscall overhead using getpid() */#include <stdio.h>#include <stdint.h>#include <unistd.h> /* Read timestamp counter */static inline uint64_t rdtsc(void){ uint32_t lo, hi; __asm__ volatile ("rdtsc" : "=a"(lo), "=d"(hi)); return ((uint64_t)hi << 32) | lo;} int main(void){ uint64_t start, end, total = 0; const int iterations = 1000000; /* Warm up */ for (int i = 0; i < 10000; i++) { getpid(); } /* Measure */ for (int i = 0; i < iterations; i++) { start = rdtsc(); getpid(); /* Simple syscall with minimal work */ end = rdtsc(); total += (end - start); } printf("Average syscall overhead: %.1f cycles\n", (double)total / iterations); /* Typical results: * Without mitigations: ~150 cycles * With KPTI: ~300-500 cycles * With full Spectre mitigations: ~400-700 cycles */ return 0;}We've traced the complete journey of a system call context switch—one of the most intricate and performance-critical paths in any operating system. Let's consolidate the key concepts:
What's next:
The context switch lands us in the kernel with a pt_regs structure containing the syscall number and arguments. But how does the kernel actually handle the request? The next page explores the Kernel Handler—the dispatch mechanism that routes syscall numbers to specific handler functions, and how those handlers execute the requested operation.
You now understand the low-level mechanics of system call context switching—from the syscall instruction through kernel entry, state preservation, and return via sysret. This foundation prepares you to follow a syscall through the kernel's dispatch and handler mechanisms.