Loading content...
After the context switch completes, the kernel is ready to act. The CPU is in Ring 0, the kernel stack holds the user's saved state in pt_regs, and the syscall number sits in a register. But the kernel supports hundreds of different system calls, each requiring different parameters and performing different operations.
How does the kernel route syscall #0 to the read() implementation, #1 to write(), #2 to open(), and so on? The answer is the kernel handler layer—a sophisticated dispatch mechanism that translates syscall numbers into function calls.
This page explores the complete flow from syscall number to handler execution, including the syscall table architecture, dispatch routines, function pointer invocation, and how the kernel maintains this mapping across hundreds of system calls and multiple architectures.
By the end of this page, you will understand how the kernel's syscall dispatch works—from the C entry point receiving pt_regs through the syscall table lookup to individual handler execution. You'll know how syscall tables are generated, how the kernel validates syscall numbers, and how handler functions access their arguments.
At the heart of syscall dispatch is a simple data structure: an array of function pointers, indexed by syscall number. This is the syscall table (or system call table).
The fundamental idea:
handler = syscall_table[syscall_number];
result = handler(arguments...);
The syscall number (in RAX) becomes an array index. The table lookup retrieves a function pointer. The kernel calls that function. Done.
Reality adds complexity:
123456789101112131415161718192021222324252627282930313233
/* Linux kernel: arch/x86/entry/syscall_64.c */ /* Type for syscall handler functions */typedef long (*sys_call_ptr_t)(const struct pt_regs *); /* The syscall table: array of function pointers */const sys_call_ptr_t sys_call_table[__NR_syscall_max + 1] = { /* * This array is populated by including a generated file. * Each entry maps a syscall number to its handler. */ [0 ... __NR_syscall_max] = &__x64_sys_ni_syscall, /* Default: not implemented */ #include <asm/syscalls_64.h> /* Generated: populates with real handlers */}; /* The generated syscalls_64.h contains entries like: * * [0] = __x64_sys_read, // read() * [1] = __x64_sys_write, // write() * [2] = __x64_sys_open, // open() * [3] = __x64_sys_close, // close() * [4] = __x64_sys_stat, // stat() * ... hundreds more ... * [334] = __x64_sys_rseq, // rseq() */ /* Not-implemented stub for undefined syscall numbers */asmlinkage long __x64_sys_ni_syscall(const struct pt_regs *regs){ return -ENOSYS; /* Function not implemented */}Table generation pipeline:
The syscall table isn't hardcoded—it's generated during kernel build from a declarative specification:
| File | Purpose | Example Content |
|---|---|---|
syscall_64.tbl | Human-readable syscall definitions | 0 common read sys_read |
syscalltbl.sh | Script to process .tbl files | Parses table, generates headers |
syscalls_64.h | Generated header with table entries | [0] = __x64_sys_read, |
syscall_64.c | Includes generated header, defines table | const sys_call_ptr_t sys_call_table[] |
12345678910111213141516171819202122232425262728293031323334353637383940
# Linux kernel: arch/x86/entry/syscalls/syscall_64.tbl# Format: <number> <abi> <name> <entry point> [<compat entry point>]## - number: syscall number (assigned sequentially, never reused)# - abi: "common" (both 64-bit and 32-bit), "64" (64-bit only), "x32" (x32 ABI)# - name: symbolic name used for macros# - entry point: kernel function implementing this syscall # Core file operations0 common read sys_read1 common write sys_write2 common open sys_open3 common close sys_close4 common stat sys_newstat5 common fstat sys_newfstat6 common lstat sys_newlstat7 common poll sys_poll # Memory mapping9 common mmap sys_mmap10 common mprotect sys_mprotect11 common munmap sys_munmap12 common brk sys_brk # Process control56 common clone sys_clone57 common fork sys_fork58 common vfork sys_vfork59 common execve sys_execve60 common exit sys_exit61 common wait4 sys_wait462 common kill sys_kill # ... continues for 300+ syscalls ... # Recent additions (Linux 5.x-6.x)434 common pidfd_open sys_pidfd_open435 common clone3 sys_clone3439 common faccessat2 sys_faccessat2448 common process_mrelease sys_process_mreleaseOnce a syscall number is assigned, it NEVER changes. This is part of the kernel's stable ABI promise. Syscall #0 has been read() since the earliest Linux versions and will remain so forever. New syscalls get new numbers. Old syscalls may be deprecated but their numbers are never reassigned.
The assembly entry code (entry_SYSCALL_64) calls a C function to perform the actual dispatch. This function, do_syscall_64(), is the heart of syscall handling:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
/* Linux kernel: arch/x86/entry/common.c */ /* Main entry point for 64-bit syscalls */__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr){ /* nr = syscall number from RAX (passed by assembly caller) */ /* Add a random offset to syscall entry (security hardening) */ add_random_kstack_offset(); /* Check if syscalls are enabled for this task */ if (!do_syscall_x64(regs, nr)) { /* Syscall was rejected (seccomp, ptrace, etc.) */ return; }} /* Called from do_syscall_64 to execute the actual syscall */static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr){ /* * Check that nr is a valid syscall number. * __NR_syscall_max is the highest valid syscall number. */ if (likely(nr < NR_syscalls)) { /* * Actually call the handler! * sys_call_table[nr] returns a function pointer. * The function takes pt_regs* and returns long. */ regs->ax = sys_call_table[nr](regs); return true; } /* Invalid syscall number */ regs->ax = __x64_sys_ni_syscall(regs); return true;} /* * The return value ends up in regs->ax. * When we return to user space (via sysret or iret), * the assembly code will: * 1. Restore registers from pt_regs * 2. This includes RAX from regs->ax * 3. User sees the return value in RAX */Security hooks in the dispatch path:
The dispatch function isn't a straight-line path. The kernel checks several security mechanisms before executing the syscall:
123456789101112131415161718192021222324252627282930313233343536373839404142
/* Simplified syscall entry with security checks */ static bool syscall_enter_from_user_mode(struct pt_regs *regs){ unsigned long work = READ_ONCE(current_thread_info()->flags); if (work & SYSCALL_WORK_ENTRY) { /* There's work to do before the syscall */ /* Check seccomp filters */ if (work & SYSCALL_WORK_SECCOMP) { int ret = __secure_computing(NULL); if (ret == -1) { /* Seccomp denied this syscall */ return false; } if (ret == -2) { /* Seccomp modified the syscall - reload nr */ /* (Feature: SECCOMP_RET_TRACE can change nr) */ } } /* Handle ptrace syscall-enter-stop */ if (work & SYSCALL_WORK_SYSCALL_TRACE) { if (tracehook_report_syscall_entry(regs)) { /* Tracer said to skip this syscall */ return false; } } /* Audit the syscall entry */ if (work & SYSCALL_WORK_SYSCALL_AUDIT) { audit_syscall_entry(syscall_get_nr(current, regs), syscall_get_arg1(regs), syscall_get_arg2(regs), syscall_get_arg3(regs), syscall_get_arg4(regs)); } } return true; /* Proceed with syscall */}The likely(nr < NR_syscalls) macro hints to the compiler that this condition is almost always true, enabling better branch prediction and code layout. Invalid syscall numbers are rare; the hot path should be optimized for valid numbers.
Syscall handlers in modern Linux follow a specific calling convention. Understanding this convention is essential for reading kernel code and implementing new syscalls.
The modern approach: pt_regs-based handlers
Since Linux 4.17, x86-64 syscall handlers receive a single argument: a pointer to pt_regs. This change (from passing arguments directly) improves security by preventing register contents from leaking on error paths.
12345678910111213141516171819202122232425262728293031323334353637
/* Modern syscall handler signature (Linux 4.17+) */ /* The handler receives pt_regs and extracts arguments from it */asmlinkage long __x64_sys_read(const struct pt_regs *regs){ /* Extract arguments from pt_regs */ int fd = (int)regs->di; /* First argument: rdi */ void __user *buf = (void __user *)regs->si; /* Second: rsi */ size_t count = regs->dx; /* Third: rdx */ /* Call the common implementation */ return ksys_read(fd, buf, count);} /* Helper macros for extracting arguments */#define SC_ARG0(regs) ((regs)->di) /* Arg 1 */#define SC_ARG1(regs) ((regs)->si) /* Arg 2 */#define SC_ARG2(regs) ((regs)->dx) /* Arg 3 */#define SC_ARG3(regs) ((regs)->r10) /* Arg 4 - note: r10, not rcx! */#define SC_ARG4(regs) ((regs)->r8) /* Arg 5 */#define SC_ARG5(regs) ((regs)->r9) /* Arg 6 */ /* The common implementation does the real work */ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count){ struct fd f = fdget_pos(fd); ssize_t ret = -EBADF; if (!f.file) return ret; /* Actually perform the read... */ ret = vfs_read(f.file, buf, count, &pos); fdput_pos(f); return ret;}The SYSCALL_DEFINE macros:
Writing the argument extraction code manually is tedious and error-prone. Linux provides macros that generate the boilerplate:
12345678910111213141516171819202122232425262728293031323334353637383940414243
/* Linux kernel: include/linux/syscalls.h */ /* SYSCALL_DEFINE3: Define a syscall with 3 arguments * The number suffix indicates argument count (0-6) */#define SYSCALL_DEFINE3(name, t1, a1, t2, a2, t3, a3) \ __SYSCALL_DEFINEx(3, _##name, t1, a1, t2, a2, t3, a3) /* This generates: * 1. Prototype for __x64_sys_<name>(const struct pt_regs *) * 2. Static inline __do_sys_<name>(t1 a1, t2 a2, t3 a3) * 3. __x64_sys wrapper that extracts args and calls __do_sys */ /* Example: Defining the read() syscall *//* fs/read_write.c */ SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count){ /* This becomes the body of __do_sys_read(fd, buf, count) */ return ksys_read(fd, buf, count);} /* The macro expands to something like: */static inline long __do_sys_read(unsigned int fd, char __user *buf, size_t count); asmlinkage long __x64_sys_read(const struct pt_regs *regs){ return __do_sys_read( (unsigned int)SC_ARG0(regs), /* fd from rdi */ (char __user *)SC_ARG1(regs), /* buf from rsi */ (size_t)SC_ARG2(regs) /* count from rdx */ );} static inline long __do_sys_read(unsigned int fd, char __user *buf, size_t count){ return ksys_read(fd, buf, count);}The __user annotation marks pointers that point to user space memory. This enables sparse (a static analysis tool) to catch bugs where kernel code dereferences user pointers directly instead of using copy_from_user()/copy_to_user(). Direct access to __user pointers is a security vulnerability.
| Macro | Arguments | Use Case |
|---|---|---|
SYSCALL_DEFINE0(name) | 0 | getpid(), getuid(), fork() |
SYSCALL_DEFINE1(name, t1, a1) | 1 | close(fd), exit(status) |
SYSCALL_DEFINE2(name, t1, a1, t2, a2) | 2 | creat(), access() |
SYSCALL_DEFINE3(name, ...) | 3 | read(), write(), open() |
SYSCALL_DEFINE4(name, ...) | 4 | ptrace(), reboot() |
SYSCALL_DEFINE5(name, ...) | 5 | select(), mount() |
SYSCALL_DEFINE6(name, ...) | 6 | mmap(), futex() |
Once the handler function is called, it executes like any other kernel function—with full Ring 0 privileges. However, syscall handlers have specific patterns and constraints:
What handlers can do:
What handlers must be careful about:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
/* Tracing through sys_read: from syscall to disk */ /* Step 1: Entry wrapper (generated by SYSCALL_DEFINE3) */SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count){ return ksys_read(fd, buf, count);} /* Step 2: Common kernel helper */ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count){ struct fd f = fdget_pos(fd); /* Look up file descriptor */ ssize_t ret = -EBADF; if (f.file) { loff_t pos, *ppos = file_ppos(f.file); if (ppos) { pos = *ppos; ppos = &pos; } ret = vfs_read(f.file, buf, count, ppos); /* Call VFS layer */ if (ret >= 0 && ppos) f.file->f_pos = pos; fdput_pos(f); } return ret;} /* Step 3: VFS (Virtual File System) read */ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos){ ssize_t ret; /* Validate the file allows reading */ if (!(file->f_mode & FMODE_READ)) return -EBADF; if (!(file->f_mode & FMODE_CAN_READ)) return -EINVAL; /* Check access_ok() for buffer */ if (!access_ok(buf, count)) return -EFAULT; /* Call file-specific read operation */ if (file->f_op->read) ret = file->f_op->read(file, buf, count, pos); else if (file->f_op->read_iter) ret = new_sync_read(file, buf, count, pos); else ret = -EINVAL; return ret;} /* Step 4: Filesystem-specific read (e.g., ext4) *//* This eventually calls the block layer and disk driver */Notice how sys_read() doesn't know about ext4, NFS, or procfs. It calls vfs_read(), which uses the file->f_op function pointer table to call the right filesystem's read implementation. This abstraction allows one syscall to work with hundreds of filesystem types.
Handler call depth:
A syscall handler may call many kernel functions before completing. A typical read() on an ext4 file might traverse:
__x64_sys_read → argument extractionksys_read → fd lookupvfs_read → VFS layer dispatchext4_file_read_iter → filesystem handlinggeneric_file_buffered_read → page cacheext4_readpage → block layersubmit_bio → block I/O submissionscsi_queue_rq → SCSI drivernvme_queue_rq → NVMe driver...then the return path unwinds all of this back to user space.
Syscall handlers return a value that eventually reaches user space in the RAX register. The kernel uses a consistent convention:
Return value interpretation:
Different syscalls, different success values:
| Syscall | Success Return Value | Interpretation |
|---|---|---|
read() | 0 to n | Number of bytes read (0 = EOF) |
write() | 1 to n | Number of bytes written |
open() | ≥ 0 | New file descriptor |
close() | 0 | Success (no meaningful value) |
fork() | 0 or 0 | Child PID to parent, 0 to child |
getpid() | 0 | Process ID (never fails) |
mmap() | Address | Pointer to mapped region |
brk() | Address | New program break address |
1234567891011121314151617181920212223242526272829303132333435363738394041
/* How the return value flows from handler to user */ /* 1. Handler returns a long value */SYSCALL_DEFINE3(read, ...){ /* ... */ if (error) return -EFAULT; /* Returns -14 (negative errno) */ return bytes_read; /* Returns positive count */} /* 2. Dispatch stores result in pt_regs->ax */static bool do_syscall_x64(struct pt_regs *regs, int nr){ /* * sys_call_table[nr](regs) returns the handler's result * This is stored in regs->ax */ regs->ax = sys_call_table[nr](regs); return true;} /* 3. Assembly exit path restores RAX from pt_regs->ax *//* * movq OFFSET_AX(%rsp), %rax ; Load saved ax * ... * sysretq ; Return to user */ /* 4. User-space wrapper interprets RAX value */ssize_t read(int fd, void *buf, size_t count){ long ret = syscall(__NR_read, fd, buf, count); /* RAX is now available via 'ret' */ if (ret < 0 && ret > -4096) { errno = -ret; /* Convert -14 to errno=14 */ return -1; } return ret;}Some syscalls return pointers (mmap, brk). Since the error range is [-4095, -1], any valid user-space address works. The kernel ensures no valid mmap address falls in this range. On error, mmap returns MAP_FAILED which glibc interprets as -1 (after errno setting).
Common error codes:
The kernel defines hundreds of error codes in include/uapi/asm-generic/errno-base.h and errno.h. The most frequently encountered:
| errno | Value | Meaning | Common Cause |
|---|---|---|---|
| EPERM | 1 | Operation not permitted | Lacks privilege/capability |
| ENOENT | 2 | No such file or directory | Path doesn't exist |
| ESRCH | 3 | No such process | PID doesn't exist |
| EINTR | 4 | Interrupted system call | Signal received during syscall |
| EIO | 5 | I/O error | Hardware or driver failure |
| EBADF | 9 | Bad file descriptor | fd not open or wrong mode |
| EAGAIN | 11 | Try again | Would block (non-blocking I/O) |
| ENOMEM | 12 | Out of memory | Allocation failed |
| EACCES | 13 | Permission denied | File permissions deny access |
| EFAULT | 14 | Bad address | Pointer outside address space |
| EINVAL | 22 | Invalid argument | Argument value is invalid |
| ENOSYS | 38 | Function not implemented | Syscall doesn't exist |
x86-64 Linux supports running 32-bit applications through compatibility mode. This introduces multiple syscall tables and dispatch paths:
Three ABIs on x86-64:
int 0x80 or compat mode)| Aspect | Native 64-bit | 32-bit Compat | x32 |
|---|---|---|---|
| Entry instruction | syscall | int $0x80 | syscall |
| Syscall number | 64-bit table | 32-bit table | 64-bit + 0x40000000 |
| Pointer size | 64 bits | 32 bits | 32 bits |
| Register size | 64 bits | 32 bits (limited) | 64 bits |
| Table | sys_call_table | ia32_sys_call_table | sys_call_table |
| Entry point | entry_SYSCALL_64 | entry_INT80_compat | entry_SYSCALL_64 |
12345678910111213141516171819202122232425262728293031323334353637
/* Linux handles multiple ABIs with separate dispatch paths */ /* Native 64-bit: uses sys_call_table */do_syscall_64(struct pt_regs *regs, int nr){ regs->ax = sys_call_table[nr](regs);} /* 32-bit compatibility: uses ia32_sys_call_table */do_int80_syscall_32(struct pt_regs *regs){ int nr = regs->orig_ax; /* 32-bit syscall number in eax */ if (nr < IA32_NR_syscalls) { regs->ax = ia32_sys_call_table[nr](regs); } else { regs->ax = -ENOSYS; }} /* The 32-bit table has different handlers that handle 32-bit semantics */const sys_call_ptr_t ia32_sys_call_table[] = { /* 32-bit syscall 0 is not read! It's restart_syscall */ [0] = __ia32_sys_restart_syscall, [1] = __ia32_sys_exit, [2] = __ia32_sys_fork, [3] = __ia32_sys_read, /* read() is syscall 3 in 32-bit! */ [4] = __ia32_sys_write, /* ... different mapping from 64-bit ... */}; /* Compat handlers may need to convert arguments */asmlinkage long __ia32_compat_sys_truncate(const struct pt_regs *regs){ /* 32-bit user passed a 32-bit pointer, we need to zero-extend */ return ksys_truncate(compat_ptr(regs->bx), regs->cx);}The 32-bit and 64-bit syscall tables have different numbering. For example, read() is syscall 0 on 64-bit but syscall 3 on 32-bit (inherited from i386 Linux). This is why you can't just use 64-bit syscall numbers from 32-bit code and vice versa.
Argument conversion for compat:
32-bit programs have 32-bit pointers. When they pass pointers to syscalls, the kernel must:
compat_ptr() to convert 32-bit pointers to kernel pointersThe compat_ prefixed functions handle these conversions throughout the kernel.
Understanding syscall dispatch is essential for debugging system-level issues. Several tools leverage this knowledge:
strace:
The strace utility uses ptrace(PTRACE_SYSCALL) to intercept every syscall a process makes:
123456789101112131415161718192021
# Trace all syscalls for a command$ strace ls /tmpexecve("/usr/bin/ls", ["ls", "/tmp"], 0x7ffd... /* 50 vars */) = 0brk(NULL) = 0x55ff0a240000access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file)openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3read(3, "\177ELF\002\001\001\003..."..., 832) = 832# ... many more syscalls ...write(1, "file1.txt file2.txt\n", 21) = 21close(1) = 0exit_group(0) = ? # Count syscall types$ strace -c ls /tmp% time seconds usecs/call calls errors syscall------ ----------- ----------- --------- --------- ---------------- 26.45 0.000081 20 4 openat 21.90 0.000067 16 4 mmap 16.34 0.000050 12 4 close 8.17 0.000025 12 2 read# ...Kernel tracepoints:
The kernel has built-in tracepoints for syscall entry/exit that can be used with ftrace or perf:
12345678910111213141516171819
# List available syscall tracepoints$ ls /sys/kernel/debug/tracing/events/syscalls/sys_enter_read sys_exit_readsys_enter_write sys_exit_write# ... one pair per syscall ... # Trace all read() calls system-wide$ echo 1 > /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/enable$ cat /sys/kernel/debug/tracing/trace_pipe ls-12345 [000] .... 1234.567890: sys_read(fd: 3, buf: 7ffd..., count: 832) bash-54321 [001] .... 1234.567891: sys_read(fd: 0, buf: 7ffd..., count: 1)# ... # Use perf for detailed syscall analysis$ perf trace ls /tmp 0.000 (0.012 ms): execve(filename: "/usr/bin/ls", argv: 0x7ffd...) = 0 0.089 (0.002 ms): brk(brk: 0) = 0x55a9... 0.093 (0.003 ms): access(filename: "/etc/ld.so.preload", mode: R) = -1 ENOENT# ...Modern kernels support BPF (Berkeley Packet Filter) programs that can attach to syscall entry/exit points with almost no overhead. Tools like bpftrace and bcc allow sophisticated analysis: 'bpftrace -e "tracepoint:syscalls:sys_enter_open { printf("%s opened %s\n", comm, str(args->filename)); }"'
| Tool | Overhead | Scope | Best For |
|---|---|---|---|
| strace | High (ptrace) | Single process | Quick debugging, seeing all args/results |
| ltrace | High (ptrace) | Single process | Library calls + syscalls |
| ftrace | Low | System-wide | Kernel development, global patterns |
| perf trace | Low-Medium | Flexible | Performance analysis with syscall context |
| bpftrace/bcc | Very Low | Flexible | Production tracing, complex queries |
We've traced the complete path from syscall number to handler execution—the dispatch mechanism that makes all OS services accessible. Let's consolidate the key concepts:
What's next:
The handler receives arguments through pt_regs, but those arguments often include pointers to user-space memory. How does the kernel safely read and write user memory? The next page explores Parameter Validation—the critical security checks that prevent malicious or buggy user code from corrupting kernel state.
You now understand the kernel handler layer—from the syscall table through dispatch to individual handler execution. This knowledge enables you to read kernel syscall code, understand strace output, and reason about syscall behavior. Next, we'll examine the critical security barrier between user pointers and kernel operations.