Loading content...
When your application executes a trap instruction, it transitions to kernel mode—but the kernel still needs to know which service you're requesting. Are you trying to read a file? Create a new process? Allocate memory? Wait for a network packet?
This is where the system call number enters the picture. Before executing the trap, user code places a numeric identifier in a designated register or stack location. The kernel uses this number to index into a system call table, finding the appropriate handler function. This simple mechanism—a number mapping to a function—is the foundation of the entire system call API.
By the end of this page, you will understand how system call numbers identify kernel services, how system call tables are organized, the conventions for passing the call number, versioning challenges, and how operating systems maintain decades of backward compatibility through careful numbering discipline.
At the heart of system call dispatching is the system call table (syscall table)—an array of function pointers where each index corresponds to a system call number. When the kernel receives a system call, it uses the call number as an index into this table to find the handler function.
Conceptual Structure:
// Simplified representation
typedef long (*syscall_fn_t)(...);
// The system call table is essentially:
syscall_fn_t sys_call_table[] = {
[0] = sys_read, // read()
[1] = sys_write, // write()
[2] = sys_open, // open()
[3] = sys_close, // close()
// ... hundreds more entries ...
[n] = sys_new_fancy_syscall, // newest syscall
};
When a user program requests system call 1, the kernel executes sys_call_table[1](), which is sys_write(). This indirection allows the kernel to:
123456789101112131415161718192021222324252627282930
// arch/x86/entry/syscall_64.c (conceptual representation) #include <asm/syscall.h> // Define the table using the __SYSCALL macro#define __SYSCALL(nr, sym) [nr] = __x64_##sym, // The actual table - compiled from syscall definitionsconst sys_call_ptr_t sys_call_table[] = { [0] = __x64_sys_read, [1] = __x64_sys_write, [2] = __x64_sys_open, [3] = __x64_sys_close, [4] = __x64_sys_stat, [5] = __x64_sys_fstat, [6] = __x64_sys_lstat, [7] = __x64_sys_poll, [8] = __x64_sys_lseek, [9] = __x64_sys_mmap, [10] = __x64_sys_mprotect, [11] = __x64_sys_munmap, [12] = __x64_sys_brk, // ... 330+ more entries on modern kernels ... [435] = __x64_sys_clone3, [436] = __x64_sys_close_range, // ... continues to grow ...}; // Table size for bounds checkingconst unsigned int NR_syscalls = ARRAY_SIZE(sys_call_table);Each CPU architecture maintains its own system call table with potentially different numbers for the same functionality. Linux on x86-64 uses one numbering, ARM64 uses another. The POSIX API functions (read, write, etc.) provide a portable interface, but the underlying syscall numbers differ.
Different architectures use different conventions for passing the system call number from user space to the kernel. The number must be available to the kernel immediately after the trap instruction executes.
| Architecture | Register | Example Instruction | Notes |
|---|---|---|---|
| x86-64 (Linux) | RAX | mov rax, 1 | Number before SYSCALL |
| x86-32 (Linux) | EAX | mov eax, 4 | Number before INT 0x80 |
| ARM64 (Linux) | X8 | mov x8, #64 | Number before SVC #0 |
| ARM32 (Linux) | R7 | mov r7, #4 | Number before SWI 0 |
| RISC-V (Linux) | A7 | li a7, 64 | Number before ECALL |
| x86-64 (Windows) | RAX | mov rax, 0x50 | Number via NTDLL stub |
| PowerPC (Linux) | R0 | li r0, 4 | Number before sc |
| MIPS (Linux) | $v0 | li $v0, 4001 | O32 ABI, before SYSCALL |
Why a Register?
Using a register (rather than the stack) for the system call number has several advantages:
Speed: Register access is faster than memory access. The kernel can check the call number immediately without any loads.
Atomicity: The value is captured as part of the trap's atomic state save. On x86-64, RAX is preserved across the trap.
Simplicity: No stack manipulation is needed before the trap. This is especially important for SYSCALL, which doesn't automatically switch stacks.
Security: Stack-based passing would require reading user memory, which is susceptible to TOCTOU (time-of-check-time-of-use) attacks.
123456789101112131415161718192021222324252627
// Linux kernel: arch/x86/entry/common.c (simplified) __visible noinstr void do_syscall_64(struct pt_regs *regs, int nr){ // nr comes from RAX (saved in regs->orig_ax during entry) nr = syscall_enter_from_user_mode(regs, nr); // Check for valid system call number if (likely(nr < NR_syscalls)) { // Look up handler in table and call it regs->ax = sys_call_table[nr](regs); } else { // Invalid system call number regs->ax = -ENOSYS; // "Function not implemented" } syscall_exit_to_user_mode(regs);} // The regs structure contains all saved user registers:// regs->di = first argument (RDI)// regs->si = second argument (RSI)// regs->dx = third argument (RDX)// regs->r10 = fourth argument (R10, not RCX!)// regs->r8 = fifth argument (R8)// regs->r9 = sixth argument (R9)// regs->orig_ax = system call number (RAX at entry)In the x86-64 Linux ABI, RAX serves double duty: it holds the system call number on entry and the return value on exit. The kernel saves the original system call number to a separate field (orig_ax/orig_rax) so it can still identify the call even after setting the return value.
How are system call numbers assigned? The process is surprisingly deliberate, driven by historical compatibility and practical constraints:
Historical Assignment
Early UNIX systems assigned numbers sequentially as calls were added:
These original numbers have been maintained for decades to preserve binary compatibility. A program compiled in 1990 should still work on today's kernel if it uses standard system calls.
Rules for Adding New System Calls
When a new system call is added to Linux:
Never reuse numbers: A number, once assigned, is never reassigned to a different call.
Append to the end: New calls get the next available number.
Reserve gaps carefully: Some ranges are reserved for future use or experimental calls.
Architecture consistency (when possible): While numbers differ across architectures, the semantic relationship should be consistent.
| Architecture | read() Number | Notes |
|---|---|---|
| x86-64 | 0 | Renumbered for 64-bit ABI |
| x86-32 | 3 | Original i386 numbering |
| ARM64 | 63 | Matches generic numbering |
| ARM32 (EABI) | 3 | Follows x86-32 tradition |
| RISC-V | 63 | Modern unified numbering |
| MIPS O32 | 4003 | Offset by 4000 |
| PowerPC | 3 | Follows UNIX tradition |
The Generic System Call Table
Newer architectures (ARM64, RISC-V) use a standardized generic system call table (defined in include/uapi/asm-generic/unistd.h) that aims to unify numbering across platforms. Older architectures retain their historical numbers for compatibility.
The generic table starts with commonly-used calls at low numbers and maintains consistent numbering for all participating architectures. This simplifies cross-architecture development and makes it easier to add calls that work consistently everywhere.
On Linux, you can find system call numbers in /usr/include/asm/unistd_64.h (x86-64) or equivalent headers. The ausyscall --dump command (from audit tools) lists all calls and numbers for your architecture.
User space can pass any value as a system call number. The kernel must validate this number before using it as a table index to prevent out-of-bounds access:
The Attack Scenario:
If the kernel blindly indexed into sys_call_table without bounds checking:
// VULNERABLE CODE - DO NOT USE
regs->ax = sys_call_table[nr](regs); // What if nr > table size?
An attacker could pass nr = 0x7FFFFFFF, causing the CPU to read beyond the table into arbitrary kernel memory and jump to an attacker-controlled address. This would be a trivial kernel exploit.
12345678910111213141516171819202122
// Linux kernel system call dispatch (simplified) #define NR_syscalls 451 // Total number of valid system calls __visible void do_syscall_64(struct pt_regs *regs, unsigned int nr){ // Bounds check: CRITICAL for security if (likely(nr < NR_syscalls)) { // Only safe AFTER validation regs->ax = sys_call_table[nr](regs); } else { // Invalid number: return standard error regs->ax = -ENOSYS; // errno = 38 (Function not implemented) }} // The 'likely()' macro hints to the compiler that this branch// is expected to be taken most of the time, enabling optimization. // Note: The comparison uses 'unsigned int' to prevent negative// numbers from passing the check (they become very large positive// numbers, failing the < NR_syscalls comparison).Handling Invalid Numbers
When the kernel receives an invalid system call number, it returns -ENOSYS (errno 38: "Function not implemented"). This is the standard error for:
User-space wrappers typically handle ENOSYS by either falling back to an alternative method or propagating the error to the application.
Speculative Execution Concerns
With Spectre vulnerabilities, even the bounds check itself became a security concern. Speculative execution could bypass the check and access the table out of bounds speculatively. Modern kernels use speculation barriers to prevent this:
123456789101112131415161718
// Modern Linux with Spectre mitigations __visible void do_syscall_64(struct pt_regs *regs, unsigned int nr){ if (likely(nr < NR_syscalls)) { // Array index masking prevents speculative out-of-bounds access nr = array_index_nospec(nr, NR_syscalls); // Now safe even under speculative execution regs->ax = sys_call_table[nr](regs); } else { regs->ax = -ENOSYS; }} // array_index_nospec() ensures that even if the branch is // mispredicted, the index cannot exceed the array bounds.// It uses data dependencies to make the index always <= max.Before Spectre (2018), a bounds check was sufficient security. After Spectre, speculative execution can bypass branches, making array index bounds checks exploitable. Every kernel bounds check now needs speculation barriers or index masking to remain secure.
The system call table is a high-value target for attackers. Rootkits historically have modified the table to intercept system calls, hide files, disguise processes, or log passwords. Modern kernels implement multiple protections:
CR0.WP bit ensures even kernel code cannot write to read-only pages.12345678910111213141516171819
// Linux kernel: arch/x86/entry/syscall_64.c // Table is declared in read-only section__section(".rodata..sys_call_table")asmlinkage const sys_call_ptr_t sys_call_table[] = { // ... table entries ...}; // The __section() attribute places this in .rodata// which is mapped as read-only after kernel init. // Attempting to modify:// sys_call_table[1] = evil_handler; // Would trigger a page fault, NOT succeed. // Even from kernel code, this won't work because:// 1. The table is in rodata section// 2. CR0.WP (Write Protect) is set// 3. The page tables mark these pages as read-onlySome security tools (like seccomp) don't modify the table directly but intercept calls at a higher level. The kernel's LSM (Linux Security Modules) framework provides hooks for security decisions without touching the syscall table. These are the correct, supported ways to implement security monitoring.
Operating systems must maintain backward compatibility for decades. A binary compiled 20 years ago should still run on today's kernel. This creates significant constraints on system call evolution.
The Sacred ABI
The system call interface is part of the kernel ABI (Application Binary Interface). Unlike internal kernel APIs (which can change freely), the ABI is a contract with user space:
"We do not break user space." — Linus Torvalds (numerous times)
This means:
Evolution Strategies
When functionality needs to change, the kernel uses several strategies:
1. Adding New Calls
Instead of modifying open(), Linux added openat(), then openat2(). Each new version provides additional functionality while the original remains unchanged.
| Generation | Call | Features |
|---|---|---|
| Original | open(path, flags) | Basic file opening |
| Extended | openat(dirfd, path, flags) | Relative paths, race-free |
| Modern | openat2(dirfd, path, how, size) | Extensible struct, RESOLVE_* flags |
2. Flag Extension
New behaviors are added via new flag bits. open() has accumulated dozens of flags over decades:
O_RDONLY (1970s) // Original
O_NONBLOCK (1980s) // Non-blocking I/O
O_CLOEXEC (2000s) // Close on exec
O_TMPFILE (2013) // Create unnamed temp file
O_PATH (2010) // Path-only file descriptor
Old programs ignore new flags; new programs can use advanced features.
1234567891011121314151617181920212223242526
// Modern extensible system call design: openat2() struct open_how { __u64 flags; // O_* flags __u64 mode; // File mode for creation __u64 resolve; // RESOLVE_* flags // Future fields go here...}; // System call takes size parameter for versioninglong sys_openat2(int dirfd, const char *pathname, struct open_how *how, size_t size); // Kernel handles versioning:if (size < OPEN_HOW_SIZE_VER0) return -EINVAL; // Too old // Zero any fields beyond what userspace providedif (size < OPEN_HOW_SIZE_LATEST) { memset((char*)how + size, 0, OPEN_HOW_SIZE_LATEST - size);} // Old programs pass small struct, get defaults for new fields// New programs can use new fields on old kernels (graceful fail)// New kernels add fields to end, old programs unaffectedModern Linux system calls that take structs often have an explicit size parameter. This allows both forward and backward compatibility: old programs on new kernels get default values for new fields; new programs on old kernels can detect missing support and fall back gracefully.
Some system calls act as multiplexers—a single call number that dispatches to many different operations based on an argument. This pattern trades call-number simplicity for argument complexity.
ioctl: The Classic Multiplexer
The ioctl() system call is the most famous example:
int ioctl(int fd, unsigned long request, ... /* arg */);
The request code determines the operation. There are thousands of ioctl codes:
| Domain | Example Request | Purpose |
|---|---|---|
| Terminal | TIOCGWINSZ | Get terminal window size |
| Block device | BLKGETSIZE | Get device size |
| Network | SIOCGIFADDR | Get interface address |
| Graphics | DRM_IOCTL_MODE_GETRESOURCES | Get display resources |
| USB | USBDEVFS_SUBMITURB | Submit USB request |
This approach was historically used because adding new system calls required kernel changes, while new ioctl codes could be added by device drivers.
Modern Multiplexers
Several newer system calls also use multiplexing:
prctl() — Process control operations:
prctl(PR_SET_NAME, "mythread"); // Set thread name
prctl(PR_SET_SECCOMP, SECCOMP_MODE); // Enable seccomp
prctl(PR_SET_DUMPABLE, 0); // Prevent core dumps
fcntl() — File descriptor control:
fcntl(fd, F_GETFL); // Get flags
fcntl(fd, F_SETFL, O_NONBLOCK); // Set non-blocking
fcntl(fd, F_DUPFD, 10); // Duplicate to fd >= 10
futex() — Fast userspace mutex operations:
futex(addr, FUTEX_WAIT, val, timeout); // Wait if *addr == val
futex(addr, FUTEX_WAKE, n); // Wake n waiters
futex(addr, FUTEX_CMP_REQUEUE, ...); // Requeue waiters
Multiplexed calls have disadvantages: harder to trace (strace shows ioctl number, not name), harder to sandbox (seccomp must understand sub-commands), and error-prone (easy to pass wrong command). Modern practice prefers separate system calls for major functionality.
Seccomp and Multiplexed Calls
Seccomp (secure computing mode) allows filtering system calls. With multiplexed calls like ioctl, simple call-number filtering is insufficient—you need to inspect the command argument:
// Allow ioctl but only for TIOCGWINSZ (get window size)
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_ioctl, 0, 3),
BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
offsetof(struct seccomp_data, args[1])), // Load request
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, TIOCGWINSZ, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
This complexity is one reason the kernel now prefers adding new system calls rather than new ioctl commands for major features.
Different operating systems and architectures use different system call numbers. This creates challenges for:
| System | write() Number | Notes |
|---|---|---|
| Linux x86-64 | 1 | Modern Linux ABI |
| Linux x86-32 | 4 | Original i386 |
| Linux ARM64 | 64 | Generic unified table |
| FreeBSD x86-64 | 4 | BSD tradition |
| macOS x86-64 | 0x2000004 | BSD + Mach hybrid |
| Windows x64 | N/A | No direct write syscall |
| Solaris x86-64 | 4 | System V tradition |
macOS System Call Classes
macOS is particularly interesting—it has multiple system call "classes" accessed through different number ranges:
// macOS syscall classes (embedded in number)
#define SYSCALL_CLASS_UNIX 2 // BSD layer
#define SYSCALL_CLASS_MACH 1 // Mach kernel
#define SYSCALL_CLASS_MDEP 3 // Machine-dependent
// write() is BSD class 2, number 4:
// Full number: 0x2000004 = (2 << 24) | 4
This reflects macOS's hybrid kernel architecture combining BSD and Mach components.
Windows: No Traditional Numbers
Windows doesn't expose stable system call numbers to user space. Applications call functions in NTDLL.DLL, which internally uses undocumented syscall numbers. These numbers change between Windows versions:
// NtWriteFile on different Windows versions:
// Windows 7: 0x0005
// Windows 10: 0x0008 (varies by build!)
// Windows 11: 0x0009 (still changing)
This is by design—Microsoft reserves the right to change the kernel interface, requiring all programs to go through their documented API layer.
Emulators like WSL1, QEMU user-mode, and box64 must translate system call numbers between systems. They maintain mapping tables and adapt calling conventions. For example, WSL1 intercepts Linux system calls on Windows and implements them using Windows NT kernel APIs.
The Linux system call table has grown significantly over its 30+ year history. Examining this growth reveals the evolving needs of computing:
| Version | Year | Approx. Count | Notable Additions |
|---|---|---|---|
| 1.0 | 1994 | ~140 | Original set |
| 2.0 | 1996 | ~180 | SMP support calls |
| 2.4 | 2001 | ~250 | Networking, capabilities |
| 2.6 | 2003 | ~270 | Futex, epoll, inotify |
| 3.0 | 2011 | ~310 | Fanotify, name_to_handle |
| 4.0 | 2015 | ~330 | BPF, memfd, getrandom |
| 5.0 | 2019 | ~350 | io_uring, pidfd_open |
| 6.0 | 2022 | ~450 | Landlock, futex2, fsconfig |
Trends in System Call Evolution:
Security Enhancement: Many new calls add security features—seccomp (sandboxing), capabilities (fine-grained privileges), landlock (unprivileged sandboxing).
Race-Free Operations: The *at() family (openat, fstatat, etc.) eliminates TOCTOU race conditions by operating relative to directory file descriptors.
Performance Optimization: io_uring provides a high-performance async I/O interface, adding many syscalls for setup and management.
Containerization Support: pidfd_* calls enable container runtimes to manage processes without PID race conditions.
Obsolescence: Some old calls become deprecated (e.g., obsolete signal APIs) but their numbers are never reused.
Adding New System Calls
The process for adding a system call to Linux:
This rigorous process ensures stability—once added, a call exists forever.
To see the newest Linux system calls, check the kernel source: arch/x86/entry/syscalls/syscall_64.tbl shows the canonical x86-64 list. The man syscalls page provides documentation for all calls.
We've explored how operating systems identify requested services through system call numbers—a deceptively simple mechanism with profound implications for compatibility, security, and system evolution.
What's Next:
We've seen how the kernel identifies which service is requested. But system calls need more than just identification—they need arguments. The next page examines parameter passing: how applications transfer data to and from the kernel efficiently and safely.
You now understand system call numbers—the kernel's service identification mechanism. From the simple table lookup to the complex versioning requirements, this numbering scheme enables applications to request hundreds of different kernel services through a unified interface. Next, we'll examine how arguments are passed to those services.