System Call Mechanism - Learning Module

Loading content...

0/240

System Call Number

The Kernel's Service Menu

When your application executes a trap instruction, it transitions to kernel mode—but the kernel still needs to know which service you're requesting. Are you trying to read a file? Create a new process? Allocate memory? Wait for a network packet?

This is where the system call number enters the picture. Before executing the trap, user code places a numeric identifier in a designated register or stack location. The kernel uses this number to index into a system call table, finding the appropriate handler function. This simple mechanism—a number mapping to a function—is the foundation of the entire system call API.

What You Will Learn

By the end of this page, you will understand how system call numbers identify kernel services, how system call tables are organized, the conventions for passing the call number, versioning challenges, and how operating systems maintain decades of backward compatibility through careful numbering discipline.

The System Call Table

At the heart of system call dispatching is the system call table (syscall table)—an array of function pointers where each index corresponds to a system call number. When the kernel receives a system call, it uses the call number as an index into this table to find the handler function.

Conceptual Structure:

// Simplified representation
typedef long (*syscall_fn_t)(...);

// The system call table is essentially:
syscall_fn_t sys_call_table[] = {
    [0]   = sys_read,       // read()
    [1]   = sys_write,      // write()
    [2]   = sys_open,       // open()
    [3]   = sys_close,      // close()  
    // ... hundreds more entries ...
    [n]   = sys_new_fancy_syscall, // newest syscall
};

When a user program requests system call 1, the kernel executes sys_call_table[1](), which is sys_write(). This indirection allows the kernel to:

Centralize dispatch logic
Add new system calls simply by adding entries
Maintain a clear mapping between numbers and functions

Linux x86-64 System Call Table (Simplified)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// arch/x86/entry/syscall_64.c (conceptual representation)
 
#include <asm/syscall.h>
 
// Define the table using the __SYSCALL macro
#define __SYSCALL(nr, sym) [nr] = __x64_##sym,
 
// The actual table - compiled from syscall definitions
const sys_call_ptr_t sys_call_table[] = {
    [0]   = __x64_sys_read,
    [1]   = __x64_sys_write,
    [2]   = __x64_sys_open,
    [3]   = __x64_sys_close,
    [4]   = __x64_sys_stat,
    [5]   = __x64_sys_fstat,
    [6]   = __x64_sys_lstat,
    [7]   = __x64_sys_poll,
    [8]   = __x64_sys_lseek,
    [9]   = __x64_sys_mmap,
    [10]  = __x64_sys_mprotect,
    [11]  = __x64_sys_munmap,
    [12]  = __x64_sys_brk,
    // ... 330+ more entries on modern kernels ...
    [435] = __x64_sys_clone3,
    [436] = __x64_sys_close_range,
    // ... continues to grow ...
};
 
// Table size for bounds checking
const unsigned int NR_syscalls = ARRAY_SIZE(sys_call_table);

Architecture-Specific Tables

Each CPU architecture maintains its own system call table with potentially different numbers for the same functionality. Linux on x86-64 uses one numbering, ARM64 uses another. The POSIX API functions (read, write, etc.) provide a portable interface, but the underlying syscall numbers differ.

Passing the System Call Number

Different architectures use different conventions for passing the system call number from user space to the kernel. The number must be available to the kernel immediately after the trap instruction executes.

System Call Number Passing Conventions
Architecture	Register	Example Instruction	Notes
x86-64 (Linux)	`RAX`	`mov rax, 1`	Number before `SYSCALL`
x86-32 (Linux)	`EAX`	`mov eax, 4`	Number before `INT 0x80`
ARM64 (Linux)	`X8`	`mov x8, #64`	Number before `SVC #0`
ARM32 (Linux)	`R7`	`mov r7, #4`	Number before `SWI 0`
RISC-V (Linux)	`A7`	`li a7, 64`	Number before `ECALL`
x86-64 (Windows)	`RAX`	`mov rax, 0x50`	Number via NTDLL stub
PowerPC (Linux)	`R0`	`li r0, 4`	Number before `sc`
MIPS (Linux)	`$v0`	`li $v0, 4001`	O32 ABI, before `SYSCALL`

Why a Register?

Using a register (rather than the stack) for the system call number has several advantages:

Speed: Register access is faster than memory access. The kernel can check the call number immediately without any loads.
Atomicity: The value is captured as part of the trap's atomic state save. On x86-64, RAX is preserved across the trap.
Simplicity: No stack manipulation is needed before the trap. This is especially important for SYSCALL, which doesn't automatically switch stacks.
Security: Stack-based passing would require reading user memory, which is susceptible to TOCTOU (time-of-check-time-of-use) attacks.

System Call Number Usage in Kernel Dispatch
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Linux kernel: arch/x86/entry/common.c (simplified)
 
__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
{
    // nr comes from RAX (saved in regs->orig_ax during entry)
    nr = syscall_enter_from_user_mode(regs, nr);
    
    // Check for valid system call number
    if (likely(nr < NR_syscalls)) {
        // Look up handler in table and call it
        regs->ax = sys_call_table[nr](regs);
    } else {
        // Invalid system call number
        regs->ax = -ENOSYS;  // "Function not implemented"
    }
    
    syscall_exit_to_user_mode(regs);
}
 
// The regs structure contains all saved user registers:
// regs->di  = first argument (RDI)
// regs->si  = second argument (RSI)
// regs->dx  = third argument (RDX)
// regs->r10 = fourth argument (R10, not RCX!)
// regs->r8  = fifth argument (R8)
// regs->r9  = sixth argument (R9)
// regs->orig_ax = system call number (RAX at entry)

RAX Overloaded Purpose

In the x86-64 Linux ABI, RAX serves double duty: it holds the system call number on entry and the return value on exit. The kernel saves the original system call number to a separate field (orig_ax/orig_rax) so it can still identify the call even after setting the return value.

System Call Number Assignment

How are system call numbers assigned? The process is surprisingly deliberate, driven by historical compatibility and practical constraints:

Historical Assignment

Early UNIX systems assigned numbers sequentially as calls were added:

1: exit
2: fork
3: read
4: write
5: open
... and so on

These original numbers have been maintained for decades to preserve binary compatibility. A program compiled in 1990 should still work on today's kernel if it uses standard system calls.

Rules for Adding New System Calls

When a new system call is added to Linux:

Never reuse numbers: A number, once assigned, is never reassigned to a different call.
Append to the end: New calls get the next available number.
Reserve gaps carefully: Some ranges are reserved for future use or experimental calls.
Architecture consistency (when possible): While numbers differ across architectures, the semantic relationship should be consistent.

Example: read() System Call Numbers Across Architectures
Architecture	read() Number	Notes
x86-64	0	Renumbered for 64-bit ABI
x86-32	3	Original i386 numbering
ARM64	63	Matches generic numbering
ARM32 (EABI)	3	Follows x86-32 tradition
RISC-V	63	Modern unified numbering
MIPS O32	4003	Offset by 4000
PowerPC	3	Follows UNIX tradition

The Generic System Call Table

Newer architectures (ARM64, RISC-V) use a standardized generic system call table (defined in include/uapi/asm-generic/unistd.h) that aims to unify numbering across platforms. Older architectures retain their historical numbers for compatibility.

The generic table starts with commonly-used calls at low numbers and maintains consistent numbering for all participating architectures. This simplifies cross-architecture development and makes it easier to add calls that work consistently everywhere.

Checking System Call Numbers

On Linux, you can find system call numbers in /usr/include/asm/unistd_64.h (x86-64) or equivalent headers. The ausyscall --dump command (from audit tools) lists all calls and numbers for your architecture.

Bounds Checking and Validation

User space can pass any value as a system call number. The kernel must validate this number before using it as a table index to prevent out-of-bounds access:

The Attack Scenario:

If the kernel blindly indexed into sys_call_table without bounds checking:

// VULNERABLE CODE - DO NOT USE
regs->ax = sys_call_table[nr](regs);  // What if nr > table size?

An attacker could pass nr = 0x7FFFFFFF, causing the CPU to read beyond the table into arbitrary kernel memory and jump to an attacker-controlled address. This would be a trivial kernel exploit.

Proper System Call Number Validation
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Linux kernel system call dispatch (simplified)
 
#define NR_syscalls 451  // Total number of valid system calls
 
__visible void do_syscall_64(struct pt_regs *regs, unsigned int nr)
{
    // Bounds check: CRITICAL for security
    if (likely(nr < NR_syscalls)) {
        // Only safe AFTER validation
        regs->ax = sys_call_table[nr](regs);
    } else {
        // Invalid number: return standard error
        regs->ax = -ENOSYS;  // errno = 38 (Function not implemented)
    }
}
 
// The 'likely()' macro hints to the compiler that this branch
// is expected to be taken most of the time, enabling optimization.
 
// Note: The comparison uses 'unsigned int' to prevent negative
// numbers from passing the check (they become very large positive
// numbers, failing the < NR_syscalls comparison).

Handling Invalid Numbers

When the kernel receives an invalid system call number, it returns -ENOSYS (errno 38: "Function not implemented"). This is the standard error for:

Numbers beyond the table size
Numbers that correspond to unimplemented slots (newer kernels may have gaps)
Obsolete system calls that have been removed

User-space wrappers typically handle ENOSYS by either falling back to an alternative method or propagating the error to the application.

Speculative Execution Concerns

With Spectre vulnerabilities, even the bounds check itself became a security concern. Speculative execution could bypass the check and access the table out of bounds speculatively. Modern kernels use speculation barriers to prevent this:

Speculation-Safe System Call Dispatch
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// Modern Linux with Spectre mitigations
 
__visible void do_syscall_64(struct pt_regs *regs, unsigned int nr)
{
    if (likely(nr < NR_syscalls)) {
        // Array index masking prevents speculative out-of-bounds access
        nr = array_index_nospec(nr, NR_syscalls);
        
        // Now safe even under speculative execution
        regs->ax = sys_call_table[nr](regs);
    } else {
        regs->ax = -ENOSYS;
    }
}
 
// array_index_nospec() ensures that even if the branch is 
// mispredicted, the index cannot exceed the array bounds.
// It uses data dependencies to make the index always <= max.

Spectre Changes Everything

Before Spectre (2018), a bounds check was sufficient security. After Spectre, speculative execution can bypass branches, making array index bounds checks exploitable. Every kernel bounds check now needs speculation barriers or index masking to remain secure.

System Call Table Security

The system call table is a high-value target for attackers. Rootkits historically have modified the table to intercept system calls, hide files, disguise processes, or log passwords. Modern kernels implement multiple protections:

System Call Table Attack Techniques

•Direct Table Modification: Overwriting table entries to point to malicious handlers. Classic rootkit technique from the 2000s.
•Inline Hooking: Modifying the first few bytes of a handler function to jump to malicious code instead.
•Table Relocation: Changing the pointer to the system call table itself (in kernel data structures).
•Hardware Breakpoints: Using debug registers to intercept specific system calls without modifying code.

Modern Kernel Protections

•Read-Only Table: The system call table is placed in read-only memory. Modifications trigger page faults.
•Write-Protected Kernel Text: CR0.WP bit ensures even kernel code cannot write to read-only pages.
•Kernel Address Space Layout Randomization (KASLR): Randomizes table location, making it harder to find.
•Secure Boot: Ensures only signed kernels load, preventing rootkit installation at boot.
•Integrity Measurement: TPM-based verification can detect table modifications.

System Call Table Memory Protection
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Linux kernel: arch/x86/entry/syscall_64.c
 
// Table is declared in read-only section
__section(".rodata..sys_call_table")
asmlinkage const sys_call_ptr_t sys_call_table[] = {
    // ... table entries ...
};
 
// The __section() attribute places this in .rodata
// which is mapped as read-only after kernel init.
 
// Attempting to modify:
// sys_call_table[1] = evil_handler;  
// Would trigger a page fault, NOT succeed.
 
// Even from kernel code, this won't work because:
// 1. The table is in rodata section
// 2. CR0.WP (Write Protect) is set
// 3. The page tables mark these pages as read-only

Legitimate Table Modifications

Some security tools (like seccomp) don't modify the table directly but intercept calls at a higher level. The kernel's LSM (Linux Security Modules) framework provides hooks for security decisions without touching the syscall table. These are the correct, supported ways to implement security monitoring.

Versioning and Backward Compatibility

Operating systems must maintain backward compatibility for decades. A binary compiled 20 years ago should still run on today's kernel. This creates significant constraints on system call evolution.

The Sacred ABI

The system call interface is part of the kernel ABI (Application Binary Interface). Unlike internal kernel APIs (which can change freely), the ABI is a contract with user space:

"We do not break user space." — Linus Torvalds (numerous times)

This means:

System call numbers are permanent
Argument layouts cannot change
Return value semantics are fixed
Error codes maintain their meanings

Evolution Strategies

When functionality needs to change, the kernel uses several strategies:

1. Adding New Calls

Instead of modifying open(), Linux added openat(), then openat2(). Each new version provides additional functionality while the original remains unchanged.

Generation	Call	Features
Original	`open(path, flags)`	Basic file opening
Extended	`openat(dirfd, path, flags)`	Relative paths, race-free
Modern	`openat2(dirfd, path, how, size)`	Extensible struct, RESOLVE_* flags

2. Flag Extension

New behaviors are added via new flag bits. open() has accumulated dozens of flags over decades:

O_RDONLY    (1970s)  // Original
O_NONBLOCK  (1980s)  // Non-blocking I/O
O_CLOEXEC   (2000s)  // Close on exec
O_TMPFILE   (2013)   // Create unnamed temp file
O_PATH      (2010)   // Path-only file descriptor

Old programs ignore new flags; new programs can use advanced features.

Extensible System Call Design (openat2)
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// Modern extensible system call design: openat2()
 
struct open_how {
    __u64 flags;    // O_* flags
    __u64 mode;     // File mode for creation
    __u64 resolve;  // RESOLVE_* flags
    // Future fields go here...
};
 
// System call takes size parameter for versioning
long sys_openat2(int dirfd, const char *pathname,
                 struct open_how *how, size_t size);
 
// Kernel handles versioning:
if (size < OPEN_HOW_SIZE_VER0)
    return -EINVAL;  // Too old
    
// Zero any fields beyond what userspace provided
if (size < OPEN_HOW_SIZE_LATEST) {
    memset((char*)how + size, 0, 
           OPEN_HOW_SIZE_LATEST - size);
}
 
// Old programs pass small struct, get defaults for new fields
// New programs can use new fields on old kernels (graceful fail)
// New kernels add fields to end, old programs unaffected

The Size Parameter Pattern

Modern Linux system calls that take structs often have an explicit size parameter. This allows both forward and backward compatibility: old programs on new kernels get default values for new fields; new programs on old kernels can detect missing support and fall back gracefully.

System Call Multiplexing and Sub-Commands

Some system calls act as multiplexers—a single call number that dispatches to many different operations based on an argument. This pattern trades call-number simplicity for argument complexity.

ioctl: The Classic Multiplexer

The ioctl() system call is the most famous example:

int ioctl(int fd, unsigned long request, ...  /* arg */);

The request code determines the operation. There are thousands of ioctl codes:

Domain	Example Request	Purpose
Terminal	`TIOCGWINSZ`	Get terminal window size
Block device	`BLKGETSIZE`	Get device size
Network	`SIOCGIFADDR`	Get interface address
Graphics	`DRM_IOCTL_MODE_GETRESOURCES`	Get display resources
USB	`USBDEVFS_SUBMITURB`	Submit USB request

This approach was historically used because adding new system calls required kernel changes, while new ioctl codes could be added by device drivers.

Modern Multiplexers

Several newer system calls also use multiplexing:

prctl() — Process control operations:

prctl(PR_SET_NAME, "mythread");      // Set thread name
prctl(PR_SET_SECCOMP, SECCOMP_MODE);  // Enable seccomp
prctl(PR_SET_DUMPABLE, 0);            // Prevent core dumps

fcntl() — File descriptor control:

fcntl(fd, F_GETFL);                   // Get flags
fcntl(fd, F_SETFL, O_NONBLOCK);       // Set non-blocking
fcntl(fd, F_DUPFD, 10);               // Duplicate to fd >= 10

futex() — Fast userspace mutex operations:

futex(addr, FUTEX_WAIT, val, timeout);  // Wait if *addr == val
futex(addr, FUTEX_WAKE, n);              // Wake n waiters
futex(addr, FUTEX_CMP_REQUEUE, ...);     // Requeue waiters

Multiplexing Drawbacks

Multiplexed calls have disadvantages: harder to trace (strace shows ioctl number, not name), harder to sandbox (seccomp must understand sub-commands), and error-prone (easy to pass wrong command). Modern practice prefers separate system calls for major functionality.

Seccomp and Multiplexed Calls

Seccomp (secure computing mode) allows filtering system calls. With multiplexed calls like ioctl, simple call-number filtering is insufficient—you need to inspect the command argument:

// Allow ioctl but only for TIOCGWINSZ (get window size)
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_ioctl, 0, 3),
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, 
         offsetof(struct seccomp_data, args[1])),  // Load request
BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, TIOCGWINSZ, 0, 1),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),

This complexity is one reason the kernel now prefers adding new system calls rather than new ioctl commands for major features.

Cross-Platform System Call Numbering

Different operating systems and architectures use different system call numbers. This creates challenges for:

Cross-platform development
Binary translation and emulation
Security tools that need to interpret syscalls

write() System Call Numbers Across OS/Architectures
System	write() Number	Notes
Linux x86-64	1	Modern Linux ABI
Linux x86-32	4	Original i386
Linux ARM64	64	Generic unified table
FreeBSD x86-64	4	BSD tradition
macOS x86-64	0x2000004	BSD + Mach hybrid
Windows x64	N/A	No direct write syscall
Solaris x86-64	4	System V tradition

macOS System Call Classes

macOS is particularly interesting—it has multiple system call "classes" accessed through different number ranges:

// macOS syscall classes (embedded in number)
#define SYSCALL_CLASS_UNIX    2  // BSD layer
#define SYSCALL_CLASS_MACH    1  // Mach kernel
#define SYSCALL_CLASS_MDEP    3  // Machine-dependent

// write() is BSD class 2, number 4:
// Full number: 0x2000004 = (2 << 24) | 4

This reflects macOS's hybrid kernel architecture combining BSD and Mach components.

Windows: No Traditional Numbers

Windows doesn't expose stable system call numbers to user space. Applications call functions in NTDLL.DLL, which internally uses undocumented syscall numbers. These numbers change between Windows versions:

// NtWriteFile on different Windows versions:
// Windows 7:    0x0005
// Windows 10:   0x0008 (varies by build!)
// Windows 11:   0x0009 (still changing)

This is by design—Microsoft reserves the right to change the kernel interface, requiring all programs to go through their documented API layer.

Binary Translation Challenges

Emulators like WSL1, QEMU user-mode, and box64 must translate system call numbers between systems. They maintain mapping tables and adapt calling conventions. For example, WSL1 intercepts Linux system calls on Windows and implements them using Windows NT kernel APIs.

System Call Number Evolution: A Historical View

The Linux system call table has grown significantly over its 30+ year history. Examining this growth reveals the evolving needs of computing:

Linux System Call Table Growth (x86-64)
Version	Year	Approx. Count	Notable Additions
1.0	1994	~140	Original set
2.0	1996	~180	SMP support calls
2.4	2001	~250	Networking, capabilities
2.6	2003	~270	Futex, epoll, inotify
3.0	2011	~310	Fanotify, name_to_handle
4.0	2015	~330	BPF, memfd, getrandom
5.0	2019	~350	io_uring, pidfd_open
6.0	2022	~450	Landlock, futex2, fsconfig

Trends in System Call Evolution:

Security Enhancement: Many new calls add security features—seccomp (sandboxing), capabilities (fine-grained privileges), landlock (unprivileged sandboxing).
Race-Free Operations: The *at() family (openat, fstatat, etc.) eliminates TOCTOU race conditions by operating relative to directory file descriptors.
Performance Optimization: io_uring provides a high-performance async I/O interface, adding many syscalls for setup and management.
Containerization Support: pidfd_* calls enable container runtimes to manage processes without PID race conditions.
Obsolescence: Some old calls become deprecated (e.g., obsolete signal APIs) but their numbers are never reused.

Adding New System Calls

The process for adding a system call to Linux:

Propose on LKML (Linux Kernel Mailing List)
Justify why existing calls are insufficient
Design API with future extensibility
Implement for all architectures
Add user-space wrappers (glibc, musl)
Document in man pages

This rigorous process ensures stability—once added, a call exists forever.

Finding the Latest Syscalls

To see the newest Linux system calls, check the kernel source: arch/x86/entry/syscalls/syscall_64.tbl shows the canonical x86-64 list. The man syscalls page provides documentation for all calls.

Summary: The Numbering System

We've explored how operating systems identify requested services through system call numbers—a deceptively simple mechanism with profound implications for compatibility, security, and system evolution.

Key Takeaways

•System call numbers index into handler tables — The kernel uses the number to look up the appropriate function in the syscall table.
•Numbers are passed in registers — Most architectures use a designated register (RAX on x86-64, X8 on ARM64) for speed and atomicity.
•Numbers are permanent — Once assigned, a number is never reused, ensuring decades of binary compatibility.
•Bounds checking is essential — The kernel validates numbers before using them as array indices. Speculation barriers provide additional protection.
•The table is security-critical — Modern kernels make the table read-only and use KASLR to protect against rootkits.
•Different systems use different numbers — Cross-platform tools must account for OS and architecture differences in numbering.

What's Next:

We've seen how the kernel identifies which service is requested. But system calls need more than just identification—they need arguments. The next page examines parameter passing: how applications transfer data to and from the kernel efficiently and safely.

Page Complete

You now understand system call numbers—the kernel's service identification mechanism. From the simple table lookup to the complex versioning requirements, this numbering scheme enables applications to request hundreds of different kernel services through a unified interface. Next, we'll examine how arguments are passed to those services.