Loading content...
System calls need more than just identification—they need data. When you call write(fd, buffer, count), the kernel must receive three arguments: the file descriptor number, a pointer to the data buffer, and the byte count. When you call read(), the kernel must not only receive arguments but also return data to user-supplied buffers.
This data transfer across the privilege boundary is one of the most complex aspects of system call implementation. The kernel must receive arguments efficiently, validate that user-provided pointers are safe to access, copy data between address spaces without creating security vulnerabilities, and handle errors gracefully when things go wrong.
By the end of this page, you will understand the register-based and memory-based conventions for passing system call arguments, the critical security requirements for validating user pointers, the mechanisms for copying data between user and kernel space, and how errors and partial failures are handled.
Passing arguments to system calls is more complex than passing arguments to regular functions. Several factors create this complexity:
1. Privilege Boundary
At the trap instruction, execution transitions from Ring 3 to Ring 0. Normal function calling conventions don't apply—there's no direct function call, just a privilege-level change. The kernel and user code have completely separate stacks.
2. Address Space Separation
With virtual memory, the kernel and user space may have different views of memory. A pointer valid in user space might not be directly usable in kernel space (depending on kernel memory model).
3. Security Validation
The kernel cannot trust any data from user space. Every pointer must be validated as pointing to user-accessible memory. Every size argument must be checked for overflow. Every piece of data must be treated as potentially malicious.
4. Performance Requirements
System calls happen billions of times per second globally. The parameter passing mechanism must be highly optimized while maintaining security.
| Challenge | Risk | Solution |
|---|---|---|
| Untrusted pointers | Kernel crash, privilege escalation | Pointer validation, copy functions |
| Buffer overflows | Stack smashing, code execution | Size bounds checking |
| TOCTOU races | Security bypass | Immediate copy-in, validation after copy |
| Null pointers | Kernel panic | Explicit null checks |
| Non-aligned data | CPU exceptions, performance | Alignment handling |
The kernel must treat all user-space data as potentially crafted by an attacker. Even seemingly harmless arguments like file descriptors or byte counts can be attack vectors if not properly validated. The mindset should be: 'What's the worst thing that could happen if this value is malicious?'
Modern architectures pass system call arguments in registers—the fastest and most direct method. Each architecture defines a convention mapping arguments to specific registers:
| Arg # | x86-64 (Linux) | ARM64 | RISC-V | Notes |
|---|---|---|---|---|
| Syscall # | RAX | X8 | A7 | Identifies the system call |
| 1 | RDI | X0 | A0 | First argument |
| 2 | RSI | X1 | A1 | Second argument |
| 3 | RDX | X2 | A2 | Third argument |
| 4 | R10* | X3 | A3 | Fourth argument |
| 5 | R8 | X4 | A4 | Fifth argument |
| 6 | R9 | X5 | A5 | Sixth argument |
*Note: x86-64 uses R10 instead of RCX for the fourth argument because SYSCALL uses RCX to store the return address.
Why Six Arguments?
Six register arguments cover the vast majority of system calls. Looking at Linux:
| Argument Count | Percentage | Example Calls |
|---|---|---|
| 0 arguments | ~5% | getpid(), fork() |
| 1-2 arguments | ~35% | close(), exit(), time() |
| 3 arguments | ~40% | read(), write(), open() |
| 4-6 arguments | ~18% | mmap(), ppoll(), clone() |
| 6+ arguments | ~2% | Rarely needed |
For the rare cases needing more than six arguments, a pointer to a structure is passed instead.
1234567891011121314151617181920212223242526
; ssize_t write(int fd, const void *buf, size_t count);; Returns bytes written or negative error section .data message: db "Hello, World!", 10 ; String with newline msg_len: equ $ - message ; Calculate length section .textglobal _start _start: ; Set up system call arguments mov rax, 1 ; System call number: write = 1 mov rdi, 1 ; Arg1 (fd): 1 = stdout lea rsi, [rel message] ; Arg2 (buf): pointer to buffer mov rdx, msg_len ; Arg3 (count): number of bytes ; Execute system call syscall ; Trap to kernel ; RAX now contains return value (bytes written or -errno) ; Exit mov rax, 60 ; System call number: exit = 60 xor rdi, rdi ; Arg1 (status): 0 syscallOn x86-64 Linux, the kernel promises to preserve RBX, RBP, R12-R15 across system calls. However, RAX (return value), RCX (saved by SYSCALL), and R11 (saved by SYSCALL) are clobbered. This matches the function calling convention, allowing system calls to be treated like function calls by compilers.
When system calls need more data than registers can hold, or when data must flow in both directions, memory buffers are used. A pointer to the data is passed in a register.
Common Patterns:
1. Fixed-Size Structures
Calls like stat() return data into a user-provided structure:
struct stat {
dev_t st_dev; // Device ID
ino_t st_ino; // Inode number
mode_t st_mode; // File mode
nlink_t st_nlink; // Number of links
// ... many more fields (~144 bytes on x86-64)
};
int stat(const char *path, struct stat *statbuf);
// statbuf is a user-space pointer, kernel writes to it
2. Variable-Length Data
Calls like read() and write() transfer variable amounts of data:
ssize_t read(int fd, void *buf, size_t count);
// buf is user-space pointer, count is maximum bytes
// Returns actual bytes read, fills buffer partially or fully
3. String Arguments
Calls like open() take null-terminated path strings:
int open(const char *pathname, int flags);
// pathname is user-space pointer to C string
// Kernel must copy string, find null terminator
12345678910111213141516171819202122232425262728
// Linux kernel: sys_write implementation (simplified) SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf, size_t, count){ struct fd f; ssize_t ret = -EBADF; // Get file descriptor (validates fd is valid) f = fdget_pos(fd); if (f.file) { // Note: 'buf' is marked __user - it's a user-space pointer // We cannot directly dereference it! // The actual write operation will: // 1. Validate the user pointer range // 2. Copy data from user space // 3. Perform the I/O ret = vfs_write(f.file, buf, count, &pos); fdput_pos(f); } return ret;} // The __user annotation:// - Documents that this is a user-space pointer// - Enables sparse (static analysis tool) to detect direct dereferences// - Has no runtime effect, but prevents bugs at compile timeThe kernel must NEVER directly dereference a user-space pointer (like *buf or buf[0]). This is because: (1) the pointer might not be valid, causing a kernel panic; (2) the pointer might be a kernel address, enabling kernel memory read/write; (3) the pointer might cause page faults that can't be handled safely. Always use copy_from_user() and copy_to_user().
Before accessing any user-space memory, the kernel must validate that the pointer is safe. This validation has multiple layers:
Layer 1: Address Range Check
The kernel first verifies that the pointer falls within the user address range, not in kernel space:
// x86-64: User addresses are below 0x00007FFFFFFFFFFF
// Anything higher is kernel space
static inline bool __access_ok(unsigned long addr, unsigned long size)
{
// Check that addr and addr+size-1 are both in user range
return likely(addr + size <= TASK_SIZE_MAX) &&
likely(addr <= addr + size); // Overflow check
}
Layer 2: Page Table Verification
The memory must be mapped and accessible with appropriate permissions. This is handled during the actual access, not beforehand (because page tables can change).
Layer 3: SMAP (Supervisor Mode Access Prevention)
Modern CPUs (Intel Broadwell+, AMD Zen) implement SMAP, which causes a fault if kernel code tries to access user memory without explicitly enabling access. This provides hardware-level protection against accidental user memory access.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
// Linux kernel: The core user memory access functions /** * copy_from_user - Copy data from user space to kernel space * @to: Destination address (in kernel space) * @from: Source address (in user space, untrusted!) * @n: Number of bytes to copy * * Returns: Number of bytes that could NOT be copied. * Zero on success, non-zero on partial/failed copy. */unsigned long copy_from_user(void *to, const void __user *from, unsigned long n){ // Check if user pointer is in valid range if (access_ok(from, n)) { // Temporarily disable SMAP protection // (allows kernel to access user memory) stac(); // Set AC flag // Do the actual copy with exception handling n = __do_copy_from_user(to, from, n); // Re-enable SMAP protection clac(); // Clear AC flag } return n; // Return bytes NOT copied} /** * copy_to_user - Copy data from kernel space to user space * @to: Destination address (in user space, untrusted!) * @from: Source address (in kernel space) * @n: Number of bytes to copy */unsigned long copy_to_user(void __user *to, const void *from, unsigned long n){ if (access_ok(to, n)) { stac(); n = __do_copy_to_user(to, from, n); clac(); } return n;} // Example usage in system call:SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count){ char kernel_buf[4096]; ssize_t ret; // Read into kernel buffer first ret = do_read(fd, kernel_buf, min(count, sizeof(kernel_buf))); if (ret > 0) { // Copy result to user buffer if (copy_to_user(buf, kernel_buf, ret)) return -EFAULT; // User buffer was invalid } return ret;}The stac (Set AC flag) and clac (Clear AC flag) instructions control SMAP. When SMAP is enabled and AC=0, any kernel access to user memory causes a page fault. The kernel briefly sets AC=1 only when intentionally accessing user memory, providing defense-in-depth against accidental or malicious user memory access.
User memory might not be physically present when the kernel tries to access it—it could be swapped out, copy-on-write, or demand-zero. The kernel must handle page faults gracefully during user memory copies.
The Exception Table Mechanism
Linux uses an exception table that maps addresses where faults might occur to fixup code that handles them:
// During copy, if a page fault occurs at certain instructions,
// the kernel can look up what to do:
struct exception_table_entry {
int insn; // Relative address of instruction that might fault
int fixup; // Relative address of fixup code
};
// When page fault handler runs:
// 1. Look up faulting instruction in exception table
// 2. If found, redirect execution to fixup code
// 3. Fixup code returns error to caller (EFAULT)
// 4. If not found, it's a real kernel bug (panic)
12345678910111213141516171819202122232425262728293031
;; Simplified copy_from_user implementation with exception handling copy_from_user_asm: ; RDI = kernel dest, RSI = user source, RDX = count .Lcopy_loop: test rdx, rdx jz .Ldone ; This instruction might fault if user memory is invalid1: mov al, [rsi] ; Load byte from user space mov [rdi], al ; Store to kernel space inc rsi inc rdi dec rdx jmp .Lcopy_loop .Ldone: xor eax, eax ; Return 0 (success, all bytes copied) ret ; Fixup code - execution jumps here if load at label 1 faults.Lfault: mov rax, rdx ; Return remaining bytes (not copied) ret ; Exception table entry - links the faulting instruction to fixup.section __ex_table, "a" .long 1b - . ; Instruction address (relative) .long .Lfault - . ; Fixup address (relative)Demand Paging and Kernel Copies
When the kernel accesses user memory that's not present:
This means copy_from_user() can trigger disk I/O if user data is swapped out—the kernel must be prepared for this delay.
Code running in atomic context (holding spinlocks, interrupt handlers) cannot use copy_from_user()/copy_to_user() because page faults might sleep waiting for I/O. For atomic contexts, use the 'nowait' variants that fail immediately if a page fault would occur, or pre-fault the pages first.
String parameters (like file paths) require special handling because their length isn't known until the null terminator is found. The kernel cannot simply call strlen() on a user pointer—this could read arbitrary amounts of kernel memory or crash if the string isn't terminated.
The strncpy_from_user() Function
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
// Linux kernel: fs/namei.c (simplified) /** * strncpy_from_user - Copy string from user space * @dst: Destination buffer (kernel space) * @src: Source address (user space) * @count: Maximum bytes to copy (including null terminator) * * Returns: Length of string (not including null) on success, * -EFAULT on fault, * -ENAMETOOLONG if string exceeds count */long strncpy_from_user(char *dst, const char __user *src, long count){ long res; if (!access_ok(src, 1)) // At least 1 byte must be accessible return -EFAULT; stac(); // Enable user memory access // Copy up to 'count' bytes, stopping at null terminator res = do_strncpy_from_user(dst, src, count); clac(); // Disable user memory access return res;} // Usage in open() system call:SYSCALL_DEFINE2(open, const char __user *, filename, int, flags){ char kernel_path[PATH_MAX]; // 4096 bytes long len; // Copy path from user space, max PATH_MAX bytes len = strncpy_from_user(kernel_path, filename, PATH_MAX); if (len < 0) return len; // Error (EFAULT or similar) if (len == PATH_MAX) return -ENAMETOOLONG; // Path too long // Now kernel_path is safe to use return do_sys_open(kernel_path, flags);}Path Name Handling: getname()
For filesystem paths specifically, Linux provides optimized functions:
// getname_kernel() - path already in kernel space
// getname() - path in user space, copies to kernel buffer
// putname() - releases path buffer
SYSCALL_DEFINE2(stat, const char __user *, filename,
struct stat __user *, statbuf)
{
struct filename *name = getname(filename);
int error;
if (IS_ERR(name))
return PTR_ERR(name);
error = vfs_stat(name->name, &stat); // Now safe to use
putname(name);
// Copy result to user space
if (!error && copy_to_user(statbuf, &stat, sizeof(stat)))
error = -EFAULT;
return error;
}
getname() handles:
Linux defines PATH_MAX as 4096 bytes. Paths longer than this return ENAMETOOLONG. This limit is per-path-component, not per-file—symlink resolution can accumulate to longer total paths, which is why Linux also has a symlink loop/length limit.
Many system calls pass complex data structures. The kernel must copy these safely while handling structure layout differences.
Copying Structures:
12345678910111213141516171819202122232425262728293031323334353637383940414243
// Modern extensible structure pattern: io_uring_params struct io_uring_params { __u32 sq_entries; // Size of submission queue __u32 cq_entries; // Size of completion queue __u32 flags; // Feature flags __u32 sq_thread_cpu; // CPU for kernel polling thread __u32 sq_thread_idle; // Idle time before sleeping __u32 features; // Kernel feature flags (output) __u32 wq_fd; // Workqueue sharing descriptor __u32 resv[3]; // Reserved for future use struct io_sqring_offsets sq_off; // Offsets for SQ ring struct io_cqring_offsets cq_off; // Offsets for CQ ring}; // System call takes size parameter for version detectionSYSCALL_DEFINE2(io_uring_setup, u32, entries, struct io_uring_params __user *, p){ struct io_uring_params params; int ret; // Copy entire structure from user space if (copy_from_user(¶ms, p, sizeof(params))) return -EFAULT; // Validate reserved fields are zero (compatibility check) if (params.resv[0] || params.resv[1] || params.resv[2]) return -EINVAL; // Validate flags (reject unknown flags) if (params.flags & ~IORING_SETUP_SUPPORTED_FLAGS) return -EINVAL; // Do the actual setup ret = io_uring_create(entries, ¶ms); // Copy output fields back to user if (!ret && copy_to_user(p, ¶ms, sizeof(params))) ret = -EFAULT; return ret;}Array Arguments:
Some calls pass arrays of items:
// poll() takes an array of file descriptors to monitor
struct pollfd {
int fd; // File descriptor
short events; // Events to watch for
short revents; // Events that occurred (output)
};
int poll(struct pollfd *fds, nfds_t nfds, int timeout);
// fds points to array of nfds items
// Kernel copies entire array in, modifies revents, copies back
The kernel must:
nfds * sizeof(struct pollfd) doesn't overflowWhen multiplying count × sizeof(element), overflow can create security vulnerabilities. If nfds=0xFFFFFFFF and sizeof(pollfd)=8, the multiplication wraps to a small value, causing the kernel to allocate a small buffer but copy huge amounts of data (buffer overflow). Always check: count <= MAX_SAFE / sizeof(element).
One of the most subtle security issues in parameter handling is the TOCTOU race condition—where user data changes between when the kernel checks it and when the kernel uses it.
The Classic Attack:
Imagine a vulnerable kernel implementation:
// VULNERABLE CODE - DO NOT USE
SYSCALL_DEFINE2(open, const char __user *, path, int, flags)
{
// Step 1: Check if user can access the file
if (!access_ok_for_user(path)) // Reads path from user space
return -EACCES;
// <-- RACE WINDOW: Another thread changes 'path' in memory
// Step 2: Actually open the file
return do_open(path); // Reads path AGAIN from user space
}
An attacker can:
path to /harmless/filepath to /etc/shadow/harmless/file/etc/shadow!1234567891011121314151617181920212223242526272829303132333435363738394041424344
// CORRECT: Copy ONCE, then use the kernel copy SYSCALL_DEFINE2(open, const char __user *, path, int, flags){ // Copy user data once into kernel space struct filename *name = getname(path); int fd; if (IS_ERR(name)) return PTR_ERR(name); // Now use ONLY the kernel copy - user can't modify it // All permission checks and operations use name->name fd = do_filp_open(name->name, flags); putname(name); return fd;} // General principle: COPY FIRST, VALIDATE AFTER// Never read the same user memory twice// Never validate user data without copying first // Another example - validating struct fields:SYSCALL_DEFINE2(set_value, struct config __user *, cfg, size_t, size){ struct config kernel_cfg; if (size != sizeof(kernel_cfg)) return -EINVAL; // Copy the ENTIRE struct if (copy_from_user(&kernel_cfg, cfg, sizeof(kernel_cfg))) return -EFAULT; // Now validate the KERNEL COPY, not user memory if (kernel_cfg.value > MAX_VALUE) return -EINVAL; if (kernel_cfg.flags & ~VALID_FLAGS) return -EINVAL; // Use kernel_cfg, never touch cfg again return apply_config(&kernel_cfg);}Reading the same user memory twice is called a 'double-fetch' and is almost always a security bug. Static analysis tools specifically look for this pattern. The rule is simple: copy once, then never read from user space again for that data.
System calls must communicate results back to user space—both success values and error conditions. Different conventions exist across operating systems.
Linux/UNIX Convention:
System calls return a single value in a register (RAX on x86-64):
-return_value is the errnoReturn Value Meaning
≥ 0 Success (value is result)
-1 to -4095 Error (-value is errno)
< -4095 Should never happen (kernel bug)
The C library translates this:
// glibc syscall wrapper (simplified)
long syscall_wrapper(long syscall_num, ...) {
long result = do_syscall(syscall_num, ...);
if (result >= -4095UL && result <= -1UL) {
// Error: convert to errno
errno = -result;
return -1;
}
return result;
}
| Error | Value | Meaning |
|---|---|---|
| EPERM | 1 | Operation not permitted |
| ENOENT | 2 | No such file or directory |
| EINTR | 4 | Interrupted system call |
| EIO | 5 | Input/output error |
| ENOMEM | 12 | Cannot allocate memory |
| EACCES | 13 | Permission denied |
| EFAULT | 14 | Bad address (invalid pointer) |
| EEXIST | 17 | File exists |
| EINVAL | 22 | Invalid argument |
| ENOSYS | 38 | Function not implemented |
123456789101112131415161718192021222324252627282930313233
// Linux kernel: Error return patterns SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count){ struct fd f = fdget(fd); ssize_t ret; // Error: invalid file descriptor if (!f.file) return -EBADF; // Return negative errno // Error: buffer pointer is invalid if (!access_ok(buf, count)) { fdput(f); return -EFAULT; // Bad address } // Try to read ret = vfs_read(f.file, buf, count, &f.file->f_pos); // ret is either: // >= 0: number of bytes read (success) // < 0: negative errno (error from lower layers) fdput(f); return ret; // Propagate error or success} // Multiple output values: Some calls need to return both status and data.// Solutions:// 1. Output struct (stat, getrusage)// 2. Pointer arguments get filled (pipe: int pipefd[2])// 3. Multiplex: high bits are status, low bits are data (waitpid status)Some calls can partially succeed: read() might return fewer bytes than requested, write() might write only some data. The caller must check the return value and potentially retry. Signal interruption (EINTR) is a special case where the call should typically be restarted.
We've explored the mechanisms for passing data between user space and the kernel—a seemingly simple operation that requires careful attention to security, correctness, and performance.
What's Next:
We've now covered how arguments get to the kernel and how errors are reported. The final piece of the system call mechanism is returning to user mode—how the kernel restores user state and transitions back to Ring 3 execution. We'll examine the return path, signal delivery, and the complete round-trip of a system call.
You now understand how system call parameters are safely transferred across the privilege boundary. From register conventions to TOCTOU race prevention, these mechanisms ensure that the kernel can receive and return data without compromising system security. Next, we'll complete the journey by examining the return to user mode.