Loading learning content...
System calls represent the most critical trust boundary in any operating system. On one side is user space—untrusted, potentially malicious, definitely buggy code that can send any garbage it wants. On the other side is the kernel—the most privileged code in the system, with access to all memory, all processes, all hardware.
Every argument to every syscall is a potential attack vector. A malicious user could:
Parameter validation is the kernel's immune system. Get it wrong, and you have a privilege escalation vulnerability. This page explores how the kernel validates every syscall argument to maintain system security.
By the end of this page, you will understand how the kernel validates syscall parameters—from address space verification through safe memory copy primitives to TOCTTOU (time-of-check-to-time-of-use) attack prevention. You'll know why direct pointer dereference is forbidden and how the kernel safely moves data across the user/kernel boundary.
User-space code can invoke any syscall with any arguments. The kernel must assume every argument is malicious until proven otherwise.
Why can't we just dereference user pointers?
Consider a naive implementation of sys_read():
1234567891011121314151617181920212223242526272829303132333435363738394041424344
/* VULNERABLE: Never do this! */ssize_t vulnerable_sys_read(int fd, char *buf, size_t count){ struct file *f = get_file(fd); char kernel_buffer[1024]; /* Read data from file into kernel buffer */ ssize_t n = read_file(f, kernel_buffer, count); /* VULNERABLE: Directly copy to user pointer! */ memcpy(buf, kernel_buffer, n); return n;} /* Attack 1: buf points to kernel memory * * User calls: read(fd, 0xffffffff81000000, 100); * (0xffffffff81000000 is in kernel space!) * * The memcpy writes to kernel memory, * potentially overwriting kernel code/data. * Result: Kernel crash or arbitrary code execution. */ /* Attack 2: buf is not mapped * * User calls: read(fd, 0xdeadbeef, 100); * (0xdeadbeef is unmapped) * * The memcpy causes a page fault in kernel mode. * At best: kernel crash (DoS) * At worst: exploitable panic handler */ /* Attack 3: Integer overflow * * User calls: read(fd, buf, 0xffffffffffffffff); * (count = SIZE_MAX) * * count + offset in kernel may overflow. * bounds checks may be bypassed. * Result: Out-of-bounds read/write */The kernel CANNOT safely dereference any pointer provided by user space. The pointer could point to kernel memory, unmapped memory, memory-mapped hardware registers, or change value between check and use. All access must go through validated copy functions.
The categories of validation:
Parameter validation breaks down into several categories, each preventing different attack classes:
The first line of defense is checking whether a pointer is even in the range of possible user addresses. On x86-64 Linux:
The access_ok() macro performs this fundamental check:
123456789101112131415161718192021222324252627282930313233343536373839
/* Linux kernel: arch/x86/include/asm/uaccess.h */ /* Verify that a user pointer is in the valid user range */#define access_ok(addr, size) \ likely(!__range_not_ok(addr, size, TASK_SIZE_MAX)) /* The actual range check */static inline bool __range_not_ok( unsigned long addr, unsigned long size, unsigned long limit){ /* * If addr + size overflows, this test will catch it * because the result will wrap around below addr. * * Also catches addr + size > limit (outside user space) */ return unlikely(size > limit) || unlikely(addr > limit - size);} /* TASK_SIZE_MAX: The upper bound of user space * On x86-64: 0x00007ffffffff000 (128TB minus guard pages) */ /* Example usage in a syscall handler */ssize_t ksys_read(int fd, char __user *buf, size_t count){ /* First: verify buf is in user space */ if (!access_ok(buf, count)) { return -EFAULT; /* Bad address */ } /* Now we know buf points to user space, but we can't * dereference it directly. We must use copy_to_user(). */ /* ... */}access_ok() only checks that the address is in the user address range. It does NOT verify that the memory is mapped, that the process has permission to access it, or that the pages are currently resident. Those checks happen during the actual copy.
Historical note: set_fs() and the dangers it posed
Older Linux kernels had a mechanism called set_fs() that could temporarily allow kernel code to treat kernel pointers as "user" pointers, bypassing access_ok(). This was removed in Linux 5.10 because:
| Address Range | Size | Purpose | Pointer Check |
|---|---|---|---|
| 0x0000000000000000 - 0x00007FFFFFFFFFFF | 128 TB | User space | access_ok() passes |
| 0x0000800000000000 - 0xFFFF7FFFFFFFFFFF | ~16M TB | Non-canonical hole | CPU fault on use |
| 0xFFFF800000000000 - 0xFFFFBFFFFFFFFFFF | 64 TB | Kernel direct mapping | access_ok() fails |
| 0xFFFFFFFF80000000 - 0xFFFFFFFFFFFFFFFF | 2 GB | Kernel text/modules | access_ok() fails |
Once we've verified the address range, we need to actually move data between kernel and user space. The kernel provides specialized functions that handle all the complexity:
The copy_*_user() family:
copy_from_user(to, from, n) — Copy n bytes from user space to kernelcopy_to_user(to, from, n) — Copy n bytes from kernel to user spaceget_user(x, ptr) — Copy a simple value (1, 2, 4, 8 bytes)put_user(x, ptr) — Write a simple value to user spacestrncpy_from_user() — Copy a null-terminated string from user space1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
/* Safe memory copy functions - conceptual implementation */ /* Copy from user space to kernel space */unsigned long copy_from_user(void *to, const void __user *from, unsigned long n){ /* First: validate the source address range */ if (!access_ok(from, n)) { /* Clear destination (security: don't leak old kernel data) */ memset(to, 0, n); return n; /* Return number of bytes NOT copied */ } /* Attempt the copy, handling page faults */ return raw_copy_from_user(to, from, n);} /* Copy from kernel space to user space */unsigned long copy_to_user(void __user *to, const void *from, unsigned long n){ /* Validate the destination range */ if (!access_ok(to, n)) return n; /* Return number of bytes NOT copied */ return raw_copy_to_user(to, from, n);} /* Return value convention: * Returns 0 on success (all bytes copied) * Returns >0 on failure (number of bytes NOT copied) */ /* Example usage */ssize_t sys_read_impl(struct file *f, char __user *buf, size_t count){ char *kbuf; ssize_t ret; /* Allocate kernel buffer */ kbuf = kmalloc(count, GFP_KERNEL); if (!kbuf) return -ENOMEM; /* Read data into kernel buffer */ ret = vfs_read(f, kbuf, count); if (ret <= 0) goto out; /* Copy to user space SAFELY */ if (copy_to_user(buf, kbuf, ret)) { /* copy_to_user failed (returned non-zero) */ ret = -EFAULT; } out: kfree(kbuf); return ret;}How page faults are handled:
The raw_copy_*_user() functions may encounter page faults—the user memory might not be currently resident. The kernel handles this with a clever mechanism:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
/* The kernel uses exception tables to handle faults gracefully */ /* Assembly for raw_copy_from_user (simplified concept) *//* * 1: rep movsb // Copy bytes * xor %eax, %eax // Return 0 (success) * ret * 2: mov %ecx, %eax // Return remaining count (failure) * ret * * .section __ex_table * .quad 1b, 2b // If fault at 1, jump to 2 */ /* When a page fault occurs during the copy: * * 1. CPU generates page fault exception * 2. do_page_fault() is called * 3. Kernel checks if fault is from a "expected" location * 4. Looks up RIP in exception table (__ex_table) * 5. If found: modify RIP to jump to fixup handler * 6. If not found: kernel oops (bug in kernel code) * 7. Fixup handler returns error to caller */ /* The exception table entry */struct exception_table_entry { int insn; /* Relative address of faulting instruction */ int fixup; /* Relative address of fixup code */ int handler; /* Handler type (optional) */}; /* During page fault handling */bool fixup_exception(struct pt_regs *regs, int trap){ const struct exception_table_entry *e; /* Search for this RIP in exception table */ e = search_exception_tables(regs->ip); if (e) { /* Found! Modify RIP to retry with fixup */ regs->ip = (unsigned long)&e->insn + e->fixup; return true; /* Fault handled */ } return false; /* No fixup - kernel bug! */}The exception table approach means that copy_*_user() can never cause a kernel crash, even if the user provides a completely invalid pointer. The fault is caught, the copy aborts, and the caller receives an error. This is why we MUST use these functions instead of direct pointer access.
For accessing single values (1, 2, 4, or 8 bytes), the kernel provides optimized get_user() and put_user() macros that are faster than copy_*_user() for small transfers:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
/* Optimized accessors for simple types */ /* get_user: Read a single value from user space * * x: kernel variable to receive the value * ptr: user-space pointer * Returns: 0 on success, -EFAULT on error */#define get_user(x, ptr) \({ \ int __ret; \ __typeof__(*(ptr)) __val; \ /* Verify address is in user space */ \ if (!access_ok(ptr, sizeof(*ptr))) { \ __ret = -EFAULT; \ } else { \ __ret = __get_user(__val, ptr); \ (x) = __val; \ } \ __ret; \}) /* put_user: Write a single value to user space */#define put_user(x, ptr) \({ \ int __ret; \ if (!access_ok(ptr, sizeof(*ptr))) { \ __ret = -EFAULT; \ } else { \ __ret = __put_user(x, ptr); \ } \ __ret; \}) /* Example usage */SYSCALL_DEFINE2(getrlimit, unsigned int, resource, struct rlimit __user *, rlim){ struct rlimit value; int retval; retval = do_prlimit(current, resource, NULL, &value); if (retval) return retval; /* Write the result to user space */ if (put_user(value.rlim_cur, &rlim->rlim_cur) || put_user(value.rlim_max, &rlim->rlim_max)) return -EFAULT; return 0;} /* __get_user (unsafe version) - used after access_ok() *//* These are faster but caller MUST verify address first */#define __get_user(x, ptr) ({ \ __typeof__(*(ptr)) __gu_val; \ int __gu_err = 0; \ /* Inline assembly with exception handler */ \ switch (sizeof(*ptr)) { \ case 1: \ __get_user_asm(__gu_val, ptr, "b"); \ break; \ case 2: \ __get_user_asm(__gu_val, ptr, "w"); \ break; \ case 4: \ __get_user_asm(__gu_val, ptr, "l"); \ break; \ case 8: \ __get_user_asm(__gu_val, ptr, "q"); \ break; \ } \ (x) = __gu_val; \ __gu_err; \})The double-underscore versions (__get_user, __put_user) skip access_ok() checking for performance. They should ONLY be used when the caller has already verified the address. Misuse leads to security vulnerabilities. Prefer the safe versions (get_user, put_user) unless you're absolutely certain access_ok() was already called.
| Function | Direction | Size | Returns |
|---|---|---|---|
get_user(x, ptr) | User → Kernel | 1/2/4/8 bytes | 0 or -EFAULT |
put_user(x, ptr) | Kernel → User | 1/2/4/8 bytes | 0 or -EFAULT |
copy_from_user(to, from, n) | User → Kernel | n bytes | Bytes not copied |
copy_to_user(to, from, n) | Kernel → User | n bytes | Bytes not copied |
strncpy_from_user(to, from, n) | User → Kernel | String | Length or -EFAULT |
strnlen_user(s, n) | Measure length | String | Length or 0 |
clear_user(to, n) | Clear user memory | n bytes | Bytes not cleared |
Strings from user space are particularly dangerous. They can be:
The kernel provides strncpy_from_user() for safe string copy:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
/* Safe string copy from user space */ /** * strncpy_from_user - Copy a string from user space * @dst: Destination kernel buffer * @src: Source user-space pointer * @count: Maximum bytes to copy (including null terminator) * * Returns: * - Length of string (not including null) on success * - -EFAULT on access error * - If string is longer than count, copies count-1 chars + null */long strncpy_from_user(char *dst, const char __user *src, long count){ long res; if (!access_ok(src, 1)) { /* Check at least first byte */ return -EFAULT; } /* Call architecture-specific safe version */ res = do_strncpy_from_user(dst, src, count); if (res >= count) { /* String was truncated - add null terminator */ dst[count - 1] = '\0'; return count - 1; } return res;} /* Example: Getting a pathname from user space */char *getname(const char __user *filename){ char *kname; int len; /* PATH_MAX is typically 4096 */ kname = kmalloc(PATH_MAX, GFP_KERNEL); if (!kname) return ERR_PTR(-ENOMEM); /* Safely copy the path */ len = strncpy_from_user(kname, filename, PATH_MAX); if (len < 0) { kfree(kname); return ERR_PTR(len); /* -EFAULT */ } if (len == PATH_MAX - 1 && kname[len] != '\0') { /* Path was too long */ kfree(kname); return ERR_PTR(-ENAMETOOLONG); } return kname;} /* strnlen_user: Get length of user string without copying */long strnlen_user(const char __user *str, long count){ /* Returns: length including null, or 0 on error */ /* * Useful for pre-allocation: * len = strnlen_user(user_str, MAX_LEN); * if (len == 0) return -EFAULT; * if (len > MAX_LEN) return -ENAMETOOLONG; * buf = kmalloc(len, GFP_KERNEL); * strncpy_from_user(buf, user_str, len); */}The kernel's getname() function (and getname_flags()) handles all the complexity of copying paths from user space, including length limits, error handling, and audit logging. Most syscalls dealing with filenames use this shared infrastructure rather than calling strncpy_from_user() directly.
TOCTTOU (Time-of-Check-to-Time-of-Use) vulnerabilities occur when the system checks a condition, then later acts on it—but the condition changes between check and use.
Classic TOCTTOU attack:
access("/tmp/foo", W_OK) to check write permissionopen("/tmp/foo", O_WRONLY)123456789101112131415161718192021222324252627282930313233343536373839
/* TOCTTOU-vulnerable pattern (in USER space) *//* This is broken by design, not a kernel bug */ if (access(filename, W_OK) == 0) { /* Window of vulnerability: between check and use */ /* Attacker can change filesystem here! */ int fd = open(filename, O_WRONLY); /* Might open different file */ write(fd, data, len);} /* TOCTTOU in KERNEL space - validation must be atomic with use */ /* VULNERABLE kernel code (conceptual) */ssize_t vulnerable_read(char __user *buf, size_t count){ /* Check that buf is valid */ if (!access_ok(buf, count)) return -EFAULT; /* Window: another thread unmaps buf here! */ /* Use buf - might fault now! */ for (int i = 0; i < count; i++) { *buf++ = data[i]; /* CRASH! */ }} /* CORRECT approach: use atomic copy functions */ssize_t correct_read(char __user *buf, size_t count){ /* copy_to_user handles the entire operation atomically * w.r.t. page faults. If the page becomes invalid during * copy, the exception table catches it and returns error. */ if (copy_to_user(buf, kernel_data, count)) return -EFAULT; return count;}How the kernel prevents TOCTTOU in syscalls:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
/* Double-fetch vulnerability example */ struct user_cmd { size_t size; char data[];}; /* VULNERABLE: Double fetch from user memory */int vuln_handler(struct user_cmd __user *ucmd){ size_t size; char *kbuf; /* First fetch: get size */ if (get_user(size, &ucmd->size)) return -EFAULT; /* Allocate based on size */ if (size > MAX_SIZE) return -EINVAL; kbuf = kmalloc(size, GFP_KERNEL); /* VULNERABLE: Second fetch of size! */ /* Attacker changes ucmd->size between fetches */ if (copy_from_user(kbuf, ucmd->data, ucmd->size)) /* WRONG! */ return -EFAULT; /* If attacker changed size to huge value, buffer overflow! */} /* CORRECT: Copy once, use the copy */int safe_handler(struct user_cmd __user *ucmd){ size_t size; char *kbuf; /* Fetch size exactly once */ if (get_user(size, &ucmd->size)) return -EFAULT; if (size > MAX_SIZE) return -EINVAL; kbuf = kmalloc(size, GFP_KERNEL); /* Use the KERNEL copy of size */ if (copy_from_user(kbuf, ucmd->data, size)) /* Correct! */ return -EFAULT; /* size is from kernel memory - can't change */}Any value read from user space must be copied to kernel memory ONCE, then only the kernel copy is used thereafter. Reading user memory multiple times creates race conditions that attackers can exploit.
Size and count parameters can cause integer overflows that bypass bounds checks. The kernel uses careful arithmetic and dedicated overflow-safe functions:
Classic overflow attack:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
/* Integer overflow protection mechanisms */ /* VULNERABLE: Simple addition overflows */ssize_t vuln_copy(void __user *buf, size_t offset, size_t count){ /* This check is bypassed by overflow! */ if (offset + count > buffer_size) /* WRONG */ return -EINVAL; /* Attacker: offset=MAX-10, count=100 * offset + count = (wraps around to) 90 * 90 < buffer_size - check passes! * But actual access is at offset MAX-10 */} /* CORRECT: Check for overflow explicitly */ssize_t safe_copy(void __user *buf, size_t offset, size_t count){ /* Check for addition overflow */ if (count > SIZE_MAX - offset) /* Can't overflow */ return -EINVAL; /* Now safe to add */ if (offset + count > buffer_size) return -EINVAL; /* Proceed... */} /* Using compiler builtins (modern approach) */ssize_t modern_copy(void __user *buf, size_t offset, size_t count){ size_t end; /* __builtin_add_overflow returns true on overflow */ if (__builtin_add_overflow(offset, count, &end)) return -EOVERFLOW; if (end > buffer_size) return -EINVAL; /* Safe to proceed */} /* Kernel helper macros */#include <linux/overflow.h> /* check_add_overflow(a, b, d) - true if a+b overflows, stores in *d *//* check_mul_overflow(a, b, d) - true if a*b overflows *//* array_size(a, b) - returns a*b or SIZE_MAX on overflow *//* struct_size(ptr, member, n) - size of struct with n array elements */ /* Example: Allocating array with overflow protection */struct my_struct { int header; char data[];}; struct my_struct *alloc_struct(size_t num_elements){ struct my_struct *p; size_t size; /* Safe calculation of total size */ size = struct_size(p, data, num_elements); if (size == SIZE_MAX) return NULL; /* Overflow detected */ return kmalloc(size, GFP_KERNEL);}When allocating structures with flexible array members, always use struct_size(). It correctly calculates the size including the array and returns SIZE_MAX if the multiplication overflows. This single function prevents an entire class of vulnerabilities.
| Function | Operation | Overflow Behavior |
|---|---|---|
check_add_overflow(a, b, &d) | d = a + b | Returns true if overflow |
check_sub_overflow(a, b, &d) | d = a - b | Returns true if underflow |
check_mul_overflow(a, b, &d) | d = a * b | Returns true if overflow |
array_size(n, size) | n * size | Returns SIZE_MAX on overflow |
array3_size(a, b, c) | a * b * c | Returns SIZE_MAX on overflow |
struct_size(p, member, n) | sizeof(p) + nsizeof(member[0]) | Returns SIZE_MAX on overflow |
We've explored the critical security barrier between user space and kernel—the parameter validation layer that protects the system from malicious or buggy input. Let's consolidate the key concepts:
What's next:
Even with perfect parameter validation, syscalls can fail. The kernel must communicate errors to user space in a consistent, informative way. The final page in this module explores Error Handling—how the kernel signals errors, how errno propagates through wrappers, and how to debug syscall failures.
You now understand the parameter validation layer—the kernel's immune system against malicious input. This knowledge is essential for kernel development, security auditing, and understanding CVE reports about syscall vulnerabilities. Next, we'll examine how errors flow from kernel to user space.