System Call Implementation - Learning Module

Loading content...

0/227

Kernel Handler

Dispatching to the Right Handler

After the context switch completes, the kernel is ready to act. The CPU is in Ring 0, the kernel stack holds the user's saved state in pt_regs, and the syscall number sits in a register. But the kernel supports hundreds of different system calls, each requiring different parameters and performing different operations.

How does the kernel route syscall #0 to the read() implementation, #1 to write(), #2 to open(), and so on? The answer is the kernel handler layer—a sophisticated dispatch mechanism that translates syscall numbers into function calls.

This page explores the complete flow from syscall number to handler execution, including the syscall table architecture, dispatch routines, function pointer invocation, and how the kernel maintains this mapping across hundreds of system calls and multiple architectures.

What You Will Learn

By the end of this page, you will understand how the kernel's syscall dispatch works—from the C entry point receiving pt_regs through the syscall table lookup to individual handler execution. You'll know how syscall tables are generated, how the kernel validates syscall numbers, and how handler functions access their arguments.

The Syscall Table

At the heart of syscall dispatch is a simple data structure: an array of function pointers, indexed by syscall number. This is the syscall table (or system call table).

The fundamental idea:

handler = syscall_table[syscall_number];
result = handler(arguments...);

The syscall number (in RAX) becomes an array index. The table lookup retrieves a function pointer. The kernel calls that function. Done.

Reality adds complexity:

Different architectures have different syscall tables
32-bit and 64-bit processes use different tables on x86-64
Syscalls have different numbers of arguments (0 to 6)
The table must be protected from modification
Invalid syscall numbers must be handled gracefully

syscall_table.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
/* Linux kernel: arch/x86/entry/syscall_64.c */
 
/* Type for syscall handler functions */
typedef long (*sys_call_ptr_t)(const struct pt_regs *);
 
/* The syscall table: array of function pointers */
const sys_call_ptr_t sys_call_table[__NR_syscall_max + 1] = {
    /* 
     * This array is populated by including a generated file.
     * Each entry maps a syscall number to its handler.
     */
    
    [0 ... __NR_syscall_max] = &__x64_sys_ni_syscall, /* Default: not implemented */
    
#include <asm/syscalls_64.h>  /* Generated: populates with real handlers */
};
 
/* The generated syscalls_64.h contains entries like:
 * 
 * [0] = __x64_sys_read,      // read()
 * [1] = __x64_sys_write,     // write()
 * [2] = __x64_sys_open,      // open()
 * [3] = __x64_sys_close,     // close()
 * [4] = __x64_sys_stat,      // stat()
 * ... hundreds more ...
 * [334] = __x64_sys_rseq,    // rseq()
 */
 
/* Not-implemented stub for undefined syscall numbers */
asmlinkage long __x64_sys_ni_syscall(const struct pt_regs *regs)
{
    return -ENOSYS;  /* Function not implemented */
}

Table generation pipeline:

The syscall table isn't hardcoded—it's generated during kernel build from a declarative specification:

Syscall Table Generation Pipeline
File	Purpose	Example Content
`syscall_64.tbl`	Human-readable syscall definitions	`0 common read sys_read`
`syscalltbl.sh`	Script to process .tbl files	Parses table, generates headers
`syscalls_64.h`	Generated header with table entries	`[0] = __x64_sys_read,`
`syscall_64.c`	Includes generated header, defines table	`const sys_call_ptr_t sys_call_table[]`

syscall_64.tbl

Text

# Linux kernel: arch/x86/entry/syscalls/syscall_64.tbl
# Format: <number> <abi> <name> <entry point> [<compat entry point>]
#
# - number: syscall number (assigned sequentially, never reused)
# - abi: "common" (both 64-bit and 32-bit), "64" (64-bit only), "x32" (x32 ABI)
# - name: symbolic name used for macros
# - entry point: kernel function implementing this syscall
 
# Core file operations
0       common  read                    sys_read
1       common  write                   sys_write
2       common  open                    sys_open
3       common  close                   sys_close
4       common  stat                    sys_newstat
5       common  fstat                   sys_newfstat
6       common  lstat                   sys_newlstat
7       common  poll                    sys_poll
 
# Memory mapping
9       common  mmap                    sys_mmap
10      common  mprotect                sys_mprotect
11      common  munmap                  sys_munmap
12      common  brk                     sys_brk
 
# Process control
56      common  clone                   sys_clone
57      common  fork                    sys_fork
58      common  vfork                   sys_vfork
59      common  execve                  sys_execve
60      common  exit                    sys_exit
61      common  wait4                   sys_wait4
62      common  kill                    sys_kill
 
# ... continues for 300+ syscalls ...
 
# Recent additions (Linux 5.x-6.x)
434     common  pidfd_open              sys_pidfd_open
435     common  clone3                  sys_clone3
439     common  faccessat2              sys_faccessat2
448     common  process_mrelease        sys_process_mrelease

Syscall Numbers Are Stable ABI

Once a syscall number is assigned, it NEVER changes. This is part of the kernel's stable ABI promise. Syscall #0 has been read() since the earliest Linux versions and will remain so forever. New syscalls get new numbers. Old syscalls may be deprecated but their numbers are never reassigned.

The Dispatch Function

The assembly entry code (entry_SYSCALL_64) calls a C function to perform the actual dispatch. This function, do_syscall_64(), is the heart of syscall handling:

Extract the syscall number from pt_regs
Validate the number is in range
Look up the handler in the syscall table
Call the handler with pt_regs as argument
Store the return value for the return path

do_syscall_64.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
/* Linux kernel: arch/x86/entry/common.c */
 
/* Main entry point for 64-bit syscalls */
__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr)
{
    /* nr = syscall number from RAX (passed by assembly caller) */
    
    /* Add a random offset to syscall entry (security hardening) */
    add_random_kstack_offset();
    
    /* Check if syscalls are enabled for this task */
    if (!do_syscall_x64(regs, nr)) {
        /* Syscall was rejected (seccomp, ptrace, etc.) */
        return;
    }
}
 
/* Called from do_syscall_64 to execute the actual syscall */
static __always_inline bool do_syscall_x64(struct pt_regs *regs, int nr)
{
    /* 
     * Check that nr is a valid syscall number.
     * __NR_syscall_max is the highest valid syscall number.
     */
    if (likely(nr < NR_syscalls)) {
        /*
         * Actually call the handler!
         * sys_call_table[nr] returns a function pointer.
         * The function takes pt_regs* and returns long.
         */
        regs->ax = sys_call_table[nr](regs);
        return true;
    }
    
    /* Invalid syscall number */
    regs->ax = __x64_sys_ni_syscall(regs);
    return true;
}
 
/*
 * The return value ends up in regs->ax.
 * When we return to user space (via sysret or iret),
 * the assembly code will:
 * 1. Restore registers from pt_regs
 * 2. This includes RAX from regs->ax
 * 3. User sees the return value in RAX
 */

Security hooks in the dispatch path:

The dispatch function isn't a straight-line path. The kernel checks several security mechanisms before executing the syscall:

Security Checks in Dispatch

•seccomp (Secure Computing) — If the task has a seccomp filter installed, the syscall number and arguments are checked against the filter rules. The filter can allow, deny, log, or trap the syscall.
•ptrace syscall tracing — If the task is being traced by a debugger, the kernel notifies the tracer before and after syscall execution. The tracer can inspect/modify arguments and return values.
•audit system — If Linux auditing is enabled, syscalls may be logged to the audit log with their arguments and return values.
•SELinux/AppArmor — Security modules can impose additional access checks beyond standard permissions.

syscall_enter_work.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
/* Simplified syscall entry with security checks */
 
static bool syscall_enter_from_user_mode(struct pt_regs *regs)
{
    unsigned long work = READ_ONCE(current_thread_info()->flags);
    
    if (work & SYSCALL_WORK_ENTRY) {
        /* There's work to do before the syscall */
        
        /* Check seccomp filters */
        if (work & SYSCALL_WORK_SECCOMP) {
            int ret = __secure_computing(NULL);
            if (ret == -1) {
                /* Seccomp denied this syscall */
                return false;
            }
            if (ret == -2) {
                /* Seccomp modified the syscall - reload nr */
                /* (Feature: SECCOMP_RET_TRACE can change nr) */
            }
        }
        
        /* Handle ptrace syscall-enter-stop */
        if (work & SYSCALL_WORK_SYSCALL_TRACE) {
            if (tracehook_report_syscall_entry(regs)) {
                /* Tracer said to skip this syscall */
                return false;
            }
        }
        
        /* Audit the syscall entry */
        if (work & SYSCALL_WORK_SYSCALL_AUDIT) {
            audit_syscall_entry(syscall_get_nr(current, regs),
                               syscall_get_arg1(regs),
                               syscall_get_arg2(regs),
                               syscall_get_arg3(regs),
                               syscall_get_arg4(regs));
        }
    }
    
    return true;  /* Proceed with syscall */
}

likely() and unlikely() Macros

The likely(nr < NR_syscalls) macro hints to the compiler that this condition is almost always true, enabling better branch prediction and code layout. Invalid syscall numbers are rare; the hot path should be optimized for valid numbers.

Handler Function Signatures

Syscall handlers in modern Linux follow a specific calling convention. Understanding this convention is essential for reading kernel code and implementing new syscalls.

The modern approach: pt_regs-based handlers

Since Linux 4.17, x86-64 syscall handlers receive a single argument: a pointer to pt_regs. This change (from passing arguments directly) improves security by preventing register contents from leaking on error paths.

syscall_handler_convention.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
/* Modern syscall handler signature (Linux 4.17+) */
 
/* The handler receives pt_regs and extracts arguments from it */
asmlinkage long __x64_sys_read(const struct pt_regs *regs)
{
    /* Extract arguments from pt_regs */
    int fd = (int)regs->di;          /* First argument: rdi */
    void __user *buf = (void __user *)regs->si;  /* Second: rsi */
    size_t count = regs->dx;         /* Third: rdx */
    
    /* Call the common implementation */
    return ksys_read(fd, buf, count);
}
 
/* Helper macros for extracting arguments */
#define SC_ARG0(regs) ((regs)->di)    /* Arg 1 */
#define SC_ARG1(regs) ((regs)->si)    /* Arg 2 */
#define SC_ARG2(regs) ((regs)->dx)    /* Arg 3 */
#define SC_ARG3(regs) ((regs)->r10)   /* Arg 4 - note: r10, not rcx! */
#define SC_ARG4(regs) ((regs)->r8)    /* Arg 5 */
#define SC_ARG5(regs) ((regs)->r9)    /* Arg 6 */
 
/* The common implementation does the real work */
ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
{
    struct fd f = fdget_pos(fd);
    ssize_t ret = -EBADF;
    
    if (!f.file)
        return ret;
    
    /* Actually perform the read... */
    ret = vfs_read(f.file, buf, count, &pos);
    
    fdput_pos(f);
    return ret;
}

The SYSCALL_DEFINE macros:

Writing the argument extraction code manually is tedious and error-prone. Linux provides macros that generate the boilerplate:

syscall_define_macros.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
/* Linux kernel: include/linux/syscalls.h */
 
/* SYSCALL_DEFINE3: Define a syscall with 3 arguments
 * The number suffix indicates argument count (0-6)
 */
#define SYSCALL_DEFINE3(name, t1, a1, t2, a2, t3, a3) \
    __SYSCALL_DEFINEx(3, _##name, t1, a1, t2, a2, t3, a3)
 
/* This generates:
 * 1. Prototype for __x64_sys_<name>(const struct pt_regs *)
 * 2. Static inline __do_sys_<name>(t1 a1, t2 a2, t3 a3) 
 * 3. __x64_sys wrapper that extracts args and calls __do_sys
 */
 
/* Example: Defining the read() syscall */
/* fs/read_write.c */
 
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
    /* This becomes the body of __do_sys_read(fd, buf, count) */
    return ksys_read(fd, buf, count);
}
 
/* The macro expands to something like: */
static inline long __do_sys_read(unsigned int fd, 
                                  char __user *buf,
                                  size_t count);
 
asmlinkage long __x64_sys_read(const struct pt_regs *regs)
{
    return __do_sys_read(
        (unsigned int)SC_ARG0(regs),                /* fd from rdi */
        (char __user *)SC_ARG1(regs),               /* buf from rsi */
        (size_t)SC_ARG2(regs)                       /* count from rdx */
    );
}
 
static inline long __do_sys_read(unsigned int fd,
                                  char __user *buf,
                                  size_t count)
{
    return ksys_read(fd, buf, count);
}

Why __user Annotation?

The __user annotation marks pointers that point to user space memory. This enables sparse (a static analysis tool) to catch bugs where kernel code dereferences user pointers directly instead of using copy_from_user()/copy_to_user(). Direct access to __user pointers is a security vulnerability.

SYSCALL_DEFINE Macro Variants
Macro	Arguments	Use Case
`SYSCALL_DEFINE0(name)`	0	`getpid()`, `getuid()`, `fork()`
`SYSCALL_DEFINE1(name, t1, a1)`	1	`close(fd)`, `exit(status)`
`SYSCALL_DEFINE2(name, t1, a1, t2, a2)`	2	`creat()`, `access()`
`SYSCALL_DEFINE3(name, ...)`	3	`read()`, `write()`, `open()`
`SYSCALL_DEFINE4(name, ...)`	4	`ptrace()`, `reboot()`
`SYSCALL_DEFINE5(name, ...)`	5	`select()`, `mount()`
`SYSCALL_DEFINE6(name, ...)`	6	`mmap()`, `futex()`

Handler Execution

Once the handler function is called, it executes like any other kernel function—with full Ring 0 privileges. However, syscall handlers have specific patterns and constraints:

What handlers can do:

Access all kernel data structures
Read/modify the calling task's state
Interact with hardware (via driver interfaces)
Block waiting for I/O or events
Call other kernel subsystems (VFS, networking, memory, etc.)

What handlers must be careful about:

Syscall Handler Constraints

•Never trust user pointers — All pointers from user space must be validated. Use copy_from_user()/copy_to_user(), never direct dereference.
•Check all arguments — Users can pass any values. Every argument must be validated before use.
•Handle blocking carefully — If the handler blocks, the task's state must be safely interruptible. Signal handling may interrupt blocking calls.
•Don't hold locks too long — Long-held locks cause priority inversion and deadlocks. Release locks before blocking.
•Check permissions — The calling process may lack permission for the requested operation. Check file permissions, capabilities, etc.

sys_read_implementation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
/* Tracing through sys_read: from syscall to disk */
 
/* Step 1: Entry wrapper (generated by SYSCALL_DEFINE3) */
SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count)
{
    return ksys_read(fd, buf, count);
}
 
/* Step 2: Common kernel helper */
ssize_t ksys_read(unsigned int fd, char __user *buf, size_t count)
{
    struct fd f = fdget_pos(fd);  /* Look up file descriptor */
    ssize_t ret = -EBADF;
    
    if (f.file) {
        loff_t pos, *ppos = file_ppos(f.file);
        if (ppos) {
            pos = *ppos;
            ppos = &pos;
        }
        ret = vfs_read(f.file, buf, count, ppos);  /* Call VFS layer */
        if (ret >= 0 && ppos)
            f.file->f_pos = pos;
        fdput_pos(f);
    }
    return ret;
}
 
/* Step 3: VFS (Virtual File System) read */
ssize_t vfs_read(struct file *file, char __user *buf, 
                 size_t count, loff_t *pos)
{
    ssize_t ret;
    
    /* Validate the file allows reading */
    if (!(file->f_mode & FMODE_READ))
        return -EBADF;
    if (!(file->f_mode & FMODE_CAN_READ))
        return -EINVAL;
    
    /* Check access_ok() for buffer */
    if (!access_ok(buf, count))
        return -EFAULT;
    
    /* Call file-specific read operation */
    if (file->f_op->read)
        ret = file->f_op->read(file, buf, count, pos);
    else if (file->f_op->read_iter)
        ret = new_sync_read(file, buf, count, pos);
    else
        ret = -EINVAL;
    
    return ret;
}
 
/* Step 4: Filesystem-specific read (e.g., ext4) */
/* This eventually calls the block layer and disk driver */

The VFS Abstraction

Notice how sys_read() doesn't know about ext4, NFS, or procfs. It calls vfs_read(), which uses the file->f_op function pointer table to call the right filesystem's read implementation. This abstraction allows one syscall to work with hundreds of filesystem types.

Handler call depth:

A syscall handler may call many kernel functions before completing. A typical read() on an ext4 file might traverse:

__x64_sys_read → argument extraction
ksys_read → fd lookup
vfs_read → VFS layer dispatch
ext4_file_read_iter → filesystem handling
generic_file_buffered_read → page cache
ext4_readpage → block layer
submit_bio → block I/O submission
scsi_queue_rq → SCSI driver
nvme_queue_rq → NVMe driver
Hardware DMA → actual disk read

...then the return path unwinds all of this back to user space.

Return Value Semantics

Syscall handlers return a value that eventually reaches user space in the RAX register. The kernel uses a consistent convention:

Return value interpretation:

Negative values (-1 to -4095): Error code (negated errno)
Non-negative values: Success; meaning depends on the syscall

Different syscalls, different success values:

Syscall Return Value Examples
Syscall	Success Return Value	Interpretation
`read()`	0 to n	Number of bytes read (0 = EOF)
`write()`	1 to n	Number of bytes written
`open()`	≥ 0	New file descriptor
`close()`	0	Success (no meaningful value)
`fork()`	0 or 0	Child PID to parent, 0 to child
`getpid()`	0	Process ID (never fails)
`mmap()`	Address	Pointer to mapped region
`brk()`	Address	New program break address

return_value_flow.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
/* How the return value flows from handler to user */
 
/* 1. Handler returns a long value */
SYSCALL_DEFINE3(read, ...)
{
    /* ... */
    if (error)
        return -EFAULT;  /* Returns -14 (negative errno) */
    return bytes_read;   /* Returns positive count */
}
 
/* 2. Dispatch stores result in pt_regs->ax */
static bool do_syscall_x64(struct pt_regs *regs, int nr)
{
    /*            
     * sys_call_table[nr](regs) returns the handler's result
     * This is stored in regs->ax
     */
    regs->ax = sys_call_table[nr](regs);
    return true;
}
 
/* 3. Assembly exit path restores RAX from pt_regs->ax */
/*
 * movq    OFFSET_AX(%rsp), %rax    ; Load saved ax
 * ... 
 * sysretq                           ; Return to user
 */
 
/* 4. User-space wrapper interprets RAX value */
ssize_t read(int fd, void *buf, size_t count)
{
    long ret = syscall(__NR_read, fd, buf, count);
    
    /* RAX is now available via 'ret' */
    if (ret < 0 && ret > -4096) {
        errno = -ret;  /* Convert -14 to errno=14 */
        return -1;
    }
    return ret;
}

Returning Pointers

Some syscalls return pointers (mmap, brk). Since the error range is [-4095, -1], any valid user-space address works. The kernel ensures no valid mmap address falls in this range. On error, mmap returns MAP_FAILED which glibc interprets as -1 (after errno setting).

Common error codes:

The kernel defines hundreds of error codes in include/uapi/asm-generic/errno-base.h and errno.h. The most frequently encountered:

Frequently Encountered errno Values
errno	Value	Meaning	Common Cause
EPERM	1	Operation not permitted	Lacks privilege/capability
ENOENT	2	No such file or directory	Path doesn't exist
ESRCH	3	No such process	PID doesn't exist
EINTR	4	Interrupted system call	Signal received during syscall
EIO	5	I/O error	Hardware or driver failure
EBADF	9	Bad file descriptor	fd not open or wrong mode
EAGAIN	11	Try again	Would block (non-blocking I/O)
ENOMEM	12	Out of memory	Allocation failed
EACCES	13	Permission denied	File permissions deny access
EFAULT	14	Bad address	Pointer outside address space
EINVAL	22	Invalid argument	Argument value is invalid
ENOSYS	38	Function not implemented	Syscall doesn't exist

x32 and Compatibility Mode

x86-64 Linux supports running 32-bit applications through compatibility mode. This introduces multiple syscall tables and dispatch paths:

Three ABIs on x86-64:

Native 64-bit — 64-bit programs using 64-bit syscalls
32-bit compatibility — 32-bit programs using 32-bit syscalls (via int 0x80 or compat mode)
x32 — 32-bit pointers with 64-bit syscalls (hybrid; rarely used)

x86-64 Syscall ABIs Comparison
Aspect	Native 64-bit	32-bit Compat	x32
Entry instruction	`syscall`	`int $0x80`	`syscall`
Syscall number	64-bit table	32-bit table	64-bit + 0x40000000
Pointer size	64 bits	32 bits	32 bits
Register size	64 bits	32 bits (limited)	64 bits
Table	`sys_call_table`	`ia32_sys_call_table`	`sys_call_table`
Entry point	`entry_SYSCALL_64`	`entry_INT80_compat`	`entry_SYSCALL_64`

compat_dispatch.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
/* Linux handles multiple ABIs with separate dispatch paths */
 
/* Native 64-bit: uses sys_call_table */
do_syscall_64(struct pt_regs *regs, int nr)
{
    regs->ax = sys_call_table[nr](regs);
}
 
/* 32-bit compatibility: uses ia32_sys_call_table */
do_int80_syscall_32(struct pt_regs *regs)
{
    int nr = regs->orig_ax;  /* 32-bit syscall number in eax */
    
    if (nr < IA32_NR_syscalls) {
        regs->ax = ia32_sys_call_table[nr](regs);
    } else {
        regs->ax = -ENOSYS;
    }
}
 
/* The 32-bit table has different handlers that handle 32-bit semantics */
const sys_call_ptr_t ia32_sys_call_table[] = {
    /* 32-bit syscall 0 is not read! It's restart_syscall */
    [0] = __ia32_sys_restart_syscall,
    [1] = __ia32_sys_exit,
    [2] = __ia32_sys_fork,
    [3] = __ia32_sys_read,   /* read() is syscall 3 in 32-bit! */
    [4] = __ia32_sys_write,
    /* ... different mapping from 64-bit ... */
};
 
/* Compat handlers may need to convert arguments */
asmlinkage long __ia32_compat_sys_truncate(const struct pt_regs *regs)
{
    /* 32-bit user passed a 32-bit pointer, we need to zero-extend */
    return ksys_truncate(compat_ptr(regs->bx), regs->cx);
}

Different Numbers, Same Functionality

The 32-bit and 64-bit syscall tables have different numbering. For example, read() is syscall 0 on 64-bit but syscall 3 on 32-bit (inherited from i386 Linux). This is why you can't just use 64-bit syscall numbers from 32-bit code and vice versa.

Argument conversion for compat:

32-bit programs have 32-bit pointers. When they pass pointers to syscalls, the kernel must:

Zero-extend 32-bit values to 64-bit (safe for all values 0-4GB)
Use compat_ptr() to convert 32-bit pointers to kernel pointers
Carefully handle struct layouts that differ between 32-bit and 64-bit

The compat_ prefixed functions handle these conversions throughout the kernel.

Tracing and Debugging Syscalls

Understanding syscall dispatch is essential for debugging system-level issues. Several tools leverage this knowledge:

strace:

The strace utility uses ptrace(PTRACE_SYSCALL) to intercept every syscall a process makes:

strace_example.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Trace all syscalls for a command
$ strace ls /tmp
execve("/usr/bin/ls", ["ls", "/tmp"], 0x7ffd... /* 50 vars */) = 0
brk(NULL)                               = 0x55ff0a240000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file)
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\002\001\001\003..."..., 832) = 832
# ... many more syscalls ...
write(1, "file1.txt  file2.txt\n", 21) = 21
close(1)                                = 0
exit_group(0)                           = ?
 
# Count syscall types
$ strace -c ls /tmp
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 26.45    0.000081          20         4           openat
 21.90    0.000067          16         4           mmap
 16.34    0.000050          12         4           close
  8.17    0.000025          12         2           read
# ...

Kernel tracepoints:

The kernel has built-in tracepoints for syscall entry/exit that can be used with ftrace or perf:

ftrace_syscalls.sh
Shell
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# List available syscall tracepoints
$ ls /sys/kernel/debug/tracing/events/syscalls/
sys_enter_read   sys_exit_read
sys_enter_write  sys_exit_write
# ... one pair per syscall ...
 
# Trace all read() calls system-wide
$ echo 1 > /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/enable
$ cat /sys/kernel/debug/tracing/trace_pipe
    ls-12345 [000] .... 1234.567890: sys_read(fd: 3, buf: 7ffd..., count: 832)
    bash-54321 [001] .... 1234.567891: sys_read(fd: 0, buf: 7ffd..., count: 1)
# ...
 
# Use perf for detailed syscall analysis
$ perf trace ls /tmp
     0.000 (0.012 ms): execve(filename: "/usr/bin/ls", argv: 0x7ffd...) = 0
     0.089 (0.002 ms): brk(brk: 0) = 0x55a9...
     0.093 (0.003 ms): access(filename: "/etc/ld.so.preload", mode: R) = -1 ENOENT
# ...

Debugging with BPF

Modern kernels support BPF (Berkeley Packet Filter) programs that can attach to syscall entry/exit points with almost no overhead. Tools like bpftrace and bcc allow sophisticated analysis: 'bpftrace -e "tracepoint:syscalls:sys_enter_open { printf("%s opened %s\n", comm, str(args->filename)); }"'

Syscall Debugging Tools Comparison
Tool	Overhead	Scope	Best For
strace	High (ptrace)	Single process	Quick debugging, seeing all args/results
ltrace	High (ptrace)	Single process	Library calls + syscalls
ftrace	Low	System-wide	Kernel development, global patterns
perf trace	Low-Medium	Flexible	Performance analysis with syscall context
bpftrace/bcc	Very Low	Flexible	Production tracing, complex queries

Summary: The Kernel Handler Layer

We've traced the complete path from syscall number to handler execution—the dispatch mechanism that makes all OS services accessible. Let's consolidate the key concepts:

Key Takeaways

•The syscall table is an array of function pointers — Simple indexing by syscall number retrieves the handler. Invalid numbers return -ENOSYS.
•Tables are generated from declarative specifications — The syscall_64.tbl file defines syscalls in human-readable format; build scripts generate the actual C table.
•do_syscall_64() dispatches to handlers — This C function validates the syscall number, performs security checks, and calls the handler via function pointer.
•SYSCALL_DEFINE macros generate boilerplate — Kernel developers define handlers with SYSCALL_DEFINE3(read, ...) which generates argument extraction code.
•Handlers extract arguments from pt_regs — Modern handlers receive pt_regs and extract arguments using architecture-specific offsets (rdi, rsi, rdx, r10, r8, r9).
•Return values use negative errno convention — [-4095, -1] indicates error; non-negative indicates success with syscall-specific meaning.
•Multiple ABIs require multiple tables — 64-bit, 32-bit compat, and x32 each have their own syscall tables and entry points.
•Rich tracing infrastructure exists — strace, ftrace, perf trace, and BPF tools enable detailed syscall analysis.

What's next:

The handler receives arguments through pt_regs, but those arguments often include pointers to user-space memory. How does the kernel safely read and write user memory? The next page explores Parameter Validation—the critical security checks that prevent malicious or buggy user code from corrupting kernel state.

Page Complete

You now understand the kernel handler layer—from the syscall table through dispatch to individual handler execution. This knowledge enables you to read kernel syscall code, understand strace output, and reason about syscall behavior. Next, we'll examine the critical security barrier between user pointers and kernel operations.