Loading learning content...
Every process you see running on a Linux system—from the init process (PID 1) to thousands of containerized microservices—was created through a single kernel code path. Whether triggered by fork(), vfork(), clone(), or pthread_create(), all process creation flows through the kernel's kernel_clone() function (historically known as do_fork()).
Understanding this mechanism reveals how Linux achieves efficient process creation through copy-on-write semantics, how threads share resources with their parent, and why fork() is surprisingly fast despite seemingly copying an entire address space.
By the end of this page, you will understand the complete process creation flow from system call to runnable task, the role of clone flags in resource sharing, copy-on-write optimization, and how the kernel allocates and initializes a new task_struct.
Linux provides several system calls for process creation, each with different semantics but sharing a common implementation:
| System Call | Purpose | Key Characteristics |
|---|---|---|
| fork() | Create child process | Full copy of parent (COW optimized), child gets new PID |
| vfork() | Create child for exec | Parent blocked until child exits/execs, shares address space |
| clone() | Flexible creation | Fine-grained control via flags, used for threads |
| clone3() | Modern clone | Extensible struct-based arguments, additional features |
123456789101112131415161718192021222324252627282930313233343536
/* All process creation funnels through kernel_clone() */ /* fork() - Traditional process creation */SYSCALL_DEFINE0(fork){ struct kernel_clone_args args = { .exit_signal = SIGCHLD, /* Signal parent on exit */ }; return kernel_clone(&args);} /* vfork() - Optimized for fork+exec pattern */SYSCALL_DEFINE0(vfork){ struct kernel_clone_args args = { .flags = CLONE_VFORK | CLONE_VM, .exit_signal = SIGCHLD, }; return kernel_clone(&args);} /* clone() - Full control via flags */SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp, int __user *, parent_tidptr, int __user *, child_tidptr, unsigned long, tls){ struct kernel_clone_args args = { .flags = (lower_32_bits(clone_flags) & ~CSIGNAL), .exit_signal = (lower_32_bits(clone_flags) & CSIGNAL), .stack = newsp, .parent_tid = parent_tidptr, .child_tid = child_tidptr, .tls = tls, }; return kernel_clone(&args);}The power of Linux process creation lies in clone flags. Each flag controls whether a specific resource is shared with the parent or copied for the child. This is how Linux implements both processes (mostly copied) and threads (mostly shared).
| Flag | When Set: SHARE | When Clear: COPY |
|---|---|---|
| CLONE_VM | Share memory space (mm_struct) | Copy address space (COW) |
| CLONE_FS | Share filesystem info (pwd, root) | Copy filesystem context |
| CLONE_FILES | Share file descriptor table | Copy open file descriptors |
| CLONE_SIGHAND | Share signal handlers | Copy signal handlers |
| CLONE_THREAD | Same thread group (TGID) | New thread group |
| CLONE_PARENT | Share parent with caller | Caller becomes parent |
| CLONE_NEWPID | New PID namespace | Inherit PID namespace |
| CLONE_NEWNS | New mount namespace | Inherit mount namespace |
12345678910111213141516
/* How pthread_create() creates threads (via clone) */#define CLONE_THREAD_FLAGS (CLONE_VM | CLONE_FS | CLONE_FILES | \ CLONE_SIGHAND | CLONE_THREAD | \ CLONE_SYSVSEM | CLONE_SETTLS | \ CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID) /* * Thread creation shares nearly everything: * - CLONE_VM: Same address space (critical for threads!) * - CLONE_FS: Same pwd, root directory * - CLONE_FILES: Same file descriptor table * - CLONE_SIGHAND: Same signal handlers * - CLONE_THREAD: Same thread group ID (getpid() returns same value) * * fork() sets NONE of these flags, so child gets copies of everything. */Container runtimes use CLONE_NEW* flags to create isolated namespaces. CLONE_NEWPID gives a container its own PID namespace where container init is PID 1. CLONE_NEWNS provides an isolated mount table. CLONE_NEWNET creates a separate network stack. All from the same kernel_clone() path!
The kernel_clone() function (previously do_fork()) orchestrates the entire creation process. Let's trace the critical path:
123456789101112131415161718192021222324252627282930313233
pid_t kernel_clone(struct kernel_clone_args *args){ struct task_struct *p; struct pid *pid; int trace = 0; pid_t nr; /* Step 1: Copy the process (the heavy lifting) */ p = copy_process(NULL, trace, NUMA_NO_NODE, args); if (IS_ERR(p)) return PTR_ERR(p); /* Step 2: Get the PID (in appropriate namespace) */ pid = get_task_pid(p, PIDTYPE_PID); nr = pid_vnr(pid); /* Virtual PID number */ /* Step 3: Handle vfork - parent waits for child */ if (args->flags & CLONE_VFORK) { /* Will block until child calls exec() or exit() */ p->vfork_done = &vfork; } /* Step 4: Wake up the new task */ wake_up_new_task(p); /* Step 5: If vfork, wait for child completion */ if (args->flags & CLONE_VFORK) { wait_for_vfork_done(p, &vfork); } /* Return child's PID to parent */ return nr;}kernel_clone() returns the child's PID to the parent. But how does the child return 0 from fork()? The answer is in copy_thread(): it sets up the child's registers so that when the child is first scheduled, it returns 0 from the syscall. Parent and child execute the same return path but see different values!
The copy_process() function is where the actual work happens. It allocates and initializes every component of the new task:
12345678910111213141516171819202122232425262728293031323334353637383940414243
static struct task_struct *copy_process( struct pid *pid, int trace, int node, struct kernel_clone_args *args){ struct task_struct *p; int retval; unsigned long clone_flags = args->flags; /* Validate flag combinations */ retval = -EINVAL; if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND)) goto bad_fork; /* Threads must share signals */ /* Allocate task_struct and kernel stack */ p = dup_task_struct(current, node); if (!p) goto bad_fork; /* Initialize scheduler data */ retval = sched_fork(clone_flags, p); /* Copy/share each subsystem based on flags */ retval = copy_files(clone_flags, p); /* File descriptors */ retval = copy_fs(clone_flags, p); /* Filesystem context */ retval = copy_sighand(clone_flags, p); /* Signal handlers */ retval = copy_signal(clone_flags, p); /* Signal state */ retval = copy_mm(clone_flags, p); /* Memory (COW!) */ retval = copy_namespaces(clone_flags, p); /* Namespaces */ retval = copy_thread(p, args); /* Registers, stack */ /* Allocate PID */ pid = alloc_pid(p->nsproxy->pid_ns_for_children, ...); p->pid = pid_nr(pid); /* Set up parent-child relationships */ p->real_parent = current; /* or as flags dictate */ list_add_tail(&p->sibling, &p->real_parent->children); return p;}A naive implementation of fork() would copy the entire address space—potentially gigabytes of memory. Linux uses Copy-on-Write (COW) to make fork() nearly instantaneous regardless of process size.
123456789101112131415161718192021222324252627282930313233343536
/* copy_mm() - Memory descriptor handling */static int copy_mm(unsigned long clone_flags, struct task_struct *tsk){ struct mm_struct *mm, *oldmm; oldmm = current->mm; if (!oldmm) return 0; /* Kernel thread, no mm */ /* CLONE_VM: Share the mm_struct (threads) */ if (clone_flags & CLONE_VM) { mmget(oldmm); /* Increment reference count */ tsk->mm = oldmm; tsk->active_mm = oldmm; return 0; } /* Process fork: Create COW copy */ mm = dup_mm(tsk, current->mm); tsk->mm = mm; tsk->active_mm = mm; return 0;} /* dup_mm creates COW mappings *//* * COW works by: * 1. Both parent and child share the SAME physical pages * 2. All writable pages are marked READ-ONLY in both * 3. On first write, a page fault occurs * 4. Fault handler copies the page, gives private copy to writer * 5. Only actually-modified pages get copied * * Result: fork() copies page TABLES (small), not pages (large). * Most pages are never modified, so never copied. */COW makes fork() fast but can cause unexpected latency later. A process forked from a 10GB parent might run quickly for a while, then hit severe page fault storms when it starts modifying data. Redis's background save (BGSAVE) is a famous example—forking is instant, but as the parent modifies keys, COW faults cause latency spikes.
| Operation | Cost | Notes |
|---|---|---|
| Allocate task_struct | O(1) | Slab allocation, very fast |
| Copy page tables | O(n) | n = number of page table pages, not memory pages |
| Allocate PID | O(1) | IDR allocation |
| First write to shared page | O(1) + page copy | COW fault, deferred cost |
The dup_task_struct() function handles the critical job of allocating memory for the new task and its kernel stack:
12345678910111213141516171819202122232425262728293031323334353637383940414243
static struct task_struct *dup_task_struct( struct task_struct *orig, int node){ struct task_struct *tsk; unsigned long *stack; int err; /* Allocate task_struct from the task slab cache */ tsk = alloc_task_struct_node(node); if (!tsk) return NULL; /* Allocate kernel stack (typically 2 or 4 pages) */ stack = alloc_thread_stack_node(tsk, node); if (!stack) goto free_tsk; /* Copy parent's task_struct as starting point */ err = arch_dup_task_struct(tsk, orig); /* Set new stack pointer */ tsk->stack = stack; /* Reset fields that must be fresh */ tsk->stack_canary = get_random_canary(); refcount_set(&tsk->usage, 2); /* 1 for thread, 1 for return value */ /* Clear statistics */ tsk->utime = tsk->stime = 0; /* Setup thread_info in the stack */ setup_thread_stack(tsk, orig); return tsk;} /* * Why slab allocation? * - task_struct is ~7KB, awkward for page allocator * - Slab caches keep free task_structs ready * - SLAB_TYPESAFE_BY_RCU allows RCU-safe task lookup * - Allocation is essentially O(1) */Notice the stack_canary initialization. This random value is placed at the bottom of the kernel stack. If a buffer overflow corrupts the stack, the canary value changes, and the kernel detects it before the attacker can hijack control flow.
After copy_process() creates the new task, it's not yet runnable. The wake_up_new_task() function inserts it into the scheduler:
123456789101112131415161718192021222324252627282930313233
void wake_up_new_task(struct task_struct *p){ struct rq *rq; unsigned long flags; /* Mark task as runnable */ p->state = TASK_RUNNING; /* Select CPU for new task (scheduler decides) */ __set_task_cpu(p, select_task_rq(p, ...)); /* Lock the runqueue and enqueue */ rq = task_rq_lock(p, &flags); /* Add to runqueue via scheduling class */ activate_task(rq, p, ENQUEUE_NOCLOCK); /* Check if new task should preempt current */ check_preempt_curr(rq, p, WF_FORK); task_rq_unlock(rq, p, &flags);} /* * Key decisions here: * 1. Which CPU should run the new task? * - Usually same as parent for cache affinity * - But may move if parent's CPU is overloaded * * 2. Should new task preempt parent immediately? * - CFS often runs child first (fork pattern: if(fork()) wait) * - Avoids COW faults if child execs immediately */You now understand how Linux creates processes—from system call through kernel_clone() to a running task. Next, we'll explore the scheduling classes that determine how these tasks compete for CPU time.