Operating SystemsThread Concepts

Thread Libraries

LevelIntermediate

Duration75 mins

TopicThread Concepts

4 / 5

Thread Creation

The Birth of a Thread

Creating a thread is one of the most fundamental operations in concurrent programming, yet the mechanics behind it are often treated as a black box. When you call pthread_create(), CreateThread(), or new Thread().start(), what actually happens? How does the operating system allocate resources, set up execution context, and prepare a new flow of control within your process?

Understanding thread creation at a deep level is essential for:

Performance optimization — Thread creation has significant overhead; knowing the costs helps you design efficient systems
Debugging — Stack configuration issues, initialization failures, and resource limits become diagnosable
Architecture decisions — Choosing between creating threads directly, using thread pools, or adopting lightweight threading models

This page dissects the thread creation process across POSIX, Windows, and Java platforms, revealing the kernel interactions, memory allocations, and initialization sequences that bring a thread to life.

What You Will Master

By the end of this page, you will understand the complete thread creation lifecycle: stack allocation and layout, kernel data structure creation, thread attributes and their effects, the differences between fork() and thread creation, and the true costs of thread creation across platforms.

Thread Creation Overview

Thread creation involves coordinated work between user-space libraries and the operating system kernel. The process can be broken down into distinct phases:

High-Level Creation Sequence

User Request — Application calls thread creation API (pthread_create, CreateThread, Thread.start)
Parameter Validation — Library validates thread function, attributes, stack size requirements
Memory Allocation — Allocate stack memory, thread-local storage, control structures
Kernel Request — System call to create kernel-level thread structure
Initialization — Set up execution context: registers, program counter, stack pointer
Scheduling — Add new thread to scheduler's ready queue
Return — Return thread identifier to caller; new thread may already be running

This sequence takes microseconds on modern systems—fast enough for most applications, but significant when creating many threads or in latency-sensitive contexts.

Thread vs Process Creation Cost
Operation	Approximate Time	Relative Cost
Thread creation	10-50 μs	1x (baseline)
Process creation (fork)	100-500 μs	10-50x
Process creation + exec	1-5 ms	100-500x
Thread pool work submission	0.1-1 μs	0.01-0.1x
Goroutine creation (Go)	0.3-1 μs	0.03-0.1x

What Thread Creation Does NOT Do

Understanding what's not involved in thread creation clarifies why threads are lighter than processes:

No address space duplication — Threads share the parent's address space
No page table copying — All threads use the same page tables
No file descriptor table copying — Threads share file descriptors
No signal handler copying — Signal disposition is per-process
No IPC overhead — Threads communicate through shared memory directly

The kernel creates a new scheduling entity within the existing process, allocating only thread-specific resources: registers, stack, kernel stack, and thread-local data.

User-Level vs Kernel-Level Creation

In user-level threading models (green threads), thread 'creation' happens entirely in user space without kernel involvement—just allocating a stack and adding to a user-space run queue. This is extremely fast (sub-microsecond) but sacrifices true parallelism. Modern systems use kernel-level threads where each thread creation involves a system call.

Stack Allocation and Layout

Every thread requires a stack—a contiguous region of memory for function calls, local variables, and return addresses. Stack allocation is one of the most significant aspects of thread creation, both in terms of memory consumption and configuration complexity.

Stack Structure

A typical thread stack layout (high to low addresses):

High Address
+------------------------+
|     Stack Guard        |  <- Guard page (optional, triggers SIGSEGV on overflow)
+------------------------+
|                        |
|    Usable Stack        |  <- Stack grows downward
|    (function frames,   |
|     local variables)   |
|                        |
+------------------------+
|   Thread Control Block |  <- TCB / Thread Local Storage
+------------------------+
|    Red Zone (x64)      |  <- 128 bytes below RSP, usable without adjustment
+------------------------+
Low Address

Default Stack Sizes

Stack sizes vary significantly across platforms and can dramatically impact how many threads you can create:

Default Stack Sizes by Platform
Platform	User Stack	Kernel Stack	Notes
Linux (x86_64)	8 MB	16 KB	ulimit -s shows default; NPTL
Linux (arm64)	8 MB	16 KB	Similar to x86_64
macOS	512 KB (secondary)	16 KB	Main thread: 8 MB
Windows (x64)	1 MB (reserved)	24 KB	Only 4KB committed initially
Java (64-bit)	1 MB (default)	~24 KB	-Xss flag to configure
Go goroutine	2-8 KB (grows)	N/A	Dynamically grows to 1 GB

stack_allocation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <unistd.h>
#include <sys/resource.h>
 
/*
 * Querying and Setting Stack Sizes
 */
 
void query_default_stack_size(void) {
    pthread_attr_t attr;
    size_t stack_size;
    
    pthread_attr_init(&attr);
    pthread_attr_getstacksize(&attr, &stack_size);
    printf("Default stack size: %zu bytes (%.2f MB)\n", 
           stack_size, (double)stack_size / (1024 * 1024));
    pthread_attr_destroy(&attr);
    
    // Also check system limit
    struct rlimit rlim;
    getrlimit(RLIMIT_STACK, &rlim);
    printf("Stack soft limit: %lu, hard limit: %lu\n",
           (unsigned long)rlim.rlim_cur, (unsigned long)rlim.rlim_max);
}
 
/*
 * Creating Threads with Custom Stack Size
 */
 
void *worker(void *arg) {
    // Demonstrate stack usage
    char buffer[4096];  // 4KB on stack
    memset(buffer, 0, sizeof(buffer));
    printf("Thread running with stack buffer at %p\n", (void *)buffer);
    return NULL;
}
 
int create_thread_with_custom_stack(size_t stack_bytes) {
    pthread_t tid;
    pthread_attr_t attr;
    int result;
    
    result = pthread_attr_init(&attr);
    if (result != 0) {
        fprintf(stderr, "attr_init failed: %s\n", strerror(result));
        return -1;
    }
    
    /*
     * PTHREAD_STACK_MIN is the minimum allowed (typically 16KB)
     * Stacks must be page-aligned and large enough for signal handlers
     */
    if (stack_bytes < PTHREAD_STACK_MIN) {
        fprintf(stderr, "Stack too small, minimum is %d\n", PTHREAD_STACK_MIN);
        stack_bytes = PTHREAD_STACK_MIN;
    }
    
    result = pthread_attr_setstacksize(&attr, stack_bytes);
    if (result != 0) {
        fprintf(stderr, "setstacksize failed: %s\n", strerror(result));
        pthread_attr_destroy(&attr);
        return -1;
    }
    
    result = pthread_create(&tid, &attr, worker, NULL);
    pthread_attr_destroy(&attr);  // Safe to destroy after create
    
    if (result != 0) {
        fprintf(stderr, "pthread_create failed: %s\n", strerror(result));
        return -1;
    }
    
    pthread_join(tid, NULL);
    return 0;
}
 
/*
 * Using a Pre-Allocated Stack
 * Useful for embedded systems or memory-constrained environments
 */
 
int create_thread_with_custom_stack_memory(void) {
    pthread_t tid;
    pthread_attr_t attr;
    void *stack_base;
    size_t stack_size = 64 * 1024;  // 64 KB
    int result;
    
    // Allocate aligned memory for stack
    result = posix_memalign(&stack_base, sysconf(_SC_PAGESIZE), stack_size);
    if (result != 0) {
        fprintf(stderr, "posix_memalign failed: %s\n", strerror(result));
        return -1;
    }
    
    pthread_attr_init(&attr);
    
    // Set both stack address and size
    result = pthread_attr_setstack(&attr, stack_base, stack_size);
    if (result != 0) {
        fprintf(stderr, "setstack failed: %s\n", strerror(result));
        free(stack_base);
        pthread_attr_destroy(&attr);
        return -1;
    }
    
    result = pthread_create(&tid, &attr, worker, NULL);
    pthread_attr_destroy(&attr);
    
    if (result != 0) {
        fprintf(stderr, "pthread_create failed: %s\n", strerror(result));
        free(stack_base);
        return -1;
    }
    
    pthread_join(tid, NULL);
    
    /*
     * IMPORTANT: When using pthread_attr_setstack, YOU are responsible
     * for freeing the stack memory. The thread library will NOT free it.
     */
    free(stack_base);
    
    return 0;
}
 
/*
 * Stack Guard Pages
 * Protecting against stack overflow
 */
 
int demonstrate_guard_pages(void) {
    pthread_attr_t attr;
    size_t guard_size;
    
    pthread_attr_init(&attr);
    
    // Query default guard size (typically one page)
    pthread_attr_getguardsize(&attr, &guard_size);
    printf("Default guard size: %zu bytes\n", guard_size);
    
    // Can set to 0 to disable (saves memory, loses protection)
    // pthread_attr_setguardsize(&attr, 0);
    
    // Or increase for more protection
    pthread_attr_setguardsize(&attr, 4096 * 2);  // Two pages
    
    pthread_attr_destroy(&attr);
    return 0;
}

Stack Overflow Risk

Stack overflow doesn't just crash one thread—it can corrupt adjacent memory or other threads' stacks. Guard pages help detect overflow but add memory overhead. When reducing stack sizes for many threads, carefully audit maximum stack depth including signal handlers, which execute on the thread's stack.

Kernel Thread Creation (Linux)

On Linux, all threads are created via the clone() system call—the universal mechanism for creating both processes and threads. Understanding clone() reveals how the kernel implements threading.

The clone() System Call

long clone(
    unsigned long flags,           // What to share
    void *child_stack,             // Stack for new thread
    pid_t *parent_tidptr,          // Where to store child TID in parent
    pid_t *child_tidptr,           // Where to store child TID in child
    unsigned long tls              // Thread-local storage descriptor
);

The flags parameter determines what resources are shared between parent and child. The difference between fork() and thread creation is just which flags are set:

Key clone() Flags for Threading
Flag	Effect	Used By
CLONE_VM	Share virtual memory (address space)	Threads (essential)
CLONE_FS	Share filesystem info (cwd, root)	Threads
CLONE_FILES	Share file descriptor table	Threads
CLONE_SIGHAND	Share signal handlers	Threads
CLONE_THREAD	Same thread group (share PID)	Threads
CLONE_SYSVSEM	Share System V semaphore adjustments	Threads
CLONE_SETTLS	Create new TLS for child	Threads
CLONE_PARENT_SETTID	Store TID in parent address space	pthread_create
CLONE_CHILD_CLEARTID	Clear TID on exit (for pthread_join)	NPTL

clone_internals.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
/*
 * How pthread_create() uses clone() internally (simplified)
 * This is what NPTL does under the hood
 */
 
#define _GNU_SOURCE
#include <sched.h>
#include <signal.h>
#include <sys/types.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
/*
 * Conceptual implementation of pthread_create-like functionality
 * using raw clone() system call
 */
 
// Thread entry point wrapper
static int thread_wrapper(void *arg) {
    void (*start_routine)(void *) = ((void **)arg)[0];
    void *thread_arg = ((void **)arg)[1];
    
    // Call the user's thread function
    start_routine(thread_arg);
    
    // Thread exits here
    return 0;
}
 
// Simplified thread creation
int simple_thread_create(void (*start_routine)(void *), void *arg) {
    const size_t STACK_SIZE = 1024 * 1024;  // 1 MB
    
    // Allocate stack
    void *stack = malloc(STACK_SIZE);
    if (!stack) {
        return -1;
    }
    
    // Stack grows down, so start at top
    void *stack_top = (char *)stack + STACK_SIZE;
    
    // Package arguments
    void *args[2] = { start_routine, arg };
    
    /*
     * clone() flags for creating a thread:
     * 
     * CLONE_VM:       Share address space (critical for threads)
     * CLONE_FS:       Share filesystem context
     * CLONE_FILES:    Share file descriptors
     * CLONE_SIGHAND:  Share signal handlers
     * CLONE_THREAD:   Same thread group (same getpid() result)
     * CLONE_SYSVSEM:  Share SysV semaphore undo values
     * SIGCHLD:        Signal to send parent on termination
     */
    int flags = CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND |
                CLONE_THREAD | CLONE_SYSVSEM | SIGCHLD;
    
    // Create the thread
    pid_t tid = clone(thread_wrapper, stack_top, flags, args);
    
    if (tid == -1) {
        free(stack);
        perror("clone");
        return -1;
    }
    
    printf("Created thread with TID: %d\n", tid);
    return tid;
}
 
/*
 * What the kernel does when clone() is called:
 * 
 * 1. Allocate task_struct for new thread
 * 2. Copy/share resources based on flags:
 *    - If CLONE_VM: share mm_struct (page tables, mappings)
 *    - If CLONE_FILES: share files_struct (fd table)
 *    - etc.
 * 3. Allocate kernel stack (separate from user stack)
 * 4. Set up initial register state:
 *    - Stack pointer -> child_stack
 *    - Instruction pointer -> clone return address (child starts in clone())
 * 5. Add to scheduler's run queue
 * 6. Return TID to parent, 0 to child
 */
 
/*
 * Kernel data structures created for each thread:
 *
 * task_struct (~6 KB on 64-bit Linux)
 *   - Process/thread state
 *   - Scheduling information
 *   - Links to shared resources
 *   - Signal handling state
 *   - Credentials
 *   - Timers
 *
 * thread_info + kernel stack (~16 KB)
 *   - Architecture-specific thread state
 *   - Kernel-mode execution stack
 *
 * Optional: thread_struct
 *   - Floating point state
 *   - Debug registers
 */

PID vs TID

On Linux, getpid() returns the Thread Group ID (TGID), which is the same for all threads in a process. gettid() (via syscall) returns the unique Thread ID. CLONE_THREAD makes the new thread share the parent's TGID while having its own TID. This maintains POSIX semantics where getpid() returns the same value for all threads.

Thread Creation on Windows

Windows thread creation follows a different architectural model but achieves similar results. The CreateThread() API ultimately invokes NtCreateThreadEx() in the kernel, which creates the ETHREAD and KTHREAD structures that represent the thread.

Windows Thread Creation Sequence

Validate parameters — Check thread function pointer, stack size, etc.
Reserve/commit stack — Virtual memory for user-mode stack
Create kernel structures — ETHREAD, KTHREAD in kernel pool
Initialize TEB — Thread Environment Block in user-space
Set initial context — Register state for new thread
Insert in process — Link thread to process's thread list
Signal creation event — Thread is ready
Optional initial suspend — If CREATE_SUSPENDED flag set
Begin scheduling — Thread becomes runnable

windows_creation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
#include <windows.h>
#include <stdio.h>
 
/*
 * Windows Thread Creation Deep Dive
 */
 
/*
 * CreateThread() Parameters Explained
 */
void demonstrate_create_thread(void) {
    HANDLE hThread;
    DWORD threadId;
    
    hThread = CreateThread(
        NULL,           // lpThreadAttributes:
                        //   Security descriptor for thread handle
                        //   NULL = default security, not inheritable
        
        0,              // dwStackSize:
                        //   Initial stack size in bytes
                        //   0 = default (1 MB reserved, 1 page committed)
        
        ThreadProc,     // lpStartAddress:
                        //   Thread entry point function
                        //   Must match LPTHREAD_START_ROUTINE signature
        
        (LPVOID)42,     // lpParameter:
                        //   Argument passed to thread function
        
        0,              // dwCreationFlags:
                        //   0 = run immediately
                        //   CREATE_SUSPENDED = start suspended
                        //   STACK_SIZE_PARAM_IS_A_RESERVATION = dwStackSize is reservation, not commit
        
        &threadId       // lpThreadId:
                        //   Receives the thread ID
                        //   Can be NULL if not needed
    );
    
    if (hThread == NULL) {
        printf("CreateThread failed: %lu\n", GetLastError());
    }
}
 
/*
 * Stack Size Control on Windows
 */
void demonstrate_stack_sizes(void) {
    HANDLE hThread;
    
    // Default: 1 MB reserved, 1 page (4KB) committed
    // Stack grows and commits more pages as needed
    hThread = CreateThread(NULL, 0, ThreadProc, NULL, 0, NULL);
    CloseHandle(hThread);
    
    // Reserve 256KB, commit as needed
    hThread = CreateThread(NULL, 256 * 1024, ThreadProc, NULL,
                          STACK_SIZE_PARAM_IS_A_RESERVATION, NULL);
    CloseHandle(hThread);
    
    // Commit 256KB immediately (all stack memory is committed upfront)
    hThread = CreateThread(NULL, 256 * 1024, ThreadProc, NULL,
                          0, NULL);  // Without STACK_SIZE_PARAM_IS_A_RESERVATION
    CloseHandle(hThread);
}
 
/*
 * Creating Suspended Threads
 * Useful for setting thread properties before it runs
 */
void demonstrate_suspended_creation(void) {
    HANDLE hThread;
    DWORD threadId;
    
    // Create in suspended state
    hThread = CreateThread(NULL, 0, ThreadProc, NULL,
                          CREATE_SUSPENDED, &threadId);
    
    if (hThread == NULL) {
        printf("CreateThread failed: %lu\n", GetLastError());
        return;
    }
    
    // Configure thread before it runs
    SetThreadPriority(hThread, THREAD_PRIORITY_HIGHEST);
    
    // Set processor affinity
    SetThreadAffinityMask(hThread, 0x1);  // CPU 0 only
    
    // Now start it
    if (ResumeThread(hThread) == (DWORD)-1) {
        printf("ResumeThread failed: %lu\n", GetLastError());
    }
    
    WaitForSingleObject(hThread, INFINITE);
    CloseHandle(hThread);
}
 
/*
 * NtCreateThreadEx: The low-level API
 * (Not documented, but used by CreateThread internally)
 */
 
// Declaration (from ntdll.dll)
typedef NTSTATUS (NTAPI *pNtCreateThreadEx)(
    OUT PHANDLE ThreadHandle,
    IN ACCESS_MASK DesiredAccess,
    IN POBJECT_ATTRIBUTES ObjectAttributes OPTIONAL,
    IN HANDLE ProcessHandle,
    IN LPTHREAD_START_ROUTINE StartAddress,
    IN LPVOID Parameter OPTIONAL,
    IN ULONG CreateFlags,  // THREAD_CREATE_FLAGS_*
    IN SIZE_T ZeroBits OPTIONAL,
    IN SIZE_T StackSize OPTIONAL,
    IN SIZE_T MaximumStackSize OPTIONAL,
    IN PVOID AttributeList OPTIONAL
);
 
/*
 * Kernel structures created:
 *
 * ETHREAD (Executive Thread Object)
 *   - Win32StartAddress (user entry point)
 *   - Process link
 *   - IrpList (pending I/O requests)
 *   - ThreadListEntry
 *   - Create/exit times
 *   - Timer data
 *
 * KTHREAD (Kernel Thread Object) 
 *   - Embedded in ETHREAD
 *   - State (Running, Ready, Waiting, etc.)
 *   - Priority (base and dynamic)
 *   - Quantum (time slice)
 *   - Stack pointers (kernel and user)
 *   - Wait blocks
 *   - Context (registers)
 *
 * TEB (Thread Environment Block, user space)
 *   - TLS array
 *   - Stack boundaries
 *   - Last error value
 *   - Exception handling chain
 *   - Self-reference (NtTib.Self)
 */
 
DWORD WINAPI ThreadProc(LPVOID lpParam) {
    printf("Thread running!\n");
    return 0;
}

Windows Stack Reservation vs Commitment

Windows distinguishes between reserved and committed stack memory. Reservation (default 1MB) just reserves address space. Commitment allocates actual memory pages. By default, only one page is committed initially; more pages are committed as the stack grows. This allows many threads without consuming actual memory until needed.

Java Thread Creation Internals

Java's Thread.start() method triggers a complex sequence involving the Java runtime, JNI, and ultimately native thread creation. Understanding this sequence explains why Java thread creation has noticeable overhead compared to raw native threads.

Java Thread Creation Sequence

Java validation — Check thread state (must be NEW), validate Thread object
Native transition — JNI call to JVM_StartThread in the JVM
JVM thread structure — Allocate JavaThread C++ object
Stack allocation — Allocate Java stack (for interpreted code) and native stack
OS thread creation — Call OS API (pthread_create, CreateThread)
Thread initialization — Set up TLS, stack traces capability, safepoint tracking
JVM registration — Add to thread list, attach to current thread group
Memory barrier — Ensure visibility of Thread object fields
run() invocation — Call Thread.run() method (or Runnable.run())

JavaThreadCreation.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
import java.lang.management.*;
import java.util.concurrent.*;
 
/**
 * Java Thread Creation Internals and Configuration
 */
public class JavaThreadCreation {
 
    /**
     * Examining the thread creation sequence
     */
    public static void examineThreadCreation() {
        Thread t = new Thread(() -> {
            System.out.println("Thread running!");
        });
        
        // State is NEW (thread object exists, native thread does not)
        System.out.println("Before start: " + t.getState());  // NEW
        
        t.start();  // Native thread created here!
        
        // State is now RUNNABLE (or may have already terminated)
        System.out.println("After start: " + t.getState());  // RUNNABLE
        
        try {
            t.join();
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
        
        // State is TERMINATED
        System.out.println("After join: " + t.getState());  // TERMINATED
    }
    
    /**
     * Stack size configuration via Thread constructor
     */
    public static void customStackSize() {
        // Create thread with 256KB stack
        Thread t = new Thread(
            null,           // ThreadGroup
            () -> {
                System.out.println("Small stack thread");
                // Be careful with recursive calls!
            },
            "SmallStackThread",
            256 * 1024      // Stack size in bytes
        );
        
        t.start();
        
        /*
         * Notes on stack size:
         * - Default is 1MB on most 64-bit JVMs
         * - Minimum is platform-dependent
         * - JVM may ignore the hint (it's advisory!)
         * - Use -Xss flag for default: java -Xss256k MyClass
         * - -XX:ThreadStackSize for JVM threads
         */
    }
    
    /**
     * Thread creation cost measurement
     */
    public static void measureCreationCost() {
        final int ITERATIONS = 1000;
        
        // Warm up
        for (int i = 0; i < 100; i++) {
            Thread t = new Thread(() -> {});
            t.start();
            try { t.join(); } catch (InterruptedException e) {}
        }
        
        // Measure
        long start = System.nanoTime();
        
        for (int i = 0; i < ITERATIONS; i++) {
            Thread t = new Thread(() -> {
                // Empty task
            });
            t.start();
            try { t.join(); } catch (InterruptedException e) {}
        }
        
        long elapsed = System.nanoTime() - start;
        double avgMicros = (elapsed / 1000.0) / ITERATIONS;
        
        System.out.printf("Average thread create+join: %.2f μs%n", avgMicros);
    }
    
    /**
     * Thread pool vs direct creation performance
     */
    public static void poolVsDirect() throws Exception {
        final int TASKS = 10000;
        Runnable task = () -> {
            // Minimal work
            int x = 0;
            for (int i = 0; i < 100; i++) x += i;
        };
        
        // Direct thread creation
        long start = System.nanoTime();
        for (int i = 0; i < TASKS; i++) {
            Thread t = new Thread(task);
            t.start();
            t.join();
        }
        long directTime = System.nanoTime() - start;
        
        // Thread pool
        ExecutorService pool = Executors.newFixedThreadPool(
            Runtime.getRuntime().availableProcessors()
        );
        
        start = System.nanoTime();
        for (int i = 0; i < TASKS; i++) {
            pool.submit(task).get();
        }
        long poolTime = System.nanoTime() - start;
        
        pool.shutdown();
        
        System.out.printf("Direct: %.2f ms%n", directTime / 1e6);
        System.out.printf("Pool: %.2f ms%n", poolTime / 1e6);
        System.out.printf("Speedup: %.2fx%n", (double)directTime / poolTime);
    }
    
    /**
     * Thread information via management APIs
     */
    public static void threadInfoViaManagement() {
        ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean();
        
        System.out.println("Thread count: " + threadMXBean.getThreadCount());
        System.out.println("Peak thread count: " + threadMXBean.getPeakThreadCount());
        System.out.println("Daemon thread count: " + threadMXBean.getDaemonThreadCount());
        System.out.println("Total started: " + threadMXBean.getTotalStartedThreadCount());
        
        // Get info about current thread
        long tid = Thread.currentThread().getId();
        ThreadInfo info = threadMXBean.getThreadInfo(tid);
        
        System.out.println("\nCurrent thread info:");
        System.out.println("  Name: " + info.getThreadName());
        System.out.println("  State: " + info.getThreadState());
        System.out.println("  Blocked count: " + info.getBlockedCount());
        System.out.println("  Waited count: " + info.getWaitedCount());
    }
}

JVM Native Thread Overhead

Java thread creation is slower than raw native thread creation because the JVM must allocate Java-specific structures (JavaThread, handles), set up safepoint polling, initialize stack traces capability, and perform various thread-safety operations. This overhead is typically 20-100μs on modern systems—negligible for most applications but significant for fine-grained parallelism.

Thread Creation Costs

Understanding the true costs of thread creation helps architects make informed decisions about concurrency design. Thread creation cost has three components: time (latency), memory (per-thread overhead), and scalability (limits on thread count).

Time Cost Breakdown

The time to create a thread can be decomposed into:

Thread Creation Time Components (approximate)
Component	Time	Description
User-space setup	1-5 μs	Attribute parsing, allocation bookkeeping
System call entry	0.5-1 μs	Mode switch to kernel
Kernel structure allocation	5-20 μs	task_struct, thread_info, etc.
Stack allocation	2-10 μs	Virtual memory mapping, guard pages
Scheduler insertion	1-5 μs	Add to run queue, potential IPI
Return to user space	0.5-1 μs	Mode switch back
Total	10-50 μs	Typical range on modern systems

Memory Cost Per Thread

Each thread consumes memory for:

User stack: 64KB - 8MB (application-controlled)
Kernel stack: 16-24KB (fixed by OS)
Kernel structures: 6-10KB (task_struct, etc.)
TLS/TSD: Variable (depends on application)
Thread-local allocator state: 0-64KB (for jemalloc, tcmalloc)

Example calculation: 1000 threads with 1MB stacks each = ~1GB memory just for stacks, plus ~30MB for kernel structures.

Scalability Limits

Maximum thread count is constrained by:

Virtual address space: Total stack space cannot exceed available VA
Kernel memory: Each thread needs kernel structures
PID/TID namespace: Usually 32K-4M depending on configuration
File descriptors: Default per-process limit (~1024 on Linux)
ulimits: RLIMIT_NPROC limits threads per user

thread_limits.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/resource.h>
#include <sys/sysinfo.h>
 
/*
 * Measuring Thread Creation Costs
 */
 
void *empty_thread(void *arg) {
    return NULL;
}
 
void measure_creation_time(void) {
    const int ITERATIONS = 1000;
    pthread_t tids[ITERATIONS];
    struct timespec start, end;
    
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    for (int i = 0; i < ITERATIONS; i++) {
        int result = pthread_create(&tids[i], NULL, empty_thread, NULL);
        if (result != 0) {
            fprintf(stderr, "pthread_create failed at %d: %s\n", 
                    i, strerror(result));
            break;
        }
    }
    
    // Wait for all threads
    for (int i = 0; i < ITERATIONS; i++) {
        pthread_join(tids[i], NULL);
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    double elapsed_ns = (end.tv_sec - start.tv_sec) * 1e9 + 
                        (end.tv_nsec - start.tv_nsec);
    double per_thread_us = elapsed_ns / ITERATIONS / 1000.0;
    
    printf("Created %d threads in %.2f ms\n", 
           ITERATIONS, elapsed_ns / 1e6);
    printf("Average per thread: %.2f μs\n", per_thread_us);
}
 
/*
 * Finding maximum thread count
 */
 
void *wait_forever(void *arg) {
    while (1) {
        pause();  // Wait for signal
    }
    return NULL;
}
 
int find_max_threads(size_t stack_size) {
    pthread_attr_t attr;
    pthread_t *tids = NULL;
    int count = 0;
    int capacity = 1024;
    
    pthread_attr_init(&attr);
    
    if (stack_size > 0) {
        pthread_attr_setstacksize(&attr, stack_size);
        // Disable guard pages to save memory
        pthread_attr_setguardsize(&attr, 0);
    }
    
    tids = malloc(capacity * sizeof(pthread_t));
    if (!tids) {
        perror("malloc");
        return -1;
    }
    
    printf("Creating threads with %zu byte stacks...\n", stack_size);
    
    while (1) {
        if (count >= capacity) {
            capacity *= 2;
            pthread_t *new_tids = realloc(tids, capacity * sizeof(pthread_t));
            if (!new_tids) {
                printf("Realloc failed at %d threads\n", count);
                break;
            }
            tids = new_tids;
        }
        
        int result = pthread_create(&tids[count], &attr, wait_forever, NULL);
        if (result != 0) {
            printf("pthread_create failed at %d: %s\n", 
                   count, strerror(result));
            break;
        }
        
        count++;
        
        if (count % 1000 == 0) {
            printf("Created %d threads...\n", count);
        }
    }
    
    printf("Maximum threads created: %d\n", count);
    
    // Cleanup (will take a while)
    printf("Cleaning up...\n");
    for (int i = 0; i < count; i++) {
        pthread_cancel(tids[i]);
    }
    for (int i = 0; i < count; i++) {
        pthread_join(tids[i], NULL);
    }
    
    free(tids);
    pthread_attr_destroy(&attr);
    
    return count;
}
 
/*
 * Checking system limits
 */
 
void show_system_limits(void) {
    struct rlimit rlim;
    
    // Threads per process (actually per-user on Linux)
    getrlimit(RLIMIT_NPROC, &rlim);
    printf("RLIMIT_NPROC: soft=%lu, hard=%lu\n",
           (unsigned long)rlim.rlim_cur, (unsigned long)rlim.rlim_max);
    
    // Stack size
    getrlimit(RLIMIT_STACK, &rlim);
    printf("RLIMIT_STACK: soft=%lu, hard=%lu\n",
           (unsigned long)rlim.rlim_cur, (unsigned long)rlim.rlim_max);
    
    // Virtual memory
    getrlimit(RLIMIT_AS, &rlim);
    printf("RLIMIT_AS: soft=%lu, hard=%lu\n",
           (unsigned long)rlim.rlim_cur, (unsigned long)rlim.rlim_max);
    
    // System-wide limits
    printf("\nSystem info:\n");
    printf("Available RAM: %lu MB\n", 
           (unsigned long)(get_avphys_pages() * sysconf(_SC_PAGESIZE) / 1024 / 1024));
    printf("Total RAM: %lu MB\n",
           (unsigned long)(get_phys_pages() * sysconf(_SC_PAGESIZE) / 1024 / 1024));
    printf("CPUs: %d\n", get_nprocs());
    
    // PID max
    FILE *f = fopen("/proc/sys/kernel/pid_max", "r");
    if (f) {
        int pid_max;
        if (fscanf(f, "%d", &pid_max) == 1) {
            printf("PID max: %d\n", pid_max);
        }
        fclose(f);
    }
    
    // Threads max
    f = fopen("/proc/sys/kernel/threads-max", "r");
    if (f) {
        int threads_max;
        if (fscanf(f, "%d", &threads_max) == 1) {
            printf("Threads max: %d\n", threads_max);
        }
        fclose(f);
    }
}

10,000 Thread Rule of Thumb

With 1MB default stacks, you can create at most ~10,000 threads in 10GB of address space. Reducing stack size to 64KB allows ~150,000 threads in the same space. For applications needing more concurrent tasks, consider thread pools (reuse threads) or M:N threading models (like Go goroutines or Java virtual threads) that multiplex many tasks onto fewer OS threads.

Best Practices for Thread Creation

Thread creation decisions significantly impact application performance, resource consumption, and scalability. Following these best practices ensures efficient concurrent systems.

Thread Creation Best Practices

•Use thread pools for task-based work — Amortize creation cost across many tasks. ExecutorService (Java), thread pool API (Windows), or custom pool (Pthreads).
•Size stacks appropriately — Default 1-8MB is often excessive. Analyze maximum call depth and reduce stack size for many concurrent threads.
•Create threads at startup when possible — Front-load creation cost to avoid latency spikes during operation.
•Use CREATE_SUSPENDED (Windows) for pre-configuration — Set priority, affinity, etc. before the thread runs.
•Consider thread-per-core for CPU-bound work — More threads than cores adds overhead without parallelism benefit.
•Monitor thread count — Use /proc/[pid]/task (Linux), Process Explorer (Windows), or JMX (Java) to track thread growth.
•Handle creation failure gracefully — pthread_create can fail (EAGAIN); CreateThread can return NULL. Have a fallback strategy.
•Clean up properly — Join or detach every thread; close handles; free custom stacks.
•Be aware of platform differences — Stack sizes, creation speeds, and limits vary significantly across operating systems.
•Profile before optimizing — Measure actual creation costs in your environment before micro-optimizing.

Summary: The True Cost of a Thread

Creating a thread involves:

Time: 10-100μs for the system call sequence Memory: 64KB-8MB stack + 20-50KB kernel structures Limits: System-wide and per-process constraints

For high-performance systems:

Reuse threads through pools
Right-size stacks for your workload
Consider lightweight alternatives (async I/O, coroutines, virtual threads)

Thread creation is not free, but it's also not prohibitively expensive for most applications. The key is matching your threading model to your workload characteristics.

Page Complete

You now understand the complete thread creation process—from system call to scheduler insertion—across POSIX, Windows, and Java environments. You can reason about creation costs, configure stack sizes, and make informed architectural decisions. Next, we'll examine thread joining and termination: how threads coordinate their completion and clean up resources.

4 / 5

Loading learning content...

Operating SystemsThread Concepts

Thread Libraries

LevelIntermediate

Duration75 mins

TopicThread Concepts

4 / 5

Thread Creation

The Birth of a Thread

Understanding thread creation at a deep level is essential for:

Performance optimization — Thread creation has significant overhead; knowing the costs helps you design efficient systems
Debugging — Stack configuration issues, initialization failures, and resource limits become diagnosable
Architecture decisions — Choosing between creating threads directly, using thread pools, or adopting lightweight threading models

What You Will Master

Thread Creation Overview

Thread creation involves coordinated work between user-space libraries and the operating system kernel. The process can be broken down into distinct phases:

High-Level Creation Sequence

User Request — Application calls thread creation API (pthread_create, CreateThread, Thread.start)
Parameter Validation — Library validates thread function, attributes, stack size requirements
Memory Allocation — Allocate stack memory, thread-local storage, control structures
Kernel Request — System call to create kernel-level thread structure
Initialization — Set up execution context: registers, program counter, stack pointer
Scheduling — Add new thread to scheduler's ready queue
Return — Return thread identifier to caller; new thread may already be running

This sequence takes microseconds on modern systems—fast enough for most applications, but significant when creating many threads or in latency-sensitive contexts.

Thread vs Process Creation Cost
Operation	Approximate Time	Relative Cost
Thread creation	10-50 μs	1x (baseline)
Process creation (fork)	100-500 μs	10-50x
Process creation + exec	1-5 ms	100-500x
Thread pool work submission	0.1-1 μs	0.01-0.1x
Goroutine creation (Go)	0.3-1 μs	0.03-0.1x

What Thread Creation Does NOT Do

Understanding what's not involved in thread creation clarifies why threads are lighter than processes:

No address space duplication — Threads share the parent's address space
No page table copying — All threads use the same page tables
No file descriptor table copying — Threads share file descriptors
No signal handler copying — Signal disposition is per-process
No IPC overhead — Threads communicate through shared memory directly

The kernel creates a new scheduling entity within the existing process, allocating only thread-specific resources: registers, stack, kernel stack, and thread-local data.

User-Level vs Kernel-Level Creation

Stack Allocation and Layout

Stack Structure

A typical thread stack layout (high to low addresses):

High Address
+------------------------+
|     Stack Guard        |  <- Guard page (optional, triggers SIGSEGV on overflow)
+------------------------+
|                        |
|    Usable Stack        |  <- Stack grows downward
|    (function frames,   |
|     local variables)   |
|                        |
+------------------------+
|   Thread Control Block |  <- TCB / Thread Local Storage
+------------------------+
|    Red Zone (x64)      |  <- 128 bytes below RSP, usable without adjustment
+------------------------+
Low Address

Default Stack Sizes

Stack sizes vary significantly across platforms and can dramatically impact how many threads you can create:

Default Stack Sizes by Platform
Platform	User Stack	Kernel Stack	Notes
Linux (x86_64)	8 MB	16 KB	ulimit -s shows default; NPTL
Linux (arm64)	8 MB	16 KB	Similar to x86_64
macOS	512 KB (secondary)	16 KB	Main thread: 8 MB
Windows (x64)	1 MB (reserved)	24 KB	Only 4KB committed initially
Java (64-bit)	1 MB (default)	~24 KB	-Xss flag to configure
Go goroutine	2-8 KB (grows)	N/A	Dynamically grows to 1 GB

stack_allocation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
#include <unistd.h>
#include <sys/resource.h>
 
/*
 * Querying and Setting Stack Sizes
 */
 
void query_default_stack_size(void) {
    pthread_attr_t attr;
    size_t stack_size;
    
    pthread_attr_init(&attr);
    pthread_attr_getstacksize(&attr, &stack_size);
    printf("Default stack size: %zu bytes (%.2f MB)\n", 
           stack_size, (double)stack_size / (1024 * 1024));
    pthread_attr_destroy(&attr);
    
    // Also check system limit
    struct rlimit rlim;
    getrlimit(RLIMIT_STACK, &rlim);
    printf("Stack soft limit: %lu, hard limit: %lu\n",
           (unsigned long)rlim.rlim_cur, (unsigned long)rlim.rlim_max);
}
 
/*
 * Creating Threads with Custom Stack Size
 */
 
void *worker(void *arg) {
    // Demonstrate stack usage
    char buffer[4096];  // 4KB on stack
    memset(buffer, 0, sizeof(buffer));
    printf("Thread running with stack buffer at %p\n", (void *)buffer);
    return NULL;
}
 
int create_thread_with_custom_stack(size_t stack_bytes) {
    pthread_t tid;
    pthread_attr_t attr;
    int result;
    
    result = pthread_attr_init(&attr);
    if (result != 0) {
        fprintf(stderr, "attr_init failed: %s\n", strerror(result));
        return -1;
    }
    
    /*
     * PTHREAD_STACK_MIN is the minimum allowed (typically 16KB)
     * Stacks must be page-aligned and large enough for signal handlers
     */
    if (stack_bytes < PTHREAD_STACK_MIN) {
        fprintf(stderr, "Stack too small, minimum is %d\n", PTHREAD_STACK_MIN);
        stack_bytes = PTHREAD_STACK_MIN;
    }
    
    result = pthread_attr_setstacksize(&attr, stack_bytes);
    if (result != 0) {
        fprintf(stderr, "setstacksize failed: %s\n", strerror(result));
        pthread_attr_destroy(&attr);
        return -1;
    }
    
    result = pthread_create(&tid, &attr, worker, NULL);
    pthread_attr_destroy(&attr);  // Safe to destroy after create
    
    if (result != 0) {
        fprintf(stderr, "pthread_create failed: %s\n", strerror(result));
        return -1;
    }
    
    pthread_join(tid, NULL);
    return 0;
}
 
/*
 * Using a Pre-Allocated Stack
 * Useful for embedded systems or memory-constrained environments
 */
 
int create_thread_with_custom_stack_memory(void) {
    pthread_t tid;
    pthread_attr_t attr;
    void *stack_base;
    size_t stack_size = 64 * 1024;  // 64 KB
    int result;
    
    // Allocate aligned memory for stack
    result = posix_memalign(&stack_base, sysconf(_SC_PAGESIZE), stack_size);
    if (result != 0) {
        fprintf(stderr, "posix_memalign failed: %s\n", strerror(result));
        return -1;
    }
    
    pthread_attr_init(&attr);
    
    // Set both stack address and size
    result = pthread_attr_setstack(&attr, stack_base, stack_size);
    if (result != 0) {
        fprintf(stderr, "setstack failed: %s\n", strerror(result));
        free(stack_base);
        pthread_attr_destroy(&attr);
        return -1;
    }
    
    result = pthread_create(&tid, &attr, worker, NULL);
    pthread_attr_destroy(&attr);
    
    if (result != 0) {
        fprintf(stderr, "pthread_create failed: %s\n", strerror(result));
        free(stack_base);
        return -1;
    }
    
    pthread_join(tid, NULL);
    
    /*
     * IMPORTANT: When using pthread_attr_setstack, YOU are responsible
     * for freeing the stack memory. The thread library will NOT free it.
     */
    free(stack_base);
    
    return 0;
}
 
/*
 * Stack Guard Pages
 * Protecting against stack overflow
 */
 
int demonstrate_guard_pages(void) {
    pthread_attr_t attr;
    size_t guard_size;
    
    pthread_attr_init(&attr);
    
    // Query default guard size (typically one page)
    pthread_attr_getguardsize(&attr, &guard_size);
    printf("Default guard size: %zu bytes\n", guard_size);
    
    // Can set to 0 to disable (saves memory, loses protection)
    // pthread_attr_setguardsize(&attr, 0);
    
    // Or increase for more protection
    pthread_attr_setguardsize(&attr, 4096 * 2);  // Two pages
    
    pthread_attr_destroy(&attr);
    return 0;
}

Stack Overflow Risk

Kernel Thread Creation (Linux)

On Linux, all threads are created via the clone() system call—the universal mechanism for creating both processes and threads. Understanding clone() reveals how the kernel implements threading.

The clone() System Call

long clone(
    unsigned long flags,           // What to share
    void *child_stack,             // Stack for new thread
    pid_t *parent_tidptr,          // Where to store child TID in parent
    pid_t *child_tidptr,           // Where to store child TID in child
    unsigned long tls              // Thread-local storage descriptor
);

The flags parameter determines what resources are shared between parent and child. The difference between fork() and thread creation is just which flags are set:

Key clone() Flags for Threading
Flag	Effect	Used By
CLONE_VM	Share virtual memory (address space)	Threads (essential)
CLONE_FS	Share filesystem info (cwd, root)	Threads
CLONE_FILES	Share file descriptor table	Threads
CLONE_SIGHAND	Share signal handlers	Threads
CLONE_THREAD	Same thread group (share PID)	Threads
CLONE_SYSVSEM	Share System V semaphore adjustments	Threads
CLONE_SETTLS	Create new TLS for child	Threads
CLONE_PARENT_SETTID	Store TID in parent address space	pthread_create
CLONE_CHILD_CLEARTID	Clear TID on exit (for pthread_join)	NPTL

clone_internals.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
/*
 * How pthread_create() uses clone() internally (simplified)
 * This is what NPTL does under the hood
 */
 
#define _GNU_SOURCE
#include <sched.h>
#include <signal.h>
#include <sys/types.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
/*
 * Conceptual implementation of pthread_create-like functionality
 * using raw clone() system call
 */
 
// Thread entry point wrapper
static int thread_wrapper(void *arg) {
    void (*start_routine)(void *) = ((void **)arg)[0];
    void *thread_arg = ((void **)arg)[1];
    
    // Call the user's thread function
    start_routine(thread_arg);
    
    // Thread exits here
    return 0;
}
 
// Simplified thread creation
int simple_thread_create(void (*start_routine)(void *), void *arg) {
    const size_t STACK_SIZE = 1024 * 1024;  // 1 MB
    
    // Allocate stack
    void *stack = malloc(STACK_SIZE);
    if (!stack) {
        return -1;
    }
    
    // Stack grows down, so start at top
    void *stack_top = (char *)stack + STACK_SIZE;
    
    // Package arguments
    void *args[2] = { start_routine, arg };
    
    /*
     * clone() flags for creating a thread:
     * 
     * CLONE_VM:       Share address space (critical for threads)
     * CLONE_FS:       Share filesystem context
     * CLONE_FILES:    Share file descriptors
     * CLONE_SIGHAND:  Share signal handlers
     * CLONE_THREAD:   Same thread group (same getpid() result)
     * CLONE_SYSVSEM:  Share SysV semaphore undo values
     * SIGCHLD:        Signal to send parent on termination
     */
    int flags = CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND |
                CLONE_THREAD | CLONE_SYSVSEM | SIGCHLD;
    
    // Create the thread
    pid_t tid = clone(thread_wrapper, stack_top, flags, args);
    
    if (tid == -1) {
        free(stack);
        perror("clone");
        return -1;
    }
    
    printf("Created thread with TID: %d\n", tid);
    return tid;
}
 
/*
 * What the kernel does when clone() is called:
 * 
 * 1. Allocate task_struct for new thread
 * 2. Copy/share resources based on flags:
 *    - If CLONE_VM: share mm_struct (page tables, mappings)
 *    - If CLONE_FILES: share files_struct (fd table)
 *    - etc.
 * 3. Allocate kernel stack (separate from user stack)
 * 4. Set up initial register state:
 *    - Stack pointer -> child_stack
 *    - Instruction pointer -> clone return address (child starts in clone())
 * 5. Add to scheduler's run queue
 * 6. Return TID to parent, 0 to child
 */
 
/*
 * Kernel data structures created for each thread:
 *
 * task_struct (~6 KB on 64-bit Linux)
 *   - Process/thread state
 *   - Scheduling information
 *   - Links to shared resources
 *   - Signal handling state
 *   - Credentials
 *   - Timers
 *
 * thread_info + kernel stack (~16 KB)
 *   - Architecture-specific thread state
 *   - Kernel-mode execution stack
 *
 * Optional: thread_struct
 *   - Floating point state
 *   - Debug registers
 */

PID vs TID

Thread Creation on Windows

Windows Thread Creation Sequence

Validate parameters — Check thread function pointer, stack size, etc.
Reserve/commit stack — Virtual memory for user-mode stack
Create kernel structures — ETHREAD, KTHREAD in kernel pool
Initialize TEB — Thread Environment Block in user-space
Set initial context — Register state for new thread
Insert in process — Link thread to process's thread list
Signal creation event — Thread is ready
Optional initial suspend — If CREATE_SUSPENDED flag set
Begin scheduling — Thread becomes runnable

windows_creation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
#include <windows.h>
#include <stdio.h>
 
/*
 * Windows Thread Creation Deep Dive
 */
 
/*
 * CreateThread() Parameters Explained
 */
void demonstrate_create_thread(void) {
    HANDLE hThread;
    DWORD threadId;
    
    hThread = CreateThread(
        NULL,           // lpThreadAttributes:
                        //   Security descriptor for thread handle
                        //   NULL = default security, not inheritable
        
        0,              // dwStackSize:
                        //   Initial stack size in bytes
                        //   0 = default (1 MB reserved, 1 page committed)
        
        ThreadProc,     // lpStartAddress:
                        //   Thread entry point function
                        //   Must match LPTHREAD_START_ROUTINE signature
        
        (LPVOID)42,     // lpParameter:
                        //   Argument passed to thread function
        
        0,              // dwCreationFlags:
                        //   0 = run immediately
                        //   CREATE_SUSPENDED = start suspended
                        //   STACK_SIZE_PARAM_IS_A_RESERVATION = dwStackSize is reservation, not commit
        
        &threadId       // lpThreadId:
                        //   Receives the thread ID
                        //   Can be NULL if not needed
    );
    
    if (hThread == NULL) {
        printf("CreateThread failed: %lu\n", GetLastError());
    }
}
 
/*
 * Stack Size Control on Windows
 */
void demonstrate_stack_sizes(void) {
    HANDLE hThread;
    
    // Default: 1 MB reserved, 1 page (4KB) committed
    // Stack grows and commits more pages as needed
    hThread = CreateThread(NULL, 0, ThreadProc, NULL, 0, NULL);
    CloseHandle(hThread);
    
    // Reserve 256KB, commit as needed
    hThread = CreateThread(NULL, 256 * 1024, ThreadProc, NULL,
                          STACK_SIZE_PARAM_IS_A_RESERVATION, NULL);
    CloseHandle(hThread);
    
    // Commit 256KB immediately (all stack memory is committed upfront)
    hThread = CreateThread(NULL, 256 * 1024, ThreadProc, NULL,
                          0, NULL);  // Without STACK_SIZE_PARAM_IS_A_RESERVATION
    CloseHandle(hThread);
}
 
/*
 * Creating Suspended Threads
 * Useful for setting thread properties before it runs
 */
void demonstrate_suspended_creation(void) {
    HANDLE hThread;
    DWORD threadId;
    
    // Create in suspended state
    hThread = CreateThread(NULL, 0, ThreadProc, NULL,
                          CREATE_SUSPENDED, &threadId);
    
    if (hThread == NULL) {
        printf("CreateThread failed: %lu\n", GetLastError());
        return;
    }
    
    // Configure thread before it runs
    SetThreadPriority(hThread, THREAD_PRIORITY_HIGHEST);
    
    // Set processor affinity
    SetThreadAffinityMask(hThread, 0x1);  // CPU 0 only
    
    // Now start it
    if (ResumeThread(hThread) == (DWORD)-1) {
        printf("ResumeThread failed: %lu\n", GetLastError());
    }
    
    WaitForSingleObject(hThread, INFINITE);
    CloseHandle(hThread);
}
 
/*
 * NtCreateThreadEx: The low-level API
 * (Not documented, but used by CreateThread internally)
 */
 
// Declaration (from ntdll.dll)
typedef NTSTATUS (NTAPI *pNtCreateThreadEx)(
    OUT PHANDLE ThreadHandle,
    IN ACCESS_MASK DesiredAccess,
    IN POBJECT_ATTRIBUTES ObjectAttributes OPTIONAL,
    IN HANDLE ProcessHandle,
    IN LPTHREAD_START_ROUTINE StartAddress,
    IN LPVOID Parameter OPTIONAL,
    IN ULONG CreateFlags,  // THREAD_CREATE_FLAGS_*
    IN SIZE_T ZeroBits OPTIONAL,
    IN SIZE_T StackSize OPTIONAL,
    IN SIZE_T MaximumStackSize OPTIONAL,
    IN PVOID AttributeList OPTIONAL
);
 
/*
 * Kernel structures created:
 *
 * ETHREAD (Executive Thread Object)
 *   - Win32StartAddress (user entry point)
 *   - Process link
 *   - IrpList (pending I/O requests)
 *   - ThreadListEntry
 *   - Create/exit times
 *   - Timer data
 *
 * KTHREAD (Kernel Thread Object) 
 *   - Embedded in ETHREAD
 *   - State (Running, Ready, Waiting, etc.)
 *   - Priority (base and dynamic)
 *   - Quantum (time slice)
 *   - Stack pointers (kernel and user)
 *   - Wait blocks
 *   - Context (registers)
 *
 * TEB (Thread Environment Block, user space)
 *   - TLS array
 *   - Stack boundaries
 *   - Last error value
 *   - Exception handling chain
 *   - Self-reference (NtTib.Self)
 */
 
DWORD WINAPI ThreadProc(LPVOID lpParam) {
    printf("Thread running!\n");
    return 0;
}

Windows Stack Reservation vs Commitment

Java Thread Creation Internals

Java Thread Creation Sequence

Java validation — Check thread state (must be NEW), validate Thread object
Native transition — JNI call to JVM_StartThread in the JVM
JVM thread structure — Allocate JavaThread C++ object
Stack allocation — Allocate Java stack (for interpreted code) and native stack
OS thread creation — Call OS API (pthread_create, CreateThread)
Thread initialization — Set up TLS, stack traces capability, safepoint tracking
JVM registration — Add to thread list, attach to current thread group
Memory barrier — Ensure visibility of Thread object fields
run() invocation — Call Thread.run() method (or Runnable.run())

JavaThreadCreation.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
import java.lang.management.*;
import java.util.concurrent.*;
 
/**
 * Java Thread Creation Internals and Configuration
 */
public class JavaThreadCreation {
 
    /**
     * Examining the thread creation sequence
     */
    public static void examineThreadCreation() {
        Thread t = new Thread(() -> {
            System.out.println("Thread running!");
        });
        
        // State is NEW (thread object exists, native thread does not)
        System.out.println("Before start: " + t.getState());  // NEW
        
        t.start();  // Native thread created here!
        
        // State is now RUNNABLE (or may have already terminated)
        System.out.println("After start: " + t.getState());  // RUNNABLE
        
        try {
            t.join();
        } catch (InterruptedException e) {
            Thread.currentThread().interrupt();
        }
        
        // State is TERMINATED
        System.out.println("After join: " + t.getState());  // TERMINATED
    }
    
    /**
     * Stack size configuration via Thread constructor
     */
    public static void customStackSize() {
        // Create thread with 256KB stack
        Thread t = new Thread(
            null,           // ThreadGroup
            () -> {
                System.out.println("Small stack thread");
                // Be careful with recursive calls!
            },
            "SmallStackThread",
            256 * 1024      // Stack size in bytes
        );
        
        t.start();
        
        /*
         * Notes on stack size:
         * - Default is 1MB on most 64-bit JVMs
         * - Minimum is platform-dependent
         * - JVM may ignore the hint (it's advisory!)
         * - Use -Xss flag for default: java -Xss256k MyClass
         * - -XX:ThreadStackSize for JVM threads
         */
    }
    
    /**
     * Thread creation cost measurement
     */
    public static void measureCreationCost() {
        final int ITERATIONS = 1000;
        
        // Warm up
        for (int i = 0; i < 100; i++) {
            Thread t = new Thread(() -> {});
            t.start();
            try { t.join(); } catch (InterruptedException e) {}
        }
        
        // Measure
        long start = System.nanoTime();
        
        for (int i = 0; i < ITERATIONS; i++) {
            Thread t = new Thread(() -> {
                // Empty task
            });
            t.start();
            try { t.join(); } catch (InterruptedException e) {}
        }
        
        long elapsed = System.nanoTime() - start;
        double avgMicros = (elapsed / 1000.0) / ITERATIONS;
        
        System.out.printf("Average thread create+join: %.2f μs%n", avgMicros);
    }
    
    /**
     * Thread pool vs direct creation performance
     */
    public static void poolVsDirect() throws Exception {
        final int TASKS = 10000;
        Runnable task = () -> {
            // Minimal work
            int x = 0;
            for (int i = 0; i < 100; i++) x += i;
        };
        
        // Direct thread creation
        long start = System.nanoTime();
        for (int i = 0; i < TASKS; i++) {
            Thread t = new Thread(task);
            t.start();
            t.join();
        }
        long directTime = System.nanoTime() - start;
        
        // Thread pool
        ExecutorService pool = Executors.newFixedThreadPool(
            Runtime.getRuntime().availableProcessors()
        );
        
        start = System.nanoTime();
        for (int i = 0; i < TASKS; i++) {
            pool.submit(task).get();
        }
        long poolTime = System.nanoTime() - start;
        
        pool.shutdown();
        
        System.out.printf("Direct: %.2f ms%n", directTime / 1e6);
        System.out.printf("Pool: %.2f ms%n", poolTime / 1e6);
        System.out.printf("Speedup: %.2fx%n", (double)directTime / poolTime);
    }
    
    /**
     * Thread information via management APIs
     */
    public static void threadInfoViaManagement() {
        ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean();
        
        System.out.println("Thread count: " + threadMXBean.getThreadCount());
        System.out.println("Peak thread count: " + threadMXBean.getPeakThreadCount());
        System.out.println("Daemon thread count: " + threadMXBean.getDaemonThreadCount());
        System.out.println("Total started: " + threadMXBean.getTotalStartedThreadCount());
        
        // Get info about current thread
        long tid = Thread.currentThread().getId();
        ThreadInfo info = threadMXBean.getThreadInfo(tid);
        
        System.out.println("\nCurrent thread info:");
        System.out.println("  Name: " + info.getThreadName());
        System.out.println("  State: " + info.getThreadState());
        System.out.println("  Blocked count: " + info.getBlockedCount());
        System.out.println("  Waited count: " + info.getWaitedCount());
    }
}

JVM Native Thread Overhead

Thread Creation Costs

Time Cost Breakdown

The time to create a thread can be decomposed into:

Thread Creation Time Components (approximate)
Component	Time	Description
User-space setup	1-5 μs	Attribute parsing, allocation bookkeeping
System call entry	0.5-1 μs	Mode switch to kernel
Kernel structure allocation	5-20 μs	task_struct, thread_info, etc.
Stack allocation	2-10 μs	Virtual memory mapping, guard pages
Scheduler insertion	1-5 μs	Add to run queue, potential IPI
Return to user space	0.5-1 μs	Mode switch back
Total	10-50 μs	Typical range on modern systems

Memory Cost Per Thread

Each thread consumes memory for:

User stack: 64KB - 8MB (application-controlled)
Kernel stack: 16-24KB (fixed by OS)
Kernel structures: 6-10KB (task_struct, etc.)
TLS/TSD: Variable (depends on application)
Thread-local allocator state: 0-64KB (for jemalloc, tcmalloc)

Example calculation: 1000 threads with 1MB stacks each = ~1GB memory just for stacks, plus ~30MB for kernel structures.

Scalability Limits

Maximum thread count is constrained by:

Virtual address space: Total stack space cannot exceed available VA
Kernel memory: Each thread needs kernel structures
PID/TID namespace: Usually 32K-4M depending on configuration
File descriptors: Default per-process limit (~1024 on Linux)
ulimits: RLIMIT_NPROC limits threads per user

thread_limits.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/resource.h>
#include <sys/sysinfo.h>
 
/*
 * Measuring Thread Creation Costs
 */
 
void *empty_thread(void *arg) {
    return NULL;
}
 
void measure_creation_time(void) {
    const int ITERATIONS = 1000;
    pthread_t tids[ITERATIONS];
    struct timespec start, end;
    
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    for (int i = 0; i < ITERATIONS; i++) {
        int result = pthread_create(&tids[i], NULL, empty_thread, NULL);
        if (result != 0) {
            fprintf(stderr, "pthread_create failed at %d: %s\n", 
                    i, strerror(result));
            break;
        }
    }
    
    // Wait for all threads
    for (int i = 0; i < ITERATIONS; i++) {
        pthread_join(tids[i], NULL);
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    double elapsed_ns = (end.tv_sec - start.tv_sec) * 1e9 + 
                        (end.tv_nsec - start.tv_nsec);
    double per_thread_us = elapsed_ns / ITERATIONS / 1000.0;
    
    printf("Created %d threads in %.2f ms\n", 
           ITERATIONS, elapsed_ns / 1e6);
    printf("Average per thread: %.2f μs\n", per_thread_us);
}
 
/*
 * Finding maximum thread count
 */
 
void *wait_forever(void *arg) {
    while (1) {
        pause();  // Wait for signal
    }
    return NULL;
}
 
int find_max_threads(size_t stack_size) {
    pthread_attr_t attr;
    pthread_t *tids = NULL;
    int count = 0;
    int capacity = 1024;
    
    pthread_attr_init(&attr);
    
    if (stack_size > 0) {
        pthread_attr_setstacksize(&attr, stack_size);
        // Disable guard pages to save memory
        pthread_attr_setguardsize(&attr, 0);
    }
    
    tids = malloc(capacity * sizeof(pthread_t));
    if (!tids) {
        perror("malloc");
        return -1;
    }
    
    printf("Creating threads with %zu byte stacks...\n", stack_size);
    
    while (1) {
        if (count >= capacity) {
            capacity *= 2;
            pthread_t *new_tids = realloc(tids, capacity * sizeof(pthread_t));
            if (!new_tids) {
                printf("Realloc failed at %d threads\n", count);
                break;
            }
            tids = new_tids;
        }
        
        int result = pthread_create(&tids[count], &attr, wait_forever, NULL);
        if (result != 0) {
            printf("pthread_create failed at %d: %s\n", 
                   count, strerror(result));
            break;
        }
        
        count++;
        
        if (count % 1000 == 0) {
            printf("Created %d threads...\n", count);
        }
    }
    
    printf("Maximum threads created: %d\n", count);
    
    // Cleanup (will take a while)
    printf("Cleaning up...\n");
    for (int i = 0; i < count; i++) {
        pthread_cancel(tids[i]);
    }
    for (int i = 0; i < count; i++) {
        pthread_join(tids[i], NULL);
    }
    
    free(tids);
    pthread_attr_destroy(&attr);
    
    return count;
}
 
/*
 * Checking system limits
 */
 
void show_system_limits(void) {
    struct rlimit rlim;
    
    // Threads per process (actually per-user on Linux)
    getrlimit(RLIMIT_NPROC, &rlim);
    printf("RLIMIT_NPROC: soft=%lu, hard=%lu\n",
           (unsigned long)rlim.rlim_cur, (unsigned long)rlim.rlim_max);
    
    // Stack size
    getrlimit(RLIMIT_STACK, &rlim);
    printf("RLIMIT_STACK: soft=%lu, hard=%lu\n",
           (unsigned long)rlim.rlim_cur, (unsigned long)rlim.rlim_max);
    
    // Virtual memory
    getrlimit(RLIMIT_AS, &rlim);
    printf("RLIMIT_AS: soft=%lu, hard=%lu\n",
           (unsigned long)rlim.rlim_cur, (unsigned long)rlim.rlim_max);
    
    // System-wide limits
    printf("\nSystem info:\n");
    printf("Available RAM: %lu MB\n", 
           (unsigned long)(get_avphys_pages() * sysconf(_SC_PAGESIZE) / 1024 / 1024));
    printf("Total RAM: %lu MB\n",
           (unsigned long)(get_phys_pages() * sysconf(_SC_PAGESIZE) / 1024 / 1024));
    printf("CPUs: %d\n", get_nprocs());
    
    // PID max
    FILE *f = fopen("/proc/sys/kernel/pid_max", "r");
    if (f) {
        int pid_max;
        if (fscanf(f, "%d", &pid_max) == 1) {
            printf("PID max: %d\n", pid_max);
        }
        fclose(f);
    }
    
    // Threads max
    f = fopen("/proc/sys/kernel/threads-max", "r");
    if (f) {
        int threads_max;
        if (fscanf(f, "%d", &threads_max) == 1) {
            printf("Threads max: %d\n", threads_max);
        }
        fclose(f);
    }
}

10,000 Thread Rule of Thumb

Best Practices for Thread Creation

Thread creation decisions significantly impact application performance, resource consumption, and scalability. Following these best practices ensures efficient concurrent systems.

Thread Creation Best Practices

•Use thread pools for task-based work — Amortize creation cost across many tasks. ExecutorService (Java), thread pool API (Windows), or custom pool (Pthreads).
•Size stacks appropriately — Default 1-8MB is often excessive. Analyze maximum call depth and reduce stack size for many concurrent threads.
•Create threads at startup when possible — Front-load creation cost to avoid latency spikes during operation.
•Use CREATE_SUSPENDED (Windows) for pre-configuration — Set priority, affinity, etc. before the thread runs.
•Consider thread-per-core for CPU-bound work — More threads than cores adds overhead without parallelism benefit.
•Monitor thread count — Use /proc/[pid]/task (Linux), Process Explorer (Windows), or JMX (Java) to track thread growth.
•Handle creation failure gracefully — pthread_create can fail (EAGAIN); CreateThread can return NULL. Have a fallback strategy.
•Clean up properly — Join or detach every thread; close handles; free custom stacks.
•Be aware of platform differences — Stack sizes, creation speeds, and limits vary significantly across operating systems.
•Profile before optimizing — Measure actual creation costs in your environment before micro-optimizing.

Summary: The True Cost of a Thread

Creating a thread involves:

Time: 10-100μs for the system call sequence Memory: 64KB-8MB stack + 20-50KB kernel structures Limits: System-wide and per-process constraints

For high-performance systems:

Reuse threads through pools
Right-size stacks for your workload
Consider lightweight alternatives (async I/O, coroutines, virtual threads)

Thread creation is not free, but it's also not prohibitively expensive for most applications. The key is matching your threading model to your workload characteristics.

Page Complete

4 / 5