Loading learning content...
Creating a thread is one of the most fundamental operations in concurrent programming, yet the mechanics behind it are often treated as a black box. When you call pthread_create(), CreateThread(), or new Thread().start(), what actually happens? How does the operating system allocate resources, set up execution context, and prepare a new flow of control within your process?
Understanding thread creation at a deep level is essential for:
This page dissects the thread creation process across POSIX, Windows, and Java platforms, revealing the kernel interactions, memory allocations, and initialization sequences that bring a thread to life.
By the end of this page, you will understand the complete thread creation lifecycle: stack allocation and layout, kernel data structure creation, thread attributes and their effects, the differences between fork() and thread creation, and the true costs of thread creation across platforms.
Thread creation involves coordinated work between user-space libraries and the operating system kernel. The process can be broken down into distinct phases:
This sequence takes microseconds on modern systems—fast enough for most applications, but significant when creating many threads or in latency-sensitive contexts.
| Operation | Approximate Time | Relative Cost |
|---|---|---|
| Thread creation | 10-50 μs | 1x (baseline) |
| Process creation (fork) | 100-500 μs | 10-50x |
| Process creation + exec | 1-5 ms | 100-500x |
| Thread pool work submission | 0.1-1 μs | 0.01-0.1x |
| Goroutine creation (Go) | 0.3-1 μs | 0.03-0.1x |
Understanding what's not involved in thread creation clarifies why threads are lighter than processes:
The kernel creates a new scheduling entity within the existing process, allocating only thread-specific resources: registers, stack, kernel stack, and thread-local data.
In user-level threading models (green threads), thread 'creation' happens entirely in user space without kernel involvement—just allocating a stack and adding to a user-space run queue. This is extremely fast (sub-microsecond) but sacrifices true parallelism. Modern systems use kernel-level threads where each thread creation involves a system call.
Every thread requires a stack—a contiguous region of memory for function calls, local variables, and return addresses. Stack allocation is one of the most significant aspects of thread creation, both in terms of memory consumption and configuration complexity.
A typical thread stack layout (high to low addresses):
High Address
+------------------------+
| Stack Guard | <- Guard page (optional, triggers SIGSEGV on overflow)
+------------------------+
| |
| Usable Stack | <- Stack grows downward
| (function frames, |
| local variables) |
| |
+------------------------+
| Thread Control Block | <- TCB / Thread Local Storage
+------------------------+
| Red Zone (x64) | <- 128 bytes below RSP, usable without adjustment
+------------------------+
Low Address
Stack sizes vary significantly across platforms and can dramatically impact how many threads you can create:
| Platform | User Stack | Kernel Stack | Notes |
|---|---|---|---|
| Linux (x86_64) | 8 MB | 16 KB | ulimit -s shows default; NPTL |
| Linux (arm64) | 8 MB | 16 KB | Similar to x86_64 |
| macOS | 512 KB (secondary) | 16 KB | Main thread: 8 MB |
| Windows (x64) | 1 MB (reserved) | 24 KB | Only 4KB committed initially |
| Java (64-bit) | 1 MB (default) | ~24 KB | -Xss flag to configure |
| Go goroutine | 2-8 KB (grows) | N/A | Dynamically grows to 1 GB |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154
#include <pthread.h>#include <stdio.h>#include <stdlib.h>#include <errno.h>#include <string.h>#include <unistd.h>#include <sys/resource.h> /* * Querying and Setting Stack Sizes */ void query_default_stack_size(void) { pthread_attr_t attr; size_t stack_size; pthread_attr_init(&attr); pthread_attr_getstacksize(&attr, &stack_size); printf("Default stack size: %zu bytes (%.2f MB)\n", stack_size, (double)stack_size / (1024 * 1024)); pthread_attr_destroy(&attr); // Also check system limit struct rlimit rlim; getrlimit(RLIMIT_STACK, &rlim); printf("Stack soft limit: %lu, hard limit: %lu\n", (unsigned long)rlim.rlim_cur, (unsigned long)rlim.rlim_max);} /* * Creating Threads with Custom Stack Size */ void *worker(void *arg) { // Demonstrate stack usage char buffer[4096]; // 4KB on stack memset(buffer, 0, sizeof(buffer)); printf("Thread running with stack buffer at %p\n", (void *)buffer); return NULL;} int create_thread_with_custom_stack(size_t stack_bytes) { pthread_t tid; pthread_attr_t attr; int result; result = pthread_attr_init(&attr); if (result != 0) { fprintf(stderr, "attr_init failed: %s\n", strerror(result)); return -1; } /* * PTHREAD_STACK_MIN is the minimum allowed (typically 16KB) * Stacks must be page-aligned and large enough for signal handlers */ if (stack_bytes < PTHREAD_STACK_MIN) { fprintf(stderr, "Stack too small, minimum is %d\n", PTHREAD_STACK_MIN); stack_bytes = PTHREAD_STACK_MIN; } result = pthread_attr_setstacksize(&attr, stack_bytes); if (result != 0) { fprintf(stderr, "setstacksize failed: %s\n", strerror(result)); pthread_attr_destroy(&attr); return -1; } result = pthread_create(&tid, &attr, worker, NULL); pthread_attr_destroy(&attr); // Safe to destroy after create if (result != 0) { fprintf(stderr, "pthread_create failed: %s\n", strerror(result)); return -1; } pthread_join(tid, NULL); return 0;} /* * Using a Pre-Allocated Stack * Useful for embedded systems or memory-constrained environments */ int create_thread_with_custom_stack_memory(void) { pthread_t tid; pthread_attr_t attr; void *stack_base; size_t stack_size = 64 * 1024; // 64 KB int result; // Allocate aligned memory for stack result = posix_memalign(&stack_base, sysconf(_SC_PAGESIZE), stack_size); if (result != 0) { fprintf(stderr, "posix_memalign failed: %s\n", strerror(result)); return -1; } pthread_attr_init(&attr); // Set both stack address and size result = pthread_attr_setstack(&attr, stack_base, stack_size); if (result != 0) { fprintf(stderr, "setstack failed: %s\n", strerror(result)); free(stack_base); pthread_attr_destroy(&attr); return -1; } result = pthread_create(&tid, &attr, worker, NULL); pthread_attr_destroy(&attr); if (result != 0) { fprintf(stderr, "pthread_create failed: %s\n", strerror(result)); free(stack_base); return -1; } pthread_join(tid, NULL); /* * IMPORTANT: When using pthread_attr_setstack, YOU are responsible * for freeing the stack memory. The thread library will NOT free it. */ free(stack_base); return 0;} /* * Stack Guard Pages * Protecting against stack overflow */ int demonstrate_guard_pages(void) { pthread_attr_t attr; size_t guard_size; pthread_attr_init(&attr); // Query default guard size (typically one page) pthread_attr_getguardsize(&attr, &guard_size); printf("Default guard size: %zu bytes\n", guard_size); // Can set to 0 to disable (saves memory, loses protection) // pthread_attr_setguardsize(&attr, 0); // Or increase for more protection pthread_attr_setguardsize(&attr, 4096 * 2); // Two pages pthread_attr_destroy(&attr); return 0;}Stack overflow doesn't just crash one thread—it can corrupt adjacent memory or other threads' stacks. Guard pages help detect overflow but add memory overhead. When reducing stack sizes for many threads, carefully audit maximum stack depth including signal handlers, which execute on the thread's stack.
On Linux, all threads are created via the clone() system call—the universal mechanism for creating both processes and threads. Understanding clone() reveals how the kernel implements threading.
long clone(
unsigned long flags, // What to share
void *child_stack, // Stack for new thread
pid_t *parent_tidptr, // Where to store child TID in parent
pid_t *child_tidptr, // Where to store child TID in child
unsigned long tls // Thread-local storage descriptor
);
The flags parameter determines what resources are shared between parent and child. The difference between fork() and thread creation is just which flags are set:
| Flag | Effect | Used By |
|---|---|---|
| CLONE_VM | Share virtual memory (address space) | Threads (essential) |
| CLONE_FS | Share filesystem info (cwd, root) | Threads |
| CLONE_FILES | Share file descriptor table | Threads |
| CLONE_SIGHAND | Share signal handlers | Threads |
| CLONE_THREAD | Same thread group (share PID) | Threads |
| CLONE_SYSVSEM | Share System V semaphore adjustments | Threads |
| CLONE_SETTLS | Create new TLS for child | Threads |
| CLONE_PARENT_SETTID | Store TID in parent address space | pthread_create |
| CLONE_CHILD_CLEARTID | Clear TID on exit (for pthread_join) | NPTL |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
/* * How pthread_create() uses clone() internally (simplified) * This is what NPTL does under the hood */ #define _GNU_SOURCE#include <sched.h>#include <signal.h>#include <sys/types.h>#include <sys/syscall.h>#include <unistd.h>#include <stdio.h>#include <stdlib.h>#include <string.h> /* * Conceptual implementation of pthread_create-like functionality * using raw clone() system call */ // Thread entry point wrapperstatic int thread_wrapper(void *arg) { void (*start_routine)(void *) = ((void **)arg)[0]; void *thread_arg = ((void **)arg)[1]; // Call the user's thread function start_routine(thread_arg); // Thread exits here return 0;} // Simplified thread creationint simple_thread_create(void (*start_routine)(void *), void *arg) { const size_t STACK_SIZE = 1024 * 1024; // 1 MB // Allocate stack void *stack = malloc(STACK_SIZE); if (!stack) { return -1; } // Stack grows down, so start at top void *stack_top = (char *)stack + STACK_SIZE; // Package arguments void *args[2] = { start_routine, arg }; /* * clone() flags for creating a thread: * * CLONE_VM: Share address space (critical for threads) * CLONE_FS: Share filesystem context * CLONE_FILES: Share file descriptors * CLONE_SIGHAND: Share signal handlers * CLONE_THREAD: Same thread group (same getpid() result) * CLONE_SYSVSEM: Share SysV semaphore undo values * SIGCHLD: Signal to send parent on termination */ int flags = CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND | CLONE_THREAD | CLONE_SYSVSEM | SIGCHLD; // Create the thread pid_t tid = clone(thread_wrapper, stack_top, flags, args); if (tid == -1) { free(stack); perror("clone"); return -1; } printf("Created thread with TID: %d\n", tid); return tid;} /* * What the kernel does when clone() is called: * * 1. Allocate task_struct for new thread * 2. Copy/share resources based on flags: * - If CLONE_VM: share mm_struct (page tables, mappings) * - If CLONE_FILES: share files_struct (fd table) * - etc. * 3. Allocate kernel stack (separate from user stack) * 4. Set up initial register state: * - Stack pointer -> child_stack * - Instruction pointer -> clone return address (child starts in clone()) * 5. Add to scheduler's run queue * 6. Return TID to parent, 0 to child */ /* * Kernel data structures created for each thread: * * task_struct (~6 KB on 64-bit Linux) * - Process/thread state * - Scheduling information * - Links to shared resources * - Signal handling state * - Credentials * - Timers * * thread_info + kernel stack (~16 KB) * - Architecture-specific thread state * - Kernel-mode execution stack * * Optional: thread_struct * - Floating point state * - Debug registers */On Linux, getpid() returns the Thread Group ID (TGID), which is the same for all threads in a process. gettid() (via syscall) returns the unique Thread ID. CLONE_THREAD makes the new thread share the parent's TGID while having its own TID. This maintains POSIX semantics where getpid() returns the same value for all threads.
Windows thread creation follows a different architectural model but achieves similar results. The CreateThread() API ultimately invokes NtCreateThreadEx() in the kernel, which creates the ETHREAD and KTHREAD structures that represent the thread.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151
#include <windows.h>#include <stdio.h> /* * Windows Thread Creation Deep Dive */ /* * CreateThread() Parameters Explained */void demonstrate_create_thread(void) { HANDLE hThread; DWORD threadId; hThread = CreateThread( NULL, // lpThreadAttributes: // Security descriptor for thread handle // NULL = default security, not inheritable 0, // dwStackSize: // Initial stack size in bytes // 0 = default (1 MB reserved, 1 page committed) ThreadProc, // lpStartAddress: // Thread entry point function // Must match LPTHREAD_START_ROUTINE signature (LPVOID)42, // lpParameter: // Argument passed to thread function 0, // dwCreationFlags: // 0 = run immediately // CREATE_SUSPENDED = start suspended // STACK_SIZE_PARAM_IS_A_RESERVATION = dwStackSize is reservation, not commit &threadId // lpThreadId: // Receives the thread ID // Can be NULL if not needed ); if (hThread == NULL) { printf("CreateThread failed: %lu\n", GetLastError()); }} /* * Stack Size Control on Windows */void demonstrate_stack_sizes(void) { HANDLE hThread; // Default: 1 MB reserved, 1 page (4KB) committed // Stack grows and commits more pages as needed hThread = CreateThread(NULL, 0, ThreadProc, NULL, 0, NULL); CloseHandle(hThread); // Reserve 256KB, commit as needed hThread = CreateThread(NULL, 256 * 1024, ThreadProc, NULL, STACK_SIZE_PARAM_IS_A_RESERVATION, NULL); CloseHandle(hThread); // Commit 256KB immediately (all stack memory is committed upfront) hThread = CreateThread(NULL, 256 * 1024, ThreadProc, NULL, 0, NULL); // Without STACK_SIZE_PARAM_IS_A_RESERVATION CloseHandle(hThread);} /* * Creating Suspended Threads * Useful for setting thread properties before it runs */void demonstrate_suspended_creation(void) { HANDLE hThread; DWORD threadId; // Create in suspended state hThread = CreateThread(NULL, 0, ThreadProc, NULL, CREATE_SUSPENDED, &threadId); if (hThread == NULL) { printf("CreateThread failed: %lu\n", GetLastError()); return; } // Configure thread before it runs SetThreadPriority(hThread, THREAD_PRIORITY_HIGHEST); // Set processor affinity SetThreadAffinityMask(hThread, 0x1); // CPU 0 only // Now start it if (ResumeThread(hThread) == (DWORD)-1) { printf("ResumeThread failed: %lu\n", GetLastError()); } WaitForSingleObject(hThread, INFINITE); CloseHandle(hThread);} /* * NtCreateThreadEx: The low-level API * (Not documented, but used by CreateThread internally) */ // Declaration (from ntdll.dll)typedef NTSTATUS (NTAPI *pNtCreateThreadEx)( OUT PHANDLE ThreadHandle, IN ACCESS_MASK DesiredAccess, IN POBJECT_ATTRIBUTES ObjectAttributes OPTIONAL, IN HANDLE ProcessHandle, IN LPTHREAD_START_ROUTINE StartAddress, IN LPVOID Parameter OPTIONAL, IN ULONG CreateFlags, // THREAD_CREATE_FLAGS_* IN SIZE_T ZeroBits OPTIONAL, IN SIZE_T StackSize OPTIONAL, IN SIZE_T MaximumStackSize OPTIONAL, IN PVOID AttributeList OPTIONAL); /* * Kernel structures created: * * ETHREAD (Executive Thread Object) * - Win32StartAddress (user entry point) * - Process link * - IrpList (pending I/O requests) * - ThreadListEntry * - Create/exit times * - Timer data * * KTHREAD (Kernel Thread Object) * - Embedded in ETHREAD * - State (Running, Ready, Waiting, etc.) * - Priority (base and dynamic) * - Quantum (time slice) * - Stack pointers (kernel and user) * - Wait blocks * - Context (registers) * * TEB (Thread Environment Block, user space) * - TLS array * - Stack boundaries * - Last error value * - Exception handling chain * - Self-reference (NtTib.Self) */ DWORD WINAPI ThreadProc(LPVOID lpParam) { printf("Thread running!\n"); return 0;}Windows distinguishes between reserved and committed stack memory. Reservation (default 1MB) just reserves address space. Commitment allocates actual memory pages. By default, only one page is committed initially; more pages are committed as the stack grows. This allows many threads without consuming actual memory until needed.
Java's Thread.start() method triggers a complex sequence involving the Java runtime, JNI, and ultimately native thread creation. Understanding this sequence explains why Java thread creation has noticeable overhead compared to raw native threads.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151
import java.lang.management.*;import java.util.concurrent.*; /** * Java Thread Creation Internals and Configuration */public class JavaThreadCreation { /** * Examining the thread creation sequence */ public static void examineThreadCreation() { Thread t = new Thread(() -> { System.out.println("Thread running!"); }); // State is NEW (thread object exists, native thread does not) System.out.println("Before start: " + t.getState()); // NEW t.start(); // Native thread created here! // State is now RUNNABLE (or may have already terminated) System.out.println("After start: " + t.getState()); // RUNNABLE try { t.join(); } catch (InterruptedException e) { Thread.currentThread().interrupt(); } // State is TERMINATED System.out.println("After join: " + t.getState()); // TERMINATED } /** * Stack size configuration via Thread constructor */ public static void customStackSize() { // Create thread with 256KB stack Thread t = new Thread( null, // ThreadGroup () -> { System.out.println("Small stack thread"); // Be careful with recursive calls! }, "SmallStackThread", 256 * 1024 // Stack size in bytes ); t.start(); /* * Notes on stack size: * - Default is 1MB on most 64-bit JVMs * - Minimum is platform-dependent * - JVM may ignore the hint (it's advisory!) * - Use -Xss flag for default: java -Xss256k MyClass * - -XX:ThreadStackSize for JVM threads */ } /** * Thread creation cost measurement */ public static void measureCreationCost() { final int ITERATIONS = 1000; // Warm up for (int i = 0; i < 100; i++) { Thread t = new Thread(() -> {}); t.start(); try { t.join(); } catch (InterruptedException e) {} } // Measure long start = System.nanoTime(); for (int i = 0; i < ITERATIONS; i++) { Thread t = new Thread(() -> { // Empty task }); t.start(); try { t.join(); } catch (InterruptedException e) {} } long elapsed = System.nanoTime() - start; double avgMicros = (elapsed / 1000.0) / ITERATIONS; System.out.printf("Average thread create+join: %.2f μs%n", avgMicros); } /** * Thread pool vs direct creation performance */ public static void poolVsDirect() throws Exception { final int TASKS = 10000; Runnable task = () -> { // Minimal work int x = 0; for (int i = 0; i < 100; i++) x += i; }; // Direct thread creation long start = System.nanoTime(); for (int i = 0; i < TASKS; i++) { Thread t = new Thread(task); t.start(); t.join(); } long directTime = System.nanoTime() - start; // Thread pool ExecutorService pool = Executors.newFixedThreadPool( Runtime.getRuntime().availableProcessors() ); start = System.nanoTime(); for (int i = 0; i < TASKS; i++) { pool.submit(task).get(); } long poolTime = System.nanoTime() - start; pool.shutdown(); System.out.printf("Direct: %.2f ms%n", directTime / 1e6); System.out.printf("Pool: %.2f ms%n", poolTime / 1e6); System.out.printf("Speedup: %.2fx%n", (double)directTime / poolTime); } /** * Thread information via management APIs */ public static void threadInfoViaManagement() { ThreadMXBean threadMXBean = ManagementFactory.getThreadMXBean(); System.out.println("Thread count: " + threadMXBean.getThreadCount()); System.out.println("Peak thread count: " + threadMXBean.getPeakThreadCount()); System.out.println("Daemon thread count: " + threadMXBean.getDaemonThreadCount()); System.out.println("Total started: " + threadMXBean.getTotalStartedThreadCount()); // Get info about current thread long tid = Thread.currentThread().getId(); ThreadInfo info = threadMXBean.getThreadInfo(tid); System.out.println("\nCurrent thread info:"); System.out.println(" Name: " + info.getThreadName()); System.out.println(" State: " + info.getThreadState()); System.out.println(" Blocked count: " + info.getBlockedCount()); System.out.println(" Waited count: " + info.getWaitedCount()); }}Java thread creation is slower than raw native thread creation because the JVM must allocate Java-specific structures (JavaThread, handles), set up safepoint polling, initialize stack traces capability, and perform various thread-safety operations. This overhead is typically 20-100μs on modern systems—negligible for most applications but significant for fine-grained parallelism.
Understanding the true costs of thread creation helps architects make informed decisions about concurrency design. Thread creation cost has three components: time (latency), memory (per-thread overhead), and scalability (limits on thread count).
The time to create a thread can be decomposed into:
| Component | Time | Description |
|---|---|---|
| User-space setup | 1-5 μs | Attribute parsing, allocation bookkeeping |
| System call entry | 0.5-1 μs | Mode switch to kernel |
| Kernel structure allocation | 5-20 μs | task_struct, thread_info, etc. |
| Stack allocation | 2-10 μs | Virtual memory mapping, guard pages |
| Scheduler insertion | 1-5 μs | Add to run queue, potential IPI |
| Return to user space | 0.5-1 μs | Mode switch back |
| Total | 10-50 μs | Typical range on modern systems |
Each thread consumes memory for:
Example calculation: 1000 threads with 1MB stacks each = ~1GB memory just for stacks, plus ~30MB for kernel structures.
Maximum thread count is constrained by:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174
#define _GNU_SOURCE#include <pthread.h>#include <stdio.h>#include <stdlib.h>#include <string.h>#include <errno.h>#include <sys/resource.h>#include <sys/sysinfo.h> /* * Measuring Thread Creation Costs */ void *empty_thread(void *arg) { return NULL;} void measure_creation_time(void) { const int ITERATIONS = 1000; pthread_t tids[ITERATIONS]; struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); for (int i = 0; i < ITERATIONS; i++) { int result = pthread_create(&tids[i], NULL, empty_thread, NULL); if (result != 0) { fprintf(stderr, "pthread_create failed at %d: %s\n", i, strerror(result)); break; } } // Wait for all threads for (int i = 0; i < ITERATIONS; i++) { pthread_join(tids[i], NULL); } clock_gettime(CLOCK_MONOTONIC, &end); double elapsed_ns = (end.tv_sec - start.tv_sec) * 1e9 + (end.tv_nsec - start.tv_nsec); double per_thread_us = elapsed_ns / ITERATIONS / 1000.0; printf("Created %d threads in %.2f ms\n", ITERATIONS, elapsed_ns / 1e6); printf("Average per thread: %.2f μs\n", per_thread_us);} /* * Finding maximum thread count */ void *wait_forever(void *arg) { while (1) { pause(); // Wait for signal } return NULL;} int find_max_threads(size_t stack_size) { pthread_attr_t attr; pthread_t *tids = NULL; int count = 0; int capacity = 1024; pthread_attr_init(&attr); if (stack_size > 0) { pthread_attr_setstacksize(&attr, stack_size); // Disable guard pages to save memory pthread_attr_setguardsize(&attr, 0); } tids = malloc(capacity * sizeof(pthread_t)); if (!tids) { perror("malloc"); return -1; } printf("Creating threads with %zu byte stacks...\n", stack_size); while (1) { if (count >= capacity) { capacity *= 2; pthread_t *new_tids = realloc(tids, capacity * sizeof(pthread_t)); if (!new_tids) { printf("Realloc failed at %d threads\n", count); break; } tids = new_tids; } int result = pthread_create(&tids[count], &attr, wait_forever, NULL); if (result != 0) { printf("pthread_create failed at %d: %s\n", count, strerror(result)); break; } count++; if (count % 1000 == 0) { printf("Created %d threads...\n", count); } } printf("Maximum threads created: %d\n", count); // Cleanup (will take a while) printf("Cleaning up...\n"); for (int i = 0; i < count; i++) { pthread_cancel(tids[i]); } for (int i = 0; i < count; i++) { pthread_join(tids[i], NULL); } free(tids); pthread_attr_destroy(&attr); return count;} /* * Checking system limits */ void show_system_limits(void) { struct rlimit rlim; // Threads per process (actually per-user on Linux) getrlimit(RLIMIT_NPROC, &rlim); printf("RLIMIT_NPROC: soft=%lu, hard=%lu\n", (unsigned long)rlim.rlim_cur, (unsigned long)rlim.rlim_max); // Stack size getrlimit(RLIMIT_STACK, &rlim); printf("RLIMIT_STACK: soft=%lu, hard=%lu\n", (unsigned long)rlim.rlim_cur, (unsigned long)rlim.rlim_max); // Virtual memory getrlimit(RLIMIT_AS, &rlim); printf("RLIMIT_AS: soft=%lu, hard=%lu\n", (unsigned long)rlim.rlim_cur, (unsigned long)rlim.rlim_max); // System-wide limits printf("\nSystem info:\n"); printf("Available RAM: %lu MB\n", (unsigned long)(get_avphys_pages() * sysconf(_SC_PAGESIZE) / 1024 / 1024)); printf("Total RAM: %lu MB\n", (unsigned long)(get_phys_pages() * sysconf(_SC_PAGESIZE) / 1024 / 1024)); printf("CPUs: %d\n", get_nprocs()); // PID max FILE *f = fopen("/proc/sys/kernel/pid_max", "r"); if (f) { int pid_max; if (fscanf(f, "%d", &pid_max) == 1) { printf("PID max: %d\n", pid_max); } fclose(f); } // Threads max f = fopen("/proc/sys/kernel/threads-max", "r"); if (f) { int threads_max; if (fscanf(f, "%d", &threads_max) == 1) { printf("Threads max: %d\n", threads_max); } fclose(f); }}With 1MB default stacks, you can create at most ~10,000 threads in 10GB of address space. Reducing stack size to 64KB allows ~150,000 threads in the same space. For applications needing more concurrent tasks, consider thread pools (reuse threads) or M:N threading models (like Go goroutines or Java virtual threads) that multiplex many tasks onto fewer OS threads.
Thread creation decisions significantly impact application performance, resource consumption, and scalability. Following these best practices ensures efficient concurrent systems.
Creating a thread involves:
Time: 10-100μs for the system call sequence Memory: 64KB-8MB stack + 20-50KB kernel structures Limits: System-wide and per-process constraints
For high-performance systems:
Thread creation is not free, but it's also not prohibitively expensive for most applications. The key is matching your threading model to your workload characteristics.
You now understand the complete thread creation process—from system call to scheduler insertion—across POSIX, Windows, and Java environments. You can reason about creation costs, configure stack sizes, and make informed architectural decisions. Next, we'll examine thread joining and termination: how threads coordinate their completion and clean up resources.