Operating SystemsThread Concepts

Thread Fundamentals

LevelIntermediate

Duration60 mins

TopicThread Concepts

4 / 5

Thread-Specific Resources

The Private State of Threads

While threads share a remarkable amount of process state, they are not mere clones of each other. Each thread maintains its own private state—resources that belong exclusively to that thread and are not visible to siblings. This private state is what allows threads to execute independently, maintain their own execution context, and operate without interfering with each other's core execution flow.

Understanding thread-specific resources is crucial for:

Writing correct concurrent code that doesn't accidentally share data that should be private
Avoiding stack-related bugs that can corrupt memory
Using thread-local storage effectively for per-thread state
Understanding how the operating system manages thread execution context

What You Will Learn

By the end of this page, you will have mastery over the four categories of thread-private state: the register set, the stack, thread-local storage, and scheduling-related attributes. You'll understand how each is managed, what problems can arise, and how to use them effectively.

The Register Set

The register set is the most fundamental thread-private resource. CPU registers are the fastest storage available to a processor—they hold the data currently being operated on and the execution state.

What's in the Register Set:

General-purpose registers — Hold operands for arithmetic, addresses for memory access, function arguments, and temporary values (e.g., RAX, RBX, RCX, RDX, RSI, RDI on x86-64)
Program counter (RIP) — The address of the next instruction to execute
Stack pointer (RSP) — Points to the current top of the stack
Base/frame pointer (RBP) — Points to the current stack frame (optional but common)
Flags register (RFLAGS) — Condition codes (zero, carry, overflow) and processor state
Floating-point registers — For floating-point and SIMD operations (XMM, YMM, ZMM)
Segment registers — FS and GS often used for thread-local storage addressing

x86-64 General-Purpose Registers (Typical Usage)
Register	Purpose	Preserved Across Calls?
RAX	Return value, temporary	No (caller-saved)
RBX	Callee-saved general purpose	Yes (callee-saved)
RCX	4th integer argument, counter	No
RDX	3rd integer argument, 2nd return value	No
RSI	2nd integer argument	No
RDI	1st integer argument	No
RBP	Base pointer / frame pointer	Yes
RSP	Stack pointer	Yes (by definition)
R8-R11	5th-8th arguments, temporaries	No
R12-R15	Callee-saved general purpose	Yes

Why Registers Must Be Private:

Consider two threads executing simultaneously:

Thread A: RAX = 5, computing 5 * 3
Thread B: RAX = 100, computing 100 + 7

If they shared registers, one would overwrite the other's computation. The result would be meaningless. Thus, each thread has its own complete set of register values.

Context Switch and Registers:

When the scheduler switches from one thread to another:

Save current thread's registers to its Thread Control Block (TCB) in memory
Load next thread's previously-saved registers from its TCB

This save/restore operation is the core of context switching. The cost of saving and restoring all registers is a significant component of context switch overhead.

register_context.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
/* Simplified representation of saved register context */
 
struct cpu_context {
    /* General-purpose registers (x86-64) */
    uint64_t rax, rbx, rcx, rdx;
    uint64_t rsi, rdi, rbp, rsp;
    uint64_t r8, r9, r10, r11;
    uint64_t r12, r13, r14, r15;
    
    /* Instruction pointer */
    uint64_t rip;
    
    /* Flags register */
    uint64_t rflags;
    
    /* Segment registers (important for TLS) */
    uint16_t cs, ss, ds, es, fs, gs;
    uint64_t fs_base, gs_base;  /* Base addresses for FS/GS segments */
    
    /* Extended state (floating point, SIMD) */
    /* This is large: 512 bytes for x87/SSE, up to 2KB+ with AVX-512 */
    uint8_t fpu_state[512];
    uint8_t extended_state[];  /* Variable size based on CPU features */
};
 
/*
 * Context switch saves/restores this entire structure.
 * Modern CPUs can save/restore FPU state lazily to reduce overhead.
 */
 
void context_switch(struct thread *old, struct thread *new) {
    /* Save old thread's state */
    save_cpu_context(&old->cpu_context);
    
    /* Critical: Switch stack pointer */
    /* After this point, we're on the new thread's stack */
    
    /* Restore new thread's state */
    restore_cpu_context(&new->cpu_context);
    
    /* "Return" - but we return to wherever the new thread was suspended */
}

Lazy FPU Context Switching

Saving/restoring floating-point and SIMD state is expensive (hundreds of bytes). Modern OSes use 'lazy' FPU switching: they don't save/restore FPU state until a thread actually uses floating-point instructions. A 'used FPU' flag is tracked per thread, optimizing the common case where many threads never use floating-point.

The Thread Stack

Every thread has its own stack—a region of memory used for function call management, local variables, and temporary storage. The stack is perhaps the most operationally important thread-private resource.

What Lives on the Stack:

Return addresses — Where to continue after a function returns
Saved frame pointers — To restore the caller's stack frame
Function arguments — (Those beyond what fits in registers)
Local variables — Variables declared inside functions
Temporary storage — Intermediate computation results
Spilled registers — Register values pushed to make room for others

Converting Mermaid diagram...

Stack Allocation and Layout:

Each thread's stack is allocated from the process's virtual address space
The main thread typically has a larger stack (8 MB default on Linux)
Spawned threads have smaller stacks (often 2 MB default, configurable)
Stacks grow downward (from high to low addresses on most architectures)
A guard page at the stack limit catches overflows

Stack Isolation:

Because each thread has its own stack, local variables are inherently thread-safe:

void worker(int id) {
    int local_count = 0;  // Private to this thread's invocation
    char buffer[1024];    // Private
    
    // These can never conflict with another thread's local_count
    // unless you deliberately share their addresses
}

thread_stack_config.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
void *stack_info_thread(void *arg) {
    int local_var;  /* A variable on this thread's stack */
    
    /* Get the thread's stack info */
    pthread_attr_t attr;
    pthread_getattr_np(pthread_self(), &attr);
    
    void *stack_addr;
    size_t stack_size;
    pthread_attr_getstack(&attr, &stack_addr, &stack_size);
    
    printf("Thread stack:
");
    printf("  Stack base:     %p
", stack_addr);
    printf("  Stack size:     %zu bytes (%.2f MB)
", 
           stack_size, stack_size / (1024.0 * 1024.0));
    printf("  Stack top:      %p
", (char*)stack_addr + stack_size);
    printf("  local_var at:   %p
", &local_var);
    printf("  Stack used:     %zu bytes
", 
           (char*)stack_addr + stack_size - (char*)&local_var);
    
    pthread_attr_destroy(&attr);
    return NULL;
}
 
int main() {
    /* Create thread with custom stack size */
    pthread_attr_t attr;
    pthread_attr_init(&attr);
    
    /* Set stack size to 1 MB */
    size_t custom_size = 1 * 1024 * 1024;
    pthread_attr_setstacksize(&attr, custom_size);
    
    pthread_t thread;
    pthread_create(&thread, &attr, stack_info_thread, NULL);
    pthread_join(thread, NULL);
    
    pthread_attr_destroy(&attr);
    return 0;
}
 
/* Alternatively, allocate your own stack memory */
void create_thread_with_custom_stack(void) {
    pthread_attr_t attr;
    pthread_attr_init(&attr);
    
    /* Allocate aligned stack memory */
    size_t stack_size = 2 * 1024 * 1024;  /* 2 MB */
    void *stack = aligned_alloc(16, stack_size);
    
    /* Set the stack address and size */
    pthread_attr_setstack(&attr, stack, stack_size);
    
    pthread_t thread;
    pthread_create(&thread, &attr, stack_info_thread, NULL);
    
    pthread_join(thread, NULL);
    
    /* Don't forget to free the stack after thread exits */
    free(stack);
    pthread_attr_destroy(&attr);
}

Stack-Related Dangers

•Stack overflow — Deep recursion or large local arrays can exceed stack size. The guard page catches this and causes a segfault, but by then it's too late.
•Returning pointers to local variables — A function returning a pointer to a local variable returns a dangling pointer. The stack space is reused immediately.
•Sharing stack addresses across threads — While technically possible, the memory becomes invalid when that thread's function returns.
•Too-small custom stacks — If you configure a small stack and the thread uses more, you get undefined behavior (or a crash if you're lucky).

The Dangling Stack Pointer Trap

Never pass a pointer to a local variable to another thread unless you guarantee the calling function won't return until the other thread is done with it. Stack memory is reused as functions return. What was a valid data structure becomes garbage—or worse, a valid-looking data structure for a different function's frame.

Thread-Local Storage (TLS)

Thread-Local Storage (TLS) provides a mechanism for variables that are:

Global in scope — Accessible from anywhere in the code, like global variables
Private per thread — Each thread has its own independent copy

This solves a common problem: you need a variable that persists across function calls (like a global or static), but you don't want threads to share it.

Common TLS Use Cases:

Error codes — errno is thread-local so each thread has its own error state
Random number generator state — Each thread needs independent RNG for reproducibility
Thread-specific buffers — Like the buffer for strtok_r stored per-thread
Caching — Per-thread caches avoid lock contention
Context objects — Request handlers storing per-request context

c11_thread_local.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#include <threads.h>  /* C11 threads */
#include <stdio.h>
 
/* C11 standard way: _Thread_local keyword */
_Thread_local int thread_errno = 0;
_Thread_local char thread_name[64] = {0};
 
/* GNU C also accepts __thread keyword */
__thread int gnu_thread_var = 42;
 
void set_thread_identity(const char *name, int id) {
    snprintf(thread_name, sizeof(thread_name), "%s", name);
    thread_errno = id;
}
 
void print_thread_identity(void) {
    /* Each thread sees its OWN values */
    printf("Thread: %s, errno: %d
", thread_name, thread_errno);
}
 
int thread_func(void *arg) {
    int id = *(int*)arg;
    
    /* Set this thread's local values */
    char name[32];
    snprintf(name, sizeof(name), "Worker-%d", id);
    set_thread_identity(name, id * 100);
    
    /* Print shows this thread's values */
    print_thread_identity();
    
    return 0;
}
 
int main(void) {
    thrd_t t1, t2;
    int id1 = 1, id2 = 2;
    
    set_thread_identity("Main", 0);
    
    thrd_create(&t1, thread_func, &id1);
    thrd_create(&t2, thread_func, &id2);
    
    thrd_join(t1, NULL);
    thrd_join(t2, NULL);
    
    /* Main thread still has its own values */
    print_thread_identity();  /* Prints: Thread: Main, errno: 0 */
    
    return 0;
}

How TLS Works Under the Hood:

Thread-local storage is typically implemented using a segment register (FS or GS on x86-64). Each thread has a unique base address loaded into this register, pointing to that thread's TLS block:

Thread 1: GS_BASE = 0x7f1234560000
  → TLS block at that address contains Thread 1's TLS variables

Thread 2: GS_BASE = 0x7f5678900000
  → TLS block at that address contains Thread 2's TLS variables

Accessing a TLS variable is then:

Load the offset of the variable within the TLS block (known at compile time)
Add to the GS base address (set by the kernel for each thread)
Access that memory address

This is extremely fast—just one or two extra instructions compared to a global variable.

Choosing Between TLS Approaches

Use _Thread_local / __thread for simple, statically-known TLS variables—it's cleaner and faster. Use pthread_key_create for dynamically allocated per-thread data, especially when you need destructors to clean up resources when threads exit. Libraries that don't know thread identity often use pthread keys.

TLS Implementation Comparison
Approach	Speed	Flexibility	Destructor Support	Use Case
_Thread_local / __thread	Fastest (compiled-in offsets)	Static only	No	Simple per-thread state
pthread_key_t	Fast (TLS + indirection)	Dynamic	Yes	Complex objects, libraries
Manual thread ID lookup	Slower (hash lookup)	Full control	Manual	When other options don't fit

Thread ID and Scheduling Attributes

Each thread has a unique thread ID and individual scheduling attributes that affect how the scheduler treats it.

Thread Identification:

pthread_t — The POSIX thread identifier (opaque type, use pthread_equal for comparison)
Kernel TID — On Linux, each thread has a unique TID visible via gettid()
The kernel uses TIDs for scheduling, signal delivery, and system call attribution

Per-Thread Scheduling Attributes

•Priority — How urgently the thread should be scheduled relative to others. Higher priority threads are favored.
•Scheduling policy — SCHED_OTHER (default, time-sharing), SCHED_FIFO (real-time), SCHED_RR (real-time round-robin).
•CPU affinity — Which CPUs the thread is allowed to run on. Can be set per-thread.
•Nice value — Influences priority for SCHED_OTHER threads (lower nice = higher priority).
•State — Running, ready, blocked, etc. Managed by the kernel.
•CPU time consumed — Accounting for this thread's CPU usage.

thread_scheduling.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#define _GNU_SOURCE
#include <pthread.h>
#include <sched.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
 
void *thread_with_custom_scheduling(void *arg) {
    /* Get this thread's kernel TID */
    pid_t tid = syscall(SYS_gettid);
    printf("Thread TID: %d
", tid);
    
    /* Get current scheduling policy and priority */
    int policy;
    struct sched_param param;
    pthread_getschedparam(pthread_self(), &policy, &param);
    
    printf("Scheduling policy: %s
", 
           policy == SCHED_OTHER ? "SCHED_OTHER" :
           policy == SCHED_FIFO ? "SCHED_FIFO" :
           policy == SCHED_RR ? "SCHED_RR" : "Unknown");
    printf("Priority: %d
", param.sched_priority);
    
    /* Get CPU affinity */
    cpu_set_t cpuset;
    pthread_getaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
    printf("Affinity: ");
    for (int i = 0; i < CPU_SETSIZE; i++) {
        if (CPU_ISSET(i, &cpuset)) {
            printf("CPU%d ", i);
        }
    }
    printf("
");
    
    return NULL;
}
 
int main() {
    pthread_t thread;
    pthread_attr_t attr;
    
    pthread_attr_init(&attr);
    
    /* Set custom CPU affinity - restrict to CPUs 0 and 1 */
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(0, &cpuset);
    CPU_SET(1, &cpuset);
    pthread_attr_setaffinity_np(&attr, sizeof(cpuset), &cpuset);
    
    /* Set scheduling policy and priority (requires root for real-time) */
    /* pthread_attr_setschedpolicy(&attr, SCHED_FIFO); */
    /* struct sched_param param = { .sched_priority = 50 }; */
    /* pthread_attr_setschedparam(&attr, &param); */
    
    pthread_create(&thread, &attr, thread_with_custom_scheduling, NULL);
    pthread_join(thread, NULL);
    
    pthread_attr_destroy(&attr);
    return 0;
}

Real-Time Scheduling Requires Privileges

SCHED_FIFO and SCHED_RR policies require root privileges (or CAP_SYS_NICE capability). A real-time thread with high priority can starve other threads and even make the system unresponsive if it doesn't block. Use with extreme caution.

The Signal Mask

Each thread has its own signal mask—a set of signals that are blocked (not delivered) to that specific thread. This is one of the few truly per-thread aspects of signal handling.

Key Points:

Signal handlers are process-wide — But the mask is per-thread
Blocked signals are queued — They'll be delivered when unblocked
New threads inherit the creating thread's mask — Set the mask before pthread_create
pthread_sigmask — Modifies the calling thread's mask (vs. sigprocmask which is less defined for threads)

thread_signal_mask.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
 
void *worker_thread(void *arg) {
    int id = *(int*)arg;
    
    /* Get current signal mask for this thread */
    sigset_t current_mask;
    pthread_sigmask(SIG_BLOCK, NULL, &current_mask);
    
    printf("Thread %d: SIGINT %s
", id,
           sigismember(&current_mask, SIGINT) ? "BLOCKED" : "not blocked");
    printf("Thread %d: SIGTERM %s
", id,
           sigismember(&current_mask, SIGTERM) ? "BLOCKED" : "not blocked");
    
    /* Sleep and see if we receive signals */
    for (int i = 0; i < 5; i++) {
        printf("Thread %d: iteration %d
", id, i);
        sleep(1);
    }
    
    return NULL;
}
 
int main() {
    /* Thread 1: Block SIGINT */
    sigset_t mask1;
    sigemptyset(&mask1);
    sigaddset(&mask1, SIGINT);
    pthread_sigmask(SIG_BLOCK, &mask1, NULL);
    
    int id1 = 1;
    pthread_t t1;
    pthread_create(&t1, NULL, worker_thread, &id1);
    
    /* Thread 2: Different mask - block SIGTERM instead */
    sigset_t mask2;
    sigemptyset(&mask2);
    sigaddset(&mask2, SIGTERM);
    pthread_sigmask(SIG_SETMASK, &mask2, NULL);  /* Replace entire mask */
    
    int id2 = 2;
    pthread_t t2;
    pthread_create(&t2, NULL, worker_thread, &id2);
    
    /* If SIGINT is sent, Thread 1 won't receive it (blocked) */
    /* Thread 2 might receive it (not blocked for SIGINT) */
    
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    
    return 0;
}

The Standard Pattern

Block all signals in main() before creating any threads. Each new thread inherits this mask. Create one dedicated signal-handling thread that sigwaits for signals. This avoids the complexity of async-signal-safety entirely—only the signal thread handles signals, and it does so synchronously.

Thread State and Exit Status

Each thread has its own state (running, ready, blocked, terminated) tracked by the kernel, and when terminated, maintains an exit status until joined.

Thread States (private to each thread):

Running — Currently executing on a CPU
Ready — Runnable but waiting for CPU
Blocked — Waiting for I/O, lock, condition, or other event
Terminated — Execution complete, waiting to be joined

Each thread can be in different states simultaneously. One thread blocked on I/O doesn't prevent another from running.

Exit Status:

When a thread terminates (via return or pthread_exit), it produces an exit value. This value is held until another thread calls pthread_join to retrieve it:

thread_exit_status.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
 
/* Thread returning a simple value */
void *compute_sum(void *arg) {
    int *numbers = (int*)arg;
    int sum = 0;
    for (int i = 0; i < 5; i++) {
        sum += numbers[i];
    }
    
    /* Return value cast to void* */
    return (void*)(long)sum;
}
 
/* Thread returning a heap-allocated structure */
struct result {
    int sum;
    int max;
    int min;
};
 
void *compute_stats(void *arg) {
    int *numbers = (int*)arg;
    
    struct result *res = malloc(sizeof(struct result));
    res->sum = 0;
    res->max = numbers[0];
    res->min = numbers[0];
    
    for (int i = 0; i < 5; i++) {
        res->sum += numbers[i];
        if (numbers[i] > res->max) res->max = numbers[i];
        if (numbers[i] < res->min) res->min = numbers[i];
    }
    
    /* Return pointer to heap-allocated result */
    return res;  /* Caller must free */
}
 
int main() {
    int data[] = {10, 20, 5, 30, 15};
    pthread_t t1, t2;
    
    pthread_create(&t1, NULL, compute_sum, data);
    pthread_create(&t2, NULL, compute_stats, data);
    
    /* Retrieve exit status from t1 */
    void *ret1;
    pthread_join(t1, &ret1);
    printf("Sum (simple return): %ld
", (long)ret1);
    
    /* Retrieve exit status from t2 */
    void *ret2;
    pthread_join(t2, &ret2);
    struct result *stats = (struct result*)ret2;
    printf("Stats: sum=%d, max=%d, min=%d
", 
           stats->sum, stats->max, stats->min);
    free(stats);  /* Free the heap-allocated result */
    
    return 0;
}

Joinable vs Detached Threads

By default, threads are 'joinable'—their resources persist after termination until joined. If you don't care about the return value, call pthread_detach() or create the thread with PTHREAD_CREATE_DETACHED attribute. Detached threads release resources immediately on termination, preventing resource leaks.

Thread Resources in Context

Let's visualize how all thread-specific resources fit together within a process:

Converting Mermaid diagram...

Complete Thread Resource Summary
Resource	Private/Shared	Synchronization Needed	Notes
Register set	Private	No (hardware enforced)	Saved/restored on context switch
Stack	Private	No (unless addresses shared)	Function calls, local variables
Thread-local storage	Private	No (per-thread copies)	Thread-specific globals
Thread ID	Private	No (immutable)	Unique identifier
Signal mask	Private	No (per-thread)	Controls signal delivery
Scheduling attributes	Private	No (kernel-managed)	Priority, affinity, policy
Thread state	Private	No (kernel-managed)	Running, blocked, etc.
Exit status	Private	Via pthread_join	Available until joined

Summary: Thread-Specific Resources

We've thoroughly examined the resources that each thread keeps private. Let's consolidate the essential insights:

Key Takeaways

•The register set is the core of thread identity — Each thread has its own PC, SP, and registers. Context switching saves/restores this state.
•Each thread has its own stack — Local variables, return addresses, and function parameters live here. Stack memory is inherently thread-safe.
•Thread-local storage provides per-thread globals — Use _Thread_local for simple cases, pthread_key_t for complex objects with destructors.
•Each thread has unique scheduling attributes — Priority, affinity, and policy can be set per-thread for fine-grained control.
•Signal masks are per-thread — Each thread controls which signals it receives. Use the dedicated signal thread pattern.
•Thread state is individually tracked — Threads can be in different states simultaneously—one blocked doesn't block others.
•Join retrieves exit status — A terminated thread's return value is available via pthread_join until joined or detached.

What's Next:

With a complete understanding of what threads own privately and share with siblings, we're ready to explore why threading is valuable. The next page examines the Benefits of Threading—responsiveness, resource sharing, economy, and scalability—and when these benefits outweigh the complexity costs.

Thread Resources Mastered

You now have comprehensive knowledge of thread-private resources—registers, stack, TLS, scheduling attributes, signal masks, and state. This understanding is crucial for writing correct concurrent code and debugging thread-related issues.

4 / 5

Loading learning content...

Operating SystemsThread Concepts

Thread Fundamentals

LevelIntermediate

Duration60 mins

TopicThread Concepts

4 / 5

Thread-Specific Resources

The Private State of Threads

Understanding thread-specific resources is crucial for:

Writing correct concurrent code that doesn't accidentally share data that should be private
Avoiding stack-related bugs that can corrupt memory
Using thread-local storage effectively for per-thread state
Understanding how the operating system manages thread execution context

What You Will Learn

The Register Set

What's in the Register Set:

General-purpose registers — Hold operands for arithmetic, addresses for memory access, function arguments, and temporary values (e.g., RAX, RBX, RCX, RDX, RSI, RDI on x86-64)
Program counter (RIP) — The address of the next instruction to execute
Stack pointer (RSP) — Points to the current top of the stack
Base/frame pointer (RBP) — Points to the current stack frame (optional but common)
Flags register (RFLAGS) — Condition codes (zero, carry, overflow) and processor state
Floating-point registers — For floating-point and SIMD operations (XMM, YMM, ZMM)
Segment registers — FS and GS often used for thread-local storage addressing

x86-64 General-Purpose Registers (Typical Usage)
Register	Purpose	Preserved Across Calls?
RAX	Return value, temporary	No (caller-saved)
RBX	Callee-saved general purpose	Yes (callee-saved)
RCX	4th integer argument, counter	No
RDX	3rd integer argument, 2nd return value	No
RSI	2nd integer argument	No
RDI	1st integer argument	No
RBP	Base pointer / frame pointer	Yes
RSP	Stack pointer	Yes (by definition)
R8-R11	5th-8th arguments, temporaries	No
R12-R15	Callee-saved general purpose	Yes

Why Registers Must Be Private:

Consider two threads executing simultaneously:

Thread A: RAX = 5, computing 5 * 3
Thread B: RAX = 100, computing 100 + 7

If they shared registers, one would overwrite the other's computation. The result would be meaningless. Thus, each thread has its own complete set of register values.

Context Switch and Registers:

When the scheduler switches from one thread to another:

Save current thread's registers to its Thread Control Block (TCB) in memory
Load next thread's previously-saved registers from its TCB

This save/restore operation is the core of context switching. The cost of saving and restoring all registers is a significant component of context switch overhead.

register_context.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
/* Simplified representation of saved register context */
 
struct cpu_context {
    /* General-purpose registers (x86-64) */
    uint64_t rax, rbx, rcx, rdx;
    uint64_t rsi, rdi, rbp, rsp;
    uint64_t r8, r9, r10, r11;
    uint64_t r12, r13, r14, r15;
    
    /* Instruction pointer */
    uint64_t rip;
    
    /* Flags register */
    uint64_t rflags;
    
    /* Segment registers (important for TLS) */
    uint16_t cs, ss, ds, es, fs, gs;
    uint64_t fs_base, gs_base;  /* Base addresses for FS/GS segments */
    
    /* Extended state (floating point, SIMD) */
    /* This is large: 512 bytes for x87/SSE, up to 2KB+ with AVX-512 */
    uint8_t fpu_state[512];
    uint8_t extended_state[];  /* Variable size based on CPU features */
};
 
/*
 * Context switch saves/restores this entire structure.
 * Modern CPUs can save/restore FPU state lazily to reduce overhead.
 */
 
void context_switch(struct thread *old, struct thread *new) {
    /* Save old thread's state */
    save_cpu_context(&old->cpu_context);
    
    /* Critical: Switch stack pointer */
    /* After this point, we're on the new thread's stack */
    
    /* Restore new thread's state */
    restore_cpu_context(&new->cpu_context);
    
    /* "Return" - but we return to wherever the new thread was suspended */
}

Lazy FPU Context Switching

The Thread Stack

What Lives on the Stack:

Return addresses — Where to continue after a function returns
Saved frame pointers — To restore the caller's stack frame
Function arguments — (Those beyond what fits in registers)
Local variables — Variables declared inside functions
Temporary storage — Intermediate computation results
Spilled registers — Register values pushed to make room for others

Converting Mermaid diagram...

Stack Allocation and Layout:

Each thread's stack is allocated from the process's virtual address space
The main thread typically has a larger stack (8 MB default on Linux)
Spawned threads have smaller stacks (often 2 MB default, configurable)
Stacks grow downward (from high to low addresses on most architectures)
A guard page at the stack limit catches overflows

Stack Isolation:

Because each thread has its own stack, local variables are inherently thread-safe:

void worker(int id) {
    int local_count = 0;  // Private to this thread's invocation
    char buffer[1024];    // Private
    
    // These can never conflict with another thread's local_count
    // unless you deliberately share their addresses
}

thread_stack_config.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
void *stack_info_thread(void *arg) {
    int local_var;  /* A variable on this thread's stack */
    
    /* Get the thread's stack info */
    pthread_attr_t attr;
    pthread_getattr_np(pthread_self(), &attr);
    
    void *stack_addr;
    size_t stack_size;
    pthread_attr_getstack(&attr, &stack_addr, &stack_size);
    
    printf("Thread stack:
");
    printf("  Stack base:     %p
", stack_addr);
    printf("  Stack size:     %zu bytes (%.2f MB)
", 
           stack_size, stack_size / (1024.0 * 1024.0));
    printf("  Stack top:      %p
", (char*)stack_addr + stack_size);
    printf("  local_var at:   %p
", &local_var);
    printf("  Stack used:     %zu bytes
", 
           (char*)stack_addr + stack_size - (char*)&local_var);
    
    pthread_attr_destroy(&attr);
    return NULL;
}
 
int main() {
    /* Create thread with custom stack size */
    pthread_attr_t attr;
    pthread_attr_init(&attr);
    
    /* Set stack size to 1 MB */
    size_t custom_size = 1 * 1024 * 1024;
    pthread_attr_setstacksize(&attr, custom_size);
    
    pthread_t thread;
    pthread_create(&thread, &attr, stack_info_thread, NULL);
    pthread_join(thread, NULL);
    
    pthread_attr_destroy(&attr);
    return 0;
}
 
/* Alternatively, allocate your own stack memory */
void create_thread_with_custom_stack(void) {
    pthread_attr_t attr;
    pthread_attr_init(&attr);
    
    /* Allocate aligned stack memory */
    size_t stack_size = 2 * 1024 * 1024;  /* 2 MB */
    void *stack = aligned_alloc(16, stack_size);
    
    /* Set the stack address and size */
    pthread_attr_setstack(&attr, stack, stack_size);
    
    pthread_t thread;
    pthread_create(&thread, &attr, stack_info_thread, NULL);
    
    pthread_join(thread, NULL);
    
    /* Don't forget to free the stack after thread exits */
    free(stack);
    pthread_attr_destroy(&attr);
}

Stack-Related Dangers

•Stack overflow — Deep recursion or large local arrays can exceed stack size. The guard page catches this and causes a segfault, but by then it's too late.
•Returning pointers to local variables — A function returning a pointer to a local variable returns a dangling pointer. The stack space is reused immediately.
•Sharing stack addresses across threads — While technically possible, the memory becomes invalid when that thread's function returns.
•Too-small custom stacks — If you configure a small stack and the thread uses more, you get undefined behavior (or a crash if you're lucky).

The Dangling Stack Pointer Trap

Thread-Local Storage (TLS)

Thread-Local Storage (TLS) provides a mechanism for variables that are:

Global in scope — Accessible from anywhere in the code, like global variables
Private per thread — Each thread has its own independent copy

This solves a common problem: you need a variable that persists across function calls (like a global or static), but you don't want threads to share it.

Common TLS Use Cases:

Error codes — errno is thread-local so each thread has its own error state
Random number generator state — Each thread needs independent RNG for reproducibility
Thread-specific buffers — Like the buffer for strtok_r stored per-thread
Caching — Per-thread caches avoid lock contention
Context objects — Request handlers storing per-request context

c11_thread_local.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#include <threads.h>  /* C11 threads */
#include <stdio.h>
 
/* C11 standard way: _Thread_local keyword */
_Thread_local int thread_errno = 0;
_Thread_local char thread_name[64] = {0};
 
/* GNU C also accepts __thread keyword */
__thread int gnu_thread_var = 42;
 
void set_thread_identity(const char *name, int id) {
    snprintf(thread_name, sizeof(thread_name), "%s", name);
    thread_errno = id;
}
 
void print_thread_identity(void) {
    /* Each thread sees its OWN values */
    printf("Thread: %s, errno: %d
", thread_name, thread_errno);
}
 
int thread_func(void *arg) {
    int id = *(int*)arg;
    
    /* Set this thread's local values */
    char name[32];
    snprintf(name, sizeof(name), "Worker-%d", id);
    set_thread_identity(name, id * 100);
    
    /* Print shows this thread's values */
    print_thread_identity();
    
    return 0;
}
 
int main(void) {
    thrd_t t1, t2;
    int id1 = 1, id2 = 2;
    
    set_thread_identity("Main", 0);
    
    thrd_create(&t1, thread_func, &id1);
    thrd_create(&t2, thread_func, &id2);
    
    thrd_join(t1, NULL);
    thrd_join(t2, NULL);
    
    /* Main thread still has its own values */
    print_thread_identity();  /* Prints: Thread: Main, errno: 0 */
    
    return 0;
}

How TLS Works Under the Hood:

Thread-local storage is typically implemented using a segment register (FS or GS on x86-64). Each thread has a unique base address loaded into this register, pointing to that thread's TLS block:

Thread 1: GS_BASE = 0x7f1234560000
  → TLS block at that address contains Thread 1's TLS variables

Thread 2: GS_BASE = 0x7f5678900000
  → TLS block at that address contains Thread 2's TLS variables

Accessing a TLS variable is then:

Load the offset of the variable within the TLS block (known at compile time)
Add to the GS base address (set by the kernel for each thread)
Access that memory address

This is extremely fast—just one or two extra instructions compared to a global variable.

Choosing Between TLS Approaches

TLS Implementation Comparison
Approach	Speed	Flexibility	Destructor Support	Use Case
_Thread_local / __thread	Fastest (compiled-in offsets)	Static only	No	Simple per-thread state
pthread_key_t	Fast (TLS + indirection)	Dynamic	Yes	Complex objects, libraries
Manual thread ID lookup	Slower (hash lookup)	Full control	Manual	When other options don't fit

Thread ID and Scheduling Attributes

Each thread has a unique thread ID and individual scheduling attributes that affect how the scheduler treats it.

Thread Identification:

pthread_t — The POSIX thread identifier (opaque type, use pthread_equal for comparison)
Kernel TID — On Linux, each thread has a unique TID visible via gettid()
The kernel uses TIDs for scheduling, signal delivery, and system call attribution

Per-Thread Scheduling Attributes

•Priority — How urgently the thread should be scheduled relative to others. Higher priority threads are favored.
•Scheduling policy — SCHED_OTHER (default, time-sharing), SCHED_FIFO (real-time), SCHED_RR (real-time round-robin).
•CPU affinity — Which CPUs the thread is allowed to run on. Can be set per-thread.
•Nice value — Influences priority for SCHED_OTHER threads (lower nice = higher priority).
•State — Running, ready, blocked, etc. Managed by the kernel.
•CPU time consumed — Accounting for this thread's CPU usage.

thread_scheduling.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#define _GNU_SOURCE
#include <pthread.h>
#include <sched.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
 
void *thread_with_custom_scheduling(void *arg) {
    /* Get this thread's kernel TID */
    pid_t tid = syscall(SYS_gettid);
    printf("Thread TID: %d
", tid);
    
    /* Get current scheduling policy and priority */
    int policy;
    struct sched_param param;
    pthread_getschedparam(pthread_self(), &policy, &param);
    
    printf("Scheduling policy: %s
", 
           policy == SCHED_OTHER ? "SCHED_OTHER" :
           policy == SCHED_FIFO ? "SCHED_FIFO" :
           policy == SCHED_RR ? "SCHED_RR" : "Unknown");
    printf("Priority: %d
", param.sched_priority);
    
    /* Get CPU affinity */
    cpu_set_t cpuset;
    pthread_getaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
    printf("Affinity: ");
    for (int i = 0; i < CPU_SETSIZE; i++) {
        if (CPU_ISSET(i, &cpuset)) {
            printf("CPU%d ", i);
        }
    }
    printf("
");
    
    return NULL;
}
 
int main() {
    pthread_t thread;
    pthread_attr_t attr;
    
    pthread_attr_init(&attr);
    
    /* Set custom CPU affinity - restrict to CPUs 0 and 1 */
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    CPU_SET(0, &cpuset);
    CPU_SET(1, &cpuset);
    pthread_attr_setaffinity_np(&attr, sizeof(cpuset), &cpuset);
    
    /* Set scheduling policy and priority (requires root for real-time) */
    /* pthread_attr_setschedpolicy(&attr, SCHED_FIFO); */
    /* struct sched_param param = { .sched_priority = 50 }; */
    /* pthread_attr_setschedparam(&attr, &param); */
    
    pthread_create(&thread, &attr, thread_with_custom_scheduling, NULL);
    pthread_join(thread, NULL);
    
    pthread_attr_destroy(&attr);
    return 0;
}

Real-Time Scheduling Requires Privileges

The Signal Mask

Each thread has its own signal mask—a set of signals that are blocked (not delivered) to that specific thread. This is one of the few truly per-thread aspects of signal handling.

Key Points:

Signal handlers are process-wide — But the mask is per-thread
Blocked signals are queued — They'll be delivered when unblocked
New threads inherit the creating thread's mask — Set the mask before pthread_create
pthread_sigmask — Modifies the calling thread's mask (vs. sigprocmask which is less defined for threads)

thread_signal_mask.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>
 
void *worker_thread(void *arg) {
    int id = *(int*)arg;
    
    /* Get current signal mask for this thread */
    sigset_t current_mask;
    pthread_sigmask(SIG_BLOCK, NULL, &current_mask);
    
    printf("Thread %d: SIGINT %s
", id,
           sigismember(&current_mask, SIGINT) ? "BLOCKED" : "not blocked");
    printf("Thread %d: SIGTERM %s
", id,
           sigismember(&current_mask, SIGTERM) ? "BLOCKED" : "not blocked");
    
    /* Sleep and see if we receive signals */
    for (int i = 0; i < 5; i++) {
        printf("Thread %d: iteration %d
", id, i);
        sleep(1);
    }
    
    return NULL;
}
 
int main() {
    /* Thread 1: Block SIGINT */
    sigset_t mask1;
    sigemptyset(&mask1);
    sigaddset(&mask1, SIGINT);
    pthread_sigmask(SIG_BLOCK, &mask1, NULL);
    
    int id1 = 1;
    pthread_t t1;
    pthread_create(&t1, NULL, worker_thread, &id1);
    
    /* Thread 2: Different mask - block SIGTERM instead */
    sigset_t mask2;
    sigemptyset(&mask2);
    sigaddset(&mask2, SIGTERM);
    pthread_sigmask(SIG_SETMASK, &mask2, NULL);  /* Replace entire mask */
    
    int id2 = 2;
    pthread_t t2;
    pthread_create(&t2, NULL, worker_thread, &id2);
    
    /* If SIGINT is sent, Thread 1 won't receive it (blocked) */
    /* Thread 2 might receive it (not blocked for SIGINT) */
    
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    
    return 0;
}

The Standard Pattern

Thread State and Exit Status

Each thread has its own state (running, ready, blocked, terminated) tracked by the kernel, and when terminated, maintains an exit status until joined.

Thread States (private to each thread):

Running — Currently executing on a CPU
Ready — Runnable but waiting for CPU
Blocked — Waiting for I/O, lock, condition, or other event
Terminated — Execution complete, waiting to be joined

Each thread can be in different states simultaneously. One thread blocked on I/O doesn't prevent another from running.

Exit Status:

When a thread terminates (via return or pthread_exit), it produces an exit value. This value is held until another thread calls pthread_join to retrieve it:

thread_exit_status.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
 
/* Thread returning a simple value */
void *compute_sum(void *arg) {
    int *numbers = (int*)arg;
    int sum = 0;
    for (int i = 0; i < 5; i++) {
        sum += numbers[i];
    }
    
    /* Return value cast to void* */
    return (void*)(long)sum;
}
 
/* Thread returning a heap-allocated structure */
struct result {
    int sum;
    int max;
    int min;
};
 
void *compute_stats(void *arg) {
    int *numbers = (int*)arg;
    
    struct result *res = malloc(sizeof(struct result));
    res->sum = 0;
    res->max = numbers[0];
    res->min = numbers[0];
    
    for (int i = 0; i < 5; i++) {
        res->sum += numbers[i];
        if (numbers[i] > res->max) res->max = numbers[i];
        if (numbers[i] < res->min) res->min = numbers[i];
    }
    
    /* Return pointer to heap-allocated result */
    return res;  /* Caller must free */
}
 
int main() {
    int data[] = {10, 20, 5, 30, 15};
    pthread_t t1, t2;
    
    pthread_create(&t1, NULL, compute_sum, data);
    pthread_create(&t2, NULL, compute_stats, data);
    
    /* Retrieve exit status from t1 */
    void *ret1;
    pthread_join(t1, &ret1);
    printf("Sum (simple return): %ld
", (long)ret1);
    
    /* Retrieve exit status from t2 */
    void *ret2;
    pthread_join(t2, &ret2);
    struct result *stats = (struct result*)ret2;
    printf("Stats: sum=%d, max=%d, min=%d
", 
           stats->sum, stats->max, stats->min);
    free(stats);  /* Free the heap-allocated result */
    
    return 0;
}

Joinable vs Detached Threads

Thread Resources in Context

Let's visualize how all thread-specific resources fit together within a process:

Converting Mermaid diagram...

Complete Thread Resource Summary
Resource	Private/Shared	Synchronization Needed	Notes
Register set	Private	No (hardware enforced)	Saved/restored on context switch
Stack	Private	No (unless addresses shared)	Function calls, local variables
Thread-local storage	Private	No (per-thread copies)	Thread-specific globals
Thread ID	Private	No (immutable)	Unique identifier
Signal mask	Private	No (per-thread)	Controls signal delivery
Scheduling attributes	Private	No (kernel-managed)	Priority, affinity, policy
Thread state	Private	No (kernel-managed)	Running, blocked, etc.
Exit status	Private	Via pthread_join	Available until joined

Summary: Thread-Specific Resources

We've thoroughly examined the resources that each thread keeps private. Let's consolidate the essential insights:

Key Takeaways

•The register set is the core of thread identity — Each thread has its own PC, SP, and registers. Context switching saves/restores this state.
•Each thread has its own stack — Local variables, return addresses, and function parameters live here. Stack memory is inherently thread-safe.
•Thread-local storage provides per-thread globals — Use _Thread_local for simple cases, pthread_key_t for complex objects with destructors.
•Each thread has unique scheduling attributes — Priority, affinity, and policy can be set per-thread for fine-grained control.
•Signal masks are per-thread — Each thread controls which signals it receives. Use the dedicated signal thread pattern.
•Thread state is individually tracked — Threads can be in different states simultaneously—one blocked doesn't block others.
•Join retrieves exit status — A terminated thread's return value is available via pthread_join until joined or detached.

What's Next:

Thread Resources Mastered

4 / 5