Operating SystemsThread Concepts

User-Level Threads

LevelIntermediate

Duration75 mins

TopicThread Concepts

1 / 5

Thread Library in User Space

The Foundation of Lightweight Concurrency

In modern software development, concurrency is not optional—it's essential. But before operating systems provided native thread support, and even today in specialized contexts, user-level thread libraries provide a remarkably elegant solution: concurrency without kernel involvement.

A user-level thread library is a runtime system that exists entirely within the address space of a process, implementing thread creation, scheduling, synchronization, and destruction without ever crossing the boundary into kernel space. This architectural decision—keeping threads invisible to the kernel—creates a fascinating set of tradeoffs that every systems engineer must understand.

What You Will Learn

By the end of this page, you will understand the complete architecture of user-level thread libraries, how they implement thread abstractions using only user-space mechanisms, why this approach offers significant performance advantages, and the fundamental architectural constraints that govern their design.

Conceptual Foundation: What Are User-Level Threads?

A user-level thread (ULT), also known as a lightweight thread or fiber, is a thread of execution managed entirely by a user-space library rather than by the operating system kernel. From the kernel's perspective, a process using user-level threads appears to be a single-threaded process—the kernel schedules the process as a unit, completely unaware of the internal threading happening within.

This creates a layered abstraction model:

Converting Mermaid diagram...

The fundamental principle:

User-level threads exploit a simple but profound insight: thread management is just data structure manipulation and control flow transfer. Creating a thread is creating a data structure (thread control block) and a stack. Switching between threads is saving registers to one location and loading them from another. Scheduling is selecting which data structure to activate next.

All of these operations can be performed entirely in user space, without involving the kernel at all.

Historical Context

User-level threads predate kernel-level thread support. Early Unix systems (and many other operating systems) only provided process-level concurrency. User-level thread libraries like GNU Pth, the original POSIX threads (in some implementations), and M:N threading systems emerged as a way to provide concurrency without waiting for kernel support. Even today, languages like Go, Erlang, and many async runtimes use user-level thread concepts (often called 'green threads' or 'coroutines') for specific performance characteristics.

Thread Library Architecture

A user-level thread library is a sophisticated piece of systems software that provides the complete illusion of multi-threading within a single process. Understanding its architecture reveals how this illusion is constructed:

Core Components

Every user-level thread library consists of several essential subsystems working together:

Essential Library Subsystems

•Thread Control Block (TCB) Management — The data structures that represent each thread's state, including saved registers, stack pointer, thread-local data, and scheduling metadata. The TCB is the identity of a thread.
•Stack Allocator — Each thread needs its own private stack for function calls, local variables, and return addresses. The library must allocate, manage, and deallocate these stacks within the process heap.
•User-Level Scheduler — The decision-making component that determines which thread runs next. It implements scheduling policies (round-robin, priority-based, lottery scheduling, etc.) entirely in user space.
•Context Switch Mechanism — Low-level assembly routines that save the current thread's CPU state (registers, program counter, stack pointer) and restore another thread's state to resume execution.
•Synchronization Primitives — User-space implementations of mutexes, condition variables, semaphores, and barriers that coordinate access between threads without kernel involvement.

The Thread Control Block in Detail

The TCB is the cornerstone of thread management. It serves the same purpose for user-level threads that the PCB (Process Control Block) serves for processes in the kernel—it is the complete representation of the thread's state when the thread is not running.

A typical TCB structure contains:

user_thread_tcb.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
/* Thread Control Block for User-Level Thread Library */
 
typedef enum {
    THREAD_READY,      /* Ready to run, in the ready queue */
    THREAD_RUNNING,    /* Currently executing on CPU */
    THREAD_BLOCKED,    /* Waiting on synchronization primitive */
    THREAD_TERMINATED  /* Completed execution, awaiting cleanup */
} thread_state_t;
 
typedef struct thread_context {
    /* CPU Register State - saved/restored during context switch */
    uint64_t rsp;      /* Stack pointer */
    uint64_t rbp;      /* Frame pointer */
    uint64_t rbx;      /* Callee-saved registers */
    uint64_t r12;
    uint64_t r13;
    uint64_t r14;
    uint64_t r15;
    uint64_t rip;      /* Instruction pointer (return address) */
    
    /* Floating-point/SIMD state (if needed) */
    uint8_t fpu_state[512] __attribute__((aligned(16)));
} thread_context_t;
 
typedef struct thread_control_block {
    /* Identity */
    uint64_t tid;                   /* Unique thread ID */
    char name[64];                  /* Thread name (debugging) */
    
    /* Execution State */
    thread_state_t state;           /* Current thread state */
    thread_context_t context;       /* Saved CPU context */
    
    /* Stack Management */
    void *stack_base;               /* Bottom of allocated stack */
    size_t stack_size;              /* Size of stack in bytes */
    void *stack_pointer;            /* Current/saved stack pointer */
    
    /* Thread Function */
    void *(*entry_function)(void*); /* Initial function to execute */
    void *argument;                 /* Argument passed to function */
    void *return_value;             /* Value returned by thread */
    
    /* Scheduling Metadata */
    int priority;                   /* Thread priority level */
    uint64_t time_slice;            /* Remaining time quantum */
    uint64_t total_runtime;         /* CPU time consumed */
    
    /* Thread-Local Storage */
    void **tls_slots;               /* Thread-local data array */
    size_t tls_count;               /* Number of TLS slots */
    
    /* Synchronization */
    struct thread_control_block *join_target;   /* Thread waiting for us */
    struct list_head blocked_queue_link;        /* Link in blocked queue */
    
    /* Linking */
    struct list_head ready_queue_link;  /* Link in scheduler queue */
    struct list_head all_threads_link;  /* Link in global thread list */
} tcb_t;
 
/* Global Thread Library State */
typedef struct thread_library {
    tcb_t *current_thread;              /* Currently executing thread */
    tcb_t *main_thread;                 /* Original main thread */
    struct list_head ready_queue;       /* Queue of runnable threads */
    struct list_head all_threads;       /* List of all threads */
    uint64_t next_tid;                  /* Next thread ID to assign */
    spinlock_t library_lock;            /* Protects library state */
} thread_library_t;
 
static thread_library_t thread_lib;

Design Insight: Memory Locality

Notice that the TCB keeps related data together: the context save area, stack information, and scheduling metadata are all in one structure. This design improves cache locality during thread operations. When the scheduler examines a thread's state, all the necessary information is likely in the same cache line or adjacent lines. This is one reason user-level thread switches are so fast—they're optimized for the memory hierarchy.

Stack Management: The Hidden Complexity

Every thread requires its own stack—a private region of memory for function call frames, local variables, return addresses, and saved registers. In kernel-level threads, the kernel allocates stacks and establishes guard pages. For user-level threads, the library must handle this entirely within user space.

Stack Allocation Strategies

User-level thread libraries typically use one of several approaches for stack management:

Stack Allocation Strategies Comparison
Strategy	Mechanism	Advantages	Disadvantages
Fixed-Size Stacks	Allocate fixed-size memory block (e.g., 1MB) for each thread	Simple, predictable, easy stack overflow detection	Wasteful if threads don't use full stack; limits thread count
Segmented Stacks	Allocate small initial stack; grow by linking new segments	Memory efficient; supports many threads	Complex; segment transitions add overhead; fragmentation
Contiguous Growth	Start small; reallocate larger when needed by copying	Simple memory model; no segmentation overhead	Copy cost on growth; requires relocation support
Stack Pooling	Maintain pool of pre-allocated stacks; reuse across threads	Fast thread creation; reduced fragmentation	Fixed total memory; pool sizing challenges

Implementation: Fixed-Size Stack Allocation

The most common approach for robustness is fixed-size stacks with guard pages. Here's how a sophisticated implementation handles this:

stack_allocator.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
#include <sys/mman.h>
#include <stdint.h>
#include <stdlib.h>
 
#define DEFAULT_STACK_SIZE  (1024 * 1024)  /* 1 MB per thread */
#define GUARD_PAGE_SIZE     (4096)         /* 4 KB guard page */
 
typedef struct stack_allocation {
    void *base_address;     /* Start of allocated region */
    void *stack_bottom;     /* Usable stack starts here */
    void *stack_top;        /* Stack grows down to this address */
    size_t total_size;      /* Total allocation size */
    size_t usable_size;     /* Actual stack space */
} stack_allocation_t;
 
/*
 * Allocate a thread stack with guard page protection.
 * 
 * Memory layout (addresses increase downward):
 *   +-------------------+  <-- base_address
 *   |    Guard Page     |  <-- mprotect(PROT_NONE) - triggers SIGSEGV on access
 *   |   (PROT_NONE)     |
 *   +-------------------+  <-- stack_top (stack pointer starts near here)
 *   |                   |
 *   |   Usable Stack    |
 *   |  (PROT_READ |     |
 *   |   PROT_WRITE)     |
 *   |                   |
 *   +-------------------+  <-- stack_bottom (stack grows toward here)
 *   
 * The guard page provides stack overflow detection without runtime checks.
 * If the thread uses too much stack, it will write to the guard page and
 * receive SIGSEGV—immediate, deterministic crash rather than memory corruption.
 */
stack_allocation_t* allocate_thread_stack(size_t requested_size) {
    size_t page_size = getpagesize();
    size_t usable_size = requested_size ?: DEFAULT_STACK_SIZE;
    
    /* Round up to page boundary */
    usable_size = (usable_size + page_size - 1) & ~(page_size - 1);
    
    /* Total size includes guard page */
    size_t total_size = usable_size + GUARD_PAGE_SIZE;
    
    /* Allocate using mmap for page-aligned memory */
    void *base = mmap(NULL, total_size,
                      PROT_READ | PROT_WRITE,
                      MAP_PRIVATE | MAP_ANONYMOUS,
                      -1, 0);
    
    if (base == MAP_FAILED) {
        return NULL;  /* Allocation failed */
    }
    
    /* Install guard page at the bottom (low address) */
    if (mprotect(base, GUARD_PAGE_SIZE, PROT_NONE) != 0) {
        munmap(base, total_size);
        return NULL;
    }
    
    /* Prepare result structure */
    stack_allocation_t *stack = malloc(sizeof(stack_allocation_t));
    if (!stack) {
        munmap(base, total_size);
        return NULL;
    }
    
    stack->base_address = base;
    stack->stack_bottom = (char*)base + GUARD_PAGE_SIZE;
    stack->stack_top = (char*)base + total_size;
    stack->total_size = total_size;
    stack->usable_size = usable_size;
    
    return stack;
}
 
void free_thread_stack(stack_allocation_t *stack) {
    if (stack) {
        munmap(stack->base_address, stack->total_size);
        free(stack);
    }
}
 
/*
 * Initialize stack for new thread execution.
 * Sets up the initial stack frame so the thread starts executing
 * its entry function when first scheduled.
 */
void* initialize_thread_stack(stack_allocation_t *stack,
                               void *(*entry)(void*),
                               void *arg) {
    /*
     * Stack grows downward. Initial setup:
     * - Place fake return address (thread_exit trampoline)
     * - Set up initial stack pointer at natural alignment
     * - The context switch will restore this pointer and "return"
     *   to the entry function
     */
    
    /* Start near top, with ABI-required alignment (16-byte on x86-64) */
    uintptr_t sp = (uintptr_t)stack->stack_top;
    sp = sp & ~0xF;  /* Align to 16 bytes */
    sp -= 8;         /* Space for return address (call convention) */
    
    /* The return address points to our cleanup trampoline */
    *(void**)sp = (void*)thread_exit_trampoline;
    
    return (void*)sp;
}

Stack Overflow: The Silent Killer

Without guard pages, stack overflow corrupts adjacent memory—often another thread's stack or heap data. This creates non-deterministic bugs that manifest far from the actual overflow. Guard pages convert this undefined behavior into an immediate, debuggable crash. Always use guard pages in production thread libraries.

Context Switch Implementation

The context switch is the heart of any threading system. It's the mechanism by which one thread stops executing and another resumes. In user-level threads, this happens entirely in user space—no kernel involvement whatsoever.

What Must Be Saved and Restored?

A thread's execution state consists of all the information needed to resume execution exactly where it left off:

Thread Execution State

•Program Counter (RIP on x86-64) — The address of the next instruction to execute. Captured implicitly as the return address on the stack.
•Stack Pointer (RSP) — Points to the current top of the thread's stack. Essential for function calls and local variables.
•Frame Pointer (RBP) — Used for stack frame navigation and debugging. Part of the calling convention.
•Callee-Saved Registers (RBX, R12-R15) — The calling convention requires these be preserved across function calls. Caller-saved registers are implicitly handled.
•Floating-Point/SIMD State (Optional) — XMM/YMM registers for floating-point and vector operations. Expensive to save; often lazily saved only when needed.

The Context Switch Routine

The context switch itself is typically implemented in assembly for maximum control and minimal overhead. Here's a complete x86-64 implementation:

context_switch.s

x86-64 Assembly

/*
 * context_switch.s - User-Level Thread Context Switch
 * 
 * void context_switch(thread_context_t *current, thread_context_t *next);
 *
 * Calling convention (System V AMD64 ABI):
 *   - current context pointer in RDI
 *   - next context pointer in RSI
 *   - Return via RIP loaded from next context
 *
 * This function:
 *   1. Saves current thread's callee-saved registers to *current
 *   2. Saves current stack pointer to *current
 *   3. Loads next thread's stack pointer from *next
 *   4. Restores next thread's callee-saved registers from *next
 *   5. Returns (actually resumes next thread at its saved RIP)
 */
 
    .text
    .globl context_switch
    .type context_switch, @function
 
context_switch:
    /* ========================================
     * PHASE 1: Save current thread's context
     * ======================================== */
    
    /* Save callee-saved registers to current context */
    movq %rbx, 16(%rdi)      /* Save RBX at offset 16 */
    movq %r12, 24(%rdi)      /* Save R12 at offset 24 */
    movq %r13, 32(%rdi)      /* Save R13 at offset 32 */
    movq %r14, 40(%rdi)      /* Save R14 at offset 40 */
    movq %r15, 48(%rdi)      /* Save R15 at offset 48 */
    movq %rbp, 8(%rdi)       /* Save RBP at offset 8 */
    
    /* Save stack pointer */
    movq %rsp, 0(%rdi)       /* Save RSP at offset 0 */
    
    /* Save return address (where to resume this thread) */
    /* The return address is on the stack, placed by CALL */
    movq (%rsp), %rax
    movq %rax, 56(%rdi)      /* Save RIP at offset 56 */
    
    /* ========================================
     * PHASE 2: Load next thread's context
     * ======================================== */
    
    /* Load stack pointer FIRST - subsequent pops use new stack */
    movq 0(%rsi), %rsp       /* Load RSP from next context */
    
    /* Restore callee-saved registers from next context */
    movq 16(%rsi), %rbx      /* Restore RBX */
    movq 24(%rsi), %r12      /* Restore R12 */
    movq 32(%rsi), %r13      /* Restore R13 */
    movq 40(%rsi), %r14      /* Restore R14 */
    movq 48(%rsi), %r15      /* Restore R15 */
    movq 8(%rsi), %rbp       /* Restore RBP */
    
    /* Push return address and return
     * This makes us "return" to the next thread's saved location */
    movq 56(%rsi), %rax      /* Load saved RIP */
    pushq %rax               /* Push it as return address */
    ret                      /* "Return" to next thread */
 
    .size context_switch, .-context_switch
 
 
/*
 * thread_start_wrapper - Initial entry point for new threads
 * 
 * When a thread is first scheduled, its context is set up to "return"
 * here. This wrapper calls the actual thread function and handles
 * thread termination when the function returns.
 *
 * Expected register setup (from thread_create initialization):
 *   R12: Thread entry function pointer
 *   R13: Thread argument pointer
 */
    .globl thread_start_wrapper
    .type thread_start_wrapper, @function
 
thread_start_wrapper:
    /* Move argument to RDI (first parameter register) */
    movq %r13, %rdi
    
    /* Call the thread's entry function */
    callq *%r12
    
    /* Thread function returned - clean up
     * RAX contains return value */
    movq %rax, %rdi          /* Pass return value to thread_exit */
    callq thread_exit
    
    /* thread_exit never returns, but just in case... */
    ud2                      /* Undefined instruction trap */
 
    .size thread_start_wrapper, .-thread_start_wrapper

Anatomy of the Context Switch

Let's trace through exactly what happens during a context switch from Thread A to Thread B:

Thread A is executing, calls yield() or its time slice expires
Scheduler decides Thread B should run next
context_switch(&A.context, &B.context) is called
Thread A's registers are saved to A's TCB
Stack pointer switches from A's stack to B's stack
Thread B's registers are restored from B's TCB
RET instruction executes - pops B's saved RIP, jumping to B's code
Thread B is now executing, completely unaware a switch occurred

The Elegance of Symmetry

Notice that the context switch is entirely symmetric. Thread B was previously in the middle of its own context_switch call when it got switched away. When we restore B and return, B continues executing as if its context_switch just returned normally. This symmetry is what makes cooperative multitasking elegant—every thread sees the same programming model.

The User-Level Scheduler

User-level thread libraries implement their own scheduler—a function that runs in user space and decides which thread executes next. Unlike kernel schedulers that are triggered by hardware interrupts and have access to precise timing, user-level schedulers depend on explicit yield points or voluntary preemption.

Scheduling Policy Implementation

A simple round-robin scheduler demonstrates the core concepts:

user_scheduler.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
/*
 * User-Level Thread Scheduler - Round Robin Implementation
 *
 * This scheduler maintains a queue of ready threads and cycles
 * through them in order. Each thread runs until it explicitly
 * yields, blocks, or terminates.
 */
 
/* Ready queue - doubly linked list for O(1) enqueue/dequeue */
static tcb_t *ready_queue_head = NULL;
static tcb_t *ready_queue_tail = NULL;
static tcb_t *current_thread = NULL;
 
/* Add thread to end of ready queue */
void enqueue_ready(tcb_t *thread) {
    thread->next = NULL;
    thread->state = THREAD_READY;
    
    if (ready_queue_tail) {
        ready_queue_tail->next = thread;
        thread->prev = ready_queue_tail;
        ready_queue_tail = thread;
    } else {
        ready_queue_head = ready_queue_tail = thread;
        thread->prev = NULL;
    }
}
 
/* Remove and return thread from front of ready queue */
tcb_t* dequeue_ready(void) {
    if (!ready_queue_head) {
        return NULL;  /* Queue empty */
    }
    
    tcb_t *thread = ready_queue_head;
    ready_queue_head = thread->next;
    
    if (ready_queue_head) {
        ready_queue_head->prev = NULL;
    } else {
        ready_queue_tail = NULL;  /* Queue now empty */
    }
    
    thread->next = thread->prev = NULL;
    return thread;
}
 
/*
 * schedule() - Select and switch to next ready thread
 *
 * This is the core scheduling decision. Called when:
 *   - Current thread yields voluntarily
 *   - Current thread blocks on synchronization
 *   - Current thread terminates
 */
void schedule(void) {
    tcb_t *next = dequeue_ready();
    
    if (!next) {
        /* No ready threads - could idle, spin, or exit */
        if (current_thread && current_thread->state == THREAD_RUNNING) {
            /* Only current thread exists, keep running */
            return;
        }
        /* All threads blocked or terminated - error or finished */
        panic("No runnable threads!");
    }
    
    tcb_t *prev = current_thread;
    current_thread = next;
    next->state = THREAD_RUNNING;
    
    if (prev && prev != next) {
        /* Different thread selected - context switch */
        context_switch(&prev->context, &next->context);
        /* When we return here, we've been switched back to */
    }
}
 
/*
 * yield() - Voluntarily give up CPU to another thread
 *
 * This is the primary mechanism for cooperative multitasking.
 * The current thread remains runnable but lets others execute.
 */
void thread_yield(void) {
    if (current_thread) {
        /* Add current thread back to ready queue */
        enqueue_ready(current_thread);
    }
    schedule();
}
 
/*
 * block() - Put current thread to sleep
 *
 * Used when a thread must wait for some condition.
 * The thread is NOT added to ready queue.
 */
void thread_block(void) {
    if (current_thread) {
        current_thread->state = THREAD_BLOCKED;
    }
    schedule();  /* Will not pick current thread since not in ready queue */
}
 
/*
 * unblock() - Wake up a blocked thread
 *
 * Called when the condition a thread was waiting for is satisfied.
 */
void thread_unblock(tcb_t *thread) {
    if (thread && thread->state == THREAD_BLOCKED) {
        enqueue_ready(thread);  /* Back to ready queue */
    }
}

Scheduling Policies Available

User-level thread libraries can implement any scheduling policy since they have complete control over the scheduler:

Common User-Level Scheduling Policies
Policy	Description	Use Case
Round Robin	Cycle through threads in order, each gets equal turns	General purpose; fair; simple to implement
Priority Queue	Higher priority threads run first; same priority uses FIFO	Real-time-like behavior; background vs foreground work
Lottery Scheduling	Random selection weighted by 'tickets'; probabilistically fair	Research; proportional share systems
Work Stealing	Idle threads steal from busy threads' local queues	Multi-queue systems; improves load balance
Fair Share	Track CPU usage per thread/group; favor under-served	Multi-user fairness; prevent starvation

Cooperative vs Preemptive

Pure user-level threads are typically cooperative—threads must explicitly yield. Preemption requires handling signals (SIGALRM) to interrupt threads, which reintroduces complexity and overhead. Many production libraries use hybrid approaches: cooperative yields at strategic points with signal-based preemption as a backstop for misbehaving threads.

Thread API Implementation

The thread library exposes an API that applications use to create, manage, and synchronize threads. This API typically mirrors POSIX threads semantics, providing familiarity for developers while implementing everything in user space.

Core API Functions

thread_api.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
/*
 * Thread API Implementation - Core Functions
 */
 
#include "thread.h"
#include <stdlib.h>
#include <string.h>
 
/* Global state */
static uint64_t next_tid = 1;
static tcb_t *all_threads = NULL;  /* Global thread list */
 
/*
 * thread_create() - Create a new thread
 *
 * Allocates TCB, stack, initializes context, and adds to ready queue.
 * Returns 0 on success, -1 on failure.
 */
int thread_create(thread_id_t *tid,
                  const thread_attr_t *attr,
                  void *(*start_routine)(void*),
                  void *arg) {
    
    /* Allocate thread control block */
    tcb_t *tcb = calloc(1, sizeof(tcb_t));
    if (!tcb) {
        return -1;  /* Out of memory */
    }
    
    /* Assign thread ID */
    tcb->tid = next_tid++;
    if (tid) {
        *tid = tcb->tid;
    }
    
    /* Determine stack size (use default or attribute) */
    size_t stack_size = attr ? attr->stack_size : DEFAULT_STACK_SIZE;
    
    /* Allocate stack with guard page */
    tcb->stack = allocate_thread_stack(stack_size);
    if (!tcb->stack) {
        free(tcb);
        return -1;  /* Stack allocation failed */
    }
    
    /* Save thread function and argument */
    tcb->entry_function = start_routine;
    tcb->argument = arg;
    tcb->state = THREAD_READY;
    
    /* Initialize context for first scheduling
     * 
     * When this thread is first context-switched to, we want it
     * to start executing thread_start_wrapper, which will call
     * the actual entry function with the argument.
     *
     * We set up the context so that:
     *   - RSP points to initial stack position (with return address)
     *   - R12 holds the entry function pointer
     *   - R13 holds the argument pointer
     *   - RIP (saved) points to thread_start_wrapper
     */
    
    /* Initialize stack pointer near top, properly aligned */
    uintptr_t sp = (uintptr_t)tcb->stack->stack_top;
    sp = (sp - 128) & ~0xF;  /* Red zone + alignment */
    
    /* Clear context, then set up initial values */
    memset(&tcb->context, 0, sizeof(thread_context_t));
    tcb->context.rsp = sp;
    tcb->context.r12 = (uint64_t)start_routine;  /* Entry function */
    tcb->context.r13 = (uint64_t)arg;            /* Argument */
    tcb->context.rip = (uint64_t)thread_start_wrapper;
    
    /* Add to global thread list */
    tcb->all_next = all_threads;
    all_threads = tcb;
    
    /* Add to ready queue - thread is now schedulable */
    enqueue_ready(tcb);
    
    return 0;
}
 
/*
 * thread_exit() - Terminate the calling thread
 *
 * Cleans up the thread, stores return value, and wakes any joiners.
 */
void thread_exit(void *retval) {
    tcb_t *self = current_thread;
    
    /* Store return value for thread_join */
    self->return_value = retval;
    self->state = THREAD_TERMINATED;
    
    /* Wake any thread waiting in thread_join */
    if (self->joiner) {
        thread_unblock(self->joiner);
    }
    
    /* Schedule next thread - we will never return here */
    schedule();
    
    /* Should never reach here */
    __builtin_unreachable();
}
 
/*
 * thread_join() - Wait for thread termination
 *
 * Blocks until the target thread terminates, then retrieves
 * its return value.
 */
int thread_join(thread_id_t tid, void **retval) {
    /* Find the target thread */
    tcb_t *target = find_thread_by_id(tid);
    if (!target) {
        return -1;  /* Thread not found */
    }
    
    /* If thread hasn't terminated yet, wait */
    if (target->state != THREAD_TERMINATED) {
        target->joiner = current_thread;  /* Record who's waiting */
        thread_block();  /* Sleep until target exits */
    }
    
    /* Thread terminated - get return value */
    if (retval) {
        *retval = target->return_value;
    }
    
    /* Clean up terminated thread resources */
    remove_from_all_threads(target);
    free_thread_stack(target->stack);
    free(target);
    
    return 0;
}
 
/*
 * thread_self() - Get current thread's ID
 */
thread_id_t thread_self(void) {
    return current_thread ? current_thread->tid : 0;
}

API Design Philosophy

A well-designed thread API should be familiar (matching POSIX patterns where possible), minimal (exposing only what's necessary), and safe (making incorrect usage difficult). The separation between thread_create() which returns immediately and thread_join() which blocks allows for flexible concurrent execution patterns.

Synchronization Primitives in User Space

User-level thread libraries must provide synchronization primitives—mutexes, condition variables, semaphores—implemented entirely within user space. Since only one user-level thread executes at a time (within a process on a single CPU), some synchronization is simpler, but blocking behavior requires integration with the scheduler.

user_sync.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
/*
 * User-Level Mutex Implementation
 *
 * Since only one user-level thread runs at a time (no true parallelism
 * within the process), we don't need atomic instructions for basic
 * mutual exclusion. The scheduler provides the synchronization.
 */
 
typedef struct {
    int locked;           /* 0 = unlocked, 1 = locked */
    tcb_t *owner;        /* Thread holding the lock */
    tcb_t *wait_queue;   /* Threads waiting for lock */
} user_mutex_t;
 
void mutex_init(user_mutex_t *mutex) {
    mutex->locked = 0;
    mutex->owner = NULL;
    mutex->wait_queue = NULL;
}
 
void mutex_lock(user_mutex_t *mutex) {
    /* No atomics needed - we won't be preempted mid-check
     * in a pure cooperative threading model */
    
    while (mutex->locked) {
        /* Mutex is held - add ourselves to wait queue and block */
        current_thread->wait_next = mutex->wait_queue;
        mutex->wait_queue = current_thread;
        thread_block();
        /* When we wake up, loop and check again */
    }
    
    /* Acquire the lock */
    mutex->locked = 1;
    mutex->owner = current_thread;
}
 
void mutex_unlock(user_mutex_t *mutex) {
    /* Verify we own the lock */
    if (mutex->owner != current_thread) {
        panic("Unlock by non-owner!");
    }
    
    mutex->locked = 0;
    mutex->owner = NULL;
    
    /* Wake one waiting thread if any */
    if (mutex->wait_queue) {
        tcb_t *waiter = mutex->wait_queue;
        mutex->wait_queue = waiter->wait_next;
        waiter->wait_next = NULL;
        thread_unblock(waiter);
    }
}
 
/*
 * User-Level Condition Variable Implementation
 */
 
typedef struct {
    tcb_t *wait_queue;   /* Threads waiting on condition */
} user_cond_t;
 
void cond_init(user_cond_t *cond) {
    cond->wait_queue = NULL;
}
 
void cond_wait(user_cond_t *cond, user_mutex_t *mutex) {
    /* Add to wait queue before releasing mutex
     * This prevents missed signals */
    current_thread->wait_next = cond->wait_queue;
    cond->wait_queue = current_thread;
    
    /* Release mutex and block atomically (scheduler perspective) */
    mutex_unlock(mutex);
    thread_block();
    
    /* Woken up - reacquire mutex before returning */
    mutex_lock(mutex);
}
 
void cond_signal(user_cond_t *cond) {
    /* Wake one waiter */
    if (cond->wait_queue) {
        tcb_t *waiter = cond->wait_queue;
        cond->wait_queue = waiter->wait_next;
        waiter->wait_next = NULL;
        thread_unblock(waiter);
    }
}
 
void cond_broadcast(user_cond_t *cond) {
    /* Wake all waiters */
    while (cond->wait_queue) {
        tcb_t *waiter = cond->wait_queue;
        cond->wait_queue = waiter->wait_next;
        waiter->wait_next = NULL;
        thread_unblock(waiter);
    }
}

The Single-Core Assumption

This simple implementation relies on the fact that only one user-level thread executes at a time. On a single-core system with cooperative threading, there's no parallel access to mutex state. If you introduce preemptive user-level threads (via signals) or multiprocessor support, you must add proper atomic operations and memory barriers.

Summary: Thread Library Architecture

We have explored the complete architecture of user-level thread libraries—sophisticated systems software that implements threading entirely within user space, invisible to the operating system kernel.

Key Takeaways

•User-level threads exist entirely in user space — The kernel sees only a single-threaded process. All thread management happens within the application's runtime.
•The TCB is the thread's identity — Thread Control Blocks store saved registers, stack pointers, state, and metadata needed to pause and resume thread execution.
•Stack management is critical — Each thread needs its own stack, allocated from the process heap with guard pages for overflow detection.
•Context switch is register save/restore — The core operation saves one thread's registers to its TCB and loads another's, with the stack pointer switch being the critical moment.
•User-level schedulers implement policy — The library includes a scheduler function that decides which thread runs next, implementing round-robin, priority, or other policies.
•Synchronization primitives are integrated with the scheduler — Mutexes and condition variables use blocking and unblocking operations that interact directly with the scheduler.

Foundation Established

You now understand the core architecture of user-level thread libraries. In the next page, we'll explore why this architecture enables remarkably fast context switching—often 100x faster than kernel-level threads—and the performance characteristics that result.

1 / 5

Loading learning content...

Operating SystemsThread Concepts

User-Level Threads

LevelIntermediate

Duration75 mins

TopicThread Concepts

1 / 5

Thread Library in User Space

The Foundation of Lightweight Concurrency

What You Will Learn

Conceptual Foundation: What Are User-Level Threads?

This creates a layered abstraction model:

Converting Mermaid diagram...

The fundamental principle:

All of these operations can be performed entirely in user space, without involving the kernel at all.

Historical Context

Thread Library Architecture

Core Components

Every user-level thread library consists of several essential subsystems working together:

Essential Library Subsystems

•Thread Control Block (TCB) Management — The data structures that represent each thread's state, including saved registers, stack pointer, thread-local data, and scheduling metadata. The TCB is the identity of a thread.
•Stack Allocator — Each thread needs its own private stack for function calls, local variables, and return addresses. The library must allocate, manage, and deallocate these stacks within the process heap.
•User-Level Scheduler — The decision-making component that determines which thread runs next. It implements scheduling policies (round-robin, priority-based, lottery scheduling, etc.) entirely in user space.
•Context Switch Mechanism — Low-level assembly routines that save the current thread's CPU state (registers, program counter, stack pointer) and restore another thread's state to resume execution.
•Synchronization Primitives — User-space implementations of mutexes, condition variables, semaphores, and barriers that coordinate access between threads without kernel involvement.

The Thread Control Block in Detail

A typical TCB structure contains:

user_thread_tcb.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
/* Thread Control Block for User-Level Thread Library */
 
typedef enum {
    THREAD_READY,      /* Ready to run, in the ready queue */
    THREAD_RUNNING,    /* Currently executing on CPU */
    THREAD_BLOCKED,    /* Waiting on synchronization primitive */
    THREAD_TERMINATED  /* Completed execution, awaiting cleanup */
} thread_state_t;
 
typedef struct thread_context {
    /* CPU Register State - saved/restored during context switch */
    uint64_t rsp;      /* Stack pointer */
    uint64_t rbp;      /* Frame pointer */
    uint64_t rbx;      /* Callee-saved registers */
    uint64_t r12;
    uint64_t r13;
    uint64_t r14;
    uint64_t r15;
    uint64_t rip;      /* Instruction pointer (return address) */
    
    /* Floating-point/SIMD state (if needed) */
    uint8_t fpu_state[512] __attribute__((aligned(16)));
} thread_context_t;
 
typedef struct thread_control_block {
    /* Identity */
    uint64_t tid;                   /* Unique thread ID */
    char name[64];                  /* Thread name (debugging) */
    
    /* Execution State */
    thread_state_t state;           /* Current thread state */
    thread_context_t context;       /* Saved CPU context */
    
    /* Stack Management */
    void *stack_base;               /* Bottom of allocated stack */
    size_t stack_size;              /* Size of stack in bytes */
    void *stack_pointer;            /* Current/saved stack pointer */
    
    /* Thread Function */
    void *(*entry_function)(void*); /* Initial function to execute */
    void *argument;                 /* Argument passed to function */
    void *return_value;             /* Value returned by thread */
    
    /* Scheduling Metadata */
    int priority;                   /* Thread priority level */
    uint64_t time_slice;            /* Remaining time quantum */
    uint64_t total_runtime;         /* CPU time consumed */
    
    /* Thread-Local Storage */
    void **tls_slots;               /* Thread-local data array */
    size_t tls_count;               /* Number of TLS slots */
    
    /* Synchronization */
    struct thread_control_block *join_target;   /* Thread waiting for us */
    struct list_head blocked_queue_link;        /* Link in blocked queue */
    
    /* Linking */
    struct list_head ready_queue_link;  /* Link in scheduler queue */
    struct list_head all_threads_link;  /* Link in global thread list */
} tcb_t;
 
/* Global Thread Library State */
typedef struct thread_library {
    tcb_t *current_thread;              /* Currently executing thread */
    tcb_t *main_thread;                 /* Original main thread */
    struct list_head ready_queue;       /* Queue of runnable threads */
    struct list_head all_threads;       /* List of all threads */
    uint64_t next_tid;                  /* Next thread ID to assign */
    spinlock_t library_lock;            /* Protects library state */
} thread_library_t;
 
static thread_library_t thread_lib;

Design Insight: Memory Locality

Stack Management: The Hidden Complexity

Stack Allocation Strategies

User-level thread libraries typically use one of several approaches for stack management:

Stack Allocation Strategies Comparison
Strategy	Mechanism	Advantages	Disadvantages
Fixed-Size Stacks	Allocate fixed-size memory block (e.g., 1MB) for each thread	Simple, predictable, easy stack overflow detection	Wasteful if threads don't use full stack; limits thread count
Segmented Stacks	Allocate small initial stack; grow by linking new segments	Memory efficient; supports many threads	Complex; segment transitions add overhead; fragmentation
Contiguous Growth	Start small; reallocate larger when needed by copying	Simple memory model; no segmentation overhead	Copy cost on growth; requires relocation support
Stack Pooling	Maintain pool of pre-allocated stacks; reuse across threads	Fast thread creation; reduced fragmentation	Fixed total memory; pool sizing challenges

Implementation: Fixed-Size Stack Allocation

The most common approach for robustness is fixed-size stacks with guard pages. Here's how a sophisticated implementation handles this:

stack_allocator.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
#include <sys/mman.h>
#include <stdint.h>
#include <stdlib.h>
 
#define DEFAULT_STACK_SIZE  (1024 * 1024)  /* 1 MB per thread */
#define GUARD_PAGE_SIZE     (4096)         /* 4 KB guard page */
 
typedef struct stack_allocation {
    void *base_address;     /* Start of allocated region */
    void *stack_bottom;     /* Usable stack starts here */
    void *stack_top;        /* Stack grows down to this address */
    size_t total_size;      /* Total allocation size */
    size_t usable_size;     /* Actual stack space */
} stack_allocation_t;
 
/*
 * Allocate a thread stack with guard page protection.
 * 
 * Memory layout (addresses increase downward):
 *   +-------------------+  <-- base_address
 *   |    Guard Page     |  <-- mprotect(PROT_NONE) - triggers SIGSEGV on access
 *   |   (PROT_NONE)     |
 *   +-------------------+  <-- stack_top (stack pointer starts near here)
 *   |                   |
 *   |   Usable Stack    |
 *   |  (PROT_READ |     |
 *   |   PROT_WRITE)     |
 *   |                   |
 *   +-------------------+  <-- stack_bottom (stack grows toward here)
 *   
 * The guard page provides stack overflow detection without runtime checks.
 * If the thread uses too much stack, it will write to the guard page and
 * receive SIGSEGV—immediate, deterministic crash rather than memory corruption.
 */
stack_allocation_t* allocate_thread_stack(size_t requested_size) {
    size_t page_size = getpagesize();
    size_t usable_size = requested_size ?: DEFAULT_STACK_SIZE;
    
    /* Round up to page boundary */
    usable_size = (usable_size + page_size - 1) & ~(page_size - 1);
    
    /* Total size includes guard page */
    size_t total_size = usable_size + GUARD_PAGE_SIZE;
    
    /* Allocate using mmap for page-aligned memory */
    void *base = mmap(NULL, total_size,
                      PROT_READ | PROT_WRITE,
                      MAP_PRIVATE | MAP_ANONYMOUS,
                      -1, 0);
    
    if (base == MAP_FAILED) {
        return NULL;  /* Allocation failed */
    }
    
    /* Install guard page at the bottom (low address) */
    if (mprotect(base, GUARD_PAGE_SIZE, PROT_NONE) != 0) {
        munmap(base, total_size);
        return NULL;
    }
    
    /* Prepare result structure */
    stack_allocation_t *stack = malloc(sizeof(stack_allocation_t));
    if (!stack) {
        munmap(base, total_size);
        return NULL;
    }
    
    stack->base_address = base;
    stack->stack_bottom = (char*)base + GUARD_PAGE_SIZE;
    stack->stack_top = (char*)base + total_size;
    stack->total_size = total_size;
    stack->usable_size = usable_size;
    
    return stack;
}
 
void free_thread_stack(stack_allocation_t *stack) {
    if (stack) {
        munmap(stack->base_address, stack->total_size);
        free(stack);
    }
}
 
/*
 * Initialize stack for new thread execution.
 * Sets up the initial stack frame so the thread starts executing
 * its entry function when first scheduled.
 */
void* initialize_thread_stack(stack_allocation_t *stack,
                               void *(*entry)(void*),
                               void *arg) {
    /*
     * Stack grows downward. Initial setup:
     * - Place fake return address (thread_exit trampoline)
     * - Set up initial stack pointer at natural alignment
     * - The context switch will restore this pointer and "return"
     *   to the entry function
     */
    
    /* Start near top, with ABI-required alignment (16-byte on x86-64) */
    uintptr_t sp = (uintptr_t)stack->stack_top;
    sp = sp & ~0xF;  /* Align to 16 bytes */
    sp -= 8;         /* Space for return address (call convention) */
    
    /* The return address points to our cleanup trampoline */
    *(void**)sp = (void*)thread_exit_trampoline;
    
    return (void*)sp;
}

Stack Overflow: The Silent Killer

Context Switch Implementation

What Must Be Saved and Restored?

A thread's execution state consists of all the information needed to resume execution exactly where it left off:

Thread Execution State

•Program Counter (RIP on x86-64) — The address of the next instruction to execute. Captured implicitly as the return address on the stack.
•Stack Pointer (RSP) — Points to the current top of the thread's stack. Essential for function calls and local variables.
•Frame Pointer (RBP) — Used for stack frame navigation and debugging. Part of the calling convention.
•Callee-Saved Registers (RBX, R12-R15) — The calling convention requires these be preserved across function calls. Caller-saved registers are implicitly handled.
•Floating-Point/SIMD State (Optional) — XMM/YMM registers for floating-point and vector operations. Expensive to save; often lazily saved only when needed.

The Context Switch Routine

The context switch itself is typically implemented in assembly for maximum control and minimal overhead. Here's a complete x86-64 implementation:

context_switch.s

x86-64 Assembly

/*
 * context_switch.s - User-Level Thread Context Switch
 * 
 * void context_switch(thread_context_t *current, thread_context_t *next);
 *
 * Calling convention (System V AMD64 ABI):
 *   - current context pointer in RDI
 *   - next context pointer in RSI
 *   - Return via RIP loaded from next context
 *
 * This function:
 *   1. Saves current thread's callee-saved registers to *current
 *   2. Saves current stack pointer to *current
 *   3. Loads next thread's stack pointer from *next
 *   4. Restores next thread's callee-saved registers from *next
 *   5. Returns (actually resumes next thread at its saved RIP)
 */
 
    .text
    .globl context_switch
    .type context_switch, @function
 
context_switch:
    /* ========================================
     * PHASE 1: Save current thread's context
     * ======================================== */
    
    /* Save callee-saved registers to current context */
    movq %rbx, 16(%rdi)      /* Save RBX at offset 16 */
    movq %r12, 24(%rdi)      /* Save R12 at offset 24 */
    movq %r13, 32(%rdi)      /* Save R13 at offset 32 */
    movq %r14, 40(%rdi)      /* Save R14 at offset 40 */
    movq %r15, 48(%rdi)      /* Save R15 at offset 48 */
    movq %rbp, 8(%rdi)       /* Save RBP at offset 8 */
    
    /* Save stack pointer */
    movq %rsp, 0(%rdi)       /* Save RSP at offset 0 */
    
    /* Save return address (where to resume this thread) */
    /* The return address is on the stack, placed by CALL */
    movq (%rsp), %rax
    movq %rax, 56(%rdi)      /* Save RIP at offset 56 */
    
    /* ========================================
     * PHASE 2: Load next thread's context
     * ======================================== */
    
    /* Load stack pointer FIRST - subsequent pops use new stack */
    movq 0(%rsi), %rsp       /* Load RSP from next context */
    
    /* Restore callee-saved registers from next context */
    movq 16(%rsi), %rbx      /* Restore RBX */
    movq 24(%rsi), %r12      /* Restore R12 */
    movq 32(%rsi), %r13      /* Restore R13 */
    movq 40(%rsi), %r14      /* Restore R14 */
    movq 48(%rsi), %r15      /* Restore R15 */
    movq 8(%rsi), %rbp       /* Restore RBP */
    
    /* Push return address and return
     * This makes us "return" to the next thread's saved location */
    movq 56(%rsi), %rax      /* Load saved RIP */
    pushq %rax               /* Push it as return address */
    ret                      /* "Return" to next thread */
 
    .size context_switch, .-context_switch
 
 
/*
 * thread_start_wrapper - Initial entry point for new threads
 * 
 * When a thread is first scheduled, its context is set up to "return"
 * here. This wrapper calls the actual thread function and handles
 * thread termination when the function returns.
 *
 * Expected register setup (from thread_create initialization):
 *   R12: Thread entry function pointer
 *   R13: Thread argument pointer
 */
    .globl thread_start_wrapper
    .type thread_start_wrapper, @function
 
thread_start_wrapper:
    /* Move argument to RDI (first parameter register) */
    movq %r13, %rdi
    
    /* Call the thread's entry function */
    callq *%r12
    
    /* Thread function returned - clean up
     * RAX contains return value */
    movq %rax, %rdi          /* Pass return value to thread_exit */
    callq thread_exit
    
    /* thread_exit never returns, but just in case... */
    ud2                      /* Undefined instruction trap */
 
    .size thread_start_wrapper, .-thread_start_wrapper

Anatomy of the Context Switch

Let's trace through exactly what happens during a context switch from Thread A to Thread B:

Thread A is executing, calls yield() or its time slice expires
Scheduler decides Thread B should run next
context_switch(&A.context, &B.context) is called
Thread A's registers are saved to A's TCB
Stack pointer switches from A's stack to B's stack
Thread B's registers are restored from B's TCB
RET instruction executes - pops B's saved RIP, jumping to B's code
Thread B is now executing, completely unaware a switch occurred

The Elegance of Symmetry

The User-Level Scheduler

Scheduling Policy Implementation

A simple round-robin scheduler demonstrates the core concepts:

user_scheduler.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
/*
 * User-Level Thread Scheduler - Round Robin Implementation
 *
 * This scheduler maintains a queue of ready threads and cycles
 * through them in order. Each thread runs until it explicitly
 * yields, blocks, or terminates.
 */
 
/* Ready queue - doubly linked list for O(1) enqueue/dequeue */
static tcb_t *ready_queue_head = NULL;
static tcb_t *ready_queue_tail = NULL;
static tcb_t *current_thread = NULL;
 
/* Add thread to end of ready queue */
void enqueue_ready(tcb_t *thread) {
    thread->next = NULL;
    thread->state = THREAD_READY;
    
    if (ready_queue_tail) {
        ready_queue_tail->next = thread;
        thread->prev = ready_queue_tail;
        ready_queue_tail = thread;
    } else {
        ready_queue_head = ready_queue_tail = thread;
        thread->prev = NULL;
    }
}
 
/* Remove and return thread from front of ready queue */
tcb_t* dequeue_ready(void) {
    if (!ready_queue_head) {
        return NULL;  /* Queue empty */
    }
    
    tcb_t *thread = ready_queue_head;
    ready_queue_head = thread->next;
    
    if (ready_queue_head) {
        ready_queue_head->prev = NULL;
    } else {
        ready_queue_tail = NULL;  /* Queue now empty */
    }
    
    thread->next = thread->prev = NULL;
    return thread;
}
 
/*
 * schedule() - Select and switch to next ready thread
 *
 * This is the core scheduling decision. Called when:
 *   - Current thread yields voluntarily
 *   - Current thread blocks on synchronization
 *   - Current thread terminates
 */
void schedule(void) {
    tcb_t *next = dequeue_ready();
    
    if (!next) {
        /* No ready threads - could idle, spin, or exit */
        if (current_thread && current_thread->state == THREAD_RUNNING) {
            /* Only current thread exists, keep running */
            return;
        }
        /* All threads blocked or terminated - error or finished */
        panic("No runnable threads!");
    }
    
    tcb_t *prev = current_thread;
    current_thread = next;
    next->state = THREAD_RUNNING;
    
    if (prev && prev != next) {
        /* Different thread selected - context switch */
        context_switch(&prev->context, &next->context);
        /* When we return here, we've been switched back to */
    }
}
 
/*
 * yield() - Voluntarily give up CPU to another thread
 *
 * This is the primary mechanism for cooperative multitasking.
 * The current thread remains runnable but lets others execute.
 */
void thread_yield(void) {
    if (current_thread) {
        /* Add current thread back to ready queue */
        enqueue_ready(current_thread);
    }
    schedule();
}
 
/*
 * block() - Put current thread to sleep
 *
 * Used when a thread must wait for some condition.
 * The thread is NOT added to ready queue.
 */
void thread_block(void) {
    if (current_thread) {
        current_thread->state = THREAD_BLOCKED;
    }
    schedule();  /* Will not pick current thread since not in ready queue */
}
 
/*
 * unblock() - Wake up a blocked thread
 *
 * Called when the condition a thread was waiting for is satisfied.
 */
void thread_unblock(tcb_t *thread) {
    if (thread && thread->state == THREAD_BLOCKED) {
        enqueue_ready(thread);  /* Back to ready queue */
    }
}

Scheduling Policies Available

User-level thread libraries can implement any scheduling policy since they have complete control over the scheduler:

Common User-Level Scheduling Policies
Policy	Description	Use Case
Round Robin	Cycle through threads in order, each gets equal turns	General purpose; fair; simple to implement
Priority Queue	Higher priority threads run first; same priority uses FIFO	Real-time-like behavior; background vs foreground work
Lottery Scheduling	Random selection weighted by 'tickets'; probabilistically fair	Research; proportional share systems
Work Stealing	Idle threads steal from busy threads' local queues	Multi-queue systems; improves load balance
Fair Share	Track CPU usage per thread/group; favor under-served	Multi-user fairness; prevent starvation

Cooperative vs Preemptive

Thread API Implementation

Core API Functions

thread_api.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
/*
 * Thread API Implementation - Core Functions
 */
 
#include "thread.h"
#include <stdlib.h>
#include <string.h>
 
/* Global state */
static uint64_t next_tid = 1;
static tcb_t *all_threads = NULL;  /* Global thread list */
 
/*
 * thread_create() - Create a new thread
 *
 * Allocates TCB, stack, initializes context, and adds to ready queue.
 * Returns 0 on success, -1 on failure.
 */
int thread_create(thread_id_t *tid,
                  const thread_attr_t *attr,
                  void *(*start_routine)(void*),
                  void *arg) {
    
    /* Allocate thread control block */
    tcb_t *tcb = calloc(1, sizeof(tcb_t));
    if (!tcb) {
        return -1;  /* Out of memory */
    }
    
    /* Assign thread ID */
    tcb->tid = next_tid++;
    if (tid) {
        *tid = tcb->tid;
    }
    
    /* Determine stack size (use default or attribute) */
    size_t stack_size = attr ? attr->stack_size : DEFAULT_STACK_SIZE;
    
    /* Allocate stack with guard page */
    tcb->stack = allocate_thread_stack(stack_size);
    if (!tcb->stack) {
        free(tcb);
        return -1;  /* Stack allocation failed */
    }
    
    /* Save thread function and argument */
    tcb->entry_function = start_routine;
    tcb->argument = arg;
    tcb->state = THREAD_READY;
    
    /* Initialize context for first scheduling
     * 
     * When this thread is first context-switched to, we want it
     * to start executing thread_start_wrapper, which will call
     * the actual entry function with the argument.
     *
     * We set up the context so that:
     *   - RSP points to initial stack position (with return address)
     *   - R12 holds the entry function pointer
     *   - R13 holds the argument pointer
     *   - RIP (saved) points to thread_start_wrapper
     */
    
    /* Initialize stack pointer near top, properly aligned */
    uintptr_t sp = (uintptr_t)tcb->stack->stack_top;
    sp = (sp - 128) & ~0xF;  /* Red zone + alignment */
    
    /* Clear context, then set up initial values */
    memset(&tcb->context, 0, sizeof(thread_context_t));
    tcb->context.rsp = sp;
    tcb->context.r12 = (uint64_t)start_routine;  /* Entry function */
    tcb->context.r13 = (uint64_t)arg;            /* Argument */
    tcb->context.rip = (uint64_t)thread_start_wrapper;
    
    /* Add to global thread list */
    tcb->all_next = all_threads;
    all_threads = tcb;
    
    /* Add to ready queue - thread is now schedulable */
    enqueue_ready(tcb);
    
    return 0;
}
 
/*
 * thread_exit() - Terminate the calling thread
 *
 * Cleans up the thread, stores return value, and wakes any joiners.
 */
void thread_exit(void *retval) {
    tcb_t *self = current_thread;
    
    /* Store return value for thread_join */
    self->return_value = retval;
    self->state = THREAD_TERMINATED;
    
    /* Wake any thread waiting in thread_join */
    if (self->joiner) {
        thread_unblock(self->joiner);
    }
    
    /* Schedule next thread - we will never return here */
    schedule();
    
    /* Should never reach here */
    __builtin_unreachable();
}
 
/*
 * thread_join() - Wait for thread termination
 *
 * Blocks until the target thread terminates, then retrieves
 * its return value.
 */
int thread_join(thread_id_t tid, void **retval) {
    /* Find the target thread */
    tcb_t *target = find_thread_by_id(tid);
    if (!target) {
        return -1;  /* Thread not found */
    }
    
    /* If thread hasn't terminated yet, wait */
    if (target->state != THREAD_TERMINATED) {
        target->joiner = current_thread;  /* Record who's waiting */
        thread_block();  /* Sleep until target exits */
    }
    
    /* Thread terminated - get return value */
    if (retval) {
        *retval = target->return_value;
    }
    
    /* Clean up terminated thread resources */
    remove_from_all_threads(target);
    free_thread_stack(target->stack);
    free(target);
    
    return 0;
}
 
/*
 * thread_self() - Get current thread's ID
 */
thread_id_t thread_self(void) {
    return current_thread ? current_thread->tid : 0;
}

API Design Philosophy

Synchronization Primitives in User Space

user_sync.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
/*
 * User-Level Mutex Implementation
 *
 * Since only one user-level thread runs at a time (no true parallelism
 * within the process), we don't need atomic instructions for basic
 * mutual exclusion. The scheduler provides the synchronization.
 */
 
typedef struct {
    int locked;           /* 0 = unlocked, 1 = locked */
    tcb_t *owner;        /* Thread holding the lock */
    tcb_t *wait_queue;   /* Threads waiting for lock */
} user_mutex_t;
 
void mutex_init(user_mutex_t *mutex) {
    mutex->locked = 0;
    mutex->owner = NULL;
    mutex->wait_queue = NULL;
}
 
void mutex_lock(user_mutex_t *mutex) {
    /* No atomics needed - we won't be preempted mid-check
     * in a pure cooperative threading model */
    
    while (mutex->locked) {
        /* Mutex is held - add ourselves to wait queue and block */
        current_thread->wait_next = mutex->wait_queue;
        mutex->wait_queue = current_thread;
        thread_block();
        /* When we wake up, loop and check again */
    }
    
    /* Acquire the lock */
    mutex->locked = 1;
    mutex->owner = current_thread;
}
 
void mutex_unlock(user_mutex_t *mutex) {
    /* Verify we own the lock */
    if (mutex->owner != current_thread) {
        panic("Unlock by non-owner!");
    }
    
    mutex->locked = 0;
    mutex->owner = NULL;
    
    /* Wake one waiting thread if any */
    if (mutex->wait_queue) {
        tcb_t *waiter = mutex->wait_queue;
        mutex->wait_queue = waiter->wait_next;
        waiter->wait_next = NULL;
        thread_unblock(waiter);
    }
}
 
/*
 * User-Level Condition Variable Implementation
 */
 
typedef struct {
    tcb_t *wait_queue;   /* Threads waiting on condition */
} user_cond_t;
 
void cond_init(user_cond_t *cond) {
    cond->wait_queue = NULL;
}
 
void cond_wait(user_cond_t *cond, user_mutex_t *mutex) {
    /* Add to wait queue before releasing mutex
     * This prevents missed signals */
    current_thread->wait_next = cond->wait_queue;
    cond->wait_queue = current_thread;
    
    /* Release mutex and block atomically (scheduler perspective) */
    mutex_unlock(mutex);
    thread_block();
    
    /* Woken up - reacquire mutex before returning */
    mutex_lock(mutex);
}
 
void cond_signal(user_cond_t *cond) {
    /* Wake one waiter */
    if (cond->wait_queue) {
        tcb_t *waiter = cond->wait_queue;
        cond->wait_queue = waiter->wait_next;
        waiter->wait_next = NULL;
        thread_unblock(waiter);
    }
}
 
void cond_broadcast(user_cond_t *cond) {
    /* Wake all waiters */
    while (cond->wait_queue) {
        tcb_t *waiter = cond->wait_queue;
        cond->wait_queue = waiter->wait_next;
        waiter->wait_next = NULL;
        thread_unblock(waiter);
    }
}

The Single-Core Assumption

Summary: Thread Library Architecture

Key Takeaways

•User-level threads exist entirely in user space — The kernel sees only a single-threaded process. All thread management happens within the application's runtime.
•The TCB is the thread's identity — Thread Control Blocks store saved registers, stack pointers, state, and metadata needed to pause and resume thread execution.
•Stack management is critical — Each thread needs its own stack, allocated from the process heap with guard pages for overflow detection.
•Context switch is register save/restore — The core operation saves one thread's registers to its TCB and loads another's, with the stack pointer switch being the critical moment.
•User-level schedulers implement policy — The library includes a scheduler function that decides which thread runs next, implementing round-robin, priority, or other policies.
•Synchronization primitives are integrated with the scheduler — Mutexes and condition variables use blocking and unblocking operations that interact directly with the scheduler.

Foundation Established

1 / 5