Loading learning content...
In modern software development, concurrency is not optional—it's essential. But before operating systems provided native thread support, and even today in specialized contexts, user-level thread libraries provide a remarkably elegant solution: concurrency without kernel involvement.
A user-level thread library is a runtime system that exists entirely within the address space of a process, implementing thread creation, scheduling, synchronization, and destruction without ever crossing the boundary into kernel space. This architectural decision—keeping threads invisible to the kernel—creates a fascinating set of tradeoffs that every systems engineer must understand.
By the end of this page, you will understand the complete architecture of user-level thread libraries, how they implement thread abstractions using only user-space mechanisms, why this approach offers significant performance advantages, and the fundamental architectural constraints that govern their design.
A user-level thread (ULT), also known as a lightweight thread or fiber, is a thread of execution managed entirely by a user-space library rather than by the operating system kernel. From the kernel's perspective, a process using user-level threads appears to be a single-threaded process—the kernel schedules the process as a unit, completely unaware of the internal threading happening within.
This creates a layered abstraction model:
The fundamental principle:
User-level threads exploit a simple but profound insight: thread management is just data structure manipulation and control flow transfer. Creating a thread is creating a data structure (thread control block) and a stack. Switching between threads is saving registers to one location and loading them from another. Scheduling is selecting which data structure to activate next.
All of these operations can be performed entirely in user space, without involving the kernel at all.
User-level threads predate kernel-level thread support. Early Unix systems (and many other operating systems) only provided process-level concurrency. User-level thread libraries like GNU Pth, the original POSIX threads (in some implementations), and M:N threading systems emerged as a way to provide concurrency without waiting for kernel support. Even today, languages like Go, Erlang, and many async runtimes use user-level thread concepts (often called 'green threads' or 'coroutines') for specific performance characteristics.
A user-level thread library is a sophisticated piece of systems software that provides the complete illusion of multi-threading within a single process. Understanding its architecture reveals how this illusion is constructed:
Every user-level thread library consists of several essential subsystems working together:
The TCB is the cornerstone of thread management. It serves the same purpose for user-level threads that the PCB (Process Control Block) serves for processes in the kernel—it is the complete representation of the thread's state when the thread is not running.
A typical TCB structure contains:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
/* Thread Control Block for User-Level Thread Library */ typedef enum { THREAD_READY, /* Ready to run, in the ready queue */ THREAD_RUNNING, /* Currently executing on CPU */ THREAD_BLOCKED, /* Waiting on synchronization primitive */ THREAD_TERMINATED /* Completed execution, awaiting cleanup */} thread_state_t; typedef struct thread_context { /* CPU Register State - saved/restored during context switch */ uint64_t rsp; /* Stack pointer */ uint64_t rbp; /* Frame pointer */ uint64_t rbx; /* Callee-saved registers */ uint64_t r12; uint64_t r13; uint64_t r14; uint64_t r15; uint64_t rip; /* Instruction pointer (return address) */ /* Floating-point/SIMD state (if needed) */ uint8_t fpu_state[512] __attribute__((aligned(16)));} thread_context_t; typedef struct thread_control_block { /* Identity */ uint64_t tid; /* Unique thread ID */ char name[64]; /* Thread name (debugging) */ /* Execution State */ thread_state_t state; /* Current thread state */ thread_context_t context; /* Saved CPU context */ /* Stack Management */ void *stack_base; /* Bottom of allocated stack */ size_t stack_size; /* Size of stack in bytes */ void *stack_pointer; /* Current/saved stack pointer */ /* Thread Function */ void *(*entry_function)(void*); /* Initial function to execute */ void *argument; /* Argument passed to function */ void *return_value; /* Value returned by thread */ /* Scheduling Metadata */ int priority; /* Thread priority level */ uint64_t time_slice; /* Remaining time quantum */ uint64_t total_runtime; /* CPU time consumed */ /* Thread-Local Storage */ void **tls_slots; /* Thread-local data array */ size_t tls_count; /* Number of TLS slots */ /* Synchronization */ struct thread_control_block *join_target; /* Thread waiting for us */ struct list_head blocked_queue_link; /* Link in blocked queue */ /* Linking */ struct list_head ready_queue_link; /* Link in scheduler queue */ struct list_head all_threads_link; /* Link in global thread list */} tcb_t; /* Global Thread Library State */typedef struct thread_library { tcb_t *current_thread; /* Currently executing thread */ tcb_t *main_thread; /* Original main thread */ struct list_head ready_queue; /* Queue of runnable threads */ struct list_head all_threads; /* List of all threads */ uint64_t next_tid; /* Next thread ID to assign */ spinlock_t library_lock; /* Protects library state */} thread_library_t; static thread_library_t thread_lib;Notice that the TCB keeps related data together: the context save area, stack information, and scheduling metadata are all in one structure. This design improves cache locality during thread operations. When the scheduler examines a thread's state, all the necessary information is likely in the same cache line or adjacent lines. This is one reason user-level thread switches are so fast—they're optimized for the memory hierarchy.
Every thread requires its own stack—a private region of memory for function call frames, local variables, return addresses, and saved registers. In kernel-level threads, the kernel allocates stacks and establishes guard pages. For user-level threads, the library must handle this entirely within user space.
User-level thread libraries typically use one of several approaches for stack management:
| Strategy | Mechanism | Advantages | Disadvantages |
|---|---|---|---|
| Fixed-Size Stacks | Allocate fixed-size memory block (e.g., 1MB) for each thread | Simple, predictable, easy stack overflow detection | Wasteful if threads don't use full stack; limits thread count |
| Segmented Stacks | Allocate small initial stack; grow by linking new segments | Memory efficient; supports many threads | Complex; segment transitions add overhead; fragmentation |
| Contiguous Growth | Start small; reallocate larger when needed by copying | Simple memory model; no segmentation overhead | Copy cost on growth; requires relocation support |
| Stack Pooling | Maintain pool of pre-allocated stacks; reuse across threads | Fast thread creation; reduced fragmentation | Fixed total memory; pool sizing challenges |
The most common approach for robustness is fixed-size stacks with guard pages. Here's how a sophisticated implementation handles this:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
#include <sys/mman.h>#include <stdint.h>#include <stdlib.h> #define DEFAULT_STACK_SIZE (1024 * 1024) /* 1 MB per thread */#define GUARD_PAGE_SIZE (4096) /* 4 KB guard page */ typedef struct stack_allocation { void *base_address; /* Start of allocated region */ void *stack_bottom; /* Usable stack starts here */ void *stack_top; /* Stack grows down to this address */ size_t total_size; /* Total allocation size */ size_t usable_size; /* Actual stack space */} stack_allocation_t; /* * Allocate a thread stack with guard page protection. * * Memory layout (addresses increase downward): * +-------------------+ <-- base_address * | Guard Page | <-- mprotect(PROT_NONE) - triggers SIGSEGV on access * | (PROT_NONE) | * +-------------------+ <-- stack_top (stack pointer starts near here) * | | * | Usable Stack | * | (PROT_READ | | * | PROT_WRITE) | * | | * +-------------------+ <-- stack_bottom (stack grows toward here) * * The guard page provides stack overflow detection without runtime checks. * If the thread uses too much stack, it will write to the guard page and * receive SIGSEGV—immediate, deterministic crash rather than memory corruption. */stack_allocation_t* allocate_thread_stack(size_t requested_size) { size_t page_size = getpagesize(); size_t usable_size = requested_size ?: DEFAULT_STACK_SIZE; /* Round up to page boundary */ usable_size = (usable_size + page_size - 1) & ~(page_size - 1); /* Total size includes guard page */ size_t total_size = usable_size + GUARD_PAGE_SIZE; /* Allocate using mmap for page-aligned memory */ void *base = mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (base == MAP_FAILED) { return NULL; /* Allocation failed */ } /* Install guard page at the bottom (low address) */ if (mprotect(base, GUARD_PAGE_SIZE, PROT_NONE) != 0) { munmap(base, total_size); return NULL; } /* Prepare result structure */ stack_allocation_t *stack = malloc(sizeof(stack_allocation_t)); if (!stack) { munmap(base, total_size); return NULL; } stack->base_address = base; stack->stack_bottom = (char*)base + GUARD_PAGE_SIZE; stack->stack_top = (char*)base + total_size; stack->total_size = total_size; stack->usable_size = usable_size; return stack;} void free_thread_stack(stack_allocation_t *stack) { if (stack) { munmap(stack->base_address, stack->total_size); free(stack); }} /* * Initialize stack for new thread execution. * Sets up the initial stack frame so the thread starts executing * its entry function when first scheduled. */void* initialize_thread_stack(stack_allocation_t *stack, void *(*entry)(void*), void *arg) { /* * Stack grows downward. Initial setup: * - Place fake return address (thread_exit trampoline) * - Set up initial stack pointer at natural alignment * - The context switch will restore this pointer and "return" * to the entry function */ /* Start near top, with ABI-required alignment (16-byte on x86-64) */ uintptr_t sp = (uintptr_t)stack->stack_top; sp = sp & ~0xF; /* Align to 16 bytes */ sp -= 8; /* Space for return address (call convention) */ /* The return address points to our cleanup trampoline */ *(void**)sp = (void*)thread_exit_trampoline; return (void*)sp;}Without guard pages, stack overflow corrupts adjacent memory—often another thread's stack or heap data. This creates non-deterministic bugs that manifest far from the actual overflow. Guard pages convert this undefined behavior into an immediate, debuggable crash. Always use guard pages in production thread libraries.
The context switch is the heart of any threading system. It's the mechanism by which one thread stops executing and another resumes. In user-level threads, this happens entirely in user space—no kernel involvement whatsoever.
A thread's execution state consists of all the information needed to resume execution exactly where it left off:
The context switch itself is typically implemented in assembly for maximum control and minimal overhead. Here's a complete x86-64 implementation:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
/* * context_switch.s - User-Level Thread Context Switch * * void context_switch(thread_context_t *current, thread_context_t *next); * * Calling convention (System V AMD64 ABI): * - current context pointer in RDI * - next context pointer in RSI * - Return via RIP loaded from next context * * This function: * 1. Saves current thread's callee-saved registers to *current * 2. Saves current stack pointer to *current * 3. Loads next thread's stack pointer from *next * 4. Restores next thread's callee-saved registers from *next * 5. Returns (actually resumes next thread at its saved RIP) */ .text .globl context_switch .type context_switch, @function context_switch: /* ======================================== * PHASE 1: Save current thread's context * ======================================== */ /* Save callee-saved registers to current context */ movq %rbx, 16(%rdi) /* Save RBX at offset 16 */ movq %r12, 24(%rdi) /* Save R12 at offset 24 */ movq %r13, 32(%rdi) /* Save R13 at offset 32 */ movq %r14, 40(%rdi) /* Save R14 at offset 40 */ movq %r15, 48(%rdi) /* Save R15 at offset 48 */ movq %rbp, 8(%rdi) /* Save RBP at offset 8 */ /* Save stack pointer */ movq %rsp, 0(%rdi) /* Save RSP at offset 0 */ /* Save return address (where to resume this thread) */ /* The return address is on the stack, placed by CALL */ movq (%rsp), %rax movq %rax, 56(%rdi) /* Save RIP at offset 56 */ /* ======================================== * PHASE 2: Load next thread's context * ======================================== */ /* Load stack pointer FIRST - subsequent pops use new stack */ movq 0(%rsi), %rsp /* Load RSP from next context */ /* Restore callee-saved registers from next context */ movq 16(%rsi), %rbx /* Restore RBX */ movq 24(%rsi), %r12 /* Restore R12 */ movq 32(%rsi), %r13 /* Restore R13 */ movq 40(%rsi), %r14 /* Restore R14 */ movq 48(%rsi), %r15 /* Restore R15 */ movq 8(%rsi), %rbp /* Restore RBP */ /* Push return address and return * This makes us "return" to the next thread's saved location */ movq 56(%rsi), %rax /* Load saved RIP */ pushq %rax /* Push it as return address */ ret /* "Return" to next thread */ .size context_switch, .-context_switch /* * thread_start_wrapper - Initial entry point for new threads * * When a thread is first scheduled, its context is set up to "return" * here. This wrapper calls the actual thread function and handles * thread termination when the function returns. * * Expected register setup (from thread_create initialization): * R12: Thread entry function pointer * R13: Thread argument pointer */ .globl thread_start_wrapper .type thread_start_wrapper, @function thread_start_wrapper: /* Move argument to RDI (first parameter register) */ movq %r13, %rdi /* Call the thread's entry function */ callq *%r12 /* Thread function returned - clean up * RAX contains return value */ movq %rax, %rdi /* Pass return value to thread_exit */ callq thread_exit /* thread_exit never returns, but just in case... */ ud2 /* Undefined instruction trap */ .size thread_start_wrapper, .-thread_start_wrapperLet's trace through exactly what happens during a context switch from Thread A to Thread B:
yield() or its time slice expiresNotice that the context switch is entirely symmetric. Thread B was previously in the middle of its own context_switch call when it got switched away. When we restore B and return, B continues executing as if its context_switch just returned normally. This symmetry is what makes cooperative multitasking elegant—every thread sees the same programming model.
User-level thread libraries implement their own scheduler—a function that runs in user space and decides which thread executes next. Unlike kernel schedulers that are triggered by hardware interrupts and have access to precise timing, user-level schedulers depend on explicit yield points or voluntary preemption.
A simple round-robin scheduler demonstrates the core concepts:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116
/* * User-Level Thread Scheduler - Round Robin Implementation * * This scheduler maintains a queue of ready threads and cycles * through them in order. Each thread runs until it explicitly * yields, blocks, or terminates. */ /* Ready queue - doubly linked list for O(1) enqueue/dequeue */static tcb_t *ready_queue_head = NULL;static tcb_t *ready_queue_tail = NULL;static tcb_t *current_thread = NULL; /* Add thread to end of ready queue */void enqueue_ready(tcb_t *thread) { thread->next = NULL; thread->state = THREAD_READY; if (ready_queue_tail) { ready_queue_tail->next = thread; thread->prev = ready_queue_tail; ready_queue_tail = thread; } else { ready_queue_head = ready_queue_tail = thread; thread->prev = NULL; }} /* Remove and return thread from front of ready queue */tcb_t* dequeue_ready(void) { if (!ready_queue_head) { return NULL; /* Queue empty */ } tcb_t *thread = ready_queue_head; ready_queue_head = thread->next; if (ready_queue_head) { ready_queue_head->prev = NULL; } else { ready_queue_tail = NULL; /* Queue now empty */ } thread->next = thread->prev = NULL; return thread;} /* * schedule() - Select and switch to next ready thread * * This is the core scheduling decision. Called when: * - Current thread yields voluntarily * - Current thread blocks on synchronization * - Current thread terminates */void schedule(void) { tcb_t *next = dequeue_ready(); if (!next) { /* No ready threads - could idle, spin, or exit */ if (current_thread && current_thread->state == THREAD_RUNNING) { /* Only current thread exists, keep running */ return; } /* All threads blocked or terminated - error or finished */ panic("No runnable threads!"); } tcb_t *prev = current_thread; current_thread = next; next->state = THREAD_RUNNING; if (prev && prev != next) { /* Different thread selected - context switch */ context_switch(&prev->context, &next->context); /* When we return here, we've been switched back to */ }} /* * yield() - Voluntarily give up CPU to another thread * * This is the primary mechanism for cooperative multitasking. * The current thread remains runnable but lets others execute. */void thread_yield(void) { if (current_thread) { /* Add current thread back to ready queue */ enqueue_ready(current_thread); } schedule();} /* * block() - Put current thread to sleep * * Used when a thread must wait for some condition. * The thread is NOT added to ready queue. */void thread_block(void) { if (current_thread) { current_thread->state = THREAD_BLOCKED; } schedule(); /* Will not pick current thread since not in ready queue */} /* * unblock() - Wake up a blocked thread * * Called when the condition a thread was waiting for is satisfied. */void thread_unblock(tcb_t *thread) { if (thread && thread->state == THREAD_BLOCKED) { enqueue_ready(thread); /* Back to ready queue */ }}User-level thread libraries can implement any scheduling policy since they have complete control over the scheduler:
| Policy | Description | Use Case |
|---|---|---|
| Round Robin | Cycle through threads in order, each gets equal turns | General purpose; fair; simple to implement |
| Priority Queue | Higher priority threads run first; same priority uses FIFO | Real-time-like behavior; background vs foreground work |
| Lottery Scheduling | Random selection weighted by 'tickets'; probabilistically fair | Research; proportional share systems |
| Work Stealing | Idle threads steal from busy threads' local queues | Multi-queue systems; improves load balance |
| Fair Share | Track CPU usage per thread/group; favor under-served | Multi-user fairness; prevent starvation |
Pure user-level threads are typically cooperative—threads must explicitly yield. Preemption requires handling signals (SIGALRM) to interrupt threads, which reintroduces complexity and overhead. Many production libraries use hybrid approaches: cooperative yields at strategic points with signal-based preemption as a backstop for misbehaving threads.
The thread library exposes an API that applications use to create, manage, and synchronize threads. This API typically mirrors POSIX threads semantics, providing familiarity for developers while implementing everything in user space.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146
/* * Thread API Implementation - Core Functions */ #include "thread.h"#include <stdlib.h>#include <string.h> /* Global state */static uint64_t next_tid = 1;static tcb_t *all_threads = NULL; /* Global thread list */ /* * thread_create() - Create a new thread * * Allocates TCB, stack, initializes context, and adds to ready queue. * Returns 0 on success, -1 on failure. */int thread_create(thread_id_t *tid, const thread_attr_t *attr, void *(*start_routine)(void*), void *arg) { /* Allocate thread control block */ tcb_t *tcb = calloc(1, sizeof(tcb_t)); if (!tcb) { return -1; /* Out of memory */ } /* Assign thread ID */ tcb->tid = next_tid++; if (tid) { *tid = tcb->tid; } /* Determine stack size (use default or attribute) */ size_t stack_size = attr ? attr->stack_size : DEFAULT_STACK_SIZE; /* Allocate stack with guard page */ tcb->stack = allocate_thread_stack(stack_size); if (!tcb->stack) { free(tcb); return -1; /* Stack allocation failed */ } /* Save thread function and argument */ tcb->entry_function = start_routine; tcb->argument = arg; tcb->state = THREAD_READY; /* Initialize context for first scheduling * * When this thread is first context-switched to, we want it * to start executing thread_start_wrapper, which will call * the actual entry function with the argument. * * We set up the context so that: * - RSP points to initial stack position (with return address) * - R12 holds the entry function pointer * - R13 holds the argument pointer * - RIP (saved) points to thread_start_wrapper */ /* Initialize stack pointer near top, properly aligned */ uintptr_t sp = (uintptr_t)tcb->stack->stack_top; sp = (sp - 128) & ~0xF; /* Red zone + alignment */ /* Clear context, then set up initial values */ memset(&tcb->context, 0, sizeof(thread_context_t)); tcb->context.rsp = sp; tcb->context.r12 = (uint64_t)start_routine; /* Entry function */ tcb->context.r13 = (uint64_t)arg; /* Argument */ tcb->context.rip = (uint64_t)thread_start_wrapper; /* Add to global thread list */ tcb->all_next = all_threads; all_threads = tcb; /* Add to ready queue - thread is now schedulable */ enqueue_ready(tcb); return 0;} /* * thread_exit() - Terminate the calling thread * * Cleans up the thread, stores return value, and wakes any joiners. */void thread_exit(void *retval) { tcb_t *self = current_thread; /* Store return value for thread_join */ self->return_value = retval; self->state = THREAD_TERMINATED; /* Wake any thread waiting in thread_join */ if (self->joiner) { thread_unblock(self->joiner); } /* Schedule next thread - we will never return here */ schedule(); /* Should never reach here */ __builtin_unreachable();} /* * thread_join() - Wait for thread termination * * Blocks until the target thread terminates, then retrieves * its return value. */int thread_join(thread_id_t tid, void **retval) { /* Find the target thread */ tcb_t *target = find_thread_by_id(tid); if (!target) { return -1; /* Thread not found */ } /* If thread hasn't terminated yet, wait */ if (target->state != THREAD_TERMINATED) { target->joiner = current_thread; /* Record who's waiting */ thread_block(); /* Sleep until target exits */ } /* Thread terminated - get return value */ if (retval) { *retval = target->return_value; } /* Clean up terminated thread resources */ remove_from_all_threads(target); free_thread_stack(target->stack); free(target); return 0;} /* * thread_self() - Get current thread's ID */thread_id_t thread_self(void) { return current_thread ? current_thread->tid : 0;}A well-designed thread API should be familiar (matching POSIX patterns where possible), minimal (exposing only what's necessary), and safe (making incorrect usage difficult). The separation between thread_create() which returns immediately and thread_join() which blocks allows for flexible concurrent execution patterns.
User-level thread libraries must provide synchronization primitives—mutexes, condition variables, semaphores—implemented entirely within user space. Since only one user-level thread executes at a time (within a process on a single CPU), some synchronization is simpler, but blocking behavior requires integration with the scheduler.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
/* * User-Level Mutex Implementation * * Since only one user-level thread runs at a time (no true parallelism * within the process), we don't need atomic instructions for basic * mutual exclusion. The scheduler provides the synchronization. */ typedef struct { int locked; /* 0 = unlocked, 1 = locked */ tcb_t *owner; /* Thread holding the lock */ tcb_t *wait_queue; /* Threads waiting for lock */} user_mutex_t; void mutex_init(user_mutex_t *mutex) { mutex->locked = 0; mutex->owner = NULL; mutex->wait_queue = NULL;} void mutex_lock(user_mutex_t *mutex) { /* No atomics needed - we won't be preempted mid-check * in a pure cooperative threading model */ while (mutex->locked) { /* Mutex is held - add ourselves to wait queue and block */ current_thread->wait_next = mutex->wait_queue; mutex->wait_queue = current_thread; thread_block(); /* When we wake up, loop and check again */ } /* Acquire the lock */ mutex->locked = 1; mutex->owner = current_thread;} void mutex_unlock(user_mutex_t *mutex) { /* Verify we own the lock */ if (mutex->owner != current_thread) { panic("Unlock by non-owner!"); } mutex->locked = 0; mutex->owner = NULL; /* Wake one waiting thread if any */ if (mutex->wait_queue) { tcb_t *waiter = mutex->wait_queue; mutex->wait_queue = waiter->wait_next; waiter->wait_next = NULL; thread_unblock(waiter); }} /* * User-Level Condition Variable Implementation */ typedef struct { tcb_t *wait_queue; /* Threads waiting on condition */} user_cond_t; void cond_init(user_cond_t *cond) { cond->wait_queue = NULL;} void cond_wait(user_cond_t *cond, user_mutex_t *mutex) { /* Add to wait queue before releasing mutex * This prevents missed signals */ current_thread->wait_next = cond->wait_queue; cond->wait_queue = current_thread; /* Release mutex and block atomically (scheduler perspective) */ mutex_unlock(mutex); thread_block(); /* Woken up - reacquire mutex before returning */ mutex_lock(mutex);} void cond_signal(user_cond_t *cond) { /* Wake one waiter */ if (cond->wait_queue) { tcb_t *waiter = cond->wait_queue; cond->wait_queue = waiter->wait_next; waiter->wait_next = NULL; thread_unblock(waiter); }} void cond_broadcast(user_cond_t *cond) { /* Wake all waiters */ while (cond->wait_queue) { tcb_t *waiter = cond->wait_queue; cond->wait_queue = waiter->wait_next; waiter->wait_next = NULL; thread_unblock(waiter); }}This simple implementation relies on the fact that only one user-level thread executes at a time. On a single-core system with cooperative threading, there's no parallel access to mutex state. If you introduce preemptive user-level threads (via signals) or multiprocessor support, you must add proper atomic operations and memory barriers.
We have explored the complete architecture of user-level thread libraries—sophisticated systems software that implements threading entirely within user space, invisible to the operating system kernel.
You now understand the core architecture of user-level thread libraries. In the next page, we'll explore why this architecture enables remarkably fast context switching—often 100x faster than kernel-level threads—and the performance characteristics that result.