Loading learning content...
Threads excel at sharing data—but sometimes, threads need private data. Consider global error codes like errno: in a single-threaded program, one global errno suffices. But in a multithreaded program, if two threads simultaneously make failing system calls, their error codes would overwrite each other. We need each thread to have its own errno.
Thread-Local Storage (TLS) solves this problem by providing each thread with its own instance of a variable. The variable has the same name across all threads, but each thread sees and modifies only its own copy. This creates a powerful abstraction: global-like variables that are actually per-thread.
By the end of this page, you will understand: (1) the conceptual model of thread-local storage, (2) why TLS is essential for certain programming patterns, (3) POSIX pthread_key APIs for dynamic TLS, (4) compiler-level __thread and thread_local keywords for static TLS, (5) performance characteristics and implementation details, and (6) practical patterns and best practices for TLS usage.
Before TLS, programmers faced a dilemma when dealing with per-thread state:
Option 1: Pass everything explicitly
Option 2: External mapping
Option 3: Thread-Local Storage
12345678910111213141516171819202122232425262728293031323334353637
// How errno is typically implemented (conceptual)// In <errno.h>, you'll often see something like: // Old single-threaded model (WRONG for threads)int errno; // Would be shared - data race! // Modern TLS model// Option 1: Compiler-supported TLS__thread int errno; // Each thread gets its own errno // Option 2: Function-based access (POSIX style)// errno is actually a macro:#define errno (*__errno_location()) // Where __errno_location returns a pointer to// the current thread's errno valueint *__errno_location(void) { // Returns TLS pointer for current thread // Implementation uses architecture-specific TLS access return &__thread_local_errno;} // This is why code like:// if (result < 0) perror("error");// Works correctly in multithreaded programs // Each thread's path through the code sees its own errnovoid *thread_func(void *arg) { int result = some_syscall(); if (result < 0) { // errno here is THIS thread's errno // Other threads making concurrent calls don't affect it printf("This thread's error: %d (%s)\n", errno, strerror(errno)); } return NULL;}Stack variables are already per-thread (each thread has its own stack). TLS is for data that needs to: (1) persist across function calls, (2) be accessible without passing parameters, and (3) be global-like in scope but per-thread in instance. If data is only needed within one function, use stack variables. If it crosses function boundaries without explicit passing, consider TLS.
The simplest and most efficient TLS mechanism is static TLS provided by the compiler and linker. Variables declared with the thread_local (C11/C++11) or __thread (GCC extension) storage class are allocated per-thread at thread creation.
Declaration Syntax:
// C11 standard
_Thread_local int my_var;
thread_local int my_var; // With <threads.h>
// C++11 standard
thread_local int my_var;
// GCC/Clang extension (pre-standard, still widely used)
__thread int my_var;
// Windows
__declspec(thread) int my_var;
Key Properties:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
#include <pthread.h>#include <stdio.h> // Static TLS variable - each thread has its own copystatic __thread int thread_counter = 0;static __thread const char *thread_name = NULL; // TLS with complex initializationstatic __thread struct { int operations; double total_time; void *context;} thread_stats = {0, 0.0, NULL}; void increment_counter(void) { // No synchronization needed! // Each thread accesses its own copy thread_counter++;} int get_counter(void) { return thread_counter;} void set_thread_name(const char *name) { thread_name = name;} void *worker_thread(void *arg) { int id = *(int *)arg; char name_buf[32]; snprintf(name_buf, sizeof(name_buf), "Worker-%d", id); // Set this thread's name (thread-local) set_thread_name(name_buf); // Each thread has its own counter starting at 0 for (int i = 0; i < 1000; i++) { increment_counter(); } printf("Thread %s: counter = %d\n", thread_name, get_counter()); return NULL;} int main() { pthread_t threads[4]; int ids[4] = {0, 1, 2, 3}; for (int i = 0; i < 4; i++) { pthread_create(&threads[i], NULL, worker_thread, &ids[i]); } // Main thread's TLS is separate set_thread_name("Main"); increment_counter(); printf("Thread %s: counter = %d\n", thread_name, get_counter()); for (int i = 0; i < 4; i++) { pthread_join(threads[i], NULL); } // Output (order may vary): // Thread Main: counter = 1 // Thread Worker-0: counter = 1000 // Thread Worker-1: counter = 1000 // Thread Worker-2: counter = 1000 // Thread Worker-3: counter = 1000 return 0;}| Language/Compiler | Keyword | Notes |
|---|---|---|
| C11 | _Thread_local or thread_local | Standard; requires <threads.h> for macro |
| C++11 | thread_local | Standard; can have constructors/destructors |
| GCC/Clang | __thread | Extension; works on C and C++; no destructors in C |
| MSVC | __declspec(thread) | Windows-specific; limited to POD types before VS2015 |
| Rust | No keyword (use std::thread::LocalKey) | Macro-based; thread_local! for declaration |
| Java | ThreadLocal<T> | Library class; not a language keyword |
Static TLS is extremely fast—often just a register load plus an offset. Modern architectures (x86-64, ARM64) have dedicated registers (FS/GS segment on x86-64) pointing to the thread's TLS block. Accessing a TLS variable is comparable to accessing a global variable, with minimal overhead. This makes static TLS suitable for hot paths and performance-critical code.
While static TLS is efficient, it's not always sufficient. Dynamic TLS via pthread_key_t provides more flexibility:
The pthread_key API:
int pthread_key_create(pthread_key_t *key, void (*destructor)(void *));
int pthread_key_delete(pthread_key_t key);
void *pthread_getspecific(pthread_key_t key);
int pthread_setspecific(pthread_key_t key, const void *value);
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
#include <pthread.h>#include <stdio.h>#include <stdlib.h>#include <string.h> // Key for thread-local bufferstatic pthread_key_t buffer_key;static pthread_once_t key_once = PTHREAD_ONCE_INIT; // Destructor - called automatically when thread exitsvoid buffer_destructor(void *ptr) { printf("Thread exiting: freeing buffer %p\n", ptr); free(ptr);} // One-time initialization of the keyvoid make_key(void) { pthread_key_create(&buffer_key, buffer_destructor);} // Get this thread's buffer, creating if neededchar *get_thread_buffer(void) { // Ensure key is created (thread-safe, runs once across all threads) pthread_once(&key_once, make_key); // Get this thread's value char *buffer = pthread_getspecific(buffer_key); if (buffer == NULL) { // First access in this thread - allocate buffer = malloc(1024); if (buffer == NULL) return NULL; // Associate with this thread pthread_setspecific(buffer_key, buffer); printf("Thread %lu: allocated buffer %p\n", (unsigned long)pthread_self(), buffer); } return buffer;} void *worker(void *arg) { int id = *(int *)arg; // Get this thread's private buffer char *buf = get_thread_buffer(); // Use the buffer - no synchronization needed! snprintf(buf, 1024, "Thread %d was here", id); printf("Buffer says: %s\n", buf); // When thread exits, destructor frees the buffer automatically return NULL;} int main() { pthread_t threads[3]; int ids[3] = {1, 2, 3}; for (int i = 0; i < 3; i++) { pthread_create(&threads[i], NULL, worker, &ids[i]); } for (int i = 0; i < 3; i++) { pthread_join(threads[i], NULL); } // Key can be deleted after all threads are done pthread_key_delete(buffer_key); return 0;}POSIX systems have a limit on the number of pthread keys (typically 128 or 1024). If your application and its libraries create too many keys, pthread_key_create() will fail. For this reason, prefer static TLS when possible, and consolidate related TLS data into structs pointed to by a single key rather than using multiple keys.
Destructor Semantics:
When a thread exits (via pthread_exit(), return from start function, or cancellation):
For each key with a non-NULL value associated with this thread:
If destructors themselves set new values (e.g., re-allocating resources), the process repeats up to PTHREAD_DESTRUCTOR_ITERATIONS times
After all iterations, remaining non-NULL values are leaked
This behavior allows for complex cleanup scenarios but requires careful programming to avoid infinite loops or leaks.
Understanding how TLS is implemented helps explain its performance characteristics and limitations.
Static TLS Implementation (ELF/x86-64):
On modern x86-64 Linux systems:
The compiler reserves space for thread-local variables in a special .tdata (initialized) or .tbss (uninitialized) section
At thread creation, the runtime allocates a Thread Control Block (TCB) plus space for all TLS variables
The FS segment register is set to point to this thread's TLS area
Accessing a TLS variable compiles to: mov eax, fs:[offset]—a single instruction!
1234567891011121314151617181920212223242526272829303132333435363738
// What the compiler generates for TLS access// Source:__thread int counter; void increment() { counter++;} // Compiler output (x86-64, simplified):// // increment:// mov eax, fs:[counter@tpoff] ; Load from FS-based TLS// add eax, 1// mov fs:[counter@tpoff], eax ; Store back// ret//// Where @tpoff is the offset from the thread pointer (FS base) // The fs: prefix accesses memory relative to the FS segment register// which points to the current thread's TLS block // Compare to global variable access:int global_counter; void increment_global() { global_counter++;} // Compiler output for global (simplified)://// increment_global:// mov eax, [rip + global_counter] ; RIP-relative addressing// add eax, 1// mov [rip + global_counter], eax// ret // The difference is minimal - just the segment override// TLS access is nearly as fast as global access| Model | Use Case | Performance |
|---|---|---|
| Local Exec (LE) | Executable's own TLS | Single instruction, no function calls |
| Initial Exec (IE) | Libraries loaded at startup | One load + offset, no function calls |
| General Dynamic (GD) | Libraries loaded via dlopen() | Function call (__tls_get_addr), slowest |
| Local Dynamic (LD) | Multiple TLS vars in same DSO | One function call, then offsets |
When a shared library with TLS variables is loaded via dlopen(), the General Dynamic access model is required. This involves a function call (__tls_get_addr) and is significantly slower. For performance-critical TLS, prefer statically linked or pre-loaded libraries. The compiler flags -ftls-model=initial-exec can force faster access if you know the library will be loaded at startup.
C++11 standardized thread_local with enhanced capabilities compared to C:
C++ Enhancements:
This makes TLS in C++ more powerful but also more complex.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
#include <iostream>#include <thread>#include <string> // Thread-local with constructor/destructorclass ThreadContext {public: std::string thread_name; int operation_count = 0; ThreadContext() { std::cout << "ThreadContext constructed for thread " << std::this_thread::get_id() << std::endl; } ~ThreadContext() { std::cout << "ThreadContext destroyed for " << thread_name << " (ops: " << operation_count << ")" << std::endl; } void record_operation() { operation_count++; }}; // Thread-local instance - each thread gets its ownthread_local ThreadContext ctx; void worker(int id) { // First access constructs this thread's instance ctx.thread_name = "Worker-" + std::to_string(id); for (int i = 0; i < 100; i++) { ctx.record_operation(); } std::cout << ctx.thread_name << " completed" << std::endl; // When thread exits, destructor is called automatically} int main() { ctx.thread_name = "Main"; std::thread t1(worker, 1); std::thread t2(worker, 2); t1.join(); t2.join(); ctx.record_operation(); return 0; // Main's ctx is destroyed here} // Thread-local with lazy initializationthread_local std::vector<int> thread_cache = []() { std::cout << "Initializing cache for thread " << std::this_thread::get_id() << std::endl; return std::vector<int>(100, 0);}(); // Beware: this lambda runs ONCE per thread, // first time thread_cache is accessedThread-local objects with non-trivial constructors may have subtle initialization order issues. If one TLS object's constructor accesses another TLS object, the second one is constructed on-demand. This lazy construction can cause surprising behavior. Be especially careful with TLS objects that depend on each other or on global state.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
// PITFALL: Destruction order with dependenciesthread_local Logger logger; // Might be destroyed firstthread_local Connection conn; // Uses logger in destructor? // If logger is destroyed before conn, conn's destructor crashes // SOLUTION: Use explicit cleanup, or careful orderingthread_local std::unique_ptr<Logger> logger;thread_local std::unique_ptr<Connection> conn; void init_thread_resources() { logger = std::make_unique<Logger>(); conn = std::make_unique<Connection>(logger.get());} void cleanup_thread_resources() { // Explicit order: conn first (uses logger), then logger conn.reset(); logger.reset();} // PITFALL: Static thread_local in block scopevoid problematic() { static thread_local Expensive obj; // Constructed once per thread // But when? First time this function is called in each thread // And destroyed when thread exits} // Each thread that calls problematic() constructs obj once// But order relative to other TLS is unpredictable // BETTER: Explicit initialization patternclass ThreadResources { thread_local static std::unique_ptr<Resources> instance_; public: static Resources& get() { if (!instance_) { instance_ = std::make_unique<Resources>(); } return *instance_; } static void cleanup() { instance_.reset(); }}; thread_local std::unique_ptr<Resources> ThreadResources::instance_;Thread-Local Storage is powerful but requires careful use. Here are proven patterns and practices from production systems:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
// Pattern 1: Consolidated TLS structtypedef struct { int request_id; char user_name[64]; LogLevel log_level; void *transaction; struct timespec start_time; // Add more as needed} ThreadContext; static __thread ThreadContext thread_ctx = {0}; ThreadContext *get_thread_context(void) { return &thread_ctx;} void reset_thread_context(void) { memset(&thread_ctx, 0, sizeof(thread_ctx));} // Pattern 2: Lazy initialization with flagstatic __thread int tls_initialized = 0;static __thread ExpensiveResource resource; ExpensiveResource *get_resource(void) { if (!tls_initialized) { initialize_resource(&resource); tls_initialized = 1; } return &resource;} // Pattern 3: Thread pool-safe TLSvoid begin_task(TaskContext *task) { // Reset ALL TLS at start of each task reset_thread_context(); // Set up for this task ThreadContext *ctx = get_thread_context(); ctx->request_id = task->request_id; strncpy(ctx->user_name, task->user_name, sizeof(ctx->user_name)); clock_gettime(CLOCK_MONOTONIC, &ctx->start_time);} void end_task(void) { // Log task completion using TLS context ThreadContext *ctx = get_thread_context(); struct timespec end_time; clock_gettime(CLOCK_MONOTONIC, &end_time); log_task_metrics(ctx->request_id, time_diff_ms(&ctx->start_time, &end_time)); // Clear sensitive data reset_thread_context();} // Pattern 4: TLS with cleanup registrationstatic pthread_key_t cleanup_key;static pthread_once_t key_once = PTHREAD_ONCE_INIT; void cleanup_handler(void *ptr) { ResourceList *list = ptr; while (list) { list->cleanup(list->resource); ResourceList *next = list->next; free(list); list = next; }} void init_cleanup_key(void) { pthread_key_create(&cleanup_key, cleanup_handler);} void register_for_cleanup(void *resource, void (*cleanup)(void *)) { pthread_once(&key_once, init_cleanup_key); ResourceList *entry = malloc(sizeof(ResourceList)); entry->resource = resource; entry->cleanup = cleanup; entry->next = pthread_getspecific(cleanup_key); pthread_setspecific(cleanup_key, entry);}Modern memory allocators (tcmalloc, jemalloc, mimalloc) heavily use TLS for per-thread caches. Each thread maintains its own pool of recently freed objects, eliminating lock contention for most allocations. This pattern—TLS caches backed by shared pools—is applicable to any resource where per-thread caching reduces contention.
TLS implementation varies across platforms. Understanding these differences is essential for portable code.
| Platform | Static TLS | Dynamic TLS | Notes |
|---|---|---|---|
| Linux (glibc) | __thread, thread_local | pthread_key_t | Excellent support; FS/GS register based |
| macOS | __thread, thread_local | pthread_key_t | Full support since macOS 10.7 |
| Windows | __declspec(thread), thread_local | TlsAlloc/TlsFree | Different API; limited in DLLs before VS2015 |
| FreeBSD | __thread, thread_local | pthread_key_t | POSIX compliant |
| Embedded (bare metal) | Manual implementation | Manual implementation | May need custom TLS library |
| WebAssembly (WASI) | Limited support | Via Emscripten | Still evolving |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
// Portable TLS macro header #ifndef PORTABLE_TLS_H#define PORTABLE_TLS_H // Static TLS declaration#if defined(_MSC_VER) #define THREAD_LOCAL __declspec(thread)#elif defined(__STDC_VERSION__) && __STDC_VERSION__ >= 201112L #define THREAD_LOCAL _Thread_local#elif defined(__GNUC__) || defined(__clang__) #define THREAD_LOCAL __thread#else #error "No TLS support detected"#endif // Usage:// static THREAD_LOCAL int my_var; // For C++11 and later, prefer:// #if __cplusplus >= 201103L// #define THREAD_LOCAL thread_local// #endif // Dynamic TLS (POSIX vs Windows)#ifdef _WIN32 #include <windows.h> typedef DWORD tls_key_t; static inline int tls_create(tls_key_t *key, void (*destructor)(void *)) { *key = TlsAlloc(); // Note: Windows TLS doesn't natively support destructors // Would need DllMain or manual tracking return (*key == TLS_OUT_OF_INDEXES) ? -1 : 0; } static inline void *tls_get(tls_key_t key) { return TlsGetValue(key); } static inline int tls_set(tls_key_t key, void *value) { return TlsSetValue(key, value) ? 0 : -1; } static inline int tls_delete(tls_key_t key) { return TlsFree(key) ? 0 : -1; }#else #include <pthread.h> typedef pthread_key_t tls_key_t; static inline int tls_create(tls_key_t *key, void (*destructor)(void *)) { return pthread_key_create(key, destructor); } static inline void *tls_get(tls_key_t key) { return pthread_getspecific(key); } static inline int tls_set(tls_key_t key, void *value) { return pthread_setspecific(key, value); } static inline int tls_delete(tls_key_t key) { return pthread_key_delete(key); }#endif #endif // PORTABLE_TLS_HThread-Local Storage bridges the gap between global convenience and thread safety. Let's consolidate the essential concepts:
You now understand Thread-Local Storage—from conceptual motivation through implementation details to practical patterns. TLS is a critical tool in the multithreaded programmer's arsenal, enabling global-like convenience without sacrificing thread safety. Next, we'll examine thread safety itself: what it means, why it matters, and how to achieve it.