Futex - Learning Module

Loading content...

0/227

Linux Implementation

Inside the Linux Kernel Futex Subsystem

The final piece of our futex journey takes us into the actual Linux kernel implementation. While we've covered concepts and interfaces, seeing how Linux engineers solved the hard problems—and continue to evolve the implementation—provides invaluable insight.

This page examines the kernel source structure, traces through real code paths, explores priority inheritance futexes (a critical feature for real-time systems), and covers the evolution from futex's introduction in 2002 to modern kernels. By the end, you'll understand not just what futex does, but how Linux makes it happen.

What You Will Learn

By the end of this page, you will understand the Linux kernel's futex source structure, the implementation of priority inheritance futexes, the evolution of futex across kernel versions, and advanced features like robust futexes and futex2.

Kernel Source Structure

The futex implementation lives in the kernel/futex/ directory (since Linux 5.15; previously kernel/futex.c). The code is organized into focused modules for maintainability.

kernel_futex_structure.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Linux kernel futex source structure (as of Linux 6.x)
 
kernel/futex/
├── core.c          # Core infrastructure: hash table, key computation
├── futex.h         # Internal header: structs, inline functions
├── pi.c            # Priority inheritance futex implementation
├── requeue.c       # FUTEX_REQUEUE and variants
├── waitwake.c      # FUTEX_WAIT and FUTEX_WAKE
├── Makefile        # Build configuration
 
# Key files and their responsibilities:
 
## core.c (~1200 lines)
# - futex_hash_bucket: Wait queue hash table
# - get_futex_key(): Compute futex key from userspace address
# - futex_q: Per-waiter queue entry structure
# - hash_futex(): Map key to bucket
# - futex_cmpxchg_release(): Atomic operations on user memory
 
## waitwake.c (~500 lines)
# - futex_wait(): The FUTEX_WAIT implementation
# - futex_wake(): The FUTEX_WAKE implementation
# - futex_wait_queue_me(): Add to queue and sleep
# - futex_wake_mark(): Mark waiter for waking
 
## requeue.c (~700 lines)
# - futex_requeue(): Move waiters between futexes
# - futex_wake_op(): Compound wake operation
# - requeue_pi(): Priority inheritance requeue
 
## pi.c (~800 lines)
# - futex_lock_pi(): Priority inheritance lock acquire
# - futex_unlock_pi(): Priority inheritance unlock
# - fixup_pi_state_owner(): Handle PI state transfers
# - rt_mutex integration: Uses kernel RT mutexes internally

Why the Split?

The original kernel/futex.c grew to over 4000 lines and became difficult to maintain. In Linux 5.15 (2021), Thomas Gleixner refactored it into the current modular structure. Each file now has a clear responsibility:

core.c: Shared infrastructure used by all operations
waitwake.c: The fundamental wait/wake primitives
requeue.c: The complex requeue logic (separate due to complexity)
pi.c: Priority inheritance (orthogonal to other operations)

Reading Kernel Source

The Linux kernel source is available at kernel.org or on GitHub. To explore futex, start with kernel/futex/core.c and trace the main system call entry point. Use cscope or ctags for navigation. The code is well-commented—kernel developers expect their code to be read.

System Call Entry Point

All futex operations enter the kernel through a single system call handler. Let's trace the entry point and dispatch logic.

futex_syscall_entry.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
/*
 * System call entry point for futex
 * Source: kernel/futex/syscalls.c
 * 
 * This is called when userspace executes syscall(SYS_futex, ...)
 */
 
SYSCALL_DEFINE6(futex, u32 __user *, uaddr, int, op, u32, val,
                const struct __kernel_timespec __user *, utime,
                u32 __user *, uaddr2, u32, val3)
{
    int ret, cmd = op & FUTEX_CMD_MASK;
    ktime_t t, *tp = NULL;
    struct timespec64 ts;
    
    /*
     * STEP 1: Handle timeout conversion
     * 
     * For wait operations, convert userspace timespec to kernel ktime.
     * Different operations interpret the timeout differently:
     * - FUTEX_WAIT: relative timeout
     * - FUTEX_WAIT_BITSET: absolute timeout
     * - FUTEX_LOCK_PI: absolute timeout
     */
    if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI ||
                  cmd == FUTEX_WAIT_BITSET ||
                  cmd == FUTEX_WAIT_REQUEUE_PI)) {
        if (get_timespec64(&ts, utime))
            return -EFAULT;
        if (!timespec64_valid(&ts))
            return -EINVAL;
        
        t = timespec64_to_ktime(ts);
        if (cmd == FUTEX_WAIT)
            t = ktime_add_safe(ktime_get(), t);  // Relative -> absolute
        tp = &t;
    }
    
    /*
     * STEP 2: Dispatch to operation-specific handler
     */
    return do_futex(uaddr, op, val, tp, uaddr2, (unsigned long)utime, val3);
}
 
/*
 * Main futex dispatch function
 */
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
              u32 __user *uaddr2, u32 val2, u32 val3)
{
    int cmd = op & FUTEX_CMD_MASK;
    unsigned int flags = 0;
    
    /*
     * STEP 3: Parse flags
     */
    if (!(op & FUTEX_PRIVATE_FLAG))
        flags |= FLAGS_SHARED;
    
    if (op & FUTEX_CLOCK_REALTIME) {
        flags |= FLAGS_CLOCKRT;
        if (cmd != FUTEX_WAIT_BITSET && cmd != FUTEX_WAIT_REQUEUE_PI &&
            cmd != FUTEX_LOCK_PI)
            return -ENOSYS;
    }
    
    /*
     * STEP 4: Dispatch based on command
     */
    switch (cmd) {
    case FUTEX_WAIT:
        return futex_wait(uaddr, flags, val, timeout, FUTEX_BITSET_MATCH_ANY);
        
    case FUTEX_WAIT_BITSET:
        return futex_wait(uaddr, flags, val, timeout, val3);
        
    case FUTEX_WAKE:
        return futex_wake(uaddr, flags, val, FUTEX_BITSET_MATCH_ANY);
        
    case FUTEX_WAKE_BITSET:
        return futex_wake(uaddr, flags, val, val3);
        
    case FUTEX_REQUEUE:
        return futex_requeue(uaddr, flags, uaddr2, val, val2, NULL, 0);
        
    case FUTEX_CMP_REQUEUE:
        return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 0);
        
    case FUTEX_WAKE_OP:
        return futex_wake_op(uaddr, flags, uaddr2, val, val2, val3);
        
    case FUTEX_LOCK_PI:
        return futex_lock_pi(uaddr, flags, timeout, 0);
        
    case FUTEX_UNLOCK_PI:
        return futex_unlock_pi(uaddr, flags);
        
    case FUTEX_TRYLOCK_PI:
        return futex_lock_pi(uaddr, flags, NULL, 1);
        
    case FUTEX_WAIT_REQUEUE_PI:
        return futex_wait_requeue_pi(uaddr, flags, val, timeout, val3,
                                      uaddr2);
    case FUTEX_CMP_REQUEUE_PI:
        return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
    }
    
    return -ENOSYS;  // Unknown operation
}

SYSCALL_DEFINE6 Macro

The SYSCALL_DEFINE6 macro generates the actual system call handler with proper type checking, tracing hooks, and ABI handling. The '6' means it takes 6 arguments. This macro is part of Linux's syscall infrastructure that handles differences between 32-bit and 64-bit calling conventions.

Priority Inheritance Futexes

Priority inversion is a classic problem in real-time systems: a high-priority task waits for a lock held by a low-priority task, which is preempted by a medium-priority task. The high-priority task is effectively blocked by the medium-priority task—a priority inversion.

Priority Inheritance (PI) solves this: when a high-priority task blocks on a lock, the lock holder temporarily inherits the high priority, preventing preemption by medium-priority tasks.

pi_futex_implementation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
/*
 * Priority Inheritance Futex Implementation
 * Source: kernel/futex/pi.c
 * 
 * PI futexes differ from regular futexes:
 * 1. The futex word contains the TID of the owner (not just 0/1/2)
 * 2. The kernel tracks ownership and adjusts priorities
 * 3. Uses the kernel's RT mutex infrastructure internally
 */
 
/*
 * PI futex word format (32 bits):
 * 
 * Bits 0-29: Owner's TID (thread ID)
 * Bit 30:    FUTEX_WAITERS - there are waiters
 * Bit 31:    FUTEX_OWNER_DIED - owner died while holding
 */
#define FUTEX_TID_MASK      0x3fffffff
#define FUTEX_WAITERS       0x40000000
#define FUTEX_OWNER_DIED    0x80000000
 
/*
 * PI state structure - tracks priority inheritance chain
 */
struct futex_pi_state {
    struct list_head list;          // All PI states for this task
    struct rt_mutex_base pi_mutex;  // The RT mutex for PI tracking
    struct task_struct *owner;      // Current owner
    refcount_t refcount;            // Reference count
    union futex_key key;            // Which futex this is for
};
 
/*
 * Acquiring a PI futex
 * 
 * Much more complex than regular futex due to priority tracking
 */
int futex_lock_pi(u32 __user *uaddr, unsigned int flags,
                   ktime_t *time, int trylock)
{
    struct futex_hash_bucket *hb;
    struct futex_q q = futex_q_init;
    int ret;
    
    /*
     * STEP 1: Get futex key
     */
    ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &q.key);
    if (ret)
        return ret;
    
    /*
     * STEP 2: Fast path - try to acquire uncontended
     * 
     * If futex word is 0, try to set it to our TID
     */
    ret = futex_lock_pi_atomic(uaddr, &q, NULL, current->pid, NULL);
    if (ret == 1) {
        return 0;  // Got it! Fast path success.
    }
    if (ret < 0) {
        return ret;  // Error
    }
    
    /*
     * STEP 3: Slow path - need to wait with PI
     * 
     * ret == 0 means someone else holds it
     */
    hb = hash_futex(&q.key);
    spin_lock(&hb->lock);
    
    /*
     * STEP 4: Set up PI state
     * 
     * Find or create the futex_pi_state for this futex.
     * Attach ourselves to the RT mutex wait chain.
     */
    ret = futex_wait_setup(uaddr, val, flags, &q, &hb);
    if (ret)
        goto out_unlock;
    
    ret = attach_to_pi_state(uaddr, &q, current);
    if (ret)
        goto out_unlock;
    
    /*
     * STEP 5: Now the interesting part - priority inheritance
     * 
     * The RT mutex code will:
     * 1. Add us to the RT mutex wait list
     * 2. Boost owner's priority to ours if we're higher
     * 3. Potentially chain-boost if owner is also blocked
     */
    ret = rt_mutex_wait_proxy_lock(&q.pi_state->pi_mutex, time,
                                    &rt_waiter);
    
    /*
     * STEP 6: Woken up - we now own the lock
     * 
     * The RT mutex code transferred ownership to us.
     * Update the userspace futex word to our TID.
     */
    fixup_owner(uaddr, &q, current);
    
out_unlock:
    spin_unlock(&hb->lock);
    return ret;
}
 
/*
 * Priority inheritance chain example:
 * 
 * Task A (priority 99, highest) wants lock held by B
 * Task B (priority 50) wants lock held by C  
 * Task C (priority 10, lowest) holds both locks
 * 
 * Without PI:
 *   C runs at priority 10, gets preempted by everything
 *   A waits potentially forever
 * 
 * With PI:
 *   C inherits A's priority 99 through the chain
 *   C runs at priority 99, completes quickly
 *   B inherits A's priority 99 (or its own 50, whichever higher)
 *   A finally runs
 */

PI Futex vs Regular Futex
Aspect	Regular Futex	PI Futex
Futex word content	0/1/2 (state)	TID of owner + flags
Priority tracking	None	Full chain boosting
Kernel state	Hash table only	rt_mutex + pi_state
Use case	General synchronization	Real-time systems
Performance	Faster	More overhead for PI
POSIX equivalent	PTHREAD_MUTEX_NORMAL	PTHREAD_PRIO_INHERIT

PI Futex Complexity

PI futexes are significantly more complex and have higher overhead. Use them only when real-time guarantees are needed. For general-purpose synchronization, regular futexes are faster and sufficient.

Robust Futexes: Handling Owner Death

What happens if a thread crashes while holding a lock? With naive futexes, other threads wait forever—the lock is never released. Robust futexes solve this: the kernel detects owner death and marks the lock as recoverable.

robust_futex_mechanism.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
/*
 * Robust Futex Mechanism
 * 
 * The key idea: each thread maintains a list of held robust locks.
 * When the thread exits (normally or via crash), the kernel walks
 * this list and marks each lock as FUTEX_OWNER_DIED.
 */
 
/*
 * Registering the robust list with the kernel
 * 
 * Called early in thread startup (glibc does this automatically
 * for PTHREAD_MUTEX_ROBUST mutexes).
 */
#include <linux/futex.h>
 
struct robust_list_head {
    struct robust_list *list;          // Head of circular list
    long futex_offset;                 // Offset to futex word in struct
    struct robust_list *list_op_pending;  // Currently being acquired/released
};
 
// System call to register the robust list
long set_robust_list(struct robust_list_head *head, size_t len);
 
/*
 * How userspace tracks robust locks
 * 
 * Each mutex structure has a 'list' member that links it into
 * the thread's robust list. When acquiring, add to list.
 * When releasing, remove from list.
 */
 
struct robust_mutex {
    struct robust_list list;    // Link in robust list
    uint32_t futex;             // The actual futex word
    // ... other fields
};
 
void acquire_robust_mutex(struct robust_mutex *m) {
    // Add to pending (in case we crash during acquire)
    __robust_list_head.list_op_pending = &m->list;
    
    // Acquire the lock (PI or regular futex)
    // ... futex operations ...
    
    // Move from pending to held list
    add_to_robust_list(&m->list);
    __robust_list_head.list_op_pending = NULL;
}
 
/*
 * Kernel side: what happens on thread exit
 * Source: kernel/exit.c -> exit_robust_list()
 */
void exit_robust_list(struct task_struct *curr)
{
    struct robust_list_head __user *head;
    struct robust_list __user *entry, *next;
    unsigned int limit = ROBUST_LIST_LIMIT;
    
    // Get the robust list head for this task
    head = curr->robust_list;
    if (!head)
        return;  // No robust list registered
    
    /*
     * Handle pending operation first
     * 
     * If we crashed during acquire/release, handle that entry specially
     */
    if (head->list_op_pending) {
        handle_futex_death((void *)head->list_op_pending + head->futex_offset,
                           curr);
    }
    
    /*
     * Walk the robust list and mark each lock as OWNER_DIED
     */
    entry = head->list;
    while (entry != (struct robust_list __user *)head && --limit) {
        u32 __user *uaddr = (u32 *)((char *)entry + head->futex_offset);
        
        handle_futex_death(uaddr, curr);
        
        // Follow the list
        if (get_user(next, &entry->next))
            break;
        entry = next;
    }
}
 
/*
 * Marking a lock as owner-died
 */
static void handle_futex_death(u32 __user *uaddr, struct task_struct *curr)
{
    u32 uval, nval;
    
    // Read current value
    if (get_user(uval, uaddr))
        return;
    
    // Only if we actually owned it
    if ((uval & FUTEX_TID_MASK) != curr->pid)
        return;
    
    // Atomically set FUTEX_OWNER_DIED, keeping WAITERS bit
    do {
        nval = (uval & FUTEX_WAITERS) | FUTEX_OWNER_DIED;
    } while (cmpxchg_futex_value_locked(uaddr, uval, nval) != uval);
    
    // Wake any waiters so they can recover
    if (uval & FUTEX_WAITERS)
        futex_wake(uaddr, FLAGS_SHARED, 1, FUTEX_BITSET_MATCH_ANY);
}
 
/*
 * Userspace recovery
 * 
 * When a waiter wakes and sees FUTEX_OWNER_DIED, it can:
 * 1. Take ownership of the lock
 * 2. Run recovery code (check data consistency)
 * 3. Clear FUTEX_OWNER_DIED and FUTEX_WAITERS
 * 4. Continue with the acquired lock
 * 
 * pthread_mutex_consistent() performs this recovery.
 */

Using Robust Mutexes

In POSIX, use pthread_mutexattr_setrobust() with PTHREAD_MUTEX_ROBUST. The kernel handles list registration automatically. When pthread_mutex_lock returns EOWNERDEAD, call pthread_mutex_consistent() after recovering state integrity, or pthread_mutex_unlock() if recovery isn't possible.

Evolution of Futex

Futex has evolved significantly since its introduction in 2002. Understanding this evolution helps appreciate current behavior and anticipate future changes.

Futex Feature Evolution
Kernel Version	Year	Features Added
2.5.7	2002	Initial futex: WAIT, WAKE, FD
2.5.40	2002	FUTEX_REQUEUE for efficient condvars
2.6.0	2003	FUTEX_CMP_REQUEUE (safer requeue)
2.6.12	2005	FUTEX_WAKE_OP compound operations
2.6.17	2006	Priority inheritance (PI) futexes
2.6.18	2006	Robust futexes (handle owner death)
2.6.25	2008	FUTEX_WAIT_BITSET, FUTEX_WAKE_BITSET
2.6.26	2008	Removed FUTEX_FD (security issues)
2.6.31	2009	Private futex hash optimizations
3.14	2014	Better NUMA-aware hash tables
4.2	2015	Futex PI improvements
5.15	2021	Code refactoring into kernel/futex/
5.16+	2022+	futex2 proposals (not yet merged)

Key Evolutionary Insights:

From simple to complex: Futex started with just wait/wake. Real-world needs drove addition of requeue (for condvars), PI (for real-time), and robust (for fault tolerance).
Security lessons: FUTEX_FD was removed because it leaked kernel information. Security considerations affect all futex development.
Performance refinements: Private futex optimizations, NUMA-aware hashing, and hash table sizing reflect continuous performance tuning.
Ongoing evolution: futex2 proposals aim to address remaining limitations (variable-size futexes, multiple-wait operations).

futex2_proposal.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
/*
 * Futex2 Proposals (as of 2023, not yet merged)
 * 
 * The current futex syscall has limitations that futex2 aims to address:
 */
 
/*
 * LIMITATION 1: Fixed 32-bit futex word
 * 
 * Current: Futex word is always uint32_t
 * Problem: 64-bit atomic operations becoming common
 * 
 * Proposed: Variable-size futexes (8, 16, 32, 64-bit)
 */
struct futex_waitv {
    uint64_t val;           // Expected value (any size)
    uint64_t uaddr;         // Address of futex
    uint32_t flags;         // FUTEX2_SIZE_U8/U16/U32/U64, PRIVATE, etc.
    uint32_t __reserved;
};
 
/*
 * LIMITATION 2: Can only wait on one futex at a time
 * 
 * Current: One FUTEX_WAIT per syscall
 * Problem: Polling multiple futexes requires threads or epoll hacks
 * 
 * Proposed: Wait on multiple futexes simultaneously
 */
long futex_waitv(struct futex_waitv *waiters, unsigned int nr_futexes,
                  unsigned int flags, struct timespec *timeout);
 
/*
 * Usage example: Wait for either of two events
 */
struct futex_waitv events[2] = {
    { .uaddr = &event1, .val = 0, .flags = FUTEX2_SIZE_U32 | FUTEX2_PRIVATE },
    { .uaddr = &event2, .val = 0, .flags = FUTEX2_SIZE_U32 | FUTEX2_PRIVATE },
};
 
ret = futex_waitv(events, 2, 0, &timeout);
// Returns index of futex that woke us, or -1 on timeout
 
/*
 * LIMITATION 3: Timeout handling quirks
 * 
 * Current: Inconsistent timeout semantics across operations
 * Proposed: Unified, explicit timeout handling with newer interfaces
 */
 
/*
 * STATUS: As of 2023, futex_waitv (multi-wait) has been partially
 * accepted for enabling Windows gaming via Wine/Proton (needs waiting
 * on multiple synchronization objects). Full futex2 is still evolving.
 */

Kernel ABI Stability

The original futex syscall will never be removed or incompatibly changed—Linux kernel ABI is stable. New features come through new operations or new syscalls (like futex2), not modifications to existing behavior. Existing code will continue to work indefinitely.

glibc NPTL Integration

Most applications use futex indirectly through glibc's NPTL (Native POSIX Thread Library). Understanding how NPTL uses futex completes our picture.

nptl_futex_usage.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
/*
 * How glibc NPTL uses futex
 * Source: glibc/nptl/*.c
 */
 
/*
 * pthread_mutex_t structure (simplified, glibc 2.34+)
 */
typedef struct {
    int __lock;                    // Futex word
    unsigned int __count;          // Recursive lock count
    int __owner;                   // Owner TID (for error checking)
    unsigned int __nusers;         // Debugging: number of users
    int __kind;                    // Mutex type (normal, recursive, etc.)
    // ... additional fields for robust, PI, etc.
} pthread_mutex_t;
 
/*
 * pthread_mutex_lock implementation (simplified)
 */
int pthread_mutex_lock(pthread_mutex_t *mutex)
{
    int type = mutex->__kind & PTHREAD_MUTEX_KIND_MASK;
    
    switch (type) {
    case PTHREAD_MUTEX_NORMAL:
        return lll_lock(&mutex->__lock);  // Direct futex path
        
    case PTHREAD_MUTEX_RECURSIVE:
        if (mutex->__owner == pthread_self()) {
            mutex->__count++;
            return 0;  // Already hold it, just increment count
        }
        lll_lock(&mutex->__lock);
        mutex->__owner = pthread_self();
        mutex->__count = 1;
        return 0;
        
    case PTHREAD_MUTEX_ERRORCHECK:
        if (mutex->__owner == pthread_self())
            return EDEADLK;  // Error: would deadlock
        lll_lock(&mutex->__lock);
        mutex->__owner = pthread_self();
        return 0;
        
    // ... PI and robust variants
    }
}
 
/*
 * The low-level lock (lll) functions in NPTL
 * These directly use futex operations
 */
 
// Lock acquire (fast path inline)
static inline void lll_lock(int *futex)
{
    if (__glibc_likely(atomic_compare_exchange_weak(futex, 0, 1)))
        return;  // Fast path: acquired
    
    lll_lock_wait(futex);  // Slow path
}
 
// Slow path: futex wait loop
void lll_lock_wait(int *futex)
{
    // Set to contended state
    if (atomic_exchange(futex, 2) == 2)
        goto do_futex;  // Someone else was already waiting
    
    // Try once more after setting contended
    while (atomic_exchange(futex, 2) != 0) {
    do_futex:
        // Actually sleep via futex
        syscall(SYS_futex, futex, 
                FUTEX_WAIT | FUTEX_PRIVATE_FLAG, 2, NULL);
    }
}
 
/*
 * pthread_cond_t and condition variable implementation
 * Uses FUTEX_CMP_REQUEUE for efficient signaling
 */
 
int pthread_cond_signal(pthread_cond_t *cond)
{
    atomic_fetch_add(&cond->__wrefs, 8);  // Increment sequence
    
    // Requeue: wake 1, requeue rest to mutex
    syscall(SYS_futex, &cond->__wseq, FUTEX_CMP_REQUEUE | PRIVATE,
            1,                    // Wake 1 thread
            INT_MAX,              // Requeue all others
            &cond->__mutex->__lock,  // To the mutex
            cond->__wseq);        // Expected sequence
    
    return 0;
}

NPTL Design Philosophy:

NPTL was designed around futex from the ground up (unlike the older LinuxThreads that bolted synchronization onto process-based threads). Key design choices:

Zero kernel involvement for thread-local operations: Thread creation, exit, and most synchronization are userspace-only until blocking occurs.
Minimal structure sizes: pthread_mutex_t is small enough to embed anywhere without allocation.
Always use FUTEX_PRIVATE_FLAG: NPTL assumes mutexes are private unless PTHREAD_PROCESS_SHARED is set.
Adaptive mutexes: Spin briefly before futex wait (controlled by PTHREAD_MUTEX_ADAPTIVE_NP).

Alternative Implementations

While glibc NPTL is the most common pthread implementation on Linux, alternatives exist: musl libc has its own futex-based pthreads, the Go runtime uses futexes for its synchronization, and Rust's standard library uses parking_lot which builds on futex. All share the same futex philosophy: fast path in userspace, kernel only for blocking.

Debugging Futex Issues

When synchronization goes wrong, futex-level debugging tools become essential. Let's explore techniques for diagnosing futex-related problems.

futex_debugging.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
#!/bin/bash
# Debugging futex issues
 
# ============================================
# TECHNIQUE 1: Find threads blocked on futex
# ============================================
 
# List all threads in your process
ps -eLf | grep my_application
 
# Check what each thread is doing
cat /proc/<pid>/task/<tid>/syscall
# Output like: 202 0x7f1234 0x80 0x2 ...
# 202 = SYS_futex, following are arguments
 
# More readable with strace
strace -f -e futex -p <pid>
# Shows all futex calls in real-time
 
# ============================================
# TECHNIQUE 2: GDB for lock state inspection
# ============================================
 
# Attach to running process
gdb -p <pid>
 
# Find mutex state
(gdb) print *my_mutex
# Shows __lock field value:
# 0 = unlocked
# 1 = locked, no waiters
# 2 = locked, has waiters
 
# Find who holds a PI mutex
(gdb) print my_pi_mutex.__owner
# Shows TID of owner
 
# Get backtrace of all threads
(gdb) thread apply all bt
 
# Find threads in futex_wait
(gdb) thread apply all bt | grep futex_wait
 
# ============================================
# TECHNIQUE 3: bpftrace for futex timing
# ============================================
 
# Find long futex waits
sudo bpftrace -e '
tracepoint:syscalls:sys_enter_futex /args->op & 0x7f == 0/ {
    @start[tid] = nsecs;
}
tracepoint:syscalls:sys_exit_futex /@start[tid] && 
                                    nsecs - @start[tid] > 1000000000/ {
    printf("Thread %d waited %d ms in futex\n", 
           tid, (nsecs - @start[tid]) / 1000000);
    @[ustack] = count();
    delete(@start[tid]);
}'
 
# ============================================
# TECHNIQUE 4: lockdep for deadlock detection
# ============================================
 
# Enable CONFIG_LOCKDEP in kernel (debug builds)
# Kernel will detect lock order violations
 
# Runtime check for circular dependencies
echo 1 > /proc/sys/kernel/softlockup_panic
 
# View lock statistics (if enabled)
cat /proc/lockdep_stats
 
# ============================================
# TECHNIQUE 5: Valgrind helgrind/drd
# ============================================
 
# Detect data races and lock misuse
valgrind --tool=helgrind ./my_application
 
# DRD is similar but different detection strategy
valgrind --tool=drd ./my_application
 
# Common issues detected:
# - Lock order inconsistency (potential deadlock)
# - Unlocking unheld lock
# - Data race: concurrent access without synchronization

Common Futex Issues and Diagnostics
Symptom	Likely Cause	Diagnostic	Fix
Thread stuck in futex_wait	Deadlock or lost wakeup	Check lock order, waker logic	Fix lock ordering or wake call
High EAGAIN rate	Contention or ABA	Profile fast/slow path	Reduce contention or fix ABA
Futex word corruption	Memory corruption	ASAN, watchpoints	Fix buffer overflow/use-after-free
PI deadlock	Priority inversion unresolved	Check rt_mutex chain	Verify PI mutex setup
Robust futex not cleaning	Missing set_robust_list	Check startup code	Call set_robust_list early

Lock Order Discipline

The #1 cause of futex-related hangs is deadlock from inconsistent lock ordering. Establish and enforce a global lock order in your application. Document it. Use lockdep or ThreadSanitizer to catch violations during development.

Summary: Linux Futex Implementation Mastery

We've completed our deep dive into the Linux kernel's futex implementation. Let's consolidate the key knowledge.

Key Takeaways

•Source organization: Futex code lives in kernel/futex/ with modules for core, wait/wake, requeue, and PI. This structure reflects 20+ years of evolution and refinement.
•Single entry point: All futex operations enter via SYSCALL_DEFINE6(futex, ...) and dispatch based on operation code. Flags modify behavior (PRIVATE, CLOCK_REALTIME).
•Priority inheritance: PI futexes store owner TID in the futex word and use kernel RT mutexes for priority boosting. Essential for real-time systems.
•Robust futexes: Handle owner death gracefully. The kernel walks the thread's robust list on exit and marks owned locks as OWNER_DIED, waking waiters.
•Continuous evolution: From basic wait/wake in 2002 to today's feature-rich system. Future may bring variable-size futexes and multi-wait via futex2.
•glibc integration: NPTL uses futex for all pthread primitives. The lll_* functions are the direct futex interface; pthread_* functions add POSIX semantics on top.
•Debugging arsenal: strace, bpftrace, GDB, lockdep, and Valgrind tools all help diagnose futex issues. Know them well.

Module Complete:

You have now mastered the futex synchronization primitive from every angle:

Why it exists: The syscall overhead problem and fast userspace solution
How to use it: All operations, their semantics, and error handling
What the kernel does: Hash tables, wait queues, scheduler integration
Performance characteristics: Quantified benefits, contention effects, optimization strategies
Linux implementation: Kernel source, PI/robust features, glibc integration

This knowledge positions you to build, debug, and optimize synchronization at the lowest level on Linux systems.

Module Complete

You now have a complete understanding of the futex primitive—from philosophical motivation through kernel implementation detail. This is the foundation upon which all modern Linux threading is built. Whether you're debugging a production deadlock, implementing a custom synchronization primitive, or simply understanding why your mutexes are fast, you now have the knowledge to succeed.