Operating SystemsFutex

Futex: Fast Userspace Mutex

LevelAdvanced

Duration75 mins

TopicFutex

3 / 5

Kernel Involvement

Behind the System Call Barrier

When a thread calls FUTEX_WAIT, it vanishes from userspace perspective—blocked, waiting to be woken. But what actually happens inside the kernel during those microseconds to milliseconds of sleep? Understanding the kernel's role in futex is essential for:

Debugging deadlocks and performance issues at the system level
Performance tuning by understanding contention points
Designing synchronization primitives that work with the kernel, not against it
Security analysis of systems relying on futex-based primitives

This page takes you inside the Linux kernel's futex subsystem. We'll trace execution paths, examine data structures, and understand the careful dance between correctness and performance that makes futex work.

What You Will Learn

By the end of this page, you will understand the kernel's futex hash table, wait queue management, the critical sections that protect futex state, how the kernel handles contended operations, and the scheduler integration that makes sleeping and waking efficient.

The Futex Hash Table

The kernel must efficiently manage wait queues for potentially millions of different futex addresses. Using a per-futex lock or queue would consume enormous memory. Instead, the kernel uses a hash table to multiplex many futexes onto a smaller number of hash buckets.

Hash Table Architecture:

The futex hash table is a global (per-CPU or per-node on NUMA systems) array of hash buckets. Each bucket contains a spinlock and a list of waiting threads.

kernel_futex_structures.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
/*
 * Kernel futex data structures (simplified from kernel/futex/core.c)
 */
 
/*
 * The key that uniquely identifies a futex
 * Different from userspace address - handles shared memory
 */
union futex_key {
    struct {
        u64 i_seq;           // Inode sequence number
        unsigned long pgoff; // Page offset
        unsigned int offset; // Offset within page
    } shared;                // For shared memory futexes
    
    struct {
        union {
            struct mm_struct *mm;  // Process memory descriptor
            u64 __tmp;
        };
        unsigned long address;     // Virtual address
        unsigned int offset;       // Offset within page
    } private;                     // For private futexes
    
    struct {
        u64 ptr;             // Raw 64-bit key
        unsigned int offset;
        unsigned int word;
    } both;
};
 
/*
 * A waiter in the futex wait queue
 */
struct futex_q {
    struct plist_node list;        // Priority-sorted list link
    struct task_struct *task;      // The waiting thread
    spinlock_t *lock_ptr;          // Backpointer to bucket lock
    
    union futex_key key;           // Which futex we're waiting on
    u32 bitset;                    // Bitset for selective wake
    
    struct rt_mutex_waiter *rt_waiter;  // For PI futexes
    struct futex_pi_state *pi_state;    // PI state tracking
    
    // Requeue state
    atomic_t requeue_state;
    bool requeue_pi_key;
};
 
/*
 * A hash bucket with its wait queue
 */
struct futex_hash_bucket {
    atomic_t waiters;              // Fast check: any waiters?
    spinlock_t lock;               // Protects the wait list
    struct plist_head chain;       // Priority-sorted waiter list
} ____cacheline_aligned;
 
/*
 * Number of hash buckets (power of 2 for fast modulo)
 * Sized based on expected number of simultaneous futexes
 */
#define futex_hashshift  8
#define futex_hashsize   (1 << futex_hashshift)  // 256 buckets
 
/* The global hash table (simplified - actual is per-NUMA-node) */
static struct futex_hash_bucket futex_queues[futex_hashsize];
 
/*
 * Hash function: maps futex key to bucket
 */
static struct futex_hash_bucket *hash_futex(union futex_key *key) {
    u32 hash = jhash2((u32*)&key->both.word,
                      (sizeof(key->both)/sizeof(u32)), 0);
    return &futex_queues[hash & (futex_hashsize - 1)];
}

Why Hashing?

The hash table design reflects a fundamental trade-off:

Futex Wait Queue Design Trade-offs
Approach	Memory	Contention	Lookup Time
Per-futex queue	O(F) for F futexes	Zero (isolated)	O(1)
Single global queue	O(1)	Extreme	O(N) for N waiters
Hash table (K buckets)	O(K)	Low if K >> cpus	O(N/K) average

With 256 buckets and reasonable workloads, the expected waiters per bucket is small, keeping both lookup time and contention manageable.

The Hash Collision Issue:

Multiple futexes can hash to the same bucket. This means:

The bucket's wait list may contain threads waiting on different futexes
When waking, the kernel must check each waiter's key, not just grab the first one
High-contention buckets can become bottlenecks

Modern kernels mitigate this with per-NUMA-node hash tables and careful bucket sizing.

Priority-Sorted Wait Lists

Notice that wait lists use plist (priority list), not simple list. This enables priority-aware waking—higher priority threads are woken first. This is essential for real-time applications and preventing priority inversion scenarios.

The Futex Key: Identifying Futexes

A critical challenge in futex design is: how does the kernel know two threads are waiting on the same futex? Userspace passes a virtual address, but:

Two processes sharing memory have different virtual addresses for the same physical location
The same process might have multiple mappings of the same memory
Memory mappings can change during futex lifetime

The Key Abstraction:

The kernel computes a futex key that uniquely identifies the futex regardless of how it's mapped:

futex_key_computation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
/*
 * Computing the futex key from userspace address
 * (Simplified from kernel/futex/core.c)
 */
 
static int get_futex_key(u32 __user *uaddr, bool fshared,
                         union futex_key *key) {
    struct mm_struct *mm = current->mm;
    unsigned long address = (unsigned long)uaddr;
    struct page *page;
    
    // Always record offset within page
    key->both.offset = address % PAGE_SIZE;
    
    if (!fshared) {
        /*
         * PRIVATE FUTEX (FUTEX_PRIVATE_FLAG set)
         * 
         * Futex is only used within this process.
         * Key is simply (mm, virtual_address).
         * Fast - no page table walk needed for key.
         */
        key->private.mm = mm;
        key->private.address = address;
        return 0;
    }
    
    /*
     * SHARED FUTEX (might be in shared memory)
     * 
     * Need to identify by backing storage, not virtual address.
     * Key is (inode, page_offset) for file-backed memory,
     * or (special marker, physical_page) for anonymous shared.
     */
    
    // Walk page tables to find the actual page
    page = get_user_page_fast(address );
    if (!page) {
        return -EFAULT;  // Not mapped
    }
    
    if (PageAnon(page)) {
        // Anonymous shared memory (e.g., mmap MAP_SHARED|MAP_ANONYMOUS)
        // Use page pointer as key (unique per physical page)
        key->shared.i_seq = (u64)page_to_pfn(page);
        key->shared.pgoff = 0;
    } else {
        // File-backed memory (e.g., mmap of regular file, shm_open)
        // Use inode + file offset as key
        struct inode *inode = page->mapping->host;
        key->shared.i_seq = get_inode_sequence_number(inode);
        key->shared.pgoff = page->index;  // Page offset in file
    }
    
    put_page(page);
    return 0;
}
 
/*
 * Key comparison
 */
static inline int match_futex(union futex_key *key1,
                              union futex_key *key2) {
    return (key1->both.word == key2->both.word &&
            key1->both.ptr == key2->both.ptr &&
            key1->both.offset == key2->both.offset);
}

Private Futex Key

•Uses virtual address directly
•No page table walk for key
•Key = (mm_struct*, vaddr, offset)
•Only works within single process
•~10-20x faster key computation
•Default for pthread mutexes

Shared Futex Key

•Walks page tables to find page
•Uses inode+offset or page frame
•Key = (inode_seq, pgoff, offset)
•Works across processes
•Required for shared memory sync
•Used for process-shared mutexes

Always Use FUTEX_PRIVATE_FLAG When Possible

Shared futex key computation is significantly more expensive—it requires page table walks and may trigger page faults. Always use FUTEX_PRIVATE_FLAG for intra-process synchronization. The kernel cannot infer privacy from usage patterns; you must explicitly request it.

FUTEX_WAIT: The Kernel Path

Let's trace through exactly what happens in the kernel when a thread calls FUTEX_WAIT. Understanding this path is essential for debugging and optimization.

futex_wait_kernel_path.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
/*
 * FUTEX_WAIT kernel implementation (heavily simplified)
 * Actual code: kernel/futex/waitwake.c
 */
 
static int futex_wait(u32 __user *uaddr, u32 val,
                      ktime_t *time, u32 bitset) {
    struct futex_hash_bucket *hb;
    struct futex_q q = FUTEX_Q_INIT;
    u32 uval;
    int ret;
 
    /*
     * STEP 1: Compute the futex key
     */
    ret = get_futex_key(uaddr, FLAGS_SHARED, &q.key);
    if (ret)
        return ret;
 
    /*
     * STEP 2: Find and lock the hash bucket
     */
    hb = hash_futex(&q.key);
    q.lock_ptr = &hb->lock;
    q.bitset = bitset;
 
retry:
    /*
     * STEP 3: Acquire the bucket spinlock
     * This serializes against concurrent FUTEX_WAKE
     */
    spin_lock(&hb->lock);
 
    /*
     * STEP 4: Read the futex value from userspace
     * This is protected by the bucket lock
     */
    ret = get_futex_value_locked(&uval, uaddr);
    if (ret) {
        spin_unlock(&hb->lock);
        // Page fault - might need to fault in the page
        ret = get_user(uval, uaddr);
        if (ret)
            return -EFAULT;
        goto retry;
    }
 
    /*
     * STEP 5: Compare with expected value
     * This is the ATOMIC check - we hold the bucket lock
     */
    if (uval != val) {
        spin_unlock(&hb->lock);
        return -EAGAIN;  // Value changed - don't sleep
    }
 
    /*
     * STEP 6: Prepare to sleep - set task state
     */
    set_current_state(TASK_INTERRUPTIBLE);
 
    /*
     * STEP 7: Queue ourselves on the wait list
     * Still holding bucket lock - wake cannot miss us
     */
    futex_queue(&q, hb);  // Adds to hb->chain
 
    /*
     * STEP 8: Release bucket lock
     * From this point, FUTEX_WAKE can find and wake us
     */
    spin_unlock(&hb->lock);
 
    /*
     * STEP 9: Handle timeout if specified
     */
    if (time) {
        hrtimer_start(&timeout_timer, *time, HRTIMER_MODE_ABS);
    }
 
    /*
     * STEP 10: Actually go to sleep
     * schedule() doesn't return until we're woken
     */
    if (likely(!plist_node_empty(&q.list))) {
        if (!timeout || timeout->task)
            schedule();  // <<< SLEEP HERE >>>
    }
 
    /*
     * STEP 11: We've been woken or timed out
     */
    __set_current_state(TASK_RUNNING);
 
    /*
     * STEP 12: Cleanup - remove from wait queue if still there
     */
    if (!plist_node_empty(&q.list)) {
        spin_lock(&hb->lock);
        if (!plist_node_empty(&q.list))
            futex_unqueue(&q);
        spin_unlock(&hb->lock);
        ret = -ETIMEDOUT;  // We timed out, not woken
    }
 
    return ret;
}

The Critical Atomicity:

The key to correctness is that steps 4-7 (read value, compare, queue) happen while holding the bucket spinlock. This prevents the lost wakeup race:

Without atomicity:           With atomicity:

WAIT:                        WAIT:
1. Read value = LOCKED       1. Lock bucket
                             2. Read value = LOCKED
WAKE:                        3. Queue self
2. Set value = UNLOCKED      4. Unlock bucket
3. Check queue (empty!)
4. No one to wake            WAKE:
                             5. Lock bucket
WAIT (continued):            6. Set value = UNLOCKED
5. Queue self                7. Find waiter, wake
6. Sleep forever...          8. Unlock bucket
                             
                             WAIT wakes correctly!

The set_current_state Dance

Notice we set TASK_INTERRUPTIBLE before queueing, while still holding the lock. This is a crucial Linux kernel pattern: we signal intent to sleep before it's possible for wake to find us. If wake occurs between queue and schedule(), it will clear our sleep state, and schedule() will return immediately instead of sleeping. This is how Linux prevents lost wakeups at the scheduler level.

FUTEX_WAKE: The Kernel Path

The wake path is simpler than wait—it searches the wait queue and wakes threads. But the details matter for understanding performance.

futex_wake_kernel_path.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
/*
 * FUTEX_WAKE kernel implementation (simplified)
 */
 
static int futex_wake(u32 __user *uaddr, int nr_wake, u32 bitset) {
    struct futex_hash_bucket *hb;
    struct futex_q *this, *next;
    union futex_key key;
    int ret;
    DEFINE_WAKE_Q(wake_q);  // List of threads to wake
 
    /*
     * STEP 1: Compute the futex key
     */
    ret = get_futex_key(uaddr, FLAGS_SHARED, &key);
    if (ret)
        return ret;
 
    /*
     * STEP 2: Find and lock the hash bucket
     */
    hb = hash_futex(&key);
    spin_lock(&hb->lock);
 
    /*
     * STEP 3: Scan the wait queue for matching waiters
     */
    plist_for_each_entry_safe(this, next, &hb->chain, list) {
        /*
         * Check if this waiter matches:
         * 1. Same futex key
         * 2. Bitset matches (if using WAKE_BITSET)
         */
        if (match_futex(&this->key, &key)) {
            if (!(this->bitset & bitset))
                continue;  // Bitset doesn't match
 
            /*
             * STEP 4: Mark this waiter for waking
             * Don't wake while holding spinlock - add to batch
             */
            mark_wake_futex(&wake_q, this);
 
            if (++ret >= nr_wake)
                break;  // Woken enough
        }
    }
 
    spin_unlock(&hb->lock);
 
    /*
     * STEP 5: Actually wake the threads
     * Done outside the spinlock for better latency
     */
    wake_up_q(&wake_q);
 
    return ret;  // Return count of woken threads
}
 
/*
 * mark_wake_futex: Prepare a waiter for waking
 */
static void mark_wake_futex(struct wake_q_head *wake_q,
                            struct futex_q *q) {
    struct task_struct *p = q->task;
    
    /*
     * Remove from the wait queue
     */
    plist_del(&q->list, &q->list.plist);
    
    /*
     * Clear the waiter's list entry so it knows it was woken
     * (not timed out or spuriously woken)
     */
    q->lock_ptr = NULL;
    
    /*
     * Add to wake queue for batch waking
     */
    wake_q_add(wake_q, p);
}
 
/*
 * wake_up_q: Batch wake all collected threads
 * Called WITHOUT holding any spinlocks
 */
static void wake_up_q(struct wake_q_head *head) {
    struct task_struct *task;
 
    while ((task = wake_q_next(head)) != NULL) {
        /*
         * Set task state to TASK_RUNNING
         * If task is currently in schedule(), it will return
         * Otherwise, it will be ready to run when scheduler picks it
         */
        wake_up_process(task);
        
        /*
         * Drop reference acquired by wake_q_add
         */
        put_task_struct(task);
    }
}

Batch Waking Optimization:

Notice the two-phase approach: collect waiters while holding the lock, then wake them after releasing. This is critical for performance:

Minimizes lock hold time: Other waiters/wakers can access the bucket sooner
Reduces lock contention: The expensive wake_up_process() happens without holding the spinlock
Enables batching: Multiple threads can be prepared for running in one lock acquisition

Wake Queue Priority:

The plist_for_each iteration respects priority order. Higher priority threads (lower niceness value) appear first in the list and are woken first. This naturally provides priority-aware synchronization without explicit priority logic in the wake code.

Hash Bucket Scan Overhead

When waking, the kernel scans all waiters in the hash bucket, not just those for our futex (they're intermixed due to hash collisions). This is why extreme futex usage (thousands of contended futexes) can show bucket contention in profiling. The solution is larger hash tables or futex locality (related futexes on different pages).

Scheduler Integration

Futex is deeply integrated with the Linux scheduler. When a thread sleeps on a futex, it's not just blocked—it's precisely integrated with the kernel's task scheduling machinery.

scheduler_integration.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
/*
 * Scheduler integration points for futex
 */
 
/* 
 * Task states relevant to futex:
 * 
 * TASK_RUNNING     - Thread can be scheduled to run
 * TASK_INTERRUPTIBLE - Sleeping, can be woken by signals or futex_wake
 * TASK_UNINTERRUPTIBLE - Sleeping, only woken by specific events
 */
 
/*
 * When entering futex_wait:
 */
void futex_wait_prepare(struct futex_q *q) {
    /*
     * Mark ourselves as wanting to sleep
     * This MUST happen before we're visible in the wait queue
     */
    set_current_state(TASK_INTERRUPTIBLE);
    
    /*
     * Memory barrier ensures state change is visible
     * before we check the futex value
     */
    smp_mb();
}
 
/*
 * The actual schedule() call:
 */
void futex_wait_sleep(void) {
    /*
     * schedule() checks current->state:
     * 
     * If TASK_INTERRUPTIBLE and no pending signal/wake:
     *   - Remove from run queue
     *   - Pick next task to run
     *   - Context switch away
     *   - <we're now sleeping>
     * 
     * If TASK_RUNNING (someone called wake_up_process):
     *   - Return immediately, don't sleep
     *   - This prevents lost wakeups
     */
    schedule();
}
 
/*
 * When woken by futex_wake or timeout:
 */
void futex_wait_cleanup(struct futex_q *q) {
    /*
     * Restore to running state
     */
    __set_current_state(TASK_RUNNING);
    
    /*
     * Check why we woke up:
     * 1. wake_q_add set q->lock_ptr = NULL -> woken by FUTEX_WAKE
     * 2. timer fired -> timeout
     * 3. signal pending -> interrupted
     * 4. q still in list -> spurious wakeup
     */
}
 
/*
 * wake_up_process: The actual waking mechanism
 */
int wake_up_process(struct task_struct *p) {
    /*
     * Set task state to TASK_RUNNING
     * This is an atomic operation with memory barriers
     */
    unsigned long flags;
    
    raw_spin_lock_irqsave(&p->pi_lock, flags);
    
    /*
     * If task was sleeping (TASK_INTERRUPTIBLE/UNINTERRUPTIBLE):
     * 1. Change state to TASK_RUNNING
     * 2. Add to scheduler run queue
     * 3. If on same CPU, might preempt current task
     */
    if (p->state != TASK_RUNNING) {
        p->state = TASK_RUNNING;
        enqueue_task(rq, p, ENQUEUE_WAKEUP);
        check_preempt_curr(rq, p);  // Might preempt current
    }
    
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);
    
    return 1;
}

The State Machine:

A thread waiting on a futex goes through these states:

Futex Wait State Transitions
State	Run Queue	Description
TASK_RUNNING	On queue	Executing or ready to execute
TASK_INTERRUPTIBLE	Off queue	Sleeping, wake on futex/signal
TASK_RUNNING	On queue	Woken, waiting for CPU time
TASK_RUNNING	Current	Actively running after wakeup

Why TASK_INTERRUPTIBLE?

Futex uses TASK_INTERRUPTIBLE (not UNINTERRUPTIBLE) so threads can be woken by signals. This allows pthread_cancel, timeout via signals, and debugger attachment on sleeping threads. TASK_UNINTERRUPTIBLE would make the thread truly unbreakable—appropriate for disk I/O but not synchronization waits.

Memory Ordering and Barriers

Futex correctness depends critically on memory ordering. Modern CPUs and compilers reorder memory operations for performance. Without proper barriers, futex operations could appear to happen in wrong order, causing lost wakeups or other bugs.

futex_memory_ordering.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
/*
 * Memory ordering in futex operations
 * 
 * There are THREE levels of memory ordering to consider:
 * 1. Compiler reordering (prevented by volatile/atomic)
 * 2. CPU store buffer reordering (prevented by barriers)
 * 3. Cache coherence delays (handled by cache protocols)
 */
 
/* 
 * FUTEX_WAIT ordering requirements:
 * 
 * We need to ensure that:
 * 1. Reading the futex value happens AFTER entering the wait queue
 *    (conceptually - in terms of visibility to WAKE)
 * 2. Our TASK_INTERRUPTIBLE state is visible before we could miss a wake
 */
 
void futex_wait_ordering(void) {
    /*
     * This sequence must be seen in order:
     */
    
    // Step 1: Set state to sleeping
    set_current_state(TASK_INTERRUPTIBLE);
    /*
     * Implicit barrier in set_current_state:
     * - Compiler barrier prevents reordering
     * - Memory barrier ensures state is visible to other CPUs
     */
    
    // Step 2: Lock bucket (spinlock has acquire semantics)
    spin_lock(&bucket->lock);
    
    // Step 3: Read futex value from userspace
    // Protected by spinlock - ordered after state set
    get_user(val, uaddr);
    
    // Step 4: Add to wait queue
    list_add(&waiter, &queue);
    
    // Step 5: Release spinlock (release semantics)
    spin_unlock(&bucket->lock);
    /*
     * Spinlock release ensures our wait queue addition
     * is visible before we actually sleep
     */
    
    // Step 6: Go to sleep
    schedule();
}
 
/*
 * FUTEX_WAKE ordering requirements:
 * 
 * The waker typically does:
 * 1. Write new value to futex (e.g., set UNLOCKED)
 * 2. Call FUTEX_WAKE
 * 
 * We need waiters to see the new value if they re-check after waking
 */
 
void futex_wake_ordering(void) {
    /*
     * Userspace writes to futex word should use atomic store
     * with release semantics (memory_order_release)
     * 
     * atomic_store(&futex, UNLOCKED, memory_order_release);
     * futex_wake(&futex, 1);
     */
    
    // Step 1: Lock bucket (acquire semantics)
    spin_lock(&bucket->lock);
    /*
     * This acquires all previous stores to the futex word
     * from our perspective (even though we don't read it)
     */
    
    // Step 2: Remove waiters from queue
    list_del(&waiter);
    
    // Step 3: Release bucket lock
    spin_unlock(&bucket->lock);
    /*
     * Release semantics ensures queue removal is complete
     */
    
    // Step 4: Wake the thread
    wake_up_process(waiter->task);
    /*
     * wake_up_process has internal barriers to ensure
     * state change is immediately visible to the woken task
     */
}
 
/*
 * The full memory ordering timeline for a lock:
 * 
 * Unlocker (CPU 1):                 Waiter (CPU 2):
 * --------------                    ----------------
 * <execute critical section>
 * 
 * release fence                     
 * store UNLOCKED                    load futex_word
 *                                   compare (sees LOCKED)
 *                                   set TASK_INTERRUPTIBLE
 *                                   acquire bucket lock
 *                                   add to queue
 *                                   release bucket lock
 *                                   schedule() [sleeps]
 * 
 * acquire bucket lock
 * find waiter
 * remove from queue
 * release bucket lock
 * wake_up_process                   
 *                                   [wakes in schedule()]
 *                                   load futex_word
 *                                   [sees UNLOCKED due to release-acquire]
 *                                   acquire lock
 */

Userspace Must Also Use Barriers

The kernel's barriers only ensure correctness within futex operations. Userspace code must also use proper atomic operations with correct memory ordering. A plain (non-atomic) store followed by FUTEX_WAKE can race due to store buffer delays. Always use atomic_store with memory_order_release before waking.

Kernel Data Structures Summary

Let's consolidate all the kernel data structures involved in futex into a comprehensive view.

Futex Kernel Data Structures
Structure	Purpose	Lifetime	Per-
futex_hash_bucket[]	Hash table of wait queues	Kernel lifetime	NUMA node
futex_hash_bucket.lock	Protects one bucket's queue	Held briefly	Bucket
futex_hash_bucket.chain	Priority-sorted waiter list	Contains waiters	Bucket
futex_q	A single waiting thread	Duration of wait	Waiter
futex_key (private)	Identifies private futex	Duration of operation	Operation
futex_key (shared)	Identifies shared futex	Duration of operation	Operation
wake_q_head	Batch of threads to wake	During wake operation	Wake call
task_struct.state	Running/sleeping status	Thread lifetime	Thread

kernel_memory_layout.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
/*
 * Memory layout visualization
 */
 
/*
 * KERNEL MEMORY (process-independent):
 * 
 * ┌─────────────────────────────────────────────────────────┐
 * │   futex_queues[0..255] (per-NUMA-node hash table)       │
 * │   ├── bucket[0]                                          │
 * │   │   ├── lock (spinlock)                                │
 * │   │   └── chain (plist_head)                             │
 * │   │       ├── futex_q (thread A, futex X)                │
 * │   │       ├── futex_q (thread B, futex X)                │
 * │   │       └── futex_q (thread C, futex Y) ← collision    │
 * │   ├── bucket[1]                                          │
 * │   │   └── chain                                          │
 * │   │       └── futex_q (thread D, futex Z)                │
 * │   └── ... (256 buckets total)                            │
 * └─────────────────────────────────────────────────────────┘
 * 
 * USER MEMORY (per-process):
 * 
 * ┌─────────────────────────────────────────────────────────┐
 * │   Process A address space                                │
 * │   ├── 0x7f..100: futex X (value: CONTENDED)             │
 * │   ├── 0x7f..200: futex Y (value: LOCKED)                │
 * │   └── ...                                                │
 * └─────────────────────────────────────────────────────────┘
 * 
 * ┌─────────────────────────────────────────────────────────┐
 * │   Shared Memory Region (mmap, shm_open)                  │
 * │   └── page 0x1234:                                       │
 * │       └── futex Z (value: UNLOCKED)                      │
 * │           ↑                                              │
 * │           └── Accessible from multiple processes         │
 * │               with different virtual addresses           │
 * └─────────────────────────────────────────────────────────┘
 */

Memory Efficiency

The futex subsystem is very memory-efficient. The kernel only allocates memory for threads actively waiting—the hash table structure is fixed-size and small. A million futexes with no contention consumes zero kernel memory for futex purposes.

Summary: The Kernel's Role in Futex

We've taken a deep dive into the kernel internals of futex. Let's consolidate the key insights.

Key Takeaways

•Hash table design: The kernel uses a fixed-size hash table to manage wait queues for potentially millions of futexes. Collisions are handled by checking futex keys during wake.
•Futex keys: Private futexes use (mm, vaddr) for fast key computation. Shared futexes use (inode, offset) requiring page table walks. Always prefer private when possible.
•FUTEX_WAIT atomicity: The value check and queue addition happen while holding a spinlock. This prevents lost wakeups—a wake cannot slip between check and queue.
•FUTEX_WAKE batching: Waiters are collected while holding the lock, then woken after release. This minimizes lock hold time and enables efficient batch waking.
•Scheduler integration: Futex works directly with the scheduler via TASK_INTERRUPTIBLE and wake_up_process(). The state machine prevents lost wakeups at multiple levels.
•Memory ordering: Spinlocks provide acquire-release semantics. The kernel ensures proper barriers, but userspace must also use atomic operations with correct ordering.
•Minimal footprint: Only actively waiting threads consume kernel memory. The hash table is fixed-size regardless of futex count.

What's Next:

Now that we understand both the userspace interface and kernel implementation, the next page examines the performance benefits of futex in detail. We'll quantify the speedup, analyze profiling data, and understand when futex delivers the most value.

Page Complete

You now understand what happens inside the Linux kernel when futex operations are invoked. This knowledge enables you to debug synchronization at the deepest level, understand performance characteristics, and make informed decisions about synchronization design.

3 / 5

Loading learning content...

Operating SystemsFutex

Futex: Fast Userspace Mutex

LevelAdvanced

Duration75 mins

TopicFutex

3 / 5

Kernel Involvement

Behind the System Call Barrier

Debugging deadlocks and performance issues at the system level
Performance tuning by understanding contention points
Designing synchronization primitives that work with the kernel, not against it
Security analysis of systems relying on futex-based primitives

What You Will Learn

The Futex Hash Table

Hash Table Architecture:

The futex hash table is a global (per-CPU or per-node on NUMA systems) array of hash buckets. Each bucket contains a spinlock and a list of waiting threads.

kernel_futex_structures.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
/*
 * Kernel futex data structures (simplified from kernel/futex/core.c)
 */
 
/*
 * The key that uniquely identifies a futex
 * Different from userspace address - handles shared memory
 */
union futex_key {
    struct {
        u64 i_seq;           // Inode sequence number
        unsigned long pgoff; // Page offset
        unsigned int offset; // Offset within page
    } shared;                // For shared memory futexes
    
    struct {
        union {
            struct mm_struct *mm;  // Process memory descriptor
            u64 __tmp;
        };
        unsigned long address;     // Virtual address
        unsigned int offset;       // Offset within page
    } private;                     // For private futexes
    
    struct {
        u64 ptr;             // Raw 64-bit key
        unsigned int offset;
        unsigned int word;
    } both;
};
 
/*
 * A waiter in the futex wait queue
 */
struct futex_q {
    struct plist_node list;        // Priority-sorted list link
    struct task_struct *task;      // The waiting thread
    spinlock_t *lock_ptr;          // Backpointer to bucket lock
    
    union futex_key key;           // Which futex we're waiting on
    u32 bitset;                    // Bitset for selective wake
    
    struct rt_mutex_waiter *rt_waiter;  // For PI futexes
    struct futex_pi_state *pi_state;    // PI state tracking
    
    // Requeue state
    atomic_t requeue_state;
    bool requeue_pi_key;
};
 
/*
 * A hash bucket with its wait queue
 */
struct futex_hash_bucket {
    atomic_t waiters;              // Fast check: any waiters?
    spinlock_t lock;               // Protects the wait list
    struct plist_head chain;       // Priority-sorted waiter list
} ____cacheline_aligned;
 
/*
 * Number of hash buckets (power of 2 for fast modulo)
 * Sized based on expected number of simultaneous futexes
 */
#define futex_hashshift  8
#define futex_hashsize   (1 << futex_hashshift)  // 256 buckets
 
/* The global hash table (simplified - actual is per-NUMA-node) */
static struct futex_hash_bucket futex_queues[futex_hashsize];
 
/*
 * Hash function: maps futex key to bucket
 */
static struct futex_hash_bucket *hash_futex(union futex_key *key) {
    u32 hash = jhash2((u32*)&key->both.word,
                      (sizeof(key->both)/sizeof(u32)), 0);
    return &futex_queues[hash & (futex_hashsize - 1)];
}

Why Hashing?

The hash table design reflects a fundamental trade-off:

Futex Wait Queue Design Trade-offs
Approach	Memory	Contention	Lookup Time
Per-futex queue	O(F) for F futexes	Zero (isolated)	O(1)
Single global queue	O(1)	Extreme	O(N) for N waiters
Hash table (K buckets)	O(K)	Low if K >> cpus	O(N/K) average

With 256 buckets and reasonable workloads, the expected waiters per bucket is small, keeping both lookup time and contention manageable.

The Hash Collision Issue:

Multiple futexes can hash to the same bucket. This means:

The bucket's wait list may contain threads waiting on different futexes
When waking, the kernel must check each waiter's key, not just grab the first one
High-contention buckets can become bottlenecks

Modern kernels mitigate this with per-NUMA-node hash tables and careful bucket sizing.

Priority-Sorted Wait Lists

The Futex Key: Identifying Futexes

A critical challenge in futex design is: how does the kernel know two threads are waiting on the same futex? Userspace passes a virtual address, but:

Two processes sharing memory have different virtual addresses for the same physical location
The same process might have multiple mappings of the same memory
Memory mappings can change during futex lifetime

The Key Abstraction:

The kernel computes a futex key that uniquely identifies the futex regardless of how it's mapped:

futex_key_computation.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
/*
 * Computing the futex key from userspace address
 * (Simplified from kernel/futex/core.c)
 */
 
static int get_futex_key(u32 __user *uaddr, bool fshared,
                         union futex_key *key) {
    struct mm_struct *mm = current->mm;
    unsigned long address = (unsigned long)uaddr;
    struct page *page;
    
    // Always record offset within page
    key->both.offset = address % PAGE_SIZE;
    
    if (!fshared) {
        /*
         * PRIVATE FUTEX (FUTEX_PRIVATE_FLAG set)
         * 
         * Futex is only used within this process.
         * Key is simply (mm, virtual_address).
         * Fast - no page table walk needed for key.
         */
        key->private.mm = mm;
        key->private.address = address;
        return 0;
    }
    
    /*
     * SHARED FUTEX (might be in shared memory)
     * 
     * Need to identify by backing storage, not virtual address.
     * Key is (inode, page_offset) for file-backed memory,
     * or (special marker, physical_page) for anonymous shared.
     */
    
    // Walk page tables to find the actual page
    page = get_user_page_fast(address );
    if (!page) {
        return -EFAULT;  // Not mapped
    }
    
    if (PageAnon(page)) {
        // Anonymous shared memory (e.g., mmap MAP_SHARED|MAP_ANONYMOUS)
        // Use page pointer as key (unique per physical page)
        key->shared.i_seq = (u64)page_to_pfn(page);
        key->shared.pgoff = 0;
    } else {
        // File-backed memory (e.g., mmap of regular file, shm_open)
        // Use inode + file offset as key
        struct inode *inode = page->mapping->host;
        key->shared.i_seq = get_inode_sequence_number(inode);
        key->shared.pgoff = page->index;  // Page offset in file
    }
    
    put_page(page);
    return 0;
}
 
/*
 * Key comparison
 */
static inline int match_futex(union futex_key *key1,
                              union futex_key *key2) {
    return (key1->both.word == key2->both.word &&
            key1->both.ptr == key2->both.ptr &&
            key1->both.offset == key2->both.offset);
}

Private Futex Key

•Uses virtual address directly
•No page table walk for key
•Key = (mm_struct*, vaddr, offset)
•Only works within single process
•~10-20x faster key computation
•Default for pthread mutexes

Shared Futex Key

•Walks page tables to find page
•Uses inode+offset or page frame
•Key = (inode_seq, pgoff, offset)
•Works across processes
•Required for shared memory sync
•Used for process-shared mutexes

Always Use FUTEX_PRIVATE_FLAG When Possible

FUTEX_WAIT: The Kernel Path

Let's trace through exactly what happens in the kernel when a thread calls FUTEX_WAIT. Understanding this path is essential for debugging and optimization.

futex_wait_kernel_path.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
/*
 * FUTEX_WAIT kernel implementation (heavily simplified)
 * Actual code: kernel/futex/waitwake.c
 */
 
static int futex_wait(u32 __user *uaddr, u32 val,
                      ktime_t *time, u32 bitset) {
    struct futex_hash_bucket *hb;
    struct futex_q q = FUTEX_Q_INIT;
    u32 uval;
    int ret;
 
    /*
     * STEP 1: Compute the futex key
     */
    ret = get_futex_key(uaddr, FLAGS_SHARED, &q.key);
    if (ret)
        return ret;
 
    /*
     * STEP 2: Find and lock the hash bucket
     */
    hb = hash_futex(&q.key);
    q.lock_ptr = &hb->lock;
    q.bitset = bitset;
 
retry:
    /*
     * STEP 3: Acquire the bucket spinlock
     * This serializes against concurrent FUTEX_WAKE
     */
    spin_lock(&hb->lock);
 
    /*
     * STEP 4: Read the futex value from userspace
     * This is protected by the bucket lock
     */
    ret = get_futex_value_locked(&uval, uaddr);
    if (ret) {
        spin_unlock(&hb->lock);
        // Page fault - might need to fault in the page
        ret = get_user(uval, uaddr);
        if (ret)
            return -EFAULT;
        goto retry;
    }
 
    /*
     * STEP 5: Compare with expected value
     * This is the ATOMIC check - we hold the bucket lock
     */
    if (uval != val) {
        spin_unlock(&hb->lock);
        return -EAGAIN;  // Value changed - don't sleep
    }
 
    /*
     * STEP 6: Prepare to sleep - set task state
     */
    set_current_state(TASK_INTERRUPTIBLE);
 
    /*
     * STEP 7: Queue ourselves on the wait list
     * Still holding bucket lock - wake cannot miss us
     */
    futex_queue(&q, hb);  // Adds to hb->chain
 
    /*
     * STEP 8: Release bucket lock
     * From this point, FUTEX_WAKE can find and wake us
     */
    spin_unlock(&hb->lock);
 
    /*
     * STEP 9: Handle timeout if specified
     */
    if (time) {
        hrtimer_start(&timeout_timer, *time, HRTIMER_MODE_ABS);
    }
 
    /*
     * STEP 10: Actually go to sleep
     * schedule() doesn't return until we're woken
     */
    if (likely(!plist_node_empty(&q.list))) {
        if (!timeout || timeout->task)
            schedule();  // <<< SLEEP HERE >>>
    }
 
    /*
     * STEP 11: We've been woken or timed out
     */
    __set_current_state(TASK_RUNNING);
 
    /*
     * STEP 12: Cleanup - remove from wait queue if still there
     */
    if (!plist_node_empty(&q.list)) {
        spin_lock(&hb->lock);
        if (!plist_node_empty(&q.list))
            futex_unqueue(&q);
        spin_unlock(&hb->lock);
        ret = -ETIMEDOUT;  // We timed out, not woken
    }
 
    return ret;
}

The Critical Atomicity:

The key to correctness is that steps 4-7 (read value, compare, queue) happen while holding the bucket spinlock. This prevents the lost wakeup race:

Without atomicity:           With atomicity:

WAIT:                        WAIT:
1. Read value = LOCKED       1. Lock bucket
                             2. Read value = LOCKED
WAKE:                        3. Queue self
2. Set value = UNLOCKED      4. Unlock bucket
3. Check queue (empty!)
4. No one to wake            WAKE:
                             5. Lock bucket
WAIT (continued):            6. Set value = UNLOCKED
5. Queue self                7. Find waiter, wake
6. Sleep forever...          8. Unlock bucket
                             
                             WAIT wakes correctly!

The set_current_state Dance

FUTEX_WAKE: The Kernel Path

The wake path is simpler than wait—it searches the wait queue and wakes threads. But the details matter for understanding performance.

futex_wake_kernel_path.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
/*
 * FUTEX_WAKE kernel implementation (simplified)
 */
 
static int futex_wake(u32 __user *uaddr, int nr_wake, u32 bitset) {
    struct futex_hash_bucket *hb;
    struct futex_q *this, *next;
    union futex_key key;
    int ret;
    DEFINE_WAKE_Q(wake_q);  // List of threads to wake
 
    /*
     * STEP 1: Compute the futex key
     */
    ret = get_futex_key(uaddr, FLAGS_SHARED, &key);
    if (ret)
        return ret;
 
    /*
     * STEP 2: Find and lock the hash bucket
     */
    hb = hash_futex(&key);
    spin_lock(&hb->lock);
 
    /*
     * STEP 3: Scan the wait queue for matching waiters
     */
    plist_for_each_entry_safe(this, next, &hb->chain, list) {
        /*
         * Check if this waiter matches:
         * 1. Same futex key
         * 2. Bitset matches (if using WAKE_BITSET)
         */
        if (match_futex(&this->key, &key)) {
            if (!(this->bitset & bitset))
                continue;  // Bitset doesn't match
 
            /*
             * STEP 4: Mark this waiter for waking
             * Don't wake while holding spinlock - add to batch
             */
            mark_wake_futex(&wake_q, this);
 
            if (++ret >= nr_wake)
                break;  // Woken enough
        }
    }
 
    spin_unlock(&hb->lock);
 
    /*
     * STEP 5: Actually wake the threads
     * Done outside the spinlock for better latency
     */
    wake_up_q(&wake_q);
 
    return ret;  // Return count of woken threads
}
 
/*
 * mark_wake_futex: Prepare a waiter for waking
 */
static void mark_wake_futex(struct wake_q_head *wake_q,
                            struct futex_q *q) {
    struct task_struct *p = q->task;
    
    /*
     * Remove from the wait queue
     */
    plist_del(&q->list, &q->list.plist);
    
    /*
     * Clear the waiter's list entry so it knows it was woken
     * (not timed out or spuriously woken)
     */
    q->lock_ptr = NULL;
    
    /*
     * Add to wake queue for batch waking
     */
    wake_q_add(wake_q, p);
}
 
/*
 * wake_up_q: Batch wake all collected threads
 * Called WITHOUT holding any spinlocks
 */
static void wake_up_q(struct wake_q_head *head) {
    struct task_struct *task;
 
    while ((task = wake_q_next(head)) != NULL) {
        /*
         * Set task state to TASK_RUNNING
         * If task is currently in schedule(), it will return
         * Otherwise, it will be ready to run when scheduler picks it
         */
        wake_up_process(task);
        
        /*
         * Drop reference acquired by wake_q_add
         */
        put_task_struct(task);
    }
}

Batch Waking Optimization:

Notice the two-phase approach: collect waiters while holding the lock, then wake them after releasing. This is critical for performance:

Minimizes lock hold time: Other waiters/wakers can access the bucket sooner
Reduces lock contention: The expensive wake_up_process() happens without holding the spinlock
Enables batching: Multiple threads can be prepared for running in one lock acquisition

Wake Queue Priority:

Hash Bucket Scan Overhead

Scheduler Integration

Futex is deeply integrated with the Linux scheduler. When a thread sleeps on a futex, it's not just blocked—it's precisely integrated with the kernel's task scheduling machinery.

scheduler_integration.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
/*
 * Scheduler integration points for futex
 */
 
/* 
 * Task states relevant to futex:
 * 
 * TASK_RUNNING     - Thread can be scheduled to run
 * TASK_INTERRUPTIBLE - Sleeping, can be woken by signals or futex_wake
 * TASK_UNINTERRUPTIBLE - Sleeping, only woken by specific events
 */
 
/*
 * When entering futex_wait:
 */
void futex_wait_prepare(struct futex_q *q) {
    /*
     * Mark ourselves as wanting to sleep
     * This MUST happen before we're visible in the wait queue
     */
    set_current_state(TASK_INTERRUPTIBLE);
    
    /*
     * Memory barrier ensures state change is visible
     * before we check the futex value
     */
    smp_mb();
}
 
/*
 * The actual schedule() call:
 */
void futex_wait_sleep(void) {
    /*
     * schedule() checks current->state:
     * 
     * If TASK_INTERRUPTIBLE and no pending signal/wake:
     *   - Remove from run queue
     *   - Pick next task to run
     *   - Context switch away
     *   - <we're now sleeping>
     * 
     * If TASK_RUNNING (someone called wake_up_process):
     *   - Return immediately, don't sleep
     *   - This prevents lost wakeups
     */
    schedule();
}
 
/*
 * When woken by futex_wake or timeout:
 */
void futex_wait_cleanup(struct futex_q *q) {
    /*
     * Restore to running state
     */
    __set_current_state(TASK_RUNNING);
    
    /*
     * Check why we woke up:
     * 1. wake_q_add set q->lock_ptr = NULL -> woken by FUTEX_WAKE
     * 2. timer fired -> timeout
     * 3. signal pending -> interrupted
     * 4. q still in list -> spurious wakeup
     */
}
 
/*
 * wake_up_process: The actual waking mechanism
 */
int wake_up_process(struct task_struct *p) {
    /*
     * Set task state to TASK_RUNNING
     * This is an atomic operation with memory barriers
     */
    unsigned long flags;
    
    raw_spin_lock_irqsave(&p->pi_lock, flags);
    
    /*
     * If task was sleeping (TASK_INTERRUPTIBLE/UNINTERRUPTIBLE):
     * 1. Change state to TASK_RUNNING
     * 2. Add to scheduler run queue
     * 3. If on same CPU, might preempt current task
     */
    if (p->state != TASK_RUNNING) {
        p->state = TASK_RUNNING;
        enqueue_task(rq, p, ENQUEUE_WAKEUP);
        check_preempt_curr(rq, p);  // Might preempt current
    }
    
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);
    
    return 1;
}

The State Machine:

A thread waiting on a futex goes through these states:

Futex Wait State Transitions
State	Run Queue	Description
TASK_RUNNING	On queue	Executing or ready to execute
TASK_INTERRUPTIBLE	Off queue	Sleeping, wake on futex/signal
TASK_RUNNING	On queue	Woken, waiting for CPU time
TASK_RUNNING	Current	Actively running after wakeup

Why TASK_INTERRUPTIBLE?

Memory Ordering and Barriers

futex_memory_ordering.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
/*
 * Memory ordering in futex operations
 * 
 * There are THREE levels of memory ordering to consider:
 * 1. Compiler reordering (prevented by volatile/atomic)
 * 2. CPU store buffer reordering (prevented by barriers)
 * 3. Cache coherence delays (handled by cache protocols)
 */
 
/* 
 * FUTEX_WAIT ordering requirements:
 * 
 * We need to ensure that:
 * 1. Reading the futex value happens AFTER entering the wait queue
 *    (conceptually - in terms of visibility to WAKE)
 * 2. Our TASK_INTERRUPTIBLE state is visible before we could miss a wake
 */
 
void futex_wait_ordering(void) {
    /*
     * This sequence must be seen in order:
     */
    
    // Step 1: Set state to sleeping
    set_current_state(TASK_INTERRUPTIBLE);
    /*
     * Implicit barrier in set_current_state:
     * - Compiler barrier prevents reordering
     * - Memory barrier ensures state is visible to other CPUs
     */
    
    // Step 2: Lock bucket (spinlock has acquire semantics)
    spin_lock(&bucket->lock);
    
    // Step 3: Read futex value from userspace
    // Protected by spinlock - ordered after state set
    get_user(val, uaddr);
    
    // Step 4: Add to wait queue
    list_add(&waiter, &queue);
    
    // Step 5: Release spinlock (release semantics)
    spin_unlock(&bucket->lock);
    /*
     * Spinlock release ensures our wait queue addition
     * is visible before we actually sleep
     */
    
    // Step 6: Go to sleep
    schedule();
}
 
/*
 * FUTEX_WAKE ordering requirements:
 * 
 * The waker typically does:
 * 1. Write new value to futex (e.g., set UNLOCKED)
 * 2. Call FUTEX_WAKE
 * 
 * We need waiters to see the new value if they re-check after waking
 */
 
void futex_wake_ordering(void) {
    /*
     * Userspace writes to futex word should use atomic store
     * with release semantics (memory_order_release)
     * 
     * atomic_store(&futex, UNLOCKED, memory_order_release);
     * futex_wake(&futex, 1);
     */
    
    // Step 1: Lock bucket (acquire semantics)
    spin_lock(&bucket->lock);
    /*
     * This acquires all previous stores to the futex word
     * from our perspective (even though we don't read it)
     */
    
    // Step 2: Remove waiters from queue
    list_del(&waiter);
    
    // Step 3: Release bucket lock
    spin_unlock(&bucket->lock);
    /*
     * Release semantics ensures queue removal is complete
     */
    
    // Step 4: Wake the thread
    wake_up_process(waiter->task);
    /*
     * wake_up_process has internal barriers to ensure
     * state change is immediately visible to the woken task
     */
}
 
/*
 * The full memory ordering timeline for a lock:
 * 
 * Unlocker (CPU 1):                 Waiter (CPU 2):
 * --------------                    ----------------
 * <execute critical section>
 * 
 * release fence                     
 * store UNLOCKED                    load futex_word
 *                                   compare (sees LOCKED)
 *                                   set TASK_INTERRUPTIBLE
 *                                   acquire bucket lock
 *                                   add to queue
 *                                   release bucket lock
 *                                   schedule() [sleeps]
 * 
 * acquire bucket lock
 * find waiter
 * remove from queue
 * release bucket lock
 * wake_up_process                   
 *                                   [wakes in schedule()]
 *                                   load futex_word
 *                                   [sees UNLOCKED due to release-acquire]
 *                                   acquire lock
 */

Userspace Must Also Use Barriers

Kernel Data Structures Summary

Let's consolidate all the kernel data structures involved in futex into a comprehensive view.

Futex Kernel Data Structures
Structure	Purpose	Lifetime	Per-
futex_hash_bucket[]	Hash table of wait queues	Kernel lifetime	NUMA node
futex_hash_bucket.lock	Protects one bucket's queue	Held briefly	Bucket
futex_hash_bucket.chain	Priority-sorted waiter list	Contains waiters	Bucket
futex_q	A single waiting thread	Duration of wait	Waiter
futex_key (private)	Identifies private futex	Duration of operation	Operation
futex_key (shared)	Identifies shared futex	Duration of operation	Operation
wake_q_head	Batch of threads to wake	During wake operation	Wake call
task_struct.state	Running/sleeping status	Thread lifetime	Thread

kernel_memory_layout.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
/*
 * Memory layout visualization
 */
 
/*
 * KERNEL MEMORY (process-independent):
 * 
 * ┌─────────────────────────────────────────────────────────┐
 * │   futex_queues[0..255] (per-NUMA-node hash table)       │
 * │   ├── bucket[0]                                          │
 * │   │   ├── lock (spinlock)                                │
 * │   │   └── chain (plist_head)                             │
 * │   │       ├── futex_q (thread A, futex X)                │
 * │   │       ├── futex_q (thread B, futex X)                │
 * │   │       └── futex_q (thread C, futex Y) ← collision    │
 * │   ├── bucket[1]                                          │
 * │   │   └── chain                                          │
 * │   │       └── futex_q (thread D, futex Z)                │
 * │   └── ... (256 buckets total)                            │
 * └─────────────────────────────────────────────────────────┘
 * 
 * USER MEMORY (per-process):
 * 
 * ┌─────────────────────────────────────────────────────────┐
 * │   Process A address space                                │
 * │   ├── 0x7f..100: futex X (value: CONTENDED)             │
 * │   ├── 0x7f..200: futex Y (value: LOCKED)                │
 * │   └── ...                                                │
 * └─────────────────────────────────────────────────────────┘
 * 
 * ┌─────────────────────────────────────────────────────────┐
 * │   Shared Memory Region (mmap, shm_open)                  │
 * │   └── page 0x1234:                                       │
 * │       └── futex Z (value: UNLOCKED)                      │
 * │           ↑                                              │
 * │           └── Accessible from multiple processes         │
 * │               with different virtual addresses           │
 * └─────────────────────────────────────────────────────────┘
 */

Memory Efficiency

Summary: The Kernel's Role in Futex

We've taken a deep dive into the kernel internals of futex. Let's consolidate the key insights.

Key Takeaways

•Hash table design: The kernel uses a fixed-size hash table to manage wait queues for potentially millions of futexes. Collisions are handled by checking futex keys during wake.
•Futex keys: Private futexes use (mm, vaddr) for fast key computation. Shared futexes use (inode, offset) requiring page table walks. Always prefer private when possible.
•FUTEX_WAIT atomicity: The value check and queue addition happen while holding a spinlock. This prevents lost wakeups—a wake cannot slip between check and queue.
•FUTEX_WAKE batching: Waiters are collected while holding the lock, then woken after release. This minimizes lock hold time and enables efficient batch waking.
•Scheduler integration: Futex works directly with the scheduler via TASK_INTERRUPTIBLE and wake_up_process(). The state machine prevents lost wakeups at multiple levels.
•Memory ordering: Spinlocks provide acquire-release semantics. The kernel ensures proper barriers, but userspace must also use atomic operations with correct ordering.
•Minimal footprint: Only actively waiting threads consume kernel memory. The hash table is fixed-size regardless of futex count.

What's Next:

Page Complete

3 / 5