Loading learning content...
When a thread calls FUTEX_WAIT, it vanishes from userspace perspective—blocked, waiting to be woken. But what actually happens inside the kernel during those microseconds to milliseconds of sleep? Understanding the kernel's role in futex is essential for:
This page takes you inside the Linux kernel's futex subsystem. We'll trace execution paths, examine data structures, and understand the careful dance between correctness and performance that makes futex work.
By the end of this page, you will understand the kernel's futex hash table, wait queue management, the critical sections that protect futex state, how the kernel handles contended operations, and the scheduler integration that makes sleeping and waking efficient.
The kernel must efficiently manage wait queues for potentially millions of different futex addresses. Using a per-futex lock or queue would consume enormous memory. Instead, the kernel uses a hash table to multiplex many futexes onto a smaller number of hash buckets.
Hash Table Architecture:
The futex hash table is a global (per-CPU or per-node on NUMA systems) array of hash buckets. Each bucket contains a spinlock and a list of waiting threads.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
/* * Kernel futex data structures (simplified from kernel/futex/core.c) */ /* * The key that uniquely identifies a futex * Different from userspace address - handles shared memory */union futex_key { struct { u64 i_seq; // Inode sequence number unsigned long pgoff; // Page offset unsigned int offset; // Offset within page } shared; // For shared memory futexes struct { union { struct mm_struct *mm; // Process memory descriptor u64 __tmp; }; unsigned long address; // Virtual address unsigned int offset; // Offset within page } private; // For private futexes struct { u64 ptr; // Raw 64-bit key unsigned int offset; unsigned int word; } both;}; /* * A waiter in the futex wait queue */struct futex_q { struct plist_node list; // Priority-sorted list link struct task_struct *task; // The waiting thread spinlock_t *lock_ptr; // Backpointer to bucket lock union futex_key key; // Which futex we're waiting on u32 bitset; // Bitset for selective wake struct rt_mutex_waiter *rt_waiter; // For PI futexes struct futex_pi_state *pi_state; // PI state tracking // Requeue state atomic_t requeue_state; bool requeue_pi_key;}; /* * A hash bucket with its wait queue */struct futex_hash_bucket { atomic_t waiters; // Fast check: any waiters? spinlock_t lock; // Protects the wait list struct plist_head chain; // Priority-sorted waiter list} ____cacheline_aligned; /* * Number of hash buckets (power of 2 for fast modulo) * Sized based on expected number of simultaneous futexes */#define futex_hashshift 8#define futex_hashsize (1 << futex_hashshift) // 256 buckets /* The global hash table (simplified - actual is per-NUMA-node) */static struct futex_hash_bucket futex_queues[futex_hashsize]; /* * Hash function: maps futex key to bucket */static struct futex_hash_bucket *hash_futex(union futex_key *key) { u32 hash = jhash2((u32*)&key->both.word, (sizeof(key->both)/sizeof(u32)), 0); return &futex_queues[hash & (futex_hashsize - 1)];}Why Hashing?
The hash table design reflects a fundamental trade-off:
| Approach | Memory | Contention | Lookup Time |
|---|---|---|---|
| Per-futex queue | O(F) for F futexes | Zero (isolated) | O(1) |
| Single global queue | O(1) | Extreme | O(N) for N waiters |
| Hash table (K buckets) | O(K) | Low if K >> cpus | O(N/K) average |
With 256 buckets and reasonable workloads, the expected waiters per bucket is small, keeping both lookup time and contention manageable.
The Hash Collision Issue:
Multiple futexes can hash to the same bucket. This means:
Modern kernels mitigate this with per-NUMA-node hash tables and careful bucket sizing.
Notice that wait lists use plist (priority list), not simple list. This enables priority-aware waking—higher priority threads are woken first. This is essential for real-time applications and preventing priority inversion scenarios.
A critical challenge in futex design is: how does the kernel know two threads are waiting on the same futex? Userspace passes a virtual address, but:
The Key Abstraction:
The kernel computes a futex key that uniquely identifies the futex regardless of how it's mapped:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
/* * Computing the futex key from userspace address * (Simplified from kernel/futex/core.c) */ static int get_futex_key(u32 __user *uaddr, bool fshared, union futex_key *key) { struct mm_struct *mm = current->mm; unsigned long address = (unsigned long)uaddr; struct page *page; // Always record offset within page key->both.offset = address % PAGE_SIZE; if (!fshared) { /* * PRIVATE FUTEX (FUTEX_PRIVATE_FLAG set) * * Futex is only used within this process. * Key is simply (mm, virtual_address). * Fast - no page table walk needed for key. */ key->private.mm = mm; key->private.address = address; return 0; } /* * SHARED FUTEX (might be in shared memory) * * Need to identify by backing storage, not virtual address. * Key is (inode, page_offset) for file-backed memory, * or (special marker, physical_page) for anonymous shared. */ // Walk page tables to find the actual page page = get_user_page_fast(address ); if (!page) { return -EFAULT; // Not mapped } if (PageAnon(page)) { // Anonymous shared memory (e.g., mmap MAP_SHARED|MAP_ANONYMOUS) // Use page pointer as key (unique per physical page) key->shared.i_seq = (u64)page_to_pfn(page); key->shared.pgoff = 0; } else { // File-backed memory (e.g., mmap of regular file, shm_open) // Use inode + file offset as key struct inode *inode = page->mapping->host; key->shared.i_seq = get_inode_sequence_number(inode); key->shared.pgoff = page->index; // Page offset in file } put_page(page); return 0;} /* * Key comparison */static inline int match_futex(union futex_key *key1, union futex_key *key2) { return (key1->both.word == key2->both.word && key1->both.ptr == key2->both.ptr && key1->both.offset == key2->both.offset);}Shared futex key computation is significantly more expensive—it requires page table walks and may trigger page faults. Always use FUTEX_PRIVATE_FLAG for intra-process synchronization. The kernel cannot infer privacy from usage patterns; you must explicitly request it.
Let's trace through exactly what happens in the kernel when a thread calls FUTEX_WAIT. Understanding this path is essential for debugging and optimization.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
/* * FUTEX_WAIT kernel implementation (heavily simplified) * Actual code: kernel/futex/waitwake.c */ static int futex_wait(u32 __user *uaddr, u32 val, ktime_t *time, u32 bitset) { struct futex_hash_bucket *hb; struct futex_q q = FUTEX_Q_INIT; u32 uval; int ret; /* * STEP 1: Compute the futex key */ ret = get_futex_key(uaddr, FLAGS_SHARED, &q.key); if (ret) return ret; /* * STEP 2: Find and lock the hash bucket */ hb = hash_futex(&q.key); q.lock_ptr = &hb->lock; q.bitset = bitset; retry: /* * STEP 3: Acquire the bucket spinlock * This serializes against concurrent FUTEX_WAKE */ spin_lock(&hb->lock); /* * STEP 4: Read the futex value from userspace * This is protected by the bucket lock */ ret = get_futex_value_locked(&uval, uaddr); if (ret) { spin_unlock(&hb->lock); // Page fault - might need to fault in the page ret = get_user(uval, uaddr); if (ret) return -EFAULT; goto retry; } /* * STEP 5: Compare with expected value * This is the ATOMIC check - we hold the bucket lock */ if (uval != val) { spin_unlock(&hb->lock); return -EAGAIN; // Value changed - don't sleep } /* * STEP 6: Prepare to sleep - set task state */ set_current_state(TASK_INTERRUPTIBLE); /* * STEP 7: Queue ourselves on the wait list * Still holding bucket lock - wake cannot miss us */ futex_queue(&q, hb); // Adds to hb->chain /* * STEP 8: Release bucket lock * From this point, FUTEX_WAKE can find and wake us */ spin_unlock(&hb->lock); /* * STEP 9: Handle timeout if specified */ if (time) { hrtimer_start(&timeout_timer, *time, HRTIMER_MODE_ABS); } /* * STEP 10: Actually go to sleep * schedule() doesn't return until we're woken */ if (likely(!plist_node_empty(&q.list))) { if (!timeout || timeout->task) schedule(); // <<< SLEEP HERE >>> } /* * STEP 11: We've been woken or timed out */ __set_current_state(TASK_RUNNING); /* * STEP 12: Cleanup - remove from wait queue if still there */ if (!plist_node_empty(&q.list)) { spin_lock(&hb->lock); if (!plist_node_empty(&q.list)) futex_unqueue(&q); spin_unlock(&hb->lock); ret = -ETIMEDOUT; // We timed out, not woken } return ret;}The Critical Atomicity:
The key to correctness is that steps 4-7 (read value, compare, queue) happen while holding the bucket spinlock. This prevents the lost wakeup race:
Without atomicity: With atomicity:
WAIT: WAIT:
1. Read value = LOCKED 1. Lock bucket
2. Read value = LOCKED
WAKE: 3. Queue self
2. Set value = UNLOCKED 4. Unlock bucket
3. Check queue (empty!)
4. No one to wake WAKE:
5. Lock bucket
WAIT (continued): 6. Set value = UNLOCKED
5. Queue self 7. Find waiter, wake
6. Sleep forever... 8. Unlock bucket
WAIT wakes correctly!
Notice we set TASK_INTERRUPTIBLE before queueing, while still holding the lock. This is a crucial Linux kernel pattern: we signal intent to sleep before it's possible for wake to find us. If wake occurs between queue and schedule(), it will clear our sleep state, and schedule() will return immediately instead of sleeping. This is how Linux prevents lost wakeups at the scheduler level.
The wake path is simpler than wait—it searches the wait queue and wakes threads. But the details matter for understanding performance.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
/* * FUTEX_WAKE kernel implementation (simplified) */ static int futex_wake(u32 __user *uaddr, int nr_wake, u32 bitset) { struct futex_hash_bucket *hb; struct futex_q *this, *next; union futex_key key; int ret; DEFINE_WAKE_Q(wake_q); // List of threads to wake /* * STEP 1: Compute the futex key */ ret = get_futex_key(uaddr, FLAGS_SHARED, &key); if (ret) return ret; /* * STEP 2: Find and lock the hash bucket */ hb = hash_futex(&key); spin_lock(&hb->lock); /* * STEP 3: Scan the wait queue for matching waiters */ plist_for_each_entry_safe(this, next, &hb->chain, list) { /* * Check if this waiter matches: * 1. Same futex key * 2. Bitset matches (if using WAKE_BITSET) */ if (match_futex(&this->key, &key)) { if (!(this->bitset & bitset)) continue; // Bitset doesn't match /* * STEP 4: Mark this waiter for waking * Don't wake while holding spinlock - add to batch */ mark_wake_futex(&wake_q, this); if (++ret >= nr_wake) break; // Woken enough } } spin_unlock(&hb->lock); /* * STEP 5: Actually wake the threads * Done outside the spinlock for better latency */ wake_up_q(&wake_q); return ret; // Return count of woken threads} /* * mark_wake_futex: Prepare a waiter for waking */static void mark_wake_futex(struct wake_q_head *wake_q, struct futex_q *q) { struct task_struct *p = q->task; /* * Remove from the wait queue */ plist_del(&q->list, &q->list.plist); /* * Clear the waiter's list entry so it knows it was woken * (not timed out or spuriously woken) */ q->lock_ptr = NULL; /* * Add to wake queue for batch waking */ wake_q_add(wake_q, p);} /* * wake_up_q: Batch wake all collected threads * Called WITHOUT holding any spinlocks */static void wake_up_q(struct wake_q_head *head) { struct task_struct *task; while ((task = wake_q_next(head)) != NULL) { /* * Set task state to TASK_RUNNING * If task is currently in schedule(), it will return * Otherwise, it will be ready to run when scheduler picks it */ wake_up_process(task); /* * Drop reference acquired by wake_q_add */ put_task_struct(task); }}Batch Waking Optimization:
Notice the two-phase approach: collect waiters while holding the lock, then wake them after releasing. This is critical for performance:
wake_up_process() happens without holding the spinlockWake Queue Priority:
The plist_for_each iteration respects priority order. Higher priority threads (lower niceness value) appear first in the list and are woken first. This naturally provides priority-aware synchronization without explicit priority logic in the wake code.
When waking, the kernel scans all waiters in the hash bucket, not just those for our futex (they're intermixed due to hash collisions). This is why extreme futex usage (thousands of contended futexes) can show bucket contention in profiling. The solution is larger hash tables or futex locality (related futexes on different pages).
Futex is deeply integrated with the Linux scheduler. When a thread sleeps on a futex, it's not just blocked—it's precisely integrated with the kernel's task scheduling machinery.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
/* * Scheduler integration points for futex */ /* * Task states relevant to futex: * * TASK_RUNNING - Thread can be scheduled to run * TASK_INTERRUPTIBLE - Sleeping, can be woken by signals or futex_wake * TASK_UNINTERRUPTIBLE - Sleeping, only woken by specific events */ /* * When entering futex_wait: */void futex_wait_prepare(struct futex_q *q) { /* * Mark ourselves as wanting to sleep * This MUST happen before we're visible in the wait queue */ set_current_state(TASK_INTERRUPTIBLE); /* * Memory barrier ensures state change is visible * before we check the futex value */ smp_mb();} /* * The actual schedule() call: */void futex_wait_sleep(void) { /* * schedule() checks current->state: * * If TASK_INTERRUPTIBLE and no pending signal/wake: * - Remove from run queue * - Pick next task to run * - Context switch away * - <we're now sleeping> * * If TASK_RUNNING (someone called wake_up_process): * - Return immediately, don't sleep * - This prevents lost wakeups */ schedule();} /* * When woken by futex_wake or timeout: */void futex_wait_cleanup(struct futex_q *q) { /* * Restore to running state */ __set_current_state(TASK_RUNNING); /* * Check why we woke up: * 1. wake_q_add set q->lock_ptr = NULL -> woken by FUTEX_WAKE * 2. timer fired -> timeout * 3. signal pending -> interrupted * 4. q still in list -> spurious wakeup */} /* * wake_up_process: The actual waking mechanism */int wake_up_process(struct task_struct *p) { /* * Set task state to TASK_RUNNING * This is an atomic operation with memory barriers */ unsigned long flags; raw_spin_lock_irqsave(&p->pi_lock, flags); /* * If task was sleeping (TASK_INTERRUPTIBLE/UNINTERRUPTIBLE): * 1. Change state to TASK_RUNNING * 2. Add to scheduler run queue * 3. If on same CPU, might preempt current task */ if (p->state != TASK_RUNNING) { p->state = TASK_RUNNING; enqueue_task(rq, p, ENQUEUE_WAKEUP); check_preempt_curr(rq, p); // Might preempt current } raw_spin_unlock_irqrestore(&p->pi_lock, flags); return 1;}The State Machine:
A thread waiting on a futex goes through these states:
| State | Run Queue | Description |
|---|---|---|
| TASK_RUNNING | On queue | Executing or ready to execute |
| TASK_INTERRUPTIBLE | Off queue | Sleeping, wake on futex/signal |
| TASK_RUNNING | On queue | Woken, waiting for CPU time |
| TASK_RUNNING | Current | Actively running after wakeup |
Futex uses TASK_INTERRUPTIBLE (not UNINTERRUPTIBLE) so threads can be woken by signals. This allows pthread_cancel, timeout via signals, and debugger attachment on sleeping threads. TASK_UNINTERRUPTIBLE would make the thread truly unbreakable—appropriate for disk I/O but not synchronization waits.
Futex correctness depends critically on memory ordering. Modern CPUs and compilers reorder memory operations for performance. Without proper barriers, futex operations could appear to happen in wrong order, causing lost wakeups or other bugs.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121
/* * Memory ordering in futex operations * * There are THREE levels of memory ordering to consider: * 1. Compiler reordering (prevented by volatile/atomic) * 2. CPU store buffer reordering (prevented by barriers) * 3. Cache coherence delays (handled by cache protocols) */ /* * FUTEX_WAIT ordering requirements: * * We need to ensure that: * 1. Reading the futex value happens AFTER entering the wait queue * (conceptually - in terms of visibility to WAKE) * 2. Our TASK_INTERRUPTIBLE state is visible before we could miss a wake */ void futex_wait_ordering(void) { /* * This sequence must be seen in order: */ // Step 1: Set state to sleeping set_current_state(TASK_INTERRUPTIBLE); /* * Implicit barrier in set_current_state: * - Compiler barrier prevents reordering * - Memory barrier ensures state is visible to other CPUs */ // Step 2: Lock bucket (spinlock has acquire semantics) spin_lock(&bucket->lock); // Step 3: Read futex value from userspace // Protected by spinlock - ordered after state set get_user(val, uaddr); // Step 4: Add to wait queue list_add(&waiter, &queue); // Step 5: Release spinlock (release semantics) spin_unlock(&bucket->lock); /* * Spinlock release ensures our wait queue addition * is visible before we actually sleep */ // Step 6: Go to sleep schedule();} /* * FUTEX_WAKE ordering requirements: * * The waker typically does: * 1. Write new value to futex (e.g., set UNLOCKED) * 2. Call FUTEX_WAKE * * We need waiters to see the new value if they re-check after waking */ void futex_wake_ordering(void) { /* * Userspace writes to futex word should use atomic store * with release semantics (memory_order_release) * * atomic_store(&futex, UNLOCKED, memory_order_release); * futex_wake(&futex, 1); */ // Step 1: Lock bucket (acquire semantics) spin_lock(&bucket->lock); /* * This acquires all previous stores to the futex word * from our perspective (even though we don't read it) */ // Step 2: Remove waiters from queue list_del(&waiter); // Step 3: Release bucket lock spin_unlock(&bucket->lock); /* * Release semantics ensures queue removal is complete */ // Step 4: Wake the thread wake_up_process(waiter->task); /* * wake_up_process has internal barriers to ensure * state change is immediately visible to the woken task */} /* * The full memory ordering timeline for a lock: * * Unlocker (CPU 1): Waiter (CPU 2): * -------------- ---------------- * <execute critical section> * * release fence * store UNLOCKED load futex_word * compare (sees LOCKED) * set TASK_INTERRUPTIBLE * acquire bucket lock * add to queue * release bucket lock * schedule() [sleeps] * * acquire bucket lock * find waiter * remove from queue * release bucket lock * wake_up_process * [wakes in schedule()] * load futex_word * [sees UNLOCKED due to release-acquire] * acquire lock */The kernel's barriers only ensure correctness within futex operations. Userspace code must also use proper atomic operations with correct memory ordering. A plain (non-atomic) store followed by FUTEX_WAKE can race due to store buffer delays. Always use atomic_store with memory_order_release before waking.
Let's consolidate all the kernel data structures involved in futex into a comprehensive view.
| Structure | Purpose | Lifetime | Per- |
|---|---|---|---|
| futex_hash_bucket[] | Hash table of wait queues | Kernel lifetime | NUMA node |
| futex_hash_bucket.lock | Protects one bucket's queue | Held briefly | Bucket |
| futex_hash_bucket.chain | Priority-sorted waiter list | Contains waiters | Bucket |
| futex_q | A single waiting thread | Duration of wait | Waiter |
| futex_key (private) | Identifies private futex | Duration of operation | Operation |
| futex_key (shared) | Identifies shared futex | Duration of operation | Operation |
| wake_q_head | Batch of threads to wake | During wake operation | Wake call |
| task_struct.state | Running/sleeping status | Thread lifetime | Thread |
123456789101112131415161718192021222324252627282930313233343536373839
/* * Memory layout visualization */ /* * KERNEL MEMORY (process-independent): * * ┌─────────────────────────────────────────────────────────┐ * │ futex_queues[0..255] (per-NUMA-node hash table) │ * │ ├── bucket[0] │ * │ │ ├── lock (spinlock) │ * │ │ └── chain (plist_head) │ * │ │ ├── futex_q (thread A, futex X) │ * │ │ ├── futex_q (thread B, futex X) │ * │ │ └── futex_q (thread C, futex Y) ← collision │ * │ ├── bucket[1] │ * │ │ └── chain │ * │ │ └── futex_q (thread D, futex Z) │ * │ └── ... (256 buckets total) │ * └─────────────────────────────────────────────────────────┘ * * USER MEMORY (per-process): * * ┌─────────────────────────────────────────────────────────┐ * │ Process A address space │ * │ ├── 0x7f..100: futex X (value: CONTENDED) │ * │ ├── 0x7f..200: futex Y (value: LOCKED) │ * │ └── ... │ * └─────────────────────────────────────────────────────────┘ * * ┌─────────────────────────────────────────────────────────┐ * │ Shared Memory Region (mmap, shm_open) │ * │ └── page 0x1234: │ * │ └── futex Z (value: UNLOCKED) │ * │ ↑ │ * │ └── Accessible from multiple processes │ * │ with different virtual addresses │ * └─────────────────────────────────────────────────────────┘ */The futex subsystem is very memory-efficient. The kernel only allocates memory for threads actively waiting—the hash table structure is fixed-size and small. A million futexes with no contention consumes zero kernel memory for futex purposes.
We've taken a deep dive into the kernel internals of futex. Let's consolidate the key insights.
What's Next:
Now that we understand both the userspace interface and kernel implementation, the next page examines the performance benefits of futex in detail. We'll quantify the speedup, analyze profiling data, and understand when futex delivers the most value.
You now understand what happens inside the Linux kernel when futex operations are invoked. This knowledge enables you to debug synchronization at the deepest level, understand performance characteristics, and make informed decisions about synchronization design.