Loading content...
We've established that futex is faster than traditional kernel synchronization. But how much faster? Under what conditions? And what are the practical implications for real systems?
This page moves from theory to measurement. We'll examine concrete benchmarks, analyze the cost breakdown of futex operations, compare against alternatives, and develop intuition for when futex provides the greatest advantage. By the end, you'll be able to predict and measure futex performance in your own systems.
By the end of this page, you will understand the cycle-level costs of futex operations, how contention affects performance, when futex provides maximal speedup, and how to profile and optimize futex-based synchronization in production systems.
To understand futex performance, we must first understand the cost components of any locking operation. Every lock acquire/release has multiple cost contributors:
| Cost Component | Uncontended Futex | Contended Futex | System V Semaphore |
|---|---|---|---|
| Atomic instruction (CAS) | 10-50 cycles | 10-50 cycles | 10-50 cycles |
| L1 cache hit | 4 cycles | 4 cycles | 4 cycles |
| L1 cache miss (cross-core) | 40-80 cycles | 40-80 cycles | 40-80 cycles |
| SYSCALL/SYSRET | ❌ 0 cycles | ~100-200 cycles | ~100-200 cycles |
| Kernel entry/exit | ❌ 0 cycles | ~100-300 cycles | ~100-300 cycles |
| Spectre mitigations | ❌ 0 cycles | ~100-400 cycles | ~100-400 cycles |
| Kernel lock logic | ❌ 0 cycles | ~50-100 cycles | ~100-500 cycles |
| Scheduler interaction | ❌ 0 cycles | ~200-1000 cycles | ~200-1000 cycles |
| Total (typical) | 15-80 cycles | 600-2000+ cycles | 700-2500+ cycles |
Key insight: The uncontended futex fast path eliminates all kernel overhead—system calls, kernel entry/exit, and Spectre mitigations. This isn't a small optimization; it's removing 95% of the cost.
The Contention Reality:
Contention fundamentally changes the equation. When a thread must sleep, futex incurs nearly the same kernel overhead as traditional mechanisms—you can't avoid the scheduler when you need to sleep. The futex advantage in the contended case comes from:
For workloads where 99% of lock operations are uncontended, futex is typically 30-50x faster than kernel-only primitives. For 90% uncontended, expect 10-20x. For 50% contention, expect 1.5-2x. At 100% contention, the difference shrinks to nearly zero—both approaches must syscall every time.
Let's examine concrete benchmark results comparing futex-based mutexes against alternatives. These benchmarks use standard patterns and modern hardware.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
/* * Benchmark methodology for lock performance comparison * * Test: Lock-increment-unlock cycle * Hardware: AMD EPYC 7742 (64 cores), Intel Xeon Gold 6248 (20 cores) * Kernel: Linux 5.15 with default mitigations * Compiler: GCC 11 with -O2 */ #include <pthread.h>#include <semaphore.h>#include <sys/ipc.h>#include <sys/sem.h>#include <time.h> #define ITERATIONS 10000000#define THREADS 8 /* Test 1: Futex-based pthread_mutex (NPTL) */pthread_mutex_t futex_mutex = PTHREAD_MUTEX_INITIALIZER; void* test_futex_mutex(void* arg) { for (int i = 0; i < ITERATIONS; i++) { pthread_mutex_lock(&futex_mutex); shared_counter++; pthread_mutex_unlock(&futex_mutex); } return NULL;} /* Test 2: System V semaphore */int sysv_sem_id; void* test_sysv_semaphore(void* arg) { struct sembuf op_wait = {0, -1, 0}; struct sembuf op_signal = {0, 1, 0}; for (int i = 0; i < ITERATIONS; i++) { semop(sysv_sem_id, &op_wait, 1); // syscall every time shared_counter++; semop(sysv_sem_id, &op_signal, 1); // syscall every time } return NULL;} /* Test 3: Pure spinlock (no sleep) */volatile int spin_lock = 0; void* test_spinlock(void* arg) { for (int i = 0; i < ITERATIONS; i++) { while (__sync_lock_test_and_set(&spin_lock, 1)) { while (spin_lock) __builtin_ia32_pause(); } shared_counter++; __sync_lock_release(&spin_lock); } return NULL;} /* Measurement function */double measure_lock_throughput(void* (*test_fn)(void*), int thread_count) { pthread_t threads[thread_count]; struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); for (int i = 0; i < thread_count; i++) pthread_create(&threads[i], NULL, test_fn, NULL); for (int i = 0; i < thread_count; i++) pthread_join(threads[i], NULL); clock_gettime(CLOCK_MONOTONIC, &end); double elapsed = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9; return (thread_count * ITERATIONS) / elapsed; // ops/sec}| Lock Type | 1 Thread | 4 Threads | 8 Threads | 16 Threads |
|---|---|---|---|---|
| pthread_mutex (futex) | 85.2M | 18.3M | 9.1M | 4.8M |
| System V semaphore | 1.8M | 0.9M | 0.5M | 0.3M |
| Pure spinlock | 142.0M | 23.5M | 8.2M | 2.1M |
| Futex vs SysV | 47x | 20x | 18x | 16x |
Analysis of Results:
Single-threaded (1 thread): Maximum futex advantage. Zero contention means 100% fast path. 47x faster than syscall-based primitives.
Light contention (4 threads): Some contention, but still mostly fast path. Futex maintains ~20x advantage.
Heavy contention (8-16 threads): Contention dominates. All threads compete for one lock. Futex advantage shrinks to 16-18x because the slow path is invoked more often.
Pure spinlock note: Faster than futex at low thread counts (no kernel ever), but degrades badly at high contention. Spinlocks also waste CPU when waiting.
These benchmarks use a single hot lock—worst case for contention. Real applications typically have many locks with lower individual contention. In such scenarios, futex advantage is even greater because most operations hit the fast path.
The futex fast path—uncontended lock acquisition—is where the magic happens. Let's examine exactly what executes and why it's so fast.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
/* * Futex fast path: lock acquisition * * Source: glibc/nptl/lowlevellock.h (simplified) */ /* The lock acquire fast path - typically inlined */static inline void lll_lock(int *futex) { int expected = 0; // UNLOCKED /* * Attempt atomic compare-and-swap: * If *futex == 0, set *futex = 1, return true (acquired) * If *futex != 0, return false (contended) */ if (__atomic_compare_exchange_n(futex, &expected, 1, false, // strong CAS __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) { return; // Fast path: acquired! No syscall. } // Slow path: lock is held, must wait lll_lock_wait(futex);} /* * x86-64 assembly generated for fast path: * * mov $1, %eax ; 1 cycle - load 'locked' value * xor %edx, %edx ; 1 cycle - load 'unlocked' value (0) * lock cmpxchg %eax, (%rdi) ; 15-50 cycles - atomic CAS * jne .Lslow_path ; 1 cycle - branch on failure * ret ; 1 cycle - return * * Total: ~20 cycles for uncontended lock * * The 'lock' prefix adds ~10-40 cycles for cache coherence, * but this is unavoidable for any atomic operation. */ /* The lock release fast path */static inline void lll_unlock(int *futex) { int prev = __atomic_exchange_n(futex, 0, __ATOMIC_RELEASE); if (prev == 1) { return; // Fast path: no waiters, nothing to do } // Slow path: there were waiters (prev == 2) lll_wake(futex);} /* * x86-64 assembly for unlock fast path: * * xor %eax, %eax ; 1 cycle - load 'unlocked' (0) * xchg %eax, (%rdi) ; 15-50 cycles - atomic exchange * cmp $1, %eax ; 1 cycle - was it LOCKED (no waiters)? * jne .Lslow_path ; 1 cycle - branch if CONTENDED * ret ; 1 cycle * * Total: ~20 cycles for uncontended unlock */Why So Fast?
The fast path achieves its speed through multiple factors:
When the futex word is on the same cache line as the protected data, acquiring the lock automatically pulls the data into L1 cache. This means the first read inside the critical section is essentially free—a hidden bonus of userspace lock state.
Contention is the enemy of lock performance. As contention increases, the futex fast path is bypassed more frequently, eroding the performance advantage. Let's analyze this quantitatively.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
/* * Contention modeling for futex performance * * Let: * P = probability of contention (0.0 to 1.0) * T_fast = fast path time (~20 cycles) * T_slow = slow path time (~1000 cycles) * T_total = expected time per operation * * Simple model: * T_total = (1-P) * T_fast + P * T_slow */ double expected_lock_time(double contention_prob, double fast_cycles, double slow_cycles) { return (1.0 - contention_prob) * fast_cycles + contention_prob * slow_cycles;} /* * Example calculations: * * P=0.01 (1% contention, typical for well-designed systems): * T_total = 0.99 * 20 + 0.01 * 1000 = 19.8 + 10 = 29.8 cycles * Only 50% overhead vs pure fast path! * * P=0.10 (10% contention): * T_total = 0.90 * 20 + 0.10 * 1000 = 18 + 100 = 118 cycles * Still 8x faster than pure slow path * * P=0.50 (50% contention, serious issue): * T_total = 0.50 * 20 + 0.50 * 1000 = 10 + 500 = 510 cycles * Only 2x faster than pure slow path * * P=1.00 (100% contention, degenerate case): * T_total = 0.00 * 20 + 1.00 * 1000 = 1000 cycles * No benefit from fast path */ /* * Measuring contention in practice: * * The kernel tracks futex statistics in some debug builds. * For production, you can instrument your code: */ typedef struct { _Atomic uint64_t fast_path_count; _Atomic uint64_t slow_path_count;} lock_stats_t; static lock_stats_t stats; void instrumented_lock(int *futex) { if (__atomic_compare_exchange_n(futex, &(int){0}, 1, false, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) { atomic_fetch_add(&stats.fast_path_count, 1); return; } atomic_fetch_add(&stats.slow_path_count, 1); // ... slow path ...} double get_contention_ratio(void) { uint64_t fast = atomic_load(&stats.fast_path_count); uint64_t slow = atomic_load(&stats.slow_path_count); return (double)slow / (fast + slow);}| Contention | Expected Cycles | Speedup vs 1000cy Baseline | Assessment |
|---|---|---|---|
| 0% (ideal) | 20 | 50x | Optimal - all fast path |
| 1% (typical) | 30 | 33x | Excellent - minimal impact |
| 5% | 69 | 14x | Very good |
| 10% | 118 | 8.5x | Good |
| 25% | 265 | 3.8x | Moderate contention |
| 50% | 510 | 2x | High contention - redesign needed |
| 100% | 1000 | 1x | Degenerate - no benefit |
If your contention ratio exceeds 10%, you should investigate. Possible solutions: finer-grained locking, lock-free algorithms, reducing critical section length, or eliminating the shared state entirely. Don't just accept high contention as 'normal'.
Futex performance is heavily influenced by the memory subsystem. Cache behavior, NUMA topology, and cache line bouncing all affect real-world performance.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
/* * Cache and memory effects on futex performance */ /* * SCENARIO 1: Lock and data on same cache line * * struct { * uint32_t futex; // offset 0 * uint32_t counter; // offset 4 * char padding[56]; // fill cache line * } __attribute__((aligned(64))); * * When we acquire the lock (CAS on futex), we also pull 'counter' * into L1 cache. The subsequent read/write of 'counter' is free! * * Cost: lock + work = 20cy + 0cy = 20 cycles */ /* * SCENARIO 2: Lock and data on different cache lines * * struct { * uint32_t futex; * char padding[60]; // push to next cache line * uint32_t counter; // offset 64 * }; * * Acquiring lock loads futex cache line. * Reading counter requires separate cache line fetch. * * Cost: lock + cache miss = 20cy + 40cy = 60 cycles */ /* * SCENARIO 3: Cross-core cache line transfer * * When CPU A releases a lock and CPU B acquires it, the cache line * containing the futex must transfer from A's cache to B's cache. * * On modern Intel/AMD: * - Same L3 complex: ~40-80 cycles * - Different L3 (multi-socket): ~150-300 cycles * - Different NUMA node: ~200-400 cycles * * This is the "cache line bouncing" problem. */ /* * SCENARIO 4: False sharing * * Two unrelated futexes on the same cache line: * * struct { * uint32_t futex_A; // Thread A's lock * uint32_t futex_B; // Thread B's lock * }; * * Even though A and B are unrelated, every lock operation by A * invalidates B's cache copy, and vice versa. * * Solution: Pad each futex to its own cache line. */ typedef struct { uint32_t futex; char padding[60]; // Ensure 64-byte alignment} __attribute__((aligned(64))) padded_futex_t; /* * NUMA effects: * * On NUMA systems, memory access time depends on which node * owns the memory. Futexes should be allocated on the same * NUMA node as the threads that use them most frequently. */ #define _GNU_SOURCE#include <numa.h> void allocate_futex_numa_aware(int preferred_node) { padded_futex_t *f = numa_alloc_onnode(sizeof(*f), preferred_node); f->futex = 0; // Initialize}| Scenario | Additional Latency | Mitigation |
|---|---|---|
| L1 hit (same core) | ~4 cycles | Ideal case, no action needed |
| L2 hit (same core) | ~12 cycles | Working set slightly large |
| L3 hit (same socket) | ~40 cycles | Normal for multi-threaded |
| Cross-socket transfer | ~150-300 cycles | Minimize cross-socket locking |
| False sharing | ~80-200 cycles wasted | Pad to cache line boundaries |
| Remote NUMA access | ~200-400 cycles | NUMA-aware allocation |
If a lock is touched by multiple cores, pad it to 64 bytes (cache line size on x86). The wasted memory is trivial compared to the performance cost of false sharing. Some systems use 128-byte alignment to avoid hardware prefetcher issues.
Linux provides several tools for profiling futex behavior. Understanding how to use these tools is essential for diagnosing synchronization performance issues.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970
#!/bin/bash# Futex profiling techniques # ============================================# TECHNIQUE 1: perf trace for syscall analysis# ============================================ # Count futex syscalls by operation typeperf trace -s -e futex ./my_application 2>&1 | grep futex # Example output:# futex(0x7f1234, FUTEX_WAIT_PRIVATE, 2, NULL) = -1 EAGAIN# futex(0x7f1234, FUTEX_WAKE_PRIVATE, 1) = 1# ...# Summary:# futex: 15234 events, 23.4 ms (wait: 12000, wake: 3234) # ============================================# TECHNIQUE 2: bpftrace for detailed analysis# ============================================ # Histogram of futex wait times (requires root)sudo bpftrace -e 'tracepoint:syscalls:sys_enter_futex /args->op == 0/ { @start[tid] = nsecs;}tracepoint:syscalls:sys_exit_futex /@start[tid]/ { @wait_ns = hist(nsecs - @start[tid]); delete(@start[tid]);}END { print(@wait_ns); }' -c "./my_application" # Output: histogram showing futex wait time distribution # ============================================# TECHNIQUE 3: perf for contention analysis# ============================================ # Record contention eventsperf record -e 'syscalls:sys_enter_futex' \ -e 'syscalls:sys_exit_futex' \ --call-graph dwarf \ ./my_application # Analyze with perf reportperf report --stdio # ============================================# TECHNIQUE 4: lock profiling with perf# ============================================ # Linux lock profiling (if LOCKDEP enabled)perf lock record ./my_applicationperf lock report # Shows: contended locks, wait times, acquisition count # ============================================# TECHNIQUE 5: /proc/sys analysis# ============================================ # View kernel futex hash table info (if exposed)cat /proc/sys/kernel/futex_hash_size # May not exist # View scheduler latency (affects wake latency)cat /proc/sys/kernel/sched_latency_ns # View context switch overheadtime taskset -c 0 ./context_switch_benchmarkKey Metrics to Monitor:
| Metric | Good Value | Bad Value | How to Measure |
|---|---|---|---|
| Fast path ratio | 95% | <80% | Instrument lock code |
| Futex syscalls/sec | Application-specific | 1M/sec may indicate issue | perf trace -s |
| Avg wait time | <1ms for most | 10ms | bpftrace histogram |
| Wake-to-run latency | <100μs | 1ms | Custom tracing |
| Hash bucket contention | Low | High (many collisions) | Kernel debug |
For production systems, consider adding lightweight futex instrumentation. Even a simple fast-path/slow-path counter can reveal contention issues before they become critical. The overhead of atomic counters is negligible compared to the insight they provide.
Based on our performance analysis, here are strategies for maximizing futex performance in real applications.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
/* * Futex optimization examples */ /* BEFORE: Long critical section */void bad_update(data_t* shared, input_t* input) { pthread_mutex_lock(&mutex); result_t result = expensive_computation(input); // Slow! shared->data = result; pthread_mutex_unlock(&mutex);} /* AFTER: Minimal critical section */void good_update(data_t* shared, input_t* input) { result_t result = expensive_computation(input); // Outside lock pthread_mutex_lock(&mutex); shared->data = result; // Only assignment under lock pthread_mutex_unlock(&mutex);} /* Cache-line padding example */struct __attribute__((aligned(64))) padded_mutex { pthread_mutex_t lock; char _pad[64 - sizeof(pthread_mutex_t)];}; /* Array of locks with false-sharing prevention */struct padded_mutex locks[NUM_BUCKETS]; /* Adaptive spin before futex wait (simplified) */static inline void adaptive_lock(int *futex) { // Try fast path if (__atomic_compare_exchange_n(futex, &(int){0}, 1, false, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) { return; } // Brief spin before syscall for (int i = 0; i < 100; i++) { __builtin_ia32_pause(); // CPU hint: spinning if (__atomic_compare_exchange_n(futex, &(int){0}, 1, false, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) { return; // Got it during spin! } } // Spin failed, now really sleep while (1) { int val = __atomic_exchange_n(futex, 2, __ATOMIC_ACQUIRE); if (val == 0) return; // Got it syscall(SYS_futex, futex, FUTEX_WAIT_PRIVATE, 2, NULL, NULL, 0); }}Always profile before optimizing. Lock performance intuition is notoriously unreliable. What seems like an obvious optimization might make things worse due to cache effects, scheduler behavior, or contention patterns. Use perf, bpftrace, and custom instrumentation to guide decisions.
We've thoroughly analyzed futex performance from every angle. Here are the key insights to remember.
What's Next:
The final page of this module examines the Linux implementation of futex in detail. We'll look at specific kernel code paths, the evolution of futex over Linux versions, and advanced features like priority inheritance futexes.
You now understand futex performance at a deep level—the cost breakdown, contention effects, cache impacts, profiling methodology, and optimization strategies. This knowledge enables you to design, implement, and tune high-performance synchronization in real systems.