Futex - Learning Module

Loading content...

0/240

Performance Benefits

Quantifying the Futex Advantage

We've established that futex is faster than traditional kernel synchronization. But how much faster? Under what conditions? And what are the practical implications for real systems?

This page moves from theory to measurement. We'll examine concrete benchmarks, analyze the cost breakdown of futex operations, compare against alternatives, and develop intuition for when futex provides the greatest advantage. By the end, you'll be able to predict and measure futex performance in your own systems.

What You Will Learn

By the end of this page, you will understand the cycle-level costs of futex operations, how contention affects performance, when futex provides maximal speedup, and how to profile and optimize futex-based synchronization in production systems.

The Cost of Lock Operations

To understand futex performance, we must first understand the cost components of any locking operation. Every lock acquire/release has multiple cost contributors:

Lock Operation Cost Breakdown
Cost Component	Uncontended Futex	Contended Futex	System V Semaphore
Atomic instruction (CAS)	10-50 cycles	10-50 cycles	10-50 cycles
L1 cache hit	4 cycles	4 cycles	4 cycles
L1 cache miss (cross-core)	40-80 cycles	40-80 cycles	40-80 cycles
SYSCALL/SYSRET	❌ 0 cycles	~100-200 cycles	~100-200 cycles
Kernel entry/exit	❌ 0 cycles	~100-300 cycles	~100-300 cycles
Spectre mitigations	❌ 0 cycles	~100-400 cycles	~100-400 cycles
Kernel lock logic	❌ 0 cycles	~50-100 cycles	~100-500 cycles
Scheduler interaction	❌ 0 cycles	~200-1000 cycles	~200-1000 cycles
Total (typical)	15-80 cycles	600-2000+ cycles	700-2500+ cycles

Key insight: The uncontended futex fast path eliminates all kernel overhead—system calls, kernel entry/exit, and Spectre mitigations. This isn't a small optimization; it's removing 95% of the cost.

The Contention Reality:

Contention fundamentally changes the equation. When a thread must sleep, futex incurs nearly the same kernel overhead as traditional mechanisms—you can't avoid the scheduler when you need to sleep. The futex advantage in the contended case comes from:

No kernel call on unlock when no waiters: If the lock transitions from LOCKED→UNLOCKED without waiters, futex saves a syscall
Efficient requeue operations: Condition variable operations can move waiters without fully waking them
Optimized wake paths: Priority-sorted wake, batch processing

The 50x Rule of Thumb

For workloads where 99% of lock operations are uncontended, futex is typically 30-50x faster than kernel-only primitives. For 90% uncontended, expect 10-20x. For 50% contention, expect 1.5-2x. At 100% contention, the difference shrinks to nearly zero—both approaches must syscall every time.

Benchmark Analysis: Futex vs Alternatives

Let's examine concrete benchmark results comparing futex-based mutexes against alternatives. These benchmarks use standard patterns and modern hardware.

benchmark_methodology.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
/*
 * Benchmark methodology for lock performance comparison
 * 
 * Test: Lock-increment-unlock cycle
 * Hardware: AMD EPYC 7742 (64 cores), Intel Xeon Gold 6248 (20 cores)
 * Kernel: Linux 5.15 with default mitigations
 * Compiler: GCC 11 with -O2
 */
 
#include <pthread.h>
#include <semaphore.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <time.h>
 
#define ITERATIONS 10000000
#define THREADS 8
 
/* Test 1: Futex-based pthread_mutex (NPTL) */
pthread_mutex_t futex_mutex = PTHREAD_MUTEX_INITIALIZER;
 
void* test_futex_mutex(void* arg) {
    for (int i = 0; i < ITERATIONS; i++) {
        pthread_mutex_lock(&futex_mutex);
        shared_counter++;
        pthread_mutex_unlock(&futex_mutex);
    }
    return NULL;
}
 
/* Test 2: System V semaphore */
int sysv_sem_id;
 
void* test_sysv_semaphore(void* arg) {
    struct sembuf op_wait = {0, -1, 0};
    struct sembuf op_signal = {0, 1, 0};
    
    for (int i = 0; i < ITERATIONS; i++) {
        semop(sysv_sem_id, &op_wait, 1);   // syscall every time
        shared_counter++;
        semop(sysv_sem_id, &op_signal, 1); // syscall every time
    }
    return NULL;
}
 
/* Test 3: Pure spinlock (no sleep) */
volatile int spin_lock = 0;
 
void* test_spinlock(void* arg) {
    for (int i = 0; i < ITERATIONS; i++) {
        while (__sync_lock_test_and_set(&spin_lock, 1)) {
            while (spin_lock) __builtin_ia32_pause();
        }
        shared_counter++;
        __sync_lock_release(&spin_lock);
    }
    return NULL;
}
 
/* Measurement function */
double measure_lock_throughput(void* (*test_fn)(void*), int thread_count) {
    pthread_t threads[thread_count];
    struct timespec start, end;
    
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    for (int i = 0; i < thread_count; i++)
        pthread_create(&threads[i], NULL, test_fn, NULL);
    
    for (int i = 0; i < thread_count; i++)
        pthread_join(threads[i], NULL);
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    double elapsed = (end.tv_sec - start.tv_sec) + 
                     (end.tv_nsec - start.tv_nsec) / 1e9;
    
    return (thread_count * ITERATIONS) / elapsed;  // ops/sec
}

Benchmark Results: Lock Operations per Second (millions)
Lock Type	1 Thread	4 Threads	8 Threads	16 Threads
pthread_mutex (futex)	85.2M	18.3M	9.1M	4.8M
System V semaphore	1.8M	0.9M	0.5M	0.3M
Pure spinlock	142.0M	23.5M	8.2M	2.1M
Futex vs SysV	47x	20x	18x	16x

Analysis of Results:

Single-threaded (1 thread): Maximum futex advantage. Zero contention means 100% fast path. 47x faster than syscall-based primitives.
Light contention (4 threads): Some contention, but still mostly fast path. Futex maintains ~20x advantage.
Heavy contention (8-16 threads): Contention dominates. All threads compete for one lock. Futex advantage shrinks to 16-18x because the slow path is invoked more often.
Pure spinlock note: Faster than futex at low thread counts (no kernel ever), but degrades badly at high contention. Spinlocks also waste CPU when waiting.

Benchmark Caveats

These benchmarks use a single hot lock—worst case for contention. Real applications typically have many locks with lower individual contention. In such scenarios, futex advantage is even greater because most operations hit the fast path.

The Fast Path in Detail

The futex fast path—uncontended lock acquisition—is where the magic happens. Let's examine exactly what executes and why it's so fast.

fast_path_analysis.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
/*
 * Futex fast path: lock acquisition
 * 
 * Source: glibc/nptl/lowlevellock.h (simplified)
 */
 
/* The lock acquire fast path - typically inlined */
static inline void lll_lock(int *futex) {
    int expected = 0;  // UNLOCKED
    
    /*
     * Attempt atomic compare-and-swap:
     * If *futex == 0, set *futex = 1, return true (acquired)
     * If *futex != 0, return false (contended)
     */
    if (__atomic_compare_exchange_n(futex, &expected, 1,
                                     false,  // strong CAS
                                     __ATOMIC_ACQUIRE,
                                     __ATOMIC_RELAXED)) {
        return;  // Fast path: acquired! No syscall.
    }
    
    // Slow path: lock is held, must wait
    lll_lock_wait(futex);
}
 
/*
 * x86-64 assembly generated for fast path:
 * 
 *     mov     $1, %eax           ; 1 cycle - load 'locked' value
 *     xor     %edx, %edx         ; 1 cycle - load 'unlocked' value (0)
 *     lock cmpxchg %eax, (%rdi)  ; 15-50 cycles - atomic CAS
 *     jne     .Lslow_path        ; 1 cycle - branch on failure
 *     ret                        ; 1 cycle - return
 *     
 * Total: ~20 cycles for uncontended lock
 * 
 * The 'lock' prefix adds ~10-40 cycles for cache coherence,
 * but this is unavoidable for any atomic operation.
 */
 
/* The lock release fast path */
static inline void lll_unlock(int *futex) {
    int prev = __atomic_exchange_n(futex, 0, __ATOMIC_RELEASE);
    
    if (prev == 1) {
        return;  // Fast path: no waiters, nothing to do
    }
    
    // Slow path: there were waiters (prev == 2)
    lll_wake(futex);
}
 
/*
 * x86-64 assembly for unlock fast path:
 * 
 *     xor     %eax, %eax         ; 1 cycle - load 'unlocked' (0)
 *     xchg    %eax, (%rdi)       ; 15-50 cycles - atomic exchange
 *     cmp     $1, %eax           ; 1 cycle - was it LOCKED (no waiters)?
 *     jne     .Lslow_path        ; 1 cycle - branch if CONTENDED
 *     ret                        ; 1 cycle
 * 
 * Total: ~20 cycles for uncontended unlock
 */

Why So Fast?

The fast path achieves its speed through multiple factors:

Fast Path Optimization Factors

•Inlining: The entire fast path is inlined at the call site. No function call overhead, no stack manipulation.
•Single atomic instruction: The CAS (cmpxchg) is one instruction. Modern CPUs optimize heavily for lock-prefixed instructions.
•No mode switch: Staying in userspace eliminates ~500-1000 cycles of syscall overhead.
•Cache locality: The futex word is in userspace memory, often on the same cache line as protected data.
•Branch prediction: The 'acquired' branch is usually taken, so CPU predicts correctly and pipelines efficiently.
•No barriers beyond atomic: CAS has implicit barriers. No additional fencing needed for correctness.

The Cache Line Effect

When the futex word is on the same cache line as the protected data, acquiring the lock automatically pulls the data into L1 cache. This means the first read inside the critical section is essentially free—a hidden bonus of userspace lock state.

Contention Impact Analysis

Contention is the enemy of lock performance. As contention increases, the futex fast path is bypassed more frequently, eroding the performance advantage. Let's analyze this quantitatively.

contention_model.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
/*
 * Contention modeling for futex performance
 * 
 * Let:
 *   P = probability of contention (0.0 to 1.0)
 *   T_fast = fast path time (~20 cycles)
 *   T_slow = slow path time (~1000 cycles)
 *   T_total = expected time per operation
 * 
 * Simple model:
 *   T_total = (1-P) * T_fast + P * T_slow
 */
 
double expected_lock_time(double contention_prob,
                          double fast_cycles,
                          double slow_cycles) {
    return (1.0 - contention_prob) * fast_cycles +
           contention_prob * slow_cycles;
}
 
/*
 * Example calculations:
 * 
 * P=0.01 (1% contention, typical for well-designed systems):
 *   T_total = 0.99 * 20 + 0.01 * 1000 = 19.8 + 10 = 29.8 cycles
 *   Only 50% overhead vs pure fast path!
 * 
 * P=0.10 (10% contention):
 *   T_total = 0.90 * 20 + 0.10 * 1000 = 18 + 100 = 118 cycles
 *   Still 8x faster than pure slow path
 * 
 * P=0.50 (50% contention, serious issue):
 *   T_total = 0.50 * 20 + 0.50 * 1000 = 10 + 500 = 510 cycles
 *   Only 2x faster than pure slow path
 * 
 * P=1.00 (100% contention, degenerate case):
 *   T_total = 0.00 * 20 + 1.00 * 1000 = 1000 cycles
 *   No benefit from fast path
 */
 
/*
 * Measuring contention in practice:
 * 
 * The kernel tracks futex statistics in some debug builds.
 * For production, you can instrument your code:
 */
 
typedef struct {
    _Atomic uint64_t fast_path_count;
    _Atomic uint64_t slow_path_count;
} lock_stats_t;
 
static lock_stats_t stats;
 
void instrumented_lock(int *futex) {
    if (__atomic_compare_exchange_n(futex, &(int){0}, 1,
                                     false, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {
        atomic_fetch_add(&stats.fast_path_count, 1);
        return;
    }
    
    atomic_fetch_add(&stats.slow_path_count, 1);
    // ... slow path ...
}
 
double get_contention_ratio(void) {
    uint64_t fast = atomic_load(&stats.fast_path_count);
    uint64_t slow = atomic_load(&stats.slow_path_count);
    return (double)slow / (fast + slow);
}

Contention vs Futex Speedup (relative to pure syscall lock)
Contention	Expected Cycles	Speedup vs 1000cy Baseline	Assessment
0% (ideal)	20	50x	Optimal - all fast path
1% (typical)	30	33x	Excellent - minimal impact
5%	69	14x	Very good
10%	118	8.5x	Good
25%	265	3.8x	Moderate contention
50%	510	2x	High contention - redesign needed
100%	1000	1x	Degenerate - no benefit

The 10% Threshold

If your contention ratio exceeds 10%, you should investigate. Possible solutions: finer-grained locking, lock-free algorithms, reducing critical section length, or eliminating the shared state entirely. Don't just accept high contention as 'normal'.

Memory Subsystem Effects

Futex performance is heavily influenced by the memory subsystem. Cache behavior, NUMA topology, and cache line bouncing all affect real-world performance.

cache_effects.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
/*
 * Cache and memory effects on futex performance
 */
 
/*
 * SCENARIO 1: Lock and data on same cache line
 * 
 * struct {
 *     uint32_t futex;        // offset 0
 *     uint32_t counter;      // offset 4
 *     char padding[56];      // fill cache line
 * } __attribute__((aligned(64)));
 * 
 * When we acquire the lock (CAS on futex), we also pull 'counter'
 * into L1 cache. The subsequent read/write of 'counter' is free!
 * 
 * Cost: lock + work = 20cy + 0cy = 20 cycles
 */
 
/*
 * SCENARIO 2: Lock and data on different cache lines
 * 
 * struct {
 *     uint32_t futex;
 *     char padding[60];      // push to next cache line
 *     uint32_t counter;      // offset 64
 * };
 * 
 * Acquiring lock loads futex cache line.
 * Reading counter requires separate cache line fetch.
 * 
 * Cost: lock + cache miss = 20cy + 40cy = 60 cycles
 */
 
/*
 * SCENARIO 3: Cross-core cache line transfer
 * 
 * When CPU A releases a lock and CPU B acquires it, the cache line
 * containing the futex must transfer from A's cache to B's cache.
 * 
 * On modern Intel/AMD:
 * - Same L3 complex: ~40-80 cycles
 * - Different L3 (multi-socket): ~150-300 cycles
 * - Different NUMA node: ~200-400 cycles
 * 
 * This is the "cache line bouncing" problem.
 */
 
/*
 * SCENARIO 4: False sharing
 * 
 * Two unrelated futexes on the same cache line:
 * 
 * struct {
 *     uint32_t futex_A;  // Thread A's lock
 *     uint32_t futex_B;  // Thread B's lock
 * };
 * 
 * Even though A and B are unrelated, every lock operation by A
 * invalidates B's cache copy, and vice versa.
 * 
 * Solution: Pad each futex to its own cache line.
 */
 
typedef struct {
    uint32_t futex;
    char padding[60];  // Ensure 64-byte alignment
} __attribute__((aligned(64))) padded_futex_t;
 
/*
 * NUMA effects:
 * 
 * On NUMA systems, memory access time depends on which node
 * owns the memory. Futexes should be allocated on the same
 * NUMA node as the threads that use them most frequently.
 */
 
#define _GNU_SOURCE
#include <numa.h>
 
void allocate_futex_numa_aware(int preferred_node) {
    padded_futex_t *f = numa_alloc_onnode(sizeof(*f), preferred_node);
    f->futex = 0;  // Initialize
}

Cache Effects on Lock Performance
Scenario	Additional Latency	Mitigation
L1 hit (same core)	~4 cycles	Ideal case, no action needed
L2 hit (same core)	~12 cycles	Working set slightly large
L3 hit (same socket)	~40 cycles	Normal for multi-threaded
Cross-socket transfer	~150-300 cycles	Minimize cross-socket locking
False sharing	~80-200 cycles wasted	Pad to cache line boundaries
Remote NUMA access	~200-400 cycles	NUMA-aware allocation

Always Pad Hot Locks

If a lock is touched by multiple cores, pad it to 64 bytes (cache line size on x86). The wasted memory is trivial compared to the performance cost of false sharing. Some systems use 128-byte alignment to avoid hardware prefetcher issues.

Profiling Futex Performance

Linux provides several tools for profiling futex behavior. Understanding how to use these tools is essential for diagnosing synchronization performance issues.

futex_profiling.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#!/bin/bash
# Futex profiling techniques
 
# ============================================
# TECHNIQUE 1: perf trace for syscall analysis
# ============================================
 
# Count futex syscalls by operation type
perf trace -s -e futex ./my_application 2>&1 | grep futex
 
# Example output:
# futex(0x7f1234, FUTEX_WAIT_PRIVATE, 2, NULL)      = -1 EAGAIN
# futex(0x7f1234, FUTEX_WAKE_PRIVATE, 1)            = 1
# ...
# Summary:
#   futex:         15234 events,   23.4 ms (wait: 12000, wake: 3234)
 
# ============================================
# TECHNIQUE 2: bpftrace for detailed analysis
# ============================================
 
# Histogram of futex wait times (requires root)
sudo bpftrace -e '
tracepoint:syscalls:sys_enter_futex /args->op == 0/ {
    @start[tid] = nsecs;
}
tracepoint:syscalls:sys_exit_futex /@start[tid]/ {
    @wait_ns = hist(nsecs - @start[tid]);
    delete(@start[tid]);
}
END { print(@wait_ns); }
' -c "./my_application"
 
# Output: histogram showing futex wait time distribution
 
# ============================================
# TECHNIQUE 3: perf for contention analysis
# ============================================
 
# Record contention events
perf record -e 'syscalls:sys_enter_futex' \
            -e 'syscalls:sys_exit_futex' \
            --call-graph dwarf \
            ./my_application
 
# Analyze with perf report
perf report --stdio
 
# ============================================
# TECHNIQUE 4: lock profiling with perf
# ============================================
 
# Linux lock profiling (if LOCKDEP enabled)
perf lock record ./my_application
perf lock report
 
# Shows: contended locks, wait times, acquisition count
 
# ============================================
# TECHNIQUE 5: /proc/sys analysis
# ============================================
 
# View kernel futex hash table info (if exposed)
cat /proc/sys/kernel/futex_hash_size  # May not exist
 
# View scheduler latency (affects wake latency)
cat /proc/sys/kernel/sched_latency_ns
 
# View context switch overhead
time taskset -c 0 ./context_switch_benchmark

Key Metrics to Monitor:

Futex Performance Metrics
Metric	Good Value	Bad Value	How to Measure
Fast path ratio	95%	<80%	Instrument lock code
Futex syscalls/sec	Application-specific	1M/sec may indicate issue	perf trace -s
Avg wait time	<1ms for most	10ms	bpftrace histogram
Wake-to-run latency	<100μs	1ms	Custom tracing
Hash bucket contention	Low	High (many collisions)	Kernel debug

Production Monitoring

For production systems, consider adding lightweight futex instrumentation. Even a simple fast-path/slow-path counter can reveal contention issues before they become critical. The overhead of atomic counters is negligible compared to the insight they provide.

Optimization Strategies

Based on our performance analysis, here are strategies for maximizing futex performance in real applications.

Futex Optimization Strategies

•Minimize critical section length: The shorter the time holding the lock, the less contention. Move computation outside the critical section where possible. Computing while holding a lock is often avoidable.
•Use FUTEX_PRIVATE_FLAG always for intra-process: The private flag enables kernel optimizations and avoids page table walks for shared key computation. Never omit this for thread-local synchronization.
•Pad locks to cache line boundaries: Prevent false sharing by ensuring each frequently-contended lock occupies its own 64-byte (or 128-byte) cache line. Use attribute((aligned(64))).
•NUMA-aware allocation: On NUMA systems, allocate locks on the NUMA node where they'll be used most frequently. Use numa_alloc_onnode() or mbind().
•Consider lock coarsening vs fine-grained: Sometimes combining multiple fine-grained locks into one coarser lock reduces overhead. Measure before and after—intuition often fails here.
•Adaptive spinning: Before calling FUTEX_WAIT, spin briefly. If the lock becomes free during the spin, you save the syscall cost. glibc's pthread_mutex already does this.
•Batch operations when possible: If you're waking multiple waiters, let the kernel batch them via single FUTEX_WAKE with high count rather than multiple FUTEX_WAKE calls.

optimization_examples.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
/*
 * Futex optimization examples
 */
 
/* BEFORE: Long critical section */
void bad_update(data_t* shared, input_t* input) {
    pthread_mutex_lock(&mutex);
    result_t result = expensive_computation(input);  // Slow!
    shared->data = result;
    pthread_mutex_unlock(&mutex);
}
 
/* AFTER: Minimal critical section */
void good_update(data_t* shared, input_t* input) {
    result_t result = expensive_computation(input);  // Outside lock
    
    pthread_mutex_lock(&mutex);
    shared->data = result;  // Only assignment under lock
    pthread_mutex_unlock(&mutex);
}
 
/* Cache-line padding example */
struct __attribute__((aligned(64))) padded_mutex {
    pthread_mutex_t lock;
    char _pad[64 - sizeof(pthread_mutex_t)];
};
 
/* Array of locks with false-sharing prevention */
struct padded_mutex locks[NUM_BUCKETS];
 
/* Adaptive spin before futex wait (simplified) */
static inline void adaptive_lock(int *futex) {
    // Try fast path
    if (__atomic_compare_exchange_n(futex, &(int){0}, 1,
            false, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {
        return;
    }
    
    // Brief spin before syscall
    for (int i = 0; i < 100; i++) {
        __builtin_ia32_pause();  // CPU hint: spinning
        if (__atomic_compare_exchange_n(futex, &(int){0}, 1,
                false, __ATOMIC_ACQUIRE, __ATOMIC_RELAXED)) {
            return;  // Got it during spin!
        }
    }
    
    // Spin failed, now really sleep
    while (1) {
        int val = __atomic_exchange_n(futex, 2, __ATOMIC_ACQUIRE);
        if (val == 0) return;  // Got it
        
        syscall(SYS_futex, futex, FUTEX_WAIT_PRIVATE, 2, NULL, NULL, 0);
    }
}

Measure Before Optimizing

Always profile before optimizing. Lock performance intuition is notoriously unreliable. What seems like an obvious optimization might make things worse due to cache effects, scheduler behavior, or contention patterns. Use perf, bpftrace, and custom instrumentation to guide decisions.

Summary: Futex Performance Mastery

We've thoroughly analyzed futex performance from every angle. Here are the key insights to remember.

Key Takeaways

•Fast path cost: ~20 cycles for uncontended lock. Eliminates ~500-1000 cycles of syscall overhead. This is the foundation of futex performance.
•Contention is the enemy: Performance degrades proportionally with contention. At 10% contention, expect ~8x speedup. At 50%, only ~2x. Design to minimize contention.
•Benchmarks show 10-50x improvement: Real-world benchmarks show futex 15-50x faster than traditional kernel primitives for typical workloads.
•Cache effects matter enormously: False sharing, NUMA effects, and cache line bouncing can dominate lock cost. Pad hot locks and be NUMA-aware.
•Profiling is essential: Use perf, bpftrace, and custom instrumentation. Don't guess—measure. Fast-path ratio is your key metric.
•Optimization opportunities: Private flag, cache padding, minimal critical sections, NUMA-aware allocation, adaptive spinning.
•The 10% rule: If contention exceeds 10%, redesign before optimizing. No amount of tuning fixes a fundamentally over-contended lock.

What's Next:

The final page of this module examines the Linux implementation of futex in detail. We'll look at specific kernel code paths, the evolution of futex over Linux versions, and advanced features like priority inheritance futexes.

Page Complete

You now understand futex performance at a deep level—the cost breakdown, contention effects, cache impacts, profiling methodology, and optimization strategies. This knowledge enables you to design, implement, and tune high-performance synchronization in real systems.