Operating SystemsSemaphore Implementation

Semaphore Implementation

LevelIntermediate

Duration90 mins

TopicSemaphore Implementation

2 / 5

Spinlock-Based Implementation

The Case for Busy Waiting

In the previous page, we explored blocking semaphores—where threads voluntarily sleep when resources are unavailable. But there's another approach: spinlock-based semaphores, where threads actively wait in a tight loop, continuously checking whether the condition has become true.

At first glance, busy-waiting seems wasteful. Why burn CPU cycles spinning when you could sleep? The answer lies in latency. A context switch—removing a thread from the CPU and later restoring it—costs thousands of CPU cycles. If the expected wait time is less than the context switch overhead, spinning is actually more efficient.

Spinlock-based semaphores occupy a critical niche in systems programming: short critical sections, multiprocessor environments, and contexts where sleeping is impossible (interrupt handlers, real-time constraints). This page explores their implementation, from the atomic primitives that make them correct to the contention-management strategies that make them practical.

What You Will Learn

By the end of this page, you will understand: how spinlocks implement semaphore semantics, the role of atomic test-and-set operations, spin techniques and their tradeoffs, backoff strategies for reducing contention, and when spinlock-based implementations outperform blocking ones.

Fundamental Architecture

A spinlock-based semaphore replaces the wait queue and scheduler integration of blocking semaphores with a simpler mechanism: a tight loop that repeatedly tests the semaphore's value until acquisition succeeds.

Core Structure:

The simplest spinlock-based semaphore consists of:

An integer counter — The semaphore value, same as blocking implementations
An atomic lock — Protects the counter during read-modify-write operations

Notice what's missing: there's no wait queue, no scheduler integration, no thread state management. The simplicity is the point—at the cost of CPU consumption during waits.

The Mental Model:

Imagine a thread trying to walk through a turnstile (the semaphore). In a blocking implementation, if the turnstile is locked, the thread sits on a bench (wait queue) and waits to be called. In a spinlock implementation, the thread continuously pushes on the turnstile, checking every nanosecond whether it has opened.

Why This Works:

On multiprocessor systems, while one thread is spinning, another thread on a different CPU can be executing the critical section and preparing to release the semaphore. The spinner doesn't need to sleep because useful work is happening elsewhere. When the semaphore becomes available, the spinner detects this within nanoseconds—far faster than a context switch would allow.

spinlock_semaphore_structure.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// Basic spinlock-based semaphore structure
typedef struct spin_semaphore {
    // Semaphore value: positive = available, zero = none available
    // Unlike blocking semaphores, no negative values (no waiter tracking)
    volatile int value;
    
    // Spinlock for atomic operations on value
    // Ensures read-modify-write on value is atomic
    spinlock_t guard;
    
} spin_semaphore_t;
 
// Compare with blocking semaphore structure:
/*
typedef struct blocking_semaphore {
    int value;
    spinlock_t lock;
    wait_queue_head_t wait_list;   // <-- This is missing!
} blocking_semaphore_t;
*/
 
// The guard spinlock itself is typically:
typedef struct spinlock {
    volatile unsigned int locked;   // 0 = unlocked, 1 = locked
} spinlock_t;
 
// Or for ticket locks (fairer):
typedef struct ticket_spinlock {
    volatile unsigned int head;     // Next ticket to be served
    volatile unsigned int tail;     // Next ticket to be issued
} ticket_spinlock_t;
 
// Initialization
void spin_semaphore_init(spin_semaphore_t *sem, int initial_value) {
    sem->value = initial_value;
    spin_lock_init(&sem->guard);
}
 
// Helper: initialize the guard spinlock
void spin_lock_init(spinlock_t *lock) {
    lock->locked = 0;
    // Memory barrier to ensure visibility
    __sync_synchronize();
}

Volatile and Atomic Operations

The 'volatile' keyword prevents the compiler from optimizing away repeated reads to the semaphore value. However, volatile alone doesn't ensure atomicity on multiprocessors. Hardware atomic operations (test-and-set, compare-and-swap) or explicit memory barriers are still required for correctness.

The Spin-Wait Operation

The wait operation in a spinlock-based semaphore loops until it can atomically decrement the counter. Unlike blocking semaphores, there's no sleep—just continuous checking.

The Algorithm:

Acquire the guard spinlock — Protects the counter value
Check the counter — Is a resource available?
If value > 0 — Decrement and release the lock; done!
If value ≤ 0 — Release the lock and spin back to step 1

The key insight: we release the guard lock while spinning. This allows other threads to signal the semaphore. If we held the guard lock while spinning, signalers couldn't increment the counter, creating deadlock.

The Polling Loop:

The simplest implementation is a tight loop:

while (!try_acquire(sem)) { /* spin */ }

But this aggressive spinning has problems:

Cache line bouncing: Every check invalidates the cache line holding the semaphore on other CPUs
Memory bus saturation: Constant atomic operations flood the memory bus
Power consumption: CPU runs at maximum power during spin

More sophisticated implementations add delays, backoff strategies, and test-before-test-and-set optimizations.

spin_wait_operation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
// Spinlock-based semaphore wait (P/down) operation
 
// Version 1: Basic spin loop
void spin_semaphore_wait_basic(spin_semaphore_t *sem) {
    while (1) {
        spin_lock(&sem->guard);
        
        if (sem->value > 0) {
            // Resource available - acquire it
            sem->value--;
            spin_unlock(&sem->guard);
            return;
        }
        
        // Resource not available - release lock and try again
        spin_unlock(&sem->guard);
        
        // This is the "spin" - loop back immediately
    }
}
 
// Version 2: Test-before-test-and-set (TATAS) optimization
void spin_semaphore_wait_tatas(spin_semaphore_t *sem) {
    while (1) {
        // Read-only test first (no cache line invalidation)
        while (sem->value == 0) {
            // Just spin on read, don't touch the lock
            cpu_relax();  // Hint to CPU we're spinning
        }
        
        // Value might be positive - try to acquire
        spin_lock(&sem->guard);
        
        if (sem->value > 0) {
            sem->value--;
            spin_unlock(&sem->guard);
            return;
        }
        
        // Lost the race - someone else got it
        spin_unlock(&sem->guard);
    }
}
 
// Version 3: With exponential backoff
void spin_semaphore_wait_backoff(spin_semaphore_t *sem) {
    unsigned int delay = MIN_BACKOFF;
    
    while (1) {
        spin_lock(&sem->guard);
        
        if (sem->value > 0) {
            sem->value--;
            spin_unlock(&sem->guard);
            return;
        }
        
        spin_unlock(&sem->guard);
        
        // Back off before retrying
        for (unsigned int i = 0; i < delay; i++) {
            cpu_relax();
        }
        
        // Exponential backoff with cap
        if (delay < MAX_BACKOFF) {
            delay *= 2;
        }
    }
}
 
// The cpu_relax() function - architecture specific
// x86: PAUSE instruction
// ARM: YIELD instruction
static inline void cpu_relax(void) {
#if defined(__x86_64__) || defined(__i386__)
    __asm__ volatile("pause" ::: "memory");
#elif defined(__arm__) || defined(__aarch64__)
    __asm__ volatile("yield" ::: "memory");
#else
    // Fallback: memory barrier alone
    __asm__ volatile("" ::: "memory");
#endif
}
 
// Try-wait: non-blocking attempt
int spin_semaphore_trywait(spin_semaphore_t *sem) {
    spin_lock(&sem->guard);
    
    if (sem->value > 0) {
        sem->value--;
        spin_unlock(&sem->guard);
        return 0;  // Success
    }
    
    spin_unlock(&sem->guard);
    return -EAGAIN;  // Would block
}

Spin Optimization Techniques

•TATAS (Test-and-Test-and-Set) — Read the lock in a loop without locking; only attempt atomic acquisition when it appears available. Reduces cache line invalidations dramatically.
•cpu_relax() / PAUSE instruction — Tells the CPU we're in a spin loop. CPU can reduce power, and on hyperthreaded cores, allows the other hardware thread to proceed.
•Exponential backoff — Wait longer between attempts as contention persists. Reduces bus traffic and gives holders more time to release.
•Randomized backoff — Add randomness to prevent lockstep spinning among multiple waiters, which causes synchronized cache line bouncing.

The Signal Operation

The signal operation in a spinlock-based semaphore is remarkably simple compared to its blocking counterpart. There's no wait queue to scan, no threads to wake, no scheduler to notify. We simply increment the counter atomically.

The Algorithm:

Acquire the guard spinlock — Ensure atomic modification
Increment the counter — Make one unit of resource available
Release the guard spinlock — Allow waiters to proceed

No Wake-Up Necessary:

When we increment the counter, any spinning thread will notice on its next iteration. The cache coherence protocol ensures that our write to the counter becomes visible to other CPUs (eventually, with possible short delay). The spinners don't need explicit notification—they're already watching.

Memory Ordering Considerations:

The guard spinlock provides acquire/release semantics:

The lock acquisition is an acquire barrier: subsequent operations won't be reordered before it
The lock release is a release barrier: prior operations won't be reordered after it

This ensures that the counter increment is fully visible before spinning threads can observe the unlocked guard and re-check the value.

spin_signal_operation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// Spinlock-based semaphore signal (V/up) operation
 
void spin_semaphore_signal(spin_semaphore_t *sem) {
    spin_lock(&sem->guard);
    sem->value++;
    spin_unlock(&sem->guard);
    // That's it! No wake-up mechanism needed.
    // Spinners will notice on their next check.
}
 
// Compare with blocking signal (for reference):
/*
void blocking_semaphore_signal(semaphore_t *sem) {
    spin_lock(&sem->lock);
    sem->value++;
    
    // Must wake a waiter!
    if (!list_empty(&sem->wait_list)) {
        wake_up(&sem->wait_list);  // <-- This is missing!
    }
    
    spin_unlock(&sem->lock);
}
*/
 
// Atomic signal without separate lock (optimization)
void spin_semaphore_signal_atomic(spin_semaphore_t *sem) {
    // Use atomic increment if hardware supports it
    __sync_fetch_and_add(&sem->value, 1);
    // Implicit memory barrier ensures visibility
}
 
// Multiple signals at once (for counting semaphores)
void spin_semaphore_signal_n(spin_semaphore_t *sem, int n) {
    spin_lock(&sem->guard);
    sem->value += n;
    spin_unlock(&sem->guard);
}
 
// Atomic version of multiple signals
void spin_semaphore_signal_n_atomic(spin_semaphore_t *sem, int n) {
    __sync_fetch_and_add(&sem->value, n);
}
 
// The simplicity allows several optimizations:
 
// Inline the signal for minimal overhead
#define SPIN_SEM_SIGNAL(sem) \
    do { \
        __sync_fetch_and_add(&(sem)->value, 1); \
    } while (0)
 
// Signal from interrupt context (safe - no sleeping)
void spin_semaphore_signal_irq(spin_semaphore_t *sem) {
    unsigned long flags;
    
    // Disable interrupts to prevent nested acquisition
    local_irq_save(flags);
    spin_lock(&sem->guard);
    sem->value++;
    spin_unlock(&sem->guard);
    local_irq_restore(flags);
}

Atomic Increment Optimization

If the counter is only ever incremented (never read-modify-write with a condition), we can eliminate the guard lock entirely and use a hardware atomic increment. This reduces the signal operation to a single instruction on most architectures, providing maximum throughput.

Atomic Primitives Underneath

Spinlock-based semaphores rely on hardware atomic operations to function correctly. These operations ensure that read-modify-write sequences cannot be split by concurrent access from other CPUs.

Test-and-Set (TAS):

The fundamental primitive. Atomically reads a location, sets it to a new value, and returns the old value. If old value was 0 (unlocked), we acquired the lock. If it was 1 (locked), someone else has it.

Compare-and-Swap (CAS):

More flexible than TAS. Atomically: if location equals expected value, update to new value and return success; otherwise, return failure. This enables lock-free updates.

Fetch-and-Add:

Atomically increments (or decrements) a value and returns the original. Perfect for counting semaphores where we just need to adjust the counter.

Implementation on x86:

Intel/AMD processors provide the LOCK prefix, which locks the memory bus (or cache line, on modern CPUs) during the following instruction. This makes instructions like XCHG, CMPXCHG, and XADD atomic.

Implementation on ARM:

ARM uses Load-Link/Store-Conditional (LL/SC) pairs: LDREX loads a value and marks the location as exclusive; STREX stores only if no other CPU has written to that location since the LDREX.

atomic_primitives.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
// Atomic primitives for spinlock implementation
 
// ============================================
// Test-and-Set (TAS)
// ============================================
static inline int test_and_set(volatile int *lock) {
#if defined(__x86_64__) || defined(__i386__)
    int old = 1;
    // XCHG is implicitly locked on x86
    __asm__ volatile(
        "xchg %0, %1"
        : "+r"(old), "+m"(*lock)
        :
        : "memory"
    );
    return old;
#elif defined(__aarch64__)
    int old;
    int tmp;
    __asm__ volatile(
        "1: ldaxr %w0, [%2]\n"      // Load-acquire exclusive
        "   stlxr %w1, %w3, [%2]\n"  // Store-release exclusive
        "   cbnz %w1, 1b\n"          // Retry if store failed
        : "=&r"(old), "=&r"(tmp)
        : "r"(lock), "r"(1)
        : "memory"
    );
    return old;
#else
    return __sync_lock_test_and_set(lock, 1);
#endif
}
 
// ============================================
// Compare-and-Swap (CAS)
// ============================================
static inline int compare_and_swap(volatile int *ptr, 
                                    int expected, 
                                    int new_value) {
#if defined(__x86_64__) || defined(__i386__)
    int old;
    __asm__ volatile(
        "lock cmpxchg %2, %1"
        : "=a"(old), "+m"(*ptr)
        : "r"(new_value), "0"(expected)
        : "memory", "cc"
    );
    return old == expected;
#else
    return __sync_bool_compare_and_swap(ptr, expected, new_value);
#endif
}
 
// ============================================
// Fetch-and-Add
// ============================================
static inline int fetch_and_add(volatile int *ptr, int increment) {
#if defined(__x86_64__) || defined(__i386__)
    int old = increment;
    __asm__ volatile(
        "lock xadd %0, %1"
        : "+r"(old), "+m"(*ptr)
        :
        : "memory"
    );
    return old;
#else
    return __sync_fetch_and_add(ptr, increment);
#endif
}
 
// ============================================
// Spinlock using TAS
// ============================================
void spin_lock_tas(spinlock_t *lock) {
    while (test_and_set(&lock->locked)) {
        // Spin until we acquire
        cpu_relax();
    }
}
 
void spin_unlock(spinlock_t *lock) {
    // Compiler and memory barrier
    __asm__ volatile("" ::: "memory");
    lock->locked = 0;
}
 
// ============================================
// Spinlock using CAS (allows more complex state)
// ============================================
void spin_lock_cas(spinlock_t *lock) {
    while (!compare_and_swap(&lock->locked, 0, 1)) {
        // CAS failed, lock was not 0
        while (lock->locked) {  // TATAS pattern
            cpu_relax();
        }
    }
}
 
// ============================================
// Semaphore wait using CAS directly
// ============================================
void spin_semaphore_wait_cas(spin_semaphore_t *sem) {
    int old, new_val;
    
    while (1) {
        // Read current value
        old = sem->value;
        
        if (old > 0) {
            new_val = old - 1;
            // Try to decrement
            if (compare_and_swap(&sem->value, old, new_val)) {
                return;  // Successfully acquired
            }
            // CAS failed, retry
        } else {
            // No resources - spin
            cpu_relax();
        }
    }
}

Atomic Operations Comparison
Operation	x86 Instruction	ARM Instruction	Use Case
Test-and-Set	`XCHG` (implicit lock)	`LDXR`/`STXR` loop	Simple spinlocks, basic mutex
Compare-and-Swap	`LOCK CMPXCHG`	`LDXR`/`STXR` with compare	Lock-free structures, complex state
Fetch-and-Add	`LOCK XADD`	`LDXR`/`ADD`/`STXR` loop	Counting semaphores, statistics
Load-Linked/Store-Cond.	N/A (CAS instead)	`LDXR`/`STXR`	General-purpose atomic RMW on ARM

Contention and Cache Effects

When multiple threads spin on the same semaphore, performance can collapse due to cache coherence traffic. Understanding these effects is crucial for designing scalable spinlock-based synchronization.

Cache Coherence Basics:

Modern CPUs maintain cache coherence using protocols like MESI (Modified, Exclusive, Shared, Invalid). When one CPU writes to a cache line, all other CPUs holding copies must invalidate them. This causes:

Invalidation messages on the interconnect
Cache misses on subsequent reads by other CPUs
Memory bus saturation under heavy contention

The TAS Spinning Problem:

With naive test-and-set spinning, every attempt executes an atomic operation that writes to the lock. Even if the lock is held, each spinner's TAS generates a cache line invalidation, forcing all other spinners to refetch the line. With N spinners:

Each TAS causes ~N invalidations
Total invalidations per second: O(N² × spin rate)
Memory bus becomes the bottleneck

TATAS Mitigation:

Test-and-Test-and-Set dramatically reduces bus traffic:

while (1) {
    while (lock->locked) cpu_relax();  // Read-only spin
    if (!test_and_set(lock)) return;    // Only then TAS
}

The read-only spin on lock->locked doesn't generate invalidations. CPUs read from their local cache (in Shared state). Only when the lock appears free do they attempt the atomic TAS—and only one will succeed.

False Sharing:

If the semaphore shares a cache line with other data, unrelated modifications can cause invalidations. Solution: align semaphore structures to cache line boundaries (typically 64 bytes).

contention_management.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
// Contention-aware spinlock implementations
 
// Cache line alignment to prevent false sharing
#define CACHE_LINE_SIZE 64
 
typedef struct __attribute__((aligned(CACHE_LINE_SIZE))) {
    volatile int locked;
    
    // Padding to fill cache line
    char padding[CACHE_LINE_SIZE - sizeof(int)];
} cache_aligned_spinlock_t;
 
typedef struct __attribute__((aligned(CACHE_LINE_SIZE))) {
    volatile int value;
    cache_aligned_spinlock_t guard;
    
    // Ensure entire structure is cache-aligned
    char padding[CACHE_LINE_SIZE - sizeof(int)];
} cache_aligned_spin_semaphore_t;
 
// ============================================
// TATAS (Test-and-Test-and-Set) Spinlock
// ============================================
void spinlock_acquire_tatas(spinlock_t *lock) {
    while (1) {
        // Read-only spin (no bus traffic if cache is warm)
        while (lock->locked) {
            cpu_relax();  // Reduce power, help hyperthreads
        }
        
        // Lock appears free - try atomic acquisition
        if (!test_and_set(&lock->locked)) {
            return;  // Got it!
        }
        
        // Lost the race - back to read-only spin
    }
}
 
// ============================================
// TATAS with Exponential Backoff
// ============================================
#define MIN_BACKOFF 4
#define MAX_BACKOFF 1024
 
void spinlock_acquire_tatas_backoff(spinlock_t *lock) {
    unsigned int delay = MIN_BACKOFF;
    
    while (1) {
        // Read-only test
        while (lock->locked) {
            cpu_relax();
        }
        
        // Attempt acquisition
        if (!test_and_set(&lock->locked)) {
            return;
        }
        
        // Failed - backoff before retrying
        for (unsigned int i = 0; i < delay; i++) {
            cpu_relax();
        }
        
        // Increase delay (exponential, with randomness)
        if (delay < MAX_BACKOFF) {
            delay = delay * 2 + (rand() % delay);
            if (delay > MAX_BACKOFF) delay = MAX_BACKOFF;
        }
    }
}
 
// ============================================
// Proportional Backoff
// ============================================
// Backoff proportional to observed contention
void spinlock_acquire_proportional_backoff(spinlock_t *lock) {
    unsigned int attempts = 0;
    
    while (1) {
        while (lock->locked) {
            cpu_relax();
            attempts++;  // Count contention
        }
        
        if (!test_and_set(&lock->locked)) {
            return;
        }
        
        // Backoff based on contention observed
        unsigned int backoff = attempts < 1000 ? attempts : 1000;
        for (unsigned int i = 0; i < backoff; i++) {
            cpu_relax();
        }
    }
}
 
// ============================================
// Semaphore with contention tracking
// ============================================
typedef struct {
    volatile int value;
    cache_aligned_spinlock_t guard;
    volatile unsigned int contention_count;  // For monitoring
} monitored_spin_semaphore_t;
 
void monitored_sem_wait(monitored_spin_semaphore_t *sem) {
    unsigned int local_contention = 0;
    
    while (1) {
        // TATAS on value first (avoid lock if no resources)
        while (sem->value == 0) {
            cpu_relax();
            local_contention++;
        }
        
        spinlock_acquire_tatas(&sem->guard.locked);
        
        if (sem->value > 0) {
            sem->value--;
            spinlock_release(&sem->guard.locked);
            
            // Report contention for tuning
            __sync_fetch_and_add(&sem->contention_count, local_contention);
            return;
        }
        
        spinlock_release(&sem->guard.locked);
    }
}

Cache Line Bouncing is Catastrophic

Under heavy contention, naive spinlocks can reduce throughput to 1-5% of optimal. A single lock acquisition might take thousands of cycles due to cache line bouncing. Always use at minimum TATAS, and consider backoff strategies for any lock that might be contended.

Ticket Locks and Fairness

Basic spinlocks using test-and-set provide no fairness guarantee. A thread that just released the lock might immediately reacquire it, starving other waiters. Ticket locks solve this by ensuring FIFO ordering.

The Ticket Lock Concept:

Imagine a deli counter with a ticket dispenser:

Each arriving customer takes a numbered ticket (atomic fetch-and-increment)
A display shows the current serving number
Customers wait until their ticket matches the display
When finished, the counter increments the display

Implementation:

A ticket lock has two counters:

tail (aka next_ticket): The next ticket number to issue
head (aka now_serving): The ticket number currently being served

To acquire: atomically increment tail to get your ticket, then spin until head equals your ticket. To release: increment head.

Fairness Guarantee:

Ticket locks provide strict FIFO ordering. The thread that arrived first will acquire the lock first. This prevents starvation and provides predictable latency under contention.

Tradeoffs:

Pro: Perfect fairness, no starvation
Pro: Predictable worst-case wait time
Con: All waiters spin on the same location (head), causing cache traffic
Con: Cannot use TATAS optimization (must atomically get ticket)

ticket_lock.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
// Ticket lock for fair FIFO ordering
 
typedef struct {
    volatile unsigned int head;  // Now serving
    volatile unsigned int tail;  // Next ticket to issue
} ticket_lock_t;
 
void ticket_lock_init(ticket_lock_t *lock) {
    lock->head = 0;
    lock->tail = 0;
}
 
void ticket_lock_acquire(ticket_lock_t *lock) {
    // Get our ticket number (atomic fetch-and-increment)
    unsigned int my_ticket = fetch_and_add(&lock->tail, 1);
    
    // Wait until it's our turn
    while (lock->head != my_ticket) {
        cpu_relax();
    }
    
    // Memory barrier to ensure subsequent operations stay after acquire
    __sync_synchronize();
}
 
void ticket_lock_release(ticket_lock_t *lock) {
    // Memory barrier to ensure prior operations complete before release
    __sync_synchronize();
    
    // Increment the serving counter (next waiter's turn)
    lock->head++;
}
 
// ============================================
// Ticket-based semaphore for fair ordering
// ============================================
typedef struct {
    volatile int value;          // Semaphore count
    ticket_lock_t guard;         // Fair lock for accessing value
} fair_spin_semaphore_t;
 
void fair_sem_init(fair_spin_semaphore_t *sem, int initial) {
    sem->value = initial;
    ticket_lock_init(&sem->guard);
}
 
void fair_sem_wait(fair_spin_semaphore_t *sem) {
    while (1) {
        ticket_lock_acquire(&sem->guard);
        
        if (sem->value > 0) {
            sem->value--;
            ticket_lock_release(&sem->guard);
            return;
        }
        
        ticket_lock_release(&sem->guard);
        
        // Small delay before retrying to reduce contention
        for (int i = 0; i < 100; i++) {
            cpu_relax();
        }
    }
}
 
void fair_sem_signal(fair_spin_semaphore_t *sem) {
    ticket_lock_acquire(&sem->guard);
    sem->value++;
    ticket_lock_release(&sem->guard);
}
 
// ============================================
// Proportional backoff ticket lock
// ============================================
// Each waiter backs off proportionally to their position in line
void ticket_lock_acquire_proportional(ticket_lock_t *lock) {
    unsigned int my_ticket = fetch_and_add(&lock->tail, 1);
    
    while (1) {
        unsigned int current = lock->head;
        
        if (current == my_ticket) {
            __sync_synchronize();
            return;
        }
        
        // We're (my_ticket - current) positions away
        // Back off proportionally
        unsigned int distance = my_ticket - current;
        unsigned int backoff = distance * BACKOFF_PER_POSITION;
        
        for (unsigned int i = 0; i < backoff; i++) {
            cpu_relax();
        }
    }
}
 
#define BACKOFF_PER_POSITION 50  // Tune based on critical section length

Ticket Lock Properties

•FIFO fairness — Threads acquire the lock in arrival order; no starvation possible
•O(1) acquisition — Single atomic operation to get ticket
•O(1) release — Simple increment of head counter
•Bounded wait — Maximum wait is (N-1) × critical section time for N contenders
•Cache scalability concern — All waiters spin on head, causing invalidations when it changes
•Proportional backoff — Waiters far from head can back off more, reducing traffic

MCS Locks: Scalable Spinning

Ticket locks provide fairness but have scalability issues: all waiters spin on the same memory location. When the head value changes, every waiter's cache line is invalidated. MCS locks (named after Mellor-Crummey and Scott) solve this by having each waiter spin on its own local variable.

The MCS Lock Concept:

Instead of a shared counter, MCS uses a linked list:

Each acquiring thread has a local node on its stack
The node contains a locked flag (initially true = must wait)
The thread atomically appends its node to the queue tail
If there was a predecessor, the thread spins on its own locked flag
The predecessor clears our locked flag when it releases
On release, we clear the next node's locked flag (if any)

Scalability Benefit:

Each thread spins on a different cache line (its own node). When the lock is released, only one cache line is invalidated—that of the next waiter. This gives O(1) cache traffic per release, regardless of waiter count.

Tradeoffs:

Pro: Excellent cache behavior under contention
Pro: Fair (FIFO) ordering like ticket locks
Con: More complex implementation
Con: Requires per-thread node allocation (typically stack)
Con: Slower than simple spinlocks when uncontended

mcs_lock.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
// MCS Lock - scalable queue-based spinlock
 
// Per-thread queue node (typically stack-allocated)
typedef struct mcs_node {
    volatile int locked;              // 0 = can proceed, 1 = must wait
    volatile struct mcs_node *next;   // Next waiter in queue
} mcs_node_t;
 
// The lock itself is just a tail pointer
typedef struct {
    volatile mcs_node_t *tail;
} mcs_lock_t;
 
void mcs_lock_init(mcs_lock_t *lock) {
    lock->tail = NULL;
}
 
void mcs_lock_acquire(mcs_lock_t *lock, mcs_node_t *node) {
    node->next = NULL;
    node->locked = 1;  // Assume we must wait
    
    // Atomically swap our node into the tail
    mcs_node_t *predecessor = (mcs_node_t *)
        __sync_lock_test_and_set(&lock->tail, node);
    
    if (predecessor != NULL) {
        // Queue was not empty - link ourselves and wait
        predecessor->next = node;
        
        // Spin on our own flag (local cache line!)
        while (node->locked) {
            cpu_relax();
        }
    }
    
    // We hold the lock now
    __sync_synchronize();  // Acquire barrier
}
 
void mcs_lock_release(mcs_lock_t *lock, mcs_node_t *node) {
    __sync_synchronize();  // Release barrier - prior ops complete
    
    if (node->next == NULL) {
        // No known successor - try to clear tail
        if (__sync_bool_compare_and_swap(&lock->tail, node, NULL)) {
            // Successfully released; nobody was waiting
            return;
        }
        
        // CAS failed: someone is in the process of queuing
        // Wait for them to link themselves
        while (node->next == NULL) {
            cpu_relax();
        }
    }
    
    // Wake the next waiter
    node->next->locked = 0;
}
 
// ============================================
// MCS-based semaphore
// ============================================
typedef struct {
    volatile int value;
    mcs_lock_t guard;
} mcs_semaphore_t;
 
void mcs_sem_init(mcs_semaphore_t *sem, int initial) {
    sem->value = initial;
    mcs_lock_init(&sem->guard);
}
 
void mcs_sem_wait(mcs_semaphore_t *sem) {
    mcs_node_t node;  // Stack-allocated!
    
    while (1) {
        mcs_lock_acquire(&sem->guard, &node);
        
        if (sem->value > 0) {
            sem->value--;
            mcs_lock_release(&sem->guard, &node);
            return;
        }
        
        mcs_lock_release(&sem->guard, &node);
        
        // Backoff before retrying
        cpu_relax();
    }
}
 
void mcs_sem_signal(mcs_semaphore_t *sem) {
    mcs_node_t node;
    
    mcs_lock_acquire(&sem->guard, &node);
    sem->value++;
    mcs_lock_release(&sem->guard, &node);
}

Linux Kernel's qspinlock

The Linux kernel uses a hybrid called qspinlock: it starts as a simple spinlock for low contention (fast path), but transforms into an MCS-like queue under high contention (scalable path). This provides the best of both worlds: minimal overhead when uncontended, and O(1) cache traffic when contended.

When to Use Spinlock-Based Semaphores

Spinlock-based semaphores are not universally applicable. They excel in specific scenarios and fail in others. Matching the synchronization mechanism to the situation is crucial for system performance.

Ideal Scenarios for Spinning:

Short critical sections: If the semaphore is held for just a few hundred cycles, spinning is cheaper than context switching
Multiprocessor systems: Spinning makes sense when other CPUs can run the critical section. On uniprocessor, spinning is always wasteful.
Interrupt/exception context: Kernel code in interrupt handlers cannot sleep. Spinlocks are the only option.
Real-time constraints: Context switch latency may be unacceptable. Spinning provides deterministic acquisition time.
High-frequency operations: When a lock is acquired millions of times per second, context switch overhead dominates.

Poor Scenarios for Spinning:

Long critical sections: Holding the semaphore for milliseconds while others spin wastes massive CPU time.
Single-CPU systems: The holder can't make progress while we spin—guaranteed waste.
Energy-constrained environments: Mobile devices need CPUs in low-power states, not spinning.
High contention with many waiters: O(N) spinning CPUs is catastrophic for system throughput.

Prefer Spinlock-Based When:

•Critical section < 1000 CPU cycles
•Multiple CPUs available
•Running in interrupt context
•Deterministic timing required
•High acquisition frequency (>100K/sec)
•Low waiter count expected

Avoid Spinlock-Based When:

•Critical section includes I/O or long operations
•Uniprocessor system (spinning is pure waste)
•Energy efficiency is important
•Many threads contend for same lock
•Hold time is unpredictable
•Sleeping primitives are available and suitable

The Crossover Point:

The decision to spin or block depends on expected hold time vs. context switch cost:

Context switch cost: ~5,000-20,000 cycles (varies by architecture/OS)
Crossover point: If expected wait < 2 × context switch cost, spinning wins

Many systems implement adaptive spinning: spin for a while, then block if the lock isn't released. This captures the benefits of both approaches.

The Uniprocessor Trap

Never spin on a uniprocessor system! If you're the only CPU, the lock holder cannot run while you spin. You'll spin forever (until your time slice expires). On uniprocessors, always use blocking primitives and let the scheduler run the lock holder.

Summary: Spinlock-Based Semaphore Implementation

We've explored the design and implementation of spinlock-based semaphores—from atomic primitives through contention management to sophisticated queue locks. Let's consolidate the key insights:

Key Takeaways

•Architecture simplicity — Spinlock-based semaphores need only a counter and guard lock; no wait queues or scheduler integration
•Atomic primitives — Test-and-Set, Compare-and-Swap, and Fetch-and-Add provide the hardware foundation for correct concurrent access
•TATAS optimization — Test-and-Test-and-Set dramatically reduces cache traffic by avoiding atomic operations when the lock is held
•Backoff strategies — Exponential and proportional backoff reduce contention by spreading out retry attempts
•Ticket locks — Provide FIFO fairness but all waiters spin on the same location
•MCS locks — Achieve O(1) cache traffic by having each waiter spin on its own local variable
•Use case selection — Spinning excels for short critical sections on multiprocessors; blocking wins for long waits or single CPUs
•Hybrid approaches — Modern systems often spin briefly, then block if the wait extends

What's Next:

With both blocking and spinning implementations understood, we'll examine kernel implementation of semaphores—how operating systems provide semaphore services, the system call interface, and the integration with process scheduling that makes these abstractions available to user programs.

Page Complete

You now understand how spinlock-based semaphores work—from raw atomic operations through cache-conscious optimizations to sophisticated queue-based locks. This knowledge is essential for kernel development, high-performance computing, and any scenario where control over synchronization overhead is critical.

2 / 5

Loading learning content...

Operating SystemsSemaphore Implementation

Semaphore Implementation

LevelIntermediate

Duration90 mins

TopicSemaphore Implementation

2 / 5

Spinlock-Based Implementation

The Case for Busy Waiting

What You Will Learn

Fundamental Architecture

Core Structure:

The simplest spinlock-based semaphore consists of:

An integer counter — The semaphore value, same as blocking implementations
An atomic lock — Protects the counter during read-modify-write operations

Notice what's missing: there's no wait queue, no scheduler integration, no thread state management. The simplicity is the point—at the cost of CPU consumption during waits.

The Mental Model:

Why This Works:

spinlock_semaphore_structure.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// Basic spinlock-based semaphore structure
typedef struct spin_semaphore {
    // Semaphore value: positive = available, zero = none available
    // Unlike blocking semaphores, no negative values (no waiter tracking)
    volatile int value;
    
    // Spinlock for atomic operations on value
    // Ensures read-modify-write on value is atomic
    spinlock_t guard;
    
} spin_semaphore_t;
 
// Compare with blocking semaphore structure:
/*
typedef struct blocking_semaphore {
    int value;
    spinlock_t lock;
    wait_queue_head_t wait_list;   // <-- This is missing!
} blocking_semaphore_t;
*/
 
// The guard spinlock itself is typically:
typedef struct spinlock {
    volatile unsigned int locked;   // 0 = unlocked, 1 = locked
} spinlock_t;
 
// Or for ticket locks (fairer):
typedef struct ticket_spinlock {
    volatile unsigned int head;     // Next ticket to be served
    volatile unsigned int tail;     // Next ticket to be issued
} ticket_spinlock_t;
 
// Initialization
void spin_semaphore_init(spin_semaphore_t *sem, int initial_value) {
    sem->value = initial_value;
    spin_lock_init(&sem->guard);
}
 
// Helper: initialize the guard spinlock
void spin_lock_init(spinlock_t *lock) {
    lock->locked = 0;
    // Memory barrier to ensure visibility
    __sync_synchronize();
}

Volatile and Atomic Operations

The Spin-Wait Operation

The wait operation in a spinlock-based semaphore loops until it can atomically decrement the counter. Unlike blocking semaphores, there's no sleep—just continuous checking.

The Algorithm:

Acquire the guard spinlock — Protects the counter value
Check the counter — Is a resource available?
If value > 0 — Decrement and release the lock; done!
If value ≤ 0 — Release the lock and spin back to step 1

The Polling Loop:

The simplest implementation is a tight loop:

while (!try_acquire(sem)) { /* spin */ }

But this aggressive spinning has problems:

Cache line bouncing: Every check invalidates the cache line holding the semaphore on other CPUs
Memory bus saturation: Constant atomic operations flood the memory bus
Power consumption: CPU runs at maximum power during spin

More sophisticated implementations add delays, backoff strategies, and test-before-test-and-set optimizations.

spin_wait_operation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
// Spinlock-based semaphore wait (P/down) operation
 
// Version 1: Basic spin loop
void spin_semaphore_wait_basic(spin_semaphore_t *sem) {
    while (1) {
        spin_lock(&sem->guard);
        
        if (sem->value > 0) {
            // Resource available - acquire it
            sem->value--;
            spin_unlock(&sem->guard);
            return;
        }
        
        // Resource not available - release lock and try again
        spin_unlock(&sem->guard);
        
        // This is the "spin" - loop back immediately
    }
}
 
// Version 2: Test-before-test-and-set (TATAS) optimization
void spin_semaphore_wait_tatas(spin_semaphore_t *sem) {
    while (1) {
        // Read-only test first (no cache line invalidation)
        while (sem->value == 0) {
            // Just spin on read, don't touch the lock
            cpu_relax();  // Hint to CPU we're spinning
        }
        
        // Value might be positive - try to acquire
        spin_lock(&sem->guard);
        
        if (sem->value > 0) {
            sem->value--;
            spin_unlock(&sem->guard);
            return;
        }
        
        // Lost the race - someone else got it
        spin_unlock(&sem->guard);
    }
}
 
// Version 3: With exponential backoff
void spin_semaphore_wait_backoff(spin_semaphore_t *sem) {
    unsigned int delay = MIN_BACKOFF;
    
    while (1) {
        spin_lock(&sem->guard);
        
        if (sem->value > 0) {
            sem->value--;
            spin_unlock(&sem->guard);
            return;
        }
        
        spin_unlock(&sem->guard);
        
        // Back off before retrying
        for (unsigned int i = 0; i < delay; i++) {
            cpu_relax();
        }
        
        // Exponential backoff with cap
        if (delay < MAX_BACKOFF) {
            delay *= 2;
        }
    }
}
 
// The cpu_relax() function - architecture specific
// x86: PAUSE instruction
// ARM: YIELD instruction
static inline void cpu_relax(void) {
#if defined(__x86_64__) || defined(__i386__)
    __asm__ volatile("pause" ::: "memory");
#elif defined(__arm__) || defined(__aarch64__)
    __asm__ volatile("yield" ::: "memory");
#else
    // Fallback: memory barrier alone
    __asm__ volatile("" ::: "memory");
#endif
}
 
// Try-wait: non-blocking attempt
int spin_semaphore_trywait(spin_semaphore_t *sem) {
    spin_lock(&sem->guard);
    
    if (sem->value > 0) {
        sem->value--;
        spin_unlock(&sem->guard);
        return 0;  // Success
    }
    
    spin_unlock(&sem->guard);
    return -EAGAIN;  // Would block
}

Spin Optimization Techniques

•TATAS (Test-and-Test-and-Set) — Read the lock in a loop without locking; only attempt atomic acquisition when it appears available. Reduces cache line invalidations dramatically.
•cpu_relax() / PAUSE instruction — Tells the CPU we're in a spin loop. CPU can reduce power, and on hyperthreaded cores, allows the other hardware thread to proceed.
•Exponential backoff — Wait longer between attempts as contention persists. Reduces bus traffic and gives holders more time to release.
•Randomized backoff — Add randomness to prevent lockstep spinning among multiple waiters, which causes synchronized cache line bouncing.

The Signal Operation

The Algorithm:

Acquire the guard spinlock — Ensure atomic modification
Increment the counter — Make one unit of resource available
Release the guard spinlock — Allow waiters to proceed

No Wake-Up Necessary:

Memory Ordering Considerations:

The guard spinlock provides acquire/release semantics:

The lock acquisition is an acquire barrier: subsequent operations won't be reordered before it
The lock release is a release barrier: prior operations won't be reordered after it

This ensures that the counter increment is fully visible before spinning threads can observe the unlocked guard and re-check the value.

spin_signal_operation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// Spinlock-based semaphore signal (V/up) operation
 
void spin_semaphore_signal(spin_semaphore_t *sem) {
    spin_lock(&sem->guard);
    sem->value++;
    spin_unlock(&sem->guard);
    // That's it! No wake-up mechanism needed.
    // Spinners will notice on their next check.
}
 
// Compare with blocking signal (for reference):
/*
void blocking_semaphore_signal(semaphore_t *sem) {
    spin_lock(&sem->lock);
    sem->value++;
    
    // Must wake a waiter!
    if (!list_empty(&sem->wait_list)) {
        wake_up(&sem->wait_list);  // <-- This is missing!
    }
    
    spin_unlock(&sem->lock);
}
*/
 
// Atomic signal without separate lock (optimization)
void spin_semaphore_signal_atomic(spin_semaphore_t *sem) {
    // Use atomic increment if hardware supports it
    __sync_fetch_and_add(&sem->value, 1);
    // Implicit memory barrier ensures visibility
}
 
// Multiple signals at once (for counting semaphores)
void spin_semaphore_signal_n(spin_semaphore_t *sem, int n) {
    spin_lock(&sem->guard);
    sem->value += n;
    spin_unlock(&sem->guard);
}
 
// Atomic version of multiple signals
void spin_semaphore_signal_n_atomic(spin_semaphore_t *sem, int n) {
    __sync_fetch_and_add(&sem->value, n);
}
 
// The simplicity allows several optimizations:
 
// Inline the signal for minimal overhead
#define SPIN_SEM_SIGNAL(sem) \
    do { \
        __sync_fetch_and_add(&(sem)->value, 1); \
    } while (0)
 
// Signal from interrupt context (safe - no sleeping)
void spin_semaphore_signal_irq(spin_semaphore_t *sem) {
    unsigned long flags;
    
    // Disable interrupts to prevent nested acquisition
    local_irq_save(flags);
    spin_lock(&sem->guard);
    sem->value++;
    spin_unlock(&sem->guard);
    local_irq_restore(flags);
}

Atomic Increment Optimization

Atomic Primitives Underneath

Spinlock-based semaphores rely on hardware atomic operations to function correctly. These operations ensure that read-modify-write sequences cannot be split by concurrent access from other CPUs.

Test-and-Set (TAS):

Compare-and-Swap (CAS):

More flexible than TAS. Atomically: if location equals expected value, update to new value and return success; otherwise, return failure. This enables lock-free updates.

Fetch-and-Add:

Atomically increments (or decrements) a value and returns the original. Perfect for counting semaphores where we just need to adjust the counter.

Implementation on x86:

Implementation on ARM:

ARM uses Load-Link/Store-Conditional (LL/SC) pairs: LDREX loads a value and marks the location as exclusive; STREX stores only if no other CPU has written to that location since the LDREX.

atomic_primitives.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
// Atomic primitives for spinlock implementation
 
// ============================================
// Test-and-Set (TAS)
// ============================================
static inline int test_and_set(volatile int *lock) {
#if defined(__x86_64__) || defined(__i386__)
    int old = 1;
    // XCHG is implicitly locked on x86
    __asm__ volatile(
        "xchg %0, %1"
        : "+r"(old), "+m"(*lock)
        :
        : "memory"
    );
    return old;
#elif defined(__aarch64__)
    int old;
    int tmp;
    __asm__ volatile(
        "1: ldaxr %w0, [%2]\n"      // Load-acquire exclusive
        "   stlxr %w1, %w3, [%2]\n"  // Store-release exclusive
        "   cbnz %w1, 1b\n"          // Retry if store failed
        : "=&r"(old), "=&r"(tmp)
        : "r"(lock), "r"(1)
        : "memory"
    );
    return old;
#else
    return __sync_lock_test_and_set(lock, 1);
#endif
}
 
// ============================================
// Compare-and-Swap (CAS)
// ============================================
static inline int compare_and_swap(volatile int *ptr, 
                                    int expected, 
                                    int new_value) {
#if defined(__x86_64__) || defined(__i386__)
    int old;
    __asm__ volatile(
        "lock cmpxchg %2, %1"
        : "=a"(old), "+m"(*ptr)
        : "r"(new_value), "0"(expected)
        : "memory", "cc"
    );
    return old == expected;
#else
    return __sync_bool_compare_and_swap(ptr, expected, new_value);
#endif
}
 
// ============================================
// Fetch-and-Add
// ============================================
static inline int fetch_and_add(volatile int *ptr, int increment) {
#if defined(__x86_64__) || defined(__i386__)
    int old = increment;
    __asm__ volatile(
        "lock xadd %0, %1"
        : "+r"(old), "+m"(*ptr)
        :
        : "memory"
    );
    return old;
#else
    return __sync_fetch_and_add(ptr, increment);
#endif
}
 
// ============================================
// Spinlock using TAS
// ============================================
void spin_lock_tas(spinlock_t *lock) {
    while (test_and_set(&lock->locked)) {
        // Spin until we acquire
        cpu_relax();
    }
}
 
void spin_unlock(spinlock_t *lock) {
    // Compiler and memory barrier
    __asm__ volatile("" ::: "memory");
    lock->locked = 0;
}
 
// ============================================
// Spinlock using CAS (allows more complex state)
// ============================================
void spin_lock_cas(spinlock_t *lock) {
    while (!compare_and_swap(&lock->locked, 0, 1)) {
        // CAS failed, lock was not 0
        while (lock->locked) {  // TATAS pattern
            cpu_relax();
        }
    }
}
 
// ============================================
// Semaphore wait using CAS directly
// ============================================
void spin_semaphore_wait_cas(spin_semaphore_t *sem) {
    int old, new_val;
    
    while (1) {
        // Read current value
        old = sem->value;
        
        if (old > 0) {
            new_val = old - 1;
            // Try to decrement
            if (compare_and_swap(&sem->value, old, new_val)) {
                return;  // Successfully acquired
            }
            // CAS failed, retry
        } else {
            // No resources - spin
            cpu_relax();
        }
    }
}

Atomic Operations Comparison
Operation	x86 Instruction	ARM Instruction	Use Case
Test-and-Set	`XCHG` (implicit lock)	`LDXR`/`STXR` loop	Simple spinlocks, basic mutex
Compare-and-Swap	`LOCK CMPXCHG`	`LDXR`/`STXR` with compare	Lock-free structures, complex state
Fetch-and-Add	`LOCK XADD`	`LDXR`/`ADD`/`STXR` loop	Counting semaphores, statistics
Load-Linked/Store-Cond.	N/A (CAS instead)	`LDXR`/`STXR`	General-purpose atomic RMW on ARM

Contention and Cache Effects

Cache Coherence Basics:

Invalidation messages on the interconnect
Cache misses on subsequent reads by other CPUs
Memory bus saturation under heavy contention

The TAS Spinning Problem:

Each TAS causes ~N invalidations
Total invalidations per second: O(N² × spin rate)
Memory bus becomes the bottleneck

TATAS Mitigation:

Test-and-Test-and-Set dramatically reduces bus traffic:

while (1) {
    while (lock->locked) cpu_relax();  // Read-only spin
    if (!test_and_set(lock)) return;    // Only then TAS
}

False Sharing:

If the semaphore shares a cache line with other data, unrelated modifications can cause invalidations. Solution: align semaphore structures to cache line boundaries (typically 64 bytes).

contention_management.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
// Contention-aware spinlock implementations
 
// Cache line alignment to prevent false sharing
#define CACHE_LINE_SIZE 64
 
typedef struct __attribute__((aligned(CACHE_LINE_SIZE))) {
    volatile int locked;
    
    // Padding to fill cache line
    char padding[CACHE_LINE_SIZE - sizeof(int)];
} cache_aligned_spinlock_t;
 
typedef struct __attribute__((aligned(CACHE_LINE_SIZE))) {
    volatile int value;
    cache_aligned_spinlock_t guard;
    
    // Ensure entire structure is cache-aligned
    char padding[CACHE_LINE_SIZE - sizeof(int)];
} cache_aligned_spin_semaphore_t;
 
// ============================================
// TATAS (Test-and-Test-and-Set) Spinlock
// ============================================
void spinlock_acquire_tatas(spinlock_t *lock) {
    while (1) {
        // Read-only spin (no bus traffic if cache is warm)
        while (lock->locked) {
            cpu_relax();  // Reduce power, help hyperthreads
        }
        
        // Lock appears free - try atomic acquisition
        if (!test_and_set(&lock->locked)) {
            return;  // Got it!
        }
        
        // Lost the race - back to read-only spin
    }
}
 
// ============================================
// TATAS with Exponential Backoff
// ============================================
#define MIN_BACKOFF 4
#define MAX_BACKOFF 1024
 
void spinlock_acquire_tatas_backoff(spinlock_t *lock) {
    unsigned int delay = MIN_BACKOFF;
    
    while (1) {
        // Read-only test
        while (lock->locked) {
            cpu_relax();
        }
        
        // Attempt acquisition
        if (!test_and_set(&lock->locked)) {
            return;
        }
        
        // Failed - backoff before retrying
        for (unsigned int i = 0; i < delay; i++) {
            cpu_relax();
        }
        
        // Increase delay (exponential, with randomness)
        if (delay < MAX_BACKOFF) {
            delay = delay * 2 + (rand() % delay);
            if (delay > MAX_BACKOFF) delay = MAX_BACKOFF;
        }
    }
}
 
// ============================================
// Proportional Backoff
// ============================================
// Backoff proportional to observed contention
void spinlock_acquire_proportional_backoff(spinlock_t *lock) {
    unsigned int attempts = 0;
    
    while (1) {
        while (lock->locked) {
            cpu_relax();
            attempts++;  // Count contention
        }
        
        if (!test_and_set(&lock->locked)) {
            return;
        }
        
        // Backoff based on contention observed
        unsigned int backoff = attempts < 1000 ? attempts : 1000;
        for (unsigned int i = 0; i < backoff; i++) {
            cpu_relax();
        }
    }
}
 
// ============================================
// Semaphore with contention tracking
// ============================================
typedef struct {
    volatile int value;
    cache_aligned_spinlock_t guard;
    volatile unsigned int contention_count;  // For monitoring
} monitored_spin_semaphore_t;
 
void monitored_sem_wait(monitored_spin_semaphore_t *sem) {
    unsigned int local_contention = 0;
    
    while (1) {
        // TATAS on value first (avoid lock if no resources)
        while (sem->value == 0) {
            cpu_relax();
            local_contention++;
        }
        
        spinlock_acquire_tatas(&sem->guard.locked);
        
        if (sem->value > 0) {
            sem->value--;
            spinlock_release(&sem->guard.locked);
            
            // Report contention for tuning
            __sync_fetch_and_add(&sem->contention_count, local_contention);
            return;
        }
        
        spinlock_release(&sem->guard.locked);
    }
}

Cache Line Bouncing is Catastrophic

Ticket Locks and Fairness

The Ticket Lock Concept:

Imagine a deli counter with a ticket dispenser:

Each arriving customer takes a numbered ticket (atomic fetch-and-increment)
A display shows the current serving number
Customers wait until their ticket matches the display
When finished, the counter increments the display

Implementation:

A ticket lock has two counters:

tail (aka next_ticket): The next ticket number to issue
head (aka now_serving): The ticket number currently being served

To acquire: atomically increment tail to get your ticket, then spin until head equals your ticket. To release: increment head.

Fairness Guarantee:

Ticket locks provide strict FIFO ordering. The thread that arrived first will acquire the lock first. This prevents starvation and provides predictable latency under contention.

Tradeoffs:

Pro: Perfect fairness, no starvation
Pro: Predictable worst-case wait time
Con: All waiters spin on the same location (head), causing cache traffic
Con: Cannot use TATAS optimization (must atomically get ticket)

ticket_lock.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
// Ticket lock for fair FIFO ordering
 
typedef struct {
    volatile unsigned int head;  // Now serving
    volatile unsigned int tail;  // Next ticket to issue
} ticket_lock_t;
 
void ticket_lock_init(ticket_lock_t *lock) {
    lock->head = 0;
    lock->tail = 0;
}
 
void ticket_lock_acquire(ticket_lock_t *lock) {
    // Get our ticket number (atomic fetch-and-increment)
    unsigned int my_ticket = fetch_and_add(&lock->tail, 1);
    
    // Wait until it's our turn
    while (lock->head != my_ticket) {
        cpu_relax();
    }
    
    // Memory barrier to ensure subsequent operations stay after acquire
    __sync_synchronize();
}
 
void ticket_lock_release(ticket_lock_t *lock) {
    // Memory barrier to ensure prior operations complete before release
    __sync_synchronize();
    
    // Increment the serving counter (next waiter's turn)
    lock->head++;
}
 
// ============================================
// Ticket-based semaphore for fair ordering
// ============================================
typedef struct {
    volatile int value;          // Semaphore count
    ticket_lock_t guard;         // Fair lock for accessing value
} fair_spin_semaphore_t;
 
void fair_sem_init(fair_spin_semaphore_t *sem, int initial) {
    sem->value = initial;
    ticket_lock_init(&sem->guard);
}
 
void fair_sem_wait(fair_spin_semaphore_t *sem) {
    while (1) {
        ticket_lock_acquire(&sem->guard);
        
        if (sem->value > 0) {
            sem->value--;
            ticket_lock_release(&sem->guard);
            return;
        }
        
        ticket_lock_release(&sem->guard);
        
        // Small delay before retrying to reduce contention
        for (int i = 0; i < 100; i++) {
            cpu_relax();
        }
    }
}
 
void fair_sem_signal(fair_spin_semaphore_t *sem) {
    ticket_lock_acquire(&sem->guard);
    sem->value++;
    ticket_lock_release(&sem->guard);
}
 
// ============================================
// Proportional backoff ticket lock
// ============================================
// Each waiter backs off proportionally to their position in line
void ticket_lock_acquire_proportional(ticket_lock_t *lock) {
    unsigned int my_ticket = fetch_and_add(&lock->tail, 1);
    
    while (1) {
        unsigned int current = lock->head;
        
        if (current == my_ticket) {
            __sync_synchronize();
            return;
        }
        
        // We're (my_ticket - current) positions away
        // Back off proportionally
        unsigned int distance = my_ticket - current;
        unsigned int backoff = distance * BACKOFF_PER_POSITION;
        
        for (unsigned int i = 0; i < backoff; i++) {
            cpu_relax();
        }
    }
}
 
#define BACKOFF_PER_POSITION 50  // Tune based on critical section length

Ticket Lock Properties

•FIFO fairness — Threads acquire the lock in arrival order; no starvation possible
•O(1) acquisition — Single atomic operation to get ticket
•O(1) release — Simple increment of head counter
•Bounded wait — Maximum wait is (N-1) × critical section time for N contenders
•Cache scalability concern — All waiters spin on head, causing invalidations when it changes
•Proportional backoff — Waiters far from head can back off more, reducing traffic

MCS Locks: Scalable Spinning

The MCS Lock Concept:

Instead of a shared counter, MCS uses a linked list:

Each acquiring thread has a local node on its stack
The node contains a locked flag (initially true = must wait)
The thread atomically appends its node to the queue tail
If there was a predecessor, the thread spins on its own locked flag
The predecessor clears our locked flag when it releases
On release, we clear the next node's locked flag (if any)

Scalability Benefit:

Tradeoffs:

Pro: Excellent cache behavior under contention
Pro: Fair (FIFO) ordering like ticket locks
Con: More complex implementation
Con: Requires per-thread node allocation (typically stack)
Con: Slower than simple spinlocks when uncontended

mcs_lock.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
// MCS Lock - scalable queue-based spinlock
 
// Per-thread queue node (typically stack-allocated)
typedef struct mcs_node {
    volatile int locked;              // 0 = can proceed, 1 = must wait
    volatile struct mcs_node *next;   // Next waiter in queue
} mcs_node_t;
 
// The lock itself is just a tail pointer
typedef struct {
    volatile mcs_node_t *tail;
} mcs_lock_t;
 
void mcs_lock_init(mcs_lock_t *lock) {
    lock->tail = NULL;
}
 
void mcs_lock_acquire(mcs_lock_t *lock, mcs_node_t *node) {
    node->next = NULL;
    node->locked = 1;  // Assume we must wait
    
    // Atomically swap our node into the tail
    mcs_node_t *predecessor = (mcs_node_t *)
        __sync_lock_test_and_set(&lock->tail, node);
    
    if (predecessor != NULL) {
        // Queue was not empty - link ourselves and wait
        predecessor->next = node;
        
        // Spin on our own flag (local cache line!)
        while (node->locked) {
            cpu_relax();
        }
    }
    
    // We hold the lock now
    __sync_synchronize();  // Acquire barrier
}
 
void mcs_lock_release(mcs_lock_t *lock, mcs_node_t *node) {
    __sync_synchronize();  // Release barrier - prior ops complete
    
    if (node->next == NULL) {
        // No known successor - try to clear tail
        if (__sync_bool_compare_and_swap(&lock->tail, node, NULL)) {
            // Successfully released; nobody was waiting
            return;
        }
        
        // CAS failed: someone is in the process of queuing
        // Wait for them to link themselves
        while (node->next == NULL) {
            cpu_relax();
        }
    }
    
    // Wake the next waiter
    node->next->locked = 0;
}
 
// ============================================
// MCS-based semaphore
// ============================================
typedef struct {
    volatile int value;
    mcs_lock_t guard;
} mcs_semaphore_t;
 
void mcs_sem_init(mcs_semaphore_t *sem, int initial) {
    sem->value = initial;
    mcs_lock_init(&sem->guard);
}
 
void mcs_sem_wait(mcs_semaphore_t *sem) {
    mcs_node_t node;  // Stack-allocated!
    
    while (1) {
        mcs_lock_acquire(&sem->guard, &node);
        
        if (sem->value > 0) {
            sem->value--;
            mcs_lock_release(&sem->guard, &node);
            return;
        }
        
        mcs_lock_release(&sem->guard, &node);
        
        // Backoff before retrying
        cpu_relax();
    }
}
 
void mcs_sem_signal(mcs_semaphore_t *sem) {
    mcs_node_t node;
    
    mcs_lock_acquire(&sem->guard, &node);
    sem->value++;
    mcs_lock_release(&sem->guard, &node);
}

Linux Kernel's qspinlock

When to Use Spinlock-Based Semaphores

Ideal Scenarios for Spinning:

Short critical sections: If the semaphore is held for just a few hundred cycles, spinning is cheaper than context switching
Multiprocessor systems: Spinning makes sense when other CPUs can run the critical section. On uniprocessor, spinning is always wasteful.
Interrupt/exception context: Kernel code in interrupt handlers cannot sleep. Spinlocks are the only option.
Real-time constraints: Context switch latency may be unacceptable. Spinning provides deterministic acquisition time.
High-frequency operations: When a lock is acquired millions of times per second, context switch overhead dominates.

Poor Scenarios for Spinning:

Long critical sections: Holding the semaphore for milliseconds while others spin wastes massive CPU time.
Single-CPU systems: The holder can't make progress while we spin—guaranteed waste.
Energy-constrained environments: Mobile devices need CPUs in low-power states, not spinning.
High contention with many waiters: O(N) spinning CPUs is catastrophic for system throughput.

Prefer Spinlock-Based When:

•Critical section < 1000 CPU cycles
•Multiple CPUs available
•Running in interrupt context
•Deterministic timing required
•High acquisition frequency (>100K/sec)
•Low waiter count expected

Avoid Spinlock-Based When:

•Critical section includes I/O or long operations
•Uniprocessor system (spinning is pure waste)
•Energy efficiency is important
•Many threads contend for same lock
•Hold time is unpredictable
•Sleeping primitives are available and suitable

The Crossover Point:

The decision to spin or block depends on expected hold time vs. context switch cost:

Context switch cost: ~5,000-20,000 cycles (varies by architecture/OS)
Crossover point: If expected wait < 2 × context switch cost, spinning wins

Many systems implement adaptive spinning: spin for a while, then block if the lock isn't released. This captures the benefits of both approaches.

The Uniprocessor Trap

Summary: Spinlock-Based Semaphore Implementation

We've explored the design and implementation of spinlock-based semaphores—from atomic primitives through contention management to sophisticated queue locks. Let's consolidate the key insights:

Key Takeaways

•Architecture simplicity — Spinlock-based semaphores need only a counter and guard lock; no wait queues or scheduler integration
•Atomic primitives — Test-and-Set, Compare-and-Swap, and Fetch-and-Add provide the hardware foundation for correct concurrent access
•TATAS optimization — Test-and-Test-and-Set dramatically reduces cache traffic by avoiding atomic operations when the lock is held
•Backoff strategies — Exponential and proportional backoff reduce contention by spreading out retry attempts
•Ticket locks — Provide FIFO fairness but all waiters spin on the same location
•MCS locks — Achieve O(1) cache traffic by having each waiter spin on its own local variable
•Use case selection — Spinning excels for short critical sections on multiprocessors; blocking wins for long waits or single CPUs
•Hybrid approaches — Modern systems often spin briefly, then block if the wait extends

What's Next:

Page Complete

2 / 5