Loading learning content...
In the previous page, we explored blocking semaphores—where threads voluntarily sleep when resources are unavailable. But there's another approach: spinlock-based semaphores, where threads actively wait in a tight loop, continuously checking whether the condition has become true.
At first glance, busy-waiting seems wasteful. Why burn CPU cycles spinning when you could sleep? The answer lies in latency. A context switch—removing a thread from the CPU and later restoring it—costs thousands of CPU cycles. If the expected wait time is less than the context switch overhead, spinning is actually more efficient.
Spinlock-based semaphores occupy a critical niche in systems programming: short critical sections, multiprocessor environments, and contexts where sleeping is impossible (interrupt handlers, real-time constraints). This page explores their implementation, from the atomic primitives that make them correct to the contention-management strategies that make them practical.
By the end of this page, you will understand: how spinlocks implement semaphore semantics, the role of atomic test-and-set operations, spin techniques and their tradeoffs, backoff strategies for reducing contention, and when spinlock-based implementations outperform blocking ones.
A spinlock-based semaphore replaces the wait queue and scheduler integration of blocking semaphores with a simpler mechanism: a tight loop that repeatedly tests the semaphore's value until acquisition succeeds.
Core Structure:
The simplest spinlock-based semaphore consists of:
Notice what's missing: there's no wait queue, no scheduler integration, no thread state management. The simplicity is the point—at the cost of CPU consumption during waits.
The Mental Model:
Imagine a thread trying to walk through a turnstile (the semaphore). In a blocking implementation, if the turnstile is locked, the thread sits on a bench (wait queue) and waits to be called. In a spinlock implementation, the thread continuously pushes on the turnstile, checking every nanosecond whether it has opened.
Why This Works:
On multiprocessor systems, while one thread is spinning, another thread on a different CPU can be executing the critical section and preparing to release the semaphore. The spinner doesn't need to sleep because useful work is happening elsewhere. When the semaphore becomes available, the spinner detects this within nanoseconds—far faster than a context switch would allow.
1234567891011121314151617181920212223242526272829303132333435363738394041424344
// Basic spinlock-based semaphore structuretypedef struct spin_semaphore { // Semaphore value: positive = available, zero = none available // Unlike blocking semaphores, no negative values (no waiter tracking) volatile int value; // Spinlock for atomic operations on value // Ensures read-modify-write on value is atomic spinlock_t guard; } spin_semaphore_t; // Compare with blocking semaphore structure:/*typedef struct blocking_semaphore { int value; spinlock_t lock; wait_queue_head_t wait_list; // <-- This is missing!} blocking_semaphore_t;*/ // The guard spinlock itself is typically:typedef struct spinlock { volatile unsigned int locked; // 0 = unlocked, 1 = locked} spinlock_t; // Or for ticket locks (fairer):typedef struct ticket_spinlock { volatile unsigned int head; // Next ticket to be served volatile unsigned int tail; // Next ticket to be issued} ticket_spinlock_t; // Initializationvoid spin_semaphore_init(spin_semaphore_t *sem, int initial_value) { sem->value = initial_value; spin_lock_init(&sem->guard);} // Helper: initialize the guard spinlockvoid spin_lock_init(spinlock_t *lock) { lock->locked = 0; // Memory barrier to ensure visibility __sync_synchronize();}The 'volatile' keyword prevents the compiler from optimizing away repeated reads to the semaphore value. However, volatile alone doesn't ensure atomicity on multiprocessors. Hardware atomic operations (test-and-set, compare-and-swap) or explicit memory barriers are still required for correctness.
The wait operation in a spinlock-based semaphore loops until it can atomically decrement the counter. Unlike blocking semaphores, there's no sleep—just continuous checking.
The Algorithm:
The key insight: we release the guard lock while spinning. This allows other threads to signal the semaphore. If we held the guard lock while spinning, signalers couldn't increment the counter, creating deadlock.
The Polling Loop:
The simplest implementation is a tight loop:
while (!try_acquire(sem)) { /* spin */ }
But this aggressive spinning has problems:
More sophisticated implementations add delays, backoff strategies, and test-before-test-and-set optimizations.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
// Spinlock-based semaphore wait (P/down) operation // Version 1: Basic spin loopvoid spin_semaphore_wait_basic(spin_semaphore_t *sem) { while (1) { spin_lock(&sem->guard); if (sem->value > 0) { // Resource available - acquire it sem->value--; spin_unlock(&sem->guard); return; } // Resource not available - release lock and try again spin_unlock(&sem->guard); // This is the "spin" - loop back immediately }} // Version 2: Test-before-test-and-set (TATAS) optimizationvoid spin_semaphore_wait_tatas(spin_semaphore_t *sem) { while (1) { // Read-only test first (no cache line invalidation) while (sem->value == 0) { // Just spin on read, don't touch the lock cpu_relax(); // Hint to CPU we're spinning } // Value might be positive - try to acquire spin_lock(&sem->guard); if (sem->value > 0) { sem->value--; spin_unlock(&sem->guard); return; } // Lost the race - someone else got it spin_unlock(&sem->guard); }} // Version 3: With exponential backoffvoid spin_semaphore_wait_backoff(spin_semaphore_t *sem) { unsigned int delay = MIN_BACKOFF; while (1) { spin_lock(&sem->guard); if (sem->value > 0) { sem->value--; spin_unlock(&sem->guard); return; } spin_unlock(&sem->guard); // Back off before retrying for (unsigned int i = 0; i < delay; i++) { cpu_relax(); } // Exponential backoff with cap if (delay < MAX_BACKOFF) { delay *= 2; } }} // The cpu_relax() function - architecture specific// x86: PAUSE instruction// ARM: YIELD instructionstatic inline void cpu_relax(void) {#if defined(__x86_64__) || defined(__i386__) __asm__ volatile("pause" ::: "memory");#elif defined(__arm__) || defined(__aarch64__) __asm__ volatile("yield" ::: "memory");#else // Fallback: memory barrier alone __asm__ volatile("" ::: "memory");#endif} // Try-wait: non-blocking attemptint spin_semaphore_trywait(spin_semaphore_t *sem) { spin_lock(&sem->guard); if (sem->value > 0) { sem->value--; spin_unlock(&sem->guard); return 0; // Success } spin_unlock(&sem->guard); return -EAGAIN; // Would block}The signal operation in a spinlock-based semaphore is remarkably simple compared to its blocking counterpart. There's no wait queue to scan, no threads to wake, no scheduler to notify. We simply increment the counter atomically.
The Algorithm:
No Wake-Up Necessary:
When we increment the counter, any spinning thread will notice on its next iteration. The cache coherence protocol ensures that our write to the counter becomes visible to other CPUs (eventually, with possible short delay). The spinners don't need explicit notification—they're already watching.
Memory Ordering Considerations:
The guard spinlock provides acquire/release semantics:
This ensures that the counter increment is fully visible before spinning threads can observe the unlocked guard and re-check the value.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
// Spinlock-based semaphore signal (V/up) operation void spin_semaphore_signal(spin_semaphore_t *sem) { spin_lock(&sem->guard); sem->value++; spin_unlock(&sem->guard); // That's it! No wake-up mechanism needed. // Spinners will notice on their next check.} // Compare with blocking signal (for reference):/*void blocking_semaphore_signal(semaphore_t *sem) { spin_lock(&sem->lock); sem->value++; // Must wake a waiter! if (!list_empty(&sem->wait_list)) { wake_up(&sem->wait_list); // <-- This is missing! } spin_unlock(&sem->lock);}*/ // Atomic signal without separate lock (optimization)void spin_semaphore_signal_atomic(spin_semaphore_t *sem) { // Use atomic increment if hardware supports it __sync_fetch_and_add(&sem->value, 1); // Implicit memory barrier ensures visibility} // Multiple signals at once (for counting semaphores)void spin_semaphore_signal_n(spin_semaphore_t *sem, int n) { spin_lock(&sem->guard); sem->value += n; spin_unlock(&sem->guard);} // Atomic version of multiple signalsvoid spin_semaphore_signal_n_atomic(spin_semaphore_t *sem, int n) { __sync_fetch_and_add(&sem->value, n);} // The simplicity allows several optimizations: // Inline the signal for minimal overhead#define SPIN_SEM_SIGNAL(sem) \ do { \ __sync_fetch_and_add(&(sem)->value, 1); \ } while (0) // Signal from interrupt context (safe - no sleeping)void spin_semaphore_signal_irq(spin_semaphore_t *sem) { unsigned long flags; // Disable interrupts to prevent nested acquisition local_irq_save(flags); spin_lock(&sem->guard); sem->value++; spin_unlock(&sem->guard); local_irq_restore(flags);}If the counter is only ever incremented (never read-modify-write with a condition), we can eliminate the guard lock entirely and use a hardware atomic increment. This reduces the signal operation to a single instruction on most architectures, providing maximum throughput.
Spinlock-based semaphores rely on hardware atomic operations to function correctly. These operations ensure that read-modify-write sequences cannot be split by concurrent access from other CPUs.
Test-and-Set (TAS):
The fundamental primitive. Atomically reads a location, sets it to a new value, and returns the old value. If old value was 0 (unlocked), we acquired the lock. If it was 1 (locked), someone else has it.
Compare-and-Swap (CAS):
More flexible than TAS. Atomically: if location equals expected value, update to new value and return success; otherwise, return failure. This enables lock-free updates.
Fetch-and-Add:
Atomically increments (or decrements) a value and returns the original. Perfect for counting semaphores where we just need to adjust the counter.
Implementation on x86:
Intel/AMD processors provide the LOCK prefix, which locks the memory bus (or cache line, on modern CPUs) during the following instruction. This makes instructions like XCHG, CMPXCHG, and XADD atomic.
Implementation on ARM:
ARM uses Load-Link/Store-Conditional (LL/SC) pairs: LDREX loads a value and marks the location as exclusive; STREX stores only if no other CPU has written to that location since the LDREX.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
// Atomic primitives for spinlock implementation // ============================================// Test-and-Set (TAS)// ============================================static inline int test_and_set(volatile int *lock) {#if defined(__x86_64__) || defined(__i386__) int old = 1; // XCHG is implicitly locked on x86 __asm__ volatile( "xchg %0, %1" : "+r"(old), "+m"(*lock) : : "memory" ); return old;#elif defined(__aarch64__) int old; int tmp; __asm__ volatile( "1: ldaxr %w0, [%2]\n" // Load-acquire exclusive " stlxr %w1, %w3, [%2]\n" // Store-release exclusive " cbnz %w1, 1b\n" // Retry if store failed : "=&r"(old), "=&r"(tmp) : "r"(lock), "r"(1) : "memory" ); return old;#else return __sync_lock_test_and_set(lock, 1);#endif} // ============================================// Compare-and-Swap (CAS)// ============================================static inline int compare_and_swap(volatile int *ptr, int expected, int new_value) {#if defined(__x86_64__) || defined(__i386__) int old; __asm__ volatile( "lock cmpxchg %2, %1" : "=a"(old), "+m"(*ptr) : "r"(new_value), "0"(expected) : "memory", "cc" ); return old == expected;#else return __sync_bool_compare_and_swap(ptr, expected, new_value);#endif} // ============================================// Fetch-and-Add// ============================================static inline int fetch_and_add(volatile int *ptr, int increment) {#if defined(__x86_64__) || defined(__i386__) int old = increment; __asm__ volatile( "lock xadd %0, %1" : "+r"(old), "+m"(*ptr) : : "memory" ); return old;#else return __sync_fetch_and_add(ptr, increment);#endif} // ============================================// Spinlock using TAS// ============================================void spin_lock_tas(spinlock_t *lock) { while (test_and_set(&lock->locked)) { // Spin until we acquire cpu_relax(); }} void spin_unlock(spinlock_t *lock) { // Compiler and memory barrier __asm__ volatile("" ::: "memory"); lock->locked = 0;} // ============================================// Spinlock using CAS (allows more complex state)// ============================================void spin_lock_cas(spinlock_t *lock) { while (!compare_and_swap(&lock->locked, 0, 1)) { // CAS failed, lock was not 0 while (lock->locked) { // TATAS pattern cpu_relax(); } }} // ============================================// Semaphore wait using CAS directly// ============================================void spin_semaphore_wait_cas(spin_semaphore_t *sem) { int old, new_val; while (1) { // Read current value old = sem->value; if (old > 0) { new_val = old - 1; // Try to decrement if (compare_and_swap(&sem->value, old, new_val)) { return; // Successfully acquired } // CAS failed, retry } else { // No resources - spin cpu_relax(); } }}| Operation | x86 Instruction | ARM Instruction | Use Case |
|---|---|---|---|
| Test-and-Set | XCHG (implicit lock) | LDXR/STXR loop | Simple spinlocks, basic mutex |
| Compare-and-Swap | LOCK CMPXCHG | LDXR/STXR with compare | Lock-free structures, complex state |
| Fetch-and-Add | LOCK XADD | LDXR/ADD/STXR loop | Counting semaphores, statistics |
| Load-Linked/Store-Cond. | N/A (CAS instead) | LDXR/STXR | General-purpose atomic RMW on ARM |
When multiple threads spin on the same semaphore, performance can collapse due to cache coherence traffic. Understanding these effects is crucial for designing scalable spinlock-based synchronization.
Cache Coherence Basics:
Modern CPUs maintain cache coherence using protocols like MESI (Modified, Exclusive, Shared, Invalid). When one CPU writes to a cache line, all other CPUs holding copies must invalidate them. This causes:
The TAS Spinning Problem:
With naive test-and-set spinning, every attempt executes an atomic operation that writes to the lock. Even if the lock is held, each spinner's TAS generates a cache line invalidation, forcing all other spinners to refetch the line. With N spinners:
TATAS Mitigation:
Test-and-Test-and-Set dramatically reduces bus traffic:
while (1) {
while (lock->locked) cpu_relax(); // Read-only spin
if (!test_and_set(lock)) return; // Only then TAS
}
The read-only spin on lock->locked doesn't generate invalidations. CPUs read from their local cache (in Shared state). Only when the lock appears free do they attempt the atomic TAS—and only one will succeed.
False Sharing:
If the semaphore shares a cache line with other data, unrelated modifications can cause invalidations. Solution: align semaphore structures to cache line boundaries (typically 64 bytes).
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
// Contention-aware spinlock implementations // Cache line alignment to prevent false sharing#define CACHE_LINE_SIZE 64 typedef struct __attribute__((aligned(CACHE_LINE_SIZE))) { volatile int locked; // Padding to fill cache line char padding[CACHE_LINE_SIZE - sizeof(int)];} cache_aligned_spinlock_t; typedef struct __attribute__((aligned(CACHE_LINE_SIZE))) { volatile int value; cache_aligned_spinlock_t guard; // Ensure entire structure is cache-aligned char padding[CACHE_LINE_SIZE - sizeof(int)];} cache_aligned_spin_semaphore_t; // ============================================// TATAS (Test-and-Test-and-Set) Spinlock// ============================================void spinlock_acquire_tatas(spinlock_t *lock) { while (1) { // Read-only spin (no bus traffic if cache is warm) while (lock->locked) { cpu_relax(); // Reduce power, help hyperthreads } // Lock appears free - try atomic acquisition if (!test_and_set(&lock->locked)) { return; // Got it! } // Lost the race - back to read-only spin }} // ============================================// TATAS with Exponential Backoff// ============================================#define MIN_BACKOFF 4#define MAX_BACKOFF 1024 void spinlock_acquire_tatas_backoff(spinlock_t *lock) { unsigned int delay = MIN_BACKOFF; while (1) { // Read-only test while (lock->locked) { cpu_relax(); } // Attempt acquisition if (!test_and_set(&lock->locked)) { return; } // Failed - backoff before retrying for (unsigned int i = 0; i < delay; i++) { cpu_relax(); } // Increase delay (exponential, with randomness) if (delay < MAX_BACKOFF) { delay = delay * 2 + (rand() % delay); if (delay > MAX_BACKOFF) delay = MAX_BACKOFF; } }} // ============================================// Proportional Backoff// ============================================// Backoff proportional to observed contentionvoid spinlock_acquire_proportional_backoff(spinlock_t *lock) { unsigned int attempts = 0; while (1) { while (lock->locked) { cpu_relax(); attempts++; // Count contention } if (!test_and_set(&lock->locked)) { return; } // Backoff based on contention observed unsigned int backoff = attempts < 1000 ? attempts : 1000; for (unsigned int i = 0; i < backoff; i++) { cpu_relax(); } }} // ============================================// Semaphore with contention tracking// ============================================typedef struct { volatile int value; cache_aligned_spinlock_t guard; volatile unsigned int contention_count; // For monitoring} monitored_spin_semaphore_t; void monitored_sem_wait(monitored_spin_semaphore_t *sem) { unsigned int local_contention = 0; while (1) { // TATAS on value first (avoid lock if no resources) while (sem->value == 0) { cpu_relax(); local_contention++; } spinlock_acquire_tatas(&sem->guard.locked); if (sem->value > 0) { sem->value--; spinlock_release(&sem->guard.locked); // Report contention for tuning __sync_fetch_and_add(&sem->contention_count, local_contention); return; } spinlock_release(&sem->guard.locked); }}Under heavy contention, naive spinlocks can reduce throughput to 1-5% of optimal. A single lock acquisition might take thousands of cycles due to cache line bouncing. Always use at minimum TATAS, and consider backoff strategies for any lock that might be contended.
Basic spinlocks using test-and-set provide no fairness guarantee. A thread that just released the lock might immediately reacquire it, starving other waiters. Ticket locks solve this by ensuring FIFO ordering.
The Ticket Lock Concept:
Imagine a deli counter with a ticket dispenser:
Implementation:
A ticket lock has two counters:
To acquire: atomically increment tail to get your ticket, then spin until head equals your ticket. To release: increment head.
Fairness Guarantee:
Ticket locks provide strict FIFO ordering. The thread that arrived first will acquire the lock first. This prevents starvation and provides predictable latency under contention.
Tradeoffs:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798
// Ticket lock for fair FIFO ordering typedef struct { volatile unsigned int head; // Now serving volatile unsigned int tail; // Next ticket to issue} ticket_lock_t; void ticket_lock_init(ticket_lock_t *lock) { lock->head = 0; lock->tail = 0;} void ticket_lock_acquire(ticket_lock_t *lock) { // Get our ticket number (atomic fetch-and-increment) unsigned int my_ticket = fetch_and_add(&lock->tail, 1); // Wait until it's our turn while (lock->head != my_ticket) { cpu_relax(); } // Memory barrier to ensure subsequent operations stay after acquire __sync_synchronize();} void ticket_lock_release(ticket_lock_t *lock) { // Memory barrier to ensure prior operations complete before release __sync_synchronize(); // Increment the serving counter (next waiter's turn) lock->head++;} // ============================================// Ticket-based semaphore for fair ordering// ============================================typedef struct { volatile int value; // Semaphore count ticket_lock_t guard; // Fair lock for accessing value} fair_spin_semaphore_t; void fair_sem_init(fair_spin_semaphore_t *sem, int initial) { sem->value = initial; ticket_lock_init(&sem->guard);} void fair_sem_wait(fair_spin_semaphore_t *sem) { while (1) { ticket_lock_acquire(&sem->guard); if (sem->value > 0) { sem->value--; ticket_lock_release(&sem->guard); return; } ticket_lock_release(&sem->guard); // Small delay before retrying to reduce contention for (int i = 0; i < 100; i++) { cpu_relax(); } }} void fair_sem_signal(fair_spin_semaphore_t *sem) { ticket_lock_acquire(&sem->guard); sem->value++; ticket_lock_release(&sem->guard);} // ============================================// Proportional backoff ticket lock// ============================================// Each waiter backs off proportionally to their position in linevoid ticket_lock_acquire_proportional(ticket_lock_t *lock) { unsigned int my_ticket = fetch_and_add(&lock->tail, 1); while (1) { unsigned int current = lock->head; if (current == my_ticket) { __sync_synchronize(); return; } // We're (my_ticket - current) positions away // Back off proportionally unsigned int distance = my_ticket - current; unsigned int backoff = distance * BACKOFF_PER_POSITION; for (unsigned int i = 0; i < backoff; i++) { cpu_relax(); } }} #define BACKOFF_PER_POSITION 50 // Tune based on critical section lengthTicket locks provide fairness but have scalability issues: all waiters spin on the same memory location. When the head value changes, every waiter's cache line is invalidated. MCS locks (named after Mellor-Crummey and Scott) solve this by having each waiter spin on its own local variable.
The MCS Lock Concept:
Instead of a shared counter, MCS uses a linked list:
locked flag (initially true = must wait)locked flaglocked flag when it releaseslocked flag (if any)Scalability Benefit:
Each thread spins on a different cache line (its own node). When the lock is released, only one cache line is invalidated—that of the next waiter. This gives O(1) cache traffic per release, regardless of waiter count.
Tradeoffs:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
// MCS Lock - scalable queue-based spinlock // Per-thread queue node (typically stack-allocated)typedef struct mcs_node { volatile int locked; // 0 = can proceed, 1 = must wait volatile struct mcs_node *next; // Next waiter in queue} mcs_node_t; // The lock itself is just a tail pointertypedef struct { volatile mcs_node_t *tail;} mcs_lock_t; void mcs_lock_init(mcs_lock_t *lock) { lock->tail = NULL;} void mcs_lock_acquire(mcs_lock_t *lock, mcs_node_t *node) { node->next = NULL; node->locked = 1; // Assume we must wait // Atomically swap our node into the tail mcs_node_t *predecessor = (mcs_node_t *) __sync_lock_test_and_set(&lock->tail, node); if (predecessor != NULL) { // Queue was not empty - link ourselves and wait predecessor->next = node; // Spin on our own flag (local cache line!) while (node->locked) { cpu_relax(); } } // We hold the lock now __sync_synchronize(); // Acquire barrier} void mcs_lock_release(mcs_lock_t *lock, mcs_node_t *node) { __sync_synchronize(); // Release barrier - prior ops complete if (node->next == NULL) { // No known successor - try to clear tail if (__sync_bool_compare_and_swap(&lock->tail, node, NULL)) { // Successfully released; nobody was waiting return; } // CAS failed: someone is in the process of queuing // Wait for them to link themselves while (node->next == NULL) { cpu_relax(); } } // Wake the next waiter node->next->locked = 0;} // ============================================// MCS-based semaphore// ============================================typedef struct { volatile int value; mcs_lock_t guard;} mcs_semaphore_t; void mcs_sem_init(mcs_semaphore_t *sem, int initial) { sem->value = initial; mcs_lock_init(&sem->guard);} void mcs_sem_wait(mcs_semaphore_t *sem) { mcs_node_t node; // Stack-allocated! while (1) { mcs_lock_acquire(&sem->guard, &node); if (sem->value > 0) { sem->value--; mcs_lock_release(&sem->guard, &node); return; } mcs_lock_release(&sem->guard, &node); // Backoff before retrying cpu_relax(); }} void mcs_sem_signal(mcs_semaphore_t *sem) { mcs_node_t node; mcs_lock_acquire(&sem->guard, &node); sem->value++; mcs_lock_release(&sem->guard, &node);}The Linux kernel uses a hybrid called qspinlock: it starts as a simple spinlock for low contention (fast path), but transforms into an MCS-like queue under high contention (scalable path). This provides the best of both worlds: minimal overhead when uncontended, and O(1) cache traffic when contended.
Spinlock-based semaphores are not universally applicable. They excel in specific scenarios and fail in others. Matching the synchronization mechanism to the situation is crucial for system performance.
Ideal Scenarios for Spinning:
Short critical sections: If the semaphore is held for just a few hundred cycles, spinning is cheaper than context switching
Multiprocessor systems: Spinning makes sense when other CPUs can run the critical section. On uniprocessor, spinning is always wasteful.
Interrupt/exception context: Kernel code in interrupt handlers cannot sleep. Spinlocks are the only option.
Real-time constraints: Context switch latency may be unacceptable. Spinning provides deterministic acquisition time.
High-frequency operations: When a lock is acquired millions of times per second, context switch overhead dominates.
Poor Scenarios for Spinning:
Long critical sections: Holding the semaphore for milliseconds while others spin wastes massive CPU time.
Single-CPU systems: The holder can't make progress while we spin—guaranteed waste.
Energy-constrained environments: Mobile devices need CPUs in low-power states, not spinning.
High contention with many waiters: O(N) spinning CPUs is catastrophic for system throughput.
The Crossover Point:
The decision to spin or block depends on expected hold time vs. context switch cost:
Many systems implement adaptive spinning: spin for a while, then block if the lock isn't released. This captures the benefits of both approaches.
Never spin on a uniprocessor system! If you're the only CPU, the lock holder cannot run while you spin. You'll spin forever (until your time slice expires). On uniprocessors, always use blocking primitives and let the scheduler run the lock holder.
We've explored the design and implementation of spinlock-based semaphores—from atomic primitives through contention management to sophisticated queue locks. Let's consolidate the key insights:
What's Next:
With both blocking and spinning implementations understood, we'll examine kernel implementation of semaphores—how operating systems provide semaphore services, the system call interface, and the integration with process scheduling that makes these abstractions available to user programs.
You now understand how spinlock-based semaphores work—from raw atomic operations through cache-conscious optimizations to sophisticated queue-based locks. This knowledge is essential for kernel development, high-performance computing, and any scenario where control over synchronization overhead is critical.