Operating SystemsLocks & Hardware Support

Spinlocks

LevelIntermediate

Duration60 mins

TopicLocks & Hardware Support

5 / 5

Ticket Locks

Bringing Order to Chaos

The spinlocks we've studied so far share a critical flaw: unfairness. When a lock is released, any waiting thread might acquire it—there's no guarantee that the longest-waiting thread will succeed. Under high contention, this leads to starvation, where some threads wait indefinitely while others repeatedly acquire and release the lock.

Ticket locks solve this elegantly by borrowing a real-world concept: the numbered ticket system used at bakeries and deli counters. Take a ticket, wait for your number to be called. Simple, fair, and deterministic.

This page explores ticket locks in depth—their elegant design, implementation details, correctness guarantees, and the foundation they provide for even more sophisticated locking mechanisms.

What You Will Learn

By the end of this page, you will understand: the fairness problem with traditional spinlocks; how ticket locks guarantee FIFO ordering; detailed implementation with atomic operations; performance characteristics and cache behavior; ticket lock limitations and when to use them; and the evolution toward MCS and other queued locks.

The Fairness Problem Revisited

Let's examine why traditional spinlocks are unfair and why this matters.

The Race at Release

With a test-and-set spinlock, lock release triggers a race. All waiting threads observe the lock becoming free (via cache invalidation) and attempt to acquire it. The winner is determined by:

Cache coherence arbitration (which CPU's request is processed first)
Timing of when each thread checks the lock
Interconnect topology (some CPUs are 'closer' to the lock's memory)

None of these factors relate to how long each thread has been waiting. The result is random: a thread that just arrived might win over one that's been waiting for milliseconds.

unfair_scenario.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Unfair spinlock behavior under contention
 
Timeline:
T=0:    Thread A acquires lock
T=1ms:  Thread B starts waiting
T=2ms:  Thread C starts waiting
T=3ms:  Thread D starts waiting
T=4ms:  Thread A releases lock
 
# All threads race for the lock
# Due to cache topology, Thread D's CPU happens to
# get exclusive access first
 
T=4ms:  Thread D acquires lock (waited 1ms)
T=5ms:  Thread D releases lock
 
# Thread B races again... but loses to newcomer Thread E
 
T=5ms:  Thread E arrives and immediately acquires (waited 0ms!)
        Thread B now waited 4ms
        Thread C waited 3ms
 
# Threads B and C can starve indefinitely if new arrivals
# are favored by cache topology

Why Fairness Matters

1. Bounded Wait Times: Without fairness guarantees, worst-case wait time is unbounded. A thread might wait forever if it's systematically unlucky.

2. Predictability: Real-time and deadline-sensitive systems require predictable latency. Starvation is unpredictable.

3. Progress Guarantees: Formal correctness often requires "bounded waiting"—a thread cannot be bypassed indefinitely. Unfair locks violate this.

4. Cache Behavior: Unfair locks can exhibit NUMA bias (threads on certain nodes are favored), creating implicit performance tiers.

The NUMA Bias Problem

On NUMA systems, threads on the same node as the lock's memory have lower latency access. They consistently win races against remote threads. The result: remote threads starve while local threads dominate. This hidden unfairness can cause severe performance disparities—certain processes run fast while others crawl.

The Ticket Lock Concept

Ticket locks enforce First-In-First-Out (FIFO) ordering through a simple mechanism:

Two counters:

next_ticket: The next ticket number to be dispensed (increments when a thread starts waiting)
now_serving: The ticket number currently being served (the lock holder's ticket)

Lock acquisition:

Atomically fetch-and-increment next_ticket to get your unique ticket number
Spin until now_serving equals your ticket

Lock release:

Increment now_serving to call the next ticket

The key insight: Each thread gets a unique, monotonically increasing ticket number. The order of acquisition is determined at arrival time, not at release time. No racing—just orderly waiting.

ticket_lock_concept.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Ticket lock operation visualization
 
Initial state: next_ticket = 0, now_serving = 0 (lock is free)
 
T=0:  Thread A arrives
      A does: my_ticket = fetch_add(next_ticket) → gets 0
      now_serving = 0 = my_ticket, so A enters critical section
      
      State: next_ticket = 1, now_serving = 0
 
T=1:  Thread B arrives
      B does: my_ticket = fetch_add(next_ticket) → gets 1
      now_serving = 0 ≠ 1, so B spins
      
      State: next_ticket = 2, now_serving = 0
 
T=2:  Thread C arrives
      C does: my_ticket = fetch_add(next_ticket) → gets 2
      now_serving = 0 ≠ 2, so C spins
      
      State: next_ticket = 3, now_serving = 0
 
T=3:  Thread A releases
      A does: now_serving++ → becomes 1
      B sees now_serving = 1 = my_ticket, B enters
      C still spins (now_serving = 1 ≠ 2)
 
T=4:  Thread B releases
      B does: now_serving++ → becomes 2
      C sees now_serving = 2 = my_ticket, C enters
 
# FIFO order guaranteed: A → B → C, exactly as they arrived

The Elegance of Ticket Locks

No thundering herd: When the lock is released, only one thread (the next in line) observes its condition satisfied. Others continue spinning on their own condition.

Guaranteed fairness: FIFO ordering is intrinsic to the design. Cannot be violated regardless of cache topology, NUMA effects, or timing.

Simple implementation: Two atomic counters and two operations (fetch-add, load). No complex data structures.

Deterministic wait time: With N threads waiting, you will acquire the lock after exactly N releases. Bounded waiting is guaranteed.

Historical Note

Ticket locks were introduced by Mellor-Crummey and Scott in their influential 1991 paper 'Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors.' The same paper introduced MCS locks, which we'll touch on later. These algorithms remain foundational to modern lock design.

Ticket Lock Implementation

Let's build a complete ticket lock implementation, examining each design decision.

Basic Implementation

ticket_lock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#include <stdatomic.h>
 
typedef struct {
    atomic_uint next_ticket;   // Next ticket to be dispensed
    atomic_uint now_serving;   // Currently held/served ticket
} ticket_lock_t;
 
// Initialize to zero (lock is free when next_ticket == now_serving)
#define TICKET_LOCK_INIT { ATOMIC_VAR_INIT(0), ATOMIC_VAR_INIT(0) }
 
void ticket_lock(ticket_lock_t *lock) {
    // Step 1: Get our ticket number (atomic increment)
    // fetch_add returns the OLD value, then increments
    unsigned int my_ticket = atomic_fetch_add_explicit(
        &lock->next_ticket, 1, memory_order_relaxed);
    
    // Step 2: Wait until it's our turn
    // Spin until now_serving matches our ticket
    while (atomic_load_explicit(
        &lock->now_serving, memory_order_acquire) != my_ticket) {
        // PAUSE for efficient spinning
        __builtin_ia32_pause();
    }
    
    // Our ticket is now being served - we hold the lock!
    // The acquire ordering ensures we see all prior writes
}
 
void ticket_unlock(ticket_lock_t *lock) {
    // Call the next ticket
    // Just increment now_serving
    unsigned int current = atomic_load_explicit(
        &lock->now_serving, memory_order_relaxed);
    
    atomic_store_explicit(
        &lock->now_serving, current + 1, memory_order_release);
    
    // The release ordering ensures our writes are visible
    // before the next holder proceeds
}

Memory Ordering Analysis

Ticket acquisition (fetch_add): Uses relaxed ordering because:

We only need the operation to be atomic
No synchronization with other threads required at this point
We haven't entered the critical section yet

Spin loop (load): Uses acquire ordering because:

When our ticket matches, we're entering the critical section
We must see all writes made by the previous holder
Acquire prevents reordering of subsequent reads before the load

Release (store): Uses release ordering because:

Our critical section writes must be visible before the next holder proceeds
Release prevents reordering of prior writes after the store

ticket_lock_optimized.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// Optimized ticket lock with proportional backoff
 
#include <stdatomic.h>
 
typedef struct {
    _Alignas(64) atomic_uint next_ticket;   // Separate cache line
    _Alignas(64) atomic_uint now_serving;   // Separate cache line
} ticket_lock_t;
 
// Proportional backoff: wait longer if we're further back in queue
void ticket_lock_optimized(ticket_lock_t *lock) {
    unsigned int my_ticket = atomic_fetch_add_explicit(
        &lock->next_ticket, 1, memory_order_relaxed);
    
    // Read current serving number
    unsigned int serving = atomic_load_explicit(
        &lock->now_serving, memory_order_relaxed);
    
    while (serving != my_ticket) {
        // Calculate how many threads are ahead of us
        unsigned int ahead = my_ticket - serving;
        
        // Proportional backoff: wait proportional to position in queue
        // Avoid checking too frequently when we're far back
        for (unsigned int i = 0; i < ahead * 100; i++) {
            __builtin_ia32_pause();
        }
        
        // Check again with acquire ordering (needed when we're next)
        if (ahead == 1) {
            serving = atomic_load_explicit(
                &lock->now_serving, memory_order_acquire);
        } else {
            serving = atomic_load_explicit(
                &lock->now_serving, memory_order_relaxed);
        }
    }
}
 
void ticket_unlock_optimized(ticket_lock_t *lock) {
    // Increment with release ordering
    atomic_fetch_add_explicit(
        &lock->now_serving, 1, memory_order_release);
}

Cache Line Separation

The optimized version places next_ticket and now_serving on separate cache lines (using _Alignas(64)). This prevents false sharing: when threads increment next_ticket, they don't invalidate the cache line containing now_serving that other threads are spinning on.

Correctness Analysis

Let's formally verify that ticket locks satisfy the three properties required of any correct lock: mutual exclusion, progress, and bounded waiting.

Mutual Exclusion

Claim: At most one thread can be in the critical section at any time.

Proof:

Each thread receives a unique ticket number (atomic fetch_add guarantees uniqueness)
A thread enters the critical section only when now_serving == my_ticket
Since now_serving has a single value at any instant, at most one thread's condition can be satisfied
Therefore, at most one thread is in the critical section.

QED ∎

Progress (Deadlock Freedom)

Claim: If no thread is in the critical section and threads are waiting, some thread will enter.

Proof:

When no thread is in the critical section, now_serving holds the ticket of the next eligible thread
The thread holding that ticket is either spinning (will eventually see the match) or hasn't checked yet (will eventually check)
fetch_add is wait-free, so ticket acquisition cannot block
The spin loop with acquire load will eventually observe now_serving matches its ticket
Therefore, progress is guaranteed.

QED ∎

Bounded Waiting (Starvation Freedom)

Claim: Every thread that wants entry will enter within a bounded number of lock acquisitions by other threads.

Proof:

When a thread gets ticket N, threads with tickets 0 through N-1 are logically ahead
Each of these threads enters exactly once before ticket N is called
The thread waits for at most N-1 other threads to enter and exit
After N-1 releases, now_serving equals N, and the thread enters
The bound is exactly the number of threads ahead, which is fixed at arrival.

QED ∎

Stronger guarantee: Ticket locks provide FIFO ordering, which is stricter than just bounded waiting. Not only is wait time bounded, but the order is deterministic—first-come, first-served.

Ticket Overflow

What if ticket numbers overflow? With 32-bit unsigned integers, overflow wraps to 0. As long as the number of concurrent waiters is much less than 2^32, the arithmetic still works correctly due to modular arithmetic. The comparison (now_serving == my_ticket) works because both wrap equally. In practice, 32 bits is more than sufficient.

Performance Characteristics

Ticket locks have different performance characteristics than TAS/TTAS spinlocks. Let's analyze them.

Cache Traffic Analysis

Lock acquisition (fetch_add on next_ticket):

Each acquisition modifies next_ticket
Requires exclusive cache line ownership
Serial: only one thread can do this at a time for a given lock
Cost: O(1) cache line transfer per acquisition

Spinning (reading now_serving):

All waiting threads read now_serving
While held, they read from their local L1 cache (shared state)
Cost: O(1) for all spinners combined while lock is held

Lock release (incrementing now_serving):

Modifies now_serving, invalidating all spinners' cache lines
All spinners fetch the updated value
Cost: O(N) cache line transfers where N is number of waiters

Cache Traffic: Ticket Lock vs. TAS/TTAS (N waiters)
Phase	TAS Lock	TTAS Lock	Ticket Lock
Acquisition attempt	O(1)	O(1)	O(1)
Per spin iteration	O(N) coherence msgs	O(1) local read	O(1) local read
During hold (total)	O(N²) msgs	O(1) total	O(1) total
On release	O(N) msgs + race	O(N) msgs + race	O(N) msgs, no race

The now_serving Bottleneck

Ticket locks have a subtle scalability issue: all threads spin on the same memory location (now_serving). When the lock is released:

Owner writes to now_serving
Invalidation message sent to all spinners
All spinners fetch the new value
Cache line bounces to the CPU that should acquire next

With many waiters, this creates O(N) cache traffic per release. On systems with hundreds of cores, this limits scalability.

This is the motivation for MCS locks: Each thread spins on a separate, thread-local memory location. Lock handoff updates only the next thread's location, achieving O(1) cache traffic per release.

ticket_vs_mcs.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Cache behavior comparison during lock handoff
 
TICKET LOCK (N=8 waiters):
Release: Write to now_serving
         → Invalidate 8 cache lines (one per waiter)
         → All 8 fetch new value
         → Thread 2 sees its ticket, enters critical section
         → Threads 3-8 see wrong ticket, continue spinning
         
Total cache traffic: 8 invalidations + 8 fetches = 16 operations
 
MCS LOCK (N=8 waiters):
Each thread spins on its own queue node
 
Release: Write to next thread's node
         → Invalidate 1 cache line (only next thread's)
         → Thread 2 sees its flag set, enters critical section
         → Threads 3-8 never see traffic (spinning locally)
         
Total cache traffic: 1 invalidation + 1 fetch = 2 operations
 
Result: MCS scales O(1), ticket scales O(N) per release

When Ticket Locks Struggle

On very large systems (64+ cores), highly contended ticket locks show poor scalability due to the O(N) cache traffic. For such scenarios, MCS locks, qspinlocks, or hierarchical locks designed for the NUMA topology are preferred. Ticket locks remain excellent for moderate core counts and moderate contention.

From Ticket Locks to Linux's qspinlock

The Linux kernel's spinlock implementation has evolved through several generations, each addressing limitations of the previous.

Evolution Timeline

Phase 1 (early Linux): Simple TAS spinlock. Unfair, poor scalability.

Phase 2 (2.6.x): Ticket spinlock. Fair, but O(N) cache traffic per release.

Phase 3 (4.2+): Queued spinlock (qspinlock). Combines ticket-lock fairness with MCS-lock scalability.

qspinlock uses a clever hybrid approach:

Fast path: single atomic for uncontended acquisition
Medium contention: embedded queue for first few waiters
High contention: falls back to full MCS queue

qspinlock_concept.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// Simplified qspinlock concept (actual implementation is more complex)
 
// The lock word packs multiple fields:
// - Locked bit: is the lock held?
// - Pending bit: is there exactly one waiter (before forming queue)?
// - Tail: index to MCS queue tail for multiple waiters
 
typedef struct {
    union {
        atomic_uint val;
        struct {
            uint8_t locked;   // Lock is held
            uint8_t pending;  // One thread waiting (fast path)
            uint16_t tail;    // MCS queue tail (slow path)
        };
    };
} qspinlock_t;
 
void qspin_lock(qspinlock_t *lock) {
    // Fast path: lock is free, no waiters
    // Single CAS: 0 -> locked
    if (atomic_compare_exchange_strong(&lock->val, 
            &(unsigned){0}, 1)) {
        return;  // Got it immediately!
    }
    
    // Medium path: one waiter uses pending bit
    // Set pending, spin on locked, then acquire
    // ...
    
    // Slow path: multiple waiters form MCS queue
    // Each spins on local queue node, not shared memory
    // ...
}

Why qspinlock Replaced Ticket Locks

1. Size: qspinlock fits in 4 bytes, same as ticket lock's minimum. MCS lock requires per-CPU queue nodes.

2. Uncontended performance: Single atomic CAS for acquisition when uncontended—as fast as a simple spinlock.

3. Contended scalability: Under high contention, forms an MCS queue where each thread spins on its own cache line. O(1) cache traffic per handoff.

4. NUMA awareness: Queue formation can prefer local waiters, reducing cross-node traffic.

5. Lock stealing prevention: Unlike pure MCS, prevents certain race conditions during PV (paravirtualized) operation.

Linux Spinlock Evolution
Generation	Type	Fairness	Scalability	Size
Early (< 2.6)	TAS	None	Poor	4 bytes
2.6.25 - 4.1	Ticket	FIFO	O(N) traffic	4 bytes
4.2+	qspinlock	FIFO	O(1) traffic	4 bytes

The Best of All Worlds

qspinlock achieves the holy grail of lock design: simple case fast (single CAS), fairness guaranteed (FIFO ordering), scalable under contention (O(1) cache traffic), and compact representation (4 bytes). The complexity is in the code, not the interface—callers just use spin_lock() and spin_unlock().

Practical Ticket Lock Usage

When should you use ticket locks in your own code?

Recommended Scenarios

1. Moderate contention, fairness required: When you need FIFO guarantees but don't expect hundreds of concurrent waiters.

ticket_lock_usage.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Good use case: fair access to limited resource
 
ticket_lock_t resource_lock = TICKET_LOCK_INIT;
 
void access_limited_resource(int thread_id) {
    // Fair ordering ensures no thread starves
    ticket_lock(&resource_lock);
    
    printf("Thread %d accessing resource (waited fair turn)\n", thread_id);
    use_resource();
    
    ticket_unlock(&resource_lock);
}
 
// All threads get their turn in arrival order
// No priority inversion (except from scheduler decisions)

2. Reader-writer hint: Ticket locks naturally support reader-writer extensions:

ticket_rwlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// Ticket-based reader-writer lock (simplified concept)
 
typedef struct {
    atomic_uint next_ticket;
    atomic_uint now_serving;
    atomic_uint active_readers;
} ticket_rwlock_t;
 
void read_lock(ticket_rwlock_t *lock) {
    unsigned int my_ticket = atomic_fetch_add(&lock->next_ticket, 1);
    
    // Wait for our turn
    while (atomic_load(&lock->now_serving) != my_ticket) {
        __builtin_ia32_pause();
    }
    
    // Increment reader count
    atomic_fetch_add(&lock->active_readers, 1);
    
    // Allow next ticket holder to proceed
    // Writers will wait for active_readers == 0
    atomic_fetch_add(&lock->now_serving, 1);
}
 
void read_unlock(ticket_rwlock_t *lock) {
    atomic_fetch_sub(&lock->active_readers, 1);
}
 
void write_lock(ticket_rwlock_t *lock) {
    unsigned int my_ticket = atomic_fetch_add(&lock->next_ticket, 1);
    
    // Wait for our turn
    while (atomic_load(&lock->now_serving) != my_ticket) {
        __builtin_ia32_pause();
    }
    
    // Wait for all active readers to finish
    while (atomic_load(&lock->active_readers) > 0) {
        __builtin_ia32_pause();
    }
    
    // Now we hold exclusive access
    // Don't increment now_serving until write_unlock
}
 
void write_unlock(ticket_rwlock_t *lock) {
    atomic_fetch_add(&lock->now_serving, 1);
}

When NOT to Use Ticket Locks

•Very high contention (100+ threads) — O(N) cache traffic per release limits scalability. Use MCS or qspinlock.
•When adaptive behavior needed — Ticket locks don't support spin-then-block. For variable wait times, adaptive mutexes are better.
•Trylock with timeout — Ticket locks don't naturally support giving up. You could implement, but it wastes ticket numbers.
•Userspace without library support — Roll your own only if you understand the platform's memory model deeply.

Library Availability

Most threading libraries don't expose ticket locks directly—they use adaptive mutexes. For kernel development, Linux exposes spinlock semantics that use qspinlock internally. If you need explicit fairness in userspace, consider libraries like 'folly' (Facebook's) or implement ticket locks carefully with proper atomic semantics.

Summary: Ticket Locks

Ticket locks represent a significant step forward in spinlock design, introducing fairness guarantees through an elegant ticket-counter mechanism. Let's consolidate the key insights:

Key Takeaways

•Fairness through ordering — Two counters (next_ticket, now_serving) establish FIFO ordering. Arrival time determines service order, not race outcomes.
•Eliminates thundering herd — Only one thread can have its condition satisfied at release. No mass retry stampede.
•Guaranteed bounded waiting — A thread waits for exactly the number of threads ahead of it. Starvation is impossible.
•Simple implementation — Two atomics (fetch_add, load) with proper memory ordering. No complex data structures.
•O(N) release traffic limits scalability — All waiters spin on now_serving. Cache invalidation hits all. MCS locks solve this.
•Foundation for advanced designs — Linux qspinlock combines ticket-lock fairness with MCS-lock scalability. Modern systems use hybrids.
•Appropriate for moderate scale — Excellent for up to ~32 cores and moderate contention. Beyond that, consider MCS or qspinlock.

Module Complete:

Congratulations! You've completed Module 1: Spinlocks. You've journeyed from the fundamental concept of busy waiting through spinlock implementation, appropriate use scenarios, spin tuning, and finally to fair ticket locks.

You now have a comprehensive understanding of spinlock-based synchronization—the foundation for all other synchronization primitives in operating systems.

Module Complete

You've mastered spinlocks: busy waiting fundamentals, TAS and TTAS implementations, when to use spinlocks, spin timing and backoff strategies, and fairness-guaranteeing ticket locks. This knowledge prepares you for the next modules: hardware atomic operations (Test-and-Set, Compare-and-Swap) and higher-level primitives like mutexes, semaphores, and condition variables.

5 / 5

Loading learning content...

Operating SystemsLocks & Hardware Support

Spinlocks

LevelIntermediate

Duration60 mins

TopicLocks & Hardware Support

5 / 5

Ticket Locks

Bringing Order to Chaos

This page explores ticket locks in depth—their elegant design, implementation details, correctness guarantees, and the foundation they provide for even more sophisticated locking mechanisms.

What You Will Learn

The Fairness Problem Revisited

Let's examine why traditional spinlocks are unfair and why this matters.

The Race at Release

With a test-and-set spinlock, lock release triggers a race. All waiting threads observe the lock becoming free (via cache invalidation) and attempt to acquire it. The winner is determined by:

Cache coherence arbitration (which CPU's request is processed first)
Timing of when each thread checks the lock
Interconnect topology (some CPUs are 'closer' to the lock's memory)

None of these factors relate to how long each thread has been waiting. The result is random: a thread that just arrived might win over one that's been waiting for milliseconds.

unfair_scenario.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Unfair spinlock behavior under contention
 
Timeline:
T=0:    Thread A acquires lock
T=1ms:  Thread B starts waiting
T=2ms:  Thread C starts waiting
T=3ms:  Thread D starts waiting
T=4ms:  Thread A releases lock
 
# All threads race for the lock
# Due to cache topology, Thread D's CPU happens to
# get exclusive access first
 
T=4ms:  Thread D acquires lock (waited 1ms)
T=5ms:  Thread D releases lock
 
# Thread B races again... but loses to newcomer Thread E
 
T=5ms:  Thread E arrives and immediately acquires (waited 0ms!)
        Thread B now waited 4ms
        Thread C waited 3ms
 
# Threads B and C can starve indefinitely if new arrivals
# are favored by cache topology

Why Fairness Matters

1. Bounded Wait Times: Without fairness guarantees, worst-case wait time is unbounded. A thread might wait forever if it's systematically unlucky.

2. Predictability: Real-time and deadline-sensitive systems require predictable latency. Starvation is unpredictable.

3. Progress Guarantees: Formal correctness often requires "bounded waiting"—a thread cannot be bypassed indefinitely. Unfair locks violate this.

4. Cache Behavior: Unfair locks can exhibit NUMA bias (threads on certain nodes are favored), creating implicit performance tiers.

The NUMA Bias Problem

The Ticket Lock Concept

Ticket locks enforce First-In-First-Out (FIFO) ordering through a simple mechanism:

Two counters:

next_ticket: The next ticket number to be dispensed (increments when a thread starts waiting)
now_serving: The ticket number currently being served (the lock holder's ticket)

Lock acquisition:

Atomically fetch-and-increment next_ticket to get your unique ticket number
Spin until now_serving equals your ticket

Lock release:

Increment now_serving to call the next ticket

The key insight: Each thread gets a unique, monotonically increasing ticket number. The order of acquisition is determined at arrival time, not at release time. No racing—just orderly waiting.

ticket_lock_concept.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Ticket lock operation visualization
 
Initial state: next_ticket = 0, now_serving = 0 (lock is free)
 
T=0:  Thread A arrives
      A does: my_ticket = fetch_add(next_ticket) → gets 0
      now_serving = 0 = my_ticket, so A enters critical section
      
      State: next_ticket = 1, now_serving = 0
 
T=1:  Thread B arrives
      B does: my_ticket = fetch_add(next_ticket) → gets 1
      now_serving = 0 ≠ 1, so B spins
      
      State: next_ticket = 2, now_serving = 0
 
T=2:  Thread C arrives
      C does: my_ticket = fetch_add(next_ticket) → gets 2
      now_serving = 0 ≠ 2, so C spins
      
      State: next_ticket = 3, now_serving = 0
 
T=3:  Thread A releases
      A does: now_serving++ → becomes 1
      B sees now_serving = 1 = my_ticket, B enters
      C still spins (now_serving = 1 ≠ 2)
 
T=4:  Thread B releases
      B does: now_serving++ → becomes 2
      C sees now_serving = 2 = my_ticket, C enters
 
# FIFO order guaranteed: A → B → C, exactly as they arrived

The Elegance of Ticket Locks

No thundering herd: When the lock is released, only one thread (the next in line) observes its condition satisfied. Others continue spinning on their own condition.

Guaranteed fairness: FIFO ordering is intrinsic to the design. Cannot be violated regardless of cache topology, NUMA effects, or timing.

Simple implementation: Two atomic counters and two operations (fetch-add, load). No complex data structures.

Deterministic wait time: With N threads waiting, you will acquire the lock after exactly N releases. Bounded waiting is guaranteed.

Historical Note

Ticket Lock Implementation

Let's build a complete ticket lock implementation, examining each design decision.

Basic Implementation

ticket_lock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
#include <stdatomic.h>
 
typedef struct {
    atomic_uint next_ticket;   // Next ticket to be dispensed
    atomic_uint now_serving;   // Currently held/served ticket
} ticket_lock_t;
 
// Initialize to zero (lock is free when next_ticket == now_serving)
#define TICKET_LOCK_INIT { ATOMIC_VAR_INIT(0), ATOMIC_VAR_INIT(0) }
 
void ticket_lock(ticket_lock_t *lock) {
    // Step 1: Get our ticket number (atomic increment)
    // fetch_add returns the OLD value, then increments
    unsigned int my_ticket = atomic_fetch_add_explicit(
        &lock->next_ticket, 1, memory_order_relaxed);
    
    // Step 2: Wait until it's our turn
    // Spin until now_serving matches our ticket
    while (atomic_load_explicit(
        &lock->now_serving, memory_order_acquire) != my_ticket) {
        // PAUSE for efficient spinning
        __builtin_ia32_pause();
    }
    
    // Our ticket is now being served - we hold the lock!
    // The acquire ordering ensures we see all prior writes
}
 
void ticket_unlock(ticket_lock_t *lock) {
    // Call the next ticket
    // Just increment now_serving
    unsigned int current = atomic_load_explicit(
        &lock->now_serving, memory_order_relaxed);
    
    atomic_store_explicit(
        &lock->now_serving, current + 1, memory_order_release);
    
    // The release ordering ensures our writes are visible
    // before the next holder proceeds
}

Memory Ordering Analysis

Ticket acquisition (fetch_add): Uses relaxed ordering because:

We only need the operation to be atomic
No synchronization with other threads required at this point
We haven't entered the critical section yet

Spin loop (load): Uses acquire ordering because:

When our ticket matches, we're entering the critical section
We must see all writes made by the previous holder
Acquire prevents reordering of subsequent reads before the load

Release (store): Uses release ordering because:

Our critical section writes must be visible before the next holder proceeds
Release prevents reordering of prior writes after the store

ticket_lock_optimized.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// Optimized ticket lock with proportional backoff
 
#include <stdatomic.h>
 
typedef struct {
    _Alignas(64) atomic_uint next_ticket;   // Separate cache line
    _Alignas(64) atomic_uint now_serving;   // Separate cache line
} ticket_lock_t;
 
// Proportional backoff: wait longer if we're further back in queue
void ticket_lock_optimized(ticket_lock_t *lock) {
    unsigned int my_ticket = atomic_fetch_add_explicit(
        &lock->next_ticket, 1, memory_order_relaxed);
    
    // Read current serving number
    unsigned int serving = atomic_load_explicit(
        &lock->now_serving, memory_order_relaxed);
    
    while (serving != my_ticket) {
        // Calculate how many threads are ahead of us
        unsigned int ahead = my_ticket - serving;
        
        // Proportional backoff: wait proportional to position in queue
        // Avoid checking too frequently when we're far back
        for (unsigned int i = 0; i < ahead * 100; i++) {
            __builtin_ia32_pause();
        }
        
        // Check again with acquire ordering (needed when we're next)
        if (ahead == 1) {
            serving = atomic_load_explicit(
                &lock->now_serving, memory_order_acquire);
        } else {
            serving = atomic_load_explicit(
                &lock->now_serving, memory_order_relaxed);
        }
    }
}
 
void ticket_unlock_optimized(ticket_lock_t *lock) {
    // Increment with release ordering
    atomic_fetch_add_explicit(
        &lock->now_serving, 1, memory_order_release);
}

Cache Line Separation

Correctness Analysis

Let's formally verify that ticket locks satisfy the three properties required of any correct lock: mutual exclusion, progress, and bounded waiting.

Mutual Exclusion

Claim: At most one thread can be in the critical section at any time.

Proof:

Each thread receives a unique ticket number (atomic fetch_add guarantees uniqueness)
A thread enters the critical section only when now_serving == my_ticket
Since now_serving has a single value at any instant, at most one thread's condition can be satisfied
Therefore, at most one thread is in the critical section.

QED ∎

Progress (Deadlock Freedom)

Claim: If no thread is in the critical section and threads are waiting, some thread will enter.

Proof:

When no thread is in the critical section, now_serving holds the ticket of the next eligible thread
The thread holding that ticket is either spinning (will eventually see the match) or hasn't checked yet (will eventually check)
fetch_add is wait-free, so ticket acquisition cannot block
The spin loop with acquire load will eventually observe now_serving matches its ticket
Therefore, progress is guaranteed.

QED ∎

Bounded Waiting (Starvation Freedom)

Claim: Every thread that wants entry will enter within a bounded number of lock acquisitions by other threads.

Proof:

When a thread gets ticket N, threads with tickets 0 through N-1 are logically ahead
Each of these threads enters exactly once before ticket N is called
The thread waits for at most N-1 other threads to enter and exit
After N-1 releases, now_serving equals N, and the thread enters
The bound is exactly the number of threads ahead, which is fixed at arrival.

QED ∎

Stronger guarantee: Ticket locks provide FIFO ordering, which is stricter than just bounded waiting. Not only is wait time bounded, but the order is deterministic—first-come, first-served.

Ticket Overflow

Performance Characteristics

Ticket locks have different performance characteristics than TAS/TTAS spinlocks. Let's analyze them.

Cache Traffic Analysis

Lock acquisition (fetch_add on next_ticket):

Each acquisition modifies next_ticket
Requires exclusive cache line ownership
Serial: only one thread can do this at a time for a given lock
Cost: O(1) cache line transfer per acquisition

Spinning (reading now_serving):

All waiting threads read now_serving
While held, they read from their local L1 cache (shared state)
Cost: O(1) for all spinners combined while lock is held

Lock release (incrementing now_serving):

Modifies now_serving, invalidating all spinners' cache lines
All spinners fetch the updated value
Cost: O(N) cache line transfers where N is number of waiters

Cache Traffic: Ticket Lock vs. TAS/TTAS (N waiters)
Phase	TAS Lock	TTAS Lock	Ticket Lock
Acquisition attempt	O(1)	O(1)	O(1)
Per spin iteration	O(N) coherence msgs	O(1) local read	O(1) local read
During hold (total)	O(N²) msgs	O(1) total	O(1) total
On release	O(N) msgs + race	O(N) msgs + race	O(N) msgs, no race

The now_serving Bottleneck

Ticket locks have a subtle scalability issue: all threads spin on the same memory location (now_serving). When the lock is released:

Owner writes to now_serving
Invalidation message sent to all spinners
All spinners fetch the new value
Cache line bounces to the CPU that should acquire next

With many waiters, this creates O(N) cache traffic per release. On systems with hundreds of cores, this limits scalability.

ticket_vs_mcs.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Cache behavior comparison during lock handoff
 
TICKET LOCK (N=8 waiters):
Release: Write to now_serving
         → Invalidate 8 cache lines (one per waiter)
         → All 8 fetch new value
         → Thread 2 sees its ticket, enters critical section
         → Threads 3-8 see wrong ticket, continue spinning
         
Total cache traffic: 8 invalidations + 8 fetches = 16 operations
 
MCS LOCK (N=8 waiters):
Each thread spins on its own queue node
 
Release: Write to next thread's node
         → Invalidate 1 cache line (only next thread's)
         → Thread 2 sees its flag set, enters critical section
         → Threads 3-8 never see traffic (spinning locally)
         
Total cache traffic: 1 invalidation + 1 fetch = 2 operations
 
Result: MCS scales O(1), ticket scales O(N) per release

When Ticket Locks Struggle

From Ticket Locks to Linux's qspinlock

The Linux kernel's spinlock implementation has evolved through several generations, each addressing limitations of the previous.

Evolution Timeline

Phase 1 (early Linux): Simple TAS spinlock. Unfair, poor scalability.

Phase 2 (2.6.x): Ticket spinlock. Fair, but O(N) cache traffic per release.

Phase 3 (4.2+): Queued spinlock (qspinlock). Combines ticket-lock fairness with MCS-lock scalability.

qspinlock uses a clever hybrid approach:

Fast path: single atomic for uncontended acquisition
Medium contention: embedded queue for first few waiters
High contention: falls back to full MCS queue

qspinlock_concept.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// Simplified qspinlock concept (actual implementation is more complex)
 
// The lock word packs multiple fields:
// - Locked bit: is the lock held?
// - Pending bit: is there exactly one waiter (before forming queue)?
// - Tail: index to MCS queue tail for multiple waiters
 
typedef struct {
    union {
        atomic_uint val;
        struct {
            uint8_t locked;   // Lock is held
            uint8_t pending;  // One thread waiting (fast path)
            uint16_t tail;    // MCS queue tail (slow path)
        };
    };
} qspinlock_t;
 
void qspin_lock(qspinlock_t *lock) {
    // Fast path: lock is free, no waiters
    // Single CAS: 0 -> locked
    if (atomic_compare_exchange_strong(&lock->val, 
            &(unsigned){0}, 1)) {
        return;  // Got it immediately!
    }
    
    // Medium path: one waiter uses pending bit
    // Set pending, spin on locked, then acquire
    // ...
    
    // Slow path: multiple waiters form MCS queue
    // Each spins on local queue node, not shared memory
    // ...
}

Why qspinlock Replaced Ticket Locks

1. Size: qspinlock fits in 4 bytes, same as ticket lock's minimum. MCS lock requires per-CPU queue nodes.

2. Uncontended performance: Single atomic CAS for acquisition when uncontended—as fast as a simple spinlock.

3. Contended scalability: Under high contention, forms an MCS queue where each thread spins on its own cache line. O(1) cache traffic per handoff.

4. NUMA awareness: Queue formation can prefer local waiters, reducing cross-node traffic.

5. Lock stealing prevention: Unlike pure MCS, prevents certain race conditions during PV (paravirtualized) operation.

Linux Spinlock Evolution
Generation	Type	Fairness	Scalability	Size
Early (< 2.6)	TAS	None	Poor	4 bytes
2.6.25 - 4.1	Ticket	FIFO	O(N) traffic	4 bytes
4.2+	qspinlock	FIFO	O(1) traffic	4 bytes

The Best of All Worlds

Practical Ticket Lock Usage

When should you use ticket locks in your own code?

Recommended Scenarios

1. Moderate contention, fairness required: When you need FIFO guarantees but don't expect hundreds of concurrent waiters.

ticket_lock_usage.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Good use case: fair access to limited resource
 
ticket_lock_t resource_lock = TICKET_LOCK_INIT;
 
void access_limited_resource(int thread_id) {
    // Fair ordering ensures no thread starves
    ticket_lock(&resource_lock);
    
    printf("Thread %d accessing resource (waited fair turn)\n", thread_id);
    use_resource();
    
    ticket_unlock(&resource_lock);
}
 
// All threads get their turn in arrival order
// No priority inversion (except from scheduler decisions)

2. Reader-writer hint: Ticket locks naturally support reader-writer extensions:

ticket_rwlock.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// Ticket-based reader-writer lock (simplified concept)
 
typedef struct {
    atomic_uint next_ticket;
    atomic_uint now_serving;
    atomic_uint active_readers;
} ticket_rwlock_t;
 
void read_lock(ticket_rwlock_t *lock) {
    unsigned int my_ticket = atomic_fetch_add(&lock->next_ticket, 1);
    
    // Wait for our turn
    while (atomic_load(&lock->now_serving) != my_ticket) {
        __builtin_ia32_pause();
    }
    
    // Increment reader count
    atomic_fetch_add(&lock->active_readers, 1);
    
    // Allow next ticket holder to proceed
    // Writers will wait for active_readers == 0
    atomic_fetch_add(&lock->now_serving, 1);
}
 
void read_unlock(ticket_rwlock_t *lock) {
    atomic_fetch_sub(&lock->active_readers, 1);
}
 
void write_lock(ticket_rwlock_t *lock) {
    unsigned int my_ticket = atomic_fetch_add(&lock->next_ticket, 1);
    
    // Wait for our turn
    while (atomic_load(&lock->now_serving) != my_ticket) {
        __builtin_ia32_pause();
    }
    
    // Wait for all active readers to finish
    while (atomic_load(&lock->active_readers) > 0) {
        __builtin_ia32_pause();
    }
    
    // Now we hold exclusive access
    // Don't increment now_serving until write_unlock
}
 
void write_unlock(ticket_rwlock_t *lock) {
    atomic_fetch_add(&lock->now_serving, 1);
}

When NOT to Use Ticket Locks

•Very high contention (100+ threads) — O(N) cache traffic per release limits scalability. Use MCS or qspinlock.
•When adaptive behavior needed — Ticket locks don't support spin-then-block. For variable wait times, adaptive mutexes are better.
•Trylock with timeout — Ticket locks don't naturally support giving up. You could implement, but it wastes ticket numbers.
•Userspace without library support — Roll your own only if you understand the platform's memory model deeply.

Library Availability

Summary: Ticket Locks

Ticket locks represent a significant step forward in spinlock design, introducing fairness guarantees through an elegant ticket-counter mechanism. Let's consolidate the key insights:

Key Takeaways

•Fairness through ordering — Two counters (next_ticket, now_serving) establish FIFO ordering. Arrival time determines service order, not race outcomes.
•Eliminates thundering herd — Only one thread can have its condition satisfied at release. No mass retry stampede.
•Guaranteed bounded waiting — A thread waits for exactly the number of threads ahead of it. Starvation is impossible.
•Simple implementation — Two atomics (fetch_add, load) with proper memory ordering. No complex data structures.
•O(N) release traffic limits scalability — All waiters spin on now_serving. Cache invalidation hits all. MCS locks solve this.
•Foundation for advanced designs — Linux qspinlock combines ticket-lock fairness with MCS-lock scalability. Modern systems use hybrids.
•Appropriate for moderate scale — Excellent for up to ~32 cores and moderate contention. Beyond that, consider MCS or qspinlock.

Module Complete:

You now have a comprehensive understanding of spinlock-based synchronization—the foundation for all other synchronization primitives in operating systems.

Module Complete

5 / 5