Operating SystemsSemaphore Implementation

Semaphore Implementation

LevelIntermediate

Duration90 mins

TopicSemaphore Implementation

1 / 5

Blocking Implementation

The Art of Waiting Efficiently

When Edsger Dijkstra introduced semaphores in 1965, he solved a profound problem: how can multiple processes coordinate access to shared resources without wasting computational power? The answer lies in blocking—a mechanism that allows threads to voluntarily relinquish the CPU while waiting for a condition to become true.

Blocking semaphores represent a fundamental departure from busy-waiting approaches. Instead of spinning in a tight loop consuming CPU cycles, a blocked thread is removed from the CPU's run queue entirely, allowing other work to proceed. When the awaited resource becomes available, the blocked thread is awakened and resumes execution.

This page explores the intricate machinery behind blocking semaphore implementations—the data structures, algorithms, and system-level integrations that make efficient thread coordination possible.

What You Will Learn

By the end of this page, you will understand: the internal architecture of blocking semaphores, how wait queues manage suspended threads, the atomicity requirements for correct blocking, the integration with scheduler and memory subsystems, and the tradeoffs between blocking and spinning implementations.

Anatomy of a Blocking Semaphore

A blocking semaphore consists of two fundamental components: an integer counter and a wait queue. Together, these structures enable the semaphore to track resource availability and manage the set of threads waiting for that resource.

The Integer Counter:

The semaphore's counter (traditionally denoted S or value) represents the number of available units of the protected resource. For a binary semaphore (mutex), this value alternates between 0 and 1. For counting semaphores, it ranges from 0 to the maximum resource count.

When the counter is positive, resources are available for immediate acquisition
When the counter is zero, all resources are currently held by other threads
When the counter is negative (in some implementations), the absolute value indicates the number of waiting threads

The Wait Queue:

The wait queue is a data structure that holds references to threads blocked on this semaphore. When a thread attempts to decrement the counter below zero, it is enqueued in this structure and suspended. When another thread increments the counter via a signal operation, a thread is dequeued and awakened.

The choice of queue discipline—FIFO, priority-based, or LIFO—has profound implications for fairness and performance.

semaphore_structure.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Fundamental blocking semaphore structure
typedef struct semaphore {
    // Current value of the semaphore
    // Positive: resources available
    // Zero: no resources, but no waiters
    // Negative (optional): absolute value = waiter count
    int value;
    
    // Spinlock protecting semaphore state
    // Required for atomicity on multiprocessors
    spinlock_t lock;
    
    // Queue of waiting threads
    // Discipline determines fairness properties
    struct wait_queue_head wait_list;
    
    // Debug information (optional)
    #ifdef CONFIG_DEBUG_LOCK
    const char *name;
    void *owner;           // For debugging binary semaphores
    unsigned long acquired_at;
    #endif
} semaphore_t;
 
// Wait queue entry - one per blocked thread
typedef struct wait_queue_entry {
    unsigned int flags;           // Wait type flags
    struct task_struct *private;  // The waiting thread/task
    wait_queue_func_t func;       // Wake function callback
    struct list_head entry;       // Links in the wait queue
} wait_queue_entry_t;
 
// Wait queue head - container for the queue
typedef struct wait_queue_head {
    spinlock_t lock;              // Protects the queue itself
    struct list_head head;        // Head of the linked list
} wait_queue_head_t;

Negative Value Convention

Some semaphore implementations allow the counter to go negative, where |value| indicates the number of waiters. Others keep the counter at zero and track waiters separately. The Linux kernel traditionally used non-negative values, while POSIX semantics allow for either approach. The key invariant is: if value < 0, exactly |value| threads are blocked.

The Wait (P/Down) Operation

The wait operation (historically called P from Dutch proberen, "to try," or down in Unix terminology) is the heart of semaphore blocking semantics. This operation attempts to decrement the semaphore counter and blocks the calling thread if the counter would become negative.

The Fundamental Algorithm:

Acquire the semaphore's internal lock — This lock protects the counter and wait queue from concurrent modification
Test the counter value — Determine if resources are available
If counter > 0 — Decrement the counter and release the lock; the caller proceeds
If counter ≤ 0 — Prepare the thread for sleeping:
- Create and initialize a wait queue entry
- Add the entry to the semaphore's wait queue
- Set the thread state to TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE
- Release the lock (critical timing!)
- Invoke the scheduler to context switch away
Upon waking — The thread resumes, reacquires the lock (if needed), and verifies it now holds the resource

Critical Atomicity Requirement:

The transition from "checking the counter" to "sleeping" must be handled with extreme care. There exists a dangerous window between releasing the lock and actually sleeping. If a signal (V operation) arrives during this window, it could be lost—the signaling thread finds no one in the wait queue because our thread hasn't added itself yet, but our thread then sleeps forever.

Modern implementations solve this by linking the lock release and sleep atomically, typically via scheduler integration.

wait_operation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Blocking wait operation implementation
void semaphore_wait(semaphore_t *sem) {
    // Disable preemption and acquire internal lock
    spin_lock_irqsave(&sem->lock, flags);
    
    // Fast path: resource immediately available
    if (sem->value > 0) {
        sem->value--;
        spin_unlock_irqrestore(&sem->lock, flags);
        return;
    }
    
    // Slow path: must block
    // Decrement to indicate we're waiting (if using negative convention)
    sem->value--;
    
    // Prepare current thread for sleeping
    DEFINE_WAIT(wait);  // Allocates wait_queue_entry on stack
    
    for (;;) {
        // Add ourselves to the wait queue
        prepare_to_wait(&sem->wait_list, &wait, TASK_UNINTERRUPTIBLE);
        
        // Double-check: maybe signal arrived while we prepared
        if (sem->value >= 0) {
            // Resource became available
            break;
        }
        
        // Release lock and sleep atomically
        // This is the critical section - schedule() won't return
        // until we're woken up
        spin_unlock_irqrestore(&sem->lock, flags);
        schedule();  // Context switch to another thread
        spin_lock_irqsave(&sem->lock, flags);
        
        // We've been awakened - loop back to verify condition
    }
    
    // Clean up our wait queue entry
    finish_wait(&sem->wait_list, &wait);
    spin_unlock_irqrestore(&sem->lock, flags);
}
 
// Interruptible version - can be interrupted by signals
int semaphore_wait_interruptible(semaphore_t *sem) {
    spin_lock_irqsave(&sem->lock, flags);
    
    if (sem->value > 0) {
        sem->value--;
        spin_unlock_irqrestore(&sem->lock, flags);
        return 0;  // Success
    }
    
    sem->value--;
    DEFINE_WAIT(wait);
    
    for (;;) {
        prepare_to_wait(&sem->wait_list, &wait, TASK_INTERRUPTIBLE);
        
        if (sem->value >= 0)
            break;
        
        // Check for pending signals
        if (signal_pending(current)) {
            // We're being interrupted
            sem->value++;  // We're leaving, restore count
            finish_wait(&sem->wait_list, &wait);
            spin_unlock_irqrestore(&sem->lock, flags);
            return -EINTR;  // Interrupted!
        }
        
        spin_unlock_irqrestore(&sem->lock, flags);
        schedule();
        spin_lock_irqsave(&sem->lock, flags);
    }
    
    finish_wait(&sem->wait_list, &wait);
    spin_unlock_irqrestore(&sem->lock, flags);
    return 0;  // Success
}

Wait Operation Thread States
State	Description	Implications
`TASK_RUNNING`	Thread is executing or ready to execute	Consumes CPU time, on run queue
`TASK_INTERRUPTIBLE`	Thread is sleeping but can be woken by signals	Not on run queue, signal-responsive
`TASK_UNINTERRUPTIBLE`	Thread is sleeping and cannot be interrupted	Used for critical waits (disk I/O)
`TASK_KILLABLE`	Like uninterruptible but responds to fatal signals	Modern hybrid for robustness

The Signal (V/Up) Operation

The signal operation (historically V from Dutch verhogen, "to raise," or up in Unix terminology) releases a unit of the protected resource by incrementing the semaphore counter. If threads are waiting, one is awakened to acquire the newly available resource.

The Fundamental Algorithm:

Acquire the semaphore's internal lock — Ensures atomic modification
Increment the counter — Release one unit of resource
Check for waiters — If the wait queue is non-empty:
- Dequeue a thread according to the queue discipline (typically FIFO)
- Wake up that thread by changing its state and adding it to the run queue
Release the internal lock — Allow other operations to proceed

The Wake-Up Mechanics:

Waking a thread involves several steps that coordinate with the scheduler:

The thread's state changes from TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE to TASK_RUNNING
The thread is added back to the scheduler's run queue
If the awakened thread has higher priority than the current thread, a reschedule may be triggered

Convoys and Thundering Herds:

When multiple threads are waiting and the signal operation wakes only one thread at a time, we achieve orderly handoff of the resource. However, if the implementation wakes all waiters (as some condition variable implementations do), we get the "thundering herd" problem—many threads wake up, but only one can proceed.

signal_operation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Signal operation implementation
void semaphore_signal(semaphore_t *sem) {
    unsigned long flags;
    
    // Acquire internal lock
    spin_lock_irqsave(&sem->lock, flags);
    
    // Increment the counter - one more resource available
    sem->value++;
    
    // Wake up one waiter if any exist
    // wake_up wakes exactly one thread from the queue
    if (!list_empty(&sem->wait_list.head)) {
        // Wake the first waiter (FIFO discipline)
        wake_up(&sem->wait_list);
    }
    
    spin_unlock_irqrestore(&sem->lock, flags);
}
 
// More detailed wake implementation
void wake_up(wait_queue_head_t *wq_head) {
    unsigned long flags;
    wait_queue_entry_t *curr, *next;
    
    spin_lock_irqsave(&wq_head->lock, flags);
    
    // Iterate through waiters
    list_for_each_entry_safe(curr, next, &wq_head->head, entry) {
        // Call the wake function for this entry
        // Typically: default_wake_function
        unsigned int ret = curr->func(curr, TASK_NORMAL, 0, NULL);
        
        // If exclusive wake (most semaphore waiters):
        // Wake only one and stop
        if (curr->flags & WQ_FLAG_EXCLUSIVE) {
            if (ret)  // Successfully woke this task
                break;
        }
    }
    
    spin_unlock_irqrestore(&wq_head->lock, flags);
}
 
// The actual task wake function
int default_wake_function(wait_queue_entry_t *wq_entry,
                          unsigned mode, int flags, void *key) {
    struct task_struct *task = wq_entry->private;
    
    // Try to wake the task
    return try_to_wake_up(task, mode, flags);
}
 
// Core scheduler wake function (simplified)
int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) {
    unsigned long flags;
    int cpu, success = 0;
    
    raw_spin_lock_irqsave(&p->pi_lock, flags);
    
    // Check if task is in the expected state
    if (!(p->state & state))
        goto out;
    
    // Set task to running
    p->state = TASK_RUNNING;
    
    // Select CPU for the task (load balancing considerations)
    cpu = select_task_rq(p, wake_flags);
    
    // Add task to the run queue
    activate_task(rq, p, 0);
    
    // Check if we need to preempt current task
    check_preempt_curr(rq, p, wake_flags);
    
    success = 1;
out:
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);
    return success;
}

Exclusive vs Non-Exclusive Waiters

Semaphore waiters are typically marked as 'exclusive' (WQ_FLAG_EXCLUSIVE), meaning a signal wakes exactly one. This contrasts with non-exclusive waiters (used in some broadcast scenarios) where all waiters are awakened. The exclusive flag prevents the thundering herd problem and ensures O(1) wake-up cost.

Wait Queue Management

The wait queue is the critical data structure that manages blocked threads. Its implementation must balance several competing concerns: O(1) insertion and removal, memory efficiency, cache friendliness, and fairness properties.

Data Structure Choices:

Most implementations use a doubly-linked list for the wait queue:

Insertion at tail: O(1) with a tail pointer, ensures FIFO ordering
Removal from head: O(1), wakes the longest-waiting thread first
Arbitrary removal: O(1) for a node removing itself (interruptible waits)
Memory: Each waiting thread has one wait_queue_entry structure

Alternative structures exist for specialized use cases:

Priority heap: O(log n) operations, but ensures highest-priority waiter is woken first
Circular buffer: Fixed-size, cache-friendly, but limits maximum waiters
Lock-free queue: Higher complexity but avoids lock contention on the queue itself

The Wait Queue Entry Lifecycle:

Allocation: Typically on the waiting thread's stack (cheap, automatic cleanup)
Initialization: Set the private field to current (the calling thread)
Enqueueing: Add to the wait queue under the queue's lock
Sleeping: Thread blocks via schedule()
Dequeueing on wake: The signaling thread removes the entry
Cleanup: finish_wait() handles any race conditions

Memory Location Matters:

Using stack-allocated wait queue entries is elegant—no heap allocation, no memory leaks. But it requires the entry to remain valid while the thread sleeps. Since the thread's stack isn't deallocated during sleep, this works perfectly. The DEFINE_WAIT() macro handles this idiom.

wait_queue_ops.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Wait queue initialization and operations
 
// Initialize a wait queue head
void init_waitqueue_head(wait_queue_head_t *wq_head) {
    spin_lock_init(&wq_head->lock);
    INIT_LIST_HEAD(&wq_head->head);
}
 
// Initialize a wait queue entry
void init_waitqueue_entry(wait_queue_entry_t *wq_entry,
                          struct task_struct *task) {
    wq_entry->flags = 0;
    wq_entry->private = task;
    wq_entry->func = default_wake_function;
    INIT_LIST_HEAD(&wq_entry->entry);
}
 
// Stack-based wait entry allocation
#define DEFINE_WAIT(name) \
    wait_queue_entry_t name = { \
        .private = current, \
        .func = autoremove_wake_function, \
        .entry = LIST_HEAD_INIT(name.entry), \
    }
 
// Prepare to wait - add to queue and set state
void prepare_to_wait(wait_queue_head_t *wq_head,
                     wait_queue_entry_t *wq_entry,
                     int state) {
    unsigned long flags;
    wq_entry->flags &= ~WQ_FLAG_EXCLUSIVE;
    
    spin_lock_irqsave(&wq_head->lock, flags);
    
    // Only add if not already in a wait queue
    if (list_empty(&wq_entry->entry))
        __add_wait_queue(wq_head, wq_entry);  // Add to head (some impls)
    
    set_current_state(state);
    spin_unlock_irqrestore(&wq_head->lock, flags);
}
 
// Prepare for exclusive wait (semaphore style)
void prepare_to_wait_exclusive(wait_queue_head_t *wq_head,
                                wait_queue_entry_t *wq_entry,
                                int state) {
    unsigned long flags;
    wq_entry->flags |= WQ_FLAG_EXCLUSIVE;  // Mark as exclusive
    
    spin_lock_irqsave(&wq_head->lock, flags);
    if (list_empty(&wq_entry->entry))
        __add_wait_queue_entry_tail(wq_head, wq_entry);  // FIFO!
    set_current_state(state);
    spin_unlock_irqrestore(&wq_head->lock, flags);
}
 
// Finish waiting - clean up after being woken
void finish_wait(wait_queue_head_t *wq_head,
                 wait_queue_entry_t *wq_entry) {
    unsigned long flags;
    
    // Set back to running (might already be)
    __set_current_state(TASK_RUNNING);
    
    // Only remove if still queued (might have been removed by waker)
    if (!list_empty_careful(&wq_entry->entry)) {
        spin_lock_irqsave(&wq_head->lock, flags);
        list_del_init(&wq_entry->entry);
        spin_unlock_irqrestore(&wq_head->lock, flags);
    }
}
 
// Auto-remove wake function - removes entry upon waking
int autoremove_wake_function(wait_queue_entry_t *wq_entry,
                              unsigned mode, int sync, void *key) {
    int ret = default_wake_function(wq_entry, mode, sync, key);
    
    if (ret)
        list_del_init(&wq_entry->entry);  // Remove from queue
    return ret;
}

Wait Queue Design Principles

•Stack allocation for entries — Eliminates heap allocation overhead and memory leaks; entry lifetime matches sleep duration
•Doubly-linked list — O(1) insertion at tail (FIFO), O(1) removal from anywhere (for interrupted waits)
•Exclusive flag — Prevents thundering herd by waking only one waiter per signal
•Tail insertion for fairness — The longest-waiting thread is woken first, preventing starvation
•Auto-remove on wake — The waking path removes the entry, simplifying cleanup and reducing races

Scheduler Integration

Blocking semaphores require deep integration with the operating system scheduler. When a thread blocks on a semaphore, it must be removed from the CPU's run queue. When awakened, it must be added back. This integration involves several critical coordination points.

The Context Switch Path:

When a thread calls schedule() after failing to acquire a semaphore:

State check: The scheduler verifies the thread's state is TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE
Run queue removal: The thread is removed from the run queue (if still present)
Context selection: The scheduler picks the next thread to run based on its scheduling policy
Context switch: CPU registers, stack pointer, and page table are switched to the new thread
Control transfer: The new thread resumes execution

The blocked thread now exists only in the semaphore's wait queue—invisible to the scheduler until awakened.

The Wake-Up Path:

When another thread signals the semaphore:

Wait queue access: The signaling thread acquires the wait queue lock
Waiter selection: The first waiter (FIFO) is dequeued
State transition: The waiter's state becomes TASK_RUNNING
Run queue insertion: The waiter is added to a run queue (possibly on a different CPU)
Preemption check: If the woken thread has higher priority, a reschedule is triggered
Lock release: The signal operation completes

Multi-CPU Considerations:

On SMP systems, the woken thread might run on a different CPU than the signaling thread. This introduces cache migration costs but enables parallelism. Smart wake-up heuristics try to balance:

CPU affinity: Prefer the CPU where the waiter last ran (warm cache)
Load balancing: Avoid overloading busy CPUs
NUMA locality: Keep threads and their data on the same NUMA node

scheduler_integration.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
// Scheduler integration for blocking
 
// The main schedule function (simplified)
void __schedule(bool preempt) {
    struct task_struct *prev, *next;
    struct rq *rq;
    int cpu;
    
    cpu = smp_processor_id();
    rq = cpu_rq(cpu);
    prev = rq->curr;  // The currently running task
    
    // Lock the run queue
    raw_spin_lock(&rq->lock);
    
    // If prev is not RUNNING, it needs to be dequeued
    // This happens when a task called schedule() after blocking
    if (prev->state != TASK_RUNNING && !preempt) {
        // Remove from run queue - task is blocking
        deactivate_task(rq, prev, DEQUEUE_SLEEP);
        
        // For interruptible sleeps, check for signals
        if (prev->state == TASK_INTERRUPTIBLE &&
            signal_pending(prev)) {
            prev->state = TASK_RUNNING;
            activate_task(rq, prev, ENQUEUE_HEAD);
        }
    }
    
    // Pick the next task to run
    next = pick_next_task(rq, prev);
    
    if (prev != next) {
        rq->curr = next;
        
        // Perform the actual context switch
        // This switches stacks, registers, and address space
        context_switch(rq, prev, next);
        
        // After switch, we're running as 'next'
        // prev might be running on another CPU by now!
    }
    
    raw_spin_unlock(&rq->lock);
}
 
// Wake up a specific task
int wake_up_process(struct task_struct *p) {
    return try_to_wake_up(p, TASK_NORMAL, 0);
}
 
// Core wake-up logic
int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) {
    struct rq *rq;
    int cpu;
    
    // Grab the task's lock
    raw_spin_lock_irqsave(&p->pi_lock, flags);
    
    // Verify task is in expected state
    if (!(p->state & state))
        goto out;
    
    // Transition to running
    p->state = TASK_RUNNING;
    
    // Select CPU for the task
    // Consider: affinity, load, NUMA topology
    cpu = select_task_rq(p, wake_flags);
    
    // Get the run queue for that CPU
    rq = cpu_rq(cpu);
    raw_spin_lock(&rq->lock);
    
    // Add to run queue
    activate_task(rq, p, 0);
    
    // Check if we should preempt the current task
    if (rq->curr->prio > p->prio) {
        // Woken task has higher priority
        resched_curr(rq);  // Request reschedule
    }
    
    raw_spin_unlock(&rq->lock);
    
out:
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);
    return success;
}
 
// Select CPU for awakened task (simplified)
int select_task_rq(struct task_struct *p, int wake_flags) {
    // First choice: the CPU where task last ran (cache affinity)
    int prev_cpu = task_cpu(p);
    
    // If that CPU is idle, use it
    if (idle_cpu(prev_cpu))
        return prev_cpu;
    
    // Otherwise, find a suitable CPU
    // Consider: same NUMA node, low load, idle state
    return select_idle_sibling(prev_cpu);
}

Critical: State-Schedule-Wake Ordering

The order of operations is crucial: set state BEFORE checking conditions, and BEFORE calling schedule(). If you set state to INTERRUPTIBLE after checking the semaphore value, a wake-up between the check and state change would be lost. The pattern is: (1) set_current_state(INTERRUPTIBLE); (2) if (!condition) schedule(); (3) __set_current_state(RUNNING);

Atomicity and the Lost Wake-Up Problem

The most insidious bug in blocking synchronization is the lost wake-up problem. This occurs when a signal arrives precisely as a thread is transitioning to sleep, but before it has actually blocked. The signal finds no one to wake (the thread hasn't slept yet), and the thread then sleeps forever.

The Race Condition Anatomy:

Thread A (waiting)          Thread B (signaling)
─────────────────          ────────────────────
if (sem.value <= 0) {      
                           sem.signal()
                           // No waiters found!
    sleep()                
    // Sleep forever!
}

Thread A checks the condition and finds it must wait. Before it can enter the sleep state, Thread B signals. Thread B scans the wait queue but finds no one—Thread A hasn't added itself yet. Thread B's signal is lost. Thread A then sleeps, unaware that the resource is now available.

The Solution: State-Before-Test Pattern

The solution relies on careful ordering and the semantics of the thread state field:

Set state to sleeping BEFORE checking condition — Even though you're not actually sleeping yet
The signaling thread checks our state — It will see we're "sleeping" and add us to run queue
If we actually sleep, we can be woken correctly
If condition is true, reset state before returning

This works because the state transition is atomic and visible to other CPUs (with proper memory barriers).

lost_wakeup_prevention.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
// WRONG: Race condition leading to lost wake-up
void semaphore_wait_BROKEN(semaphore_t *sem) {
    spin_lock(&sem->lock);
    
    while (sem->value <= 0) {
        spin_unlock(&sem->lock);
        
        // DANGER ZONE: signal could arrive here!
        // We're not in the wait queue yet
        // but we're about to sleep
        
        schedule();  // May sleep forever if signal was sent above
        
        spin_lock(&sem->lock);
    }
    
    sem->value--;
    spin_unlock(&sem->lock);
}
 
// CORRECT: Atomicity via state-before-test pattern
void semaphore_wait_CORRECT(semaphore_t *sem) {
    DEFINE_WAIT(wait);
    
    spin_lock(&sem->lock);
    
    while (sem->value <= 0) {
        // Step 1: Add to wait queue AND set state to sleeping
        // This must happen atomically with respect to the signaler
        prepare_to_wait_exclusive(&sem->wait_list, &wait,
                                   TASK_UNINTERRUPTIBLE);
        
        // After this point, any signal() will find us and wake us
        
        spin_unlock(&sem->lock);
        
        // Step 2: Actually sleep
        // Safe now - signaler will see our state and wake us
        schedule();
        
        // Step 3: We've been woken - recheck condition
        spin_lock(&sem->lock);
    }
    
    // Step 4: Condition satisfied, clean up
    finish_wait(&sem->wait_list, &wait);
    sem->value--;
    spin_unlock(&sem->lock);
}
 
// The prepare_to_wait function (critical for correctness)
void prepare_to_wait_exclusive(wait_queue_head_t *wq_head,
                                wait_queue_entry_t *wq_entry,
                                int state) {
    unsigned long flags;
    wq_entry->flags |= WQ_FLAG_EXCLUSIVE;
    
    spin_lock_irqsave(&wq_head->lock, flags);
    
    // Atomically:
    // 1. Add to wait queue
    if (list_empty(&wq_entry->entry))
        __add_wait_queue_entry_tail(wq_head, wq_entry);
    
    // 2. Set our state to "sleeping"
    // Uses memory barrier to ensure visibility
    set_current_state(state);
    
    spin_unlock_irqrestore(&wq_head->lock, flags);
    
    // Now signalers will:
    // - Find us in the wait queue
    // - See our state as INTERRUPTIBLE/UNINTERRUPTIBLE
    // - Call wake_up, which checks our state and wakes us
}
 
// Memory barriers in set_current_state
#define set_current_state(state_value) \
    do { \
        smp_store_mb(current->state, (state_value)); \
    } while (0)
    
// smp_store_mb: store with full memory barrier
// Ensures the state write is visible to all CPUs
// before any subsequent memory operations

Lost Wake-Up Prevention Checklist

•Set state before checking condition — Use set_current_state(TASK_INTERRUPTIBLE) before testing the semaphore value
•Add to wait queue under lock — The wait queue entry must be visible before releasing the lock
•Use memory barriers — smp_store_mb ensures state changes are visible to other CPUs
•Double-check after waking — The condition might not be true (spurious wake-up); loop and re-check
•Reset state on exit — Call __set_current_state(TASK_RUNNING) after the condition is satisfied

Memory Barriers and Ordering

On modern multi-core processors, memory operations can be reordered for performance. Compilers may rearrange code, and CPUs may execute loads and stores out of order. For blocking semaphores to work correctly, specific memory ordering guarantees must be enforced.

Why Ordering Matters:

Consider the wait operation:

set_current_state(TASK_INTERRUPTIBLE);  // Write A
if (sem.value > 0) { ...                 // Read B

If the CPU reorders Read B before Write A, we might check the semaphore value before our state is visible to signaling threads. A wake-up could be lost.

Similarly, in the signal operation:

sem.value++;                             // Write X
if (!list_empty(&sem.wait_list))         // Read Y
    wake_up(...);

If Read Y happens before Write X, a waiter might not see the incremented value and incorrectly sleep.

Memory Barrier Types:

Full barrier (smp_mb): Prevents reordering of any loads/stores across this point
Read barrier (smp_rmb): Ensures prior reads complete before subsequent reads
Write barrier (smp_wmb): Ensures prior writes complete before subsequent writes
Acquire barrier: Reads/writes after this point cannot be moved before it
Release barrier: Reads/writes before this point cannot be moved after it

Barriers in Semaphore Implementation:

Spinlocks inherently provide acquire/release semantics:

spin_lock has acquire semantics (operations inside the critical section stay inside)
spin_unlock has release semantics (operations inside the critical section complete first)

Additionally, set_current_state() includes a full memory barrier to ensure state visibility.

memory_barriers.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
// Memory barrier placement in semaphore operations
 
// Correct wait with explicit barrier explanation
void semaphore_wait_with_barriers(semaphore_t *sem) {
    DEFINE_WAIT(wait);
    spin_lock(&sem->lock);  // ACQUIRE barrier
    
    while (sem->value <= 0) {
        // set_current_state includes smp_store_mb
        // which is a store + full memory barrier
        set_current_state(TASK_INTERRUPTIBLE);
        
        // The barrier in set_current_state ensures:
        // - Our state write is visible BEFORE...
        // - ...the signal path checks our state
        
        __add_wait_queue(&sem->wait_list, &wait);
        
        spin_unlock(&sem->lock);  // RELEASE barrier
        
        // Between unlock and schedule(), our state is visible
        // Any signaler will see us as INTERRUPTIBLE
        
        schedule();
        
        spin_lock(&sem->lock);  // ACQUIRE barrier
    }
    
    __set_current_state(TASK_RUNNING);
    finish_wait(&sem->wait_list, &wait);
    sem->value--;
    
    spin_unlock(&sem->lock);  // RELEASE barrier
}
 
// Signal with explicit barrier explanation
void semaphore_signal_with_barriers(semaphore_t *sem) {
    spin_lock(&sem->lock);  // ACQUIRE barrier
    
    sem->value++;
    
    // write barrier implicit in wake_up (via try_to_wake_up)
    // ensures value increment is visible before state check
    
    if (!list_empty(&sem->wait_list)) {
        // wake_up includes barriers in try_to_wake_up
        wake_up(&sem->wait_list);
    }
    
    spin_unlock(&sem->lock);  // RELEASE barrier
}
 
// Memory barrier macros (architecture-specific)
// x86: LOCK prefix provides similar guarantees
// ARM: DMB (Data Memory Barrier) instruction
// RISC-V: FENCE instruction
 
// Full memory barrier
#define smp_mb()    __asm__ __volatile__("mfence" ::: "memory")
 
// Write memory barrier
#define smp_wmb()   __asm__ __volatile__("sfence" ::: "memory")
 
// Read memory barrier  
#define smp_rmb()   __asm__ __volatile__("lfence" ::: "memory")
 
// Store with memory barrier (what set_current_state uses)
#define smp_store_mb(var, value) \
    do { \
        WRITE_ONCE(var, value); \
        smp_mb(); \
    } while (0)

Lock-Implied Barriers

You rarely need explicit memory barriers in semaphore code because spinlocks already provide them. spin_lock() acts as an acquire barrier and spin_unlock() acts as a release barrier. This is why the lock-based implementation 'just works' on most architectures without manual barrier insertion.

Comparison: Blocking vs Spinning

Blocking semaphores are not universally superior to spinlocks. Each approach has distinct tradeoffs that make it optimal in specific scenarios. Understanding these tradeoffs is crucial for choosing the right synchronization mechanism.

Blocking Advantages:

CPU efficiency during long waits: Blocked threads release the CPU entirely; other threads can execute
Power efficiency: Blocked threads allow CPU to enter low-power states
Scalability: System remains responsive even with many threads waiting
No wasted cycles: CPU time isn't consumed by spinning

Blocking Disadvantages:

Context switch overhead: Each block/unblock pair requires ~5000-20000 CPU cycles
Cache pollution: Context switching can invalidate caches
Latency: Waking a blocked thread involves scheduler overhead
Complexity: Requires kernel integration and wait queue management

Spinning Advantages:

Immediate response: No context switch delay when resource becomes available
Low overhead for short waits: Cheaper than a context switch if wait is brief
Simplicity: No kernel integration required
Cache preservation: CPU caches remain warm during spin

Spinning Disadvantages:

CPU waste during long waits: Spinning thread consumes 100% of a CPU core
Priority inversion: Low-priority thread holding lock can starve if holder can't run
Power consumption: CPU runs at full power during spin
Scalability limits: Only practical for short critical sections

When to Block vs When to Spin
Factor	Prefer Blocking	Prefer Spinning
Expected wait time	Long waits (> context switch time)	Very short waits (microseconds)
Critical section length	Any length	Must be very short
Number of contenders	Many threads competing	Few threads, short contention
CPU count	Single CPU (spinning wastes sole CPU)	Multi-CPU (spinners can run)
Interrupt context	Not allowed (can't sleep)	Required (only option)
Real-time requirements	Soft real-time (deterministic wake)	Hard real-time (no scheduler delay)
Power constraints	Mobile/embedded (save power)	Servers (optimize latency)

The Hybrid Approach:

Modern implementations often use a hybrid strategy: spin briefly, then block. The intuition is that if the lock becomes available quickly, spinning avoids context switch overhead. If the wait extends, blocking saves CPU.

The Linux kernel's mutex implementation exemplifies this: it spins while the owner is running (since the owner might release soon), but blocks if the owner is not running (no point spinning when the owner can't make progress).

Adaptive Spinning in Practice

Adaptive mutexes spin while the lock owner is running on another CPU (optimistic: owner will finish soon), but immediately block if the owner is sleeping or preempted (pessimistic: spinning won't help). This requires tracking which CPU the owner is running on—adding complexity but improving performance in common cases.

Summary: Mastering Blocking Semaphore Implementation

We've dissected the internals of blocking semaphore implementations—the data structures, algorithms, and system integrations that enable efficient thread coordination. Let's consolidate the essential concepts:

Key Takeaways

•Semaphore structure — A blocking semaphore consists of an integer counter and a wait queue, protected by an internal spinlock for atomicity
•Wait operation — Decrements the counter; if resulting value is negative, the thread is enqueued and put to sleep via scheduler integration
•Signal operation — Increments the counter and wakes one waiting thread (if any) by moving it from the wait queue to the run queue
•Wait queue management — Typically a doubly-linked list with FIFO ordering; wait entries are stack-allocated for efficiency
•Scheduler integration — Blocking requires coordinating with the scheduler to remove threads from run queues and later reactivate them
•Lost wake-up prevention — The state-before-test pattern with memory barriers ensures signals are never missed
•Memory barriers — Spinlocks provide acquire/release semantics; set_current_state adds a full barrier for state visibility
•Blocking vs spinning tradeoff — Block for long waits (saves CPU), spin for short waits (avoids context switch overhead)

What's Next:

Now that we understand blocking implementations, we'll explore the complementary approach: spinlock-based semaphores. These implementations avoid kernel involvement by busy-waiting, trading CPU efficiency for lower latency in specific scenarios.

The interplay between blocking and spinning forms the foundation of modern synchronization—and understanding both enables informed choices in real system design.

Page Complete

You now understand how blocking semaphores are implemented at the systems level—from wait queue data structures through scheduler integration to memory ordering requirements. This knowledge forms the foundation for understanding all blocking synchronization primitives in modern operating systems.

1 / 5

Loading learning content...

Operating SystemsSemaphore Implementation

Semaphore Implementation

LevelIntermediate

Duration90 mins

TopicSemaphore Implementation

1 / 5

Blocking Implementation

The Art of Waiting Efficiently

This page explores the intricate machinery behind blocking semaphore implementations—the data structures, algorithms, and system-level integrations that make efficient thread coordination possible.

What You Will Learn

Anatomy of a Blocking Semaphore

The Integer Counter:

When the counter is positive, resources are available for immediate acquisition
When the counter is zero, all resources are currently held by other threads
When the counter is negative (in some implementations), the absolute value indicates the number of waiting threads

The Wait Queue:

The choice of queue discipline—FIFO, priority-based, or LIFO—has profound implications for fairness and performance.

semaphore_structure.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Fundamental blocking semaphore structure
typedef struct semaphore {
    // Current value of the semaphore
    // Positive: resources available
    // Zero: no resources, but no waiters
    // Negative (optional): absolute value = waiter count
    int value;
    
    // Spinlock protecting semaphore state
    // Required for atomicity on multiprocessors
    spinlock_t lock;
    
    // Queue of waiting threads
    // Discipline determines fairness properties
    struct wait_queue_head wait_list;
    
    // Debug information (optional)
    #ifdef CONFIG_DEBUG_LOCK
    const char *name;
    void *owner;           // For debugging binary semaphores
    unsigned long acquired_at;
    #endif
} semaphore_t;
 
// Wait queue entry - one per blocked thread
typedef struct wait_queue_entry {
    unsigned int flags;           // Wait type flags
    struct task_struct *private;  // The waiting thread/task
    wait_queue_func_t func;       // Wake function callback
    struct list_head entry;       // Links in the wait queue
} wait_queue_entry_t;
 
// Wait queue head - container for the queue
typedef struct wait_queue_head {
    spinlock_t lock;              // Protects the queue itself
    struct list_head head;        // Head of the linked list
} wait_queue_head_t;

Negative Value Convention

The Wait (P/Down) Operation

The Fundamental Algorithm:

Acquire the semaphore's internal lock — This lock protects the counter and wait queue from concurrent modification
Test the counter value — Determine if resources are available
If counter > 0 — Decrement the counter and release the lock; the caller proceeds
If counter ≤ 0 — Prepare the thread for sleeping:
- Create and initialize a wait queue entry
- Add the entry to the semaphore's wait queue
- Set the thread state to TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE
- Release the lock (critical timing!)
- Invoke the scheduler to context switch away
Upon waking — The thread resumes, reacquires the lock (if needed), and verifies it now holds the resource

Critical Atomicity Requirement:

Modern implementations solve this by linking the lock release and sleep atomically, typically via scheduler integration.

wait_operation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Blocking wait operation implementation
void semaphore_wait(semaphore_t *sem) {
    // Disable preemption and acquire internal lock
    spin_lock_irqsave(&sem->lock, flags);
    
    // Fast path: resource immediately available
    if (sem->value > 0) {
        sem->value--;
        spin_unlock_irqrestore(&sem->lock, flags);
        return;
    }
    
    // Slow path: must block
    // Decrement to indicate we're waiting (if using negative convention)
    sem->value--;
    
    // Prepare current thread for sleeping
    DEFINE_WAIT(wait);  // Allocates wait_queue_entry on stack
    
    for (;;) {
        // Add ourselves to the wait queue
        prepare_to_wait(&sem->wait_list, &wait, TASK_UNINTERRUPTIBLE);
        
        // Double-check: maybe signal arrived while we prepared
        if (sem->value >= 0) {
            // Resource became available
            break;
        }
        
        // Release lock and sleep atomically
        // This is the critical section - schedule() won't return
        // until we're woken up
        spin_unlock_irqrestore(&sem->lock, flags);
        schedule();  // Context switch to another thread
        spin_lock_irqsave(&sem->lock, flags);
        
        // We've been awakened - loop back to verify condition
    }
    
    // Clean up our wait queue entry
    finish_wait(&sem->wait_list, &wait);
    spin_unlock_irqrestore(&sem->lock, flags);
}
 
// Interruptible version - can be interrupted by signals
int semaphore_wait_interruptible(semaphore_t *sem) {
    spin_lock_irqsave(&sem->lock, flags);
    
    if (sem->value > 0) {
        sem->value--;
        spin_unlock_irqrestore(&sem->lock, flags);
        return 0;  // Success
    }
    
    sem->value--;
    DEFINE_WAIT(wait);
    
    for (;;) {
        prepare_to_wait(&sem->wait_list, &wait, TASK_INTERRUPTIBLE);
        
        if (sem->value >= 0)
            break;
        
        // Check for pending signals
        if (signal_pending(current)) {
            // We're being interrupted
            sem->value++;  // We're leaving, restore count
            finish_wait(&sem->wait_list, &wait);
            spin_unlock_irqrestore(&sem->lock, flags);
            return -EINTR;  // Interrupted!
        }
        
        spin_unlock_irqrestore(&sem->lock, flags);
        schedule();
        spin_lock_irqsave(&sem->lock, flags);
    }
    
    finish_wait(&sem->wait_list, &wait);
    spin_unlock_irqrestore(&sem->lock, flags);
    return 0;  // Success
}

Wait Operation Thread States
State	Description	Implications
`TASK_RUNNING`	Thread is executing or ready to execute	Consumes CPU time, on run queue
`TASK_INTERRUPTIBLE`	Thread is sleeping but can be woken by signals	Not on run queue, signal-responsive
`TASK_UNINTERRUPTIBLE`	Thread is sleeping and cannot be interrupted	Used for critical waits (disk I/O)
`TASK_KILLABLE`	Like uninterruptible but responds to fatal signals	Modern hybrid for robustness

The Signal (V/Up) Operation

The Fundamental Algorithm:

Acquire the semaphore's internal lock — Ensures atomic modification
Increment the counter — Release one unit of resource
Check for waiters — If the wait queue is non-empty:
- Dequeue a thread according to the queue discipline (typically FIFO)
- Wake up that thread by changing its state and adding it to the run queue
Release the internal lock — Allow other operations to proceed

The Wake-Up Mechanics:

Waking a thread involves several steps that coordinate with the scheduler:

The thread's state changes from TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE to TASK_RUNNING
The thread is added back to the scheduler's run queue
If the awakened thread has higher priority than the current thread, a reschedule may be triggered

Convoys and Thundering Herds:

signal_operation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Signal operation implementation
void semaphore_signal(semaphore_t *sem) {
    unsigned long flags;
    
    // Acquire internal lock
    spin_lock_irqsave(&sem->lock, flags);
    
    // Increment the counter - one more resource available
    sem->value++;
    
    // Wake up one waiter if any exist
    // wake_up wakes exactly one thread from the queue
    if (!list_empty(&sem->wait_list.head)) {
        // Wake the first waiter (FIFO discipline)
        wake_up(&sem->wait_list);
    }
    
    spin_unlock_irqrestore(&sem->lock, flags);
}
 
// More detailed wake implementation
void wake_up(wait_queue_head_t *wq_head) {
    unsigned long flags;
    wait_queue_entry_t *curr, *next;
    
    spin_lock_irqsave(&wq_head->lock, flags);
    
    // Iterate through waiters
    list_for_each_entry_safe(curr, next, &wq_head->head, entry) {
        // Call the wake function for this entry
        // Typically: default_wake_function
        unsigned int ret = curr->func(curr, TASK_NORMAL, 0, NULL);
        
        // If exclusive wake (most semaphore waiters):
        // Wake only one and stop
        if (curr->flags & WQ_FLAG_EXCLUSIVE) {
            if (ret)  // Successfully woke this task
                break;
        }
    }
    
    spin_unlock_irqrestore(&wq_head->lock, flags);
}
 
// The actual task wake function
int default_wake_function(wait_queue_entry_t *wq_entry,
                          unsigned mode, int flags, void *key) {
    struct task_struct *task = wq_entry->private;
    
    // Try to wake the task
    return try_to_wake_up(task, mode, flags);
}
 
// Core scheduler wake function (simplified)
int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) {
    unsigned long flags;
    int cpu, success = 0;
    
    raw_spin_lock_irqsave(&p->pi_lock, flags);
    
    // Check if task is in the expected state
    if (!(p->state & state))
        goto out;
    
    // Set task to running
    p->state = TASK_RUNNING;
    
    // Select CPU for the task (load balancing considerations)
    cpu = select_task_rq(p, wake_flags);
    
    // Add task to the run queue
    activate_task(rq, p, 0);
    
    // Check if we need to preempt current task
    check_preempt_curr(rq, p, wake_flags);
    
    success = 1;
out:
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);
    return success;
}

Exclusive vs Non-Exclusive Waiters

Wait Queue Management

Data Structure Choices:

Most implementations use a doubly-linked list for the wait queue:

Insertion at tail: O(1) with a tail pointer, ensures FIFO ordering
Removal from head: O(1), wakes the longest-waiting thread first
Arbitrary removal: O(1) for a node removing itself (interruptible waits)
Memory: Each waiting thread has one wait_queue_entry structure

Alternative structures exist for specialized use cases:

Priority heap: O(log n) operations, but ensures highest-priority waiter is woken first
Circular buffer: Fixed-size, cache-friendly, but limits maximum waiters
Lock-free queue: Higher complexity but avoids lock contention on the queue itself

The Wait Queue Entry Lifecycle:

Allocation: Typically on the waiting thread's stack (cheap, automatic cleanup)
Initialization: Set the private field to current (the calling thread)
Enqueueing: Add to the wait queue under the queue's lock
Sleeping: Thread blocks via schedule()
Dequeueing on wake: The signaling thread removes the entry
Cleanup: finish_wait() handles any race conditions

Memory Location Matters:

wait_queue_ops.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Wait queue initialization and operations
 
// Initialize a wait queue head
void init_waitqueue_head(wait_queue_head_t *wq_head) {
    spin_lock_init(&wq_head->lock);
    INIT_LIST_HEAD(&wq_head->head);
}
 
// Initialize a wait queue entry
void init_waitqueue_entry(wait_queue_entry_t *wq_entry,
                          struct task_struct *task) {
    wq_entry->flags = 0;
    wq_entry->private = task;
    wq_entry->func = default_wake_function;
    INIT_LIST_HEAD(&wq_entry->entry);
}
 
// Stack-based wait entry allocation
#define DEFINE_WAIT(name) \
    wait_queue_entry_t name = { \
        .private = current, \
        .func = autoremove_wake_function, \
        .entry = LIST_HEAD_INIT(name.entry), \
    }
 
// Prepare to wait - add to queue and set state
void prepare_to_wait(wait_queue_head_t *wq_head,
                     wait_queue_entry_t *wq_entry,
                     int state) {
    unsigned long flags;
    wq_entry->flags &= ~WQ_FLAG_EXCLUSIVE;
    
    spin_lock_irqsave(&wq_head->lock, flags);
    
    // Only add if not already in a wait queue
    if (list_empty(&wq_entry->entry))
        __add_wait_queue(wq_head, wq_entry);  // Add to head (some impls)
    
    set_current_state(state);
    spin_unlock_irqrestore(&wq_head->lock, flags);
}
 
// Prepare for exclusive wait (semaphore style)
void prepare_to_wait_exclusive(wait_queue_head_t *wq_head,
                                wait_queue_entry_t *wq_entry,
                                int state) {
    unsigned long flags;
    wq_entry->flags |= WQ_FLAG_EXCLUSIVE;  // Mark as exclusive
    
    spin_lock_irqsave(&wq_head->lock, flags);
    if (list_empty(&wq_entry->entry))
        __add_wait_queue_entry_tail(wq_head, wq_entry);  // FIFO!
    set_current_state(state);
    spin_unlock_irqrestore(&wq_head->lock, flags);
}
 
// Finish waiting - clean up after being woken
void finish_wait(wait_queue_head_t *wq_head,
                 wait_queue_entry_t *wq_entry) {
    unsigned long flags;
    
    // Set back to running (might already be)
    __set_current_state(TASK_RUNNING);
    
    // Only remove if still queued (might have been removed by waker)
    if (!list_empty_careful(&wq_entry->entry)) {
        spin_lock_irqsave(&wq_head->lock, flags);
        list_del_init(&wq_entry->entry);
        spin_unlock_irqrestore(&wq_head->lock, flags);
    }
}
 
// Auto-remove wake function - removes entry upon waking
int autoremove_wake_function(wait_queue_entry_t *wq_entry,
                              unsigned mode, int sync, void *key) {
    int ret = default_wake_function(wq_entry, mode, sync, key);
    
    if (ret)
        list_del_init(&wq_entry->entry);  // Remove from queue
    return ret;
}

Wait Queue Design Principles

•Stack allocation for entries — Eliminates heap allocation overhead and memory leaks; entry lifetime matches sleep duration
•Doubly-linked list — O(1) insertion at tail (FIFO), O(1) removal from anywhere (for interrupted waits)
•Exclusive flag — Prevents thundering herd by waking only one waiter per signal
•Tail insertion for fairness — The longest-waiting thread is woken first, preventing starvation
•Auto-remove on wake — The waking path removes the entry, simplifying cleanup and reducing races

Scheduler Integration

The Context Switch Path:

When a thread calls schedule() after failing to acquire a semaphore:

State check: The scheduler verifies the thread's state is TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE
Run queue removal: The thread is removed from the run queue (if still present)
Context selection: The scheduler picks the next thread to run based on its scheduling policy
Context switch: CPU registers, stack pointer, and page table are switched to the new thread
Control transfer: The new thread resumes execution

The blocked thread now exists only in the semaphore's wait queue—invisible to the scheduler until awakened.

The Wake-Up Path:

When another thread signals the semaphore:

Wait queue access: The signaling thread acquires the wait queue lock
Waiter selection: The first waiter (FIFO) is dequeued
State transition: The waiter's state becomes TASK_RUNNING
Run queue insertion: The waiter is added to a run queue (possibly on a different CPU)
Preemption check: If the woken thread has higher priority, a reschedule is triggered
Lock release: The signal operation completes

Multi-CPU Considerations:

On SMP systems, the woken thread might run on a different CPU than the signaling thread. This introduces cache migration costs but enables parallelism. Smart wake-up heuristics try to balance:

CPU affinity: Prefer the CPU where the waiter last ran (warm cache)
Load balancing: Avoid overloading busy CPUs
NUMA locality: Keep threads and their data on the same NUMA node

scheduler_integration.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
// Scheduler integration for blocking
 
// The main schedule function (simplified)
void __schedule(bool preempt) {
    struct task_struct *prev, *next;
    struct rq *rq;
    int cpu;
    
    cpu = smp_processor_id();
    rq = cpu_rq(cpu);
    prev = rq->curr;  // The currently running task
    
    // Lock the run queue
    raw_spin_lock(&rq->lock);
    
    // If prev is not RUNNING, it needs to be dequeued
    // This happens when a task called schedule() after blocking
    if (prev->state != TASK_RUNNING && !preempt) {
        // Remove from run queue - task is blocking
        deactivate_task(rq, prev, DEQUEUE_SLEEP);
        
        // For interruptible sleeps, check for signals
        if (prev->state == TASK_INTERRUPTIBLE &&
            signal_pending(prev)) {
            prev->state = TASK_RUNNING;
            activate_task(rq, prev, ENQUEUE_HEAD);
        }
    }
    
    // Pick the next task to run
    next = pick_next_task(rq, prev);
    
    if (prev != next) {
        rq->curr = next;
        
        // Perform the actual context switch
        // This switches stacks, registers, and address space
        context_switch(rq, prev, next);
        
        // After switch, we're running as 'next'
        // prev might be running on another CPU by now!
    }
    
    raw_spin_unlock(&rq->lock);
}
 
// Wake up a specific task
int wake_up_process(struct task_struct *p) {
    return try_to_wake_up(p, TASK_NORMAL, 0);
}
 
// Core wake-up logic
int try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags) {
    struct rq *rq;
    int cpu;
    
    // Grab the task's lock
    raw_spin_lock_irqsave(&p->pi_lock, flags);
    
    // Verify task is in expected state
    if (!(p->state & state))
        goto out;
    
    // Transition to running
    p->state = TASK_RUNNING;
    
    // Select CPU for the task
    // Consider: affinity, load, NUMA topology
    cpu = select_task_rq(p, wake_flags);
    
    // Get the run queue for that CPU
    rq = cpu_rq(cpu);
    raw_spin_lock(&rq->lock);
    
    // Add to run queue
    activate_task(rq, p, 0);
    
    // Check if we should preempt the current task
    if (rq->curr->prio > p->prio) {
        // Woken task has higher priority
        resched_curr(rq);  // Request reschedule
    }
    
    raw_spin_unlock(&rq->lock);
    
out:
    raw_spin_unlock_irqrestore(&p->pi_lock, flags);
    return success;
}
 
// Select CPU for awakened task (simplified)
int select_task_rq(struct task_struct *p, int wake_flags) {
    // First choice: the CPU where task last ran (cache affinity)
    int prev_cpu = task_cpu(p);
    
    // If that CPU is idle, use it
    if (idle_cpu(prev_cpu))
        return prev_cpu;
    
    // Otherwise, find a suitable CPU
    // Consider: same NUMA node, low load, idle state
    return select_idle_sibling(prev_cpu);
}

Critical: State-Schedule-Wake Ordering

Atomicity and the Lost Wake-Up Problem

The Race Condition Anatomy:

Thread A (waiting)          Thread B (signaling)
─────────────────          ────────────────────
if (sem.value <= 0) {      
                           sem.signal()
                           // No waiters found!
    sleep()                
    // Sleep forever!
}

The Solution: State-Before-Test Pattern

The solution relies on careful ordering and the semantics of the thread state field:

Set state to sleeping BEFORE checking condition — Even though you're not actually sleeping yet
The signaling thread checks our state — It will see we're "sleeping" and add us to run queue
If we actually sleep, we can be woken correctly
If condition is true, reset state before returning

This works because the state transition is atomic and visible to other CPUs (with proper memory barriers).

lost_wakeup_prevention.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
// WRONG: Race condition leading to lost wake-up
void semaphore_wait_BROKEN(semaphore_t *sem) {
    spin_lock(&sem->lock);
    
    while (sem->value <= 0) {
        spin_unlock(&sem->lock);
        
        // DANGER ZONE: signal could arrive here!
        // We're not in the wait queue yet
        // but we're about to sleep
        
        schedule();  // May sleep forever if signal was sent above
        
        spin_lock(&sem->lock);
    }
    
    sem->value--;
    spin_unlock(&sem->lock);
}
 
// CORRECT: Atomicity via state-before-test pattern
void semaphore_wait_CORRECT(semaphore_t *sem) {
    DEFINE_WAIT(wait);
    
    spin_lock(&sem->lock);
    
    while (sem->value <= 0) {
        // Step 1: Add to wait queue AND set state to sleeping
        // This must happen atomically with respect to the signaler
        prepare_to_wait_exclusive(&sem->wait_list, &wait,
                                   TASK_UNINTERRUPTIBLE);
        
        // After this point, any signal() will find us and wake us
        
        spin_unlock(&sem->lock);
        
        // Step 2: Actually sleep
        // Safe now - signaler will see our state and wake us
        schedule();
        
        // Step 3: We've been woken - recheck condition
        spin_lock(&sem->lock);
    }
    
    // Step 4: Condition satisfied, clean up
    finish_wait(&sem->wait_list, &wait);
    sem->value--;
    spin_unlock(&sem->lock);
}
 
// The prepare_to_wait function (critical for correctness)
void prepare_to_wait_exclusive(wait_queue_head_t *wq_head,
                                wait_queue_entry_t *wq_entry,
                                int state) {
    unsigned long flags;
    wq_entry->flags |= WQ_FLAG_EXCLUSIVE;
    
    spin_lock_irqsave(&wq_head->lock, flags);
    
    // Atomically:
    // 1. Add to wait queue
    if (list_empty(&wq_entry->entry))
        __add_wait_queue_entry_tail(wq_head, wq_entry);
    
    // 2. Set our state to "sleeping"
    // Uses memory barrier to ensure visibility
    set_current_state(state);
    
    spin_unlock_irqrestore(&wq_head->lock, flags);
    
    // Now signalers will:
    // - Find us in the wait queue
    // - See our state as INTERRUPTIBLE/UNINTERRUPTIBLE
    // - Call wake_up, which checks our state and wakes us
}
 
// Memory barriers in set_current_state
#define set_current_state(state_value) \
    do { \
        smp_store_mb(current->state, (state_value)); \
    } while (0)
    
// smp_store_mb: store with full memory barrier
// Ensures the state write is visible to all CPUs
// before any subsequent memory operations

Lost Wake-Up Prevention Checklist

•Set state before checking condition — Use set_current_state(TASK_INTERRUPTIBLE) before testing the semaphore value
•Add to wait queue under lock — The wait queue entry must be visible before releasing the lock
•Use memory barriers — smp_store_mb ensures state changes are visible to other CPUs
•Double-check after waking — The condition might not be true (spurious wake-up); loop and re-check
•Reset state on exit — Call __set_current_state(TASK_RUNNING) after the condition is satisfied

Memory Barriers and Ordering

Why Ordering Matters:

Consider the wait operation:

set_current_state(TASK_INTERRUPTIBLE);  // Write A
if (sem.value > 0) { ...                 // Read B

If the CPU reorders Read B before Write A, we might check the semaphore value before our state is visible to signaling threads. A wake-up could be lost.

Similarly, in the signal operation:

sem.value++;                             // Write X
if (!list_empty(&sem.wait_list))         // Read Y
    wake_up(...);

If Read Y happens before Write X, a waiter might not see the incremented value and incorrectly sleep.

Memory Barrier Types:

Full barrier (smp_mb): Prevents reordering of any loads/stores across this point
Read barrier (smp_rmb): Ensures prior reads complete before subsequent reads
Write barrier (smp_wmb): Ensures prior writes complete before subsequent writes
Acquire barrier: Reads/writes after this point cannot be moved before it
Release barrier: Reads/writes before this point cannot be moved after it

Barriers in Semaphore Implementation:

Spinlocks inherently provide acquire/release semantics:

spin_lock has acquire semantics (operations inside the critical section stay inside)
spin_unlock has release semantics (operations inside the critical section complete first)

Additionally, set_current_state() includes a full memory barrier to ensure state visibility.

memory_barriers.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
// Memory barrier placement in semaphore operations
 
// Correct wait with explicit barrier explanation
void semaphore_wait_with_barriers(semaphore_t *sem) {
    DEFINE_WAIT(wait);
    spin_lock(&sem->lock);  // ACQUIRE barrier
    
    while (sem->value <= 0) {
        // set_current_state includes smp_store_mb
        // which is a store + full memory barrier
        set_current_state(TASK_INTERRUPTIBLE);
        
        // The barrier in set_current_state ensures:
        // - Our state write is visible BEFORE...
        // - ...the signal path checks our state
        
        __add_wait_queue(&sem->wait_list, &wait);
        
        spin_unlock(&sem->lock);  // RELEASE barrier
        
        // Between unlock and schedule(), our state is visible
        // Any signaler will see us as INTERRUPTIBLE
        
        schedule();
        
        spin_lock(&sem->lock);  // ACQUIRE barrier
    }
    
    __set_current_state(TASK_RUNNING);
    finish_wait(&sem->wait_list, &wait);
    sem->value--;
    
    spin_unlock(&sem->lock);  // RELEASE barrier
}
 
// Signal with explicit barrier explanation
void semaphore_signal_with_barriers(semaphore_t *sem) {
    spin_lock(&sem->lock);  // ACQUIRE barrier
    
    sem->value++;
    
    // write barrier implicit in wake_up (via try_to_wake_up)
    // ensures value increment is visible before state check
    
    if (!list_empty(&sem->wait_list)) {
        // wake_up includes barriers in try_to_wake_up
        wake_up(&sem->wait_list);
    }
    
    spin_unlock(&sem->lock);  // RELEASE barrier
}
 
// Memory barrier macros (architecture-specific)
// x86: LOCK prefix provides similar guarantees
// ARM: DMB (Data Memory Barrier) instruction
// RISC-V: FENCE instruction
 
// Full memory barrier
#define smp_mb()    __asm__ __volatile__("mfence" ::: "memory")
 
// Write memory barrier
#define smp_wmb()   __asm__ __volatile__("sfence" ::: "memory")
 
// Read memory barrier  
#define smp_rmb()   __asm__ __volatile__("lfence" ::: "memory")
 
// Store with memory barrier (what set_current_state uses)
#define smp_store_mb(var, value) \
    do { \
        WRITE_ONCE(var, value); \
        smp_mb(); \
    } while (0)

Lock-Implied Barriers

Comparison: Blocking vs Spinning

Blocking Advantages:

CPU efficiency during long waits: Blocked threads release the CPU entirely; other threads can execute
Power efficiency: Blocked threads allow CPU to enter low-power states
Scalability: System remains responsive even with many threads waiting
No wasted cycles: CPU time isn't consumed by spinning

Blocking Disadvantages:

Context switch overhead: Each block/unblock pair requires ~5000-20000 CPU cycles
Cache pollution: Context switching can invalidate caches
Latency: Waking a blocked thread involves scheduler overhead
Complexity: Requires kernel integration and wait queue management

Spinning Advantages:

Immediate response: No context switch delay when resource becomes available
Low overhead for short waits: Cheaper than a context switch if wait is brief
Simplicity: No kernel integration required
Cache preservation: CPU caches remain warm during spin

Spinning Disadvantages:

CPU waste during long waits: Spinning thread consumes 100% of a CPU core
Priority inversion: Low-priority thread holding lock can starve if holder can't run
Power consumption: CPU runs at full power during spin
Scalability limits: Only practical for short critical sections

When to Block vs When to Spin
Factor	Prefer Blocking	Prefer Spinning
Expected wait time	Long waits (> context switch time)	Very short waits (microseconds)
Critical section length	Any length	Must be very short
Number of contenders	Many threads competing	Few threads, short contention
CPU count	Single CPU (spinning wastes sole CPU)	Multi-CPU (spinners can run)
Interrupt context	Not allowed (can't sleep)	Required (only option)
Real-time requirements	Soft real-time (deterministic wake)	Hard real-time (no scheduler delay)
Power constraints	Mobile/embedded (save power)	Servers (optimize latency)

The Hybrid Approach:

Adaptive Spinning in Practice

Summary: Mastering Blocking Semaphore Implementation

Key Takeaways

•Semaphore structure — A blocking semaphore consists of an integer counter and a wait queue, protected by an internal spinlock for atomicity
•Wait operation — Decrements the counter; if resulting value is negative, the thread is enqueued and put to sleep via scheduler integration
•Signal operation — Increments the counter and wakes one waiting thread (if any) by moving it from the wait queue to the run queue
•Wait queue management — Typically a doubly-linked list with FIFO ordering; wait entries are stack-allocated for efficiency
•Scheduler integration — Blocking requires coordinating with the scheduler to remove threads from run queues and later reactivate them
•Lost wake-up prevention — The state-before-test pattern with memory barriers ensures signals are never missed
•Memory barriers — Spinlocks provide acquire/release semantics; set_current_state adds a full barrier for state visibility
•Blocking vs spinning tradeoff — Block for long waits (saves CPU), spin for short waits (avoids context switch overhead)

What's Next:

The interplay between blocking and spinning forms the foundation of modern synchronization—and understanding both enables informed choices in real system design.

Page Complete

1 / 5