Loading learning content...
Imagine a fire alarm going off in a building. Everyone evacuates—hundreds of people rush to the exits simultaneously. But what if the alarm was triggered for a small kitchen fire that only needs one person with an extinguisher? Now you have 500 people clogging the hallways, fighting to get through doors, only to realize that just one person needed to act.\n\nThis is the thundering herd problem in operating systems: a broadcast wakes many threads, they all rush to acquire the mutex, but only one (or a few) can actually proceed. The rest waste CPU cycles, cause cache thrashing, and increase latency—all for nothing.\n\nThe thundering herd is one of the most common performance pathologies in concurrent systems, and understanding it is essential for building high-performance applications.
By the end of this page, you will understand exactly why the thundering herd problem occurs, how to measure its impact, recognize it in your systems, and apply mitigations that reduce its severity while maintaining correctness.
The thundering herd problem occurs when:\n\n1. Multiple threads are waiting on a condition variable\n2. Broadcast wakes them all simultaneously\n3. Only one (or few) can proceed after reacquiring the mutex\n4. The rest go back to sleep after wasted computation\n\nLet's trace through the mechanics step by step:
Stage 1: Waiting State\n\n\n┌────────────────────────────────────────────────────────────┐\n│ INITIAL STATE │\n├────────────────────────────────────────────────────────────┤\n│ │\n│ Condition Variable CV │\n│ ┌──────────────────────────────────────────────────────┐ │\n│ │ Wait Queue: [T1] → [T2] → [T3] → [T4] → [T5] → [T6] │ │\n│ │ │ │\n│ │ All 6 threads sleeping, waiting for work │ │\n│ └──────────────────────────────────────────────────────┘ │\n│ │\n│ Mutex M: Held by T0 (the thread about to broadcast) │\n│ │\n│ Shared State: │\n│ - work_queue.size = 0 │\n│ - T0 just added 1 work item │\n│ │\n└────────────────────────────────────────────────────────────┘\n
Stage 2: Broadcast Occurs\n\n\n┌────────────────────────────────────────────────────────────┐\n│ AFTER BROADCAST (before unlock) │\n├────────────────────────────────────────────────────────────┤\n│ │\n│ Condition Variable CV │\n│ ┌──────────────────────────────────────────────────────┐ │\n│ │ Wait Queue: [EMPTY] │ │\n│ └──────────────────────────────────────────────────────┘ │\n│ │\n│ Mutex M Wait Queue: [T1] → [T2] → [T3] → [T4] → [T5] → [T6]│\n│ │\n│ All 6 threads moved from CV queue to mutex queue │\n│ All are now 'runnable' but blocked on mutex │\n│ │\n│ work_queue.size = 1 (only ONE work item!) │\n│ │\n└────────────────────────────────────────────────────────────┘\n
Stage 3: The Stampede\n\n\n┌────────────────────────────────────────────────────────────┐\n│ THE STAMPEDE │\n├────────────────────────────────────────────────────────────┤\n│ │\n│ T0 releases mutex... │\n│ │\n│ Time │ Event │ Result │\n│ ─────┼──────────────────────────────┼─────────────────── │\n│ t1 │ T1 acquires mutex │ Checks queue: 1 item│\n│ │ T1 dequeues work │ queue.size = 0 │\n│ │ T1 releases mutex │ T1 PROCEEDS │\n│ ─────┼──────────────────────────────┼─────────────────── │\n│ t2 │ T2 acquires mutex │ Checks queue: EMPTY │\n│ │ T2 goes back to wait() │ WASTED WORK │\n│ ─────┼──────────────────────────────┼─────────────────── │\n│ t3 │ T3 acquires mutex │ Checks queue: EMPTY │\n│ │ T3 goes back to wait() │ WASTED WORK │\n│ ─────┼──────────────────────────────┼─────────────────── │\n│ t4 │ T4 acquires mutex │ Checks queue: EMPTY │\n│ │ T4 goes back to wait() │ WASTED WORK │\n│ ─────┼──────────────────────────────┼─────────────────── │\n│ t5 │ T5 acquires mutex │ Checks queue: EMPTY │\n│ │ T5 goes back to wait() │ WASTED WORK │\n│ ─────┼──────────────────────────────┼─────────────────── │\n│ t6 │ T6 acquires mutex │ Checks queue: EMPTY │\n│ │ T6 goes back to wait() │ WASTED WORK │\n│ │\n│ RESULT: 1 work item processed, 5 threads woke for nothing│\n│ │\n└────────────────────────────────────────────────────────────┘\n
We woke 6 threads to handle 1 work item. 5 threads performed context switches, acquired the mutex, checked the condition, released the mutex, and went back to sleep—all for nothing. This is pure overhead that scales with the number of waiters.
The thundering herd imposes multiple types of overhead. Understanding each helps you assess the severity in your specific situation.
Quantitative Analysis:\n\nLet's model the overhead mathematically:\n\n\nGiven:\n N = number of waiting threads\n W = number of work items added (typically W << N for thundering herd)\n Cs = context switch time (~5μs)\n Ma = mutex acquire time (~0.2μs uncontended, 1-10μs contended)\n Cond = condition check time (~0.1μs)\n\nIdeal case (with signal):\n Total overhead = W × Cs + W × Ma + W × Cond\n For W=1: ~5.3μs\n\nThundering herd (with broadcast):\n Total overhead = N × Cs + N × Ma(contended) + N × Cond\n For N=100, W=1: ~100 × 5 + 100 × 5 + 100 × 0.1 = 1010μs ≈ 1ms\n\n Overhead ratio = N/W = 100× worse\n\n\nFor a system handling 1000 events/second, with 100 waiters experiencing thundering herd on each event, you're adding ~1 second of overhead per second of real work. At that point, the system cannot keep up.
| Waiters (N) | Work Items (W) | Wasted Wakeups | Est. Overhead | Throughput Impact |
|---|---|---|---|---|
| 10 | 1 | 9 | ~50μs | Minimal (~5%) |
| 50 | 1 | 49 | ~250μs | Noticeable (~20%) |
| 100 | 1 | 99 | ~500μs | Significant (~40%) |
| 500 | 1 | 499 | ~2.5ms | Severe (~80%) |
| 1000 | 1 | 999 | ~5ms | System degradation |
| 10 | 5 | 5 | ~25μs | Minimal |
| 100 | 50 | 50 | ~250μs | Moderate (~20%) |
Thundering herd overhead is O(N) where N is the number of waiters. In highly concurrent systems with many workers, this overhead can dominate actual processing time. A system designed for 64 cores might spend more time managing thundering herds than doing useful work.
The thundering herd problem appears in many real-world systems, often with dramatic effects on performance.
The Classic Example: Multiple Processes Calling accept()\n\nHistorically, web servers like Apache used a 'prefork' model: multiple processes all waiting on accept() for the same listening socket.\n\nThe Problem:\n- 100 Apache child processes call accept() on the same socket\n- A client connection arrives\n- The kernel wakes ALL 100 processes\n- All race to call accept(), but only ONE can get the connection\n- 99 processes woke up for nothing\n\nThe Impact:\n- Under high connection rates, the server spent more time handling wakeups than serving requests\n- CPU utilization spiked to 100% even with low actual load\n- Known as the 'accept() thundering herd' problem\n\nHistorical Solution:\n- Linux added SO_REUSEPORT to allow separate listen queues per process\n- Added accept() serialization options\n- Modern servers use epoll/kqueue with edge-triggered notifications
Identifying thundering herd problems requires understanding the symptoms and using appropriate diagnostic tools.
vmstat shows thousands of context switches per secondDiagnostic Tools and Techniques:
1234567891011121314151617181920212223242526272829303132333435
#!/bin/bash# Diagnosing thundering herd problems # 1. Check context switch ratevmstat 1# Look at 'cs' column - high values suggest excessive wakeups# Normal: hundreds per second# Problem: tens of thousands per second # 2. Check mutex contention with perf (Linux)perf record -g -p <pid> -- sleep 10perf report# Look for __lll_lock_wait, pthread_cond_wait, futex in hot paths # 3. Check for specific condition variable issues# Use eBPF/bpftrace to trace pthread_cond_broadcast callssudo bpftrace -e 'uprobe:/lib/x86_64-linux-gnu/libpthread.so.0:pthread_cond_broadcast { @broadcasts[comm] = count();}interval:s:5 { print(@broadcasts); clear(@broadcasts); }' # 4. Trace futex operations (underlying syscall for condvars)sudo perf trace -e futex -p <pid> 2>&1 | head -100# High FUTEX_WAKE with large 'val' indicates broadcast # 5. Check cache behaviorperf stat -e cache-references,cache-misses,context-switches -p <pid> sleep 10 # 6. Application-level instrumentation# Add counters to track:# - Number of broadcast() calls# - Number of threads woken per broadcast# - Number of threads that re-waited immediately1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
// Application-level instrumentation to detect thundering herd #include <pthread.h>#include <stdatomic.h>#include <stdio.h> typedef struct { pthread_mutex_t mutex; pthread_cond_t cond; // Instrumentation atomic_int waiters; // Current number of waiters atomic_long total_broadcasts; // Total broadcast calls atomic_long total_wakeups; // Total threads woken atomic_long wasted_wakeups; // Threads that re-waited immediately} instrumented_cv_t; void cv_wait(instrumented_cv_t *cv) { atomic_fetch_add(&cv->waiters, 1); pthread_cond_wait(&cv->cond, &cv->mutex); atomic_fetch_sub(&cv->waiters, 1);} void cv_broadcast(instrumented_cv_t *cv) { int current_waiters = atomic_load(&cv->waiters); atomic_fetch_add(&cv->total_broadcasts, 1); atomic_fetch_add(&cv->total_wakeups, current_waiters); pthread_cond_broadcast(&cv->cond);} void cv_record_rewait(instrumented_cv_t *cv) { // Call this when a thread wakes but finds condition still false atomic_fetch_add(&cv->wasted_wakeups, 1);} void cv_print_stats(instrumented_cv_t *cv) { long broadcasts = atomic_load(&cv->total_broadcasts); long wakeups = atomic_load(&cv->total_wakeups); long wasted = atomic_load(&cv->wasted_wakeups); printf("Condition Variable Stats:\n"); printf(" Total broadcasts: %ld\n", broadcasts); printf(" Total wakeups: %ld\n", wakeups); printf(" Wasted wakeups: %ld (%.1f%%)\n", wasted, broadcasts > 0 ? 100.0 * wasted / wakeups : 0); printf(" Avg wakeups per broadcast: %.1f\n", broadcasts > 0 ? (double)wakeups / broadcasts : 0); // Thundering herd indicator if (wasted > wakeups / 2) { printf(" WARNING: High thundering herd ratio detected!\n"); }} // Usage in worker thread:void worker(instrumented_cv_t *cv, queue_t *queue) { pthread_mutex_lock(&cv->mutex); while (queue_empty(queue)) { bool was_empty = queue_empty(queue); cv_wait(cv); if (was_empty && queue_empty(queue)) { cv_record_rewait(cv); // Woke but found nothing } } // Process work... pthread_mutex_unlock(&cv->mutex);}There are several strategies to mitigate the thundering herd problem, ranging from simple code changes to architectural redesigns.
Strategy 1: Replace Broadcast with Signal\n\nThe most direct solution—when applicable—is simply to use signal instead of broadcast:
123456789101112131415161718192021222324252627282930313233
// BEFORE: Thundering herdvoid submit_work(work_queue_t *wq, work_t *work) { pthread_mutex_lock(&wq->mutex); queue_enqueue(wq->queue, work); pthread_cond_broadcast(&wq->cond); // Wakes ALL workers pthread_mutex_unlock(&wq->mutex);} // AFTER: No thundering herdvoid submit_work(work_queue_t *wq, work_t *work) { pthread_mutex_lock(&wq->mutex); queue_enqueue(wq->queue, work); pthread_cond_signal(&wq->cond); // Wakes ONE worker pthread_mutex_unlock(&wq->mutex);} // For batch submission:void submit_work_batch(work_queue_t *wq, work_t **items, int count) { pthread_mutex_lock(&wq->mutex); for (int i = 0; i < count; i++) { queue_enqueue(wq->queue, items[i]); } // Wake 'count' threads, not ALL threads // Note: this is a suboptimal approach since signal doesn't guarantee // order, but it's still better than broadcast for (int i = 0; i < count; i++) { pthread_cond_signal(&wq->cond); } pthread_mutex_unlock(&wq->mutex);}Strategy 2: Use Semaphores for Counted Resources\n\nWhen you have N resources and want to wake exactly N threads, semaphores are the natural choice:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
#include <semaphore.h> // Using semaphore to wake exactly the right number of threads typedef struct { pthread_mutex_t mutex; sem_t work_available; // Semaphore instead of condition variable queue_t *queue;} work_queue_t; // Submit work - wake exactly one threadvoid submit_work(work_queue_t *wq, work_t *work) { pthread_mutex_lock(&wq->mutex); queue_enqueue(wq->queue, work); pthread_mutex_unlock(&wq->mutex); sem_post(&wq->work_available); // Increment semaphore by 1 // Wakes exactly ONE waiting thread} // Submit batch - wake exactly 'count' threadsvoid submit_work_batch(work_queue_t *wq, work_t **items, int count) { pthread_mutex_lock(&wq->mutex); for (int i = 0; i < count; i++) { queue_enqueue(wq->queue, items[i]); } pthread_mutex_unlock(&wq->mutex); // Post semaphore 'count' times for (int i = 0; i < count; i++) { sem_post(&wq->work_available); // Wakes one thread per post } // Exactly 'count' threads will wake (or fewer if fewer are waiting)} // Worker threadvoid *worker(void *arg) { work_queue_t *wq = (work_queue_t *)arg; while (1) { // Wait on semaphore - blocks until work is available sem_wait(&wq->work_available); // Now we KNOW there's work for us pthread_mutex_lock(&wq->mutex); work_t *work = queue_dequeue(wq->queue); pthread_mutex_unlock(&wq->mutex); if (work == NULL) continue; // Shutdown/spurious process_work(work); }}Strategy 3: Leader-Follower Pattern\n\nThis architectural pattern avoids the thundering herd by having only one thread actively waiting for events:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
// Leader-Follower pattern to avoid thundering herd typedef struct { pthread_mutex_t mutex; pthread_cond_t become_leader; int current_leader; // Thread ID of current leader (or -1 if none) bool has_leader; queue_t *event_queue;} leader_follower_t; void promote_new_leader(leader_follower_t *lf) { pthread_mutex_lock(&lf->mutex); lf->has_leader = false; // Signal wakes exactly ONE follower to become the new leader pthread_cond_signal(&lf->become_leader); pthread_mutex_unlock(&lf->mutex);} void *worker(void *arg) { leader_follower_t *lf = (leader_follower_t *)arg; int my_id = get_thread_id(); while (1) { pthread_mutex_lock(&lf->mutex); // Wait to become leader (followers wait here) while (lf->has_leader) { pthread_cond_wait(&lf->become_leader, &lf->mutex); } // I am now the leader lf->has_leader = true; lf->current_leader = my_id; pthread_mutex_unlock(&lf->mutex); // As leader, wait for events (only ONE thread waits on events) event_t *event = wait_for_event(lf->event_queue); // Got an event - promote the next leader before processing promote_new_leader(lf); // Process the event (other threads continue in parallel) process_event(event); // After processing, become a follower again // (loop to wait for leader role) }} // Advantages:// - Only ONE thread waits for external events (no thundering herd)// - Other threads can work on processing in parallel// - Signal is always sufficient (never need broadcast)// - Fairness: followers take turns becoming leader // Disadvantages:// - More complex pattern to implement correctly// - Leadership handoff adds latency// - Not suitable for all workloadsOperating system kernels have implemented various mechanisms to mitigate thundering herd at the system level.
Linux EPOLLEXCLUSIVE Flag\n\nAdded in Linux 4.5, this flag ensures that only one thread is woken when an event occurs on a file descriptor monitored by multiple epoll instances.
1234567891011121314151617181920
#include <sys/epoll.h> // Without EPOLLEXCLUSIVE:// Multiple threads waiting on same FD -> thundering herd // With EPOLLEXCLUSIVE:// Only one thread is woken per event void setup_exclusive_listener(int listen_fd) { int epfd = epoll_create1(0); struct epoll_event ev; ev.events = EPOLLIN | EPOLLEXCLUSIVE; // The magic flag ev.data.fd = listen_fd; epoll_ctl(epfd, EPOLL_CTL_ADD, listen_fd, &ev); // Now multiple threads can wait on epfd // But only ONE will be woken per incoming connection}When building network servers, prefer modern kernel features like EPOLLEXCLUSIVE and SO_REUSEPORT over legacy patterns. These provide thundering-herd mitigation at the kernel level, which is more efficient than application-level solutions.
Despite its overhead, thundering herd is sometimes the correct or acceptable choice. Understanding when helps you avoid over-engineering.
Remember: broadcast is always correct, signal sometimes isn't. If you're unsure whether signal is safe, use broadcast and measure the impact. Premature optimization with signal can introduce deadlocks that are much harder to debug than performance problems.
Next Steps:\n\nNow that you understand the thundering herd problem and how to mitigate it, the next page explores performance considerations in more depth—including how to analyze tradeoffs and make data-driven decisions about broadcast usage.
You now understand the thundering herd problem, its causes, its costs, how to detect it, and how to mitigate it. You can make informed decisions about when to accept the overhead and when to apply optimizations. Next, we'll dive deeper into performance analysis and measurement.