Loading learning content...
We've established that broadcast is always correct while signal requires careful analysis. But in production systems, correctness is just the starting point—performance determines whether your system can actually handle the required load.\n\nThis page takes a deep dive into the performance characteristics of broadcast, how to measure and analyze its costs, and how to make data-driven decisions about optimization. You'll learn to quantify the tradeoffs so you can make informed choices rather than guessing.
By the end of this page, you will understand how to measure broadcast performance, identify bottlenecks, compute the cost-benefit ratio of optimizations, and apply systematic performance analysis to your concurrent systems.
To reason about performance, we need a model that captures the key costs involved in broadcast operations. Let's build this model from first principles.
Fundamental Cost Components:\n\n\n┌─────────────────────────────────────────────────────────────────────┐\n│ BROADCAST COST MODEL │\n├─────────────────────────────────────────────────────────────────────┤\n│ │\n│ Total Cost = C_broadcast + C_wakeups + C_mutex + C_recheck │\n│ │\n│ Where: │\n│ │\n│ C_broadcast = Cost of broadcast() call itself │\n│ = Time to iterate wait queue + kernel involvement │\n│ ≈ O(N) where N = number of waiters │\n│ │\n│ C_wakeups = Cost of N context switches (sleeping → runnable) │\n│ = N × T_context_switch │\n│ ≈ N × 2-10μs (varies by CPU, OS, load) │\n│ │\n│ C_mutex = Cost of N threads competing for mutex │\n│ = Serial acquisition time + contention overhead │\n│ ≈ N × (T_acquire + T_release) × contention_factor │\n│ │\n│ C_recheck = Cost of N threads checking their condition │\n│ = N × T_predicate_evaluation │\n│ │\n│ Useful Work = W × (T_acquire + T_work + T_release) │\n│ │\n│ Efficiency = Useful Work / Total Cost │\n│ │\n└─────────────────────────────────────────────────────────────────────┘\n
| Operation | Uncontended | Contended | Notes |
|---|---|---|---|
| Context switch | 2-5 μs | 5-15 μs | Higher under load |
| Mutex lock (uncontended) | 15-30 ns | N/A | Atomic operation |
| Mutex lock (contended) | N/A | 1-20 μs | Involves futex syscall |
| Condition variable wait | 200-500 ns | 1-5 μs | Depends on impl |
| Broadcast call (N waiters) | 100 + 50×N ns | Varies | Linear in N |
| Cache line reload (L3 miss) | 40-80 ns | 100+ ns | Critical for contention |
Under contention, every operation becomes more expensive due to cache coherency traffic, scheduler overhead, and waiting time. A 10× increase in cost under heavy contention is common. Always measure under realistic load conditions.
123456789101112131415161718192021222324252627282930313233
// Example: Calculating broadcast overhead for a work queue // Parametersint N = 64; // Number of worker threads waitingint W = 1; // Work items available (typical case)double T_cs = 5.0; // Context switch time (microseconds)double T_mutex = 2.0; // Contended mutex acquire (microseconds)double T_check = 0.1; // Condition check (microseconds)double T_work = 100.0; // Actual work time (microseconds) // Cost calculationdouble C_wakeups = N * T_cs; // 64 × 5 = 320 μsdouble C_mutex = N * T_mutex; // 64 × 2 = 128 μs double C_recheck = N * T_check; // 64 × 0.1 = 6.4 μsdouble total_overhead = C_wakeups + C_mutex + C_recheck; // 454.4 μs double useful_work = W * T_work; // 1 × 100 = 100 μs double efficiency = useful_work / (useful_work + total_overhead);// efficiency = 100 / 554.4 = 18% printf("Broadcast overhead: %.1f μs\n", total_overhead);printf("Useful work: %.1f μs\n", useful_work);printf("Efficiency: %.1f%%\n", efficiency * 100);printf("Overhead ratio: %.1fx\n", total_overhead / useful_work); // OUTPUT:// Broadcast overhead: 454.4 μs// Useful work: 100.0 μs// Efficiency: 18.0%// Overhead ratio: 4.5x//// This means 82% of time is wasted on the thundering herd!Theoretical models are useful for understanding, but real performance requires measurement. Here's how to measure broadcast behavior in your systems.
Key Metrics to Measure:\n\n1. Broadcast frequency: How often is broadcast() called?\n2. Waiter count per broadcast: How many threads are woken each time?\n3. Proceed ratio: What fraction of woken threads actually proceed?\n4. Mutex contention time: How long do threads wait for the mutex?\n5. Overall latency: End-to-end time from event to processing complete\n6. Throughput: Events processed per second under load
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
#include <pthread.h>#include <stdatomic.h>#include <time.h>#include <stdio.h> // High-resolution timingstatic inline uint64_t get_nanos() { struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts); return ts.tv_sec * 1000000000ULL + ts.tv_nsec;} // Performance-instrumented condition variabletypedef struct { pthread_mutex_t mutex; pthread_cond_t cond; // Counters atomic_ulong broadcasts; atomic_ulong total_waiters_woken; atomic_ulong total_proceeded; atomic_ulong total_rewaited; // Timing accumulators (nanoseconds) atomic_ulong total_wait_time; atomic_ulong total_mutex_wait_time; // Current state atomic_int current_waiters;} perf_cv_t; void perf_cv_init(perf_cv_t *cv) { pthread_mutex_init(&cv->mutex, NULL); pthread_cond_init(&cv->cond, NULL); atomic_store(&cv->broadcasts, 0); atomic_store(&cv->total_waiters_woken, 0); atomic_store(&cv->total_proceeded, 0); atomic_store(&cv->total_rewaited, 0); atomic_store(&cv->total_wait_time, 0); atomic_store(&cv->total_mutex_wait_time, 0); atomic_store(&cv->current_waiters, 0);} // Enhanced wait that measures timevoid perf_cv_wait(perf_cv_t *cv, bool *condition_was_false) { uint64_t start = get_nanos(); atomic_fetch_add(&cv->current_waiters, 1); pthread_cond_wait(&cv->cond, &cv->mutex); atomic_fetch_sub(&cv->current_waiters, 1); uint64_t end = get_nanos(); atomic_fetch_add(&cv->total_wait_time, end - start); *condition_was_false = true; // Caller sets to false if proceeding} // Enhanced broadcast that records statsvoid perf_cv_broadcast(perf_cv_t *cv) { int waiters = atomic_load(&cv->current_waiters); atomic_fetch_add(&cv->broadcasts, 1); atomic_fetch_add(&cv->total_waiters_woken, waiters); pthread_cond_broadcast(&cv->cond);} void perf_cv_record_proceed(perf_cv_t *cv) { atomic_fetch_add(&cv->total_proceeded, 1);} void perf_cv_record_rewait(perf_cv_t *cv) { atomic_fetch_add(&cv->total_rewaited, 1);} // Print statisticsvoid perf_cv_report(perf_cv_t *cv) { unsigned long broadcasts = atomic_load(&cv->broadcasts); unsigned long woken = atomic_load(&cv->total_waiters_woken); unsigned long proceeded = atomic_load(&cv->total_proceeded); unsigned long rewaited = atomic_load(&cv->total_rewaited); unsigned long wait_time = atomic_load(&cv->total_wait_time); printf("\n=== Condition Variable Performance Report ===\n"); printf("Total broadcasts: %lu\n", broadcasts); printf("Total threads woken: %lu\n", woken); printf("Threads that proceeded: %lu\n", proceeded); printf("Threads that re-waited: %lu\n", rewaited); if (broadcasts > 0) { printf("\nPer-broadcast metrics:\n"); printf(" Avg waiters per broadcast: %.2f\n", (double)woken / broadcasts); printf(" Proceed ratio: %.1f%%\n", woken > 0 ? 100.0 * proceeded / woken : 0); printf(" Waste ratio: %.1f%%\n", woken > 0 ? 100.0 * rewaited / woken : 0); } if (woken > 0) { printf("\nTiming:\n"); printf(" Total wait time: %.3f ms\n", wait_time / 1e6); printf(" Avg wait per thread: %.3f ms\n", (wait_time / 1e6) / woken); } printf("=============================================\n");}Using perf and eBPF for System-Level Measurement:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
#!/bin/bash# System-level broadcast performance analysis # 1. Record futex activity (underlying syscall for condvars)echo "Recording futex activity for 60 seconds..."sudo perf record -e 'syscalls:sys_enter_futex' \ -e 'syscalls:sys_exit_futex' \ -p $(pgrep your_application) -- sleep 60 perf report --stdio # 2. eBPF script to measure broadcast behaviorcat > broadcast_latency.bt << 'EOF'#!/usr/bin/env bpftrace // Track pthread_cond_broadcast latencyuprobe:/lib/x86_64-linux-gnu/libpthread.so.0:pthread_cond_broadcast { @start[tid] = nsecs;} uretprobe:/lib/x86_64-linux-gnu/libpthread.so.0:pthread_cond_broadcast /@start[tid]/ { $latency = nsecs - @start[tid]; @broadcast_latency_us = hist($latency / 1000); delete(@start[tid]);} // Count broadcasts by stack trace (to identify callers)uprobe:/lib/x86_64-linux-gnu/libpthread.so.0:pthread_cond_broadcast { @broadcast_callers[ustack(3)] = count();} interval:s:30 { print(@broadcast_latency_us); print(@broadcast_callers); clear(@broadcast_latency_us); clear(@broadcast_callers);}EOF echo "Running eBPF broadcast analysis..."sudo bpftrace broadcast_latency.bt # 3. Check context switch ratesecho "\nContext switch monitoring:"vmstat 1 10 | awk 'NR==1 || NR==2 || NR>2 {print $1, $12, $13, $14, $15, $16}' # 4. Check cache behavior (cache-line bouncing indicates contention)echo "\nCache statistics:"sudo perf stat -e cache-references,cache-misses,L1-dcache-load-misses \ -p $(pgrep your_application) -- sleep 10Let's examine controlled benchmarks comparing signal and broadcast performance across different scenarios.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159
// Benchmark: Signal vs Broadcast for work queue #include <pthread.h>#include <stdio.h>#include <stdlib.h>#include <time.h>#include <unistd.h> #define NUM_WORKERS 64#define NUM_WORK_ITEMS 100000 typedef struct { pthread_mutex_t mutex; pthread_cond_t cond; int work_count; int processed; int done;} work_queue_t; work_queue_t wq; void *worker_thread(void *arg) { while (1) { pthread_mutex_lock(&wq.mutex); while (wq.work_count == 0 && !wq.done) { pthread_cond_wait(&wq.cond, &wq.mutex); } if (wq.done && wq.work_count == 0) { pthread_mutex_unlock(&wq.mutex); return NULL; } wq.work_count--; wq.processed++; pthread_mutex_unlock(&wq.mutex); // Simulate small amount of work for (volatile int i = 0; i < 1000; i++); }} void benchmark_broadcast() { wq.work_count = 0; wq.processed = 0; wq.done = 0; pthread_t workers[NUM_WORKERS]; for (int i = 0; i < NUM_WORKERS; i++) { pthread_create(&workers[i], NULL, worker_thread, NULL); } // Give workers time to start waiting usleep(10000); struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); // Submit work with BROADCAST for (int i = 0; i < NUM_WORK_ITEMS; i++) { pthread_mutex_lock(&wq.mutex); wq.work_count++; pthread_cond_broadcast(&wq.cond); // BROADCAST pthread_mutex_unlock(&wq.mutex); } // Wait for all work to complete while (1) { pthread_mutex_lock(&wq.mutex); if (wq.processed >= NUM_WORK_ITEMS) { wq.done = 1; pthread_cond_broadcast(&wq.cond); // Wake for shutdown pthread_mutex_unlock(&wq.mutex); break; } pthread_mutex_unlock(&wq.mutex); usleep(1000); } clock_gettime(CLOCK_MONOTONIC, &end); for (int i = 0; i < NUM_WORKERS; i++) { pthread_join(workers[i], NULL); } double elapsed = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9; printf("BROADCAST: %d items in %.3f s = %.0f items/sec\n", NUM_WORK_ITEMS, elapsed, NUM_WORK_ITEMS / elapsed);} void benchmark_signal() { wq.work_count = 0; wq.processed = 0; wq.done = 0; pthread_t workers[NUM_WORKERS]; for (int i = 0; i < NUM_WORKERS; i++) { pthread_create(&workers[i], NULL, worker_thread, NULL); } usleep(10000); struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); // Submit work with SIGNAL for (int i = 0; i < NUM_WORK_ITEMS; i++) { pthread_mutex_lock(&wq.mutex); wq.work_count++; pthread_cond_signal(&wq.cond); // SIGNAL pthread_mutex_unlock(&wq.mutex); } while (1) { pthread_mutex_lock(&wq.mutex); if (wq.processed >= NUM_WORK_ITEMS) { wq.done = 1; pthread_cond_broadcast(&wq.cond); pthread_mutex_unlock(&wq.mutex); break; } pthread_mutex_unlock(&wq.mutex); usleep(1000); } clock_gettime(CLOCK_MONOTONIC, &end); for (int i = 0; i < NUM_WORKERS; i++) { pthread_join(workers[i], NULL); } double elapsed = (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec) / 1e9; printf("SIGNAL: %d items in %.3f s = %.0f items/sec\n", NUM_WORK_ITEMS, elapsed, NUM_WORK_ITEMS / elapsed);} int main() { pthread_mutex_init(&wq.mutex, NULL); pthread_cond_init(&wq.cond, NULL); printf("Workers: %d, Work items: %d\n\n", NUM_WORKERS, NUM_WORK_ITEMS); benchmark_signal(); benchmark_broadcast(); return 0;} // TYPICAL OUTPUT on 16-core machine:// Workers: 64, Work items: 100000//// SIGNAL: 100000 items in 0.893 s = 112000 items/sec// BROADCAST: 100000 items in 3.241 s = 30856 items/sec//// RESULT: Signal is ~3.6x faster for this workload!| Workers | Signal (items/sec) | Broadcast (items/sec) | Signal Speedup |
|---|---|---|---|
| 4 | 145,000 | 128,000 | 1.1x |
| 8 | 138,000 | 102,000 | 1.4x |
| 16 | 130,000 | 72,000 | 1.8x |
| 32 | 118,000 | 45,000 | 2.6x |
| 64 | 112,000 | 31,000 | 3.6x |
| 128 | 98,000 | 18,000 | 5.4x |
The performance gap between signal and broadcast WIDENS as worker count increases. With 128 workers, signal is 5.4x faster. This is because the thundering herd overhead is O(N) where N is the worker count.
Broadcast can affect latency and throughput differently. Understanding these tradeoffs helps you make the right choice for your workload.
Latency Distribution Example:\n\nConsider a work queue processing with 32 workers, comparing latency percentiles:\n\n\n┌────────────────────────────────────────────────────────────────┐\n│ LATENCY DISTRIBUTION COMPARISON │\n├────────────────────────────────────────────────────────────────┤\n│ │\n│ Percentile │ Signal │ Broadcast │ Difference │\n│ ─────────────────────────────────────────────────────────────│\n│ P50 (median) │ 0.8 ms │ 1.2 ms │ +50% │\n│ P90 │ 2.1 ms │ 4.8 ms │ +129% │\n│ P99 │ 5.2 ms │ 18.4 ms │ +254% │\n│ P99.9 │ 12.8 ms │ 45.2 ms │ +253% │\n│ Max │ 28.0 ms │ 112.0 ms │ +300% │\n│ │\n│ Observations: │\n│ - Median is only 50% worse (acceptable for many apps) │\n│ - P99 is 3.5x worse (significant for latency-sensitive) │\n│ - Max is 4x worse (can cause timeout issues) │\n│ │\n└────────────────────────────────────────────────────────────────┘\n
In distributed systems, tail latency (P99, P99.9) often matters more than median. If your service is called 100 times per user request, the P99 latency determines user experience. Broadcast's impact on tail latency can be a deal-breaker for latency-sensitive systems.
Broadcast has significant effects on CPU cache behavior, which can dominate performance in high-frequency scenarios.
Cache Line Bouncing Problem:\n\nWhen multiple cores wake up and compete for the same mutex, the cache line containing the mutex bounces between cores:\n\n\n┌────────────────────────────────────────────────────────────────┐\n│ CACHE LINE BOUNCING │\n├────────────────────────────────────────────────────────────────┤\n│ │\n│ Broadcast wakes threads on CPUs 0, 1, 2, 3, 4, 5, 6, 7 │\n│ │\n│ Time │ Mutex cache line location │ Operation │\n│ ────────┼───────────────────────────┼─────────────────────── │\n│ t1 │ Core 0 L1 cache │ T0 tries acquire │\n│ t2 │ Core 1 L1 cache │ T1 tries acquire │\n│ │ (invalidate Core 0) │ │\n│ t3 │ Core 2 L1 cache │ T2 tries acquire │\n│ │ (invalidate Core 1) │ │\n│ t4 │ Core 0 L1 cache │ T0 tries AGAIN │\n│ │ (invalidate Core 2) │ │\n│ ... │ ... │ ... │\n│ │\n│ Each cache line transfer takes 40-100+ cycles │\n│ With 8 cores all contending, we get 8+ transfers │\n│ Total overhead: 320-800+ cycles just for cache coherency │\n│ │\n└────────────────────────────────────────────────────────────────┘\n
Memory Bandwidth Consumption:\n\nBeyond latency, cache line bouncing consumes memory bandwidth:
12345678910111213141516171819202122232425262728293031
# Measure cache behavior during signal vs broadcast # Setup: Run 64-worker benchmark with signal./benchmark_signal &PID=$! # Measure cache statisticsecho "=== SIGNAL workload cache stats ==="sudo perf stat -e cache-references,cache-misses,\ L1-dcache-loads,L1-dcache-load-misses,\ LLC-loads,LLC-load-misses -p $PID -- sleep 10 wait $PID # Repeat with broadcast./benchmark_broadcast &PID=$! echo "=== BROADCAST workload cache stats ==="sudo perf stat -e cache-references,cache-misses,\ L1-dcache-loads,L1-dcache-load-misses,\ LLC-loads,LLC-load-misses -p $PID -- sleep 10 wait $PID # Typical results (relative):# Signal: L1 miss rate ~2%, LLC miss rate ~0.1%# Broadcast: L1 miss rate ~8%, LLC miss rate ~0.5%# # The broadcast case shows 4x more L1 misses due to # cache line invalidations from mutex contentionOn NUMA systems, the effects are even worse. If threads wake on different NUMA nodes, cache coherency traffic must cross the node interconnect, adding 2-3x latency. Consider NUMA-aware thread placement to mitigate this.
With measurement data in hand, you can make informed decisions about whether to optimize and how.
The Decision Process:\n\n\n┌─────────────────────────────────────────────────────────────────────┐\n│ OPTIMIZATION DECISION FRAMEWORK │\n├─────────────────────────────────────────────────────────────────────┤\n│ │\n│ STEP 1: Measure current performance │\n│ │ - Waste ratio (rewaited / total woken) │\n│ │ - Throughput and latency under load │\n│ │ - CPU utilization vs useful work │\n│ ▼ │\n│ STEP 2: Calculate overhead │\n│ │ - If waste ratio < 20%: Probably not a problem │\n│ │ - If waste ratio > 50%: Likely impacting performance │\n│ │ - If waste ratio > 80%: Serious thundering herd │\n│ ▼ │\n│ STEP 3: Assess whether signal is safe │\n│ │ - Review all waiters: same type? same condition? │\n│ │ - Can you use separate CVs for different waiter types? │\n│ │ - Would the complexity be worth the gain? │\n│ ▼ │\n│ STEP 4: Implement and validate │\n│ │ - Make the change (signal, separate CVs, semaphore) │\n│ │ - Thorough testing for race conditions │\n│ │ - Re-measure to confirm improvement │\n│ ▼ │\n│ STEP 5: Monitor in production │\n│ - Watch for new bugs (deadlocks, starvation) │\n│ - Track performance metrics continuously │\n│ │\n└─────────────────────────────────────────────────────────────────────┘\n
| Scenario | Waste Ratio | Optimization | Expected Speedup | Risk |
|---|---|---|---|---|
| Work queue, 64 workers | 90% | signal() instead of broadcast() | 2-5x | Low |
| Prod/cons, shared CV | 70% | Separate CVs | 2-3x | Low |
| Resource pool, bulk return | 30-50% | Semaphore | 1.5-2x | Low |
| Heterogeneous waiters | Variable | Separate CVs | 2-3x | Medium |
| Complex predicates | Variable | Keep broadcast, optimize predicate | 1.2-1.5x | Low |
| Barrier synchronization | 0% | No change needed | N/A | N/A |
If your waste ratio is >80%, you're spending 5x more time on overhead than useful work. This is almost always worth optimizing. Below 20% waste, the optimization complexity may not be worth it unless you're in an extremely latency-sensitive path.
Beyond the basic mitigations, here are advanced techniques for performance-critical systems.
Batch Multiple Events Before Signaling\n\nInstead of signaling for each event, batch events and signal once:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
// Batch events to reduce signal/broadcast overhead typedef struct { pthread_mutex_t mutex; pthread_cond_t cond; queue_t *pending; int batch_size; int current_count;} batched_queue_t; // Add item without immediate signalvoid queue_add_batched(batched_queue_t *q, void *item) { pthread_mutex_lock(&q->mutex); queue_enqueue(q->pending, item); q->current_count++; // Only signal when batch is full if (q->current_count >= q->batch_size) { // Signal once for entire batch pthread_cond_signal(&q->cond); q->current_count = 0; } pthread_mutex_unlock(&q->mutex);} // Flush remaining items (call periodically or on shutdown)void queue_flush(batched_queue_t *q) { pthread_mutex_lock(&q->mutex); if (q->current_count > 0) { pthread_cond_signal(&q->cond); q->current_count = 0; } pthread_mutex_unlock(&q->mutex);} // Worker processes multiple items per wakeupvoid *worker(void *arg) { batched_queue_t *q = (batched_queue_t *)arg; while (1) { pthread_mutex_lock(&q->mutex); while (queue_is_empty(q->pending)) { pthread_cond_wait(&q->cond, &q->mutex); } // Process ALL available items, not just one while (!queue_is_empty(q->pending)) { void *item = queue_dequeue(q->pending); pthread_mutex_unlock(&q->mutex); process(item); // Outside lock pthread_mutex_lock(&q->mutex); } pthread_mutex_unlock(&q->mutex); }} // Benefits:// - Amortizes wakeup cost across multiple items// - Reduces context switches// - Better cache utilization (process while data is hot)Next Steps:\n\nThe final page explores use cases—real-world scenarios and patterns where broadcast is the right choice, with complete, production-ready implementations you can adapt to your own systems.
You now have the tools to measure, analyze, and optimize broadcast performance in your systems. You can quantify the costs, make data-driven decisions, and apply the right optimization techniques for your workload. Next, we'll see broadcast in action with complete use case implementations.