Loading learning content...
In 2003, the Northeast Blackout affected 55 million people across the United States and Canada. Investigation revealed that a software bug—involving unsynchronized access to shared state in the alarm monitoring system—prevented operators from seeing critical alerts as the grid began to fail. The bug had existed for years but manifested only under a specific combination of conditions.
This is the nature of data races: they lurk silently in code, working correctly most of the time, until conditions align to produce catastrophic failure. A data race can:
Data races are the single most important bug class in concurrent programming, and understanding them deeply is essential for any engineer working with threads.
By the end of this page, you will understand precisely what a data race is, why data races are so dangerous, how to identify them in code, and the fundamental principle for preventing them. You'll develop the ability to spot potential races before they become bugs.
A data race occurs when all of the following conditions are met:
The formal definition:
A data race is a situation where two memory accesses in different threads target the same location, at least one is a write, and there is no happens-before relationship between them.
Note what is not required for a data race:
1234567891011121314151617181920212223242526272829303132333435363738394041
// Example 1: Classic data race - simultaneous read and writeint counter = 0; // Shared void thread1() { counter = counter + 1; // Read, modify, write - not atomic!} void thread2() { counter = counter + 1; // Same: read, modify, write}// Both threads read and write 'counter' without synchronization// Result after both complete: could be 1 (lost update) or 2 // Example 2: Flag-based race - looks safe but isn'tint data = 0; // Sharedint ready = 0; // Shared flag void producer() { data = 42; // Write data ready = 1; // Signal reader} void consumer() { while (ready == 0) ; // Wait for signal printf("%d", data); // Read data}// Data race on 'ready': writer and reader without synchronization// Data race on 'data': no happens-before between write and read// Consumer might see ready=1 but data=0 due to reordering! // Example 3: Read-read - NOT a data raceint shared = 42; void reader1() { printf("%d", shared); }void reader2() { printf("%d", shared); }// Both only read - no data race (but no useful communication either)These terms are often confused but are different:
Data race: Unsynchronized concurrent access to shared memory (a low-level concept about memory operations).
Race condition: A correctness bug where program behavior depends on timing (a high-level concept about program logic).
You can have race conditions without data races (using locks incorrectly) and data races without race conditions (if the program happens to work despite undefined behavior).
Data races aren't just bugs—in C and C++, they're undefined behavior. This has profound implications that many developers don't fully appreciate.
Undefined behavior means anything can happen:
The language specification places no constraints on program behavior when a data race exists. The compiler is free to assume data races don't exist and optimize accordingly. This can cause:
Value corruption — Torn reads/writes where a 64-bit value is partially updated may produce values that were never written by any thread.
Optimization breakage — The compiler might cache a value in a register, never re-reading from memory even though another thread changed it.
Infinite loops — A loop that checks a flag might be optimized to check once (since 'no race exists,' the flag can't change).
Time travel — Due to compiler reordering, effects of a data race may appear to happen before their causes.
Speculative execution effects — Side-channel information leakage, as demonstrated by Spectre vulnerabilities.
12345678910111213141516171819202122232425262728293031323334
// Example: Compiler optimization breaks racy code int done = 0; // Shared flag (non-atomic, non-volatile) void wait_for_done() { while (!done) { // Wait... } // Continue after signal} // What the compiler might generate (optimized):void wait_for_done_optimized() { if (!done) { while (1) { } // Infinite loop! } // Compiler hoists 'done' read outside loop // Reasoning: "done can't change (no race allowed)"} // The compiler's optimization is CORRECT per the standard:// - Program has a data race → undefined behavior// - Compiler may assume any behavior for undefined behavior// - Hoisting the read is a valid transformation // Fix: Use atomic types with proper memory ordering#include <stdatomic.h>atomic_int done_safe = 0; void wait_for_done_correct() { while (!atomic_load(&done_safe)) { // Now compiler knows another thread might change done_safe }}Many developers attempt to fix data races with 'volatile'. In C/C++, volatile prevents compiler optimization of the variable but provides NO synchronization or memory ordering. A volatile variable can still be subject to CPU reordering and cache incoherence. Volatile does not fix data races—use atomics or proper synchronization.
The non-determinism problem:
Even when data races don't trigger dramatic effects, they introduce non-determinism:
This makes data race bugs extremely difficult to diagnose. A 'Heisenbug' that disappears when you try to observe it is often a data race.
The most classic data race pattern is the lost update, where two threads try to update the same value and one update is lost.
Consider incrementing a counter:
counter = counter + 1;
This looks like one operation, but it's actually three:
counter into registercounterWith two threads executing concurrently, the interleaving can cause updates to be lost:
| Time | Thread 1 | Thread 2 | counter value |
|---|---|---|---|
| T1 | Read counter → 0 | – | 0 |
| T2 | – | Read counter → 0 | 0 |
| T3 | Add 1 → 1 | – | 0 |
| T4 | – | Add 1 → 1 | 0 |
| T5 | Write 1 → counter | – | 1 |
| T6 | – | Write 1 → counter | 1 |
Expected result: 2 (both increments applied) Actual result: 1 (Thread 2's increment overwrote Thread 1's)
The update from Thread 1 was lost because Thread 2 read the old value before Thread 1's write was visible.
This generalizes to any read-modify-write pattern:
In all these cases, the pattern is the same: read, compute, write—without atomicity, the race exists.
123456789101112131415161718192021222324252627282930313233343536
// Demonstration: Lost updates in practice#include <stdio.h>#include <pthread.h> int counter = 0;#define ITERATIONS 1000000 void* increment(void* arg) { for (int i = 0; i < ITERATIONS; i++) { counter = counter + 1; // Data race! } return NULL;} int main() { pthread_t t1, t2; pthread_create(&t1, NULL, increment, NULL); pthread_create(&t2, NULL, increment, NULL); pthread_join(t1, NULL); pthread_join(t2, NULL); printf("Expected: %d", 2 * ITERATIONS); // 2,000,000 printf("Actual: %d", counter); // Often < 2,000,000 return 0;} // Typical output (varies on each run):// Expected: 2000000// Actual: 1387429//// Hundreds of thousands of updates lost to the race!Lost updates are prevented by making the read-modify-write operation atomic. Using atomic_fetch_add (C11), __sync_fetch_and_add (GCC), or InterlockedIncrement (Windows) ensures the entire operation completes without interruption. Alternatively, a mutex can protect the entire critical section.
Torn reads/writes occur when a multi-byte value is partially updated by one thread while another thread reads it, resulting in a chimeric value that was never written.
When tearing can occur:
The C standard does not guarantee that any type is accessed atomically. Whether tearing occurs depends on:
Data size vs. native word size — Writing a 64-bit value on a 32-bit CPU requires two memory operations. Another thread may see half old, half new.
Alignment — Misaligned data may cross cache lines, requiring multiple memory accesses.
Hardware architecture — Even correctly-sized, aligned data may not be atomic on all hardware.
Example: 64-bit value on 32-bit CPU
| Time | Thread 1 (Writer) | Thread 2 (Reader) | Value in Memory |
|---|---|---|---|
| Initial | – | – | 0x0000000000000000 |
| T1 | Write low 32 bits of 0xFFFFFFFFFFFFFFFF | – | 0x00000000FFFFFFFF |
| T2 | – | Read 64-bit value | Reads 0x00000000FFFFFFFF |
| T3 | Write high 32 bits | – | 0xFFFFFFFFFFFFFFFF |
The reader sees 0x00000000FFFFFFFF—a value that Thread 1 never intended to write! This is a torn read.
Real-world consequences:
What guarantees atomicity?
In practice, most modern 64-bit platforms guarantee atomic access for naturally-aligned 64-bit values. But this is a hardware/ABI guarantee, not a language guarantee. Always use atomic types when multiple threads access the same data.
1234567891011121314151617181920212223242526272829303132333435363738394041
// Example: Torn reads with a struct struct Coordinate { int x; int y;}; struct Coordinate position = {0, 0}; // Thread 1: Updates position atomically (conceptually)void update_position(int new_x, int new_y) { position.x = new_x; position.y = new_y; // Not atomic with x write!} // Thread 2: Reads positionvoid read_position() { int x = position.x; int y = position.y; // May see old x, new y or vice versa printf("Position: (%d, %d)", x, y);} // If Thread 1 writes (10, 20) while position is (0, 0):// Thread 2 might read:// (0, 0) - saw nothing yet// (10, 0) - saw x update, not y// (0, 20) - saw y update, not x (reordering!)// (10, 20) - saw both//// The (10, 0) and (0, 20) cases are torn reads of the struct // Fix: Use a mutex to make the entire update atomicpthread_mutex_t pos_lock = PTHREAD_MUTEX_INITIALIZER; void update_position_safe(int new_x, int new_y) { pthread_mutex_lock(&pos_lock); position.x = new_x; position.y = new_y; pthread_mutex_unlock(&pos_lock);}Even if individual fields are atomically accessible, updating multiple fields is never atomic without synchronization. A struct, array, or any compound data structure requires explicit protection (mutex, RCU, or lock-free design) for thread-safe updates.
Data races are notoriously difficult to detect through testing alone because they depend on thread scheduling, which varies non-deterministically. Fortunately, powerful tools exist:
1. ThreadSanitizer (TSan)
Instrumentation-based detection built into Clang and GCC. Tracks all memory accesses and synchronization to detect races at runtime.
-fsanitize=thread2. Helgrind (Valgrind tool)
Dynamic race detection using Valgrind's binary instrumentation.
valgrind --tool=helgrind ./program3. Intel Inspector
Commercial race detection for Intel platforms.
4. Static Analysis
Tools like Coverity, PVS-Studio, and Clang's static analyzer can identify potential races without running code.
123456789101112131415161718192021222324
# Compile with ThreadSanitizer$ clang -fsanitize=thread -g -O1 race_program.c -o race_program # Run - TSan will report any races detected$ ./race_program ==================WARNING: ThreadSanitizer: data race (pid=12345) Write of size 4 at 0x... by thread T2: #0 increment race_program.c:10 (race_program+0x...) #1 thread_start ... Previous write of size 4 at 0x... by thread T1: #0 increment race_program.c:10 (race_program+0x...) #1 thread_start ... Location is global 'counter' of size 4 at 0x... Thread T2 (tid=..., running) created by main thread at: #0 pthread_create ... #1 main race_program.c:20 (race_program+0x...) SUMMARY: ThreadSanitizer: data race race_program.c:10 in increment==================Integrate ThreadSanitizer into your continuous integration pipeline. Run tests with TSan enabled on every commit. A race that manifests once will be caught before it reaches production. The ~10× slowdown is acceptable for CI, and catching races early is invaluable.
Limitations of race detection:
Despite limitations, race detectors catch the vast majority of races in practice and are essential tools for concurrent programming.
The fundamental principle for preventing data races:
If a memory location is accessed by multiple threads and at least one access is a write, all accesses must be synchronized.
There are several strategies to achieve this:
Strategy 1: Mutual Exclusion (Locks)
Protect all accesses to shared data with a mutex. Only one thread can hold the lock at a time, preventing concurrent access.
Strategy 2: Atomic Operations
Use atomic types and operations (C11 _Atomic, C++ std::atomic) for simple shared state. The hardware guarantees atomicity.
Strategy 3: Message Passing
Instead of sharing memory, have threads communicate by sending messages (channels, queues). Shared state is eliminated.
Strategy 4: Immutability
Data that is never modified after creation is inherently safe for concurrent reads. Many functional programming patterns leverage this.
Strategy 5: Thread Confinement
Design so that each piece of data is only accessed by one thread. No sharing = no races.
123456789101112131415161718192021222324252627282930313233343536
#include <stdatomic.h>#include <pthread.h> // ===== Strategy 1: Mutex =====int counter_mutex = 0;pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; void increment_with_mutex() { pthread_mutex_lock(&lock); counter_mutex++; // Protected by lock pthread_mutex_unlock(&lock);} // ===== Strategy 2: Atomics =====atomic_int counter_atomic = 0; void increment_with_atomic() { atomic_fetch_add(&counter_atomic, 1); // Single atomic operation} // ===== Strategy 3: Thread Confinement =====__thread int thread_local_counter = 0; // Each thread has its own void increment_local() { thread_local_counter++; // No sharing = no race} // ===== Strategy 4: Immutable Data =====// Once created, 'config' is never modifiedconst struct Config* get_config() { static const struct Config config = { .timeout = 30 }; return &config; // Safe to read from any thread}Mutexes are the go-to choice for protecting complex shared state. Atomics are ideal for simple counters, flags, and pointers. Message passing works well for producer-consumer patterns. Immutability shines for configuration and reference data. Good concurrent design often combines multiple strategies.
Different programming languages handle data races differently:
C and C++
Data races are undefined behavior. The program may do anything—crash, produce wrong results, appear to work, or format your hard drive (in theory). The language provides atomics and threading primitives, but race prevention is entirely the programmer's responsibility.
Java
Data races produce weak but defined behavior. The Java Memory Model specifies what happens: you may see stale or partly-updated values, but certain cases (like torn long/double writes) are prevented for volatile variables. Race prevention is still your job, but undefined behavior is avoided.
Go
The Go race detector catches races at runtime, and races are considered bugs, but behavior is not undefined—you get unpredictable-but-safe behavior. Go encourages message passing via channels: "Do not communicate by sharing memory; share memory by communicating."
Rust
Data races are impossible (in safe Rust). The ownership and borrowing system prevents multiple threads from having mutable access to the same data simultaneously. This is checked at compile time—if your code compiles, it's data-race free.
| Language | Data Race Semantics | Prevention Mechanism | Detection |
|---|---|---|---|
| C/C++ | Undefined behavior | Programmer discipline + atomics/mutexes | TSan, static analysis |
| Java | Weak but defined | volatile, synchronized, java.util.concurrent | Race detectors exist |
| Go | Unsafe but not UB | Channels preferred; mutexes available | Built-in race detector |
| Rust | Compile-time prevented | Ownership system prevents at compile time | N/A (can't compile) |
| Python | GIL prevents true races* | Global Interpreter Lock (for CPython) | Limited parallelism |
| JavaScript | Single-threaded** | Event loop; SharedArrayBuffer needs Atomics | N/A for main thread |
*Python's GIL prevents simultaneous bytecode execution but doesn't prevent race conditions in logic.
**JavaScript's main thread is single-threaded, but Web Workers can share memory via SharedArrayBuffer, requiring explicit atomics.
Rust's approach is groundbreaking: the compiler statically ensures data-race freedom. If you access data from multiple threads, you must use synchronization types (Arc<Mutex<T>>, channels, etc.), or the code won't compile. This eliminates an entire class of bugs at zero runtime cost.
Data races have caused numerous real-world failures. Understanding these helps appreciate why race prevention matters.
Therac-25 (1985-1987)
A radiation therapy machine that massively overdosed patients due to software race conditions. When operators typed commands quickly, race conditions between the user interface and hardware control software could set lethal radiation doses. Six patients received massive overdoses; three died. The root cause included missing synchronization between concurrent software routines.
2003 Northeast Blackout
As mentioned earlier, a race condition in the alarm system software prevented operators from seeing cascading failures in the power grid. The race had existed for years but triggered only under specific conditions—55 million people lost power.
Knight Capital (2012)
A trading firm lost $440 million in 45 minutes due to software deployment errors that activated old, disabled code. While not purely a data race, the incident involved unsynchronized state between components behaving as if old orders were valid. Improper concurrent access to shared state contributed to catastrophic behavior.
Linux Kernel Privilege Escalation (CVE-2016-5195, 'Dirty COW')
A race condition in the Linux kernel's copy-on-write mechanism allowed unprivileged users to gain root access. The race had existed for nearly a decade. Attackers could race two threads to write to memory they shouldn't have access to, exploiting a narrow window in the kernel's memory handling.
The Therac-25 case makes it clear: data races aren't just programming errors—they can kill people. In safety-critical systems, race prevention isn't optional. Even in non-safety-critical systems, races cause data corruption, financial loss, and security vulnerabilities.
Writing race-free concurrent code requires disciplined practices:
1. Document thread-safety invariants
For every shared variable, document:
2. Minimize sharing
The safest race is one that can't happen because data isn't shared. Design for thread confinement and message passing where possible.
3. Hold locks for the minimum time
Long critical sections increase contention and the window for mistakes. Enter, do the minimum necessary, exit.
4. Use established patterns
Producer-consumer, readers-writers, and other patterns have well-understood solutions. Use them rather than inventing ad-hoc synchronization.
5. Review concurrent code carefully
Every concurrent code review should ask:
123456789101112131415161718192021222324252627282930
// Example: Well-documented synchronization // This shared state is accessed by multiple threads.// All fields are protected by 'state_lock'.// The lock MUST be held for any read or write.struct SharedState { int count; // Protected by state_lock char* buffer; // Protected by state_lock size_t buffer_size; // Protected by state_lock} shared_state; pthread_mutex_t state_lock = PTHREAD_MUTEX_INITIALIZER; // Protects shared_state // This counter is updated by multiple threads.// Uses atomic operations; no lock required.// Memory order: seq_cst for simplicity (could relax if needed).atomic_int request_count = 0; // This is thread-local - no synchronization needed.__thread int thread_id = 0; // Thread-safe increment with documentationvoid increment_count() { pthread_mutex_lock(&state_lock); // Acquire lock before accessing shared_state shared_state.count++; // Safe: lock held pthread_mutex_unlock(&state_lock);}If you rigorously follow synchronization discipline—protecting every shared variable, using atomics correctly, running race detectors—your program will be data-race-free. This earns you DRF-SC: you can reason about your program using the simple sequential consistency model, regardless of the underlying hardware.
We've examined data races—the fundamental hazard that makes concurrent programming challenging:
What's next:
We've seen that data races occur when concurrent accesses aren't properly synchronized. But what exactly needs synchronization? The next page examines critical regions—the sections of code where race conditions can occur, and the formal requirements for safely executing them.
You now understand data races deeply: what they are, why they're dangerous, how to detect them, and how to prevent them. This knowledge is foundational—every subsequent topic in concurrent programming (critical regions, locks, atomics, lock-free algorithms) is about preventing or managing races correctly.