Thread Safety - Learning Module

Loading content...

0/246

What is Thread Safety

The Invisible Menace in Concurrent Systems

Imagine you're a senior engineer at a major e-commerce platform. It's Black Friday, and your checkout service is handling 50,000 requests per second. Everything looks fine in the logs—until customer support starts flooring with complaints: customers are seeing each other's shopping carts, orders are being placed with wrong quantities, and account balances are mysteriously changing.

You check the code. It looks perfectly correct. Every individual function does exactly what it should. The unit tests pass. But in production, under heavy concurrent load, the system exhibits ghostly, unpredictable behavior that vanishes when you try to reproduce it in development.

Welcome to the world of thread safety violations—one of the most insidious categories of bugs in software engineering.

What You Will Learn

By the end of this page, you will understand what thread safety truly means at a foundational level, why it's essential for any concurrent system, and the fundamental concepts that underpin all thread-safe design. You'll develop the mental model needed to reason about concurrent code and recognize potential safety violations before they become production disasters.

Defining Thread Safety Precisely

Thread safety is one of those terms that's often used loosely, leading to confusion. Let's establish a rigorous definition.

Thread Safety Definition: A class, method, or data structure is thread-safe if it guarantees correct behavior when accessed from multiple threads simultaneously, without requiring any additional synchronization or coordination from the caller.

The key phrase here is "guarantees correct behavior"—but what does "correct" mean in a concurrent context? This leads us to a more formal definition known as correctness under concurrent execution.

Brian Goetz's Definition

Brian Goetz, author of Java Concurrency in Practice, offers perhaps the most widely cited definition: "A class is thread-safe if it behaves correctly when accessed from multiple threads, regardless of the scheduling or interleaving of the execution of those threads by the runtime environment, and with no additional synchronization or other coordination on the part of the calling code."

Let's dissect this definition into its essential components:

1. Behaves Correctly

The class upholds its invariants—the properties that must always be true about its state. For example, a Counter class that can go negative when it shouldn't, or a BankAccount that allows the balance to become inconsistent across threads, is not behaving correctly.

2. Under Any Scheduling or Interleaving

This is the critical part. Modern operating systems can interrupt thread execution at any point and switch to another thread. A thread-safe class must work correctly regardless of:

When threads are paused and resumed
The order in which threads execute
Whether threads run truly in parallel (on multiple cores) or are time-sliced on a single core

3. Without Additional Synchronization from the Caller

The thread safety is encapsulated within the class. Callers shouldn't need to wrap every call in a lock or coordinate access between threads—that's the responsibility of the class itself.

thread-safe-vs-unsafe.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// NOT THREAD-SAFE: This counter can produce incorrect results under concurrent access
class UnsafeCounter {
    private count: number = 0;
    
    // Problem: read-modify-write is three separate operations
    increment(): void {
        this.count = this.count + 1;  // Thread A reads 5, Thread B reads 5
                                       // Both write 6, losing an increment!
    }
    
    getCount(): number {
        return this.count;
    }
}
 
// THREAD-SAFE (conceptually): Proper synchronization ensures correctness
class ThreadSafeCounter {
    private count: number = 0;
    private lock: Mutex = new Mutex();  // Conceptual mutex
    
    async increment(): Promise<void> {
        await this.lock.acquire();  // Only one thread can hold the lock
        try {
            this.count = this.count + 1;  // Now atomic from callers' perspective
        } finally {
            this.lock.release();
        }
    }
    
    async getCount(): Promise<number> {
        await this.lock.acquire();
        try {
            return this.count;
        } finally {
            this.lock.release();
        }
    }
}

Levels of Thread Safety

Thread safety isn't binary—there's a spectrum of safety guarantees that different classes and functions can provide. Understanding these levels helps you design appropriate solutions for specific concurrency requirements.

Joshua Bloch, in Effective Java, describes four levels of thread safety that have become the standard classification in the industry:

Levels of Thread Safety (from strongest to weakest)
Level	Description	Example	Caller Responsibility
Immutable	Objects that cannot be modified after construction. Always thread-safe.	String, Integer, LocalDate	None—guaranteed safe
Unconditionally Thread-Safe	Mutable objects with sufficient internal synchronization. Safe for all usage patterns.	ConcurrentHashMap, AtomicInteger	None—guaranteed safe
Conditionally Thread-Safe	Some operations require external synchronization. Safety depends on usage.	Collections.synchronizedList() during iteration	Must synchronize for specific operations
Thread-Compatible	No internal synchronization, but can be used safely with external locks.	ArrayList, HashMap	Must synchronize all access
Thread-Hostile	Cannot be used safely even with external synchronization.	Classes with static mutable state, poorly designed APIs	Redesign required

Let's examine each level in detail:

Immutable (Strongest Guarantee)

Immutable objects are the gold standard of thread safety. Since their state cannot change after construction, there's no possibility of one thread's modifications affecting another thread. Immutable objects can be freely shared across threads with zero synchronization overhead.

Key characteristics:

All fields are final (or effectively final)
No methods modify state
Any contained objects are also immutable or defensively copied
The object is properly constructed (no this references escape during construction)

The Power of Immutability

Immutability is the simplest and most robust path to thread safety. When designing concurrent systems, always ask: "Can this object be immutable?" The performance cost of creating new objects is often far less than the complexity cost of managing mutable shared state.

Unconditionally Thread-Safe

These are mutable objects that encapsulate all necessary synchronization internally. Clients can use them concurrently without any external coordination. This is the level most "thread-safe" classes aim for.

Examples:

ConcurrentHashMap: All operations are thread-safe, including compound operations like putIfAbsent
AtomicLong: Lock-free atomic operations
BlockingQueue: Thread-safe producer-consumer primitives

Conditionally Thread-Safe

This level is often misunderstood and leads to bugs. Some operations require external synchronization, but not all. The class documentation should clearly specify which operations are safe and which require coordination.

Classic example: Iterating over Collections.synchronizedList()

conditionally-thread-safe.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import java.util.Collections;
import java.util.List;
import java.util.ArrayList;
 
public class ConditionalSafetyExample {
    
    // This list is "conditionally thread-safe"
    private List<String> syncList = Collections.synchronizedList(new ArrayList<>());
    
    // SAFE: Individual operations are synchronized
    public void addItem(String item) {
        syncList.add(item);  // Thread-safe - uses internal lock
    }
    
    public int getSize() {
        return syncList.size();  // Thread-safe - uses internal lock
    }
    
    // UNSAFE: Compound operation without synchronization
    public void addIfAbsentBroken(String item) {
        // BUG: Between contains() and add(), another thread could add the same item
        if (!syncList.contains(item)) {
            syncList.add(item);  // Not atomic!
        }
    }
    
    // UNSAFE: Iteration without synchronization
    public void printAllBroken() {
        // BUG: Another thread could modify list during iteration
        for (String item : syncList) {  // ConcurrentModificationException possible!
            System.out.println(item);
        }
    }
    
    // SAFE: Proper synchronization during iteration
    public void printAllCorrect() {
        synchronized (syncList) {  // Must hold lock for entire iteration
            for (String item : syncList) {
                System.out.println(item);
            }
        }
    }
}

Thread-Compatible

Most classes in standard libraries fall into this category. They're not internally synchronized, but they can be used safely if the caller provides proper synchronization. This is the default for most classes because:

Synchronization has performance overhead
The caller often knows the best synchronization granularity
Many classes are only used in single-threaded contexts

Examples: ArrayList, HashMap, StringBuilder

Thread-Hostile

These classes cannot be made thread-safe even with external synchronization, often due to:

Static mutable state that can't be synchronized
Methods that have unavoidable race conditions
Broken memory visibility guarantees

Thread-hostile classes should be redesigned or avoided in concurrent contexts.

The Three Pillars of Thread Safety

Thread safety problems arise from three fundamental sources. Understanding these pillars is essential for both diagnosing issues and designing thread-safe code.

The three pillars are:

Atomicity — Operations must complete as indivisible units
Visibility — Changes made by one thread must be visible to other threads
Ordering — Operations must appear to execute in a consistent order

Violating any of these pillars can cause thread safety failures.

Pillar 1: Atomicity

•Definition: An operation is atomic if it appears to happen instantaneously from the perspective of other threads—no intermediate states are observable.
•Problem: Compound operations (read-modify-write, check-then-act) are not atomic even if individual reads and writes are.
•Solution: Use synchronization (locks, mutexes) or atomic primitives (CAS operations) to make compound operations atomic.
•Example: Incrementing a counter (count++) involves read, add, and write—three separate operations that another thread can interleave.

Pillar 2: Visibility

•Definition: Changes made by one thread to shared variables must be made visible to other threads that read those variables.
•Problem: Modern CPUs use caches, registers, and write buffers. Without proper synchronization, changes may never propagate to main memory or other threads' caches.
•Solution: Use volatile variables, synchronization, or memory barriers to ensure visibility.
•Example: Thread A sets running = false, but Thread B's loop on while(running) never terminates because it reads a stale cached value.

Pillar 3: Ordering

•Definition: Operations must appear to execute in an order that's consistent with the program's logic.
•Problem: Compilers and CPUs reorder instructions for optimization. This can break assumptions about what state is visible when.
•Solution: Memory barriers (fences), synchronization primitives, or volatile variables prevent problematic reordering.
•Example: Object initialization writes might be reordered so that a reference is published before fields are set, causing another thread to see a partially-constructed object.

The Memory Model Matters

Each programming language/platform has a memory model that defines exactly what guarantees are provided about visibility and ordering. Java has the Java Memory Model (JMM), C++ has its memory model (since C++11), and so on. Thread-safe code depends on understanding and respecting these memory model guarantees.

visibility-problem.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
public class VisibilityProblem {
    
    // Without 'volatile', this change may NEVER be seen by the reader thread!
    private boolean running = true;   // BUG: Missing volatile
    private int value = 0;
    
    // Writer thread
    public void stop() {
        value = 42;      // This write might not be visible either
        running = false; // This might be cached in register, never written to main memory
    }
    
    // Reader thread - may loop FOREVER even after stop() is called
    public void run() {
        while (running) {
            // JVM is free to optimize this to: if (running) { while(true) {} }
            // because it sees 'running' is never modified in this thread's view
        }
        System.out.println("Value: " + value);  // Might print 0!
    }
}
 
// CORRECT: Using volatile for visibility
public class VisibilityFixed {
    
    private volatile boolean running = true;  // volatile ensures visibility
    private volatile int value = 0;           // visibility guaranteed
    
    public void stop() {
        value = 42;       // Happens-before the write to running
        running = false;  // This write is guaranteed to be visible
    }
    
    public void run() {
        while (running) {
            // Loop will terminate when stop() is called
        }
        System.out.println("Value: " + value);  // Guaranteed to print 42
    }
}

Why Thread Safety is Inherently Difficult

Thread safety isn't just "another concern" to add to your checklist—it's fundamentally more challenging than sequential programming. Understanding why it's difficult helps you approach concurrent code with appropriate rigor.

The core challenge: In sequential programming, you reason about one execution path. In concurrent programming, you must reason about all possible interleavings of multiple execution paths.

Consider two threads executing just two instructions each. The number of possible interleavings is:

interleaving-explosion.txt

Analysis

Thread A: A1, A2
Thread B: B1, B2
 
Possible interleavings (6 total):
1. A1 → A2 → B1 → B2
2. A1 → B1 → A2 → B2
3. A1 → B1 → B2 → A2
4. B1 → A1 → A2 → B2
5. B1 → A1 → B2 → A2
6. B1 → B2 → A1 → A2
 
For n threads with m operations each: (n * m)! / (m!)^n interleavings
 
Example: 3 threads × 10 operations each = 30!/(10!)³ ≈ 5.5 × 10¹⁸ interleavings
 
This is why testing CANNOT prove thread safety—you can only test a tiny fraction 
of possible executions.

This combinatorial explosion of possible states leads to several characteristics that make concurrent programming uniquely challenging:

Challenges of Concurrent Programming

•Non-Determinism: The same code can produce different results on different runs, depending on thread scheduling. A bug might occur once in 10 million executions.
•Heisenbugs: Bugs that disappear or change behavior when you try to observe them. Adding logging, breakpoints, or even print statements can change timing enough to mask the bug.
•Testing Limitations: You can run a concurrent program a million times and never trigger a race condition that will occur in production under different load patterns.
•Local Reasoning Fails: Understanding each function in isolation isn't enough. You must reason about how multiple threads interact across function boundaries.
•Composability Problems: Two individually thread-safe operations may be unsafe when combined. Safe A + Safe B ≠ Safe AB.

The Testing Fallacy

"We tested it extensively and found no concurrency bugs" is NOT evidence of thread safety. Due to the astronomical number of possible interleavings, testing can only prove the presence of bugs, never their absence. Thread safety must be established through careful design and reasoning, not just testing.

Real-world example of non-determinism:

A team deployed a payment processing system that passed all tests—including concurrent stress tests. After 3 months in production, at 2:47 AM during a traffic spike, a customer was charged twice for a single purchase. The conditions that triggered the bug: exactly 47 concurrent requests, a garbage collection pause of 23ms on a specific server, and network latency variance that caused two threads to hit a specific code path within 0.1ms of each other.

No test could have predicted this. The bug required reasoning about the code's invariants under all possible interleavings—which revealed that a compound operation wasn't properly atomic.

Thread Safety Invariants and Contracts

Thread-safe design is fundamentally about preserving invariants in the presence of concurrent access. An invariant is a condition that must always be true about an object's state. Let's formalize this concept.

Class Invariant: A predicate that holds true before and after every public method execution.

For example, a BoundedBuffer class might have these invariants:

count >= 0 && count <= capacity
writeIndex always points to the next empty slot
If count == 0, the buffer is empty
If count == capacity, the buffer is full

In single-threaded code, you only need to ensure invariants hold after each method completes. In concurrent code, you must ensure invariants hold even while multiple methods execute simultaneously.

invariant-violation.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
public class InvariantViolation {
    
    // Invariant: items.length == capacity, 0 <= count <= capacity
    private final Object[] items;
    private final int capacity;
    private int count = 0;
    private int putIndex = 0;
    private int takeIndex = 0;
    
    public InvariantViolation(int capacity) {
        this.capacity = capacity;
        this.items = new Object[capacity];
    }
    
    // BUG: Invariant can be violated between check and modification
    public void put(Object item) throws InterruptedException {
        // WINDOW OF VULNERABILITY BEGINS
        while (count == capacity) {  // <- Thread A checks: count=10, capacity=10
            wait();                   // <- Thread A waits
        }
        // Thread B could decrement count here, but Thread C might also be 
        // waking up and about to put—both think there's space!
        
        items[putIndex] = item;       // <- Both threads might write to same slot!
        putIndex = (putIndex + 1) % capacity;
        count++;                      // <- count could exceed capacity!
        // WINDOW OF VULNERABILITY ENDS
        
        notifyAll();
    }
    
    // CORRECT: Synchronized to protect invariant
    public synchronized void putSafe(Object item) throws InterruptedException {
        while (count == capacity) {
            wait();  // Releases lock while waiting
        }
        // No other thread can execute putSafe/takeSafe while we hold the lock
        items[putIndex] = item;
        putIndex = (putIndex + 1) % capacity;
        count++;
        notifyAll();
    }
}

Thread Safety Contracts

Beyond invariants, thread-safe classes should have clear contracts that document:

What consistency guarantees are provided?
- Are operations linearizable (appear to happen instantaneously)?
- Are reads guaranteed to see the most recent write?
- What ordering guarantees exist?
What synchronization is required from the caller (if any)?
- Conditional thread safety should be clearly documented.
- What external locks must be held for specific operations?
What are the thread-safety boundaries?
- Which methods are safe to call concurrently?
- What sequences of operations need external synchronization?

Documentation example:

thread-safety-contract.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
/**
 * A thread-safe counter with atomic increment and decrement operations.
 * 
 * <h2>Thread Safety</h2>
 * This class is <b>unconditionally thread-safe</b>. All individual operations
 * are atomic with respect to concurrent access.
 * 
 * <h3>Atomicity</h3>
 * Each method (increment, decrement, getValue) is atomic. However, compound operations
 * combining multiple methods are NOT atomic without external synchronization.
 * 
 * For example, the following is NOT safe:
 * <pre>
 * // BUG: Not atomic - another thread could modify between calls
 * if (counter.getValue() > 0) {
 *     counter.decrement();
 * }
 * </pre>
 * 
 * <h3>Visibility</h3>
 * All writes are immediately visible to subsequent reads from any thread.
 * 
 * <h3>Ordering</h3>
 * Operations by a single thread appear in program order. Operations across
 * threads are linearizable - they appear to occur in some total order consistent
 * with the real-time ordering of operations.
 * 
 * @ThreadSafe
 */
public class AtomicCounter {
    private final AtomicLong value = new AtomicLong(0);
    
    /** Atomically increments and returns the new value. */
    public long increment() {
        return value.incrementAndGet();
    }
    
    /** Atomically decrements and returns the new value. */
    public long decrement() {
        return value.decrementAndGet();
    }
    
    /** Returns the current value. */
    public long getValue() {
        return value.get();
    }
    
    /**
     * Atomically decrements if value is positive.
     * <p>
     * Use this instead of check-then-decrement for safe conditional decrement.
     * 
     * @return true if decrement was performed, false if value was already 0
     */
    public boolean decrementIfPositive() {
        while (true) {
            long current = value.get();
            if (current <= 0) {
                return false;
            }
            if (value.compareAndSet(current, current - 1)) {
                return true;
            }
            // CAS failed due to concurrent modification - retry
        }
    }
}

Common Misconceptions About Thread Safety

Before moving forward in our exploration of thread safety, let's address and dispel several common misconceptions that lead developers astray.

Misconceptions (Wrong)

•"Reads are always thread-safe"
•"Only writes need synchronization"
•"Single operations are atomic"
•"Making everything synchronized makes it thread-safe"
•"Thread-safe components compose into thread-safe systems"
•"Modern hardware handles this automatically"

Reality (Correct)

•Reads without synchronization may see stale or partially-written data
•Reads must be synchronized to see writes from other threads (visibility)
•Even simple operations like i++ are compound (read + modify + write)
•Too much synchronization can cause deadlocks and terrible performance
•Two thread-safe components can interact unsafely
•CPU/compiler reordering requires explicit memory ordering constraints

Let's examine some of these in detail:

"Single operations are atomic"

This is one of the most dangerous misconceptions. In most languages:

count++ is NOT atomic (read, add, write)
longValue = 42L may NOT be atomic on 32-bit systems (two 32-bit writes)
reference = newObject may NOT be atomic in some memory models
list.add(item) is NOT atomic (resize, copy, increment size)

Only operations explicitly guaranteed as atomic by your language/platform are safe.

The Composition Fallacy

Using thread-safe collections everywhere does NOT make your code thread-safe. A ConcurrentHashMap guarantees individual operations are atomic, but sequences of operations (get, modify, put) are not atomic. Two thread-safe operations do not compose into a single thread-safe operation.

composition-fallacy.java
Java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import java.util.concurrent.ConcurrentHashMap;
 
public class CompositionFallacy {
    
    // ConcurrentHashMap is a thread-safe data structure
    private ConcurrentHashMap<String, Integer> counts = new ConcurrentHashMap<>();
    
    // BUG: Not thread-safe despite using ConcurrentHashMap!
    public void incrementBroken(String key) {
        Integer current = counts.get(key);      // Thread A gets 5
                                                 // Thread B gets 5
        if (current == null) {
            counts.put(key, 1);                  // Both might put 1
        } else {
            counts.put(key, current + 1);        // Both put 6, should be 7!
        }
    }
    
    // CORRECT: Use atomic operation
    public void incrementCorrect(String key) {
        counts.merge(key, 1, Integer::sum);  // Atomic compute operation
        
        // Or use compute:
        // counts.compute(key, (k, v) -> v == null ? 1 : v + 1);
    }
    
    // ALSO CORRECT: Compare-and-swap loop
    public void incrementCAS(String key) {
        while (true) {
            Integer current = counts.get(key);
            Integer next = (current == null) ? 1 : current + 1;
            
            if (current == null) {
                if (counts.putIfAbsent(key, next) == null) return;
            } else {
                if (counts.replace(key, current, next)) return;
            }
            // CAS failed - retry
        }
    }
}

Summary: Foundations of Thread Safety

We've established the foundational concepts of thread safety that will guide all our subsequent discussions. Let's consolidate the key takeaways:

Key Takeaways

•Thread safety is about correctness under concurrent access — guaranteeing that code behaves correctly regardless of how threads are scheduled or interleaved.
•Thread safety has multiple levels — from immutable (strongest) to thread-hostile (weakest). Most classes are thread-compatible, requiring caller synchronization.
•Three pillars underpin thread safety — Atomicity (operations complete as units), Visibility (changes are seen by other threads), and Ordering (operations appear in consistent order).
•Concurrent reasoning is fundamentally harder — the number of possible interleavings grows factorially, making testing insufficient for proving safety.
•Thread safety is about invariant preservation — ensuring class conditions remain true even during concurrent mutation.
•Common misconceptions are dangerous — assuming reads are safe, that simple operations are atomic, or that thread-safe components compose safely.

What's next:

Now that we understand what thread safety means and why it's challenging, we'll explore the root cause of most thread safety problems: shared mutable state. The next page examines why sharing and mutability together create the conditions for concurrency bugs, and how this insight guides our design decisions.

Page Complete

You now have a rigorous understanding of what thread safety means, its levels, the three pillars that underpin it, and why concurrent reasoning is fundamentally challenging. This foundation prepares you to explore the specific problems—shared mutable state and race conditions—and the design patterns that address them.