Union-Find (DSU) - Learning Module

Loading content...

0/276

The Union-Find Data Structure

When Connectivity Becomes the Core Problem

Imagine you're building a social network. Users arrive continuously and form friendships. At any moment, someone might ask: "Are Alice and David in the same friend network?" Or consider a computer network where cables connect machines—you need to quickly answer: "Can server A communicate with server B?"

These questions share a common structure. They're not asking about shortest paths, distances, or optimal routes. They're asking something simpler yet profound: Are these two elements connected?

This is the dynamic connectivity problem—one of the most fundamental questions in computer science. And while it sounds simple, solving it efficiently as connections form and dissolve requires a data structure of remarkable elegance: the Union-Find data structure, also known as Disjoint Set Union (DSU).

What You Will Learn

By the end of this page, you will understand what Union-Find is, why it exists, what problems it solves, and how it represents disjoint sets internally. You'll see why this data structure is considered one of the most beautiful inventions in computer science—combining simplicity with extraordinary efficiency.

The Dynamic Connectivity Problem

Before we dive into the solution, let's precisely define the problem that Union-Find so elegantly addresses.

The Dynamic Connectivity Problem:

You have n elements (numbered 0 to n-1). Initially, each element is in its own isolated set. You must support two operations:

Union(a, b): Merge the sets containing elements a and b into a single set
Find(a): Determine which set element a belongs to (typically by returning a representative element of that set)

With these two operations, you can answer connectivity queries: "Are a and b in the same set?" by checking if Find(a) == Find(b).

Why 'Disjoint Sets'?

The sets in Union-Find are disjoint—no element can belong to more than one set at any time. When we union two sets, we merge them completely. This property is crucial: at any moment, the collection of sets forms a partition of all elements, where every element belongs to exactly one set.

Why is this problem interesting?

At first glance, you might think: "Just use a graph and run BFS or DFS for each connectivity query." But consider the scale:

A social network with 1 billion users making millions of friend connections daily
Each connectivity query with BFS/DFS: O(V + E) time
With millions of queries per day: Catastrophic performance

We need something dramatically faster—operations that approach constant time even as the network grows to billions of elements. This is exactly what Union-Find delivers.

Naive Approaches vs. Union-Find
Approach	Union Time	Find/Query Time	Practicality at Scale
Store list of connections, run BFS	O(1)	O(V + E)	Impractical for frequent queries
Maintain full adjacency matrix	O(1)	O(V²) for transitive closure	Memory prohibitive
Rebuild connected components	O(V + E)	O(1) after rebuild	Impractical for frequent unions
Union-Find with optimizations	α(n) ≈ O(1)*	α(n) ≈ O(1)*	Excellent at any scale

The Magic of α(n)

The α(n) in the table is the inverse Ackermann function—a function that grows so slowly that for all practical purposes (any conceivable number of elements in the universe), α(n) ≤ 4. We'll explore this remarkable efficiency in later pages. For now, understand that Union-Find achieves essentially constant time per operation.

What is Union-Find?

Union-Find (also called Disjoint Set Union or DSU) is a data structure that maintains a collection of disjoint (non-overlapping) sets. It provides near-constant-time operations to:

Find which set an element belongs to
Union two sets into one

The brilliance of Union-Find lies in its representation. Rather than explicitly storing set membership lists, it represents each set as a tree. Elements are nodes, and each node points to its parent. The root of each tree serves as the representative (or canonical element) of the set.

Key insight: To find which set an element belongs to, follow parent pointers until you reach the root. Two elements are in the same set if and only if they have the same root.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Initially, 5 elements (0-4), each in its own set:
 
  [0]   [1]   [2]   [3]   [4]
   ↑     ↑     ↑     ↑     ↑
  root  root  root  root  root
 
After Union(0, 1) - element 1 now points to 0:
 
  [0]   [2]   [3]   [4]
   ↑     ↑     ↑     ↑
   1
   
Element 0 is root (representative) of set {0, 1}
Elements 2, 3, 4 are each their own root
 
After Union(2, 3) and Union(3, 4):
 
  [0]        [2]
   ↑          ↑
   1          3
              ↑
              4
 
Now we have two sets: {0, 1} and {2, 3, 4}
 
After Union(0, 2) - now all elements in one set:
 
       [0]
        ↑
      /   \
    [1]   [2]
           ↑
           3
           ↑
           4
 
All elements connected through root 0

The representation is minimal and elegant:

All we need is a parent array where parent[i] stores the parent of element i. If an element is a root, it is its own parent: parent[i] = i.

This single array encodes the entire partition structure:

Space complexity: O(n) — just one integer per element
Finding the set: Follow parent pointers to the root
Unioning sets: Make one root point to the other

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// The most minimal Union-Find representation
class UnionFindBasic {
    private parent: number[];
    
    constructor(n: number) {
        // Initially, each element is its own parent (its own set)
        this.parent = Array.from({ length: n }, (_, i) => i);
        // parent = [0, 1, 2, 3, 4] means 5 separate sets
    }
    
    // Find the root (representative) of element x
    find(x: number): number {
        // Keep following parent pointers until we reach the root
        // Root is identified by parent[x] === x
        while (this.parent[x] !== x) {
            x = this.parent[x];
        }
        return x;
    }
    
    // Union the sets containing x and y
    union(x: number, y: number): void {
        const rootX = this.find(x);
        const rootY = this.find(y);
        
        // If already in the same set, nothing to do
        if (rootX !== rootY) {
            // Make rootY's parent be rootX
            // (arbitrary choice - we'll optimize this later)
            this.parent[rootY] = rootX;
        }
    }
    
    // Check if x and y are in the same set
    connected(x: number, y: number): boolean {
        return this.find(x) === this.find(y);
    }
}

This Basic Version Has a Problem

The naive implementation shown above can degenerate into a linked list in the worst case, making Find operations O(n). Imagine unioning elements in sequence: Union(0,1), Union(1,2), Union(2,3), ... This creates a chain where finding element 0 requires traversing n-1 pointers. We'll fix this with optimizations in later pages.

The Disjoint Set Abstraction

Let's formalize what Union-Find represents mathematically. This abstraction helps us reason about correctness and understand why certain operations make sense.

Set Partition:

A partition of a set S is a collection of non-empty, pairwise disjoint subsets of S whose union equals S. In mathematical notation:

If P = {S₁, S₂, ..., Sₖ} is a partition of S, then:
- Each Sᵢ ≠ ∅ (non-empty)
- Sᵢ ∩ Sⱼ = ∅ for i ≠ j (disjoint)
- S₁ ∪ S₂ ∪ ... ∪ Sₖ = S (complete coverage)

Union-Find maintains exactly such a partition. Every element belongs to exactly one set, and the collection of all sets covers all elements.

Key Properties of Union-Find

•Partition Invariant: At all times, the structure represents a valid partition of elements. No element is orphaned; no element belongs to multiple sets.
•Monotonic Merging: Unions only merge sets—they never split them. Once two elements are in the same set, they remain connected forever. This is crucial for correctness.
•Representative Consistency: Within a set, all elements have the same representative (root). The representative uniquely identifies the set.
•Transitivity: If a and b are connected, and b and c are connected, then a and c are connected. Union-Find naturally maintains transitive closure.
•Idempotency: Union(a, b) when a and b are already connected is a no-op. Find(a) always returns the same value until a union involving a's set occurs.

Why Trees?

You might wonder why we use trees rather than other representations. Consider the alternatives:

Alternative 1: Explicit set membership lists

Store setId[i] for each element
Find: O(1) — just return setId[i]
Union: O(n) — must update setId for all elements in one set

Alternative 2: Linked lists with head pointers

Each set is a linked list, with a map from element to head
Find: O(1) if we maintain the map
Union: O(smaller set size) — need to traverse one list

The tree representation:

No explicit set membership storage
Find: O(tree height)
Union: O(tree height) — just link two roots

When we add path compression and union by rank (covered in later pages), tree heights become nearly constant, giving us the best of all worlds: near-O(1) for both operations with O(n) space.

The Elegance of Implicit Structure

Union-Find is a masterclass in implicit data structure design. Instead of maintaining complex explicit relationships, we let the tree structure emerge naturally. The parent array is all we need—everything else (set membership, connectivity, partitioning) is derived from following pointers. This simplicity is what enables the remarkable optimizations.

Core Invariants and Properties

Understanding Union-Find's invariants is essential for both implementing it correctly and reasoning about its behavior. These invariants hold after every operation completes.

Fundamental Invariants

•Root Self-Reference: For every root r, parent[r] = r. This is how we identify roots and terminate the Find operation.
•Reachability to Root: From any element x, following parent pointers eventually reaches a root. There are no cycles other than root self-loops.
•Unique Root per Set: Each connected component (set) has exactly one root. This root serves as the canonical representative.
•Union Preserves Connectivity: After Union(a, b), Find(a) = Find(b). Elements united stay united.
•Find Stability: Between unions, Find(x) returns the same value. Find operations do not change logical set membership (though they may restructure the tree).

Derived Properties:

From these invariants, we can derive several useful properties:

Property 1: Connectivity is an equivalence relation

Reflexive: Every element is connected to itself
Symmetric: If a is connected to b, then b is connected to a
Transitive: If a↔b and b↔c, then a↔c

Property 2: The partition only becomes coarser

We start with n sets (finest partition)
Each union reduces the number of sets by at most 1
We end with at least 1 set (coarsest partition)

Property 3: The number of unions is bounded

At most n-1 meaningful unions can occur (each reduces set count by 1)
Starting with n sets, n-1 unions leave exactly 1 set

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
// Helper methods to verify Union-Find invariants
// Useful for debugging and testing
 
class UnionFindWithVerification {
    private parent: number[];
    private n: number;
    
    constructor(n: number) {
        this.n = n;
        this.parent = Array.from({ length: n }, (_, i) => i);
    }
    
    // Verify all invariants hold (expensive, for debugging only)
    verifyInvariants(): boolean {
        // 1. Check root self-reference
        for (let i = 0; i < this.n; i++) {
            const root = this.findWithoutCompression(i);
            if (this.parent[root] !== root) {
                console.error(`Invariant violated: root ${root} not self-referential`);
                return false;
            }
        }
        
        // 2. Check reachability (no cycles other than root self-loops)
        for (let i = 0; i < this.n; i++) {
            const visited = new Set<number>();
            let current = i;
            while (!visited.has(current)) {
                visited.add(current);
                if (this.parent[current] === current) break; // Reached root
                current = this.parent[current];
            }
            // Should have reached a root, not formed a cycle
            if (this.parent[current] !== current) {
                console.error(`Invariant violated: cycle detected from ${i}`);
                return false;
            }
        }
        
        // 3. Check unique root per set (implicitly guaranteed by structure)
        // Every element reaches exactly one root by following parents
        
        console.log("All invariants satisfied!");
        return true;
    }
    
    private findWithoutCompression(x: number): number {
        while (this.parent[x] !== x) {
            x = this.parent[x];
        }
        return x;
    }
    
    find(x: number): number {
        return this.findWithoutCompression(x);
    }
    
    union(x: number, y: number): void {
        const rootX = this.find(x);
        const rootY = this.find(y);
        if (rootX !== rootY) {
            this.parent[rootY] = rootX;
        }
    }
}

Real-World Applications

Union-Find isn't just a theoretical construct—it's a workhorse data structure that appears in countless real-world applications. Understanding these applications helps you recognize when Union-Find is the right tool for a problem.

Why Union-Find appears everywhere:

Any problem that involves grouping elements, detecting when elements become connected, or maintaining equivalence classes is a potential Union-Find application. The structure's efficiency makes it practical even for massive datasets.

Union-Find Applications Across Domains
Domain	Application	How Union-Find Helps
Graph Algorithms	Kruskal's MST Algorithm	Detect if adding an edge would create a cycle
Graph Algorithms	Connected Components	Track and query component membership dynamically
Network Design	Network Connectivity	Determine if two nodes can communicate
Image Processing	Percolation Simulation	Model fluid flow through porous materials
Social Networks	Friend Circles	Find and count distinct friend networks
Compilers	Type Unification	Merge type variables during type inference
Gaming	Dynamic Terrain	Track connected regions as terrain changes
Clustering	Single-Linkage Clustering	Build hierarchical clusters bottom-up

Deep Dive: Kruskal's Algorithm

Perhaps the most famous application of Union-Find is in Kruskal's algorithm for finding Minimum Spanning Trees. The algorithm:

Sort all edges by weight
For each edge (u, v) in sorted order:
- If u and v are in different components: add edge to MST, union(u, v)
- If u and v are in the same component: skip (would create cycle)

Union-Find answers the critical question "Are u and v already connected?" in near-constant time, making Kruskal's algorithm efficient even for dense graphs.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
interface Edge {
    u: number;
    v: number;
    weight: number;
}
 
function kruskalMST(n: number, edges: Edge[]): Edge[] {
    // Sort edges by weight
    edges.sort((a, b) => a.weight - b.weight);
    
    const uf = new UnionFind(n);  // Assume optimized Union-Find
    const mst: Edge[] = [];
    
    for (const edge of edges) {
        // Key question: Are u and v already connected?
        // Union-Find answers this in near O(1) time
        if (!uf.connected(edge.u, edge.v)) {
            uf.union(edge.u, edge.v);
            mst.push(edge);
            
            // MST complete when we have n-1 edges
            if (mst.length === n - 1) break;
        }
    }
    
    return mst;
}
 
// Without Union-Find, we'd need O(V + E) per connectivity check
// With Union-Find, the entire algorithm runs in O(E log E) time
// (dominated by sorting, not connectivity checks)

Deep Dive: Percolation

Percolation is a fascinating application in computational physics. Imagine a porous material represented as an n×n grid. Each cell can be open (permeable) or closed (blocked). We want to know: Can fluid flow from the top to the bottom?

This happens if and only if there's a path of open cells connecting any top-row cell to any bottom-row cell. As cells randomly open, we use Union-Find to efficiently track when percolation occurs.

The trick: Create two virtual nodes—one connected to all top-row cells, one connected to all bottom-row cells. Percolation occurs when these virtual nodes become connected.

The Virtual Node Technique

Adding virtual nodes that connect to multiple real nodes is a powerful Union-Find pattern. Instead of checking connectivity between many pairs, you check connectivity between two virtual nodes. This reduces multiple queries to a single query—a technique applicable to many problems.

Interface and API Design

A well-designed Union-Find interface is clean, intuitive, and expressive. Let's examine the standard API and some useful extensions.

Core Operations:

Operation	Purpose	Typical Signature
`find(x)`	Find representative of x's set	`find(x: number): number`
`union(x, y)`	Merge sets containing x and y	`union(x: number, y: number): boolean`
`connected(x, y)`	Check if x and y are in same set	`connected(x: number, y: number): boolean`

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
interface IUnionFind {
    // Core operations
    
    /**
     * Find the representative (root) of the set containing x.
     * Elements in the same set have the same representative.
     */
    find(x: number): number;
    
    /**
     * Merge the sets containing x and y.
     * Returns true if a merge occurred, false if x and y were already connected.
     */
    union(x: number, y: number): boolean;
    
    /**
     * Check if x and y are in the same set.
     * Equivalent to: find(x) === find(y)
     */
    connected(x: number, y: number): boolean;
    
    // Extended operations (optional but useful)
    
    /**
     * Return the size of the set containing x.
     * Requires tracking sizes during unions.
     */
    getSize(x: number): number;
    
    /**
     * Return the total number of disjoint sets.
     * Decreases with each successful union.
     */
    getCount(): number;
    
    /**
     * Get all elements in the same set as x.
     * Note: This is O(n) as we must scan all elements.
     */
    getSetMembers(x: number): number[];
}
 
// Example usage
function demonstrateUnionFind() {
    const uf = new UnionFind(10);  // 10 elements: 0-9
    
    console.log(uf.getCount());    // 10 (each element is its own set)
    
    uf.union(0, 1);
    uf.union(2, 3);
    uf.union(4, 5);
    
    console.log(uf.getCount());    // 7 (three pairs, four singles)
    
    console.log(uf.connected(0, 1));  // true
    console.log(uf.connected(0, 2));  // false
    
    uf.union(1, 3);  // Merges {0,1} with {2,3}
    
    console.log(uf.connected(0, 2));  // true (now in same set)
    console.log(uf.getSize(0));       // 4 (set is {0,1,2,3})
}

Design Considerations:

Return value of union():

Some implementations return void, others return boolean (whether a merge occurred). The boolean version is convenient—it tells you if this was a "new" connection or a redundant one.

Zero-indexed vs One-indexed:

Most implementations use 0-indexed elements (0 to n-1). Be consistent and match your problem's indexing.

Dynamic resizing:

The basic Union-Find has a fixed size set at construction. For dynamic applications, you might need a map-based variant that can grow arbitrarily.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// When elements aren't contiguous integers from 0 to n-1
// Use a Map-based approach for arbitrary keys
 
class DynamicUnionFind<T> {
    private parent: Map<T, T> = new Map();
    private rank: Map<T, number> = new Map();
    private count: number = 0;
    
    // Lazily initialize elements when first accessed
    private ensure(x: T): void {
        if (!this.parent.has(x)) {
            this.parent.set(x, x);  // Self-parent
            this.rank.set(x, 0);
            this.count++;
        }
    }
    
    find(x: T): T {
        this.ensure(x);
        
        // Path compression
        if (this.parent.get(x) !== x) {
            this.parent.set(x, this.find(this.parent.get(x)!));
        }
        return this.parent.get(x)!;
    }
    
    union(x: T, y: T): boolean {
        const rootX = this.find(x);
        const rootY = this.find(y);
        
        if (rootX === rootY) return false;
        
        // Union by rank
        const rankX = this.rank.get(rootX)!;
        const rankY = this.rank.get(rootY)!;
        
        if (rankX < rankY) {
            this.parent.set(rootX, rootY);
        } else if (rankX > rankY) {
            this.parent.set(rootY, rootX);
        } else {
            this.parent.set(rootY, rootX);
            this.rank.set(rootX, rankX + 1);
        }
        
        this.count--;
        return true;
    }
    
    connected(x: T, y: T): boolean {
        return this.find(x) === this.find(y);
    }
}
 
// Usage with strings, objects, any hashable type
const uf = new DynamicUnionFind<string>();
uf.union("Alice", "Bob");
uf.union("Charlie", "Diana");
console.log(uf.connected("Alice", "Bob"));     // true
console.log(uf.connected("Alice", "Charlie")); // false

Choosing the Right Variant

Use array-based Union-Find when elements are integers in a known range—it's faster and uses less memory. Use Map-based Union-Find when elements are sparse, non-integer, or the range is unknown. The algorithmic complexity is the same, but array operations have lower constants.

Summary and Looking Ahead

We've established the foundation for understanding Union-Find. Let's consolidate what we've learned:

Key Takeaways

•Union-Find solves dynamic connectivity — It efficiently tracks which elements are connected as connections form over time.
•Tree-based representation — Each set is a tree; elements point to parents; roots identify sets. A single parent array is all we need.
•Core operations: Find and Union — Find traverses to root; Union links two roots. Connected queries reduce to comparing roots.
•Maintains a set partition — The collection of sets is always a valid partition: disjoint, complete, with every element in exactly one set.
•Widely applicable — From Kruskal's MST to network connectivity to percolation, Union-Find appears across domains wherever connectivity matters.
•The naive version has issues — Without optimizations, tree height can become O(n), making operations slow. This motivates the optimizations we'll study next.

What's next:

We've defined what Union-Find is. Now we need to understand how it works efficiently. In the next page, we'll dive deep into the Union and Find operations—examining exactly how they work, tracing through examples, and understanding why the naive approach can degrade to linear time per operation.

Page Complete

You now understand what Union-Find is, why it exists, and how it represents disjoint sets. You've seen the basic implementation and its applications. Next, we'll explore the Union and Find operations in detail, setting the stage for the powerful optimizations that make Union-Find truly remarkable.