Database Management SystemsPhysical Operators

Other Physical Operators

LevelAdvanced

Duration90 mins

TopicPhysical Operators

3 / 5

Set Operations

Boolean Operations on Relations

The mathematical foundations of relational databases rest on set theory—relations are sets of tuples, and operations on relations have clear set-theoretic counterparts. UNION, INTERSECT, and EXCEPT (also called MINUS or DIFFERENCE) are the direct SQL equivalents of set union, intersection, and difference operations.

These operations are fundamental to combining query results, finding common records across tables, and identifying records that exist in one dataset but not another. Despite their simple conceptual definitions, their physical implementations involve subtle algorithmic choices and significant optimization opportunities.

What You Will Learn

By the end of this page, you will understand the physical implementation of UNION, INTERSECT, and EXCEPT operations. You'll master both hash-based and sort-based algorithms, understand the distinction between set and bag semantics (ALL variants), and learn how to analyze the costs and optimize these operations in practice.

Set Operations Fundamentals

SQL provides three set operations that combine the results of two query expressions:

UNION (∪): Returns all tuples that appear in either input relation (or both)

INTERSECT (∩): Returns only tuples that appear in both input relations

EXCEPT (−): Returns tuples that appear in the first relation but not the second

Compatibility Requirement:

For any set operation, the two input relations must be union-compatible (also called type-compatible):

Same number of columns
Corresponding columns have compatible data types

The result schema typically inherits column names from the first (left) input.

Set Operation Semantics
Operation	Mathematical Notation	SQL Syntax	Description
Union	R ∪ S	R UNION S	All tuples in R or S (duplicates removed)
Union All	R ⊎ S	R UNION ALL S	All tuples from both (duplicates preserved)
Intersection	R ∩ S	R INTERSECT S	Tuples in both R and S
Intersection All	R ∩bag S	R INTERSECT ALL S	Bag intersection (min multiplicity)
Difference	R − S	R EXCEPT S	Tuples in R but not in S
Difference All	R −bag S	R EXCEPT ALL S	Bag difference (subtract counts)

set_operations_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- UNION: All customers who placed orders OR left reviews
SELECT customer_id FROM orders
UNION
SELECT customer_id FROM reviews;
-- Result: Each customer_id appears exactly once
 
-- UNION ALL: Preserves all occurrences
SELECT product_id FROM january_sales
UNION ALL
SELECT product_id FROM february_sales;
-- Result: If product appears 3x in Jan and 2x in Feb, it appears 5x total
 
-- INTERSECT: Customers who placed orders AND left reviews
SELECT customer_id FROM orders
INTERSECT
SELECT customer_id FROM reviews;
-- Result: Only customer_ids present in both tables
 
-- EXCEPT: Customers who ordered but never reviewed
SELECT customer_id FROM orders
EXCEPT
SELECT customer_id FROM reviews;
-- Result: Customer_ids in orders but not in reviews
 
-- Complex: Multi-step set operations
(SELECT id FROM premium_users UNION SELECT id FROM vip_users)
EXCEPT
SELECT id FROM banned_users;
-- Result: Premium or VIP users who aren't banned

Set vs Bag Semantics

Standard SQL operates on bags (multisets), not sets—duplicates are allowed. UNION, INTERSECT, and EXCEPT without ALL apply set semantics by implicitly eliminating duplicates from the result. The ALL variants preserve bag semantics, treating multiple occurrences as significant. This distinction critically affects both correctness and performance.

UNION Implementation

UNION is the most common set operation, combining results from multiple queries. Its implementation depends on whether duplicates should be eliminated (UNION) or preserved (UNION ALL).

UNION ALL (Bag Union):

UNION ALL is trivially implemented—simply concatenate the two input streams:

UNION_ALL(R, S):
    for each tuple t in R:
        output t
    for each tuple t in S:
        output t

This requires no comparison, hashing, or duplicate tracking. It's an extremely efficient operation:

Time Complexity: O(|R| + |S|)
Space Complexity: O(1) additional memory
I/O Cost: Read R + Read S (no extra writes)

UNION ALL maintains full pipelining—tuples flow through without buffering.

UNION (Set Union):

UNION with duplicate elimination requires more sophisticated implementation. The operation is essentially: DISTINCT(R UNION ALL S).

Hash-Based UNION:

HASH_UNION(R, S):
    seen = empty hash set
    
    for each tuple t in R:
        if t not in seen:
            seen.add(t)
            output t
    
    for each tuple t in S:
        if t not in seen:
            seen.add(t)
            output t

This processes inputs sequentially, outputting each unique tuple on first encounter.

Sort-Based UNION:

SORT_MERGE_UNION(R, S):
    sorted_R = sort(R)
    sorted_S = sort(S)
    
    r = next tuple from sorted_R (or NULL)
    s = next tuple from sorted_S (or NULL)
    last_output = NULL
    
    while r ≠ NULL or s ≠ NULL:
        if s == NULL or (r ≠ NULL and r < s):
            if r ≠ last_output:
                output r
                last_output = r
            r = next from sorted_R
        elif r == NULL or (s ≠ NULL and s < r):
            if s ≠ last_output:
                output s
                last_output = s
            s = next from sorted_S
        else:  // r == s
            if r ≠ last_output:
                output r
                last_output = r
            r = next from sorted_R
            s = next from sorted_S

After sorting, merge with deduplication produces the union in sorted order.

UNION Ordering Optimization

When using hash-based UNION, process the smaller relation first. Why? The smaller relation builds a more compact hash table, and if it contains most of the unique values, many tuples from the larger relation will be identified as duplicates quickly (hash hit) without needing hash table inserts.

UNION Implementation Comparison
Aspect	Hash-Based	Sort-Based
Time Complexity	O(\|R\| + \|S\|) average	O((\|R\| + \|S\|) log(\|R\| + \|S\|))
Space Complexity	O(distinct tuples)	O(sort buffers)
Output Order	Arbitrary	Sorted
Early Output	Yes (pipelined)	No (blocking until merge)
Memory Overflow	Spill to partitions	External merge sort
Best When	Memory sufficient, no order needed	Output must be sorted

INTERSECT Implementation

INTERSECT returns only tuples that appear in both input relations. This is conceptually equivalent to a natural join on all columns, but with set or bag semantics.

Hash-Based INTERSECT:

The algorithm is similar to hash join:

HASH_INTERSECT(R, S):
    // Phase 1: Build hash table from R
    hash_table = {}
    for each tuple t in R:
        if t in hash_table:
            hash_table[t].count += 1
        else:
            hash_table[t] = {count: 1, outputted: false}
    
    // Phase 2: Probe with S
    for each tuple t in S:
        if t in hash_table and not hash_table[t].outputted:
            output t
            hash_table[t].outputted = true  // For set semantics
            // For INTERSECT ALL: output and decrement count instead

For set semantics, each matching tuple is output exactly once. For INTERSECT ALL, we track counts and output up to min(count_R, count_S) copies.

intersect_semantics.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- INTERSECT (set semantics)
-- R has: (A, A, A, B, B, C)
-- S has: (A, A, B, D)
-- Result: (A, B)  -- each appears exactly once
 
-- INTERSECT ALL (bag semantics)  
-- R has: (A, A, A, B, B, C)  -- A:3, B:2, C:1
-- S has: (A, A, B, D)         -- A:2, B:1, D:1
-- Result: (A, A, B)           -- A:min(3,2)=2, B:min(2,1)=1
 
-- Hash table state for INTERSECT ALL:
-- After processing R: {A: 3, B: 2, C: 1}
-- Processing S[A]: output A, decrement → {A: 2, B: 2, C: 1}
-- Processing S[A]: output A, decrement → {A: 1, B: 2, C: 1}
-- Processing S[B]: output B, decrement → {A: 1, B: 1, C: 1}
-- Processing S[D]: not found, skip
-- Final output: (A, A, B)

Sort-Based INTERSECT:

SORT_MERGE_INTERSECT(R, S):
    sorted_R = sort(R)
    sorted_S = sort(S)
    
    r = next tuple from sorted_R
    s = next tuple from sorted_S
    
    while r ≠ NULL and s ≠ NULL:
        if r < s:
            r = next from sorted_R
        elif s < r:
            s = next from sorted_S
        else:  // r == s, found match
            output r  // or s, they're equal
            r = next from sorted_R
            s = next from sorted_S
            // For set semantics: skip remaining duplicates of this value
            // For ALL semantics: output min(count_R, count_S) copies

Sort-merge intersection is elegant: advance through both sorted inputs, outputting when they match, skipping when they don't.

Cost Analysis:

Hash-Based: O(|R|) to build + O(|S|) to probe = O(|R| + |S|)
Sort-Based: O(|R| log|R| + |S| log|S|) for sorting + O(|R| + |S|) for merge

Hash-based is typically faster unless inputs are already sorted.

INTERSECT and Semi-Join Equivalence

INTERSECT R ∩ S is semantically equivalent to R ⋉ S (semi-join) when all columns match. Some optimizers rewrite INTERSECT to semi-join for access to join optimizations like index lookups. Conversely, semi-joins with matching column lists can be converted to INTERSECT.

EXCEPT (Difference) Implementation

EXCEPT (or MINUS in some SQL dialects) returns tuples from the left relation that don't appear in the right relation. Unlike UNION and INTERSECT, EXCEPT is not commutative—order matters.

Hash-Based EXCEPT:

Build a hash table from the right relation, then probe with the left:

HASH_EXCEPT(R, S):  // R EXCEPT S
    // Phase 1: Build hash set from S (what to exclude)
    exclude_set = {}
    for each tuple t in S:
        exclude_set.add(t)
        // For EXCEPT ALL, track counts: exclude_set[t].count++
    
    // Phase 2: Output R tuples not in exclude set
    seen = {}  // For set semantics: avoid outputting duplicates
    for each tuple t in R:
        if t not in exclude_set:
            if t not in seen:  // Set semantics
                output t
                seen.add(t)
        // For EXCEPT ALL: output if count not exhausted, decrement

The key insight: we must fully process S before outputting from R, because any tuple in R might be excluded by a later tuple in S.

except_semantics.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- EXCEPT (set semantics)
-- R has: (A, A, A, B, B, C)
-- S has: (A, D)
-- Result: (B, C)  -- A excluded, D doesn't affect result
 
-- EXCEPT ALL (bag semantics)
-- R has: (A, A, A, B, B, C)  -- A:3, B:2, C:1
-- S has: (A, A)              -- A:2
-- Result: (A, B, B, C)       -- A:3-2=1, B:2, C:1
 
-- Practical example: Find new customers this month
SELECT customer_id FROM march_customers
EXCEPT
SELECT customer_id FROM feb_customers;
-- Returns customers who appear in March but not February
 
-- Common mistake: Order matters!
SELECT a FROM R EXCEPT SELECT a FROM S  -- Tuples in R but not S
-- is different from
SELECT a FROM S EXCEPT SELECT a FROM R  -- Tuples in S but not R

Sort-Based EXCEPT:

SORT_MERGE_EXCEPT(R, S):  // R EXCEPT S
    sorted_R = sort(R)
    sorted_S = sort(S)
    
    r = next from sorted_R
    s = next from sorted_S
    last_output = NULL
    
    while r ≠ NULL:
        if s == NULL or r < s:
            // r has no match in S, output it (with dedup)
            if r ≠ last_output:  // Set semantics
                output r
                last_output = r
            r = next from sorted_R
        elif r > s:
            // s is behind, advance s
            s = next from sorted_S
        else:  // r == s
            // Match found, r is excluded
            r = next from sorted_R
            s = next from sorted_S

After sorting, we scan both inputs in parallel. When left tuple has no match on right, output it. When they match, skip both.

EXCEPT is Blocking

Unlike UNION ALL, EXCEPT cannot be fully pipelined. We must see all of S before outputting from R, because we need to know which R tuples are excluded. This makes EXCEPT a semi-blocking operator. Sort-based EXCEPT is fully blocking (must sort first). Hash-based can stream R after building S's hash table.

Hash EXCEPT Advantages

•O(|R| + |S|) time complexity
•Only blocks on S, then streams R
•No sorting overhead
•Good when S is small (small exclude set)

Sort EXCEPT Advantages

•Output is sorted
•Stable memory usage
•Fast if inputs pre-sorted
•Works with very large S (external sort)

External (Disk-Based) Set Operations

When input relations exceed available memory, set operations must spill to disk. The strategies mirror those for external hashing and sorting.

External Hash-Based Set Operations:

The approach uses partitioning:

Partition Phase: Hash both R and S into P partitions using the same hash function
- Each tuple goes to partition h(tuple) mod P
- R partitions: R₀, R₁, ..., Rₚ₋₁
- S partitions: S₀, S₁, ..., Sₚ₋₁
Process Phase: For each partition pair (Rᵢ, Sᵢ):
- Load both partitions into memory
- Perform in-memory set operation
- Output results

Since we use the same hash function for R and S, matching tuples land in corresponding partitions. Each partition pair can be processed independently.

external_set_operations.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
EXTERNAL_HASH_UNION(R, S, memory_limit):
    // Estimate partition count
    num_partitions = estimate_needed(|R|, |S|, memory_limit)
    
    // Phase 1: Partition both inputs
    R_partitions = partition(R, num_partitions)  // Using hash function h1
    S_partitions = partition(S, num_partitions)  // Same hash function h1
    
    // Phase 2: Process each partition pair
    for i = 0 to num_partitions - 1:
        seen = empty hash set (using hash function h2)
        
        // Read R partition, add to set
        for each tuple t in R_partitions[i]:
            if t not in seen:
                seen.add(t)
                output t
        
        // Read S partition, add new tuples
        for each tuple t in S_partitions[i]:
            if t not in seen:
                seen.add(t)
                output t
        
        clear(seen)
        delete_temp_files(R_partitions[i], S_partitions[i])
 
EXTERNAL_HASH_INTERSECT(R, S, memory_limit):
    // Similar partitioning...
    for i = 0 to num_partitions - 1:
        // Build hash table from R partition
        R_set = build_counting_hash_set(R_partitions[i])
        
        // Probe with S partition
        for each tuple t in S_partitions[i]:
            if t in R_set and not R_set[t].outputted:
                output t
                R_set[t].outputted = true

I/O Cost for External Set Operations:

Let |R| = N_R pages and |S| = N_S pages:

Partitioning:

Read R: N_R pages
Read S: N_S pages
Write partitions: N_R + N_S pages

Processing:

Read partitions: N_R + N_S pages
Write output: O_output pages (operation-dependent)

Total: 3(N_R + N_S) + O_output page operations

External Sort-Based Set Operations:

Alternatively, sort both inputs and merge:

External sort R → sorted_R
External sort S → sorted_S
Merge-based set operation on sorted inputs

Cost: Sorting cost + single merge pass. This is O(2N log N) for sorting plus O(N) for merge.

Choosing External Strategy

For external set operations:

Hash-based is typically faster for unsorted inputs and moderate memory
Sort-based wins when inputs are already sorted or when partitions would be highly skewed
Hybrid approaches use hashing when memory permits, fall back to sorting for overflow partitions

Set Operation Optimizations

Database optimizers apply various transformations to improve set operation performance.

1. Push Down Selections:

Apply filters before set operations to reduce input sizes:

-- Before optimization
(SELECT * FROM orders WHERE year = 2024)
UNION
(SELECT * FROM archived_orders WHERE year = 2024)

-- After: Predicates already pushed down
-- (This is often the user's intent, but optimizer ensures it)

2. Push Down Projections:

Project only necessary columns before set operations:

-- Original: SELECT DISTINCT a FROM (SELECT * FROM R UNION SELECT * FROM S)
-- Optimized: SELECT a FROM R UNION SELECT a FROM S
-- Benefits: Narrower tuples, possibly fewer distinct values

3. Merge Adjacent Operations:

Multiple set operations can sometimes merge:

-- Original
(R UNION S) UNION T

-- Optimized: Single 3-way union
-- Process R, S, T with shared hash table
-- Avoids materializing intermediate result

4. Convert to Joins When Beneficial:

INTERSECT and EXCEPT can often be rewritten as joins:

-- INTERSECT as semi-join
SELECT a, b FROM R INTERSECT SELECT a, b FROM S
-- Equivalent to:
SELECT DISTINCT R.a, R.b FROM R WHERE EXISTS (SELECT 1 FROM S WHERE S.a = R.a AND S.b = R.b)

-- EXCEPT as anti-join  
SELECT a, b FROM R EXCEPT SELECT a, b FROM S
-- Equivalent to:
SELECT DISTINCT R.a, R.b FROM R WHERE NOT EXISTS (SELECT 1 FROM S WHERE S.a = R.a AND S.b = R.b)

Join-based rewrites can exploit indexes that set operations cannot.

set_operation_rewrites.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Optimization: Empty set elimination
-- If statistics show S is empty:
R UNION S      → R (with duplicate elimination)
R UNION ALL S  → R (no change needed)
R INTERSECT S  → ∅ (empty result, skip execution)
R EXCEPT S     → R (with duplicate elimination)
 
-- Optimization: Self-set operations
R UNION R      → DISTINCT R
R INTERSECT R  → DISTINCT R  
R EXCEPT R     → ∅ (empty result)
 
-- Optimization: Commutativity exploitation
-- For UNION and INTERSECT, order by estimated size:
-- Put smaller relation first for hash-based (smaller hash table)
-- Example: 1M rows UNION 10K rows → 10K UNION 1M
 
-- Optimization: Distribute UNION over JOIN
(R JOIN S) UNION (R JOIN T)  →  R JOIN (S UNION T)
-- When R is expensive to compute, avoid computing it twice

ALL Variants are Cheaper

Always consider whether you truly need duplicate elimination. UNION ALL, INTERSECT ALL, and EXCEPT ALL avoid the cost of tracking unique tuples. If downstream operations (like aggregation or another dedup) will handle duplicates anyway, using ALL variants can significantly reduce intermediate processing.

Distributed Set Operations

In distributed database systems, set operations require coordinating across nodes. The strategies depend on how data is partitioned across the cluster.

Distributed UNION:

UNION ALL is trivial in distributed settings—no coordination needed:

Distributed UNION ALL:
    Each node outputs its local data
    Results are concatenated at coordinator

For UNION with deduplication:

Option A: Centralized Dedup
- All nodes send data to one node
- Single node performs deduplication
- Problem: Single point of bottleneck
Option B: Distributed Dedup
- Hash-partition combined data across all nodes
- Each node deduplicates its partition
- Results union without further dedup

distributed_set_ops.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
DISTRIBUTED_UNION(R_distributed, S_distributed):
    // R and S are already distributed across nodes
    
    // Step 1: Local UNION ALL (no coordination)
    for each node N in parallel:
        local_union = R_local[N] UNION ALL S_local[N]
    
    // Step 2: Global deduplication via hash redistribution
    for each node N in parallel:
        for each tuple t in local_union:
            target = hash(t) mod num_nodes
            send t to target
    
    barrier()  // Wait for shuffle
    
    // Step 3: Local deduplication
    for each node N in parallel:
        result[N] = HASH_DISTINCT(received_tuples)
    
    // Combined result across all nodes is the answer
 
DISTRIBUTED_INTERSECT(R_distributed, S_distributed):
    // Co-partition R and S by full tuple hash
    for each node N in parallel:
        for each tuple t in R_local[N]:
            send t to hash(t) mod num_nodes, tagged as 'R'
        for each tuple t in S_local[N]:
            send t to hash(t) mod num_nodes, tagged as 'S'
    
    barrier()
    
    // Each node has matching R and S tuples; do local intersect
    for each node N in parallel:
        local_R = tuples tagged 'R'
        local_S = tuples tagged 'S'
        result[N] = HASH_INTERSECT(local_R, local_S)

Optimization: Co-located Set Operations

If R and S are already co-partitioned on the same hash key and that key comprises all columns:

Each node can compute the set operation locally
No data shuffling required
Massive network I/O savings

This is why physical data layout choices (like table partitioning) are so important for analytical query performance.

Handling Skew:

Skewed data causes load imbalance—some nodes process much more data than others. Mitigations include:

Partial aggregation before shuffle: Eliminate duplicates locally first
Dynamic load balancing: Reassign work from overloaded nodes
Virtual partitioning: Split hot partitions across multiple nodes

Network is Often the Bottleneck

In distributed set operations, network transfer typically dominates cost. A good rule of thumb: any optimization that reduces bytes transferred across the network is worth significant CPU overhead. This is why local deduplication before shuffle is almost always beneficial, even though it means doing dedup twice.

Summary: Set Operations

Set operations combine query results using mathematical set theory. Let's consolidate the key concepts:

Key Takeaways

•UNION combines, INTERSECT finds common, EXCEPT excludes — Each has distinct semantics and implementation strategies.
•ALL variants preserve bag semantics — UNION ALL, INTERSECT ALL, EXCEPT ALL skip duplicate elimination, often providing significant speedup.
•Hash-based excels for in-memory — O(n) complexity, but memory grows with distinct tuples.
•Sort-based provides ordered output — Higher baseline cost but exploits pre-sorted inputs and handles overflow predictably.
•EXCEPT is non-commutative — Order matters; R EXCEPT S ≠ S EXCEPT R.
•Distributed set ops require coordination — Hash-based redistribution ensures matching tuples land on the same node.

What's Next:

Set operations and duplicate elimination rely on sorting as a fundamental primitive. The next page dives deep into sorting algorithms used in database systems, from in-memory quicksort to external merge sort, revealing how this classic algorithm adapts to database-scale data.

Page Complete

You now understand how set operations are physically implemented in database systems. From simple UNION ALL concatenation to complex distributed INTERSECT operations, these techniques enable powerful data combination and comparison queries. Next, we'll explore the sorting algorithms that underpin many of these operations.

3 / 5

Loading learning content...

Database Management SystemsPhysical Operators

Other Physical Operators

LevelAdvanced

Duration90 mins

TopicPhysical Operators

3 / 5

Set Operations

Boolean Operations on Relations

What You Will Learn

Set Operations Fundamentals

SQL provides three set operations that combine the results of two query expressions:

UNION (∪): Returns all tuples that appear in either input relation (or both)

INTERSECT (∩): Returns only tuples that appear in both input relations

EXCEPT (−): Returns tuples that appear in the first relation but not the second

Compatibility Requirement:

For any set operation, the two input relations must be union-compatible (also called type-compatible):

Same number of columns
Corresponding columns have compatible data types

The result schema typically inherits column names from the first (left) input.

Set Operation Semantics
Operation	Mathematical Notation	SQL Syntax	Description
Union	R ∪ S	R UNION S	All tuples in R or S (duplicates removed)
Union All	R ⊎ S	R UNION ALL S	All tuples from both (duplicates preserved)
Intersection	R ∩ S	R INTERSECT S	Tuples in both R and S
Intersection All	R ∩bag S	R INTERSECT ALL S	Bag intersection (min multiplicity)
Difference	R − S	R EXCEPT S	Tuples in R but not in S
Difference All	R −bag S	R EXCEPT ALL S	Bag difference (subtract counts)

set_operations_examples.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
-- UNION: All customers who placed orders OR left reviews
SELECT customer_id FROM orders
UNION
SELECT customer_id FROM reviews;
-- Result: Each customer_id appears exactly once
 
-- UNION ALL: Preserves all occurrences
SELECT product_id FROM january_sales
UNION ALL
SELECT product_id FROM february_sales;
-- Result: If product appears 3x in Jan and 2x in Feb, it appears 5x total
 
-- INTERSECT: Customers who placed orders AND left reviews
SELECT customer_id FROM orders
INTERSECT
SELECT customer_id FROM reviews;
-- Result: Only customer_ids present in both tables
 
-- EXCEPT: Customers who ordered but never reviewed
SELECT customer_id FROM orders
EXCEPT
SELECT customer_id FROM reviews;
-- Result: Customer_ids in orders but not in reviews
 
-- Complex: Multi-step set operations
(SELECT id FROM premium_users UNION SELECT id FROM vip_users)
EXCEPT
SELECT id FROM banned_users;
-- Result: Premium or VIP users who aren't banned

Set vs Bag Semantics

UNION Implementation

UNION is the most common set operation, combining results from multiple queries. Its implementation depends on whether duplicates should be eliminated (UNION) or preserved (UNION ALL).

UNION ALL (Bag Union):

UNION ALL is trivially implemented—simply concatenate the two input streams:

UNION_ALL(R, S):
    for each tuple t in R:
        output t
    for each tuple t in S:
        output t

This requires no comparison, hashing, or duplicate tracking. It's an extremely efficient operation:

Time Complexity: O(|R| + |S|)
Space Complexity: O(1) additional memory
I/O Cost: Read R + Read S (no extra writes)

UNION ALL maintains full pipelining—tuples flow through without buffering.

UNION (Set Union):

UNION with duplicate elimination requires more sophisticated implementation. The operation is essentially: DISTINCT(R UNION ALL S).

Hash-Based UNION:

HASH_UNION(R, S):
    seen = empty hash set
    
    for each tuple t in R:
        if t not in seen:
            seen.add(t)
            output t
    
    for each tuple t in S:
        if t not in seen:
            seen.add(t)
            output t

This processes inputs sequentially, outputting each unique tuple on first encounter.

Sort-Based UNION:

SORT_MERGE_UNION(R, S):
    sorted_R = sort(R)
    sorted_S = sort(S)
    
    r = next tuple from sorted_R (or NULL)
    s = next tuple from sorted_S (or NULL)
    last_output = NULL
    
    while r ≠ NULL or s ≠ NULL:
        if s == NULL or (r ≠ NULL and r < s):
            if r ≠ last_output:
                output r
                last_output = r
            r = next from sorted_R
        elif r == NULL or (s ≠ NULL and s < r):
            if s ≠ last_output:
                output s
                last_output = s
            s = next from sorted_S
        else:  // r == s
            if r ≠ last_output:
                output r
                last_output = r
            r = next from sorted_R
            s = next from sorted_S

After sorting, merge with deduplication produces the union in sorted order.

UNION Ordering Optimization

UNION Implementation Comparison
Aspect	Hash-Based	Sort-Based
Time Complexity	O(\|R\| + \|S\|) average	O((\|R\| + \|S\|) log(\|R\| + \|S\|))
Space Complexity	O(distinct tuples)	O(sort buffers)
Output Order	Arbitrary	Sorted
Early Output	Yes (pipelined)	No (blocking until merge)
Memory Overflow	Spill to partitions	External merge sort
Best When	Memory sufficient, no order needed	Output must be sorted

INTERSECT Implementation

INTERSECT returns only tuples that appear in both input relations. This is conceptually equivalent to a natural join on all columns, but with set or bag semantics.

Hash-Based INTERSECT:

The algorithm is similar to hash join:

HASH_INTERSECT(R, S):
    // Phase 1: Build hash table from R
    hash_table = {}
    for each tuple t in R:
        if t in hash_table:
            hash_table[t].count += 1
        else:
            hash_table[t] = {count: 1, outputted: false}
    
    // Phase 2: Probe with S
    for each tuple t in S:
        if t in hash_table and not hash_table[t].outputted:
            output t
            hash_table[t].outputted = true  // For set semantics
            // For INTERSECT ALL: output and decrement count instead

For set semantics, each matching tuple is output exactly once. For INTERSECT ALL, we track counts and output up to min(count_R, count_S) copies.

intersect_semantics.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- INTERSECT (set semantics)
-- R has: (A, A, A, B, B, C)
-- S has: (A, A, B, D)
-- Result: (A, B)  -- each appears exactly once
 
-- INTERSECT ALL (bag semantics)  
-- R has: (A, A, A, B, B, C)  -- A:3, B:2, C:1
-- S has: (A, A, B, D)         -- A:2, B:1, D:1
-- Result: (A, A, B)           -- A:min(3,2)=2, B:min(2,1)=1
 
-- Hash table state for INTERSECT ALL:
-- After processing R: {A: 3, B: 2, C: 1}
-- Processing S[A]: output A, decrement → {A: 2, B: 2, C: 1}
-- Processing S[A]: output A, decrement → {A: 1, B: 2, C: 1}
-- Processing S[B]: output B, decrement → {A: 1, B: 1, C: 1}
-- Processing S[D]: not found, skip
-- Final output: (A, A, B)

Sort-Based INTERSECT:

SORT_MERGE_INTERSECT(R, S):
    sorted_R = sort(R)
    sorted_S = sort(S)
    
    r = next tuple from sorted_R
    s = next tuple from sorted_S
    
    while r ≠ NULL and s ≠ NULL:
        if r < s:
            r = next from sorted_R
        elif s < r:
            s = next from sorted_S
        else:  // r == s, found match
            output r  // or s, they're equal
            r = next from sorted_R
            s = next from sorted_S
            // For set semantics: skip remaining duplicates of this value
            // For ALL semantics: output min(count_R, count_S) copies

Sort-merge intersection is elegant: advance through both sorted inputs, outputting when they match, skipping when they don't.

Cost Analysis:

Hash-Based: O(|R|) to build + O(|S|) to probe = O(|R| + |S|)
Sort-Based: O(|R| log|R| + |S| log|S|) for sorting + O(|R| + |S|) for merge

Hash-based is typically faster unless inputs are already sorted.

INTERSECT and Semi-Join Equivalence

EXCEPT (Difference) Implementation

EXCEPT (or MINUS in some SQL dialects) returns tuples from the left relation that don't appear in the right relation. Unlike UNION and INTERSECT, EXCEPT is not commutative—order matters.

Hash-Based EXCEPT:

Build a hash table from the right relation, then probe with the left:

HASH_EXCEPT(R, S):  // R EXCEPT S
    // Phase 1: Build hash set from S (what to exclude)
    exclude_set = {}
    for each tuple t in S:
        exclude_set.add(t)
        // For EXCEPT ALL, track counts: exclude_set[t].count++
    
    // Phase 2: Output R tuples not in exclude set
    seen = {}  // For set semantics: avoid outputting duplicates
    for each tuple t in R:
        if t not in exclude_set:
            if t not in seen:  // Set semantics
                output t
                seen.add(t)
        // For EXCEPT ALL: output if count not exhausted, decrement

The key insight: we must fully process S before outputting from R, because any tuple in R might be excluded by a later tuple in S.

except_semantics.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- EXCEPT (set semantics)
-- R has: (A, A, A, B, B, C)
-- S has: (A, D)
-- Result: (B, C)  -- A excluded, D doesn't affect result
 
-- EXCEPT ALL (bag semantics)
-- R has: (A, A, A, B, B, C)  -- A:3, B:2, C:1
-- S has: (A, A)              -- A:2
-- Result: (A, B, B, C)       -- A:3-2=1, B:2, C:1
 
-- Practical example: Find new customers this month
SELECT customer_id FROM march_customers
EXCEPT
SELECT customer_id FROM feb_customers;
-- Returns customers who appear in March but not February
 
-- Common mistake: Order matters!
SELECT a FROM R EXCEPT SELECT a FROM S  -- Tuples in R but not S
-- is different from
SELECT a FROM S EXCEPT SELECT a FROM R  -- Tuples in S but not R

Sort-Based EXCEPT:

SORT_MERGE_EXCEPT(R, S):  // R EXCEPT S
    sorted_R = sort(R)
    sorted_S = sort(S)
    
    r = next from sorted_R
    s = next from sorted_S
    last_output = NULL
    
    while r ≠ NULL:
        if s == NULL or r < s:
            // r has no match in S, output it (with dedup)
            if r ≠ last_output:  // Set semantics
                output r
                last_output = r
            r = next from sorted_R
        elif r > s:
            // s is behind, advance s
            s = next from sorted_S
        else:  // r == s
            // Match found, r is excluded
            r = next from sorted_R
            s = next from sorted_S

After sorting, we scan both inputs in parallel. When left tuple has no match on right, output it. When they match, skip both.

EXCEPT is Blocking

Hash EXCEPT Advantages

•O(|R| + |S|) time complexity
•Only blocks on S, then streams R
•No sorting overhead
•Good when S is small (small exclude set)

Sort EXCEPT Advantages

•Output is sorted
•Stable memory usage
•Fast if inputs pre-sorted
•Works with very large S (external sort)

External (Disk-Based) Set Operations

When input relations exceed available memory, set operations must spill to disk. The strategies mirror those for external hashing and sorting.

External Hash-Based Set Operations:

The approach uses partitioning:

Partition Phase: Hash both R and S into P partitions using the same hash function
- Each tuple goes to partition h(tuple) mod P
- R partitions: R₀, R₁, ..., Rₚ₋₁
- S partitions: S₀, S₁, ..., Sₚ₋₁
Process Phase: For each partition pair (Rᵢ, Sᵢ):
- Load both partitions into memory
- Perform in-memory set operation
- Output results

Since we use the same hash function for R and S, matching tuples land in corresponding partitions. Each partition pair can be processed independently.

external_set_operations.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
EXTERNAL_HASH_UNION(R, S, memory_limit):
    // Estimate partition count
    num_partitions = estimate_needed(|R|, |S|, memory_limit)
    
    // Phase 1: Partition both inputs
    R_partitions = partition(R, num_partitions)  // Using hash function h1
    S_partitions = partition(S, num_partitions)  // Same hash function h1
    
    // Phase 2: Process each partition pair
    for i = 0 to num_partitions - 1:
        seen = empty hash set (using hash function h2)
        
        // Read R partition, add to set
        for each tuple t in R_partitions[i]:
            if t not in seen:
                seen.add(t)
                output t
        
        // Read S partition, add new tuples
        for each tuple t in S_partitions[i]:
            if t not in seen:
                seen.add(t)
                output t
        
        clear(seen)
        delete_temp_files(R_partitions[i], S_partitions[i])
 
EXTERNAL_HASH_INTERSECT(R, S, memory_limit):
    // Similar partitioning...
    for i = 0 to num_partitions - 1:
        // Build hash table from R partition
        R_set = build_counting_hash_set(R_partitions[i])
        
        // Probe with S partition
        for each tuple t in S_partitions[i]:
            if t in R_set and not R_set[t].outputted:
                output t
                R_set[t].outputted = true

I/O Cost for External Set Operations:

Let |R| = N_R pages and |S| = N_S pages:

Partitioning:

Read R: N_R pages
Read S: N_S pages
Write partitions: N_R + N_S pages

Processing:

Read partitions: N_R + N_S pages
Write output: O_output pages (operation-dependent)

Total: 3(N_R + N_S) + O_output page operations

External Sort-Based Set Operations:

Alternatively, sort both inputs and merge:

External sort R → sorted_R
External sort S → sorted_S
Merge-based set operation on sorted inputs

Cost: Sorting cost + single merge pass. This is O(2N log N) for sorting plus O(N) for merge.

Choosing External Strategy

For external set operations:

Hash-based is typically faster for unsorted inputs and moderate memory
Sort-based wins when inputs are already sorted or when partitions would be highly skewed
Hybrid approaches use hashing when memory permits, fall back to sorting for overflow partitions

Set Operation Optimizations

Database optimizers apply various transformations to improve set operation performance.

1. Push Down Selections:

Apply filters before set operations to reduce input sizes:

-- Before optimization
(SELECT * FROM orders WHERE year = 2024)
UNION
(SELECT * FROM archived_orders WHERE year = 2024)

-- After: Predicates already pushed down
-- (This is often the user's intent, but optimizer ensures it)

2. Push Down Projections:

Project only necessary columns before set operations:

-- Original: SELECT DISTINCT a FROM (SELECT * FROM R UNION SELECT * FROM S)
-- Optimized: SELECT a FROM R UNION SELECT a FROM S
-- Benefits: Narrower tuples, possibly fewer distinct values

3. Merge Adjacent Operations:

Multiple set operations can sometimes merge:

-- Original
(R UNION S) UNION T

-- Optimized: Single 3-way union
-- Process R, S, T with shared hash table
-- Avoids materializing intermediate result

4. Convert to Joins When Beneficial:

INTERSECT and EXCEPT can often be rewritten as joins:

-- INTERSECT as semi-join
SELECT a, b FROM R INTERSECT SELECT a, b FROM S
-- Equivalent to:
SELECT DISTINCT R.a, R.b FROM R WHERE EXISTS (SELECT 1 FROM S WHERE S.a = R.a AND S.b = R.b)

-- EXCEPT as anti-join  
SELECT a, b FROM R EXCEPT SELECT a, b FROM S
-- Equivalent to:
SELECT DISTINCT R.a, R.b FROM R WHERE NOT EXISTS (SELECT 1 FROM S WHERE S.a = R.a AND S.b = R.b)

Join-based rewrites can exploit indexes that set operations cannot.

set_operation_rewrites.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Optimization: Empty set elimination
-- If statistics show S is empty:
R UNION S      → R (with duplicate elimination)
R UNION ALL S  → R (no change needed)
R INTERSECT S  → ∅ (empty result, skip execution)
R EXCEPT S     → R (with duplicate elimination)
 
-- Optimization: Self-set operations
R UNION R      → DISTINCT R
R INTERSECT R  → DISTINCT R  
R EXCEPT R     → ∅ (empty result)
 
-- Optimization: Commutativity exploitation
-- For UNION and INTERSECT, order by estimated size:
-- Put smaller relation first for hash-based (smaller hash table)
-- Example: 1M rows UNION 10K rows → 10K UNION 1M
 
-- Optimization: Distribute UNION over JOIN
(R JOIN S) UNION (R JOIN T)  →  R JOIN (S UNION T)
-- When R is expensive to compute, avoid computing it twice

ALL Variants are Cheaper

Distributed Set Operations

In distributed database systems, set operations require coordinating across nodes. The strategies depend on how data is partitioned across the cluster.

Distributed UNION:

UNION ALL is trivial in distributed settings—no coordination needed:

Distributed UNION ALL:
    Each node outputs its local data
    Results are concatenated at coordinator

For UNION with deduplication:

Option A: Centralized Dedup
- All nodes send data to one node
- Single node performs deduplication
- Problem: Single point of bottleneck
Option B: Distributed Dedup
- Hash-partition combined data across all nodes
- Each node deduplicates its partition
- Results union without further dedup

distributed_set_ops.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
DISTRIBUTED_UNION(R_distributed, S_distributed):
    // R and S are already distributed across nodes
    
    // Step 1: Local UNION ALL (no coordination)
    for each node N in parallel:
        local_union = R_local[N] UNION ALL S_local[N]
    
    // Step 2: Global deduplication via hash redistribution
    for each node N in parallel:
        for each tuple t in local_union:
            target = hash(t) mod num_nodes
            send t to target
    
    barrier()  // Wait for shuffle
    
    // Step 3: Local deduplication
    for each node N in parallel:
        result[N] = HASH_DISTINCT(received_tuples)
    
    // Combined result across all nodes is the answer
 
DISTRIBUTED_INTERSECT(R_distributed, S_distributed):
    // Co-partition R and S by full tuple hash
    for each node N in parallel:
        for each tuple t in R_local[N]:
            send t to hash(t) mod num_nodes, tagged as 'R'
        for each tuple t in S_local[N]:
            send t to hash(t) mod num_nodes, tagged as 'S'
    
    barrier()
    
    // Each node has matching R and S tuples; do local intersect
    for each node N in parallel:
        local_R = tuples tagged 'R'
        local_S = tuples tagged 'S'
        result[N] = HASH_INTERSECT(local_R, local_S)

Optimization: Co-located Set Operations

If R and S are already co-partitioned on the same hash key and that key comprises all columns:

Each node can compute the set operation locally
No data shuffling required
Massive network I/O savings

This is why physical data layout choices (like table partitioning) are so important for analytical query performance.

Handling Skew:

Skewed data causes load imbalance—some nodes process much more data than others. Mitigations include:

Partial aggregation before shuffle: Eliminate duplicates locally first
Dynamic load balancing: Reassign work from overloaded nodes
Virtual partitioning: Split hot partitions across multiple nodes

Network is Often the Bottleneck

Summary: Set Operations

Set operations combine query results using mathematical set theory. Let's consolidate the key concepts:

Key Takeaways

•UNION combines, INTERSECT finds common, EXCEPT excludes — Each has distinct semantics and implementation strategies.
•ALL variants preserve bag semantics — UNION ALL, INTERSECT ALL, EXCEPT ALL skip duplicate elimination, often providing significant speedup.
•Hash-based excels for in-memory — O(n) complexity, but memory grows with distinct tuples.
•Sort-based provides ordered output — Higher baseline cost but exploits pre-sorted inputs and handles overflow predictably.
•EXCEPT is non-commutative — Order matters; R EXCEPT S ≠ S EXCEPT R.
•Distributed set ops require coordination — Hash-based redistribution ensures matching tuples land on the same node.

What's Next:

Page Complete

3 / 5