Loading learning content...
The mathematical foundations of relational databases rest on set theory—relations are sets of tuples, and operations on relations have clear set-theoretic counterparts. UNION, INTERSECT, and EXCEPT (also called MINUS or DIFFERENCE) are the direct SQL equivalents of set union, intersection, and difference operations.
These operations are fundamental to combining query results, finding common records across tables, and identifying records that exist in one dataset but not another. Despite their simple conceptual definitions, their physical implementations involve subtle algorithmic choices and significant optimization opportunities.
By the end of this page, you will understand the physical implementation of UNION, INTERSECT, and EXCEPT operations. You'll master both hash-based and sort-based algorithms, understand the distinction between set and bag semantics (ALL variants), and learn how to analyze the costs and optimize these operations in practice.
SQL provides three set operations that combine the results of two query expressions:
UNION (∪): Returns all tuples that appear in either input relation (or both)
INTERSECT (∩): Returns only tuples that appear in both input relations
EXCEPT (−): Returns tuples that appear in the first relation but not the second
Compatibility Requirement:
For any set operation, the two input relations must be union-compatible (also called type-compatible):
The result schema typically inherits column names from the first (left) input.
| Operation | Mathematical Notation | SQL Syntax | Description |
|---|---|---|---|
| Union | R ∪ S | R UNION S | All tuples in R or S (duplicates removed) |
| Union All | R ⊎ S | R UNION ALL S | All tuples from both (duplicates preserved) |
| Intersection | R ∩ S | R INTERSECT S | Tuples in both R and S |
| Intersection All | R ∩bag S | R INTERSECT ALL S | Bag intersection (min multiplicity) |
| Difference | R − S | R EXCEPT S | Tuples in R but not in S |
| Difference All | R −bag S | R EXCEPT ALL S | Bag difference (subtract counts) |
1234567891011121314151617181920212223242526272829
-- UNION: All customers who placed orders OR left reviewsSELECT customer_id FROM ordersUNIONSELECT customer_id FROM reviews;-- Result: Each customer_id appears exactly once -- UNION ALL: Preserves all occurrencesSELECT product_id FROM january_salesUNION ALLSELECT product_id FROM february_sales;-- Result: If product appears 3x in Jan and 2x in Feb, it appears 5x total -- INTERSECT: Customers who placed orders AND left reviewsSELECT customer_id FROM ordersINTERSECTSELECT customer_id FROM reviews;-- Result: Only customer_ids present in both tables -- EXCEPT: Customers who ordered but never reviewedSELECT customer_id FROM ordersEXCEPTSELECT customer_id FROM reviews;-- Result: Customer_ids in orders but not in reviews -- Complex: Multi-step set operations(SELECT id FROM premium_users UNION SELECT id FROM vip_users)EXCEPTSELECT id FROM banned_users;-- Result: Premium or VIP users who aren't bannedStandard SQL operates on bags (multisets), not sets—duplicates are allowed. UNION, INTERSECT, and EXCEPT without ALL apply set semantics by implicitly eliminating duplicates from the result. The ALL variants preserve bag semantics, treating multiple occurrences as significant. This distinction critically affects both correctness and performance.
UNION is the most common set operation, combining results from multiple queries. Its implementation depends on whether duplicates should be eliminated (UNION) or preserved (UNION ALL).
UNION ALL (Bag Union):
UNION ALL is trivially implemented—simply concatenate the two input streams:
UNION_ALL(R, S):
for each tuple t in R:
output t
for each tuple t in S:
output t
This requires no comparison, hashing, or duplicate tracking. It's an extremely efficient operation:
UNION ALL maintains full pipelining—tuples flow through without buffering.
UNION (Set Union):
UNION with duplicate elimination requires more sophisticated implementation. The operation is essentially: DISTINCT(R UNION ALL S).
Hash-Based UNION:
HASH_UNION(R, S):
seen = empty hash set
for each tuple t in R:
if t not in seen:
seen.add(t)
output t
for each tuple t in S:
if t not in seen:
seen.add(t)
output t
This processes inputs sequentially, outputting each unique tuple on first encounter.
Sort-Based UNION:
SORT_MERGE_UNION(R, S):
sorted_R = sort(R)
sorted_S = sort(S)
r = next tuple from sorted_R (or NULL)
s = next tuple from sorted_S (or NULL)
last_output = NULL
while r ≠ NULL or s ≠ NULL:
if s == NULL or (r ≠ NULL and r < s):
if r ≠ last_output:
output r
last_output = r
r = next from sorted_R
elif r == NULL or (s ≠ NULL and s < r):
if s ≠ last_output:
output s
last_output = s
s = next from sorted_S
else: // r == s
if r ≠ last_output:
output r
last_output = r
r = next from sorted_R
s = next from sorted_S
After sorting, merge with deduplication produces the union in sorted order.
When using hash-based UNION, process the smaller relation first. Why? The smaller relation builds a more compact hash table, and if it contains most of the unique values, many tuples from the larger relation will be identified as duplicates quickly (hash hit) without needing hash table inserts.
| Aspect | Hash-Based | Sort-Based |
|---|---|---|
| Time Complexity | O(|R| + |S|) average | O((|R| + |S|) log(|R| + |S|)) |
| Space Complexity | O(distinct tuples) | O(sort buffers) |
| Output Order | Arbitrary | Sorted |
| Early Output | Yes (pipelined) | No (blocking until merge) |
| Memory Overflow | Spill to partitions | External merge sort |
| Best When | Memory sufficient, no order needed | Output must be sorted |
INTERSECT returns only tuples that appear in both input relations. This is conceptually equivalent to a natural join on all columns, but with set or bag semantics.
Hash-Based INTERSECT:
The algorithm is similar to hash join:
HASH_INTERSECT(R, S):
// Phase 1: Build hash table from R
hash_table = {}
for each tuple t in R:
if t in hash_table:
hash_table[t].count += 1
else:
hash_table[t] = {count: 1, outputted: false}
// Phase 2: Probe with S
for each tuple t in S:
if t in hash_table and not hash_table[t].outputted:
output t
hash_table[t].outputted = true // For set semantics
// For INTERSECT ALL: output and decrement count instead
For set semantics, each matching tuple is output exactly once. For INTERSECT ALL, we track counts and output up to min(count_R, count_S) copies.
1234567891011121314151617
-- INTERSECT (set semantics)-- R has: (A, A, A, B, B, C)-- S has: (A, A, B, D)-- Result: (A, B) -- each appears exactly once -- INTERSECT ALL (bag semantics) -- R has: (A, A, A, B, B, C) -- A:3, B:2, C:1-- S has: (A, A, B, D) -- A:2, B:1, D:1-- Result: (A, A, B) -- A:min(3,2)=2, B:min(2,1)=1 -- Hash table state for INTERSECT ALL:-- After processing R: {A: 3, B: 2, C: 1}-- Processing S[A]: output A, decrement → {A: 2, B: 2, C: 1}-- Processing S[A]: output A, decrement → {A: 1, B: 2, C: 1}-- Processing S[B]: output B, decrement → {A: 1, B: 1, C: 1}-- Processing S[D]: not found, skip-- Final output: (A, A, B)Sort-Based INTERSECT:
SORT_MERGE_INTERSECT(R, S):
sorted_R = sort(R)
sorted_S = sort(S)
r = next tuple from sorted_R
s = next tuple from sorted_S
while r ≠ NULL and s ≠ NULL:
if r < s:
r = next from sorted_R
elif s < r:
s = next from sorted_S
else: // r == s, found match
output r // or s, they're equal
r = next from sorted_R
s = next from sorted_S
// For set semantics: skip remaining duplicates of this value
// For ALL semantics: output min(count_R, count_S) copies
Sort-merge intersection is elegant: advance through both sorted inputs, outputting when they match, skipping when they don't.
Cost Analysis:
Hash-based is typically faster unless inputs are already sorted.
INTERSECT R ∩ S is semantically equivalent to R ⋉ S (semi-join) when all columns match. Some optimizers rewrite INTERSECT to semi-join for access to join optimizations like index lookups. Conversely, semi-joins with matching column lists can be converted to INTERSECT.
EXCEPT (or MINUS in some SQL dialects) returns tuples from the left relation that don't appear in the right relation. Unlike UNION and INTERSECT, EXCEPT is not commutative—order matters.
Hash-Based EXCEPT:
Build a hash table from the right relation, then probe with the left:
HASH_EXCEPT(R, S): // R EXCEPT S
// Phase 1: Build hash set from S (what to exclude)
exclude_set = {}
for each tuple t in S:
exclude_set.add(t)
// For EXCEPT ALL, track counts: exclude_set[t].count++
// Phase 2: Output R tuples not in exclude set
seen = {} // For set semantics: avoid outputting duplicates
for each tuple t in R:
if t not in exclude_set:
if t not in seen: // Set semantics
output t
seen.add(t)
// For EXCEPT ALL: output if count not exhausted, decrement
The key insight: we must fully process S before outputting from R, because any tuple in R might be excluded by a later tuple in S.
1234567891011121314151617181920
-- EXCEPT (set semantics)-- R has: (A, A, A, B, B, C)-- S has: (A, D)-- Result: (B, C) -- A excluded, D doesn't affect result -- EXCEPT ALL (bag semantics)-- R has: (A, A, A, B, B, C) -- A:3, B:2, C:1-- S has: (A, A) -- A:2-- Result: (A, B, B, C) -- A:3-2=1, B:2, C:1 -- Practical example: Find new customers this monthSELECT customer_id FROM march_customersEXCEPTSELECT customer_id FROM feb_customers;-- Returns customers who appear in March but not February -- Common mistake: Order matters!SELECT a FROM R EXCEPT SELECT a FROM S -- Tuples in R but not S-- is different fromSELECT a FROM S EXCEPT SELECT a FROM R -- Tuples in S but not RSort-Based EXCEPT:
SORT_MERGE_EXCEPT(R, S): // R EXCEPT S
sorted_R = sort(R)
sorted_S = sort(S)
r = next from sorted_R
s = next from sorted_S
last_output = NULL
while r ≠ NULL:
if s == NULL or r < s:
// r has no match in S, output it (with dedup)
if r ≠ last_output: // Set semantics
output r
last_output = r
r = next from sorted_R
elif r > s:
// s is behind, advance s
s = next from sorted_S
else: // r == s
// Match found, r is excluded
r = next from sorted_R
s = next from sorted_S
After sorting, we scan both inputs in parallel. When left tuple has no match on right, output it. When they match, skip both.
Unlike UNION ALL, EXCEPT cannot be fully pipelined. We must see all of S before outputting from R, because we need to know which R tuples are excluded. This makes EXCEPT a semi-blocking operator. Sort-based EXCEPT is fully blocking (must sort first). Hash-based can stream R after building S's hash table.
When input relations exceed available memory, set operations must spill to disk. The strategies mirror those for external hashing and sorting.
External Hash-Based Set Operations:
The approach uses partitioning:
Partition Phase: Hash both R and S into P partitions using the same hash function
Process Phase: For each partition pair (Rᵢ, Sᵢ):
Since we use the same hash function for R and S, matching tuples land in corresponding partitions. Each partition pair can be processed independently.
1234567891011121314151617181920212223242526272829303132333435363738
EXTERNAL_HASH_UNION(R, S, memory_limit): // Estimate partition count num_partitions = estimate_needed(|R|, |S|, memory_limit) // Phase 1: Partition both inputs R_partitions = partition(R, num_partitions) // Using hash function h1 S_partitions = partition(S, num_partitions) // Same hash function h1 // Phase 2: Process each partition pair for i = 0 to num_partitions - 1: seen = empty hash set (using hash function h2) // Read R partition, add to set for each tuple t in R_partitions[i]: if t not in seen: seen.add(t) output t // Read S partition, add new tuples for each tuple t in S_partitions[i]: if t not in seen: seen.add(t) output t clear(seen) delete_temp_files(R_partitions[i], S_partitions[i]) EXTERNAL_HASH_INTERSECT(R, S, memory_limit): // Similar partitioning... for i = 0 to num_partitions - 1: // Build hash table from R partition R_set = build_counting_hash_set(R_partitions[i]) // Probe with S partition for each tuple t in S_partitions[i]: if t in R_set and not R_set[t].outputted: output t R_set[t].outputted = trueI/O Cost for External Set Operations:
Let |R| = N_R pages and |S| = N_S pages:
Partitioning:
Processing:
Total: 3(N_R + N_S) + O_output page operations
External Sort-Based Set Operations:
Alternatively, sort both inputs and merge:
Cost: Sorting cost + single merge pass. This is O(2N log N) for sorting plus O(N) for merge.
For external set operations:
Database optimizers apply various transformations to improve set operation performance.
1. Push Down Selections:
Apply filters before set operations to reduce input sizes:
-- Before optimization
(SELECT * FROM orders WHERE year = 2024)
UNION
(SELECT * FROM archived_orders WHERE year = 2024)
-- After: Predicates already pushed down
-- (This is often the user's intent, but optimizer ensures it)
2. Push Down Projections:
Project only necessary columns before set operations:
-- Original: SELECT DISTINCT a FROM (SELECT * FROM R UNION SELECT * FROM S)
-- Optimized: SELECT a FROM R UNION SELECT a FROM S
-- Benefits: Narrower tuples, possibly fewer distinct values
3. Merge Adjacent Operations:
Multiple set operations can sometimes merge:
-- Original
(R UNION S) UNION T
-- Optimized: Single 3-way union
-- Process R, S, T with shared hash table
-- Avoids materializing intermediate result
4. Convert to Joins When Beneficial:
INTERSECT and EXCEPT can often be rewritten as joins:
-- INTERSECT as semi-join
SELECT a, b FROM R INTERSECT SELECT a, b FROM S
-- Equivalent to:
SELECT DISTINCT R.a, R.b FROM R WHERE EXISTS (SELECT 1 FROM S WHERE S.a = R.a AND S.b = R.b)
-- EXCEPT as anti-join
SELECT a, b FROM R EXCEPT SELECT a, b FROM S
-- Equivalent to:
SELECT DISTINCT R.a, R.b FROM R WHERE NOT EXISTS (SELECT 1 FROM S WHERE S.a = R.a AND S.b = R.b)
Join-based rewrites can exploit indexes that set operations cannot.
1234567891011121314151617181920
-- Optimization: Empty set elimination-- If statistics show S is empty:R UNION S → R (with duplicate elimination)R UNION ALL S → R (no change needed)R INTERSECT S → ∅ (empty result, skip execution)R EXCEPT S → R (with duplicate elimination) -- Optimization: Self-set operationsR UNION R → DISTINCT RR INTERSECT R → DISTINCT R R EXCEPT R → ∅ (empty result) -- Optimization: Commutativity exploitation-- For UNION and INTERSECT, order by estimated size:-- Put smaller relation first for hash-based (smaller hash table)-- Example: 1M rows UNION 10K rows → 10K UNION 1M -- Optimization: Distribute UNION over JOIN(R JOIN S) UNION (R JOIN T) → R JOIN (S UNION T)-- When R is expensive to compute, avoid computing it twiceAlways consider whether you truly need duplicate elimination. UNION ALL, INTERSECT ALL, and EXCEPT ALL avoid the cost of tracking unique tuples. If downstream operations (like aggregation or another dedup) will handle duplicates anyway, using ALL variants can significantly reduce intermediate processing.
In distributed database systems, set operations require coordinating across nodes. The strategies depend on how data is partitioned across the cluster.
Distributed UNION:
UNION ALL is trivial in distributed settings—no coordination needed:
Distributed UNION ALL:
Each node outputs its local data
Results are concatenated at coordinator
For UNION with deduplication:
Option A: Centralized Dedup
Option B: Distributed Dedup
123456789101112131415161718192021222324252627282930313233343536
DISTRIBUTED_UNION(R_distributed, S_distributed): // R and S are already distributed across nodes // Step 1: Local UNION ALL (no coordination) for each node N in parallel: local_union = R_local[N] UNION ALL S_local[N] // Step 2: Global deduplication via hash redistribution for each node N in parallel: for each tuple t in local_union: target = hash(t) mod num_nodes send t to target barrier() // Wait for shuffle // Step 3: Local deduplication for each node N in parallel: result[N] = HASH_DISTINCT(received_tuples) // Combined result across all nodes is the answer DISTRIBUTED_INTERSECT(R_distributed, S_distributed): // Co-partition R and S by full tuple hash for each node N in parallel: for each tuple t in R_local[N]: send t to hash(t) mod num_nodes, tagged as 'R' for each tuple t in S_local[N]: send t to hash(t) mod num_nodes, tagged as 'S' barrier() // Each node has matching R and S tuples; do local intersect for each node N in parallel: local_R = tuples tagged 'R' local_S = tuples tagged 'S' result[N] = HASH_INTERSECT(local_R, local_S)Optimization: Co-located Set Operations
If R and S are already co-partitioned on the same hash key and that key comprises all columns:
This is why physical data layout choices (like table partitioning) are so important for analytical query performance.
Handling Skew:
Skewed data causes load imbalance—some nodes process much more data than others. Mitigations include:
In distributed set operations, network transfer typically dominates cost. A good rule of thumb: any optimization that reduces bytes transferred across the network is worth significant CPU overhead. This is why local deduplication before shuffle is almost always beneficial, even though it means doing dedup twice.
Set operations combine query results using mathematical set theory. Let's consolidate the key concepts:
What's Next:
Set operations and duplicate elimination rely on sorting as a fundamental primitive. The next page dives deep into sorting algorithms used in database systems, from in-memory quicksort to external merge sort, revealing how this classic algorithm adapts to database-scale data.
You now understand how set operations are physically implemented in database systems. From simple UNION ALL concatenation to complex distributed INTERSECT operations, these techniques enable powerful data combination and comparison queries. Next, we'll explore the sorting algorithms that underpin many of these operations.