Database Management SystemsJoin Operations

Join Operations in Relational Algebra

LevelIntermediate

Duration75 mins

TopicJoin Operations

5 / 5

Semi-Join

When Presence Matters, Not Content

Throughout our exploration of join operations, we've focused on combining tuples from multiple relations—creating result rows that merge attributes from both sides. But there's a fundamentally different question we often need to answer:

"Which tuples from R have at least one matching tuple in S?"

Notice what's different: we don't want S's attributes in our result. We only want R tuples that satisfy the matching condition, preserving R's original structure. This is the domain of the semi-join.

Semi-join (⋉) answers questions like:

"Which customers have placed at least one order?"
"Which products have been reviewed?"
"Which departments have employees?"

In each case, we want the original entity (customer, product, department) without duplicating or widening the output with matched data.

What You Will Learn

By the end of this page, you will understand semi-join's formal definition, how it differs from regular joins, its relationship to EXISTS and IN subqueries in SQL, the complementary anti-join operation, why semi-join is crucial for distributed query processing, and practical implementation strategies.

The Semi-Join Concept

The semi-join is a filtering operation rather than a combining operation. It uses relation S as a filter to select which tuples from R to output, but the output contains only R's attributes.

Intuitive Definition:

R ⋉ S = "Tuples from R that have at least one match in S"

Formal Definition:

R ⋉ S = π_{R-attributes}(R ⋈ S)

The semi-join equals the natural (or specified) join, projected back onto R's attributes only. The join creates the pairings; the projection discards S's contributions.

semi_join_definition.txt

Definition

Semi-Join Formal Definition:
═══════════════════════════════════════════════════════════════════
 
Notation:          R ⋉ S  (left semi-join)
                   R ⋊ S  (right semi-join) — less common
                   The 'half-bowtie' indicates one-sided output
 
Formal:            R ⋉ S = π_R(R ⋈ S)
 
                   Where π_R means project onto all attributes of R
 
Alternative:       R ⋉ S = { r ∈ R | ∃s ∈ S : r and s match on common attributes }
 
Key Properties:
───────────────────────────────────────────────────────────────────
• Result schema:   EXACTLY the schema of R (no S attributes)
• Result tuples:   Subset of R (no duplicates added)
• Cardinality:     |R ⋉ S| ≤ |R|
• No Duplicates:   Each R tuple appears AT MOST ONCE
                   (even if it matches multiple S tuples)
 
Condition Variants:
───────────────────────────────────────────────────────────────────
R ⋉_θ S    = Semi-join with theta condition θ
R ⋉_{A=B} S = Semi-join with explicit equijoin condition
 
SQL Equivalent:    SELECT * FROM R WHERE EXISTS (SELECT 1 FROM S 
                                                 WHERE R.attr = S.attr)
═══════════════════════════════════════════════════════════════════

Critical Observation: No Duplicates

Consider a customer who has placed 10 orders. In a regular join:

CUSTOMER ⋈ ORDER produces 10 rows for this customer

In a semi-join:

CUSTOMER ⋉ ORDER produces exactly 1 row for this customer

The semi-join asks "does a match exist?" not "list all matches." Once existence is established, the question is answered—additional matches are irrelevant.

Semi-Join vs Join + Distinct

You might think: 'Can't I just do JOIN then DISTINCT?' Semantically, yes:

R ⋉ S = DISTINCT(π_R(R ⋈ S))

But this is computationally wasteful—you generate all duplicates then remove them. Semi-join implementations never generate duplicates; they stop checking after finding the first match.

Semi-Join vs Regular Join: The Key Differences

Understanding when to use semi-join versus regular join is crucial for writing efficient, correct queries. Let's examine the differences with a concrete example.

semi_vs_regular_example.txt

Comparison

CUSTOMER(CustID, Name, City)
════════════════════════════════════════
C001    Alice Chen      New York
C002    Bob Patel       Chicago
C003    Carol Wang      Boston
 
ORDER(OrderID, CustID, Amount)
════════════════════════════════════════
O101    C001    250.00
O102    C001    175.50
O103    C001    89.99
O104    C002    445.00
 
Query: "Which customers have placed at least one order?"
 
 
═══════════════════════════════════════════════════════════════════
REGULAR JOIN: CUSTOMER ⋈(CustID=CustID) ORDER
═══════════════════════════════════════════════════════════════════
 
Result (4 rows — Alice appears 3 times!):
CustID | Name        | City     | OrderID | Amount
──────────────────────────────────────────────────────
C001   | Alice Chen  | New York | O101    | 250.00
C001   | Alice Chen  | New York | O102    | 175.50
C001   | Alice Chen  | New York | O103    | 89.99
C002   | Bob Patel   | Chicago  | O104    | 445.00
 
Problems:
• Alice duplicated 3× (one per order)
• Result contains ORDER attributes we didn't ask for
• Larger result set (4 rows vs 2 expected)
 
 
═══════════════════════════════════════════════════════════════════
SEMI-JOIN: CUSTOMER ⋉(CustID=CustID) ORDER
═══════════════════════════════════════════════════════════════════
 
Result (2 rows — exactly the customers with orders):
CustID | Name        | City
─────────────────────────────────
C001   | Alice Chen  | New York
C002   | Bob Patel   | Chicago
 
Benefits:
• Each customer appears exactly once
• Only CUSTOMER attributes (exactly what we asked)
• Minimal result set
• Carol excluded (no orders)

Semi-Join vs Regular Join Comparison
Property	Regular Join (⋈)	Semi-Join (⋉)
Result Schema	R ∪ S attributes	Only R attributes
Duplicates	One row per match	One row per R tuple (max)
Cardinality Bound	Up to \|R\| × \|S\|	Up to \|R\|
S data in output?	Yes	No
Questions answered	"What are all matches?"	"Which R tuples have matches?"
SQL pattern	SELECT * FROM R JOIN S...	SELECT R.* WHERE EXISTS...

Choosing the Right Operation

Use SEMI-JOIN when: • You need entities that satisfy a condition involving another table • You don't need the other table's columns • You want each entity once regardless of match count

Use REGULAR JOIN when: • You need data from both tables combined • Each match represents a meaningful distinct result • You're computing aggregates over matched pairs

Semi-Join in SQL

Standard SQL doesn't have a SEMI JOIN keyword, but semi-join semantics are achieved through EXISTS subqueries or IN expressions. Query optimizers recognize these patterns and implement them efficiently.

The EXISTS Pattern (Most Common):

semi_join_sql.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
-- ════════════════════════════════════════════════════════════════
-- Pattern 1: EXISTS (Preferred for Semi-Join)
-- ════════════════════════════════════════════════════════════════
 
-- Customers who have placed at least one order
SELECT *
FROM customers c
WHERE EXISTS (
    SELECT 1  -- The 1 is arbitrary; optimizer ignores it
    FROM orders o
    WHERE o.customer_id = c.customer_id
);
 
-- Products that have at least one review
SELECT p.*
FROM products p
WHERE EXISTS (
    SELECT 1
    FROM reviews r
    WHERE r.product_id = p.product_id
);
 
-- Departments with at least one employee
SELECT d.*
FROM departments d
WHERE EXISTS (
    SELECT 1
    FROM employees e
    WHERE e.dept_id = d.dept_id
);
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 2: IN (Alternative, sometimes less efficient)
-- ════════════════════════════════════════════════════════════════
 
-- Same query using IN
SELECT *
FROM customers
WHERE customer_id IN (
    SELECT customer_id
    FROM orders
);
 
-- Warning: IN with NULLs can be tricky
-- If the subquery returns NULL, rows won't match!
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 3: INNER JOIN + DISTINCT (Avoid if possible)
-- ════════════════════════════════════════════════════════════════
 
-- Works but inefficient
SELECT DISTINCT c.*
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id;
 
-- This generates duplicates then removes them — wasteful!
-- Optimizer might convert this to semi-join, but don't rely on it
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 4: Correlated Condition in Semi-Join
-- ════════════════════════════════════════════════════════════════
 
-- Customers who ordered products in category 'Electronics'
SELECT c.*
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    JOIN products p ON oi.product_id = p.product_id
    WHERE o.customer_id = c.customer_id
      AND p.category = 'Electronics'
);
 
-- The EXISTS subquery can be arbitrarily complex
-- As long as it references the outer table (c.customer_id)

EXISTS vs IN: Performance

• EXISTS often performs better with correlated subqueries—it stops at first match • IN may work better with small, static lists or when the subquery result is cached • Modern optimizers often transform one to the other, but EXISTS is typically safer • EXISTS handles NULL correctly; IN with NULL subquery results can fail silently

Anti-Join: The Complement of Semi-Join

If semi-join answers "which R tuples have matches?" then anti-join answers the complementary question: "which R tuples have no matches?"

Anti-Join Definition:

R ▷ S = R - (R ⋉ S)

Anti-join returns tuples from R that have no matching tuple in S. It's the set difference between R and R's semi-join with S.

anti_join_definition.txt

Definition

Anti-Join Formal Definition:
═══════════════════════════════════════════════════════════════════
 
Notation:          R ▷ S  (anti-join)
                   Also written: R ⊳ S, R ⟈ S (various notations)
 
Formal:            R ▷ S = R - (R ⋉ S)
                         = R - π_R(R ⋈ S)
 
Alternative:       R ▷ S = { r ∈ R | ¬∃s ∈ S : r and s match }
 
Semantic:          "Tuples from R that have NO match in S"
 
Key Properties:
───────────────────────────────────────────────────────────────────
• Result schema:   EXACTLY the schema of R
• Result tuples:   Subset of R
• Cardinality:     |R ▷ S| ≤ |R|
• Complement:      (R ⋉ S) ∪ (R ▷ S) = R  (partition of R)
• Disjoint:        (R ⋉ S) ∩ (R ▷ S) = ∅
 
SQL Equivalent:    SELECT * FROM R WHERE NOT EXISTS 
                   (SELECT 1 FROM S WHERE R.attr = S.attr)
═══════════════════════════════════════════════════════════════════

Example: Finding Customers Without Orders

anti_join_example.txt

Example

CUSTOMER(CustID, Name, City)
════════════════════════════════════════
C001    Alice Chen      New York
C002    Bob Patel       Chicago
C003    Carol Wang      Boston
C004    David Kim       Seattle
 
ORDER(OrderID, CustID, Amount)
════════════════════════════════════════
O101    C001    250.00
O102    C002    445.00
 
Query: "Which customers have NEVER placed an order?"
 
CUSTOMER ▷(CustID=CustID) ORDER
════════════════════════════════════════════════════════════════
 
Semi-join (customers WITH orders):
  C001 | Alice Chen  | New York
  C002 | Bob Patel   | Chicago
 
Anti-join (customers WITHOUT orders) = CUSTOMER - Semi-join:
  C003 | Carol Wang  | Boston      ← Never ordered
  C004 | David Kim   | Seattle     ← Never ordered
 
Result (2 rows):
CustID | Name        | City
─────────────────────────────────
C003   | Carol Wang  | Boston
C004   | David Kim   | Seattle

Anti-Join in SQL:

anti_join_sql.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
-- ════════════════════════════════════════════════════════════════
-- Pattern 1: NOT EXISTS (Preferred)
-- ════════════════════════════════════════════════════════════════
 
-- Customers who have NEVER ordered
SELECT *
FROM customers c
WHERE NOT EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
);
 
-- Products with NO reviews
SELECT p.*
FROM products p
WHERE NOT EXISTS (
    SELECT 1
    FROM reviews r
    WHERE r.product_id = p.product_id
);
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 2: NOT IN (Careful with NULLs!)
-- ════════════════════════════════════════════════════════════════
 
-- Same query using NOT IN
SELECT *
FROM customers
WHERE customer_id NOT IN (
    SELECT customer_id FROM orders
);
 
-- WARNING: If orders.customer_id contains ANY NULL,
-- NOT IN returns NO ROWS! This is a common bug.
 
-- Safe version with NULL filtering:
SELECT *
FROM customers
WHERE customer_id NOT IN (
    SELECT customer_id FROM orders WHERE customer_id IS NOT NULL
);
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 3: LEFT JOIN + IS NULL (Common Alternative)
-- ════════════════════════════════════════════════════════════════
 
-- Using outer join to find non-matches
SELECT c.*
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL;
 
-- This works because unmatched rows have NULL for order columns
-- Often optimized similarly to NOT EXISTS
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 4: EXCEPT (Set Difference)
-- ════════════════════════════════════════════════════════════════
 
-- Find customers not in the orders table
SELECT customer_id FROM customers
EXCEPT
SELECT customer_id FROM orders;
 
-- Returns only the key; for full rows, use EXISTS pattern

The NOT IN + NULL Trap

NOT IN behaves unexpectedly with NULLs:

• If ANY value in the subquery is NULL, NOT IN returns FALSE for everything! • Example: 1 NOT IN (2, NULL) returns UNKNOWN, not TRUE • This causes queries to silently return zero rows

Always prefer NOT EXISTS for anti-join, or explicitly filter NULLs from IN subqueries.

Semi-Join in Distributed Query Processing

Semi-join has special significance in distributed database systems. When data resides on different nodes, joining across nodes requires data movement. Semi-join provides an elegant optimization called semi-join reduction.

The Distributed Join Problem:

Imagine CUSTOMER lives on Node A and ORDER lives on Node B. To compute CUSTOMER ⋈ ORDER:

Naive approach: Ship all of ORDER to Node A (or vice versa) Problem: Moving large relations is expensive

Semi-join optimization: Only move what's needed

semi_join_distributed.txt

Strategy

Semi-Join Reduction Strategy for Distributed Joins:
═══════════════════════════════════════════════════════════════════
 
Setup:
  Node A has: CUSTOMER(CustID, Name, City)       100,000 rows
  Node B has: ORDER(OrderID, CustID, Amount)     500,000 rows
  Join: CUSTOMER ⋈ ORDER
  
Goal: Minimize data transfer between nodes
 
═══════════════════════════════════════════════════════════════════
Naive Approach: Ship entire ORDER to Node A
═══════════════════════════════════════════════════════════════════
  Transfer: 500,000 rows
  Network cost: Very high
  
═══════════════════════════════════════════════════════════════════
Semi-Join Reduction: 3-Step Process
═══════════════════════════════════════════════════════════════════
 
Step 1: Send distinct join attributes from smaller side to larger
────────────────────────────────────────────────────────────────
  Node A → Node B: Send { DISTINCT CustID from CUSTOMER }
  Transfer: ~100,000 distinct IDs (much smaller than full rows!)
  
Step 2: Filter the larger side using semi-join
────────────────────────────────────────────────────────────────
  Node B: Compute ORDER ⋉ {received CustIDs}
  Result: Only ORDERs that WILL match (discard orders with 
          CustIDs not in CUSTOMER)
  
  If only 80% of CustIDs have orders, we've reduced ORDER
  to relevant rows only
  
Step 3: Send filtered result for final join
────────────────────────────────────────────────────────────────
  Node B → Node A: Send filtered ORDER rows
  Transfer: Much smaller than 500,000!
  Node A: Compute final join locally
 
 
BENEFIT:
═══════════════════════════════════════════════════════════════════
  Instead of shipping 500,000 rows, we ship:
  • 100,000 IDs (Step 1)
  • X filtered rows where X << 500,000 (Step 3)
  
  If there's low join selectivity (many ORDERs don't match),
  the reduction can be dramatic!

When Semi-Join Reduction Helps:

Scenario	Reduction Effectiveness
Many ORDER rows don't match any CUSTOMER	Excellent — filter removes dangling orders
Each CUSTOMER has many ORDERs	Limited — most orders still transfer
Small relation dominates join selectivity	Good — filtering is significant
Very wide rows	Good — sending IDs vs full rows saves bandwidth
Network is the bottleneck	Critical — any reduction is valuable

Bloom Filter Optimization

In practice, distributed systems often use Bloom filters instead of sending exact join keys:

Build a Bloom filter from CUSTOMER.CustID
Send the (compact) filter to Node B
Node B filters ORDER using the filter (with small false positive rate)
Transfer filtered orders

Bloom filters are much smaller than distinct key sets, further reducing network cost.

Implementation Strategies

Semi-join can be implemented more efficiently than regular join by exploiting its "existence only" semantics. Query optimizers implement several strategies:

1. Early Termination:

Once a tuple from R finds ONE match in S, searching can stop for that tuple. This is the key efficiency gain.

semi_join_algorithm.pseudo

Pseudocode

ALGORITHM SemiJoinNestedLoop(R, S, condition)
────────────────────────────────────────────────────────────
result ← ∅
 
FOR EACH tuple r IN R:
    found ← FALSE
    FOR EACH tuple s IN S:
        IF condition(r, s) THEN
            found ← TRUE
            BREAK                    // ← KEY: Stop after first match!
        END IF
    END FOR
    IF found THEN
        result ← result ∪ {r}
    END IF
END FOR
 
RETURN result
 
COMPARISON TO REGULAR JOIN:
────────────────────────────────────────────────────────────
Regular Join: Must find ALL matches (no BREAK)
Semi-Join: Stops at FIRST match (early termination)
 
If R tuple matches 1000 S tuples:
  Regular: Processes all 1000
  Semi: Processes 1 and stops

2. Hash-Based Semi-Join:

Build a hash table on S's join column, then probe for each R tuple. If probe finds ANY match, output R tuple.

hash_semi_join.pseudo

Pseudocode

ALGORITHM HashSemiJoin(R, S, r_attr, s_attr)
────────────────────────────────────────────────────────────
// Build phase: Create hash set from S's join values
joinValues ← ∅ (hash set)
FOR EACH tuple s IN S:
    ADD s[s_attr] TO joinValues
 
// Probe phase: Check R tuples for membership
result ← ∅
FOR EACH tuple r IN R:
    IF r[r_attr] IN joinValues THEN
        result ← result ∪ {r}
    END IF
 
RETURN result
 
KEY INSIGHT:
────────────────────────────────────────────────────────────
We only need S's JOIN COLUMN VALUES, not full S tuples!
The hash set is smaller and lookups are O(1).
 
Space: O(|distinct S join values|) vs O(|S|) for regular join

Semi-Join Optimization Opportunities

•Early termination — Stop searching after first match per R tuple
•Hash set instead of hash table — Store only values, not full S tuples
•Bloom filter pre-filtering — Quick rejection before detailed probe
•Index-only scans — May not need to access S's data pages at all
•Parallel probing — Each R tuple's existence check is independent
•Materialized join values — Precompute and cache distinct S values

Query Plan Recognition

In EXPLAIN output, look for: • 'Semi Join' or 'SemiJoin' node types • 'Exists' operators • 'IN' with subquery that's optimized to semi-join • Hash set build instead of full hash table

If you see 'Join + Distinct' for what should be a semi-join, the optimizer may not have recognized the pattern—consider rewriting with EXISTS.

Use Cases and Query Patterns

Semi-join and anti-join appear in numerous practical query patterns. Recognizing these patterns helps write efficient queries.

semi_anti_patterns.sql
SQL Patterns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
-- ════════════════════════════════════════════════════════════════
-- Pattern 1: Entity Filtering by Related Data (Semi-Join)
-- ════════════════════════════════════════════════════════════════
 
-- Customers who bought a specific product
SELECT c.*
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    WHERE o.customer_id = c.customer_id
      AND oi.product_id = 'P001'
);
 
-- Users who logged in this month
SELECT u.*
FROM users u
WHERE EXISTS (
    SELECT 1
    FROM login_events e
    WHERE e.user_id = u.user_id
      AND e.login_time >= DATE_TRUNC('month', CURRENT_DATE)
);
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 2: Finding Missing Data (Anti-Join)
-- ════════════════════════════════════════════════════════════════
 
-- Customers without any orders in the last year
SELECT c.*
FROM customers c
WHERE NOT EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
      AND o.order_date >= CURRENT_DATE - INTERVAL '1 year'
);
 
-- Products not in any active promotions
SELECT p.*
FROM products p
WHERE NOT EXISTS (
    SELECT 1
    FROM promotions pr
    WHERE pr.product_id = p.product_id
      AND pr.end_date >= CURRENT_DATE
);
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 3: Set Membership Checks (Semi-Join)
-- ════════════════════════════════════════════════════════════════
 
-- Orders from VIP customers
SELECT o.*
FROM orders o
WHERE EXISTS (
    SELECT 1
    FROM vip_customers v
    WHERE v.customer_id = o.customer_id
);
 
-- Equivalent but often less efficient:
SELECT o.*
FROM orders o
WHERE o.customer_id IN (SELECT customer_id FROM vip_customers);
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 4: Data Quality Checks (Anti-Join)
-- ════════════════════════════════════════════════════════════════
 
-- Orders with invalid customer references
SELECT o.*
FROM orders o
WHERE NOT EXISTS (
    SELECT 1
    FROM customers c
    WHERE c.customer_id = o.customer_id
);
 
-- Employees assigned to non-existent departments
SELECT e.*
FROM employees e
WHERE NOT EXISTS (
    SELECT 1
    FROM departments d
    WHERE d.dept_id = e.dept_id
);
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 5: Conditional Inclusion (Combined)
-- ════════════════════════════════════════════════════════════════
 
-- Active products with stock, regardless of sales history
SELECT p.product_id, p.name,
    CASE 
        WHEN EXISTS (SELECT 1 FROM sales s WHERE s.product_id = p.product_id)
        THEN 'Has Sales'
        ELSE 'No Sales Yet'
    END AS sales_status
FROM products p
WHERE p.is_active = TRUE
  AND p.stock_quantity > 0;

Semi-Join for Deduplication

When you need 'distinct' entities that have related records, semi-join naturally provides deduplication:

• Regular join: might produce duplicates requiring DISTINCT • Semi-join: inherently produces each entity at most once

This makes EXISTS ideal for 'list all X that have at least one Y' queries.

Algebraic Properties of Semi-Join

Understanding semi-join's algebraic properties enables query optimization and transformation.

Key Properties:

Algebraic Properties of Semi-Join
Property	Statement	Notes
Idempotence	R ⋉ R = R	Semi-join with self is identity
Non-Commutativity	R ⋉ S ≠ S ⋉ R (usually)	Result has R's schema, not S's
Distribution over Union	R ⋉ (S ∪ T) = (R ⋉ S) ∪ (R ⋉ T)	Enables parallel processing
Absorption	(R ⋈ S) ⋉ T = R ⋈ (S ⋉ T) (when T's attrs ⊂ S)	Useful for optimization
Partition with Anti-Join	(R ⋉ S) ∪ (R ▷ S) = R	Semi and anti perfectly partition R
Disjoint with Anti-Join	(R ⋉ S) ∩ (R ▷ S) = ∅	Semi and anti are mutually exclusive

Relationship to Other Operations:

R ⋉ S = π_R(R ⋈ S)           // Definition via natural join
R ⋉ S = R - (R ▷ S)          // Via anti-join complement
R ▷ S = R - (R ⋉ S)          // Anti-join via semi-join
R ⋈ S = (R ⋉ S) ⋈ S          // Join can use semi-join as filter

Optimization Transformations:

-- Original (may produce duplicates, less efficient)
SELECT DISTINCT c.* 
FROM customers c 
JOIN orders o ON c.id = o.customer_id;

-- Optimized (semi-join semantics)
SELECT c.* 
FROM customers c 
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);

Optimizers recognize the first pattern and may transform it to the second internally.

Query Rewriting for Performance

If you see slow 'SELECT DISTINCT... JOIN...' queries where only one table's columns are selected, rewrite using EXISTS:

SELECT DISTINCT a.* FROM a JOIN b ON... → SELECT a.* FROM a WHERE EXISTS(...)

The EXISTS version avoids generating duplicates entirely.

Summary: Semi-Join and Anti-Join Mastery

We've thoroughly explored semi-join and anti-join—essential filtering operations that answer existence questions without widening results. Let's consolidate the key concepts:

Key Takeaways

•Semi-join filters by match existence — R ⋉ S returns R tuples that have at least one match in S, with R's schema only.
•No duplicates from multiple matches — Unlike regular join, each R tuple appears at most once.
•Anti-join is the complement — R ▷ S returns R tuples with NO match in S.
•SQL uses EXISTS and NOT EXISTS — No SEMI JOIN keyword, but EXISTS patterns implement the semantics.
•Beware NOT IN with NULLs — NOT EXISTS is safer and typically faster for anti-join.
•Early termination is key optimization — Stop searching after first match per tuple.
•Crucial for distributed systems — Semi-join reduction minimizes data transfer across nodes.
•Semi + Anti partition R — Together they cover all of R with no overlap.

Module Complete: Join Operations Mastery

With semi-join and anti-join understood, we've completed our journey through the join operation family:

Theta Join — The general form with arbitrary conditions
Equijoin — Theta join with equality conditions only (most common)
Natural Join — Implicit matching on shared attribute names
Outer Joins — Preserving non-matching tuples with NULL padding
Semi-Join & Anti-Join — Existence filtering without widening

This comprehensive understanding of join operations forms the foundation for advanced query formulation, optimization, and database system design. The ability to select the right join type for each situation—and understand its computational implications—distinguishes skilled database professionals.

Module Complete

You now have comprehensive mastery of join operations in relational algebra—from the foundational theta join through specialized variants including equijoin, natural join, outer joins, semi-join, and anti-join. This knowledge enables you to write efficient, correct queries for any relational data retrieval scenario and understand how database systems execute them.

5 / 5

Loading learning content...

Database Management SystemsJoin Operations

Join Operations in Relational Algebra

LevelIntermediate

Duration75 mins

TopicJoin Operations

5 / 5

Semi-Join

When Presence Matters, Not Content

"Which tuples from R have at least one matching tuple in S?"

Semi-join (⋉) answers questions like:

"Which customers have placed at least one order?"
"Which products have been reviewed?"
"Which departments have employees?"

In each case, we want the original entity (customer, product, department) without duplicating or widening the output with matched data.

What You Will Learn

The Semi-Join Concept

The semi-join is a filtering operation rather than a combining operation. It uses relation S as a filter to select which tuples from R to output, but the output contains only R's attributes.

Intuitive Definition:

R ⋉ S = "Tuples from R that have at least one match in S"

Formal Definition:

R ⋉ S = π_{R-attributes}(R ⋈ S)

The semi-join equals the natural (or specified) join, projected back onto R's attributes only. The join creates the pairings; the projection discards S's contributions.

semi_join_definition.txt

Definition

Semi-Join Formal Definition:
═══════════════════════════════════════════════════════════════════
 
Notation:          R ⋉ S  (left semi-join)
                   R ⋊ S  (right semi-join) — less common
                   The 'half-bowtie' indicates one-sided output
 
Formal:            R ⋉ S = π_R(R ⋈ S)
 
                   Where π_R means project onto all attributes of R
 
Alternative:       R ⋉ S = { r ∈ R | ∃s ∈ S : r and s match on common attributes }
 
Key Properties:
───────────────────────────────────────────────────────────────────
• Result schema:   EXACTLY the schema of R (no S attributes)
• Result tuples:   Subset of R (no duplicates added)
• Cardinality:     |R ⋉ S| ≤ |R|
• No Duplicates:   Each R tuple appears AT MOST ONCE
                   (even if it matches multiple S tuples)
 
Condition Variants:
───────────────────────────────────────────────────────────────────
R ⋉_θ S    = Semi-join with theta condition θ
R ⋉_{A=B} S = Semi-join with explicit equijoin condition
 
SQL Equivalent:    SELECT * FROM R WHERE EXISTS (SELECT 1 FROM S 
                                                 WHERE R.attr = S.attr)
═══════════════════════════════════════════════════════════════════

Critical Observation: No Duplicates

Consider a customer who has placed 10 orders. In a regular join:

CUSTOMER ⋈ ORDER produces 10 rows for this customer

In a semi-join:

CUSTOMER ⋉ ORDER produces exactly 1 row for this customer

The semi-join asks "does a match exist?" not "list all matches." Once existence is established, the question is answered—additional matches are irrelevant.

Semi-Join vs Join + Distinct

You might think: 'Can't I just do JOIN then DISTINCT?' Semantically, yes:

R ⋉ S = DISTINCT(π_R(R ⋈ S))

But this is computationally wasteful—you generate all duplicates then remove them. Semi-join implementations never generate duplicates; they stop checking after finding the first match.

Semi-Join vs Regular Join: The Key Differences

Understanding when to use semi-join versus regular join is crucial for writing efficient, correct queries. Let's examine the differences with a concrete example.

semi_vs_regular_example.txt

Comparison

CUSTOMER(CustID, Name, City)
════════════════════════════════════════
C001    Alice Chen      New York
C002    Bob Patel       Chicago
C003    Carol Wang      Boston
 
ORDER(OrderID, CustID, Amount)
════════════════════════════════════════
O101    C001    250.00
O102    C001    175.50
O103    C001    89.99
O104    C002    445.00
 
Query: "Which customers have placed at least one order?"
 
 
═══════════════════════════════════════════════════════════════════
REGULAR JOIN: CUSTOMER ⋈(CustID=CustID) ORDER
═══════════════════════════════════════════════════════════════════
 
Result (4 rows — Alice appears 3 times!):
CustID | Name        | City     | OrderID | Amount
──────────────────────────────────────────────────────
C001   | Alice Chen  | New York | O101    | 250.00
C001   | Alice Chen  | New York | O102    | 175.50
C001   | Alice Chen  | New York | O103    | 89.99
C002   | Bob Patel   | Chicago  | O104    | 445.00
 
Problems:
• Alice duplicated 3× (one per order)
• Result contains ORDER attributes we didn't ask for
• Larger result set (4 rows vs 2 expected)
 
 
═══════════════════════════════════════════════════════════════════
SEMI-JOIN: CUSTOMER ⋉(CustID=CustID) ORDER
═══════════════════════════════════════════════════════════════════
 
Result (2 rows — exactly the customers with orders):
CustID | Name        | City
─────────────────────────────────
C001   | Alice Chen  | New York
C002   | Bob Patel   | Chicago
 
Benefits:
• Each customer appears exactly once
• Only CUSTOMER attributes (exactly what we asked)
• Minimal result set
• Carol excluded (no orders)

Semi-Join vs Regular Join Comparison
Property	Regular Join (⋈)	Semi-Join (⋉)
Result Schema	R ∪ S attributes	Only R attributes
Duplicates	One row per match	One row per R tuple (max)
Cardinality Bound	Up to \|R\| × \|S\|	Up to \|R\|
S data in output?	Yes	No
Questions answered	"What are all matches?"	"Which R tuples have matches?"
SQL pattern	SELECT * FROM R JOIN S...	SELECT R.* WHERE EXISTS...

Choosing the Right Operation

Use SEMI-JOIN when: • You need entities that satisfy a condition involving another table • You don't need the other table's columns • You want each entity once regardless of match count

Use REGULAR JOIN when: • You need data from both tables combined • Each match represents a meaningful distinct result • You're computing aggregates over matched pairs

Semi-Join in SQL

The EXISTS Pattern (Most Common):

semi_join_sql.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
-- ════════════════════════════════════════════════════════════════
-- Pattern 1: EXISTS (Preferred for Semi-Join)
-- ════════════════════════════════════════════════════════════════
 
-- Customers who have placed at least one order
SELECT *
FROM customers c
WHERE EXISTS (
    SELECT 1  -- The 1 is arbitrary; optimizer ignores it
    FROM orders o
    WHERE o.customer_id = c.customer_id
);
 
-- Products that have at least one review
SELECT p.*
FROM products p
WHERE EXISTS (
    SELECT 1
    FROM reviews r
    WHERE r.product_id = p.product_id
);
 
-- Departments with at least one employee
SELECT d.*
FROM departments d
WHERE EXISTS (
    SELECT 1
    FROM employees e
    WHERE e.dept_id = d.dept_id
);
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 2: IN (Alternative, sometimes less efficient)
-- ════════════════════════════════════════════════════════════════
 
-- Same query using IN
SELECT *
FROM customers
WHERE customer_id IN (
    SELECT customer_id
    FROM orders
);
 
-- Warning: IN with NULLs can be tricky
-- If the subquery returns NULL, rows won't match!
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 3: INNER JOIN + DISTINCT (Avoid if possible)
-- ════════════════════════════════════════════════════════════════
 
-- Works but inefficient
SELECT DISTINCT c.*
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id;
 
-- This generates duplicates then removes them — wasteful!
-- Optimizer might convert this to semi-join, but don't rely on it
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 4: Correlated Condition in Semi-Join
-- ════════════════════════════════════════════════════════════════
 
-- Customers who ordered products in category 'Electronics'
SELECT c.*
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    JOIN products p ON oi.product_id = p.product_id
    WHERE o.customer_id = c.customer_id
      AND p.category = 'Electronics'
);
 
-- The EXISTS subquery can be arbitrarily complex
-- As long as it references the outer table (c.customer_id)

EXISTS vs IN: Performance

Anti-Join: The Complement of Semi-Join

If semi-join answers "which R tuples have matches?" then anti-join answers the complementary question: "which R tuples have no matches?"

Anti-Join Definition:

R ▷ S = R - (R ⋉ S)

Anti-join returns tuples from R that have no matching tuple in S. It's the set difference between R and R's semi-join with S.

anti_join_definition.txt

Definition

Anti-Join Formal Definition:
═══════════════════════════════════════════════════════════════════
 
Notation:          R ▷ S  (anti-join)
                   Also written: R ⊳ S, R ⟈ S (various notations)
 
Formal:            R ▷ S = R - (R ⋉ S)
                         = R - π_R(R ⋈ S)
 
Alternative:       R ▷ S = { r ∈ R | ¬∃s ∈ S : r and s match }
 
Semantic:          "Tuples from R that have NO match in S"
 
Key Properties:
───────────────────────────────────────────────────────────────────
• Result schema:   EXACTLY the schema of R
• Result tuples:   Subset of R
• Cardinality:     |R ▷ S| ≤ |R|
• Complement:      (R ⋉ S) ∪ (R ▷ S) = R  (partition of R)
• Disjoint:        (R ⋉ S) ∩ (R ▷ S) = ∅
 
SQL Equivalent:    SELECT * FROM R WHERE NOT EXISTS 
                   (SELECT 1 FROM S WHERE R.attr = S.attr)
═══════════════════════════════════════════════════════════════════

Example: Finding Customers Without Orders

anti_join_example.txt

Example

CUSTOMER(CustID, Name, City)
════════════════════════════════════════
C001    Alice Chen      New York
C002    Bob Patel       Chicago
C003    Carol Wang      Boston
C004    David Kim       Seattle
 
ORDER(OrderID, CustID, Amount)
════════════════════════════════════════
O101    C001    250.00
O102    C002    445.00
 
Query: "Which customers have NEVER placed an order?"
 
CUSTOMER ▷(CustID=CustID) ORDER
════════════════════════════════════════════════════════════════
 
Semi-join (customers WITH orders):
  C001 | Alice Chen  | New York
  C002 | Bob Patel   | Chicago
 
Anti-join (customers WITHOUT orders) = CUSTOMER - Semi-join:
  C003 | Carol Wang  | Boston      ← Never ordered
  C004 | David Kim   | Seattle     ← Never ordered
 
Result (2 rows):
CustID | Name        | City
─────────────────────────────────
C003   | Carol Wang  | Boston
C004   | David Kim   | Seattle

Anti-Join in SQL:

anti_join_sql.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
-- ════════════════════════════════════════════════════════════════
-- Pattern 1: NOT EXISTS (Preferred)
-- ════════════════════════════════════════════════════════════════
 
-- Customers who have NEVER ordered
SELECT *
FROM customers c
WHERE NOT EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
);
 
-- Products with NO reviews
SELECT p.*
FROM products p
WHERE NOT EXISTS (
    SELECT 1
    FROM reviews r
    WHERE r.product_id = p.product_id
);
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 2: NOT IN (Careful with NULLs!)
-- ════════════════════════════════════════════════════════════════
 
-- Same query using NOT IN
SELECT *
FROM customers
WHERE customer_id NOT IN (
    SELECT customer_id FROM orders
);
 
-- WARNING: If orders.customer_id contains ANY NULL,
-- NOT IN returns NO ROWS! This is a common bug.
 
-- Safe version with NULL filtering:
SELECT *
FROM customers
WHERE customer_id NOT IN (
    SELECT customer_id FROM orders WHERE customer_id IS NOT NULL
);
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 3: LEFT JOIN + IS NULL (Common Alternative)
-- ════════════════════════════════════════════════════════════════
 
-- Using outer join to find non-matches
SELECT c.*
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL;
 
-- This works because unmatched rows have NULL for order columns
-- Often optimized similarly to NOT EXISTS
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 4: EXCEPT (Set Difference)
-- ════════════════════════════════════════════════════════════════
 
-- Find customers not in the orders table
SELECT customer_id FROM customers
EXCEPT
SELECT customer_id FROM orders;
 
-- Returns only the key; for full rows, use EXISTS pattern

The NOT IN + NULL Trap

NOT IN behaves unexpectedly with NULLs:

• If ANY value in the subquery is NULL, NOT IN returns FALSE for everything! • Example: 1 NOT IN (2, NULL) returns UNKNOWN, not TRUE • This causes queries to silently return zero rows

Always prefer NOT EXISTS for anti-join, or explicitly filter NULLs from IN subqueries.

Semi-Join in Distributed Query Processing

The Distributed Join Problem:

Imagine CUSTOMER lives on Node A and ORDER lives on Node B. To compute CUSTOMER ⋈ ORDER:

Naive approach: Ship all of ORDER to Node A (or vice versa) Problem: Moving large relations is expensive

Semi-join optimization: Only move what's needed

semi_join_distributed.txt

Strategy

Semi-Join Reduction Strategy for Distributed Joins:
═══════════════════════════════════════════════════════════════════
 
Setup:
  Node A has: CUSTOMER(CustID, Name, City)       100,000 rows
  Node B has: ORDER(OrderID, CustID, Amount)     500,000 rows
  Join: CUSTOMER ⋈ ORDER
  
Goal: Minimize data transfer between nodes
 
═══════════════════════════════════════════════════════════════════
Naive Approach: Ship entire ORDER to Node A
═══════════════════════════════════════════════════════════════════
  Transfer: 500,000 rows
  Network cost: Very high
  
═══════════════════════════════════════════════════════════════════
Semi-Join Reduction: 3-Step Process
═══════════════════════════════════════════════════════════════════
 
Step 1: Send distinct join attributes from smaller side to larger
────────────────────────────────────────────────────────────────
  Node A → Node B: Send { DISTINCT CustID from CUSTOMER }
  Transfer: ~100,000 distinct IDs (much smaller than full rows!)
  
Step 2: Filter the larger side using semi-join
────────────────────────────────────────────────────────────────
  Node B: Compute ORDER ⋉ {received CustIDs}
  Result: Only ORDERs that WILL match (discard orders with 
          CustIDs not in CUSTOMER)
  
  If only 80% of CustIDs have orders, we've reduced ORDER
  to relevant rows only
  
Step 3: Send filtered result for final join
────────────────────────────────────────────────────────────────
  Node B → Node A: Send filtered ORDER rows
  Transfer: Much smaller than 500,000!
  Node A: Compute final join locally
 
 
BENEFIT:
═══════════════════════════════════════════════════════════════════
  Instead of shipping 500,000 rows, we ship:
  • 100,000 IDs (Step 1)
  • X filtered rows where X << 500,000 (Step 3)
  
  If there's low join selectivity (many ORDERs don't match),
  the reduction can be dramatic!

When Semi-Join Reduction Helps:

Scenario	Reduction Effectiveness
Many ORDER rows don't match any CUSTOMER	Excellent — filter removes dangling orders
Each CUSTOMER has many ORDERs	Limited — most orders still transfer
Small relation dominates join selectivity	Good — filtering is significant
Very wide rows	Good — sending IDs vs full rows saves bandwidth
Network is the bottleneck	Critical — any reduction is valuable

Bloom Filter Optimization

In practice, distributed systems often use Bloom filters instead of sending exact join keys:

Build a Bloom filter from CUSTOMER.CustID
Send the (compact) filter to Node B
Node B filters ORDER using the filter (with small false positive rate)
Transfer filtered orders

Bloom filters are much smaller than distinct key sets, further reducing network cost.

Implementation Strategies

Semi-join can be implemented more efficiently than regular join by exploiting its "existence only" semantics. Query optimizers implement several strategies:

1. Early Termination:

Once a tuple from R finds ONE match in S, searching can stop for that tuple. This is the key efficiency gain.

semi_join_algorithm.pseudo

Pseudocode

ALGORITHM SemiJoinNestedLoop(R, S, condition)
────────────────────────────────────────────────────────────
result ← ∅
 
FOR EACH tuple r IN R:
    found ← FALSE
    FOR EACH tuple s IN S:
        IF condition(r, s) THEN
            found ← TRUE
            BREAK                    // ← KEY: Stop after first match!
        END IF
    END FOR
    IF found THEN
        result ← result ∪ {r}
    END IF
END FOR
 
RETURN result
 
COMPARISON TO REGULAR JOIN:
────────────────────────────────────────────────────────────
Regular Join: Must find ALL matches (no BREAK)
Semi-Join: Stops at FIRST match (early termination)
 
If R tuple matches 1000 S tuples:
  Regular: Processes all 1000
  Semi: Processes 1 and stops

2. Hash-Based Semi-Join:

Build a hash table on S's join column, then probe for each R tuple. If probe finds ANY match, output R tuple.

hash_semi_join.pseudo

Pseudocode

ALGORITHM HashSemiJoin(R, S, r_attr, s_attr)
────────────────────────────────────────────────────────────
// Build phase: Create hash set from S's join values
joinValues ← ∅ (hash set)
FOR EACH tuple s IN S:
    ADD s[s_attr] TO joinValues
 
// Probe phase: Check R tuples for membership
result ← ∅
FOR EACH tuple r IN R:
    IF r[r_attr] IN joinValues THEN
        result ← result ∪ {r}
    END IF
 
RETURN result
 
KEY INSIGHT:
────────────────────────────────────────────────────────────
We only need S's JOIN COLUMN VALUES, not full S tuples!
The hash set is smaller and lookups are O(1).
 
Space: O(|distinct S join values|) vs O(|S|) for regular join

Semi-Join Optimization Opportunities

•Early termination — Stop searching after first match per R tuple
•Hash set instead of hash table — Store only values, not full S tuples
•Bloom filter pre-filtering — Quick rejection before detailed probe
•Index-only scans — May not need to access S's data pages at all
•Parallel probing — Each R tuple's existence check is independent
•Materialized join values — Precompute and cache distinct S values

Query Plan Recognition

In EXPLAIN output, look for: • 'Semi Join' or 'SemiJoin' node types • 'Exists' operators • 'IN' with subquery that's optimized to semi-join • Hash set build instead of full hash table

If you see 'Join + Distinct' for what should be a semi-join, the optimizer may not have recognized the pattern—consider rewriting with EXISTS.

Use Cases and Query Patterns

Semi-join and anti-join appear in numerous practical query patterns. Recognizing these patterns helps write efficient queries.

semi_anti_patterns.sql
SQL Patterns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
-- ════════════════════════════════════════════════════════════════
-- Pattern 1: Entity Filtering by Related Data (Semi-Join)
-- ════════════════════════════════════════════════════════════════
 
-- Customers who bought a specific product
SELECT c.*
FROM customers c
WHERE EXISTS (
    SELECT 1
    FROM orders o
    JOIN order_items oi ON o.order_id = oi.order_id
    WHERE o.customer_id = c.customer_id
      AND oi.product_id = 'P001'
);
 
-- Users who logged in this month
SELECT u.*
FROM users u
WHERE EXISTS (
    SELECT 1
    FROM login_events e
    WHERE e.user_id = u.user_id
      AND e.login_time >= DATE_TRUNC('month', CURRENT_DATE)
);
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 2: Finding Missing Data (Anti-Join)
-- ════════════════════════════════════════════════════════════════
 
-- Customers without any orders in the last year
SELECT c.*
FROM customers c
WHERE NOT EXISTS (
    SELECT 1
    FROM orders o
    WHERE o.customer_id = c.customer_id
      AND o.order_date >= CURRENT_DATE - INTERVAL '1 year'
);
 
-- Products not in any active promotions
SELECT p.*
FROM products p
WHERE NOT EXISTS (
    SELECT 1
    FROM promotions pr
    WHERE pr.product_id = p.product_id
      AND pr.end_date >= CURRENT_DATE
);
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 3: Set Membership Checks (Semi-Join)
-- ════════════════════════════════════════════════════════════════
 
-- Orders from VIP customers
SELECT o.*
FROM orders o
WHERE EXISTS (
    SELECT 1
    FROM vip_customers v
    WHERE v.customer_id = o.customer_id
);
 
-- Equivalent but often less efficient:
SELECT o.*
FROM orders o
WHERE o.customer_id IN (SELECT customer_id FROM vip_customers);
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 4: Data Quality Checks (Anti-Join)
-- ════════════════════════════════════════════════════════════════
 
-- Orders with invalid customer references
SELECT o.*
FROM orders o
WHERE NOT EXISTS (
    SELECT 1
    FROM customers c
    WHERE c.customer_id = o.customer_id
);
 
-- Employees assigned to non-existent departments
SELECT e.*
FROM employees e
WHERE NOT EXISTS (
    SELECT 1
    FROM departments d
    WHERE d.dept_id = e.dept_id
);
 
 
-- ════════════════════════════════════════════════════════════════
-- Pattern 5: Conditional Inclusion (Combined)
-- ════════════════════════════════════════════════════════════════
 
-- Active products with stock, regardless of sales history
SELECT p.product_id, p.name,
    CASE 
        WHEN EXISTS (SELECT 1 FROM sales s WHERE s.product_id = p.product_id)
        THEN 'Has Sales'
        ELSE 'No Sales Yet'
    END AS sales_status
FROM products p
WHERE p.is_active = TRUE
  AND p.stock_quantity > 0;

Semi-Join for Deduplication

When you need 'distinct' entities that have related records, semi-join naturally provides deduplication:

• Regular join: might produce duplicates requiring DISTINCT • Semi-join: inherently produces each entity at most once

This makes EXISTS ideal for 'list all X that have at least one Y' queries.

Algebraic Properties of Semi-Join

Understanding semi-join's algebraic properties enables query optimization and transformation.

Key Properties:

Algebraic Properties of Semi-Join
Property	Statement	Notes
Idempotence	R ⋉ R = R	Semi-join with self is identity
Non-Commutativity	R ⋉ S ≠ S ⋉ R (usually)	Result has R's schema, not S's
Distribution over Union	R ⋉ (S ∪ T) = (R ⋉ S) ∪ (R ⋉ T)	Enables parallel processing
Absorption	(R ⋈ S) ⋉ T = R ⋈ (S ⋉ T) (when T's attrs ⊂ S)	Useful for optimization
Partition with Anti-Join	(R ⋉ S) ∪ (R ▷ S) = R	Semi and anti perfectly partition R
Disjoint with Anti-Join	(R ⋉ S) ∩ (R ▷ S) = ∅	Semi and anti are mutually exclusive

Relationship to Other Operations:

R ⋉ S = π_R(R ⋈ S)           // Definition via natural join
R ⋉ S = R - (R ▷ S)          // Via anti-join complement
R ▷ S = R - (R ⋉ S)          // Anti-join via semi-join
R ⋈ S = (R ⋉ S) ⋈ S          // Join can use semi-join as filter

Optimization Transformations:

-- Original (may produce duplicates, less efficient)
SELECT DISTINCT c.* 
FROM customers c 
JOIN orders o ON c.id = o.customer_id;

-- Optimized (semi-join semantics)
SELECT c.* 
FROM customers c 
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.customer_id = c.id);

Optimizers recognize the first pattern and may transform it to the second internally.

Query Rewriting for Performance

If you see slow 'SELECT DISTINCT... JOIN...' queries where only one table's columns are selected, rewrite using EXISTS:

SELECT DISTINCT a.* FROM a JOIN b ON... → SELECT a.* FROM a WHERE EXISTS(...)

The EXISTS version avoids generating duplicates entirely.

Summary: Semi-Join and Anti-Join Mastery

We've thoroughly explored semi-join and anti-join—essential filtering operations that answer existence questions without widening results. Let's consolidate the key concepts:

Key Takeaways

•Semi-join filters by match existence — R ⋉ S returns R tuples that have at least one match in S, with R's schema only.
•No duplicates from multiple matches — Unlike regular join, each R tuple appears at most once.
•Anti-join is the complement — R ▷ S returns R tuples with NO match in S.
•SQL uses EXISTS and NOT EXISTS — No SEMI JOIN keyword, but EXISTS patterns implement the semantics.
•Beware NOT IN with NULLs — NOT EXISTS is safer and typically faster for anti-join.
•Early termination is key optimization — Stop searching after first match per tuple.
•Crucial for distributed systems — Semi-join reduction minimizes data transfer across nodes.
•Semi + Anti partition R — Together they cover all of R with no overlap.

Module Complete: Join Operations Mastery

With semi-join and anti-join understood, we've completed our journey through the join operation family:

Theta Join — The general form with arbitrary conditions
Equijoin — Theta join with equality conditions only (most common)
Natural Join — Implicit matching on shared attribute names
Outer Joins — Preserving non-matching tuples with NULL padding
Semi-Join & Anti-Join — Existence filtering without widening

Module Complete

5 / 5