Data Structures & AlgorithmsHash Sets vs Hash Maps

Hash Sets vs Hash Maps

LevelIntermediate

Duration55 mins

TopicHash Sets vs Hash Maps

1 / 4

Hash Set — Just Keys, No Values

The Power of Presence Without Payload

In the world of hash-based data structures, we often focus on the key-value paradigm—a dictionary-like model where every key is associated with some meaningful data. But there exists a simpler, equally powerful abstraction: the hash set. A hash set answers one question, and one question only: Is this element present?

This seemingly simple question—membership testing—forms the backbone of countless algorithms, from duplicate detection to graph traversal, from spell checking to database query optimization. Understanding hash sets deeply means understanding when simplicity is not just sufficient, but optimal.

In this comprehensive exploration, we will dissect the hash set from first principles, examining its mathematical foundations, implementation mechanics, and the subtle but critical distinction between storing keys alone versus key-value pairs.

What You Will Master

By the end of this page, you will understand: the formal definition and mathematical properties of sets, how hash sets implement set semantics with O(1) operations, why storing keys without values is a deliberate design choice (not a limitation), the internal representation and memory implications, and how to recognize problems that demand sets over maps.

The Mathematical Foundation of Sets

Before diving into hash sets as a data structure, we must first understand sets as a mathematical concept. This foundation is essential because a hash set is simply an efficient implementation of the mathematical set abstraction.

What Is a Mathematical Set?

A set is a well-defined collection of distinct objects, considered as an object in its own right. The objects in a set are called elements or members. The defining properties of a set are:

Distinctness: No element appears more than once. The set {1, 2, 2, 3} is identical to {1, 2, 3}.
Unordered nature: The set {a, b, c} is identical to {c, a, b}. Order has no meaning.
Membership is binary: An element is either in the set or not. There is no "how much" or "how many times."

This third property is crucial: a set tracks presence, not quantity or association. This is fundamentally different from a map, which associates keys with values.

The Five Fundamental Set Operations

•Membership Test (∈): Is element x in set S? Written as x ∈ S. This is the most fundamental operation and the one hash sets optimize for.
•Insertion: Add element x to set S. If x already exists, the set remains unchanged (idempotent operation).
•Deletion: Remove element x from set S. If x doesn't exist, the set remains unchanged.
•Union (∪): Create a new set containing all elements from both S₁ and S₂. S₁ ∪ S₂ = {x : x ∈ S₁ or x ∈ S₂}.
•Intersection (∩): Create a new set containing only elements present in both S₁ and S₂. S₁ ∩ S₂ = {x : x ∈ S₁ and x ∈ S₂}.

Additional Set Operations

Beyond the five fundamental operations, sets support several derived operations:

Difference (S₁ \ S₂): Elements in S₁ but not in S₂
Symmetric Difference (S₁ △ S₂): Elements in exactly one of S₁ or S₂
Subset Test (⊆): Is every element of S₁ also in S₂?
Superset Test (⊇): Does S₁ contain every element of S₂?
Cardinality (|S|): The number of elements in the set

Why Sets Matter in Computing

Sets appear everywhere in computing because many problems are fundamentally about presence, not association:

Visited tracking in graph traversal: Have we seen this node? (Yes/No)
Duplicate detection: Have we encountered this value before? (Yes/No)
Spell checking dictionary: Is this a valid word? (Yes/No)
Bloom filter seeds: Which values have we potentially seen? (Maybe Yes/Definitely No)
Database constraint validation: Is this primary key already taken? (Yes/No)

In all these cases, we don't need to store what is associated with the key—we only need to know whether the key exists.

Set Theory as Computer Science Foundation

Set theory, formalized by Georg Cantor in the 1870s, became the foundation of modern mathematics. In computer science, set operations map directly to bitwise operations (union = OR, intersection = AND), database query operations (JOIN, WHERE IN), and type systems (union types, intersection types). Understanding sets mathematically enriches your understanding of computational abstractions.

From Mathematical Sets to Hash Sets

A mathematical set is an abstraction. To use sets in a computer program, we need a concrete implementation. The question becomes: how do we implement set operations efficiently?

Naive Implementations and Their Limitations

Let's consider how we might implement a set without hashing:

Unsorted Array/List:

Membership test: O(n) — must scan entire collection
Insertion: O(n) — must first check for duplicates
Deletion: O(n) — must find element, then shift

Sorted Array:

Membership test: O(log n) — binary search
Insertion: O(n) — binary search + shift elements
Deletion: O(n) — binary search + shift elements

Balanced Binary Search Tree (BST):

Membership test: O(log n)
Insertion: O(log n)
Deletion: O(log n)
Bonus: Elements remain sorted

A balanced BST (like a red-black tree or AVL tree) provides O(log n) for all operations, which is respectable. But for sets where we don't need ordering, we can do better.

Time Complexity Comparison: Set Implementations
Implementation	Membership Test	Insertion	Deletion	Space
Unsorted Array	O(n)	O(n)	O(n)	O(n)
Sorted Array	O(log n)	O(n)	O(n)	O(n)
Balanced BST	O(log n)	O(log n)	O(log n)	O(n)
Hash Set	O(1) average	O(1) average	O(1) average	O(n)

The Hash Set Breakthrough

A hash set achieves O(1) average-case complexity for membership, insertion, and deletion by using a hash function to compute array indices directly from element values.

The key insight: instead of searching for an element, we compute where it should be. If we want to know whether element x is in the set, we:

Compute hash(x) to get an integer
Calculate index = hash(x) % tableSize
Look at position index in our array
Handle collisions if multiple elements map to the same index

This transforms searching from "scan everything" to "go directly to the answer."

What Gets Stored in a Hash Set?

This is the crucial distinction from a hash map:

Hash Map stores: hash(key) → bucket containing [(key₁, value₁), (key₂, value₂), ...]

Hash Set stores: hash(key) → bucket containing [key₁, key₂, ...]

In a hash set, we store only the keys themselves. There is no value associated with each key. Each bucket (or slot, depending on collision strategy) contains just the elements that hashed to that position.

The Conceptual Clarity

A hash set is conceptually simpler than a hash map: it's a container that either contains an element or doesn't. This simplicity isn't a limitation—it's a feature. When your problem is purely about membership, using a set communicates intent more clearly than using a map with dummy values.

Internal Representation Deep Dive

Understanding how hash sets are represented in memory illuminates both their efficiency and their constraints.

The Underlying Array Structure

At its core, a hash set maintains an array (sometimes called a "bucket array" or "hash table"). Each position in this array is a slot that can hold:

In open addressing: A single element, an empty marker, or a "deleted" tombstone
In separate chaining: A linked list (or other collection) of elements that hashed to this position

Let's visualize both approaches:

hash_set_representations.txt

Visualization

SEPARATE CHAINING HASH SET
─────────────────────────────────
 
Insert elements: {"apple", "banana", "cherry", "apricot", "blueberry"}
 
Assume hash function produces:
  hash("apple")    % 5 = 2
  hash("banana")   % 5 = 1
  hash("cherry")   % 5 = 4
  hash("apricot")  % 5 = 2  ← Collision with "apple"!
  hash("blueberry") % 5 = 1  ← Collision with "banana"!
 
Bucket Array:
┌─────┬───────────────────────────────────────────┐
│  0  │ NULL                                      │
├─────┼───────────────────────────────────────────┤
│  1  │ "banana" → "blueberry" → NULL             │
├─────┼───────────────────────────────────────────┤
│  2  │ "apple" → "apricot" → NULL                │
├─────┼───────────────────────────────────────────┤
│  3  │ NULL                                      │
├─────┼───────────────────────────────────────────┤
│  4  │ "cherry" → NULL                           │
└─────┴───────────────────────────────────────────┘
 
Memory layout: We store ONLY the keys. No values at all.
Each node contains: [key data] [next pointer]
 
 
OPEN ADDRESSING HASH SET (Linear Probing)
──────────────────────────────────────────
 
Insert same elements. On collision, probe next slot.
 
Insert order and probing:
  "apple"    → slot 2 (empty, insert)
  "banana"   → slot 1 (empty, insert)
  "cherry"   → slot 4 (empty, insert)
  "apricot"  → slot 2 (occupied), probe to slot 3 (empty, insert)
  "blueberry"→ slot 1 (occupied), probe to slot 2 (occupied),
               probe to slot 3 (occupied), probe to slot 4 (occupied),
               probe to slot 0 (empty, insert)
 
Slot Array:
┌─────┬──────────────┬───────────────────────────┐
│ Idx │    Value     │   Original Hash Target    │
├─────┼──────────────┼───────────────────────────┤
│  0  │ "blueberry"  │ hash target was 1         │
│  1  │ "banana"     │ hash target was 1 ✓       │
│  2  │ "apple"      │ hash target was 2 ✓       │
│  3  │ "apricot"    │ hash target was 2         │
│  4  │ "cherry"     │ hash target was 4 ✓       │
└─────┴──────────────┴───────────────────────────┘
 
Memory layout: Just keys stored directly in array slots.
Each slot contains: [key data] or [empty marker] or [deleted tombstone]

Memory Efficiency: Set vs Map

Because hash sets don't store values, they use less memory per element than hash maps:

Separate Chaining:

Hash Set node: [key data][next pointer]
Hash Map node: [key data][value data][next pointer]

Open Addressing:

Hash Set slot: [key data] or [state flag]
Hash Map slot: [key data][value data] or [state flag]

The memory savings depend on the size of values in the map case. If values are large (e.g., objects, strings), a hash set can use significantly less memory than a hash map with dummy values.

The Null Value Pattern: An Anti-Pattern

Some developers, when they need set semantics but only have a hash map available, use a pattern like:

# Anti-pattern: Using map as set with null values
seen = {}
for item in items:
    seen[item] = None  # or True, or 1, or any dummy value

While this works, it's suboptimal:

Wastes memory: Each entry stores an unnecessary value
Obscures intent: The code doesn't clearly communicate "this is a set"
May prevent optimization: The language runtime can't optimize for set-specific patterns

When you need a set, use a set.

Language Matters

Some older languages or constrained environments may not provide a built-in set type. In such cases, using a map with dummy values is an acceptable workaround. However, modern languages (Python, Java, JavaScript, C++, Go, Rust, etc.) all provide dedicated set types. Always prefer the native set type when available.

Hash Set Operations in Detail

Let's examine each core hash set operation with implementation-level detail, understanding exactly what happens under the hood.

The `add(element)` Operation

Purpose: Insert an element into the set if it doesn't already exist.

Algorithm (Separate Chaining):

Compute hashCode(element)
Calculate index = hashCode % tableSize
Traverse the linked list at buckets[index]
If element is found, return (already exists, no-op)
If not found, prepend element to the list
Increment size counter
If load factor exceeds threshold, trigger rehashing

Algorithm (Open Addressing - Linear Probing):

Compute hashCode(element)
Calculate index = hashCode % tableSize
While slots[index] is occupied:
- If slots[index] == element, return (already exists)
- Increment index = (index + 1) % tableSize
Store element at slots[index]
Increment size counter
If load factor exceeds threshold, trigger rehashing

Key insight: The "no duplicates" invariant is enforced by checking for existence before insertion. This is why hash sets require elements to be both hashable and comparable for equality.

hash_set_add.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
class HashSet:
    """
    Hash Set implementation using separate chaining.
    Demonstrates the 'keys only, no values' principle.
    """
    
    def __init__(self, initial_capacity: int = 16, load_factor: float = 0.75):
        self._buckets = [None] * initial_capacity
        self._size = 0
        self._load_factor_threshold = load_factor
    
    def add(self, element) -> bool:
        """
        Add an element to the set.
        
        Returns True if element was added (didn't exist before),
        False if element already existed.
        
        Time Complexity: O(1) average, O(n) worst case
        """
        # Step 1: Check if rehashing is needed BEFORE insertion
        if self._size / len(self._buckets) >= self._load_factor_threshold:
            self._rehash()
        
        # Step 2: Compute the bucket index
        index = hash(element) % len(self._buckets)
        
        # Step 3: Traverse the chain to check for duplicates
        current = self._buckets[index]
        while current is not None:
            if current.element == element:
                # Element already exists - set semantics: no duplicates
                return False
            current = current.next
        
        # Step 4: Element not found - insert at head of chain
        # Notice: We only store the element, no associated value!
        new_node = self._Node(element)
        new_node.next = self._buckets[index]
        self._buckets[index] = new_node
        
        self._size += 1
        return True
    
    class _Node:
        """
        Internal node for separate chaining.
        Note: Only stores 'element', not 'element' + 'value'.
        This is the key difference from a HashMap node.
        """
        __slots__ = ('element', 'next')  # Memory optimization
        
        def __init__(self, element):
            self.element = element
            self.next = None

The `contains(element)` / `has(element)` Operation

Purpose: Test whether an element is present in the set.

Algorithm:

Compute hashCode(element)
Calculate index = hashCode % tableSize
Search for element in buckets[index] (chain or probe sequence)
Return true if found, false otherwise

This is the most frequently used operation in most applications. The O(1) average complexity is why hash sets are preferred for membership testing.

The `remove(element)` Operation

Purpose: Remove an element from the set if it exists.

Algorithm (Separate Chaining):

Compute hashCode(element)
Calculate index = hashCode % tableSize
Traverse the linked list, maintaining a previous pointer
If found, unlink the node from the chain
Decrement size counter
Return true if removed, false if not found

Algorithm (Open Addressing):

Find the element using the probing sequence
If found, mark the slot with a DELETED tombstone
Decrement size counter

The tombstone is necessary because simply emptying the slot would break the probe sequence for other elements.

The Equality Contract

For hash sets to work correctly, elements must satisfy the equality contract: if a.equals(b), then hash(a) == hash(b). Violating this invariant causes elements to become 'lost'—they'll be inserted but never found, because we search in the wrong bucket. This is why custom objects used in hash sets must implement both hashCode() and equals() consistently.

Immutability & Hashability Requirements

A critical aspect of hash sets that catches many developers off guard is the immutability requirement for set elements.

The Problem: Mutable Elements in Sets

Consider this scenario:

# Dangerous: Mutable element in a set
class Person:
    def __init__(self, name):
        self.name = name
    
    def __hash__(self):
        return hash(self.name)
    
    def __eq__(self, other):
        return self.name == other.name

person = Person("Alice")
people = {person}  # Add to set

# Later...
person.name = "Bob"  # Mutate the object!

# Now we have a problem:
print(Person("Alice") in people)  # False - original hash bucketing
print(Person("Bob") in people)    # False - wrong bucket!
print(person in people)           # Undefined behavior!

What happened? The person object was hashed based on "Alice" and placed in the corresponding bucket. When we changed the name to "Bob", the hash value changed, but the object's position in the set didn't. Now:

Searching for "Alice" looks in the right bucket but finds an object that says "Bob"
Searching for "Bob" looks in the wrong bucket
The object is effectively lost in the set

The Cardinal Rule

Never mutate an object while it is in a hash set (or used as a hash map key). The hash value computed at insertion time becomes stale, causing the object to become unfindable. Either use immutable objects, or remove the object before modification and re-add it after.

What Makes a Type Hashable?

Not all types can be used in a hash set. The requirements are:

1. The type must implement a hash function

This function must:

Return an integer (the hash code)
Be deterministic (same input → same output)
Be consistent with equality (equal objects → equal hash codes)

2. The type must implement equality comparison

This is needed because:

Hash collisions occur (different objects, same hash)
We must be able to distinguish colliding elements

3. Ideally, the type should be immutable

Immutability guarantees that the hash value never changes, preventing the "lost object" problem.

Language-Specific Hashability Rules

Python:

Built-in immutable types are hashable: int, float, str, tuple (if contents are hashable), frozenset
Mutable types are NOT hashable by default: list, dict, set
Custom classes are hashable by default (based on object identity), but should override __hash__ and __eq__ for value-based equality

Java:

Primitive wrappers are hashable: Integer, String, Double, etc.
Custom classes must override hashCode() and equals() for use in HashSet
Using mutable objects as keys is allowed but dangerous

JavaScript:

Set uses the SameValueZero algorithm (similar to ===)
Only primitives and object references are directly usable
Object equality is by reference, not by value

Hashability by Language and Type
Type Category	Python	Java	JavaScript
Integers	✓ Hashable	✓ Hashable	✓ Usable in Set
Strings	✓ Hashable	✓ Hashable	✓ Usable in Set
Lists/Arrays	✗ Not hashable	✗ Not recommended	By reference only
Tuples	✓ If contents hashable	N/A (no tuples)	N/A (use arrays)
Custom Objects	Override hash/eq	Override hashCode/equals	By reference only
Frozen/Immutable	frozenset ✓	Immutable wrappers ✓	Object.freeze (still by ref)

Set-Specific Operations: Union, Intersection, Difference

One of the powerful features of sets (that maps don't naturally provide) is set algebra—operations that combine or compare sets in mathematically meaningful ways. These operations are built into most set implementations.

Union: Combining Two Sets

Mathematical definition: A ∪ B = {x : x ∈ A or x ∈ B}

Implementation: Create a new set containing all elements from both input sets.

set1 = {1, 2, 3}
set2 = {3, 4, 5}

# Using operator
result = set1 | set2  # {1, 2, 3, 4, 5}

# Using method
result = set1.union(set2)  # {1, 2, 3, 4, 5}

Time Complexity: O(|A| + |B|) — we must examine every element from both sets.

Intersection: Common Elements

Mathematical definition: A ∩ B = {x : x ∈ A and x ∈ B}

Implementation: Create a new set containing only elements present in both.

set1 = {1, 2, 3, 4}
set2 = {3, 4, 5, 6}

result = set1 & set2  # {3, 4}
result = set1.intersection(set2)  # {3, 4}

Time Complexity: O(min(|A|, |B|)) — iterate over the smaller set, check membership in the larger.

Difference: Elements in One But Not the Other

Mathematical definition: A \ B = {x : x ∈ A and x ∉ B}

set1 = {1, 2, 3, 4}
set2 = {3, 4, 5, 6}

result = set1 - set2  # {1, 2}
result = set1.difference(set2)  # {1, 2}

Time Complexity: O(|A|) — must examine every element in the first set.

Symmetric Difference: Exclusive Elements

Mathematical definition: A △ B = {x : x ∈ A XOR x ∈ B}

set1 = {1, 2, 3}
set2 = {3, 4, 5}

result = set1 ^ set2  # {1, 2, 4, 5}
result = set1.symmetric_difference(set2)  # {1, 2, 4, 5}

Time Complexity: O(|A| + |B|)

Why Maps Don't Have These Operations

Maps (dictionaries) don't naturally support union or intersection because the semantics are ambiguous: if two maps have the same key with different values, which value should the union contain? Sets avoid this problem entirely because there are no values—just presence or absence.

set_operations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Practical example: Finding common and exclusive skills
 
# Skills possessed by team members
alice_skills = {"Python", "SQL", "Machine Learning", "Docker"}
bob_skills = {"Java", "SQL", "Kubernetes", "Docker"}
carol_skills = {"Python", "JavaScript", "React", "Docker"}
 
# All skills across the team (Union)
all_skills = alice_skills | bob_skills | carol_skills
print(f"Team's combined skills: {all_skills}")
# {'Python', 'SQL', 'Machine Learning', 'Docker', 'Java', 'Kubernetes', 'JavaScript', 'React'}
 
# Skills everyone has (Intersection)
common_skills = alice_skills & bob_skills & carol_skills
print(f"Skills everyone has: {common_skills}")
# {'Docker'}
 
# Skills only Alice has (Difference)
alice_unique = alice_skills - bob_skills - carol_skills
print(f"Alice's unique skills: {alice_unique}")
# {'Machine Learning'}
 
# Skills that exactly one person has (requires multiple operations)
pairwise_shared = (alice_skills & bob_skills) | (bob_skills & carol_skills) | (alice_skills & carol_skills)
unique_skills = all_skills - pairwise_shared
print(f"Skills held by exactly one person: {unique_skills}")
# {'Machine Learning', 'Java', 'Kubernetes', 'JavaScript', 'React'}
 
# Subset checking
beginner_skills = {"Docker"}
print(f"Is beginner_skills subset of alice_skills? {beginner_skills <= alice_skills}")
# True

When Sets Are the Right Choice

Recognizing when to use a set versus a map is a fundamental skill. Here's a framework for making the right choice.

Use a Hash Set When:

The fundamental question is "Is X present?"

If you only need to know whether something exists—without needing to retrieve associated data—a set is appropriate.

Examples:

Tracking visited nodes in graph traversal
Checking if a word is in a dictionary
Detecting duplicates in a stream
Maintaining a blacklist or whitelist
Implementing set algebra (union, intersection, etc.)

You need to enforce uniqueness without additional data

Sets naturally enforce uniqueness. If you're storing items where duplicates are meaningless, a set documents this constraint.

Memory efficiency matters

When storing millions of elements, the memory saved by not storing values can be significant.

You want clear, self-documenting code

Using if item in my_set is clearer than if item in my_dict when you don't care about the value.

Classic Set Use Cases

•Duplicate Detection: Given a list, find/remove duplicates. Set naturally deduplicates.
•Membership Testing: Is this user in the admin group? Is this IP address blacklisted?
•Graph Algorithms: Track visited vertices in BFS/DFS to prevent cycles.
•Two-Sum-like Problems: Seen values in a single pass for complement searching.
•Set Operations: Finding common elements, unique elements, differences between groups.
•Constraint Validation: Ensuring primary keys are unique, usernames aren't taken.
•Bloom Filter Seeds: Initial exact-match before probabilistic checking.

When NOT to Use a Set (Use a Map Instead)

You need to associate data with each key

If you're tracking not just presence but also information about each key, you need a map.

User ID → User profile data
Word → Definition
Node → Distance from source
Cache key → Cached value

You need to count occurrences

If you need to know how many times something appears, use a map where values are counts.

# Counting occurrences - need a map, not a set
word_counts = {}  # word → count
for word in document:
    word_counts[word] = word_counts.get(word, 0) + 1

You need key-value semantics

Configuration settings, environment variables, HTTP headers—these are naturally key-value pairs.

Use a Set

•"Have I seen this before?"
•"Is this in the allowed list?"
•"What's unique in this collection?"
•"What do these groups have in common?"
•"Remove duplicates from this list"

Use a Map

•"What's the value associated with this key?"
•"How many times does this appear?"
•"Store this configuration setting"
•"Cache this computation result"
•"Map user IDs to user names"

Summary and Key Takeaways

We have taken a deep dive into hash sets—their mathematical foundations, implementation mechanics, and practical usage patterns. Let's consolidate the key insights:

Core Concepts Mastered

•Sets are about presence, not association: A set answers "Is X here?" with no additional data. This is fundamentally different from a map's "What is associated with X?"
•Hash sets implement mathematical sets efficiently: Using hash functions, we achieve O(1) average-case membership testing, insertion, and deletion.
•Hash sets store only keys: The internal structure contains no value field, saving memory and clarifying semantics.
•Elements must be hashable and preferably immutable: Mutable elements can become 'lost' when their hash value changes after insertion.
•Sets support powerful algebraic operations: Union, intersection, difference, and subset testing are natural operations that maps don't cleanly support.
•Choose sets for membership, maps for association: The defining question is whether you need to retrieve data, not just test existence.

Page Complete

You now deeply understand hash sets as a distinct data structure optimized for membership testing. In the next page, we'll explore hash maps in equal depth—examining how the addition of values changes the structure's capabilities and use cases.

1 / 4

Loading learning content...

Data Structures & AlgorithmsHash Sets vs Hash Maps

Hash Sets vs Hash Maps

LevelIntermediate

Duration55 mins

TopicHash Sets vs Hash Maps

1 / 4

Hash Set — Just Keys, No Values

The Power of Presence Without Payload

What You Will Master

The Mathematical Foundation of Sets

What Is a Mathematical Set?

Distinctness: No element appears more than once. The set {1, 2, 2, 3} is identical to {1, 2, 3}.
Unordered nature: The set {a, b, c} is identical to {c, a, b}. Order has no meaning.
Membership is binary: An element is either in the set or not. There is no "how much" or "how many times."

This third property is crucial: a set tracks presence, not quantity or association. This is fundamentally different from a map, which associates keys with values.

The Five Fundamental Set Operations

•Membership Test (∈): Is element x in set S? Written as x ∈ S. This is the most fundamental operation and the one hash sets optimize for.
•Insertion: Add element x to set S. If x already exists, the set remains unchanged (idempotent operation).
•Deletion: Remove element x from set S. If x doesn't exist, the set remains unchanged.
•Union (∪): Create a new set containing all elements from both S₁ and S₂. S₁ ∪ S₂ = {x : x ∈ S₁ or x ∈ S₂}.
•Intersection (∩): Create a new set containing only elements present in both S₁ and S₂. S₁ ∩ S₂ = {x : x ∈ S₁ and x ∈ S₂}.

Additional Set Operations

Beyond the five fundamental operations, sets support several derived operations:

Difference (S₁ \ S₂): Elements in S₁ but not in S₂
Symmetric Difference (S₁ △ S₂): Elements in exactly one of S₁ or S₂
Subset Test (⊆): Is every element of S₁ also in S₂?
Superset Test (⊇): Does S₁ contain every element of S₂?
Cardinality (|S|): The number of elements in the set

Why Sets Matter in Computing

Sets appear everywhere in computing because many problems are fundamentally about presence, not association:

Visited tracking in graph traversal: Have we seen this node? (Yes/No)
Duplicate detection: Have we encountered this value before? (Yes/No)
Spell checking dictionary: Is this a valid word? (Yes/No)
Bloom filter seeds: Which values have we potentially seen? (Maybe Yes/Definitely No)
Database constraint validation: Is this primary key already taken? (Yes/No)

In all these cases, we don't need to store what is associated with the key—we only need to know whether the key exists.

Set Theory as Computer Science Foundation

From Mathematical Sets to Hash Sets

A mathematical set is an abstraction. To use sets in a computer program, we need a concrete implementation. The question becomes: how do we implement set operations efficiently?

Naive Implementations and Their Limitations

Let's consider how we might implement a set without hashing:

Unsorted Array/List:

Membership test: O(n) — must scan entire collection
Insertion: O(n) — must first check for duplicates
Deletion: O(n) — must find element, then shift

Sorted Array:

Membership test: O(log n) — binary search
Insertion: O(n) — binary search + shift elements
Deletion: O(n) — binary search + shift elements

Balanced Binary Search Tree (BST):

Membership test: O(log n)
Insertion: O(log n)
Deletion: O(log n)
Bonus: Elements remain sorted

A balanced BST (like a red-black tree or AVL tree) provides O(log n) for all operations, which is respectable. But for sets where we don't need ordering, we can do better.

Time Complexity Comparison: Set Implementations
Implementation	Membership Test	Insertion	Deletion	Space
Unsorted Array	O(n)	O(n)	O(n)	O(n)
Sorted Array	O(log n)	O(n)	O(n)	O(n)
Balanced BST	O(log n)	O(log n)	O(log n)	O(n)
Hash Set	O(1) average	O(1) average	O(1) average	O(n)

The Hash Set Breakthrough

A hash set achieves O(1) average-case complexity for membership, insertion, and deletion by using a hash function to compute array indices directly from element values.

The key insight: instead of searching for an element, we compute where it should be. If we want to know whether element x is in the set, we:

Compute hash(x) to get an integer
Calculate index = hash(x) % tableSize
Look at position index in our array
Handle collisions if multiple elements map to the same index

This transforms searching from "scan everything" to "go directly to the answer."

What Gets Stored in a Hash Set?

This is the crucial distinction from a hash map:

Hash Map stores: hash(key) → bucket containing [(key₁, value₁), (key₂, value₂), ...]

Hash Set stores: hash(key) → bucket containing [key₁, key₂, ...]

The Conceptual Clarity

Internal Representation Deep Dive

Understanding how hash sets are represented in memory illuminates both their efficiency and their constraints.

The Underlying Array Structure

At its core, a hash set maintains an array (sometimes called a "bucket array" or "hash table"). Each position in this array is a slot that can hold:

In open addressing: A single element, an empty marker, or a "deleted" tombstone
In separate chaining: A linked list (or other collection) of elements that hashed to this position

Let's visualize both approaches:

hash_set_representations.txt

Visualization

SEPARATE CHAINING HASH SET
─────────────────────────────────
 
Insert elements: {"apple", "banana", "cherry", "apricot", "blueberry"}
 
Assume hash function produces:
  hash("apple")    % 5 = 2
  hash("banana")   % 5 = 1
  hash("cherry")   % 5 = 4
  hash("apricot")  % 5 = 2  ← Collision with "apple"!
  hash("blueberry") % 5 = 1  ← Collision with "banana"!
 
Bucket Array:
┌─────┬───────────────────────────────────────────┐
│  0  │ NULL                                      │
├─────┼───────────────────────────────────────────┤
│  1  │ "banana" → "blueberry" → NULL             │
├─────┼───────────────────────────────────────────┤
│  2  │ "apple" → "apricot" → NULL                │
├─────┼───────────────────────────────────────────┤
│  3  │ NULL                                      │
├─────┼───────────────────────────────────────────┤
│  4  │ "cherry" → NULL                           │
└─────┴───────────────────────────────────────────┘
 
Memory layout: We store ONLY the keys. No values at all.
Each node contains: [key data] [next pointer]
 
 
OPEN ADDRESSING HASH SET (Linear Probing)
──────────────────────────────────────────
 
Insert same elements. On collision, probe next slot.
 
Insert order and probing:
  "apple"    → slot 2 (empty, insert)
  "banana"   → slot 1 (empty, insert)
  "cherry"   → slot 4 (empty, insert)
  "apricot"  → slot 2 (occupied), probe to slot 3 (empty, insert)
  "blueberry"→ slot 1 (occupied), probe to slot 2 (occupied),
               probe to slot 3 (occupied), probe to slot 4 (occupied),
               probe to slot 0 (empty, insert)
 
Slot Array:
┌─────┬──────────────┬───────────────────────────┐
│ Idx │    Value     │   Original Hash Target    │
├─────┼──────────────┼───────────────────────────┤
│  0  │ "blueberry"  │ hash target was 1         │
│  1  │ "banana"     │ hash target was 1 ✓       │
│  2  │ "apple"      │ hash target was 2 ✓       │
│  3  │ "apricot"    │ hash target was 2         │
│  4  │ "cherry"     │ hash target was 4 ✓       │
└─────┴──────────────┴───────────────────────────┘
 
Memory layout: Just keys stored directly in array slots.
Each slot contains: [key data] or [empty marker] or [deleted tombstone]

Memory Efficiency: Set vs Map

Because hash sets don't store values, they use less memory per element than hash maps:

Separate Chaining:

Hash Set node: [key data][next pointer]
Hash Map node: [key data][value data][next pointer]

Open Addressing:

Hash Set slot: [key data] or [state flag]
Hash Map slot: [key data][value data] or [state flag]

The memory savings depend on the size of values in the map case. If values are large (e.g., objects, strings), a hash set can use significantly less memory than a hash map with dummy values.

The Null Value Pattern: An Anti-Pattern

Some developers, when they need set semantics but only have a hash map available, use a pattern like:

# Anti-pattern: Using map as set with null values
seen = {}
for item in items:
    seen[item] = None  # or True, or 1, or any dummy value

While this works, it's suboptimal:

Wastes memory: Each entry stores an unnecessary value
Obscures intent: The code doesn't clearly communicate "this is a set"
May prevent optimization: The language runtime can't optimize for set-specific patterns

When you need a set, use a set.

Language Matters

Hash Set Operations in Detail

Let's examine each core hash set operation with implementation-level detail, understanding exactly what happens under the hood.

The `add(element)` Operation

Purpose: Insert an element into the set if it doesn't already exist.

Algorithm (Separate Chaining):

Compute hashCode(element)
Calculate index = hashCode % tableSize
Traverse the linked list at buckets[index]
If element is found, return (already exists, no-op)
If not found, prepend element to the list
Increment size counter
If load factor exceeds threshold, trigger rehashing

Algorithm (Open Addressing - Linear Probing):

Compute hashCode(element)
Calculate index = hashCode % tableSize
While slots[index] is occupied:
- If slots[index] == element, return (already exists)
- Increment index = (index + 1) % tableSize
Store element at slots[index]
Increment size counter
If load factor exceeds threshold, trigger rehashing

Key insight: The "no duplicates" invariant is enforced by checking for existence before insertion. This is why hash sets require elements to be both hashable and comparable for equality.

hash_set_add.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
class HashSet:
    """
    Hash Set implementation using separate chaining.
    Demonstrates the 'keys only, no values' principle.
    """
    
    def __init__(self, initial_capacity: int = 16, load_factor: float = 0.75):
        self._buckets = [None] * initial_capacity
        self._size = 0
        self._load_factor_threshold = load_factor
    
    def add(self, element) -> bool:
        """
        Add an element to the set.
        
        Returns True if element was added (didn't exist before),
        False if element already existed.
        
        Time Complexity: O(1) average, O(n) worst case
        """
        # Step 1: Check if rehashing is needed BEFORE insertion
        if self._size / len(self._buckets) >= self._load_factor_threshold:
            self._rehash()
        
        # Step 2: Compute the bucket index
        index = hash(element) % len(self._buckets)
        
        # Step 3: Traverse the chain to check for duplicates
        current = self._buckets[index]
        while current is not None:
            if current.element == element:
                # Element already exists - set semantics: no duplicates
                return False
            current = current.next
        
        # Step 4: Element not found - insert at head of chain
        # Notice: We only store the element, no associated value!
        new_node = self._Node(element)
        new_node.next = self._buckets[index]
        self._buckets[index] = new_node
        
        self._size += 1
        return True
    
    class _Node:
        """
        Internal node for separate chaining.
        Note: Only stores 'element', not 'element' + 'value'.
        This is the key difference from a HashMap node.
        """
        __slots__ = ('element', 'next')  # Memory optimization
        
        def __init__(self, element):
            self.element = element
            self.next = None

The `contains(element)` / `has(element)` Operation

Purpose: Test whether an element is present in the set.

Algorithm:

Compute hashCode(element)
Calculate index = hashCode % tableSize
Search for element in buckets[index] (chain or probe sequence)
Return true if found, false otherwise

This is the most frequently used operation in most applications. The O(1) average complexity is why hash sets are preferred for membership testing.

The `remove(element)` Operation

Purpose: Remove an element from the set if it exists.

Algorithm (Separate Chaining):

Compute hashCode(element)
Calculate index = hashCode % tableSize
Traverse the linked list, maintaining a previous pointer
If found, unlink the node from the chain
Decrement size counter
Return true if removed, false if not found

Algorithm (Open Addressing):

Find the element using the probing sequence
If found, mark the slot with a DELETED tombstone
Decrement size counter

The tombstone is necessary because simply emptying the slot would break the probe sequence for other elements.

The Equality Contract

Immutability & Hashability Requirements

A critical aspect of hash sets that catches many developers off guard is the immutability requirement for set elements.

The Problem: Mutable Elements in Sets

Consider this scenario:

# Dangerous: Mutable element in a set
class Person:
    def __init__(self, name):
        self.name = name
    
    def __hash__(self):
        return hash(self.name)
    
    def __eq__(self, other):
        return self.name == other.name

person = Person("Alice")
people = {person}  # Add to set

# Later...
person.name = "Bob"  # Mutate the object!

# Now we have a problem:
print(Person("Alice") in people)  # False - original hash bucketing
print(Person("Bob") in people)    # False - wrong bucket!
print(person in people)           # Undefined behavior!

Searching for "Alice" looks in the right bucket but finds an object that says "Bob"
Searching for "Bob" looks in the wrong bucket
The object is effectively lost in the set

The Cardinal Rule

What Makes a Type Hashable?

Not all types can be used in a hash set. The requirements are:

1. The type must implement a hash function

This function must:

Return an integer (the hash code)
Be deterministic (same input → same output)
Be consistent with equality (equal objects → equal hash codes)

2. The type must implement equality comparison

This is needed because:

Hash collisions occur (different objects, same hash)
We must be able to distinguish colliding elements

3. Ideally, the type should be immutable

Immutability guarantees that the hash value never changes, preventing the "lost object" problem.

Language-Specific Hashability Rules

Python:

Built-in immutable types are hashable: int, float, str, tuple (if contents are hashable), frozenset
Mutable types are NOT hashable by default: list, dict, set
Custom classes are hashable by default (based on object identity), but should override __hash__ and __eq__ for value-based equality

Java:

Primitive wrappers are hashable: Integer, String, Double, etc.
Custom classes must override hashCode() and equals() for use in HashSet
Using mutable objects as keys is allowed but dangerous

JavaScript:

Set uses the SameValueZero algorithm (similar to ===)
Only primitives and object references are directly usable
Object equality is by reference, not by value

Hashability by Language and Type
Type Category	Python	Java	JavaScript
Integers	✓ Hashable	✓ Hashable	✓ Usable in Set
Strings	✓ Hashable	✓ Hashable	✓ Usable in Set
Lists/Arrays	✗ Not hashable	✗ Not recommended	By reference only
Tuples	✓ If contents hashable	N/A (no tuples)	N/A (use arrays)
Custom Objects	Override hash/eq	Override hashCode/equals	By reference only
Frozen/Immutable	frozenset ✓	Immutable wrappers ✓	Object.freeze (still by ref)

Set-Specific Operations: Union, Intersection, Difference

Union: Combining Two Sets

Mathematical definition: A ∪ B = {x : x ∈ A or x ∈ B}

Implementation: Create a new set containing all elements from both input sets.

set1 = {1, 2, 3}
set2 = {3, 4, 5}

# Using operator
result = set1 | set2  # {1, 2, 3, 4, 5}

# Using method
result = set1.union(set2)  # {1, 2, 3, 4, 5}

Time Complexity: O(|A| + |B|) — we must examine every element from both sets.

Intersection: Common Elements

Mathematical definition: A ∩ B = {x : x ∈ A and x ∈ B}

Implementation: Create a new set containing only elements present in both.

set1 = {1, 2, 3, 4}
set2 = {3, 4, 5, 6}

result = set1 & set2  # {3, 4}
result = set1.intersection(set2)  # {3, 4}

Time Complexity: O(min(|A|, |B|)) — iterate over the smaller set, check membership in the larger.

Difference: Elements in One But Not the Other

Mathematical definition: A \ B = {x : x ∈ A and x ∉ B}

set1 = {1, 2, 3, 4}
set2 = {3, 4, 5, 6}

result = set1 - set2  # {1, 2}
result = set1.difference(set2)  # {1, 2}

Time Complexity: O(|A|) — must examine every element in the first set.

Symmetric Difference: Exclusive Elements

Mathematical definition: A △ B = {x : x ∈ A XOR x ∈ B}

set1 = {1, 2, 3}
set2 = {3, 4, 5}

result = set1 ^ set2  # {1, 2, 4, 5}
result = set1.symmetric_difference(set2)  # {1, 2, 4, 5}

Time Complexity: O(|A| + |B|)

Why Maps Don't Have These Operations

set_operations.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Practical example: Finding common and exclusive skills
 
# Skills possessed by team members
alice_skills = {"Python", "SQL", "Machine Learning", "Docker"}
bob_skills = {"Java", "SQL", "Kubernetes", "Docker"}
carol_skills = {"Python", "JavaScript", "React", "Docker"}
 
# All skills across the team (Union)
all_skills = alice_skills | bob_skills | carol_skills
print(f"Team's combined skills: {all_skills}")
# {'Python', 'SQL', 'Machine Learning', 'Docker', 'Java', 'Kubernetes', 'JavaScript', 'React'}
 
# Skills everyone has (Intersection)
common_skills = alice_skills & bob_skills & carol_skills
print(f"Skills everyone has: {common_skills}")
# {'Docker'}
 
# Skills only Alice has (Difference)
alice_unique = alice_skills - bob_skills - carol_skills
print(f"Alice's unique skills: {alice_unique}")
# {'Machine Learning'}
 
# Skills that exactly one person has (requires multiple operations)
pairwise_shared = (alice_skills & bob_skills) | (bob_skills & carol_skills) | (alice_skills & carol_skills)
unique_skills = all_skills - pairwise_shared
print(f"Skills held by exactly one person: {unique_skills}")
# {'Machine Learning', 'Java', 'Kubernetes', 'JavaScript', 'React'}
 
# Subset checking
beginner_skills = {"Docker"}
print(f"Is beginner_skills subset of alice_skills? {beginner_skills <= alice_skills}")
# True

When Sets Are the Right Choice

Recognizing when to use a set versus a map is a fundamental skill. Here's a framework for making the right choice.

Use a Hash Set When:

The fundamental question is "Is X present?"

If you only need to know whether something exists—without needing to retrieve associated data—a set is appropriate.

Examples:

Tracking visited nodes in graph traversal
Checking if a word is in a dictionary
Detecting duplicates in a stream
Maintaining a blacklist or whitelist
Implementing set algebra (union, intersection, etc.)

You need to enforce uniqueness without additional data

Sets naturally enforce uniqueness. If you're storing items where duplicates are meaningless, a set documents this constraint.

Memory efficiency matters

When storing millions of elements, the memory saved by not storing values can be significant.

You want clear, self-documenting code

Using if item in my_set is clearer than if item in my_dict when you don't care about the value.

Classic Set Use Cases

•Duplicate Detection: Given a list, find/remove duplicates. Set naturally deduplicates.
•Membership Testing: Is this user in the admin group? Is this IP address blacklisted?
•Graph Algorithms: Track visited vertices in BFS/DFS to prevent cycles.
•Two-Sum-like Problems: Seen values in a single pass for complement searching.
•Set Operations: Finding common elements, unique elements, differences between groups.
•Constraint Validation: Ensuring primary keys are unique, usernames aren't taken.
•Bloom Filter Seeds: Initial exact-match before probabilistic checking.

When NOT to Use a Set (Use a Map Instead)

You need to associate data with each key

If you're tracking not just presence but also information about each key, you need a map.

User ID → User profile data
Word → Definition
Node → Distance from source
Cache key → Cached value

You need to count occurrences

If you need to know how many times something appears, use a map where values are counts.

# Counting occurrences - need a map, not a set
word_counts = {}  # word → count
for word in document:
    word_counts[word] = word_counts.get(word, 0) + 1

You need key-value semantics

Configuration settings, environment variables, HTTP headers—these are naturally key-value pairs.

Use a Set

•"Have I seen this before?"
•"Is this in the allowed list?"
•"What's unique in this collection?"
•"What do these groups have in common?"
•"Remove duplicates from this list"

Use a Map

•"What's the value associated with this key?"
•"How many times does this appear?"
•"Store this configuration setting"
•"Cache this computation result"
•"Map user IDs to user names"

Summary and Key Takeaways

We have taken a deep dive into hash sets—their mathematical foundations, implementation mechanics, and practical usage patterns. Let's consolidate the key insights:

Core Concepts Mastered

•Sets are about presence, not association: A set answers "Is X here?" with no additional data. This is fundamentally different from a map's "What is associated with X?"
•Hash sets implement mathematical sets efficiently: Using hash functions, we achieve O(1) average-case membership testing, insertion, and deletion.
•Hash sets store only keys: The internal structure contains no value field, saving memory and clarifying semantics.
•Elements must be hashable and preferably immutable: Mutable elements can become 'lost' when their hash value changes after insertion.
•Sets support powerful algebraic operations: Union, intersection, difference, and subset testing are natural operations that maps don't cleanly support.
•Choose sets for membership, maps for association: The defining question is whether you need to retrieve data, not just test existence.

Page Complete

1 / 4

Hash Sets vs Hash Maps

Hash Set — Just Keys, No Values

What Is a Mathematical Set?

Additional Set Operations

Why Sets Matter in Computing

Naive Implementations and Their Limitations

The Hash Set Breakthrough

What Gets Stored in a Hash Set?

The Underlying Array Structure

Memory Efficiency: Set vs Map

The Null Value Pattern: An Anti-Pattern

The add(element) Operation

The contains(element) / has(element) Operation

The remove(element) Operation

The Problem: Mutable Elements in Sets

What Makes a Type Hashable?

Language-Specific Hashability Rules

Union: Combining Two Sets

Intersection: Common Elements

Difference: Elements in One But Not the Other

Symmetric Difference: Exclusive Elements

Use a Hash Set When:

When NOT to Use a Set (Use a Map Instead)

Hash Sets vs Hash Maps

Hash Set — Just Keys, No Values

What Is a Mathematical Set?

Additional Set Operations

Why Sets Matter in Computing

Naive Implementations and Their Limitations

The Hash Set Breakthrough

What Gets Stored in a Hash Set?

The Underlying Array Structure

Memory Efficiency: Set vs Map

The Null Value Pattern: An Anti-Pattern

The add(element) Operation

The contains(element) / has(element) Operation

The remove(element) Operation

The Problem: Mutable Elements in Sets

What Makes a Type Hashable?

Language-Specific Hashability Rules

Union: Combining Two Sets

Intersection: Common Elements

Difference: Elements in One But Not the Other

Symmetric Difference: Exclusive Elements

Use a Hash Set When:

When NOT to Use a Set (Use a Map Instead)

The `add(element)` Operation

The `contains(element)` / `has(element)` Operation

The `remove(element)` Operation

The `add(element)` Operation

The `contains(element)` / `has(element)` Operation

The `remove(element)` Operation