Loading learning content...
Duplicate detection is one of the most practical hashing patterns you'll encounter. It's not glamorous, but it's everywhere:
The core question is simple: Have we seen this before?
Without hash tables, answering this question for n items requires O(n²) time—checking every pair. With hash tables, it's O(n) time. For datasets with billions of records, this isn't just an optimization—it's the difference between possible and impossible.
By the end of this page, you will understand duplicate detection in all its forms: finding any duplicate, finding all duplicates, finding the first duplicate, detecting near-duplicates, and handling duplicate detection at scale. You'll learn both the algorithmic patterns and the production considerations.
The simplest duplicate detection asks: Does this collection contain any duplicates?
The hash-based solution uses a set—the fundamental data structure for tracking membership:
Algorithm:
This is sometimes called the "seen set" pattern.
1234567891011121314151617181920212223242526272829
/** * Contains Duplicate: Check if any element appears twice * Time: O(n) average, O(n²) worst case (hash collision) * Space: O(n) */function containsDuplicate<T>(elements: T[]): boolean { const seen = new Set<T>(); for (const element of elements) { if (seen.has(element)) { return true; // Duplicate found! } seen.add(element); } return false; // No duplicates} // Alternative: One-liner using Set size comparison// If set size < array length, duplicates must existfunction containsDuplicateOneLiner<T>(elements: T[]): boolean { return new Set(elements).size !== elements.length;} // Examplesconsole.log(containsDuplicate([1, 2, 3, 1])); // trueconsole.log(containsDuplicate([1, 2, 3, 4])); // falseconsole.log(containsDuplicate([])); // false (empty array)console.log(containsDuplicate(["a", "b", "a"])); // true (works with strings)In Java and some other languages, Set.add() returns false if the element already exists. This combines the 'check' and 'add' into a single atomic operation—cleaner code and sometimes better performance due to a single hash lookup instead of two.
Sometimes we need more than a boolean—we need to identify which elements are duplicates. This requires tracking counts or using multiple sets.
Variant 1: Find All Duplicate Elements
Return a list of all elements that appear more than once:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
/** * Find All Duplicate Elements * Returns each duplicate value exactly once * Time: O(n), Space: O(n) */function findAllDuplicates<T>(elements: T[]): T[] { const seen = new Set<T>(); const duplicates = new Set<T>(); for (const element of elements) { if (seen.has(element)) { duplicates.add(element); // Already seen → it's a duplicate } else { seen.add(element); } } return [...duplicates];} /** * Alternative: Using frequency map * More flexible if you need counts later */function findAllDuplicatesWithCounts<T>(elements: T[]): Map<T, number> { const freq = new Map<T, number>(); // Count all occurrences for (const element of elements) { freq.set(element, (freq.get(element) ?? 0) + 1); } // Filter to only duplicates const duplicates = new Map<T, number>(); for (const [key, count] of freq) { if (count > 1) { duplicates.set(key, count); } } return duplicates;} // Exampleconsole.log(findAllDuplicates([1, 2, 3, 2, 4, 3, 5])); // [2, 3]console.log(findAllDuplicatesWithCounts([1, 2, 3, 2, 4, 3, 3, 5])); // Map { 2 => 2, 3 => 3 }Variant 2: Find Duplicates in Constrained Space
A classic interview problem: Given an array of n integers where each integer is in the range [1, n], find duplicates using O(1) extra space.
The trick: use the array itself as a hash table by negating values at indices.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
/** * Find Duplicates with O(1) Extra Space * * Constraint: Values are in range [1, n] for array of length n * * Key insight: Use value as index to mark "seen" by negating * If we encounter an already-negative value, it's a duplicate * * Time: O(n), Space: O(1) (modifies input array in-place) */function findDuplicatesConstantSpace(nums: number[]): number[] { const duplicates: number[] = []; for (let i = 0; i < nums.length; i++) { // Get the index corresponding to this value // Use absolute value because we might have negated it const index = Math.abs(nums[i]) - 1; // -1 because 1-indexed values, 0-indexed array if (nums[index] < 0) { // Already visited → this value is a duplicate duplicates.push(Math.abs(nums[i])); } else { // Mark as visited by negating nums[index] = -nums[index]; } } // Optional: restore the array for (let i = 0; i < nums.length; i++) { nums[i] = Math.abs(nums[i]); } return duplicates;} // Example walkthrough: nums = [4, 3, 2, 7, 8, 2, 3, 1]// i=0: value=4, index=3, nums[3]=7>0 → negate → [4, 3, 2, -7, 8, 2, 3, 1]// i=1: value=3, index=2, nums[2]=2>0 → negate → [4, 3, -2, -7, 8, 2, 3, 1]// i=2: value=2, index=1, nums[1]=3>0 → negate → [4, -3, -2, -7, 8, 2, 3, 1]// i=3: value=7, index=6, nums[6]=3>0 → negate → [4, -3, -2, -7, 8, 2, -3, 1]// i=4: value=8, index=7, nums[7]=1>0 → negate → [4, -3, -2, -7, 8, 2, -3, -1]// i=5: value=2, index=1, nums[1]=-3<0 → duplicate found! → duplicates = [2]// i=6: value=3, index=2, nums[2]=-2<0 → duplicate found! → duplicates = [2, 3]// i=7: value=1, index=0, nums[0]=4>0 → negate → [-4, -3, -2, -7, 8, 2, -3, -1] console.log(findDuplicatesConstantSpace([4, 3, 2, 7, 8, 2, 3, 1])); // [2, 3]The constant-space technique modifies the input array, which may not be acceptable in production code. Use it when: • Memory is critically constrained • Input can be modified (or you restore it afterward) • Values are in the required range [1, n]
For general-purpose code, the O(n) space hash set is usually preferred for clarity.
Sometimes the position of duplicates matters—not just whether they exist.
Find First Duplicate (by second occurrence)
Return the first element that appears for the second time as we scan left to right:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
/** * Find First Duplicate: Return value of first repeated element * "First" = earliest second occurrence while scanning left to right * * Time: O(n), Space: O(n) */function findFirstDuplicate<T>(elements: T[]): T | undefined { const seen = new Set<T>(); for (const element of elements) { if (seen.has(element)) { return element; // First duplicate found! } seen.add(element); } return undefined;} /** * Find First Duplicate with Index * Return the index of the second occurrence */function findFirstDuplicateIndex<T>(elements: T[]): number { const seen = new Set<T>(); for (let i = 0; i < elements.length; i++) { if (seen.has(elements[i])) { return i; // Index of second occurrence } seen.add(elements[i]); } return -1;} /** * Find First Duplicate with Both Indices * Return indices of both occurrences */function findFirstDuplicatePair<T>(elements: T[]): [number, number] | null { const firstOccurrence = new Map<T, number>(); for (let i = 0; i < elements.length; i++) { if (firstOccurrence.has(elements[i])) { return [firstOccurrence.get(elements[i])!, i]; } firstOccurrence.set(elements[i], i); } return null;} // Examplesconsole.log(findFirstDuplicate([2, 1, 3, 5, 3, 2])); // 3 (3 repeats first)console.log(findFirstDuplicateIndex([2, 1, 3, 5, 3, 2])); // 4 (index of second 3)console.log(findFirstDuplicatePair([2, 1, 3, 5, 3, 2])); // [2, 4] (indices of both 3s)Find First Duplicate (by first occurrence of duplicate value)
A different interpretation: return the element whose first occurrence is earliest among all duplicates:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
from typing import TypeVar, List, Optional T = TypeVar('T') def find_earliest_duplicate(elements: List[T]) -> Optional[T]: """ Find duplicate whose FIRST occurrence is earliest. Example: [2, 1, 3, 5, 3, 2] - 3's first occurrence: index 2 - 2's first occurrence: index 0 - Answer: 2 (even though 3 repeats first) Time: O(n), Space: O(n) """ first_occurrence = {} duplicates = set() for i, element in enumerate(elements): if element in first_occurrence: duplicates.add(element) else: first_occurrence[element] = i if not duplicates: return None # Find duplicate with earliest first occurrence earliest = min(duplicates, key=lambda x: first_occurrence[x]) return earliest # Exampleprint(find_earliest_duplicate([2, 1, 3, 5, 3, 2])) # 2 (first duplicate by first occurrence) def find_kth_duplicate(elements: List[T], k: int) -> Optional[T]: """ Find the k-th distinct duplicate value (1-indexed). Example: [1, 2, 3, 2, 1, 4, 3, 5, 1], k=2 - First duplicate detected: 2 (at index 3) - Second duplicate detected: 1 (at index 4) - Answer: 1 """ seen = set() duplicates_found = [] for element in elements: if element in seen and element not in duplicates_found: duplicates_found.append(element) if len(duplicates_found) == k: return element seen.add(element) return None # Fewer than k duplicates # Exampleprint(find_kth_duplicate([1, 2, 3, 2, 1, 4, 3, 5, 1], 2)) # 1A common variant asks: Are there duplicates within a sliding window of size k?
This is useful for detecting suspicious patterns—like multiple login attempts from the same IP within a short time window.
Classic Problem: Nearby Duplicate
Given an array and integer k, return true if there are two distinct indices i and j such that nums[i] == nums[j] and |i - j| <= k.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
/** * Contains Duplicate II: Duplicates within distance k * * Approach 1: Sliding Window Set - O(n) time, O(k) space * Maintain a set of the last k elements */function containsNearbyDuplicate(nums: number[], k: number): boolean { const window = new Set<number>(); for (let i = 0; i < nums.length; i++) { // Check if current element is in the window if (window.has(nums[i])) { return true; } // Add current element to window window.add(nums[i]); // Remove element that's now outside the window if (window.size > k) { window.delete(nums[i - k]); } } return false;} /** * Approach 2: Hash Map with Index - O(n) time, O(n) space * Store the most recent index of each value */function containsNearbyDuplicateMap(nums: number[], k: number): boolean { const lastIndex = new Map<number, number>(); for (let i = 0; i < nums.length; i++) { if (lastIndex.has(nums[i])) { const prevIndex = lastIndex.get(nums[i])!; if (i - prevIndex <= k) { return true; } } lastIndex.set(nums[i], i); } return false;} // Example walkthrough: nums = [1, 2, 3, 1], k = 3// Sliding window approach:// i=0: window = {} → add 1 → {1}// i=1: window = {1} → 2 not in window → add 2 → {1, 2}// i=2: window = {1, 2} → 3 not in window → add 3 → {1, 2, 3}// i=3: window = {1, 2, 3} → 1 IS in window → return true! console.log(containsNearbyDuplicate([1, 2, 3, 1], 3)); // trueconsole.log(containsNearbyDuplicate([1, 2, 3, 1, 2, 3], 2)); // falseClassic Problem: Nearby Almost Duplicate
A more complex variant: check if there are two distinct indices i and j such that |nums[i] - nums[j]| <= t and |i - j| <= k.
This requires both proximity in position AND proximity in value. The bucket-based solution is elegant:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
/** * Contains Duplicate III: Nearby Almost Duplicate * |nums[i] - nums[j]| <= t AND |i - j| <= k * * Key insight: Use buckets of size (t+1) * - Numbers in same bucket differ by at most t * - Numbers in adjacent buckets might differ by at most t * * Time: O(n), Space: O(k) */function containsNearbyAlmostDuplicate( nums: number[], k: number, t: number): boolean { if (k <= 0 || t < 0) return false; const buckets = new Map<number, number>(); const bucketSize = t + 1; // +1 to avoid division by zero when t=0 // Helper: Get bucket ID for a number // Handle negative numbers correctly const getBucketId = (num: number): number => { return num >= 0 ? Math.floor(num / bucketSize) : Math.floor((num + 1) / bucketSize) - 1; }; for (let i = 0; i < nums.length; i++) { const bucketId = getBucketId(nums[i]); // Check same bucket (guaranteed |diff| <= t) if (buckets.has(bucketId)) { return true; } // Check adjacent buckets (might have |diff| <= t) if (buckets.has(bucketId - 1)) { if (Math.abs(nums[i] - buckets.get(bucketId - 1)!) <= t) { return true; } } if (buckets.has(bucketId + 1)) { if (Math.abs(nums[i] - buckets.get(bucketId + 1)!) <= t) { return true; } } // Add current number to its bucket buckets.set(bucketId, nums[i]); // Remove element outside window if (i >= k) { const oldBucketId = getBucketId(nums[i - k]); buckets.delete(oldBucketId); } } return false;} // Example: nums = [1, 5, 9, 1, 5, 9], k = 2, t = 3// Bucket size = 4 (so 0-3 in bucket 0, 4-7 in bucket 1, etc.)// i=0: 1 → bucket 0 → buckets = {0: 1}// i=1: 5 → bucket 1 → check bucket 0: |5-1|=4 > 3 → buckets = {0: 1, 1: 5}// i=2: 9 → bucket 2 → check bucket 1: |9-5|=4 > 3 → buckets = {0: 1, 1: 5, 2: 9}// Remove bucket for nums[0]=1 → buckets = {1: 5, 2: 9}// i=3: 1 → bucket 0 → no adjacent match → buckets = {0: 1, 1: 5, 2: 9}// ... continues console.log(containsNearbyAlmostDuplicate([1, 2, 1, 1], 1, 0)); // trueProduction duplicate detection involves challenges beyond the basic algorithm.
Challenge 1: Near-Duplicates
'John Smith' and 'JOHN SMITH' are duplicates, but exact matching won't find them. Solutions:
123456789101112131415161718192021222324252627282930313233343536
/** * Near-Duplicate Detection with Normalization */function findNearDuplicates(strings: string[]): Map<string, string[]> { const groups = new Map<string, string[]>(); for (const str of strings) { // Normalize: lowercase, remove extra whitespace, etc. const normalized = str.toLowerCase().trim().replace(/\s+/g, ' '); if (!groups.has(normalized)) { groups.set(normalized, []); } groups.get(normalized)!.push(str); } // Return only groups with duplicates return new Map( [...groups.entries()].filter(([_, values]) => values.length > 1) );} // Exampleconst names = [ "John Smith", "JOHN SMITH", "John Smith", // Extra space "Jane Doe", "jane doe"]; console.log(findNearDuplicates(names));// Map {// "john smith" => ["John Smith", "JOHN SMITH", "John Smith"],// "jane doe" => ["Jane Doe", "jane doe"]// }Challenge 2: Scale and Memory
For billions of records, keeping all values in memory isn't feasible. Solutions:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
/** * Conceptual Bloom Filter for Duplicate Detection * * A Bloom filter uses k hash functions and a bit array * - Adding: Set bits at positions hash1(x), hash2(x), ..., hashk(x) * - Checking: If ALL bits at hash positions are set → might be present * If ANY bit is 0 → definitely not present * * Trade-off: Small memory, but false positives possible */class SimpleBloomFilter { private bits: boolean[]; private size: number; constructor(size: number = 1000) { this.size = size; this.bits = new Array(size).fill(false); } // Simplified: using two hash functions private hash1(value: string): number { let hash = 0; for (const char of value) { hash = (hash * 31 + char.charCodeAt(0)) % this.size; } return hash; } private hash2(value: string): number { let hash = 0; for (const char of value) { hash = (hash * 37 + char.charCodeAt(0)) % this.size; } return hash; } add(value: string): void { this.bits[this.hash1(value)] = true; this.bits[this.hash2(value)] = true; } mightContain(value: string): boolean { return this.bits[this.hash1(value)] && this.bits[this.hash2(value)]; }} // Usage for deduplication with space constraintfunction deduplicateWithBloom( stream: Iterable<string>): string[] { const bloom = new SimpleBloomFilter(10000); const result: string[] = []; for (const item of stream) { if (!bloom.mightContain(item)) { // Definitely new result.push(item); bloom.add(item); } // If mightContain returns true, it MIGHT be duplicate (could be false positive) // For guaranteed deduplication, combine with exact check for "maybe" cases } return result;}Challenge 3: Distributed Systems
In distributed systems, duplicate detection becomes the problem of exactly-once delivery:
In distributed systems, the "have we seen this before?" question is hard because of network partitions. Different nodes might disagree. Solutions typically trade off: • Accuracy (false positives/negatives) • Latency (how long to wait for consensus) • Memory (how much state to keep)
Let's examine some classic duplicate-related interview problems that go beyond basic detection.
Problem 1: Single Number
Every element appears twice except one. Find that single element.
This can be solved with a hash set, but there's an elegant O(1) space solution using XOR:
12345678910111213141516171819202122232425262728293031323334353637383940
/** * Single Number: One element appears once, all others appear twice * * Hash Set Approach: O(n) time, O(n) space */function singleNumberSet(nums: number[]): number { const set = new Set<number>(); for (const num of nums) { if (set.has(num)) { set.delete(num); // Seen twice = remove } else { set.add(num); // First occurrence = add } } return [...set][0]; // Only one element remains} /** * XOR Approach: O(n) time, O(1) space * * Key insight: a XOR a = 0 and a XOR 0 = a * So duplicates cancel out, leaving the single number */function singleNumberXOR(nums: number[]): number { let result = 0; for (const num of nums) { result ^= num; } return result;} // Example: [4, 1, 2, 1, 2]// 4 XOR 1 XOR 2 XOR 1 XOR 2// = (1 XOR 1) XOR (2 XOR 2) XOR 4// = 0 XOR 0 XOR 4// = 4 console.log(singleNumberXOR([4, 1, 2, 1, 2])); // 4Problem 2: Find Missing Number
Given an array containing n distinct numbers in range [0, n], find the missing one.
1234567891011121314151617181920212223242526272829303132333435363738394041
from typing import List def missing_number_set(nums: List[int]) -> int: """ Hash Set Approach: O(n) time, O(n) space """ all_nums = set(range(len(nums) + 1)) present = set(nums) missing = all_nums - present return missing.pop() def missing_number_xor(nums: List[int]) -> int: """ XOR Approach: O(n) time, O(1) space XOR all numbers 0 to n with all numbers in array. Duplicates cancel, leaving the missing number. """ result = len(nums) # Start with n (since range is [0, n]) for i, num in enumerate(nums): result ^= i ^ num return result def missing_number_math(nums: List[int]) -> int: """ Mathematical Approach: O(n) time, O(1) space Sum(0 to n) - Sum(nums) = missing number """ n = len(nums) expected_sum = n * (n + 1) // 2 actual_sum = sum(nums) return expected_sum - actual_sum # Example: [3, 0, 1] → missing 2print(missing_number_xor([3, 0, 1])) # 2Problem 3: Find All Numbers Disappeared in an Array
Given an array of integers where 1 ≤ a[i] ≤ n, find all integers in [1, n] that don't appear.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
/** * Find Disappeared Numbers * * Hash Set Approach: O(n) time, O(n) space */function findDisappearedNumbersSet(nums: number[]): number[] { const present = new Set(nums); const missing: number[] = []; for (let i = 1; i <= nums.length; i++) { if (!present.has(i)) { missing.push(i); } } return missing;} /** * In-Place Marking: O(n) time, O(1) space * * Use negation to mark indices as "seen" */function findDisappearedNumbersInPlace(nums: number[]): number[] { // Mark indices corresponding to present values for (let i = 0; i < nums.length; i++) { const index = Math.abs(nums[i]) - 1; if (nums[index] > 0) { nums[index] = -nums[index]; } } // Indices still positive are missing const missing: number[] = []; for (let i = 0; i < nums.length; i++) { if (nums[i] > 0) { missing.push(i + 1); } } // Restore (optional) for (let i = 0; i < nums.length; i++) { nums[i] = Math.abs(nums[i]); } return missing;} console.log(findDisappearedNumbersInPlace([4, 3, 2, 7, 8, 2, 3, 1])); // [5, 6]Duplicate detection, while conceptually simple, underpins critical systems across computing. Let's consolidate the key insights:
| Problem Variant | Approach | Time | Space |
|---|---|---|---|
| Contains any duplicate | Hash Set | O(n) | O(n) |
| Find all duplicates | Set + Set or Map | O(n) | O(n) |
| Duplicates within k | Sliding Window Set | O(n) | O(min(n,k)) |
| Almost duplicates | Bucket Sort | O(n) | O(k) |
| In-place (values 1-n) | Index negation | O(n) | O(1) |
| Single non-duplicate | XOR | O(n) | O(1) |
You now have a comprehensive understanding of duplicate detection—from basic membership checking to advanced distance-constrained and probabilistic methods. This pattern appears everywhere from data validation to distributed systems. Next, we'll explore grouping by characteristic.