Loading content...
Your algorithm assumes unique elements—but the input has duplicates. Your string parser assumes alphanumeric characters—but the input contains emojis, null bytes, and Unicode normalization variants. Your sorting algorithm is stable—but does it matter when there are duplicates with different associated values?
Data quality edge cases are among the most overlooked categories of bugs. Algorithms are often designed with idealized inputs in mind: unique values, clean ASCII text, well-formed structures. Real-world data is messier.
This page addresses two major categories of data-quality edge cases:
By the end of this page, you will understand how duplicates affect algorithm behavior, how to handle them correctly, and how to write robust string-processing code that handles special characters, Unicode, and encoding edge cases without breaking.
Many algorithms implicitly assume unique values. When duplicates appear, these assumptions break in subtle ways:
Why Duplicates Are Problematic:
| Algorithm/Operation | Unique Values Assumption | Duplicate Behavior | Correct Handling |
|---|---|---|---|
| Binary Search | Finds 'the' occurrence | Which occurrence found is undefined | Use first/last occurrence variant |
| QuickSort | Partitions around distinct values | All-same degenerates to O(n²) | Use 3-way partition (Dutch flag) |
| Two Sum | At most one pair sums to target | Multiple pairs possible | Define: return first, any, or all |
| HashMap/Set insert | Key updates or rejects | Value overwritten silently | Check for existence before insert if needed |
| Find Maximum | Returns maximum value | Works correctly | But: which index if multiple maxes? |
| Remove Element | Removes 'the' occurrence | First? Last? All? | Define semantics explicitly |
| Counting Sort | Works regardless | Works correctly | No special handling needed |
| Find Unique Element | Exactly one unique | Might have zero or multiple uniques | Handle count != 1 case |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190
from typing import TypeVarfrom collections import Counter T = TypeVar('T') # ============================================# SEARCH WITH DUPLICATES# ============================================ def find_first_occurrence(arr: list[int], target: int) -> int: """ Find first occurrence of target in sorted array with duplicates. """ left, right = 0, len(arr) - 1 result = -1 while left <= right: mid = left + (right - left) // 2 if arr[mid] == target: result = mid # Record this, but keep searching left right = mid - 1 elif arr[mid] < target: left = mid + 1 else: right = mid - 1 return result def find_last_occurrence(arr: list[int], target: int) -> int: """ Find last occurrence of target in sorted array with duplicates. """ left, right = 0, len(arr) - 1 result = -1 while left <= right: mid = left + (right - left) // 2 if arr[mid] == target: result = mid # Record this, but keep searching right left = mid + 1 elif arr[mid] < target: left = mid + 1 else: right = mid - 1 return result def count_occurrences(arr: list[int], target: int) -> int: """ Count occurrences of target in sorted array. Uses first and last occurrence to compute count. """ first = find_first_occurrence(arr, target) if first == -1: return 0 last = find_last_occurrence(arr, target) return last - first + 1 # ============================================# REMOVAL WITH DUPLICATES # ============================================ def remove_first_occurrence(arr: list[T], value: T) -> list[T]: """Remove first occurrence of value.""" for i, item in enumerate(arr): if item == value: return arr[:i] + arr[i+1:] return arr[:] # Not found, return copy def remove_last_occurrence(arr: list[T], value: T) -> list[T]: """Remove last occurrence of value.""" for i in range(len(arr) - 1, -1, -1): if arr[i] == value: return arr[:i] + arr[i+1:] return arr[:] def remove_all_occurrences(arr: list[T], value: T) -> list[T]: """Remove all occurrences of value.""" return [item for item in arr if item != value] def remove_duplicates_sorted(arr: list[int]) -> list[int]: """ Remove duplicates from sorted array, keeping one of each. Modifies in place and returns new length. Standard LeetCode problem approach. """ if not arr: return [] if len(arr) == 1: return arr[:] write_index = 1 for read_index in range(1, len(arr)): if arr[read_index] != arr[read_index - 1]: arr[write_index] = arr[read_index] write_index += 1 return arr[:write_index] # ============================================# DUPLICATE DETECTION AND COUNTING# ============================================ def has_duplicates(arr: list[T]) -> bool: """Check if array contains any duplicates.""" return len(arr) != len(set(arr)) def find_duplicates(arr: list[int]) -> list[int]: """ Find all duplicate values (appear more than once). Returns list of duplicated values (each appears once in result). """ counts = Counter(arr) return [val for val, count in counts.items() if count > 1] def find_all_duplicate_indices(arr: list[T], value: T) -> list[int]: """Find all indices where value appears.""" return [i for i, item in enumerate(arr) if item == value] # ============================================# HANDLING DUPLICATES IN SPECIFIC PROBLEMS# ============================================ def two_sum_with_duplicates(nums: list[int], target: int) -> list[tuple[int, int]]: """ Find ALL pairs of indices that sum to target. Handles duplicates correctly. """ from collections import defaultdict # Map value to list of indices value_indices = defaultdict(list) for i, num in enumerate(nums): value_indices[num].append(i) result = [] seen_pairs = set() # Avoid duplicate pairs for i, num in enumerate(nums): complement = target - num if complement in value_indices: for j in value_indices[complement]: if i < j: # Ensure unique pairs with i < j pair = (i, j) if pair not in seen_pairs: seen_pairs.add(pair) result.append(pair) return result def three_way_partition(arr: list[int], pivot: int) -> list[int]: """ Dutch National Flag algorithm: partition into <pivot, =pivot, >pivot. Crucial for efficient QuickSort with many duplicates. """ low = 0 # Boundary of < section mid = 0 # Current element high = len(arr) - 1 # Boundary of > section while mid <= high: if arr[mid] < pivot: arr[low], arr[mid] = arr[mid], arr[low] low += 1 mid += 1 elif arr[mid] > pivot: arr[mid], arr[high] = arr[high], arr[mid] high -= 1 # Don't increment mid - swapped element needs checking else: # arr[mid] == pivot mid += 1 return arrWhen testing duplicate handling, always test the extreme case: all elements are identical. This is where QuickSort degrades, binary search behavior is most ambiguous, and many assumptions break. If your algorithm handles [5, 5, 5, 5, 5] correctly, it likely handles all duplicate scenarios.
When designing algorithms that must handle duplicates, consider these design principles:
Principle 1: Define Semantics Explicitly
Before writing code, explicitly define what happens with duplicates:
Principle 2: Use Appropriate Data Structures
Different structures handle duplicates differently:
Principle 3: Consider Stability
For sorting and ordering operations, stability (maintaining relative order of equal elements) may be important for correctness.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159
from collections import defaultdict, Counterfrom typing import Callable, TypeVar T = TypeVar('T')K = TypeVar('K') # ============================================# PATTERN: Group by Key with Duplicates# ============================================ def group_anagrams(strs: list[str]) -> list[list[str]]: """ Group strings that are anagrams. Demonstrates grouping with potential duplicates. Edge cases: - Duplicate strings: grouped together - Empty string: valid, forms its own group """ groups = defaultdict(list) for s in strs: # Key = sorted characters (anagram signature) key = tuple(sorted(s)) groups[key].append(s) # Duplicates allowed in list return list(groups.values()) # ============================================# PATTERN: First/Last with Duplicates# ============================================ def first_unique_character(s: str) -> int: """ Find index of first character that appears only once. Edge cases: - All duplicates: return -1 - All unique: return 0 - Duplicate characters interspersed with unique """ char_count = Counter(s) for i, char in enumerate(s): if char_count[char] == 1: return i return -1 # All characters are duplicates def last_unique_character(s: str) -> int: """Find index of last character that appears only once.""" char_count = Counter(s) for i in range(len(s) - 1, -1, -1): if char_count[s[i]] == 1: return i return -1 # ============================================# PATTERN: Frequency-Based Operations# ============================================ def top_k_frequent(nums: list[int], k: int) -> list[int]: """ Return k most frequent elements. Edge case: Ties in frequency - which k to return? This implementation returns any valid k. """ if k <= 0: return [] counts = Counter(nums) # Sort by frequency (descending), then by value for consistency sorted_items = sorted( counts.items(), key=lambda x: (-x[1], x[0]) ) return [item[0] for item in sorted_items[:k]] def find_majority_element(nums: list[int]) -> int | None: """ Find element that appears more than n/2 times. Returns None if no majority exists. Uses Boyer-Moore Voting Algorithm. """ if not nums: return None # Find candidate candidate = nums[0] count = 1 for i in range(1, len(nums)): if count == 0: candidate = nums[i] count = 1 elif nums[i] == candidate: count += 1 else: count -= 1 # Verify candidate is actually majority actual_count = sum(1 for num in nums if num == candidate) if actual_count > len(nums) // 2: return candidate return None # ============================================# PATTERN: Set Operations with Duplicates (Multiset)# ============================================ def intersection_with_duplicates(nums1: list[int], nums2: list[int]) -> list[int]: """ Intersection of two arrays, including duplicate handling. If element appears k1 times in nums1 and k2 times in nums2, it appears min(k1, k2) times in result. """ counts1 = Counter(nums1) counts2 = Counter(nums2) result = [] for num in counts1: if num in counts2: # Include min of the two counts for _ in range(min(counts1[num], counts2[num])): result.append(num) return result def union_with_duplicates(nums1: list[int], nums2: list[int]) -> list[int]: """ Union of two arrays with duplicates. Element appears max(count1, count2) times. """ counts1 = Counter(nums1) counts2 = Counter(nums2) # Combine all keys all_keys = set(counts1.keys()) | set(counts2.keys()) result = [] for num in all_keys: count = max(counts1.get(num, 0), counts2.get(num, 0)) result.extend([num] * count) return resultWhen processing strings, many algorithms assume "normal" characters—letters and digits. But real-world strings contain:
Categories of Special Characters:
| Character Type | Examples | Common Bug | Correct Handling |
|---|---|---|---|
| Empty string | "" | Index out of bounds on s[0] | Check length before access |
| Whitespace only | " \t\n" | Treated as valid content | Trim and check if empty |
| Leading/trailing spaces | " hello " | Comparison failures | Trim before comparison |
| Null character | "hel\0lo" | String truncated at null | Use length, not null termination |
| Newlines | "line1\nline2" | Split logic fails | Handle \n, \r, and \r\n |
| Tab characters | "a\tb" | Visual vs actual length differs | Consider tab expansion |
| Unicode letters | "café", "naïve" | Sorting, comparison fails | Use locale-aware functions |
| Emojis | "Hello 👋" | Length calculation wrong | Use grapheme clusters |
| Mixed case | "HeLLo" | Comparison failures | Normalize case before compare |
| Escape sequences | "path\file" | Escaping bugs | Use raw strings or proper escaping |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199
import unicodedataimport re # ============================================# WHITESPACE HANDLING# ============================================ def normalize_whitespace(s: str) -> str: """ Normalize all whitespace: trim ends, collapse internal spaces. Handles: spaces, tabs, newlines, multiple spaces """ if not s: return "" # Replace all whitespace with single space, then trim return " ".join(s.split()) def is_blank(s: str) -> bool: """ Check if string is empty or contains only whitespace. """ return not s or s.isspace() def split_lines_universal(text: str) -> list[str]: """ Split text into lines, handling all newline conventions. Handles: \n (Unix), \r\n (Windows), \r (old Mac) """ # splitlines() handles all conventions return text.splitlines() # ============================================# CASE NORMALIZATION# ============================================ def case_insensitive_compare(s1: str, s2: str) -> bool: """ Case-insensitive comparison. Note: casefold() is better than lower() for international text. Example: German 'ß' casefolds to 'ss' """ return s1.casefold() == s2.casefold() def case_insensitive_find(text: str, pattern: str) -> int: """ Case-insensitive search. """ return text.casefold().find(pattern.casefold()) # ============================================# UNICODE HANDLING# ============================================ def normalize_unicode(s: str, form: str = 'NFC') -> str: """ Normalize Unicode string. Forms: - NFC: Canonical Decomposition, followed by Canonical Composition - NFD: Canonical Decomposition - NFKC: Compatibility Decomposition, followed by Canonical Composition - NFKD: Compatibility Decomposition Example: 'café' can be 'cafe\u0301' (e + combining accent) or 'caf\xe9' (é as single character). NFC makes them the same. """ return unicodedata.normalize(form, s) def remove_accents(s: str) -> str: """ Remove accents/diacritics from characters. 'café' → 'cafe' """ # Decompose to base + combining characters decomposed = unicodedata.normalize('NFD', s) # Keep only non-combining characters return ''.join( char for char in decomposed if unicodedata.category(char) != 'Mn' # Mn = Mark, Nonspacing ) def string_length_unicode_aware(s: str) -> int: """ Get visual length (grapheme clusters) not code units. 'café' = 4 graphemes (might be 5 code points with combining accent) 'Hello 👋' = 7 graphemes (emoji is 1 grapheme, might be 2+ code units) """ # For precise grapheme counting, use regex or third-party library # This is an approximation import regex # pip install regex (more Unicode-aware than re) return len(regex.findall(r'\X', s)) # ============================================# ALPHANUMERIC EXTRACTION# ============================================ def extract_alphanumeric(s: str) -> str: """ Extract only alphanumeric characters. Useful for palindrome checking, normalization. """ return ''.join(char for char in s if char.isalnum()) def is_palindrome_ignoring_non_alphanumeric(s: str) -> bool: """ Check if string is palindrome, ignoring non-alphanumerics and case. "A man, a plan, a canal: Panama" → True """ cleaned = extract_alphanumeric(s).lower() return cleaned == cleaned[::-1] # ============================================# CONTROL CHARACTER HANDLING# ============================================ def remove_control_characters(s: str) -> str: """ Remove control characters (except whitespace). Control characters can break display and processing. """ return ''.join( char for char in s if unicodedata.category(char) not in ('Cc', 'Cf') or char in '\t\n\r' # Keep common whitespace ) def escape_for_display(s: str) -> str: """ Escape control characters for safe display. """ result = [] for char in s: if char == '\n': result.append('\\n') elif char == '\r': result.append('\\r') elif char == '\t': result.append('\\t') elif ord(char) < 32 or ord(char) == 127: result.append(f'\\x{ord(char):02x}') else: result.append(char) return ''.join(result) # ============================================# STRING VALIDATION# ============================================ def is_valid_identifier(s: str) -> bool: """ Check if string is a valid programming identifier. Starts with letter/underscore, contains only alphanum/underscore. """ if not s: return False if not (s[0].isalpha() or s[0] == '_'): return False return all(c.isalnum() or c == '_' for c in s) def sanitize_filename(filename: str) -> str: """ Sanitize string for use as filename. Removes/replaces characters invalid in file systems. """ # Characters not allowed in Windows filenames invalid_chars = r'[<>:"/\\|?*]' # Replace invalid characters with underscore sanitized = re.sub(invalid_chars, '_', filename) # Remove control characters sanitized = remove_control_characters(sanitized) # Remove leading/trailing whitespace and dots sanitized = sanitized.strip(' .') # Ensure non-empty return sanitized or 'unnamed'In Unicode, what users perceive as one 'character' might be multiple code points (like 'e' + combining accent) or multiple code units (like emojis in UTF-16). len('👨👩👧') might return 8, not 1. For user-facing string operations, consider using grapheme cluster counting.
String algorithms have specific edge cases beyond general character handling. These arise from the nature of pattern matching, substring operations, and string transformations.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229
# ============================================# SUBSTRING EDGE CASES# ============================================ def find_all_substrings(text: str, pattern: str) -> list[int]: """ Find all occurrences of pattern in text. Edge cases: - Empty pattern: Every position is a match? Or invalid? - Pattern longer than text: No matches - Overlapping matches: "aaa" in "aaaa" has indices [0, 1] """ if not pattern: return [] # Or raise ValueError("Empty pattern") if len(pattern) > len(text): return [] indices = [] start = 0 while True: pos = text.find(pattern, start) if pos == -1: break indices.append(pos) start = pos + 1 # Allow overlapping matches return indices def longest_substring_without_repeating(s: str) -> str: """ Find longest substring with all unique characters. Edge cases: - Empty string: Return "" - Single character: Return that character - All same characters: Return any single character - All unique: Return entire string """ if not s: return "" char_index = {} start = 0 max_len = 0 max_start = 0 for end, char in enumerate(s): if char in char_index and char_index[char] >= start: start = char_index[char] + 1 char_index[char] = end if end - start + 1 > max_len: max_len = end - start + 1 max_start = start return s[max_start:max_start + max_len] def count_substrings(text: str, pattern: str, overlap: bool = True) -> int: """ Count occurrences of pattern in text. Edge cases: - Empty pattern: undefined behavior - overlap=True: "aa" in "aaa" → 2 - overlap=False: "aa" in "aaa" → 1 """ if not pattern: return 0 count = 0 start = 0 while True: pos = text.find(pattern, start) if pos == -1: break count += 1 start = pos + (1 if overlap else len(pattern)) return count # ============================================# ANAGRAM AND PERMUTATION EDGE CASES# ============================================ def are_anagrams(s1: str, s2: str, case_sensitive: bool = True) -> bool: """ Check if two strings are anagrams. Edge cases: - Empty strings: Two empty strings ARE anagrams - Different lengths: Cannot be anagrams - Spaces: Should spaces count? Depends on requirements - Case: 'Listen' vs 'Silent' - depends on case_sensitive """ from collections import Counter # Length check first (optimization) if len(s1) != len(s2): return False if not case_sensitive: s1 = s1.lower() s2 = s2.lower() return Counter(s1) == Counter(s2) def find_anagrams_in_text(text: str, pattern: str) -> list[int]: """ Find starting indices of all anagram occurrences of pattern in text. Uses sliding window with character frequency counting. Edge cases: - Empty pattern: Return [] - Pattern longer than text: Return [] - Single character pattern: Every occurrence """ from collections import Counter if not pattern or len(pattern) > len(text): return [] pattern_count = Counter(pattern) window_count = Counter() result = [] k = len(pattern) for i, char in enumerate(text): # Add new character to window window_count[char] += 1 # Remove character that fell out of window if i >= k: left_char = text[i - k] window_count[left_char] -= 1 if window_count[left_char] == 0: del window_count[left_char] # Check for anagram match if i >= k - 1 and window_count == pattern_count: result.append(i - k + 1) return result # ============================================# PALINDROME EDGE CASES# ============================================ def longest_palindromic_substring(s: str) -> str: """ Find longest palindromic substring. Edge cases: - Empty string: Return "" - Single character: Return that character - All same characters: Return entire string - No palindrome longer than 1: Return first character """ if not s: return "" if len(s) == 1: return s start = 0 max_len = 1 def expand_around_center(left: int, right: int) -> tuple[int, int]: while left >= 0 and right < len(s) and s[left] == s[right]: left -= 1 right += 1 return left + 1, right - left - 1 # Return start and length for i in range(len(s)): # Odd length palindromes left1, len1 = expand_around_center(i, i) if len1 > max_len: start = left1 max_len = len1 # Even length palindromes if i + 1 < len(s): left2, len2 = expand_around_center(i, i + 1) if len2 > max_len: start = left2 max_len = len2 return s[start:start + max_len] def count_palindromic_substrings(s: str) -> int: """ Count total number of palindromic substrings. Edge cases: - Empty string: 0 palindromes - Single character: 1 palindrome (itself) - All same: n(n+1)/2 palindromes (every substring) """ if not s: return 0 count = 0 for i in range(len(s)): # Odd length left, right = i, i while left >= 0 and right < len(s) and s[left] == s[right]: count += 1 left -= 1 right += 1 # Even length left, right = i, i + 1 while left >= 0 and right < len(s) and s[left] == s[right]: count += 1 left -= 1 right += 1 return countTo ensure your algorithms handle duplicates and special characters correctly, use a systematic testing approach.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
def generate_duplicate_test_cases(base_values: list) -> list[list]: """ Generate comprehensive duplicate test cases from base values. """ test_cases = [] if not base_values: return test_cases # All unique test_cases.append(list(base_values)) # All same (first element) test_cases.append([base_values[0]] * len(base_values)) # Pair duplicated at start if len(base_values) >= 2: case = list(base_values) case[1] = case[0] test_cases.append(case) # Pair duplicated at end if len(base_values) >= 2: case = list(base_values) case[-1] = case[-2] test_cases.append(case) # First and last are same if len(base_values) >= 2: case = list(base_values) case[-1] = case[0] test_cases.append(case) # All same except middle if len(base_values) >= 3: mid = len(base_values) // 2 case = [base_values[0]] * len(base_values) case[mid] = base_values[1] if len(base_values) > 1 else base_values[0] + 1 test_cases.append(case) return test_cases def generate_string_edge_cases() -> list[str]: """ Generate comprehensive string edge case test cases. """ return [ "", # Empty " ", # Single space " ", # Multiple spaces "\t", # Tab "\n", # Newline "\r\n", # Windows newline "a", # Single char "A", # Single uppercase "1", # Single digit "hello", # Simple lowercase "HELLO", # Simple uppercase "Hello World", # Mixed case with space " hello ", # Leading/trailing spaces "hello\nworld", # Internal newline "hello\tworld", # Internal tab "café", # Accented characters "naïve", # More accents "Hello 👋", # Emoji "你好", # Chinese "مرحبا", # Arabic (RTL) "hello123", # Alphanumeric "123hello", # Number prefix "12345", # Only digits "!@#$%", # Only special chars "hello, world!", # Punctuation "a" * 1000, # Long string same char "".join(chr(i) for i in range(32, 127)), # All printable ASCII ] def test_string_function(func, test_cases: list[str] = None): """ Test a string function with comprehensive edge cases. """ if test_cases is None: test_cases = generate_string_edge_cases() results = [] for test in test_cases: try: result = func(test) results.append({ 'input': repr(test), 'output': repr(result), 'error': None }) except Exception as e: results.append({ 'input': repr(test), 'output': None, 'error': str(e) }) return resultsYou now understand how to handle duplicates and special characters robustly. These data quality edge cases are among the most common in real-world applications. Next, we'll explore how to systematically generate test cases to verify your algorithms handle all edge cases correctly.