Edge Cases & Testing - Learning Module

Loading content...

0/276

Duplicates and Special Characters

When Data Isn't As Clean As You Assumed

Your algorithm assumes unique elements—but the input has duplicates. Your string parser assumes alphanumeric characters—but the input contains emojis, null bytes, and Unicode normalization variants. Your sorting algorithm is stable—but does it matter when there are duplicates with different associated values?

Data quality edge cases are among the most overlooked categories of bugs. Algorithms are often designed with idealized inputs in mind: unique values, clean ASCII text, well-formed structures. Real-world data is messier.

This page addresses two major categories of data-quality edge cases:

Duplicates: Repeated values that stress uniqueness assumptions
Special Characters: Non-alphanumeric characters, Unicode edge cases, whitespace, and control characters

What You Will Master

By the end of this page, you will understand how duplicates affect algorithm behavior, how to handle them correctly, and how to write robust string-processing code that handles special characters, Unicode, and encoding edge cases without breaking.

The Duplicate Problem: When Uniqueness Assumptions Break

Many algorithms implicitly assume unique values. When duplicates appear, these assumptions break in subtle ways:

Why Duplicates Are Problematic:

Search ambiguity: Which occurrence should be returned?
Counting errors: Is each occurrence counted, or just unique values?
Stability concerns: Do duplicates maintain relative order after sorting?
Partition degeneracy: QuickSort degrades to O(n²) with all identical elements
Set operations: Duplicates have no meaning in mathematical sets
Index mapping: Multiple values map to the same key/bucket

Algorithm Behavior with Duplicates
Algorithm/Operation	Unique Values Assumption	Duplicate Behavior	Correct Handling
Binary Search	Finds 'the' occurrence	Which occurrence found is undefined	Use first/last occurrence variant
QuickSort	Partitions around distinct values	All-same degenerates to O(n²)	Use 3-way partition (Dutch flag)
Two Sum	At most one pair sums to target	Multiple pairs possible	Define: return first, any, or all
HashMap/Set insert	Key updates or rejects	Value overwritten silently	Check for existence before insert if needed
Find Maximum	Returns maximum value	Works correctly	But: which index if multiple maxes?
Remove Element	Removes 'the' occurrence	First? Last? All?	Define semantics explicitly
Counting Sort	Works regardless	Works correctly	No special handling needed
Find Unique Element	Exactly one unique	Might have zero or multiple uniques	Handle count != 1 case

Duplicate Handling Patterns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
from typing import TypeVar
from collections import Counter
 
T = TypeVar('T')
 
 
# ============================================
# SEARCH WITH DUPLICATES
# ============================================
 
def find_first_occurrence(arr: list[int], target: int) -> int:
    """
    Find first occurrence of target in sorted array with duplicates.
    """
    left, right = 0, len(arr) - 1
    result = -1
    
    while left <= right:
        mid = left + (right - left) // 2
        
        if arr[mid] == target:
            result = mid      # Record this, but keep searching left
            right = mid - 1
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    
    return result
 
 
def find_last_occurrence(arr: list[int], target: int) -> int:
    """
    Find last occurrence of target in sorted array with duplicates.
    """
    left, right = 0, len(arr) - 1
    result = -1
    
    while left <= right:
        mid = left + (right - left) // 2
        
        if arr[mid] == target:
            result = mid      # Record this, but keep searching right
            left = mid + 1
        elif arr[mid] < target:
            left = mid + 1
        else:
            right = mid - 1
    
    return result
 
 
def count_occurrences(arr: list[int], target: int) -> int:
    """
    Count occurrences of target in sorted array.
    Uses first and last occurrence to compute count.
    """
    first = find_first_occurrence(arr, target)
    if first == -1:
        return 0
    last = find_last_occurrence(arr, target)
    return last - first + 1
 
 
# ============================================
# REMOVAL WITH DUPLICATES  
# ============================================
 
def remove_first_occurrence(arr: list[T], value: T) -> list[T]:
    """Remove first occurrence of value."""
    for i, item in enumerate(arr):
        if item == value:
            return arr[:i] + arr[i+1:]
    return arr[:]  # Not found, return copy
 
 
def remove_last_occurrence(arr: list[T], value: T) -> list[T]:
    """Remove last occurrence of value."""
    for i in range(len(arr) - 1, -1, -1):
        if arr[i] == value:
            return arr[:i] + arr[i+1:]
    return arr[:]
 
 
def remove_all_occurrences(arr: list[T], value: T) -> list[T]:
    """Remove all occurrences of value."""
    return [item for item in arr if item != value]
 
 
def remove_duplicates_sorted(arr: list[int]) -> list[int]:
    """
    Remove duplicates from sorted array, keeping one of each.
    Modifies in place and returns new length.
    
    Standard LeetCode problem approach.
    """
    if not arr:
        return []
    
    if len(arr) == 1:
        return arr[:]
    
    write_index = 1
    
    for read_index in range(1, len(arr)):
        if arr[read_index] != arr[read_index - 1]:
            arr[write_index] = arr[read_index]
            write_index += 1
    
    return arr[:write_index]
 
 
# ============================================
# DUPLICATE DETECTION AND COUNTING
# ============================================
 
def has_duplicates(arr: list[T]) -> bool:
    """Check if array contains any duplicates."""
    return len(arr) != len(set(arr))
 
 
def find_duplicates(arr: list[int]) -> list[int]:
    """
    Find all duplicate values (appear more than once).
    Returns list of duplicated values (each appears once in result).
    """
    counts = Counter(arr)
    return [val for val, count in counts.items() if count > 1]
 
 
def find_all_duplicate_indices(arr: list[T], value: T) -> list[int]:
    """Find all indices where value appears."""
    return [i for i, item in enumerate(arr) if item == value]
 
 
# ============================================
# HANDLING DUPLICATES IN SPECIFIC PROBLEMS
# ============================================
 
def two_sum_with_duplicates(nums: list[int], target: int) -> list[tuple[int, int]]:
    """
    Find ALL pairs of indices that sum to target.
    Handles duplicates correctly.
    """
    from collections import defaultdict
    
    # Map value to list of indices
    value_indices = defaultdict(list)
    for i, num in enumerate(nums):
        value_indices[num].append(i)
    
    result = []
    seen_pairs = set()  # Avoid duplicate pairs
    
    for i, num in enumerate(nums):
        complement = target - num
        
        if complement in value_indices:
            for j in value_indices[complement]:
                if i < j:  # Ensure unique pairs with i < j
                    pair = (i, j)
                    if pair not in seen_pairs:
                        seen_pairs.add(pair)
                        result.append(pair)
    
    return result
 
 
def three_way_partition(arr: list[int], pivot: int) -> list[int]:
    """
    Dutch National Flag algorithm: partition into <pivot, =pivot, >pivot.
    Crucial for efficient QuickSort with many duplicates.
    """
    low = 0           # Boundary of < section
    mid = 0           # Current element
    high = len(arr) - 1  # Boundary of > section
    
    while mid <= high:
        if arr[mid] < pivot:
            arr[low], arr[mid] = arr[mid], arr[low]
            low += 1
            mid += 1
        elif arr[mid] > pivot:
            arr[mid], arr[high] = arr[high], arr[mid]
            high -= 1
            # Don't increment mid - swapped element needs checking
        else:  # arr[mid] == pivot
            mid += 1
    
    return arr

The All-Same Extreme

When testing duplicate handling, always test the extreme case: all elements are identical. This is where QuickSort degrades, binary search behavior is most ambiguous, and many assumptions break. If your algorithm handles [5, 5, 5, 5, 5] correctly, it likely handles all duplicate scenarios.

Duplicate-Aware Algorithm Design

When designing algorithms that must handle duplicates, consider these design principles:

Principle 1: Define Semantics Explicitly

Before writing code, explicitly define what happens with duplicates:

For search: Return first, last, any, or all occurrences?
For removal: Remove first, last, or all?
For insertion: Allow duplicates, reject, or update?
For counting: Count unique values or total occurrences?

Principle 2: Use Appropriate Data Structures

Different structures handle duplicates differently:

Set: Automatically removes duplicates
Map/Dict: Later insertions overwrite
MultiSet/Counter: Explicitly tracks counts
List: Preserves all duplicates

Principle 3: Consider Stability

For sorting and ordering operations, stability (maintaining relative order of equal elements) may be important for correctness.

Duplicate-Aware Design Patterns
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
from collections import defaultdict, Counter
from typing import Callable, TypeVar
 
T = TypeVar('T')
K = TypeVar('K')
 
 
# ============================================
# PATTERN: Group by Key with Duplicates
# ============================================
 
def group_anagrams(strs: list[str]) -> list[list[str]]:
    """
    Group strings that are anagrams.
    Demonstrates grouping with potential duplicates.
    
    Edge cases:
    - Duplicate strings: grouped together
    - Empty string: valid, forms its own group
    """
    groups = defaultdict(list)
    
    for s in strs:
        # Key = sorted characters (anagram signature)
        key = tuple(sorted(s))
        groups[key].append(s)  # Duplicates allowed in list
    
    return list(groups.values())
 
 
# ============================================
# PATTERN: First/Last with Duplicates
# ============================================
 
def first_unique_character(s: str) -> int:
    """
    Find index of first character that appears only once.
    
    Edge cases:
    - All duplicates: return -1
    - All unique: return 0
    - Duplicate characters interspersed with unique
    """
    char_count = Counter(s)
    
    for i, char in enumerate(s):
        if char_count[char] == 1:
            return i
    
    return -1  # All characters are duplicates
 
 
def last_unique_character(s: str) -> int:
    """Find index of last character that appears only once."""
    char_count = Counter(s)
    
    for i in range(len(s) - 1, -1, -1):
        if char_count[s[i]] == 1:
            return i
    
    return -1
 
 
# ============================================
# PATTERN: Frequency-Based Operations
# ============================================
 
def top_k_frequent(nums: list[int], k: int) -> list[int]:
    """
    Return k most frequent elements.
    
    Edge case: Ties in frequency - which k to return?
    This implementation returns any valid k.
    """
    if k <= 0:
        return []
    
    counts = Counter(nums)
    
    # Sort by frequency (descending), then by value for consistency
    sorted_items = sorted(
        counts.items(),
        key=lambda x: (-x[1], x[0])
    )
    
    return [item[0] for item in sorted_items[:k]]
 
 
def find_majority_element(nums: list[int]) -> int | None:
    """
    Find element that appears more than n/2 times.
    Returns None if no majority exists.
    
    Uses Boyer-Moore Voting Algorithm.
    """
    if not nums:
        return None
    
    # Find candidate
    candidate = nums[0]
    count = 1
    
    for i in range(1, len(nums)):
        if count == 0:
            candidate = nums[i]
            count = 1
        elif nums[i] == candidate:
            count += 1
        else:
            count -= 1
    
    # Verify candidate is actually majority
    actual_count = sum(1 for num in nums if num == candidate)
    if actual_count > len(nums) // 2:
        return candidate
    
    return None
 
 
# ============================================
# PATTERN: Set Operations with Duplicates (Multiset)
# ============================================
 
def intersection_with_duplicates(nums1: list[int], nums2: list[int]) -> list[int]:
    """
    Intersection of two arrays, including duplicate handling.
    If element appears k1 times in nums1 and k2 times in nums2,
    it appears min(k1, k2) times in result.
    """
    counts1 = Counter(nums1)
    counts2 = Counter(nums2)
    
    result = []
    for num in counts1:
        if num in counts2:
            # Include min of the two counts
            for _ in range(min(counts1[num], counts2[num])):
                result.append(num)
    
    return result
 
 
def union_with_duplicates(nums1: list[int], nums2: list[int]) -> list[int]:
    """
    Union of two arrays with duplicates.
    Element appears max(count1, count2) times.
    """
    counts1 = Counter(nums1)
    counts2 = Counter(nums2)
    
    # Combine all keys
    all_keys = set(counts1.keys()) | set(counts2.keys())
    
    result = []
    for num in all_keys:
        count = max(counts1.get(num, 0), counts2.get(num, 0))
        result.extend([num] * count)
    
    return result

Special Characters: Beyond Alphanumerics

When processing strings, many algorithms assume "normal" characters—letters and digits. But real-world strings contain:

Categories of Special Characters:

Whitespace: Spaces, tabs, newlines, carriage returns, form feeds
Punctuation: Period, comma, quotes, brackets, etc.
Control Characters: Null (\0), escape, backspace, etc.
Unicode: Emojis, accented characters, right-to-left text
Invisible Characters: Zero-width joiners, byte-order marks, soft hyphens
Encoding Artifacts: Invalid UTF-8 sequences, mixed encodings

Common Special Character Edge Cases
Character Type	Examples	Common Bug	Correct Handling
Empty string	""	Index out of bounds on s[0]	Check length before access
Whitespace only	" \t\n"	Treated as valid content	Trim and check if empty
Leading/trailing spaces	" hello "	Comparison failures	Trim before comparison
Null character	"hel\0lo"	String truncated at null	Use length, not null termination
Newlines	"line1\nline2"	Split logic fails	Handle \n, \r, and \r\n
Tab characters	"a\tb"	Visual vs actual length differs	Consider tab expansion
Unicode letters	"café", "naïve"	Sorting, comparison fails	Use locale-aware functions
Emojis	"Hello 👋"	Length calculation wrong	Use grapheme clusters
Mixed case	"HeLLo"	Comparison failures	Normalize case before compare
Escape sequences	"path\file"	Escaping bugs	Use raw strings or proper escaping

Special Character Handling
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
import unicodedata
import re
 
 
# ============================================
# WHITESPACE HANDLING
# ============================================
 
def normalize_whitespace(s: str) -> str:
    """
    Normalize all whitespace: trim ends, collapse internal spaces.
    
    Handles: spaces, tabs, newlines, multiple spaces
    """
    if not s:
        return ""
    
    # Replace all whitespace with single space, then trim
    return " ".join(s.split())
 
 
def is_blank(s: str) -> bool:
    """
    Check if string is empty or contains only whitespace.
    """
    return not s or s.isspace()
 
 
def split_lines_universal(text: str) -> list[str]:
    """
    Split text into lines, handling all newline conventions.
    Handles: \n (Unix), \r\n (Windows), \r (old Mac)
    """
    # splitlines() handles all conventions
    return text.splitlines()
 
 
# ============================================
# CASE NORMALIZATION
# ============================================
 
def case_insensitive_compare(s1: str, s2: str) -> bool:
    """
    Case-insensitive comparison.
    
    Note: casefold() is better than lower() for international text.
    Example: German 'ß' casefolds to 'ss'
    """
    return s1.casefold() == s2.casefold()
 
 
def case_insensitive_find(text: str, pattern: str) -> int:
    """
    Case-insensitive search.
    """
    return text.casefold().find(pattern.casefold())
 
 
# ============================================
# UNICODE HANDLING
# ============================================
 
def normalize_unicode(s: str, form: str = 'NFC') -> str:
    """
    Normalize Unicode string.
    
    Forms:
    - NFC: Canonical Decomposition, followed by Canonical Composition
    - NFD: Canonical Decomposition
    - NFKC: Compatibility Decomposition, followed by Canonical Composition
    - NFKD: Compatibility Decomposition
    
    Example: 'café' can be 'cafe\u0301' (e + combining accent)
    or 'caf\xe9' (é as single character). NFC makes them the same.
    """
    return unicodedata.normalize(form, s)
 
 
def remove_accents(s: str) -> str:
    """
    Remove accents/diacritics from characters.
    'café' → 'cafe'
    """
    # Decompose to base + combining characters
    decomposed = unicodedata.normalize('NFD', s)
    # Keep only non-combining characters
    return ''.join(
        char for char in decomposed
        if unicodedata.category(char) != 'Mn'  # Mn = Mark, Nonspacing
    )
 
 
def string_length_unicode_aware(s: str) -> int:
    """
    Get visual length (grapheme clusters) not code units.
    
    'café' = 4 graphemes (might be 5 code points with combining accent)
    'Hello 👋' = 7 graphemes (emoji is 1 grapheme, might be 2+ code units)
    """
    # For precise grapheme counting, use regex or third-party library
    # This is an approximation
    import regex  # pip install regex (more Unicode-aware than re)
    return len(regex.findall(r'\X', s))
 
 
# ============================================
# ALPHANUMERIC EXTRACTION
# ============================================
 
def extract_alphanumeric(s: str) -> str:
    """
    Extract only alphanumeric characters.
    Useful for palindrome checking, normalization.
    """
    return ''.join(char for char in s if char.isalnum())
 
 
def is_palindrome_ignoring_non_alphanumeric(s: str) -> bool:
    """
    Check if string is palindrome, ignoring non-alphanumerics and case.
    
    "A man, a plan, a canal: Panama" → True
    """
    cleaned = extract_alphanumeric(s).lower()
    return cleaned == cleaned[::-1]
 
 
# ============================================
# CONTROL CHARACTER HANDLING
# ============================================
 
def remove_control_characters(s: str) -> str:
    """
    Remove control characters (except whitespace).
    Control characters can break display and processing.
    """
    return ''.join(
        char for char in s
        if unicodedata.category(char) not in ('Cc', 'Cf')
        or char in '\t\n\r'  # Keep common whitespace
    )
 
 
def escape_for_display(s: str) -> str:
    """
    Escape control characters for safe display.
    """
    result = []
    for char in s:
        if char == '\n':
            result.append('\\n')
        elif char == '\r':
            result.append('\\r')
        elif char == '\t':
            result.append('\\t')
        elif ord(char) < 32 or ord(char) == 127:
            result.append(f'\\x{ord(char):02x}')
        else:
            result.append(char)
    return ''.join(result)
 
 
# ============================================
# STRING VALIDATION
# ============================================
 
def is_valid_identifier(s: str) -> bool:
    """
    Check if string is a valid programming identifier.
    Starts with letter/underscore, contains only alphanum/underscore.
    """
    if not s:
        return False
    
    if not (s[0].isalpha() or s[0] == '_'):
        return False
    
    return all(c.isalnum() or c == '_' for c in s)
 
 
def sanitize_filename(filename: str) -> str:
    """
    Sanitize string for use as filename.
    Removes/replaces characters invalid in file systems.
    """
    # Characters not allowed in Windows filenames
    invalid_chars = r'[<>:"/\\|?*]'
    
    # Replace invalid characters with underscore
    sanitized = re.sub(invalid_chars, '_', filename)
    
    # Remove control characters
    sanitized = remove_control_characters(sanitized)
    
    # Remove leading/trailing whitespace and dots
    sanitized = sanitized.strip(' .')
    
    # Ensure non-empty
    return sanitized or 'unnamed'

String Length Is Complicated

In Unicode, what users perceive as one 'character' might be multiple code points (like 'e' + combining accent) or multiple code units (like emojis in UTF-16). len('👨‍👩‍👧') might return 8, not 1. For user-facing string operations, consider using grapheme cluster counting.

String Algorithm Edge Cases

String algorithms have specific edge cases beyond general character handling. These arise from the nature of pattern matching, substring operations, and string transformations.

String Algorithm Edge Cases
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
# ============================================
# SUBSTRING EDGE CASES
# ============================================
 
def find_all_substrings(text: str, pattern: str) -> list[int]:
    """
    Find all occurrences of pattern in text.
    
    Edge cases:
    - Empty pattern: Every position is a match? Or invalid?
    - Pattern longer than text: No matches
    - Overlapping matches: "aaa" in "aaaa" has indices [0, 1]
    """
    if not pattern:
        return []  # Or raise ValueError("Empty pattern")
    
    if len(pattern) > len(text):
        return []
    
    indices = []
    start = 0
    
    while True:
        pos = text.find(pattern, start)
        if pos == -1:
            break
        indices.append(pos)
        start = pos + 1  # Allow overlapping matches
    
    return indices
 
 
def longest_substring_without_repeating(s: str) -> str:
    """
    Find longest substring with all unique characters.
    
    Edge cases:
    - Empty string: Return ""
    - Single character: Return that character
    - All same characters: Return any single character
    - All unique: Return entire string
    """
    if not s:
        return ""
    
    char_index = {}
    start = 0
    max_len = 0
    max_start = 0
    
    for end, char in enumerate(s):
        if char in char_index and char_index[char] >= start:
            start = char_index[char] + 1
        
        char_index[char] = end
        
        if end - start + 1 > max_len:
            max_len = end - start + 1
            max_start = start
    
    return s[max_start:max_start + max_len]
 
 
def count_substrings(text: str, pattern: str, overlap: bool = True) -> int:
    """
    Count occurrences of pattern in text.
    
    Edge cases:
    - Empty pattern: undefined behavior
    - overlap=True: "aa" in "aaa" → 2
    - overlap=False: "aa" in "aaa" → 1
    """
    if not pattern:
        return 0
    
    count = 0
    start = 0
    
    while True:
        pos = text.find(pattern, start)
        if pos == -1:
            break
        count += 1
        start = pos + (1 if overlap else len(pattern))
    
    return count
 
 
# ============================================
# ANAGRAM AND PERMUTATION EDGE CASES
# ============================================
 
def are_anagrams(s1: str, s2: str, case_sensitive: bool = True) -> bool:
    """
    Check if two strings are anagrams.
    
    Edge cases:
    - Empty strings: Two empty strings ARE anagrams
    - Different lengths: Cannot be anagrams
    - Spaces: Should spaces count? Depends on requirements
    - Case: 'Listen' vs 'Silent' - depends on case_sensitive
    """
    from collections import Counter
    
    # Length check first (optimization)
    if len(s1) != len(s2):
        return False
    
    if not case_sensitive:
        s1 = s1.lower()
        s2 = s2.lower()
    
    return Counter(s1) == Counter(s2)
 
 
def find_anagrams_in_text(text: str, pattern: str) -> list[int]:
    """
    Find starting indices of all anagram occurrences of pattern in text.
    Uses sliding window with character frequency counting.
    
    Edge cases:
    - Empty pattern: Return []
    - Pattern longer than text: Return []
    - Single character pattern: Every occurrence
    """
    from collections import Counter
    
    if not pattern or len(pattern) > len(text):
        return []
    
    pattern_count = Counter(pattern)
    window_count = Counter()
    result = []
    k = len(pattern)
    
    for i, char in enumerate(text):
        # Add new character to window
        window_count[char] += 1
        
        # Remove character that fell out of window
        if i >= k:
            left_char = text[i - k]
            window_count[left_char] -= 1
            if window_count[left_char] == 0:
                del window_count[left_char]
        
        # Check for anagram match
        if i >= k - 1 and window_count == pattern_count:
            result.append(i - k + 1)
    
    return result
 
 
# ============================================
# PALINDROME EDGE CASES
# ============================================
 
def longest_palindromic_substring(s: str) -> str:
    """
    Find longest palindromic substring.
    
    Edge cases:
    - Empty string: Return ""
    - Single character: Return that character
    - All same characters: Return entire string
    - No palindrome longer than 1: Return first character
    """
    if not s:
        return ""
    
    if len(s) == 1:
        return s
    
    start = 0
    max_len = 1
    
    def expand_around_center(left: int, right: int) -> tuple[int, int]:
        while left >= 0 and right < len(s) and s[left] == s[right]:
            left -= 1
            right += 1
        return left + 1, right - left - 1  # Return start and length
    
    for i in range(len(s)):
        # Odd length palindromes
        left1, len1 = expand_around_center(i, i)
        if len1 > max_len:
            start = left1
            max_len = len1
        
        # Even length palindromes
        if i + 1 < len(s):
            left2, len2 = expand_around_center(i, i + 1)
            if len2 > max_len:
                start = left2
                max_len = len2
    
    return s[start:start + max_len]
 
 
def count_palindromic_substrings(s: str) -> int:
    """
    Count total number of palindromic substrings.
    
    Edge cases:
    - Empty string: 0 palindromes
    - Single character: 1 palindrome (itself)
    - All same: n(n+1)/2 palindromes (every substring)
    """
    if not s:
        return 0
    
    count = 0
    
    for i in range(len(s)):
        # Odd length
        left, right = i, i
        while left >= 0 and right < len(s) and s[left] == s[right]:
            count += 1
            left -= 1
            right += 1
        
        # Even length
        left, right = i, i + 1
        while left >= 0 and right < len(s) and s[left] == s[right]:
            count += 1
            left -= 1
            right += 1
    
    return count

Testing Strategy for Data Quality Edge Cases

To ensure your algorithms handle duplicates and special characters correctly, use a systematic testing approach.

Duplicate Test Cases Checklist

•All unique values — Baseline case, no duplicates
•All identical values — Extreme case, all duplicates
•One duplicate pair — Minimum duplicate case
•Target value duplicated — For search: which occurrence returned?
•Non-target duplicated — Other values duplicate, target unique
•Consecutive duplicates — [1, 2, 2, 2, 3]
•Non-consecutive duplicates — [1, 2, 1, 2, 1]
•Duplicates at boundaries — First or last elements duplicated

Special Character Test Cases Checklist

•Empty string — The null case
•Whitespace only — Spaces, tabs, newlines
•Leading/trailing whitespace — " hello "
•All spaces — " "
•Single character — "a", " ", "\n"
•Unicode characters — "café", "北京", "مرحبا"
•Emojis — "Hello 👋🌍"
•Mixed case — "HeLLo WoRLd"
•Numbers as strings — "12345", "3.14"
•Special punctuation — "!@#$%^&*()"
•Escape sequences — Strings containing \n, \t, \
•Null character — String containing \0

Comprehensive Test Generator
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
def generate_duplicate_test_cases(base_values: list) -> list[list]:
    """
    Generate comprehensive duplicate test cases from base values.
    """
    test_cases = []
    
    if not base_values:
        return test_cases
    
    # All unique
    test_cases.append(list(base_values))
    
    # All same (first element)
    test_cases.append([base_values[0]] * len(base_values))
    
    # Pair duplicated at start
    if len(base_values) >= 2:
        case = list(base_values)
        case[1] = case[0]
        test_cases.append(case)
    
    # Pair duplicated at end
    if len(base_values) >= 2:
        case = list(base_values)
        case[-1] = case[-2]
        test_cases.append(case)
    
    # First and last are same
    if len(base_values) >= 2:
        case = list(base_values)
        case[-1] = case[0]
        test_cases.append(case)
    
    # All same except middle
    if len(base_values) >= 3:
        mid = len(base_values) // 2
        case = [base_values[0]] * len(base_values)
        case[mid] = base_values[1] if len(base_values) > 1 else base_values[0] + 1
        test_cases.append(case)
    
    return test_cases
 
 
def generate_string_edge_cases() -> list[str]:
    """
    Generate comprehensive string edge case test cases.
    """
    return [
        "",                          # Empty
        " ",                         # Single space
        "   ",                       # Multiple spaces
        "\t",                        # Tab
        "\n",                        # Newline
        "\r\n",                      # Windows newline
        "a",                         # Single char
        "A",                         # Single uppercase
        "1",                         # Single digit
        "hello",                     # Simple lowercase
        "HELLO",                     # Simple uppercase
        "Hello World",               # Mixed case with space
        "  hello  ",                 # Leading/trailing spaces
        "hello\nworld",              # Internal newline
        "hello\tworld",              # Internal tab
        "café",                      # Accented characters
        "naïve",                     # More accents
        "Hello 👋",                  # Emoji
        "你好",                       # Chinese
        "مرحبا",                     # Arabic (RTL)
        "hello123",                  # Alphanumeric
        "123hello",                  # Number prefix
        "12345",                     # Only digits
        "!@#$%",                     # Only special chars
        "hello, world!",             # Punctuation
        "a" * 1000,                  # Long string same char
        "".join(chr(i) for i in range(32, 127)),  # All printable ASCII
    ]
 
 
def test_string_function(func, test_cases: list[str] = None):
    """
    Test a string function with comprehensive edge cases.
    """
    if test_cases is None:
        test_cases = generate_string_edge_cases()
    
    results = []
    for test in test_cases:
        try:
            result = func(test)
            results.append({
                'input': repr(test),
                'output': repr(result),
                'error': None
            })
        except Exception as e:
            results.append({
                'input': repr(test),
                'output': None,
                'error': str(e)
            })
    
    return results

Key Takeaways

Remember These Principles

•Define duplicate semantics explicitly — Decide what 'first', 'last', 'any', or 'all' means before coding.
•Use three-way partition for duplicates — Prevents QuickSort O(n²) degeneracy on all-same inputs.
•Normalize before comparing — Case, whitespace, accents, and Unicode normalization.
•Empty string is always special — Check for it before any character access.
•String length is complicated — Characters vs code points vs grapheme clusters.
•Test the extremes — All duplicates, all unique, all whitespace, all special characters.

Page Complete

You now understand how to handle duplicates and special characters robustly. These data quality edge cases are among the most common in real-world applications. Next, we'll explore how to systematically generate test cases to verify your algorithms handle all edge cases correctly.