Loading content...
While pattern matching (like Rabin-Karp) is the canonical application of string hashing, the technique's power extends far beyond finding substrings. String hashing provides a universal mechanism for constant-time substring comparison, which unlocks efficient solutions to problems that would otherwise require quadratic or worse complexity.
This page explores the diverse applications of string hashing—from finding longest repeated substrings to detecting palindromes, from plagiarism detection to computing string similarity. You'll see how a single technique adapts to solve surprisingly varied problems.
By the end of this page, you will know how to use hashing for O(1) substring comparison, solve longest repeated substring in O(n log n), detect palindromes efficiently, find the longest common substring between two strings, compute string similarity metrics, and apply hashing to plagiarism detection.
The foundational power of string hashing is the ability to compare any two substrings in O(1) time, after O(n) preprocessing. This simple capability has profound algorithmic implications.
The Standard Approach:
1. Precompute prefix hash array and powers array in O(n)
2. For any query "are s[l1..r1] and s[l2..r2] equal?":
- Compute hash of s[l1..r1] in O(1)
- Compute hash of s[l2..r2] in O(1)
- Compare hash values
Use Cases:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
class SubstringComparator: """ Enables O(1) comparison of any two substrings of a string. After O(n) preprocessing, any comparison query runs in O(1). """ def __init__(self, s: str, base: int = 31, mod: int = 10**9 + 7): self.s = s self.n = len(s) self.base = base self.mod = mod # Precompute prefix hashes and powers self.prefix = [0] * (self.n + 1) self.power = [1] * (self.n + 1) for i in range(self.n): self.prefix[i + 1] = ( self.prefix[i] * base + ord(s[i]) ) % mod self.power[i + 1] = (self.power[i] * base) % mod def get_hash(self, left: int, right: int) -> int: """Get hash of s[left:right+1] in O(1).""" length = right - left + 1 result = ( self.prefix[right + 1] - self.prefix[left] * self.power[length] ) % self.mod return result if result >= 0 else result + self.mod def equals(self, l1: int, r1: int, l2: int, r2: int) -> bool: """Compare s[l1:r1+1] with s[l2:r2+1] in O(1).""" if r1 - l1 != r2 - l2: return False return self.get_hash(l1, r1) == self.get_hash(l2, r2) def lcp(self, i: int, j: int) -> int: """ Find longest common prefix of suffixes starting at i and j. Uses binary search + hash comparison for O(log n) time. """ max_len = min(self.n - i, self.n - j) left, right = 0, max_len result = 0 while left <= right: mid = (left + right) // 2 # Check if first mid characters are equal if self.equals(i, i + mid - 1, j, j + mid - 1): result = mid left = mid + 1 else: right = mid - 1 return result # Example: Find all pairs of positions with matching k-length substringsdef find_matching_substrings(s: str, k: int) -> list[tuple[int, int]]: """ Find all pairs of starting positions where k-length substrings match. Time: O(n) for hashing + O(matches) for output """ cmp = SubstringComparator(s) n = len(s) # Group positions by hash from collections import defaultdict hash_to_positions = defaultdict(list) for i in range(n - k + 1): h = cmp.get_hash(i, i + k - 1) hash_to_positions[h].append(i) # Find matching pairs pairs = [] for positions in hash_to_positions.values(): if len(positions) > 1: for i in range(len(positions)): for j in range(i + 1, len(positions)): pairs.append((positions[i], positions[j])) return pairs # Demos = "abracadabra"cmp = SubstringComparator(s) print(f"String: '{s}'")print(f"\nO(1) substring comparisons:")print(f" s[0:4]='abra' vs s[7:11]='abra': {cmp.equals(0, 3, 7, 10)}") # Trueprint(f" s[0:4]='abra' vs s[1:5]='brac': {cmp.equals(0, 3, 1, 4)}") # False print(f"\nLongest common prefix of suffixes:")print(f" LCP of suffix at 0 and 7: {cmp.lcp(0, 7)}") # 4 ('abra')print(f" LCP of suffix at 0 and 3: {cmp.lcp(0, 3)}") # 1 ('a') print(f"\nPositions with matching 4-char substrings:")print(f" {find_matching_substrings(s, 4)}") # [(0, 7)]Problem: Find the longest substring that appears at least twice in a given string.
Example:
Naive Approach:
Hash-Based Approach:
If a substring of length k repeats, then substrings of length k-1 (prefixes of the repeating substring) also repeat. This monotonicity property enables binary search on the answer length, reducing from O(n) length checks to O(log n) checks.
Algorithm:
1. Binary search for the longest length k where a repeated k-substring exists
2. For each candidate k (during binary search):
a. Compute hash of every k-length substring in O(n)
b. Store hashes in a set
c. If any hash appears twice, return success
3. Track the actual substring for the answer
Time Complexity Analysis:
Space Complexity: O(n) for the hash set
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166
class DoubleHash: """Double hashing for reliable substring comparison.""" def __init__(self, s: str): self.s = s self.n = len(s) # First hash self.h1 = [0] * (self.n + 1) self.p1 = [1] * (self.n + 1) B1, M1 = 31, 10**9 + 7 # Second hash self.h2 = [0] * (self.n + 1) self.p2 = [1] * (self.n + 1) B2, M2 = 37, 10**9 + 9 for i in range(self.n): c = ord(s[i]) self.h1[i + 1] = (self.h1[i] * B1 + c) % M1 self.p1[i + 1] = (self.p1[i] * B1) % M1 self.h2[i + 1] = (self.h2[i] * B2 + c) % M2 self.p2[i + 1] = (self.p2[i] * B2) % M2 self.M1, self.M2 = M1, M2 def get_hash(self, left: int, right: int) -> tuple[int, int]: length = right - left + 1 hash1 = (self.h1[right + 1] - self.h1[left] * self.p1[length]) % self.M1 if hash1 < 0: hash1 += self.M1 hash2 = (self.h2[right + 1] - self.h2[left] * self.p2[length]) % self.M2 if hash2 < 0: hash2 += self.M2 return (hash1, hash2) def longest_repeated_substring(s: str) -> str: """ Find the longest substring that appears at least twice. Time: O(n log n) average Space: O(n) """ if len(s) <= 1: return "" n = len(s) hasher = DoubleHash(s) def has_repeated(length: int) -> int: """ Check if a repeated substring of given length exists. Returns starting index if found, -1 otherwise. """ if length == 0: return 0 seen = {} # hash -> first occurrence index for i in range(n - length + 1): h = hasher.get_hash(i, i + length - 1) if h in seen: # Verify to be safe (optional with double hash) if s[i:i + length] == s[seen[h]:seen[h] + length]: return i else: seen[h] = i return -1 # Binary search for longest length left, right = 1, n - 1 result_start = -1 result_length = 0 while left <= right: mid = (left + right) // 2 start = has_repeated(mid) if start != -1: # Found repeated substring of length mid result_start = start result_length = mid left = mid + 1 # Try longer else: right = mid - 1 # Try shorter if result_start == -1: return "" return s[result_start:result_start + result_length] def longest_repeated_substring_with_count(s: str, min_occurrences: int = 2) -> str: """ Find longest substring appearing at least min_occurrences times. Variant: Sometimes we need more than 2 occurrences. """ if len(s) <= 1: return "" n = len(s) hasher = DoubleHash(s) def count_occurrences(length: int) -> tuple[int, int]: """Count max occurrences of any substring of given length.""" if length == 0: return (n + 1, 0) # Empty string at every position from collections import Counter hash_counts = Counter() for i in range(n - length + 1): h = hasher.get_hash(i, i + length - 1) hash_counts[h] += 1 if not hash_counts: return (0, -1) most_common_hash, count = hash_counts.most_common(1)[0] # Find starting index of most common for i in range(n - length + 1): if hasher.get_hash(i, i + length - 1) == most_common_hash: return (count, i) return (0, -1) left, right = 1, n - 1 result = "" while left <= right: mid = (left + right) // 2 count, start = count_occurrences(mid) if count >= min_occurrences: result = s[start:start + mid] left = mid + 1 else: right = mid - 1 return result # Examplesprint("Longest Repeated Substring Examples:")print("="*50) test_cases = [ "banana", "abracadabra", "mississippi", "aaaaaa", "abcdef", # No repeated substring] for s in test_cases: result = longest_repeated_substring(s) print(f" '{s}' -> '{result}' (length {len(result)})") print("\nWith minimum occurrence count:")print(f" 'banana' appearing 3+ times: '{longest_repeated_substring_with_count('banana', 3)}'")print(f" 'aaaaaa' appearing 3+ times: '{longest_repeated_substring_with_count('aaaaaa', 3)}'")String hashing provides an elegant O(1) palindrome check for any substring, after O(n) preprocessing.
The Key Insight:
A string is a palindrome if and only if it equals its reverse. With hashing:
H(s[l..r]) == H(reverse(s[l..r]))
Implementation Strategy:
Precompute two hash arrays:
Then for substring s[l..r]:
Applications:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149
class PalindromeHash: """ Hash-based palindrome detection with O(1) queries. Maintains both forward and reverse hash arrays to compare any substring with its reverse in constant time. """ def __init__(self, s: str, base: int = 31, mod: int = 10**9 + 7): self.s = s self.n = len(s) self.base = base self.mod = mod # Forward hash: hash of s[0..i-1] self.h_fwd = [0] * (self.n + 1) # Reverse hash: hash of s reversed, i.e., s[n-1..0] self.h_rev = [0] * (self.n + 1) # Powers self.power = [1] * (self.n + 1) # Build forward hash for i in range(self.n): self.h_fwd[i + 1] = ( self.h_fwd[i] * base + ord(s[i]) ) % mod self.power[i + 1] = (self.power[i] * base) % mod # Build reverse hash (hash s from right to left) for i in range(self.n - 1, -1, -1): self.h_rev[self.n - i] = ( self.h_rev[self.n - i - 1] * base + ord(s[i]) ) % mod def _forward_hash(self, left: int, right: int) -> int: """Hash of s[left:right+1] reading left to right.""" length = right - left + 1 result = ( self.h_fwd[right + 1] - self.h_fwd[left] * self.power[length] ) % self.mod return result if result >= 0 else result + self.mod def _reverse_hash(self, left: int, right: int) -> int: """Hash of s[left:right+1] reading right to left.""" # In reverse array, position left in s corresponds to n-1-left from end # s[left..right] reversed is s[right], s[right-1], ..., s[left] # In h_rev, this starts at position (n-right) and has length (right-left+1) length = right - left + 1 rev_left = self.n - right - 1 rev_right = self.n - left - 1 result = ( self.h_rev[rev_right + 1] - self.h_rev[rev_left] * self.power[length] ) % self.mod return result if result >= 0 else result + self.mod def is_palindrome(self, left: int, right: int) -> bool: """Check if s[left:right+1] is a palindrome in O(1).""" return self._forward_hash(left, right) == self._reverse_hash(left, right) def count_palindromic_substrings(self) -> int: """Count all palindromic substrings.""" count = 0 for l in range(self.n): for r in range(l, self.n): if self.is_palindrome(l, r): count += 1 return count def longest_palindromic_substring(self) -> str: """ Find longest palindromic substring using binary search. Time: O(n log n) - binary search on length, O(n) checks per length """ if self.n == 0: return "" best_start, best_len = 0, 1 # Try odd-length palindromes (centered at each position) for center in range(self.n): # Binary search for longest radius left, right = 0, min(center, self.n - center - 1) while left <= right: mid = (left + right) // 2 l, r = center - mid, center + mid if self.is_palindrome(l, r): length = r - l + 1 if length > best_len: best_start, best_len = l, length left = mid + 1 else: right = mid - 1 # Try even-length palindromes (centered between i and i+1) for gap_after in range(self.n - 1): # Gap is between s[gap_after] and s[gap_after+1] left, right = 0, min(gap_after + 1, self.n - gap_after - 1) while left <= right: mid = (left + right) // 2 l, r = gap_after - mid + 1, gap_after + mid if mid == 0: # No palindrome of length 0 left = mid + 1 continue if self.is_palindrome(l, r): length = r - l + 1 if length > best_len: best_start, best_len = l, length left = mid + 1 else: right = mid - 1 return self.s[best_start:best_start + best_len] # Demonstrationprint("Palindrome Detection with Hashing")print("="*50) test_strings = [ "racecar", "banana", "abacaba", "babad",] for s in test_strings: ph = PalindromeHash(s) print(f"\nString: '{s}'") print(f" Count of palindromic substrings: {ph.count_palindromic_substrings()}") print(f" Longest palindrome: '{ph.longest_palindromic_substring()}'") # Show some palindrome checks for l in range(min(3, len(s))): for r in range(l, min(l + 5, len(s))): sub = s[l:r+1] is_pal = ph.is_palindrome(l, r) if is_pal: print(f" s[{l}:{r+1}]='{sub}' is palindrome")Problem: Given two strings s1 and s2, find the longest substring common to both.
Example:
Approaches:
Hash-Based Algorithm:
1. Binary search on the answer length k
2. For each candidate k:
a. Hash all k-length substrings of s1 into a set
b. Hash all k-length substrings of s2, check if any is in the set
c. If match found, k is feasible; try longer
3. Return the longest feasible k
If strings share a common substring of length k, they also share common substrings of all lengths < k (just take prefixes). This monotonicity makes binary search applicable.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142
def polynomial_hash(s: str, base: int = 31, mod: int = 10**9 + 7) -> list[int]: """Compute all prefix hashes of string s.""" n = len(s) h = [0] * (n + 1) for i in range(n): h[i + 1] = (h[i] * base + ord(s[i])) % mod return h def power_table(n: int, base: int = 31, mod: int = 10**9 + 7) -> list[int]: """Compute powers of base.""" p = [1] * (n + 1) for i in range(n): p[i + 1] = (p[i] * base) % mod return p def get_hash(h: list[int], p: list[int], left: int, right: int, mod: int = 10**9 + 7) -> int: """Get hash of substring s[left:right+1].""" length = right - left + 1 result = (h[right + 1] - h[left] * p[length]) % mod return result if result >= 0 else result + mod def longest_common_substring(s1: str, s2: str) -> str: """ Find longest substring common to both s1 and s2. Time: O((n + m) log min(n, m)) average Space: O(n + m) """ n, m = len(s1), len(s2) if n == 0 or m == 0: return "" # Precompute hashes for both strings BASE, MOD = 31, 10**9 + 7 h1 = polynomial_hash(s1, BASE, MOD) h2 = polynomial_hash(s2, BASE, MOD) max_len = max(n, m) p = power_table(max_len, BASE, MOD) def has_common_substring(length: int) -> tuple[bool, int, int]: """ Check if s1 and s2 share a common substring of given length. Returns (found, start_in_s1, start_in_s2). """ if length == 0: return (True, 0, 0) # Collect all length-k substrings of s1 s1_hashes = {} # hash -> starting index for i in range(n - length + 1): h = get_hash(h1, p, i, i + length - 1, MOD) if h not in s1_hashes: s1_hashes[h] = i # Check s2's substrings for j in range(m - length + 1): h = get_hash(h2, p, j, j + length - 1, MOD) if h in s1_hashes: # Verify actual string match (handle potential collision) i = s1_hashes[h] if s1[i:i + length] == s2[j:j + length]: return (True, i, j) return (False, -1, -1) # Binary search for longest length left, right = 0, min(n, m) result_start = 0 result_length = 0 while left <= right: mid = (left + right) // 2 found, i, j = has_common_substring(mid) if found: result_start = i result_length = mid left = mid + 1 else: right = mid - 1 return s1[result_start:result_start + result_length] def all_common_substrings(s1: str, s2: str, min_length: int = 1) -> list[str]: """ Find ALL common substrings of at least min_length. Returns unique substrings common to both strings. """ n, m = len(s1), len(s2) BASE, MOD = 31, 10**9 + 7 h1 = polynomial_hash(s1, BASE, MOD) h2 = polynomial_hash(s2, BASE, MOD) max_len = max(n, m) p = power_table(max_len, BASE, MOD) common = set() # Check each length from min_length to min(n, m) for length in range(min_length, min(n, m) + 1): s1_hashes = {} for i in range(n - length + 1): h = get_hash(h1, p, i, i + length - 1, MOD) if h not in s1_hashes: s1_hashes[h] = i for j in range(m - length + 1): h = get_hash(h2, p, j, j + length - 1, MOD) if h in s1_hashes: common.add(s2[j:j + length]) return sorted(common, key=lambda x: (-len(x), x)) # Examplesprint("Longest Common Substring")print("="*50) test_cases = [ ("ABABC", "BABCA"), ("GeeksforGeeks", "GeeksQuiz"), ("abcdxyz", "xyzabcd"), ("programming", "algorithms"),] for s1, s2 in test_cases: result = longest_common_substring(s1, s2) print(f" '{s1}' and '{s2}'") print(f" -> LCS: '{result}' (length {len(result)})") print("\nAll common substrings (min length 3):")s1, s2 = "ABABC", "BABCA"all_common = all_common_substrings(s1, s2, 3)print(f" '{s1}' and '{s2}': {all_common}")Beyond exact matching, hashing enables approximate string comparison and similarity detection at scale.
Shingling (n-gram hashing):
Break a document into overlapping n-character chunks (shingles) and hash each:
"The quick brown" (n=4) → {"The ", "he q", "e qu", " qui", "quic", "uick", ...}
Compare documents by the overlap of their shingle sets:
MinHash for Similarity Estimation:
For large documents, comparing full shingle sets is expensive. MinHash approximates Jaccard similarity:
Applications:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161
def shingle(text: str, k: int = 5) -> set[int]: """ Create a set of hashed k-shingles from text. A k-shingle is a contiguous sequence of k characters. """ BASE, MOD = 31, 10**9 + 7 if len(text) < k: return set() shingles = set() # Compute initial hash h = 0 power = 1 for i in range(k): h = (h * BASE + ord(text[i])) % MOD if i < k - 1: power = (power * BASE) % MOD shingles.add(h) # Rolling hash for remaining shingles for i in range(k, len(text)): h = (h - ord(text[i - k]) * power) % MOD if h < 0: h += MOD h = (h * BASE + ord(text[i])) % MOD shingles.add(h) return shingles def jaccard_similarity(set1: set, set2: set) -> float: """Compute Jaccard similarity between two sets.""" if not set1 and not set2: return 1.0 intersection = len(set1 & set2) union = len(set1 | set2) return intersection / union if union > 0 else 0.0 def document_similarity(doc1: str, doc2: str, k: int = 5) -> float: """ Compute similarity between two documents using shingling. Returns value in [0, 1] where 1 means identical. """ shingles1 = shingle(doc1.lower(), k) shingles2 = shingle(doc2.lower(), k) return jaccard_similarity(shingles1, shingles2) class MinHash: """ MinHash for efficient similarity estimation. Uses multiple hash functions to create signatures that can estimate Jaccard similarity in O(num_hashes) time. """ def __init__(self, num_hashes: int = 100): self.num_hashes = num_hashes # Generate random hash function parameters import random random.seed(42) max_val = 2**31 - 1 self.a = [random.randint(1, max_val) for _ in range(num_hashes)] self.b = [random.randint(0, max_val) for _ in range(num_hashes)] self.p = 2147483647 # Large prime (2^31 - 1) def _hash(self, x: int, i: int) -> int: """Apply i-th hash function to value x.""" return (self.a[i] * x + self.b[i]) % self.p def compute_signature(self, shingles: set[int]) -> list[int]: """ Compute MinHash signature for a set of shingles. The signature has length num_hashes. """ if not shingles: return [self.p] * self.num_hashes signature = [] for i in range(self.num_hashes): min_hash = min(self._hash(s, i) for s in shingles) signature.append(min_hash) return signature def estimate_similarity(self, sig1: list[int], sig2: list[int]) -> float: """ Estimate Jaccard similarity from two signatures. This is O(num_hashes) instead of O(|shingles|). """ matches = sum(1 for a, b in zip(sig1, sig2) if a == b) return matches / self.num_hashes def plagiarism_check(documents: list[str], threshold: float = 0.5) -> list[tuple[int, int, float]]: """ Find pairs of documents with similarity above threshold. Uses MinHash for efficient comparison of many documents. """ k = 5 # Shingle size mh = MinHash(num_hashes=100) # Compute shingles and signatures for all documents signatures = [] for doc in documents: shingles = shingle(doc.lower(), k) sig = mh.compute_signature(shingles) signatures.append(sig) # Compare all pairs similar_pairs = [] for i in range(len(documents)): for j in range(i + 1, len(documents)): sim = mh.estimate_similarity(signatures[i], signatures[j]) if sim >= threshold: similar_pairs.append((i, j, sim)) return similar_pairs # Demonstrationprint("Document Similarity with Hashing")print("="*50) doc1 = "The quick brown fox jumps over the lazy dog"doc2 = "The quick brown fox leaps over the lazy dog"doc3 = "A fast auburn fox jumps over a sleepy hound"doc4 = "Lorem ipsum dolor sit amet consectetur" print("\nDirect Jaccard similarity (using shingles):")print(f" Doc1 vs Doc2: {document_similarity(doc1, doc2):.3f}") # Highprint(f" Doc1 vs Doc3: {document_similarity(doc1, doc3):.3f}") # Mediumprint(f" Doc1 vs Doc4: {document_similarity(doc1, doc4):.3f}") # Low print("\nMinHash signature-based similarity:")mh = MinHash(num_hashes=100)sig1 = mh.compute_signature(shingle(doc1.lower(), 5))sig2 = mh.compute_signature(shingle(doc2.lower(), 5))sig3 = mh.compute_signature(shingle(doc3.lower(), 5))print(f" Doc1 vs Doc2 (estimated): {mh.estimate_similarity(sig1, sig2):.3f}")print(f" Doc1 vs Doc3 (estimated): {mh.estimate_similarity(sig1, sig3):.3f}") print("\nPlagiarism detection test:")essays = [ "This is an original essay about algorithms and data structures.", "This is an original essay about algorithms and data organization.", # Similar "Essays cover many topics including math and science.", "This is an original essay on algorithms and data structures!", # Very similar]pairs = plagiarism_check(essays, threshold=0.3)for i, j, sim in pairs: print(f" Documents {i} and {j}: {sim:.3f} similarity (potential plagiarism)")String hashing finds application in many other algorithmic problems:
The core technique is always the same: use hashing to convert O(m) string comparisons to O(1) hash comparisons. This often reduces O(n² × m) algorithms to O(n²) or O(n log n × m) to O(n log n). Identify where your algorithm spends time comparing strings, and apply hashing there.
Performance Considerations:
| Operation | Without Hashing | With Hashing |
|---|---|---|
| Compare 2 substrings | O(m) | O(1) |
| Find all matches of pattern | O(n × m) | O(n + m) |
| Longest repeated substring | O(n³) | O(n log n) |
| Longest common substring | O(n × m) | O((n+m) log n) |
| Document similarity (n docs) | O(n² × d) | O(n² × k) where k << d |
where m = pattern/substring length, d = document length, n = text/array length, k = signature length.
Module Complete:
You've now mastered string hashing from first principles through advanced applications:
This knowledge equips you to tackle a wide range of string problems efficiently and correctly.
You now have a comprehensive understanding of string hashing techniques—from the mathematical foundations to practical applications. These skills are essential for efficient string processing in algorithms, competitive programming, and production systems.