String Hashing - Learning Module

Loading content...

0/276

Applications Beyond Exact Matching

String Hashing as a Universal Tool

While pattern matching (like Rabin-Karp) is the canonical application of string hashing, the technique's power extends far beyond finding substrings. String hashing provides a universal mechanism for constant-time substring comparison, which unlocks efficient solutions to problems that would otherwise require quadratic or worse complexity.

This page explores the diverse applications of string hashing—from finding longest repeated substrings to detecting palindromes, from plagiarism detection to computing string similarity. You'll see how a single technique adapts to solve surprisingly varied problems.

What You Will Learn

By the end of this page, you will know how to use hashing for O(1) substring comparison, solve longest repeated substring in O(n log n), detect palindromes efficiently, find the longest common substring between two strings, compute string similarity metrics, and apply hashing to plagiarism detection.

O(1) Substring Comparison

The foundational power of string hashing is the ability to compare any two substrings in O(1) time, after O(n) preprocessing. This simple capability has profound algorithmic implications.

The Standard Approach:

1. Precompute prefix hash array and powers array in O(n)
2. For any query "are s[l1..r1] and s[l2..r2] equal?":
   - Compute hash of s[l1..r1] in O(1)
   - Compute hash of s[l2..r2] in O(1)
   - Compare hash values

Use Cases:

Comparing prefixes/suffixes of the same string
Finding repeating patterns within a string
Checking if substring appears at multiple positions
Building suffix array via sorting substrings
Finding longest common prefix (LCP) of any two suffixes

substring_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
class SubstringComparator:
    """
    Enables O(1) comparison of any two substrings of a string.
    
    After O(n) preprocessing, any comparison query runs in O(1).
    """
    
    def __init__(self, s: str, base: int = 31, mod: int = 10**9 + 7):
        self.s = s
        self.n = len(s)
        self.base = base
        self.mod = mod
        
        # Precompute prefix hashes and powers
        self.prefix = [0] * (self.n + 1)
        self.power = [1] * (self.n + 1)
        
        for i in range(self.n):
            self.prefix[i + 1] = (
                self.prefix[i] * base + ord(s[i])
            ) % mod
            self.power[i + 1] = (self.power[i] * base) % mod
    
    def get_hash(self, left: int, right: int) -> int:
        """Get hash of s[left:right+1] in O(1)."""
        length = right - left + 1
        result = (
            self.prefix[right + 1] - self.prefix[left] * self.power[length]
        ) % self.mod
        return result if result >= 0 else result + self.mod
    
    def equals(self, l1: int, r1: int, l2: int, r2: int) -> bool:
        """Compare s[l1:r1+1] with s[l2:r2+1] in O(1)."""
        if r1 - l1 != r2 - l2:
            return False
        return self.get_hash(l1, r1) == self.get_hash(l2, r2)
    
    def lcp(self, i: int, j: int) -> int:
        """
        Find longest common prefix of suffixes starting at i and j.
        
        Uses binary search + hash comparison for O(log n) time.
        """
        max_len = min(self.n - i, self.n - j)
        
        left, right = 0, max_len
        result = 0
        
        while left <= right:
            mid = (left + right) // 2
            # Check if first mid characters are equal
            if self.equals(i, i + mid - 1, j, j + mid - 1):
                result = mid
                left = mid + 1
            else:
                right = mid - 1
        
        return result
 
 
# Example: Find all pairs of positions with matching k-length substrings
def find_matching_substrings(s: str, k: int) -> list[tuple[int, int]]:
    """
    Find all pairs of starting positions where k-length substrings match.
    
    Time: O(n) for hashing + O(matches) for output
    """
    cmp = SubstringComparator(s)
    n = len(s)
    
    # Group positions by hash
    from collections import defaultdict
    hash_to_positions = defaultdict(list)
    
    for i in range(n - k + 1):
        h = cmp.get_hash(i, i + k - 1)
        hash_to_positions[h].append(i)
    
    # Find matching pairs
    pairs = []
    for positions in hash_to_positions.values():
        if len(positions) > 1:
            for i in range(len(positions)):
                for j in range(i + 1, len(positions)):
                    pairs.append((positions[i], positions[j]))
    
    return pairs
 
 
# Demo
s = "abracadabra"
cmp = SubstringComparator(s)
 
print(f"String: '{s}'")
print(f"\nO(1) substring comparisons:")
print(f"  s[0:4]='abra' vs s[7:11]='abra': {cmp.equals(0, 3, 7, 10)}")  # True
print(f"  s[0:4]='abra' vs s[1:5]='brac': {cmp.equals(0, 3, 1, 4)}")   # False
 
print(f"\nLongest common prefix of suffixes:")
print(f"  LCP of suffix at 0 and 7: {cmp.lcp(0, 7)}")  # 4 ('abra')
print(f"  LCP of suffix at 0 and 3: {cmp.lcp(0, 3)}")  # 1 ('a')
 
print(f"\nPositions with matching 4-char substrings:")
print(f"  {find_matching_substrings(s, 4)}")  # [(0, 7)]

Longest Repeated Substring

Problem: Find the longest substring that appears at least twice in a given string.

Example:

Input: "banana"
Output: "ana" (appears at positions 1 and 3)

Naive Approach:

Check all pairs of substrings: O(n² × m) where m is substring length
Total: O(n³) — far too slow for large inputs

Hash-Based Approach:

Binary search on the answer length
For each candidate length k, check if any k-length substring appears twice using hashing
Total: O(n log n) average time!

The Binary Search Insight

If a substring of length k repeats, then substrings of length k-1 (prefixes of the repeating substring) also repeat. This monotonicity property enables binary search on the answer length, reducing from O(n) length checks to O(log n) checks.

Algorithm:

1. Binary search for the longest length k where a repeated k-substring exists
2. For each candidate k (during binary search):
   a. Compute hash of every k-length substring in O(n)
   b. Store hashes in a set
   c. If any hash appears twice, return success
3. Track the actual substring for the answer

Time Complexity Analysis:

Binary search: O(log n) iterations
Each iteration: O(n) to compute and check hashes
Total: O(n log n)

Space Complexity: O(n) for the hash set

longest_repeated_substring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
class DoubleHash:
    """Double hashing for reliable substring comparison."""
    
    def __init__(self, s: str):
        self.s = s
        self.n = len(s)
        
        # First hash
        self.h1 = [0] * (self.n + 1)
        self.p1 = [1] * (self.n + 1)
        B1, M1 = 31, 10**9 + 7
        
        # Second hash
        self.h2 = [0] * (self.n + 1)
        self.p2 = [1] * (self.n + 1)
        B2, M2 = 37, 10**9 + 9
        
        for i in range(self.n):
            c = ord(s[i])
            self.h1[i + 1] = (self.h1[i] * B1 + c) % M1
            self.p1[i + 1] = (self.p1[i] * B1) % M1
            self.h2[i + 1] = (self.h2[i] * B2 + c) % M2
            self.p2[i + 1] = (self.p2[i] * B2) % M2
        
        self.M1, self.M2 = M1, M2
    
    def get_hash(self, left: int, right: int) -> tuple[int, int]:
        length = right - left + 1
        
        hash1 = (self.h1[right + 1] - self.h1[left] * self.p1[length]) % self.M1
        if hash1 < 0: hash1 += self.M1
        
        hash2 = (self.h2[right + 1] - self.h2[left] * self.p2[length]) % self.M2
        if hash2 < 0: hash2 += self.M2
        
        return (hash1, hash2)
 
 
def longest_repeated_substring(s: str) -> str:
    """
    Find the longest substring that appears at least twice.
    
    Time: O(n log n) average
    Space: O(n)
    """
    if len(s) <= 1:
        return ""
    
    n = len(s)
    hasher = DoubleHash(s)
    
    def has_repeated(length: int) -> int:
        """
        Check if a repeated substring of given length exists.
        Returns starting index if found, -1 otherwise.
        """
        if length == 0:
            return 0
        
        seen = {}  # hash -> first occurrence index
        
        for i in range(n - length + 1):
            h = hasher.get_hash(i, i + length - 1)
            
            if h in seen:
                # Verify to be safe (optional with double hash)
                if s[i:i + length] == s[seen[h]:seen[h] + length]:
                    return i
            else:
                seen[h] = i
        
        return -1
    
    # Binary search for longest length
    left, right = 1, n - 1
    result_start = -1
    result_length = 0
    
    while left <= right:
        mid = (left + right) // 2
        start = has_repeated(mid)
        
        if start != -1:
            # Found repeated substring of length mid
            result_start = start
            result_length = mid
            left = mid + 1  # Try longer
        else:
            right = mid - 1  # Try shorter
    
    if result_start == -1:
        return ""
    return s[result_start:result_start + result_length]
 
 
def longest_repeated_substring_with_count(s: str, min_occurrences: int = 2) -> str:
    """
    Find longest substring appearing at least min_occurrences times.
    
    Variant: Sometimes we need more than 2 occurrences.
    """
    if len(s) <= 1:
        return ""
    
    n = len(s)
    hasher = DoubleHash(s)
    
    def count_occurrences(length: int) -> tuple[int, int]:
        """Count max occurrences of any substring of given length."""
        if length == 0:
            return (n + 1, 0)  # Empty string at every position
        
        from collections import Counter
        hash_counts = Counter()
        
        for i in range(n - length + 1):
            h = hasher.get_hash(i, i + length - 1)
            hash_counts[h] += 1
        
        if not hash_counts:
            return (0, -1)
        
        most_common_hash, count = hash_counts.most_common(1)[0]
        
        # Find starting index of most common
        for i in range(n - length + 1):
            if hasher.get_hash(i, i + length - 1) == most_common_hash:
                return (count, i)
        
        return (0, -1)
    
    left, right = 1, n - 1
    result = ""
    
    while left <= right:
        mid = (left + right) // 2
        count, start = count_occurrences(mid)
        
        if count >= min_occurrences:
            result = s[start:start + mid]
            left = mid + 1
        else:
            right = mid - 1
    
    return result
 
 
# Examples
print("Longest Repeated Substring Examples:")
print("="*50)
 
test_cases = [
    "banana",
    "abracadabra",
    "mississippi",
    "aaaaaa",
    "abcdef",  # No repeated substring
]
 
for s in test_cases:
    result = longest_repeated_substring(s)
    print(f"  '{s}' -> '{result}' (length {len(result)})")
 
print("\nWith minimum occurrence count:")
print(f"  'banana' appearing 3+ times: '{longest_repeated_substring_with_count('banana', 3)}'")
print(f"  'aaaaaa' appearing 3+ times: '{longest_repeated_substring_with_count('aaaaaa', 3)}'")

Palindrome Detection with Hashing

String hashing provides an elegant O(1) palindrome check for any substring, after O(n) preprocessing.

The Key Insight:

A string is a palindrome if and only if it equals its reverse. With hashing:

H(s[l..r]) == H(reverse(s[l..r]))

Implementation Strategy:

Precompute two hash arrays:

Forward hash: H[i] = hash of s[0..i-1]
Reverse hash: R[i] = hash of s[n-1..n-i] (string reversed)

Then for substring s[l..r]:

Forward hash of s[l..r] uses H
"Reverse hash" of s[l..r] uses R (treating positions appropriately)

Applications:

Count palindromic substrings in O(n²) → O(n log n) with binary search
Find longest palindromic substring in O(n log n)
Check if string can be rearranged into palindrome
Find all palindrome centers efficiently

palindrome_hash.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
class PalindromeHash:
    """
    Hash-based palindrome detection with O(1) queries.
    
    Maintains both forward and reverse hash arrays to compare
    any substring with its reverse in constant time.
    """
    
    def __init__(self, s: str, base: int = 31, mod: int = 10**9 + 7):
        self.s = s
        self.n = len(s)
        self.base = base
        self.mod = mod
        
        # Forward hash: hash of s[0..i-1]
        self.h_fwd = [0] * (self.n + 1)
        
        # Reverse hash: hash of s reversed, i.e., s[n-1..0]
        self.h_rev = [0] * (self.n + 1)
        
        # Powers
        self.power = [1] * (self.n + 1)
        
        # Build forward hash
        for i in range(self.n):
            self.h_fwd[i + 1] = (
                self.h_fwd[i] * base + ord(s[i])
            ) % mod
            self.power[i + 1] = (self.power[i] * base) % mod
        
        # Build reverse hash (hash s from right to left)
        for i in range(self.n - 1, -1, -1):
            self.h_rev[self.n - i] = (
                self.h_rev[self.n - i - 1] * base + ord(s[i])
            ) % mod
    
    def _forward_hash(self, left: int, right: int) -> int:
        """Hash of s[left:right+1] reading left to right."""
        length = right - left + 1
        result = (
            self.h_fwd[right + 1] - self.h_fwd[left] * self.power[length]
        ) % self.mod
        return result if result >= 0 else result + self.mod
    
    def _reverse_hash(self, left: int, right: int) -> int:
        """Hash of s[left:right+1] reading right to left."""
        # In reverse array, position left in s corresponds to n-1-left from end
        # s[left..right] reversed is s[right], s[right-1], ..., s[left]
        # In h_rev, this starts at position (n-right) and has length (right-left+1)
        length = right - left + 1
        rev_left = self.n - right - 1
        rev_right = self.n - left - 1
        
        result = (
            self.h_rev[rev_right + 1] - self.h_rev[rev_left] * self.power[length]
        ) % self.mod
        return result if result >= 0 else result + self.mod
    
    def is_palindrome(self, left: int, right: int) -> bool:
        """Check if s[left:right+1] is a palindrome in O(1)."""
        return self._forward_hash(left, right) == self._reverse_hash(left, right)
    
    def count_palindromic_substrings(self) -> int:
        """Count all palindromic substrings."""
        count = 0
        for l in range(self.n):
            for r in range(l, self.n):
                if self.is_palindrome(l, r):
                    count += 1
        return count
    
    def longest_palindromic_substring(self) -> str:
        """
        Find longest palindromic substring using binary search.
        
        Time: O(n log n) - binary search on length, O(n) checks per length
        """
        if self.n == 0:
            return ""
        
        best_start, best_len = 0, 1
        
        # Try odd-length palindromes (centered at each position)
        for center in range(self.n):
            # Binary search for longest radius
            left, right = 0, min(center, self.n - center - 1)
            
            while left <= right:
                mid = (left + right) // 2
                l, r = center - mid, center + mid
                
                if self.is_palindrome(l, r):
                    length = r - l + 1
                    if length > best_len:
                        best_start, best_len = l, length
                    left = mid + 1
                else:
                    right = mid - 1
        
        # Try even-length palindromes (centered between i and i+1)
        for gap_after in range(self.n - 1):
            # Gap is between s[gap_after] and s[gap_after+1]
            left, right = 0, min(gap_after + 1, self.n - gap_after - 1)
            
            while left <= right:
                mid = (left + right) // 2
                l, r = gap_after - mid + 1, gap_after + mid
                
                if mid == 0:
                    # No palindrome of length 0
                    left = mid + 1
                    continue
                
                if self.is_palindrome(l, r):
                    length = r - l + 1
                    if length > best_len:
                        best_start, best_len = l, length
                    left = mid + 1
                else:
                    right = mid - 1
        
        return self.s[best_start:best_start + best_len]
 
 
# Demonstration
print("Palindrome Detection with Hashing")
print("="*50)
 
test_strings = [
    "racecar",
    "banana",
    "abacaba",
    "babad",
]
 
for s in test_strings:
    ph = PalindromeHash(s)
    
    print(f"\nString: '{s}'")
    print(f"  Count of palindromic substrings: {ph.count_palindromic_substrings()}")
    print(f"  Longest palindrome: '{ph.longest_palindromic_substring()}'")
    
    # Show some palindrome checks
    for l in range(min(3, len(s))):
        for r in range(l, min(l + 5, len(s))):
            sub = s[l:r+1]
            is_pal = ph.is_palindrome(l, r)
            if is_pal:
                print(f"    s[{l}:{r+1}]='{sub}' is palindrome")

Longest Common Substring

Problem: Given two strings s1 and s2, find the longest substring common to both.

Example:

s1 = "ABABC"
s2 = "BABCA"
Answer: "BABC" (or "ABC" depending on tie-breaking)

Approaches:

Dynamic Programming: O(n × m) time and space
Suffix Array: O((n + m) log (n + m)) time, O(n + m) space
Hashing + Binary Search: O((n + m) log min(n, m)) average time

Hash-Based Algorithm:

1. Binary search on the answer length k
2. For each candidate k:
   a. Hash all k-length substrings of s1 into a set
   b. Hash all k-length substrings of s2, check if any is in the set
   c. If match found, k is feasible; try longer
3. Return the longest feasible k

Why Binary Search Works

If strings share a common substring of length k, they also share common substrings of all lengths < k (just take prefixes). This monotonicity makes binary search applicable.

longest_common_substring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def polynomial_hash(s: str, base: int = 31, mod: int = 10**9 + 7) -> list[int]:
    """Compute all prefix hashes of string s."""
    n = len(s)
    h = [0] * (n + 1)
    for i in range(n):
        h[i + 1] = (h[i] * base + ord(s[i])) % mod
    return h
 
 
def power_table(n: int, base: int = 31, mod: int = 10**9 + 7) -> list[int]:
    """Compute powers of base."""
    p = [1] * (n + 1)
    for i in range(n):
        p[i + 1] = (p[i] * base) % mod
    return p
 
 
def get_hash(h: list[int], p: list[int], left: int, right: int, 
             mod: int = 10**9 + 7) -> int:
    """Get hash of substring s[left:right+1]."""
    length = right - left + 1
    result = (h[right + 1] - h[left] * p[length]) % mod
    return result if result >= 0 else result + mod
 
 
def longest_common_substring(s1: str, s2: str) -> str:
    """
    Find longest substring common to both s1 and s2.
    
    Time: O((n + m) log min(n, m)) average
    Space: O(n + m)
    """
    n, m = len(s1), len(s2)
    if n == 0 or m == 0:
        return ""
    
    # Precompute hashes for both strings
    BASE, MOD = 31, 10**9 + 7
    
    h1 = polynomial_hash(s1, BASE, MOD)
    h2 = polynomial_hash(s2, BASE, MOD)
    
    max_len = max(n, m)
    p = power_table(max_len, BASE, MOD)
    
    def has_common_substring(length: int) -> tuple[bool, int, int]:
        """
        Check if s1 and s2 share a common substring of given length.
        Returns (found, start_in_s1, start_in_s2).
        """
        if length == 0:
            return (True, 0, 0)
        
        # Collect all length-k substrings of s1
        s1_hashes = {}  # hash -> starting index
        for i in range(n - length + 1):
            h = get_hash(h1, p, i, i + length - 1, MOD)
            if h not in s1_hashes:
                s1_hashes[h] = i
        
        # Check s2's substrings
        for j in range(m - length + 1):
            h = get_hash(h2, p, j, j + length - 1, MOD)
            if h in s1_hashes:
                # Verify actual string match (handle potential collision)
                i = s1_hashes[h]
                if s1[i:i + length] == s2[j:j + length]:
                    return (True, i, j)
        
        return (False, -1, -1)
    
    # Binary search for longest length
    left, right = 0, min(n, m)
    result_start = 0
    result_length = 0
    
    while left <= right:
        mid = (left + right) // 2
        found, i, j = has_common_substring(mid)
        
        if found:
            result_start = i
            result_length = mid
            left = mid + 1
        else:
            right = mid - 1
    
    return s1[result_start:result_start + result_length]
 
 
def all_common_substrings(s1: str, s2: str, min_length: int = 1) -> list[str]:
    """
    Find ALL common substrings of at least min_length.
    
    Returns unique substrings common to both strings.
    """
    n, m = len(s1), len(s2)
    
    BASE, MOD = 31, 10**9 + 7
    h1 = polynomial_hash(s1, BASE, MOD)
    h2 = polynomial_hash(s2, BASE, MOD)
    max_len = max(n, m)
    p = power_table(max_len, BASE, MOD)
    
    common = set()
    
    # Check each length from min_length to min(n, m)
    for length in range(min_length, min(n, m) + 1):
        s1_hashes = {}
        for i in range(n - length + 1):
            h = get_hash(h1, p, i, i + length - 1, MOD)
            if h not in s1_hashes:
                s1_hashes[h] = i
        
        for j in range(m - length + 1):
            h = get_hash(h2, p, j, j + length - 1, MOD)
            if h in s1_hashes:
                common.add(s2[j:j + length])
    
    return sorted(common, key=lambda x: (-len(x), x))
 
 
# Examples
print("Longest Common Substring")
print("="*50)
 
test_cases = [
    ("ABABC", "BABCA"),
    ("GeeksforGeeks", "GeeksQuiz"),
    ("abcdxyz", "xyzabcd"),
    ("programming", "algorithms"),
]
 
for s1, s2 in test_cases:
    result = longest_common_substring(s1, s2)
    print(f"  '{s1}' and '{s2}'")
    print(f"    -> LCS: '{result}' (length {len(result)})")
 
print("\nAll common substrings (min length 3):")
s1, s2 = "ABABC", "BABCA"
all_common = all_common_substrings(s1, s2, 3)
print(f"  '{s1}' and '{s2}': {all_common}")

String Fingerprinting and Similarity

Beyond exact matching, hashing enables approximate string comparison and similarity detection at scale.

Shingling (n-gram hashing):

Break a document into overlapping n-character chunks (shingles) and hash each:

"The quick brown" (n=4) → {"The ", "he q", "e qu", " qui", "quic", "uick", ...}

Compare documents by the overlap of their shingle sets:

Jaccard Similarity: |A ∩ B| / |A ∪ B|
Similar documents have high Jaccard similarity

MinHash for Similarity Estimation:

For large documents, comparing full shingle sets is expensive. MinHash approximates Jaccard similarity:

Apply k different hash functions to each shingle set
Keep the minimum hash value for each function
Compare the k minimum values
Similarity ≈ (fraction of matching minimums)

Applications:

Plagiarism detection
Near-duplicate detection
Document clustering
Version control (detecting similar code)

document_similarity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
def shingle(text: str, k: int = 5) -> set[int]:
    """
    Create a set of hashed k-shingles from text.
    
    A k-shingle is a contiguous sequence of k characters.
    """
    BASE, MOD = 31, 10**9 + 7
    
    if len(text) < k:
        return set()
    
    shingles = set()
    
    # Compute initial hash
    h = 0
    power = 1
    for i in range(k):
        h = (h * BASE + ord(text[i])) % MOD
        if i < k - 1:
            power = (power * BASE) % MOD
    shingles.add(h)
    
    # Rolling hash for remaining shingles
    for i in range(k, len(text)):
        h = (h - ord(text[i - k]) * power) % MOD
        if h < 0:
            h += MOD
        h = (h * BASE + ord(text[i])) % MOD
        shingles.add(h)
    
    return shingles
 
 
def jaccard_similarity(set1: set, set2: set) -> float:
    """Compute Jaccard similarity between two sets."""
    if not set1 and not set2:
        return 1.0
    intersection = len(set1 & set2)
    union = len(set1 | set2)
    return intersection / union if union > 0 else 0.0
 
 
def document_similarity(doc1: str, doc2: str, k: int = 5) -> float:
    """
    Compute similarity between two documents using shingling.
    
    Returns value in [0, 1] where 1 means identical.
    """
    shingles1 = shingle(doc1.lower(), k)
    shingles2 = shingle(doc2.lower(), k)
    return jaccard_similarity(shingles1, shingles2)
 
 
class MinHash:
    """
    MinHash for efficient similarity estimation.
    
    Uses multiple hash functions to create signatures that can
    estimate Jaccard similarity in O(num_hashes) time.
    """
    
    def __init__(self, num_hashes: int = 100):
        self.num_hashes = num_hashes
        
        # Generate random hash function parameters
        import random
        random.seed(42)
        max_val = 2**31 - 1
        self.a = [random.randint(1, max_val) for _ in range(num_hashes)]
        self.b = [random.randint(0, max_val) for _ in range(num_hashes)]
        self.p = 2147483647  # Large prime (2^31 - 1)
    
    def _hash(self, x: int, i: int) -> int:
        """Apply i-th hash function to value x."""
        return (self.a[i] * x + self.b[i]) % self.p
    
    def compute_signature(self, shingles: set[int]) -> list[int]:
        """
        Compute MinHash signature for a set of shingles.
        
        The signature has length num_hashes.
        """
        if not shingles:
            return [self.p] * self.num_hashes
        
        signature = []
        for i in range(self.num_hashes):
            min_hash = min(self._hash(s, i) for s in shingles)
            signature.append(min_hash)
        
        return signature
    
    def estimate_similarity(self, sig1: list[int], sig2: list[int]) -> float:
        """
        Estimate Jaccard similarity from two signatures.
        
        This is O(num_hashes) instead of O(|shingles|).
        """
        matches = sum(1 for a, b in zip(sig1, sig2) if a == b)
        return matches / self.num_hashes
 
 
def plagiarism_check(documents: list[str], threshold: float = 0.5) -> list[tuple[int, int, float]]:
    """
    Find pairs of documents with similarity above threshold.
    
    Uses MinHash for efficient comparison of many documents.
    """
    k = 5  # Shingle size
    mh = MinHash(num_hashes=100)
    
    # Compute shingles and signatures for all documents
    signatures = []
    for doc in documents:
        shingles = shingle(doc.lower(), k)
        sig = mh.compute_signature(shingles)
        signatures.append(sig)
    
    # Compare all pairs
    similar_pairs = []
    for i in range(len(documents)):
        for j in range(i + 1, len(documents)):
            sim = mh.estimate_similarity(signatures[i], signatures[j])
            if sim >= threshold:
                similar_pairs.append((i, j, sim))
    
    return similar_pairs
 
 
# Demonstration
print("Document Similarity with Hashing")
print("="*50)
 
doc1 = "The quick brown fox jumps over the lazy dog"
doc2 = "The quick brown fox leaps over the lazy dog"
doc3 = "A fast auburn fox jumps over a sleepy hound"
doc4 = "Lorem ipsum dolor sit amet consectetur"
 
print("\nDirect Jaccard similarity (using shingles):")
print(f"  Doc1 vs Doc2: {document_similarity(doc1, doc2):.3f}")  # High
print(f"  Doc1 vs Doc3: {document_similarity(doc1, doc3):.3f}")  # Medium
print(f"  Doc1 vs Doc4: {document_similarity(doc1, doc4):.3f}")  # Low
 
print("\nMinHash signature-based similarity:")
mh = MinHash(num_hashes=100)
sig1 = mh.compute_signature(shingle(doc1.lower(), 5))
sig2 = mh.compute_signature(shingle(doc2.lower(), 5))
sig3 = mh.compute_signature(shingle(doc3.lower(), 5))
print(f"  Doc1 vs Doc2 (estimated): {mh.estimate_similarity(sig1, sig2):.3f}")
print(f"  Doc1 vs Doc3 (estimated): {mh.estimate_similarity(sig1, sig3):.3f}")
 
print("\nPlagiarism detection test:")
essays = [
    "This is an original essay about algorithms and data structures.",
    "This is an original essay about algorithms and data organization.",  # Similar
    "Essays cover many topics including math and science.",
    "This is an original essay on algorithms and data structures!",  # Very similar
]
pairs = plagiarism_check(essays, threshold=0.3)
for i, j, sim in pairs:
    print(f"  Documents {i} and {j}: {sim:.3f} similarity (potential plagiarism)")

Additional Applications

String hashing finds application in many other algorithmic problems:

Extended Applications

•Suffix Array Construction: Hash-based comparison enables O(n log² n) suffix array construction via sorting suffixes with O(log n) comparison using LCP queries.
•Cyclic String Matching: Determine if one string is a rotation of another: "abcd" rotations include "bcda", "cdab", "dabc". Hash all rotations of one string, check against the other.
•K-distinct Substrings: Count substrings with exactly k distinct characters. Use hashing to deduplicate and count efficiently.
•String Compression: Find the shortest period of a string. If string s has period p, then s = t^k where |t| = p. Use hashing to verify period candidates.
•Anagram Detection: Two strings are anagrams if they have the same character frequency. While frequency counting is simpler, hashing with commutative operations (like XOR or sum) can provide quick screening.
•String Interleaving: Check if string c is an interleaving of strings a and b using state hashing in dynamic programming.

The Universal Pattern

The core technique is always the same: use hashing to convert O(m) string comparisons to O(1) hash comparisons. This often reduces O(n² × m) algorithms to O(n²) or O(n log n × m) to O(n log n). Identify where your algorithm spends time comparing strings, and apply hashing there.

Performance Considerations:

Operation	Without Hashing	With Hashing
Compare 2 substrings	O(m)	O(1)
Find all matches of pattern	O(n × m)	O(n + m)
Longest repeated substring	O(n³)	O(n log n)
Longest common substring	O(n × m)	O((n+m) log n)
Document similarity (n docs)	O(n² × d)	O(n² × k) where k << d

where m = pattern/substring length, d = document length, n = text/array length, k = signature length.

Summary

Key Takeaways

•O(1) substring comparison is the foundational power of string hashing—enabling efficient solutions to problems that would otherwise be quadratic or worse.
•Longest repeated substring can be solved in O(n log n) using binary search on length combined with hash-based duplicate detection.
•Palindrome detection becomes O(1) per query by maintaining both forward and reverse hash arrays.
•Longest common substring between two strings is solvable in O((n+m) log min(n,m)) using hash sets and binary search.
•Document similarity at scale uses shingling (n-gram hashing) and MinHash to estimate Jaccard similarity efficiently.
•The universal pattern: wherever O(m) string comparison exists in nested loops, apply hashing to reduce it to O(1).

Module Complete:

You've now mastered string hashing from first principles through advanced applications:

Polynomial rolling hash: The mathematical foundation and rolling update technique
Parameter selection: Choosing base and modulus for correctness, efficiency, and security
Collision handling: Double hashing, verification strategies, and decision frameworks
Applications: From pattern matching to document similarity

This knowledge equips you to tackle a wide range of string problems efficiently and correctly.

Module Complete

You now have a comprehensive understanding of string hashing techniques—from the mathematical foundations to practical applications. These skills are essential for efficient string processing in algorithms, competitive programming, and production systems.

Applications Beyond Exact Matching

String Hashing as a Universal Tool

What You Will Learn

O(1) Substring Comparison

The foundational power of string hashing is the ability to compare any two substrings in O(1) time, after O(n) preprocessing. This simple capability has profound algorithmic implications.

The Standard Approach:

1. Precompute prefix hash array and powers array in O(n)
2. For any query "are s[l1..r1] and s[l2..r2] equal?":
   - Compute hash of s[l1..r1] in O(1)
   - Compute hash of s[l2..r2] in O(1)
   - Compare hash values

Use Cases:

Comparing prefixes/suffixes of the same string
Finding repeating patterns within a string
Checking if substring appears at multiple positions
Building suffix array via sorting substrings
Finding longest common prefix (LCP) of any two suffixes

substring_comparison.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
class SubstringComparator:
    """
    Enables O(1) comparison of any two substrings of a string.
    
    After O(n) preprocessing, any comparison query runs in O(1).
    """
    
    def __init__(self, s: str, base: int = 31, mod: int = 10**9 + 7):
        self.s = s
        self.n = len(s)
        self.base = base
        self.mod = mod
        
        # Precompute prefix hashes and powers
        self.prefix = [0] * (self.n + 1)
        self.power = [1] * (self.n + 1)
        
        for i in range(self.n):
            self.prefix[i + 1] = (
                self.prefix[i] * base + ord(s[i])
            ) % mod
            self.power[i + 1] = (self.power[i] * base) % mod
    
    def get_hash(self, left: int, right: int) -> int:
        """Get hash of s[left:right+1] in O(1)."""
        length = right - left + 1
        result = (
            self.prefix[right + 1] - self.prefix[left] * self.power[length]
        ) % self.mod
        return result if result >= 0 else result + self.mod
    
    def equals(self, l1: int, r1: int, l2: int, r2: int) -> bool:
        """Compare s[l1:r1+1] with s[l2:r2+1] in O(1)."""
        if r1 - l1 != r2 - l2:
            return False
        return self.get_hash(l1, r1) == self.get_hash(l2, r2)
    
    def lcp(self, i: int, j: int) -> int:
        """
        Find longest common prefix of suffixes starting at i and j.
        
        Uses binary search + hash comparison for O(log n) time.
        """
        max_len = min(self.n - i, self.n - j)
        
        left, right = 0, max_len
        result = 0
        
        while left <= right:
            mid = (left + right) // 2
            # Check if first mid characters are equal
            if self.equals(i, i + mid - 1, j, j + mid - 1):
                result = mid
                left = mid + 1
            else:
                right = mid - 1
        
        return result
 
 
# Example: Find all pairs of positions with matching k-length substrings
def find_matching_substrings(s: str, k: int) -> list[tuple[int, int]]:
    """
    Find all pairs of starting positions where k-length substrings match.
    
    Time: O(n) for hashing + O(matches) for output
    """
    cmp = SubstringComparator(s)
    n = len(s)
    
    # Group positions by hash
    from collections import defaultdict
    hash_to_positions = defaultdict(list)
    
    for i in range(n - k + 1):
        h = cmp.get_hash(i, i + k - 1)
        hash_to_positions[h].append(i)
    
    # Find matching pairs
    pairs = []
    for positions in hash_to_positions.values():
        if len(positions) > 1:
            for i in range(len(positions)):
                for j in range(i + 1, len(positions)):
                    pairs.append((positions[i], positions[j]))
    
    return pairs
 
 
# Demo
s = "abracadabra"
cmp = SubstringComparator(s)
 
print(f"String: '{s}'")
print(f"\nO(1) substring comparisons:")
print(f"  s[0:4]='abra' vs s[7:11]='abra': {cmp.equals(0, 3, 7, 10)}")  # True
print(f"  s[0:4]='abra' vs s[1:5]='brac': {cmp.equals(0, 3, 1, 4)}")   # False
 
print(f"\nLongest common prefix of suffixes:")
print(f"  LCP of suffix at 0 and 7: {cmp.lcp(0, 7)}")  # 4 ('abra')
print(f"  LCP of suffix at 0 and 3: {cmp.lcp(0, 3)}")  # 1 ('a')
 
print(f"\nPositions with matching 4-char substrings:")
print(f"  {find_matching_substrings(s, 4)}")  # [(0, 7)]

Longest Repeated Substring

Problem: Find the longest substring that appears at least twice in a given string.

Example:

Input: "banana"
Output: "ana" (appears at positions 1 and 3)

Naive Approach:

Check all pairs of substrings: O(n² × m) where m is substring length
Total: O(n³) — far too slow for large inputs

Hash-Based Approach:

Binary search on the answer length
For each candidate length k, check if any k-length substring appears twice using hashing
Total: O(n log n) average time!

The Binary Search Insight

Algorithm:

1. Binary search for the longest length k where a repeated k-substring exists
2. For each candidate k (during binary search):
   a. Compute hash of every k-length substring in O(n)
   b. Store hashes in a set
   c. If any hash appears twice, return success
3. Track the actual substring for the answer

Time Complexity Analysis:

Binary search: O(log n) iterations
Each iteration: O(n) to compute and check hashes
Total: O(n log n)

Space Complexity: O(n) for the hash set

longest_repeated_substring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
class DoubleHash:
    """Double hashing for reliable substring comparison."""
    
    def __init__(self, s: str):
        self.s = s
        self.n = len(s)
        
        # First hash
        self.h1 = [0] * (self.n + 1)
        self.p1 = [1] * (self.n + 1)
        B1, M1 = 31, 10**9 + 7
        
        # Second hash
        self.h2 = [0] * (self.n + 1)
        self.p2 = [1] * (self.n + 1)
        B2, M2 = 37, 10**9 + 9
        
        for i in range(self.n):
            c = ord(s[i])
            self.h1[i + 1] = (self.h1[i] * B1 + c) % M1
            self.p1[i + 1] = (self.p1[i] * B1) % M1
            self.h2[i + 1] = (self.h2[i] * B2 + c) % M2
            self.p2[i + 1] = (self.p2[i] * B2) % M2
        
        self.M1, self.M2 = M1, M2
    
    def get_hash(self, left: int, right: int) -> tuple[int, int]:
        length = right - left + 1
        
        hash1 = (self.h1[right + 1] - self.h1[left] * self.p1[length]) % self.M1
        if hash1 < 0: hash1 += self.M1
        
        hash2 = (self.h2[right + 1] - self.h2[left] * self.p2[length]) % self.M2
        if hash2 < 0: hash2 += self.M2
        
        return (hash1, hash2)
 
 
def longest_repeated_substring(s: str) -> str:
    """
    Find the longest substring that appears at least twice.
    
    Time: O(n log n) average
    Space: O(n)
    """
    if len(s) <= 1:
        return ""
    
    n = len(s)
    hasher = DoubleHash(s)
    
    def has_repeated(length: int) -> int:
        """
        Check if a repeated substring of given length exists.
        Returns starting index if found, -1 otherwise.
        """
        if length == 0:
            return 0
        
        seen = {}  # hash -> first occurrence index
        
        for i in range(n - length + 1):
            h = hasher.get_hash(i, i + length - 1)
            
            if h in seen:
                # Verify to be safe (optional with double hash)
                if s[i:i + length] == s[seen[h]:seen[h] + length]:
                    return i
            else:
                seen[h] = i
        
        return -1
    
    # Binary search for longest length
    left, right = 1, n - 1
    result_start = -1
    result_length = 0
    
    while left <= right:
        mid = (left + right) // 2
        start = has_repeated(mid)
        
        if start != -1:
            # Found repeated substring of length mid
            result_start = start
            result_length = mid
            left = mid + 1  # Try longer
        else:
            right = mid - 1  # Try shorter
    
    if result_start == -1:
        return ""
    return s[result_start:result_start + result_length]
 
 
def longest_repeated_substring_with_count(s: str, min_occurrences: int = 2) -> str:
    """
    Find longest substring appearing at least min_occurrences times.
    
    Variant: Sometimes we need more than 2 occurrences.
    """
    if len(s) <= 1:
        return ""
    
    n = len(s)
    hasher = DoubleHash(s)
    
    def count_occurrences(length: int) -> tuple[int, int]:
        """Count max occurrences of any substring of given length."""
        if length == 0:
            return (n + 1, 0)  # Empty string at every position
        
        from collections import Counter
        hash_counts = Counter()
        
        for i in range(n - length + 1):
            h = hasher.get_hash(i, i + length - 1)
            hash_counts[h] += 1
        
        if not hash_counts:
            return (0, -1)
        
        most_common_hash, count = hash_counts.most_common(1)[0]
        
        # Find starting index of most common
        for i in range(n - length + 1):
            if hasher.get_hash(i, i + length - 1) == most_common_hash:
                return (count, i)
        
        return (0, -1)
    
    left, right = 1, n - 1
    result = ""
    
    while left <= right:
        mid = (left + right) // 2
        count, start = count_occurrences(mid)
        
        if count >= min_occurrences:
            result = s[start:start + mid]
            left = mid + 1
        else:
            right = mid - 1
    
    return result
 
 
# Examples
print("Longest Repeated Substring Examples:")
print("="*50)
 
test_cases = [
    "banana",
    "abracadabra",
    "mississippi",
    "aaaaaa",
    "abcdef",  # No repeated substring
]
 
for s in test_cases:
    result = longest_repeated_substring(s)
    print(f"  '{s}' -> '{result}' (length {len(result)})")
 
print("\nWith minimum occurrence count:")
print(f"  'banana' appearing 3+ times: '{longest_repeated_substring_with_count('banana', 3)}'")
print(f"  'aaaaaa' appearing 3+ times: '{longest_repeated_substring_with_count('aaaaaa', 3)}'")

Palindrome Detection with Hashing

String hashing provides an elegant O(1) palindrome check for any substring, after O(n) preprocessing.

The Key Insight:

A string is a palindrome if and only if it equals its reverse. With hashing:

H(s[l..r]) == H(reverse(s[l..r]))

Implementation Strategy:

Precompute two hash arrays:

Forward hash: H[i] = hash of s[0..i-1]
Reverse hash: R[i] = hash of s[n-1..n-i] (string reversed)

Then for substring s[l..r]:

Forward hash of s[l..r] uses H
"Reverse hash" of s[l..r] uses R (treating positions appropriately)

Applications:

Count palindromic substrings in O(n²) → O(n log n) with binary search
Find longest palindromic substring in O(n log n)
Check if string can be rearranged into palindrome
Find all palindrome centers efficiently

palindrome_hash.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
class PalindromeHash:
    """
    Hash-based palindrome detection with O(1) queries.
    
    Maintains both forward and reverse hash arrays to compare
    any substring with its reverse in constant time.
    """
    
    def __init__(self, s: str, base: int = 31, mod: int = 10**9 + 7):
        self.s = s
        self.n = len(s)
        self.base = base
        self.mod = mod
        
        # Forward hash: hash of s[0..i-1]
        self.h_fwd = [0] * (self.n + 1)
        
        # Reverse hash: hash of s reversed, i.e., s[n-1..0]
        self.h_rev = [0] * (self.n + 1)
        
        # Powers
        self.power = [1] * (self.n + 1)
        
        # Build forward hash
        for i in range(self.n):
            self.h_fwd[i + 1] = (
                self.h_fwd[i] * base + ord(s[i])
            ) % mod
            self.power[i + 1] = (self.power[i] * base) % mod
        
        # Build reverse hash (hash s from right to left)
        for i in range(self.n - 1, -1, -1):
            self.h_rev[self.n - i] = (
                self.h_rev[self.n - i - 1] * base + ord(s[i])
            ) % mod
    
    def _forward_hash(self, left: int, right: int) -> int:
        """Hash of s[left:right+1] reading left to right."""
        length = right - left + 1
        result = (
            self.h_fwd[right + 1] - self.h_fwd[left] * self.power[length]
        ) % self.mod
        return result if result >= 0 else result + self.mod
    
    def _reverse_hash(self, left: int, right: int) -> int:
        """Hash of s[left:right+1] reading right to left."""
        # In reverse array, position left in s corresponds to n-1-left from end
        # s[left..right] reversed is s[right], s[right-1], ..., s[left]
        # In h_rev, this starts at position (n-right) and has length (right-left+1)
        length = right - left + 1
        rev_left = self.n - right - 1
        rev_right = self.n - left - 1
        
        result = (
            self.h_rev[rev_right + 1] - self.h_rev[rev_left] * self.power[length]
        ) % self.mod
        return result if result >= 0 else result + self.mod
    
    def is_palindrome(self, left: int, right: int) -> bool:
        """Check if s[left:right+1] is a palindrome in O(1)."""
        return self._forward_hash(left, right) == self._reverse_hash(left, right)
    
    def count_palindromic_substrings(self) -> int:
        """Count all palindromic substrings."""
        count = 0
        for l in range(self.n):
            for r in range(l, self.n):
                if self.is_palindrome(l, r):
                    count += 1
        return count
    
    def longest_palindromic_substring(self) -> str:
        """
        Find longest palindromic substring using binary search.
        
        Time: O(n log n) - binary search on length, O(n) checks per length
        """
        if self.n == 0:
            return ""
        
        best_start, best_len = 0, 1
        
        # Try odd-length palindromes (centered at each position)
        for center in range(self.n):
            # Binary search for longest radius
            left, right = 0, min(center, self.n - center - 1)
            
            while left <= right:
                mid = (left + right) // 2
                l, r = center - mid, center + mid
                
                if self.is_palindrome(l, r):
                    length = r - l + 1
                    if length > best_len:
                        best_start, best_len = l, length
                    left = mid + 1
                else:
                    right = mid - 1
        
        # Try even-length palindromes (centered between i and i+1)
        for gap_after in range(self.n - 1):
            # Gap is between s[gap_after] and s[gap_after+1]
            left, right = 0, min(gap_after + 1, self.n - gap_after - 1)
            
            while left <= right:
                mid = (left + right) // 2
                l, r = gap_after - mid + 1, gap_after + mid
                
                if mid == 0:
                    # No palindrome of length 0
                    left = mid + 1
                    continue
                
                if self.is_palindrome(l, r):
                    length = r - l + 1
                    if length > best_len:
                        best_start, best_len = l, length
                    left = mid + 1
                else:
                    right = mid - 1
        
        return self.s[best_start:best_start + best_len]
 
 
# Demonstration
print("Palindrome Detection with Hashing")
print("="*50)
 
test_strings = [
    "racecar",
    "banana",
    "abacaba",
    "babad",
]
 
for s in test_strings:
    ph = PalindromeHash(s)
    
    print(f"\nString: '{s}'")
    print(f"  Count of palindromic substrings: {ph.count_palindromic_substrings()}")
    print(f"  Longest palindrome: '{ph.longest_palindromic_substring()}'")
    
    # Show some palindrome checks
    for l in range(min(3, len(s))):
        for r in range(l, min(l + 5, len(s))):
            sub = s[l:r+1]
            is_pal = ph.is_palindrome(l, r)
            if is_pal:
                print(f"    s[{l}:{r+1}]='{sub}' is palindrome")

Longest Common Substring

Problem: Given two strings s1 and s2, find the longest substring common to both.

Example:

s1 = "ABABC"
s2 = "BABCA"
Answer: "BABC" (or "ABC" depending on tie-breaking)

Approaches:

Dynamic Programming: O(n × m) time and space
Suffix Array: O((n + m) log (n + m)) time, O(n + m) space
Hashing + Binary Search: O((n + m) log min(n, m)) average time

Hash-Based Algorithm:

1. Binary search on the answer length k
2. For each candidate k:
   a. Hash all k-length substrings of s1 into a set
   b. Hash all k-length substrings of s2, check if any is in the set
   c. If match found, k is feasible; try longer
3. Return the longest feasible k

Why Binary Search Works

If strings share a common substring of length k, they also share common substrings of all lengths < k (just take prefixes). This monotonicity makes binary search applicable.

longest_common_substring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def polynomial_hash(s: str, base: int = 31, mod: int = 10**9 + 7) -> list[int]:
    """Compute all prefix hashes of string s."""
    n = len(s)
    h = [0] * (n + 1)
    for i in range(n):
        h[i + 1] = (h[i] * base + ord(s[i])) % mod
    return h
 
 
def power_table(n: int, base: int = 31, mod: int = 10**9 + 7) -> list[int]:
    """Compute powers of base."""
    p = [1] * (n + 1)
    for i in range(n):
        p[i + 1] = (p[i] * base) % mod
    return p
 
 
def get_hash(h: list[int], p: list[int], left: int, right: int, 
             mod: int = 10**9 + 7) -> int:
    """Get hash of substring s[left:right+1]."""
    length = right - left + 1
    result = (h[right + 1] - h[left] * p[length]) % mod
    return result if result >= 0 else result + mod
 
 
def longest_common_substring(s1: str, s2: str) -> str:
    """
    Find longest substring common to both s1 and s2.
    
    Time: O((n + m) log min(n, m)) average
    Space: O(n + m)
    """
    n, m = len(s1), len(s2)
    if n == 0 or m == 0:
        return ""
    
    # Precompute hashes for both strings
    BASE, MOD = 31, 10**9 + 7
    
    h1 = polynomial_hash(s1, BASE, MOD)
    h2 = polynomial_hash(s2, BASE, MOD)
    
    max_len = max(n, m)
    p = power_table(max_len, BASE, MOD)
    
    def has_common_substring(length: int) -> tuple[bool, int, int]:
        """
        Check if s1 and s2 share a common substring of given length.
        Returns (found, start_in_s1, start_in_s2).
        """
        if length == 0:
            return (True, 0, 0)
        
        # Collect all length-k substrings of s1
        s1_hashes = {}  # hash -> starting index
        for i in range(n - length + 1):
            h = get_hash(h1, p, i, i + length - 1, MOD)
            if h not in s1_hashes:
                s1_hashes[h] = i
        
        # Check s2's substrings
        for j in range(m - length + 1):
            h = get_hash(h2, p, j, j + length - 1, MOD)
            if h in s1_hashes:
                # Verify actual string match (handle potential collision)
                i = s1_hashes[h]
                if s1[i:i + length] == s2[j:j + length]:
                    return (True, i, j)
        
        return (False, -1, -1)
    
    # Binary search for longest length
    left, right = 0, min(n, m)
    result_start = 0
    result_length = 0
    
    while left <= right:
        mid = (left + right) // 2
        found, i, j = has_common_substring(mid)
        
        if found:
            result_start = i
            result_length = mid
            left = mid + 1
        else:
            right = mid - 1
    
    return s1[result_start:result_start + result_length]
 
 
def all_common_substrings(s1: str, s2: str, min_length: int = 1) -> list[str]:
    """
    Find ALL common substrings of at least min_length.
    
    Returns unique substrings common to both strings.
    """
    n, m = len(s1), len(s2)
    
    BASE, MOD = 31, 10**9 + 7
    h1 = polynomial_hash(s1, BASE, MOD)
    h2 = polynomial_hash(s2, BASE, MOD)
    max_len = max(n, m)
    p = power_table(max_len, BASE, MOD)
    
    common = set()
    
    # Check each length from min_length to min(n, m)
    for length in range(min_length, min(n, m) + 1):
        s1_hashes = {}
        for i in range(n - length + 1):
            h = get_hash(h1, p, i, i + length - 1, MOD)
            if h not in s1_hashes:
                s1_hashes[h] = i
        
        for j in range(m - length + 1):
            h = get_hash(h2, p, j, j + length - 1, MOD)
            if h in s1_hashes:
                common.add(s2[j:j + length])
    
    return sorted(common, key=lambda x: (-len(x), x))
 
 
# Examples
print("Longest Common Substring")
print("="*50)
 
test_cases = [
    ("ABABC", "BABCA"),
    ("GeeksforGeeks", "GeeksQuiz"),
    ("abcdxyz", "xyzabcd"),
    ("programming", "algorithms"),
]
 
for s1, s2 in test_cases:
    result = longest_common_substring(s1, s2)
    print(f"  '{s1}' and '{s2}'")
    print(f"    -> LCS: '{result}' (length {len(result)})")
 
print("\nAll common substrings (min length 3):")
s1, s2 = "ABABC", "BABCA"
all_common = all_common_substrings(s1, s2, 3)
print(f"  '{s1}' and '{s2}': {all_common}")

String Fingerprinting and Similarity

Beyond exact matching, hashing enables approximate string comparison and similarity detection at scale.

Shingling (n-gram hashing):

Break a document into overlapping n-character chunks (shingles) and hash each:

"The quick brown" (n=4) → {"The ", "he q", "e qu", " qui", "quic", "uick", ...}

Compare documents by the overlap of their shingle sets:

Jaccard Similarity: |A ∩ B| / |A ∪ B|
Similar documents have high Jaccard similarity

MinHash for Similarity Estimation:

For large documents, comparing full shingle sets is expensive. MinHash approximates Jaccard similarity:

Apply k different hash functions to each shingle set
Keep the minimum hash value for each function
Compare the k minimum values
Similarity ≈ (fraction of matching minimums)

Applications:

Plagiarism detection
Near-duplicate detection
Document clustering
Version control (detecting similar code)

document_similarity.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
def shingle(text: str, k: int = 5) -> set[int]:
    """
    Create a set of hashed k-shingles from text.
    
    A k-shingle is a contiguous sequence of k characters.
    """
    BASE, MOD = 31, 10**9 + 7
    
    if len(text) < k:
        return set()
    
    shingles = set()
    
    # Compute initial hash
    h = 0
    power = 1
    for i in range(k):
        h = (h * BASE + ord(text[i])) % MOD
        if i < k - 1:
            power = (power * BASE) % MOD
    shingles.add(h)
    
    # Rolling hash for remaining shingles
    for i in range(k, len(text)):
        h = (h - ord(text[i - k]) * power) % MOD
        if h < 0:
            h += MOD
        h = (h * BASE + ord(text[i])) % MOD
        shingles.add(h)
    
    return shingles
 
 
def jaccard_similarity(set1: set, set2: set) -> float:
    """Compute Jaccard similarity between two sets."""
    if not set1 and not set2:
        return 1.0
    intersection = len(set1 & set2)
    union = len(set1 | set2)
    return intersection / union if union > 0 else 0.0
 
 
def document_similarity(doc1: str, doc2: str, k: int = 5) -> float:
    """
    Compute similarity between two documents using shingling.
    
    Returns value in [0, 1] where 1 means identical.
    """
    shingles1 = shingle(doc1.lower(), k)
    shingles2 = shingle(doc2.lower(), k)
    return jaccard_similarity(shingles1, shingles2)
 
 
class MinHash:
    """
    MinHash for efficient similarity estimation.
    
    Uses multiple hash functions to create signatures that can
    estimate Jaccard similarity in O(num_hashes) time.
    """
    
    def __init__(self, num_hashes: int = 100):
        self.num_hashes = num_hashes
        
        # Generate random hash function parameters
        import random
        random.seed(42)
        max_val = 2**31 - 1
        self.a = [random.randint(1, max_val) for _ in range(num_hashes)]
        self.b = [random.randint(0, max_val) for _ in range(num_hashes)]
        self.p = 2147483647  # Large prime (2^31 - 1)
    
    def _hash(self, x: int, i: int) -> int:
        """Apply i-th hash function to value x."""
        return (self.a[i] * x + self.b[i]) % self.p
    
    def compute_signature(self, shingles: set[int]) -> list[int]:
        """
        Compute MinHash signature for a set of shingles.
        
        The signature has length num_hashes.
        """
        if not shingles:
            return [self.p] * self.num_hashes
        
        signature = []
        for i in range(self.num_hashes):
            min_hash = min(self._hash(s, i) for s in shingles)
            signature.append(min_hash)
        
        return signature
    
    def estimate_similarity(self, sig1: list[int], sig2: list[int]) -> float:
        """
        Estimate Jaccard similarity from two signatures.
        
        This is O(num_hashes) instead of O(|shingles|).
        """
        matches = sum(1 for a, b in zip(sig1, sig2) if a == b)
        return matches / self.num_hashes
 
 
def plagiarism_check(documents: list[str], threshold: float = 0.5) -> list[tuple[int, int, float]]:
    """
    Find pairs of documents with similarity above threshold.
    
    Uses MinHash for efficient comparison of many documents.
    """
    k = 5  # Shingle size
    mh = MinHash(num_hashes=100)
    
    # Compute shingles and signatures for all documents
    signatures = []
    for doc in documents:
        shingles = shingle(doc.lower(), k)
        sig = mh.compute_signature(shingles)
        signatures.append(sig)
    
    # Compare all pairs
    similar_pairs = []
    for i in range(len(documents)):
        for j in range(i + 1, len(documents)):
            sim = mh.estimate_similarity(signatures[i], signatures[j])
            if sim >= threshold:
                similar_pairs.append((i, j, sim))
    
    return similar_pairs
 
 
# Demonstration
print("Document Similarity with Hashing")
print("="*50)
 
doc1 = "The quick brown fox jumps over the lazy dog"
doc2 = "The quick brown fox leaps over the lazy dog"
doc3 = "A fast auburn fox jumps over a sleepy hound"
doc4 = "Lorem ipsum dolor sit amet consectetur"
 
print("\nDirect Jaccard similarity (using shingles):")
print(f"  Doc1 vs Doc2: {document_similarity(doc1, doc2):.3f}")  # High
print(f"  Doc1 vs Doc3: {document_similarity(doc1, doc3):.3f}")  # Medium
print(f"  Doc1 vs Doc4: {document_similarity(doc1, doc4):.3f}")  # Low
 
print("\nMinHash signature-based similarity:")
mh = MinHash(num_hashes=100)
sig1 = mh.compute_signature(shingle(doc1.lower(), 5))
sig2 = mh.compute_signature(shingle(doc2.lower(), 5))
sig3 = mh.compute_signature(shingle(doc3.lower(), 5))
print(f"  Doc1 vs Doc2 (estimated): {mh.estimate_similarity(sig1, sig2):.3f}")
print(f"  Doc1 vs Doc3 (estimated): {mh.estimate_similarity(sig1, sig3):.3f}")
 
print("\nPlagiarism detection test:")
essays = [
    "This is an original essay about algorithms and data structures.",
    "This is an original essay about algorithms and data organization.",  # Similar
    "Essays cover many topics including math and science.",
    "This is an original essay on algorithms and data structures!",  # Very similar
]
pairs = plagiarism_check(essays, threshold=0.3)
for i, j, sim in pairs:
    print(f"  Documents {i} and {j}: {sim:.3f} similarity (potential plagiarism)")

Additional Applications

String hashing finds application in many other algorithmic problems:

Extended Applications

•Suffix Array Construction: Hash-based comparison enables O(n log² n) suffix array construction via sorting suffixes with O(log n) comparison using LCP queries.
•Cyclic String Matching: Determine if one string is a rotation of another: "abcd" rotations include "bcda", "cdab", "dabc". Hash all rotations of one string, check against the other.
•K-distinct Substrings: Count substrings with exactly k distinct characters. Use hashing to deduplicate and count efficiently.
•String Compression: Find the shortest period of a string. If string s has period p, then s = t^k where |t| = p. Use hashing to verify period candidates.
•Anagram Detection: Two strings are anagrams if they have the same character frequency. While frequency counting is simpler, hashing with commutative operations (like XOR or sum) can provide quick screening.
•String Interleaving: Check if string c is an interleaving of strings a and b using state hashing in dynamic programming.

The Universal Pattern

Performance Considerations:

Operation	Without Hashing	With Hashing
Compare 2 substrings	O(m)	O(1)
Find all matches of pattern	O(n × m)	O(n + m)
Longest repeated substring	O(n³)	O(n log n)
Longest common substring	O(n × m)	O((n+m) log n)
Document similarity (n docs)	O(n² × d)	O(n² × k) where k << d

where m = pattern/substring length, d = document length, n = text/array length, k = signature length.

Summary

Key Takeaways

•O(1) substring comparison is the foundational power of string hashing—enabling efficient solutions to problems that would otherwise be quadratic or worse.
•Longest repeated substring can be solved in O(n log n) using binary search on length combined with hash-based duplicate detection.
•Palindrome detection becomes O(1) per query by maintaining both forward and reverse hash arrays.
•Longest common substring between two strings is solvable in O((n+m) log min(n,m)) using hash sets and binary search.
•Document similarity at scale uses shingling (n-gram hashing) and MinHash to estimate Jaccard similarity efficiently.
•The universal pattern: wherever O(m) string comparison exists in nested loops, apply hashing to reduce it to O(1).

Module Complete:

You've now mastered string hashing from first principles through advanced applications:

Polynomial rolling hash: The mathematical foundation and rolling update technique
Parameter selection: Choosing base and modulus for correctness, efficiency, and security
Collision handling: Double hashing, verification strategies, and decision frameworks
Applications: From pattern matching to document similarity

This knowledge equips you to tackle a wide range of string problems efficiently and correctly.

Module Complete