Loading learning content...
At its heart, an anagram is a deceptively simple concept: two strings are anagrams if one can be rearranged to form the other. The word "listen" rearranges to "silent"; "astronomer" rearranges to "moon starer." Yet this simple definition underlies a surprisingly deep class of algorithmic problems that appear constantly in interview questions, text processing systems, and computational linguistics.
Anagram detection isn't merely an academic exercise. It's the foundation for spell checkers that suggest corrections, plagiarism detection systems that identify reworded content, cryptographic puzzles that rely on character permutations, and search engines that need to match queries against documents regardless of word variations. Understanding how to efficiently detect and group anagrams gives you a powerful pattern that transfers to many other character-frequency problems.
By the end of this page, you will master multiple techniques for anagram detection—from naive comparison to optimized frequency counting—and understand how to efficiently group large collections of strings by their anagram equivalence classes. You'll develop intuition for when each approach is appropriate and how to optimize for different constraints.
Before diving into algorithms, let's establish precise definitions. The term "anagram" comes from the Greek ana- (back, anew) and -gramma (letter), literally meaning "letters rearranged."
Definition (Anagram): Two strings s and t are anagrams if and only if t is a permutation of s. Equivalently, s and t are anagrams if they have exactly the same character frequencies.
This definition immediately suggests the key insight: anagrams are characterized by their character histogram, not by their character order. The words "cat" and "act" share the histogram {a: 1, c: 1, t: 1}, making them anagrams.
Problem Variants:
The anagram problem appears in several different forms, each with its own optimal approach:
Key Observation:
Every solution to anagram problems ultimately relies on one of two fundamental approaches:
Canonical Form Comparison: Transform each string into a canonical form (usually sorted characters) such that anagrams map to the same canonical form.
Frequency Counting: Compute character frequencies directly and compare the frequency vectors (or hashes thereof).
The choice between these approaches depends on the specific constraints—string length, alphabet size, number of comparisons, and whether you need a hashable key.
Throughout this page, we'll assume lowercase English letters (a-z) unless otherwise specified. For Unicode or extended character sets, the principles remain the same, but implementation details change—particularly around frequency array sizes and hash computation.
The most intuitive approach to anagram detection leverages the canonical form principle: if we sort both strings, anagrams will produce identical sorted results.
Algorithm:
s to produce sorted_st to produce sorted_ttrue if sorted_s == sorted_t, otherwise falseWhy It Works:
Sorting is a function that maps permutations to a single canonical representative. All anagrams of "cat" (i.e., "cat", "act", "tca", "tac", "atc", "cta") sort to "act". This property makes sorted strings perfect hash keys for anagram grouping.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
def are_anagrams_sorted(s: str, t: str) -> bool: """ Determine if two strings are anagrams using sorting. Time Complexity: O(n log n) where n = max(len(s), len(t)) Space Complexity: O(n) for the sorted string copies Args: s: First string t: Second string Returns: True if s and t are anagrams, False otherwise """ # Quick length check - anagrams must have same length if len(s) != len(t): return False # Sort both strings and compare # In Python, sorted() returns a list, so we compare lists return sorted(s) == sorted(t) def group_anagrams_sorted(words: list[str]) -> list[list[str]]: """ Group words by anagram equivalence class using sorting as key. Time Complexity: O(n * k log k) where n = number of words, k = max word length Space Complexity: O(n * k) for storing all words and their sorted forms Args: words: List of strings to group Returns: List of groups, where each group contains anagrams """ from collections import defaultdict # Use sorted tuple as the hash key anagram_groups = defaultdict(list) for word in words: # Create canonical form by sorting characters # Using tuple because lists aren't hashable canonical = tuple(sorted(word)) anagram_groups[canonical].append(word) return list(anagram_groups.values()) # Example usageif __name__ == "__main__": # Pairwise detection print(are_anagrams_sorted("listen", "silent")) # True print(are_anagrams_sorted("hello", "world")) # False # Grouping words = ["eat", "tea", "tan", "ate", "nat", "bat"] groups = group_anagrams_sorted(words) print(groups) # [['eat', 'tea', 'ate'], ['tan', 'nat'], ['bat']]Complexity Analysis:
Time Complexity: O(n log n) for pairwise detection, where n is the string length. For grouping m strings of average length k, the total time is O(m · k log k).
Space Complexity: O(n) for the sorted copies. Note that Python's sorted() creates a new list, and JavaScript's sort() modifies in place (though split() already creates a copy).
Advantages:
Disadvantages:
The sorting approach is ideal when you need a simple, debuggable implementation and the strings are short (< 1000 characters). It's also preferred when working with large or variable alphabets where frequency arrays would be unwieldy. For interview settings, this solution is often acceptable and demonstrates clear thinking.
The frequency counting approach directly leverages the definition of anagrams: two strings are anagrams if and only if they have identical character frequencies. By computing and comparing frequency vectors in O(n) time, we avoid the O(n log n) sorting overhead.
The Key Insight:
For a fixed alphabet of size σ (e.g., σ = 26 for lowercase English), we can represent any string as a frequency vector of length σ. Comparing two such vectors takes O(σ) time, which is O(1) for constant alphabet size. This reduces the overall complexity from O(n log n) to O(n).
Single-Array Optimization:
Instead of creating two frequency arrays and comparing them, we can use a single array: increment for characters in s, decrement for characters in t. If all counts are zero at the end, the strings are anagrams.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
def are_anagrams_frequency(s: str, t: str) -> bool: """ Determine if two strings are anagrams using frequency counting. Time Complexity: O(n) where n = len(s) = len(t) Space Complexity: O(σ) = O(26) = O(1) for fixed alphabet This approach uses a single frequency array: increment for s, decrement for t. If all values are zero at the end, the strings are anagrams. """ if len(s) != len(t): return False # Single frequency array for lowercase English letters # Could use Counter, but array is faster for fixed alphabet freq = [0] * 26 for i in range(len(s)): freq[ord(s[i]) - ord('a')] += 1 freq[ord(t[i]) - ord('a')] -= 1 # Check if all frequencies are zero return all(count == 0 for count in freq) def are_anagrams_counter(s: str, t: str) -> bool: """ Alternative using Python's Counter for arbitrary character sets. Time Complexity: O(n) Space Complexity: O(σ) where σ is the number of unique characters This is more Pythonic and handles Unicode/extended character sets naturally. """ from collections import Counter if len(s) != len(t): return False return Counter(s) == Counter(t) def group_anagrams_frequency(words: list[str]) -> list[list[str]]: """ Group anagrams using frequency tuple as key. Time Complexity: O(n * k) where n = number of words, k = max word length Space Complexity: O(n * σ) for frequency tuples, typically O(n * 26) This is faster than sorting when k is large. """ from collections import defaultdict anagram_groups = defaultdict(list) for word in words: # Build frequency tuple - this is hashable and serves as key freq = [0] * 26 for char in word: freq[ord(char) - ord('a')] += 1 # Tuple is required because lists aren't hashable key = tuple(freq) anagram_groups[key].append(word) return list(anagram_groups.values()) # Demonstration of the techniqueif __name__ == "__main__": # Basic tests print(are_anagrams_frequency("anagram", "nagaram")) # True print(are_anagrams_frequency("rat", "car")) # False # Counter approach for Unicode print(are_anagrams_counter("こんにちは", "はちにんこ")) # True (Japanese) # Grouping benchmark import time words = ["eat", "tea", "tan", "ate", "nat", "bat"] * 10000 start = time.time() result = group_anagrams_frequency(words) print(f"Frequency approach: {time.time() - start:.3f}s")Complexity Analysis:
Time Complexity: O(n) for pairwise detection. For grouping m strings of average length k, the total time is O(m · k), which is asymptotically better than the O(m · k log k) sorting approach.
Space Complexity: O(σ) per comparison, where σ is alphabet size. For grouping, we need O(m · σ) for the keys, but σ is typically constant (26 for English).
Why Frequency Tuples Work as Hash Keys:
When grouping anagrams, we need a hashable canonical representation. The sorted-string approach creates strings like "act", "act", "abt" as keys. The frequency approach creates tuples like (1,0,1,...,1,...,0). Both approaches guarantee that anagrams map to identical keys.
The frequency tuple approach is faster for long strings because building a frequency array is O(n) while sorting is O(n log n). However, frequency tuples are longer (26 integers vs. n characters), so there's a crossover point where sorting becomes competitive for short strings.
| Approach | Time (Detection) | Time (Grouping) | Space | Best For |
|---|---|---|---|---|
| Sorting | O(n log n) | O(m·k log k) | O(n) | Short strings, large alphabet |
| Frequency Array | O(n) | O(m·k) | O(σ) | Long strings, fixed alphabet |
| Frequency Hash | O(n) | O(m·k) | O(σ) | Unicode, arbitrary character sets |
A more sophisticated variant asks: given a text t and a pattern p, find all starting indices in t where an anagram of p begins. This is essentially searching for any permutation of p within t.
Naive Approach:
For each window of length |p| in t, check if the window is an anagram of p. This takes O((n - m) · m) time for text length n and pattern length m—too slow for large inputs.
Sliding Window with Frequency Counts:
We can solve this in O(n) time using a sliding window. The key insight is that moving the window by one position only changes two characters: one leaves, one enters. We can update our frequency counts incrementally rather than recomputing from scratch.
The Match Count Technique:
Instead of comparing full frequency arrays on each slide, we track how many of the 26 characters have matching frequencies. When this count equals 26 (or the number of unique characters in the pattern), we have an anagram match.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118
def find_all_anagrams(text: str, pattern: str) -> list[int]: """ Find all starting positions where anagrams of pattern appear in text. Uses sliding window with incremental frequency updates and match counting. Time Complexity: O(n) where n = len(text) Space Complexity: O(σ) = O(26) = O(1) Args: text: The string to search within pattern: The pattern whose anagrams we're looking for Returns: List of starting indices where anagram matches occur """ n, m = len(text), len(pattern) result = [] if m > n: return result # Frequency arrays for pattern and current window pattern_freq = [0] * 26 window_freq = [0] * 26 # Build pattern frequency for char in pattern: pattern_freq[ord(char) - ord('a')] += 1 # Track how many characters have matching frequencies # When matches == 26, we have an anagram matches = 0 # Initialize match count for empty window for i in range(26): if pattern_freq[i] == window_freq[i]: matches += 1 # Sliding window for i in range(n): # Add character entering the window (right side) idx = ord(text[i]) - ord('a') window_freq[idx] += 1 # Update matches for this character if window_freq[idx] == pattern_freq[idx]: matches += 1 elif window_freq[idx] == pattern_freq[idx] + 1: # We went from matching to not matching matches -= 1 # Remove character leaving the window (left side) if i >= m: left_idx = ord(text[i - m]) - ord('a') window_freq[left_idx] -= 1 if window_freq[left_idx] == pattern_freq[left_idx]: matches += 1 elif window_freq[left_idx] == pattern_freq[left_idx] - 1: matches -= 1 # Check if we have a complete anagram if matches == 26 and i >= m - 1: result.append(i - m + 1) return result # More intuitive version using Counter (slightly slower but clearer)def find_all_anagrams_counter(text: str, pattern: str) -> list[int]: """ Alternative implementation using Counter for clarity. Still O(n) but with higher constant factor due to Counter operations. """ from collections import Counter n, m = len(text), len(pattern) result = [] if m > n: return result pattern_count = Counter(pattern) window_count = Counter(text[:m]) if window_count == pattern_count: result.append(0) for i in range(m, n): # Add new character window_count[text[i]] += 1 # Remove old character left_char = text[i - m] window_count[left_char] -= 1 if window_count[left_char] == 0: del window_count[left_char] # Check for match if window_count == pattern_count: result.append(i - m + 1) return result if __name__ == "__main__": # Example: pattern "ab" in "cbaebabacd" text = "cbaebabacd" pattern = "ab" print(f"Pattern '{pattern}' anagrams in '{text}':") print(find_all_anagrams(text, pattern)) # [0, 6] # At index 0: "cb" is not an anagram of "ab"... wait # Actually at 0: text[0:2] = "cb", which IS anagram of "ab" (b,a) # Hmm, "cb" has c and b, not a and b. Let me trace... # Oh I see the issue - "ba" at index 1 and "ab" at index 6 print(find_all_anagrams("abab", "ab")) # [0, 1, 2]The match count technique is subtle. We start with matches initialized by counting how many of the 26 positions have equal frequencies (initially, both arrays are all zeros, so matches starts at 26). As we slide the window, adding or removing a character can either create a match (increment), break a match (decrement), or leave matches unchanged. When matches returns to 26, all frequencies match again.
Why This Is O(n):
Each character is processed exactly twice—once when entering the window (right edge) and once when leaving (left edge). Each update to the frequency array and match count takes O(1) time. The final comparison (matches == 26) is also O(1). Thus, the overall time complexity is O(n), a significant improvement over the naive O(n · m) approach.
Space Optimization:
We use only two arrays of size 26, regardless of input size. This O(1) space usage (for constant alphabet) makes the algorithm suitable for streaming scenarios where memory is constrained.
For very large-scale grouping (millions of strings), the overhead of storing full frequency tuples as dictionary keys becomes significant. We can compress the canonical representation using polynomial hashing, trading perfect accuracy for speed.
Prime Factorization Approach:
A clever technique assigns each letter a distinct prime number, then computes the product of primes for each character in the string. Since prime factorization is unique, anagrams produce identical products.
a → 2, b → 3, c → 5, d → 7, e → 11, ...
"cat" → 5 × 2 × 71 = 710 (c=5, a=2, t=71)
"act" → 2 × 5 × 71 = 710 (same product!)
Caution: Products grow exponentially with string length, causing overflow for long strings. Use modular arithmetic or big integers, accepting collision risk with modular reduction.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
def group_anagrams_prime_hash(words: list[str]) -> list[list[str]]: """ Group anagrams using prime number product as hash key. Each letter maps to a prime; the product uniquely identifies the character multiset (like prime factorization). Warning: Products can overflow for long strings. In Python, integers have arbitrary precision, but this may be slow. For production use, consider modular arithmetic with collision handling. Time Complexity: O(n * k) where n = number of words, k = max word length Space Complexity: O(n) for storing groups """ from collections import defaultdict # First 26 primes for a-z PRIMES = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101] anagram_groups = defaultdict(list) for word in words: # Compute product of primes product = 1 for char in word: product *= PRIMES[ord(char) - ord('a')] anagram_groups[product].append(word) return list(anagram_groups.values()) def group_anagrams_rolling_hash(words: list[str]) -> list[list[str]]: """ Group anagrams using polynomial rolling hash (with collision risk). This uses the formula: sum(char_value * base^(char_index)) mod prime But for anagrams, we use a simpler commutative version: hash = sum(prime_for_char) mod large_prime This is faster but has collision risk (different anagram groups might hash to the same value). Time Complexity: O(n * k) Space Complexity: O(n) """ from collections import defaultdict # Using primes additively (simpler, some collision risk) PRIMES = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101] MOD = 10**9 + 7 anagram_groups = defaultdict(list) for word in words: # Additive hash (commutative, so order-independent) hash_val = sum(PRIMES[ord(c) - ord('a')] for c in word) % MOD anagram_groups[hash_val].append(word) return list(anagram_groups.values()) # Count-based string key (efficient for fixed alphabet)def group_anagrams_count_key(words: list[str]) -> list[list[str]]: """ A practical middle-ground: encode frequency as a string key. For each character, append its count to the key. This avoids tuple comparison overhead while being collision-free. Example: "aab" -> "a2b1c0d0...z0" (only non-zero for brevity) Time Complexity: O(n * k) Space Complexity: O(n * 26) for keys """ from collections import defaultdict anagram_groups = defaultdict(list) for word in words: # Build frequency string key freq = [0] * 26 for char in word: freq[ord(char) - ord('a')] += 1 # Encode as string: "2#1#0#0#..." using delimiter key = '#'.join(map(str, freq)) anagram_groups[key].append(word) return list(anagram_groups.values()) if __name__ == "__main__": words = ["eat", "tea", "tan", "ate", "nat", "bat"] print("Prime product:", group_anagrams_prime_hash(words)) print("Rolling hash:", group_anagrams_rolling_hash(words)) print("Count key:", group_anagrams_count_key(words))Choosing the Right Approach:
| Method | Collision Risk | Speed | Memory | Use Case |
|---|---|---|---|---|
| Sorted String | None | O(k log k) | O(k) | Small strings, prototyping |
| Frequency Tuple | None | O(k + 26) | O(26) | Production, fixed alphabet |
| Prime Product | None* | O(k) | O(1) | Short strings only |
| Rolling Hash | Possible | O(k) | O(1) | Approximate matching |
| Count Key String | None | O(k + 26) | O(26) | Hash key compatibility |
*Prime product has no collisions for unique prime mappings, but integer overflow becomes a concern for long strings.
For most production use cases with English text, the frequency array approach with tuple keys (Python) or count-based string keys (JavaScript) provides the best balance of correctness, speed, and simplicity. Reserve hash-based approaches for specialized cases where you can tolerate collisions or need extreme performance.
Anagram detection and grouping appear in numerous practical systems:
1. Spell Checking and Correction
When a user misspells a word, the spell checker needs candidate corrections. Anagram-based indexing enables quick lookup: if the user typed "hte" (missing 'e', extra 'h'), we might look up words that share character frequencies to suggest "the". Combined with edit distance, this powers suggestion engines.
2. Plagiarism Detection
Plagiarism systems must detect text that has been shuffled or paraphrased. By computing character/word frequency signatures, these systems can identify suspiciously similar passages even when word order has changed. Anagram detection is part of this signature-based matching.
3. Cryptography and Puzzles
Anagram solvers are essential tools in cryptography (especially for puzzle-based encryption) and word games. High-performance anagram groupers enable dictionary attacks on ciphers that preserve character frequencies.
4. Genomic Sequence Analysis
In bioinformatics, k-mer analysis often involves grouping DNA sequences by nucleotide content. This is essentially anagram grouping over a four-letter alphabet (A, C, G, T). Efficient frequency-based grouping enables genome-scale analysis.
5. Search Query Normalization
Search engines may group queries that are anagrams (or near-anagrams) for analytics and caching purposes. If users search for "New York" and "York New", these might be recognized as equivalent intent.
| Application | Technique Used | Scale | Key Optimization |
|---|---|---|---|
| Word game solvers | Sorted key grouping | 100K words | Pre-indexed dictionary |
| Spell checkers | Frequency + edit distance | Millions of lookups/sec | Trie with frequency metadata |
| DNA k-mer analysis | Frequency tuple hashing | Billions of sequences | Parallel/distributed grouping |
| Plagiarism detection | Sliding window + hashing | Millions of documents | MinHash/locality-sensitive hashing |
Anagram detection exemplifies a powerful pattern in string algorithms: canonical representation. By reducing strings to a form that is order-independent (sorted characters or frequency vectors), we transform a permutation problem into a simple equality check.
You now have a comprehensive understanding of anagram detection and grouping. These techniques—canonical forms, frequency counting, sliding windows—are fundamental patterns that you'll apply to many other string problems. Next, we'll explore string compression and run-length encoding.