Data Structures & AlgorithmsBasic String Operations

Basic String Operations & Their Cost Model

LevelBeginner

Duration60 mins

TopicBasic String Operations

5 / 6

String Comparison

Determining Order and Equality

String comparison is one of the most frequently performed operations in computing. Every time you:

Check if a username matches
Sort a list of names
Search for a word in a dictionary
Validate user input
Look up a key in a hash table

...you're comparing strings. Yet comparison is more nuanced than it first appears. Is "Apple" equal to "apple"? Which comes first: "10" or "9"? Is "café" the same as "cafe"?

Understanding comparison deeply—equality vs ordering, case sensitivity, locale effects, and performance characteristics—is essential for writing correct, efficient string-handling code.

Learning Objectives

By the end of this page, you will understand: how string equality works at the character level, lexicographical ordering and its implications, case-sensitive vs case-insensitive comparison, the performance cost model for comparison, and common pitfalls in string comparison.

String Equality: What Does It Mean?

Two strings are equal if and only if they have the same length and exactly the same character at each position.

The equality algorithm:

Compare lengths. If different, strings are not equal (immediate return).
Compare characters one by one from position 0 to length-1.
If any character differs, strings are not equal.
If all characters match, strings are equal.

String Equality Algorithm
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def strings_equal(s1: str, s2: str) -> bool:
    """
    Check if two strings are exactly equal.
    
    Time: O(n) where n = min(len(s1), len(s2)), O(1) if lengths differ
    Space: O(1)
    """
    # Quick length check - O(1)
    if len(s1) != len(s2):
        return False
    
    # Character-by-character comparison - O(n) worst case
    for i in range(len(s1)):
        if s1[i] != s2[i]:
            return False  # Early termination on first mismatch
    
    return True
 
 
# In practice, use the == operator
"hello" == "hello"   # True
"hello" == "Hello"   # False (case matters)
"hello" == "hello "  # False (length differs)
"" == ""             # True (both empty)

Key properties of string equality:

Reflexive: s == s is always true
Symmetric: if s1 == s2, then s2 == s1
Transitive: if s1 == s2 and s2 == s3, then s1 == s3
Character-sensitive: "A" ≠ "a", "é" ≠ "e"
Length-sensitive: "cat" ≠ "cats"

Why the length check matters:

The length check is O(1) and eliminates many comparisons immediately. If you're comparing a million strings, most pairs will differ in length. This optimization is crucial for performance.

Equality vs Identity

Some languages distinguish between equality (same value) and identity (same object in memory). In Python, == tests equality while 'is' tests identity. For strings, equality is almost always what you want. Identity can give surprising results due to string interning.

Lexicographical Ordering: Which String Comes First?

Beyond equality, we often need to order strings—for sorting, binary search, or finding the minimum/maximum. Strings are ordered lexicographically, which is a generalization of alphabetical ordering to all characters.

The lexicographical ordering algorithm:

To compare strings s1 and s2:

Compare characters at position 0. If different, the string with the smaller character value is "less than" the other.
If characters are equal, move to position 1 and repeat.
Continue until a difference is found or one string ends.
If one string is a prefix of the other, the shorter string is "less than".

Lexicographical Comparison
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def compare_strings(s1: str, s2: str) -> int:
    """
    Compare two strings lexicographically.
    
    Returns:
        -1 if s1 < s2
         0 if s1 == s2
         1 if s1 > s2
    
    Time: O(min(len(s1), len(s2))) worst case
    """
    min_len = min(len(s1), len(s2))
    
    for i in range(min_len):
        if s1[i] < s2[i]:
            return -1
        elif s1[i] > s2[i]:
            return 1
    
    # All compared characters are equal
    # Shorter string is "less than"
    if len(s1) < len(s2):
        return -1
    elif len(s1) > len(s2):
        return 1
    else:
        return 0
 
 
# Examples using Python's built-in comparison
"apple" < "banana"   # True (a < b)
"apple" < "apply"    # True (e < y at position 4)
"app" < "apple"      # True (prefix is smaller)
"Apple" < "apple"    # True (A=65 < a=97 in ASCII)

Lexicographical Comparison Examples
Comparison	Result	Reason
"abc" vs "abd"	abc < abd	c < d at position 2
"abc" vs "abc"	abc == abc	All characters match
"ab" vs "abc"	ab < abc	ab is a prefix of abc
"abc" vs "ab"	abc > ab	abc contains ab
"10" vs "9"	"10" < "9"	Character '1' (49) < '9' (57)
"A" vs "a"	A < a	ASCII: 'A'=65, 'a'=97

Numeric Strings Don't Sort Numerically!

Lexicographical ordering compares character-by-character. "10" < "9" because '1' comes before '9' in character code. To sort strings as numbers, you must convert to integers first. This is a common source of bugs when sorting user input or file names.

Character Codes and Comparison

Character comparison is based on numeric character codes. In ASCII (and extended to Unicode), each character has a numeric value, and these values determine ordering.

Key ASCII ranges:

Digits '0'-'9': codes 48-57
Uppercase 'A'-'Z': codes 65-90
Lowercase 'a'-'z': codes 97-122

Important ordering relationships:

Digits < Uppercase < Lowercase
'0' < '9' < 'A' < 'Z' < 'a' < 'z'

Character Codes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Getting character codes
ord('A')    # 65
ord('Z')    # 90
ord('a')    # 97
ord('z')    # 122
ord('0')    # 48
ord('9')    # 57
ord(' ')    # 32 (space)
 
# Converting back
chr(65)     # 'A'
chr(97)     # 'a'
 
# Why uppercase < lowercase
'A' < 'a'   # True (65 < 97)
'Z' < 'a'   # True (90 < 97)
 
# Mixed-case sorting consequences
words = ["apple", "Banana", "cherry", "Apricot"]
sorted(words)  # ['Apricot', 'Banana', 'apple', 'cherry']
# All uppercase comes before all lowercase!
 
# This is rarely what humans expect
# Use case-insensitive sorting for "natural" order

Unicode considerations:

Unicode extends far beyond ASCII. Different scripts, accented characters, and emoji all have code points that affect ordering:

Accented characters: 'é' (U+00E9, 233) comes after all basic ASCII
Non-Latin scripts: Greek, Cyrillic, CJK characters have their own code point ranges
Emoji: Very high code points (U+1F600+)

For locale-aware sorting (where 'é' sorts near 'e'), you need locale-specific comparison functions.

Know Your Use Case

For technical comparisons (exact match, hash keys), standard comparison is correct and fast. For human-facing sorting (names, titles), consider locale-aware comparison. Don't overcomplicate when you don't need to, but don't assume ASCII ordering is 'natural' for users.

Case-Sensitive vs Case-Insensitive Comparison

By default, string comparison is case-sensitive—'A' and 'a' are different characters with different codes. But many applications need case-insensitive comparison:

Username matching (allow "JohnDoe" or "johndoe")
Search functionality (find "Python" when searching for "python")
Human-readable sorting (group by letter regardless of case)

Naive approach: Convert to same case

Case-Insensitive Comparison
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Naive: Convert both to lowercase (or uppercase)
def equals_ignore_case(s1: str, s2: str) -> bool:
    """Case-insensitive equality check."""
    return s1.lower() == s2.lower()
 
equals_ignore_case("Hello", "hello")   # True
equals_ignore_case("PYTHON", "Python") # True
 
# Case-insensitive sorting
words = ["apple", "Banana", "cherry", "Apricot"]
sorted(words, key=str.lower)  # ['apple', 'Apricot', 'Banana', 'cherry']
 
# Case-insensitive search
def contains_ignore_case(haystack: str, needle: str) -> bool:
    return needle.lower() in haystack.lower()
 
contains_ignore_case("Hello World", "WORLD")  # True

The hidden complexity: Unicode case folding

For ASCII (A-Z, a-z), case conversion is trivial: add or subtract 32. But Unicode has many more complexities:

Some characters have no case (digits, symbols)
Some characters have multiple case mappings
The German 'ß' lowercases to "ss"—one character becomes two!
Turkish has 'I' → 'ı' (dotless i), not 'i'

For correct internationalized code, use proper Unicode-aware case folding.

Unicode Case Complexity
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Simple case: ASCII letters
"ABC".lower()       # "abc"
"abc".upper()       # "ABC"
 
# German sharp S (ß)
"ß".upper()         # "SS" (one char becomes two!)
"SS".lower()        # "ss" (two chars stay two)
"ß".lower()         # "ß" (already lowercase)
 
# casefold() is more aggressive for comparison
"ß".casefold()      # "ss" (better for matching)
"SS".casefold()     # "ss"
"ß".casefold() == "SS".casefold()  # True
 
# Recommendation: Use casefold() for case-insensitive comparison
def equals_unicode_safe(s1: str, s2: str) -> bool:
    return s1.casefold() == s2.casefold()

Case Conversion Has a Cost

Converting to lowercase creates a new string, which is O(n) time and space. If comparing many strings repeatedly, consider normalizing once and storing the lowercase version, rather than converting on every comparison.

Comparison Performance: The Cost Model

Understanding the performance of string comparison is crucial for writing efficient algorithms:

Equality comparison (s1 == s2):

Equality Comparison Performance
Case	Time Complexity	Reason
Different lengths	O(1)	Length check fails immediately
Equal strings	O(n)	Must check all n characters
Differ at position k	O(k)	Early termination at first mismatch
Average case (random strings)	O(1)	Most random strings differ early

Why is the average case O(1)?

For random strings, the probability of matching at position 0 is 1/alphabet_size. For a 26-letter alphabet:

P(match at pos 0) ≈ 1/26
P(match at pos 0 and 1) ≈ 1/676
P(match at first k positions) ≈ (1/26)^k

The expected number of comparisons is a geometric series that converges to a constant. For English text, the "average" comparison examines only a few characters before finding a difference.

Ordering comparison (s1 < s2):

Same cost model as equality—we examine characters until we find a difference or exhaust one string.

Comparison in Sorting
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Sorting n strings of average length L
# Sorting algorithm: O(n log n) comparisons
# Each comparison: O(L) worst case, but often O(1) average
 
def sort_strings(strings: list[str]) -> list[str]:
    """
    Sort a list of strings.
    
    Time complexity analysis:
    - Number of comparisons: O(n log n)
    - Each comparison: O(L) worst, O(1) average
    - Total: O(n log n × L) worst case
             O(n log n) average case for random strings
    """
    return sorted(strings)
 
 
# Performance tip for repeated comparisons:
# If comparing the same strings many times (e.g., in a set),
# caching the hash can speed things up
 
# Sets use hash for O(1) average lookup, not O(n) comparison
words_set = set(["apple", "banana", "cherry"])
"apple" in words_set  # O(1) average, uses hash first

Early Termination Is Your Friend

String comparison benefits hugely from early termination. Different lengths are caught in O(1). Different first characters are caught in O(1). Only strings that are very similar require examining many characters. Design your data to take advantage of this: putting distinguishing information early in strings can speed up comparisons.

Common Comparison Functions and Methods

Languages provide various functions and methods for string comparison. Here's a reference for common operations:

Comparison Methods Reference
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
s1, s2 = "Hello", "hello"
 
# Equality
s1 == s2                    # False (case-sensitive)
s1 != s2                    # True
 
# Ordering
s1 < s2                     # True (H=72 < h=104)
s1 <= s2                    # True
s1 > s2                     # False
s1 >= s2                    # False
 
# Case-insensitive equality
s1.lower() == s2.lower()    # True
s1.casefold() == s2.casefold()  # True (better for Unicode)
 
# Prefix/suffix checking
s1.startswith("Hel")        # True
s1.endswith("lo")           # True
 
# Containment
"ell" in s1                 # True
 
# Identity (same object, not usually what you want)
s1 is s2                    # Could be True or False (interning)

Java: Never Use == for Strings

In Java, == compares object references, not string contents. Two strings with the same characters may be different objects and compare as unequal with ==. Always use .equals() for string comparison. This is one of the most common Java bugs.

Locale-Aware Comparison

For user-facing sorting and comparison, locale-aware comparison follows the sorting rules of a specific language or culture. This is different from simple character-code comparison.

Examples where locale matters:

Swedish: 'ä' comes after 'z' (not near 'a')
German: 'ß' might sort with 'ss'
Spanish: 'ñ' comes after 'n'
French: Accented characters sort near their base forms

Using locale-aware comparison:

Locale-Aware Comparison
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import locale
 
# Set locale for comparison
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
 
# Use locale.strcoll for locale-aware comparison
words = ['apple', 'Apricot', 'banana', 'élan']
 
# Standard sort (ASCII order)
sorted(words)
# ['Apricot', 'apple', 'banana', 'élan']
 
# Locale-aware sort
sorted(words, key=locale.strxfrm)
# ['apple', 'Apricot', 'banana', 'élan']
# (depends on locale settings)
 
 
# For more control, use the icu library
# pip install PyICU
from icu import Collator, Locale
 
collator = Collator.createInstance(Locale('sv_SE'))  # Swedish
# Now 'ä' will sort after 'z' as in Swedish

When to Use Locale-Aware Comparison

Use locale-aware comparison when displaying sorted data to users or when sorting should follow cultural expectations. For technical operations (hash tables, binary search in code), standard comparison is faster and more predictable. Know your use case.

Common Pitfalls and Bugs

String comparison seems simple, but it hides many common bugs. Here are the pitfalls to avoid:

String Comparison Pitfalls

•Numeric string sorting: "10" < "9" because '1' < '9'. Always convert to numbers for numeric sorting.
•Case sensitivity surprise: "Apple" ≠ "apple". Decide upfront whether comparison should be case-sensitive.
•Whitespace differences: "hello" ≠ "hello " ≠ " hello". Trim strings if whitespace shouldn't matter.
•Unicode normalization: 'é' (single character) ≠ 'e' + '´' (combining accent). Normalize before comparing.
•Reference vs value (Java): Using == instead of .equals() compares memory addresses, not contents.
•Null/undefined handling: Comparing with null can throw or give unexpected results. Always check first.
•Encoding mismatches: Same text in different encodings may have different byte representations. Decode to strings first.

Common Pitfalls with Fixes
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Pitfall 1: Numeric string sorting
versions = ["1.0", "10.0", "2.0", "9.0"]
sorted(versions)  # ['1.0', '10.0', '2.0', '9.0'] - WRONG for versions
# Fix: Parse version numbers properly
from packaging import version  # or custom parsing
 
# Pitfall 2: Whitespace
"hello" == "hello "  # False
# Fix: Strip before comparison if needed
s1.strip() == s2.strip()
 
# Pitfall 3: Unicode normalization
import unicodedata
 
s1 = "é"           # Single character (precomposed)
s2 = "e\u0301"    # e + combining acute accent (decomposed)
s1 == s2           # False (different character sequences!)
 
# Fix: Normalize both
unicodedata.normalize('NFC', s1) == unicodedata.normalize('NFC', s2)  # True
 
# Pitfall 4: None handling
def safe_compare(s1, s2):
    if s1 is None or s2 is None:
        return s1 is s2  # Both None = equal, one None = not equal
    return s1 == s2

Defensive Comparison

When writing comparison code, ask: What are the edge cases? Empty strings? None/null? Leading/trailing whitespace? Mixed case? Non-ASCII characters? Test with these cases explicitly, and document your assumptions.

Summary: String Comparison Cost Model

Let's consolidate the cost model for string comparison:

String Comparison: Complete Cost Model
Operation	Time Complexity	Space Complexity	Notes
Equality (==)	O(n) worst, O(1) average	O(1)	Length check enables early exit
Ordering (<, >)	O(n) worst, O(1) average	O(1)	Exits at first difference
Case-insensitive	O(n)	O(n)	Requires lowercase copy
Locale-aware	O(n log n) or more	O(n)	Complex collation rules
startswith/endswith	O(k)	O(1)	k = length of pattern

Key Takeaways

•Equality is O(n) worst case, O(1) average: Length check and early termination help.
•Lexicographical ordering uses character codes: 'A' < 'a' because of ASCII values.
•Case-insensitive comparison requires conversion: This has time and space cost.
•Numeric strings don't sort numerically: "10" < "9" lexicographically.
•Use proper equality operators: In Java, always use .equals(), never ==.
•Locale matters for user-facing text: Different languages have different sorting rules.

Module Complete:

You have now mastered the five fundamental string operations:

Character Access: O(1) retrieval of individual characters
Traversal: O(n) systematic iteration over all characters
Concatenation: O(m+n) for single join, beware O(n²) traps in loops
Substring Extraction: O(k) for extracting k characters
Comparison: O(n) worst, O(1) average, with early termination

These operations form the foundation for all string algorithms. Understanding their individual costs lets you analyze and optimize string-processing code with confidence.

Module Complete

You now understand string comparison comprehensively—from equality semantics to lexicographical ordering, from case sensitivity to locale awareness, from performance optimization to common pitfalls. Combined with the previous pages, you have a complete foundation in basic string operations and their cost models.

5 / 6

Loading learning content...

Data Structures & AlgorithmsBasic String Operations

Basic String Operations & Their Cost Model

LevelBeginner

Duration60 mins

TopicBasic String Operations

5 / 6

String Comparison

Determining Order and Equality

String comparison is one of the most frequently performed operations in computing. Every time you:

Check if a username matches
Sort a list of names
Search for a word in a dictionary
Validate user input
Look up a key in a hash table

...you're comparing strings. Yet comparison is more nuanced than it first appears. Is "Apple" equal to "apple"? Which comes first: "10" or "9"? Is "café" the same as "cafe"?

Understanding comparison deeply—equality vs ordering, case sensitivity, locale effects, and performance characteristics—is essential for writing correct, efficient string-handling code.

Learning Objectives

String Equality: What Does It Mean?

Two strings are equal if and only if they have the same length and exactly the same character at each position.

The equality algorithm:

Compare lengths. If different, strings are not equal (immediate return).
Compare characters one by one from position 0 to length-1.
If any character differs, strings are not equal.
If all characters match, strings are equal.

String Equality Algorithm
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def strings_equal(s1: str, s2: str) -> bool:
    """
    Check if two strings are exactly equal.
    
    Time: O(n) where n = min(len(s1), len(s2)), O(1) if lengths differ
    Space: O(1)
    """
    # Quick length check - O(1)
    if len(s1) != len(s2):
        return False
    
    # Character-by-character comparison - O(n) worst case
    for i in range(len(s1)):
        if s1[i] != s2[i]:
            return False  # Early termination on first mismatch
    
    return True
 
 
# In practice, use the == operator
"hello" == "hello"   # True
"hello" == "Hello"   # False (case matters)
"hello" == "hello "  # False (length differs)
"" == ""             # True (both empty)

Key properties of string equality:

Reflexive: s == s is always true
Symmetric: if s1 == s2, then s2 == s1
Transitive: if s1 == s2 and s2 == s3, then s1 == s3
Character-sensitive: "A" ≠ "a", "é" ≠ "e"
Length-sensitive: "cat" ≠ "cats"

Why the length check matters:

The length check is O(1) and eliminates many comparisons immediately. If you're comparing a million strings, most pairs will differ in length. This optimization is crucial for performance.

Equality vs Identity

Lexicographical Ordering: Which String Comes First?

The lexicographical ordering algorithm:

To compare strings s1 and s2:

Compare characters at position 0. If different, the string with the smaller character value is "less than" the other.
If characters are equal, move to position 1 and repeat.
Continue until a difference is found or one string ends.
If one string is a prefix of the other, the shorter string is "less than".

Lexicographical Comparison
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def compare_strings(s1: str, s2: str) -> int:
    """
    Compare two strings lexicographically.
    
    Returns:
        -1 if s1 < s2
         0 if s1 == s2
         1 if s1 > s2
    
    Time: O(min(len(s1), len(s2))) worst case
    """
    min_len = min(len(s1), len(s2))
    
    for i in range(min_len):
        if s1[i] < s2[i]:
            return -1
        elif s1[i] > s2[i]:
            return 1
    
    # All compared characters are equal
    # Shorter string is "less than"
    if len(s1) < len(s2):
        return -1
    elif len(s1) > len(s2):
        return 1
    else:
        return 0
 
 
# Examples using Python's built-in comparison
"apple" < "banana"   # True (a < b)
"apple" < "apply"    # True (e < y at position 4)
"app" < "apple"      # True (prefix is smaller)
"Apple" < "apple"    # True (A=65 < a=97 in ASCII)

Lexicographical Comparison Examples
Comparison	Result	Reason
"abc" vs "abd"	abc < abd	c < d at position 2
"abc" vs "abc"	abc == abc	All characters match
"ab" vs "abc"	ab < abc	ab is a prefix of abc
"abc" vs "ab"	abc > ab	abc contains ab
"10" vs "9"	"10" < "9"	Character '1' (49) < '9' (57)
"A" vs "a"	A < a	ASCII: 'A'=65, 'a'=97

Numeric Strings Don't Sort Numerically!

Character Codes and Comparison

Character comparison is based on numeric character codes. In ASCII (and extended to Unicode), each character has a numeric value, and these values determine ordering.

Key ASCII ranges:

Digits '0'-'9': codes 48-57
Uppercase 'A'-'Z': codes 65-90
Lowercase 'a'-'z': codes 97-122

Important ordering relationships:

Digits < Uppercase < Lowercase
'0' < '9' < 'A' < 'Z' < 'a' < 'z'

Character Codes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Getting character codes
ord('A')    # 65
ord('Z')    # 90
ord('a')    # 97
ord('z')    # 122
ord('0')    # 48
ord('9')    # 57
ord(' ')    # 32 (space)
 
# Converting back
chr(65)     # 'A'
chr(97)     # 'a'
 
# Why uppercase < lowercase
'A' < 'a'   # True (65 < 97)
'Z' < 'a'   # True (90 < 97)
 
# Mixed-case sorting consequences
words = ["apple", "Banana", "cherry", "Apricot"]
sorted(words)  # ['Apricot', 'Banana', 'apple', 'cherry']
# All uppercase comes before all lowercase!
 
# This is rarely what humans expect
# Use case-insensitive sorting for "natural" order

Unicode considerations:

Unicode extends far beyond ASCII. Different scripts, accented characters, and emoji all have code points that affect ordering:

Accented characters: 'é' (U+00E9, 233) comes after all basic ASCII
Non-Latin scripts: Greek, Cyrillic, CJK characters have their own code point ranges
Emoji: Very high code points (U+1F600+)

For locale-aware sorting (where 'é' sorts near 'e'), you need locale-specific comparison functions.

Know Your Use Case

Case-Sensitive vs Case-Insensitive Comparison

By default, string comparison is case-sensitive—'A' and 'a' are different characters with different codes. But many applications need case-insensitive comparison:

Username matching (allow "JohnDoe" or "johndoe")
Search functionality (find "Python" when searching for "python")
Human-readable sorting (group by letter regardless of case)

Naive approach: Convert to same case

Case-Insensitive Comparison
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Naive: Convert both to lowercase (or uppercase)
def equals_ignore_case(s1: str, s2: str) -> bool:
    """Case-insensitive equality check."""
    return s1.lower() == s2.lower()
 
equals_ignore_case("Hello", "hello")   # True
equals_ignore_case("PYTHON", "Python") # True
 
# Case-insensitive sorting
words = ["apple", "Banana", "cherry", "Apricot"]
sorted(words, key=str.lower)  # ['apple', 'Apricot', 'Banana', 'cherry']
 
# Case-insensitive search
def contains_ignore_case(haystack: str, needle: str) -> bool:
    return needle.lower() in haystack.lower()
 
contains_ignore_case("Hello World", "WORLD")  # True

The hidden complexity: Unicode case folding

For ASCII (A-Z, a-z), case conversion is trivial: add or subtract 32. But Unicode has many more complexities:

Some characters have no case (digits, symbols)
Some characters have multiple case mappings
The German 'ß' lowercases to "ss"—one character becomes two!
Turkish has 'I' → 'ı' (dotless i), not 'i'

For correct internationalized code, use proper Unicode-aware case folding.

Unicode Case Complexity
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Simple case: ASCII letters
"ABC".lower()       # "abc"
"abc".upper()       # "ABC"
 
# German sharp S (ß)
"ß".upper()         # "SS" (one char becomes two!)
"SS".lower()        # "ss" (two chars stay two)
"ß".lower()         # "ß" (already lowercase)
 
# casefold() is more aggressive for comparison
"ß".casefold()      # "ss" (better for matching)
"SS".casefold()     # "ss"
"ß".casefold() == "SS".casefold()  # True
 
# Recommendation: Use casefold() for case-insensitive comparison
def equals_unicode_safe(s1: str, s2: str) -> bool:
    return s1.casefold() == s2.casefold()

Case Conversion Has a Cost

Comparison Performance: The Cost Model

Understanding the performance of string comparison is crucial for writing efficient algorithms:

Equality comparison (s1 == s2):

Equality Comparison Performance
Case	Time Complexity	Reason
Different lengths	O(1)	Length check fails immediately
Equal strings	O(n)	Must check all n characters
Differ at position k	O(k)	Early termination at first mismatch
Average case (random strings)	O(1)	Most random strings differ early

Why is the average case O(1)?

For random strings, the probability of matching at position 0 is 1/alphabet_size. For a 26-letter alphabet:

P(match at pos 0) ≈ 1/26
P(match at pos 0 and 1) ≈ 1/676
P(match at first k positions) ≈ (1/26)^k

The expected number of comparisons is a geometric series that converges to a constant. For English text, the "average" comparison examines only a few characters before finding a difference.

Ordering comparison (s1 < s2):

Same cost model as equality—we examine characters until we find a difference or exhaust one string.

Comparison in Sorting
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Sorting n strings of average length L
# Sorting algorithm: O(n log n) comparisons
# Each comparison: O(L) worst case, but often O(1) average
 
def sort_strings(strings: list[str]) -> list[str]:
    """
    Sort a list of strings.
    
    Time complexity analysis:
    - Number of comparisons: O(n log n)
    - Each comparison: O(L) worst, O(1) average
    - Total: O(n log n × L) worst case
             O(n log n) average case for random strings
    """
    return sorted(strings)
 
 
# Performance tip for repeated comparisons:
# If comparing the same strings many times (e.g., in a set),
# caching the hash can speed things up
 
# Sets use hash for O(1) average lookup, not O(n) comparison
words_set = set(["apple", "banana", "cherry"])
"apple" in words_set  # O(1) average, uses hash first

Early Termination Is Your Friend

Common Comparison Functions and Methods

Languages provide various functions and methods for string comparison. Here's a reference for common operations:

Comparison Methods Reference
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
s1, s2 = "Hello", "hello"
 
# Equality
s1 == s2                    # False (case-sensitive)
s1 != s2                    # True
 
# Ordering
s1 < s2                     # True (H=72 < h=104)
s1 <= s2                    # True
s1 > s2                     # False
s1 >= s2                    # False
 
# Case-insensitive equality
s1.lower() == s2.lower()    # True
s1.casefold() == s2.casefold()  # True (better for Unicode)
 
# Prefix/suffix checking
s1.startswith("Hel")        # True
s1.endswith("lo")           # True
 
# Containment
"ell" in s1                 # True
 
# Identity (same object, not usually what you want)
s1 is s2                    # Could be True or False (interning)

Java: Never Use == for Strings

Locale-Aware Comparison

For user-facing sorting and comparison, locale-aware comparison follows the sorting rules of a specific language or culture. This is different from simple character-code comparison.

Examples where locale matters:

Swedish: 'ä' comes after 'z' (not near 'a')
German: 'ß' might sort with 'ss'
Spanish: 'ñ' comes after 'n'
French: Accented characters sort near their base forms

Using locale-aware comparison:

Locale-Aware Comparison
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import locale
 
# Set locale for comparison
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
 
# Use locale.strcoll for locale-aware comparison
words = ['apple', 'Apricot', 'banana', 'élan']
 
# Standard sort (ASCII order)
sorted(words)
# ['Apricot', 'apple', 'banana', 'élan']
 
# Locale-aware sort
sorted(words, key=locale.strxfrm)
# ['apple', 'Apricot', 'banana', 'élan']
# (depends on locale settings)
 
 
# For more control, use the icu library
# pip install PyICU
from icu import Collator, Locale
 
collator = Collator.createInstance(Locale('sv_SE'))  # Swedish
# Now 'ä' will sort after 'z' as in Swedish

When to Use Locale-Aware Comparison

Common Pitfalls and Bugs

String comparison seems simple, but it hides many common bugs. Here are the pitfalls to avoid:

String Comparison Pitfalls

•Numeric string sorting: "10" < "9" because '1' < '9'. Always convert to numbers for numeric sorting.
•Case sensitivity surprise: "Apple" ≠ "apple". Decide upfront whether comparison should be case-sensitive.
•Whitespace differences: "hello" ≠ "hello " ≠ " hello". Trim strings if whitespace shouldn't matter.
•Unicode normalization: 'é' (single character) ≠ 'e' + '´' (combining accent). Normalize before comparing.
•Reference vs value (Java): Using == instead of .equals() compares memory addresses, not contents.
•Null/undefined handling: Comparing with null can throw or give unexpected results. Always check first.
•Encoding mismatches: Same text in different encodings may have different byte representations. Decode to strings first.

Common Pitfalls with Fixes
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Pitfall 1: Numeric string sorting
versions = ["1.0", "10.0", "2.0", "9.0"]
sorted(versions)  # ['1.0', '10.0', '2.0', '9.0'] - WRONG for versions
# Fix: Parse version numbers properly
from packaging import version  # or custom parsing
 
# Pitfall 2: Whitespace
"hello" == "hello "  # False
# Fix: Strip before comparison if needed
s1.strip() == s2.strip()
 
# Pitfall 3: Unicode normalization
import unicodedata
 
s1 = "é"           # Single character (precomposed)
s2 = "e\u0301"    # e + combining acute accent (decomposed)
s1 == s2           # False (different character sequences!)
 
# Fix: Normalize both
unicodedata.normalize('NFC', s1) == unicodedata.normalize('NFC', s2)  # True
 
# Pitfall 4: None handling
def safe_compare(s1, s2):
    if s1 is None or s2 is None:
        return s1 is s2  # Both None = equal, one None = not equal
    return s1 == s2

Defensive Comparison

Summary: String Comparison Cost Model

Let's consolidate the cost model for string comparison:

String Comparison: Complete Cost Model
Operation	Time Complexity	Space Complexity	Notes
Equality (==)	O(n) worst, O(1) average	O(1)	Length check enables early exit
Ordering (<, >)	O(n) worst, O(1) average	O(1)	Exits at first difference
Case-insensitive	O(n)	O(n)	Requires lowercase copy
Locale-aware	O(n log n) or more	O(n)	Complex collation rules
startswith/endswith	O(k)	O(1)	k = length of pattern

Key Takeaways

•Equality is O(n) worst case, O(1) average: Length check and early termination help.
•Lexicographical ordering uses character codes: 'A' < 'a' because of ASCII values.
•Case-insensitive comparison requires conversion: This has time and space cost.
•Numeric strings don't sort numerically: "10" < "9" lexicographically.
•Use proper equality operators: In Java, always use .equals(), never ==.
•Locale matters for user-facing text: Different languages have different sorting rules.

Module Complete:

You have now mastered the five fundamental string operations:

Character Access: O(1) retrieval of individual characters
Traversal: O(n) systematic iteration over all characters
Concatenation: O(m+n) for single join, beware O(n²) traps in loops
Substring Extraction: O(k) for extracting k characters
Comparison: O(n) worst, O(1) average, with early termination

These operations form the foundation for all string algorithms. Understanding their individual costs lets you analyze and optimize string-processing code with confidence.

Module Complete

5 / 6