String Comparison - Learning Module

Loading content...

0/276

Lexicographical Ordering

How Do You Sort Words?

Open any dictionary, and you'll find words arranged in a specific order. "Apple" comes before "Banana," which comes before "Cherry." This ordering seems obvious—we learned it in childhood. But have you ever stopped to ask: How exactly does a computer determine that "apple" should come before "banana"?

The answer lies in lexicographical ordering—a precise, character-by-character comparison algorithm that forms the foundation of string sorting, searching, and comparison throughout computer science. Understanding this ordering deeply is essential because it:

Determines how strings are sorted in every programming language
Controls how binary search works on sorted string collections
Influences the behavior of virtually every string comparison operation
Contains subtle corner cases that cause bugs when misunderstood

What You Will Learn

By the end of this page, you will understand lexicographical ordering from first principles—not as a vague concept of 'alphabetical order' but as a rigorous, step-by-step algorithm. You'll see how character codes drive comparison, how strings of different lengths are handled, and why this simple concept has profound implications for algorithms and data structures.

Defining Lexicographical Order

Let's establish a precise definition before exploring implications:

Definition:

Lexicographical order (also called lexical order, dictionary order, or alphabetical order) is a method of comparing sequences (including strings) by examining their elements one at a time, from left to right, until a difference is found or one sequence is exhausted.

The name comes from 'lexicon'—a dictionary—because this is how dictionaries order words.

The Algorithm:

To compare two strings A and B lexicographically:

Compare characters at position 0: If A[0] < B[0], then A < B. If A[0] > B[0], then A > B. If A[0] = B[0], continue.
Compare characters at position 1: Apply the same logic. Continue for each position.
If one string is exhausted: The shorter string is considered 'less than' the longer string if all compared characters were equal.
If both strings are identical: They are equal.

Critical insight: The comparison is determined by the first differing character. All subsequent characters are irrelevant once a difference is found.

Lexicographical Comparison Examples
String A	String B	First Difference	Result	Reason
apple	banana	Position 0: 'a' vs 'b'	A < B	'a' (97) < 'b' (98)
apple	application	Position 5: end vs 'c'	A < B	A is prefix of B, shorter comes first
apple	apple	No difference	A = B	Identical strings
Apple	apple	Position 0: 'A' vs 'a'	A < B	'A' (65) < 'a' (97) in ASCII
123	45	Position 0: '1' vs '4'	A < B	'1' (49) < '4' (52)
zoo	aardvark	Position 0: 'z' vs 'a'	A > B	'z' (122) > 'a' (97)

Numeric Strings: A Common Trap

Lexicographical ordering compares strings character by character, NOT as numeric values. This means "9" > "10" because '9' (57) > '1' (49). The comparison stops at the first character! This is a frequent source of bugs when sorting numbers stored as strings.

The Role of Character Codes

Lexicographical comparison fundamentally relies on the numeric value assigned to each character by the underlying character encoding (typically Unicode, with ASCII as a subset).

Key principle: When comparing two characters, we compare their code points. The character with the lower code point is considered 'less than' the other.

ASCII Code Point Ranges (critical knowledge):

Range	Characters	Decimal Codes
Digits	'0' - '9'	48 - 57
Uppercase	'A' - 'Z'	65 - 90
Lowercase	'a' - 'z'	97 - 122

This ordering has profound implications:

All digits come before all letters: '9' (57) < 'A' (65)
All uppercase letters come before all lowercase letters: 'Z' (90) < 'a' (97)
Within letters, order is alphabetical: 'A' < 'B' < ... < 'Z' and 'a' < 'b' < ... < 'z'
Space and punctuation have their own codes: Space is 32, which comes before digits, which come before letters

Surprising Comparisons:

"Zoo"     < "apple"   (Z=90 < a=97)
"ZEBRA"   < "ant"     (Z=90 < a=97)
"123"     < "ABC"     (1=49 < A=65)
" hello"  < "hello"   (space=32 < h=104)
"Hello!" < "Hello?"  (!='33 < ?=63)

These results may surprise you if you expect 'intuitive' alphabetical ordering.

Expected Comparisons:

"apple"   < "banana"  (a=97 < b=98)
"car"     < "card"    (prefix rule)
"dog"     < "dogs"    (prefix rule)
"abc"     < "abd"     (c=99 < d=100)
"aaa"     < "aab"     (a=97 < b=98)

These match intuitive dictionary order because they involve only lowercase letters.

Memorize the Key Ranges

For string manipulation and algorithm design, commit these to memory: digits (48-57), uppercase (65-90), lowercase (97-122). The gap between uppercase and lowercase is exactly 32, which is why char.toLowerCase() often works by adding 32 to the ASCII value (for A-Z).

The Comparison Algorithm in Detail

Let's formalize the lexicographical comparison algorithm with pseudocode that reveals exactly what happens at each step:

function lexicographicalCompare(A, B):
    i = 0
    
    // Step 1: Compare character by character
    while i < length(A) AND i < length(B):
        if A[i] < B[i]:
            return -1        // A is less than B
        else if A[i] > B[i]:
            return +1        // A is greater than B
        else:
            i = i + 1        // Characters equal, continue
    
    // Step 2: All compared characters were equal
    // Now check lengths
    if length(A) < length(B):
        return -1            // A is a prefix of B, A is less
    else if length(A) > length(B):
        return +1            // B is a prefix of A, B is less
    else:
        return 0             // Identical strings

Time Complexity: O(min(|A|, |B|)) — We compare at most as many characters as the shorter string has.

Space Complexity: O(1) — Only a single index variable is needed.

Trace Example: Comparing "application" and "apple"

Step	i	A[i]	B[i]	Comparison	Action
1	0	'a'	'a'	97 = 97	Continue
2	1	'p'	'p'	112 = 112	Continue
3	2	'p'	'p'	112 = 112	Continue
4	3	'l'	'l'	108 = 108	Continue
5	4	'i'	'e'	105 > 101	Return +1

Result: "application" > "apple" because 'i' > 'e' at position 4.

Notice that we never looked at the remaining characters in either string. The comparison was decided at the first point of difference.

Trace Example: Comparing "app" and "apple"

Step	i	A[i]	B[i]	Comparison	Action
1	0	'a'	'a'	97 = 97	Continue
2	1	'p'	'p'	112 = 112	Continue
3	2	'p'	'p'	112 = 112	Continue
4	3	-	'l'	A exhausted	-

After the loop: length(A) = 3 < length(B) = 5

Result: "app" < "apple" because "app" is a proper prefix of "apple".

The prefix rule states: if one string is a prefix of another (all characters match up to the shorter string's length), the shorter string is considered less than the longer one.

The Prefix Rule Explained

The prefix rule isn't arbitrary—it's mathematically consistent. We're essentially comparing strings as if the shorter one had infinite 'null' characters appended, where null is less than any actual character. Since "app" + (nothing) comes before "app" + "le", "app" < "apple".

Why This Ordering Matters for Algorithms

Lexicographical order isn't just a convenience—it's a total ordering with properties that make it essential for fundamental algorithms:

1. Total Ordering Properties:

A total ordering satisfies three properties that enable sorting and searching:

Antisymmetry: If A ≤ B and B ≤ A, then A = B
Transitivity: If A ≤ B and B ≤ C, then A ≤ C
Connexity: For any A and B, either A ≤ B or B ≤ A (or both if equal)

Lexicographical order satisfies all three, making it a valid comparison function for any algorithm that requires ordered data.

2. Binary Search:

Binary search requires a sorted collection. When strings are sorted lexicographically, we can efficiently find any string (or determine its absence) in O(log n) time:

Given sorted list: ["apple", "banana", "cherry", "date", "elderberry"]

Search for "cherry":
- Middle: "cherry" — Found!

Search for "coconut":
- Middle: "cherry" — "coconut" > "cherry", search right
- Middle: "date" — "coconut" < "date", search left  
- No elements left — Not found, would insert between "cherry" and "date"

3. Balanced Search Trees (BST, Red-Black, AVL):

Tree-based data structures like TreeMap and TreeSet maintain elements in sorted order. For strings, this means lexicographical order:

                    "mango"
                   /       \
            "cherry"        "papaya"
            /     \              \
      "apple"   "grape"      "strawberry"

In-order traversal yields: apple, cherry, grape, mango, papaya, strawberry

4. Range Queries:

Lexicographical order enables powerful range queries:

"Find all words starting with 'pr'" → Find range ["pr", "ps")
"Find all words between 'cat' and 'dog'" → Find range ["cat", "dog"]

These operations are O(log n + k) where k is the number of results.

5. Sorting Algorithms:

Every sorting algorithm that works on strings (quicksort, mergesort, heapsort) uses lexicographical comparison as its comparator. Understanding the comparison is understanding the sort.

Algorithms That Rely on Lexicographical Order

•Binary Search — O(log n) lookup in sorted string arrays
•Merge Sort — Stable O(n log n) string sorting
•Quick Sort — Average O(n log n) string sorting
•Heap Sort / Priority Queues — Maintain min/max string efficiently
•Tree-based Maps/Sets — Ordered string collections (TreeMap, TreeSet)
•Trie Operations — Dictionary ordering naturally emerges from trie structure
•Suffix Arrays — Sorted array of all suffixes for pattern matching
•Lexicographic BFS/DFS — Traverse in dictionary order

A Universal Comparator

Lexicographical comparison is the default string comparator in virtually every programming language. Whenever you call sort() on a string array without specifying a custom comparator, you're using lexicographical order. Understanding it deeply means understanding how strings behave across the entire language ecosystem.

Corner Cases and Common Gotchas

Lexicographical ordering, while simple in principle, has edge cases that frequently cause bugs:

1. Empty String Comparisons:

The empty string "" is lexicographically less than every non-empty string:

"" < "a" (empty is prefix of any string)
"" < " " (even a single space)
"" = "" (equal to itself)

2. Strings with Spaces:

Space (code 32) is a valid character that compares less than all letters and digits:

" apple" < "apple" (space < 'a')
"a b" < "ab" (space < 'b' at position 1)
"hello world" < "helloworld" (space < 'w')

3. Mixed Case Strings:

Uppercase letters have lower ASCII values than lowercase:

"Zebra" < "apple" ('Z'=90 < 'a'=97)
"ABC" < "abc" ('A'=65 < 'a'=97)
"aBC" < "abc" ('B'=66 < 'b'=98 at position 1)

4. Numeric Strings:

This is the most common source of bugs:

"9" > "10" ('9'=57 > '1'=49)
"100" < "20" ('1'=49 < '2'=50)
"1" < "10" < "100" < "2" (prefix rules and first-char comparison)

If you need numerical ordering, you must either:

Convert to numbers before comparing
Pad with leading zeros for fixed-width comparison
Use a custom comparator

5. Unicode and Extended Characters:

Beyond ASCII, characters from different scripts have their own code points:

"café" — The 'é' (code 233) > 'e' (101)
"naïve" — The 'ï' (code 239) > 'i' (105)
Greek, Cyrillic, Chinese characters all have code points above ASCII

For proper multilingual sorting, you need locale-aware comparison (covered later), not pure lexicographical comparison.

Bug-Prone Patterns

Avoid these mistakes:

• Sorting numeric strings lexicographically • Assuming case-insensitive comparison • Ignoring leading/trailing whitespace • Treating 'ä' as equivalent to 'a' • Assuming all letters compare similarly

Safe Patterns

Do these instead:

• Convert to numbers for numeric comparison • Explicitly normalize case before comparing • Trim strings before comparison if needed • Use locale-aware comparison for text • Document comparison semantics clearly

The Version Number Problem

Version strings like "1.9" vs "1.10" are notoriously problematic. Lexicographically, "1.10" < "1.9" because '1' < '9' at the third position. This has caused real bugs in software update systems, package managers, and deployment tools. Always use semantic version comparison libraries for version strings.

Lexicographical Order in Practice

Understanding lexicographical order unlocks elegant solutions to many string problems:

Problem 1: Finding the Lexicographically Smallest String

Given a list of strings, find the one that comes first in dictionary order:

Input: ["banana", "apple", "cherry"]
Output: "apple"

Solution: Simply use min() with default string comparison—it uses lexicographical order.

Problem 2: Next Lexicographical Permutation

Given a string, find the next string in lexicographical order among all permutations:

Input: "abc"
Output: "acb"

Full sequence: abc → acb → bac → bca → cab → cba

This is a classic algorithm problem with applications in combinatorics and optimization.

Problem 3: Longest Common Prefix

When strings are sorted lexicographically, the longest common prefix of the entire array equals the longest common prefix of just the first and last strings:

Sorted: ["flower", "flow", "flight"]
Wait—sorted: ["flight", "flow", "flower"]

LCP of "flight" and "flower" = "fl"
This is also the LCP of all strings in the array!

Why? Because in sorted order, all strings with the same prefix are adjacent. The first and last strings have the minimum overlap.

Problem 4: String Comparison in Sorting

When you call Arrays.sort(stringArray) or list.sort() on strings, the runtime uses lexicographical comparison. Understanding this explains:

Why ["10", "2", "1"] sorts to ["1", "10", "2"]
Why ["Banana", "apple"] sorts to ["Banana", "apple"] (B < a)
Why sorting is stable for equal strings

Problem 5: Binary Search for String Insertion

To insert a string into a sorted list while maintaining order:

Sorted list: ["apple", "cherry", "date"]
Insert: "banana"

Binary search: "banana" > "apple", "banana" < "cherry"
Insert at index 1: ["apple", "banana", "cherry", "date"]

Problem 6: Prefix-Based Queries

Given strings sorted lexicographically, all strings starting with a prefix form a contiguous range:

Sorted: ["app", "apple", "application", "apply", "banana", "band"]

Find all words starting with "app":
- Binary search for "app" → index 0
- Binary search for "apq" (first string > any "app..." prefix) → index 4
- Answer: indices [0, 4) = ["app", "apple", "application", "apply"]

Real-World Applications

•Autocomplete systems — Prefix search in sorted word lists returns suggestions in dictionary order
•Database indexing — B-trees and indexes use lexicographical ordering for string columns
•File systems — Directory listings are often sorted lexicographically
•DNS resolution — Domain name comparisons follow lexicographical rules
•Log file analysis — Sorting log entries by timestamp strings (when formatted correctly)
•Search engines — Inverted indexes store terms in lexicographical order for efficient lookup

Comparison Cost Analysis

String comparison is not a constant-time operation. Understanding its cost is crucial for algorithm analysis:

Basic Comparison Cost:

Comparing two strings of lengths m and n:

Best case: O(1) — First characters differ
Average case: O(min(m, n)) — Must scan until difference or end
Worst case: O(min(m, n)) — Strings are equal or one is prefix of other

Impact on Sorting:

For n strings of average length L:

Each comparison is O(L)
Sorting makes O(n log n) comparisons
Total: O(n × L × log n)

For large strings, this L factor can dominate!

Impact on Binary Search:

Searching in n strings of average length L:

Each comparison is O(L)
Binary search makes O(log n) comparisons
Total: O(L × log n)

Again, string length matters.

String Operation Complexity (n strings, average length L)
Operation	Comparisons	Per Comparison	Total
Sort (general)	O(n log n)	O(L)	O(n × L × log n)
Binary Search	O(log n)	O(L)	O(L × log n)
Linear Search	O(n)	O(L)	O(n × L)
Find Min	O(n)	O(L)	O(n × L)
Merge Two Sorted Lists	O(n + m)	O(L)	O((n+m) × L)

Optimization Insight

For very long strings with common prefixes, comparison can be expensive. Techniques like hashing (compare hash codes first, then verify), tries (avoid redundant prefix comparisons), or normalized forms (precompute comparison keys) can dramatically improve performance in string-heavy algorithms.

Summary: Lexicographical Ordering Mastered

We've developed a rigorous understanding of lexicographical ordering—the foundation of string comparison in computer science. Let's consolidate the essential concepts:

Key Takeaways

•Character-by-character comparison — Strings are compared left-to-right until a difference is found or one string is exhausted.
•Character codes determine order — ASCII/Unicode code points define 'less than' and 'greater than' for characters.
•Prefix rule — A shorter string that matches a longer string's prefix is considered less than the longer string.
•First difference decides — Once a differing character is found, subsequent characters don't matter.
•Case matters — Uppercase letters have lower code points than lowercase letters in ASCII.
•Numeric strings sort incorrectly — '9' > '10' lexicographically; use numeric comparison for numbers.
•Comparison is O(L) — String comparison time depends on string length, affecting algorithm complexity.
•Foundation for algorithms — Sorting, searching, trees, and tries all rely on lexicographical order.

What's next:

Lexicographical comparison treats uppercase and lowercase letters as completely different characters. But often we want 'Apple' and 'apple' to compare equal. The next page explores case sensitivity—how it affects string comparison, when it matters, and strategies for case-insensitive operations.

Page Complete

You now understand lexicographical ordering with precision—not as vague 'alphabetical order' but as a concrete algorithm with defined behavior for every case. This foundation prepares you to understand the nuances of case sensitivity, locale awareness, and string equality in the coming pages.