Loading content...
Open any dictionary, and you'll find words arranged in a specific order. "Apple" comes before "Banana," which comes before "Cherry." This ordering seems obvious—we learned it in childhood. But have you ever stopped to ask: How exactly does a computer determine that "apple" should come before "banana"?
The answer lies in lexicographical ordering—a precise, character-by-character comparison algorithm that forms the foundation of string sorting, searching, and comparison throughout computer science. Understanding this ordering deeply is essential because it:
By the end of this page, you will understand lexicographical ordering from first principles—not as a vague concept of 'alphabetical order' but as a rigorous, step-by-step algorithm. You'll see how character codes drive comparison, how strings of different lengths are handled, and why this simple concept has profound implications for algorithms and data structures.
Let's establish a precise definition before exploring implications:
Definition:
Lexicographical order (also called lexical order, dictionary order, or alphabetical order) is a method of comparing sequences (including strings) by examining their elements one at a time, from left to right, until a difference is found or one sequence is exhausted.
The name comes from 'lexicon'—a dictionary—because this is how dictionaries order words.
The Algorithm:
To compare two strings A and B lexicographically:
Compare characters at position 0: If A[0] < B[0], then A < B. If A[0] > B[0], then A > B. If A[0] = B[0], continue.
Compare characters at position 1: Apply the same logic. Continue for each position.
If one string is exhausted: The shorter string is considered 'less than' the longer string if all compared characters were equal.
If both strings are identical: They are equal.
Critical insight: The comparison is determined by the first differing character. All subsequent characters are irrelevant once a difference is found.
| String A | String B | First Difference | Result | Reason |
|---|---|---|---|---|
| apple | banana | Position 0: 'a' vs 'b' | A < B | 'a' (97) < 'b' (98) |
| apple | application | Position 5: end vs 'c' | A < B | A is prefix of B, shorter comes first |
| apple | apple | No difference | A = B | Identical strings |
| Apple | apple | Position 0: 'A' vs 'a' | A < B | 'A' (65) < 'a' (97) in ASCII |
| 123 | 45 | Position 0: '1' vs '4' | A < B | '1' (49) < '4' (52) |
| zoo | aardvark | Position 0: 'z' vs 'a' | A > B | 'z' (122) > 'a' (97) |
Lexicographical ordering compares strings character by character, NOT as numeric values. This means "9" > "10" because '9' (57) > '1' (49). The comparison stops at the first character! This is a frequent source of bugs when sorting numbers stored as strings.
Lexicographical comparison fundamentally relies on the numeric value assigned to each character by the underlying character encoding (typically Unicode, with ASCII as a subset).
Key principle: When comparing two characters, we compare their code points. The character with the lower code point is considered 'less than' the other.
ASCII Code Point Ranges (critical knowledge):
| Range | Characters | Decimal Codes |
|---|---|---|
| Digits | '0' - '9' | 48 - 57 |
| Uppercase | 'A' - 'Z' | 65 - 90 |
| Lowercase | 'a' - 'z' | 97 - 122 |
This ordering has profound implications:
All digits come before all letters: '9' (57) < 'A' (65)
All uppercase letters come before all lowercase letters: 'Z' (90) < 'a' (97)
Within letters, order is alphabetical: 'A' < 'B' < ... < 'Z' and 'a' < 'b' < ... < 'z'
Space and punctuation have their own codes: Space is 32, which comes before digits, which come before letters
Surprising Comparisons:
"Zoo" < "apple" (Z=90 < a=97)
"ZEBRA" < "ant" (Z=90 < a=97)
"123" < "ABC" (1=49 < A=65)
" hello" < "hello" (space=32 < h=104)
"Hello!" < "Hello?" (!='33 < ?=63)
These results may surprise you if you expect 'intuitive' alphabetical ordering.
Expected Comparisons:
"apple" < "banana" (a=97 < b=98)
"car" < "card" (prefix rule)
"dog" < "dogs" (prefix rule)
"abc" < "abd" (c=99 < d=100)
"aaa" < "aab" (a=97 < b=98)
These match intuitive dictionary order because they involve only lowercase letters.
For string manipulation and algorithm design, commit these to memory: digits (48-57), uppercase (65-90), lowercase (97-122). The gap between uppercase and lowercase is exactly 32, which is why char.toLowerCase() often works by adding 32 to the ASCII value (for A-Z).
Let's formalize the lexicographical comparison algorithm with pseudocode that reveals exactly what happens at each step:
function lexicographicalCompare(A, B):
i = 0
// Step 1: Compare character by character
while i < length(A) AND i < length(B):
if A[i] < B[i]:
return -1 // A is less than B
else if A[i] > B[i]:
return +1 // A is greater than B
else:
i = i + 1 // Characters equal, continue
// Step 2: All compared characters were equal
// Now check lengths
if length(A) < length(B):
return -1 // A is a prefix of B, A is less
else if length(A) > length(B):
return +1 // B is a prefix of A, B is less
else:
return 0 // Identical strings
Time Complexity: O(min(|A|, |B|)) — We compare at most as many characters as the shorter string has.
Space Complexity: O(1) — Only a single index variable is needed.
Trace Example: Comparing "application" and "apple"
| Step | i | A[i] | B[i] | Comparison | Action |
|---|---|---|---|---|---|
| 1 | 0 | 'a' | 'a' | 97 = 97 | Continue |
| 2 | 1 | 'p' | 'p' | 112 = 112 | Continue |
| 3 | 2 | 'p' | 'p' | 112 = 112 | Continue |
| 4 | 3 | 'l' | 'l' | 108 = 108 | Continue |
| 5 | 4 | 'i' | 'e' | 105 > 101 | Return +1 |
Result: "application" > "apple" because 'i' > 'e' at position 4.
Notice that we never looked at the remaining characters in either string. The comparison was decided at the first point of difference.
Trace Example: Comparing "app" and "apple"
| Step | i | A[i] | B[i] | Comparison | Action |
|---|---|---|---|---|---|
| 1 | 0 | 'a' | 'a' | 97 = 97 | Continue |
| 2 | 1 | 'p' | 'p' | 112 = 112 | Continue |
| 3 | 2 | 'p' | 'p' | 112 = 112 | Continue |
| 4 | 3 | - | 'l' | A exhausted | - |
After the loop: length(A) = 3 < length(B) = 5
Result: "app" < "apple" because "app" is a proper prefix of "apple".
The prefix rule states: if one string is a prefix of another (all characters match up to the shorter string's length), the shorter string is considered less than the longer one.
The prefix rule isn't arbitrary—it's mathematically consistent. We're essentially comparing strings as if the shorter one had infinite 'null' characters appended, where null is less than any actual character. Since "app" + (nothing) comes before "app" + "le", "app" < "apple".
Lexicographical order isn't just a convenience—it's a total ordering with properties that make it essential for fundamental algorithms:
1. Total Ordering Properties:
A total ordering satisfies three properties that enable sorting and searching:
Lexicographical order satisfies all three, making it a valid comparison function for any algorithm that requires ordered data.
2. Binary Search:
Binary search requires a sorted collection. When strings are sorted lexicographically, we can efficiently find any string (or determine its absence) in O(log n) time:
Given sorted list: ["apple", "banana", "cherry", "date", "elderberry"]
Search for "cherry":
- Middle: "cherry" — Found!
Search for "coconut":
- Middle: "cherry" — "coconut" > "cherry", search right
- Middle: "date" — "coconut" < "date", search left
- No elements left — Not found, would insert between "cherry" and "date"
3. Balanced Search Trees (BST, Red-Black, AVL):
Tree-based data structures like TreeMap and TreeSet maintain elements in sorted order. For strings, this means lexicographical order:
"mango"
/ \
"cherry" "papaya"
/ \ \
"apple" "grape" "strawberry"
In-order traversal yields: apple, cherry, grape, mango, papaya, strawberry
4. Range Queries:
Lexicographical order enables powerful range queries:
These operations are O(log n + k) where k is the number of results.
5. Sorting Algorithms:
Every sorting algorithm that works on strings (quicksort, mergesort, heapsort) uses lexicographical comparison as its comparator. Understanding the comparison is understanding the sort.
Lexicographical comparison is the default string comparator in virtually every programming language. Whenever you call sort() on a string array without specifying a custom comparator, you're using lexicographical order. Understanding it deeply means understanding how strings behave across the entire language ecosystem.
Lexicographical ordering, while simple in principle, has edge cases that frequently cause bugs:
1. Empty String Comparisons:
The empty string "" is lexicographically less than every non-empty string:
"" < "a" (empty is prefix of any string)"" < " " (even a single space)"" = "" (equal to itself)2. Strings with Spaces:
Space (code 32) is a valid character that compares less than all letters and digits:
" apple" < "apple" (space < 'a')"a b" < "ab" (space < 'b' at position 1)"hello world" < "helloworld" (space < 'w')3. Mixed Case Strings:
Uppercase letters have lower ASCII values than lowercase:
"Zebra" < "apple" ('Z'=90 < 'a'=97)"ABC" < "abc" ('A'=65 < 'a'=97)"aBC" < "abc" ('B'=66 < 'b'=98 at position 1)4. Numeric Strings:
This is the most common source of bugs:
"9" > "10" ('9'=57 > '1'=49)"100" < "20" ('1'=49 < '2'=50)"1" < "10" < "100" < "2" (prefix rules and first-char comparison)If you need numerical ordering, you must either:
5. Unicode and Extended Characters:
Beyond ASCII, characters from different scripts have their own code points:
"café" — The 'é' (code 233) > 'e' (101)"naïve" — The 'ï' (code 239) > 'i' (105)For proper multilingual sorting, you need locale-aware comparison (covered later), not pure lexicographical comparison.
Avoid these mistakes:
• Sorting numeric strings lexicographically • Assuming case-insensitive comparison • Ignoring leading/trailing whitespace • Treating 'ä' as equivalent to 'a' • Assuming all letters compare similarly
Do these instead:
• Convert to numbers for numeric comparison • Explicitly normalize case before comparing • Trim strings before comparison if needed • Use locale-aware comparison for text • Document comparison semantics clearly
Version strings like "1.9" vs "1.10" are notoriously problematic. Lexicographically, "1.10" < "1.9" because '1' < '9' at the third position. This has caused real bugs in software update systems, package managers, and deployment tools. Always use semantic version comparison libraries for version strings.
Understanding lexicographical order unlocks elegant solutions to many string problems:
Problem 1: Finding the Lexicographically Smallest String
Given a list of strings, find the one that comes first in dictionary order:
Input: ["banana", "apple", "cherry"]
Output: "apple"
Solution: Simply use min() with default string comparison—it uses lexicographical order.
Problem 2: Next Lexicographical Permutation
Given a string, find the next string in lexicographical order among all permutations:
Input: "abc"
Output: "acb"
Full sequence: abc → acb → bac → bca → cab → cba
This is a classic algorithm problem with applications in combinatorics and optimization.
Problem 3: Longest Common Prefix
When strings are sorted lexicographically, the longest common prefix of the entire array equals the longest common prefix of just the first and last strings:
Sorted: ["flower", "flow", "flight"]
Wait—sorted: ["flight", "flow", "flower"]
LCP of "flight" and "flower" = "fl"
This is also the LCP of all strings in the array!
Why? Because in sorted order, all strings with the same prefix are adjacent. The first and last strings have the minimum overlap.
Problem 4: String Comparison in Sorting
When you call Arrays.sort(stringArray) or list.sort() on strings, the runtime uses lexicographical comparison. Understanding this explains:
["10", "2", "1"] sorts to ["1", "10", "2"]["Banana", "apple"] sorts to ["Banana", "apple"] (B < a)Problem 5: Binary Search for String Insertion
To insert a string into a sorted list while maintaining order:
Sorted list: ["apple", "cherry", "date"]
Insert: "banana"
Binary search: "banana" > "apple", "banana" < "cherry"
Insert at index 1: ["apple", "banana", "cherry", "date"]
Problem 6: Prefix-Based Queries
Given strings sorted lexicographically, all strings starting with a prefix form a contiguous range:
Sorted: ["app", "apple", "application", "apply", "banana", "band"]
Find all words starting with "app":
- Binary search for "app" → index 0
- Binary search for "apq" (first string > any "app..." prefix) → index 4
- Answer: indices [0, 4) = ["app", "apple", "application", "apply"]
String comparison is not a constant-time operation. Understanding its cost is crucial for algorithm analysis:
Basic Comparison Cost:
Comparing two strings of lengths m and n:
Impact on Sorting:
For n strings of average length L:
For large strings, this L factor can dominate!
Impact on Binary Search:
Searching in n strings of average length L:
Again, string length matters.
| Operation | Comparisons | Per Comparison | Total |
|---|---|---|---|
| Sort (general) | O(n log n) | O(L) | O(n × L × log n) |
| Binary Search | O(log n) | O(L) | O(L × log n) |
| Linear Search | O(n) | O(L) | O(n × L) |
| Find Min | O(n) | O(L) | O(n × L) |
| Merge Two Sorted Lists | O(n + m) | O(L) | O((n+m) × L) |
For very long strings with common prefixes, comparison can be expensive. Techniques like hashing (compare hash codes first, then verify), tries (avoid redundant prefix comparisons), or normalized forms (precompute comparison keys) can dramatically improve performance in string-heavy algorithms.
We've developed a rigorous understanding of lexicographical ordering—the foundation of string comparison in computer science. Let's consolidate the essential concepts:
What's next:
Lexicographical comparison treats uppercase and lowercase letters as completely different characters. But often we want 'Apple' and 'apple' to compare equal. The next page explores case sensitivity—how it affects string comparison, when it matters, and strategies for case-insensitive operations.
You now understand lexicographical ordering with precision—not as vague 'alphabetical order' but as a concrete algorithm with defined behavior for every case. This foundation prepares you to understand the nuances of case sensitivity, locale awareness, and string equality in the coming pages.