Loading learning content...
Consider sorting these German words: ["Öl", "Apfel", "Öffnung", "Zebra"]
In German locale: ["Apfel", "Öffnung", "Öl", "Zebra"]
In Swedish locale: ["Apfel", "Zebra", "Öffnung", "Öl"]
Using pure ASCII/binary comparison: ["Apfel", "Zebra", "Öffnung", "Öl"]
Three different sorting approaches produce two different results. Which is correct?
The answer is: it depends on context. This is the domain of locale-aware versus binary comparison—a distinction that determines whether your software feels native to users across cultures or produces results that seem arbitrarily wrong.
By the end of this page, you will understand the fundamental difference between binary and locale-aware string comparison—what each approach does, when each is appropriate, and the substantial complexity hidden beneath seemingly simple sorting and comparison operations. You'll be equipped to make informed decisions about comparison semantics in your applications.
Definition:
Binary comparison (also called ordinal comparison or code point comparison) compares strings by their raw numeric values—the Unicode code points (or bytes in ASCII) that represent each character.
How it works:
This is pure lexicographical comparison on numeric values:
"Apfel" vs "Öl"
Position 0: 'A' (65) vs 'Ö' (214)
65 < 214, so "Apfel" < "Öl"
Binary sorted: ["Apfel", "Zebra", "Öffnung", "Öl"]
// Z (90) < Ö (214), so Z comes first
| Character | Unicode Code Point | Binary Sorted Position |
|---|---|---|
| A | U+0041 (65) | Early (uppercase letters: 65-90) |
| Z | U+005A (90) | End of uppercase range |
| a | U+0061 (97) | After uppercase, before extended |
| z | U+007A (122) | End of lowercase ASCII |
| ä | U+00E4 (228) | Extended Latin, after ASCII |
| ö | U+00F6 (246) | Extended Latin, after ä |
| Ö | U+00D6 (214) | Uppercase extended, after z |
Characteristics of Binary Comparison:
Advantages:
Disadvantages:
Binary comparison is ideal for:
• Comparing identifiers, variable names, and keys • Sorting technical data (URLs, file paths, IDs) • Hash table keys and dictionary lookups • Ensuring reproducible sorting in tests • Any scenario where cultural expectation doesn't matter
Definition:
Locale-aware comparison (also called collation or cultural comparison) sorts strings according to the conventions of a specific language and culture. It treats strings as text meant for human consumption, not as sequences of bytes.
How it works (conceptually):
The result matches human expectations for that culture:
German locale sorting:
"Öl" is treated like "Ol" at primary level
→ Sorts after "O" words, before "P" words
→ ["Apfel", "Öffnung", "Öl", "Zebra"]
Locale Variations in Action:
German (de_DE):
Swedish (sv_SE):
Spanish (es_ES):
Estonian (et_EE):
Japanese:
| Words | Binary Order | German Order | Swedish Order |
|---|---|---|---|
| [ärger, Apfel, zoo] | [Apfel, zoo, ärger] | [Apfel, ärger, zoo] | [Apfel, zoo, ärger] |
| [Öl, Apfel, Zebra] | [Apfel, Zebra, Öl] | [Apfel, Öl, Zebra] | [Apfel, Zebra, Öl] |
| [äpple, apple, Zoo] | [Zoo, apple, äpple] | [apple, äpple, Zoo] | [apple, Zoo, äpple] |
There is no 'correct' universal human sorting order. Every locale-aware sort is culturally specific. An application serving users in both Germany and Sweden may need to support multiple sort orders—or explicitly choose a single convention and document it.
The Unicode Collation Algorithm (UCA) is the standard specification for sorting strings in a linguistically and culturally correct manner. It's the foundation of locale-aware sorting in most modern systems.
Core Concepts:
1. Collation Elements:
Each character (or sequence of characters) maps to a collation element—a tuple of numeric weights at different levels.
Example collation elements (simplified):
'a' → [0A0A, 0020, 0002] // Primary, Secondary, Tertiary
'A' → [0A0A, 0020, 0008] // Same primary/secondary, different tertiary
'á' → [0A0A, 0024, 0002] // Same primary, different secondary (accent)
'ä' → [0A0A, 0028, 0002] // Same primary, different secondary (diaeresis)
2. Multi-Level Comparison:
Strings are compared level by level:
This allows 'a' and 'A' and 'á' to be near each other (same primary) while still having a defined order.
3. Collation Tailoring:
The UCA defines a Default Unicode Collation Element Table (DUCET)—a default sorting order for all Unicode characters. Specific locales then tailor this default:
DUCET default: ä sorts after z (by code point convention)
German tailoring: ä sorts as if it were 'ae' or just after 'a'
Swedish tailoring: ä sorts after z (happens to match DUCET here)
4. Normalization:
Collation typically normalizes strings first. Characters like 'é' can be represented as:
Both representations should collate identically, so normalization is essential.
5. Variable Weighting:
Punctuation and spaces can be handled differently:
This determines whether "can't" sorts near "cant" or is separated by the apostrophe.
The International Components for Unicode (ICU) library provides the most comprehensive implementation of UCA. It's used by Java, JavaScript (Intl), many databases, and operating systems. When you call localeCompare() in JavaScript, you're likely invoking ICU under the hood.
The choice between binary and locale-aware comparison depends on what the strings represent and who will see the results:
Use Binary Comparison When:
Use Locale-Aware Comparison When:
| Question | If Yes → Binary | If Yes → Locale |
|---|---|---|
| Is this a technical identifier? | ✓ | |
| Will users see the ordering? | ✓ | |
| Must sorting be reproducible across systems? | ✓ | |
| Is correct cultural ordering expected? | ✓ | |
| Is performance critical with millions of comparisons? | ✓ | |
| Is this for hashtable keys or exact matching? | ✓ | |
| Is this for a specific cultural audience? | ✓ |
If the data is meant for human eyes, use locale-aware comparison. If it's meant for machine processing, use binary comparison. When both apply (e.g., usernames shown to users but also used as keys), consider normalizing to a canonical form for storage while using locale-aware display ordering.
Here's how to implement each comparison approach in common programming environments:
Binary Comparison:
// JavaScript - binary comparison
"apple".localeCompare("Banana", undefined, { sensitivity: 'variant', usage: 'sort' })
// Or simply use < and > operators for binary comparison
"apple" < "Banana" // true: 'a' (97) > 'B' (66), so false wait...
// Actually: 'a' (97) > 'B' (66) is true, so "apple" > "Banana"
// Python - binary comparison is default
sorted(["apple", "Banana"]) // ['Banana', 'apple']
// Java - binary comparison
String.compareTo() // Uses code point values
Locale-Aware Comparison:
// JavaScript - locale-aware
["apple", "Banana"].sort((a, b) => a.localeCompare(b))
// Result: ['apple', 'Banana'] - case-insensitive by default
// With specific locale
["Öl", "Ol"].sort((a, b) => a.localeCompare(b, 'de'))
// German locale: Ö sorts with O
Python with Locale:
import locale
# Set locale
locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')
# Sort with locale awareness
sorted(["Öl", "Apfel", "Zebra"], key=locale.strxfrm)
# Result depends on German collation rules
# Or use ICU directly
import icu # PyICU library
collator = icu.Collator.createInstance(icu.Locale('de_DE'))
sorted(words, key=collator.getSortKey)
Database Queries:
-- PostgreSQL with specific collation
SELECT * FROM products ORDER BY name COLLATE "de_DE";
-- MySQL
SELECT * FROM products ORDER BY name COLLATE utf8mb4_german2_ci;
-- SQLite (limited collation support)
SELECT * FROM products ORDER BY name COLLATE NOCASE; -- Case-insensitive only
Key Insight: Most standard library sort functions use binary comparison by default. Locale-aware sorting typically requires explicit configuration.
Locale-aware sorting often depends on system configuration. A server without German locale packages installed may not sort German text correctly. Docker containers, especially minimal images, often lack locale data. Always verify locale support in your deployment environment.
Binary and locale comparison have significantly different performance characteristics:
Binary Comparison Performance:
Locale-Aware Comparison Performance:
Typical Performance Difference:
Locale comparison is often 3-10x slower than binary comparison, depending on:
| Operation | Binary | Locale (Simple) | Locale (Complex) |
|---|---|---|---|
| Single comparison | ~50 nanoseconds | ~150 nanoseconds | ~500+ nanoseconds |
| Sort 10,000 strings | ~5 milliseconds | ~15 milliseconds | ~50+ milliseconds |
| Sort 1,000,000 strings | ~500 milliseconds | ~1.5 seconds | ~5+ seconds |
Optimization Strategies:
1. Precompute Sort Keys:
For repeated sorting, compute collation keys once and sort by keys:
# Instead of sorting with locale comparison each time
sorted(words, key=locale.strxfrm)
# Precompute keys for multiple sorts
keys = {word: locale.strxfrm(word) for word in words}
sorted(words, key=lambda w: keys[w]) # Use precomputed keys
2. Two-Phase Sorting:
For large datasets where most items have clear order:
3. Lazy Evaluation:
For user interfaces showing sorted lists:
4. Hybrid Approaches:
Store both raw strings and collation keys:
For user-facing features with hundreds of items, locale comparison overhead is negligible. For batch processing millions of items, consider whether locale comparison is truly necessary. Often, internal processing can use binary comparison, with locale comparison only at the final display layer.
Both comparison approaches have traps for the unwary:
Pitfall 1: Mixing Comparison Types
Using binary comparison for one operation and locale comparison for another on the same data:
Data stored sorted with binary comparison
→ Binary search works
User interface sorts with locale comparison
→ Display order differs from storage order
→ Pagination and scrolling break!
Solution: Always use consistent comparison within a data pipeline.
Pitfall 2: Locale Environment Variability
Development machine: Locale data for all languages installed
Production server: Minimal Docker image, no locale data
Result: Code works locally, breaks in production
Solution: Explicitly install locale data in deployment or use locale-independent ICU libraries.
Pitfall 3: Default Locale Surprises
System default locale varies between machines:
Developer in US: en_US sort order
Server in Germany: de_DE sort order
Test server in Japan: ja_JP sort order
All produce different results for the same data!
Solution: Always specify locale explicitly; never rely on system default.
Pitfall 4: Binary Search on Locale-Sorted Data
Data sorted with locale comparison
Binary search using binary comparison
Binary search assumes: if A < B and B < C, then A < C
But locale vs binary comparison may disagree on these relationships!
Result: Binary search fails or misses elements
Solution: Sort and search must use the same comparison.
Pitfall 5: Hash Tables with Locale Keys
// Stored with one normalization
map["café"] = value
// Looked up with different normalization
map["café"] // Different underlying bytes!
Result: Key not found
Solution: Use Unicode normalization before hashing. Use exact binary comparison for hash keys.
Pitfall 6: Ignoring Locale in User-Facing Data
App displays: ["Zoo", "apple", "Öl"]
User expects: ["apple", "Öl", "Zoo"] (German user)
Result: App feels 'broken' or 'foreign'
Solution: Detect or allow user to select locale for display sorting.
We've explored the fundamental distinction between binary and locale-aware string comparison—two approaches that can produce very different results from the same input. Let's consolidate the essential knowledge:
What's next:
Comparing strings for ordering is one challenge. Comparing strings for equality is another—with its own subtleties. The next page explores equality vs identity—when two strings are 'equal' versus 'identical', how different programming languages handle this distinction, and why it matters for correctness and performance.
You now understand the two fundamental approaches to string comparison—binary and locale-aware—and can make informed decisions about which to use. You've seen how the same strings produce different orderings depending on the approach, and you're equipped to avoid the common pitfalls that cause subtle bugs in internationalized applications.