Data Structures & AlgorithmsStrings

String Comparison, Ordering & Equality

LevelIntermediate

Duration60 mins

TopicStrings

3 / 4

Locale vs Binary Comparison

Why Does 'ä' Sort Differently in Different Countries?

Consider sorting these German words: ["Öl", "Apfel", "Öffnung", "Zebra"]

In German locale: ["Apfel", "Öffnung", "Öl", "Zebra"]

Ö is treated like O, sorting after A but before Z

In Swedish locale: ["Apfel", "Zebra", "Öffnung", "Öl"]

Ö is a distinct letter that comes after Z!

Using pure ASCII/binary comparison: ["Apfel", "Zebra", "Öffnung", "Öl"]

Ö (code 214) comes after Z (code 90) because 214 > 90

Three different sorting approaches produce two different results. Which is correct?

The answer is: it depends on context. This is the domain of locale-aware versus binary comparison—a distinction that determines whether your software feels native to users across cultures or produces results that seem arbitrarily wrong.

What You Will Learn

By the end of this page, you will understand the fundamental difference between binary and locale-aware string comparison—what each approach does, when each is appropriate, and the substantial complexity hidden beneath seemingly simple sorting and comparison operations. You'll be equipped to make informed decisions about comparison semantics in your applications.

Binary Comparison: The Machine's Perspective

Definition:

Binary comparison (also called ordinal comparison or code point comparison) compares strings by their raw numeric values—the Unicode code points (or bytes in ASCII) that represent each character.

How it works:

Compare strings character by character from left to right
At each position, compare the numeric code point of each character
The first differing code point determines the result
If one string is a prefix of the other, the shorter string is 'less than'

This is pure lexicographical comparison on numeric values:

"Apfel" vs "Öl"

Position 0: 'A' (65) vs 'Ö' (214)
65 < 214, so "Apfel" < "Öl"

Binary sorted: ["Apfel", "Zebra", "Öffnung", "Öl"]
// Z (90) < Ö (214), so Z comes first

Binary Comparison: Character Code Points
Character	Unicode Code Point	Binary Sorted Position
A	U+0041 (65)	Early (uppercase letters: 65-90)
Z	U+005A (90)	End of uppercase range
a	U+0061 (97)	After uppercase, before extended
z	U+007A (122)	End of lowercase ASCII
ä	U+00E4 (228)	Extended Latin, after ASCII
ö	U+00F6 (246)	Extended Latin, after ä
Ö	U+00D6 (214)	Uppercase extended, after z

Characteristics of Binary Comparison:

Advantages:

Deterministic: Same input always produces same output, regardless of system settings
Fast: Direct numeric comparison, no table lookups required
Simple: Easy to understand, implement, and verify
Consistent across systems: No dependence on installed locales or libraries
Stable for identifiers: Perfect for comparing code, keys, and technical strings

Disadvantages:

Culturally incorrect: Produces ordering that violates human expectations in many languages
Case quirks: All uppercase letters sort before all lowercase ('Z' < 'a')
Accent placement: Characters like 'é', 'ñ', 'ü' sort after 'z', not with their base letters
Inappropriate for user-facing sorting: Users expect culturally appropriate order

When to Use Binary Comparison

Binary comparison is ideal for:

• Comparing identifiers, variable names, and keys • Sorting technical data (URLs, file paths, IDs) • Hash table keys and dictionary lookups • Ensuring reproducible sorting in tests • Any scenario where cultural expectation doesn't matter

Locale-Aware Comparison: The Human Perspective

Definition:

Locale-aware comparison (also called collation or cultural comparison) sorts strings according to the conventions of a specific language and culture. It treats strings as text meant for human consumption, not as sequences of bytes.

How it works (conceptually):

Transform each character to a collation key based on the locale
Collation keys encode sorting weight at multiple levels:
- Primary: Base letter (a, b, c...)
- Secondary: Accents and diacritics (a vs á vs ä)
- Tertiary: Case (a vs A)
- Quaternary: Punctuation and special characters
Compare collation keys to determine order

The result matches human expectations for that culture:

German locale sorting:
"Öl" is treated like "Ol" at primary level
→ Sorts after "O" words, before "P" words
→ ["Apfel", "Öffnung", "Öl", "Zebra"]

Locale Variations in Action:

German (de_DE):

Treats ä, ö, ü as variants of a, o, u
ß is equivalent to 'ss' for sorting
"Männer" sorts near "Manner", not after 'z'

Swedish (sv_SE):

Treats ä, ö as separate letters that come after z
Alphabet ends: ...x, y, z, å, ä, ö
"Öl" sorts after "Zebra"

Spanish (es_ES):

Historically treated "ch" as a single letter after "c"
"ñ" is a separate letter after "n"
Modern Spanish often uses standard Latin ordering

Estonian (et_EE):

"z" comes in the middle of the alphabet, not at the end!
Alphabet: ...s, z, t, u...x, y

Japanese:

Multiple sort orders exist (hiragana, katakana, kanji readings)
Collation is vastly more complex than Latin alphabets

Same Words, Different Locale Orders
Words	Binary Order	German Order	Swedish Order
[ärger, Apfel, zoo]	[Apfel, zoo, ärger]	[Apfel, ärger, zoo]	[Apfel, zoo, ärger]
[Öl, Apfel, Zebra]	[Apfel, Zebra, Öl]	[Apfel, Öl, Zebra]	[Apfel, Zebra, Öl]
[äpple, apple, Zoo]	[Zoo, apple, äpple]	[apple, äpple, Zoo]	[apple, Zoo, äpple]

Locale Is Not Universal

There is no 'correct' universal human sorting order. Every locale-aware sort is culturally specific. An application serving users in both Germany and Sweden may need to support multiple sort orders—or explicitly choose a single convention and document it.

The Unicode Collation Algorithm (UCA)

The Unicode Collation Algorithm (UCA) is the standard specification for sorting strings in a linguistically and culturally correct manner. It's the foundation of locale-aware sorting in most modern systems.

Core Concepts:

1. Collation Elements:

Each character (or sequence of characters) maps to a collation element—a tuple of numeric weights at different levels.

Example collation elements (simplified):
'a' → [0A0A, 0020, 0002]  // Primary, Secondary, Tertiary
'A' → [0A0A, 0020, 0008]  // Same primary/secondary, different tertiary
'á' → [0A0A, 0024, 0002]  // Same primary, different secondary (accent)
'ä' → [0A0A, 0028, 0002]  // Same primary, different secondary (diaeresis)

2. Multi-Level Comparison:

Strings are compared level by level:

First, compare all primary weights
If equal at primary level, compare secondary weights
If still equal, compare tertiary weights
And so on for additional levels

This allows 'a' and 'A' and 'á' to be near each other (same primary) while still having a defined order.

3. Collation Tailoring:

The UCA defines a Default Unicode Collation Element Table (DUCET)—a default sorting order for all Unicode characters. Specific locales then tailor this default:

DUCET default: ä sorts after z (by code point convention)

German tailoring: ä sorts as if it were 'ae' or just after 'a'
Swedish tailoring: ä sorts after z (happens to match DUCET here)

4. Normalization:

Collation typically normalizes strings first. Characters like 'é' can be represented as:

Precomposed: single code point U+00E9
Decomposed: 'e' (U+0065) + combining acute (U+0301)

Both representations should collate identically, so normalization is essential.

5. Variable Weighting:

Punctuation and spaces can be handled differently:

Non-ignorable: Spaces and punctuation affect sort order
Blanked/Shifted: Spaces and punctuation are ignored at primary level

This determines whether "can't" sorts near "cant" or is separated by the apostrophe.

UCA Key Features

•Default ordering (DUCET) — Sensible default for all Unicode characters
•Locale tailoring — Override defaults for specific cultural requirements
•Multi-level comparison — Primary (base), secondary (accent), tertiary (case)
•Normalization — Handles equivalent representations consistently
•Configurable handling of spaces/punctuation — Ignorable or significant
•Contractions — Multi-character sequences that sort as units (e.g., 'ch' in Spanish)
•Expansions — Single characters that expand for sorting (e.g., 'ß' → 'ss')

ICU: The Reference Implementation

The International Components for Unicode (ICU) library provides the most comprehensive implementation of UCA. It's used by Java, JavaScript (Intl), many databases, and operating systems. When you call localeCompare() in JavaScript, you're likely invoking ICU under the hood.

Choosing Between Binary and Locale Comparison

The choice between binary and locale-aware comparison depends on what the strings represent and who will see the results:

Use Binary Comparison When:

Comparing technical identifiers: Variable names, constants, API keys, UUIDs
Hash table keys and lookups: Need exact matching
Sorting for machine consumption: Indexing, deduplication, technical ordering
Reproducibility is required: Tests, deterministic algorithms
Performance is critical: Binary comparison is faster
Cross-system consistency: No dependency on locale installation

Use Locale-Aware Comparison When:

Displaying sorted data to users: Name lists, product catalogs, search results
User input comparison: Usernames, search queries
Natural language text: Documents, content, titles
Internationalized applications: Serving users in multiple locales
Matching cultural expectations: Phone books, dictionaries

Binary Use Cases

•Database primary keys
•File system paths (often)
•Programming language identifiers
•Configuration keys
•URL components
•JSON object keys
•Log file analysis
•Protocol messages

Locale Use Cases

•Contact list sorting
•Search result ranking
•Product catalog displays
•User interface labels
•Natural language processing
•Educational applications
•Publishing and typography
•E-commerce listings

Comparison Approach Decision Matrix
Question	If Yes → Binary	If Yes → Locale
Is this a technical identifier?	✓
Will users see the ordering?		✓
Must sorting be reproducible across systems?	✓
Is correct cultural ordering expected?		✓
Is performance critical with millions of comparisons?	✓
Is this for hashtable keys or exact matching?	✓
Is this for a specific cultural audience?		✓

When in Doubt

If the data is meant for human eyes, use locale-aware comparison. If it's meant for machine processing, use binary comparison. When both apply (e.g., usernames shown to users but also used as keys), consider normalizing to a canonical form for storage while using locale-aware display ordering.

Practical Implementation Patterns

Here's how to implement each comparison approach in common programming environments:

Binary Comparison:

// JavaScript - binary comparison
"apple".localeCompare("Banana", undefined, { sensitivity: 'variant', usage: 'sort' })
// Or simply use < and > operators for binary comparison
"apple" < "Banana"  // true: 'a' (97) > 'B' (66), so false wait...
// Actually: 'a' (97) > 'B' (66) is true, so "apple" > "Banana"

// Python - binary comparison is default
sorted(["apple", "Banana"])  // ['Banana', 'apple']

// Java - binary comparison
String.compareTo()  // Uses code point values

Locale-Aware Comparison:

// JavaScript - locale-aware
["apple", "Banana"].sort((a, b) => a.localeCompare(b))
// Result: ['apple', 'Banana'] - case-insensitive by default

// With specific locale
["Öl", "Ol"].sort((a, b) => a.localeCompare(b, 'de'))
// German locale: Ö sorts with O

Python with Locale:

import locale

# Set locale
locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')

# Sort with locale awareness
sorted(["Öl", "Apfel", "Zebra"], key=locale.strxfrm)
# Result depends on German collation rules

# Or use ICU directly
import icu  # PyICU library
collator = icu.Collator.createInstance(icu.Locale('de_DE'))
sorted(words, key=collator.getSortKey)

Database Queries:

-- PostgreSQL with specific collation
SELECT * FROM products ORDER BY name COLLATE "de_DE";

-- MySQL
SELECT * FROM products ORDER BY name COLLATE utf8mb4_german2_ci;

-- SQLite (limited collation support)
SELECT * FROM products ORDER BY name COLLATE NOCASE;  -- Case-insensitive only

Key Insight: Most standard library sort functions use binary comparison by default. Locale-aware sorting typically requires explicit configuration.

Locale Environment Dependencies

Locale-aware sorting often depends on system configuration. A server without German locale packages installed may not sort German text correctly. Docker containers, especially minimal images, often lack locale data. Always verify locale support in your deployment environment.

Performance Considerations

Binary and locale comparison have significantly different performance characteristics:

Binary Comparison Performance:

Time complexity: O(min(n, m)) where n and m are string lengths
Work per character: ~1 operation (direct numeric comparison)
Memory: O(1) — no additional memory needed
Caching: Compiler can optimize heavily

Locale-Aware Comparison Performance:

Time complexity: O(n × k) where k is the complexity factor from collation
Work per character: Multiple operations (table lookup, weight extraction, multi-level comparison)
Memory: May allocate collation keys (O(n) per string)
Caching: Limited optimization possible due to complexity

Typical Performance Difference:

Locale comparison is often 3-10x slower than binary comparison, depending on:

Locale complexity
String length
Implementation quality

Performance Comparison (Approximate)
Operation	Binary	Locale (Simple)	Locale (Complex)
Single comparison	~50 nanoseconds	~150 nanoseconds	~500+ nanoseconds
Sort 10,000 strings	~5 milliseconds	~15 milliseconds	~50+ milliseconds
Sort 1,000,000 strings	~500 milliseconds	~1.5 seconds	~5+ seconds

Optimization Strategies:

1. Precompute Sort Keys:

For repeated sorting, compute collation keys once and sort by keys:

# Instead of sorting with locale comparison each time
sorted(words, key=locale.strxfrm)

# Precompute keys for multiple sorts
keys = {word: locale.strxfrm(word) for word in words}
sorted(words, key=lambda w: keys[w])  # Use precomputed keys

2. Two-Phase Sorting:

For large datasets where most items have clear order:

First pass: binary sort (fast)
Second pass: locale sort only within ambiguous groups

3. Lazy Evaluation:

For user interfaces showing sorted lists:

Sort only visible items initially
Sort remaining items on scroll/request

4. Hybrid Approaches:

Store both raw strings and collation keys:

Use collation keys for sorting and range queries
Use raw strings for display and exact matching

When Performance Matters

For user-facing features with hundreds of items, locale comparison overhead is negligible. For batch processing millions of items, consider whether locale comparison is truly necessary. Often, internal processing can use binary comparison, with locale comparison only at the final display layer.

Common Pitfalls and Gotchas

Both comparison approaches have traps for the unwary:

Pitfall 1: Mixing Comparison Types

Using binary comparison for one operation and locale comparison for another on the same data:

Data stored sorted with binary comparison
→ Binary search works

User interface sorts with locale comparison  
→ Display order differs from storage order
→ Pagination and scrolling break!

Solution: Always use consistent comparison within a data pipeline.

Pitfall 2: Locale Environment Variability

Development machine: Locale data for all languages installed
Production server: Minimal Docker image, no locale data

Result: Code works locally, breaks in production

Solution: Explicitly install locale data in deployment or use locale-independent ICU libraries.

Pitfall 3: Default Locale Surprises

System default locale varies between machines:

Developer in US: en_US sort order
Server in Germany: de_DE sort order
Test server in Japan: ja_JP sort order

All produce different results for the same data!

Solution: Always specify locale explicitly; never rely on system default.

Pitfall 4: Binary Search on Locale-Sorted Data

Data sorted with locale comparison
Binary search using binary comparison

Binary search assumes: if A < B and B < C, then A < C
But locale vs binary comparison may disagree on these relationships!

Result: Binary search fails or misses elements

Solution: Sort and search must use the same comparison.

Pitfall 5: Hash Tables with Locale Keys

// Stored with one normalization
map["café"] = value

// Looked up with different normalization
map["café"]  // Different underlying bytes!

Result: Key not found

Solution: Use Unicode normalization before hashing. Use exact binary comparison for hash keys.

Pitfall 6: Ignoring Locale in User-Facing Data

App displays: ["Zoo", "apple", "Öl"]
User expects: ["apple", "Öl", "Zoo"] (German user)

Result: App feels 'broken' or 'foreign'

Solution: Detect or allow user to select locale for display sorting.

Key Gotcha Summary

•Never mix comparison types in the same data pipeline
•Never rely on system default locale — always specify explicitly
•Always install locale data in production environments
•Use consistent comparison for sort and search
•Normalize before hashing — use binary for hash table keys
•Test with non-ASCII data — ASCII-only tests miss locale bugs

Summary: Binary vs Locale Comparison

We've explored the fundamental distinction between binary and locale-aware string comparison—two approaches that can produce very different results from the same input. Let's consolidate the essential knowledge:

Key Takeaways

•Binary comparison uses raw code point values — fast, deterministic, and consistent across systems.
•Locale comparison uses cultural sorting rules — correct for human expectations but slower and system-dependent.
•Same data, different results — 'ä' may sort with 'a' or after 'z' depending on the approach and locale.
•UCA provides standardization — The Unicode Collation Algorithm defines how locale-aware sorting should work.
•Use binary for technical data — identifiers, keys, and machine-consumed ordering.
•Use locale for user-facing data — names, titles, and any content shown to humans.
•Consistency is paramount — Never mix comparison approaches in the same data pipeline.
•Performance differs significantly — Locale comparison is typically 3-10x slower than binary.

What's next:

Comparing strings for ordering is one challenge. Comparing strings for equality is another—with its own subtleties. The next page explores equality vs identity—when two strings are 'equal' versus 'identical', how different programming languages handle this distinction, and why it matters for correctness and performance.

Page Complete

You now understand the two fundamental approaches to string comparison—binary and locale-aware—and can make informed decisions about which to use. You've seen how the same strings produce different orderings depending on the approach, and you're equipped to avoid the common pitfalls that cause subtle bugs in internationalized applications.

3 / 4

Loading learning content...

Data Structures & AlgorithmsStrings

String Comparison, Ordering & Equality

LevelIntermediate

Duration60 mins

TopicStrings

3 / 4

Locale vs Binary Comparison

Why Does 'ä' Sort Differently in Different Countries?

Consider sorting these German words: ["Öl", "Apfel", "Öffnung", "Zebra"]

In German locale: ["Apfel", "Öffnung", "Öl", "Zebra"]

Ö is treated like O, sorting after A but before Z

In Swedish locale: ["Apfel", "Zebra", "Öffnung", "Öl"]

Ö is a distinct letter that comes after Z!

Using pure ASCII/binary comparison: ["Apfel", "Zebra", "Öffnung", "Öl"]

Ö (code 214) comes after Z (code 90) because 214 > 90

Three different sorting approaches produce two different results. Which is correct?

What You Will Learn

Binary Comparison: The Machine's Perspective

Definition:

How it works:

Compare strings character by character from left to right
At each position, compare the numeric code point of each character
The first differing code point determines the result
If one string is a prefix of the other, the shorter string is 'less than'

This is pure lexicographical comparison on numeric values:

"Apfel" vs "Öl"

Position 0: 'A' (65) vs 'Ö' (214)
65 < 214, so "Apfel" < "Öl"

Binary sorted: ["Apfel", "Zebra", "Öffnung", "Öl"]
// Z (90) < Ö (214), so Z comes first

Binary Comparison: Character Code Points
Character	Unicode Code Point	Binary Sorted Position
A	U+0041 (65)	Early (uppercase letters: 65-90)
Z	U+005A (90)	End of uppercase range
a	U+0061 (97)	After uppercase, before extended
z	U+007A (122)	End of lowercase ASCII
ä	U+00E4 (228)	Extended Latin, after ASCII
ö	U+00F6 (246)	Extended Latin, after ä
Ö	U+00D6 (214)	Uppercase extended, after z

Characteristics of Binary Comparison:

Advantages:

Deterministic: Same input always produces same output, regardless of system settings
Fast: Direct numeric comparison, no table lookups required
Simple: Easy to understand, implement, and verify
Consistent across systems: No dependence on installed locales or libraries
Stable for identifiers: Perfect for comparing code, keys, and technical strings

Disadvantages:

Culturally incorrect: Produces ordering that violates human expectations in many languages
Case quirks: All uppercase letters sort before all lowercase ('Z' < 'a')
Accent placement: Characters like 'é', 'ñ', 'ü' sort after 'z', not with their base letters
Inappropriate for user-facing sorting: Users expect culturally appropriate order

When to Use Binary Comparison

Binary comparison is ideal for:

Locale-Aware Comparison: The Human Perspective

Definition:

How it works (conceptually):

Transform each character to a collation key based on the locale
Collation keys encode sorting weight at multiple levels:
- Primary: Base letter (a, b, c...)
- Secondary: Accents and diacritics (a vs á vs ä)
- Tertiary: Case (a vs A)
- Quaternary: Punctuation and special characters
Compare collation keys to determine order

The result matches human expectations for that culture:

German locale sorting:
"Öl" is treated like "Ol" at primary level
→ Sorts after "O" words, before "P" words
→ ["Apfel", "Öffnung", "Öl", "Zebra"]

Locale Variations in Action:

German (de_DE):

Treats ä, ö, ü as variants of a, o, u
ß is equivalent to 'ss' for sorting
"Männer" sorts near "Manner", not after 'z'

Swedish (sv_SE):

Treats ä, ö as separate letters that come after z
Alphabet ends: ...x, y, z, å, ä, ö
"Öl" sorts after "Zebra"

Spanish (es_ES):

Historically treated "ch" as a single letter after "c"
"ñ" is a separate letter after "n"
Modern Spanish often uses standard Latin ordering

Estonian (et_EE):

"z" comes in the middle of the alphabet, not at the end!
Alphabet: ...s, z, t, u...x, y

Japanese:

Multiple sort orders exist (hiragana, katakana, kanji readings)
Collation is vastly more complex than Latin alphabets

Same Words, Different Locale Orders
Words	Binary Order	German Order	Swedish Order
[ärger, Apfel, zoo]	[Apfel, zoo, ärger]	[Apfel, ärger, zoo]	[Apfel, zoo, ärger]
[Öl, Apfel, Zebra]	[Apfel, Zebra, Öl]	[Apfel, Öl, Zebra]	[Apfel, Zebra, Öl]
[äpple, apple, Zoo]	[Zoo, apple, äpple]	[apple, äpple, Zoo]	[apple, Zoo, äpple]

Locale Is Not Universal

The Unicode Collation Algorithm (UCA)

Core Concepts:

1. Collation Elements:

Each character (or sequence of characters) maps to a collation element—a tuple of numeric weights at different levels.

Example collation elements (simplified):
'a' → [0A0A, 0020, 0002]  // Primary, Secondary, Tertiary
'A' → [0A0A, 0020, 0008]  // Same primary/secondary, different tertiary
'á' → [0A0A, 0024, 0002]  // Same primary, different secondary (accent)
'ä' → [0A0A, 0028, 0002]  // Same primary, different secondary (diaeresis)

2. Multi-Level Comparison:

Strings are compared level by level:

First, compare all primary weights
If equal at primary level, compare secondary weights
If still equal, compare tertiary weights
And so on for additional levels

This allows 'a' and 'A' and 'á' to be near each other (same primary) while still having a defined order.

3. Collation Tailoring:

The UCA defines a Default Unicode Collation Element Table (DUCET)—a default sorting order for all Unicode characters. Specific locales then tailor this default:

DUCET default: ä sorts after z (by code point convention)

German tailoring: ä sorts as if it were 'ae' or just after 'a'
Swedish tailoring: ä sorts after z (happens to match DUCET here)

4. Normalization:

Collation typically normalizes strings first. Characters like 'é' can be represented as:

Precomposed: single code point U+00E9
Decomposed: 'e' (U+0065) + combining acute (U+0301)

Both representations should collate identically, so normalization is essential.

5. Variable Weighting:

Punctuation and spaces can be handled differently:

Non-ignorable: Spaces and punctuation affect sort order
Blanked/Shifted: Spaces and punctuation are ignored at primary level

This determines whether "can't" sorts near "cant" or is separated by the apostrophe.

UCA Key Features

•Default ordering (DUCET) — Sensible default for all Unicode characters
•Locale tailoring — Override defaults for specific cultural requirements
•Multi-level comparison — Primary (base), secondary (accent), tertiary (case)
•Normalization — Handles equivalent representations consistently
•Configurable handling of spaces/punctuation — Ignorable or significant
•Contractions — Multi-character sequences that sort as units (e.g., 'ch' in Spanish)
•Expansions — Single characters that expand for sorting (e.g., 'ß' → 'ss')

ICU: The Reference Implementation

Choosing Between Binary and Locale Comparison

The choice between binary and locale-aware comparison depends on what the strings represent and who will see the results:

Use Binary Comparison When:

Comparing technical identifiers: Variable names, constants, API keys, UUIDs
Hash table keys and lookups: Need exact matching
Sorting for machine consumption: Indexing, deduplication, technical ordering
Reproducibility is required: Tests, deterministic algorithms
Performance is critical: Binary comparison is faster
Cross-system consistency: No dependency on locale installation

Use Locale-Aware Comparison When:

Displaying sorted data to users: Name lists, product catalogs, search results
User input comparison: Usernames, search queries
Natural language text: Documents, content, titles
Internationalized applications: Serving users in multiple locales
Matching cultural expectations: Phone books, dictionaries

Binary Use Cases

•Database primary keys
•File system paths (often)
•Programming language identifiers
•Configuration keys
•URL components
•JSON object keys
•Log file analysis
•Protocol messages

Locale Use Cases

•Contact list sorting
•Search result ranking
•Product catalog displays
•User interface labels
•Natural language processing
•Educational applications
•Publishing and typography
•E-commerce listings

Comparison Approach Decision Matrix
Question	If Yes → Binary	If Yes → Locale
Is this a technical identifier?	✓
Will users see the ordering?		✓
Must sorting be reproducible across systems?	✓
Is correct cultural ordering expected?		✓
Is performance critical with millions of comparisons?	✓
Is this for hashtable keys or exact matching?	✓
Is this for a specific cultural audience?		✓

When in Doubt

Practical Implementation Patterns

Here's how to implement each comparison approach in common programming environments:

Binary Comparison:

// JavaScript - binary comparison
"apple".localeCompare("Banana", undefined, { sensitivity: 'variant', usage: 'sort' })
// Or simply use < and > operators for binary comparison
"apple" < "Banana"  // true: 'a' (97) > 'B' (66), so false wait...
// Actually: 'a' (97) > 'B' (66) is true, so "apple" > "Banana"

// Python - binary comparison is default
sorted(["apple", "Banana"])  // ['Banana', 'apple']

// Java - binary comparison
String.compareTo()  // Uses code point values

Locale-Aware Comparison:

// JavaScript - locale-aware
["apple", "Banana"].sort((a, b) => a.localeCompare(b))
// Result: ['apple', 'Banana'] - case-insensitive by default

// With specific locale
["Öl", "Ol"].sort((a, b) => a.localeCompare(b, 'de'))
// German locale: Ö sorts with O

Python with Locale:

import locale

# Set locale
locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')

# Sort with locale awareness
sorted(["Öl", "Apfel", "Zebra"], key=locale.strxfrm)
# Result depends on German collation rules

# Or use ICU directly
import icu  # PyICU library
collator = icu.Collator.createInstance(icu.Locale('de_DE'))
sorted(words, key=collator.getSortKey)

Database Queries:

-- PostgreSQL with specific collation
SELECT * FROM products ORDER BY name COLLATE "de_DE";

-- MySQL
SELECT * FROM products ORDER BY name COLLATE utf8mb4_german2_ci;

-- SQLite (limited collation support)
SELECT * FROM products ORDER BY name COLLATE NOCASE;  -- Case-insensitive only

Key Insight: Most standard library sort functions use binary comparison by default. Locale-aware sorting typically requires explicit configuration.

Locale Environment Dependencies

Performance Considerations

Binary and locale comparison have significantly different performance characteristics:

Binary Comparison Performance:

Time complexity: O(min(n, m)) where n and m are string lengths
Work per character: ~1 operation (direct numeric comparison)
Memory: O(1) — no additional memory needed
Caching: Compiler can optimize heavily

Locale-Aware Comparison Performance:

Time complexity: O(n × k) where k is the complexity factor from collation
Work per character: Multiple operations (table lookup, weight extraction, multi-level comparison)
Memory: May allocate collation keys (O(n) per string)
Caching: Limited optimization possible due to complexity

Typical Performance Difference:

Locale comparison is often 3-10x slower than binary comparison, depending on:

Locale complexity
String length
Implementation quality

Performance Comparison (Approximate)
Operation	Binary	Locale (Simple)	Locale (Complex)
Single comparison	~50 nanoseconds	~150 nanoseconds	~500+ nanoseconds
Sort 10,000 strings	~5 milliseconds	~15 milliseconds	~50+ milliseconds
Sort 1,000,000 strings	~500 milliseconds	~1.5 seconds	~5+ seconds

Optimization Strategies:

1. Precompute Sort Keys:

For repeated sorting, compute collation keys once and sort by keys:

# Instead of sorting with locale comparison each time
sorted(words, key=locale.strxfrm)

# Precompute keys for multiple sorts
keys = {word: locale.strxfrm(word) for word in words}
sorted(words, key=lambda w: keys[w])  # Use precomputed keys

2. Two-Phase Sorting:

For large datasets where most items have clear order:

First pass: binary sort (fast)
Second pass: locale sort only within ambiguous groups

3. Lazy Evaluation:

For user interfaces showing sorted lists:

Sort only visible items initially
Sort remaining items on scroll/request

4. Hybrid Approaches:

Store both raw strings and collation keys:

Use collation keys for sorting and range queries
Use raw strings for display and exact matching

When Performance Matters

Common Pitfalls and Gotchas

Both comparison approaches have traps for the unwary:

Pitfall 1: Mixing Comparison Types

Using binary comparison for one operation and locale comparison for another on the same data:

Data stored sorted with binary comparison
→ Binary search works

User interface sorts with locale comparison  
→ Display order differs from storage order
→ Pagination and scrolling break!

Solution: Always use consistent comparison within a data pipeline.

Pitfall 2: Locale Environment Variability

Development machine: Locale data for all languages installed
Production server: Minimal Docker image, no locale data

Result: Code works locally, breaks in production

Solution: Explicitly install locale data in deployment or use locale-independent ICU libraries.

Pitfall 3: Default Locale Surprises

System default locale varies between machines:

Developer in US: en_US sort order
Server in Germany: de_DE sort order
Test server in Japan: ja_JP sort order

All produce different results for the same data!

Solution: Always specify locale explicitly; never rely on system default.

Pitfall 4: Binary Search on Locale-Sorted Data

Data sorted with locale comparison
Binary search using binary comparison

Binary search assumes: if A < B and B < C, then A < C
But locale vs binary comparison may disagree on these relationships!

Result: Binary search fails or misses elements

Solution: Sort and search must use the same comparison.

Pitfall 5: Hash Tables with Locale Keys

// Stored with one normalization
map["café"] = value

// Looked up with different normalization
map["café"]  // Different underlying bytes!

Result: Key not found

Solution: Use Unicode normalization before hashing. Use exact binary comparison for hash keys.

Pitfall 6: Ignoring Locale in User-Facing Data

App displays: ["Zoo", "apple", "Öl"]
User expects: ["apple", "Öl", "Zoo"] (German user)

Result: App feels 'broken' or 'foreign'

Solution: Detect or allow user to select locale for display sorting.

Key Gotcha Summary

•Never mix comparison types in the same data pipeline
•Never rely on system default locale — always specify explicitly
•Always install locale data in production environments
•Use consistent comparison for sort and search
•Normalize before hashing — use binary for hash table keys
•Test with non-ASCII data — ASCII-only tests miss locale bugs

Summary: Binary vs Locale Comparison

Key Takeaways

•Binary comparison uses raw code point values — fast, deterministic, and consistent across systems.
•Locale comparison uses cultural sorting rules — correct for human expectations but slower and system-dependent.
•Same data, different results — 'ä' may sort with 'a' or after 'z' depending on the approach and locale.
•UCA provides standardization — The Unicode Collation Algorithm defines how locale-aware sorting should work.
•Use binary for technical data — identifiers, keys, and machine-consumed ordering.
•Use locale for user-facing data — names, titles, and any content shown to humans.
•Consistency is paramount — Never mix comparison approaches in the same data pipeline.
•Performance differs significantly — Locale comparison is typically 3-10x slower than binary.

What's next:

Page Complete

3 / 4