Data Structures & AlgorithmsStrings

String Comparison, Ordering & Equality

LevelIntermediate

Duration60 mins

TopicStrings

4 / 4

Equality vs Identity

When Is 'Hello' Not the Same as 'Hello'?

Consider this puzzle that has confused countless developers:

String a = "Hello";
String b = "Hello";
String c = new String("Hello");

a == b      // true (in Java)
a == c      // false (in Java!)
a.equals(c) // true

How can two strings that are clearly identical—both containing the characters H-e-l-l-o—not be == to each other?

The answer unveils one of the most fundamental distinctions in programming: equality vs identity.

Equality: Do two strings have the same content? Do they represent the same value?
Identity: Are two references pointing to the same object in memory? Are they literally the same thing?

This distinction is crucial for correctness (using == when you should use equals() is a classic bug) and for performance (identity comparison is faster than equality comparison).

What You Will Learn

By the end of this page, you will understand the deep difference between equality and identity for strings—why this distinction exists, how different programming languages handle it, the performance implications, and the role of string interning in bridging these concepts. You'll be equipped to avoid one of the most common bug patterns in string programming.

Defining Equality and Identity

Let's establish precise definitions:

Identity (Reference Equality):

Two references have the same identity if they point to the exact same object in memory. They are not just similar—they are literally the same thing, occupying the same memory address.

Variable a ──────────┐
                      ├──► [Memory: 0x1A2B3C] "Hello"
Variable b ──────────┘

a and b have the same identity (they point to the same object)

Equality (Value Equality):

Two strings are equal if they contain the same sequence of characters in the same order. They may or may not be the same object in memory.

Variable a ────► [Memory: 0x1A2B3C] "Hello"
Variable c ────► [Memory: 0x4D5E6F] "Hello"  (different object!)

a and c are equal (same content) but do not have the same identity

Equality vs Identity: Key Distinctions
Aspect	Identity	Equality
What is compared	Memory addresses	String contents
Must be same object?	Yes	No
Time complexity	O(1)	O(n) worst case
Relationship	Identity implies equality	Equality does not imply identity
Operator (Java)	==	.equals()
Operator (JavaScript)	===	===*
Operator (Python)	is	==
Operator (C#)	ReferenceEquals()	.Equals() or ==

*JavaScript note: In JavaScript, === performs value equality for primitives (including strings), not reference equality. This is a simplification in the language.

The Fundamental Relationship:

If two references have the same identity (same object), they are necessarily equal (same content). A single object cannot contain different values simultaneously.

But if two references are equal (same content), they may or may not have the same identity. There could be two separate objects in memory that happen to contain identical data.

Identity    →  Equality    (always true)
Equality   ⇏  Identity    (not always true)

This asymmetry is the source of many bugs.

The Analogy

Think of identical twins. They may look exactly the same (equal) but they are not the same person (identity). If you send a letter to one twin, it won't automatically reach the other. Similarly, modifying one object (if it were mutable) wouldn't affect an equal-but-not-identical object.

How Different Languages Handle This

Different programming languages have different approaches to equality and identity:

Java:

String a = "Hello";
String b = "Hello";
String c = new String("Hello");

// Identity check
a == b       // true (string pool optimization)
a == c       // false (c is a new object)

// Equality check
a.equals(b)  // true
a.equals(c)  // true

// Interning to force identity
c = c.intern();
a == c       // true (now c points to pooled string)

Key point: In Java, == checks identity, .equals() checks equality. Using == for string comparison is almost always a bug.

Python:

a = "Hello"
b = "Hello"
c = "".join(["H", "e", "l", "l", "o"])  # Creates new string

# Identity check
a is b       # True (interning optimization)
a is c       # May be True or False depending on implementation

# Equality check
a == b       # True
a == c       # True (always)

# Best practice: Always use == for string comparison

JavaScript:

let a = "Hello";
let b = "Hello";
let c = "Hel" + "lo";  // Runtime concatenation

// JavaScript strings are primitives, not objects
a === b   // true (value equality for primitives)
a === c   // true (value equality)

// JavaScript doesn't expose identity semantics for strings
// All comparisons are effectively value comparisons

C#:

string a = "Hello";
string b = "Hello";
string c = new string(new char[] {'H','e','l','l','o'});

// Overloaded == performs equality check!
a == b   // true (equality, not identity)
a == c   // true (equality)

// Explicit identity check
Object.ReferenceEquals(a, b)  // true (interning)
Object.ReferenceEquals(a, c)  // false

// Equality
a.Equals(b)  // true
a.Equals(c)  // true

Key point: C# overloads == for strings to perform equality, not identity—making it safer than Java's approach.

Language Comparison Summary
Language	Equality Operator	Identity Operator	Safety
Java	.equals()	==	Easy to use == by mistake
Python	==	is	Generally safe, 'is' rarely needed
JavaScript	===	N/A	Primitives use value semantics
C#	== or .Equals()	ReferenceEquals()	Safe, == is overloaded
C++	== (if overloaded)	pointer ==	Depends on implementation
Go	==	N/A for strings	Strings have value semantics
Rust	==	ptr::eq()	Clear semantics

The Java Trap

In Java, using == to compare strings is one of the most common bugs:

if (userInput == "expected")  // BUG! May fail even if content matches
if (userInput.equals("expected"))  // CORRECT

This bug is especially insidious because == sometimes works (when strings are interned) but fails unpredictably with runtime-created strings.

String Interning: Bridging Equality and Identity

String interning is an optimization technique where the runtime maintains a pool of unique string values. When a string is interned, the runtime checks if an equal string already exists in the pool:

If yes: return a reference to the existing string
If no: add this string to the pool and return its reference

The result: All interned strings with equal content share the same identity.

How Interning Works:

Pool: { }

Step 1: Intern "Hello"
  - Pool empty, add "Hello"
  - Pool: { 0x100: "Hello" }
  - Return reference: 0x100

Step 2: Intern "Hello" again
  - Found "Hello" in pool at 0x100
  - Pool: { 0x100: "Hello" }  (unchanged)
  - Return existing reference: 0x100

Step 3: Intern "World"
  - Not in pool, add "World"
  - Pool: { 0x100: "Hello", 0x200: "World" }
  - Return reference: 0x200

After interning, strings can be compared by identity (fast) instead of equality (slower).

Automatic Interning:

Most languages automatically intern string literals (strings that appear directly in source code):

// Java: Literals are automatically interned
String a = "Hello";  // Goes to string pool
String b = "Hello";  // Reuses same pool entry
a == b  // true (same object from pool)

// Runtime-created strings are NOT automatically interned
String c = new String("Hello");  // New object, not in pool
a == c  // false (different objects)

String d = scanner.nextLine();  // Input "Hello" from user
a == d  // false (d not interned)

Manual Interning:

You can explicitly intern strings to force pooling:

// Java
String c = new String("Hello").intern();
a == c  // true (now c references pooled string)

// Python
import sys
c = sys.intern("Hello")  # Force interning

Why Not Intern Everything?

Interned strings can never be garbage collected (in many implementations)
The lookup to check if a string exists has cost
For strings used once and discarded, interning wastes memory
Pool can grow unboundedly if overused

When to Intern

•Strings compared many times
•Small set of repeated values (enum-like)
•Dictionary/hash table keys
•Identifiers and symbols
•Configuration keys

When Not to Intern

•Unique strings (user IDs, UUIDs)
•Large strings (interning has overhead)
•Strings used only briefly
•Dynamically generated content
•High-cardinality data

Interning Is an Optimization, Not a Correctness Mechanism

Never rely on interning for correctness. Always compare strings with the equality method (.equals(), ==, whatever your language uses for value comparison). Interning should only be a performance optimization—and only when profiling shows it matters.

Performance Implications

The choice between identity and equality comparison has significant performance implications:

Identity Comparison (O(1)):

Comparing memory addresses is a single CPU operation:

a == b   →   Compare 0x1A2B3C with 0x1A2B3C   →   true

Cost: 1 operation regardless of string length

A 10-character string and a 10-million-character string take the same time for identity comparison.

Equality Comparison (O(n)):

Comparing string content requires examining each character:

a.equals(b)   →   Compare 'H' with 'H'   →   equal, continue
               →   Compare 'e' with 'e'   →   equal, continue
               →   Compare 'l' with 'l'   →   equal, continue
               →   Compare 'l' with 'l'   →   equal, continue
               →   Compare 'o' with 'o'   →   equal, done
               →   true

Cost: n operations for strings of length n

Best case: Different strings with different first characters → O(1)

Worst case: Equal strings or strings differing only at the end → O(n)

Performance: Identity vs Equality
Scenario	Identity (==)	Equality (.equals())
Same object	O(1) - true	O(n) - true
Different objects, same content	O(1) - false	O(n) - true
Different objects, different content	O(1) - false	O(1) to O(n) - false
Strings of length 1M, equal	~1 nanosecond	~1 millisecond
Helpful early-out	No	Yes - different lengths

Optimizations in Equality Comparison:

Most equality implementations include optimizations:

// Typical equals() implementation structure
public boolean equals(Object other) {
    if (this == other) return true;        // Identity shortcut: O(1)
    if (other == null) return false;       // Null check: O(1)
    if (!(other instanceof String)) return false;
    
    String s = (String) other;
    if (this.length != s.length) return false;  // Length check: O(1)
    
    // Character comparison: O(n)
    for (int i = 0; i < length; i++) {
        if (this.charAt(i) != s.charAt(i)) return false;
    }
    return true;
}

Why the identity check first?

Because interned strings (from literals and manual interning) are extremely common. The O(1) identity check catches these cases immediately.

Why the length check?

Because strings of different lengths cannot be equal. Checking length is O(1) (stored as metadata) and eliminates many comparisons immediately.

Hash Code Caching

Because strings are immutable, their hash codes can be cached. Java's String class caches the hash code after the first computation. This makes repeated hash lookups O(1) after the first one. Combined with interning, this enables very fast string-based hash tables.

Hash Tables and String Keys

Hash tables (HashMap, Dict, Map) rely heavily on the equality/identity distinction:

How Hash Table Lookup Works:

map.get("Hello")

Step 1: Compute hash code → hashCode("Hello") → 42 (example)
Step 2: Find bucket → bucket[42 % bucketCount]
Step 3: Search bucket for matching key
        For each entry in bucket:
            If entry.key.equals("Hello")  → found!

The Critical Question: How do we check if entry.key matches "Hello"?

Using Equality (Standard Approach):

entry.key.equals("Hello")

This works correctly but costs O(n) per comparison.

Using Identity (After Interning):

entry.key == "Hello"  // Only safe if both are interned

This is O(1) per comparison—but only safe if all keys are interned.

Real-World Optimization Pattern:

// Intern keys at insertion time
map.put(key.intern(), value);

// Intern lookup keys to enable identity comparison
// (Only if map implementation uses identity)
map.get(lookupKey.intern());

Python's Approach:

Python dictionaries automatically benefit from string interning:

# Python interns small strings automatically
d = {"status": "active"}  # "status" is interned

# Lookups can use fast identity comparison
d["status"]  # The literal "status" is interned, identity match!

Trade-off Analysis:

Approach	Insertion Cost	Lookup Cost	Memory
No interning	O(n) hash	O(n) per bucket entry	Normal
Intern on insert	O(n) hash + intern	O(1) per bucket entry	Higher
Full interning	O(n) hash + intern	O(1) everywhere	Highest

Specialized String Maps

For performance-critical applications, specialized data structures exist:

• Trie: Implicit deduplication of prefixes • Symbol tables: Intern-based in compilers and interpreters • ConcurrentHashMap with weak keys: Balance interning with GC • Perfect hashing: For fixed key sets

Unicode Normalization and Equality

Unicode introduces another dimension to string equality: the same visual character can have multiple encodings.

The Problem:

The character 'é' can be represented two ways:

Precomposed: A single code point U+00E9 (é)
Decomposed: 'e' (U+0065) + combining acute accent (U+0301)

Both render identically but have different byte sequences:

Precomposed:  [C3 A9]        (2 bytes UTF-8)
Decomposed:   [65 CC 81]     (3 bytes UTF-8)

Standard equality comparison sees these as different:

a = "é"              # Precomposed (1 code point)
b = "e\u0301"        # Decomposed (2 code points)

a == b               # False!
len(a)               # 1
len(b)               # 2

# But they look identical when printed!
print(a)             # é
print(b)             # é

The Solution: Normalization

Unicode defines four normalization forms:

Form	Name	Description
NFC	Composed	Combines characters where possible
NFD	Decomposed	Separates base characters and combining marks
NFKC	Compatibility Composed	NFC + compatibility equivalents
NFKD	Compatibility Decomposed	NFD + compatibility equivalents

Normalized Comparison:

import unicodedata

a = "é"           # Precomposed
b = "e\u0301"     # Decomposed

# Normalize both to same form before comparing
unicodedata.normalize('NFC', a) == unicodedata.normalize('NFC', b)  # True!

Best Practices:

Normalize at input boundaries: When receiving text, normalize to a consistent form (usually NFC)
Store normalized text: Database columns should contain normalized strings
Hash normalized strings: Hash codes should be computed on normalized forms
Document your normalization policy: Team members should know what form to expect

Normalization Gotchas

• Normalization can change string length • Some characters have no normalized equivalent • NFKC/NFKD can change meaning (℡ → TEL) • OS-level APIs may return non-normalized strings • Copy-paste from different sources may introduce mixed normalization

When Normalization Matters

•Usernames and identifiers — Two visually identical names should be the same account
•Search functionality — Query for 'café' should find 'café' regardless of encoding
•Hash table keys — Same-looking keys must have same hash
•Text comparison — Detecting duplicates in content
•Security contexts — Preventing homograph attacks (look-alike characters)

Common Bugs and Anti-Patterns

Confusion between equality and identity causes some of the most common string bugs:

Bug 1: Java == for String Comparison

// BAD: Works for literals, fails for runtime strings
if (userInput == "admin") {  // userInput is NOT interned!
    grantAdminAccess();
}

// GOOD: Always use equals()
if ("admin".equals(userInput)) {
    grantAdminAccess();
}

// Why "admin".equals(userInput) not userInput.equals("admin")?
// Avoids NullPointerException if userInput is null

Bug 2: Python 'is' Instead of '=='

# BAD: Works sometimes due to interning, fails unpredictably
if user_role is "admin":  # Don't use 'is' for string comparison!
    grant_access()

# GOOD: Always use ==
if user_role == "admin":
    grant_access()

Note: Python will even warn about this with a SyntaxWarning: "is" with a literal.

Bug 3: Assuming String Immutability Means Identity

String a = "Hello";
String b = a;

a = a + "!";  // Creates NEW string, doesn't modify existing

a == b  // false: a now points to "Hello!", b still points to "Hello"

Reassignment doesn't modify the original string—it creates a new one.

Bug 4: Inconsistent Normalization in Hash Keys

# User 1 enters café (precomposed)
user_data["café"] = {...}

# User 2 searches for café (decomposed from different input method)
user_data["cafe\u0301"]  # KeyError! Different encoding

# Solution: Normalize keys
import unicodedata
def normalize_key(k):
    return unicodedata.normalize('NFC', k)

user_data[normalize_key("café")] = {...}
user_data[normalize_key("cafe\u0301")]  # Works!

Bug 5: Relying on Interning for Correctness

// BAD: Depends on interning behavior
if (config.getValue() == "enabled") {...}  // May fail!

// BAD: Interning doesn't make this correct, just lucky
config.getValue().intern() == "enabled"  // Works but wasteful

// GOOD: Use proper equality
"enabled".equals(config.getValue())  // Always correct

Anti-Patterns

Avoid these:

• Using == for string equality in Java • Using is for string comparison in Python • Assuming identical strings have identity • Relying on interning for correctness • Mixing normalized and non-normalized strings

Best Practices

Do these instead:

• Use .equals() in Java, == in Python • Put literals first: "x".equals(var) • Normalize strings at system boundaries • Use interning only as optimization • Document comparison semantics

Summary: Equality vs Identity Mastered

We've explored the fundamental distinction between equality and identity—two concepts that appear similar but have profound differences in semantics and performance. Let's consolidate the essential knowledge:

Key Takeaways

•Identity checks if references point to the same object — Fast (O(1)) but only true for the same object in memory.
•Equality checks if strings have the same content — Slower (O(n)) but correct for comparing values.
•Identity implies equality, but not vice versa — Equal strings may exist as separate objects.
•String interning pools strings for identity-based optimization — All interned strings with equal content share identity.
•Language syntax varies — Java uses == for identity, .equals() for equality; Python uses is vs ==; JavaScript simplifies with value semantics.
•Using identity when equality is needed is a classic bug — Java's == on strings is one of the most common mistakes.
•Unicode normalization affects equality — Visually identical strings may have different encodings.
•Performance trade-offs are real — Identity is O(1), equality is O(n), but equality is almost always what you want for correctness.

Module Complete:

With this page, we've completed our exploration of String Comparison, Ordering & Equality. You now understand:

Lexicographical ordering: How strings are compared character by character
Case sensitivity: Why 'A' ≠ 'a' and how to handle it correctly
Locale vs binary comparison: When cultural order matters and when it doesn't
Equality vs identity: The fundamental distinction that causes countless bugs

These concepts form the foundation for avoiding logical bugs in string problems—the stated outcome of this module.

Module Complete

You now have a comprehensive understanding of string comparison at all levels—from the byte level to the cultural level. You can make informed decisions about comparison approaches, avoid common pitfalls, and reason precisely about string operations. This knowledge will serve you in debugging, optimization, and designing correct algorithms for string problems.

4 / 4

Loading learning content...

Data Structures & AlgorithmsStrings

String Comparison, Ordering & Equality

LevelIntermediate

Duration60 mins

TopicStrings

4 / 4

Equality vs Identity

When Is 'Hello' Not the Same as 'Hello'?

Consider this puzzle that has confused countless developers:

String a = "Hello";
String b = "Hello";
String c = new String("Hello");

a == b      // true (in Java)
a == c      // false (in Java!)
a.equals(c) // true

How can two strings that are clearly identical—both containing the characters H-e-l-l-o—not be == to each other?

The answer unveils one of the most fundamental distinctions in programming: equality vs identity.

Equality: Do two strings have the same content? Do they represent the same value?
Identity: Are two references pointing to the same object in memory? Are they literally the same thing?

This distinction is crucial for correctness (using == when you should use equals() is a classic bug) and for performance (identity comparison is faster than equality comparison).

What You Will Learn

Defining Equality and Identity

Let's establish precise definitions:

Identity (Reference Equality):

Two references have the same identity if they point to the exact same object in memory. They are not just similar—they are literally the same thing, occupying the same memory address.

Variable a ──────────┐
                      ├──► [Memory: 0x1A2B3C] "Hello"
Variable b ──────────┘

a and b have the same identity (they point to the same object)

Equality (Value Equality):

Two strings are equal if they contain the same sequence of characters in the same order. They may or may not be the same object in memory.

Variable a ────► [Memory: 0x1A2B3C] "Hello"
Variable c ────► [Memory: 0x4D5E6F] "Hello"  (different object!)

a and c are equal (same content) but do not have the same identity

Equality vs Identity: Key Distinctions
Aspect	Identity	Equality
What is compared	Memory addresses	String contents
Must be same object?	Yes	No
Time complexity	O(1)	O(n) worst case
Relationship	Identity implies equality	Equality does not imply identity
Operator (Java)	==	.equals()
Operator (JavaScript)	===	===*
Operator (Python)	is	==
Operator (C#)	ReferenceEquals()	.Equals() or ==

*JavaScript note: In JavaScript, === performs value equality for primitives (including strings), not reference equality. This is a simplification in the language.

The Fundamental Relationship:

If two references have the same identity (same object), they are necessarily equal (same content). A single object cannot contain different values simultaneously.

But if two references are equal (same content), they may or may not have the same identity. There could be two separate objects in memory that happen to contain identical data.

Identity    →  Equality    (always true)
Equality   ⇏  Identity    (not always true)

This asymmetry is the source of many bugs.

The Analogy

How Different Languages Handle This

Different programming languages have different approaches to equality and identity:

Java:

String a = "Hello";
String b = "Hello";
String c = new String("Hello");

// Identity check
a == b       // true (string pool optimization)
a == c       // false (c is a new object)

// Equality check
a.equals(b)  // true
a.equals(c)  // true

// Interning to force identity
c = c.intern();
a == c       // true (now c points to pooled string)

Key point: In Java, == checks identity, .equals() checks equality. Using == for string comparison is almost always a bug.

Python:

a = "Hello"
b = "Hello"
c = "".join(["H", "e", "l", "l", "o"])  # Creates new string

# Identity check
a is b       # True (interning optimization)
a is c       # May be True or False depending on implementation

# Equality check
a == b       # True
a == c       # True (always)

# Best practice: Always use == for string comparison

JavaScript:

let a = "Hello";
let b = "Hello";
let c = "Hel" + "lo";  // Runtime concatenation

// JavaScript strings are primitives, not objects
a === b   // true (value equality for primitives)
a === c   // true (value equality)

// JavaScript doesn't expose identity semantics for strings
// All comparisons are effectively value comparisons

C#:

string a = "Hello";
string b = "Hello";
string c = new string(new char[] {'H','e','l','l','o'});

// Overloaded == performs equality check!
a == b   // true (equality, not identity)
a == c   // true (equality)

// Explicit identity check
Object.ReferenceEquals(a, b)  // true (interning)
Object.ReferenceEquals(a, c)  // false

// Equality
a.Equals(b)  // true
a.Equals(c)  // true

Key point: C# overloads == for strings to perform equality, not identity—making it safer than Java's approach.

Language Comparison Summary
Language	Equality Operator	Identity Operator	Safety
Java	.equals()	==	Easy to use == by mistake
Python	==	is	Generally safe, 'is' rarely needed
JavaScript	===	N/A	Primitives use value semantics
C#	== or .Equals()	ReferenceEquals()	Safe, == is overloaded
C++	== (if overloaded)	pointer ==	Depends on implementation
Go	==	N/A for strings	Strings have value semantics
Rust	==	ptr::eq()	Clear semantics

The Java Trap

In Java, using == to compare strings is one of the most common bugs:

if (userInput == "expected")  // BUG! May fail even if content matches
if (userInput.equals("expected"))  // CORRECT

This bug is especially insidious because == sometimes works (when strings are interned) but fails unpredictably with runtime-created strings.

String Interning: Bridging Equality and Identity

If yes: return a reference to the existing string
If no: add this string to the pool and return its reference

The result: All interned strings with equal content share the same identity.

How Interning Works:

Pool: { }

Step 1: Intern "Hello"
  - Pool empty, add "Hello"
  - Pool: { 0x100: "Hello" }
  - Return reference: 0x100

Step 2: Intern "Hello" again
  - Found "Hello" in pool at 0x100
  - Pool: { 0x100: "Hello" }  (unchanged)
  - Return existing reference: 0x100

Step 3: Intern "World"
  - Not in pool, add "World"
  - Pool: { 0x100: "Hello", 0x200: "World" }
  - Return reference: 0x200

After interning, strings can be compared by identity (fast) instead of equality (slower).

Automatic Interning:

Most languages automatically intern string literals (strings that appear directly in source code):

// Java: Literals are automatically interned
String a = "Hello";  // Goes to string pool
String b = "Hello";  // Reuses same pool entry
a == b  // true (same object from pool)

// Runtime-created strings are NOT automatically interned
String c = new String("Hello");  // New object, not in pool
a == c  // false (different objects)

String d = scanner.nextLine();  // Input "Hello" from user
a == d  // false (d not interned)

Manual Interning:

You can explicitly intern strings to force pooling:

// Java
String c = new String("Hello").intern();
a == c  // true (now c references pooled string)

// Python
import sys
c = sys.intern("Hello")  # Force interning

Why Not Intern Everything?

Interned strings can never be garbage collected (in many implementations)
The lookup to check if a string exists has cost
For strings used once and discarded, interning wastes memory
Pool can grow unboundedly if overused

When to Intern

•Strings compared many times
•Small set of repeated values (enum-like)
•Dictionary/hash table keys
•Identifiers and symbols
•Configuration keys

When Not to Intern

•Unique strings (user IDs, UUIDs)
•Large strings (interning has overhead)
•Strings used only briefly
•Dynamically generated content
•High-cardinality data

Interning Is an Optimization, Not a Correctness Mechanism

Performance Implications

The choice between identity and equality comparison has significant performance implications:

Identity Comparison (O(1)):

Comparing memory addresses is a single CPU operation:

a == b   →   Compare 0x1A2B3C with 0x1A2B3C   →   true

Cost: 1 operation regardless of string length

A 10-character string and a 10-million-character string take the same time for identity comparison.

Equality Comparison (O(n)):

Comparing string content requires examining each character:

a.equals(b)   →   Compare 'H' with 'H'   →   equal, continue
               →   Compare 'e' with 'e'   →   equal, continue
               →   Compare 'l' with 'l'   →   equal, continue
               →   Compare 'l' with 'l'   →   equal, continue
               →   Compare 'o' with 'o'   →   equal, done
               →   true

Cost: n operations for strings of length n

Best case: Different strings with different first characters → O(1)

Worst case: Equal strings or strings differing only at the end → O(n)

Performance: Identity vs Equality
Scenario	Identity (==)	Equality (.equals())
Same object	O(1) - true	O(n) - true
Different objects, same content	O(1) - false	O(n) - true
Different objects, different content	O(1) - false	O(1) to O(n) - false
Strings of length 1M, equal	~1 nanosecond	~1 millisecond
Helpful early-out	No	Yes - different lengths

Optimizations in Equality Comparison:

Most equality implementations include optimizations:

// Typical equals() implementation structure
public boolean equals(Object other) {
    if (this == other) return true;        // Identity shortcut: O(1)
    if (other == null) return false;       // Null check: O(1)
    if (!(other instanceof String)) return false;
    
    String s = (String) other;
    if (this.length != s.length) return false;  // Length check: O(1)
    
    // Character comparison: O(n)
    for (int i = 0; i < length; i++) {
        if (this.charAt(i) != s.charAt(i)) return false;
    }
    return true;
}

Why the identity check first?

Because interned strings (from literals and manual interning) are extremely common. The O(1) identity check catches these cases immediately.

Why the length check?

Because strings of different lengths cannot be equal. Checking length is O(1) (stored as metadata) and eliminates many comparisons immediately.

Hash Code Caching

Hash Tables and String Keys

Hash tables (HashMap, Dict, Map) rely heavily on the equality/identity distinction:

How Hash Table Lookup Works:

map.get("Hello")

Step 1: Compute hash code → hashCode("Hello") → 42 (example)
Step 2: Find bucket → bucket[42 % bucketCount]
Step 3: Search bucket for matching key
        For each entry in bucket:
            If entry.key.equals("Hello")  → found!

The Critical Question: How do we check if entry.key matches "Hello"?

Using Equality (Standard Approach):

entry.key.equals("Hello")

This works correctly but costs O(n) per comparison.

Using Identity (After Interning):

entry.key == "Hello"  // Only safe if both are interned

This is O(1) per comparison—but only safe if all keys are interned.

Real-World Optimization Pattern:

// Intern keys at insertion time
map.put(key.intern(), value);

// Intern lookup keys to enable identity comparison
// (Only if map implementation uses identity)
map.get(lookupKey.intern());

Python's Approach:

Python dictionaries automatically benefit from string interning:

# Python interns small strings automatically
d = {"status": "active"}  # "status" is interned

# Lookups can use fast identity comparison
d["status"]  # The literal "status" is interned, identity match!

Trade-off Analysis:

Approach	Insertion Cost	Lookup Cost	Memory
No interning	O(n) hash	O(n) per bucket entry	Normal
Intern on insert	O(n) hash + intern	O(1) per bucket entry	Higher
Full interning	O(n) hash + intern	O(1) everywhere	Highest

Specialized String Maps

For performance-critical applications, specialized data structures exist:

Unicode Normalization and Equality

Unicode introduces another dimension to string equality: the same visual character can have multiple encodings.

The Problem:

The character 'é' can be represented two ways:

Precomposed: A single code point U+00E9 (é)
Decomposed: 'e' (U+0065) + combining acute accent (U+0301)

Both render identically but have different byte sequences:

Precomposed:  [C3 A9]        (2 bytes UTF-8)
Decomposed:   [65 CC 81]     (3 bytes UTF-8)

Standard equality comparison sees these as different:

a = "é"              # Precomposed (1 code point)
b = "e\u0301"        # Decomposed (2 code points)

a == b               # False!
len(a)               # 1
len(b)               # 2

# But they look identical when printed!
print(a)             # é
print(b)             # é

The Solution: Normalization

Unicode defines four normalization forms:

Form	Name	Description
NFC	Composed	Combines characters where possible
NFD	Decomposed	Separates base characters and combining marks
NFKC	Compatibility Composed	NFC + compatibility equivalents
NFKD	Compatibility Decomposed	NFD + compatibility equivalents

Normalized Comparison:

import unicodedata

a = "é"           # Precomposed
b = "e\u0301"     # Decomposed

# Normalize both to same form before comparing
unicodedata.normalize('NFC', a) == unicodedata.normalize('NFC', b)  # True!

Best Practices:

Normalize at input boundaries: When receiving text, normalize to a consistent form (usually NFC)
Store normalized text: Database columns should contain normalized strings
Hash normalized strings: Hash codes should be computed on normalized forms
Document your normalization policy: Team members should know what form to expect

Normalization Gotchas

When Normalization Matters

•Usernames and identifiers — Two visually identical names should be the same account
•Search functionality — Query for 'café' should find 'café' regardless of encoding
•Hash table keys — Same-looking keys must have same hash
•Text comparison — Detecting duplicates in content
•Security contexts — Preventing homograph attacks (look-alike characters)

Common Bugs and Anti-Patterns

Confusion between equality and identity causes some of the most common string bugs:

Bug 1: Java == for String Comparison

// BAD: Works for literals, fails for runtime strings
if (userInput == "admin") {  // userInput is NOT interned!
    grantAdminAccess();
}

// GOOD: Always use equals()
if ("admin".equals(userInput)) {
    grantAdminAccess();
}

// Why "admin".equals(userInput) not userInput.equals("admin")?
// Avoids NullPointerException if userInput is null

Bug 2: Python 'is' Instead of '=='

# BAD: Works sometimes due to interning, fails unpredictably
if user_role is "admin":  # Don't use 'is' for string comparison!
    grant_access()

# GOOD: Always use ==
if user_role == "admin":
    grant_access()

Note: Python will even warn about this with a SyntaxWarning: "is" with a literal.

Bug 3: Assuming String Immutability Means Identity

String a = "Hello";
String b = a;

a = a + "!";  // Creates NEW string, doesn't modify existing

a == b  // false: a now points to "Hello!", b still points to "Hello"

Reassignment doesn't modify the original string—it creates a new one.

Bug 4: Inconsistent Normalization in Hash Keys

# User 1 enters café (precomposed)
user_data["café"] = {...}

# User 2 searches for café (decomposed from different input method)
user_data["cafe\u0301"]  # KeyError! Different encoding

# Solution: Normalize keys
import unicodedata
def normalize_key(k):
    return unicodedata.normalize('NFC', k)

user_data[normalize_key("café")] = {...}
user_data[normalize_key("cafe\u0301")]  # Works!

Bug 5: Relying on Interning for Correctness

// BAD: Depends on interning behavior
if (config.getValue() == "enabled") {...}  // May fail!

// BAD: Interning doesn't make this correct, just lucky
config.getValue().intern() == "enabled"  // Works but wasteful

// GOOD: Use proper equality
"enabled".equals(config.getValue())  // Always correct

Anti-Patterns

Avoid these:

Best Practices

Do these instead:

Summary: Equality vs Identity Mastered

Key Takeaways

•Identity checks if references point to the same object — Fast (O(1)) but only true for the same object in memory.
•Equality checks if strings have the same content — Slower (O(n)) but correct for comparing values.
•Identity implies equality, but not vice versa — Equal strings may exist as separate objects.
•String interning pools strings for identity-based optimization — All interned strings with equal content share identity.
•Language syntax varies — Java uses == for identity, .equals() for equality; Python uses is vs ==; JavaScript simplifies with value semantics.
•Using identity when equality is needed is a classic bug — Java's == on strings is one of the most common mistakes.
•Unicode normalization affects equality — Visually identical strings may have different encodings.
•Performance trade-offs are real — Identity is O(1), equality is O(n), but equality is almost always what you want for correctness.

Module Complete:

With this page, we've completed our exploration of String Comparison, Ordering & Equality. You now understand:

Lexicographical ordering: How strings are compared character by character
Case sensitivity: Why 'A' ≠ 'a' and how to handle it correctly
Locale vs binary comparison: When cultural order matters and when it doesn't
Equality vs identity: The fundamental distinction that causes countless bugs

These concepts form the foundation for avoiding logical bugs in string problems—the stated outcome of this module.

Module Complete

4 / 4