Loading learning content...
Consider this puzzle that has confused countless developers:
String a = "Hello";
String b = "Hello";
String c = new String("Hello");
a == b // true (in Java)
a == c // false (in Java!)
a.equals(c) // true
How can two strings that are clearly identical—both containing the characters H-e-l-l-o—not be == to each other?
The answer unveils one of the most fundamental distinctions in programming: equality vs identity.
This distinction is crucial for correctness (using == when you should use equals() is a classic bug) and for performance (identity comparison is faster than equality comparison).
By the end of this page, you will understand the deep difference between equality and identity for strings—why this distinction exists, how different programming languages handle it, the performance implications, and the role of string interning in bridging these concepts. You'll be equipped to avoid one of the most common bug patterns in string programming.
Let's establish precise definitions:
Identity (Reference Equality):
Two references have the same identity if they point to the exact same object in memory. They are not just similar—they are literally the same thing, occupying the same memory address.
Variable a ──────────┐
├──► [Memory: 0x1A2B3C] "Hello"
Variable b ──────────┘
a and b have the same identity (they point to the same object)
Equality (Value Equality):
Two strings are equal if they contain the same sequence of characters in the same order. They may or may not be the same object in memory.
Variable a ────► [Memory: 0x1A2B3C] "Hello"
Variable c ────► [Memory: 0x4D5E6F] "Hello" (different object!)
a and c are equal (same content) but do not have the same identity
| Aspect | Identity | Equality |
|---|---|---|
| What is compared | Memory addresses | String contents |
| Must be same object? | Yes | No |
| Time complexity | O(1) | O(n) worst case |
| Relationship | Identity implies equality | Equality does not imply identity |
| Operator (Java) | == | .equals() |
| Operator (JavaScript) | === | ===* |
| Operator (Python) | is | == |
| Operator (C#) | ReferenceEquals() | .Equals() or == |
*JavaScript note: In JavaScript, === performs value equality for primitives (including strings), not reference equality. This is a simplification in the language.
The Fundamental Relationship:
If two references have the same identity (same object), they are necessarily equal (same content). A single object cannot contain different values simultaneously.
But if two references are equal (same content), they may or may not have the same identity. There could be two separate objects in memory that happen to contain identical data.
Identity → Equality (always true)
Equality ⇏ Identity (not always true)
This asymmetry is the source of many bugs.
Think of identical twins. They may look exactly the same (equal) but they are not the same person (identity). If you send a letter to one twin, it won't automatically reach the other. Similarly, modifying one object (if it were mutable) wouldn't affect an equal-but-not-identical object.
Different programming languages have different approaches to equality and identity:
Java:
String a = "Hello";
String b = "Hello";
String c = new String("Hello");
// Identity check
a == b // true (string pool optimization)
a == c // false (c is a new object)
// Equality check
a.equals(b) // true
a.equals(c) // true
// Interning to force identity
c = c.intern();
a == c // true (now c points to pooled string)
Key point: In Java, == checks identity, .equals() checks equality. Using == for string comparison is almost always a bug.
Python:
a = "Hello"
b = "Hello"
c = "".join(["H", "e", "l", "l", "o"]) # Creates new string
# Identity check
a is b # True (interning optimization)
a is c # May be True or False depending on implementation
# Equality check
a == b # True
a == c # True (always)
# Best practice: Always use == for string comparison
JavaScript:
let a = "Hello";
let b = "Hello";
let c = "Hel" + "lo"; // Runtime concatenation
// JavaScript strings are primitives, not objects
a === b // true (value equality for primitives)
a === c // true (value equality)
// JavaScript doesn't expose identity semantics for strings
// All comparisons are effectively value comparisons
C#:
string a = "Hello";
string b = "Hello";
string c = new string(new char[] {'H','e','l','l','o'});
// Overloaded == performs equality check!
a == b // true (equality, not identity)
a == c // true (equality)
// Explicit identity check
Object.ReferenceEquals(a, b) // true (interning)
Object.ReferenceEquals(a, c) // false
// Equality
a.Equals(b) // true
a.Equals(c) // true
Key point: C# overloads == for strings to perform equality, not identity—making it safer than Java's approach.
| Language | Equality Operator | Identity Operator | Safety |
|---|---|---|---|
| Java | .equals() | == | Easy to use == by mistake |
| Python | == | is | Generally safe, 'is' rarely needed |
| JavaScript | === | N/A | Primitives use value semantics |
| C# | == or .Equals() | ReferenceEquals() | Safe, == is overloaded |
| C++ | == (if overloaded) | pointer == | Depends on implementation |
| Go | == | N/A for strings | Strings have value semantics |
| Rust | == | ptr::eq() | Clear semantics |
In Java, using == to compare strings is one of the most common bugs:
if (userInput == "expected") // BUG! May fail even if content matches
if (userInput.equals("expected")) // CORRECT
This bug is especially insidious because == sometimes works (when strings are interned) but fails unpredictably with runtime-created strings.
String interning is an optimization technique where the runtime maintains a pool of unique string values. When a string is interned, the runtime checks if an equal string already exists in the pool:
The result: All interned strings with equal content share the same identity.
How Interning Works:
Pool: { }
Step 1: Intern "Hello"
- Pool empty, add "Hello"
- Pool: { 0x100: "Hello" }
- Return reference: 0x100
Step 2: Intern "Hello" again
- Found "Hello" in pool at 0x100
- Pool: { 0x100: "Hello" } (unchanged)
- Return existing reference: 0x100
Step 3: Intern "World"
- Not in pool, add "World"
- Pool: { 0x100: "Hello", 0x200: "World" }
- Return reference: 0x200
After interning, strings can be compared by identity (fast) instead of equality (slower).
Automatic Interning:
Most languages automatically intern string literals (strings that appear directly in source code):
// Java: Literals are automatically interned
String a = "Hello"; // Goes to string pool
String b = "Hello"; // Reuses same pool entry
a == b // true (same object from pool)
// Runtime-created strings are NOT automatically interned
String c = new String("Hello"); // New object, not in pool
a == c // false (different objects)
String d = scanner.nextLine(); // Input "Hello" from user
a == d // false (d not interned)
Manual Interning:
You can explicitly intern strings to force pooling:
// Java
String c = new String("Hello").intern();
a == c // true (now c references pooled string)
// Python
import sys
c = sys.intern("Hello") # Force interning
Why Not Intern Everything?
Never rely on interning for correctness. Always compare strings with the equality method (.equals(), ==, whatever your language uses for value comparison). Interning should only be a performance optimization—and only when profiling shows it matters.
The choice between identity and equality comparison has significant performance implications:
Identity Comparison (O(1)):
Comparing memory addresses is a single CPU operation:
a == b → Compare 0x1A2B3C with 0x1A2B3C → true
Cost: 1 operation regardless of string length
A 10-character string and a 10-million-character string take the same time for identity comparison.
Equality Comparison (O(n)):
Comparing string content requires examining each character:
a.equals(b) → Compare 'H' with 'H' → equal, continue
→ Compare 'e' with 'e' → equal, continue
→ Compare 'l' with 'l' → equal, continue
→ Compare 'l' with 'l' → equal, continue
→ Compare 'o' with 'o' → equal, done
→ true
Cost: n operations for strings of length n
Best case: Different strings with different first characters → O(1)
Worst case: Equal strings or strings differing only at the end → O(n)
| Scenario | Identity (==) | Equality (.equals()) |
|---|---|---|
| Same object | O(1) - true | O(n) - true |
| Different objects, same content | O(1) - false | O(n) - true |
| Different objects, different content | O(1) - false | O(1) to O(n) - false |
| Strings of length 1M, equal | ~1 nanosecond | ~1 millisecond |
| Helpful early-out | No | Yes - different lengths |
Optimizations in Equality Comparison:
Most equality implementations include optimizations:
// Typical equals() implementation structure
public boolean equals(Object other) {
if (this == other) return true; // Identity shortcut: O(1)
if (other == null) return false; // Null check: O(1)
if (!(other instanceof String)) return false;
String s = (String) other;
if (this.length != s.length) return false; // Length check: O(1)
// Character comparison: O(n)
for (int i = 0; i < length; i++) {
if (this.charAt(i) != s.charAt(i)) return false;
}
return true;
}
Why the identity check first?
Because interned strings (from literals and manual interning) are extremely common. The O(1) identity check catches these cases immediately.
Why the length check?
Because strings of different lengths cannot be equal. Checking length is O(1) (stored as metadata) and eliminates many comparisons immediately.
Because strings are immutable, their hash codes can be cached. Java's String class caches the hash code after the first computation. This makes repeated hash lookups O(1) after the first one. Combined with interning, this enables very fast string-based hash tables.
Hash tables (HashMap, Dict, Map) rely heavily on the equality/identity distinction:
How Hash Table Lookup Works:
map.get("Hello")
Step 1: Compute hash code → hashCode("Hello") → 42 (example)
Step 2: Find bucket → bucket[42 % bucketCount]
Step 3: Search bucket for matching key
For each entry in bucket:
If entry.key.equals("Hello") → found!
The Critical Question: How do we check if entry.key matches "Hello"?
Using Equality (Standard Approach):
entry.key.equals("Hello")
This works correctly but costs O(n) per comparison.
Using Identity (After Interning):
entry.key == "Hello" // Only safe if both are interned
This is O(1) per comparison—but only safe if all keys are interned.
Real-World Optimization Pattern:
// Intern keys at insertion time
map.put(key.intern(), value);
// Intern lookup keys to enable identity comparison
// (Only if map implementation uses identity)
map.get(lookupKey.intern());
Python's Approach:
Python dictionaries automatically benefit from string interning:
# Python interns small strings automatically
d = {"status": "active"} # "status" is interned
# Lookups can use fast identity comparison
d["status"] # The literal "status" is interned, identity match!
Trade-off Analysis:
| Approach | Insertion Cost | Lookup Cost | Memory |
|---|---|---|---|
| No interning | O(n) hash | O(n) per bucket entry | Normal |
| Intern on insert | O(n) hash + intern | O(1) per bucket entry | Higher |
| Full interning | O(n) hash + intern | O(1) everywhere | Highest |
For performance-critical applications, specialized data structures exist:
• Trie: Implicit deduplication of prefixes • Symbol tables: Intern-based in compilers and interpreters • ConcurrentHashMap with weak keys: Balance interning with GC • Perfect hashing: For fixed key sets
Unicode introduces another dimension to string equality: the same visual character can have multiple encodings.
The Problem:
The character 'é' can be represented two ways:
Both render identically but have different byte sequences:
Precomposed: [C3 A9] (2 bytes UTF-8)
Decomposed: [65 CC 81] (3 bytes UTF-8)
Standard equality comparison sees these as different:
a = "é" # Precomposed (1 code point)
b = "e\u0301" # Decomposed (2 code points)
a == b # False!
len(a) # 1
len(b) # 2
# But they look identical when printed!
print(a) # é
print(b) # é
The Solution: Normalization
Unicode defines four normalization forms:
| Form | Name | Description |
|---|---|---|
| NFC | Composed | Combines characters where possible |
| NFD | Decomposed | Separates base characters and combining marks |
| NFKC | Compatibility Composed | NFC + compatibility equivalents |
| NFKD | Compatibility Decomposed | NFD + compatibility equivalents |
Normalized Comparison:
import unicodedata
a = "é" # Precomposed
b = "e\u0301" # Decomposed
# Normalize both to same form before comparing
unicodedata.normalize('NFC', a) == unicodedata.normalize('NFC', b) # True!
Best Practices:
• Normalization can change string length • Some characters have no normalized equivalent • NFKC/NFKD can change meaning (℡ → TEL) • OS-level APIs may return non-normalized strings • Copy-paste from different sources may introduce mixed normalization
Confusion between equality and identity causes some of the most common string bugs:
Bug 1: Java == for String Comparison
// BAD: Works for literals, fails for runtime strings
if (userInput == "admin") { // userInput is NOT interned!
grantAdminAccess();
}
// GOOD: Always use equals()
if ("admin".equals(userInput)) {
grantAdminAccess();
}
// Why "admin".equals(userInput) not userInput.equals("admin")?
// Avoids NullPointerException if userInput is null
Bug 2: Python 'is' Instead of '=='
# BAD: Works sometimes due to interning, fails unpredictably
if user_role is "admin": # Don't use 'is' for string comparison!
grant_access()
# GOOD: Always use ==
if user_role == "admin":
grant_access()
Note: Python will even warn about this with a SyntaxWarning: "is" with a literal.
Bug 3: Assuming String Immutability Means Identity
String a = "Hello";
String b = a;
a = a + "!"; // Creates NEW string, doesn't modify existing
a == b // false: a now points to "Hello!", b still points to "Hello"
Reassignment doesn't modify the original string—it creates a new one.
Bug 4: Inconsistent Normalization in Hash Keys
# User 1 enters café (precomposed)
user_data["café"] = {...}
# User 2 searches for café (decomposed from different input method)
user_data["cafe\u0301"] # KeyError! Different encoding
# Solution: Normalize keys
import unicodedata
def normalize_key(k):
return unicodedata.normalize('NFC', k)
user_data[normalize_key("café")] = {...}
user_data[normalize_key("cafe\u0301")] # Works!
Bug 5: Relying on Interning for Correctness
// BAD: Depends on interning behavior
if (config.getValue() == "enabled") {...} // May fail!
// BAD: Interning doesn't make this correct, just lucky
config.getValue().intern() == "enabled" // Works but wasteful
// GOOD: Use proper equality
"enabled".equals(config.getValue()) // Always correct
Avoid these:
• Using == for string equality in Java
• Using is for string comparison in Python
• Assuming identical strings have identity
• Relying on interning for correctness
• Mixing normalized and non-normalized strings
Do these instead:
• Use .equals() in Java, == in Python
• Put literals first: "x".equals(var)
• Normalize strings at system boundaries
• Use interning only as optimization
• Document comparison semantics
We've explored the fundamental distinction between equality and identity—two concepts that appear similar but have profound differences in semantics and performance. Let's consolidate the essential knowledge:
== for identity, .equals() for equality; Python uses is vs ==; JavaScript simplifies with value semantics.== on strings is one of the most common mistakes.Module Complete:
With this page, we've completed our exploration of String Comparison, Ordering & Equality. You now understand:
'A' ≠ 'a' and how to handle it correctlyThese concepts form the foundation for avoiding logical bugs in string problems—the stated outcome of this module.
You now have a comprehensive understanding of string comparison at all levels—from the byte level to the cultural level. You can make informed decisions about comparison approaches, avoid common pitfalls, and reason precisely about string operations. This knowledge will serve you in debugging, optimization, and designing correct algorithms for string problems.