Data Structures & AlgorithmsHow Strings Are Represented

How Strings Are Represented (Conceptual View)

LevelBeginner

Duration45 mins

TopicHow Strings Are Represented

2 / 3

Physical Storage Intuition

From Abstract Sequence to Physical Reality

In the previous page, we defined a string as an ordered sequence of characters—a clean, mathematical abstraction. But abstractions don't compute. Every string you've ever worked with exists somewhere physically: in RAM chips, on disk platters, or traversing network cables as electrical signals.

This page bridges the gap between logical model and physical reality. We'll develop intuition for how strings are stored without diving into data structure terminology. Think of this as understanding how the numbered boxes from our mental model translate into actual locations in a computer's memory.

This understanding is crucial. Many string behaviors that seem mysterious—why concatenation can be slow, why indexing is fast, why strings have size limits—become obvious once you understand the physical layer.

What You Will Learn

By the end of this page, you'll understand how the 'boxes' in our logical model correspond to memory locations, why characters are stored contiguously, how computers keep track of string length, and the fundamental relationship between a string's logical structure and its physical footprint.

Computer Memory: A Giant Row of Numbered Boxes

Computer memory can be understood as a vast, linear sequence of storage locations—think of it as billions of tiny boxes arranged in a single row, each with a unique number called an address.

Memory:
┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
│  ...   │  1000  │  1001  │  1002  │  1003  │  1004  │  1005  │  1006  │  ...   │
└────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘
           ↑                                                        ↑
         (address)                                              (address)

Each box (memory location) has two critical properties:

A unique address: An integer that identifies this location. Like a street address, it tells you exactly where to find this particular box.
Storage capacity: Each box can hold a fixed amount of data—typically one byte (8 bits). This box size is determined by the hardware.

What Is a Byte?

A byte is a unit of information consisting of 8 bits (binary digits). It can represent 256 different values (2⁸ = 256), which is enough for basic characters, small numbers, or part of a larger value. For now, think of a byte as the fundamental 'box size' in memory.

Key insight: Memory is linear and addressable.

This means:

Every location has a numeric address you can compute
You can jump directly to any address (this is called random access)
Locations next to each other have consecutive addresses

This linear, addressable structure is why the 'row of boxes' mental model is so powerful—it directly mirrors how computers actually work at the hardware level.

Storing a String: Characters in Consecutive Boxes

When a string is stored in memory, its characters are placed in consecutive memory locations—one after another, with no gaps.

Let's trace how the string "HELLO" would be stored:

Logical View (what we think about):
Position:   0      1      2      3      4
          ┌──────┬──────┬──────┬──────┬──────┐
          │  H   │  E   │  L   │  L   │  O   │
          └──────┴──────┴──────┴──────┴──────┘

Physical View (what actually happens in memory):
Address:   1000   1001   1002   1003   1004
          ┌──────┬──────┬──────┬──────┬──────┐
          │  H   │  E   │  L   │  L   │  O   │
          └──────┴──────┴──────┴──────┴──────┘
              ↑
          Start address (base address)

Notice the beautiful correspondence:

Position 0 (logical) → Address 1000 (physical)
Position 1 (logical) → Address 1001 (physical)
Position k (logical) → Address 1000 + k (physical)

This is no coincidence—it's by design. The consecutive storage means we can compute any character's location with simple arithmetic.

The Power of Contiguous Storage

Because characters are stored consecutively, finding the k-th character is instantaneous: just calculate base_address + k and look there. This 'random access' capability is why string indexing is fast—you never need to scan through preceding characters.

Why contiguous? Why not scatter characters around memory?

Contiguous storage has profound advantages:

Instant access by position: Calculate address = base + k. Jump directly there. Done.
Efficient traversal: To visit all characters, just walk through consecutive addresses. Hardware is optimized for this pattern—it's called spatial locality.
Simpler implementation: One number (the base address) plus one number (the length) fully describes where the string lives.
Cache efficiency: Modern CPUs load chunks of memory at once. Contiguous data means loading one chunk gets multiple useful characters.

The cost? If you need to insert a character in the middle, you might need to move everything after it to make room. But for most string operations, contiguous storage wins.

How Characters Become Numbers

Memory stores numbers, not pictures of letters. So how does the character 'H' get stored in a memory box?

The answer is character encoding: a standardized mapping between characters and numbers.

The most foundational encoding is ASCII (American Standard Code for Information Interchange), which assigns a number 0-127 to each character:

Character	ASCII Value	Binary
'A'	65	01000001
'B'	66	01000010
'H'	72	01001000
'a'	97	01100001
'0' (digit)	48	00110000
' ' (space)	32	00100000
'\n' (newline)	10	00001010

When you store "HELLO", the computer actually stores the sequence of numbers: 72, 69, 76, 76, 79.

What actually happens in memory:

"HELLO" stored starting at address 1000:

Address:  1000   1001   1002   1003   1004
          ┌──────┬──────┬──────┬──────┬──────┐
Logical:  │  H   │  E   │  L   │  L   │  O   │
          ├──────┼──────┼──────┼──────┼──────┤
Actual:   │  72  │  69  │  76  │  76  │  79  │
          └──────┴──────┴──────┴──────┴──────┘

The characters 'H', 'E', 'L', 'L', 'O' are a convenient fiction for humans. What the memory actually contains are numbers that, by convention, represent those characters.

This encoding is completely transparent to you as a programmer—the language handles conversion automatically. But understanding it explains several phenomena:

Why strings that look different might have the same bytes (different encodings)
Why comparing characters works (you're really comparing numbers)
Why case conversion is simple ('a' = 'A' + 32 in ASCII)

Beyond ASCII: Unicode

ASCII only covers 128 characters—fine for English, but hopeless for 中文, العربية, or 😀. Unicode extends this to over 140,000 characters. We'll discuss Unicode in detail later; for now, understand that each character has a numeric code, and the encoding determines how that code maps to bytes.

Tracking Length: How the Computer Knows Where Strings End

When you store "HELLO" in memory, how does the computer know the string is 5 characters long and not 50? How does it know when to stop reading?

This is the boundary problem: in a vast sea of memory locations, how do we mark where one string ends and the next begins?

Historically, two approaches emerged:

Approach 1: Sentinel Termination (Null Terminator)

Place a special marker (typically the number 0, called the null character or \0) after the last character:

"HELLO" with null terminator:
Address:  1000   1001   1002   1003   1004   1005
          ┌──────┬──────┬──────┬──────┬──────┬──────┐
          │  H   │  E   │  L   │  L   │  O   │  \0  │
          │ (72) │ (69) │ (76) │ (76) │ (79) │ (0)  │
          └──────┴──────┴──────┴──────┴──────┴──────┘
                                               ↑
                                         Null terminator
                                         (string ends here)

To find the length, scan forward until you hit 0. This approach is used by C and C++.

Approach 2: Explicit Length Storage

Store the length as a number alongside (typically before) the character data:

"HELLO" with explicit length:
Address:  996-999     1000   1001   1002   1003   1004
          ┌─────────┬──────┬──────┬──────┬──────┬──────┐
          │  5      │  H   │  E   │  L   │  L   │  O   │
          │ (length)│ (72) │ (69) │ (76) │ (76) │ (79) │
          └─────────┴──────┴──────┴──────┴──────┴──────┘
               ↑
         Length stored
         before characters

Now finding the length is instant—just read that stored number. This approach is used by most modern languages (Java, Python, JavaScript, etc.).

Null Terminator Trade-offs

•Finding length requires scanning (slow for long strings)
•String cannot contain the null character (it would look like the end)
•No upfront length limit—string can be any size
•Uses one extra byte for the terminator
•Safer against length-field corruption

Explicit Length Trade-offs

•Finding length is instant (just read the number)
•String can contain any character, including null
•Maximum length limited by how big the length field is
•Uses 4-8 bytes for length storage
•Safer against unterminated-string bugs

Why Length Storage Matters

This isn't academic trivia. If you ask 'what's the length of this string?' and the answer requires scanning all characters vs. reading one stored number, that's the difference between O(n) and O(1) time. For a million-character string, that difference is massive.

The Base Address: Finding Where Strings Live

When we work with a string in code, we don't actually carry around all its characters—that would be impractical. Instead, we work with a reference to the string: essentially the memory address where the string begins.

This starting address is called the base address or start pointer.

String variable str contains address 1000:

    str
    ┌─────────┐
    │  1000   │  ← This is what the variable holds: an address
    └─────────┘
         │
         ▼
Memory: ┌──────┬──────┬──────┬──────┬──────┐
        │  H   │  E   │  L   │  L   │  O   │
        └──────┴──────┴──────┴──────┴──────┘
        1000   1001   1002   1003   1004

The variable str doesn't contain the letters 'H', 'E', 'L', 'L', 'O'. It contains the number 1000—a pointer to where those letters are stored.

Why addresses instead of actual characters?

Efficiency: Copying an address (one number) is much faster than copying every character.
Sharing: Multiple variables can reference the same string data without duplicating it.
Flexible sizing: Variables have fixed size (enough to hold an address), even for strings of wildly different lengths.
Indirection enables operations: To get character at position k, compute address + k. Without the address, you couldn't locate the string.

The equation that powers string access:

address_of_character_at_position_k = base_address + (k × character_size)

For ASCII strings where each character is 1 byte:

address("HELLO"[2]) = 1000 + (2 × 1) = 1002

Look at address 1002, find 'L'. Done. This is why indexing is constant time (O(1))—no matter how long the string, one multiplication and one addition give you the exact location.

This Is Why Zero-Based Indexing Makes Sense

Remember zero-based indexing? The first character is at position 0, not position 1. Now it makes perfect sense: address_of_first_char = base + 0 × size = base. The index is the offset from the start. No +1 or -1 corrections needed.

Memory Layout Variations Across Systems

While the core idea—characters stored consecutively in memory—is universal, the exact layout varies across languages and systems. Understanding these variations explains why string behavior differs across environments.

C-style strings:

┌───┬───┬───┬───┬───┬───┐
│ H │ E │ L │ L │ O │\0 │
└───┴───┴───┴───┴───┴───┘
  Characters + null terminator
  Length: computed by scanning

Pascal-style strings:

┌───────┬───┬───┬───┬───┬───┐
│   5   │ H │ E │ L │ L │ O │
└───────┴───┴───┴───┴───┴───┘
  Length prefix + characters
  (length in first byte, limited to 255)

Modern language strings (Java, Python, etc.):

┌───────────────────────────────┐
│  String Object Header         │  (metadata: type info, etc.)
├───────────────────────────────┤
│  Length: 5                    │  (explicit length field)
├───────────────────────────────┤
│  Hash: <cached hash code>     │  (optimization for comparisons)
├───────────────────────────────┤
│  Data: H E L L O              │  (actual characters)
└───────────────────────────────┘

String Memory Models Compared
Aspect	C-style	Pascal-style	Modern Objects
Length access	O(n) - scan	O(1) - read prefix	O(1) - read field
Max length	Unbounded	255 (1-byte prefix)	~2 billion (4-byte)
Can contain null?	No	Yes	Yes
Memory overhead	1 byte (terminator)	1 byte (length)	8-24 bytes (header)
Typical use	C, C++	Historical (Pascal)	Java, Python, JS

Why These Differences Matter

When calling C libraries from higher-level languages (Python, Java), string conversion must handle these differences. The string must be copied into a C-compatible format with a null terminator. This hidden conversion can impact performance.

Implications for String Operations

Now that we understand physical storage, we can predict the cost of operations. This is the payoff of understanding the physical layer—no need to memorize; you can derive it.

Accessing a character at position k:

Compute base + k × character_size. Jump directly there. Read one value.

Cost: O(1) — constant time, regardless of string length.

Finding the length:

If length is stored: Read the stored number. O(1)
If null-terminated: Scan forward counting until you hit null. O(n)

Concatenating two strings:

You can't just 'attach' one string to another—memory is fixed in place. You must:

Allocate new memory large enough for both strings
Copy all characters from the first string
Copy all characters from the second string

Cost: O(n + m) where n and m are the lengths of the two strings.

Operation Costs Derived from Physical Storage
Operation	Physical Action	Time Complexity
Access char at position k	Compute address, read memory	O(1)
Get length (if stored)	Read length field	O(1)
Get length (null-terminated)	Scan until null	O(n)
Concatenate strings of length n and m	Allocate + copy both	O(n + m)
Compare two strings	Compare char by char until diff	O(min(n, m)) worst case
Create substring from i to j	Allocate + copy j-i+1 chars	O(j - i)
Search for substring	Scan for match	O(n × m) naive

Concatenation in Loops: A Common Pitfall

Building a string by repeatedly concatenating in a loop is O(n²) total—not O(n). Each concatenation creates a new string and copies all previous characters. For 1000 appends, you copy 1 + 2 + 3 + ... + 1000 = 500,500 characters total. Use a builder pattern instead.

String Creation and Destruction

Strings don't just exist—they must be created (allocated in memory) and eventually destroyed (memory freed for reuse). This lifecycle explains many runtime behaviors.

Creation (Allocation):

When you create a string, the runtime must:

Calculate how many bytes are needed (characters + overhead)
Find a contiguous block of memory that large
Copy the character data into that block
Return the base address

For a string literal like "hello", this often happens at compile time—the string is embedded directly in the program's binary. But dynamically created strings (concatenation results, user input) are allocated at runtime.

Destruction (Deallocation):

When a string is no longer needed, its memory should be returned to the pool for reuse. How this happens depends on the language:

Manual memory management (C): The programmer explicitly frees memory. Forget to free → memory leak. Free too early → use-after-free bug.
Garbage collection (Java, Python, Go): The runtime automatically detects unreachable strings and reclaims their memory. Safer, but with some performance overhead.

Why This Matters

•Memory is finite: Every string occupies real memory. A million-character string takes (at least) a million bytes, plus overhead.
•Allocation takes time: Finding and preparing memory isn't free. Creating many small strings can be slower than one large string.
•Fragmentation: Allocating and freeing many different-sized strings can leave memory fragmented—lots of small gaps that can't hold large strings.
•String pooling/interning: Some languages reuse memory for identical string literals. The strings "hello" and "hello" might share the same memory.

String Interning

Many languages 'intern' string literals—storing only one copy and having all references point to it. This saves memory and makes equality checking faster (compare addresses instead of characters). But it only works for immutable strings.

Variable-Width Characters: A Complexity Worth Knowing

Our discussion assumed each character occupies the same amount of space (one byte). This is true for ASCII, but modern strings often use variable-width encoding like UTF-8.

In UTF-8:

ASCII characters (U+0000 to U+007F): 1 byte
Extended Latin, Greek, Cyrillic (U+0080 to U+07FF): 2 bytes
Common CJK characters, most symbols (U+0800 to U+FFFF): 3 bytes
Rare characters, emoji (U+10000 to U+10FFFF): 4 bytes

Implications for our mental model:

With fixed-width characters:

Logical position 3 → Physical offset 3 × 1 = 3 bytes

With variable-width characters:

Logical position 3 → ??? (depends on how wide preceding characters are)

The simple address = base + k formula breaks down. To find the k-th character, you must scan from the start, counting characters (not bytes) until you reach k.

Fixed-Width vs. Variable-Width Encoding
Aspect	Fixed-Width (ASCII)	Variable-Width (UTF-8)
Character size	Always 1 byte	1-4 bytes
Access by position	O(1) - calculate offset	O(n) - must scan
Memory efficiency	Wasteful for ASCII-only	Compact for mixed content
Compatibility	Works only for English	Supports all languages + emoji
Length(string) means...	bytes = characters	bytes ≠ characters

A Common Source of Bugs

Assuming 'string length' and 'byte count' are the same causes bugs in UTF-8 strings. The string "hello" is 5 bytes AND 5 characters. But "héllo" is 6 bytes and 5 characters (é takes 2 bytes). And "👋hello" is 9 bytes but 6 characters (👋 takes 4 bytes).

Summary: From Abstract to Physical

We've journeyed from the abstract sequence of our logical model to the concrete reality of bytes in memory. Let's consolidate the key insights:

Key Takeaways

•Memory is linear and addressable — Like a row of numbered boxes, each with an address.
•Strings are stored contiguously — Characters occupy consecutive memory locations with no gaps.
•Characters are numbers — Encodings like ASCII map characters to numeric values that fit in bytes.
•Length tracking varies — Null terminators (scan to find length) vs. explicit storage (instant length).
•Base address + offset = instant access — The formula that makes indexing O(1).
•Concatenation requires copying — You can't just 'attach' strings; new memory must be allocated.
•Physical layout affects performance — Understanding storage lets you predict (not memorize) operation costs.
•Variable-width encoding changes the rules — UTF-8 breaks simple offset calculation, making position access O(n).

Next up:

We've covered the logical view (ordered sequence) and physical view (contiguous memory). The final piece of the representation puzzle is understanding length, indexing, and mutability—the practical concepts that determine how you work with strings in code.

Page Complete

You now understand how abstract strings map to physical memory. This knowledge explains why some operations are fast (indexing) and others slow (concatenation in a loop). More importantly, you can now derive these costs from first principles.

2 / 3

Loading learning content...

Data Structures & AlgorithmsHow Strings Are Represented

How Strings Are Represented (Conceptual View)

LevelBeginner

Duration45 mins

TopicHow Strings Are Represented

2 / 3

Physical Storage Intuition

From Abstract Sequence to Physical Reality

What You Will Learn

Computer Memory: A Giant Row of Numbered Boxes

Computer memory can be understood as a vast, linear sequence of storage locations—think of it as billions of tiny boxes arranged in a single row, each with a unique number called an address.

Memory:
┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
│  ...   │  1000  │  1001  │  1002  │  1003  │  1004  │  1005  │  1006  │  ...   │
└────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘
           ↑                                                        ↑
         (address)                                              (address)

Each box (memory location) has two critical properties:

A unique address: An integer that identifies this location. Like a street address, it tells you exactly where to find this particular box.
Storage capacity: Each box can hold a fixed amount of data—typically one byte (8 bits). This box size is determined by the hardware.

What Is a Byte?

Key insight: Memory is linear and addressable.

This means:

Every location has a numeric address you can compute
You can jump directly to any address (this is called random access)
Locations next to each other have consecutive addresses

This linear, addressable structure is why the 'row of boxes' mental model is so powerful—it directly mirrors how computers actually work at the hardware level.

Storing a String: Characters in Consecutive Boxes

When a string is stored in memory, its characters are placed in consecutive memory locations—one after another, with no gaps.

Let's trace how the string "HELLO" would be stored:

Logical View (what we think about):
Position:   0      1      2      3      4
          ┌──────┬──────┬──────┬──────┬──────┐
          │  H   │  E   │  L   │  L   │  O   │
          └──────┴──────┴──────┴──────┴──────┘

Physical View (what actually happens in memory):
Address:   1000   1001   1002   1003   1004
          ┌──────┬──────┬──────┬──────┬──────┐
          │  H   │  E   │  L   │  L   │  O   │
          └──────┴──────┴──────┴──────┴──────┘
              ↑
          Start address (base address)

Notice the beautiful correspondence:

Position 0 (logical) → Address 1000 (physical)
Position 1 (logical) → Address 1001 (physical)
Position k (logical) → Address 1000 + k (physical)

This is no coincidence—it's by design. The consecutive storage means we can compute any character's location with simple arithmetic.

The Power of Contiguous Storage

Why contiguous? Why not scatter characters around memory?

Contiguous storage has profound advantages:

Instant access by position: Calculate address = base + k. Jump directly there. Done.
Efficient traversal: To visit all characters, just walk through consecutive addresses. Hardware is optimized for this pattern—it's called spatial locality.
Simpler implementation: One number (the base address) plus one number (the length) fully describes where the string lives.
Cache efficiency: Modern CPUs load chunks of memory at once. Contiguous data means loading one chunk gets multiple useful characters.

The cost? If you need to insert a character in the middle, you might need to move everything after it to make room. But for most string operations, contiguous storage wins.

How Characters Become Numbers

Memory stores numbers, not pictures of letters. So how does the character 'H' get stored in a memory box?

The answer is character encoding: a standardized mapping between characters and numbers.

The most foundational encoding is ASCII (American Standard Code for Information Interchange), which assigns a number 0-127 to each character:

Character	ASCII Value	Binary
'A'	65	01000001
'B'	66	01000010
'H'	72	01001000
'a'	97	01100001
'0' (digit)	48	00110000
' ' (space)	32	00100000
'\n' (newline)	10	00001010

When you store "HELLO", the computer actually stores the sequence of numbers: 72, 69, 76, 76, 79.

What actually happens in memory:

"HELLO" stored starting at address 1000:

Address:  1000   1001   1002   1003   1004
          ┌──────┬──────┬──────┬──────┬──────┐
Logical:  │  H   │  E   │  L   │  L   │  O   │
          ├──────┼──────┼──────┼──────┼──────┤
Actual:   │  72  │  69  │  76  │  76  │  79  │
          └──────┴──────┴──────┴──────┴──────┘

The characters 'H', 'E', 'L', 'L', 'O' are a convenient fiction for humans. What the memory actually contains are numbers that, by convention, represent those characters.

This encoding is completely transparent to you as a programmer—the language handles conversion automatically. But understanding it explains several phenomena:

Why strings that look different might have the same bytes (different encodings)
Why comparing characters works (you're really comparing numbers)
Why case conversion is simple ('a' = 'A' + 32 in ASCII)

Beyond ASCII: Unicode

Tracking Length: How the Computer Knows Where Strings End

When you store "HELLO" in memory, how does the computer know the string is 5 characters long and not 50? How does it know when to stop reading?

This is the boundary problem: in a vast sea of memory locations, how do we mark where one string ends and the next begins?

Historically, two approaches emerged:

Approach 1: Sentinel Termination (Null Terminator)

Place a special marker (typically the number 0, called the null character or \0) after the last character:

"HELLO" with null terminator:
Address:  1000   1001   1002   1003   1004   1005
          ┌──────┬──────┬──────┬──────┬──────┬──────┐
          │  H   │  E   │  L   │  L   │  O   │  \0  │
          │ (72) │ (69) │ (76) │ (76) │ (79) │ (0)  │
          └──────┴──────┴──────┴──────┴──────┴──────┘
                                               ↑
                                         Null terminator
                                         (string ends here)

To find the length, scan forward until you hit 0. This approach is used by C and C++.

Approach 2: Explicit Length Storage

Store the length as a number alongside (typically before) the character data:

"HELLO" with explicit length:
Address:  996-999     1000   1001   1002   1003   1004
          ┌─────────┬──────┬──────┬──────┬──────┬──────┐
          │  5      │  H   │  E   │  L   │  L   │  O   │
          │ (length)│ (72) │ (69) │ (76) │ (76) │ (79) │
          └─────────┴──────┴──────┴──────┴──────┴──────┘
               ↑
         Length stored
         before characters

Now finding the length is instant—just read that stored number. This approach is used by most modern languages (Java, Python, JavaScript, etc.).

Null Terminator Trade-offs

•Finding length requires scanning (slow for long strings)
•String cannot contain the null character (it would look like the end)
•No upfront length limit—string can be any size
•Uses one extra byte for the terminator
•Safer against length-field corruption

Explicit Length Trade-offs

•Finding length is instant (just read the number)
•String can contain any character, including null
•Maximum length limited by how big the length field is
•Uses 4-8 bytes for length storage
•Safer against unterminated-string bugs

Why Length Storage Matters

The Base Address: Finding Where Strings Live

This starting address is called the base address or start pointer.

String variable str contains address 1000:

    str
    ┌─────────┐
    │  1000   │  ← This is what the variable holds: an address
    └─────────┘
         │
         ▼
Memory: ┌──────┬──────┬──────┬──────┬──────┐
        │  H   │  E   │  L   │  L   │  O   │
        └──────┴──────┴──────┴──────┴──────┘
        1000   1001   1002   1003   1004

The variable str doesn't contain the letters 'H', 'E', 'L', 'L', 'O'. It contains the number 1000—a pointer to where those letters are stored.

Why addresses instead of actual characters?

Efficiency: Copying an address (one number) is much faster than copying every character.
Sharing: Multiple variables can reference the same string data without duplicating it.
Flexible sizing: Variables have fixed size (enough to hold an address), even for strings of wildly different lengths.
Indirection enables operations: To get character at position k, compute address + k. Without the address, you couldn't locate the string.

The equation that powers string access:

address_of_character_at_position_k = base_address + (k × character_size)

For ASCII strings where each character is 1 byte:

address("HELLO"[2]) = 1000 + (2 × 1) = 1002

Look at address 1002, find 'L'. Done. This is why indexing is constant time (O(1))—no matter how long the string, one multiplication and one addition give you the exact location.

This Is Why Zero-Based Indexing Makes Sense

Memory Layout Variations Across Systems

C-style strings:

┌───┬───┬───┬───┬───┬───┐
│ H │ E │ L │ L │ O │\0 │
└───┴───┴───┴───┴───┴───┘
  Characters + null terminator
  Length: computed by scanning

Pascal-style strings:

┌───────┬───┬───┬───┬───┬───┐
│   5   │ H │ E │ L │ L │ O │
└───────┴───┴───┴───┴───┴───┘
  Length prefix + characters
  (length in first byte, limited to 255)

Modern language strings (Java, Python, etc.):

┌───────────────────────────────┐
│  String Object Header         │  (metadata: type info, etc.)
├───────────────────────────────┤
│  Length: 5                    │  (explicit length field)
├───────────────────────────────┤
│  Hash: <cached hash code>     │  (optimization for comparisons)
├───────────────────────────────┤
│  Data: H E L L O              │  (actual characters)
└───────────────────────────────┘

String Memory Models Compared
Aspect	C-style	Pascal-style	Modern Objects
Length access	O(n) - scan	O(1) - read prefix	O(1) - read field
Max length	Unbounded	255 (1-byte prefix)	~2 billion (4-byte)
Can contain null?	No	Yes	Yes
Memory overhead	1 byte (terminator)	1 byte (length)	8-24 bytes (header)
Typical use	C, C++	Historical (Pascal)	Java, Python, JS

Why These Differences Matter

Implications for String Operations

Now that we understand physical storage, we can predict the cost of operations. This is the payoff of understanding the physical layer—no need to memorize; you can derive it.

Accessing a character at position k:

Compute base + k × character_size. Jump directly there. Read one value.

Cost: O(1) — constant time, regardless of string length.

Finding the length:

If length is stored: Read the stored number. O(1)
If null-terminated: Scan forward counting until you hit null. O(n)

Concatenating two strings:

You can't just 'attach' one string to another—memory is fixed in place. You must:

Allocate new memory large enough for both strings
Copy all characters from the first string
Copy all characters from the second string

Cost: O(n + m) where n and m are the lengths of the two strings.

Operation Costs Derived from Physical Storage
Operation	Physical Action	Time Complexity
Access char at position k	Compute address, read memory	O(1)
Get length (if stored)	Read length field	O(1)
Get length (null-terminated)	Scan until null	O(n)
Concatenate strings of length n and m	Allocate + copy both	O(n + m)
Compare two strings	Compare char by char until diff	O(min(n, m)) worst case
Create substring from i to j	Allocate + copy j-i+1 chars	O(j - i)
Search for substring	Scan for match	O(n × m) naive

Concatenation in Loops: A Common Pitfall

String Creation and Destruction

Strings don't just exist—they must be created (allocated in memory) and eventually destroyed (memory freed for reuse). This lifecycle explains many runtime behaviors.

Creation (Allocation):

When you create a string, the runtime must:

Calculate how many bytes are needed (characters + overhead)
Find a contiguous block of memory that large
Copy the character data into that block
Return the base address

Destruction (Deallocation):

When a string is no longer needed, its memory should be returned to the pool for reuse. How this happens depends on the language:

Manual memory management (C): The programmer explicitly frees memory. Forget to free → memory leak. Free too early → use-after-free bug.
Garbage collection (Java, Python, Go): The runtime automatically detects unreachable strings and reclaims their memory. Safer, but with some performance overhead.

Why This Matters

•Memory is finite: Every string occupies real memory. A million-character string takes (at least) a million bytes, plus overhead.
•Allocation takes time: Finding and preparing memory isn't free. Creating many small strings can be slower than one large string.
•Fragmentation: Allocating and freeing many different-sized strings can leave memory fragmented—lots of small gaps that can't hold large strings.
•String pooling/interning: Some languages reuse memory for identical string literals. The strings "hello" and "hello" might share the same memory.

String Interning

Variable-Width Characters: A Complexity Worth Knowing

Our discussion assumed each character occupies the same amount of space (one byte). This is true for ASCII, but modern strings often use variable-width encoding like UTF-8.

In UTF-8:

ASCII characters (U+0000 to U+007F): 1 byte
Extended Latin, Greek, Cyrillic (U+0080 to U+07FF): 2 bytes
Common CJK characters, most symbols (U+0800 to U+FFFF): 3 bytes
Rare characters, emoji (U+10000 to U+10FFFF): 4 bytes

Implications for our mental model:

With fixed-width characters:

Logical position 3 → Physical offset 3 × 1 = 3 bytes

With variable-width characters:

Logical position 3 → ??? (depends on how wide preceding characters are)

The simple address = base + k formula breaks down. To find the k-th character, you must scan from the start, counting characters (not bytes) until you reach k.

Fixed-Width vs. Variable-Width Encoding
Aspect	Fixed-Width (ASCII)	Variable-Width (UTF-8)
Character size	Always 1 byte	1-4 bytes
Access by position	O(1) - calculate offset	O(n) - must scan
Memory efficiency	Wasteful for ASCII-only	Compact for mixed content
Compatibility	Works only for English	Supports all languages + emoji
Length(string) means...	bytes = characters	bytes ≠ characters

A Common Source of Bugs

Summary: From Abstract to Physical

We've journeyed from the abstract sequence of our logical model to the concrete reality of bytes in memory. Let's consolidate the key insights:

Key Takeaways

•Memory is linear and addressable — Like a row of numbered boxes, each with an address.
•Strings are stored contiguously — Characters occupy consecutive memory locations with no gaps.
•Characters are numbers — Encodings like ASCII map characters to numeric values that fit in bytes.
•Length tracking varies — Null terminators (scan to find length) vs. explicit storage (instant length).
•Base address + offset = instant access — The formula that makes indexing O(1).
•Concatenation requires copying — You can't just 'attach' strings; new memory must be allocated.
•Physical layout affects performance — Understanding storage lets you predict (not memorize) operation costs.
•Variable-width encoding changes the rules — UTF-8 breaks simple offset calculation, making position access O(n).

Next up:

Page Complete

2 / 3