Loading learning content...
When you type your name, send a message, or search for something online, you're working with strings. But what exactly is a string? Most developers immediately think of strings as just 'text'—a sequence of letters displayed on a screen. While this intuition captures part of the truth, it misses the profound mathematical structure that makes strings one of the most important and ubiquitous data structures in all of computer science.
In this page, we'll develop a rigorous conceptual model of strings that transcends any particular programming language. By understanding strings as ordered sequences of characters, you'll gain insight into why certain operations are possible, why others are expensive, and how to reason about string-based problems systematically.
By the end of this page, you will understand the mathematical definition of a string, the concept of ordering and position, the empty string as a boundary case, and how this logical view enables us to think about string operations before ever writing code.
Before we can reason precisely about strings, we need a formal definition that captures their essential nature. This definition must be independent of any programming language, memory model, or hardware architecture.
Definition: A string is a finite, ordered sequence of symbols drawn from a fixed set called an alphabet.
Let's unpack each component of this definition:
Finite: A string has a definite, countable length. It begins somewhere and ends somewhere. There's no such thing as an infinite string in practical computing.
Ordered: The position of each symbol matters. The string "abc" is fundamentally different from "cab" or "bac", even though they contain the same symbols.
Sequence: Symbols appear one after another in a line. Unlike a set (which has no order) or a bag (which allows duplicates but has no order), a sequence preserves both duplication and position.
Symbols from an alphabet: Every symbol in a string comes from a predefined collection. In computing, this is typically the set of characters a system can represent—ASCII characters, Unicode code points, or some subset thereof.
This mathematical formalism isn't academic pedantry. It's the foundation that allows us to prove properties about string algorithms, understand their complexity, and transfer knowledge across languages. When you know what a string is, you can reason about what operations should be possible.
Example:
Consider the string "hello". Under our formal definition:
This same formal structure applies whether you're dealing with "hello" in Python, "hello" in Java, or "hello" in Haskell. The underlying abstraction is universal.
The property of order is what fundamentally distinguishes sequences (and hence strings) from other collections. This ordering has profound implications for both the meaning of strings and the operations we can perform on them.
Order creates identity:
In a set, {a, b, c} and {c, b, a} are identical—sets don't care about order. But the strings "abc" and "cba" are completely different entities. They may contain the same characters, but they represent different sequences of symbols.
This matters because text has inherent linear structure. The word "stressed" is meaningfully different from "desserts" (its reverse), even though they're anagrams. The order of characters carries semantic information.
| Property | Set | Sequence (String) |
|---|---|---|
| Order matters | No | Yes |
| Duplicates allowed | No | Yes |
| {a, a, b} vs {a, b} | Identical | Different ("aab" ≠ "ab") |
| {a, b, c} vs {c, b, a} | Identical | Different ("abc" ≠ "cba") |
| Position concept | Not applicable | Each element has a position |
| Access pattern | Membership test | Index-based retrieval |
Order enables position-based operations:
Because strings are ordered, every character occupies a specific position (also called an index). This unlocks a rich set of operations:
Without the concept of order, none of these operations would be meaningful. You couldn't ask "what's the third character?" in a set—there is no 'third' in a set.
When you think about a string, visualize it as a row of boxes, each containing one character and labeled with a position number. This mental image—sometimes called the "array mental model"—is extraordinarily useful even before you learn about arrays as a data structure.
The most powerful mental model for understanding strings is to visualize them as a linear sequence of boxes, where each box contains exactly one character and is labeled with a position number.
Consider the string "ALGORITHM":
Position: 0 1 2 3 4 5 6 7 8
┌────┬────┬────┬────┬────┬────┬────┬────┬────┐
│ A │ L │ G │ O │ R │ I │ T │ H │ M │
└────┴────┴────┴────┴────┴────┴────┴────┴────┘
This visualization immediately reveals several important properties:
Why zero-based indexing?
Zero-based indexing might seem unnatural initially—we don't count "zeroth" in everyday life. But it has deep mathematical and computational advantages:
Offset interpretation: Position k represents how far you've moved from the start. Position 0 = no movement (you're at the start). Position 5 = you've moved past 5 characters.
Simpler arithmetic: With zero-based indexing, the formula for finding the character at position k is simply: address = start + k × size. No need to subtract 1.
Consistent with memory addressing: Computer memory naturally works this way—the address of an element is its offset from a base address.
Half-open intervals: Ranges like [0, n) naturally describe a sequence of length n without any "+1" or "-1" corrections.
The most common bug category in string manipulation is the off-by-one error. It stems from confusion about whether indices start at 0 or 1, whether end indices are inclusive or exclusive, and what happens at boundaries. Master the box model now to avoid these bugs later.
One of the most important—and often overlooked—concepts in string theory is the empty string: a string containing zero characters.
Definition: The empty string, denoted as "" (two quotation marks with nothing between them), or sometimes as ε (epsilon) in formal language theory, is the unique string of length 0.
The empty string isn't merely an edge case to handle grudgingly—it's a fundamental mathematical object with important properties:
Identity for concatenation:
Just as 0 is the identity for addition (n + 0 = n) and 1 is the identity for multiplication (n × 1 = n), the empty string is the identity for string concatenation:
"hello" + "" = "hello""" + "world" = "world""" + "" = ""Concatenating the empty string doesn't change anything.
length("") = 0. This is the only string with this property.One of the most common bugs in string handling is confusing the empty string with null/nil/None. The empty string "" is a legitimate string with zero length. Null typically indicates the absence of any string value. Many runtime errors occur when code expects a string but receives null, or treats null as equivalent to "".
Why the empty string matters in practice:
Input validation: User input might be empty. If your code assumes at least one character exists, it will fail.
Split operations: Splitting a string can produce empty strings. For example, splitting "a,,b" by comma gives ["a", "", "b"].
Recursive algorithms: Many string algorithms have the empty string as their base case. It's what remains when you've consumed all characters.
Concatenation building: When building a string piece by piece, you typically start with the empty string and append to it.
Professional engineers always consider the empty string when designing string-handling code. It's not an afterthought—it's part of the design.
A recurring source of confusion is the relationship between strings and characters. Let's clarify this distinction precisely.
A character is a single, atomic symbol from an alphabet. It cannot be broken down further (at the logical level). Examples: 'A', '7', '$', '😀'.
A string is a sequence of zero or more characters. It's a composite entity that may contain any number of characters.
Critically: A single-character string is not the same as a character.
The string "A" (a sequence containing one character) and the character 'A' (the character itself) are conceptually different:
"A" has type string and length 1'A' has type character and the concept of length doesn't applyMany programming languages enforce this distinction syntactically (using different quote marks), while others blur it for convenience.
| Aspect | Character | String |
|---|---|---|
| Definition | Single atomic symbol | Sequence of symbols |
| Length | Not applicable (atomic) | 0, 1, 2, ... n characters |
| Example | 'A', 'b', '3', '!' | "hello", "A", "" |
| Can be empty? | No—a character is a unit | Yes—the empty string "" |
| Indexing | Not applicable | Yes—access character at position k |
| Typical syntax | Single quotes: 'x' | Double quotes: "x" |
| Relationship | Building block | Collection of building blocks |
The containment relationship:
Characters compose strings. A string is made of characters in the same way that a word is made of letters. You can:
This relationship is fundamental to understanding string operations. When we 'traverse' a string, we're visiting each character in sequence. When we 'compare' strings, we're comparing characters position by position.
Think of characters as individual tiles in a word game (like Scrabble). Each tile has a single letter. A string is like a word on the board—a sequence of tiles placed in order. The tiles (characters) are the building blocks; the word (string) is the structured collection.
Because strings are ordered sequences, we can meaningfully extract portions of them. Two related but distinct concepts are substrings and subsequences.
Substring:
A substring is a contiguous portion of a string. Every character in a substring appears consecutively in the original string.
For the string "ALGORITHM":
"ALGO" is a substring (positions 0-3)"RITH" is a substring (positions 4-7)"ALGORITHM" is a substring of itself (the whole string)"" (empty string) is a substring (trivially)"ARM" is NOT a substring (characters aren't contiguous)Subsequence:
A subsequence is obtained by deleting some (possibly zero) characters from a string without changing the order of remaining characters. Characters need not be contiguous.
For the string "ALGORITHM":
"ARM" is a subsequence (A at 0, R at 4, M at 8)"AGT" is a subsequence (A at 0, G at 2, T at 6)"ALGORITHM" is a subsequence of itself"ANY" is NOT a subsequence (Y doesn't appear in the string)"HMT" is NOT a subsequence (this ordering doesn't match the original)Many problems distinguish between substrings and subsequences. Finding the longest common substring (contiguous match) is fundamentally different from finding the longest common subsequence (matching characters that may be spread out). The algorithmic approaches, complexities, and applications differ significantly.
Two strings are equal if and only if they contain exactly the same characters in exactly the same positions.
Formal definition: Strings A and B are equal iff:
length(A) = length(B), ANDA[i] = B[i]This definition has several important implications:
Examples of string equality:
| String A | String B | Equal? | Reason |
|---|---|---|---|
| "hello" | "hello" | Yes | Identical characters at all positions |
| "hello" | "Hello" | Depends | Case-sensitive: No. Case-insensitive: Yes |
| "hello" | "hello " | No | Different lengths (trailing space) |
| "hello" | "helo" | No | Different lengths |
| "" | "" | Yes | Both are the empty string |
| "hello" | "olleh" | No | Same characters, different positions |
Strings that look identical may not be equal due to invisible characters: trailing spaces, different Unicode representations of the 'same' character (e.g., é as a single code point vs. e + combining accent), or different line ending conventions (\r\n vs \n). Always consider what equality truly means for your domain.
Lexicographical (Dictionary) Ordering:
Beyond equality, we often want to compare which of two strings comes 'first' in a sorted order. This is called lexicographical ordering—essentially, dictionary order.
The algorithm:
Examples:
"apple" < "banana" (because 'a' < 'b' at position 0)"apple" < "application" (all match up to "appl", but "apple" is shorter)"Zebra" < "apple" in ASCII (because 'Z' = 90 < 'a' = 97)This ordering is fundamental to sorting strings and searching in sorted collections.
Our logical view of strings—as ordered sequences of characters—immediately suggests a set of natural operations. These operations are meaningful purely because of the mathematical structure we've defined, independent of any implementation.
Core operations on strings:
| Operation | Description | Example |
|---|---|---|
| Length | Count the number of characters | length("hello") = 5 |
| Access | Retrieve character at position k | "hello"[1] = 'e' |
| Concatenation | Join two strings end-to-end | "hel" + "lo" = "hello" |
| Substring | Extract contiguous portion | substring("hello", 1, 3) = "ell" |
| Search | Find position of a substring | find("hello", "ll") = 2 |
| Comparison | Determine equality or ordering | "abc" < "abd" |
| Traversal | Visit each character in order | for c in "hello": print(c) |
| Reverse | Produce characters in opposite order | reverse("hello") = "olleh" |
Why define operations abstractly?
At this point, we haven't discussed how these operations are implemented—we've only described what they do. This separation is intentional and powerful:
Reason about correctness independent of implementation: We can prove that reverse(reverse(s)) = s purely from the logical definition.
Understand operations before code: You know what concatenation means without seeing any code.
Recognize trade-offs later: When we discuss implementation, you'll understand why some operations are fast and others slow.
This is the essence of abstract thinking in computer science: understanding the what before the how.
Each operation has a cost. Accessing a character might be fast; searching for a substring might be slow. The logical view tells us what operations are possible, but the physical implementation determines their efficiency. We'll explore this in subsequent pages.
Let's consolidate what we've learned about the logical view of strings:
Looking ahead:
We've established what a string is at the logical level—an abstract mathematical object. But strings must actually exist somewhere in a computer's memory. In the next page, we'll develop intuition about physical storage: how the abstract concept of an ordered sequence translates into actual bytes in memory, and what this means for the cost of operations.
You now have a precise mental model of strings as ordered sequences. This logical foundation will support everything that follows—whether you're implementing string algorithms, analyzing their complexity, or debugging string-related bugs. Next, we explore how this abstract structure manifests in physical memory.