Loading learning content...
Every piece of text you've ever typed, read, or searched for in a computer—every username, password, email, tweet, URL, log message, and error notification—is represented internally as a string. Yet despite being the most frequently used data type in most software systems, strings are often treated as 'just text' without deeper understanding.
This is a critical oversight. Strings are not mere collections of letters—they are structured data with specific properties, operations, and performance characteristics that every serious programmer must understand deeply. The way you think about strings directly impacts the efficiency, correctness, and scalability of nearly every program you write.
By the end of this page, you will understand exactly what a string is at a fundamental level—not as an abstract programming convenience, but as a precisely defined sequence of characters with specific structure and properties. This foundation is essential for everything that follows in your journey through string algorithms and text processing.
Before we can fully understand strings, we need to establish where they come from. In Chapter 3, we explored characters as primitive data types—individual symbols that represent letters, digits, punctuation, or other textual elements. A character is atomic: it cannot be broken down into smaller textual units.
The fundamental insight:
A string is what happens when we need to represent more than one character in a meaningful, ordered way. Consider the difference:
| Concept | Example | What It Represents | Nature |
|---|---|---|---|
| Character | 'H' | A single symbol—the letter H | Primitive (atomic) |
| Character | '5' | A single symbol—the digit 5 | Primitive (atomic) |
| Character | '!' | A single symbol—an exclamation mark | Primitive (atomic) |
| String | "Hello" | A sequence of 5 characters: 'H', 'e', 'l', 'l', 'o' | Composite (structured) |
| String | "42" | A sequence of 2 characters: '4', '2' | Composite (structured) |
| String | "Hi!" | A sequence of 3 characters: 'H', 'i', '!' | Composite (structured) |
The critical observation:
Notice how the string "42" consists of the characters '4' and '2'—it is not the number forty-two. Similarly, the string "Hello" is not simply five separate characters floating in space—it is a single entity that maintains those five characters in a specific order.
This distinction is profound: a string imposes structure on characters. It defines which characters belong together and in what order they appear. This structure is what transforms a collection of primitive values into something greater—a data structure.
Strings exemplify a core principle in computer science: complex data structures are built by composing simpler elements according to specific rules. Just as molecules are atoms arranged in specific configurations, strings are characters arranged in specific sequences. The 'arrangement rules' are what make it a data structure.
Let us now state precisely what a string is:
Definition:
A string is a finite, ordered sequence of zero or more characters drawn from a defined character set (alphabet).
Let's unpack each part of this definition, because every word carries significance:
Formal notation:
In theoretical computer science, strings are often represented mathematically. Given an alphabet Σ (sigma), a string over Σ is any finite sequence of symbols from Σ:
This formal perspective becomes essential when studying automata theory, regular expressions, and formal languages—but for now, the key insight is that strings are mathematically well-defined objects, not vague programming conveniences.
The precise definition of a string may seem pedantic, but it prevents countless bugs and misunderstandings. Knowing that strings are ordered sequences explains why "dog" ≠ "god". Knowing they're finite explains why you can always determine a string's length. Knowing they can be empty explains why empty string checks are necessary in code.
The fact that strings are sequences is their most defining characteristic. This isn't merely about putting characters in a row—it establishes a precise mathematical relationship between each character and its position.
Indexing: Every character has an address
In a string of length n, each character occupies exactly one position, typically numbered from 0 to n-1 (zero-based indexing) or from 1 to n (one-based indexing). This position is called the character's index.
Consider the string "HELLO":
| Index | 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|---|
| Character | H | E | L | L | O |
This indexing system enables several fundamental operations:
1. Direct Access (Random Access)
Given an index i, we can retrieve the character at position i directly. For "HELLO", accessing index 2 yields 'L'. This is typically an O(1) operation—we can jump directly to any position without examining the characters before it.
2. Traversal
We can systematically visit every character in the string by iterating through indices from 0 to length-1. This is fundamental to virtually all string algorithms.
3. Substring Extraction
We can extract a contiguous portion of the string by specifying start and end indices. The substring of "HELLO" from index 1 to 3 is "ELL".
4. Position-Based Comparison
Two strings are equal if and only if they have the same length AND the character at each position in one string equals the character at the corresponding position in the other.
The strings "ABC" and "CBA" contain identical characters but are completely different strings because the same characters occupy different positions. When comparing, searching, or manipulating strings, you must always consider position, not just content.
The implications of ordering:
Because strings are ordered sequences, several properties emerge:
Every string has a first character (at index 0) and a last character (at index length-1), except the empty string which has neither.
Substrings preserve order. A substring maintains the relative ordering of characters from the original string.
Concatenation respects order. When you concatenate "Hello" + "World", the result is "HelloWorld"—not "WorldHello" or some shuffled combination. The first string's characters precede the second string's characters.
Comparison is positional. Lexicographic (dictionary) ordering compares strings position by position, left to right, making position critical for sorting and searching.
Every string has a length—the count of characters it contains. This seems trivially simple, but length is one of the most important properties of a string, influencing nearly every operation and algorithm.
Formal definition:
The length of a string S, often written as |S| or len(S), is the number of characters in the sequence.
Examples:
| String | Length | Explanation |
|---|---|---|
| "Hello" | 5 | Five characters: H, e, l, l, o |
| "Hi!" | 3 | Three characters: H, i, ! |
| "" | 0 | The empty string contains no characters |
| " " | 1 | A single space character |
| "A B" | 3 | Three characters: A, space, B |
| "\n" | 1 | A single newline character |
| "Line1\nLine2" | 11 | Eleven characters including the newline |
Critical observations about length:
1. Spaces and special characters count
A common beginner mistake is forgetting that spaces, tabs, newlines, and other 'invisible' characters contribute to length. The string "Hello World" has length 11, not 10—the space between Hello and World is a character.
2. The empty string is unique
The empty string "" is the only string with length 0. It's a valid string—not a null or undefined value. Every string operation must account for this edge case.
3. Length determines valid indices
For a string of length n, valid indices in zero-based systems are 0 through n-1. Accessing index n or beyond is an error (off-by-one errors are among the most common bugs in string processing).
4. Length affects algorithmic complexity
Most string algorithms have complexity that depends on length. A linear search through a string is O(n) where n is the string's length. Concatenating two strings typically takes O(n + m) time where n and m are their lengths. Understanding length is essential for analyzing string algorithm performance.
When analyzing any string problem, start by considering length. Ask: How does the solution scale with string length? What happens when length is 0? 1? Very large? Boundary conditions around length reveal many bugs and algorithmic issues.
The empty string deserves special attention because it is frequently a source of confusion and bugs, yet it is a perfectly valid and important string value.
Definition:
The empty string is a string containing zero characters. It has length 0 and is typically represented as "" (two quotation marks with nothing between them).
Key properties of the empty string:
A critical distinction: The empty string "" is a valid string that happens to contain no characters. Null (or undefined/None in various languages) represents the absence of any string value. An empty string has length 0; null has no length because there is no string to measure.
Always handle the empty string as a valid input. Functions that process strings should explicitly consider what happens when given "". A robust substring search returns a valid result for an empty pattern. A well-designed API documents empty string behavior.
Why does the empty string matter algorithmically?
The empty string is the base case for many string algorithms. Consider:
Recursion on strings: When processing a string recursively (removing one character at a time), the empty string is typically the base case where recursion stops.
Pattern matching: An empty pattern matches everywhere—there are n+1 positions where an empty string could be said to 'occur' within a string of length n.
String building: When constructing a string iteratively (e.g., building up a result), you typically start with the empty string and progressively append.
Invariants: Many algorithms maintain invariants like 'the result so far is a valid prefix'. The empty string is always a valid prefix.
Understanding the empty string is not pedantry—it's essential for writing correct string algorithms.
We've established that strings are sequences of characters, but what exactly constitutes a 'character' in a string? This question has evolved significantly as computing has become global.
The historical view: ASCII characters
In the early days of computing, a character was typically an ASCII symbol—one of 128 possible values representing English letters (uppercase and lowercase), digits 0-9, punctuation marks, and control characters. Each ASCII character fit in 7 bits, and strings were simple arrays of bytes.
The modern reality: Unicode characters
Today, strings typically contain Unicode characters—symbols drawn from a vast repertoire that includes:
Unicode defines over 149,000 characters as of recent versions, with room for over a million.
In Unicode, what users perceive as a single 'character' (called a grapheme) may actually be composed of multiple code points. For example, 'é' can be represented as a single code point OR as 'e' followed by a combining acute accent. Some emoji require multiple code points. This complexity is beyond our current scope, but be aware that 'character' is more nuanced than it appears.
For our purposes:
When we discuss strings in this chapter, we'll generally treat each position in a string as holding one character, abstracting away the encoding complexities. This abstraction is valid for understanding string algorithms conceptually—the same algorithmic principles apply whether characters are ASCII, Unicode code points, or arbitrary symbols.
The key insight is that strings are parameterized by their character type. You can have:
In each case, the fundamental definition remains: a string is a finite, ordered sequence of characters from some alphabet. The algorithms we develop work regardless of what that alphabet contains.
| Domain | Alphabet | Example String | String Length |
|---|---|---|---|
| English text | {a-z, A-Z, 0-9, punctuation, space, ...} | "Hello, World!" | 13 |
| Binary data | {0, 1} | "10110011" | 8 |
| DNA sequences | {A, T, G, C} | "ATCGATCG" | 8 |
| Hexadecimal | {0-9, A-F} | "3F7A" | 4 |
| Multilingual | Unicode | "Hello 世界 🌍" | 10+ |
Now we can understand why strings are considered structured data rather than primitive values. A string imposes structure in several ways:
1. Logical Structure: The Sequence
A string maintains its characters in a strict linear sequence. This structure enables positional operations (accessing by index), order-dependent operations (comparison, concatenation), and pattern operations (searching, matching).
2. Semantic Structure: Meaning Through Order
The order of characters in a string typically conveys meaning. "stop" and "pots" contain identical characters but have entirely different meanings. The structure of the sequence is the information.
3. Operational Structure: Defined Behaviors
Strings come with a standard set of operations that respect their sequential nature:
The contrast with primitives:
A single character is atomic—it cannot be decomposed, has no internal structure, and supports only basic operations (comparison, conversion). A string, despite being made of characters, transcends them:
The moment you combine characters into a sequence, you create something qualitatively different—a data structure with rich operational semantics. This is why strings are classified as non-primitive despite containing primitive characters.
Think of a string as a 'word made visible'—not just letters, but the specific arrangement of letters that creates meaning. The word 'tar' becomes 'rat' with rearrangement, 'art' with another. Same letters, different structures, different meanings. The structure is what transforms raw characters into information.
We have built a precise understanding of what a string is at its most fundamental level. Let's consolidate the key concepts:
What's next:
Now that we understand what strings are, the next page explores why strings are classified as non-primitive data structures. We'll see how strings differ fundamentally from primitive types like integers, characters, and booleans—and why this classification matters for how we think about and work with text data.
You now have a rigorous understanding of what defines a string: a finite, ordered sequence of characters. This foundation—understanding strings as structured sequences rather than vague 'text values'—will inform every string algorithm and text processing technique you encounter. Next, we'll explore why this structured nature makes strings fundamentally non-primitive.