Loading learning content...
Substring extraction is the operation of obtaining a contiguous portion of a string—a subset of its characters starting at some position and ending at another. This operation is fundamental to:
Whether you call it slicing, substring, or substr, the core concept is the same: given a string and a range, produce a new string containing just that range. But the details—inclusive vs exclusive bounds, copy vs view semantics, encoding considerations—can make the difference between correct and buggy code.
By the end of this page, you will understand: how substring extraction works conceptually and physically, the difference between copy and view semantics, inclusive vs exclusive range conventions, efficient substring patterns, and the time/space cost model.
A substring is a contiguous sequence of characters within a string. Substring extraction creates a new string containing only those characters.
Definition: Given a string S and indices i and j, the substring S[i:j] contains the characters from position i up to (but not including) position j.
Example:
String S = "Hello World"
S[0:5] = "Hello" (characters at indices 0, 1, 2, 3, 4)
S[6:11] = "World" (characters at indices 6, 7, 8, 9, 10)
S[3:8] = "lo Wo" (characters at indices 3, 4, 5, 6, 7)
| Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Char | H | e | l | l | o | W | o | r | l | d |
Key properties of substring extraction:
Special cases:
Different languages use different terminology: Python uses 'slice', Java uses 'substring', JavaScript has both 'substring' and 'slice' (with slightly different behaviors). The core concept is the same; pay attention to the specific API semantics in your language.
One of the most confusing aspects of substring extraction is understanding range conventions. Different languages and functions use different rules for whether the end index is included or excluded.
Half-open intervals [start, end):
Most modern languages use half-open intervals (also called "exclusive end"):
This is the convention in Python, Java, JavaScript, Go, Rust, and most modern languages.
1234567891011121314
s = "0123456789" # [start, end) - end is EXCLUDEDs[0:5] # "01234" - indices 0, 1, 2, 3, 4 (length = 5 - 0 = 5)s[3:7] # "3456" - indices 3, 4, 5, 6 (length = 7 - 3 = 4)s[5:10] # "56789" - indices 5, 6, 7, 8, 9 (length = 10 - 5 = 5) # Empty substring when start == ends[3:3] # "" - no characters # Convenient properties:# s[0:k] gives first k characters# s[k:] gives everything after first k characters# s[0:k] + s[k:] == s (split and rejoin at any point)Why half-open intervals?
The exclusive end convention offers several mathematical advantages:
Some APIs use inclusive end (particularly older APIs or domain-specific functions). Always check documentation! The difference between exclusive and inclusive end is the most common source of off-by-one errors in substring operations.
Certain substring extraction patterns appear repeatedly in programming. Mastering these patterns makes string manipulation fluent:
Pattern 1: Get first N characters (prefix)
12345678910111213
s = "Hello World" # First N charactersfirst_5 = s[:5] # "Hello" (0:5 is implied)first_1 = s[:1] # "H"first_0 = s[:0] # "" (empty) # Practical use: truncating stringsdef truncate(s: str, max_length: int) -> str: """Truncate string to max_length characters.""" return s[:max_length] # Example: truncate("Hello World", 5) → "Hello"Pattern 2: Get last N characters (suffix)
123456789101112131415
s = "Hello World" # Last N characters using negative indexlast_5 = s[-5:] # "World"last_1 = s[-1:] # "d"last_3 = s[-3:] # "rld" # Alternative: calculate start positionn = 5last_n = s[len(s) - n:] # "World" # Practical use: checking file extensionsdef get_extension(filename: str) -> str: """Get last 4 characters (e.g., '.txt').""" return filename[-4:] if len(filename) >= 4 else filenamePattern 3: Remove first/last N characters
1234567891011121314151617181920
s = "Hello World" # Remove first N characters (get suffix)without_first_6 = s[6:] # "World" # Remove last N characters (get prefix)without_last_6 = s[:-6] # "Hello" # Remove from both endstrimmed = s[1:-1] # "ello Worl" (remove first and last) # Practical use: removing quotesquoted = '"Hello"'unquoted = quoted[1:-1] # 'Hello' # Practical use: removing prefix if presentdef remove_prefix(s: str, prefix: str) -> str: if s.startswith(prefix): return s[len(prefix):] return sPattern 4: Split at position
123456789101112131415161718
s = "Hello World" # Split at position kk = 5left = s[:k] # "Hello"right = s[k:] # " World" # Verify: left + right == sassert left + right == s # Practical use: parsing fixed-width fieldsrecord = "John 30NYC"# ^^^^^^^^ ^^^^^^# name(10) age(2)rest name = record[:10].strip() # "John"age = record[10:12] # "30"city = record[12:] # "NYC"Python's negative indices (s[-1] is last character, s[-2] is second-to-last) work in slicing too. s[-3:] gets last 3 characters, s[:-3] gets everything except last 3. This makes many patterns more readable, but be aware: not all languages support this.
When you extract a substring, does the language create a copy of the characters, or does it create a view (reference) into the original string? This distinction has significant performance and memory implications.
Copy semantics (most common):
The substring operation allocates new memory and copies the characters. The result is completely independent of the original.
1234567891011121314151617
Original string S = "Hello World" (stored in memory block A) Substring operation: T = S[6:11] Result with COPY semantics:- New memory block B is allocated (5 characters)- Characters "World" are copied from A to B- T points to memory block B- S and T are completely independent Memory layout:Block A: [H][e][l][l][o][ ][W][o][r][l][d] ← S points hereBlock B: [W][o][r][l][d] ← T points here If S is garbage collected, T is unaffected.Time: O(k) where k = length of substringSpace: O(k) for the new stringView semantics (some languages/cases):
The substring operation returns a reference to a portion of the original string. No characters are copied.
12345678910111213141516171819
Original string S = "Hello World" (stored in memory block A) Substring operation: T = view of S[6:11] Result with VIEW semantics:- No new memory is allocated- T is a reference to block A, with offset=6 and length=5- S and T share the same underlying memory Memory layout:Block A: [H][e][l][l][o][ ][W][o][r][l][d] S points here ↑ ↑ T points here (with length 5) If S is still in use, the entire block A stays in memory.Time: O(1) - just create the referenceSpace: O(1) - no character copying DANGER: If S is referenced only by T, the ENTIRE original stringstays in memory, not just the substring. This can cause memory leaks!Which languages use which semantics?
| Language | Default Semantics | Notes |
|---|---|---|
| Python | Copy | Slicing always creates new string objects |
| JavaScript | Copy | All string operations return new strings |
| Java (pre-7u6) | View | Shared char[] could cause memory leaks |
| Java (7u6+) | Copy | Changed to copy to prevent memory issues |
| Go | Copy | All slice operations copy |
| Rust | View (&str) | String slices are borrowed references |
Early Java versions used view semantics for substrings. A famous problem: read a 1MB string, extract a 10-character substring, discard the original. The 1MB char array stayed in memory because the tiny substring referenced it! Java 7 update 6 changed to copy semantics to fix this. This is a lesson in the hidden costs of optimization.
Understanding the cost of substring extraction is essential for writing efficient code:
Time complexity:
Space complexity:
Important insight:
With copy semantics, extracting a substring from a very long string is still efficient if the substring is short:
1234567891011121314151617
# Copy semantics (Python): Cost depends on SUBSTRING length, not sourceimport time huge_string = "x" * 10_000_000 # 10 million character string # This is O(k) where k = 5, regardless of huge_string lengthstart = time.time()tiny = huge_string[:5] # "xxxxx"print(f"Extract 5 chars: {time.time() - start:.6f}s") # ~0.000001s # This is O(k) where k = 1_000_000start = time.time()million = huge_string[:1_000_000]print(f"Extract 1M chars: {time.time() - start:.6f}s") # ~0.01s # The source string length doesn't affect performance# Only the extracted length mattersWhen does substring extraction become expensive?
Optimization: Avoid extraction when possible
Sometimes you can work with indices instead of creating substrings:
123456789101112131415161718192021222324
def compare_substrings_naive(s1: str, i1: int, s2: str, i2: int, length: int) -> bool: """ Compare substrings by extracting them first. Time: O(length) for extraction + O(length) for comparison = O(length) Space: O(length) for two temporary substrings """ sub1 = s1[i1:i1+length] # Creates new string sub2 = s2[i2:i2+length] # Creates new string return sub1 == sub2 def compare_substrings_efficient(s1: str, i1: int, s2: str, i2: int, length: int) -> bool: """ Compare substrings by comparing characters directly. Time: O(length) for comparison Space: O(1) - no substring allocation """ for k in range(length): if s1[i1 + k] != s2[i2 + k]: return False return True # Both are O(length) time, but the second uses O(1) space# and avoids memory allocation overheadIn performance-critical code, consider whether you can avoid creating substrings entirely. If you just need to compare, search within, or iterate over a portion, working with index ranges is often faster and uses less memory.
Substring extraction has several edge cases that can cause bugs or errors. Understanding these cases is essential for robust code:
Edge case 1: Out-of-bounds indices
Different languages handle invalid indices differently:
12345678910111213
s = "hello" # length 5, valid indices 0-4 # Python SILENTLY CLAMPS out-of-bounds slice indicess[0:100] # "hello" (end clamped to 5)s[-100:3] # "hel" (start clamped to 0)s[10:20] # "" (both beyond end, empty result)s[3:1] # "" (start > end, empty result) # This is convenient but can hide bugs!# You might get "" when you expected data # Note: Direct indexing DOES raise error# s[10] # IndexError: string index out of rangeEdge case 2: Empty string source
12345678910
s = "" # empty string # Any extraction from empty string gives empty strings[0:0] # ""s[0:1] # "" (bounds clamped, nothing to extract)s[:5] # "" # This is usually fine, but watch for assumptions:# If you expect s[:1] to give you the first character,# you'll get "" instead of raising an error.Edge case 3: Single character extractions
1234567891011121314
s = "hello" # Character access vs single-char slicechar = s[0] # 'h' (a character/string of length 1)slice = s[0:1] # 'h' (explicitly a string of length 1) # In Python, these are equivalent (both return str)type(s[0]) # <class 'str'>type(s[0:1]) # <class 'str'>s[0] == s[0:1] # True # But semantically:# - s[0] means "the character at position 0"# - s[0:1] means "the substring from 0 to 1 (exclusive)"Python and JavaScript 'helpfully' return empty strings for invalid ranges. This can hide bugs where you expected data but got nothing. Java throws exceptions, forcing you to handle errors explicitly. Know your language's behavior and write defensive code accordingly.
Substring extraction is a core operation in many algorithms. Understanding when to use it (and when not to) is key to writing efficient code.
Example 1: Finding all substrings of length k
12345678910111213141516171819202122232425262728293031
def get_all_substrings_of_length_k(s: str, k: int) -> list[str]: """ Extract all contiguous substrings of length k. For s = "abcde" and k = 3: Returns: ["abc", "bcd", "cde"] Time: O((n-k+1) * k) = O(n*k) - each extraction is O(k) Space: O((n-k+1) * k) for storing all substrings """ result = [] for i in range(len(s) - k + 1): result.append(s[i:i+k]) return result def count_unique_substrings_of_length_k(s: str, k: int) -> int: """ Count unique substrings of length k using a set. Time: O(n*k) - extracting substrings Space: O(n*k) worst case for the set """ seen = set() for i in range(len(s) - k + 1): seen.add(s[i:i+k]) return len(seen) # Example: count_unique_substrings_of_length_k("abcabc", 3)# Substrings: "abc", "bca", "cab", "abc"# Unique: {"abc", "bca", "cab"} → 3Example 2: Checking if string contains a substring (naive approach)
123456789101112131415161718192021
def contains_substring_naive(text: str, pattern: str) -> bool: """ Check if text contains pattern using substring extraction. Time: O((n-m+1) * m) where n = len(text), m = len(pattern) This is O(n*m) in the worst case. Note: Python's 'in' operator is much faster (uses optimized algorithms) """ n, m = len(text), len(pattern) for i in range(n - m + 1): if text[i:i+m] == pattern: # O(m) extraction + O(m) comparison return True return False # Better: use built-in 'in' operatordef contains_substring_better(text: str, pattern: str) -> bool: return pattern in text # Uses Boyer-Moore or similar internally # Understanding the naive approach helps appreciate # why advanced algorithms (KMP, Rabin-Karp) existExample 3: Longest common prefix of two strings
1234567891011121314151617181920212223
def longest_common_prefix(s1: str, s2: str) -> str: """ Find the longest common prefix of two strings. Approach 1: Find common length, then extract once Time: O(min(n, m)) for comparison + O(k) for extraction Space: O(k) where k = length of common prefix """ min_len = min(len(s1), len(s2)) common_length = 0 for i in range(min_len): if s1[i] == s2[i]: common_length += 1 else: break return s1[:common_length] # Extract once at the end # Example:# longest_common_prefix("flower", "flow") → "flow"# longest_common_prefix("abc", "xyz") → ""In algorithms that search through strings, prefer to find the boundaries first (using index comparisons), then extract the substring once at the end. This avoids creating many temporary substring objects during the search phase.
Checking whether a string starts with or ends with a specific sequence is so common that most languages provide dedicated methods. These are optimized alternatives to manual substring extraction and comparison.
Prefix checking (starts with):
1234567891011121314151617
s = "Hello World" # Built-in method (preferred)s.startswith("Hello") # Trues.startswith("World") # Falses.startswith("") # True (empty is a prefix of everything) # Manual approach (slower, less readable)prefix = "Hello"s[:len(prefix)] == prefix # True # Multiple prefixess.startswith(("Hello", "Goodbye")) # True (tuple of options) # Practical use: URL scheme detectionurl = "https://example.com"is_secure = url.startswith("https://")Suffix checking (ends with):
12345678910111213
s = "document.pdf" # Built-in method (preferred)s.endswith(".pdf") # Trues.endswith(".txt") # Falses.endswith("") # True # Multiple suffixess.endswith((".pdf", ".doc", ".txt")) # True # Practical use: file type detectiondef is_image(filename: str) -> bool: return filename.lower().endswith((".jpg", ".jpeg", ".png", ".gif"))Always use startswith()/endswith() when available. They're more readable than manual slicing, less error-prone (handle edge cases correctly), and often optimized. The manual approaches exist if you need custom behavior, but built-ins should be your default.
Let's consolidate the cost model for substring extraction:
| Operation | Time Complexity | Space Complexity | Notes |
|---|---|---|---|
| Extract substring (copy) | O(k) | O(k) | k = length of extracted substring |
| Extract substring (view) | O(1) | O(1)* | *Original string memory is retained |
| startswith/endswith | O(m) | O(1) | m = length of prefix/suffix checked |
| Multiple extractions (n of length k) | O(n × k) | O(n × k) | Builds n new strings |
| Compare substrings via indices | O(k) | O(1) | No substring allocation |
What's next:
With substring extraction mastered, we're ready for the final operation in this module: string comparison. Comparing strings involves not just equality checks but also ordering, case sensitivity, and locale considerations—essential knowledge for searching, sorting, and validating text.
You now understand substring extraction comprehensively—from range conventions to copy/view semantics, from common patterns to performance optimization. This operation is central to text processing, and you're now equipped to use it correctly and efficiently.