Loading content...
Every time you send a photo, stream a video, or download a file, compression algorithms are silently at work—reducing data to its essential form, squeezing out redundancy, and making the impossible possible. A 4K video that would take hours to download uncompressed arrives in minutes. A database backup that would fill your disk fits comfortably.
String compression is the entry point to this world. At its simplest, it asks: can we represent this sequence of characters using fewer characters? The answer reveals deep insights about redundancy, patterns, and the fundamental limits of information representation.
In this page, we'll master Run-Length Encoding (RLE), the simplest and most intuitive compression scheme, then explore its generalizations and the broader compression landscape. You'll understand not just how to implement compression, but when it helps and why it sometimes backfires.
By the end of this page, you will implement multiple string compression algorithms (including the classic LeetCode 443 in-place variant), understand the mathematical conditions for compression gain, and recognize the connection between RLE and more sophisticated compression schemes like LZ77 and Huffman coding.
Definition (String Compression): Given a string s, produce a shorter representation c from which s can be perfectly reconstructed. The compression ratio is |c| / |s|, with values less than 1 indicating successful compression.
The Redundancy Principle:
Compression works by exploiting redundancy—repeated patterns, predictable sequences, or statistical regularities. A string like "AAAAAABBBBCC" is highly redundant; each 'A' after the first adds no new information. The string "a7#Qx!mZ" contains little redundancy and resists compression.
Lossless vs. Lossy:
Lossless compression preserves all information. Decompression yields the exact original string. This is essential for text, code, and data where every bit matters.
Lossy compression sacrifices some information for better compression ratios. This is acceptable for images, audio, and video where human perception can't detect minor losses.
In this page, we focus exclusively on lossless string compression.
Why Start with RLE?
Run-Length Encoding is the "Hello World" of compression algorithms. It's simple enough to implement correctly in an interview setting, yet profound enough to illustrate core compression concepts. Many interview problems (including LeetCode 443 "String Compression") are direct applications of RLE.
Moreover, RLE appears as a building block in more sophisticated schemes. The Burrows-Wheeler Transform, for example, rearranges strings specifically to create long runs that RLE can exploit.
Definition (Run-Length Encoding): RLE replaces each maximal run of consecutive identical characters with a pair: the character and the run length.
Example:
Original: AAABBBCCCCCD
Encoded: A3B3C5D1 (or A3B3C5D if single chars omit count)
Formal Specification:
Given string s = s[0]s[1]...s[n-1], partition s into maximal runs r_1, r_2, ..., r_k where each run r_i consists of one or more consecutive occurrences of the same character. The RLE encoding is:
RLE(s) = c_1 len(r_1) c_2 len(r_2) ... c_k len(r_k)
where c_i is the character in run r_i.
Variant: Omit Count of 1:
Some implementations omit the count when it equals 1:
The optimized variant prevents compression from making strings longer when runs are short.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117
def run_length_encode(s: str) -> str: """ Encode a string using Run-Length Encoding. Format: Each run becomes "character + count" Example: "AAABBC" -> "A3B2C1" Time Complexity: O(n) Space Complexity: O(n) for the result string Args: s: The string to encode Returns: RLE-encoded string """ if not s: return "" result = [] count = 1 for i in range(1, len(s)): if s[i] == s[i - 1]: count += 1 else: result.append(s[i - 1]) result.append(str(count)) count = 1 # Don't forget the last run result.append(s[-1]) result.append(str(count)) return ''.join(result) def run_length_encode_optimized(s: str) -> str: """ RLE with optimization: omit count of 1 to prevent expansion. Example: "ABC" -> "ABC" (not "A1B1C1") "AABC" -> "A2BC" (not "A2B1C1") This ensures the encoded string is never longer than the original. """ if not s: return "" result = [] count = 1 for i in range(1, len(s)): if s[i] == s[i - 1]: count += 1 else: result.append(s[i - 1]) if count > 1: result.append(str(count)) count = 1 # Last run result.append(s[-1]) if count > 1: result.append(str(count)) return ''.join(result) def run_length_decode(encoded: str) -> str: """ Decode an RLE-encoded string back to original. Assumes format: single character followed by one or more digits. Example: "A10B2C1" -> "AAAAAAAAAABBC" Time Complexity: O(output_length) Space Complexity: O(output_length) """ if not encoded: return "" result = [] i = 0 while i < len(encoded): char = encoded[i] i += 1 # Extract the count (may be multiple digits) count_str = [] while i < len(encoded) and encoded[i].isdigit(): count_str.append(encoded[i]) i += 1 count = int(''.join(count_str)) if count_str else 1 result.append(char * count) return ''.join(result) # Demonstrationif __name__ == "__main__": test_cases = [ "AAABBBCCCC", "ABCDEF", "AAAAAAAAAAABBBBBBB", # Good compression "", "A", "AA", ] for s in test_cases: encoded = run_length_encode(s) decoded = run_length_decode(encoded) ratio = len(encoded) / len(s) if s else 0 print(f"'{s}' -> '{encoded}' -> '{decoded}' (ratio: {ratio:.2f})")Note that RLE can make strings longer. The string "ABCDEF" (length 6) becomes "A1B1C1D1E1F1" (length 12) under naive RLE—a 2x expansion! This is why the optimized variant (omitting counts of 1) is important for practical use, and why understanding when compression helps is crucial.
The classic interview problem "String Compression" (LeetCode 443) adds a constraint: compress the string in-place with O(1) extra space. This variant tests your ability to carefully manage read and write pointers without overwriting unprocessed data.
Problem Statement:
Given an array chars, compress it in-place. The length after compression should be stored in the first k characters, and you should return k. Each group of consecutive repeating characters is replaced by the character followed by the group length (if length > 1).
Key Insight: Why In-Place Works
The compressed representation is never longer than a single character per run, and often shorter. Since we process runs from left to right, the write pointer is always at or behind the read pointer. This means we never overwrite unprocessed data.
Counting Multi-Digit Numbers:
A run of 100 'A's compresses to "A100"—four characters. We must handle counts that span multiple digits. The trick is to convert the count to characters by writing digits in reverse order, then reversing the digit sequence.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
def compress(chars: list[str]) -> int: """ Compress the character array in-place (LeetCode 443). Rules: - Replace consecutive repeating characters with character + count - Count is omitted if it equals 1 - Multi-digit counts are written digit by digit ("100" -> '1', '0', '0') - Modify chars in-place and return the new length Time Complexity: O(n) Space Complexity: O(1) - truly in-place! Args: chars: The character array to compress Returns: The length of the compressed array """ if not chars: return 0 n = len(chars) write = 0 # Position to write next compressed character read = 0 # Position to read/process while read < n: char = chars[read] count = 0 # Count consecutive occurrences of current character while read < n and chars[read] == char: read += 1 count += 1 # Write the character chars[write] = char write += 1 # Write the count if > 1 if count > 1: # Convert count to string and write each digit count_str = str(count) for digit in count_str: chars[write] = digit write += 1 return write def compress_manual_digits(chars: list[str]) -> int: """ Alternative implementation that computes digits without str() conversion. This approach writes digits in reverse order, then reverses them. Demonstrates a more "algorithmic" solution without string conversion. """ if not chars: return 0 n = len(chars) write = 0 read = 0 while read < n: char = chars[read] count = 0 while read < n and chars[read] == char: read += 1 count += 1 # Write character chars[write] = char write += 1 # Write count digits if > 1 if count > 1: # Write digits in reverse order digit_start = write while count > 0: chars[write] = chr(ord('0') + count % 10) write += 1 count //= 10 # Reverse the digits to get correct order left, right = digit_start, write - 1 while left < right: chars[left], chars[right] = chars[right], chars[left] left += 1 right -= 1 return write # Test the compressionif __name__ == "__main__": test_cases = [ ["a", "a", "b", "b", "c", "c", "c"], # Expected: ["a","2","b","2","c","3"] ["a"], # Expected: ["a"] ["a", "b", "b", "b", "b", "b", "b", "b", # Expected: ["a","b","1","2"] (count 12) "b", "b", "b", "b", "b"], list("AAAAAAAAAAAABBBBBBBBBBBBCCCCCCCC"), # Mixed ] for chars in test_cases: original = ''.join(chars) length = compress(chars) compressed = ''.join(chars[:length]) print(f"'{original}' -> '{compressed}' (length {length})")read scans ahead to count runs, write trails behind to write compressed output.Understanding when RLE achieves compression versus expansion is critical for practical applications. The math is straightforward once you see it.
Break-Even Analysis:
For a run of length k, the encoded representation has length:
Compression occurs when this is less than k:
The Rule of 3:
With the optimized variant (omit count of 1), RLE never expands. But it only compresses runs of length ≥ 3. For random data with short runs, RLE provides almost no compression.
| Run Length | Original Size | RLE Size | Ratio | Verdict |
|---|---|---|---|---|
| 1 | 1 | 1* | 1.00 | No change |
| 2 | 2 | 2 | 1.00 | No change |
| 3 | 3 | 2 | 0.67 | Compression |
| 9 | 9 | 2 | 0.22 | Good compression |
| 10 | 10 | 3 | 0.30 | Good compression |
| 99 | 99 | 3 | 0.03 | Excellent |
| 100 | 100 | 4 | 0.04 | Excellent |
| 1000 | 1000 | 5 | 0.005 | Extreme |
Ideal Data for RLE:
RLE shines when data has long homogeneous runs:
Poor Data for RLE:
RLE fails (provides no compression) when:
Key Insight:
The average run length of a string determines RLE effectiveness. If average run length > 3, RLE helps. If < 2, RLE may expand (without the omit-1 optimization) or provide no benefit.
Before applying RLE, compute the average run length. For a string of length n with r runs, average run length = n/r. If this is significantly greater than 3, RLE will provide good compression. This quick estimate helps you decide whether RLE is worth applying.
RLE has numerous variants tailored to specific use cases. Understanding these variations deepens your compression intuition.
1. PackBits (Run-Length + Literal Encoding)
Used in TIFF images and early Macintosh graphics. PackBits distinguishes between "runs" (repeated bytes) and "literal" sequences (non-repeated bytes). This prevents expansion on non-repetitive data.
Run: Length (negative), Byte to repeat
Literal: Length (positive), Actual bytes
2. Byte-Pair Encoding (BPE)
A generalization where we don't just compress runs, but any frequently occurring pair of bytes. Repeatedly replace the most common byte pair with a new symbol until no compression is possible. Used in modern NLP tokenizers (GPT, BERT).
3. Burrows-Wheeler + RLE
The Burrows-Wheeler Transform (BWT) rearranges a string to cluster repeated characters together. Applying RLE to the transformed string often achieves much better compression than RLE alone. This combination powers bzip2.
4. Zero-Run Length Encoding
A special case for sparse data where we only encode runs of zeros. Common in image compression (JPEG) and sparse matrix storage.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128
def count_and_say(n: int) -> str: """ The "Count and Say" sequence (LeetCode 38). This is a classic RLE-adjacent problem: 1 → "1" 2 → "11" (one 1) 3 → "21" (two 1s) 4 → "1211" (one 2, one 1) 5 → "111221" (one 1, one 2, two 1s) Each term is the RLE description of the previous term. Time Complexity: O(n * L) where L is the length of the nth term Space Complexity: O(L) """ if n == 1: return "1" prev = "1" for _ in range(2, n + 1): result = [] i = 0 while i < len(prev): char = prev[i] count = 0 while i < len(prev) and prev[i] == char: count += 1 i += 1 result.append(str(count)) result.append(char) prev = ''.join(result) return prev def decode_string(s: str) -> str: """ Decode nested RLE-like encoding (LeetCode 394). Input: "3[a2[c]]" → "accaccacc" This is RLE with nesting and explicit brackets. Uses a stack to handle nested encodings. Time Complexity: O(output_length) Space Complexity: O(nesting_depth) """ stack = [] current_num = 0 current_str = "" for char in s: if char.isdigit(): current_num = current_num * 10 + int(char) elif char == '[': # Push current state onto stack and reset stack.append((current_str, current_num)) current_str = "" current_num = 0 elif char == ']': # Pop from stack and build the repeated string prev_str, num = stack.pop() current_str = prev_str + current_str * num else: current_str += char return current_str def encode_rle_with_escape(s: str) -> str: """ RLE that can handle strings containing digits. Uses an escape character (backslash) to distinguish literal digits from count digits. Example: "AAA123" → "A3\1\2\3" or similar scheme This makes RLE applicable to arbitrary binary/text data. """ if not s: return "" result = [] count = 1 for i in range(1, len(s)): if s[i] == s[i - 1] and count < 99: # Cap count for simplicity count += 1 else: # Encode character (escape if digit or backslash) char = s[i - 1] if char.isdigit() or char == '\\': result.append('\\') result.append(char) if count > 1: result.append(str(count)) count = 1 # Last character char = s[-1] if char.isdigit() or char == '\\': result.append('\\') result.append(char) if count > 1: result.append(str(count)) return ''.join(result) # Demonstrationsif __name__ == "__main__": # Count and Say print("Count and Say sequence:") for i in range(1, 8): print(f" {i}: {count_and_say(i)}") # Decode String print("\nDecode String:") test_cases = ["3[a]2[bc]", "3[a2[c]]", "2[abc]3[cd]ef"] for s in test_cases: print(f" '{s}' -> '{decode_string(s)}'")Connection to LeetCode Problems:
Several popular interview problems are essentially RLE variations:
| Problem | Core Idea |
|---|---|
| 38. Count and Say | Generate RLE of previous term |
| 394. Decode String | Stack-based nested RLE decoding |
| 443. String Compression | In-place RLE encoding |
| 271. Encode and Decode Strings | Length-prefix encoding (RLE-adjacent) |
| 900. RLE Iterator | Iterator over RLE-encoded sequence |
RLE is just the beginning. Understanding where it fits in the compression landscape helps you appreciate its role and limitations.
The Lempel-Ziv Family (LZ77, LZ78, LZW):
Instead of encoding runs of a single character, LZ algorithms find repeated substrings and replace later occurrences with references to earlier ones.
Original: ABABABAB
LZ77: AB (len=2, dist=0) (len=6, back=2)
"AB" literally, then repeat 6 chars from 2 positions back
This is far more powerful than RLE for text, which has many repeated words but few repeated characters. LZ77 underlies gzip, zlib, PNG, and ZIP.
Huffman Coding:
Assign variable-length codes to symbols based on frequency. Common symbols get short codes; rare symbols get long codes. This is optimal for known frequency distributions. Many compression schemes use LZ for pattern matching, then Huffman for encoding the output.
Arithmetic Coding:
Represent the entire message as a single number in [0, 1). More efficient than Huffman for very skewed distributions. Used in modern codecs (H.264, JPEG 2000).
| Technique | Best For | Compression Ratio | Speed | Complexity |
|---|---|---|---|---|
| RLE | Data with long runs | High if runs exist | Very fast | O(n) |
| LZ77/DEFLATE | General text/data | Good | Medium | O(n log n) |
| Huffman | Known frequencies | Optimal for model | Fast | O(n log σ) |
| BWT + MTF + RLE | Text with structure | Very good | Slow | O(n log n) |
| Arithmetic | Highly skewed data | Near-optimal | Slow | Complex |
Entropy: The Theoretical Limit
Shannon's entropy gives the theoretical minimum bits per symbol for lossless compression:
H(X) = -Σ p(x) log₂ p(x)
For example, if a string has 50% 'A' and 50% 'B', entropy is 1 bit per character. No lossless compression can do better than 1 bit per character on average.
RLE can beat entropy only when we consider the run structure, not just symbol frequencies. A string of 1000 'A's has zero entropy in the symbol sense, but RLE exploits the run structure for excellent compression.
Key Takeaway:
RLE is a preprocessing step in many compression pipelines. By itself, it's limited to run-heavy data. Combined with transforms (BWT) or dictionary methods (LZ), it becomes part of a powerful compression system.
String compression is a vast field, but RLE provides an accessible entry point that teaches fundamental concepts applicable to all compression techniques.
You now understand string compression fundamentals through the lens of Run-Length Encoding. You can implement RLE, analyze when it helps, and recognize its place in the broader compression landscape. Next, we'll tackle the Minimum Window Substring problem—a different but equally important string pattern.