Common Patterns - Learning Module

Loading content...

0/276

String Compression and Run-Length Encoding

The Art of Saying Less

Every time you send a photo, stream a video, or download a file, compression algorithms are silently at work—reducing data to its essential form, squeezing out redundancy, and making the impossible possible. A 4K video that would take hours to download uncompressed arrives in minutes. A database backup that would fill your disk fits comfortably.

String compression is the entry point to this world. At its simplest, it asks: can we represent this sequence of characters using fewer characters? The answer reveals deep insights about redundancy, patterns, and the fundamental limits of information representation.

In this page, we'll master Run-Length Encoding (RLE), the simplest and most intuitive compression scheme, then explore its generalizations and the broader compression landscape. You'll understand not just how to implement compression, but when it helps and why it sometimes backfires.

What You Will Learn

By the end of this page, you will implement multiple string compression algorithms (including the classic LeetCode 443 in-place variant), understand the mathematical conditions for compression gain, and recognize the connection between RLE and more sophisticated compression schemes like LZ77 and Huffman coding.

The Compression Problem

Definition (String Compression): Given a string s, produce a shorter representation c from which s can be perfectly reconstructed. The compression ratio is |c| / |s|, with values less than 1 indicating successful compression.

The Redundancy Principle:

Compression works by exploiting redundancy—repeated patterns, predictable sequences, or statistical regularities. A string like "AAAAAABBBBCC" is highly redundant; each 'A' after the first adds no new information. The string "a7#Qx!mZ" contains little redundancy and resists compression.

Lossless vs. Lossy:

Lossless compression preserves all information. Decompression yields the exact original string. This is essential for text, code, and data where every bit matters.
Lossy compression sacrifices some information for better compression ratios. This is acceptable for images, audio, and video where human perception can't detect minor losses.

In this page, we focus exclusively on lossless string compression.

Common String Compression Techniques

•Run-Length Encoding (RLE): Replace consecutive runs of the same character with the character and count. Simple but powerful for data with long runs.
•Dictionary Coding (LZ family): Build a dictionary of seen patterns and replace occurrences with dictionary references. Powers gzip, PNG, and ZIP.
•Statistical Coding (Huffman, Arithmetic): Assign shorter codes to more frequent symbols. Optimal for known probability distributions.
•Burrows-Wheeler Transform: Permute the string to cluster repeated characters, then apply RLE or move-to-front encoding. Powers bzip2.
•Grammar-Based Compression: Represent the string as a context-free grammar with production rules. Efficient for highly structured data.

Why Start with RLE?

Run-Length Encoding is the "Hello World" of compression algorithms. It's simple enough to implement correctly in an interview setting, yet profound enough to illustrate core compression concepts. Many interview problems (including LeetCode 443 "String Compression") are direct applications of RLE.

Moreover, RLE appears as a building block in more sophisticated schemes. The Burrows-Wheeler Transform, for example, rearranges strings specifically to create long runs that RLE can exploit.

Run-Length Encoding Fundamentals

Definition (Run-Length Encoding): RLE replaces each maximal run of consecutive identical characters with a pair: the character and the run length.

Example:

Original:   AAABBBCCCCCD
Encoded:    A3B3C5D1  (or A3B3C5D if single chars omit count)

Formal Specification:

Given string s = s[0]s[1]...s[n-1], partition s into maximal runs r_1, r_2, ..., r_k where each run r_i consists of one or more consecutive occurrences of the same character. The RLE encoding is:

RLE(s) = c_1 len(r_1) c_2 len(r_2) ... c_k len(r_k)

where c_i is the character in run r_i.

Variant: Omit Count of 1:

Some implementations omit the count when it equals 1:

Standard: "AB" → "A1B1" (length 4)
Optimized: "AB" → "AB" (length 2, no expansion)

The optimized variant prevents compression from making strings longer when runs are short.

run_length_encoding.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
def run_length_encode(s: str) -> str:
    """
    Encode a string using Run-Length Encoding.
    
    Format: Each run becomes "character + count"
    Example: "AAABBC" -> "A3B2C1"
    
    Time Complexity: O(n)
    Space Complexity: O(n) for the result string
    
    Args:
        s: The string to encode
    
    Returns:
        RLE-encoded string
    """
    if not s:
        return ""
    
    result = []
    count = 1
    
    for i in range(1, len(s)):
        if s[i] == s[i - 1]:
            count += 1
        else:
            result.append(s[i - 1])
            result.append(str(count))
            count = 1
    
    # Don't forget the last run
    result.append(s[-1])
    result.append(str(count))
    
    return ''.join(result)
 
 
def run_length_encode_optimized(s: str) -> str:
    """
    RLE with optimization: omit count of 1 to prevent expansion.
    
    Example: "ABC" -> "ABC" (not "A1B1C1")
             "AABC" -> "A2BC" (not "A2B1C1")
    
    This ensures the encoded string is never longer than the original.
    """
    if not s:
        return ""
    
    result = []
    count = 1
    
    for i in range(1, len(s)):
        if s[i] == s[i - 1]:
            count += 1
        else:
            result.append(s[i - 1])
            if count > 1:
                result.append(str(count))
            count = 1
    
    # Last run
    result.append(s[-1])
    if count > 1:
        result.append(str(count))
    
    return ''.join(result)
 
 
def run_length_decode(encoded: str) -> str:
    """
    Decode an RLE-encoded string back to original.
    
    Assumes format: single character followed by one or more digits.
    Example: "A10B2C1" -> "AAAAAAAAAABBC"
    
    Time Complexity: O(output_length)
    Space Complexity: O(output_length)
    """
    if not encoded:
        return ""
    
    result = []
    i = 0
    
    while i < len(encoded):
        char = encoded[i]
        i += 1
        
        # Extract the count (may be multiple digits)
        count_str = []
        while i < len(encoded) and encoded[i].isdigit():
            count_str.append(encoded[i])
            i += 1
        
        count = int(''.join(count_str)) if count_str else 1
        result.append(char * count)
    
    return ''.join(result)
 
 
# Demonstration
if __name__ == "__main__":
    test_cases = [
        "AAABBBCCCC",
        "ABCDEF",
        "AAAAAAAAAAABBBBBBB",  # Good compression
        "",
        "A",
        "AA",
    ]
    
    for s in test_cases:
        encoded = run_length_encode(s)
        decoded = run_length_decode(encoded)
        ratio = len(encoded) / len(s) if s else 0
        print(f"'{s}' -> '{encoded}' -> '{decoded}' (ratio: {ratio:.2f})")

Compression Can Expand

Note that RLE can make strings longer. The string "ABCDEF" (length 6) becomes "A1B1C1D1E1F1" (length 12) under naive RLE—a 2x expansion! This is why the optimized variant (omitting counts of 1) is important for practical use, and why understanding when compression helps is crucial.

In-Place String Compression (LeetCode 443)

The classic interview problem "String Compression" (LeetCode 443) adds a constraint: compress the string in-place with O(1) extra space. This variant tests your ability to carefully manage read and write pointers without overwriting unprocessed data.

Problem Statement:

Given an array chars, compress it in-place. The length after compression should be stored in the first k characters, and you should return k. Each group of consecutive repeating characters is replaced by the character followed by the group length (if length > 1).

Key Insight: Why In-Place Works

The compressed representation is never longer than a single character per run, and often shorter. Since we process runs from left to right, the write pointer is always at or behind the read pointer. This means we never overwrite unprocessed data.

Counting Multi-Digit Numbers:

A run of 100 'A's compresses to "A100"—four characters. We must handle counts that span multiple digits. The trick is to convert the count to characters by writing digits in reverse order, then reversing the digit sequence.

in_place_compression.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
def compress(chars: list[str]) -> int:
    """
    Compress the character array in-place (LeetCode 443).
    
    Rules:
    - Replace consecutive repeating characters with character + count
    - Count is omitted if it equals 1
    - Multi-digit counts are written digit by digit ("100" -> '1', '0', '0')
    - Modify chars in-place and return the new length
    
    Time Complexity: O(n)
    Space Complexity: O(1) - truly in-place!
    
    Args:
        chars: The character array to compress
    
    Returns:
        The length of the compressed array
    """
    if not chars:
        return 0
    
    n = len(chars)
    write = 0  # Position to write next compressed character
    read = 0   # Position to read/process
    
    while read < n:
        char = chars[read]
        count = 0
        
        # Count consecutive occurrences of current character
        while read < n and chars[read] == char:
            read += 1
            count += 1
        
        # Write the character
        chars[write] = char
        write += 1
        
        # Write the count if > 1
        if count > 1:
            # Convert count to string and write each digit
            count_str = str(count)
            for digit in count_str:
                chars[write] = digit
                write += 1
    
    return write
 
 
def compress_manual_digits(chars: list[str]) -> int:
    """
    Alternative implementation that computes digits without str() conversion.
    
    This approach writes digits in reverse order, then reverses them.
    Demonstrates a more "algorithmic" solution without string conversion.
    """
    if not chars:
        return 0
    
    n = len(chars)
    write = 0
    read = 0
    
    while read < n:
        char = chars[read]
        count = 0
        
        while read < n and chars[read] == char:
            read += 1
            count += 1
        
        # Write character
        chars[write] = char
        write += 1
        
        # Write count digits if > 1
        if count > 1:
            # Write digits in reverse order
            digit_start = write
            while count > 0:
                chars[write] = chr(ord('0') + count % 10)
                write += 1
                count //= 10
            
            # Reverse the digits to get correct order
            left, right = digit_start, write - 1
            while left < right:
                chars[left], chars[right] = chars[right], chars[left]
                left += 1
                right -= 1
    
    return write
 
 
# Test the compression
if __name__ == "__main__":
    test_cases = [
        ["a", "a", "b", "b", "c", "c", "c"],      # Expected: ["a","2","b","2","c","3"]
        ["a"],                                     # Expected: ["a"]
        ["a", "b", "b", "b", "b", "b", "b", "b",  # Expected: ["a","b","1","2"] (count 12)
         "b", "b", "b", "b", "b"],
        list("AAAAAAAAAAAABBBBBBBBBBBBCCCCCCCC"),   # Mixed
    ]
    
    for chars in test_cases:
        original = ''.join(chars)
        length = compress(chars)
        compressed = ''.join(chars[:length])
        print(f"'{original}' -> '{compressed}' (length {length})")

Why Write Never Overtakes Read

•For a run of length 1, we write 1 character (the char itself). Read advances 1. Write advances 1. Equal.
•For a run of length 2-9, we write 2 characters (char + digit). Read advances 2-9. Write advances 2. Write ≤ Read.
•For a run of length 10-99, we write 3 characters. Read advances 10-99. Write ≤ Read.
•In general, for run length k, we write 1 + ⌈log₁₀(k)⌉ ≤ k characters.

Key Implementation Details

•Two pointers: read scans ahead to count runs, write trails behind to write compressed output.
•We must handle counts > 9 (multi-digit numbers).
•The character array is modified in-place; no new array is created.
•Return value is the new logical length, not the array itself.

When Does RLE Actually Compress?

Understanding when RLE achieves compression versus expansion is critical for practical applications. The math is straightforward once you see it.

Break-Even Analysis:

For a run of length k, the encoded representation has length:

1 (the character) + ⌈log₁₀(k)⌉ (digits of k)

Compression occurs when this is less than k:

Run length 1: "A" → "A1" or "A" (no improvement, possible expansion)
Run length 2: "AA" → "A2" (2 → 2, no improvement)
Run length 3: "AAA" → "A3" (3 → 2, compression!)
Run length 10: "A×10" → "A10" (10 → 3, great compression)
Run length 100: "A×100" → "A100" (100 → 4, excellent compression)

The Rule of 3:

With the optimized variant (omit count of 1), RLE never expands. But it only compresses runs of length ≥ 3. For random data with short runs, RLE provides almost no compression.

RLE Compression Ratio by Run Length
Run Length	Original Size	RLE Size	Ratio	Verdict
1	1	1*	1.00	No change
2	2	2	1.00	No change
3	3	2	0.67	Compression
9	9	2	0.22	Good compression
10	10	3	0.30	Good compression
99	99	3	0.03	Excellent
100	100	4	0.04	Excellent
1000	1000	5	0.005	Extreme

Ideal Data for RLE:

RLE shines when data has long homogeneous runs:

Bitmap images: Large areas of solid color (icons, simple graphics)
Sparse data: Arrays with many zeros or repeated values
Fax transmissions: Black-and-white documents with long runs of white space
Run-length encoded pixels: Early video game graphics

Poor Data for RLE:

RLE fails (provides no compression) when:

Characters alternate frequently: "ABABAB" → "A1B1A1B1A1B1" (expansion with naive RLE)
Random data with uniform distribution
Text in natural language (few character repetitions)
Already-compressed data

Key Insight:

The average run length of a string determines RLE effectiveness. If average run length > 3, RLE helps. If < 2, RLE may expand (without the omit-1 optimization) or provide no benefit.

Estimating Compression Potential

Before applying RLE, compute the average run length. For a string of length n with r runs, average run length = n/r. If this is significantly greater than 3, RLE will provide good compression. This quick estimate helps you decide whether RLE is worth applying.

Variations and Extensions

RLE has numerous variants tailored to specific use cases. Understanding these variations deepens your compression intuition.

1. PackBits (Run-Length + Literal Encoding)

Used in TIFF images and early Macintosh graphics. PackBits distinguishes between "runs" (repeated bytes) and "literal" sequences (non-repeated bytes). This prevents expansion on non-repetitive data.

Run:     Length (negative), Byte to repeat
Literal: Length (positive), Actual bytes

2. Byte-Pair Encoding (BPE)

A generalization where we don't just compress runs, but any frequently occurring pair of bytes. Repeatedly replace the most common byte pair with a new symbol until no compression is possible. Used in modern NLP tokenizers (GPT, BERT).

3. Burrows-Wheeler + RLE

The Burrows-Wheeler Transform (BWT) rearranges a string to cluster repeated characters together. Applying RLE to the transformed string often achieves much better compression than RLE alone. This combination powers bzip2.

4. Zero-Run Length Encoding

A special case for sparse data where we only encode runs of zeros. Common in image compression (JPEG) and sparse matrix storage.

rle_variations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
def count_and_say(n: int) -> str:
    """
    The "Count and Say" sequence (LeetCode 38).
    
    This is a classic RLE-adjacent problem:
    1 → "1"
    2 → "11" (one 1) 
    3 → "21" (two 1s)
    4 → "1211" (one 2, one 1)
    5 → "111221" (one 1, one 2, two 1s)
    
    Each term is the RLE description of the previous term.
    
    Time Complexity: O(n * L) where L is the length of the nth term
    Space Complexity: O(L)
    """
    if n == 1:
        return "1"
    
    prev = "1"
    
    for _ in range(2, n + 1):
        result = []
        i = 0
        
        while i < len(prev):
            char = prev[i]
            count = 0
            
            while i < len(prev) and prev[i] == char:
                count += 1
                i += 1
            
            result.append(str(count))
            result.append(char)
        
        prev = ''.join(result)
    
    return prev
 
 
def decode_string(s: str) -> str:
    """
    Decode nested RLE-like encoding (LeetCode 394).
    
    Input: "3[a2[c]]" → "accaccacc"
    
    This is RLE with nesting and explicit brackets.
    Uses a stack to handle nested encodings.
    
    Time Complexity: O(output_length)
    Space Complexity: O(nesting_depth)
    """
    stack = []
    current_num = 0
    current_str = ""
    
    for char in s:
        if char.isdigit():
            current_num = current_num * 10 + int(char)
        elif char == '[':
            # Push current state onto stack and reset
            stack.append((current_str, current_num))
            current_str = ""
            current_num = 0
        elif char == ']':
            # Pop from stack and build the repeated string
            prev_str, num = stack.pop()
            current_str = prev_str + current_str * num
        else:
            current_str += char
    
    return current_str
 
 
def encode_rle_with_escape(s: str) -> str:
    """
    RLE that can handle strings containing digits.
    
    Uses an escape character (backslash) to distinguish
    literal digits from count digits.
    
    Example: "AAA123" → "A3\1\2\3" or similar scheme
    
    This makes RLE applicable to arbitrary binary/text data.
    """
    if not s:
        return ""
    
    result = []
    count = 1
    
    for i in range(1, len(s)):
        if s[i] == s[i - 1] and count < 99:  # Cap count for simplicity
            count += 1
        else:
            # Encode character (escape if digit or backslash)
            char = s[i - 1]
            if char.isdigit() or char == '\\':
                result.append('\\')
            result.append(char)
            if count > 1:
                result.append(str(count))
            count = 1
    
    # Last character
    char = s[-1]
    if char.isdigit() or char == '\\':
        result.append('\\')
    result.append(char)
    if count > 1:
        result.append(str(count))
    
    return ''.join(result)
 
 
# Demonstrations
if __name__ == "__main__":
    # Count and Say
    print("Count and Say sequence:")
    for i in range(1, 8):
        print(f"  {i}: {count_and_say(i)}")
    
    # Decode String
    print("\nDecode String:")
    test_cases = ["3[a]2[bc]", "3[a2[c]]", "2[abc]3[cd]ef"]
    for s in test_cases:
        print(f"  '{s}' -> '{decode_string(s)}'")

Connection to LeetCode Problems:

Several popular interview problems are essentially RLE variations:

Problem	Core Idea
38. Count and Say	Generate RLE of previous term
394. Decode String	Stack-based nested RLE decoding
443. String Compression	In-place RLE encoding
271. Encode and Decode Strings	Length-prefix encoding (RLE-adjacent)
900. RLE Iterator	Iterator over RLE-encoded sequence

Beyond RLE: The Compression Landscape

RLE is just the beginning. Understanding where it fits in the compression landscape helps you appreciate its role and limitations.

The Lempel-Ziv Family (LZ77, LZ78, LZW):

Instead of encoding runs of a single character, LZ algorithms find repeated substrings and replace later occurrences with references to earlier ones.

Original: ABABABAB
LZ77:     AB (len=2, dist=0) (len=6, back=2)
          "AB" literally, then repeat 6 chars from 2 positions back

This is far more powerful than RLE for text, which has many repeated words but few repeated characters. LZ77 underlies gzip, zlib, PNG, and ZIP.

Huffman Coding:

Assign variable-length codes to symbols based on frequency. Common symbols get short codes; rare symbols get long codes. This is optimal for known frequency distributions. Many compression schemes use LZ for pattern matching, then Huffman for encoding the output.

Arithmetic Coding:

Represent the entire message as a single number in [0, 1). More efficient than Huffman for very skewed distributions. Used in modern codecs (H.264, JPEG 2000).

Compression Techniques Comparison
Technique	Best For	Compression Ratio	Speed	Complexity
RLE	Data with long runs	High if runs exist	Very fast	O(n)
LZ77/DEFLATE	General text/data	Good	Medium	O(n log n)
Huffman	Known frequencies	Optimal for model	Fast	O(n log σ)
BWT + MTF + RLE	Text with structure	Very good	Slow	O(n log n)
Arithmetic	Highly skewed data	Near-optimal	Slow	Complex

Entropy: The Theoretical Limit

Shannon's entropy gives the theoretical minimum bits per symbol for lossless compression:

H(X) = -Σ p(x) log₂ p(x)

For example, if a string has 50% 'A' and 50% 'B', entropy is 1 bit per character. No lossless compression can do better than 1 bit per character on average.

RLE can beat entropy only when we consider the run structure, not just symbol frequencies. A string of 1000 'A's has zero entropy in the symbol sense, but RLE exploits the run structure for excellent compression.

Key Takeaway:

RLE is a preprocessing step in many compression pipelines. By itself, it's limited to run-heavy data. Combined with transforms (BWT) or dictionary methods (LZ), it becomes part of a powerful compression system.

Summary and Key Insights

String compression is a vast field, but RLE provides an accessible entry point that teaches fundamental concepts applicable to all compression techniques.

Key Takeaways

•Compression exploits redundancy — RLE specifically exploits repeated consecutive characters (runs).
•RLE achieves O(n) time and works in-place — The classic interview problem tests careful pointer management without extra space.
•Compression can expand — Without the "omit count of 1" optimization, RLE can double string length on worst-case input.
•Break-even at run length 3 — Runs must be at least 3 characters long for RLE to provide compression benefit.
•RLE is a building block — It combines with BWT, MTF, and other transforms in sophisticated compression pipelines.
•Know the interview variants — Count and Say (38), String Compression (443), and Decode String (394) are all RLE-related classics.

Page Complete

You now understand string compression fundamentals through the lens of Run-Length Encoding. You can implement RLE, analyze when it helps, and recognize its place in the broader compression landscape. Next, we'll tackle the Minimum Window Substring problem—a different but equally important string pattern.