Data Structures & AlgorithmsPower of Two & Counting Bits

Power of Two & Counting Bits

LevelIntermediate

Duration50 mins

TopicPower of Two & Counting Bits

2 / 4

Count Set Bits (Popcount)

The Population Count Problem

Population count—commonly abbreviated as popcount or also known as bit count, sideways sum, or Hamming weight—answers a deceptively simple question: How many bits are set to 1 in a binary number?

This fundamental operation appears throughout computer science: error detection codes, cryptographic algorithms, database bitmap indexes, genomic analysis, chess engines, and countless optimization scenarios. Modern CPUs even include dedicated hardware instructions for popcount because it's so universally useful.

Yet despite its simplicity, popcount admits surprisingly sophisticated solutions. From naive bit-by-bit counting to parallel SIMD techniques, the evolution of popcount algorithms illustrates how understanding binary arithmetic enables dramatic performance improvements.

What You Will Master

By the end of this page, you will understand multiple approaches to counting set bits—from the intuitive naive method to sophisticated parallel algorithms. You'll learn when each approach is appropriate and how modern hardware optimizes this operation.

Problem Definition and Significance

Formal Definition:

Given an unsigned integer n, compute the number of bits set to 1 in its binary representation. This count is called the population count or Hamming weight.

Examples:

Number	Binary	Popcount
0	00000000	0
1	00000001	1
7	00000111	3
12	00001100	2
15	00001111	4
255	11111111	8
1023	1111111111	10

Why "Population Count"?

The name comes from thinking of each bit position as a "citizen" that is either present (1) or absent (0). The population count is literally counting how many citizens are present.

Why "Hamming Weight"?

Named after Richard Hamming, this interpretation views the count as the "weight" or "distance" of the number from zero in terms of differing bit positions. The Hamming distance between two numbers equals the popcount of their XOR.

Why Popcount Matters

•Error Detection/Correction — Hamming codes and other ECC schemes use bit counting to detect and correct errors in data transmission.
•Cryptography — Many cryptographic algorithms require counting bits for key generation, hash functions, and security analysis.
•Database Bitmap Indexes — Counting bits in bitmap indexes determines cardinality estimates and query selectivity.
•Set Operations — When sets are represented as bitmasks, popcount gives the set cardinality—essential for Jaccard similarity, union sizes, etc.
•Chess Engines — Bitboards represent chess positions; popcount counts pieces, attacks, or mobility in O(1).
•Genomics — DNA sequence analysis uses bit-packed representations; popcount computes sequence statistics.
•Network Routing — Subnet mask validation and CIDR calculations rely on counting contiguous 1-bits.
•Compression — Run-length encoding and entropy estimation use bit statistics.

Naive Approach: Iterate Through All Bits

The most intuitive approach examines each bit position individually and counts the 1s. This is the natural first solution anyone would devise.

Algorithm:

Initialize a counter to 0
For each bit position (0 to word size - 1):
- Extract the bit at that position
- If it's 1, increment the counter
Return the counter

Two common implementations:

Right shift approach: Repeatedly shift the number right and check the least significant bit
Mask approach: Use a left-shifting mask to check each position

naive_popcount.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def popcount_naive_shift(n: int) -> int:
    """
    Count set bits by shifting right and checking LSB.
    
    Time Complexity: O(k) where k is the bit width (e.g., 32 or 64)
    Space Complexity: O(1)
    
    This always examines every bit position, regardless of
    how many bits are actually set.
    """
    count = 0
    while n:
        count += n & 1  # Check if LSB is set
        n >>= 1         # Shift right to process next bit
    return count
 
 
def popcount_naive_mask(n: int, bits: int = 32) -> int:
    """
    Count set bits using a moving mask.
    
    Always iterates exactly 'bits' times, even if n is small.
    Useful when you need constant-time behavior regardless of value.
    """
    count = 0
    mask = 1
    for _ in range(bits):
        if n & mask:
            count += 1
        mask <<= 1
    return count
 
 
# Trace for n = 13 (binary: 1101):
# Iteration 1: n=1101, n&1=1, count=1, n becomes 0110
# Iteration 2: n=0110, n&1=0, count=1, n becomes 0011
# Iteration 3: n=0011, n&1=1, count=2, n becomes 0001
# Iteration 4: n=0001, n&1=1, count=3, n becomes 0000
# Loop ends, return 3  ✓
 
 
# Test examples
print(popcount_naive_shift(0))    # 0
print(popcount_naive_shift(1))    # 1
print(popcount_naive_shift(7))    # 3 (binary: 111)
print(popcount_naive_shift(13))   # 3 (binary: 1101)
print(popcount_naive_shift(255))  # 8 (binary: 11111111)

Handling Negative Numbers

In languages with signed integers, be careful with right shifts. Arithmetic right shift (>>) preserves the sign bit, potentially creating an infinite loop. Use unsigned right shift (>>> in Java/JavaScript) or work with unsigned types. Python's arbitrary-precision integers avoid this issue but require explicit handling.

Analysis:

Aspect	Evaluation
Time Complexity	O(k) where k = bit width (typically 32 or 64)
Space Complexity	O(1)
Simplicity	Excellent—immediately understandable
Performance	Poor—always examines all bit positions

The naive approach has a significant inefficiency: it always examines every bit position, even if most bits are 0. For a 64-bit word with only 2 bits set, we still perform 64 iterations.

This motivates our search for algorithms whose runtime depends on the number of set bits rather than the word size.

Lookup Table Approach: Trading Space for Speed

A classic optimization strategy in computing is precomputation: do expensive work once, store results, and look them up later. For popcount, we can precompute the bit count for all possible byte values (0–255) and combine lookups.

The Insight:

There are only 256 possible byte values (8 bits)
We can precompute popcount for each in a 256-entry table
For larger integers, split into bytes and sum the table lookups

Algorithm:

Build a table where table[i] = popcount of byte value i
For any integer, split it into bytes
Look up each byte's popcount and sum

Advantage: Each lookup is O(1), and we do at most 4-8 lookups per integer.

Trade-off: Uses 256 bytes of memory for the table.

lookup_table_popcount.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# Precompute popcount for all byte values
# Table is built once at module load time
POPCOUNT_TABLE_8BIT = [bin(i).count('1') for i in range(256)]
 
# Alternative: Build table using the naive method
def build_popcount_table():
    """Explicitly build the lookup table."""
    table = [0] * 256
    for i in range(256):
        # Count bits in this byte value
        n = i
        count = 0
        while n:
            count += n & 1
            n >>= 1
        table[i] = count
    return table
 
 
def popcount_lookup_table(n: int) -> int:
    """
    Count set bits using byte-level lookup table.
    
    Time Complexity: O(k/8) = O(1) for fixed-width integers
    Space Complexity: O(256) = O(1) for the table
    
    For a 32-bit integer: exactly 4 table lookups
    For a 64-bit integer: exactly 8 table lookups
    """
    count = 0
    
    # Process one byte at a time
    while n:
        count += POPCOUNT_TABLE_8BIT[n & 0xFF]  # Lookup lowest byte
        n >>= 8                                   # Shift to next byte
    
    return count
 
 
def popcount_lookup_32bit(n: int) -> int:
    """
    Optimized version for 32-bit integers—unrolled loop.
    Always exactly 4 lookups, potentially faster due to no branching.
    """
    return (POPCOUNT_TABLE_8BIT[n & 0xFF] +
            POPCOUNT_TABLE_8BIT[(n >> 8) & 0xFF] +
            POPCOUNT_TABLE_8BIT[(n >> 16) & 0xFF] +
            POPCOUNT_TABLE_8BIT[(n >> 24) & 0xFF])
 
 
# Example: popcount of 0x12345678
# Byte 0: 0x78 = 01111000 → 4 bits
# Byte 1: 0x56 = 01010110 → 4 bits  
# Byte 2: 0x34 = 00110100 → 3 bits
# Byte 3: 0x12 = 00010010 → 2 bits
# Total: 4 + 4 + 3 + 2 = 13 bits
 
print(popcount_lookup_table(0x12345678))  # 13

Table Size Trade-offs

You can use larger tables (16-bit: 65536 entries) for fewer lookups but more memory. You can also use smaller tables (4-bit: 16 entries) for less memory but more lookups. The 8-bit table is the sweet spot for most architectures—it fits in L1 cache while keeping lookup count low.

Performance Considerations:

Cache Behavior: The 256-byte table easily fits in L1 cache. Repeated popcounts on many values will hit cache nearly every time.
Memory Bandwidth: For single counts, the table lookup has memory latency. For bulk operations, it's very efficient.
Comparison to Naive: For sparse bit patterns (few 1s), naive might be competitive. For dense patterns, lookup wins decisively.

When to Use Lookup Tables:

Scenario	Recommendation
Single popcount, cold cache	Naive might be faster
Many popcounts in tight loop	Lookup table excels
Memory-constrained embedded	Avoid the table
Modern CPU with POPCNT	Use hardware instruction

Parallel Bit Counting: The SWAR Algorithm

The most elegant algorithmic approach to popcount is parallel bit counting, also known as SWAR (SIMD Within A Register). This technique uses clever arithmetic to count bits in parallel within a single register—no loops, no tables.

The Core Idea:

Instead of counting bits one at a time, we count them in progressively larger groups:

First, count bits in pairs (adjacent 2-bit groups)
Then, combine pair counts into 4-bit group counts
Then, combine into 8-bit group counts
Continue until we have the total

This is a classic divide-and-conquer approach, running in O(log₂(word_size)) steps.

Visualizing the Algorithm:

For a simple 8-bit example with n = 0b11010111:

Original:        1 1 0 1 0 1 1 1    (8 bits)
Step 1 - pairs:  (10) (01) (01) (10)   → counts: 2, 1, 1, 2
Step 2 - quads:  (0011) (0011)          → counts: 3, 3  
Step 3 - byte:   (00000110)             → count: 6

The number 0b11010111 has 6 bits set. We computed this in 3 parallel steps!

swar_popcount.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def popcount_swar_32bit(n: int) -> int:
    """
    Parallel bit counting using SWAR (SIMD Within A Register).
    
    This is the famous "HAKMEM 169" algorithm, refined by various authors.
    
    Time Complexity: O(log k) = O(1) for fixed-width integers
    Space Complexity: O(1)
    
    For 32-bit: exactly 5 arithmetic operations (after the initial setup)
    """
    # Ensure we're working with 32-bit unsigned semantics
    n = n & 0xFFFFFFFF
    
    # Step 1: Count bits in each pair (2-bit groups)
    # For each pair AB, compute A+B (which is 0, 1, or 2)
    # Mask 0x55555555 = 01010101... selects odd bit positions
    n = (n & 0x55555555) + ((n >> 1) & 0x55555555)
    
    # Step 2: Count bits in each nibble (4-bit groups)
    # For each nibble, add the two 2-bit counts from step 1
    # Mask 0x33333333 = 00110011... selects pairs
    n = (n & 0x33333333) + ((n >> 2) & 0x33333333)
    
    # Step 3: Count bits in each byte (8-bit groups)
    # For each byte, add the two 4-bit counts
    # Result fits in 4 bits (max 8), so masking detail changes
    n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F)
    
    # Step 4: Count bits in each 16-bit half
    n = (n & 0x00FF00FF) + ((n >> 8) & 0x00FF00FF)
    
    # Step 5: Final sum of two 16-bit halves
    n = (n & 0x0000FFFF) + ((n >> 16) & 0x0000FFFF)
    
    return n
 
 
# Optimized version using multiplication trick for final summation
def popcount_swar_optimized(n: int) -> int:
    """
    Optimized SWAR popcount using multiplication for final sum.
    
    After step 3, each byte contains its bit count (0-8).
    Multiplying by 0x01010101 sums all bytes into the high byte.
    """
    n = n & 0xFFFFFFFF
    
    # Steps 1-3 same as before
    n = n - ((n >> 1) & 0x55555555)  # Slight optimization for step 1
    n = (n & 0x33333333) + ((n >> 2) & 0x33333333)
    n = (n + (n >> 4)) & 0x0F0F0F0F
    
    # Multiplication trick: multiply by 0x01010101 to sum all bytes
    # Result ends up in the high byte
    return (n * 0x01010101) >> 24
 
 
# Trace example: n = 0b11111111000000001111111100000000 (pop = 16)
# After step 1: each pair contains its 2-bit count
# After step 2: each nibble contains its 4-bit count  
# After step 3: each byte contains its 8-bit count
# Multiply and shift: sum all byte counts
 
print(popcount_swar_32bit(0b11010111))  # 6
print(popcount_swar_optimized(0xFF00FF00))  # 16

The Magic Constants Explained

The hexadecimal masks aren't magic—they're carefully constructed bit patterns: • 0x55555555 = 01010101... (selects every other bit) • 0x33333333 = 00110011... (selects every other pair) • 0x0F0F0F0F = 00001111... (selects every other nibble) • 0x01010101 = 00000001... (1 in each byte position, for multiplication)

Why This Works:

Step 1 Insight: For a 2-bit value AB, the count is A + B. We can compute this by:

Keeping A in its position
Shifting B right by 1 to align with A's position
Adding them

But we're doing this in parallel for 16 pairs simultaneously! The mask ensures we don't mix values between pairs.

The Multiplication Trick:

After step 3, each byte contains the count of bits originally in that byte (0-8). Multiplying by 0x01010101 effectively computes:

byte0 + byte1 + byte2 + byte3

by shifting and adding in one operation. The result appears in the highest byte (bits 24-31), which we extract by shifting right 24 bits.

Performance:

The SWAR algorithm is incredibly fast on modern CPUs:

All operations are simple arithmetic (ADD, AND, OR, SHIFT, MUL)
No branches or memory accesses
Can be fully pipelined by the CPU
Often competitive with hardware POPCNT instruction!

Deep Dive: Step 1 of SWAR

The first step of SWAR deserves special attention because it uses a counterintuitive optimization: subtraction instead of masking and adding.

The Naive Version:

n = (n & 0x55555555) + ((n >> 1) & 0x55555555)

This masks the odd bits, masks the even bits (shifted), and adds them.

The Optimized Version:

n = n - ((n >> 1) & 0x55555555)

This achieves the same result with fewer operations! Let's prove it works.

Algebraic Proof:

Consider a 2-bit value with bits A (position 1) and B (position 0). Its value is 2A + B.

We want to compute A + B (the bit count).

Naive:

(n & 01) = B (the low bit)
((n >> 1) & 01) = A (the high bit shifted down)
Sum = A + B ✓

Optimized:

n = 2A + B
(n >> 1) & 01 = A (just the high bit shifted)
n - A = 2A + B - A = A + B ✓

Both compute A + B, but subtraction saves one mask operation!

Step 1 Verification for All 2-bit Values
AB (binary)	n (decimal)	(n>>1)&01	n - ((n>>1)&01)	Expected A+B
00	0	0	0 - 0 = 0	0+0 = 0 ✓
01	1	0	1 - 0 = 1	0+1 = 1 ✓
10	2	1	2 - 1 = 1	1+0 = 1 ✓
11	3	1	3 - 1 = 2	1+1 = 2 ✓

Why This Optimization Matters

The subtraction trick saves one AND operation per call. In high-performance code executing billions of popcounts, this translates to measurable speedups. It also demonstrates how understanding the mathematical invariants enables non-obvious optimizations.

Comprehensive Complexity Analysis

Let's rigorously analyze the complexity of each popcount approach. For a word of k bits with b bits set:

Popcount Algorithm Complexity Comparison
Algorithm	Time	Space	Best Case	Worst Case
Naive (shift)	O(k)	O(1)	O(k) always	O(k) always
Naive (early exit)	O(b)	O(1)	O(1) if b=0	O(k) if all bits set
Brian Kernighan	O(b)	O(1)	O(1) if b=0	O(k) if all bits set
Lookup Table (8-bit)	O(k/8)	O(256)	O(1) fixed	O(1) fixed
SWAR	O(log k)	O(1)	O(1) fixed	O(1) fixed
Hardware POPCNT	O(1)	O(1)	O(1)	O(1)

Practical Performance Considerations:

Naive O(k) vs Brian Kernighan O(b):
- For random data, b ≈ k/2 on average
- Kernighan is only faster for sparse bit patterns (b << k)
- For dense patterns (b ≈ k), naive might match or beat it
SWAR O(log k) Analysis:
- For k = 32: only 5 steps (log₂32 = 5)
- For k = 64: only 6 steps
- Each step is ~4 operations → about 20-24 operations total
Lookup table O(k/8):
- For k = 32: 4 lookups + 3 additions
- Memory access latency can dominate
- Excellent when table is cache-hot
Hardware POPCNT O(1):
- Single CPU instruction (typically 1-3 cycles)
- No memory access, no dependencies
- Available on most modern x86/ARM processors

Big-O vs Real Performance

Big-O hides constants that matter in practice. The hardware POPCNT instruction is O(1) with a very small constant (1-3 CPU cycles). SWAR is also O(1) for fixed-width integers but with a larger constant (~10-20 cycles). For most purposes, prefer hardware POPCNT when available, SWAR otherwise.

Choosing the Right Approach

Which popcount algorithm should you use? Here's a practical decision framework:

Decision Framework

•Use language built-ins first — Python's int.bit_count(), Java's Integer.bitCount(), C++20's std::popcount all use optimal implementations for your platform.
•For interviews, know Brian Kernighan — It's simple, elegant, and demonstrates understanding. The O(b) complexity shows sophistication over naive O(k).
•For production systems code — Rely on compiler intrinsics or built-in functions. The compiler knows if hardware POPCNT is available.
•For portable library code — Use SWAR as a fallback when hardware support isn't guaranteed.
•For embedded/constrained systems — Kernighan or naive might be appropriate if memory for tables or code size for SWAR is a concern.
•For bulk operations (billions of counts) — Profile! Memory bandwidth, cache behavior, and instruction pipelining can surprise you.

practical_usage.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# RECOMMENDED: Use the built-in (Python 3.10+)
def popcount_best(n: int) -> int:
    """This uses the optimal implementation for your platform."""
    return n.bit_count()
 
 
# For older Python, use bin().count()
def popcount_portable(n: int) -> int:
    """Works on all Python versions, reasonably fast."""
    return bin(n).count('1')
 
 
# For arbitrary-precision integers (Python handles this!)
big_number = 2**1000 - 1  # 1000-bit number, all 1s
print(big_number.bit_count())  # 1000

Summary and Key Takeaways

We've explored the population count problem from multiple angles—naive iteration to sophisticated parallel algorithms. Here are the key insights:

Key Takeaways

•Popcount is fundamental — It appears in error detection, cryptography, databases, games, and countless algorithms.
•Naive iteration is O(k) — Simple but examines every bit position regardless of content.
•Lookup tables trade space for time — O(1) per lookup but require 256+ bytes of memory.
•SWAR is elegantly parallel — Counts bits in log(k) steps using clever arithmetic, no memory access.
•The multiplication trick sums bytes — Multiplying by 0x01010101 sums all byte-level counts in one operation.
•Modern CPUs have POPCNT — Hardware instruction counts bits in 1-3 cycles; always prefer when available.
•Use built-in functions — Language libraries use optimal implementations; don't reinvent unless necessary.

What's Next:

In the next page, we'll explore Brian Kernighan's algorithm—an elegant technique that counts set bits in O(b) time where b is the number of set bits. This algorithm directly uses the n & (n-1) pattern from Page 1, beautifully connecting our bit manipulation toolkit.

Page Complete

You now understand the major approaches to counting set bits. From naive iteration to parallel SWAR, each technique offers different trade-offs. Next, we'll see how Brian Kernighan's algorithm provides an elegant O(b) solution that bridges our knowledge of n & (n-1) with popcount.

2 / 4

Loading learning content...

Data Structures & AlgorithmsPower of Two & Counting Bits

Power of Two & Counting Bits

LevelIntermediate

Duration50 mins

TopicPower of Two & Counting Bits

2 / 4

Count Set Bits (Popcount)

The Population Count Problem

What You Will Master

Problem Definition and Significance

Formal Definition:

Given an unsigned integer n, compute the number of bits set to 1 in its binary representation. This count is called the population count or Hamming weight.

Examples:

Number	Binary	Popcount
0	00000000	0
1	00000001	1
7	00000111	3
12	00001100	2
15	00001111	4
255	11111111	8
1023	1111111111	10

Why "Population Count"?

The name comes from thinking of each bit position as a "citizen" that is either present (1) or absent (0). The population count is literally counting how many citizens are present.

Why "Hamming Weight"?

Why Popcount Matters

•Error Detection/Correction — Hamming codes and other ECC schemes use bit counting to detect and correct errors in data transmission.
•Cryptography — Many cryptographic algorithms require counting bits for key generation, hash functions, and security analysis.
•Database Bitmap Indexes — Counting bits in bitmap indexes determines cardinality estimates and query selectivity.
•Set Operations — When sets are represented as bitmasks, popcount gives the set cardinality—essential for Jaccard similarity, union sizes, etc.
•Chess Engines — Bitboards represent chess positions; popcount counts pieces, attacks, or mobility in O(1).
•Genomics — DNA sequence analysis uses bit-packed representations; popcount computes sequence statistics.
•Network Routing — Subnet mask validation and CIDR calculations rely on counting contiguous 1-bits.
•Compression — Run-length encoding and entropy estimation use bit statistics.

Naive Approach: Iterate Through All Bits

The most intuitive approach examines each bit position individually and counts the 1s. This is the natural first solution anyone would devise.

Algorithm:

Initialize a counter to 0
For each bit position (0 to word size - 1):
- Extract the bit at that position
- If it's 1, increment the counter
Return the counter

Two common implementations:

Right shift approach: Repeatedly shift the number right and check the least significant bit
Mask approach: Use a left-shifting mask to check each position

naive_popcount.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
def popcount_naive_shift(n: int) -> int:
    """
    Count set bits by shifting right and checking LSB.
    
    Time Complexity: O(k) where k is the bit width (e.g., 32 or 64)
    Space Complexity: O(1)
    
    This always examines every bit position, regardless of
    how many bits are actually set.
    """
    count = 0
    while n:
        count += n & 1  # Check if LSB is set
        n >>= 1         # Shift right to process next bit
    return count
 
 
def popcount_naive_mask(n: int, bits: int = 32) -> int:
    """
    Count set bits using a moving mask.
    
    Always iterates exactly 'bits' times, even if n is small.
    Useful when you need constant-time behavior regardless of value.
    """
    count = 0
    mask = 1
    for _ in range(bits):
        if n & mask:
            count += 1
        mask <<= 1
    return count
 
 
# Trace for n = 13 (binary: 1101):
# Iteration 1: n=1101, n&1=1, count=1, n becomes 0110
# Iteration 2: n=0110, n&1=0, count=1, n becomes 0011
# Iteration 3: n=0011, n&1=1, count=2, n becomes 0001
# Iteration 4: n=0001, n&1=1, count=3, n becomes 0000
# Loop ends, return 3  ✓
 
 
# Test examples
print(popcount_naive_shift(0))    # 0
print(popcount_naive_shift(1))    # 1
print(popcount_naive_shift(7))    # 3 (binary: 111)
print(popcount_naive_shift(13))   # 3 (binary: 1101)
print(popcount_naive_shift(255))  # 8 (binary: 11111111)

Handling Negative Numbers

Analysis:

Aspect	Evaluation
Time Complexity	O(k) where k = bit width (typically 32 or 64)
Space Complexity	O(1)
Simplicity	Excellent—immediately understandable
Performance	Poor—always examines all bit positions

The naive approach has a significant inefficiency: it always examines every bit position, even if most bits are 0. For a 64-bit word with only 2 bits set, we still perform 64 iterations.

This motivates our search for algorithms whose runtime depends on the number of set bits rather than the word size.

Lookup Table Approach: Trading Space for Speed

The Insight:

There are only 256 possible byte values (8 bits)
We can precompute popcount for each in a 256-entry table
For larger integers, split into bytes and sum the table lookups

Algorithm:

Build a table where table[i] = popcount of byte value i
For any integer, split it into bytes
Look up each byte's popcount and sum

Advantage: Each lookup is O(1), and we do at most 4-8 lookups per integer.

Trade-off: Uses 256 bytes of memory for the table.

lookup_table_popcount.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
# Precompute popcount for all byte values
# Table is built once at module load time
POPCOUNT_TABLE_8BIT = [bin(i).count('1') for i in range(256)]
 
# Alternative: Build table using the naive method
def build_popcount_table():
    """Explicitly build the lookup table."""
    table = [0] * 256
    for i in range(256):
        # Count bits in this byte value
        n = i
        count = 0
        while n:
            count += n & 1
            n >>= 1
        table[i] = count
    return table
 
 
def popcount_lookup_table(n: int) -> int:
    """
    Count set bits using byte-level lookup table.
    
    Time Complexity: O(k/8) = O(1) for fixed-width integers
    Space Complexity: O(256) = O(1) for the table
    
    For a 32-bit integer: exactly 4 table lookups
    For a 64-bit integer: exactly 8 table lookups
    """
    count = 0
    
    # Process one byte at a time
    while n:
        count += POPCOUNT_TABLE_8BIT[n & 0xFF]  # Lookup lowest byte
        n >>= 8                                   # Shift to next byte
    
    return count
 
 
def popcount_lookup_32bit(n: int) -> int:
    """
    Optimized version for 32-bit integers—unrolled loop.
    Always exactly 4 lookups, potentially faster due to no branching.
    """
    return (POPCOUNT_TABLE_8BIT[n & 0xFF] +
            POPCOUNT_TABLE_8BIT[(n >> 8) & 0xFF] +
            POPCOUNT_TABLE_8BIT[(n >> 16) & 0xFF] +
            POPCOUNT_TABLE_8BIT[(n >> 24) & 0xFF])
 
 
# Example: popcount of 0x12345678
# Byte 0: 0x78 = 01111000 → 4 bits
# Byte 1: 0x56 = 01010110 → 4 bits  
# Byte 2: 0x34 = 00110100 → 3 bits
# Byte 3: 0x12 = 00010010 → 2 bits
# Total: 4 + 4 + 3 + 2 = 13 bits
 
print(popcount_lookup_table(0x12345678))  # 13

Table Size Trade-offs

Performance Considerations:

Cache Behavior: The 256-byte table easily fits in L1 cache. Repeated popcounts on many values will hit cache nearly every time.
Memory Bandwidth: For single counts, the table lookup has memory latency. For bulk operations, it's very efficient.
Comparison to Naive: For sparse bit patterns (few 1s), naive might be competitive. For dense patterns, lookup wins decisively.

When to Use Lookup Tables:

Scenario	Recommendation
Single popcount, cold cache	Naive might be faster
Many popcounts in tight loop	Lookup table excels
Memory-constrained embedded	Avoid the table
Modern CPU with POPCNT	Use hardware instruction

Parallel Bit Counting: The SWAR Algorithm

The Core Idea:

Instead of counting bits one at a time, we count them in progressively larger groups:

First, count bits in pairs (adjacent 2-bit groups)
Then, combine pair counts into 4-bit group counts
Then, combine into 8-bit group counts
Continue until we have the total

This is a classic divide-and-conquer approach, running in O(log₂(word_size)) steps.

Visualizing the Algorithm:

For a simple 8-bit example with n = 0b11010111:

Original:        1 1 0 1 0 1 1 1    (8 bits)
Step 1 - pairs:  (10) (01) (01) (10)   → counts: 2, 1, 1, 2
Step 2 - quads:  (0011) (0011)          → counts: 3, 3  
Step 3 - byte:   (00000110)             → count: 6

The number 0b11010111 has 6 bits set. We computed this in 3 parallel steps!

swar_popcount.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
def popcount_swar_32bit(n: int) -> int:
    """
    Parallel bit counting using SWAR (SIMD Within A Register).
    
    This is the famous "HAKMEM 169" algorithm, refined by various authors.
    
    Time Complexity: O(log k) = O(1) for fixed-width integers
    Space Complexity: O(1)
    
    For 32-bit: exactly 5 arithmetic operations (after the initial setup)
    """
    # Ensure we're working with 32-bit unsigned semantics
    n = n & 0xFFFFFFFF
    
    # Step 1: Count bits in each pair (2-bit groups)
    # For each pair AB, compute A+B (which is 0, 1, or 2)
    # Mask 0x55555555 = 01010101... selects odd bit positions
    n = (n & 0x55555555) + ((n >> 1) & 0x55555555)
    
    # Step 2: Count bits in each nibble (4-bit groups)
    # For each nibble, add the two 2-bit counts from step 1
    # Mask 0x33333333 = 00110011... selects pairs
    n = (n & 0x33333333) + ((n >> 2) & 0x33333333)
    
    # Step 3: Count bits in each byte (8-bit groups)
    # For each byte, add the two 4-bit counts
    # Result fits in 4 bits (max 8), so masking detail changes
    n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F)
    
    # Step 4: Count bits in each 16-bit half
    n = (n & 0x00FF00FF) + ((n >> 8) & 0x00FF00FF)
    
    # Step 5: Final sum of two 16-bit halves
    n = (n & 0x0000FFFF) + ((n >> 16) & 0x0000FFFF)
    
    return n
 
 
# Optimized version using multiplication trick for final summation
def popcount_swar_optimized(n: int) -> int:
    """
    Optimized SWAR popcount using multiplication for final sum.
    
    After step 3, each byte contains its bit count (0-8).
    Multiplying by 0x01010101 sums all bytes into the high byte.
    """
    n = n & 0xFFFFFFFF
    
    # Steps 1-3 same as before
    n = n - ((n >> 1) & 0x55555555)  # Slight optimization for step 1
    n = (n & 0x33333333) + ((n >> 2) & 0x33333333)
    n = (n + (n >> 4)) & 0x0F0F0F0F
    
    # Multiplication trick: multiply by 0x01010101 to sum all bytes
    # Result ends up in the high byte
    return (n * 0x01010101) >> 24
 
 
# Trace example: n = 0b11111111000000001111111100000000 (pop = 16)
# After step 1: each pair contains its 2-bit count
# After step 2: each nibble contains its 4-bit count  
# After step 3: each byte contains its 8-bit count
# Multiply and shift: sum all byte counts
 
print(popcount_swar_32bit(0b11010111))  # 6
print(popcount_swar_optimized(0xFF00FF00))  # 16

The Magic Constants Explained

Why This Works:

Step 1 Insight: For a 2-bit value AB, the count is A + B. We can compute this by:

Keeping A in its position
Shifting B right by 1 to align with A's position
Adding them

But we're doing this in parallel for 16 pairs simultaneously! The mask ensures we don't mix values between pairs.

The Multiplication Trick:

After step 3, each byte contains the count of bits originally in that byte (0-8). Multiplying by 0x01010101 effectively computes:

byte0 + byte1 + byte2 + byte3

by shifting and adding in one operation. The result appears in the highest byte (bits 24-31), which we extract by shifting right 24 bits.

Performance:

The SWAR algorithm is incredibly fast on modern CPUs:

All operations are simple arithmetic (ADD, AND, OR, SHIFT, MUL)
No branches or memory accesses
Can be fully pipelined by the CPU
Often competitive with hardware POPCNT instruction!

Deep Dive: Step 1 of SWAR

The first step of SWAR deserves special attention because it uses a counterintuitive optimization: subtraction instead of masking and adding.

The Naive Version:

n = (n & 0x55555555) + ((n >> 1) & 0x55555555)

This masks the odd bits, masks the even bits (shifted), and adds them.

The Optimized Version:

n = n - ((n >> 1) & 0x55555555)

This achieves the same result with fewer operations! Let's prove it works.

Algebraic Proof:

Consider a 2-bit value with bits A (position 1) and B (position 0). Its value is 2A + B.

We want to compute A + B (the bit count).

Naive:

(n & 01) = B (the low bit)
((n >> 1) & 01) = A (the high bit shifted down)
Sum = A + B ✓

Optimized:

n = 2A + B
(n >> 1) & 01 = A (just the high bit shifted)
n - A = 2A + B - A = A + B ✓

Both compute A + B, but subtraction saves one mask operation!

Step 1 Verification for All 2-bit Values
AB (binary)	n (decimal)	(n>>1)&01	n - ((n>>1)&01)	Expected A+B
00	0	0	0 - 0 = 0	0+0 = 0 ✓
01	1	0	1 - 0 = 1	0+1 = 1 ✓
10	2	1	2 - 1 = 1	1+0 = 1 ✓
11	3	1	3 - 1 = 2	1+1 = 2 ✓

Why This Optimization Matters

Comprehensive Complexity Analysis

Let's rigorously analyze the complexity of each popcount approach. For a word of k bits with b bits set:

Popcount Algorithm Complexity Comparison
Algorithm	Time	Space	Best Case	Worst Case
Naive (shift)	O(k)	O(1)	O(k) always	O(k) always
Naive (early exit)	O(b)	O(1)	O(1) if b=0	O(k) if all bits set
Brian Kernighan	O(b)	O(1)	O(1) if b=0	O(k) if all bits set
Lookup Table (8-bit)	O(k/8)	O(256)	O(1) fixed	O(1) fixed
SWAR	O(log k)	O(1)	O(1) fixed	O(1) fixed
Hardware POPCNT	O(1)	O(1)	O(1)	O(1)

Practical Performance Considerations:

Naive O(k) vs Brian Kernighan O(b):
- For random data, b ≈ k/2 on average
- Kernighan is only faster for sparse bit patterns (b << k)
- For dense patterns (b ≈ k), naive might match or beat it
SWAR O(log k) Analysis:
- For k = 32: only 5 steps (log₂32 = 5)
- For k = 64: only 6 steps
- Each step is ~4 operations → about 20-24 operations total
Lookup table O(k/8):
- For k = 32: 4 lookups + 3 additions
- Memory access latency can dominate
- Excellent when table is cache-hot
Hardware POPCNT O(1):
- Single CPU instruction (typically 1-3 cycles)
- No memory access, no dependencies
- Available on most modern x86/ARM processors

Big-O vs Real Performance

Choosing the Right Approach

Which popcount algorithm should you use? Here's a practical decision framework:

Decision Framework

•Use language built-ins first — Python's int.bit_count(), Java's Integer.bitCount(), C++20's std::popcount all use optimal implementations for your platform.
•For interviews, know Brian Kernighan — It's simple, elegant, and demonstrates understanding. The O(b) complexity shows sophistication over naive O(k).
•For production systems code — Rely on compiler intrinsics or built-in functions. The compiler knows if hardware POPCNT is available.
•For portable library code — Use SWAR as a fallback when hardware support isn't guaranteed.
•For embedded/constrained systems — Kernighan or naive might be appropriate if memory for tables or code size for SWAR is a concern.
•For bulk operations (billions of counts) — Profile! Memory bandwidth, cache behavior, and instruction pipelining can surprise you.

practical_usage.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# RECOMMENDED: Use the built-in (Python 3.10+)
def popcount_best(n: int) -> int:
    """This uses the optimal implementation for your platform."""
    return n.bit_count()
 
 
# For older Python, use bin().count()
def popcount_portable(n: int) -> int:
    """Works on all Python versions, reasonably fast."""
    return bin(n).count('1')
 
 
# For arbitrary-precision integers (Python handles this!)
big_number = 2**1000 - 1  # 1000-bit number, all 1s
print(big_number.bit_count())  # 1000

Summary and Key Takeaways

We've explored the population count problem from multiple angles—naive iteration to sophisticated parallel algorithms. Here are the key insights:

Key Takeaways

•Popcount is fundamental — It appears in error detection, cryptography, databases, games, and countless algorithms.
•Naive iteration is O(k) — Simple but examines every bit position regardless of content.
•Lookup tables trade space for time — O(1) per lookup but require 256+ bytes of memory.
•SWAR is elegantly parallel — Counts bits in log(k) steps using clever arithmetic, no memory access.
•The multiplication trick sums bytes — Multiplying by 0x01010101 sums all byte-level counts in one operation.
•Modern CPUs have POPCNT — Hardware instruction counts bits in 1-3 cycles; always prefer when available.
•Use built-in functions — Language libraries use optimal implementations; don't reinvent unless necessary.

What's Next:

Page Complete

2 / 4