Loading learning content...
Population count—commonly abbreviated as popcount or also known as bit count, sideways sum, or Hamming weight—answers a deceptively simple question: How many bits are set to 1 in a binary number?
This fundamental operation appears throughout computer science: error detection codes, cryptographic algorithms, database bitmap indexes, genomic analysis, chess engines, and countless optimization scenarios. Modern CPUs even include dedicated hardware instructions for popcount because it's so universally useful.
Yet despite its simplicity, popcount admits surprisingly sophisticated solutions. From naive bit-by-bit counting to parallel SIMD techniques, the evolution of popcount algorithms illustrates how understanding binary arithmetic enables dramatic performance improvements.
By the end of this page, you will understand multiple approaches to counting set bits—from the intuitive naive method to sophisticated parallel algorithms. You'll learn when each approach is appropriate and how modern hardware optimizes this operation.
Formal Definition:
Given an unsigned integer n, compute the number of bits set to 1 in its binary representation. This count is called the population count or Hamming weight.
Examples:
| Number | Binary | Popcount |
|---|---|---|
| 0 | 00000000 | 0 |
| 1 | 00000001 | 1 |
| 7 | 00000111 | 3 |
| 12 | 00001100 | 2 |
| 15 | 00001111 | 4 |
| 255 | 11111111 | 8 |
| 1023 | 1111111111 | 10 |
Why "Population Count"?
The name comes from thinking of each bit position as a "citizen" that is either present (1) or absent (0). The population count is literally counting how many citizens are present.
Why "Hamming Weight"?
Named after Richard Hamming, this interpretation views the count as the "weight" or "distance" of the number from zero in terms of differing bit positions. The Hamming distance between two numbers equals the popcount of their XOR.
The most intuitive approach examines each bit position individually and counts the 1s. This is the natural first solution anyone would devise.
Algorithm:
Two common implementations:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
def popcount_naive_shift(n: int) -> int: """ Count set bits by shifting right and checking LSB. Time Complexity: O(k) where k is the bit width (e.g., 32 or 64) Space Complexity: O(1) This always examines every bit position, regardless of how many bits are actually set. """ count = 0 while n: count += n & 1 # Check if LSB is set n >>= 1 # Shift right to process next bit return count def popcount_naive_mask(n: int, bits: int = 32) -> int: """ Count set bits using a moving mask. Always iterates exactly 'bits' times, even if n is small. Useful when you need constant-time behavior regardless of value. """ count = 0 mask = 1 for _ in range(bits): if n & mask: count += 1 mask <<= 1 return count # Trace for n = 13 (binary: 1101):# Iteration 1: n=1101, n&1=1, count=1, n becomes 0110# Iteration 2: n=0110, n&1=0, count=1, n becomes 0011# Iteration 3: n=0011, n&1=1, count=2, n becomes 0001# Iteration 4: n=0001, n&1=1, count=3, n becomes 0000# Loop ends, return 3 ✓ # Test examplesprint(popcount_naive_shift(0)) # 0print(popcount_naive_shift(1)) # 1print(popcount_naive_shift(7)) # 3 (binary: 111)print(popcount_naive_shift(13)) # 3 (binary: 1101)print(popcount_naive_shift(255)) # 8 (binary: 11111111)In languages with signed integers, be careful with right shifts. Arithmetic right shift (>>) preserves the sign bit, potentially creating an infinite loop. Use unsigned right shift (>>> in Java/JavaScript) or work with unsigned types. Python's arbitrary-precision integers avoid this issue but require explicit handling.
Analysis:
| Aspect | Evaluation |
|---|---|
| Time Complexity | O(k) where k = bit width (typically 32 or 64) |
| Space Complexity | O(1) |
| Simplicity | Excellent—immediately understandable |
| Performance | Poor—always examines all bit positions |
The naive approach has a significant inefficiency: it always examines every bit position, even if most bits are 0. For a 64-bit word with only 2 bits set, we still perform 64 iterations.
This motivates our search for algorithms whose runtime depends on the number of set bits rather than the word size.
A classic optimization strategy in computing is precomputation: do expensive work once, store results, and look them up later. For popcount, we can precompute the bit count for all possible byte values (0–255) and combine lookups.
The Insight:
Algorithm:
table[i] = popcount of byte value iAdvantage: Each lookup is O(1), and we do at most 4-8 lookups per integer.
Trade-off: Uses 256 bytes of memory for the table.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
# Precompute popcount for all byte values# Table is built once at module load timePOPCOUNT_TABLE_8BIT = [bin(i).count('1') for i in range(256)] # Alternative: Build table using the naive methoddef build_popcount_table(): """Explicitly build the lookup table.""" table = [0] * 256 for i in range(256): # Count bits in this byte value n = i count = 0 while n: count += n & 1 n >>= 1 table[i] = count return table def popcount_lookup_table(n: int) -> int: """ Count set bits using byte-level lookup table. Time Complexity: O(k/8) = O(1) for fixed-width integers Space Complexity: O(256) = O(1) for the table For a 32-bit integer: exactly 4 table lookups For a 64-bit integer: exactly 8 table lookups """ count = 0 # Process one byte at a time while n: count += POPCOUNT_TABLE_8BIT[n & 0xFF] # Lookup lowest byte n >>= 8 # Shift to next byte return count def popcount_lookup_32bit(n: int) -> int: """ Optimized version for 32-bit integers—unrolled loop. Always exactly 4 lookups, potentially faster due to no branching. """ return (POPCOUNT_TABLE_8BIT[n & 0xFF] + POPCOUNT_TABLE_8BIT[(n >> 8) & 0xFF] + POPCOUNT_TABLE_8BIT[(n >> 16) & 0xFF] + POPCOUNT_TABLE_8BIT[(n >> 24) & 0xFF]) # Example: popcount of 0x12345678# Byte 0: 0x78 = 01111000 → 4 bits# Byte 1: 0x56 = 01010110 → 4 bits # Byte 2: 0x34 = 00110100 → 3 bits# Byte 3: 0x12 = 00010010 → 2 bits# Total: 4 + 4 + 3 + 2 = 13 bits print(popcount_lookup_table(0x12345678)) # 13You can use larger tables (16-bit: 65536 entries) for fewer lookups but more memory. You can also use smaller tables (4-bit: 16 entries) for less memory but more lookups. The 8-bit table is the sweet spot for most architectures—it fits in L1 cache while keeping lookup count low.
Performance Considerations:
When to Use Lookup Tables:
| Scenario | Recommendation |
|---|---|
| Single popcount, cold cache | Naive might be faster |
| Many popcounts in tight loop | Lookup table excels |
| Memory-constrained embedded | Avoid the table |
| Modern CPU with POPCNT | Use hardware instruction |
The most elegant algorithmic approach to popcount is parallel bit counting, also known as SWAR (SIMD Within A Register). This technique uses clever arithmetic to count bits in parallel within a single register—no loops, no tables.
The Core Idea:
Instead of counting bits one at a time, we count them in progressively larger groups:
This is a classic divide-and-conquer approach, running in O(log₂(word_size)) steps.
Visualizing the Algorithm:
For a simple 8-bit example with n = 0b11010111:
Original: 1 1 0 1 0 1 1 1 (8 bits)
Step 1 - pairs: (10) (01) (01) (10) → counts: 2, 1, 1, 2
Step 2 - quads: (0011) (0011) → counts: 3, 3
Step 3 - byte: (00000110) → count: 6
The number 0b11010111 has 6 bits set. We computed this in 3 parallel steps!
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
def popcount_swar_32bit(n: int) -> int: """ Parallel bit counting using SWAR (SIMD Within A Register). This is the famous "HAKMEM 169" algorithm, refined by various authors. Time Complexity: O(log k) = O(1) for fixed-width integers Space Complexity: O(1) For 32-bit: exactly 5 arithmetic operations (after the initial setup) """ # Ensure we're working with 32-bit unsigned semantics n = n & 0xFFFFFFFF # Step 1: Count bits in each pair (2-bit groups) # For each pair AB, compute A+B (which is 0, 1, or 2) # Mask 0x55555555 = 01010101... selects odd bit positions n = (n & 0x55555555) + ((n >> 1) & 0x55555555) # Step 2: Count bits in each nibble (4-bit groups) # For each nibble, add the two 2-bit counts from step 1 # Mask 0x33333333 = 00110011... selects pairs n = (n & 0x33333333) + ((n >> 2) & 0x33333333) # Step 3: Count bits in each byte (8-bit groups) # For each byte, add the two 4-bit counts # Result fits in 4 bits (max 8), so masking detail changes n = (n & 0x0F0F0F0F) + ((n >> 4) & 0x0F0F0F0F) # Step 4: Count bits in each 16-bit half n = (n & 0x00FF00FF) + ((n >> 8) & 0x00FF00FF) # Step 5: Final sum of two 16-bit halves n = (n & 0x0000FFFF) + ((n >> 16) & 0x0000FFFF) return n # Optimized version using multiplication trick for final summationdef popcount_swar_optimized(n: int) -> int: """ Optimized SWAR popcount using multiplication for final sum. After step 3, each byte contains its bit count (0-8). Multiplying by 0x01010101 sums all bytes into the high byte. """ n = n & 0xFFFFFFFF # Steps 1-3 same as before n = n - ((n >> 1) & 0x55555555) # Slight optimization for step 1 n = (n & 0x33333333) + ((n >> 2) & 0x33333333) n = (n + (n >> 4)) & 0x0F0F0F0F # Multiplication trick: multiply by 0x01010101 to sum all bytes # Result ends up in the high byte return (n * 0x01010101) >> 24 # Trace example: n = 0b11111111000000001111111100000000 (pop = 16)# After step 1: each pair contains its 2-bit count# After step 2: each nibble contains its 4-bit count # After step 3: each byte contains its 8-bit count# Multiply and shift: sum all byte counts print(popcount_swar_32bit(0b11010111)) # 6print(popcount_swar_optimized(0xFF00FF00)) # 16The hexadecimal masks aren't magic—they're carefully constructed bit patterns: • 0x55555555 = 01010101... (selects every other bit) • 0x33333333 = 00110011... (selects every other pair) • 0x0F0F0F0F = 00001111... (selects every other nibble) • 0x01010101 = 00000001... (1 in each byte position, for multiplication)
Why This Works:
Step 1 Insight: For a 2-bit value AB, the count is A + B. We can compute this by:
But we're doing this in parallel for 16 pairs simultaneously! The mask ensures we don't mix values between pairs.
The Multiplication Trick:
After step 3, each byte contains the count of bits originally in that byte (0-8). Multiplying by 0x01010101 effectively computes:
byte0 + byte1 + byte2 + byte3
by shifting and adding in one operation. The result appears in the highest byte (bits 24-31), which we extract by shifting right 24 bits.
Performance:
The SWAR algorithm is incredibly fast on modern CPUs:
The first step of SWAR deserves special attention because it uses a counterintuitive optimization: subtraction instead of masking and adding.
The Naive Version:
n = (n & 0x55555555) + ((n >> 1) & 0x55555555)
This masks the odd bits, masks the even bits (shifted), and adds them.
The Optimized Version:
n = n - ((n >> 1) & 0x55555555)
This achieves the same result with fewer operations! Let's prove it works.
Algebraic Proof:
Consider a 2-bit value with bits A (position 1) and B (position 0). Its value is 2A + B.
We want to compute A + B (the bit count).
Naive:
(n & 01) = B (the low bit)((n >> 1) & 01) = A (the high bit shifted down)Optimized:
n = 2A + B(n >> 1) & 01 = A (just the high bit shifted)n - A = 2A + B - A = A + B ✓Both compute A + B, but subtraction saves one mask operation!
| AB (binary) | n (decimal) | (n>>1)&01 | n - ((n>>1)&01) | Expected A+B |
|---|---|---|---|---|
| 00 | 0 | 0 | 0 - 0 = 0 | 0+0 = 0 ✓ |
| 01 | 1 | 0 | 1 - 0 = 1 | 0+1 = 1 ✓ |
| 10 | 2 | 1 | 2 - 1 = 1 | 1+0 = 1 ✓ |
| 11 | 3 | 1 | 3 - 1 = 2 | 1+1 = 2 ✓ |
The subtraction trick saves one AND operation per call. In high-performance code executing billions of popcounts, this translates to measurable speedups. It also demonstrates how understanding the mathematical invariants enables non-obvious optimizations.
Let's rigorously analyze the complexity of each popcount approach. For a word of k bits with b bits set:
| Algorithm | Time | Space | Best Case | Worst Case |
|---|---|---|---|---|
| Naive (shift) | O(k) | O(1) | O(k) always | O(k) always |
| Naive (early exit) | O(b) | O(1) | O(1) if b=0 | O(k) if all bits set |
| Brian Kernighan | O(b) | O(1) | O(1) if b=0 | O(k) if all bits set |
| Lookup Table (8-bit) | O(k/8) | O(256) | O(1) fixed | O(1) fixed |
| SWAR | O(log k) | O(1) | O(1) fixed | O(1) fixed |
| Hardware POPCNT | O(1) | O(1) | O(1) | O(1) |
Practical Performance Considerations:
Naive O(k) vs Brian Kernighan O(b):
SWAR O(log k) Analysis:
Lookup table O(k/8):
Hardware POPCNT O(1):
Big-O hides constants that matter in practice. The hardware POPCNT instruction is O(1) with a very small constant (1-3 CPU cycles). SWAR is also O(1) for fixed-width integers but with a larger constant (~10-20 cycles). For most purposes, prefer hardware POPCNT when available, SWAR otherwise.
Which popcount algorithm should you use? Here's a practical decision framework:
int.bit_count(), Java's Integer.bitCount(), C++20's std::popcount all use optimal implementations for your platform.123456789101112131415
# RECOMMENDED: Use the built-in (Python 3.10+)def popcount_best(n: int) -> int: """This uses the optimal implementation for your platform.""" return n.bit_count() # For older Python, use bin().count()def popcount_portable(n: int) -> int: """Works on all Python versions, reasonably fast.""" return bin(n).count('1') # For arbitrary-precision integers (Python handles this!)big_number = 2**1000 - 1 # 1000-bit number, all 1sprint(big_number.bit_count()) # 1000We've explored the population count problem from multiple angles—naive iteration to sophisticated parallel algorithms. Here are the key insights:
What's Next:
In the next page, we'll explore Brian Kernighan's algorithm—an elegant technique that counts set bits in O(b) time where b is the number of set bits. This algorithm directly uses the n & (n-1) pattern from Page 1, beautifully connecting our bit manipulation toolkit.
You now understand the major approaches to counting set bits. From naive iteration to parallel SWAR, each technique offers different trade-offs. Next, we'll see how Brian Kernighan's algorithm provides an elegant O(b) solution that bridges our knowledge of n & (n-1) with popcount.