Loading content...
Powers of two appear everywhere in computing: page sizes (4096 bytes), cache lines (64 bytes), memory alignment requirements, network MTUs, disk sector sizes, and of course, memory allocation block sizes. This isn't coincidence—it's a deep structural property of binary computing that creates enormous efficiency gains.
In memory allocation, constraining block sizes to powers of two—1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, ...—is the foundation of the buddy system and numerous other allocators. This page explores why this constraint is so powerful, how it enables efficient operations, and what trade-offs it introduces.
By the end of this page, you will understand the mathematical properties of powers of two that make them ideal for allocation, how hardware exploits these properties, efficient algorithms for power-of-two calculations, and the fragmentation costs of this approach.
Powers of two have unique mathematical properties in binary representation that make them extraordinarily efficient for computer operations.
Property 1: Single-Bit Representation
In binary, a power of two has exactly one bit set:
2^0 = 1 = 0000 0001
2^1 = 2 = 0000 0010
2^2 = 4 = 0000 0100
2^3 = 8 = 0000 1000
2^4 = 16 = 0001 0000
2^5 = 32 = 0010 0000
2^6 = 64 = 0100 0000
2^7 = 128 = 1000 0000
...
2^k = single bit at position k
This enables a fast test for "is this number a power of two?":
bool is_power_of_two(unsigned n) {
return n > 0 && (n & (n - 1)) == 0;
}
How it works: If n is a power of two, subtracting 1 flips all bits below the single set bit. ANDing with n yields 0.
Property 2: Division and Multiplication by Shift
Dividing or multiplying by 2^k is equivalent to bit shifting:
n × 2^k = n << k (left shift by k)
n ÷ 2^k = n >> k (right shift by k, for unsigned/logical shift)
Performance Impact:
By using powers of two, allocation size calculations that would require expensive division become trivial bit shifts.
Property 3: Modulo by Mask
The remainder when dividing by 2^k is obtained by masking:
n mod 2^k = n & (2^k - 1)
Example:
n = 137 = 1000 1001
2^4 = 16 = 0001 0000
2^4 - 1 = 15 = 0000 1111 (the mask)
137 mod 16 = 137 & 15
= 1000 1001 & 0000 1111
= 0000 1001
= 9
Verification: 137 = 8 × 16 + 9 ✓
This enables O(1) offset calculations within blocks of power-of-two size.
| Operation | General Case | Power-of-Two | Speedup |
|---|---|---|---|
| Multiply by k | MUL instruction (3-5 cycles) | Left shift (1 cycle) | 3-5x |
| Divide by k | DIV instruction (10-40 cycles) | Right shift (1 cycle) | 10-40x |
| Modulo k | DIV + extract remainder | AND with mask (1 cycle) | 10-40x |
| Is multiple of k? | Divide, check remainder | AND with mask, check zero | 10-40x |
| Align to k | Division, multiply | AND with complement mask | 20-80x |
Property 4: Address Alignment
An address A is aligned to 2^k bytes if and only if its lower k bits are zero:
A is aligned to 2^k ⟺ (A & (2^k - 1)) == 0
Aligning an address:
To round up address A to the next 2^k boundary:
uintptr_t align_up(uintptr_t addr, size_t alignment) {
// alignment must be power of two
size_t mask = alignment - 1;
return (addr + mask) & ~mask;
}
Example:
addr = 1000, alignment = 256 = 2^8
mask = 255
align_up(1000, 256):
= (1000 + 255) & ~255
= 1255 & 0xFFFFFF00
= 1024
1024 is the next 256-byte boundary after 1000 ✓
Aligned memory access is often required by hardware (SIMD instructions, DMA transfers) and always more efficient. Modern CPUs may penalize unaligned access with extra cycles or even raise exceptions. Power-of-two sizes naturally produce aligned blocks when allocated from aligned base addresses.
In power-of-two allocation, every request is rounded up to the next power of two. This rounding operation is fundamental and must be extremely efficient.
The Problem:
Given a size S, find the smallest power of two P such that P ≥ S.
Examples:
S = 1 → P = 1
S = 2 → P = 2
S = 3 → P = 4
S = 5 → P = 8
S = 100 → P = 128
S = 1000 → P = 1024
Approach 1: Iterative
size_t round_up_pow2_iterative(size_t s) {
if (s == 0) return 1;
size_t p = 1;
while (p < s) {
p <<= 1;
}
return p;
}
Time complexity: O(log s) - up to 64 iterations for 64-bit values. Unacceptable for hot paths.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
// Efficient power-of-two rounding algorithms #include <stdint.h> // Approach 2: Bit manipulation (constant time, no loops)// Uses the fact that for n > 0:// - Fill all bits below the highest set bit// - Add 1 to get the next power of twouint64_t round_up_pow2_bitwise(uint64_t s) { if (s == 0) return 1; if (s == 1) return 1; s--; // Handle exact powers of two s |= s >> 1; // Fill bit below highest s |= s >> 2; // Fill 2 bits below highest 2 s |= s >> 4; // Fill 4 bits below highest 4 s |= s >> 8; // Fill 8 bits below highest 8 s |= s >> 16; // Fill 16 bits below highest 16 s |= s >> 32; // Fill 32 bits below highest 32 (for 64-bit) s++; // Add 1 to get next power of two return s;} // Approach 3: Using compiler builtins (fastest)// Most compilers have builtins that use hardware instructionsuint64_t round_up_pow2_builtin(uint64_t s) { if (s <= 1) return 1; // __builtin_clzll counts leading zeros // 64 - clz - 1 = position of highest bit (0-indexed) int highest_bit = 63 - __builtin_clzll(s - 1); return 1ULL << (highest_bit + 1);} // Approach 4: Using x86 BSR (Bit Scan Reverse) instruction// Inline assembly for maximum efficiencyuint64_t round_up_pow2_bsr(uint64_t s) { if (s <= 1) return 1; uint64_t result; // BSR finds the position of the most significant set bit asm("bsrq %1, %0" : "=r"(result) : "r"(s - 1)); return 1ULL << (result + 1);} // Log base 2 (floor) - useful for computing orderint log2_floor(uint64_t s) { if (s == 0) return -1; // Undefined return 63 - __builtin_clzll(s);} // Log base 2 (ceiling) - for rounding upint log2_ceil(uint64_t s) { if (s <= 1) return 0; return 64 - __builtin_clzll(s - 1);}Understanding the Bit-Fill Algorithm:
The bit manipulation approach (Approach 2) works by "smearing" the highest set bit downward until all lower bits are set, then adding 1.
Example: s = 100
Step by step:
s = 100 = 0110 0100
s-- = 99 = 0110 0011
s |= s >> 1 → 99 | 49 = 0110 0011 | 0011 0001 = 0111 0011
s |= s >> 2 → 115 | 28 = 0111 0011 | 0001 1100 = 0111 1111
s |= s >> 4 → 127 | 7 = 0111 1111 | 0000 0111 = 0111 1111
(further shifts don't change anything)
s++ = 128 = 1000 0000 ✓
The algorithm fills all bits below the highest, creating 2^n - 1,
then adds 1 to get 2^n.
Why Subtract 1 First?
The initial s-- ensures that exact powers of two round to themselves:
Without s--:
s = 64 = 0100 0000
After fills: 0111 1111
After ++: 128 ✗ (should be 64)
With s--:
s = 64, s-- = 63 = 0011 1111
After fills: 0011 1111
After ++: 64 ✓
| Approach | Cycles (approx) | Portability | Code Complexity |
|---|---|---|---|
| Iterative | 5-64 (varies with input) | Excellent | Low |
| Bit manipulation | ~12 fixed | Excellent | Medium |
| Compiler builtin | ~3-5 fixed | Good (GCC/Clang) | Low |
| Inline assembly (BSR) | ~2-3 fixed | x86 only | High |
Use compiler builtins (__builtin_clzll for GCC/Clang, _BitScanReverse64 for MSVC) when available. They compile to optimal hardware instructions (LZCNT or BSR) and are both fast and readable. Fall back to the bit manipulation approach for maximum portability.
In the buddy system, the order of a block determines its size: a block of order k has size 2^k. Selecting the minimum and maximum orders involves trade-offs.
Minimum Block Size (MIN_ORDER):
The smallest allocatable block affects:
Common MIN_ORDER choices:
MIN_ORDER = 4: 2^4 = 16 bytes
- Fits a free block header (prev + next pointers)
- Good for small allocation workloads
- More management overhead
MIN_ORDER = 5: 2^5 = 32 bytes
- Comfortable for headers with metadata
- Balance between overhead and utility
- Common userspace choice
MIN_ORDER = 12: 2^12 = 4096 bytes = 1 page
- Used by Linux kernel page allocator
- Page is the natural unit for kernel memory
- Sub-page allocations handled by slab
Maximum Block Size (MAX_ORDER):
The largest block affects:
Common MAX_ORDER choices:
MAX_ORDER = 20: 2^20 = 1 MB
- Reasonable for userspace allocators
- Larger requests use mmap() directly
MAX_ORDER = 22: 2^22 = 4 MB
- Linux page allocator (with ORDER=10 for 4KB pages)
- 1024 pages maximum contiguous allocation
MAX_ORDER = 30: 2^30 = 1 GB
- For systems needing huge contiguous allocations
- Rare in practice
Order Calculation:
Given a request of size S, the order k is:
k = max(MIN_ORDER, ceil(log2(S)))
Implementation using builtins:
int size_to_order(size_t size) {
if (size == 0) return MIN_ORDER;
// Ceiling of log2
int order = (size == 1) ? 0 : 64 - __builtin_clzll(size - 1);
// Clamp to valid range
if (order < MIN_ORDER) order = MIN_ORDER;
if (order > MAX_ORDER) return -1; // Too large
return order;
}
Order to Size:
size_t order_to_size(int order) {
return (size_t)1 << order;
}
| Request Size | Calculated Order | Allocated Size | Internal Fragmentation |
|---|---|---|---|
| 1-16 bytes | 4 | 16 bytes | 0-15 bytes (0-94%) |
| 17-32 bytes | 5 | 32 bytes | 0-15 bytes (0-47%) |
| 33-64 bytes | 6 | 64 bytes | 0-31 bytes (0-48%) |
| 65-128 bytes | 7 | 128 bytes | 0-63 bytes (0-49%) |
| 129-256 bytes | 8 | 256 bytes | 0-127 bytes (0-50%) |
| 513-1024 bytes | 10 | 1024 bytes | 0-511 bytes (0-50%) |
| 1 MB + 1 byte | -1 | N/A | Request too large |
Notice that worst-case internal fragmentation approaches 50% as request sizes increase. A request of (2^k + 1) bytes requires a block of 2^(k+1) bytes, wasting nearly half. This is the fundamental trade-off of power-of-two sizing—efficiency in management at the cost of potential space waste.
Modern CPUs provide specialized hardware support for power-of-two operations, recognizing their importance in systems software.
Leading Zero Count (LZCNT / CLZ):
Counts the number of leading zero bits in a value:
LZCNT(0x0080) = 56 (for 64-bit)
LZCNT(0x8000000000000000) = 0
Use: 63 - LZCNT(x) = floor(log2(x))
x86 Instructions:
BSR (Bit Scan Reverse): Finds position of highest set bitLZCNT (Leading Zero Count): Counts leading zeros (AMD/Intel)BSF (Bit Scan Forward): Finds position of lowest set bitTZCNT (Trailing Zero Count): Counts trailing zerosARM Instructions:
CLZ (Count Leading Zeros): Direct leading zero count1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
// Using hardware instructions for power-of-two operations #include <stdint.h>#include <stdbool.h> // x86-64 using LZCNT (requires LZCNT support, or BSR fallback)static inline int clz64(uint64_t x) { if (x == 0) return 64; #if defined(__LZCNT__) return __builtin_clzll(x); #else // BSR-based fallback (note: undefined for x=0) uint64_t result; asm("bsrq %1, %0" : "=r"(result) : "r"(x)); return 63 - result; #endif} // Fast log2 floor using hardwarestatic inline int floor_log2(uint64_t x) { return 63 - clz64(x);} // Fast log2 ceiling using hardwarestatic inline int ceil_log2(uint64_t x) { if (x <= 1) return 0; return 64 - clz64(x - 1);} // Check if power of two (single AND operation)static inline bool is_pow2(uint64_t x) { return x && !(x & (x - 1));} // Alignment check using trailing zerosstatic inline bool is_aligned(void* ptr, size_t alignment) { // alignment must be power of two return ((uintptr_t)ptr & (alignment - 1)) == 0;} // Find the order of a block given its address and the base address// Assumes buddy system with power-of-two blocksstatic inline int order_from_address(uintptr_t addr, uintptr_t base, int max_order) { uintptr_t offset = addr - base; if (offset == 0) return max_order; // The order is determined by the position of the lowest set bit // in the offset from base return __builtin_ctzll(offset); // count trailing zeros}Population Count (POPCNT):
Counts the number of set bits. Although not directly for power-of-two operations, useful for related calculations:
// Check if exactly one bit set (power of two)
bool is_pow2_popcnt(uint64_t x) {
return __builtin_popcountll(x) == 1;
}
Bit Manipulation Extensions:
Modern CPUs have extended bit manipulation capabilities:
Intel BMI1/BMI2:
BLSR - Reset lowest set bit: x & (x-1)BLSI - Extract lowest set bit: x & (-x)BLSMSK - Get mask up to lowest set bitPDEP/PEXT - Parallel bit deposit/extractThese enable:
Compiler Integration:
Compilers recognize patterns and emit optimal instructions:
// This code:
if ((n & (n - 1)) == 0 && n != 0) { ... }
// May compile to:
// BLSR reg, n ; reg = n & (n-1)
// TEST n, n ; check n != 0
// CMOVZ ...
// JZ target
// Or simply:
// POPCNT reg, n
// CMP reg, 1
// JE target
Modern compilers with optimization enabled (-O2 or higher) automatically recognize power-of-two patterns and emit optimal instructions. Writing clear, idiomatic code (like using __builtin_clz) often produces better results than hand-crafted assembly. Profile before optimizing!
Pure power-of-two sizing has significant internal fragmentation. Production allocators often use hybrid approaches that combine power-of-two benefits with finer granularity.
Approach 1: Multiple Size Classes per Order
jemalloc and similar allocators use size classes that aren't strictly powers of two:
jemalloc size classes (small) example:
8, 16, 32, 48, 64, 80, 96, 112, 128,
160, 192, 224, 256,
320, 384, 448, 512,
...
Pattern: within each power-of-two range, add intermediate sizes
Benefits:
Costs:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
// Size class approaches used by production allocators // Approach 1: Power-of-two with quartiles// Each power-of-two range has 4 size classesstatic const size_t SIZE_CLASSES[] = { // Base 16 (order 4): 16 only (too small to subdivide) 16, // Base 32 (order 5): 32 only 32, // Base 64 (order 6): 48, 64 48, 64, // Base 128 (order 7): 80, 96, 112, 128 80, 96, 112, 128, // Base 256 (order 8): 160, 192, 224, 256 160, 192, 224, 256, // Base 512 (order 9): 320, 384, 448, 512 320, 384, 448, 512, // ... and so on}; #define NUM_SIZE_CLASSES (sizeof(SIZE_CLASSES) / sizeof(SIZE_CLASSES[0])) // Map request size to size class indexint size_to_class(size_t size) { if (size <= 16) return 0; if (size <= 32) return 1; if (size <= 48) return 2; if (size <= 64) return 3; // For larger sizes, compute based on pattern // This is O(1) using bit operations int order = 64 - __builtin_clzll(size - 1); // ceil(log2(size)) int base_index = /* lookup table for order */ order_to_base_index[order]; // Which quartile within this order? size_t base = 1UL << (order - 1); // Previous power of two size_t step = base / 4; // Quartile size int quartile = (size - base - 1) / step; return base_index + quartile;} // Approach 2: jemalloc-style groups// Larger size classes are spaced further aparttypedef struct { size_t size; int lg_delta; // log2 of spacing at this level int grp; // group index} SizeClass; // jemalloc uses: within each group, size classes are// base, base+delta, base+2*delta, base+3*delta// where delta doubles with each groupApproach 2: Slab Allocation for Common Sizes
Combine buddy allocation for large blocks with specialized handling for common sizes:
Architectural Layers:
┌─────────────────────────────────────────┐
│ User Request (arbitrary size) │
└─────────────────┬───────────────────────┘
│
┌─────────────┴─────────────┐
│ │
▼ ▼
┌─────────┐ ┌──────────────┐
│ Slab │ │ Buddy System │
│ Allocator│ │ (large blocks)│
│ (small) │ │ │
└────┬────┘ └──────┬────────┘
│ │
▼ ▼
┌───────────────────────────────────────┐
│ Physical Pages (from OS) │
└───────────────────────────────────────┘
Slab handles: 8, 16, 32, 64, 96, 128, 192, 256, ...
(with zero internal fragmentation for exact sizes)
Buddy handles: 4KB, 8KB, 16KB, 32KB, ...
(for large allocations that don't match slabs)
Linux Kernel Approach:
Buddy Allocator
├── Provides pages to SLAB/SLUB/SLOB
└── Direct allocation for multi-page requests
SLAB Allocator (kmalloc)
├── Caches for each power-of-two size
├── Object caching for kernel structures
└── Per-CPU caching for performance
Power-of-two allocation exploits deep properties of binary arithmetic to achieve efficient memory management. While it introduces internal fragmentation, the operational efficiencies—O(1) arithmetic, hardware acceleration, and simple coalescing—often outweigh the space cost.
What's Next:
The final page of this module examines Buddy System Fragmentation—a detailed analysis of both internal fragmentation (from power-of-two rounding) and the specific patterns of external fragmentation that can still occur in buddy systems. Understanding these characteristics enables informed decisions about when buddy allocation is appropriate.
You now understand the mathematical and practical foundations of power-of-two allocation—from binary properties through hardware support to hybrid sizing strategies. This knowledge enables you to implement efficient allocators and make informed decisions about size class design.