Compaction And Buddy System - Learning Module

Loading content...

0/227

Power-of-Two Allocation

Why Powers of Two Dominate Computing

Powers of two appear everywhere in computing: page sizes (4096 bytes), cache lines (64 bytes), memory alignment requirements, network MTUs, disk sector sizes, and of course, memory allocation block sizes. This isn't coincidence—it's a deep structural property of binary computing that creates enormous efficiency gains.

In memory allocation, constraining block sizes to powers of two—1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, ...—is the foundation of the buddy system and numerous other allocators. This page explores why this constraint is so powerful, how it enables efficient operations, and what trade-offs it introduces.

What You Will Learn

By the end of this page, you will understand the mathematical properties of powers of two that make them ideal for allocation, how hardware exploits these properties, efficient algorithms for power-of-two calculations, and the fragmentation costs of this approach.

Mathematical Properties

Powers of two have unique mathematical properties in binary representation that make them extraordinarily efficient for computer operations.

Property 1: Single-Bit Representation

In binary, a power of two has exactly one bit set:

2^0 = 1    = 0000 0001
2^1 = 2    = 0000 0010
2^2 = 4    = 0000 0100
2^3 = 8    = 0000 1000
2^4 = 16   = 0001 0000
2^5 = 32   = 0010 0000
2^6 = 64   = 0100 0000
2^7 = 128  = 1000 0000
...
2^k = single bit at position k

This enables a fast test for "is this number a power of two?":

bool is_power_of_two(unsigned n) {
    return n > 0 && (n & (n - 1)) == 0;
}

How it works: If n is a power of two, subtracting 1 flips all bits below the single set bit. ANDing with n yields 0.

Property 2: Division and Multiplication by Shift

Dividing or multiplying by 2^k is equivalent to bit shifting:

n × 2^k = n << k   (left shift by k)
n ÷ 2^k = n >> k   (right shift by k, for unsigned/logical shift)

Performance Impact:

Shift operations: 1 CPU cycle
Division instruction: 10-40+ cycles (varies by architecture)
Multiplication instruction: 3-5 cycles

By using powers of two, allocation size calculations that would require expensive division become trivial bit shifts.

Property 3: Modulo by Mask

The remainder when dividing by 2^k is obtained by masking:

n mod 2^k = n & (2^k - 1)

Example:

 n = 137 = 1000 1001
 2^4 = 16 = 0001 0000
 2^4 - 1 = 15 = 0000 1111 (the mask)
 
 137 mod 16 = 137 & 15
            = 1000 1001 & 0000 1111
            = 0000 1001
            = 9
 
 Verification: 137 = 8 × 16 + 9 ✓

This enables O(1) offset calculations within blocks of power-of-two size.

Power-of-Two Operation Efficiency
Operation	General Case	Power-of-Two	Speedup
Multiply by k	MUL instruction (3-5 cycles)	Left shift (1 cycle)	3-5x
Divide by k	DIV instruction (10-40 cycles)	Right shift (1 cycle)	10-40x
Modulo k	DIV + extract remainder	AND with mask (1 cycle)	10-40x
Is multiple of k?	Divide, check remainder	AND with mask, check zero	10-40x
Align to k	Division, multiply	AND with complement mask	20-80x

Property 4: Address Alignment

An address A is aligned to 2^k bytes if and only if its lower k bits are zero:

A is aligned to 2^k  ⟺  (A & (2^k - 1)) == 0

Aligning an address:

To round up address A to the next 2^k boundary:

uintptr_t align_up(uintptr_t addr, size_t alignment) {
    // alignment must be power of two
    size_t mask = alignment - 1;
    return (addr + mask) & ~mask;
}

Example:

addr = 1000, alignment = 256 = 2^8
mask = 255

align_up(1000, 256):
  = (1000 + 255) & ~255
  = 1255 & 0xFFFFFF00
  = 1024

1024 is the next 256-byte boundary after 1000 ✓

Why Alignment Matters

Aligned memory access is often required by hardware (SIMD instructions, DMA transfers) and always more efficient. Modern CPUs may penalize unaligned access with extra cycles or even raise exceptions. Power-of-two sizes naturally produce aligned blocks when allocated from aligned base addresses.

The Rounding Operation

In power-of-two allocation, every request is rounded up to the next power of two. This rounding operation is fundamental and must be extremely efficient.

The Problem:

Given a size S, find the smallest power of two P such that P ≥ S.

Examples:
  S = 1   → P = 1
  S = 2   → P = 2
  S = 3   → P = 4
  S = 5   → P = 8
  S = 100 → P = 128
  S = 1000 → P = 1024

Approach 1: Iterative

size_t round_up_pow2_iterative(size_t s) {
    if (s == 0) return 1;
    size_t p = 1;
    while (p < s) {
        p <<= 1;
    }
    return p;
}

Time complexity: O(log s) - up to 64 iterations for 64-bit values. Unacceptable for hot paths.

round_pow2.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// Efficient power-of-two rounding algorithms
 
#include <stdint.h>
 
// Approach 2: Bit manipulation (constant time, no loops)
// Uses the fact that for n > 0:
//   - Fill all bits below the highest set bit
//   - Add 1 to get the next power of two
uint64_t round_up_pow2_bitwise(uint64_t s) {
    if (s == 0) return 1;
    if (s == 1) return 1;
    
    s--;  // Handle exact powers of two
    s |= s >> 1;   // Fill bit below highest
    s |= s >> 2;   // Fill 2 bits below highest 2
    s |= s >> 4;   // Fill 4 bits below highest 4
    s |= s >> 8;   // Fill 8 bits below highest 8
    s |= s >> 16;  // Fill 16 bits below highest 16
    s |= s >> 32;  // Fill 32 bits below highest 32 (for 64-bit)
    s++;           // Add 1 to get next power of two
    
    return s;
}
 
// Approach 3: Using compiler builtins (fastest)
// Most compilers have builtins that use hardware instructions
uint64_t round_up_pow2_builtin(uint64_t s) {
    if (s <= 1) return 1;
    
    // __builtin_clzll counts leading zeros
    // 64 - clz - 1 = position of highest bit (0-indexed)
    int highest_bit = 63 - __builtin_clzll(s - 1);
    return 1ULL << (highest_bit + 1);
}
 
// Approach 4: Using x86 BSR (Bit Scan Reverse) instruction
// Inline assembly for maximum efficiency
uint64_t round_up_pow2_bsr(uint64_t s) {
    if (s <= 1) return 1;
    
    uint64_t result;
    // BSR finds the position of the most significant set bit
    asm("bsrq %1, %0" : "=r"(result) : "r"(s - 1));
    return 1ULL << (result + 1);
}
 
// Log base 2 (floor) - useful for computing order
int log2_floor(uint64_t s) {
    if (s == 0) return -1;  // Undefined
    return 63 - __builtin_clzll(s);
}
 
// Log base 2 (ceiling) - for rounding up
int log2_ceil(uint64_t s) {
    if (s <= 1) return 0;
    return 64 - __builtin_clzll(s - 1);
}

Understanding the Bit-Fill Algorithm:

The bit manipulation approach (Approach 2) works by "smearing" the highest set bit downward until all lower bits are set, then adding 1.

Example: s = 100

Step by step:
  s = 100                      = 0110 0100
  s-- = 99                     = 0110 0011
  s |= s >> 1  → 99 | 49       = 0110 0011 | 0011 0001 = 0111 0011
  s |= s >> 2  → 115 | 28      = 0111 0011 | 0001 1100 = 0111 1111
  s |= s >> 4  → 127 | 7       = 0111 1111 | 0000 0111 = 0111 1111
  (further shifts don't change anything)
  s++ = 128                    = 1000 0000 ✓

The algorithm fills all bits below the highest, creating 2^n - 1,
then adds 1 to get 2^n.

Why Subtract 1 First?

The initial s-- ensures that exact powers of two round to themselves:

Without s--:
  s = 64 = 0100 0000
  After fills: 0111 1111
  After ++: 128 ✗ (should be 64)

With s--:
  s = 64, s-- = 63 = 0011 1111
  After fills: 0011 1111
  After ++: 64 ✓

Rounding Algorithm Performance
Approach	Cycles (approx)	Portability	Code Complexity
Iterative	5-64 (varies with input)	Excellent	Low
Bit manipulation	~12 fixed	Excellent	Medium
Compiler builtin	~3-5 fixed	Good (GCC/Clang)	Low
Inline assembly (BSR)	~2-3 fixed	x86 only	High

Recommendation

Use compiler builtins (__builtin_clzll for GCC/Clang, _BitScanReverse64 for MSVC) when available. They compile to optimal hardware instructions (LZCNT or BSR) and are both fast and readable. Fall back to the bit manipulation approach for maximum portability.

Block Size Selection

In the buddy system, the order of a block determines its size: a block of order k has size 2^k. Selecting the minimum and maximum orders involves trade-offs.

Minimum Block Size (MIN_ORDER):

The smallest allocatable block affects:

Internal fragmentation for small allocations: Smaller minimum = less wasted space
Metadata overhead: Free block headers must fit in the smallest block
Number of blocks to manage: Smaller minimums = more blocks

Common MIN_ORDER choices:

MIN_ORDER = 4:  2^4 = 16 bytes
  - Fits a free block header (prev + next pointers)
  - Good for small allocation workloads
  - More management overhead

MIN_ORDER = 5:  2^5 = 32 bytes
  - Comfortable for headers with metadata
  - Balance between overhead and utility
  - Common userspace choice

MIN_ORDER = 12: 2^12 = 4096 bytes = 1 page
  - Used by Linux kernel page allocator
  - Page is the natural unit for kernel memory
  - Sub-page allocations handled by slab

Maximum Block Size (MAX_ORDER):

The largest block affects:

Maximum single allocation: Can't allocate larger than 2^MAX_ORDER
Contiguous memory availability: Larger blocks may not be feasible
Number of order levels: More levels = deeper splits but finer control

Common MAX_ORDER choices:

MAX_ORDER = 20: 2^20 = 1 MB
  - Reasonable for userspace allocators
  - Larger requests use mmap() directly

MAX_ORDER = 22: 2^22 = 4 MB
  - Linux page allocator (with ORDER=10 for 4KB pages)
  - 1024 pages maximum contiguous allocation

MAX_ORDER = 30: 2^30 = 1 GB
  - For systems needing huge contiguous allocations
  - Rare in practice

Order Calculation:

Given a request of size S, the order k is:

k = max(MIN_ORDER, ceil(log2(S)))

Implementation using builtins:

int size_to_order(size_t size) {
    if (size == 0) return MIN_ORDER;
    
    // Ceiling of log2
    int order = (size == 1) ? 0 : 64 - __builtin_clzll(size - 1);
    
    // Clamp to valid range
    if (order < MIN_ORDER) order = MIN_ORDER;
    if (order > MAX_ORDER) return -1;  // Too large
    
    return order;
}

Order to Size:

size_t order_to_size(int order) {
    return (size_t)1 << order;
}

Example Order Mapping (MIN_ORDER=4, MAX_ORDER=20)
Request Size	Calculated Order	Allocated Size	Internal Fragmentation
1-16 bytes	4	16 bytes	0-15 bytes (0-94%)
17-32 bytes	5	32 bytes	0-15 bytes (0-47%)
33-64 bytes	6	64 bytes	0-31 bytes (0-48%)
65-128 bytes	7	128 bytes	0-63 bytes (0-49%)
129-256 bytes	8	256 bytes	0-127 bytes (0-50%)
513-1024 bytes	10	1024 bytes	0-511 bytes (0-50%)
1 MB + 1 byte	-1	N/A	Request too large

The 50% Bound

Notice that worst-case internal fragmentation approaches 50% as request sizes increase. A request of (2^k + 1) bytes requires a block of 2^(k+1) bytes, wasting nearly half. This is the fundamental trade-off of power-of-two sizing—efficiency in management at the cost of potential space waste.

Hardware Support

Modern CPUs provide specialized hardware support for power-of-two operations, recognizing their importance in systems software.

Leading Zero Count (LZCNT / CLZ):

Counts the number of leading zero bits in a value:

LZCNT(0x0080) = 56 (for 64-bit)
LZCNT(0x8000000000000000) = 0

Use: 63 - LZCNT(x) = floor(log2(x))

x86 Instructions:

BSR (Bit Scan Reverse): Finds position of highest set bit
LZCNT (Leading Zero Count): Counts leading zeros (AMD/Intel)
BSF (Bit Scan Forward): Finds position of lowest set bit
TZCNT (Trailing Zero Count): Counts trailing zeros

ARM Instructions:

CLZ (Count Leading Zeros): Direct leading zero count
Available in ARMv5+, including all modern ARM64

hw_power2.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
// Using hardware instructions for power-of-two operations
 
#include <stdint.h>
#include <stdbool.h>
 
// x86-64 using LZCNT (requires LZCNT support, or BSR fallback)
static inline int clz64(uint64_t x) {
    if (x == 0) return 64;
    #if defined(__LZCNT__)
        return __builtin_clzll(x);
    #else
        // BSR-based fallback (note: undefined for x=0)
        uint64_t result;
        asm("bsrq %1, %0" : "=r"(result) : "r"(x));
        return 63 - result;
    #endif
}
 
// Fast log2 floor using hardware
static inline int floor_log2(uint64_t x) {
    return 63 - clz64(x);
}
 
// Fast log2 ceiling using hardware
static inline int ceil_log2(uint64_t x) {
    if (x <= 1) return 0;
    return 64 - clz64(x - 1);
}
 
// Check if power of two (single AND operation)
static inline bool is_pow2(uint64_t x) {
    return x && !(x & (x - 1));
}
 
// Alignment check using trailing zeros
static inline bool is_aligned(void* ptr, size_t alignment) {
    // alignment must be power of two
    return ((uintptr_t)ptr & (alignment - 1)) == 0;
}
 
// Find the order of a block given its address and the base address
// Assumes buddy system with power-of-two blocks
static inline int order_from_address(uintptr_t addr, uintptr_t base, int max_order) {
    uintptr_t offset = addr - base;
    if (offset == 0) return max_order;
    
    // The order is determined by the position of the lowest set bit
    // in the offset from base
    return __builtin_ctzll(offset);  // count trailing zeros
}

Population Count (POPCNT):

Counts the number of set bits. Although not directly for power-of-two operations, useful for related calculations:

// Check if exactly one bit set (power of two)
bool is_pow2_popcnt(uint64_t x) {
    return __builtin_popcountll(x) == 1;
}

Bit Manipulation Extensions:

Modern CPUs have extended bit manipulation capabilities:

Intel BMI1/BMI2:

BLSR - Reset lowest set bit: x & (x-1)
BLSI - Extract lowest set bit: x & (-x)
BLSMSK - Get mask up to lowest set bit
PDEP/PEXT - Parallel bit deposit/extract

These enable:

Even faster power-of-two tests
Efficient bitmap operations
Complex bit field manipulations

Compiler Integration:

Compilers recognize patterns and emit optimal instructions:

// This code:
if ((n & (n - 1)) == 0 && n != 0) { ... }

// May compile to:
//   BLSR   reg, n    ; reg = n & (n-1)
//   TEST   n, n      ; check n != 0  
//   CMOVZ  ...
//   JZ     target

// Or simply:
//   POPCNT reg, n
//   CMP    reg, 1
//   JE     target

Let the Compiler Help

Modern compilers with optimization enabled (-O2 or higher) automatically recognize power-of-two patterns and emit optimal instructions. Writing clear, idiomatic code (like using __builtin_clz) often produces better results than hand-crafted assembly. Profile before optimizing!

Size Classes Beyond Pure Powers

Pure power-of-two sizing has significant internal fragmentation. Production allocators often use hybrid approaches that combine power-of-two benefits with finer granularity.

Approach 1: Multiple Size Classes per Order

jemalloc and similar allocators use size classes that aren't strictly powers of two:

jemalloc size classes (small) example:
8, 16, 32, 48, 64, 80, 96, 112, 128,
160, 192, 224, 256,
320, 384, 448, 512,
...

Pattern: within each power-of-two range, add intermediate sizes

Benefits:

Reduced internal fragmentation
Requests for common sizes (like 48, 80, 96) fit better
Maximum waste reduced from 50% to ~25% or less

Costs:

More free lists to manage
Slightly more complex size-to-class mapping
Coalescing becomes more complex (not simple buddy)

size_classes.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// Size class approaches used by production allocators
 
// Approach 1: Power-of-two with quartiles
// Each power-of-two range has 4 size classes
static const size_t SIZE_CLASSES[] = {
    // Base 16 (order 4): 16 only (too small to subdivide)
    16,
    // Base 32 (order 5): 32 only
    32,
    // Base 64 (order 6): 48, 64
    48, 64,
    // Base 128 (order 7): 80, 96, 112, 128  
    80, 96, 112, 128,
    // Base 256 (order 8): 160, 192, 224, 256
    160, 192, 224, 256,
    // Base 512 (order 9): 320, 384, 448, 512
    320, 384, 448, 512,
    // ... and so on
};
 
#define NUM_SIZE_CLASSES (sizeof(SIZE_CLASSES) / sizeof(SIZE_CLASSES[0]))
 
// Map request size to size class index
int size_to_class(size_t size) {
    if (size <= 16) return 0;
    if (size <= 32) return 1;
    if (size <= 48) return 2;
    if (size <= 64) return 3;
    
    // For larger sizes, compute based on pattern
    // This is O(1) using bit operations
    int order = 64 - __builtin_clzll(size - 1);  // ceil(log2(size))
    int base_index = /* lookup table for order */ order_to_base_index[order];
    
    // Which quartile within this order?
    size_t base = 1UL << (order - 1);   // Previous power of two
    size_t step = base / 4;              // Quartile size
    int quartile = (size - base - 1) / step;
    
    return base_index + quartile;
}
 
// Approach 2: jemalloc-style groups
// Larger size classes are spaced further apart
typedef struct {
    size_t size;
    int lg_delta;    // log2 of spacing at this level
    int grp;         // group index
} SizeClass;
 
// jemalloc uses: within each group, size classes are
// base, base+delta, base+2*delta, base+3*delta
// where delta doubles with each group

Approach 2: Slab Allocation for Common Sizes

Combine buddy allocation for large blocks with specialized handling for common sizes:

Architectural Layers:

┌─────────────────────────────────────────┐
│   User Request (arbitrary size)          │
└─────────────────┬───────────────────────┘
                  │
    ┌─────────────┴─────────────┐
    │                           │
    ▼                           ▼
┌─────────┐              ┌──────────────┐
│ Slab    │              │ Buddy System  │
│ Allocator│              │ (large blocks)│
│ (small) │              │               │
└────┬────┘              └──────┬────────┘
     │                          │
     ▼                          ▼
┌───────────────────────────────────────┐
│        Physical Pages (from OS)        │
└───────────────────────────────────────┘

Slab handles: 8, 16, 32, 64, 96, 128, 192, 256, ...
             (with zero internal fragmentation for exact sizes)

Buddy handles: 4KB, 8KB, 16KB, 32KB, ...
             (for large allocations that don't match slabs)

Linux Kernel Approach:

Buddy Allocator
     ├── Provides pages to SLAB/SLUB/SLOB
     └── Direct allocation for multi-page requests

SLAB Allocator (kmalloc)
     ├── Caches for each power-of-two size
     ├── Object caching for kernel structures
     └── Per-CPU caching for performance

Choosing a Strategy

•Pure Power-of-Two — Use when simplicity matters most and fragmentation is acceptable. Good for page allocators, simple embedded systems.
•Multiple Classes per Order — Use for general-purpose allocation where fragmentation reduction justifies complexity. Standard for userspace allocators.
•Buddy + Slab Hybrid — Use when you have both many small allocations (objects, buffers) and large contiguous needs (DMA, mapping). The kernel sweet spot.
•Custom Size Classes — Use when you know your workload's allocation patterns. Profile, identify common sizes, optimize for them.

Summary: Power-of-Two Allocation

Power-of-two allocation exploits deep properties of binary arithmetic to achieve efficient memory management. While it introduces internal fragmentation, the operational efficiencies—O(1) arithmetic, hardware acceleration, and simple coalescing—often outweigh the space cost.

Key Takeaways

•Binary properties — Powers of two have single-bit representations, enabling O(1) multiplication (shift), division (shift), and modulo (AND) operations.
•Efficient rounding — Request sizes are rounded up using bit manipulation or hardware instructions in constant time.
•Order-based addressing — Block sizes are encoded as orders (k where size = 2^k), simplifying size calculations and buddy identification.
•Hardware support — Modern CPUs provide LZCNT, BSR/BSF, POPCNT, and BMI instructions specifically for efficient bit manipulation.
•Hybrid approaches exist — Production allocators often combine power-of-two efficiency with finer size classes to reduce fragmentation.
•The fundamental trade-off — Up to 50% internal fragmentation is exchanged for O(1) operations and simple buddy relationships.

What's Next:

The final page of this module examines Buddy System Fragmentation—a detailed analysis of both internal fragmentation (from power-of-two rounding) and the specific patterns of external fragmentation that can still occur in buddy systems. Understanding these characteristics enables informed decisions about when buddy allocation is appropriate.

Page Complete

You now understand the mathematical and practical foundations of power-of-two allocation—from binary properties through hardware support to hybrid sizing strategies. This knowledge enables you to implement efficient allocators and make informed decisions about size class design.

Power-of-Two Allocation

Why Powers of Two Dominate Computing

What You Will Learn

Mathematical Properties

Powers of two have unique mathematical properties in binary representation that make them extraordinarily efficient for computer operations.

Property 1: Single-Bit Representation

In binary, a power of two has exactly one bit set:

2^0 = 1    = 0000 0001
2^1 = 2    = 0000 0010
2^2 = 4    = 0000 0100
2^3 = 8    = 0000 1000
2^4 = 16   = 0001 0000
2^5 = 32   = 0010 0000
2^6 = 64   = 0100 0000
2^7 = 128  = 1000 0000
...
2^k = single bit at position k

This enables a fast test for "is this number a power of two?":

bool is_power_of_two(unsigned n) {
    return n > 0 && (n & (n - 1)) == 0;
}

How it works: If n is a power of two, subtracting 1 flips all bits below the single set bit. ANDing with n yields 0.

Property 2: Division and Multiplication by Shift

Dividing or multiplying by 2^k is equivalent to bit shifting:

n × 2^k = n << k   (left shift by k)
n ÷ 2^k = n >> k   (right shift by k, for unsigned/logical shift)

Performance Impact:

Shift operations: 1 CPU cycle
Division instruction: 10-40+ cycles (varies by architecture)
Multiplication instruction: 3-5 cycles

By using powers of two, allocation size calculations that would require expensive division become trivial bit shifts.

Property 3: Modulo by Mask

The remainder when dividing by 2^k is obtained by masking:

n mod 2^k = n & (2^k - 1)

Example:

 n = 137 = 1000 1001
 2^4 = 16 = 0001 0000
 2^4 - 1 = 15 = 0000 1111 (the mask)
 
 137 mod 16 = 137 & 15
            = 1000 1001 & 0000 1111
            = 0000 1001
            = 9
 
 Verification: 137 = 8 × 16 + 9 ✓

This enables O(1) offset calculations within blocks of power-of-two size.

Power-of-Two Operation Efficiency
Operation	General Case	Power-of-Two	Speedup
Multiply by k	MUL instruction (3-5 cycles)	Left shift (1 cycle)	3-5x
Divide by k	DIV instruction (10-40 cycles)	Right shift (1 cycle)	10-40x
Modulo k	DIV + extract remainder	AND with mask (1 cycle)	10-40x
Is multiple of k?	Divide, check remainder	AND with mask, check zero	10-40x
Align to k	Division, multiply	AND with complement mask	20-80x

Property 4: Address Alignment

An address A is aligned to 2^k bytes if and only if its lower k bits are zero:

A is aligned to 2^k  ⟺  (A & (2^k - 1)) == 0

Aligning an address:

To round up address A to the next 2^k boundary:

uintptr_t align_up(uintptr_t addr, size_t alignment) {
    // alignment must be power of two
    size_t mask = alignment - 1;
    return (addr + mask) & ~mask;
}

Example:

addr = 1000, alignment = 256 = 2^8
mask = 255

align_up(1000, 256):
  = (1000 + 255) & ~255
  = 1255 & 0xFFFFFF00
  = 1024

1024 is the next 256-byte boundary after 1000 ✓

Why Alignment Matters

The Rounding Operation

In power-of-two allocation, every request is rounded up to the next power of two. This rounding operation is fundamental and must be extremely efficient.

The Problem:

Given a size S, find the smallest power of two P such that P ≥ S.

Examples:
  S = 1   → P = 1
  S = 2   → P = 2
  S = 3   → P = 4
  S = 5   → P = 8
  S = 100 → P = 128
  S = 1000 → P = 1024

Approach 1: Iterative

size_t round_up_pow2_iterative(size_t s) {
    if (s == 0) return 1;
    size_t p = 1;
    while (p < s) {
        p <<= 1;
    }
    return p;
}

Time complexity: O(log s) - up to 64 iterations for 64-bit values. Unacceptable for hot paths.

round_pow2.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// Efficient power-of-two rounding algorithms
 
#include <stdint.h>
 
// Approach 2: Bit manipulation (constant time, no loops)
// Uses the fact that for n > 0:
//   - Fill all bits below the highest set bit
//   - Add 1 to get the next power of two
uint64_t round_up_pow2_bitwise(uint64_t s) {
    if (s == 0) return 1;
    if (s == 1) return 1;
    
    s--;  // Handle exact powers of two
    s |= s >> 1;   // Fill bit below highest
    s |= s >> 2;   // Fill 2 bits below highest 2
    s |= s >> 4;   // Fill 4 bits below highest 4
    s |= s >> 8;   // Fill 8 bits below highest 8
    s |= s >> 16;  // Fill 16 bits below highest 16
    s |= s >> 32;  // Fill 32 bits below highest 32 (for 64-bit)
    s++;           // Add 1 to get next power of two
    
    return s;
}
 
// Approach 3: Using compiler builtins (fastest)
// Most compilers have builtins that use hardware instructions
uint64_t round_up_pow2_builtin(uint64_t s) {
    if (s <= 1) return 1;
    
    // __builtin_clzll counts leading zeros
    // 64 - clz - 1 = position of highest bit (0-indexed)
    int highest_bit = 63 - __builtin_clzll(s - 1);
    return 1ULL << (highest_bit + 1);
}
 
// Approach 4: Using x86 BSR (Bit Scan Reverse) instruction
// Inline assembly for maximum efficiency
uint64_t round_up_pow2_bsr(uint64_t s) {
    if (s <= 1) return 1;
    
    uint64_t result;
    // BSR finds the position of the most significant set bit
    asm("bsrq %1, %0" : "=r"(result) : "r"(s - 1));
    return 1ULL << (result + 1);
}
 
// Log base 2 (floor) - useful for computing order
int log2_floor(uint64_t s) {
    if (s == 0) return -1;  // Undefined
    return 63 - __builtin_clzll(s);
}
 
// Log base 2 (ceiling) - for rounding up
int log2_ceil(uint64_t s) {
    if (s <= 1) return 0;
    return 64 - __builtin_clzll(s - 1);
}

Understanding the Bit-Fill Algorithm:

The bit manipulation approach (Approach 2) works by "smearing" the highest set bit downward until all lower bits are set, then adding 1.

Example: s = 100

Step by step:
  s = 100                      = 0110 0100
  s-- = 99                     = 0110 0011
  s |= s >> 1  → 99 | 49       = 0110 0011 | 0011 0001 = 0111 0011
  s |= s >> 2  → 115 | 28      = 0111 0011 | 0001 1100 = 0111 1111
  s |= s >> 4  → 127 | 7       = 0111 1111 | 0000 0111 = 0111 1111
  (further shifts don't change anything)
  s++ = 128                    = 1000 0000 ✓

The algorithm fills all bits below the highest, creating 2^n - 1,
then adds 1 to get 2^n.

Why Subtract 1 First?

The initial s-- ensures that exact powers of two round to themselves:

Without s--:
  s = 64 = 0100 0000
  After fills: 0111 1111
  After ++: 128 ✗ (should be 64)

With s--:
  s = 64, s-- = 63 = 0011 1111
  After fills: 0011 1111
  After ++: 64 ✓

Rounding Algorithm Performance
Approach	Cycles (approx)	Portability	Code Complexity
Iterative	5-64 (varies with input)	Excellent	Low
Bit manipulation	~12 fixed	Excellent	Medium
Compiler builtin	~3-5 fixed	Good (GCC/Clang)	Low
Inline assembly (BSR)	~2-3 fixed	x86 only	High

Recommendation

Block Size Selection

In the buddy system, the order of a block determines its size: a block of order k has size 2^k. Selecting the minimum and maximum orders involves trade-offs.

Minimum Block Size (MIN_ORDER):

The smallest allocatable block affects:

Internal fragmentation for small allocations: Smaller minimum = less wasted space
Metadata overhead: Free block headers must fit in the smallest block
Number of blocks to manage: Smaller minimums = more blocks

Common MIN_ORDER choices:

MIN_ORDER = 4:  2^4 = 16 bytes
  - Fits a free block header (prev + next pointers)
  - Good for small allocation workloads
  - More management overhead

MIN_ORDER = 5:  2^5 = 32 bytes
  - Comfortable for headers with metadata
  - Balance between overhead and utility
  - Common userspace choice

MIN_ORDER = 12: 2^12 = 4096 bytes = 1 page
  - Used by Linux kernel page allocator
  - Page is the natural unit for kernel memory
  - Sub-page allocations handled by slab

Maximum Block Size (MAX_ORDER):

The largest block affects:

Maximum single allocation: Can't allocate larger than 2^MAX_ORDER
Contiguous memory availability: Larger blocks may not be feasible
Number of order levels: More levels = deeper splits but finer control

Common MAX_ORDER choices:

MAX_ORDER = 20: 2^20 = 1 MB
  - Reasonable for userspace allocators
  - Larger requests use mmap() directly

MAX_ORDER = 22: 2^22 = 4 MB
  - Linux page allocator (with ORDER=10 for 4KB pages)
  - 1024 pages maximum contiguous allocation

MAX_ORDER = 30: 2^30 = 1 GB
  - For systems needing huge contiguous allocations
  - Rare in practice

Order Calculation:

Given a request of size S, the order k is:

k = max(MIN_ORDER, ceil(log2(S)))

Implementation using builtins:

int size_to_order(size_t size) {
    if (size == 0) return MIN_ORDER;
    
    // Ceiling of log2
    int order = (size == 1) ? 0 : 64 - __builtin_clzll(size - 1);
    
    // Clamp to valid range
    if (order < MIN_ORDER) order = MIN_ORDER;
    if (order > MAX_ORDER) return -1;  // Too large
    
    return order;
}

Order to Size:

size_t order_to_size(int order) {
    return (size_t)1 << order;
}

Example Order Mapping (MIN_ORDER=4, MAX_ORDER=20)
Request Size	Calculated Order	Allocated Size	Internal Fragmentation
1-16 bytes	4	16 bytes	0-15 bytes (0-94%)
17-32 bytes	5	32 bytes	0-15 bytes (0-47%)
33-64 bytes	6	64 bytes	0-31 bytes (0-48%)
65-128 bytes	7	128 bytes	0-63 bytes (0-49%)
129-256 bytes	8	256 bytes	0-127 bytes (0-50%)
513-1024 bytes	10	1024 bytes	0-511 bytes (0-50%)
1 MB + 1 byte	-1	N/A	Request too large

The 50% Bound

Hardware Support

Modern CPUs provide specialized hardware support for power-of-two operations, recognizing their importance in systems software.

Leading Zero Count (LZCNT / CLZ):

Counts the number of leading zero bits in a value:

LZCNT(0x0080) = 56 (for 64-bit)
LZCNT(0x8000000000000000) = 0

Use: 63 - LZCNT(x) = floor(log2(x))

x86 Instructions:

BSR (Bit Scan Reverse): Finds position of highest set bit
LZCNT (Leading Zero Count): Counts leading zeros (AMD/Intel)
BSF (Bit Scan Forward): Finds position of lowest set bit
TZCNT (Trailing Zero Count): Counts trailing zeros

ARM Instructions:

CLZ (Count Leading Zeros): Direct leading zero count
Available in ARMv5+, including all modern ARM64

hw_power2.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
// Using hardware instructions for power-of-two operations
 
#include <stdint.h>
#include <stdbool.h>
 
// x86-64 using LZCNT (requires LZCNT support, or BSR fallback)
static inline int clz64(uint64_t x) {
    if (x == 0) return 64;
    #if defined(__LZCNT__)
        return __builtin_clzll(x);
    #else
        // BSR-based fallback (note: undefined for x=0)
        uint64_t result;
        asm("bsrq %1, %0" : "=r"(result) : "r"(x));
        return 63 - result;
    #endif
}
 
// Fast log2 floor using hardware
static inline int floor_log2(uint64_t x) {
    return 63 - clz64(x);
}
 
// Fast log2 ceiling using hardware
static inline int ceil_log2(uint64_t x) {
    if (x <= 1) return 0;
    return 64 - clz64(x - 1);
}
 
// Check if power of two (single AND operation)
static inline bool is_pow2(uint64_t x) {
    return x && !(x & (x - 1));
}
 
// Alignment check using trailing zeros
static inline bool is_aligned(void* ptr, size_t alignment) {
    // alignment must be power of two
    return ((uintptr_t)ptr & (alignment - 1)) == 0;
}
 
// Find the order of a block given its address and the base address
// Assumes buddy system with power-of-two blocks
static inline int order_from_address(uintptr_t addr, uintptr_t base, int max_order) {
    uintptr_t offset = addr - base;
    if (offset == 0) return max_order;
    
    // The order is determined by the position of the lowest set bit
    // in the offset from base
    return __builtin_ctzll(offset);  // count trailing zeros
}

Population Count (POPCNT):

Counts the number of set bits. Although not directly for power-of-two operations, useful for related calculations:

// Check if exactly one bit set (power of two)
bool is_pow2_popcnt(uint64_t x) {
    return __builtin_popcountll(x) == 1;
}

Bit Manipulation Extensions:

Modern CPUs have extended bit manipulation capabilities:

Intel BMI1/BMI2:

BLSR - Reset lowest set bit: x & (x-1)
BLSI - Extract lowest set bit: x & (-x)
BLSMSK - Get mask up to lowest set bit
PDEP/PEXT - Parallel bit deposit/extract

These enable:

Even faster power-of-two tests
Efficient bitmap operations
Complex bit field manipulations

Compiler Integration:

Compilers recognize patterns and emit optimal instructions:

// This code:
if ((n & (n - 1)) == 0 && n != 0) { ... }

// May compile to:
//   BLSR   reg, n    ; reg = n & (n-1)
//   TEST   n, n      ; check n != 0  
//   CMOVZ  ...
//   JZ     target

// Or simply:
//   POPCNT reg, n
//   CMP    reg, 1
//   JE     target

Let the Compiler Help

Size Classes Beyond Pure Powers

Pure power-of-two sizing has significant internal fragmentation. Production allocators often use hybrid approaches that combine power-of-two benefits with finer granularity.

Approach 1: Multiple Size Classes per Order

jemalloc and similar allocators use size classes that aren't strictly powers of two:

jemalloc size classes (small) example:
8, 16, 32, 48, 64, 80, 96, 112, 128,
160, 192, 224, 256,
320, 384, 448, 512,
...

Pattern: within each power-of-two range, add intermediate sizes

Benefits:

Reduced internal fragmentation
Requests for common sizes (like 48, 80, 96) fit better
Maximum waste reduced from 50% to ~25% or less

Costs:

More free lists to manage
Slightly more complex size-to-class mapping
Coalescing becomes more complex (not simple buddy)

size_classes.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// Size class approaches used by production allocators
 
// Approach 1: Power-of-two with quartiles
// Each power-of-two range has 4 size classes
static const size_t SIZE_CLASSES[] = {
    // Base 16 (order 4): 16 only (too small to subdivide)
    16,
    // Base 32 (order 5): 32 only
    32,
    // Base 64 (order 6): 48, 64
    48, 64,
    // Base 128 (order 7): 80, 96, 112, 128  
    80, 96, 112, 128,
    // Base 256 (order 8): 160, 192, 224, 256
    160, 192, 224, 256,
    // Base 512 (order 9): 320, 384, 448, 512
    320, 384, 448, 512,
    // ... and so on
};
 
#define NUM_SIZE_CLASSES (sizeof(SIZE_CLASSES) / sizeof(SIZE_CLASSES[0]))
 
// Map request size to size class index
int size_to_class(size_t size) {
    if (size <= 16) return 0;
    if (size <= 32) return 1;
    if (size <= 48) return 2;
    if (size <= 64) return 3;
    
    // For larger sizes, compute based on pattern
    // This is O(1) using bit operations
    int order = 64 - __builtin_clzll(size - 1);  // ceil(log2(size))
    int base_index = /* lookup table for order */ order_to_base_index[order];
    
    // Which quartile within this order?
    size_t base = 1UL << (order - 1);   // Previous power of two
    size_t step = base / 4;              // Quartile size
    int quartile = (size - base - 1) / step;
    
    return base_index + quartile;
}
 
// Approach 2: jemalloc-style groups
// Larger size classes are spaced further apart
typedef struct {
    size_t size;
    int lg_delta;    // log2 of spacing at this level
    int grp;         // group index
} SizeClass;
 
// jemalloc uses: within each group, size classes are
// base, base+delta, base+2*delta, base+3*delta
// where delta doubles with each group

Approach 2: Slab Allocation for Common Sizes

Combine buddy allocation for large blocks with specialized handling for common sizes:

Architectural Layers:

┌─────────────────────────────────────────┐
│   User Request (arbitrary size)          │
└─────────────────┬───────────────────────┘
                  │
    ┌─────────────┴─────────────┐
    │                           │
    ▼                           ▼
┌─────────┐              ┌──────────────┐
│ Slab    │              │ Buddy System  │
│ Allocator│              │ (large blocks)│
│ (small) │              │               │
└────┬────┘              └──────┬────────┘
     │                          │
     ▼                          ▼
┌───────────────────────────────────────┐
│        Physical Pages (from OS)        │
└───────────────────────────────────────┘

Slab handles: 8, 16, 32, 64, 96, 128, 192, 256, ...
             (with zero internal fragmentation for exact sizes)

Buddy handles: 4KB, 8KB, 16KB, 32KB, ...
             (for large allocations that don't match slabs)

Linux Kernel Approach:

Buddy Allocator
     ├── Provides pages to SLAB/SLUB/SLOB
     └── Direct allocation for multi-page requests

SLAB Allocator (kmalloc)
     ├── Caches for each power-of-two size
     ├── Object caching for kernel structures
     └── Per-CPU caching for performance

Choosing a Strategy

•Pure Power-of-Two — Use when simplicity matters most and fragmentation is acceptable. Good for page allocators, simple embedded systems.
•Multiple Classes per Order — Use for general-purpose allocation where fragmentation reduction justifies complexity. Standard for userspace allocators.
•Buddy + Slab Hybrid — Use when you have both many small allocations (objects, buffers) and large contiguous needs (DMA, mapping). The kernel sweet spot.
•Custom Size Classes — Use when you know your workload's allocation patterns. Profile, identify common sizes, optimize for them.

Summary: Power-of-Two Allocation

Key Takeaways

•Binary properties — Powers of two have single-bit representations, enabling O(1) multiplication (shift), division (shift), and modulo (AND) operations.
•Efficient rounding — Request sizes are rounded up using bit manipulation or hardware instructions in constant time.
•Order-based addressing — Block sizes are encoded as orders (k where size = 2^k), simplifying size calculations and buddy identification.
•Hardware support — Modern CPUs provide LZCNT, BSR/BSF, POPCNT, and BMI instructions specifically for efficient bit manipulation.
•Hybrid approaches exist — Production allocators often combine power-of-two efficiency with finer size classes to reduce fragmentation.
•The fundamental trade-off — Up to 50% internal fragmentation is exchanged for O(1) operations and simple buddy relationships.

What's Next:

Page Complete