Operating SystemsLinux Memory Management

Linux Memory Management

LevelAdvanced

Duration120 mins

TopicLinux Memory Management

4 / 5

Memory Zones

The Heterogeneous Nature of Physical Memory

Not all memory is created equal. In a modern computer system, physical memory isn't a uniform resource—different regions have different capabilities, constraints, and use cases. Some memory can only be addressed by legacy DMA controllers. Some memory is directly mappable by the kernel, while other regions require special tricks to access. Some memory belongs to specific NUMA nodes with varying access latencies.

Linux organizes physical memory into zones—logical groupings that partition memory based on these hardware constraints and usage patterns. Understanding zones is essential for kernel development, driver writing, and diagnosing memory allocation issues.

This page provides an expert-level examination of Linux memory zones—their purpose, organization, allocation strategies, and the watermark-based reclamation system that keeps the system running smoothly under memory pressure.

What You Will Learn

By the end of this page, you will understand: (1) why zones exist and what constraints they address, (2) the principal zone types (DMA, DMA32, Normal, HighMem, Movable), (3) how the buddy allocator manages memory within zones, (4) zone watermarks and memory reclamation, (5) zone fallback and allocation strategies, and (6) NUMA nodes and their relationship to zones.

Why Zones Exist

Memory zones address fundamental hardware and software constraints that have evolved over decades of PC architecture. Understanding these constraints illuminates why the kernel's memory management is designed the way it is.

Historical DMA Limitations

Early IBM PC-compatible systems used an Intel 8237 DMA controller that could only address the first 16 MB of memory. Even as systems gained more RAM, legacy ISA devices retained this limitation. Memory above 16 MB couldn't be used for DMA transfers to these devices.

32-bit Address Space Limitations

On 32-bit systems, the kernel can only directly address about 1 GB of physical memory (due to the kernel/user split). Memory above this limit ("high memory") can't be permanently mapped—it requires temporary mappings when accessed.

NUMA Architecture

Modern multi-processor systems often have Non-Uniform Memory Access (NUMA) architectures where memory access latency depends on which CPU is accessing which memory bank. Grouping memory by node enables locality-aware allocation.

Memory Hotplug

Virtualized and large systems may add or remove memory dynamically. Zones help organize memory for safe hotplug operations.

Zones are About Constraints, Not Performance

Zones don't represent different speeds or qualities of memory—they represent what that memory can be used for. Memory in ZONE_DMA isn't faster or slower than ZONE_NORMAL; it's simply addressable by more (legacy) devices. The zone system ensures that memory requests are satisfied from appropriate regions.

Zone Types

Linux defines several zone types, though not all are present on every system. The kernel configures zones based on architecture and physical memory layout.

Principal Zone Types

•ZONE_DMA — Memory addressable by legacy 24-bit ISA DMA controllers (0-16 MB on x86). Used for ancient hardware and some kernel structures that must be DMA-accessible. Increasingly vestigial on modern systems.
•ZONE_DMA32 — Memory addressable by 32-bit DMA-capable devices (0-4 GB on x86_64). Important because many PCI devices have 32-bit DMA limitations. Does not exist on 32-bit systems (where all memory is already < 4 GB).
•ZONE_NORMAL — Directly addressable memory within the kernel's linear mapping. On 64-bit systems, this typically covers all non-DMA memory. On 32-bit, typically 16 MB to 896 MB.
•ZONE_HIGHMEM — Memory beyond the kernel's direct mapping (32-bit only). Must be temporarily mapped into kernel space to access. Does not exist on 64-bit systems where all memory is directly mappable.
•ZONE_MOVABLE — A pseudo-zone containing only movable (migratable) allocations. Used for memory hotplug—pages here can be migrated elsewhere, enabling safe memory removal.
•ZONE_DEVICE — Memory exposed by devices like persistent memory (pmem) and GPU memory. Not managed by the normal page allocator but tracked for accounting.

Memory Zones by Architecture
Zone	32-bit x86	64-bit x86_64	Purpose
ZONE_DMA	0-16 MB	0-16 MB	ISA DMA devices
ZONE_DMA32	N/A	16 MB - 4 GB	32-bit DMA devices
ZONE_NORMAL	16 MB - 896 MB	4 GB	General kernel use
ZONE_HIGHMEM	896 MB	N/A	Indirectly mapped memory
ZONE_MOVABLE	Configurable	Configurable	Memory hotplug

zone_types.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
/* Zone type enumeration (include/linux/mmzone.h) */
enum zone_type {
#ifdef CONFIG_ZONE_DMA
    ZONE_DMA,          /* 0-16 MB - ISA DMA limit */
#endif
#ifdef CONFIG_ZONE_DMA32
    ZONE_DMA32,        /* 0-4 GB - 32-bit DMA limit */
#endif
    ZONE_NORMAL,       /* Directly mapped memory */
#ifdef CONFIG_HIGHMEM
    ZONE_HIGHMEM,      /* Indirectly mapped (32-bit only) */
#endif
    ZONE_MOVABLE,      /* Movable pages for memory hotplug */
#ifdef CONFIG_ZONE_DEVICE
    ZONE_DEVICE,       /* Device memory */
#endif
    __MAX_NR_ZONES     /* Sentinel */
};
 
/* Zone structure (simplified) */
struct zone {
    /* Zone name and type */
    const char *name;
    
    /* Memory statistics */
    unsigned long managed_pages;   /* Pages managed by this zone */
    unsigned long spanned_pages;   /* Total pages in zone span */
    unsigned long present_pages;   /* Physical pages in zone */
    
    /* Watermarks for memory pressure */
    unsigned long _watermark[NR_WMARK];
    unsigned long watermark_boost;
    
    /* Free page lists (buddy allocator) */
    struct free_area free_area[MAX_ORDER];
    
    /* Page reclaim state */
    unsigned long flags;
    
    /* Zone lock */
    spinlock_t lock;
    
    /* Per-CPU page caches */
    struct per_cpu_pages __percpu *per_cpu_pageset;
    
    /* Zone start address */
    unsigned long zone_start_pfn;
    
    /* NUMA node this zone belongs to */
    struct pglist_data *zone_pgdat;
    
    /* ... many more fields ... */
};
 
/* Watermark levels */
enum zone_watermarks {
    WMARK_MIN,    /* Minimum pages - below this, allocation may fail */
    WMARK_LOW,    /* Low watermark - kswapd wakes up */
    WMARK_HIGH,   /* High watermark - kswapd goes to sleep */
    NR_WMARK
};

Viewing Zone Information

Linux exposes detailed zone information through /proc and /sys interfaces. Understanding how to read this information is essential for debugging memory issues.

viewing_zones.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# === /proc/zoneinfo - Detailed zone information ===
cat /proc/zoneinfo
 
# Example output (truncated):
# Node 0, zone      DMA
#   per-node stats
#       nr_inactive_anon 0
#       nr_active_anon 0
#       nr_inactive_file 134
#       nr_active_file 201
#       ...
#   pages free     3968
#         min      15
#         low      18
#         high     22
#         spanned  4095
#         present  3998
#         managed  3968
#         protection: (0, 1941, 15826, 15826, 15826)
#   ...
#
# Node 0, zone    DMA32
#   pages free     448234
#         min      1964
#         low      2455
#         high     2946
#         spanned  1044480
#         present  782288
#         managed  496862
#         protection: (0, 0, 13884, 13884, 13884)
#   ...
#
# Node 0, zone   Normal
#   pages free     2485632
#         min      14043
#         low      17553
#         high     21064
#         spanned  3670016
#         present  3670016
#         managed  3554398
#         protection: (0, 0, 0, 0, 0)
 
# === Understanding the numbers ===
# spanned: Total pages in the zone's address range (including holes)
# present: Physical pages actually present
# managed: Pages available for allocation (after kernel reserves)
# free: Currently free pages
# min/low/high: Watermark levels (see next section)
# protection: Lowmem reserve for fallback protection
 
# === /proc/buddyinfo - Buddy allocator free lists ===
cat /proc/buddyinfo
# Node 0, zone      DMA      1      1      0      1      2     ...
# Node 0, zone    DMA32  12043   5234   1234    512    128     ...
# Node 0, zone   Normal  98765  45678  12345   4567   1234     ...
# 
# Columns represent order 0, 1, 2, 3, 4, ... (page counts at each order)
# Order 0 = single 4KB pages
# Order 1 = 8KB (2 contiguous pages)
# Order N = 2^N contiguous pages
 
# === Quick zone summary ===
cat /proc/pagetypeinfo | head -30
 
# === View zone statistics ===
cat /proc/vmstat | grep -E '^(nr_|pgscan|pgsteal|pageout)'
 
# === Per-NUMA node memory info ===
numastat
# Or more detailed:
cat /sys/devices/system/node/node0/meminfo
 
# === Zone-specific sysfs entries ===
ls /sys/devices/system/node/node0/
# cpu0 cpu1 ... meminfo numastat vmstat distance
 
# View watermarks for adjustments
cat /proc/sys/vm/min_free_kbytes
cat /proc/sys/vm/watermark_scale_factor

Interpreting buddyinfo

The /proc/buddyinfo output shows how fragmented memory is. In a healthy system, you should see substantial counts at higher orders (large contiguous regions available). If high-order numbers are zero or very low while low orders have many entries, memory is fragmented—large allocations may fail even with plenty of total free memory.

The Buddy Allocator

Within each zone, physical memory is managed by the buddy allocator—a power-of-two allocator that efficiently handles allocation and deallocation of contiguous page ranges.

The Buddy System Concept:

The buddy allocator maintains free lists for different allocation sizes, where each size is a power of two:

Order 0: 1 page (4 KB)
Order 1: 2 pages (8 KB)
Order 2: 4 pages (16 KB)
...
Order 10: 1024 pages (4 MB) — Maximum order (MAX_ORDER-1)

Allocation:

Request comes in for a specific order (e.g., order 2 = 4 pages)
If a free block of that order exists, use it
If not, split a larger block: take an order-(N+1) block, split it into two order-N "buddies"
Use one buddy, put the other on the order-N free list
Repeat splitting up through orders until a block is found

Deallocation:

Free the block back to its order's free list
Check if its "buddy" (adjacent block of same size) is also free
If both buddies are free, coalesce them into a higher-order block
Repeat coalescing up through orders

Converting Mermaid diagram...

buddy_allocator.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/* Buddy allocator data structures (include/linux/mmzone.h) */
 
#define MAX_ORDER 11  /* Order 0-10, so max allocation is 2^10 = 1024 pages */
 
/* Free area for each order */
struct free_area {
    struct list_head free_list[MIGRATE_TYPES];  /* Per-migratetype lists */
    unsigned long nr_free;                       /* Free blocks at this order */
};
 
/* Zone contains MAX_ORDER free_areas */
struct zone {
    /* ... */
    struct free_area free_area[MAX_ORDER];
    /* ... */
};
 
/* === Low-level page allocation functions === */
 
/* Allocate 2^order contiguous pages from a zone */
struct page *__alloc_pages(gfp_t gfp, unsigned int order,
                           int preferred_nid, nodemask_t *nodemask);
 
/* Simplified allocation path */
struct page *alloc_pages(gfp_t gfp_mask, unsigned int order);
 
/* Allocate single page */
struct page *alloc_page(gfp_t gfp_mask);
 
/* Get virtual address directly */
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
unsigned long get_zeroed_page(gfp_t gfp_mask);
 
/* Free pages */
void __free_pages(struct page *page, unsigned int order);
void free_pages(unsigned long addr, unsigned int order);
void free_page(unsigned long addr);
 
/* === Buddy finding logic === */
 
/* 
 * Buddies are determined by XOR:
 * If a block starts at page frame number (PFN) P and has order O,
 * its buddy starts at PFN (P XOR (1 << O))
 * 
 * Example: Order-2 block at PFN 0: buddy is at PFN 4 (0 XOR 4)
 *          Order-2 block at PFN 4: buddy is at PFN 0 (4 XOR 4)
 */
static inline unsigned long
__find_buddy_pfn(unsigned long page_pfn, unsigned int order)
{
    return page_pfn ^ (1 << order);
}
 
/* Check if two pages are buddies at given order */
static inline bool page_is_buddy(struct page *page, struct page *buddy,
                                 unsigned int order)
{
    /* Must be same zone */
    if (page_zone(page) != page_zone(buddy))
        return false;
    
    /* Buddy must be free at this order */
    if (!PageBuddy(buddy) || buddy_order(buddy) != order)
        return false;
    
    return true;
}

Migration Types and Anti-Fragmentation

Modern Linux extends the buddy allocator with migration types (MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RECLAIMABLE). Pages of each type are grouped together. This helps prevent fragmentation: movable pages can be compacted, while unmovable kernel structures don't fragment movable regions.

Zone Watermarks and Memory Pressure

Each zone maintains three watermarks that define thresholds for memory pressure response. These watermarks trigger different behaviors to ensure the system can always satisfy memory requests.

The Three Watermarks:

WMARK_HIGH (High watermark)

When free pages exceed this level, kswapd (the page reclaim daemon) stops working
The zone is healthy, no action needed
System operates normally

WMARK_LOW (Low watermark)

When free pages drop below this level, kswapd wakes up
Background reclamation begins
Normal allocations still succeed

WMARK_MIN (Minimum watermark)

When free pages approach this level, allocations may block or fail
Reserved for critical allocations (atomic, kernel emergency)
Direct reclaim may be triggered (synchronous, blocking)

The buffer between low and high is the "safe zone" where kswapd works to maintain free memory without impacting applications.

Converting Mermaid diagram...

watermarks.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
/* Watermark calculation (mm/page_alloc.c) */
 
/* 
 * min_free_kbytes: Tunable that sets WMARK_MIN
 * Default is calculated based on total memory (typically 0.1% - 5%)
 */
int min_free_kbytes = 1024;  /* Can be tuned via sysctl */
 
/*
 * watermark_scale_factor: Multiplier for low and high watermarks
 * Default: 10 (meaning low/high are 10/10000 = 0.1% above min)
 */
int watermark_scale_factor = 10;
 
/* Watermark calculation */
void __setup_per_zone_wmarks(void)
{
    unsigned long min_pages = min_free_kbytes >> (PAGE_SHIFT - 10);
    unsigned long lowmem_pages = 0;
    
    for_each_zone(zone) {
        unsigned long tmp, low, high;
        
        /* min = proportional share of min_free_kbytes */
        tmp = (min_pages * zone->managed_pages) / lowmem_pages;
        zone->_watermark[WMARK_MIN] = tmp;
        
        /* low = min + (scale_factor * managed_pages / 10000) */
        low = tmp + (zone->managed_pages * watermark_scale_factor / 10000);
        zone->_watermark[WMARK_LOW] = low;
        
        /* high = min + 2 * (scale_factor * managed_pages / 10000) */
        high = tmp + 2 * (zone->managed_pages * watermark_scale_factor / 10000);
        zone->_watermark[WMARK_HIGH] = high;
    }
}
 
/* Check if allocation is allowed given watermarks */
static inline bool zone_watermark_ok(struct zone *z, unsigned int order,
                                     unsigned long mark, int highest_zoneidx,
                                     unsigned int alloc_flags)
{
    unsigned long free_pages = zone_page_state(z, NR_FREE_PAGES);
    
    /* Each order we can use consumes 2^order pages */
    free_pages -= (1 << order) - 1;
    
    /* Check against watermark + lowmem reserve */
    if (free_pages <= mark + z->lowmem_reserve[highest_zoneidx])
        return false;
    
    return true;
}
 
/* Tuning watermarks */
/*
 * # View current watermarks
 * cat /proc/zoneinfo | grep -E 'min|low|high|free'
 * 
 * # Increase minimum free memory (reduces allocation failures)
 * echo 65536 > /proc/sys/vm/min_free_kbytes
 * 
 * # Increase watermark gap (more aggressive background reclaim)
 * echo 100 > /proc/sys/vm/watermark_scale_factor
 */

min_free_kbytes Trade-offs

Setting min_free_kbytes too low risks allocation failures during bursts. Setting it too high wastes memory that could be used for page cache. For systems with heavy network I/O (where bursts of packet allocations occur), higher values (e.g., 128 MB or more) may be needed. Monitor /proc/vmstat for 'allocstall' events indicating direct reclaim.

Zone Fallback and Lowmem Reserve

When a requested zone is exhausted, the allocator falls back to other zones, but this must be done carefully to prevent lower zones from being depleted by requests that could use higher zones.

Zone Fallback Order:

For GFP_KERNEL allocations (preferring ZONE_NORMAL), the fallback order is:

ZONE_NORMAL (preferred)
ZONE_DMA32 (fallback)
ZONE_DMA (last resort)

For GFP_DMA allocations:

ZONE_DMA only (no fallback)

The Problem with Unrestricted Fallback:

Imagine ZONE_NORMAL becomes full. Without protection:

Normal allocations fall back to ZONE_DMA32
ZONE_DMA32 gets depleted
DMA32 allocations fall back to ZONE_DMA
ZONE_DMA gets depleted
Hardware requiring DMA memory fails

Lowmem Reserve to the Rescue:

Each zone maintains a lowmem reserve—a per-zone buffer that prevents higher-zone requests from completely depleting lower zones. When checking watermarks, the reserve is added to ensure lower zones retain emergency capacity.

zone_fallback.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
/* Zone fallback list (per-node) */
struct pglist_data {
    /* Zones in this node */
    struct zone node_zones[MAX_NR_ZONES];
    
    /* Zonelist for fallback order */
    struct zonelist node_zonelists[MAX_ZONELISTS];
    
    /* ... */
};
 
/* Zonelist structure - ordered list of zones for allocation */
struct zonelist {
    struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
};
 
struct zoneref {
    struct zone *zone;        /* Pointer to zone */
    int zone_idx;             /* Index of zone type */
};
 
/* Lowmem reserve ratio */
/*
 * lowmem_reserve_ratio controls how much of each lower zone is
 * protected from fallback allocations.
 * 
 * Default: 256, meaning 1/256 of the higher zone's size is reserved
 * in each lower zone.
 */
int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES] = {
#ifdef CONFIG_ZONE_DMA
    256,    /* DMA zone reserve ratio */
#endif
#ifdef CONFIG_ZONE_DMA32
    256,    /* DMA32 zone reserve ratio */
#endif
    256,    /* Normal zone reserve ratio */
#ifdef CONFIG_HIGHMEM
    0,      /* HighMem doesn't protect lower zones */
#endif
};
 
/* Calculate lowmem reserves */
static void calculate_lowmem_reserve(struct zone *zone)
{
    /* 
     * For each higher zone type, calculate reserve:
     * reserve[higher_zone] = higher_zone_pages / ratio
     */
    for_each_zone(higher_zone) {
        if (zone_idx(higher_zone) <= zone_idx(zone))
            continue;
        
        zone->lowmem_reserve[zone_idx(higher_zone)] = 
            higher_zone->managed_pages / 
            sysctl_lowmem_reserve_ratio[zone_idx(zone)];
    }
}
 
/* View and adjust lowmem reserve */
/*
 * View current protection values (in /proc/zoneinfo):
 * protection: (0, 1941, 15826, 15826, 15826)
 * 
 * Meaning for ZONE_DMA:
 *   DMA can't fallback, so 0
 *   1941 pages protected from DMA32 allocations
 *   15826 pages protected from Normal allocations
 *   etc.
 * 
 * Adjust ratios:
 * cat /proc/sys/vm/lowmem_reserve_ratio
 * # 256   256   32   0   0
 * 
 * echo "256 256 64 0 0" > /proc/sys/vm/lowmem_reserve_ratio
 */

Reading Protection Values

In /proc/zoneinfo, the 'protection: (a, b, c, d, e)' line shows how many pages are reserved in this zone to protect against allocations from each zone type. Larger values mean more protection but less memory available for fallback. Adjust lowmem_reserve_ratio if you see DMA allocation failures.

NUMA Nodes and Per-Node Zones

On NUMA systems, memory is distributed across multiple nodes, each typically associated with one or more CPUs. Linux organizes zones on a per-node basis to enable locality-aware allocation.

pg_data_t (pglist_data): The Per-Node Structure

Each NUMA node is represented by a pg_data_t structure containing:

All zones for that node
Per-node statistics
LRU lists for page reclaim
Zonelists for fallback ordering

Allocation Strategy:

Prefer the local node (same node as the CPU making the request)
Fall back to nearby nodes (low latency)
Fall back to distant nodes (high latency, last resort)

The kernel builds zonelists that encode this preference order for each node.

numa_zones.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
/* Per-node data structure (include/linux/mmzone.h) */
typedef struct pglist_data {
    /* All zones for this node */
    struct zone node_zones[MAX_NR_ZONES];
    
    /* Zonelists with fallback order */
    struct zonelist node_zonelists[MAX_ZONELISTS];
    
    /* Number of zones in this node */
    int nr_zones;
    
    /* Node identification */
    int node_id;
    
    /* Starting page frame number */
    unsigned long node_start_pfn;
    
    /* Total pages in node */
    unsigned long node_present_pages;  /* Physical pages */
    unsigned long node_spanned_pages;  /* Including holes */
    
    /* Page reclaim state */
    struct lruvec lruvec;      /* LRU lists for this node */
    unsigned long flags;        /* Node state flags */
    
    /* Page reclaim control */
    wait_queue_head_t kswapd_wait;
    struct task_struct *kswapd; /* kswapd for this node */
    int kswapd_order;
    
    /* Compaction state */
    struct task_struct *kcompactd;
    
    /* ... */
} pg_data_t;
 
/* Global array of per-node structures */
extern struct pglist_data *node_data[];
 
/* Get local node ID */
static inline int numa_node_id(void)
{
    return raw_cpu_read(current_node);
}
 
/* Get structure for a node */
#define NODE_DATA(nid)  (node_data[nid])
 
/* Iterate over all nodes */
#define for_each_online_node(nid) \
    for (nid = 0; nid < MAX_NUMNODES; nid++) \
        if (node_online(nid))
 
/* NUMA allocation flags */
/*
 * GFP_THISNODE: Allocate only from current node (fail if not available)
 * __GFP_THISNODE: Hint to prefer current node
 * 
 * Default behavior: try local, fall back to remote
 */
 
/* View NUMA topology and memory */
/*
 * $ numactl --hardware
 * available: 2 nodes (0-1)
 * node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22
 * node 0 size: 32768 MB
 * node 0 free: 24576 MB
 * node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23
 * node 1 size: 32768 MB
 * node 1 free: 25600 MB
 * node distances:
 * node   0   1
 *   0:  10  21
 *   1:  21  10
 * 
 * $ cat /sys/devices/system/node/node0/meminfo
 * Node 0 MemTotal:       33554432 kB
 * Node 0 MemFree:        25165824 kB
 * Node 0 MemUsed:         8388608 kB
 * ...
 */

NUMA Allocation Policies
Policy	Description	Use Case
Local allocation	Prefer current node, fall back to others	Default for most allocations
Strict local (THISNODE)	Only current node, fail if unavailable	Latency-critical allocations
Interleave	Round-robin across nodes	Large shared allocations
Bind	Only specified nodes	Application-level NUMA control
Preferred	Prefer specified node, allow fallback	Soft affinity

Memory Compaction

Over time, memory becomes fragmented—plenty of free pages exist, but they're scattered and can't satisfy high-order allocations. Memory compaction addresses this by migrating movable pages to consolidate free space.

How Compaction Works:

A compaction zone is selected (typically the one failing high-order allocations)
A scanner moves from the zone start, finding movable pages
Another scanner moves from the zone end, finding free pages
Movable pages are migrated to the free space
This creates contiguous free regions at one end of the zone

When Compaction Runs:

Direct compaction: Triggered by a high-order allocation failure
kcompactd: Background daemon that proactively compacts
Proactive compaction: Compacts before fragmentation becomes severe

Migration Types:

Not all pages can be moved:

MIGRATE_MOVABLE: User pages, page cache—can be migrated
MIGRATE_RECLAIMABLE: Can be freed and re-read (cache)
MIGRATE_UNMOVABLE: Kernel structures—cannot move

compaction.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# === Compaction Statistics ===
cat /proc/vmstat | grep compact
# compact_migrate_scanned 1234567
# compact_free_scanned 2345678
# compact_stall 123           # Direct compaction stalls
# compact_fail 45             # Compaction failures
# compact_success 78          # Successful compactions
# compact_isolated 5678       # Pages isolated for migration
# compact_daemon_migrate_scanned 98765
# compact_daemon_free_scanned 87654
 
# === Trigger Manual Compaction ===
# For specific node:
echo 1 > /sys/devices/system/node/node0/compact
 
# For all zones:
echo 1 > /proc/sys/vm/compact_memory
 
# === Compaction Tuning ===
 
# Proactive compaction (reduces fragmentation before problems occur)
echo 20 > /proc/sys/vm/compaction_proactiveness
# Range 0-100: 0=disabled, 100=most aggressive
 
# Extfrag threshold (when to trigger compaction)
cat /proc/sys/vm/extfrag_threshold
# Lower values: compact more often for high-order allocations
 
# === Viewing Fragmentation ===
cat /proc/buddyinfo
# Shows free blocks at each order
# Healthy: Substantial counts at high orders (8, 9, 10)
# Fragmented: Only low orders have counts
 
# Detailed fragmentation info:
cat /proc/pagetypeinfo
# Shows fragmentation by migration type
 
# Extfrag index (0=not fragmented, 1=very fragmented)
cat /sys/kernel/debug/extfrag/extfrag_index

Compaction vs. High-Order Allocation

Compaction is why transparent huge pages (THP) can work—without compaction, systems would quickly become too fragmented for 2 MB allocations. However, compaction has overhead and can cause latency spikes. For systems requiring predictable latency, consider preallocating huge pages at boot rather than relying on THP and compaction.

Summary: Memory Zones

Memory zones are a foundational concept in Linux memory management, bridging hardware constraints with kernel allocation needs. Let's consolidate the key concepts:

Key Takeaways

•Zones partition physical memory by hardware constraints — DMA zones ensure legacy devices can access memory; HIGHMEM handles 32-bit limitations; MOVABLE enables memory hotplug.
•The buddy allocator manages memory within zones — power-of-two allocation and coalescing provides efficient, low-fragmentation memory management.
•Zone watermarks control memory pressure response — WMARK_MIN/LOW/HIGH trigger different reclamation behaviors to ensure memory availability.
•Zone fallback with lowmem reserve — prevents higher-zone allocations from depleting DMA-capable memory needed by hardware.
•NUMA-aware allocation — per-node zones and zonelists enable locality-aware memory allocation on multi-socket systems.
•Memory compaction — migrates movable pages to consolidate free space, enabling high-order allocations in fragmented systems.

What's Next:

When memory pressure becomes severe and reclamation cannot keep up with demand, the system faces a critical decision: which process should be terminated to free memory? The next page explores the OOM Killer—Linux's mechanism for surviving out-of-memory conditions.

Page Complete

You now have an expert understanding of Linux memory zones—their purpose, organization, allocation strategies, and the mechanisms that keep the system stable under memory pressure. This knowledge is essential for kernel development, system tuning, and understanding allocation failures.

4 / 5

Loading learning content...

Operating SystemsLinux Memory Management

Linux Memory Management

LevelAdvanced

Duration120 mins

TopicLinux Memory Management

4 / 5

Memory Zones

The Heterogeneous Nature of Physical Memory

What You Will Learn

Why Zones Exist

Historical DMA Limitations

32-bit Address Space Limitations

NUMA Architecture

Memory Hotplug

Virtualized and large systems may add or remove memory dynamically. Zones help organize memory for safe hotplug operations.

Zones are About Constraints, Not Performance

Zone Types

Linux defines several zone types, though not all are present on every system. The kernel configures zones based on architecture and physical memory layout.

Principal Zone Types

•ZONE_DMA — Memory addressable by legacy 24-bit ISA DMA controllers (0-16 MB on x86). Used for ancient hardware and some kernel structures that must be DMA-accessible. Increasingly vestigial on modern systems.
•ZONE_DMA32 — Memory addressable by 32-bit DMA-capable devices (0-4 GB on x86_64). Important because many PCI devices have 32-bit DMA limitations. Does not exist on 32-bit systems (where all memory is already < 4 GB).
•ZONE_NORMAL — Directly addressable memory within the kernel's linear mapping. On 64-bit systems, this typically covers all non-DMA memory. On 32-bit, typically 16 MB to 896 MB.
•ZONE_HIGHMEM — Memory beyond the kernel's direct mapping (32-bit only). Must be temporarily mapped into kernel space to access. Does not exist on 64-bit systems where all memory is directly mappable.
•ZONE_MOVABLE — A pseudo-zone containing only movable (migratable) allocations. Used for memory hotplug—pages here can be migrated elsewhere, enabling safe memory removal.
•ZONE_DEVICE — Memory exposed by devices like persistent memory (pmem) and GPU memory. Not managed by the normal page allocator but tracked for accounting.

Memory Zones by Architecture
Zone	32-bit x86	64-bit x86_64	Purpose
ZONE_DMA	0-16 MB	0-16 MB	ISA DMA devices
ZONE_DMA32	N/A	16 MB - 4 GB	32-bit DMA devices
ZONE_NORMAL	16 MB - 896 MB	4 GB	General kernel use
ZONE_HIGHMEM	896 MB	N/A	Indirectly mapped memory
ZONE_MOVABLE	Configurable	Configurable	Memory hotplug

zone_types.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
/* Zone type enumeration (include/linux/mmzone.h) */
enum zone_type {
#ifdef CONFIG_ZONE_DMA
    ZONE_DMA,          /* 0-16 MB - ISA DMA limit */
#endif
#ifdef CONFIG_ZONE_DMA32
    ZONE_DMA32,        /* 0-4 GB - 32-bit DMA limit */
#endif
    ZONE_NORMAL,       /* Directly mapped memory */
#ifdef CONFIG_HIGHMEM
    ZONE_HIGHMEM,      /* Indirectly mapped (32-bit only) */
#endif
    ZONE_MOVABLE,      /* Movable pages for memory hotplug */
#ifdef CONFIG_ZONE_DEVICE
    ZONE_DEVICE,       /* Device memory */
#endif
    __MAX_NR_ZONES     /* Sentinel */
};
 
/* Zone structure (simplified) */
struct zone {
    /* Zone name and type */
    const char *name;
    
    /* Memory statistics */
    unsigned long managed_pages;   /* Pages managed by this zone */
    unsigned long spanned_pages;   /* Total pages in zone span */
    unsigned long present_pages;   /* Physical pages in zone */
    
    /* Watermarks for memory pressure */
    unsigned long _watermark[NR_WMARK];
    unsigned long watermark_boost;
    
    /* Free page lists (buddy allocator) */
    struct free_area free_area[MAX_ORDER];
    
    /* Page reclaim state */
    unsigned long flags;
    
    /* Zone lock */
    spinlock_t lock;
    
    /* Per-CPU page caches */
    struct per_cpu_pages __percpu *per_cpu_pageset;
    
    /* Zone start address */
    unsigned long zone_start_pfn;
    
    /* NUMA node this zone belongs to */
    struct pglist_data *zone_pgdat;
    
    /* ... many more fields ... */
};
 
/* Watermark levels */
enum zone_watermarks {
    WMARK_MIN,    /* Minimum pages - below this, allocation may fail */
    WMARK_LOW,    /* Low watermark - kswapd wakes up */
    WMARK_HIGH,   /* High watermark - kswapd goes to sleep */
    NR_WMARK
};

Viewing Zone Information

Linux exposes detailed zone information through /proc and /sys interfaces. Understanding how to read this information is essential for debugging memory issues.

viewing_zones.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# === /proc/zoneinfo - Detailed zone information ===
cat /proc/zoneinfo
 
# Example output (truncated):
# Node 0, zone      DMA
#   per-node stats
#       nr_inactive_anon 0
#       nr_active_anon 0
#       nr_inactive_file 134
#       nr_active_file 201
#       ...
#   pages free     3968
#         min      15
#         low      18
#         high     22
#         spanned  4095
#         present  3998
#         managed  3968
#         protection: (0, 1941, 15826, 15826, 15826)
#   ...
#
# Node 0, zone    DMA32
#   pages free     448234
#         min      1964
#         low      2455
#         high     2946
#         spanned  1044480
#         present  782288
#         managed  496862
#         protection: (0, 0, 13884, 13884, 13884)
#   ...
#
# Node 0, zone   Normal
#   pages free     2485632
#         min      14043
#         low      17553
#         high     21064
#         spanned  3670016
#         present  3670016
#         managed  3554398
#         protection: (0, 0, 0, 0, 0)
 
# === Understanding the numbers ===
# spanned: Total pages in the zone's address range (including holes)
# present: Physical pages actually present
# managed: Pages available for allocation (after kernel reserves)
# free: Currently free pages
# min/low/high: Watermark levels (see next section)
# protection: Lowmem reserve for fallback protection
 
# === /proc/buddyinfo - Buddy allocator free lists ===
cat /proc/buddyinfo
# Node 0, zone      DMA      1      1      0      1      2     ...
# Node 0, zone    DMA32  12043   5234   1234    512    128     ...
# Node 0, zone   Normal  98765  45678  12345   4567   1234     ...
# 
# Columns represent order 0, 1, 2, 3, 4, ... (page counts at each order)
# Order 0 = single 4KB pages
# Order 1 = 8KB (2 contiguous pages)
# Order N = 2^N contiguous pages
 
# === Quick zone summary ===
cat /proc/pagetypeinfo | head -30
 
# === View zone statistics ===
cat /proc/vmstat | grep -E '^(nr_|pgscan|pgsteal|pageout)'
 
# === Per-NUMA node memory info ===
numastat
# Or more detailed:
cat /sys/devices/system/node/node0/meminfo
 
# === Zone-specific sysfs entries ===
ls /sys/devices/system/node/node0/
# cpu0 cpu1 ... meminfo numastat vmstat distance
 
# View watermarks for adjustments
cat /proc/sys/vm/min_free_kbytes
cat /proc/sys/vm/watermark_scale_factor

Interpreting buddyinfo

The Buddy Allocator

Within each zone, physical memory is managed by the buddy allocator—a power-of-two allocator that efficiently handles allocation and deallocation of contiguous page ranges.

The Buddy System Concept:

The buddy allocator maintains free lists for different allocation sizes, where each size is a power of two:

Order 0: 1 page (4 KB)
Order 1: 2 pages (8 KB)
Order 2: 4 pages (16 KB)
...
Order 10: 1024 pages (4 MB) — Maximum order (MAX_ORDER-1)

Allocation:

Request comes in for a specific order (e.g., order 2 = 4 pages)
If a free block of that order exists, use it
If not, split a larger block: take an order-(N+1) block, split it into two order-N "buddies"
Use one buddy, put the other on the order-N free list
Repeat splitting up through orders until a block is found

Deallocation:

Free the block back to its order's free list
Check if its "buddy" (adjacent block of same size) is also free
If both buddies are free, coalesce them into a higher-order block
Repeat coalescing up through orders

Converting Mermaid diagram...

buddy_allocator.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/* Buddy allocator data structures (include/linux/mmzone.h) */
 
#define MAX_ORDER 11  /* Order 0-10, so max allocation is 2^10 = 1024 pages */
 
/* Free area for each order */
struct free_area {
    struct list_head free_list[MIGRATE_TYPES];  /* Per-migratetype lists */
    unsigned long nr_free;                       /* Free blocks at this order */
};
 
/* Zone contains MAX_ORDER free_areas */
struct zone {
    /* ... */
    struct free_area free_area[MAX_ORDER];
    /* ... */
};
 
/* === Low-level page allocation functions === */
 
/* Allocate 2^order contiguous pages from a zone */
struct page *__alloc_pages(gfp_t gfp, unsigned int order,
                           int preferred_nid, nodemask_t *nodemask);
 
/* Simplified allocation path */
struct page *alloc_pages(gfp_t gfp_mask, unsigned int order);
 
/* Allocate single page */
struct page *alloc_page(gfp_t gfp_mask);
 
/* Get virtual address directly */
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
unsigned long get_zeroed_page(gfp_t gfp_mask);
 
/* Free pages */
void __free_pages(struct page *page, unsigned int order);
void free_pages(unsigned long addr, unsigned int order);
void free_page(unsigned long addr);
 
/* === Buddy finding logic === */
 
/* 
 * Buddies are determined by XOR:
 * If a block starts at page frame number (PFN) P and has order O,
 * its buddy starts at PFN (P XOR (1 << O))
 * 
 * Example: Order-2 block at PFN 0: buddy is at PFN 4 (0 XOR 4)
 *          Order-2 block at PFN 4: buddy is at PFN 0 (4 XOR 4)
 */
static inline unsigned long
__find_buddy_pfn(unsigned long page_pfn, unsigned int order)
{
    return page_pfn ^ (1 << order);
}
 
/* Check if two pages are buddies at given order */
static inline bool page_is_buddy(struct page *page, struct page *buddy,
                                 unsigned int order)
{
    /* Must be same zone */
    if (page_zone(page) != page_zone(buddy))
        return false;
    
    /* Buddy must be free at this order */
    if (!PageBuddy(buddy) || buddy_order(buddy) != order)
        return false;
    
    return true;
}

Migration Types and Anti-Fragmentation

Zone Watermarks and Memory Pressure

Each zone maintains three watermarks that define thresholds for memory pressure response. These watermarks trigger different behaviors to ensure the system can always satisfy memory requests.

The Three Watermarks:

WMARK_HIGH (High watermark)

When free pages exceed this level, kswapd (the page reclaim daemon) stops working
The zone is healthy, no action needed
System operates normally

WMARK_LOW (Low watermark)

When free pages drop below this level, kswapd wakes up
Background reclamation begins
Normal allocations still succeed

WMARK_MIN (Minimum watermark)

When free pages approach this level, allocations may block or fail
Reserved for critical allocations (atomic, kernel emergency)
Direct reclaim may be triggered (synchronous, blocking)

The buffer between low and high is the "safe zone" where kswapd works to maintain free memory without impacting applications.

Converting Mermaid diagram...

watermarks.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
/* Watermark calculation (mm/page_alloc.c) */
 
/* 
 * min_free_kbytes: Tunable that sets WMARK_MIN
 * Default is calculated based on total memory (typically 0.1% - 5%)
 */
int min_free_kbytes = 1024;  /* Can be tuned via sysctl */
 
/*
 * watermark_scale_factor: Multiplier for low and high watermarks
 * Default: 10 (meaning low/high are 10/10000 = 0.1% above min)
 */
int watermark_scale_factor = 10;
 
/* Watermark calculation */
void __setup_per_zone_wmarks(void)
{
    unsigned long min_pages = min_free_kbytes >> (PAGE_SHIFT - 10);
    unsigned long lowmem_pages = 0;
    
    for_each_zone(zone) {
        unsigned long tmp, low, high;
        
        /* min = proportional share of min_free_kbytes */
        tmp = (min_pages * zone->managed_pages) / lowmem_pages;
        zone->_watermark[WMARK_MIN] = tmp;
        
        /* low = min + (scale_factor * managed_pages / 10000) */
        low = tmp + (zone->managed_pages * watermark_scale_factor / 10000);
        zone->_watermark[WMARK_LOW] = low;
        
        /* high = min + 2 * (scale_factor * managed_pages / 10000) */
        high = tmp + 2 * (zone->managed_pages * watermark_scale_factor / 10000);
        zone->_watermark[WMARK_HIGH] = high;
    }
}
 
/* Check if allocation is allowed given watermarks */
static inline bool zone_watermark_ok(struct zone *z, unsigned int order,
                                     unsigned long mark, int highest_zoneidx,
                                     unsigned int alloc_flags)
{
    unsigned long free_pages = zone_page_state(z, NR_FREE_PAGES);
    
    /* Each order we can use consumes 2^order pages */
    free_pages -= (1 << order) - 1;
    
    /* Check against watermark + lowmem reserve */
    if (free_pages <= mark + z->lowmem_reserve[highest_zoneidx])
        return false;
    
    return true;
}
 
/* Tuning watermarks */
/*
 * # View current watermarks
 * cat /proc/zoneinfo | grep -E 'min|low|high|free'
 * 
 * # Increase minimum free memory (reduces allocation failures)
 * echo 65536 > /proc/sys/vm/min_free_kbytes
 * 
 * # Increase watermark gap (more aggressive background reclaim)
 * echo 100 > /proc/sys/vm/watermark_scale_factor
 */

min_free_kbytes Trade-offs

Zone Fallback and Lowmem Reserve

When a requested zone is exhausted, the allocator falls back to other zones, but this must be done carefully to prevent lower zones from being depleted by requests that could use higher zones.

Zone Fallback Order:

For GFP_KERNEL allocations (preferring ZONE_NORMAL), the fallback order is:

ZONE_NORMAL (preferred)
ZONE_DMA32 (fallback)
ZONE_DMA (last resort)

For GFP_DMA allocations:

ZONE_DMA only (no fallback)

The Problem with Unrestricted Fallback:

Imagine ZONE_NORMAL becomes full. Without protection:

Normal allocations fall back to ZONE_DMA32
ZONE_DMA32 gets depleted
DMA32 allocations fall back to ZONE_DMA
ZONE_DMA gets depleted
Hardware requiring DMA memory fails

Lowmem Reserve to the Rescue:

zone_fallback.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
/* Zone fallback list (per-node) */
struct pglist_data {
    /* Zones in this node */
    struct zone node_zones[MAX_NR_ZONES];
    
    /* Zonelist for fallback order */
    struct zonelist node_zonelists[MAX_ZONELISTS];
    
    /* ... */
};
 
/* Zonelist structure - ordered list of zones for allocation */
struct zonelist {
    struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
};
 
struct zoneref {
    struct zone *zone;        /* Pointer to zone */
    int zone_idx;             /* Index of zone type */
};
 
/* Lowmem reserve ratio */
/*
 * lowmem_reserve_ratio controls how much of each lower zone is
 * protected from fallback allocations.
 * 
 * Default: 256, meaning 1/256 of the higher zone's size is reserved
 * in each lower zone.
 */
int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES] = {
#ifdef CONFIG_ZONE_DMA
    256,    /* DMA zone reserve ratio */
#endif
#ifdef CONFIG_ZONE_DMA32
    256,    /* DMA32 zone reserve ratio */
#endif
    256,    /* Normal zone reserve ratio */
#ifdef CONFIG_HIGHMEM
    0,      /* HighMem doesn't protect lower zones */
#endif
};
 
/* Calculate lowmem reserves */
static void calculate_lowmem_reserve(struct zone *zone)
{
    /* 
     * For each higher zone type, calculate reserve:
     * reserve[higher_zone] = higher_zone_pages / ratio
     */
    for_each_zone(higher_zone) {
        if (zone_idx(higher_zone) <= zone_idx(zone))
            continue;
        
        zone->lowmem_reserve[zone_idx(higher_zone)] = 
            higher_zone->managed_pages / 
            sysctl_lowmem_reserve_ratio[zone_idx(zone)];
    }
}
 
/* View and adjust lowmem reserve */
/*
 * View current protection values (in /proc/zoneinfo):
 * protection: (0, 1941, 15826, 15826, 15826)
 * 
 * Meaning for ZONE_DMA:
 *   DMA can't fallback, so 0
 *   1941 pages protected from DMA32 allocations
 *   15826 pages protected from Normal allocations
 *   etc.
 * 
 * Adjust ratios:
 * cat /proc/sys/vm/lowmem_reserve_ratio
 * # 256   256   32   0   0
 * 
 * echo "256 256 64 0 0" > /proc/sys/vm/lowmem_reserve_ratio
 */

Reading Protection Values

NUMA Nodes and Per-Node Zones

On NUMA systems, memory is distributed across multiple nodes, each typically associated with one or more CPUs. Linux organizes zones on a per-node basis to enable locality-aware allocation.

pg_data_t (pglist_data): The Per-Node Structure

Each NUMA node is represented by a pg_data_t structure containing:

All zones for that node
Per-node statistics
LRU lists for page reclaim
Zonelists for fallback ordering

Allocation Strategy:

Prefer the local node (same node as the CPU making the request)
Fall back to nearby nodes (low latency)
Fall back to distant nodes (high latency, last resort)

The kernel builds zonelists that encode this preference order for each node.

numa_zones.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
/* Per-node data structure (include/linux/mmzone.h) */
typedef struct pglist_data {
    /* All zones for this node */
    struct zone node_zones[MAX_NR_ZONES];
    
    /* Zonelists with fallback order */
    struct zonelist node_zonelists[MAX_ZONELISTS];
    
    /* Number of zones in this node */
    int nr_zones;
    
    /* Node identification */
    int node_id;
    
    /* Starting page frame number */
    unsigned long node_start_pfn;
    
    /* Total pages in node */
    unsigned long node_present_pages;  /* Physical pages */
    unsigned long node_spanned_pages;  /* Including holes */
    
    /* Page reclaim state */
    struct lruvec lruvec;      /* LRU lists for this node */
    unsigned long flags;        /* Node state flags */
    
    /* Page reclaim control */
    wait_queue_head_t kswapd_wait;
    struct task_struct *kswapd; /* kswapd for this node */
    int kswapd_order;
    
    /* Compaction state */
    struct task_struct *kcompactd;
    
    /* ... */
} pg_data_t;
 
/* Global array of per-node structures */
extern struct pglist_data *node_data[];
 
/* Get local node ID */
static inline int numa_node_id(void)
{
    return raw_cpu_read(current_node);
}
 
/* Get structure for a node */
#define NODE_DATA(nid)  (node_data[nid])
 
/* Iterate over all nodes */
#define for_each_online_node(nid) \
    for (nid = 0; nid < MAX_NUMNODES; nid++) \
        if (node_online(nid))
 
/* NUMA allocation flags */
/*
 * GFP_THISNODE: Allocate only from current node (fail if not available)
 * __GFP_THISNODE: Hint to prefer current node
 * 
 * Default behavior: try local, fall back to remote
 */
 
/* View NUMA topology and memory */
/*
 * $ numactl --hardware
 * available: 2 nodes (0-1)
 * node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22
 * node 0 size: 32768 MB
 * node 0 free: 24576 MB
 * node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23
 * node 1 size: 32768 MB
 * node 1 free: 25600 MB
 * node distances:
 * node   0   1
 *   0:  10  21
 *   1:  21  10
 * 
 * $ cat /sys/devices/system/node/node0/meminfo
 * Node 0 MemTotal:       33554432 kB
 * Node 0 MemFree:        25165824 kB
 * Node 0 MemUsed:         8388608 kB
 * ...
 */

NUMA Allocation Policies
Policy	Description	Use Case
Local allocation	Prefer current node, fall back to others	Default for most allocations
Strict local (THISNODE)	Only current node, fail if unavailable	Latency-critical allocations
Interleave	Round-robin across nodes	Large shared allocations
Bind	Only specified nodes	Application-level NUMA control
Preferred	Prefer specified node, allow fallback	Soft affinity

Memory Compaction

How Compaction Works:

A compaction zone is selected (typically the one failing high-order allocations)
A scanner moves from the zone start, finding movable pages
Another scanner moves from the zone end, finding free pages
Movable pages are migrated to the free space
This creates contiguous free regions at one end of the zone

When Compaction Runs:

Direct compaction: Triggered by a high-order allocation failure
kcompactd: Background daemon that proactively compacts
Proactive compaction: Compacts before fragmentation becomes severe

Migration Types:

Not all pages can be moved:

MIGRATE_MOVABLE: User pages, page cache—can be migrated
MIGRATE_RECLAIMABLE: Can be freed and re-read (cache)
MIGRATE_UNMOVABLE: Kernel structures—cannot move

compaction.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# === Compaction Statistics ===
cat /proc/vmstat | grep compact
# compact_migrate_scanned 1234567
# compact_free_scanned 2345678
# compact_stall 123           # Direct compaction stalls
# compact_fail 45             # Compaction failures
# compact_success 78          # Successful compactions
# compact_isolated 5678       # Pages isolated for migration
# compact_daemon_migrate_scanned 98765
# compact_daemon_free_scanned 87654
 
# === Trigger Manual Compaction ===
# For specific node:
echo 1 > /sys/devices/system/node/node0/compact
 
# For all zones:
echo 1 > /proc/sys/vm/compact_memory
 
# === Compaction Tuning ===
 
# Proactive compaction (reduces fragmentation before problems occur)
echo 20 > /proc/sys/vm/compaction_proactiveness
# Range 0-100: 0=disabled, 100=most aggressive
 
# Extfrag threshold (when to trigger compaction)
cat /proc/sys/vm/extfrag_threshold
# Lower values: compact more often for high-order allocations
 
# === Viewing Fragmentation ===
cat /proc/buddyinfo
# Shows free blocks at each order
# Healthy: Substantial counts at high orders (8, 9, 10)
# Fragmented: Only low orders have counts
 
# Detailed fragmentation info:
cat /proc/pagetypeinfo
# Shows fragmentation by migration type
 
# Extfrag index (0=not fragmented, 1=very fragmented)
cat /sys/kernel/debug/extfrag/extfrag_index

Compaction vs. High-Order Allocation

Summary: Memory Zones

Memory zones are a foundational concept in Linux memory management, bridging hardware constraints with kernel allocation needs. Let's consolidate the key concepts:

Key Takeaways

•Zones partition physical memory by hardware constraints — DMA zones ensure legacy devices can access memory; HIGHMEM handles 32-bit limitations; MOVABLE enables memory hotplug.
•The buddy allocator manages memory within zones — power-of-two allocation and coalescing provides efficient, low-fragmentation memory management.
•Zone watermarks control memory pressure response — WMARK_MIN/LOW/HIGH trigger different reclamation behaviors to ensure memory availability.
•Zone fallback with lowmem reserve — prevents higher-zone allocations from depleting DMA-capable memory needed by hardware.
•NUMA-aware allocation — per-node zones and zonelists enable locality-aware memory allocation on multi-socket systems.
•Memory compaction — migrates movable pages to consolidate free space, enabling high-order allocations in fragmented systems.

What's Next:

Page Complete

4 / 5