Loading learning content...
Not all memory is created equal. In a modern computer system, physical memory isn't a uniform resource—different regions have different capabilities, constraints, and use cases. Some memory can only be addressed by legacy DMA controllers. Some memory is directly mappable by the kernel, while other regions require special tricks to access. Some memory belongs to specific NUMA nodes with varying access latencies.
Linux organizes physical memory into zones—logical groupings that partition memory based on these hardware constraints and usage patterns. Understanding zones is essential for kernel development, driver writing, and diagnosing memory allocation issues.
This page provides an expert-level examination of Linux memory zones—their purpose, organization, allocation strategies, and the watermark-based reclamation system that keeps the system running smoothly under memory pressure.
By the end of this page, you will understand: (1) why zones exist and what constraints they address, (2) the principal zone types (DMA, DMA32, Normal, HighMem, Movable), (3) how the buddy allocator manages memory within zones, (4) zone watermarks and memory reclamation, (5) zone fallback and allocation strategies, and (6) NUMA nodes and their relationship to zones.
Memory zones address fundamental hardware and software constraints that have evolved over decades of PC architecture. Understanding these constraints illuminates why the kernel's memory management is designed the way it is.
Historical DMA Limitations
Early IBM PC-compatible systems used an Intel 8237 DMA controller that could only address the first 16 MB of memory. Even as systems gained more RAM, legacy ISA devices retained this limitation. Memory above 16 MB couldn't be used for DMA transfers to these devices.
32-bit Address Space Limitations
On 32-bit systems, the kernel can only directly address about 1 GB of physical memory (due to the kernel/user split). Memory above this limit ("high memory") can't be permanently mapped—it requires temporary mappings when accessed.
NUMA Architecture
Modern multi-processor systems often have Non-Uniform Memory Access (NUMA) architectures where memory access latency depends on which CPU is accessing which memory bank. Grouping memory by node enables locality-aware allocation.
Memory Hotplug
Virtualized and large systems may add or remove memory dynamically. Zones help organize memory for safe hotplug operations.
Zones don't represent different speeds or qualities of memory—they represent what that memory can be used for. Memory in ZONE_DMA isn't faster or slower than ZONE_NORMAL; it's simply addressable by more (legacy) devices. The zone system ensures that memory requests are satisfied from appropriate regions.
Linux defines several zone types, though not all are present on every system. The kernel configures zones based on architecture and physical memory layout.
| Zone | 32-bit x86 | 64-bit x86_64 | Purpose |
|---|---|---|---|
| ZONE_DMA | 0-16 MB | 0-16 MB | ISA DMA devices |
| ZONE_DMA32 | N/A | 16 MB - 4 GB | 32-bit DMA devices |
| ZONE_NORMAL | 16 MB - 896 MB | 4 GB | General kernel use |
| ZONE_HIGHMEM | 896 MB | N/A | Indirectly mapped memory |
| ZONE_MOVABLE | Configurable | Configurable | Memory hotplug |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
/* Zone type enumeration (include/linux/mmzone.h) */enum zone_type {#ifdef CONFIG_ZONE_DMA ZONE_DMA, /* 0-16 MB - ISA DMA limit */#endif#ifdef CONFIG_ZONE_DMA32 ZONE_DMA32, /* 0-4 GB - 32-bit DMA limit */#endif ZONE_NORMAL, /* Directly mapped memory */#ifdef CONFIG_HIGHMEM ZONE_HIGHMEM, /* Indirectly mapped (32-bit only) */#endif ZONE_MOVABLE, /* Movable pages for memory hotplug */#ifdef CONFIG_ZONE_DEVICE ZONE_DEVICE, /* Device memory */#endif __MAX_NR_ZONES /* Sentinel */}; /* Zone structure (simplified) */struct zone { /* Zone name and type */ const char *name; /* Memory statistics */ unsigned long managed_pages; /* Pages managed by this zone */ unsigned long spanned_pages; /* Total pages in zone span */ unsigned long present_pages; /* Physical pages in zone */ /* Watermarks for memory pressure */ unsigned long _watermark[NR_WMARK]; unsigned long watermark_boost; /* Free page lists (buddy allocator) */ struct free_area free_area[MAX_ORDER]; /* Page reclaim state */ unsigned long flags; /* Zone lock */ spinlock_t lock; /* Per-CPU page caches */ struct per_cpu_pages __percpu *per_cpu_pageset; /* Zone start address */ unsigned long zone_start_pfn; /* NUMA node this zone belongs to */ struct pglist_data *zone_pgdat; /* ... many more fields ... */}; /* Watermark levels */enum zone_watermarks { WMARK_MIN, /* Minimum pages - below this, allocation may fail */ WMARK_LOW, /* Low watermark - kswapd wakes up */ WMARK_HIGH, /* High watermark - kswapd goes to sleep */ NR_WMARK};Linux exposes detailed zone information through /proc and /sys interfaces. Understanding how to read this information is essential for debugging memory issues.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
# === /proc/zoneinfo - Detailed zone information ===cat /proc/zoneinfo # Example output (truncated):# Node 0, zone DMA# per-node stats# nr_inactive_anon 0# nr_active_anon 0# nr_inactive_file 134# nr_active_file 201# ...# pages free 3968# min 15# low 18# high 22# spanned 4095# present 3998# managed 3968# protection: (0, 1941, 15826, 15826, 15826)# ...## Node 0, zone DMA32# pages free 448234# min 1964# low 2455# high 2946# spanned 1044480# present 782288# managed 496862# protection: (0, 0, 13884, 13884, 13884)# ...## Node 0, zone Normal# pages free 2485632# min 14043# low 17553# high 21064# spanned 3670016# present 3670016# managed 3554398# protection: (0, 0, 0, 0, 0) # === Understanding the numbers ===# spanned: Total pages in the zone's address range (including holes)# present: Physical pages actually present# managed: Pages available for allocation (after kernel reserves)# free: Currently free pages# min/low/high: Watermark levels (see next section)# protection: Lowmem reserve for fallback protection # === /proc/buddyinfo - Buddy allocator free lists ===cat /proc/buddyinfo# Node 0, zone DMA 1 1 0 1 2 ...# Node 0, zone DMA32 12043 5234 1234 512 128 ...# Node 0, zone Normal 98765 45678 12345 4567 1234 ...# # Columns represent order 0, 1, 2, 3, 4, ... (page counts at each order)# Order 0 = single 4KB pages# Order 1 = 8KB (2 contiguous pages)# Order N = 2^N contiguous pages # === Quick zone summary ===cat /proc/pagetypeinfo | head -30 # === View zone statistics ===cat /proc/vmstat | grep -E '^(nr_|pgscan|pgsteal|pageout)' # === Per-NUMA node memory info ===numastat# Or more detailed:cat /sys/devices/system/node/node0/meminfo # === Zone-specific sysfs entries ===ls /sys/devices/system/node/node0/# cpu0 cpu1 ... meminfo numastat vmstat distance # View watermarks for adjustmentscat /proc/sys/vm/min_free_kbytescat /proc/sys/vm/watermark_scale_factorThe /proc/buddyinfo output shows how fragmented memory is. In a healthy system, you should see substantial counts at higher orders (large contiguous regions available). If high-order numbers are zero or very low while low orders have many entries, memory is fragmented—large allocations may fail even with plenty of total free memory.
Within each zone, physical memory is managed by the buddy allocator—a power-of-two allocator that efficiently handles allocation and deallocation of contiguous page ranges.
The Buddy System Concept:
The buddy allocator maintains free lists for different allocation sizes, where each size is a power of two:
Allocation:
Deallocation:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
/* Buddy allocator data structures (include/linux/mmzone.h) */ #define MAX_ORDER 11 /* Order 0-10, so max allocation is 2^10 = 1024 pages */ /* Free area for each order */struct free_area { struct list_head free_list[MIGRATE_TYPES]; /* Per-migratetype lists */ unsigned long nr_free; /* Free blocks at this order */}; /* Zone contains MAX_ORDER free_areas */struct zone { /* ... */ struct free_area free_area[MAX_ORDER]; /* ... */}; /* === Low-level page allocation functions === */ /* Allocate 2^order contiguous pages from a zone */struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid, nodemask_t *nodemask); /* Simplified allocation path */struct page *alloc_pages(gfp_t gfp_mask, unsigned int order); /* Allocate single page */struct page *alloc_page(gfp_t gfp_mask); /* Get virtual address directly */unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);unsigned long get_zeroed_page(gfp_t gfp_mask); /* Free pages */void __free_pages(struct page *page, unsigned int order);void free_pages(unsigned long addr, unsigned int order);void free_page(unsigned long addr); /* === Buddy finding logic === */ /* * Buddies are determined by XOR: * If a block starts at page frame number (PFN) P and has order O, * its buddy starts at PFN (P XOR (1 << O)) * * Example: Order-2 block at PFN 0: buddy is at PFN 4 (0 XOR 4) * Order-2 block at PFN 4: buddy is at PFN 0 (4 XOR 4) */static inline unsigned long__find_buddy_pfn(unsigned long page_pfn, unsigned int order){ return page_pfn ^ (1 << order);} /* Check if two pages are buddies at given order */static inline bool page_is_buddy(struct page *page, struct page *buddy, unsigned int order){ /* Must be same zone */ if (page_zone(page) != page_zone(buddy)) return false; /* Buddy must be free at this order */ if (!PageBuddy(buddy) || buddy_order(buddy) != order) return false; return true;}Modern Linux extends the buddy allocator with migration types (MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RECLAIMABLE). Pages of each type are grouped together. This helps prevent fragmentation: movable pages can be compacted, while unmovable kernel structures don't fragment movable regions.
Each zone maintains three watermarks that define thresholds for memory pressure response. These watermarks trigger different behaviors to ensure the system can always satisfy memory requests.
The Three Watermarks:
WMARK_HIGH (High watermark)
WMARK_LOW (Low watermark)
WMARK_MIN (Minimum watermark)
The buffer between low and high is the "safe zone" where kswapd works to maintain free memory without impacting applications.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
/* Watermark calculation (mm/page_alloc.c) */ /* * min_free_kbytes: Tunable that sets WMARK_MIN * Default is calculated based on total memory (typically 0.1% - 5%) */int min_free_kbytes = 1024; /* Can be tuned via sysctl */ /* * watermark_scale_factor: Multiplier for low and high watermarks * Default: 10 (meaning low/high are 10/10000 = 0.1% above min) */int watermark_scale_factor = 10; /* Watermark calculation */void __setup_per_zone_wmarks(void){ unsigned long min_pages = min_free_kbytes >> (PAGE_SHIFT - 10); unsigned long lowmem_pages = 0; for_each_zone(zone) { unsigned long tmp, low, high; /* min = proportional share of min_free_kbytes */ tmp = (min_pages * zone->managed_pages) / lowmem_pages; zone->_watermark[WMARK_MIN] = tmp; /* low = min + (scale_factor * managed_pages / 10000) */ low = tmp + (zone->managed_pages * watermark_scale_factor / 10000); zone->_watermark[WMARK_LOW] = low; /* high = min + 2 * (scale_factor * managed_pages / 10000) */ high = tmp + 2 * (zone->managed_pages * watermark_scale_factor / 10000); zone->_watermark[WMARK_HIGH] = high; }} /* Check if allocation is allowed given watermarks */static inline bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, int highest_zoneidx, unsigned int alloc_flags){ unsigned long free_pages = zone_page_state(z, NR_FREE_PAGES); /* Each order we can use consumes 2^order pages */ free_pages -= (1 << order) - 1; /* Check against watermark + lowmem reserve */ if (free_pages <= mark + z->lowmem_reserve[highest_zoneidx]) return false; return true;} /* Tuning watermarks *//* * # View current watermarks * cat /proc/zoneinfo | grep -E 'min|low|high|free' * * # Increase minimum free memory (reduces allocation failures) * echo 65536 > /proc/sys/vm/min_free_kbytes * * # Increase watermark gap (more aggressive background reclaim) * echo 100 > /proc/sys/vm/watermark_scale_factor */Setting min_free_kbytes too low risks allocation failures during bursts. Setting it too high wastes memory that could be used for page cache. For systems with heavy network I/O (where bursts of packet allocations occur), higher values (e.g., 128 MB or more) may be needed. Monitor /proc/vmstat for 'allocstall' events indicating direct reclaim.
When a requested zone is exhausted, the allocator falls back to other zones, but this must be done carefully to prevent lower zones from being depleted by requests that could use higher zones.
Zone Fallback Order:
For GFP_KERNEL allocations (preferring ZONE_NORMAL), the fallback order is:
For GFP_DMA allocations:
The Problem with Unrestricted Fallback:
Imagine ZONE_NORMAL becomes full. Without protection:
Lowmem Reserve to the Rescue:
Each zone maintains a lowmem reserve—a per-zone buffer that prevents higher-zone requests from completely depleting lower zones. When checking watermarks, the reserve is added to ensure lower zones retain emergency capacity.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
/* Zone fallback list (per-node) */struct pglist_data { /* Zones in this node */ struct zone node_zones[MAX_NR_ZONES]; /* Zonelist for fallback order */ struct zonelist node_zonelists[MAX_ZONELISTS]; /* ... */}; /* Zonelist structure - ordered list of zones for allocation */struct zonelist { struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];}; struct zoneref { struct zone *zone; /* Pointer to zone */ int zone_idx; /* Index of zone type */}; /* Lowmem reserve ratio *//* * lowmem_reserve_ratio controls how much of each lower zone is * protected from fallback allocations. * * Default: 256, meaning 1/256 of the higher zone's size is reserved * in each lower zone. */int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES] = {#ifdef CONFIG_ZONE_DMA 256, /* DMA zone reserve ratio */#endif#ifdef CONFIG_ZONE_DMA32 256, /* DMA32 zone reserve ratio */#endif 256, /* Normal zone reserve ratio */#ifdef CONFIG_HIGHMEM 0, /* HighMem doesn't protect lower zones */#endif}; /* Calculate lowmem reserves */static void calculate_lowmem_reserve(struct zone *zone){ /* * For each higher zone type, calculate reserve: * reserve[higher_zone] = higher_zone_pages / ratio */ for_each_zone(higher_zone) { if (zone_idx(higher_zone) <= zone_idx(zone)) continue; zone->lowmem_reserve[zone_idx(higher_zone)] = higher_zone->managed_pages / sysctl_lowmem_reserve_ratio[zone_idx(zone)]; }} /* View and adjust lowmem reserve *//* * View current protection values (in /proc/zoneinfo): * protection: (0, 1941, 15826, 15826, 15826) * * Meaning for ZONE_DMA: * DMA can't fallback, so 0 * 1941 pages protected from DMA32 allocations * 15826 pages protected from Normal allocations * etc. * * Adjust ratios: * cat /proc/sys/vm/lowmem_reserve_ratio * # 256 256 32 0 0 * * echo "256 256 64 0 0" > /proc/sys/vm/lowmem_reserve_ratio */In /proc/zoneinfo, the 'protection: (a, b, c, d, e)' line shows how many pages are reserved in this zone to protect against allocations from each zone type. Larger values mean more protection but less memory available for fallback. Adjust lowmem_reserve_ratio if you see DMA allocation failures.
On NUMA systems, memory is distributed across multiple nodes, each typically associated with one or more CPUs. Linux organizes zones on a per-node basis to enable locality-aware allocation.
pg_data_t (pglist_data): The Per-Node Structure
Each NUMA node is represented by a pg_data_t structure containing:
Allocation Strategy:
The kernel builds zonelists that encode this preference order for each node.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
/* Per-node data structure (include/linux/mmzone.h) */typedef struct pglist_data { /* All zones for this node */ struct zone node_zones[MAX_NR_ZONES]; /* Zonelists with fallback order */ struct zonelist node_zonelists[MAX_ZONELISTS]; /* Number of zones in this node */ int nr_zones; /* Node identification */ int node_id; /* Starting page frame number */ unsigned long node_start_pfn; /* Total pages in node */ unsigned long node_present_pages; /* Physical pages */ unsigned long node_spanned_pages; /* Including holes */ /* Page reclaim state */ struct lruvec lruvec; /* LRU lists for this node */ unsigned long flags; /* Node state flags */ /* Page reclaim control */ wait_queue_head_t kswapd_wait; struct task_struct *kswapd; /* kswapd for this node */ int kswapd_order; /* Compaction state */ struct task_struct *kcompactd; /* ... */} pg_data_t; /* Global array of per-node structures */extern struct pglist_data *node_data[]; /* Get local node ID */static inline int numa_node_id(void){ return raw_cpu_read(current_node);} /* Get structure for a node */#define NODE_DATA(nid) (node_data[nid]) /* Iterate over all nodes */#define for_each_online_node(nid) \ for (nid = 0; nid < MAX_NUMNODES; nid++) \ if (node_online(nid)) /* NUMA allocation flags *//* * GFP_THISNODE: Allocate only from current node (fail if not available) * __GFP_THISNODE: Hint to prefer current node * * Default behavior: try local, fall back to remote */ /* View NUMA topology and memory *//* * $ numactl --hardware * available: 2 nodes (0-1) * node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 * node 0 size: 32768 MB * node 0 free: 24576 MB * node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 * node 1 size: 32768 MB * node 1 free: 25600 MB * node distances: * node 0 1 * 0: 10 21 * 1: 21 10 * * $ cat /sys/devices/system/node/node0/meminfo * Node 0 MemTotal: 33554432 kB * Node 0 MemFree: 25165824 kB * Node 0 MemUsed: 8388608 kB * ... */| Policy | Description | Use Case |
|---|---|---|
| Local allocation | Prefer current node, fall back to others | Default for most allocations |
| Strict local (THISNODE) | Only current node, fail if unavailable | Latency-critical allocations |
| Interleave | Round-robin across nodes | Large shared allocations |
| Bind | Only specified nodes | Application-level NUMA control |
| Preferred | Prefer specified node, allow fallback | Soft affinity |
Over time, memory becomes fragmented—plenty of free pages exist, but they're scattered and can't satisfy high-order allocations. Memory compaction addresses this by migrating movable pages to consolidate free space.
How Compaction Works:
When Compaction Runs:
Migration Types:
Not all pages can be moved:
12345678910111213141516171819202122232425262728293031323334353637383940
# === Compaction Statistics ===cat /proc/vmstat | grep compact# compact_migrate_scanned 1234567# compact_free_scanned 2345678# compact_stall 123 # Direct compaction stalls# compact_fail 45 # Compaction failures# compact_success 78 # Successful compactions# compact_isolated 5678 # Pages isolated for migration# compact_daemon_migrate_scanned 98765# compact_daemon_free_scanned 87654 # === Trigger Manual Compaction ===# For specific node:echo 1 > /sys/devices/system/node/node0/compact # For all zones:echo 1 > /proc/sys/vm/compact_memory # === Compaction Tuning === # Proactive compaction (reduces fragmentation before problems occur)echo 20 > /proc/sys/vm/compaction_proactiveness# Range 0-100: 0=disabled, 100=most aggressive # Extfrag threshold (when to trigger compaction)cat /proc/sys/vm/extfrag_threshold# Lower values: compact more often for high-order allocations # === Viewing Fragmentation ===cat /proc/buddyinfo# Shows free blocks at each order# Healthy: Substantial counts at high orders (8, 9, 10)# Fragmented: Only low orders have counts # Detailed fragmentation info:cat /proc/pagetypeinfo# Shows fragmentation by migration type # Extfrag index (0=not fragmented, 1=very fragmented)cat /sys/kernel/debug/extfrag/extfrag_indexCompaction is why transparent huge pages (THP) can work—without compaction, systems would quickly become too fragmented for 2 MB allocations. However, compaction has overhead and can cause latency spikes. For systems requiring predictable latency, consider preallocating huge pages at boot rather than relying on THP and compaction.
Memory zones are a foundational concept in Linux memory management, bridging hardware constraints with kernel allocation needs. Let's consolidate the key concepts:
What's Next:
When memory pressure becomes severe and reclamation cannot keep up with demand, the system faces a critical decision: which process should be terminated to free memory? The next page explores the OOM Killer—Linux's mechanism for surviving out-of-memory conditions.
You now have an expert understanding of Linux memory zones—their purpose, organization, allocation strategies, and the mechanisms that keep the system stable under memory pressure. This knowledge is essential for kernel development, system tuning, and understanding allocation failures.