Loading learning content...
The Linux kernel constantly creates and destroys data structures—task_struct for processes, inode for files, sk_buff for network packets, dentry for directory entries. These allocations happen millions of times per second on busy systems. A naive malloc/free approach would be catastrophically slow.
Enter the slab allocator—a sophisticated object caching system designed by Jeff Bonwick for Solaris and adapted for Linux. The slab allocator recognizes that most kernel allocations are for objects of known, fixed sizes. Rather than allocating and freeing individual objects from a general-purpose heap, it maintains caches of pre-initialized objects that can be quickly dispensed and returned.
This page examines the slab allocator's design, its three Linux implementations (SLAB, SLUB, SLOB), and the techniques that make it extraordinarily efficient.
By the end of this page, you will understand: (1) why traditional allocators are insufficient for kernel needs, (2) the slab cache architecture and object lifecycle, (3) differences between SLAB, SLUB, and SLOB implementations, (4) how to create and use kernel caches, (5) per-CPU caching and NUMA awareness, and (6) debugging and monitoring slab allocator behavior.
To appreciate the slab allocator, we must first understand why general-purpose allocators like malloc are unsuitable for kernel workloads.
The Challenges:
1. Internal Fragmentation
General-purpose allocators maintain free lists of various sizes. When you request 64 bytes and only a 128-byte block is available, half the block is wasted. Over time, this internal fragmentation consumes significant memory.
2. Allocation Overhead
Each allocation requires searching free lists, potentially splitting blocks, and maintaining metadata. For small, frequent allocations, this overhead dominates the actual object size.
3. Cache Line Pollution
General allocators don't consider CPU cache behavior. Objects of the same type may be scattered across memory, causing poor cache utilization and more L1/L2 misses.
4. Initialization Cost
Many kernel objects require complex initialization (zeroing, setting up pointers, allocating sub-objects). If an object is freed and immediately reallocated, reinitializing it is wasteful—the previous state might be largely reusable.
5. Locking Contention
A single global allocator lock becomes a severe bottleneck on multi-core systems. Every allocation and free operation contends for this lock.
Most kernel allocations are for a small number of well-known object types: inodes, dentries, task_structs, socket buffers, etc. If we maintain separate caches for each object type, we can eliminate fragmentation (objects are always the right size), reduce initialization overhead (objects can retain partial state), and optimize cache behavior (similar objects stored together).
| Aspect | Traditional Allocator | Slab Allocator |
|---|---|---|
| Fragmentation | Variable block sizes cause waste | Fixed-size objects eliminate internal fragmentation |
| Allocation Speed | Free list search, potential splitting | O(1) from pre-allocated cache |
| Initialization | Full re-initialization on each allocation | Constructor/destructor can preserve state |
| Cache Behavior | Objects scattered randomly | Objects grouped by type, cache-aligned |
| Locking | Global lock contention | Per-CPU and per-cache structures |
| Metadata | Per-allocation headers | Per-slab metadata, minimal per-object |
The slab allocator organizes memory in a three-level hierarchy: caches, slabs, and objects.
Cache (kmem_cache)
A cache is a pool of objects of a single type. Each cache is created for a specific data structure—one cache for inodes, another for dentries, another for task_structs. The cache maintains:
Slab
A slab is a contiguous memory region (typically one or more pages) containing multiple objects. Slabs are categorized as:
Object
The unit of allocation. Objects within a slab are pre-sized, properly aligned, and optionally pre-initialized. Getting a free object is simply returning a pointer from the free list.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
/* Core slab allocator structures (simplified) */ /* A cache manages objects of one type */struct kmem_cache { /* Per-CPU data for fast path allocation */ struct kmem_cache_cpu __percpu *cpu_slab; /* Object properties */ unsigned int object_size; /* Size of each object */ unsigned int size; /* Size with padding/alignment */ unsigned int align; /* Alignment requirement */ unsigned int offset; /* Free pointer offset within object */ /* Slab properties */ unsigned int oo; /* Objects per slab (encoded) */ unsigned int min; /* Minimum partial slabs to keep */ gfp_t allocflags; /* Allocation flags */ /* Constructor/destructor */ void (*ctor)(void *); /* Object constructor */ /* Slab lists */ struct kmem_cache_node *node[MAX_NUMNODES]; /* Per-NUMA-node lists */ /* Identification */ const char *name; /* Cache name (e.g., "task_struct") */ struct list_head list; /* All caches linked list */ /* Debug and statistics */ int refcount; /* Reference count */ unsigned long flags; /* SLAB_* flags */ /* ... more fields ... */}; /* Per-NUMA-node slab lists */struct kmem_cache_node { spinlock_t list_lock; /* Protects partial list */ unsigned long nr_partial; /* Number of partial slabs */ struct list_head partial; /* Partial slab list */ /* Full list only if debugging enabled */#ifdef CONFIG_SLUB_DEBUG unsigned long nr_slabs; struct list_head full;#endif}; /* Per-CPU slab data (SLUB) */struct kmem_cache_cpu { void **freelist; /* Pointer to next free object */ unsigned long tid; /* Transaction ID for lock-free ops */ struct page *page; /* Current "active" slab page */ struct page *partial; /* Per-CPU partial slab list */}; /* Slab page metadata (stored in struct page) *//* In Linux, slab metadata is stored in the page struct itself */struct page { /* ... other fields ... */ /* When used as slab page: */ struct kmem_cache *slab_cache; /* Owning cache */ void *freelist; /* First free object in slab */ unsigned int inuse; /* Objects in use count */ unsigned int objects; /* Total objects in slab */ /* ... */};Understanding the allocation and free paths is crucial for appreciating slab allocator efficiency.
Allocation Path (Fast Path):
The ideal allocation path—serving the vast majority of requests—requires no locking:
This entire operation is essentially a pointer manipulation—extremely fast.
Allocation Path (Slow Path):
When the per-CPU cache is empty:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125
/* Allocation: kmem_cache_alloc() (SLUB implementation, simplified) */ void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags){ void *object; /* Fast path: try per-CPU freelist */ object = __slab_alloc_fast(s); if (likely(object)) return object; /* Slow path: need to refill or allocate new slab */ return __slab_alloc_slow(s, gfpflags);} /* Fast path - lockless, CPU-local */static inline void *__slab_alloc_fast(struct kmem_cache *s){ struct kmem_cache_cpu *c; void *object; unsigned long tid; /* Disable preemption to ensure we stay on same CPU */ c = this_cpu_ptr(s->cpu_slab); /* Try to allocate from current freelist atomically */ do { tid = c->tid; object = c->freelist; if (unlikely(!object)) return NULL; /* Need slow path */ } while (!this_cpu_cmpxchg_double( s->cpu_slab->freelist, s->cpu_slab->tid, object, tid, get_freepointer(s, object), next_tid(tid))); return object;} /* Slow path - may need locks, page allocation */static void *__slab_alloc_slow(struct kmem_cache *s, gfp_t gfpflags){ struct page *page; void *object; /* Try to refill from per-CPU partial list */ page = this_cpu_read(s->cpu_slab->partial); if (page) { /* Activate this partial slab */ object = acquire_slab(s, page); if (object) return object; } /* Try per-node partial list (requires lock) */ page = get_partial_node(s, get_node(s, numa_node_id())); if (page) { object = get_freelist(s, page); return object; } /* Must allocate new slab */ page = allocate_slab(s, gfpflags); if (!page) return NULL; /* Out of memory */ /* Initialize new slab's freelist */ page->freelist = setup_objects(s, page); /* Get first object */ object = page->freelist; page->freelist = get_freepointer(s, object); return object;} /* Freeing: kmem_cache_free() */void kmem_cache_free(struct kmem_cache *s, void *object){ struct kmem_cache_cpu *c; struct page *page; /* Find the slab page containing this object */ page = virt_to_head_page(object); /* Fast path: if object belongs to current CPU's slab */ c = this_cpu_ptr(s->cpu_slab); if (likely(page == c->page)) { /* Just push onto freelist - lockless */ set_freepointer(s, object, c->freelist); c->freelist = object; return; } /* Slow path: object from different CPU or node */ __slab_free_slow(s, page, object);} /* Slow path for cross-CPU or cross-node free */static void __slab_free_slow(struct kmem_cache *s, struct page *page, void *object){ /* This requires locks and may move slabs between lists */ spin_lock(&n->list_lock); /* Add object to page's freelist */ set_freepointer(s, object, page->freelist); page->freelist = object; page->inuse--; if (page->inuse == 0) { /* Slab now empty - consider freeing */ if (n->nr_partial >= s->min_partial) { /* We have enough partials, free this slab */ discard_slab(s, page); } } else if (was_full) { /* Slab was full, now partial - add to partial list */ add_partial(n, page); } spin_unlock(&n->list_lock);}When an object is free, the slab allocator stores the pointer to the next free object within the object's own memory. This eliminates the need for separate free list structures—the objects themselves form the list. The 'offset' field in kmem_cache specifies where within the object this pointer is stored.
Linux has evolved three slab allocator implementations, each optimized for different scenarios. Understanding their differences is important for kernel configuration and performance tuning.
| Feature | SLAB | SLUB (Default) | SLOB |
|---|---|---|---|
| Status | Removed (2023) | Active (default) | Removed (2023) |
| Complexity | High | Medium | Low |
| Memory Overhead | High | Low | Minimal |
| Per-CPU Caching | Yes (array-based) | Yes (freelist) | No |
| Slab Metadata | Separate struct | In struct page | Minimal |
| NUMA Awareness | Yes | Yes | No |
| Debug Support | CONFIG option | Runtime option | Limited |
| Best For | Many caches | General use | Tiny systems |
Unless you're maintaining legacy systems or working on extremely resource-constrained embedded devices, SLUB is the implementation you'll encounter. Modern kernel development and debugging tools are built around SLUB, and it offers the best balance of performance, memory efficiency, and debuggability.
Kernel developers create dedicated caches for frequently-allocated data structures. Here's how to create, use, and destroy slab caches:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133
/* Example: Creating and using a slab cache in a kernel module */ #include <linux/slab.h>#include <linux/module.h> /* Our example data structure */struct my_object { unsigned long id; char name[64]; struct list_head list; spinlock_t lock; /* ... */}; /* Global cache pointer */static struct kmem_cache *my_cache; /* Constructor - called when new objects are created */static void my_object_ctor(void *ptr){ struct my_object *obj = ptr; /* Initialize fields that can be reused */ spin_lock_init(&obj->lock); INIT_LIST_HEAD(&obj->list); /* Note: id and name set by caller, not constructor */} /* Module initialization */static int __init my_module_init(void){ /* Create the cache */ my_cache = kmem_cache_create( "my_object_cache", /* Cache name (shows in /proc/slabinfo) */ sizeof(struct my_object), /* Object size */ 0, /* Alignment (0 = default) */ SLAB_HWCACHE_ALIGN | /* Align to cache line */ SLAB_PANIC, /* Panic if creation fails */ my_object_ctor /* Constructor (or NULL) */ ); if (!my_cache) return -ENOMEM; pr_info("my_object_cache created\n"); return 0;} /* Allocate an object */struct my_object *my_object_alloc(gfp_t gfp_flags){ struct my_object *obj; static unsigned long next_id = 1; /* kmem_cache_alloc returns pre-initialized object */ obj = kmem_cache_alloc(my_cache, gfp_flags); if (!obj) return NULL; /* Complete initialization */ obj->id = next_id++; obj->name[0] = '\0'; return obj;} /* Alternative: Allocate and zero (ignores constructor) */struct my_object *my_object_alloc_zero(gfp_t gfp_flags){ return kmem_cache_zalloc(my_cache, gfp_flags);} /* Free an object */void my_object_free(struct my_object *obj){ if (!obj) return; /* Clean up any resources held by object */ /* Note: constructor-initialized fields can be left as-is */ kmem_cache_free(my_cache, obj);} /* Module cleanup */static void __exit my_module_exit(void){ /* Destroy the cache - all objects must be freed first! */ if (my_cache) { kmem_cache_destroy(my_cache); pr_info("my_object_cache destroyed\n"); }} module_init(my_module_init);module_exit(my_module_exit); /* === Common Slab API Functions === */ /* Create cache with full control */struct kmem_cache *kmem_cache_create( const char *name, /* Cache name for debugging */ unsigned int size, /* Object size */ unsigned int align, /* Minimum alignment */ slab_flags_t flags, /* SLAB_* flags */ void (*ctor)(void *) /* Constructor */); /* Create cache using existing struct size/alignment */struct kmem_cache *kmem_cache_create_usercopy( const char *name, unsigned int size, unsigned int align, slab_flags_t flags, unsigned int useroffset, /* Offset of user-copyable region */ unsigned int usersize, /* Size of user-copyable region */ void (*ctor)(void *)); /* Allocate from cache */void *kmem_cache_alloc(struct kmem_cache *cache, gfp_t flags);void *kmem_cache_zalloc(struct kmem_cache *cache, gfp_t flags); /* Free to cache */void kmem_cache_free(struct kmem_cache *cache, void *object); /* Bulk operations (more efficient than loop) */int kmem_cache_alloc_bulk(struct kmem_cache *cache, gfp_t flags, size_t size, void **p);void kmem_cache_free_bulk(struct kmem_cache *cache, size_t size, void **p); /* Destroy cache */void kmem_cache_destroy(struct kmem_cache *cache);While dedicated caches are preferred for known object types, the kernel also needs general-purpose allocation for variable-sized buffers. The kmalloc family of functions provides this, built on top of size-specific slab caches.
How kmalloc Works:
The kernel maintains a set of anonymous caches for power-of-two sizes (32, 64, 128, 256, ... 4096, 8192, etc.). When you call kmalloc(100, GFP_KERNEL), the allocator:
This gives you the speed benefits of slab allocation with malloc-like flexibility.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
/* kmalloc family of functions */ #include <linux/slab.h> /* Basic allocation */void *ptr = kmalloc(size, GFP_KERNEL); /* Zero-initialized allocation */void *ptr = kzalloc(size, GFP_KERNEL); /* Array allocation (checks for overflow) */void *arr = kmalloc_array(n, element_size, GFP_KERNEL);void *arr = kcalloc(n, element_size, GFP_KERNEL); /* zeroed */ /* String duplication */char *str = kstrdup("hello", GFP_KERNEL);char *str = kstrndup(src, max_len, GFP_KERNEL); /* Memory duplication */void *copy = kmemdup(src, size, GFP_KERNEL); /* Resize allocation */void *new = krealloc(ptr, new_size, GFP_KERNEL); /* Free */kfree(ptr); /* Size query - what size did we actually get? */size_t actual = ksize(ptr); /* === GFP (Get Free Pages) Flags === */ GFP_KERNEL /* Normal kernel allocation - may sleep */GFP_ATOMIC /* Cannot sleep - use in interrupt context */GFP_NOWAIT /* Don't wait for memory to become available */GFP_NOFS /* Don't call into filesystem (avoid recursion) */GFP_NOIO /* Don't call into I/O layer */GFP_DMA /* Require DMA-capable memory */GFP_DMA32 /* Require 32-bit DMA-capable memory */GFP_HIGHUSER /* User-space allocation from high memory */ /* === Viewing kmalloc Caches === *//* * $ cat /proc/slabinfo | grep kmalloc * kmalloc-8192 1024 1024 8192 4 8 ... * kmalloc-4096 2048 2048 4096 8 8 ... * kmalloc-2048 4096 4096 2048 16 8 ... * kmalloc-1024 8192 8192 1024 32 8 ... * kmalloc-512 16384 16384 512 32 4 ... * kmalloc-256 32768 32768 256 32 2 ... * kmalloc-192 21333 21333 192 21 1 ... * kmalloc-128 65536 65536 128 32 1 ... * kmalloc-96 42666 42666 96 42 1 ... * kmalloc-64 65536 65536 64 64 1 ... * kmalloc-32 131072 131072 32 128 1 ... */ /* For larger allocations (> 1 page typically), use vmalloc or kvmalloc */void *big = vmalloc(large_size); /* Virtually contiguous */void *any = kvmalloc(size, GFP_KERNEL); /* Try kmalloc, fall back to vmalloc */kmalloc has size limits. The maximum varies by architecture but is typically 4-8 MB. For larger allocations, use vmalloc() which provides virtually contiguous (but physically scattered) memory, or kvmalloc() which tries kmalloc first and falls back to vmalloc. Note that vmalloc memory cannot be used for DMA.
On multi-core and NUMA systems, the slab allocator employs several optimizations to minimize contention and maximize memory locality.
Per-CPU Freelists:
Each CPU maintains its own freelist for each cache. This eliminates lock contention for the common case:
Objects tend to stay on the CPU that allocated them, improving cache locality.
NUMA Awareness:
On NUMA (Non-Uniform Memory Access) systems, memory access latency depends on which node the memory is attached to. The slab allocator is NUMA-aware:
12345678910111213141516171819202122232425262728293031323334353637383940
/* NUMA-aware slab operations */ #include <linux/slab.h>#include <linux/numa.h> /* Allocate from specific NUMA node */void *kmem_cache_alloc_node(struct kmem_cache *cache, gfp_t flags, int node); void *kmalloc_node(size_t size, gfp_t flags, int node); /* Example: Allocate on node where current task prefers to run */int preferred_node = numa_node_id();void *obj = kmem_cache_alloc_node(cache, GFP_KERNEL, preferred_node); /* Example: Allocate on same node as another object */int obj_node = page_to_nid(virt_to_page(existing_obj));void *new_obj = kmalloc_node(size, GFP_KERNEL, obj_node); /* View NUMA slab distribution *//* * $ cat /sys/kernel/slab/kmalloc-256/numa_stat * N0=12345 N1=10234 N2=11234 N3=9876 */ /* Slab spreads across nodes by default for some caches *//* SLAB_MEM_SPREAD flag enables intentional distribution */cache = kmem_cache_create("spread_cache", sizeof(struct my_obj), 0, SLAB_MEM_SPREAD, NULL); /* === Per-CPU Statistics === *//* * $ cat /sys/kernel/slab/task_struct/cpu_slabs * 0 : slabs=3 objects=24 partial=1 * 1 : slabs=2 objects=16 partial=1 * ... */When writing drivers that allocate memory heavily on specific CPUs (e.g., network drivers with receive queues per CPU), using kmalloc_node() or kmem_cache_alloc_node() with the appropriate NUMA node can significantly reduce memory latency. Profile with tools like 'numastat' to identify cross-node memory access patterns.
The slab allocator provides rich debugging and monitoring capabilities through /proc, /sys, and kernel options.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071
# === /proc/slabinfo - Traditional slab statistics ===cat /proc/slabinfo# slabinfo - version: 2.1# # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab># task_struct 1234 1280 1728 18 8# inode_cache 45678 46080 608 26 4# dentry 89012 89856 192 21 1 # Parse with slabtop for real-time viewsudo slabtop# Active / Total Objects (% used) : 1234567 / 1345678 (91.7%)# Active / Total Slabs (% used) : 12345 / 13456 (91.7%)# Active / Total Caches (% used) : 150 / 200 (75.0%)# Active / Total Size (% used) : 512.00M / 560.00M (91.4%) # === /sys/kernel/slab/ - Detailed per-cache information ===ls /sys/kernel/slab/task_struct/# aliases align alloc_calls cache_dma cpu_partial cpu_slabs# ctor destroy_by_rcu free_calls hwcache_align min_partial# object_size objects objects_partial objs_per_slab order# partial poison red_zone sanity_checks shrink slabs# slabs_cpu store_user total_objects trace validate # View object sizecat /sys/kernel/slab/task_struct/object_size# 1728 # View allocation call sitescat /sys/kernel/slab/task_struct/alloc_calls# 1234 copy_process+0x1a3/0x1d80 ... # View free call sitescat /sys/kernel/slab/task_struct/free_calls# 1100 free_task+0x85/0xa0 ... # === SLUB Debug Options ===# Boot with: slub_debug=PUZ (see below)# P = Poisoning: fill with pattern on free# U = User tracking: store allocation/free call sites# Z = Red zoning: detect buffer overflows# F = Sanity checks on free# T = Trace allocations # Enable debugging for specific cacheecho 1 > /sys/kernel/slab/my_cache/poisonecho 1 > /sys/kernel/slab/my_cache/red_zoneecho 1 > /sys/kernel/slab/my_cache/trace # Force validation of a cache (slow!)echo 1 > /sys/kernel/slab/my_cache/validate # === Common Issues and Detection === # Use-after-free detection (with poisoning):# Object filled with 0x6b on free# Access pattern 0x6b6b6b6b indicates use-after-free # Buffer overflow detection (with red zones):# Guard bytes placed around object# Corruption of guards indicates buffer overflow # Memory leak detection:# Compare alloc_calls vs free_calls# Growing object count without matching frees # === Memory Pressure and Shrinking ===# Force slab shrinking (release empty slabs)echo 1 > /sys/kernel/slab/*/shrink # Or use sysctlsysctl vm.drop_caches=2 # Free reclaimable slab objects| Option | Description | Overhead |
|---|---|---|
| P (Poison) | Fill freed objects with 0x6b pattern | Low |
| U (User tracking) | Record allocation/free call sites | Medium |
| Z (Red zone) | Add guard bytes around objects | Medium |
| F (Sanity) | Validate objects on free | Low |
| T (Trace) | Trace all allocations to kernel log | High |
| A (Failslab) | Random allocation failures for testing | Low |
SLUB debugging options can significantly impact performance. Poisoning and red zones are suitable for development systems. User tracking and tracing should be used sparingly. Never enable all options on production systems unless diagnosing a specific issue. Boot with 'slub_debug=-' to disable all debugging.
The slab allocator is one of the most carefully engineered subsystems in the Linux kernel. Let's consolidate what we've learned:
What's Next:
The slab allocator works with pages obtained from the page allocator, but not all memory is created equal. The next page explores Linux memory zones—how the kernel partitions physical memory into regions with different characteristics and constraints.
You now have a comprehensive understanding of the Linux slab allocator—its architecture, implementations, APIs, and optimization techniques. This knowledge is essential for kernel development, driver writing, and performance analysis.