Loading learning content...
When the Linux kernel faces memory pressure, it traditionally has one escape route: swap—writing anonymous pages to disk and freeing their memory for reuse. This works, but at tremendous cost. Even modern NVMe SSDs introduce latencies measured in microseconds, while HDDs impose latencies in milliseconds.
zswap revolutionizes this by inserting a compressed cache layer between the reclaim path and the swap device. Instead of immediately writing pages to disk, zswap:
The result: under memory pressure, systems with zswap maintain dramatically better responsiveness than those relying purely on swap.
By the end of this page, you will master zswap internals—the frontend interception mechanism, backend pool management, writeback to swap, configuration tuning, and production deployment strategies. You'll understand when zswap helps, when it hurts, and how to optimize it for specific workloads.
zswap is implemented as a frontswap backend—a Linux kernel mechanism that allows interception of swap operations. When a page is about to be written to swap, zswap gets first opportunity to handle it.
Core Components:
| Component | Purpose | Implementation |
|---|---|---|
| Frontend | Intercepts swap-out requests | Frontswap ops registration |
| Compressor | Compresses/decompresses pages | Crypto API (lz4, lzo, zstd, etc.) |
| zpool | Stores compressed pages | zbud, z3fold, or zsmalloc |
| Same-filled check | Optimizes zero/same-filled pages | Deduplication to single value |
| Writeback | Evicts cold pages to actual swap | kthread-based background worker |
| Entry Tree | Maps (swap type, offset) to compressed entries | Red-black tree |
The Data Flow:
When zswap intercepts a page:
Check same-filled: Is the page entirely filled with the same byte value?
Compress page: Apply the configured algorithm
Check compression ratio: Did compression achieve sufficient savings?
Allocate zpool space: Get space in the compressed pool
Store and index: Copy compressed data, create index entry, mark success
Free original page: The uncompressed page frame is now available for reuse
A significant percentage of pages (often 10-30%) are 'same-filled'—entirely filled with zeros or another repeated byte. zswap detects these without compression, storing only the fill value. A 4KB zero page becomes a few bytes of metadata. This optimization alone can dramatically increase effective memory.
zswap hooks into the kernel's swap path via the frontswap API—a clean abstraction that allows backend implementations to intercept swap operations without modifying the core VM code.
Frontswap Operations:
struct frontswap_ops {
void (*init)(unsigned type); /* Swap area initialized */
int (*store)(unsigned type, /* Store a page */
pgoff_t offset,
struct page *page);
int (*load)(unsigned type, /* Load (decompress) a page */
pgoff_t offset,
struct page *page);
void (*invalidate_page)(unsigned type, /* Page no longer needed */
pgoff_t offset);
void (*invalidate_area)(unsigned type); /* Swap area deactivated */
};
Store Operation (Compression Path):
When the kernel calls swap_writepage() to write a page to swap, frontswap intercepts and calls zswap's store operation:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
/* Simplified zswap store path (Linux kernel) */ static int zswap_frontswap_store(unsigned type, pgoff_t offset, struct page *page){ struct zswap_tree *tree = zswap_trees[type]; struct zswap_entry *entry, *dupentry; struct crypto_acomp_ctx *acomp_ctx; struct scatterlist input, output; int ret, dlen = PAGE_SIZE; unsigned long handle; char *buf; u8 *src; /* Check if zswap is enabled and pool is available */ if (!zswap_enabled || !tree) return -ENODEV; /* Allocate entry metadata */ entry = zswap_entry_cache_alloc(GFP_KERNEL); if (!entry) return -ENOMEM; /* Check for same-filled pages first (optimization) */ src = kmap_atomic(page); if (zswap_same_filled_pages_enabled && zswap_is_page_same_filled(src, &entry->value)) { kunmap_atomic(src); entry->length = 0; /* Marker for same-filled */ goto insert; } /* Get compression context for this CPU */ acomp_ctx = raw_cpu_ptr(zswap_comp->acomp_ctx); mutex_lock(&acomp_ctx->mutex); /* Setup source scatter-gather */ sg_init_table(&input, 1); sg_set_page(&input, page, PAGE_SIZE, 0); /* Compress into temporary buffer */ buf = acomp_ctx->dstmem; sg_init_one(&output, buf, PAGE_SIZE); acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen); ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait); dlen = acomp_ctx->req->dlen; kunmap_atomic(src); /* Check if compression was worthwhile */ if (ret || dlen >= zswap_max_pool_size_ratio * PAGE_SIZE) { mutex_unlock(&acomp_ctx->mutex); ret = -EINVAL; /* Reject: compression didn't help enough */ goto freepage; } /* Allocate space in zpool */ ret = zpool_malloc(zswap_pool->zpool, dlen + sizeof(struct zswap_header), &handle); if (ret) { mutex_unlock(&acomp_ctx->mutex); ret = -ENOMEM; goto freepage; } /* Copy compressed data to zpool */ char *dst = zpool_map_handle(zswap_pool->zpool, handle, ZPOOL_MM_WO); memcpy(dst, buf, dlen); zpool_unmap_handle(zswap_pool->zpool, handle); mutex_unlock(&acomp_ctx->mutex); /* Setup entry */ entry->handle = handle; entry->length = dlen; insert: entry->offset = offset; entry->refcount = 1; entry->pool = zswap_pool; /* Insert into tree, checking for duplicates */ spin_lock(&tree->lock); dupentry = zswap_rb_search(&tree->rbroot, offset); if (dupentry) { zswap_entry_put(tree, dupentry); zswap_rb_erase(&tree->rbroot, dupentry); } zswap_rb_insert(&tree->rbroot, entry); spin_unlock(&tree->lock); /* Update statistics */ atomic_inc(&zswap_stored_pages); zswap_pool_total_size = zpool_get_total_size(zswap_pool->zpool); return 0; /* Success: page is now in zswap */ freepage: zswap_entry_cache_free(entry); return ret; /* Failure: page should go to regular swap */}When zswap_frontswap_store() returns 0, the page is successfully stored in the compressed cache and won't be written to disk. When it returns non-zero (rejection or failure), the kernel falls through to normal swap I/O. This graceful fallback ensures reliability.
When a process accesses a page that was compressed into zswap, the kernel must decompress and return the original data. This happens through the frontswap load operation.
The Load Path:
swap_readpage()123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081
/* Simplified zswap load path */ static int zswap_frontswap_load(unsigned type, pgoff_t offset, struct page *page){ struct zswap_tree *tree = zswap_trees[type]; struct zswap_entry *entry; struct crypto_acomp_ctx *acomp_ctx; u8 *src, *dst; unsigned int dlen; int ret; /* Lookup entry in tree */ spin_lock(&tree->lock); entry = zswap_rb_search(&tree->rbroot, offset); if (!entry) { spin_unlock(&tree->lock); return -ENOENT; /* Not in zswap, try regular swap */ } zswap_entry_get(entry); /* Take reference */ spin_unlock(&tree->lock); /* Handle same-filled pages */ if (entry->length == 0) { dst = kmap_atomic(page); zswap_fill_page(dst, entry->value); kunmap_atomic(dst); goto stats; } /* Map compressed data from zpool */ src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO); /* Get decompression context */ acomp_ctx = raw_cpu_ptr(zswap_comp->acomp_ctx); mutex_lock(&acomp_ctx->mutex); /* Setup decompression */ struct scatterlist input, output; sg_init_one(&input, src, entry->length); sg_init_table(&output, 1); sg_set_page(&output, page, PAGE_SIZE, 0); acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE); /* Decompress */ ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait); dlen = acomp_ctx->req->dlen; mutex_unlock(&acomp_ctx->mutex); zpool_unmap_handle(entry->pool->zpool, entry->handle); /* Verify decompression succeeded */ if (ret || dlen != PAGE_SIZE) { zswap_entry_put(tree, entry); return -EIO; } stats: atomic_dec(&zswap_stored_pages); zswap_entry_put(tree, entry); return 0; /* Success: page decompressed into target */} /* Invalidate removes the entry entirely */static void zswap_frontswap_invalidate_page(unsigned type, pgoff_t offset){ struct zswap_tree *tree = zswap_trees[type]; struct zswap_entry *entry; spin_lock(&tree->lock); entry = zswap_rb_search(&tree->rbroot, offset); if (entry) { zswap_rb_erase(&tree->rbroot, entry); zswap_entry_put(tree, entry); } spin_unlock(&tree->lock);}Critical Performance Considerations:
| Operation | Typical Latency | Key Factors |
|---|---|---|
| Tree lookup | ~100 ns | Tree depth, cache locality |
| zpool map | ~50 ns | Pool type, memory access |
| Decompression | 300-1000 ns | Algorithm, data size |
| Page mapping | ~100 ns | TLB state, NUMA effects |
| Total load | 500-1500 ns | Sum of above |
Compare to:
zswap provides 10-10,000x improvement over disk-based swap for cached pages.
zswap entries are invalidated on load—once decompressed, the page returns to regular memory and the compressed copy is freed. This differs from some caching schemes that retain copies. The rationale: if the page is faulted in, it's likely to be used soon; keeping both copies wastes memory.
The zpool is the storage backend where compressed page data lives. zswap uses the kernel's zpool abstraction, which can be backed by different allocation strategies:
zbud (Budget):
z3fold:
zsmalloc:
| Feature | zbud | z3fold | zsmalloc |
|---|---|---|---|
| Pages per physical page | 2 | 3 | Variable |
| Fragmentation | Low | Medium | Minimal |
| CPU overhead | Very low | Low | Medium |
| Memory efficiency | ~75% | ~85% | ~95% |
| Complexity | Simple | Moderate | Complex |
| Compaction support | No | Limited | Yes |
| Best for | Low overhead | Balanced | Max savings |
zbud Internal Structure:
┌─────────────────────────────────┐
│ Physical Page │
├───────────────┬─────────────────┤
│ Entry 1 │ Entry 2 │
│ (< 2KB) │ (< 2KB) │
│ │ │
│ Compressed │ Compressed │
│ Data │ Data │
├───────────────┴─────────────────┤
│ Metadata Header │
└─────────────────────────────────┘
zbud divides each 4KB page into two "buddies" that can each hold a compressed page up to ~2KB. If a compressed page exceeds 2KB, it gets the whole physical page, wasting the other half.
z3fold Internal Structure:
┌─────────────────────────────────┐
│ Physical Page │
├──────────┬──────────┬───────────┤
│ Entry 1 │ Entry 2 │ Entry 3 │
│ (≤1.3KB)│ (≤1.3KB) │ (≤1.3KB) │
│ │ │ │
│ Data 1 │ Data 2 │ Data 3 │
├──────────┴──────────┴───────────┤
│ Metadata + Padding │
└─────────────────────────────────┘
z3fold allows up to 3 entries, improving density when pages compress to < 1.3KB.
The default (z3fold) is a good starting point. Switch to zbud if CPU overhead is critical, or zsmalloc if memory savings are paramount and CPU is plentiful. Monitor pool statistics to validate your choice for specific workloads.
zswap's pool has a configurable maximum size, expressed as a percentage of total RAM. When the pool fills, zswap can either:
Writeback Mechanism:
zswap implements a background writeback thread (zswap_shrink_worker) that proactively evicts pages from the compressed pool to disk when:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
/* Simplified zswap writeback mechanism */ /* Writeback worker - runs when pool is getting full */static void zswap_shrink_worker(struct work_struct *work){ struct zswap_pool *pool = container_of(work, struct zswap_pool, shrink_work); unsigned long pool_size, target_size; int ret; /* Calculate target: we want to free 10% of pool */ pool_size = zpool_get_total_size(pool->zpool); target_size = pool_size - (pool_size / 10); while (zpool_get_total_size(pool->zpool) > target_size) { /* Select oldest entry via LRU */ struct zswap_entry *entry = get_lru_entry(pool); if (!entry) break; /* Write entry to actual swap device */ ret = zswap_writeback_entry(pool, entry); if (ret) { /* Writeback failed, return entry to pool */ put_entry_back(pool, entry); break; } /* Free pool space */ zpool_free(pool->zpool, entry->handle); zswap_entry_cache_free(entry); atomic_dec(&zswap_stored_pages); atomic_inc(&zswap_written_back_pages); /* Yield to prevent monopolizing CPU */ cond_resched(); }} /* Write a compressed entry to actual swap */static int zswap_writeback_entry(struct zswap_pool *pool, struct zswap_entry *entry){ struct page *page; swp_entry_t swpentry; struct bio *bio; u8 *src, *dst; /* Allocate temporary page */ page = alloc_page(GFP_NOIO); if (!page) return -ENOMEM; /* For same-filled pages, just fill */ if (entry->length == 0) { dst = kmap_atomic(page); zswap_fill_page(dst, entry->value); kunmap_atomic(dst); } else { /* Decompress into temporary page */ src = zpool_map_handle(pool->zpool, entry->handle, ZPOOL_MM_RO); dst = kmap_atomic(page); ret = decompress(entry->algo, src, entry->length, dst, PAGE_SIZE); kunmap_atomic(dst); zpool_unmap_handle(pool->zpool, entry->handle); if (ret != PAGE_SIZE) { __free_page(page); return -EIO; } } /* Construct swap entry */ swpentry = entry_to_swp_entry(entry); /* Write to swap device */ bio = bio_alloc(GFP_NOIO, 1); bio_set_dev(bio, get_swap_bdev(swpentry)); bio->bi_iter.bi_sector = map_swap_page(swpentry); bio_add_page(bio, page, PAGE_SIZE, 0); bio->bi_opf = REQ_OP_WRITE; submit_bio_wait(bio); bio_put(bio); __free_page(page); return 0;}Pool Sizing Considerations:
| Pool Size | Behavior | Trade-off |
|---|---|---|
| Small (5-10%) | Frequent writeback, low memory use | Higher I/O, less benefit |
| Medium (20-30%) | Balanced operation | Good starting point |
| Large (50%+) | Rare writeback, maximum caching | May starve applications |
Recommendation: Start with 20% (max_pool_percent=20) and adjust based on:
Writeback involves decompression (CPU cost) followed by disk I/O. If the page is accessed again after writeback, it must be read from disk— the worst outcome. Good LRU ordering is critical: evict pages unlikely to be accessed soon.
zswap is highly configurable through sysfs parameters. Understanding these parameters is essential for optimal deployment:
Enable/Disable (enabled):
echo Y > /sys/module/zswap/parameters/enabled # Enable
echo N > /sys/module/zswap/parameters/enabled # Disable
Compression Algorithm (compressor):
# Available algorithms (depends on kernel config)
cat /proc/crypto | grep -E 'name.*lz[o4]|zstd'
# Set compressor
echo lz4 > /sys/module/zswap/parameters/compressor
Pool Allocator (zpool):
echo z3fold > /sys/module/zswap/parameters/zpool
echo zbud > /sys/module/zswap/parameters/zpool
echo zsmalloc > /sys/module/zswap/parameters/zpool
Maximum Pool Size (max_pool_percent):
echo 25 > /sys/module/zswap/parameters/max_pool_percent
Same-Filled Pages (same_filled_pages_enabled):
echo Y > /sys/module/zswap/parameters/same_filled_pages_enabled
| Parameter | Default | Range | Description |
|---|---|---|---|
enabled | N | Y/N | Master enable switch |
compressor | lzo-rle | lz4, lzo, zstd, etc. | Compression algorithm |
zpool | z3fold | zbud, z3fold, zsmalloc | Pool allocator |
max_pool_percent | 20 | 1-100 | Max pool as % of RAM |
accept_threshold_percent | 90 | 1-100 | Min compression ratio |
same_filled_pages_enabled | Y | Y/N | Zero-page optimization |
non_same_filled_pages_enabled | Y | Y/N | Enable regular compression |
123456789101112131415161718192021222324252627
#!/bin/bash# zswap configuration script for production servers # Enable zswapecho Y > /sys/module/zswap/parameters/enabled # Use LZ4 for best speed (or zstd for best ratio)echo lz4 > /sys/module/zswap/parameters/compressor # Use z3fold for good balanceecho z3fold > /sys/module/zswap/parameters/zpool # Set pool to 25% of RAMecho 25 > /sys/module/zswap/parameters/max_pool_percent # Enable same-filled page optimizationecho Y > /sys/module/zswap/parameters/same_filled_pages_enabled # Verify configurationecho "=== zswap Configuration ==="for param in enabled compressor zpool max_pool_percent; do val=$(cat /sys/module/zswap/parameters/$param) echo "$param: $val"done # For persistent configuration, add to kernel command line:# zswap.enabled=1 zswap.compressor=lz4 zswap.zpool=z3fold zswap.max_pool_percent=25LZ4: Fastest speed, moderate compression (~2.5:1). Best for latency-sensitive workloads. LZO: Good balance of speed and compression. Legacy default. Zstd: Best compression (~3-4:1) but slower. Good when CPU is plentiful. Choose based on your CPU/memory tradeoff preferences.
Effective zswap operation requires monitoring key metrics to ensure the system is behaving as expected.
Primary Statistics Location:
/sys/kernel/debug/zswap/* # Detailed statistics (debugfs)
Key Metrics to Monitor:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
#!/bin/bash# zswap monitoring script DEBUGFS="/sys/kernel/debug/zswap" # Check if debugfs is mounted and zswap is activeif [ ! -d "$DEBUGFS" ]; then echo "zswap debugfs not available" exit 1fi echo "=== zswap Statistics ==="echo "" # Core metricsstored=$(cat $DEBUGFS/stored_pages 2>/dev/null || echo 0)same_filled=$(cat $DEBUGFS/same_filled_pages 2>/dev/null || echo 0)pool_size=$(cat $DEBUGFS/pool_total_size 2>/dev/null || echo 0)written_back=$(cat $DEBUGFS/written_back_pages 2>/dev/null || echo 0) # Calculate effective sizepage_size=4096stored_bytes=$((stored * page_size))ratio="N/A"if [ "$pool_size" -gt 0 ]; then ratio=$(echo "scale=2; $stored_bytes / $pool_size" | bc)fi echo "Stored pages: $stored ($(numfmt --to=iec $stored_bytes))"echo "Same-filled pages: $same_filled"echo "Pool size: $(numfmt --to=iec $pool_size)"echo "Effective ratio: ${ratio}:1"echo "Written back: $written_back"echo "" # Rejection statsecho "=== Rejection Statistics ==="echo "Compress poor: $(cat $DEBUGFS/reject_compress_poor 2>/dev/null || echo 0)"echo "Alloc fail: $(cat $DEBUGFS/reject_alloc_fail 2>/dev/null || echo 0)"echo "Pool limit hit: $(cat $DEBUGFS/pool_limit_hit 2>/dev/null || echo 0)"echo "" # Configurationecho "=== Configuration ==="for param in enabled compressor zpool max_pool_percent; do val=$(cat /sys/module/zswap/parameters/$param 2>/dev/null || echo "N/A") printf "%-20s %s" "$param:" "$val"doneHigh reject_compress_poor? Workload has incompressible data (encrypted, media). Consider disabling zswap for these systems. High pool_limit_hit? Pool is too small; increase max_pool_percent. High written_back_pages? Pool is cycling; consider larger pool or faster compressor. Low stored_pages? Check if zswap is enabled and swap is configured.
Deploying zswap in production requires careful consideration of workload characteristics and system constraints. Here are battle-tested recommendations:
pool_total_size vs stored_pagesProperly configured zswap typically reduces swap I/O by 50-90%, improving system responsiveness dramatically under memory pressure. Desktop users see fewer 'freezes' during heavy multitasking. Servers maintain lower latency during memory spikes. The main cost is modest CPU usage during compression.
We've explored zswap in depth—from its architecture and interception mechanism through configuration and production deployment. Let's consolidate the key concepts:
What's Next:
The next page explores zram—a complementary technology that creates a compressed block device in RAM. While zswap intercepts the swap path to existing swap devices, zram creates an entirely new compressed swap device. Understanding both enables optimal memory compression strategies.
You now have deep knowledge of zswap internals—interception mechanisms, compression paths, pool management, and production deployment. This prepares you to effectively deploy, monitor, and troubleshoot zswap in production Linux systems.