Loading learning content...
Throughout this module, we've seen how buffering mediates between devices and processes. But we've accepted an implicit cost: data is copied at each stage. A file read, for example, typically involves:
For a simple 'serve file over network' operation, data is copied four times. Each copy consumes CPU cycles, memory bandwidth, and cache space. At 10 Gbps networking speeds, the CPU spends more time copying than doing actual work.
Zero-copy techniques eliminate unnecessary copies, allowing data to flow from source to destination with minimal CPU involvement. This isn't optimization at the margins—it's the difference between systems that scale and systems that bottleneck.
By the end of this page, you will understand the true cost of memory copies, master the suite of zero-copy mechanisms (sendfile, splice, mmap, RDMA), analyze when each technique applies, and implement zero-copy patterns for maximum I/O performance.
To appreciate zero-copy's value, we must quantify the cost of traditional copying. This cost manifests in multiple dimensions:
CPU Cost:
Memory copying is deceptively expensive. memcpy() for aligned data achieves roughly 10-20 GB/s on modern CPUs—impressive, but consider the implications:
| Data Size | Copy Duration | At 1M ops/sec | CPU Core Usage |
|---|---|---|---|
| 1 KB | ~100 ns | 100 ms/sec | 10% |
| 4 KB (page) | ~400 ns | 400 ms/sec | 40% |
| 64 KB | ~6.4 μs | 6.4 sec/sec | 640% (infeasible) |
| 1 MB | ~100 μs | 100 sec/sec | 10,000% (impossible) |
The table reveals that serving 1 million 4KB pages per second consumes 40% of a CPU core just for copying—before any actual processing.
Memory Bandwidth Cost:
Each copy reads from source and writes to destination. A single 4KB copy consumes 8KB of memory bandwidth. For a server handling 10 Gbps of network traffic (approximately 1.25 GB/s of data), the traditional 4-copy path consumes:
$$Bandwidth = 1.25 \text{ GB/s} \times 4 \times 2 = 10 \text{ GB/s}$$
Modern DDR4 memory provides around 25 GB/s bandwidth per channel. A single network flow could consume 40% of available memory bandwidth just for copies!
Cache Pollution:
Perhaps the most insidious cost is cache pollution. When data is copied through CPU, it occupies cache lines:
For streaming data that's never accessed again (like a web server forwarding files), cache pollution is pure overhead.
Systems that 'work fine in testing' often hit a wall in production because testing doesn't replicate memory bandwidth saturation or cache thrashing. Zero-copy isn't just about peak throughput—it's about maintaining consistent performance under load.
Let's compare the data path for a common operation: serving a file over the network.
Traditional Path (read() + write()):
12345678910111213141516171819202122232425
/* Traditional file sending - 4 copies, 4 context switches */void send_file_traditional(int file_fd, int socket_fd, size_t size) { char buffer[65536]; ssize_t bytes; while (size > 0) { /* Copy 1: read() - Kernel page cache → User buffer */ /* Context switch: User → Kernel → User */ bytes = read(file_fd, buffer, min(sizeof(buffer), size)); if (bytes <= 0) break; /* Copy 2: write() - User buffer → Kernel socket buffer */ /* Context switch: User → Kernel → User */ write(socket_fd, buffer, bytes); /* Kernel internally: * Copy 3: Socket buffer → NIC TX buffer (or DMA scatter-gather) * * Total context switches: 4 (2 syscalls × 2 transitions each) * Total copies: 3-4 depending on NIC capabilities */ size -= bytes; }}Zero-Copy Path (sendfile()):
The data never transits through user space. Kernel manipulates buffer references, not data:
12345678910111213141516171819202122232425262728
/* Zero-copy file sending with sendfile() */#include <sys/sendfile.h> void send_file_zerocopy(int file_fd, int socket_fd, size_t size) { off_t offset = 0; ssize_t sent; while (size > 0) { /* * sendfile(): Transfer data from file to socket without * ever copying to user space * * Kernel operations: * 1. Page cache contains file data (or reads from disk) * 2. Socket buffer gets reference to page cache pages * 3. NIC DMAs directly from page cache * * With modern NICs supporting scatter-gather DMA: * - Zero CPU copies * - One context switch * - Data flows: Disk → Page Cache → NIC (via DMA) */ sent = sendfile(socket_fd, file_fd, &offset, size); if (sent <= 0) break; size -= sent; }}| Metric | read()+write() | sendfile() |
|---|---|---|
| CPU copies | 2-3 | 0-1 |
| Context switches | 4 | 2 |
| Syscalls | 2 per chunk | 1 per chunk |
| User space involvement | Data passes through | None |
| Cache utilization | Polluted | Preserved |
| CPU usage (100 MB/s) | ~20% | ~2% |
Zero-copy is most beneficial when data passes through without modification. If your application needs to encrypt, compress, or transform data, it must access the data—eliminating zero-copy benefits. For proxies, file servers, and streaming services that forward data unchanged, zero-copy is transformative.
The sendfile() system call is the workhouse of zero-copy networking on Linux. It transfers data directly from a file descriptor to a socket, bypassing user space entirely.
Signature:
123456789101112131415161718192021222324252627282930313233343536373839404142
#include <sys/sendfile.h> /* * sendfile - Transfer data between file descriptors * * @out_fd: Destination file descriptor (must be a socket on Linux) * @in_fd: Source file descriptor (must support mmap or have page cache) * @offset: Pointer to file offset; updated atomically * @count: Number of bytes to transfer * * Returns: Number of bytes written (may be less than count) * -1 on error */ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count); /* Example: High-performance static file server */void serve_static_file(int client_socket, const char *path) { struct stat st; int fd = open(path, O_RDONLY); if (fd < 0) return; fstat(fd, &st); /* Send HTTP headers (small, via write) */ dprintf(client_socket, "HTTP/1.1 200 OK\r\n" "Content-Length: %ld\r\n" "Content-Type: application/octet-stream\r\n\r\n", st.st_size); /* Send file body with zero-copy */ off_t offset = 0; size_t remaining = st.st_size; while (remaining > 0) { ssize_t sent = sendfile(client_socket, fd, &offset, remaining); if (sent <= 0) break; remaining -= sent; } close(fd);}How sendfile() Works Internally:
File data in page cache: If the file isn't already cached, kernel reads it from disk into page cache.
Socket buffer setup: Instead of copying data, the socket buffer descriptor points to the page cache pages.
NIC scatter-gather: Modern NICs support scatter-gather DMA, reading data from non-contiguous memory locations. The NIC reads directly from page cache pages.
Reference counting: Page cache pages are reference-counted. They won't be reclaimed until both the page cache and the socket are done with them.
sendfile() only works with source file descriptors that have a page cache (regular files, block devices). You cannot sendfile() from a socket to another socket, or from a pipe. For those use cases, splice() is needed. Also, the output must be a socket on standard Linux; some variants allow other destinations.
The splice() system call is a more general zero-copy mechanism that moves data between file descriptors without copying through user space. Unlike sendfile(), splice() requires at least one of the descriptors to be a pipe.
splice() API:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
#include <fcntl.h> /* * splice - Move data between file descriptors without user-space copy * * @fd_in: Source file descriptor * @off_in: Offset in source (NULL for pipes/sockets) * @fd_out: Destination file descriptor * @off_out: Offset in destination * @len: Maximum bytes to transfer * @flags: SPLICE_F_MOVE, SPLICE_F_NONBLOCK, SPLICE_F_MORE, etc. * * At least one of fd_in or fd_out must be a pipe! */ssize_t splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags); /* * tee - Duplicate pipe contents without consuming * Creates a copy of data for multiple consumers */ssize_t tee(int fd_in, int fd_out, size_t len, unsigned int flags); /* Example: Zero-copy proxy (socket → pipe → socket) */void zerocopy_proxy(int client_sock, int server_sock) { int pipefd[2]; pipe(pipefd); /* * Use splice() to move data through a pipe * Client → Pipe → Server */ for (;;) { ssize_t n; /* Splice from client socket into pipe */ n = splice(client_sock, NULL, pipefd[1], NULL, 65536, SPLICE_F_MOVE | SPLICE_F_MORE); if (n <= 0) break; /* Splice from pipe to server socket */ while (n > 0) { ssize_t sent = splice(pipefd[0], NULL, server_sock, NULL, n, SPLICE_F_MOVE | SPLICE_F_MORE); if (sent <= 0) break; n -= sent; } } close(pipefd[0]); close(pipefd[1]);}The Pipe as Buffer:
The pipe in splice() serves as a kernel-managed buffer. This might seem wasteful—why add a pipe if we're trying to avoid buffers? The key insight is that the pipe's buffer holds page references, not data copies. When splicing:
The tee() System Call:
tee() duplicates pipe data without consuming it—useful for broadcasting:
123456789101112131415161718192021222324252627
/* tee() for broadcasting: Send same data to multiple destinations */void broadcast_to_clients(int source_fd, int *client_sockets, int num_clients) { int pipes[2]; pipe(pipes); /* Read from source into pipe (only once) */ ssize_t n = splice(source_fd, NULL, pipes[1], NULL, 65536, SPLICE_F_MOVE); if (n <= 0) return; /* Send to each client using tee() and splice() */ for (int i = 0; i < num_clients - 1; i++) { /* tee() duplicates pipe content without consuming */ ssize_t duplicated = tee(pipes[0], pipes[1], n, 0); /* splice() the original to this client */ splice(pipes[0], NULL, client_sockets[i], NULL, duplicated, SPLICE_F_MOVE); } /* Last client gets the remaining data directly */ splice(pipes[0], NULL, client_sockets[num_clients-1], NULL, n, SPLICE_F_MOVE); close(pipes[0]); close(pipes[1]);}Use sendfile() when copying from file to socket—it's simpler and slightly faster. Use splice() when moving data between sockets, between files and non-socket destinations, or when you need tee() for duplication. splice() is the building block; sendfile() is the specialized optimization.
Memory mapping (mmap()) enables zero-copy by mapping file contents directly into a process's address space. Instead of copying data via read()/write(), the process accesses file data as if it were regular memory.
How mmap() Achieves Zero-Copy:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
#include <sys/mman.h>#include <fcntl.h> /* Zero-copy file access with mmap() */void process_file_mmap(const char *path) { int fd = open(path, O_RDONLY); struct stat st; fstat(fd, &st); /* * Map file into address space * - No data copied yet! * - Kernel sets up page table entries pointing to "nothing" * - Actual data loaded on-demand via page faults */ char *data = mmap(NULL, st.st_size, PROT_READ, /* Read-only access */ MAP_PRIVATE, /* Private copy-on-write */ fd, 0); if (data == MAP_FAILED) { perror("mmap"); close(fd); return; } /* Access data directly - page faults load pages as needed */ size_t count = 0; for (size_t i = 0; i < st.st_size; i++) { /* * First access to a page triggers: * 1. Page fault (minor fault if data in page cache) * 2. Kernel maps physical page into process address space * 3. No copy - process sees page cache directly */ if (data[i] == '\n') count++; } printf("Lines: %zu\n", count); munmap(data, st.st_size); close(fd);}mmap() Advantages:
mmap() Limitations and Gotchas:
For simple sequential file reading, read() with a properly-sized buffer often beats mmap(). The kernel's read-ahead is highly optimized for sequential access. mmap() shines for random access patterns, memory-mapped databases, and inter-process communication.
Copy-on-Write (CoW) is a deferred-copy technique: instead of duplicating data immediately, the system shares the original copy and only creates a separate copy if either party modifies it. This is 'zero-copy' in the sense that no copy occurs unless absolutely necessary.
CoW in fork():
The most famous CoW application is fork(). Creating a child process would be prohibitively expensive if all parent memory were copied:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
/* Copy-on-Write enables cheap fork() */#include <unistd.h>#include <sys/wait.h> int main() { /* * Allocate large memory region * Assume this is 1GB of application data */ char *data = malloc(1024 * 1024 * 1024); initialize_data(data); /* * fork() creates child process * With Copy-on-Write: * - No memory actually copied * - Both processes share same physical pages * - Pages marked read-only in both address spaces * - fork() completes in microseconds regardless of memory size */ pid_t pid = fork(); if (pid == 0) { /* Child process */ /* * If child only reads data: * - No pages copied * - Child exits sharing parent's memory */ read_only_operations(data); /* * If child writes to a page: * - Page fault triggered (write to read-only page) * - Kernel allocates new physical page * - Content copied from original page * - Child's page table updated to point to new page * - Write proceeds * - Only ONE page copied, not all memory! */ data[1000] = 'X'; /* This page, and only this page, is copied */ exit(0); } /* Parent continues with original pages */ wait(NULL); return 0;}CoW in File Systems:
Modern file systems (Btrfs, ZFS, APFS) use CoW for data and metadata. When modifying a file block:
This enables:
CoW isn't free. Write amplification occurs when small changes trigger page/block copies. Fragmentation increases as files are modified. For write-heavy workloads, CoW overhead can exceed the benefits. Database systems often disable CoW for data files (e.g., 'nocow' attribute on Btrfs).
The ultimate zero-copy technique bypasses the kernel entirely. Remote Direct Memory Access (RDMA) and kernel bypass networking allow applications to send and receive data directly to/from network hardware without any kernel involvement after initial setup.
RDMA Concepts:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
/* Simplified RDMA send example (ibverbs API) */#include <infiniband/verbs.h> /* * After connection setup (omitted for brevity): * - qp: Queue Pair handle * - mr: Memory Region (registered buffer) */ void rdma_send_zerocopy(struct ibv_qp *qp, struct ibv_mr *mr, void *data, size_t len) { struct ibv_sge sge = { .addr = (uintptr_t)data, /* User buffer address */ .length = len, .lkey = mr->lkey /* Local memory key */ }; struct ibv_send_wr wr = { .wr_id = 1, .sg_list = &sge, .num_sge = 1, .opcode = IBV_WR_SEND, .send_flags = IBV_SEND_SIGNALED }; struct ibv_send_wr *bad_wr; /* * Post send work request * - Directly programs NIC hardware * - No system call, no kernel involvement * - NIC DMAs from user buffer to network * - True zero-copy: data goes straight from user memory to wire */ ibv_post_send(qp, &wr, &bad_wr); /* Poll completion queue for send completion */ struct ibv_wc wc; while (ibv_poll_cq(cq, 1, &wc) == 0) { /* Spin waiting for completion */ } /* Data has been transmitted without any copies! */}Kernel Bypass Networking (DPDK, SPDK):
For extreme performance, frameworks like DPDK (Data Plane Development Kit) remove the kernel from the networking data path entirely:
| Aspect | Traditional (Linux stack) | DPDK |
|---|---|---|
| Packet path | NIC → Driver → TCP/IP stack → Socket → App | NIC → User-space driver → App |
| Copies | 2-3 | 0 |
| Context switches | 2+ per packet | 0 |
| Latency | ~10-50 μs | ~1-5 μs |
| Packets/sec (10GbE) | ~1-2 million | ~14.8 million (line rate) |
Bypassing the kernel means losing kernel services: no firewall, no connection tracking, no standard sockets API, no protection between applications. DPDK applications must implement their own TCP stack or use raw packets. It's for specialized high-performance applications (routers, load balancers, trading systems), not general-purpose use.
Zero-copy techniques represent the pinnacle of I/O optimization, eliminating the CPU memory copies that limit throughput and waste resources. From sendfile() to RDMA, each technique trades complexity for performance. Let's consolidate the key insights:
| Use Case | Best Technique | Reason |
|---|---|---|
| Static file server | sendfile() | Direct file-to-socket, minimal kernel overhead |
| TCP proxy/load balancer | splice() | Socket-to-socket via pipe, no user-space |
| Database file access | mmap() | Random access, shared caching |
| Large buffer passing | CoW (fork/clone) | Avoids copying until write |
| High-frequency trading | RDMA/DPDK | Sub-microsecond latency required |
| Generic application | Evaluate need | Zero-copy complexity vs. benefit |
Module Complete:
You've now mastered the complete buffering hierarchy—from single and double buffers through circular buffers, buffer management, and finally zero-copy techniques. This knowledge enables you to design and analyze I/O systems at every level, from understanding why a simple memcpy causes performance problems to architecting systems that achieve line-rate throughput.
You now understand the full spectrum of zero-copy techniques: sendfile(), splice(), tee(), mmap(), copy-on-write, and kernel-bypass approaches like RDMA. Combined with your knowledge of buffering fundamentals, you can design I/O systems that minimize overhead and maximize throughput.