Operating SystemsBuffering

I/O Buffering Strategies

LevelIntermediate

Duration90 mins

TopicBuffering

5 / 5

Zero-Copy Techniques

The Copy-Free Path

Throughout this module, we've seen how buffering mediates between devices and processes. But we've accepted an implicit cost: data is copied at each stage. A file read, for example, typically involves:

Disk → Kernel Buffer — DMA transfers data from disk to page cache
Kernel Buffer → User Buffer — CPU copies data to application memory
User Buffer → Socket Buffer — For network transmission, another copy
Socket Buffer → NIC — DMA to network hardware

For a simple 'serve file over network' operation, data is copied four times. Each copy consumes CPU cycles, memory bandwidth, and cache space. At 10 Gbps networking speeds, the CPU spends more time copying than doing actual work.

Zero-copy techniques eliminate unnecessary copies, allowing data to flow from source to destination with minimal CPU involvement. This isn't optimization at the margins—it's the difference between systems that scale and systems that bottleneck.

What You Will Learn

By the end of this page, you will understand the true cost of memory copies, master the suite of zero-copy mechanisms (sendfile, splice, mmap, RDMA), analyze when each technique applies, and implement zero-copy patterns for maximum I/O performance.

The Cost of Copying

To appreciate zero-copy's value, we must quantify the cost of traditional copying. This cost manifests in multiple dimensions:

CPU Cost:

Memory copying is deceptively expensive. memcpy() for aligned data achieves roughly 10-20 GB/s on modern CPUs—impressive, but consider the implications:

CPU Time Consumed by Copying (10 GB/s memcpy)
Data Size	Copy Duration	At 1M ops/sec	CPU Core Usage
1 KB	~100 ns	100 ms/sec	10%
4 KB (page)	~400 ns	400 ms/sec	40%
64 KB	~6.4 μs	6.4 sec/sec	640% (infeasible)
1 MB	~100 μs	100 sec/sec	10,000% (impossible)

The table reveals that serving 1 million 4KB pages per second consumes 40% of a CPU core just for copying—before any actual processing.

Memory Bandwidth Cost:

Each copy reads from source and writes to destination. A single 4KB copy consumes 8KB of memory bandwidth. For a server handling 10 Gbps of network traffic (approximately 1.25 GB/s of data), the traditional 4-copy path consumes:

$$Bandwidth = 1.25 \text{ GB/s} \times 4 \times 2 = 10 \text{ GB/s}$$

Modern DDR4 memory provides around 25 GB/s bandwidth per channel. A single network flow could consume 40% of available memory bandwidth just for copies!

Cache Pollution:

Perhaps the most insidious cost is cache pollution. When data is copied through CPU, it occupies cache lines:

Source data is brought into L1/L2/L3 cache
Destination memory is also cached
Useful working-set data is evicted
Subsequent cache misses slow all processing

For streaming data that's never accessed again (like a web server forwarding files), cache pollution is pure overhead.

The Hidden Scalability Wall

Systems that 'work fine in testing' often hit a wall in production because testing doesn't replicate memory bandwidth saturation or cache thrashing. Zero-copy isn't just about peak throughput—it's about maintaining consistent performance under load.

Traditional vs. Zero-Copy Data Paths

Let's compare the data path for a common operation: serving a file over the network.

Traditional Path (read() + write()):

traditional_send.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
/* Traditional file sending - 4 copies, 4 context switches */
void send_file_traditional(int file_fd, int socket_fd, size_t size) {
    char buffer[65536];
    ssize_t bytes;
    
    while (size > 0) {
        /* Copy 1: read() - Kernel page cache → User buffer */
        /* Context switch: User → Kernel → User */
        bytes = read(file_fd, buffer, min(sizeof(buffer), size));
        if (bytes <= 0) break;
        
        /* Copy 2: write() - User buffer → Kernel socket buffer */
        /* Context switch: User → Kernel → User */
        write(socket_fd, buffer, bytes);
        
        /* Kernel internally:
         * Copy 3: Socket buffer → NIC TX buffer (or DMA scatter-gather)
         * 
         * Total context switches: 4 (2 syscalls × 2 transitions each)
         * Total copies: 3-4 depending on NIC capabilities
         */
        
        size -= bytes;
    }
}

Converting Mermaid diagram...

Zero-Copy Path (sendfile()):

The data never transits through user space. Kernel manipulates buffer references, not data:

zerocopy_send.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/* Zero-copy file sending with sendfile() */
#include <sys/sendfile.h>
 
void send_file_zerocopy(int file_fd, int socket_fd, size_t size) {
    off_t offset = 0;
    ssize_t sent;
    
    while (size > 0) {
        /* 
         * sendfile(): Transfer data from file to socket without
         * ever copying to user space
         * 
         * Kernel operations:
         * 1. Page cache contains file data (or reads from disk)
         * 2. Socket buffer gets reference to page cache pages
         * 3. NIC DMAs directly from page cache
         * 
         * With modern NICs supporting scatter-gather DMA:
         * - Zero CPU copies
         * - One context switch
         * - Data flows: Disk → Page Cache → NIC (via DMA)
         */
        sent = sendfile(socket_fd, file_fd, &offset, size);
        if (sent <= 0) break;
        
        size -= sent;
    }
}

Traditional vs. Zero-Copy Comparison
Metric	read()+write()	sendfile()
CPU copies	2-3	0-1
Context switches	4	2
Syscalls	2 per chunk	1 per chunk
User space involvement	Data passes through	None
Cache utilization	Polluted	Preserved
CPU usage (100 MB/s)	~20%	~2%

When Zero-Copy Applies

Zero-copy is most beneficial when data passes through without modification. If your application needs to encrypt, compress, or transform data, it must access the data—eliminating zero-copy benefits. For proxies, file servers, and streaming services that forward data unchanged, zero-copy is transformative.

sendfile() Deep Dive

The sendfile() system call is the workhouse of zero-copy networking on Linux. It transfers data directly from a file descriptor to a socket, bypassing user space entirely.

Signature:

sendfile_api.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#include <sys/sendfile.h>
 
/*
 * sendfile - Transfer data between file descriptors
 * 
 * @out_fd:  Destination file descriptor (must be a socket on Linux)
 * @in_fd:   Source file descriptor (must support mmap or have page cache)
 * @offset:  Pointer to file offset; updated atomically
 * @count:   Number of bytes to transfer
 * 
 * Returns: Number of bytes written (may be less than count)
 *          -1 on error
 */
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);
 
/* Example: High-performance static file server */
void serve_static_file(int client_socket, const char *path) {
    struct stat st;
    int fd = open(path, O_RDONLY);
    if (fd < 0) return;
    
    fstat(fd, &st);
    
    /* Send HTTP headers (small, via write) */
    dprintf(client_socket, 
            "HTTP/1.1 200 OK\r\n"
            "Content-Length: %ld\r\n"
            "Content-Type: application/octet-stream\r\n\r\n",
            st.st_size);
    
    /* Send file body with zero-copy */
    off_t offset = 0;
    size_t remaining = st.st_size;
    
    while (remaining > 0) {
        ssize_t sent = sendfile(client_socket, fd, &offset, remaining);
        if (sent <= 0) break;
        remaining -= sent;
    }
    
    close(fd);
}

How sendfile() Works Internally:

File data in page cache: If the file isn't already cached, kernel reads it from disk into page cache.
Socket buffer setup: Instead of copying data, the socket buffer descriptor points to the page cache pages.
NIC scatter-gather: Modern NICs support scatter-gather DMA, reading data from non-contiguous memory locations. The NIC reads directly from page cache pages.
Reference counting: Page cache pages are reference-counted. They won't be reclaimed until both the page cache and the socket are done with them.

Limitations of sendfile()

sendfile() only works with source file descriptors that have a page cache (regular files, block devices). You cannot sendfile() from a socket to another socket, or from a pipe. For those use cases, splice() is needed. Also, the output must be a socket on standard Linux; some variants allow other destinations.

splice() and tee()

The splice() system call is a more general zero-copy mechanism that moves data between file descriptors without copying through user space. Unlike sendfile(), splice() requires at least one of the descriptors to be a pipe.

splice() API:

splice_api.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#include <fcntl.h>
 
/*
 * splice - Move data between file descriptors without user-space copy
 * 
 * @fd_in:    Source file descriptor
 * @off_in:   Offset in source (NULL for pipes/sockets)
 * @fd_out:   Destination file descriptor
 * @off_out:  Offset in destination
 * @len:      Maximum bytes to transfer
 * @flags:    SPLICE_F_MOVE, SPLICE_F_NONBLOCK, SPLICE_F_MORE, etc.
 * 
 * At least one of fd_in or fd_out must be a pipe!
 */
ssize_t splice(int fd_in, loff_t *off_in, 
               int fd_out, loff_t *off_out,
               size_t len, unsigned int flags);
 
/* 
 * tee - Duplicate pipe contents without consuming
 * Creates a copy of data for multiple consumers
 */
ssize_t tee(int fd_in, int fd_out, size_t len, unsigned int flags);
 
/* Example: Zero-copy proxy (socket → pipe → socket) */
void zerocopy_proxy(int client_sock, int server_sock) {
    int pipefd[2];
    pipe(pipefd);
    
    /* 
     * Use splice() to move data through a pipe
     * Client → Pipe → Server
     */
    for (;;) {
        ssize_t n;
        
        /* Splice from client socket into pipe */
        n = splice(client_sock, NULL, pipefd[1], NULL,
                   65536, SPLICE_F_MOVE | SPLICE_F_MORE);
        if (n <= 0) break;
        
        /* Splice from pipe to server socket */
        while (n > 0) {
            ssize_t sent = splice(pipefd[0], NULL, server_sock, NULL,
                                  n, SPLICE_F_MOVE | SPLICE_F_MORE);
            if (sent <= 0) break;
            n -= sent;
        }
    }
    
    close(pipefd[0]);
    close(pipefd[1]);
}

The Pipe as Buffer:

The pipe in splice() serves as a kernel-managed buffer. This might seem wasteful—why add a pipe if we're trying to avoid buffers? The key insight is that the pipe's buffer holds page references, not data copies. When splicing:

Source data pages are referenced (not copied) into pipe buffer
Pipe buffer pages are referenced into destination
Data never moves; only references are shuffled

The tee() System Call:

tee() duplicates pipe data without consuming it—useful for broadcasting:

tee_example.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
/* tee() for broadcasting: Send same data to multiple destinations */
void broadcast_to_clients(int source_fd, int *client_sockets, int num_clients) {
    int pipes[2];
    pipe(pipes);
    
    /* Read from source into pipe (only once) */
    ssize_t n = splice(source_fd, NULL, pipes[1], NULL,
                       65536, SPLICE_F_MOVE);
    if (n <= 0) return;
    
    /* Send to each client using tee() and splice() */
    for (int i = 0; i < num_clients - 1; i++) {
        /* tee() duplicates pipe content without consuming */
        ssize_t duplicated = tee(pipes[0], pipes[1], n, 0);
        
        /* splice() the original to this client */
        splice(pipes[0], NULL, client_sockets[i], NULL,
               duplicated, SPLICE_F_MOVE);
    }
    
    /* Last client gets the remaining data directly */
    splice(pipes[0], NULL, client_sockets[num_clients-1], NULL,
           n, SPLICE_F_MOVE);
    
    close(pipes[0]);
    close(pipes[1]);
}

When to Use splice() vs sendfile()

Use sendfile() when copying from file to socket—it's simpler and slightly faster. Use splice() when moving data between sockets, between files and non-socket destinations, or when you need tee() for duplication. splice() is the building block; sendfile() is the specialized optimization.

Memory-Mapped I/O (mmap)

Memory mapping (mmap()) enables zero-copy by mapping file contents directly into a process's address space. Instead of copying data via read()/write(), the process accesses file data as if it were regular memory.

How mmap() Achieves Zero-Copy:

mmap_zerocopy.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#include <sys/mman.h>
#include <fcntl.h>
 
/* Zero-copy file access with mmap() */
void process_file_mmap(const char *path) {
    int fd = open(path, O_RDONLY);
    struct stat st;
    fstat(fd, &st);
    
    /* 
     * Map file into address space
     * - No data copied yet!
     * - Kernel sets up page table entries pointing to "nothing"
     * - Actual data loaded on-demand via page faults
     */
    char *data = mmap(NULL, st.st_size, 
                      PROT_READ,          /* Read-only access */
                      MAP_PRIVATE,        /* Private copy-on-write */
                      fd, 0);
    
    if (data == MAP_FAILED) {
        perror("mmap");
        close(fd);
        return;
    }
    
    /* Access data directly - page faults load pages as needed */
    size_t count = 0;
    for (size_t i = 0; i < st.st_size; i++) {
        /* 
         * First access to a page triggers:
         * 1. Page fault (minor fault if data in page cache)
         * 2. Kernel maps physical page into process address space
         * 3. No copy - process sees page cache directly
         */
        if (data[i] == '\n')
            count++;
    }
    
    printf("Lines: %zu\n", count);
    
    munmap(data, st.st_size);
    close(fd);
}

mmap() Advantages:

Benefits of mmap()

•True zero-copy for reads — Process reads page cache directly; no copying.
•Lazy loading — Only pages actually accessed are fetched from disk.
•Implicit caching — Repeated access hits page cache automatically.
•Memory efficiency — Multiple processes mapping same file share physical pages.
•Simplified code — File access uses pointer arithmetic, not syscalls.

mmap() Limitations and Gotchas:

mmap() Pitfalls

•TLB pressure — Large mappings consume translation lookaside buffer entries.
•Minor page faults — Each new page access incurs kernel trap overhead.
•Poor for sequential streaming — read() with read-ahead often outperforms mmap() for sequential access.
•32-bit address space limits — Can't map huge files on 32-bit systems.
•Complexity with modifications — MAP_SHARED writes can surprise you; MAP_PRIVATE has copy-on-write costs.
•Error handling awkward — I/O errors become SIGBUS signals, not return codes.

mmap() Is Not Always Faster

For simple sequential file reading, read() with a properly-sized buffer often beats mmap(). The kernel's read-ahead is highly optimized for sequential access. mmap() shines for random access patterns, memory-mapped databases, and inter-process communication.

Copy-on-Write (CoW)

Copy-on-Write (CoW) is a deferred-copy technique: instead of duplicating data immediately, the system shares the original copy and only creates a separate copy if either party modifies it. This is 'zero-copy' in the sense that no copy occurs unless absolutely necessary.

CoW in fork():

The most famous CoW application is fork(). Creating a child process would be prohibitively expensive if all parent memory were copied:

cow_fork.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
/* Copy-on-Write enables cheap fork() */
#include <unistd.h>
#include <sys/wait.h>
 
int main() {
    /* 
     * Allocate large memory region
     * Assume this is 1GB of application data
     */
    char *data = malloc(1024 * 1024 * 1024);
    initialize_data(data);
    
    /*
     * fork() creates child process
     * With Copy-on-Write:
     * - No memory actually copied
     * - Both processes share same physical pages
     * - Pages marked read-only in both address spaces
     * - fork() completes in microseconds regardless of memory size
     */
    pid_t pid = fork();
    
    if (pid == 0) {
        /* Child process */
        
        /*
         * If child only reads data:
         * - No pages copied
         * - Child exits sharing parent's memory
         */
        read_only_operations(data);
        
        /*
         * If child writes to a page:
         * - Page fault triggered (write to read-only page)
         * - Kernel allocates new physical page
         * - Content copied from original page
         * - Child's page table updated to point to new page
         * - Write proceeds
         * - Only ONE page copied, not all memory!
         */
        data[1000] = 'X';  /* This page, and only this page, is copied */
        
        exit(0);
    }
    
    /* Parent continues with original pages */
    wait(NULL);
    return 0;
}

CoW in File Systems:

Modern file systems (Btrfs, ZFS, APFS) use CoW for data and metadata. When modifying a file block:

New block is allocated at a different location
Modified data is written to new block
Metadata is updated to point to new block
Old block becomes available (if no snapshots reference it)

This enables:

Instant snapshots: Snapshot just records current block pointers
Data integrity: Writes are atomic (old or new, never partial)
Efficient cloning: Copy file = copy metadata only

CoW Trade-offs

CoW isn't free. Write amplification occurs when small changes trigger page/block copies. Fragmentation increases as files are modified. For write-heavy workloads, CoW overhead can exceed the benefits. Database systems often disable CoW for data files (e.g., 'nocow' attribute on Btrfs).

RDMA and Kernel Bypass

The ultimate zero-copy technique bypasses the kernel entirely. Remote Direct Memory Access (RDMA) and kernel bypass networking allow applications to send and receive data directly to/from network hardware without any kernel involvement after initial setup.

RDMA Concepts:

RDMA Capabilities

•Memory registration: Application memory is pinned and registered with the NIC. The NIC can DMA directly to/from this memory.
•Queue pairs: Send and receive queues in user space. Application posts work requests directly to hardware.
•No kernel involvement: Data path is entirely in user space. Only connection setup involves the kernel.
•One-sided operations: RDMA read/write access remote memory without remote CPU involvement.
•Sub-microsecond latency: Eliminating kernel crossing enables latencies below 1μs.

rdma_example.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
/* Simplified RDMA send example (ibverbs API) */
#include <infiniband/verbs.h>
 
/* 
 * After connection setup (omitted for brevity):
 * - qp: Queue Pair handle
 * - mr: Memory Region (registered buffer)
 */
 
void rdma_send_zerocopy(struct ibv_qp *qp, struct ibv_mr *mr, 
                        void *data, size_t len) {
    struct ibv_sge sge = {
        .addr = (uintptr_t)data,     /* User buffer address */
        .length = len,
        .lkey = mr->lkey             /* Local memory key */
    };
    
    struct ibv_send_wr wr = {
        .wr_id = 1,
        .sg_list = &sge,
        .num_sge = 1,
        .opcode = IBV_WR_SEND,
        .send_flags = IBV_SEND_SIGNALED
    };
    
    struct ibv_send_wr *bad_wr;
    
    /*
     * Post send work request
     * - Directly programs NIC hardware
     * - No system call, no kernel involvement
     * - NIC DMAs from user buffer to network
     * - True zero-copy: data goes straight from user memory to wire
     */
    ibv_post_send(qp, &wr, &bad_wr);
    
    /* Poll completion queue for send completion */
    struct ibv_wc wc;
    while (ibv_poll_cq(cq, 1, &wc) == 0) {
        /* Spin waiting for completion */
    }
    
    /* Data has been transmitted without any copies! */
}

Kernel Bypass Networking (DPDK, SPDK):

For extreme performance, frameworks like DPDK (Data Plane Development Kit) remove the kernel from the networking data path entirely:

Traditional Networking vs. DPDK
Aspect	Traditional (Linux stack)	DPDK
Packet path	NIC → Driver → TCP/IP stack → Socket → App	NIC → User-space driver → App
Copies	2-3	0
Context switches	2+ per packet	0
Latency	~10-50 μs	~1-5 μs
Packets/sec (10GbE)	~1-2 million	~14.8 million (line rate)

Kernel Bypass Trade-offs

Bypassing the kernel means losing kernel services: no firewall, no connection tracking, no standard sockets API, no protection between applications. DPDK applications must implement their own TCP stack or use raw packets. It's for specialized high-performance applications (routers, load balancers, trading systems), not general-purpose use.

Summary: Zero-Copy Mastery

Zero-copy techniques represent the pinnacle of I/O optimization, eliminating the CPU memory copies that limit throughput and waste resources. From sendfile() to RDMA, each technique trades complexity for performance. Let's consolidate the key insights:

Key Takeaways

•Copying is expensive — CPU time, memory bandwidth, and cache pollution limit systems that copy data through every layer.
•sendfile() is the workhorse — For file-to-socket transfers, sendfile() eliminates user-space involvement with a single syscall.
•splice() enables general zero-copy — Using pipes as intermediates, splice() moves data between any file descriptors without copying.
•mmap() provides random-access zero-copy — Memory-mapped files let processes access data in place, ideal for databases and IPC.
•Copy-on-Write defers copying — Share data until modification, making fork() and snapshots essentially free.
•RDMA bypasses the kernel entirely — For ultimate performance, registered memory and hardware queues enable microsecond-latency, line-rate networking.

Zero-Copy Technique Selection Guide
Use Case	Best Technique	Reason
Static file server	sendfile()	Direct file-to-socket, minimal kernel overhead
TCP proxy/load balancer	splice()	Socket-to-socket via pipe, no user-space
Database file access	mmap()	Random access, shared caching
Large buffer passing	CoW (fork/clone)	Avoids copying until write
High-frequency trading	RDMA/DPDK	Sub-microsecond latency required
Generic application	Evaluate need	Zero-copy complexity vs. benefit

Module Complete:

You've now mastered the complete buffering hierarchy—from single and double buffers through circular buffers, buffer management, and finally zero-copy techniques. This knowledge enables you to design and analyze I/O systems at every level, from understanding why a simple memcpy causes performance problems to architecting systems that achieve line-rate throughput.

Page Complete

You now understand the full spectrum of zero-copy techniques: sendfile(), splice(), tee(), mmap(), copy-on-write, and kernel-bypass approaches like RDMA. Combined with your knowledge of buffering fundamentals, you can design I/O systems that minimize overhead and maximize throughput.

5 / 5

Loading learning content...

Operating SystemsBuffering

I/O Buffering Strategies

LevelIntermediate

Duration90 mins

TopicBuffering

5 / 5

Zero-Copy Techniques

The Copy-Free Path

Disk → Kernel Buffer — DMA transfers data from disk to page cache
Kernel Buffer → User Buffer — CPU copies data to application memory
User Buffer → Socket Buffer — For network transmission, another copy
Socket Buffer → NIC — DMA to network hardware

What You Will Learn

The Cost of Copying

To appreciate zero-copy's value, we must quantify the cost of traditional copying. This cost manifests in multiple dimensions:

CPU Cost:

Memory copying is deceptively expensive. memcpy() for aligned data achieves roughly 10-20 GB/s on modern CPUs—impressive, but consider the implications:

CPU Time Consumed by Copying (10 GB/s memcpy)
Data Size	Copy Duration	At 1M ops/sec	CPU Core Usage
1 KB	~100 ns	100 ms/sec	10%
4 KB (page)	~400 ns	400 ms/sec	40%
64 KB	~6.4 μs	6.4 sec/sec	640% (infeasible)
1 MB	~100 μs	100 sec/sec	10,000% (impossible)

The table reveals that serving 1 million 4KB pages per second consumes 40% of a CPU core just for copying—before any actual processing.

Memory Bandwidth Cost:

$$Bandwidth = 1.25 \text{ GB/s} \times 4 \times 2 = 10 \text{ GB/s}$$

Modern DDR4 memory provides around 25 GB/s bandwidth per channel. A single network flow could consume 40% of available memory bandwidth just for copies!

Cache Pollution:

Perhaps the most insidious cost is cache pollution. When data is copied through CPU, it occupies cache lines:

Source data is brought into L1/L2/L3 cache
Destination memory is also cached
Useful working-set data is evicted
Subsequent cache misses slow all processing

For streaming data that's never accessed again (like a web server forwarding files), cache pollution is pure overhead.

The Hidden Scalability Wall

Traditional vs. Zero-Copy Data Paths

Let's compare the data path for a common operation: serving a file over the network.

Traditional Path (read() + write()):

traditional_send.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
/* Traditional file sending - 4 copies, 4 context switches */
void send_file_traditional(int file_fd, int socket_fd, size_t size) {
    char buffer[65536];
    ssize_t bytes;
    
    while (size > 0) {
        /* Copy 1: read() - Kernel page cache → User buffer */
        /* Context switch: User → Kernel → User */
        bytes = read(file_fd, buffer, min(sizeof(buffer), size));
        if (bytes <= 0) break;
        
        /* Copy 2: write() - User buffer → Kernel socket buffer */
        /* Context switch: User → Kernel → User */
        write(socket_fd, buffer, bytes);
        
        /* Kernel internally:
         * Copy 3: Socket buffer → NIC TX buffer (or DMA scatter-gather)
         * 
         * Total context switches: 4 (2 syscalls × 2 transitions each)
         * Total copies: 3-4 depending on NIC capabilities
         */
        
        size -= bytes;
    }
}

Converting Mermaid diagram...

Zero-Copy Path (sendfile()):

The data never transits through user space. Kernel manipulates buffer references, not data:

zerocopy_send.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
/* Zero-copy file sending with sendfile() */
#include <sys/sendfile.h>
 
void send_file_zerocopy(int file_fd, int socket_fd, size_t size) {
    off_t offset = 0;
    ssize_t sent;
    
    while (size > 0) {
        /* 
         * sendfile(): Transfer data from file to socket without
         * ever copying to user space
         * 
         * Kernel operations:
         * 1. Page cache contains file data (or reads from disk)
         * 2. Socket buffer gets reference to page cache pages
         * 3. NIC DMAs directly from page cache
         * 
         * With modern NICs supporting scatter-gather DMA:
         * - Zero CPU copies
         * - One context switch
         * - Data flows: Disk → Page Cache → NIC (via DMA)
         */
        sent = sendfile(socket_fd, file_fd, &offset, size);
        if (sent <= 0) break;
        
        size -= sent;
    }
}

Traditional vs. Zero-Copy Comparison
Metric	read()+write()	sendfile()
CPU copies	2-3	0-1
Context switches	4	2
Syscalls	2 per chunk	1 per chunk
User space involvement	Data passes through	None
Cache utilization	Polluted	Preserved
CPU usage (100 MB/s)	~20%	~2%

When Zero-Copy Applies

sendfile() Deep Dive

The sendfile() system call is the workhouse of zero-copy networking on Linux. It transfers data directly from a file descriptor to a socket, bypassing user space entirely.

Signature:

sendfile_api.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
#include <sys/sendfile.h>
 
/*
 * sendfile - Transfer data between file descriptors
 * 
 * @out_fd:  Destination file descriptor (must be a socket on Linux)
 * @in_fd:   Source file descriptor (must support mmap or have page cache)
 * @offset:  Pointer to file offset; updated atomically
 * @count:   Number of bytes to transfer
 * 
 * Returns: Number of bytes written (may be less than count)
 *          -1 on error
 */
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);
 
/* Example: High-performance static file server */
void serve_static_file(int client_socket, const char *path) {
    struct stat st;
    int fd = open(path, O_RDONLY);
    if (fd < 0) return;
    
    fstat(fd, &st);
    
    /* Send HTTP headers (small, via write) */
    dprintf(client_socket, 
            "HTTP/1.1 200 OK\r\n"
            "Content-Length: %ld\r\n"
            "Content-Type: application/octet-stream\r\n\r\n",
            st.st_size);
    
    /* Send file body with zero-copy */
    off_t offset = 0;
    size_t remaining = st.st_size;
    
    while (remaining > 0) {
        ssize_t sent = sendfile(client_socket, fd, &offset, remaining);
        if (sent <= 0) break;
        remaining -= sent;
    }
    
    close(fd);
}

How sendfile() Works Internally:

File data in page cache: If the file isn't already cached, kernel reads it from disk into page cache.
Socket buffer setup: Instead of copying data, the socket buffer descriptor points to the page cache pages.
NIC scatter-gather: Modern NICs support scatter-gather DMA, reading data from non-contiguous memory locations. The NIC reads directly from page cache pages.
Reference counting: Page cache pages are reference-counted. They won't be reclaimed until both the page cache and the socket are done with them.

Limitations of sendfile()

splice() and tee()

splice() API:

splice_api.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#include <fcntl.h>
 
/*
 * splice - Move data between file descriptors without user-space copy
 * 
 * @fd_in:    Source file descriptor
 * @off_in:   Offset in source (NULL for pipes/sockets)
 * @fd_out:   Destination file descriptor
 * @off_out:  Offset in destination
 * @len:      Maximum bytes to transfer
 * @flags:    SPLICE_F_MOVE, SPLICE_F_NONBLOCK, SPLICE_F_MORE, etc.
 * 
 * At least one of fd_in or fd_out must be a pipe!
 */
ssize_t splice(int fd_in, loff_t *off_in, 
               int fd_out, loff_t *off_out,
               size_t len, unsigned int flags);
 
/* 
 * tee - Duplicate pipe contents without consuming
 * Creates a copy of data for multiple consumers
 */
ssize_t tee(int fd_in, int fd_out, size_t len, unsigned int flags);
 
/* Example: Zero-copy proxy (socket → pipe → socket) */
void zerocopy_proxy(int client_sock, int server_sock) {
    int pipefd[2];
    pipe(pipefd);
    
    /* 
     * Use splice() to move data through a pipe
     * Client → Pipe → Server
     */
    for (;;) {
        ssize_t n;
        
        /* Splice from client socket into pipe */
        n = splice(client_sock, NULL, pipefd[1], NULL,
                   65536, SPLICE_F_MOVE | SPLICE_F_MORE);
        if (n <= 0) break;
        
        /* Splice from pipe to server socket */
        while (n > 0) {
            ssize_t sent = splice(pipefd[0], NULL, server_sock, NULL,
                                  n, SPLICE_F_MOVE | SPLICE_F_MORE);
            if (sent <= 0) break;
            n -= sent;
        }
    }
    
    close(pipefd[0]);
    close(pipefd[1]);
}

The Pipe as Buffer:

Source data pages are referenced (not copied) into pipe buffer
Pipe buffer pages are referenced into destination
Data never moves; only references are shuffled

The tee() System Call:

tee() duplicates pipe data without consuming it—useful for broadcasting:

tee_example.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
/* tee() for broadcasting: Send same data to multiple destinations */
void broadcast_to_clients(int source_fd, int *client_sockets, int num_clients) {
    int pipes[2];
    pipe(pipes);
    
    /* Read from source into pipe (only once) */
    ssize_t n = splice(source_fd, NULL, pipes[1], NULL,
                       65536, SPLICE_F_MOVE);
    if (n <= 0) return;
    
    /* Send to each client using tee() and splice() */
    for (int i = 0; i < num_clients - 1; i++) {
        /* tee() duplicates pipe content without consuming */
        ssize_t duplicated = tee(pipes[0], pipes[1], n, 0);
        
        /* splice() the original to this client */
        splice(pipes[0], NULL, client_sockets[i], NULL,
               duplicated, SPLICE_F_MOVE);
    }
    
    /* Last client gets the remaining data directly */
    splice(pipes[0], NULL, client_sockets[num_clients-1], NULL,
           n, SPLICE_F_MOVE);
    
    close(pipes[0]);
    close(pipes[1]);
}

When to Use splice() vs sendfile()

Memory-Mapped I/O (mmap)

How mmap() Achieves Zero-Copy:

mmap_zerocopy.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#include <sys/mman.h>
#include <fcntl.h>
 
/* Zero-copy file access with mmap() */
void process_file_mmap(const char *path) {
    int fd = open(path, O_RDONLY);
    struct stat st;
    fstat(fd, &st);
    
    /* 
     * Map file into address space
     * - No data copied yet!
     * - Kernel sets up page table entries pointing to "nothing"
     * - Actual data loaded on-demand via page faults
     */
    char *data = mmap(NULL, st.st_size, 
                      PROT_READ,          /* Read-only access */
                      MAP_PRIVATE,        /* Private copy-on-write */
                      fd, 0);
    
    if (data == MAP_FAILED) {
        perror("mmap");
        close(fd);
        return;
    }
    
    /* Access data directly - page faults load pages as needed */
    size_t count = 0;
    for (size_t i = 0; i < st.st_size; i++) {
        /* 
         * First access to a page triggers:
         * 1. Page fault (minor fault if data in page cache)
         * 2. Kernel maps physical page into process address space
         * 3. No copy - process sees page cache directly
         */
        if (data[i] == '\n')
            count++;
    }
    
    printf("Lines: %zu\n", count);
    
    munmap(data, st.st_size);
    close(fd);
}

mmap() Advantages:

Benefits of mmap()

•True zero-copy for reads — Process reads page cache directly; no copying.
•Lazy loading — Only pages actually accessed are fetched from disk.
•Implicit caching — Repeated access hits page cache automatically.
•Memory efficiency — Multiple processes mapping same file share physical pages.
•Simplified code — File access uses pointer arithmetic, not syscalls.

mmap() Limitations and Gotchas:

mmap() Pitfalls

•TLB pressure — Large mappings consume translation lookaside buffer entries.
•Minor page faults — Each new page access incurs kernel trap overhead.
•Poor for sequential streaming — read() with read-ahead often outperforms mmap() for sequential access.
•32-bit address space limits — Can't map huge files on 32-bit systems.
•Complexity with modifications — MAP_SHARED writes can surprise you; MAP_PRIVATE has copy-on-write costs.
•Error handling awkward — I/O errors become SIGBUS signals, not return codes.

mmap() Is Not Always Faster

Copy-on-Write (CoW)

CoW in fork():

The most famous CoW application is fork(). Creating a child process would be prohibitively expensive if all parent memory were copied:

cow_fork.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
/* Copy-on-Write enables cheap fork() */
#include <unistd.h>
#include <sys/wait.h>
 
int main() {
    /* 
     * Allocate large memory region
     * Assume this is 1GB of application data
     */
    char *data = malloc(1024 * 1024 * 1024);
    initialize_data(data);
    
    /*
     * fork() creates child process
     * With Copy-on-Write:
     * - No memory actually copied
     * - Both processes share same physical pages
     * - Pages marked read-only in both address spaces
     * - fork() completes in microseconds regardless of memory size
     */
    pid_t pid = fork();
    
    if (pid == 0) {
        /* Child process */
        
        /*
         * If child only reads data:
         * - No pages copied
         * - Child exits sharing parent's memory
         */
        read_only_operations(data);
        
        /*
         * If child writes to a page:
         * - Page fault triggered (write to read-only page)
         * - Kernel allocates new physical page
         * - Content copied from original page
         * - Child's page table updated to point to new page
         * - Write proceeds
         * - Only ONE page copied, not all memory!
         */
        data[1000] = 'X';  /* This page, and only this page, is copied */
        
        exit(0);
    }
    
    /* Parent continues with original pages */
    wait(NULL);
    return 0;
}

CoW in File Systems:

Modern file systems (Btrfs, ZFS, APFS) use CoW for data and metadata. When modifying a file block:

New block is allocated at a different location
Modified data is written to new block
Metadata is updated to point to new block
Old block becomes available (if no snapshots reference it)

This enables:

Instant snapshots: Snapshot just records current block pointers
Data integrity: Writes are atomic (old or new, never partial)
Efficient cloning: Copy file = copy metadata only

CoW Trade-offs

RDMA and Kernel Bypass

RDMA Concepts:

RDMA Capabilities

•Memory registration: Application memory is pinned and registered with the NIC. The NIC can DMA directly to/from this memory.
•Queue pairs: Send and receive queues in user space. Application posts work requests directly to hardware.
•No kernel involvement: Data path is entirely in user space. Only connection setup involves the kernel.
•One-sided operations: RDMA read/write access remote memory without remote CPU involvement.
•Sub-microsecond latency: Eliminating kernel crossing enables latencies below 1μs.

rdma_example.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
/* Simplified RDMA send example (ibverbs API) */
#include <infiniband/verbs.h>
 
/* 
 * After connection setup (omitted for brevity):
 * - qp: Queue Pair handle
 * - mr: Memory Region (registered buffer)
 */
 
void rdma_send_zerocopy(struct ibv_qp *qp, struct ibv_mr *mr, 
                        void *data, size_t len) {
    struct ibv_sge sge = {
        .addr = (uintptr_t)data,     /* User buffer address */
        .length = len,
        .lkey = mr->lkey             /* Local memory key */
    };
    
    struct ibv_send_wr wr = {
        .wr_id = 1,
        .sg_list = &sge,
        .num_sge = 1,
        .opcode = IBV_WR_SEND,
        .send_flags = IBV_SEND_SIGNALED
    };
    
    struct ibv_send_wr *bad_wr;
    
    /*
     * Post send work request
     * - Directly programs NIC hardware
     * - No system call, no kernel involvement
     * - NIC DMAs from user buffer to network
     * - True zero-copy: data goes straight from user memory to wire
     */
    ibv_post_send(qp, &wr, &bad_wr);
    
    /* Poll completion queue for send completion */
    struct ibv_wc wc;
    while (ibv_poll_cq(cq, 1, &wc) == 0) {
        /* Spin waiting for completion */
    }
    
    /* Data has been transmitted without any copies! */
}

Kernel Bypass Networking (DPDK, SPDK):

For extreme performance, frameworks like DPDK (Data Plane Development Kit) remove the kernel from the networking data path entirely:

Traditional Networking vs. DPDK
Aspect	Traditional (Linux stack)	DPDK
Packet path	NIC → Driver → TCP/IP stack → Socket → App	NIC → User-space driver → App
Copies	2-3	0
Context switches	2+ per packet	0
Latency	~10-50 μs	~1-5 μs
Packets/sec (10GbE)	~1-2 million	~14.8 million (line rate)

Kernel Bypass Trade-offs

Summary: Zero-Copy Mastery

Key Takeaways

•Copying is expensive — CPU time, memory bandwidth, and cache pollution limit systems that copy data through every layer.
•sendfile() is the workhorse — For file-to-socket transfers, sendfile() eliminates user-space involvement with a single syscall.
•splice() enables general zero-copy — Using pipes as intermediates, splice() moves data between any file descriptors without copying.
•mmap() provides random-access zero-copy — Memory-mapped files let processes access data in place, ideal for databases and IPC.
•Copy-on-Write defers copying — Share data until modification, making fork() and snapshots essentially free.
•RDMA bypasses the kernel entirely — For ultimate performance, registered memory and hardware queues enable microsecond-latency, line-rate networking.

Zero-Copy Technique Selection Guide
Use Case	Best Technique	Reason
Static file server	sendfile()	Direct file-to-socket, minimal kernel overhead
TCP proxy/load balancer	splice()	Socket-to-socket via pipe, no user-space
Database file access	mmap()	Random access, shared caching
Large buffer passing	CoW (fork/clone)	Avoids copying until write
High-frequency trading	RDMA/DPDK	Sub-microsecond latency required
Generic application	Evaluate need	Zero-copy complexity vs. benefit

Module Complete:

Page Complete

5 / 5