Loading content...
Double buffering works elegantly when producer and consumer operate at roughly matched rates. But the real world is rarely so accommodating. Network packets arrive in bursts. Disk I/O latencies vary by orders of magnitude. User input comes in unpredictable clusters. Video frames have wildly different compression ratios.
When a producer temporarily outpaces the consumer, double buffering fails—there's no buffer available, and data must be dropped or the producer must stall. Conversely, when the consumer races ahead, it quickly exhausts available data and blocks.
Circular buffering (also called ring buffering) extends the double buffering concept from two buffers to N buffers arranged in a logical ring. This provides a reservoir of slack that absorbs rate mismatches, smoothing out bursts and enabling more robust streaming.
By the end of this page, you will master the structure and mechanics of circular buffers, understand the mathematics of head and tail pointer manipulation, implement lock-free single-producer single-consumer rings, analyze capacity requirements for burst absorption, and recognize circular buffer applications across operating systems.
A circular buffer is a fixed-size array that wraps around—when you reach the end, you continue from the beginning. Two pointers track the current state:
The filled region is the data between tail and head (wrapping if necessary). The empty region is the space between head and tail.
Visualization of Circular Buffer States:
The Wrapping Arithmetic:
The circular nature is achieved through modulo arithmetic. For a buffer of size N:
head = (head + 1) % Ntail = (tail + 1) % N(head - tail + N) % N or equivalently (head >= tail) ? (head - tail) : (N - tail + head)N - 1 - data_available (we keep one slot empty to distinguish full from empty)Why Keep One Slot Empty?
A critical design decision: we never fill the last slot. This distinguishes a full buffer from an empty one:
Without this rule, head == tail would be ambiguous—is the buffer completely full or completely empty? Alternative designs use a separate count variable, but the one-slot-empty approach enables elegant lock-free implementations.
When the buffer size N is a power of two, modulo operations become simple bitwise AND: index & (N-1). This is significantly faster than division-based modulo. Most production ring buffers use power-of-two sizes for this reason. For example, size 4096 means index & 0xFFF replaces index % 4096.
Let's examine a production-quality circular buffer implementation with careful attention to memory layout and cache efficiency.
Basic Ring Buffer Structure:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
/* Lock-free Single-Producer Single-Consumer Ring Buffer */#include <stdatomic.h>#include <stddef.h>#include <stdbool.h> #define CACHE_LINE_SIZE 64 /* * Ring buffer structure with cache-line padding. * Producer and consumer indices are on separate cache lines * to prevent false sharing in multi-threaded scenarios. */struct ring_buffer { /* Producer side - cache line 1 */ alignas(CACHE_LINE_SIZE) atomic_size_t head; size_t cached_tail; /* Local cache of tail for producer */ char _pad1[CACHE_LINE_SIZE - 2 * sizeof(size_t)]; /* Consumer side - cache line 2 */ alignas(CACHE_LINE_SIZE) atomic_size_t tail; size_t cached_head; /* Local cache of head for consumer */ char _pad2[CACHE_LINE_SIZE - 2 * sizeof(size_t)]; /* Shared configuration - cache line 3 */ alignas(CACHE_LINE_SIZE) size_t mask; /* N - 1 for power-of-two N */ size_t element_size; /* Size of each element */ void *buffer; /* The actual data storage */}; /* Initialize a ring buffer with given capacity (must be power of 2) */int ring_buffer_init(struct ring_buffer *rb, size_t capacity, size_t elem_size); /* Producer: try to enqueue an element, returns false if full */bool ring_buffer_enqueue(struct ring_buffer *rb, const void *elem); /* Consumer: try to dequeue an element, returns false if empty */bool ring_buffer_dequeue(struct ring_buffer *rb, void *elem); /* Query current fill level */size_t ring_buffer_count(const struct ring_buffer *rb); /* Check if empty/full */bool ring_buffer_is_empty(const struct ring_buffer *rb);bool ring_buffer_is_full(const struct ring_buffer *rb);Cache Line Considerations:
Notice the careful padding and alignment. This prevents false sharing—a performance killer in multi-threaded code where producer and consumer run on different CPU cores.
False sharing occurs when two variables on the same cache line are modified by different cores. Each modification invalidates the entire cache line on the other core, causing expensive cache coherency traffic.
By placing the producer's head on one cache line and the consumer's tail on another, each can modify its index without disturbing the other's cache.
| Cache Line | Contents | Access Pattern |
|---|---|---|
| Line 1 (0x00-0x3F) | head, cached_tail, padding | Producer modifies frequently |
| Line 2 (0x40-0x7F) | tail, cached_head, padding | Consumer modifies frequently |
| Line 3 (0x80-0xBF) | mask, element_size, buffer ptr | Read-only after init |
| Line 4+ (buffer) | Actual data elements | Both access, but different regions |
The cached_tail (producer) and cached_head (consumer) are local caches of the other party's index. This reduces expensive atomic reads: the producer only re-reads the actual tail atomically when its cached value suggests the buffer might be full. This optimization can improve throughput by 2-3x in high-frequency scenarios.
The single-producer, single-consumer (SPSC) ring buffer can be implemented entirely lock-free using atomic operations. This eliminates synchronization overhead and makes the implementation suitable for real-time systems and interrupt handlers.
Lock-Free SPSC Implementation:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
#include "ring_buffer.h"#include <string.h>#include <stdlib.h> int ring_buffer_init(struct ring_buffer *rb, size_t capacity, size_t elem_size) { /* Capacity must be power of 2 */ if (capacity == 0 || (capacity & (capacity - 1)) != 0) return -EINVAL; rb->buffer = aligned_alloc(CACHE_LINE_SIZE, capacity * elem_size); if (!rb->buffer) return -ENOMEM; atomic_init(&rb->head, 0); atomic_init(&rb->tail, 0); rb->cached_head = 0; rb->cached_tail = 0; rb->mask = capacity - 1; /* For fast modulo */ rb->element_size = elem_size; return 0;} bool ring_buffer_enqueue(struct ring_buffer *rb, const void *elem) { size_t head = atomic_load_explicit(&rb->head, memory_order_relaxed); size_t next_head = (head + 1) & rb->mask; /* Check if buffer is full using cached tail */ if (next_head == rb->cached_tail) { /* Cache might be stale - read actual tail */ rb->cached_tail = atomic_load_explicit(&rb->tail, memory_order_acquire); if (next_head == rb->cached_tail) return false; /* Buffer truly full */ } /* Copy element into buffer */ char *slot = (char *)rb->buffer + (head * rb->element_size); memcpy(slot, elem, rb->element_size); /* Publish the new head with release semantics */ /* This ensures the data write is visible before head update */ atomic_store_explicit(&rb->head, next_head, memory_order_release); return true;} bool ring_buffer_dequeue(struct ring_buffer *rb, void *elem) { size_t tail = atomic_load_explicit(&rb->tail, memory_order_relaxed); /* Check if buffer is empty using cached head */ if (tail == rb->cached_head) { /* Cache might be stale - read actual head */ rb->cached_head = atomic_load_explicit(&rb->head, memory_order_acquire); if (tail == rb->cached_head) return false; /* Buffer truly empty */ } /* Copy element from buffer */ char *slot = (char *)rb->buffer + (tail * rb->element_size); memcpy(elem, slot, rb->element_size); /* Publish the new tail with release semantics */ size_t next_tail = (tail + 1) & rb->mask; atomic_store_explicit(&rb->tail, next_tail, memory_order_release); return true;} size_t ring_buffer_count(const struct ring_buffer *rb) { size_t head = atomic_load_explicit(&rb->head, memory_order_acquire); size_t tail = atomic_load_explicit(&rb->tail, memory_order_acquire); return (head - tail) & rb->mask;}Memory Ordering Explained:
The code uses C11 atomics with specific memory orderings:
memory_order_relaxed: No synchronization or ordering constraints. Used for reading our own index.
memory_order_acquire: Ensures all memory writes made before the corresponding release are visible. Used when reading the other party's index.
memory_order_release: Ensures all prior memory writes are visible to threads that acquire the same atomic. Used when publishing index updates.
This creates a release-acquire relationship: when the consumer sees a new head value, all data written before that head update is guaranteed visible.
The code is lock-free because neither thread ever blocks waiting for the other. If the buffer is full, enqueue returns false immediately. If empty, dequeue returns false. There are no mutexes, spinlocks, or sleep operations. Progress is always possible for at least one thread. This makes the implementation suitable for interrupt handlers and real-time contexts.
The SPSC ring buffer is elegant but limited to one producer and one consumer. Many scenarios—like network packet processing with multiple worker threads—require Multi-Producer Multi-Consumer (MPMC) ring buffers.
MPMC rings are significantly more complex because multiple threads may attempt to claim the same slot simultaneously.
Common MPMC Approaches:
| Strategy | Mechanism | Pros | Cons |
|---|---|---|---|
| Mutex-based | Global lock for all operations | Simple, correct | Poor scalability, contention |
| CAS-loop | Compare-and-swap on indices | Lock-free, scalable | Complex, ABA problem risk |
| Per-slot state | State machine per buffer slot | True lock-free | Memory overhead, complex |
| Ticket-based | Sequence numbers per slot | Wait-free producers | Complex consumer side |
CAS-Loop Implementation (Simplified):
The most common MPMC approach uses compare-and-swap loops to atomically claim slots:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
/* MPMC Ring Buffer using CAS loops */struct mpmc_ring { alignas(CACHE_LINE_SIZE) atomic_size_t head; alignas(CACHE_LINE_SIZE) atomic_size_t tail; alignas(CACHE_LINE_SIZE) size_t mask; size_t elem_size; /* Per-slot sequence numbers for ABA prevention */ atomic_size_t *sequences; char *buffer;}; bool mpmc_enqueue(struct mpmc_ring *rb, const void *elem) { size_t head, next_head, seq; for (;;) { head = atomic_load_explicit(&rb->head, memory_order_relaxed); size_t idx = head & rb->mask; seq = atomic_load_explicit(&rb->sequences[idx], memory_order_acquire); ptrdiff_t diff = (ptrdiff_t)(seq - head); if (diff == 0) { /* Slot is available for writing */ next_head = head + 1; if (atomic_compare_exchange_weak_explicit( &rb->head, &head, next_head, memory_order_relaxed, memory_order_relaxed)) { /* Successfully claimed the slot */ break; } /* CAS failed - another producer got it, retry */ } else if (diff < 0) { /* Buffer is full */ return false; } /* diff > 0: slot not ready yet (rare), spin and retry */ } /* Write data to claimed slot */ size_t idx = head & rb->mask; memcpy(rb->buffer + (idx * rb->elem_size), elem, rb->elem_size); /* Mark slot as ready for consumption */ atomic_store_explicit(&rb->sequences[idx], head + 1, memory_order_release); return true;} bool mpmc_dequeue(struct mpmc_ring *rb, void *elem) { size_t tail, next_tail, seq; for (;;) { tail = atomic_load_explicit(&rb->tail, memory_order_relaxed); size_t idx = tail & rb->mask; seq = atomic_load_explicit(&rb->sequences[idx], memory_order_acquire); ptrdiff_t diff = (ptrdiff_t)(seq - (tail + 1)); if (diff == 0) { /* Slot is ready for reading */ next_tail = tail + 1; if (atomic_compare_exchange_weak_explicit( &rb->tail, &tail, next_tail, memory_order_relaxed, memory_order_relaxed)) { break; } } else if (diff < 0) { /* Buffer is empty */ return false; } } /* Read data from slot */ size_t idx = tail & rb->mask; memcpy(elem, rb->buffer + (idx * rb->elem_size), rb->elem_size); /* Mark slot as available for producers */ atomic_store_explicit(&rb->sequences[idx], tail + rb->mask + 1, memory_order_release); return true;}The sequence numbers prevent the ABA problem: a situation where a slot appears unchanged (same index) but has actually been modified and recycled. Without sequence numbers, a producer might claim a slot that a consumer is still reading. The sequence acts as a version number, ensuring each cycle of the buffer creates distinct states.
The primary advantage of circular buffers over double buffers is burst absorption: the ability to temporarily accommodate producer rates exceeding consumer rates.
Bursty Traffic Model:
Real-world I/O rarely arrives at constant rates. Consider network packets:
For the system to be stable long-term, we need μ > λ (consume faster than average arrival). But during bursts, packets arrive faster than consumption. The buffer must absorb this excess.
Required Buffer Size:
To absorb a burst of B items when the consumer processes at rate μ:
$$Buffer_Size \geq B - \mu \times Burst_Duration$$
Simplified, if the burst arrives nearly instantaneously:
$$Buffer_Size \geq B$$
Queue Theory Perspective:
For Poisson arrivals (M/M/1 queue model), the probability of the buffer being full relates to:
$$P(full) = \rho^N$$
where ρ = λ/μ (utilization) and N is buffer size.
To achieve less than 0.1% packet loss with ρ = 0.9:
$$N \geq \frac{\log(0.001)}{\log(0.9)} \approx 65 \text{ slots}$$
| Buffer Size | ρ = 0.8 | ρ = 0.9 | ρ = 0.95 | ρ = 0.99 |
|---|---|---|---|---|
| 8 | 16.8% | 43.0% | 66.3% | 92.3% |
| 16 | 2.8% | 18.5% | 44.0% | 85.1% |
| 32 | 0.08% | 3.4% | 19.4% | 72.5% |
| 64 | ~0% | 0.12% | 3.8% | 52.5% |
| 128 | ~0% | ~0% | 0.14% | 27.6% |
| 256 | ~0% | ~0% | ~0% | 7.6% |
For low-loss requirements: buffer size should be at least 2-4× the expected burst size. For very high utilization (ρ > 0.95), exponentially more buffering is needed. In practice, power-of-two sizes between 256 and 4096 are common for network and sound cards.
Circular buffers are ubiquitous in systems software. Let's examine several critical applications:
1. Network Packet Buffers (sk_buff ring in Linux):
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
/* Linux-style network receive ring (simplified) */struct napi_ring { struct sk_buff **ring; /* Array of socket buffer pointers */ unsigned int size; /* Power of 2 */ unsigned int mask; /* size - 1 */ /* Hardware writes here */ unsigned int head; /* Next slot for NIC to fill */ /* Driver reads here */ unsigned int tail; /* Next slot for driver to process */ /* DMA descriptors for hardware */ struct rx_descriptor *hw_ring; dma_addr_t hw_ring_dma;}; /* Called by NAPI poll to process received packets */int napi_poll_ring(struct napi_ring *ring, int budget) { int processed = 0; /* Process up to 'budget' packets (NAPI budget limit) */ while (processed < budget && ring->tail != ring->head) { unsigned int idx = ring->tail & ring->mask; struct sk_buff *skb = ring->ring[idx]; /* Process packet through network stack */ netif_receive_skb(skb); /* Allocate new buffer for this slot */ ring->ring[idx] = alloc_skb(RX_BUFFER_SIZE, GFP_ATOMIC); if (ring->ring[idx]) { /* Update hardware descriptor with new buffer DMA address */ ring->hw_ring[idx].buffer_addr = dma_map_single(..., ring->ring[idx]->data, RX_BUFFER_SIZE, ...); } ring->tail++; processed++; } /* Tell hardware how many buffers we've freed up */ writel(ring->tail, ring->hw_tail_register); return processed;}2. Audio Playback Buffers:
Audio requires extremely consistent timing. Circular buffers absorb jitter between the application writing audio samples and the sound card playing them:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
/* Audio playback ring buffer */struct audio_ring { int16_t *samples; /* Audio sample buffer */ size_t frame_count; /* Total frames in buffer */ size_t frame_size; /* Bytes per frame (e.g., 4 for stereo 16-bit) */ atomic_size_t write_pos; /* Application write position */ atomic_size_t read_pos; /* Hardware read position */ /* Timing statistics */ unsigned int underruns; /* Times hardware caught up to writer */ unsigned int overruns; /* Times writer caught up to hardware */}; /* Application writes audio data */ssize_t audio_ring_write(struct audio_ring *ring, const int16_t *data, size_t frames) { size_t write = atomic_load(&ring->write_pos); size_t read = atomic_load(&ring->read_pos); size_t available = ring->frame_count - ((write - read) % ring->frame_count); if (available == 0) { ring->overruns++; return -EAGAIN; /* Buffer full */ } size_t to_write = min(frames, available - 1); /* Keep one frame empty */ size_t idx = write % ring->frame_count; size_t first_chunk = min(to_write, ring->frame_count - idx); /* Copy, handling wrap-around */ memcpy(ring->samples + (idx * ring->frame_size / sizeof(int16_t)), data, first_chunk * ring->frame_size); if (to_write > first_chunk) { memcpy(ring->samples, data + (first_chunk * ring->frame_size / sizeof(int16_t)), (to_write - first_chunk) * ring->frame_size); } atomic_store(&ring->write_pos, write + to_write); return to_write;} /* Hardware interrupt: DMA completed, advance read position */void audio_dma_complete_irq(struct audio_ring *ring, size_t frames_played) { size_t read = atomic_load(&ring->read_pos); size_t write = atomic_load(&ring->write_pos); if ((write - read) < frames_played) { /* Underrun: hardware played silence or garbage */ ring->underruns++; } atomic_store(&ring->read_pos, read + frames_played);}Larger audio buffers reduce underrun risk but increase latency (delay between application producing audio and listener hearing it). Professional audio applications (DAWs, gaming) use smaller buffers (128-512 samples) for low latency, accepting higher CPU requirements. Consumer applications use larger buffers (2048+) for reliability.
Operating system kernels rely heavily on ring buffers for various subsystems. Let's examine key examples:
1. Linux Kernel Ring Buffer (dmesg):
The kernel message log is stored in a ring buffer, accessible via dmesg. As new messages arrive, old ones are overwritten:
123456789101112131415161718192021222324252627282930313233343536373839404142434445
/* Simplified printk ring buffer (Linux kernel) */struct printk_ringbuffer { atomic_long_t head; /* Next write sequence number */ atomic_long_t tail; /* Oldest valid sequence number */ struct { char *data; unsigned long size; } text_data_ring; struct prb_desc_ring { struct prb_desc *descs; /* Message descriptors */ unsigned long count; /* Power of 2 */ } desc_ring;}; /* * Each message has a descriptor with: * - Sequence number (monotonically increasing) * - Offset into text_data_ring * - Length * - Timestamp * - Log level */ int printk_emit(int level, const char *fmt, ...) { /* ... format message ... */ /* Allocate descriptor slot */ unsigned long seq = atomic_long_fetch_add(&rb->head, 1); unsigned long idx = seq % rb->desc_ring.count; /* If wrapping, advance tail to discard oldest */ if (seq >= rb->desc_ring.count) { unsigned long old_tail = seq - rb->desc_ring.count; atomic_long_cmpxchg(&rb->tail, old_tail, old_tail + 1); } /* ... copy message to text ring, fill descriptor ... */ /* Wake up any readers (syslog daemon, dmesg) */ wake_up_interruptible(&log_wait); return len;}2. io_uring: Modern Linux Async I/O:
io_uring uses two ring buffers—one for submissions (process → kernel), one for completions (kernel → process)—enabling highly efficient asynchronous I/O with minimal syscall overhead:
1234567891011121314151617181920212223242526272829303132333435363738
/* io_uring ring structures (simplified) */ /* Submission Queue Entry - process writes, kernel reads */struct io_uring_sqe { __u8 opcode; /* Operation: read, write, etc. */ __u8 flags; __u16 ioprio; __s32 fd; /* File descriptor */ __u64 off; /* Offset */ __u64 addr; /* Buffer address */ __u32 len; /* Length */ __u64 user_data; /* Returned in CQE for correlation */ /* ... more fields ... */}; /* Completion Queue Entry - kernel writes, process reads */struct io_uring_cqe { __u64 user_data; /* Copied from SQE for correlation */ __s32 res; /* Result (bytes transferred or -errno) */ __u32 flags;}; /* Ring buffer headers (shared memory between kernel and process) */struct io_sq_ring { atomic_t head; /* Kernel reads from SQ[head] */ atomic_t tail; /* Process writes to SQ[tail] */ __u32 ring_mask; /* Mask for index wrapping */ __u32 ring_entries; /* Number of entries */ __u32 array[]; /* Indirection: indexes into sqes[] */}; struct io_cq_ring { atomic_t head; /* Process reads from CQ[head] */ atomic_t tail; /* Kernel writes to CQ[tail] */ __u32 ring_mask; __u32 ring_entries; struct io_uring_cqe cqes[]; /* Actual completion entries */};io_uring achieves remarkable performance by eliminating most syscalls. The rings are memory-mapped into user space, so the application writes SQEs directly without syscalls. The kernel polls the submission ring (or is notified via a single syscall covering many operations). This batching and shared memory approach can achieve over 1 million IOPS on modern hardware.
Circular buffers extend the double buffering concept to handle real-world complexities: bursty traffic, variable processing rates, and multi-threaded access. Let's consolidate the key insights:
What's Next:
We've covered the mechanics of individual buffers and buffer pools. But who allocates these buffers? How are they sized, recycled, and protected? The next page explores buffer management: the policies and mechanisms that govern buffer lifecycle in operating systems.
You now understand circular buffer mechanics, lock-free SPSC implementation, MPMC challenges, burst absorption mathematics, and real-world applications from network drivers to io_uring. These concepts are fundamental to high-performance systems programming.