Loading learning content...
A DMA transfer might seem instantaneous from a programmer's perspective—you program registers, start the transfer, and eventually receive a completion interrupt. But between setup and completion lies an intricate choreography of hardware signals, bus arbitration, address generation, and data movement.
Understanding this process reveals why DMA achieves such remarkable efficiency and helps diagnose the subtle bugs that emerge when timing assumptions are violated.
This section dissects the DMA transfer process phase by phase, from the moment the CPU writes the 'start' register to the final completion interrupt. We'll examine bus arbitration, handshaking protocols, data paths, and the synchronization mechanisms that keep everything coherent.
By the end of this page, you will understand every phase of DMA transfer execution, master bus arbitration and handshaking protocols, learn how data flows through the system during DMA, and comprehend synchronization and completion mechanisms. This knowledge is essential for debugging DMA-related issues and optimizing transfer performance.
A complete DMA transfer progresses through distinct phases, each involving specific hardware actions and timing requirements:
| Phase | Actor | Actions | Duration |
|---|---|---|---|
| 1. Setup | CPU | Program registers, configure transfer parameters | ~100-500 CPU cycles |
| 2. Initiation | CPU/Device | Trigger transfer start (doorbell or device request) | ~10-50 cycles |
| 3. Arbitration | DMA Controller | Request bus access, wait for grant | ~10-100 cycles (variable) |
| 4. Transfer | DMA Controller | Execute memory read/write transactions | Depends on data size |
| 5. Completion | DMA Controller | Signal done, update status, generate interrupt | ~10-100 cycles |
| 6. Processing | CPU | Handle interrupt, process data, free resources | ~1000+ cycles (software) |
Before any data can move, the DMA controller must gain control of the system bus. This requires a handshake protocol between the device, DMA controller, and bus arbiter.
The classic handshake involves four critical signals:
Let's trace through the exact signal sequence for a DMA read (device → memory):
12345678910111213141516171819202122232425262728293031323334353637383940
DMA Read Timing (Device → Memory)================================== Time → T0 T1 T2 T3 T4 T5 T6 T7 T8 │ │ │ │ │ │ │ │ │CPU accessing ╔═════╗ ╔═════════memory ║BUSY ║ (CPU idle/cache) ║RESUMED Device DRQ ──────╔═════════════════════════════╗─────────────(Data Ready) ║ ASSERTED ║ DMA BREQ ──────────╔═════════════════════════════╗─────────(Bus Request) ║ ASSERTED ║ Bus BGNT ────────────────╔═══════════════════╗─────────────(Bus Grant) ║ GRANTED ║ DMA DACK ────────────────╔═══════════════════╗─────────────(DMA Ack) ║ ACTIVE ║ Memory Write ────────────────────╔═══════════════╗───────────── ║ DATA WRITTEN ║ IRQ ────────────────────────────────────╔═════════════(Completion) ║ INTERRUPT Timeline:T0-T1: CPU using bus for its own memory accessT1: Device asserts DRQ - "I have data ready"T2: DMA controller sees DRQ, asserts BREQ - "I need the bus"T3: Arbiter waits for CPU to finish current transactionT4: Arbiter grants bus to DMA (BGNT asserted)T4: DMA controller asserts DACK - "Transferring for device"T5-T7: DMA controller reads from device, writes to memory (Multiple bus cycles for multi-word transfers)T7: Transfer complete, DMA releases BREQ, DACKT7: Arbiter sees BREQ low, deasserts BGNTT8: DMA generates IRQ, device deasserts DRQ CPU returns to normal bus accessWhile the concepts remain identical, modern buses like PCIe replace explicit signal wires with message-based protocols. A PCIe device sends a 'memory read request' transaction packet instead of asserting DRQ. The root complex acts as arbiter through credit-based flow control. The timing constraints are encoded in protocol timeouts rather than electrical specifications.
When multiple bus masters (CPU, DMA controller, multiple DMA channels, other devices) need bus access simultaneously, the arbiter decides who wins. Understanding arbitration is crucial because it directly impacts DMA latency and throughput.
Modern systems typically use weighted round-robin or similar hybrid schemes:
Example WRR Configuration:
Priority 0 (Highest): CPU - weight 4, DMA Channel 0 - weight 2
Priority 1 (Medium): DMA Channels 1-3 - weight 1 each
Priority 2 (Lowest): Debug/diagnostic access - weight 1
Sequence: CPU→CPU→CPU→CPU→DMA0→DMA0→DMA1→DMA2→DMA3→(repeat)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
// Simplified Round-Robin Arbiter (Hardware Description)// This illustrates how bus arbitration might be implemented module round_robin_arbiter #( parameter NUM_MASTERS = 4)( input wire clk, input wire rst_n, input wire [NUM_MASTERS-1:0] request, // Bus requests from masters output reg [NUM_MASTERS-1:0] grant, // Grant to one master input wire bus_free // Current transaction complete); reg [$clog2(NUM_MASTERS)-1:0] last_granted; reg [$clog2(NUM_MASTERS)-1:0] next_master; integer i; // Find next requesting master (round-robin from last granted) always @(*) begin next_master = last_granted; for (i = 0; i < NUM_MASTERS; i = i + 1) begin // Start looking from (last_granted + 1) modulo NUM_MASTERS automatic integer check = (last_granted + 1 + i) % NUM_MASTERS; if (request[check]) begin next_master = check; break; end end end // Grant logic always @(posedge clk or negedge rst_n) begin if (!rst_n) begin grant <= 0; last_granted <= 0; end else if (bus_free && |request) begin // Bus is available and someone wants it grant <= (1 << next_master); last_granted <= next_master; end else if (!request[last_granted]) begin // Current master released bus, clear grant grant <= 0; end end endmodule // The key insight: arbitration happens in a single clock cycle// (much faster than actual data transfers), so arbitration overhead// is minimal compared to transfer time for reasonable-sized transfers.In heavily-loaded systems, arbitration delay can dominate DMA latency. If 4 DMA channels and the CPU are all actively requesting, each master waits for 4 other transactions on average. For NVMe SSDs with sub-10μs latency targets, bus arbitration design is critical. This is why PCIe uses point-to-point links rather than shared buses—no arbitration means no arbitration delay.
Once the DMA controller has bus access, it must physically move data. The path data takes depends on the system architecture, but understanding the general flow helps optimize performance.
1234567891011121314151617181920212223242526272829303132333435363738394041
Device-to-Memory DMA Transfer (Read) - Step by Step==================================================== Step 1: DMA controller asserts address on address bus Address = Current destination (memory) address from register Step 2: DMA controller asserts DACK to device Device sees DACK active → places data on data bus Step 3: DMA controller asserts write control signal Memory controller sees: address + data + write_enable Step 4: Memory controller writes data to RAM For modern DDR: involves activate, write, precharge commands Step 5: Memory controller signals completion (ACK signal or bus protocol completion) Step 6: DMA controller updates internal state - Increment destination address by transfer width - Decrement byte count - Check if transfer complete Step 7: If more data: return to Step 1 If complete: release bus, generate interrupt Data Path Visualization: ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ I/O │ │ DMA │ │ Memory │ │ Main │ │ Device │───▶│Controller│───▶│Controller│───▶│ Memory │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ │ │ │ Data │ Address │ DDR │ └────────────────┴──────────────┴──────────────┘ System Bus Note: In modern systems (PCIe), there's no separate DMA controller chip.The device itself issues memory write transactions directly through PCIe.The reverse direction has a subtle complexity: the DMA controller must read from memory before it can write to the device.
In modern PCIe systems, the data path is fundamentally different but conceptually similar:
PCIe Memory Read (Device reading from system memory):
PCIe Memory Write (Device writing to system memory):
Key difference: PCIe is packet-based. There's no 'bus' in the traditional sense—just point-to-point serial links carrying transaction packets.
The DMA controller must generate correct addresses for every transfer unit. This involves several modes and configurations that affect how addresses change during transfer.
| Mode | Behavior | Use Case |
|---|---|---|
| Increment | Address increases by transfer width after each unit | Contiguous memory buffer filling/emptying |
| Decrement | Address decreases by transfer width after each unit | Reverse-order transfers, rare |
| Fixed | Address stays constant throughout transfer | Device registers (single location accessed repeatedly) |
| Stride | Address increases by custom step (not just transfer width) | Multi-dimensional arrays, interleaved data |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374
// DMA Address Generation Logic (Conceptual) struct dma_channel_state { uint64_t source_addr; uint64_t dest_addr; uint32_t bytes_remaining; uint32_t transfer_width; // 1, 2, 4, or 8 bytes // Address mode: 0=fixed, 1=increment, 2=decrement int source_mode; int dest_mode; // For stride mode uint32_t source_stride; uint32_t dest_stride;}; // Called after each unit transferredvoid update_addresses(struct dma_channel_state *ch) { uint32_t src_step, dst_step; // Calculate source address change switch (ch->source_mode) { case MODE_FIXED: src_step = 0; break; case MODE_INCREMENT: src_step = ch->transfer_width; break; case MODE_DECREMENT: src_step = -ch->transfer_width; break; case MODE_STRIDE: src_step = ch->source_stride; break; } // Calculate destination address change switch (ch->dest_mode) { case MODE_FIXED: dst_step = 0; break; case MODE_INCREMENT: dst_step = ch->transfer_width; break; case MODE_DECREMENT: dst_step = -ch->transfer_width; break; case MODE_STRIDE: dst_step = ch->dest_stride; break; } // Update state ch->source_addr += src_step; ch->dest_addr += dst_step; ch->bytes_remaining -= ch->transfer_width;} // Example: Reading from a device's FIFO to memory buffer// Source: Fixed (device FIFO at 0x10000000)// Dest: Increment (buffer starting at 0x80000000)// Width: 4 bytes//// Transfer sequence:// Cycle 1: Read 0x10000000 → Write 0x80000000// Cycle 2: Read 0x10000000 → Write 0x80000004 (dest incremented)// Cycle 3: Read 0x10000000 → Write 0x80000008// ...and so on // Example: 2D DMA for image processing// Source stride mode with width=320, height=240// After each row (320 bytes), advance to next row (stride=1024)// This handles images with row padding in memoryThe DMA controller maintains a count of bytes (or transfer units) remaining. This counter serves multiple purposes:
Multimedia hardware often uses 2D DMA with separate row length, row stride, and row count. This efficiently handles video frames where useful pixels (e.g., 1920 pixels × 3 bytes = 5760 bytes per row) are padded to power-of-2 alignment (8192 bytes per row in memory). 2D DMA avoids copying pixel data to strip padding.
Individual byte transfers are inefficient—each requires bus overhead. Burst transfers amortize this overhead by sending multiple data units in a single bus transaction.
A burst transaction works like this:
The efficiency gain is dramatic:
| Transaction Type | Address Phases | Data Phases | Overhead Ratio |
|---|---|---|---|
| Single transfer | 1 | 1 | 50% |
| 4-word burst | 1 | 4 | 20% |
| 8-word burst | 1 | 8 | 11% |
| 16-word burst | 1 | 16 | 6% |
| 64-word burst | 1 | 64 | 1.5% |
123456789101112131415161718192021222324252627282930313233343536373839
Burst vs Non-Burst Transfer Comparison====================================== Non-Burst (Single Transfer Mode):---------------------------------Cycle 1: [ADDR 0x1000] [WAIT] [DATA0] [ACK]Cycle 2: [ADDR 0x1004] [WAIT] [DATA1] [ACK]Cycle 3: [ADDR 0x1008] [WAIT] [DATA2] [ACK]Cycle 4: [ADDR 0x100C] [WAIT] [DATA3] [ACK] Total cycles: 4 address + 4 wait + 4 data = 12 cycles for 4 words Burst Transfer (4-word burst):------------------------------Cycle 1: [ADDR 0x1000] [WAIT]Cycle 2: [DATA0] [DATA1] [DATA2] [DATA3] [ACK] Total cycles: 1 address + 1 wait + 4 data = 6 cycles for 4 wordsEfficiency gain: 2x Real-World Example: DDR4 Memory-------------------------------DDR4 operates in burst mode internally:- Minimum burst length: 8 (BL8 mode)- Each access returns 8 sequential 64-bit words = 64 bytes- This matches typical cache line size A single DDR4 read request returns a full cache line (64 bytes)at once, making burst natural for cache line refills and DMA. PCIe Burst (Max Payload/Read Request):--------------------------------------- Configurable: 128, 256, 512, 1024, 2048, or 4096 bytes- Set in PCIe configuration space- Larger = better throughput, but more latency per transaction- Typical sweet spot: 256-512 bytes for general workloadsChoosing optimal burst size involves tradeoffs:
Larger bursts:
Smaller bursts:
Optimal choice depends on workload:
When a DMA transfer finishes, the system must properly synchronize between hardware completion and software consumption of the data. This involves several mechanisms:
Critical concern: When the DMA controller generates a completion interrupt, is the data actually visible to the CPU?
Due to memory hierarchy effects (caches, write buffers, reordering), the answer isn't automatically 'yes.' Systems must ensure completion ordering:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
// DMA Completion Ordering Problem and Solutions // PROBLEM: This code may see stale data!void bad_dma_handler(struct dma_device *dma) { uint32_t status = readl(dma->regs + STATUS); if (status & DMA_COMPLETE) { // BUG: Data might not be visible yet! // Interrupt was generated, but data may still be // in flight or in memory controller buffers process_data(dma->buffer); // May see old data! }} // SOLUTION 1: Read from the transferred region (forces completion)void good_dma_handler_v1(struct dma_device *dma) { uint32_t status = readl(dma->regs + STATUS); if (status & DMA_COMPLETE) { // Force completion by reading from target memory // This ensures prior DMA writes are visible rmb(); // Read barrier volatile uint32_t dummy = *(volatile uint32_t *)dma->buffer; (void)dummy; // Now safe to use data process_data(dma->buffer); }} // SOLUTION 2: Use DMA sync API (Linux)void good_dma_handler_v2(struct dma_device *dma) { uint32_t status = readl(dma->regs + STATUS); if (status & DMA_COMPLETE) { // Linux DMA API provides proper synchronization // This handles cache invalidation and memory barriers dma_sync_single_for_cpu(dma->dev, dma->dma_addr, dma->size, DMA_FROM_DEVICE); // Now safe to use data process_data(dma->buffer); }} // SOLUTION 3: Cache-coherent DMA (hardware solution)// On systems with cache-coherent DMA, no software action needed.// The hardware interconnect ensures DMA writes are visible to CPU// caches before the interrupt is generated.// Examples: Modern AMD/Intel systems with proper IOMMU configuration // WHY THIS MATTERS:// 1. Write buffers: DMA controller write may be in PCIe/memory buffers// 2. Cache coherency: Must invalidate CPU cache of target region// 3. Reordering: Memory controller may reorder writes// 4. Interrupt delivery: IRQ may arrive before all data visible//// Failure to sync properly causes intermittent data corruption// that's extremely difficult to debug (happens rarely, non-reproducible)Modern high-performance devices (NVMe, high-speed NICs) use completion queues rather than simple interrupts:
Advantages:
Failure to properly synchronize after DMA completion is one of the most insidious bugs in device driver development. The system works 99.9% of the time, then randomly produces corrupted data. Since it's timing-dependent, it may never reproduce on developer machines but appear in production under load. Always use proper synchronization APIs.
We've traced the complete DMA transfer process from initiation to completion. This detailed understanding enables proper driver development and performance optimization:
What's Next:
The next section examines cycle stealing—a specific DMA mode that interleaves DMA and CPU bus access, balancing throughput with CPU responsiveness.
You now understand the complete DMA transfer lifecycle at a hardware level. This knowledge enables you to debug mysterious DMA issues, optimize transfer performance, and write drivers that correctly synchronize with hardware. The concepts translate directly to modern PCIe devices despite their packet-based architecture.