Operating SystemsDMA

Direct Memory Access (DMA)

LevelIntermediate

Duration60 mins

TopicDMA

3 / 5

DMA Transfer Process

The Dance of Data Movement

A DMA transfer might seem instantaneous from a programmer's perspective—you program registers, start the transfer, and eventually receive a completion interrupt. But between setup and completion lies an intricate choreography of hardware signals, bus arbitration, address generation, and data movement.

Understanding this process reveals why DMA achieves such remarkable efficiency and helps diagnose the subtle bugs that emerge when timing assumptions are violated.

This section dissects the DMA transfer process phase by phase, from the moment the CPU writes the 'start' register to the final completion interrupt. We'll examine bus arbitration, handshaking protocols, data paths, and the synchronization mechanisms that keep everything coherent.

What You Will Learn

By the end of this page, you will understand every phase of DMA transfer execution, master bus arbitration and handshaking protocols, learn how data flows through the system during DMA, and comprehend synchronization and completion mechanisms. This knowledge is essential for debugging DMA-related issues and optimizing transfer performance.

Transfer Lifecycle Overview

A complete DMA transfer progresses through distinct phases, each involving specific hardware actions and timing requirements:

The Six Phases of DMA Transfer

DMA Transfer Phases
Phase	Actor	Actions	Duration
1. Setup	CPU	Program registers, configure transfer parameters	~100-500 CPU cycles
2. Initiation	CPU/Device	Trigger transfer start (doorbell or device request)	~10-50 cycles
3. Arbitration	DMA Controller	Request bus access, wait for grant	~10-100 cycles (variable)
4. Transfer	DMA Controller	Execute memory read/write transactions	Depends on data size
5. Completion	DMA Controller	Signal done, update status, generate interrupt	~10-100 cycles
6. Processing	CPU	Handle interrupt, process data, free resources	~1000+ cycles (software)

Converting Mermaid diagram...

The Request-Grant Handshake

Before any data can move, the DMA controller must gain control of the system bus. This requires a handshake protocol between the device, DMA controller, and bus arbiter.

Signal Flow

The classic handshake involves four critical signals:

DMA Handshake Signals

•DRQ (DMA Request) — Asserted by the I/O device when it needs data transfer. 'I have data ready' or 'I need data.' Each DMA channel has its own DRQ line.
•DACK (DMA Acknowledge) — Asserted by DMA controller to indicate transfer is happening. 'I'm transferring for you now.' Also selects the device via chip-select logic.
•BREQ (Bus Request) — Asserted by DMA controller to request bus mastership from the arbiter. 'I need the bus.'
•BGNT (Bus Grant) — Asserted by arbiter granting bus control to DMA controller. 'The bus is yours.'

Timing Diagram Analysis

Let's trace through the exact signal sequence for a DMA read (device → memory):

dma_handshake_timing.txt

text

DMA Read Timing (Device → Memory)
==================================
 
Time →           T0    T1    T2    T3    T4    T5    T6    T7    T8
                 │     │     │     │     │     │     │     │     │
CPU accessing    ╔═════╗                                   ╔═════════
memory           ║BUSY ║     (CPU idle/cache)              ║RESUMED  
 
Device DRQ       ──────╔═════════════════════════════╗─────────────
(Data Ready)           ║         ASSERTED            ║             
 
DMA BREQ         ──────────╔═════════════════════════════╗─────────
(Bus Request)              ║         ASSERTED             ║        
 
Bus BGNT         ────────────────╔═══════════════════╗─────────────
(Bus Grant)                      ║     GRANTED       ║             
 
DMA DACK         ────────────────╔═══════════════════╗─────────────
(DMA Ack)                        ║    ACTIVE         ║             
 
Memory Write     ────────────────────╔═══════════════╗─────────────
                                     ║ DATA WRITTEN  ║             
 
IRQ              ────────────────────────────────────╔═════════════
(Completion)                                         ║ INTERRUPT   
 
 
Timeline:
T0-T1: CPU using bus for its own memory access
T1:    Device asserts DRQ - "I have data ready"
T2:    DMA controller sees DRQ, asserts BREQ - "I need the bus"
T3:    Arbiter waits for CPU to finish current transaction
T4:    Arbiter grants bus to DMA (BGNT asserted)
T4:    DMA controller asserts DACK - "Transferring for device"
T5-T7: DMA controller reads from device, writes to memory
       (Multiple bus cycles for multi-word transfers)
T7:    Transfer complete, DMA releases BREQ, DACK
T7:    Arbiter sees BREQ low, deasserts BGNT
T8:    DMA generates IRQ, device deasserts DRQ
       CPU returns to normal bus access

Modern vs. Legacy Signaling

While the concepts remain identical, modern buses like PCIe replace explicit signal wires with message-based protocols. A PCIe device sends a 'memory read request' transaction packet instead of asserting DRQ. The root complex acts as arbiter through credit-based flow control. The timing constraints are encoded in protocol timeouts rather than electrical specifications.

Bus Arbitration Deep Dive

When multiple bus masters (CPU, DMA controller, multiple DMA channels, other devices) need bus access simultaneously, the arbiter decides who wins. Understanding arbitration is crucial because it directly impacts DMA latency and throughput.

Arbitration Schemes

Fixed Priority

•Each requester has a static priority level
•Higher priority always wins when competing
•Simple to implement in hardware
•Problem: Low-priority masters may starve
•Use case: When priorities are well-defined (e.g., DMA > general I/O)

Round-Robin

•Each requester gets bus access in turn
•After one master finishes, next in rotation gets priority
•Fair: prevents any master from starving
•Problem: Ignores urgency differences
•Use case: Homogeneous workloads, fairness critical

Advanced Arbitration: Weighted Round-Robin (WRR)

Modern systems typically use weighted round-robin or similar hybrid schemes:

Each master has a weight (number of consecutive grants before moving to next)
Higher-weight masters get more bandwidth but don't completely block others
Often combined with priority classes (high-priority class goes before low-priority class, round-robin within each class)

Example WRR Configuration:

Priority 0 (Highest): CPU - weight 4, DMA Channel 0 - weight 2
Priority 1 (Medium):  DMA Channels 1-3 - weight 1 each  
Priority 2 (Lowest):  Debug/diagnostic access - weight 1

Sequence: CPU→CPU→CPU→CPU→DMA0→DMA0→DMA1→DMA2→DMA3→(repeat)

arbitration_logic.v
Verilog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Simplified Round-Robin Arbiter (Hardware Description)
// This illustrates how bus arbitration might be implemented
 
module round_robin_arbiter #(
    parameter NUM_MASTERS = 4
)(
    input  wire                    clk,
    input  wire                    rst_n,
    input  wire [NUM_MASTERS-1:0]  request,   // Bus requests from masters
    output reg  [NUM_MASTERS-1:0]  grant,     // Grant to one master
    input  wire                    bus_free   // Current transaction complete
);
 
    reg [$clog2(NUM_MASTERS)-1:0] last_granted;
    reg [$clog2(NUM_MASTERS)-1:0] next_master;
    integer i;
    
    // Find next requesting master (round-robin from last granted)
    always @(*) begin
        next_master = last_granted;
        
        for (i = 0; i < NUM_MASTERS; i = i + 1) begin
            // Start looking from (last_granted + 1) modulo NUM_MASTERS
            automatic integer check = (last_granted + 1 + i) % NUM_MASTERS;
            if (request[check]) begin
                next_master = check;
                break;
            end
        end
    end
    
    // Grant logic
    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            grant <= 0;
            last_granted <= 0;
        end else if (bus_free && |request) begin
            // Bus is available and someone wants it
            grant <= (1 << next_master);
            last_granted <= next_master;
        end else if (!request[last_granted]) begin
            // Current master released bus, clear grant
            grant <= 0;
        end
    end
 
endmodule
 
// The key insight: arbitration happens in a single clock cycle
// (much faster than actual data transfers), so arbitration overhead
// is minimal compared to transfer time for reasonable-sized transfers.

Arbitration and Latency

In heavily-loaded systems, arbitration delay can dominate DMA latency. If 4 DMA channels and the CPU are all actively requesting, each master waits for 4 other transactions on average. For NVMe SSDs with sub-10μs latency targets, bus arbitration design is critical. This is why PCIe uses point-to-point links rather than shared buses—no arbitration means no arbitration delay.

Data Path During Transfer

Once the DMA controller has bus access, it must physically move data. The path data takes depends on the system architecture, but understanding the general flow helps optimize performance.

Device-to-Memory Transfer (DMA Read)

dma_read_flow.txt

text

Device-to-Memory DMA Transfer (Read) - Step by Step
====================================================
 
Step 1: DMA controller asserts address on address bus
        Address = Current destination (memory) address from register
        
Step 2: DMA controller asserts DACK to device
        Device sees DACK active → places data on data bus
        
Step 3: DMA controller asserts write control signal
        Memory controller sees: address + data + write_enable
        
Step 4: Memory controller writes data to RAM
        For modern DDR: involves activate, write, precharge commands
        
Step 5: Memory controller signals completion
        (ACK signal or bus protocol completion)
        
Step 6: DMA controller updates internal state
        - Increment destination address by transfer width
        - Decrement byte count
        - Check if transfer complete
        
Step 7: If more data: return to Step 1
        If complete: release bus, generate interrupt
 
 
Data Path Visualization:
                                                     
    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
    │  I/O     │    │   DMA    │    │  Memory  │    │   Main   │
    │ Device   │───▶│Controller│───▶│Controller│───▶│  Memory  │
    └──────────┘    └──────────┘    └──────────┘    └──────────┘
         │                │              │              │
         │     Data       │   Address    │    DDR       │
         └────────────────┴──────────────┴──────────────┘
                              System Bus
 
 
Note: In modern systems (PCIe), there's no separate DMA controller chip.
The device itself issues memory write transactions directly through PCIe.

Memory-to-Device Transfer (DMA Write)

The reverse direction has a subtle complexity: the DMA controller must read from memory before it can write to the device.

Memory-to-Device Steps

•DMA controller asserts read address — Source memory address goes on address bus
•Memory controller fetches data — DDR read cycle retrieves data
•Data returns to DMA controller — May be latched in internal buffer
•DMA controller asserts DACK — Selects the target device
•DMA controller places data on bus — Device reads data
•Device acknowledges receipt — Transfer for this word complete
•Update addresses and count — Prepare for next word

PCIe Data Path

In modern PCIe systems, the data path is fundamentally different but conceptually similar:

PCIe Memory Read (Device reading from system memory):

Device sends Memory Read Request (Transaction Layer Packet)
Request travels through PCIe switches to Root Complex
Root Complex accesses system memory
Root Complex sends Memory Read Completion with data
Data travels back through PCIe to device

PCIe Memory Write (Device writing to system memory):

Device sends Memory Write Request with data payload
Transaction travels through PCIe to Root Complex
Root Complex writes data to system memory
No completion required (posted transaction) or ACK returned

Key difference: PCIe is packet-based. There's no 'bus' in the traditional sense—just point-to-point serial links carrying transaction packets.

Address Generation and Counting

The DMA controller must generate correct addresses for every transfer unit. This involves several modes and configurations that affect how addresses change during transfer.

Address Modes

DMA Address Modes
Mode	Behavior	Use Case
Increment	Address increases by transfer width after each unit	Contiguous memory buffer filling/emptying
Decrement	Address decreases by transfer width after each unit	Reverse-order transfers, rare
Fixed	Address stays constant throughout transfer	Device registers (single location accessed repeatedly)
Stride	Address increases by custom step (not just transfer width)	Multi-dimensional arrays, interleaved data

address_generation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
// DMA Address Generation Logic (Conceptual)
 
struct dma_channel_state {
    uint64_t source_addr;
    uint64_t dest_addr;
    uint32_t bytes_remaining;
    uint32_t transfer_width;      // 1, 2, 4, or 8 bytes
    
    // Address mode: 0=fixed, 1=increment, 2=decrement
    int source_mode;
    int dest_mode;
    
    // For stride mode
    uint32_t source_stride;
    uint32_t dest_stride;
};
 
// Called after each unit transferred
void update_addresses(struct dma_channel_state *ch) {
    uint32_t src_step, dst_step;
    
    // Calculate source address change
    switch (ch->source_mode) {
        case MODE_FIXED:
            src_step = 0;
            break;
        case MODE_INCREMENT:
            src_step = ch->transfer_width;
            break;
        case MODE_DECREMENT:
            src_step = -ch->transfer_width;
            break;
        case MODE_STRIDE:
            src_step = ch->source_stride;
            break;
    }
    
    // Calculate destination address change
    switch (ch->dest_mode) {
        case MODE_FIXED:
            dst_step = 0;
            break;
        case MODE_INCREMENT:
            dst_step = ch->transfer_width;
            break;
        case MODE_DECREMENT:
            dst_step = -ch->transfer_width;
            break;
        case MODE_STRIDE:
            dst_step = ch->dest_stride;
            break;
    }
    
    // Update state
    ch->source_addr += src_step;
    ch->dest_addr += dst_step;
    ch->bytes_remaining -= ch->transfer_width;
}
 
// Example: Reading from a device's FIFO to memory buffer
//   Source: Fixed (device FIFO at 0x10000000)
//   Dest:   Increment (buffer starting at 0x80000000)
//   Width:  4 bytes
//
// Transfer sequence:
//   Cycle 1: Read 0x10000000 → Write 0x80000000
//   Cycle 2: Read 0x10000000 → Write 0x80000004  (dest incremented)
//   Cycle 3: Read 0x10000000 → Write 0x80000008
//   ...and so on
 
// Example: 2D DMA for image processing
//   Source stride mode with width=320, height=240
//   After each row (320 bytes), advance to next row (stride=1024)
//   This handles images with row padding in memory

Byte Count Management

The DMA controller maintains a count of bytes (or transfer units) remaining. This counter serves multiple purposes:

Transfer termination — When count reaches zero, transfer is complete
Progress tracking — Software can read remaining count to monitor progress
Circular mode restart — Counter reloads from initial value for continuous streaming
Interrupt generation — Half-complete or quarter-complete interrupts based on count thresholds

2D DMA for Graphics and Video

Multimedia hardware often uses 2D DMA with separate row length, row stride, and row count. This efficiently handles video frames where useful pixels (e.g., 1920 pixels × 3 bytes = 5760 bytes per row) are padded to power-of-2 alignment (8192 bytes per row in memory). 2D DMA avoids copying pixel data to strip padding.

Burst Transfers

Individual byte transfers are inefficient—each requires bus overhead. Burst transfers amortize this overhead by sending multiple data units in a single bus transaction.

How Bursts Work

A burst transaction works like this:

One address phase — DMA controller puts starting address on bus
Multiple data phases — Sequential data units transfer, address increments automatically
Single termination — Burst ends after N units or when count exhausted

The efficiency gain is dramatic:

Transaction Type	Address Phases	Data Phases	Overhead Ratio
Single transfer	1	1	50%
4-word burst	1	4	20%
8-word burst	1	8	11%
16-word burst	1	16	6%
64-word burst	1	64	1.5%

burst_transfer.txt

text

Burst vs Non-Burst Transfer Comparison
======================================
 
Non-Burst (Single Transfer Mode):
---------------------------------
Cycle 1: [ADDR 0x1000] [WAIT] [DATA0] [ACK]
Cycle 2: [ADDR 0x1004] [WAIT] [DATA1] [ACK]
Cycle 3: [ADDR 0x1008] [WAIT] [DATA2] [ACK]
Cycle 4: [ADDR 0x100C] [WAIT] [DATA3] [ACK]
 
Total cycles: 4 address + 4 wait + 4 data = 12 cycles for 4 words
 
 
Burst Transfer (4-word burst):
------------------------------
Cycle 1: [ADDR 0x1000] [WAIT]
Cycle 2: [DATA0] [DATA1] [DATA2] [DATA3] [ACK]
 
Total cycles: 1 address + 1 wait + 4 data = 6 cycles for 4 words
Efficiency gain: 2x
 
 
Real-World Example: DDR4 Memory
-------------------------------
DDR4 operates in burst mode internally:
- Minimum burst length: 8 (BL8 mode)
- Each access returns 8 sequential 64-bit words = 64 bytes
- This matches typical cache line size
 
A single DDR4 read request returns a full cache line (64 bytes)
at once, making burst natural for cache line refills and DMA.
 
 
PCIe Burst (Max Payload/Read Request):
--------------------------------------
- Configurable: 128, 256, 512, 1024, 2048, or 4096 bytes
- Set in PCIe configuration space
- Larger = better throughput, but more latency per transaction
- Typical sweet spot: 256-512 bytes for general workloads

Burst Size Considerations

Choosing optimal burst size involves tradeoffs:

Larger bursts:

✅ Higher sustained throughput (less address overhead)
✅ Better memory efficiency (matches DDR burst length)
❌ Higher latency before first data appears
❌ Longer bus hold time (blocks other masters more)

Smaller bursts:

✅ Lower individual transaction latency
✅ More responsive bus (other masters get access sooner)
❌ Lower throughput due to address overhead
❌ Less efficient memory access patterns

Optimal choice depends on workload:

Bulk storage transfers: Large bursts (maximize throughput)
Real-time audio: Small bursts (minimize latency)
Interactive systems: Medium bursts (balanced)

Completion and Synchronization

When a DMA transfer finishes, the system must properly synchronize between hardware completion and software consumption of the data. This involves several mechanisms:

Completion Detection

Completion Detection Methods

•Polling — Software repeatedly reads status register until COMPLETE bit set. Simple but wastes CPU cycles.
•Interrupt — Hardware generates interrupt when transfer completes. Efficient but has context switch overhead.
•Completion Queue — Hardware writes completion records to memory queue. Software checks queue periodically or on interrupt.
•Doorbell — Hardware writes to a specific memory location that software monitors. Variant of polling with better cache behavior.

Memory Order and Visibility

Critical concern: When the DMA controller generates a completion interrupt, is the data actually visible to the CPU?

Due to memory hierarchy effects (caches, write buffers, reordering), the answer isn't automatically 'yes.' Systems must ensure completion ordering:

completion_ordering.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// DMA Completion Ordering Problem and Solutions
 
// PROBLEM: This code may see stale data!
void bad_dma_handler(struct dma_device *dma) {
    uint32_t status = readl(dma->regs + STATUS);
    
    if (status & DMA_COMPLETE) {
        // BUG: Data might not be visible yet!
        // Interrupt was generated, but data may still be
        // in flight or in memory controller buffers
        process_data(dma->buffer);  // May see old data!
    }
}
 
// SOLUTION 1: Read from the transferred region (forces completion)
void good_dma_handler_v1(struct dma_device *dma) {
    uint32_t status = readl(dma->regs + STATUS);
    
    if (status & DMA_COMPLETE) {
        // Force completion by reading from target memory
        // This ensures prior DMA writes are visible
        rmb();  // Read barrier
        volatile uint32_t dummy = *(volatile uint32_t *)dma->buffer;
        (void)dummy;
        
        // Now safe to use data
        process_data(dma->buffer);
    }
}
 
// SOLUTION 2: Use DMA sync API (Linux)
void good_dma_handler_v2(struct dma_device *dma) {
    uint32_t status = readl(dma->regs + STATUS);
    
    if (status & DMA_COMPLETE) {
        // Linux DMA API provides proper synchronization
        // This handles cache invalidation and memory barriers
        dma_sync_single_for_cpu(dma->dev, dma->dma_addr, 
                                 dma->size, DMA_FROM_DEVICE);
        
        // Now safe to use data
        process_data(dma->buffer);
    }
}
 
// SOLUTION 3: Cache-coherent DMA (hardware solution)
// On systems with cache-coherent DMA, no software action needed.
// The hardware interconnect ensures DMA writes are visible to CPU
// caches before the interrupt is generated.
// Examples: Modern AMD/Intel systems with proper IOMMU configuration
 
// WHY THIS MATTERS:
// 1. Write buffers: DMA controller write may be in PCIe/memory buffers
// 2. Cache coherency: Must invalidate CPU cache of target region
// 3. Reordering: Memory controller may reorder writes
// 4. Interrupt delivery: IRQ may arrive before all data visible
//
// Failure to sync properly causes intermittent data corruption
// that's extremely difficult to debug (happens rarely, non-reproducible)

Completion Queues

Modern high-performance devices (NVMe, high-speed NICs) use completion queues rather than simple interrupts:

Hardware writes completion records to a pre-allocated ring buffer in memory
Each completion includes: command ID, status, and device-specific results
Hardware optionally generates interrupt to wake software
Software reads completions from queue, processes in order
Software updates 'head' pointer, making queue entries available for reuse

Advantages:

Batch multiple completions per interrupt (coalescing)
Completions in memory can be polled without register access
Natural ordering—completions are processed in queue order
Supports high IOPS workloads where per-operation interrupts would overwhelm CPU

The Completion Ordering Bug

Failure to properly synchronize after DMA completion is one of the most insidious bugs in device driver development. The system works 99.9% of the time, then randomly produces corrupted data. Since it's timing-dependent, it may never reproduce on developer machines but appear in production under load. Always use proper synchronization APIs.

Summary: Understanding DMA Transfer Flow

We've traced the complete DMA transfer process from initiation to completion. This detailed understanding enables proper driver development and performance optimization:

Key Takeaways

•DMA transfers have six distinct phases — Setup, initiation, arbitration, transfer, completion, and processing. Each involves specific hardware and software actions.
•The request-grant handshake coordinates bus access — DRQ, DACK, BREQ, BGNT signals (or their packet equivalents) ensure orderly bus sharing.
•Arbitration determines DMA latency — Priority schemes, round-robin, and weighted algorithms balance fairness and performance.
•Data paths differ between read and write directions — Device-to-memory and memory-to-device have distinct timing and complexity.
•Address generation supports multiple modes — Fixed, increment, decrement, and stride modes enable flexible memory access patterns.
•Burst transfers dramatically improve efficiency — Amortizing address overhead across multiple data units is essential for high throughput.
•Completion synchronization is critical — Memory barriers, cache invalidation, and DMA sync APIs ensure data visibility after transfer completion.

What's Next:

The next section examines cycle stealing—a specific DMA mode that interleaves DMA and CPU bus access, balancing throughput with CPU responsiveness.

Transfer Process Mastered

You now understand the complete DMA transfer lifecycle at a hardware level. This knowledge enables you to debug mysterious DMA issues, optimize transfer performance, and write drivers that correctly synchronize with hardware. The concepts translate directly to modern PCIe devices despite their packet-based architecture.

3 / 5

Loading learning content...

Operating SystemsDMA

Direct Memory Access (DMA)

LevelIntermediate

Duration60 mins

TopicDMA

3 / 5

DMA Transfer Process

The Dance of Data Movement

Understanding this process reveals why DMA achieves such remarkable efficiency and helps diagnose the subtle bugs that emerge when timing assumptions are violated.

What You Will Learn

Transfer Lifecycle Overview

A complete DMA transfer progresses through distinct phases, each involving specific hardware actions and timing requirements:

The Six Phases of DMA Transfer

DMA Transfer Phases
Phase	Actor	Actions	Duration
1. Setup	CPU	Program registers, configure transfer parameters	~100-500 CPU cycles
2. Initiation	CPU/Device	Trigger transfer start (doorbell or device request)	~10-50 cycles
3. Arbitration	DMA Controller	Request bus access, wait for grant	~10-100 cycles (variable)
4. Transfer	DMA Controller	Execute memory read/write transactions	Depends on data size
5. Completion	DMA Controller	Signal done, update status, generate interrupt	~10-100 cycles
6. Processing	CPU	Handle interrupt, process data, free resources	~1000+ cycles (software)

Converting Mermaid diagram...

The Request-Grant Handshake

Before any data can move, the DMA controller must gain control of the system bus. This requires a handshake protocol between the device, DMA controller, and bus arbiter.

Signal Flow

The classic handshake involves four critical signals:

DMA Handshake Signals

•DRQ (DMA Request) — Asserted by the I/O device when it needs data transfer. 'I have data ready' or 'I need data.' Each DMA channel has its own DRQ line.
•DACK (DMA Acknowledge) — Asserted by DMA controller to indicate transfer is happening. 'I'm transferring for you now.' Also selects the device via chip-select logic.
•BREQ (Bus Request) — Asserted by DMA controller to request bus mastership from the arbiter. 'I need the bus.'
•BGNT (Bus Grant) — Asserted by arbiter granting bus control to DMA controller. 'The bus is yours.'

Timing Diagram Analysis

Let's trace through the exact signal sequence for a DMA read (device → memory):

dma_handshake_timing.txt

text

DMA Read Timing (Device → Memory)
==================================
 
Time →           T0    T1    T2    T3    T4    T5    T6    T7    T8
                 │     │     │     │     │     │     │     │     │
CPU accessing    ╔═════╗                                   ╔═════════
memory           ║BUSY ║     (CPU idle/cache)              ║RESUMED  
 
Device DRQ       ──────╔═════════════════════════════╗─────────────
(Data Ready)           ║         ASSERTED            ║             
 
DMA BREQ         ──────────╔═════════════════════════════╗─────────
(Bus Request)              ║         ASSERTED             ║        
 
Bus BGNT         ────────────────╔═══════════════════╗─────────────
(Bus Grant)                      ║     GRANTED       ║             
 
DMA DACK         ────────────────╔═══════════════════╗─────────────
(DMA Ack)                        ║    ACTIVE         ║             
 
Memory Write     ────────────────────╔═══════════════╗─────────────
                                     ║ DATA WRITTEN  ║             
 
IRQ              ────────────────────────────────────╔═════════════
(Completion)                                         ║ INTERRUPT   
 
 
Timeline:
T0-T1: CPU using bus for its own memory access
T1:    Device asserts DRQ - "I have data ready"
T2:    DMA controller sees DRQ, asserts BREQ - "I need the bus"
T3:    Arbiter waits for CPU to finish current transaction
T4:    Arbiter grants bus to DMA (BGNT asserted)
T4:    DMA controller asserts DACK - "Transferring for device"
T5-T7: DMA controller reads from device, writes to memory
       (Multiple bus cycles for multi-word transfers)
T7:    Transfer complete, DMA releases BREQ, DACK
T7:    Arbiter sees BREQ low, deasserts BGNT
T8:    DMA generates IRQ, device deasserts DRQ
       CPU returns to normal bus access

Modern vs. Legacy Signaling

Bus Arbitration Deep Dive

Arbitration Schemes

Fixed Priority

•Each requester has a static priority level
•Higher priority always wins when competing
•Simple to implement in hardware
•Problem: Low-priority masters may starve
•Use case: When priorities are well-defined (e.g., DMA > general I/O)

Round-Robin

•Each requester gets bus access in turn
•After one master finishes, next in rotation gets priority
•Fair: prevents any master from starving
•Problem: Ignores urgency differences
•Use case: Homogeneous workloads, fairness critical

Advanced Arbitration: Weighted Round-Robin (WRR)

Modern systems typically use weighted round-robin or similar hybrid schemes:

Each master has a weight (number of consecutive grants before moving to next)
Higher-weight masters get more bandwidth but don't completely block others
Often combined with priority classes (high-priority class goes before low-priority class, round-robin within each class)

Example WRR Configuration:

Priority 0 (Highest): CPU - weight 4, DMA Channel 0 - weight 2
Priority 1 (Medium):  DMA Channels 1-3 - weight 1 each  
Priority 2 (Lowest):  Debug/diagnostic access - weight 1

Sequence: CPU→CPU→CPU→CPU→DMA0→DMA0→DMA1→DMA2→DMA3→(repeat)

arbitration_logic.v
Verilog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Simplified Round-Robin Arbiter (Hardware Description)
// This illustrates how bus arbitration might be implemented
 
module round_robin_arbiter #(
    parameter NUM_MASTERS = 4
)(
    input  wire                    clk,
    input  wire                    rst_n,
    input  wire [NUM_MASTERS-1:0]  request,   // Bus requests from masters
    output reg  [NUM_MASTERS-1:0]  grant,     // Grant to one master
    input  wire                    bus_free   // Current transaction complete
);
 
    reg [$clog2(NUM_MASTERS)-1:0] last_granted;
    reg [$clog2(NUM_MASTERS)-1:0] next_master;
    integer i;
    
    // Find next requesting master (round-robin from last granted)
    always @(*) begin
        next_master = last_granted;
        
        for (i = 0; i < NUM_MASTERS; i = i + 1) begin
            // Start looking from (last_granted + 1) modulo NUM_MASTERS
            automatic integer check = (last_granted + 1 + i) % NUM_MASTERS;
            if (request[check]) begin
                next_master = check;
                break;
            end
        end
    end
    
    // Grant logic
    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            grant <= 0;
            last_granted <= 0;
        end else if (bus_free && |request) begin
            // Bus is available and someone wants it
            grant <= (1 << next_master);
            last_granted <= next_master;
        end else if (!request[last_granted]) begin
            // Current master released bus, clear grant
            grant <= 0;
        end
    end
 
endmodule
 
// The key insight: arbitration happens in a single clock cycle
// (much faster than actual data transfers), so arbitration overhead
// is minimal compared to transfer time for reasonable-sized transfers.

Arbitration and Latency

Data Path During Transfer

Once the DMA controller has bus access, it must physically move data. The path data takes depends on the system architecture, but understanding the general flow helps optimize performance.

Device-to-Memory Transfer (DMA Read)

dma_read_flow.txt

text

Device-to-Memory DMA Transfer (Read) - Step by Step
====================================================
 
Step 1: DMA controller asserts address on address bus
        Address = Current destination (memory) address from register
        
Step 2: DMA controller asserts DACK to device
        Device sees DACK active → places data on data bus
        
Step 3: DMA controller asserts write control signal
        Memory controller sees: address + data + write_enable
        
Step 4: Memory controller writes data to RAM
        For modern DDR: involves activate, write, precharge commands
        
Step 5: Memory controller signals completion
        (ACK signal or bus protocol completion)
        
Step 6: DMA controller updates internal state
        - Increment destination address by transfer width
        - Decrement byte count
        - Check if transfer complete
        
Step 7: If more data: return to Step 1
        If complete: release bus, generate interrupt
 
 
Data Path Visualization:
                                                     
    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
    │  I/O     │    │   DMA    │    │  Memory  │    │   Main   │
    │ Device   │───▶│Controller│───▶│Controller│───▶│  Memory  │
    └──────────┘    └──────────┘    └──────────┘    └──────────┘
         │                │              │              │
         │     Data       │   Address    │    DDR       │
         └────────────────┴──────────────┴──────────────┘
                              System Bus
 
 
Note: In modern systems (PCIe), there's no separate DMA controller chip.
The device itself issues memory write transactions directly through PCIe.

Memory-to-Device Transfer (DMA Write)

The reverse direction has a subtle complexity: the DMA controller must read from memory before it can write to the device.

Memory-to-Device Steps

•DMA controller asserts read address — Source memory address goes on address bus
•Memory controller fetches data — DDR read cycle retrieves data
•Data returns to DMA controller — May be latched in internal buffer
•DMA controller asserts DACK — Selects the target device
•DMA controller places data on bus — Device reads data
•Device acknowledges receipt — Transfer for this word complete
•Update addresses and count — Prepare for next word

PCIe Data Path

In modern PCIe systems, the data path is fundamentally different but conceptually similar:

PCIe Memory Read (Device reading from system memory):

Device sends Memory Read Request (Transaction Layer Packet)
Request travels through PCIe switches to Root Complex
Root Complex accesses system memory
Root Complex sends Memory Read Completion with data
Data travels back through PCIe to device

PCIe Memory Write (Device writing to system memory):

Device sends Memory Write Request with data payload
Transaction travels through PCIe to Root Complex
Root Complex writes data to system memory
No completion required (posted transaction) or ACK returned

Key difference: PCIe is packet-based. There's no 'bus' in the traditional sense—just point-to-point serial links carrying transaction packets.

Address Generation and Counting

The DMA controller must generate correct addresses for every transfer unit. This involves several modes and configurations that affect how addresses change during transfer.

Address Modes

DMA Address Modes
Mode	Behavior	Use Case
Increment	Address increases by transfer width after each unit	Contiguous memory buffer filling/emptying
Decrement	Address decreases by transfer width after each unit	Reverse-order transfers, rare
Fixed	Address stays constant throughout transfer	Device registers (single location accessed repeatedly)
Stride	Address increases by custom step (not just transfer width)	Multi-dimensional arrays, interleaved data

address_generation.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
// DMA Address Generation Logic (Conceptual)
 
struct dma_channel_state {
    uint64_t source_addr;
    uint64_t dest_addr;
    uint32_t bytes_remaining;
    uint32_t transfer_width;      // 1, 2, 4, or 8 bytes
    
    // Address mode: 0=fixed, 1=increment, 2=decrement
    int source_mode;
    int dest_mode;
    
    // For stride mode
    uint32_t source_stride;
    uint32_t dest_stride;
};
 
// Called after each unit transferred
void update_addresses(struct dma_channel_state *ch) {
    uint32_t src_step, dst_step;
    
    // Calculate source address change
    switch (ch->source_mode) {
        case MODE_FIXED:
            src_step = 0;
            break;
        case MODE_INCREMENT:
            src_step = ch->transfer_width;
            break;
        case MODE_DECREMENT:
            src_step = -ch->transfer_width;
            break;
        case MODE_STRIDE:
            src_step = ch->source_stride;
            break;
    }
    
    // Calculate destination address change
    switch (ch->dest_mode) {
        case MODE_FIXED:
            dst_step = 0;
            break;
        case MODE_INCREMENT:
            dst_step = ch->transfer_width;
            break;
        case MODE_DECREMENT:
            dst_step = -ch->transfer_width;
            break;
        case MODE_STRIDE:
            dst_step = ch->dest_stride;
            break;
    }
    
    // Update state
    ch->source_addr += src_step;
    ch->dest_addr += dst_step;
    ch->bytes_remaining -= ch->transfer_width;
}
 
// Example: Reading from a device's FIFO to memory buffer
//   Source: Fixed (device FIFO at 0x10000000)
//   Dest:   Increment (buffer starting at 0x80000000)
//   Width:  4 bytes
//
// Transfer sequence:
//   Cycle 1: Read 0x10000000 → Write 0x80000000
//   Cycle 2: Read 0x10000000 → Write 0x80000004  (dest incremented)
//   Cycle 3: Read 0x10000000 → Write 0x80000008
//   ...and so on
 
// Example: 2D DMA for image processing
//   Source stride mode with width=320, height=240
//   After each row (320 bytes), advance to next row (stride=1024)
//   This handles images with row padding in memory

Byte Count Management

The DMA controller maintains a count of bytes (or transfer units) remaining. This counter serves multiple purposes:

Transfer termination — When count reaches zero, transfer is complete
Progress tracking — Software can read remaining count to monitor progress
Circular mode restart — Counter reloads from initial value for continuous streaming
Interrupt generation — Half-complete or quarter-complete interrupts based on count thresholds

2D DMA for Graphics and Video

Burst Transfers

Individual byte transfers are inefficient—each requires bus overhead. Burst transfers amortize this overhead by sending multiple data units in a single bus transaction.

How Bursts Work

A burst transaction works like this:

One address phase — DMA controller puts starting address on bus
Multiple data phases — Sequential data units transfer, address increments automatically
Single termination — Burst ends after N units or when count exhausted

The efficiency gain is dramatic:

Transaction Type	Address Phases	Data Phases	Overhead Ratio
Single transfer	1	1	50%
4-word burst	1	4	20%
8-word burst	1	8	11%
16-word burst	1	16	6%
64-word burst	1	64	1.5%

burst_transfer.txt

text

Burst vs Non-Burst Transfer Comparison
======================================
 
Non-Burst (Single Transfer Mode):
---------------------------------
Cycle 1: [ADDR 0x1000] [WAIT] [DATA0] [ACK]
Cycle 2: [ADDR 0x1004] [WAIT] [DATA1] [ACK]
Cycle 3: [ADDR 0x1008] [WAIT] [DATA2] [ACK]
Cycle 4: [ADDR 0x100C] [WAIT] [DATA3] [ACK]
 
Total cycles: 4 address + 4 wait + 4 data = 12 cycles for 4 words
 
 
Burst Transfer (4-word burst):
------------------------------
Cycle 1: [ADDR 0x1000] [WAIT]
Cycle 2: [DATA0] [DATA1] [DATA2] [DATA3] [ACK]
 
Total cycles: 1 address + 1 wait + 4 data = 6 cycles for 4 words
Efficiency gain: 2x
 
 
Real-World Example: DDR4 Memory
-------------------------------
DDR4 operates in burst mode internally:
- Minimum burst length: 8 (BL8 mode)
- Each access returns 8 sequential 64-bit words = 64 bytes
- This matches typical cache line size
 
A single DDR4 read request returns a full cache line (64 bytes)
at once, making burst natural for cache line refills and DMA.
 
 
PCIe Burst (Max Payload/Read Request):
--------------------------------------
- Configurable: 128, 256, 512, 1024, 2048, or 4096 bytes
- Set in PCIe configuration space
- Larger = better throughput, but more latency per transaction
- Typical sweet spot: 256-512 bytes for general workloads

Burst Size Considerations

Choosing optimal burst size involves tradeoffs:

Larger bursts:

✅ Higher sustained throughput (less address overhead)
✅ Better memory efficiency (matches DDR burst length)
❌ Higher latency before first data appears
❌ Longer bus hold time (blocks other masters more)

Smaller bursts:

✅ Lower individual transaction latency
✅ More responsive bus (other masters get access sooner)
❌ Lower throughput due to address overhead
❌ Less efficient memory access patterns

Optimal choice depends on workload:

Bulk storage transfers: Large bursts (maximize throughput)
Real-time audio: Small bursts (minimize latency)
Interactive systems: Medium bursts (balanced)

Completion and Synchronization

When a DMA transfer finishes, the system must properly synchronize between hardware completion and software consumption of the data. This involves several mechanisms:

Completion Detection

Completion Detection Methods

•Polling — Software repeatedly reads status register until COMPLETE bit set. Simple but wastes CPU cycles.
•Interrupt — Hardware generates interrupt when transfer completes. Efficient but has context switch overhead.
•Completion Queue — Hardware writes completion records to memory queue. Software checks queue periodically or on interrupt.
•Doorbell — Hardware writes to a specific memory location that software monitors. Variant of polling with better cache behavior.

Memory Order and Visibility

Critical concern: When the DMA controller generates a completion interrupt, is the data actually visible to the CPU?

Due to memory hierarchy effects (caches, write buffers, reordering), the answer isn't automatically 'yes.' Systems must ensure completion ordering:

completion_ordering.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// DMA Completion Ordering Problem and Solutions
 
// PROBLEM: This code may see stale data!
void bad_dma_handler(struct dma_device *dma) {
    uint32_t status = readl(dma->regs + STATUS);
    
    if (status & DMA_COMPLETE) {
        // BUG: Data might not be visible yet!
        // Interrupt was generated, but data may still be
        // in flight or in memory controller buffers
        process_data(dma->buffer);  // May see old data!
    }
}
 
// SOLUTION 1: Read from the transferred region (forces completion)
void good_dma_handler_v1(struct dma_device *dma) {
    uint32_t status = readl(dma->regs + STATUS);
    
    if (status & DMA_COMPLETE) {
        // Force completion by reading from target memory
        // This ensures prior DMA writes are visible
        rmb();  // Read barrier
        volatile uint32_t dummy = *(volatile uint32_t *)dma->buffer;
        (void)dummy;
        
        // Now safe to use data
        process_data(dma->buffer);
    }
}
 
// SOLUTION 2: Use DMA sync API (Linux)
void good_dma_handler_v2(struct dma_device *dma) {
    uint32_t status = readl(dma->regs + STATUS);
    
    if (status & DMA_COMPLETE) {
        // Linux DMA API provides proper synchronization
        // This handles cache invalidation and memory barriers
        dma_sync_single_for_cpu(dma->dev, dma->dma_addr, 
                                 dma->size, DMA_FROM_DEVICE);
        
        // Now safe to use data
        process_data(dma->buffer);
    }
}
 
// SOLUTION 3: Cache-coherent DMA (hardware solution)
// On systems with cache-coherent DMA, no software action needed.
// The hardware interconnect ensures DMA writes are visible to CPU
// caches before the interrupt is generated.
// Examples: Modern AMD/Intel systems with proper IOMMU configuration
 
// WHY THIS MATTERS:
// 1. Write buffers: DMA controller write may be in PCIe/memory buffers
// 2. Cache coherency: Must invalidate CPU cache of target region
// 3. Reordering: Memory controller may reorder writes
// 4. Interrupt delivery: IRQ may arrive before all data visible
//
// Failure to sync properly causes intermittent data corruption
// that's extremely difficult to debug (happens rarely, non-reproducible)

Completion Queues

Modern high-performance devices (NVMe, high-speed NICs) use completion queues rather than simple interrupts:

Hardware writes completion records to a pre-allocated ring buffer in memory
Each completion includes: command ID, status, and device-specific results
Hardware optionally generates interrupt to wake software
Software reads completions from queue, processes in order
Software updates 'head' pointer, making queue entries available for reuse

Advantages:

Batch multiple completions per interrupt (coalescing)
Completions in memory can be polled without register access
Natural ordering—completions are processed in queue order
Supports high IOPS workloads where per-operation interrupts would overwhelm CPU

The Completion Ordering Bug

Summary: Understanding DMA Transfer Flow

We've traced the complete DMA transfer process from initiation to completion. This detailed understanding enables proper driver development and performance optimization:

Key Takeaways

•DMA transfers have six distinct phases — Setup, initiation, arbitration, transfer, completion, and processing. Each involves specific hardware and software actions.
•The request-grant handshake coordinates bus access — DRQ, DACK, BREQ, BGNT signals (or their packet equivalents) ensure orderly bus sharing.
•Arbitration determines DMA latency — Priority schemes, round-robin, and weighted algorithms balance fairness and performance.
•Data paths differ between read and write directions — Device-to-memory and memory-to-device have distinct timing and complexity.
•Address generation supports multiple modes — Fixed, increment, decrement, and stride modes enable flexible memory access patterns.
•Burst transfers dramatically improve efficiency — Amortizing address overhead across multiple data units is essential for high throughput.
•Completion synchronization is critical — Memory barriers, cache invalidation, and DMA sync APIs ensure data visibility after transfer completion.

What's Next:

The next section examines cycle stealing—a specific DMA mode that interleaves DMA and CPU bus access, balancing throughput with CPU responsiveness.

Transfer Process Mastered

3 / 5