Operating SystemsDMA

Direct Memory Access (DMA)

LevelIntermediate

Duration60 mins

TopicDMA

5 / 5

Bus Mastering

When Peripherals Take Control

In traditional DMA, a central controller mediates between devices and memory—devices request transfers, and the controller executes them. But what if we cut out the middleman entirely? What if each device could directly read from and write to system memory on its own authority?

This is bus mastering: peripheral devices that are full citizens of the memory bus, capable of initiating memory transactions independently.

Every modern high-performance device—your NVMe SSD, your graphics card, your network adapter—is a bus master. Understanding bus mastering is essential for anyone writing device drivers, designing hardware, or optimizing I/O performance in contemporary systems.

What You Will Learn

By the end of this page, you will understand what distinguishes bus mastering from traditional DMA, master the architecture of bus-mastering devices and their integration with modern interconnects, learn how operating systems enable and manage bus masters, and appreciate the security implications and protections around bus mastering capability.

From Third-Party to First-Party DMA

The evolution from traditional DMA to bus mastering represents a fundamental architectural shift—from third-party DMA to first-party DMA.

Third-Party DMA (Traditional)

In third-party DMA, transfers involve three entities:

Source: Where data comes from (device or memory)
Destination: Where data goes (memory or device)
Controller: The DMA controller that orchestrates the transfer

The DMA controller is a 'third party'—an intermediary that reads from one location and writes to another. The actual device doesn't touch the bus directly.

Converting Mermaid diagram...

First-Party DMA (Bus Mastering)

In first-party DMA, the device itself becomes a bus master—it contains its own DMA engine and directly accesses memory:

Source/Destination: The bus-mastering device and memory
Controller: Integrated into the device itself

There is no third party—the device handles its own memory transactions.

Third-Party DMA Limitations

•Central bottleneck: All devices share one DMA controller
•Limited channels: Typically 4-8 channels total
•ISA constraints: 16MB limit, 64KB boundaries
•Double-transfer overhead: Device→Controller→Memory
•Limited device knowledge: Controller doesn't understand device protocols

Bus Mastering Advantages

•No central bottleneck: Each device has independent access
•Unlimited parallelism: All devices can transfer simultaneously
•Full address space: 64-bit addressing standard
•Single-transfer path: Device→Memory directly
•Device-specific optimization: DMA engine tailored to device needs

Bus Master Architecture

A bus-mastering device contains all the hardware necessary to independently access system memory. Let's examine what makes a device a bus master:

Essential Components

Bus Master Components

•Bus Interface Logic: Compliant with bus protocol (PCIe, etc.). Generates valid bus transactions.
•Address Generation Unit: Produces memory addresses for reads/writes. Supports 64-bit addressing.
•DMA Engine: State machine executing transfer sequences. Manages descriptors, handles scatter-gather.
•Data FIFOs/Buffers: Temporary storage for data in flight. Matches device and bus speeds.
•Interrupt Controller: Generates MSI/MSI-X interrupts on events. Manages completion notification.
•Configuration Registers: Memory-mapped or I/O-mapped. Allow driver to program DMA parameters.
•Error Handling Logic: Detects and reports bus errors. Implements timeout and retry mechanisms.

bus_master_device.v
Verilog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
// Simplified Bus Master Device Architecture (Conceptual Verilog)
 
module bus_master_device (
    // Bus interface signals (simplified)
    input  wire        clk,
    input  wire        rst_n,
    
    // Memory bus interface
    output reg         bus_request,      // Request bus mastership
    input  wire        bus_grant,        // Bus access granted
    output reg  [63:0] bus_addr,         // 64-bit address
    output reg  [63:0] bus_wdata,        // Write data
    input  wire [63:0] bus_rdata,        // Read data
    output reg         bus_read,         // Read strobe
    output reg         bus_write,        // Write strobe
    input  wire        bus_ready,        // Transaction complete
    
    // DMA configuration registers (CPU access)
    input  wire [63:0] dma_src_addr,     // Source address
    input  wire [63:0] dma_dst_addr,     // Destination address
    input  wire [31:0] dma_length,       // Transfer length
    input  wire        dma_start,        // Start transfer
    output wire        dma_done,         // Transfer complete
    output wire        dma_error,        // Error occurred
    
    // Device-specific interface
    input  wire [63:0] device_data,      // Data from device core
    input  wire        device_data_valid,
    output reg         device_data_read
);
 
    // DMA state machine
    localparam IDLE       = 4'd0;
    localparam REQ_BUS    = 4'd1;
    localparam WAIT_GRANT = 4'd2;
    localparam SETUP_ADDR = 4'd3;
    localparam DO_WRITE   = 4'd4;
    localparam WAIT_READY = 4'd5;
    localparam CHECK_DONE = 4'd6;
    localparam DONE       = 4'd7;
    
    reg [3:0]  state;
    reg [63:0] current_addr;
    reg [31:0] bytes_remaining;
    reg [63:0] data_buffer;
    
    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            state <= IDLE;
            bus_request <= 0;
            bus_read <= 0;
            bus_write <= 0;
        end else begin
            case (state)
                IDLE: begin
                    if (dma_start) begin
                        current_addr <= dma_dst_addr;
                        bytes_remaining <= dma_length;
                        state <= REQ_BUS;
                    end
                end
                
                REQ_BUS: begin
                    // Wait for data from device
                    if (device_data_valid) begin
                        data_buffer <= device_data;
                        device_data_read <= 1;
                        bus_request <= 1;  // Request bus access
                        state <= WAIT_GRANT;
                    end
                end
                
                WAIT_GRANT: begin
                    device_data_read <= 0;
                    if (bus_grant) begin
                        state <= SETUP_ADDR;
                    end
                end
                
                SETUP_ADDR: begin
                    bus_addr <= current_addr;
                    bus_wdata <= data_buffer;
                    bus_write <= 1;
                    state <= WAIT_READY;
                end
                
                WAIT_READY: begin
                    if (bus_ready) begin
                        bus_write <= 0;
                        bus_request <= 0;  // Release bus
                        current_addr <= current_addr + 8;
                        bytes_remaining <= bytes_remaining - 8;
                        state <= CHECK_DONE;
                    end
                end
                
                CHECK_DONE: begin
                    if (bytes_remaining == 0) begin
                        state <= DONE;
                    end else begin
                        state <= REQ_BUS;  // Next transfer
                    end
                end
                
                DONE: begin
                    // Generate interrupt, wait for reset
                    state <= IDLE;
                end
            endcase
        end
    end
    
    assign dma_done = (state == DONE);
    assign dma_error = 0;  // Simplified
 
endmodule

Real Devices Are More Complex

Production bus masters include sophisticated features not shown here: multi-queue support, descriptor prefetching, scatter-gather handling, multiple outstanding transactions, MSI-X interrupt generation, error recovery, and protocol-specific optimizations. A modern NVMe controller's DMA engine is a significant piece of silicon.

PCIe Bus Mastering

PCI Express (PCIe) is the dominant interconnect for bus-mastering devices in modern systems. Understanding how PCIe implements bus mastering is essential knowledge.

PCIe as a Bus Master Environment

PCIe is inherently designed for bus mastering:

Point-to-point architecture: No shared bus to arbitrate
Packet-switched fabric: Transactions are routed, not broadcast
Native 64-bit addressing: Full memory space accessible
Split transactions: Requests and completions are separate
Credit-based flow control: Eliminates traditional arbitration

In PCIe, every endpoint device is potentially a bus master—it just needs to be enabled.

PCIe Transaction Types for Bus Mastering
Transaction	Direction	Purpose	Completion Required
Memory Read	Device → Root Complex	Device reads from system RAM	Yes (data returned)
Memory Write	Device → Root Complex	Device writes to system RAM	No (posted)
Message	Device → Anywhere	Signaling (interrupts, etc.)	No (posted)
Config Read/Write	Root Complex → Device	Access device configuration	Yes (for reads)

pcie_bus_master.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
// Enabling and Using PCIe Bus Master (Linux Driver)
 
#include <linux/pci.h>
#include <linux/dma-mapping.h>
 
struct my_pcie_device {
    struct pci_dev *pdev;
    void __iomem *regs;
    dma_addr_t ring_dma;
    void *ring_va;
};
 
// Enable bus mastering during device probe
int my_device_probe(struct pci_dev *pdev, const struct pci_device_id *id) {
    struct my_pcie_device *dev;
    int ret;
    
    dev = kzalloc(sizeof(*dev), GFP_KERNEL);
    if (!dev)
        return -ENOMEM;
    
    dev->pdev = pdev;
    
    // Step 1: Enable the device
    ret = pci_enable_device(pdev);
    if (ret) {
        dev_err(&pdev->dev, "Failed to enable PCI device
");
        goto err_free;
    }
    
    // Step 2: Enable bus mastering - CRITICAL!
    // This sets the Bus Master Enable bit in PCI Command Register
    pci_set_master(pdev);
    
    // Step 3: Set DMA mask (what addresses device can access)
    ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
    if (ret) {
        // Fall back to 32-bit DMA
        ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
        if (ret) {
            dev_err(&pdev->dev, "No suitable DMA mask
");
            goto err_disable;
        }
    }
    
    // Step 4: Request MMIO regions
    ret = pci_request_regions(pdev, "my_driver");
    if (ret) {
        dev_err(&pdev->dev, "Failed to request regions
");
        goto err_disable;
    }
    
    // Step 5: Map BAR0 (registers)
    dev->regs = pci_iomap(pdev, 0, 0);
    if (!dev->regs) {
        ret = -ENOMEM;
        goto err_release;
    }
    
    // Step 6: Allocate DMA memory for descriptor ring
    dev->ring_va = dma_alloc_coherent(&pdev->dev, 
                                       4096,  // Size
                                       &dev->ring_dma,  // Gets DMA address
                                       GFP_KERNEL);
    if (!dev->ring_va) {
        ret = -ENOMEM;
        goto err_unmap;
    }
    
    // Step 7: Tell device where descriptors are (device will DMA here)
    // Device now has permission to bus master!
    writel(lower_32_bits(dev->ring_dma), dev->regs + RING_BASE_LO);
    writel(upper_32_bits(dev->ring_dma), dev->regs + RING_BASE_HI);
    
    pci_set_drvdata(pdev, dev);
    return 0;
 
err_unmap:
    pci_iounmap(pdev, dev->regs);
err_release:
    pci_release_regions(pdev);
err_disable:
    pci_disable_device(pdev);
err_free:
    kfree(dev);
    return ret;
}
 
// The key function: pci_set_master()
// Sets bit 2 (Bus Master Enable) in PCI Command Register
// Without this, device's DMA attempts will be ignored/error
 
void my_pci_set_master(struct pci_dev *dev) {
    u16 cmd;
    
    pci_read_config_word(dev, PCI_COMMAND, &cmd);
    
    if (!(cmd & PCI_COMMAND_MASTER)) {
        dev_dbg(&dev->dev, "Enabling bus mastering
");
        cmd |= PCI_COMMAND_MASTER;
        pci_write_config_word(dev, PCI_COMMAND, cmd);
    }
    
    // Also update latency timer
    pci_set_latency_timer(dev, 64);
}
 
// Security note: pci_set_master() gives the device FULL access
// to system memory (subject to IOMMU). Never call for untrusted devices!

Multi-Queue Bus Masters

Modern high-performance devices don't just bus master—they do so with multiple independent queues, enabling massive parallelism and CPU core-local I/O.

The Multi-Queue Architecture

Consider modern NVMe SSDs, which can have:

Up to 65,535 submission queues (requests to device)
Up to 65,535 completion queues (responses from device)
Each queue can hold up to 65,536 entries

Why so many queues?

Multi-Queue Benefits

•Per-CPU queues: Each CPU core can have its own queue pair, eliminating lock contention
•Parallel DMA: Device can have many DMAs in flight simultaneously
•Locality: Completion processing happens on the same core that submitted the request
•QoS: Different queues can have different priorities
•Virtualization: Each VM or container can have dedicated queues

Converting Mermaid diagram...

Parallel DMA Operations

With multi-queue bus mastering, the device's DMA engine might be simultaneously:

Reading submission queue 0 to get new commands
Writing completion for queue 1 from a finished command
Transferring read data to buffer for queue 2's request
Reading write data from buffer for queue 3's request
Generating MSI-X interrupt for core 0

All five operations happen in parallel, each targeting different memory regions.

MSI-X: Interrupts That Scale

Multi-queue devices use MSI-X (Message Signaled Interrupts Extended) to deliver interrupts per-queue. Each queue has its own interrupt vector, targeting the CPU that owns that queue. No shared interrupt lines, no interrupt routing—just direct core notification. This eliminates interrupt affinity issues that plagued traditional devices.

IOMMU and Security

Bus mastering creates a significant security challenge: any bus-mastering device can read or write any physical memory address. Without protection, a compromised or malicious device could:

Read sensitive data from kernel memory
Modify page tables to gain privilege
Tamper with security-critical code
Bypass all operating system protections

The IOMMU (I/O Memory Management Unit) provides essential protection.

How IOMMU Works

The IOMMU sits between bus masters and memory, translating and validating device addresses:

Without IOMMU:
Device → Physical Address → Memory Controller → RAM
(Device directly specifies physical address - dangerous!)

With IOMMU:
Device → Device Virtual Address → IOMMU → Physical Address → RAM
(IOMMU translates and validates - protected!)

Key IOMMU features:

IOMMU Protections

•Address Translation: Device addresses are translated via page tables, like CPU MMU. Device can only access mapped regions.
•Access Control: Permissions (read/write/none) per page. Device can be restricted to specific buffers.
•Domain Isolation: Each device (or group) has separate address space. One device can't see another's memory.
•Fault Reporting: Invalid accesses generate IOMMU faults. OS can log, terminate device, or take action.
•DMA Remapping: Enables secure direct device assignment to VMs. Guest physical addresses mapped to host physical.

iommu_mapping.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
// IOMMU-Protected DMA Buffer Setup (Linux)
 
// The DMA API automatically uses IOMMU when available
int setup_protected_dma(struct device *dev) {
    void *buffer;
    dma_addr_t dma_handle;
    
    // Allocate a buffer the device can DMA to
    // If IOMMU is present, dma_handle will be an IOVA (I/O Virtual Address)
    // The IOMMU will translate this to actual physical address
    buffer = dma_alloc_coherent(dev, 4096, &dma_handle, GFP_KERNEL);
    
    // The device can ONLY access this specific buffer
    // Any attempt to access other memory will cause IOMMU fault
    
    // dma_handle: Address to give to device (IOVA)
    // buffer:     Virtual address for CPU access
    
    return 0;
}
 
// Manual IOMMU mapping example
int map_user_buffer_for_dma(struct device *dev,
                             unsigned long user_addr,
                             size_t size) {
    struct page **pages;
    dma_addr_t *dma_addrs;
    int nr_pages, i;
    
    nr_pages = (size + PAGE_SIZE - 1) / PAGE_SIZE;
    pages = kmalloc(nr_pages * sizeof(*pages), GFP_KERNEL);
    dma_addrs = kmalloc(nr_pages * sizeof(*dma_addrs), GFP_KERNEL);
    
    // Pin user pages in memory
    // (Pages can't be swapped while DMA is in progress)
    int pinned = get_user_pages_fast(user_addr, nr_pages, 0, pages);
    if (pinned != nr_pages)
        goto err;
    
    // Create DMA mappings (IOMMU translates these)
    for (i = 0; i < nr_pages; i++) {
        dma_addrs[i] = dma_map_page(dev, pages[i], 0, PAGE_SIZE,
                                     DMA_BIDIRECTIONAL);
        if (dma_mapping_error(dev, dma_addrs[i]))
            goto err_unmap;
    }
    
    // Now dma_addrs[] contains IOVAs that:
    // 1. Device can use for DMA
    // 2. IOMMU will translate to correct physical pages
    // 3. Device cannot access any OTHER memory
    
    return 0;
 
err_unmap:
    for (int j = 0; j < i; j++)
        dma_unmap_page(dev, dma_addrs[j], PAGE_SIZE, DMA_BIDIRECTIONAL);
err:
    kfree(pages);
    kfree(dma_addrs);
    return -EIO;
}
 
// IOMMU fault handler (simplified)
// Called when device tries to access unmapped memory
int iommu_fault_handler(struct iommu_domain *domain,
                        struct device *dev,
                        unsigned long iova,
                        int flags, void *token) {
    dev_err(dev, "IOMMU fault! Device tried to access IOVA 0x%lx
", iova);
    
    // Options:
    // 1. Log and continue (dangerous, allows bypass attacks)
    // 2. Reset the device
    // 3. Kill the process using the device
    
    return -EACCES;  // Deny access
}

DMA Attacks Are Real

Without IOMMU, attacks via FireWire, Thunderbolt, and other bus-mastering interfaces have demonstrated full system compromise. An attacker plugging in a malicious device can read encryption keys from memory, modify running code, and bypass all software security. IOMMU must be enabled and properly configured for secure systems.

Peer-to-Peer DMA

Advanced bus mastering enables an even more powerful capability: peer-to-peer (P2P) DMA where devices transfer data directly to each other, bypassing system memory entirely.

The P2P Concept

Traditionally, data flowing between two devices must pass through system RAM:

Traditional: SSD → System RAM → GPU
- SSD DMA writes to RAM
- GPU DMA reads from RAM
- Two memory copies, uses RAM bandwidth

Peer-to-Peer: SSD → GPU
- SSD DMA writes directly to GPU memory
- Zero RAM bandwidth used
- Lower latency, higher throughput

Peer-to-Peer DMA Use Cases
Source	Destination	Application	Benefit
NVMe SSD	GPU (VRAM)	AI training data loading	3x faster data pipeline
Network Card	GPU (VRAM)	RDMA AI cluster communication	Low-latency inference
FPGA	GPU	Accelerated preprocessing	No CPU involvement
GPU	NVMe SSD	Model checkpoint saving	Fast save/restore
NIC	NIC	Network forwarding	Kernel bypass forwarding

GPUDirect and Related Technologies

NVIDIA GPUDirect Storage: Enables direct NVMe SSD → GPU transfers

Bypasses CPU and system RAM
Up to 5x faster for large AI datasets
Requires NVMe, GPU, and driver support

GPUDirect RDMA: Enables direct Network → GPU transfers

Remote system sends data directly into GPU memory
Critical for distributed AI training
Requires InfiniBand or RoCE network

AMD Smart Access Storage: AMD's equivalent technology

Direct PCIe communication between storage and GPU
Optimized for game asset loading

p2p_dma.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// Peer-to-Peer DMA Example (Conceptual)
// NVMe SSD directly to GPU memory
 
#include <linux/pci-p2pdma.h>
 
int setup_p2p_transfer(struct device *nvme_dev, 
                        struct device *gpu_dev,
                        size_t size) {
    struct pci_dev *nvme_pdev = to_pci_dev(nvme_dev);
    struct pci_dev *gpu_pdev = to_pci_dev(gpu_dev);
    dma_addr_t p2p_addr;
    void *p2p_buffer;
    int ret;
    
    // Step 1: Check if P2P is possible between these devices
    // Requires: Same root complex, PCIe switch support, ACS config
    ret = pci_p2pdma_distance_many(nvme_pdev, &gpu_pdev, 1);
    if (ret < 0) {
        dev_err(nvme_dev, "P2P not supported between devices
");
        return ret;
    }
    
    // Step 2: Allocate buffer in P2P-capable memory
    // This memory is accessible by both devices directly
    p2p_buffer = pci_p2pmem_alloc_sgl(gpu_pdev, nvme_dev, size);
    if (!p2p_buffer) {
        return -ENOMEM;
    }
    
    // Step 3: Get address for NVMe to use as DMA destination
    // This address points to GPU BAR (not system RAM!)
    p2p_addr = pci_p2pdma_map_sg(gpu_dev, sgl, nents, DMA_FROM_DEVICE);
    
    // Step 4: Program NVMe to DMA to this address
    // NVMe thinks it's writing to memory, but it's GPU memory!
    nvme_submit_read_command(nvme_handle, 
                              lba,
                              num_blocks,
                              p2p_addr);  // This is GPU memory address
    
    // Data flows: NVMe → PCIe Switch → GPU
    // System RAM is not touched!
    
    return 0;
}
 
// Prerequisites for P2P DMA:
// 1. Devices under same PCIe root complex (usually same CPU)
// 2. PCIe switch (if any) supports routing between endpoints
// 3. Access Control Services (ACS) configured correctly
// 4. Both devices support P2P in their drivers
// 5. IOMMU configured to allow P2P (or correctly map BAR space)
//
// This is why P2P often "just works" when devices are in the
// same system but fails across CPU sockets or through PLX switches.

P2P Configuration is Tricky

Getting P2P DMA to work requires careful hardware and software configuration. PCIe ACS (Access Control Services), IOMMU settings, BIOS options, and device placement all affect whether P2P is possible. In production, it's often easier to use explicit P2P-aware APIs (like cuFile in CUDA) than to configure generic P2P.

Bus Master Performance

Bus mastering devices achieve remarkable I/O performance unthinkable with CPU-mediated transfers. Let's examine what modern bus masters can do:

Current Performance Levels

Modern Bus Master Performance
Device Type	Bandwidth	IOPS	Latency
PCIe 4.0 x4 NVMe SSD	~7 GB/s	1,000,000	~10 µs
PCIe 5.0 x4 NVMe SSD	~14 GB/s	2,000,000+	~10 µs
100G Ethernet NIC	~12.5 GB/s	150M pps	~1 µs RDMA
PCIe 4.0 x16 GPU	~32 GB/s (bidirectional)	N/A	~5 µs
PCIe 5.0 x16 GPU	~64 GB/s (bidirectional)	N/A	~5 µs
CXL Memory Expander	~64 GB/s	N/A	~100-200 ns

Achieving Maximum Performance

Reaching these numbers requires attention to several factors:

Queue Depth Modern devices achieve peak performance only with deep queues. An NVMe SSD at queue depth 1 might deliver 10,000 IOPS; at queue depth 256, it delivers 1,000,000 IOPS. The device needs many commands in flight to hide latency and maximize internal parallelism.

I/O Size Small I/Os are inefficient. A 4KB read might achieve 3 GB/s effective throughput; a 256KB read achieves 7 GB/s on the same SSD. Larger I/Os amortize per-command overhead.

CPU/Memory Locality Submitting commands and handling completions on the same CPU that owns the queue avoids cross-CPU synchronization. Allocating buffers from local NUMA node memory minimizes memory access latency.

Interrupt Coalescing Handling an interrupt costs ~1,000 cycles. At 1M IOPS, that's 4B cycles/sec—one entire CPU core! Coalescing completions and using polling for high-IOPS workloads is essential.

PCIe 6.0 and Beyond

PCIe 6.0 doubles bandwidth again to 128 GT/s with PAM4 signaling. PCIe 7.0 is specified to reach 256 GT/s by 2025. Bus mastering hardware continues to evolve, with devices targeting 100+ GB/s and tens of millions of IOPS. The CPU's role is increasingly just orchestration—data barely touches the CPU.

Summary: Mastering Bus Mastering

Bus mastering represents the pinnacle of DMA evolution—devices as full peers on the memory interconnect, capable of high-performance, independent memory access. Let's consolidate the key insights:

Key Takeaways

•Bus mastering is first-party DMA — Devices contain their own DMA engines and directly access memory without a central controller.
•PCIe is inherently bus-master friendly — Point-to-point architecture with 64-bit addressing makes every endpoint a potential bus master.
•Multi-queue architecture enables parallelism — Per-CPU queues eliminate contention and enable million-IOPS performance.
•IOMMU provides essential security — Without IOMMU, bus masters have unrestricted access to all physical memory.
•Peer-to-peer DMA bypasses system RAM — Devices can transfer directly to each other for even higher performance.
•Modern devices achieve extraordinary throughput — NVMe at 7+ GB/s, networks at 100G+, all via bus mastering.
•Proper configuration is critical — Queue depth, I/O size, locality, and interrupt handling significantly impact achieved performance.

Module Complete:

You've now completed the comprehensive exploration of Direct Memory Access. From the basic concept through controller architecture, transfer processes, cycle stealing, and finally bus mastering—you understand DMA at the level needed for device driver development, system architecture, and performance optimization.

DMA Mastery Achieved

You now understand DMA from first principles through advanced bus mastering. You can explain why DMA exists, how controllers work, what distinguishes different transfer modes, and how modern PCIe devices achieve their remarkable performance. This foundation enables you to write efficient device drivers, debug I/O performance issues, and design systems that maximize throughput while maintaining security through proper IOMMU configuration.

5 / 5

Loading learning content...

Operating SystemsDMA

Direct Memory Access (DMA)

LevelIntermediate

Duration60 mins

TopicDMA

5 / 5

Bus Mastering

When Peripherals Take Control

This is bus mastering: peripheral devices that are full citizens of the memory bus, capable of initiating memory transactions independently.

What You Will Learn

From Third-Party to First-Party DMA

The evolution from traditional DMA to bus mastering represents a fundamental architectural shift—from third-party DMA to first-party DMA.

Third-Party DMA (Traditional)

In third-party DMA, transfers involve three entities:

Source: Where data comes from (device or memory)
Destination: Where data goes (memory or device)
Controller: The DMA controller that orchestrates the transfer

The DMA controller is a 'third party'—an intermediary that reads from one location and writes to another. The actual device doesn't touch the bus directly.

Converting Mermaid diagram...

First-Party DMA (Bus Mastering)

In first-party DMA, the device itself becomes a bus master—it contains its own DMA engine and directly accesses memory:

Source/Destination: The bus-mastering device and memory
Controller: Integrated into the device itself

There is no third party—the device handles its own memory transactions.

Third-Party DMA Limitations

•Central bottleneck: All devices share one DMA controller
•Limited channels: Typically 4-8 channels total
•ISA constraints: 16MB limit, 64KB boundaries
•Double-transfer overhead: Device→Controller→Memory
•Limited device knowledge: Controller doesn't understand device protocols

Bus Mastering Advantages

•No central bottleneck: Each device has independent access
•Unlimited parallelism: All devices can transfer simultaneously
•Full address space: 64-bit addressing standard
•Single-transfer path: Device→Memory directly
•Device-specific optimization: DMA engine tailored to device needs

Bus Master Architecture

A bus-mastering device contains all the hardware necessary to independently access system memory. Let's examine what makes a device a bus master:

Essential Components

Bus Master Components

•Bus Interface Logic: Compliant with bus protocol (PCIe, etc.). Generates valid bus transactions.
•Address Generation Unit: Produces memory addresses for reads/writes. Supports 64-bit addressing.
•DMA Engine: State machine executing transfer sequences. Manages descriptors, handles scatter-gather.
•Data FIFOs/Buffers: Temporary storage for data in flight. Matches device and bus speeds.
•Interrupt Controller: Generates MSI/MSI-X interrupts on events. Manages completion notification.
•Configuration Registers: Memory-mapped or I/O-mapped. Allow driver to program DMA parameters.
•Error Handling Logic: Detects and reports bus errors. Implements timeout and retry mechanisms.

bus_master_device.v
Verilog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
// Simplified Bus Master Device Architecture (Conceptual Verilog)
 
module bus_master_device (
    // Bus interface signals (simplified)
    input  wire        clk,
    input  wire        rst_n,
    
    // Memory bus interface
    output reg         bus_request,      // Request bus mastership
    input  wire        bus_grant,        // Bus access granted
    output reg  [63:0] bus_addr,         // 64-bit address
    output reg  [63:0] bus_wdata,        // Write data
    input  wire [63:0] bus_rdata,        // Read data
    output reg         bus_read,         // Read strobe
    output reg         bus_write,        // Write strobe
    input  wire        bus_ready,        // Transaction complete
    
    // DMA configuration registers (CPU access)
    input  wire [63:0] dma_src_addr,     // Source address
    input  wire [63:0] dma_dst_addr,     // Destination address
    input  wire [31:0] dma_length,       // Transfer length
    input  wire        dma_start,        // Start transfer
    output wire        dma_done,         // Transfer complete
    output wire        dma_error,        // Error occurred
    
    // Device-specific interface
    input  wire [63:0] device_data,      // Data from device core
    input  wire        device_data_valid,
    output reg         device_data_read
);
 
    // DMA state machine
    localparam IDLE       = 4'd0;
    localparam REQ_BUS    = 4'd1;
    localparam WAIT_GRANT = 4'd2;
    localparam SETUP_ADDR = 4'd3;
    localparam DO_WRITE   = 4'd4;
    localparam WAIT_READY = 4'd5;
    localparam CHECK_DONE = 4'd6;
    localparam DONE       = 4'd7;
    
    reg [3:0]  state;
    reg [63:0] current_addr;
    reg [31:0] bytes_remaining;
    reg [63:0] data_buffer;
    
    always @(posedge clk or negedge rst_n) begin
        if (!rst_n) begin
            state <= IDLE;
            bus_request <= 0;
            bus_read <= 0;
            bus_write <= 0;
        end else begin
            case (state)
                IDLE: begin
                    if (dma_start) begin
                        current_addr <= dma_dst_addr;
                        bytes_remaining <= dma_length;
                        state <= REQ_BUS;
                    end
                end
                
                REQ_BUS: begin
                    // Wait for data from device
                    if (device_data_valid) begin
                        data_buffer <= device_data;
                        device_data_read <= 1;
                        bus_request <= 1;  // Request bus access
                        state <= WAIT_GRANT;
                    end
                end
                
                WAIT_GRANT: begin
                    device_data_read <= 0;
                    if (bus_grant) begin
                        state <= SETUP_ADDR;
                    end
                end
                
                SETUP_ADDR: begin
                    bus_addr <= current_addr;
                    bus_wdata <= data_buffer;
                    bus_write <= 1;
                    state <= WAIT_READY;
                end
                
                WAIT_READY: begin
                    if (bus_ready) begin
                        bus_write <= 0;
                        bus_request <= 0;  // Release bus
                        current_addr <= current_addr + 8;
                        bytes_remaining <= bytes_remaining - 8;
                        state <= CHECK_DONE;
                    end
                end
                
                CHECK_DONE: begin
                    if (bytes_remaining == 0) begin
                        state <= DONE;
                    end else begin
                        state <= REQ_BUS;  // Next transfer
                    end
                end
                
                DONE: begin
                    // Generate interrupt, wait for reset
                    state <= IDLE;
                end
            endcase
        end
    end
    
    assign dma_done = (state == DONE);
    assign dma_error = 0;  // Simplified
 
endmodule

Real Devices Are More Complex

PCIe Bus Mastering

PCI Express (PCIe) is the dominant interconnect for bus-mastering devices in modern systems. Understanding how PCIe implements bus mastering is essential knowledge.

PCIe as a Bus Master Environment

PCIe is inherently designed for bus mastering:

Point-to-point architecture: No shared bus to arbitrate
Packet-switched fabric: Transactions are routed, not broadcast
Native 64-bit addressing: Full memory space accessible
Split transactions: Requests and completions are separate
Credit-based flow control: Eliminates traditional arbitration

In PCIe, every endpoint device is potentially a bus master—it just needs to be enabled.

PCIe Transaction Types for Bus Mastering
Transaction	Direction	Purpose	Completion Required
Memory Read	Device → Root Complex	Device reads from system RAM	Yes (data returned)
Memory Write	Device → Root Complex	Device writes to system RAM	No (posted)
Message	Device → Anywhere	Signaling (interrupts, etc.)	No (posted)
Config Read/Write	Root Complex → Device	Access device configuration	Yes (for reads)

pcie_bus_master.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
// Enabling and Using PCIe Bus Master (Linux Driver)
 
#include <linux/pci.h>
#include <linux/dma-mapping.h>
 
struct my_pcie_device {
    struct pci_dev *pdev;
    void __iomem *regs;
    dma_addr_t ring_dma;
    void *ring_va;
};
 
// Enable bus mastering during device probe
int my_device_probe(struct pci_dev *pdev, const struct pci_device_id *id) {
    struct my_pcie_device *dev;
    int ret;
    
    dev = kzalloc(sizeof(*dev), GFP_KERNEL);
    if (!dev)
        return -ENOMEM;
    
    dev->pdev = pdev;
    
    // Step 1: Enable the device
    ret = pci_enable_device(pdev);
    if (ret) {
        dev_err(&pdev->dev, "Failed to enable PCI device
");
        goto err_free;
    }
    
    // Step 2: Enable bus mastering - CRITICAL!
    // This sets the Bus Master Enable bit in PCI Command Register
    pci_set_master(pdev);
    
    // Step 3: Set DMA mask (what addresses device can access)
    ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
    if (ret) {
        // Fall back to 32-bit DMA
        ret = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(32));
        if (ret) {
            dev_err(&pdev->dev, "No suitable DMA mask
");
            goto err_disable;
        }
    }
    
    // Step 4: Request MMIO regions
    ret = pci_request_regions(pdev, "my_driver");
    if (ret) {
        dev_err(&pdev->dev, "Failed to request regions
");
        goto err_disable;
    }
    
    // Step 5: Map BAR0 (registers)
    dev->regs = pci_iomap(pdev, 0, 0);
    if (!dev->regs) {
        ret = -ENOMEM;
        goto err_release;
    }
    
    // Step 6: Allocate DMA memory for descriptor ring
    dev->ring_va = dma_alloc_coherent(&pdev->dev, 
                                       4096,  // Size
                                       &dev->ring_dma,  // Gets DMA address
                                       GFP_KERNEL);
    if (!dev->ring_va) {
        ret = -ENOMEM;
        goto err_unmap;
    }
    
    // Step 7: Tell device where descriptors are (device will DMA here)
    // Device now has permission to bus master!
    writel(lower_32_bits(dev->ring_dma), dev->regs + RING_BASE_LO);
    writel(upper_32_bits(dev->ring_dma), dev->regs + RING_BASE_HI);
    
    pci_set_drvdata(pdev, dev);
    return 0;
 
err_unmap:
    pci_iounmap(pdev, dev->regs);
err_release:
    pci_release_regions(pdev);
err_disable:
    pci_disable_device(pdev);
err_free:
    kfree(dev);
    return ret;
}
 
// The key function: pci_set_master()
// Sets bit 2 (Bus Master Enable) in PCI Command Register
// Without this, device's DMA attempts will be ignored/error
 
void my_pci_set_master(struct pci_dev *dev) {
    u16 cmd;
    
    pci_read_config_word(dev, PCI_COMMAND, &cmd);
    
    if (!(cmd & PCI_COMMAND_MASTER)) {
        dev_dbg(&dev->dev, "Enabling bus mastering
");
        cmd |= PCI_COMMAND_MASTER;
        pci_write_config_word(dev, PCI_COMMAND, cmd);
    }
    
    // Also update latency timer
    pci_set_latency_timer(dev, 64);
}
 
// Security note: pci_set_master() gives the device FULL access
// to system memory (subject to IOMMU). Never call for untrusted devices!

Multi-Queue Bus Masters

Modern high-performance devices don't just bus master—they do so with multiple independent queues, enabling massive parallelism and CPU core-local I/O.

The Multi-Queue Architecture

Consider modern NVMe SSDs, which can have:

Up to 65,535 submission queues (requests to device)
Up to 65,535 completion queues (responses from device)
Each queue can hold up to 65,536 entries

Why so many queues?

Multi-Queue Benefits

•Per-CPU queues: Each CPU core can have its own queue pair, eliminating lock contention
•Parallel DMA: Device can have many DMAs in flight simultaneously
•Locality: Completion processing happens on the same core that submitted the request
•QoS: Different queues can have different priorities
•Virtualization: Each VM or container can have dedicated queues

Converting Mermaid diagram...

Parallel DMA Operations

With multi-queue bus mastering, the device's DMA engine might be simultaneously:

Reading submission queue 0 to get new commands
Writing completion for queue 1 from a finished command
Transferring read data to buffer for queue 2's request
Reading write data from buffer for queue 3's request
Generating MSI-X interrupt for core 0

All five operations happen in parallel, each targeting different memory regions.

MSI-X: Interrupts That Scale

IOMMU and Security

Bus mastering creates a significant security challenge: any bus-mastering device can read or write any physical memory address. Without protection, a compromised or malicious device could:

Read sensitive data from kernel memory
Modify page tables to gain privilege
Tamper with security-critical code
Bypass all operating system protections

The IOMMU (I/O Memory Management Unit) provides essential protection.

How IOMMU Works

The IOMMU sits between bus masters and memory, translating and validating device addresses:

Without IOMMU:
Device → Physical Address → Memory Controller → RAM
(Device directly specifies physical address - dangerous!)

With IOMMU:
Device → Device Virtual Address → IOMMU → Physical Address → RAM
(IOMMU translates and validates - protected!)

Key IOMMU features:

IOMMU Protections

•Address Translation: Device addresses are translated via page tables, like CPU MMU. Device can only access mapped regions.
•Access Control: Permissions (read/write/none) per page. Device can be restricted to specific buffers.
•Domain Isolation: Each device (or group) has separate address space. One device can't see another's memory.
•Fault Reporting: Invalid accesses generate IOMMU faults. OS can log, terminate device, or take action.
•DMA Remapping: Enables secure direct device assignment to VMs. Guest physical addresses mapped to host physical.

iommu_mapping.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
// IOMMU-Protected DMA Buffer Setup (Linux)
 
// The DMA API automatically uses IOMMU when available
int setup_protected_dma(struct device *dev) {
    void *buffer;
    dma_addr_t dma_handle;
    
    // Allocate a buffer the device can DMA to
    // If IOMMU is present, dma_handle will be an IOVA (I/O Virtual Address)
    // The IOMMU will translate this to actual physical address
    buffer = dma_alloc_coherent(dev, 4096, &dma_handle, GFP_KERNEL);
    
    // The device can ONLY access this specific buffer
    // Any attempt to access other memory will cause IOMMU fault
    
    // dma_handle: Address to give to device (IOVA)
    // buffer:     Virtual address for CPU access
    
    return 0;
}
 
// Manual IOMMU mapping example
int map_user_buffer_for_dma(struct device *dev,
                             unsigned long user_addr,
                             size_t size) {
    struct page **pages;
    dma_addr_t *dma_addrs;
    int nr_pages, i;
    
    nr_pages = (size + PAGE_SIZE - 1) / PAGE_SIZE;
    pages = kmalloc(nr_pages * sizeof(*pages), GFP_KERNEL);
    dma_addrs = kmalloc(nr_pages * sizeof(*dma_addrs), GFP_KERNEL);
    
    // Pin user pages in memory
    // (Pages can't be swapped while DMA is in progress)
    int pinned = get_user_pages_fast(user_addr, nr_pages, 0, pages);
    if (pinned != nr_pages)
        goto err;
    
    // Create DMA mappings (IOMMU translates these)
    for (i = 0; i < nr_pages; i++) {
        dma_addrs[i] = dma_map_page(dev, pages[i], 0, PAGE_SIZE,
                                     DMA_BIDIRECTIONAL);
        if (dma_mapping_error(dev, dma_addrs[i]))
            goto err_unmap;
    }
    
    // Now dma_addrs[] contains IOVAs that:
    // 1. Device can use for DMA
    // 2. IOMMU will translate to correct physical pages
    // 3. Device cannot access any OTHER memory
    
    return 0;
 
err_unmap:
    for (int j = 0; j < i; j++)
        dma_unmap_page(dev, dma_addrs[j], PAGE_SIZE, DMA_BIDIRECTIONAL);
err:
    kfree(pages);
    kfree(dma_addrs);
    return -EIO;
}
 
// IOMMU fault handler (simplified)
// Called when device tries to access unmapped memory
int iommu_fault_handler(struct iommu_domain *domain,
                        struct device *dev,
                        unsigned long iova,
                        int flags, void *token) {
    dev_err(dev, "IOMMU fault! Device tried to access IOVA 0x%lx
", iova);
    
    // Options:
    // 1. Log and continue (dangerous, allows bypass attacks)
    // 2. Reset the device
    // 3. Kill the process using the device
    
    return -EACCES;  // Deny access
}

DMA Attacks Are Real

Peer-to-Peer DMA

Advanced bus mastering enables an even more powerful capability: peer-to-peer (P2P) DMA where devices transfer data directly to each other, bypassing system memory entirely.

The P2P Concept

Traditionally, data flowing between two devices must pass through system RAM:

Traditional: SSD → System RAM → GPU
- SSD DMA writes to RAM
- GPU DMA reads from RAM
- Two memory copies, uses RAM bandwidth

Peer-to-Peer: SSD → GPU
- SSD DMA writes directly to GPU memory
- Zero RAM bandwidth used
- Lower latency, higher throughput

Peer-to-Peer DMA Use Cases
Source	Destination	Application	Benefit
NVMe SSD	GPU (VRAM)	AI training data loading	3x faster data pipeline
Network Card	GPU (VRAM)	RDMA AI cluster communication	Low-latency inference
FPGA	GPU	Accelerated preprocessing	No CPU involvement
GPU	NVMe SSD	Model checkpoint saving	Fast save/restore
NIC	NIC	Network forwarding	Kernel bypass forwarding

GPUDirect and Related Technologies

NVIDIA GPUDirect Storage: Enables direct NVMe SSD → GPU transfers

Bypasses CPU and system RAM
Up to 5x faster for large AI datasets
Requires NVMe, GPU, and driver support

GPUDirect RDMA: Enables direct Network → GPU transfers

Remote system sends data directly into GPU memory
Critical for distributed AI training
Requires InfiniBand or RoCE network

AMD Smart Access Storage: AMD's equivalent technology

Direct PCIe communication between storage and GPU
Optimized for game asset loading

p2p_dma.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// Peer-to-Peer DMA Example (Conceptual)
// NVMe SSD directly to GPU memory
 
#include <linux/pci-p2pdma.h>
 
int setup_p2p_transfer(struct device *nvme_dev, 
                        struct device *gpu_dev,
                        size_t size) {
    struct pci_dev *nvme_pdev = to_pci_dev(nvme_dev);
    struct pci_dev *gpu_pdev = to_pci_dev(gpu_dev);
    dma_addr_t p2p_addr;
    void *p2p_buffer;
    int ret;
    
    // Step 1: Check if P2P is possible between these devices
    // Requires: Same root complex, PCIe switch support, ACS config
    ret = pci_p2pdma_distance_many(nvme_pdev, &gpu_pdev, 1);
    if (ret < 0) {
        dev_err(nvme_dev, "P2P not supported between devices
");
        return ret;
    }
    
    // Step 2: Allocate buffer in P2P-capable memory
    // This memory is accessible by both devices directly
    p2p_buffer = pci_p2pmem_alloc_sgl(gpu_pdev, nvme_dev, size);
    if (!p2p_buffer) {
        return -ENOMEM;
    }
    
    // Step 3: Get address for NVMe to use as DMA destination
    // This address points to GPU BAR (not system RAM!)
    p2p_addr = pci_p2pdma_map_sg(gpu_dev, sgl, nents, DMA_FROM_DEVICE);
    
    // Step 4: Program NVMe to DMA to this address
    // NVMe thinks it's writing to memory, but it's GPU memory!
    nvme_submit_read_command(nvme_handle, 
                              lba,
                              num_blocks,
                              p2p_addr);  // This is GPU memory address
    
    // Data flows: NVMe → PCIe Switch → GPU
    // System RAM is not touched!
    
    return 0;
}
 
// Prerequisites for P2P DMA:
// 1. Devices under same PCIe root complex (usually same CPU)
// 2. PCIe switch (if any) supports routing between endpoints
// 3. Access Control Services (ACS) configured correctly
// 4. Both devices support P2P in their drivers
// 5. IOMMU configured to allow P2P (or correctly map BAR space)
//
// This is why P2P often "just works" when devices are in the
// same system but fails across CPU sockets or through PLX switches.

P2P Configuration is Tricky

Bus Master Performance

Bus mastering devices achieve remarkable I/O performance unthinkable with CPU-mediated transfers. Let's examine what modern bus masters can do:

Current Performance Levels

Modern Bus Master Performance
Device Type	Bandwidth	IOPS	Latency
PCIe 4.0 x4 NVMe SSD	~7 GB/s	1,000,000	~10 µs
PCIe 5.0 x4 NVMe SSD	~14 GB/s	2,000,000+	~10 µs
100G Ethernet NIC	~12.5 GB/s	150M pps	~1 µs RDMA
PCIe 4.0 x16 GPU	~32 GB/s (bidirectional)	N/A	~5 µs
PCIe 5.0 x16 GPU	~64 GB/s (bidirectional)	N/A	~5 µs
CXL Memory Expander	~64 GB/s	N/A	~100-200 ns

Achieving Maximum Performance

Reaching these numbers requires attention to several factors:

I/O Size Small I/Os are inefficient. A 4KB read might achieve 3 GB/s effective throughput; a 256KB read achieves 7 GB/s on the same SSD. Larger I/Os amortize per-command overhead.

PCIe 6.0 and Beyond

Summary: Mastering Bus Mastering

Bus mastering represents the pinnacle of DMA evolution—devices as full peers on the memory interconnect, capable of high-performance, independent memory access. Let's consolidate the key insights:

Key Takeaways

•Bus mastering is first-party DMA — Devices contain their own DMA engines and directly access memory without a central controller.
•PCIe is inherently bus-master friendly — Point-to-point architecture with 64-bit addressing makes every endpoint a potential bus master.
•Multi-queue architecture enables parallelism — Per-CPU queues eliminate contention and enable million-IOPS performance.
•IOMMU provides essential security — Without IOMMU, bus masters have unrestricted access to all physical memory.
•Peer-to-peer DMA bypasses system RAM — Devices can transfer directly to each other for even higher performance.
•Modern devices achieve extraordinary throughput — NVMe at 7+ GB/s, networks at 100G+, all via bus mastering.
•Proper configuration is critical — Queue depth, I/O size, locality, and interrupt handling significantly impact achieved performance.

Module Complete:

DMA Mastery Achieved

5 / 5