Operating SystemsI/O Architecture

I/O Architecture: How the CPU Communicates with the World

LevelIntermediate

Duration90 mins

TopicI/O Architecture

5 / 5

DMA: Direct Memory Access for Maximum Throughput

The Ultimate I/O Efficiency

We've traced the evolution of I/O techniques: from Programmed I/O where the CPU moves every byte, to Interrupt-Driven I/O where the CPU is notified when devices need attention but still performs the data transfer. Both approaches share a fundamental limitation: the CPU is on the critical path for every byte transferred.

What if devices could access system memory directly? What if, instead of the CPU laboriously reading from a device and writing to memory, the device could simply write to memory itself—freeing the CPU entirely?

This is Direct Memory Access (DMA)—the technique that enables modern systems to transfer gigabytes per second while the CPU attends to computation. DMA is the reason your NVMe SSD can sustain 7 GB/s, your GPU can render at 4K 120fps, and your network card can process millions of packets per second—all without bottlenecking on CPU involvement.

What You Will Learn

By the end of this page, you will understand the DMA architectural model and how it differs from PIO, bus mastering and the role of DMA controllers, scatter-gather and descriptor-based DMA, memory coherency challenges and solutions, DMA security concerns and IOMMU protection, practical DMA programming patterns in device drivers, and modern DMA in PCIe, NVMe, and high-speed networking.

DMA Fundamentals

Direct Memory Access (DMA) is a data transfer mechanism where a device controller transfers data between I/O devices and main memory without direct CPU involvement. The CPU initiates the transfer by providing parameters (memory address, transfer length, direction), then the DMA engine executes the transfer autonomously.

Key Characteristics:

CPU-free data movement: The processor doesn't execute instructions for each byte
Parallel operation: CPU can execute other code while DMA proceeds
High bandwidth: Limited only by memory and bus speed, not CPU instruction throughput
Hardware complexity: Devices need DMA controllers capable of memory addressing

DMA vs PIO Comparison:

Programmed I/O (PIO):
┌─────┐    instructions    ┌──────┐    data    ┌────────┐
│ CPU │◄─────────────────► │Device│◄───────────►│Memory  │
└─────┘    (CPU fetches/   └──────┘            └────────┘
           stores each byte)
           
Direct Memory Access (DMA):
┌─────┐    setup only    ┌──────┐    data     ┌────────┐
│ CPU │─────────────────►│Device│◄────────────►│Memory  │
└─────┘  (addresses,     └──────┘  (direct    └────────┘
         length, start)           transfer)
     │                                  │
     │    (interrupt on complete)       │
     ◄──────────────────────────────────┘

In PIO, data flows: Device → CPU registers → Memory (or reverse) In DMA, data flows: Device ↔ Memory (CPU not involved in data path)

DMA vs PIO Performance Comparison
Metric	Programmed I/O	DMA
Max throughput (NVMe SSD)	~100 MB/s (CPU limited)	7000+ MB/s
CPU utilization during transfer	100%	~0% (setup + completion only)
Transfer initiation overhead	Minimal	Higher (descriptor setup)
Latency for tiny transfers	Lower	Higher (DMA setup overhead)
Hardware complexity	Simple device registers	DMA engine + bus mastering
Memory access pattern	Sequential only	Scatter-gather supported

When DMA Isn't Worth It

DMA has setup overhead—configuring descriptors, synchronizing caches, and handling the completion interrupt. For very small transfers (under ~100 bytes), this overhead can exceed the cost of simple PIO. This is why configuration register access uses PIO while bulk data transfers use DMA.

DMA Controller Architecture

DMA can be implemented in two broad architectural patterns:

1. Third-Party DMA (Legacy ISA Model):

A centralized DMA controller on the system board handles transfers for multiple devices. The original PC used the Intel 8237 DMA controller, which provided 4 channels (later 7 usable channels with two cascaded controllers).

Workflow:

Device requests DMA service (asserts DRQ line)
CPU programs DMA controller with address, length, direction
DMA controller takes control of the bus
DMA controller transfers data between device and memory
DMA controller signals completion (interrupt)

This design is obsolete because:

Limited channels (only 7 available)
Low speed (designed for ISA bus speeds)
Single-threaded (only one transfer at a time)
Required contiguous memory buffers

2. First-Party DMA (Bus Mastering):

Modern devices contain their own DMA engines and become bus masters—they can initiate memory transactions directly without a centralized controller.

Workflow:

CPU prepares DMA descriptors in memory (addresses, lengths)
CPU programs device with descriptor ring location
Device's DMA engine fetches descriptors from memory
Device's DMA engine transfers data directly to/from memory
Device updates completion status and signals interrupt

Advantages of Bus Mastering:

Each device has dedicated DMA capability
Parallel transfers from multiple devices
Much higher performance
Support for scatter-gather (multiple memory regions per transfer)

dma_descriptor.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
/*
 * DMA Descriptor Structures
 * 
 * Modern DMA uses descriptor rings - the device fetches transfer
 * descriptions from memory, executes them, and updates completion status.
 */
 
#include <stdint.h>
 
/*
 * Simple DMA Descriptor (generic pattern)
 * 
 * Each descriptor describes one contiguous memory region
 * for a DMA transfer.
 */
struct dma_descriptor {
    uint64_t buffer_addr;     /* Physical address of data buffer */
    uint32_t buffer_length;   /* Length in bytes */
    uint32_t control;         /* Flags: interrupt-on-complete, etc. */
} __attribute__((packed));
 
/* Control word flags */
#define DMA_CTRL_INTR   (1 << 0)   /* Generate interrupt on completion */
#define DMA_CTRL_LINK   (1 << 1)   /* Another descriptor follows */
#define DMA_CTRL_LAST   (1 << 2)   /* This is the last descriptor */
 
/*
 * NVMe-style Submission Queue Entry (Simplified)
 * 
 * NVMe uses command queues rather than simple descriptors.
 * Each entry is a 64-byte command that can reference multiple
 * Physical Region Pages.
 */
struct nvme_sqe {
    uint8_t opcode;           /* Command opcode (read, write, etc.) */
    uint8_t flags;            /* Fused command, PRP vs SGL, etc. */
    uint16_t command_id;      /* Unique ID for completion matching */
    uint32_t nsid;            /* Namespace ID */
    uint64_t reserved;
    uint64_t metadata;        /* Metadata buffer address */
    uint64_t prp1;            /* Physical Region Page 1 (data buffer) */
    uint64_t prp2;            /* PRP 2 or PRP list pointer */
    uint32_t cdw10;           /* Command-specific dword 10 */
    uint32_t cdw11;           /* Start LBA (low 32 bits for read/write) */
    uint32_t cdw12;           /* LBA (high 32 bits) */
    uint32_t cdw13;           /* Command-specific */
    uint32_t cdw14;           /* Command-specific */
    uint32_t cdw15;           /* Command-specific */
} __attribute__((packed));
 
/*
 * Scatter-Gather List Entry (Intel Style)
 * 
 * Describes one segment of a potentially non-contiguous buffer.
 * Multiple SGEs form a list, allowing DMA to access scattered pages.
 */
struct sg_entry {
    uint64_t address;         /* Physical address of segment */
    uint32_t length;          /* Segment length in bytes */
    uint32_t flags;           /* End-of-list, interrupt, etc. */
} __attribute__((packed));
 
/* SG flags */
#define SG_FLAG_FINAL   (1 << 31)  /* Last entry in list */
 
/*
 * DMA Ring Buffer (Descriptor Ring)
 * 
 * Many devices use circular descriptor rings for efficient
 * command submission and completion.
 */
#define RING_SIZE 256
 
struct dma_ring {
    struct dma_descriptor descriptors[RING_SIZE];  /* In DMA-able memory */
    volatile uint32_t head;                         /* Next to submit */
    volatile uint32_t tail;                         /* Next to complete */
    uint32_t ring_size;
    
    /* Shadow data for driver use (not touched by hardware) */
    void *buffers[RING_SIZE];         /* Virtual addresses of buffers */
    void *completion_data[RING_SIZE]; /* Per-descriptor completion context */
};
 
/*
 * Ring buffer management
 */
static inline bool ring_full(struct dma_ring *ring) {
    return ((ring->head + 1) % ring->ring_size) == ring->tail;
}
 
static inline bool ring_empty(struct dma_ring *ring) {
    return ring->head == ring->tail;
}
 
static inline uint32_t ring_next(struct dma_ring *ring, uint32_t idx) {
    return (idx + 1) % ring->ring_size;
}

Hardware Descriptor Fetch

The device's DMA engine reads descriptors from main memory. This means descriptors must be in DMA-addressable memory (below 4GB for 32-bit DMA, or using IOMMU for 64-bit). Descriptors must also use physical addresses—the device has no knowledge of virtual memory.

Scatter-Gather DMA

Real-world data buffers are often not contiguous in physical memory. When a user allocates a 64KB buffer, the OS might fulfill that request with 16 separate 4KB pages scattered throughout physical memory. Scatter-Gather DMA (SG-DMA) addresses this by allowing a single logical transfer to span multiple non-contiguous physical memory regions.

Without Scatter-Gather:

Buffers must be physically contiguous
OS must allocate special contiguous memory (limited, expensive)
Or software must split transfers into page-aligned pieces

With Scatter-Gather:

Each transfer has a list of (address, length) pairs
Device's DMA engine processes the entire list as one logical transfer
OS can use normal, non-contiguous page allocations

Scatter-Gather List Visualization:

Virtual Buffer (64KB, contiguous in virtual address space):
┌──────────────────────────────────────────────────────────┐
│ 0x00000000 - 0x0000FFFF (user sees contiguous buffer)   │
└──────────────────────────────────────────────────────────┘

Physical Pages (scattered in RAM):
┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐
│ Page at│  │ Page at│  │ Page at│  │ Page at│  ... (16 pages)
│0x12000 │  │0x47000 │  │0x89000 │  │0x23000 │
└────────┘  └────────┘  └────────┘  └────────┘

Scatter-Gather List:
┌─────────────────────────────────────────────────────────┐
│ Entry 0: addr=0x12000, len=4096                         │
│ Entry 1: addr=0x47000, len=4096                         │
│ Entry 2: addr=0x89000, len=4096                         │
│ Entry 3: addr=0x23000, len=4096                         │
│ ...                                                     │
│ Entry 15: addr=0x56000, len=4096, flags=FINAL          │
└─────────────────────────────────────────────────────────┘

scatter_gather.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
/*
 * Scatter-Gather DMA Implementation
 * 
 * Demonstrates building and using scatter-gather lists for
 * DMA transfers with non-contiguous physical memory.
 */
 
#include <stdint.h>
#include <stddef.h>
 
/* Simulated kernel functions */
extern uint64_t virt_to_phys(void *virt);
extern void *dma_alloc_coherent(size_t size, uint64_t *dma_handle);
extern void dma_free_coherent(void *virt, size_t size, uint64_t dma_handle);
 
/* Page size definitions */
#define PAGE_SIZE       4096
#define PAGE_MASK       (~(PAGE_SIZE - 1))
 
#define MAX_SG_ENTRIES  128
 
/* Scatter-Gather Entry */
struct sg_entry {
    uint64_t dma_address;   /* Physical/DMA address */
    uint32_t length;        /* Length of this segment */
    uint32_t offset;        /* Offset within page (for alignment) */
    struct sg_entry *next;  /* Link to next entry (optional) */
};
 
/* Scatter-Gather Table */
struct sg_table {
    struct sg_entry *entries;   /* Array of entries */
    int nents;                  /* Number of entries in use */
    int max_ents;               /* Maximum entries allocated */
};
 
/*
 * Map a user buffer to a scatter-gather list
 * 
 * Takes a virtually contiguous buffer that may be physically
 * scattered across multiple pages, and creates a list of
 * physical address/length pairs for DMA.
 */
int sg_map_buffer(struct sg_table *sgt, void *buffer, size_t length) {
    uint8_t *ptr = (uint8_t *)buffer;
    size_t remaining = length;
    int entry_idx = 0;
    
    while (remaining > 0 && entry_idx < sgt->max_ents) {
        struct sg_entry *entry = &sgt->entries[entry_idx];
        
        /* Calculate offset within current page */
        uint32_t page_offset = (uintptr_t)ptr & (PAGE_SIZE - 1);
        
        /* Calculate how much we can transfer from this page */
        uint32_t this_seg_len = PAGE_SIZE - page_offset;
        if (this_seg_len > remaining) {
            this_seg_len = remaining;
        }
        
        /* Translate virtual to physical/DMA address */
        entry->dma_address = virt_to_phys(ptr);
        entry->length = this_seg_len;
        entry->offset = page_offset;
        
        ptr += this_seg_len;
        remaining -= this_seg_len;
        entry_idx++;
    }
    
    if (remaining > 0) {
        return -1;  /* Buffer too large for SG table */
    }
    
    sgt->nents = entry_idx;
    return 0;
}
 
/*
 * Program device with scatter-gather list
 * 
 * The device's DMA engine will fetch each entry and transfer
 * data to/from the specified physical addresses.
 */
void program_sg_dma(void *device_base, struct sg_table *sgt, int direction) {
    /* First, allocate DMA-able memory for the SG list itself */
    /* The device needs to read the SG entries from memory */
    
    size_t sg_list_size = sgt->nents * sizeof(struct hw_sg_entry);
    uint64_t sg_dma_addr;
    struct hw_sg_entry *hw_sg = dma_alloc_coherent(sg_list_size, &sg_dma_addr);
    
    /* Copy our SG entries to the hardware format */
    for (int i = 0; i < sgt->nents; i++) {
        hw_sg[i].address = sgt->entries[i].dma_address;
        hw_sg[i].length = sgt->entries[i].length;
        hw_sg[i].flags = (i == sgt->nents - 1) ? SG_FLAG_FINAL : 0;
    }
    
    /* Program device registers with SG list location */
    mmio_write64(device_base + DMA_SG_ADDR, sg_dma_addr);
    mmio_write32(device_base + DMA_SG_COUNT, sgt->nents);
    mmio_write32(device_base + DMA_CONTROL, 
                 (direction == DMA_TO_DEVICE) ? DMA_CTRL_WRITE : DMA_CTRL_READ);
    
    /* Start the DMA (device fetches SG list and begins transfer) */
    mmio_write32(device_base + DMA_COMMAND, DMA_CMD_START);
}
 
/*
 * Coalesce adjacent SG entries
 * 
 * If virtual pages happen to be physically contiguous,
 * we can merge SG entries for better DMA efficiency.
 */
void sg_coalesce(struct sg_table *sgt) {
    if (sgt->nents <= 1) return;
    
    int write_idx = 0;
    
    for (int i = 1; i < sgt->nents; i++) {
        struct sg_entry *prev = &sgt->entries[write_idx];
        struct sg_entry *curr = &sgt->entries[i];
        
        /* Check if this entry is contiguous with previous */
        if (prev->dma_address + prev->length == curr->dma_address) {
            /* Merge: extend previous entry */
            prev->length += curr->length;
        } else {
            /* Not contiguous: start new entry */
            write_idx++;
            if (write_idx != i) {
                sgt->entries[write_idx] = *curr;
            }
        }
    }
    
    sgt->nents = write_idx + 1;  /* Update count after coalescing */
}

IOMMU Changes the Game

With an IOMMU (Intel VT-d, AMD-Vi), devices access memory through their own page tables. The IOMMU can make scattered physical pages appear contiguous to the device, potentially eliminating the need for scatter-gather lists entirely. However, setting up IOMMU mappings has its own overhead, so scatter-gather remains important for performance.

Cache Coherency and DMA

DMA introduces a fundamental challenge: cache coherency. When a device writes directly to memory, it bypasses CPU caches. If the CPU has cached data from those addresses, it sees stale data. Conversely, if the CPU has dirty cached data that hasn't been written back, the device reads stale data from memory.

The Coherency Problem:

Scenario: Device writes to memory via DMA

BEFORE DMA:
  Memory[0x1000] = 0x00    (old value in RAM)
  Cache[0x1000]  = 0x00    (CPU cached this value)

DMA TRANSFER:
  Device writes 0xFF to memory address 0x1000
  Memory[0x1000] = 0xFF    (new value in RAM)
  Cache[0x1000]  = 0x00    (cache still has old value!)

CPU READS:
  CPU reads from 0x1000 → cache hit → returns 0x00 (WRONG!)

Cache Coherency Solutions

•Cache Bypass (Uncacheable Memory): Mark DMA buffers as uncacheable. Simple but hurts performance—CPU can't cache frequently accessed buffers.
•Software Cache Management: Explicitly flush/invalidate caches around DMA operations. Portable but adds overhead and easy to get wrong.
•Cache-Coherent DMA (Hardware Snooping): Some platforms (x86, newer ARM) snoop DMA transactions. Memory controller checks if cache lines are dirty and handles coherency automatically.
•IOMMU with Coherency: Some IOMMUs can enforce coherency as part of address translation, offloading the burden from software.

dma_cache_management.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
/*
 * DMA Cache Management
 * 
 * Proper cache handling is CRITICAL for DMA correctness.
 * Incorrect cache management causes subtle, intermittent data corruption.
 */
 
#include <stdint.h>
#include <stddef.h>
 
/* 
 * DMA direction flags 
 */
#define DMA_TO_DEVICE     1   /* CPU → Memory → Device (device reads) */
#define DMA_FROM_DEVICE   2   /* Device → Memory → CPU (device writes) */
#define DMA_BIDIRECTIONAL 3   /* Both directions */
 
/* Cache line size (architecture dependent) */
#define CACHE_LINE_SIZE 64
 
/* Cache operation primitives (architecture-specific) */
extern void cache_flush_range(void *start, size_t len);      /* Writeback to memory */
extern void cache_invalidate_range(void *start, size_t len); /* Discard cached data */
extern void cache_flush_invalidate_range(void *start, size_t len); /* Both */
 
/*
 * Prepare buffer for DMA transfer (BEFORE starting DMA)
 * 
 * Called after CPU has written data that device will read,
 * or before device will write data that CPU will read.
 */
void dma_sync_for_device(void *buffer, size_t length, int direction) {
    switch (direction) {
        case DMA_TO_DEVICE:
            /*
             * CPU has written data, device will read.
             * Ensure all CPU writes are visible in RAM.
             * 
             * Action: Flush (writeback) CPU caches to memory.
             * This ensures the device sees the latest data.
             */
            cache_flush_range(buffer, length);
            break;
            
        case DMA_FROM_DEVICE:
            /*
             * Device will write data, CPU will read.
             * We need to invalidate the cache so CPU reads from RAM.
             * 
             * Action: Invalidate CPU caches.
             * Note: We do this BEFORE DMA so that any dirty cache lines
             * are handled before device writes (on non-coherent systems).
             */
            cache_invalidate_range(buffer, length);
            break;
            
        case DMA_BIDIRECTIONAL:
            /* Must handle both cases */
            cache_flush_invalidate_range(buffer, length);
            break;
    }
    
    /* Memory barrier to ensure cache operations complete before DMA starts */
    __asm__ volatile ("mfence" ::: "memory");
}
 
/*
 * Synchronize buffer after DMA transfer (AFTER DMA completes)
 * 
 * Called after device has written data that CPU will read,
 * or after device has read data (less critical).
 */
void dma_sync_for_cpu(void *buffer, size_t length, int direction) {
    /* Memory barrier to ensure DMA writes are visible */
    __asm__ volatile ("lfence" ::: "memory");
    
    switch (direction) {
        case DMA_TO_DEVICE:
            /*
             * Device has read the data.
             * No cache action strictly required, but invalidating
             * can prevent confusion if buffer is reused.
             */
            /* Optional: cache_invalidate_range(buffer, length); */
            break;
            
        case DMA_FROM_DEVICE:
            /*
             * Device has written data, CPU needs to read.
             * CRITICAL: Ensure CPU reads from memory, not stale cache.
             * 
             * Action: Invalidate CPU caches.
             */
            cache_invalidate_range(buffer, length);
            break;
            
        case DMA_BIDIRECTIONAL:
            cache_invalidate_range(buffer, length);
            break;
    }
}
 
/*
 * Allocate DMA-coherent memory
 * 
 * Some systems provide memory that is automatically coherent
 * between CPU and DMA - no explicit cache management needed.
 * This is often slower for CPU access but simpler to use.
 */
void *dma_alloc_coherent(size_t size, uint64_t *dma_handle) {
    /* Implementation allocates uncacheable memory or
     * uses special coherent region with hardware snooping.
     * 
     * On x86: Memory is coherent by default.
     * On ARM: May use uncacheable mapping or CMA with CMO.
     */
    void *virt = alloc_pages_dma(size);
    *dma_handle = virt_to_phys(virt);
    return virt;
}
 
/*
 * Streaming DMA mapping
 * 
 * For one-shot DMA operations, streaming mappings are more
 * efficient than coherent allocations.
 */
uint64_t dma_map_single(void *buffer, size_t size, int direction) {
    /* Perform cache management for this mapping */
    dma_sync_for_device(buffer, size, direction);
    
    /* Return physical/DMA address */
    return virt_to_phys(buffer);
}
 
void dma_unmap_single(uint64_t dma_addr, size_t size, int direction) {
    void *buffer = phys_to_virt(dma_addr);
    
    /* Perform cache management for unmapping */
    dma_sync_for_cpu(buffer, size, direction);
}

x86 is Cache Coherent - Usually

x86 systems implement hardware cache coherency for DMA (through snooping). This means explicit cache management is often unnecessary on x86—but you must still use proper memory barriers, mark buffers correctly in page tables (non-prefetchable for device writes), and allocate from DMA-accessible memory zones. Portable code should always include proper synchronization calls.

DMA Security: The IOMMU

DMA provides devices with direct access to system memory—a powerful capability that is also a severe security risk. Without protection, a malicious or compromised device can:

Read arbitrary memory (steal encryption keys, passwords, secrets)
Write arbitrary memory (inject rootkits, corrupt kernel data)
Bypass all software security measures (OS, antivirus, encryption)

This is especially concerning with:

Hot-pluggable devices (Thunderbolt, USB4, ExpressCard)
Compromised firmware in NICs, SSDs, GPUs
Physical access attacks ('evil maid' scenarios)

The IOMMU (I/O Memory Management Unit):

The IOMMU sits between devices and the memory controller, translating device DMA addresses through page tables—exactly as the CPU's MMU translates virtual addresses.

Without IOMMU:
┌──────────┐    physical    ┌──────────────┐
│  Device  │───────────────►│  Memory      │
└──────────┘    address      │  Controller  │
                             └──────────────┘
   (Device can access ANY physical address)

With IOMMU:
┌──────────┐    device     ┌──────────┐    physical    ┌──────────────┐
│  Device  │──────────────►│  IOMMU   │───────────────►│  Memory      │
└──────────┘    address    │(translat)│    address     │  Controller  │
                           └──────────┘                └──────────────┘
   (Device can only access mapped addresses, faults on invalid access)

IOMMU Capabilities

•Address Translation: Device addresses (IOVA) → Physical addresses, just like CPU MMU
•Access Control: Read-only, write-only, or no-access permissions per page
•Fault Isolation: Invalid DMA causes fault, not memory corruption
•Interrupt Remapping: Prevents devices from injecting spurious interrupts
•Device Isolation: Different devices have different address spaces
•Nested/Two-Stage Translation: Enables secure device passthrough to VMs

IOMMU Technologies by Platform
Platform	IOMMU Name	Features
Intel	VT-d	DMA remapping, interrupt remapping, Scalable Mode (5-level paging)
AMD	AMD-Vi (IOMMU)	DMA/interrupt remapping, Guest Translation, v2 paging
ARM	SMMU	Stage 1 & 2 translation, Substream IDs, SVA
Apple	Custom IOMMU	Integrated into SoC, hardware security coprocessor managed

Enable Your IOMMU

Many systems ship with IOMMU disabled for 'compatibility' or 'performance.' This is a significant security vulnerability. Enable VT-d/AMD-Vi in BIOS settings. On Linux, verify with 'dmesg | grep -i iommu' and check for active protection. Without IOMMU, Thunderbolt/USB4 devices have unrestricted memory access.

Modern DMA in Practice: NVMe and High-Speed Networking

Modern high-performance devices have evolved sophisticated DMA architectures that go far beyond simple block transfers. Let's examine how NVMe and high-speed NICs utilize DMA.

NVMe DMA Architecture:

NVMe SSDs use a queue-based architecture where both commands and completions are transferred via DMA:

Submission Queues (SQ): Host memory containing commands. Host writes commands, device reads via DMA.
Completion Queues (CQ): Host memory containing results. Device writes completions, host reads.
Data Buffers: Actual data transferred between device and host memory via DMA.

NVMe Command Flow:

1. Host writes command to Submission Queue (host memory)
2. Host writes doorbell register (MMIO) to notify device
3. Device DMA-reads command from Submission Queue
4. Device processes command (read/write flash)
5. Device DMA-transfers data (to/from host memory)
6. Device DMA-writes completion to Completion Queue
7. Device signals interrupt (MSI-X)
8. Host reads completion from Completion Queue
9. Host writes doorbell to acknowledge completion

Notice: The CPU touches the data path only at setup (writing the command) and completion (reading the status). All actual data movement is pure DMA.

nvme_dma_flow.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
/*
 * NVMe DMA Flow (Simplified)
 * 
 * Demonstrates the queue-based DMA architecture of modern storage.
 */
 
#include <stdint.h>
 
/* NVMe Queue structure (simplified) */
struct nvme_queue {
    /* Submission queue - commands written by host, read by device */
    struct nvme_sqe *sq;           /* Virtual address (for CPU access) */
    uint64_t sq_dma;               /* Physical/DMA address (for device) */
    uint16_t sq_head;              /* Device has read up to here */
    uint16_t sq_tail;              /* Host has written up to here */
    
    /* Completion queue - completions written by device, read by host */
    struct nvme_cqe *cq;
    uint64_t cq_dma;
    uint16_t cq_head;              /* Host has read up to here */
    volatile uint16_t *cq_doorbell; /* MMIO doorbell register */
    volatile uint16_t *sq_doorbell;
    
    uint16_t depth;                /* Queue depth */
    uint8_t phase;                 /* Completion phase bit */
};
 
/* Submit a read command */
int nvme_submit_read(struct nvme_queue *q, uint64_t lba, 
                     uint32_t num_blocks, void *buffer, 
                     uint64_t buffer_dma) {
    /* Check queue has space */
    uint16_t next_tail = (q->sq_tail + 1) % q->depth;
    if (next_tail == q->sq_head) {
        return -1;  /* Queue full */
    }
    
    /* Build the command in the submission queue */
    struct nvme_sqe *cmd = &q->sq[q->sq_tail];
    
    cmd->opcode = 0x02;             /* Read command */
    cmd->flags = 0;
    cmd->command_id = q->sq_tail;   /* Use slot as ID */
    cmd->nsid = 1;                  /* Namespace 1 */
    cmd->prp1 = buffer_dma;         /* Physical address of data buffer */
    cmd->prp2 = 0;                  /* For multi-page, would be PRP list */
    cmd->cdw10 = lba & 0xFFFFFFFF;  /* Starting LBA (low 32 bits) */
    cmd->cdw11 = lba >> 32;         /* Starting LBA (high 32 bits) */
    cmd->cdw12 = num_blocks - 1;    /* Number of blocks (0-based) */
    
    /* Ensure command is visible in memory before ringing doorbell */
    __asm__ volatile ("sfence" ::: "memory");
    
    /* Advance tail */
    q->sq_tail = next_tail;
    
    /* Ring the doorbell - MMIO write tells device to fetch commands */
    *q->sq_doorbell = q->sq_tail;
    
    return 0;
}
 
/* Poll for completion */
struct nvme_cqe *nvme_poll_completion(struct nvme_queue *q) {
    struct nvme_cqe *cqe = &q->cq[q->cq_head];
    
    /* Check phase bit to see if this entry is new */
    /* Phase bit flips each time queue wraps around */
    if ((cqe->status & 0x01) != q->phase) {
        return NULL;  /* No new completion */
    }
    
    /* Completion is ready */
    /* Note: Device DMA'd this completion to our CQ in host memory */
    
    /* Advance head, flip phase if wrapped */
    q->cq_head++;
    if (q->cq_head >= q->depth) {
        q->cq_head = 0;
        q->phase = !q->phase;
    }
    
    /* Update completion queue doorbell */
    *q->cq_doorbell = q->cq_head;
    
    return cqe;
}
 
/*
 * Network RX Ring Example (e.g., Intel NIC)
 * 
 * Similar concept: ring of descriptors, device fills them via DMA.
 */
struct rx_descriptor {
    uint64_t buffer_addr;   /* Physical address of RX buffer */
    uint64_t header_addr;   /* Header buffer (optional) */
};
 
struct rx_completion {
    uint16_t length;        /* Packet length */
    uint16_t vlan;          /* VLAN tag */
    uint32_t rss_hash;      /* RSS hash for steering */
    uint32_t flags;         /* Checksum status, etc. */
    uint16_t status;        /* DD (descriptor done) bit */
};
 
void setup_rx_ring(void *device_base, struct rx_descriptor *ring,
                   uint64_t ring_dma, int ring_size) {
    /* Tell device where the RX descriptor ring is located */
    mmio_write64(device_base + RX_DESC_BASE, ring_dma);
    mmio_write32(device_base + RX_DESC_LEN, ring_size * sizeof(struct rx_descriptor));
    
    /* Fill ring with buffer addresses */
    for (int i = 0; i < ring_size; i++) {
        void *buffer = alloc_dma_buffer(2048);
        ring[i].buffer_addr = virt_to_phys(buffer);
    }
    
    /* Set head = 0, tail = ring_size - 1 (all descriptors available) */
    mmio_write32(device_base + RX_HEAD, 0);
    mmio_write32(device_base + RX_TAIL, ring_size - 1);
    
    /* Enable RX (device will DMA packets into our buffers) */
    mmio_write32(device_base + RX_CTRL, RX_ENABLE);
}

Zero-Copy Networking

High-performance applications use zero-copy DMA where network packets or storage blocks are transferred directly to/from user-space buffers without kernel copies. Technologies like io_uring, DPDK, and RDMA leverage this for extreme performance—millions of IOPS or 100+ Gbps networking from a single machine.

DMA Debugging and Common Issues

DMA bugs are among the hardest to debug because they involve hardware timing, asynchronous operations, and often manifest as data corruption or crashes far from the actual bug.

Common DMA Bugs:

DMA Bug Categories

•Cache Coherency Failures: Forgetting dma_sync_* calls causes intermittent data corruption. Data looks correct in debugger but wrong at runtime.
•Address Translation Errors: Using virtual addresses instead of physical/DMA addresses. Works in some conditions, fails mysteriously in others.
•Descriptor Lifetime Issues: Freeing buffer or descriptor before DMA completes. Device reads garbage or causes system crash.
•Alignment Violations: Some DMA engines require addresses or lengths aligned to specific boundaries (4-byte, cache line, page).
•Boundary Crossing: Transfers crossing 4GB boundaries on 32-bit DMA engines. 64KB boundary issues on legacy controllers.
•Memory Zone Violations: Allocating from high memory on 32-bit DMA devices. Device can't address the buffer.
•Endianness Mismatches: Device expects little-endian, driver writes big-endian (or vice versa). Common in portable code.

Debugging Techniques:

IOMMU Fault Logging: Enable IOMMU faults to catch invalid DMA addresses. Linux: dmesg | grep -i iommu
DMA Debugging API (Linux): CONFIG_DMA_API_DEBUG checks for common mistakes:
- Double-maps/double-unmaps
- Missing sync calls
- Illegal bounds
Memory Patterns: Fill buffers with known patterns before DMA; verify after. Corruption is immediately visible.
Hardware Debug: Some devices have debug registers showing DMA engine state, last addresses accessed, error codes.
Trace Points: Linux has DMA API trace points for tracking every map/unmap/sync operation.

Race Condition Central

DMA operations are inherently asynchronous. A common deadly pattern: unmap a buffer immediately after starting DMA, assuming it will complete instantly. The device continues accessing the (now potentially reused) memory. Always ensure the device has finished with a buffer before modifying or freeing it.

Summary: Mastering I/O Architecture

Direct Memory Access represents the pinnacle of I/O efficiency, enabling modern systems to achieve performance levels impossible with CPU-mediated transfers. DMA completes our journey through I/O architecture, from the simplest port operations to the most sophisticated memory-to-device data paths.

DMA Key Takeaways

•DMA removes CPU from data path — Devices transfer directly to/from memory while CPU does other work
•Modern devices use bus mastering — Each device has its own DMA engine, not centralized controller
•Scatter-gather enables real-world buffers — Transfers can span non-contiguous physical memory
•Cache coherency is critical — Explicit sync operations prevent data corruption on non-coherent systems
•IOMMU provides security — Restricts device memory access, essential for untrusted peripherals
•Performance requires careful design — Descriptor layout, queue depth, buffer management all impact throughput

Module Summary: I/O Architecture Complete

Across five pages, we've built a comprehensive understanding of how CPUs communicate with peripheral devices:

I/O Ports: The original addressing mechanism—separate address space, IN/OUT instructions, still used for legacy devices
Memory-Mapped I/O: The modern approach—devices appear as memory addresses, accessible with standard load/store instructions
Programmed I/O: The simplest transfer method—CPU explicitly moves every byte, through polling loops
Interrupt-Driven I/O: Devices notify CPU of events—CPU is freed from polling but still moves data
Direct Memory Access: The ultimate efficiency—devices transfer data autonomously, CPU involvement minimal

This progression represents the evolution of I/O from the earliest computers to today's high-performance systems. Understanding all five techniques is essential for OS development, device driver programming, performance optimization, and security engineering.

Module Complete

You now possess comprehensive knowledge of I/O architecture—from fundamental port access through advanced DMA mechanisms. This foundation enables you to understand how operating systems interact with hardware, write efficient device drivers, diagnose I/O performance issues, and appreciate the sophisticated engineering that makes modern computing possible.

5 / 5

Loading learning content...

Operating SystemsI/O Architecture

I/O Architecture: How the CPU Communicates with the World

LevelIntermediate

Duration90 mins

TopicI/O Architecture

5 / 5

DMA: Direct Memory Access for Maximum Throughput

The Ultimate I/O Efficiency

What You Will Learn

DMA Fundamentals

Key Characteristics:

CPU-free data movement: The processor doesn't execute instructions for each byte
Parallel operation: CPU can execute other code while DMA proceeds
High bandwidth: Limited only by memory and bus speed, not CPU instruction throughput
Hardware complexity: Devices need DMA controllers capable of memory addressing

DMA vs PIO Comparison:

Programmed I/O (PIO):
┌─────┐    instructions    ┌──────┐    data    ┌────────┐
│ CPU │◄─────────────────► │Device│◄───────────►│Memory  │
└─────┘    (CPU fetches/   └──────┘            └────────┘
           stores each byte)
           
Direct Memory Access (DMA):
┌─────┐    setup only    ┌──────┐    data     ┌────────┐
│ CPU │─────────────────►│Device│◄────────────►│Memory  │
└─────┘  (addresses,     └──────┘  (direct    └────────┘
         length, start)           transfer)
     │                                  │
     │    (interrupt on complete)       │
     ◄──────────────────────────────────┘

In PIO, data flows: Device → CPU registers → Memory (or reverse) In DMA, data flows: Device ↔ Memory (CPU not involved in data path)

DMA vs PIO Performance Comparison
Metric	Programmed I/O	DMA
Max throughput (NVMe SSD)	~100 MB/s (CPU limited)	7000+ MB/s
CPU utilization during transfer	100%	~0% (setup + completion only)
Transfer initiation overhead	Minimal	Higher (descriptor setup)
Latency for tiny transfers	Lower	Higher (DMA setup overhead)
Hardware complexity	Simple device registers	DMA engine + bus mastering
Memory access pattern	Sequential only	Scatter-gather supported

When DMA Isn't Worth It

DMA Controller Architecture

DMA can be implemented in two broad architectural patterns:

1. Third-Party DMA (Legacy ISA Model):

Workflow:

Device requests DMA service (asserts DRQ line)
CPU programs DMA controller with address, length, direction
DMA controller takes control of the bus
DMA controller transfers data between device and memory
DMA controller signals completion (interrupt)

This design is obsolete because:

Limited channels (only 7 available)
Low speed (designed for ISA bus speeds)
Single-threaded (only one transfer at a time)
Required contiguous memory buffers

2. First-Party DMA (Bus Mastering):

Modern devices contain their own DMA engines and become bus masters—they can initiate memory transactions directly without a centralized controller.

Workflow:

CPU prepares DMA descriptors in memory (addresses, lengths)
CPU programs device with descriptor ring location
Device's DMA engine fetches descriptors from memory
Device's DMA engine transfers data directly to/from memory
Device updates completion status and signals interrupt

Advantages of Bus Mastering:

Each device has dedicated DMA capability
Parallel transfers from multiple devices
Much higher performance
Support for scatter-gather (multiple memory regions per transfer)

dma_descriptor.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
/*
 * DMA Descriptor Structures
 * 
 * Modern DMA uses descriptor rings - the device fetches transfer
 * descriptions from memory, executes them, and updates completion status.
 */
 
#include <stdint.h>
 
/*
 * Simple DMA Descriptor (generic pattern)
 * 
 * Each descriptor describes one contiguous memory region
 * for a DMA transfer.
 */
struct dma_descriptor {
    uint64_t buffer_addr;     /* Physical address of data buffer */
    uint32_t buffer_length;   /* Length in bytes */
    uint32_t control;         /* Flags: interrupt-on-complete, etc. */
} __attribute__((packed));
 
/* Control word flags */
#define DMA_CTRL_INTR   (1 << 0)   /* Generate interrupt on completion */
#define DMA_CTRL_LINK   (1 << 1)   /* Another descriptor follows */
#define DMA_CTRL_LAST   (1 << 2)   /* This is the last descriptor */
 
/*
 * NVMe-style Submission Queue Entry (Simplified)
 * 
 * NVMe uses command queues rather than simple descriptors.
 * Each entry is a 64-byte command that can reference multiple
 * Physical Region Pages.
 */
struct nvme_sqe {
    uint8_t opcode;           /* Command opcode (read, write, etc.) */
    uint8_t flags;            /* Fused command, PRP vs SGL, etc. */
    uint16_t command_id;      /* Unique ID for completion matching */
    uint32_t nsid;            /* Namespace ID */
    uint64_t reserved;
    uint64_t metadata;        /* Metadata buffer address */
    uint64_t prp1;            /* Physical Region Page 1 (data buffer) */
    uint64_t prp2;            /* PRP 2 or PRP list pointer */
    uint32_t cdw10;           /* Command-specific dword 10 */
    uint32_t cdw11;           /* Start LBA (low 32 bits for read/write) */
    uint32_t cdw12;           /* LBA (high 32 bits) */
    uint32_t cdw13;           /* Command-specific */
    uint32_t cdw14;           /* Command-specific */
    uint32_t cdw15;           /* Command-specific */
} __attribute__((packed));
 
/*
 * Scatter-Gather List Entry (Intel Style)
 * 
 * Describes one segment of a potentially non-contiguous buffer.
 * Multiple SGEs form a list, allowing DMA to access scattered pages.
 */
struct sg_entry {
    uint64_t address;         /* Physical address of segment */
    uint32_t length;          /* Segment length in bytes */
    uint32_t flags;           /* End-of-list, interrupt, etc. */
} __attribute__((packed));
 
/* SG flags */
#define SG_FLAG_FINAL   (1 << 31)  /* Last entry in list */
 
/*
 * DMA Ring Buffer (Descriptor Ring)
 * 
 * Many devices use circular descriptor rings for efficient
 * command submission and completion.
 */
#define RING_SIZE 256
 
struct dma_ring {
    struct dma_descriptor descriptors[RING_SIZE];  /* In DMA-able memory */
    volatile uint32_t head;                         /* Next to submit */
    volatile uint32_t tail;                         /* Next to complete */
    uint32_t ring_size;
    
    /* Shadow data for driver use (not touched by hardware) */
    void *buffers[RING_SIZE];         /* Virtual addresses of buffers */
    void *completion_data[RING_SIZE]; /* Per-descriptor completion context */
};
 
/*
 * Ring buffer management
 */
static inline bool ring_full(struct dma_ring *ring) {
    return ((ring->head + 1) % ring->ring_size) == ring->tail;
}
 
static inline bool ring_empty(struct dma_ring *ring) {
    return ring->head == ring->tail;
}
 
static inline uint32_t ring_next(struct dma_ring *ring, uint32_t idx) {
    return (idx + 1) % ring->ring_size;
}

Hardware Descriptor Fetch

Scatter-Gather DMA

Without Scatter-Gather:

Buffers must be physically contiguous
OS must allocate special contiguous memory (limited, expensive)
Or software must split transfers into page-aligned pieces

With Scatter-Gather:

Each transfer has a list of (address, length) pairs
Device's DMA engine processes the entire list as one logical transfer
OS can use normal, non-contiguous page allocations

Scatter-Gather List Visualization:

Virtual Buffer (64KB, contiguous in virtual address space):
┌──────────────────────────────────────────────────────────┐
│ 0x00000000 - 0x0000FFFF (user sees contiguous buffer)   │
└──────────────────────────────────────────────────────────┘

Physical Pages (scattered in RAM):
┌────────┐  ┌────────┐  ┌────────┐  ┌────────┐
│ Page at│  │ Page at│  │ Page at│  │ Page at│  ... (16 pages)
│0x12000 │  │0x47000 │  │0x89000 │  │0x23000 │
└────────┘  └────────┘  └────────┘  └────────┘

Scatter-Gather List:
┌─────────────────────────────────────────────────────────┐
│ Entry 0: addr=0x12000, len=4096                         │
│ Entry 1: addr=0x47000, len=4096                         │
│ Entry 2: addr=0x89000, len=4096                         │
│ Entry 3: addr=0x23000, len=4096                         │
│ ...                                                     │
│ Entry 15: addr=0x56000, len=4096, flags=FINAL          │
└─────────────────────────────────────────────────────────┘

scatter_gather.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
/*
 * Scatter-Gather DMA Implementation
 * 
 * Demonstrates building and using scatter-gather lists for
 * DMA transfers with non-contiguous physical memory.
 */
 
#include <stdint.h>
#include <stddef.h>
 
/* Simulated kernel functions */
extern uint64_t virt_to_phys(void *virt);
extern void *dma_alloc_coherent(size_t size, uint64_t *dma_handle);
extern void dma_free_coherent(void *virt, size_t size, uint64_t dma_handle);
 
/* Page size definitions */
#define PAGE_SIZE       4096
#define PAGE_MASK       (~(PAGE_SIZE - 1))
 
#define MAX_SG_ENTRIES  128
 
/* Scatter-Gather Entry */
struct sg_entry {
    uint64_t dma_address;   /* Physical/DMA address */
    uint32_t length;        /* Length of this segment */
    uint32_t offset;        /* Offset within page (for alignment) */
    struct sg_entry *next;  /* Link to next entry (optional) */
};
 
/* Scatter-Gather Table */
struct sg_table {
    struct sg_entry *entries;   /* Array of entries */
    int nents;                  /* Number of entries in use */
    int max_ents;               /* Maximum entries allocated */
};
 
/*
 * Map a user buffer to a scatter-gather list
 * 
 * Takes a virtually contiguous buffer that may be physically
 * scattered across multiple pages, and creates a list of
 * physical address/length pairs for DMA.
 */
int sg_map_buffer(struct sg_table *sgt, void *buffer, size_t length) {
    uint8_t *ptr = (uint8_t *)buffer;
    size_t remaining = length;
    int entry_idx = 0;
    
    while (remaining > 0 && entry_idx < sgt->max_ents) {
        struct sg_entry *entry = &sgt->entries[entry_idx];
        
        /* Calculate offset within current page */
        uint32_t page_offset = (uintptr_t)ptr & (PAGE_SIZE - 1);
        
        /* Calculate how much we can transfer from this page */
        uint32_t this_seg_len = PAGE_SIZE - page_offset;
        if (this_seg_len > remaining) {
            this_seg_len = remaining;
        }
        
        /* Translate virtual to physical/DMA address */
        entry->dma_address = virt_to_phys(ptr);
        entry->length = this_seg_len;
        entry->offset = page_offset;
        
        ptr += this_seg_len;
        remaining -= this_seg_len;
        entry_idx++;
    }
    
    if (remaining > 0) {
        return -1;  /* Buffer too large for SG table */
    }
    
    sgt->nents = entry_idx;
    return 0;
}
 
/*
 * Program device with scatter-gather list
 * 
 * The device's DMA engine will fetch each entry and transfer
 * data to/from the specified physical addresses.
 */
void program_sg_dma(void *device_base, struct sg_table *sgt, int direction) {
    /* First, allocate DMA-able memory for the SG list itself */
    /* The device needs to read the SG entries from memory */
    
    size_t sg_list_size = sgt->nents * sizeof(struct hw_sg_entry);
    uint64_t sg_dma_addr;
    struct hw_sg_entry *hw_sg = dma_alloc_coherent(sg_list_size, &sg_dma_addr);
    
    /* Copy our SG entries to the hardware format */
    for (int i = 0; i < sgt->nents; i++) {
        hw_sg[i].address = sgt->entries[i].dma_address;
        hw_sg[i].length = sgt->entries[i].length;
        hw_sg[i].flags = (i == sgt->nents - 1) ? SG_FLAG_FINAL : 0;
    }
    
    /* Program device registers with SG list location */
    mmio_write64(device_base + DMA_SG_ADDR, sg_dma_addr);
    mmio_write32(device_base + DMA_SG_COUNT, sgt->nents);
    mmio_write32(device_base + DMA_CONTROL, 
                 (direction == DMA_TO_DEVICE) ? DMA_CTRL_WRITE : DMA_CTRL_READ);
    
    /* Start the DMA (device fetches SG list and begins transfer) */
    mmio_write32(device_base + DMA_COMMAND, DMA_CMD_START);
}
 
/*
 * Coalesce adjacent SG entries
 * 
 * If virtual pages happen to be physically contiguous,
 * we can merge SG entries for better DMA efficiency.
 */
void sg_coalesce(struct sg_table *sgt) {
    if (sgt->nents <= 1) return;
    
    int write_idx = 0;
    
    for (int i = 1; i < sgt->nents; i++) {
        struct sg_entry *prev = &sgt->entries[write_idx];
        struct sg_entry *curr = &sgt->entries[i];
        
        /* Check if this entry is contiguous with previous */
        if (prev->dma_address + prev->length == curr->dma_address) {
            /* Merge: extend previous entry */
            prev->length += curr->length;
        } else {
            /* Not contiguous: start new entry */
            write_idx++;
            if (write_idx != i) {
                sgt->entries[write_idx] = *curr;
            }
        }
    }
    
    sgt->nents = write_idx + 1;  /* Update count after coalescing */
}

IOMMU Changes the Game

Cache Coherency and DMA

The Coherency Problem:

Scenario: Device writes to memory via DMA

BEFORE DMA:
  Memory[0x1000] = 0x00    (old value in RAM)
  Cache[0x1000]  = 0x00    (CPU cached this value)

DMA TRANSFER:
  Device writes 0xFF to memory address 0x1000
  Memory[0x1000] = 0xFF    (new value in RAM)
  Cache[0x1000]  = 0x00    (cache still has old value!)

CPU READS:
  CPU reads from 0x1000 → cache hit → returns 0x00 (WRONG!)

Cache Coherency Solutions

•Cache Bypass (Uncacheable Memory): Mark DMA buffers as uncacheable. Simple but hurts performance—CPU can't cache frequently accessed buffers.
•Software Cache Management: Explicitly flush/invalidate caches around DMA operations. Portable but adds overhead and easy to get wrong.
•Cache-Coherent DMA (Hardware Snooping): Some platforms (x86, newer ARM) snoop DMA transactions. Memory controller checks if cache lines are dirty and handles coherency automatically.
•IOMMU with Coherency: Some IOMMUs can enforce coherency as part of address translation, offloading the burden from software.

dma_cache_management.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
/*
 * DMA Cache Management
 * 
 * Proper cache handling is CRITICAL for DMA correctness.
 * Incorrect cache management causes subtle, intermittent data corruption.
 */
 
#include <stdint.h>
#include <stddef.h>
 
/* 
 * DMA direction flags 
 */
#define DMA_TO_DEVICE     1   /* CPU → Memory → Device (device reads) */
#define DMA_FROM_DEVICE   2   /* Device → Memory → CPU (device writes) */
#define DMA_BIDIRECTIONAL 3   /* Both directions */
 
/* Cache line size (architecture dependent) */
#define CACHE_LINE_SIZE 64
 
/* Cache operation primitives (architecture-specific) */
extern void cache_flush_range(void *start, size_t len);      /* Writeback to memory */
extern void cache_invalidate_range(void *start, size_t len); /* Discard cached data */
extern void cache_flush_invalidate_range(void *start, size_t len); /* Both */
 
/*
 * Prepare buffer for DMA transfer (BEFORE starting DMA)
 * 
 * Called after CPU has written data that device will read,
 * or before device will write data that CPU will read.
 */
void dma_sync_for_device(void *buffer, size_t length, int direction) {
    switch (direction) {
        case DMA_TO_DEVICE:
            /*
             * CPU has written data, device will read.
             * Ensure all CPU writes are visible in RAM.
             * 
             * Action: Flush (writeback) CPU caches to memory.
             * This ensures the device sees the latest data.
             */
            cache_flush_range(buffer, length);
            break;
            
        case DMA_FROM_DEVICE:
            /*
             * Device will write data, CPU will read.
             * We need to invalidate the cache so CPU reads from RAM.
             * 
             * Action: Invalidate CPU caches.
             * Note: We do this BEFORE DMA so that any dirty cache lines
             * are handled before device writes (on non-coherent systems).
             */
            cache_invalidate_range(buffer, length);
            break;
            
        case DMA_BIDIRECTIONAL:
            /* Must handle both cases */
            cache_flush_invalidate_range(buffer, length);
            break;
    }
    
    /* Memory barrier to ensure cache operations complete before DMA starts */
    __asm__ volatile ("mfence" ::: "memory");
}
 
/*
 * Synchronize buffer after DMA transfer (AFTER DMA completes)
 * 
 * Called after device has written data that CPU will read,
 * or after device has read data (less critical).
 */
void dma_sync_for_cpu(void *buffer, size_t length, int direction) {
    /* Memory barrier to ensure DMA writes are visible */
    __asm__ volatile ("lfence" ::: "memory");
    
    switch (direction) {
        case DMA_TO_DEVICE:
            /*
             * Device has read the data.
             * No cache action strictly required, but invalidating
             * can prevent confusion if buffer is reused.
             */
            /* Optional: cache_invalidate_range(buffer, length); */
            break;
            
        case DMA_FROM_DEVICE:
            /*
             * Device has written data, CPU needs to read.
             * CRITICAL: Ensure CPU reads from memory, not stale cache.
             * 
             * Action: Invalidate CPU caches.
             */
            cache_invalidate_range(buffer, length);
            break;
            
        case DMA_BIDIRECTIONAL:
            cache_invalidate_range(buffer, length);
            break;
    }
}
 
/*
 * Allocate DMA-coherent memory
 * 
 * Some systems provide memory that is automatically coherent
 * between CPU and DMA - no explicit cache management needed.
 * This is often slower for CPU access but simpler to use.
 */
void *dma_alloc_coherent(size_t size, uint64_t *dma_handle) {
    /* Implementation allocates uncacheable memory or
     * uses special coherent region with hardware snooping.
     * 
     * On x86: Memory is coherent by default.
     * On ARM: May use uncacheable mapping or CMA with CMO.
     */
    void *virt = alloc_pages_dma(size);
    *dma_handle = virt_to_phys(virt);
    return virt;
}
 
/*
 * Streaming DMA mapping
 * 
 * For one-shot DMA operations, streaming mappings are more
 * efficient than coherent allocations.
 */
uint64_t dma_map_single(void *buffer, size_t size, int direction) {
    /* Perform cache management for this mapping */
    dma_sync_for_device(buffer, size, direction);
    
    /* Return physical/DMA address */
    return virt_to_phys(buffer);
}
 
void dma_unmap_single(uint64_t dma_addr, size_t size, int direction) {
    void *buffer = phys_to_virt(dma_addr);
    
    /* Perform cache management for unmapping */
    dma_sync_for_cpu(buffer, size, direction);
}

x86 is Cache Coherent - Usually

DMA Security: The IOMMU

DMA provides devices with direct access to system memory—a powerful capability that is also a severe security risk. Without protection, a malicious or compromised device can:

Read arbitrary memory (steal encryption keys, passwords, secrets)
Write arbitrary memory (inject rootkits, corrupt kernel data)
Bypass all software security measures (OS, antivirus, encryption)

This is especially concerning with:

Hot-pluggable devices (Thunderbolt, USB4, ExpressCard)
Compromised firmware in NICs, SSDs, GPUs
Physical access attacks ('evil maid' scenarios)

The IOMMU (I/O Memory Management Unit):

The IOMMU sits between devices and the memory controller, translating device DMA addresses through page tables—exactly as the CPU's MMU translates virtual addresses.

Without IOMMU:
┌──────────┐    physical    ┌──────────────┐
│  Device  │───────────────►│  Memory      │
└──────────┘    address      │  Controller  │
                             └──────────────┘
   (Device can access ANY physical address)

With IOMMU:
┌──────────┐    device     ┌──────────┐    physical    ┌──────────────┐
│  Device  │──────────────►│  IOMMU   │───────────────►│  Memory      │
└──────────┘    address    │(translat)│    address     │  Controller  │
                           └──────────┘                └──────────────┘
   (Device can only access mapped addresses, faults on invalid access)

IOMMU Capabilities

•Address Translation: Device addresses (IOVA) → Physical addresses, just like CPU MMU
•Access Control: Read-only, write-only, or no-access permissions per page
•Fault Isolation: Invalid DMA causes fault, not memory corruption
•Interrupt Remapping: Prevents devices from injecting spurious interrupts
•Device Isolation: Different devices have different address spaces
•Nested/Two-Stage Translation: Enables secure device passthrough to VMs

IOMMU Technologies by Platform
Platform	IOMMU Name	Features
Intel	VT-d	DMA remapping, interrupt remapping, Scalable Mode (5-level paging)
AMD	AMD-Vi (IOMMU)	DMA/interrupt remapping, Guest Translation, v2 paging
ARM	SMMU	Stage 1 & 2 translation, Substream IDs, SVA
Apple	Custom IOMMU	Integrated into SoC, hardware security coprocessor managed

Enable Your IOMMU

Modern DMA in Practice: NVMe and High-Speed Networking

Modern high-performance devices have evolved sophisticated DMA architectures that go far beyond simple block transfers. Let's examine how NVMe and high-speed NICs utilize DMA.

NVMe DMA Architecture:

NVMe SSDs use a queue-based architecture where both commands and completions are transferred via DMA:

Submission Queues (SQ): Host memory containing commands. Host writes commands, device reads via DMA.
Completion Queues (CQ): Host memory containing results. Device writes completions, host reads.
Data Buffers: Actual data transferred between device and host memory via DMA.

NVMe Command Flow:

1. Host writes command to Submission Queue (host memory)
2. Host writes doorbell register (MMIO) to notify device
3. Device DMA-reads command from Submission Queue
4. Device processes command (read/write flash)
5. Device DMA-transfers data (to/from host memory)
6. Device DMA-writes completion to Completion Queue
7. Device signals interrupt (MSI-X)
8. Host reads completion from Completion Queue
9. Host writes doorbell to acknowledge completion

Notice: The CPU touches the data path only at setup (writing the command) and completion (reading the status). All actual data movement is pure DMA.

nvme_dma_flow.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
/*
 * NVMe DMA Flow (Simplified)
 * 
 * Demonstrates the queue-based DMA architecture of modern storage.
 */
 
#include <stdint.h>
 
/* NVMe Queue structure (simplified) */
struct nvme_queue {
    /* Submission queue - commands written by host, read by device */
    struct nvme_sqe *sq;           /* Virtual address (for CPU access) */
    uint64_t sq_dma;               /* Physical/DMA address (for device) */
    uint16_t sq_head;              /* Device has read up to here */
    uint16_t sq_tail;              /* Host has written up to here */
    
    /* Completion queue - completions written by device, read by host */
    struct nvme_cqe *cq;
    uint64_t cq_dma;
    uint16_t cq_head;              /* Host has read up to here */
    volatile uint16_t *cq_doorbell; /* MMIO doorbell register */
    volatile uint16_t *sq_doorbell;
    
    uint16_t depth;                /* Queue depth */
    uint8_t phase;                 /* Completion phase bit */
};
 
/* Submit a read command */
int nvme_submit_read(struct nvme_queue *q, uint64_t lba, 
                     uint32_t num_blocks, void *buffer, 
                     uint64_t buffer_dma) {
    /* Check queue has space */
    uint16_t next_tail = (q->sq_tail + 1) % q->depth;
    if (next_tail == q->sq_head) {
        return -1;  /* Queue full */
    }
    
    /* Build the command in the submission queue */
    struct nvme_sqe *cmd = &q->sq[q->sq_tail];
    
    cmd->opcode = 0x02;             /* Read command */
    cmd->flags = 0;
    cmd->command_id = q->sq_tail;   /* Use slot as ID */
    cmd->nsid = 1;                  /* Namespace 1 */
    cmd->prp1 = buffer_dma;         /* Physical address of data buffer */
    cmd->prp2 = 0;                  /* For multi-page, would be PRP list */
    cmd->cdw10 = lba & 0xFFFFFFFF;  /* Starting LBA (low 32 bits) */
    cmd->cdw11 = lba >> 32;         /* Starting LBA (high 32 bits) */
    cmd->cdw12 = num_blocks - 1;    /* Number of blocks (0-based) */
    
    /* Ensure command is visible in memory before ringing doorbell */
    __asm__ volatile ("sfence" ::: "memory");
    
    /* Advance tail */
    q->sq_tail = next_tail;
    
    /* Ring the doorbell - MMIO write tells device to fetch commands */
    *q->sq_doorbell = q->sq_tail;
    
    return 0;
}
 
/* Poll for completion */
struct nvme_cqe *nvme_poll_completion(struct nvme_queue *q) {
    struct nvme_cqe *cqe = &q->cq[q->cq_head];
    
    /* Check phase bit to see if this entry is new */
    /* Phase bit flips each time queue wraps around */
    if ((cqe->status & 0x01) != q->phase) {
        return NULL;  /* No new completion */
    }
    
    /* Completion is ready */
    /* Note: Device DMA'd this completion to our CQ in host memory */
    
    /* Advance head, flip phase if wrapped */
    q->cq_head++;
    if (q->cq_head >= q->depth) {
        q->cq_head = 0;
        q->phase = !q->phase;
    }
    
    /* Update completion queue doorbell */
    *q->cq_doorbell = q->cq_head;
    
    return cqe;
}
 
/*
 * Network RX Ring Example (e.g., Intel NIC)
 * 
 * Similar concept: ring of descriptors, device fills them via DMA.
 */
struct rx_descriptor {
    uint64_t buffer_addr;   /* Physical address of RX buffer */
    uint64_t header_addr;   /* Header buffer (optional) */
};
 
struct rx_completion {
    uint16_t length;        /* Packet length */
    uint16_t vlan;          /* VLAN tag */
    uint32_t rss_hash;      /* RSS hash for steering */
    uint32_t flags;         /* Checksum status, etc. */
    uint16_t status;        /* DD (descriptor done) bit */
};
 
void setup_rx_ring(void *device_base, struct rx_descriptor *ring,
                   uint64_t ring_dma, int ring_size) {
    /* Tell device where the RX descriptor ring is located */
    mmio_write64(device_base + RX_DESC_BASE, ring_dma);
    mmio_write32(device_base + RX_DESC_LEN, ring_size * sizeof(struct rx_descriptor));
    
    /* Fill ring with buffer addresses */
    for (int i = 0; i < ring_size; i++) {
        void *buffer = alloc_dma_buffer(2048);
        ring[i].buffer_addr = virt_to_phys(buffer);
    }
    
    /* Set head = 0, tail = ring_size - 1 (all descriptors available) */
    mmio_write32(device_base + RX_HEAD, 0);
    mmio_write32(device_base + RX_TAIL, ring_size - 1);
    
    /* Enable RX (device will DMA packets into our buffers) */
    mmio_write32(device_base + RX_CTRL, RX_ENABLE);
}

Zero-Copy Networking

DMA Debugging and Common Issues

DMA bugs are among the hardest to debug because they involve hardware timing, asynchronous operations, and often manifest as data corruption or crashes far from the actual bug.

Common DMA Bugs:

DMA Bug Categories

•Cache Coherency Failures: Forgetting dma_sync_* calls causes intermittent data corruption. Data looks correct in debugger but wrong at runtime.
•Address Translation Errors: Using virtual addresses instead of physical/DMA addresses. Works in some conditions, fails mysteriously in others.
•Descriptor Lifetime Issues: Freeing buffer or descriptor before DMA completes. Device reads garbage or causes system crash.
•Alignment Violations: Some DMA engines require addresses or lengths aligned to specific boundaries (4-byte, cache line, page).
•Boundary Crossing: Transfers crossing 4GB boundaries on 32-bit DMA engines. 64KB boundary issues on legacy controllers.
•Memory Zone Violations: Allocating from high memory on 32-bit DMA devices. Device can't address the buffer.
•Endianness Mismatches: Device expects little-endian, driver writes big-endian (or vice versa). Common in portable code.

Debugging Techniques:

IOMMU Fault Logging: Enable IOMMU faults to catch invalid DMA addresses. Linux: dmesg | grep -i iommu
DMA Debugging API (Linux): CONFIG_DMA_API_DEBUG checks for common mistakes:
- Double-maps/double-unmaps
- Missing sync calls
- Illegal bounds
Memory Patterns: Fill buffers with known patterns before DMA; verify after. Corruption is immediately visible.
Hardware Debug: Some devices have debug registers showing DMA engine state, last addresses accessed, error codes.
Trace Points: Linux has DMA API trace points for tracking every map/unmap/sync operation.

Race Condition Central

Summary: Mastering I/O Architecture

DMA Key Takeaways

•DMA removes CPU from data path — Devices transfer directly to/from memory while CPU does other work
•Modern devices use bus mastering — Each device has its own DMA engine, not centralized controller
•Scatter-gather enables real-world buffers — Transfers can span non-contiguous physical memory
•Cache coherency is critical — Explicit sync operations prevent data corruption on non-coherent systems
•IOMMU provides security — Restricts device memory access, essential for untrusted peripherals
•Performance requires careful design — Descriptor layout, queue depth, buffer management all impact throughput

Module Summary: I/O Architecture Complete

Across five pages, we've built a comprehensive understanding of how CPUs communicate with peripheral devices:

I/O Ports: The original addressing mechanism—separate address space, IN/OUT instructions, still used for legacy devices
Memory-Mapped I/O: The modern approach—devices appear as memory addresses, accessible with standard load/store instructions
Programmed I/O: The simplest transfer method—CPU explicitly moves every byte, through polling loops
Interrupt-Driven I/O: Devices notify CPU of events—CPU is freed from polling but still moves data
Direct Memory Access: The ultimate efficiency—devices transfer data autonomously, CPU involvement minimal

Module Complete

5 / 5