Operating SystemsNVMe

NVMe: Non-Volatile Memory Express

LevelAdvanced

Duration75 mins

TopicNVMe

2 / 5

PCIe Interface

NVMe's Foundation: The PCIe Interconnect

When NVMe was designed, a fundamental architectural decision was made: rather than creating a new physical interface or layering over existing storage interconnects, NVMe would speak directly to PCIe—the high-speed serial bus that connects everything from graphics cards to network adapters in modern computers.

This wasn't merely a convenience choice. PCIe offered exactly what a next-generation storage protocol needed:

Raw bandwidth: PCIe 3.0 x4 delivers ~4 GB/s; PCIe 4.0 x4 doubles that to ~8 GB/s
Low latency: Direct memory-mapped access with minimal protocol overhead
Mature ecosystem: Well-understood silicon, drivers, and debugging tools
Scalable lanes: x1, x2, x4, x8, or x16 configurations match device needs
No translation layer: No SATA/SAS controller between storage and CPU

Understanding PCIe is essential for understanding NVMe. The protocol's register layout, interrupt mechanisms, and data transfer primitives are all shaped by—and optimized for—PCIe's capabilities.

What You Will Learn

By the end of this page, you will understand how NVMe devices present themselves to the host system via PCIe, including configuration space, Base Address Registers (BARs), memory-mapped I/O, MSI/MSI-X interrupts, and DMA. You'll see how these PCIe primitives enable NVMe's high-performance command processing.

PCIe Fundamentals for NVMe

PCIe Architecture Overview

PCI Express (PCIe) is a high-speed serial interconnect standard that replaced legacy PCI and AGP buses. Unlike parallel PCI's shared bus, PCIe uses point-to-point links in a switched fabric topology:

                    ┌─────────────┐
                    │    CPU      │
                    │  (Root      │
                    │  Complex)   │
                    └─────┬───────┘
                          │
                    ┌─────┴───────┐
                    │ PCIe Root   │
                    │   Port      │
                    └─────┬───────┘
            ┌─────────────┼─────────────┐
            │             │             │
      ┌─────┴─────┐ ┌─────┴─────┐ ┌─────┴─────┐
      │  PCIe     │ │  PCIe     │ │  PCIe     │
      │ Switch    │ │ Endpoint  │ │ Endpoint  │
      │           │ │  (NVMe)   │ │  (GPU)    │
      └─────┬─────┘ └───────────┘ └───────────┘
            │
    ┌───────┴───────┐
    │               │
┌───┴───┐      ┌────┴───┐
│ NVMe  │      │ Other  │
│ SSD   │      │ Device │
└───────┘      └────────┘

Key PCIe Concepts

1. Lanes: A lane is a bidirectional pair of differential signal pairs (one transmit, one receive). PCIe devices use x1, x2, x4, x8, or x16 lane configurations. Consumer NVMe SSDs typically use x4 lanes; enterprise/datacenter NVMe may use x8.

2. Generations: Each PCIe generation doubles the per-lane bandwidth:

Generation	Per Lane	x4 Link	x8 Link
PCIe 3.0	~1 GB/s	~4 GB/s	~8 GB/s
PCIe 4.0	~2 GB/s	~8 GB/s	~16 GB/s
PCIe 5.0	~4 GB/s	~16 GB/s	~32 GB/s
PCIe 6.0	~8 GB/s	~32 GB/s	~64 GB/s

3. Transaction Layer Packets (TLPs): PCIe communicates via TLPs—packets that carry memory read/write requests, completions, messages, and configuration transactions. NVMe's efficiency derives partly from batching data into minimal TLPs.

4. Root Complex (RC): The CPU-integrated PCIe controller that originates/terminates transactions. The RC contains root ports, each connecting to a downstream device or switch.

PCIe Transaction Types Used by NVMe
Transaction Type	Direction	NVMe Usage
Memory Read	RC → Device	Host polls completion queue entries (rare with MSI-X)
Memory Write	RC → Device	Doorbell writes, register configuration
Memory Read	Device → RC	DMA: Controller reads commands from submission queue
Memory Write	Device → RC	DMA: Controller writes data and completion entries
MSI/MSI-X	Device → RC	Interrupt signaling via memory-write format
Configuration	RC → Device	Initial device enumeration and setup

Transactions are Packets, Not Cycles

Unlike legacy PCI's bus cycles, PCIe transactions are packetized with headers, payloads, and CRC. This enables credit-based flow control, out-of-order completion, and quality-of-service—features that NVMe leverages for deterministic performance.

NVMe PCIe Device Structure

An NVMe device appears to the system as a standard PCIe endpoint function with specific requirements defined in the NVMe specification. Understanding this structure is essential for driver development and system debugging.

PCIe Configuration Space

Every PCIe device exposes a configuration space—a standardized data structure accessible through special configuration transactions. The first 256 bytes are the legacy PCI Configuration Header; PCIe extends this to 4KB for Enhanced Configuration Access Mechanism (ECAM).

┌──────────────────────────────────────────────────────────────────┐
│                    PCIe Configuration Space (4KB)                │
├──────────────────────────────────────────────────────────────────┤
│ Offset 0x00-0x3F: PCI Configuration Header (Type 0)             │
│   ├─ 0x00-0x01: Vendor ID (e.g., Samsung: 0x144D)               │
│   ├─ 0x02-0x03: Device ID                                       │
│   ├─ 0x04-0x05: Command Register (memory enable, bus master)   │
│   ├─ 0x06-0x07: Status Register                                 │
│   ├─ 0x08: Revision ID                                          │
│   ├─ 0x09-0x0B: Class Code = 01:08:02 (NVMe storage)           │
│   ├─ 0x10-0x27: Base Address Registers (BAR0-BAR5)              │
│   ├─ 0x34: Capabilities Pointer                                 │
│   └─ 0x3C-0x3F: Interrupt Line/Pin, Min Grant, Max Latency      │
├──────────────────────────────────────────────────────────────────┤
│ Offset 0x40+: PCI Capability Structures (linked list)           │
│   ├─ Power Management (mandatory)                               │
│   ├─ MSI or MSI-X (mandatory for NVMe)                          │
│   ├─ PCIe Extended Capability (mandatory)                       │
│   └─ Optional: AER, ARI, LTR, L1 Substate, etc.                │
├──────────────────────────────────────────────────────────────────┤
│ Offset 0x100+: PCIe Extended Configuration Space (3.75KB)       │
│   ├─ Advanced Error Reporting (AER)                             │
│   ├─ SR-IOV (optional, for virtualization)                      │
│   └─ Other extended capabilities                                │
└──────────────────────────────────────────────────────────────────┘

NVMe Class Code

NVMe devices are identified by their PCI Class Code:

Base Class: 0x01 (Mass Storage Controller)
Sub Class: 0x08 (Non-Volatile Memory Controller)
Programming Interface: 0x02 (NVM Express)

The complete class code 0x010802 uniquely identifies an NVMe device. Operating systems use this to load the appropriate NVMe driver.

Base Address Registers (BARs)

BARs define memory or I/O spaces that the device exposes to the host. NVMe requires:

BAR0 (+ BAR1 for 64-bit): Memory-mapped register space (minimum 16KB)
- Controller Registers (4KB)
- Doorbell Registers (one pair per queue)
- Optional Controller Memory Buffer (CMB)
BAR2-5: Optional, implementation-specific (often unused)

NVMe mandates BAR0 be a 64-bit memory BAR, meaning BAR0 and BAR1 together form a single 64-bit address. This allows the register space to reside anywhere in the 64-bit physical address space.

nvme_pcie_init.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Reading NVMe device configuration space
int nvme_pcie_init(struct pci_dev *pdev) {
    uint16_t vendor_id, device_id;
    uint8_t class_code[3];
    resource_size_t bar0_start, bar0_len;
    void __iomem *regs;
    
    // Read identification
    pci_read_config_word(pdev, PCI_VENDOR_ID, &vendor_id);
    pci_read_config_word(pdev, PCI_DEVICE_ID, &device_id);
    
    // Verify NVMe class code
    pci_read_config_byte(pdev, PCI_CLASS_PROG, &class_code[0]);
    pci_read_config_byte(pdev, PCI_CLASS_DEVICE, &class_code[1]);
    pci_read_config_byte(pdev, PCI_CLASS_DEVICE+1, &class_code[2]);
    
    if (class_code[0] != 0x02 || class_code[1] != 0x08 || class_code[2] != 0x01) {
        pr_err("Not an NVMe device: %02x:%02x:%02x
",
               class_code[2], class_code[1], class_code[0]);
        return -ENODEV;
    }
    
    // Enable device (memory space + bus mastering)
    if (pci_enable_device_mem(pdev) < 0)
        return -EIO;
    pci_set_master(pdev);  // Enable bus mastering for DMA
    
    // Reserve and map BAR0 (NVMe registers)
    bar0_start = pci_resource_start(pdev, 0);
    bar0_len = pci_resource_len(pdev, 0);
    
    if (!request_mem_region(bar0_start, bar0_len, "nvme"))
        return -EBUSY;
    
    regs = ioremap(bar0_start, bar0_len);
    if (!regs) {
        release_mem_region(bar0_start, bar0_len);
        return -ENOMEM;
    }
    
    pr_info("NVMe device %04x:%04x at %llx (len=%llu)
",
            vendor_id, device_id, 
            (unsigned long long)bar0_start,
            (unsigned long long)bar0_len);
    
    return 0;
}

Bus Mastering is Critical

Setting pci_set_master() enables bus mastering, allowing the NVMe controller to initiate DMA transfers. Without this, the controller cannot read commands from submission queues or write completions/data to host memory. Forgetting this is a common driver bug.

NVMe Register Memory Map

NVMe defines a precise memory-mapped register layout within BAR0. This register space enables efficient command submission and controller configuration without legacy port I/O.

Controller Registers (Offset 0x0000 - 0x0FFF)

The first 4KB contains controller capability, configuration, and status registers:

NVMe Controller Register Definitions
Offset	Size	Register	Description
0x0000	8B	CAP (Capabilities)	Controller capabilities: max queue size, doorbell stride, timeout
0x0008	4B	VS (Version)	NVMe specification version (major.minor.tertiary)
0x000C	4B	INTMS (Int. Mask Set)	Interrupt mask set (legacy INTx only)
0x0010	4B	INTMC (Int. Mask Clear)	Interrupt mask clear
0x0014	4B	CC (Controller Config)	Enable, command set, page size, queue entry sizes
0x001C	4B	CSTS (Controller Status)	Ready, fatal, shutdown status
0x0020	4B	NSSR (Subsystem Reset)	NVM subsystem reset
0x0024	4B	AQA (Admin Queue Attr.)	Admin queue sizes
0x0028	8B	ASQ (Admin SQ Base)	Physical address of admin submission queue
0x0030	8B	ACQ (Admin CQ Base)	Physical address of admin completion queue
0x0038	4B	CMBLOC	Controller Memory Buffer location (optional)
0x003C	4B	CMBSZ	Controller Memory Buffer size

Doorbell Registers (Offset 0x1000+)

Following the controller registers are doorbell registers—one pair per queue (submission doorbell and completion doorbell):

Offset 0x1000 + (2 × queue_id × doorbell_stride)
= Submission Queue y Tail Doorbell (SQyTDBL)

Offset 0x1000 + ((2 × queue_id + 1) × doorbell_stride)  
= Completion Queue y Head Doorbell (CQyHDBL)

The doorbell stride (DSTRD) from CAP register specifies the spacing:

DSTRD = 0 → stride = 4 bytes
DSTRD = 1 → stride = 8 bytes
DSTRD = n → stride = 4 × 2^n bytes

Why Doorbell Stride Matters

Different implementations may require different alignment for optimal performance:

Cache line alignment avoids false sharing between adjacent doorbells
Some architectures prefer naturally-aligned accesses
DSTRD allows flexibility without specification changes

nvme_doorbell.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Calculating doorbell register addresses
#define NVME_CAP_DSTRD(cap)    ((cap >> 32) & 0xF)
#define NVME_DOORBELL_BASE     0x1000
 
static inline void __iomem *
nvme_sq_doorbell(void __iomem *regs, uint16_t qid, uint32_t stride) {
    // Submission queue doorbell = base + 2*qid*stride
    return regs + NVME_DOORBELL_BASE + (2 * qid * stride);
}
 
static inline void __iomem *
nvme_cq_doorbell(void __iomem *regs, uint16_t qid, uint32_t stride) {
    // Completion queue doorbell = base + (2*qid + 1)*stride
    return regs + NVME_DOORBELL_BASE + ((2 * qid + 1) * stride);
}
 
// Ring the submission queue doorbell
static inline void nvme_submit_cmd(struct nvme_queue *q) {
    // Increment tail (producer pointer)
    if (++q->sq_tail == q->queue_depth)
        q->sq_tail = 0;
    
    // Single 32-bit write notifies controller
    // Use writel for proper memory ordering and uncached write
    writel(q->sq_tail, q->sq_doorbell);
}
 
// Ring the completion queue doorbell
static inline void nvme_update_cq_head(struct nvme_queue *q) {
    // Single write releases processed completions
    writel(q->cq_head, q->cq_doorbell);
}

Single Write, Zero Synchronization

The doorbell mechanism is NVMe's secret to low latency. Submitting a command requires only updating the queue entry in host memory and writing a single 32-bit value to the doorbell. No handshakes, no registers to poll, no locks. The controller asynchronously processes the queue.

MSI-X Interrupts

NVMe mandates support for MSI-X (Message Signaled Interrupts - Extended), the modern interrupt mechanism that replaces legacy edge/level-triggered interrupt lines. MSI-X is essential for NVMe's scalability and performance.

Evolution of Interrupt Mechanisms

Legacy INTx: Shared interrupt lines (INTA#, INTB#, etc.)

Multiple devices share lines → interrupt handler must poll to find source
Level-triggered → requires explicit clear
Limited to 4 vectors per device

MSI (Message Signaled Interrupts): Interrupts as memory writes

Writes to a specific address with specific data trigger interrupt
Up to 32 vectors per function
No sharing → deterministic routing
Still limited vector count

MSI-X (Message Signaled Interrupts - Extended): Scalable MSI

Up to 2,048 vectors per function
Independent address/data per vector
Per-vector masking
Ideal for per-CPU interrupt affinity

Interrupt Mechanism Comparison
Feature	Legacy INTx	MSI	MSI-X
Max Vectors	4 (shared)	32	2,048
Sharing	Required	None	None
Per-Vector Mask	No	No	Yes
CPU Targeting	IOAPIC only	Limited	Full flexibility
NVMe Use	Fallback only	Rarely used	Primary method

MSI-X Configuration for NVMe

NVMe devices expose an MSI-X table in BAR0 (or a separate BAR). Each table entry contains:

Message Address: Physical address for the interrupt write
Message Data: Data value written to that address
Vector Control: Mask bit for this vector

The typical NVMe MSI-X configuration:

Vector 0: Admin queue completions
Vector 1 to N: I/O queue completions (one per queue or shared)

This enables true per-CPU interrupt handling:

// Assigning MSI-X vectors to I/O queues
int nvme_setup_io_queues(struct nvme_ctrl *ctrl) {
    int nr_io_queues = min(num_online_cpus(), ctrl->max_io_queues);
    int nr_vectors = nr_io_queues + 1;  // +1 for admin queue
    
    // Request MSI-X vectors
    int allocated = pci_alloc_irq_vectors(
        ctrl->pdev, 1, nr_vectors, PCI_IRQ_MSIX | PCI_IRQ_AFFINITY);
    
    if (allocated < nr_vectors) {
        // Fewer vectors than queues: share vectors across queues
        nr_io_queues = allocated - 1;
    }
    
    for (int i = 0; i < nr_io_queues; i++) {
        // Create queue pair and assign vector
        // IRQ affinity ensures interrupt targets the CPU owning the queue
        nvme_create_queue_pair(ctrl, i + 1, 
                               pci_irq_vector(ctrl->pdev, i + 1));
    }
    
    return nr_io_queues;
}

Interrupt Coalescing

High-IOPS workloads can generate millions of interrupts per second, overwhelming CPU interrupt handling capacity. NVMe provides interrupt coalescing features:

Aggregation Threshold: Generate interrupt only after N completions
Aggregation Time: Generate interrupt after T microseconds since first unprocessed completion

// Configure interrupt coalescing
struct nvme_feat_irq_coalesce {
    uint8_t thr;       // Aggregation threshold (0-255)
    uint8_t time;      // Aggregation time (100μs units, 0-255)
};

// Set Features command with Feature ID = 0x08 (Interrupt Coalescing)
cdw11 = (time << 8) | thr;
nvme_set_features(ctrl, NVME_FEAT_IRQ_COALESCE, cdw11);

The optimal coalescing settings depend on workload:

Latency-sensitive apps: low threshold, low time (more interrupts)
Throughput-optimized: higher values (batch processing)
Storage systems often auto-tune based on load

Interrupt Storms

Without interrupt coalescing, a high-performance NVMe SSD can generate >1M interrupts/second at peak IOPS. This can monopolize CPU cores for interrupt processing alone. Always configure appropriate coalescing for production workloads. Many drivers dynamically adjust settings based on observed IOPS.

DMA and Memory Transactions

Direct Memory Access (DMA) is the mechanism by which NVMe controllers read commands and write data to/from host memory without CPU intervention. Understanding DMA is critical for driver development and performance optimization.

PCIe DMA Fundamentals

In PCIe, any device with bus mastering capability can initiate memory read/write transactions to system memory. The NVMe controller uses DMA for:

Command Fetch: Reading submission queue entries
Data Transfer: Moving payload data (reads and writes)
Completion Posting: Writing completion queue entries

DMA Address Types

NVMe drivers must navigate multiple address spaces:

Address Spaces in NVMe DMA
Address Type	Description	Used For
Virtual Address	CPU's view of memory (process address space)	Driver allocates buffers
Physical Address	Hardware memory address	Programmed into NVMe commands
Bus Address	Address as seen by the device (may differ from physical)	What controller uses for DMA
IOVA (I/O Virtual Address)	Virtualized addresses via IOMMU	Secure DMA in virtualized environments

IOMMU and DMA Remapping

Modern systems include an IOMMU (Input/Output Memory Management Unit)—Intel VT-d, AMD-Vi, or ARM SMMU. The IOMMU translates device-initiated DMA addresses, providing:

Memory Protection: Devices can only access designated pages
Address Virtualization: Multiple devices see independent address spaces
64-bit Addressing: Devices with 32-bit DMA can access all memory
Cache Coherence: Hardware ensures DMA is coherent with CPU caches

NVMe DMA Descriptors: PRPs and SGLs

NVMe commands specify data locations via Physical Region Pages (PRPs) or Scatter-Gather Lists (SGLs):

PRPs (Physical Region Pages):

Simple, efficient for page-aligned transfers
PRP1: First data page address (offset allowed)
PRP2: Second page address, or pointer to PRP List for larger transfers
PRP List: Contiguous array of PRPs for multi-page transfers

Small Transfer (≤2 pages):
┌─────────────────┐
│ Command         │
│  PRP1 ─────────────────► [ Data Page 1 ]
│  PRP2 ─────────────────► [ Data Page 2 ]
└─────────────────┘

Large Transfer (>2 pages):  
┌─────────────────┐
│ Command         │
│  PRP1 ─────────────────► [ Data Page 1 ]
│  PRP2 ──┐               [ Data Page 2 ]
└─────────│───────┘       [ Data Page 3 ]
          │                     ↑
          └──► PRP List ────────┘
              [PRP Entry 1]──► Page 2
              [PRP Entry 2]──► Page 3
              [PRP Entry N]──► Page N+1

nvme_prp_setup.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Setting up PRPs for an NVMe command
int nvme_setup_prps(struct nvme_command *cmd, void *buffer, size_t len) {
    dma_addr_t dma_addr;
    size_t offset, first_page_len;
    
    // Map buffer for DMA
    dma_addr = dma_map_single(dev, buffer, len, DMA_BIDIRECTIONAL);
    if (dma_mapping_error(dev, dma_addr))
        return -ENOMEM;
    
    offset = dma_addr & (PAGE_SIZE - 1);
    first_page_len = PAGE_SIZE - offset;
    
    // PRP1 always points to first page (with offset)
    cmd->dptr.prp.prp1 = cpu_to_le64(dma_addr);
    
    if (len <= first_page_len) {
        // Entire transfer fits in first page
        cmd->dptr.prp.prp2 = 0;
    } else if (len <= first_page_len + PAGE_SIZE) {
        // Two pages needed: PRP2 points to second page
        cmd->dptr.prp.prp2 = cpu_to_le64(dma_addr + first_page_len);
    } else {
        // More than two pages: need PRP list
        size_t nprps = DIV_ROUND_UP(len - first_page_len, PAGE_SIZE);
        uint64_t *prp_list = dma_pool_alloc(prp_pool, GFP_KERNEL, &prp_dma);
        
        for (int i = 0; i < nprps; i++) {
            prp_list[i] = cpu_to_le64(dma_addr + first_page_len + i * PAGE_SIZE);
        }
        
        cmd->dptr.prp.prp2 = cpu_to_le64(prp_dma);
    }
    
    return 0;
}

SGLs (Scatter-Gather Lists)

SGLs offer more flexibility than PRPs:

Variable-length descriptors (16, 24, or 32 bytes)
Inline data in descriptors
Arbitrary byte-level offsets and lengths (no page alignment required)
Keyed data for NVMe-oF authentication

SGLs are primarily used in:

NVMe over Fabrics (NVMe-oF) for RDMA integration
Copy offload and data placement hints
Encrypted/authenticated data transfers

PRPs remain more common for local NVMe SSDs due to their simplicity and lower overhead.

DMA and the Bounce Buffer Problem

If a user buffer isn't DMA-capable (e.g., not physically contiguous, above 4GB on 32-bit DMA devices), the kernel must allocate a 'bounce buffer' and copy data. This wastes bandwidth and CPU cycles. Modern NVMe with IOMMU eliminates most bounce buffers by allowing any physical page to be mapped.

PCIe Power Management

NVMe devices inherit PCIe power management capabilities, enabling significant power savings in mobile and data center environments. Understanding these mechanisms is important for driver development and system optimization.

PCIe Link Power States

PCIe defines Active State Power Management (ASPM) with progressive power states:

PCIe ASPM Link States
State	Description	Exit Latency	Power Savings
L0	Active state, link fully operational	N/A (active)	None (full power)
L0s	Low-power standby, fast exit	<1 μs	Low
L1	Link electrical idle	2-4 μs	Moderate
L1.1	Substate: PLL off	~32 μs	Good
L1.2	Substate: common mode voltage off	~32 μs	Maximum link savings
L2	Aux power only (device may lose state)	Varies	Near-zero link power

NVMe Power States

Beyond PCIe link states, NVMe defines its own device power states (PS0 to PS(max)):

PS0: Maximum performance, highest power
PS1-PS(n-1): Progressive power/performance tradeoffs
PS(max): Lowest power non-operational state

The Identify Controller data structure reports:

Number of power states (NPSS)
Entry/exit latencies per state
Read/write throughput at each state
Idle power consumption

Autonomous Power State Transition (APST)

NVMe supports host-configured autonomous power state transitions. The controller automatically enters lower power states after idle periods:

// Configuring APST
struct nvme_apst_entry {
    uint32_t idle_time_ms;      // Idle time before transition
    uint32_t idle_transition_ps; // Target power state
    // ...exit latency tolerance fields
};

// Set Features: Autonomous Power State Transition
// Feature ID = 0x0C
// Host provides table of (idle_time → power_state) entries
// Controller performs transitions automatically

APST provides efficiency without constant host polling of device activity.

nvme_power_management.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Power management coordination between NVMe and PCIe
int nvme_configure_power(struct nvme_ctrl *ctrl) {
    struct pci_dev *pdev = ctrl->pdev;
    struct nvme_id_ctrl *id;
    
    // Query NVMe power states
    nvme_identify_controller(ctrl, &id);
    int num_ps = id->npss + 1;
    
    for (int ps = 0; ps < num_ps; ps++) {
        uint32_t entry_lat = le32_to_cpu(id->psd[ps].entry_lat);
        uint32_t exit_lat = le32_to_cpu(id->psd[ps].exit_lat);
        uint16_t idle_power = le16_to_cpu(id->psd[ps].idle_power);
        
        pr_info("PS%d: entry=%uμs exit=%uμs idle_pwr=%umW
",
                ps, entry_lat, exit_lat, idle_power);
    }
    
    // Enable ASPM L1 substates if supported
    if (pcie_capability_read_word(pdev, PCI_EXP_LNKCTL, &lnkctl) == 0) {
        lnkctl |= PCI_EXP_LNKCTL_ASPM_L1;
        pcie_capability_write_word(pdev, PCI_EXP_LNKCTL, lnkctl);
    }
    
    // Configure APST for gradual power reduction
    struct nvme_apst_entry apst_table[] = {
        { .idle_time_ms = 500,  .idle_transition_ps = 1 },
        { .idle_time_ms = 1500, .idle_transition_ps = 2 },
        { .idle_time_ms = 5000, .idle_transition_ps = 3 },
    };
    
    nvme_set_features(ctrl, NVME_FEAT_AUTO_PST, 
                      1 /* enable */, apst_table, sizeof(apst_table));
    
    return 0;
}

Power vs. Performance Tradeoffs

Aggressive power management increases response latency. A device in L1.2 + PS3 may take >50ms to return to full operation. Enterprise SSDs often disable low-power states for consistent latency. Consumer/laptop SSDs balance battery life against occasional latency spikes.

PCIe Error Handling

PCIe provides sophisticated error detection and reporting through Advanced Error Reporting (AER). NVMe drivers must properly handle PCIe-level errors to maintain data integrity and system stability.

Error Classification

PCIe errors are classified into three categories:

Correctable Errors: Hardware automatically corrects (e.g., ECC correction, retry success)
- Receiver Error
- Bad TLP (corrected by retry)
- Bad DLLP (corrected by retry)
- Replay Timer Timeout (corrected by replay)
- Replay Num Rollover
Uncorrectable Non-Fatal: Error contained, device may continue
- Completion Timeout
- Completer Abort
- Unexpected Completion
- Receiver Overflow
- ACS Violation
Uncorrectable Fatal: Device requires reset
- Data Link Protocol Error
- Surprise Down
- Poisoned TLP (unrecoverable data error)
- Flow Control Protocol Error
- Malformed TLP
- ECRC Error (optional end-to-end CRC failure)

nvme_aer_handler.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
// PCIe AER error handling for NVMe
static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
                                            pci_channel_state_t state)
{
    struct nvme_ctrl *ctrl = pci_get_drvdata(pdev);
    
    switch (state) {
    case pci_channel_io_normal:
        /* Correctable error, controller still functioning */
        dev_info(&pdev->dev, "PCIe correctable error detected
");
        return PCI_ERS_RESULT_CAN_RECOVER;
        
    case pci_channel_io_frozen:
        /* Non-fatal error, transactions blocked */
        dev_warn(&pdev->dev, "PCIe channel frozen, stopping queues
");
        nvme_stop_queues(ctrl);
        /* Prepare for slot reset */
        return PCI_ERS_RESULT_NEED_RESET;
        
    case pci_channel_io_perm_failure:
        /* Fatal error, device unrecoverable */
        dev_err(&pdev->dev, "PCIe permanent failure
");
        nvme_remove_ctrl(ctrl);
        return PCI_ERS_RESULT_DISCONNECT;
    }
    
    return PCI_ERS_RESULT_NONE;
}
 
static pci_ers_result_t nvme_slot_reset(struct pci_dev *pdev)
{
    struct nvme_ctrl *ctrl = pci_get_drvdata(pdev);
    
    dev_info(&pdev->dev, "PCIe slot reset in progress
");
    
    /* Re-enable the device after PCIe reset */
    if (pci_enable_device_mem(pdev) < 0)
        return PCI_ERS_RESULT_DISCONNECT;
    
    pci_set_master(pdev);
    pci_restore_state(pdev);
    
    /* Controller requires full re-initialization */
    if (nvme_reset_controller(ctrl) < 0)
        return PCI_ERS_RESULT_DISCONNECT;
    
    return PCI_ERS_RESULT_RECOVERED;
}
 
static void nvme_error_resume(struct pci_dev *pdev)
{
    struct nvme_ctrl *ctrl = pci_get_drvdata(pdev);
    
    dev_info(&pdev->dev, "Resuming after PCIe error recovery
");
    nvme_start_queues(ctrl);
}
 
static const struct pci_error_handlers nvme_err_handlers = {
    .error_detected = nvme_error_detected,
    .slot_reset = nvme_slot_reset,
    .resume = nvme_error_resume,
};

Link Recovery vs Controller Reset

PCIe link errors don't always require NVMe controller reset. The PCIe hardware layer may recover transparently. However, extended link down or fatal errors require full NVMe reinitialization. The Linux nvme driver tracks controller state to determine appropriate recovery actions.

Summary: NVMe on PCIe

We've explored the tight integration between NVMe and PCIe—the high-speed interconnect that enables NVMe's exceptional performance. Let's consolidate the key insights:

Key Takeaways

•PCIe provides the foundation: NVMe's performance comes from direct PCIe integration—no translation layers, no legacy protocol overhead. PCIe 4.0 x4 delivers ~8 GB/s, matching the fastest flash.
•Memory-mapped register space: NVMe's controller and doorbell registers reside in BAR0, enabling efficient memory operations instead of slow port I/O.
•MSI-X enables scalable interrupts: Up to 2,048 vectors per device allow per-CPU interrupt targeting, eliminating interrupt sharing bottlenecks.
•DMA-centric design: Commands, data, and completions flow via DMA. The CPU writes commands to memory and doorbell registers; the controller does the rest.
•IOMMU integration: Modern systems use IOMMU for secure, virtualized DMA. NVMe drivers work transparently with VT-d/AMD-Vi/SMMU.
•Power management coordination: PCIe ASPM and NVMe power states combine for aggressive power savings when devices are idle.
•Robust error handling: PCIe AER provides sophisticated error detection; NVMe drivers must implement proper recovery procedures.

What's Next

With the PCIe interface foundation established, the next page explores NVMe's command queue architecture in depth. We'll examine submission and completion queue mechanics, the doorbell protocol, and how queue design enables NVMe's scalability to hundreds of thousands of IOPS.

Understanding command queues is essential for anyone implementing NVMe drivers, debugging storage performance, or designing NVMe-aware applications.

Page Complete

You now understand how NVMe leverages PCIe for high-performance storage access: configuration space layout, memory-mapped registers, MSI-X interrupts, DMA mechanisms, and power/error management. This knowledge is essential for driver development and system-level debugging.

2 / 5

Loading learning content...

Operating SystemsNVMe

NVMe: Non-Volatile Memory Express

LevelAdvanced

Duration75 mins

TopicNVMe

2 / 5

PCIe Interface

NVMe's Foundation: The PCIe Interconnect

This wasn't merely a convenience choice. PCIe offered exactly what a next-generation storage protocol needed:

Raw bandwidth: PCIe 3.0 x4 delivers ~4 GB/s; PCIe 4.0 x4 doubles that to ~8 GB/s
Low latency: Direct memory-mapped access with minimal protocol overhead
Mature ecosystem: Well-understood silicon, drivers, and debugging tools
Scalable lanes: x1, x2, x4, x8, or x16 configurations match device needs
No translation layer: No SATA/SAS controller between storage and CPU

Understanding PCIe is essential for understanding NVMe. The protocol's register layout, interrupt mechanisms, and data transfer primitives are all shaped by—and optimized for—PCIe's capabilities.

What You Will Learn

PCIe Fundamentals for NVMe

PCIe Architecture Overview

PCI Express (PCIe) is a high-speed serial interconnect standard that replaced legacy PCI and AGP buses. Unlike parallel PCI's shared bus, PCIe uses point-to-point links in a switched fabric topology:

                    ┌─────────────┐
                    │    CPU      │
                    │  (Root      │
                    │  Complex)   │
                    └─────┬───────┘
                          │
                    ┌─────┴───────┐
                    │ PCIe Root   │
                    │   Port      │
                    └─────┬───────┘
            ┌─────────────┼─────────────┐
            │             │             │
      ┌─────┴─────┐ ┌─────┴─────┐ ┌─────┴─────┐
      │  PCIe     │ │  PCIe     │ │  PCIe     │
      │ Switch    │ │ Endpoint  │ │ Endpoint  │
      │           │ │  (NVMe)   │ │  (GPU)    │
      └─────┬─────┘ └───────────┘ └───────────┘
            │
    ┌───────┴───────┐
    │               │
┌───┴───┐      ┌────┴───┐
│ NVMe  │      │ Other  │
│ SSD   │      │ Device │
└───────┘      └────────┘

Key PCIe Concepts

2. Generations: Each PCIe generation doubles the per-lane bandwidth:

Generation	Per Lane	x4 Link	x8 Link
PCIe 3.0	~1 GB/s	~4 GB/s	~8 GB/s
PCIe 4.0	~2 GB/s	~8 GB/s	~16 GB/s
PCIe 5.0	~4 GB/s	~16 GB/s	~32 GB/s
PCIe 6.0	~8 GB/s	~32 GB/s	~64 GB/s

4. Root Complex (RC): The CPU-integrated PCIe controller that originates/terminates transactions. The RC contains root ports, each connecting to a downstream device or switch.

PCIe Transaction Types Used by NVMe
Transaction Type	Direction	NVMe Usage
Memory Read	RC → Device	Host polls completion queue entries (rare with MSI-X)
Memory Write	RC → Device	Doorbell writes, register configuration
Memory Read	Device → RC	DMA: Controller reads commands from submission queue
Memory Write	Device → RC	DMA: Controller writes data and completion entries
MSI/MSI-X	Device → RC	Interrupt signaling via memory-write format
Configuration	RC → Device	Initial device enumeration and setup

Transactions are Packets, Not Cycles

NVMe PCIe Device Structure

PCIe Configuration Space

┌──────────────────────────────────────────────────────────────────┐
│                    PCIe Configuration Space (4KB)                │
├──────────────────────────────────────────────────────────────────┤
│ Offset 0x00-0x3F: PCI Configuration Header (Type 0)             │
│   ├─ 0x00-0x01: Vendor ID (e.g., Samsung: 0x144D)               │
│   ├─ 0x02-0x03: Device ID                                       │
│   ├─ 0x04-0x05: Command Register (memory enable, bus master)   │
│   ├─ 0x06-0x07: Status Register                                 │
│   ├─ 0x08: Revision ID                                          │
│   ├─ 0x09-0x0B: Class Code = 01:08:02 (NVMe storage)           │
│   ├─ 0x10-0x27: Base Address Registers (BAR0-BAR5)              │
│   ├─ 0x34: Capabilities Pointer                                 │
│   └─ 0x3C-0x3F: Interrupt Line/Pin, Min Grant, Max Latency      │
├──────────────────────────────────────────────────────────────────┤
│ Offset 0x40+: PCI Capability Structures (linked list)           │
│   ├─ Power Management (mandatory)                               │
│   ├─ MSI or MSI-X (mandatory for NVMe)                          │
│   ├─ PCIe Extended Capability (mandatory)                       │
│   └─ Optional: AER, ARI, LTR, L1 Substate, etc.                │
├──────────────────────────────────────────────────────────────────┤
│ Offset 0x100+: PCIe Extended Configuration Space (3.75KB)       │
│   ├─ Advanced Error Reporting (AER)                             │
│   ├─ SR-IOV (optional, for virtualization)                      │
│   └─ Other extended capabilities                                │
└──────────────────────────────────────────────────────────────────┘

NVMe Class Code

NVMe devices are identified by their PCI Class Code:

Base Class: 0x01 (Mass Storage Controller)
Sub Class: 0x08 (Non-Volatile Memory Controller)
Programming Interface: 0x02 (NVM Express)

The complete class code 0x010802 uniquely identifies an NVMe device. Operating systems use this to load the appropriate NVMe driver.

Base Address Registers (BARs)

BARs define memory or I/O spaces that the device exposes to the host. NVMe requires:

BAR0 (+ BAR1 for 64-bit): Memory-mapped register space (minimum 16KB)
- Controller Registers (4KB)
- Doorbell Registers (one pair per queue)
- Optional Controller Memory Buffer (CMB)
BAR2-5: Optional, implementation-specific (often unused)

NVMe mandates BAR0 be a 64-bit memory BAR, meaning BAR0 and BAR1 together form a single 64-bit address. This allows the register space to reside anywhere in the 64-bit physical address space.

nvme_pcie_init.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Reading NVMe device configuration space
int nvme_pcie_init(struct pci_dev *pdev) {
    uint16_t vendor_id, device_id;
    uint8_t class_code[3];
    resource_size_t bar0_start, bar0_len;
    void __iomem *regs;
    
    // Read identification
    pci_read_config_word(pdev, PCI_VENDOR_ID, &vendor_id);
    pci_read_config_word(pdev, PCI_DEVICE_ID, &device_id);
    
    // Verify NVMe class code
    pci_read_config_byte(pdev, PCI_CLASS_PROG, &class_code[0]);
    pci_read_config_byte(pdev, PCI_CLASS_DEVICE, &class_code[1]);
    pci_read_config_byte(pdev, PCI_CLASS_DEVICE+1, &class_code[2]);
    
    if (class_code[0] != 0x02 || class_code[1] != 0x08 || class_code[2] != 0x01) {
        pr_err("Not an NVMe device: %02x:%02x:%02x
",
               class_code[2], class_code[1], class_code[0]);
        return -ENODEV;
    }
    
    // Enable device (memory space + bus mastering)
    if (pci_enable_device_mem(pdev) < 0)
        return -EIO;
    pci_set_master(pdev);  // Enable bus mastering for DMA
    
    // Reserve and map BAR0 (NVMe registers)
    bar0_start = pci_resource_start(pdev, 0);
    bar0_len = pci_resource_len(pdev, 0);
    
    if (!request_mem_region(bar0_start, bar0_len, "nvme"))
        return -EBUSY;
    
    regs = ioremap(bar0_start, bar0_len);
    if (!regs) {
        release_mem_region(bar0_start, bar0_len);
        return -ENOMEM;
    }
    
    pr_info("NVMe device %04x:%04x at %llx (len=%llu)
",
            vendor_id, device_id, 
            (unsigned long long)bar0_start,
            (unsigned long long)bar0_len);
    
    return 0;
}

Bus Mastering is Critical

NVMe Register Memory Map

NVMe defines a precise memory-mapped register layout within BAR0. This register space enables efficient command submission and controller configuration without legacy port I/O.

Controller Registers (Offset 0x0000 - 0x0FFF)

The first 4KB contains controller capability, configuration, and status registers:

NVMe Controller Register Definitions
Offset	Size	Register	Description
0x0000	8B	CAP (Capabilities)	Controller capabilities: max queue size, doorbell stride, timeout
0x0008	4B	VS (Version)	NVMe specification version (major.minor.tertiary)
0x000C	4B	INTMS (Int. Mask Set)	Interrupt mask set (legacy INTx only)
0x0010	4B	INTMC (Int. Mask Clear)	Interrupt mask clear
0x0014	4B	CC (Controller Config)	Enable, command set, page size, queue entry sizes
0x001C	4B	CSTS (Controller Status)	Ready, fatal, shutdown status
0x0020	4B	NSSR (Subsystem Reset)	NVM subsystem reset
0x0024	4B	AQA (Admin Queue Attr.)	Admin queue sizes
0x0028	8B	ASQ (Admin SQ Base)	Physical address of admin submission queue
0x0030	8B	ACQ (Admin CQ Base)	Physical address of admin completion queue
0x0038	4B	CMBLOC	Controller Memory Buffer location (optional)
0x003C	4B	CMBSZ	Controller Memory Buffer size

Doorbell Registers (Offset 0x1000+)

Following the controller registers are doorbell registers—one pair per queue (submission doorbell and completion doorbell):

Offset 0x1000 + (2 × queue_id × doorbell_stride)
= Submission Queue y Tail Doorbell (SQyTDBL)

Offset 0x1000 + ((2 × queue_id + 1) × doorbell_stride)  
= Completion Queue y Head Doorbell (CQyHDBL)

The doorbell stride (DSTRD) from CAP register specifies the spacing:

DSTRD = 0 → stride = 4 bytes
DSTRD = 1 → stride = 8 bytes
DSTRD = n → stride = 4 × 2^n bytes

Why Doorbell Stride Matters

Different implementations may require different alignment for optimal performance:

Cache line alignment avoids false sharing between adjacent doorbells
Some architectures prefer naturally-aligned accesses
DSTRD allows flexibility without specification changes

nvme_doorbell.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Calculating doorbell register addresses
#define NVME_CAP_DSTRD(cap)    ((cap >> 32) & 0xF)
#define NVME_DOORBELL_BASE     0x1000
 
static inline void __iomem *
nvme_sq_doorbell(void __iomem *regs, uint16_t qid, uint32_t stride) {
    // Submission queue doorbell = base + 2*qid*stride
    return regs + NVME_DOORBELL_BASE + (2 * qid * stride);
}
 
static inline void __iomem *
nvme_cq_doorbell(void __iomem *regs, uint16_t qid, uint32_t stride) {
    // Completion queue doorbell = base + (2*qid + 1)*stride
    return regs + NVME_DOORBELL_BASE + ((2 * qid + 1) * stride);
}
 
// Ring the submission queue doorbell
static inline void nvme_submit_cmd(struct nvme_queue *q) {
    // Increment tail (producer pointer)
    if (++q->sq_tail == q->queue_depth)
        q->sq_tail = 0;
    
    // Single 32-bit write notifies controller
    // Use writel for proper memory ordering and uncached write
    writel(q->sq_tail, q->sq_doorbell);
}
 
// Ring the completion queue doorbell
static inline void nvme_update_cq_head(struct nvme_queue *q) {
    // Single write releases processed completions
    writel(q->cq_head, q->cq_doorbell);
}

Single Write, Zero Synchronization

MSI-X Interrupts

Evolution of Interrupt Mechanisms

Legacy INTx: Shared interrupt lines (INTA#, INTB#, etc.)

Multiple devices share lines → interrupt handler must poll to find source
Level-triggered → requires explicit clear
Limited to 4 vectors per device

MSI (Message Signaled Interrupts): Interrupts as memory writes

Writes to a specific address with specific data trigger interrupt
Up to 32 vectors per function
No sharing → deterministic routing
Still limited vector count

MSI-X (Message Signaled Interrupts - Extended): Scalable MSI

Up to 2,048 vectors per function
Independent address/data per vector
Per-vector masking
Ideal for per-CPU interrupt affinity

Interrupt Mechanism Comparison
Feature	Legacy INTx	MSI	MSI-X
Max Vectors	4 (shared)	32	2,048
Sharing	Required	None	None
Per-Vector Mask	No	No	Yes
CPU Targeting	IOAPIC only	Limited	Full flexibility
NVMe Use	Fallback only	Rarely used	Primary method

MSI-X Configuration for NVMe

NVMe devices expose an MSI-X table in BAR0 (or a separate BAR). Each table entry contains:

Message Address: Physical address for the interrupt write
Message Data: Data value written to that address
Vector Control: Mask bit for this vector

The typical NVMe MSI-X configuration:

Vector 0: Admin queue completions
Vector 1 to N: I/O queue completions (one per queue or shared)

This enables true per-CPU interrupt handling:

// Assigning MSI-X vectors to I/O queues
int nvme_setup_io_queues(struct nvme_ctrl *ctrl) {
    int nr_io_queues = min(num_online_cpus(), ctrl->max_io_queues);
    int nr_vectors = nr_io_queues + 1;  // +1 for admin queue
    
    // Request MSI-X vectors
    int allocated = pci_alloc_irq_vectors(
        ctrl->pdev, 1, nr_vectors, PCI_IRQ_MSIX | PCI_IRQ_AFFINITY);
    
    if (allocated < nr_vectors) {
        // Fewer vectors than queues: share vectors across queues
        nr_io_queues = allocated - 1;
    }
    
    for (int i = 0; i < nr_io_queues; i++) {
        // Create queue pair and assign vector
        // IRQ affinity ensures interrupt targets the CPU owning the queue
        nvme_create_queue_pair(ctrl, i + 1, 
                               pci_irq_vector(ctrl->pdev, i + 1));
    }
    
    return nr_io_queues;
}

Interrupt Coalescing

High-IOPS workloads can generate millions of interrupts per second, overwhelming CPU interrupt handling capacity. NVMe provides interrupt coalescing features:

Aggregation Threshold: Generate interrupt only after N completions
Aggregation Time: Generate interrupt after T microseconds since first unprocessed completion

// Configure interrupt coalescing
struct nvme_feat_irq_coalesce {
    uint8_t thr;       // Aggregation threshold (0-255)
    uint8_t time;      // Aggregation time (100μs units, 0-255)
};

// Set Features command with Feature ID = 0x08 (Interrupt Coalescing)
cdw11 = (time << 8) | thr;
nvme_set_features(ctrl, NVME_FEAT_IRQ_COALESCE, cdw11);

The optimal coalescing settings depend on workload:

Latency-sensitive apps: low threshold, low time (more interrupts)
Throughput-optimized: higher values (batch processing)
Storage systems often auto-tune based on load

Interrupt Storms

DMA and Memory Transactions

PCIe DMA Fundamentals

In PCIe, any device with bus mastering capability can initiate memory read/write transactions to system memory. The NVMe controller uses DMA for:

Command Fetch: Reading submission queue entries
Data Transfer: Moving payload data (reads and writes)
Completion Posting: Writing completion queue entries

DMA Address Types

NVMe drivers must navigate multiple address spaces:

Address Spaces in NVMe DMA
Address Type	Description	Used For
Virtual Address	CPU's view of memory (process address space)	Driver allocates buffers
Physical Address	Hardware memory address	Programmed into NVMe commands
Bus Address	Address as seen by the device (may differ from physical)	What controller uses for DMA
IOVA (I/O Virtual Address)	Virtualized addresses via IOMMU	Secure DMA in virtualized environments

IOMMU and DMA Remapping

Modern systems include an IOMMU (Input/Output Memory Management Unit)—Intel VT-d, AMD-Vi, or ARM SMMU. The IOMMU translates device-initiated DMA addresses, providing:

Memory Protection: Devices can only access designated pages
Address Virtualization: Multiple devices see independent address spaces
64-bit Addressing: Devices with 32-bit DMA can access all memory
Cache Coherence: Hardware ensures DMA is coherent with CPU caches

NVMe DMA Descriptors: PRPs and SGLs

NVMe commands specify data locations via Physical Region Pages (PRPs) or Scatter-Gather Lists (SGLs):

PRPs (Physical Region Pages):

Simple, efficient for page-aligned transfers
PRP1: First data page address (offset allowed)
PRP2: Second page address, or pointer to PRP List for larger transfers
PRP List: Contiguous array of PRPs for multi-page transfers

Small Transfer (≤2 pages):
┌─────────────────┐
│ Command         │
│  PRP1 ─────────────────► [ Data Page 1 ]
│  PRP2 ─────────────────► [ Data Page 2 ]
└─────────────────┘

Large Transfer (>2 pages):  
┌─────────────────┐
│ Command         │
│  PRP1 ─────────────────► [ Data Page 1 ]
│  PRP2 ──┐               [ Data Page 2 ]
└─────────│───────┘       [ Data Page 3 ]
          │                     ↑
          └──► PRP List ────────┘
              [PRP Entry 1]──► Page 2
              [PRP Entry 2]──► Page 3
              [PRP Entry N]──► Page N+1

nvme_prp_setup.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Setting up PRPs for an NVMe command
int nvme_setup_prps(struct nvme_command *cmd, void *buffer, size_t len) {
    dma_addr_t dma_addr;
    size_t offset, first_page_len;
    
    // Map buffer for DMA
    dma_addr = dma_map_single(dev, buffer, len, DMA_BIDIRECTIONAL);
    if (dma_mapping_error(dev, dma_addr))
        return -ENOMEM;
    
    offset = dma_addr & (PAGE_SIZE - 1);
    first_page_len = PAGE_SIZE - offset;
    
    // PRP1 always points to first page (with offset)
    cmd->dptr.prp.prp1 = cpu_to_le64(dma_addr);
    
    if (len <= first_page_len) {
        // Entire transfer fits in first page
        cmd->dptr.prp.prp2 = 0;
    } else if (len <= first_page_len + PAGE_SIZE) {
        // Two pages needed: PRP2 points to second page
        cmd->dptr.prp.prp2 = cpu_to_le64(dma_addr + first_page_len);
    } else {
        // More than two pages: need PRP list
        size_t nprps = DIV_ROUND_UP(len - first_page_len, PAGE_SIZE);
        uint64_t *prp_list = dma_pool_alloc(prp_pool, GFP_KERNEL, &prp_dma);
        
        for (int i = 0; i < nprps; i++) {
            prp_list[i] = cpu_to_le64(dma_addr + first_page_len + i * PAGE_SIZE);
        }
        
        cmd->dptr.prp.prp2 = cpu_to_le64(prp_dma);
    }
    
    return 0;
}

SGLs (Scatter-Gather Lists)

SGLs offer more flexibility than PRPs:

Variable-length descriptors (16, 24, or 32 bytes)
Inline data in descriptors
Arbitrary byte-level offsets and lengths (no page alignment required)
Keyed data for NVMe-oF authentication

SGLs are primarily used in:

NVMe over Fabrics (NVMe-oF) for RDMA integration
Copy offload and data placement hints
Encrypted/authenticated data transfers

PRPs remain more common for local NVMe SSDs due to their simplicity and lower overhead.

DMA and the Bounce Buffer Problem

PCIe Power Management

PCIe Link Power States

PCIe defines Active State Power Management (ASPM) with progressive power states:

PCIe ASPM Link States
State	Description	Exit Latency	Power Savings
L0	Active state, link fully operational	N/A (active)	None (full power)
L0s	Low-power standby, fast exit	<1 μs	Low
L1	Link electrical idle	2-4 μs	Moderate
L1.1	Substate: PLL off	~32 μs	Good
L1.2	Substate: common mode voltage off	~32 μs	Maximum link savings
L2	Aux power only (device may lose state)	Varies	Near-zero link power

NVMe Power States

Beyond PCIe link states, NVMe defines its own device power states (PS0 to PS(max)):

PS0: Maximum performance, highest power
PS1-PS(n-1): Progressive power/performance tradeoffs
PS(max): Lowest power non-operational state

The Identify Controller data structure reports:

Number of power states (NPSS)
Entry/exit latencies per state
Read/write throughput at each state
Idle power consumption

Autonomous Power State Transition (APST)

NVMe supports host-configured autonomous power state transitions. The controller automatically enters lower power states after idle periods:

// Configuring APST
struct nvme_apst_entry {
    uint32_t idle_time_ms;      // Idle time before transition
    uint32_t idle_transition_ps; // Target power state
    // ...exit latency tolerance fields
};

// Set Features: Autonomous Power State Transition
// Feature ID = 0x0C
// Host provides table of (idle_time → power_state) entries
// Controller performs transitions automatically

APST provides efficiency without constant host polling of device activity.

nvme_power_management.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Power management coordination between NVMe and PCIe
int nvme_configure_power(struct nvme_ctrl *ctrl) {
    struct pci_dev *pdev = ctrl->pdev;
    struct nvme_id_ctrl *id;
    
    // Query NVMe power states
    nvme_identify_controller(ctrl, &id);
    int num_ps = id->npss + 1;
    
    for (int ps = 0; ps < num_ps; ps++) {
        uint32_t entry_lat = le32_to_cpu(id->psd[ps].entry_lat);
        uint32_t exit_lat = le32_to_cpu(id->psd[ps].exit_lat);
        uint16_t idle_power = le16_to_cpu(id->psd[ps].idle_power);
        
        pr_info("PS%d: entry=%uμs exit=%uμs idle_pwr=%umW
",
                ps, entry_lat, exit_lat, idle_power);
    }
    
    // Enable ASPM L1 substates if supported
    if (pcie_capability_read_word(pdev, PCI_EXP_LNKCTL, &lnkctl) == 0) {
        lnkctl |= PCI_EXP_LNKCTL_ASPM_L1;
        pcie_capability_write_word(pdev, PCI_EXP_LNKCTL, lnkctl);
    }
    
    // Configure APST for gradual power reduction
    struct nvme_apst_entry apst_table[] = {
        { .idle_time_ms = 500,  .idle_transition_ps = 1 },
        { .idle_time_ms = 1500, .idle_transition_ps = 2 },
        { .idle_time_ms = 5000, .idle_transition_ps = 3 },
    };
    
    nvme_set_features(ctrl, NVME_FEAT_AUTO_PST, 
                      1 /* enable */, apst_table, sizeof(apst_table));
    
    return 0;
}

Power vs. Performance Tradeoffs

PCIe Error Handling

PCIe provides sophisticated error detection and reporting through Advanced Error Reporting (AER). NVMe drivers must properly handle PCIe-level errors to maintain data integrity and system stability.

Error Classification

PCIe errors are classified into three categories:

Correctable Errors: Hardware automatically corrects (e.g., ECC correction, retry success)
- Receiver Error
- Bad TLP (corrected by retry)
- Bad DLLP (corrected by retry)
- Replay Timer Timeout (corrected by replay)
- Replay Num Rollover
Uncorrectable Non-Fatal: Error contained, device may continue
- Completion Timeout
- Completer Abort
- Unexpected Completion
- Receiver Overflow
- ACS Violation
Uncorrectable Fatal: Device requires reset
- Data Link Protocol Error
- Surprise Down
- Poisoned TLP (unrecoverable data error)
- Flow Control Protocol Error
- Malformed TLP
- ECRC Error (optional end-to-end CRC failure)

nvme_aer_handler.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
// PCIe AER error handling for NVMe
static pci_ers_result_t nvme_error_detected(struct pci_dev *pdev,
                                            pci_channel_state_t state)
{
    struct nvme_ctrl *ctrl = pci_get_drvdata(pdev);
    
    switch (state) {
    case pci_channel_io_normal:
        /* Correctable error, controller still functioning */
        dev_info(&pdev->dev, "PCIe correctable error detected
");
        return PCI_ERS_RESULT_CAN_RECOVER;
        
    case pci_channel_io_frozen:
        /* Non-fatal error, transactions blocked */
        dev_warn(&pdev->dev, "PCIe channel frozen, stopping queues
");
        nvme_stop_queues(ctrl);
        /* Prepare for slot reset */
        return PCI_ERS_RESULT_NEED_RESET;
        
    case pci_channel_io_perm_failure:
        /* Fatal error, device unrecoverable */
        dev_err(&pdev->dev, "PCIe permanent failure
");
        nvme_remove_ctrl(ctrl);
        return PCI_ERS_RESULT_DISCONNECT;
    }
    
    return PCI_ERS_RESULT_NONE;
}
 
static pci_ers_result_t nvme_slot_reset(struct pci_dev *pdev)
{
    struct nvme_ctrl *ctrl = pci_get_drvdata(pdev);
    
    dev_info(&pdev->dev, "PCIe slot reset in progress
");
    
    /* Re-enable the device after PCIe reset */
    if (pci_enable_device_mem(pdev) < 0)
        return PCI_ERS_RESULT_DISCONNECT;
    
    pci_set_master(pdev);
    pci_restore_state(pdev);
    
    /* Controller requires full re-initialization */
    if (nvme_reset_controller(ctrl) < 0)
        return PCI_ERS_RESULT_DISCONNECT;
    
    return PCI_ERS_RESULT_RECOVERED;
}
 
static void nvme_error_resume(struct pci_dev *pdev)
{
    struct nvme_ctrl *ctrl = pci_get_drvdata(pdev);
    
    dev_info(&pdev->dev, "Resuming after PCIe error recovery
");
    nvme_start_queues(ctrl);
}
 
static const struct pci_error_handlers nvme_err_handlers = {
    .error_detected = nvme_error_detected,
    .slot_reset = nvme_slot_reset,
    .resume = nvme_error_resume,
};

Link Recovery vs Controller Reset

Summary: NVMe on PCIe

We've explored the tight integration between NVMe and PCIe—the high-speed interconnect that enables NVMe's exceptional performance. Let's consolidate the key insights:

Key Takeaways

•PCIe provides the foundation: NVMe's performance comes from direct PCIe integration—no translation layers, no legacy protocol overhead. PCIe 4.0 x4 delivers ~8 GB/s, matching the fastest flash.
•Memory-mapped register space: NVMe's controller and doorbell registers reside in BAR0, enabling efficient memory operations instead of slow port I/O.
•MSI-X enables scalable interrupts: Up to 2,048 vectors per device allow per-CPU interrupt targeting, eliminating interrupt sharing bottlenecks.
•DMA-centric design: Commands, data, and completions flow via DMA. The CPU writes commands to memory and doorbell registers; the controller does the rest.
•IOMMU integration: Modern systems use IOMMU for secure, virtualized DMA. NVMe drivers work transparently with VT-d/AMD-Vi/SMMU.
•Power management coordination: PCIe ASPM and NVMe power states combine for aggressive power savings when devices are idle.
•Robust error handling: PCIe AER provides sophisticated error detection; NVMe drivers must implement proper recovery procedures.

What's Next

Understanding command queues is essential for anyone implementing NVMe drivers, debugging storage performance, or designing NVMe-aware applications.

Page Complete

2 / 5