Nvme - Learning Module | OneNoughtOne

Loading content...

0/227

Command Queues

The Engine of NVMe Performance

At the heart of NVMe's performance lies its command queue architecture—a fundamentally different approach to storage I/O than legacy interfaces provided. While AHCI/SATA imposed a single queue of 32 commands as the entire interface between host and device, NVMe scales to 65,535 I/O queue pairs, each capable of holding 65,536 commands.

This isn't merely a larger number—it's a paradigm shift. NVMe queues are designed for:

Per-CPU ownership: Each processor core can have dedicated queues, eliminating cross-core synchronization
Lock-free operation: Submission and completion paths require no locks in the common case
Massive parallelism: Thousands of outstanding commands keep flash channels fully utilized
NUMA awareness: Queue memory can reside local to the CPU that uses it

Understanding command queues is essential for anyone implementing NVMe drivers, optimizing storage workloads, or designing NVMe-aware systems. This page provides the complete picture—from queue creation to completion processing.

What You Will Learn

By the end of this page, you will understand NVMe queue architecture at implementation depth: submission and completion queue mechanics, the doorbell protocol, queue pair creation and deletion, arbitration mechanisms, and queue sizing strategies. You'll be equipped to implement or debug NVMe drivers.

Queue Architecture Fundamentals

NVMe's queue architecture employs a classic producer-consumer pattern with a crucial twist: the host and controller each produce and consume from different queues.

The Queue Pair Model

            HOST                                   CONTROLLER
    ┌─────────────────┐                      ┌─────────────────┐
    │                 │                      │                 │
    │   Submission    │  ───(DMA Read)───►   │   Command       │
    │     Queue       │                      │   Processor     │
    │                 │                      │                 │
    │  [Producer]     │                      │  [Consumer]     │
    └─────────────────┘                      └─────────────────┘
                                                     │
                                                     ▼
    ┌─────────────────┐                      ┌─────────────────┐
    │                 │                      │                 │
    │   Completion    │  ◄──(DMA Write)───   │   Completion    │
    │     Queue       │                      │   Generator     │
    │                 │                      │                 │
    │  [Consumer]     │                      │  [Producer]     │
    └─────────────────┘                      └─────────────────┘

Submission Queue (SQ):

Location: Host memory, allocated by the driver
Entry Size: 64 bytes (one cache line)
Producer: Host software (driver)
Consumer: NVMe controller (via DMA)
Flow: Host writes commands, rings doorbell, controller fetches and executes

Completion Queue (CQ):

Location: Host memory, allocated by the driver
Entry Size: 16 bytes
Producer: NVMe controller (via DMA)
Consumer: Host software (driver, interrupt handler)
Flow: Controller writes completions, host polls (or gets interrupted), processes results

Queue Indexing

NVMe uses circular buffer semantics with head and tail pointers:

Submission Queue:

SQ Tail Pointer: Host-maintained, indicates next slot for new command
SQ Head Pointer: Controller-maintained, returned in completions, indicates last consumed entry

Completion Queue:

CQ Head Pointer: Host-maintained, indicates next completion to process
CQ Tail Pointer: Controller-maintained, last valid completion written

The difference between head and tail indicates queue occupancy:

tail - head (mod queue_size) = number of entries in use
When tail + 1 ≡ head (mod queue_size), the queue is full
When tail ≡ head, the queue is empty (for initially empty queues)

Queue Pointer Ownership
Queue Type	Pointer	Maintained By	Communicated Via
Submission Queue	Tail	Host (driver)	Doorbell register write
Submission Queue	Head	Controller	Returned in completion entry
Completion Queue	Head	Host (driver)	Doorbell register write
Completion Queue	Tail	Controller	Implicit (phase bit + written entries)

Why Phase Bits Instead of Tail Pointers?

The completion queue doesn't need an explicit tail pointer because the phase bit provides the same information. When the controller writes a completion, it sets the phase bit. The host distinguishes new completions from old ones by checking if the phase matches the expected value. This elegant design eliminates a memory write (tail update) on every completion.

The Doorbell Protocol

Doorbells are the signaling mechanism that coordinates host and controller activities. A doorbell is simply a memory-mapped register that, when written, triggers controller action.

Submission Queue Tail Doorbell (SQyTDBL)

When the host writes to the submission queue tail doorbell:

The controller learns that new commands are available
The written value indicates how many entries the host has added
The controller fetches entries from its cached head up to the new tail
Processing begins for the fetched commands

Completion Queue Head Doorbell (CQyHDBL)

When the host writes to the completion queue head doorbell:

The controller learns which entries the host has processed
Space becomes available for new completions
If the controller was stalled waiting for CQ space, it can now proceed

Doorbell Register Layout

Doorbell registers occupy BAR0 starting at offset 0x1000:

Offset 0x1000: Admin Submission Queue 0 Tail Doorbell
Offset 0x1000 + stride: Admin Completion Queue 0 Head Doorbell
Offset 0x1000 + 2*stride: I/O Submission Queue 1 Tail Doorbell  
Offset 0x1000 + 3*stride: I/O Completion Queue 1 Head Doorbell
...
Offset 0x1000 + 2*N*stride: I/O Submission Queue N Tail Doorbell
Offset 0x1000 + (2*N+1)*stride: I/O Completion Queue N Head Doorbell

where stride = 4 << CAP.DSTRD (typically 4 bytes)

nvme_doorbell_operations.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
// Complete doorbell protocol implementation
struct nvme_queue {
    void __iomem *sq_doorbell;    // SQ tail doorbell address
    void __iomem *cq_doorbell;    // CQ head doorbell address
    
    volatile struct nvme_command *sq;   // Submission queue base
    volatile struct nvme_cqe *cq;       // Completion queue base
    
    uint16_t sq_tail;       // Host-maintained SQ tail
    uint16_t sq_head;       // Last known SQ head (from completions)
    uint16_t cq_head;       // Host-maintained CQ head
    uint16_t queue_depth;   // Number of entries in each queue
    uint8_t cq_phase;       // Expected phase bit
    uint8_t sqes;          // Log2(SQ entry size) = 6 (64 bytes)
    uint8_t cqes;          // Log2(CQ entry size) = 4 (16 bytes)
    
    spinlock_t sq_lock;     // Protect SQ tail updates (if shared)
};
 
// Submit a command to the submission queue
int nvme_submit_cmd(struct nvme_queue *q, struct nvme_command *cmd) {
    unsigned long flags;
    
    spin_lock_irqsave(&q->sq_lock, flags);
    
    // Check if queue is full
    // Queue is full when (tail + 1) mod depth == head
    uint16_t next_tail = (q->sq_tail + 1) % q->queue_depth;
    if (next_tail == q->sq_head) {
        spin_unlock_irqrestore(&q->sq_lock, flags);
        return -ENOSPC;  // Queue full
    }
    
    // Copy command to queue entry
    memcpy((void *)&q->sq[q->sq_tail], cmd, sizeof(*cmd));
    
    // Memory barrier: ensure command visible before doorbell
    wmb();
    
    // Update and ring doorbell
    q->sq_tail = next_tail;
    writel(q->sq_tail, q->sq_doorbell);
    
    spin_unlock_irqrestore(&q->sq_lock, flags);
    return 0;
}
 
// Process completions from the completion queue
int nvme_process_completions(struct nvme_queue *q, int budget) {
    int processed = 0;
    volatile struct nvme_cqe *cqe;
    uint16_t status;
    
    while (processed < budget) {
        cqe = &q->cq[q->cq_head];
        
        // Read status with phase bit
        status = READ_ONCE(cqe->status);
        
        // Check phase bit (bit 0) against expected phase
        if ((status & 1) != q->cq_phase)
            break;  // No more valid completions
        
        // Process this completion
        nvme_handle_completion(q, cqe);
        
        // Update SQ head from completion
        q->sq_head = le16_to_cpu(cqe->sq_head);
        
        // Advance CQ head with phase wrap
        if (++q->cq_head == q->queue_depth) {
            q->cq_head = 0;
            q->cq_phase ^= 1;  // Flip expected phase
        }
        
        processed++;
    }
    
    // Ring CQ doorbell to release processed entries
    if (processed > 0)
        writel(q->cq_head, q->cq_doorbell);
    
    return processed;
}

Memory Barriers are Critical

The wmb() (write memory barrier) before the doorbell write is essential. Modern CPUs reorder memory operations for performance. Without the barrier, the doorbell write might reach memory before the command is visible, causing the controller to read stale or partial command data.

Admin vs I/O Queues

NVMe distinguishes between two types of queue pairs, each serving distinct purposes.

Admin Queue (Queue ID 0)

The Admin Queue is mandatory—every NVMe controller must have exactly one Admin Queue pair. It's created during controller initialization through the ASQ/ACQ registers, not through commands.

Admin Queue characteristics:

Fixed Queue ID: Always 0
Created at init: Via ASQ/ACQ BAR0 registers during controller enable
Maximum depth: 4,096 entries (limited for reliability)
Purpose: Controller management and configuration
Not for data I/O: Never use for read/write operations

Admin Queue commands include:

Identify Controller/Namespace
Create/Delete I/O Queue
Set/Get Features
Firmware operations
Namespace management
Async Event Request

Admin Queue Commands
Opcode	Command	Purpose
0x00	Delete I/O SQ	Remove submission queue
0x01	Create I/O SQ	Allocate submission queue
0x04	Delete I/O CQ	Remove completion queue
0x05	Create I/O CQ	Allocate completion queue
0x06	Identify	Query device/namespace properties
0x09	Set Features	Configure controller behavior
0x0A	Get Features	Query controller configuration
0x0C	Async Event Request	Register for event notifications

I/O Queues (Queue ID 1+)

I/O Queues handle actual data operations—reads, writes, flushes, and deallocates. They're created dynamically by the host using Admin Queue commands.

I/O Queue characteristics:

Queue IDs: 1 through 65,535
Created dynamically: Via Create I/O SQ/CQ admin commands
Maximum depth: Up to 65,536 entries (device-specific)
Purpose: Data transfer operations
Scalable: Create as many as needed for parallelism

Queue Pair Relationship

A key NVMe flexibility: multiple Submission Queues can share a Completion Queue. This enables:

Reduced interrupt vectors (one per CQ rather than per SQ)
Memory savings (CQ entries are small, but many CQs add up)
Simpler driver design for some use cases

However, typical high-performance configurations use 1:1 SQ:CQ pairing for:

Per-CPU queue isolation
Independent interrupt targeting
Simpler completion processing

nvme_create_queues.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
// Creating I/O Queue Pairs via Admin Commands
 
// Step 1: Create Completion Queue first (SQ requires existing CQ)
int nvme_create_cq(struct nvme_ctrl *ctrl, uint16_t qid, 
                   uint16_t depth, dma_addr_t cq_dma, uint16_t vector) {
    struct nvme_command cmd = {};
    
    cmd.opcode = nvme_admin_create_cq;  // 0x05
    cmd.nsid = 0;  // Admin commands don't target namespaces
    cmd.dptr.prp.prp1 = cpu_to_le64(cq_dma);  // CQ physical address
    
    // cdw10: Queue Size (0-based) | Queue ID
    cmd.cdw10 = cpu_to_le32(((depth - 1) << 16) | qid);
    
    // cdw11: Interrupt Vector | Interrupt Enable | Physically Contiguous
    cmd.cdw11 = cpu_to_le32((vector << 16) | 
                            CQ_IRQ_ENABLED | 
                            CQ_PHYS_CONTIGUOUS);
    
    return nvme_submit_admin_cmd(ctrl, &cmd);
}
 
// Step 2: Create Submission Queue linked to the CQ
int nvme_create_sq(struct nvme_ctrl *ctrl, uint16_t qid,
                   uint16_t depth, dma_addr_t sq_dma, uint16_t cqid,
                   enum nvme_queue_prio prio) {
    struct nvme_command cmd = {};
    
    cmd.opcode = nvme_admin_create_sq;  // 0x01
    cmd.nsid = 0;
    cmd.dptr.prp.prp1 = cpu_to_le64(sq_dma);  // SQ physical address
    
    // cdw10: Queue Size (0-based) | Queue ID
    cmd.cdw10 = cpu_to_le32(((depth - 1) << 16) | qid);
    
    // cdw11: CQ ID | Queue Priority | Physically Contiguous
    cmd.cdw11 = cpu_to_le32((cqid << 16) | 
                            (prio << 1) | 
                            SQ_PHYS_CONTIGUOUS);
    
    return nvme_submit_admin_cmd(ctrl, &cmd);
}
 
// Complete queue pair creation
int nvme_alloc_queue_pair(struct nvme_ctrl *ctrl, int qid, int vector) {
    struct nvme_queue *q;
    int depth = ctrl->io_queue_depth;
    dma_addr_t sq_dma, cq_dma;
    
    q = kzalloc(sizeof(*q), GFP_KERNEL);
    
    // Allocate DMA-coherent memory for queues
    q->sq = dma_alloc_coherent(ctrl->dev, 
                               depth * sizeof(struct nvme_command),
                               &sq_dma, GFP_KERNEL);
    q->cq = dma_alloc_coherent(ctrl->dev,
                               depth * sizeof(struct nvme_cqe),
                               &cq_dma, GFP_KERNEL);
    
    // Initialize queue state
    q->queue_depth = depth;
    q->sq_tail = q->sq_head = 0;
    q->cq_head = 0;
    q->cq_phase = 1;  // Controller writes with phase=1 initially
    
    // Calculate doorbell addresses
    uint32_t stride = ctrl->doorbell_stride;
    q->sq_doorbell = ctrl->regs + 0x1000 + (2 * qid * stride);
    q->cq_doorbell = ctrl->regs + 0x1000 + ((2 * qid + 1) * stride);
    
    // Create CQ first (SQ references CQ by ID)
    nvme_create_cq(ctrl, qid, depth, cq_dma, vector);
    
    // Create SQ linked to this CQ
    nvme_create_sq(ctrl, qid, depth, sq_dma, qid, NVME_QP_MEDIUM);
    
    ctrl->queues[qid] = q;
    return 0;
}

Deletion Order Matters

Queues must be deleted in reverse order: Submission Queues first, then Completion Queues. You cannot delete a CQ while an SQ references it. The controller will reject the Delete CQ command with Invalid Queue Deletion status.

Queue Arbitration Mechanisms

When multiple queues contain pending commands, the controller must decide which queue to service. NVMe defines arbitration mechanisms to control this scheduling.

Arbitration Mechanisms

The controller supports one or more arbitration mechanisms, selected via the Arbitration Burst field in CC (Controller Configuration register):

1. Round Robin

The simplest mechanism—queues are serviced in rotation:

Each queue gets equal opportunity
No priority differentiation
Fair but may not meet QoS requirements
Good for homogeneous workloads

2. Weighted Round Robin with Urgent Priority Class

More sophisticated scheduling with priority classes:

NVMe Queue Priority Classes
Priority Level	Class	Behavior	Use Case
Highest	Urgent	Serviced before all other classes	Latency-critical operations
High	High	3× weight vs Medium in WRR	Important application I/O
Medium	Medium	2× weight vs Low in WRR	Normal operations (default)
Low	Low	1× weight (baseline)	Background tasks, scrubbing

The Weighted Round Robin operates as follows:

Urgent Priority Phase: Service all commands from Urgent queues
High Priority Phase: Service up to 3 × burst_size commands from High queues
Medium Priority Phase: Service up to 2 × burst_size commands from Medium queues
Low Priority Phase: Service up to 1 × burst_size commands from Low queues
Repeat

The burst_size is configurable via the Arbitration Burst setting (2^(AB+1) commands, where AB ∈ [0,7]).

3. Vendor Specific

Some controllers implement proprietary arbitration schemes optimized for specific workloads or hardware characteristics.

nvme_arbitration.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Configuring arbitration mechanism and burst size
 
// Arbitration settings in CC register
#define NVME_CC_AMS_SHIFT      11  // Arbitration Mechanism Selected
#define NVME_CC_AMS_RR         0   // Round Robin
#define NVME_CC_AMS_WRRU       1   // Weighted Round Robin with Urgent
 
// Set arbitration during controller configuration
void nvme_configure_arbitration(struct nvme_ctrl *ctrl,
                                enum nvme_arb_mech mechanism,
                                int burst_exp) {
    uint32_t cc = readl(ctrl->regs + NVME_REG_CC);
    
    // Clear and set arbitration mechanism
    cc &= ~(7 << NVME_CC_AMS_SHIFT);
    cc |= (mechanism << NVME_CC_AMS_SHIFT);
    
    writel(cc, ctrl->regs + NVME_REG_CC);
    
    // Set arbitration burst via Set Features
    // Feature ID 0x01: Arbitration
    // CDW11[2:0] = AB (Arbitration Burst)
    // CDW11[7:0] = HPW (High Priority Weight)
    // CDW11[15:8] = MPW (Medium Priority Weight)  
    // CDW11[23:16] = LPW (Low Priority Weight)
    
    uint32_t arb_feature = (burst_exp & 0x7) |      // AB
                          (3 << 8) |                  // HPW = 3
                          (2 << 16) |                 // MPW = 2
                          (1 << 24);                  // LPW = 1
    
    nvme_set_features(ctrl, NVME_FEAT_ARBITRATION, arb_feature, NULL, 0);
}
 
// Creating queues with priority assignment
int nvme_create_priority_queues(struct nvme_ctrl *ctrl) {
    // Create an Urgent priority queue for database commit logs
    nvme_create_sq(ctrl, 1, depth, sq_dma[0], 1, NVME_QPRIO_URGENT);
    
    // Create High priority queues for foreground I/O
    for (int i = 2; i <= num_fg_queues; i++) {
        nvme_create_sq(ctrl, i, depth, sq_dma[i-1], i, NVME_QPRIO_HIGH);
    }
    
    // Create Low priority queues for background tasks
    for (int i = num_fg_queues + 1; i <= total_queues; i++) {
        nvme_create_sq(ctrl, i, depth, sq_dma[i-1], i, NVME_QPRIO_LOW);
    }
    
    return 0;
}

Per-Queue vs Per-Command Priority

NVMe 1.x priority is per-queue, not per-command. All commands in a queue share the same priority. For mixed-priority workloads, create separate queues for each priority level. NVMe 2.0 introduces I/O Priority (I/O Priorities feature) for finer-grained control.

Queue Sizing Strategies

Optimal queue configuration balances resource usage against performance. Both queue depth (entries per queue) and queue count (number of queue pairs) require careful consideration.

Determining Maximum Capabilities

The controller advertises its limits in the CAP register and Identify data:

// From CAP register (offset 0x00)
max_queue_entries = (CAP & 0xFFFF) + 1;  // MQES field (0-based)

// From Identify Controller
max_io_queues = id->nn;  // Maximum I/O SQ supported
cqr = (id->cqr & 0x1);   // Contiguous Queues Required

Queue Depth Considerations

Queue Depth Tradeoffs
Factor	Shallow Queues (<128)	Deep Queues (1024+)
Memory Usage	Lower (32KB-128KB per queue)	Higher (64KB-4MB per queue)
Queue Full Events	More frequent stalls	Rare, sustained throughput
Latency Under Load	May increase if stalled	Consistent, more buffering
Flash Utilization	May leave channels idle	Full parallelism exploited
Command Tracking	Less driver bookkeeping	More context memory needed

Practical Queue Depth Guidelines

Consumer SSDs: 32-256 entries typically sufficient
Enterprise SSDs: 512-1024 entries for sustained random I/O
Datacenter NVMe: 1024-4096 entries for extreme parallelism

The queue depth should be at least:

min_depth = target_iops × average_latency_seconds

Example: 100,000 IOPS × 0.00010s latency = 10 outstanding
         (Add margin: 4× = 40 entries minimum)

Queue Count Strategies

Per-CPU Queues: The most common approach

One queue pair per CPU core
Eliminates cross-core synchronization
MSI-X vector per queue for local interrupt delivery
NUMA-optimal memory allocation

// Standard per-CPU queue allocation
int create_per_cpu_queues(struct nvme_ctrl *ctrl) {
    int max_queues = min3(num_online_cpus(),
                          ctrl->max_io_queues,
                          ctrl->max_msix_vectors - 1);
    
    for (int i = 0; i < max_queues; i++) {
        int cpu = cpumask_nth(i, cpu_online_mask);
        int node = cpu_to_node(cpu);
        
        // Allocate queue memory from local NUMA node
        alloc_queue_on_node(ctrl, i + 1, node);
    }
    
    return max_queues;
}

Shared Queues: For resource-constrained environments

Fewer queues than CPUs
Requires locking or per-CPU affinity tracking
Lower memory footprint
Acceptable for low-IOPS workloads

Polling Queues

Some workloads benefit from dedicated polling queues—queues without interrupt vectors that the CPU continuously polls:

Pros: Lowest possible latency (no interrupt overhead)
Cons: Consumes 100% of a CPU core
Use case: Ultra-low-latency databases, high-frequency trading

// Creating a polling queue (no interrupt vector)
nvme_create_cq(ctrl, qid, depth, cq_dma, 0xFFFF);  // Invalid vector = no IRQ

// Polling loop
while (running) {
    int found = nvme_poll_queue(queue, 16);  // Check 16 entries
    if (!found)
        cpu_relax();  // Yield CPU resources briefly
}

Modern kernels (Linux 5.0+) support mixed polling/interrupt modes with io_uring and blk-mq poll queues.

Real-World Queue Counts

A typical server with 64 cores and an enterprise NVMe SSD might run with 64 I/O queue pairs (plus admin), each with 1024 entries. Total queue memory: 64 × (64KB SQ + 16KB CQ) ≈ 5 MB—modest compared to benefits gained.

Command Flow Deep Dive

Let's trace a complete read operation through the NVMe command queue system, from application request to completion.

Step 1: Application Issues Read

Application → System Call → Block Layer → nvme-blk Driver

Application calls read() or pread()
VFS routes to block layer
Block layer creates a bio/request structure
Request is submitted to NVMe blk_mq hardware queue

Step 2: Driver Prepares Command

The driver allocates a command tag and builds the NVMe Read command:

// Prepare read command
struct nvme_command cmd = {
    .opcode = nvme_cmd_read,           // 0x02
    .nsid = cpu_to_le32(nsid),         // Target namespace
    .cdw10 = cpu_to_le32(slba & 0xFFFFFFFF),  // Starting LBA low
    .cdw11 = cpu_to_le32(slba >> 32),         // Starting LBA high  
    .cdw12 = cpu_to_le32(nlb - 1),            // Number of blocks (0-based)
};
cmd.command_id = tag;  // For completion matching

// Setup PRP for data buffer
nvme_setup_prps(&cmd, buffer_dma, nlb * block_size);

Step 3: Command Submission

┌────────────────────────────────────────────────┐
│           Host Memory - Submission Queue       │
├────────────────────────────────────────────────┤
│ [0] completed  [1] completed  [2] NEW_CMD ←   │
│                                    tail=3     │
└────────────────────────────────────────────────┘
                      │
                      ▼  doorbell write
┌────────────────────────────────────────────────┐
│                NVMe Controller                 │
│    Doorbell received: tail=3, fetching...     │
└────────────────────────────────────────────────┘

The driver:

Writes command to SQ[tail]
Issues write memory barrier
Writes new tail value to SQ doorbell

Step 4: Controller Processing

The controller:

Receives doorbell interrupt/notification
DMAs the command from host memory
Parses command fields
Initiates flash read via FTL (Flash Translation Layer)
Waits for flash data
DMAs data to host buffer (via PRPs)

Step 5: Completion

┌────────────────────────────────────────────────┐
│                NVMe Controller                 │
│    Read complete, writing completion entry    │
└────────────────────────────────────────────────┘
                      │
                      ▼  DMA write
┌────────────────────────────────────────────────┐
│           Host Memory - Completion Queue       │
├────────────────────────────────────────────────┤
│ [0] old (P=0)  [1] NEW (P=1) ← cq_head=1      │
│                     ^                          │
│                     command_id=tag             │
│                     sq_head=3                  │
│                     status=SUCCESS             │
└────────────────────────────────────────────────┘
                      │
                      ▼  MSI-X interrupt
┌────────────────────────────────────────────────┐
│           CPU Interrupt Handler                │
└────────────────────────────────────────────────┘

Step 6: Completion Processing

The interrupt handler or polling loop:

Reads CQ[cq_head]
Verifies phase bit matches expected
Extracts command_id to find original request
Updates SQ head from completion entry
Advances CQ head, flips phase if wrapped
Wakes waiting processes
Writes CQ head to doorbell

nvme_read_complete_flow.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// Complete read operation flow
int nvme_submit_read(struct nvme_ns *ns, void *buffer,
                     sector_t lba, unsigned int sectors) {
    struct nvme_ctrl *ctrl = ns->ctrl;
    struct nvme_queue *q = get_queue_for_cpu(ctrl);
    struct nvme_request *req;
    int tag;
    
    // Allocate request tracking structure
    tag = ida_simple_get(&q->tag_ida, 0, q->queue_depth, GFP_KERNEL);
    req = &q->requests[tag];
    req->buffer = buffer;
    init_completion(&req->done);
    
    // Build and submit command
    struct nvme_command cmd = {};
    cmd.opcode = nvme_cmd_read;
    cmd.command_id = tag;
    cmd.nsid = cpu_to_le32(ns->nsid);
    cmd.cdw10 = cpu_to_le32(lba);
    cmd.cdw11 = cpu_to_le32(lba >> 32);
    cmd.cdw12 = cpu_to_le32(sectors - 1);
    nvme_setup_prps(&cmd, req->buffer_dma, sectors * 512);
    
    nvme_submit_cmd(q, &cmd);
    
    // Wait for completion
    wait_for_completion(&req->done);
    
    int status = req->status;
    ida_simple_remove(&q->tag_ida, tag);
    
    return status;
}
 
// Called from interrupt handler for each completion
void nvme_handle_completion(struct nvme_queue *q, 
                           volatile struct nvme_cqe *cqe) {
    uint16_t tag = cqe->command_id;
    uint16_t status = le16_to_cpu(cqe->status) >> 1;  // Remove phase bit
    
    struct nvme_request *req = &q->requests[tag];
    req->status = status;
    
    // Notify waiting thread
    complete(&req->done);
}

Command ID as Index

Using the command_id as a direct index into a request array enables O(1) completion lookup—no hash tables or searches needed. This is why command_ids are typically constrained to [0, queue_depth) rather than arbitrary values.

Queue Error Handling

Robust queue error handling is essential for production NVMe systems. Errors can occur at multiple levels—command failures, queue failures, and controller-level failures.

Command-Level Errors

The completion entry's status field indicates success or failure:

struct nvme_cqe {
    // ...
    uint16_t status;  // Bits [15:1] = status, Bit [0] = phase
};

// Status Code Type (SCT) in bits [11:9]
#define NVME_SCT_GENERIC     0x0  // Generic command status
#define NVME_SCT_CMD_SPEC    0x1  // Command-specific status
#define NVME_SCT_MEDIA       0x2  // Media and data errors
#define NVME_SCT_PATH        0x3  // Path-related status (NVMe-oF)
#define NVME_SCT_VENDOR      0x7  // Vendor-specific

// Status Code (SC) in bits [8:1]
// Example Generic status codes:
#define NVME_SC_SUCCESS           0x00
#define NVME_SC_INVALID_OPCODE    0x01
#define NVME_SC_INVALID_FIELD     0x02
#define NVME_SC_DATA_XFER_ERROR   0x04
#define NVME_SC_ABORTED_POWER_LOSS 0x05
#define NVME_SC_INTERNAL          0x06
#define NVME_SC_ABORT_REQ         0x07
#define NVME_SC_SQ_DELETED        0x08

Common Error Responses:

NVMe Error Handling Strategies
Error	Cause	Retry?	Action
Invalid Opcode	Unsupported command	No	Feature not available
Invalid Field	Bad parameter	No	Fix command structure
Data Transfer Error	DMA failure	Yes	Retry with backoff
Namespace Not Ready	NS being formatted	Yes	Wait, then retry
Media Error	Uncorrectable ECC	No	Report to filesystem
Abort Requested	Explicit abort	Maybe	Depends on context
Compare Failure	Data mismatch	No	Report to application
Write Fault	Flash write failed	Yes	Controller may relocate

nvme_error_handling.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
// Comprehensive error handling
int nvme_complete_request(struct nvme_queue *q, 
                         volatile struct nvme_cqe *cqe) {
    uint16_t status = (le16_to_cpu(cqe->status) >> 1) & 0x7FF;
    uint8_t sct = (status >> 8) & 0x7;  // Status Code Type
    uint8_t sc = status & 0xFF;          // Status Code
    bool dnr = status & (1 << 14);       // Do Not Retry bit
    
    struct nvme_request *req = nvme_find_request(q, cqe->command_id);
    
    if (sc == NVME_SC_SUCCESS) {
        req->error = 0;
        goto complete;
    }
    
    // Classify error and decide on retry
    switch (sct) {
    case NVME_SCT_GENERIC:
        switch (sc) {
        case NVME_SC_INVALID_OPCODE:
        case NVME_SC_INVALID_FIELD:
            // Configuration errors - don't retry
            req->error = -EINVAL;
            break;
            
        case NVME_SC_DATA_XFER_ERROR:
        case NVME_SC_INTERNAL:
            // Transient errors - retry if allowed
            if (!dnr && req->retries < MAX_RETRIES) {
                req->retries++;
                nvme_resubmit_request(q, req);
                return 0;  // Don't complete yet
            }
            req->error = -EIO;
            break;
            
        case NVME_SC_NS_NOT_READY:
            // Namespace busy - delay and retry
            msleep(100);
            if (req->retries++ < MAX_RETRIES) {
                nvme_resubmit_request(q, req);
                return 0;
            }
            req->error = -EBUSY;
            break;
            
        default:
            req->error = -EIO;
        }
        break;
        
    case NVME_SCT_MEDIA:
        // Media errors - typically unrecoverable for reads
        switch (sc) {
        case NVME_SC_UNWRITTEN_BLOCK:
            req->error = -ENODATA;
            break;
        case NVME_SC_ECC_ERROR:
            req->error = -EBADMSG;
            break;
        default:
            req->error = -EIO;
        }
        break;
        
    default:
        req->error = -EIO;
    }
    
complete:
    complete(&req->done);
    return 1;  // Completed
}
 
// Queue-level error: submission queue overflow
int nvme_submit_cmd_safe(struct nvme_queue *q, 
                         struct nvme_command *cmd) {
    int retries = 0;
    int ret;
    
    while ((ret = nvme_submit_cmd(q, cmd)) == -ENOSPC) {
        // Queue full - process completions to free slots
        nvme_process_completions(q, 16);
        
        if (++retries > 100) {
            // Queue persistently full - may indicate stuck controller
            dev_err(q->dev, "Queue %d persistently full\n", q->qid);
            return -EBUSY;
        }
        
        cpu_relax();
    }
    
    return ret;
}

The Do Not Retry (DNR) Bit

When DNR is set in the completion status, the controller explicitly forbids retry—the error is deterministic. Ignoring DNR and retrying wastes time and resources. Common DNR scenarios: invalid parameters, namespace deleted, compare failures.

Summary: NVMe Command Queues

We've deeply explored NVMe's command queue architecture—the foundation that enables unprecedented storage performance. Let's consolidate the key insights:

Key Takeaways

•Queue pairs enable parallelism: Submission queues (host→controller) and completion queues (controller→host) work together, with up to 65,535 I/O queue pairs and 65,536 entries each.
•Doorbells minimize overhead: Single 32-bit writes notify the controller of new work or completion processing—no handshakes, no polling, no locks required.
•Phase bits eliminate synchronization: The alternating phase bit in completions distinguishes new from old entries without explicit tail pointers or memory barriers.
•Admin vs I/O queues: The Admin Queue handles configuration; I/O Queues handle data. Create CQ before SQ; delete SQ before CQ.
•Arbitration controls priority: Weighted Round Robin with priority classes enables QoS differentiation between workloads.
•Queue sizing is workload-dependent: Balance depth (parallelism) against memory usage. Per-CPU queues eliminate cross-core synchronization.
•Robust error handling is essential: Understand status codes, honor DNR, implement appropriate retry logic for transient errors.

What's Next

With queue mechanics mastered, the next page explores NVMe's performance characteristics in depth. We'll examine why NVMe achieves millions of IOPS, analyze latency distributions, explore bandwidth considerations, and understand the real-world performance advantages NVMe provides over legacy storage interfaces.

This knowledge is essential for capacity planning, performance tuning, and understanding when NVMe is (and isn't) the right solution.

Page Complete

You now understand NVMe command queues at implementation depth: queue pair architecture, the doorbell protocol, queue creation/deletion, arbitration mechanisms, sizing strategies, and error handling. This knowledge enables you to implement, debug, or optimize NVMe storage systems.