Loading content...
At the heart of NVMe's performance lies its command queue architecture—a fundamentally different approach to storage I/O than legacy interfaces provided. While AHCI/SATA imposed a single queue of 32 commands as the entire interface between host and device, NVMe scales to 65,535 I/O queue pairs, each capable of holding 65,536 commands.
This isn't merely a larger number—it's a paradigm shift. NVMe queues are designed for:
Understanding command queues is essential for anyone implementing NVMe drivers, optimizing storage workloads, or designing NVMe-aware systems. This page provides the complete picture—from queue creation to completion processing.
By the end of this page, you will understand NVMe queue architecture at implementation depth: submission and completion queue mechanics, the doorbell protocol, queue pair creation and deletion, arbitration mechanisms, and queue sizing strategies. You'll be equipped to implement or debug NVMe drivers.
NVMe's queue architecture employs a classic producer-consumer pattern with a crucial twist: the host and controller each produce and consume from different queues.
The Queue Pair Model
HOST CONTROLLER
┌─────────────────┐ ┌─────────────────┐
│ │ │ │
│ Submission │ ───(DMA Read)───► │ Command │
│ Queue │ │ Processor │
│ │ │ │
│ [Producer] │ │ [Consumer] │
└─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐
│ │ │ │
│ Completion │ ◄──(DMA Write)─── │ Completion │
│ Queue │ │ Generator │
│ │ │ │
│ [Consumer] │ │ [Producer] │
└─────────────────┘ └─────────────────┘
Submission Queue (SQ):
Completion Queue (CQ):
Queue Indexing
NVMe uses circular buffer semantics with head and tail pointers:
Submission Queue:
Completion Queue:
The difference between head and tail indicates queue occupancy:
tail - head (mod queue_size) = number of entries in use| Queue Type | Pointer | Maintained By | Communicated Via |
|---|---|---|---|
| Submission Queue | Tail | Host (driver) | Doorbell register write |
| Submission Queue | Head | Controller | Returned in completion entry |
| Completion Queue | Head | Host (driver) | Doorbell register write |
| Completion Queue | Tail | Controller | Implicit (phase bit + written entries) |
The completion queue doesn't need an explicit tail pointer because the phase bit provides the same information. When the controller writes a completion, it sets the phase bit. The host distinguishes new completions from old ones by checking if the phase matches the expected value. This elegant design eliminates a memory write (tail update) on every completion.
Doorbells are the signaling mechanism that coordinates host and controller activities. A doorbell is simply a memory-mapped register that, when written, triggers controller action.
Submission Queue Tail Doorbell (SQyTDBL)
When the host writes to the submission queue tail doorbell:
Completion Queue Head Doorbell (CQyHDBL)
When the host writes to the completion queue head doorbell:
Doorbell Register Layout
Doorbell registers occupy BAR0 starting at offset 0x1000:
Offset 0x1000: Admin Submission Queue 0 Tail Doorbell
Offset 0x1000 + stride: Admin Completion Queue 0 Head Doorbell
Offset 0x1000 + 2*stride: I/O Submission Queue 1 Tail Doorbell
Offset 0x1000 + 3*stride: I/O Completion Queue 1 Head Doorbell
...
Offset 0x1000 + 2*N*stride: I/O Submission Queue N Tail Doorbell
Offset 0x1000 + (2*N+1)*stride: I/O Completion Queue N Head Doorbell
where stride = 4 << CAP.DSTRD (typically 4 bytes)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384
// Complete doorbell protocol implementationstruct nvme_queue { void __iomem *sq_doorbell; // SQ tail doorbell address void __iomem *cq_doorbell; // CQ head doorbell address volatile struct nvme_command *sq; // Submission queue base volatile struct nvme_cqe *cq; // Completion queue base uint16_t sq_tail; // Host-maintained SQ tail uint16_t sq_head; // Last known SQ head (from completions) uint16_t cq_head; // Host-maintained CQ head uint16_t queue_depth; // Number of entries in each queue uint8_t cq_phase; // Expected phase bit uint8_t sqes; // Log2(SQ entry size) = 6 (64 bytes) uint8_t cqes; // Log2(CQ entry size) = 4 (16 bytes) spinlock_t sq_lock; // Protect SQ tail updates (if shared)}; // Submit a command to the submission queueint nvme_submit_cmd(struct nvme_queue *q, struct nvme_command *cmd) { unsigned long flags; spin_lock_irqsave(&q->sq_lock, flags); // Check if queue is full // Queue is full when (tail + 1) mod depth == head uint16_t next_tail = (q->sq_tail + 1) % q->queue_depth; if (next_tail == q->sq_head) { spin_unlock_irqrestore(&q->sq_lock, flags); return -ENOSPC; // Queue full } // Copy command to queue entry memcpy((void *)&q->sq[q->sq_tail], cmd, sizeof(*cmd)); // Memory barrier: ensure command visible before doorbell wmb(); // Update and ring doorbell q->sq_tail = next_tail; writel(q->sq_tail, q->sq_doorbell); spin_unlock_irqrestore(&q->sq_lock, flags); return 0;} // Process completions from the completion queueint nvme_process_completions(struct nvme_queue *q, int budget) { int processed = 0; volatile struct nvme_cqe *cqe; uint16_t status; while (processed < budget) { cqe = &q->cq[q->cq_head]; // Read status with phase bit status = READ_ONCE(cqe->status); // Check phase bit (bit 0) against expected phase if ((status & 1) != q->cq_phase) break; // No more valid completions // Process this completion nvme_handle_completion(q, cqe); // Update SQ head from completion q->sq_head = le16_to_cpu(cqe->sq_head); // Advance CQ head with phase wrap if (++q->cq_head == q->queue_depth) { q->cq_head = 0; q->cq_phase ^= 1; // Flip expected phase } processed++; } // Ring CQ doorbell to release processed entries if (processed > 0) writel(q->cq_head, q->cq_doorbell); return processed;}The wmb() (write memory barrier) before the doorbell write is essential. Modern CPUs reorder memory operations for performance. Without the barrier, the doorbell write might reach memory before the command is visible, causing the controller to read stale or partial command data.
NVMe distinguishes between two types of queue pairs, each serving distinct purposes.
Admin Queue (Queue ID 0)
The Admin Queue is mandatory—every NVMe controller must have exactly one Admin Queue pair. It's created during controller initialization through the ASQ/ACQ registers, not through commands.
Admin Queue characteristics:
Admin Queue commands include:
| Opcode | Command | Purpose |
|---|---|---|
| 0x00 | Delete I/O SQ | Remove submission queue |
| 0x01 | Create I/O SQ | Allocate submission queue |
| 0x04 | Delete I/O CQ | Remove completion queue |
| 0x05 | Create I/O CQ | Allocate completion queue |
| 0x06 | Identify | Query device/namespace properties |
| 0x09 | Set Features | Configure controller behavior |
| 0x0A | Get Features | Query controller configuration |
| 0x0C | Async Event Request | Register for event notifications |
I/O Queues (Queue ID 1+)
I/O Queues handle actual data operations—reads, writes, flushes, and deallocates. They're created dynamically by the host using Admin Queue commands.
I/O Queue characteristics:
Queue Pair Relationship
A key NVMe flexibility: multiple Submission Queues can share a Completion Queue. This enables:
However, typical high-performance configurations use 1:1 SQ:CQ pairing for:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
// Creating I/O Queue Pairs via Admin Commands // Step 1: Create Completion Queue first (SQ requires existing CQ)int nvme_create_cq(struct nvme_ctrl *ctrl, uint16_t qid, uint16_t depth, dma_addr_t cq_dma, uint16_t vector) { struct nvme_command cmd = {}; cmd.opcode = nvme_admin_create_cq; // 0x05 cmd.nsid = 0; // Admin commands don't target namespaces cmd.dptr.prp.prp1 = cpu_to_le64(cq_dma); // CQ physical address // cdw10: Queue Size (0-based) | Queue ID cmd.cdw10 = cpu_to_le32(((depth - 1) << 16) | qid); // cdw11: Interrupt Vector | Interrupt Enable | Physically Contiguous cmd.cdw11 = cpu_to_le32((vector << 16) | CQ_IRQ_ENABLED | CQ_PHYS_CONTIGUOUS); return nvme_submit_admin_cmd(ctrl, &cmd);} // Step 2: Create Submission Queue linked to the CQint nvme_create_sq(struct nvme_ctrl *ctrl, uint16_t qid, uint16_t depth, dma_addr_t sq_dma, uint16_t cqid, enum nvme_queue_prio prio) { struct nvme_command cmd = {}; cmd.opcode = nvme_admin_create_sq; // 0x01 cmd.nsid = 0; cmd.dptr.prp.prp1 = cpu_to_le64(sq_dma); // SQ physical address // cdw10: Queue Size (0-based) | Queue ID cmd.cdw10 = cpu_to_le32(((depth - 1) << 16) | qid); // cdw11: CQ ID | Queue Priority | Physically Contiguous cmd.cdw11 = cpu_to_le32((cqid << 16) | (prio << 1) | SQ_PHYS_CONTIGUOUS); return nvme_submit_admin_cmd(ctrl, &cmd);} // Complete queue pair creationint nvme_alloc_queue_pair(struct nvme_ctrl *ctrl, int qid, int vector) { struct nvme_queue *q; int depth = ctrl->io_queue_depth; dma_addr_t sq_dma, cq_dma; q = kzalloc(sizeof(*q), GFP_KERNEL); // Allocate DMA-coherent memory for queues q->sq = dma_alloc_coherent(ctrl->dev, depth * sizeof(struct nvme_command), &sq_dma, GFP_KERNEL); q->cq = dma_alloc_coherent(ctrl->dev, depth * sizeof(struct nvme_cqe), &cq_dma, GFP_KERNEL); // Initialize queue state q->queue_depth = depth; q->sq_tail = q->sq_head = 0; q->cq_head = 0; q->cq_phase = 1; // Controller writes with phase=1 initially // Calculate doorbell addresses uint32_t stride = ctrl->doorbell_stride; q->sq_doorbell = ctrl->regs + 0x1000 + (2 * qid * stride); q->cq_doorbell = ctrl->regs + 0x1000 + ((2 * qid + 1) * stride); // Create CQ first (SQ references CQ by ID) nvme_create_cq(ctrl, qid, depth, cq_dma, vector); // Create SQ linked to this CQ nvme_create_sq(ctrl, qid, depth, sq_dma, qid, NVME_QP_MEDIUM); ctrl->queues[qid] = q; return 0;}Queues must be deleted in reverse order: Submission Queues first, then Completion Queues. You cannot delete a CQ while an SQ references it. The controller will reject the Delete CQ command with Invalid Queue Deletion status.
When multiple queues contain pending commands, the controller must decide which queue to service. NVMe defines arbitration mechanisms to control this scheduling.
Arbitration Mechanisms
The controller supports one or more arbitration mechanisms, selected via the Arbitration Burst field in CC (Controller Configuration register):
1. Round Robin
The simplest mechanism—queues are serviced in rotation:
2. Weighted Round Robin with Urgent Priority Class
More sophisticated scheduling with priority classes:
| Priority Level | Class | Behavior | Use Case |
|---|---|---|---|
| Highest | Urgent | Serviced before all other classes | Latency-critical operations |
| High | High | 3× weight vs Medium in WRR | Important application I/O |
| Medium | Medium | 2× weight vs Low in WRR | Normal operations (default) |
| Low | Low | 1× weight (baseline) | Background tasks, scrubbing |
The Weighted Round Robin operates as follows:
The burst_size is configurable via the Arbitration Burst setting (2^(AB+1) commands, where AB ∈ [0,7]).
3. Vendor Specific
Some controllers implement proprietary arbitration schemes optimized for specific workloads or hardware characteristics.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
// Configuring arbitration mechanism and burst size // Arbitration settings in CC register#define NVME_CC_AMS_SHIFT 11 // Arbitration Mechanism Selected#define NVME_CC_AMS_RR 0 // Round Robin#define NVME_CC_AMS_WRRU 1 // Weighted Round Robin with Urgent // Set arbitration during controller configurationvoid nvme_configure_arbitration(struct nvme_ctrl *ctrl, enum nvme_arb_mech mechanism, int burst_exp) { uint32_t cc = readl(ctrl->regs + NVME_REG_CC); // Clear and set arbitration mechanism cc &= ~(7 << NVME_CC_AMS_SHIFT); cc |= (mechanism << NVME_CC_AMS_SHIFT); writel(cc, ctrl->regs + NVME_REG_CC); // Set arbitration burst via Set Features // Feature ID 0x01: Arbitration // CDW11[2:0] = AB (Arbitration Burst) // CDW11[7:0] = HPW (High Priority Weight) // CDW11[15:8] = MPW (Medium Priority Weight) // CDW11[23:16] = LPW (Low Priority Weight) uint32_t arb_feature = (burst_exp & 0x7) | // AB (3 << 8) | // HPW = 3 (2 << 16) | // MPW = 2 (1 << 24); // LPW = 1 nvme_set_features(ctrl, NVME_FEAT_ARBITRATION, arb_feature, NULL, 0);} // Creating queues with priority assignmentint nvme_create_priority_queues(struct nvme_ctrl *ctrl) { // Create an Urgent priority queue for database commit logs nvme_create_sq(ctrl, 1, depth, sq_dma[0], 1, NVME_QPRIO_URGENT); // Create High priority queues for foreground I/O for (int i = 2; i <= num_fg_queues; i++) { nvme_create_sq(ctrl, i, depth, sq_dma[i-1], i, NVME_QPRIO_HIGH); } // Create Low priority queues for background tasks for (int i = num_fg_queues + 1; i <= total_queues; i++) { nvme_create_sq(ctrl, i, depth, sq_dma[i-1], i, NVME_QPRIO_LOW); } return 0;}NVMe 1.x priority is per-queue, not per-command. All commands in a queue share the same priority. For mixed-priority workloads, create separate queues for each priority level. NVMe 2.0 introduces I/O Priority (I/O Priorities feature) for finer-grained control.
Optimal queue configuration balances resource usage against performance. Both queue depth (entries per queue) and queue count (number of queue pairs) require careful consideration.
Determining Maximum Capabilities
The controller advertises its limits in the CAP register and Identify data:
// From CAP register (offset 0x00)
max_queue_entries = (CAP & 0xFFFF) + 1; // MQES field (0-based)
// From Identify Controller
max_io_queues = id->nn; // Maximum I/O SQ supported
cqr = (id->cqr & 0x1); // Contiguous Queues Required
Queue Depth Considerations
| Factor | Shallow Queues (<128) | Deep Queues (1024+) |
|---|---|---|
| Memory Usage | Lower (32KB-128KB per queue) | Higher (64KB-4MB per queue) |
| Queue Full Events | More frequent stalls | Rare, sustained throughput |
| Latency Under Load | May increase if stalled | Consistent, more buffering |
| Flash Utilization | May leave channels idle | Full parallelism exploited |
| Command Tracking | Less driver bookkeeping | More context memory needed |
Practical Queue Depth Guidelines
The queue depth should be at least:
min_depth = target_iops × average_latency_seconds
Example: 100,000 IOPS × 0.00010s latency = 10 outstanding
(Add margin: 4× = 40 entries minimum)
Queue Count Strategies
Per-CPU Queues: The most common approach
// Standard per-CPU queue allocation
int create_per_cpu_queues(struct nvme_ctrl *ctrl) {
int max_queues = min3(num_online_cpus(),
ctrl->max_io_queues,
ctrl->max_msix_vectors - 1);
for (int i = 0; i < max_queues; i++) {
int cpu = cpumask_nth(i, cpu_online_mask);
int node = cpu_to_node(cpu);
// Allocate queue memory from local NUMA node
alloc_queue_on_node(ctrl, i + 1, node);
}
return max_queues;
}
Shared Queues: For resource-constrained environments
Polling Queues
Some workloads benefit from dedicated polling queues—queues without interrupt vectors that the CPU continuously polls:
// Creating a polling queue (no interrupt vector)
nvme_create_cq(ctrl, qid, depth, cq_dma, 0xFFFF); // Invalid vector = no IRQ
// Polling loop
while (running) {
int found = nvme_poll_queue(queue, 16); // Check 16 entries
if (!found)
cpu_relax(); // Yield CPU resources briefly
}
Modern kernels (Linux 5.0+) support mixed polling/interrupt modes with io_uring and blk-mq poll queues.
A typical server with 64 cores and an enterprise NVMe SSD might run with 64 I/O queue pairs (plus admin), each with 1024 entries. Total queue memory: 64 × (64KB SQ + 16KB CQ) ≈ 5 MB—modest compared to benefits gained.
Let's trace a complete read operation through the NVMe command queue system, from application request to completion.
Step 1: Application Issues Read
Application → System Call → Block Layer → nvme-blk Driver
read() or pread()Step 2: Driver Prepares Command
The driver allocates a command tag and builds the NVMe Read command:
// Prepare read command
struct nvme_command cmd = {
.opcode = nvme_cmd_read, // 0x02
.nsid = cpu_to_le32(nsid), // Target namespace
.cdw10 = cpu_to_le32(slba & 0xFFFFFFFF), // Starting LBA low
.cdw11 = cpu_to_le32(slba >> 32), // Starting LBA high
.cdw12 = cpu_to_le32(nlb - 1), // Number of blocks (0-based)
};
cmd.command_id = tag; // For completion matching
// Setup PRP for data buffer
nvme_setup_prps(&cmd, buffer_dma, nlb * block_size);
Step 3: Command Submission
┌────────────────────────────────────────────────┐
│ Host Memory - Submission Queue │
├────────────────────────────────────────────────┤
│ [0] completed [1] completed [2] NEW_CMD ← │
│ tail=3 │
└────────────────────────────────────────────────┘
│
▼ doorbell write
┌────────────────────────────────────────────────┐
│ NVMe Controller │
│ Doorbell received: tail=3, fetching... │
└────────────────────────────────────────────────┘
The driver:
Step 4: Controller Processing
The controller:
Step 5: Completion
┌────────────────────────────────────────────────┐
│ NVMe Controller │
│ Read complete, writing completion entry │
└────────────────────────────────────────────────┘
│
▼ DMA write
┌────────────────────────────────────────────────┐
│ Host Memory - Completion Queue │
├────────────────────────────────────────────────┤
│ [0] old (P=0) [1] NEW (P=1) ← cq_head=1 │
│ ^ │
│ command_id=tag │
│ sq_head=3 │
│ status=SUCCESS │
└────────────────────────────────────────────────┘
│
▼ MSI-X interrupt
┌────────────────────────────────────────────────┐
│ CPU Interrupt Handler │
└────────────────────────────────────────────────┘
Step 6: Completion Processing
The interrupt handler or polling loop:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
// Complete read operation flowint nvme_submit_read(struct nvme_ns *ns, void *buffer, sector_t lba, unsigned int sectors) { struct nvme_ctrl *ctrl = ns->ctrl; struct nvme_queue *q = get_queue_for_cpu(ctrl); struct nvme_request *req; int tag; // Allocate request tracking structure tag = ida_simple_get(&q->tag_ida, 0, q->queue_depth, GFP_KERNEL); req = &q->requests[tag]; req->buffer = buffer; init_completion(&req->done); // Build and submit command struct nvme_command cmd = {}; cmd.opcode = nvme_cmd_read; cmd.command_id = tag; cmd.nsid = cpu_to_le32(ns->nsid); cmd.cdw10 = cpu_to_le32(lba); cmd.cdw11 = cpu_to_le32(lba >> 32); cmd.cdw12 = cpu_to_le32(sectors - 1); nvme_setup_prps(&cmd, req->buffer_dma, sectors * 512); nvme_submit_cmd(q, &cmd); // Wait for completion wait_for_completion(&req->done); int status = req->status; ida_simple_remove(&q->tag_ida, tag); return status;} // Called from interrupt handler for each completionvoid nvme_handle_completion(struct nvme_queue *q, volatile struct nvme_cqe *cqe) { uint16_t tag = cqe->command_id; uint16_t status = le16_to_cpu(cqe->status) >> 1; // Remove phase bit struct nvme_request *req = &q->requests[tag]; req->status = status; // Notify waiting thread complete(&req->done);}Using the command_id as a direct index into a request array enables O(1) completion lookup—no hash tables or searches needed. This is why command_ids are typically constrained to [0, queue_depth) rather than arbitrary values.
Robust queue error handling is essential for production NVMe systems. Errors can occur at multiple levels—command failures, queue failures, and controller-level failures.
Command-Level Errors
The completion entry's status field indicates success or failure:
struct nvme_cqe {
// ...
uint16_t status; // Bits [15:1] = status, Bit [0] = phase
};
// Status Code Type (SCT) in bits [11:9]
#define NVME_SCT_GENERIC 0x0 // Generic command status
#define NVME_SCT_CMD_SPEC 0x1 // Command-specific status
#define NVME_SCT_MEDIA 0x2 // Media and data errors
#define NVME_SCT_PATH 0x3 // Path-related status (NVMe-oF)
#define NVME_SCT_VENDOR 0x7 // Vendor-specific
// Status Code (SC) in bits [8:1]
// Example Generic status codes:
#define NVME_SC_SUCCESS 0x00
#define NVME_SC_INVALID_OPCODE 0x01
#define NVME_SC_INVALID_FIELD 0x02
#define NVME_SC_DATA_XFER_ERROR 0x04
#define NVME_SC_ABORTED_POWER_LOSS 0x05
#define NVME_SC_INTERNAL 0x06
#define NVME_SC_ABORT_REQ 0x07
#define NVME_SC_SQ_DELETED 0x08
Common Error Responses:
| Error | Cause | Retry? | Action |
|---|---|---|---|
| Invalid Opcode | Unsupported command | No | Feature not available |
| Invalid Field | Bad parameter | No | Fix command structure |
| Data Transfer Error | DMA failure | Yes | Retry with backoff |
| Namespace Not Ready | NS being formatted | Yes | Wait, then retry |
| Media Error | Uncorrectable ECC | No | Report to filesystem |
| Abort Requested | Explicit abort | Maybe | Depends on context |
| Compare Failure | Data mismatch | No | Report to application |
| Write Fault | Flash write failed | Yes | Controller may relocate |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495
// Comprehensive error handlingint nvme_complete_request(struct nvme_queue *q, volatile struct nvme_cqe *cqe) { uint16_t status = (le16_to_cpu(cqe->status) >> 1) & 0x7FF; uint8_t sct = (status >> 8) & 0x7; // Status Code Type uint8_t sc = status & 0xFF; // Status Code bool dnr = status & (1 << 14); // Do Not Retry bit struct nvme_request *req = nvme_find_request(q, cqe->command_id); if (sc == NVME_SC_SUCCESS) { req->error = 0; goto complete; } // Classify error and decide on retry switch (sct) { case NVME_SCT_GENERIC: switch (sc) { case NVME_SC_INVALID_OPCODE: case NVME_SC_INVALID_FIELD: // Configuration errors - don't retry req->error = -EINVAL; break; case NVME_SC_DATA_XFER_ERROR: case NVME_SC_INTERNAL: // Transient errors - retry if allowed if (!dnr && req->retries < MAX_RETRIES) { req->retries++; nvme_resubmit_request(q, req); return 0; // Don't complete yet } req->error = -EIO; break; case NVME_SC_NS_NOT_READY: // Namespace busy - delay and retry msleep(100); if (req->retries++ < MAX_RETRIES) { nvme_resubmit_request(q, req); return 0; } req->error = -EBUSY; break; default: req->error = -EIO; } break; case NVME_SCT_MEDIA: // Media errors - typically unrecoverable for reads switch (sc) { case NVME_SC_UNWRITTEN_BLOCK: req->error = -ENODATA; break; case NVME_SC_ECC_ERROR: req->error = -EBADMSG; break; default: req->error = -EIO; } break; default: req->error = -EIO; } complete: complete(&req->done); return 1; // Completed} // Queue-level error: submission queue overflowint nvme_submit_cmd_safe(struct nvme_queue *q, struct nvme_command *cmd) { int retries = 0; int ret; while ((ret = nvme_submit_cmd(q, cmd)) == -ENOSPC) { // Queue full - process completions to free slots nvme_process_completions(q, 16); if (++retries > 100) { // Queue persistently full - may indicate stuck controller dev_err(q->dev, "Queue %d persistently full\n", q->qid); return -EBUSY; } cpu_relax(); } return ret;}When DNR is set in the completion status, the controller explicitly forbids retry—the error is deterministic. Ignoring DNR and retrying wastes time and resources. Common DNR scenarios: invalid parameters, namespace deleted, compare failures.
We've deeply explored NVMe's command queue architecture—the foundation that enables unprecedented storage performance. Let's consolidate the key insights:
What's Next
With queue mechanics mastered, the next page explores NVMe's performance characteristics in depth. We'll examine why NVMe achieves millions of IOPS, analyze latency distributions, explore bandwidth considerations, and understand the real-world performance advantages NVMe provides over legacy storage interfaces.
This knowledge is essential for capacity planning, performance tuning, and understanding when NVMe is (and isn't) the right solution.
You now understand NVMe command queues at implementation depth: queue pair architecture, the doorbell protocol, queue creation/deletion, arbitration mechanisms, sizing strategies, and error handling. This knowledge enables you to implement, debug, or optimize NVMe storage systems.