Loading content...
Imagine a highway with two types of traffic: regular commuters (CPU memory accesses) and delivery trucks (DMA transfers). Block mode DMA is like closing the highway to all commuters while a convoy of trucks passes—efficient for the trucks, but disruptive for everyone else.
Cycle stealing offers an alternative: trucks merge into regular traffic, taking one lane for one moment at a time.
In cycle stealing mode, the DMA controller 'steals' individual bus cycles from the CPU without fully blocking it. The CPU is briefly stalled while DMA moves one unit of data, then immediately resumes. This interleaving minimizes CPU disruption while still enabling DMA progress.
When is this tradeoff worthwhile? When must we prefer cycle stealing over block transfers? This section provides the answers.
By the end of this page, you will understand the mechanics of cycle stealing at the hardware level, master the tradeoffs between cycle stealing and block transfer modes, learn to calculate performance impact on both DMA and CPU, and recognize scenarios where cycle stealing is the optimal choice.
Cycle stealing is a DMA transfer mode where the controller acquires the bus for exactly one transfer unit, completes that transfer, then releases the bus. The CPU can execute until it needs bus access, at which point it may find the DMA controller has 'stolen' the next bus cycle.
Unlike block mode where DMA holds the bus for an entire transfer, cycle stealing creates a fine-grained interleaving:
1234567891011121314151617181920212223242526272829303132333435
Cycle Stealing vs Block Mode - Timeline Comparison=================================================== Block Mode (DMA transfers 8 words):-----------------------------------Time: 1 2 3 4 5 6 7 8 9 10 11 12 │ │ │ │ │ │ │ │ │ │ │ │CPU: ████ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒ ████ ████ ████ RUN ────────────BLOCKED──────────────── RUN RUN RUN (waiting for DMA to finish) DMA: ▒▒▒▒ ████ ████ ████ ████ ████ ████ ████ ████ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒ REQ D0 D1 D2 D3 D4 D5 D6 D7 IDLE IDLE IDLE Result: CPU completely blocked for 8 cycles, then runs uninterrupted Cycle Stealing Mode (same 8 words):-----------------------------------Time: 1 2 3 4 5 6 7 8 9 10 11 12 │ │ │ │ │ │ │ │ │ │ │ │CPU: ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ RUN STL RUN STL RUN STL RUN STL RUN STL RUN STL DMA: ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████ IDLE D0 WAIT D1 WAIT D2 WAIT D3 WAIT D4 WAIT D5 Result: CPU runs 50% of cycles, DMA takes 50%, interleaved perfectly Legend:████ Active (executing/transferring)▒▒▒▒ Blocked/Waiting/IdleSTL Cycle "stolen" by DMAD0-D7 Data words transferredCycle stealing requires specific hardware support:
The term 'stealing' reflects the CPU's perspective: cycles that 'should' belong to the CPU are taken by the DMA controller. In practice, it's cooperative—both share the bus according to arbitration rules. But from an old-school programmer's viewpoint, the DMA controller was an interloper that occasionally stole away the processor's rightful bus access.
Here's a crucial insight: modern CPUs don't use the memory bus on every cycle. Processor caches, out-of-order execution, and speculative fetch create many cycles where the CPU has no pending bus transactions.
In these 'idle' bus cycles, DMA can steal cycles without affecting CPU performance at all.
| Workload | Cache Hit Rate | CPU Bus Utilization | Free Cycles for DMA |
|---|---|---|---|
| Integer compute (no memory) | ~100% | ~5% | ~95% |
| Cached data processing | ~90% | ~15% | ~85% |
| Database queries (warm cache) | ~70% | ~30% | ~70% |
| Large array streaming | ~30% | ~60% | ~40% |
| Uncached random access | ~5% | ~85% | ~15% |
When CPU cache hit rates are high, cycle stealing DMA can achieve near-block-mode throughput with minimal CPU impact:
Scenario: CPU with 90% cache hit rate, DMA transferring continuously
Without DMA:
- CPU uses bus 10% of cycles
- Bus idle 90% of cycles
With Cycle Stealing DMA:
- CPU uses bus 10% of cycles (unchanged!)
- DMA uses bus ~90% of cycles (using otherwise-idle capacity)
- Total bus utilization: ~100%
- CPU slowdown: ~0% (DMA only takes idle cycles)
Key insight: Cycle stealing is most effective when CPU bus utilization is low. It becomes problematic only when CPU and DMA are both bus-hungry.
On workloads with good cache locality, cycle stealing DMA essentially gets free bus access during CPU cache hits. The CPU never knows its cycles were 'stolen' because it wasn't using them anyway. This makes cycle stealing attractive for background transfers on general-purpose systems.
Both modes have distinct characteristics that make them suitable for different scenarios. Understanding the tradeoffs enables proper mode selection:
Let's calculate actual throughput and latency for a 1KB transfer with different modes:
Assumptions:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
1KB (256-word) Transfer Analysis================================= Block Mode:-----------Arbitration (once): 2 cyclesAddress phase (once): 1 cycleData transfer: 256 cyclesRelease bus: 1 cycle─────────────────────────────────Total: 260 cyclesTime: 260 × 10ns = 2.6 µsThroughput: 1024 bytes / 2.6 µs = 394 MB/sCPU blocked: 100% for 2.6 µs (worst-case latency) Cycle Stealing (1 word at a time):----------------------------------Per word: Arbitration: 2 cycles Address phase: 1 cycle Data transfer: 1 cycle Release bus: 1 cycle Per-word total: 5 cycles For 256 words: Total: 256 × 5 = 1280 cycles Time: 1280 × 10ns = 12.8 µs Throughput: 1024 bytes / 12.8 µs = 80 MB/s CPU blocked per steal: 5 cycles (50ns)CPU available between steals: Variable (when DMA pauses for next request) Cycle Stealing with Transparent Stealing (during cache hits):------------------------------------------------------------Assuming CPU cache hit rate of 80%: Visible stalls to CPU: 20% of 1280 cycles = 256 cycles Effective CPU slowdown: 256 cycles / (total CPU work time) If CPU work would have taken 5000 cycles without DMA: With DMA: 5000 + 256 = 5256 cycles Slowdown: 5.1% (nearly invisible!) Trade-off Summary (1KB transfer):--------------------------------- Block Mode Cycle StealingDMA completion: 2.6 µs 12.8 µsThroughput: 394 MB/s 80 MB/sWorst-case CPU stall: 2.6 µs 50 nsAverage CPU impact: 2.6 µs block 5% slowdown* *Assuming 80% cache hit rateIn this example, cycle stealing achieves only 20% of block mode throughput. For bulk transfers where absolute speed matters (disk I/O, network packets), block mode is clearly superior. Cycle stealing's advantage is limiting worst-case CPU latency, not maximizing DMA speed.
Cycle stealing finds its primary use case in real-time systems where worst-case timing is more important than average throughput.
In real-time systems, missing a deadline is a failure—regardless of average performance. Consider a medical device monitoring a patient's heartbeat:
With Block Mode DMA:
Block transfer time (128 bytes at 100 MHz, 4-byte width): ~33 cycles = 330 ns
BUT: If DMA starts just as critical interrupt arrives:
- Interrupt latency += 330 ns (blocked waiting for DMA)
- Plus interrupt overhead
- Risk: May exceed 500 µs budget in worst case
With Cycle Stealing:
Worst-case single interrupt latency: 50 ns (one DMA cycle)
DMA still progresses (slower), but critical timing met.
Predictable 50 ns worst-case latency is GUARANTEED.
In general-purpose computing, we optimize for average case—a 1% chance of 1ms delay is acceptable. In hard real-time systems, we must guarantee worst-case—even a 0.0001% chance of deadline miss is unacceptable. Cycle stealing's value is its guaranteed bounded latency, not its average performance.
Implementing cycle stealing properly requires attention to several hardware and software details.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
// Configuring DMA for Cycle Stealing Mode // Mode register bits (typical DMA controller)#define DMA_MODE_DEMAND 0x00 // Transfer on device request only#define DMA_MODE_SINGLE 0x40 // Cycle stealing: one unit per bus grant#define DMA_MODE_BLOCK 0x80 // Block: hold bus for entire transfer#define DMA_MODE_CASCADE 0xC0 // Cascade: for chaining controllers // Configure channel for cycle stealingvoid setup_cycle_stealing_dma(int channel, dma_addr_t src, dma_addr_t dst, size_t size) { struct dma_regs *regs = &dma_controller.channel[channel]; // Disable channel while configuring regs->control = 0; // Set source and destination regs->source_addr = src; regs->dest_addr = dst; regs->transfer_count = size; // Configure for cycle stealing (single transfer mode) regs->mode = DMA_MODE_SINGLE | // Cycle stealing mode DMA_DIR_READ | // Device to memory DMA_AUTOINIT_DISABLE | // Don't restart when done DMA_ADDR_INCREMENT; // Increment destination // Set priority (lower = more aggressive stealing) regs->priority = DMA_PRIORITY_LOW; // Be gentle with CPU // Enable channel regs->control = DMA_CTRL_ENABLE | DMA_CTRL_IRQ_ON_COMPLETE;} // Contrast: Block mode configurationvoid setup_block_dma(int channel, dma_addr_t src, dma_addr_t dst, size_t size) { struct dma_regs *regs = &dma_controller.channel[channel]; regs->control = 0; regs->source_addr = src; regs->dest_addr = dst; regs->transfer_count = size; // Configure for block mode regs->mode = DMA_MODE_BLOCK | // Block mode: hold bus DMA_DIR_READ | DMA_AUTOINIT_DISABLE | DMA_ADDR_INCREMENT; regs->priority = DMA_PRIORITY_HIGH; // Complete quickly regs->control = DMA_CTRL_ENABLE | DMA_CTRL_IRQ_ON_COMPLETE;} // Hybrid approach: Burst stealing// Transfer several words per bus grant, but still release bus between burstsvoid setup_burst_stealing_dma(int channel, dma_addr_t src, dma_addr_t dst, size_t size, int burst_size) { // 4, 8, or 16 words struct dma_regs *regs = &dma_controller.channel[channel]; regs->control = 0; regs->source_addr = src; regs->dest_addr = dst; regs->transfer_count = size; // Cycle stealing with burst regs->mode = DMA_MODE_SINGLE | DMA_DIR_READ; regs->burst_length = burst_size; // Words per steal regs->control = DMA_CTRL_ENABLE | DMA_CTRL_IRQ_ON_COMPLETE;}Cycle stealing can be tuned by adjusting how aggressively the DMA controller requests bus cycles:
Aggressive Stealing:
Polite Stealing:
Throttled Stealing:
Cycle stealing is fundamentally about bandwidth sharing between CPU and DMA. Understanding bandwidth allocation helps size systems appropriately.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
Memory Bandwidth Sharing Analysis================================= System Parameters:- Memory bandwidth: 10 GB/s (DDR4 single channel)- CPU memory demand: 3 GB/s (30% utilization)- DMA device requirement: 500 MB/s Question: Will cycle stealing DMA impact CPU performance? Analysis:---------Available bandwidth: 10,000 MB/sCPU demand: 3,000 MB/sDMA demand: 500 MB/s────────────────────────────────────Total demand: 3,500 MB/s (35% utilization) Since 3,500 < 10,000, both can be satisfied simultaneously! With cycle stealing:- System alternates CPU and DMA accesses- Both complete at full speed (bandwidth not saturated)- CPU: 3 GB/s achieved ✓- DMA: 500 MB/s achieved ✓ Oversubscribed Example:-----------------------CPU demand: 8 GB/s (80% utilization)DMA demand: 4 GB/s Total demand: 12,000 MB/s (120% of capacity) With cycle stealing:- Bandwidth shared proportionally (if fair arbiter)- CPU gets: 8/12 × 10 GB/s = 6.67 GB/s (17% shortfall)- DMA gets: 4/12 × 10 GB/s = 3.33 GB/s (17% shortfall) Both are slowed by the same proportion. Priority-Based Sharing:-----------------------If CPU has priority (steals only when CPU doesn't need bus):- CPU gets: 8 GB/s (full demand)- DMA gets: 10 - 8 = 2 GB/s (50% of demand) If DMA has priority:- DMA gets: 4 GB/s (full demand)- CPU gets: 10 - 4 = 6 GB/s (25% shortfall)Advanced systems implement bandwidth reservation to guarantee specific allocations:
Time-Division Multiplexing:
Credit-Based Throttling:
These mechanisms transform cycle stealing from 'opportunistic' to 'guaranteed' bandwidth allocation.
Modern systems (PCIe, AMBA AXI) use sophisticated QoS mechanisms that generalize cycle stealing concepts. Virtual channels, traffic classes, and credit-based flow control provide fine-grained bandwidth management far beyond simple cycle stealing. But the core principle—sharing bandwidth while limiting worst-case latency—remains the same.
Pure single-word cycle stealing is rarely used in modern systems. Instead, several variants offer improved characteristics:
Concept: Only steal cycles when the CPU isn't using the bus. If CPU needs bus, it gets priority.
Implementation:
Limitations:
Concept: Guarantee specific interleave ratio, e.g., 1 DMA for every 3 CPU cycles.
Implementation:
Used in: Systems requiring predictable behavior for both CPU and DMA (real-time, embedded)
| Variant | Throughput | Worst-Case Latency | Predictability |
|---|---|---|---|
| Pure Single-Word | Low | Excellent (1 cycle) | High |
| Burst (8-word) | Medium | Good (8 cycles) | High |
| Transparent | Variable | Excellent (0 visible) | Low |
| Interleaved (1:3) | Fixed (25%) | Good (bounded) | Very High |
| Block Mode | High | Poor (entire block) | High |
Cycle stealing represents a fundamental tradeoff in DMA design: sacrificing throughput for bounded latency. Let's consolidate the key insights:
What's Next:
The final section examines bus mastering—the most advanced form of DMA where peripheral devices become full bus masters with independent access to the system memory space, as seen in modern PCIe devices.
You now understand cycle stealing at both conceptual and implementation levels. You know when to prefer it over block mode (real-time systems, background transfers), how to configure it (mode registers, priority settings), and its variants (burst, transparent, interleaved). This knowledge enables proper DMA mode selection for any system's requirements.