Dma - Learning Module | OneNoughtOne

Loading content...

0/240

Cycle Stealing

The Art of Sharing

Imagine a highway with two types of traffic: regular commuters (CPU memory accesses) and delivery trucks (DMA transfers). Block mode DMA is like closing the highway to all commuters while a convoy of trucks passes—efficient for the trucks, but disruptive for everyone else.

Cycle stealing offers an alternative: trucks merge into regular traffic, taking one lane for one moment at a time.

In cycle stealing mode, the DMA controller 'steals' individual bus cycles from the CPU without fully blocking it. The CPU is briefly stalled while DMA moves one unit of data, then immediately resumes. This interleaving minimizes CPU disruption while still enabling DMA progress.

When is this tradeoff worthwhile? When must we prefer cycle stealing over block transfers? This section provides the answers.

What You Will Learn

By the end of this page, you will understand the mechanics of cycle stealing at the hardware level, master the tradeoffs between cycle stealing and block transfer modes, learn to calculate performance impact on both DMA and CPU, and recognize scenarios where cycle stealing is the optimal choice.

Cycle Stealing Mechanics

Cycle stealing is a DMA transfer mode where the controller acquires the bus for exactly one transfer unit, completes that transfer, then releases the bus. The CPU can execute until it needs bus access, at which point it may find the DMA controller has 'stolen' the next bus cycle.

The Interleaving Pattern

Unlike block mode where DMA holds the bus for an entire transfer, cycle stealing creates a fine-grained interleaving:

cycle_stealing_pattern.txt

text

Cycle Stealing vs Block Mode - Timeline Comparison
===================================================
 
Block Mode (DMA transfers 8 words):
-----------------------------------
Time:  1    2    3    4    5    6    7    8    9    10   11   12
       │    │    │    │    │    │    │    │    │    │    │    │
CPU:   ████ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒ ████ ████ ████
       RUN  ────────────BLOCKED──────────────── RUN  RUN  RUN
                 (waiting for DMA to finish)
 
DMA:   ▒▒▒▒ ████ ████ ████ ████ ████ ████ ████ ████ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒
       REQ  D0   D1   D2   D3   D4   D5   D6   D7   IDLE IDLE IDLE
 
Result: CPU completely blocked for 8 cycles, then runs uninterrupted
 
 
Cycle Stealing Mode (same 8 words):
-----------------------------------
Time:  1    2    3    4    5    6    7    8    9    10   11   12
       │    │    │    │    │    │    │    │    │    │    │    │
CPU:   ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒
       RUN  STL  RUN  STL  RUN  STL  RUN  STL  RUN  STL  RUN  STL
 
DMA:   ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████
       IDLE D0   WAIT D1   WAIT D2   WAIT D3   WAIT D4   WAIT D5   
 
Result: CPU runs 50% of cycles, DMA takes 50%, interleaved perfectly
 
 
Legend:
████  Active (executing/transferring)
▒▒▒▒  Blocked/Waiting/Idle
STL   Cycle "stolen" by DMA
D0-D7 Data words transferred

Hardware Implementation

Cycle stealing requires specific hardware support:

Fast Bus Request/Grant — The handshake must complete in one or two cycles to make single-cycle stealing practical
CPU Suspendable at Cycle Boundaries — CPU must be able to pause cleanly between memory accesses
DMA Controller Patience — Controller must release bus after each unit rather than holding for more
Priority Logic — Decides whether DMA or CPU wins when both want the bus simultaneously

Why 'Stealing'?

The term 'stealing' reflects the CPU's perspective: cycles that 'should' belong to the CPU are taken by the DMA controller. In practice, it's cooperative—both share the bus according to arbitration rules. But from an old-school programmer's viewpoint, the DMA controller was an interloper that occasionally stole away the processor's rightful bus access.

When the CPU Doesn't Miss the Cycles

Here's a crucial insight: modern CPUs don't use the memory bus on every cycle. Processor caches, out-of-order execution, and speculative fetch create many cycles where the CPU has no pending bus transactions.

In these 'idle' bus cycles, DMA can steal cycles without affecting CPU performance at all.

Cache Hit Scenarios

CPU Bus Utilization by Workload Type
Workload	Cache Hit Rate	CPU Bus Utilization	Free Cycles for DMA
Integer compute (no memory)	~100%	~5%	~95%
Cached data processing	~90%	~15%	~85%
Database queries (warm cache)	~70%	~30%	~70%
Large array streaming	~30%	~60%	~40%
Uncached random access	~5%	~85%	~15%

The Hidden Parallelism

When CPU cache hit rates are high, cycle stealing DMA can achieve near-block-mode throughput with minimal CPU impact:

Scenario: CPU with 90% cache hit rate, DMA transferring continuously

Without DMA:
 - CPU uses bus 10% of cycles
 - Bus idle 90% of cycles
 
With Cycle Stealing DMA:
 - CPU uses bus 10% of cycles (unchanged!)
 - DMA uses bus ~90% of cycles (using otherwise-idle capacity)
 - Total bus utilization: ~100%
 - CPU slowdown: ~0% (DMA only takes idle cycles)

Key insight: Cycle stealing is most effective when CPU bus utilization is low. It becomes problematic only when CPU and DMA are both bus-hungry.

The Free Lunch

On workloads with good cache locality, cycle stealing DMA essentially gets free bus access during CPU cache hits. The CPU never knows its cycles were 'stolen' because it wasn't using them anyway. This makes cycle stealing attractive for background transfers on general-purpose systems.

Cycle Stealing vs Block Mode

Both modes have distinct characteristics that make them suitable for different scenarios. Understanding the tradeoffs enables proper mode selection:

Comparative Analysis

Cycle Stealing Mode

•Latency per unit: Higher (bus overhead each unit)
•Total throughput: Lower (more overhead cycles)
•CPU impact: Distributed (small delays, frequent)
•Worst-case CPU latency: Low (max one cycle stolen)
•Best for: Background transfers, real-time systems
•Memory subsystem: May cause cache pollution incrementally

Block Transfer Mode

•Latency per unit: Lower (amortized overhead)
•Total throughput: Higher (efficient bus use)
•CPU impact: Concentrated (completely blocked)
•Worst-case CPU latency: High (entire block duration)
•Best for: Bulk transfers, throughput-critical
•Memory subsystem: Efficient burst patterns

Quantitative Comparison

Let's calculate actual throughput and latency for a 1KB transfer with different modes:

Assumptions:

Bus clock: 100 MHz (10ns per cycle)
Transfer width: 32 bits (4 bytes)
Units to transfer: 1024 / 4 = 256 words
Bus arbitration overhead: 2 cycles

mode_comparison.txt

text

1KB (256-word) Transfer Analysis
=================================
 
Block Mode:
-----------
Arbitration (once):           2 cycles
Address phase (once):         1 cycle
Data transfer:              256 cycles
Release bus:                  1 cycle
─────────────────────────────────
Total:                      260 cycles
Time: 260 × 10ns = 2.6 µs
Throughput: 1024 bytes / 2.6 µs = 394 MB/s
CPU blocked: 100% for 2.6 µs (worst-case latency)
 
 
Cycle Stealing (1 word at a time):
----------------------------------
Per word:
  Arbitration:                2 cycles
  Address phase:              1 cycle
  Data transfer:              1 cycle
  Release bus:                1 cycle
  Per-word total:             5 cycles
 
For 256 words:
  Total:               256 × 5 = 1280 cycles
  Time: 1280 × 10ns = 12.8 µs
  Throughput: 1024 bytes / 12.8 µs = 80 MB/s
  
CPU blocked per steal:        5 cycles (50ns)
CPU available between steals: Variable (when DMA pauses for next request)
 
 
Cycle Stealing with Transparent Stealing (during cache hits):
------------------------------------------------------------
Assuming CPU cache hit rate of 80%:
  Visible stalls to CPU: 20% of 1280 cycles = 256 cycles
  Effective CPU slowdown: 256 cycles / (total CPU work time)
  
If CPU work would have taken 5000 cycles without DMA:
  With DMA: 5000 + 256 = 5256 cycles
  Slowdown: 5.1% (nearly invisible!)
 
 
Trade-off Summary (1KB transfer):
---------------------------------
                    Block Mode    Cycle Stealing
DMA completion:        2.6 µs         12.8 µs
Throughput:           394 MB/s        80 MB/s
Worst-case CPU stall: 2.6 µs          50 ns
Average CPU impact:   2.6 µs block    5% slowdown*
 
*Assuming 80% cache hit rate

The 5x Throughput Penalty

In this example, cycle stealing achieves only 20% of block mode throughput. For bulk transfers where absolute speed matters (disk I/O, network packets), block mode is clearly superior. Cycle stealing's advantage is limiting worst-case CPU latency, not maximizing DMA speed.

Real-Time System Applications

Cycle stealing finds its primary use case in real-time systems where worst-case timing is more important than average throughput.

The Real-Time Constraint

In real-time systems, missing a deadline is a failure—regardless of average performance. Consider a medical device monitoring a patient's heartbeat:

Sample rate: 1000 Hz (1ms between samples)
Processing time budget: 500 µs (50% of period)
DMA transfer: 128 bytes per sample

With Block Mode DMA:

Block transfer time (128 bytes at 100 MHz, 4-byte width): ~33 cycles = 330 ns

BUT: If DMA starts just as critical interrupt arrives:
  - Interrupt latency += 330 ns (blocked waiting for DMA)
  - Plus interrupt overhead
  - Risk: May exceed 500 µs budget in worst case

With Cycle Stealing:

Worst-case single interrupt latency: 50 ns (one DMA cycle)

DMA still progresses (slower), but critical timing met.
Predictable 50 ns worst-case latency is GUARANTEED.

Cycle Stealing Use Cases

•Audio Processing — Must process samples every 20.8µs (48kHz sample rate). Block DMA could cause audio pops/clicks. Cycle stealing ensures continuous processing.
•Motor Control — Industrial motors need control loop updates every 50-100µs. Missing update causes mechanical damage. Any CPU blocking is dangerous.
•Avionics Flight Control — Hard real-time with safety-of-life implications. Deterministic timing required by certification (DO-178C).
•Medical Devices — Patient monitoring, drug delivery. FDA requires proven worst-case timing. Block mode DMA is often prohibited.
•Telecommunications — LTE/5G timing requirements measured in microseconds. Base station processing can't tolerate unpredictable delays.

Worst-Case vs. Average-Case

In general-purpose computing, we optimize for average case—a 1% chance of 1ms delay is acceptable. In hard real-time systems, we must guarantee worst-case—even a 0.0001% chance of deadline miss is unacceptable. Cycle stealing's value is its guaranteed bounded latency, not its average performance.

Implementation Details

Implementing cycle stealing properly requires attention to several hardware and software details.

DMA Controller Configuration

cycle_stealing_config.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
// Configuring DMA for Cycle Stealing Mode
 
// Mode register bits (typical DMA controller)
#define DMA_MODE_DEMAND       0x00  // Transfer on device request only
#define DMA_MODE_SINGLE       0x40  // Cycle stealing: one unit per bus grant
#define DMA_MODE_BLOCK        0x80  // Block: hold bus for entire transfer
#define DMA_MODE_CASCADE      0xC0  // Cascade: for chaining controllers
 
// Configure channel for cycle stealing
void setup_cycle_stealing_dma(int channel, 
                               dma_addr_t src, 
                               dma_addr_t dst,
                               size_t size) {
    struct dma_regs *regs = &dma_controller.channel[channel];
    
    // Disable channel while configuring
    regs->control = 0;
    
    // Set source and destination
    regs->source_addr = src;
    regs->dest_addr = dst;
    regs->transfer_count = size;
    
    // Configure for cycle stealing (single transfer mode)
    regs->mode = DMA_MODE_SINGLE |        // Cycle stealing mode
                 DMA_DIR_READ |            // Device to memory
                 DMA_AUTOINIT_DISABLE |    // Don't restart when done
                 DMA_ADDR_INCREMENT;       // Increment destination
    
    // Set priority (lower = more aggressive stealing)
    regs->priority = DMA_PRIORITY_LOW;    // Be gentle with CPU
    
    // Enable channel
    regs->control = DMA_CTRL_ENABLE | DMA_CTRL_IRQ_ON_COMPLETE;
}
 
// Contrast: Block mode configuration
void setup_block_dma(int channel,
                     dma_addr_t src,
                     dma_addr_t dst, 
                     size_t size) {
    struct dma_regs *regs = &dma_controller.channel[channel];
    
    regs->control = 0;
    regs->source_addr = src;
    regs->dest_addr = dst;
    regs->transfer_count = size;
    
    // Configure for block mode
    regs->mode = DMA_MODE_BLOCK |         // Block mode: hold bus
                 DMA_DIR_READ |
                 DMA_AUTOINIT_DISABLE |
                 DMA_ADDR_INCREMENT;
    
    regs->priority = DMA_PRIORITY_HIGH;   // Complete quickly
    
    regs->control = DMA_CTRL_ENABLE | DMA_CTRL_IRQ_ON_COMPLETE;
}
 
// Hybrid approach: Burst stealing
// Transfer several words per bus grant, but still release bus between bursts
void setup_burst_stealing_dma(int channel,
                               dma_addr_t src,
                               dma_addr_t dst,
                               size_t size,
                               int burst_size) {  // 4, 8, or 16 words
    struct dma_regs *regs = &dma_controller.channel[channel];
    
    regs->control = 0;
    regs->source_addr = src;
    regs->dest_addr = dst;
    regs->transfer_count = size;
    
    // Cycle stealing with burst
    regs->mode = DMA_MODE_SINGLE | DMA_DIR_READ;
    regs->burst_length = burst_size;      // Words per steal
    
    regs->control = DMA_CTRL_ENABLE | DMA_CTRL_IRQ_ON_COMPLETE;
}

Priority and Throttling

Cycle stealing can be tuned by adjusting how aggressively the DMA controller requests bus cycles:

Aggressive Stealing:

DMA requests bus immediately when ready
Higher DMA throughput, more CPU disruption
May cause CPU starvation if DMA is continuous

Polite Stealing:

DMA waits several cycles between requests
Lower DMA throughput, minimal CPU impact
Used when DMA is low-priority background activity

Throttled Stealing:

DMA limited to N cycles per M total cycles (e.g., max 25% utilization)
Guarantees bandwidth allocation
Useful for multi-device systems with QoS requirements

Memory Bandwidth Sharing

Cycle stealing is fundamentally about bandwidth sharing between CPU and DMA. Understanding bandwidth allocation helps size systems appropriately.

Bandwidth Analysis

bandwidth_analysis.txt

text

Memory Bandwidth Sharing Analysis
=================================
 
System Parameters:
- Memory bandwidth: 10 GB/s (DDR4 single channel)
- CPU memory demand: 3 GB/s (30% utilization)
- DMA device requirement: 500 MB/s
 
Question: Will cycle stealing DMA impact CPU performance?
 
Analysis:
---------
Available bandwidth:      10,000 MB/s
CPU demand:                3,000 MB/s
DMA demand:                  500 MB/s
────────────────────────────────────
Total demand:              3,500 MB/s (35% utilization)
 
Since 3,500 < 10,000, both can be satisfied simultaneously!
 
With cycle stealing:
- System alternates CPU and DMA accesses
- Both complete at full speed (bandwidth not saturated)
- CPU: 3 GB/s achieved ✓
- DMA: 500 MB/s achieved ✓
 
 
Oversubscribed Example:
-----------------------
CPU demand:                8 GB/s (80% utilization)
DMA demand:                4 GB/s
 
Total demand:             12,000 MB/s (120% of capacity)
 
With cycle stealing:
- Bandwidth shared proportionally (if fair arbiter)
- CPU gets: 8/12 × 10 GB/s = 6.67 GB/s (17% shortfall)
- DMA gets: 4/12 × 10 GB/s = 3.33 GB/s (17% shortfall)
 
Both are slowed by the same proportion.
 
 
Priority-Based Sharing:
-----------------------
If CPU has priority (steals only when CPU doesn't need bus):
- CPU gets: 8 GB/s (full demand)
- DMA gets: 10 - 8 = 2 GB/s (50% of demand)
 
If DMA has priority:
- DMA gets: 4 GB/s (full demand)
- CPU gets: 10 - 4 = 6 GB/s (25% shortfall)

Bandwidth Reservation

Advanced systems implement bandwidth reservation to guarantee specific allocations:

Time-Division Multiplexing:

Divide time into slots (e.g., 10 slots)
Assign slots to masters: CPU=7, DMA Channel 0=2, DMA Channel 1=1
Provides guaranteed bandwidth regardless of demand

Credit-Based Throttling:

Each master accumulates credits over time
Spending credit allows bus access
Rate of credit accumulation determines bandwidth allocation

These mechanisms transform cycle stealing from 'opportunistic' to 'guaranteed' bandwidth allocation.

Modern Interconnects

Modern systems (PCIe, AMBA AXI) use sophisticated QoS mechanisms that generalize cycle stealing concepts. Virtual channels, traffic classes, and credit-based flow control provide fine-grained bandwidth management far beyond simple cycle stealing. But the core principle—sharing bandwidth while limiting worst-case latency—remains the same.

Cycle Stealing Variants

Pure single-word cycle stealing is rarely used in modern systems. Instead, several variants offer improved characteristics:

Burst Stealing

Burst Stealing Characteristics

•Concept: Transfer 4-16 words per bus acquisition, then release
•Advantage: Better throughput than single-word (less arbitration overhead)
•Advantage: Better latency than full block mode (limited burst length)
•Tradeoff: CPU stall per burst is longer (e.g., 8 cycles vs 1 cycle)
•Sweet spot: Often used with burst length matching cache line size (64 bytes)

Transparent/Hidden Stealing

Concept: Only steal cycles when the CPU isn't using the bus. If CPU needs bus, it gets priority.

Implementation:

DMA monitors CPU bus request signal
DMA only proceeds when CPU idle
Effectively 'invisible' to CPU

Limitations:

DMA may starve if CPU is bus-intensive
Unpredictable DMA completion time
Works well only if CPU cache hit rate is high

Interleaved Stealing

Concept: Guarantee specific interleave ratio, e.g., 1 DMA for every 3 CPU cycles.

Implementation:

Arbiter enforces ratio
Both CPU and DMA get predictable allocation
Neither can monopolize bus

Used in: Systems requiring predictable behavior for both CPU and DMA (real-time, embedded)

Cycle Stealing Variant Comparison
Variant	Throughput	Worst-Case Latency	Predictability
Pure Single-Word	Low	Excellent (1 cycle)	High
Burst (8-word)	Medium	Good (8 cycles)	High
Transparent	Variable	Excellent (0 visible)	Low
Interleaved (1:3)	Fixed (25%)	Good (bounded)	Very High
Block Mode	High	Poor (entire block)	High

Summary: Mastering Cycle Stealing

Cycle stealing represents a fundamental tradeoff in DMA design: sacrificing throughput for bounded latency. Let's consolidate the key insights:

Key Takeaways

•Cycle stealing interleaves DMA and CPU access — DMA takes one (or few) cycles at a time, releasing bus between transfers.
•The tradeoff is throughput vs. latency — Block mode is faster; cycle stealing has lower worst-case latency.
•Cache hits hide DMA overhead — When CPU cache hit rate is high, stolen cycles are 'free' from CPU's perspective.
•Real-time systems prefer cycle stealing — Guaranteed bounded latency is essential; average throughput is secondary.
•Burst stealing offers middle ground — Small bursts balance throughput and latency better than single-word or full-block.
•Bandwidth sharing is the fundamental concept — Cycle stealing is one mechanism for dividing memory bandwidth between CPU and DMA.
•Modern variants provide more control — Transparent stealing, interleaved modes, and QoS mechanisms offer fine-grained tuning.

What's Next:

The final section examines bus mastering—the most advanced form of DMA where peripheral devices become full bus masters with independent access to the system memory space, as seen in modern PCIe devices.

Cycle Stealing Mastered

You now understand cycle stealing at both conceptual and implementation levels. You know when to prefer it over block mode (real-time systems, background transfers), how to configure it (mode registers, priority settings), and its variants (burst, transparent, interleaved). This knowledge enables proper DMA mode selection for any system's requirements.

Cycle Stealing

The Art of Sharing

Cycle stealing offers an alternative: trucks merge into regular traffic, taking one lane for one moment at a time.

When is this tradeoff worthwhile? When must we prefer cycle stealing over block transfers? This section provides the answers.

What You Will Learn

Cycle Stealing Mechanics

The Interleaving Pattern

Unlike block mode where DMA holds the bus for an entire transfer, cycle stealing creates a fine-grained interleaving:

cycle_stealing_pattern.txt

text

Cycle Stealing vs Block Mode - Timeline Comparison
===================================================
 
Block Mode (DMA transfers 8 words):
-----------------------------------
Time:  1    2    3    4    5    6    7    8    9    10   11   12
       │    │    │    │    │    │    │    │    │    │    │    │
CPU:   ████ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒ ████ ████ ████
       RUN  ────────────BLOCKED──────────────── RUN  RUN  RUN
                 (waiting for DMA to finish)
 
DMA:   ▒▒▒▒ ████ ████ ████ ████ ████ ████ ████ ████ ▒▒▒▒ ▒▒▒▒ ▒▒▒▒
       REQ  D0   D1   D2   D3   D4   D5   D6   D7   IDLE IDLE IDLE
 
Result: CPU completely blocked for 8 cycles, then runs uninterrupted
 
 
Cycle Stealing Mode (same 8 words):
-----------------------------------
Time:  1    2    3    4    5    6    7    8    9    10   11   12
       │    │    │    │    │    │    │    │    │    │    │    │
CPU:   ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒
       RUN  STL  RUN  STL  RUN  STL  RUN  STL  RUN  STL  RUN  STL
 
DMA:   ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████ ▒▒▒▒ ████
       IDLE D0   WAIT D1   WAIT D2   WAIT D3   WAIT D4   WAIT D5   
 
Result: CPU runs 50% of cycles, DMA takes 50%, interleaved perfectly
 
 
Legend:
████  Active (executing/transferring)
▒▒▒▒  Blocked/Waiting/Idle
STL   Cycle "stolen" by DMA
D0-D7 Data words transferred

Hardware Implementation

Cycle stealing requires specific hardware support:

Fast Bus Request/Grant — The handshake must complete in one or two cycles to make single-cycle stealing practical
CPU Suspendable at Cycle Boundaries — CPU must be able to pause cleanly between memory accesses
DMA Controller Patience — Controller must release bus after each unit rather than holding for more
Priority Logic — Decides whether DMA or CPU wins when both want the bus simultaneously

Why 'Stealing'?

When the CPU Doesn't Miss the Cycles

In these 'idle' bus cycles, DMA can steal cycles without affecting CPU performance at all.

Cache Hit Scenarios

CPU Bus Utilization by Workload Type
Workload	Cache Hit Rate	CPU Bus Utilization	Free Cycles for DMA
Integer compute (no memory)	~100%	~5%	~95%
Cached data processing	~90%	~15%	~85%
Database queries (warm cache)	~70%	~30%	~70%
Large array streaming	~30%	~60%	~40%
Uncached random access	~5%	~85%	~15%

The Hidden Parallelism

When CPU cache hit rates are high, cycle stealing DMA can achieve near-block-mode throughput with minimal CPU impact:

Scenario: CPU with 90% cache hit rate, DMA transferring continuously

Without DMA:
 - CPU uses bus 10% of cycles
 - Bus idle 90% of cycles
 
With Cycle Stealing DMA:
 - CPU uses bus 10% of cycles (unchanged!)
 - DMA uses bus ~90% of cycles (using otherwise-idle capacity)
 - Total bus utilization: ~100%
 - CPU slowdown: ~0% (DMA only takes idle cycles)

Key insight: Cycle stealing is most effective when CPU bus utilization is low. It becomes problematic only when CPU and DMA are both bus-hungry.

The Free Lunch

Cycle Stealing vs Block Mode

Both modes have distinct characteristics that make them suitable for different scenarios. Understanding the tradeoffs enables proper mode selection:

Comparative Analysis

Cycle Stealing Mode

•Latency per unit: Higher (bus overhead each unit)
•Total throughput: Lower (more overhead cycles)
•CPU impact: Distributed (small delays, frequent)
•Worst-case CPU latency: Low (max one cycle stolen)
•Best for: Background transfers, real-time systems
•Memory subsystem: May cause cache pollution incrementally

Block Transfer Mode

•Latency per unit: Lower (amortized overhead)
•Total throughput: Higher (efficient bus use)
•CPU impact: Concentrated (completely blocked)
•Worst-case CPU latency: High (entire block duration)
•Best for: Bulk transfers, throughput-critical
•Memory subsystem: Efficient burst patterns

Quantitative Comparison

Let's calculate actual throughput and latency for a 1KB transfer with different modes:

Assumptions:

Bus clock: 100 MHz (10ns per cycle)
Transfer width: 32 bits (4 bytes)
Units to transfer: 1024 / 4 = 256 words
Bus arbitration overhead: 2 cycles

mode_comparison.txt

text

1KB (256-word) Transfer Analysis
=================================
 
Block Mode:
-----------
Arbitration (once):           2 cycles
Address phase (once):         1 cycle
Data transfer:              256 cycles
Release bus:                  1 cycle
─────────────────────────────────
Total:                      260 cycles
Time: 260 × 10ns = 2.6 µs
Throughput: 1024 bytes / 2.6 µs = 394 MB/s
CPU blocked: 100% for 2.6 µs (worst-case latency)
 
 
Cycle Stealing (1 word at a time):
----------------------------------
Per word:
  Arbitration:                2 cycles
  Address phase:              1 cycle
  Data transfer:              1 cycle
  Release bus:                1 cycle
  Per-word total:             5 cycles
 
For 256 words:
  Total:               256 × 5 = 1280 cycles
  Time: 1280 × 10ns = 12.8 µs
  Throughput: 1024 bytes / 12.8 µs = 80 MB/s
  
CPU blocked per steal:        5 cycles (50ns)
CPU available between steals: Variable (when DMA pauses for next request)
 
 
Cycle Stealing with Transparent Stealing (during cache hits):
------------------------------------------------------------
Assuming CPU cache hit rate of 80%:
  Visible stalls to CPU: 20% of 1280 cycles = 256 cycles
  Effective CPU slowdown: 256 cycles / (total CPU work time)
  
If CPU work would have taken 5000 cycles without DMA:
  With DMA: 5000 + 256 = 5256 cycles
  Slowdown: 5.1% (nearly invisible!)
 
 
Trade-off Summary (1KB transfer):
---------------------------------
                    Block Mode    Cycle Stealing
DMA completion:        2.6 µs         12.8 µs
Throughput:           394 MB/s        80 MB/s
Worst-case CPU stall: 2.6 µs          50 ns
Average CPU impact:   2.6 µs block    5% slowdown*
 
*Assuming 80% cache hit rate

The 5x Throughput Penalty

Real-Time System Applications

Cycle stealing finds its primary use case in real-time systems where worst-case timing is more important than average throughput.

The Real-Time Constraint

In real-time systems, missing a deadline is a failure—regardless of average performance. Consider a medical device monitoring a patient's heartbeat:

Sample rate: 1000 Hz (1ms between samples)
Processing time budget: 500 µs (50% of period)
DMA transfer: 128 bytes per sample

With Block Mode DMA:

Block transfer time (128 bytes at 100 MHz, 4-byte width): ~33 cycles = 330 ns

BUT: If DMA starts just as critical interrupt arrives:
  - Interrupt latency += 330 ns (blocked waiting for DMA)
  - Plus interrupt overhead
  - Risk: May exceed 500 µs budget in worst case

With Cycle Stealing:

Worst-case single interrupt latency: 50 ns (one DMA cycle)

DMA still progresses (slower), but critical timing met.
Predictable 50 ns worst-case latency is GUARANTEED.

Cycle Stealing Use Cases

•Audio Processing — Must process samples every 20.8µs (48kHz sample rate). Block DMA could cause audio pops/clicks. Cycle stealing ensures continuous processing.
•Motor Control — Industrial motors need control loop updates every 50-100µs. Missing update causes mechanical damage. Any CPU blocking is dangerous.
•Avionics Flight Control — Hard real-time with safety-of-life implications. Deterministic timing required by certification (DO-178C).
•Medical Devices — Patient monitoring, drug delivery. FDA requires proven worst-case timing. Block mode DMA is often prohibited.
•Telecommunications — LTE/5G timing requirements measured in microseconds. Base station processing can't tolerate unpredictable delays.

Worst-Case vs. Average-Case

Implementation Details

Implementing cycle stealing properly requires attention to several hardware and software details.

DMA Controller Configuration

cycle_stealing_config.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
// Configuring DMA for Cycle Stealing Mode
 
// Mode register bits (typical DMA controller)
#define DMA_MODE_DEMAND       0x00  // Transfer on device request only
#define DMA_MODE_SINGLE       0x40  // Cycle stealing: one unit per bus grant
#define DMA_MODE_BLOCK        0x80  // Block: hold bus for entire transfer
#define DMA_MODE_CASCADE      0xC0  // Cascade: for chaining controllers
 
// Configure channel for cycle stealing
void setup_cycle_stealing_dma(int channel, 
                               dma_addr_t src, 
                               dma_addr_t dst,
                               size_t size) {
    struct dma_regs *regs = &dma_controller.channel[channel];
    
    // Disable channel while configuring
    regs->control = 0;
    
    // Set source and destination
    regs->source_addr = src;
    regs->dest_addr = dst;
    regs->transfer_count = size;
    
    // Configure for cycle stealing (single transfer mode)
    regs->mode = DMA_MODE_SINGLE |        // Cycle stealing mode
                 DMA_DIR_READ |            // Device to memory
                 DMA_AUTOINIT_DISABLE |    // Don't restart when done
                 DMA_ADDR_INCREMENT;       // Increment destination
    
    // Set priority (lower = more aggressive stealing)
    regs->priority = DMA_PRIORITY_LOW;    // Be gentle with CPU
    
    // Enable channel
    regs->control = DMA_CTRL_ENABLE | DMA_CTRL_IRQ_ON_COMPLETE;
}
 
// Contrast: Block mode configuration
void setup_block_dma(int channel,
                     dma_addr_t src,
                     dma_addr_t dst, 
                     size_t size) {
    struct dma_regs *regs = &dma_controller.channel[channel];
    
    regs->control = 0;
    regs->source_addr = src;
    regs->dest_addr = dst;
    regs->transfer_count = size;
    
    // Configure for block mode
    regs->mode = DMA_MODE_BLOCK |         // Block mode: hold bus
                 DMA_DIR_READ |
                 DMA_AUTOINIT_DISABLE |
                 DMA_ADDR_INCREMENT;
    
    regs->priority = DMA_PRIORITY_HIGH;   // Complete quickly
    
    regs->control = DMA_CTRL_ENABLE | DMA_CTRL_IRQ_ON_COMPLETE;
}
 
// Hybrid approach: Burst stealing
// Transfer several words per bus grant, but still release bus between bursts
void setup_burst_stealing_dma(int channel,
                               dma_addr_t src,
                               dma_addr_t dst,
                               size_t size,
                               int burst_size) {  // 4, 8, or 16 words
    struct dma_regs *regs = &dma_controller.channel[channel];
    
    regs->control = 0;
    regs->source_addr = src;
    regs->dest_addr = dst;
    regs->transfer_count = size;
    
    // Cycle stealing with burst
    regs->mode = DMA_MODE_SINGLE | DMA_DIR_READ;
    regs->burst_length = burst_size;      // Words per steal
    
    regs->control = DMA_CTRL_ENABLE | DMA_CTRL_IRQ_ON_COMPLETE;
}

Priority and Throttling

Cycle stealing can be tuned by adjusting how aggressively the DMA controller requests bus cycles:

Aggressive Stealing:

DMA requests bus immediately when ready
Higher DMA throughput, more CPU disruption
May cause CPU starvation if DMA is continuous

Polite Stealing:

DMA waits several cycles between requests
Lower DMA throughput, minimal CPU impact
Used when DMA is low-priority background activity

Throttled Stealing:

DMA limited to N cycles per M total cycles (e.g., max 25% utilization)
Guarantees bandwidth allocation
Useful for multi-device systems with QoS requirements

Memory Bandwidth Sharing

Cycle stealing is fundamentally about bandwidth sharing between CPU and DMA. Understanding bandwidth allocation helps size systems appropriately.

Bandwidth Analysis

bandwidth_analysis.txt

text

Memory Bandwidth Sharing Analysis
=================================
 
System Parameters:
- Memory bandwidth: 10 GB/s (DDR4 single channel)
- CPU memory demand: 3 GB/s (30% utilization)
- DMA device requirement: 500 MB/s
 
Question: Will cycle stealing DMA impact CPU performance?
 
Analysis:
---------
Available bandwidth:      10,000 MB/s
CPU demand:                3,000 MB/s
DMA demand:                  500 MB/s
────────────────────────────────────
Total demand:              3,500 MB/s (35% utilization)
 
Since 3,500 < 10,000, both can be satisfied simultaneously!
 
With cycle stealing:
- System alternates CPU and DMA accesses
- Both complete at full speed (bandwidth not saturated)
- CPU: 3 GB/s achieved ✓
- DMA: 500 MB/s achieved ✓
 
 
Oversubscribed Example:
-----------------------
CPU demand:                8 GB/s (80% utilization)
DMA demand:                4 GB/s
 
Total demand:             12,000 MB/s (120% of capacity)
 
With cycle stealing:
- Bandwidth shared proportionally (if fair arbiter)
- CPU gets: 8/12 × 10 GB/s = 6.67 GB/s (17% shortfall)
- DMA gets: 4/12 × 10 GB/s = 3.33 GB/s (17% shortfall)
 
Both are slowed by the same proportion.
 
 
Priority-Based Sharing:
-----------------------
If CPU has priority (steals only when CPU doesn't need bus):
- CPU gets: 8 GB/s (full demand)
- DMA gets: 10 - 8 = 2 GB/s (50% of demand)
 
If DMA has priority:
- DMA gets: 4 GB/s (full demand)
- CPU gets: 10 - 4 = 6 GB/s (25% shortfall)

Bandwidth Reservation

Advanced systems implement bandwidth reservation to guarantee specific allocations:

Time-Division Multiplexing:

Divide time into slots (e.g., 10 slots)
Assign slots to masters: CPU=7, DMA Channel 0=2, DMA Channel 1=1
Provides guaranteed bandwidth regardless of demand

Credit-Based Throttling:

Each master accumulates credits over time
Spending credit allows bus access
Rate of credit accumulation determines bandwidth allocation

These mechanisms transform cycle stealing from 'opportunistic' to 'guaranteed' bandwidth allocation.

Modern Interconnects

Cycle Stealing Variants

Pure single-word cycle stealing is rarely used in modern systems. Instead, several variants offer improved characteristics:

Burst Stealing

Burst Stealing Characteristics

•Concept: Transfer 4-16 words per bus acquisition, then release
•Advantage: Better throughput than single-word (less arbitration overhead)
•Advantage: Better latency than full block mode (limited burst length)
•Tradeoff: CPU stall per burst is longer (e.g., 8 cycles vs 1 cycle)
•Sweet spot: Often used with burst length matching cache line size (64 bytes)

Transparent/Hidden Stealing

Concept: Only steal cycles when the CPU isn't using the bus. If CPU needs bus, it gets priority.

Implementation:

DMA monitors CPU bus request signal
DMA only proceeds when CPU idle
Effectively 'invisible' to CPU

Limitations:

DMA may starve if CPU is bus-intensive
Unpredictable DMA completion time
Works well only if CPU cache hit rate is high

Interleaved Stealing

Concept: Guarantee specific interleave ratio, e.g., 1 DMA for every 3 CPU cycles.

Implementation:

Arbiter enforces ratio
Both CPU and DMA get predictable allocation
Neither can monopolize bus

Used in: Systems requiring predictable behavior for both CPU and DMA (real-time, embedded)

Cycle Stealing Variant Comparison
Variant	Throughput	Worst-Case Latency	Predictability
Pure Single-Word	Low	Excellent (1 cycle)	High
Burst (8-word)	Medium	Good (8 cycles)	High
Transparent	Variable	Excellent (0 visible)	Low
Interleaved (1:3)	Fixed (25%)	Good (bounded)	Very High
Block Mode	High	Poor (entire block)	High

Summary: Mastering Cycle Stealing

Cycle stealing represents a fundamental tradeoff in DMA design: sacrificing throughput for bounded latency. Let's consolidate the key insights:

Key Takeaways

•Cycle stealing interleaves DMA and CPU access — DMA takes one (or few) cycles at a time, releasing bus between transfers.
•The tradeoff is throughput vs. latency — Block mode is faster; cycle stealing has lower worst-case latency.
•Cache hits hide DMA overhead — When CPU cache hit rate is high, stolen cycles are 'free' from CPU's perspective.
•Real-time systems prefer cycle stealing — Guaranteed bounded latency is essential; average throughput is secondary.
•Burst stealing offers middle ground — Small bursts balance throughput and latency better than single-word or full-block.
•Bandwidth sharing is the fundamental concept — Cycle stealing is one mechanism for dividing memory bandwidth between CPU and DMA.
•Modern variants provide more control — Transparent stealing, interleaved modes, and QoS mechanisms offer fine-grained tuning.

What's Next:

Cycle Stealing Mastered