Loading learning content...
Consider the challenge of transferring a file from disk to memory. Somewhere, somehow, every single byte must move from the storage device to RAM. The most straightforward approach? Have the CPU read each byte from the device and write it to memory—one byte at a time, in a tight loop, until the transfer is complete.
This technique is called Programmed I/O (PIO), and it represents the most fundamental form of data transfer between the processor and peripheral devices. The CPU 'programs' each I/O operation explicitly, executing instructions to move every unit of data.
While modern systems rely primarily on more sophisticated techniques like DMA (Direct Memory Access), Programmed I/O remains relevant as a fallback mechanism, for initialization sequences, and as the conceptual foundation upon which we understand more advanced approaches.
By the end of this page, you will understand the mechanics of Programmed I/O, including polling and busy-waiting, the CPU performance implications, when PIO is appropriate versus when it's problematic, practical implementation patterns, and how PIO relates to interrupt-driven I/O and DMA.
Programmed I/O is a data transfer technique in which the CPU is directly responsible for every data movement operation between main memory and I/O devices. The processor executes explicit instructions to:
Characterizing PIO:
The PIO Operation Cycle:
A typical PIO read operation follows this pattern:
┌────────────────────────────────────────────────────────────────┐
│ PIO Read Cycle │
├─────────────────────────────────────────────────────────────────┤
│ 1. CPU → Device: Send read command + parameters │
│ 2. Device: Prepares data (may take many cycles) │
│ 3. CPU: Poll status register (busy wait loop) │
│ 4. Device → CPU: Data available (status bit set) │
│ 5. CPU: Read data from device data register │
│ 6. CPU: Store data to memory │
│ 7. Repeat steps 3-6 for remaining data │
└─────────────────────────────────────────────────────────────────┘
A PIO write operation is similar but reversed:
┌────────────────────────────────────────────────────────────────┐
│ PIO Write Cycle │
├─────────────────────────────────────────────────────────────────┤
│ 1. CPU: Load data from memory │
│ 2. CPU: Poll device status (wait for 'ready to receive') │
│ 3. CPU → Device: Write data to device data register │
│ 4. Device: Processes/stores the data │
│ 5. Repeat steps 1-4 for remaining data │
│ 6. CPU: Check for completion/errors │
└─────────────────────────────────────────────────────────────────┘
Don't confuse Programmed I/O with port-mapped or memory-mapped I/O. PIO describes the data transfer methodology (CPU moves every byte), while port-mapped and memory-mapped I/O describe the addressing mechanism (how registers are accessed). You can have PIO using either port-mapped or memory-mapped register access.
The defining characteristic of PIO is polling—the CPU repeatedly checking a device's status register until the device signals readiness. This technique is also called busy waiting, spinning, or polling loop.
Anatomy of a Polling Loop:
A polling loop typically looks like this:
while (!(inb(STATUS_PORT) & READY_BIT)) {
/* Do nothing - just keep checking */
}
/* Device is now ready, proceed with data transfer */
data = inb(DATA_PORT);
This simple pattern hides a significant cost: during the entire waiting period, the CPU is fully occupied executing the loop—it cannot perform any other useful work.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
/* * Serial Port Polling - Classic PIO Example * * This code demonstrates polling-based PIO for UART communication. * The CPU explicitly checks status and moves every byte. */ #include <stdint.h> /* Port I/O functions (x86 specific) */static inline void outb(uint16_t port, uint8_t val) { __asm__ volatile ("outb %0, %1" : : "a"(val), "Nd"(port));} static inline uint8_t inb(uint16_t port) { uint8_t ret; __asm__ volatile ("inb %1, %0" : "=a"(ret) : "Nd"(port)); return ret;} /* COM1 port addresses */#define COM1_DATA 0x3F8 /* Data register (R/W) */#define COM1_STATUS 0x3FD /* Line Status Register */ /* Line Status Register bits */#define LSR_DATA_READY 0x01 /* Data available to read */#define LSR_EMPTY_XMIT 0x20 /* Transmitter holding register empty */ /* * Receive a single character using polling. * * The CPU spins in a loop until data arrives. * This is the essence of PIO - the CPU is fully occupied waiting. * * Timing analysis: * - At 115200 baud, one character takes ~87 microseconds * - A 3GHz CPU could execute ~260,000 instructions in that time * - All of that capacity is wasted on checking the status bit */char serial_receive_polling(void) { /* Poll until data is available */ while ((inb(COM1_STATUS) & LSR_DATA_READY) == 0) { /* Busy wait - CPU is doing "nothing productive" */ /* Each iteration: * - IN instruction: ~100-300 cycles (I/O is slow!) * - Compare and branch: ~1-5 cycles * Even at ~200 cycles/iteration, we're burning CPU */ } /* Data is ready, read it */ return inb(COM1_DATA);} /* * Transmit a single character using polling. */void serial_transmit_polling(char c) { /* Poll until transmitter is ready */ while ((inb(COM1_STATUS) & LSR_EMPTY_XMIT) == 0) { /* Busy wait for transmit buffer to empty */ } /* Transmitter is ready, send the byte */ outb(COM1_DATA, c);} /* * Transmit a null-terminated string using polling. * * This demonstrates the cumulative cost: for each character, * we may wait thousands of CPU cycles. For a 100-character * string at 9600 baud, we might wait 100+ milliseconds total. */void serial_print_polling(const char *str) { while (*str) { serial_transmit_polling(*str++); }} /* * Receive exactly 'count' bytes into buffer using polling. * * For 1024 bytes at 115200 baud, this takes ~89 milliseconds * of CPU time, during which the CPU does essentially nothing * but poll and copy bytes. */void serial_receive_block_polling(char *buffer, int count) { for (int i = 0; i < count; i++) { buffer[i] = serial_receive_polling(); }}Cost Analysis of Polling:
Let's quantify the CPU waste in a polling scenario:
Scenario: Reading 1 KB over a serial port at 115200 baud
During those 267 million cycles, the CPU could have:
Instead, it was stuck in a tight loop, checking the same status bit over and over.
Polling is simplest to implement but most expensive in CPU resources. The faster your CPU, the more cycles you waste waiting for slow devices. A modern CPU running at 3 GHz waiting for a 115200 baud serial port wastes over 99.99% of its capacity in the polling loop.
Block devices like hard disks and CD-ROMs historically relied heavily on PIO mode transfers. The ATA (AT Attachment) / IDE (Integrated Drive Electronics) interface supported multiple PIO modes with increasing transfer rates:
ATA PIO Modes:
| PIO Mode | Maximum Transfer Rate | Cycle Time | Notes |
|---|---|---|---|
| PIO Mode 0 | 3.3 MB/s | 600 ns | Original ATA standard |
| PIO Mode 1 | 5.2 MB/s | 383 ns | Common in 1990s |
| PIO Mode 2 | 8.3 MB/s | 240 ns | Common in 1990s |
| PIO Mode 3 | 11.1 MB/s | 180 ns | Enhanced IDE |
| PIO Mode 4 | 16.7 MB/s | 120 ns | Maximum ATA PIO |
Even the fastest PIO Mode 4 at 16.7 MB/s meant the CPU was fully occupied during transfers. Reading a 100 MB file would consume approximately 6 seconds of dedicated CPU time. This is why DMA modes (UDMA, etc.) were developed—to free the CPU from the drudgery of manual data movement.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158
/* * ATA PIO Sector Read * * This demonstrates PIO-mode disk sector reading. * The CPU issues a command, polls for completion, then reads * 512 bytes (one sector) one word at a time. */ #include <stdint.h> /* Primary ATA I/O Ports */#define ATA_DATA 0x1F0 /* Data register (R/W) */#define ATA_ERROR 0x1F1 /* Error register (R) / Features (W) */#define ATA_SEC_COUNT 0x1F2 /* Sector count */#define ATA_SEC_NUM 0x1F3 /* Sector number */#define ATA_CYL_LOW 0x1F4 /* Cylinder low */#define ATA_CYL_HIGH 0x1F5 /* Cylinder high */#define ATA_HEAD 0x1F6 /* Drive/head register */#define ATA_STATUS 0x1F7 /* Status (R) / Command (W) */ /* Status register bits */#define ATA_SR_BSY 0x80 /* Busy */#define ATA_SR_DRDY 0x40 /* Drive ready */#define ATA_SR_DRQ 0x08 /* Data request (ready for data transfer) */#define ATA_SR_ERR 0x01 /* Error occurred */ /* ATA Commands */#define ATA_CMD_READ_SECTORS 0x20 /* Read sectors with retry */ static inline void outb(uint16_t port, uint8_t val) { __asm__ volatile ("outb %0, %1" : : "a"(val), "Nd"(port));} static inline uint8_t inb(uint16_t port) { uint8_t ret; __asm__ volatile ("inb %1, %0" : "=a"(ret) : "Nd"(port)); return ret;} static inline uint16_t inw(uint16_t port) { uint16_t ret; __asm__ volatile ("inw %1, %0" : "=a"(ret) : "Nd"(port)); return ret;} /* * Wait for drive to become ready (clear BSY, set DRDY) * * This is pure polling - the CPU does nothing but check status. */static int ata_wait_ready(void) { int timeout = 100000; while (--timeout) { uint8_t status = inb(ATA_STATUS); if (status & ATA_SR_ERR) { return -1; /* Error occurred */ } if (!(status & ATA_SR_BSY) && (status & ATA_SR_DRDY)) { return 0; /* Drive is ready */ } /* Optionally insert a tiny delay to reduce bus contention */ /* __asm__ volatile ("pause":::); */ } return -2; /* Timeout */} /* * Wait for DRQ (data request) - indicates data is available */static int ata_wait_drq(void) { int timeout = 100000; while (--timeout) { uint8_t status = inb(ATA_STATUS); if (status & ATA_SR_ERR) { return -1; } if (status & ATA_SR_DRQ) { return 0; /* Data is ready to transfer */ } } return -2; /* Timeout */} /* * Read one sector (512 bytes) using PIO * * This function demonstrates the full PIO workflow: * 1. Wait for drive ready * 2. Set up command parameters (LBA address, sector count) * 3. Issue command * 4. Wait for data ready (polling) * 5. Read data word-by-word * * The CPU is busy throughout steps 4 and 5. */int ata_read_sector_pio(uint32_t lba, uint8_t *buffer) { /* Wait for any previous operation to complete */ if (ata_wait_ready() < 0) { return -1; } /* Select drive 0, use LBA addressing */ outb(ATA_HEAD, 0xE0 | ((lba >> 24) & 0x0F)); /* LBA bits 24-27 + flags */ /* Set up the transfer parameters */ outb(ATA_SEC_COUNT, 1); /* Read 1 sector */ outb(ATA_SEC_NUM, lba & 0xFF); /* LBA bits 0-7 */ outb(ATA_CYL_LOW, (lba >> 8) & 0xFF); /* LBA bits 8-15 */ outb(ATA_CYL_HIGH, (lba >> 16) & 0xFF); /* LBA bits 16-23 */ /* Issue the READ SECTORS command */ outb(ATA_STATUS, ATA_CMD_READ_SECTORS); /* ============================================ */ /* This is where PIO gets expensive! */ /* We now wait (polling) for the drive to */ /* prepare the data, then read it byte-by-byte */ /* ============================================ */ /* Wait for data to be available */ if (ata_wait_drq() < 0) { return -2; } /* Read 512 bytes (256 words) from the data register */ /* Each inw() reads 16 bits from the data port */ uint16_t *buf16 = (uint16_t *)buffer; for (int i = 0; i < 256; i++) { buf16[i] = inw(ATA_DATA); } return 0; /* Success */} /* * Read multiple sectors using PIO * * For reading N sectors, we pay the polling cost N times. * At typical disk latencies, this becomes very expensive for large reads. */int ata_read_sectors_pio(uint32_t lba, uint8_t sector_count, uint8_t *buffer) { for (int s = 0; s < sector_count; s++) { int result = ata_read_sector_pio(lba + s, buffer + (s * 512)); if (result < 0) { return result; } } return 0;}The x86 architecture provides REP INSW (repeat input string word) which can transfer 256 words much faster than 256 individual INW instructions. The loop 'rep insw' with ECX=256 transfers an entire sector with minimal instruction overhead. However, the CPU is still fully occupied during the transfer—it just completes faster.
Despite its CPU overhead, PIO has legitimate use cases where its characteristics become advantages:
1. Simplicity of Implementation:
PIO requires minimal hardware complexity. A device only needs:
This simplicity translates to:
2. Deterministic Timing:
PIO provides predictable, measurable timing because:
This makes PIO valuable for:
3. No Memory Coherency Concerns:
With DMA, you must carefully manage cache coherency—flushing caches before DMA reads from memory, invalidating caches after DMA writes to memory. PIO has no such concerns because data goes directly through CPU registers.
When Linux panics, it switches to 'poll' mode for console output. Interrupt handlers might be broken, DMA might be corrupted, but the CPU can always poll the serial port and send characters one at a time. This reliability in failure scenarios makes PIO invaluable for diagnostics.
The disadvantages of PIO are substantial and grow more severe as systems become faster and devices become larger:
1. CPU Monopolization:
During PIO transfers, the CPU is completely occupied. On a single-processor system, nothing else runs. Even on multiprocessor systems, one entire core is consumed moving data.
Impact calculation:
2. The Device Speed Gap:
The fundamental problem with PIO is the speed mismatch between modern devices and the technique's inherent limitations:
Even with today's fastest CPUs, PIO cannot approach modern storage speeds. The technique simply doesn't scale.
PIO is particularly devastating for battery life. CPUs in polling loops run at full speed, burning maximum power. Modern systems use DMA specifically so the CPU can enter low-power sleep states while transfers complete autonomously. A laptop doing heavy PIO would see dramatically reduced battery life.
When PIO is unavoidable, several techniques can mitigate its performance impact:
1. Hybrid Polling:
Instead of tight polling loops, use a hybrid approach:
This captures the low-latency benefit of polling for quick operations while avoiding long CPU stalls for slow ones.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
/* * Polling Optimization Techniques * * Various approaches to make polling less wasteful. */ #include <stdint.h> /* Assume these are defined elsewhere */extern uint8_t inb(uint16_t port);extern void yield(void);extern unsigned long jiffies; #define STATUS_PORT 0x3FD#define READY_BIT 0x01 /* * Technique 1: PAUSE Instruction (x86) * * The PAUSE instruction hints to the CPU that this is a spin loop. * Benefits: * - Reduces power consumption in the loop * - Avoids memory ordering violations * - Improves SMT (hyperthreading) performance by yielding resources */void poll_with_pause(void) { while ((inb(STATUS_PORT) & READY_BIT) == 0) { __asm__ volatile ("pause" ::: "memory"); }} /* * Technique 2: Exponential Backoff * * Poll frequently at first (capturing fast responses quickly), * then slow down if the device is taking a while. */void poll_exponential_backoff(void) { int delay = 1; const int max_delay = 1000; while ((inb(STATUS_PORT) & READY_BIT) == 0) { /* Wait for 'delay' pause cycles */ for (int i = 0; i < delay; i++) { __asm__ volatile ("pause"); } /* Increase delay exponentially, up to a maximum */ if (delay < max_delay) { delay *= 2; } }} /* * Technique 3: Bounded Polling with Timeout * * Never poll forever - always have a timeout. * After timeout, return error or switch strategies. */int poll_with_timeout(unsigned long timeout_jiffies) { unsigned long deadline = jiffies + timeout_jiffies; while ((inb(STATUS_PORT) & READY_BIT) == 0) { if (jiffies >= deadline) { return -1; /* Timeout - consider interrupt mode or error */ } __asm__ volatile ("pause"); } return 0; /* Success */} /* * Technique 4: Yielding Poll * * Give up the CPU after each check, allowing other work to run. * Increases latency but dramatically improves system throughput. */void poll_with_yield(void) { while ((inb(STATUS_PORT) & READY_BIT) == 0) { yield(); /* Let the scheduler run other tasks */ }} /* * Technique 5: Hybrid Polling * * Fast poll initially (for low latency on quick operations), * then switch to yielding mode for slow operations. * * This is used by modern NICs in their NAPI polling mode. */#define FAST_POLL_LIMIT 1000 int poll_hybrid(void) { /* Phase 1: Fast polling for quick responses */ for (int i = 0; i < FAST_POLL_LIMIT; i++) { if (inb(STATUS_PORT) & READY_BIT) { return 0; /* Got response quickly */ } __asm__ volatile ("pause"); } /* Phase 2: Slow polling with yields */ while ((inb(STATUS_PORT) & READY_BIT) == 0) { yield(); } return 0;} /* * Technique 6: Busy Polling with Budget * * Use a "time budget" - poll only until time quota is exhausted, * then return and let caller decide what to do. * * Good for real-time systems with latency constraints. */int poll_budgeted(int max_polls) { for (int i = 0; i < max_polls; i++) { if (inb(STATUS_PORT) & READY_BIT) { return i; /* Return number of polls needed */ } __asm__ volatile ("pause"); } return -1; /* Budget exhausted, not ready */}Modern Linux networking uses NAPI (New API), which combines interrupts and polling. When traffic is low, packets trigger interrupts. Under high load, the driver switches to polling mode, processing many packets per polling cycle. This adaptive approach minimizes both latency (when idle) and interrupt overhead (under load).
While DMA has largely replaced PIO for bulk data transfers, PIO remains present in modern systems for specific purposes:
1. Device Configuration:
Even devices with sophisticated DMA engines require PIO for their control registers. You can't use DMA to program the DMA engine! Configuration and status registers are almost always accessed via PIO (either port-mapped or memory-mapped).
2. Virtualization and Emulation:
Virtual machines and emulators often use PIO for device emulation:
QEMU and VirtualBox use PIO heavily for simple devices:
Trusted Platform Modules (TPMs) often use PIO for their MMIO registers accessed at 0xFED40000. The access pattern involves writing a command, polling a status register for completion, then reading results. This is classic PIO behavior, used for security-sensitive operations that benefit from deterministic, simple code paths.
Programmed I/O represents the most fundamental—and most expensive—method of CPU-device data transfer. Understanding its mechanics illuminates why more sophisticated techniques were developed.
What's Next:
The core problem with PIO is that the CPU must wait for slow devices. What if, instead of polling, the device could notify the CPU when it's ready? This insight leads to Interrupt-Driven I/O—the subject of our next page.
Interrupt-driven I/O allows the CPU to perform other work while waiting for devices, fundamentally changing the efficiency equation. We'll explore how interrupts work, their costs and benefits, and how modern systems combine interrupts with polling for optimal performance.
You now understand Programmed I/O comprehensively—from polling mechanics through performance analysis to modern relevance. This foundation is essential for appreciating interrupt-driven I/O and DMA, which solve PIO's fundamental efficiency problems while building on its conceptual simplicity.