Loading learning content...
A computer's components—CPU, memory, I/O devices—are only as useful as their ability to communicate. Just as a city's productivity depends on its transportation infrastructure, a computer's performance is fundamentally constrained by its bus architecture.
In von Neumann's original conception, a shared bus connected all components. This simple design was elegant but created limitations that become increasingly severe as systems scale. Modern computers have evolved sophisticated hierarchical interconnects, but understanding why—and understanding the tradeoffs involved—requires starting from first principles.
The bus architecture isn't just a hardware concern. It directly impacts:
This page will give you deep insight into the communication backbone that makes computation possible.
By the end of this page, you will understand: (1) Bus components and their functions, (2) Bus protocols and timing, (3) The problems with shared buses at scale, (4) Evolution from shared buses to point-to-point interconnects, (5) Modern bus hierarchies (PCIe, memory channels), and (6) How bus design affects OS behavior.
A bus is a shared communication channel that connects two or more devices. In the computing context, a bus is a collection of parallel signal lines—physical wires or traces on a circuit board—that carry data, addresses, and control signals.
What Makes Something a Bus?
The defining characteristic is sharing: multiple devices connect to the same set of wires. This is both the advantage (simple design, easy to add devices) and the limitation (only one communication can occur at a time).
Bus Terminology
| Term | Definition | Implication |
|---|---|---|
| Bus Width | Number of data lines (bits transferred per cycle) | 32-bit bus moves 4 bytes/cycle; 256-bit moves 32 bytes/cycle |
| Bus Speed/Frequency | Clock rate of the bus | 100 MHz, 1 GHz, etc. Higher = more transfers/second |
| Bus Bandwidth | Maximum data transfer rate (width × frequency) | 64-bit bus at 100 MHz = 800 MB/s theoretical max |
| Bus Transaction | Complete read or write operation | Request → (wait) → Response |
| Bus Master | Device currently controlling the bus | CPU, DMA controller can be bus masters |
| Bus Slave | Device responding to the bus master | Memory, I/O devices are typically slaves |
| Bus Arbitration | Process of deciding which master gets control | Required when multiple masters compete |
| Bus Latency | Time from request to first data returned | Measured in cycles or nanoseconds |
The Three Logical Buses (In Detail)
While we introduced the concept earlier, let's examine each bus more precisely:
The three-bus model is LOGICAL, not necessarily physical. Modern systems multiplex these functions over fewer physical wires using time-division or packet-based protocols. PCIe, for example, sends addresses, data, and commands as packets over serial lanes—but logically still performs the same functions.
For devices to communicate reliably on a shared bus, they must follow precise timing rules—a bus protocol. There are two fundamental timing approaches:
Synchronous Buses
All operations are coordinated by a central clock signal. Every device samples signals at defined clock edges.
Advantages:
Disadvantages:
Asynchronous Buses
No central clock. Communication uses handshaking signals: the master indicates readiness, the slave acknowledges.
Advantages:
Disadvantages:
SYNCHRONOUS BUS READ (simplified): Clock ─┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬── │ │ │ │ │ │ │ │ │ │ │Address ═╪══════════════════════════════╪═ │←──── Valid Address ────────→│ │ │R/W# ─┴────────────────────────────┴─ (Read = Low) │ │Data ═════════════════════╪════════════ │← Valid Data │ (from memory) ↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓ Cycle 1 2 3 4 5 6 7 (data valid on cycle 7) ASYNCHRONOUS BUS READ (handshake): Request ───┐ ┌─── └───────────────────────────┘ Address ═══════════════════════════════════ │← Valid Address ─────────→│ Acknowledge ──────────────────────────┐ (slave processing) └─── Data ════════════════════════╪═══════════ │← Valid Master asserts Request, puts Address on bus Slave sees Request, processes, puts Data on bus Slave asserts Acknowledge Master sees Acknowledge, reads Data Master deasserts Request Slave deasserts AcknowledgeBus Transaction Types
Beyond simple read/write, modern buses support sophisticated transactions:
| Transaction Type | Description | Use Case |
|---|---|---|
| Single Read | Read one word from one address | CPU fetching a value |
| Single Write | Write one word to one address | Storing a variable |
| Burst Read | Read multiple consecutive words | Cache line fill |
| Burst Write | Write multiple consecutive words | Cache line writeback |
| Read-Modify-Write | Atomic read, modify, write | Lock implementation |
| Split Transaction | Request and response are separate | Pipelined buses |
| Posted Write | Write without waiting for acknowledgment | High-throughput writes |
Burst transfers are critical for performance. Instead of incurring the addressing overhead for each word, a burst specifies a starting address and count. The slave returns consecutive words automatically. This is how cache lines (typically 64 bytes) are filled: one address, 8 transfers of 8 bytes each.
Posted writes allow the CPU to continue without waiting for write acknowledgment. This improves performance but complicates error handling—what if the write fails after the CPU has moved on?
In high-performance systems, a bus master can issue multiple requests before any responses return. The slave(s) respond as data becomes available, potentially out of order. This requires tagging requests so responses can be matched. Modern interconnects like PCIe use this extensively for maximizing bandwidth.
When multiple devices can be bus masters (CPU, DMA controllers, other processors), someone must decide who gets access when both want it simultaneously. This is bus arbitration, and it's a microcosm of the resource scheduling problems that pervade OS design.
Why Arbitration Matters:
Common Arbitration Schemes:
Arbitration and DMA
DMA (Direct Memory Access) creates interesting arbitration scenarios. A DMA controller transfers data between memory and I/O without CPU intervention—but it needs bus access to do so.
Approaches:
Cycle Stealing — DMA steals individual bus cycles from the CPU. Minimal latency impact but many arbitration events.
Burst Mode / Block Transfer — DMA takes over the bus for an entire block transfer. More efficient but CPU is blocked longer.
Transparent DMA — DMA operates only when the CPU isn't using the bus (e.g., during instruction decode phases where no memory access occurs). Ideal but hard to implement.
The OS must configure DMA to balance device throughput against CPU latency. A disk DMA transferring in burst mode can block the CPU for microseconds—an eternity at GHz clock speeds.
If a high-priority device (CPU) is waiting for a resource held by a low-priority device (slow DMA), and medium-priority devices keep winning arbitration, the high-priority device is effectively demoted. This is bus-level priority inversion, analogous to the famous Mars Pathfinder bug. Careful priority assignment is essential.
The original von Neumann architecture assumed a single shared bus connecting CPU, memory, and I/O. This design has a fundamental limitation: only one transfer can occur at a time.
As components became faster, this bottleneck became critical:
The Problem:
Imagine a system with:
The bus becomes the limiting factor. The CPU starves waiting for memory. I/O starves waiting for the CPU to finish. Performance collapses.
Quantifying the Bottleneck
| Era | CPU Speed | Memory Demand | I/O Demand | Single Bus Can Provide | Gap |
|---|---|---|---|---|---|
| 1970s | 1 MHz | ~1 MB/s | ~0.1 MB/s | ~2 MB/s | OK |
| 1990s | 100 MHz | ~100 MB/s | ~50 MB/s | ~100 MB/s | Tight |
| 2000s | 3 GHz | ~10 GB/s | ~5 GB/s | ~3 GB/s | CRISIS |
| 2020s | 5 GHz (multi-core) | ~200 GB/s | ~100 GB/s | Shared bus impossible | HIERARCHY REQUIRED |
Symptoms of Bus Saturation:
Memory stalls: CPU spends cycles waiting for memory, visible as low IPC (instructions per cycle) despite no cache misses. The bus simply can't feed data fast enough.
I/O latency spikes: Even with DMA, I/O transfers take longer when competing with memory traffic.
Scaling walls: Adding a second CPU to the shared bus doesn't double performance—both CPUs compete for the same bandwidth and often get less than 1.5× together.
Cache traffic dominance: In multi-core systems, cache coherence traffic (keeping caches synchronized) can dominate, leaving little bandwidth for actual data.
The Inevitable Solution: Hierarchy
Just as cities solve traffic with a hierarchy of roads (highways, arterial roads, local streets), computers moved to hierarchical interconnects. The shared bus became multiple specialized buses, each optimized for its traffic type:
This hierarchy is fundamental to modern computer architecture.
For years, Intel x86 systems used a Front-Side Bus (FSB) connecting the CPU to a 'Northbridge' chip, which in turn connected to memory and a 'Southbridge' for I/O. This evolved into direct CPU-memory connections (like AMD's HyperTransport and Intel's QPI/UPI) with integrated memory controllers. The Northbridge's functions moved into the CPU itself.
Today's computers don't have a single bus—they have a hierarchy of interconnects, each optimized for different requirements:
The Modern PC Architecture (circa 2020s)
Key Interconnect Technologies:
1. Point-to-Point CPU Links
Modern CPUs use dedicated high-speed links for critical connections:
2. PCIe (Peripheral Component Interconnect Express)
The dominant I/O interconnect. Despite the name, PCIe is NOT a bus—it's a network of point-to-point serial links.
| PCIe Version | Per-Lane Bandwidth (each direction) | x16 Slot Bandwidth |
|---|---|---|
| PCIe 3.0 | 1 GB/s | 16 GB/s |
| PCIe 4.0 | 2 GB/s | 32 GB/s |
| PCIe 5.0 | 4 GB/s | 64 GB/s |
| PCIe 6.0 | 8 GB/s | 128 GB/s |
PCIe uses a packet-based protocol with sophisticated features:
3. Memory Channels
CPUs have multiple independent memory channels, each providing full bandwidth to a subset of memory:
The original PCI was a shared bus—all devices shared 133 MB/s. When GPUs alone needed gigabytes per second, this was hopeless. PCIe replaced the shared bus with a switch fabric where each device gets dedicated lanes. The 'Express' name reflects this fundamental architecture change from shared to point-to-point.
The bus architecture fundamentally shapes how operating systems are designed and how they manage devices:
Device Enumeration and Configuration
Modern buses (PCIe, USB) are self-describing. Devices contain configuration spaces that describe their capabilities. The OS must:
This is why you can plug in a new device and the OS detects it—the bus protocol supports discovery.
PCIe Device Enumeration (Simplified): At boot:1. Root Complex scans Bus 02. For each device number (0-31): - For each function number (0-7): - Read Vendor ID at configuration offset 0x00 - If Vendor ID = 0xFFFF → no device, skip - If valid → device present!3. If device is a bridge: - Assign secondary bus numbers - Recursively enumerate downstream buses Device Configuration Space (first 64 bytes):┌──────────────────────────────────────────────────────────────┐│ Offset │ Field │ Purpose │├──────────────────────────────────────────────────────────────┤│ 0x00 │ Vendor ID │ Who made this device ││ 0x02 │ Device ID │ What device model ││ 0x04 │ Command │ Enable device features ││ 0x06 │ Status │ Error flags, capabilities ││ 0x08 │ Class Code │ Device type (graphics, NIC) ││ 0x0C │ Header Type │ Layout of rest of header ││ 0x10-0x24 │ BARs (Base Address Registers) ││ │ │ Memory/IO regions device uses ││ 0x3C │ Interrupt Line │ Which IRQ to use │└──────────────────────────────────────────────────────────────┘ OS reads this, assigns BARs to non-conflicting addresses, programs the device, and loads the matching driver.Memory-Mapped I/O and the OS
When a device's registers are memory-mapped, the OS must:
Interrupt Routing
Bus architecture determines how device interrupts reach the CPU. The OS must understand:
The OS interrupt subsystem must configure interrupt routing, balance interrupts across CPUs, and dispatch to appropriate handlers.
DMA is powerful but dangerous—a malicious or buggy device could read/write arbitrary memory, bypassing OS protections. The IOMMU (I/O Memory Management Unit) provides address translation and access control for DMA. The OS configures IOMMU page tables so each device can only access its designated memory regions. This is essential for security (virtualization, Thunderbolt) and fault isolation.
Bus architecture directly impacts application and system performance in ways that are often invisible but significant:
Latency Matters More Than Bandwidth (Sometimes)
For small transfers, the time to set up the transfer (latency) dominates. For large transfers, available bandwidth dominates.
| Transfer Size | PCIe RTT (~1μs) | NVMe SSD (~100μs) | Limiting Factor |
|---|---|---|---|
| 4 bytes | 1μs | 100μs | Latency |
| 4 KB | 1μs + 0.5μs | 100μs + 0.5μs | Latency |
| 1 MB | 1μs + 100μs | 100μs + 100μs | Balanced |
| 1 GB | 1μs + 100ms | 100μs + 100ms | Bandwidth |
NUMA and Memory Locality
In multi-socket systems, memory is physically attached to specific CPUs. Accessing "local" memory is faster than "remote" memory accessed via inter-CPU links:
Local memory access: ~80-100 ns
Remote memory access: ~150-300 ns (1.5-3× slower!)
The OS scheduler should ideally place processes near their data. Memory allocators should prefer local memory. This is a direct consequence of the bus hierarchy.
Example 2-Socket NUMA System: ┌─────────────────────────────┐ ┌─────────────────────────────┐│ Socket 0 │ │ Socket 1 │├─────────────────────────────┤ ├─────────────────────────────┤│ CPU Cores 0-11 │ │ CPU Cores 12-23 ││ L1/L2/L3 Cache │ │ L1/L2/L3 Cache ││ │ │ ││ Memory Controller 0 │ │ Memory Controller 1 │├─────────────────────────────┤ ├─────────────────────────────┤│ 128 GB DDR4 (NUMA Node 0) │ │ 128 GB DDR4 (NUMA Node 1) │└──────────────┬──────────────┘ └──────────────┬──────────────┘ │ │ │ Inter-Socket Link (UPI/IF) │ └────────────────────────────────────┘ Access Latencies:- Core 0 → Node 0 memory: 80 ns (LOCAL)- Core 0 → Node 1 memory: 150 ns (REMOTE - must cross UPI)- Core 12 → Node 1 memory: 80 ns (LOCAL)- Core 12 → Node 0 memory: 150 ns (REMOTE) OS commands: numactl --cpunodebind=0 --membind=0 ./program # Pin to node 0 lscpu # Show NUMA topologyPCIe Slot Placement Matters
On a typical consumer motherboard, not all PCIe slots are equal:
Putting a high-performance GPU or NVMe SSD in a chipset-connected slot can dramatically reduce performance.
Batching and Coalescing
Given the overhead of bus transactions, both OS and hardware try to batch operations:
Understanding that these optimizations exist—and why—helps debug performance anomalies.
If I/O performance is mysteriously slow, check: (1) Is the device in the right slot? (2) Is the link operating at expected speed? (lspci -vv on Linux shows negotiated link speed). (3) Is interrupt affinity set appropriately? (4) Is NUMA placing data near the processing CPU?
We've explored the bus architecture that connects every component in a computer. Let's consolidate the key concepts:
What's Next:
We've seen how components communicate. But communication is only part of the picture—the CPU must systematically fetch, understand, and execute instructions. The next page examines the Instruction Cycle: fetch, decode, execute. This is the heartbeat of computation, the repetitive process that breathes life into stored programs. Understanding it completes our picture of the von Neumann architecture.
You now understand bus architecture—the communication backbone connecting CPU, memory, and I/O. From simple shared buses to modern hierarchical point-to-point interconnects, you can see how hardware evolution shaped OS design and continues to influence application performance.