Loading learning content...
Every program you've ever run—from a simple "Hello, World" to a complex 3D game engine—executes through the same fundamental process, repeated billions of times per second:
This is the instruction cycle (also called the fetch-decode-execute cycle or machine cycle), and it is the heartbeat of every von Neumann computer. Understanding this cycle in depth is essential for anyone who wants to:
This page will trace through the instruction cycle step by step, examine what happens at each stage, and explore the sophisticated optimizations that modern CPUs use to execute instructions faster.
By the end of this page, you will understand: (1) The detailed steps of the instruction cycle, (2) How instructions are encoded and decoded, (3) The role of each CPU component in execution, (4) How pipelining overlaps multiple instructions, (5) What happens during interrupts and exceptions, and (6) Why understanding this cycle matters for OS development.
At its core, the instruction cycle is conceptually simple. The CPU is a state machine that endlessly repeats a fixed sequence of operations:
The Canonical Three Steps:
1. FETCH
2. DECODE
3. EXECUTE
The Cycle in Action: A Concrete Example
Let's trace through a specific instruction: ADD R1, R2, R3 (R1 = R2 + R3)
FETCH:
DECODE:
EXECUTE:
This entire sequence happens in nanoseconds—a 3 GHz CPU completes ~3 billion such cycles per second.
Our example assumes fixed 4-byte instructions. Many ISAs (like x86) have variable-length instructions—anywhere from 1 to 15 bytes. The decoder must determine instruction length during decode, which complicates the process but enables more compact code.
Before the CPU can decode an instruction, we must understand how instructions are encoded—how abstract operations like "add" or "load" are represented as bit patterns.
Components of an Instruction:
Every machine instruction contains:
Instruction Formats
Different ISAs (Instruction Set Architectures) use different formats:
RISC-V Instruction Formats (32-bit fixed-length): R-type (Register-Register operations: add, sub, and, or):┌─────────┬─────┬──────┬──────┬────────┬────────┐│ funct7 │ rs2 │ rs1 │funct3│ rd │ opcode ││ 7 bits │5 bit│5 bits│3 bits│ 5 bits │ 7 bits │└─────────┴─────┴──────┴──────┴────────┴────────┘ [31:25] [24:20][19:15][14:12][11:7] [6:0] Example: ADD x1, x2, x3 → 0x003100B3 opcode = 0110011 (R-type arithmetic) rd = 00001 (x1) funct3 = 000 (ADD) rs1 = 00010 (x2) rs2 = 00011 (x3) funct7 = 0000000 I-type (Immediate operations: addi, load):┌────────────┬──────┬────────┬────────┬────────┐│ imm[11:0]│ rs1 │ funct3 │ rd │ opcode ││ 12 bits │5 bits│ 3 bits │ 5 bits │ 7 bits │└────────────┴──────┴────────┴────────┴────────┘ S-type (Store operations):┌─────────┬─────┬──────┬────────┬────────┬────────┐│imm[11:5]│ rs2 │ rs1 │ funct3 │imm[4:0]│ opcode │└─────────┴─────┴──────┴────────┴────────┴────────┘ B-type (Branch operations):U-type (Upper immediate):J-type (Jump):Addressing Modes
The operand specifier doesn't always mean the same thing. The addressing mode determines how to interpret it:
| Mode | Notation | Meaning | Example |
|---|---|---|---|
| Immediate | #100 | The operand IS the value | ADD R1, #100 (R1 += 100) |
| Register | R3 | Value is in the register | ADD R1, R3 (R1 += R3) |
| Direct | [1000] | Value is at memory address | LOAD R1, [1000] (R1 = mem[1000]) |
| Indirect | [R3] | Register contains address | LOAD R1, [R3] (R1 = mem[R3]) |
| Indexed | [R3 + 100] | Base + offset | LOAD R1, [R3+100] |
| Scaled | [R3 + R4*4] | Base + scale*index | For array access |
The addressing mode affects decode complexity. RISC architectures keep it simple (only a few modes); CISC architectures (x86) support many modes, making decode elaborate.
When the OS performs a context switch, it must save/restore not just register values but also any instruction-internal state. Complex addressing modes that require multiple memory accesses may be partially complete when an interrupt occurs. The architecture defines precisely when an instruction can be interrupted and what state must be preserved.
The fetch phase seems simple—just read the instruction from memory. In practice, substantial complexity exists:
The Fetch Process:
The Role of the Instruction Cache (I-Cache)
Modern CPUs have dedicated instruction caches, separate from data caches. This separation (Modified Harvard architecture at cache level) allows instruction fetch and data access to happen simultaneously.
Typical I-cache characteristics:
Instruction Fetch Timing (simplified 4-stage example): Cycle │ Action │ Hardware State═══════╪═══════════════════════════════════════╪═══════════════════════════════ 1 │ PC value sent to I-cache │ PC = 0x1000 │ Cache tag comparison begins │ ───────┼───────────────────────────────────────┼─────────────────────────────── 2 │ Tag match/miss determined │ │ If hit: data routing begins │ Cache line found!───────┼───────────────────────────────────────┼─────────────────────────────── 3 │ Instruction bytes extracted │ Line: [0x1000-0x103F] │ PC auto-increment calculated │ Extract bytes at 0x1000───────┼───────────────────────────────────────┼─────────────────────────────── 4 │ IR loaded with instruction │ IR = 0x02123000 │ PC updated to 0x1004 │ PC = 0x1004 │ Decode can begin │ ═══════╧═══════════════════════════════════════╧═══════════════════════════════ On Cache Miss (L1 miss, L2 hit):- Add 10-20 cycles waiting for L2- I-cache line is filled- Then proceed as above On Cache Miss (L1, L2, L3 miss - DRAM fetch):- Add 100-300 cycles!- This is why cache behavior matters so muchBranch Prediction During Fetch
A critical complication: what happens when the fetched instruction is a branch?
If we wait until we know whether the branch is taken, we waste cycles. Instead, modern CPUs use branch prediction to guess ahead:
Branch prediction accuracy exceeds 95% on typical code, but mispredictions cost 10-20 cycles.
The famous Spectre and Meltdown vulnerabilities exploit speculative execution. Even though mispredicted instructions are 'rolled back,' they leave traces in the cache. Attackers can measure cache timing to infer secrets. This is a direct consequence of aggressive fetch optimization interacting with security requirements—a lesson that OS developers must internalize.
The decode phase translates the raw instruction bits into control signals that orchestrate execution. This is where the CPU "understands" what to do.
Decode Responsibilities:
RISC vs CISC Decode Complexity
The x86 Decode Challenge
Modern x86 CPUs face a formidable decode challenge: hundreds of instruction variants, complex addressing modes, and variable length. The solution? Micro-operations (μops).
x86 CPUs internally decode complex CISC instructions into simpler RISC-like micro-operations:
ADD [memory], register → μop1: LOAD temp, [memory]
μop2: ADD temp, register
μop3: STORE [memory], temp
This decouples the complex ISA from execution—the back-end of the CPU is essentially a RISC machine executing μops.
Pre-Decode and Micro-op Cache
To accelerate decode:
Intel's μop cache ("Decoded Stream Buffer") can hold ~2000 μops, covering most hot loops. This effectively transforms x86 into RISC internally.
Apple's M-series chips (ARM) can have higher IPC than x86 partly because simplified decode frees up die area and power for more execution units. The x86 decode complexity is a tax paid on every instruction. However, x86's dense encoding means smaller code, better I-cache utilization—it's a tradeoff.
Execute is where the actual work happens. Depending on the instruction type, different resources are used:
Instruction Categories and Execution:
| Category | Examples | Execution Resources | Latency |
|---|---|---|---|
| ALU (Integer) | ADD, SUB, AND, OR | Integer ALU | 1 cycle |
| Shift/Rotate | SHL, ROL | Shifter unit | 1 cycle |
| Multiply | MUL, IMUL | Multiplier | 3-5 cycles |
| Divide | DIV, IDIV | Divider | 20-100 cycles |
| Load | LOAD, MOV from mem | Load port, Data cache | 4-5 cycles (cache hit) |
| Store | STORE, MOV to mem | Store port, Store buffer | 1 cycle (to buffer) |
| Branch | JMP, BEQ, CALL | Branch unit | 1 cycle (predicted) |
| FP Arithmetic | FADD, FMUL | FP unit | 3-6 cycles |
| Vector/SIMD | AVX, SSE | Vector unit | 4-6 cycles |
Execution for Different Instruction Types:
ARITHMETIC INSTRUCTION: ADD R1, R2, R3─────────────────────────────────────────────────────────────────1. Operand Read: R2 value (42) → ALU input A R3 value (10) → ALU input B2. ALU Operation: ALU computes 42 + 10 = 52 Zero flag = 0 (result not zero) Negative flag = 0 (result positive) Carry flag = 0 (no carry out) Overflow flag = 0 (no signed overflow)3. Write-back: 52 → R1 LOAD INSTRUCTION: LOAD R4, [R5 + 0x100]─────────────────────────────────────────────────────────────────1. Address Calc: R5 value (0x2000) + 0x100 = 0x21002. D-Cache Check: Look up address 0x2100 in data cache3a. Cache Hit: Data (say, 0xDEADBEEF) available in ~4 cycles3b. Cache Miss: Wait for L2 (10+ cycles) or DRAM (100+ cycles)4. Write-back: 0xDEADBEEF → R4 BRANCH INSTRUCTION: BEQ target (Branch if Equal)─────────────────────────────────────────────────────────────────1. Condition Check: Zero flag == 1?2a. Not Taken: PC already points to next instruction (done)2b. Taken: PC := target address3. Prediction: Compare with what was predicted during fetch4a. Correct: Continue normally4b. Misprediction: Flush pipeline, restart from correct address (10-20 cycle penalty) STORE INSTRUCTION: STORE R6, [R7]─────────────────────────────────────────────────────────────────1. Address Calc: Address = R7 value2. Store Buffer: Write {address, R6 value} to store buffer3. Execute Done: Instruction appears complete to the CPU4. Background: Store buffer eventually writes to cache/memory (Store can complete before data reaches cache - "posted write")Write-Back and Commit
The final sub-phase of execute is writing results:
Instruction Commit and Exceptions
Modern out-of-order CPUs can have many instructions "in flight" simultaneously. Results are committed (made architecturally visible) only when an instruction is definitively correct:
If an exception occurs (divide by zero, page fault, illegal instruction), instructions after the faulting one must be discarded. This is where the OS takes over—we'll discuss this more in the exceptions section.
Even though a store instruction 'completes' quickly to the store buffer, other CPUs may not see that store immediately. Memory ordering models (TSO, relaxed memory models) define when stores become visible. This is critical for multi-threaded programming and OS synchronization primitives.
If each instruction goes through F-D-E sequentially, and each phase takes a cycle, every instruction takes 3 cycles. But different phases use different hardware—why not overlap them?
The Pipelining Insight:
While instruction N is executing, instruction N+1 can decode, and instruction N+2 can fetch:
Cycle: 1 2 3 4 5 6 7
─────────────────────────────────
Inst 1: [F] [D] [E]
Inst 2: [F] [D] [E]
Inst 3: [F] [D] [E]
Inst 4: [F] [D] [E]
─────────────────────────────────
↑
3 instructions in flight
After initial fill, 1 instruction
completes per cycle!
Throughput vs Latency:
This is the magic of pipelining—the same hardware, used more efficiently.
Modern Pipeline Depths
| Processor | Pipeline Stages | Notes |
|---|---|---|
| Classic MIPS | 5 | Textbook example |
| Intel Pentium | 5 | Simple |
| Intel Pentium 4 | 31 | Deep pipeline for high clocks |
| Intel Skylake | 14-19 | Modern balanced design |
| Apple M1 | ~12-14 | Efficiency-focused |
Pipeline Hazards:
Pipelining isn't free. Several hazards can stall the pipeline:
1. Data Hazards (RAW - Read After Write)
ADD R1, R2, R3 ; R1 written in cycle 5
SUB R4, R1, R5 ; R1 needed in cycle 3 - not ready!
Solution: Forwarding (bypass result directly) or stall.
2. Control Hazards (Branches)
BEQ target ; Branch outcome known in cycle 3
ADD R1, R2, R3 ; Already fetched - is it correct?
Solution: Branch prediction (guess) + flush on misprediction.
3. Structural Hazards (Resource Conflicts)
LOAD R1, [mem] ; Uses memory port
STORE R2, [mem] ; Also needs memory port
Solution: Duplicate resources or stall.
Every pipeline stall represents wasted cycles. A 10-cycle branch misprediction penalty on a 15-stage pipeline means ~40% of potential throughput lost on bad predictions. This is why modern CPUs invest enormous transistor budgets in branch predictors, out-of-order execution, and other hazard mitigation.
The instruction cycle isn't always allowed to proceed uninterrupted. Interrupts and exceptions can forcibly redirect the CPU:
Types of Interruptions:
Hardware Interrupts: External signals from devices
Exceptions (Traps): Synchronous, caused by instruction execution
Faults vs Traps:
The Interrupt Response Sequence:
Interrupt Response (Simplified): 1. INTERRUPT RECOGNITION ─────────────────────────────────────────────── - CPU checks for pending interrupts between instructions - If interrupt signal is asserted AND interrupts are enabled: → Proceed to interrupt handling - If interrupts disabled: remain pending until enabled 2. STATE SAVING (Hardware performs automatically) ─────────────────────────────────────────────── - Push PC onto stack (return address) - Push processor status register (flags, interrupt mask) - Switch to kernel/supervisor mode if not already - Disable further interrupts (prevent nesting, typically) 3. HANDLER DISPATCH ─────────────────────────────────────────────── - Determine interrupt type/number - Look up handler address in Interrupt Vector Table (IVT) IVT[interrupt_number] → handler_entry_point - Load PC with handler address - Begin fetching handler instructions 4. HANDLER EXECUTION (OS code runs) ─────────────────────────────────────────────── - Save additional registers (handler's responsibility) - Identify the interrupt source - Service the interrupt (process the event) - Restore saved registers 5. RETURN FROM INTERRUPT ─────────────────────────────────────────────── - Execute IRET/RTI instruction - Hardware restores status register from stack - Hardware restores PC from stack - Resume interrupted instruction streamPrecise vs Imprecise Exceptions
In a pipelined, out-of-order CPU, defining "where" an exception occurred is complex:
Precise Exception: When an exception is taken, all earlier instructions have completed, none later have visible effects. The saved PC points exactly at the faulting instruction. This is essential for OS correctness (page fault handlers need to know which load failed).
Imprecise Exception: Some earlier instructions may not have completed; some later may have. The state is ambiguous. Processing is difficult.
Modern CPUs guarantee precise exceptions despite out-of-order execution by using a Reorder Buffer (ROB) that commits instructions in program order.
Every interrupt causes a context switch overhead: saving register state, invalidating some pipeline state, and jumping to the handler. A timer interrupt every 1ms on a 3 GHz CPU means interrupting every 3 million instructions. If interrupt handling took 1000 cycles, that's 0.03% overhead—acceptable. But interrupts arriving every microsecond (e.g., high-speed networking) must be handled with extreme care.
We've traced through the instruction cycle in depth—the core mechanism that brings stored programs to life. Let's consolidate:
What's Next:
We've now covered most of the von Neumann architecture: stored programs, CPU-memory-I/O organization, bus architecture, and the instruction cycle. But no discussion of von Neumann architecture is complete without addressing its fundamental limitation—the von Neumann Bottleneck. The final page in this module explores this critical constraint and the techniques modern systems use to mitigate it.
You now understand the instruction cycle in detail—the fundamental pulse that animates every computer. From instruction encoding through fetch, decode, execute, and interrupt handling, you can trace the journey of a program through the CPU. This knowledge is essential for understanding OS operations like context switching, interrupt handling, and exception management.