Operating SystemsComputer Architecture Fundamentals

von Neumann Architecture

LevelBeginner

Duration75 mins

TopicComputer Architecture Fundamentals

4 / 5

The Instruction Cycle - Fetch, Decode, Execute

The Heartbeat of Computation

Every program you've ever run—from a simple "Hello, World" to a complex 3D game engine—executes through the same fundamental process, repeated billions of times per second:

Fetch an instruction from memory
Decode the instruction to understand what it means
Execute the operation specified
Repeat

This is the instruction cycle (also called the fetch-decode-execute cycle or machine cycle), and it is the heartbeat of every von Neumann computer. Understanding this cycle in depth is essential for anyone who wants to:

Understand how compilers translate code to machine instructions
Reason about CPU performance and optimization
Comprehend context switching in operating systems
Debug low-level issues and understand error conditions
Appreciate modern CPU optimizations (pipelining, out-of-order execution)

This page will trace through the instruction cycle step by step, examine what happens at each stage, and explore the sophisticated optimizations that modern CPUs use to execute instructions faster.

What You Will Learn

By the end of this page, you will understand: (1) The detailed steps of the instruction cycle, (2) How instructions are encoded and decoded, (3) The role of each CPU component in execution, (4) How pipelining overlaps multiple instructions, (5) What happens during interrupts and exceptions, and (6) Why understanding this cycle matters for OS development.

The Basic Instruction Cycle

At its core, the instruction cycle is conceptually simple. The CPU is a state machine that endlessly repeats a fixed sequence of operations:

The Canonical Three Steps:

1. FETCH

Read the instruction at the memory address stored in the Program Counter (PC)
Place the instruction bits into the Instruction Register (IR)
Increment the PC to point to the next instruction

2. DECODE

Examine the bits in the IR
Determine the operation type (ADD, LOAD, JUMP, etc.)
Identify the operands (which registers, what memory address)
Generate control signals for the execution phase

3. EXECUTE

Perform the operation specified
This might involve the ALU (for arithmetic/logic), memory (for load/store), or control flow (for jumps/branches)
Write results to destination (register or memory)
Update status flags if applicable

Converting Mermaid diagram...

The Cycle in Action: A Concrete Example

Let's trace through a specific instruction: ADD R1, R2, R3 (R1 = R2 + R3)

FETCH:

PC contains 0x1000 (current instruction address)
CPU places 0x1000 on address bus, asserts read signal
Memory returns the bit pattern for ADD R1, R2, R3 (say, 0x02123000)
This pattern is loaded into IR
PC is incremented to 0x1004 (assuming 4-byte instructions)

DECODE:

Decoder examines IR content: 0x02123000
Bits 28-31 (0x0) specify the opcode class: arithmetic
Bits 24-27 (0x2) specify ADD operation
Bits 20-23 (0x1) specify destination: R1
Bits 16-19 (0x2) specify source 1: R2
Bits 12-15 (0x3) specify source 2: R3
Control unit generates signals to:
- Read R2 value to ALU input A
- Read R3 value to ALU input B
- Set ALU to ADD mode
- Route ALU output to R1

EXECUTE:

ALU performs addition of R2 and R3 contents
Result is written to R1
Flags updated: Zero flag if result is 0, Negative flag if result < 0, Carry and Overflow as appropriate

This entire sequence happens in nanoseconds—a 3 GHz CPU completes ~3 billion such cycles per second.

Instruction Size Varies

Our example assumes fixed 4-byte instructions. Many ISAs (like x86) have variable-length instructions—anywhere from 1 to 15 bytes. The decoder must determine instruction length during decode, which complicates the process but enables more compact code.

Instruction Encoding

Before the CPU can decode an instruction, we must understand how instructions are encoded—how abstract operations like "add" or "load" are represented as bit patterns.

Components of an Instruction:

Every machine instruction contains:

Opcode (Operation Code): Specifies what operation to perform
Operand Specifiers: Identify the data to operate on (registers, memory addresses, immediate values)
Addressing Mode: How to interpret the operand specifiers

Instruction Formats

Different ISAs (Instruction Set Architectures) use different formats:

instruction_formats.txt

RISC-V Instruction Formats (32-bit fixed-length):
 
R-type (Register-Register operations: add, sub, and, or):
┌─────────┬─────┬──────┬──────┬────────┬────────┐
│  funct7 │ rs2 │  rs1 │funct3│   rd   │ opcode │
│  7 bits │5 bit│5 bits│3 bits│ 5 bits │ 7 bits │
└─────────┴─────┴──────┴──────┴────────┴────────┘
   [31:25] [24:20][19:15][14:12][11:7]   [6:0]
 
Example: ADD x1, x2, x3  →  0x003100B3
  opcode  = 0110011 (R-type arithmetic)
  rd      = 00001 (x1)
  funct3  = 000 (ADD)
  rs1     = 00010 (x2)
  rs2     = 00011 (x3)
  funct7  = 0000000
 
I-type (Immediate operations: addi, load):
┌────────────┬──────┬────────┬────────┬────────┐
│   imm[11:0]│  rs1 │ funct3 │   rd   │ opcode │
│   12 bits  │5 bits│ 3 bits │ 5 bits │ 7 bits │
└────────────┴──────┴────────┴────────┴────────┘
 
S-type (Store operations):
┌─────────┬─────┬──────┬────────┬────────┬────────┐
│imm[11:5]│ rs2 │  rs1 │ funct3 │imm[4:0]│ opcode │
└─────────┴─────┴──────┴────────┴────────┴────────┘
 
B-type (Branch operations):
U-type (Upper immediate):
J-type (Jump):

Addressing Modes

The operand specifier doesn't always mean the same thing. The addressing mode determines how to interpret it:

Mode	Notation	Meaning	Example
Immediate	#100	The operand IS the value	ADD R1, #100 (R1 += 100)
Register	R3	Value is in the register	ADD R1, R3 (R1 += R3)
Direct	[1000]	Value is at memory address	LOAD R1, [1000] (R1 = mem[1000])
Indirect	[R3]	Register contains address	LOAD R1, [R3] (R1 = mem[R3])
Indexed	[R3 + 100]	Base + offset	LOAD R1, [R3+100]
Scaled	[R3 + R4*4]	Base + scale*index	For array access

The addressing mode affects decode complexity. RISC architectures keep it simple (only a few modes); CISC architectures (x86) support many modes, making decode elaborate.

Why Addressing Modes Matter for OS Developers

When the OS performs a context switch, it must save/restore not just register values but also any instruction-internal state. Complex addressing modes that require multiple memory accesses may be partially complete when an interrupt occurs. The architecture defines precisely when an instruction can be interrupted and what state must be preserved.

The Fetch Phase in Detail

The fetch phase seems simple—just read the instruction from memory. In practice, substantial complexity exists:

The Fetch Process:

Address Generation: The PC value is placed on the internal address bus
Cache Lookup: The instruction cache (I-cache) is checked for that address
- Cache Hit: Instruction returned in ~1 cycle
- Cache Miss: Must fetch from L2, L3, or main memory (10-300 cycles!)
Instruction Delivery: Bits are placed in the IR
PC Update: PC is incremented by instruction size

The Role of the Instruction Cache (I-Cache)

Modern CPUs have dedicated instruction caches, separate from data caches. This separation (Modified Harvard architecture at cache level) allows instruction fetch and data access to happen simultaneously.

Typical I-cache characteristics:

Size: 32-64 KB per core
Line size: 64 bytes (fetches a block of instructions)
Associativity: 4-8 way set associative
Latency: 3-4 cycles

fetch_phase.txt

Fetch Timing

Instruction Fetch Timing (simplified 4-stage example):
 
Cycle  │ Action                               │ Hardware State
═══════╪═══════════════════════════════════════╪═══════════════════════════════
   1   │ PC value sent to I-cache             │ PC = 0x1000
       │ Cache tag comparison begins          │ 
───────┼───────────────────────────────────────┼───────────────────────────────
   2   │ Tag match/miss determined            │ 
       │ If hit: data routing begins          │ Cache line found!
───────┼───────────────────────────────────────┼───────────────────────────────
   3   │ Instruction bytes extracted          │ Line: [0x1000-0x103F]
       │ PC auto-increment calculated         │ Extract bytes at 0x1000
───────┼───────────────────────────────────────┼───────────────────────────────
   4   │ IR loaded with instruction           │ IR = 0x02123000
       │ PC updated to 0x1004                 │ PC = 0x1004
       │ Decode can begin                     │ 
═══════╧═══════════════════════════════════════╧═══════════════════════════════
 
On Cache Miss (L1 miss, L2 hit):
- Add 10-20 cycles waiting for L2
- I-cache line is filled
- Then proceed as above
 
On Cache Miss (L1, L2, L3 miss - DRAM fetch):
- Add 100-300 cycles!
- This is why cache behavior matters so much

Branch Prediction During Fetch

A critical complication: what happens when the fetched instruction is a branch?

If we wait until we know whether the branch is taken, we waste cycles. Instead, modern CPUs use branch prediction to guess ahead:

BTB (Branch Target Buffer): Caches recent branch targets
Branch Predictor: Guesses taken/not-taken based on history
Speculative Fetch: CPU fetches from predicted path
Rollback if Wrong: If prediction was wrong, discard fetched instructions and restart

Branch prediction accuracy exceeds 95% on typical code, but mispredictions cost 10-20 cycles.

Spectre and Meltdown

The famous Spectre and Meltdown vulnerabilities exploit speculative execution. Even though mispredicted instructions are 'rolled back,' they leave traces in the cache. Attackers can measure cache timing to infer secrets. This is a direct consequence of aggressive fetch optimization interacting with security requirements—a lesson that OS developers must internalize.

The Decode Phase in Detail

The decode phase translates the raw instruction bits into control signals that orchestrate execution. This is where the CPU "understands" what to do.

Decode Responsibilities:

Opcode Recognition: Determine the instruction type
Operand Extraction: Parse register numbers, immediate values, addressing modes
Control Signal Generation: Activate the correct functional units
Register Read Initiation: Start fetching operand values from register file
Hazard Detection: Identify dependencies that could cause problems

RISC vs CISC Decode Complexity

RISC Decode (Simple)

•Fixed instruction length (32 bits typically)
•Fields in consistent positions
•Small number of formats (5-6)
•Decode is simple combinational logic
•One cycle decode is achievable
•Examples: ARM, RISC-V, MIPS

CISC Decode (Complex)

•Variable instruction length (1-15 bytes)
•Fields can be anywhere
•Hundreds of instruction forms
•Requires sequential parsing
•Multiple decode stages needed
•Examples: x86, x86-64

The x86 Decode Challenge

Modern x86 CPUs face a formidable decode challenge: hundreds of instruction variants, complex addressing modes, and variable length. The solution? Micro-operations (μops).

x86 CPUs internally decode complex CISC instructions into simpler RISC-like micro-operations:

ADD [memory], register  →  μop1: LOAD temp, [memory]
                            μop2: ADD temp, register
                            μop3: STORE [memory], temp

This decouples the complex ISA from execution—the back-end of the CPU is essentially a RISC machine executing μops.

Pre-Decode and Micro-op Cache

To accelerate decode:

Pre-Decode: As cache lines are filled, mark instruction boundaries (x86 variable length problem)
Micro-op Cache (μop Cache): Cache decoded μops. If the same instruction is executed again, skip decode entirely.

Intel's μop cache ("Decoded Stream Buffer") can hold ~2000 μops, covering most hot loops. This effectively transforms x86 into RISC internally.

Why Simple ISAs Are Faster (Sometimes)

Apple's M-series chips (ARM) can have higher IPC than x86 partly because simplified decode frees up die area and power for more execution units. The x86 decode complexity is a tax paid on every instruction. However, x86's dense encoding means smaller code, better I-cache utilization—it's a tradeoff.

The Execute Phase in Detail

Execute is where the actual work happens. Depending on the instruction type, different resources are used:

Instruction Categories and Execution:

Category	Examples	Execution Resources	Latency
ALU (Integer)	ADD, SUB, AND, OR	Integer ALU	1 cycle
Shift/Rotate	SHL, ROL	Shifter unit	1 cycle
Multiply	MUL, IMUL	Multiplier	3-5 cycles
Divide	DIV, IDIV	Divider	20-100 cycles
Load	LOAD, MOV from mem	Load port, Data cache	4-5 cycles (cache hit)
Store	STORE, MOV to mem	Store port, Store buffer	1 cycle (to buffer)
Branch	JMP, BEQ, CALL	Branch unit	1 cycle (predicted)
FP Arithmetic	FADD, FMUL	FP unit	3-6 cycles
Vector/SIMD	AVX, SSE	Vector unit	4-6 cycles

Execution for Different Instruction Types:

execute_examples.txt

Execution Details

ARITHMETIC INSTRUCTION: ADD R1, R2, R3
─────────────────────────────────────────────────────────────────
1. Operand Read:   R2 value (42) → ALU input A
                   R3 value (10) → ALU input B
2. ALU Operation:  ALU computes 42 + 10 = 52
                   Zero flag = 0 (result not zero)
                   Negative flag = 0 (result positive)
                   Carry flag = 0 (no carry out)
                   Overflow flag = 0 (no signed overflow)
3. Write-back:     52 → R1
 
 
LOAD INSTRUCTION: LOAD R4, [R5 + 0x100]
─────────────────────────────────────────────────────────────────
1. Address Calc:   R5 value (0x2000) + 0x100 = 0x2100
2. D-Cache Check:  Look up address 0x2100 in data cache
3a. Cache Hit:     Data (say, 0xDEADBEEF) available in ~4 cycles
3b. Cache Miss:    Wait for L2 (10+ cycles) or DRAM (100+ cycles)
4. Write-back:     0xDEADBEEF → R4
 
 
BRANCH INSTRUCTION: BEQ target  (Branch if Equal)
─────────────────────────────────────────────────────────────────
1. Condition Check: Zero flag == 1?
2a. Not Taken:     PC already points to next instruction (done)
2b. Taken:         PC := target address
3. Prediction:     Compare with what was predicted during fetch
4a. Correct:       Continue normally
4b. Misprediction: Flush pipeline, restart from correct address
                   (10-20 cycle penalty)
 
 
STORE INSTRUCTION: STORE R6, [R7]
─────────────────────────────────────────────────────────────────
1. Address Calc:   Address = R7 value
2. Store Buffer:   Write {address, R6 value} to store buffer
3. Execute Done:   Instruction appears complete to the CPU
4. Background:     Store buffer eventually writes to cache/memory
   (Store can complete before data reaches cache - "posted write")

Write-Back and Commit

The final sub-phase of execute is writing results:

Register Write: ALU result or loaded data → destination register
Memory Write: Data → cache/memory (via store buffer)
PC Update: For branches, PC may or may not be modified
Flag Update: Condition codes reflect operation result

Instruction Commit and Exceptions

Modern out-of-order CPUs can have many instructions "in flight" simultaneously. Results are committed (made architecturally visible) only when an instruction is definitively correct:

All earlier instructions have committed
The instruction didn't cause an exception
If speculative, the speculation was correct

If an exception occurs (divide by zero, page fault, illegal instruction), instructions after the faulting one must be discarded. This is where the OS takes over—we'll discuss this more in the exceptions section.

Memory Ordering Matters

Even though a store instruction 'completes' quickly to the store buffer, other CPUs may not see that store immediately. Memory ordering models (TSO, relaxed memory models) define when stores become visible. This is critical for multi-threaded programming and OS synchronization primitives.

Pipelining: Overlapping Instruction Cycles

If each instruction goes through F-D-E sequentially, and each phase takes a cycle, every instruction takes 3 cycles. But different phases use different hardware—why not overlap them?

The Pipelining Insight:

While instruction N is executing, instruction N+1 can decode, and instruction N+2 can fetch:

Cycle:  1    2    3    4    5    6    7
        ─────────────────────────────────
Inst 1: [F]  [D]  [E]
Inst 2:      [F]  [D]  [E]
Inst 3:           [F]  [D]  [E]
Inst 4:                [F]  [D]  [E]
        ─────────────────────────────────
                    ↑
                 3 instructions in flight
                 After initial fill, 1 instruction 
                 completes per cycle!

Throughput vs Latency:

Latency: Each instruction still takes 3 cycles from start to finish
Throughput: One instruction completes per cycle (3× improvement!)

This is the magic of pipelining—the same hardware, used more efficiently.

Converting Mermaid diagram...

Modern Pipeline Depths

Processor	Pipeline Stages	Notes
Classic MIPS	5	Textbook example
Intel Pentium	5	Simple
Intel Pentium 4	31	Deep pipeline for high clocks
Intel Skylake	14-19	Modern balanced design
Apple M1	~12-14	Efficiency-focused

Pipeline Hazards:

Pipelining isn't free. Several hazards can stall the pipeline:

1. Data Hazards (RAW - Read After Write)

ADD R1, R2, R3    ; R1 written in cycle 5
SUB R4, R1, R5    ; R1 needed in cycle 3 - not ready!

Solution: Forwarding (bypass result directly) or stall.

2. Control Hazards (Branches)

BEQ target        ; Branch outcome known in cycle 3
ADD R1, R2, R3    ; Already fetched - is it correct?

Solution: Branch prediction (guess) + flush on misprediction.

3. Structural Hazards (Resource Conflicts)

LOAD R1, [mem]    ; Uses memory port
STORE R2, [mem]   ; Also needs memory port

Solution: Duplicate resources or stall.

Pipeline Stalls and Performance

Every pipeline stall represents wasted cycles. A 10-cycle branch misprediction penalty on a 15-stage pipeline means ~40% of potential throughput lost on bad predictions. This is why modern CPUs invest enormous transistor budgets in branch predictors, out-of-order execution, and other hazard mitigation.

Interrupts and Exceptions

The instruction cycle isn't always allowed to proceed uninterrupted. Interrupts and exceptions can forcibly redirect the CPU:

Types of Interruptions:

Hardware Interrupts: External signals from devices
- Disk completed a transfer
- Keyboard key pressed
- Timer expired (critical for OS scheduling!)
- Network packet arrived
Exceptions (Traps): Synchronous, caused by instruction execution
- Division by zero
- Invalid opcode
- Page fault (memory not mapped)
- System call (deliberate trap to kernel)
Faults vs Traps:
- Fault: Can be fixed; restart faulting instruction (page fault)
- Trap: Continue after trapping instruction (syscall)
- Abort: Cannot recover; terminate process (machine check)

The Interrupt Response Sequence:

interrupt_sequence.txt

Interrupt Handling

Interrupt Response (Simplified):
 
1. INTERRUPT RECOGNITION
   ───────────────────────────────────────────────
   - CPU checks for pending interrupts between instructions
   - If interrupt signal is asserted AND interrupts are enabled:
     → Proceed to interrupt handling
   - If interrupts disabled: remain pending until enabled
 
2. STATE SAVING (Hardware performs automatically)
   ───────────────────────────────────────────────
   - Push PC onto stack (return address)
   - Push processor status register (flags, interrupt mask)
   - Switch to kernel/supervisor mode if not already
   - Disable further interrupts (prevent nesting, typically)
 
3. HANDLER DISPATCH
   ───────────────────────────────────────────────
   - Determine interrupt type/number
   - Look up handler address in Interrupt Vector Table (IVT)
     IVT[interrupt_number] → handler_entry_point
   - Load PC with handler address
   - Begin fetching handler instructions
 
4. HANDLER EXECUTION (OS code runs)
   ───────────────────────────────────────────────
   - Save additional registers (handler's responsibility)
   - Identify the interrupt source
   - Service the interrupt (process the event)
   - Restore saved registers
 
5. RETURN FROM INTERRUPT
   ───────────────────────────────────────────────
   - Execute IRET/RTI instruction
   - Hardware restores status register from stack
   - Hardware restores PC from stack
   - Resume interrupted instruction stream

Precise vs Imprecise Exceptions

In a pipelined, out-of-order CPU, defining "where" an exception occurred is complex:

Precise Exception: When an exception is taken, all earlier instructions have completed, none later have visible effects. The saved PC points exactly at the faulting instruction. This is essential for OS correctness (page fault handlers need to know which load failed).
Imprecise Exception: Some earlier instructions may not have completed; some later may have. The state is ambiguous. Processing is difficult.

Modern CPUs guarantee precise exceptions despite out-of-order execution by using a Reorder Buffer (ROB) that commits instructions in program order.

Context Switch Cost Revisited

Every interrupt causes a context switch overhead: saving register state, invalidating some pipeline state, and jumping to the handler. A timer interrupt every 1ms on a 3 GHz CPU means interrupting every 3 million instructions. If interrupt handling took 1000 cycles, that's 0.03% overhead—acceptable. But interrupts arriving every microsecond (e.g., high-speed networking) must be handled with extreme care.

Summary: The Heartbeat of Computation

We've traced through the instruction cycle in depth—the core mechanism that brings stored programs to life. Let's consolidate:

Key Takeaways

•Fetch-Decode-Execute is the fundamental loop — This cycle, repeated billions of times per second, is how every program executes.
•Instructions are encoded bit patterns — Opcodes, operands, and addressing modes are packed into fixed or variable-length formats. ISA design determines decode complexity.
•Fetch involves caching and prediction — I-cache misses are expensive; branch prediction enables speculative fetch. Modern CPUs are highly optimized for fetch.
•Decode translates bits to actions — Simple in RISC, complex in CISC. x86 internally converts to µops, making the backend RISC-like.
•Execute uses specialized functional units — ALU for arithmetic, load/store ports for memory, branch units for control flow. Different operations have different latencies.
•Pipelining overlaps instruction cycles — Dramatically improves throughput at the cost of hazard handling complexity.
•Interrupts and exceptions redirect the cycle — Hardware saves state and invokes OS code. Precise exceptions are essential for OS correctness.

What's Next:

We've now covered most of the von Neumann architecture: stored programs, CPU-memory-I/O organization, bus architecture, and the instruction cycle. But no discussion of von Neumann architecture is complete without addressing its fundamental limitation—the von Neumann Bottleneck. The final page in this module explores this critical constraint and the techniques modern systems use to mitigate it.

Page Complete

You now understand the instruction cycle in detail—the fundamental pulse that animates every computer. From instruction encoding through fetch, decode, execute, and interrupt handling, you can trace the journey of a program through the CPU. This knowledge is essential for understanding OS operations like context switching, interrupt handling, and exception management.

4 / 5

Loading learning content...

Operating SystemsComputer Architecture Fundamentals

von Neumann Architecture

LevelBeginner

Duration75 mins

TopicComputer Architecture Fundamentals

4 / 5

The Instruction Cycle - Fetch, Decode, Execute

The Heartbeat of Computation

Every program you've ever run—from a simple "Hello, World" to a complex 3D game engine—executes through the same fundamental process, repeated billions of times per second:

Fetch an instruction from memory
Decode the instruction to understand what it means
Execute the operation specified
Repeat

Understand how compilers translate code to machine instructions
Reason about CPU performance and optimization
Comprehend context switching in operating systems
Debug low-level issues and understand error conditions
Appreciate modern CPU optimizations (pipelining, out-of-order execution)

This page will trace through the instruction cycle step by step, examine what happens at each stage, and explore the sophisticated optimizations that modern CPUs use to execute instructions faster.

What You Will Learn

The Basic Instruction Cycle

At its core, the instruction cycle is conceptually simple. The CPU is a state machine that endlessly repeats a fixed sequence of operations:

The Canonical Three Steps:

1. FETCH

Read the instruction at the memory address stored in the Program Counter (PC)
Place the instruction bits into the Instruction Register (IR)
Increment the PC to point to the next instruction

2. DECODE

Examine the bits in the IR
Determine the operation type (ADD, LOAD, JUMP, etc.)
Identify the operands (which registers, what memory address)
Generate control signals for the execution phase

3. EXECUTE

Perform the operation specified
This might involve the ALU (for arithmetic/logic), memory (for load/store), or control flow (for jumps/branches)
Write results to destination (register or memory)
Update status flags if applicable

Converting Mermaid diagram...

The Cycle in Action: A Concrete Example

Let's trace through a specific instruction: ADD R1, R2, R3 (R1 = R2 + R3)

FETCH:

PC contains 0x1000 (current instruction address)
CPU places 0x1000 on address bus, asserts read signal
Memory returns the bit pattern for ADD R1, R2, R3 (say, 0x02123000)
This pattern is loaded into IR
PC is incremented to 0x1004 (assuming 4-byte instructions)

DECODE:

Decoder examines IR content: 0x02123000
Bits 28-31 (0x0) specify the opcode class: arithmetic
Bits 24-27 (0x2) specify ADD operation
Bits 20-23 (0x1) specify destination: R1
Bits 16-19 (0x2) specify source 1: R2
Bits 12-15 (0x3) specify source 2: R3
Control unit generates signals to:
- Read R2 value to ALU input A
- Read R3 value to ALU input B
- Set ALU to ADD mode
- Route ALU output to R1

EXECUTE:

ALU performs addition of R2 and R3 contents
Result is written to R1
Flags updated: Zero flag if result is 0, Negative flag if result < 0, Carry and Overflow as appropriate

This entire sequence happens in nanoseconds—a 3 GHz CPU completes ~3 billion such cycles per second.

Instruction Size Varies

Instruction Encoding

Before the CPU can decode an instruction, we must understand how instructions are encoded—how abstract operations like "add" or "load" are represented as bit patterns.

Components of an Instruction:

Every machine instruction contains:

Opcode (Operation Code): Specifies what operation to perform
Operand Specifiers: Identify the data to operate on (registers, memory addresses, immediate values)
Addressing Mode: How to interpret the operand specifiers

Instruction Formats

Different ISAs (Instruction Set Architectures) use different formats:

instruction_formats.txt

RISC-V Instruction Formats (32-bit fixed-length):
 
R-type (Register-Register operations: add, sub, and, or):
┌─────────┬─────┬──────┬──────┬────────┬────────┐
│  funct7 │ rs2 │  rs1 │funct3│   rd   │ opcode │
│  7 bits │5 bit│5 bits│3 bits│ 5 bits │ 7 bits │
└─────────┴─────┴──────┴──────┴────────┴────────┘
   [31:25] [24:20][19:15][14:12][11:7]   [6:0]
 
Example: ADD x1, x2, x3  →  0x003100B3
  opcode  = 0110011 (R-type arithmetic)
  rd      = 00001 (x1)
  funct3  = 000 (ADD)
  rs1     = 00010 (x2)
  rs2     = 00011 (x3)
  funct7  = 0000000
 
I-type (Immediate operations: addi, load):
┌────────────┬──────┬────────┬────────┬────────┐
│   imm[11:0]│  rs1 │ funct3 │   rd   │ opcode │
│   12 bits  │5 bits│ 3 bits │ 5 bits │ 7 bits │
└────────────┴──────┴────────┴────────┴────────┘
 
S-type (Store operations):
┌─────────┬─────┬──────┬────────┬────────┬────────┐
│imm[11:5]│ rs2 │  rs1 │ funct3 │imm[4:0]│ opcode │
└─────────┴─────┴──────┴────────┴────────┴────────┘
 
B-type (Branch operations):
U-type (Upper immediate):
J-type (Jump):

Addressing Modes

The operand specifier doesn't always mean the same thing. The addressing mode determines how to interpret it:

Mode	Notation	Meaning	Example
Immediate	#100	The operand IS the value	ADD R1, #100 (R1 += 100)
Register	R3	Value is in the register	ADD R1, R3 (R1 += R3)
Direct	[1000]	Value is at memory address	LOAD R1, [1000] (R1 = mem[1000])
Indirect	[R3]	Register contains address	LOAD R1, [R3] (R1 = mem[R3])
Indexed	[R3 + 100]	Base + offset	LOAD R1, [R3+100]
Scaled	[R3 + R4*4]	Base + scale*index	For array access

The addressing mode affects decode complexity. RISC architectures keep it simple (only a few modes); CISC architectures (x86) support many modes, making decode elaborate.

Why Addressing Modes Matter for OS Developers

The Fetch Phase in Detail

The fetch phase seems simple—just read the instruction from memory. In practice, substantial complexity exists:

The Fetch Process:

Address Generation: The PC value is placed on the internal address bus
Cache Lookup: The instruction cache (I-cache) is checked for that address
- Cache Hit: Instruction returned in ~1 cycle
- Cache Miss: Must fetch from L2, L3, or main memory (10-300 cycles!)
Instruction Delivery: Bits are placed in the IR
PC Update: PC is incremented by instruction size

The Role of the Instruction Cache (I-Cache)

Typical I-cache characteristics:

Size: 32-64 KB per core
Line size: 64 bytes (fetches a block of instructions)
Associativity: 4-8 way set associative
Latency: 3-4 cycles

fetch_phase.txt

Fetch Timing

Instruction Fetch Timing (simplified 4-stage example):
 
Cycle  │ Action                               │ Hardware State
═══════╪═══════════════════════════════════════╪═══════════════════════════════
   1   │ PC value sent to I-cache             │ PC = 0x1000
       │ Cache tag comparison begins          │ 
───────┼───────────────────────────────────────┼───────────────────────────────
   2   │ Tag match/miss determined            │ 
       │ If hit: data routing begins          │ Cache line found!
───────┼───────────────────────────────────────┼───────────────────────────────
   3   │ Instruction bytes extracted          │ Line: [0x1000-0x103F]
       │ PC auto-increment calculated         │ Extract bytes at 0x1000
───────┼───────────────────────────────────────┼───────────────────────────────
   4   │ IR loaded with instruction           │ IR = 0x02123000
       │ PC updated to 0x1004                 │ PC = 0x1004
       │ Decode can begin                     │ 
═══════╧═══════════════════════════════════════╧═══════════════════════════════
 
On Cache Miss (L1 miss, L2 hit):
- Add 10-20 cycles waiting for L2
- I-cache line is filled
- Then proceed as above
 
On Cache Miss (L1, L2, L3 miss - DRAM fetch):
- Add 100-300 cycles!
- This is why cache behavior matters so much

Branch Prediction During Fetch

A critical complication: what happens when the fetched instruction is a branch?

If we wait until we know whether the branch is taken, we waste cycles. Instead, modern CPUs use branch prediction to guess ahead:

BTB (Branch Target Buffer): Caches recent branch targets
Branch Predictor: Guesses taken/not-taken based on history
Speculative Fetch: CPU fetches from predicted path
Rollback if Wrong: If prediction was wrong, discard fetched instructions and restart

Branch prediction accuracy exceeds 95% on typical code, but mispredictions cost 10-20 cycles.

Spectre and Meltdown

The Decode Phase in Detail

The decode phase translates the raw instruction bits into control signals that orchestrate execution. This is where the CPU "understands" what to do.

Decode Responsibilities:

Opcode Recognition: Determine the instruction type
Operand Extraction: Parse register numbers, immediate values, addressing modes
Control Signal Generation: Activate the correct functional units
Register Read Initiation: Start fetching operand values from register file
Hazard Detection: Identify dependencies that could cause problems

RISC vs CISC Decode Complexity

RISC Decode (Simple)

•Fixed instruction length (32 bits typically)
•Fields in consistent positions
•Small number of formats (5-6)
•Decode is simple combinational logic
•One cycle decode is achievable
•Examples: ARM, RISC-V, MIPS

CISC Decode (Complex)

•Variable instruction length (1-15 bytes)
•Fields can be anywhere
•Hundreds of instruction forms
•Requires sequential parsing
•Multiple decode stages needed
•Examples: x86, x86-64

The x86 Decode Challenge

Modern x86 CPUs face a formidable decode challenge: hundreds of instruction variants, complex addressing modes, and variable length. The solution? Micro-operations (μops).

x86 CPUs internally decode complex CISC instructions into simpler RISC-like micro-operations:

ADD [memory], register  →  μop1: LOAD temp, [memory]
                            μop2: ADD temp, register
                            μop3: STORE [memory], temp

This decouples the complex ISA from execution—the back-end of the CPU is essentially a RISC machine executing μops.

Pre-Decode and Micro-op Cache

To accelerate decode:

Pre-Decode: As cache lines are filled, mark instruction boundaries (x86 variable length problem)
Micro-op Cache (μop Cache): Cache decoded μops. If the same instruction is executed again, skip decode entirely.

Intel's μop cache ("Decoded Stream Buffer") can hold ~2000 μops, covering most hot loops. This effectively transforms x86 into RISC internally.

Why Simple ISAs Are Faster (Sometimes)

The Execute Phase in Detail

Execute is where the actual work happens. Depending on the instruction type, different resources are used:

Instruction Categories and Execution:

Category	Examples	Execution Resources	Latency
ALU (Integer)	ADD, SUB, AND, OR	Integer ALU	1 cycle
Shift/Rotate	SHL, ROL	Shifter unit	1 cycle
Multiply	MUL, IMUL	Multiplier	3-5 cycles
Divide	DIV, IDIV	Divider	20-100 cycles
Load	LOAD, MOV from mem	Load port, Data cache	4-5 cycles (cache hit)
Store	STORE, MOV to mem	Store port, Store buffer	1 cycle (to buffer)
Branch	JMP, BEQ, CALL	Branch unit	1 cycle (predicted)
FP Arithmetic	FADD, FMUL	FP unit	3-6 cycles
Vector/SIMD	AVX, SSE	Vector unit	4-6 cycles

Execution for Different Instruction Types:

execute_examples.txt

Execution Details

ARITHMETIC INSTRUCTION: ADD R1, R2, R3
─────────────────────────────────────────────────────────────────
1. Operand Read:   R2 value (42) → ALU input A
                   R3 value (10) → ALU input B
2. ALU Operation:  ALU computes 42 + 10 = 52
                   Zero flag = 0 (result not zero)
                   Negative flag = 0 (result positive)
                   Carry flag = 0 (no carry out)
                   Overflow flag = 0 (no signed overflow)
3. Write-back:     52 → R1
 
 
LOAD INSTRUCTION: LOAD R4, [R5 + 0x100]
─────────────────────────────────────────────────────────────────
1. Address Calc:   R5 value (0x2000) + 0x100 = 0x2100
2. D-Cache Check:  Look up address 0x2100 in data cache
3a. Cache Hit:     Data (say, 0xDEADBEEF) available in ~4 cycles
3b. Cache Miss:    Wait for L2 (10+ cycles) or DRAM (100+ cycles)
4. Write-back:     0xDEADBEEF → R4
 
 
BRANCH INSTRUCTION: BEQ target  (Branch if Equal)
─────────────────────────────────────────────────────────────────
1. Condition Check: Zero flag == 1?
2a. Not Taken:     PC already points to next instruction (done)
2b. Taken:         PC := target address
3. Prediction:     Compare with what was predicted during fetch
4a. Correct:       Continue normally
4b. Misprediction: Flush pipeline, restart from correct address
                   (10-20 cycle penalty)
 
 
STORE INSTRUCTION: STORE R6, [R7]
─────────────────────────────────────────────────────────────────
1. Address Calc:   Address = R7 value
2. Store Buffer:   Write {address, R6 value} to store buffer
3. Execute Done:   Instruction appears complete to the CPU
4. Background:     Store buffer eventually writes to cache/memory
   (Store can complete before data reaches cache - "posted write")

Write-Back and Commit

The final sub-phase of execute is writing results:

Register Write: ALU result or loaded data → destination register
Memory Write: Data → cache/memory (via store buffer)
PC Update: For branches, PC may or may not be modified
Flag Update: Condition codes reflect operation result

Instruction Commit and Exceptions

Modern out-of-order CPUs can have many instructions "in flight" simultaneously. Results are committed (made architecturally visible) only when an instruction is definitively correct:

All earlier instructions have committed
The instruction didn't cause an exception
If speculative, the speculation was correct

Memory Ordering Matters

Pipelining: Overlapping Instruction Cycles

If each instruction goes through F-D-E sequentially, and each phase takes a cycle, every instruction takes 3 cycles. But different phases use different hardware—why not overlap them?

The Pipelining Insight:

While instruction N is executing, instruction N+1 can decode, and instruction N+2 can fetch:

Cycle:  1    2    3    4    5    6    7
        ─────────────────────────────────
Inst 1: [F]  [D]  [E]
Inst 2:      [F]  [D]  [E]
Inst 3:           [F]  [D]  [E]
Inst 4:                [F]  [D]  [E]
        ─────────────────────────────────
                    ↑
                 3 instructions in flight
                 After initial fill, 1 instruction 
                 completes per cycle!

Throughput vs Latency:

Latency: Each instruction still takes 3 cycles from start to finish
Throughput: One instruction completes per cycle (3× improvement!)

This is the magic of pipelining—the same hardware, used more efficiently.

Converting Mermaid diagram...

Modern Pipeline Depths

Processor	Pipeline Stages	Notes
Classic MIPS	5	Textbook example
Intel Pentium	5	Simple
Intel Pentium 4	31	Deep pipeline for high clocks
Intel Skylake	14-19	Modern balanced design
Apple M1	~12-14	Efficiency-focused

Pipeline Hazards:

Pipelining isn't free. Several hazards can stall the pipeline:

1. Data Hazards (RAW - Read After Write)

ADD R1, R2, R3    ; R1 written in cycle 5
SUB R4, R1, R5    ; R1 needed in cycle 3 - not ready!

Solution: Forwarding (bypass result directly) or stall.

2. Control Hazards (Branches)

BEQ target        ; Branch outcome known in cycle 3
ADD R1, R2, R3    ; Already fetched - is it correct?

Solution: Branch prediction (guess) + flush on misprediction.

3. Structural Hazards (Resource Conflicts)

LOAD R1, [mem]    ; Uses memory port
STORE R2, [mem]   ; Also needs memory port

Solution: Duplicate resources or stall.

Pipeline Stalls and Performance

Interrupts and Exceptions

The instruction cycle isn't always allowed to proceed uninterrupted. Interrupts and exceptions can forcibly redirect the CPU:

Types of Interruptions:

Hardware Interrupts: External signals from devices
- Disk completed a transfer
- Keyboard key pressed
- Timer expired (critical for OS scheduling!)
- Network packet arrived
Exceptions (Traps): Synchronous, caused by instruction execution
- Division by zero
- Invalid opcode
- Page fault (memory not mapped)
- System call (deliberate trap to kernel)
Faults vs Traps:
- Fault: Can be fixed; restart faulting instruction (page fault)
- Trap: Continue after trapping instruction (syscall)
- Abort: Cannot recover; terminate process (machine check)

The Interrupt Response Sequence:

interrupt_sequence.txt

Interrupt Handling

Interrupt Response (Simplified):
 
1. INTERRUPT RECOGNITION
   ───────────────────────────────────────────────
   - CPU checks for pending interrupts between instructions
   - If interrupt signal is asserted AND interrupts are enabled:
     → Proceed to interrupt handling
   - If interrupts disabled: remain pending until enabled
 
2. STATE SAVING (Hardware performs automatically)
   ───────────────────────────────────────────────
   - Push PC onto stack (return address)
   - Push processor status register (flags, interrupt mask)
   - Switch to kernel/supervisor mode if not already
   - Disable further interrupts (prevent nesting, typically)
 
3. HANDLER DISPATCH
   ───────────────────────────────────────────────
   - Determine interrupt type/number
   - Look up handler address in Interrupt Vector Table (IVT)
     IVT[interrupt_number] → handler_entry_point
   - Load PC with handler address
   - Begin fetching handler instructions
 
4. HANDLER EXECUTION (OS code runs)
   ───────────────────────────────────────────────
   - Save additional registers (handler's responsibility)
   - Identify the interrupt source
   - Service the interrupt (process the event)
   - Restore saved registers
 
5. RETURN FROM INTERRUPT
   ───────────────────────────────────────────────
   - Execute IRET/RTI instruction
   - Hardware restores status register from stack
   - Hardware restores PC from stack
   - Resume interrupted instruction stream

Precise vs Imprecise Exceptions

In a pipelined, out-of-order CPU, defining "where" an exception occurred is complex:

Precise Exception: When an exception is taken, all earlier instructions have completed, none later have visible effects. The saved PC points exactly at the faulting instruction. This is essential for OS correctness (page fault handlers need to know which load failed).
Imprecise Exception: Some earlier instructions may not have completed; some later may have. The state is ambiguous. Processing is difficult.

Modern CPUs guarantee precise exceptions despite out-of-order execution by using a Reorder Buffer (ROB) that commits instructions in program order.

Context Switch Cost Revisited

Summary: The Heartbeat of Computation

We've traced through the instruction cycle in depth—the core mechanism that brings stored programs to life. Let's consolidate:

Key Takeaways

•Fetch-Decode-Execute is the fundamental loop — This cycle, repeated billions of times per second, is how every program executes.
•Instructions are encoded bit patterns — Opcodes, operands, and addressing modes are packed into fixed or variable-length formats. ISA design determines decode complexity.
•Fetch involves caching and prediction — I-cache misses are expensive; branch prediction enables speculative fetch. Modern CPUs are highly optimized for fetch.
•Decode translates bits to actions — Simple in RISC, complex in CISC. x86 internally converts to µops, making the backend RISC-like.
•Execute uses specialized functional units — ALU for arithmetic, load/store ports for memory, branch units for control flow. Different operations have different latencies.
•Pipelining overlaps instruction cycles — Dramatically improves throughput at the cost of hazard handling complexity.
•Interrupts and exceptions redirect the cycle — Hardware saves state and invokes OS code. Precise exceptions are essential for OS correctness.

What's Next:

Page Complete

4 / 5