Loading learning content...
Every line of code you write exists in a form that CPUs cannot directly understand. Processors speak the language of binary instructions—sequences of zeros and ones representing operations on registers and memory. The compilation process is the elaborate transformation that bridges human-readable source code and machine-executable binaries.
This transformation is not a simple translation. It involves multiple sophisticated stages, each performing specific analyses and transformations that progressively convert high-level abstractions into low-level machine code. Understanding this process is fundamental to operating systems because it directly determines how programs are organized in memory, how they reference code and data, and how the operating system ultimately loads and executes them.
By the end of this page, you will understand the complete compilation pipeline—preprocessing, compilation proper, assembly, and assembly into object code. You'll grasp how each stage contributes to the final executable, why each stage exists, and how this knowledge impacts memory management and linking.
Before examining each stage in detail, let's establish the complete picture. When you invoke a compiler like gcc or clang on a C source file, it doesn't directly produce an executable. Instead, the compilation system executes four distinct phases:
This modular design is intentional. Each phase has a well-defined input and output, enabling tools to be developed independently and allowing developers to inspect intermediate results for debugging or optimization.
| Phase | Input | Output | Tool | Purpose |
|---|---|---|---|---|
| Preprocessing | Source code (.c, .cpp) | Preprocessed source (.i) | cpp (C Preprocessor) | Macro expansion, includes, conditional compilation |
| Compilation | Preprocessed source (.i) | Assembly (.s) | cc1 (Compiler proper) | Syntax analysis, optimization, code generation |
| Assembly | Assembly (.s) | Object code (.o) | as (Assembler) | Convert mnemonics to machine code |
| Linking | Object files (.o) | Executable | ld (Linker) | Resolve symbols, combine into executable |
This pipeline enables separate compilation—each source file can be compiled independently into an object file. Only at the linking stage are all pieces combined. This dramatically improves build times for large projects, as unchanged files don't need recompilation.
The preprocessing stage operates purely at the textual level. The preprocessor (historically cpp for C) knows nothing about the language's syntax or semantics—it performs pattern-matching and text substitution based on preprocessor directives.
Key operations performed during preprocessing:
12345678910111213141516171819
// example.c - Before preprocessing#include <stdio.h> #define MAX_BUFFER 1024#define SQUARE(x) ((x) * (x)) #ifdef DEBUG #define LOG(msg) printf("[DEBUG] %s", msg)#else #define LOG(msg) ((void)0)#endif int main() { char buffer[MAX_BUFFER]; int result = SQUARE(5); LOG("Computation complete"); return 0;}After preprocessing (with -E flag to see preprocessed output), the file transforms dramatically. The #include <stdio.h> expands to thousands of lines from the stdio header and its transitive includes. All macros are expanded inline:
123456789
// ... thousands of lines from stdio.h ...// ... type definitions, function declarations ... int main() { char buffer[1024]; // MAX_BUFFER replaced int result = ((5) * (5)); // SQUARE(5) expanded ((void)0); // LOG expanded (DEBUG not defined) return 0;}Use gcc -E source.c -o source.i to see the preprocessed file. This is invaluable for debugging macro-related issues. The output can be enormous due to header expansion—a simple 10-line program might produce 50,000+ lines after preprocessing.
Header guards and pragma once:
Because the preprocessor performs literal text inclusion, the same header might be included multiple times (through different paths), causing redefinition errors. Header guards prevent this:
123456789101112
#ifndef MYHEADER_H // If not already defined#define MYHEADER_H // Define the guard macro // Header contents heretypedef struct { int x; int y;} Point; void process_point(Point* p); #endif // MYHEADER_HModern compilers also support #pragma once as a simpler alternative, though it's not part of the C standard. The preprocessor's textual nature has important implications:
SQUARE(x++) would evaluate x++ twiceUnderstanding these limitations is crucial for writing robust code and debugging preprocessor-related issues.
The compilation proper stage is where the real linguistic transformation occurs. The compiler takes preprocessed source code and produces assembly language—a human-readable representation of machine instructions. This stage itself involves multiple sophisticated sub-phases:
Frontend phases:
Middle-end phases: 4. Intermediate Representation (IR) — Platform-independent code form 5. Optimization — Improving code efficiency
Backend phases: 6. Code Generation — Producing assembly for target architecture 7. Target-specific Optimization — Architecture-specific improvements
The lexer (or scanner) reads the character stream and groups characters into tokens—the fundamental units of the language. Each token has a type and value:
123456789
// Source codeint count = 42; // Token stream produced by lexer:// TOKEN_KEYWORD_INT : "int"// TOKEN_IDENTIFIER : "count"// TOKEN_ASSIGN : "="// TOKEN_INTEGER_LITERAL: "42"// TOKEN_SEMICOLON : ";"The parser consumes the token stream and constructs an Abstract Syntax Tree (AST)—a hierarchical representation of the program's grammatical structure. The AST captures the essential structure while discarding syntactic noise like parentheses and semicolons.
// Expression: a + b * c // AST structure (+ has lower precedence than *)// (+)// / \// (a) (*)// / \// (b) (c) // This correctly represents that b*c is computed firstSemantic analysis adds meaning to the syntax tree. This phase performs:
After semantic analysis, the compiler has a complete understanding of the program's meaning and can detect errors like:
Errors detected in the frontend (lexer, parser, semantic analyzer) are compile-time errors that prevent code generation. These include syntax errors, type mismatches, and undefined references. Understanding this distinction helps in debugging: if the compiler rejects your code, the issue is in these early phases.
Modern compilers don't translate directly from AST to assembly. Instead, they use an Intermediate Representation (IR)—a platform-independent code format that serves as the target for optimization. Popular IRs include:
The IR provides a uniform representation that allows optimizations to be written once and applied regardless of source language or target architecture.
123456789101112
; Original C code:; int square(int x) { return x * x; } define i32 @square(i32 %x) {entry: %mul = mul nsw i32 %x, %x ret i32 %mul} ; LLVM IR uses typed registers (%x, %mul); Instructions are in SSA (Static Single Assignment) form; Each register is assigned exactly onceThe optimizer transforms the IR to produce faster and/or smaller code. Optimizations operate at different levels:
Local optimizations (within a basic block):
3 + 4 at compile time to 7x * 2 with x << 1Global optimizations (across basic blocks):
a + b once if used multiple timesInterprocedural optimizations (across functions):
1234567891011
int example(int n) { int x = 3 + 4; // Constant expression int y = x * 2; // Multiply by power of 2 int sum = 0; for (int i = 0; i < n; i++) { int base = 10; // Loop invariant sum += base * i; } int unused = 42; // Dead code return sum + y;}1234567891011
int example(int n) { // x = 7 (constant folded) // y = 14 (7 * 2, constant folded) int sum = 0; // base = 10 hoisted outside loop for (int i = 0; i < n; i++) { sum += 10 * i; // Or: sum += (i << 1) + (i << 3) } // unused eliminated return sum + 14;}Compilers provide optimization flags: -O0 (no optimization, fast compile), -O1 (basic optimizations), -O2 (most optimizations), -O3 (aggressive including vectorization), -Os (optimize for size). Higher levels improve performance but increase compile time and may make debugging harder.
The code generator transforms optimized IR into assembly language for the target architecture. This phase must handle:
The output is assembly language—a human-readable representation of machine instructions.
12345678910111213141516
; C source: int add(int a, int b) { return a + b; }; Compiled with gcc -S -O2 for x86-64 .text .globl add .type add, @functionadd: .cfi_startproc leal (%rdi,%rsi), %eax ; Compute a + b using LEA ret ; Return (result in %eax) .cfi_endproc .size add, .-add ; Note: Arguments passed in %edi (a) and %esi (b) per System V ABI; Result returned in %eax; LEA (Load Effective Address) performs addition without affecting flagsOne of the most challenging aspects of code generation is register allocation. The IR uses unlimited virtual registers, but physical CPUs have limited registers (x86-64 has 16 general-purpose registers). The allocator must:
Poor register allocation forces values to be stored in memory ("spilled"), dramatically slowing execution since memory access is 100x slower than register access.
| Register | 64-bit | 32-bit | 16-bit | 8-bit low | Purpose/Convention |
|---|---|---|---|---|---|
| RAX | %rax | %eax | %ax | %al | Return value, accumulator |
| RBX | %rbx | %ebx | %bx | %bl | Callee-saved, base pointer |
| RCX | %rcx | %ecx | %cx | %cl | 4th argument, counter |
| RDX | %rdx | %edx | %dx | %dl | 3rd argument, I/O |
| RSI | %rsi | %esi | %si | %sil | 2nd argument, source index |
| RDI | %rdi | %edi | %di | %dil | 1st argument, dest index |
| RSP | %rsp | %esp | %sp | %spl | Stack pointer (reserved) |
| RBP | %rbp | %ebp | %bp | %bpl | Base pointer, callee-saved |
The Application Binary Interface (ABI) defines calling conventions, register usage, and stack layout. On Linux x86-64, the System V ABI specifies that the first six integer arguments go in RDI, RSI, RDX, RCX, R8, R9. Understanding the ABI is essential for debugging, writing assembly, or interfacing with libraries.
The assembler (as on Unix systems) translates assembly language into machine code, producing an object file. This transformation is relatively straightforward—mostly a one-to-one mapping of mnemonics to binary instruction encodings—but involves several important operations:
12345678910111213141516
; Assembly instructionmov $42, %eax ; Move immediate 42 into EAX ; Binary encoding (little-endian x86-64); B8 2A 00 00 00; ^^ ^^^^^^^^^^; | 42 in little-endian (0x0000002A); Opcode for "mov imm32, eax" ; The assembler performs this translation for every instruction; and produces the final object file with:; - .text section: machine code; - .data section: initialized data; - .bss section: uninitialized data; - .symtab: symbol table; - .rela.*: relocation entriesThe assembler creates a symbol table recording every label, function, and global variable defined or referenced in the file. Each entry contains:
Symbols can be defined (have an address in this file) or undefined (reference external symbols to be resolved during linking).
1234567891011
$ nm example.o0000000000000000 T main ; Defined in .text, global U printf ; Undefined, external reference0000000000000000 D global_var ; Defined in .data, global0000000000000004 d local_static ; Defined in .data, local0000000000000000 B uninitialized ; Defined in .bss, global # Symbol types:# T/t = text (code), D/d = data, B/b = BSS# Uppercase = global, lowercase = local# U = undefined (external reference)When the assembler encounters a reference to a symbol whose address isn't yet known (like an external function or a global variable in the same file at a location not yet settled), it generates a relocation entry. This entry tells the linker:
Relocation allows object files to be compiled independently without knowing the final memory layout.
Object files (.o) contain machine code but cannot be executed directly. They have unresolved symbols (external references) and contain addresses that assume the code starts at address 0. The linker must combine object files, resolve symbols, and perform relocations to create an executable.
Let's trace a simple program through the entire compilation process to solidify understanding. We'll use GCC with flags to observe each stage.
123456789
#include <stdio.h> #define GREETING "Hello, World!" int main() { printf("%s", GREETING); return 0;}123456789101112131415161718
# Step 1: Preprocessing onlygcc -E hello.c -o hello.i# hello.i contains expanded stdio.h and GREETING macro # Step 2: Compilation to assemblygcc -S hello.i -o hello.s# hello.s contains x86-64 assembly # Step 3: Assembly to object filegcc -c hello.s -o hello.o# hello.o is an ELF object file # Step 4: Linking to executablegcc hello.o -o hello# hello is the final executable # Or all at once:gcc hello.c -o hello12345678910111213141516
.file "hello.c" .section .rodata.LC0: .string "Hello, World!" ; GREETING expanded .text .globl main .type main, @functionmain: pushq %rbp ; Save frame pointer movq %rsp, %rbp ; Set up stack frame leaq .LC0(%rip), %rdi ; Load address of string call puts@PLT ; Call puts (printf optimized) movl $0, %eax ; Return 0 popq %rbp ; Restore frame pointer ret .size main, .-mainNotice that GCC replaced printf("%s ", str) with puts(str). This is a common optimization—puts is simpler and faster for printing strings with newlines. The compiler recognized the pattern and substituted a more efficient function.
We can use tools to inspect the generated object file:
12345678910111213141516171819
$ objdump -d hello.ohello.o: file format elf64-x86-64 Disassembly of section .text: 0000000000000000 <main>: 0: 55 push %rbp 1: 48 89 e5 mov %rsp,%rbp 4: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # Address TBD b: e8 00 00 00 00 call 10 <main+0x10> # Needs relocation 10: b8 00 00 00 00 mov $0x0,%eax 15: 5d pop %rbp 16: c3 ret $ readelf -r hello.oRelocation section '.rela.text' at offset 0x1c0 contains 2 entries: Offset Type Symbol000000000007 R_X86_64_PC32 .rodata-0x4 ; Address of string00000000000c R_X86_64_PLT32 puts-0x4 ; Call to putsThe compilation process is a sophisticated pipeline that transforms human-readable source code into machine-executable object files. Each stage has a precise purpose:
What's next:
Now that we understand how individual source files are compiled into object files, the next page explores object files in detail—their structure, formats (ELF, COFF, Mach-O), sections, and the metadata they contain that enables linking and loading.
You now understand the complete compilation process—from preprocessing through assembly. This knowledge is foundational for understanding how the operating system manages programs in memory and how linking resolves the relationships between separately-compiled modules.