Operating SystemsPage Tables

Page Tables

LevelIntermediate

Duration90 mins

TopicPage Tables

2 / 5

Page Table Entry (PTE)

The Atom of Virtual Memory

If the page table is a dictionary mapping virtual to physical addresses, then the Page Table Entry (PTE) is the individual entry in that dictionary. Every translation, every protection check, every decision about memory caching—all of it comes down to the bits packed into this small but powerful data structure.

A PTE is typically just 4 or 8 bytes, yet these few bytes control whether a memory access succeeds or faults, whether data is cached or not, whether a page can be written or only read, and whether user-mode code can access it at all. Understanding PTEs is understanding the interface between hardware and the operating system's memory management.

What You Will Learn

By the end of this page, you will understand every field in a page table entry, how hardware interprets each bit, how the operating system manipulates these entries to implement memory management, and the architectural variations across different CPU families.

PTE Conceptual Model

At its core, a Page Table Entry answers two fundamental questions:

Where is this virtual page physically located? (The translation)
What operations are allowed on this page? (The protection)

Beyond these, the PTE also provides:

Status tracking: Has this page been accessed or modified?
Memory behavior: How should the hardware cache this page?
OS metadata: Extra bits for operating system use

The PTE as a Packed Bit Field:

CPU architects face a fundamental constraint: PTEs should be small (to minimize memory overhead) but comprehensive (to support all necessary features). The solution is aggressive bit packing—every bit serves a specific hardware or software purpose.

PTE Size Across Architectures
Architecture	PTE Size	Address Bits Used	Key Features
x86 (32-bit)	4 bytes	20 bits (PFN)	PAE mode available for 36-bit physical
x86-64	8 bytes	40+ bits (PFN)	52-bit physical max, NX bit
ARM (AArch64)	8 bytes	48 bits addressable	Multiple page sizes, hierarchical attributes
RISC-V Sv39	8 bytes	39-bit virtual	Clean, orthogonal design
RISC-V Sv48	8 bytes	48-bit virtual	Extended address space

The Physical Frame Number (PFN):

The PFN is the core translation data—it tells the MMU which physical frame contains the data for this virtual page. The number of bits required depends on:

Physical memory size: More RAM = more frame numbers needed
Page size: Larger pages = fewer frames to enumerate

For a 4KB page size:

4GB physical memory → 2²⁰ frames → 20 bits needed
64GB physical memory → 2²⁴ frames → 24 bits needed
1TB physical memory → 2²⁸ frames → 28 bits needed

Modern 64-bit PTEs allocate 40+ bits for the PFN, supporting terabytes of physical memory. The remaining bits are used for flags and metadata.

x86-64 PTE Deep Dive

Let's examine the x86-64 page table entry in complete detail. This is the most common architecture in servers, desktops, and laptops, making its PTE format essential knowledge for systems programmers.

x86-64 PTE Layout (for 4KB pages):

x86_64_pte.txt
x86-64 Page Table Entry (64 bits, 4KB pages):
 
Bit Position   Name              Description
─────────────────────────────────────────────────────────────────
   0           P (Present)       Page is present in physical memory
   1           R/W               Read/Write: 0=read-only, 1=read-write
   2           U/S               User/Supervisor: 0=kernel, 1=user
   3           PWT               Page Write-Through (cache policy)
   4           PCD               Page Cache Disable
   5           A (Accessed)      Page has been accessed (read or write)
   6           D (Dirty)         Page has been written
   7           PAT/PS            Page Attribute Table index / Page Size
   8           G (Global)        Global page (not flushed on CR3 change)
  9-11         AVL               Available for OS use
 12-51         PFN               Physical Frame Number (40 bits)
 52-58         Available         Reserved / Available for OS
   59          Prot Key 0        Protection Key bit 0
   60          Prot Key 1        Protection Key bit 1  
   61          Prot Key 2        Protection Key bit 2
   62          Prot Key 3        Protection Key bit 3
   63          NX/XD             No Execute / Execute Disable
 
Memory Layout:
┌────┬────────────────────────────┬────────────────┬────────────┐
│ NX │  PFN (40 bits)             │  Control Bits  │ P R U W P A D   │
│ 63 │  51 ─────────────────── 12 │  11 ──────── 4 │ 3 2 1 0         │
└────┴────────────────────────────┴────────────────┴────────────┘

Bit-by-Bit Analysis:

Bit 0 - Present (P): The most critical bit. When P=0, the MMU will trigger a page fault on any access. The remaining 63 bits can store anything the OS wants—commonly the swap location or a marker indicating the page was never allocated.

Bit 1 - Read/Write (R/W): Controls write permission. When R/W=0, any write attempt triggers a protection fault. This enables copy-on-write, read-only data sections, and code integrity protection.

Bit 2 - User/Supervisor (U/S): Controls privilege level access. When U/S=0, only kernel-mode (ring 0) code can access the page. User-mode (ring 3) accesses trigger a fault. This is fundamental to kernel memory protection.

Bit 3 - Page Write-Through (PWT): Cache write policy. When PWT=1, writes go through to memory immediately. Used for memory-mapped I/O where device registers must see writes immediately.

Bit 4 - Page Cache Disable (PCD): Disables caching entirely. Essential for I/O memory where caching could return stale device data. Also used for memory-mapped files where coherency with disk matters.

PAT for Advanced Cache Control

Modern x86 uses the Page Attribute Table (PAT) in combination with PWT and PCD to provide 8 different memory types: Write-Back (WB), Write-Through (WT), Write-Combining (WC), Uncacheable (UC), and variations. This is crucial for graphics memory, NVRAM, and high-performance I/O.

Status Bits: Accessed and Dirty

The Accessed (A) and Dirty (D) bits are remarkable because they're set automatically by hardware, enabling the operating system to track memory usage patterns without software overhead on every access.

Accessed Bit (A):

Set by hardware on any access (read or write) to the page
Never cleared by hardware—only software can clear it
Used for implementing page replacement algorithms (LRU, clock)

Dirty Bit (D):

Set by hardware only on write accesses
Never cleared by hardware—only software can clear it
Used to determine if a page needs to be written to disk before eviction

page_replacement_algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Simplified Clock Algorithm using Accessed bit
 
def clock_page_replacement():
    """
    Second-chance page replacement using hardware-set A bit.
    The OS never needs to trap individual memory accesses—
    hardware sets the A bit automatically.
    """
    clock_hand = 0  # Points to current candidate
    
    while True:
        pte = page_table[clock_hand]
        
        if not pte.accessed:
            # Not accessed since last check—good victim
            victim = clock_hand
            clock_hand = (clock_hand + 1) % num_frames
            return victim_frame
        else:
            # Was accessed—give it another chance
            pte.accessed = False  # Clear it (OS action)
            clock_hand = (clock_hand + 1) % num_frames
            # Continue searching
 
def evict_page(frame_number):
    """Evict a page, writing to disk only if dirty."""
    pte = find_pte_for_frame(frame_number)
    
    if pte.dirty:
        # Page was modified—must write back
        swap_out(frame_number, pte.swap_location)
        disk_writes += 1
    else:
        # Page is clean—can simply discard
        # No disk I/O needed!
        pass
    
    pte.present = False
    invalidate_tlb_entry(pte.vpn)

Why Hardware Support Matters:

Consider the alternative: without hardware-managed A and D bits, the OS would need to:

Mark all pages as read-only (even writable ones)
Trap every write to set the dirty bit in software
Similarly trap reads to track accessed status

With billions of memory accesses per second, this software overhead would be catastrophic. Hardware A/D bits reduce this to:

Periodic scanning of PTEs to check/clear bits
No traps during normal execution
O(1) time to check any page's status

Dirty Bit Optimization

The dirty bit is a huge performance win during page eviction. If a page was only read (not written), it doesn't need to be written back to disk—the swap copy is still valid. For read-mostly workloads, this can eliminate 80-90% of disk writes during memory pressure.

Protection Bits in Practice

The protection bits (R/W, U/S, NX) work together to implement memory protection policies. Understanding their combinations is essential for security-conscious systems programming.

Protection Matrix:

x86-64 Protection Bit Combinations
U/S	R/W	NX	Effective Protection	Typical Use
0	0	0	Kernel read + execute	Kernel code (rare without NX)
0	0	1	Kernel read-only	Kernel rodata, initrd
0	1	0	Kernel read-write + execute	Kernel writable code (dangerous)
0	1	1	Kernel read-write	Kernel heap, stack, data
1	0	0	User read + execute	User code (.text section)
1	0	1	User read-only	User rodata, shared libs mappings
1	1	0	User read-write + execute	JIT code, trampolines
1	1	1	User read-write	User heap, stack, mmap regions

The NX (No-Execute) Bit:

The NX bit (called XD on Intel, NX on AMD) is a critical security feature added to x86-64. It allows marking pages as non-executable, preventing code injection attacks.

Without NX: An attacker could:

Overflow a buffer on the stack or heap
Inject machine code into the buffer
Redirect execution to the injected code
The CPU would happily execute the attacker's code

With NX: Stack and heap pages are marked NX=1. Any attempt to execute code from these regions triggers a hardware fault. The attack fails at step 4.

Modern Security Configurations:

Stack: R/W, NX (can't execute stack shellcode)
Heap: R/W, NX (can't execute heap shellcode)
Code: R, !NX (can execute, but not write)
JIT regions: R/W/X (carefully managed, security risk)

W^X Policy

The W^X (Write XOR Execute) security policy states that no page should be both writable and executable simultaneously. This prevents attackers from modifying code in memory. Modern OSes enforce this strictly, with exceptions only for JIT compilers that carefully manage transitions between W and X states.

Global and PCID Bits

Some pages should remain cached in the TLB across context switches—especially kernel pages that are the same across all processes. The Global (G) bit and PCID feature optimize for this case.

The Global Bit (G):

Normally, when the OS switches processes by changing CR3, the entire TLB is flushed. This is correct—different processes have different mappings for the same virtual addresses.

However, kernel pages are mapped identically in every process. Flushing these is wasteful. The G bit marks pages as global:

G=1: Don't flush this TLB entry when CR3 changes
G=0: Flush normally with CR3 changes

Kernel code and data pages are typically marked global, remaining cached across context switches.

context_switch_tlb.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Context switch with TLB considerations
 
void context_switch(struct task *prev, struct task *next) {
    // Save prev's state...
    
    // Switch page tables
    // This flushes all non-global TLB entries
    write_cr3(next->mm->pgd_physical);
    
    // Global pages (kernel mappings) remain in TLB
    // Non-global pages (user mappings) are flushed
    
    // With PCID, we can tag entries by process ID:
    // write_cr3(next->mm->pgd_physical | (next->pcid << 0));
    // Different PCID = different namespace, entries can coexist
}
 
void map_kernel_page(pte_t *pte, phys_addr_t frame) {
    *pte = frame | PTE_PRESENT | PTE_GLOBAL | PTE_NX;
    //                           ^^^^^^^^^^
    // Kernel pages marked global - survive context switches
}
 
void map_user_page(pte_t *pte, phys_addr_t frame, int writable) {
    *pte = frame | PTE_PRESENT | PTE_USER;
    if (writable) *pte |= PTE_RW;
    // NOT marked global - flushed on context switch
}

Process-Context Identifiers (PCID):

PCID is a more sophisticated solution. Instead of binary global/not-global, each TLB entry is tagged with a 12-bit process identifier:

TLB entries for different PCIDs can coexist
Context switch changes the active PCID without flushing
TLB lookups only match entries with the current PCID

PCID Benefits:

Switching between two frequently-used processes keeps both cached
Kernel entries (same across processes) hit regardless of PCID
Reduces context switch overhead significantly

PCID Considerations:

Only 4096 PCIDs available (12 bits)
OS must manage PCID allocation and reuse
Old entries must eventually be flushed when PCIDs are recycled

KPTI and Global Pages

After Meltdown (2018), Kernel Page Table Isolation (KPTI) changed this model. To prevent speculative access to kernel memory, kernel pages are no longer mapped in user-mode page tables. The G bit is less useful now, as user and kernel use separate page tables. PCID helps amortize the cost of this separation.

OS Available Bits

The x86-64 PTE dedicates several bits (positions 9-11 and 52-58) to operating system use. Hardware ignores these bits—they're a scratchpad for the OS to store per-page metadata without allocating additional structures.

Common Uses for Available Bits:

OS-Managed PTE Bits

•Copy-on-Write marker: Indicates this page should be copied on first write attempt
•Page type indicator: File-backed vs anonymous vs device-mapped
•Swap state: Page is in swap, not just unmapped
•Reference count overflow: Tracks very high sharing counts
•NUMA locality hints: Preferred memory node for migration
•Page lock: Indicates page is being operated on
•Special mappings: Huge page, mixed cache state, etc.

linux_pte_bits.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
/* Linux kernel PTE manipulation (simplified from arch/x86/include/asm/pgtable_types.h) */
 
/* Hardware-defined bits */
#define _PAGE_PRESENT    (1UL << 0)
#define _PAGE_RW         (1UL << 1)
#define _PAGE_USER       (1UL << 2)
#define _PAGE_PWT        (1UL << 3)
#define _PAGE_PCD        (1UL << 4)
#define _PAGE_ACCESSED   (1UL << 5)
#define _PAGE_DIRTY      (1UL << 6)
#define _PAGE_PSE        (1UL << 7)   /* 2MB/1GB page */
#define _PAGE_GLOBAL     (1UL << 8)
#define _PAGE_NX         (1UL << 63)
 
/* OS-defined bits (using "available" positions) */
#define _PAGE_SOFT_DIRTY (1UL << 9)   /* Track dirty across fork */
#define _PAGE_DEVMAP     (1UL << 58)  /* Device-mapped memory */
#define _PAGE_SPECIAL    (1UL << 57)  /* Special zero-page, etc */
 
/* Bits used when page is NOT present (P=0) */
#define _PAGE_SWP_SOFT_DIRTY  (1UL << 1)
#define _PAGE_SWP_EXCLUSIVE   (1UL << 2)
/* Remaining bits encode swap file and offset */
 
/* Checking page state */
static inline bool pte_present(pte_t pte) {
    return pte_val(pte) & _PAGE_PRESENT;
}
 
static inline bool pte_write(pte_t pte) {
    return pte_val(pte) & _PAGE_RW;
}
 
static inline bool pte_dirty(pte_t pte) {
    return pte_val(pte) & _PAGE_DIRTY;
}

When Present=0: The Swap Entry Format

When a page is not present (P=0), the PTE is repurposed entirely. The MMU will fault on any access, so the remaining bits can encode:

Swap file identifier: Which swap device/file holds this page
Swap offset: Location within the swap space
Page state: Was it dirty when swapped out?
Exclusive flag: Is this the only reference to this swap slot?

This dual-use of the PTE is elegant: no additional data structure is needed to track swapped pages. The page table itself serves as the swap directory.

Architecture Portability

Linux abstracts PTE manipulation behind architecture-specific functions (pte_present, pte_write, etc.). This allows the same VMM code to work across x86, ARM, RISC-V, etc., even though each has different bit positions and semantics. Understanding x86 PTEs helps you read the code, but always use the accessor functions in production kernel code.

PTE Atomicity and Concurrency

In multiprocessor systems, multiple CPUs may access the same PTE simultaneously—one doing a table walk while another updates the entry. This creates subtle concurrency challenges that operating systems must handle carefully.

The Problem:

CPU0 is walking the page table for process P
CPU1 modifies a PTE (changing protection or unmapping)
CPU0 might see a partially-written PTE (torn read)
Or CPU0 might use a stale cached TLB entry

Hardware Guarantees:

On x86-64:

8-byte aligned 8-byte writes are atomic
PTEs are naturally aligned (enforced by table structure)
A table walk sees either the old or new value, never a mix

However, this only ensures atomic reads/writes. Higher-level invariants require OS cooperation.

pte_update_patterns.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
/* Common patterns for safe PTE updates */
 
/* Pattern 1: Atomic single-word update */
void update_pte_protection(pte_t *pte, unsigned long new_prot) {
    pte_t old, new;
    
    do {
        old = *pte;
        new = pte_modify(old, new_prot);
    } while (cmpxchg(pte, old, new) != old);
    
    /* Must flush TLB on all CPUs that might have cached this */
    flush_tlb_page(vma, address);
}
 
/* Pattern 2: Clear-before-modify for unmapping */
void unmap_page_safe(pte_t *pte, unsigned long addr) {
    pte_t old;
    
    /* First, clear present bit to stop new accesses */
    old = ptep_clear(pte);
    
    /* Issue TLB flush on all CPUs */
    flush_tlb_page_all_cpus(addr);
    
    /* Now safe to free the physical frame */
    if (pte_present(old)) {
        struct page *page = pte_page(old);
        put_page(page);
    }
}
 
/* Pattern 3: Update with TLB flush batching */
void update_page_range(struct vm_area_struct *vma,
                       unsigned long start, unsigned long end) {
    pte_t *pte;
    struct mmu_gather tlb;
    
    tlb_gather_mmu(&tlb, vma->vm_mm);
    
    for (addr = start; addr < end; addr += PAGE_SIZE) {
        pte = pte_offset(addr);
        /* Modify PTE */
        ptep_modify(pte, new_flags);
        /* Record for batched flush */
        tlb_flush_pte_range(&tlb, addr, PAGE_SIZE);
    }
    
    /* Single IPI to all CPUs with batched invalidation */
    tlb_finish_mmu(&tlb);
}

TLB Shootdown:

The most complex concurrency issue is TLB shootdown. When CPU0 modifies a PTE that might be cached in CPU1's TLB:

CPU0 updates the PTE in memory
CPU0 sends an Inter-Processor Interrupt (IPI) to CPU1
CPU1 receives the interrupt and flushes the relevant TLB entry
CPU1 acknowledges the flush
CPU0 (now knowing all TLBs are coherent) continues

This is expensive—IPIs take thousands of cycles. Operating systems batch multiple TLB flushes together when possible.

Hardware A/D Bit Updates

Hardware automatically sets Accessed and Dirty bits during memory access. This is a write to the PTE that races with OS updates. On x86, the CPU uses an atomic read-modify-write to set these bits. The OS must be aware that a PTE can be modified 'behind its back' and use appropriate atomic operations when checking or clearing these bits.

PTE Comparison Across Architectures

While we've focused on x86-64, other architectures have different PTE designs. Understanding these variations helps when working on portable operating systems or specialized hardware.

ARM AArch64:

ARM's PTE format is more orthogonal and flexible:

ARM AArch64 Page Table Entry Features
Feature	ARM Approach	x86 Approach
Execute permission	Separate XN bit per privilege level (UXN, PXN)	Single NX bit
Access permissions	AP[2:1] 2-bit field encoding	R/W + U/S bits
Memory type	AttrIndx[2:0] indexes MAIR register	PAT + PWT + PCD combination
Shareability	Explicit SH[1:0] field (Non/Inner/Outer)	Implicit from memory type
Access flag	AF bit (optional hardware update)	A bit (always HW updated)
Dirty tracking	DBM extension or software managed	D bit (always HW updated)

RISC-V Sv39/Sv48:

RISC-V takes a clean-slate approach with a simple, well-documented PTE format:

RISC-V PTE (64 bits):
┌──────────────────┬───────┬─────────────────────────────┐
│ Reserved (10b)   │PPN(44b)│  RSW │ D│ A│ G│ U│ X│ W│ R│ V│
└──────────────────┴───────┴─────────────────────────────┘

V = Valid (like Present)
R = Readable
W = Writable  
X = Executable
U = User accessible
G = Global
A = Accessed
D = Dirty
RSW = Reserved for Supervisor (OS use)
PPN = Physical Page Number

RISC-V's RWX permissions are independent—any combination is valid. This is more flexible than x86's inheritance model.

RISC-V Simplicity

RISC-V's PTE format exemplifies its design philosophy: simple, orthogonal, and well-specified. The separate R/W/X bits avoid the complex permission inheritance rules of x86 (where kernel can always access user pages, etc.). This makes the ISA easier to implement and reason about, at the cost of some compatibility with existing security assumptions.

Summary: Page Table Entries

The Page Table Entry is a marvel of engineering efficiency—packing translation, protection, status tracking, caching control, and OS metadata into just 8 bytes. Let's consolidate the key insights:

Key Takeaways

•PTEs encode translation plus metadata — The PFN provides the mapping; remaining bits control behavior.
•Present bit is the gatekeeper — When P=0, the rest of the entry is OS-defined (often swap location).
•Protection bits enforce security — U/S, R/W, and NX together implement the access control matrix.
•A and D bits are hardware-managed — Enabling efficient LRU tracking and dirty-page optimization.
•Global and PCID optimize TLB — Reducing flush overhead across context switches.
•Available bits are OS scratchpad — Storing COW markers, page type, and other per-page metadata.
•Concurrent updates require care — TLB shootdown and atomic operations maintain consistency.

What's Next:

With a deep understanding of PTE structure, we're ready to explore specific bits in more detail. The next page focuses on the Valid/Invalid bit—examining how this single bit enables demand paging, lazy allocation, and the fundamental page fault mechanism that underlies virtual memory.

Page Complete

You now understand the anatomy of page table entries—from the physical frame number to protection bits to status tracking. This knowledge is essential for understanding how operating systems implement virtual memory, security isolation, and memory optimization policies.

2 / 5

Loading learning content...

Operating SystemsPage Tables

Page Tables

LevelIntermediate

Duration90 mins

TopicPage Tables

2 / 5

Page Table Entry (PTE)

The Atom of Virtual Memory

What You Will Learn

PTE Conceptual Model

At its core, a Page Table Entry answers two fundamental questions:

Where is this virtual page physically located? (The translation)
What operations are allowed on this page? (The protection)

Beyond these, the PTE also provides:

Status tracking: Has this page been accessed or modified?
Memory behavior: How should the hardware cache this page?
OS metadata: Extra bits for operating system use

The PTE as a Packed Bit Field:

PTE Size Across Architectures
Architecture	PTE Size	Address Bits Used	Key Features
x86 (32-bit)	4 bytes	20 bits (PFN)	PAE mode available for 36-bit physical
x86-64	8 bytes	40+ bits (PFN)	52-bit physical max, NX bit
ARM (AArch64)	8 bytes	48 bits addressable	Multiple page sizes, hierarchical attributes
RISC-V Sv39	8 bytes	39-bit virtual	Clean, orthogonal design
RISC-V Sv48	8 bytes	48-bit virtual	Extended address space

The Physical Frame Number (PFN):

The PFN is the core translation data—it tells the MMU which physical frame contains the data for this virtual page. The number of bits required depends on:

Physical memory size: More RAM = more frame numbers needed
Page size: Larger pages = fewer frames to enumerate

For a 4KB page size:

4GB physical memory → 2²⁰ frames → 20 bits needed
64GB physical memory → 2²⁴ frames → 24 bits needed
1TB physical memory → 2²⁸ frames → 28 bits needed

Modern 64-bit PTEs allocate 40+ bits for the PFN, supporting terabytes of physical memory. The remaining bits are used for flags and metadata.

x86-64 PTE Deep Dive

x86-64 PTE Layout (for 4KB pages):

x86_64_pte.txt
x86-64 Page Table Entry (64 bits, 4KB pages):
 
Bit Position   Name              Description
─────────────────────────────────────────────────────────────────
   0           P (Present)       Page is present in physical memory
   1           R/W               Read/Write: 0=read-only, 1=read-write
   2           U/S               User/Supervisor: 0=kernel, 1=user
   3           PWT               Page Write-Through (cache policy)
   4           PCD               Page Cache Disable
   5           A (Accessed)      Page has been accessed (read or write)
   6           D (Dirty)         Page has been written
   7           PAT/PS            Page Attribute Table index / Page Size
   8           G (Global)        Global page (not flushed on CR3 change)
  9-11         AVL               Available for OS use
 12-51         PFN               Physical Frame Number (40 bits)
 52-58         Available         Reserved / Available for OS
   59          Prot Key 0        Protection Key bit 0
   60          Prot Key 1        Protection Key bit 1  
   61          Prot Key 2        Protection Key bit 2
   62          Prot Key 3        Protection Key bit 3
   63          NX/XD             No Execute / Execute Disable
 
Memory Layout:
┌────┬────────────────────────────┬────────────────┬────────────┐
│ NX │  PFN (40 bits)             │  Control Bits  │ P R U W P A D   │
│ 63 │  51 ─────────────────── 12 │  11 ──────── 4 │ 3 2 1 0         │
└────┴────────────────────────────┴────────────────┴────────────┘

Bit-by-Bit Analysis:

Bit 1 - Read/Write (R/W): Controls write permission. When R/W=0, any write attempt triggers a protection fault. This enables copy-on-write, read-only data sections, and code integrity protection.

Bit 3 - Page Write-Through (PWT): Cache write policy. When PWT=1, writes go through to memory immediately. Used for memory-mapped I/O where device registers must see writes immediately.

PAT for Advanced Cache Control

Status Bits: Accessed and Dirty

Accessed Bit (A):

Set by hardware on any access (read or write) to the page
Never cleared by hardware—only software can clear it
Used for implementing page replacement algorithms (LRU, clock)

Dirty Bit (D):

Set by hardware only on write accesses
Never cleared by hardware—only software can clear it
Used to determine if a page needs to be written to disk before eviction

page_replacement_algorithm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# Simplified Clock Algorithm using Accessed bit
 
def clock_page_replacement():
    """
    Second-chance page replacement using hardware-set A bit.
    The OS never needs to trap individual memory accesses—
    hardware sets the A bit automatically.
    """
    clock_hand = 0  # Points to current candidate
    
    while True:
        pte = page_table[clock_hand]
        
        if not pte.accessed:
            # Not accessed since last check—good victim
            victim = clock_hand
            clock_hand = (clock_hand + 1) % num_frames
            return victim_frame
        else:
            # Was accessed—give it another chance
            pte.accessed = False  # Clear it (OS action)
            clock_hand = (clock_hand + 1) % num_frames
            # Continue searching
 
def evict_page(frame_number):
    """Evict a page, writing to disk only if dirty."""
    pte = find_pte_for_frame(frame_number)
    
    if pte.dirty:
        # Page was modified—must write back
        swap_out(frame_number, pte.swap_location)
        disk_writes += 1
    else:
        # Page is clean—can simply discard
        # No disk I/O needed!
        pass
    
    pte.present = False
    invalidate_tlb_entry(pte.vpn)

Why Hardware Support Matters:

Consider the alternative: without hardware-managed A and D bits, the OS would need to:

Mark all pages as read-only (even writable ones)
Trap every write to set the dirty bit in software
Similarly trap reads to track accessed status

With billions of memory accesses per second, this software overhead would be catastrophic. Hardware A/D bits reduce this to:

Periodic scanning of PTEs to check/clear bits
No traps during normal execution
O(1) time to check any page's status

Dirty Bit Optimization

Protection Bits in Practice

The protection bits (R/W, U/S, NX) work together to implement memory protection policies. Understanding their combinations is essential for security-conscious systems programming.

Protection Matrix:

x86-64 Protection Bit Combinations
U/S	R/W	NX	Effective Protection	Typical Use
0	0	0	Kernel read + execute	Kernel code (rare without NX)
0	0	1	Kernel read-only	Kernel rodata, initrd
0	1	0	Kernel read-write + execute	Kernel writable code (dangerous)
0	1	1	Kernel read-write	Kernel heap, stack, data
1	0	0	User read + execute	User code (.text section)
1	0	1	User read-only	User rodata, shared libs mappings
1	1	0	User read-write + execute	JIT code, trampolines
1	1	1	User read-write	User heap, stack, mmap regions

The NX (No-Execute) Bit:

The NX bit (called XD on Intel, NX on AMD) is a critical security feature added to x86-64. It allows marking pages as non-executable, preventing code injection attacks.

Without NX: An attacker could:

Overflow a buffer on the stack or heap
Inject machine code into the buffer
Redirect execution to the injected code
The CPU would happily execute the attacker's code

With NX: Stack and heap pages are marked NX=1. Any attempt to execute code from these regions triggers a hardware fault. The attack fails at step 4.

Modern Security Configurations:

Stack: R/W, NX (can't execute stack shellcode)
Heap: R/W, NX (can't execute heap shellcode)
Code: R, !NX (can execute, but not write)
JIT regions: R/W/X (carefully managed, security risk)

W^X Policy

Global and PCID Bits

The Global Bit (G):

Normally, when the OS switches processes by changing CR3, the entire TLB is flushed. This is correct—different processes have different mappings for the same virtual addresses.

However, kernel pages are mapped identically in every process. Flushing these is wasteful. The G bit marks pages as global:

G=1: Don't flush this TLB entry when CR3 changes
G=0: Flush normally with CR3 changes

Kernel code and data pages are typically marked global, remaining cached across context switches.

context_switch_tlb.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Context switch with TLB considerations
 
void context_switch(struct task *prev, struct task *next) {
    // Save prev's state...
    
    // Switch page tables
    // This flushes all non-global TLB entries
    write_cr3(next->mm->pgd_physical);
    
    // Global pages (kernel mappings) remain in TLB
    // Non-global pages (user mappings) are flushed
    
    // With PCID, we can tag entries by process ID:
    // write_cr3(next->mm->pgd_physical | (next->pcid << 0));
    // Different PCID = different namespace, entries can coexist
}
 
void map_kernel_page(pte_t *pte, phys_addr_t frame) {
    *pte = frame | PTE_PRESENT | PTE_GLOBAL | PTE_NX;
    //                           ^^^^^^^^^^
    // Kernel pages marked global - survive context switches
}
 
void map_user_page(pte_t *pte, phys_addr_t frame, int writable) {
    *pte = frame | PTE_PRESENT | PTE_USER;
    if (writable) *pte |= PTE_RW;
    // NOT marked global - flushed on context switch
}

Process-Context Identifiers (PCID):

PCID is a more sophisticated solution. Instead of binary global/not-global, each TLB entry is tagged with a 12-bit process identifier:

TLB entries for different PCIDs can coexist
Context switch changes the active PCID without flushing
TLB lookups only match entries with the current PCID

PCID Benefits:

Switching between two frequently-used processes keeps both cached
Kernel entries (same across processes) hit regardless of PCID
Reduces context switch overhead significantly

PCID Considerations:

Only 4096 PCIDs available (12 bits)
OS must manage PCID allocation and reuse
Old entries must eventually be flushed when PCIDs are recycled

KPTI and Global Pages

OS Available Bits

Common Uses for Available Bits:

OS-Managed PTE Bits

•Copy-on-Write marker: Indicates this page should be copied on first write attempt
•Page type indicator: File-backed vs anonymous vs device-mapped
•Swap state: Page is in swap, not just unmapped
•Reference count overflow: Tracks very high sharing counts
•NUMA locality hints: Preferred memory node for migration
•Page lock: Indicates page is being operated on
•Special mappings: Huge page, mixed cache state, etc.

linux_pte_bits.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
/* Linux kernel PTE manipulation (simplified from arch/x86/include/asm/pgtable_types.h) */
 
/* Hardware-defined bits */
#define _PAGE_PRESENT    (1UL << 0)
#define _PAGE_RW         (1UL << 1)
#define _PAGE_USER       (1UL << 2)
#define _PAGE_PWT        (1UL << 3)
#define _PAGE_PCD        (1UL << 4)
#define _PAGE_ACCESSED   (1UL << 5)
#define _PAGE_DIRTY      (1UL << 6)
#define _PAGE_PSE        (1UL << 7)   /* 2MB/1GB page */
#define _PAGE_GLOBAL     (1UL << 8)
#define _PAGE_NX         (1UL << 63)
 
/* OS-defined bits (using "available" positions) */
#define _PAGE_SOFT_DIRTY (1UL << 9)   /* Track dirty across fork */
#define _PAGE_DEVMAP     (1UL << 58)  /* Device-mapped memory */
#define _PAGE_SPECIAL    (1UL << 57)  /* Special zero-page, etc */
 
/* Bits used when page is NOT present (P=0) */
#define _PAGE_SWP_SOFT_DIRTY  (1UL << 1)
#define _PAGE_SWP_EXCLUSIVE   (1UL << 2)
/* Remaining bits encode swap file and offset */
 
/* Checking page state */
static inline bool pte_present(pte_t pte) {
    return pte_val(pte) & _PAGE_PRESENT;
}
 
static inline bool pte_write(pte_t pte) {
    return pte_val(pte) & _PAGE_RW;
}
 
static inline bool pte_dirty(pte_t pte) {
    return pte_val(pte) & _PAGE_DIRTY;
}

When Present=0: The Swap Entry Format

When a page is not present (P=0), the PTE is repurposed entirely. The MMU will fault on any access, so the remaining bits can encode:

Swap file identifier: Which swap device/file holds this page
Swap offset: Location within the swap space
Page state: Was it dirty when swapped out?
Exclusive flag: Is this the only reference to this swap slot?

This dual-use of the PTE is elegant: no additional data structure is needed to track swapped pages. The page table itself serves as the swap directory.

Architecture Portability

PTE Atomicity and Concurrency

The Problem:

CPU0 is walking the page table for process P
CPU1 modifies a PTE (changing protection or unmapping)
CPU0 might see a partially-written PTE (torn read)
Or CPU0 might use a stale cached TLB entry

Hardware Guarantees:

On x86-64:

8-byte aligned 8-byte writes are atomic
PTEs are naturally aligned (enforced by table structure)
A table walk sees either the old or new value, never a mix

However, this only ensures atomic reads/writes. Higher-level invariants require OS cooperation.

pte_update_patterns.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
/* Common patterns for safe PTE updates */
 
/* Pattern 1: Atomic single-word update */
void update_pte_protection(pte_t *pte, unsigned long new_prot) {
    pte_t old, new;
    
    do {
        old = *pte;
        new = pte_modify(old, new_prot);
    } while (cmpxchg(pte, old, new) != old);
    
    /* Must flush TLB on all CPUs that might have cached this */
    flush_tlb_page(vma, address);
}
 
/* Pattern 2: Clear-before-modify for unmapping */
void unmap_page_safe(pte_t *pte, unsigned long addr) {
    pte_t old;
    
    /* First, clear present bit to stop new accesses */
    old = ptep_clear(pte);
    
    /* Issue TLB flush on all CPUs */
    flush_tlb_page_all_cpus(addr);
    
    /* Now safe to free the physical frame */
    if (pte_present(old)) {
        struct page *page = pte_page(old);
        put_page(page);
    }
}
 
/* Pattern 3: Update with TLB flush batching */
void update_page_range(struct vm_area_struct *vma,
                       unsigned long start, unsigned long end) {
    pte_t *pte;
    struct mmu_gather tlb;
    
    tlb_gather_mmu(&tlb, vma->vm_mm);
    
    for (addr = start; addr < end; addr += PAGE_SIZE) {
        pte = pte_offset(addr);
        /* Modify PTE */
        ptep_modify(pte, new_flags);
        /* Record for batched flush */
        tlb_flush_pte_range(&tlb, addr, PAGE_SIZE);
    }
    
    /* Single IPI to all CPUs with batched invalidation */
    tlb_finish_mmu(&tlb);
}

TLB Shootdown:

The most complex concurrency issue is TLB shootdown. When CPU0 modifies a PTE that might be cached in CPU1's TLB:

CPU0 updates the PTE in memory
CPU0 sends an Inter-Processor Interrupt (IPI) to CPU1
CPU1 receives the interrupt and flushes the relevant TLB entry
CPU1 acknowledges the flush
CPU0 (now knowing all TLBs are coherent) continues

This is expensive—IPIs take thousands of cycles. Operating systems batch multiple TLB flushes together when possible.

Hardware A/D Bit Updates

PTE Comparison Across Architectures

While we've focused on x86-64, other architectures have different PTE designs. Understanding these variations helps when working on portable operating systems or specialized hardware.

ARM AArch64:

ARM's PTE format is more orthogonal and flexible:

ARM AArch64 Page Table Entry Features
Feature	ARM Approach	x86 Approach
Execute permission	Separate XN bit per privilege level (UXN, PXN)	Single NX bit
Access permissions	AP[2:1] 2-bit field encoding	R/W + U/S bits
Memory type	AttrIndx[2:0] indexes MAIR register	PAT + PWT + PCD combination
Shareability	Explicit SH[1:0] field (Non/Inner/Outer)	Implicit from memory type
Access flag	AF bit (optional hardware update)	A bit (always HW updated)
Dirty tracking	DBM extension or software managed	D bit (always HW updated)

RISC-V Sv39/Sv48:

RISC-V takes a clean-slate approach with a simple, well-documented PTE format:

RISC-V PTE (64 bits):
┌──────────────────┬───────┬─────────────────────────────┐
│ Reserved (10b)   │PPN(44b)│  RSW │ D│ A│ G│ U│ X│ W│ R│ V│
└──────────────────┴───────┴─────────────────────────────┘

V = Valid (like Present)
R = Readable
W = Writable  
X = Executable
U = User accessible
G = Global
A = Accessed
D = Dirty
RSW = Reserved for Supervisor (OS use)
PPN = Physical Page Number

RISC-V's RWX permissions are independent—any combination is valid. This is more flexible than x86's inheritance model.

RISC-V Simplicity

Summary: Page Table Entries

The Page Table Entry is a marvel of engineering efficiency—packing translation, protection, status tracking, caching control, and OS metadata into just 8 bytes. Let's consolidate the key insights:

Key Takeaways

•PTEs encode translation plus metadata — The PFN provides the mapping; remaining bits control behavior.
•Present bit is the gatekeeper — When P=0, the rest of the entry is OS-defined (often swap location).
•Protection bits enforce security — U/S, R/W, and NX together implement the access control matrix.
•A and D bits are hardware-managed — Enabling efficient LRU tracking and dirty-page optimization.
•Global and PCID optimize TLB — Reducing flush overhead across context switches.
•Available bits are OS scratchpad — Storing COW markers, page type, and other per-page metadata.
•Concurrent updates require care — TLB shootdown and atomic operations maintain consistency.

What's Next:

Page Complete

2 / 5