Loading learning content...
If the page table is a dictionary mapping virtual to physical addresses, then the Page Table Entry (PTE) is the individual entry in that dictionary. Every translation, every protection check, every decision about memory caching—all of it comes down to the bits packed into this small but powerful data structure.
A PTE is typically just 4 or 8 bytes, yet these few bytes control whether a memory access succeeds or faults, whether data is cached or not, whether a page can be written or only read, and whether user-mode code can access it at all. Understanding PTEs is understanding the interface between hardware and the operating system's memory management.
By the end of this page, you will understand every field in a page table entry, how hardware interprets each bit, how the operating system manipulates these entries to implement memory management, and the architectural variations across different CPU families.
At its core, a Page Table Entry answers two fundamental questions:
Beyond these, the PTE also provides:
The PTE as a Packed Bit Field:
CPU architects face a fundamental constraint: PTEs should be small (to minimize memory overhead) but comprehensive (to support all necessary features). The solution is aggressive bit packing—every bit serves a specific hardware or software purpose.
| Architecture | PTE Size | Address Bits Used | Key Features |
|---|---|---|---|
| x86 (32-bit) | 4 bytes | 20 bits (PFN) | PAE mode available for 36-bit physical |
| x86-64 | 8 bytes | 40+ bits (PFN) | 52-bit physical max, NX bit |
| ARM (AArch64) | 8 bytes | 48 bits addressable | Multiple page sizes, hierarchical attributes |
| RISC-V Sv39 | 8 bytes | 39-bit virtual | Clean, orthogonal design |
| RISC-V Sv48 | 8 bytes | 48-bit virtual | Extended address space |
The Physical Frame Number (PFN):
The PFN is the core translation data—it tells the MMU which physical frame contains the data for this virtual page. The number of bits required depends on:
For a 4KB page size:
Modern 64-bit PTEs allocate 40+ bits for the PFN, supporting terabytes of physical memory. The remaining bits are used for flags and metadata.
Let's examine the x86-64 page table entry in complete detail. This is the most common architecture in servers, desktops, and laptops, making its PTE format essential knowledge for systems programmers.
x86-64 PTE Layout (for 4KB pages):
x86-64 Page Table Entry (64 bits, 4KB pages): Bit Position Name Description───────────────────────────────────────────────────────────────── 0 P (Present) Page is present in physical memory 1 R/W Read/Write: 0=read-only, 1=read-write 2 U/S User/Supervisor: 0=kernel, 1=user 3 PWT Page Write-Through (cache policy) 4 PCD Page Cache Disable 5 A (Accessed) Page has been accessed (read or write) 6 D (Dirty) Page has been written 7 PAT/PS Page Attribute Table index / Page Size 8 G (Global) Global page (not flushed on CR3 change) 9-11 AVL Available for OS use 12-51 PFN Physical Frame Number (40 bits) 52-58 Available Reserved / Available for OS 59 Prot Key 0 Protection Key bit 0 60 Prot Key 1 Protection Key bit 1 61 Prot Key 2 Protection Key bit 2 62 Prot Key 3 Protection Key bit 3 63 NX/XD No Execute / Execute Disable Memory Layout:┌────┬────────────────────────────┬────────────────┬────────────┐│ NX │ PFN (40 bits) │ Control Bits │ P R U W P A D ││ 63 │ 51 ─────────────────── 12 │ 11 ──────── 4 │ 3 2 1 0 │└────┴────────────────────────────┴────────────────┴────────────┘Bit-by-Bit Analysis:
Bit 0 - Present (P): The most critical bit. When P=0, the MMU will trigger a page fault on any access. The remaining 63 bits can store anything the OS wants—commonly the swap location or a marker indicating the page was never allocated.
Bit 1 - Read/Write (R/W): Controls write permission. When R/W=0, any write attempt triggers a protection fault. This enables copy-on-write, read-only data sections, and code integrity protection.
Bit 2 - User/Supervisor (U/S): Controls privilege level access. When U/S=0, only kernel-mode (ring 0) code can access the page. User-mode (ring 3) accesses trigger a fault. This is fundamental to kernel memory protection.
Bit 3 - Page Write-Through (PWT): Cache write policy. When PWT=1, writes go through to memory immediately. Used for memory-mapped I/O where device registers must see writes immediately.
Bit 4 - Page Cache Disable (PCD): Disables caching entirely. Essential for I/O memory where caching could return stale device data. Also used for memory-mapped files where coherency with disk matters.
Modern x86 uses the Page Attribute Table (PAT) in combination with PWT and PCD to provide 8 different memory types: Write-Back (WB), Write-Through (WT), Write-Combining (WC), Uncacheable (UC), and variations. This is crucial for graphics memory, NVRAM, and high-performance I/O.
The Accessed (A) and Dirty (D) bits are remarkable because they're set automatically by hardware, enabling the operating system to track memory usage patterns without software overhead on every access.
Accessed Bit (A):
Dirty Bit (D):
123456789101112131415161718192021222324252627282930313233343536373839
# Simplified Clock Algorithm using Accessed bit def clock_page_replacement(): """ Second-chance page replacement using hardware-set A bit. The OS never needs to trap individual memory accesses— hardware sets the A bit automatically. """ clock_hand = 0 # Points to current candidate while True: pte = page_table[clock_hand] if not pte.accessed: # Not accessed since last check—good victim victim = clock_hand clock_hand = (clock_hand + 1) % num_frames return victim_frame else: # Was accessed—give it another chance pte.accessed = False # Clear it (OS action) clock_hand = (clock_hand + 1) % num_frames # Continue searching def evict_page(frame_number): """Evict a page, writing to disk only if dirty.""" pte = find_pte_for_frame(frame_number) if pte.dirty: # Page was modified—must write back swap_out(frame_number, pte.swap_location) disk_writes += 1 else: # Page is clean—can simply discard # No disk I/O needed! pass pte.present = False invalidate_tlb_entry(pte.vpn)Why Hardware Support Matters:
Consider the alternative: without hardware-managed A and D bits, the OS would need to:
With billions of memory accesses per second, this software overhead would be catastrophic. Hardware A/D bits reduce this to:
The dirty bit is a huge performance win during page eviction. If a page was only read (not written), it doesn't need to be written back to disk—the swap copy is still valid. For read-mostly workloads, this can eliminate 80-90% of disk writes during memory pressure.
The protection bits (R/W, U/S, NX) work together to implement memory protection policies. Understanding their combinations is essential for security-conscious systems programming.
Protection Matrix:
| U/S | R/W | NX | Effective Protection | Typical Use |
|---|---|---|---|---|
| 0 | 0 | 0 | Kernel read + execute | Kernel code (rare without NX) |
| 0 | 0 | 1 | Kernel read-only | Kernel rodata, initrd |
| 0 | 1 | 0 | Kernel read-write + execute | Kernel writable code (dangerous) |
| 0 | 1 | 1 | Kernel read-write | Kernel heap, stack, data |
| 1 | 0 | 0 | User read + execute | User code (.text section) |
| 1 | 0 | 1 | User read-only | User rodata, shared libs mappings |
| 1 | 1 | 0 | User read-write + execute | JIT code, trampolines |
| 1 | 1 | 1 | User read-write | User heap, stack, mmap regions |
The NX (No-Execute) Bit:
The NX bit (called XD on Intel, NX on AMD) is a critical security feature added to x86-64. It allows marking pages as non-executable, preventing code injection attacks.
Without NX: An attacker could:
With NX: Stack and heap pages are marked NX=1. Any attempt to execute code from these regions triggers a hardware fault. The attack fails at step 4.
Modern Security Configurations:
The W^X (Write XOR Execute) security policy states that no page should be both writable and executable simultaneously. This prevents attackers from modifying code in memory. Modern OSes enforce this strictly, with exceptions only for JIT compilers that carefully manage transitions between W and X states.
Some pages should remain cached in the TLB across context switches—especially kernel pages that are the same across all processes. The Global (G) bit and PCID feature optimize for this case.
The Global Bit (G):
Normally, when the OS switches processes by changing CR3, the entire TLB is flushed. This is correct—different processes have different mappings for the same virtual addresses.
However, kernel pages are mapped identically in every process. Flushing these is wasteful. The G bit marks pages as global:
Kernel code and data pages are typically marked global, remaining cached across context switches.
12345678910111213141516171819202122232425262728
// Context switch with TLB considerations void context_switch(struct task *prev, struct task *next) { // Save prev's state... // Switch page tables // This flushes all non-global TLB entries write_cr3(next->mm->pgd_physical); // Global pages (kernel mappings) remain in TLB // Non-global pages (user mappings) are flushed // With PCID, we can tag entries by process ID: // write_cr3(next->mm->pgd_physical | (next->pcid << 0)); // Different PCID = different namespace, entries can coexist} void map_kernel_page(pte_t *pte, phys_addr_t frame) { *pte = frame | PTE_PRESENT | PTE_GLOBAL | PTE_NX; // ^^^^^^^^^^ // Kernel pages marked global - survive context switches} void map_user_page(pte_t *pte, phys_addr_t frame, int writable) { *pte = frame | PTE_PRESENT | PTE_USER; if (writable) *pte |= PTE_RW; // NOT marked global - flushed on context switch}Process-Context Identifiers (PCID):
PCID is a more sophisticated solution. Instead of binary global/not-global, each TLB entry is tagged with a 12-bit process identifier:
PCID Benefits:
PCID Considerations:
After Meltdown (2018), Kernel Page Table Isolation (KPTI) changed this model. To prevent speculative access to kernel memory, kernel pages are no longer mapped in user-mode page tables. The G bit is less useful now, as user and kernel use separate page tables. PCID helps amortize the cost of this separation.
The x86-64 PTE dedicates several bits (positions 9-11 and 52-58) to operating system use. Hardware ignores these bits—they're a scratchpad for the OS to store per-page metadata without allocating additional structures.
Common Uses for Available Bits:
123456789101112131415161718192021222324252627282930313233343536
/* Linux kernel PTE manipulation (simplified from arch/x86/include/asm/pgtable_types.h) */ /* Hardware-defined bits */#define _PAGE_PRESENT (1UL << 0)#define _PAGE_RW (1UL << 1)#define _PAGE_USER (1UL << 2)#define _PAGE_PWT (1UL << 3)#define _PAGE_PCD (1UL << 4)#define _PAGE_ACCESSED (1UL << 5)#define _PAGE_DIRTY (1UL << 6)#define _PAGE_PSE (1UL << 7) /* 2MB/1GB page */#define _PAGE_GLOBAL (1UL << 8)#define _PAGE_NX (1UL << 63) /* OS-defined bits (using "available" positions) */#define _PAGE_SOFT_DIRTY (1UL << 9) /* Track dirty across fork */#define _PAGE_DEVMAP (1UL << 58) /* Device-mapped memory */#define _PAGE_SPECIAL (1UL << 57) /* Special zero-page, etc */ /* Bits used when page is NOT present (P=0) */#define _PAGE_SWP_SOFT_DIRTY (1UL << 1)#define _PAGE_SWP_EXCLUSIVE (1UL << 2)/* Remaining bits encode swap file and offset */ /* Checking page state */static inline bool pte_present(pte_t pte) { return pte_val(pte) & _PAGE_PRESENT;} static inline bool pte_write(pte_t pte) { return pte_val(pte) & _PAGE_RW;} static inline bool pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY;}When Present=0: The Swap Entry Format
When a page is not present (P=0), the PTE is repurposed entirely. The MMU will fault on any access, so the remaining bits can encode:
This dual-use of the PTE is elegant: no additional data structure is needed to track swapped pages. The page table itself serves as the swap directory.
Linux abstracts PTE manipulation behind architecture-specific functions (pte_present, pte_write, etc.). This allows the same VMM code to work across x86, ARM, RISC-V, etc., even though each has different bit positions and semantics. Understanding x86 PTEs helps you read the code, but always use the accessor functions in production kernel code.
In multiprocessor systems, multiple CPUs may access the same PTE simultaneously—one doing a table walk while another updates the entry. This creates subtle concurrency challenges that operating systems must handle carefully.
The Problem:
Hardware Guarantees:
On x86-64:
However, this only ensures atomic reads/writes. Higher-level invariants require OS cooperation.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
/* Common patterns for safe PTE updates */ /* Pattern 1: Atomic single-word update */void update_pte_protection(pte_t *pte, unsigned long new_prot) { pte_t old, new; do { old = *pte; new = pte_modify(old, new_prot); } while (cmpxchg(pte, old, new) != old); /* Must flush TLB on all CPUs that might have cached this */ flush_tlb_page(vma, address);} /* Pattern 2: Clear-before-modify for unmapping */void unmap_page_safe(pte_t *pte, unsigned long addr) { pte_t old; /* First, clear present bit to stop new accesses */ old = ptep_clear(pte); /* Issue TLB flush on all CPUs */ flush_tlb_page_all_cpus(addr); /* Now safe to free the physical frame */ if (pte_present(old)) { struct page *page = pte_page(old); put_page(page); }} /* Pattern 3: Update with TLB flush batching */void update_page_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) { pte_t *pte; struct mmu_gather tlb; tlb_gather_mmu(&tlb, vma->vm_mm); for (addr = start; addr < end; addr += PAGE_SIZE) { pte = pte_offset(addr); /* Modify PTE */ ptep_modify(pte, new_flags); /* Record for batched flush */ tlb_flush_pte_range(&tlb, addr, PAGE_SIZE); } /* Single IPI to all CPUs with batched invalidation */ tlb_finish_mmu(&tlb);}TLB Shootdown:
The most complex concurrency issue is TLB shootdown. When CPU0 modifies a PTE that might be cached in CPU1's TLB:
This is expensive—IPIs take thousands of cycles. Operating systems batch multiple TLB flushes together when possible.
Hardware automatically sets Accessed and Dirty bits during memory access. This is a write to the PTE that races with OS updates. On x86, the CPU uses an atomic read-modify-write to set these bits. The OS must be aware that a PTE can be modified 'behind its back' and use appropriate atomic operations when checking or clearing these bits.
While we've focused on x86-64, other architectures have different PTE designs. Understanding these variations helps when working on portable operating systems or specialized hardware.
ARM AArch64:
ARM's PTE format is more orthogonal and flexible:
| Feature | ARM Approach | x86 Approach |
|---|---|---|
| Execute permission | Separate XN bit per privilege level (UXN, PXN) | Single NX bit |
| Access permissions | AP[2:1] 2-bit field encoding | R/W + U/S bits |
| Memory type | AttrIndx[2:0] indexes MAIR register | PAT + PWT + PCD combination |
| Shareability | Explicit SH[1:0] field (Non/Inner/Outer) | Implicit from memory type |
| Access flag | AF bit (optional hardware update) | A bit (always HW updated) |
| Dirty tracking | DBM extension or software managed | D bit (always HW updated) |
RISC-V Sv39/Sv48:
RISC-V takes a clean-slate approach with a simple, well-documented PTE format:
RISC-V PTE (64 bits):
┌──────────────────┬───────┬─────────────────────────────┐
│ Reserved (10b) │PPN(44b)│ RSW │ D│ A│ G│ U│ X│ W│ R│ V│
└──────────────────┴───────┴─────────────────────────────┘
V = Valid (like Present)
R = Readable
W = Writable
X = Executable
U = User accessible
G = Global
A = Accessed
D = Dirty
RSW = Reserved for Supervisor (OS use)
PPN = Physical Page Number
RISC-V's RWX permissions are independent—any combination is valid. This is more flexible than x86's inheritance model.
RISC-V's PTE format exemplifies its design philosophy: simple, orthogonal, and well-specified. The separate R/W/X bits avoid the complex permission inheritance rules of x86 (where kernel can always access user pages, etc.). This makes the ISA easier to implement and reason about, at the cost of some compatibility with existing security assumptions.
The Page Table Entry is a marvel of engineering efficiency—packing translation, protection, status tracking, caching control, and OS metadata into just 8 bytes. Let's consolidate the key insights:
What's Next:
With a deep understanding of PTE structure, we're ready to explore specific bits in more detail. The next page focuses on the Valid/Invalid bit—examining how this single bit enables demand paging, lazy allocation, and the fundamental page fault mechanism that underlies virtual memory.
You now understand the anatomy of page table entries—from the physical frame number to protection bits to status tracking. This knowledge is essential for understanding how operating systems implement virtual memory, security isolation, and memory optimization policies.