Loading content...
Once the valid bit confirms a page is present, the CPU must still answer: Is this access allowed? Can user-mode code read this kernel page? Can a process write to read-only code? Can the stack be executed as code?
The protection bits in each Page Table Entry form an access control matrix that the hardware checks on every single memory access—billions of times per second. These bits are the front line of security, enforcing process isolation, preventing code injection, and protecting the kernel from user-mode attacks.
Understanding protection bits is essential for systems programmers, security researchers, and anyone who wants to comprehend how modern operating systems maintain integrity despite running untrusted code.
By the end of this page, you will understand each protection bit in detail, how they combine to form access policies, architectural variations across platforms, and the security vulnerabilities that arise from protection misconfigurations.
Memory protection implements an access control matrix in hardware. For each memory page, we define what operations are permitted by whom. The three primary dimensions are:
Access Types:
Privilege Levels:
The page table entry contains bits that encode this matrix. The MMU checks these bits against the current CPU mode and access type, faulting if the access is denied.
| U/S | R/W | NX | User Can | Kernel Can | Common Use |
|---|---|---|---|---|---|
| 0 | 0 | 1 | Nothing | Read only | Kernel read-only data |
| 0 | 1 | 1 | Nothing | Read+Write | Kernel data, stack |
| 0 | 0 | 0 | Nothing | Read+Execute | Kernel code |
| 0 | 1 | 0 | Nothing | R+W+X | Rare (security risk) |
| 1 | 0 | 1 | Read only | Read only | User rodata, const |
| 1 | 1 | 1 | Read+Write | Read+Write | User heap, stack |
| 1 | 0 | 0 | Read+Execute | Read+Execute | User code (.text) |
| 1 | 1 | 0 | R+W+X | R+W+X | JIT code (dangerous) |
Important Nuances:
Kernel can access user pages: On most architectures, kernel mode can access any user-accessible page (though SMAP changes this on x86)
Read implies execute (historically): Before NX/XD bits, read permission implied execute permission—data could be executed
Hierarchy matters: In multi-level page tables, permission bits at each level are combined (typically AND logic)
TLB caches permissions: Protection violations can be detected from TLB entries without walking page tables
The R/W/U/S/NX bits provide a 5-bit matrix, but real systems need more. x86 added protection keys (4 bits selecting from 16 permission sets), ARM has domain bits, and various architectures support memory tagging. The basic model keeps expanding to meet modern security needs.
The Read/Write (R/W) bit controls whether a page can be modified. Despite its name, it doesn't control read access—a page with R/W=0 can still be read, just not written.
Semantics on x86:
Use Cases for Read-Only Pages:
1234567891011121314151617181920212223242526272829303132333435
/* Example: Kernel read-only data protection (Linux) */ /* Mark kernel rodata section as read-only */void mark_rodata_ro(void) { unsigned long start = (unsigned long)__start_rodata; unsigned long end = (unsigned long)__end_rodata; /* Change permissions on all pages in this range */ set_memory_ro(start, (end - start) >> PAGE_SHIFT); printk(KERN_INFO "Kernel rodata now read-only\n");} /* Attempt to write to rodata - immediate protection fault! */void __init test_rodata_protection(void) { const char *test = "This is in rodata"; char *ptr = (char *)test; /* Cast away const - but HW will catch us */ /* This will trigger protection fault: */ /* *ptr = 'X'; */ /* * CPU Exception: #PF (page fault) * Error code: 0x3 (protection violation, write, kernel) * Handler: do_page_fault() -> SIGBUS or oops */} /* Copy-on-write using R/W bit */pte_t make_pte_readonly(pte_t pte) { return pte_wrprotect(pte); /* Clear R/W bit */} pte_t make_pte_writable(pte_t pte) { return pte_mkwrite(pte); /* Set R/W bit */}Write Protection Fault Handling:
When a write occurs to a read-only page, the CPU generates a protection fault (different from a page-not-present fault). The fault handler examines:
The ability to distinguish 'not present' from 'present but protected' is crucial for implementing COW efficiently.
Linux marks kernel code and rodata as read-only once boot completes (mark_rodata_ro). This prevents many kernel exploits from modifying critical kernel data. The CR0.WP flag must be set for this to apply to kernel mode—some old exploits disabled WP to write anywhere.
The User/Supervisor (U/S) bit controls privilege-level access—whether user-mode code can access a page at all.
Semantics on x86:
This is the primary defense between user space and kernel space. Every kernel page has U/S=0, making it inaccessible to user code even if the user knows the virtual address.
User vs Supervisor Page Access: Virtual Address Space:┌─────────────────────────────┐ 0xFFFFFFFFFFFFFFFF│ ││ Kernel Space │ U/S = 0 (Supervisor only)│ (Kernel code, data, │ │ per-process kernel │ User access → #PF (protection fault)│ stack) │ Kernel access → OK│ │├─────────────────────────────┤ 0xFFFF800000000000 (typical split)│ ││ User Space │ U/S = 1 (User accessible)│ (Code, heap, stack, ││ shared libraries, │ User access → OK (subject to R/W, NX)│ mmap regions) │ Kernel access → OK (subject to SMAP)│ │└─────────────────────────────┘ 0x0000000000000000 Note: Kernel is mapped in every process's address space at the same virtual addresses, but U/S=0 prevents user access.Why Kernel is Mapped in User Page Tables:
SMAP and SMEP: Kernel Self-Restriction:
Modern CPUs provide additional protection against kernel accessing user pages:
SMEP (Supervisor Mode Execution Prevention): Prevents kernel from executing user-mode code. Blocks attacks that redirect kernel execution to user-controlled code.
SMAP (Supervisor Mode Access Prevention): Prevents kernel from reading/writing user-mode pages. Kernel must use special instructions (STAC/CLAC) to temporarily enable user access. Blocks attacks that trick kernel into dereferencing user-controlled pointers.
| U/S | Mode | SMAP/SMEP | Access Result |
|---|---|---|---|
| 0 | User | N/A | Protection Fault |
| 0 | Kernel | N/A | Allowed |
| 1 | User | N/A | Allowed (per R/W, NX) |
| 1 | Kernel Execute | SMEP enabled | Protection Fault |
| 1 | Kernel Read/Write | SMAP enabled, AC=0 | Protection Fault |
| 1 | Kernel Read/Write | SMAP enabled, AC=1 | Allowed |
| 1 | Kernel | SMAP/SMEP disabled | Allowed |
SMEP and SMAP are defense-in-depth measures. Even if an attacker can control kernel execution through a bug, they can't simply jump to user-space shellcode (SMEP) or trick the kernel into reading malicious user-space data structures (SMAP). These reduce the exploitability of kernel vulnerabilities.
The No-Execute (NX) bit (Intel calls it XD for Execute Disable) is perhaps the most important security addition to page table entries. It allows marking pages as non-executable, preventing code injection attacks.
Historical Context:
Before NX (pre-2004 on x86):
With NX:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354
/* The classic buffer overflow attack - blocked by NX */ // Vulnerable function:void vulnerable(char *input) { char buffer[64]; strcpy(buffer, input); // Buffer overflow!} /* * Attack WITHOUT NX protection: * * 1. Attacker sends: [shellcode][padding][return addr] * 2. strcpy overflows buffer, overwrites return address * 3. Return address now points to buffer (on stack) * 4. Function returns, CPU starts executing shellcode * 5. Attacker wins! * * Memory layout: Stack grows ↓ * ┌────────────────────┐ * │ return address │ ← Overwritten to point to buffer * ├────────────────────┤ * │ saved frame ptr │ ← Overwritten with junk * ├────────────────────┤ * │ buffer[63] │ * │ ... │ ← Shellcode written here * │ buffer[0] │ * └────────────────────┘ */ /* * Attack WITH NX protection: * * 1. Same overflow, return address → buffer * 2. Function returns, CPU tries to fetch from stack * 3. PTE for stack page has NX=1 * 4. CPU raises #PF (protection fault) * 5. Process killed with SIGSEGV * 6. Attack BLOCKED! */ /* Proper memory layout with NX */void setup_process_memory(void) { // Code section: R-X (read, execute, no write) mprotect(code_start, code_len, PROT_READ | PROT_EXEC); // Data section: RW- (read, write, no execute) mprotect(data_start, data_len, PROT_READ | PROT_WRITE); // Stack: RW- (read, write, no execute) mprotect(stack_start, stack_len, PROT_READ | PROT_WRITE); // Heap: RW- (read, write, no execute) mprotect(heap_start, heap_len, PROT_READ | PROT_WRITE);}The W^X Principle:
W^X (Write XOR Execute) is a security principle: no memory region should be both writable and executable simultaneously.
This is enforced by combining R/W and NX bits:
Exceptions to W^X:
These must carefully manage permissions, switching between W and X as needed.
NX doesn't stop all attacks. Attackers developed ROP: instead of injecting code, they chain together existing code 'gadgets' (small instruction sequences ending in RET). Each gadget does a small operation; chained together, they achieve arbitrary computation. Defenses like ASLR and CFI help mitigate ROP.
In multi-level page tables, each level has its own protection bits. This creates an interesting question: what happens when upper levels have different permissions than lower levels?
x86-64 Behavior:
The effective permission is the most restrictive combination across all levels:
Mathematically: Effective = Level1 AND Level2 AND Level3 AND Level4
Multi-Level Permission Resolution (x86-64): PML4 Entry (Level 4) PDPT Entry (Level 3) PD Entry (Level 2) PT Entry (Level 1) Effective──────────────────────────────────────────────────────────────────────────────────────────────────────U/S=1, R/W=1, NX=0 U/S=1, R/W=1, NX=0 U/S=1, R/W=1, NX=0 U/S=1, R/W=1, NX=0 User, R/W, XU/S=1, R/W=1, NX=0 U/S=1, R/W=1, NX=0 U/S=1, R/W=1, NX=0 U/S=1, R/W=0, NX=1 User, R, NXU/S=1, R/W=1, NX=0 U/S=0, R/W=1, NX=0 U/S=1, R/W=1, NX=0 U/S=1, R/W=1, NX=0 Kernel*, R/W, XU/S=0, R/W=1, NX=0 U/S=1, R/W=1, NX=0 U/S=1, R/W=1, NX=0 U/S=1, R/W=1, NX=0 Kernel**, R/W, X * Even though PT entry says U/S=1, Level 3 restriction wins** Even though all lower entries say U/S=1, PML4 restriction wins Common Pattern:- Map entire kernel range via single PML4 entry with U/S=0- Individual kernel pages don't need U/S=0 in their PTEs- Map entire user range via PML4 entries with U/S=1- Individual user pages set U/S=1 in PTEs (this is consistent)Practical Implications:
Bulk Permission Setting: To make a large region kernel-only, set U/S=0 in the upper-level entry. Individual pages don't need separate protection—they inherit the restriction.
Avoid Mixed Mappings: A single 2MB huge page (or 1GB gigantic page) shares one set of permission bits. All 4KB sub-regions must have the same permission. This limits flexibility with huge pages.
Sharing Upper-Level Tables: If two processes share a PDPT (for shared library mappings), they must have the same permissions for that region. Protection can differ only at levels where page tables diverge.
TLB Caches Effective Permissions: The TLB stores the final computed permissions. Hardware checks are fast—no need to re-walk levels on each access.
ARM allows finer control: the execute-never (XN) bit has separate UXN (user) and PXN (privileged) variants. This allows pages to be executable for kernel but not user, or vice versa—something x86 can't express directly. This is useful for page-table pages and kernel trampoline code.
Intel Memory Protection Keys (MPK/PKU) extend the protection model with 4 additional bits in each PTE, allowing up to 16 different protection domains within a single process. This enables fine-grained access control without changing page tables.
How Protection Keys Work:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
/* Intel Memory Protection Keys Example */ #include <sys/mman.h>#include <sys/syscall.h> /* Allocate a protection key */int pkey_alloc(unsigned int flags, unsigned int access_rights) { return syscall(SYS_pkey_alloc, flags, access_rights);} /* Associate memory with a protection key */int pkey_mprotect(void *addr, size_t len, int prot, int pkey) { return syscall(SYS_pkey_mprotect, addr, len, prot, pkey);} /* Read/write the PKRU register (user-mode!) */static inline unsigned int rdpkru(void) { unsigned int eax, edx; __asm__ volatile(".byte 0x0f, 0x01, 0xee" : "=a"(eax), "=d"(edx) : "c"(0)); return eax;} static inline void wrpkru(unsigned int pkru) { __asm__ volatile(".byte 0x0f, 0x01, 0xef" :: "a"(pkru), "c"(0), "d"(0));} /* Disable access to pages with key 'pkey' */void disable_pkey_access(int pkey) { unsigned int pkru = rdpkru(); pkru |= (1 << (pkey * 2)); /* Set access-disable bit */ pkru |= (1 << (pkey * 2 + 1)); /* Set write-disable bit */ wrpkru(pkru);} /* Re-enable access to pages with key 'pkey' */void enable_pkey_access(int pkey) { unsigned int pkru = rdpkru(); pkru &= ~(3 << (pkey * 2)); /* Clear both bits */ wrpkru(pkru);} /* Example: Protecting sensitive data */int main() { /* Allocate memory and a protection key */ void *secret = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); int pkey = pkey_alloc(0, 0); /* Associate memory with the key */ pkey_mprotect(secret, 4096, PROT_READ | PROT_WRITE, pkey); /* Store secret data */ strcpy(secret, "Super secret password"); /* Disable access - no system call needed! */ disable_pkey_access(pkey); /* Any access to 'secret' now faults immediately */ /* printf("%s\n", secret); // Would cause SIGSEGV! */ /* Re-enable when needed */ enable_pkey_access(pkey); printf("%s\n", secret); /* Now works */ return 0;}Use Cases for Protection Keys:
Performance Advantage:
Changing PKRU is ~20-30 cycles. Changing page table entries requires TLB flushes, typically thousands of cycles. For frequently-switched protection (like enabling/disabling JIT write access), this is a huge win.
Protection keys only restrict user-mode access. Kernel can still access any page. Also, WRPKRU is a user-mode instruction—malicious code in the same process can modify PKRU. PKU is for fault isolation and defense-in-depth, not for protecting against in-process attackers who achieve code execution.
When access violates protection bits, the CPU generates a protection fault (formally, still #PF on x86, but with different error code). Understanding fault handling is crucial for both OS implementation and security.
x86 Page Fault Error Code:
The error code pushed on stack contains:
| Bit | Name | Meaning when set |
|---|---|---|
| 0 | P | Fault was on present page (protection, not absent) |
| 1 | W/R | Fault was a write (vs read) |
| 2 | U/S | Fault in user mode (vs kernel) |
| 3 | RSVD | Reserved bit violation |
| 4 | I/D | Fault was instruction fetch (vs data) |
| 5 | PK | Protection key violation |
| 6 | SS | Shadow stack violation (CET) |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
/* Simplified protection fault handling (Linux-like) */ void do_page_fault(struct pt_regs *regs, unsigned long error_code, unsigned long fault_address) { struct vm_area_struct *vma; int fault_flags = 0; /* Was it a protection fault (P=1) or not-present (P=0)? */ if (error_code & X86_PF_PROT) { /* Protection violation on a present page */ if (error_code & X86_PF_WRITE) { /* Write to read-only page */ vma = find_vma(current->mm, fault_address); if (!vma) goto bad_area; /* Check if VMA allows write */ if (!(vma->vm_flags & VM_WRITE)) goto bad_area; /* VMA is read-only, SIGSEGV */ /* VMA allows write - might be COW */ fault_flags |= FAULT_FLAG_WRITE; return handle_mm_fault(vma, fault_address, fault_flags); /* handle_mm_fault will do COW if needed */ } if (error_code & X86_PF_INSTR) { /* Attempt to execute NX page */ goto bad_area; /* Always SIGSEGV - can't fix this */ } if (error_code & X86_PF_USER) { /* User access to supervisor page */ goto bad_area; /* Always SIGSEGV */ } if (error_code & X86_PF_PK) { /* Protection key violation */ goto bad_area; /* Or send SIGSEGV with SEGV_PKUERR */ } } else { /* Not present - normal demand paging fault */ return handle_mm_fault(vma, fault_address, fault_flags); } bad_area: if (error_code & X86_PF_USER) { /* User-mode fault - send signal */ struct siginfo info = { .si_signo = SIGSEGV, .si_code = (error_code & X86_PF_PK) ? SEGV_PKUERR : SEGV_ACCERR, .si_addr = (void *)fault_address, }; force_sig_info(SIGSEGV, &info, current); } else { /* Kernel-mode fault - oops! */ kernel_oops("BUG: kernel protection fault", regs); }}Key Distinctions:
Present vs Protection Fault:
Legitimate vs Illegitimate Protection Faults:
User vs Kernel:
When debugging a SIGSEGV, check the fault address and error code. 'dmesg' on Linux shows fault details. si_code in the signal tells you: SEGV_MAPERR (no mapping), SEGV_ACCERR (permission denied), SEGV_PKUERR (protection key). These distinguish 'bad pointer' from 'valid pointer, wrong permissions'.
Protection bits are fundamental to system security, but they're not infallible. Modern attacks and mitigations reveal both the power and limitations of page-level protection.
Attack Vectors:
Modern Mitigations:
ASLR (Address Space Layout Randomization): Randomize where code/data is loaded. Even if attacker knows address, it's unpredictable at runtime.
KPTI (Kernel Page Table Isolation): Use separate page tables for user/kernel. Kernel pages aren't even in user page tables, not just marked U/S=0.
KASLR (Kernel ASLR): Randomize kernel location. Even with KPTI bypass, attackers don't know where kernel is.
CFI (Control-Flow Integrity): Validate indirect jumps go to expected targets. Mitigates ROP/JOP.
Shadow Stacks (CET): Separate stack for return addresses. Can't overwrite with buffer overflow.
MTE (Memory Tagging): ARM feature that tags pointers and memory. Mismatch causes fault. Catches use-after-free, overflow.
No single protection mechanism is sufficient. Modern secure systems layer multiple defenses: NX prevents code injection, ASLR prevents gadget finding, CFI prevents control-flow hijacking, and KPTI prevents kernel information leaks. Each adds cost; the security/performance tradeoff is continuously evolving.
Protection bits form the hardware-enforced access control layer for virtual memory. Let's consolidate the key insights:
What's Next:
We've now covered PTE structure, the valid bit, and protection bits. The final piece is where page tables themselves are stored—the page table location. This includes how the OS finds the page table, how tables are stored in memory, and the kernel/user space split.
You now understand how protection bits enforce memory access control—the R/W, U/S, and NX bits that protect processes from each other and the kernel from user code. This knowledge is essential for system security, exploit development understanding, and OS implementation.