Page Tables - Learning Module

Loading content...

0/227

Protection Bits

The Gatekeepers of Memory Access

Once the valid bit confirms a page is present, the CPU must still answer: Is this access allowed? Can user-mode code read this kernel page? Can a process write to read-only code? Can the stack be executed as code?

The protection bits in each Page Table Entry form an access control matrix that the hardware checks on every single memory access—billions of times per second. These bits are the front line of security, enforcing process isolation, preventing code injection, and protecting the kernel from user-mode attacks.

Understanding protection bits is essential for systems programmers, security researchers, and anyone who wants to comprehend how modern operating systems maintain integrity despite running untrusted code.

What You Will Learn

By the end of this page, you will understand each protection bit in detail, how they combine to form access policies, architectural variations across platforms, and the security vulnerabilities that arise from protection misconfigurations.

The Access Control Matrix

Memory protection implements an access control matrix in hardware. For each memory page, we define what operations are permitted by whom. The three primary dimensions are:

Access Types:

Read (R): Load data from this page
Write (W): Store data to this page
Execute (X): Fetch instructions from this page

Privilege Levels:

User mode (Ring 3): Normal application code
Supervisor/Kernel mode (Ring 0): Operating system code

The page table entry contains bits that encode this matrix. The MMU checks these bits against the current CPU mode and access type, faulting if the access is denied.

x86-64 Protection Bit Combinations
U/S	R/W	NX	User Can	Kernel Can	Common Use
0	0	1	Nothing	Read only	Kernel read-only data
0	1	1	Nothing	Read+Write	Kernel data, stack
0	0	0	Nothing	Read+Execute	Kernel code
0	1	0	Nothing	R+W+X	Rare (security risk)
1	0	1	Read only	Read only	User rodata, const
1	1	1	Read+Write	Read+Write	User heap, stack
1	0	0	Read+Execute	Read+Execute	User code (.text)
1	1	0	R+W+X	R+W+X	JIT code (dangerous)

Important Nuances:

Kernel can access user pages: On most architectures, kernel mode can access any user-accessible page (though SMAP changes this on x86)
Read implies execute (historically): Before NX/XD bits, read permission implied execute permission—data could be executed
Hierarchy matters: In multi-level page tables, permission bits at each level are combined (typically AND logic)
TLB caches permissions: Protection violations can be detected from TLB entries without walking page tables

Why Three Bits Aren't Enough

The R/W/U/S/NX bits provide a 5-bit matrix, but real systems need more. x86 added protection keys (4 bits selecting from 16 permission sets), ARM has domain bits, and various architectures support memory tagging. The basic model keeps expanding to meet modern security needs.

Read/Write Permission (R/W)

The Read/Write (R/W) bit controls whether a page can be modified. Despite its name, it doesn't control read access—a page with R/W=0 can still be read, just not written.

Semantics on x86:

R/W = 0: Page is read-only; writes cause protection fault
R/W = 1: Page is read-write; both reads and writes permitted

Use Cases for Read-Only Pages:

Read-Only Page Usage

•Code sections (.text): Executable code should never be modified at runtime (prevents code injection)
•Read-only data (.rodata): String literals, constants, jump tables that never change
•Shared library code: Same physical code shared across processes; writes would corrupt other processes
•Copy-on-write pages: Shared until first write; R/W=0 triggers fault that triggers copy
•Memory-mapped files (read-only): Prevent modifications to file contents
•Kernel code and data protection: Prevent kernel data corruption from kernel bugs

readonly_protection.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
/* Example: Kernel read-only data protection (Linux) */
 
/* Mark kernel rodata section as read-only */
void mark_rodata_ro(void) {
    unsigned long start = (unsigned long)__start_rodata;
    unsigned long end = (unsigned long)__end_rodata;
    
    /* Change permissions on all pages in this range */
    set_memory_ro(start, (end - start) >> PAGE_SHIFT);
    
    printk(KERN_INFO "Kernel rodata now read-only\n");
}
 
/* Attempt to write to rodata - immediate protection fault! */
void __init test_rodata_protection(void) {
    const char *test = "This is in rodata";
    char *ptr = (char *)test;  /* Cast away const - but HW will catch us */
    
    /* This will trigger protection fault: */
    /* *ptr = 'X';  */
    /* 
     * CPU Exception: #PF (page fault)
     * Error code: 0x3 (protection violation, write, kernel)
     * Handler: do_page_fault() -> SIGBUS or oops
     */
}
 
/* Copy-on-write using R/W bit */
pte_t make_pte_readonly(pte_t pte) {
    return pte_wrprotect(pte);  /* Clear R/W bit */
}
 
pte_t make_pte_writable(pte_t pte) {
    return pte_mkwrite(pte);    /* Set R/W bit */
}

Write Protection Fault Handling:

When a write occurs to a read-only page, the CPU generates a protection fault (different from a page-not-present fault). The fault handler examines:

Was this a user or kernel mode access?
Is the VMA marked writable? If not → SIGSEGV
Is this a copy-on-write page? If so → copy and make writable
Is this legitimate kernel write to RO area? → likely a bug → kernel oops

The ability to distinguish 'not present' from 'present but protected' is crucial for implementing COW efficiently.

Kernel Write Protection

Linux marks kernel code and rodata as read-only once boot completes (mark_rodata_ro). This prevents many kernel exploits from modifying critical kernel data. The CR0.WP flag must be set for this to apply to kernel mode—some old exploits disabled WP to write anywhere.

User/Supervisor (U/S) Bit

The User/Supervisor (U/S) bit controls privilege-level access—whether user-mode code can access a page at all.

Semantics on x86:

U/S = 0: Supervisor (kernel) only; user-mode access causes protection fault
U/S = 1: User accessible; both user and kernel mode can access

This is the primary defense between user space and kernel space. Every kernel page has U/S=0, making it inaccessible to user code even if the user knows the virtual address.

user_supervisor_example.txt
User vs Supervisor Page Access:
 
Virtual Address Space:
┌─────────────────────────────┐ 0xFFFFFFFFFFFFFFFF
│                             │
│     Kernel Space            │  U/S = 0 (Supervisor only)
│     (Kernel code, data,     │  
│      per-process kernel     │  User access → #PF (protection fault)
│      stack)                 │  Kernel access → OK
│                             │
├─────────────────────────────┤ 0xFFFF800000000000 (typical split)
│                             │
│     User Space              │  U/S = 1 (User accessible)
│     (Code, heap, stack,     │
│      shared libraries,      │  User access → OK (subject to R/W, NX)
│      mmap regions)          │  Kernel access → OK (subject to SMAP)
│                             │
└─────────────────────────────┘ 0x0000000000000000
 
Note: Kernel is mapped in every process's address space at the 
same virtual addresses, but U/S=0 prevents user access.

Why Kernel is Mapped in User Page Tables:

Fast system calls: No page table switch needed when entering kernel
Easy parameter passing: Kernel can directly read user buffers
Unified addressing: Kernel always sees same virtual addresses

SMAP and SMEP: Kernel Self-Restriction:

Modern CPUs provide additional protection against kernel accessing user pages:

SMEP (Supervisor Mode Execution Prevention): Prevents kernel from executing user-mode code. Blocks attacks that redirect kernel execution to user-controlled code.
SMAP (Supervisor Mode Access Prevention): Prevents kernel from reading/writing user-mode pages. Kernel must use special instructions (STAC/CLAC) to temporarily enable user access. Blocks attacks that trick kernel into dereferencing user-controlled pointers.

Complete User/Supervisor Protection Matrix
U/S	Mode	SMAP/SMEP	Access Result
0	User	N/A	Protection Fault
0	Kernel	N/A	Allowed
1	User	N/A	Allowed (per R/W, NX)
1	Kernel Execute	SMEP enabled	Protection Fault
1	Kernel Read/Write	SMAP enabled, AC=0	Protection Fault
1	Kernel Read/Write	SMAP enabled, AC=1	Allowed
1	Kernel	SMAP/SMEP disabled	Allowed

Defense in Depth

SMEP and SMAP are defense-in-depth measures. Even if an attacker can control kernel execution through a bug, they can't simply jump to user-space shellcode (SMEP) or trick the kernel into reading malicious user-space data structures (SMAP). These reduce the exploitability of kernel vulnerabilities.

No-Execute (NX/XD) Bit

The No-Execute (NX) bit (Intel calls it XD for Execute Disable) is perhaps the most important security addition to page table entries. It allows marking pages as non-executable, preventing code injection attacks.

Historical Context:

Before NX (pre-2004 on x86):

All readable pages were implicitly executable
Stack, heap, data segments could all execute code
Buffer overflow attacks were trivial: inject code, jump to it

With NX:

Stack and heap pages can be marked NX (data-only)
Code pages remain executable but can be read-only
Injection into data areas doesn't allow execution

nx_bit_protection.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
/* The classic buffer overflow attack - blocked by NX */
 
// Vulnerable function:
void vulnerable(char *input) {
    char buffer[64];
    strcpy(buffer, input);  // Buffer overflow!
}
 
/* 
 * Attack WITHOUT NX protection:
 * 
 * 1. Attacker sends: [shellcode][padding][return addr]
 * 2. strcpy overflows buffer, overwrites return address
 * 3. Return address now points to buffer (on stack)
 * 4. Function returns, CPU starts executing shellcode
 * 5. Attacker wins!
 *
 * Memory layout:         Stack grows ↓
 * ┌────────────────────┐
 * │ return address     │ ← Overwritten to point to buffer
 * ├────────────────────┤
 * │ saved frame ptr    │ ← Overwritten with junk
 * ├────────────────────┤
 * │ buffer[63]         │
 * │ ...                │ ← Shellcode written here
 * │ buffer[0]          │
 * └────────────────────┘
 */
 
/*
 * Attack WITH NX protection:
 *
 * 1. Same overflow, return address → buffer
 * 2. Function returns, CPU tries to fetch from stack
 * 3. PTE for stack page has NX=1
 * 4. CPU raises #PF (protection fault)
 * 5. Process killed with SIGSEGV
 * 6. Attack BLOCKED!
 */
 
/* Proper memory layout with NX */
void setup_process_memory(void) {
    // Code section: R-X (read, execute, no write)
    mprotect(code_start, code_len, PROT_READ | PROT_EXEC);
    
    // Data section: RW- (read, write, no execute)
    mprotect(data_start, data_len, PROT_READ | PROT_WRITE);
    
    // Stack: RW- (read, write, no execute)
    mprotect(stack_start, stack_len, PROT_READ | PROT_WRITE);
    
    // Heap: RW- (read, write, no execute)
    mprotect(heap_start, heap_len, PROT_READ | PROT_WRITE);
}

The W^X Principle:

W^X (Write XOR Execute) is a security principle: no memory region should be both writable and executable simultaneously.

If you can write it, you can't execute it (can't inject code)
If you can execute it, you can't write it (can't modify code)

This is enforced by combining R/W and NX bits:

Data pages: R/W=1, NX=1 (writable, not executable)
Code pages: R/W=0, NX=0 (executable, not writable)

Exceptions to W^X:

JIT Compilers: JavaScript engines, .NET, Java HotSpot need to generate code at runtime
Trampolines: Some FFI/callback mechanisms use executable stacks
Legacy software: Old code may rely on executable stack

These must carefully manage permissions, switching between W and X as needed.

Return-Oriented Programming (ROP)

NX doesn't stop all attacks. Attackers developed ROP: instead of injecting code, they chain together existing code 'gadgets' (small instruction sequences ending in RET). Each gadget does a small operation; chained together, they achieve arbitrary computation. Defenses like ASLR and CFI help mitigate ROP.

Protection in Multi-Level Tables

In multi-level page tables, each level has its own protection bits. This creates an interesting question: what happens when upper levels have different permissions than lower levels?

x86-64 Behavior:

The effective permission is the most restrictive combination across all levels:

If any level has U/S=0, the page is supervisor-only
If any level has R/W=0, the page is read-only (for that level's scope)
If any level has NX=1, the page is non-executable

Mathematically: Effective = Level1 AND Level2 AND Level3 AND Level4

multilevel_permissions.txt
Multi-Level Permission Resolution (x86-64):
 
PML4 Entry (Level 4)    PDPT Entry (Level 3)    PD Entry (Level 2)    PT Entry (Level 1)    Effective
──────────────────────────────────────────────────────────────────────────────────────────────────────
U/S=1, R/W=1, NX=0      U/S=1, R/W=1, NX=0      U/S=1, R/W=1, NX=0    U/S=1, R/W=1, NX=0    User, R/W, X
U/S=1, R/W=1, NX=0      U/S=1, R/W=1, NX=0      U/S=1, R/W=1, NX=0    U/S=1, R/W=0, NX=1    User, R, NX
U/S=1, R/W=1, NX=0      U/S=0, R/W=1, NX=0      U/S=1, R/W=1, NX=0    U/S=1, R/W=1, NX=0    Kernel*, R/W, X
U/S=0, R/W=1, NX=0      U/S=1, R/W=1, NX=0      U/S=1, R/W=1, NX=0    U/S=1, R/W=1, NX=0    Kernel**, R/W, X
 
* Even though PT entry says U/S=1, Level 3 restriction wins
** Even though all lower entries say U/S=1, PML4 restriction wins
 
Common Pattern:
- Map entire kernel range via single PML4 entry with U/S=0
- Individual kernel pages don't need U/S=0 in their PTEs
- Map entire user range via PML4 entries with U/S=1
- Individual user pages set U/S=1 in PTEs (this is consistent)

Practical Implications:

Bulk Permission Setting: To make a large region kernel-only, set U/S=0 in the upper-level entry. Individual pages don't need separate protection—they inherit the restriction.

Avoid Mixed Mappings: A single 2MB huge page (or 1GB gigantic page) shares one set of permission bits. All 4KB sub-regions must have the same permission. This limits flexibility with huge pages.

Sharing Upper-Level Tables: If two processes share a PDPT (for shared library mappings), they must have the same permissions for that region. Protection can differ only at levels where page tables diverge.

TLB Caches Effective Permissions: The TLB stores the final computed permissions. Hardware checks are fast—no need to re-walk levels on each access.

ARM: More Granular Control

ARM allows finer control: the execute-never (XN) bit has separate UXN (user) and PXN (privileged) variants. This allows pages to be executable for kernel but not user, or vice versa—something x86 can't express directly. This is useful for page-table pages and kernel trampoline code.

Protection Keys (PKU)

Intel Memory Protection Keys (MPK/PKU) extend the protection model with 4 additional bits in each PTE, allowing up to 16 different protection domains within a single process. This enables fine-grained access control without changing page tables.

How Protection Keys Work:

Each PTE contains a 4-bit protection key (bits 59-62 in x86-64)
A special register (PKRU) holds 32 bits: 2 bits per key (access disable + write disable)
On each access, hardware checks PKRU[key] to allow/deny
PKRU can be changed with user-mode instructions (WRPKRU) without system calls!

protection_keys_example.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
/* Intel Memory Protection Keys Example */
 
#include <sys/mman.h>
#include <sys/syscall.h>
 
/* Allocate a protection key */
int pkey_alloc(unsigned int flags, unsigned int access_rights) {
    return syscall(SYS_pkey_alloc, flags, access_rights);
}
 
/* Associate memory with a protection key */
int pkey_mprotect(void *addr, size_t len, int prot, int pkey) {
    return syscall(SYS_pkey_mprotect, addr, len, prot, pkey);
}
 
/* Read/write the PKRU register (user-mode!) */
static inline unsigned int rdpkru(void) {
    unsigned int eax, edx;
    __asm__ volatile(".byte 0x0f, 0x01, 0xee" : "=a"(eax), "=d"(edx) : "c"(0));
    return eax;
}
 
static inline void wrpkru(unsigned int pkru) {
    __asm__ volatile(".byte 0x0f, 0x01, 0xef" :: "a"(pkru), "c"(0), "d"(0));
}
 
/* Disable access to pages with key 'pkey' */
void disable_pkey_access(int pkey) {
    unsigned int pkru = rdpkru();
    pkru |= (1 << (pkey * 2));       /* Set access-disable bit */
    pkru |= (1 << (pkey * 2 + 1));   /* Set write-disable bit */
    wrpkru(pkru);
}
 
/* Re-enable access to pages with key 'pkey' */
void enable_pkey_access(int pkey) {
    unsigned int pkru = rdpkru();
    pkru &= ~(3 << (pkey * 2));  /* Clear both bits */
    wrpkru(pkru);
}
 
/* Example: Protecting sensitive data */
int main() {
    /* Allocate memory and a protection key */
    void *secret = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    int pkey = pkey_alloc(0, 0);
    
    /* Associate memory with the key */
    pkey_mprotect(secret, 4096, PROT_READ | PROT_WRITE, pkey);
    
    /* Store secret data */
    strcpy(secret, "Super secret password");
    
    /* Disable access - no system call needed! */
    disable_pkey_access(pkey);
    
    /* Any access to 'secret' now faults immediately */
    /* printf("%s\n", secret);  // Would cause SIGSEGV! */
    
    /* Re-enable when needed */
    enable_pkey_access(pkey);
    printf("%s\n", secret);  /* Now works */
    
    return 0;
}

Use Cases for Protection Keys:

Intra-process isolation: Isolate parts of a process (e.g., cryptographic keys from main code)
JIT security: Disable write to JIT code except during compilation
Safe speculation: Disable access to secrets during speculative execution
Library isolation: Prevent libraries from accessing each other's data
Light-weight sandboxing: Faster than separate address spaces

Performance Advantage:

Changing PKRU is ~20-30 cycles. Changing page table entries requires TLB flushes, typically thousands of cycles. For frequently-switched protection (like enabling/disabling JIT write access), this is a huge win.

PKU Security Limitations

Protection keys only restrict user-mode access. Kernel can still access any page. Also, WRPKRU is a user-mode instruction—malicious code in the same process can modify PKRU. PKU is for fault isolation and defense-in-depth, not for protecting against in-process attackers who achieve code execution.

Protection Faults

When access violates protection bits, the CPU generates a protection fault (formally, still #PF on x86, but with different error code). Understanding fault handling is crucial for both OS implementation and security.

x86 Page Fault Error Code:

The error code pushed on stack contains:

Bit	Name	Meaning when set
0	P	Fault was on present page (protection, not absent)
1	W/R	Fault was a write (vs read)
2	U/S	Fault in user mode (vs kernel)
3	RSVD	Reserved bit violation
4	I/D	Fault was instruction fetch (vs data)
5	PK	Protection key violation
6	SS	Shadow stack violation (CET)

protection_fault_handler.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
/* Simplified protection fault handling (Linux-like) */
 
void do_page_fault(struct pt_regs *regs, unsigned long error_code,
                   unsigned long fault_address) {
    struct vm_area_struct *vma;
    int fault_flags = 0;
    
    /* Was it a protection fault (P=1) or not-present (P=0)? */
    if (error_code & X86_PF_PROT) {
        /* Protection violation on a present page */
        
        if (error_code & X86_PF_WRITE) {
            /* Write to read-only page */
            
            vma = find_vma(current->mm, fault_address);
            if (!vma) goto bad_area;
            
            /* Check if VMA allows write */
            if (!(vma->vm_flags & VM_WRITE))
                goto bad_area;  /* VMA is read-only, SIGSEGV */
            
            /* VMA allows write - might be COW */
            fault_flags |= FAULT_FLAG_WRITE;
            return handle_mm_fault(vma, fault_address, fault_flags);
            /* handle_mm_fault will do COW if needed */
        }
        
        if (error_code & X86_PF_INSTR) {
            /* Attempt to execute NX page */
            goto bad_area;  /* Always SIGSEGV - can't fix this */
        }
        
        if (error_code & X86_PF_USER) {
            /* User access to supervisor page */
            goto bad_area;  /* Always SIGSEGV */
        }
        
        if (error_code & X86_PF_PK) {
            /* Protection key violation */
            goto bad_area;  /* Or send SIGSEGV with SEGV_PKUERR */
        }
    } else {
        /* Not present - normal demand paging fault */
        return handle_mm_fault(vma, fault_address, fault_flags);
    }
    
bad_area:
    if (error_code & X86_PF_USER) {
        /* User-mode fault - send signal */
        struct siginfo info = {
            .si_signo = SIGSEGV,
            .si_code = (error_code & X86_PF_PK) ? SEGV_PKUERR : SEGV_ACCERR,
            .si_addr = (void *)fault_address,
        };
        force_sig_info(SIGSEGV, &info, current);
    } else {
        /* Kernel-mode fault - oops! */
        kernel_oops("BUG: kernel protection fault", regs);
    }
}

Key Distinctions:

Present vs Protection Fault:

P=0: Page not present (demand paging, swap-in needed)
P=1: Page present but access denied (true protection violation)

Legitimate vs Illegitimate Protection Faults:

COW: Write to shared read-only → create private copy → continue
True violation: Access denied by policy → SIGSEGV

User vs Kernel:

User fault + bad access → SIGSEGV to process
Kernel fault + bad access → kernel bug (oops/panic)

Debugging Protection Faults

When debugging a SIGSEGV, check the fault address and error code. 'dmesg' on Linux shows fault details. si_code in the signal tells you: SEGV_MAPERR (no mapping), SEGV_ACCERR (permission denied), SEGV_PKUERR (protection key). These distinguish 'bad pointer' from 'valid pointer, wrong permissions'.

Security Considerations

Protection bits are fundamental to system security, but they're not infallible. Modern attacks and mitigations reveal both the power and limitations of page-level protection.

Attack Vectors:

Protection Bypass Techniques

•ROP/JOP: Chain existing code gadgets instead of injecting new code. NX doesn't help because no new code is executed.
•Meltdown/Spectre: Speculative execution reads kernel memory before protection fault is handled. Page tables can leak through side channels.
•Rowhammer: Flip bits in page tables using DRAM vulnerabilities. Can change P, R/W, U/S bits to grant access.
•Race conditions: Time-of-check vs time-of-use bugs in fault handlers can be exploited.
•JIT spraying: Fill JIT-compiled code regions with attacker-controlled values that happen to be useful gadgets.

Modern Mitigations:

ASLR (Address Space Layout Randomization): Randomize where code/data is loaded. Even if attacker knows address, it's unpredictable at runtime.
KPTI (Kernel Page Table Isolation): Use separate page tables for user/kernel. Kernel pages aren't even in user page tables, not just marked U/S=0.
KASLR (Kernel ASLR): Randomize kernel location. Even with KPTI bypass, attackers don't know where kernel is.
CFI (Control-Flow Integrity): Validate indirect jumps go to expected targets. Mitigates ROP/JOP.
Shadow Stacks (CET): Separate stack for return addresses. Can't overwrite with buffer overflow.
MTE (Memory Tagging): ARM feature that tags pointers and memory. Mismatch causes fault. Catches use-after-free, overflow.

Defense in Depth

No single protection mechanism is sufficient. Modern secure systems layer multiple defenses: NX prevents code injection, ASLR prevents gadget finding, CFI prevents control-flow hijacking, and KPTI prevents kernel information leaks. Each adds cost; the security/performance tradeoff is continuously evolving.

Summary: Protection Bits

Protection bits form the hardware-enforced access control layer for virtual memory. Let's consolidate the key insights:

Key Takeaways

•R/W controls write permission — Read-only pages enable COW, code protection, and shared data.
•U/S enforces privilege levels — Kernel pages inaccessible to user mode, regardless of virtual address.
•NX prevents code injection — Stack and heap can't be executed, blocking classic exploits.
•Multi-level permissions combine restrictively — Most restrictive permission across all levels wins.
•Protection keys add intra-process isolation — Fast permission changes without page table modification.
•Protection faults are handled by OS — COW becomes writable; true violations become SIGSEGV.
•Security requires multiple layers — Protection bits alone are insufficient against modern attacks.

What's Next:

We've now covered PTE structure, the valid bit, and protection bits. The final piece is where page tables themselves are stored—the page table location. This includes how the OS finds the page table, how tables are stored in memory, and the kernel/user space split.

Page Complete

You now understand how protection bits enforce memory access control—the R/W, U/S, and NX bits that protect processes from each other and the kernel from user code. This knowledge is essential for system security, exploit development understanding, and OS implementation.

Protection Bits

The Gatekeepers of Memory Access

What You Will Learn

The Access Control Matrix

Memory protection implements an access control matrix in hardware. For each memory page, we define what operations are permitted by whom. The three primary dimensions are:

Access Types:

Read (R): Load data from this page
Write (W): Store data to this page
Execute (X): Fetch instructions from this page

Privilege Levels:

User mode (Ring 3): Normal application code
Supervisor/Kernel mode (Ring 0): Operating system code

The page table entry contains bits that encode this matrix. The MMU checks these bits against the current CPU mode and access type, faulting if the access is denied.

x86-64 Protection Bit Combinations
U/S	R/W	NX	User Can	Kernel Can	Common Use
0	0	1	Nothing	Read only	Kernel read-only data
0	1	1	Nothing	Read+Write	Kernel data, stack
0	0	0	Nothing	Read+Execute	Kernel code
0	1	0	Nothing	R+W+X	Rare (security risk)
1	0	1	Read only	Read only	User rodata, const
1	1	1	Read+Write	Read+Write	User heap, stack
1	0	0	Read+Execute	Read+Execute	User code (.text)
1	1	0	R+W+X	R+W+X	JIT code (dangerous)

Important Nuances:

Kernel can access user pages: On most architectures, kernel mode can access any user-accessible page (though SMAP changes this on x86)
Read implies execute (historically): Before NX/XD bits, read permission implied execute permission—data could be executed
Hierarchy matters: In multi-level page tables, permission bits at each level are combined (typically AND logic)
TLB caches permissions: Protection violations can be detected from TLB entries without walking page tables

Why Three Bits Aren't Enough

Read/Write Permission (R/W)

The Read/Write (R/W) bit controls whether a page can be modified. Despite its name, it doesn't control read access—a page with R/W=0 can still be read, just not written.

Semantics on x86:

R/W = 0: Page is read-only; writes cause protection fault
R/W = 1: Page is read-write; both reads and writes permitted

Use Cases for Read-Only Pages:

Read-Only Page Usage

•Code sections (.text): Executable code should never be modified at runtime (prevents code injection)
•Read-only data (.rodata): String literals, constants, jump tables that never change
•Shared library code: Same physical code shared across processes; writes would corrupt other processes
•Copy-on-write pages: Shared until first write; R/W=0 triggers fault that triggers copy
•Memory-mapped files (read-only): Prevent modifications to file contents
•Kernel code and data protection: Prevent kernel data corruption from kernel bugs

readonly_protection.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
/* Example: Kernel read-only data protection (Linux) */
 
/* Mark kernel rodata section as read-only */
void mark_rodata_ro(void) {
    unsigned long start = (unsigned long)__start_rodata;
    unsigned long end = (unsigned long)__end_rodata;
    
    /* Change permissions on all pages in this range */
    set_memory_ro(start, (end - start) >> PAGE_SHIFT);
    
    printk(KERN_INFO "Kernel rodata now read-only\n");
}
 
/* Attempt to write to rodata - immediate protection fault! */
void __init test_rodata_protection(void) {
    const char *test = "This is in rodata";
    char *ptr = (char *)test;  /* Cast away const - but HW will catch us */
    
    /* This will trigger protection fault: */
    /* *ptr = 'X';  */
    /* 
     * CPU Exception: #PF (page fault)
     * Error code: 0x3 (protection violation, write, kernel)
     * Handler: do_page_fault() -> SIGBUS or oops
     */
}
 
/* Copy-on-write using R/W bit */
pte_t make_pte_readonly(pte_t pte) {
    return pte_wrprotect(pte);  /* Clear R/W bit */
}
 
pte_t make_pte_writable(pte_t pte) {
    return pte_mkwrite(pte);    /* Set R/W bit */
}

Write Protection Fault Handling:

When a write occurs to a read-only page, the CPU generates a protection fault (different from a page-not-present fault). The fault handler examines:

Was this a user or kernel mode access?
Is the VMA marked writable? If not → SIGSEGV
Is this a copy-on-write page? If so → copy and make writable
Is this legitimate kernel write to RO area? → likely a bug → kernel oops

The ability to distinguish 'not present' from 'present but protected' is crucial for implementing COW efficiently.

Kernel Write Protection

User/Supervisor (U/S) Bit

The User/Supervisor (U/S) bit controls privilege-level access—whether user-mode code can access a page at all.

Semantics on x86:

U/S = 0: Supervisor (kernel) only; user-mode access causes protection fault
U/S = 1: User accessible; both user and kernel mode can access

This is the primary defense between user space and kernel space. Every kernel page has U/S=0, making it inaccessible to user code even if the user knows the virtual address.

user_supervisor_example.txt
User vs Supervisor Page Access:
 
Virtual Address Space:
┌─────────────────────────────┐ 0xFFFFFFFFFFFFFFFF
│                             │
│     Kernel Space            │  U/S = 0 (Supervisor only)
│     (Kernel code, data,     │  
│      per-process kernel     │  User access → #PF (protection fault)
│      stack)                 │  Kernel access → OK
│                             │
├─────────────────────────────┤ 0xFFFF800000000000 (typical split)
│                             │
│     User Space              │  U/S = 1 (User accessible)
│     (Code, heap, stack,     │
│      shared libraries,      │  User access → OK (subject to R/W, NX)
│      mmap regions)          │  Kernel access → OK (subject to SMAP)
│                             │
└─────────────────────────────┘ 0x0000000000000000
 
Note: Kernel is mapped in every process's address space at the 
same virtual addresses, but U/S=0 prevents user access.

Why Kernel is Mapped in User Page Tables:

Fast system calls: No page table switch needed when entering kernel
Easy parameter passing: Kernel can directly read user buffers
Unified addressing: Kernel always sees same virtual addresses

SMAP and SMEP: Kernel Self-Restriction:

Modern CPUs provide additional protection against kernel accessing user pages:

SMEP (Supervisor Mode Execution Prevention): Prevents kernel from executing user-mode code. Blocks attacks that redirect kernel execution to user-controlled code.
SMAP (Supervisor Mode Access Prevention): Prevents kernel from reading/writing user-mode pages. Kernel must use special instructions (STAC/CLAC) to temporarily enable user access. Blocks attacks that trick kernel into dereferencing user-controlled pointers.

Complete User/Supervisor Protection Matrix
U/S	Mode	SMAP/SMEP	Access Result
0	User	N/A	Protection Fault
0	Kernel	N/A	Allowed
1	User	N/A	Allowed (per R/W, NX)
1	Kernel Execute	SMEP enabled	Protection Fault
1	Kernel Read/Write	SMAP enabled, AC=0	Protection Fault
1	Kernel Read/Write	SMAP enabled, AC=1	Allowed
1	Kernel	SMAP/SMEP disabled	Allowed

Defense in Depth

No-Execute (NX/XD) Bit

Historical Context:

Before NX (pre-2004 on x86):

All readable pages were implicitly executable
Stack, heap, data segments could all execute code
Buffer overflow attacks were trivial: inject code, jump to it

With NX:

Stack and heap pages can be marked NX (data-only)
Code pages remain executable but can be read-only
Injection into data areas doesn't allow execution

nx_bit_protection.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
/* The classic buffer overflow attack - blocked by NX */
 
// Vulnerable function:
void vulnerable(char *input) {
    char buffer[64];
    strcpy(buffer, input);  // Buffer overflow!
}
 
/* 
 * Attack WITHOUT NX protection:
 * 
 * 1. Attacker sends: [shellcode][padding][return addr]
 * 2. strcpy overflows buffer, overwrites return address
 * 3. Return address now points to buffer (on stack)
 * 4. Function returns, CPU starts executing shellcode
 * 5. Attacker wins!
 *
 * Memory layout:         Stack grows ↓
 * ┌────────────────────┐
 * │ return address     │ ← Overwritten to point to buffer
 * ├────────────────────┤
 * │ saved frame ptr    │ ← Overwritten with junk
 * ├────────────────────┤
 * │ buffer[63]         │
 * │ ...                │ ← Shellcode written here
 * │ buffer[0]          │
 * └────────────────────┘
 */
 
/*
 * Attack WITH NX protection:
 *
 * 1. Same overflow, return address → buffer
 * 2. Function returns, CPU tries to fetch from stack
 * 3. PTE for stack page has NX=1
 * 4. CPU raises #PF (protection fault)
 * 5. Process killed with SIGSEGV
 * 6. Attack BLOCKED!
 */
 
/* Proper memory layout with NX */
void setup_process_memory(void) {
    // Code section: R-X (read, execute, no write)
    mprotect(code_start, code_len, PROT_READ | PROT_EXEC);
    
    // Data section: RW- (read, write, no execute)
    mprotect(data_start, data_len, PROT_READ | PROT_WRITE);
    
    // Stack: RW- (read, write, no execute)
    mprotect(stack_start, stack_len, PROT_READ | PROT_WRITE);
    
    // Heap: RW- (read, write, no execute)
    mprotect(heap_start, heap_len, PROT_READ | PROT_WRITE);
}

The W^X Principle:

W^X (Write XOR Execute) is a security principle: no memory region should be both writable and executable simultaneously.

If you can write it, you can't execute it (can't inject code)
If you can execute it, you can't write it (can't modify code)

This is enforced by combining R/W and NX bits:

Data pages: R/W=1, NX=1 (writable, not executable)
Code pages: R/W=0, NX=0 (executable, not writable)

Exceptions to W^X:

JIT Compilers: JavaScript engines, .NET, Java HotSpot need to generate code at runtime
Trampolines: Some FFI/callback mechanisms use executable stacks
Legacy software: Old code may rely on executable stack

These must carefully manage permissions, switching between W and X as needed.

Return-Oriented Programming (ROP)

Protection in Multi-Level Tables

In multi-level page tables, each level has its own protection bits. This creates an interesting question: what happens when upper levels have different permissions than lower levels?

x86-64 Behavior:

The effective permission is the most restrictive combination across all levels:

If any level has U/S=0, the page is supervisor-only
If any level has R/W=0, the page is read-only (for that level's scope)
If any level has NX=1, the page is non-executable

Mathematically: Effective = Level1 AND Level2 AND Level3 AND Level4

multilevel_permissions.txt
Multi-Level Permission Resolution (x86-64):
 
PML4 Entry (Level 4)    PDPT Entry (Level 3)    PD Entry (Level 2)    PT Entry (Level 1)    Effective
──────────────────────────────────────────────────────────────────────────────────────────────────────
U/S=1, R/W=1, NX=0      U/S=1, R/W=1, NX=0      U/S=1, R/W=1, NX=0    U/S=1, R/W=1, NX=0    User, R/W, X
U/S=1, R/W=1, NX=0      U/S=1, R/W=1, NX=0      U/S=1, R/W=1, NX=0    U/S=1, R/W=0, NX=1    User, R, NX
U/S=1, R/W=1, NX=0      U/S=0, R/W=1, NX=0      U/S=1, R/W=1, NX=0    U/S=1, R/W=1, NX=0    Kernel*, R/W, X
U/S=0, R/W=1, NX=0      U/S=1, R/W=1, NX=0      U/S=1, R/W=1, NX=0    U/S=1, R/W=1, NX=0    Kernel**, R/W, X
 
* Even though PT entry says U/S=1, Level 3 restriction wins
** Even though all lower entries say U/S=1, PML4 restriction wins
 
Common Pattern:
- Map entire kernel range via single PML4 entry with U/S=0
- Individual kernel pages don't need U/S=0 in their PTEs
- Map entire user range via PML4 entries with U/S=1
- Individual user pages set U/S=1 in PTEs (this is consistent)

Practical Implications:

Bulk Permission Setting: To make a large region kernel-only, set U/S=0 in the upper-level entry. Individual pages don't need separate protection—they inherit the restriction.

Avoid Mixed Mappings: A single 2MB huge page (or 1GB gigantic page) shares one set of permission bits. All 4KB sub-regions must have the same permission. This limits flexibility with huge pages.

TLB Caches Effective Permissions: The TLB stores the final computed permissions. Hardware checks are fast—no need to re-walk levels on each access.

ARM: More Granular Control

Protection Keys (PKU)

How Protection Keys Work:

Each PTE contains a 4-bit protection key (bits 59-62 in x86-64)
A special register (PKRU) holds 32 bits: 2 bits per key (access disable + write disable)
On each access, hardware checks PKRU[key] to allow/deny
PKRU can be changed with user-mode instructions (WRPKRU) without system calls!

protection_keys_example.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
/* Intel Memory Protection Keys Example */
 
#include <sys/mman.h>
#include <sys/syscall.h>
 
/* Allocate a protection key */
int pkey_alloc(unsigned int flags, unsigned int access_rights) {
    return syscall(SYS_pkey_alloc, flags, access_rights);
}
 
/* Associate memory with a protection key */
int pkey_mprotect(void *addr, size_t len, int prot, int pkey) {
    return syscall(SYS_pkey_mprotect, addr, len, prot, pkey);
}
 
/* Read/write the PKRU register (user-mode!) */
static inline unsigned int rdpkru(void) {
    unsigned int eax, edx;
    __asm__ volatile(".byte 0x0f, 0x01, 0xee" : "=a"(eax), "=d"(edx) : "c"(0));
    return eax;
}
 
static inline void wrpkru(unsigned int pkru) {
    __asm__ volatile(".byte 0x0f, 0x01, 0xef" :: "a"(pkru), "c"(0), "d"(0));
}
 
/* Disable access to pages with key 'pkey' */
void disable_pkey_access(int pkey) {
    unsigned int pkru = rdpkru();
    pkru |= (1 << (pkey * 2));       /* Set access-disable bit */
    pkru |= (1 << (pkey * 2 + 1));   /* Set write-disable bit */
    wrpkru(pkru);
}
 
/* Re-enable access to pages with key 'pkey' */
void enable_pkey_access(int pkey) {
    unsigned int pkru = rdpkru();
    pkru &= ~(3 << (pkey * 2));  /* Clear both bits */
    wrpkru(pkru);
}
 
/* Example: Protecting sensitive data */
int main() {
    /* Allocate memory and a protection key */
    void *secret = mmap(NULL, 4096, PROT_READ | PROT_WRITE,
                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    int pkey = pkey_alloc(0, 0);
    
    /* Associate memory with the key */
    pkey_mprotect(secret, 4096, PROT_READ | PROT_WRITE, pkey);
    
    /* Store secret data */
    strcpy(secret, "Super secret password");
    
    /* Disable access - no system call needed! */
    disable_pkey_access(pkey);
    
    /* Any access to 'secret' now faults immediately */
    /* printf("%s\n", secret);  // Would cause SIGSEGV! */
    
    /* Re-enable when needed */
    enable_pkey_access(pkey);
    printf("%s\n", secret);  /* Now works */
    
    return 0;
}

Use Cases for Protection Keys:

Intra-process isolation: Isolate parts of a process (e.g., cryptographic keys from main code)
JIT security: Disable write to JIT code except during compilation
Safe speculation: Disable access to secrets during speculative execution
Library isolation: Prevent libraries from accessing each other's data
Light-weight sandboxing: Faster than separate address spaces

Performance Advantage:

PKU Security Limitations

Protection Faults

x86 Page Fault Error Code:

The error code pushed on stack contains:

Bit	Name	Meaning when set
0	P	Fault was on present page (protection, not absent)
1	W/R	Fault was a write (vs read)
2	U/S	Fault in user mode (vs kernel)
3	RSVD	Reserved bit violation
4	I/D	Fault was instruction fetch (vs data)
5	PK	Protection key violation
6	SS	Shadow stack violation (CET)

protection_fault_handler.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
/* Simplified protection fault handling (Linux-like) */
 
void do_page_fault(struct pt_regs *regs, unsigned long error_code,
                   unsigned long fault_address) {
    struct vm_area_struct *vma;
    int fault_flags = 0;
    
    /* Was it a protection fault (P=1) or not-present (P=0)? */
    if (error_code & X86_PF_PROT) {
        /* Protection violation on a present page */
        
        if (error_code & X86_PF_WRITE) {
            /* Write to read-only page */
            
            vma = find_vma(current->mm, fault_address);
            if (!vma) goto bad_area;
            
            /* Check if VMA allows write */
            if (!(vma->vm_flags & VM_WRITE))
                goto bad_area;  /* VMA is read-only, SIGSEGV */
            
            /* VMA allows write - might be COW */
            fault_flags |= FAULT_FLAG_WRITE;
            return handle_mm_fault(vma, fault_address, fault_flags);
            /* handle_mm_fault will do COW if needed */
        }
        
        if (error_code & X86_PF_INSTR) {
            /* Attempt to execute NX page */
            goto bad_area;  /* Always SIGSEGV - can't fix this */
        }
        
        if (error_code & X86_PF_USER) {
            /* User access to supervisor page */
            goto bad_area;  /* Always SIGSEGV */
        }
        
        if (error_code & X86_PF_PK) {
            /* Protection key violation */
            goto bad_area;  /* Or send SIGSEGV with SEGV_PKUERR */
        }
    } else {
        /* Not present - normal demand paging fault */
        return handle_mm_fault(vma, fault_address, fault_flags);
    }
    
bad_area:
    if (error_code & X86_PF_USER) {
        /* User-mode fault - send signal */
        struct siginfo info = {
            .si_signo = SIGSEGV,
            .si_code = (error_code & X86_PF_PK) ? SEGV_PKUERR : SEGV_ACCERR,
            .si_addr = (void *)fault_address,
        };
        force_sig_info(SIGSEGV, &info, current);
    } else {
        /* Kernel-mode fault - oops! */
        kernel_oops("BUG: kernel protection fault", regs);
    }
}

Key Distinctions:

Present vs Protection Fault:

P=0: Page not present (demand paging, swap-in needed)
P=1: Page present but access denied (true protection violation)

Legitimate vs Illegitimate Protection Faults:

COW: Write to shared read-only → create private copy → continue
True violation: Access denied by policy → SIGSEGV

User vs Kernel:

User fault + bad access → SIGSEGV to process
Kernel fault + bad access → kernel bug (oops/panic)

Debugging Protection Faults

Security Considerations

Protection bits are fundamental to system security, but they're not infallible. Modern attacks and mitigations reveal both the power and limitations of page-level protection.

Attack Vectors:

Protection Bypass Techniques

•ROP/JOP: Chain existing code gadgets instead of injecting new code. NX doesn't help because no new code is executed.
•Meltdown/Spectre: Speculative execution reads kernel memory before protection fault is handled. Page tables can leak through side channels.
•Rowhammer: Flip bits in page tables using DRAM vulnerabilities. Can change P, R/W, U/S bits to grant access.
•Race conditions: Time-of-check vs time-of-use bugs in fault handlers can be exploited.
•JIT spraying: Fill JIT-compiled code regions with attacker-controlled values that happen to be useful gadgets.

Modern Mitigations:

ASLR (Address Space Layout Randomization): Randomize where code/data is loaded. Even if attacker knows address, it's unpredictable at runtime.
KPTI (Kernel Page Table Isolation): Use separate page tables for user/kernel. Kernel pages aren't even in user page tables, not just marked U/S=0.
KASLR (Kernel ASLR): Randomize kernel location. Even with KPTI bypass, attackers don't know where kernel is.
CFI (Control-Flow Integrity): Validate indirect jumps go to expected targets. Mitigates ROP/JOP.
Shadow Stacks (CET): Separate stack for return addresses. Can't overwrite with buffer overflow.
MTE (Memory Tagging): ARM feature that tags pointers and memory. Mismatch causes fault. Catches use-after-free, overflow.

Defense in Depth

Summary: Protection Bits

Protection bits form the hardware-enforced access control layer for virtual memory. Let's consolidate the key insights:

Key Takeaways

•R/W controls write permission — Read-only pages enable COW, code protection, and shared data.
•U/S enforces privilege levels — Kernel pages inaccessible to user mode, regardless of virtual address.
•NX prevents code injection — Stack and heap can't be executed, blocking classic exploits.
•Multi-level permissions combine restrictively — Most restrictive permission across all levels wins.
•Protection keys add intra-process isolation — Fast permission changes without page table modification.
•Protection faults are handled by OS — COW becomes writable; true violations become SIGSEGV.
•Security requires multiple layers — Protection bits alone are insufficient against modern attacks.

What's Next:

Page Complete