Protection Domains - Learning Module

Loading content...

0/227

Domain Implementation

From Theory to Reality

We've established the theoretical foundations: protection domains define what code can do; domain switching enables controlled privilege transitions; protection rings provide hierarchical hardware enforcement; least privilege guides policy decisions.

But how do real operating systems actually implement these concepts? Where is the access matrix stored? How does the kernel track which domain a process belongs to? What happens when a process requests access to a resource?

This page bridges theory and practice, examining the concrete data structures, algorithms, and design patterns that transform protection domain concepts into running systems. We'll see how the elegance of theoretical models meets the messy reality of performance constraints, legacy compatibility, and engineering trade-offs.

What You Will Learn

By the end of this page, you will understand how operating systems implement protection domains through process credentials, file permissions, capability systems, and mandatory access controls. You'll see the kernel data structures, the access checking algorithms, and how different approaches trade off flexibility, performance, and security.

Implementing the Access Matrix

Recall that the access matrix is a conceptual model with rows (domains/subjects) and columns (objects). Real systems cannot store this matrix directly—it would be enormous and mostly empty. Instead, they use two complementary approaches:

Approach 1: Access Control Lists (ACLs)

Store the matrix by columns. Each object has a list of (subject, rights) pairs:

File /etc/passwd:
  ACL: [(root, RW), (shadow, R), (everyone, R)]

File /etc/shadow:
  ACL: [(root, RW), (shadow, R)]

Approach 2: Capability Lists

Store the matrix by rows. Each subject holds a list of (object, rights) tokens:

Process 1234:
  Capabilities: [(/etc/passwd, R), (/home/user/file, RW), (socket:80, RW)]

ACLs vs. Capabilities Comparison
Aspect	Access Control Lists	Capabilities
Storage location	With the object	With the subject
Question answered	Who can access this object?	What can this subject access?
Revocation	Easy (modify object's ACL)	Hard (must find all copies)
Delegation	Hard (need admin to modify ACL)	Easy (pass capability token)
Audit: object access	Easy (list is right there)	Hard (must scan all subjects)
Audit: subject rights	Hard (must scan all objects)	Easy (list is right there)
Example systems	Unix permissions, NTFS ACLs	Capsicum, seL4, KeyKOS

Hybrid Approaches:

Most real systems combine aspects of both:

Unix: Primarily ACL-based (permissions attached to files), but file descriptors act as capabilities (once opened, the descriptor grants access regardless of later permission changes)
Windows: NTFS uses rich ACLs, but access tokens (held by processes) are capability-like
Linux capabilities: Kernel capabilities are per-process (capability-style), while file permissions are ACL-style
Modern: securityd/XPC (macOS): Central security daemon tracks entitlements (ACL-style policy) while processes hold capability-like references

Unix Domain Implementation

Unix pioneered a simple yet effective domain implementation that has influenced all subsequent systems. Let's examine the core data structures:

Process Credentials (struct cred in Linux):

Every process has a credentials structure defining its domain:

struct cred {
    atomic_t    usage;          // Reference count
    
    // User identity
    kuid_t      uid;            // Real user ID
    kgid_t      gid;            // Real group ID
    kuid_t      suid;           // Saved user ID  
    kgid_t      sgid;           // Saved group ID
    kuid_t      euid;           // Effective user ID (THIS DEFINES THE DOMAIN)
    kgid_t      egid;           // Effective group ID
    kuid_t      fsuid;          // Filesystem user ID
    kgid_t      fsgid;          // Filesystem group ID
    
    // Supplementary groups
    struct group_info *group_info;
    
    // Capabilities (since Linux 2.2)
    kernel_cap_t cap_inheritable; // Inherited across exec
    kernel_cap_t cap_permitted;   // Maximum capabilities
    kernel_cap_t cap_effective;   // Currently active capabilities
    kernel_cap_t cap_bset;        // Capability bounding set
    kernel_cap_t cap_ambient;     // Ambient capabilities
    
    // Security module labels (SELinux, AppArmor)
    void        *security;
    
    // Namespace pointers
    struct user_namespace *user_ns;
    // ...
};

The Domain is the Effective UID:

In basic Unix, the effective UID is the primary domain identifier. Access checks compare euid against file ownership:

// Simplified file access check
int may_access(struct inode *inode, int mask) {
    const struct cred *cred = current_cred();
    
    // Owner check
    if (uid_eq(cred->euid, inode->i_uid)) {
        // Use owner permission bits
        if ((inode->i_mode >> 6) & mask)
            return 0;  // Permitted
    }
    
    // Group check  
    if (in_group_p(inode->i_gid)) {
        // Use group permission bits
        if ((inode->i_mode >> 3) & mask)
            return 0;  // Permitted
    }
    
    // Other check
    if (inode->i_mode & mask)
        return 0;  // Permitted
        
    // Check for capability override
    if (capable(CAP_DAC_OVERRIDE))
        return 0;  // Root-like capability grants access
        
    return -EACCES;  // Denied
}

Key Implementation Points:

Credentials are copy-on-write: Changing UID creates a new cred structure
Current cred accessed via current_cred(): Per-task lookup into task_struct
Atomic transitions: UID changes are atomic to prevent races
Immutable once set: Active creds aren't modified, only replaced

Credentials Are RCU-Protected

Linux uses RCU (Read-Copy-Update) to protect credential access. Readers don't need locks; writers create new cred structures and atomically publish them. This enables lock-free access checks on the fast path.

File Permission Implementation

Files store their access control information (the "ACL column") in the inode:

The inode Structure (simplified):

struct inode {
    umode_t         i_mode;     // File type and permission bits
    kuid_t          i_uid;      // Owner user ID
    kgid_t          i_gid;      // Owner group ID
    
    // Extended ACL (if present)
    struct posix_acl *i_acl;    // Access ACL
    struct posix_acl *i_default_acl;  // Default ACL for directories
    
    // Extended attributes (for security labels)
    // Accessed via xattr interface
    
    // ... many other fields
};

Permission Bit Layout:

i_mode (16 bits):
┌────────┬────────┬────────┬──────────────┐
│ Type   │ Setuid │ Owner  │ Group │Other│
│ (4b)   │ (3b)   │ (3b)   │ (3b)  │(3b) │
└────────┴────────┴────────┴──────────────┘

Type: Regular file, directory, symlink, device, etc.
Setuid bits: setuid, setgid, sticky
Permission triads: rwx for owner, group, other

Example: 0100755 (regular file, rwxr-xr-x)
            │└─── Other: r-x (5)
            │└──── Group: r-x (5)  
            │└───── Owner: rwx (7)
            └────── Regular file (0100000)

Extended ACLs (POSIX ACLs):

The basic 9-bit model is too limited for complex sharing. POSIX ACLs extend this:

# View extended ACL
$ getfacl myfile
# file: myfile
# owner: alice
# group: staff
user::rwx             # Owner permissions
user:bob:r-x          # Specific user (additional ACL entry)
group::r--            # Owning group
group:admin:rwx       # Specific group (additional ACL entry)
mask::r-x             # Maximum permissions for named entries
other::---

# Set extended ACL
$ setfacl -m u:bob:rx myfile
$ setfacl -m g:admin:rwx myfile

ACL Evaluation Algorithm:

If process EUID = file owner UID → use owner entry
Else if EUID matches any named user entry → use that entry (masked)
Else if EGID or supplementary GID matches file GID → use group entry (masked)
Else if EGID matches any named group entry → use that entry (masked)
Else → use other entry

acl_check.c
C
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Simplified POSIX ACL permission check
int posix_acl_permission(struct inode *inode, int mask) {
    struct posix_acl *acl = i_acl(inode);
    const struct cred *cred = current_cred();
    int perm = 0;
    
    if (!acl)
        return generic_permission(inode, mask);
    
    // Walk ACL entries
    for (int i = 0; i < acl->a_count; i++) {
        struct posix_acl_entry *entry = &acl->a_entries[i];
        
        switch (entry->e_tag) {
        case ACL_USER_OBJ:
            if (uid_eq(cred->euid, inode->i_uid)) {
                perm = entry->e_perm;
                goto check;
            }
            break;
            
        case ACL_USER:
            if (uid_eq(cred->euid, entry->e_uid)) {
                perm = entry->e_perm;
                goto mask;
            }
            break;
            
        case ACL_GROUP_OBJ:
        case ACL_GROUP:
            // Check against group memberships...
            // Complex logic for group matching
            break;
            
        case ACL_OTHER:
            perm = entry->e_perm;
            goto check;
        }
    }
    
mask:
    // Apply mask entry
    perm &= acl_mask(acl);
    
check:
    if ((perm & mask) == mask)
        return 0;
    return -EACCES;
}

Linux Capabilities Implementation

Linux capabilities divide root's monolithic privilege into ~40 distinct capabilities. Each capability enables specific privileged operations:

Capability Sets:

Each process has five capability sets (64-bit bitmasks):

Set	Purpose	Rules
Permitted (P)	Maximum capabilities process can use	Bounds effective and inheritable
Effective (E)	Currently active capabilities for access checks	Must be subset of permitted
Inheritable (I)	Preserved across exec (if file also has it)	For capability inheritance
Bounding (B)	Upper limit on what can be gained	Reduced by dropping, never raised
Ambient (A)	Automatically granted to non-capability-aware programs	For legacy compatibility

Capability Check in Kernel:

// Check if current process has a capability
bool capable(int cap) {
    return ns_capable(current_cred()->user_ns, cap);
}

bool ns_capable(struct user_namespace *ns, int cap) {
    // Must be in init user namespace to have real root caps
    if (cap_valid(cap) && in_proper_ns(ns)) {
        const struct cred *cred = current_cred();
        
        if (security_capable(cred, ns, cap, CAP_OPT_NONE) == 0)
            return true;
    }
    return false;
}

// LSM hook allows SELinux/AppArmor to deny even if cap is present

File Capabilities:

Executables can have capability sets that affect processes running them:

# Set file capabilities
setcap 'cap_net_bind_service=ep' /usr/bin/myserver

# View file capabilities
getcap /usr/bin/myserver
/usr/bin/myserver cap_net_bind_service=ep

# Components: capability=sets
# e = effective (capability is raised on exec)
# p = permitted (capability is in permitted set after exec)
# i = inheritable (combined with process inheritable)

Capability Transformation on Exec:

When a process calls execve(), capabilities are transformed:

Let P = process's pre-exec caps, P' = post-exec caps
Let F = file's caps on the executable, A = process ambient caps

P'(permitted)   = (P(inheritable) & F(inheritable)) | 
                  (F(permitted) & P(bounding)) | P(ambient)
                  
P'(effective)   = F(effective) ? P'(permitted) : P(ambient)

P'(inheritable) = P(inheritable)

P'(ambient)     = (no setuid/setgid) ? P(ambient) : 0

This complex formula prevents capability escalation while enabling legitimate inheritance.

Common Linux Capabilities
Capability	Allows	Typical Use
CAP_CHOWN	Change file owner	Archival utilities
CAP_DAC_OVERRIDE	Bypass read/write/execute checks	Backup programs
CAP_KILL	Send signals to any process	Process managers
CAP_NET_ADMIN	Network configuration	Network utilities
CAP_NET_BIND_SERVICE	Bind to ports < 1024	Web servers
CAP_NET_RAW	Use raw sockets	ping, tcpdump
CAP_SETUID	Set UID arbitrarily	Login utilities
CAP_SYS_ADMIN	Various admin ops (too broad!)	Everything needing root
CAP_SYS_PTRACE	Trace/debug any process	Debuggers, strace
CAP_SYS_TIME	Set system clock	NTP daemon

Pure Capability-Based Systems

While Linux capabilities are an overlay on traditional Unix ACLs, some systems use capabilities as the primary access control mechanism. In pure capability systems, access rights are held as unforgeable tokens.

Properties of True Capabilities:

Unforgeable: Cannot be created from nothing; only granted by someone who has them
Transferable: Can be passed to other processes
Specific: Each capability grants rights to one specific object
Fine-grained: Different capabilities for different rights (read vs write)
No ambient authority: Processes have no rights except through capabilities they hold

Capsicum (FreeBSD/Userspace Capability Mode):

Capsicum brings capability semantics to Unix file descriptors:

#include <sys/capsicum.h>

int main() {
    // Open file before entering capability mode
    int fd = open("/etc/passwd", O_RDONLY);
    
    // Limit what can be done with this descriptor
    cap_rights_t rights;
    cap_rights_init(&rights, CAP_READ, CAP_SEEK);
    cap_rights_limit(fd, &rights);
    // fd can now only read and seek, not write or ioctl
    
    // Enter capability mode - irrevocable
    if (cap_enter() < 0) {
        perror("cap_enter");
        return 1;
    }
    
    // Now in sandbox:
    // - Cannot open new files (no global namespace access)
    // - Can only use pre-opened file descriptors
    // - Each fd limited to its granted rights
    
    // This works (we have fd with CAP_READ):
    char buf[100];
    read(fd, buf, sizeof(buf));
    
    // This fails (cannot open new resources):
    // open("/etc/shadow", O_RDONLY);  // Returns -1, ECAPMODE
    
    return 0;
}

seL4: Formal Verification of Capabilities:

seL4 is a microkernel with formally verified capability system:

┌─────────────────────────────────────────────────┐
│                     seL4                         │
│                                                 │
│  ┌───────────────────────────────────────────┐  │
│  │            Capability Space               │  │
│  │  [CNode] ─────┬───── CSlot 0: EndpointCap │  │
│  │               ├───── CSlot 1: PageCap     │  │
│  │               ├───── CSlot 2: TCBCap      │  │
│  │               └───── CSlot 3: (empty)     │  │
│  └───────────────────────────────────────────┘  │
│                                                 │
│  Capabilities are the ONLY way to access       │
│  kernel objects. No global namespace exists.    │
│                                                 │
│  Mathematical proof that information flow       │
│  respects capability boundaries.                │
└─────────────────────────────────────────────────┘

Operations on seL4 Capabilities:

Invoke: Use the capability (send to endpoint, map page, etc.)
Copy: Duplicate capability to another CSlot
Mint: Create reduced capability (fewer rights)
Delete: Remove capability from CSlot
Revoke: Remove all derived capabilities

Capabilities Solve Confused Deputy

The "confused deputy" problem occurs when a privileged program is tricked into misusing its authority. Capabilities prevent this: the deputy only has capabilities explicitly passed to it for the current task, not ambient authority to access anything it could theoretically access.

Mandatory Access Control Implementation

Discretionary Access Control (DAC) allows resource owners to control access. Mandatory Access Control (MAC) enforces system-wide policy that even administrators cannot override.

SELinux Implementation:

SELinux implements Type Enforcement (TE), a form of MAC:

Security Contexts: Every process and object has a security context:

user:role:type:level

Examples:
system_u:system_r:httpd_t:s0        # Apache process
system_u:object_r:httpd_config_t:s0 # Apache config file
unconfined_u:unconfined_r:unconfined_t:s0  # Root shell (permissive)

Policy Rules:

# Allow httpd_t to read httpd_config_t files
allow httpd_t httpd_config_t:file { read open getattr };

# Allow httpd_t to bind to http ports
allow httpd_t http_port_t:tcp_socket { name_bind };

# Deny everything not explicitly allowed

Implementation in Kernel:

// Every task has security blob
struct task_security_struct {
    u32 osid;           // Original SID from exec
    u32 sid;            // Current SID
    u32 exec_sid;       // SID on exec
    u32 create_sid;     // SID for created objects
};

// Every inode has security blob
struct inode_security_struct {
    u32 sid;            // Security ID
    u32 sclass;         // Object class (file, dir, socket...)
    // ...
};

LSM (Linux Security Modules) Framework:

SELinux, AppArmor, and other MAC systems plug into the kernel via LSM:

// LSM hooks throughout kernel
int vfs_open(struct file *file) {
    int error = security_file_open(file);  // LSM hook
    if (error)
        return error;
    // ... actual open logic
}

// Hook implementation (SELinux example)
int selinux_file_open(struct file *file) {
    struct inode *inode = file_inode(file);
    u32 sid = current_sid();           // Process security ID
    u32 isid = inode_security(inode);  // Inode security ID
    u32 av;                            // Access vector requested
    
    // Determine required permissions from file mode
    av = file_to_av(file);
    
    // Check policy: is (sid, isid, class, av) allowed?
    return avc_has_perm(sid, isid, inode_security_class(inode), av);
}

Access Vector Cache (AVC):

SELinux uses a cache to avoid constant policy lookups:

AVC Entry: (source_sid, target_sid, class) → allowed_vector, denied_vector

On access:
1. Compute cache key from (current_sid, target_sid, class)
2. Look up in hash table
3. If hit: check allowed bits → permit or deny
4. If miss: consult full policy → cache result → permit or deny

MAC Systems Comparison
System	Policy Model	Configuration	Pros/Cons
SELinux	Type Enforcement + MLS	Policy files, compiled	Most flexible; complex
AppArmor	Path-based profiles	Profile files	Simpler; path-based has issues
Smack	Label-based	Extended attributes	Simple labels; less granular
TOMOYO	Path-based learning	Auto-generated policy	Learning mode; path-based

Namespace-Based Domain Isolation

Modern container systems use Linux namespaces to implement domain isolation. Each namespace type provides isolation for a specific resource:

The Namespace Implementation:

// Each task has namespace pointers
struct task_struct {
    // ...
    struct nsproxy *nsproxy;  // Namespace proxy
    // ...
};

struct nsproxy {
    struct uts_namespace  *uts_ns;    // Hostname/domain
    struct ipc_namespace  *ipc_ns;    // SysV IPC
    struct mnt_namespace  *mnt_ns;    // Mount points
    struct pid_namespace  *pid_ns;    // Process IDs
    struct net            *net_ns;    // Network stack
    struct time_namespace *time_ns;   // Monotonic/boot clocks
    struct cgroup_namespace *cgroup_ns; // Cgroup root
};

User Namespaces—The Game Changer:

User namespaces allow UID mapping, enabling unprivileged container creation:

# Create user namespace mapping container root (0) to host UID 100000
unshare --user --map-root-user bash

# Inside: appear to be root
$ id
uid=0(root) gid=0(root)

# But actually mapped to unprivileged host UID
$ cat /proc/self/uid_map
         0     100000     65536
# Means: Container UID 0 → Host UID 100000, range of 65536 UIDs

cgroups for Resource Domains:

Control groups limit what resources a domain can consume:

# View cgroup structure (v2)
$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-1.scope

# Create cgroup for a service
mkdir /sys/fs/cgroup/myservice

# Set resource limits
echo "100M" > /sys/fs/cgroup/myservice/memory.max
echo "50000 100000" > /sys/fs/cgroup/myservice/cpu.max  # 50% CPU
echo "rioblk 8:0 10M" > /sys/fs/cgroup/myservice/io.max  # I/O limit

# Move process to cgroup 
echo $$ > /sys/fs/cgroup/myservice/cgroup.procs

Container Domain = Namespaces + cgroups + Seccomp + Capabilities:

┌────────────────────────────────────────────────────────┐
│                    Container                            │
│                                                        │
│  ┌────────────────────────────────────────────────┐   │
│  │                   Namespaces                    │   │
│  │  • Isolated PID tree (init = PID 1)            │   │
│  │  • Private mount tree (overlayfs root)         │   │
│  │  • Own network stack (veth pair to bridge)     │   │
│  │  • Mapped UIDs (container 0 ≠ host 0)          │   │
│  └────────────────────────────────────────────────┘   │
│                                                        │
│  ┌────────────────────────────────────────────────┐   │
│  │                   cgroups                       │   │
│  │  • Memory limit (OOM if exceeded)              │   │
│  │  • CPU quota (throttled if exceeded)           │   │
│  │  • I/O bandwidth limit                         │   │
│  │  • Process count limit                         │   │
│  └────────────────────────────────────────────────┘   │
│                                                        │
│  ┌────────────────────────────────────────────────┐   │
│  │              Security Restrictions              │   │
│  │  • Dropped capabilities (no CAP_SYS_ADMIN)     │   │
│  │  • Seccomp filter (blocked syscalls)           │   │
│  │  • Read-only root filesystem                   │   │
│  │  • SELinux/AppArmor profile                    │   │
│  └────────────────────────────────────────────────┘   │
└────────────────────────────────────────────────────────┘

Performance Considerations

Protection domain enforcement has performance implications. Every resource access requires permission checking. Design decisions balance security and performance:

Fast Path Optimizations:

1. Credential Caching:

// Accessing current credentials is very fast
static inline const struct cred *current_cred(void) {
    return rcu_dereference_protected(
        current->cred,
        1  // No locking needed for own cred
    );
}
// No locks, just dereference from current task_struct

2. Access Vector Cache (SELinux):

Policy decisions cached in hash table
95%+ hit rate in steady state
Miss only on first access pattern

3. Capability Checks:

// Capability check is a bitfield test
static inline bool cap_raised(kernel_cap_t c, int flag) {
    return c.cap[__cap_idx(flag)] & __cap_bit(flag);
}
// Single array lookup + bitwise AND

Performance Overhead Measurements:

Operation	Without MAC	With SELinux	Overhead
`open()` syscall	1.0x baseline	1.1-1.2x	10-20%
`stat()` syscall	1.0x baseline	1.05x	5%
Process fork	1.0x baseline	1.15x	15%
Network connect	1.0x baseline	1.1x	10%
Overall system	1.0x baseline	1.02-1.07x	2-7%

Factors Affecting Performance:

Policy complexity: More rules = more lookup time
Cache effectiveness: First-time accesses are slow
Context switches: Domain changes may flush TLB (KPTI)
Audit logging: Writing audit records has I/O cost
Namespace depth: Deep namespace hierarchies add lookup layers

Security Worth the Cost

Modern implementations keep overhead low (typically 2-7% for MAC). The security benefits far outweigh this cost for most workloads. Where performance is critical (HPC, real-time), systems may run with reduced protection, accepting the trade-off explicitly.

Summary: Domain Implementation

We've explored how protection domains move from theoretical abstraction to concrete implementation. Let's consolidate the key insights:

Key Takeaways

•The access matrix is implemented via ACLs or capabilities — ACLs store rights with objects; capabilities store rights with subjects. Each has different trade-offs for revocation and delegation.
•Unix domains are defined by effective UID and credentials — The struct cred contains all domain-defining information including UIDs, groups, and capability sets.
•File permissions implement per-object ACLs — Basic mode bits plus POSIX ACLs control file access; the inode stores this information.
•Linux capabilities split root privilege — ~40 capabilities replace all-or-nothing root; file capabilities enable setcap-style privilege.
•Pure capability systems differ fundamentally — Systems like seL4 and Capsicum use capabilities as the only access mechanism, eliminating ambient authority.
•MAC enforces system-wide policy — SELinux uses Type Enforcement with security contexts; LSM hooks provide the kernel integration point.
•Namespaces and cgroups create isolated domains — Container isolation combines namespace virtualization with resource limits for comprehensive domain separation.
•Performance overhead is manageable — Credential caching, AVC, and fast capability checks keep typical overhead at 2-7%.

Module Complete:

You now have a comprehensive understanding of protection domains: what they are (domain concept), how processes transition between them (domain switching), the hardware mechanism for enforcement (protection rings), the guiding principle for their use (least privilege), and how real systems implement them (domain implementation).

This knowledge is foundational for understanding operating system security, container architectures, privilege escalation vulnerabilities, and secure system design.

Module Complete: Protection Domains

Congratulations! You've mastered protection domains—the fundamental abstraction for OS security. From the theoretical access matrix to SELinux policies and container namespaces, you now understand how operating systems implement secure resource isolation.