Loading content...
We've established the theoretical foundations: protection domains define what code can do; domain switching enables controlled privilege transitions; protection rings provide hierarchical hardware enforcement; least privilege guides policy decisions.
But how do real operating systems actually implement these concepts? Where is the access matrix stored? How does the kernel track which domain a process belongs to? What happens when a process requests access to a resource?
This page bridges theory and practice, examining the concrete data structures, algorithms, and design patterns that transform protection domain concepts into running systems. We'll see how the elegance of theoretical models meets the messy reality of performance constraints, legacy compatibility, and engineering trade-offs.
By the end of this page, you will understand how operating systems implement protection domains through process credentials, file permissions, capability systems, and mandatory access controls. You'll see the kernel data structures, the access checking algorithms, and how different approaches trade off flexibility, performance, and security.
Recall that the access matrix is a conceptual model with rows (domains/subjects) and columns (objects). Real systems cannot store this matrix directly—it would be enormous and mostly empty. Instead, they use two complementary approaches:
Approach 1: Access Control Lists (ACLs)
Store the matrix by columns. Each object has a list of (subject, rights) pairs:
File /etc/passwd:
ACL: [(root, RW), (shadow, R), (everyone, R)]
File /etc/shadow:
ACL: [(root, RW), (shadow, R)]
Approach 2: Capability Lists
Store the matrix by rows. Each subject holds a list of (object, rights) tokens:
Process 1234:
Capabilities: [(/etc/passwd, R), (/home/user/file, RW), (socket:80, RW)]
| Aspect | Access Control Lists | Capabilities |
|---|---|---|
| Storage location | With the object | With the subject |
| Question answered | Who can access this object? | What can this subject access? |
| Revocation | Easy (modify object's ACL) | Hard (must find all copies) |
| Delegation | Hard (need admin to modify ACL) | Easy (pass capability token) |
| Audit: object access | Easy (list is right there) | Hard (must scan all subjects) |
| Audit: subject rights | Hard (must scan all objects) | Easy (list is right there) |
| Example systems | Unix permissions, NTFS ACLs | Capsicum, seL4, KeyKOS |
Hybrid Approaches:
Most real systems combine aspects of both:
Unix: Primarily ACL-based (permissions attached to files), but file descriptors act as capabilities (once opened, the descriptor grants access regardless of later permission changes)
Windows: NTFS uses rich ACLs, but access tokens (held by processes) are capability-like
Linux capabilities: Kernel capabilities are per-process (capability-style), while file permissions are ACL-style
Modern: securityd/XPC (macOS): Central security daemon tracks entitlements (ACL-style policy) while processes hold capability-like references
Unix pioneered a simple yet effective domain implementation that has influenced all subsequent systems. Let's examine the core data structures:
Process Credentials (struct cred in Linux):
Every process has a credentials structure defining its domain:
struct cred {
atomic_t usage; // Reference count
// User identity
kuid_t uid; // Real user ID
kgid_t gid; // Real group ID
kuid_t suid; // Saved user ID
kgid_t sgid; // Saved group ID
kuid_t euid; // Effective user ID (THIS DEFINES THE DOMAIN)
kgid_t egid; // Effective group ID
kuid_t fsuid; // Filesystem user ID
kgid_t fsgid; // Filesystem group ID
// Supplementary groups
struct group_info *group_info;
// Capabilities (since Linux 2.2)
kernel_cap_t cap_inheritable; // Inherited across exec
kernel_cap_t cap_permitted; // Maximum capabilities
kernel_cap_t cap_effective; // Currently active capabilities
kernel_cap_t cap_bset; // Capability bounding set
kernel_cap_t cap_ambient; // Ambient capabilities
// Security module labels (SELinux, AppArmor)
void *security;
// Namespace pointers
struct user_namespace *user_ns;
// ...
};
The Domain is the Effective UID:
In basic Unix, the effective UID is the primary domain identifier. Access checks compare euid against file ownership:
// Simplified file access check
int may_access(struct inode *inode, int mask) {
const struct cred *cred = current_cred();
// Owner check
if (uid_eq(cred->euid, inode->i_uid)) {
// Use owner permission bits
if ((inode->i_mode >> 6) & mask)
return 0; // Permitted
}
// Group check
if (in_group_p(inode->i_gid)) {
// Use group permission bits
if ((inode->i_mode >> 3) & mask)
return 0; // Permitted
}
// Other check
if (inode->i_mode & mask)
return 0; // Permitted
// Check for capability override
if (capable(CAP_DAC_OVERRIDE))
return 0; // Root-like capability grants access
return -EACCES; // Denied
}
Key Implementation Points:
current_cred(): Per-task lookup into task_structLinux uses RCU (Read-Copy-Update) to protect credential access. Readers don't need locks; writers create new cred structures and atomically publish them. This enables lock-free access checks on the fast path.
Files store their access control information (the "ACL column") in the inode:
The inode Structure (simplified):
struct inode {
umode_t i_mode; // File type and permission bits
kuid_t i_uid; // Owner user ID
kgid_t i_gid; // Owner group ID
// Extended ACL (if present)
struct posix_acl *i_acl; // Access ACL
struct posix_acl *i_default_acl; // Default ACL for directories
// Extended attributes (for security labels)
// Accessed via xattr interface
// ... many other fields
};
Permission Bit Layout:
i_mode (16 bits):
┌────────┬────────┬────────┬──────────────┐
│ Type │ Setuid │ Owner │ Group │Other│
│ (4b) │ (3b) │ (3b) │ (3b) │(3b) │
└────────┴────────┴────────┴──────────────┘
Type: Regular file, directory, symlink, device, etc.
Setuid bits: setuid, setgid, sticky
Permission triads: rwx for owner, group, other
Example: 0100755 (regular file, rwxr-xr-x)
│└─── Other: r-x (5)
│└──── Group: r-x (5)
│└───── Owner: rwx (7)
└────── Regular file (0100000)
Extended ACLs (POSIX ACLs):
The basic 9-bit model is too limited for complex sharing. POSIX ACLs extend this:
# View extended ACL
$ getfacl myfile
# file: myfile
# owner: alice
# group: staff
user::rwx # Owner permissions
user:bob:r-x # Specific user (additional ACL entry)
group::r-- # Owning group
group:admin:rwx # Specific group (additional ACL entry)
mask::r-x # Maximum permissions for named entries
other::---
# Set extended ACL
$ setfacl -m u:bob:rx myfile
$ setfacl -m g:admin:rwx myfile
ACL Evaluation Algorithm:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
// Simplified POSIX ACL permission checkint posix_acl_permission(struct inode *inode, int mask) { struct posix_acl *acl = i_acl(inode); const struct cred *cred = current_cred(); int perm = 0; if (!acl) return generic_permission(inode, mask); // Walk ACL entries for (int i = 0; i < acl->a_count; i++) { struct posix_acl_entry *entry = &acl->a_entries[i]; switch (entry->e_tag) { case ACL_USER_OBJ: if (uid_eq(cred->euid, inode->i_uid)) { perm = entry->e_perm; goto check; } break; case ACL_USER: if (uid_eq(cred->euid, entry->e_uid)) { perm = entry->e_perm; goto mask; } break; case ACL_GROUP_OBJ: case ACL_GROUP: // Check against group memberships... // Complex logic for group matching break; case ACL_OTHER: perm = entry->e_perm; goto check; } } mask: // Apply mask entry perm &= acl_mask(acl); check: if ((perm & mask) == mask) return 0; return -EACCES;}Linux capabilities divide root's monolithic privilege into ~40 distinct capabilities. Each capability enables specific privileged operations:
Capability Sets:
Each process has five capability sets (64-bit bitmasks):
| Set | Purpose | Rules |
|---|---|---|
| Permitted (P) | Maximum capabilities process can use | Bounds effective and inheritable |
| Effective (E) | Currently active capabilities for access checks | Must be subset of permitted |
| Inheritable (I) | Preserved across exec (if file also has it) | For capability inheritance |
| Bounding (B) | Upper limit on what can be gained | Reduced by dropping, never raised |
| Ambient (A) | Automatically granted to non-capability-aware programs | For legacy compatibility |
Capability Check in Kernel:
// Check if current process has a capability
bool capable(int cap) {
return ns_capable(current_cred()->user_ns, cap);
}
bool ns_capable(struct user_namespace *ns, int cap) {
// Must be in init user namespace to have real root caps
if (cap_valid(cap) && in_proper_ns(ns)) {
const struct cred *cred = current_cred();
if (security_capable(cred, ns, cap, CAP_OPT_NONE) == 0)
return true;
}
return false;
}
// LSM hook allows SELinux/AppArmor to deny even if cap is present
File Capabilities:
Executables can have capability sets that affect processes running them:
# Set file capabilities
setcap 'cap_net_bind_service=ep' /usr/bin/myserver
# View file capabilities
getcap /usr/bin/myserver
/usr/bin/myserver cap_net_bind_service=ep
# Components: capability=sets
# e = effective (capability is raised on exec)
# p = permitted (capability is in permitted set after exec)
# i = inheritable (combined with process inheritable)
Capability Transformation on Exec:
When a process calls execve(), capabilities are transformed:
Let P = process's pre-exec caps, P' = post-exec caps
Let F = file's caps on the executable, A = process ambient caps
P'(permitted) = (P(inheritable) & F(inheritable)) |
(F(permitted) & P(bounding)) | P(ambient)
P'(effective) = F(effective) ? P'(permitted) : P(ambient)
P'(inheritable) = P(inheritable)
P'(ambient) = (no setuid/setgid) ? P(ambient) : 0
This complex formula prevents capability escalation while enabling legitimate inheritance.
| Capability | Allows | Typical Use |
|---|---|---|
| CAP_CHOWN | Change file owner | Archival utilities |
| CAP_DAC_OVERRIDE | Bypass read/write/execute checks | Backup programs |
| CAP_KILL | Send signals to any process | Process managers |
| CAP_NET_ADMIN | Network configuration | Network utilities |
| CAP_NET_BIND_SERVICE | Bind to ports < 1024 | Web servers |
| CAP_NET_RAW | Use raw sockets | ping, tcpdump |
| CAP_SETUID | Set UID arbitrarily | Login utilities |
| CAP_SYS_ADMIN | Various admin ops (too broad!) | Everything needing root |
| CAP_SYS_PTRACE | Trace/debug any process | Debuggers, strace |
| CAP_SYS_TIME | Set system clock | NTP daemon |
While Linux capabilities are an overlay on traditional Unix ACLs, some systems use capabilities as the primary access control mechanism. In pure capability systems, access rights are held as unforgeable tokens.
Properties of True Capabilities:
Capsicum (FreeBSD/Userspace Capability Mode):
Capsicum brings capability semantics to Unix file descriptors:
#include <sys/capsicum.h>
int main() {
// Open file before entering capability mode
int fd = open("/etc/passwd", O_RDONLY);
// Limit what can be done with this descriptor
cap_rights_t rights;
cap_rights_init(&rights, CAP_READ, CAP_SEEK);
cap_rights_limit(fd, &rights);
// fd can now only read and seek, not write or ioctl
// Enter capability mode - irrevocable
if (cap_enter() < 0) {
perror("cap_enter");
return 1;
}
// Now in sandbox:
// - Cannot open new files (no global namespace access)
// - Can only use pre-opened file descriptors
// - Each fd limited to its granted rights
// This works (we have fd with CAP_READ):
char buf[100];
read(fd, buf, sizeof(buf));
// This fails (cannot open new resources):
// open("/etc/shadow", O_RDONLY); // Returns -1, ECAPMODE
return 0;
}
seL4: Formal Verification of Capabilities:
seL4 is a microkernel with formally verified capability system:
┌─────────────────────────────────────────────────┐
│ seL4 │
│ │
│ ┌───────────────────────────────────────────┐ │
│ │ Capability Space │ │
│ │ [CNode] ─────┬───── CSlot 0: EndpointCap │ │
│ │ ├───── CSlot 1: PageCap │ │
│ │ ├───── CSlot 2: TCBCap │ │
│ │ └───── CSlot 3: (empty) │ │
│ └───────────────────────────────────────────┘ │
│ │
│ Capabilities are the ONLY way to access │
│ kernel objects. No global namespace exists. │
│ │
│ Mathematical proof that information flow │
│ respects capability boundaries. │
└─────────────────────────────────────────────────┘
Operations on seL4 Capabilities:
The "confused deputy" problem occurs when a privileged program is tricked into misusing its authority. Capabilities prevent this: the deputy only has capabilities explicitly passed to it for the current task, not ambient authority to access anything it could theoretically access.
Discretionary Access Control (DAC) allows resource owners to control access. Mandatory Access Control (MAC) enforces system-wide policy that even administrators cannot override.
SELinux Implementation:
SELinux implements Type Enforcement (TE), a form of MAC:
Security Contexts: Every process and object has a security context:
user:role:type:level
Examples:
system_u:system_r:httpd_t:s0 # Apache process
system_u:object_r:httpd_config_t:s0 # Apache config file
unconfined_u:unconfined_r:unconfined_t:s0 # Root shell (permissive)
Policy Rules:
# Allow httpd_t to read httpd_config_t files
allow httpd_t httpd_config_t:file { read open getattr };
# Allow httpd_t to bind to http ports
allow httpd_t http_port_t:tcp_socket { name_bind };
# Deny everything not explicitly allowed
Implementation in Kernel:
// Every task has security blob
struct task_security_struct {
u32 osid; // Original SID from exec
u32 sid; // Current SID
u32 exec_sid; // SID on exec
u32 create_sid; // SID for created objects
};
// Every inode has security blob
struct inode_security_struct {
u32 sid; // Security ID
u32 sclass; // Object class (file, dir, socket...)
// ...
};
LSM (Linux Security Modules) Framework:
SELinux, AppArmor, and other MAC systems plug into the kernel via LSM:
// LSM hooks throughout kernel
int vfs_open(struct file *file) {
int error = security_file_open(file); // LSM hook
if (error)
return error;
// ... actual open logic
}
// Hook implementation (SELinux example)
int selinux_file_open(struct file *file) {
struct inode *inode = file_inode(file);
u32 sid = current_sid(); // Process security ID
u32 isid = inode_security(inode); // Inode security ID
u32 av; // Access vector requested
// Determine required permissions from file mode
av = file_to_av(file);
// Check policy: is (sid, isid, class, av) allowed?
return avc_has_perm(sid, isid, inode_security_class(inode), av);
}
Access Vector Cache (AVC):
SELinux uses a cache to avoid constant policy lookups:
AVC Entry: (source_sid, target_sid, class) → allowed_vector, denied_vector
On access:
1. Compute cache key from (current_sid, target_sid, class)
2. Look up in hash table
3. If hit: check allowed bits → permit or deny
4. If miss: consult full policy → cache result → permit or deny
| System | Policy Model | Configuration | Pros/Cons |
|---|---|---|---|
| SELinux | Type Enforcement + MLS | Policy files, compiled | Most flexible; complex |
| AppArmor | Path-based profiles | Profile files | Simpler; path-based has issues |
| Smack | Label-based | Extended attributes | Simple labels; less granular |
| TOMOYO | Path-based learning | Auto-generated policy | Learning mode; path-based |
Modern container systems use Linux namespaces to implement domain isolation. Each namespace type provides isolation for a specific resource:
The Namespace Implementation:
// Each task has namespace pointers
struct task_struct {
// ...
struct nsproxy *nsproxy; // Namespace proxy
// ...
};
struct nsproxy {
struct uts_namespace *uts_ns; // Hostname/domain
struct ipc_namespace *ipc_ns; // SysV IPC
struct mnt_namespace *mnt_ns; // Mount points
struct pid_namespace *pid_ns; // Process IDs
struct net *net_ns; // Network stack
struct time_namespace *time_ns; // Monotonic/boot clocks
struct cgroup_namespace *cgroup_ns; // Cgroup root
};
User Namespaces—The Game Changer:
User namespaces allow UID mapping, enabling unprivileged container creation:
# Create user namespace mapping container root (0) to host UID 100000
unshare --user --map-root-user bash
# Inside: appear to be root
$ id
uid=0(root) gid=0(root)
# But actually mapped to unprivileged host UID
$ cat /proc/self/uid_map
0 100000 65536
# Means: Container UID 0 → Host UID 100000, range of 65536 UIDs
cgroups for Resource Domains:
Control groups limit what resources a domain can consume:
# View cgroup structure (v2)
$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-1.scope
# Create cgroup for a service
mkdir /sys/fs/cgroup/myservice
# Set resource limits
echo "100M" > /sys/fs/cgroup/myservice/memory.max
echo "50000 100000" > /sys/fs/cgroup/myservice/cpu.max # 50% CPU
echo "rioblk 8:0 10M" > /sys/fs/cgroup/myservice/io.max # I/O limit
# Move process to cgroup
echo $$ > /sys/fs/cgroup/myservice/cgroup.procs
Container Domain = Namespaces + cgroups + Seccomp + Capabilities:
┌────────────────────────────────────────────────────────┐
│ Container │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Namespaces │ │
│ │ • Isolated PID tree (init = PID 1) │ │
│ │ • Private mount tree (overlayfs root) │ │
│ │ • Own network stack (veth pair to bridge) │ │
│ │ • Mapped UIDs (container 0 ≠ host 0) │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ cgroups │ │
│ │ • Memory limit (OOM if exceeded) │ │
│ │ • CPU quota (throttled if exceeded) │ │
│ │ • I/O bandwidth limit │ │
│ │ • Process count limit │ │
│ └────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────┐ │
│ │ Security Restrictions │ │
│ │ • Dropped capabilities (no CAP_SYS_ADMIN) │ │
│ │ • Seccomp filter (blocked syscalls) │ │
│ │ • Read-only root filesystem │ │
│ │ • SELinux/AppArmor profile │ │
│ └────────────────────────────────────────────────┘ │
└────────────────────────────────────────────────────────┘
Protection domain enforcement has performance implications. Every resource access requires permission checking. Design decisions balance security and performance:
Fast Path Optimizations:
1. Credential Caching:
// Accessing current credentials is very fast
static inline const struct cred *current_cred(void) {
return rcu_dereference_protected(
current->cred,
1 // No locking needed for own cred
);
}
// No locks, just dereference from current task_struct
2. Access Vector Cache (SELinux):
3. Capability Checks:
// Capability check is a bitfield test
static inline bool cap_raised(kernel_cap_t c, int flag) {
return c.cap[__cap_idx(flag)] & __cap_bit(flag);
}
// Single array lookup + bitwise AND
Performance Overhead Measurements:
| Operation | Without MAC | With SELinux | Overhead |
|---|---|---|---|
open() syscall | 1.0x baseline | 1.1-1.2x | 10-20% |
stat() syscall | 1.0x baseline | 1.05x | 5% |
| Process fork | 1.0x baseline | 1.15x | 15% |
| Network connect | 1.0x baseline | 1.1x | 10% |
| Overall system | 1.0x baseline | 1.02-1.07x | 2-7% |
Factors Affecting Performance:
Modern implementations keep overhead low (typically 2-7% for MAC). The security benefits far outweigh this cost for most workloads. Where performance is critical (HPC, real-time), systems may run with reduced protection, accepting the trade-off explicitly.
We've explored how protection domains move from theoretical abstraction to concrete implementation. Let's consolidate the key insights:
struct cred contains all domain-defining information including UIDs, groups, and capability sets.Module Complete:
You now have a comprehensive understanding of protection domains: what they are (domain concept), how processes transition between them (domain switching), the hardware mechanism for enforcement (protection rings), the guiding principle for their use (least privilege), and how real systems implement them (domain implementation).
This knowledge is foundational for understanding operating system security, container architectures, privilege escalation vulnerabilities, and secure system design.
Congratulations! You've mastered protection domains—the fundamental abstraction for OS security. From the theoretical access matrix to SELinux policies and container namespaces, you now understand how operating systems implement secure resource isolation.