Operating SystemsProtection Mechanisms

Protection Goals

LevelIntermediate

Duration60 mins

TopicProtection Mechanisms

3 / 5

Availability

When Systems Must Be There

A hospital's electronic health record system goes down during a medical emergency. An e-commerce platform crashes on Black Friday. A banking system becomes inaccessible when customers need to make critical payments. In each case, the data remains confidential, the integrity is intact—but the system is useless because it's unavailable.

Availability is the third pillar of the CIA triad, ensuring that systems, applications, and data are accessible to authorized users when needed. It's the property that transforms secure but inaccessible systems into actually useful ones. A perfectly confidential and integral system that nobody can access serves no purpose.

What You Will Learn

By the end of this page, you will understand availability as a protection goal, recognize the diverse threats to system availability, understand how operating systems implement availability mechanisms, appreciate the tension between availability and other security properties, and know how to measure and design for high availability.

Understanding Availability

Availability refers to the assurance that systems and data are accessible when authorized users need them. Unlike confidentiality and integrity, which focus on preventing unwanted information events, availability focuses on ensuring wanted events occur—authorized access succeeds.

Formal Definition:

Availability can be expressed mathematically. If we define:

Uptime: Total time the system is operational and accessible
Downtime: Total time the system is unavailable

Then: Availability = Uptime / (Uptime + Downtime)

This is often expressed as a percentage. 'Five nines' (99.999%) availability means no more than 5.26 minutes of downtime per year—an extremely demanding target.

Dimensions of Availability:

Availability encompasses more than binary up/down status. A system might be 'up' but so slow as to be effectively unavailable. A system might be accessible but with degraded functionality. True availability requires that the system provides expected service levels.

Availability Levels and Their Meaning
Availability %	Designation	Downtime/Year	Downtime/Month	Typical Use Case
99%	Two nines	3.65 days	7.2 hours	Internal tools, non-critical systems
99.9%	Three nines	8.76 hours	43.2 minutes	Standard business applications
99.99%	Four nines	52.6 minutes	4.32 minutes	E-commerce, financial services
99.999%	Five nines	5.26 minutes	25.9 seconds	Healthcare, emergency services
99.9999%	Six nines	31.5 seconds	2.59 seconds	Mission-critical infrastructure

The Cost of Nines

Each additional 'nine' of availability typically requires exponentially more investment. Moving from three nines to four nines might double your infrastructure costs. Moving from four to five nines might require complete architectural redesign and orders of magnitude more spending. Organizations must balance availability requirements against cost constraints.

Threats to Availability

Availability threats are diverse, ranging from malicious attacks to innocent misconfigurations to acts of nature. Understanding these threats is essential for designing resilient systems.

Denial of Service (DoS) Attacks:

DoS attacks deliberately overwhelm systems to make them unavailable. Distributed DoS (DDoS) attacks amplify this by using many compromised machines (botnets) to generate attack traffic. Volumetric attacks flood bandwidth; protocol attacks exploit protocol weaknesses; application-layer attacks target specific vulnerable functionality.

Resource Exhaustion:

Even without malicious intent, systems can become unavailable when resources are exhausted. Memory leaks gradually consume RAM. Disk fills with logs. Thread pools are exhausted. File descriptors run out. These failures often develop slowly, then occur suddenly.

Categories of Availability Threats

•DoS/DDoS Attacks — Malicious flooding of network bandwidth, CPU, memory, or connections. Modern attacks can exceed 1 Tbps of traffic.
•Hardware Failure — Disk crashes, server failures, network equipment malfunctions, power supply failures. Individual components fail eventually.
•Software Bugs — Crashes, deadlocks, memory leaks, infinite loops. Bugs in operating systems, applications, or dependencies cause outages.
•Resource Exhaustion — Disk space, memory, file handles, network connections. Gradual accumulation or sudden spikes can exhaust resources.
•Configuration Errors — Misconfigurations are the leading cause of outages. Wrong firewall rules, incorrect DNS, expired certificates.
•Natural Disasters — Earthquakes, floods, fires, extreme weather. Data centers can be destroyed; entire regions can lose connectivity.
•Power Failures — Grid outages, UPS failures, generator malfunctions. Power is the most fundamental dependency.
•Human Error — Accidental deletions, incorrect commands, maintenance mistakes. Humans remain the most unpredictable component.

Cascading Failures

The most dangerous availability threats are cascading failures. A single component failure increases load on remaining components, causing them to fail, which increases load further. A classic example: one server fails, remaining servers get more traffic, they slow down, timeouts cause retries, retries add more load, more servers fail. Designing to prevent cascading failures is crucial for high availability.

Operating System Availability Mechanisms

Operating systems implement numerous mechanisms to maintain availability in the face of failures and attacks. These mechanisms span resource management, fault isolation, and recovery capabilities.

Process Isolation:

By isolating processes in separate address spaces, the OS prevents one process from crashing another. A bug in an application terminates that application, not the entire system. This fundamental isolation is the foundation of OS reliability.

Resource Limits:

OS controls prevent any single process from consuming all system resources. Limits on memory usage (ulimit, cgroups), file descriptors, CPU time, and other resources ensure that runaway processes cannot starve the rest of the system.

OS Mechanisms for Availability
Mechanism	What It Protects	How It Works
Process Isolation	System from faulty applications	Separate address spaces, privilege boundaries
Resource Limits (ulimit, cgroups)	System from resource exhaustion	Per-process and per-group caps on resources
OOM Killer	System from memory exhaustion	Terminates processes when memory critically low
Watchdog Timers	System from hangs and deadlocks	Hardware/software timers reset system if not fed
Automatic Restart (systemd)	Services from crashes	Restart failed services with rate limiting
Journaling File Systems	File system from corruption	Transaction logging enables crash recovery
Memory Protection	System from memory corruption	Hardware-enforced access controls on memory
Rate Limiting	Services from overload	Kernel or application level request throttling

Resource Limits with cgroups
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/bin/bash
# Using cgroups v2 for resource isolation and availability
 
# Create a cgroup for a critical service
sudo mkdir -p /sys/fs/cgroup/critical-service
 
# Limit memory to 1GB (protects system from memory exhaustion)
echo "1G" | sudo tee /sys/fs/cgroup/critical-service/memory.max
 
# Set high memory threshold for OOM killer to trigger early
echo "900M" | sudo tee /sys/fs/cgroup/critical-service/memory.high
 
# Guarantee minimum 20% of CPU (avoids starvation)
echo "20000 100000" | sudo tee /sys/fs/cgroup/critical-service/cpu.max
 
# Give this cgroup high weight for CPU scheduling priority
echo "200" | sudo tee /sys/fs/cgroup/critical-service/cpu.weight
 
# Limit I/O bandwidth to prevent disk monopolization
# Format: "major:minor rbps wbps riops wiops"
echo "8:0 rbps=100000000 wbps=50000000" | \
    sudo tee /sys/fs/cgroup/critical-service/io.max
 
# Add process to cgroup
echo $$ | sudo tee /sys/fs/cgroup/critical-service/cgroup.procs
 
# With systemd, use service unit options:
# [Service]
# MemoryMax=1G
# MemoryHigh=900M
# CPUQuota=20%
# CPUWeight=200
# IOReadBandwidthMax=/dev/sda 100M
 
# Check resource usage
cat /sys/fs/cgroup/critical-service/memory.current
cat /sys/fs/cgroup/critical-service/cpu.stat
cat /sys/fs/cgroup/critical-service/io.stat

The OOM Killer:

When Linux runs out of memory and cannot reclaim enough through normal means, the Out-of-Memory (OOM) killer activates. It selects a process to terminate, freeing its memory for the rest of the system. While brutal, this mechanism prevents complete system hang from memory exhaustion. Processes can influence their OOM score to protect critical services.

Defense in Depth for Availability

No single mechanism guarantees availability. Effective systems combine process isolation, resource limits, monitoring, automatic restart, and architectural redundancy. When one layer fails to prevent a problem, another layer catches it. This defense in depth is as important for availability as for security.

Handling Denial of Service

Denial of Service attacks are among the most challenging availability threats because they exploit the asymmetry between attack cost and defense cost. An attacker with a botnet can generate traffic that costs them nothing but overwhelms expensive server infrastructure.

Operating System Level Defenses:

OS kernels implement multiple DoS mitigations. SYN cookies defend against SYN flood attacks without maintaining state for half-open connections. Connection rate limiting restricts new connections per source IP. Netfilter/iptables can drop packets based on patterns. BPF programs enable sophisticated packet filtering at wire speed.

DoS Attack Types

•Volumetric — Flood bandwidth with massive traffic (UDP flood, amplification attacks)
•Protocol — Exhaust server resources (SYN flood, Ping of Death)
•Application — Target specific vulnerabilities (HTTP slowloris, regex DoS)
•Resource — Exhaust system resources (fork bombs, memory exhaustion)
•Algorithmic — Trigger worst-case algorithm behavior (hash collision attacks)

OS-Level Defenses

•SYN Cookies — Stateless SYN flood protection, enabled in kernel
•Rate Limiting — iptables/nftables limit module restricts connection rates
•Connection Tracking Timeout — Aggressive timeout tuning frees resources
•BPF/XDP — Drop attack packets in driver before reaching kernel stack
•Resource Limits — cgroups prevent single service exhaustion

DoS Mitigation Configuration
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#!/bin/bash
# Kernel-level DoS mitigations
 
# Enable SYN cookies (defends against SYN floods)
echo 1 > /proc/sys/net/ipv4/tcp_syncookies
 
# Reduce SYN-RECV timeout
echo 15 > /proc/sys/net/ipv4/tcp_synack_retries
 
# Increase SYN backlog
echo 2048 > /proc/sys/net/ipv4/tcp_max_syn_backlog
 
# Enable reverse path filtering (prevent IP spoofing)
echo 1 > /proc/sys/net/ipv4/conf/all/rp_filter
 
# Ignore ICMP broadcast (prevent Smurf attacks)
echo 1 > /proc/sys/net/ipv4/icmp_echo_ignore_broadcasts
 
# Reduce TIME_WAIT sockets
echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout
echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse
 
# iptables rate limiting example
# Limit new SSH connections to 4 per minute per source IP
iptables -A INPUT -p tcp --dport 22 -m state --state NEW \
    -m recent --set --name SSH
iptables -A INPUT -p tcp --dport 22 -m state --state NEW \
    -m recent --update --seconds 60 --hitcount 4 --name SSH \
    -j DROP
 
# Connection rate limiting for HTTP
iptables -A INPUT -p tcp --dport 80 -m connlimit \
    --connlimit-above 50 -j REJECT
 
# Per-IP packet rate limit
iptables -A INPUT -m hashlimit \
    --hashlimit-above 100/sec \
    --hashlimit-burst 200 \
    --hashlimit-mode srcip \
    --hashlimit-name http_rate \
    -p tcp --dport 80 -j DROP

The Limits of OS-Level Defense

Operating system defenses protect against moderate attacks but cannot defend against massive DDoS. When attack traffic exceeds your network bandwidth, even perfect kernel-level filtering is useless—packets flood your uplink before reaching your servers. Large-scale DDoS mitigation requires upstream filtering: ISP-level, CDN-based, or specialized DDoS mitigation services.

High Availability Architectures

Achieving high availability requires architectural decisions that extend beyond individual server configuration. Redundancy, load balancing, and failover mechanisms ensure that component failures don't cause service outages.

Redundancy:

The fundamental principle of high availability is redundancy—having multiple instances of every critical component so that failure of one doesn't cause system failure. Redundancy applies at every layer: multiple servers, multiple network paths, multiple data centers, multiple power sources.

Failover:

When a component fails, traffic must automatically redirect to healthy components. Failover can be active-passive (standby takes over when primary fails) or active-active (all instances serve traffic, surviving instances absorb failed instance's load).

High Availability Patterns
Pattern	Description	Complexity	Recovery Time
Active-Passive	Standby takes over on primary failure	Low	Seconds to minutes
Active-Active	All nodes serve traffic, survivors absorb failed load	Medium	Milliseconds to seconds
N+1 Redundancy	One spare for every N active components	Low	Depends on failover mechanism
N+M Redundancy	M spares for N active components	Medium	Depends on failover mechanism
Geographic Distribution	Components spread across data centers/regions	High	Depends on replication lag
Hot Standby	Standby receives all updates, ready to serve immediately	Medium	Seconds
Warm Standby	Standby periodically synced, needs brief preparation	Low	Minutes
Cold Standby	Standby maintained offline, requires full restore	Lowest	Hours

Load Balancing:

Load balancers distribute traffic across multiple backend servers, enabling horizontal scaling and automatic failover. When a backend server fails health checks, the load balancer stops sending it traffic. From the client's perspective, the service remains available.

Health Checking:

Health checks are the nervous system of high availability. Load balancers, orchestration platforms, and monitoring systems continuously probe components to detect failures. Health checks can be:

Shallow: TCP connection succeeds
Deep: Application returns valid response
End-to-end: Full transaction through all layers succeeds

Linux Heartbeat/Pacemaker HA
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Setting up high availability with Pacemaker/Corosync
# This creates an active-passive cluster with automatic failover
 
# Install on both nodes
apt install pacemaker corosync pcs resource-agents
 
# Initialize cluster authentication
pcs cluster auth node1 node2 -u hacluster -p password
 
# Create the cluster
pcs cluster setup --name ha-cluster node1 node2
 
# Start cluster services
pcs cluster start --all
pcs cluster enable --all
 
# Configure a virtual IP that floats between nodes
pcs resource create VirtualIP ocf:heartbeat:IPaddr2 \
    ip=192.168.1.100 \
    cidr_netmask=24 \
    op monitor interval=30s
 
# Configure a service that depends on the virtual IP
pcs resource create WebServer ocf:heartbeat:apache \
    configfile=/etc/apache2/apache2.conf \
    statusurl="http://localhost/server-status" \
    op monitor interval=30s
 
# Ensure WebServer runs where VirtualIP runs
pcs constraint colocation add WebServer with VirtualIP INFINITY
 
# Ensure VirtualIP starts before WebServer
pcs constraint order VirtualIP then WebServer
 
# Configure STONITH (Shoot The Other Node In The Head)
# Critical for preventing split-brain scenarios
pcs stonith create fence_node1 fence_ipmilan \
    ipaddr=192.168.1.11 \
    login=admin passwd=secret \
    pcmk_host_list="node1"
 
# Check cluster status
pcs status
 
# Cluster name: ha-cluster
# Node List:
#  * Online: [ node1 node2 ]
# 
# Full List of Resources:
#  * VirtualIP (ocf:heartbeat:IPaddr2):    Started node1
#  * WebServer (ocf:heartbeat:apache):     Started node1

Split-Brain Prevention

In high-availability clusters, split-brain occurs when nodes lose communication and each believes it should take over. This can cause data corruption if both write to shared storage. Quorum (requiring majority of nodes to agree) and STONITH (fencing nodes that can't communicate) prevent split-brain by ensuring only one partition can be active.

Availability vs. Other Security Properties

Availability often exists in tension with confidentiality and integrity. Security measures that protect data can reduce availability, and availability measures can create security vulnerabilities. Understanding these tradeoffs is essential for balanced system design.

Availability vs. Confidentiality:

Strong authentication protects confidentiality but can reduce availability. If authentication systems fail, legitimate users are locked out. Complex password policies cause forgotten credentials. Multi-factor authentication adds failure points. Emergency access provisions (break-glass procedures) restore availability but create confidentiality risks.

Availability vs. Integrity:

Integrity checks take time and can cause availability bottlenecks. Cryptographic verification of every read operation adds latency. Transactional consistency (ACID) requires coordination that limits throughput. Systems that prioritize availability (like eventually consistent databases) sacrifice strict integrity guarantees.

Security vs. Availability Tradeoffs
Security Measure	Availability Impact	Resolution Approach
Strong authentication	Single point of failure if auth system down	Redundant auth systems, cached credentials, break-glass
Encryption at rest	Key management failures cause inaccessibility	Key escrow, multiple key custodians, HSM redundancy
Integrity verification	Verification latency reduces throughput	Parallel verification, caching, risk-based checks
Audit logging	Log system failure can block operations	Asynchronous logging, local buffering, fail-open policy
Input validation	Aggressive filtering may reject legitimate requests	Tuned signatures, allowlisting, feedback mechanisms
Access controls	Complex policies slow authorization decisions	Policy caching, simplified rules, delegation

The CAP Theorem:

For distributed systems, the CAP theorem formalizes fundamental tradeoffs. A distributed system cannot simultaneously guarantee Consistency (all nodes see same data), Availability (every request receives a response), and Partition tolerance (system continues despite network failures). Since partitions are inevitable, systems must choose between consistency and availability during partitions.

Fail-Open vs. Fail-Closed:

Security systems must decide how to fail. Fail-closed (deny access when security system fails) prioritizes confidentiality but sacrifices availability. Fail-open (allow access when security system fails) maintains availability but creates security risk. The correct choice depends on the specific application's requirements.

Context-Dependent Decisions

There's no universal right answer to availability vs. security tradeoffs. A hospital system should fail-open for patient records during emergencies—patient lives outweigh privacy risk. A nuclear facility should fail-closed for access controls—safety outweighs availability. Design decisions must reflect the specific context and consequences.

Measuring and Monitoring Availability

You cannot improve what you cannot measure. Availability monitoring is essential for understanding current state, detecting problems early, and driving improvement initiatives.

Service Level Concepts:

Availability targets are formalized through Service Level documents:

SLA (Service Level Agreement): Contract with customers specifying availability commitments and remedies for failures
SLO (Service Level Objective): Internal targets, typically more aggressive than SLAs
SLI (Service Level Indicator): Specific metrics used to measure service level (latency, error rate, uptime)

Key Metrics:

Beyond simple uptime percentage, meaningful availability measurement considers:

Mean Time Between Failures (MTBF): Average time between system failures
Mean Time To Recovery (MTTR): Average time to restore service after failure
Error Budget: Allowable unavailability based on SLO (e.g., 99.9% SLO = 8.76 hours/year error budget)

Availability Monitoring Approaches

•Synthetic Monitoring — Automated probes simulate user interactions to detect availability issues from external perspective.
•Real User Monitoring (RUM) — Track actual user experience to detect availability issues affecting production users.
•Health Endpoints — Application-exposed endpoints that return service health for load balancers and orchestrators.
•Log Analysis — Aggregate and analyze logs for error rates, exceptions, and failure patterns.
•Distributed Tracing — Track requests across microservices to identify availability bottlenecks and dependencies.
•Chaos Engineering — Intentionally inject failures to validate availability mechanisms and find weaknesses.

Health Check Endpoint
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
from flask import Flask, jsonify
import psycopg2
import redis
import time
 
app = Flask(__name__)
 
class HealthChecker:
    """
    Comprehensive health checking for availability monitoring.
    Returns detailed status for each dependency.
    """
    
    def check_database(self):
        """Deep health check for database connectivity."""
        start = time.time()
        try:
            conn = psycopg2.connect(
                host="db-host", 
                dbname="app", 
                connect_timeout=5
            )
            # Execute actual query to verify not just TCP connection
            cur = conn.cursor()
            cur.execute("SELECT 1")
            cur.fetchone()
            conn.close()
            return {
                "status": "healthy",
                "latency_ms": (time.time() - start) * 1000
            }
        except Exception as e:
            return {
                "status": "unhealthy",
                "error": str(e),
                "latency_ms": (time.time() - start) * 1000
            }
    
    def check_cache(self):
        """Health check for Redis cache."""
        start = time.time()
        try:
            r = redis.Redis(host='redis-host', socket_timeout=5)
            r.ping()
            return {
                "status": "healthy",
                "latency_ms": (time.time() - start) * 1000
            }
        except Exception as e:
            return {
                "status": "unhealthy",
                "error": str(e)
            }
    
    def full_check(self):
        """Aggregate health of all dependencies."""
        checks = {
            "database": self.check_database(),
            "cache": self.check_cache(),
        }
        
        # Overall status: healthy only if all dependencies healthy
        overall_healthy = all(
            c["status"] == "healthy" 
            for c in checks.values()
        )
        
        return {
            "status": "healthy" if overall_healthy else "unhealthy",
            "checks": checks,
            "timestamp": time.time()
        }
 
health = HealthChecker()
 
@app.route('/health/live')
def liveness():
    """
    Liveness probe: Is the process running?
    Used by Kubernetes to decide whether to restart container.
    """
    return jsonify({"status": "alive"}), 200
 
@app.route('/health/ready')
def readiness():
    """
    Readiness probe: Can the service handle requests?
    Used by load balancers to include/exclude from rotation.
    """
    result = health.full_check()
    status_code = 200 if result["status"] == "healthy" else 503
    return jsonify(result), status_code
 
@app.route('/health/detailed')
def detailed_health():
    """Detailed health for monitoring dashboards."""
    return jsonify(health.full_check())

Error Budgets in Practice

Error budgets create explicit tradeoffs between reliability and feature velocity. If a service is within its error budget, teams can take risks with new deployments. If the error budget is exhausted, teams must focus on reliability improvements. This approach, popularized by Google's SRE practice, turns availability from an abstract goal into concrete engineering decisions.

Summary: Ensuring Availability

Availability ensures that systems and data are accessible when authorized users need them. Operating systems implement availability through resource management, fault isolation, and recovery mechanisms, while architectures achieve high availability through redundancy and failover. Let's consolidate our understanding:

Key Takeaways

•Availability completes the CIA triad — Without availability, confidential and integral data is useless because nobody can access it.
•Threats are diverse — DoS attacks, hardware failures, software bugs, configuration errors, and natural disasters all threaten availability.
•OS mechanisms provide foundation — Process isolation, resource limits, and automatic recovery prevent many availability failures.
•Defense in depth is essential — Kernel protections, application resilience, and architectural redundancy together achieve high availability.
•Availability trades off with security — Strong security measures can reduce availability; balanced design considers both.
•Measurement drives improvement — SLIs, SLOs, error budgets, and monitoring enable data-driven availability management.

What's Next:

We've now completed the CIA triad—the three fundamental protection goals. However, protection requires knowing who is making requests and what they're allowed to do. In the next page, we'll explore Authentication—the process of verifying identity that enables us to apply confidentiality, integrity, and availability controls to the right people.

Page Complete

You now understand availability as a protection goal, the threats that endanger it, and the mechanisms operating systems and architectures use to maintain it. This knowledge enables you to design systems that remain accessible despite failures and attacks. Next, we'll examine authentication—how systems verify who is making requests.

3 / 5

Loading learning content...

Operating SystemsProtection Mechanisms

Protection Goals

LevelIntermediate

Duration60 mins

TopicProtection Mechanisms

3 / 5

Availability

When Systems Must Be There

What You Will Learn

Understanding Availability

Formal Definition:

Availability can be expressed mathematically. If we define:

Uptime: Total time the system is operational and accessible
Downtime: Total time the system is unavailable

Then: Availability = Uptime / (Uptime + Downtime)

This is often expressed as a percentage. 'Five nines' (99.999%) availability means no more than 5.26 minutes of downtime per year—an extremely demanding target.

Dimensions of Availability:

Availability Levels and Their Meaning
Availability %	Designation	Downtime/Year	Downtime/Month	Typical Use Case
99%	Two nines	3.65 days	7.2 hours	Internal tools, non-critical systems
99.9%	Three nines	8.76 hours	43.2 minutes	Standard business applications
99.99%	Four nines	52.6 minutes	4.32 minutes	E-commerce, financial services
99.999%	Five nines	5.26 minutes	25.9 seconds	Healthcare, emergency services
99.9999%	Six nines	31.5 seconds	2.59 seconds	Mission-critical infrastructure

The Cost of Nines

Threats to Availability

Availability threats are diverse, ranging from malicious attacks to innocent misconfigurations to acts of nature. Understanding these threats is essential for designing resilient systems.

Denial of Service (DoS) Attacks:

Resource Exhaustion:

Categories of Availability Threats

•DoS/DDoS Attacks — Malicious flooding of network bandwidth, CPU, memory, or connections. Modern attacks can exceed 1 Tbps of traffic.
•Hardware Failure — Disk crashes, server failures, network equipment malfunctions, power supply failures. Individual components fail eventually.
•Software Bugs — Crashes, deadlocks, memory leaks, infinite loops. Bugs in operating systems, applications, or dependencies cause outages.
•Resource Exhaustion — Disk space, memory, file handles, network connections. Gradual accumulation or sudden spikes can exhaust resources.
•Configuration Errors — Misconfigurations are the leading cause of outages. Wrong firewall rules, incorrect DNS, expired certificates.
•Natural Disasters — Earthquakes, floods, fires, extreme weather. Data centers can be destroyed; entire regions can lose connectivity.
•Power Failures — Grid outages, UPS failures, generator malfunctions. Power is the most fundamental dependency.
•Human Error — Accidental deletions, incorrect commands, maintenance mistakes. Humans remain the most unpredictable component.

Cascading Failures

Operating System Availability Mechanisms

Operating systems implement numerous mechanisms to maintain availability in the face of failures and attacks. These mechanisms span resource management, fault isolation, and recovery capabilities.

Process Isolation:

Resource Limits:

OS Mechanisms for Availability
Mechanism	What It Protects	How It Works
Process Isolation	System from faulty applications	Separate address spaces, privilege boundaries
Resource Limits (ulimit, cgroups)	System from resource exhaustion	Per-process and per-group caps on resources
OOM Killer	System from memory exhaustion	Terminates processes when memory critically low
Watchdog Timers	System from hangs and deadlocks	Hardware/software timers reset system if not fed
Automatic Restart (systemd)	Services from crashes	Restart failed services with rate limiting
Journaling File Systems	File system from corruption	Transaction logging enables crash recovery
Memory Protection	System from memory corruption	Hardware-enforced access controls on memory
Rate Limiting	Services from overload	Kernel or application level request throttling

Resource Limits with cgroups
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/bin/bash
# Using cgroups v2 for resource isolation and availability
 
# Create a cgroup for a critical service
sudo mkdir -p /sys/fs/cgroup/critical-service
 
# Limit memory to 1GB (protects system from memory exhaustion)
echo "1G" | sudo tee /sys/fs/cgroup/critical-service/memory.max
 
# Set high memory threshold for OOM killer to trigger early
echo "900M" | sudo tee /sys/fs/cgroup/critical-service/memory.high
 
# Guarantee minimum 20% of CPU (avoids starvation)
echo "20000 100000" | sudo tee /sys/fs/cgroup/critical-service/cpu.max
 
# Give this cgroup high weight for CPU scheduling priority
echo "200" | sudo tee /sys/fs/cgroup/critical-service/cpu.weight
 
# Limit I/O bandwidth to prevent disk monopolization
# Format: "major:minor rbps wbps riops wiops"
echo "8:0 rbps=100000000 wbps=50000000" | \
    sudo tee /sys/fs/cgroup/critical-service/io.max
 
# Add process to cgroup
echo $$ | sudo tee /sys/fs/cgroup/critical-service/cgroup.procs
 
# With systemd, use service unit options:
# [Service]
# MemoryMax=1G
# MemoryHigh=900M
# CPUQuota=20%
# CPUWeight=200
# IOReadBandwidthMax=/dev/sda 100M
 
# Check resource usage
cat /sys/fs/cgroup/critical-service/memory.current
cat /sys/fs/cgroup/critical-service/cpu.stat
cat /sys/fs/cgroup/critical-service/io.stat

The OOM Killer:

Defense in Depth for Availability

Handling Denial of Service

Operating System Level Defenses:

DoS Attack Types

•Volumetric — Flood bandwidth with massive traffic (UDP flood, amplification attacks)
•Protocol — Exhaust server resources (SYN flood, Ping of Death)
•Application — Target specific vulnerabilities (HTTP slowloris, regex DoS)
•Resource — Exhaust system resources (fork bombs, memory exhaustion)
•Algorithmic — Trigger worst-case algorithm behavior (hash collision attacks)

OS-Level Defenses

•SYN Cookies — Stateless SYN flood protection, enabled in kernel
•Rate Limiting — iptables/nftables limit module restricts connection rates
•Connection Tracking Timeout — Aggressive timeout tuning frees resources
•BPF/XDP — Drop attack packets in driver before reaching kernel stack
•Resource Limits — cgroups prevent single service exhaustion

DoS Mitigation Configuration
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#!/bin/bash
# Kernel-level DoS mitigations
 
# Enable SYN cookies (defends against SYN floods)
echo 1 > /proc/sys/net/ipv4/tcp_syncookies
 
# Reduce SYN-RECV timeout
echo 15 > /proc/sys/net/ipv4/tcp_synack_retries
 
# Increase SYN backlog
echo 2048 > /proc/sys/net/ipv4/tcp_max_syn_backlog
 
# Enable reverse path filtering (prevent IP spoofing)
echo 1 > /proc/sys/net/ipv4/conf/all/rp_filter
 
# Ignore ICMP broadcast (prevent Smurf attacks)
echo 1 > /proc/sys/net/ipv4/icmp_echo_ignore_broadcasts
 
# Reduce TIME_WAIT sockets
echo 30 > /proc/sys/net/ipv4/tcp_fin_timeout
echo 1 > /proc/sys/net/ipv4/tcp_tw_reuse
 
# iptables rate limiting example
# Limit new SSH connections to 4 per minute per source IP
iptables -A INPUT -p tcp --dport 22 -m state --state NEW \
    -m recent --set --name SSH
iptables -A INPUT -p tcp --dport 22 -m state --state NEW \
    -m recent --update --seconds 60 --hitcount 4 --name SSH \
    -j DROP
 
# Connection rate limiting for HTTP
iptables -A INPUT -p tcp --dport 80 -m connlimit \
    --connlimit-above 50 -j REJECT
 
# Per-IP packet rate limit
iptables -A INPUT -m hashlimit \
    --hashlimit-above 100/sec \
    --hashlimit-burst 200 \
    --hashlimit-mode srcip \
    --hashlimit-name http_rate \
    -p tcp --dport 80 -j DROP

The Limits of OS-Level Defense

High Availability Architectures

Redundancy:

Failover:

High Availability Patterns
Pattern	Description	Complexity	Recovery Time
Active-Passive	Standby takes over on primary failure	Low	Seconds to minutes
Active-Active	All nodes serve traffic, survivors absorb failed load	Medium	Milliseconds to seconds
N+1 Redundancy	One spare for every N active components	Low	Depends on failover mechanism
N+M Redundancy	M spares for N active components	Medium	Depends on failover mechanism
Geographic Distribution	Components spread across data centers/regions	High	Depends on replication lag
Hot Standby	Standby receives all updates, ready to serve immediately	Medium	Seconds
Warm Standby	Standby periodically synced, needs brief preparation	Low	Minutes
Cold Standby	Standby maintained offline, requires full restore	Lowest	Hours

Load Balancing:

Health Checking:

Health checks are the nervous system of high availability. Load balancers, orchestration platforms, and monitoring systems continuously probe components to detect failures. Health checks can be:

Shallow: TCP connection succeeds
Deep: Application returns valid response
End-to-end: Full transaction through all layers succeeds

Linux Heartbeat/Pacemaker HA
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# Setting up high availability with Pacemaker/Corosync
# This creates an active-passive cluster with automatic failover
 
# Install on both nodes
apt install pacemaker corosync pcs resource-agents
 
# Initialize cluster authentication
pcs cluster auth node1 node2 -u hacluster -p password
 
# Create the cluster
pcs cluster setup --name ha-cluster node1 node2
 
# Start cluster services
pcs cluster start --all
pcs cluster enable --all
 
# Configure a virtual IP that floats between nodes
pcs resource create VirtualIP ocf:heartbeat:IPaddr2 \
    ip=192.168.1.100 \
    cidr_netmask=24 \
    op monitor interval=30s
 
# Configure a service that depends on the virtual IP
pcs resource create WebServer ocf:heartbeat:apache \
    configfile=/etc/apache2/apache2.conf \
    statusurl="http://localhost/server-status" \
    op monitor interval=30s
 
# Ensure WebServer runs where VirtualIP runs
pcs constraint colocation add WebServer with VirtualIP INFINITY
 
# Ensure VirtualIP starts before WebServer
pcs constraint order VirtualIP then WebServer
 
# Configure STONITH (Shoot The Other Node In The Head)
# Critical for preventing split-brain scenarios
pcs stonith create fence_node1 fence_ipmilan \
    ipaddr=192.168.1.11 \
    login=admin passwd=secret \
    pcmk_host_list="node1"
 
# Check cluster status
pcs status
 
# Cluster name: ha-cluster
# Node List:
#  * Online: [ node1 node2 ]
# 
# Full List of Resources:
#  * VirtualIP (ocf:heartbeat:IPaddr2):    Started node1
#  * WebServer (ocf:heartbeat:apache):     Started node1

Split-Brain Prevention

Availability vs. Other Security Properties

Availability vs. Confidentiality:

Availability vs. Integrity:

Security vs. Availability Tradeoffs
Security Measure	Availability Impact	Resolution Approach
Strong authentication	Single point of failure if auth system down	Redundant auth systems, cached credentials, break-glass
Encryption at rest	Key management failures cause inaccessibility	Key escrow, multiple key custodians, HSM redundancy
Integrity verification	Verification latency reduces throughput	Parallel verification, caching, risk-based checks
Audit logging	Log system failure can block operations	Asynchronous logging, local buffering, fail-open policy
Input validation	Aggressive filtering may reject legitimate requests	Tuned signatures, allowlisting, feedback mechanisms
Access controls	Complex policies slow authorization decisions	Policy caching, simplified rules, delegation

The CAP Theorem:

Fail-Open vs. Fail-Closed:

Context-Dependent Decisions

Measuring and Monitoring Availability

You cannot improve what you cannot measure. Availability monitoring is essential for understanding current state, detecting problems early, and driving improvement initiatives.

Service Level Concepts:

Availability targets are formalized through Service Level documents:

SLA (Service Level Agreement): Contract with customers specifying availability commitments and remedies for failures
SLO (Service Level Objective): Internal targets, typically more aggressive than SLAs
SLI (Service Level Indicator): Specific metrics used to measure service level (latency, error rate, uptime)

Key Metrics:

Beyond simple uptime percentage, meaningful availability measurement considers:

Mean Time Between Failures (MTBF): Average time between system failures
Mean Time To Recovery (MTTR): Average time to restore service after failure
Error Budget: Allowable unavailability based on SLO (e.g., 99.9% SLO = 8.76 hours/year error budget)

Availability Monitoring Approaches

•Synthetic Monitoring — Automated probes simulate user interactions to detect availability issues from external perspective.
•Real User Monitoring (RUM) — Track actual user experience to detect availability issues affecting production users.
•Health Endpoints — Application-exposed endpoints that return service health for load balancers and orchestrators.
•Log Analysis — Aggregate and analyze logs for error rates, exceptions, and failure patterns.
•Distributed Tracing — Track requests across microservices to identify availability bottlenecks and dependencies.
•Chaos Engineering — Intentionally inject failures to validate availability mechanisms and find weaknesses.

Health Check Endpoint
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
from flask import Flask, jsonify
import psycopg2
import redis
import time
 
app = Flask(__name__)
 
class HealthChecker:
    """
    Comprehensive health checking for availability monitoring.
    Returns detailed status for each dependency.
    """
    
    def check_database(self):
        """Deep health check for database connectivity."""
        start = time.time()
        try:
            conn = psycopg2.connect(
                host="db-host", 
                dbname="app", 
                connect_timeout=5
            )
            # Execute actual query to verify not just TCP connection
            cur = conn.cursor()
            cur.execute("SELECT 1")
            cur.fetchone()
            conn.close()
            return {
                "status": "healthy",
                "latency_ms": (time.time() - start) * 1000
            }
        except Exception as e:
            return {
                "status": "unhealthy",
                "error": str(e),
                "latency_ms": (time.time() - start) * 1000
            }
    
    def check_cache(self):
        """Health check for Redis cache."""
        start = time.time()
        try:
            r = redis.Redis(host='redis-host', socket_timeout=5)
            r.ping()
            return {
                "status": "healthy",
                "latency_ms": (time.time() - start) * 1000
            }
        except Exception as e:
            return {
                "status": "unhealthy",
                "error": str(e)
            }
    
    def full_check(self):
        """Aggregate health of all dependencies."""
        checks = {
            "database": self.check_database(),
            "cache": self.check_cache(),
        }
        
        # Overall status: healthy only if all dependencies healthy
        overall_healthy = all(
            c["status"] == "healthy" 
            for c in checks.values()
        )
        
        return {
            "status": "healthy" if overall_healthy else "unhealthy",
            "checks": checks,
            "timestamp": time.time()
        }
 
health = HealthChecker()
 
@app.route('/health/live')
def liveness():
    """
    Liveness probe: Is the process running?
    Used by Kubernetes to decide whether to restart container.
    """
    return jsonify({"status": "alive"}), 200
 
@app.route('/health/ready')
def readiness():
    """
    Readiness probe: Can the service handle requests?
    Used by load balancers to include/exclude from rotation.
    """
    result = health.full_check()
    status_code = 200 if result["status"] == "healthy" else 503
    return jsonify(result), status_code
 
@app.route('/health/detailed')
def detailed_health():
    """Detailed health for monitoring dashboards."""
    return jsonify(health.full_check())

Error Budgets in Practice

Summary: Ensuring Availability

Key Takeaways

•Availability completes the CIA triad — Without availability, confidential and integral data is useless because nobody can access it.
•Threats are diverse — DoS attacks, hardware failures, software bugs, configuration errors, and natural disasters all threaten availability.
•OS mechanisms provide foundation — Process isolation, resource limits, and automatic recovery prevent many availability failures.
•Defense in depth is essential — Kernel protections, application resilience, and architectural redundancy together achieve high availability.
•Availability trades off with security — Strong security measures can reduce availability; balanced design considers both.
•Measurement drives improvement — SLIs, SLOs, error budgets, and monitoring enable data-driven availability management.

What's Next:

Page Complete

3 / 5