Loading learning content...
A hospital's electronic health record system goes down during a medical emergency. An e-commerce platform crashes on Black Friday. A banking system becomes inaccessible when customers need to make critical payments. In each case, the data remains confidential, the integrity is intact—but the system is useless because it's unavailable.
Availability is the third pillar of the CIA triad, ensuring that systems, applications, and data are accessible to authorized users when needed. It's the property that transforms secure but inaccessible systems into actually useful ones. A perfectly confidential and integral system that nobody can access serves no purpose.
By the end of this page, you will understand availability as a protection goal, recognize the diverse threats to system availability, understand how operating systems implement availability mechanisms, appreciate the tension between availability and other security properties, and know how to measure and design for high availability.
Availability refers to the assurance that systems and data are accessible when authorized users need them. Unlike confidentiality and integrity, which focus on preventing unwanted information events, availability focuses on ensuring wanted events occur—authorized access succeeds.
Formal Definition:
Availability can be expressed mathematically. If we define:
Then: Availability = Uptime / (Uptime + Downtime)
This is often expressed as a percentage. 'Five nines' (99.999%) availability means no more than 5.26 minutes of downtime per year—an extremely demanding target.
Dimensions of Availability:
Availability encompasses more than binary up/down status. A system might be 'up' but so slow as to be effectively unavailable. A system might be accessible but with degraded functionality. True availability requires that the system provides expected service levels.
| Availability % | Designation | Downtime/Year | Downtime/Month | Typical Use Case |
|---|---|---|---|---|
| 99% | Two nines | 3.65 days | 7.2 hours | Internal tools, non-critical systems |
| 99.9% | Three nines | 8.76 hours | 43.2 minutes | Standard business applications |
| 99.99% | Four nines | 52.6 minutes | 4.32 minutes | E-commerce, financial services |
| 99.999% | Five nines | 5.26 minutes | 25.9 seconds | Healthcare, emergency services |
| 99.9999% | Six nines | 31.5 seconds | 2.59 seconds | Mission-critical infrastructure |
Each additional 'nine' of availability typically requires exponentially more investment. Moving from three nines to four nines might double your infrastructure costs. Moving from four to five nines might require complete architectural redesign and orders of magnitude more spending. Organizations must balance availability requirements against cost constraints.
Availability threats are diverse, ranging from malicious attacks to innocent misconfigurations to acts of nature. Understanding these threats is essential for designing resilient systems.
Denial of Service (DoS) Attacks:
DoS attacks deliberately overwhelm systems to make them unavailable. Distributed DoS (DDoS) attacks amplify this by using many compromised machines (botnets) to generate attack traffic. Volumetric attacks flood bandwidth; protocol attacks exploit protocol weaknesses; application-layer attacks target specific vulnerable functionality.
Resource Exhaustion:
Even without malicious intent, systems can become unavailable when resources are exhausted. Memory leaks gradually consume RAM. Disk fills with logs. Thread pools are exhausted. File descriptors run out. These failures often develop slowly, then occur suddenly.
The most dangerous availability threats are cascading failures. A single component failure increases load on remaining components, causing them to fail, which increases load further. A classic example: one server fails, remaining servers get more traffic, they slow down, timeouts cause retries, retries add more load, more servers fail. Designing to prevent cascading failures is crucial for high availability.
Operating systems implement numerous mechanisms to maintain availability in the face of failures and attacks. These mechanisms span resource management, fault isolation, and recovery capabilities.
Process Isolation:
By isolating processes in separate address spaces, the OS prevents one process from crashing another. A bug in an application terminates that application, not the entire system. This fundamental isolation is the foundation of OS reliability.
Resource Limits:
OS controls prevent any single process from consuming all system resources. Limits on memory usage (ulimit, cgroups), file descriptors, CPU time, and other resources ensure that runaway processes cannot starve the rest of the system.
| Mechanism | What It Protects | How It Works |
|---|---|---|
| Process Isolation | System from faulty applications | Separate address spaces, privilege boundaries |
| Resource Limits (ulimit, cgroups) | System from resource exhaustion | Per-process and per-group caps on resources |
| OOM Killer | System from memory exhaustion | Terminates processes when memory critically low |
| Watchdog Timers | System from hangs and deadlocks | Hardware/software timers reset system if not fed |
| Automatic Restart (systemd) | Services from crashes | Restart failed services with rate limiting |
| Journaling File Systems | File system from corruption | Transaction logging enables crash recovery |
| Memory Protection | System from memory corruption | Hardware-enforced access controls on memory |
| Rate Limiting | Services from overload | Kernel or application level request throttling |
1234567891011121314151617181920212223242526272829303132333435363738
#!/bin/bash# Using cgroups v2 for resource isolation and availability # Create a cgroup for a critical servicesudo mkdir -p /sys/fs/cgroup/critical-service # Limit memory to 1GB (protects system from memory exhaustion)echo "1G" | sudo tee /sys/fs/cgroup/critical-service/memory.max # Set high memory threshold for OOM killer to trigger earlyecho "900M" | sudo tee /sys/fs/cgroup/critical-service/memory.high # Guarantee minimum 20% of CPU (avoids starvation)echo "20000 100000" | sudo tee /sys/fs/cgroup/critical-service/cpu.max # Give this cgroup high weight for CPU scheduling priorityecho "200" | sudo tee /sys/fs/cgroup/critical-service/cpu.weight # Limit I/O bandwidth to prevent disk monopolization# Format: "major:minor rbps wbps riops wiops"echo "8:0 rbps=100000000 wbps=50000000" | \ sudo tee /sys/fs/cgroup/critical-service/io.max # Add process to cgroupecho $$ | sudo tee /sys/fs/cgroup/critical-service/cgroup.procs # With systemd, use service unit options:# [Service]# MemoryMax=1G# MemoryHigh=900M# CPUQuota=20%# CPUWeight=200# IOReadBandwidthMax=/dev/sda 100M # Check resource usagecat /sys/fs/cgroup/critical-service/memory.currentcat /sys/fs/cgroup/critical-service/cpu.statcat /sys/fs/cgroup/critical-service/io.statThe OOM Killer:
When Linux runs out of memory and cannot reclaim enough through normal means, the Out-of-Memory (OOM) killer activates. It selects a process to terminate, freeing its memory for the rest of the system. While brutal, this mechanism prevents complete system hang from memory exhaustion. Processes can influence their OOM score to protect critical services.
No single mechanism guarantees availability. Effective systems combine process isolation, resource limits, monitoring, automatic restart, and architectural redundancy. When one layer fails to prevent a problem, another layer catches it. This defense in depth is as important for availability as for security.
Denial of Service attacks are among the most challenging availability threats because they exploit the asymmetry between attack cost and defense cost. An attacker with a botnet can generate traffic that costs them nothing but overwhelms expensive server infrastructure.
Operating System Level Defenses:
OS kernels implement multiple DoS mitigations. SYN cookies defend against SYN flood attacks without maintaining state for half-open connections. Connection rate limiting restricts new connections per source IP. Netfilter/iptables can drop packets based on patterns. BPF programs enable sophisticated packet filtering at wire speed.
1234567891011121314151617181920212223242526272829303132333435363738394041
#!/bin/bash# Kernel-level DoS mitigations # Enable SYN cookies (defends against SYN floods)echo 1 > /proc/sys/net/ipv4/tcp_syncookies # Reduce SYN-RECV timeoutecho 15 > /proc/sys/net/ipv4/tcp_synack_retries # Increase SYN backlogecho 2048 > /proc/sys/net/ipv4/tcp_max_syn_backlog # Enable reverse path filtering (prevent IP spoofing)echo 1 > /proc/sys/net/ipv4/conf/all/rp_filter # Ignore ICMP broadcast (prevent Smurf attacks)echo 1 > /proc/sys/net/ipv4/icmp_echo_ignore_broadcasts # Reduce TIME_WAIT socketsecho 30 > /proc/sys/net/ipv4/tcp_fin_timeoutecho 1 > /proc/sys/net/ipv4/tcp_tw_reuse # iptables rate limiting example# Limit new SSH connections to 4 per minute per source IPiptables -A INPUT -p tcp --dport 22 -m state --state NEW \ -m recent --set --name SSHiptables -A INPUT -p tcp --dport 22 -m state --state NEW \ -m recent --update --seconds 60 --hitcount 4 --name SSH \ -j DROP # Connection rate limiting for HTTPiptables -A INPUT -p tcp --dport 80 -m connlimit \ --connlimit-above 50 -j REJECT # Per-IP packet rate limitiptables -A INPUT -m hashlimit \ --hashlimit-above 100/sec \ --hashlimit-burst 200 \ --hashlimit-mode srcip \ --hashlimit-name http_rate \ -p tcp --dport 80 -j DROPOperating system defenses protect against moderate attacks but cannot defend against massive DDoS. When attack traffic exceeds your network bandwidth, even perfect kernel-level filtering is useless—packets flood your uplink before reaching your servers. Large-scale DDoS mitigation requires upstream filtering: ISP-level, CDN-based, or specialized DDoS mitigation services.
Achieving high availability requires architectural decisions that extend beyond individual server configuration. Redundancy, load balancing, and failover mechanisms ensure that component failures don't cause service outages.
Redundancy:
The fundamental principle of high availability is redundancy—having multiple instances of every critical component so that failure of one doesn't cause system failure. Redundancy applies at every layer: multiple servers, multiple network paths, multiple data centers, multiple power sources.
Failover:
When a component fails, traffic must automatically redirect to healthy components. Failover can be active-passive (standby takes over when primary fails) or active-active (all instances serve traffic, surviving instances absorb failed instance's load).
| Pattern | Description | Complexity | Recovery Time |
|---|---|---|---|
| Active-Passive | Standby takes over on primary failure | Low | Seconds to minutes |
| Active-Active | All nodes serve traffic, survivors absorb failed load | Medium | Milliseconds to seconds |
| N+1 Redundancy | One spare for every N active components | Low | Depends on failover mechanism |
| N+M Redundancy | M spares for N active components | Medium | Depends on failover mechanism |
| Geographic Distribution | Components spread across data centers/regions | High | Depends on replication lag |
| Hot Standby | Standby receives all updates, ready to serve immediately | Medium | Seconds |
| Warm Standby | Standby periodically synced, needs brief preparation | Low | Minutes |
| Cold Standby | Standby maintained offline, requires full restore | Lowest | Hours |
Load Balancing:
Load balancers distribute traffic across multiple backend servers, enabling horizontal scaling and automatic failover. When a backend server fails health checks, the load balancer stops sending it traffic. From the client's perspective, the service remains available.
Health Checking:
Health checks are the nervous system of high availability. Load balancers, orchestration platforms, and monitoring systems continuously probe components to detect failures. Health checks can be:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
# Setting up high availability with Pacemaker/Corosync# This creates an active-passive cluster with automatic failover # Install on both nodesapt install pacemaker corosync pcs resource-agents # Initialize cluster authenticationpcs cluster auth node1 node2 -u hacluster -p password # Create the clusterpcs cluster setup --name ha-cluster node1 node2 # Start cluster servicespcs cluster start --allpcs cluster enable --all # Configure a virtual IP that floats between nodespcs resource create VirtualIP ocf:heartbeat:IPaddr2 \ ip=192.168.1.100 \ cidr_netmask=24 \ op monitor interval=30s # Configure a service that depends on the virtual IPpcs resource create WebServer ocf:heartbeat:apache \ configfile=/etc/apache2/apache2.conf \ statusurl="http://localhost/server-status" \ op monitor interval=30s # Ensure WebServer runs where VirtualIP runspcs constraint colocation add WebServer with VirtualIP INFINITY # Ensure VirtualIP starts before WebServerpcs constraint order VirtualIP then WebServer # Configure STONITH (Shoot The Other Node In The Head)# Critical for preventing split-brain scenariospcs stonith create fence_node1 fence_ipmilan \ ipaddr=192.168.1.11 \ login=admin passwd=secret \ pcmk_host_list="node1" # Check cluster statuspcs status # Cluster name: ha-cluster# Node List:# * Online: [ node1 node2 ]# # Full List of Resources:# * VirtualIP (ocf:heartbeat:IPaddr2): Started node1# * WebServer (ocf:heartbeat:apache): Started node1In high-availability clusters, split-brain occurs when nodes lose communication and each believes it should take over. This can cause data corruption if both write to shared storage. Quorum (requiring majority of nodes to agree) and STONITH (fencing nodes that can't communicate) prevent split-brain by ensuring only one partition can be active.
Availability often exists in tension with confidentiality and integrity. Security measures that protect data can reduce availability, and availability measures can create security vulnerabilities. Understanding these tradeoffs is essential for balanced system design.
Availability vs. Confidentiality:
Strong authentication protects confidentiality but can reduce availability. If authentication systems fail, legitimate users are locked out. Complex password policies cause forgotten credentials. Multi-factor authentication adds failure points. Emergency access provisions (break-glass procedures) restore availability but create confidentiality risks.
Availability vs. Integrity:
Integrity checks take time and can cause availability bottlenecks. Cryptographic verification of every read operation adds latency. Transactional consistency (ACID) requires coordination that limits throughput. Systems that prioritize availability (like eventually consistent databases) sacrifice strict integrity guarantees.
| Security Measure | Availability Impact | Resolution Approach |
|---|---|---|
| Strong authentication | Single point of failure if auth system down | Redundant auth systems, cached credentials, break-glass |
| Encryption at rest | Key management failures cause inaccessibility | Key escrow, multiple key custodians, HSM redundancy |
| Integrity verification | Verification latency reduces throughput | Parallel verification, caching, risk-based checks |
| Audit logging | Log system failure can block operations | Asynchronous logging, local buffering, fail-open policy |
| Input validation | Aggressive filtering may reject legitimate requests | Tuned signatures, allowlisting, feedback mechanisms |
| Access controls | Complex policies slow authorization decisions | Policy caching, simplified rules, delegation |
The CAP Theorem:
For distributed systems, the CAP theorem formalizes fundamental tradeoffs. A distributed system cannot simultaneously guarantee Consistency (all nodes see same data), Availability (every request receives a response), and Partition tolerance (system continues despite network failures). Since partitions are inevitable, systems must choose between consistency and availability during partitions.
Fail-Open vs. Fail-Closed:
Security systems must decide how to fail. Fail-closed (deny access when security system fails) prioritizes confidentiality but sacrifices availability. Fail-open (allow access when security system fails) maintains availability but creates security risk. The correct choice depends on the specific application's requirements.
There's no universal right answer to availability vs. security tradeoffs. A hospital system should fail-open for patient records during emergencies—patient lives outweigh privacy risk. A nuclear facility should fail-closed for access controls—safety outweighs availability. Design decisions must reflect the specific context and consequences.
You cannot improve what you cannot measure. Availability monitoring is essential for understanding current state, detecting problems early, and driving improvement initiatives.
Service Level Concepts:
Availability targets are formalized through Service Level documents:
Key Metrics:
Beyond simple uptime percentage, meaningful availability measurement considers:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697
from flask import Flask, jsonifyimport psycopg2import redisimport time app = Flask(__name__) class HealthChecker: """ Comprehensive health checking for availability monitoring. Returns detailed status for each dependency. """ def check_database(self): """Deep health check for database connectivity.""" start = time.time() try: conn = psycopg2.connect( host="db-host", dbname="app", connect_timeout=5 ) # Execute actual query to verify not just TCP connection cur = conn.cursor() cur.execute("SELECT 1") cur.fetchone() conn.close() return { "status": "healthy", "latency_ms": (time.time() - start) * 1000 } except Exception as e: return { "status": "unhealthy", "error": str(e), "latency_ms": (time.time() - start) * 1000 } def check_cache(self): """Health check for Redis cache.""" start = time.time() try: r = redis.Redis(host='redis-host', socket_timeout=5) r.ping() return { "status": "healthy", "latency_ms": (time.time() - start) * 1000 } except Exception as e: return { "status": "unhealthy", "error": str(e) } def full_check(self): """Aggregate health of all dependencies.""" checks = { "database": self.check_database(), "cache": self.check_cache(), } # Overall status: healthy only if all dependencies healthy overall_healthy = all( c["status"] == "healthy" for c in checks.values() ) return { "status": "healthy" if overall_healthy else "unhealthy", "checks": checks, "timestamp": time.time() } health = HealthChecker() @app.route('/health/live')def liveness(): """ Liveness probe: Is the process running? Used by Kubernetes to decide whether to restart container. """ return jsonify({"status": "alive"}), 200 @app.route('/health/ready')def readiness(): """ Readiness probe: Can the service handle requests? Used by load balancers to include/exclude from rotation. """ result = health.full_check() status_code = 200 if result["status"] == "healthy" else 503 return jsonify(result), status_code @app.route('/health/detailed')def detailed_health(): """Detailed health for monitoring dashboards.""" return jsonify(health.full_check())Error budgets create explicit tradeoffs between reliability and feature velocity. If a service is within its error budget, teams can take risks with new deployments. If the error budget is exhausted, teams must focus on reliability improvements. This approach, popularized by Google's SRE practice, turns availability from an abstract goal into concrete engineering decisions.
Availability ensures that systems and data are accessible when authorized users need them. Operating systems implement availability through resource management, fault isolation, and recovery mechanisms, while architectures achieve high availability through redundancy and failover. Let's consolidate our understanding:
What's Next:
We've now completed the CIA triad—the three fundamental protection goals. However, protection requires knowing who is making requests and what they're allowed to do. In the next page, we'll explore Authentication—the process of verifying identity that enables us to apply confidentiality, integrity, and availability controls to the right people.
You now understand availability as a protection goal, the threats that endanger it, and the mechanisms operating systems and architectures use to maintain it. This knowledge enables you to design systems that remain accessible despite failures and attacks. Next, we'll examine authentication—how systems verify who is making requests.