Loading learning content...
Consider a seemingly impossible task: take any file—from a one-byte text file to a multi-terabyte database—and reduce it to a fixed-size "fingerprint" of exactly 256 bits. This fingerprint must have remarkable properties:
This seemingly magical construct is the cryptographic hash function, one of the most versatile primitives in computer security. Hash functions are everywhere: they verify file integrity, store passwords securely, power blockchain consensus, authenticate messages, generate pseudorandom numbers, and enable digital signatures. Understanding hash functions is essential for any systems engineer.
By the end of this page, you will understand the essential properties of cryptographic hash functions, know the internal structure of modern algorithms like SHA-256 and SHA-3, recognize the difference between hash functions and MACs, understand how hash functions enable password storage and integrity verification, and see how operating systems rely on hashing throughout their security architecture.
A cryptographic hash function H takes an input of arbitrary length and produces a fixed-length output called the hash, digest, or fingerprint. For security applications, H must satisfy three critical properties:
1. Preimage Resistance (One-Way Property)
Given a hash value h, it must be computationally infeasible to find any input m such that H(m) = h. You cannot reverse the hash to recover the input.
Security goal: Given only a hash, an attacker cannot find the original data or any data that produces that hash.
2. Second Preimage Resistance (Weak Collision Resistance)
Given an input m₁ and its hash H(m₁), it must be computationally infeasible to find a different input m₂ ≠ m₁ such that H(m₂) = H(m₁).
Security goal: An attacker cannot find a substitute input that produces the same hash as a known input.
3. Collision Resistance (Strong Collision Resistance)
It must be computationally infeasible to find any two distinct inputs m₁ ≠ m₂ such that H(m₁) = H(m₂).
Security goal: An attacker cannot find any pair of inputs with the same hash, even if they choose both inputs.
Collision resistance is harder to achieve than preimage resistance due to the Birthday Paradox. For a hash with n-bit output, finding a collision requires approximately 2^(n/2) operations on average, while finding a preimage requires 2^n operations. This is why SHA-256 (256 bits) provides ~128 bits of collision resistance but 256 bits of preimage resistance. This also explains why 128-bit hashes (like MD5) were deprecated—their collision resistance is only ~64 bits.
| Property | Attacker's Goal | Attack Complexity | Broken If... |
|---|---|---|---|
| Preimage resistance | Find m given H(m) | 2^n operations | Attacker can forge data matching a known hash |
| Second preimage resistance | Find m₂ given m₁ where H(m₁)=H(m₂) | 2^n operations | Attacker can substitute signed/verified documents |
| Collision resistance | Find any m₁≠m₂ where H(m₁)=H(m₂) | 2^(n/2) operations | Attacker can create pairs (one benign, one malicious) with same hash |
Additional Desirable Properties:
Avalanche Effect: A small change in the input produces a completely different hash. Changing a single bit should flip approximately half the output bits. This prevents attackers from learning about the input by observing patterns in the output.
Determinism: The same input always produces the same output. This is essential for verification—you must be able to recompute the hash to check it.
Efficiency: Hash computation should be fast. Modern hashes process data at gigabytes per second.
Length Extension Immunity: Some hash constructions (like Merkle-Damgård) allow attackers to compute H(m || extension) given only H(m), without knowing m. Modern hashes (SHA-3, BLAKE3) are designed to resist this.
The history of cryptographic hash functions is one of progressive strengthening as attacks improved. Understanding this evolution clarifies current best practices.
MD5 (Message Digest 5) — 1992
Designed by Ron Rivest, MD5 produces 128-bit hashes. It was widely used for checksum verification and password storage. However:
SHA-1 (Secure Hash Algorithm 1) — 1995
160-bit output, designed by NSA. Widely deployed in SSL/TLS, Git, and code signing.
SHA-2 Family — 2001
The current workhorse, designed by NSA. Multiple variants:
Status: SECURE — Current standard for most applications
| Algorithm | Output Size | Block Size | Security | Status |
|---|---|---|---|---|
| MD5 | 128 bits | 512 bits | Broken (collisions) | Never use |
| SHA-1 | 160 bits | 512 bits | Broken (collisions) | Deprecated |
| SHA-256 | 256 bits | 512 bits | ~128-bit collision | Standard |
| SHA-512 | 512 bits | 1024 bits | ~256-bit collision | High security |
| SHA-3-256 | 256 bits | 1088 bits | ~128-bit collision | Alternative standard |
| BLAKE3 | 256+ bits | 64 bytes | ~128-bit collision | Modern, fast |
SHA-3 (Keccak) — 2015
After SHA-1 weaknesses emerged, NIST held a competition (2007-2012) for SHA-3. Keccak won, designed by the same team behind Rijndael/AES. SHA-3 uses a completely different internal structure (sponge construction vs. Merkle-Damgård), providing algorithm diversity.
BLAKE3 — 2020
The fastest production hash function, derived from ChaCha20. Features:
BLAKE3 is increasingly adopted as a SHA-256 alternative where performance matters.
For new systems: SHA-256 for compatibility, SHA-3 for diversity, BLAKE3 for performance. For passwords: NEVER use plain hashes—use specialized password hashing functions (Argon2, bcrypt, scrypt). For HMACs: SHA-256 or SHA-3. For file integrity: SHA-256, BLAKE3, or SHA-512/256. If you see MD5 or SHA-1 in security-critical code, it needs updating.
SHA-256 uses the Merkle-Damgård construction, a framework for building hash functions from compression functions. Understanding this structure reveals both the elegance and the limitations of traditional hash design.
The Merkle-Damgård Structure:
Padding: The message is padded to a multiple of the block size (512 bits for SHA-256). Padding includes the original message length to prevent length extension attacks on the padding itself.
Initialization: An initial hash value (IV) of eight 32-bit words is set to specific constants (derived from the fractional parts of the square roots of the first eight primes).
Compression: The message is processed in 512-bit blocks. Each block goes through a compression function that updates the hash state.
Finalization: After all blocks are processed, the final state IS the hash.
The SHA-256 Compression Function:
Each 512-bit block is expanded into 64 32-bit words using a message schedule. The compression function then performs 64 rounds of mixing, each round using:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
// SHA-256 High-Level Structure function SHA256(message): // Step 1: Pad message to multiple of 512 bits // Append '1' bit, then '0' bits, then 64-bit length padded = PAD(message) // Step 2: Initialize hash state (8 × 32-bit words) H[0..7] = [ 0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19 ] // Constants from √2, √3, √5, √7, √11, √13, √17, √19 // Step 3: Process each 512-bit block for each 512-bit block M in padded: // Create message schedule (64 × 32-bit words) W[0..15] = M split into 16 words for t = 16 to 63: W[t] = σ₁(W[t-2]) + W[t-7] + σ₀(W[t-15]) + W[t-16] // Initialize working variables a, b, c, d, e, f, g, h = H[0..7] // 64 rounds of compression for t = 0 to 63: T1 = h + Σ₁(e) + Ch(e,f,g) + K[t] + W[t] T2 = Σ₀(a) + Maj(a,b,c) h = g g = f f = e e = d + T1 d = c c = b b = a a = T1 + T2 // Update hash state H[0..7] += [a, b, c, d, e, f, g, h] // Step 4: Concatenate final state return H[0] || H[1] || ... || H[7] // Boolean functionsCh(x, y, z) = (x AND y) XOR (NOT x AND z) // "Choice"Maj(x, y, z) = (x AND y) XOR (x AND z) XOR (y AND z) // "Majority" // Mixing functions (rotations and shifts)Σ₀(x) = ROTR(x, 2) XOR ROTR(x, 13) XOR ROTR(x, 22)Σ₁(x) = ROTR(x, 6) XOR ROTR(x, 11) XOR ROTR(x, 25)Merkle-Damgård hashes (MD5, SHA-1, SHA-256) are vulnerable to length extension attacks. Given H(secret || message) and the message length, an attacker can compute H(secret || message || padding || extension) WITHOUT knowing the secret. This is devastating for naive MACs like H(secret || message). Solution: Use HMAC or switch to SHA-3/BLAKE3 which are immune.
Plain hash functions verify integrity (the data hasn't changed) but not authenticity (the data came from a trusted source). Anyone can compute SHA-256 of any data. To prove that data originated from someone who knows a secret key, we need Message Authentication Codes (MACs).
MAC Definition:
A MAC is a keyed hash function: MAC(K, M) produces a tag T. To verify, the recipient recomputes MAC(K, M) and checks if it matches T. Only parties knowing K can produce valid MACs.
HMAC (Hash-based MAC):
The standard construction for building a MAC from a hash function, defined in RFC 2104:
HMAC(K, M) = H((K ⊕ opad) || H((K ⊕ ipad) || M))
Where:
HMAC is provably secure if the underlying hash function is a pseudorandom function. It's immune to length extension attacks.
Why Not Just H(K || M)?
As discussed, length extension attacks allow computing H(K || M || extension) without knowing K. This completely breaks authentication. HMAC's nested structure prevents this.
123456789101112131415161718192021222324252627282930313233343536
// HMAC Construction and Usage function HMAC(key, message, hash_function): block_size = hash_function.block_size // 64 bytes for SHA-256 // Normalize key length if length(key) > block_size: key = hash_function(key) // Hash long keys if length(key) < block_size: key = key || zeros(block_size - length(key)) // Pad short keys // Compute inner and outer padded keys o_key_pad = key XOR repeat(0x5c, block_size) // Outer padding i_key_pad = key XOR repeat(0x36, block_size) // Inner padding // Double hashing prevents length extension inner_hash = hash_function(i_key_pad || message) outer_hash = hash_function(o_key_pad || inner_hash) return outer_hash // Usage example: Authenticating an API requestshared_secret = "my_secret_key_256_bits"request_body = '{"action": "transfer", "amount": 1000}' // Generate authentication tagmac_tag = HMAC(shared_secret, request_body, SHA256) // Send: request_body + mac_tag // Recipient verifies by recomputingexpected_tag = HMAC(shared_secret, received_body, SHA256)if constant_time_compare(expected_tag, received_tag): accept_request()else: reject_request() // Tampered or wrong keyAuthenticated Encryption with Associated Data (AEAD) modes like AES-GCM and ChaCha20-Poly1305 combine encryption AND authentication in one primitive, using Poly1305 or GHASH as the MAC. This is the modern approach—encrypt-then-MAC is error-prone, while AEAD handles both correctly by design.
Password storage is a critical application of hash functions, but it requires specialized approaches. Standard cryptographic hashes like SHA-256 are wrong for password hashing, despite being used (incorrectly) in many systems.
The Problem with Fast Hashes:
SHA-256 is designed to be fast—billions of hashes per second on modern GPUs. This is a feature for file integrity, but a catastrophe for passwords:
No amount of password complexity defeats hardware designed to maximize hash rate (cryptocurrency mining ASICs compute trillions of hashes per second).
The Solution: Slow by Design
Password hashing functions are deliberately computationally expensive, using:
| Function | Year | Cost Factor | Strengths | Status |
|---|---|---|---|---|
| bcrypt | 1999 | CPU iterations | Battle-tested, widely available | Good, but limited memory |
| scrypt | 2009 | CPU + memory | Memory-hard, defeats GPUs | Good for high-security |
| Argon2i | 2015 | CPU + memory + threads | Side-channel resistant | Recommended for servers |
| Argon2id | 2015 | CPU + memory + threads | Balanced hybrid | Default recommendation |
| PBKDF2 | 2000 | CPU iterations | NIST approved, ubiquitous | Acceptable, prefer Argon2 |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
// Proper Password Hashing (using Argon2id) function HASH_PASSWORD(password): // Generate random salt (16-32 bytes) salt = generate_random_bytes(16) // Argon2id parameters (tune for ~0.5-1 second on target hardware) memory_cost = 65536 // 64 MB RAM time_cost = 3 // 3 iterations parallelism = 4 // 4 threads hash_length = 32 // 256-bit output // Compute memory-hard hash hash = argon2id( password, salt, memory_cost, time_cost, parallelism, hash_length ) // Store salt + parameters + hash // Format: $argon2id$v=19$m=65536,t=3,p=4$salt_base64$hash_base64 return encode_for_storage(salt, memory_cost, time_cost, parallelism, hash) function VERIFY_PASSWORD(password, stored_hash): // Parse stored hash to extract parameters (salt, memory_cost, time_cost, parallelism, expected_hash) = parse_stored_hash(stored_hash) // Recompute hash with same parameters computed_hash = argon2id( password, salt, memory_cost, time_cost, parallelism, length(expected_hash) ) // Constant-time comparison (prevent timing attacks) return constant_time_compare(computed_hash, expected_hash) // NEVER DO THIS:// bad_hash = SHA256(password) // Too fast!// bad_hash = SHA256(password + salt) // Still too fast!// bad_hash = MD5(password) // Broken AND too fast!NEVER use SHA-256, SHA-1, or MD5 for passwords—they're designed to be fast. NEVER hash without a salt—enables rainbow table attacks. NEVER use a global salt—must be unique per password. NEVER store passwords in plaintext or reversible encryption. ALWAYS use timing-safe comparison to prevent timing attacks. When in doubt, use Argon2id with recommended parameters.
Operating systems rely on hash functions for integrity, authentication, and performance optimization throughout their architecture.
1234567891011121314151617181920212223242526272829
# Linux Hash Function Examples # View password hash format in shadow filesudo cat /etc/shadow | grep myuser# Format: $algorithm$salt$hash# $6$ = SHA-512, $y$ = yescrypt # Compute file hashsha256sum /bin/ls# Outputs: <64-char hex hash> /bin/ls # Verify file against known hashecho "expected_hash_here /bin/ls" | sha256sum -c# Outputs: /bin/ls: OK # HMAC computation with opensslecho -n "message" | openssl dgst -sha256 -hmac "secret_key" # View dm-verity hash tree (Android/Chrome OS)veritysetup dump /dev/sda1 # Check package signatures (Debian/Ubuntu)apt-get download package_namesha256sum package_name*.deb# Compare with Packages file signed by archive key # Btrfs checksum statussudo btrfs scrub start /mnt/datasudo btrfs scrub status /mnt/dataNot all OS hashing uses cryptographic functions. Hash tables use fast non-cryptographic hashes (SipHash for DoS resistance, xxHash for speed). File system checksums may use CRC32 (fast, error detection) or xxHash rather than SHA-256. Cryptographic hashes are reserved for security applications where collision resistance matters. Choose the right hash for the job.
Understanding hash function attacks illuminates why certain algorithms are deprecated and informs secure usage.
| Algorithm | Collision Attack | Preimage Attack | Practical Impact |
|---|---|---|---|
| MD5 | Broken (seconds) | Theoretical weaknesses | Can create colliding PDFs, certificates |
| SHA-1 | Broken (2017, $110K compute) | Secure | SHAttered: malicious PDFs with same hash |
| SHA-256 | Secure (no known attack) | Secure | Standard for security applications |
| SHA-3 | Secure (no known attack) | Secure | Immune to length extension |
| BLAKE3 | Secure (no known attack) | Secure | Modern, high performance |
Use SHA-256 or SHA-3 for security applications—never MD5 or SHA-1. Use HMAC for authentication, not plain H(key || message). Use Argon2id for passwords with unique salts. Use constant-time comparison for all security-critical comparisons. Keep up with cryptographic recommendations as attacks improve.
We've explored cryptographic hash functions—the versatile primitives that create fixed-size fingerprints of arbitrary data, enabling integrity verification, authentication, and secure password storage.
Looking Ahead:
Hash functions alone verify integrity but don't prove origin. Asymmetric encryption provides confidentiality but doesn't prove that a specific party created the data. The next page explores digital signatures—the combination of hash functions and asymmetric cryptography that enables non-repudiable proof of authorship, forming the foundation for code signing, secure boot, and the certificate authority system.
You now understand cryptographic hash functions—their properties, algorithms, constructions, and applications. This knowledge is essential for understanding digital signatures, password security, integrity verification, and the hash-based security mechanisms throughout operating systems.