Operating SystemsSecurity Fundamentals

Cryptography Basics

LevelIntermediate

Duration90 mins

TopicSecurity Fundamentals

3 / 5

Hash Functions

Digital Fingerprints

Consider a seemingly impossible task: take any file—from a one-byte text file to a multi-terabyte database—and reduce it to a fixed-size "fingerprint" of exactly 256 bits. This fingerprint must have remarkable properties:

Uniqueness: Different files should produce different fingerprints (with overwhelming probability)
Sensitivity: Change a single bit anywhere in the input, and the fingerprint changes completely
Irreversibility: Given only the fingerprint, reconstructing the original file must be computationally impossible
Speed: Computing the fingerprint should be fast, even for massive files

This seemingly magical construct is the cryptographic hash function, one of the most versatile primitives in computer security. Hash functions are everywhere: they verify file integrity, store passwords securely, power blockchain consensus, authenticate messages, generate pseudorandom numbers, and enable digital signatures. Understanding hash functions is essential for any systems engineer.

What You Will Learn

By the end of this page, you will understand the essential properties of cryptographic hash functions, know the internal structure of modern algorithms like SHA-256 and SHA-3, recognize the difference between hash functions and MACs, understand how hash functions enable password storage and integrity verification, and see how operating systems rely on hashing throughout their security architecture.

Fundamental Properties

A cryptographic hash function H takes an input of arbitrary length and produces a fixed-length output called the hash, digest, or fingerprint. For security applications, H must satisfy three critical properties:

1. Preimage Resistance (One-Way Property)

Given a hash value h, it must be computationally infeasible to find any input m such that H(m) = h. You cannot reverse the hash to recover the input.

Security goal: Given only a hash, an attacker cannot find the original data or any data that produces that hash.

2. Second Preimage Resistance (Weak Collision Resistance)

Given an input m₁ and its hash H(m₁), it must be computationally infeasible to find a different input m₂ ≠ m₁ such that H(m₂) = H(m₁).

Security goal: An attacker cannot find a substitute input that produces the same hash as a known input.

3. Collision Resistance (Strong Collision Resistance)

It must be computationally infeasible to find any two distinct inputs m₁ ≠ m₂ such that H(m₁) = H(m₂).

Security goal: An attacker cannot find any pair of inputs with the same hash, even if they choose both inputs.

Birthday Paradox Implications

Collision resistance is harder to achieve than preimage resistance due to the Birthday Paradox. For a hash with n-bit output, finding a collision requires approximately 2^(n/2) operations on average, while finding a preimage requires 2^n operations. This is why SHA-256 (256 bits) provides ~128 bits of collision resistance but 256 bits of preimage resistance. This also explains why 128-bit hashes (like MD5) were deprecated—their collision resistance is only ~64 bits.

Hash Function Security Properties
Property	Attacker's Goal	Attack Complexity	Broken If...
Preimage resistance	Find m given H(m)	2^n operations	Attacker can forge data matching a known hash
Second preimage resistance	Find m₂ given m₁ where H(m₁)=H(m₂)	2^n operations	Attacker can substitute signed/verified documents
Collision resistance	Find any m₁≠m₂ where H(m₁)=H(m₂)	2^(n/2) operations	Attacker can create pairs (one benign, one malicious) with same hash

Additional Desirable Properties:

Avalanche Effect: A small change in the input produces a completely different hash. Changing a single bit should flip approximately half the output bits. This prevents attackers from learning about the input by observing patterns in the output.

Determinism: The same input always produces the same output. This is essential for verification—you must be able to recompute the hash to check it.

Efficiency: Hash computation should be fast. Modern hashes process data at gigabytes per second.

Length Extension Immunity: Some hash constructions (like Merkle-Damgård) allow attackers to compute H(m || extension) given only H(m), without knowing m. Modern hashes (SHA-3, BLAKE3) are designed to resist this.

Major Hash Algorithms

The history of cryptographic hash functions is one of progressive strengthening as attacks improved. Understanding this evolution clarifies current best practices.

MD5 (Message Digest 5) — 1992

Designed by Ron Rivest, MD5 produces 128-bit hashes. It was widely used for checksum verification and password storage. However:

2004: Collision found in under an hour
2008: Practical attack demonstrated (rogue CA certificate)
Status: BROKEN — Never use for security purposes

SHA-1 (Secure Hash Algorithm 1) — 1995

160-bit output, designed by NSA. Widely deployed in SSL/TLS, Git, and code signing.

2005: Theoretical collision attack published
2017: Practical collision demonstrated (SHAttered attack, Google/CWI)
Status: DEPRECATED — Avoid for new systems, migrate existing uses

SHA-2 Family — 2001

The current workhorse, designed by NSA. Multiple variants:

SHA-224: 224-bit output (truncated SHA-256)
SHA-256: 256-bit output, most commonly used
SHA-384: 384-bit output (truncated SHA-512)
SHA-512: 512-bit output, faster on 64-bit processors
SHA-512/256: 256-bit output from SHA-512 structure

Status: SECURE — Current standard for most applications

Cryptographic Hash Function Comparison
Algorithm	Output Size	Block Size	Security	Status
MD5	128 bits	512 bits	Broken (collisions)	Never use
SHA-1	160 bits	512 bits	Broken (collisions)	Deprecated
SHA-256	256 bits	512 bits	~128-bit collision	Standard
SHA-512	512 bits	1024 bits	~256-bit collision	High security
SHA-3-256	256 bits	1088 bits	~128-bit collision	Alternative standard
BLAKE3	256+ bits	64 bytes	~128-bit collision	Modern, fast

SHA-3 (Keccak) — 2015

After SHA-1 weaknesses emerged, NIST held a competition (2007-2012) for SHA-3. Keccak won, designed by the same team behind Rijndael/AES. SHA-3 uses a completely different internal structure (sponge construction vs. Merkle-Damgård), providing algorithm diversity.

Immune to length extension attacks
Variable output length (SHAKE128, SHAKE256)
Elegant mathematical design
Slightly slower than SHA-2 in software, excellent in hardware

BLAKE3 — 2020

The fastest production hash function, derived from ChaCha20. Features:

Parallelizable tree structure (scales with cores)
Streaming API (can hash incrementally)
Faster than SHA-256 even without parallelism
Built-in KDF and MAC modes

BLAKE3 is increasingly adopted as a SHA-256 alternative where performance matters.

Choosing a Hash Function Today

For new systems: SHA-256 for compatibility, SHA-3 for diversity, BLAKE3 for performance. For passwords: NEVER use plain hashes—use specialized password hashing functions (Argon2, bcrypt, scrypt). For HMACs: SHA-256 or SHA-3. For file integrity: SHA-256, BLAKE3, or SHA-512/256. If you see MD5 or SHA-1 in security-critical code, it needs updating.

Inside SHA-256: The Merkle-Damgård Construction

SHA-256 uses the Merkle-Damgård construction, a framework for building hash functions from compression functions. Understanding this structure reveals both the elegance and the limitations of traditional hash design.

The Merkle-Damgård Structure:

Padding: The message is padded to a multiple of the block size (512 bits for SHA-256). Padding includes the original message length to prevent length extension attacks on the padding itself.
Initialization: An initial hash value (IV) of eight 32-bit words is set to specific constants (derived from the fractional parts of the square roots of the first eight primes).
Compression: The message is processed in 512-bit blocks. Each block goes through a compression function that updates the hash state.
Finalization: After all blocks are processed, the final state IS the hash.

The SHA-256 Compression Function:

Each 512-bit block is expanded into 64 32-bit words using a message schedule. The compression function then performs 64 rounds of mixing, each round using:

Boolean functions (Ch, Maj) for non-linearity
Bitwise rotations and shifts for diffusion
Addition modulo 2³² for mixing
Round constants (derived from prime cube roots)

sha256_structure.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// SHA-256 High-Level Structure
 
function SHA256(message):
    // Step 1: Pad message to multiple of 512 bits
    // Append '1' bit, then '0' bits, then 64-bit length
    padded = PAD(message)
    
    // Step 2: Initialize hash state (8 × 32-bit words)
    H[0..7] = [
        0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a,
        0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
    ]  // Constants from √2, √3, √5, √7, √11, √13, √17, √19
    
    // Step 3: Process each 512-bit block
    for each 512-bit block M in padded:
        // Create message schedule (64 × 32-bit words)
        W[0..15] = M split into 16 words
        for t = 16 to 63:
            W[t] = σ₁(W[t-2]) + W[t-7] + σ₀(W[t-15]) + W[t-16]
        
        // Initialize working variables
        a, b, c, d, e, f, g, h = H[0..7]
        
        // 64 rounds of compression
        for t = 0 to 63:
            T1 = h + Σ₁(e) + Ch(e,f,g) + K[t] + W[t]
            T2 = Σ₀(a) + Maj(a,b,c)
            h = g
            g = f
            f = e
            e = d + T1
            d = c
            c = b
            b = a
            a = T1 + T2
        
        // Update hash state
        H[0..7] += [a, b, c, d, e, f, g, h]
    
    // Step 4: Concatenate final state
    return H[0] || H[1] || ... || H[7]
 
// Boolean functions
Ch(x, y, z) = (x AND y) XOR (NOT x AND z)  // "Choice"
Maj(x, y, z) = (x AND y) XOR (x AND z) XOR (y AND z)  // "Majority"
 
// Mixing functions (rotations and shifts)
Σ₀(x) = ROTR(x, 2) XOR ROTR(x, 13) XOR ROTR(x, 22)
Σ₁(x) = ROTR(x, 6) XOR ROTR(x, 11) XOR ROTR(x, 25)

Length Extension Vulnerability

Merkle-Damgård hashes (MD5, SHA-1, SHA-256) are vulnerable to length extension attacks. Given H(secret || message) and the message length, an attacker can compute H(secret || message || padding || extension) WITHOUT knowing the secret. This is devastating for naive MACs like H(secret || message). Solution: Use HMAC or switch to SHA-3/BLAKE3 which are immune.

Message Authentication Codes (MACs)

Plain hash functions verify integrity (the data hasn't changed) but not authenticity (the data came from a trusted source). Anyone can compute SHA-256 of any data. To prove that data originated from someone who knows a secret key, we need Message Authentication Codes (MACs).

MAC Definition:

A MAC is a keyed hash function: MAC(K, M) produces a tag T. To verify, the recipient recomputes MAC(K, M) and checks if it matches T. Only parties knowing K can produce valid MACs.

HMAC (Hash-based MAC):

The standard construction for building a MAC from a hash function, defined in RFC 2104:

HMAC(K, M) = H((K ⊕ opad) || H((K ⊕ ipad) || M))

Where:

K is the key (padded to block size if needed)
opad is 0x5c repeated to block size
ipad is 0x36 repeated to block size

HMAC is provably secure if the underlying hash function is a pseudorandom function. It's immune to length extension attacks.

Why Not Just H(K || M)?

As discussed, length extension attacks allow computing H(K || M || extension) without knowing K. This completely breaks authentication. HMAC's nested structure prevents this.

hmac_usage.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// HMAC Construction and Usage
 
function HMAC(key, message, hash_function):
    block_size = hash_function.block_size  // 64 bytes for SHA-256
    
    // Normalize key length
    if length(key) > block_size:
        key = hash_function(key)  // Hash long keys
    if length(key) < block_size:
        key = key || zeros(block_size - length(key))  // Pad short keys
    
    // Compute inner and outer padded keys
    o_key_pad = key XOR repeat(0x5c, block_size)  // Outer padding
    i_key_pad = key XOR repeat(0x36, block_size)  // Inner padding
    
    // Double hashing prevents length extension
    inner_hash = hash_function(i_key_pad || message)
    outer_hash = hash_function(o_key_pad || inner_hash)
    
    return outer_hash
 
// Usage example: Authenticating an API request
shared_secret = "my_secret_key_256_bits"
request_body = '{"action": "transfer", "amount": 1000}'
 
// Generate authentication tag
mac_tag = HMAC(shared_secret, request_body, SHA256)
 
// Send: request_body + mac_tag
 
// Recipient verifies by recomputing
expected_tag = HMAC(shared_secret, received_body, SHA256)
if constant_time_compare(expected_tag, received_tag):
    accept_request()
else:
    reject_request()  // Tampered or wrong key

HMAC Provides

•Authentication: Only key holders can generate valid MACs
•Integrity: Any modification invalidates the MAC
•Length extension immunity: Cannot extend authenticated messages

HMAC Does NOT Provide

•Confidentiality: Message is not encrypted
•Non-repudiation: Any key holder can generate the MAC
•Replay protection: Same message produces same MAC

Modern Alternative: AEAD

Authenticated Encryption with Associated Data (AEAD) modes like AES-GCM and ChaCha20-Poly1305 combine encryption AND authentication in one primitive, using Poly1305 or GHASH as the MAC. This is the modern approach—encrypt-then-MAC is error-prone, while AEAD handles both correctly by design.

Password Hashing

Password storage is a critical application of hash functions, but it requires specialized approaches. Standard cryptographic hashes like SHA-256 are wrong for password hashing, despite being used (incorrectly) in many systems.

The Problem with Fast Hashes:

SHA-256 is designed to be fast—billions of hashes per second on modern GPUs. This is a feature for file integrity, but a catastrophe for passwords:

Average password entropy: ~30-40 bits
GPU SHA-256 throughput: ~10 billion hashes/second
Time to brute-force 40-bit password: 2^40 / 10^10 ≈ 100 seconds

No amount of password complexity defeats hardware designed to maximize hash rate (cryptocurrency mining ASICs compute trillions of hashes per second).

The Solution: Slow by Design

Password hashing functions are deliberately computationally expensive, using:

CPU cost: Many iterations (thousands or millions of hash operations)
Memory cost: Large amounts of RAM required (defeats GPU parallelism)
Salt: Random value stored with hash, preventing rainbow tables

Password Hashing Functions
Function	Year	Cost Factor	Strengths	Status
bcrypt	1999	CPU iterations	Battle-tested, widely available	Good, but limited memory
scrypt	2009	CPU + memory	Memory-hard, defeats GPUs	Good for high-security
Argon2i	2015	CPU + memory + threads	Side-channel resistant	Recommended for servers
Argon2id	2015	CPU + memory + threads	Balanced hybrid	Default recommendation
PBKDF2	2000	CPU iterations	NIST approved, ubiquitous	Acceptable, prefer Argon2

password_hashing.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// Proper Password Hashing (using Argon2id)
 
function HASH_PASSWORD(password):
    // Generate random salt (16-32 bytes)
    salt = generate_random_bytes(16)
    
    // Argon2id parameters (tune for ~0.5-1 second on target hardware)
    memory_cost = 65536    // 64 MB RAM
    time_cost = 3          // 3 iterations
    parallelism = 4        // 4 threads
    hash_length = 32       // 256-bit output
    
    // Compute memory-hard hash
    hash = argon2id(
        password,
        salt,
        memory_cost,
        time_cost,
        parallelism,
        hash_length
    )
    
    // Store salt + parameters + hash
    // Format: $argon2id$v=19$m=65536,t=3,p=4$salt_base64$hash_base64
    return encode_for_storage(salt, memory_cost, time_cost, parallelism, hash)
 
function VERIFY_PASSWORD(password, stored_hash):
    // Parse stored hash to extract parameters
    (salt, memory_cost, time_cost, parallelism, expected_hash) = 
        parse_stored_hash(stored_hash)
    
    // Recompute hash with same parameters
    computed_hash = argon2id(
        password,
        salt,
        memory_cost,
        time_cost,
        parallelism,
        length(expected_hash)
    )
    
    // Constant-time comparison (prevent timing attacks)
    return constant_time_compare(computed_hash, expected_hash)
 
// NEVER DO THIS:
// bad_hash = SHA256(password)  // Too fast!
// bad_hash = SHA256(password + salt)  // Still too fast!
// bad_hash = MD5(password)  // Broken AND too fast!

Common Password Hashing Mistakes

NEVER use SHA-256, SHA-1, or MD5 for passwords—they're designed to be fast. NEVER hash without a salt—enables rainbow table attacks. NEVER use a global salt—must be unique per password. NEVER store passwords in plaintext or reversible encryption. ALWAYS use timing-safe comparison to prevent timing attacks. When in doubt, use Argon2id with recommended parameters.

Hash Functions in Operating Systems

Operating systems rely on hash functions for integrity, authentication, and performance optimization throughout their architecture.

Key OS Applications

•Password Storage — /etc/shadow (Linux), SAM database (Windows) store password hashes, not plaintext. Modern systems use bcrypt or yescrypt with per-user salts.
•File System Integrity — Btrfs, ZFS use checksums (CRC32, xxHash, SHA-256) to detect data corruption. dm-verity hashes entire partitions for read-only integrity (Android, Chrome OS).
•Secure Boot and Code Signing — Boot components and drivers are hashed, and the hash is signed. The OS verifies the hash before execution, ensuring only authorized code runs.
•Package Management — Package managers verify downloaded packages against signed hashes. This detects tampering and corrupted downloads before installation.
•Memory Deduplication — Kernel Same-page Merging (KSM) uses hashes to identify duplicate memory pages, merging them to save RAM in virtualized environments.
•Network Protocol Authentication — IPsec, TLS use HMAC for packet authentication. The kernel's networking stack computes and verifies HMACs for encrypted connections.
•Randomness Extraction — The kernel's random number generator uses hash functions (SHA-256, BLAKE2) to extract entropy from hardware sources and mix the entropy pool.

linux_hash_examples.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Linux Hash Function Examples
 
# View password hash format in shadow file
sudo cat /etc/shadow | grep myuser
# Format: $algorithm$salt$hash
# $6$ = SHA-512, $y$ = yescrypt
 
# Compute file hash
sha256sum /bin/ls
# Outputs: <64-char hex hash>  /bin/ls
 
# Verify file against known hash
echo "expected_hash_here  /bin/ls" | sha256sum -c
# Outputs: /bin/ls: OK
 
# HMAC computation with openssl
echo -n "message" | openssl dgst -sha256 -hmac "secret_key"
 
# View dm-verity hash tree (Android/Chrome OS)
veritysetup dump /dev/sda1
 
# Check package signatures (Debian/Ubuntu)
apt-get download package_name
sha256sum package_name*.deb
# Compare with Packages file signed by archive key
 
# Btrfs checksum status
sudo btrfs scrub start /mnt/data
sudo btrfs scrub status /mnt/data

Performance vs. Cryptographic Hashes

Not all OS hashing uses cryptographic functions. Hash tables use fast non-cryptographic hashes (SipHash for DoS resistance, xxHash for speed). File system checksums may use CRC32 (fast, error detection) or xxHash rather than SHA-256. Cryptographic hashes are reserved for security applications where collision resistance matters. Choose the right hash for the job.

Attacks and Defenses

Understanding hash function attacks illuminates why certain algorithms are deprecated and informs secure usage.

Attack Types

•Collision Attacks — Finding two inputs with the same hash. MD5 and SHA-1 are broken this way. Impact: Attackers can create malicious files with the same hash as legitimate ones, defeating signatures that only sign hashes.
•Preimage Attacks — Finding an input that produces a specific hash. No practical attacks against modern hashes, but reduced-round attacks exist. Impact: Would allow forging data matching a known hash.
•Length Extension Attacks — Computing H(message || extension) given only H(message). Affects MD5, SHA-1, SHA-256. Impact: Breaks naive MAC constructions like H(key || message).
•Rainbow Table Attacks — Precomputed hash → password tables. Defeated by salting (each password gets a unique random salt). Impact: Unsalted password databases can be cracked instantly against known wordlists.
•Timing Attacks — Measuring comparison time to learn hash bytes. If comparison stops at first mismatch, time reveals how many bytes matched. Defense: Constant-time comparison functions.

Hash Function Attack Status
Algorithm	Collision Attack	Preimage Attack	Practical Impact
MD5	Broken (seconds)	Theoretical weaknesses	Can create colliding PDFs, certificates
SHA-1	Broken (2017, $110K compute)	Secure	SHAttered: malicious PDFs with same hash
SHA-256	Secure (no known attack)	Secure	Standard for security applications
SHA-3	Secure (no known attack)	Secure	Immune to length extension
BLAKE3	Secure (no known attack)	Secure	Modern, high performance

Defense Best Practices

Use SHA-256 or SHA-3 for security applications—never MD5 or SHA-1. Use HMAC for authentication, not plain H(key || message). Use Argon2id for passwords with unique salts. Use constant-time comparison for all security-critical comparisons. Keep up with cryptographic recommendations as attacks improve.

Summary: Hash Functions

We've explored cryptographic hash functions—the versatile primitives that create fixed-size fingerprints of arbitrary data, enabling integrity verification, authentication, and secure password storage.

Key Takeaways

•Hash functions produce fixed-size outputs from arbitrary inputs — SHA-256 always produces 256 bits regardless of input size.
•Three security properties: preimage resistance, second preimage resistance, collision resistance — Collision resistance is hardest (birthday bound: 2^(n/2)).
•MD5 and SHA-1 are broken — Collisions can be computed practically. Use SHA-256, SHA-3, or BLAKE3.
•HMAC provides keyed authentication — Use HMAC(K, M) not H(K || M) to avoid length extension attacks.
•Password hashing requires slow functions — Fast hashes (SHA-256) are wrong; use Argon2id, bcrypt, or scrypt with unique salts.
•Operating systems use hashes extensively — Password storage, file integrity, secure boot, package verification, memory deduplication.
•Match the hash to the use case — Cryptographic hashes for security, fast hashes (xxHash) for performance, password hashes for credentials.

Looking Ahead:

Hash functions alone verify integrity but don't prove origin. Asymmetric encryption provides confidentiality but doesn't prove that a specific party created the data. The next page explores digital signatures—the combination of hash functions and asymmetric cryptography that enables non-repudiable proof of authorship, forming the foundation for code signing, secure boot, and the certificate authority system.

Page Complete

You now understand cryptographic hash functions—their properties, algorithms, constructions, and applications. This knowledge is essential for understanding digital signatures, password security, integrity verification, and the hash-based security mechanisms throughout operating systems.

3 / 5

Loading learning content...

Operating SystemsSecurity Fundamentals

Cryptography Basics

LevelIntermediate

Duration90 mins

TopicSecurity Fundamentals

3 / 5

Hash Functions

Digital Fingerprints

Uniqueness: Different files should produce different fingerprints (with overwhelming probability)
Sensitivity: Change a single bit anywhere in the input, and the fingerprint changes completely
Irreversibility: Given only the fingerprint, reconstructing the original file must be computationally impossible
Speed: Computing the fingerprint should be fast, even for massive files

What You Will Learn

Fundamental Properties

1. Preimage Resistance (One-Way Property)

Given a hash value h, it must be computationally infeasible to find any input m such that H(m) = h. You cannot reverse the hash to recover the input.

Security goal: Given only a hash, an attacker cannot find the original data or any data that produces that hash.

2. Second Preimage Resistance (Weak Collision Resistance)

Given an input m₁ and its hash H(m₁), it must be computationally infeasible to find a different input m₂ ≠ m₁ such that H(m₂) = H(m₁).

Security goal: An attacker cannot find a substitute input that produces the same hash as a known input.

3. Collision Resistance (Strong Collision Resistance)

It must be computationally infeasible to find any two distinct inputs m₁ ≠ m₂ such that H(m₁) = H(m₂).

Security goal: An attacker cannot find any pair of inputs with the same hash, even if they choose both inputs.

Birthday Paradox Implications

Hash Function Security Properties
Property	Attacker's Goal	Attack Complexity	Broken If...
Preimage resistance	Find m given H(m)	2^n operations	Attacker can forge data matching a known hash
Second preimage resistance	Find m₂ given m₁ where H(m₁)=H(m₂)	2^n operations	Attacker can substitute signed/verified documents
Collision resistance	Find any m₁≠m₂ where H(m₁)=H(m₂)	2^(n/2) operations	Attacker can create pairs (one benign, one malicious) with same hash

Additional Desirable Properties:

Determinism: The same input always produces the same output. This is essential for verification—you must be able to recompute the hash to check it.

Efficiency: Hash computation should be fast. Modern hashes process data at gigabytes per second.

Major Hash Algorithms

The history of cryptographic hash functions is one of progressive strengthening as attacks improved. Understanding this evolution clarifies current best practices.

MD5 (Message Digest 5) — 1992

Designed by Ron Rivest, MD5 produces 128-bit hashes. It was widely used for checksum verification and password storage. However:

2004: Collision found in under an hour
2008: Practical attack demonstrated (rogue CA certificate)
Status: BROKEN — Never use for security purposes

SHA-1 (Secure Hash Algorithm 1) — 1995

160-bit output, designed by NSA. Widely deployed in SSL/TLS, Git, and code signing.

2005: Theoretical collision attack published
2017: Practical collision demonstrated (SHAttered attack, Google/CWI)
Status: DEPRECATED — Avoid for new systems, migrate existing uses

SHA-2 Family — 2001

The current workhorse, designed by NSA. Multiple variants:

SHA-224: 224-bit output (truncated SHA-256)
SHA-256: 256-bit output, most commonly used
SHA-384: 384-bit output (truncated SHA-512)
SHA-512: 512-bit output, faster on 64-bit processors
SHA-512/256: 256-bit output from SHA-512 structure

Status: SECURE — Current standard for most applications

Cryptographic Hash Function Comparison
Algorithm	Output Size	Block Size	Security	Status
MD5	128 bits	512 bits	Broken (collisions)	Never use
SHA-1	160 bits	512 bits	Broken (collisions)	Deprecated
SHA-256	256 bits	512 bits	~128-bit collision	Standard
SHA-512	512 bits	1024 bits	~256-bit collision	High security
SHA-3-256	256 bits	1088 bits	~128-bit collision	Alternative standard
BLAKE3	256+ bits	64 bytes	~128-bit collision	Modern, fast

SHA-3 (Keccak) — 2015

Immune to length extension attacks
Variable output length (SHAKE128, SHAKE256)
Elegant mathematical design
Slightly slower than SHA-2 in software, excellent in hardware

BLAKE3 — 2020

The fastest production hash function, derived from ChaCha20. Features:

Parallelizable tree structure (scales with cores)
Streaming API (can hash incrementally)
Faster than SHA-256 even without parallelism
Built-in KDF and MAC modes

BLAKE3 is increasingly adopted as a SHA-256 alternative where performance matters.

Choosing a Hash Function Today

Inside SHA-256: The Merkle-Damgård Construction

The Merkle-Damgård Structure:

Padding: The message is padded to a multiple of the block size (512 bits for SHA-256). Padding includes the original message length to prevent length extension attacks on the padding itself.
Initialization: An initial hash value (IV) of eight 32-bit words is set to specific constants (derived from the fractional parts of the square roots of the first eight primes).
Compression: The message is processed in 512-bit blocks. Each block goes through a compression function that updates the hash state.
Finalization: After all blocks are processed, the final state IS the hash.

The SHA-256 Compression Function:

Each 512-bit block is expanded into 64 32-bit words using a message schedule. The compression function then performs 64 rounds of mixing, each round using:

Boolean functions (Ch, Maj) for non-linearity
Bitwise rotations and shifts for diffusion
Addition modulo 2³² for mixing
Round constants (derived from prime cube roots)

sha256_structure.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// SHA-256 High-Level Structure
 
function SHA256(message):
    // Step 1: Pad message to multiple of 512 bits
    // Append '1' bit, then '0' bits, then 64-bit length
    padded = PAD(message)
    
    // Step 2: Initialize hash state (8 × 32-bit words)
    H[0..7] = [
        0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a,
        0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19
    ]  // Constants from √2, √3, √5, √7, √11, √13, √17, √19
    
    // Step 3: Process each 512-bit block
    for each 512-bit block M in padded:
        // Create message schedule (64 × 32-bit words)
        W[0..15] = M split into 16 words
        for t = 16 to 63:
            W[t] = σ₁(W[t-2]) + W[t-7] + σ₀(W[t-15]) + W[t-16]
        
        // Initialize working variables
        a, b, c, d, e, f, g, h = H[0..7]
        
        // 64 rounds of compression
        for t = 0 to 63:
            T1 = h + Σ₁(e) + Ch(e,f,g) + K[t] + W[t]
            T2 = Σ₀(a) + Maj(a,b,c)
            h = g
            g = f
            f = e
            e = d + T1
            d = c
            c = b
            b = a
            a = T1 + T2
        
        // Update hash state
        H[0..7] += [a, b, c, d, e, f, g, h]
    
    // Step 4: Concatenate final state
    return H[0] || H[1] || ... || H[7]
 
// Boolean functions
Ch(x, y, z) = (x AND y) XOR (NOT x AND z)  // "Choice"
Maj(x, y, z) = (x AND y) XOR (x AND z) XOR (y AND z)  // "Majority"
 
// Mixing functions (rotations and shifts)
Σ₀(x) = ROTR(x, 2) XOR ROTR(x, 13) XOR ROTR(x, 22)
Σ₁(x) = ROTR(x, 6) XOR ROTR(x, 11) XOR ROTR(x, 25)

Length Extension Vulnerability

Message Authentication Codes (MACs)

MAC Definition:

A MAC is a keyed hash function: MAC(K, M) produces a tag T. To verify, the recipient recomputes MAC(K, M) and checks if it matches T. Only parties knowing K can produce valid MACs.

HMAC (Hash-based MAC):

The standard construction for building a MAC from a hash function, defined in RFC 2104:

HMAC(K, M) = H((K ⊕ opad) || H((K ⊕ ipad) || M))

Where:

K is the key (padded to block size if needed)
opad is 0x5c repeated to block size
ipad is 0x36 repeated to block size

HMAC is provably secure if the underlying hash function is a pseudorandom function. It's immune to length extension attacks.

Why Not Just H(K || M)?

As discussed, length extension attacks allow computing H(K || M || extension) without knowing K. This completely breaks authentication. HMAC's nested structure prevents this.

hmac_usage.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// HMAC Construction and Usage
 
function HMAC(key, message, hash_function):
    block_size = hash_function.block_size  // 64 bytes for SHA-256
    
    // Normalize key length
    if length(key) > block_size:
        key = hash_function(key)  // Hash long keys
    if length(key) < block_size:
        key = key || zeros(block_size - length(key))  // Pad short keys
    
    // Compute inner and outer padded keys
    o_key_pad = key XOR repeat(0x5c, block_size)  // Outer padding
    i_key_pad = key XOR repeat(0x36, block_size)  // Inner padding
    
    // Double hashing prevents length extension
    inner_hash = hash_function(i_key_pad || message)
    outer_hash = hash_function(o_key_pad || inner_hash)
    
    return outer_hash
 
// Usage example: Authenticating an API request
shared_secret = "my_secret_key_256_bits"
request_body = '{"action": "transfer", "amount": 1000}'
 
// Generate authentication tag
mac_tag = HMAC(shared_secret, request_body, SHA256)
 
// Send: request_body + mac_tag
 
// Recipient verifies by recomputing
expected_tag = HMAC(shared_secret, received_body, SHA256)
if constant_time_compare(expected_tag, received_tag):
    accept_request()
else:
    reject_request()  // Tampered or wrong key

HMAC Provides

•Authentication: Only key holders can generate valid MACs
•Integrity: Any modification invalidates the MAC
•Length extension immunity: Cannot extend authenticated messages

HMAC Does NOT Provide

•Confidentiality: Message is not encrypted
•Non-repudiation: Any key holder can generate the MAC
•Replay protection: Same message produces same MAC

Modern Alternative: AEAD

Password Hashing

The Problem with Fast Hashes:

SHA-256 is designed to be fast—billions of hashes per second on modern GPUs. This is a feature for file integrity, but a catastrophe for passwords:

Average password entropy: ~30-40 bits
GPU SHA-256 throughput: ~10 billion hashes/second
Time to brute-force 40-bit password: 2^40 / 10^10 ≈ 100 seconds

No amount of password complexity defeats hardware designed to maximize hash rate (cryptocurrency mining ASICs compute trillions of hashes per second).

The Solution: Slow by Design

Password hashing functions are deliberately computationally expensive, using:

CPU cost: Many iterations (thousands or millions of hash operations)
Memory cost: Large amounts of RAM required (defeats GPU parallelism)
Salt: Random value stored with hash, preventing rainbow tables

Password Hashing Functions
Function	Year	Cost Factor	Strengths	Status
bcrypt	1999	CPU iterations	Battle-tested, widely available	Good, but limited memory
scrypt	2009	CPU + memory	Memory-hard, defeats GPUs	Good for high-security
Argon2i	2015	CPU + memory + threads	Side-channel resistant	Recommended for servers
Argon2id	2015	CPU + memory + threads	Balanced hybrid	Default recommendation
PBKDF2	2000	CPU iterations	NIST approved, ubiquitous	Acceptable, prefer Argon2

password_hashing.pseudo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
// Proper Password Hashing (using Argon2id)
 
function HASH_PASSWORD(password):
    // Generate random salt (16-32 bytes)
    salt = generate_random_bytes(16)
    
    // Argon2id parameters (tune for ~0.5-1 second on target hardware)
    memory_cost = 65536    // 64 MB RAM
    time_cost = 3          // 3 iterations
    parallelism = 4        // 4 threads
    hash_length = 32       // 256-bit output
    
    // Compute memory-hard hash
    hash = argon2id(
        password,
        salt,
        memory_cost,
        time_cost,
        parallelism,
        hash_length
    )
    
    // Store salt + parameters + hash
    // Format: $argon2id$v=19$m=65536,t=3,p=4$salt_base64$hash_base64
    return encode_for_storage(salt, memory_cost, time_cost, parallelism, hash)
 
function VERIFY_PASSWORD(password, stored_hash):
    // Parse stored hash to extract parameters
    (salt, memory_cost, time_cost, parallelism, expected_hash) = 
        parse_stored_hash(stored_hash)
    
    // Recompute hash with same parameters
    computed_hash = argon2id(
        password,
        salt,
        memory_cost,
        time_cost,
        parallelism,
        length(expected_hash)
    )
    
    // Constant-time comparison (prevent timing attacks)
    return constant_time_compare(computed_hash, expected_hash)
 
// NEVER DO THIS:
// bad_hash = SHA256(password)  // Too fast!
// bad_hash = SHA256(password + salt)  // Still too fast!
// bad_hash = MD5(password)  // Broken AND too fast!

Common Password Hashing Mistakes

Hash Functions in Operating Systems

Operating systems rely on hash functions for integrity, authentication, and performance optimization throughout their architecture.

Key OS Applications

•Password Storage — /etc/shadow (Linux), SAM database (Windows) store password hashes, not plaintext. Modern systems use bcrypt or yescrypt with per-user salts.
•File System Integrity — Btrfs, ZFS use checksums (CRC32, xxHash, SHA-256) to detect data corruption. dm-verity hashes entire partitions for read-only integrity (Android, Chrome OS).
•Secure Boot and Code Signing — Boot components and drivers are hashed, and the hash is signed. The OS verifies the hash before execution, ensuring only authorized code runs.
•Package Management — Package managers verify downloaded packages against signed hashes. This detects tampering and corrupted downloads before installation.
•Memory Deduplication — Kernel Same-page Merging (KSM) uses hashes to identify duplicate memory pages, merging them to save RAM in virtualized environments.
•Network Protocol Authentication — IPsec, TLS use HMAC for packet authentication. The kernel's networking stack computes and verifies HMACs for encrypted connections.
•Randomness Extraction — The kernel's random number generator uses hash functions (SHA-256, BLAKE2) to extract entropy from hardware sources and mix the entropy pool.

linux_hash_examples.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Linux Hash Function Examples
 
# View password hash format in shadow file
sudo cat /etc/shadow | grep myuser
# Format: $algorithm$salt$hash
# $6$ = SHA-512, $y$ = yescrypt
 
# Compute file hash
sha256sum /bin/ls
# Outputs: <64-char hex hash>  /bin/ls
 
# Verify file against known hash
echo "expected_hash_here  /bin/ls" | sha256sum -c
# Outputs: /bin/ls: OK
 
# HMAC computation with openssl
echo -n "message" | openssl dgst -sha256 -hmac "secret_key"
 
# View dm-verity hash tree (Android/Chrome OS)
veritysetup dump /dev/sda1
 
# Check package signatures (Debian/Ubuntu)
apt-get download package_name
sha256sum package_name*.deb
# Compare with Packages file signed by archive key
 
# Btrfs checksum status
sudo btrfs scrub start /mnt/data
sudo btrfs scrub status /mnt/data

Performance vs. Cryptographic Hashes

Attacks and Defenses

Understanding hash function attacks illuminates why certain algorithms are deprecated and informs secure usage.

Attack Types

•Collision Attacks — Finding two inputs with the same hash. MD5 and SHA-1 are broken this way. Impact: Attackers can create malicious files with the same hash as legitimate ones, defeating signatures that only sign hashes.
•Preimage Attacks — Finding an input that produces a specific hash. No practical attacks against modern hashes, but reduced-round attacks exist. Impact: Would allow forging data matching a known hash.
•Length Extension Attacks — Computing H(message || extension) given only H(message). Affects MD5, SHA-1, SHA-256. Impact: Breaks naive MAC constructions like H(key || message).
•Rainbow Table Attacks — Precomputed hash → password tables. Defeated by salting (each password gets a unique random salt). Impact: Unsalted password databases can be cracked instantly against known wordlists.
•Timing Attacks — Measuring comparison time to learn hash bytes. If comparison stops at first mismatch, time reveals how many bytes matched. Defense: Constant-time comparison functions.

Hash Function Attack Status
Algorithm	Collision Attack	Preimage Attack	Practical Impact
MD5	Broken (seconds)	Theoretical weaknesses	Can create colliding PDFs, certificates
SHA-1	Broken (2017, $110K compute)	Secure	SHAttered: malicious PDFs with same hash
SHA-256	Secure (no known attack)	Secure	Standard for security applications
SHA-3	Secure (no known attack)	Secure	Immune to length extension
BLAKE3	Secure (no known attack)	Secure	Modern, high performance

Defense Best Practices

Summary: Hash Functions

Key Takeaways

•Hash functions produce fixed-size outputs from arbitrary inputs — SHA-256 always produces 256 bits regardless of input size.
•Three security properties: preimage resistance, second preimage resistance, collision resistance — Collision resistance is hardest (birthday bound: 2^(n/2)).
•MD5 and SHA-1 are broken — Collisions can be computed practically. Use SHA-256, SHA-3, or BLAKE3.
•HMAC provides keyed authentication — Use HMAC(K, M) not H(K || M) to avoid length extension attacks.
•Password hashing requires slow functions — Fast hashes (SHA-256) are wrong; use Argon2id, bcrypt, or scrypt with unique salts.
•Operating systems use hashes extensively — Password storage, file integrity, secure boot, package verification, memory deduplication.
•Match the hash to the use case — Cryptographic hashes for security, fast hashes (xxHash) for performance, password hashes for credentials.

Looking Ahead:

Page Complete

3 / 5