Loading content...
"Encryption is easy. Key management is hard."
This axiom, repeated by cryptographers for decades, captures a fundamental truth: the strength of any encryption system ultimately depends on how well you protect the keys. A 256-bit AES key provides practically unbreakable encryption—but if that key is stored in plaintext on the same server as the encrypted data, the encryption is worthless.
The key management problem has three dimensions:
Confidentiality — Keys must be accessible only to authorized entities. If an attacker gets the key, encryption provides no protection.
Availability — Keys must be available when needed to decrypt data. If you lose the key, you lose the data—permanently.
Integrity — Keys must not be modified or corrupted. A corrupted key produces incorrect decryption, potentially causing data loss or subtle corruption.
Balancing these three requirements—often in tension with each other—is the central challenge of key management.
By the end of this page, you will understand key hierarchy design, key lifecycle management from generation to destruction, cloud Key Management Services (KMS), Hardware Security Modules (HSMs), key rotation strategies, key escrow and recovery, and how to design key management architectures for different security requirements.
Modern cryptographic systems use hierarchies of keys rather than single keys for several important reasons:
The standard three-tier key hierarchy:
| Level | Key Type | Purpose | Storage | Rotation Frequency |
|---|---|---|---|---|
| 1 (Root) | Master Key / Root Key | Protects all lower-level keys | HSM or cloud KMS (never exported) | Rarely (annually or on compromise) |
| 2 (KEK) | Key Encryption Key | Encrypts Data Encryption Keys | KMS or encrypted at rest | Periodically (quarterly) |
| 3 (DEK) | Data Encryption Key | Encrypts actual data | Encrypted by KEK, stored with data | Frequently (monthly or per-data-set) |
Additional key types in enterprise environments:
The pattern of encrypting a DEK with a KEK, then storing the encrypted DEK alongside the encrypted data, is called 'envelope encryption.' The encrypted DEK is the 'envelope' that wraps the data key. This is the standard pattern used by AWS KMS, Google Cloud KMS, Azure Key Vault, and most encryption systems. It enables local encryption operations (fast) while protecting keys with centralized key management (secure).
Encryption keys have a lifecycle from creation to destruction. Each phase has specific requirements and risks that must be managed carefully.
The key lifecycle phases:
| Phase | Activities | Key Risks | Controls |
|---|---|---|---|
| Generation | Create key with proper randomness | Weak random number generation | Use HSM/KMS; verify entropy sources |
| Activation | Deploy key for use | Premature use before approval | Formal activation workflow; key ceremony |
| Active Use | Encrypt/decrypt operations | Unauthorized access; overuse | Access logging; usage monitoring |
| Suspension | Temporarily disable key | Suspended key still accessible | Clear suspension states; audit access |
| Rotation | Replace with new key; re-encrypt data | Old key exposed during transition | Overlap period; gradual migration |
| Deactivation | Stop using for encryption; decrypt only | Continued encryption with old key | Enforce encrypt-disable in KMS |
| Archival | Long-term storage for decrypt-only | Archive compromise; key discovery | Strong access controls; offline storage |
| Destruction | Cryptographically erase key | Incomplete destruction; recovery | Verified destruction; secure wipe |
Critical considerations for each phase:
Generation: Key generation must use cryptographically secure random number generators (CSPRNGs). Never use rand() or other weak random sources. HSMs and cloud KMS handle this correctly; if generating keys yourself, use /dev/urandom, crypto.randomBytes(), or equivalent.
Activation: For critical keys (master keys, signing keys), use formal key ceremonies with multiple participants, witnessed generation, and documented chain of custody.
Rotation: Plan for rotation from the beginning. Supporting multiple active key versions simultaneously (for gradual migration) is much easier than retrofitting later.
Destruction: When a key is destroyed, verify it's gone. Cloud KMS typically has a 'scheduled deletion' period (e.g., 7-30 days) to prevent accidental destruction. Understand your provider's destruction guarantees.
NIST SP 800-57 recommends limiting the 'cryptoperiod' (active usage lifetime) of encryption keys based on their type. For symmetric data encryption keys, the originator usage period should be 1-2 years maximum. After this, the key should only be used for decryption (deactivation). Plan your key rotation strategy to comply with these recommendations.
Cloud Key Management Services provide centralized, managed key storage and cryptographic operations. They offer significant advantages over managing keys yourself:
Major cloud KMS offerings:
| Feature | AWS KMS | Google Cloud KMS | Azure Key Vault |
|---|---|---|---|
| Key Storage | FIPS 140-2 Level 2 (standard) or Level 3 (CloudHSM) | FIPS 140-2 Level 2 (software) or HSM | FIPS 140-2 Level 2 (standard) or HSM |
| Key Types | Symmetric (AES-256), Asymmetric (RSA, ECC) | Symmetric (AES-256), Asymmetric (RSA, EC), MAC keys | RSA, EC, symmetric keys |
| Automatic Rotation | Annual for symmetric CMKs | Configurable period | Manual or automated |
| Multi-Region | Multi-region keys (replicated) | Global or regional keys | Geo-replication options |
| Pricing Model | Per key + per request | Per key version + per operation | Per key + per 10K operations |
| Custom Key Store | External key store (XKS) | EKM for key holders | Managed HSM (dedicated) |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131
// === Terraform: Create KMS key with policy === resource "aws_kms_key" "data_encryption" { description = "CMK for application data encryption" deletion_window_in_days = 30 enable_key_rotation = true policy = jsonencode({ Version = "2012-10-17" Statement = [ { Sid = "Enable IAM User Permissions" Effect = "Allow" Principal = { AWS = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root" } Action = "kms:*" Resource = "*" }, { Sid = "Allow application to encrypt/decrypt" Effect = "Allow" Principal = { AWS = aws_iam_role.application.arn } Action = [ "kms:Encrypt", "kms:Decrypt", "kms:GenerateDataKey", "kms:GenerateDataKeyWithoutPlaintext", "kms:DescribeKey" ] Resource = "*" }, { Sid = "Allow security team admin access" Effect = "Allow" Principal = { AWS = "arn:aws:iam::${account_id}:role/security-admin" } Action = [ "kms:Create*", "kms:Describe*", "kms:Enable*", "kms:List*", "kms:Put*", "kms:Update*", "kms:Revoke*", "kms:Disable*", "kms:Get*", "kms:ScheduleKeyDeletion", "kms:CancelKeyDeletion" ] Resource = "*" } ] })} // === TypeScript SDK: Envelope encryption === import { KMSClient, GenerateDataKeyCommand, DecryptCommand } from "@aws-sdk/client-kms";import { createCipheriv, createDecipheriv, randomBytes } from "crypto"; const kms = new KMSClient({ region: "us-east-1" });const KEY_ID = "arn:aws:kms:us-east-1:123456789:key/abc123"; interface EncryptedData { encryptedDataKey: Buffer; // DEK encrypted by KMS iv: Buffer; ciphertext: Buffer; authTag: Buffer;} async function encryptWithEnvelope(plaintext: Buffer): Promise<EncryptedData> { // Generate a data key - KMS returns both plaintext and encrypted versions const { Plaintext, CiphertextBlob } = await kms.send( new GenerateDataKeyCommand({ KeyId: KEY_ID, KeySpec: "AES_256", }) ); if (!Plaintext || !CiphertextBlob) throw new Error("Failed to generate data key"); // Use the plaintext key to encrypt data locally const iv = randomBytes(12); const cipher = createCipheriv("aes-256-gcm", Plaintext, iv); const ciphertext = Buffer.concat([ cipher.update(plaintext), cipher.final() ]); const authTag = cipher.getAuthTag(); // Zero out the plaintext key in memory Plaintext.fill(0); return { encryptedDataKey: Buffer.from(CiphertextBlob), // Store this! iv, ciphertext, authTag };} async function decryptWithEnvelope(encrypted: EncryptedData): Promise<Buffer> { // Decrypt the data key using KMS const { Plaintext } = await kms.send( new DecryptCommand({ CiphertextBlob: encrypted.encryptedDataKey, }) ); if (!Plaintext) throw new Error("Failed to decrypt data key"); // Use the plaintext key to decrypt data locally const decipher = createDecipheriv("aes-256-gcm", Plaintext, encrypted.iv); decipher.setAuthTag(encrypted.authTag); const plaintext = Buffer.concat([ decipher.update(encrypted.ciphertext), decipher.final() ]); // Zero out the plaintext key Plaintext.fill(0); return plaintext;}Cloud KMS services have request rate limits (e.g., AWS KMS allows 5,500-30,000 requests per second per key depending on region and key type). For high-throughput encryption, use envelope encryption: generate a DEK once with KMS, cache it locally, and use it for many encryption operations. Regenerate periodically (e.g., hourly) for forward secrecy.
Hardware Security Modules (HSMs) are dedicated hardware devices designed to generate, store, and manage cryptographic keys securely. They provide the highest level of key protection, with physical and logical safeguards against key extraction.
What makes HSMs special:
| Option | Description | Use Case | Cost Level |
|---|---|---|---|
| On-premises HSM | Physical hardware you own and manage | Highest security requirements; regulatory mandates | $$$$ |
| Cloud HSM | Dedicated HSM in cloud provider's DC (AWS CloudHSM, GCP Cloud HSM) | High security without physical management | $$$ |
| Managed KMS (HSM-backed) | Shared HSM infrastructure behind cloud KMS | Standard enterprise encryption | $$ |
| Virtual HSM | Software HSM for development/testing | Non-production environments | $ |
FIPS 140-2/3 Security Levels:
FIPS 140 is the US government standard for cryptographic modules. Understanding the levels helps you choose appropriate protection:
When you need an HSM vs. cloud KMS:
HSMs are critical infrastructure—if the HSM is unavailable, all cryptographic operations stop. Plan for high availability: use clustered HSMs, maintain hot standbys, and ensure adequate capacity for peak load. For cloud HSM, understand the availability SLA (typically 99.99% for clustered deployments). Test failover procedures regularly.
Key rotation is the practice of replacing encryption keys periodically. It limits the impact of key compromise (only data encrypted since the last rotation is exposed) and satisfies compliance requirements for key lifecycle management.
Types of key rotation:
Understanding the difference between key rotation and re-encryption:
Key rotation (what most cloud KMS does automatically):
Re-encryption (true rotation):
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
import { KMSClient, GenerateDataKeyCommand, DecryptCommand, ScheduleKeyDeletionCommand } from "@aws-sdk/client-kms"; interface EncryptedRecord { id: string; keyId: string; // Which key version encrypted this record keyVersion: number; // Version number for tracking encryptedData: Buffer; encryptedDek: Buffer; // DEK encrypted by KMS key} class KeyRotationManager { private kms: KMSClient; private currentKeyId: string; private currentVersion: number; constructor(kms: KMSClient, keyId: string, version: number) { this.kms = kms; this.currentKeyId = keyId; this.currentVersion = version; } /** * Gradual re-encryption: process records in batches * to avoid service disruption */ async reencryptBatch( records: EncryptedRecord[], batchSize: number = 100 ): Promise<void> { for (let i = 0; i < records.length; i += batchSize) { const batch = records.slice(i, i + batchSize); await Promise.all( batch.map(async (record) => { // Skip if already on current key version if (record.keyVersion >= this.currentVersion) { return; } try { // Decrypt with old key const plaintext = await this.decrypt(record); // Re-encrypt with new key const newRecord = await this.encrypt(plaintext); // Update in database (atomic operation) await this.updateRecord(record.id, newRecord); console.log(`Re-encrypted record ${record.id}`); } catch (error) { console.error(`Failed to re-encrypt ${record.id}:`, error); // Log for manual remediation; don't fail entire batch } }) ); // Pause between batches to reduce load await new Promise(resolve => setTimeout(resolve, 100)); } } /** * Background job for continuous re-encryption */ async runReencryptionJob(db: Database): Promise<void> { // Find records encrypted with old key versions const oldRecords = await db.query(` SELECT * FROM encrypted_data WHERE key_version < $1 ORDER BY created_at ASC LIMIT 1000 `, [this.currentVersion]); if (oldRecords.length === 0) { console.log("All records on current key version"); return; } console.log(`Re-encrypting ${oldRecords.length} records`); await this.reencryptBatch(oldRecords); // Check if all old data is migrated const remaining = await db.queryOne(` SELECT COUNT(*) FROM encrypted_data WHERE key_version < $1 `, [this.currentVersion]); if (remaining.count === 0) { console.log("All data migrated to new key version"); // Safe to disable old key version await this.disableOldKeyVersion(); } } private async disableOldKeyVersion(): Promise<void> { // Schedule deletion of old key (with 30-day safety period) const oldKeyId = this.getOldKeyId(); await this.kms.send(new ScheduleKeyDeletionCommand({ KeyId: oldKeyId, PendingWindowInDays: 30, })); console.log(`Scheduled deletion of old key ${oldKeyId}`); } // ... encrypt, decrypt, updateRecord, getOldKeyId implementations}High-churn data (session tokens, temporary caches): Auto-rotation with no re-encryption—data naturally expires. Critical persistent data (financial records, health data): Full re-encryption on a schedule (quarterly/annually). Archival data: May use separate long-lived keys if re-encryption is infeasible. Match your rotation strategy to your data lifecycle.
Key escrow is the practice of storing copies of encryption keys with a trusted third party or in a secure backup location. While it creates an additional attack surface, it's essential for disaster recovery and business continuity.
The escrow dilemma:
Strategies to balance security and recovery:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899
import { split, combine } from 'shamirs-secret-sharing'; /** * Shamir's Secret Sharing implementation for key escrow. * Split a key into N shares where M are required to reconstruct. */ interface EscrowConfig { totalShares: number; // N: total number of shares to create requiredShares: number; // M: minimum shares needed to reconstruct} interface EscrowResult { shares: Buffer[]; shareAssignments: ShareAssignment[];} interface ShareAssignment { shareIndex: number; custodian: string; storageLocation: string; verificationHash: string; // To verify share integrity without revealing it} async function escrowKey( key: Buffer, config: EscrowConfig, custodians: string[]): Promise<EscrowResult> { if (custodians.length !== config.totalShares) { throw new Error("Custodian count must match total shares"); } // Split the key using Shamir's Secret Sharing const shares = split(key, { shares: config.totalShares, threshold: config.requiredShares, }); // Create assignment records (for tracking, not storage) const shareAssignments: ShareAssignment[] = shares.map((share, index) => ({ shareIndex: index + 1, custodian: custodians[index], storageLocation: `secure-vault-${index + 1}`, verificationHash: hashShare(share), // Can verify without combining })); // Each share goes to its custodian via secure channel // In practice, use hardware tokens, secure ceremonies, etc. return { shares, shareAssignments };} async function recoverKey( shares: Buffer[], requiredShares: number): Promise<Buffer> { if (shares.length < requiredShares) { throw new Error(`Need at least ${requiredShares} shares, got ${shares.length}`); } // Reconstruct the original key const key = combine(shares.slice(0, requiredShares)); return key;} // Example: 3-of-5 escrow for master keyasync function setupMasterKeyEscrow() { const masterKey = await generateMasterKey(); // From HSM const result = await escrowKey(masterKey, { totalShares: 5, requiredShares: 3, }, [ "CEO", "CFO", "CTO", "General Counsel", "CISO", ]); // Distribute shares securely // CEO gets Share 1, stored in CEO's personal vault // CFO gets Share 2, stored in finance department safe // etc. // Store assignments (not shares!) for tracking await storeEscrowMetadata({ keyId: "master-key-v1", createdAt: new Date(), expiresAt: null, assignments: result.shareAssignments, threshold: 3, totalShares: 5, }); console.log("Master key escrowed with 3-of-5 threshold");}An escrow system that's never been tested is not a recovery system—it's hope. Conduct regular recovery drills: actually reconstruct keys from shares, verify they work, then securely destroy the test reconstruction. Document the procedure. Time how long recovery takes. An untested recovery procedure will fail when you need it most.
Learning what NOT to do is as important as learning best practices. These anti-patterns are unfortunately common and often lead to security breaches or data loss:
Real-world breach examples from these anti-patterns:
Uber (2016) — AWS keys in GitHub repo exposed data of 57 million users and drivers.
Facebook (2019) — Hundreds of millions of passwords stored in plaintext in internal logs.
Equifax (2017) — Encryption keys stored adjacent to encrypted data; when servers were compromised, attackers got both.
Capital One (2019) — Misconfigured WAF + overly permissive IAM role allowed attacker to use instance credentials to access encrypted data.
Multiple organizations — Backup tapes shipped to third-party storage without encryption; tapes lost or stolen in transit.
If your encryption key can be accessed by the same principals (people, processes, or pathways) that can access the encrypted data, your encryption provides limited value. The security of encryption depends on the key being accessible to a SMALLER set of principals than the encrypted data itself.
We've covered the critical discipline of key management. Let's consolidate the key insights:
What's next:
With key management foundations established, we'll now address the practical concern every engineer asks: "What about performance?" The final page of this module explores the performance implications of encryption at rest—measuring overhead, optimization strategies, and making informed tradeoffs between security and system performance.
You now understand the fundamental principles of cryptographic key management—from key hierarchies and lifecycle management to cloud KMS, HSMs, rotation strategies, and escrow. You can design key management architectures appropriate for different security requirements and avoid the anti-patterns that lead to breaches. Next, we'll examine the performance considerations for encryption at rest.