System Design (HLD)TLS and Encryption in Transit

TLS and Encryption in Transit

LevelAdvanced

Duration90 mins

TopicTLS and Encryption in Transit

4 / 5

Certificate Management

The Invisible Expiration Bomb

On September 30, 2021, millions of devices suddenly stopped connecting to the internet. The cause? A single root certificate—DST Root CA X3 from Let's Encrypt—expired. Devices with outdated certificate stores couldn't validate new certificates, causing widespread outages.

This incident illustrates a critical reality: certificates are time bombs with expiration dates, and managing them at scale is one of the most operationally challenging aspects of running secure infrastructure.

A forgotten certificate renewal can bring down your entire platform. Manual renewal processes don't scale. Revocation is poorly understood and often doesn't work as expected. And as organizations adopt TLS everywhere (internal services, mTLS, IoT devices), certificate inventory explodes from dozens to thousands or millions.

This page covers the complete certificate lifecycle: issuance, renewal, distribution, revocation, and monitoring. You'll learn how to build certificate infrastructure that's reliable, automated, and doesn't wake anyone up at 3 AM.

What You Will Learn

By the end of this page, you will understand the certificate lifecycle, automated certificate management with ACME and cert-manager, private CA infrastructure for internal services, certificate revocation mechanisms and their limitations, and monitoring strategies to prevent certificate-related outages.

The Certificate Lifecycle

Every X.509 certificate follows a lifecycle from creation to expiration (or revocation). Understanding this lifecycle is essential for proper management.

Converting Mermaid diagram...

Certificate Lifecycle Stages

•Key Generation: Private/public key pair created. Private key MUST be protected; if compromised, the certificate becomes a liability. Generate on secure systems; never share private keys.
•Certificate Signing Request (CSR): Contains the public key and identity information (domain names, organization). Signed by the private key to prove possession. Sent to CA.
•Validation: CA verifies the requester controls the domain(s) in the CSR. Methods: DNS record (TXT), HTTP challenge file, or email verification. For EV: additional legal entity verification.
•Issuance: CA signs the certificate, creating a chain to their root CA. Certificate has validity period (notBefore, notAfter). Typically 90 days (Let's Encrypt) to 1 year (commercial CAs).
•Deployment: Certificate and private key deployed to servers. Must include intermediate certificates. Validate chain completeness before going live.
•Renewal: Before expiration, request a new certificate. Can reuse existing CSR or generate new. Automated renewal prevents outages.
•Revocation: If private key is compromised or certificate issued in error, revoke immediately. Revocation status published via CRL or OCSP.

The 90-Day Standard

Let's Encrypt issues certificates valid for 90 days, recommending renewal at 60 days. This short validity forces automation—a feature, not a bug. Manual management of 90-day certificates is impossible at scale. Many organizations adopt this practice even with commercial CAs, as shorter validity limits exposure from compromised keys.

Automated Certificate Management with ACME

ACME (Automatic Certificate Management Environment) is the protocol that powers Let's Encrypt and has revolutionized certificate management. It enables fully automated issuance and renewal without human intervention.

How ACME works:

Converting Mermaid diagram...

ACME Challenge Types:

Challenge	How It Works	Best For	Limitations
HTTP-01	Place file at `/.well-known/acme-challenge/`	Web servers accessible on port 80	Requires inbound HTTP access
DNS-01	Create TXT record `_acme-challenge.domain`	Wildcard certs, internal services	Requires DNS API access
TLS-ALPN-01	Respond on TLS port with special cert	When only port 443 is open	Limited client support

certbot-examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Certbot: Most popular ACME client
# https://certbot.eff.org/
 
# HTTP-01 Challenge (webserver must be accessible)
certbot certonly \
    --webroot \
    --webroot-path /var/www/html \
    --domain example.com \
    --domain www.example.com \
    --email admin@example.com \
    --agree-tos \
    --non-interactive
 
# DNS-01 Challenge with Cloudflare plugin (for wildcards)
certbot certonly \
    --dns-cloudflare \
    --dns-cloudflare-credentials ~/.secrets/cloudflare.ini \
    --domain "*.example.com" \
    --domain example.com \
    --email admin@example.com \
    --agree-tos
 
# Automatic renewal (add to cron or systemd timer)
certbot renew --quiet
 
# Test renewal without actually renewing
certbot renew --dry-run
 
# Post-renewal hook to reload services
certbot renew --post-hook "systemctl reload nginx"

Rate Limits Matter

Let's Encrypt has rate limits: 50 certificates per registered domain per week, 5 duplicate certificates per week, 5 failed validations per hour. In CI/CD, use staging environment (fake certs, no limits) for testing. Only hit production LE for actual deployments. Plan certificate consolidation using SANs or wildcards for high-scale deployments.

cert-manager in Kubernetes

cert-manager is the de facto standard for certificate management in Kubernetes. It automates issuance, renewal, and secret management for TLS certificates.

Key Concepts:

Issuer/ClusterIssuer: Defines where certificates come from (Let's Encrypt, Vault, internal CA, self-signed)
Certificate: Declares desired certificate properties; cert-manager creates it and stores in a Secret
CertificateRequest: Lower-level resource for custom integrations
Order/Challenge: ACME-specific resources tracking the validation process

cert-manager-config
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# ClusterIssuer for Let's Encrypt (production)
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    # Production ACME server
    server: https://acme-v02.api.letsencrypt.org/directory
    email: platform-team@example.com
    
    # Secret to store ACME account private key
    privateKeySecretRef:
      name: letsencrypt-prod-account-key
    
    # Enable DNS-01 challenge for wildcards
    solvers:
      # HTTP-01 for specific domains
      - http01:
          ingress:
            class: nginx
        selector:
          dnsNames:
            - "api.example.com"
      
      # DNS-01 for wildcards (using Cloudflare)
      - dns01:
          cloudflare:
            email: dns-admin@example.com
            apiTokenSecretRef:
              name: cloudflare-api-token
              key: api-token
        selector:
          dnsNames:
            - "*.example.com"
---
# Staging issuer for testing (no rate limits)
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: platform-team@example.com
    privateKeySecretRef:
      name: letsencrypt-staging-account-key
    solvers:
      - http01:
          ingress:
            class: nginx

cert-manager Best Practices

Use ClusterIssuers for organization-wide settings. Create namespace-scoped Issuers for team-specific CAs. Set renewBefore to at least 30 days to allow time for failure recovery. Use staging issuers for development. Monitor cert-manager logs and Certificate status. Implement alerting on certificate expiration.

Private CA for Internal Services

For internal services (microservices mTLS, internal APIs, databases), using public CAs like Let's Encrypt is often impractical:

Internal hostnames (service.namespace.svc.cluster.local) aren't publicly resolvable
High certificate volume would hit rate limits
Faster issuance and shorter validity are possible with internal CA
Control over certificate policies and extensions

Private CA Options:

Private CA Solutions
Solution	Type	Best For	Complexity
HashiCorp Vault PKI	Self-hosted secrets manager	Dynamic secrets, mTLS, multi-cloud	Medium-High
cert-manager CA Issuer	Kubernetes-native	K8s-only environments, simple setups	Low
AWS Private CA	Managed service	AWS-centric, compliance requirements	Low-Medium
CFSSL	Open-source CLI/API	Custom PKI, air-gapped environments	Medium
Step-CA	Modern open-source CA	ACME + internal, developer-friendly	Low-Medium
Service Mesh (Istio/Linkerd)	Integrated with mesh	Automatic mTLS for all mesh services	Medium

private-ca-setup
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# Create Root CA certificate using cert-manager
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: internal-root-ca
  namespace: cert-manager
spec:
  isCA: true
  commonName: Internal Root CA
  secretName: internal-root-ca-secret
  privateKey:
    algorithm: ECDSA
    size: 384
  duration: 87600h   # 10 years
  renewBefore: 8760h # 1 year before expiry
  issuerRef:
    name: selfsigned-issuer
    kind: ClusterIssuer
---
# Self-signed issuer (bootstrap only)
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: selfsigned-issuer
spec:
  selfSigned: {}
---
# Internal CA issuer using root certificate
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: internal-ca-issuer
spec:
  ca:
    secretName: internal-root-ca-secret
---
# Now services can request certificates from internal CA
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: my-service-cert
  namespace: production
spec:
  secretName: my-service-tls
  issuerRef:
    name: internal-ca-issuer
    kind: ClusterIssuer
  commonName: my-service.production.svc.cluster.local
  dnsNames:
    - my-service
    - my-service.production
    - my-service.production.svc
    - my-service.production.svc.cluster.local
  duration: 720h     # 30 days (short for internal)
  renewBefore: 168h  # Renew 7 days before expiry

Distribute Root CA Trust

Services validating internal certificates must trust your internal CA root. Distribute the root CA certificate to all clients: mount as ConfigMap in Kubernetes, include in container base images, push via configuration management. Without trust distribution, internal TLS validation fails.

Certificate Revocation

When a private key is compromised, the certificate must be revoked—marked as untrustworthy before its expiration date. Certificate revocation is critically important but operationally challenging.

Revocation Mechanisms:

Revocation Methods

•Certificate Revocation Lists (CRL): CA publishes a signed list of revoked certificate serial numbers. Clients download and cache. Simple but doesn't scale—CRLs can grow enormous for large CAs.
•Online Certificate Status Protocol (OCSP): Real-time query to CA's OCSP responder. Lighter than CRLs but adds latency to every connection and has privacy implications (CA sees all connections).
•OCSP Stapling: Server fetches OCSP response and 'staples' it to TLS handshake. Reduces latency and improves privacy. Requires server support.
•Short-Lived Certificates: Issue certificates with very short validity (hours/days). Compromised certificates expire quickly. Requires robust automated renewal. Used by some internal PKIs.
•OCSP Must-Staple: Certificate extension requiring stapled OCSP response. If server doesn't staple, connection fails. Enforces revocation checking but fragile if OCSP responder is down.

The hard truth about revocation:

Revocation checking is widely broken in practice:

Soft-fail by default: Most browsers treat failed revocation checks as 'certificate is okay' to avoid breaking sites when OCSP is unreachable. This means revocation only works when convenient.
CRLs are often stale: Caching means revocations don't propagate instantly. Days may pass before clients see a revoked certificate.
Mobile clients often skip checks: To save battery and data, mobile apps frequently disable revocation checking entirely.
OCSP availability: If the CA's OCSP responder is down, connections either fail (bad UX) or proceed without checking (bad security).

Revocation Method Comparison
Method	Latency	Freshness	Privacy	Reliability
CRL Download	High (large files)	Hours-old	Good (no CA visibility)	Good (cacheable)
OCSP Query	Medium (per-request)	Real-time	Poor (CA sees all)	Depends on responder
OCSP Stapling	None (bundled)	Minutes-old	Good (server fetches)	Depends on server impl
Short-Lived Certs	None	Built-in (via expiry)	Excellent	Depends on renewal infra

Prefer Short Validity Over Revocation

Given the fragility of revocation, reducing certificate validity is often more effective. A 24-hour certificate that can't be revoked is arguably more secure than a 1-year certificate with OCSP—because the 24-hour cert expires before most revocations would propagate anyway. This is why Google advocates for automation and short validity.

Certificate Monitoring and Alerting

Certificate expiration is a leading cause of production incidents. Effective monitoring is not optional—it's essential for reliability.

Certificate Monitoring Checklist

•Inventory all certificates: Know every certificate in your infrastructure—public endpoints, internal services, databases, client certs, code signing, etc. Automated discovery tools help.
•Monitor expiration dates: Alert at 30, 14, 7, 3, and 1 day before expiry. Escalate urgency as deadline approaches. Page on-call at 24 hours if not renewed.
•Validate certificate chains: Ensure intermediates are present and valid. A complete chain today might break if an intermediate expires.
•Monitor TLS handshake success: Prometheus metrics on handshake failures, invalid certificates presented, and protocol errors.
•Track Certificate Transparency logs: Monitor CT for unauthorized certificates issued for your domains. Tools: crt.sh, Facebook CT Monitor, Hardenize.
•Test from external vantage points: Internal tests may pass while external clients fail. Use external monitoring services (Pingdom, Datadog, etc.).

certificate-monitoring
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# Prometheus alerting rules for certificate expiration
groups:
  - name: certificate-alerts
    rules:
      # Alert when certificate expires in less than 30 days
      - alert: CertificateExpiringThirtyDays
        expr: |
          (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Certificate expiring in < 30 days"
          description: "Certificate for {{ $labels.instance }} expires in {{ $value | humanizeDuration }}"
 
      # Alert when certificate expires in less than 7 days
      - alert: CertificateExpiringSevenDays
        expr: |
          (probe_ssl_earliest_cert_expiry - time()) / 86400 < 7
        for: 1h
        labels:
          severity: critical
        annotations:
          summary: "Certificate expiring in < 7 days"
          description: "URGENT: Certificate for {{ $labels.instance }} expires in {{ $value | humanizeDuration }}"
 
      # Alert when certificate has already expired
      - alert: CertificateExpired
        expr: |
          probe_ssl_earliest_cert_expiry - time() < 0
        for: 5m
        labels:
          severity: critical
          page: true
        annotations:
          summary: "Certificate has EXPIRED"
          description: "Certificate for {{ $labels.instance }} expired {{ $value | humanizeDuration }} ago"
 
      # Alert on TLS handshake failures
      - alert: TLSHandshakeFailure
        expr: |
          probe_success{job="tls-probe"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "TLS probe failing"
          description: "Cannot establish TLS connection to {{ $labels.instance }}"

Commercial Monitoring Tools

Tools like Hardenize, SSL Labs, Qualys, and Datadog provide certificate monitoring-as-a-service with external vantage points, CT log monitoring, and sophisticated alerting. For organizations without dedicated security tooling, these services provide immediate visibility with minimal setup.

Certificate Storage and Distribution

Certificates aren't secrets (they're public), but private keys are highly sensitive. How you store and distribute certificates and keys impacts both security and operational reliability.

Secure Storage Practices

•Secrets managers — HashiCorp Vault, AWS Secrets Manager, Azure Key Vault for centralized, audited storage
•Kubernetes Secrets — Native K8s storage with RBAC. Enable etcd encryption at rest.
•HSMs — Hardware Security Modules for highest-value keys (CA roots, payment systems)
•File permissions — Key files should be 600 (owner read/write only) if stored on disk
•Encryption at rest — Encrypt key files with KEK stored separately

Anti-Patterns to Avoid

•Git — Never commit private keys to version control (even encrypted; key rotation is hard)
•Environment variables — Visible in process listings, logged by many tools
•Shared keys — Different services should have different key pairs
•Long-lived keys — Rotate keys at least annually; shorter for high-value systems
•Same key for all environments — Production keys should never exist in dev/test

Distribution Patterns:

1. Pull-based (secrets manager):

Service starts → Authenticates to Vault → Retrieves cert/key → Loads into TLS stack

Pros: Centralized, audited, supports rotation. Cons: Adds dependency, requires service identity.

2. Push-based (cert-manager, deployment):

cert-manager issues cert → Writes to K8s Secret → Mounted into Pod filesystem

Pros: Matches K8s patterns, automatic. Cons: Keys at rest in etcd, requires Secret encryption.

3. Sidecar injection:

Service Mesh injects sidecar → Sidecar handles TLS → App uses plaintext locally

Pros: Zero application changes, automatic mTLS. Cons: Service mesh complexity, sidecar overhead.

Hot Reloading

Configure services to detect certificate updates and reload without restart. NGINX: 'nginx -s reload'. HAProxy: socket commands. Many Go/Java apps need explicit implementation. This enables zero-downtime certificate rotation. Without hot reloading, you need rolling restarts on every renewal.

Summary: Certificate Management

We've covered the operational challenge of certificate management at scale—from issuance to expiration, and all the automation needed to keep systems running. Let's consolidate the key takeaways:

Key Takeaways

•Certificates expire—plan for it — Every certificate has a countdown to outage. Manual renewal doesn't scale. Automation is not optional.
•ACME enables fully automated public certificates — Let's Encrypt and the ACME protocol have democratized TLS. Use certbot, Traefik, or cert-manager for hands-off renewal.
•cert-manager is the Kubernetes standard — Define Issuers and Certificates; cert-manager handles the rest. Integrate with both public and private CAs.
•Private CAs are essential for internal mTLS — Vault PKI, cert-manager CA, or service mesh PKI enable automated internal certificate issuance without public CA limitations.
•Revocation is broken in practice — CRLs are stale, OCSP is unreliable. Use short-lived certificates and OCSP stapling to mitigate. Don't rely on revocation as primary defense.
•Monitor everything — Expiration alerts, handshake failures, CT logs, and external probing. Certificate outages are preventable with proper visibility.
•Protect private keys — Use secrets managers, HSMs, or at minimum proper file permissions. Never commit to git. Rotate keys with certificates.

What's next:

With certificate management covered, the final page in this module examines mTLS for Services—mutual TLS authentication where both client and server present certificates. We'll explore how to implement zero-trust service-to-service authentication using mTLS in microservices architectures.

Page Complete

You now understand the certificate lifecycle, automated management with ACME, private CA infrastructure, revocation challenges, and monitoring requirements. These operational practices ensure TLS remains a foundation of security rather than a source of incidents. Next, we'll explore mutual TLS for service authentication.

4 / 5

Loading learning content...

System Design (HLD)TLS and Encryption in Transit

TLS and Encryption in Transit

LevelAdvanced

Duration90 mins

TopicTLS and Encryption in Transit

4 / 5

Certificate Management

The Invisible Expiration Bomb

What You Will Learn

The Certificate Lifecycle

Every X.509 certificate follows a lifecycle from creation to expiration (or revocation). Understanding this lifecycle is essential for proper management.

Converting Mermaid diagram...

Certificate Lifecycle Stages

•Key Generation: Private/public key pair created. Private key MUST be protected; if compromised, the certificate becomes a liability. Generate on secure systems; never share private keys.
•Certificate Signing Request (CSR): Contains the public key and identity information (domain names, organization). Signed by the private key to prove possession. Sent to CA.
•Validation: CA verifies the requester controls the domain(s) in the CSR. Methods: DNS record (TXT), HTTP challenge file, or email verification. For EV: additional legal entity verification.
•Issuance: CA signs the certificate, creating a chain to their root CA. Certificate has validity period (notBefore, notAfter). Typically 90 days (Let's Encrypt) to 1 year (commercial CAs).
•Deployment: Certificate and private key deployed to servers. Must include intermediate certificates. Validate chain completeness before going live.
•Renewal: Before expiration, request a new certificate. Can reuse existing CSR or generate new. Automated renewal prevents outages.
•Revocation: If private key is compromised or certificate issued in error, revoke immediately. Revocation status published via CRL or OCSP.

The 90-Day Standard

Automated Certificate Management with ACME

How ACME works:

Converting Mermaid diagram...

ACME Challenge Types:

Challenge	How It Works	Best For	Limitations
HTTP-01	Place file at `/.well-known/acme-challenge/`	Web servers accessible on port 80	Requires inbound HTTP access
DNS-01	Create TXT record `_acme-challenge.domain`	Wildcard certs, internal services	Requires DNS API access
TLS-ALPN-01	Respond on TLS port with special cert	When only port 443 is open	Limited client support

certbot-examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Certbot: Most popular ACME client
# https://certbot.eff.org/
 
# HTTP-01 Challenge (webserver must be accessible)
certbot certonly \
    --webroot \
    --webroot-path /var/www/html \
    --domain example.com \
    --domain www.example.com \
    --email admin@example.com \
    --agree-tos \
    --non-interactive
 
# DNS-01 Challenge with Cloudflare plugin (for wildcards)
certbot certonly \
    --dns-cloudflare \
    --dns-cloudflare-credentials ~/.secrets/cloudflare.ini \
    --domain "*.example.com" \
    --domain example.com \
    --email admin@example.com \
    --agree-tos
 
# Automatic renewal (add to cron or systemd timer)
certbot renew --quiet
 
# Test renewal without actually renewing
certbot renew --dry-run
 
# Post-renewal hook to reload services
certbot renew --post-hook "systemctl reload nginx"

Rate Limits Matter

cert-manager in Kubernetes

cert-manager is the de facto standard for certificate management in Kubernetes. It automates issuance, renewal, and secret management for TLS certificates.

Key Concepts:

Issuer/ClusterIssuer: Defines where certificates come from (Let's Encrypt, Vault, internal CA, self-signed)
Certificate: Declares desired certificate properties; cert-manager creates it and stores in a Secret
CertificateRequest: Lower-level resource for custom integrations
Order/Challenge: ACME-specific resources tracking the validation process

cert-manager-config
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
# ClusterIssuer for Let's Encrypt (production)
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    # Production ACME server
    server: https://acme-v02.api.letsencrypt.org/directory
    email: platform-team@example.com
    
    # Secret to store ACME account private key
    privateKeySecretRef:
      name: letsencrypt-prod-account-key
    
    # Enable DNS-01 challenge for wildcards
    solvers:
      # HTTP-01 for specific domains
      - http01:
          ingress:
            class: nginx
        selector:
          dnsNames:
            - "api.example.com"
      
      # DNS-01 for wildcards (using Cloudflare)
      - dns01:
          cloudflare:
            email: dns-admin@example.com
            apiTokenSecretRef:
              name: cloudflare-api-token
              key: api-token
        selector:
          dnsNames:
            - "*.example.com"
---
# Staging issuer for testing (no rate limits)
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: platform-team@example.com
    privateKeySecretRef:
      name: letsencrypt-staging-account-key
    solvers:
      - http01:
          ingress:
            class: nginx

cert-manager Best Practices

Private CA for Internal Services

For internal services (microservices mTLS, internal APIs, databases), using public CAs like Let's Encrypt is often impractical:

Internal hostnames (service.namespace.svc.cluster.local) aren't publicly resolvable
High certificate volume would hit rate limits
Faster issuance and shorter validity are possible with internal CA
Control over certificate policies and extensions

Private CA Options:

Private CA Solutions
Solution	Type	Best For	Complexity
HashiCorp Vault PKI	Self-hosted secrets manager	Dynamic secrets, mTLS, multi-cloud	Medium-High
cert-manager CA Issuer	Kubernetes-native	K8s-only environments, simple setups	Low
AWS Private CA	Managed service	AWS-centric, compliance requirements	Low-Medium
CFSSL	Open-source CLI/API	Custom PKI, air-gapped environments	Medium
Step-CA	Modern open-source CA	ACME + internal, developer-friendly	Low-Medium
Service Mesh (Istio/Linkerd)	Integrated with mesh	Automatic mTLS for all mesh services	Medium

private-ca-setup
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
# Create Root CA certificate using cert-manager
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: internal-root-ca
  namespace: cert-manager
spec:
  isCA: true
  commonName: Internal Root CA
  secretName: internal-root-ca-secret
  privateKey:
    algorithm: ECDSA
    size: 384
  duration: 87600h   # 10 years
  renewBefore: 8760h # 1 year before expiry
  issuerRef:
    name: selfsigned-issuer
    kind: ClusterIssuer
---
# Self-signed issuer (bootstrap only)
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: selfsigned-issuer
spec:
  selfSigned: {}
---
# Internal CA issuer using root certificate
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: internal-ca-issuer
spec:
  ca:
    secretName: internal-root-ca-secret
---
# Now services can request certificates from internal CA
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: my-service-cert
  namespace: production
spec:
  secretName: my-service-tls
  issuerRef:
    name: internal-ca-issuer
    kind: ClusterIssuer
  commonName: my-service.production.svc.cluster.local
  dnsNames:
    - my-service
    - my-service.production
    - my-service.production.svc
    - my-service.production.svc.cluster.local
  duration: 720h     # 30 days (short for internal)
  renewBefore: 168h  # Renew 7 days before expiry

Distribute Root CA Trust

Certificate Revocation

When a private key is compromised, the certificate must be revoked—marked as untrustworthy before its expiration date. Certificate revocation is critically important but operationally challenging.

Revocation Mechanisms:

Revocation Methods

•Certificate Revocation Lists (CRL): CA publishes a signed list of revoked certificate serial numbers. Clients download and cache. Simple but doesn't scale—CRLs can grow enormous for large CAs.
•Online Certificate Status Protocol (OCSP): Real-time query to CA's OCSP responder. Lighter than CRLs but adds latency to every connection and has privacy implications (CA sees all connections).
•OCSP Stapling: Server fetches OCSP response and 'staples' it to TLS handshake. Reduces latency and improves privacy. Requires server support.
•Short-Lived Certificates: Issue certificates with very short validity (hours/days). Compromised certificates expire quickly. Requires robust automated renewal. Used by some internal PKIs.
•OCSP Must-Staple: Certificate extension requiring stapled OCSP response. If server doesn't staple, connection fails. Enforces revocation checking but fragile if OCSP responder is down.

The hard truth about revocation:

Revocation checking is widely broken in practice:

Soft-fail by default: Most browsers treat failed revocation checks as 'certificate is okay' to avoid breaking sites when OCSP is unreachable. This means revocation only works when convenient.
CRLs are often stale: Caching means revocations don't propagate instantly. Days may pass before clients see a revoked certificate.
Mobile clients often skip checks: To save battery and data, mobile apps frequently disable revocation checking entirely.
OCSP availability: If the CA's OCSP responder is down, connections either fail (bad UX) or proceed without checking (bad security).

Revocation Method Comparison
Method	Latency	Freshness	Privacy	Reliability
CRL Download	High (large files)	Hours-old	Good (no CA visibility)	Good (cacheable)
OCSP Query	Medium (per-request)	Real-time	Poor (CA sees all)	Depends on responder
OCSP Stapling	None (bundled)	Minutes-old	Good (server fetches)	Depends on server impl
Short-Lived Certs	None	Built-in (via expiry)	Excellent	Depends on renewal infra

Prefer Short Validity Over Revocation

Certificate Monitoring and Alerting

Certificate expiration is a leading cause of production incidents. Effective monitoring is not optional—it's essential for reliability.

Certificate Monitoring Checklist

•Inventory all certificates: Know every certificate in your infrastructure—public endpoints, internal services, databases, client certs, code signing, etc. Automated discovery tools help.
•Monitor expiration dates: Alert at 30, 14, 7, 3, and 1 day before expiry. Escalate urgency as deadline approaches. Page on-call at 24 hours if not renewed.
•Validate certificate chains: Ensure intermediates are present and valid. A complete chain today might break if an intermediate expires.
•Monitor TLS handshake success: Prometheus metrics on handshake failures, invalid certificates presented, and protocol errors.
•Track Certificate Transparency logs: Monitor CT for unauthorized certificates issued for your domains. Tools: crt.sh, Facebook CT Monitor, Hardenize.
•Test from external vantage points: Internal tests may pass while external clients fail. Use external monitoring services (Pingdom, Datadog, etc.).

certificate-monitoring
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# Prometheus alerting rules for certificate expiration
groups:
  - name: certificate-alerts
    rules:
      # Alert when certificate expires in less than 30 days
      - alert: CertificateExpiringThirtyDays
        expr: |
          (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Certificate expiring in < 30 days"
          description: "Certificate for {{ $labels.instance }} expires in {{ $value | humanizeDuration }}"
 
      # Alert when certificate expires in less than 7 days
      - alert: CertificateExpiringSevenDays
        expr: |
          (probe_ssl_earliest_cert_expiry - time()) / 86400 < 7
        for: 1h
        labels:
          severity: critical
        annotations:
          summary: "Certificate expiring in < 7 days"
          description: "URGENT: Certificate for {{ $labels.instance }} expires in {{ $value | humanizeDuration }}"
 
      # Alert when certificate has already expired
      - alert: CertificateExpired
        expr: |
          probe_ssl_earliest_cert_expiry - time() < 0
        for: 5m
        labels:
          severity: critical
          page: true
        annotations:
          summary: "Certificate has EXPIRED"
          description: "Certificate for {{ $labels.instance }} expired {{ $value | humanizeDuration }} ago"
 
      # Alert on TLS handshake failures
      - alert: TLSHandshakeFailure
        expr: |
          probe_success{job="tls-probe"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "TLS probe failing"
          description: "Cannot establish TLS connection to {{ $labels.instance }}"

Commercial Monitoring Tools

Certificate Storage and Distribution

Certificates aren't secrets (they're public), but private keys are highly sensitive. How you store and distribute certificates and keys impacts both security and operational reliability.

Secure Storage Practices

•Secrets managers — HashiCorp Vault, AWS Secrets Manager, Azure Key Vault for centralized, audited storage
•Kubernetes Secrets — Native K8s storage with RBAC. Enable etcd encryption at rest.
•HSMs — Hardware Security Modules for highest-value keys (CA roots, payment systems)
•File permissions — Key files should be 600 (owner read/write only) if stored on disk
•Encryption at rest — Encrypt key files with KEK stored separately

Anti-Patterns to Avoid

•Git — Never commit private keys to version control (even encrypted; key rotation is hard)
•Environment variables — Visible in process listings, logged by many tools
•Shared keys — Different services should have different key pairs
•Long-lived keys — Rotate keys at least annually; shorter for high-value systems
•Same key for all environments — Production keys should never exist in dev/test

Distribution Patterns:

1. Pull-based (secrets manager):

Service starts → Authenticates to Vault → Retrieves cert/key → Loads into TLS stack

Pros: Centralized, audited, supports rotation. Cons: Adds dependency, requires service identity.

2. Push-based (cert-manager, deployment):

cert-manager issues cert → Writes to K8s Secret → Mounted into Pod filesystem

Pros: Matches K8s patterns, automatic. Cons: Keys at rest in etcd, requires Secret encryption.

3. Sidecar injection:

Service Mesh injects sidecar → Sidecar handles TLS → App uses plaintext locally

Pros: Zero application changes, automatic mTLS. Cons: Service mesh complexity, sidecar overhead.

Hot Reloading

Summary: Certificate Management

We've covered the operational challenge of certificate management at scale—from issuance to expiration, and all the automation needed to keep systems running. Let's consolidate the key takeaways:

Key Takeaways

•Certificates expire—plan for it — Every certificate has a countdown to outage. Manual renewal doesn't scale. Automation is not optional.
•ACME enables fully automated public certificates — Let's Encrypt and the ACME protocol have democratized TLS. Use certbot, Traefik, or cert-manager for hands-off renewal.
•cert-manager is the Kubernetes standard — Define Issuers and Certificates; cert-manager handles the rest. Integrate with both public and private CAs.
•Private CAs are essential for internal mTLS — Vault PKI, cert-manager CA, or service mesh PKI enable automated internal certificate issuance without public CA limitations.
•Revocation is broken in practice — CRLs are stale, OCSP is unreliable. Use short-lived certificates and OCSP stapling to mitigate. Don't rely on revocation as primary defense.
•Monitor everything — Expiration alerts, handshake failures, CT logs, and external probing. Certificate outages are preventable with proper visibility.
•Protect private keys — Use secrets managers, HSMs, or at minimum proper file permissions. Never commit to git. Rotate keys with certificates.

What's next:

Page Complete

4 / 5