Loading learning content...
On September 30, 2021, millions of devices suddenly stopped connecting to the internet. The cause? A single root certificate—DST Root CA X3 from Let's Encrypt—expired. Devices with outdated certificate stores couldn't validate new certificates, causing widespread outages.
This incident illustrates a critical reality: certificates are time bombs with expiration dates, and managing them at scale is one of the most operationally challenging aspects of running secure infrastructure.
A forgotten certificate renewal can bring down your entire platform. Manual renewal processes don't scale. Revocation is poorly understood and often doesn't work as expected. And as organizations adopt TLS everywhere (internal services, mTLS, IoT devices), certificate inventory explodes from dozens to thousands or millions.
This page covers the complete certificate lifecycle: issuance, renewal, distribution, revocation, and monitoring. You'll learn how to build certificate infrastructure that's reliable, automated, and doesn't wake anyone up at 3 AM.
By the end of this page, you will understand the certificate lifecycle, automated certificate management with ACME and cert-manager, private CA infrastructure for internal services, certificate revocation mechanisms and their limitations, and monitoring strategies to prevent certificate-related outages.
Every X.509 certificate follows a lifecycle from creation to expiration (or revocation). Understanding this lifecycle is essential for proper management.
Let's Encrypt issues certificates valid for 90 days, recommending renewal at 60 days. This short validity forces automation—a feature, not a bug. Manual management of 90-day certificates is impossible at scale. Many organizations adopt this practice even with commercial CAs, as shorter validity limits exposure from compromised keys.
ACME (Automatic Certificate Management Environment) is the protocol that powers Let's Encrypt and has revolutionized certificate management. It enables fully automated issuance and renewal without human intervention.
How ACME works:
ACME Challenge Types:
| Challenge | How It Works | Best For | Limitations |
|---|---|---|---|
| HTTP-01 | Place file at /.well-known/acme-challenge/ | Web servers accessible on port 80 | Requires inbound HTTP access |
| DNS-01 | Create TXT record _acme-challenge.domain | Wildcard certs, internal services | Requires DNS API access |
| TLS-ALPN-01 | Respond on TLS port with special cert | When only port 443 is open | Limited client support |
123456789101112131415161718192021222324252627282930
# Certbot: Most popular ACME client# https://certbot.eff.org/ # HTTP-01 Challenge (webserver must be accessible)certbot certonly \ --webroot \ --webroot-path /var/www/html \ --domain example.com \ --domain www.example.com \ --email admin@example.com \ --agree-tos \ --non-interactive # DNS-01 Challenge with Cloudflare plugin (for wildcards)certbot certonly \ --dns-cloudflare \ --dns-cloudflare-credentials ~/.secrets/cloudflare.ini \ --domain "*.example.com" \ --domain example.com \ --email admin@example.com \ --agree-tos # Automatic renewal (add to cron or systemd timer)certbot renew --quiet # Test renewal without actually renewingcertbot renew --dry-run # Post-renewal hook to reload servicescertbot renew --post-hook "systemctl reload nginx"Let's Encrypt has rate limits: 50 certificates per registered domain per week, 5 duplicate certificates per week, 5 failed validations per hour. In CI/CD, use staging environment (fake certs, no limits) for testing. Only hit production LE for actual deployments. Plan certificate consolidation using SANs or wildcards for high-scale deployments.
cert-manager is the de facto standard for certificate management in Kubernetes. It automates issuance, renewal, and secret management for TLS certificates.
Key Concepts:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
# ClusterIssuer for Let's Encrypt (production)apiVersion: cert-manager.io/v1kind: ClusterIssuermetadata: name: letsencrypt-prodspec: acme: # Production ACME server server: https://acme-v02.api.letsencrypt.org/directory email: platform-team@example.com # Secret to store ACME account private key privateKeySecretRef: name: letsencrypt-prod-account-key # Enable DNS-01 challenge for wildcards solvers: # HTTP-01 for specific domains - http01: ingress: class: nginx selector: dnsNames: - "api.example.com" # DNS-01 for wildcards (using Cloudflare) - dns01: cloudflare: email: dns-admin@example.com apiTokenSecretRef: name: cloudflare-api-token key: api-token selector: dnsNames: - "*.example.com"---# Staging issuer for testing (no rate limits)apiVersion: cert-manager.io/v1kind: ClusterIssuermetadata: name: letsencrypt-stagingspec: acme: server: https://acme-staging-v02.api.letsencrypt.org/directory email: platform-team@example.com privateKeySecretRef: name: letsencrypt-staging-account-key solvers: - http01: ingress: class: nginxUse ClusterIssuers for organization-wide settings. Create namespace-scoped Issuers for team-specific CAs. Set renewBefore to at least 30 days to allow time for failure recovery. Use staging issuers for development. Monitor cert-manager logs and Certificate status. Implement alerting on certificate expiration.
For internal services (microservices mTLS, internal APIs, databases), using public CAs like Let's Encrypt is often impractical:
Private CA Options:
| Solution | Type | Best For | Complexity |
|---|---|---|---|
| HashiCorp Vault PKI | Self-hosted secrets manager | Dynamic secrets, mTLS, multi-cloud | Medium-High |
| cert-manager CA Issuer | Kubernetes-native | K8s-only environments, simple setups | Low |
| AWS Private CA | Managed service | AWS-centric, compliance requirements | Low-Medium |
| CFSSL | Open-source CLI/API | Custom PKI, air-gapped environments | Medium |
| Step-CA | Modern open-source CA | ACME + internal, developer-friendly | Low-Medium |
| Service Mesh (Istio/Linkerd) | Integrated with mesh | Automatic mTLS for all mesh services | Medium |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
# Create Root CA certificate using cert-managerapiVersion: cert-manager.io/v1kind: Certificatemetadata: name: internal-root-ca namespace: cert-managerspec: isCA: true commonName: Internal Root CA secretName: internal-root-ca-secret privateKey: algorithm: ECDSA size: 384 duration: 87600h # 10 years renewBefore: 8760h # 1 year before expiry issuerRef: name: selfsigned-issuer kind: ClusterIssuer---# Self-signed issuer (bootstrap only)apiVersion: cert-manager.io/v1kind: ClusterIssuermetadata: name: selfsigned-issuerspec: selfSigned: {}---# Internal CA issuer using root certificateapiVersion: cert-manager.io/v1kind: ClusterIssuermetadata: name: internal-ca-issuerspec: ca: secretName: internal-root-ca-secret---# Now services can request certificates from internal CAapiVersion: cert-manager.io/v1kind: Certificatemetadata: name: my-service-cert namespace: productionspec: secretName: my-service-tls issuerRef: name: internal-ca-issuer kind: ClusterIssuer commonName: my-service.production.svc.cluster.local dnsNames: - my-service - my-service.production - my-service.production.svc - my-service.production.svc.cluster.local duration: 720h # 30 days (short for internal) renewBefore: 168h # Renew 7 days before expiryServices validating internal certificates must trust your internal CA root. Distribute the root CA certificate to all clients: mount as ConfigMap in Kubernetes, include in container base images, push via configuration management. Without trust distribution, internal TLS validation fails.
When a private key is compromised, the certificate must be revoked—marked as untrustworthy before its expiration date. Certificate revocation is critically important but operationally challenging.
Revocation Mechanisms:
The hard truth about revocation:
Revocation checking is widely broken in practice:
Soft-fail by default: Most browsers treat failed revocation checks as 'certificate is okay' to avoid breaking sites when OCSP is unreachable. This means revocation only works when convenient.
CRLs are often stale: Caching means revocations don't propagate instantly. Days may pass before clients see a revoked certificate.
Mobile clients often skip checks: To save battery and data, mobile apps frequently disable revocation checking entirely.
OCSP availability: If the CA's OCSP responder is down, connections either fail (bad UX) or proceed without checking (bad security).
| Method | Latency | Freshness | Privacy | Reliability |
|---|---|---|---|---|
| CRL Download | High (large files) | Hours-old | Good (no CA visibility) | Good (cacheable) |
| OCSP Query | Medium (per-request) | Real-time | Poor (CA sees all) | Depends on responder |
| OCSP Stapling | None (bundled) | Minutes-old | Good (server fetches) | Depends on server impl |
| Short-Lived Certs | None | Built-in (via expiry) | Excellent | Depends on renewal infra |
Given the fragility of revocation, reducing certificate validity is often more effective. A 24-hour certificate that can't be revoked is arguably more secure than a 1-year certificate with OCSP—because the 24-hour cert expires before most revocations would propagate anyway. This is why Google advocates for automation and short validity.
Certificate expiration is a leading cause of production incidents. Effective monitoring is not optional—it's essential for reliability.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
# Prometheus alerting rules for certificate expirationgroups: - name: certificate-alerts rules: # Alert when certificate expires in less than 30 days - alert: CertificateExpiringThirtyDays expr: | (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30 for: 1h labels: severity: warning annotations: summary: "Certificate expiring in < 30 days" description: "Certificate for {{ $labels.instance }} expires in {{ $value | humanizeDuration }}" # Alert when certificate expires in less than 7 days - alert: CertificateExpiringSevenDays expr: | (probe_ssl_earliest_cert_expiry - time()) / 86400 < 7 for: 1h labels: severity: critical annotations: summary: "Certificate expiring in < 7 days" description: "URGENT: Certificate for {{ $labels.instance }} expires in {{ $value | humanizeDuration }}" # Alert when certificate has already expired - alert: CertificateExpired expr: | probe_ssl_earliest_cert_expiry - time() < 0 for: 5m labels: severity: critical page: true annotations: summary: "Certificate has EXPIRED" description: "Certificate for {{ $labels.instance }} expired {{ $value | humanizeDuration }} ago" # Alert on TLS handshake failures - alert: TLSHandshakeFailure expr: | probe_success{job="tls-probe"} == 0 for: 5m labels: severity: critical annotations: summary: "TLS probe failing" description: "Cannot establish TLS connection to {{ $labels.instance }}"Tools like Hardenize, SSL Labs, Qualys, and Datadog provide certificate monitoring-as-a-service with external vantage points, CT log monitoring, and sophisticated alerting. For organizations without dedicated security tooling, these services provide immediate visibility with minimal setup.
Certificates aren't secrets (they're public), but private keys are highly sensitive. How you store and distribute certificates and keys impacts both security and operational reliability.
Distribution Patterns:
1. Pull-based (secrets manager):
Service starts → Authenticates to Vault → Retrieves cert/key → Loads into TLS stack
Pros: Centralized, audited, supports rotation. Cons: Adds dependency, requires service identity.
2. Push-based (cert-manager, deployment):
cert-manager issues cert → Writes to K8s Secret → Mounted into Pod filesystem
Pros: Matches K8s patterns, automatic. Cons: Keys at rest in etcd, requires Secret encryption.
3. Sidecar injection:
Service Mesh injects sidecar → Sidecar handles TLS → App uses plaintext locally
Pros: Zero application changes, automatic mTLS. Cons: Service mesh complexity, sidecar overhead.
Configure services to detect certificate updates and reload without restart. NGINX: 'nginx -s reload'. HAProxy: socket commands. Many Go/Java apps need explicit implementation. This enables zero-downtime certificate rotation. Without hot reloading, you need rolling restarts on every renewal.
We've covered the operational challenge of certificate management at scale—from issuance to expiration, and all the automation needed to keep systems running. Let's consolidate the key takeaways:
What's next:
With certificate management covered, the final page in this module examines mTLS for Services—mutual TLS authentication where both client and server present certificates. We'll explore how to implement zero-trust service-to-service authentication using mTLS in microservices architectures.
You now understand the certificate lifecycle, automated management with ACME, private CA infrastructure, revocation challenges, and monitoring requirements. These operational practices ensure TLS remains a foundation of security rather than a source of incidents. Next, we'll explore mutual TLS for service authentication.