Loading content...
If metrics are the vital signs of your system—heart rate, temperature, blood pressure—then logs are the patient's medical history. They tell the story of what happened, when, why, and in what context. Metrics tell you something is wrong; logs tell you what went wrong.
Consider this scenario: Your latency dashboard shows a sudden spike. Requests that normally complete in 50ms are now taking 2 seconds. The metric alerts you to the problem, but it doesn't explain the cause. Is it a database query? A third-party API timeout? A lock contention issue? A bad deployment?
Logs answer these questions. They capture the individual events—the specific requests, errors, and state changes—that together tell the complete story of what your system is doing.
In this page, we'll examine logs as the second pillar of observability: what they are, how to structure them for maximum utility, how to aggregate and query them at scale, and the design patterns that make logging effective in production systems.
By the end of this page, you will understand logs not as an afterthought but as a first-class observability signal. You'll learn the distinction between unstructured and structured logging, master log level semantics, understand aggregation architectures, and internalize the practices that make logs useful during incident response.
At their core, logs are timestamped records of discrete events that occur within a system. Each log entry (or log line, log event, log message) represents something that happened at a specific point in time:
Logs differ fundamentally from metrics in their granularity and nature:
| Aspect | Metrics | Logs |
|---|---|---|
| Nature | Numerical measurements | Event records (text/structured) |
| Granularity | Aggregated (sampled at intervals) | Individual events |
| Cardinality | Bounded by labels | Unbounded (any string) |
| Query pattern | Mathematical operations, aggregations | Search, filtering, pattern matching |
| Storage cost | Lower (numerical, compressed) | Higher (text, verbose) |
| Use case | Alerting, dashboards, trends | Debugging, audit trails, forensics |
The anatomy of a log entry:
A well-formed log entry contains several components:
12345678910111213141516171819202122
# Unstructured log (traditional format)2024-01-08 10:23:45.123 INFO [checkout-service] User 12345 placed order ORD-789 for $299.99 # Semi-structured log (key-value pairs)2024-01-08T10:23:45.123Z level=INFO service=checkout user_id=12345 order_id=ORD-789 amount=299.99 msg="Order placed successfully" # Structured log (JSON){ "timestamp": "2024-01-08T10:23:45.123Z", "level": "INFO", "service": "checkout-service", "host": "checkout-pod-abc123", "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736", "span_id": "00f067aa0ba902b7", "message": "Order placed successfully", "user_id": "12345", "order_id": "ORD-789", "amount": 299.99, "currency": "USD", "payment_method": "credit_card", "duration_ms": 127}Always use structured logging (JSON or another machine-parseable format). Unstructured text logs are easy to write but nearly impossible to query effectively at scale. 'Find all errors for user 12345' becomes a regex nightmare with unstructured logs, but a simple filter with structured logs.
Log levels are not arbitrary classifications—they carry specific semantic meaning that guides both what to log and how urgently to investigate. Misusing log levels creates noise (everything is ERROR) or blindness (critical issues are INFO).
Here's the standard hierarchy and when to use each:
A common mistake is logging routine failures as ERROR. If a 'user not found' condition in an API is logged as ERROR, you'll see thousands of 'errors' that are actually normal behavior. Reserve ERROR for true failures—things that indicate bugs or system problems, not expected outcomes. Use appropriate levels to maintain signal clarity.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758
// Correct log level usage examples // FATAL: Application cannot continuelogger.fatal({ err: dbConnectionError, msg: "Cannot connect to database after 10 retries, shutting down"});process.exit(1); // ERROR: Operation failed unexpectedlytry { await processPayment(order);} catch (err) { logger.error({ err, orderId: order.id, userId: order.userId, msg: "Payment processing failed" }); throw err;} // WARN: Recovered from issue, but noteworthyconst result = await fetchWithRetry(url, { retries: 3 });if (result.attemptCount > 1) { logger.warn({ url, attempts: result.attemptCount, msg: "Succeeded after retries" });} // INFO: Normal operation worth recordinglogger.info({ userId: user.id, action: "login", method: "oauth", provider: "google", msg: "User authenticated successfully"}); // DEBUG: Detailed diagnostic info (usually disabled in prod)logger.debug({ query: sql, params, duration_ms: queryDuration, rows_returned: result.length, msg: "Database query executed"}); // NOT an error - expected business logic outcomeconst user = await findUser(userId);if (!user) { // WRONG: logger.error("User not found"); // CORRECT: logger.debug({ userId, msg: "User lookup returned no results" }); return res.status(404).json({ error: "User not found" });}Structured logging—emitting logs as machine-parseable data structures (typically JSON)—fundamentally changes what you can do with your logs. Let's understand why it matters and how to implement it effectively.
Why structured logging wins:
Essential fields for structured logs:
Every log event should include a core set of fields that provide context without requiring the message to be parsed:
| Field | Purpose | Example |
|---|---|---|
| timestamp | When the event occurred (ISO 8601) | 2024-01-08T10:23:45.123Z |
| level | Severity classification | INFO, ERROR, WARN |
| message | Human-readable event description | Order placed successfully |
| service | Name of the emitting service | checkout-service |
| host | Hostname or pod name | checkout-pod-abc123 |
| trace_id | Distributed trace identifier | 4bf92f3577b34da6a3ce929d0e0e4736 |
| span_id | Span within the trace | 00f067aa0ba902b7 |
| environment | Deployment environment | production, staging |
| version | Application version/build | v1.2.3-abc123 |
| request_id | Unique request identifier | req_xyz789 |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
// TypeScript example: Setting up structured logging with Pino import pino from 'pino'; // Create a base logger with standard fieldsconst logger = pino({ level: process.env.LOG_LEVEL || 'info', // Format options for structured output formatters: { level: (label) => ({ level: label }), bindings: (bindings) => ({ host: bindings.hostname, pid: bindings.pid, }), }, // Add standard context to every log base: { service: 'checkout-service', version: process.env.APP_VERSION || 'unknown', environment: process.env.NODE_ENV || 'development', }, // ISO timestamp format timestamp: pino.stdTimeFunctions.isoTime,}); // Create child loggers with request-specific contextfunction createRequestLogger(req) { return logger.child({ request_id: req.id, trace_id: req.headers['x-trace-id'], user_id: req.user?.id, method: req.method, path: req.path, });} // Usage in request handlerapp.use((req, res, next) => { req.log = createRequestLogger(req); const start = Date.now(); res.on('finish', () => { req.log.info({ status: res.statusCode, duration_ms: Date.now() - start, msg: 'Request completed' }); }); next();}); // In application code - all logs automatically include request contextasync function processOrder(req, order) { req.log.info({ order_id: order.id, amount: order.total, items: order.items.length, msg: 'Processing order' }); // ... processing logic req.log.info({ order_id: order.id, msg: 'Order completed successfully' });}Use 'child loggers' to propagate context without repeating fields. Create a child logger at request entry with request_id, user_id, and trace_id, then pass it through your code. Every log automatically includes this context without explicit repetition.
In distributed systems, logs are born across hundreds or thousands of containers, VMs, and serverless functions. Making sense of them requires centralized log aggregation: collecting, processing, storing, and querying logs from all sources in one place.
The standard log aggregation pipeline:
Popular log aggregation stacks:
| Stack | Components | Best For | Trade-offs |
|---|---|---|---|
| ELK/Elastic | Elasticsearch, Logstash, Kibana | Full-text search, complex queries | Resource-intensive, operational complexity |
| EFK | Elasticsearch, Fluent Bit/Fluentd, Kibana | Kubernetes-native collection | Similar to ELK, lighter collection |
| Grafana Loki | Loki, Promtail, Grafana | Label-based indexing, cost-efficient | Less full-text capability, simpler queries |
| AWS CloudWatch | CloudWatch Logs, Insights | AWS-native, serverless-friendly | AWS lock-in, query limitations |
| Datadog/Splunk | Managed SaaS platforms | Ease of operation, features | Cost at scale, vendor lock-in |
Grafana Loki takes a different approach than Elasticsearch. Instead of indexing every word in every log, Loki indexes only the labels (like Prometheus). Log content is stored as compressed chunks. This dramatically reduces storage and operational cost, at the expense of less powerful full-text search. For most operational uses, label-based filtering (service, pod, level) is sufficient.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
# Fluent Bit configuration for Kubernetes log collection# This collects container logs and enriches them with K8s metadata [SERVICE] Flush 1 Log_Level info Daemon off Parsers_File parsers.conf [INPUT] Name tail Path /var/log/containers/*.log Parser docker Tag kube.* Refresh_Interval 5 Mem_Buf_Limit 50MB Skip_Long_Lines On [FILTER] Name kubernetes Match kube.* Kube_URL https://kubernetes.default.svc:443 Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token Merge_Log On Merge_Log_Key log_processed K8S-Logging.Parser On K8S-Logging.Exclude On [FILTER] Name modify Match kube.* # Add cluster identifier Add cluster production-us-east # Remove unnecessary fields to reduce size Remove $.kubernetes.pod_id Remove $.kubernetes.docker_id [OUTPUT] Name loki Match kube.* Host loki.monitoring.svc Port 3100 Labels job=kubernetes, cluster=${cluster}, namespace=${kubernetes['namespace_name']}, pod=${kubernetes['pod_name']}, container=${kubernetes['container_name']} Line_Format json Auto_Kubernetes_Labels offLogs are expensive. Unlike metrics (small numerical values), logs contain rich textual data—often kilobytes per event. At scale, log storage and processing costs can dwarf your actual application infrastructure.
The cost equation:
A mid-sized deployment might generate 100GB of logs per day. At $0.50/GB/month for storage and $0.10/GB for ingestion, that's $1,500/month just for storage, plus $300/day for ingestion—$10,800/month total. And that's before query costs. Companies routinely find logging is 20-40% of their cloud bill.
Cost control strategies:
| Tier | Retention | Storage | Access | Use Case |
|---|---|---|---|---|
| Hot | 7 days | Elasticsearch/Loki SSD | Interactive queries | Active debugging, incident response |
| Warm | 30 days | Elasticsearch/Loki HDD | Slower queries | Recent investigation, trend analysis |
| Cold | 1+ years | S3/Blob storage | Batch retrieval | Compliance, audit, forensics |
| Archive | 7+ years | Glacier/Archive tier | Rare retrieval | Legal/regulatory requirements |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768
// Implementing log sampling for high-volume events import pino from 'pino';import crypto from 'crypto'; // Create logger with sampling configurationconst logger = pino({ level: 'info',}); // Sampling configurationconst SAMPLE_RATES = { 'request.success': 0.1, // Log 10% of successful requests 'request.error': 1.0, // Log 100% of errors 'health.check': 0.01, // Log 1% of health checks 'cache.hit': 0.05, // Log 5% of cache hits 'cache.miss': 0.2, // Log 20% of cache misses}; // Deterministic sampling based on request ID// This ensures the same request is always sampled the same way// across all services (important for distributed tracing)function shouldSample(eventType: string, requestId: string): boolean { const rate = SAMPLE_RATES[eventType] ?? 1.0; if (rate >= 1.0) return true; if (rate <= 0) return false; // Hash the request ID to get a consistent random value const hash = crypto.createHash('md5').update(requestId).digest(); const value = hash.readUInt32BE(0) / 0xFFFFFFFF; return value < rate;} // Wrapper for sampled loggingfunction logSampled(level: string, eventType: string, data: any) { const requestId = data.request_id || crypto.randomUUID(); if (shouldSample(eventType, requestId)) { logger[level]({ ...data, event_type: eventType, sampled: true, sample_rate: SAMPLE_RATES[eventType] ?? 1.0, }); }} // Usageapp.use((req, res, next) => { const start = Date.now(); res.on('finish', () => { const eventType = res.statusCode >= 400 ? 'request.error' : 'request.success'; logSampled('info', eventType, { request_id: req.id, method: req.method, path: req.path, status: res.statusCode, duration_ms: Date.now() - start, msg: 'Request completed', }); }); next();});In a monolith, following a request's path through logs is straightforward—everything is in one process. In microservices, a single user action might touch 10 different services, each generating independent logs. Without correlation, you're looking at 10 disconnected log files with no way to connect them.
Correlation IDs solve this problem.
A correlation ID (also called request ID or trace ID) is a unique identifier generated at the system's entry point (API gateway, load balancer, or first service) and propagated through every subsequent service call. Every log from every service includes this ID, enabling you to query 'show me all logs for request abc123' and see the complete picture.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
// Correlation ID propagation across services // ========== API Gateway / Entry Point ==========app.use((req, res, next) => { // Generate or accept correlation ID from incoming request const correlationId = req.headers['x-correlation-id'] || req.headers['x-request-id'] || crypto.randomUUID(); // Make it available throughout this request req.correlationId = correlationId; // Include in response for client correlation res.setHeader('x-correlation-id', correlationId); // Create logger with correlation context req.log = logger.child({ correlation_id: correlationId, trace_id: req.headers['traceparent']?.split('-')[1] || correlationId, }); next();}); // ========== When calling downstream services ==========async function callPaymentService(order, correlationId) { const response = await fetch('http://payment-service/process', { method: 'POST', headers: { 'Content-Type': 'application/json', 'x-correlation-id': correlationId, // Propagate! 'traceparent': `00-${correlationId}-${spanId}-01`, // W3C trace context }, body: JSON.stringify(order), }); return response.json();} // ========== Downstream service receives and continues ==========// (In the payment service)app.use((req, res, next) => { // Extract correlation ID from upstream const correlationId = req.headers['x-correlation-id']; req.log = logger.child({ correlation_id: correlationId, service: 'payment-service', upstream: req.headers['x-caller-service'], }); req.log.info({ msg: 'Request received' }); next();});Context propagation patterns:
The W3C Trace Context specification defines a standard way to propagate trace information via HTTP headers ('traceparent' and 'tracestate'). Using this standard ensures compatibility with OpenTelemetry and other distributed tracing systems. A single trace ID flows through logs and traces, connecting all observability signals.
Querying correlated logs:
With proper correlation, debugging becomes dramatically easier:
1234567891011121314151617
# Elasticsearch/Kibana query - All logs for a specific requestcorrelation_id: "abc-123-def-456" # Filter by service within that requestcorrelation_id: "abc-123-def-456" AND service: "payment-service" # Loki LogQL - All logs for a request across all services{cluster="production"} |= "abc-123-def-456" # With JSON parsing in Loki{cluster="production"} | json | correlation_id = "abc-123-def-456" # Show only errors for this request{cluster="production"} | json | correlation_id = "abc-123-def-456" | level = "ERROR" # Timeline reconstruction - seeing the flow{cluster="production"} | json | correlation_id = "abc-123-def-456" | line_format "{{.timestamp}} [{{.service}}] {{.message}}"Logs are a double-edged sword. They provide invaluable insight but can also become a liability. Logs often contain sensitive information—and that information can persist far longer than you intend.
What accidentally ends up in logs:
Numerous breaches have been traced to logs. In one famous case, a company logged full HTTP requests including Authorization headers. Those logs were stored in an insufficiently protected log aggregation system. Attackers accessed the logs and extracted tokens to impersonate users. The fix was easy; the damage was done.
Log security best practices:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
// Implementing log redaction const REDACT_PATTERNS = [ // Credit card numbers { pattern: /\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b/g, replacement: '[REDACTED_CC]' }, // Email addresses { pattern: /[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}/g, replacement: '[REDACTED_EMAIL]' }, // AWS keys { pattern: /AKIA[0-9A-Z]{16}/g, replacement: '[REDACTED_AWS_KEY]' }, // JWT tokens { pattern: /eyJ[a-zA-Z0-9_-]*\.[a-zA-Z0-9_-]*\.[a-zA-Z0-9_-]*/g, replacement: '[REDACTED_JWT]' }, // Authorization headers { pattern: /Bearer [a-zA-Z0-9._-]+/g, replacement: 'Bearer [REDACTED]' },]; const SENSITIVE_FIELDS = ['password', 'secret', 'token', 'apiKey', 'authorization', 'creditCard', 'ssn']; function redactSensitiveData(obj: any): any { if (typeof obj === 'string') { let result = obj; for (const { pattern, replacement } of REDACT_PATTERNS) { result = result.replace(pattern, replacement); } return result; } if (Array.isArray(obj)) { return obj.map(redactSensitiveData); } if (obj && typeof obj === 'object') { const result: any = {}; for (const [key, value] of Object.entries(obj)) { // Completely redact known sensitive fields if (SENSITIVE_FIELDS.some(f => key.toLowerCase().includes(f.toLowerCase()))) { result[key] = '[REDACTED]'; } else { result[key] = redactSensitiveData(value); } } return result; } return obj;} // Pino redaction hookconst logger = pino({ level: 'info', hooks: { logMethod(inputArgs, method) { if (inputArgs.length >= 1) { inputArgs[0] = redactSensitiveData(inputArgs[0]); } return method.apply(this, inputArgs); }, }, // Built-in redaction for known paths redact: { paths: ['req.headers.authorization', 'req.headers.cookie', '*.password', '*.secret'], censor: '[REDACTED]', },});We've explored logs comprehensively—from their fundamental nature to practical implementation patterns. Let's consolidate the key takeaways:
What's next:
Metrics tell you what is happening—quantitative measurements over time. Logs tell you what happened—the events and context behind the numbers. But when a request flows through 10 services, neither metrics nor logs alone show you the complete path.
Next, we'll explore Traces—the third pillar of observability. Distributed tracing captures the full journey of a request through your system, connecting the dots between services and revealing where time is spent.
You now understand logs as the narrative companion to metrics. Structured logging with proper correlation and security practices enables effective debugging and incident response. Next, we'll see how traces complete the observability picture by showing request paths through distributed systems.