Loading learning content...
When systems fail—and they inevitably will—the quality of your error logs determines how quickly you can understand, diagnose, and resolve the issue. A well-logged error tells a complete story: what happened, where, when, and in what context. A poorly logged error is a dead end that forces engineers into speculation and guesswork.
Error logging is not the same as error handling. Handling is about recovery and user experience; logging is about observability and debugging. While users see carefully crafted messages hiding technical details, your logs must capture those very details with precision and completeness.
This page examines error logging from first principles: what information to capture, how to structure it, when to log at which levels, how to handle sensitive data, and how to make logs actionable for the engineers who will investigate failures at 3 AM.
By the end of this page, you will understand how to design error logging that provides complete diagnostic context, properly categorize errors by severity, implement structured logging for machine parsing, handle sensitive data safely, correlate errors across distributed systems, and create logs that accelerate incident response rather than hindering it.
Before discussing techniques, we must understand who reads error logs and what they need. Error logs serve multiple audiences with different requirements:
1. On-Call Engineers During Incidents
When production breaks, on-call engineers are your primary log consumers. They need to quickly understand:
2. Developers Debugging Issues
After the immediate incident, developers investigate root causes. They need:
3. Security and Compliance Teams
Security personnel review logs for:
4. Automated Monitoring Systems
Machine consumers of logs need:
Poor error logging extends incident duration. Each missing piece of information requires another round of investigation, deployment of enhanced logging, and waiting for the issue to recur. A single well-logged error can resolve an incident in minutes; a poorly logged error can extend it to hours or days.
A complete error log entry captures multiple dimensions of context. Think of it as answering the journalist's questions: Who, What, When, Where, Why, and How.
Essential Fields for Every Error Log
| Category | Fields | Purpose |
|---|---|---|
| Identity | correlationId, requestId, traceId, spanId | Link related log entries across time and services |
| Timing | timestamp, duration, operationStartTime | Establish sequence and identify timeouts |
| Location | serviceName, serviceVersion, hostName, environment | Identify which deployment experienced the failure |
| Operation | operationName, httpMethod, httpPath, functionName | What action was being performed |
| User Context | userId (hashed), sessionId (hashed), userRole, tenantId | Who was affected (safely anonymized) |
| Error Details | errorType, errorCode, errorMessage, stackTrace | The actual failure information |
| Input Context | inputParameters (sanitized), requestBody (partial) | What triggered the operation |
| System State | memoryUsage, cpuLoad, connectionPoolStatus | Resource conditions at failure time |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264
/** * Comprehensive error logging with complete context. * Demonstrates structured logging that provides all information * needed for debugging without additional log correlation. */interface ErrorLogEntry { // ============================================ // IDENTITY: Correlation across time and services // ============================================ correlationId: string; // Unique ID for entire request flow traceId: string; // Distributed tracing ID spanId: string; // Current operation span parentSpanId?: string; // Parent span for nested operations // ============================================ // TIMING: When and how long // ============================================ timestamp: string; // ISO 8601 format operationDurationMs: number; operationStartTime: string; // ============================================ // LOCATION: Where in the system // ============================================ service: { name: string; version: string; environment: 'production' | 'staging' | 'development'; instance: string; // Pod/container/machine ID region?: string; }; // ============================================ // OPERATION: What was being done // ============================================ operation: { name: string; // 'CreateOrder', 'ProcessPayment' type: 'http' | 'grpc' | 'async' | 'scheduled' | 'internal'; httpMethod?: string; httpPath?: string; httpStatusCode?: number; queueName?: string; }; // ============================================ // USER CONTEXT: Who was affected (safely) // ============================================ user?: { idHash: string; // Hashed user ID for privacy role: string; tenantId?: string; sessionIdHash?: string; }; // ============================================ // ERROR DETAILS: The actual failure // ============================================ error: { type: string; // Exception class name code: string; // Application error code message: string; // Error message category: 'validation' | 'authentication' | 'authorization' | 'business_rule' | 'external_service' | 'database' | 'infrastructure' | 'unknown'; isRetryable: boolean; stackTrace?: string; // Full stack trace causedBy?: { // Nested cause chain type: string; message: string; stackTrace?: string; }; }; // ============================================ // INPUT CONTEXT: What triggered this // ============================================ input?: { // Sanitized/partial input for debugging // NEVER include passwords, tokens, or PII sanitizedPayload?: Record<string, unknown>; queryParameters?: Record<string, string>; relevantHeaders?: Record<string, string>; }; // ============================================ // SYSTEM STATE: Resource conditions // ============================================ systemState?: { memoryUsageMb: number; memoryLimitMb: number; cpuPercent: number; activeConnections: number; pendingRequests: number; }; // ============================================ // METADATA: Classification and routing // ============================================ level: 'error' | 'warn' | 'fatal'; tags: string[]; // For filtering: ['payments', 'critical-path'] alertTier?: 'p1' | 'p2' | 'p3' | 'p4'; // Alert priority} /** * Error logger that constructs complete, structured log entries. */class StructuredErrorLogger { constructor( private readonly serviceName: string, private readonly serviceVersion: string, private readonly environment: string, private readonly logTarget: LogTarget ) {} /** * Log an error with complete context. */ error( error: Error, operation: OperationContext, additionalContext?: Partial<ErrorLogEntry> ): void { const entry = this.buildLogEntry('error', error, operation, additionalContext); this.logTarget.write(entry); } /** * Log a warning for recoverable issues. */ warn( error: Error, operation: OperationContext, additionalContext?: Partial<ErrorLogEntry> ): void { const entry = this.buildLogEntry('warn', error, operation, additionalContext); this.logTarget.write(entry); } /** * Log a fatal error requiring immediate attention. */ fatal( error: Error, operation: OperationContext, additionalContext?: Partial<ErrorLogEntry> ): void { const entry = this.buildLogEntry('fatal', error, operation, additionalContext); entry.alertTier = 'p1'; // Fatal errors always page on-call this.logTarget.write(entry); } private buildLogEntry( level: 'error' | 'warn' | 'fatal', error: Error, operation: OperationContext, additionalContext?: Partial<ErrorLogEntry> ): ErrorLogEntry { const now = new Date(); return { // Identity correlationId: operation.correlationId, traceId: operation.traceId, spanId: operation.spanId, parentSpanId: operation.parentSpanId, // Timing timestamp: now.toISOString(), operationDurationMs: now.getTime() - operation.startTime.getTime(), operationStartTime: operation.startTime.toISOString(), // Location service: { name: this.serviceName, version: this.serviceVersion, environment: this.environment as any, instance: process.env.HOSTNAME || 'unknown', region: process.env.AWS_REGION }, // Operation operation: { name: operation.name, type: operation.type, httpMethod: operation.httpMethod, httpPath: operation.httpPath, httpStatusCode: operation.httpStatusCode }, // Error error: { type: error.constructor.name, code: this.extractErrorCode(error), message: error.message, category: this.categorizeError(error), isRetryable: this.isRetryable(error), stackTrace: error.stack, causedBy: this.extractCause(error) }, // Metadata level, tags: this.generateTags(operation, error), alertTier: this.determineAlertTier(level, error), // Merge additional context ...additionalContext }; } private extractErrorCode(error: Error): string { if ('errorCode' in error) return (error as any).errorCode; if ('code' in error) return String((error as any).code); return 'UNKNOWN'; } private categorizeError(error: Error): ErrorLogEntry['error']['category'] { // Categorize based on error type hierarchy if (error instanceof ValidationException) return 'validation'; if (error instanceof AuthenticationException) return 'authentication'; if (error instanceof AuthorizationException) return 'authorization'; if (error instanceof DomainException) return 'business_rule'; if (error instanceof ExternalServiceException) return 'external_service'; if (error instanceof DatabaseException) return 'database'; if (error instanceof InfrastructureException) return 'infrastructure'; return 'unknown'; } private isRetryable(error: Error): boolean { if ('isRetryable' in error) return Boolean((error as any).isRetryable); // Default heuristics if (error instanceof DatabaseConnectionException) return true; if (error instanceof TimeoutException) return true; if (error instanceof ValidationException) return false; return false; } private extractCause(error: Error): { type: string; message: string; stackTrace?: string } | undefined { if ('cause' in error && error.cause instanceof Error) { return { type: error.cause.constructor.name, message: error.cause.message, stackTrace: error.cause.stack }; } return undefined; } private generateTags(operation: OperationContext, error: Error): string[] { const tags: string[] = []; if (operation.isCriticalPath) tags.push('critical-path'); if (operation.businessDomain) tags.push(operation.businessDomain); if (error instanceof PaymentException) tags.push('payments'); return tags; } private determineAlertTier(level: string, error: Error): 'p1' | 'p2' | 'p3' | 'p4' | undefined { if (level === 'fatal') return 'p1'; if (error instanceof DatabaseConnectionException) return 'p2'; if (error instanceof ExternalServiceException) return 'p2'; if (error instanceof BusinessCriticalException) return 'p2'; return undefined; // Let alerting rules decide }}Correct log level assignment is crucial for effective alerting and triage. The difference between WARN and ERROR, or ERROR and FATAL, determines whether an on-call engineer is paged at 3 AM or sleeps through the night.
The Standard Log Levels
While terminology varies between frameworks, the concepts are universal:
| Level | When to Use | Alerting Implication | Examples |
|---|---|---|---|
| FATAL / CRITICAL | System cannot continue operating; requires immediate human intervention | Page on-call immediately (P1) | Database cluster unreachable, certificate expired, data corruption detected |
| ERROR | Operation failed; user request could not be completed; unexpected condition | Alert within minutes (P2) | Payment processing failed, required external service unavailable, unhandled exception |
| WARN | Something unexpected but handled; degraded operation; potential future problem | Aggregate and alert if threshold exceeded | Retry succeeded after transient failure, deprecated API used, cache miss fallback to database |
| INFO | Not an error—significant business events for audit | No alert; for dashboards and audit | User logged in, order placed, configuration reloaded |
| DEBUG | Not an error—detailed information for development | Never in production; or sampled only | Function entry/exit, variable values, detailed flow |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
/** * Guidelines for choosing the correct log level for errors. * Apply these rules consistently across your codebase. */ /** * FATAL: System-level failures that prevent operation. * The process or service cannot function and will likely crash or restart. */function examplesOfFatalErrors() { // Database connection pool exhausted and cannot recover logger.fatal(new Error('Connection pool exhausted after max retries'), { operation: { name: 'PoolInit', type: 'internal' } }); // Configuration is invalid and service cannot start logger.fatal(new Error('Invalid configuration: required key "database_url" missing'), { operation: { name: 'ConfigLoad', type: 'internal' } }); // Critical security component failed logger.fatal(new Error('Encryption key rotation failed - service must halt'), { operation: { name: 'KeyRotation', type: 'internal' } });} /** * ERROR: Request-level failures that prevented completing user's action. * The service is still operational, but this specific request failed. */function examplesOfErrors() { // User's request failed due to external service logger.error(new ExternalServiceException('PaymentGateway', 503), { operation: { name: 'ProcessPayment', type: 'http', httpPath: '/api/payments' } }); // Unexpected exception caught at boundary logger.error(new Error('Unhandled exception in order processing'), { operation: { name: 'CreateOrder', type: 'http', httpPath: '/api/orders' } }); // Business rule violation with impact logger.error(new InsufficientFundsException('acc-123', 100, 50), { operation: { name: 'Transfer', type: 'http', httpPath: '/api/transfers' } });} /** * WARN: Issues that were handled but indicate problems worth knowing about. * The request succeeded (possibly with degradation), but something was off. */function examplesOfWarnings() { // Transient failure recovered logger.warn(new Error('Redis connection failed, succeeded on retry 2'), { operation: { name: 'CacheRead', type: 'internal' }, retryAttempts: 2 }); // Fallback activated logger.warn(new Error('Primary config service unavailable, using cached config'), { operation: { name: 'ConfigRefresh', type: 'internal' }, fallback: 'cachedConfig' }); // Deprecated usage detected logger.warn(new Error('Deprecated API endpoint called: /v1/users'), { operation: { name: 'ListUsers', type: 'http', httpPath: '/v1/users' }, recommendation: 'Migrate to /v2/users' }); // Resource pressure detected logger.warn(new Error('Connection pool at 90% capacity'), { operation: { name: 'PoolMonitor', type: 'internal' }, currentConnections: 45, maxConnections: 50 });} /** * Decision tree for log level selection. */function determineLogLevel(error: Error, context: { wasRequestCompleted: boolean; wasRecoverySuccessful: boolean; isSystemStillOperational: boolean; affectsMultipleUsers: boolean;}): 'fatal' | 'error' | 'warn' { // FATAL: System cannot operate if (!context.isSystemStillOperational) { return 'fatal'; } // ERROR: Request failed, user not served if (!context.wasRequestCompleted) { return 'error'; } // WARN: Issue occurred but was handled if (!context.wasRecoverySuccessful || context.affectsMultipleUsers) { // If affecting many users, might upgrade to error return 'warn'; } return 'warn';}When in doubt, many developers default to ERROR level. This causes 'alert fatigue'—on-call engineers become desensitized to ERROR alerts because most don't require action. Reserve ERROR for genuinely failed operations, and use WARN for handled anomalies. An unnaturally high ERROR rate indicates either real problems or incorrect level assignment—both need investigation.
Logs are often stored for extended periods, replicated to multiple systems, and accessed by various teams. Sensitive data in logs creates security and compliance risks. You must balance debugging needs against privacy and security requirements.
Categories of Sensitive Data
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202
/** * Comprehensive sensitive data sanitization for logging. * Applies multiple strategies to ensure sensitive data never reaches logs. */ /** * Fields that should be completely excluded from logs. */const FORBIDDEN_FIELDS = new Set([ 'password', 'newPassword', 'confirmPassword', 'oldPassword', 'creditCard', 'creditCardNumber', 'cardNumber', 'cvv', 'cvc', 'ssn', 'socialSecurityNumber', 'taxId', 'apiKey', 'apiSecret', 'secretKey', 'privateKey', 'accessToken', 'refreshToken', 'bearerToken', 'authToken', 'pin', 'mfaCode', 'otpCode', 'verificationCode', 'encryptionKey', 'decryptionKey', 'signingKey']); /** * Fields that should be masked (partial visibility for debugging). */const MASKED_FIELDS = new Set([ 'email', 'emailAddress', 'phone', 'phoneNumber', 'mobile', 'accountNumber', 'iban']); /** * Fields that should be hashed (pseudonymized but correlatable). */const HASHED_FIELDS = new Set([ 'userId', 'customerId', 'sessionId', 'deviceId']); class LogSanitizer { private hashKey: string; constructor(hashKey: string) { this.hashKey = hashKey; } /** * Sanitize an object for safe logging. * Recursively processes nested objects and arrays. */ sanitize(data: unknown, depth = 0): unknown { // Prevent infinite recursion if (depth > 10) return '[MAX_DEPTH]'; if (data === null || data === undefined) { return data; } if (typeof data === 'string') { return this.sanitizeString(data); } if (typeof data !== 'object') { return data; } if (Array.isArray(data)) { return data.map(item => this.sanitize(item, depth + 1)); } // Object: process each field const sanitized: Record<string, unknown> = {}; for (const [key, value] of Object.entries(data)) { const lowerKey = key.toLowerCase(); // Completely remove forbidden fields if (this.isForbidden(lowerKey)) { sanitized[key] = '[REDACTED]'; continue; } // Mask sensitive fields if (this.shouldMask(lowerKey)) { sanitized[key] = this.mask(value); continue; } // Hash pseudonymous identifiers if (this.shouldHash(lowerKey)) { sanitized[key] = this.hash(value); continue; } // Recursively sanitize nested objects sanitized[key] = this.sanitize(value, depth + 1); } return sanitized; } private isForbidden(key: string): boolean { return FORBIDDEN_FIELDS.has(key) || key.includes('password') || key.includes('secret') || key.includes('token') || key.includes('key') && key.includes('api'); } private shouldMask(key: string): boolean { return MASKED_FIELDS.has(key) || key.includes('email') || key.includes('phone'); } private shouldHash(key: string): boolean { return HASHED_FIELDS.has(key); } private mask(value: unknown): string { if (typeof value !== 'string') return '[MASKED]'; if (value.length <= 4) return '****'; // Show first and last characters for identification const visibleChars = Math.min(3, Math.floor(value.length / 4)); return value.slice(0, visibleChars) + '***' + value.slice(-visibleChars); } private hash(value: unknown): string { if (value === null || value === undefined) return '[NULL]'; const stringValue = String(value); // Create consistent hash for correlation const hash = createHmac('sha256', this.hashKey) .update(stringValue) .digest('hex') .slice(0, 16); return `hash:${hash}`; } private sanitizeString(value: string): string { // Detect and mask embedded sensitive patterns return value // Credit card patterns .replace(/\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b/g, '****-****-****-****') // SSN patterns .replace(/\b\d{3}-\d{2}-\d{4}\b/g, '***-**-****') // JWT tokens .replace(/eyJ[a-zA-Z0-9_-]*\.[a-zA-Z0-9_-]*\.[a-zA-Z0-9_-]*/g, '[JWT_TOKEN]') // Bearer tokens in headers .replace(/Bearer [a-zA-Z0-9_-]+/gi, 'Bearer [REDACTED]'); }} // Usage in error loggingconst sanitizer = new LogSanitizer(process.env.LOG_HASH_KEY!); function logErrorWithContext(error: Error, requestData: any) { logger.error({ error: { type: error.name, message: error.message, stack: error.stack }, // Sanitize request data before logging request: sanitizer.sanitize(requestData), timestamp: new Date().toISOString() });} // Example input with sensitive dataconst requestData = { userId: 'usr_123456789', email: 'john.doe@example.com', order: { items: ['item1', 'item2'], payment: { cardNumber: '4111111111111111', cvv: '123', expiryMonth: '12', expiryYear: '2025' } }, meta: { apiKey: 'sk_live_abc123', sessionToken: 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...' }}; // Output after sanitization:// {// userId: 'hash:a1b2c3d4e5f6g7h8',// email: 'joh***com',// order: {// items: ['item1', 'item2'],// payment: {// cardNumber: '[REDACTED]',// cvv: '[REDACTED]',// expiryMonth: '12',// expiryYear: '2025'// }// },// meta: {// apiKey: '[REDACTED]',// sessionToken: '[REDACTED]'// }// }Don't rely solely on field-name detection. Also scan string values for patterns (credit cards, JWTs, etc.). And use log aggregation tools' built-in redaction features as an additional layer. Sensitive data that slips past application-level sanitization may still be caught by infrastructure-level rules.
In distributed systems, a single user request may traverse multiple services, each logging independently. Without correlation, connecting these logs to understand a failure's full context becomes nearly impossible.
The Correlation ID Pattern
A correlation ID (or request ID) is a unique identifier that flows through every service handling a request. When any service logs an error, it includes this ID, allowing all related log entries to be retrieved together.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133
/** * Correlation ID management for distributed error logging. * Integrates with OpenTelemetry for distributed tracing. */import { context, trace, SpanContext } from '@opentelemetry/api';import { AsyncLocalStorage } from 'async_hooks'; /** * Request context that flows through the entire request lifecycle. */interface RequestContext { correlationId: string; // High-level request ID traceId: string; // OpenTelemetry trace ID spanId: string; // Current span ID parentSpanId?: string; // Parent span for nested ops originService?: string; // Which service initiated userId?: string; // For user-specific debugging (hashed)} /** * AsyncLocalStorage provides context that follows async execution. */const requestContextStorage = new AsyncLocalStorage<RequestContext>(); /** * Middleware that establishes request context from incoming headers. */function correlationMiddleware(req: Request, res: Response, next: NextFunction) { // Extract or generate correlation ID const correlationId = req.headers['x-correlation-id'] as string || req.headers['x-request-id'] as string || generateCorrelationId(); // Get OpenTelemetry trace context const span = trace.getActiveSpan(); const spanContext = span?.spanContext(); const requestContext: RequestContext = { correlationId, traceId: spanContext?.traceId || correlationId, spanId: spanContext?.spanId || generateSpanId(), parentSpanId: req.headers['x-parent-span-id'] as string, originService: req.headers['x-origin-service'] as string }; // Set response header so clients can reference for support res.setHeader('X-Correlation-Id', correlationId); // Run entire request within this context requestContextStorage.run(requestContext, () => { next(); });} /** * Get current request context from anywhere in the call stack. */function getCurrentContext(): RequestContext | undefined { return requestContextStorage.getStore();} /** * HTTP client that propagates context to downstream services. */class CorrelatedHttpClient { async request(url: string, options: RequestInit = {}): Promise<Response> { const ctx = getCurrentContext(); const headers = new Headers(options.headers); if (ctx) { // Propagate correlation context headers.set('X-Correlation-Id', ctx.correlationId); headers.set('X-Parent-Span-Id', ctx.spanId); headers.set('X-Origin-Service', process.env.SERVICE_NAME || 'unknown'); // OpenTelemetry trace context (W3C Trace Context format) headers.set('traceparent', `00-${ctx.traceId}-${ctx.spanId}-01`); } return fetch(url, { ...options, headers }); }} /** * Error logger that automatically includes correlation context. */class CorrelatedErrorLogger { error(error: Error, operation: string, additionalData?: Record<string, unknown>) { const ctx = getCurrentContext(); const logEntry = { level: 'error', timestamp: new Date().toISOString(), // Correlation fields for tracing correlationId: ctx?.correlationId || 'no-context', traceId: ctx?.traceId, spanId: ctx?.spanId, parentSpanId: ctx?.parentSpanId, // Error details error: { type: error.name, message: error.message, stack: error.stack }, // Operation context operation, service: process.env.SERVICE_NAME, // Additional data ...additionalData }; console.log(JSON.stringify(logEntry)); }} // Example: Tracing an error across services // Service A (API Gateway) receives request, logs error with correlation// Log entry includes: correlationId: "req-abc123", spanId: "span-1" // Service A calls Service B, which fails// Service B logs error with same correlationId, parentSpanId: "span-1" // Query in log aggregator:// correlationId:"req-abc123"// Returns all logs from both services, showing full failure contextModern distributed systems use OpenTelemetry for standardized tracing. Error logs should include the traceId and spanId from OpenTelemetry, enabling correlation between logs, traces, and metrics. Tools like Jaeger, Zipkin, or cloud-native tracing services can then visualize the full request flow and pinpoint where failures occurred.
Structured logging means outputting logs as machine-parseable data (typically JSON) rather than human-readable text strings. This enables powerful querying, aggregation, and alerting—essential for operating systems at scale.
Why Structure Matters
Compare these two log entries for the same error:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139
/** * Structured logging implementation with consistent schema. * Uses structured JSON for all log output. */interface LogSchema { // Required fields timestamp: string; level: 'debug' | 'info' | 'warn' | 'error' | 'fatal'; message: string; service: string; // Correlation correlationId?: string; traceId?: string; spanId?: string; // Context operation?: string; component?: string; // Error-specific (when level is error/fatal) error?: { type: string; code: string; message: string; stack?: string; cause?: object; }; // Flexible additional fields [key: string]: unknown;} /** * Logger implementation enforcing schema compliance. */class StructuredLogger { private serviceName: string; private sanitizer: LogSanitizer; constructor(serviceName: string, sanitizer: LogSanitizer) { this.serviceName = serviceName; this.sanitizer = sanitizer; } private log(level: LogSchema['level'], message: string, data?: Record<string, unknown>) { const ctx = getCurrentContext(); const entry: LogSchema = { timestamp: new Date().toISOString(), level, message, service: this.serviceName, correlationId: ctx?.correlationId, traceId: ctx?.traceId, spanId: ctx?.spanId }; // Merge additional data, sanitized if (data) { const sanitized = this.sanitizer.sanitize(data) as Record<string, unknown>; Object.assign(entry, sanitized); } // Output as single-line JSON this.output(entry); } error(message: string, error: Error, data?: Record<string, unknown>) { this.log('error', message, { error: { type: error.name, code: (error as any).errorCode || 'UNKNOWN', message: error.message, stack: error.stack, cause: (error as any).cause ? { type: ((error as any).cause as Error).name, message: ((error as any).cause as Error).message } : undefined }, ...data }); } warn(message: string, data?: Record<string, unknown>) { this.log('warn', message, data); } info(message: string, data?: Record<string, unknown>) { this.log('info', message, data); } private output(entry: LogSchema) { // Single-line JSON for log aggregators console.log(JSON.stringify(entry)); }} /** * Best practices for structured logging fields. */const loggingGuidelines = { // DO: Use consistent field names across all services goodFieldNames: [ 'userId', // Not 'user_id' or 'userID' or 'uid' 'orderId', // Not 'order_id' or 'orderID' 'errorType', // Not 'error_type' or 'exception_type' 'durationMs', // Not 'duration' or 'time_taken' ], // DO: Use standardized value formats goodValueFormats: { timestamps: 'ISO 8601: 2024-01-15T10:23:45.123Z', durations: 'Milliseconds as integers: 234', booleans: 'Actual booleans, not strings: true', nulls: 'Explicit null, not empty strings' }, // DON'T: Mix structured and unstructured badPractices: [ 'Including formatted strings: "Error: XYZ at module ABC"', 'Nested message strings that duplicate info', 'Inconsistent field presence across log entries' ]}; // Example: Query capabilities enabled by structured logs // Find all payment errors in the last hour:// level:error AND error.type:PaymentException AND timestamp:>now-1h // Count errors by type:// level:error | stats count() by error.type // Find errors for specific user:// level:error AND userId:hash:a1b2c3d4 // Trace request flow:// correlationId:req-abc123 | sort timestamp ascBeyond application-level logging, you must consider the infrastructure that collects, stores, and queries logs. The best log entries are useless if the infrastructure can't handle them properly.
Key Infrastructure Considerations
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113
/** * Resilient logging configuration that ensures error logs * reach their destination even under adverse conditions. */ /** * Log buffering to handle temporary aggregator unavailability. */class ResilientLogger { private buffer: LogEntry[] = []; private readonly maxBufferSize = 10000; private readonly flushIntervalMs = 1000; private isAggregatorHealthy = true; constructor(private aggregator: LogAggregator) { setInterval(() => this.flushBuffer(), this.flushIntervalMs); } log(entry: LogEntry) { if (this.isAggregatorHealthy) { // Try direct send this.aggregator.send(entry).catch(() => { this.isAggregatorHealthy = false; this.buffer.push(entry); this.scheduleHealthCheck(); }); } else { // Buffer during outage if (this.buffer.length < this.maxBufferSize) { this.buffer.push(entry); } else { // CRITICAL: Never drop error logs // Write to local stderr as fallback if (entry.level === 'error' || entry.level === 'fatal') { console.error(JSON.stringify(entry)); } // Increment dropped log counter for monitoring this.droppedLogCounter.inc({ level: entry.level }); } } } private async flushBuffer() { if (this.buffer.length === 0 || !this.isAggregatorHealthy) return; const batch = this.buffer.splice(0, 1000); try { await this.aggregator.sendBatch(batch); } catch (error) { // Put back in buffer this.buffer.unshift(...batch); this.isAggregatorHealthy = false; } } private scheduleHealthCheck() { setTimeout(async () => { try { await this.aggregator.healthCheck(); this.isAggregatorHealthy = true; await this.flushBuffer(); } catch { this.scheduleHealthCheck(); // Retry } }, 5000); }} /** * Sampling configuration for volume control. * Errors are never sampled; only verbose logs. */interface SamplingConfig { // Always log these levels (no sampling) alwaysLog: Array<'error' | 'warn' | 'fatal'>; // Sample rate for other levels (0-1) sampleRate: Record<string, number>;} const productionSamplingConfig: SamplingConfig = { alwaysLog: ['error', 'warn', 'fatal'], sampleRate: { 'info': 1.0, // Log all info in production (usually low volume) 'debug': 0.01 // Only 1% of debug logs }}; function shouldLog(level: string, config: SamplingConfig): boolean { if (config.alwaysLog.includes(level as any)) return true; const rate = config.sampleRate[level] ?? 1.0; return Math.random() < rate;} /** * Log enrichment at infrastructure level. * Add context that application code shouldn't need to know. */function enrichLogEntry(entry: LogEntry): LogEntry { return { ...entry, // Infrastructure-level enrichment _meta: { hostname: os.hostname(), pid: process.pid, kubernetesNamespace: process.env.K8S_NAMESPACE, kubernetesPod: process.env.K8S_POD_NAME, deploymentVersion: process.env.DEPLOYMENT_VERSION, cloudRegion: process.env.CLOUD_REGION, instanceId: process.env.INSTANCE_ID } };}Error logging is the bridge between a failure occurring and an engineer understanding and resolving it. Quality logs dramatically reduce time-to-resolution and make incidents manageable rather than chaotic.
What's Next:
With logging mastered, the final page explores error recovery strategies—how to design systems that not only log failures but actively recover from them, maintaining availability and data integrity even when things go wrong.
You now understand how to design error logging that serves operational needs. You can structure logs for machine parsing, handle sensitive data safely, correlate across distributed systems, and build infrastructure that ensures error logs always reach their destination. Apply these patterns to transform incident response from guesswork to methodical diagnosis.