System Design (LLD)Exception & Error Handling Design

Error Handling at Boundaries

LevelAdvanced

Duration90 mins

TopicException & Error Handling Design

3 / 4

Logging Errors Appropriately

The Bridge Between Failure and Resolution

When systems fail—and they inevitably will—the quality of your error logs determines how quickly you can understand, diagnose, and resolve the issue. A well-logged error tells a complete story: what happened, where, when, and in what context. A poorly logged error is a dead end that forces engineers into speculation and guesswork.

Error logging is not the same as error handling. Handling is about recovery and user experience; logging is about observability and debugging. While users see carefully crafted messages hiding technical details, your logs must capture those very details with precision and completeness.

This page examines error logging from first principles: what information to capture, how to structure it, when to log at which levels, how to handle sensitive data, and how to make logs actionable for the engineers who will investigate failures at 3 AM.

What You Will Master

By the end of this page, you will understand how to design error logging that provides complete diagnostic context, properly categorize errors by severity, implement structured logging for machine parsing, handle sensitive data safely, correlate errors across distributed systems, and create logs that accelerate incident response rather than hindering it.

The Purpose of Error Logging

Before discussing techniques, we must understand who reads error logs and what they need. Error logs serve multiple audiences with different requirements:

1. On-Call Engineers During Incidents

When production breaks, on-call engineers are your primary log consumers. They need to quickly understand:

What is failing?
Since when?
How many users/requests are affected?
What can be done immediately?

2. Developers Debugging Issues

After the immediate incident, developers investigate root causes. They need:

Complete exception chains and stack traces
Request context and parameters
State of relevant entities at the time of failure
Correlation across multiple service calls

3. Security and Compliance Teams

Security personnel review logs for:

Unusual failure patterns that might indicate attacks
Authentication and authorization failures
Data access anomalies
Audit trail completeness

4. Automated Monitoring Systems

Machine consumers of logs need:

Consistent, structured formatting
Reliable severity classification
Parseable error codes and categories
Correlation identifiers

Goals of Effective Error Logging

•Complete Context — Every error log should contain enough information to understand and reproduce the failure without requiring additional log correlation in most cases.
•Accurate Severity — Log levels must correctly reflect business impact so alerting systems can prioritize appropriately.
•Efficient Parsing — Logs should be structured (JSON/key-value) for automated analysis and searching.
•Safe for Storage — Sensitive data must be redacted or excluded to meet security and compliance requirements.
•Actionable — Logs should suggest the category of problem and point toward resolution paths.
•Correlated — In distributed systems, logs must be traceable across service boundaries.

The Cost of Poor Logging

Poor error logging extends incident duration. Each missing piece of information requires another round of investigation, deployment of enhanced logging, and waiting for the issue to recur. A single well-logged error can resolve an incident in minutes; a poorly logged error can extend it to hours or days.

What to Log: The Complete Error Context

A complete error log entry captures multiple dimensions of context. Think of it as answering the journalist's questions: Who, What, When, Where, Why, and How.

Essential Fields for Every Error Log

Error Log Context Categories
Category	Fields	Purpose
Identity	correlationId, requestId, traceId, spanId	Link related log entries across time and services
Timing	timestamp, duration, operationStartTime	Establish sequence and identify timeouts
Location	serviceName, serviceVersion, hostName, environment	Identify which deployment experienced the failure
Operation	operationName, httpMethod, httpPath, functionName	What action was being performed
User Context	userId (hashed), sessionId (hashed), userRole, tenantId	Who was affected (safely anonymized)
Error Details	errorType, errorCode, errorMessage, stackTrace	The actual failure information
Input Context	inputParameters (sanitized), requestBody (partial)	What triggered the operation
System State	memoryUsage, cpuLoad, connectionPoolStatus	Resource conditions at failure time

structured-error-logging.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
/**
 * Comprehensive error logging with complete context.
 * Demonstrates structured logging that provides all information
 * needed for debugging without additional log correlation.
 */
interface ErrorLogEntry {
    // ============================================
    // IDENTITY: Correlation across time and services
    // ============================================
    correlationId: string;      // Unique ID for entire request flow
    traceId: string;            // Distributed tracing ID
    spanId: string;             // Current operation span
    parentSpanId?: string;      // Parent span for nested operations
 
    // ============================================
    // TIMING: When and how long
    // ============================================
    timestamp: string;          // ISO 8601 format
    operationDurationMs: number;
    operationStartTime: string;
 
    // ============================================
    // LOCATION: Where in the system
    // ============================================
    service: {
        name: string;
        version: string;
        environment: 'production' | 'staging' | 'development';
        instance: string;       // Pod/container/machine ID
        region?: string;
    };
 
    // ============================================
    // OPERATION: What was being done
    // ============================================
    operation: {
        name: string;           // 'CreateOrder', 'ProcessPayment'
        type: 'http' | 'grpc' | 'async' | 'scheduled' | 'internal';
        httpMethod?: string;
        httpPath?: string;
        httpStatusCode?: number;
        queueName?: string;
    };
 
    // ============================================
    // USER CONTEXT: Who was affected (safely)
    // ============================================
    user?: {
        idHash: string;         // Hashed user ID for privacy
        role: string;
        tenantId?: string;
        sessionIdHash?: string;
    };
 
    // ============================================
    // ERROR DETAILS: The actual failure
    // ============================================
    error: {
        type: string;           // Exception class name
        code: string;           // Application error code
        message: string;        // Error message
        category: 'validation' | 'authentication' | 'authorization' | 
                  'business_rule' | 'external_service' | 'database' | 
                  'infrastructure' | 'unknown';
        isRetryable: boolean;
        stackTrace?: string;    // Full stack trace
        causedBy?: {            // Nested cause chain
            type: string;
            message: string;
            stackTrace?: string;
        };
    };
 
    // ============================================
    // INPUT CONTEXT: What triggered this
    // ============================================
    input?: {
        // Sanitized/partial input for debugging
        // NEVER include passwords, tokens, or PII
        sanitizedPayload?: Record<string, unknown>;
        queryParameters?: Record<string, string>;
        relevantHeaders?: Record<string, string>;
    };
 
    // ============================================
    // SYSTEM STATE: Resource conditions
    // ============================================
    systemState?: {
        memoryUsageMb: number;
        memoryLimitMb: number;
        cpuPercent: number;
        activeConnections: number;
        pendingRequests: number;
    };
 
    // ============================================
    // METADATA: Classification and routing
    // ============================================
    level: 'error' | 'warn' | 'fatal';
    tags: string[];             // For filtering: ['payments', 'critical-path']
    alertTier?: 'p1' | 'p2' | 'p3' | 'p4';  // Alert priority
}
 
/**
 * Error logger that constructs complete, structured log entries.
 */
class StructuredErrorLogger {
    constructor(
        private readonly serviceName: string,
        private readonly serviceVersion: string,
        private readonly environment: string,
        private readonly logTarget: LogTarget
    ) {}
 
    /**
     * Log an error with complete context.
     */
    error(
        error: Error,
        operation: OperationContext,
        additionalContext?: Partial<ErrorLogEntry>
    ): void {
        const entry = this.buildLogEntry('error', error, operation, additionalContext);
        this.logTarget.write(entry);
    }
 
    /**
     * Log a warning for recoverable issues.
     */
    warn(
        error: Error,
        operation: OperationContext,
        additionalContext?: Partial<ErrorLogEntry>
    ): void {
        const entry = this.buildLogEntry('warn', error, operation, additionalContext);
        this.logTarget.write(entry);
    }
 
    /**
     * Log a fatal error requiring immediate attention.
     */
    fatal(
        error: Error,
        operation: OperationContext,
        additionalContext?: Partial<ErrorLogEntry>
    ): void {
        const entry = this.buildLogEntry('fatal', error, operation, additionalContext);
        entry.alertTier = 'p1';  // Fatal errors always page on-call
        this.logTarget.write(entry);
    }
 
    private buildLogEntry(
        level: 'error' | 'warn' | 'fatal',
        error: Error,
        operation: OperationContext,
        additionalContext?: Partial<ErrorLogEntry>
    ): ErrorLogEntry {
        const now = new Date();
 
        return {
            // Identity
            correlationId: operation.correlationId,
            traceId: operation.traceId,
            spanId: operation.spanId,
            parentSpanId: operation.parentSpanId,
 
            // Timing
            timestamp: now.toISOString(),
            operationDurationMs: now.getTime() - operation.startTime.getTime(),
            operationStartTime: operation.startTime.toISOString(),
 
            // Location
            service: {
                name: this.serviceName,
                version: this.serviceVersion,
                environment: this.environment as any,
                instance: process.env.HOSTNAME || 'unknown',
                region: process.env.AWS_REGION
            },
 
            // Operation
            operation: {
                name: operation.name,
                type: operation.type,
                httpMethod: operation.httpMethod,
                httpPath: operation.httpPath,
                httpStatusCode: operation.httpStatusCode
            },
 
            // Error
            error: {
                type: error.constructor.name,
                code: this.extractErrorCode(error),
                message: error.message,
                category: this.categorizeError(error),
                isRetryable: this.isRetryable(error),
                stackTrace: error.stack,
                causedBy: this.extractCause(error)
            },
 
            // Metadata
            level,
            tags: this.generateTags(operation, error),
            alertTier: this.determineAlertTier(level, error),
 
            // Merge additional context
            ...additionalContext
        };
    }
 
    private extractErrorCode(error: Error): string {
        if ('errorCode' in error) return (error as any).errorCode;
        if ('code' in error) return String((error as any).code);
        return 'UNKNOWN';
    }
 
    private categorizeError(error: Error): ErrorLogEntry['error']['category'] {
        // Categorize based on error type hierarchy
        if (error instanceof ValidationException) return 'validation';
        if (error instanceof AuthenticationException) return 'authentication';
        if (error instanceof AuthorizationException) return 'authorization';
        if (error instanceof DomainException) return 'business_rule';
        if (error instanceof ExternalServiceException) return 'external_service';
        if (error instanceof DatabaseException) return 'database';
        if (error instanceof InfrastructureException) return 'infrastructure';
        return 'unknown';
    }
 
    private isRetryable(error: Error): boolean {
        if ('isRetryable' in error) return Boolean((error as any).isRetryable);
        // Default heuristics
        if (error instanceof DatabaseConnectionException) return true;
        if (error instanceof TimeoutException) return true;
        if (error instanceof ValidationException) return false;
        return false;
    }
 
    private extractCause(error: Error): { type: string; message: string; stackTrace?: string } | undefined {
        if ('cause' in error && error.cause instanceof Error) {
            return {
                type: error.cause.constructor.name,
                message: error.cause.message,
                stackTrace: error.cause.stack
            };
        }
        return undefined;
    }
 
    private generateTags(operation: OperationContext, error: Error): string[] {
        const tags: string[] = [];
        if (operation.isCriticalPath) tags.push('critical-path');
        if (operation.businessDomain) tags.push(operation.businessDomain);
        if (error instanceof PaymentException) tags.push('payments');
        return tags;
    }
 
    private determineAlertTier(level: string, error: Error): 'p1' | 'p2' | 'p3' | 'p4' | undefined {
        if (level === 'fatal') return 'p1';
        if (error instanceof DatabaseConnectionException) return 'p2';
        if (error instanceof ExternalServiceException) return 'p2';
        if (error instanceof BusinessCriticalException) return 'p2';
        return undefined;  // Let alerting rules decide
    }
}

Log Levels and Severity Classification

Correct log level assignment is crucial for effective alerting and triage. The difference between WARN and ERROR, or ERROR and FATAL, determines whether an on-call engineer is paged at 3 AM or sleeps through the night.

The Standard Log Levels

While terminology varies between frameworks, the concepts are universal:

Log Levels for Error Conditions
Level	When to Use	Alerting Implication	Examples
FATAL / CRITICAL	System cannot continue operating; requires immediate human intervention	Page on-call immediately (P1)	Database cluster unreachable, certificate expired, data corruption detected
ERROR	Operation failed; user request could not be completed; unexpected condition	Alert within minutes (P2)	Payment processing failed, required external service unavailable, unhandled exception
WARN	Something unexpected but handled; degraded operation; potential future problem	Aggregate and alert if threshold exceeded	Retry succeeded after transient failure, deprecated API used, cache miss fallback to database
INFO	Not an error—significant business events for audit	No alert; for dashboards and audit	User logged in, order placed, configuration reloaded
DEBUG	Not an error—detailed information for development	Never in production; or sampled only	Function entry/exit, variable values, detailed flow

log-level-decision.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
/**
 * Guidelines for choosing the correct log level for errors.
 * Apply these rules consistently across your codebase.
 */
 
/**
 * FATAL: System-level failures that prevent operation.
 * The process or service cannot function and will likely crash or restart.
 */
function examplesOfFatalErrors() {
    // Database connection pool exhausted and cannot recover
    logger.fatal(new Error('Connection pool exhausted after max retries'), {
        operation: { name: 'PoolInit', type: 'internal' }
    });
 
    // Configuration is invalid and service cannot start
    logger.fatal(new Error('Invalid configuration: required key "database_url" missing'), {
        operation: { name: 'ConfigLoad', type: 'internal' }
    });
 
    // Critical security component failed
    logger.fatal(new Error('Encryption key rotation failed - service must halt'), {
        operation: { name: 'KeyRotation', type: 'internal' }
    });
}
 
/**
 * ERROR: Request-level failures that prevented completing user's action.
 * The service is still operational, but this specific request failed.
 */
function examplesOfErrors() {
    // User's request failed due to external service
    logger.error(new ExternalServiceException('PaymentGateway', 503), {
        operation: { name: 'ProcessPayment', type: 'http', httpPath: '/api/payments' }
    });
 
    // Unexpected exception caught at boundary
    logger.error(new Error('Unhandled exception in order processing'), {
        operation: { name: 'CreateOrder', type: 'http', httpPath: '/api/orders' }
    });
 
    // Business rule violation with impact
    logger.error(new InsufficientFundsException('acc-123', 100, 50), {
        operation: { name: 'Transfer', type: 'http', httpPath: '/api/transfers' }
    });
}
 
/**
 * WARN: Issues that were handled but indicate problems worth knowing about.
 * The request succeeded (possibly with degradation), but something was off.
 */
function examplesOfWarnings() {
    // Transient failure recovered
    logger.warn(new Error('Redis connection failed, succeeded on retry 2'), {
        operation: { name: 'CacheRead', type: 'internal' },
        retryAttempts: 2
    });
 
    // Fallback activated
    logger.warn(new Error('Primary config service unavailable, using cached config'), {
        operation: { name: 'ConfigRefresh', type: 'internal' },
        fallback: 'cachedConfig'
    });
 
    // Deprecated usage detected
    logger.warn(new Error('Deprecated API endpoint called: /v1/users'), {
        operation: { name: 'ListUsers', type: 'http', httpPath: '/v1/users' },
        recommendation: 'Migrate to /v2/users'
    });
 
    // Resource pressure detected
    logger.warn(new Error('Connection pool at 90% capacity'), {
        operation: { name: 'PoolMonitor', type: 'internal' },
        currentConnections: 45,
        maxConnections: 50
    });
}
 
/**
 * Decision tree for log level selection.
 */
function determineLogLevel(error: Error, context: {
    wasRequestCompleted: boolean;
    wasRecoverySuccessful: boolean;
    isSystemStillOperational: boolean;
    affectsMultipleUsers: boolean;
}): 'fatal' | 'error' | 'warn' {
    // FATAL: System cannot operate
    if (!context.isSystemStillOperational) {
        return 'fatal';
    }
 
    // ERROR: Request failed, user not served
    if (!context.wasRequestCompleted) {
        return 'error';
    }
 
    // WARN: Issue occurred but was handled
    if (!context.wasRecoverySuccessful || context.affectsMultipleUsers) {
        // If affecting many users, might upgrade to error
        return 'warn';
    }
 
    return 'warn';
}

The Error Inflation Problem

When in doubt, many developers default to ERROR level. This causes 'alert fatigue'—on-call engineers become desensitized to ERROR alerts because most don't require action. Reserve ERROR for genuinely failed operations, and use WARN for handled anomalies. An unnaturally high ERROR rate indicates either real problems or incorrect level assignment—both need investigation.

Handling Sensitive Data in Logs

Logs are often stored for extended periods, replicated to multiple systems, and accessed by various teams. Sensitive data in logs creates security and compliance risks. You must balance debugging needs against privacy and security requirements.

Categories of Sensitive Data

Data That Must NEVER Appear in Logs

•Passwords and secrets — User passwords, API keys, encryption keys, OAuth tokens
•Financial data — Full credit card numbers, bank account numbers, transaction PINs
•PII under regulation — SSN, passport numbers, driver's license numbers, biometric data
•Health information — Medical records, diagnoses, prescriptions (HIPAA)
•Authentication tokens — JWTs, session tokens, refresh tokens, MFA codes
•Cryptographic material — Private keys, key derivation inputs, salts

Data to Handle with Care (Mask or Hash)

•Email addresses — Log hashed version or masked (j***@example.com)
•Phone numbers — Mask middle digits (--1234)
•User IDs — Hash if used for cross-referencing
•IP addresses — May be PII in some jurisdictions; consider truncation
•Names — Depending on context, may need masking
•Request bodies — Sanitize before logging; whitelist safe fields

sensitive-data-handling.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
/**
 * Comprehensive sensitive data sanitization for logging.
 * Applies multiple strategies to ensure sensitive data never reaches logs.
 */
 
/**
 * Fields that should be completely excluded from logs.
 */
const FORBIDDEN_FIELDS = new Set([
    'password', 'newPassword', 'confirmPassword', 'oldPassword',
    'creditCard', 'creditCardNumber', 'cardNumber', 'cvv', 'cvc',
    'ssn', 'socialSecurityNumber', 'taxId',
    'apiKey', 'apiSecret', 'secretKey', 'privateKey',
    'accessToken', 'refreshToken', 'bearerToken', 'authToken',
    'pin', 'mfaCode', 'otpCode', 'verificationCode',
    'encryptionKey', 'decryptionKey', 'signingKey'
]);
 
/**
 * Fields that should be masked (partial visibility for debugging).
 */
const MASKED_FIELDS = new Set([
    'email', 'emailAddress',
    'phone', 'phoneNumber', 'mobile',
    'accountNumber', 'iban'
]);
 
/**
 * Fields that should be hashed (pseudonymized but correlatable).
 */
const HASHED_FIELDS = new Set([
    'userId', 'customerId', 'sessionId', 'deviceId'
]);
 
class LogSanitizer {
    private hashKey: string;
 
    constructor(hashKey: string) {
        this.hashKey = hashKey;
    }
 
    /**
     * Sanitize an object for safe logging.
     * Recursively processes nested objects and arrays.
     */
    sanitize(data: unknown, depth = 0): unknown {
        // Prevent infinite recursion
        if (depth > 10) return '[MAX_DEPTH]';
 
        if (data === null || data === undefined) {
            return data;
        }
 
        if (typeof data === 'string') {
            return this.sanitizeString(data);
        }
 
        if (typeof data !== 'object') {
            return data;
        }
 
        if (Array.isArray(data)) {
            return data.map(item => this.sanitize(item, depth + 1));
        }
 
        // Object: process each field
        const sanitized: Record<string, unknown> = {};
        for (const [key, value] of Object.entries(data)) {
            const lowerKey = key.toLowerCase();
 
            // Completely remove forbidden fields
            if (this.isForbidden(lowerKey)) {
                sanitized[key] = '[REDACTED]';
                continue;
            }
 
            // Mask sensitive fields
            if (this.shouldMask(lowerKey)) {
                sanitized[key] = this.mask(value);
                continue;
            }
 
            // Hash pseudonymous identifiers
            if (this.shouldHash(lowerKey)) {
                sanitized[key] = this.hash(value);
                continue;
            }
 
            // Recursively sanitize nested objects
            sanitized[key] = this.sanitize(value, depth + 1);
        }
 
        return sanitized;
    }
 
    private isForbidden(key: string): boolean {
        return FORBIDDEN_FIELDS.has(key) ||
            key.includes('password') ||
            key.includes('secret') ||
            key.includes('token') ||
            key.includes('key') && key.includes('api');
    }
 
    private shouldMask(key: string): boolean {
        return MASKED_FIELDS.has(key) ||
            key.includes('email') ||
            key.includes('phone');
    }
 
    private shouldHash(key: string): boolean {
        return HASHED_FIELDS.has(key);
    }
 
    private mask(value: unknown): string {
        if (typeof value !== 'string') return '[MASKED]';
        if (value.length <= 4) return '****';
 
        // Show first and last characters for identification
        const visibleChars = Math.min(3, Math.floor(value.length / 4));
        return value.slice(0, visibleChars) + 
               '***' + 
               value.slice(-visibleChars);
    }
 
    private hash(value: unknown): string {
        if (value === null || value === undefined) return '[NULL]';
        const stringValue = String(value);
        // Create consistent hash for correlation
        const hash = createHmac('sha256', this.hashKey)
            .update(stringValue)
            .digest('hex')
            .slice(0, 16);
        return `hash:${hash}`;
    }
 
    private sanitizeString(value: string): string {
        // Detect and mask embedded sensitive patterns
        return value
            // Credit card patterns
            .replace(/\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b/g, '****-****-****-****')
            // SSN patterns
            .replace(/\b\d{3}-\d{2}-\d{4}\b/g, '***-**-****')
            // JWT tokens
            .replace(/eyJ[a-zA-Z0-9_-]*\.[a-zA-Z0-9_-]*\.[a-zA-Z0-9_-]*/g, '[JWT_TOKEN]')
            // Bearer tokens in headers
            .replace(/Bearer [a-zA-Z0-9_-]+/gi, 'Bearer [REDACTED]');
    }
}
 
// Usage in error logging
const sanitizer = new LogSanitizer(process.env.LOG_HASH_KEY!);
 
function logErrorWithContext(error: Error, requestData: any) {
    logger.error({
        error: {
            type: error.name,
            message: error.message,
            stack: error.stack
        },
        // Sanitize request data before logging
        request: sanitizer.sanitize(requestData),
        timestamp: new Date().toISOString()
    });
}
 
// Example input with sensitive data
const requestData = {
    userId: 'usr_123456789',
    email: 'john.doe@example.com',
    order: {
        items: ['item1', 'item2'],
        payment: {
            cardNumber: '4111111111111111',
            cvv: '123',
            expiryMonth: '12',
            expiryYear: '2025'
        }
    },
    meta: {
        apiKey: 'sk_live_abc123',
        sessionToken: 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...'
    }
};
 
// Output after sanitization:
// {
//   userId: 'hash:a1b2c3d4e5f6g7h8',
//   email: 'joh***com',
//   order: {
//     items: ['item1', 'item2'],
//     payment: {
//       cardNumber: '[REDACTED]',
//       cvv: '[REDACTED]',
//       expiryMonth: '12',
//       expiryYear: '2025'
//     }
//   },
//   meta: {
//     apiKey: '[REDACTED]',
//     sessionToken: '[REDACTED]'
//   }
// }

Defense in Depth

Don't rely solely on field-name detection. Also scan string values for patterns (credit cards, JWTs, etc.). And use log aggregation tools' built-in redaction features as an additional layer. Sensitive data that slips past application-level sanitization may still be caught by infrastructure-level rules.

Correlation and Distributed Tracing

In distributed systems, a single user request may traverse multiple services, each logging independently. Without correlation, connecting these logs to understand a failure's full context becomes nearly impossible.

The Correlation ID Pattern

A correlation ID (or request ID) is a unique identifier that flows through every service handling a request. When any service logs an error, it includes this ID, allowing all related log entries to be retrieved together.

correlation-and-tracing.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
/**
 * Correlation ID management for distributed error logging.
 * Integrates with OpenTelemetry for distributed tracing.
 */
import { context, trace, SpanContext } from '@opentelemetry/api';
import { AsyncLocalStorage } from 'async_hooks';
 
/**
 * Request context that flows through the entire request lifecycle.
 */
interface RequestContext {
    correlationId: string;     // High-level request ID
    traceId: string;           // OpenTelemetry trace ID
    spanId: string;            // Current span ID
    parentSpanId?: string;     // Parent span for nested ops
    originService?: string;    // Which service initiated
    userId?: string;           // For user-specific debugging (hashed)
}
 
/**
 * AsyncLocalStorage provides context that follows async execution.
 */
const requestContextStorage = new AsyncLocalStorage<RequestContext>();
 
/**
 * Middleware that establishes request context from incoming headers.
 */
function correlationMiddleware(req: Request, res: Response, next: NextFunction) {
    // Extract or generate correlation ID
    const correlationId = 
        req.headers['x-correlation-id'] as string ||
        req.headers['x-request-id'] as string ||
        generateCorrelationId();
 
    // Get OpenTelemetry trace context
    const span = trace.getActiveSpan();
    const spanContext = span?.spanContext();
 
    const requestContext: RequestContext = {
        correlationId,
        traceId: spanContext?.traceId || correlationId,
        spanId: spanContext?.spanId || generateSpanId(),
        parentSpanId: req.headers['x-parent-span-id'] as string,
        originService: req.headers['x-origin-service'] as string
    };
 
    // Set response header so clients can reference for support
    res.setHeader('X-Correlation-Id', correlationId);
 
    // Run entire request within this context
    requestContextStorage.run(requestContext, () => {
        next();
    });
}
 
/**
 * Get current request context from anywhere in the call stack.
 */
function getCurrentContext(): RequestContext | undefined {
    return requestContextStorage.getStore();
}
 
/**
 * HTTP client that propagates context to downstream services.
 */
class CorrelatedHttpClient {
    async request(url: string, options: RequestInit = {}): Promise<Response> {
        const ctx = getCurrentContext();
        
        const headers = new Headers(options.headers);
        
        if (ctx) {
            // Propagate correlation context
            headers.set('X-Correlation-Id', ctx.correlationId);
            headers.set('X-Parent-Span-Id', ctx.spanId);
            headers.set('X-Origin-Service', process.env.SERVICE_NAME || 'unknown');
            
            // OpenTelemetry trace context (W3C Trace Context format)
            headers.set('traceparent', 
                `00-${ctx.traceId}-${ctx.spanId}-01`);
        }
 
        return fetch(url, { ...options, headers });
    }
}
 
/**
 * Error logger that automatically includes correlation context.
 */
class CorrelatedErrorLogger {
    error(error: Error, operation: string, additionalData?: Record<string, unknown>) {
        const ctx = getCurrentContext();
        
        const logEntry = {
            level: 'error',
            timestamp: new Date().toISOString(),
            
            // Correlation fields for tracing
            correlationId: ctx?.correlationId || 'no-context',
            traceId: ctx?.traceId,
            spanId: ctx?.spanId,
            parentSpanId: ctx?.parentSpanId,
            
            // Error details
            error: {
                type: error.name,
                message: error.message,
                stack: error.stack
            },
            
            // Operation context
            operation,
            service: process.env.SERVICE_NAME,
            
            // Additional data
            ...additionalData
        };
 
        console.log(JSON.stringify(logEntry));
    }
}
 
// Example: Tracing an error across services
 
// Service A (API Gateway) receives request, logs error with correlation
// Log entry includes: correlationId: "req-abc123", spanId: "span-1"
 
// Service A calls Service B, which fails
// Service B logs error with same correlationId, parentSpanId: "span-1"
 
// Query in log aggregator:
// correlationId:"req-abc123"
// Returns all logs from both services, showing full failure context

OpenTelemetry Integration

Modern distributed systems use OpenTelemetry for standardized tracing. Error logs should include the traceId and spanId from OpenTelemetry, enabling correlation between logs, traces, and metrics. Tools like Jaeger, Zipkin, or cloud-native tracing services can then visualize the full request flow and pinpoint where failures occurred.

Structured Logging Best Practices

Structured logging means outputting logs as machine-parseable data (typically JSON) rather than human-readable text strings. This enables powerful querying, aggregation, and alerting—essential for operating systems at scale.

Why Structure Matters

Compare these two log entries for the same error:

Unstructured (Plain Text)

•2024-01-15 10:23:45 ERROR OrderService - Failed to process order ord-123 for user usr-456: PaymentDeclined - Card declined by issuer
•Hard to parse programmatically
•Inconsistent formats across services
•Difficult to aggregate or alert on specific fields
•Regex-based queries are fragile

Structured (JSON)

•{"timestamp":"2024-01-15T10:23:45Z", "level":"error", "service":"OrderService", "orderId":"ord-123", "userId":"usr-456", "errorType":"PaymentDeclined", "message":"Card declined by issuer"}
•Easily parsed by log aggregators
•Consistent schema enables reliable queries
•Field-level alerting (e.g., errorType=PaymentDeclined)
•Aggregation by any field (errors per userId)

structured-logging-patterns.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
/**
 * Structured logging implementation with consistent schema.
 * Uses structured JSON for all log output.
 */
interface LogSchema {
    // Required fields
    timestamp: string;
    level: 'debug' | 'info' | 'warn' | 'error' | 'fatal';
    message: string;
    service: string;
 
    // Correlation
    correlationId?: string;
    traceId?: string;
    spanId?: string;
 
    // Context
    operation?: string;
    component?: string;
    
    // Error-specific (when level is error/fatal)
    error?: {
        type: string;
        code: string;
        message: string;
        stack?: string;
        cause?: object;
    };
 
    // Flexible additional fields
    [key: string]: unknown;
}
 
/**
 * Logger implementation enforcing schema compliance.
 */
class StructuredLogger {
    private serviceName: string;
    private sanitizer: LogSanitizer;
 
    constructor(serviceName: string, sanitizer: LogSanitizer) {
        this.serviceName = serviceName;
        this.sanitizer = sanitizer;
    }
 
    private log(level: LogSchema['level'], message: string, data?: Record<string, unknown>) {
        const ctx = getCurrentContext();
        
        const entry: LogSchema = {
            timestamp: new Date().toISOString(),
            level,
            message,
            service: this.serviceName,
            correlationId: ctx?.correlationId,
            traceId: ctx?.traceId,
            spanId: ctx?.spanId
        };
 
        // Merge additional data, sanitized
        if (data) {
            const sanitized = this.sanitizer.sanitize(data) as Record<string, unknown>;
            Object.assign(entry, sanitized);
        }
 
        // Output as single-line JSON
        this.output(entry);
    }
 
    error(message: string, error: Error, data?: Record<string, unknown>) {
        this.log('error', message, {
            error: {
                type: error.name,
                code: (error as any).errorCode || 'UNKNOWN',
                message: error.message,
                stack: error.stack,
                cause: (error as any).cause ? {
                    type: ((error as any).cause as Error).name,
                    message: ((error as any).cause as Error).message
                } : undefined
            },
            ...data
        });
    }
 
    warn(message: string, data?: Record<string, unknown>) {
        this.log('warn', message, data);
    }
 
    info(message: string, data?: Record<string, unknown>) {
        this.log('info', message, data);
    }
 
    private output(entry: LogSchema) {
        // Single-line JSON for log aggregators
        console.log(JSON.stringify(entry));
    }
}
 
/**
 * Best practices for structured logging fields.
 */
const loggingGuidelines = {
    // DO: Use consistent field names across all services
    goodFieldNames: [
        'userId',      // Not 'user_id' or 'userID' or 'uid'
        'orderId',     // Not 'order_id' or 'orderID'
        'errorType',   // Not 'error_type' or 'exception_type'
        'durationMs',  // Not 'duration' or 'time_taken'
    ],
 
    // DO: Use standardized value formats
    goodValueFormats: {
        timestamps: 'ISO 8601: 2024-01-15T10:23:45.123Z',
        durations: 'Milliseconds as integers: 234',
        booleans: 'Actual booleans, not strings: true',
        nulls: 'Explicit null, not empty strings'
    },
 
    // DON'T: Mix structured and unstructured
    badPractices: [
        'Including formatted strings: "Error: XYZ at module ABC"',
        'Nested message strings that duplicate info',
        'Inconsistent field presence across log entries'
    ]
};
 
// Example: Query capabilities enabled by structured logs
 
// Find all payment errors in the last hour:
// level:error AND error.type:PaymentException AND timestamp:>now-1h
 
// Count errors by type:
// level:error | stats count() by error.type
 
// Find errors for specific user:
// level:error AND userId:hash:a1b2c3d4
 
// Trace request flow:
// correlationId:req-abc123 | sort timestamp asc

Logging Infrastructure Considerations

Beyond application-level logging, you must consider the infrastructure that collects, stores, and queries logs. The best log entries are useless if the infrastructure can't handle them properly.

Key Infrastructure Considerations

Log Infrastructure Requirements

•High Availability — Log collection must not fail when you need it most (during incidents). Use buffering and retry mechanisms.
•Latency — Logs should be queryable within seconds of emission for real-time debugging. Batch processing delays hurt incident response.
•Retention — Define retention policies balancing cost against debugging needs. Errors may need longer retention than info logs.
•Query Performance — Error logs must be quickly searchable across time ranges and fields. Test query performance with production volumes.
•Alerting Integration — Log aggregators should integrate with alerting systems (PagerDuty, Slack) for error-threshold-based alerts.
•Access Control — Logs may contain sensitive data even after sanitization. Implement role-based access to log data.
•Cost Management — Log volume scales with traffic. Implement sampling for verbose logs and ensure error logs always reach storage.

logging-resilience.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
/**
 * Resilient logging configuration that ensures error logs
 * reach their destination even under adverse conditions.
 */
 
/**
 * Log buffering to handle temporary aggregator unavailability.
 */
class ResilientLogger {
    private buffer: LogEntry[] = [];
    private readonly maxBufferSize = 10000;
    private readonly flushIntervalMs = 1000;
    private isAggregatorHealthy = true;
 
    constructor(private aggregator: LogAggregator) {
        setInterval(() => this.flushBuffer(), this.flushIntervalMs);
    }
 
    log(entry: LogEntry) {
        if (this.isAggregatorHealthy) {
            // Try direct send
            this.aggregator.send(entry).catch(() => {
                this.isAggregatorHealthy = false;
                this.buffer.push(entry);
                this.scheduleHealthCheck();
            });
        } else {
            // Buffer during outage
            if (this.buffer.length < this.maxBufferSize) {
                this.buffer.push(entry);
            } else {
                // CRITICAL: Never drop error logs
                // Write to local stderr as fallback
                if (entry.level === 'error' || entry.level === 'fatal') {
                    console.error(JSON.stringify(entry));
                }
                // Increment dropped log counter for monitoring
                this.droppedLogCounter.inc({ level: entry.level });
            }
        }
    }
 
    private async flushBuffer() {
        if (this.buffer.length === 0 || !this.isAggregatorHealthy) return;
 
        const batch = this.buffer.splice(0, 1000);
        try {
            await this.aggregator.sendBatch(batch);
        } catch (error) {
            // Put back in buffer
            this.buffer.unshift(...batch);
            this.isAggregatorHealthy = false;
        }
    }
 
    private scheduleHealthCheck() {
        setTimeout(async () => {
            try {
                await this.aggregator.healthCheck();
                this.isAggregatorHealthy = true;
                await this.flushBuffer();
            } catch {
                this.scheduleHealthCheck();  // Retry
            }
        }, 5000);
    }
}
 
/**
 * Sampling configuration for volume control.
 * Errors are never sampled; only verbose logs.
 */
interface SamplingConfig {
    // Always log these levels (no sampling)
    alwaysLog: Array<'error' | 'warn' | 'fatal'>;
    
    // Sample rate for other levels (0-1)
    sampleRate: Record<string, number>;
}
 
const productionSamplingConfig: SamplingConfig = {
    alwaysLog: ['error', 'warn', 'fatal'],
    sampleRate: {
        'info': 1.0,    // Log all info in production (usually low volume)
        'debug': 0.01   // Only 1% of debug logs
    }
};
 
function shouldLog(level: string, config: SamplingConfig): boolean {
    if (config.alwaysLog.includes(level as any)) return true;
    const rate = config.sampleRate[level] ?? 1.0;
    return Math.random() < rate;
}
 
/**
 * Log enrichment at infrastructure level.
 * Add context that application code shouldn't need to know.
 */
function enrichLogEntry(entry: LogEntry): LogEntry {
    return {
        ...entry,
        // Infrastructure-level enrichment
        _meta: {
            hostname: os.hostname(),
            pid: process.pid,
            kubernetesNamespace: process.env.K8S_NAMESPACE,
            kubernetesPod: process.env.K8S_POD_NAME,
            deploymentVersion: process.env.DEPLOYMENT_VERSION,
            cloudRegion: process.env.CLOUD_REGION,
            instanceId: process.env.INSTANCE_ID
        }
    };
}

Summary: Mastering Error Logging

Error logging is the bridge between a failure occurring and an engineer understanding and resolving it. Quality logs dramatically reduce time-to-resolution and make incidents manageable rather than chaotic.

Key Takeaways

•Log for multiple audiences — On-call engineers, developers, security teams, and automation all have different needs from the same logs.
•Capture complete context — Include timing, location, operation, user context (safely), error details, and system state.
•Classify severity accurately — Reserve ERROR for failed requests, WARN for handled anomalies, FATAL for system-level failures.
•Protect sensitive data — Implement defense-in-depth sanitization; never log passwords, tokens, or PII.
•Enable correlation — Propagate correlation IDs across services; integrate with distributed tracing.
•Use structured logging — JSON format enables powerful queries, aggregation, and alerting.
•Build resilient infrastructure — Buffer logs, handle aggregator failures gracefully, never drop error logs.

What's Next:

With logging mastered, the final page explores error recovery strategies—how to design systems that not only log failures but actively recover from them, maintaining availability and data integrity even when things go wrong.

Page Complete

You now understand how to design error logging that serves operational needs. You can structure logs for machine parsing, handle sensitive data safely, correlate across distributed systems, and build infrastructure that ensures error logs always reach their destination. Apply these patterns to transform incident response from guesswork to methodical diagnosis.

3 / 4

Loading learning content...

System Design (LLD)Exception & Error Handling Design

Error Handling at Boundaries

LevelAdvanced

Duration90 mins

TopicException & Error Handling Design

3 / 4

Logging Errors Appropriately

The Bridge Between Failure and Resolution

What You Will Master

The Purpose of Error Logging

Before discussing techniques, we must understand who reads error logs and what they need. Error logs serve multiple audiences with different requirements:

1. On-Call Engineers During Incidents

When production breaks, on-call engineers are your primary log consumers. They need to quickly understand:

What is failing?
Since when?
How many users/requests are affected?
What can be done immediately?

2. Developers Debugging Issues

After the immediate incident, developers investigate root causes. They need:

Complete exception chains and stack traces
Request context and parameters
State of relevant entities at the time of failure
Correlation across multiple service calls

3. Security and Compliance Teams

Security personnel review logs for:

Unusual failure patterns that might indicate attacks
Authentication and authorization failures
Data access anomalies
Audit trail completeness

4. Automated Monitoring Systems

Machine consumers of logs need:

Consistent, structured formatting
Reliable severity classification
Parseable error codes and categories
Correlation identifiers

Goals of Effective Error Logging

•Complete Context — Every error log should contain enough information to understand and reproduce the failure without requiring additional log correlation in most cases.
•Accurate Severity — Log levels must correctly reflect business impact so alerting systems can prioritize appropriately.
•Efficient Parsing — Logs should be structured (JSON/key-value) for automated analysis and searching.
•Safe for Storage — Sensitive data must be redacted or excluded to meet security and compliance requirements.
•Actionable — Logs should suggest the category of problem and point toward resolution paths.
•Correlated — In distributed systems, logs must be traceable across service boundaries.

The Cost of Poor Logging

What to Log: The Complete Error Context

A complete error log entry captures multiple dimensions of context. Think of it as answering the journalist's questions: Who, What, When, Where, Why, and How.

Essential Fields for Every Error Log

Error Log Context Categories
Category	Fields	Purpose
Identity	correlationId, requestId, traceId, spanId	Link related log entries across time and services
Timing	timestamp, duration, operationStartTime	Establish sequence and identify timeouts
Location	serviceName, serviceVersion, hostName, environment	Identify which deployment experienced the failure
Operation	operationName, httpMethod, httpPath, functionName	What action was being performed
User Context	userId (hashed), sessionId (hashed), userRole, tenantId	Who was affected (safely anonymized)
Error Details	errorType, errorCode, errorMessage, stackTrace	The actual failure information
Input Context	inputParameters (sanitized), requestBody (partial)	What triggered the operation
System State	memoryUsage, cpuLoad, connectionPoolStatus	Resource conditions at failure time

structured-error-logging.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
/**
 * Comprehensive error logging with complete context.
 * Demonstrates structured logging that provides all information
 * needed for debugging without additional log correlation.
 */
interface ErrorLogEntry {
    // ============================================
    // IDENTITY: Correlation across time and services
    // ============================================
    correlationId: string;      // Unique ID for entire request flow
    traceId: string;            // Distributed tracing ID
    spanId: string;             // Current operation span
    parentSpanId?: string;      // Parent span for nested operations
 
    // ============================================
    // TIMING: When and how long
    // ============================================
    timestamp: string;          // ISO 8601 format
    operationDurationMs: number;
    operationStartTime: string;
 
    // ============================================
    // LOCATION: Where in the system
    // ============================================
    service: {
        name: string;
        version: string;
        environment: 'production' | 'staging' | 'development';
        instance: string;       // Pod/container/machine ID
        region?: string;
    };
 
    // ============================================
    // OPERATION: What was being done
    // ============================================
    operation: {
        name: string;           // 'CreateOrder', 'ProcessPayment'
        type: 'http' | 'grpc' | 'async' | 'scheduled' | 'internal';
        httpMethod?: string;
        httpPath?: string;
        httpStatusCode?: number;
        queueName?: string;
    };
 
    // ============================================
    // USER CONTEXT: Who was affected (safely)
    // ============================================
    user?: {
        idHash: string;         // Hashed user ID for privacy
        role: string;
        tenantId?: string;
        sessionIdHash?: string;
    };
 
    // ============================================
    // ERROR DETAILS: The actual failure
    // ============================================
    error: {
        type: string;           // Exception class name
        code: string;           // Application error code
        message: string;        // Error message
        category: 'validation' | 'authentication' | 'authorization' | 
                  'business_rule' | 'external_service' | 'database' | 
                  'infrastructure' | 'unknown';
        isRetryable: boolean;
        stackTrace?: string;    // Full stack trace
        causedBy?: {            // Nested cause chain
            type: string;
            message: string;
            stackTrace?: string;
        };
    };
 
    // ============================================
    // INPUT CONTEXT: What triggered this
    // ============================================
    input?: {
        // Sanitized/partial input for debugging
        // NEVER include passwords, tokens, or PII
        sanitizedPayload?: Record<string, unknown>;
        queryParameters?: Record<string, string>;
        relevantHeaders?: Record<string, string>;
    };
 
    // ============================================
    // SYSTEM STATE: Resource conditions
    // ============================================
    systemState?: {
        memoryUsageMb: number;
        memoryLimitMb: number;
        cpuPercent: number;
        activeConnections: number;
        pendingRequests: number;
    };
 
    // ============================================
    // METADATA: Classification and routing
    // ============================================
    level: 'error' | 'warn' | 'fatal';
    tags: string[];             // For filtering: ['payments', 'critical-path']
    alertTier?: 'p1' | 'p2' | 'p3' | 'p4';  // Alert priority
}
 
/**
 * Error logger that constructs complete, structured log entries.
 */
class StructuredErrorLogger {
    constructor(
        private readonly serviceName: string,
        private readonly serviceVersion: string,
        private readonly environment: string,
        private readonly logTarget: LogTarget
    ) {}
 
    /**
     * Log an error with complete context.
     */
    error(
        error: Error,
        operation: OperationContext,
        additionalContext?: Partial<ErrorLogEntry>
    ): void {
        const entry = this.buildLogEntry('error', error, operation, additionalContext);
        this.logTarget.write(entry);
    }
 
    /**
     * Log a warning for recoverable issues.
     */
    warn(
        error: Error,
        operation: OperationContext,
        additionalContext?: Partial<ErrorLogEntry>
    ): void {
        const entry = this.buildLogEntry('warn', error, operation, additionalContext);
        this.logTarget.write(entry);
    }
 
    /**
     * Log a fatal error requiring immediate attention.
     */
    fatal(
        error: Error,
        operation: OperationContext,
        additionalContext?: Partial<ErrorLogEntry>
    ): void {
        const entry = this.buildLogEntry('fatal', error, operation, additionalContext);
        entry.alertTier = 'p1';  // Fatal errors always page on-call
        this.logTarget.write(entry);
    }
 
    private buildLogEntry(
        level: 'error' | 'warn' | 'fatal',
        error: Error,
        operation: OperationContext,
        additionalContext?: Partial<ErrorLogEntry>
    ): ErrorLogEntry {
        const now = new Date();
 
        return {
            // Identity
            correlationId: operation.correlationId,
            traceId: operation.traceId,
            spanId: operation.spanId,
            parentSpanId: operation.parentSpanId,
 
            // Timing
            timestamp: now.toISOString(),
            operationDurationMs: now.getTime() - operation.startTime.getTime(),
            operationStartTime: operation.startTime.toISOString(),
 
            // Location
            service: {
                name: this.serviceName,
                version: this.serviceVersion,
                environment: this.environment as any,
                instance: process.env.HOSTNAME || 'unknown',
                region: process.env.AWS_REGION
            },
 
            // Operation
            operation: {
                name: operation.name,
                type: operation.type,
                httpMethod: operation.httpMethod,
                httpPath: operation.httpPath,
                httpStatusCode: operation.httpStatusCode
            },
 
            // Error
            error: {
                type: error.constructor.name,
                code: this.extractErrorCode(error),
                message: error.message,
                category: this.categorizeError(error),
                isRetryable: this.isRetryable(error),
                stackTrace: error.stack,
                causedBy: this.extractCause(error)
            },
 
            // Metadata
            level,
            tags: this.generateTags(operation, error),
            alertTier: this.determineAlertTier(level, error),
 
            // Merge additional context
            ...additionalContext
        };
    }
 
    private extractErrorCode(error: Error): string {
        if ('errorCode' in error) return (error as any).errorCode;
        if ('code' in error) return String((error as any).code);
        return 'UNKNOWN';
    }
 
    private categorizeError(error: Error): ErrorLogEntry['error']['category'] {
        // Categorize based on error type hierarchy
        if (error instanceof ValidationException) return 'validation';
        if (error instanceof AuthenticationException) return 'authentication';
        if (error instanceof AuthorizationException) return 'authorization';
        if (error instanceof DomainException) return 'business_rule';
        if (error instanceof ExternalServiceException) return 'external_service';
        if (error instanceof DatabaseException) return 'database';
        if (error instanceof InfrastructureException) return 'infrastructure';
        return 'unknown';
    }
 
    private isRetryable(error: Error): boolean {
        if ('isRetryable' in error) return Boolean((error as any).isRetryable);
        // Default heuristics
        if (error instanceof DatabaseConnectionException) return true;
        if (error instanceof TimeoutException) return true;
        if (error instanceof ValidationException) return false;
        return false;
    }
 
    private extractCause(error: Error): { type: string; message: string; stackTrace?: string } | undefined {
        if ('cause' in error && error.cause instanceof Error) {
            return {
                type: error.cause.constructor.name,
                message: error.cause.message,
                stackTrace: error.cause.stack
            };
        }
        return undefined;
    }
 
    private generateTags(operation: OperationContext, error: Error): string[] {
        const tags: string[] = [];
        if (operation.isCriticalPath) tags.push('critical-path');
        if (operation.businessDomain) tags.push(operation.businessDomain);
        if (error instanceof PaymentException) tags.push('payments');
        return tags;
    }
 
    private determineAlertTier(level: string, error: Error): 'p1' | 'p2' | 'p3' | 'p4' | undefined {
        if (level === 'fatal') return 'p1';
        if (error instanceof DatabaseConnectionException) return 'p2';
        if (error instanceof ExternalServiceException) return 'p2';
        if (error instanceof BusinessCriticalException) return 'p2';
        return undefined;  // Let alerting rules decide
    }
}

Log Levels and Severity Classification

The Standard Log Levels

While terminology varies between frameworks, the concepts are universal:

Log Levels for Error Conditions
Level	When to Use	Alerting Implication	Examples
FATAL / CRITICAL	System cannot continue operating; requires immediate human intervention	Page on-call immediately (P1)	Database cluster unreachable, certificate expired, data corruption detected
ERROR	Operation failed; user request could not be completed; unexpected condition	Alert within minutes (P2)	Payment processing failed, required external service unavailable, unhandled exception
WARN	Something unexpected but handled; degraded operation; potential future problem	Aggregate and alert if threshold exceeded	Retry succeeded after transient failure, deprecated API used, cache miss fallback to database
INFO	Not an error—significant business events for audit	No alert; for dashboards and audit	User logged in, order placed, configuration reloaded
DEBUG	Not an error—detailed information for development	Never in production; or sampled only	Function entry/exit, variable values, detailed flow

log-level-decision.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
/**
 * Guidelines for choosing the correct log level for errors.
 * Apply these rules consistently across your codebase.
 */
 
/**
 * FATAL: System-level failures that prevent operation.
 * The process or service cannot function and will likely crash or restart.
 */
function examplesOfFatalErrors() {
    // Database connection pool exhausted and cannot recover
    logger.fatal(new Error('Connection pool exhausted after max retries'), {
        operation: { name: 'PoolInit', type: 'internal' }
    });
 
    // Configuration is invalid and service cannot start
    logger.fatal(new Error('Invalid configuration: required key "database_url" missing'), {
        operation: { name: 'ConfigLoad', type: 'internal' }
    });
 
    // Critical security component failed
    logger.fatal(new Error('Encryption key rotation failed - service must halt'), {
        operation: { name: 'KeyRotation', type: 'internal' }
    });
}
 
/**
 * ERROR: Request-level failures that prevented completing user's action.
 * The service is still operational, but this specific request failed.
 */
function examplesOfErrors() {
    // User's request failed due to external service
    logger.error(new ExternalServiceException('PaymentGateway', 503), {
        operation: { name: 'ProcessPayment', type: 'http', httpPath: '/api/payments' }
    });
 
    // Unexpected exception caught at boundary
    logger.error(new Error('Unhandled exception in order processing'), {
        operation: { name: 'CreateOrder', type: 'http', httpPath: '/api/orders' }
    });
 
    // Business rule violation with impact
    logger.error(new InsufficientFundsException('acc-123', 100, 50), {
        operation: { name: 'Transfer', type: 'http', httpPath: '/api/transfers' }
    });
}
 
/**
 * WARN: Issues that were handled but indicate problems worth knowing about.
 * The request succeeded (possibly with degradation), but something was off.
 */
function examplesOfWarnings() {
    // Transient failure recovered
    logger.warn(new Error('Redis connection failed, succeeded on retry 2'), {
        operation: { name: 'CacheRead', type: 'internal' },
        retryAttempts: 2
    });
 
    // Fallback activated
    logger.warn(new Error('Primary config service unavailable, using cached config'), {
        operation: { name: 'ConfigRefresh', type: 'internal' },
        fallback: 'cachedConfig'
    });
 
    // Deprecated usage detected
    logger.warn(new Error('Deprecated API endpoint called: /v1/users'), {
        operation: { name: 'ListUsers', type: 'http', httpPath: '/v1/users' },
        recommendation: 'Migrate to /v2/users'
    });
 
    // Resource pressure detected
    logger.warn(new Error('Connection pool at 90% capacity'), {
        operation: { name: 'PoolMonitor', type: 'internal' },
        currentConnections: 45,
        maxConnections: 50
    });
}
 
/**
 * Decision tree for log level selection.
 */
function determineLogLevel(error: Error, context: {
    wasRequestCompleted: boolean;
    wasRecoverySuccessful: boolean;
    isSystemStillOperational: boolean;
    affectsMultipleUsers: boolean;
}): 'fatal' | 'error' | 'warn' {
    // FATAL: System cannot operate
    if (!context.isSystemStillOperational) {
        return 'fatal';
    }
 
    // ERROR: Request failed, user not served
    if (!context.wasRequestCompleted) {
        return 'error';
    }
 
    // WARN: Issue occurred but was handled
    if (!context.wasRecoverySuccessful || context.affectsMultipleUsers) {
        // If affecting many users, might upgrade to error
        return 'warn';
    }
 
    return 'warn';
}

The Error Inflation Problem

Handling Sensitive Data in Logs

Categories of Sensitive Data

Data That Must NEVER Appear in Logs

•Passwords and secrets — User passwords, API keys, encryption keys, OAuth tokens
•Financial data — Full credit card numbers, bank account numbers, transaction PINs
•PII under regulation — SSN, passport numbers, driver's license numbers, biometric data
•Health information — Medical records, diagnoses, prescriptions (HIPAA)
•Authentication tokens — JWTs, session tokens, refresh tokens, MFA codes
•Cryptographic material — Private keys, key derivation inputs, salts

Data to Handle with Care (Mask or Hash)

•Email addresses — Log hashed version or masked (j***@example.com)
•Phone numbers — Mask middle digits (--1234)
•User IDs — Hash if used for cross-referencing
•IP addresses — May be PII in some jurisdictions; consider truncation
•Names — Depending on context, may need masking
•Request bodies — Sanitize before logging; whitelist safe fields

sensitive-data-handling.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
/**
 * Comprehensive sensitive data sanitization for logging.
 * Applies multiple strategies to ensure sensitive data never reaches logs.
 */
 
/**
 * Fields that should be completely excluded from logs.
 */
const FORBIDDEN_FIELDS = new Set([
    'password', 'newPassword', 'confirmPassword', 'oldPassword',
    'creditCard', 'creditCardNumber', 'cardNumber', 'cvv', 'cvc',
    'ssn', 'socialSecurityNumber', 'taxId',
    'apiKey', 'apiSecret', 'secretKey', 'privateKey',
    'accessToken', 'refreshToken', 'bearerToken', 'authToken',
    'pin', 'mfaCode', 'otpCode', 'verificationCode',
    'encryptionKey', 'decryptionKey', 'signingKey'
]);
 
/**
 * Fields that should be masked (partial visibility for debugging).
 */
const MASKED_FIELDS = new Set([
    'email', 'emailAddress',
    'phone', 'phoneNumber', 'mobile',
    'accountNumber', 'iban'
]);
 
/**
 * Fields that should be hashed (pseudonymized but correlatable).
 */
const HASHED_FIELDS = new Set([
    'userId', 'customerId', 'sessionId', 'deviceId'
]);
 
class LogSanitizer {
    private hashKey: string;
 
    constructor(hashKey: string) {
        this.hashKey = hashKey;
    }
 
    /**
     * Sanitize an object for safe logging.
     * Recursively processes nested objects and arrays.
     */
    sanitize(data: unknown, depth = 0): unknown {
        // Prevent infinite recursion
        if (depth > 10) return '[MAX_DEPTH]';
 
        if (data === null || data === undefined) {
            return data;
        }
 
        if (typeof data === 'string') {
            return this.sanitizeString(data);
        }
 
        if (typeof data !== 'object') {
            return data;
        }
 
        if (Array.isArray(data)) {
            return data.map(item => this.sanitize(item, depth + 1));
        }
 
        // Object: process each field
        const sanitized: Record<string, unknown> = {};
        for (const [key, value] of Object.entries(data)) {
            const lowerKey = key.toLowerCase();
 
            // Completely remove forbidden fields
            if (this.isForbidden(lowerKey)) {
                sanitized[key] = '[REDACTED]';
                continue;
            }
 
            // Mask sensitive fields
            if (this.shouldMask(lowerKey)) {
                sanitized[key] = this.mask(value);
                continue;
            }
 
            // Hash pseudonymous identifiers
            if (this.shouldHash(lowerKey)) {
                sanitized[key] = this.hash(value);
                continue;
            }
 
            // Recursively sanitize nested objects
            sanitized[key] = this.sanitize(value, depth + 1);
        }
 
        return sanitized;
    }
 
    private isForbidden(key: string): boolean {
        return FORBIDDEN_FIELDS.has(key) ||
            key.includes('password') ||
            key.includes('secret') ||
            key.includes('token') ||
            key.includes('key') && key.includes('api');
    }
 
    private shouldMask(key: string): boolean {
        return MASKED_FIELDS.has(key) ||
            key.includes('email') ||
            key.includes('phone');
    }
 
    private shouldHash(key: string): boolean {
        return HASHED_FIELDS.has(key);
    }
 
    private mask(value: unknown): string {
        if (typeof value !== 'string') return '[MASKED]';
        if (value.length <= 4) return '****';
 
        // Show first and last characters for identification
        const visibleChars = Math.min(3, Math.floor(value.length / 4));
        return value.slice(0, visibleChars) + 
               '***' + 
               value.slice(-visibleChars);
    }
 
    private hash(value: unknown): string {
        if (value === null || value === undefined) return '[NULL]';
        const stringValue = String(value);
        // Create consistent hash for correlation
        const hash = createHmac('sha256', this.hashKey)
            .update(stringValue)
            .digest('hex')
            .slice(0, 16);
        return `hash:${hash}`;
    }
 
    private sanitizeString(value: string): string {
        // Detect and mask embedded sensitive patterns
        return value
            // Credit card patterns
            .replace(/\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b/g, '****-****-****-****')
            // SSN patterns
            .replace(/\b\d{3}-\d{2}-\d{4}\b/g, '***-**-****')
            // JWT tokens
            .replace(/eyJ[a-zA-Z0-9_-]*\.[a-zA-Z0-9_-]*\.[a-zA-Z0-9_-]*/g, '[JWT_TOKEN]')
            // Bearer tokens in headers
            .replace(/Bearer [a-zA-Z0-9_-]+/gi, 'Bearer [REDACTED]');
    }
}
 
// Usage in error logging
const sanitizer = new LogSanitizer(process.env.LOG_HASH_KEY!);
 
function logErrorWithContext(error: Error, requestData: any) {
    logger.error({
        error: {
            type: error.name,
            message: error.message,
            stack: error.stack
        },
        // Sanitize request data before logging
        request: sanitizer.sanitize(requestData),
        timestamp: new Date().toISOString()
    });
}
 
// Example input with sensitive data
const requestData = {
    userId: 'usr_123456789',
    email: 'john.doe@example.com',
    order: {
        items: ['item1', 'item2'],
        payment: {
            cardNumber: '4111111111111111',
            cvv: '123',
            expiryMonth: '12',
            expiryYear: '2025'
        }
    },
    meta: {
        apiKey: 'sk_live_abc123',
        sessionToken: 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...'
    }
};
 
// Output after sanitization:
// {
//   userId: 'hash:a1b2c3d4e5f6g7h8',
//   email: 'joh***com',
//   order: {
//     items: ['item1', 'item2'],
//     payment: {
//       cardNumber: '[REDACTED]',
//       cvv: '[REDACTED]',
//       expiryMonth: '12',
//       expiryYear: '2025'
//     }
//   },
//   meta: {
//     apiKey: '[REDACTED]',
//     sessionToken: '[REDACTED]'
//   }
// }

Defense in Depth

Correlation and Distributed Tracing

The Correlation ID Pattern

correlation-and-tracing.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
/**
 * Correlation ID management for distributed error logging.
 * Integrates with OpenTelemetry for distributed tracing.
 */
import { context, trace, SpanContext } from '@opentelemetry/api';
import { AsyncLocalStorage } from 'async_hooks';
 
/**
 * Request context that flows through the entire request lifecycle.
 */
interface RequestContext {
    correlationId: string;     // High-level request ID
    traceId: string;           // OpenTelemetry trace ID
    spanId: string;            // Current span ID
    parentSpanId?: string;     // Parent span for nested ops
    originService?: string;    // Which service initiated
    userId?: string;           // For user-specific debugging (hashed)
}
 
/**
 * AsyncLocalStorage provides context that follows async execution.
 */
const requestContextStorage = new AsyncLocalStorage<RequestContext>();
 
/**
 * Middleware that establishes request context from incoming headers.
 */
function correlationMiddleware(req: Request, res: Response, next: NextFunction) {
    // Extract or generate correlation ID
    const correlationId = 
        req.headers['x-correlation-id'] as string ||
        req.headers['x-request-id'] as string ||
        generateCorrelationId();
 
    // Get OpenTelemetry trace context
    const span = trace.getActiveSpan();
    const spanContext = span?.spanContext();
 
    const requestContext: RequestContext = {
        correlationId,
        traceId: spanContext?.traceId || correlationId,
        spanId: spanContext?.spanId || generateSpanId(),
        parentSpanId: req.headers['x-parent-span-id'] as string,
        originService: req.headers['x-origin-service'] as string
    };
 
    // Set response header so clients can reference for support
    res.setHeader('X-Correlation-Id', correlationId);
 
    // Run entire request within this context
    requestContextStorage.run(requestContext, () => {
        next();
    });
}
 
/**
 * Get current request context from anywhere in the call stack.
 */
function getCurrentContext(): RequestContext | undefined {
    return requestContextStorage.getStore();
}
 
/**
 * HTTP client that propagates context to downstream services.
 */
class CorrelatedHttpClient {
    async request(url: string, options: RequestInit = {}): Promise<Response> {
        const ctx = getCurrentContext();
        
        const headers = new Headers(options.headers);
        
        if (ctx) {
            // Propagate correlation context
            headers.set('X-Correlation-Id', ctx.correlationId);
            headers.set('X-Parent-Span-Id', ctx.spanId);
            headers.set('X-Origin-Service', process.env.SERVICE_NAME || 'unknown');
            
            // OpenTelemetry trace context (W3C Trace Context format)
            headers.set('traceparent', 
                `00-${ctx.traceId}-${ctx.spanId}-01`);
        }
 
        return fetch(url, { ...options, headers });
    }
}
 
/**
 * Error logger that automatically includes correlation context.
 */
class CorrelatedErrorLogger {
    error(error: Error, operation: string, additionalData?: Record<string, unknown>) {
        const ctx = getCurrentContext();
        
        const logEntry = {
            level: 'error',
            timestamp: new Date().toISOString(),
            
            // Correlation fields for tracing
            correlationId: ctx?.correlationId || 'no-context',
            traceId: ctx?.traceId,
            spanId: ctx?.spanId,
            parentSpanId: ctx?.parentSpanId,
            
            // Error details
            error: {
                type: error.name,
                message: error.message,
                stack: error.stack
            },
            
            // Operation context
            operation,
            service: process.env.SERVICE_NAME,
            
            // Additional data
            ...additionalData
        };
 
        console.log(JSON.stringify(logEntry));
    }
}
 
// Example: Tracing an error across services
 
// Service A (API Gateway) receives request, logs error with correlation
// Log entry includes: correlationId: "req-abc123", spanId: "span-1"
 
// Service A calls Service B, which fails
// Service B logs error with same correlationId, parentSpanId: "span-1"
 
// Query in log aggregator:
// correlationId:"req-abc123"
// Returns all logs from both services, showing full failure context

OpenTelemetry Integration

Structured Logging Best Practices

Why Structure Matters

Compare these two log entries for the same error:

Unstructured (Plain Text)

•2024-01-15 10:23:45 ERROR OrderService - Failed to process order ord-123 for user usr-456: PaymentDeclined - Card declined by issuer
•Hard to parse programmatically
•Inconsistent formats across services
•Difficult to aggregate or alert on specific fields
•Regex-based queries are fragile

Structured (JSON)

•{"timestamp":"2024-01-15T10:23:45Z", "level":"error", "service":"OrderService", "orderId":"ord-123", "userId":"usr-456", "errorType":"PaymentDeclined", "message":"Card declined by issuer"}
•Easily parsed by log aggregators
•Consistent schema enables reliable queries
•Field-level alerting (e.g., errorType=PaymentDeclined)
•Aggregation by any field (errors per userId)

structured-logging-patterns.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
/**
 * Structured logging implementation with consistent schema.
 * Uses structured JSON for all log output.
 */
interface LogSchema {
    // Required fields
    timestamp: string;
    level: 'debug' | 'info' | 'warn' | 'error' | 'fatal';
    message: string;
    service: string;
 
    // Correlation
    correlationId?: string;
    traceId?: string;
    spanId?: string;
 
    // Context
    operation?: string;
    component?: string;
    
    // Error-specific (when level is error/fatal)
    error?: {
        type: string;
        code: string;
        message: string;
        stack?: string;
        cause?: object;
    };
 
    // Flexible additional fields
    [key: string]: unknown;
}
 
/**
 * Logger implementation enforcing schema compliance.
 */
class StructuredLogger {
    private serviceName: string;
    private sanitizer: LogSanitizer;
 
    constructor(serviceName: string, sanitizer: LogSanitizer) {
        this.serviceName = serviceName;
        this.sanitizer = sanitizer;
    }
 
    private log(level: LogSchema['level'], message: string, data?: Record<string, unknown>) {
        const ctx = getCurrentContext();
        
        const entry: LogSchema = {
            timestamp: new Date().toISOString(),
            level,
            message,
            service: this.serviceName,
            correlationId: ctx?.correlationId,
            traceId: ctx?.traceId,
            spanId: ctx?.spanId
        };
 
        // Merge additional data, sanitized
        if (data) {
            const sanitized = this.sanitizer.sanitize(data) as Record<string, unknown>;
            Object.assign(entry, sanitized);
        }
 
        // Output as single-line JSON
        this.output(entry);
    }
 
    error(message: string, error: Error, data?: Record<string, unknown>) {
        this.log('error', message, {
            error: {
                type: error.name,
                code: (error as any).errorCode || 'UNKNOWN',
                message: error.message,
                stack: error.stack,
                cause: (error as any).cause ? {
                    type: ((error as any).cause as Error).name,
                    message: ((error as any).cause as Error).message
                } : undefined
            },
            ...data
        });
    }
 
    warn(message: string, data?: Record<string, unknown>) {
        this.log('warn', message, data);
    }
 
    info(message: string, data?: Record<string, unknown>) {
        this.log('info', message, data);
    }
 
    private output(entry: LogSchema) {
        // Single-line JSON for log aggregators
        console.log(JSON.stringify(entry));
    }
}
 
/**
 * Best practices for structured logging fields.
 */
const loggingGuidelines = {
    // DO: Use consistent field names across all services
    goodFieldNames: [
        'userId',      // Not 'user_id' or 'userID' or 'uid'
        'orderId',     // Not 'order_id' or 'orderID'
        'errorType',   // Not 'error_type' or 'exception_type'
        'durationMs',  // Not 'duration' or 'time_taken'
    ],
 
    // DO: Use standardized value formats
    goodValueFormats: {
        timestamps: 'ISO 8601: 2024-01-15T10:23:45.123Z',
        durations: 'Milliseconds as integers: 234',
        booleans: 'Actual booleans, not strings: true',
        nulls: 'Explicit null, not empty strings'
    },
 
    // DON'T: Mix structured and unstructured
    badPractices: [
        'Including formatted strings: "Error: XYZ at module ABC"',
        'Nested message strings that duplicate info',
        'Inconsistent field presence across log entries'
    ]
};
 
// Example: Query capabilities enabled by structured logs
 
// Find all payment errors in the last hour:
// level:error AND error.type:PaymentException AND timestamp:>now-1h
 
// Count errors by type:
// level:error | stats count() by error.type
 
// Find errors for specific user:
// level:error AND userId:hash:a1b2c3d4
 
// Trace request flow:
// correlationId:req-abc123 | sort timestamp asc

Logging Infrastructure Considerations

Beyond application-level logging, you must consider the infrastructure that collects, stores, and queries logs. The best log entries are useless if the infrastructure can't handle them properly.

Key Infrastructure Considerations

Log Infrastructure Requirements

•High Availability — Log collection must not fail when you need it most (during incidents). Use buffering and retry mechanisms.
•Latency — Logs should be queryable within seconds of emission for real-time debugging. Batch processing delays hurt incident response.
•Retention — Define retention policies balancing cost against debugging needs. Errors may need longer retention than info logs.
•Query Performance — Error logs must be quickly searchable across time ranges and fields. Test query performance with production volumes.
•Alerting Integration — Log aggregators should integrate with alerting systems (PagerDuty, Slack) for error-threshold-based alerts.
•Access Control — Logs may contain sensitive data even after sanitization. Implement role-based access to log data.
•Cost Management — Log volume scales with traffic. Implement sampling for verbose logs and ensure error logs always reach storage.

logging-resilience.ts
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
/**
 * Resilient logging configuration that ensures error logs
 * reach their destination even under adverse conditions.
 */
 
/**
 * Log buffering to handle temporary aggregator unavailability.
 */
class ResilientLogger {
    private buffer: LogEntry[] = [];
    private readonly maxBufferSize = 10000;
    private readonly flushIntervalMs = 1000;
    private isAggregatorHealthy = true;
 
    constructor(private aggregator: LogAggregator) {
        setInterval(() => this.flushBuffer(), this.flushIntervalMs);
    }
 
    log(entry: LogEntry) {
        if (this.isAggregatorHealthy) {
            // Try direct send
            this.aggregator.send(entry).catch(() => {
                this.isAggregatorHealthy = false;
                this.buffer.push(entry);
                this.scheduleHealthCheck();
            });
        } else {
            // Buffer during outage
            if (this.buffer.length < this.maxBufferSize) {
                this.buffer.push(entry);
            } else {
                // CRITICAL: Never drop error logs
                // Write to local stderr as fallback
                if (entry.level === 'error' || entry.level === 'fatal') {
                    console.error(JSON.stringify(entry));
                }
                // Increment dropped log counter for monitoring
                this.droppedLogCounter.inc({ level: entry.level });
            }
        }
    }
 
    private async flushBuffer() {
        if (this.buffer.length === 0 || !this.isAggregatorHealthy) return;
 
        const batch = this.buffer.splice(0, 1000);
        try {
            await this.aggregator.sendBatch(batch);
        } catch (error) {
            // Put back in buffer
            this.buffer.unshift(...batch);
            this.isAggregatorHealthy = false;
        }
    }
 
    private scheduleHealthCheck() {
        setTimeout(async () => {
            try {
                await this.aggregator.healthCheck();
                this.isAggregatorHealthy = true;
                await this.flushBuffer();
            } catch {
                this.scheduleHealthCheck();  // Retry
            }
        }, 5000);
    }
}
 
/**
 * Sampling configuration for volume control.
 * Errors are never sampled; only verbose logs.
 */
interface SamplingConfig {
    // Always log these levels (no sampling)
    alwaysLog: Array<'error' | 'warn' | 'fatal'>;
    
    // Sample rate for other levels (0-1)
    sampleRate: Record<string, number>;
}
 
const productionSamplingConfig: SamplingConfig = {
    alwaysLog: ['error', 'warn', 'fatal'],
    sampleRate: {
        'info': 1.0,    // Log all info in production (usually low volume)
        'debug': 0.01   // Only 1% of debug logs
    }
};
 
function shouldLog(level: string, config: SamplingConfig): boolean {
    if (config.alwaysLog.includes(level as any)) return true;
    const rate = config.sampleRate[level] ?? 1.0;
    return Math.random() < rate;
}
 
/**
 * Log enrichment at infrastructure level.
 * Add context that application code shouldn't need to know.
 */
function enrichLogEntry(entry: LogEntry): LogEntry {
    return {
        ...entry,
        // Infrastructure-level enrichment
        _meta: {
            hostname: os.hostname(),
            pid: process.pid,
            kubernetesNamespace: process.env.K8S_NAMESPACE,
            kubernetesPod: process.env.K8S_POD_NAME,
            deploymentVersion: process.env.DEPLOYMENT_VERSION,
            cloudRegion: process.env.CLOUD_REGION,
            instanceId: process.env.INSTANCE_ID
        }
    };
}

Summary: Mastering Error Logging

Key Takeaways

•Log for multiple audiences — On-call engineers, developers, security teams, and automation all have different needs from the same logs.
•Capture complete context — Include timing, location, operation, user context (safely), error details, and system state.
•Classify severity accurately — Reserve ERROR for failed requests, WARN for handled anomalies, FATAL for system-level failures.
•Protect sensitive data — Implement defense-in-depth sanitization; never log passwords, tokens, or PII.
•Enable correlation — Propagate correlation IDs across services; integrate with distributed tracing.
•Use structured logging — JSON format enables powerful queries, aggregation, and alerting.
•Build resilient infrastructure — Buffer logs, handle aggregator failures gracefully, never drop error logs.

What's Next:

Page Complete

3 / 4