System Design (LLD)Errors vs Exceptions

Errors vs Exceptions: Foundations of Robust Error Handling

LevelIntermediate

Duration60 mins

TopicErrors vs Exceptions

1 / 4

What is an Error

The Inevitable Reality of Failure

Every software system ever built will fail. Network connections drop. Databases become unavailable. Users provide malformed input. Files get corrupted. Memory runs out. Hardware degrades. These aren't exceptional circumstances—they're the normal operating conditions of any real-world system.

Yet the way we handle these failures fundamentally shapes the quality, reliability, and maintainability of our software. A well-designed error handling strategy transforms chaotic failures into predictable, debuggable, and recoverable conditions. A poorly designed one turns minor hiccups into cascading disasters that bring entire systems to their knees.

Before we can design robust error handling, we must first develop a precise understanding of what an error actually is. This seemingly simple question reveals surprising depth and nuance that separates junior developers from senior engineers.

What You Will Learn

By the end of this page, you will understand the precise definition of an error in software systems, distinguish errors from other failure modes, classify errors by their nature and recoverability, and recognize how proper error understanding shapes system design decisions.

Defining an Error: Beyond the Obvious

An error in software represents a deviation from expected or correct behavior. But this simple definition hides considerable complexity. To truly understand errors, we must examine them from multiple perspectives: the mathematical, the operational, and the design level.

The Mathematical Perspective:

From a formal standpoint, every software function can be viewed as a mapping from an input domain to an output range. An error occurs when:

An input falls outside the valid domain (precondition violation)
The function cannot produce a valid output (postcondition violation)
The function violates an invariant during execution
The environment fails to provide expected resources or guarantees

This perspective helps us understand that errors aren't random chaos—they're violations of well-defined contracts that our systems establish.

error-contract-example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
/**
 * Function contract (implicit or explicit):
 * - Precondition: divisor !== 0
 * - Postcondition: result * divisor === dividend (within floating-point precision)
 * - Invariant: function is pure (no side effects)
 */
function divide(dividend: number, divisor: number): number {
    // Precondition violation → Error condition
    if (divisor === 0) {
        // This is an ERROR: the input violates the function's domain
        throw new Error("Division by zero: divisor must be non-zero");
    }
    
    // Normal execution path
    return dividend / divisor;
}
 
/**
 * The caller contracts:
 * - Caller promises: divisor !== 0
 * - Function promises: valid result
 * 
 * An error is a breach of this contract from either party.
 */

The Operational Perspective:

From an operational standpoint, an error is any condition that prevents a system from fulfilling its intended purpose. This includes:

User-level errors: Wrong password, invalid email format, insufficient permissions
System-level errors: Database connection failure, out of memory, disk full
Network-level errors: Timeout, connection refused, DNS resolution failure
Hardware-level errors: Disk failure, memory corruption, CPU overheating
Logic errors: Bugs in code that produce incorrect results

The key insight is that errors exist at every layer of the system, and designing robust software means anticipating and handling failures at each layer appropriately.

Error Classification by System Layer
Layer	Example Errors	Typical Detection	Handling Strategy
User Interface	Invalid form input, missing required fields	Client-side validation	Show user-friendly message, guide correction
Application Logic	Business rule violation, state inconsistency	Validation checks in code	Return error result, prevent invalid state
Service Layer	Malformed request, authentication failure	Input validation, middleware	HTTP error codes, structured error responses
Data Access	Query failure, constraint violation	Database driver exceptions	Retry, fallback, or escalate
Infrastructure	Network timeout, resource exhaustion	Timeouts, health checks	Circuit breakers, graceful degradation
Hardware	Disk failure, memory corruption	OS signals, checksums	Alerting, failover, data recovery

The Anatomy of an Error

Every error, regardless of its source, shares certain structural properties. Understanding these properties helps us design consistent, informative, and actionable error representations.

Essential Error Components:

Core Properties of Well-Defined Errors

•Error Identity: A unique identifier or code that distinguishes this error type from others. This enables programmatic handling and consistent documentation.
•Error Message: A human-readable description explaining what went wrong. This should be clear, actionable, and appropriate for the intended audience.
•Error Context: Relevant information about the circumstances when the error occurred—input values, system state, timestamp, request ID.
•Error Source: Where the error originated—which component, function, or layer detected the problem.
•Error Cause: The underlying reason for the error, potentially including a chain of causation for errors triggered by other errors.
•Error Severity: How serious is this error? Informational, warning, error, critical, or fatal?
•Recovery Hints: Suggestions for how to handle or recover from this error, if recovery is possible.

structured-error-example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
/**
 * A well-structured error contains all the information needed
 * for diagnosis, handling, and recovery.
 */
interface StructuredError {
    // Unique identifier for programmatic handling
    code: string;  // e.g., "USER_NOT_FOUND", "DATABASE_TIMEOUT"
    
    // Human-readable explanation
    message: string;
    
    // Severity level for triage
    severity: 'info' | 'warning' | 'error' | 'critical' | 'fatal';
    
    // When did this happen?
    timestamp: Date;
    
    // Where in the system?
    source: {
        component: string;  // e.g., "UserService"
        operation: string;  // e.g., "findById"
        layer: string;      // e.g., "service", "repository", "controller"
    };
    
    // What was the context?
    context: Record<string, unknown>;  // e.g., { userId: "abc123", attemptCount: 3 }
    
    // What caused this?
    cause?: Error | StructuredError;
    
    // How might we recover?
    recovery?: {
        isRetryable: boolean;
        suggestedAction?: string;
        retryAfterMs?: number;
    };
}
 
// Example instantiation
const error: StructuredError = {
    code: "DATABASE_CONNECTION_TIMEOUT",
    message: "Failed to connect to database within 5000ms",
    severity: "error",
    timestamp: new Date(),
    source: {
        component: "UserRepository",
        operation: "findUserById",
        layer: "repository"
    },
    context: {
        userId: "user_12345",
        databaseHost: "db-primary.example.com",
        timeoutMs: 5000,
        attemptNumber: 3
    },
    cause: new Error("ETIMEDOUT: Connection timed out"),
    recovery: {
        isRetryable: true,
        suggestedAction: "Retry with exponential backoff or use read replica",
        retryAfterMs: 1000
    }
};

Error Design Principle

The richness of your error structure directly impacts your ability to diagnose, handle, and learn from failures. Errors should be designed as carefully as your success paths. A well-designed error tells a story: what happened, why it happened, and what can be done about it.

Error Classifications: A Taxonomy

Not all errors are created equal. Understanding the different types of errors helps us choose appropriate handling strategies. Let's examine the major classification axes:

By Recoverability:

This is perhaps the most critical classification for error handling design. Can the system recover from this error, and if so, how?

Error Recoverability Classification
Category	Definition	Examples	Handling Strategy
Recoverable	System can automatically resolve and continue	Transient network glitch, temporary resource contention	Automatic retry with backoff, fallback to cache
User-Recoverable	User action can resolve the issue	Invalid input, missing authentication	Clear error message, guide user to fix
Operator-Recoverable	Manual intervention by operations team required	Configuration error, certificate expired	Alert operations, graceful degradation
Unrecoverable	System cannot continue; failure is permanent	Corrupted data, critical dependency permanently down	Fail fast, preserve state, alert immediately

By Predictability:

Some errors are expected parts of normal operation, while others indicate genuine problems.

Expected Errors

•User not found (normal for lookups)
•Validation failure on user input
•File doesn't exist at specified path
•Cache miss
•Record already exists (duplicate)
•Permission denied (access control working)

Unexpected Errors

•Database connection permanently lost
•Out of memory condition
•Null pointer / reference exception
•Index out of bounds
•Disk full during write
•Malformed response from dependency

Design Insight

Expected errors are part of your system's API contract. They should be documented, have clear codes, and callers should be prepared to handle them. Unexpected errors often indicate bugs or environmental problems that require investigation.

By Blame (Source of Responsibility):

Understanding who is responsible for an error guides how we communicate and handle it.

Error Blame Attribution
Blame Category	Description	HTTP Analogy	Response Strategy
Client Error	Caller provided invalid input or made an invalid request	4xx codes	Return details to help caller fix their request
Server Error	System failed to process a valid request	5xx codes	Log internally, return generic message to client
Dependency Error	An external system/service the system depends on failed	502, 504	Evaluate retry/fallback, protect caller from internal details
Environment Error	Infrastructure or runtime environment failed	503	Alert operations, implement graceful degradation

By Temporal Nature:

How long does this error condition persist? This directly impacts retry strategies.

Temporal Error Classification

•Transient: Temporary condition that will resolve on its own. Example: Network hiccup, brief resource contention. Strategy: Retry with exponential backoff.
•Intermittent: Happens randomly and unpredictably. Example: Race condition, memory pressure. Strategy: Retry, but investigate root cause.
•Semi-Permanent: Persists until external action is taken. Example: Expired credentials, filled disk. Strategy: Alert operators, graceful degradation.
•Permanent: Will never resolve without code changes. Example: Logic bug, corrupted data. Strategy: Fail fast, preserve diagnostic information.

Errors vs. Other Failure Concepts

The term "error" is often conflated with related but distinct concepts. Precise terminology enables precise thinking and design. Let's carefully distinguish errors from their conceptual neighbors.

Error vs. Related Concepts
Concept	Definition	Relationship to Error	Example
Fault	The underlying defect or root cause	A fault causes an error	Buffer overflow vulnerability in code
Error	Incorrect system state resulting from a fault	Error is the manifestation of a fault	Memory corruption from the buffer overflow
Failure	Observable deviation from specified behavior	Failure is the consequence of an error	Application crash or wrong output
Bug	A fault in the software code	Bugs are a type of fault	Off-by-one error in loop condition
Defect	Flaw in specification, design, or code	Defects may or may not cause errors	Missing input validation in design
Exception	A mechanism to signal errors	Exceptions carry error information	Java's NullPointerException

The Fault → Error → Failure Chain:

Understanding this causal chain is crucial for designing robust systems:

Fault exists in the system (latent or dormant)
Fault is activated by specific conditions (triggered)
Error manifests as incorrect system state
Error propagates if not detected and handled
Failure occurs when error reaches system boundary and impacts users

Robust error handling aims to break this chain at the earliest possible point.

fault-error-failure-chain
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// FAULT: The underlying defect
// This function has a fault - it doesn't handle null input
function processUserData(data: UserData): ProcessedData {
    // FAULT: No null check exists here
    return {
        fullName: data.firstName + " " + data.lastName,  // Will fail if data is null
        email: data.email.toLowerCase()
    };
}
 
// ERROR: Incorrect system state when fault is activated
// When called with null, the system enters an erroneous state
try {
    const result = processUserData(null as any);  // FAULT ACTIVATED
    // ERROR: We're now in an incorrect state - result is undefined or
    // execution has been interrupted
} catch (e) {
    // ERROR is now manifested as an exception
    console.error("Error detected:", e);
}
 
// FAILURE: Observable deviation at system boundary
// If the error isn't caught and handled, it causes a failure
async function handleRequest(req: Request): Promise<Response> {
    const userData = await fetchUserData(req.userId);  // May return null
    const processed = processUserData(userData);       // ERROR if null
    
    // FAILURE: If error propagates, the HTTP request fails
    // User sees 500 error, transaction is rolled back, etc.
    return new Response(JSON.stringify(processed));
}
 
// DEFENSIVE DESIGN: Break the chain early
function processUserDataSafely(data: UserData | null): ProcessedData | null {
    // Break the chain at ERROR stage - detect the condition before it causes failure
    if (data === null) {
        // Error detected, prevented from becoming failure
        return null;  // Or throw a well-defined exception
    }
    
    return {
        fullName: data.firstName + " " + data.lastName,
        email: data.email.toLowerCase()
    };
}

The Propagation Problem

Unhandled errors don't disappear—they propagate. An error in a low-level component can cascade upward, causing failures in components that have no bugs of their own. This is why error handling at every layer matters, and why understanding the fault-error-failure chain is essential for system reliability.

The Philosophy of Error States

A critical insight in error handling design is that errors represent alternative valid states, not system malfunctions. This perspective shift has profound implications for how we design our systems.

The Traditional (Flawed) View:

Many developers think of errors as exceptional interruptions to the "normal" happy path. This leads to error handling as an afterthought—something bolted on after the main logic is complete.

The Mature View:

Seasoned engineers recognize that error cases are equally valid outcomes of any operation. Finding no user is as valid a result as finding one. A failed network request is as real an outcome as a successful one. This view leads to designs where error paths are as carefully crafted as success paths.

Immature Error Thinking

•"Errors are exceptional cases"
•Focus on happy path first, add error handling later
•Catch-all handlers that swallow details
•Generic error messages everywhere
•Errors break the normal flow
•Error handling taxes main logic

Mature Error Thinking

•"Errors are valid outcomes"
•Design error paths alongside success paths
•Specific handlers for specific error types
•Contextual, actionable error information
•Errors are part of normal control flow
•Error handling enables main logic

error-as-valid-outcome
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
// IMMATURE: Error as exceptional interruption
// This design treats "not found" as an exception to normal flow
function getUserImmature(id: string): User {
    const user = database.query(`SELECT * FROM users WHERE id = ?`, id);
    if (!user) {
        throw new Error("User not found");  // Exception = something went wrong
    }
    return user;
}
 
// Usage forces try-catch everywhere
try {
    const user = getUserImmature("abc123");
    // ... do something with user
} catch (error) {
    // Is this "not found" or a database error? Hard to tell
    if (error.message === "User not found") {
        // Handle not found
    } else {
        // Handle other errors
    }
}
 
 
// MATURE: "Not found" is a valid outcome, not an exception
type UserResult = 
    | { success: true; user: User }
    | { success: false; error: 'NOT_FOUND' }
    | { success: false; error: 'DATABASE_ERROR'; details: string };
 
function getUserMature(id: string): UserResult {
    try {
        const user = database.query(`SELECT * FROM users WHERE id = ?`, id);
        
        if (!user) {
            // "Not found" is a valid, expected outcome
            return { success: false, error: 'NOT_FOUND' };
        }
        
        return { success: true, user };
        
    } catch (dbError) {
        // Database errors are actual problems
        return { 
            success: false, 
            error: 'DATABASE_ERROR', 
            details: dbError.message 
        };
    }
}
 
// Usage is explicit about all outcomes
const result = getUserMature("abc123");
 
switch (result.success) {
    case true:
        // Handle found user
        console.log(`Found: ${result.user.name}`);
        break;
        
    case false:
        switch (result.error) {
            case 'NOT_FOUND':
                // Handle expected "not found" case
                console.log("User does not exist");
                break;
            case 'DATABASE_ERROR':
                // Handle actual system problem
                console.error("Database problem:", result.details);
                break;
        }
}

Design Principle

When a function can fail in expected ways, the type system should reflect this. A function that finds a user either returns a User, returns nothing (not found), or returns an error (database problem). Each of these is a valid outcome that callers must handle explicitly. This is the foundation of robust, self-documenting APIs.

Error Information for Different Audiences

A crucial aspect of error design is recognizing that different stakeholders need different information about the same error. What helps a developer debug is often confusing or dangerous for an end user. What an operator needs for triage is different from what automated systems need for retry decisions.

The Four Audiences of Error Information:

Error Information by Audience
Audience	Needs	Should See	Should NOT See
End Users	What happened, how to fix it	Friendly message, actionable guidance	Stack traces, internal codes, system details
Developers	Root cause, reproduction steps	Stack traces, context, error chains	Secrets, PII in logs
Operators	Impact scope, urgency, runbook steps	Error codes, affected systems, metrics	Code-level details (unless debugging)
Automated Systems	Error category, retryability, retry timing	Structured codes, machine-readable metadata	Human-readable narratives

multi-audience-error
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
/**
 * A well-designed error provides different views for different audiences.
 */
interface MultiAudienceError {
    // For automated systems: structured, machine-readable
    machine: {
        code: string;           // "PAYMENT_INSUFFICIENT_FUNDS"
        category: string;       // "payment_error"
        isRetryable: boolean;
        retryAfterMs?: number;
    };
    
    // For end users: friendly, actionable
    user: {
        title: string;          // "Payment Declined"
        message: string;        // "Your card was declined. Please try a different payment method."
        actions: string[];      // ["Try another card", "Contact your bank"]
    };
    
    // For operators: context, impact, runbook
    operator: {
        severity: 'low' | 'medium' | 'high' | 'critical';
        affectedUsers: number;
        affectedSystems: string[];
        runbookUrl?: string;
    };
    
    // For developers: full technical details
    developer: {
        message: string;
        stackTrace: string;
        context: Record<string, unknown>;
        cause?: Error;
        timestamp: Date;
        requestId: string;
    };
}
 
// Example: Payment failure error
const paymentError: MultiAudienceError = {
    machine: {
        code: "PAYMENT_INSUFFICIENT_FUNDS",
        category: "payment_error",
        isRetryable: false,  // User needs to change payment method
    },
    user: {
        title: "Payment Declined",
        message: "Your card was declined due to insufficient funds. Please try a different payment method or contact your bank.",
        actions: ["Use a different card", "Try PayPal", "Contact support"]
    },
    operator: {
        severity: "low",  // Expected business error, not system problem
        affectedUsers: 1,
        affectedSystems: ["payment-service"],
    },
    developer: {
        message: "Payment processor returned decline code: insufficient_funds",
        stackTrace: "...",
        context: {
            userId: "user_123",
            orderId: "order_456",
            amount: 150.00,
            currency: "USD",
            paymentProcessor: "stripe",
            declineCode: "insufficient_funds"
        },
        timestamp: new Date(),
        requestId: "req_abc123"
    }
};

Security Consideration

Never expose developer-level error information to end users. Stack traces, database queries, internal paths, and system architecture details can be exploited by attackers. Always sanitize errors before presenting them at system boundaries.

Summary: Understanding Errors

We've established a comprehensive foundation for understanding errors in software systems. This understanding is essential before we can discuss how to handle errors effectively.

Key Takeaways:

Core Concepts to Remember

•An error is a contract violation — Errors occur when inputs, outputs, or invariants don't meet expectations.
•Errors have structure — Well-designed errors include identity, message, context, source, cause, severity, and recovery hints.
•Classification guides handling — Errors can be classified by recoverability, predictability, blame, and temporal nature.
•Understand the causal chain — Faults cause errors, errors cause failures. Break the chain early.
•Errors are valid outcomes — Treat expected failures as first-class results, not exceptional interruptions.
•Different audiences, different information — Design errors to serve users, developers, operators, and machines appropriately.

What's Next:

With a solid understanding of what errors are, we're ready to explore exceptions—the mechanism many languages provide to signal and handle errors. The next page examines what exceptions are, how they differ from errors as a concept, and the design patterns that make exception use effective.

Page Complete

You now have a rigorous understanding of errors in software systems. This conceptual foundation will serve you well as we explore exceptions, error handling strategies, and the design philosophies that shape robust, maintainable systems.

1 / 4

Loading learning content...

System Design (LLD)Errors vs Exceptions

Errors vs Exceptions: Foundations of Robust Error Handling

LevelIntermediate

Duration60 mins

TopicErrors vs Exceptions

1 / 4

What is an Error

The Inevitable Reality of Failure

What You Will Learn

Defining an Error: Beyond the Obvious

The Mathematical Perspective:

From a formal standpoint, every software function can be viewed as a mapping from an input domain to an output range. An error occurs when:

An input falls outside the valid domain (precondition violation)
The function cannot produce a valid output (postcondition violation)
The function violates an invariant during execution
The environment fails to provide expected resources or guarantees

This perspective helps us understand that errors aren't random chaos—they're violations of well-defined contracts that our systems establish.

error-contract-example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
/**
 * Function contract (implicit or explicit):
 * - Precondition: divisor !== 0
 * - Postcondition: result * divisor === dividend (within floating-point precision)
 * - Invariant: function is pure (no side effects)
 */
function divide(dividend: number, divisor: number): number {
    // Precondition violation → Error condition
    if (divisor === 0) {
        // This is an ERROR: the input violates the function's domain
        throw new Error("Division by zero: divisor must be non-zero");
    }
    
    // Normal execution path
    return dividend / divisor;
}
 
/**
 * The caller contracts:
 * - Caller promises: divisor !== 0
 * - Function promises: valid result
 * 
 * An error is a breach of this contract from either party.
 */

The Operational Perspective:

From an operational standpoint, an error is any condition that prevents a system from fulfilling its intended purpose. This includes:

User-level errors: Wrong password, invalid email format, insufficient permissions
System-level errors: Database connection failure, out of memory, disk full
Network-level errors: Timeout, connection refused, DNS resolution failure
Hardware-level errors: Disk failure, memory corruption, CPU overheating
Logic errors: Bugs in code that produce incorrect results

The key insight is that errors exist at every layer of the system, and designing robust software means anticipating and handling failures at each layer appropriately.

Error Classification by System Layer
Layer	Example Errors	Typical Detection	Handling Strategy
User Interface	Invalid form input, missing required fields	Client-side validation	Show user-friendly message, guide correction
Application Logic	Business rule violation, state inconsistency	Validation checks in code	Return error result, prevent invalid state
Service Layer	Malformed request, authentication failure	Input validation, middleware	HTTP error codes, structured error responses
Data Access	Query failure, constraint violation	Database driver exceptions	Retry, fallback, or escalate
Infrastructure	Network timeout, resource exhaustion	Timeouts, health checks	Circuit breakers, graceful degradation
Hardware	Disk failure, memory corruption	OS signals, checksums	Alerting, failover, data recovery

The Anatomy of an Error

Every error, regardless of its source, shares certain structural properties. Understanding these properties helps us design consistent, informative, and actionable error representations.

Essential Error Components:

Core Properties of Well-Defined Errors

•Error Identity: A unique identifier or code that distinguishes this error type from others. This enables programmatic handling and consistent documentation.
•Error Message: A human-readable description explaining what went wrong. This should be clear, actionable, and appropriate for the intended audience.
•Error Context: Relevant information about the circumstances when the error occurred—input values, system state, timestamp, request ID.
•Error Source: Where the error originated—which component, function, or layer detected the problem.
•Error Cause: The underlying reason for the error, potentially including a chain of causation for errors triggered by other errors.
•Error Severity: How serious is this error? Informational, warning, error, critical, or fatal?
•Recovery Hints: Suggestions for how to handle or recover from this error, if recovery is possible.

structured-error-example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
/**
 * A well-structured error contains all the information needed
 * for diagnosis, handling, and recovery.
 */
interface StructuredError {
    // Unique identifier for programmatic handling
    code: string;  // e.g., "USER_NOT_FOUND", "DATABASE_TIMEOUT"
    
    // Human-readable explanation
    message: string;
    
    // Severity level for triage
    severity: 'info' | 'warning' | 'error' | 'critical' | 'fatal';
    
    // When did this happen?
    timestamp: Date;
    
    // Where in the system?
    source: {
        component: string;  // e.g., "UserService"
        operation: string;  // e.g., "findById"
        layer: string;      // e.g., "service", "repository", "controller"
    };
    
    // What was the context?
    context: Record<string, unknown>;  // e.g., { userId: "abc123", attemptCount: 3 }
    
    // What caused this?
    cause?: Error | StructuredError;
    
    // How might we recover?
    recovery?: {
        isRetryable: boolean;
        suggestedAction?: string;
        retryAfterMs?: number;
    };
}
 
// Example instantiation
const error: StructuredError = {
    code: "DATABASE_CONNECTION_TIMEOUT",
    message: "Failed to connect to database within 5000ms",
    severity: "error",
    timestamp: new Date(),
    source: {
        component: "UserRepository",
        operation: "findUserById",
        layer: "repository"
    },
    context: {
        userId: "user_12345",
        databaseHost: "db-primary.example.com",
        timeoutMs: 5000,
        attemptNumber: 3
    },
    cause: new Error("ETIMEDOUT: Connection timed out"),
    recovery: {
        isRetryable: true,
        suggestedAction: "Retry with exponential backoff or use read replica",
        retryAfterMs: 1000
    }
};

Error Design Principle

Error Classifications: A Taxonomy

Not all errors are created equal. Understanding the different types of errors helps us choose appropriate handling strategies. Let's examine the major classification axes:

By Recoverability:

This is perhaps the most critical classification for error handling design. Can the system recover from this error, and if so, how?

Error Recoverability Classification
Category	Definition	Examples	Handling Strategy
Recoverable	System can automatically resolve and continue	Transient network glitch, temporary resource contention	Automatic retry with backoff, fallback to cache
User-Recoverable	User action can resolve the issue	Invalid input, missing authentication	Clear error message, guide user to fix
Operator-Recoverable	Manual intervention by operations team required	Configuration error, certificate expired	Alert operations, graceful degradation
Unrecoverable	System cannot continue; failure is permanent	Corrupted data, critical dependency permanently down	Fail fast, preserve state, alert immediately

By Predictability:

Some errors are expected parts of normal operation, while others indicate genuine problems.

Expected Errors

•User not found (normal for lookups)
•Validation failure on user input
•File doesn't exist at specified path
•Cache miss
•Record already exists (duplicate)
•Permission denied (access control working)

Unexpected Errors

•Database connection permanently lost
•Out of memory condition
•Null pointer / reference exception
•Index out of bounds
•Disk full during write
•Malformed response from dependency

Design Insight

By Blame (Source of Responsibility):

Understanding who is responsible for an error guides how we communicate and handle it.

Error Blame Attribution
Blame Category	Description	HTTP Analogy	Response Strategy
Client Error	Caller provided invalid input or made an invalid request	4xx codes	Return details to help caller fix their request
Server Error	System failed to process a valid request	5xx codes	Log internally, return generic message to client
Dependency Error	An external system/service the system depends on failed	502, 504	Evaluate retry/fallback, protect caller from internal details
Environment Error	Infrastructure or runtime environment failed	503	Alert operations, implement graceful degradation

By Temporal Nature:

How long does this error condition persist? This directly impacts retry strategies.

Temporal Error Classification

•Transient: Temporary condition that will resolve on its own. Example: Network hiccup, brief resource contention. Strategy: Retry with exponential backoff.
•Intermittent: Happens randomly and unpredictably. Example: Race condition, memory pressure. Strategy: Retry, but investigate root cause.
•Semi-Permanent: Persists until external action is taken. Example: Expired credentials, filled disk. Strategy: Alert operators, graceful degradation.
•Permanent: Will never resolve without code changes. Example: Logic bug, corrupted data. Strategy: Fail fast, preserve diagnostic information.

Errors vs. Other Failure Concepts

The term "error" is often conflated with related but distinct concepts. Precise terminology enables precise thinking and design. Let's carefully distinguish errors from their conceptual neighbors.

Error vs. Related Concepts
Concept	Definition	Relationship to Error	Example
Fault	The underlying defect or root cause	A fault causes an error	Buffer overflow vulnerability in code
Error	Incorrect system state resulting from a fault	Error is the manifestation of a fault	Memory corruption from the buffer overflow
Failure	Observable deviation from specified behavior	Failure is the consequence of an error	Application crash or wrong output
Bug	A fault in the software code	Bugs are a type of fault	Off-by-one error in loop condition
Defect	Flaw in specification, design, or code	Defects may or may not cause errors	Missing input validation in design
Exception	A mechanism to signal errors	Exceptions carry error information	Java's NullPointerException

The Fault → Error → Failure Chain:

Understanding this causal chain is crucial for designing robust systems:

Fault exists in the system (latent or dormant)
Fault is activated by specific conditions (triggered)
Error manifests as incorrect system state
Error propagates if not detected and handled
Failure occurs when error reaches system boundary and impacts users

Robust error handling aims to break this chain at the earliest possible point.

fault-error-failure-chain
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// FAULT: The underlying defect
// This function has a fault - it doesn't handle null input
function processUserData(data: UserData): ProcessedData {
    // FAULT: No null check exists here
    return {
        fullName: data.firstName + " " + data.lastName,  // Will fail if data is null
        email: data.email.toLowerCase()
    };
}
 
// ERROR: Incorrect system state when fault is activated
// When called with null, the system enters an erroneous state
try {
    const result = processUserData(null as any);  // FAULT ACTIVATED
    // ERROR: We're now in an incorrect state - result is undefined or
    // execution has been interrupted
} catch (e) {
    // ERROR is now manifested as an exception
    console.error("Error detected:", e);
}
 
// FAILURE: Observable deviation at system boundary
// If the error isn't caught and handled, it causes a failure
async function handleRequest(req: Request): Promise<Response> {
    const userData = await fetchUserData(req.userId);  // May return null
    const processed = processUserData(userData);       // ERROR if null
    
    // FAILURE: If error propagates, the HTTP request fails
    // User sees 500 error, transaction is rolled back, etc.
    return new Response(JSON.stringify(processed));
}
 
// DEFENSIVE DESIGN: Break the chain early
function processUserDataSafely(data: UserData | null): ProcessedData | null {
    // Break the chain at ERROR stage - detect the condition before it causes failure
    if (data === null) {
        // Error detected, prevented from becoming failure
        return null;  // Or throw a well-defined exception
    }
    
    return {
        fullName: data.firstName + " " + data.lastName,
        email: data.email.toLowerCase()
    };
}

The Propagation Problem

The Philosophy of Error States

The Traditional (Flawed) View:

Many developers think of errors as exceptional interruptions to the "normal" happy path. This leads to error handling as an afterthought—something bolted on after the main logic is complete.

The Mature View:

Immature Error Thinking

•"Errors are exceptional cases"
•Focus on happy path first, add error handling later
•Catch-all handlers that swallow details
•Generic error messages everywhere
•Errors break the normal flow
•Error handling taxes main logic

Mature Error Thinking

•"Errors are valid outcomes"
•Design error paths alongside success paths
•Specific handlers for specific error types
•Contextual, actionable error information
•Errors are part of normal control flow
•Error handling enables main logic

error-as-valid-outcome
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
// IMMATURE: Error as exceptional interruption
// This design treats "not found" as an exception to normal flow
function getUserImmature(id: string): User {
    const user = database.query(`SELECT * FROM users WHERE id = ?`, id);
    if (!user) {
        throw new Error("User not found");  // Exception = something went wrong
    }
    return user;
}
 
// Usage forces try-catch everywhere
try {
    const user = getUserImmature("abc123");
    // ... do something with user
} catch (error) {
    // Is this "not found" or a database error? Hard to tell
    if (error.message === "User not found") {
        // Handle not found
    } else {
        // Handle other errors
    }
}
 
 
// MATURE: "Not found" is a valid outcome, not an exception
type UserResult = 
    | { success: true; user: User }
    | { success: false; error: 'NOT_FOUND' }
    | { success: false; error: 'DATABASE_ERROR'; details: string };
 
function getUserMature(id: string): UserResult {
    try {
        const user = database.query(`SELECT * FROM users WHERE id = ?`, id);
        
        if (!user) {
            // "Not found" is a valid, expected outcome
            return { success: false, error: 'NOT_FOUND' };
        }
        
        return { success: true, user };
        
    } catch (dbError) {
        // Database errors are actual problems
        return { 
            success: false, 
            error: 'DATABASE_ERROR', 
            details: dbError.message 
        };
    }
}
 
// Usage is explicit about all outcomes
const result = getUserMature("abc123");
 
switch (result.success) {
    case true:
        // Handle found user
        console.log(`Found: ${result.user.name}`);
        break;
        
    case false:
        switch (result.error) {
            case 'NOT_FOUND':
                // Handle expected "not found" case
                console.log("User does not exist");
                break;
            case 'DATABASE_ERROR':
                // Handle actual system problem
                console.error("Database problem:", result.details);
                break;
        }
}

Design Principle

Error Information for Different Audiences

The Four Audiences of Error Information:

Error Information by Audience
Audience	Needs	Should See	Should NOT See
End Users	What happened, how to fix it	Friendly message, actionable guidance	Stack traces, internal codes, system details
Developers	Root cause, reproduction steps	Stack traces, context, error chains	Secrets, PII in logs
Operators	Impact scope, urgency, runbook steps	Error codes, affected systems, metrics	Code-level details (unless debugging)
Automated Systems	Error category, retryability, retry timing	Structured codes, machine-readable metadata	Human-readable narratives

multi-audience-error
TypeScript
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
/**
 * A well-designed error provides different views for different audiences.
 */
interface MultiAudienceError {
    // For automated systems: structured, machine-readable
    machine: {
        code: string;           // "PAYMENT_INSUFFICIENT_FUNDS"
        category: string;       // "payment_error"
        isRetryable: boolean;
        retryAfterMs?: number;
    };
    
    // For end users: friendly, actionable
    user: {
        title: string;          // "Payment Declined"
        message: string;        // "Your card was declined. Please try a different payment method."
        actions: string[];      // ["Try another card", "Contact your bank"]
    };
    
    // For operators: context, impact, runbook
    operator: {
        severity: 'low' | 'medium' | 'high' | 'critical';
        affectedUsers: number;
        affectedSystems: string[];
        runbookUrl?: string;
    };
    
    // For developers: full technical details
    developer: {
        message: string;
        stackTrace: string;
        context: Record<string, unknown>;
        cause?: Error;
        timestamp: Date;
        requestId: string;
    };
}
 
// Example: Payment failure error
const paymentError: MultiAudienceError = {
    machine: {
        code: "PAYMENT_INSUFFICIENT_FUNDS",
        category: "payment_error",
        isRetryable: false,  // User needs to change payment method
    },
    user: {
        title: "Payment Declined",
        message: "Your card was declined due to insufficient funds. Please try a different payment method or contact your bank.",
        actions: ["Use a different card", "Try PayPal", "Contact support"]
    },
    operator: {
        severity: "low",  // Expected business error, not system problem
        affectedUsers: 1,
        affectedSystems: ["payment-service"],
    },
    developer: {
        message: "Payment processor returned decline code: insufficient_funds",
        stackTrace: "...",
        context: {
            userId: "user_123",
            orderId: "order_456",
            amount: 150.00,
            currency: "USD",
            paymentProcessor: "stripe",
            declineCode: "insufficient_funds"
        },
        timestamp: new Date(),
        requestId: "req_abc123"
    }
};

Security Consideration

Summary: Understanding Errors

We've established a comprehensive foundation for understanding errors in software systems. This understanding is essential before we can discuss how to handle errors effectively.

Key Takeaways:

Core Concepts to Remember

•An error is a contract violation — Errors occur when inputs, outputs, or invariants don't meet expectations.
•Errors have structure — Well-designed errors include identity, message, context, source, cause, severity, and recovery hints.
•Classification guides handling — Errors can be classified by recoverability, predictability, blame, and temporal nature.
•Understand the causal chain — Faults cause errors, errors cause failures. Break the chain early.
•Errors are valid outcomes — Treat expected failures as first-class results, not exceptional interruptions.
•Different audiences, different information — Design errors to serve users, developers, operators, and machines appropriately.

What's Next:

Page Complete

1 / 4