System Design (HLD)Logging at Scale

Logging at Scale: Production-Grade Observability

LevelIntermediate

Duration60 mins

TopicLogging at Scale

2 / 5

Log Levels and When to Use

The Art of Signal-to-Noise Ratio

Every logging framework provides severity levels: DEBUG, INFO, WARN, ERROR, and others. These levels seem self-explanatory—until you observe how they're used in real production systems. Logs flooded with ERROR entries that aren't actually errors. Critical warnings buried under thousands of irrelevant INFO messages. DEBUG logging enabled in production, generating terabytes of noise daily.

Log levels are not about categorization—they're about communication. Each level is a promise to the operations team about what the log represents and how urgently they need to respond. Misusing levels breaks this contract, leading to alert fatigue, missed incidents, and wasted debugging time.

This page establishes rigorous criteria for log level selection. By the end, you'll understand not just what each level means, but why proper leveling is essential for operational excellence.

What You Will Learn

By the end of this page, you will master log level semantics across different frameworks, understand the decision framework for choosing levels, recognize and avoid common leveling mistakes, and implement organizational standards that maintain signal quality at scale.

The Log Level Hierarchy

Log levels form a severity hierarchy. Higher severity logs are always emitted; lower severity logs are filtered based on configuration. Understanding this hierarchy is foundational to proper log level usage.

The Standard Hierarchy (from most to least severe):

Log Level Severity Hierarchy
Level	Severity	Typical Configuration	Description
FATAL/CRITICAL	Highest	Always enabled	Application cannot continue; immediate human intervention required
ERROR	High	Always enabled	Operation failed; likely requires investigation
WARN	Medium	Always enabled	Something unexpected; may indicate impending failure
INFO	Low	Production enabled	Normal operation milestones; useful for understanding flow
DEBUG	Lower	Dev/staging only	Detailed diagnostic information for developers
TRACE	Lowest	Rarely enabled	Extremely detailed execution paths; usually off

The Filtering Principle

When you configure a log level threshold (e.g., LOG_LEVEL=INFO), logs at that level and above are emitted:

LOG_LEVEL=ERROR: Only ERROR and FATAL logs
LOG_LEVEL=WARN: WARN, ERROR, and FATAL
LOG_LEVEL=INFO: INFO, WARN, ERROR, and FATAL
LOG_LEVEL=DEBUG: DEBUG, INFO, WARN, ERROR, and FATAL
LOG_LEVEL=TRACE: Everything

This filtering mechanism is why level selection matters. If you log routine operations as ERROR, you can't filter them out without losing real errors. If you log critical issues as INFO, they'll be lost in the noise.

Framework Variations

Different frameworks use slightly different terminology: CRITICAL vs FATAL, WARNING vs WARN, VERBOSE vs TRACE. The semantics are consistent; only the names differ. Some frameworks add custom levels (NOTICE in syslog), but the core hierarchy remains.

FATAL/CRITICAL: Existential Failures

Definition: The application cannot continue and will terminate or is in an unrecoverable state. Human intervention is required immediately.

Mental Model: FATAL means "wake someone up at 3 AM." It represents complete service failure or corruption that affects all users.

Examples of legitimate FATAL logs:

Database connection pool exhausted with no recovery possible
Critical configuration missing at startup (cannot start)
Corrupted data detected that prevents safe operation
Security compromise detected (must shut down to prevent further damage)
Out of memory with no graceful degradation possible

Correct FATAL Usage

•Unable to bind to required port on startup
•Database schema incompatible with application version
•TLS certificate expired (cannot serve secure traffic)
•Memory corruption detected in critical data structure
•Mandatory dependent service unreachable after all retries

Incorrect FATAL Usage

•Single request failed (should be ERROR)
•Non-critical feature disabled (should be WARN)
•Configuration file using defaults (should be WARN)
•External API returned error (should be ERROR or WARN)
•High latency on dependency (should be WARN)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import structlog
import sys
 
logger = structlog.get_logger()
 
def initialize_database():
    try:
        connection = create_database_connection()
        verify_schema_version(connection)
        return connection
    except SchemaVersionMismatch as e:
        # This is FATAL: application cannot safely operate with wrong schema
        logger.critical(
            "database_schema_incompatible",
            expected_version=e.expected,
            actual_version=e.actual,
            remediation="Run database migrations before starting application"
        )
        sys.exit(1)  # FATAL logs typically precede shutdown
    except ConnectionError as e:
        # This might be FATAL at startup, but could be ERROR during runtime
        if is_during_startup():
            logger.critical(
                "database_connection_failed_at_startup",
                error=str(e),
                remediation="Verify database is running and credentials are correct"
            )
            sys.exit(1)
        else:
            # During runtime, retry logic might recover
            logger.error("database_connection_lost", error=str(e))
            raise

The FATAL Rarity Principle

If your service logs FATAL more than a few times per year across your entire fleet, you're misusing the level. FATAL means service death. Healthy services don't die frequently. If they do, you have bigger problems than logging.

ERROR: Operation Failures

Definition: An operation failed and could not complete its intended purpose. The failure is likely unexpected and usually indicates a bug, environmental issue, or dependency failure.

Mental Model: ERROR means "something broke that shouldn't have." It represents failure worthy of investigation, even if the system continues operating overall.

The Key Question: Ask yourself: "If I saw this in a log, would I want to investigate?" If yes, ERROR is appropriate. If the answer is "this is expected sometimes," consider WARN or INFO.

ERROR Level Criteria

•User-visible failure occurred — The request could not be fulfilled, data was not saved, or expected functionality broke.
•Unexpected exception caught — An exception that wasn't anticipated during normal operation (as opposed to validation errors).
•Dependency failure with impact — External service failed and the fallback (if any) significantly degraded functionality.
•Data integrity issue detected — Inconsistency found that may require manual intervention to resolve.
•Security-relevant failure — Authentication bypass detected, permission violation, or other security exception.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
public class OrderService {
    private static final Logger logger = LoggerFactory.getLogger(OrderService.class);
    
    public Order createOrder(OrderRequest request) {
        try {
            // Validation failures are NOT errors - they're expected input issues
            ValidationResult validation = validator.validate(request);
            if (!validation.isValid()) {
                logger.info("order_validation_failed",  // INFO, not ERROR
                    kv("user_id", request.getUserId()),
                    kv("violations", validation.getViolations()));
                throw new ValidationException(validation);
            }
            
            // Database failure IS an error - unexpected infrastructure issue
            Order order = orderRepository.save(buildOrder(request));
            
            // Payment failure requires nuanced handling
            PaymentResult payment = paymentService.charge(order);
            if (!payment.isSuccess()) {
                if (payment.getReason() == PaymentFailureReason.INSUFFICIENT_FUNDS) {
                    // Expected business outcome - not an error
                    logger.info("payment_declined_insufficient_funds",
                        kv("order_id", order.getId()),
                        kv("amount", order.getTotal()));
                } else if (payment.getReason() == PaymentFailureReason.NETWORK_ERROR) {
                    // Infrastructure failure - IS an error
                    logger.error("payment_service_network_failure",
                        kv("order_id", order.getId()),
                        kv("retry_count", payment.getRetryCount()),
                        kv("last_error", payment.getErrorMessage()));
                }
                throw new PaymentFailedException(payment);
            }
            
            return order;
            
        } catch (DatabaseException e) {
            // Always LOG as ERROR, then re-throw or handle
            logger.error("order_creation_database_failure",
                kv("user_id", request.getUserId()),
                kv("error_type", e.getClass().getSimpleName()),
                kv("error_message", e.getMessage()));
            throw new OrderCreationException("Database failure", e);
        }
    }
}

ERROR vs Not ERROR: Common Scenarios
Scenario	Level	Rationale
User submitted invalid form data	INFO	Expected input variation; validation working correctly
Database query returned empty result	INFO/DEBUG	Empty results are valid; not a failure
Database connection timed out	ERROR	Infrastructure failure; unexpected
Payment declined: insufficient funds	INFO	Expected business scenario; not an error
Payment declined: network unreachable	ERROR	Infrastructure failure; requires investigation
User authentication failed (wrong password)	INFO	Normal security behavior; audit trail
User authentication failed (corrupted token)	ERROR	Unexpected failure; possible security issue
Rate limit triggered	WARN	Expected protection mechanism; may need attention
Null pointer exception caught	ERROR	Programming bug; needs fix

The Error Inflation Problem

When teams log expected scenarios as ERROR, they create "error inflation." Dashboards show thousands of errors, making real problems invisible. If your service has a 1% error log rate and that's "normal," you've diluted the signal to uselessness. Reclassify expected failures as WARN or INFO.

WARN: Potential Problems

Definition: Something unexpected occurred, but the operation completed (possibly with degradation). The situation might become a problem if it persists or worsens.

Mental Model: WARN is a heads-up. It says "this isn't broken yet, but keep an eye on it." Warnings are for conditions that merit attention but don't require immediate action.

The WARN Criteria:

The current operation succeeded, but future operations may fail
A fallback or degraded path was used
Resource utilization is approaching problematic thresholds
Configuration might be suboptimal
Something took longer than expected but still completed

Appropriate WARN Scenarios

•Connection pool at 80% capacity — Still working, but may exhaust soon under sustained load.
•Request took 5x longer than p95 — Completed successfully but indicates potential degradation.
•Using deprecated API version — Works now, but will break after sunset date.
•Retry succeeded after initial failure — Problem occurred but was recovered; pattern worth monitoring.
•Feature flag defaulted due to config fetch failure — System working, but using potentially stale configuration.
•Disk space below 20% — Still functioning, but will error if trend continues.
•Certificate expires in 30 days — Valid now, but proactive attention needed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
package main
 
import (
    "context"
    "time"
 
    "go.uber.org/zap"
)
 
type CacheClient struct {
    logger *zap.Logger
    cache  Cache
    db     Database
}
 
func (c *CacheClient) GetUser(ctx context.Context, userID string) (*User, error) {
    start := time.Now()
 
    // Try cache first
    user, err := c.cache.Get(ctx, userID)
    if err != nil {
        // Cache miss or failure - this is WARN, not ERROR
        // The operation can still succeed via database
        c.logger.Warn("cache_lookup_failed",
            zap.String("user_id", userID),
            zap.Error(err),
            zap.String("fallback", "database"))
 
        // Fallback to database
        user, err = c.db.GetUser(ctx, userID)
        if err != nil {
            // NOW it's an error - both paths failed
            c.logger.Error("user_lookup_failed",
                zap.String("user_id", userID),
                zap.Error(err))
            return nil, err
        }
    }
 
    // Check if operation was slow
    duration := time.Since(start)
    if duration > 500*time.Millisecond {
        // Slow but successful - WARN
        c.logger.Warn("user_lookup_slow",
            zap.String("user_id", userID),
            zap.Duration("duration", duration),
            zap.Duration("threshold", 500*time.Millisecond))
    }
 
    return user, nil
}
 
func (c *CacheClient) monitorResources() {
    poolStats := c.db.PoolStats()
 
    // Resource warnings based on thresholds
    if poolStats.UsedPercent > 80 {
        c.logger.Warn("db_connection_pool_high",
            zap.Float64("used_percent", poolStats.UsedPercent),
            zap.Int("active_connections", poolStats.Active),
            zap.Int("max_connections", poolStats.Max))
    }
 
    if poolStats.WaitCount > 0 {
        c.logger.Warn("db_connection_pool_contention",
            zap.Int64("waiting_requests", poolStats.WaitCount),
            zap.Duration("wait_duration", poolStats.WaitDuration))
    }
}

The WARN Aggregation Principle

Individual WARN logs shouldn't trigger alerts. But WARN aggregations should. 1 cache failure is a WARN. 1000 cache failures in 5 minutes is an ERROR-level alert. Set up rate-based alerts on WARN patterns, not on individual occurrences.

INFO: Operational Milestones

Definition: Normal operational events that help understand system behavior. INFO logs answer the question: "What is the system doing right now?"

Mental Model: INFO is for the operations dashboard. It tracks significant state changes, successful completions of important operations, and key business events—all during normal, healthy operation.

The INFO Principle:

If your system is running perfectly and you look at INFO logs, you should see a clear narrative of what the system is accomplishing. Too many INFO logs obscure this narrative; too few leave operators blind.

INFO Level Content Categories
Category	Examples	Why INFO
Lifecycle Events	Service started, shutdown initiated, configuration reloaded	Fundamental visibility into service state
Important Transactions	Order completed, user registered, payment processed	Business visibility; audit requirements
State Transitions	Circuit breaker opened/closed, leader election complete	Understanding system behavior over time
Periodic Health	Health check passed, scheduled job completed	Operational heartbeat
Configuration Applied	Feature flag evaluated, A/B experiment assigned	Debugging behavior differences
Security Events (normal)	User logged in, password changed, MFA configured	Audit trail for compliance

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import { logger } from './logger';
 
class OrderProcessor {
    async processOrder(order: Order): Promise<ProcessedOrder> {
        // ✅ Good INFO: Significant business event starting
        logger.info({
            event: 'order_processing_started',
            order_id: order.id,
            customer_id: order.customerId,
            items_count: order.items.length,
            total_cents: order.totalCents,
        });
 
        // ❌ Bad INFO: Too granular, should be DEBUG
        // logger.info({ event: 'validating_order_items' });
        // logger.info({ event: 'checking_inventory_item_1' });
        // logger.info({ event: 'checking_inventory_item_2' });
        
        await this.validateOrder(order);
        await this.reserveInventory(order);
        const payment = await this.processPayment(order);
 
        // ✅ Good INFO: Significant outcome
        logger.info({
            event: 'order_processing_completed',
            order_id: order.id,
            payment_id: payment.id,
            processing_time_ms: Date.now() - startTime,
        });
 
        return { order, payment };
    }
 
    // ✅ Good INFO: State transition
    onCircuitBreakerStateChange(service: string, oldState: string, newState: string): void {
        logger.info({
            event: 'circuit_breaker_state_changed',
            service,
            old_state: oldState,
            new_state: newState,
            reason: 'Failure threshold exceeded',
        });
    }
 
    // ✅ Good INFO: Lifecycle event
    async onApplicationStart(): Promise<void> {
        logger.info({
            event: 'application_started',
            version: process.env.APP_VERSION,
            environment: process.env.NODE_ENV,
            port: process.env.PORT,
            node_version: process.version,
        });
    }
}

INFO Level Anti-Patterns

•Per-request logging in high-throughput — At 10k RPS, per-request INFO logs generate 864 million log entries/day. Use sampling or access logs instead.
•Loop iteration logging — INFO: Processing item 1 of 1000. This is DEBUG at best, TRACE in reality.
•Variable values for debugging — INFO: user object is {large JSON}. This is DEBUG and creates storage bloat.
•Logging instead of metrics — For counters (requests served, items processed), use metrics, not logs. Logs should capture context, not aggregatable counts.
•Success logging for trivial operations — INFO: Fetched configuration. Unless the configuration fetch was notable, this adds noise.

The 1% Rule for INFO

A useful heuristic: if you sampled 1% of your INFO logs, would you still understand what the system was doing? If critical context would be lost, you're under-logging INFO. If you'd still have the full picture, you might be over-logging.

DEBUG and TRACE: Development Diagnostics

DEBUG Definition: Detailed information useful for diagnosing problems. Answers: "What exact steps did the code take?"

TRACE Definition: Extremely verbose information for deep debugging. Answers: "What was the exact state at every moment?"

The Production Reality:

DEBUG and TRACE logs are almost always disabled in production. They exist for development, staging, and targeted production debugging (enabling temporarily for specific requests or instances).

DEBUG vs TRACE: When Each Applies
Level	Use When	Content Examples
DEBUG	Developer needs to understand code path	Function entry/exit, decision branches taken, intermediate results
DEBUG	Troubleshooting specific failure	Parameter values, cache hit/miss, query details
DEBUG	Testing integration behavior	Request/response payloads, transformed data
TRACE	Debugging framework internals	Low-level library operations, serialization steps
TRACE	Performance micro-analysis	Every loop iteration, every method call
TRACE	Reproducing heisenbug	Complete state capture at micro-intervals

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import structlog
 
logger = structlog.get_logger()
 
def calculate_shipping_cost(order):
    logger.debug("shipping_calculation_started",
        order_id=order.id,
        destination_country=order.shipping_address.country,
        total_weight_kg=order.total_weight)
    
    # TRACE: Would show each step of the calculation
    # logger.trace("fetching_base_rates", carrier="fedex")
    
    base_rate = get_base_rate(order.shipping_address)
    logger.debug("base_rate_fetched",
        rate_cents=base_rate,
        zone=base_rate.zone,
        carrier=base_rate.carrier)
    
    # TRACE: Would show each discount rule evaluated
    # for rule in discount_rules:
    #     logger.trace("evaluating_discount_rule", rule_id=rule.id, applies=rule.applies(order))
    
    discounts = calculate_discounts(order, base_rate)
    logger.debug("discounts_applied",
        original_rate_cents=base_rate,
        discount_amount=sum(d.amount for d in discounts),
        discount_count=len(discounts))
    
    final_rate = apply_discounts(base_rate, discounts)
    
    logger.debug("shipping_calculation_complete",
        order_id=order.id,
        final_rate_cents=final_rate,
        calculation_time_ms=elapsed_time())
    
    return final_rate
 
# Conditional expensive logging
def process_complex_request(request):
    # Only compute expensive debug info if DEBUG is enabled
    if logger.isEnabledFor(logging.DEBUG):
        # Expensive serialization only happens if log will be emitted
        logger.debug("request_details",
            full_request=request.to_detailed_dict(),
            computed_hash=expensive_hash_computation(request))

DEBUG/TRACE Performance Considerations

•Guard expensive computations — Wrap costly debug operations in if logger.isDebugEnabled() checks. JSON serialization, stack trace generation, and reflection add latency even when logs are discarded.
•Use lazy evaluation — Pass lambdas/suppliers for expensive fields: log.debug('state', () => computeState()). Only called if level is enabled.
•Avoid allocations — Even disabled log calls may allocate argument arrays in some frameworks. Use parameterized logging or check level first.
•Never log credentials at any level — DEBUG logs may be enabled unexpectedly. Secrets should never appear in logs.
•Consider sampling for production DEBUG — Enable DEBUG for 0.1% of requests to get debugging data without overwhelming storage.

DEBUG in Production: Rarely, Carefully

Enabling DEBUG in production should be rare and targeted. Turn it on for specific instances or requests using dynamic log level controls (covered in advanced topics). Blanket DEBUG in production can generate TB/day of logs, overwhelming storage and creating performance issues.

Common Mistakes and How to Avoid Them

Log level selection seems simple but is consistently misapplied in ways that undermine observability. Here are the most frequent mistakes and their corrections:

Anti-Pattern: Error for Expected Failures

•Mistake: ERROR: User not found for ID xyz — If this is a valid lookup scenario, it's not an error.
•Mistake: ERROR: Payment declined — Expected business outcome; the system worked correctly.
•Mistake: ERROR: Rate limit exceeded — This is the rate limiter working as designed.
•Fix: Reserve ERROR for unexpected failures. Expected negative outcomes are INFO (or DEBUG if high-volume).

Anti-Pattern: WARN for Routine Operations

•Mistake: WARN: Using default configuration — If defaults are valid and common, this is INFO at most.
•Mistake: WARN: Cache miss, fetching from database — Cache misses are normal; the system designed for this.
•Mistake: WARN: Request took 200ms — If 200ms is within SLO, this isn't a warning.
•Fix: WARN should indicate conditions that might become problems. Routine fallbacks that are handled gracefully don't qualify.

Anti-Pattern: INFO as Catch-All

•Mistake: Logging every function entry/exit as INFO — Creates massive volume, drowns out meaningful events.
•Mistake: INFO: Starting to process user 12345 for every request — This is DEBUG content.
•Mistake: Detailed object dumps at INFO level — Wastes storage, rarely useful, should be DEBUG.
•Fix: INFO is for milestones and important business events. Detailed debugging belongs at DEBUG level.

Log Level Decision Matrix
Question	Yes → Level	No → Consider
Is the application about to crash/shutdown?	FATAL	Lower level
Did an operation fail unexpectedly?	ERROR	WARN or INFO
Did something concerning happen that worked out?	WARN	INFO or DEBUG
Is this a significant business/operational event?	INFO	DEBUG
Is this only useful when actively debugging?	DEBUG	TRACE or don't log
Is this micro-level execution detail?	TRACE	Don't log at all

The "Would I Alert" Test

For ERROR: Would I create a PagerDuty alert for a spike in these logs? If not, it's probably not ERROR. For WARN: Would I want a non-urgent notification if these doubled? If not, it's probably INFO. For INFO: Would an operator glancing at logs find this useful? If not, it's DEBUG.

Organizational Standards

Log level consistency requires organizational standards. Individual developers have different intuitions about severity; without shared guidelines, log level meaning varies across services, making dashboards and alerts unreliable.

Establishing a Logging Standard:

Logging Style Guide Components

•Level Definitions — Documented, with examples from your domain. "ERROR means X in our services, not Y."
•Common Scenarios — Explicit level guidance for scenarios that cause confusion (validation failures, external API errors, retries).
•Review Checkpoints — Code review checklist item: "Are log levels appropriate?"
•Audit Process — Quarterly review of log level distribution. 80% ERROR usually indicates misuse.
•Linting Rules — Static analysis to catch obvious mistakes (e.g., logging exceptions as INFO).
•Onboarding Material — New engineer training includes logging level expectations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Acme Corp Logging Standards v2.1
 
## Log Levels
 
### ERROR
Use ERROR when an operation **failed** and **could not complete its intended purpose** due to an **unexpected** condition.
 
**ERROR Examples:**
- Database query threw exception
- Required external service unreachable after retries
- Data corruption detected
- Unhandled exception in request processing
 
**NOT ERROR:**
- User provided invalid input (→ INFO)
- Payment declined by bank (→ INFO)  
- Rate limit triggered (→ WARN)
- Optional feature unavailable (→ WARN)
 
### WARN
Use WARN when something **concerning happened** but the operation **completed** (possibly degraded).
 
**WARN Examples:**
- Circuit breaker opened
- Fallback to secondary data source
- Request latency exceeded 2x p99
- Resource pool utilization > 80%
 
## Common Scenario Guidance
 
| Scenario | Level | Rationale |
|----------|-------|-----------|
| HTTP 4xx response to client | INFO | Client error, server working correctly |
| HTTP 5xx response to client | ERROR | Server failure |
| Retry succeeded | WARN | Issue occurred; worth monitoring |
| All retries exhausted | ERROR | Operation failed |

Living Documentation

Logging standards should evolve based on operational experience. When an incident reveals inadequate logging or when alert fatigue becomes problematic, update the standards. The guide should be the authoritative reference that teams actually consult.

Summary: Mastering Log Levels

Log levels are not mere categorization—they're a contract between developers and operators about the significance and urgency of system events.

Key Takeaways:

Key Takeaways

•FATAL/CRITICAL — The application is dying. Page someone now. Use extremely rarely.
•ERROR — An operation failed unexpectedly. Warrants investigation. Not for expected failures.
•WARN — Something concerning happened but was handled. Monitor for patterns.
•INFO — Significant operational events during normal operation. Dashboard content.
•DEBUG — Developer diagnostics. Disabled in production except for targeted troubleshooting.
•TRACE — Micro-level execution details. Almost never enabled.
•Expected failures are not errors — Validation failures, declined payments, rate limits are INFO or WARN.
•Organizational standards prevent chaos — Document level criteria, review usage, train engineers.

What's next:

Now that you understand what to log and at what level, the next page covers where logs go at scale. Log aggregation systems (ELK Stack, OpenSearch, Loki) collect logs from thousands of services into a searchable, queryable store. We'll explore architectures, trade-offs, and operational considerations for production-grade log aggregation.

Page Complete

You now understand log level semantics and have a framework for choosing appropriate levels. You can identify and correct common leveling mistakes and contribute to organizational logging standards. Next, we'll explore how logs are collected, stored, and queried at scale.

2 / 5

Loading learning content...

System Design (HLD)Logging at Scale

Logging at Scale: Production-Grade Observability

LevelIntermediate

Duration60 mins

TopicLogging at Scale

2 / 5

Log Levels and When to Use

The Art of Signal-to-Noise Ratio

This page establishes rigorous criteria for log level selection. By the end, you'll understand not just what each level means, but why proper leveling is essential for operational excellence.

What You Will Learn

The Log Level Hierarchy

The Standard Hierarchy (from most to least severe):

Log Level Severity Hierarchy
Level	Severity	Typical Configuration	Description
FATAL/CRITICAL	Highest	Always enabled	Application cannot continue; immediate human intervention required
ERROR	High	Always enabled	Operation failed; likely requires investigation
WARN	Medium	Always enabled	Something unexpected; may indicate impending failure
INFO	Low	Production enabled	Normal operation milestones; useful for understanding flow
DEBUG	Lower	Dev/staging only	Detailed diagnostic information for developers
TRACE	Lowest	Rarely enabled	Extremely detailed execution paths; usually off

The Filtering Principle

When you configure a log level threshold (e.g., LOG_LEVEL=INFO), logs at that level and above are emitted:

LOG_LEVEL=ERROR: Only ERROR and FATAL logs
LOG_LEVEL=WARN: WARN, ERROR, and FATAL
LOG_LEVEL=INFO: INFO, WARN, ERROR, and FATAL
LOG_LEVEL=DEBUG: DEBUG, INFO, WARN, ERROR, and FATAL
LOG_LEVEL=TRACE: Everything

Framework Variations

FATAL/CRITICAL: Existential Failures

Definition: The application cannot continue and will terminate or is in an unrecoverable state. Human intervention is required immediately.

Mental Model: FATAL means "wake someone up at 3 AM." It represents complete service failure or corruption that affects all users.

Examples of legitimate FATAL logs:

Database connection pool exhausted with no recovery possible
Critical configuration missing at startup (cannot start)
Corrupted data detected that prevents safe operation
Security compromise detected (must shut down to prevent further damage)
Out of memory with no graceful degradation possible

Correct FATAL Usage

•Unable to bind to required port on startup
•Database schema incompatible with application version
•TLS certificate expired (cannot serve secure traffic)
•Memory corruption detected in critical data structure
•Mandatory dependent service unreachable after all retries

Incorrect FATAL Usage

•Single request failed (should be ERROR)
•Non-critical feature disabled (should be WARN)
•Configuration file using defaults (should be WARN)
•External API returned error (should be ERROR or WARN)
•High latency on dependency (should be WARN)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import structlog
import sys
 
logger = structlog.get_logger()
 
def initialize_database():
    try:
        connection = create_database_connection()
        verify_schema_version(connection)
        return connection
    except SchemaVersionMismatch as e:
        # This is FATAL: application cannot safely operate with wrong schema
        logger.critical(
            "database_schema_incompatible",
            expected_version=e.expected,
            actual_version=e.actual,
            remediation="Run database migrations before starting application"
        )
        sys.exit(1)  # FATAL logs typically precede shutdown
    except ConnectionError as e:
        # This might be FATAL at startup, but could be ERROR during runtime
        if is_during_startup():
            logger.critical(
                "database_connection_failed_at_startup",
                error=str(e),
                remediation="Verify database is running and credentials are correct"
            )
            sys.exit(1)
        else:
            # During runtime, retry logic might recover
            logger.error("database_connection_lost", error=str(e))
            raise

The FATAL Rarity Principle

ERROR: Operation Failures

Definition: An operation failed and could not complete its intended purpose. The failure is likely unexpected and usually indicates a bug, environmental issue, or dependency failure.

Mental Model: ERROR means "something broke that shouldn't have." It represents failure worthy of investigation, even if the system continues operating overall.

The Key Question: Ask yourself: "If I saw this in a log, would I want to investigate?" If yes, ERROR is appropriate. If the answer is "this is expected sometimes," consider WARN or INFO.

ERROR Level Criteria

•User-visible failure occurred — The request could not be fulfilled, data was not saved, or expected functionality broke.
•Unexpected exception caught — An exception that wasn't anticipated during normal operation (as opposed to validation errors).
•Dependency failure with impact — External service failed and the fallback (if any) significantly degraded functionality.
•Data integrity issue detected — Inconsistency found that may require manual intervention to resolve.
•Security-relevant failure — Authentication bypass detected, permission violation, or other security exception.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
public class OrderService {
    private static final Logger logger = LoggerFactory.getLogger(OrderService.class);
    
    public Order createOrder(OrderRequest request) {
        try {
            // Validation failures are NOT errors - they're expected input issues
            ValidationResult validation = validator.validate(request);
            if (!validation.isValid()) {
                logger.info("order_validation_failed",  // INFO, not ERROR
                    kv("user_id", request.getUserId()),
                    kv("violations", validation.getViolations()));
                throw new ValidationException(validation);
            }
            
            // Database failure IS an error - unexpected infrastructure issue
            Order order = orderRepository.save(buildOrder(request));
            
            // Payment failure requires nuanced handling
            PaymentResult payment = paymentService.charge(order);
            if (!payment.isSuccess()) {
                if (payment.getReason() == PaymentFailureReason.INSUFFICIENT_FUNDS) {
                    // Expected business outcome - not an error
                    logger.info("payment_declined_insufficient_funds",
                        kv("order_id", order.getId()),
                        kv("amount", order.getTotal()));
                } else if (payment.getReason() == PaymentFailureReason.NETWORK_ERROR) {
                    // Infrastructure failure - IS an error
                    logger.error("payment_service_network_failure",
                        kv("order_id", order.getId()),
                        kv("retry_count", payment.getRetryCount()),
                        kv("last_error", payment.getErrorMessage()));
                }
                throw new PaymentFailedException(payment);
            }
            
            return order;
            
        } catch (DatabaseException e) {
            // Always LOG as ERROR, then re-throw or handle
            logger.error("order_creation_database_failure",
                kv("user_id", request.getUserId()),
                kv("error_type", e.getClass().getSimpleName()),
                kv("error_message", e.getMessage()));
            throw new OrderCreationException("Database failure", e);
        }
    }
}

ERROR vs Not ERROR: Common Scenarios
Scenario	Level	Rationale
User submitted invalid form data	INFO	Expected input variation; validation working correctly
Database query returned empty result	INFO/DEBUG	Empty results are valid; not a failure
Database connection timed out	ERROR	Infrastructure failure; unexpected
Payment declined: insufficient funds	INFO	Expected business scenario; not an error
Payment declined: network unreachable	ERROR	Infrastructure failure; requires investigation
User authentication failed (wrong password)	INFO	Normal security behavior; audit trail
User authentication failed (corrupted token)	ERROR	Unexpected failure; possible security issue
Rate limit triggered	WARN	Expected protection mechanism; may need attention
Null pointer exception caught	ERROR	Programming bug; needs fix

The Error Inflation Problem

WARN: Potential Problems

Definition: Something unexpected occurred, but the operation completed (possibly with degradation). The situation might become a problem if it persists or worsens.

Mental Model: WARN is a heads-up. It says "this isn't broken yet, but keep an eye on it." Warnings are for conditions that merit attention but don't require immediate action.

The WARN Criteria:

The current operation succeeded, but future operations may fail
A fallback or degraded path was used
Resource utilization is approaching problematic thresholds
Configuration might be suboptimal
Something took longer than expected but still completed

Appropriate WARN Scenarios

•Connection pool at 80% capacity — Still working, but may exhaust soon under sustained load.
•Request took 5x longer than p95 — Completed successfully but indicates potential degradation.
•Using deprecated API version — Works now, but will break after sunset date.
•Retry succeeded after initial failure — Problem occurred but was recovered; pattern worth monitoring.
•Feature flag defaulted due to config fetch failure — System working, but using potentially stale configuration.
•Disk space below 20% — Still functioning, but will error if trend continues.
•Certificate expires in 30 days — Valid now, but proactive attention needed.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
package main
 
import (
    "context"
    "time"
 
    "go.uber.org/zap"
)
 
type CacheClient struct {
    logger *zap.Logger
    cache  Cache
    db     Database
}
 
func (c *CacheClient) GetUser(ctx context.Context, userID string) (*User, error) {
    start := time.Now()
 
    // Try cache first
    user, err := c.cache.Get(ctx, userID)
    if err != nil {
        // Cache miss or failure - this is WARN, not ERROR
        // The operation can still succeed via database
        c.logger.Warn("cache_lookup_failed",
            zap.String("user_id", userID),
            zap.Error(err),
            zap.String("fallback", "database"))
 
        // Fallback to database
        user, err = c.db.GetUser(ctx, userID)
        if err != nil {
            // NOW it's an error - both paths failed
            c.logger.Error("user_lookup_failed",
                zap.String("user_id", userID),
                zap.Error(err))
            return nil, err
        }
    }
 
    // Check if operation was slow
    duration := time.Since(start)
    if duration > 500*time.Millisecond {
        // Slow but successful - WARN
        c.logger.Warn("user_lookup_slow",
            zap.String("user_id", userID),
            zap.Duration("duration", duration),
            zap.Duration("threshold", 500*time.Millisecond))
    }
 
    return user, nil
}
 
func (c *CacheClient) monitorResources() {
    poolStats := c.db.PoolStats()
 
    // Resource warnings based on thresholds
    if poolStats.UsedPercent > 80 {
        c.logger.Warn("db_connection_pool_high",
            zap.Float64("used_percent", poolStats.UsedPercent),
            zap.Int("active_connections", poolStats.Active),
            zap.Int("max_connections", poolStats.Max))
    }
 
    if poolStats.WaitCount > 0 {
        c.logger.Warn("db_connection_pool_contention",
            zap.Int64("waiting_requests", poolStats.WaitCount),
            zap.Duration("wait_duration", poolStats.WaitDuration))
    }
}

The WARN Aggregation Principle

INFO: Operational Milestones

Definition: Normal operational events that help understand system behavior. INFO logs answer the question: "What is the system doing right now?"

The INFO Principle:

INFO Level Content Categories
Category	Examples	Why INFO
Lifecycle Events	Service started, shutdown initiated, configuration reloaded	Fundamental visibility into service state
Important Transactions	Order completed, user registered, payment processed	Business visibility; audit requirements
State Transitions	Circuit breaker opened/closed, leader election complete	Understanding system behavior over time
Periodic Health	Health check passed, scheduled job completed	Operational heartbeat
Configuration Applied	Feature flag evaluated, A/B experiment assigned	Debugging behavior differences
Security Events (normal)	User logged in, password changed, MFA configured	Audit trail for compliance

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import { logger } from './logger';
 
class OrderProcessor {
    async processOrder(order: Order): Promise<ProcessedOrder> {
        // ✅ Good INFO: Significant business event starting
        logger.info({
            event: 'order_processing_started',
            order_id: order.id,
            customer_id: order.customerId,
            items_count: order.items.length,
            total_cents: order.totalCents,
        });
 
        // ❌ Bad INFO: Too granular, should be DEBUG
        // logger.info({ event: 'validating_order_items' });
        // logger.info({ event: 'checking_inventory_item_1' });
        // logger.info({ event: 'checking_inventory_item_2' });
        
        await this.validateOrder(order);
        await this.reserveInventory(order);
        const payment = await this.processPayment(order);
 
        // ✅ Good INFO: Significant outcome
        logger.info({
            event: 'order_processing_completed',
            order_id: order.id,
            payment_id: payment.id,
            processing_time_ms: Date.now() - startTime,
        });
 
        return { order, payment };
    }
 
    // ✅ Good INFO: State transition
    onCircuitBreakerStateChange(service: string, oldState: string, newState: string): void {
        logger.info({
            event: 'circuit_breaker_state_changed',
            service,
            old_state: oldState,
            new_state: newState,
            reason: 'Failure threshold exceeded',
        });
    }
 
    // ✅ Good INFO: Lifecycle event
    async onApplicationStart(): Promise<void> {
        logger.info({
            event: 'application_started',
            version: process.env.APP_VERSION,
            environment: process.env.NODE_ENV,
            port: process.env.PORT,
            node_version: process.version,
        });
    }
}

INFO Level Anti-Patterns

•Per-request logging in high-throughput — At 10k RPS, per-request INFO logs generate 864 million log entries/day. Use sampling or access logs instead.
•Loop iteration logging — INFO: Processing item 1 of 1000. This is DEBUG at best, TRACE in reality.
•Variable values for debugging — INFO: user object is {large JSON}. This is DEBUG and creates storage bloat.
•Logging instead of metrics — For counters (requests served, items processed), use metrics, not logs. Logs should capture context, not aggregatable counts.
•Success logging for trivial operations — INFO: Fetched configuration. Unless the configuration fetch was notable, this adds noise.

The 1% Rule for INFO

DEBUG and TRACE: Development Diagnostics

DEBUG Definition: Detailed information useful for diagnosing problems. Answers: "What exact steps did the code take?"

TRACE Definition: Extremely verbose information for deep debugging. Answers: "What was the exact state at every moment?"

The Production Reality:

DEBUG and TRACE logs are almost always disabled in production. They exist for development, staging, and targeted production debugging (enabling temporarily for specific requests or instances).

DEBUG vs TRACE: When Each Applies
Level	Use When	Content Examples
DEBUG	Developer needs to understand code path	Function entry/exit, decision branches taken, intermediate results
DEBUG	Troubleshooting specific failure	Parameter values, cache hit/miss, query details
DEBUG	Testing integration behavior	Request/response payloads, transformed data
TRACE	Debugging framework internals	Low-level library operations, serialization steps
TRACE	Performance micro-analysis	Every loop iteration, every method call
TRACE	Reproducing heisenbug	Complete state capture at micro-intervals

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import structlog
 
logger = structlog.get_logger()
 
def calculate_shipping_cost(order):
    logger.debug("shipping_calculation_started",
        order_id=order.id,
        destination_country=order.shipping_address.country,
        total_weight_kg=order.total_weight)
    
    # TRACE: Would show each step of the calculation
    # logger.trace("fetching_base_rates", carrier="fedex")
    
    base_rate = get_base_rate(order.shipping_address)
    logger.debug("base_rate_fetched",
        rate_cents=base_rate,
        zone=base_rate.zone,
        carrier=base_rate.carrier)
    
    # TRACE: Would show each discount rule evaluated
    # for rule in discount_rules:
    #     logger.trace("evaluating_discount_rule", rule_id=rule.id, applies=rule.applies(order))
    
    discounts = calculate_discounts(order, base_rate)
    logger.debug("discounts_applied",
        original_rate_cents=base_rate,
        discount_amount=sum(d.amount for d in discounts),
        discount_count=len(discounts))
    
    final_rate = apply_discounts(base_rate, discounts)
    
    logger.debug("shipping_calculation_complete",
        order_id=order.id,
        final_rate_cents=final_rate,
        calculation_time_ms=elapsed_time())
    
    return final_rate
 
# Conditional expensive logging
def process_complex_request(request):
    # Only compute expensive debug info if DEBUG is enabled
    if logger.isEnabledFor(logging.DEBUG):
        # Expensive serialization only happens if log will be emitted
        logger.debug("request_details",
            full_request=request.to_detailed_dict(),
            computed_hash=expensive_hash_computation(request))

DEBUG/TRACE Performance Considerations

•Guard expensive computations — Wrap costly debug operations in if logger.isDebugEnabled() checks. JSON serialization, stack trace generation, and reflection add latency even when logs are discarded.
•Use lazy evaluation — Pass lambdas/suppliers for expensive fields: log.debug('state', () => computeState()). Only called if level is enabled.
•Avoid allocations — Even disabled log calls may allocate argument arrays in some frameworks. Use parameterized logging or check level first.
•Never log credentials at any level — DEBUG logs may be enabled unexpectedly. Secrets should never appear in logs.
•Consider sampling for production DEBUG — Enable DEBUG for 0.1% of requests to get debugging data without overwhelming storage.

DEBUG in Production: Rarely, Carefully

Common Mistakes and How to Avoid Them

Log level selection seems simple but is consistently misapplied in ways that undermine observability. Here are the most frequent mistakes and their corrections:

Anti-Pattern: Error for Expected Failures

•Mistake: ERROR: User not found for ID xyz — If this is a valid lookup scenario, it's not an error.
•Mistake: ERROR: Payment declined — Expected business outcome; the system worked correctly.
•Mistake: ERROR: Rate limit exceeded — This is the rate limiter working as designed.
•Fix: Reserve ERROR for unexpected failures. Expected negative outcomes are INFO (or DEBUG if high-volume).

Anti-Pattern: WARN for Routine Operations

•Mistake: WARN: Using default configuration — If defaults are valid and common, this is INFO at most.
•Mistake: WARN: Cache miss, fetching from database — Cache misses are normal; the system designed for this.
•Mistake: WARN: Request took 200ms — If 200ms is within SLO, this isn't a warning.
•Fix: WARN should indicate conditions that might become problems. Routine fallbacks that are handled gracefully don't qualify.

Anti-Pattern: INFO as Catch-All

•Mistake: Logging every function entry/exit as INFO — Creates massive volume, drowns out meaningful events.
•Mistake: INFO: Starting to process user 12345 for every request — This is DEBUG content.
•Mistake: Detailed object dumps at INFO level — Wastes storage, rarely useful, should be DEBUG.
•Fix: INFO is for milestones and important business events. Detailed debugging belongs at DEBUG level.

Log Level Decision Matrix
Question	Yes → Level	No → Consider
Is the application about to crash/shutdown?	FATAL	Lower level
Did an operation fail unexpectedly?	ERROR	WARN or INFO
Did something concerning happen that worked out?	WARN	INFO or DEBUG
Is this a significant business/operational event?	INFO	DEBUG
Is this only useful when actively debugging?	DEBUG	TRACE or don't log
Is this micro-level execution detail?	TRACE	Don't log at all

The "Would I Alert" Test

Organizational Standards

Establishing a Logging Standard:

Logging Style Guide Components

•Level Definitions — Documented, with examples from your domain. "ERROR means X in our services, not Y."
•Common Scenarios — Explicit level guidance for scenarios that cause confusion (validation failures, external API errors, retries).
•Review Checkpoints — Code review checklist item: "Are log levels appropriate?"
•Audit Process — Quarterly review of log level distribution. 80% ERROR usually indicates misuse.
•Linting Rules — Static analysis to catch obvious mistakes (e.g., logging exceptions as INFO).
•Onboarding Material — New engineer training includes logging level expectations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Acme Corp Logging Standards v2.1
 
## Log Levels
 
### ERROR
Use ERROR when an operation **failed** and **could not complete its intended purpose** due to an **unexpected** condition.
 
**ERROR Examples:**
- Database query threw exception
- Required external service unreachable after retries
- Data corruption detected
- Unhandled exception in request processing
 
**NOT ERROR:**
- User provided invalid input (→ INFO)
- Payment declined by bank (→ INFO)  
- Rate limit triggered (→ WARN)
- Optional feature unavailable (→ WARN)
 
### WARN
Use WARN when something **concerning happened** but the operation **completed** (possibly degraded).
 
**WARN Examples:**
- Circuit breaker opened
- Fallback to secondary data source
- Request latency exceeded 2x p99
- Resource pool utilization > 80%
 
## Common Scenario Guidance
 
| Scenario | Level | Rationale |
|----------|-------|-----------|
| HTTP 4xx response to client | INFO | Client error, server working correctly |
| HTTP 5xx response to client | ERROR | Server failure |
| Retry succeeded | WARN | Issue occurred; worth monitoring |
| All retries exhausted | ERROR | Operation failed |

Living Documentation

Summary: Mastering Log Levels

Log levels are not mere categorization—they're a contract between developers and operators about the significance and urgency of system events.

Key Takeaways:

Key Takeaways

•FATAL/CRITICAL — The application is dying. Page someone now. Use extremely rarely.
•ERROR — An operation failed unexpectedly. Warrants investigation. Not for expected failures.
•WARN — Something concerning happened but was handled. Monitor for patterns.
•INFO — Significant operational events during normal operation. Dashboard content.
•DEBUG — Developer diagnostics. Disabled in production except for targeted troubleshooting.
•TRACE — Micro-level execution details. Almost never enabled.
•Expected failures are not errors — Validation failures, declined payments, rate limits are INFO or WARN.
•Organizational standards prevent chaos — Document level criteria, review usage, train engineers.

What's next:

Page Complete

2 / 5