Loading learning content...
Every logging framework provides severity levels: DEBUG, INFO, WARN, ERROR, and others. These levels seem self-explanatory—until you observe how they're used in real production systems. Logs flooded with ERROR entries that aren't actually errors. Critical warnings buried under thousands of irrelevant INFO messages. DEBUG logging enabled in production, generating terabytes of noise daily.
Log levels are not about categorization—they're about communication. Each level is a promise to the operations team about what the log represents and how urgently they need to respond. Misusing levels breaks this contract, leading to alert fatigue, missed incidents, and wasted debugging time.
This page establishes rigorous criteria for log level selection. By the end, you'll understand not just what each level means, but why proper leveling is essential for operational excellence.
By the end of this page, you will master log level semantics across different frameworks, understand the decision framework for choosing levels, recognize and avoid common leveling mistakes, and implement organizational standards that maintain signal quality at scale.
Log levels form a severity hierarchy. Higher severity logs are always emitted; lower severity logs are filtered based on configuration. Understanding this hierarchy is foundational to proper log level usage.
The Standard Hierarchy (from most to least severe):
| Level | Severity | Typical Configuration | Description |
|---|---|---|---|
| FATAL/CRITICAL | Highest | Always enabled | Application cannot continue; immediate human intervention required |
| ERROR | High | Always enabled | Operation failed; likely requires investigation |
| WARN | Medium | Always enabled | Something unexpected; may indicate impending failure |
| INFO | Low | Production enabled | Normal operation milestones; useful for understanding flow |
| DEBUG | Lower | Dev/staging only | Detailed diagnostic information for developers |
| TRACE | Lowest | Rarely enabled | Extremely detailed execution paths; usually off |
The Filtering Principle
When you configure a log level threshold (e.g., LOG_LEVEL=INFO), logs at that level and above are emitted:
LOG_LEVEL=ERROR: Only ERROR and FATAL logsLOG_LEVEL=WARN: WARN, ERROR, and FATALLOG_LEVEL=INFO: INFO, WARN, ERROR, and FATALLOG_LEVEL=DEBUG: DEBUG, INFO, WARN, ERROR, and FATALLOG_LEVEL=TRACE: EverythingThis filtering mechanism is why level selection matters. If you log routine operations as ERROR, you can't filter them out without losing real errors. If you log critical issues as INFO, they'll be lost in the noise.
Different frameworks use slightly different terminology: CRITICAL vs FATAL, WARNING vs WARN, VERBOSE vs TRACE. The semantics are consistent; only the names differ. Some frameworks add custom levels (NOTICE in syslog), but the core hierarchy remains.
Definition: The application cannot continue and will terminate or is in an unrecoverable state. Human intervention is required immediately.
Mental Model: FATAL means "wake someone up at 3 AM." It represents complete service failure or corruption that affects all users.
Examples of legitimate FATAL logs:
1234567891011121314151617181920212223242526272829303132
import structlogimport sys logger = structlog.get_logger() def initialize_database(): try: connection = create_database_connection() verify_schema_version(connection) return connection except SchemaVersionMismatch as e: # This is FATAL: application cannot safely operate with wrong schema logger.critical( "database_schema_incompatible", expected_version=e.expected, actual_version=e.actual, remediation="Run database migrations before starting application" ) sys.exit(1) # FATAL logs typically precede shutdown except ConnectionError as e: # This might be FATAL at startup, but could be ERROR during runtime if is_during_startup(): logger.critical( "database_connection_failed_at_startup", error=str(e), remediation="Verify database is running and credentials are correct" ) sys.exit(1) else: # During runtime, retry logic might recover logger.error("database_connection_lost", error=str(e)) raiseIf your service logs FATAL more than a few times per year across your entire fleet, you're misusing the level. FATAL means service death. Healthy services don't die frequently. If they do, you have bigger problems than logging.
Definition: An operation failed and could not complete its intended purpose. The failure is likely unexpected and usually indicates a bug, environmental issue, or dependency failure.
Mental Model: ERROR means "something broke that shouldn't have." It represents failure worthy of investigation, even if the system continues operating overall.
The Key Question: Ask yourself: "If I saw this in a log, would I want to investigate?" If yes, ERROR is appropriate. If the answer is "this is expected sometimes," consider WARN or INFO.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
public class OrderService { private static final Logger logger = LoggerFactory.getLogger(OrderService.class); public Order createOrder(OrderRequest request) { try { // Validation failures are NOT errors - they're expected input issues ValidationResult validation = validator.validate(request); if (!validation.isValid()) { logger.info("order_validation_failed", // INFO, not ERROR kv("user_id", request.getUserId()), kv("violations", validation.getViolations())); throw new ValidationException(validation); } // Database failure IS an error - unexpected infrastructure issue Order order = orderRepository.save(buildOrder(request)); // Payment failure requires nuanced handling PaymentResult payment = paymentService.charge(order); if (!payment.isSuccess()) { if (payment.getReason() == PaymentFailureReason.INSUFFICIENT_FUNDS) { // Expected business outcome - not an error logger.info("payment_declined_insufficient_funds", kv("order_id", order.getId()), kv("amount", order.getTotal())); } else if (payment.getReason() == PaymentFailureReason.NETWORK_ERROR) { // Infrastructure failure - IS an error logger.error("payment_service_network_failure", kv("order_id", order.getId()), kv("retry_count", payment.getRetryCount()), kv("last_error", payment.getErrorMessage())); } throw new PaymentFailedException(payment); } return order; } catch (DatabaseException e) { // Always LOG as ERROR, then re-throw or handle logger.error("order_creation_database_failure", kv("user_id", request.getUserId()), kv("error_type", e.getClass().getSimpleName()), kv("error_message", e.getMessage())); throw new OrderCreationException("Database failure", e); } }}| Scenario | Level | Rationale |
|---|---|---|
| User submitted invalid form data | INFO | Expected input variation; validation working correctly |
| Database query returned empty result | INFO/DEBUG | Empty results are valid; not a failure |
| Database connection timed out | ERROR | Infrastructure failure; unexpected |
| Payment declined: insufficient funds | INFO | Expected business scenario; not an error |
| Payment declined: network unreachable | ERROR | Infrastructure failure; requires investigation |
| User authentication failed (wrong password) | INFO | Normal security behavior; audit trail |
| User authentication failed (corrupted token) | ERROR | Unexpected failure; possible security issue |
| Rate limit triggered | WARN | Expected protection mechanism; may need attention |
| Null pointer exception caught | ERROR | Programming bug; needs fix |
When teams log expected scenarios as ERROR, they create "error inflation." Dashboards show thousands of errors, making real problems invisible. If your service has a 1% error log rate and that's "normal," you've diluted the signal to uselessness. Reclassify expected failures as WARN or INFO.
Definition: Something unexpected occurred, but the operation completed (possibly with degradation). The situation might become a problem if it persists or worsens.
Mental Model: WARN is a heads-up. It says "this isn't broken yet, but keep an eye on it." Warnings are for conditions that merit attention but don't require immediate action.
The WARN Criteria:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869
package main import ( "context" "time" "go.uber.org/zap") type CacheClient struct { logger *zap.Logger cache Cache db Database} func (c *CacheClient) GetUser(ctx context.Context, userID string) (*User, error) { start := time.Now() // Try cache first user, err := c.cache.Get(ctx, userID) if err != nil { // Cache miss or failure - this is WARN, not ERROR // The operation can still succeed via database c.logger.Warn("cache_lookup_failed", zap.String("user_id", userID), zap.Error(err), zap.String("fallback", "database")) // Fallback to database user, err = c.db.GetUser(ctx, userID) if err != nil { // NOW it's an error - both paths failed c.logger.Error("user_lookup_failed", zap.String("user_id", userID), zap.Error(err)) return nil, err } } // Check if operation was slow duration := time.Since(start) if duration > 500*time.Millisecond { // Slow but successful - WARN c.logger.Warn("user_lookup_slow", zap.String("user_id", userID), zap.Duration("duration", duration), zap.Duration("threshold", 500*time.Millisecond)) } return user, nil} func (c *CacheClient) monitorResources() { poolStats := c.db.PoolStats() // Resource warnings based on thresholds if poolStats.UsedPercent > 80 { c.logger.Warn("db_connection_pool_high", zap.Float64("used_percent", poolStats.UsedPercent), zap.Int("active_connections", poolStats.Active), zap.Int("max_connections", poolStats.Max)) } if poolStats.WaitCount > 0 { c.logger.Warn("db_connection_pool_contention", zap.Int64("waiting_requests", poolStats.WaitCount), zap.Duration("wait_duration", poolStats.WaitDuration)) }}Individual WARN logs shouldn't trigger alerts. But WARN aggregations should. 1 cache failure is a WARN. 1000 cache failures in 5 minutes is an ERROR-level alert. Set up rate-based alerts on WARN patterns, not on individual occurrences.
Definition: Normal operational events that help understand system behavior. INFO logs answer the question: "What is the system doing right now?"
Mental Model: INFO is for the operations dashboard. It tracks significant state changes, successful completions of important operations, and key business events—all during normal, healthy operation.
The INFO Principle:
If your system is running perfectly and you look at INFO logs, you should see a clear narrative of what the system is accomplishing. Too many INFO logs obscure this narrative; too few leave operators blind.
| Category | Examples | Why INFO |
|---|---|---|
| Lifecycle Events | Service started, shutdown initiated, configuration reloaded | Fundamental visibility into service state |
| Important Transactions | Order completed, user registered, payment processed | Business visibility; audit requirements |
| State Transitions | Circuit breaker opened/closed, leader election complete | Understanding system behavior over time |
| Periodic Health | Health check passed, scheduled job completed | Operational heartbeat |
| Configuration Applied | Feature flag evaluated, A/B experiment assigned | Debugging behavior differences |
| Security Events (normal) | User logged in, password changed, MFA configured | Audit trail for compliance |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import { logger } from './logger'; class OrderProcessor { async processOrder(order: Order): Promise<ProcessedOrder> { // ✅ Good INFO: Significant business event starting logger.info({ event: 'order_processing_started', order_id: order.id, customer_id: order.customerId, items_count: order.items.length, total_cents: order.totalCents, }); // ❌ Bad INFO: Too granular, should be DEBUG // logger.info({ event: 'validating_order_items' }); // logger.info({ event: 'checking_inventory_item_1' }); // logger.info({ event: 'checking_inventory_item_2' }); await this.validateOrder(order); await this.reserveInventory(order); const payment = await this.processPayment(order); // ✅ Good INFO: Significant outcome logger.info({ event: 'order_processing_completed', order_id: order.id, payment_id: payment.id, processing_time_ms: Date.now() - startTime, }); return { order, payment }; } // ✅ Good INFO: State transition onCircuitBreakerStateChange(service: string, oldState: string, newState: string): void { logger.info({ event: 'circuit_breaker_state_changed', service, old_state: oldState, new_state: newState, reason: 'Failure threshold exceeded', }); } // ✅ Good INFO: Lifecycle event async onApplicationStart(): Promise<void> { logger.info({ event: 'application_started', version: process.env.APP_VERSION, environment: process.env.NODE_ENV, port: process.env.PORT, node_version: process.version, }); }}INFO: Processing item 1 of 1000. This is DEBUG at best, TRACE in reality.INFO: user object is {large JSON}. This is DEBUG and creates storage bloat.INFO: Fetched configuration. Unless the configuration fetch was notable, this adds noise.A useful heuristic: if you sampled 1% of your INFO logs, would you still understand what the system was doing? If critical context would be lost, you're under-logging INFO. If you'd still have the full picture, you might be over-logging.
DEBUG Definition: Detailed information useful for diagnosing problems. Answers: "What exact steps did the code take?"
TRACE Definition: Extremely verbose information for deep debugging. Answers: "What was the exact state at every moment?"
The Production Reality:
DEBUG and TRACE logs are almost always disabled in production. They exist for development, staging, and targeted production debugging (enabling temporarily for specific requests or instances).
| Level | Use When | Content Examples |
|---|---|---|
| DEBUG | Developer needs to understand code path | Function entry/exit, decision branches taken, intermediate results |
| DEBUG | Troubleshooting specific failure | Parameter values, cache hit/miss, query details |
| DEBUG | Testing integration behavior | Request/response payloads, transformed data |
| TRACE | Debugging framework internals | Low-level library operations, serialization steps |
| TRACE | Performance micro-analysis | Every loop iteration, every method call |
| TRACE | Reproducing heisenbug | Complete state capture at micro-intervals |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546
import structlog logger = structlog.get_logger() def calculate_shipping_cost(order): logger.debug("shipping_calculation_started", order_id=order.id, destination_country=order.shipping_address.country, total_weight_kg=order.total_weight) # TRACE: Would show each step of the calculation # logger.trace("fetching_base_rates", carrier="fedex") base_rate = get_base_rate(order.shipping_address) logger.debug("base_rate_fetched", rate_cents=base_rate, zone=base_rate.zone, carrier=base_rate.carrier) # TRACE: Would show each discount rule evaluated # for rule in discount_rules: # logger.trace("evaluating_discount_rule", rule_id=rule.id, applies=rule.applies(order)) discounts = calculate_discounts(order, base_rate) logger.debug("discounts_applied", original_rate_cents=base_rate, discount_amount=sum(d.amount for d in discounts), discount_count=len(discounts)) final_rate = apply_discounts(base_rate, discounts) logger.debug("shipping_calculation_complete", order_id=order.id, final_rate_cents=final_rate, calculation_time_ms=elapsed_time()) return final_rate # Conditional expensive loggingdef process_complex_request(request): # Only compute expensive debug info if DEBUG is enabled if logger.isEnabledFor(logging.DEBUG): # Expensive serialization only happens if log will be emitted logger.debug("request_details", full_request=request.to_detailed_dict(), computed_hash=expensive_hash_computation(request))if logger.isDebugEnabled() checks. JSON serialization, stack trace generation, and reflection add latency even when logs are discarded.log.debug('state', () => computeState()). Only called if level is enabled.Enabling DEBUG in production should be rare and targeted. Turn it on for specific instances or requests using dynamic log level controls (covered in advanced topics). Blanket DEBUG in production can generate TB/day of logs, overwhelming storage and creating performance issues.
Log level selection seems simple but is consistently misapplied in ways that undermine observability. Here are the most frequent mistakes and their corrections:
ERROR: User not found for ID xyz — If this is a valid lookup scenario, it's not an error.ERROR: Payment declined — Expected business outcome; the system worked correctly.ERROR: Rate limit exceeded — This is the rate limiter working as designed.WARN: Using default configuration — If defaults are valid and common, this is INFO at most.WARN: Cache miss, fetching from database — Cache misses are normal; the system designed for this.WARN: Request took 200ms — If 200ms is within SLO, this isn't a warning.INFO: Starting to process user 12345 for every request — This is DEBUG content.| Question | Yes → Level | No → Consider |
|---|---|---|
| Is the application about to crash/shutdown? | FATAL | Lower level |
| Did an operation fail unexpectedly? | ERROR | WARN or INFO |
| Did something concerning happen that worked out? | WARN | INFO or DEBUG |
| Is this a significant business/operational event? | INFO | DEBUG |
| Is this only useful when actively debugging? | DEBUG | TRACE or don't log |
| Is this micro-level execution detail? | TRACE | Don't log at all |
For ERROR: Would I create a PagerDuty alert for a spike in these logs? If not, it's probably not ERROR. For WARN: Would I want a non-urgent notification if these doubled? If not, it's probably INFO. For INFO: Would an operator glancing at logs find this useful? If not, it's DEBUG.
Log level consistency requires organizational standards. Individual developers have different intuitions about severity; without shared guidelines, log level meaning varies across services, making dashboards and alerts unreliable.
Establishing a Logging Standard:
123456789101112131415161718192021222324252627282930313233343536
# Acme Corp Logging Standards v2.1 ## Log Levels ### ERRORUse ERROR when an operation **failed** and **could not complete its intended purpose** due to an **unexpected** condition. **ERROR Examples:**- Database query threw exception- Required external service unreachable after retries- Data corruption detected- Unhandled exception in request processing **NOT ERROR:**- User provided invalid input (→ INFO)- Payment declined by bank (→ INFO) - Rate limit triggered (→ WARN)- Optional feature unavailable (→ WARN) ### WARNUse WARN when something **concerning happened** but the operation **completed** (possibly degraded). **WARN Examples:**- Circuit breaker opened- Fallback to secondary data source- Request latency exceeded 2x p99- Resource pool utilization > 80% ## Common Scenario Guidance | Scenario | Level | Rationale ||----------|-------|-----------|| HTTP 4xx response to client | INFO | Client error, server working correctly || HTTP 5xx response to client | ERROR | Server failure || Retry succeeded | WARN | Issue occurred; worth monitoring || All retries exhausted | ERROR | Operation failed |Logging standards should evolve based on operational experience. When an incident reveals inadequate logging or when alert fatigue becomes problematic, update the standards. The guide should be the authoritative reference that teams actually consult.
Log levels are not mere categorization—they're a contract between developers and operators about the significance and urgency of system events.
Key Takeaways:
What's next:
Now that you understand what to log and at what level, the next page covers where logs go at scale. Log aggregation systems (ELK Stack, OpenSearch, Loki) collect logs from thousands of services into a searchable, queryable store. We'll explore architectures, trade-offs, and operational considerations for production-grade log aggregation.
You now understand log level semantics and have a framework for choosing appropriate levels. You can identify and correct common leveling mistakes and contribute to organizational logging standards. Next, we'll explore how logs are collected, stored, and queried at scale.