Loading learning content...
A single production web server generates approximately 50 GB of logs per day. A Kubernetes cluster with 100 nodes produces over 1 TB daily. A major cloud provider like AWS manages petabytes of log data every hour. In the modern enterprise, logs are no longer simple text files—they are a Big Data problem that requires sophisticated infrastructure to collect, transport, store, and query.
Consider the log management challenge at Netflix: their infrastructure generates over 500 billion log events per day, representing more than 1 petabyte of raw data. Finding a single authentication failure among those 500 billion events requires infrastructure as sophisticated as the systems generating the logs themselves.
This page explores the systems and practices that make log management possible—from collection agents to centralized storage, from real-time streaming to long-term archival. You'll learn how to design log infrastructure that scales with your systems while remaining accessible for security investigation.
By the end of this page, you will understand: (1) Log collection architectures and agent selection, (2) Log transportation protocols and reliability guarantees, (3) Centralized log storage systems and indexing strategies, (4) Log retention policies and compliance requirements, (5) Log aggregation and normalization techniques, and (6) Scalability patterns for enterprise log management.
Modern log collection has evolved from simple file tailing to sophisticated distributed systems. Understanding the architectural patterns helps you design appropriate solutions for your scale.
There are three primary patterns for log collection:
| Pattern | Description | Pros | Cons |
|---|---|---|---|
| Push (Agent-based) | Agents on each host forward logs | Near real-time, reliable delivery | Agent overhead, management complexity |
| Pull (Scraping) | Central system reads from hosts | No agent installation required | Higher latency, relies on host availability |
| Sidecar | Co-located collector per application | Application isolation, container-native | Resource overhead per pod/container |
The most common pattern uses lightweight agents running on each host. These agents read logs from files, journald, or direct streams and forward them to central collectors.
The choice of agent significantly impacts resource usage, reliability, and capabilities:
| Agent | Memory | Features | Best For |
|---|---|---|---|
| Fluent Bit | ~1-5 MB | Lightweight, cloud-native, plugins | Containers, Kubernetes, edge |
| Fluentd | ~40-100 MB | Ruby plugins, flexible routing | Complex transformations, legacy systems |
| Filebeat | ~20-50 MB | ELK integration, modules | Elastic Stack environments |
| Vector | ~10-30 MB | Rust, observability pipelines | High-performance, unified observability |
| Promtail | ~10-30 MB | Loki-native, label extraction | Grafana Loki environments |
| rsyslog | ~5-15 MB | RFC-compliant, enterprise proven | Traditional syslog, compliance |
| syslog-ng | ~10-20 MB | Advanced parsing, routing | Complex syslog environments |
Fluent Bit is the leading lightweight agent for cloud-native environments:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139
# /etc/fluent-bit/fluent-bit.conf# Production configuration for security log collection [SERVICE] # Daemon mode Daemon Off Flush 1 Log_Level info # Parser configuration Parsers_File parsers.conf # Enable metrics endpoint for monitoring HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port 2020 # Buffer configuration for reliability storage.path /var/log/fluent-bit-buffer/ storage.sync normal storage.checksum off storage.backlog.mem_limit 50M # ================================================# INPUTS: What logs to collect# ================================================ # System logs via journald[INPUT] Name systemd Tag system.* Systemd_Filter _SYSTEMD_UNIT=sshd.service Systemd_Filter _SYSTEMD_UNIT=auditd.service Systemd_Filter _SYSTEMD_UNIT=sudo.service Read_From_Tail On DB /var/log/fluent-bit-systemd.db # Audit logs from auditd[INPUT] Name tail Tag audit.* Path /var/log/audit/audit.log Parser audit DB /var/log/fluent-bit-audit.db Mem_Buf_Limit 10MB Refresh_Interval 5 # Kernel messages[INPUT] Name kmsg Tag kernel # Application logs (JSON format)[INPUT] Name tail Tag app.* Path /var/log/app/*.json Parser json DB /var/log/fluent-bit-app.db Mem_Buf_Limit 10MB # ================================================# FILTERS: Enrich and transform# ================================================ # Add hostname and timestamp[FILTER] Name record_modifier Match * Record hostname ${HOSTNAME} Record environment production Record cluster main-cluster # Parse nested fields[FILTER] Name parser Match app.* Key_Name log Parser json Reserve_Data True # Add Kubernetes metadata (if running in K8s)[FILTER] Name kubernetes Match kube.* Kube_URL https://kubernetes.default.svc:443 Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token Merge_Log On # ================================================# OUTPUTS: Where to send logs# ================================================ # Primary: Elasticsearch for security logs[OUTPUT] Name es Match audit.* system.* Host elasticsearch.logging.svc Port 9200 Index security-logs Type _doc Logstash_Format On Logstash_Prefix security Time_Key @timestamp Include_Tag_Key On Retry_Limit 5 # TLS configuration tls On tls.verify On tls.ca_file /etc/fluent-bit/tls/ca.crt tls.crt_file /etc/fluent-bit/tls/tls.crt tls.key_file /etc/fluent-bit/tls/tls.key # Authentication HTTP_User ${ES_USER} HTTP_Passwd ${ES_PASSWORD} # Secondary: Forward to remote syslog for compliance[OUTPUT] Name syslog Match audit.* Host syslog-archive.corp.internal Port 6514 Mode tls Syslog_Format rfc5424 Syslog_Hostname_key hostname Syslog_Message_key message tls On tls.verify On tls.ca_file /etc/fluent-bit/tls/syslog-ca.crt # Local backup (filesystem buffer)[OUTPUT] Name file Match * Path /var/log/fluent-bit-backup/ Format out_fileNever configure log agents without persistent buffering. Network outages and collector restarts will occur. Without buffering, logs generated during these windows are lost forever. Even 10-50 MB of buffer can save hours of critical security logs during an outage.
Getting logs from thousands of sources to central storage reliably is a significant engineering challenge. The choice of transport protocol affects delivery guarantees, performance, and security.
| Protocol | Delivery Guarantee | Performance | Security | Use Case |
|---|---|---|---|---|
| UDP Syslog (RFC 5426) | None (fire-and-forget) | Very High | None by default | High-volume, loss-tolerant |
| TCP Syslog (RFC 6587) | Ordered delivery | High | TLS optional | Reliable syslog |
| RELP | At-least-once | Medium | TLS optional | Guaranteed syslog delivery |
| HTTP/HTTPS | At-least-once | Medium | TLS built-in | REST APIs, cloud services |
| Kafka | At-least-once/Exactly-once | Very High | TLS + SASL | High-scale streaming |
| gRPC | At-least-once | Very High | TLS built-in | Modern observability |
Understanding delivery semantics is critical for security logging:
Apache Kafka has become the standard for high-volume log transportation due to its durability, scalability, and decoupling of producers from consumers:
Key Benefits of Kafka for Logging:
1234567891011121314151617181920212223242526272829303132
#!/bin/bash# Create optimized Kafka topic for security logs # Topic for security/audit logs# - 30 day retention (720 hours)# - 12 partitions for parallelism# - Replication factor 3 for durabilitykafka-topics.sh --create \ --bootstrap-server kafka.internal:9092 \ --topic security-logs \ --partitions 12 \ --replication-factor 3 \ --config retention.ms=2592000000 \ --config cleanup.policy=delete \ --config min.insync.replicas=2 \ --config compression.type=lz4 \ --config segment.bytes=1073741824 # Topic for archival (longer retention, larger segments)kafka-topics.sh --create \ --bootstrap-server kafka.internal:9092 \ --topic security-logs-archive \ --partitions 6 \ --replication-factor 3 \ --config retention.ms=7776000000 \ --config cleanup.policy=delete \ --config segment.bytes=5368709120 # Verify topic configurationkafka-topics.sh --describe \ --bootstrap-server kafka.internal:9092 \ --topic security-logsKafka must be secured for log transport. Enable TLS (SSL) for encryption in transit, SASL for authentication, and ACLs for authorization. Unsecured Kafka allows attackers to read sensitive logs or inject false events—completely undermining security monitoring.
Centralized log storage must handle massive ingest rates, provide fast search capabilities, and scale economically. The choice of storage system depends on query patterns, retention requirements, and budget.
| System | Architecture | Query Speed | Cost per TB | Best For |
|---|---|---|---|---|
| Elasticsearch | Distributed inverted index | Fast full-text | High ($$$) | Ad-hoc search, SIEM integration |
| OpenSearch | Elasticsearch fork | Fast full-text | High ($$$) | AWS-native, open governance |
| Grafana Loki | Index labels only, chunks in object store | Fast by labels, slow full-text | Low ($) | Cloud-native, label-based queries |
| ClickHouse | Columnar OLAP | Fast aggregations | Medium ($$) | Analytics, metrics, structured logs |
| Splunk | Proprietary indexed data store | Very fast | Very High ($$$$) | Enterprise SIEM, compliance |
| S3/GCS/Azure Blob | Object storage + query engines | Slow (requires scan) | Very Low ($) | Long-term archive, compliance |
Log access patterns follow a predictable curve: recent logs are queried frequently, older logs rarely. This enables significant cost optimization:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
{ "policy": { "phases": { "hot": { "min_age": "0ms", "actions": { "rollover": { "max_age": "1d", "max_primary_shard_size": "50gb" }, "set_priority": { "priority": 100 } } }, "warm": { "min_age": "7d", "actions": { "shrink": { "number_of_shards": 1 }, "forcemerge": { "max_num_segments": 1 }, "allocate": { "require": { "data": "warm" }, "number_of_replicas": 1 }, "set_priority": { "priority": 50 } } }, "cold": { "min_age": "30d", "actions": { "allocate": { "require": { "data": "cold" }, "number_of_replicas": 0 }, "set_priority": { "priority": 0 }, "searchable_snapshot": { "snapshot_repository": "log-archive-s3" } } }, "frozen": { "min_age": "90d", "actions": { "searchable_snapshot": { "snapshot_repository": "log-archive-glacier" } } }, "delete": { "min_age": "365d", "actions": { "wait_for_snapshot": { "policy": "compliance-snapshot-policy" }, "delete": {} } } } }}Loki takes a different approach—it only indexes metadata labels, not log content. This dramatically reduces storage and compute costs:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
# Loki configuration for security log storage auth_enabled: false server: http_listen_port: 3100 grpc_listen_port: 9096 ingester: wal: enabled: true dir: /loki/wal lifecycler: ring: kvstore: store: inmemory replication_factor: 1 chunk_idle_period: 30m max_chunk_age: 1h chunk_target_size: 1572864 chunk_retain_period: 30s schema_config: configs: - from: 2024-01-01 store: tsdb object_store: s3 schema: v13 index: prefix: loki_index_ period: 24h storage_config: tsdb_shipper: active_index_directory: /loki/tsdb-index cache_location: /loki/tsdb-cache cache_ttl: 24h aws: s3: s3://us-east-1/loki-logs-bucket bucketnames: loki-logs-bucket region: us-east-1 access_key_id: ${AWS_ACCESS_KEY_ID} secret_access_key: ${AWS_SECRET_ACCESS_KEY} compactor: working_directory: /loki/compactor compaction_interval: 10m retention_enabled: true retention_delete_delay: 2h retention_delete_worker_count: 150 delete_request_store: s3 limits_config: retention_period: 744h # 31 days in hot storage ingestion_rate_mb: 50 ingestion_burst_size_mb: 100 per_stream_rate_limit: 10MB max_query_lookback: 0 # No limit # Label cardinality limits (critical for performance) max_label_name_length: 1024 max_label_value_length: 2048 max_label_names_per_series: 30Loki is extremely fast when querying by labels (e.g., 'show me all logs from host=db-prod-01 with level=error'). However, grep-like searches through log content require scanning chunks, which is slower than Elasticsearch's full-text index. Design your labels carefully to optimize common query patterns.
Log retention is where security requirements, compliance mandates, storage costs, and legal obligations intersect. Getting it wrong can result in regulatory fines, failed audits, or inability to investigate incidents.
| Framework | Log Types | Retention Period | Key Requirements |
|---|---|---|---|
| PCI DSS 4.0 | All audit trails, auth logs | 1 year (3 months immediately accessible) | Daily review, integrity protection |
| HIPAA | Access to PHI, security events | 6 years | Access tracking, audit controls |
| SOX | Financial system access, changes | 7 years | Tamper-evident, non-repudiation |
| GDPR | Personal data processing | As long as necessary (minimize) | Right to erasure complicates logs |
| GLBA | Customer financial data access | 5 years | Access controls, disposal procedures |
| FISMA/FedRAMP | All security-relevant events | 90 days online, 1 year archive | Real-time alerting, monthly reviews |
| SOC 2 | Security events, access logs | 1 year recommended | Integrity, availability, confidentiality |
Effective retention policies balance multiple concerns:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
# Log retention policy configuration# This example uses a declarative format that could be implemented# by various log management systems metadata: policy_name: security-log-retention version: "2.0" last_reviewed: "2024-01-15" next_review: "2024-07-15" owner: security-team@company.com approved_by: ciso@company.com # Global defaultsdefaults: online_retention_days: 90 archive_retention_years: 7 storage_tier_progression: hot_to_warm_days: 7 warm_to_cold_days: 30 cold_to_archive_days: 90 integrity_protection: required encryption: required # Per-log-type policies (override defaults)log_types: # Authentication and access logs authentication: description: "Login attempts, SSO events, MFA events" online_retention_days: 180 archive_retention_years: 7 compliance_frameworks: - PCI-DSS - SOX - HIPAA alerting_required: true # Authorization and permission events authorization: description: "Access decisions, permission changes" online_retention_days: 180 archive_retention_years: 7 # Privileged operations privileged_access: description: "sudo, admin actions, elevated permissions" online_retention_days: 365 archive_retention_years: 7 review_frequency: weekly # Application logs (non-security) application: description: "General application logs" online_retention_days: 30 archive_retention_years: 1 # Network traffic logs network: description: "Firewall logs, flow data" online_retention_days: 90 archive_retention_years: 3 # Personal data processing (GDPR) personal_data: description: "Logs containing PII processing" online_retention_days: 90 archive_retention_years: 3 special_handling: - erasure_requests_honored - anonymization_on_archive - access_restricted # Legal hold configurationlegal_hold: enabled: true notification_recipients: - legal@company.com - ciso@company.com hold_prevents_deletion: true hold_duration_override: unlimited # Deletion proceduresdeletion: method: cryptographic_erasure verification: required certificate_generated: true audit_log_retained: true # Log of what was deleted, kept longerGDPR gives individuals the right to erasure of personal data, but logs may contain personal data (usernames, IPs, actions). The standard approach is to ensure logs are retained only as long as necessary for legitimate purposes (security, compliance), clearly document retention justification, and anonymize or delete when retention expires. Never promise to instantly delete specific log entries on request—it's technically complex and may violate other retention requirements.
Raw logs from different sources come in vastly different formats. A Windows Event Log looks nothing like a Linux audit log, which looks nothing like a cloud API audit trail. Normalization transforms diverse log formats into a common schema, enabling cross-source correlation and consistent queries.
Consider the same event—a failed login—from different sources:
12345678910111213141516171819202122232425262728293031323334353637
# Linux auth.logJan 15 14:32:47 db-prod-01 sshd[12345]: Failed password for jsmith from 192.168.1.50 port 22 ssh2 # Windows Security Event (XML)<Event> <EventID>4625</EventID> <TimeCreated SystemTime="2024-01-15T14:32:47.123Z"/> <Computer>DB-PROD-01</Computer> <EventData> <Data Name="TargetUserName">jsmith</Data> <Data Name="IpAddress">192.168.1.50</Data> <Data Name="LogonType">10</Data> <Data Name="FailureReason">%%2313</Data> </EventData></Event> # AWS CloudTrail (JSON){ "eventTime": "2024-01-15T14:32:47Z", "eventSource": "signin.amazonaws.com", "eventName": "ConsoleLogin", "userIdentity": {"userName": "jsmith"}, "sourceIPAddress": "192.168.1.50", "responseElements": {"ConsoleLogin": "Failure"}} # Normalized output (Common Event Format){ "timestamp": "2024-01-15T14:32:47.123456Z", "event_type": "authentication", "event_outcome": "failure", "source_host": "db-prod-01", "source_ip": "192.168.1.50", "user": "jsmith", "authentication_method": "password", "service": "ssh"}| Standard | Origin | Key Features | Adoption |
|---|---|---|---|
| ECS (Elastic Common Schema) | Elastic | Nested JSON, comprehensive field set | High (ELK users) |
| OCSF (Open Cybersecurity Schema) | AWS, Splunk, others | Security-focused, event categories | Growing (cloud/security) |
| OSSEM (Open Source Security Events Metadata) | Community | Detection-focused, ATT&CK mapping | Medium (security researchers) |
| CEF (Common Event Format) | ArcSight/HP | Key-value pairs, legacy standard | Legacy (older SIEMs) |
| LEEF (Log Event Extended Format) | IBM QRadar | Tab-delimited, QRadar-native | Medium (IBM customers) |
Logstash is a powerful tool for parsing and normalizing diverse log formats:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131
# Logstash pipeline for log normalization input { # From Kafka topics kafka { bootstrap_servers => "kafka:9092" topics => ["raw-logs"] group_id => "logstash-normalizer" codec => json }} filter { # ================================================ # LINUX AUTH LOG PARSING # ================================================ if [log_source] == "linux-auth" { grok { match => { "message" => [ # Failed password "%{SYSLOGTIMESTAMP:syslog_timestamp} %{HOSTNAME:source_host} sshd[%{NUMBER:pid}]: Failed password for %{USERNAME:user} from %{IP:source_ip} port %{NUMBER:source_port}", # Accepted password "%{SYSLOGTIMESTAMP:syslog_timestamp} %{HOSTNAME:source_host} sshd[%{NUMBER:pid}]: Accepted password for %{USERNAME:user} from %{IP:source_ip} port %{NUMBER:source_port}", # Invalid user "%{SYSLOGTIMESTAMP:syslog_timestamp} %{HOSTNAME:source_host} sshd[%{NUMBER:pid}]: Invalid user %{USERNAME:user} from %{IP:source_ip}" ] } } # Normalize to common schema mutate { add_field => { "[event][category]" => "authentication" "[event][type]" => "start" "[source][address]" => "%{source_ip}" "[user][name]" => "%{user}" "[host][name]" => "%{source_host}" } } # Determine outcome if "Failed" in [message] or "Invalid" in [message] { mutate { add_field => { "[event][outcome]" => "failure" } } } else { mutate { add_field => { "[event][outcome]" => "success" } } } # Parse timestamp date { match => ["syslog_timestamp", "MMM d HH:mm:ss", "MMM dd HH:mm:ss"] target => "@timestamp" } } # ================================================ # WINDOWS SECURITY EVENT PARSING # ================================================ if [log_source] == "windows-security" { xml { source => "message" target => "winlog" } # Map Windows Event ID to event type translate { field => "[winlog][EventID]" destination => "[event][action]" dictionary => { "4624" => "logon_success" "4625" => "logon_failure" "4648" => "explicit_credential_logon" "4672" => "special_privileges_assigned" "4688" => "process_created" "4720" => "user_account_created" } } # Normalize fields mutate { rename => { "[winlog][Computer]" => "[host][name]" "[winlog][EventData][TargetUserName]" => "[user][name]" "[winlog][EventData][IpAddress]" => "[source][address]" } add_field => { "[event][category]" => "authentication" } } if [winlog][EventID] == "4625" { mutate { add_field => { "[event][outcome]" => "failure" } } } else if [winlog][EventID] == "4624" { mutate { add_field => { "[event][outcome]" => "success" } } } } # ================================================ # ENRICH ALL EVENTS # ================================================ # GeoIP enrichment for source IPs if [source][address] { geoip { source => "[source][address]" target => "[source][geo]" } } # Add processing metadata mutate { add_field => { "[ecs][version]" => "8.0" "[event][ingested]" => "%{@timestamp}" } } # Remove temporary fields mutate { remove_field => ["syslog_timestamp", "pid", "message", "winlog"] }} output { elasticsearch { hosts => ["elasticsearch:9200"] index => "security-logs-%{+YYYY.MM.dd}" user => "logstash" password => "${ES_PASSWORD}" }}Normalizing logs at ingestion time (rather than at query time) is almost always the right approach. It's computationally cheaper to parse once during ingestion than repeatedly during every query. It also ensures consistent field names and types, making queries reliable and correlation possible.
Enterprise log management systems must handle millions of events per second while remaining responsive for queries. The following patterns enable this scale.
Separate high-fanout collection from heavy processing:
For extremely high-volume logs that don't require 100% capture, statistical sampling reduces volume while preserving visibility:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778
"""Log sampling strategies for high-volume environments.These techniques reduce volume while maintaining security visibility.""" import randomimport hashlibfrom typing import Dict, Any class LogSampler: """ Implements various sampling strategies for high-volume logs. """ def __init__(self, default_rate: float = 1.0): """ Initialize sampler with default rate (1.0 = 100% / no sampling). """ self.default_rate = default_rate self.priority_override_rules = {} def add_priority_rule(self, field: str, value: str, rate: float): """ Certain events should never be sampled (security-critical). Rate of 1.0 means always keep. """ self.priority_override_rules[(field, value)] = rate def should_sample(self, event: Dict[str, Any]) -> bool: """ Determine if an event should be included (not dropped). Returns True to keep, False to drop. """ # Priority events are never dropped for (field, value), rate in self.priority_override_rules.items(): if event.get(field) == value: return random.random() < rate # Apply default sampling rate return random.random() < self.default_rate def consistent_hash_sample(self, event: Dict[str, Any], key_field: str, rate: float) -> bool: """ Use consistent hashing so the same key always gets same decision. Important for keeping all events from a single session/user. """ key_value = event.get(key_field, "") hash_val = int(hashlib.sha256(str(key_value).encode()).hexdigest()[:8], 16) threshold = int(rate * 0xFFFFFFFF) return hash_val < threshold # Example usagesampler = LogSampler(default_rate=0.1) # 10% default sampling # Security events: NEVER samplesampler.add_priority_rule("event.category", "authentication", 1.0)sampler.add_priority_rule("event.outcome", "failure", 1.0)sampler.add_priority_rule("event.category", "intrusion_detection", 1.0)sampler.add_priority_rule("event.severity", "critical", 1.0)sampler.add_priority_rule("event.severity", "high", 1.0) # Debug logs: Sample aggressivelysampler.add_priority_rule("log.level", "debug", 0.01) # 1% only # Normal web traffic: Moderate samplingsampler.add_priority_rule("event.category", "web", 0.05) # 5% def process_log(event: Dict[str, Any]): """Process a log event with sampling.""" if sampler.should_sample(event): # Forward to storage send_to_storage(event) else: # Update sampling statistics update_sampling_metrics(event)In large organizations or SaaS platforms, isolate tenants for security, performance, and compliance:
Sampling should ONLY apply to high-volume, low-value events (debug logs, health checks, routine reads). Security-relevant events—authentication, authorization failures, privilege escalation, policy changes—must always be captured at 100%. A sampled-out attack is effectively undetected.
Effective log management is the backbone of security operations—without it, audit trails are useless because they can't be found, queried, or retained appropriately. Let's consolidate the key concepts:
What's Next:
With logs properly collected, transported, stored, and retained, we now turn to intrusion detection—the systems and techniques that analyze log data to identify attacks in progress and security breaches.
You now understand the architecture and best practices for enterprise log management. This infrastructure forms the foundation for all security detection, investigation, and compliance activities.