Operating SystemsAuditing and Logging

Auditing and Logging

LevelIntermediate

Duration120 mins

TopicAuditing and Logging

2 / 5

Log Management

The Data Deluge: Managing Logs at Scale

A single production web server generates approximately 50 GB of logs per day. A Kubernetes cluster with 100 nodes produces over 1 TB daily. A major cloud provider like AWS manages petabytes of log data every hour. In the modern enterprise, logs are no longer simple text files—they are a Big Data problem that requires sophisticated infrastructure to collect, transport, store, and query.

Consider the log management challenge at Netflix: their infrastructure generates over 500 billion log events per day, representing more than 1 petabyte of raw data. Finding a single authentication failure among those 500 billion events requires infrastructure as sophisticated as the systems generating the logs themselves.

This page explores the systems and practices that make log management possible—from collection agents to centralized storage, from real-time streaming to long-term archival. You'll learn how to design log infrastructure that scales with your systems while remaining accessible for security investigation.

What You Will Learn

By the end of this page, you will understand: (1) Log collection architectures and agent selection, (2) Log transportation protocols and reliability guarantees, (3) Centralized log storage systems and indexing strategies, (4) Log retention policies and compliance requirements, (5) Log aggregation and normalization techniques, and (6) Scalability patterns for enterprise log management.

Log Collection Architecture

Modern log collection has evolved from simple file tailing to sophisticated distributed systems. Understanding the architectural patterns helps you design appropriate solutions for your scale.

Collection Patterns

There are three primary patterns for log collection:

Log Collection Patterns
Pattern	Description	Pros	Cons
Push (Agent-based)	Agents on each host forward logs	Near real-time, reliable delivery	Agent overhead, management complexity
Pull (Scraping)	Central system reads from hosts	No agent installation required	Higher latency, relies on host availability
Sidecar	Co-located collector per application	Application isolation, container-native	Resource overhead per pod/container

Agent-Based Collection

The most common pattern uses lightweight agents running on each host. These agents read logs from files, journald, or direct streams and forward them to central collectors.

Converting Mermaid diagram...

Popular Log Collection Agents

The choice of agent significantly impacts resource usage, reliability, and capabilities:

Log Collection Agent Comparison
Agent	Memory	Features	Best For
Fluent Bit	~1-5 MB	Lightweight, cloud-native, plugins	Containers, Kubernetes, edge
Fluentd	~40-100 MB	Ruby plugins, flexible routing	Complex transformations, legacy systems
Filebeat	~20-50 MB	ELK integration, modules	Elastic Stack environments
Vector	~10-30 MB	Rust, observability pipelines	High-performance, unified observability
Promtail	~10-30 MB	Loki-native, label extraction	Grafana Loki environments
rsyslog	~5-15 MB	RFC-compliant, enterprise proven	Traditional syslog, compliance
syslog-ng	~10-20 MB	Advanced parsing, routing	Complex syslog environments

Agent Configuration: Fluent Bit Example

Fluent Bit is the leading lightweight agent for cloud-native environments:

fluent-bit.conf
Fluent Bit Config
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# /etc/fluent-bit/fluent-bit.conf
# Production configuration for security log collection
 
[SERVICE]
    # Daemon mode
    Daemon       Off
    Flush        1
    Log_Level    info
    
    # Parser configuration
    Parsers_File parsers.conf
    
    # Enable metrics endpoint for monitoring
    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_Port    2020
    
    # Buffer configuration for reliability
    storage.path              /var/log/fluent-bit-buffer/
    storage.sync              normal
    storage.checksum          off
    storage.backlog.mem_limit 50M
 
# ================================================
# INPUTS: What logs to collect
# ================================================
 
# System logs via journald
[INPUT]
    Name                systemd
    Tag                 system.*
    Systemd_Filter      _SYSTEMD_UNIT=sshd.service
    Systemd_Filter      _SYSTEMD_UNIT=auditd.service
    Systemd_Filter      _SYSTEMD_UNIT=sudo.service
    Read_From_Tail      On
    DB                  /var/log/fluent-bit-systemd.db
 
# Audit logs from auditd
[INPUT]
    Name                tail
    Tag                 audit.*
    Path                /var/log/audit/audit.log
    Parser              audit
    DB                  /var/log/fluent-bit-audit.db
    Mem_Buf_Limit       10MB
    Refresh_Interval    5
    
# Kernel messages
[INPUT]
    Name                kmsg
    Tag                 kernel
 
# Application logs (JSON format)
[INPUT]
    Name                tail
    Tag                 app.*
    Path                /var/log/app/*.json
    Parser              json
    DB                  /var/log/fluent-bit-app.db
    Mem_Buf_Limit       10MB
 
# ================================================
# FILTERS: Enrich and transform
# ================================================
 
# Add hostname and timestamp
[FILTER]
    Name                record_modifier
    Match               *
    Record              hostname ${HOSTNAME}
    Record              environment production
    Record              cluster main-cluster
 
# Parse nested fields
[FILTER]
    Name                parser
    Match               app.*
    Key_Name            log
    Parser              json
    Reserve_Data        True
 
# Add Kubernetes metadata (if running in K8s)
[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    Merge_Log           On
 
# ================================================
# OUTPUTS: Where to send logs
# ================================================
 
# Primary: Elasticsearch for security logs
[OUTPUT]
    Name                es
    Match               audit.* system.*
    Host                elasticsearch.logging.svc
    Port                9200
    Index               security-logs
    Type                _doc
    Logstash_Format     On
    Logstash_Prefix     security
    Time_Key            @timestamp
    Include_Tag_Key     On
    Retry_Limit         5
    
    # TLS configuration
    tls                 On
    tls.verify          On
    tls.ca_file         /etc/fluent-bit/tls/ca.crt
    tls.crt_file        /etc/fluent-bit/tls/tls.crt
    tls.key_file        /etc/fluent-bit/tls/tls.key
    
    # Authentication
    HTTP_User           ${ES_USER}
    HTTP_Passwd         ${ES_PASSWORD}
 
# Secondary: Forward to remote syslog for compliance
[OUTPUT]
    Name                syslog
    Match               audit.*
    Host                syslog-archive.corp.internal
    Port                6514
    Mode                tls
    Syslog_Format       rfc5424
    Syslog_Hostname_key hostname
    Syslog_Message_key  message
    tls                 On
    tls.verify          On
    tls.ca_file         /etc/fluent-bit/tls/syslog-ca.crt
 
# Local backup (filesystem buffer)
[OUTPUT]
    Name                file
    Match               *
    Path                /var/log/fluent-bit-backup/
    Format              out_file

Buffer Everything

Never configure log agents without persistent buffering. Network outages and collector restarts will occur. Without buffering, logs generated during these windows are lost forever. Even 10-50 MB of buffer can save hours of critical security logs during an outage.

Log Transportation: Protocols and Reliability

Getting logs from thousands of sources to central storage reliably is a significant engineering challenge. The choice of transport protocol affects delivery guarantees, performance, and security.

Transport Protocol Comparison

Log Transport Protocols
Protocol	Delivery Guarantee	Performance	Security	Use Case
UDP Syslog (RFC 5426)	None (fire-and-forget)	Very High	None by default	High-volume, loss-tolerant
TCP Syslog (RFC 6587)	Ordered delivery	High	TLS optional	Reliable syslog
RELP	At-least-once	Medium	TLS optional	Guaranteed syslog delivery
HTTP/HTTPS	At-least-once	Medium	TLS built-in	REST APIs, cloud services
Kafka	At-least-once/Exactly-once	Very High	TLS + SASL	High-scale streaming
gRPC	At-least-once	Very High	TLS built-in	Modern observability

Delivery Guarantees Explained

Understanding delivery semantics is critical for security logging:

Delivery Semantics

•At-most-once — Message may be lost, but never duplicated. Fastest, but unacceptable for security logs.
•At-least-once — Message guaranteed to arrive, but may be duplicated. Standard for security logging; handle duplicates in storage.
•Exactly-once — Message arrives once and only once. Complex to implement, typically requires two-phase commit or idempotent processing.

The Kafka Pattern for High-Scale Logging

Apache Kafka has become the standard for high-volume log transportation due to its durability, scalability, and decoupling of producers from consumers:

Converting Mermaid diagram...

Key Benefits of Kafka for Logging:

•Decoupling — Producers and consumers operate independently; SIEM downtime doesn't affect log collection
•Durability — Logs persist in Kafka for configurable retention; replayable for reprocessing or investigation
•Fan-out — Multiple consumers read the same stream; security team, ops team, and archive all get full copies
•Ordering — Per-partition ordering preserves event sequence for forensics
•Scalability — Linear horizontal scaling by adding partitions and brokers

kafka-log-topic.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/bin/bash
# Create optimized Kafka topic for security logs
 
# Topic for security/audit logs
# - 30 day retention (720 hours)
# - 12 partitions for parallelism
# - Replication factor 3 for durability
kafka-topics.sh --create \
    --bootstrap-server kafka.internal:9092 \
    --topic security-logs \
    --partitions 12 \
    --replication-factor 3 \
    --config retention.ms=2592000000 \
    --config cleanup.policy=delete \
    --config min.insync.replicas=2 \
    --config compression.type=lz4 \
    --config segment.bytes=1073741824
 
# Topic for archival (longer retention, larger segments)
kafka-topics.sh --create \
    --bootstrap-server kafka.internal:9092 \
    --topic security-logs-archive \
    --partitions 6 \
    --replication-factor 3 \
    --config retention.ms=7776000000 \
    --config cleanup.policy=delete \
    --config segment.bytes=5368709120
 
# Verify topic configuration
kafka-topics.sh --describe \
    --bootstrap-server kafka.internal:9092 \
    --topic security-logs

Kafka Security Configuration

Kafka must be secured for log transport. Enable TLS (SSL) for encryption in transit, SASL for authentication, and ACLs for authorization. Unsecured Kafka allows attackers to read sensitive logs or inject false events—completely undermining security monitoring.

Centralized Log Storage Systems

Centralized log storage must handle massive ingest rates, provide fast search capabilities, and scale economically. The choice of storage system depends on query patterns, retention requirements, and budget.

Storage System Comparison

Log Storage Systems
System	Architecture	Query Speed	Cost per TB	Best For
Elasticsearch	Distributed inverted index	Fast full-text	High ($$$)	Ad-hoc search, SIEM integration
OpenSearch	Elasticsearch fork	Fast full-text	High ($$$)	AWS-native, open governance
Grafana Loki	Index labels only, chunks in object store	Fast by labels, slow full-text	Low ($)	Cloud-native, label-based queries
ClickHouse	Columnar OLAP	Fast aggregations	Medium ($$)	Analytics, metrics, structured logs
Splunk	Proprietary indexed data store	Very fast	Very High ($$$$)	Enterprise SIEM, compliance
S3/GCS/Azure Blob	Object storage + query engines	Slow (requires scan)	Very Low ($)	Long-term archive, compliance

The Hot-Warm-Cold Storage Pattern

Log access patterns follow a predictable curve: recent logs are queried frequently, older logs rarely. This enables significant cost optimization:

Converting Mermaid diagram...

Elasticsearch Index Lifecycle Management

ilm-policy.json
JSON (Elasticsearch ILM)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_age": "1d",
            "max_primary_shard_size": "50gb"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "allocate": {
            "require": {
              "data": "warm"
            },
            "number_of_replicas": 1
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": {
              "data": "cold"
            },
            "number_of_replicas": 0
          },
          "set_priority": {
            "priority": 0
          },
          "searchable_snapshot": {
            "snapshot_repository": "log-archive-s3"
          }
        }
      },
      "frozen": {
        "min_age": "90d",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "log-archive-glacier"
          }
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": {
          "wait_for_snapshot": {
            "policy": "compliance-snapshot-policy"
          },
          "delete": {}
        }
      }
    }
  }
}

Grafana Loki: Cost-Effective Alternative

Loki takes a different approach—it only indexes metadata labels, not log content. This dramatically reduces storage and compute costs:

loki-config.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# Loki configuration for security log storage
 
auth_enabled: false
 
server:
  http_listen_port: 3100
  grpc_listen_port: 9096
 
ingester:
  wal:
    enabled: true
    dir: /loki/wal
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  chunk_idle_period: 30m
  max_chunk_age: 1h
  chunk_target_size: 1572864
  chunk_retain_period: 30s
 
schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: loki_index_
        period: 24h
 
storage_config:
  tsdb_shipper:
    active_index_directory: /loki/tsdb-index
    cache_location: /loki/tsdb-cache
    cache_ttl: 24h
  aws:
    s3: s3://us-east-1/loki-logs-bucket
    bucketnames: loki-logs-bucket
    region: us-east-1
    access_key_id: ${AWS_ACCESS_KEY_ID}
    secret_access_key: ${AWS_SECRET_ACCESS_KEY}
 
compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150
  delete_request_store: s3
 
limits_config:
  retention_period: 744h  # 31 days in hot storage
  ingestion_rate_mb: 50
  ingestion_burst_size_mb: 100
  per_stream_rate_limit: 10MB
  max_query_lookback: 0  # No limit
 
# Label cardinality limits (critical for performance)
  max_label_name_length: 1024
  max_label_value_length: 2048
  max_label_names_per_series: 30

Loki Query Trade-off

Loki is extremely fast when querying by labels (e.g., 'show me all logs from host=db-prod-01 with level=error'). However, grep-like searches through log content require scanning chunks, which is slower than Elasticsearch's full-text index. Design your labels carefully to optimize common query patterns.

Log Retention Policies and Compliance

Log retention is where security requirements, compliance mandates, storage costs, and legal obligations intersect. Getting it wrong can result in regulatory fines, failed audits, or inability to investigate incidents.

Regulatory Retention Requirements

Log Retention by Regulatory Framework
Framework	Log Types	Retention Period	Key Requirements
PCI DSS 4.0	All audit trails, auth logs	1 year (3 months immediately accessible)	Daily review, integrity protection
HIPAA	Access to PHI, security events	6 years	Access tracking, audit controls
SOX	Financial system access, changes	7 years	Tamper-evident, non-repudiation
GDPR	Personal data processing	As long as necessary (minimize)	Right to erasure complicates logs
GLBA	Customer financial data access	5 years	Access controls, disposal procedures
FISMA/FedRAMP	All security-relevant events	90 days online, 1 year archive	Real-time alerting, monthly reviews
SOC 2	Security events, access logs	1 year recommended	Integrity, availability, confidentiality

Designing Retention Policies

Effective retention policies balance multiple concerns:

Retention Policy Design Factors

•Legal Hold — Ability to suspend deletion when litigation is anticipated; implement before it's needed
•Minimum vs. Maximum — Retain at least as long as regulations require, but not excessively (GDPR)
•Tiered Retention — Different log types may have different requirements; security logs often need longer retention
•Accessibility Requirements — How quickly must logs be searchable? Compliance often distinguishes online vs. archive
•Cost Modeling — Calculate total cost per GB-month across tiers; optimize for common query patterns
•Deletion Verification — Prove logs were deleted when retention expires (GDPR right to erasure)

Retention Policy Implementation

retention-policy.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# Log retention policy configuration
# This example uses a declarative format that could be implemented
# by various log management systems
 
metadata:
  policy_name: security-log-retention
  version: "2.0"
  last_reviewed: "2024-01-15"
  next_review: "2024-07-15"
  owner: security-team@company.com
  approved_by: ciso@company.com
 
# Global defaults
defaults:
  online_retention_days: 90
  archive_retention_years: 7
  storage_tier_progression:
    hot_to_warm_days: 7
    warm_to_cold_days: 30
    cold_to_archive_days: 90
  integrity_protection: required
  encryption: required
 
# Per-log-type policies (override defaults)
log_types:
  # Authentication and access logs
  authentication:
    description: "Login attempts, SSO events, MFA events"
    online_retention_days: 180
    archive_retention_years: 7
    compliance_frameworks:
      - PCI-DSS
      - SOX
      - HIPAA
    alerting_required: true
    
  # Authorization and permission events
  authorization:
    description: "Access decisions, permission changes"
    online_retention_days: 180
    archive_retention_years: 7
    
  # Privileged operations
  privileged_access:
    description: "sudo, admin actions, elevated permissions"
    online_retention_days: 365
    archive_retention_years: 7
    review_frequency: weekly
    
  # Application logs (non-security)
  application:
    description: "General application logs"
    online_retention_days: 30
    archive_retention_years: 1
    
  # Network traffic logs
  network:
    description: "Firewall logs, flow data"
    online_retention_days: 90
    archive_retention_years: 3
    
  # Personal data processing (GDPR)
  personal_data:
    description: "Logs containing PII processing"
    online_retention_days: 90
    archive_retention_years: 3
    special_handling:
      - erasure_requests_honored
      - anonymization_on_archive
      - access_restricted
 
# Legal hold configuration
legal_hold:
  enabled: true
  notification_recipients:
    - legal@company.com
    - ciso@company.com
  hold_prevents_deletion: true
  hold_duration_override: unlimited
 
# Deletion procedures
deletion:
  method: cryptographic_erasure
  verification: required
  certificate_generated: true
  audit_log_retained: true  # Log of what was deleted, kept longer

GDPR Log Erasure Challenge

GDPR gives individuals the right to erasure of personal data, but logs may contain personal data (usernames, IPs, actions). The standard approach is to ensure logs are retained only as long as necessary for legitimate purposes (security, compliance), clearly document retention justification, and anonymize or delete when retention expires. Never promise to instantly delete specific log entries on request—it's technically complex and may violate other retention requirements.

Log Aggregation and Normalization

Raw logs from different sources come in vastly different formats. A Windows Event Log looks nothing like a Linux audit log, which looks nothing like a cloud API audit trail. Normalization transforms diverse log formats into a common schema, enabling cross-source correlation and consistent queries.

The Normalization Challenge

Consider the same event—a failed login—from different sources:

log-diversity.txt

Mixed Formats

# Linux auth.log
Jan 15 14:32:47 db-prod-01 sshd[12345]: Failed password for jsmith from 192.168.1.50 port 22 ssh2
 
# Windows Security Event (XML)
<Event>
  <EventID>4625</EventID>
  <TimeCreated SystemTime="2024-01-15T14:32:47.123Z"/>
  <Computer>DB-PROD-01</Computer>
  <EventData>
    <Data Name="TargetUserName">jsmith</Data>
    <Data Name="IpAddress">192.168.1.50</Data>
    <Data Name="LogonType">10</Data>
    <Data Name="FailureReason">%%2313</Data>
  </EventData>
</Event>
 
# AWS CloudTrail (JSON)
{
  "eventTime": "2024-01-15T14:32:47Z",
  "eventSource": "signin.amazonaws.com",
  "eventName": "ConsoleLogin",
  "userIdentity": {"userName": "jsmith"},
  "sourceIPAddress": "192.168.1.50",
  "responseElements": {"ConsoleLogin": "Failure"}
}
 
# Normalized output (Common Event Format)
{
  "timestamp": "2024-01-15T14:32:47.123456Z",
  "event_type": "authentication",
  "event_outcome": "failure",
  "source_host": "db-prod-01",
  "source_ip": "192.168.1.50",
  "user": "jsmith",
  "authentication_method": "password",
  "service": "ssh"
}

Common Event Schema Standards

Log Normalization Schema Standards
Standard	Origin	Key Features	Adoption
ECS (Elastic Common Schema)	Elastic	Nested JSON, comprehensive field set	High (ELK users)
OCSF (Open Cybersecurity Schema)	AWS, Splunk, others	Security-focused, event categories	Growing (cloud/security)
OSSEM (Open Source Security Events Metadata)	Community	Detection-focused, ATT&CK mapping	Medium (security researchers)
CEF (Common Event Format)	ArcSight/HP	Key-value pairs, legacy standard	Legacy (older SIEMs)
LEEF (Log Event Extended Format)	IBM QRadar	Tab-delimited, QRadar-native	Medium (IBM customers)

Normalization with Logstash

Logstash is a powerful tool for parsing and normalizing diverse log formats:

logstash-normalize.conf
Logstash Config
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
# Logstash pipeline for log normalization
 
input {
  # From Kafka topics
  kafka {
    bootstrap_servers => "kafka:9092"
    topics => ["raw-logs"]
    group_id => "logstash-normalizer"
    codec => json
  }
}
 
filter {
  # ================================================
  # LINUX AUTH LOG PARSING
  # ================================================
  if [log_source] == "linux-auth" {
    grok {
      match => { 
        "message" => [
          # Failed password
          "%{SYSLOGTIMESTAMP:syslog_timestamp} %{HOSTNAME:source_host} sshd[%{NUMBER:pid}]: Failed password for %{USERNAME:user} from %{IP:source_ip} port %{NUMBER:source_port}",
          # Accepted password
          "%{SYSLOGTIMESTAMP:syslog_timestamp} %{HOSTNAME:source_host} sshd[%{NUMBER:pid}]: Accepted password for %{USERNAME:user} from %{IP:source_ip} port %{NUMBER:source_port}",
          # Invalid user
          "%{SYSLOGTIMESTAMP:syslog_timestamp} %{HOSTNAME:source_host} sshd[%{NUMBER:pid}]: Invalid user %{USERNAME:user} from %{IP:source_ip}"
        ]
      }
    }
    
    # Normalize to common schema
    mutate {
      add_field => {
        "[event][category]" => "authentication"
        "[event][type]" => "start"
        "[source][address]" => "%{source_ip}"
        "[user][name]" => "%{user}"
        "[host][name]" => "%{source_host}"
      }
    }
    
    # Determine outcome
    if "Failed" in [message] or "Invalid" in [message] {
      mutate { add_field => { "[event][outcome]" => "failure" } }
    } else {
      mutate { add_field => { "[event][outcome]" => "success" } }
    }
    
    # Parse timestamp
    date {
      match => ["syslog_timestamp", "MMM  d HH:mm:ss", "MMM dd HH:mm:ss"]
      target => "@timestamp"
    }
  }
  
  # ================================================
  # WINDOWS SECURITY EVENT PARSING
  # ================================================
  if [log_source] == "windows-security" {
    xml {
      source => "message"
      target => "winlog"
    }
    
    # Map Windows Event ID to event type
    translate {
      field => "[winlog][EventID]"
      destination => "[event][action]"
      dictionary => {
        "4624" => "logon_success"
        "4625" => "logon_failure"
        "4648" => "explicit_credential_logon"
        "4672" => "special_privileges_assigned"
        "4688" => "process_created"
        "4720" => "user_account_created"
      }
    }
    
    # Normalize fields
    mutate {
      rename => {
        "[winlog][Computer]" => "[host][name]"
        "[winlog][EventData][TargetUserName]" => "[user][name]"
        "[winlog][EventData][IpAddress]" => "[source][address]"
      }
      add_field => {
        "[event][category]" => "authentication"
      }
    }
    
    if [winlog][EventID] == "4625" {
      mutate { add_field => { "[event][outcome]" => "failure" } }
    } else if [winlog][EventID] == "4624" {
      mutate { add_field => { "[event][outcome]" => "success" } }
    }
  }
  
  # ================================================
  # ENRICH ALL EVENTS
  # ================================================
  
  # GeoIP enrichment for source IPs
  if [source][address] {
    geoip {
      source => "[source][address]"
      target => "[source][geo]"
    }
  }
  
  # Add processing metadata
  mutate {
    add_field => {
      "[ecs][version]" => "8.0"
      "[event][ingested]" => "%{@timestamp}"
    }
  }
  
  # Remove temporary fields
  mutate {
    remove_field => ["syslog_timestamp", "pid", "message", "winlog"]
  }
}
 
output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "security-logs-%{+YYYY.MM.dd}"
    user => "logstash"
    password => "${ES_PASSWORD}"
  }
}

Parse Once, Query Many

Normalizing logs at ingestion time (rather than at query time) is almost always the right approach. It's computationally cheaper to parse once during ingestion than repeatedly during every query. It also ensures consistent field names and types, making queries reliable and correlation possible.

Scalability Patterns for Enterprise Logging

Enterprise log management systems must handle millions of events per second while remaining responsive for queries. The following patterns enable this scale.

Pattern 1: Horizontal Collection Tiers

Separate high-fanout collection from heavy processing:

Converting Mermaid diagram...

Pattern 2: Sampling for Volume Reduction

For extremely high-volume logs that don't require 100% capture, statistical sampling reduces volume while preserving visibility:

sampling-strategy.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
"""
Log sampling strategies for high-volume environments.
These techniques reduce volume while maintaining security visibility.
"""
 
import random
import hashlib
from typing import Dict, Any
 
class LogSampler:
    """
    Implements various sampling strategies for high-volume logs.
    """
    
    def __init__(self, default_rate: float = 1.0):
        """
        Initialize sampler with default rate (1.0 = 100% / no sampling).
        """
        self.default_rate = default_rate
        self.priority_override_rules = {}
    
    def add_priority_rule(self, field: str, value: str, rate: float):
        """
        Certain events should never be sampled (security-critical).
        Rate of 1.0 means always keep.
        """
        self.priority_override_rules[(field, value)] = rate
    
    def should_sample(self, event: Dict[str, Any]) -> bool:
        """
        Determine if an event should be included (not dropped).
        Returns True to keep, False to drop.
        """
        
        # Priority events are never dropped
        for (field, value), rate in self.priority_override_rules.items():
            if event.get(field) == value:
                return random.random() < rate
        
        # Apply default sampling rate
        return random.random() < self.default_rate
    
    def consistent_hash_sample(self, event: Dict[str, Any], 
                               key_field: str, rate: float) -> bool:
        """
        Use consistent hashing so the same key always gets same decision.
        Important for keeping all events from a single session/user.
        """
        key_value = event.get(key_field, "")
        hash_val = int(hashlib.sha256(str(key_value).encode()).hexdigest()[:8], 16)
        threshold = int(rate * 0xFFFFFFFF)
        return hash_val < threshold
 
 
# Example usage
sampler = LogSampler(default_rate=0.1)  # 10% default sampling
 
# Security events: NEVER sample
sampler.add_priority_rule("event.category", "authentication", 1.0)
sampler.add_priority_rule("event.outcome", "failure", 1.0)
sampler.add_priority_rule("event.category", "intrusion_detection", 1.0)
sampler.add_priority_rule("event.severity", "critical", 1.0)
sampler.add_priority_rule("event.severity", "high", 1.0)
 
# Debug logs: Sample aggressively
sampler.add_priority_rule("log.level", "debug", 0.01)  # 1% only
 
# Normal web traffic: Moderate sampling
sampler.add_priority_rule("event.category", "web", 0.05)  # 5%
 
def process_log(event: Dict[str, Any]):
    """Process a log event with sampling."""
    if sampler.should_sample(event):
        # Forward to storage
        send_to_storage(event)
    else:
        # Update sampling statistics
        update_sampling_metrics(event)

Pattern 3: Multi-Tenant Isolation

In large organizations or SaaS platforms, isolate tenants for security, performance, and compliance:

Multi-Tenant Isolation Strategies

•Index-per-tenant — Separate Elasticsearch indices per tenant; simple but creates many indices
•Tenant ID field — Single index with tenant_id field; more efficient but requires filtering at query time
•Namespace isolation — Separate Kafka topics, storage buckets per tenant
•Query-time enforcement — All queries automatically filtered by tenant; implement at API layer
•Encryption isolation — Per-tenant encryption keys; tenant can control their key lifecycle

Never Sample Security Events

Sampling should ONLY apply to high-volume, low-value events (debug logs, health checks, routine reads). Security-relevant events—authentication, authorization failures, privilege escalation, policy changes—must always be captured at 100%. A sampled-out attack is effectively undetected.

Summary: Log Management

Effective log management is the backbone of security operations—without it, audit trails are useless because they can't be found, queried, or retained appropriately. Let's consolidate the key concepts:

Key Takeaways

•Collection architecture matters — Choose between push/pull/sidecar patterns; use lightweight agents with persistent buffering.
•Transport reliability varies — UDP loses events, TCP guarantees order, Kafka provides durability and fan-out for enterprise scale.
•Storage is tiered — Hot-warm-cold-archive pattern balances query speed with cost; use ILM policies for automatic transitions.
•Retention is regulatory — Different frameworks require different retention periods; design policies before implementation.
•Normalization enables correlation — Parse diverse formats into common schemas (ECS, OCSF) at ingestion time for consistent querying.
•Scalability requires architecture — Separate collection from processing, use stream buffers, consider sampling for non-critical logs.

What's Next:

With logs properly collected, transported, stored, and retained, we now turn to intrusion detection—the systems and techniques that analyze log data to identify attacks in progress and security breaches.

Page Complete

You now understand the architecture and best practices for enterprise log management. This infrastructure forms the foundation for all security detection, investigation, and compliance activities.

2 / 5

Loading learning content...

Operating SystemsAuditing and Logging

Auditing and Logging

LevelIntermediate

Duration120 mins

TopicAuditing and Logging

2 / 5

Log Management

The Data Deluge: Managing Logs at Scale

What You Will Learn

Log Collection Architecture

Modern log collection has evolved from simple file tailing to sophisticated distributed systems. Understanding the architectural patterns helps you design appropriate solutions for your scale.

Collection Patterns

There are three primary patterns for log collection:

Log Collection Patterns
Pattern	Description	Pros	Cons
Push (Agent-based)	Agents on each host forward logs	Near real-time, reliable delivery	Agent overhead, management complexity
Pull (Scraping)	Central system reads from hosts	No agent installation required	Higher latency, relies on host availability
Sidecar	Co-located collector per application	Application isolation, container-native	Resource overhead per pod/container

Agent-Based Collection

The most common pattern uses lightweight agents running on each host. These agents read logs from files, journald, or direct streams and forward them to central collectors.

Converting Mermaid diagram...

Popular Log Collection Agents

The choice of agent significantly impacts resource usage, reliability, and capabilities:

Log Collection Agent Comparison
Agent	Memory	Features	Best For
Fluent Bit	~1-5 MB	Lightweight, cloud-native, plugins	Containers, Kubernetes, edge
Fluentd	~40-100 MB	Ruby plugins, flexible routing	Complex transformations, legacy systems
Filebeat	~20-50 MB	ELK integration, modules	Elastic Stack environments
Vector	~10-30 MB	Rust, observability pipelines	High-performance, unified observability
Promtail	~10-30 MB	Loki-native, label extraction	Grafana Loki environments
rsyslog	~5-15 MB	RFC-compliant, enterprise proven	Traditional syslog, compliance
syslog-ng	~10-20 MB	Advanced parsing, routing	Complex syslog environments

Agent Configuration: Fluent Bit Example

Fluent Bit is the leading lightweight agent for cloud-native environments:

fluent-bit.conf
Fluent Bit Config
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
# /etc/fluent-bit/fluent-bit.conf
# Production configuration for security log collection
 
[SERVICE]
    # Daemon mode
    Daemon       Off
    Flush        1
    Log_Level    info
    
    # Parser configuration
    Parsers_File parsers.conf
    
    # Enable metrics endpoint for monitoring
    HTTP_Server  On
    HTTP_Listen  0.0.0.0
    HTTP_Port    2020
    
    # Buffer configuration for reliability
    storage.path              /var/log/fluent-bit-buffer/
    storage.sync              normal
    storage.checksum          off
    storage.backlog.mem_limit 50M
 
# ================================================
# INPUTS: What logs to collect
# ================================================
 
# System logs via journald
[INPUT]
    Name                systemd
    Tag                 system.*
    Systemd_Filter      _SYSTEMD_UNIT=sshd.service
    Systemd_Filter      _SYSTEMD_UNIT=auditd.service
    Systemd_Filter      _SYSTEMD_UNIT=sudo.service
    Read_From_Tail      On
    DB                  /var/log/fluent-bit-systemd.db
 
# Audit logs from auditd
[INPUT]
    Name                tail
    Tag                 audit.*
    Path                /var/log/audit/audit.log
    Parser              audit
    DB                  /var/log/fluent-bit-audit.db
    Mem_Buf_Limit       10MB
    Refresh_Interval    5
    
# Kernel messages
[INPUT]
    Name                kmsg
    Tag                 kernel
 
# Application logs (JSON format)
[INPUT]
    Name                tail
    Tag                 app.*
    Path                /var/log/app/*.json
    Parser              json
    DB                  /var/log/fluent-bit-app.db
    Mem_Buf_Limit       10MB
 
# ================================================
# FILTERS: Enrich and transform
# ================================================
 
# Add hostname and timestamp
[FILTER]
    Name                record_modifier
    Match               *
    Record              hostname ${HOSTNAME}
    Record              environment production
    Record              cluster main-cluster
 
# Parse nested fields
[FILTER]
    Name                parser
    Match               app.*
    Key_Name            log
    Parser              json
    Reserve_Data        True
 
# Add Kubernetes metadata (if running in K8s)
[FILTER]
    Name                kubernetes
    Match               kube.*
    Kube_URL            https://kubernetes.default.svc:443
    Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
    Merge_Log           On
 
# ================================================
# OUTPUTS: Where to send logs
# ================================================
 
# Primary: Elasticsearch for security logs
[OUTPUT]
    Name                es
    Match               audit.* system.*
    Host                elasticsearch.logging.svc
    Port                9200
    Index               security-logs
    Type                _doc
    Logstash_Format     On
    Logstash_Prefix     security
    Time_Key            @timestamp
    Include_Tag_Key     On
    Retry_Limit         5
    
    # TLS configuration
    tls                 On
    tls.verify          On
    tls.ca_file         /etc/fluent-bit/tls/ca.crt
    tls.crt_file        /etc/fluent-bit/tls/tls.crt
    tls.key_file        /etc/fluent-bit/tls/tls.key
    
    # Authentication
    HTTP_User           ${ES_USER}
    HTTP_Passwd         ${ES_PASSWORD}
 
# Secondary: Forward to remote syslog for compliance
[OUTPUT]
    Name                syslog
    Match               audit.*
    Host                syslog-archive.corp.internal
    Port                6514
    Mode                tls
    Syslog_Format       rfc5424
    Syslog_Hostname_key hostname
    Syslog_Message_key  message
    tls                 On
    tls.verify          On
    tls.ca_file         /etc/fluent-bit/tls/syslog-ca.crt
 
# Local backup (filesystem buffer)
[OUTPUT]
    Name                file
    Match               *
    Path                /var/log/fluent-bit-backup/
    Format              out_file

Buffer Everything

Log Transportation: Protocols and Reliability

Getting logs from thousands of sources to central storage reliably is a significant engineering challenge. The choice of transport protocol affects delivery guarantees, performance, and security.

Transport Protocol Comparison

Log Transport Protocols
Protocol	Delivery Guarantee	Performance	Security	Use Case
UDP Syslog (RFC 5426)	None (fire-and-forget)	Very High	None by default	High-volume, loss-tolerant
TCP Syslog (RFC 6587)	Ordered delivery	High	TLS optional	Reliable syslog
RELP	At-least-once	Medium	TLS optional	Guaranteed syslog delivery
HTTP/HTTPS	At-least-once	Medium	TLS built-in	REST APIs, cloud services
Kafka	At-least-once/Exactly-once	Very High	TLS + SASL	High-scale streaming
gRPC	At-least-once	Very High	TLS built-in	Modern observability

Delivery Guarantees Explained

Understanding delivery semantics is critical for security logging:

Delivery Semantics

•At-most-once — Message may be lost, but never duplicated. Fastest, but unacceptable for security logs.
•At-least-once — Message guaranteed to arrive, but may be duplicated. Standard for security logging; handle duplicates in storage.
•Exactly-once — Message arrives once and only once. Complex to implement, typically requires two-phase commit or idempotent processing.

The Kafka Pattern for High-Scale Logging

Apache Kafka has become the standard for high-volume log transportation due to its durability, scalability, and decoupling of producers from consumers:

Converting Mermaid diagram...

Key Benefits of Kafka for Logging:

•Decoupling — Producers and consumers operate independently; SIEM downtime doesn't affect log collection
•Durability — Logs persist in Kafka for configurable retention; replayable for reprocessing or investigation
•Fan-out — Multiple consumers read the same stream; security team, ops team, and archive all get full copies
•Ordering — Per-partition ordering preserves event sequence for forensics
•Scalability — Linear horizontal scaling by adding partitions and brokers

kafka-log-topic.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#!/bin/bash
# Create optimized Kafka topic for security logs
 
# Topic for security/audit logs
# - 30 day retention (720 hours)
# - 12 partitions for parallelism
# - Replication factor 3 for durability
kafka-topics.sh --create \
    --bootstrap-server kafka.internal:9092 \
    --topic security-logs \
    --partitions 12 \
    --replication-factor 3 \
    --config retention.ms=2592000000 \
    --config cleanup.policy=delete \
    --config min.insync.replicas=2 \
    --config compression.type=lz4 \
    --config segment.bytes=1073741824
 
# Topic for archival (longer retention, larger segments)
kafka-topics.sh --create \
    --bootstrap-server kafka.internal:9092 \
    --topic security-logs-archive \
    --partitions 6 \
    --replication-factor 3 \
    --config retention.ms=7776000000 \
    --config cleanup.policy=delete \
    --config segment.bytes=5368709120
 
# Verify topic configuration
kafka-topics.sh --describe \
    --bootstrap-server kafka.internal:9092 \
    --topic security-logs

Kafka Security Configuration

Centralized Log Storage Systems

Storage System Comparison

Log Storage Systems
System	Architecture	Query Speed	Cost per TB	Best For
Elasticsearch	Distributed inverted index	Fast full-text	High ($$$)	Ad-hoc search, SIEM integration
OpenSearch	Elasticsearch fork	Fast full-text	High ($$$)	AWS-native, open governance
Grafana Loki	Index labels only, chunks in object store	Fast by labels, slow full-text	Low ($)	Cloud-native, label-based queries
ClickHouse	Columnar OLAP	Fast aggregations	Medium ($$)	Analytics, metrics, structured logs
Splunk	Proprietary indexed data store	Very fast	Very High ($$$$)	Enterprise SIEM, compliance
S3/GCS/Azure Blob	Object storage + query engines	Slow (requires scan)	Very Low ($)	Long-term archive, compliance

The Hot-Warm-Cold Storage Pattern

Log access patterns follow a predictable curve: recent logs are queried frequently, older logs rarely. This enables significant cost optimization:

Converting Mermaid diagram...

Elasticsearch Index Lifecycle Management

ilm-policy.json
JSON (Elasticsearch ILM)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_age": "1d",
            "max_primary_shard_size": "50gb"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": {
            "number_of_shards": 1
          },
          "forcemerge": {
            "max_num_segments": 1
          },
          "allocate": {
            "require": {
              "data": "warm"
            },
            "number_of_replicas": 1
          },
          "set_priority": {
            "priority": 50
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": {
              "data": "cold"
            },
            "number_of_replicas": 0
          },
          "set_priority": {
            "priority": 0
          },
          "searchable_snapshot": {
            "snapshot_repository": "log-archive-s3"
          }
        }
      },
      "frozen": {
        "min_age": "90d",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "log-archive-glacier"
          }
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": {
          "wait_for_snapshot": {
            "policy": "compliance-snapshot-policy"
          },
          "delete": {}
        }
      }
    }
  }
}

Grafana Loki: Cost-Effective Alternative

Loki takes a different approach—it only indexes metadata labels, not log content. This dramatically reduces storage and compute costs:

loki-config.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# Loki configuration for security log storage
 
auth_enabled: false
 
server:
  http_listen_port: 3100
  grpc_listen_port: 9096
 
ingester:
  wal:
    enabled: true
    dir: /loki/wal
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  chunk_idle_period: 30m
  max_chunk_age: 1h
  chunk_target_size: 1572864
  chunk_retain_period: 30s
 
schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: loki_index_
        period: 24h
 
storage_config:
  tsdb_shipper:
    active_index_directory: /loki/tsdb-index
    cache_location: /loki/tsdb-cache
    cache_ttl: 24h
  aws:
    s3: s3://us-east-1/loki-logs-bucket
    bucketnames: loki-logs-bucket
    region: us-east-1
    access_key_id: ${AWS_ACCESS_KEY_ID}
    secret_access_key: ${AWS_SECRET_ACCESS_KEY}
 
compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150
  delete_request_store: s3
 
limits_config:
  retention_period: 744h  # 31 days in hot storage
  ingestion_rate_mb: 50
  ingestion_burst_size_mb: 100
  per_stream_rate_limit: 10MB
  max_query_lookback: 0  # No limit
 
# Label cardinality limits (critical for performance)
  max_label_name_length: 1024
  max_label_value_length: 2048
  max_label_names_per_series: 30

Loki Query Trade-off

Log Retention Policies and Compliance

Regulatory Retention Requirements

Log Retention by Regulatory Framework
Framework	Log Types	Retention Period	Key Requirements
PCI DSS 4.0	All audit trails, auth logs	1 year (3 months immediately accessible)	Daily review, integrity protection
HIPAA	Access to PHI, security events	6 years	Access tracking, audit controls
SOX	Financial system access, changes	7 years	Tamper-evident, non-repudiation
GDPR	Personal data processing	As long as necessary (minimize)	Right to erasure complicates logs
GLBA	Customer financial data access	5 years	Access controls, disposal procedures
FISMA/FedRAMP	All security-relevant events	90 days online, 1 year archive	Real-time alerting, monthly reviews
SOC 2	Security events, access logs	1 year recommended	Integrity, availability, confidentiality

Designing Retention Policies

Effective retention policies balance multiple concerns:

Retention Policy Design Factors

•Legal Hold — Ability to suspend deletion when litigation is anticipated; implement before it's needed
•Minimum vs. Maximum — Retain at least as long as regulations require, but not excessively (GDPR)
•Tiered Retention — Different log types may have different requirements; security logs often need longer retention
•Accessibility Requirements — How quickly must logs be searchable? Compliance often distinguishes online vs. archive
•Cost Modeling — Calculate total cost per GB-month across tiers; optimize for common query patterns
•Deletion Verification — Prove logs were deleted when retention expires (GDPR right to erasure)

Retention Policy Implementation

retention-policy.yaml
YAML
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
# Log retention policy configuration
# This example uses a declarative format that could be implemented
# by various log management systems
 
metadata:
  policy_name: security-log-retention
  version: "2.0"
  last_reviewed: "2024-01-15"
  next_review: "2024-07-15"
  owner: security-team@company.com
  approved_by: ciso@company.com
 
# Global defaults
defaults:
  online_retention_days: 90
  archive_retention_years: 7
  storage_tier_progression:
    hot_to_warm_days: 7
    warm_to_cold_days: 30
    cold_to_archive_days: 90
  integrity_protection: required
  encryption: required
 
# Per-log-type policies (override defaults)
log_types:
  # Authentication and access logs
  authentication:
    description: "Login attempts, SSO events, MFA events"
    online_retention_days: 180
    archive_retention_years: 7
    compliance_frameworks:
      - PCI-DSS
      - SOX
      - HIPAA
    alerting_required: true
    
  # Authorization and permission events
  authorization:
    description: "Access decisions, permission changes"
    online_retention_days: 180
    archive_retention_years: 7
    
  # Privileged operations
  privileged_access:
    description: "sudo, admin actions, elevated permissions"
    online_retention_days: 365
    archive_retention_years: 7
    review_frequency: weekly
    
  # Application logs (non-security)
  application:
    description: "General application logs"
    online_retention_days: 30
    archive_retention_years: 1
    
  # Network traffic logs
  network:
    description: "Firewall logs, flow data"
    online_retention_days: 90
    archive_retention_years: 3
    
  # Personal data processing (GDPR)
  personal_data:
    description: "Logs containing PII processing"
    online_retention_days: 90
    archive_retention_years: 3
    special_handling:
      - erasure_requests_honored
      - anonymization_on_archive
      - access_restricted
 
# Legal hold configuration
legal_hold:
  enabled: true
  notification_recipients:
    - legal@company.com
    - ciso@company.com
  hold_prevents_deletion: true
  hold_duration_override: unlimited
 
# Deletion procedures
deletion:
  method: cryptographic_erasure
  verification: required
  certificate_generated: true
  audit_log_retained: true  # Log of what was deleted, kept longer

GDPR Log Erasure Challenge

Log Aggregation and Normalization

The Normalization Challenge

Consider the same event—a failed login—from different sources:

log-diversity.txt

Mixed Formats

# Linux auth.log
Jan 15 14:32:47 db-prod-01 sshd[12345]: Failed password for jsmith from 192.168.1.50 port 22 ssh2
 
# Windows Security Event (XML)
<Event>
  <EventID>4625</EventID>
  <TimeCreated SystemTime="2024-01-15T14:32:47.123Z"/>
  <Computer>DB-PROD-01</Computer>
  <EventData>
    <Data Name="TargetUserName">jsmith</Data>
    <Data Name="IpAddress">192.168.1.50</Data>
    <Data Name="LogonType">10</Data>
    <Data Name="FailureReason">%%2313</Data>
  </EventData>
</Event>
 
# AWS CloudTrail (JSON)
{
  "eventTime": "2024-01-15T14:32:47Z",
  "eventSource": "signin.amazonaws.com",
  "eventName": "ConsoleLogin",
  "userIdentity": {"userName": "jsmith"},
  "sourceIPAddress": "192.168.1.50",
  "responseElements": {"ConsoleLogin": "Failure"}
}
 
# Normalized output (Common Event Format)
{
  "timestamp": "2024-01-15T14:32:47.123456Z",
  "event_type": "authentication",
  "event_outcome": "failure",
  "source_host": "db-prod-01",
  "source_ip": "192.168.1.50",
  "user": "jsmith",
  "authentication_method": "password",
  "service": "ssh"
}

Common Event Schema Standards

Log Normalization Schema Standards
Standard	Origin	Key Features	Adoption
ECS (Elastic Common Schema)	Elastic	Nested JSON, comprehensive field set	High (ELK users)
OCSF (Open Cybersecurity Schema)	AWS, Splunk, others	Security-focused, event categories	Growing (cloud/security)
OSSEM (Open Source Security Events Metadata)	Community	Detection-focused, ATT&CK mapping	Medium (security researchers)
CEF (Common Event Format)	ArcSight/HP	Key-value pairs, legacy standard	Legacy (older SIEMs)
LEEF (Log Event Extended Format)	IBM QRadar	Tab-delimited, QRadar-native	Medium (IBM customers)

Normalization with Logstash

Logstash is a powerful tool for parsing and normalizing diverse log formats:

logstash-normalize.conf
Logstash Config
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
# Logstash pipeline for log normalization
 
input {
  # From Kafka topics
  kafka {
    bootstrap_servers => "kafka:9092"
    topics => ["raw-logs"]
    group_id => "logstash-normalizer"
    codec => json
  }
}
 
filter {
  # ================================================
  # LINUX AUTH LOG PARSING
  # ================================================
  if [log_source] == "linux-auth" {
    grok {
      match => { 
        "message" => [
          # Failed password
          "%{SYSLOGTIMESTAMP:syslog_timestamp} %{HOSTNAME:source_host} sshd[%{NUMBER:pid}]: Failed password for %{USERNAME:user} from %{IP:source_ip} port %{NUMBER:source_port}",
          # Accepted password
          "%{SYSLOGTIMESTAMP:syslog_timestamp} %{HOSTNAME:source_host} sshd[%{NUMBER:pid}]: Accepted password for %{USERNAME:user} from %{IP:source_ip} port %{NUMBER:source_port}",
          # Invalid user
          "%{SYSLOGTIMESTAMP:syslog_timestamp} %{HOSTNAME:source_host} sshd[%{NUMBER:pid}]: Invalid user %{USERNAME:user} from %{IP:source_ip}"
        ]
      }
    }
    
    # Normalize to common schema
    mutate {
      add_field => {
        "[event][category]" => "authentication"
        "[event][type]" => "start"
        "[source][address]" => "%{source_ip}"
        "[user][name]" => "%{user}"
        "[host][name]" => "%{source_host}"
      }
    }
    
    # Determine outcome
    if "Failed" in [message] or "Invalid" in [message] {
      mutate { add_field => { "[event][outcome]" => "failure" } }
    } else {
      mutate { add_field => { "[event][outcome]" => "success" } }
    }
    
    # Parse timestamp
    date {
      match => ["syslog_timestamp", "MMM  d HH:mm:ss", "MMM dd HH:mm:ss"]
      target => "@timestamp"
    }
  }
  
  # ================================================
  # WINDOWS SECURITY EVENT PARSING
  # ================================================
  if [log_source] == "windows-security" {
    xml {
      source => "message"
      target => "winlog"
    }
    
    # Map Windows Event ID to event type
    translate {
      field => "[winlog][EventID]"
      destination => "[event][action]"
      dictionary => {
        "4624" => "logon_success"
        "4625" => "logon_failure"
        "4648" => "explicit_credential_logon"
        "4672" => "special_privileges_assigned"
        "4688" => "process_created"
        "4720" => "user_account_created"
      }
    }
    
    # Normalize fields
    mutate {
      rename => {
        "[winlog][Computer]" => "[host][name]"
        "[winlog][EventData][TargetUserName]" => "[user][name]"
        "[winlog][EventData][IpAddress]" => "[source][address]"
      }
      add_field => {
        "[event][category]" => "authentication"
      }
    }
    
    if [winlog][EventID] == "4625" {
      mutate { add_field => { "[event][outcome]" => "failure" } }
    } else if [winlog][EventID] == "4624" {
      mutate { add_field => { "[event][outcome]" => "success" } }
    }
  }
  
  # ================================================
  # ENRICH ALL EVENTS
  # ================================================
  
  # GeoIP enrichment for source IPs
  if [source][address] {
    geoip {
      source => "[source][address]"
      target => "[source][geo]"
    }
  }
  
  # Add processing metadata
  mutate {
    add_field => {
      "[ecs][version]" => "8.0"
      "[event][ingested]" => "%{@timestamp}"
    }
  }
  
  # Remove temporary fields
  mutate {
    remove_field => ["syslog_timestamp", "pid", "message", "winlog"]
  }
}
 
output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "security-logs-%{+YYYY.MM.dd}"
    user => "logstash"
    password => "${ES_PASSWORD}"
  }
}

Parse Once, Query Many

Scalability Patterns for Enterprise Logging

Enterprise log management systems must handle millions of events per second while remaining responsive for queries. The following patterns enable this scale.

Pattern 1: Horizontal Collection Tiers

Separate high-fanout collection from heavy processing:

Converting Mermaid diagram...

Pattern 2: Sampling for Volume Reduction

For extremely high-volume logs that don't require 100% capture, statistical sampling reduces volume while preserving visibility:

sampling-strategy.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
"""
Log sampling strategies for high-volume environments.
These techniques reduce volume while maintaining security visibility.
"""
 
import random
import hashlib
from typing import Dict, Any
 
class LogSampler:
    """
    Implements various sampling strategies for high-volume logs.
    """
    
    def __init__(self, default_rate: float = 1.0):
        """
        Initialize sampler with default rate (1.0 = 100% / no sampling).
        """
        self.default_rate = default_rate
        self.priority_override_rules = {}
    
    def add_priority_rule(self, field: str, value: str, rate: float):
        """
        Certain events should never be sampled (security-critical).
        Rate of 1.0 means always keep.
        """
        self.priority_override_rules[(field, value)] = rate
    
    def should_sample(self, event: Dict[str, Any]) -> bool:
        """
        Determine if an event should be included (not dropped).
        Returns True to keep, False to drop.
        """
        
        # Priority events are never dropped
        for (field, value), rate in self.priority_override_rules.items():
            if event.get(field) == value:
                return random.random() < rate
        
        # Apply default sampling rate
        return random.random() < self.default_rate
    
    def consistent_hash_sample(self, event: Dict[str, Any], 
                               key_field: str, rate: float) -> bool:
        """
        Use consistent hashing so the same key always gets same decision.
        Important for keeping all events from a single session/user.
        """
        key_value = event.get(key_field, "")
        hash_val = int(hashlib.sha256(str(key_value).encode()).hexdigest()[:8], 16)
        threshold = int(rate * 0xFFFFFFFF)
        return hash_val < threshold
 
 
# Example usage
sampler = LogSampler(default_rate=0.1)  # 10% default sampling
 
# Security events: NEVER sample
sampler.add_priority_rule("event.category", "authentication", 1.0)
sampler.add_priority_rule("event.outcome", "failure", 1.0)
sampler.add_priority_rule("event.category", "intrusion_detection", 1.0)
sampler.add_priority_rule("event.severity", "critical", 1.0)
sampler.add_priority_rule("event.severity", "high", 1.0)
 
# Debug logs: Sample aggressively
sampler.add_priority_rule("log.level", "debug", 0.01)  # 1% only
 
# Normal web traffic: Moderate sampling
sampler.add_priority_rule("event.category", "web", 0.05)  # 5%
 
def process_log(event: Dict[str, Any]):
    """Process a log event with sampling."""
    if sampler.should_sample(event):
        # Forward to storage
        send_to_storage(event)
    else:
        # Update sampling statistics
        update_sampling_metrics(event)

Pattern 3: Multi-Tenant Isolation

In large organizations or SaaS platforms, isolate tenants for security, performance, and compliance:

Multi-Tenant Isolation Strategies

•Index-per-tenant — Separate Elasticsearch indices per tenant; simple but creates many indices
•Tenant ID field — Single index with tenant_id field; more efficient but requires filtering at query time
•Namespace isolation — Separate Kafka topics, storage buckets per tenant
•Query-time enforcement — All queries automatically filtered by tenant; implement at API layer
•Encryption isolation — Per-tenant encryption keys; tenant can control their key lifecycle

Never Sample Security Events

Summary: Log Management

Key Takeaways

•Collection architecture matters — Choose between push/pull/sidecar patterns; use lightweight agents with persistent buffering.
•Transport reliability varies — UDP loses events, TCP guarantees order, Kafka provides durability and fan-out for enterprise scale.
•Storage is tiered — Hot-warm-cold-archive pattern balances query speed with cost; use ILM policies for automatic transitions.
•Retention is regulatory — Different frameworks require different retention periods; design policies before implementation.
•Normalization enables correlation — Parse diverse formats into common schemas (ECS, OCSF) at ingestion time for consistent querying.
•Scalability requires architecture — Separate collection from processing, use stream buffers, consider sampling for non-critical logs.

What's Next:

Page Complete

You now understand the architecture and best practices for enterprise log management. This infrastructure forms the foundation for all security detection, investigation, and compliance activities.

2 / 5