Anomaly Detection Fundamentals - Learning Module

Loading content...

0/278

Supervised vs Unsupervised Anomaly Detection

The Supervision Spectrum in Anomaly Detection

The availability and nature of labels fundamentally shapes the design of anomaly detection systems. Unlike standard classification where labeled data is assumed, anomaly detection operates across a spectrum of supervision ranging from fully labeled to entirely unlabeled scenarios.

This distinction is not merely academic—it dictates:

Which algorithms are applicable
What performance levels are achievable
How models should be trained and validated
What deployment and maintenance strategies are appropriate

Understanding this spectrum is essential for making sound architectural decisions in real-world anomaly detection systems.

The Label Scarcity Problem

Anomaly detection faces a fundamental label scarcity problem that distinguishes it from standard classification. Anomalies are rare by definition (often < 1% of data), expensive to label (requiring domain expertise), and may represent novel attack types never seen before. This scarcity drives the need for unsupervised and semi-supervised approaches.

The Supervision Spectrum

Anomaly detection methods can be placed along a continuum based on the amount and type of supervision available during training.

The Five-Level Taxonomy:

Level 1: Fully Supervised

Training data includes labeled examples of both normal and anomalous instances
Problem reduces to (highly imbalanced) binary classification
Maximum discriminative power when labels are accurate and representative

Level 2: Semi-Supervised (Normal-Only)

Training data contains only normal instances (labeled as such)
Model learns the boundary of normality
Test instances outside this boundary are flagged as anomalies
Also called "one-class learning" or "novelty detection"

Level 3: Semi-Supervised (Partial Labels)

Training data contains mostly normal instances with a few labeled anomalies
Cannot train full supervised model, but can use anomaly examples for calibration
Also called "positive-unlabeled (PU) learning"

Level 4: Weakly Supervised

Only aggregate or imprecise labels available
Example: "This batch contains at least one anomaly" without instance-level labels
Requires multiple-instance learning or label propagation approaches

Level 5: Fully Unsupervised

No labels of any kind available during training
Must rely purely on data distribution characteristics
Most common scenario in practice; most challenging mathematically

The Supervision Spectrum in Anomaly Detection
Level	Training Labels	Key Methods	Typical Performance
Fully Supervised	Normal + Anomaly	Random Forest, XGBoost, Neural Networks	Highest (if labels are representative)
Semi-Supervised (Normal-Only)	Normal only	One-Class SVM, Autoencoders, IF	High for seen anomaly types
Semi-Supervised (Partial)	Mostly Normal + Few Anomalies	PU Learning, Calibrated Unsupervised	Moderate to High
Weakly Supervised	Aggregate/Noisy Labels	MIL, Label Propagation	Moderate
Fully Unsupervised	None	Clustering, Density, Distance Methods	Variable (depends on assumptions)

Supervised Anomaly Detection: The Classification Paradigm

When both normal and anomalous instances are labeled, anomaly detection becomes a binary classification problem. However, this is not ordinary classification—extreme class imbalance introduces unique challenges.

Mathematical Formulation:

Given training data ${(x_i, y_i)}_{i=1}^n$ where $y_i \in {0, 1}$ (0 = normal, 1 = anomaly), we seek a classifier:

$$f: \mathcal{X} \rightarrow [0, 1]$$

that outputs the probability $P(y = 1 | x)$.

The Imbalance Challenge:

Typical class distributions in anomaly detection:

Normal: 99-99.99% of instances
Anomalous: 0.01-1% of instances

This extreme imbalance causes:

Bias toward majority class: Standard classifiers achieve high accuracy by predicting all instances as normal
Poor minority class recall: Anomalies are systematically missed
Unreliable probability estimates: Classifier probabilities are poorly calibrated near the rare class

Techniques for Handling Imbalance:

1. Resampling Methods

Oversampling the minority class:

Random oversampling: Duplicate anomaly examples
SMOTE: Generate synthetic anomalies by interpolation
ADASYN: Adaptive synthetic sampling focusing on difficult examples

Undersampling the majority class:

Random undersampling: Remove normal examples
Tomek links: Remove borderline normal examples
Cluster centroids: Replace cluster of normals with centroid

Combination approaches:

SMOTE + Tomek: Oversample anomalies, then clean decision boundary

2. Cost-Sensitive Learning

Modify the loss function to penalize anomaly misclassification more heavily:

$$L = \sum_i C_{y_i} \cdot \ell(f(x_i), y_i)$$

where $C_1 \gg C_0$ (cost of missing anomaly >> cost of false alarm)

3. Ensemble Methods

Build multiple classifiers on balanced subsets:

Each learner trains on all anomalies + random subset of normals
Aggregate predictions (voting, averaging)
Examples: Balanced Random Forest, EasyEnsemble

When to Use Supervised Approaches

Supervised anomaly detection is appropriate when: (1) you have sufficient labeled anomalies of all types you want to detect, (2) the anomaly distribution is stable over time, and (3) novel anomaly types are not a primary concern. These conditions are more common in domains like medical diagnosis (known disease types) than in security (evolving attack types).

Algorithm Selection for Supervised Anomaly Detection:

Gradient Boosting (XGBoost, LightGBM, CatBoost)

Handles imbalance via sample weights or focal loss
Excellent performance on tabular data
Provides feature importance for interpretability

Random Forest with Class Weighting

Robust to overfitting
Parallelizable training
Good baseline for comparison

Neural Networks with Focal Loss $$FL(p) = -\alpha (1-p)^\gamma \log(p)$$

Down-weights easy examples, focuses on hard examples
Particularly effective when anomalies cluster in difficult regions

Support Vector Machines with Cost-Sensitive Kernels

Effective in high-dimensional spaces
Kernel choice provides flexibility
Can struggle with very large datasets

Calibration Requirement:

Even with imbalance handling, classifier probabilities require calibration:

Platt scaling: Fit logistic regression on classifier outputs
Isotonic regression: Non-parametric calibration
Temperature scaling (for neural networks)

Well-calibrated probabilities enable proper threshold selection based on cost/benefit analysis.

Limitations of Supervised Approaches

•Label Dependency: Requires expensive, expert annotation of anomalies
•Distribution Shift: Cannot detect novel anomaly types not in training data
•Label Noise: Mislabeled instances (common with rare classes) severely impact performance
•Representativeness Assumption: Training anomalies must represent test anomalies
•Concept Drift: Anomaly patterns may evolve, requiring continuous retraining

Unsupervised Anomaly Detection: Learning Without Labels

Unsupervised anomaly detection operates without any labels, relying on the fundamental assumption that anomalies differ in measurable ways from the bulk of the data.

Core Assumptions:

Unsupervised methods rest on one or more of these assumptions:

1. The Rarity Assumption Anomalies are rare compared to normal instances: $$|{x : x \text{ is anomaly}}| \ll |{x : x \text{ is normal}}|$$

Implication: Methods that characterize the majority capture "normal" behavior.

2. The Difference Assumption Anomalies have feature values or patterns that distinguish them:

Extreme values (distance-based)
Low probability regions (density-based)
Short isolation paths (tree-based)

3. The Clustering Assumption Normal instances cluster together; anomalies are isolated: $$x_{normal} \in \text{cluster}, \quad x_{anomaly} otin \text{any cluster}$$

Implication: Anomalies fail to belong to any natural grouping.

4. The Compressibility Assumption Normal data has regular structure that allows compression; anomalies are incompressible: $$\text{reconstruct}(x_{normal}) \approx x_{normal}$$ $$\text{reconstruct}(x_{anomaly}) eq x_{anomaly}$$

Implication: High reconstruction error indicates anomaly (autoencoder principle).

Major Algorithm Families:

1. Distance-Based Methods

Anomalies are far from their neighbors:

$$s_{kNN}(x) = \frac{1}{k} \sum_{i=1}^{k} d(x, nn_i(x))$$

Examples: k-NN distance, Global Outlier Score

Strengths: Intuitive, parameter simple (k) Weaknesses: O(n²) complexity, sensitive to distance metric

2. Density-Based Methods

Anomalies reside in low-density regions:

$$s_{LOF}(x) = \frac{\text{avg local density of neighbors}}{\text{local density of } x}$$

Examples: Local Outlier Factor (LOF), LOCI, LoOP

Strengths: Handles varying density clusters Weaknesses: O(n²) complexity, parameter sensitivity (k for neighborhood)

3. Tree-Based Methods

Anomalies are isolated quickly by random partitioning:

$$s_{IF}(x) = 2^{-\frac{\mathbb{E}[h(x)]}{c(n)}}$$

where $h(x)$ is path length in isolation tree, $c(n)$ is normalization factor.

Examples: Isolation Forest, Extended Isolation Forest

Strengths: O(n log n) complexity, few parameters, handles high dimensions Weaknesses: Axis-parallel cuts can miss angled patterns

4. Reconstruction-Based Methods

Anomalies have high reconstruction error:

$$s_{AE}(x) = |x - D(E(x))|^2$$

Examples: Autoencoders, VAEs, PCA reconstruction error

Strengths: Learns complex manifold structures, deep learning scalability Weaknesses: Requires architecture tuning, can over-regularize (reconstruct anomalies too)

5. Clustering-Based Methods

Anomalies fail to belong to clusters:

$$s_{cluster}(x) = \min_c d(x, \mu_c)$$

Examples: DBSCAN noise points, k-means distance to centroid

Strengths: Intuitive interpretation Weaknesses: Sensitive to clustering hyperparameters

Unsupervised Anomaly Detection Algorithm Comparison
Algorithm	Core Principle	Time Complexity	Space Complexity	Key Hyperparameter
k-NN Distance	Average distance to neighbors	O(n² d)	O(n d)	k (neighbors)
LOF	Ratio of local densities	O(n² d)	O(n k)	k (neighborhood size)
Isolation Forest	Isolation path length	O(n log n)	O(t n)	t (trees), ψ (subsample)
One-Class SVM	Maximum margin boundary	O(n² - n³)	O(n² SV)	ν, kernel params
Autoencoder	Reconstruction error	O(n × epochs)	O(weights)	architecture, λ
DBSCAN	Core/border/noise partition	O(n log n)	O(n)	ε, minPts

The Contamination Problem

A critical issue in unsupervised anomaly detection is training set contamination: if anomalies are present in training data (which they typically are, since we have no labels), the model may learn to treat them as normal. Robust methods assume some contamination and aim to identify the majority structure despite it. Isolation Forest and robust autoencoders handle moderate contamination well.

Semi-Supervised Anomaly Detection

Semi-supervised approaches occupy the middle ground, leveraging limited label information along with the structure of unlabeled data. Two major paradigms dominate this space.

Paradigm 1: Normal-Only Training (Novelty Detection)

The most common semi-supervised setting: training data contains only normal instances (or is assumed to be predominantly normal). The goal is to learn a model of normality, then detect departures from it.

Mathematical Formulation:

Given $D_{train} = {x_1, \ldots, x_n}$ assumed normal, learn:

$$f(x) = P(x \text{ is normal})$$

or equivalently, a score function where higher values indicate anomaly:

$$s(x) = \text{degree of deviation from normality}$$

Key Algorithms:

One-Class SVM

Find a hyperplane (or hypersphere) in kernel space that separates data from the origin with maximum margin:

$$\min_{w, \rho, \xi} \frac{1}{2}|w|^2 + \frac{1}{ u n} \sum_i \xi_i - \rho$$ $$\text{s.t. } \langle w, \phi(x_i) \rangle \geq \rho - \xi_i, \quad \xi_i \geq 0$$

Points on the wrong side of the boundary are anomalies.

Deep SVDD (Support Vector Data Description)

Learn a neural network that maps data to a hypersphere-containing representation:

$$\min_{W} \frac{1}{n} \sum_i |\phi(x_i; W) - c|^2 + \lambda |W|^2$$

where $c$ is the center of the learned sphere. Points far from center after training are anomalies.

Autoencoders

Train encoder-decoder on normal data:

$$\min_{E, D} \sum_i |x_i - D(E(x_i))|^2$$

The network learns to reconstruct normal patterns. High reconstruction error indicates anomaly.

Paradigm 2: Positive-Unlabeled (PU) Learning

When we have a few labeled anomalies and many unlabeled instances:

Positive (P): Confirmed anomalies
Unlabeled (U): Mixture of normal and anomalies

Key Insight: The unlabeled set contains both classes in unknown proportions.

PU Learning Formulation:

Let $\pi = P(Y = 1)$ be the (unknown) prevalence of anomalies. The risk function rewriting enables learning:

$$R(f) = \pi R_P(f) + (1 - \pi) R_N(f)$$

where $R_P$ is risk on positives, $R_N$ is risk on negatives.

Since we don't have labeled negatives: $$R_N(f) = \frac{R_U(f) - \pi R_P(f)}{1 - \pi}$$

This allows unbiased estimation of classification risk using only P and U examples.

Practical PU Learning Approaches:

Bagging with Pseudo-Labels: Sample unlabeled as "negative", train classifier, repeat
Cost-Sensitive Adaptation: Weight unlabeled examples by estimated class probability
Non-Negative Risk Correction: Unbiased PU risk with correction terms

Paradigm 3: Active Learning

Iteratively select instances for labeling to maximize information gain:

Train initial model on available labels
Score unlabeled instances for uncertainty or informativeness
Request labels for most informative instances
Update model and repeat

This approach efficiently allocates labeling budget to maximize performance improvement.

Semi-Supervised Advantages

•Leverages partial label information
•Normal-only training easier to obtain
•Can detect novel anomaly types (unlike pure supervised)
•Better calibrated thresholds with some labeled anomalies
•Reduces labeling cost vs full supervision

Semi-Supervised Challenges

•Sensitive to contamination in 'normal' training set
•Boundary estimation depends on representation quality
•PU learning requires prevalence estimation
•May miss anomalies similar to normal examples
•Active learning requires online labeling infrastructure

Choosing the Right Supervision Level

Selecting the appropriate supervision level requires careful analysis of your problem context, data availability, and operational constraints.

Decision Framework:

START: What labels do you have?

├── Both normal AND anomaly labels (sufficient quantity)?
│   ├── Are anomaly types stable over time?
│   │   ├── YES → SUPERVISED APPROACH
│   │   │   └── Binary classification with imbalance handling
│   │   └── NO → HYBRID APPROACH
│   │       └── Supervised for known types + Unsupervised for novelty
│
├── Only normal labels (clean normal data available)?
│   └── SEMI-SUPERVISED (Normal-Only)
│       └── One-Class SVM, Deep SVDD, Autoencoders
│
├── Few anomaly labels + many unlabeled?
│   └── SEMI-SUPERVISED (PU Learning)
│       └── PU learning, active learning
│
└── No labels at all?
    └── UNSUPERVISED APPROACH
        └── Isolation Forest, LOF, Clustering

Key Considerations:

Label Quality vs Quantity

Poor quality labels (noisy, biased) → Unsupervised may outperform supervised
High quality, limited labels → Semi-supervised PU learning
High quality, abundant labels → Supervised with imbalance handling

Anomaly Stability

Static anomaly types (known fraud patterns) → Supervised feasible
Evolving anomaly types (new attack vectors) → Unsupervised/hybrid required
Seasonal patterns in anomalies → Contextual modeling required

Deployment Environment

Real-time detection → Prefer efficient methods (Isolation Forest, lightweight models)
Batch processing → Can use more expensive methods (ensembles, deep learning)
Edge deployment → Compress models, prefer tree-based methods

Interpretability Requirements

High interpretability needed → Distance/density-based, rule extraction
Black-box acceptable → Deep learning autoencoders, complex ensembles

Supervision Level Selection Guide
Scenario	Best Approach	Key Algorithms	Expected Performance
Abundant labels, stable anomalies	Supervised	XGBoost, Random Forest	Highest detection rate
Clean normal data only	Semi-supervised	One-Class SVM, Deep SVDD	Good for all anomaly types
Few labeled anomalies	PU Learning	PU classifiers, active learning	Improves with more labels
No labels, assume rarity	Unsupervised	Isolation Forest, LOF	Depends on assumptions
Evolving attack types	Hybrid	Supervised + Novelty detection	Robust to evolution

Practical Hybrid Architectures

Real-world systems rarely operate in pure supervision paradigms. Hybrid architectures combine multiple approaches to achieve robust detection across both known and novel anomalies.

Architecture Pattern: Known + Novel Detection

┌────────────────────────────────────────────────────────┐
│                   Incoming Data                        │
└────────────────────────────────────────────────────────┘
                          │
          ┌───────────────┼───────────────┐
          │                               │
          ▼                               ▼
┌──────────────────┐            ┌──────────────────┐
│  SUPERVISED      │            │  UNSUPERVISED    │
│  DETECTOR        │            │  DETECTOR        │
│                  │            │                  │
│ - Trained on     │            │ - No labels      │
│   labeled        │            │ - Catches novel  │
│   anomalies      │            │   anomaly types  │
│ - High precision │            │ - May have more  │
│   on known types │            │   false positives│
└──────────────────┘            └──────────────────┘
          │                               │
          │    ┌──────────────────┐       │
          └────►  SCORE FUSION    ◄───────┘
               │                  │
               │ - Weighted sum   │
               │ - Max-of-scores  │
               │ - Meta-learner   │
               └──────────────────┘
                        │
                        ▼
               ┌──────────────────┐
               │  NOVEL ANOMALY   │
               │  FEEDBACK LOOP   │
               │                  │
               │ - Expert review  │
               │ - Add to labeled │
               │   training set   │
               │ - Retrain        │
               └──────────────────┘

Key Hybrid Strategies:

1. Parallel Ensemble Run supervised and unsupervised detectors in parallel:

Supervised catches known patterns with high precision
Unsupervised catches novel patterns (may have lower precision)
Alert if either exceeds its threshold

2. Cascade Architecture Use efficient unsupervised filter followed by expensive supervised classifier:

Stage 1: Unsupervised detector filters candidates (high recall)
Stage 2: Supervised classifier refines (high precision)
Reduces computational cost by focusing expensive processing

3. Continuous Learning Loop Integrate human feedback to improve both components:

Novel anomalies detected by unsupervised → label → add to supervised training
False positives → label as normal → improve unsupervised threshold
Model retrains periodically with accumulated feedback

The Concept Drift Challenge

Hybrid architectures naturally address concept drift—the evolution of normal and anomalous patterns over time. The unsupervised component adapts to distribution shifts (since it learns from recent data), while the supervised component maintains memory of known attack patterns. The feedback loop allows the supervised component to incorporate newly discovered anomaly types.

Case Study: Enterprise Security Operations Center

A large enterprise deploys hybrid anomaly detection for security monitoring.

Architecture:

Supervised Stream:

XGBoost classifier trained on historical incident data
Features: Source/destination IP, port, protocol, packet size, timing
High confidence on matches to known attack signatures
Retrained weekly on new confirmed incidents

Unsupervised Stream:

Isolation Forest on network flow features
Deep autoencoder on packet-level patterns
Catches behavioral anomalies not matching signatures

Fusion Logic:

If supervised score > 0.9 → High-confidence alert
If unsupervised score > 95th percentile AND supervised score > 0.5 → Medium alert
If unsupervised score > 99th percentile → Low alert (investigation queue)

Feedback Loop:

Security analysts investigate alerts, provide labels
Confirmed true positives added to supervised training
False positives analyzed for unsupervised threshold tuning
Analyst feedback latency: 4-24 hours

Results:

Supervised-only: 78% detection, 2.1% false positive rate
Unsupervised-only: 91% detection, 8.4% false positive rate
Hybrid: 95% detection, 3.2% false positive rate

The hybrid approach catches novel attacks missed by supervised while controlling false positives better than unsupervised alone.

Evaluation Considerations by Supervision Level

Evaluation methodology must adapt to the supervision level, as different approaches have fundamentally different assumptions and failure modes.

Supervised Evaluation:

Standard classification metrics with imbalance awareness:

PR-AUC (Precision-Recall Area Under Curve) preferred over ROC-AUC
F1-score at optimal threshold
Report precision and recall separately at operational threshold

Validation strategy:

Temporal split: Train on past, test on future (critical for time-based data)
Stratified k-fold with matching class distribution
Avoid random splits that leak temporal patterns

Unsupervised Evaluation Challenge:

Without labels, evaluation is fundamentally limited. Options:

External Evaluation (when test labels available):

Treat as ranking problem: Do true anomalies receive higher scores?
Metrics: Area under PR curve, precision@k, recall@k

Internal Evaluation (no labels):

Stability analysis: Does ranking persist across subsamples?
Relative ranking: Compare algorithms on same data
No absolute performance measure possible

Semi-supervised Evaluation:

For normal-only training:

Must have labeled test set with both classes
Evaluate ability to detect anomalies not seen in training
Report detection rate on novel anomaly types separately from known types

Cross-Validation Pitfalls:

Temporal Leakage: Never use future data to predict past
Label Leakage: In PU settings, unlabeled instances may contain anomalies
Threshold Selection: Choose threshold on validation set, not test set
Contamination in Training: Account for potential anomalies in "normal" training data

Evaluation Strategy by Supervision Level
Supervision	Primary Metrics	Validation Strategy	Key Pitfall
Supervised	PR-AUC, F1	Temporal/stratified split	Overfitting to specific anomaly types
Semi-supervised	Detection rate by type	Holdout with both classes	Contamination in normal training
Unsupervised	External: PR@k; Internal: stability	Ranking comparison	No absolute performance measure
Hybrid	Component + combined metrics	End-to-end temporal evaluation	Attribution of performance gains

Summary: Navigating the Supervision Spectrum

This comprehensive exploration of supervision levels equips you with the framework to select appropriate detection strategies based on your data and operational context.

Key Takeaways

•Supervision Spectrum: Ranges from fully supervised (labeled anomalies) to fully unsupervised (no labels), with semi-supervised approaches in between
•Supervised Approaches: Powerful when labels are available and representative, but struggle with novel anomaly types
•Unsupervised Approaches: Essential when labels unavailable; rely on assumptions about anomaly characteristics
•Semi-Supervised Approaches: Leverage partial label information; normal-only training is especially practical
•Hybrid Architectures: Combine supervised and unsupervised for robust detection of both known and novel anomalies
•Evaluation Alignment: Metrics and validation strategies must match the supervision paradigm

Path Forward:

With the supervision spectrum understood, we now tackle one of the most challenging aspects of anomaly detection: evaluation. The next page explores the unique challenges of evaluating anomaly detectors, from the extreme class imbalance that renders accuracy useless to the label scarcity that makes traditional validation infeasible.

Page Complete

You have mastered the spectrum of supervision in anomaly detection. You can now diagnose which paradigm fits your problem, select appropriate algorithms, design hybrid architectures for robust detection, and apply correct evaluation methodologies for each approach.

Supervised vs Unsupervised Anomaly Detection

The Supervision Spectrum in Anomaly Detection

This distinction is not merely academic—it dictates:

Which algorithms are applicable
What performance levels are achievable
How models should be trained and validated
What deployment and maintenance strategies are appropriate

Understanding this spectrum is essential for making sound architectural decisions in real-world anomaly detection systems.

The Label Scarcity Problem

The Supervision Spectrum

Anomaly detection methods can be placed along a continuum based on the amount and type of supervision available during training.

The Five-Level Taxonomy:

Level 1: Fully Supervised

Training data includes labeled examples of both normal and anomalous instances
Problem reduces to (highly imbalanced) binary classification
Maximum discriminative power when labels are accurate and representative

Level 2: Semi-Supervised (Normal-Only)

Training data contains only normal instances (labeled as such)
Model learns the boundary of normality
Test instances outside this boundary are flagged as anomalies
Also called "one-class learning" or "novelty detection"

Level 3: Semi-Supervised (Partial Labels)

Training data contains mostly normal instances with a few labeled anomalies
Cannot train full supervised model, but can use anomaly examples for calibration
Also called "positive-unlabeled (PU) learning"

Level 4: Weakly Supervised

Only aggregate or imprecise labels available
Example: "This batch contains at least one anomaly" without instance-level labels
Requires multiple-instance learning or label propagation approaches

Level 5: Fully Unsupervised

No labels of any kind available during training
Must rely purely on data distribution characteristics
Most common scenario in practice; most challenging mathematically

The Supervision Spectrum in Anomaly Detection
Level	Training Labels	Key Methods	Typical Performance
Fully Supervised	Normal + Anomaly	Random Forest, XGBoost, Neural Networks	Highest (if labels are representative)
Semi-Supervised (Normal-Only)	Normal only	One-Class SVM, Autoencoders, IF	High for seen anomaly types
Semi-Supervised (Partial)	Mostly Normal + Few Anomalies	PU Learning, Calibrated Unsupervised	Moderate to High
Weakly Supervised	Aggregate/Noisy Labels	MIL, Label Propagation	Moderate
Fully Unsupervised	None	Clustering, Density, Distance Methods	Variable (depends on assumptions)

Supervised Anomaly Detection: The Classification Paradigm

Mathematical Formulation:

Given training data ${(x_i, y_i)}_{i=1}^n$ where $y_i \in {0, 1}$ (0 = normal, 1 = anomaly), we seek a classifier:

$$f: \mathcal{X} \rightarrow [0, 1]$$

that outputs the probability $P(y = 1 | x)$.

The Imbalance Challenge:

Typical class distributions in anomaly detection:

Normal: 99-99.99% of instances
Anomalous: 0.01-1% of instances

This extreme imbalance causes:

Bias toward majority class: Standard classifiers achieve high accuracy by predicting all instances as normal
Poor minority class recall: Anomalies are systematically missed
Unreliable probability estimates: Classifier probabilities are poorly calibrated near the rare class

Techniques for Handling Imbalance:

1. Resampling Methods

Oversampling the minority class:

Random oversampling: Duplicate anomaly examples
SMOTE: Generate synthetic anomalies by interpolation
ADASYN: Adaptive synthetic sampling focusing on difficult examples

Undersampling the majority class:

Random undersampling: Remove normal examples
Tomek links: Remove borderline normal examples
Cluster centroids: Replace cluster of normals with centroid

Combination approaches:

SMOTE + Tomek: Oversample anomalies, then clean decision boundary

2. Cost-Sensitive Learning

Modify the loss function to penalize anomaly misclassification more heavily:

$$L = \sum_i C_{y_i} \cdot \ell(f(x_i), y_i)$$

where $C_1 \gg C_0$ (cost of missing anomaly >> cost of false alarm)

3. Ensemble Methods

Build multiple classifiers on balanced subsets:

Each learner trains on all anomalies + random subset of normals
Aggregate predictions (voting, averaging)
Examples: Balanced Random Forest, EasyEnsemble

When to Use Supervised Approaches

Algorithm Selection for Supervised Anomaly Detection:

Gradient Boosting (XGBoost, LightGBM, CatBoost)

Handles imbalance via sample weights or focal loss
Excellent performance on tabular data
Provides feature importance for interpretability

Random Forest with Class Weighting

Robust to overfitting
Parallelizable training
Good baseline for comparison

Neural Networks with Focal Loss $$FL(p) = -\alpha (1-p)^\gamma \log(p)$$

Down-weights easy examples, focuses on hard examples
Particularly effective when anomalies cluster in difficult regions

Support Vector Machines with Cost-Sensitive Kernels

Effective in high-dimensional spaces
Kernel choice provides flexibility
Can struggle with very large datasets

Calibration Requirement:

Even with imbalance handling, classifier probabilities require calibration:

Platt scaling: Fit logistic regression on classifier outputs
Isotonic regression: Non-parametric calibration
Temperature scaling (for neural networks)

Well-calibrated probabilities enable proper threshold selection based on cost/benefit analysis.

Limitations of Supervised Approaches

•Label Dependency: Requires expensive, expert annotation of anomalies
•Distribution Shift: Cannot detect novel anomaly types not in training data
•Label Noise: Mislabeled instances (common with rare classes) severely impact performance
•Representativeness Assumption: Training anomalies must represent test anomalies
•Concept Drift: Anomaly patterns may evolve, requiring continuous retraining

Unsupervised Anomaly Detection: Learning Without Labels

Unsupervised anomaly detection operates without any labels, relying on the fundamental assumption that anomalies differ in measurable ways from the bulk of the data.

Core Assumptions:

Unsupervised methods rest on one or more of these assumptions:

1. The Rarity Assumption Anomalies are rare compared to normal instances: $$|{x : x \text{ is anomaly}}| \ll |{x : x \text{ is normal}}|$$

Implication: Methods that characterize the majority capture "normal" behavior.

2. The Difference Assumption Anomalies have feature values or patterns that distinguish them:

Extreme values (distance-based)
Low probability regions (density-based)
Short isolation paths (tree-based)

3. The Clustering Assumption Normal instances cluster together; anomalies are isolated: $$x_{normal} \in \text{cluster}, \quad x_{anomaly} otin \text{any cluster}$$

Implication: Anomalies fail to belong to any natural grouping.

Implication: High reconstruction error indicates anomaly (autoencoder principle).

Major Algorithm Families:

1. Distance-Based Methods

Anomalies are far from their neighbors:

$$s_{kNN}(x) = \frac{1}{k} \sum_{i=1}^{k} d(x, nn_i(x))$$

Examples: k-NN distance, Global Outlier Score

Strengths: Intuitive, parameter simple (k) Weaknesses: O(n²) complexity, sensitive to distance metric

2. Density-Based Methods

Anomalies reside in low-density regions:

$$s_{LOF}(x) = \frac{\text{avg local density of neighbors}}{\text{local density of } x}$$

Examples: Local Outlier Factor (LOF), LOCI, LoOP

Strengths: Handles varying density clusters Weaknesses: O(n²) complexity, parameter sensitivity (k for neighborhood)

3. Tree-Based Methods

Anomalies are isolated quickly by random partitioning:

$$s_{IF}(x) = 2^{-\frac{\mathbb{E}[h(x)]}{c(n)}}$$

where $h(x)$ is path length in isolation tree, $c(n)$ is normalization factor.

Examples: Isolation Forest, Extended Isolation Forest

Strengths: O(n log n) complexity, few parameters, handles high dimensions Weaknesses: Axis-parallel cuts can miss angled patterns

4. Reconstruction-Based Methods

Anomalies have high reconstruction error:

$$s_{AE}(x) = |x - D(E(x))|^2$$

Examples: Autoencoders, VAEs, PCA reconstruction error

Strengths: Learns complex manifold structures, deep learning scalability Weaknesses: Requires architecture tuning, can over-regularize (reconstruct anomalies too)

5. Clustering-Based Methods

Anomalies fail to belong to clusters:

$$s_{cluster}(x) = \min_c d(x, \mu_c)$$

Examples: DBSCAN noise points, k-means distance to centroid

Strengths: Intuitive interpretation Weaknesses: Sensitive to clustering hyperparameters

Unsupervised Anomaly Detection Algorithm Comparison
Algorithm	Core Principle	Time Complexity	Space Complexity	Key Hyperparameter
k-NN Distance	Average distance to neighbors	O(n² d)	O(n d)	k (neighbors)
LOF	Ratio of local densities	O(n² d)	O(n k)	k (neighborhood size)
Isolation Forest	Isolation path length	O(n log n)	O(t n)	t (trees), ψ (subsample)
One-Class SVM	Maximum margin boundary	O(n² - n³)	O(n² SV)	ν, kernel params
Autoencoder	Reconstruction error	O(n × epochs)	O(weights)	architecture, λ
DBSCAN	Core/border/noise partition	O(n log n)	O(n)	ε, minPts

The Contamination Problem

Semi-Supervised Anomaly Detection

Semi-supervised approaches occupy the middle ground, leveraging limited label information along with the structure of unlabeled data. Two major paradigms dominate this space.

Paradigm 1: Normal-Only Training (Novelty Detection)

Mathematical Formulation:

Given $D_{train} = {x_1, \ldots, x_n}$ assumed normal, learn:

$$f(x) = P(x \text{ is normal})$$

or equivalently, a score function where higher values indicate anomaly:

$$s(x) = \text{degree of deviation from normality}$$

Key Algorithms:

One-Class SVM

Find a hyperplane (or hypersphere) in kernel space that separates data from the origin with maximum margin:

$$\min_{w, \rho, \xi} \frac{1}{2}|w|^2 + \frac{1}{ u n} \sum_i \xi_i - \rho$$ $$\text{s.t. } \langle w, \phi(x_i) \rangle \geq \rho - \xi_i, \quad \xi_i \geq 0$$

Points on the wrong side of the boundary are anomalies.

Deep SVDD (Support Vector Data Description)

Learn a neural network that maps data to a hypersphere-containing representation:

$$\min_{W} \frac{1}{n} \sum_i |\phi(x_i; W) - c|^2 + \lambda |W|^2$$

where $c$ is the center of the learned sphere. Points far from center after training are anomalies.

Autoencoders

Train encoder-decoder on normal data:

$$\min_{E, D} \sum_i |x_i - D(E(x_i))|^2$$

The network learns to reconstruct normal patterns. High reconstruction error indicates anomaly.

Paradigm 2: Positive-Unlabeled (PU) Learning

When we have a few labeled anomalies and many unlabeled instances:

Positive (P): Confirmed anomalies
Unlabeled (U): Mixture of normal and anomalies

Key Insight: The unlabeled set contains both classes in unknown proportions.

PU Learning Formulation:

Let $\pi = P(Y = 1)$ be the (unknown) prevalence of anomalies. The risk function rewriting enables learning:

$$R(f) = \pi R_P(f) + (1 - \pi) R_N(f)$$

where $R_P$ is risk on positives, $R_N$ is risk on negatives.

Since we don't have labeled negatives: $$R_N(f) = \frac{R_U(f) - \pi R_P(f)}{1 - \pi}$$

This allows unbiased estimation of classification risk using only P and U examples.

Practical PU Learning Approaches:

Bagging with Pseudo-Labels: Sample unlabeled as "negative", train classifier, repeat
Cost-Sensitive Adaptation: Weight unlabeled examples by estimated class probability
Non-Negative Risk Correction: Unbiased PU risk with correction terms

Paradigm 3: Active Learning

Iteratively select instances for labeling to maximize information gain:

Train initial model on available labels
Score unlabeled instances for uncertainty or informativeness
Request labels for most informative instances
Update model and repeat

This approach efficiently allocates labeling budget to maximize performance improvement.

Semi-Supervised Advantages

•Leverages partial label information
•Normal-only training easier to obtain
•Can detect novel anomaly types (unlike pure supervised)
•Better calibrated thresholds with some labeled anomalies
•Reduces labeling cost vs full supervision

Semi-Supervised Challenges

•Sensitive to contamination in 'normal' training set
•Boundary estimation depends on representation quality
•PU learning requires prevalence estimation
•May miss anomalies similar to normal examples
•Active learning requires online labeling infrastructure

Choosing the Right Supervision Level

Selecting the appropriate supervision level requires careful analysis of your problem context, data availability, and operational constraints.

Decision Framework:

START: What labels do you have?

├── Both normal AND anomaly labels (sufficient quantity)?
│   ├── Are anomaly types stable over time?
│   │   ├── YES → SUPERVISED APPROACH
│   │   │   └── Binary classification with imbalance handling
│   │   └── NO → HYBRID APPROACH
│   │       └── Supervised for known types + Unsupervised for novelty
│
├── Only normal labels (clean normal data available)?
│   └── SEMI-SUPERVISED (Normal-Only)
│       └── One-Class SVM, Deep SVDD, Autoencoders
│
├── Few anomaly labels + many unlabeled?
│   └── SEMI-SUPERVISED (PU Learning)
│       └── PU learning, active learning
│
└── No labels at all?
    └── UNSUPERVISED APPROACH
        └── Isolation Forest, LOF, Clustering

Key Considerations:

Label Quality vs Quantity

Poor quality labels (noisy, biased) → Unsupervised may outperform supervised
High quality, limited labels → Semi-supervised PU learning
High quality, abundant labels → Supervised with imbalance handling

Anomaly Stability

Static anomaly types (known fraud patterns) → Supervised feasible
Evolving anomaly types (new attack vectors) → Unsupervised/hybrid required
Seasonal patterns in anomalies → Contextual modeling required

Deployment Environment

Real-time detection → Prefer efficient methods (Isolation Forest, lightweight models)
Batch processing → Can use more expensive methods (ensembles, deep learning)
Edge deployment → Compress models, prefer tree-based methods

Interpretability Requirements

High interpretability needed → Distance/density-based, rule extraction
Black-box acceptable → Deep learning autoencoders, complex ensembles

Supervision Level Selection Guide
Scenario	Best Approach	Key Algorithms	Expected Performance
Abundant labels, stable anomalies	Supervised	XGBoost, Random Forest	Highest detection rate
Clean normal data only	Semi-supervised	One-Class SVM, Deep SVDD	Good for all anomaly types
Few labeled anomalies	PU Learning	PU classifiers, active learning	Improves with more labels
No labels, assume rarity	Unsupervised	Isolation Forest, LOF	Depends on assumptions
Evolving attack types	Hybrid	Supervised + Novelty detection	Robust to evolution

Practical Hybrid Architectures

Real-world systems rarely operate in pure supervision paradigms. Hybrid architectures combine multiple approaches to achieve robust detection across both known and novel anomalies.

Architecture Pattern: Known + Novel Detection

┌────────────────────────────────────────────────────────┐
│                   Incoming Data                        │
└────────────────────────────────────────────────────────┘
                          │
          ┌───────────────┼───────────────┐
          │                               │
          ▼                               ▼
┌──────────────────┐            ┌──────────────────┐
│  SUPERVISED      │            │  UNSUPERVISED    │
│  DETECTOR        │            │  DETECTOR        │
│                  │            │                  │
│ - Trained on     │            │ - No labels      │
│   labeled        │            │ - Catches novel  │
│   anomalies      │            │   anomaly types  │
│ - High precision │            │ - May have more  │
│   on known types │            │   false positives│
└──────────────────┘            └──────────────────┘
          │                               │
          │    ┌──────────────────┐       │
          └────►  SCORE FUSION    ◄───────┘
               │                  │
               │ - Weighted sum   │
               │ - Max-of-scores  │
               │ - Meta-learner   │
               └──────────────────┘
                        │
                        ▼
               ┌──────────────────┐
               │  NOVEL ANOMALY   │
               │  FEEDBACK LOOP   │
               │                  │
               │ - Expert review  │
               │ - Add to labeled │
               │   training set   │
               │ - Retrain        │
               └──────────────────┘

Key Hybrid Strategies:

1. Parallel Ensemble Run supervised and unsupervised detectors in parallel:

Supervised catches known patterns with high precision
Unsupervised catches novel patterns (may have lower precision)
Alert if either exceeds its threshold

2. Cascade Architecture Use efficient unsupervised filter followed by expensive supervised classifier:

Stage 1: Unsupervised detector filters candidates (high recall)
Stage 2: Supervised classifier refines (high precision)
Reduces computational cost by focusing expensive processing

3. Continuous Learning Loop Integrate human feedback to improve both components:

Novel anomalies detected by unsupervised → label → add to supervised training
False positives → label as normal → improve unsupervised threshold
Model retrains periodically with accumulated feedback

The Concept Drift Challenge

Case Study: Enterprise Security Operations Center

A large enterprise deploys hybrid anomaly detection for security monitoring.

Architecture:

Supervised Stream:

XGBoost classifier trained on historical incident data
Features: Source/destination IP, port, protocol, packet size, timing
High confidence on matches to known attack signatures
Retrained weekly on new confirmed incidents

Unsupervised Stream:

Isolation Forest on network flow features
Deep autoencoder on packet-level patterns
Catches behavioral anomalies not matching signatures

Fusion Logic:

If supervised score > 0.9 → High-confidence alert
If unsupervised score > 95th percentile AND supervised score > 0.5 → Medium alert
If unsupervised score > 99th percentile → Low alert (investigation queue)

Feedback Loop:

Security analysts investigate alerts, provide labels
Confirmed true positives added to supervised training
False positives analyzed for unsupervised threshold tuning
Analyst feedback latency: 4-24 hours

Results:

Supervised-only: 78% detection, 2.1% false positive rate
Unsupervised-only: 91% detection, 8.4% false positive rate
Hybrid: 95% detection, 3.2% false positive rate

The hybrid approach catches novel attacks missed by supervised while controlling false positives better than unsupervised alone.

Evaluation Considerations by Supervision Level

Evaluation methodology must adapt to the supervision level, as different approaches have fundamentally different assumptions and failure modes.

Supervised Evaluation:

Standard classification metrics with imbalance awareness:

PR-AUC (Precision-Recall Area Under Curve) preferred over ROC-AUC
F1-score at optimal threshold
Report precision and recall separately at operational threshold

Validation strategy:

Temporal split: Train on past, test on future (critical for time-based data)
Stratified k-fold with matching class distribution
Avoid random splits that leak temporal patterns

Unsupervised Evaluation Challenge:

Without labels, evaluation is fundamentally limited. Options:

External Evaluation (when test labels available):

Treat as ranking problem: Do true anomalies receive higher scores?
Metrics: Area under PR curve, precision@k, recall@k

Internal Evaluation (no labels):

Stability analysis: Does ranking persist across subsamples?
Relative ranking: Compare algorithms on same data
No absolute performance measure possible

Semi-supervised Evaluation:

For normal-only training:

Must have labeled test set with both classes
Evaluate ability to detect anomalies not seen in training
Report detection rate on novel anomaly types separately from known types

Cross-Validation Pitfalls:

Temporal Leakage: Never use future data to predict past
Label Leakage: In PU settings, unlabeled instances may contain anomalies
Threshold Selection: Choose threshold on validation set, not test set
Contamination in Training: Account for potential anomalies in "normal" training data

Evaluation Strategy by Supervision Level
Supervision	Primary Metrics	Validation Strategy	Key Pitfall
Supervised	PR-AUC, F1	Temporal/stratified split	Overfitting to specific anomaly types
Semi-supervised	Detection rate by type	Holdout with both classes	Contamination in normal training
Unsupervised	External: PR@k; Internal: stability	Ranking comparison	No absolute performance measure
Hybrid	Component + combined metrics	End-to-end temporal evaluation	Attribution of performance gains

Summary: Navigating the Supervision Spectrum

This comprehensive exploration of supervision levels equips you with the framework to select appropriate detection strategies based on your data and operational context.

Key Takeaways

•Supervision Spectrum: Ranges from fully supervised (labeled anomalies) to fully unsupervised (no labels), with semi-supervised approaches in between
•Supervised Approaches: Powerful when labels are available and representative, but struggle with novel anomaly types
•Unsupervised Approaches: Essential when labels unavailable; rely on assumptions about anomaly characteristics
•Semi-Supervised Approaches: Leverage partial label information; normal-only training is especially practical
•Hybrid Architectures: Combine supervised and unsupervised for robust detection of both known and novel anomalies
•Evaluation Alignment: Metrics and validation strategies must match the supervision paradigm

Path Forward:

Page Complete