One Class Methods - Learning Module

Loading content...

0/245

Neural Network Approaches

Beyond Autoencoders: Deep Learning for Anomaly Detection

While autoencoders remain popular for anomaly detection, the deep learning toolkit offers many more powerful approaches. This page explores advanced neural network methods that address limitations of reconstruction-based approaches and push the boundaries of what's possible in anomaly detection.

We'll examine:

Deep One-Class Classification: Neural networks trained with one-class objectives
GAN-based Detection: Using generative adversarial networks for anomaly identification
Self-Supervised Approaches: Learning useful representations without labels
Hybrid Methods: Combining deep feature extraction with classical algorithms
Attention Mechanisms: Focusing on relevant parts of complex inputs

These methods address scenarios where autoencoders struggle: complex high-dimensional data, multimodal distributions, and cases requiring fine-grained anomaly localization.

Learning Objectives

By the end of this page, you will understand: (1) Deep SVDD and its neural one-class formulation, (2) How GANs can be repurposed for anomaly detection, (3) Self-supervised pretext tasks that create anomaly-sensitive representations, (4) Hybrid architectures combining deep and classical methods, (5) Attention mechanisms for interpretable anomaly detection, and (6) How to choose among these methods for different scenarios.

Deep One-Class Classification: Deep SVDD

Deep Support Vector Data Description (Deep SVDD) combines the geometric intuition of SVDD with the representation learning power of deep neural networks. Instead of working in a fixed kernel space, Deep SVDD learns a neural network mapping that places normal data close to a center while pushing anomalies away.

Core Idea:

Learn a neural network φ(x; W) such that normal data maps close to a center c in the output space:

$$\min_W \frac{1}{n} \sum_{i=1}^{n} ||\phi(x_i; W) - c||^2 + \frac{\lambda}{2} \sum_{l} ||W^l||_F^2$$

The network learns representations where normal data clusters tightly around c, while anomalies—having different characteristics—map further away.

Key Components:

Neural network φ: Feature extractor (CNN, MLP, etc.)
Center c: Typically fixed as the mean of initial representations, or learned
Anomaly score: ||φ(x) - c||² (distance to center in learned space)

The Hypersphere Collapse Problem

A naive implementation can collapse: the network might learn φ(x) = c for all inputs, achieving zero loss but useless for detection.

Preventions: • Remove bias terms in the final layer • Use bounded activations (tanh, sigmoid) in certain layers • Add auxiliary regularization tasks • Initialize c as the mean of pretrained representations • Use soft-boundary variant with learned radius

The Deep SVDD paper recommends using LeakyReLU without biases in the network.

Variants of Deep SVDD:

1. Soft-Boundary Deep SVDD:

Instead of a fixed objective, learn both the center and a data-enclosing hypersphere:

$$\min_{W, R} R^2 + \frac{1}{\nu n} \sum_{i=1}^{n} \max(0, ||\phi(x_i) - c||^2 - R^2) + \lambda ||W||^2$$

This mirrors classic soft-margin SVDD but with learned representations.

2. Deep SAD (Semi-Supervised):

When a few labeled anomalies are available, incorporate them:

$$\mathcal{L} = \frac{1}{n} \sum_{i=1}^{n} ||\phi(x_i) - c||^2 \cdot y_i \cdot (1 - y_i \cdot s_i)$$

where labeled anomalies (yᵢ = -1) are pushed away from the center.

3. Contrastive Deep SVDD:

Use contrastive learning to create more discriminative representations before applying the one-class objective.

deep_svdd.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
 
class DeepSVDDNetwork(nn.Module):
    """
    Feature extraction network for Deep SVDD.
    
    Following the original paper:
    - No bias terms to prevent trivial solution
    - LeakyReLU activations
    - Batch normalization for stable training
    """
    
    def __init__(self, input_dim, hidden_dims=[128, 64], output_dim=32):
        super().__init__()
        
        layers = []
        prev_dim = input_dim
        
        for hidden_dim in hidden_dims:
            layers.append(nn.Linear(prev_dim, hidden_dim, bias=False))
            layers.append(nn.BatchNorm1d(hidden_dim, affine=False))
            layers.append(nn.LeakyReLU(0.1))
            prev_dim = hidden_dim
        
        # Final layer to output space
        layers.append(nn.Linear(prev_dim, output_dim, bias=False))
        
        self.network = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.network(x)
 
 
class DeepSVDD:
    """
    Deep Support Vector Data Description for anomaly detection.
    
    Learns a neural network that maps normal data close to a center c,
    while anomalies map further away.
    """
    
    def __init__(
        self,
        input_dim,
        hidden_dims=[128, 64],
        output_dim=32,
        nu=0.1,  # For soft-boundary variant
        device='cuda' if torch.cuda.is_available() else 'cpu'
    ):
        self.device = device
        self.nu = nu
        
        # Feature extraction network
        self.net = DeepSVDDNetwork(
            input_dim, hidden_dims, output_dim
        ).to(device)
        
        # Center will be initialized during training
        self.center = None
        self.R = None  # Radius for soft-boundary
    
    def init_center(self, dataloader, eps=0.1):
        """
        Initialize center c as mean of network outputs on training data.
        
        Add small epsilon to avoid center at origin (which could encourage collapse).
        """
        self.net.eval()
        outputs = []
        
        with torch.no_grad():
            for batch in dataloader:
                x = batch[0].to(self.device)
                z = self.net(x)
                outputs.append(z)
        
        outputs = torch.cat(outputs, dim=0)
        center = outputs.mean(dim=0)
        
        # Avoid center components being exactly zero
        center[(torch.abs(center) < eps) & (center >= 0)] = eps
        center[(torch.abs(center) < eps) & (center < 0)] = -eps
        
        self.center = center.detach()
        return self.center
    
    def fit(self, X_train, epochs=100, batch_size=64, lr=1e-3, weight_decay=1e-6,
            pretrain_epochs=50, pretrain_ae=True):
        """
        Train Deep SVDD.
        
        Optionally pretrain as autoencoder to initialize network weights.
        """
        from torch.utils.data import DataLoader, TensorDataset
        
        dataset = TensorDataset(torch.FloatTensor(X_train))
        dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
        
        # Optional: Pretrain as autoencoder
        if pretrain_ae and pretrain_epochs > 0:
            print("Pretraining as autoencoder...")
            self._pretrain_autoencoder(dataloader, pretrain_epochs, lr)
        
        # Initialize center from pretrained/initial network
        print("Initializing center...")
        self.init_center(dataloader)
        
        # Main Deep SVDD training
        optimizer = torch.optim.Adam(
            self.net.parameters(),
            lr=lr,
            weight_decay=weight_decay
        )
        scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.5)
        
        print("Training Deep SVDD...")
        self.net.train()
        
        for epoch in range(epochs):
            total_loss = 0
            n_batches = 0
            
            for batch in dataloader:
                x = batch[0].to(self.device)
                
                # Forward pass
                z = self.net(x)
                
                # One-class loss: distance to center
                dist_sq = torch.sum((z - self.center) ** 2, dim=1)
                loss = torch.mean(dist_sq)
                
                # Backward pass
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                
                total_loss += loss.item()
                n_batches += 1
            
            scheduler.step()
            
            if epoch % 20 == 0:
                avg_loss = total_loss / n_batches
                print(f"Epoch {epoch}: loss={avg_loss:.6f}")
        
        return self
    
    def _pretrain_autoencoder(self, dataloader, epochs, lr):
        """Pretrain the network as an autoencoder for better initialization."""
        input_dim = next(iter(dataloader))[0].shape[1]
        
        # Build decoder (reverse of encoder)
        decoder = nn.Sequential(
            nn.Linear(32, 64, bias=False),
            nn.BatchNorm1d(64, affine=False),
            nn.LeakyReLU(0.1),
            nn.Linear(64, 128, bias=False),
            nn.BatchNorm1d(128, affine=False),
            nn.LeakyReLU(0.1),
            nn.Linear(128, input_dim, bias=False)
        ).to(self.device)
        
        params = list(self.net.parameters()) + list(decoder.parameters())
        optimizer = torch.optim.Adam(params, lr=lr)
        
        for epoch in range(epochs):
            for batch in dataloader:
                x = batch[0].to(self.device)
                
                z = self.net(x)
                x_recon = decoder(z)
                loss = F.mse_loss(x_recon, x)
                
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
    
    def decision_function(self, X):
        """
        Compute anomaly scores (distance to center).
        Higher = more anomalous.
        """
        self.net.eval()
        X_tensor = torch.FloatTensor(X).to(self.device)
        
        with torch.no_grad():
            z = self.net(X_tensor)
            dist_sq = torch.sum((z - self.center) ** 2, dim=1)
        
        return dist_sq.cpu().numpy()
    
    def predict(self, X, threshold=None):
        """Predict normal (1) or anomaly (-1)."""
        scores = self.decision_function(X)
        
        if threshold is None:
            # Use quantile-based threshold (top nu fraction as anomalies)
            threshold = np.percentile(scores, (1 - self.nu) * 100)
        
        return np.where(scores <= threshold, 1, -1)
 
 
# Example usage
if __name__ == "__main__":
    from sklearn.datasets import make_moons
    from sklearn.metrics import roc_auc_score, classification_report
    
    # Training data (normal)
    X_normal, _ = make_moons(n_samples=500, noise=0.05, random_state=42)
    X_normal = X_normal.astype(np.float32)
    
    # Test data
    X_test_normal, _ = make_moons(n_samples=100, noise=0.05, random_state=43)
    X_anomalies = np.random.uniform(-2, 3, size=(50, 2)).astype(np.float32)
    X_test = np.vstack([X_test_normal, X_anomalies])
    y_test = np.array([1] * 100 + [-1] * 50)
    
    # Train Deep SVDD
    model = DeepSVDD(input_dim=2, hidden_dims=[64, 32], output_dim=16, nu=0.1)
    model.fit(X_normal, epochs=100, pretrain_epochs=30)
    
    # Evaluate
    scores = model.decision_function(X_test)
    predictions = model.predict(X_test)
    
    print("\nClassification Report:")
    print(classification_report(y_test, predictions, target_names=['Anomaly', 'Normal']))
    print(f"AUROC: {roc_auc_score(y_test == -1, scores):.4f}")

GAN-Based Anomaly Detection

Generative Adversarial Networks offer a fundamentally different paradigm for anomaly detection. Instead of learning to reconstruct data, GANs learn to generate it. The insight: if a GAN trained on normal data can't generate something similar to a test sample, that sample is likely anomalous.

GAN Architecture Recap:

Generator G: Maps random noise z to synthetic data G(z)
Discriminator D: Distinguishes real data from generated data
Training: G and D play a minimax game:

$$\min_G \max_D \mathbb{E}[\log D(x)] + \mathbb{E}[\log(1 - D(G(z)))]$$

Anomaly Detection Approaches with GANs:

GAN-Based Anomaly Detection Methods
Method	Anomaly Score	Key Idea
AnoGAN	Reconstruction + Discriminator	Find z that best reconstructs input; anomalies can't be reconstructed
f-AnoGAN	Encoder + Reconstruction	Train encoder to map images to latent space directly (faster)
GANomaly	Latent + Reconstruction	Encoder-decoder-encoder architecture; measure latent consistency
Discriminator-only	D(x) score	Discriminator outputs anomaly probability directly
EGBAD	BiGAN framework	Bidirectional mapping between data and latent space

AnoGAN: The Foundation

AnoGAN (Anomaly GAN) was the first major work on GAN-based anomaly detection. The core idea:

Train a GAN on normal data only
For a test sample x, find the latent vector z* that best reconstructs x: $$z^* = \arg\min_z ||x - G(z)||^2 + \lambda ||f(x) - f(G(z))||^2$$ where f is an intermediate discriminator layer
The anomaly score is the residual reconstruction error at z*

The Intuition:

A well-trained GAN's generator can only produce samples in the normal data distribution. If x is normal, there exists a z that generates something close to x. If x is anomalous, no z can generate it well.

Limitation: AnoGAN requires expensive iterative optimization at test time (finding z* via gradient descent). This makes inference slow.

ganomaly.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
 
class Encoder(nn.Module):
    """Encoder G_E: Maps input to latent space."""
    
    def __init__(self, input_dim, latent_dim, hidden_dims=[128, 64]):
        super().__init__()
        
        layers = []
        prev_dim = input_dim
        for h_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, h_dim),
                nn.BatchNorm1d(h_dim),
                nn.LeakyReLU(0.2)
            ])
            prev_dim = h_dim
        layers.append(nn.Linear(prev_dim, latent_dim))
        
        self.encoder = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.encoder(x)
 
 
class Decoder(nn.Module):
    """Decoder G_D: Maps latent to reconstructed input."""
    
    def __init__(self, latent_dim, output_dim, hidden_dims=[64, 128]):
        super().__init__()
        
        layers = []
        prev_dim = latent_dim
        for h_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, h_dim),
                nn.BatchNorm1d(h_dim),
                nn.LeakyReLU(0.2)
            ])
            prev_dim = h_dim
        layers.append(nn.Linear(prev_dim, output_dim))
        layers.append(nn.Tanh())  # Output in [-1, 1]
        
        self.decoder = nn.Sequential(*layers)
    
    def forward(self, z):
        return self.decoder(z)
 
 
class Discriminator(nn.Module):
    """Discriminator with feature extraction for intermediate layers."""
    
    def __init__(self, input_dim, hidden_dims=[128, 64]):
        super().__init__()
        
        self.features = nn.Sequential(
            nn.Linear(input_dim, hidden_dims[0]),
            nn.LeakyReLU(0.2),
            nn.Linear(hidden_dims[0], hidden_dims[1]),
            nn.LeakyReLU(0.2),
        )
        
        self.classifier = nn.Sequential(
            nn.Linear(hidden_dims[1], 1),
            nn.Sigmoid()
        )
    
    def forward(self, x):
        features = self.features(x)
        validity = self.classifier(features)
        return validity, features
 
 
class GANomaly:
    """
    GANomaly for anomaly detection.
    
    Architecture: Encoder1 -> Decoder -> Encoder2
    Anomaly score: difference between Encoder1(x) and Encoder2(G(Encoder1(x)))
    
    The idea: for normal data, the latent codes should be consistent.
    For anomalies, Encoder2 produces different codes than Encoder1.
    """
    
    def __init__(
        self,
        input_dim,
        latent_dim=32,
        hidden_dims=[128, 64],
        device='cuda' if torch.cuda.is_available() else 'cpu'
    ):
        self.device = device
        self.latent_dim = latent_dim
        
        # Generator: Encoder1 -> Decoder
        self.encoder1 = Encoder(input_dim, latent_dim, hidden_dims).to(device)
        self.decoder = Decoder(latent_dim, input_dim, list(reversed(hidden_dims))).to(device)
        
        # Second encoder: maps reconstructed input back to latent space
        self.encoder2 = Encoder(input_dim, latent_dim, hidden_dims).to(device)
        
        # Discriminator
        self.discriminator = Discriminator(input_dim, hidden_dims).to(device)
        
        self.threshold = None
    
    def generator_forward(self, x):
        """Full generator pass: x -> z1 -> x_hat -> z2"""
        z1 = self.encoder1(x)
        x_hat = self.decoder(z1)
        z2 = self.encoder2(x_hat)
        return x_hat, z1, z2
    
    def fit(self, X_train, epochs=100, batch_size=64, lr=2e-4, w_adv=1.0, w_con=50.0, w_enc=1.0):
        """
        Train GANomaly.
        
        Loss components:
        - Adversarial: discriminator loss
        - Contextual: reconstruction ||x - x_hat||
        - Encoder: latent consistency ||z1 - z2||
        """
        from torch.utils.data import DataLoader, TensorDataset
        
        # Normalize to [-1, 1]
        X_mean, X_std = X_train.mean(axis=0), X_train.std(axis=0) + 1e-8
        X_normalized = (X_train - X_mean) / X_std
        self.X_mean, self.X_std = torch.FloatTensor(X_mean), torch.FloatTensor(X_std)
        
        dataset = TensorDataset(torch.FloatTensor(X_normalized))
        dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)
        
        # Optimizers
        gen_params = list(self.encoder1.parameters()) + list(self.decoder.parameters()) + list(self.encoder2.parameters())
        opt_g = torch.optim.Adam(gen_params, lr=lr, betas=(0.5, 0.999))
        opt_d = torch.optim.Adam(self.discriminator.parameters(), lr=lr, betas=(0.5, 0.999))
        
        criterion_bce = nn.BCELoss()
        criterion_l1 = nn.L1Loss()
        criterion_l2 = nn.MSELoss()
        
        print("Training GANomaly...")
        
        for epoch in range(epochs):
            total_g_loss = 0
            total_d_loss = 0
            n_batches = 0
            
            for batch in dataloader:
                x_real = batch[0].to(self.device)
                batch_size_curr = x_real.size(0)
                
                # Labels
                real_label = torch.ones(batch_size_curr, 1, device=self.device)
                fake_label = torch.zeros(batch_size_curr, 1, device=self.device)
                
                # ---------------------
                # Train Discriminator
                # ---------------------
                opt_d.zero_grad()
                
                # Real samples
                pred_real, _ = self.discriminator(x_real)
                loss_d_real = criterion_bce(pred_real, real_label)
                
                # Fake samples
                x_fake, _, _ = self.generator_forward(x_real)
                pred_fake, _ = self.discriminator(x_fake.detach())
                loss_d_fake = criterion_bce(pred_fake, fake_label)
                
                loss_d = (loss_d_real + loss_d_fake) / 2
                loss_d.backward()
                opt_d.step()
                
                # ---------------------
                # Train Generator
                # ---------------------
                opt_g.zero_grad()
                
                x_hat, z1, z2 = self.generator_forward(x_real)
                
                # Adversarial loss (fool discriminator)
                pred_fake_g, feat_fake = self.discriminator(x_hat)
                _, feat_real = self.discriminator(x_real)
                loss_adv = criterion_l2(feat_fake, feat_real)  # Feature matching
                
                # Contextual loss (reconstruction)
                loss_con = criterion_l1(x_hat, x_real)
                
                # Encoder loss (latent consistency)
                loss_enc = criterion_l2(z2, z1)
                
                # Total generator loss
                loss_g = w_adv * loss_adv + w_con * loss_con + w_enc * loss_enc
                loss_g.backward()
                opt_g.step()
                
                total_g_loss += loss_g.item()
                total_d_loss += loss_d.item()
                n_batches += 1
            
            if epoch % 20 == 0:
                print(f"Epoch {epoch}: G_loss={total_g_loss/n_batches:.4f}, D_loss={total_d_loss/n_batches:.4f}")
        
        # Calibrate threshold
        self.calibrate_threshold(X_train)
        return self
    
    def anomaly_score(self, x):
        """
        Compute anomaly score based on encoder latent consistency.
        Higher = more anomalous.
        """
        x_norm = (x - self.X_mean.numpy()) / self.X_std.numpy()
        x_tensor = torch.FloatTensor(x_norm).to(self.device)
        
        self.encoder1.eval()
        self.decoder.eval()
        self.encoder2.eval()
        
        with torch.no_grad():
            _, z1, z2 = self.generator_forward(x_tensor)
            # Anomaly score: ||z1 - z2||
            score = torch.sum((z1 - z2) ** 2, dim=1)
        
        return score.cpu().numpy()
    
    def calibrate_threshold(self, X, percentile=95):
        """Set threshold based on training data scores."""
        scores = self.anomaly_score(X)
        self.threshold = np.percentile(scores, percentile)
        print(f"Threshold: {self.threshold:.4f}")
    
    def predict(self, X):
        """Predict normal (1) or anomaly (-1)."""
        scores = self.anomaly_score(X)
        return np.where(scores <= self.threshold, 1, -1)
    
    def decision_function(self, X):
        """Return anomaly scores."""
        return self.anomaly_score(X)
 
 
# Example usage
if __name__ == "__main__":
    from sklearn.datasets import make_moons
    from sklearn.metrics import roc_auc_score, classification_report
    
    # Normal training data
    X_normal, _ = make_moons(n_samples=500, noise=0.05, random_state=42)
    X_normal = X_normal.astype(np.float32)
    
    # Test data
    X_test_normal, _ = make_moons(n_samples=100, noise=0.05, random_state=43)
    X_anomalies = np.random.uniform(-2, 3, size=(50, 2)).astype(np.float32)
    X_test = np.vstack([X_test_normal, X_anomalies])
    y_test = np.array([1] * 100 + [-1] * 50)
    
    # Train GANomaly
    model = GANomaly(input_dim=2, latent_dim=8, hidden_dims=[32, 16])
    model.fit(X_normal, epochs=100)
    
    # Evaluate
    scores = model.decision_function(X_test)
    predictions = model.predict(X_test)
    
    print("\nClassification Report:")
    print(classification_report(y_test, predictions, target_names=['Anomaly', 'Normal']))
    print(f"AUROC: {roc_auc_score(y_test == -1, scores):.4f}")

Self-Supervised Approaches

Self-supervised learning creates powerful representations by solving pretext tasks—artificial tasks that don't require labels but encourage learning useful features. For anomaly detection, the key insight is: models trained on pretext tasks for normal data will fail on anomalies.

Why Self-Supervised for Anomaly Detection?

No labels needed: Pretext tasks are self-generated from the data
Rich representations: Forces learning of semantic structure
Anomaly-sensitive: Performance on pretext task degrades for out-of-distribution samples
Pretrained models: Can leverage large-scale pretrained encoders

Common Pretext Tasks:

Self-Supervised Pretext Tasks for Anomaly Detection
Pretext Task	How It Works	Anomaly Signal
Rotation Prediction	Predict which rotation (0°, 90°, 180°, 270°) was applied	Anomalies have inconsistent rotation patterns
Jigsaw Puzzle	Predict correct arrangement of shuffled patches	Anomalous patterns break expected spatial relationships
Contrastive Learning	Distinguish between augmented views of same vs different images	Anomalies don't cluster with normal data in embedding space
Masked Prediction	Predict masked portions of input	Anomalies don't follow learned completion patterns
Transformation Classification	Identify which transformation was applied	Model uncertain about transformations of anomalies

Contrastive Learning for Anomaly Detection:

Contrastive methods like SimCLR learn representations by pulling together augmented views of the same image while pushing apart different images:

$$\mathcal{L} = -\log \frac{\exp(sim(z_i, z_j)/\tau)}{\sum_{k \neq i} \exp(sim(z_i, z_k)/\tau)}$$

For anomaly detection:

Train contrastive encoder on normal data
Normal samples cluster together; anomalies are isolated
Anomaly score = distance to nearest cluster center or negative similarity to normal prototypes

Distribution-Based Approach:

Fit a simple density model (GMM, KDE) in the learned representation space. The representation makes density estimation more tractable by:

Reducing dimensionality
Creating semantically meaningful distances
Separating clusters of normal behavior

Leveraging Foundation Models

Modern foundation models (CLIP, DINO, MAE) provide powerful pretrained representations that can be directly used for anomaly detection:

Feature Extraction: Use frozen pretrained encoder to extract features
Simple Detector: Fit a one-class classifier (SVDD, GMM) on these features
Zero-Shot: Use CLIP text-image matching with anomaly descriptions

This approach often outperforms training from scratch, especially with limited data. The pretrained representations have already learned rich semantic structure that distinguishes normal from anomalous patterns.

Hybrid Architectures: Best of Both Worlds

Hybrid methods combine deep neural network feature extraction with classical anomaly detection algorithms. This leverages the representation learning power of deep networks while benefiting from the simplicity and interpretability of classical methods.

The Hybrid Paradigm:

Input x → Deep Feature Extractor → Features φ(x) → Classical Detector → Score

Examples:

Deep + One-Class SVM: Train OC-SVM on CNN features
Deep + Isolation Forest: Isolation Forest on autoencoder latent space
Deep + GMM: Gaussian mixture on learned embeddings
Deep + kNN: k-nearest neighbors in deep feature space

Advantages of Hybrid Approach:

Modularity: Can swap components independently
Interpretability: Classical detectors are well-understood
Efficiency: Many classical methods are faster at inference
Regularization: Classical methods add implicit regularization

When Hybrid Works Best

•High-dimensional raw input (images, audio)
•Pretrained models available for your domain
•Need interpretable anomaly explanations
•Limited training data for end-to-end deep learning
•Computational constraints at inference

When End-to-End is Preferred

•Abundant training data available
•Anomaly patterns are complex and subtle
•Need joint optimization of representation and detection
•Pretrained models don't fit your domain
•Latency is not a concern

Popular Hybrid Combinations:

1. PatchCore (Industrial Anomaly Detection):

Use pretrained ImageNet features at multiple scales
Store a memory bank of normal patch features
At test time, compare to nearest neighbors in the memory bank
Achieves state-of-the-art on manufacturing defect detection

2. Deep Features + Mahalanobis Distance:

Extract features from intermediate NN layers
Model feature distribution as multivariate Gaussian
Anomaly score = Mahalanobis distance from Gaussian
Works well for identifying out-of-distribution samples

3. Autoencoder Latent Space + One-Class SVM:

Train autoencoder on normal data
Extract latent representations z = Encoder(x)
Train One-Class SVM on z space
Combine reconstruction error with SVM distance for final score

hybrid_deep_classical.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
import numpy as np
import torch
import torch.nn as nn
from sklearn.svm import OneClassSVM
from sklearn.mixture import GaussianMixture
from sklearn.neighbors import LocalOutlierFactor
from scipy.spatial.distance import mahalanobis
from sklearn.covariance import EmpiricalCovariance
 
class HybridAnomalyDetector:
    """
    Hybrid anomaly detection: Deep feature extraction + Classical detection.
    
    Supports multiple classical backends:
    - One-Class SVM
    - Gaussian Mixture Model
    - Mahalanobis Distance
    - Local Outlier Factor
    """
    
    def __init__(
        self,
        feature_extractor,
        detector_type='ocsvm',
        device='cuda' if torch.cuda.is_available() else 'cpu',
        **detector_kwargs
    ):
        """
        Parameters:
        -----------
        feature_extractor : nn.Module
            Neural network that maps input to feature space
        detector_type : str
            'ocsvm', 'gmm', 'mahalanobis', or 'lof'
        detector_kwargs : dict
            Parameters for the classical detector
        """
        self.feature_extractor = feature_extractor.to(device)
        self.device = device
        self.detector_type = detector_type
        self.detector_kwargs = detector_kwargs
        self.detector = None
        self.threshold = None
        
        # For Mahalanobis
        self.mean = None
        self.cov_inv = None
    
    def extract_features(self, X):
        """Extract features using the neural network."""
        self.feature_extractor.eval()
        
        if isinstance(X, np.ndarray):
            X = torch.FloatTensor(X)
        
        X = X.to(self.device)
        
        with torch.no_grad():
            features = self.feature_extractor(X)
        
        return features.cpu().numpy()
    
    def fit(self, X):
        """
        Fit the hybrid detector.
        
        1. Extract features using the neural network
        2. Fit classical detector on features
        """
        print("Extracting features...")
        features = self.extract_features(X)
        
        print(f"Fitting {self.detector_type} on {features.shape[1]}-dim features...")
        
        if self.detector_type == 'ocsvm':
            self.detector = OneClassSVM(
                kernel='rbf',
                nu=self.detector_kwargs.get('nu', 0.1),
                gamma=self.detector_kwargs.get('gamma', 'scale')
            )
            self.detector.fit(features)
            
        elif self.detector_type == 'gmm':
            self.detector = GaussianMixture(
                n_components=self.detector_kwargs.get('n_components', 5),
                covariance_type='full',
                random_state=42
            )
            self.detector.fit(features)
            
        elif self.detector_type == 'mahalanobis':
            # Fit robust covariance estimator
            cov_estimator = EmpiricalCovariance().fit(features)
            self.mean = cov_estimator.location_
            self.cov_inv = np.linalg.pinv(cov_estimator.covariance_)
            
        elif self.detector_type == 'lof':
            self.detector = LocalOutlierFactor(
                n_neighbors=self.detector_kwargs.get('n_neighbors', 20),
                contamination=self.detector_kwargs.get('contamination', 0.1),
                novelty=True
            )
            self.detector.fit(features)
        
        # Calibrate threshold
        scores = self._score_features(features)
        self.threshold = np.percentile(scores, 95)
        print(f"Threshold set at: {self.threshold:.4f}")
        
        return self
    
    def _score_features(self, features):
        """Score features using the classical detector."""
        if self.detector_type == 'ocsvm':
            # Negate so higher = more anomalous
            return -self.detector.decision_function(features)
            
        elif self.detector_type == 'gmm':
            # Negative log-likelihood
            return -self.detector.score_samples(features)
            
        elif self.detector_type == 'mahalanobis':
            # Mahalanobis distance to mean
            return np.array([
                mahalanobis(f, self.mean, self.cov_inv) 
                for f in features
            ])
            
        elif self.detector_type == 'lof':
            return -self.detector.decision_function(features)
    
    def decision_function(self, X):
        """Compute anomaly scores (higher = more anomalous)."""
        features = self.extract_features(X)
        return self._score_features(features)
    
    def predict(self, X):
        """Predict normal (1) or anomaly (-1)."""
        scores = self.decision_function(X)
        return np.where(scores <= self.threshold, 1, -1)
 
 
class PretrainedEncoder(nn.Module):
    """Example: Pretrained autoencoder encoder as feature extractor."""
    
    def __init__(self, input_dim, hidden_dims=[128, 64], output_dim=32):
        super().__init__()
        
        layers = []
        prev_dim = input_dim
        for h_dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, h_dim),
                nn.ReLU(),
                nn.BatchNorm1d(h_dim)
            ])
            prev_dim = h_dim
        layers.append(nn.Linear(prev_dim, output_dim))
        
        self.encoder = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.encoder(x)
 
 
# Example: Pretrain encoder, then use with classical detector
def pretrain_encoder(X_train, input_dim, latent_dim=32, epochs=50):
    """Pretrain autoencoder and return just the encoder."""
    
    # Full autoencoder
    encoder = PretrainedEncoder(input_dim, [128, 64], latent_dim)
    decoder = nn.Sequential(
        nn.Linear(latent_dim, 64),
        nn.ReLU(),
        nn.Linear(64, 128),
        nn.ReLU(),
        nn.Linear(128, input_dim)
    )
    
    model = nn.Sequential(encoder, decoder)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    
    X_tensor = torch.FloatTensor(X_train)
    
    for epoch in range(epochs):
        model.train()
        x_recon = model(X_tensor)
        loss = nn.functional.mse_loss(x_recon, X_tensor)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        if epoch % 10 == 0:
            print(f"Pretrain Epoch {epoch}: loss={loss.item():.6f}")
    
    return encoder
 
 
if __name__ == "__main__":
    from sklearn.datasets import make_moons
    from sklearn.metrics import roc_auc_score, classification_report
    
    # Training data
    X_normal, _ = make_moons(n_samples=500, noise=0.05, random_state=42)
    X_normal = X_normal.astype(np.float32)
    
    # Test data
    X_test_normal, _ = make_moons(n_samples=100, noise=0.05, random_state=43)
    X_anomalies = np.random.uniform(-2, 3, size=(50, 2)).astype(np.float32)
    X_test = np.vstack([X_test_normal, X_anomalies])
    y_test = np.array([1] * 100 + [-1] * 50)
    
    # Pretrain feature extractor
    print("Pretraining encoder...")
    encoder = pretrain_encoder(X_normal, input_dim=2, latent_dim=16, epochs=50)
    
    # Test different classical backends
    for detector_type in ['ocsvm', 'gmm', 'mahalanobis']:
        print(f"\n{'='*50}")
        print(f"Testing: Deep Features + {detector_type.upper()}")
        print('='*50)
        
        detector = HybridAnomalyDetector(
            feature_extractor=encoder,
            detector_type=detector_type,
            nu=0.1
        )
        detector.fit(X_normal)
        
        scores = detector.decision_function(X_test)
        predictions = detector.predict(X_test)
        
        print(f"AUROC: {roc_auc_score(y_test == -1, scores):.4f}")

Choosing the Right Neural Network Approach

With many neural network approaches available, selecting the right one for your problem requires considering data characteristics, computational constraints, and interpretability requirements.

Decision Framework:

Neural Network Method Selection Guide
Scenario	Recommended Approach	Rationale
Tabular data, moderate dimensionality	Deep SVDD or Dense Autoencoder	Simple architectures work well; interpretable latent space
Image data, pretrained models available	Hybrid: Pretrained features + Classical	Leverage rich pretrained representations
Image data, need localization	Autoencoder with pixel-wise error	Reconstruction error highlights anomalous regions
Sequence data (time series, logs)	LSTM Autoencoder or Transformer	Capture temporal dependencies
Limited data	Self-supervised pretrained + Simple detector	Foundation models reduce data requirements
Need probabilistic scores	VAE or Deep GMM	Principled uncertainty quantification
Fast inference required	Trained encoder + kNN or GMM	Feature extraction once; fast classical lookup
Some labeled anomalies available	Semi-supervised Deep SAD	Incorporates limited label information

Practical Advice

Start simple, add complexity as needed:

Baseline: Autoencoder with MSE reconstruction loss
If representation quality is poor: Add regularization, try VAE
If anomaly patterns are complex: Use deeper architectures, GANomaly
If training data is limited: Use pretrained encoders
If inference speed matters: Extract features once, use fast classical detector
If localization needed: Use spatial architectures with patch-level scores

Always validate on held-out normal data and, if possible, some known anomalies.

Summary and Key Takeaways

We've explored the rich landscape of neural network approaches for anomaly detection, from principled one-class objectives to generative models and hybrid architectures.

Key Takeaways

•Deep SVDD: Learns representations that cluster normal data around a center. Principled one-class objective translated to neural networks. Watch for hypersphere collapse.
•GAN-based methods: Use generative models to define normality. Anomalies can't be generated/reconstructed well. GANomaly provides fast latent consistency scoring.
•Self-supervised learning: Pretext tasks create anomaly-sensitive representations without labels. Foundation models provide powerful pretrained features.
•Hybrid architectures: Combine deep feature extraction with classical detectors. Modular, interpretable, and often more efficient than end-to-end approaches.
•Method selection: Match architecture to data type and constraints. Start simple; add complexity as needed. Validate on held-out data.
•Practical considerations: Training data quality, computational budget, interpretability needs, and inference latency all influence the best choice.

What's Next:

In the final page of this module, we explore Threshold Selection—the often-overlooked but critical step of converting continuous anomaly scores into actionable decisions. We'll cover statistical methods, business-aligned thresholds, and dynamic adaptation strategies.

Page Complete

You now have a comprehensive understanding of neural network approaches for anomaly detection beyond autoencoders. You can implement Deep SVDD, GAN-based methods, leverage self-supervised representations, and design hybrid architectures. These tools prepare you to tackle complex, high-dimensional anomaly detection challenges across diverse domains.