Machine LearningAnomaly & Outlier Detection

One-Class Methods for Anomaly Detection

LevelAdvanced

Duration90 mins

TopicAnomaly & Outlier Detection

1 / 5

One-Class SVM

The One-Class Learning Paradigm

In traditional supervised learning, we have access to labeled examples from multiple classes, and our goal is to learn a decision boundary that separates them. But what happens when we only have examples from one class—typically the 'normal' class—and our task is to identify everything that doesn't belong?

This is the fundamental challenge addressed by One-Class SVM (OC-SVM), one of the most principled and widely-used algorithms for unsupervised anomaly detection. Originally proposed by Schölkopf et al. in 2001, One-Class SVM adapts the powerful machinery of Support Vector Machines to the single-class setting, creating a decision boundary that encapsulates normal data while rejecting outliers.

The elegance of One-Class SVM lies in its ability to leverage kernel methods, enabling it to learn complex, nonlinear boundaries in high-dimensional feature spaces—all while maintaining the geometric interpretability and theoretical guarantees that make SVMs so compelling.

Learning Objectives

By the end of this page, you will understand: (1) The fundamental geometric intuition behind One-Class SVM, (2) The mathematical formulation as a maximum-margin problem, (3) How the ν-parameter controls the trade-off between false positives and false negatives, (4) Kernel selection strategies for nonlinear anomaly boundaries, (5) The dual formulation and its connections to density estimation, and (6) Practical implementation considerations and hyperparameter tuning.

Geometric Intuition: Separating Data from the Origin

The core geometric idea behind One-Class SVM is beautifully simple: find a hyperplane that separates the training data from the origin with maximum margin, after mapping the data to a high-dimensional feature space.

But why the origin? The origin serves as a convenient representative of 'everywhere else'—the vast expanse of the input space where normal data should not lie. By pushing the data away from the origin while maximizing the margin, we create a decision boundary that characterizes the normal data region.

The Feature Space Perspective:

In the original input space, the data may not be separable from anything—it's just a cloud of points. But when we map the data to a high-dimensional feature space using a kernel function φ(x), something remarkable happens:

In feature space, the origin represents 'the void'—the absence of data
The hyperplane w·φ(x) = ρ separates normal data from this void
The margin ρ/||w|| determines how 'strictly' we define normal behavior

The decision function becomes: f(x) = sign(w·φ(x) - ρ), where points with positive values are classified as normal, and points with negative values are classified as anomalies.

Why Not Just Use a Bounding Sphere?

You might wonder why we don't simply fit a minimum bounding sphere around the data. While conceptually simpler, a hyperplane-based approach offers more flexibility: (1) In feature space, a hyperplane can describe arbitrarily complex boundaries in the original space, (2) The maximum-margin principle provides regularization and better generalization, (3) The connection to kernel methods enables efficient computation even in infinite-dimensional feature spaces. That said, the sphere-based approach (SVDD) is closely related and covered in the next page.

Visualizing the Geometry:

Consider a simple 2D example where normal data forms a cluster. In the original space, we want to draw a 'boundary' around this cluster. One-Class SVM achieves this by:

Mapping to feature space: Each point x → φ(x)
Finding a hyperplane that separates data from the origin
The intersection of this hyperplane with the data manifold creates a closed boundary

With an RBF kernel, the resulting boundary in the original space is typically smooth and naturally adapts to the shape of the data distribution. The boundary contracts around dense regions and may have multiple components if the data has multiple modes.

One-Class SVM Geometry: Key Concepts
Concept	Mathematical Representation	Intuition
Feature mapping	φ: X → H (Hilbert space)	Lifts data to a space where separation is possible
Separating hyperplane	w·φ(x) = ρ	The boundary between normal and anomalous regions
Normal region	{x : w·φ(x) ≥ ρ}	Points on the 'data side' of the hyperplane
Anomaly region	{x : w·φ(x) < ρ}	Points on the 'origin side' of the hyperplane
Margin	ρ / \|\|w\|\|	Distance from hyperplane to origin; larger = more conservative
Decision function	f(x) = sign(w·φ(x) - ρ)	+1 for normal, -1 for anomaly

Mathematical Formulation: The ν-SVM Optimization Problem

The One-Class SVM optimization problem balances two competing objectives: maximizing the margin (the separation between the data and the origin) and allowing some training points to be on the 'wrong' side (to handle outliers in the training set and prevent overfitting).

Primal Formulation:

Given n training examples {x₁, x₂, ..., xₙ} assumed to be drawn from the 'normal' distribution, we solve:

$$\min_{w, \rho, \xi} \frac{1}{2}||w||^2 - \rho + \frac{1}{\nu n} \sum_{i=1}^{n} \xi_i$$

Subject to:

w·φ(xᵢ) ≥ ρ - ξᵢ for all i = 1, ..., n
ξᵢ ≥ 0 for all i = 1, ..., n

Let's dissect each component:

Objective Function Components

•||w||²/2: Regularization term that maximizes the margin. A smaller ||w|| means a larger geometric margin ρ/||w||, leading to a 'wider' separation from the origin and better generalization.
•-ρ: We want to push the hyperplane as far from the origin as possible (large positive ρ), so we minimize -ρ. This encourages the normal region to be as 'large' as possible in feature space.
•ξᵢ (slack variables): Allow individual training points to violate the margin constraint. Point i can fall on the wrong side of the hyperplane if ξᵢ > 0, with ξᵢ measuring the severity of the violation.
•1/(νn) Σξᵢ: The penalty for margin violations. The parameter ν ∈ (0, 1] controls how tolerant we are of violations—smaller ν means more tolerance for outliers in the training set.

The Critical Role of ν

The parameter ν is not just a regularization knob—it has a beautiful interpretation:

• ν is an upper bound on the fraction of training points that become outliers (support vectors with ξ > 0) • ν is a lower bound on the fraction of support vectors (points exactly on or beyond the margin)

This means if you set ν = 0.1, at most 10% of your training data will be classified as anomalous, and at least 10% will be support vectors. This provides an intuitive handle on the expected false positive rate.

Dual Formulation:

Using Lagrange multipliers αᵢ ≥ 0 for the margin constraints and solving the KKT conditions, we obtain the dual problem:

$$\min_{\alpha} \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j k(x_i, x_j)$$

Subject to:

0 ≤ αᵢ ≤ 1/(νn) for all i
Σᵢ αᵢ = 1

where k(xᵢ, xⱼ) = φ(xᵢ)·φ(xⱼ) is the kernel function.

Key Insights from the Dual:

Kernel trick: We never need to compute φ(x) explicitly—only the kernel k(x, x')
Sparsity: At optimum, most αᵢ = 0. Points with αᵢ > 0 are support vectors
Bounded coefficients: The constraint αᵢ ≤ 1/(νn) prevents any single point from dominating
Normalization: The constraint Σαᵢ = 1 ensures the solution is scale-invariant

one_class_svm_decision.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
import numpy as np
from sklearn.svm import OneClassSVM
from sklearn.metrics import classification_report, confusion_matrix
 
def one_class_svm_decision_function(X_train, X_test, kernel='rbf', nu=0.1, gamma='scale'):
    """
    Train One-Class SVM and compute decision function.
    
    The decision function f(x) = Σᵢ αᵢ k(xᵢ, x) - ρ
    Returns positive values for normal points, negative for anomalies.
    
    Parameters:
    -----------
    X_train : array-like of shape (n_samples, n_features)
        Training data (assumed to be from the normal class)
    X_test : array-like of shape (n_test_samples, n_features)
        Test data to classify
    kernel : str, default='rbf'
        Kernel type: 'linear', 'rbf', 'poly', 'sigmoid'
    nu : float, default=0.1
        Upper bound on fraction of training errors (anomalies in training)
        and lower bound on fraction of support vectors
    gamma : str or float, default='scale'
        Kernel coefficient for 'rbf', 'poly', 'sigmoid'
        
    Returns:
    --------
    model : OneClassSVM
        Trained model
    predictions : array of shape (n_test_samples,)
        +1 for normal (inlier), -1 for anomaly (outlier)
    decision_scores : array of shape (n_test_samples,)
        Signed distance to the separating hyperplane
    """
    # Initialize and train the model
    model = OneClassSVM(
        kernel=kernel,
        nu=nu,
        gamma=gamma,
        tol=1e-4,           # Convergence tolerance
        shrinking=True,     # Use shrinking heuristic for efficiency
        cache_size=500,     # Cache size in MB for kernel calculations
        verbose=False
    )
    
    model.fit(X_train)
    
    # Get predictions and decision scores
    predictions = model.predict(X_test)  # +1 (normal) or -1 (anomaly)
    decision_scores = model.decision_function(X_test)  # Signed distance to boundary
    
    # Analyze support vectors
    n_support = model.support_vectors_.shape[0]
    sv_ratio = n_support / X_train.shape[0]
    
    print(f"Support vectors: {n_support} / {X_train.shape[0]} = {sv_ratio:.2%}")
    print(f"ν = {nu} → Expected ratio ≥ {nu:.2%}")
    print(f"Offset (ρ): {model.offset_[0]:.4f}")
    
    return model, predictions, decision_scores
 
 
def analyze_ocsvm_boundary(model, X_grid):
    """
    Analyze the decision boundary by computing decision function on a grid.
    
    The decision boundary is where decision_function(x) = 0
    """
    # Decision function: positive inside (normal), negative outside (anomaly)
    decision_values = model.decision_function(X_grid)
    
    # Points on the boundary
    boundary_mask = np.abs(decision_values) < 0.01
    
    return decision_values, boundary_mask
 
 
# Example usage with synthetic data
if __name__ == "__main__":
    from sklearn.datasets import make_blobs
    
    # Generate normal training data (single cluster)
    X_normal, _ = make_blobs(
        n_samples=300, 
        centers=[[0, 0]], 
        cluster_std=0.5,
        random_state=42
    )
    
    # Generate test data: normal points + anomalies
    X_test_normal, _ = make_blobs(
        n_samples=50, 
        centers=[[0, 0]], 
        cluster_std=0.5,
        random_state=43
    )
    X_anomalies = np.random.uniform(low=-4, high=4, size=(20, 2))
    
    X_test = np.vstack([X_test_normal, X_anomalies])
    y_test = np.array([1] * 50 + [-1] * 20)  # Ground truth
    
    # Train and evaluate
    print("=" * 50)
    print("One-Class SVM for Anomaly Detection")
    print("=" * 50)
    
    for nu in [0.05, 0.1, 0.2]:
        print(f"\n--- ν = {nu} ---")
        model, predictions, scores = one_class_svm_decision_function(
            X_normal, X_test, nu=nu
        )
        print("\nClassification Report:")
        print(classification_report(
            y_test, predictions, 
            target_names=['Anomaly (-1)', 'Normal (+1)']
        ))

Kernel Selection for Anomaly Detection

The choice of kernel function is crucial for One-Class SVM performance. The kernel implicitly defines the feature space where the hyperplane separates normal from anomalous points. Different kernels create different types of decision boundaries.

Radial Basis Function (RBF) Kernel — The Default Choice:

$$k(x, x') = \exp\left(-\gamma ||x - x'||^2\right)$$

The RBF kernel is the most common choice for One-Class SVM for several reasons:

Universal approximator: With appropriate γ, RBF can approximate any continuous boundary
Bounded: k(x, x') ∈ [0, 1], ensuring numerical stability
Locality: Points far apart in input space have near-zero similarity
Single hyperparameter: Only γ needs to be tuned

The γ parameter controls the 'width' of the kernel:

Large γ: Each training point has a localized effect → complex, wiggly boundaries
Small γ: Smoother boundaries → more generalization, less sensitivity to individual points

Kernel Functions for One-Class SVM
Kernel	Formula	Decision Boundary Shape	Best Use Cases
Linear	k(x, x') = x·x'	Hyperplane (half-space)	High-dimensional sparse data; when linear separation suffices
RBF (Gaussian)	exp(-γ\|\|x-x'\|\|²)	Smooth, closed contours	General-purpose; compact data clusters; unknown boundary shape
Polynomial	(γx·x' + r)^d	Polynomial curves	When polynomial relationships exist; image feature spaces
Sigmoid	tanh(γx·x' + r)	Similar to neural networks	When neural network-like behavior desired; less common

The γ-ν Interaction

When using RBF kernel, γ and ν interact in complex ways:

• Very large γ with small ν: The boundary tightly wraps each training point individually, leading to severe overfitting and inability to generalize to new normal points.

• Very small γ with large ν: The boundary becomes too smooth and large, failing to exclude anomalies.

Recommended approach: First fix ν based on expected contamination rate, then tune γ using cross-validation or by monitoring the fraction of training data classified as normal.

Kernel Parameter Selection Strategies:

1. Scale-based Heuristics (for RBF):

'scale' (sklearn default): γ = 1/(n_features × X.var())
Median heuristic: γ = 1/(2 × median(||xᵢ - xⱼ||²))

These heuristics ensure the kernel 'sees' meaningful variation in the data.

2. Grid Search with Stability:

Since we lack labeled anomalies for validation, we use proxy metrics:

Stability: How consistent are predictions under bootstrap resampling?
Compactness: Does the learned boundary tightly fit the training distribution?
Training anomaly rate: Fraction of training points classified as anomalies (should be ≈ ν)

3. Domain Knowledge:

If you know the expected scale of normal variation, set γ accordingly:

If normal points typically differ by δ in each feature, set γ ≈ 1/(n_features × δ²)

kernel_selection_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
from sklearn.svm import OneClassSVM
from sklearn.model_selection import ParameterGrid
from scipy.spatial.distance import pdist
 
def compute_gamma_heuristics(X):
    """
    Compute different heuristic values for γ (RBF kernel width).
    
    Returns:
    --------
    dict with gamma values from different heuristics
    """
    n_samples, n_features = X.shape
    
    # Variance-based (sklearn 'scale')
    gamma_scale = 1.0 / (n_features * X.var())
    
    # Median heuristic (popular in kernel methods literature)
    pairwise_dists_sq = pdist(X, 'sqeuclidean')
    gamma_median = 1.0 / (2 * np.median(pairwise_dists_sq))
    
    # Mean-based
    gamma_mean = 1.0 / (2 * np.mean(pairwise_dists_sq))
    
    # Silverman's rule of thumb (adapted from KDE)
    std_avg = np.mean(np.std(X, axis=0))
    h_silverman = (4 / (n_features + 2)) ** (1 / (n_features + 4)) * n_samples ** (-1 / (n_features + 4)) * std_avg
    gamma_silverman = 1.0 / (2 * h_silverman ** 2)
    
    return {
        'scale': gamma_scale,
        'median': gamma_median,
        'mean': gamma_mean,
        'silverman': gamma_silverman
    }
 
 
def evaluate_ocsvm_stability(X, gamma, nu, n_bootstrap=20, sample_ratio=0.8):
    """
    Evaluate stability of One-Class SVM predictions under bootstrap resampling.
    
    High stability = predictions are consistent = well-chosen parameters
    Low stability = predictions vary = overfitting or underfitting
    """
    n_samples = X.shape[0]
    sample_size = int(n_samples * sample_ratio)
    
    predictions = []
    
    for _ in range(n_bootstrap):
        # Bootstrap sample
        indices = np.random.choice(n_samples, size=sample_size, replace=True)
        X_boot = X[indices]
        
        # Train model
        model = OneClassSVM(kernel='rbf', gamma=gamma, nu=nu)
        model.fit(X_boot)
        
        # Predict on full dataset
        pred = model.predict(X)
        predictions.append(pred)
    
    predictions = np.array(predictions)  # Shape: (n_bootstrap, n_samples)
    
    # Stability = fraction of samples with consistent predictions
    # For each sample, compute fraction of bootstraps that agree with majority
    majority_vote = np.sign(np.sum(predictions, axis=0))
    agreement = np.mean(predictions == majority_vote, axis=0)
    stability = np.mean(agreement)
    
    return stability, majority_vote
 
 
def grid_search_ocsvm(X, nu, gamma_range, stability_threshold=0.9):
    """
    Grid search for optimal gamma based on stability and training metrics.
    """
    results = []
    
    for gamma in gamma_range:
        # Train model
        model = OneClassSVM(kernel='rbf', gamma=gamma, nu=nu)
        model.fit(X)
        
        # Compute metrics
        predictions = model.predict(X)
        train_anomaly_rate = np.mean(predictions == -1)
        n_support_vectors = model.support_vectors_.shape[0]
        sv_ratio = n_support_vectors / X.shape[0]
        
        # Compute stability
        stability, _ = evaluate_ocsvm_stability(X, gamma, nu, n_bootstrap=10)
        
        results.append({
            'gamma': gamma,
            'train_anomaly_rate': train_anomaly_rate,
            'sv_ratio': sv_ratio,
            'stability': stability,
            'target_deviation': abs(train_anomaly_rate - nu)
        })
        
        print(f"γ={gamma:.4f}: anomaly_rate={train_anomaly_rate:.3f}, "
              f"SV_ratio={sv_ratio:.3f}, stability={stability:.3f}")
    
    # Select best: minimize deviation from target ν while maintaining stability
    stable_results = [r for r in results if r['stability'] >= stability_threshold]
    
    if stable_results:
        best = min(stable_results, key=lambda r: r['target_deviation'])
    else:
        best = max(results, key=lambda r: r['stability'])
    
    print(f"\nBest γ = {best['gamma']:.4f}")
    return best['gamma'], results
 
 
# Example usage
if __name__ == "__main__":
    from sklearn.datasets import make_moons
    
    # Generate crescent-shaped normal data
    X, _ = make_moons(n_samples=300, noise=0.1, random_state=42)
    
    print("Gamma Heuristics:")
    heuristics = compute_gamma_heuristics(X)
    for name, value in heuristics.items():
        print(f"  {name}: γ = {value:.4f}")
    
    print("\nGrid Search for Optimal γ (ν=0.1):")
    gamma_range = np.logspace(-2, 2, 15)
    best_gamma, results = grid_search_ocsvm(X, nu=0.1, gamma_range=gamma_range)

Connection to Density Estimation

One-Class SVM has a deep connection to density estimation that provides additional intuition for its behavior and helps explain when it works well or poorly.

The Density Level Set Perspective:

Under certain conditions, the One-Class SVM decision boundary converges to a density level set of the underlying data distribution. Specifically, as the sample size increases and with appropriate kernel bandwidth:

$${x : f(x) \geq \tau} \approx {x : p(x) \geq \tau'}$$

where f(x) is the OC-SVM decision function, p(x) is the true data density, and τ, τ' are related thresholds.

This means One-Class SVM is implicitly estimating regions of high probability density—exactly what we want for anomaly detection, since anomalies are low-density points.

Advantages of OC-SVM vs. KDE

•Computational efficiency: Predictions depend only on support vectors, not all training points
•Sparse representation: Typically few support vectors, enabling fast inference
•Robust to outliers in training: ν-parameterization explicitly handles contamination
•Sharp boundaries: Produces a crisp decision, not a continuous density estimate
•High-dimensional scaling: Kernel trick avoids explicit feature computation

Limitations Compared to KDE

•No probability estimates: Only provides a decision, not P(normal|x)
•Hyperparameter sensitivity: Performance depends heavily on γ, ν choices
•Optimization challenges: May get stuck in local optima for non-convex kernels
•Curse of dimensionality: Still affected, though less than explicit KDE
•Boundary rigidity: Once trained, boundary doesn't adapt to new regions

The Support Vector Expansion:

The decision function can be written as:

$$f(x) = \sum_{i \in SV} \alpha_i k(x_i, x) - \rho$$

where SV is the set of support vector indices. This is remarkably similar to a Parzen window (kernel density) estimate:

$$\hat{p}(x) = \frac{1}{n} \sum_{i=1}^{n} k_h(x_i, x)$$

The key differences:

Sparsity: OC-SVM sums over support vectors only, not all points
Weights: OC-SVM uses learned weights αᵢ, not uniform 1/n
Threshold: OC-SVM uses learned ρ, not a separate threshold selection

This connection explains why RBF kernel One-Class SVM tends to perform well when the normal data has a smooth, unimodal distribution—exactly the setting where KDE excels.

When to Prefer OC-SVM Over Density-Based Methods

Use One-Class SVM when: • You need fast predictions on new data (sparse support vector representation) • You expect some anomalies in the training set (ν handles contamination) • You want a hard decision boundary, not probability estimates • The data is high-dimensional but lies on a lower-dimensional manifold

Prefer KDE or GMM when: • You need probabilistic anomaly scores, not just yes/no • The data has clear multimodal structure • You have domain knowledge about the distribution form • Sample size is small and sparsity benefits are minimal

Implementation Considerations and Best Practices

Deploying One-Class SVM effectively requires attention to preprocessing, scaling, and computational efficiency. Here we cover the practical aspects that distinguish a working implementation from a textbook algorithm.

1. Feature Scaling is Critical:

The RBF kernel computes ||x - x'||², which is heavily influenced by feature scales. Features with larger magnitudes dominate distance calculations, effectively ignoring smaller-scale features.

Recommended approach: Standardize all features to zero mean and unit variance: $$x_{scaled} = \frac{x - \mu}{\sigma}$$

Alternatively, for non-Gaussian features, use robust scaling (median and IQR) or min-max scaling to [0, 1].

2. Handling Contaminated Training Data:

Real-world 'normal' training data often contains some anomalies. One-Class SVM handles this through the ν parameter, but additional preprocessing helps:

Isolation Forest pre-filtering: Remove obvious outliers before training OC-SVM
Iterative refinement: Train OC-SVM, remove training points classified as anomalies, retrain
Conservative ν: Set ν slightly higher than expected contamination rate

ocsvm_best_practices.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
import numpy as np
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.ensemble import IsolationForest
from sklearn.pipeline import Pipeline
import warnings
 
class RobustOneClassSVM:
    """
    Production-ready One-Class SVM with best practices built in.
    
    Features:
    - Automatic feature scaling (robust to outliers)
    - Optional training data cleaning with Isolation Forest
    - Multiple kernel support with sensible defaults
    - Calibrated anomaly scores
    """
    
    def __init__(
        self,
        nu=0.1,
        kernel='rbf',
        gamma='scale',
        preprocessing='robust',
        clean_training=True,
        contamination_estimate=0.05,
        verbose=False
    ):
        """
        Parameters:
        -----------
        nu : float, default=0.1
            Upper bound on training error fraction
        kernel : str, default='rbf'
            Kernel type ('rbf', 'linear', 'poly', 'sigmoid')
        gamma : str or float, default='scale'
            Kernel coefficient
        preprocessing : str, default='robust'
            'robust' (median/IQR), 'standard' (mean/std), or None
        clean_training : bool, default=True
            Whether to pre-filter training data for outliers
        contamination_estimate : float, default=0.05
            Expected fraction of anomalies in training data
        """
        self.nu = nu
        self.kernel = kernel
        self.gamma = gamma
        self.preprocessing = preprocessing
        self.clean_training = clean_training
        self.contamination_estimate = contamination_estimate
        self.verbose = verbose
        
        self.scaler_ = None
        self.model_ = None
        self.training_mask_ = None
        self.decision_offset_ = None
    
    def fit(self, X, y=None):
        """
        Fit the One-Class SVM model.
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            Training data (assumed mostly normal)
        y : ignored
        """
        X = np.asarray(X)
        n_samples_original = X.shape[0]
        
        # Step 1: Preprocessing (scaling)
        if self.preprocessing == 'robust':
            self.scaler_ = RobustScaler()
        elif self.preprocessing == 'standard':
            self.scaler_ = StandardScaler()
        else:
            self.scaler_ = None
        
        if self.scaler_ is not None:
            X_scaled = self.scaler_.fit_transform(X)
        else:
            X_scaled = X.copy()
        
        # Step 2: Training data cleaning (optional)
        if self.clean_training:
            if self.verbose:
                print(f"Cleaning training data (contamination={self.contamination_estimate})...")
            
            iso_forest = IsolationForest(
                contamination=self.contamination_estimate,
                random_state=42,
                n_jobs=-1
            )
            inlier_mask = iso_forest.fit_predict(X_scaled) == 1
            X_clean = X_scaled[inlier_mask]
            self.training_mask_ = inlier_mask
            
            if self.verbose:
                n_removed = n_samples_original - X_clean.shape[0]
                print(f"Removed {n_removed} potential outliers from training")
        else:
            X_clean = X_scaled
            self.training_mask_ = np.ones(n_samples_original, dtype=bool)
        
        # Step 3: Train One-Class SVM
        self.model_ = OneClassSVM(
            kernel=self.kernel,
            nu=self.nu,
            gamma=self.gamma,
            tol=1e-4,
            shrinking=True,
            cache_size=500
        )
        self.model_.fit(X_clean)
        
        # Step 4: Calibrate decision offset for interpretable scores
        train_scores = self.model_.decision_function(X_clean)
        self.decision_offset_ = np.median(train_scores)
        
        if self.verbose:
            print(f"Training complete:")
            print(f"  Support vectors: {self.model_.support_vectors_.shape[0]}")
            print(f"  Decision offset: {self.decision_offset_:.4f}")
        
        return self
    
    def predict(self, X):
        """
        Predict if samples are normal (+1) or anomalies (-1).
        """
        X = np.asarray(X)
        X_scaled = self.scaler_.transform(X) if self.scaler_ else X
        return self.model_.predict(X_scaled)
    
    def decision_function(self, X):
        """
        Compute signed distance to the decision boundary.
        
        Positive = normal (inside boundary)
        Negative = anomaly (outside boundary)
        """
        X = np.asarray(X)
        X_scaled = self.scaler_.transform(X) if self.scaler_ else X
        return self.model_.decision_function(X_scaled)
    
    def anomaly_score(self, X):
        """
        Compute calibrated anomaly scores.
        
        Returns values in [0, 1] where:
        - 0 = definitely normal
        - 0.5 = on the boundary  
        - 1 = definitely anomalous
        
        Uses sigmoid calibration centered on the training median.
        """
        raw_scores = self.decision_function(X)
        
        # Calibrate: sigmoid centered on training median
        # Points below median get scores > 0.5, above get < 0.5
        calibrated = 1.0 / (1.0 + np.exp(raw_scores - self.decision_offset_))
        
        return calibrated
    
    def fit_predict(self, X, y=None):
        """Fit the model and predict on the same data."""
        self.fit(X)
        return self.predict(X)
 
 
# Example usage with evaluation
if __name__ == "__main__":
    from sklearn.datasets import make_blobs
    from sklearn.metrics import roc_auc_score, average_precision_score
    
    # Generate data
    X_normal, _ = make_blobs(n_samples=500, centers=[[0, 0]], cluster_std=1.0, random_state=42)
    X_test_normal, _ = make_blobs(n_samples=100, centers=[[0, 0]], cluster_std=1.0, random_state=43)
    X_anomalies = np.random.uniform(-5, 5, size=(30, 2))
    
    X_test = np.vstack([X_test_normal, X_anomalies])
    y_test = np.array([0] * 100 + [1] * 30)  # 0 = normal, 1 = anomaly
    
    # Train and evaluate
    model = RobustOneClassSVM(nu=0.1, verbose=True)
    model.fit(X_normal)
    
    anomaly_scores = model.anomaly_score(X_test)
    predictions = model.predict(X_test)
    
    # Compute metrics
    auroc = roc_auc_score(y_test, anomaly_scores)
    auprc = average_precision_score(y_test, anomaly_scores)
    
    print(f"\nEvaluation Results:")
    print(f"  AUROC: {auroc:.4f}")
    print(f"  AUPRC: {auprc:.4f}")
    print(f"  Detection rate: {np.mean(predictions[100:] == -1):.2%}")
    print(f"  False positive rate: {np.mean(predictions[:100] == -1):.2%}")

Computational Complexity

One-Class SVM training is O(n² to n³) in the number of training samples due to kernel matrix computation and quadratic programming. For large datasets (n > 10,000):

• Use SGD-based approximations (sklearn's SGDOneClassSVM) • Consider random feature approximations (Nyström, Random Fourier Features) • Use mini-batch training with model updates • Pre-cluster data and train separate models per cluster

Prediction is O(n_sv × d) where n_sv is the number of support vectors, making it fast for sparse solutions.

Summary and Key Takeaways

We've explored One-Class SVM from geometric intuition through mathematical formulation to practical implementation. Let's consolidate the essential insights.

Key Takeaways

•Geometric principle: One-Class SVM finds a hyperplane in feature space that separates training data from the origin with maximum margin, creating a decision boundary that encapsulates 'normal' behavior.
•The ν parameter: Provides intuitive control—it upper-bounds the fraction of training points classified as anomalies and lower-bounds the fraction of support vectors. Set ν slightly above expected contamination.
•Kernel choice matters: RBF is the default for smooth, compact data regions. The γ parameter controls boundary smoothness—use heuristics (median, scale) as starting points and tune based on stability.
•Density level set connection: OC-SVM implicitly estimates density level sets, explaining its effectiveness for detecting low-density anomalies.
•Preprocessing is essential: Always scale features (robust scaling for contaminated data). Consider pre-filtering with Isolation Forest for heavily contaminated training sets.
•Computational considerations: Training scales O(n²-n³); use approximations for large datasets. Prediction is fast (O(n_sv × d)) due to sparse support vector representation.

What's Next:

In the next page, we examine Support Vector Data Description (SVDD)—a closely related method that fits a minimum-volume hypersphere around the normal data instead of separating from the origin. SVDD offers complementary insights and is sometimes preferred when a 'closed' boundary interpretation is more natural.

Page Complete

You now understand One-Class SVM at both the conceptual and implementation level. You can formulate the optimization problem, select appropriate kernels and hyperparameters, interpret the decision boundary geometrically, and build production-ready anomaly detectors. Next, we'll explore SVDD as an alternative geometric formulation.

1 / 5

Loading learning content...

Machine LearningAnomaly & Outlier Detection

One-Class Methods for Anomaly Detection

LevelAdvanced

Duration90 mins

TopicAnomaly & Outlier Detection

1 / 5

One-Class SVM

The One-Class Learning Paradigm

Learning Objectives

Geometric Intuition: Separating Data from the Origin

The Feature Space Perspective:

In feature space, the origin represents 'the void'—the absence of data
The hyperplane w·φ(x) = ρ separates normal data from this void
The margin ρ/||w|| determines how 'strictly' we define normal behavior

The decision function becomes: f(x) = sign(w·φ(x) - ρ), where points with positive values are classified as normal, and points with negative values are classified as anomalies.

Why Not Just Use a Bounding Sphere?

Visualizing the Geometry:

Consider a simple 2D example where normal data forms a cluster. In the original space, we want to draw a 'boundary' around this cluster. One-Class SVM achieves this by:

Mapping to feature space: Each point x → φ(x)
Finding a hyperplane that separates data from the origin
The intersection of this hyperplane with the data manifold creates a closed boundary

One-Class SVM Geometry: Key Concepts
Concept	Mathematical Representation	Intuition
Feature mapping	φ: X → H (Hilbert space)	Lifts data to a space where separation is possible
Separating hyperplane	w·φ(x) = ρ	The boundary between normal and anomalous regions
Normal region	{x : w·φ(x) ≥ ρ}	Points on the 'data side' of the hyperplane
Anomaly region	{x : w·φ(x) < ρ}	Points on the 'origin side' of the hyperplane
Margin	ρ / \|\|w\|\|	Distance from hyperplane to origin; larger = more conservative
Decision function	f(x) = sign(w·φ(x) - ρ)	+1 for normal, -1 for anomaly

Mathematical Formulation: The ν-SVM Optimization Problem

Primal Formulation:

Given n training examples {x₁, x₂, ..., xₙ} assumed to be drawn from the 'normal' distribution, we solve:

$$\min_{w, \rho, \xi} \frac{1}{2}||w||^2 - \rho + \frac{1}{\nu n} \sum_{i=1}^{n} \xi_i$$

Subject to:

w·φ(xᵢ) ≥ ρ - ξᵢ for all i = 1, ..., n
ξᵢ ≥ 0 for all i = 1, ..., n

Let's dissect each component:

Objective Function Components

•||w||²/2: Regularization term that maximizes the margin. A smaller ||w|| means a larger geometric margin ρ/||w||, leading to a 'wider' separation from the origin and better generalization.
•-ρ: We want to push the hyperplane as far from the origin as possible (large positive ρ), so we minimize -ρ. This encourages the normal region to be as 'large' as possible in feature space.
•ξᵢ (slack variables): Allow individual training points to violate the margin constraint. Point i can fall on the wrong side of the hyperplane if ξᵢ > 0, with ξᵢ measuring the severity of the violation.
•1/(νn) Σξᵢ: The penalty for margin violations. The parameter ν ∈ (0, 1] controls how tolerant we are of violations—smaller ν means more tolerance for outliers in the training set.

The Critical Role of ν

The parameter ν is not just a regularization knob—it has a beautiful interpretation:

Dual Formulation:

Using Lagrange multipliers αᵢ ≥ 0 for the margin constraints and solving the KKT conditions, we obtain the dual problem:

$$\min_{\alpha} \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j k(x_i, x_j)$$

Subject to:

0 ≤ αᵢ ≤ 1/(νn) for all i
Σᵢ αᵢ = 1

where k(xᵢ, xⱼ) = φ(xᵢ)·φ(xⱼ) is the kernel function.

Key Insights from the Dual:

Kernel trick: We never need to compute φ(x) explicitly—only the kernel k(x, x')
Sparsity: At optimum, most αᵢ = 0. Points with αᵢ > 0 are support vectors
Bounded coefficients: The constraint αᵢ ≤ 1/(νn) prevents any single point from dominating
Normalization: The constraint Σαᵢ = 1 ensures the solution is scale-invariant

one_class_svm_decision.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
import numpy as np
from sklearn.svm import OneClassSVM
from sklearn.metrics import classification_report, confusion_matrix
 
def one_class_svm_decision_function(X_train, X_test, kernel='rbf', nu=0.1, gamma='scale'):
    """
    Train One-Class SVM and compute decision function.
    
    The decision function f(x) = Σᵢ αᵢ k(xᵢ, x) - ρ
    Returns positive values for normal points, negative for anomalies.
    
    Parameters:
    -----------
    X_train : array-like of shape (n_samples, n_features)
        Training data (assumed to be from the normal class)
    X_test : array-like of shape (n_test_samples, n_features)
        Test data to classify
    kernel : str, default='rbf'
        Kernel type: 'linear', 'rbf', 'poly', 'sigmoid'
    nu : float, default=0.1
        Upper bound on fraction of training errors (anomalies in training)
        and lower bound on fraction of support vectors
    gamma : str or float, default='scale'
        Kernel coefficient for 'rbf', 'poly', 'sigmoid'
        
    Returns:
    --------
    model : OneClassSVM
        Trained model
    predictions : array of shape (n_test_samples,)
        +1 for normal (inlier), -1 for anomaly (outlier)
    decision_scores : array of shape (n_test_samples,)
        Signed distance to the separating hyperplane
    """
    # Initialize and train the model
    model = OneClassSVM(
        kernel=kernel,
        nu=nu,
        gamma=gamma,
        tol=1e-4,           # Convergence tolerance
        shrinking=True,     # Use shrinking heuristic for efficiency
        cache_size=500,     # Cache size in MB for kernel calculations
        verbose=False
    )
    
    model.fit(X_train)
    
    # Get predictions and decision scores
    predictions = model.predict(X_test)  # +1 (normal) or -1 (anomaly)
    decision_scores = model.decision_function(X_test)  # Signed distance to boundary
    
    # Analyze support vectors
    n_support = model.support_vectors_.shape[0]
    sv_ratio = n_support / X_train.shape[0]
    
    print(f"Support vectors: {n_support} / {X_train.shape[0]} = {sv_ratio:.2%}")
    print(f"ν = {nu} → Expected ratio ≥ {nu:.2%}")
    print(f"Offset (ρ): {model.offset_[0]:.4f}")
    
    return model, predictions, decision_scores
 
 
def analyze_ocsvm_boundary(model, X_grid):
    """
    Analyze the decision boundary by computing decision function on a grid.
    
    The decision boundary is where decision_function(x) = 0
    """
    # Decision function: positive inside (normal), negative outside (anomaly)
    decision_values = model.decision_function(X_grid)
    
    # Points on the boundary
    boundary_mask = np.abs(decision_values) < 0.01
    
    return decision_values, boundary_mask
 
 
# Example usage with synthetic data
if __name__ == "__main__":
    from sklearn.datasets import make_blobs
    
    # Generate normal training data (single cluster)
    X_normal, _ = make_blobs(
        n_samples=300, 
        centers=[[0, 0]], 
        cluster_std=0.5,
        random_state=42
    )
    
    # Generate test data: normal points + anomalies
    X_test_normal, _ = make_blobs(
        n_samples=50, 
        centers=[[0, 0]], 
        cluster_std=0.5,
        random_state=43
    )
    X_anomalies = np.random.uniform(low=-4, high=4, size=(20, 2))
    
    X_test = np.vstack([X_test_normal, X_anomalies])
    y_test = np.array([1] * 50 + [-1] * 20)  # Ground truth
    
    # Train and evaluate
    print("=" * 50)
    print("One-Class SVM for Anomaly Detection")
    print("=" * 50)
    
    for nu in [0.05, 0.1, 0.2]:
        print(f"\n--- ν = {nu} ---")
        model, predictions, scores = one_class_svm_decision_function(
            X_normal, X_test, nu=nu
        )
        print("\nClassification Report:")
        print(classification_report(
            y_test, predictions, 
            target_names=['Anomaly (-1)', 'Normal (+1)']
        ))

Kernel Selection for Anomaly Detection

Radial Basis Function (RBF) Kernel — The Default Choice:

$$k(x, x') = \exp\left(-\gamma ||x - x'||^2\right)$$

The RBF kernel is the most common choice for One-Class SVM for several reasons:

Universal approximator: With appropriate γ, RBF can approximate any continuous boundary
Bounded: k(x, x') ∈ [0, 1], ensuring numerical stability
Locality: Points far apart in input space have near-zero similarity
Single hyperparameter: Only γ needs to be tuned

The γ parameter controls the 'width' of the kernel:

Large γ: Each training point has a localized effect → complex, wiggly boundaries
Small γ: Smoother boundaries → more generalization, less sensitivity to individual points

Kernel Functions for One-Class SVM
Kernel	Formula	Decision Boundary Shape	Best Use Cases
Linear	k(x, x') = x·x'	Hyperplane (half-space)	High-dimensional sparse data; when linear separation suffices
RBF (Gaussian)	exp(-γ\|\|x-x'\|\|²)	Smooth, closed contours	General-purpose; compact data clusters; unknown boundary shape
Polynomial	(γx·x' + r)^d	Polynomial curves	When polynomial relationships exist; image feature spaces
Sigmoid	tanh(γx·x' + r)	Similar to neural networks	When neural network-like behavior desired; less common

The γ-ν Interaction

When using RBF kernel, γ and ν interact in complex ways:

• Very large γ with small ν: The boundary tightly wraps each training point individually, leading to severe overfitting and inability to generalize to new normal points.

• Very small γ with large ν: The boundary becomes too smooth and large, failing to exclude anomalies.

Recommended approach: First fix ν based on expected contamination rate, then tune γ using cross-validation or by monitoring the fraction of training data classified as normal.

Kernel Parameter Selection Strategies:

1. Scale-based Heuristics (for RBF):

'scale' (sklearn default): γ = 1/(n_features × X.var())
Median heuristic: γ = 1/(2 × median(||xᵢ - xⱼ||²))

These heuristics ensure the kernel 'sees' meaningful variation in the data.

2. Grid Search with Stability:

Since we lack labeled anomalies for validation, we use proxy metrics:

Stability: How consistent are predictions under bootstrap resampling?
Compactness: Does the learned boundary tightly fit the training distribution?
Training anomaly rate: Fraction of training points classified as anomalies (should be ≈ ν)

3. Domain Knowledge:

If you know the expected scale of normal variation, set γ accordingly:

If normal points typically differ by δ in each feature, set γ ≈ 1/(n_features × δ²)

kernel_selection_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
from sklearn.svm import OneClassSVM
from sklearn.model_selection import ParameterGrid
from scipy.spatial.distance import pdist
 
def compute_gamma_heuristics(X):
    """
    Compute different heuristic values for γ (RBF kernel width).
    
    Returns:
    --------
    dict with gamma values from different heuristics
    """
    n_samples, n_features = X.shape
    
    # Variance-based (sklearn 'scale')
    gamma_scale = 1.0 / (n_features * X.var())
    
    # Median heuristic (popular in kernel methods literature)
    pairwise_dists_sq = pdist(X, 'sqeuclidean')
    gamma_median = 1.0 / (2 * np.median(pairwise_dists_sq))
    
    # Mean-based
    gamma_mean = 1.0 / (2 * np.mean(pairwise_dists_sq))
    
    # Silverman's rule of thumb (adapted from KDE)
    std_avg = np.mean(np.std(X, axis=0))
    h_silverman = (4 / (n_features + 2)) ** (1 / (n_features + 4)) * n_samples ** (-1 / (n_features + 4)) * std_avg
    gamma_silverman = 1.0 / (2 * h_silverman ** 2)
    
    return {
        'scale': gamma_scale,
        'median': gamma_median,
        'mean': gamma_mean,
        'silverman': gamma_silverman
    }
 
 
def evaluate_ocsvm_stability(X, gamma, nu, n_bootstrap=20, sample_ratio=0.8):
    """
    Evaluate stability of One-Class SVM predictions under bootstrap resampling.
    
    High stability = predictions are consistent = well-chosen parameters
    Low stability = predictions vary = overfitting or underfitting
    """
    n_samples = X.shape[0]
    sample_size = int(n_samples * sample_ratio)
    
    predictions = []
    
    for _ in range(n_bootstrap):
        # Bootstrap sample
        indices = np.random.choice(n_samples, size=sample_size, replace=True)
        X_boot = X[indices]
        
        # Train model
        model = OneClassSVM(kernel='rbf', gamma=gamma, nu=nu)
        model.fit(X_boot)
        
        # Predict on full dataset
        pred = model.predict(X)
        predictions.append(pred)
    
    predictions = np.array(predictions)  # Shape: (n_bootstrap, n_samples)
    
    # Stability = fraction of samples with consistent predictions
    # For each sample, compute fraction of bootstraps that agree with majority
    majority_vote = np.sign(np.sum(predictions, axis=0))
    agreement = np.mean(predictions == majority_vote, axis=0)
    stability = np.mean(agreement)
    
    return stability, majority_vote
 
 
def grid_search_ocsvm(X, nu, gamma_range, stability_threshold=0.9):
    """
    Grid search for optimal gamma based on stability and training metrics.
    """
    results = []
    
    for gamma in gamma_range:
        # Train model
        model = OneClassSVM(kernel='rbf', gamma=gamma, nu=nu)
        model.fit(X)
        
        # Compute metrics
        predictions = model.predict(X)
        train_anomaly_rate = np.mean(predictions == -1)
        n_support_vectors = model.support_vectors_.shape[0]
        sv_ratio = n_support_vectors / X.shape[0]
        
        # Compute stability
        stability, _ = evaluate_ocsvm_stability(X, gamma, nu, n_bootstrap=10)
        
        results.append({
            'gamma': gamma,
            'train_anomaly_rate': train_anomaly_rate,
            'sv_ratio': sv_ratio,
            'stability': stability,
            'target_deviation': abs(train_anomaly_rate - nu)
        })
        
        print(f"γ={gamma:.4f}: anomaly_rate={train_anomaly_rate:.3f}, "
              f"SV_ratio={sv_ratio:.3f}, stability={stability:.3f}")
    
    # Select best: minimize deviation from target ν while maintaining stability
    stable_results = [r for r in results if r['stability'] >= stability_threshold]
    
    if stable_results:
        best = min(stable_results, key=lambda r: r['target_deviation'])
    else:
        best = max(results, key=lambda r: r['stability'])
    
    print(f"\nBest γ = {best['gamma']:.4f}")
    return best['gamma'], results
 
 
# Example usage
if __name__ == "__main__":
    from sklearn.datasets import make_moons
    
    # Generate crescent-shaped normal data
    X, _ = make_moons(n_samples=300, noise=0.1, random_state=42)
    
    print("Gamma Heuristics:")
    heuristics = compute_gamma_heuristics(X)
    for name, value in heuristics.items():
        print(f"  {name}: γ = {value:.4f}")
    
    print("\nGrid Search for Optimal γ (ν=0.1):")
    gamma_range = np.logspace(-2, 2, 15)
    best_gamma, results = grid_search_ocsvm(X, nu=0.1, gamma_range=gamma_range)

Connection to Density Estimation

One-Class SVM has a deep connection to density estimation that provides additional intuition for its behavior and helps explain when it works well or poorly.

The Density Level Set Perspective:

$${x : f(x) \geq \tau} \approx {x : p(x) \geq \tau'}$$

where f(x) is the OC-SVM decision function, p(x) is the true data density, and τ, τ' are related thresholds.

This means One-Class SVM is implicitly estimating regions of high probability density—exactly what we want for anomaly detection, since anomalies are low-density points.

Advantages of OC-SVM vs. KDE

•Computational efficiency: Predictions depend only on support vectors, not all training points
•Sparse representation: Typically few support vectors, enabling fast inference
•Robust to outliers in training: ν-parameterization explicitly handles contamination
•Sharp boundaries: Produces a crisp decision, not a continuous density estimate
•High-dimensional scaling: Kernel trick avoids explicit feature computation

Limitations Compared to KDE

•No probability estimates: Only provides a decision, not P(normal|x)
•Hyperparameter sensitivity: Performance depends heavily on γ, ν choices
•Optimization challenges: May get stuck in local optima for non-convex kernels
•Curse of dimensionality: Still affected, though less than explicit KDE
•Boundary rigidity: Once trained, boundary doesn't adapt to new regions

The Support Vector Expansion:

The decision function can be written as:

$$f(x) = \sum_{i \in SV} \alpha_i k(x_i, x) - \rho$$

where SV is the set of support vector indices. This is remarkably similar to a Parzen window (kernel density) estimate:

$$\hat{p}(x) = \frac{1}{n} \sum_{i=1}^{n} k_h(x_i, x)$$

The key differences:

Sparsity: OC-SVM sums over support vectors only, not all points
Weights: OC-SVM uses learned weights αᵢ, not uniform 1/n
Threshold: OC-SVM uses learned ρ, not a separate threshold selection

This connection explains why RBF kernel One-Class SVM tends to perform well when the normal data has a smooth, unimodal distribution—exactly the setting where KDE excels.

When to Prefer OC-SVM Over Density-Based Methods

Implementation Considerations and Best Practices

1. Feature Scaling is Critical:

The RBF kernel computes ||x - x'||², which is heavily influenced by feature scales. Features with larger magnitudes dominate distance calculations, effectively ignoring smaller-scale features.

Recommended approach: Standardize all features to zero mean and unit variance: $$x_{scaled} = \frac{x - \mu}{\sigma}$$

Alternatively, for non-Gaussian features, use robust scaling (median and IQR) or min-max scaling to [0, 1].

2. Handling Contaminated Training Data:

Real-world 'normal' training data often contains some anomalies. One-Class SVM handles this through the ν parameter, but additional preprocessing helps:

Isolation Forest pre-filtering: Remove obvious outliers before training OC-SVM
Iterative refinement: Train OC-SVM, remove training points classified as anomalies, retrain
Conservative ν: Set ν slightly higher than expected contamination rate

ocsvm_best_practices.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
import numpy as np
from sklearn.svm import OneClassSVM
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.ensemble import IsolationForest
from sklearn.pipeline import Pipeline
import warnings
 
class RobustOneClassSVM:
    """
    Production-ready One-Class SVM with best practices built in.
    
    Features:
    - Automatic feature scaling (robust to outliers)
    - Optional training data cleaning with Isolation Forest
    - Multiple kernel support with sensible defaults
    - Calibrated anomaly scores
    """
    
    def __init__(
        self,
        nu=0.1,
        kernel='rbf',
        gamma='scale',
        preprocessing='robust',
        clean_training=True,
        contamination_estimate=0.05,
        verbose=False
    ):
        """
        Parameters:
        -----------
        nu : float, default=0.1
            Upper bound on training error fraction
        kernel : str, default='rbf'
            Kernel type ('rbf', 'linear', 'poly', 'sigmoid')
        gamma : str or float, default='scale'
            Kernel coefficient
        preprocessing : str, default='robust'
            'robust' (median/IQR), 'standard' (mean/std), or None
        clean_training : bool, default=True
            Whether to pre-filter training data for outliers
        contamination_estimate : float, default=0.05
            Expected fraction of anomalies in training data
        """
        self.nu = nu
        self.kernel = kernel
        self.gamma = gamma
        self.preprocessing = preprocessing
        self.clean_training = clean_training
        self.contamination_estimate = contamination_estimate
        self.verbose = verbose
        
        self.scaler_ = None
        self.model_ = None
        self.training_mask_ = None
        self.decision_offset_ = None
    
    def fit(self, X, y=None):
        """
        Fit the One-Class SVM model.
        
        Parameters:
        -----------
        X : array-like of shape (n_samples, n_features)
            Training data (assumed mostly normal)
        y : ignored
        """
        X = np.asarray(X)
        n_samples_original = X.shape[0]
        
        # Step 1: Preprocessing (scaling)
        if self.preprocessing == 'robust':
            self.scaler_ = RobustScaler()
        elif self.preprocessing == 'standard':
            self.scaler_ = StandardScaler()
        else:
            self.scaler_ = None
        
        if self.scaler_ is not None:
            X_scaled = self.scaler_.fit_transform(X)
        else:
            X_scaled = X.copy()
        
        # Step 2: Training data cleaning (optional)
        if self.clean_training:
            if self.verbose:
                print(f"Cleaning training data (contamination={self.contamination_estimate})...")
            
            iso_forest = IsolationForest(
                contamination=self.contamination_estimate,
                random_state=42,
                n_jobs=-1
            )
            inlier_mask = iso_forest.fit_predict(X_scaled) == 1
            X_clean = X_scaled[inlier_mask]
            self.training_mask_ = inlier_mask
            
            if self.verbose:
                n_removed = n_samples_original - X_clean.shape[0]
                print(f"Removed {n_removed} potential outliers from training")
        else:
            X_clean = X_scaled
            self.training_mask_ = np.ones(n_samples_original, dtype=bool)
        
        # Step 3: Train One-Class SVM
        self.model_ = OneClassSVM(
            kernel=self.kernel,
            nu=self.nu,
            gamma=self.gamma,
            tol=1e-4,
            shrinking=True,
            cache_size=500
        )
        self.model_.fit(X_clean)
        
        # Step 4: Calibrate decision offset for interpretable scores
        train_scores = self.model_.decision_function(X_clean)
        self.decision_offset_ = np.median(train_scores)
        
        if self.verbose:
            print(f"Training complete:")
            print(f"  Support vectors: {self.model_.support_vectors_.shape[0]}")
            print(f"  Decision offset: {self.decision_offset_:.4f}")
        
        return self
    
    def predict(self, X):
        """
        Predict if samples are normal (+1) or anomalies (-1).
        """
        X = np.asarray(X)
        X_scaled = self.scaler_.transform(X) if self.scaler_ else X
        return self.model_.predict(X_scaled)
    
    def decision_function(self, X):
        """
        Compute signed distance to the decision boundary.
        
        Positive = normal (inside boundary)
        Negative = anomaly (outside boundary)
        """
        X = np.asarray(X)
        X_scaled = self.scaler_.transform(X) if self.scaler_ else X
        return self.model_.decision_function(X_scaled)
    
    def anomaly_score(self, X):
        """
        Compute calibrated anomaly scores.
        
        Returns values in [0, 1] where:
        - 0 = definitely normal
        - 0.5 = on the boundary  
        - 1 = definitely anomalous
        
        Uses sigmoid calibration centered on the training median.
        """
        raw_scores = self.decision_function(X)
        
        # Calibrate: sigmoid centered on training median
        # Points below median get scores > 0.5, above get < 0.5
        calibrated = 1.0 / (1.0 + np.exp(raw_scores - self.decision_offset_))
        
        return calibrated
    
    def fit_predict(self, X, y=None):
        """Fit the model and predict on the same data."""
        self.fit(X)
        return self.predict(X)
 
 
# Example usage with evaluation
if __name__ == "__main__":
    from sklearn.datasets import make_blobs
    from sklearn.metrics import roc_auc_score, average_precision_score
    
    # Generate data
    X_normal, _ = make_blobs(n_samples=500, centers=[[0, 0]], cluster_std=1.0, random_state=42)
    X_test_normal, _ = make_blobs(n_samples=100, centers=[[0, 0]], cluster_std=1.0, random_state=43)
    X_anomalies = np.random.uniform(-5, 5, size=(30, 2))
    
    X_test = np.vstack([X_test_normal, X_anomalies])
    y_test = np.array([0] * 100 + [1] * 30)  # 0 = normal, 1 = anomaly
    
    # Train and evaluate
    model = RobustOneClassSVM(nu=0.1, verbose=True)
    model.fit(X_normal)
    
    anomaly_scores = model.anomaly_score(X_test)
    predictions = model.predict(X_test)
    
    # Compute metrics
    auroc = roc_auc_score(y_test, anomaly_scores)
    auprc = average_precision_score(y_test, anomaly_scores)
    
    print(f"\nEvaluation Results:")
    print(f"  AUROC: {auroc:.4f}")
    print(f"  AUPRC: {auprc:.4f}")
    print(f"  Detection rate: {np.mean(predictions[100:] == -1):.2%}")
    print(f"  False positive rate: {np.mean(predictions[:100] == -1):.2%}")

Computational Complexity

One-Class SVM training is O(n² to n³) in the number of training samples due to kernel matrix computation and quadratic programming. For large datasets (n > 10,000):

Prediction is O(n_sv × d) where n_sv is the number of support vectors, making it fast for sparse solutions.

Summary and Key Takeaways

We've explored One-Class SVM from geometric intuition through mathematical formulation to practical implementation. Let's consolidate the essential insights.

Key Takeaways

•Geometric principle: One-Class SVM finds a hyperplane in feature space that separates training data from the origin with maximum margin, creating a decision boundary that encapsulates 'normal' behavior.
•The ν parameter: Provides intuitive control—it upper-bounds the fraction of training points classified as anomalies and lower-bounds the fraction of support vectors. Set ν slightly above expected contamination.
•Kernel choice matters: RBF is the default for smooth, compact data regions. The γ parameter controls boundary smoothness—use heuristics (median, scale) as starting points and tune based on stability.
•Density level set connection: OC-SVM implicitly estimates density level sets, explaining its effectiveness for detecting low-density anomalies.
•Preprocessing is essential: Always scale features (robust scaling for contaminated data). Consider pre-filtering with Isolation Forest for heavily contaminated training sets.
•Computational considerations: Training scales O(n²-n³); use approximations for large datasets. Prediction is fast (O(n_sv × d)) due to sparse support vector representation.

What's Next:

Page Complete

1 / 5