Loading learning content...
In traditional supervised learning, we have access to labeled examples from multiple classes, and our goal is to learn a decision boundary that separates them. But what happens when we only have examples from one class—typically the 'normal' class—and our task is to identify everything that doesn't belong?
This is the fundamental challenge addressed by One-Class SVM (OC-SVM), one of the most principled and widely-used algorithms for unsupervised anomaly detection. Originally proposed by Schölkopf et al. in 2001, One-Class SVM adapts the powerful machinery of Support Vector Machines to the single-class setting, creating a decision boundary that encapsulates normal data while rejecting outliers.
The elegance of One-Class SVM lies in its ability to leverage kernel methods, enabling it to learn complex, nonlinear boundaries in high-dimensional feature spaces—all while maintaining the geometric interpretability and theoretical guarantees that make SVMs so compelling.
By the end of this page, you will understand: (1) The fundamental geometric intuition behind One-Class SVM, (2) The mathematical formulation as a maximum-margin problem, (3) How the ν-parameter controls the trade-off between false positives and false negatives, (4) Kernel selection strategies for nonlinear anomaly boundaries, (5) The dual formulation and its connections to density estimation, and (6) Practical implementation considerations and hyperparameter tuning.
The core geometric idea behind One-Class SVM is beautifully simple: find a hyperplane that separates the training data from the origin with maximum margin, after mapping the data to a high-dimensional feature space.
But why the origin? The origin serves as a convenient representative of 'everywhere else'—the vast expanse of the input space where normal data should not lie. By pushing the data away from the origin while maximizing the margin, we create a decision boundary that characterizes the normal data region.
The Feature Space Perspective:
In the original input space, the data may not be separable from anything—it's just a cloud of points. But when we map the data to a high-dimensional feature space using a kernel function φ(x), something remarkable happens:
The decision function becomes: f(x) = sign(w·φ(x) - ρ), where points with positive values are classified as normal, and points with negative values are classified as anomalies.
You might wonder why we don't simply fit a minimum bounding sphere around the data. While conceptually simpler, a hyperplane-based approach offers more flexibility: (1) In feature space, a hyperplane can describe arbitrarily complex boundaries in the original space, (2) The maximum-margin principle provides regularization and better generalization, (3) The connection to kernel methods enables efficient computation even in infinite-dimensional feature spaces. That said, the sphere-based approach (SVDD) is closely related and covered in the next page.
Visualizing the Geometry:
Consider a simple 2D example where normal data forms a cluster. In the original space, we want to draw a 'boundary' around this cluster. One-Class SVM achieves this by:
With an RBF kernel, the resulting boundary in the original space is typically smooth and naturally adapts to the shape of the data distribution. The boundary contracts around dense regions and may have multiple components if the data has multiple modes.
| Concept | Mathematical Representation | Intuition |
|---|---|---|
| Feature mapping | φ: X → H (Hilbert space) | Lifts data to a space where separation is possible |
| Separating hyperplane | w·φ(x) = ρ | The boundary between normal and anomalous regions |
| Normal region | {x : w·φ(x) ≥ ρ} | Points on the 'data side' of the hyperplane |
| Anomaly region | {x : w·φ(x) < ρ} | Points on the 'origin side' of the hyperplane |
| Margin | ρ / ||w|| | Distance from hyperplane to origin; larger = more conservative |
| Decision function | f(x) = sign(w·φ(x) - ρ) | +1 for normal, -1 for anomaly |
The One-Class SVM optimization problem balances two competing objectives: maximizing the margin (the separation between the data and the origin) and allowing some training points to be on the 'wrong' side (to handle outliers in the training set and prevent overfitting).
Primal Formulation:
Given n training examples {x₁, x₂, ..., xₙ} assumed to be drawn from the 'normal' distribution, we solve:
$$\min_{w, \rho, \xi} \frac{1}{2}||w||^2 - \rho + \frac{1}{\nu n} \sum_{i=1}^{n} \xi_i$$
Subject to:
Let's dissect each component:
The parameter ν is not just a regularization knob—it has a beautiful interpretation:
• ν is an upper bound on the fraction of training points that become outliers (support vectors with ξ > 0) • ν is a lower bound on the fraction of support vectors (points exactly on or beyond the margin)
This means if you set ν = 0.1, at most 10% of your training data will be classified as anomalous, and at least 10% will be support vectors. This provides an intuitive handle on the expected false positive rate.
Dual Formulation:
Using Lagrange multipliers αᵢ ≥ 0 for the margin constraints and solving the KKT conditions, we obtain the dual problem:
$$\min_{\alpha} \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j k(x_i, x_j)$$
Subject to:
where k(xᵢ, xⱼ) = φ(xᵢ)·φ(xⱼ) is the kernel function.
Key Insights from the Dual:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116
import numpy as npfrom sklearn.svm import OneClassSVMfrom sklearn.metrics import classification_report, confusion_matrix def one_class_svm_decision_function(X_train, X_test, kernel='rbf', nu=0.1, gamma='scale'): """ Train One-Class SVM and compute decision function. The decision function f(x) = Σᵢ αᵢ k(xᵢ, x) - ρ Returns positive values for normal points, negative for anomalies. Parameters: ----------- X_train : array-like of shape (n_samples, n_features) Training data (assumed to be from the normal class) X_test : array-like of shape (n_test_samples, n_features) Test data to classify kernel : str, default='rbf' Kernel type: 'linear', 'rbf', 'poly', 'sigmoid' nu : float, default=0.1 Upper bound on fraction of training errors (anomalies in training) and lower bound on fraction of support vectors gamma : str or float, default='scale' Kernel coefficient for 'rbf', 'poly', 'sigmoid' Returns: -------- model : OneClassSVM Trained model predictions : array of shape (n_test_samples,) +1 for normal (inlier), -1 for anomaly (outlier) decision_scores : array of shape (n_test_samples,) Signed distance to the separating hyperplane """ # Initialize and train the model model = OneClassSVM( kernel=kernel, nu=nu, gamma=gamma, tol=1e-4, # Convergence tolerance shrinking=True, # Use shrinking heuristic for efficiency cache_size=500, # Cache size in MB for kernel calculations verbose=False ) model.fit(X_train) # Get predictions and decision scores predictions = model.predict(X_test) # +1 (normal) or -1 (anomaly) decision_scores = model.decision_function(X_test) # Signed distance to boundary # Analyze support vectors n_support = model.support_vectors_.shape[0] sv_ratio = n_support / X_train.shape[0] print(f"Support vectors: {n_support} / {X_train.shape[0]} = {sv_ratio:.2%}") print(f"ν = {nu} → Expected ratio ≥ {nu:.2%}") print(f"Offset (ρ): {model.offset_[0]:.4f}") return model, predictions, decision_scores def analyze_ocsvm_boundary(model, X_grid): """ Analyze the decision boundary by computing decision function on a grid. The decision boundary is where decision_function(x) = 0 """ # Decision function: positive inside (normal), negative outside (anomaly) decision_values = model.decision_function(X_grid) # Points on the boundary boundary_mask = np.abs(decision_values) < 0.01 return decision_values, boundary_mask # Example usage with synthetic dataif __name__ == "__main__": from sklearn.datasets import make_blobs # Generate normal training data (single cluster) X_normal, _ = make_blobs( n_samples=300, centers=[[0, 0]], cluster_std=0.5, random_state=42 ) # Generate test data: normal points + anomalies X_test_normal, _ = make_blobs( n_samples=50, centers=[[0, 0]], cluster_std=0.5, random_state=43 ) X_anomalies = np.random.uniform(low=-4, high=4, size=(20, 2)) X_test = np.vstack([X_test_normal, X_anomalies]) y_test = np.array([1] * 50 + [-1] * 20) # Ground truth # Train and evaluate print("=" * 50) print("One-Class SVM for Anomaly Detection") print("=" * 50) for nu in [0.05, 0.1, 0.2]: print(f"\n--- ν = {nu} ---") model, predictions, scores = one_class_svm_decision_function( X_normal, X_test, nu=nu ) print("\nClassification Report:") print(classification_report( y_test, predictions, target_names=['Anomaly (-1)', 'Normal (+1)'] ))The choice of kernel function is crucial for One-Class SVM performance. The kernel implicitly defines the feature space where the hyperplane separates normal from anomalous points. Different kernels create different types of decision boundaries.
Radial Basis Function (RBF) Kernel — The Default Choice:
$$k(x, x') = \exp\left(-\gamma ||x - x'||^2\right)$$
The RBF kernel is the most common choice for One-Class SVM for several reasons:
The γ parameter controls the 'width' of the kernel:
| Kernel | Formula | Decision Boundary Shape | Best Use Cases |
|---|---|---|---|
| Linear | k(x, x') = x·x' | Hyperplane (half-space) | High-dimensional sparse data; when linear separation suffices |
| RBF (Gaussian) | exp(-γ||x-x'||²) | Smooth, closed contours | General-purpose; compact data clusters; unknown boundary shape |
| Polynomial | (γx·x' + r)^d | Polynomial curves | When polynomial relationships exist; image feature spaces |
| Sigmoid | tanh(γx·x' + r) | Similar to neural networks | When neural network-like behavior desired; less common |
When using RBF kernel, γ and ν interact in complex ways:
• Very large γ with small ν: The boundary tightly wraps each training point individually, leading to severe overfitting and inability to generalize to new normal points.
• Very small γ with large ν: The boundary becomes too smooth and large, failing to exclude anomalies.
Recommended approach: First fix ν based on expected contamination rate, then tune γ using cross-validation or by monitoring the fraction of training data classified as normal.
Kernel Parameter Selection Strategies:
1. Scale-based Heuristics (for RBF):
These heuristics ensure the kernel 'sees' meaningful variation in the data.
2. Grid Search with Stability:
Since we lack labeled anomalies for validation, we use proxy metrics:
3. Domain Knowledge:
If you know the expected scale of normal variation, set γ accordingly:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
import numpy as npfrom sklearn.svm import OneClassSVMfrom sklearn.model_selection import ParameterGridfrom scipy.spatial.distance import pdist def compute_gamma_heuristics(X): """ Compute different heuristic values for γ (RBF kernel width). Returns: -------- dict with gamma values from different heuristics """ n_samples, n_features = X.shape # Variance-based (sklearn 'scale') gamma_scale = 1.0 / (n_features * X.var()) # Median heuristic (popular in kernel methods literature) pairwise_dists_sq = pdist(X, 'sqeuclidean') gamma_median = 1.0 / (2 * np.median(pairwise_dists_sq)) # Mean-based gamma_mean = 1.0 / (2 * np.mean(pairwise_dists_sq)) # Silverman's rule of thumb (adapted from KDE) std_avg = np.mean(np.std(X, axis=0)) h_silverman = (4 / (n_features + 2)) ** (1 / (n_features + 4)) * n_samples ** (-1 / (n_features + 4)) * std_avg gamma_silverman = 1.0 / (2 * h_silverman ** 2) return { 'scale': gamma_scale, 'median': gamma_median, 'mean': gamma_mean, 'silverman': gamma_silverman } def evaluate_ocsvm_stability(X, gamma, nu, n_bootstrap=20, sample_ratio=0.8): """ Evaluate stability of One-Class SVM predictions under bootstrap resampling. High stability = predictions are consistent = well-chosen parameters Low stability = predictions vary = overfitting or underfitting """ n_samples = X.shape[0] sample_size = int(n_samples * sample_ratio) predictions = [] for _ in range(n_bootstrap): # Bootstrap sample indices = np.random.choice(n_samples, size=sample_size, replace=True) X_boot = X[indices] # Train model model = OneClassSVM(kernel='rbf', gamma=gamma, nu=nu) model.fit(X_boot) # Predict on full dataset pred = model.predict(X) predictions.append(pred) predictions = np.array(predictions) # Shape: (n_bootstrap, n_samples) # Stability = fraction of samples with consistent predictions # For each sample, compute fraction of bootstraps that agree with majority majority_vote = np.sign(np.sum(predictions, axis=0)) agreement = np.mean(predictions == majority_vote, axis=0) stability = np.mean(agreement) return stability, majority_vote def grid_search_ocsvm(X, nu, gamma_range, stability_threshold=0.9): """ Grid search for optimal gamma based on stability and training metrics. """ results = [] for gamma in gamma_range: # Train model model = OneClassSVM(kernel='rbf', gamma=gamma, nu=nu) model.fit(X) # Compute metrics predictions = model.predict(X) train_anomaly_rate = np.mean(predictions == -1) n_support_vectors = model.support_vectors_.shape[0] sv_ratio = n_support_vectors / X.shape[0] # Compute stability stability, _ = evaluate_ocsvm_stability(X, gamma, nu, n_bootstrap=10) results.append({ 'gamma': gamma, 'train_anomaly_rate': train_anomaly_rate, 'sv_ratio': sv_ratio, 'stability': stability, 'target_deviation': abs(train_anomaly_rate - nu) }) print(f"γ={gamma:.4f}: anomaly_rate={train_anomaly_rate:.3f}, " f"SV_ratio={sv_ratio:.3f}, stability={stability:.3f}") # Select best: minimize deviation from target ν while maintaining stability stable_results = [r for r in results if r['stability'] >= stability_threshold] if stable_results: best = min(stable_results, key=lambda r: r['target_deviation']) else: best = max(results, key=lambda r: r['stability']) print(f"\nBest γ = {best['gamma']:.4f}") return best['gamma'], results # Example usageif __name__ == "__main__": from sklearn.datasets import make_moons # Generate crescent-shaped normal data X, _ = make_moons(n_samples=300, noise=0.1, random_state=42) print("Gamma Heuristics:") heuristics = compute_gamma_heuristics(X) for name, value in heuristics.items(): print(f" {name}: γ = {value:.4f}") print("\nGrid Search for Optimal γ (ν=0.1):") gamma_range = np.logspace(-2, 2, 15) best_gamma, results = grid_search_ocsvm(X, nu=0.1, gamma_range=gamma_range)One-Class SVM has a deep connection to density estimation that provides additional intuition for its behavior and helps explain when it works well or poorly.
The Density Level Set Perspective:
Under certain conditions, the One-Class SVM decision boundary converges to a density level set of the underlying data distribution. Specifically, as the sample size increases and with appropriate kernel bandwidth:
$${x : f(x) \geq \tau} \approx {x : p(x) \geq \tau'}$$
where f(x) is the OC-SVM decision function, p(x) is the true data density, and τ, τ' are related thresholds.
This means One-Class SVM is implicitly estimating regions of high probability density—exactly what we want for anomaly detection, since anomalies are low-density points.
The Support Vector Expansion:
The decision function can be written as:
$$f(x) = \sum_{i \in SV} \alpha_i k(x_i, x) - \rho$$
where SV is the set of support vector indices. This is remarkably similar to a Parzen window (kernel density) estimate:
$$\hat{p}(x) = \frac{1}{n} \sum_{i=1}^{n} k_h(x_i, x)$$
The key differences:
This connection explains why RBF kernel One-Class SVM tends to perform well when the normal data has a smooth, unimodal distribution—exactly the setting where KDE excels.
Use One-Class SVM when: • You need fast predictions on new data (sparse support vector representation) • You expect some anomalies in the training set (ν handles contamination) • You want a hard decision boundary, not probability estimates • The data is high-dimensional but lies on a lower-dimensional manifold
Prefer KDE or GMM when: • You need probabilistic anomaly scores, not just yes/no • The data has clear multimodal structure • You have domain knowledge about the distribution form • Sample size is small and sparsity benefits are minimal
Deploying One-Class SVM effectively requires attention to preprocessing, scaling, and computational efficiency. Here we cover the practical aspects that distinguish a working implementation from a textbook algorithm.
1. Feature Scaling is Critical:
The RBF kernel computes ||x - x'||², which is heavily influenced by feature scales. Features with larger magnitudes dominate distance calculations, effectively ignoring smaller-scale features.
Recommended approach: Standardize all features to zero mean and unit variance: $$x_{scaled} = \frac{x - \mu}{\sigma}$$
Alternatively, for non-Gaussian features, use robust scaling (median and IQR) or min-max scaling to [0, 1].
2. Handling Contaminated Training Data:
Real-world 'normal' training data often contains some anomalies. One-Class SVM handles this through the ν parameter, but additional preprocessing helps:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199
import numpy as npfrom sklearn.svm import OneClassSVMfrom sklearn.preprocessing import StandardScaler, RobustScalerfrom sklearn.ensemble import IsolationForestfrom sklearn.pipeline import Pipelineimport warnings class RobustOneClassSVM: """ Production-ready One-Class SVM with best practices built in. Features: - Automatic feature scaling (robust to outliers) - Optional training data cleaning with Isolation Forest - Multiple kernel support with sensible defaults - Calibrated anomaly scores """ def __init__( self, nu=0.1, kernel='rbf', gamma='scale', preprocessing='robust', clean_training=True, contamination_estimate=0.05, verbose=False ): """ Parameters: ----------- nu : float, default=0.1 Upper bound on training error fraction kernel : str, default='rbf' Kernel type ('rbf', 'linear', 'poly', 'sigmoid') gamma : str or float, default='scale' Kernel coefficient preprocessing : str, default='robust' 'robust' (median/IQR), 'standard' (mean/std), or None clean_training : bool, default=True Whether to pre-filter training data for outliers contamination_estimate : float, default=0.05 Expected fraction of anomalies in training data """ self.nu = nu self.kernel = kernel self.gamma = gamma self.preprocessing = preprocessing self.clean_training = clean_training self.contamination_estimate = contamination_estimate self.verbose = verbose self.scaler_ = None self.model_ = None self.training_mask_ = None self.decision_offset_ = None def fit(self, X, y=None): """ Fit the One-Class SVM model. Parameters: ----------- X : array-like of shape (n_samples, n_features) Training data (assumed mostly normal) y : ignored """ X = np.asarray(X) n_samples_original = X.shape[0] # Step 1: Preprocessing (scaling) if self.preprocessing == 'robust': self.scaler_ = RobustScaler() elif self.preprocessing == 'standard': self.scaler_ = StandardScaler() else: self.scaler_ = None if self.scaler_ is not None: X_scaled = self.scaler_.fit_transform(X) else: X_scaled = X.copy() # Step 2: Training data cleaning (optional) if self.clean_training: if self.verbose: print(f"Cleaning training data (contamination={self.contamination_estimate})...") iso_forest = IsolationForest( contamination=self.contamination_estimate, random_state=42, n_jobs=-1 ) inlier_mask = iso_forest.fit_predict(X_scaled) == 1 X_clean = X_scaled[inlier_mask] self.training_mask_ = inlier_mask if self.verbose: n_removed = n_samples_original - X_clean.shape[0] print(f"Removed {n_removed} potential outliers from training") else: X_clean = X_scaled self.training_mask_ = np.ones(n_samples_original, dtype=bool) # Step 3: Train One-Class SVM self.model_ = OneClassSVM( kernel=self.kernel, nu=self.nu, gamma=self.gamma, tol=1e-4, shrinking=True, cache_size=500 ) self.model_.fit(X_clean) # Step 4: Calibrate decision offset for interpretable scores train_scores = self.model_.decision_function(X_clean) self.decision_offset_ = np.median(train_scores) if self.verbose: print(f"Training complete:") print(f" Support vectors: {self.model_.support_vectors_.shape[0]}") print(f" Decision offset: {self.decision_offset_:.4f}") return self def predict(self, X): """ Predict if samples are normal (+1) or anomalies (-1). """ X = np.asarray(X) X_scaled = self.scaler_.transform(X) if self.scaler_ else X return self.model_.predict(X_scaled) def decision_function(self, X): """ Compute signed distance to the decision boundary. Positive = normal (inside boundary) Negative = anomaly (outside boundary) """ X = np.asarray(X) X_scaled = self.scaler_.transform(X) if self.scaler_ else X return self.model_.decision_function(X_scaled) def anomaly_score(self, X): """ Compute calibrated anomaly scores. Returns values in [0, 1] where: - 0 = definitely normal - 0.5 = on the boundary - 1 = definitely anomalous Uses sigmoid calibration centered on the training median. """ raw_scores = self.decision_function(X) # Calibrate: sigmoid centered on training median # Points below median get scores > 0.5, above get < 0.5 calibrated = 1.0 / (1.0 + np.exp(raw_scores - self.decision_offset_)) return calibrated def fit_predict(self, X, y=None): """Fit the model and predict on the same data.""" self.fit(X) return self.predict(X) # Example usage with evaluationif __name__ == "__main__": from sklearn.datasets import make_blobs from sklearn.metrics import roc_auc_score, average_precision_score # Generate data X_normal, _ = make_blobs(n_samples=500, centers=[[0, 0]], cluster_std=1.0, random_state=42) X_test_normal, _ = make_blobs(n_samples=100, centers=[[0, 0]], cluster_std=1.0, random_state=43) X_anomalies = np.random.uniform(-5, 5, size=(30, 2)) X_test = np.vstack([X_test_normal, X_anomalies]) y_test = np.array([0] * 100 + [1] * 30) # 0 = normal, 1 = anomaly # Train and evaluate model = RobustOneClassSVM(nu=0.1, verbose=True) model.fit(X_normal) anomaly_scores = model.anomaly_score(X_test) predictions = model.predict(X_test) # Compute metrics auroc = roc_auc_score(y_test, anomaly_scores) auprc = average_precision_score(y_test, anomaly_scores) print(f"\nEvaluation Results:") print(f" AUROC: {auroc:.4f}") print(f" AUPRC: {auprc:.4f}") print(f" Detection rate: {np.mean(predictions[100:] == -1):.2%}") print(f" False positive rate: {np.mean(predictions[:100] == -1):.2%}")One-Class SVM training is O(n² to n³) in the number of training samples due to kernel matrix computation and quadratic programming. For large datasets (n > 10,000):
• Use SGD-based approximations (sklearn's SGDOneClassSVM) • Consider random feature approximations (Nyström, Random Fourier Features) • Use mini-batch training with model updates • Pre-cluster data and train separate models per cluster
Prediction is O(n_sv × d) where n_sv is the number of support vectors, making it fast for sparse solutions.
We've explored One-Class SVM from geometric intuition through mathematical formulation to practical implementation. Let's consolidate the essential insights.
What's Next:
In the next page, we examine Support Vector Data Description (SVDD)—a closely related method that fits a minimum-volume hypersphere around the normal data instead of separating from the origin. SVDD offers complementary insights and is sometimes preferred when a 'closed' boundary interpretation is more natural.
You now understand One-Class SVM at both the conceptual and implementation level. You can formulate the optimization problem, select appropriate kernels and hyperparameters, interpret the decision boundary geometrically, and build production-ready anomaly detectors. Next, we'll explore SVDD as an alternative geometric formulation.