Loading content...
We've established that anomalies have shorter path lengths in isolation trees. But raw path lengths are not directly useful for anomaly detection:
To create a universal, interpretable anomaly score, we need a principled transformation that:
This page explores the elegant mathematical framework that accomplishes all three goals.
By the end of this page, you will: (1) Master the mathematical derivation of the Isolation Forest scoring function, (2) Understand the BST-based normalization and its theoretical justification, (3) Know how to interpret and calibrate anomaly scores for your specific application, (4) Learn strategies for threshold selection and operating point optimization.
The Isolation Forest scoring function transforms average path lengths into anomaly scores. Let's build it step by step.
Step 1: Average Path Length
For a point $x$ and an ensemble of $t$ isolation trees, compute the average path length:
$$E[h(x)] = \frac{1}{t} \sum_{i=1}^{t} h_i(x)$$
where $h_i(x)$ is the path length for point $x$ in tree $i$.
Step 2: Normalization Constant
To make path lengths comparable across different sample sizes, we normalize by $c(n)$, the average path length of an unsuccessful search in a Binary Search Tree with $n$ nodes:
$$c(n) = \begin{cases} 0 & \text{if } n = 1 \ 1 & \text{if } n = 2 \ 2H(n-1) - \frac{2(n-1)}{n} & \text{if } n > 2 \end{cases}$$
where $H(i) = \ln(i) + \gamma$ is the harmonic number (γ ≈ 0.5772 is the Euler-Mascheroni constant).
Step 3: The Score
The anomaly score is:
$$s(x, \psi) = 2^{-\frac{E[h(x)]}{c(\psi)}}$$
where $\psi$ is the subsample size used for tree construction.
| Property | Value/Range | Interpretation |
|---|---|---|
| Domain (input) | $E[h(x)] \in [0, \infty)$ | Average path length can be any non-negative value |
| Range (output) | $s(x, \psi) \in (0, 1)$ | Scores are bounded between 0 and 1 |
| Score = 1 | When $E[h(x)] = 0$ | Point isolated immediately (theoretical limit) |
| Score = 0.5 | When $E[h(x)] = c(\psi)$ | Average isolation time = BST average |
| Score → 0 | When $E[h(x)] \to \infty$ | Very hard to isolate → definitely normal |
| Monotonicity | Strictly decreasing | Shorter paths → higher scores → more anomalous |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import numpy as npfrom typing import Tuple def harmonic_number(i: int) -> float: """ Compute the i-th harmonic number using the logarithmic approximation. H(i) ≈ ln(i) + γ where γ ≈ 0.5772 (Euler-Mascheroni constant) """ if i <= 0: return 0.0 euler_mascheroni = 0.5772156649 return np.log(i) + euler_mascheroni def average_path_length(n: int) -> float: """ Compute c(n) - the average path length of unsuccessful search in BST. For n items, this represents the expected depth of a search that terminates at a leaf (i.e., the key is not found). c(n) = 2H(n-1) - 2(n-1)/n This formula comes from the analysis of random BSTs: - H(n-1) is the (n-1)th harmonic number - The factor of 2 accounts for both internal path length contributions - The correction term adjusts for the specific case of unsuccessful search """ if n <= 1: return 0.0 # No comparisons needed elif n == 2: return 1.0 # Exactly one comparison else: h_n_minus_1 = harmonic_number(n - 1) return 2.0 * h_n_minus_1 - (2.0 * (n - 1) / n) def anomaly_score(path_lengths: np.ndarray, subsample_size: int) -> np.ndarray: """ Convert average path lengths to anomaly scores. The formula s(x, ψ) = 2^(-E[h(x)] / c(ψ)) is motivated by: 1. Exponential decay ensures short paths map to high scores 2. Base 2 connects to binary tree structure 3. Division by c(ψ) normalizes for sample size Args: path_lengths: Average path lengths for each point subsample_size: The ψ parameter (typically 256) Returns: Anomaly scores in (0, 1), higher = more anomalous """ c_psi = average_path_length(subsample_size) if c_psi == 0: return np.zeros_like(path_lengths) # The core scoring formula normalized_lengths = path_lengths / c_psi scores = np.power(2.0, -normalized_lengths) return scores def analyze_score_function(subsample_size: int = 256) -> None: """ Analyze and visualize the scoring function behavior. """ c_psi = average_path_length(subsample_size) print(f"For subsample size ψ = {subsample_size}:") print(f" c(ψ) = {c_psi:.4f}") print(f" This means a path length of {c_psi:.2f} maps to score = 0.5") print() # Show score for various path lengths path_lengths = np.array([1, 2, 3, 4, c_psi/2, c_psi, c_psi*1.5, c_psi*2, c_psi*3]) scores = anomaly_score(path_lengths, subsample_size) print("Path Length | Normalized | Score | Interpretation") print("-" * 65) for pl, s in zip(path_lengths, scores): norm = pl / c_psi if s > 0.6: interp = "ANOMALY (high confidence)" elif s > 0.5: interp = "Likely anomaly" elif s < 0.4: interp = "Normal" else: interp = "Ambiguous" print(f"{pl:>11.2f} | {norm:>10.3f} | {s:.4f} | {interp}") analyze_score_function(256)The normalization constant $c(n)$ is derived from the analysis of Binary Search Trees. Understanding this connection reveals why it's the perfect normalization for isolation path lengths.
Why BST?
The isolation tree construction process is analogous to building a BST:
Both processes create tree structures where the depth/path length reflects the 'distinguishability' of items.
The Average Unsuccessful Search Path:
In BST analysis, we distinguish between:
For isolation, we're interested in the unsuccessful search—we want to know how deep we go before the point is 'alone' (isolated), analogous to reaching a null link in BST.
Mathematical Derivation of $c(n)$:
For a random BST constructed by inserting $n$ distinct keys in random order:
Internal path length $I(n)$: Total of all path lengths to internal nodes $$E[I(n)] = 2n \cdot H_n - 2n$$
External path length $E(n)$: Total of all path lengths to external (null) nodes $$E(n) = I(n) + 2n$$ (because each internal node contributes 2 to external path length)
Number of external nodes: $n + 1$ (one more than internal nodes in any binary tree)
Average external path length (unsuccessful search): $$c(n) = \frac{E(n)}{n+1} = \frac{E[I(n)] + 2n}{n+1} = 2H(n-1) - \frac{2(n-1)}{n}$$
This formula gives us the expected depth at which a random search terminates, which perfectly normalizes the isolation path length.
| n (samples) | c(n) | log₂(n) | c(n)/log₂(n) | Notes |
|---|---|---|---|---|
| 32 | 5.29 | 5.00 | 1.06 | Small batch |
| 64 | 6.52 | 6.00 | 1.09 | |
| 128 | 7.76 | 7.00 | 1.11 | |
| 256 | 8.99 | 8.00 | 1.12 | Default ψ |
| 512 | 10.23 | 9.00 | 1.14 | |
| 1024 | 11.47 | 10.00 | 1.15 | |
| 4096 | 13.95 | 12.00 | 1.16 | Large batch |
Note that c(n) grows logarithmically with n, approximately as 1.1×log₂(n). This means doubling the sample size only adds about 1.1 to the normalization constant—making score comparisons relatively stable across different sample sizes.
Understanding how anomaly scores are distributed helps in threshold selection and result interpretation.
For a Pure Normal Dataset:
When all points are drawn from a single, well-behaved distribution (e.g., multivariate Gaussian), the score distribution typically exhibits:
For a Dataset with Anomalies:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990
import numpy as npfrom sklearn.ensemble import IsolationForest def analyze_score_distribution(X_normal, X_anomaly=None, contamination=0.01): """ Analyze the distribution of anomaly scores. Args: X_normal: Normal data points X_anomaly: Anomaly points (optional) contamination: Expected anomaly fraction Returns: Dictionary with score statistics """ # Combine data if X_anomaly is not None: X = np.vstack([X_normal, X_anomaly]) labels = np.array([0]*len(X_normal) + [1]*len(X_anomaly)) else: X = X_normal labels = np.zeros(len(X_normal)) # Fit Isolation Forest clf = IsolationForest( n_estimators=100, max_samples=256, contamination=contamination, random_state=42 ) clf.fit(X) # Get scores (negate sklearn's convention) scores = -clf.score_samples(X) # Compute statistics results = { 'overall': { 'mean': scores.mean(), 'std': scores.std(), 'min': scores.min(), 'max': scores.max(), 'median': np.median(scores), 'q25': np.percentile(scores, 25), 'q75': np.percentile(scores, 75), } } if X_anomaly is not None: normal_scores = scores[labels == 0] anomaly_scores = scores[labels == 1] results['normal'] = { 'mean': normal_scores.mean(), 'std': normal_scores.std(), 'max': normal_scores.max(), } results['anomaly'] = { 'mean': anomaly_scores.mean(), 'std': anomaly_scores.std(), 'min': anomaly_scores.min(), } results['separation'] = anomaly_scores.min() - normal_scores.max() return results # Example: Analyze score distributionsnp.random.seed(42) # Normal data: 2D GaussianX_normal = np.random.randn(1000, 2) # Case 1: No anomaliesprint("=== Case 1: Pure Normal Data ===")results = analyze_score_distribution(X_normal)print(f"Mean score: {results['overall']['mean']:.4f}")print(f"Std score: {results['overall']['std']:.4f}")print(f"Max score: {results['overall']['max']:.4f}")print() # Case 2: With clear anomaliesX_anomaly = 5 + 0.5*np.random.randn(10, 2) # Offset clusterprint("=== Case 2: With Anomalies ===")results = analyze_score_distribution(X_normal, X_anomaly)print(f"Normal mean: {results['normal']['mean']:.4f}")print(f"Normal max: {results['normal']['max']:.4f}")print(f"Anomaly mean: {results['anomaly']['mean']:.4f}")print(f"Anomaly min: {results['anomaly']['min']:.4f}")print(f"Separation: {results['separation']:.4f}")Always visualize your score distribution before selecting a threshold. A histogram with labeled normal/anomaly points (if known) reveals: (1) whether there's a clear separation, (2) what threshold would give the desired precision/recall tradeoff, and (3) whether anomalies are truly isolated or borderline.
The anomaly score is continuous, but most applications require a binary decision: is this point an anomaly or not? Threshold selection determines this boundary.
Strategy 1: Contamination-Based Threshold
If you know (or estimate) the expected proportion of anomalies in your data:
contamination × n points are classified as anomaliesThis is what scikit-learn's contamination parameter does. It doesn't affect scores—only the threshold.
Strategy 2: Statistical Threshold
Treat scores above a certain percentile as anomalies:
Strategy 3: Score-Based Rules
Use fixed score thresholds based on the 0.5 reference point:
| Strategy | Pros | Cons | Best For |
|---|---|---|---|
| Contamination-based | Guarantees fixed anomaly rate; intuitive | Requires knowledge of true anomaly rate; may miss anomalies or flag normals | When you know the expected anomaly frequency |
| Statistical percentile | No labeled data needed; controls false positive rate | Assumes percentile = anomaly rate; sensitive to data distribution | Exploratory analysis; unknown anomaly rate |
| Fixed score threshold | Simple; interpretable; consistent across datasets | May not generalize across very different datasets | Well-understood domains with consistent anomaly types |
| Precision@K | Find exactly K most anomalous points | Ignores score magnitudes; may include normal points | Resource-limited review (e.g., manual investigation) |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
import numpy as npfrom sklearn.ensemble import IsolationForest class IsolationForestWithThresholding: """ Isolation Forest with flexible threshold selection strategies. """ def __init__(self, n_estimators=100, max_samples=256, random_state=None): self.clf = IsolationForest( n_estimators=n_estimators, max_samples=max_samples, contamination='auto', # We'll handle threshold ourselves random_state=random_state ) self.scores_ = None def fit(self, X): self.clf.fit(X) # Store training scores for threshold calculation self.scores_ = -self.clf.score_samples(X) return self def score_samples(self, X): return -self.clf.score_samples(X) def predict_contamination(self, X, contamination=0.01): """ Predict using contamination-based threshold. Top 'contamination' fraction are labeled as anomalies. """ scores = self.score_samples(X) threshold = np.percentile(self.scores_, 100 * (1 - contamination)) return (scores > threshold).astype(int) def predict_percentile(self, X, percentile=99): """ Predict using statistical percentile threshold. Points above the given percentile of training scores are anomalies. """ scores = self.score_samples(X) threshold = np.percentile(self.scores_, percentile) return (scores > threshold).astype(int) def predict_score(self, X, threshold=0.5): """ Predict using fixed score threshold. Points with score > threshold are anomalies. """ scores = self.score_samples(X) return (scores > threshold).astype(int) def predict_topk(self, X, k=10): """ Return indices of top-k most anomalous points. """ scores = self.score_samples(X) return np.argsort(scores)[-k:][::-1] # Example usagenp.random.seed(42)X_train = np.random.randn(1000, 5)X_test = np.vstack([ np.random.randn(100, 5), # Normal 3 + np.random.randn(5, 5) # Anomalies])true_labels = np.array([0]*100 + [1]*5) # Fit modelmodel = IsolationForestWithThresholding(random_state=42)model.fit(X_train) # Compare strategiesprint("Strategy Comparison on Test Set:")print("-" * 50) for name, predictions in [ ("Contamination 5%", model.predict_contamination(X_test, 0.05)), ("Percentile 95", model.predict_percentile(X_test, 95)), ("Score > 0.55", model.predict_score(X_test, 0.55)),]: tp = ((predictions == 1) & (true_labels == 1)).sum() fp = ((predictions == 1) & (true_labels == 0)).sum() precision = tp / (tp + fp) if (tp + fp) > 0 else 0 recall = tp / true_labels.sum() print(f"{name}: Precision={precision:.2f}, Recall={recall:.2f}")Optimal threshold selection typically requires some labeled examples. Without labels, you're guessing. In practice, start with a conservative threshold (high score, e.g., 0.65), manually review flagged points, and iteratively refine based on the false positive / false negative tradeoff.
Isolation Forest scores are not probabilities—a score of 0.7 doesn't mean '70% chance of being an anomaly.' However, in many applications, calibrated probability estimates are essential for decision-making.
Why Calibration Matters:
Approaches to Calibration:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104
import numpy as npfrom sklearn.ensemble import IsolationForestfrom sklearn.isotonic import IsotonicRegressionfrom sklearn.linear_model import LogisticRegression class CalibratedIsolationForest: """ Isolation Forest with probability calibration. Converts raw anomaly scores to calibrated probabilities using labeled calibration data. """ def __init__(self, calibration_method='isotonic', **iforest_params): self.iforest = IsolationForest(**iforest_params) self.calibration_method = calibration_method self.calibrator = None def fit(self, X_train, X_cal=None, y_cal=None): """ Fit the Isolation Forest and optionally calibrate. Args: X_train: Training data for Isolation Forest X_cal: Calibration data with labels y_cal: Calibration labels (1=anomaly, 0=normal) """ # Fit base Isolation Forest self.iforest.fit(X_train) # If calibration data provided, fit calibrator if X_cal is not None and y_cal is not None: scores_cal = self.score_samples(X_cal) if self.calibration_method == 'isotonic': self.calibrator = IsotonicRegression( y_min=0, y_max=1, out_of_bounds='clip' ) self.calibrator.fit(scores_cal, y_cal) elif self.calibration_method == 'platt': # Platt scaling using logistic regression self.calibrator = LogisticRegression() self.calibrator.fit(scores_cal.reshape(-1, 1), y_cal) return self def score_samples(self, X): """Get raw anomaly scores (not calibrated).""" return -self.iforest.score_samples(X) def predict_proba(self, X): """ Get calibrated probability of being an anomaly. Returns probability estimates if calibrator is fitted, otherwise returns raw scores with a warning. """ scores = self.score_samples(X) if self.calibrator is None: print("Warning: No calibration data provided. Returning raw scores.") return scores if self.calibration_method == 'isotonic': return self.calibrator.predict(scores) elif self.calibration_method == 'platt': return self.calibrator.predict_proba(scores.reshape(-1, 1))[:, 1] def predict(self, X, threshold=0.5): """Predict using calibrated probabilities.""" probs = self.predict_proba(X) return (probs > threshold).astype(int) # Example: Calibrated anomaly detectionnp.random.seed(42) # Training data (unlabeled)X_train = np.random.randn(1000, 3) # Calibration data (small labeled sample)X_cal_normal = np.random.randn(50, 3)X_cal_anomaly = 4 + np.random.randn(10, 3)X_cal = np.vstack([X_cal_normal, X_cal_anomaly])y_cal = np.array([0]*50 + [1]*10) # Test dataX_test = np.vstack([np.random.randn(20, 3), 4 + np.random.randn(5, 3)])y_test = np.array([0]*20 + [1]*5) # Fit calibrated modelmodel = CalibratedIsolationForest( calibration_method='isotonic', n_estimators=100, random_state=42)model.fit(X_train, X_cal, y_cal) # Get calibrated probabilitiesprobs = model.predict_proba(X_test)print("Calibrated Probabilities:")print(f"Normal points: min={probs[:20].min():.3f}, max={probs[:20].max():.3f}")print(f"Anomalies: min={probs[20:].min():.3f}, max={probs[20:].max():.3f}")Effective calibration requires labeled examples spanning the score range, including both true anomalies and normal points with varying scores. A small calibration set (50-200 points) can significantly improve probability estimates, but the labels must be reliable.
Path lengths from individual trees are noisy due to random split selection. Averaging across the ensemble stabilizes scores, but how many trees are needed? How variable are the scores?
Variance Analysis:
For a single tree, the path length $h_i(x)$ is a random variable whose distribution depends on:
For the average across $t$ trees: $$\bar{h}(x) = \frac{1}{t} \sum_{i=1}^{t} h_i(x)$$
By the Central Limit Theorem (approximately, as trees aren't fully independent): $$\text{Var}[\bar{h}(x)] \approx \frac{\text{Var}[h_1(x)]}{t}$$
Variance decreases as $1/t$, so doubling the number of trees halves the variance.
| n_estimators | Relative Std Dev | Computation Time | Recommendation |
|---|---|---|---|
| 10 | 1.0 (baseline) | Very fast | Too few—unstable rankings |
| 50 | ~0.45 | Fast | Minimum for stable detection |
| 100 | ~0.32 | Moderate | Good default balance |
| 200 | ~0.22 | Slow | Better stability if needed |
| 500 | ~0.14 | Very slow | Diminishing returns |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import numpy as npfrom sklearn.ensemble import IsolationForest def analyze_score_stability(X, n_estimators_values, n_trials=10): """ Analyze how score stability improves with more trees. For each n_estimators value, fits multiple models with different random seeds and measures the variability of scores for each point. """ results = [] for n_trees in n_estimators_values: scores_trials = [] for trial in range(n_trials): clf = IsolationForest( n_estimators=n_trees, max_samples=256, random_state=trial * 100 # Different seed each trial ) clf.fit(X) scores = -clf.score_samples(X) scores_trials.append(scores) # Stack: (n_trials, n_points) scores_matrix = np.vstack(scores_trials) # Compute std across trials for each point, then average per_point_std = scores_matrix.std(axis=0) avg_std = per_point_std.mean() results.append({ 'n_trees': n_trees, 'avg_score_std': avg_std, 'max_score_std': per_point_std.max(), }) return results # Analyze stabilitynp.random.seed(42)X = np.random.randn(500, 5) tree_counts = [10, 25, 50, 100, 200]stability = analyze_score_stability(X, tree_counts, n_trials=10) print("Ensemble Size vs Score Stability:")print("-" * 50)print("n_trees | Avg Std Dev | Max Std Dev | Relative")print("-" * 50) baseline = stability[0]['avg_std']for r in stability: relative = r['avg_std'] / baseline print(f"{r['n_trees']:>7} | {r['avg_score_std']:>11.4f} | {r['max_score_std']:>11.4f} | {relative:.2f}x")Use n_estimators=100 as default. Only increase if: (1) rankings of borderline points keep changing between runs, (2) you need reproducible scores without fixing random_state, or (3) computational cost is not a concern. For initial exploration, even 50 trees give reasonable results.
Practitioners frequently encounter issues with Isolation Forest scoring. Understanding these pitfalls prevents misinterpretation and improves results.
scores = -clf.score_samples(X)123456789101112131415161718192021222324252627282930313233343536373839
import numpy as npfrom sklearn.ensemble import IsolationForest # PITFALL DEMONSTRATION 1: sklearn's sign conventionnp.random.seed(42)X = np.vstack([np.random.randn(100, 2), [[5, 5]]]) # 100 normal + 1 anomaly clf = IsolationForest(n_estimators=100, random_state=42)clf.fit(X) # sklearn returns NEGATIVE scores (more negative = more anomalous)sklearn_scores = clf.score_samples(X[-1:])print(f"sklearn score for anomaly: {sklearn_scores[0]:.3f}") # Negative! # Correct interpretation: negate for intuitive scoresintuitive_scores = -sklearn_scoresprint(f"Intuitive score (negated): {intuitive_scores[0]:.3f}") # Positive! # PITFALL DEMONSTRATION 2: contamination misconceptionclf_10pct = IsolationForest(n_estimators=100, contamination=0.1, random_state=42)clf_1pct = IsolationForest(n_estimators=100, contamination=0.01, random_state=42) clf_10pct.fit(X)clf_1pct.fit(X) # Scores are THE SAME regardless of contaminationscores_10pct = clf_10pct.score_samples(X)scores_1pct = clf_1pct.score_samples(X) print(f"Scores identical? {np.allclose(scores_10pct, scores_1pct)}") # True! # Predictions differ because threshold differspred_10pct = clf_10pct.predict(X)pred_1pct = clf_1pct.predict(X) print(f"Anomalies (10% contamination): {(pred_10pct == -1).sum()}")print(f"Anomalies (1% contamination): {(pred_1pct == -1).sum()}")Scores can vary between runs due to random sampling and splitting. For reproducible results: (1) Always set random_state, (2) Use enough trees (100+), and (3) If critical decisions depend on specific scores, average across multiple random seeds.
We've thoroughly explored how Isolation Forest converts path lengths into meaningful, actionable anomaly scores. The scoring mechanism is elegant: normalize by BST average path length, apply exponential transformation, average across the ensemble.
What's Next:
The standard Isolation Forest uses axis-parallel splits, which can miss anomalies that lie along diagonal or curved patterns. The next page introduces Extended Isolation Forest, which addresses this limitation by using random hyperplanes rather than axis-parallel cuts, enabling detection of a broader class of anomalies.
You now have a comprehensive understanding of Isolation Forest scoring—from the mathematical foundations of the score function to practical threshold selection and calibration strategies. These skills are essential for deploying Isolation Forest effectively in production systems.