Isolation Forest - Learning Module

Loading content...

0/278

Path Length Scoring

From Path Lengths to Meaningful Anomaly Scores

We've established that anomalies have shorter path lengths in isolation trees. But raw path lengths are not directly useful for anomaly detection:

Path lengths depend on subsample size—the same point might have path length 5 in one tree and path length 8 in another
Path lengths are not bounded—they can range from 1 to the maximum tree depth
Path lengths are not comparable across datasets with different sizes or dimensionalities

To create a universal, interpretable anomaly score, we need a principled transformation that:

Normalizes for sample size differences
Maps to a bounded, interpretable range
Provides clear separation between anomalies and normal points

This page explores the elegant mathematical framework that accomplishes all three goals.

Learning Objectives

By the end of this page, you will: (1) Master the mathematical derivation of the Isolation Forest scoring function, (2) Understand the BST-based normalization and its theoretical justification, (3) Know how to interpret and calibrate anomaly scores for your specific application, (4) Learn strategies for threshold selection and operating point optimization.

The Anomaly Scoring Function

The Isolation Forest scoring function transforms average path lengths into anomaly scores. Let's build it step by step.

Step 1: Average Path Length

For a point $x$ and an ensemble of $t$ isolation trees, compute the average path length:

$$E[h(x)] = \frac{1}{t} \sum_{i=1}^{t} h_i(x)$$

where $h_i(x)$ is the path length for point $x$ in tree $i$.

Step 2: Normalization Constant

To make path lengths comparable across different sample sizes, we normalize by $c(n)$, the average path length of an unsuccessful search in a Binary Search Tree with $n$ nodes:

$$c(n) = \begin{cases} 0 & \text{if } n = 1 \ 1 & \text{if } n = 2 \ 2H(n-1) - \frac{2(n-1)}{n} & \text{if } n > 2 \end{cases}$$

where $H(i) = \ln(i) + \gamma$ is the harmonic number (γ ≈ 0.5772 is the Euler-Mascheroni constant).

Step 3: The Score

The anomaly score is:

$$s(x, \psi) = 2^{-\frac{E[h(x)]}{c(\psi)}}$$

where $\psi$ is the subsample size used for tree construction.

Score Function Properties
Property	Value/Range	Interpretation
Domain (input)	$E[h(x)] \in [0, \infty)$	Average path length can be any non-negative value
Range (output)	$s(x, \psi) \in (0, 1)$	Scores are bounded between 0 and 1
Score = 1	When $E[h(x)] = 0$	Point isolated immediately (theoretical limit)
Score = 0.5	When $E[h(x)] = c(\psi)$	Average isolation time = BST average
Score → 0	When $E[h(x)] \to \infty$	Very hard to isolate → definitely normal
Monotonicity	Strictly decreasing	Shorter paths → higher scores → more anomalous

anomaly_score_detailed.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
from typing import Tuple
 
def harmonic_number(i: int) -> float:
    """
    Compute the i-th harmonic number using the logarithmic approximation.
    H(i) ≈ ln(i) + γ where γ ≈ 0.5772 (Euler-Mascheroni constant)
    """
    if i <= 0:
        return 0.0
    euler_mascheroni = 0.5772156649
    return np.log(i) + euler_mascheroni
 
 
def average_path_length(n: int) -> float:
    """
    Compute c(n) - the average path length of unsuccessful search in BST.
    
    For n items, this represents the expected depth of a search that
    terminates at a leaf (i.e., the key is not found).
    
    c(n) = 2H(n-1) - 2(n-1)/n
    
    This formula comes from the analysis of random BSTs:
    - H(n-1) is the (n-1)th harmonic number
    - The factor of 2 accounts for both internal path length contributions
    - The correction term adjusts for the specific case of unsuccessful search
    """
    if n <= 1:
        return 0.0  # No comparisons needed
    elif n == 2:
        return 1.0  # Exactly one comparison
    else:
        h_n_minus_1 = harmonic_number(n - 1)
        return 2.0 * h_n_minus_1 - (2.0 * (n - 1) / n)
 
 
def anomaly_score(path_lengths: np.ndarray, subsample_size: int) -> np.ndarray:
    """
    Convert average path lengths to anomaly scores.
    
    The formula s(x, ψ) = 2^(-E[h(x)] / c(ψ)) is motivated by:
    1. Exponential decay ensures short paths map to high scores
    2. Base 2 connects to binary tree structure
    3. Division by c(ψ) normalizes for sample size
    
    Args:
        path_lengths: Average path lengths for each point
        subsample_size: The ψ parameter (typically 256)
        
    Returns:
        Anomaly scores in (0, 1), higher = more anomalous
    """
    c_psi = average_path_length(subsample_size)
    
    if c_psi == 0:
        return np.zeros_like(path_lengths)
    
    # The core scoring formula
    normalized_lengths = path_lengths / c_psi
    scores = np.power(2.0, -normalized_lengths)
    
    return scores
 
 
def analyze_score_function(subsample_size: int = 256) -> None:
    """
    Analyze and visualize the scoring function behavior.
    """
    c_psi = average_path_length(subsample_size)
    print(f"For subsample size ψ = {subsample_size}:")
    print(f"  c(ψ) = {c_psi:.4f}")
    print(f"  This means a path length of {c_psi:.2f} maps to score = 0.5")
    print()
    
    # Show score for various path lengths
    path_lengths = np.array([1, 2, 3, 4, c_psi/2, c_psi, c_psi*1.5, c_psi*2, c_psi*3])
    scores = anomaly_score(path_lengths, subsample_size)
    
    print("Path Length | Normalized | Score  | Interpretation")
    print("-" * 65)
    
    for pl, s in zip(path_lengths, scores):
        norm = pl / c_psi
        if s > 0.6:
            interp = "ANOMALY (high confidence)"
        elif s > 0.5:
            interp = "Likely anomaly"
        elif s < 0.4:
            interp = "Normal"
        else:
            interp = "Ambiguous"
        print(f"{pl:>11.2f} | {norm:>10.3f} | {s:.4f} | {interp}")
 
 
analyze_score_function(256)

Deep Dive: The BST Normalization

The normalization constant $c(n)$ is derived from the analysis of Binary Search Trees. Understanding this connection reveals why it's the perfect normalization for isolation path lengths.

Why BST?

The isolation tree construction process is analogous to building a BST:

BST: Insert items one by one; each item's depth depends on previous insertions (random order)
iTree: Recursively partition data; each point's path length depends on random split selections

Both processes create tree structures where the depth/path length reflects the 'distinguishability' of items.

The Average Unsuccessful Search Path:

In BST analysis, we distinguish between:

Successful search: Find an existing key (internal node path)
Unsuccessful search: Search for a key not in the tree (ends at a leaf/null link)

For isolation, we're interested in the unsuccessful search—we want to know how deep we go before the point is 'alone' (isolated), analogous to reaching a null link in BST.

Mathematical Derivation of $c(n)$:

For a random BST constructed by inserting $n$ distinct keys in random order:

Internal path length $I(n)$: Total of all path lengths to internal nodes $$E[I(n)] = 2n \cdot H_n - 2n$$
External path length $E(n)$: Total of all path lengths to external (null) nodes $$E(n) = I(n) + 2n$$ (because each internal node contributes 2 to external path length)
Number of external nodes: $n + 1$ (one more than internal nodes in any binary tree)
Average external path length (unsuccessful search): $$c(n) = \frac{E(n)}{n+1} = \frac{E[I(n)] + 2n}{n+1} = 2H(n-1) - \frac{2(n-1)}{n}$$

This formula gives us the expected depth at which a random search terminates, which perfectly normalizes the isolation path length.

Values of c(n) for Common Sample Sizes
n (samples)	c(n)	log₂(n)	c(n)/log₂(n)	Notes
32	5.29	5.00	1.06	Small batch
64	6.52	6.00	1.09
128	7.76	7.00	1.11
256	8.99	8.00	1.12	Default ψ
512	10.23	9.00	1.14
1024	11.47	10.00	1.15
4096	13.95	12.00	1.16	Large batch

Logarithmic Growth

Note that c(n) grows logarithmically with n, approximately as 1.1×log₂(n). This means doubling the sample size only adds about 1.1 to the normalization constant—making score comparisons relatively stable across different sample sizes.

Score Distribution Analysis

Understanding how anomaly scores are distributed helps in threshold selection and result interpretation.

For a Pure Normal Dataset:

When all points are drawn from a single, well-behaved distribution (e.g., multivariate Gaussian), the score distribution typically exhibits:

Mean around 0.4-0.5: Most points have average isolation times
Low variance: Scores cluster together
Slight right skew: A few points in sparser regions may have higher scores
Maximum scores rarely exceed 0.6-0.65: Without true anomalies, no points are dramatically easier to isolate

For a Dataset with Anomalies:

Bimodal distribution: Normal points cluster at lower scores, anomalies at higher scores
Separation gap: The gap between the two modes indicates detection quality
Score for anomalies typically > 0.6: Depends on how extreme the anomalies are

score_distribution_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import numpy as np
from sklearn.ensemble import IsolationForest
 
def analyze_score_distribution(X_normal, X_anomaly=None, contamination=0.01):
    """
    Analyze the distribution of anomaly scores.
    
    Args:
        X_normal: Normal data points
        X_anomaly: Anomaly points (optional)
        contamination: Expected anomaly fraction
        
    Returns:
        Dictionary with score statistics
    """
    # Combine data
    if X_anomaly is not None:
        X = np.vstack([X_normal, X_anomaly])
        labels = np.array([0]*len(X_normal) + [1]*len(X_anomaly))
    else:
        X = X_normal
        labels = np.zeros(len(X_normal))
    
    # Fit Isolation Forest
    clf = IsolationForest(
        n_estimators=100,
        max_samples=256,
        contamination=contamination,
        random_state=42
    )
    clf.fit(X)
    
    # Get scores (negate sklearn's convention)
    scores = -clf.score_samples(X)
    
    # Compute statistics
    results = {
        'overall': {
            'mean': scores.mean(),
            'std': scores.std(),
            'min': scores.min(),
            'max': scores.max(),
            'median': np.median(scores),
            'q25': np.percentile(scores, 25),
            'q75': np.percentile(scores, 75),
        }
    }
    
    if X_anomaly is not None:
        normal_scores = scores[labels == 0]
        anomaly_scores = scores[labels == 1]
        
        results['normal'] = {
            'mean': normal_scores.mean(),
            'std': normal_scores.std(),
            'max': normal_scores.max(),
        }
        results['anomaly'] = {
            'mean': anomaly_scores.mean(),
            'std': anomaly_scores.std(),
            'min': anomaly_scores.min(),
        }
        results['separation'] = anomaly_scores.min() - normal_scores.max()
    
    return results
 
 
# Example: Analyze score distributions
np.random.seed(42)
 
# Normal data: 2D Gaussian
X_normal = np.random.randn(1000, 2)
 
# Case 1: No anomalies
print("=== Case 1: Pure Normal Data ===")
results = analyze_score_distribution(X_normal)
print(f"Mean score: {results['overall']['mean']:.4f}")
print(f"Std score:  {results['overall']['std']:.4f}")
print(f"Max score:  {results['overall']['max']:.4f}")
print()
 
# Case 2: With clear anomalies
X_anomaly = 5 + 0.5*np.random.randn(10, 2)  # Offset cluster
print("=== Case 2: With Anomalies ===")
results = analyze_score_distribution(X_normal, X_anomaly)
print(f"Normal mean:  {results['normal']['mean']:.4f}")
print(f"Normal max:   {results['normal']['max']:.4f}")
print(f"Anomaly mean: {results['anomaly']['mean']:.4f}")
print(f"Anomaly min:  {results['anomaly']['min']:.4f}")
print(f"Separation:   {results['separation']:.4f}")

Visualizing Score Distributions

Always visualize your score distribution before selecting a threshold. A histogram with labeled normal/anomaly points (if known) reveals: (1) whether there's a clear separation, (2) what threshold would give the desired precision/recall tradeoff, and (3) whether anomalies are truly isolated or borderline.

Threshold Selection Strategies

The anomaly score is continuous, but most applications require a binary decision: is this point an anomaly or not? Threshold selection determines this boundary.

Strategy 1: Contamination-Based Threshold

If you know (or estimate) the expected proportion of anomalies in your data:

Sort all points by anomaly score (high to low)
Set threshold so that the top contamination × n points are classified as anomalies

This is what scikit-learn's contamination parameter does. It doesn't affect scores—only the threshold.

Strategy 2: Statistical Threshold

Treat scores above a certain percentile as anomalies:

Conservative: 99th percentile (only flag extreme outliers)
Moderate: 95th percentile
Aggressive: 90th percentile

Strategy 3: Score-Based Rules

Use fixed score thresholds based on the 0.5 reference point:

High confidence: Score > 0.65 (well above average isolation)
Medium confidence: Score > 0.55
Low confidence: Score > 0.5 (anything above average)

Threshold Selection Strategy Comparison
Strategy	Pros	Cons	Best For
Contamination-based	Guarantees fixed anomaly rate; intuitive	Requires knowledge of true anomaly rate; may miss anomalies or flag normals	When you know the expected anomaly frequency
Statistical percentile	No labeled data needed; controls false positive rate	Assumes percentile = anomaly rate; sensitive to data distribution	Exploratory analysis; unknown anomaly rate
Fixed score threshold	Simple; interpretable; consistent across datasets	May not generalize across very different datasets	Well-understood domains with consistent anomaly types
Precision@K	Find exactly K most anomalous points	Ignores score magnitudes; may include normal points	Resource-limited review (e.g., manual investigation)

threshold_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
from sklearn.ensemble import IsolationForest
 
class IsolationForestWithThresholding:
    """
    Isolation Forest with flexible threshold selection strategies.
    """
    
    def __init__(self, n_estimators=100, max_samples=256, random_state=None):
        self.clf = IsolationForest(
            n_estimators=n_estimators,
            max_samples=max_samples,
            contamination='auto',  # We'll handle threshold ourselves
            random_state=random_state
        )
        self.scores_ = None
        
    def fit(self, X):
        self.clf.fit(X)
        # Store training scores for threshold calculation
        self.scores_ = -self.clf.score_samples(X)
        return self
    
    def score_samples(self, X):
        return -self.clf.score_samples(X)
    
    def predict_contamination(self, X, contamination=0.01):
        """
        Predict using contamination-based threshold.
        Top 'contamination' fraction are labeled as anomalies.
        """
        scores = self.score_samples(X)
        threshold = np.percentile(self.scores_, 100 * (1 - contamination))
        return (scores > threshold).astype(int)
    
    def predict_percentile(self, X, percentile=99):
        """
        Predict using statistical percentile threshold.
        Points above the given percentile of training scores are anomalies.
        """
        scores = self.score_samples(X)
        threshold = np.percentile(self.scores_, percentile)
        return (scores > threshold).astype(int)
    
    def predict_score(self, X, threshold=0.5):
        """
        Predict using fixed score threshold.
        Points with score > threshold are anomalies.
        """
        scores = self.score_samples(X)
        return (scores > threshold).astype(int)
    
    def predict_topk(self, X, k=10):
        """
        Return indices of top-k most anomalous points.
        """
        scores = self.score_samples(X)
        return np.argsort(scores)[-k:][::-1]
 
 
# Example usage
np.random.seed(42)
X_train = np.random.randn(1000, 5)
X_test = np.vstack([
    np.random.randn(100, 5),        # Normal
    3 + np.random.randn(5, 5)       # Anomalies
])
true_labels = np.array([0]*100 + [1]*5)
 
# Fit model
model = IsolationForestWithThresholding(random_state=42)
model.fit(X_train)
 
# Compare strategies
print("Strategy Comparison on Test Set:")
print("-" * 50)
 
for name, predictions in [
    ("Contamination 5%", model.predict_contamination(X_test, 0.05)),
    ("Percentile 95", model.predict_percentile(X_test, 95)),
    ("Score > 0.55", model.predict_score(X_test, 0.55)),
]:
    tp = ((predictions == 1) & (true_labels == 1)).sum()
    fp = ((predictions == 1) & (true_labels == 0)).sum()
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / true_labels.sum()
    print(f"{name}: Precision={precision:.2f}, Recall={recall:.2f}")

Threshold Tuning Requires Labeled Data

Optimal threshold selection typically requires some labeled examples. Without labels, you're guessing. In practice, start with a conservative threshold (high score, e.g., 0.65), manually review flagged points, and iteratively refine based on the false positive / false negative tradeoff.

Score Calibration and Interpretation

Isolation Forest scores are not probabilities—a score of 0.7 doesn't mean '70% chance of being an anomaly.' However, in many applications, calibrated probability estimates are essential for decision-making.

Why Calibration Matters:

Decision costs: If missing an anomaly costs $10,000 and false alarms cost $100, you need proper probabilities to set optimal thresholds
Combining models: Ensemble methods and model aggregation require calibrated outputs
Uncertainty quantification: Stakeholders want to know 'how confident are you?'
Regulatory requirements: Some domains require interpretable probability statements

Approaches to Calibration:

Calibration Methods for Anomaly Scores

•Platt Scaling: Fit a sigmoid function to map scores to probabilities using labeled data: $P(anomaly|s) = 1/(1 + \exp(A \cdot s + B))$
•Isotonic Regression: Non-parametric calibration that learns a monotonic mapping from scores to probabilities
•Beta Calibration: Uses a beta distribution family, often works better for imbalanced anomaly detection
•Empirical Calibration: Simply use the empirical false positive rate at each score threshold

score_calibration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.isotonic import IsotonicRegression
from sklearn.linear_model import LogisticRegression
 
class CalibratedIsolationForest:
    """
    Isolation Forest with probability calibration.
    
    Converts raw anomaly scores to calibrated probabilities
    using labeled calibration data.
    """
    
    def __init__(self, calibration_method='isotonic', **iforest_params):
        self.iforest = IsolationForest(**iforest_params)
        self.calibration_method = calibration_method
        self.calibrator = None
        
    def fit(self, X_train, X_cal=None, y_cal=None):
        """
        Fit the Isolation Forest and optionally calibrate.
        
        Args:
            X_train: Training data for Isolation Forest
            X_cal: Calibration data with labels
            y_cal: Calibration labels (1=anomaly, 0=normal)
        """
        # Fit base Isolation Forest
        self.iforest.fit(X_train)
        
        # If calibration data provided, fit calibrator
        if X_cal is not None and y_cal is not None:
            scores_cal = self.score_samples(X_cal)
            
            if self.calibration_method == 'isotonic':
                self.calibrator = IsotonicRegression(
                    y_min=0, y_max=1, out_of_bounds='clip'
                )
                self.calibrator.fit(scores_cal, y_cal)
            
            elif self.calibration_method == 'platt':
                # Platt scaling using logistic regression
                self.calibrator = LogisticRegression()
                self.calibrator.fit(scores_cal.reshape(-1, 1), y_cal)
        
        return self
    
    def score_samples(self, X):
        """Get raw anomaly scores (not calibrated)."""
        return -self.iforest.score_samples(X)
    
    def predict_proba(self, X):
        """
        Get calibrated probability of being an anomaly.
        
        Returns probability estimates if calibrator is fitted,
        otherwise returns raw scores with a warning.
        """
        scores = self.score_samples(X)
        
        if self.calibrator is None:
            print("Warning: No calibration data provided. Returning raw scores.")
            return scores
        
        if self.calibration_method == 'isotonic':
            return self.calibrator.predict(scores)
        elif self.calibration_method == 'platt':
            return self.calibrator.predict_proba(scores.reshape(-1, 1))[:, 1]
    
    def predict(self, X, threshold=0.5):
        """Predict using calibrated probabilities."""
        probs = self.predict_proba(X)
        return (probs > threshold).astype(int)
 
 
# Example: Calibrated anomaly detection
np.random.seed(42)
 
# Training data (unlabeled)
X_train = np.random.randn(1000, 3)
 
# Calibration data (small labeled sample)
X_cal_normal = np.random.randn(50, 3)
X_cal_anomaly = 4 + np.random.randn(10, 3)
X_cal = np.vstack([X_cal_normal, X_cal_anomaly])
y_cal = np.array([0]*50 + [1]*10)
 
# Test data
X_test = np.vstack([np.random.randn(20, 3), 4 + np.random.randn(5, 3)])
y_test = np.array([0]*20 + [1]*5)
 
# Fit calibrated model
model = CalibratedIsolationForest(
    calibration_method='isotonic',
    n_estimators=100,
    random_state=42
)
model.fit(X_train, X_cal, y_cal)
 
# Get calibrated probabilities
probs = model.predict_proba(X_test)
print("Calibrated Probabilities:")
print(f"Normal points: min={probs[:20].min():.3f}, max={probs[:20].max():.3f}")
print(f"Anomalies:     min={probs[20:].min():.3f}, max={probs[20:].max():.3f}")

Calibration Data Requirements

Effective calibration requires labeled examples spanning the score range, including both true anomalies and normal points with varying scores. A small calibration set (50-200 points) can significantly improve probability estimates, but the labels must be reliable.

Ensemble Averaging and Score Stability

Path lengths from individual trees are noisy due to random split selection. Averaging across the ensemble stabilizes scores, but how many trees are needed? How variable are the scores?

Variance Analysis:

For a single tree, the path length $h_i(x)$ is a random variable whose distribution depends on:

The point's true isolation properties
The random subsample drawn for that tree
The random splits selected during construction

For the average across $t$ trees: $$\bar{h}(x) = \frac{1}{t} \sum_{i=1}^{t} h_i(x)$$

By the Central Limit Theorem (approximately, as trees aren't fully independent): $$\text{Var}[\bar{h}(x)] \approx \frac{\text{Var}[h_1(x)]}{t}$$

Variance decreases as $1/t$, so doubling the number of trees halves the variance.

Effect of Ensemble Size on Score Stability
n_estimators	Relative Std Dev	Computation Time	Recommendation
10	1.0 (baseline)	Very fast	Too few—unstable rankings
50	~0.45	Fast	Minimum for stable detection
100	~0.32	Moderate	Good default balance
200	~0.22	Slow	Better stability if needed
500	~0.14	Very slow	Diminishing returns

ensemble_stability.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import numpy as np
from sklearn.ensemble import IsolationForest
 
def analyze_score_stability(X, n_estimators_values, n_trials=10):
    """
    Analyze how score stability improves with more trees.
    
    For each n_estimators value, fits multiple models with different
    random seeds and measures the variability of scores for each point.
    """
    results = []
    
    for n_trees in n_estimators_values:
        scores_trials = []
        
        for trial in range(n_trials):
            clf = IsolationForest(
                n_estimators=n_trees,
                max_samples=256,
                random_state=trial * 100  # Different seed each trial
            )
            clf.fit(X)
            scores = -clf.score_samples(X)
            scores_trials.append(scores)
        
        # Stack: (n_trials, n_points)
        scores_matrix = np.vstack(scores_trials)
        
        # Compute std across trials for each point, then average
        per_point_std = scores_matrix.std(axis=0)
        avg_std = per_point_std.mean()
        
        results.append({
            'n_trees': n_trees,
            'avg_score_std': avg_std,
            'max_score_std': per_point_std.max(),
        })
        
    return results
 
 
# Analyze stability
np.random.seed(42)
X = np.random.randn(500, 5)
 
tree_counts = [10, 25, 50, 100, 200]
stability = analyze_score_stability(X, tree_counts, n_trials=10)
 
print("Ensemble Size vs Score Stability:")
print("-" * 50)
print("n_trees | Avg Std Dev | Max Std Dev | Relative")
print("-" * 50)
 
baseline = stability[0]['avg_std']
for r in stability:
    relative = r['avg_std'] / baseline
    print(f"{r['n_trees']:>7} | {r['avg_score_std']:>11.4f} | {r['max_score_std']:>11.4f} | {relative:.2f}x")

Practical Guidance

Use n_estimators=100 as default. Only increase if: (1) rankings of borderline points keep changing between runs, (2) you need reproducible scores without fixing random_state, or (3) computational cost is not a concern. For initial exploration, even 50 trees give reasonable results.

Common Scoring Pitfalls and How to Avoid Them

Practitioners frequently encounter issues with Isolation Forest scoring. Understanding these pitfalls prevents misinterpretation and improves results.

Common Pitfalls

•Pitfall 1: Interpreting scores as probabilities. A score of 0.7 does NOT mean 70% probability of being an anomaly. Scores are relative rankings, not calibrated probabilities. Solution: Calibrate with labeled data or use percentile-based thresholds.
•Pitfall 2: Comparing scores across different models. Scores from models trained on different data are not directly comparable. The same point might score 0.6 in one model and 0.5 in another. Solution: Use the same training data/parameters for fair comparison.
•Pitfall 3: Ignoring sklearn's sign convention. sklearn's score_samples() returns negative values (more negative = more anomalous). Many practitioners expect positive scores. Solution: Negate scores: scores = -clf.score_samples(X)
•Pitfall 4: Using contamination without understanding it. The contamination parameter affects only the threshold for predict(), not the scores themselves. Setting contamination=0.1 doesn't make 10% of points anomalous—it labels the top 10% as anomalies. Solution: Compute scores first, then choose thresholds based on your domain.
•Pitfall 5: Expecting perfect separation. IF scores rarely give perfect anomaly/normal separation. You'll often see overlap. Solution: Accept that thresholds involve tradeoffs; tune based on precision/recall requirements.

pitfall_demonstration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import numpy as np
from sklearn.ensemble import IsolationForest
 
# PITFALL DEMONSTRATION 1: sklearn's sign convention
np.random.seed(42)
X = np.vstack([np.random.randn(100, 2), [[5, 5]]])  # 100 normal + 1 anomaly
 
clf = IsolationForest(n_estimators=100, random_state=42)
clf.fit(X)
 
# sklearn returns NEGATIVE scores (more negative = more anomalous)
sklearn_scores = clf.score_samples(X[-1:])
print(f"sklearn score for anomaly: {sklearn_scores[0]:.3f}")  # Negative!
 
# Correct interpretation: negate for intuitive scores
intuitive_scores = -sklearn_scores
print(f"Intuitive score (negated): {intuitive_scores[0]:.3f}")  # Positive!
 
 
# PITFALL DEMONSTRATION 2: contamination misconception
clf_10pct = IsolationForest(n_estimators=100, contamination=0.1, random_state=42)
clf_1pct = IsolationForest(n_estimators=100, contamination=0.01, random_state=42)
 
clf_10pct.fit(X)
clf_1pct.fit(X)
 
# Scores are THE SAME regardless of contamination
scores_10pct = clf_10pct.score_samples(X)
scores_1pct = clf_1pct.score_samples(X)
 
print(f"
Scores identical? {np.allclose(scores_10pct, scores_1pct)}")  # True!
 
# Predictions differ because threshold differs
pred_10pct = clf_10pct.predict(X)
pred_1pct = clf_1pct.predict(X)
 
print(f"Anomalies (10% contamination): {(pred_10pct == -1).sum()}")
print(f"Anomalies (1% contamination):  {(pred_1pct == -1).sum()}")

Score Reproducibility

Scores can vary between runs due to random sampling and splitting. For reproducible results: (1) Always set random_state, (2) Use enough trees (100+), and (3) If critical decisions depend on specific scores, average across multiple random seeds.

Summary: Path Length Scoring

We've thoroughly explored how Isolation Forest converts path lengths into meaningful, actionable anomaly scores. The scoring mechanism is elegant: normalize by BST average path length, apply exponential transformation, average across the ensemble.

Key Takeaways

•The scoring function $s(x) = 2^{-E[h(x)]/c(\psi)}$ maps path lengths to (0, 1) with 0.5 as the reference point for 'average' isolation
•BST normalization $c(n)$ makes scores comparable across different sample sizes and connects to fundamental tree algorithms
•Score distributions for normal data center around 0.4-0.5; anomalies score notably higher (typically > 0.5-0.6)
•Threshold selection requires domain knowledge and often labeled data; contamination, percentile, and fixed-score methods each have tradeoffs
•Calibration converts raw scores to probabilities when needed; isotonic regression and Platt scaling are common approaches
•Ensemble averaging stabilizes scores; 100 trees is a good default for the accuracy/computation tradeoff
•Common pitfalls include misinterpreting scores as probabilities, ignoring sklearn's sign convention, and misunderstanding contamination

What's Next:

The standard Isolation Forest uses axis-parallel splits, which can miss anomalies that lie along diagonal or curved patterns. The next page introduces Extended Isolation Forest, which addresses this limitation by using random hyperplanes rather than axis-parallel cuts, enabling detection of a broader class of anomalies.

Page Complete

You now have a comprehensive understanding of Isolation Forest scoring—from the mathematical foundations of the score function to practical threshold selection and calibration strategies. These skills are essential for deploying Isolation Forest effectively in production systems.