Machine LearningIsolation Forest

Isolation Forest for Anomaly Detection

LevelIntermediate

Duration90 mins

TopicIsolation Forest

5 / 5

Parameter Selection

Tuning Isolation Forest for Production Performance

Isolation Forest has remarkably few hyperparameters compared to many ML algorithms, and it often works well with default settings. However, understanding how each parameter affects detection performance is crucial for achieving optimal results in production systems.

This page provides a practitioner's guide to Isolation Forest parameter selection. We cover not just what to tune, but when to deviate from defaults, how to diagnose parameter-related issues, and why certain settings work better in specific scenarios.

The goal is to equip you with the intuition and methodology to quickly configure Isolation Forest for any anomaly detection task.

Learning Objectives

By the end of this page, you will: (1) Master the role and impact of each Isolation Forest hyperparameter, (2) Know the recommended starting values and when to adjust them, (3) Understand systematic approaches for parameter tuning with and without labeled data, (4) Be able to diagnose and fix common parameter-related issues in production.

Overview of Hyperparameters

Isolation Forest has four primary hyperparameters, plus one operational parameter (contamination) that affects thresholding but not the core algorithm.

Primary Hyperparameters:

Isolation Forest Hyperparameters
Parameter	sklearn Name	Default	Description
Number of trees	`n_estimators`	100	Size of the forest ensemble
Subsample size	`max_samples`	'auto' (256)	Points sampled for each tree
Number of features	`max_features`	1.0 (all)	Features considered at each split
Bootstrap	`bootstrap`	False	Sample with replacement
Contamination	`contamination`	'auto'	Expected anomaly proportion (for threshold only)

Impact Hierarchy:

Not all parameters are equally important. In order of typical impact on detection quality:

max_samples — Most impactful; controls swamping/masking mitigation
n_estimators — Affects score stability; usually safe at default
contamination — Only affects decision threshold, not scores
max_features — Rarely needs tuning unless very high-dimensional
bootstrap — Almost never needs to change from False

This hierarchy guides where to focus tuning effort: start with max_samples if detection quality is poor, then adjust n_estimators for stability.

The Good News

Isolation Forest is remarkably robust to hyperparameter choices. The defaults work well in most cases. Only tune when you have evidence of a problem—premature optimization wastes effort and can introduce subtle issues.

n_estimators: Number of Trees

The number of trees in the forest affects score stability and, to a lesser extent, detection quality.

What It Controls:

Each tree provides a noisy estimate of path length. Averaging across $t$ trees reduces variance:

$$\text{Var}[\bar{h}(x)] \approx \frac{\text{Var}[h(x)]}{t}$$

More trees → more stable (reproducible) scores → more consistent anomaly rankings.

Default Value: 100

The default of 100 trees is a well-tested compromise:

Provides good score stability for most datasets
Fast enough for most applications
Diminishing returns beyond 100-200 trees

Impact of n_estimators
n_estimators	Score Stability	Training Time	Prediction Time	When to Use
25-50	Moderate	Very Fast	Very Fast	Quick prototyping; embedded systems
100 (default)	Good	Fast	Fast	General purpose; most applications
200-300	Very Good	Moderate	Moderate	When score consistency is critical
500+	Excellent	Slow	Slow	Research; when rankings must be stable

n_estimators_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
from sklearn.ensemble import IsolationForest
import time
 
def analyze_n_estimators_impact(X_train, X_test, n_estimators_range, n_trials=10):
    """
    Analyze how n_estimators affects score stability and computation time.
    
    For each n_estimators value, fits multiple models and measures:
    - Score standard deviation across runs
    - Training time
    - Prediction time
    """
    results = []
    
    for n_trees in n_estimators_range:
        # Measure timing
        start = time.time()
        clf = IsolationForest(n_estimators=n_trees, random_state=42)
        clf.fit(X_train)
        train_time = time.time() - start
        
        start = time.time()
        _ = clf.score_samples(X_test)
        predict_time = time.time() - start
        
        # Measure stability across multiple random seeds
        scores_list = []
        for trial in range(n_trials):
            clf = IsolationForest(n_estimators=n_trees, random_state=trial*100)
            clf.fit(X_train)
            scores = -clf.score_samples(X_test)
            scores_list.append(scores)
        
        scores_array = np.vstack(scores_list)
        mean_std = scores_array.std(axis=0).mean()
        
        results.append({
            'n_trees': n_trees,
            'train_time': train_time,
            'predict_time': predict_time,
            'score_std': mean_std,
        })
    
    return results
 
 
# Example usage
np.random.seed(42)
X_train = np.random.randn(1000, 10)
X_test = np.random.randn(200, 10)
 
n_est_values = [25, 50, 100, 200, 500]
results = analyze_n_estimators_impact(X_train, X_test, n_est_values)
 
print("n_trees | Train(s) | Predict(s) | Score Std | Relative Std")
print("-" * 65)
baseline_std = results[2]['score_std']  # 100 trees as baseline
for r in results:
    rel_std = r['score_std'] / baseline_std
    print(f"{r['n_trees']:>7} | {r['train_time']:>8.3f} | {r['predict_time']:>10.4f} | "
          f"{r['score_std']:>9.4f} | {rel_std:>11.2f}x")

When to Increase n_estimators

Increase beyond 100 if: (1) Rankings of borderline cases change between runs even with fixed random_state, (2) You're using scores for downstream decisions (not just ranking), or (3) You need reproducible results without fixing the random seed. Decrease to 50 or less only for extreme computational constraints.

max_samples: Subsample Size

The subsample size (ψ) is the most impactful parameter for detection quality. It controls the tradeoff between anomaly visibility and statistical stability.

What It Controls:

Swamping mitigation: Smaller samples → fewer normal points surrounding anomalies → anomalies are easier to isolate
Masking mitigation: Smaller samples → less chance of multiple anomalies appearing together → individual anomalies exposed
Tree depth: $\text{max_depth} = \lceil \log_2(\psi) \rceil$
Path length normalization: $c(\psi)$ depends on sample size

Default Value: 'auto' (min(256, n_samples))

The default of 256 is backed by empirical research showing it provides a good balance for most datasets.

Subsample Size Selection Guide
max_samples	Swamping	Masking	Stability	Best For
64-128	Excellent	Excellent	Moderate	Dense data; many local anomalies
256 (default)	Good	Good	Good	General purpose
512-1024	Moderate	Moderate	Very Good	Multiple normal clusters; sparse anomalies
2048+	Poor	Poor	Excellent	Very structured data; well-separated anomalies
'auto' or n	Varies	Varies	Full	Let sklearn decide based on dataset size

Tuning Strategy for max_samples:

Start with default (256): Works well in most cases
If you see swamping: Decrease to 128 or 64
- Symptom: Anomalies near the edge of normal clusters score lower than expected
- Cause: Too many normal neighbors in each subsample
If you see masking: Decrease to 128 or 64
- Symptom: A cluster of anomalies all score 'moderately anomalous' instead of highly anomalous
- Cause: Anomalies appearing together create a 'pseudo-normal' group
If you see instability: Increase to 512-1024
- Symptom: Scores vary widely between runs; rankings are inconsistent
- Cause: Subsamples are too small to capture data structure reliably
If you have multiple normal clusters: Increase subsample size
- Ensures each tree 'sees' representatives from each cluster
- Otherwise, points in one cluster might score high when sampled with another cluster

max_samples_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.metrics import roc_auc_score
 
def tune_max_samples(X, y_true, max_samples_range, n_trials=5):
    """
    Tune max_samples using labeled data (if available).
    
    In practice, you often don't have labels. Use this when you do
    have a validation set with known anomalies.
    
    Args:
        X: Features
        y_true: True labels (1=anomaly, 0=normal)
        max_samples_range: List of max_samples values to try
        n_trials: Number of random seeds for stability measure
        
    Returns:
        Results for each max_samples value
    """
    results = []
    
    for max_samples in max_samples_range:
        aucs = []
        for trial in range(n_trials):
            clf = IsolationForest(
                n_estimators=100,
                max_samples=max_samples,
                random_state=trial * 100
            )
            clf.fit(X)
            scores = -clf.score_samples(X)
            auc = roc_auc_score(y_true, scores)
            aucs.append(auc)
        
        results.append({
            'max_samples': max_samples,
            'mean_auc': np.mean(aucs),
            'std_auc': np.std(aucs),
            'min_auc': np.min(aucs),
        })
    
    return results
 
 
def diagnose_swamping(scores_anomaly, scores_normal):
    """
    Check if swamping might be occurring.
    
    Swamping symptom: Anomalies near normal data score lower than
    they should, overlapping with normal score distribution.
    """
    # If many anomaly scores are below normal median, likely swamping
    normal_median = np.median(scores_normal)
    below_median = (scores_anomaly < normal_median).mean()
    
    if below_median > 0.3:
        print(f"⚠️  Potential swamping: {below_median:.0%} of anomalies score below normal median")
        print("   Try reducing max_samples to 128 or 64")
    else:
        print(f"✓  Swamping unlikely: Only {below_median:.0%} of anomalies below normal median")
 
 
# Example: Diagnose swamping
np.random.seed(42)
 
# Create data with borderline anomalies (hard to detect)
X_normal = np.random.randn(500, 2)
X_anomaly = 2.5 + 0.3 * np.random.randn(20, 2)  # Slightly outside cluster
X = np.vstack([X_normal, X_anomaly])
y = np.array([0]*500 + [1]*20)
 
# With large max_samples (swamping more likely)
clf_large = IsolationForest(n_estimators=100, max_samples=512, random_state=42)
clf_large.fit(X)
scores_large = -clf_large.score_samples(X)
 
# With small max_samples (swamping mitigated)
clf_small = IsolationForest(n_estimators=100, max_samples=64, random_state=42)
clf_small.fit(X)
scores_small = -clf_small.score_samples(X)
 
print("=== Large max_samples (512) ===")
diagnose_swamping(scores_large[y==1], scores_large[y==0])
 
print("\n=== Small max_samples (64) ===")
diagnose_swamping(scores_small[y==1], scores_small[y==0])

Common Mistake: Too Large max_samples

Setting max_samples=n (using all data) defeats the purpose of subsampling. Anomaly detection can actually DEGRADE with more data due to swamping effects. If you're getting poor results with large max_samples, try 256 or smaller, not larger.

contamination: Expected Anomaly Proportion

The contamination parameter is often misunderstood. It does NOT affect anomaly scores—only the threshold used by the predict() method.

What It Controls:

When you call clf.predict(X), the model returns -1 (anomaly) or +1 (normal). The contamination parameter determines how many points are labeled as anomalies:

$$\text{threshold} = \text{percentile}(\text{scores}, 100 \times (1 - \text{contamination}))$$

With contamination=0.1, the top 10% of points by score are labeled as anomalies.

What It Does NOT Control:

Does NOT change how trees are built
Does NOT change path lengths or anomaly scores
Does NOT make the algorithm 'focus' on finding more anomalies

Contamination Parameter Effects
contamination	Threshold Behavior	Use Case
'auto'	sklearn estimates based on original IF paper	When you don't know the true contamination
0.01	Top 1% labeled anomalous	Rare anomalies; low false positive tolerance
0.05	Top 5% labeled anomalous	Moderate anomaly frequency
0.10	Top 10% labeled anomalous	Higher anomaly frequency
0.20+	Top 20%+ labeled anomalous	Very high contamination; aggressive flagging

Best Practice: Use Scores, Not Predictions

For most applications, use score_samples() instead of predict():

Flexibility: Scores let you choose any threshold based on your cost function
Interpretability: Scores provide a continuous ranking, not just binary labels
Domain adaptation: You can set thresholds based on domain knowledge, not contamination guesses

The contamination parameter is mainly useful for:

Quick baselines when you have a rough idea of anomaly frequency
Compatibility with sklearn's API expectations (e.g., pipelines expecting predict())
Cases where you truly need fixed-rate detection

contamination_best_practices.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
from sklearn.ensemble import IsolationForest
 
# Example: Demonstrating contamination behavior
np.random.seed(42)
X = np.random.randn(100, 2)
 
# Contamination DOES NOT affect scores
clf_01 = IsolationForest(contamination=0.01, random_state=42)
clf_10 = IsolationForest(contamination=0.10, random_state=42)
clf_50 = IsolationForest(contamination=0.50, random_state=42)
 
clf_01.fit(X)
clf_10.fit(X)
clf_50.fit(X)
 
scores_01 = clf_01.score_samples(X)
scores_10 = clf_10.score_samples(X)
scores_50 = clf_50.score_samples(X)
 
# Scores are IDENTICAL regardless of contamination
print("Scores identical across contamination settings?")
print(f"  0.01 vs 0.10: {np.allclose(scores_01, scores_10)}")  # True
print(f"  0.10 vs 0.50: {np.allclose(scores_10, scores_50)}")  # True
 
# But predictions differ (different thresholds)
pred_01 = clf_01.predict(X)
pred_10 = clf_10.predict(X)
pred_50 = clf_50.predict(X)
 
print(f"\nNumber of anomalies detected:")
print(f"  contamination=0.01: {(pred_01 == -1).sum()}")  # ~1
print(f"  contamination=0.10: {(pred_10 == -1).sum()}")  # ~10
print(f"  contamination=0.50: {(pred_50 == -1).sum()}")  # ~50
 
 
# BEST PRACTICE: Use scores and set your own threshold
def detect_anomalies_custom(clf, X, score_threshold=None, top_k=None, top_pct=None):
    """
    Flexible anomaly detection using scores.
    
    Choose ONE of:
    - score_threshold: Flag if score > threshold
    - top_k: Flag top k points by score
    - top_pct: Flag top pct% of points
    """
    scores = -clf.score_samples(X)  # Negate for intuitive ordering
    
    if score_threshold is not None:
        return scores > score_threshold
    elif top_k is not None:
        threshold = np.sort(scores)[-top_k]
        return scores >= threshold
    elif top_pct is not None:
        threshold = np.percentile(scores, 100 - top_pct)
        return scores >= threshold
    else:
        raise ValueError("Specify one of: score_threshold, top_k, top_pct")
 
 
# Examples
clf = IsolationForest(random_state=42)
clf.fit(X)
 
print("\nCustom thresholding:")
print(f"Score > 0.55: {detect_anomalies_custom(clf, X, score_threshold=0.55).sum()} anomalies")
print(f"Top 5 points: {detect_anomalies_custom(clf, X, top_k=5).sum()} anomalies")
print(f"Top 3%:       {detect_anomalies_custom(clf, X, top_pct=3).sum()} anomalies")

The 'auto' Setting

With contamination='auto', sklearn uses the offset from the original IF paper to determine the threshold. This typically results in a more conservative threshold (fewer predicted anomalies) than setting a specific contamination value.

max_features: Feature Subsampling

The max_features parameter controls how many features are considered when selecting random splits. Unlike Random Forest where this is a key hyperparameter, in Isolation Forest it's rarely important.

What It Controls:

1.0 (default): All features available for random selection at each split
0.5: 50% of features randomly available at each split
int: Use exactly this many features

When to Tune:

Very high-dimensional data (1000+ features): Reducing max_features can speed up training and sometimes improve detection by reducing noise dimensions
When some features are known to be irrelevant: Limiting features increases the chance of splitting on relevant dimensions
Almost never: For most practical applications, leave at default

max_features Settings
max_features	Effect	Use Case
1.0 (all)	All features can be chosen	Default; general purpose
0.8-0.9	Slight feature subsampling	Very high-dimensional; noise reduction
0.5	Half the features	Extreme dimensions; known noisy features
sqrt(d)	√d features	Common in Random Forest; rarely needed for IF

Why It's Less Important for IF:

In Random Forest, max_features creates decorrelated trees by forcing each split to use different feature subsets. This is crucial for ensemble diversity.

In Isolation Forest, diversity comes from:

Random subsampling of data points
Random selection of split value (not just feature)

Adding feature subsampling provides marginal additional diversity at the cost of potentially missing important anomaly signals in excluded features.

High-Dimensional Data Strategy

For very high-dimensional data, consider feature selection/reduction BEFORE applying Isolation Forest, rather than relying on max_features. This gives you more control and allows domain-specific selection. Alternatively, use Extended Isolation Forest which handles high-dimensional correlations better.

Tuning Without Labeled Data

In many anomaly detection scenarios, you don't have labeled anomalies to validate against. How do you tune parameters without ground truth?

Strategy 1: Score Distribution Analysis

Examine the distribution of anomaly scores:

Healthy sign: Bimodal or right-skewed distribution with a clear tail
Warning sign: Uniform or symmetric distribution (no clear separation)
Tune toward: Settings that create clearer separation between the main mass and high-score tail

unsupervised_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from sklearn.ensemble import IsolationForest
from scipy import stats
 
def score_distribution_quality(scores):
    """
    Heuristic metrics for score distribution quality.
    
    Good anomaly detection should produce a distribution with:
    - Right skew (anomalies in the tail)
    - Clear separation between main mass and tail
    """
    # Skewness: positive = right tail (anomalies)
    skewness = stats.skew(scores)
    
    # Kurtosis: higher = more extreme outliers
    kurtosis = stats.kurtosis(scores)
    
    # Gap between 90th and 99th percentile
    # Larger gap suggests clearer anomaly separation
    p90 = np.percentile(scores, 90)
    p99 = np.percentile(scores, 99)
    tail_gap = p99 - p90
    
    # Coefficient of variation of top 10%
    top_10pct = scores[scores >= p90]
    if len(top_10pct) > 1:
        top_spread = top_10pct.std() / top_10pct.mean() if top_10pct.mean() > 0 else 0
    else:
        top_spread = 0
    
    return {
        'skewness': skewness,
        'kurtosis': kurtosis,
        'tail_gap': tail_gap,
        'top_spread': top_spread,
    }
 
 
def unsupervised_param_selection(X, max_samples_range, n_estimators=100):
    """
    Select max_samples without labels using distribution heuristics.
    
    This is a heuristic approach - not guaranteed to find optimal,
    but often works reasonably well.
    """
    results = []
    
    for max_samples in max_samples_range:
        clf = IsolationForest(
            n_estimators=n_estimators,
            max_samples=max_samples,
            random_state=42
        )
        clf.fit(X)
        scores = -clf.score_samples(X)
        
        quality = score_distribution_quality(scores)
        quality['max_samples'] = max_samples
        results.append(quality)
    
    return results
 
 
# Example: Unsupervised parameter selection
np.random.seed(42)
X = np.vstack([
    np.random.randn(500, 3),          # Main cluster
    3 + 0.2*np.random.randn(10, 3),   # Anomalies (unknown to us)
])
 
max_samples_options = [64, 128, 256, 512]
results = unsupervised_param_selection(X, max_samples_options)
 
print("max_samples | Skewness | Kurtosis | Tail Gap | Top Spread")
print("-" * 60)
for r in results:
    print(f"{r['max_samples']:>11} | {r['skewness']:>8.3f} | {r['kurtosis']:>8.3f} | "
          f"{r['tail_gap']:>8.4f} | {r['top_spread']:>10.4f}")
 
# Prefer settings with:
# - Higher positive skewness
# - Higher kurtosis
# - Larger tail gap

Strategy 2: Stability Analysis

Check how consistent scores are across different random seeds. Unstable scores suggest the model isn't capturing true structure:

Compare score rankings across multiple runs
If rankings are very different, increase n_estimators or max_samples
Focus on stability of the top 10% (potential anomalies), not all points

Strategy 3: Domain Expert Review

Generate top-K anomaly candidates with different parameter settings
Have domain experts review a sample
Use their feedback to select parameters that surface real anomalies

This 'human-in-the-loop' approach is often the most reliable when you have domain expertise available.

Unsupervised Tuning Limitations

All unsupervised tuning methods are heuristics. Without true labels, you cannot guarantee optimal performance. When the stakes are high, invest in labeling a small validation set and use proper evaluation metrics.

Production Deployment Guidelines

Deploying Isolation Forest in production requires consideration beyond just parameter tuning. Here are key guidelines for robust production systems.

Production Checklist

•Fix random_state: Always set random_state for reproducibility. This ensures the same data produces the same scores, critical for debugging and auditing.
•Version your models: Store model artifacts with training data references and hyperparameters. You need to reproduce results if something goes wrong.
•Monitor score distributions: Set up alerts for drift in score distributions. If scores shift significantly, the underlying data distribution may have changed.
•Use relative thresholds: Set thresholds based on percentiles rather than absolute scores. This adapts to natural score drift over time.
•Retrain periodically: Data distributions change. Retrain models on recent data (weekly, monthly) to stay current.
•Log predictions and scores: Store raw scores, not just binary predictions. This enables threshold adjustment without retraining.
•Implement fallbacks: If IF fails (e.g., data too sparse), have a fallback strategy (simple statistical rules, manual review queue).

production_config.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
from dataclasses import dataclass
from typing import Optional
import numpy as np
from sklearn.ensemble import IsolationForest
import json
 
@dataclass
class IsolationForestConfig:
    """
    Configuration class for production Isolation Forest.
    
    Stores all settings for reproducibility and documentation.
    """
    # Core hyperparameters
    n_estimators: int = 100
    max_samples: int = 256
    max_features: float = 1.0
    bootstrap: bool = False
    random_state: int = 42
    
    # Threshold settings
    threshold_method: str = 'percentile'  # 'percentile', 'score', 'contamination'
    threshold_value: float = 97.0  # Interpretation depends on method
    
    # Metadata
    version: str = '1.0.0'
    training_data_hash: Optional[str] = None
    
    def to_dict(self):
        return {k: v for k, v in self.__dict__.items()}
    
    def to_json(self):
        return json.dumps(self.to_dict(), indent=2)
    
    @classmethod
    def from_json(cls, json_str):
        return cls(**json.loads(json_str))
 
 
class ProductionIsolationForest:
    """
    Production wrapper for Isolation Forest with best practices.
    """
    
    def __init__(self, config: IsolationForestConfig):
        self.config = config
        self.model = IsolationForest(
            n_estimators=config.n_estimators,
            max_samples=config.max_samples,
            max_features=config.max_features,
            bootstrap=config.bootstrap,
            random_state=config.random_state,
            contamination='auto',  # We handle thresholding ourselves
        )
        self.training_score_percentiles_ = None
        
    def fit(self, X):
        """Fit model and store training score statistics."""
        self.model.fit(X)
        
        # Store training score percentiles for threshold setting
        training_scores = -self.model.score_samples(X)
        self.training_score_percentiles_ = np.percentile(
            training_scores, 
            [50, 75, 90, 95, 97, 99, 99.5, 99.9]
        )
        
        return self
    
    def get_threshold(self):
        """Compute threshold based on config."""
        if self.config.threshold_method == 'percentile':
            # Map percentile to stored training threshold
            pct_idx = [50, 75, 90, 95, 97, 99, 99.5, 99.9].index(
                min([50, 75, 90, 95, 97, 99, 99.5, 99.9], 
                    key=lambda x: abs(x - self.config.threshold_value))
            )
            return self.training_score_percentiles_[pct_idx]
        elif self.config.threshold_method == 'score':
            return self.config.threshold_value
        else:
            raise ValueError(f"Unknown method: {self.config.threshold_method}")
    
    def score(self, X):
        """Get anomaly scores (higher = more anomalous)."""
        return -self.model.score_samples(X)
    
    def predict(self, X):
        """Predict using config threshold."""
        scores = self.score(X)
        threshold = self.get_threshold()
        return (scores > threshold).astype(int)
    
    def predict_with_metadata(self, X):
        """Return predictions with full metadata for logging."""
        scores = self.score(X)
        threshold = self.get_threshold()
        predictions = (scores > threshold).astype(int)
        
        return {
            'scores': scores,
            'predictions': predictions,
            'threshold_used': threshold,
            'model_version': self.config.version,
            'n_anomalies': predictions.sum(),
        }
 
 
# Example usage
config = IsolationForestConfig(
    n_estimators=100,
    max_samples=256,
    threshold_method='percentile',
    threshold_value=99.0,  # Top 1%
    version='1.0.0'
)
 
print("Production config:")
print(config.to_json())

Avoid Common Production Pitfalls

Common issues: (1) Not fixing random_state leads to non-reproducible predictions, (2) Using contamination-based thresholds that drift with data changes, (3) Not monitoring for data/score distribution drift, (4) Retraining on data with many false negatives (missed anomalies). Address these proactively.

Summary: Parameter Selection

We've covered the complete landscape of Isolation Forest parameter selection—from understanding each hyperparameter's role to systematic tuning strategies for both supervised and unsupervised settings.

Key Takeaways

•max_samples is most impactful: Tune it first; default 256 works well but decrease to 64-128 if swamping/masking suspected
•n_estimators affects stability: Default 100 is usually sufficient; increase for more consistent score rankings
•contamination only affects threshold: Does NOT change scores; prefer using score_samples() with custom thresholds
•max_features rarely matters: Leave at default unless you have 1000+ features with known noise
•Without labels, use heuristics: Score distribution analysis and stability checks guide parameter selection
•In production, document everything: Fix random_state, version models, monitor distributions, log raw scores

Quick Reference: Default Parameters
Parameter	Default	When to Change
n_estimators	100	Score stability issues → increase to 200+
max_samples	'auto' (256)	Swamping/masking → decrease; Multiple clusters → increase
contamination	'auto'	Use predict() → set based on expected rate; otherwise ignore
max_features	1.0	Very high dimensions (1000+) → try 0.8
random_state	None	Always set for production!

Module Complete:

You've now mastered Isolation Forest—from the foundational isolation principle, through the algorithm mechanics of random partitioning and path length scoring, to the advanced Extended Isolation Forest variant, and finally to practical parameter selection for production deployment.

Isolation Forest is one of the most practical and widely-used anomaly detection algorithms. With the knowledge from this module, you can confidently apply it to real-world problems, tune it for optimal performance, and deploy it robustly in production systems.

Module Complete: Isolation Forest

Congratulations! You've completed the Isolation Forest module. You now understand: the isolation principle as a paradigm shift in anomaly detection, how random partitioning operationalizes this principle, the complete scoring framework and its interpretation, Extended IF for correlated data, and production-ready parameter tuning strategies. Apply this knowledge to detect anomalies effectively in your own projects!

5 / 5

Loading learning content...

Machine LearningIsolation Forest

Isolation Forest for Anomaly Detection

LevelIntermediate

Duration90 mins

TopicIsolation Forest

5 / 5

Parameter Selection

Tuning Isolation Forest for Production Performance

The goal is to equip you with the intuition and methodology to quickly configure Isolation Forest for any anomaly detection task.

Learning Objectives

Overview of Hyperparameters

Isolation Forest has four primary hyperparameters, plus one operational parameter (contamination) that affects thresholding but not the core algorithm.

Primary Hyperparameters:

Isolation Forest Hyperparameters
Parameter	sklearn Name	Default	Description
Number of trees	`n_estimators`	100	Size of the forest ensemble
Subsample size	`max_samples`	'auto' (256)	Points sampled for each tree
Number of features	`max_features`	1.0 (all)	Features considered at each split
Bootstrap	`bootstrap`	False	Sample with replacement
Contamination	`contamination`	'auto'	Expected anomaly proportion (for threshold only)

Impact Hierarchy:

Not all parameters are equally important. In order of typical impact on detection quality:

max_samples — Most impactful; controls swamping/masking mitigation
n_estimators — Affects score stability; usually safe at default
contamination — Only affects decision threshold, not scores
max_features — Rarely needs tuning unless very high-dimensional
bootstrap — Almost never needs to change from False

This hierarchy guides where to focus tuning effort: start with max_samples if detection quality is poor, then adjust n_estimators for stability.

The Good News

n_estimators: Number of Trees

The number of trees in the forest affects score stability and, to a lesser extent, detection quality.

What It Controls:

Each tree provides a noisy estimate of path length. Averaging across $t$ trees reduces variance:

$$\text{Var}[\bar{h}(x)] \approx \frac{\text{Var}[h(x)]}{t}$$

More trees → more stable (reproducible) scores → more consistent anomaly rankings.

Default Value: 100

The default of 100 trees is a well-tested compromise:

Provides good score stability for most datasets
Fast enough for most applications
Diminishing returns beyond 100-200 trees

Impact of n_estimators
n_estimators	Score Stability	Training Time	Prediction Time	When to Use
25-50	Moderate	Very Fast	Very Fast	Quick prototyping; embedded systems
100 (default)	Good	Fast	Fast	General purpose; most applications
200-300	Very Good	Moderate	Moderate	When score consistency is critical
500+	Excellent	Slow	Slow	Research; when rankings must be stable

n_estimators_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import numpy as np
from sklearn.ensemble import IsolationForest
import time
 
def analyze_n_estimators_impact(X_train, X_test, n_estimators_range, n_trials=10):
    """
    Analyze how n_estimators affects score stability and computation time.
    
    For each n_estimators value, fits multiple models and measures:
    - Score standard deviation across runs
    - Training time
    - Prediction time
    """
    results = []
    
    for n_trees in n_estimators_range:
        # Measure timing
        start = time.time()
        clf = IsolationForest(n_estimators=n_trees, random_state=42)
        clf.fit(X_train)
        train_time = time.time() - start
        
        start = time.time()
        _ = clf.score_samples(X_test)
        predict_time = time.time() - start
        
        # Measure stability across multiple random seeds
        scores_list = []
        for trial in range(n_trials):
            clf = IsolationForest(n_estimators=n_trees, random_state=trial*100)
            clf.fit(X_train)
            scores = -clf.score_samples(X_test)
            scores_list.append(scores)
        
        scores_array = np.vstack(scores_list)
        mean_std = scores_array.std(axis=0).mean()
        
        results.append({
            'n_trees': n_trees,
            'train_time': train_time,
            'predict_time': predict_time,
            'score_std': mean_std,
        })
    
    return results
 
 
# Example usage
np.random.seed(42)
X_train = np.random.randn(1000, 10)
X_test = np.random.randn(200, 10)
 
n_est_values = [25, 50, 100, 200, 500]
results = analyze_n_estimators_impact(X_train, X_test, n_est_values)
 
print("n_trees | Train(s) | Predict(s) | Score Std | Relative Std")
print("-" * 65)
baseline_std = results[2]['score_std']  # 100 trees as baseline
for r in results:
    rel_std = r['score_std'] / baseline_std
    print(f"{r['n_trees']:>7} | {r['train_time']:>8.3f} | {r['predict_time']:>10.4f} | "
          f"{r['score_std']:>9.4f} | {rel_std:>11.2f}x")

When to Increase n_estimators

max_samples: Subsample Size

The subsample size (ψ) is the most impactful parameter for detection quality. It controls the tradeoff between anomaly visibility and statistical stability.

What It Controls:

Swamping mitigation: Smaller samples → fewer normal points surrounding anomalies → anomalies are easier to isolate
Masking mitigation: Smaller samples → less chance of multiple anomalies appearing together → individual anomalies exposed
Tree depth: $\text{max_depth} = \lceil \log_2(\psi) \rceil$
Path length normalization: $c(\psi)$ depends on sample size

Default Value: 'auto' (min(256, n_samples))

The default of 256 is backed by empirical research showing it provides a good balance for most datasets.

Subsample Size Selection Guide
max_samples	Swamping	Masking	Stability	Best For
64-128	Excellent	Excellent	Moderate	Dense data; many local anomalies
256 (default)	Good	Good	Good	General purpose
512-1024	Moderate	Moderate	Very Good	Multiple normal clusters; sparse anomalies
2048+	Poor	Poor	Excellent	Very structured data; well-separated anomalies
'auto' or n	Varies	Varies	Full	Let sklearn decide based on dataset size

Tuning Strategy for max_samples:

Start with default (256): Works well in most cases
If you see swamping: Decrease to 128 or 64
- Symptom: Anomalies near the edge of normal clusters score lower than expected
- Cause: Too many normal neighbors in each subsample
If you see masking: Decrease to 128 or 64
- Symptom: A cluster of anomalies all score 'moderately anomalous' instead of highly anomalous
- Cause: Anomalies appearing together create a 'pseudo-normal' group
If you see instability: Increase to 512-1024
- Symptom: Scores vary widely between runs; rankings are inconsistent
- Cause: Subsamples are too small to capture data structure reliably
If you have multiple normal clusters: Increase subsample size
- Ensures each tree 'sees' representatives from each cluster
- Otherwise, points in one cluster might score high when sampled with another cluster

max_samples_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.metrics import roc_auc_score
 
def tune_max_samples(X, y_true, max_samples_range, n_trials=5):
    """
    Tune max_samples using labeled data (if available).
    
    In practice, you often don't have labels. Use this when you do
    have a validation set with known anomalies.
    
    Args:
        X: Features
        y_true: True labels (1=anomaly, 0=normal)
        max_samples_range: List of max_samples values to try
        n_trials: Number of random seeds for stability measure
        
    Returns:
        Results for each max_samples value
    """
    results = []
    
    for max_samples in max_samples_range:
        aucs = []
        for trial in range(n_trials):
            clf = IsolationForest(
                n_estimators=100,
                max_samples=max_samples,
                random_state=trial * 100
            )
            clf.fit(X)
            scores = -clf.score_samples(X)
            auc = roc_auc_score(y_true, scores)
            aucs.append(auc)
        
        results.append({
            'max_samples': max_samples,
            'mean_auc': np.mean(aucs),
            'std_auc': np.std(aucs),
            'min_auc': np.min(aucs),
        })
    
    return results
 
 
def diagnose_swamping(scores_anomaly, scores_normal):
    """
    Check if swamping might be occurring.
    
    Swamping symptom: Anomalies near normal data score lower than
    they should, overlapping with normal score distribution.
    """
    # If many anomaly scores are below normal median, likely swamping
    normal_median = np.median(scores_normal)
    below_median = (scores_anomaly < normal_median).mean()
    
    if below_median > 0.3:
        print(f"⚠️  Potential swamping: {below_median:.0%} of anomalies score below normal median")
        print("   Try reducing max_samples to 128 or 64")
    else:
        print(f"✓  Swamping unlikely: Only {below_median:.0%} of anomalies below normal median")
 
 
# Example: Diagnose swamping
np.random.seed(42)
 
# Create data with borderline anomalies (hard to detect)
X_normal = np.random.randn(500, 2)
X_anomaly = 2.5 + 0.3 * np.random.randn(20, 2)  # Slightly outside cluster
X = np.vstack([X_normal, X_anomaly])
y = np.array([0]*500 + [1]*20)
 
# With large max_samples (swamping more likely)
clf_large = IsolationForest(n_estimators=100, max_samples=512, random_state=42)
clf_large.fit(X)
scores_large = -clf_large.score_samples(X)
 
# With small max_samples (swamping mitigated)
clf_small = IsolationForest(n_estimators=100, max_samples=64, random_state=42)
clf_small.fit(X)
scores_small = -clf_small.score_samples(X)
 
print("=== Large max_samples (512) ===")
diagnose_swamping(scores_large[y==1], scores_large[y==0])
 
print("\n=== Small max_samples (64) ===")
diagnose_swamping(scores_small[y==1], scores_small[y==0])

Common Mistake: Too Large max_samples

contamination: Expected Anomaly Proportion

The contamination parameter is often misunderstood. It does NOT affect anomaly scores—only the threshold used by the predict() method.

What It Controls:

When you call clf.predict(X), the model returns -1 (anomaly) or +1 (normal). The contamination parameter determines how many points are labeled as anomalies:

$$\text{threshold} = \text{percentile}(\text{scores}, 100 \times (1 - \text{contamination}))$$

With contamination=0.1, the top 10% of points by score are labeled as anomalies.

What It Does NOT Control:

Does NOT change how trees are built
Does NOT change path lengths or anomaly scores
Does NOT make the algorithm 'focus' on finding more anomalies

Contamination Parameter Effects
contamination	Threshold Behavior	Use Case
'auto'	sklearn estimates based on original IF paper	When you don't know the true contamination
0.01	Top 1% labeled anomalous	Rare anomalies; low false positive tolerance
0.05	Top 5% labeled anomalous	Moderate anomaly frequency
0.10	Top 10% labeled anomalous	Higher anomaly frequency
0.20+	Top 20%+ labeled anomalous	Very high contamination; aggressive flagging

Best Practice: Use Scores, Not Predictions

For most applications, use score_samples() instead of predict():

Flexibility: Scores let you choose any threshold based on your cost function
Interpretability: Scores provide a continuous ranking, not just binary labels
Domain adaptation: You can set thresholds based on domain knowledge, not contamination guesses

The contamination parameter is mainly useful for:

Quick baselines when you have a rough idea of anomaly frequency
Compatibility with sklearn's API expectations (e.g., pipelines expecting predict())
Cases where you truly need fixed-rate detection

contamination_best_practices.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import numpy as np
from sklearn.ensemble import IsolationForest
 
# Example: Demonstrating contamination behavior
np.random.seed(42)
X = np.random.randn(100, 2)
 
# Contamination DOES NOT affect scores
clf_01 = IsolationForest(contamination=0.01, random_state=42)
clf_10 = IsolationForest(contamination=0.10, random_state=42)
clf_50 = IsolationForest(contamination=0.50, random_state=42)
 
clf_01.fit(X)
clf_10.fit(X)
clf_50.fit(X)
 
scores_01 = clf_01.score_samples(X)
scores_10 = clf_10.score_samples(X)
scores_50 = clf_50.score_samples(X)
 
# Scores are IDENTICAL regardless of contamination
print("Scores identical across contamination settings?")
print(f"  0.01 vs 0.10: {np.allclose(scores_01, scores_10)}")  # True
print(f"  0.10 vs 0.50: {np.allclose(scores_10, scores_50)}")  # True
 
# But predictions differ (different thresholds)
pred_01 = clf_01.predict(X)
pred_10 = clf_10.predict(X)
pred_50 = clf_50.predict(X)
 
print(f"\nNumber of anomalies detected:")
print(f"  contamination=0.01: {(pred_01 == -1).sum()}")  # ~1
print(f"  contamination=0.10: {(pred_10 == -1).sum()}")  # ~10
print(f"  contamination=0.50: {(pred_50 == -1).sum()}")  # ~50
 
 
# BEST PRACTICE: Use scores and set your own threshold
def detect_anomalies_custom(clf, X, score_threshold=None, top_k=None, top_pct=None):
    """
    Flexible anomaly detection using scores.
    
    Choose ONE of:
    - score_threshold: Flag if score > threshold
    - top_k: Flag top k points by score
    - top_pct: Flag top pct% of points
    """
    scores = -clf.score_samples(X)  # Negate for intuitive ordering
    
    if score_threshold is not None:
        return scores > score_threshold
    elif top_k is not None:
        threshold = np.sort(scores)[-top_k]
        return scores >= threshold
    elif top_pct is not None:
        threshold = np.percentile(scores, 100 - top_pct)
        return scores >= threshold
    else:
        raise ValueError("Specify one of: score_threshold, top_k, top_pct")
 
 
# Examples
clf = IsolationForest(random_state=42)
clf.fit(X)
 
print("\nCustom thresholding:")
print(f"Score > 0.55: {detect_anomalies_custom(clf, X, score_threshold=0.55).sum()} anomalies")
print(f"Top 5 points: {detect_anomalies_custom(clf, X, top_k=5).sum()} anomalies")
print(f"Top 3%:       {detect_anomalies_custom(clf, X, top_pct=3).sum()} anomalies")

The 'auto' Setting

max_features: Feature Subsampling

What It Controls:

1.0 (default): All features available for random selection at each split
0.5: 50% of features randomly available at each split
int: Use exactly this many features

When to Tune:

Very high-dimensional data (1000+ features): Reducing max_features can speed up training and sometimes improve detection by reducing noise dimensions
When some features are known to be irrelevant: Limiting features increases the chance of splitting on relevant dimensions
Almost never: For most practical applications, leave at default

max_features Settings
max_features	Effect	Use Case
1.0 (all)	All features can be chosen	Default; general purpose
0.8-0.9	Slight feature subsampling	Very high-dimensional; noise reduction
0.5	Half the features	Extreme dimensions; known noisy features
sqrt(d)	√d features	Common in Random Forest; rarely needed for IF

Why It's Less Important for IF:

In Random Forest, max_features creates decorrelated trees by forcing each split to use different feature subsets. This is crucial for ensemble diversity.

In Isolation Forest, diversity comes from:

Random subsampling of data points
Random selection of split value (not just feature)

Adding feature subsampling provides marginal additional diversity at the cost of potentially missing important anomaly signals in excluded features.

High-Dimensional Data Strategy

Tuning Without Labeled Data

In many anomaly detection scenarios, you don't have labeled anomalies to validate against. How do you tune parameters without ground truth?

Strategy 1: Score Distribution Analysis

Examine the distribution of anomaly scores:

Healthy sign: Bimodal or right-skewed distribution with a clear tail
Warning sign: Uniform or symmetric distribution (no clear separation)
Tune toward: Settings that create clearer separation between the main mass and high-score tail

unsupervised_tuning.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import numpy as np
from sklearn.ensemble import IsolationForest
from scipy import stats
 
def score_distribution_quality(scores):
    """
    Heuristic metrics for score distribution quality.
    
    Good anomaly detection should produce a distribution with:
    - Right skew (anomalies in the tail)
    - Clear separation between main mass and tail
    """
    # Skewness: positive = right tail (anomalies)
    skewness = stats.skew(scores)
    
    # Kurtosis: higher = more extreme outliers
    kurtosis = stats.kurtosis(scores)
    
    # Gap between 90th and 99th percentile
    # Larger gap suggests clearer anomaly separation
    p90 = np.percentile(scores, 90)
    p99 = np.percentile(scores, 99)
    tail_gap = p99 - p90
    
    # Coefficient of variation of top 10%
    top_10pct = scores[scores >= p90]
    if len(top_10pct) > 1:
        top_spread = top_10pct.std() / top_10pct.mean() if top_10pct.mean() > 0 else 0
    else:
        top_spread = 0
    
    return {
        'skewness': skewness,
        'kurtosis': kurtosis,
        'tail_gap': tail_gap,
        'top_spread': top_spread,
    }
 
 
def unsupervised_param_selection(X, max_samples_range, n_estimators=100):
    """
    Select max_samples without labels using distribution heuristics.
    
    This is a heuristic approach - not guaranteed to find optimal,
    but often works reasonably well.
    """
    results = []
    
    for max_samples in max_samples_range:
        clf = IsolationForest(
            n_estimators=n_estimators,
            max_samples=max_samples,
            random_state=42
        )
        clf.fit(X)
        scores = -clf.score_samples(X)
        
        quality = score_distribution_quality(scores)
        quality['max_samples'] = max_samples
        results.append(quality)
    
    return results
 
 
# Example: Unsupervised parameter selection
np.random.seed(42)
X = np.vstack([
    np.random.randn(500, 3),          # Main cluster
    3 + 0.2*np.random.randn(10, 3),   # Anomalies (unknown to us)
])
 
max_samples_options = [64, 128, 256, 512]
results = unsupervised_param_selection(X, max_samples_options)
 
print("max_samples | Skewness | Kurtosis | Tail Gap | Top Spread")
print("-" * 60)
for r in results:
    print(f"{r['max_samples']:>11} | {r['skewness']:>8.3f} | {r['kurtosis']:>8.3f} | "
          f"{r['tail_gap']:>8.4f} | {r['top_spread']:>10.4f}")
 
# Prefer settings with:
# - Higher positive skewness
# - Higher kurtosis
# - Larger tail gap

Strategy 2: Stability Analysis

Check how consistent scores are across different random seeds. Unstable scores suggest the model isn't capturing true structure:

Compare score rankings across multiple runs
If rankings are very different, increase n_estimators or max_samples
Focus on stability of the top 10% (potential anomalies), not all points

Strategy 3: Domain Expert Review

Generate top-K anomaly candidates with different parameter settings
Have domain experts review a sample
Use their feedback to select parameters that surface real anomalies

This 'human-in-the-loop' approach is often the most reliable when you have domain expertise available.

Unsupervised Tuning Limitations

Production Deployment Guidelines

Deploying Isolation Forest in production requires consideration beyond just parameter tuning. Here are key guidelines for robust production systems.

Production Checklist

•Fix random_state: Always set random_state for reproducibility. This ensures the same data produces the same scores, critical for debugging and auditing.
•Version your models: Store model artifacts with training data references and hyperparameters. You need to reproduce results if something goes wrong.
•Monitor score distributions: Set up alerts for drift in score distributions. If scores shift significantly, the underlying data distribution may have changed.
•Use relative thresholds: Set thresholds based on percentiles rather than absolute scores. This adapts to natural score drift over time.
•Retrain periodically: Data distributions change. Retrain models on recent data (weekly, monthly) to stay current.
•Log predictions and scores: Store raw scores, not just binary predictions. This enables threshold adjustment without retraining.
•Implement fallbacks: If IF fails (e.g., data too sparse), have a fallback strategy (simple statistical rules, manual review queue).

production_config.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
from dataclasses import dataclass
from typing import Optional
import numpy as np
from sklearn.ensemble import IsolationForest
import json
 
@dataclass
class IsolationForestConfig:
    """
    Configuration class for production Isolation Forest.
    
    Stores all settings for reproducibility and documentation.
    """
    # Core hyperparameters
    n_estimators: int = 100
    max_samples: int = 256
    max_features: float = 1.0
    bootstrap: bool = False
    random_state: int = 42
    
    # Threshold settings
    threshold_method: str = 'percentile'  # 'percentile', 'score', 'contamination'
    threshold_value: float = 97.0  # Interpretation depends on method
    
    # Metadata
    version: str = '1.0.0'
    training_data_hash: Optional[str] = None
    
    def to_dict(self):
        return {k: v for k, v in self.__dict__.items()}
    
    def to_json(self):
        return json.dumps(self.to_dict(), indent=2)
    
    @classmethod
    def from_json(cls, json_str):
        return cls(**json.loads(json_str))
 
 
class ProductionIsolationForest:
    """
    Production wrapper for Isolation Forest with best practices.
    """
    
    def __init__(self, config: IsolationForestConfig):
        self.config = config
        self.model = IsolationForest(
            n_estimators=config.n_estimators,
            max_samples=config.max_samples,
            max_features=config.max_features,
            bootstrap=config.bootstrap,
            random_state=config.random_state,
            contamination='auto',  # We handle thresholding ourselves
        )
        self.training_score_percentiles_ = None
        
    def fit(self, X):
        """Fit model and store training score statistics."""
        self.model.fit(X)
        
        # Store training score percentiles for threshold setting
        training_scores = -self.model.score_samples(X)
        self.training_score_percentiles_ = np.percentile(
            training_scores, 
            [50, 75, 90, 95, 97, 99, 99.5, 99.9]
        )
        
        return self
    
    def get_threshold(self):
        """Compute threshold based on config."""
        if self.config.threshold_method == 'percentile':
            # Map percentile to stored training threshold
            pct_idx = [50, 75, 90, 95, 97, 99, 99.5, 99.9].index(
                min([50, 75, 90, 95, 97, 99, 99.5, 99.9], 
                    key=lambda x: abs(x - self.config.threshold_value))
            )
            return self.training_score_percentiles_[pct_idx]
        elif self.config.threshold_method == 'score':
            return self.config.threshold_value
        else:
            raise ValueError(f"Unknown method: {self.config.threshold_method}")
    
    def score(self, X):
        """Get anomaly scores (higher = more anomalous)."""
        return -self.model.score_samples(X)
    
    def predict(self, X):
        """Predict using config threshold."""
        scores = self.score(X)
        threshold = self.get_threshold()
        return (scores > threshold).astype(int)
    
    def predict_with_metadata(self, X):
        """Return predictions with full metadata for logging."""
        scores = self.score(X)
        threshold = self.get_threshold()
        predictions = (scores > threshold).astype(int)
        
        return {
            'scores': scores,
            'predictions': predictions,
            'threshold_used': threshold,
            'model_version': self.config.version,
            'n_anomalies': predictions.sum(),
        }
 
 
# Example usage
config = IsolationForestConfig(
    n_estimators=100,
    max_samples=256,
    threshold_method='percentile',
    threshold_value=99.0,  # Top 1%
    version='1.0.0'
)
 
print("Production config:")
print(config.to_json())

Avoid Common Production Pitfalls

Summary: Parameter Selection

Key Takeaways

•max_samples is most impactful: Tune it first; default 256 works well but decrease to 64-128 if swamping/masking suspected
•n_estimators affects stability: Default 100 is usually sufficient; increase for more consistent score rankings
•contamination only affects threshold: Does NOT change scores; prefer using score_samples() with custom thresholds
•max_features rarely matters: Leave at default unless you have 1000+ features with known noise
•Without labels, use heuristics: Score distribution analysis and stability checks guide parameter selection
•In production, document everything: Fix random_state, version models, monitor distributions, log raw scores

Quick Reference: Default Parameters
Parameter	Default	When to Change
n_estimators	100	Score stability issues → increase to 200+
max_samples	'auto' (256)	Swamping/masking → decrease; Multiple clusters → increase
contamination	'auto'	Use predict() → set based on expected rate; otherwise ignore
max_features	1.0	Very high dimensions (1000+) → try 0.8
random_state	None	Always set for production!

Module Complete:

Module Complete: Isolation Forest

5 / 5