One Class Methods - Learning Module

Loading content...

0/245

Threshold Selection

The Critical Art of Threshold Selection

Every anomaly detection model produces continuous scores that indicate how 'anomalous' each data point is. But in practice, stakeholders need decisions: Is this an anomaly or not? Should we alert? Should we investigate? This translation from continuous scores to discrete decisions requires setting a threshold.

Threshold selection is often treated as an afterthought, but it's arguably the most critical step in deploying anomaly detection. A perfect scoring function is useless with a poor threshold. Conversely, a decent scoring function with a well-calibrated threshold can be highly effective.

The challenge is profound: unlike supervised classification, we typically lack labeled anomalies to optimize against. We must rely on statistical reasoning, domain knowledge, and operational feedback to set and maintain effective thresholds.

Learning Objectives

By the end of this page, you will understand: (1) Why threshold selection is challenging in anomaly detection, (2) Statistical methods for automatic threshold determination, (3) Business-aligned thresholds based on costs and constraints, (4) Extreme Value Theory for principled tail modeling, (5) Dynamic and adaptive threshold strategies, and (6) Multi-threshold approaches for graduated alerting.

Understanding the Threshold Challenge

Setting an anomaly threshold is fundamentally different from setting a classification threshold in supervised learning. In supervised classification:

We have labeled examples of both classes
We can measure precision, recall, F1 at different thresholds
We can optimize using ROC curves on validation data

In anomaly detection, we typically have:

Only (mostly) normal training data
Few or no labeled anomalies
Unknown anomaly distribution
Anomaly characteristics may change over time

The Core Trade-offs:

Threshold Trade-offs
Threshold Level	False Positive Rate	Detection Rate	Operational Impact
Too Low	High	High (catches all anomalies)	Alert fatigue; team ignores alerts; real anomalies lost in noise
Too High	Low	Low (misses anomalies)	False security; critical issues go undetected; costly failures
Just Right	Acceptable	Acceptable	Actionable alerts; sustainable operations; balanced costs

What Makes a Good Threshold?

A good threshold depends entirely on the application context:

Cost asymmetry: What's the relative cost of false positives vs. false negatives?
- Medical diagnosis: Missing a disease (FN) is often worse than extra tests (FP)
- Fraud detection: Blocking legitimate transactions (FP) matters; fraud losses (FN) also matter
- Security: Missing an intrusion (FN) can be catastrophic
Operational capacity: How many alerts can the team realistically investigate?
- If you can investigate 20 alerts/day, setting threshold to generate 200 is counterproductive
Base rate: How rare are anomalies in your data?
- At 0.01% anomaly rate, even 1% FP rate means 100× false alarms for every true positive
Anomaly severity: Are all anomalies equally important?
- Consider different thresholds for different anomaly types

The Base Rate Fallacy

When anomalies are rare, even highly accurate detectors produce mostly false positives.

Example: Anomaly rate = 0.1%, Detector has 99% TPR and 1% FPR.

For 100,000 samples: • True anomalies: 100 → 99 detected (TP) • Normal samples: 99,900 → 999 false alarms (FP) • Precision = 99/(99+999) = 9%!

91% of alerts are false positives, despite 99% accuracy.

This is why threshold selection is so critical—and why you can't ignore the base rate.

Statistical Methods for Threshold Selection

Statistical methods provide principled, automatic threshold selection without requiring labeled anomalies. They work by modeling the distribution of scores on training (normal) data and setting thresholds based on statistical significance.

1. Percentile-Based Thresholds:

The simplest approach: set threshold at the p-th percentile of training scores.

$$\tau = \text{Percentile}_p({s_1, s_2, ..., s_n})$$

Intuition: Expect (100-p)% of normal data to exceed threshold
Common choice: 95th or 99th percentile
Pros: Simple, distribution-agnostic, directly controls training FP rate
Cons: Sensitive to outliers in training; assumes training is clean

2. Gaussian Assumption (μ + kσ):

If scores are approximately Gaussian, use:

$$\tau = \mu + k \cdot \sigma$$

where μ and σ are mean and standard deviation of training scores.

k = 2: ~2.3% expected FP (2 standard deviations)
k = 3: ~0.13% expected FP (3 standard deviations)
k = 4: ~0.003% expected FP

Pros: Smooth, well-understood Cons: Scores often non-Gaussian (heavy-tailed, skewed)

statistical_thresholds.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
import numpy as np
from scipy import stats
from scipy.stats import genextreme, genpareto
from typing import Tuple, Optional
 
class StatisticalThresholdSelector:
    """
    Statistical methods for anomaly threshold selection.
    
    Provides multiple approaches that don't require labeled anomalies.
    """
    
    def __init__(self, scores: np.ndarray):
        """
        Initialize with training/normal scores.
        
        Parameters:
        -----------
        scores : array-like
            Anomaly scores from training/normal data
            Higher scores = more anomalous
        """
        self.scores = np.asarray(scores)
        self.n = len(scores)
    
    def percentile_threshold(self, percentile: float = 95) -> float:
        """
        Set threshold at given percentile of training scores.
        
        Controls training false positive rate directly.
        """
        threshold = np.percentile(self.scores, percentile)
        expected_fpr = (100 - percentile) / 100
        
        print(f"Percentile threshold ({percentile}th): {threshold:.4f}")
        print(f"Expected training FP rate: {expected_fpr:.2%}")
        
        return threshold
    
    def gaussian_threshold(self, k: float = 3.0) -> float:
        """
        Set threshold as mean + k * std (Gaussian assumption).
        
        Parameters:
        -----------
        k : float
            Number of standard deviations (2, 3, or 4 common)
        """
        mu = np.mean(self.scores)
        sigma = np.std(self.scores)
        threshold = mu + k * sigma
        
        # Theoretical FP rate under Gaussian
        expected_fpr = 1 - stats.norm.cdf(k)
        
        print(f"Gaussian threshold (μ + {k}σ): {threshold:.4f}")
        print(f"μ = {mu:.4f}, σ = {sigma:.4f}")
        print(f"Expected FP rate (if Gaussian): {expected_fpr:.4%}")
        
        return threshold
    
    def robust_threshold(self, k: float = 3.0) -> float:
        """
        Robust threshold using median and MAD.
        
        More robust to outliers in training data.
        MAD = Median Absolute Deviation
        """
        median = np.median(self.scores)
        mad = np.median(np.abs(self.scores - median))
        
        # Scale MAD to be consistent with std for Gaussian
        # MAD * 1.4826 ≈ std for Gaussian
        robust_std = mad * 1.4826
        threshold = median + k * robust_std
        
        print(f"Robust threshold (median + {k}*MAD_scaled): {threshold:.4f}")
        print(f"median = {median:.4f}, MAD = {mad:.4f}")
        
        return threshold
    
    def iqr_threshold(self, multiplier: float = 1.5) -> float:
        """
        IQR-based threshold (Tukey's method).
        
        threshold = Q3 + multiplier * IQR
        
        Standard Tukey: multiplier = 1.5 (outlier)
        Extreme: multiplier = 3.0 (far outlier)
        """
        q1, q3 = np.percentile(self.scores, [25, 75])
        iqr = q3 - q1
        threshold = q3 + multiplier * iqr
        
        print(f"IQR threshold (Q3 + {multiplier}*IQR): {threshold:.4f}")
        print(f"Q1 = {q1:.4f}, Q3 = {q3:.4f}, IQR = {iqr:.4f}")
        
        return threshold
    
    def contamination_threshold(self, contamination: float = 0.05) -> float:
        """
        Set threshold assuming known contamination in training data.
        
        If we believe training data has ~5% anomalies,
        set threshold to exclude top 5%.
        """
        threshold = np.percentile(self.scores, (1 - contamination) * 100)
        
        print(f"Contamination-based threshold ({contamination:.1%}): {threshold:.4f}")
        
        return threshold
    
    def elbow_threshold(self) -> Tuple[float, int]:
        """
        Find threshold using elbow method on sorted scores.
        
        Looks for the 'knee' where scores start increasing rapidly.
        """
        sorted_scores = np.sort(self.scores)
        n = len(sorted_scores)
        
        # Simple elbow detection: maximum curvature point
        # Use second derivative approximation
        if n < 10:
            # Too few points, fall back to percentile
            return self.percentile_threshold(95), int(0.95 * n)
        
        # Normalize to [0, 1] for both axes
        x = np.arange(n) / (n - 1)
        y = (sorted_scores - sorted_scores.min()) / (sorted_scores.max() - sorted_scores.min() + 1e-10)
        
        # Find point with maximum distance from line connecting first and last points
        # This is the elbow
        line_vec = np.array([1, y[-1] - y[0]])
        line_vec = line_vec / np.linalg.norm(line_vec)
        
        point_vecs = np.column_stack([x, y - y[0]])
        cross = np.abs(point_vecs[:, 0] * line_vec[1] - point_vecs[:, 1] * line_vec[0])
        
        elbow_idx = np.argmax(cross)
        threshold = sorted_scores[elbow_idx]
        
        print(f"Elbow threshold: {threshold:.4f}")
        print(f"Elbow at index {elbow_idx} ({elbow_idx/n:.1%} of data)")
        
        return threshold, elbow_idx
    
    def compare_methods(self) -> dict:
        """Compare all threshold methods."""
        print("="*60)
        print("Threshold Method Comparison")
        print("="*60)
        
        methods = {
            'percentile_95': self.percentile_threshold(95),
            'percentile_99': self.percentile_threshold(99),
            'gaussian_2sigma': self.gaussian_threshold(2),
            'gaussian_3sigma': self.gaussian_threshold(3),
            'robust_3sigma': self.robust_threshold(3),
            'iqr_1.5': self.iqr_threshold(1.5),
            'iqr_3.0': self.iqr_threshold(3.0),
            'elbow': self.elbow_threshold()[0],
        }
        
        print("\n" + "="*60)
        print("Summary:")
        for name, thresh in methods.items():
            fpr = np.mean(self.scores > thresh)
            print(f"  {name:20s}: {thresh:8.4f} (training FPR: {fpr:.2%})")
        
        return methods
 
 
# Example usage
if __name__ == "__main__":
    # Simulate anomaly scores (mostly normal, some outliers)
    np.random.seed(42)
    normal_scores = np.random.gamma(2, 0.5, 950)  # Normal data
    outlier_scores = np.random.gamma(5, 1.0, 50)  # Outliers in training
    scores = np.concatenate([normal_scores, outlier_scores])
    
    selector = StatisticalThresholdSelector(scores)
    thresholds = selector.compare_methods()

Extreme Value Theory for Threshold Selection

Standard statistical methods often fail for anomaly detection because anomaly scores are heavy-tailed—extreme values are more common than a Gaussian would predict. Extreme Value Theory (EVT) provides tools specifically designed for modeling distribution tails.

Why EVT for Anomaly Detection?

Anomalies live in the tail: We care about extreme scores, not average behavior
Heavy-tailed distributions: Anomaly scores often follow power laws, not Gaussians
Principled extrapolation: EVT lets us estimate probabilities beyond observed data
Automatic adaptation: Fits to the actual tail shape, whatever it is

The Peaks Over Threshold (POT) Method:

POT focuses on exceedances above a high threshold u:

$$Y = X - u ;|; X > u$$

Under mild conditions, Y follows a Generalized Pareto Distribution (GPD):

$$F(y) = 1 - \left(1 + \frac{\xi y}{\sigma}\right)^{-1/\xi}$$

where:

ξ (shape): Controls tail heaviness (ξ > 0 = heavy tail)
σ (scale): Controls spread of exceedances

Interpreting the Shape Parameter ξ

The shape parameter ξ tells you about tail behavior:

• ξ > 0 (Fréchet-type): Heavy tail; extreme values likely. Common for anomaly scores. • ξ = 0 (Gumbel-type): Exponential tail; Gaussian-like decay • ξ < 0 (Weibull-type): Bounded tail; finite maximum value

Most anomaly score distributions have ξ > 0, meaning Gaussian assumptions underestimate the probability of extreme values.

EVT-Based Threshold Selection:

Choose preliminary threshold u: Often 90th-95th percentile of scores
Extract exceedances: Y = {s - u : s > u}
Fit GPD: Estimate ξ and σ from exceedances
Compute return level: Find threshold corresponding to desired false positive rate

Return Level:

The return level for probability p is:

$$z_p = u + \frac{\sigma}{\xi}\left[\left(\frac{n \cdot p}{N_u}\right)^{-\xi} - 1\right]$$

where N_u is the number of observations exceeding u.

This gives the threshold such that only fraction p of future normal observations exceed it.

evt_threshold.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
import numpy as np
from scipy import stats
from scipy.optimize import minimize
from scipy.stats import genpareto
from typing import Tuple, Optional
 
class EVTThresholdSelector:
    """
    Extreme Value Theory-based threshold selection.
    
    Uses Peaks Over Threshold (POT) method with Generalized Pareto Distribution.
    """
    
    def __init__(self, scores: np.ndarray):
        """
        Initialize with training/normal scores.
        """
        self.scores = np.asarray(scores)
        self.n = len(scores)
        
        # GPD parameters (fitted later)
        self.xi = None  # Shape
        self.sigma = None  # Scale
        self.u = None  # Threshold for POT
        self.n_exceed = None  # Number of exceedances
    
    def fit_gpd(self, preliminary_threshold_percentile: float = 90) -> Tuple[float, float, float]:
        """
        Fit Generalized Pareto Distribution to tail exceedances.
        
        Parameters:
        -----------
        preliminary_threshold_percentile : float
            Percentile for preliminary threshold (determines where 'tail' starts)
        
        Returns:
        --------
        xi : float - Shape parameter
        sigma : float - Scale parameter  
        u : float - Threshold used for POT
        """
        # Preliminary threshold
        self.u = np.percentile(self.scores, preliminary_threshold_percentile)
        
        # Extract exceedances
        exceedances = self.scores[self.scores > self.u] - self.u
        self.n_exceed = len(exceedances)
        
        if self.n_exceed < 10:
            print(f"Warning: Only {self.n_exceed} exceedances. Results may be unreliable.")
        
        # Fit GPD using MLE
        # scipy's genpareto uses parametrization: c = -xi
        # We need to handle this carefully
        try:
            # Use scipy's genpareto fit
            c, loc, scale = genpareto.fit(exceedances, floc=0)
            self.xi = -c  # Convert to standard EVT notation
            self.sigma = scale
        except Exception as e:
            print(f"GPD fit failed: {e}. Falling back to moment estimator.")
            # Moment estimator as fallback
            mean_exc = np.mean(exceedances)
            var_exc = np.var(exceedances)
            
            self.xi = 0.5 * (mean_exc**2 / var_exc - 1)
            self.sigma = mean_exc * (mean_exc**2 / var_exc + 1) / 2
        
        print(f"GPD Fit Results:")
        print(f"  Preliminary threshold (u): {self.u:.4f}")
        print(f"  Exceedances: {self.n_exceed} ({self.n_exceed/self.n:.1%} of data)")
        print(f"  Shape (ξ): {self.xi:.4f}")
        print(f"  Scale (σ): {self.sigma:.4f}")
        
        if self.xi > 0:
            print(f"  Tail type: Heavy (Fréchet)")
        elif self.xi < 0:
            print(f"  Tail type: Bounded (Weibull)")
        else:
            print(f"  Tail type: Exponential (Gumbel)")
        
        return self.xi, self.sigma, self.u
    
    def return_level(self, false_positive_rate: float) -> float:
        """
        Compute threshold for given false positive rate.
        
        Parameters:
        -----------
        false_positive_rate : float
            Desired probability that normal sample exceeds threshold
            
        Returns:
        --------
        threshold : float
            Threshold value such that P(score > threshold) = false_positive_rate
        """
        if self.xi is None:
            raise ValueError("Must call fit_gpd() first")
        
        # Probability of exceeding u
        p_exceed_u = self.n_exceed / self.n
        
        # We want P(X > z) = fpr for normal data
        # P(X > z) = P(X > u) * P(X > z | X > u)
        #          = p_exceed_u * (1 - F_GPD(z - u))
        # 
        # Setting this equal to fpr and solving for z:
        # GPD quantile: For P(Y > y) = q, y = sigma/xi * ((q)^(-xi) - 1) for xi != 0
        
        if false_positive_rate >= p_exceed_u:
            # Desired FPR is higher than exceedance rate, threshold is below u
            # Fall back to percentile
            return np.percentile(self.scores, (1 - false_positive_rate) * 100)
        
        # Conditional probability of exceeding threshold given exceeding u
        q = false_positive_rate / p_exceed_u
        
        if abs(self.xi) < 1e-6:
            # Exponential case (xi ≈ 0)
            exceedance = -self.sigma * np.log(q)
        else:
            # General GPD case
            exceedance = self.sigma / self.xi * (q**(-self.xi) - 1)
        
        threshold = self.u + exceedance
        
        return threshold
    
    def compute_threshold(self, false_positive_rate: float = 0.01, 
                          preliminary_percentile: float = 90) -> float:
        """
        One-step method to compute EVT-based threshold.
        """
        self.fit_gpd(preliminary_percentile)
        threshold = self.return_level(false_positive_rate)
        
        # Validate
        empirical_fpr = np.mean(self.scores > threshold)
        
        print(f"\nEVT Threshold for {false_positive_rate:.2%} FPR: {threshold:.4f}")
        print(f"Empirical training FPR: {empirical_fpr:.2%}")
        
        return threshold
    
    def stability_analysis(self, fpr: float = 0.01, 
                           percentile_range: Tuple[float, float] = (85, 95)) -> dict:
        """
        Analyze sensitivity to preliminary threshold choice.
        
        A stable GPD fit should give similar thresholds across different u values.
        """
        results = []
        
        for pct in np.linspace(percentile_range[0], percentile_range[1], 11):
            self.fit_gpd(pct)
            threshold = self.return_level(fpr)
            results.append({
                'percentile': pct,
                'u': self.u,
                'xi': self.xi,
                'sigma': self.sigma,
                'threshold': threshold
            })
        
        thresholds = [r['threshold'] for r in results]
        print(f"\nStability Analysis:")
        print(f"Threshold range: [{min(thresholds):.4f}, {max(thresholds):.4f}]")
        print(f"Threshold std: {np.std(thresholds):.4f}")
        print(f"Recommended threshold: {np.median(thresholds):.4f} (median)")
        
        return results
 
 
# Example
if __name__ == "__main__":
    # Simulate heavy-tailed anomaly scores
    np.random.seed(42)
    
    # Pareto-distributed scores (heavy tail)
    normal_scores = np.random.pareto(a=3, size=1000) + 1
    
    selector = EVTThresholdSelector(normal_scores)
    
    print("="*60)
    print("EVT-Based Threshold Selection")
    print("="*60)
    
    threshold = selector.compute_threshold(false_positive_rate=0.01)
    
    print("\n" + "="*60)
    print("Stability Analysis")
    print("="*60)
    selector.stability_analysis(fpr=0.01)

Business-Aligned Threshold Selection

Statistical methods optimize for mathematical properties, but real systems operate under business constraints. Business-aligned thresholds translate operational requirements into threshold values.

Key Business Considerations:

Alert Budget: How many alerts can the team handle per day?
Cost Asymmetry: What's the cost of false positives vs. false negatives?
SLA Requirements: What detection latency and accuracy are required?
Investigation Capacity: How long does each alert take to investigate?
Escalation Paths: What happens when an alert fires?

Alert Budget Approach:

Set threshold to match investigation capacity:

$$\text{Daily Alerts} = \text{Daily Volume} \times \text{FPR} + \text{Expected Anomalies}$$

Solving for FPR: $$\text{FPR} = \frac{\text{Alert Budget} - \text{Expected Anomalies}}{\text{Daily Volume}}$$

Then use statistical methods to find threshold achieving this FPR.

Cost-Based Threshold

•Define cost of false positive: C_FP (investigation time, customer friction)
•Define cost of false negative: C_FN (fraud loss, security breach)
•Optimal threshold minimizes: E[Cost] = C_FP × FPR × N_normal + C_FN × FNR × N_anomaly
•Requires estimates of costs and base rates
•Can incorporate time-varying costs (higher during business hours)

Constraint-Based Threshold

•Set hard constraint: Max 50 alerts/day
•Or: At least 95% detection rate
•Or: Max 5% customer impact
•Choose threshold satisfying constraint
•May require labeled data to estimate detection rate

When You Have Some Labeled Anomalies:

If you have even a small set of labeled anomalies (from past investigations), you can estimate the score distribution for anomalies and optimize more precisely:

Estimate TPR(τ): True positive rate at threshold τ from labeled anomalies
Estimate FPR(τ): False positive rate from training normal data
Optimize: Find τ* = argmin[C_FP × FPR(τ) × N_normal + C_FN × (1-TPR(τ)) × N_anomaly]

Even 50-100 labeled anomalies can significantly improve threshold calibration.

F1-Optimal Threshold:

If you have labeled data and want balanced precision/recall:

$$F_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

Scan thresholds and select τ that maximizes F1 on a validation set.

The Feedback Loop

In production, use investigation outcomes to continuously improve threshold:

Team investigates alert
Record: True positive or false positive?
Periodically recalibrate threshold based on accumulated labels
Track precision over time to detect threshold drift

This creates a virtuous cycle: better thresholds → fewer false positives → more trust in system → better investigation quality → better labels → even better thresholds.

Dynamic and Adaptive Thresholds

Static thresholds assume data distributions don't change. In reality, systems evolve: user behavior shifts, code changes, external factors vary. Dynamic thresholds adapt to these changes, maintaining consistent detection behavior despite distribution shifts.

Why Thresholds Drift:

Concept drift: Normal behavior evolves over time
Seasonality: Daily, weekly, monthly patterns in activity
System changes: New features, infrastructure updates
External events: Holidays, promotions, news events
Model decay: Feature distributions shift from training

Adaptation Strategies:

Dynamic Threshold Strategies
Strategy	How It Works	Best For
Rolling Window	Compute threshold from last N hours/days of data	Gradual drift; stable patterns
Exponential Smoothing	τ_t = α × τ_new + (1-α) × τ_{t-1}	Smooth adaptation; noisy data
Time-of-Day Profiles	Different thresholds for different hours	Strong seasonal patterns
Feedback-Based	Adjust based on FP/FN rates from labels	When investigation labels available
Control Charts	Detect when threshold needs update	Detecting sudden shifts

dynamic_threshold.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
import numpy as np
from collections import deque
from datetime import datetime, timedelta
from typing import Optional, Callable
 
class DynamicThreshold:
    """
    Dynamic threshold that adapts to changing data distributions.
    
    Maintains a sliding window of recent scores and updates 
    threshold based on configurable statistics.
    """
    
    def __init__(
        self,
        window_size: int = 1000,
        percentile: float = 95,
        smoothing_factor: float = 0.1,
        min_samples: int = 100
    ):
        """
        Parameters:
        -----------
        window_size : int
            Number of recent scores to consider
        percentile : float
            Percentile threshold within window
        smoothing_factor : float (0-1)
            Alpha for exponential smoothing (higher = faster adaptation)
        min_samples : int
            Minimum samples before updating threshold
        """
        self.window_size = window_size
        self.percentile = percentile
        self.alpha = smoothing_factor
        self.min_samples = min_samples
        
        self.scores = deque(maxlen=window_size)
        self.threshold = None
        self.threshold_history = []
    
    def update(self, new_score: float) -> float:
        """
        Add new score and return current threshold.
        
        Threshold is updated after new score is added.
        """
        self.scores.append(new_score)
        
        if len(self.scores) < self.min_samples:
            # Not enough data yet, return conservative threshold
            if self.threshold is None:
                self.threshold = float('inf')
            return self.threshold
        
        # Compute window threshold
        window_threshold = np.percentile(list(self.scores), self.percentile)
        
        if self.threshold is None or self.threshold == float('inf'):
            # First update
            self.threshold = window_threshold
        else:
            # Exponential smoothing
            self.threshold = self.alpha * window_threshold + (1 - self.alpha) * self.threshold
        
        self.threshold_history.append(self.threshold)
        return self.threshold
    
    def batch_update(self, new_scores: np.ndarray) -> float:
        """Update with batch of new scores."""
        for score in new_scores:
            self.update(score)
        return self.threshold
    
    def is_anomaly(self, score: float) -> bool:
        """Check if score exceeds current threshold."""
        return score > self.threshold
 
 
class TimeAwareThreshold:
    """
    Threshold that varies by time of day.
    
    Maintains separate thresholds for different time periods.
    """
    
    def __init__(
        self,
        n_periods: int = 24,  # Hourly by default
        window_size: int = 1000,
        percentile: float = 95
    ):
        self.n_periods = n_periods
        self.thresholds = {i: DynamicThreshold(window_size, percentile) 
                          for i in range(n_periods)}
    
    def _get_period(self, timestamp: datetime) -> int:
        """Map timestamp to period index."""
        # For hourly: period = hour
        # Could customize for other granularities
        return timestamp.hour % self.n_periods
    
    def update(self, score: float, timestamp: datetime) -> float:
        """Update threshold for the given time period."""
        period = self._get_period(timestamp)
        return self.thresholds[period].update(score)
    
    def get_threshold(self, timestamp: datetime) -> float:
        """Get threshold for the given time period."""
        period = self._get_period(timestamp)
        return self.thresholds[period].threshold
    
    def is_anomaly(self, score: float, timestamp: datetime) -> bool:
        """Check if score exceeds threshold for the time period."""
        return score > self.get_threshold(timestamp)
 
 
class FeedbackDrivenThreshold:
    """
    Threshold that adapts based on investigation feedback.
    
    If false positives are too high, threshold increases.
    If false negatives are confirmed, threshold decreases.
    """
    
    def __init__(
        self,
        initial_threshold: float,
        target_precision: float = 0.5,  # Target 50% precision
        learning_rate: float = 0.05,
        feedback_window: int = 100
    ):
        self.threshold = initial_threshold
        self.target_precision = target_precision
        self.learning_rate = learning_rate
        
        self.feedback_buffer = deque(maxlen=feedback_window)
        self.threshold_history = [initial_threshold]
    
    def record_feedback(self, score: float, was_true_positive: bool):
        """
        Record investigation outcome.
        
        Parameters:
        -----------
        score : float
            The anomaly score that triggered alert
        was_true_positive : bool
            True if investigation confirmed anomaly
        """
        self.feedback_buffer.append({
            'score': score,
            'tp': was_true_positive,
            'threshold': self.threshold
        })
        
        # Update threshold based on recent precision
        if len(self.feedback_buffer) >= 10:
            self._update_threshold()
    
    def _update_threshold(self):
        """Adjust threshold based on observed precision."""
        recent = list(self.feedback_buffer)
        
        # Precision from recent alerts
        true_positives = sum(1 for f in recent if f['tp'])
        precision = true_positives / len(recent)
        
        # Adjust threshold
        if precision < self.target_precision:
            # Too many false positives, raise threshold
            adjustment = self.learning_rate * (self.target_precision - precision)
            self.threshold *= (1 + adjustment)
            print(f"Precision {precision:.2%} < target. Raising threshold to {self.threshold:.4f}")
        elif precision > self.target_precision + 0.1:
            # Possibly missing anomalies, lower threshold
            adjustment = self.learning_rate * (precision - self.target_precision)
            self.threshold *= (1 - adjustment * 0.5)  # More conservative lowering
            print(f"Precision {precision:.2%} > target. Lowering threshold to {self.threshold:.4f}")
        
        self.threshold_history.append(self.threshold)
    
    def is_anomaly(self, score: float) -> bool:
        return score > self.threshold
 
 
class ControlChartThreshold:
    """
    Use control chart (EWMA) to detect when threshold needs recalibration.
    
    Monitors the score distribution and alerts when it shifts significantly.
    """
    
    def __init__(
        self,
        base_threshold: float,
        lambda_param: float = 0.1,  # EWMA smoothing
        control_limit_sigma: float = 3.0
    ):
        self.base_threshold = base_threshold
        self.lambda_param = lambda_param
        self.sigma = control_limit_sigma
        
        self.ewma = None
        self.ewma_variance = None
        self.process_mean = None
        self.process_std = None
        
        self.requires_recalibration = False
    
    def initialize(self, initial_scores: np.ndarray):
        """Initialize control chart from baseline period."""
        self.process_mean = np.mean(initial_scores)
        self.process_std = np.std(initial_scores)
        self.ewma = self.process_mean
        self.ewma_variance = (self.lambda_param / (2 - self.lambda_param)) * self.process_std**2
    
    def update(self, score: float) -> dict:
        """
        Update EWMA and check if distribution has shifted.
        
        Returns status dict with recalibration recommendation.
        """
        if self.ewma is None:
            raise ValueError("Must call initialize() first")
        
        # Update EWMA
        self.ewma = self.lambda_param * score + (1 - self.lambda_param) * self.ewma
        
        # Control limits
        control_limit = self.sigma * np.sqrt(self.ewma_variance)
        ucl = self.process_mean + control_limit
        lcl = self.process_mean - control_limit
        
        # Check for out-of-control
        if self.ewma > ucl or self.ewma < lcl:
            self.requires_recalibration = True
        
        return {
            'ewma': self.ewma,
            'ucl': ucl,
            'lcl': lcl,
            'in_control': lcl <= self.ewma <= ucl,
            'recalibration_recommended': self.requires_recalibration
        }
 
 
# Example usage
if __name__ == "__main__":
    np.random.seed(42)
    
    # Simulate evolving scores
    # Initial period
    initial_scores = np.random.normal(1.0, 0.3, 500)
    
    # Later period with drift
    drifted_scores = np.random.normal(1.5, 0.4, 500)
    
    all_scores = np.concatenate([initial_scores, drifted_scores])
    
    # Dynamic threshold
    print("="*60)
    print("Dynamic Threshold Demo")
    print("="*60)
    
    dynamic = DynamicThreshold(window_size=200, percentile=95, smoothing_factor=0.1)
    
    thresholds = []
    for i, score in enumerate(all_scores):
        thresh = dynamic.update(score)
        thresholds.append(thresh)
        
        if i % 250 == 0 and i > 0:
            print(f"Step {i}: threshold = {thresh:.4f}")
    
    print(f"\nInitial threshold: {thresholds[200]:.4f}")
    print(f"Final threshold: {thresholds[-1]:.4f}")
    print(f"Threshold adapted to drift: {(thresholds[-1] - thresholds[200])/thresholds[200]*100:.1f}% change")

Multi-Threshold and Graduated Alerting

Real systems benefit from multiple thresholds that trigger different responses. Instead of a binary alert/no-alert decision, graduated approaches reduce alert fatigue while maintaining sensitivity.

Multi-Threshold Framework:

Score > τ_critical  →  Immediate alert, page on-call, block transaction
Score > τ_high      →  High-priority alert, investigate within 1 hour
Score > τ_medium    →  Standard alert, investigate today
Score > τ_low       →  Log for analysis, batch review
Score ≤ τ_low       →  Normal, no action

Benefits:

Right-sized response: Severity matches action
Reduced fatigue: Only critical items page
Preserved recall: Low threshold catches subtle anomalies
Continuous monitoring: Everything above baseline is tracked

Setting Multiple Thresholds

•Critical (τ_critical): 99.9th percentile or EVT-based for extreme events. False positives are rare but acceptable given high-severity response.
•High (τ_high): 99th percentile. Balances sensitivity and precision for urgent but not critical issues.
•Medium (τ_medium): 95th percentile. Standard investigation threshold based on team capacity.
•Low (τ_low): 90th percentile. Captures borderline cases for batch review and trend analysis.

Ensemble Thresholds:

When using multiple models, combine their scores before thresholding:

Averaging: score = (score₁ + score₂ + ... + scoreₙ) / n
Max voting: anomaly if any model exceeds its threshold
Majority voting: anomaly if majority of models agree
Weighted combination: score = Σ wᵢ × scoreᵢ

Contextual Thresholds:

Vary thresholds based on context, not just time:

User type: Higher threshold for enterprise users (fewer false alarms)
Transaction size: Lower threshold for large transactions (more scrutiny)
Geolocation: Adjusted for regional risk profiles
Historical behavior: Personalized per-entity thresholds

Threshold Ladder Best Practices

Start with two levels (alert vs. monitor) before adding more
Map levels to actions: Each threshold should have a clear response
Balance coverage: Adjacent thresholds shouldn't capture same events
Monitor all levels: Track precision at each tier separately
Automate where possible: Low/medium tiers often automated; high/critical involve humans
Review periodically: Thresholds drift; review quarterly or after major changes

Summary and Key Takeaways

We've comprehensively explored threshold selection—the critical bridge between anomaly scores and actionable decisions. From statistical foundations to operational deployment, threshold selection deserves as much attention as model design.

Key Takeaways

•Threshold selection is critical: A perfect model with a poor threshold produces poor outcomes. Invest time in threshold calibration.
•Base rate matters: When anomalies are rare, even accurate models produce mostly false positives. Factor base rate into threshold design.
•Statistical methods provide starting points: Percentile, Gaussian, IQR, and EVT methods work without labels. Choose based on score distribution shape.
•EVT handles heavy tails: Anomaly scores often have heavy tails. Extreme Value Theory provides principled threshold selection when Gaussian assumptions fail.
•Align with business: Alert budgets, cost asymmetry, and investigation capacity should drive threshold choices. Statistical optimality ≠ operational optimality.
•Adapt to change: Static thresholds decay. Use rolling windows, seasonal profiles, or feedback-driven updates to maintain performance.
•Consider multiple thresholds: Graduated alerting matches severity to response, reducing fatigue while preserving sensitivity.

Module Complete:

You've now completed the comprehensive study of One-Class Methods for Anomaly Detection. You understand:

One-Class SVM: Hyperplane separation from the origin in feature space
SVDD: Minimum enclosing hypersphere formulation
Autoencoders: Reconstruction-based anomaly detection
Neural Networks: Deep SVDD, GANs, self-supervised, and hybrid approaches
Threshold Selection: Statistical, business-aligned, and dynamic methods

These tools form a complete toolkit for building robust, production-ready anomaly detection systems across diverse domains and data types.

Module Complete

Congratulations! You've mastered One-Class Methods for anomaly detection, from mathematical foundations through practical deployment. You can now design, implement, and operationalize anomaly detection systems using kernel methods, deep learning, and principled threshold selection. These skills position you to tackle real-world anomaly detection challenges in fraud, security, quality control, and beyond.