Distance Based Methods - Learning Module

Loading content...

0/278

Local Outlier Factor (LOF)

The Density-Relative Revolution

Consider a city with two neighborhoods: Manhattan's dense urban core and a rural suburb. A house one mile from its nearest neighbor in Manhattan is extremely isolated—a clear anomaly. The same one-mile distance in rural areas is perfectly normal.

This illustrates the fundamental flaw of pure distance-based methods: absolute distance is meaningless without context. A point with high mean distance might be an anomaly in a dense region or completely normal in a sparse region. Distance alone cannot distinguish these cases.

The Local Outlier Factor (LOF), introduced by Breunig et al. in 2000, revolutionized anomaly detection by introducing a deceptively simple but profound idea: compare each point's local density to the densities of its neighbors. Points that are significantly less dense than their surroundings are anomalous—regardless of absolute distance scales.

This density-relative approach enables detection of anomalies in datasets with clusters of varying densities, where pure distance-based methods systematically fail. LOF has become one of the most influential algorithms in anomaly detection, spawning numerous variants and remaining a go-to method two decades after its introduction.

Learning Objectives

By the end of this page, you will: (1) Understand the mathematical formulation of LOF and its components, (2) Master the geometric intuition behind local reachability density, (3) Analyze the properties that make LOF effective for local anomaly detection, (4) Implement LOF from scratch and understand scikit-learn's implementation, (5) Recognize LOF's limitations and when to choose alternative methods.

Building Blocks: k-Distance, Reachability, and Density

LOF is built from a hierarchy of carefully designed concepts. Understanding each component is essential for mastering the algorithm.

Component 1: k-Distance

The k-distance of a point $p$, denoted $d_k(p)$, is the distance from $p$ to its k-th nearest neighbor:

$$d_k(p) = d(p, o) \text{ where } o \text{ is the k-th nearest neighbor of } p$$

This defines a "neighborhood radius" for each point. The k-distance is larger for isolated points and smaller for points in dense regions.

Component 2: k-Distance Neighborhood

The k-distance neighborhood of $p$, denoted $N_k(p)$, is the set of points within distance $d_k(p)$ from $p$:

$$N_k(p) = {q \in \mathcal{D} \setminus {p} : d(p, q) \leq d_k(p)}$$

Important subtlety: $|N_k(p)| \geq k$ due to possible ties. If multiple points are exactly at the k-distance, all are included. This ensures the neighborhood always contains at least k points.

Why k-Distance Neighborhood Can Exceed k

In discrete data or when using rounded coordinates, ties are common. If 5 points are equidistant at the k-th position, the neighborhood includes all 5. This causes |N_k(p)| to vary across points, which affects density calculations. The algorithm handles this naturally, but implementations must not assume exactly k neighbors.

Component 3: Reachability Distance

The reachability distance of point $p$ with respect to point $o$ is:

$$\text{reach-dist}_k(p, o) = \max{d_k(o), d(p, o)}$$

This is the key innovation in LOF. The reachability distance is at least $d_k(o)$, even if $p$ is closer. This "smoothing" effect has critical implications:

For points inside $o$'s neighborhood: $\text{reach-dist}_k(p, o) = d_k(o)$ (the k-distance of $o$ dominates)
For points outside $o$'s neighborhood: $\text{reach-dist}_k(p, o) = d(p, o)$ (actual distance dominates)

Why This Matters:

The reachability distance prevents statistical fluctuations from overly affecting density estimates. Without this smoothing, minor variations in the positions of very close neighbors would cause large swings in density estimates. The reachability distance creates stability by imposing a minimum neighborhood scale based on the reference point's own neighborhood size.

Geometric Interpretation:

Think of each point $o$ as having a "zone of influence" defined by its k-distance. Any point $p$ within this zone is treated as if it were at the boundary. This reflects the uncertainty in localizing very close neighbors—we can't meaningfully distinguish distances smaller than the local scale.

Component 4: Local Reachability Density (LRD)

The local reachability density of point $p$ is the inverse of the average reachability distance from $p$ to its k-nearest neighbors:

$$\text{lrd}k(p) = \left(\frac{\sum{o \in N_k(p)} \text{reach-dist}_k(p, o)}{|N_k(p)|}\right)^{-1}$$

Equivalently:

$$\text{lrd}k(p) = \frac{|N_k(p)|}{\sum{o \in N_k(p)} \text{reach-dist}_k(p, o)}$$

Interpretation:

High LRD: Point $p$ is in a dense region (low average reachability distance to neighbors)
Low LRD: Point $p$ is in a sparse region or isolated (high average reachability distance)

LRD acts as a local density estimate, but using reachability distances rather than raw distances. This provides the stability benefits discussed above.

Key Properties of LRD:

Scale-invariant within local neighborhoods: Only relative distances matter
Symmetric influence: Each neighbor contributes equally to the average
Bounded below: LRD > 0 for any finite reachability distances

LOF Component Summary
Component	Symbol	Formula	Interpretation
k-distance	$d_k(p)$	Distance to k-th NN	Neighborhood radius
k-neighborhood	$N_k(p)$	${q : d(p,q) \leq d_k(p)}$	Points within radius
Reachability dist.	$\text{reach-dist}_k(p,o)$	$\max{d_k(o), d(p,o)}$	Smoothed distance
Local reach. density	$\text{lrd}_k(p)$	$\|N_k(p)\| / \sum \text{reach-dist}$	Local density estimate
Local Outlier Factor	$\text{LOF}_k(p)$	$\frac{\sum \text{lrd}(o)}{\|N_k(p)\| \cdot \text{lrd}(p)}$	Relative outlierness

The Local Outlier Factor: Putting It Together

Definition: Local Outlier Factor

The Local Outlier Factor of point $p$ is the ratio of the average LRD of $p$'s neighbors to $p$'s own LRD:

$$\text{LOF}k(p) = \frac{\sum{o \in N_k(p)} \text{lrd}_k(o)}{|N_k(p)| \cdot \text{lrd}k(p)} = \frac{1}{|N_k(p)|} \sum{o \in N_k(p)} \frac{\text{lrd}_k(o)}{\text{lrd}_k(p)}$$

This is the average ratio of neighbors' densities to the point's own density.

Interpretation:

LOF ≈ 1: Point $p$ has similar density to its neighbors—likely normal
LOF > 1: Point $p$ is less dense than its neighbors—potentially anomalous
LOF < 1: Point $p$ is denser than its neighbors—definitely not an anomaly (may be a local mode)

The Brilliance of LOF:

LOF is inherently relative. A point in a sparse region (low absolute LRD) can still have LOF ≈ 1 if its neighbors are equally sparse. This is exactly what we want: anomalies are points that are sparse relative to their local context, not points that are simply far from everything.

LOF's Self-Consistency

For points deep inside a uniform-density cluster, LOF ≈ 1 regardless of cluster density. Points in a loose cluster have LOF ≈ 1. Points in a tight cluster have LOF ≈ 1. Only points at the boundary between density regions or truly isolated points have LOF significantly different from 1. This is the key insight that makes LOF work on multi-scale data.

Complete LOF Algorithm:

Algorithm: LOF(D, k)
Input: Dataset D, number of neighbors k
Output: LOF score for each point

1. For each point p ∈ D:
   a. Find N_k(p) = k-nearest neighbors of p
   b. Compute d_k(p) = distance to k-th neighbor

2. For each point p ∈ D:
   For each neighbor o ∈ N_k(p):
      Compute reach-dist_k(p, o) = max{d_k(o), d(p, o)}

3. For each point p ∈ D:
   Compute lrd_k(p) = |N_k(p)| / Σ_{o ∈ N_k(p)} reach-dist_k(p, o)

4. For each point p ∈ D:
   Compute LOF_k(p) = (1/|N_k(p)|) × Σ_{o ∈ N_k(p)} [lrd_k(o) / lrd_k(p)]

5. Return LOF scores

Computational Complexity:

Step 1: O(n²) or O(n log n) with spatial indexing for k-NN queries
Steps 2-4: O(n × k) for local computations
Total: Dominated by Step 1, typically O(n² d) naive or O(n log n × d) with indexing
Space: O(n × k) for storing neighborhoods and distances

lof_from_scratch.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from sklearn.neighbors import NearestNeighbors
from typing import Tuple
 
def compute_lof(X: np.ndarray, k: int = 20) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute Local Outlier Factor from scratch.
    
    Parameters:
    -----------
    X : np.ndarray of shape (n_samples, n_features)
        Input data
    k : int
        Number of neighbors (MinPts parameter)
    
    Returns:
    --------
    lof_scores : np.ndarray of shape (n_samples,)
        LOF score for each point
    lrd_values : np.ndarray of shape (n_samples,)
        Local reachability density for each point
    """
    n_samples = X.shape[0]
    
    # Step 1: Find k-nearest neighbors for each point
    # We use k+1 and exclude self
    nn = NearestNeighbors(n_neighbors=k+1, algorithm='auto')
    nn.fit(X)
    distances, indices = nn.kneighbors(X)
    
    # Exclude self (first neighbor)
    distances = distances[:, 1:]
    indices = indices[:, 1:]
    
    # k-distance for each point (distance to k-th neighbor)
    k_distances = distances[:, -1]
    
    # Step 2: Compute reachability distances
    # reach_dist[i, j] = reachability distance from point i to its j-th neighbor
    reach_distances = np.zeros_like(distances)
    for i in range(n_samples):
        for j in range(k):
            neighbor_idx = indices[i, j]
            # reach-dist(p, o) = max(k-distance(o), actual_distance(p, o))
            reach_distances[i, j] = max(
                k_distances[neighbor_idx],  # k-distance of neighbor
                distances[i, j]              # actual distance
            )
    
    # Step 3: Compute Local Reachability Density (LRD)
    # lrd(p) = k / sum of reachability distances to k neighbors
    lrd_values = np.zeros(n_samples)
    for i in range(n_samples):
        sum_reach_dist = np.sum(reach_distances[i, :])
        if sum_reach_dist > 0:
            lrd_values[i] = k / sum_reach_dist
        else:
            # All neighbors at same position (edge case)
            lrd_values[i] = np.inf
    
    # Step 4: Compute LOF
    # LOF(p) = average of (lrd of neighbor / lrd of p)
    lof_scores = np.zeros(n_samples)
    for i in range(n_samples):
        if lrd_values[i] == 0 or np.isinf(lrd_values[i]):
            lof_scores[i] = 1.0  # Default for degenerate cases
            continue
        
        neighbor_lrd_sum = 0
        for j in range(k):
            neighbor_idx = indices[i, j]
            neighbor_lrd_sum += lrd_values[neighbor_idx]
        
        avg_neighbor_lrd = neighbor_lrd_sum / k
        lof_scores[i] = avg_neighbor_lrd / lrd_values[i]
    
    return lof_scores, lrd_values
 
 
def identify_anomalies(X: np.ndarray, k: int = 20, 
                       threshold: float = 1.5) -> np.ndarray:
    """
    Identify anomalies using LOF.
    
    Parameters:
    -----------
    X : np.ndarray
        Input data
    k : int
        Number of neighbors
    threshold : float
        LOF threshold (points with LOF > threshold are anomalies)
    
    Returns:
    --------
    labels : np.ndarray
        1 for anomaly, 0 for normal
    """
    lof_scores, _ = compute_lof(X, k)
    return (lof_scores > threshold).astype(int)

Geometric Intuition: Understanding LOF Visually

The mathematics of LOF can obscure its elegant geometric intuition. Let's build visual understanding through carefully constructed examples.

Scenario 1: Uniform Cluster

Imagine a tight cluster of 100 points uniformly distributed in a circle. For any interior point $p$:

$p$'s neighborhood contains k nearby points
$p$'s LRD reflects the local density (high, since points are close)
$p$'s neighbors have similar LRD values
LOF ≈ 1 (density matches neighbors)

For boundary points, there's a slight increase in LOF (neighbors are partially "inward" toward denser regions), but typically LOF < 1.5.

Scenario 2: Two Clusters of Different Densities

Consider two clusters:

Cluster A: 100 points, tight (σ = 1)
Cluster B: 100 points, loose (σ = 5)

For points deep inside each cluster:

In Cluster A: High LRD, high neighbor LRD → LOF ≈ 1
In Cluster B: Low LRD, low neighbor LRD → LOF ≈ 1

The absolute densities differ by 25x, but LOF doesn't care—internal consistency is what matters!

The Multi-Density Magic

This is LOF's killer feature. Pure distance-based methods would flag all of Cluster B as anomalous (high distances). LOF correctly identifies only the truly isolated points—those sparse relative to their surroundings, not just sparse in absolute terms.

Scenario 3: Point Between Clusters

Now place a point $p$ exactly between Clusters A and B:

$p$'s neighbors include points from both clusters
The reachability distances to Cluster A neighbors are large (must travel far)
The reachability distances to Cluster B neighbors are smaller (closer)
$p$'s LRD is low (average reachability is high)
$p$'s neighbors (in both clusters) have higher LRD than $p$
LOF significantly > 1 (anomaly detected!)

Scenario 4: Point Far from Everything

Place a point $p$ very far from all clusters:

$p$'s k neighbors are all far away → very low LRD for $p$
But those neighbors (edge of closest cluster) also have lowish LRD (boundary effects)
LOF > 1 but may not be extremely high

This reveals a subtlety: LOF may give lower scores to globally isolated points than to locally isolated points. A point wedged between two tight clusters may have higher LOF than a point floating in empty space whose neighbors are also somewhat boundary points.

Scenario 5: Micro-cluster of Anomalies

Consider 3 anomalous points clustered together:

Each anomaly has 2 other anomalies as nearest neighbors
The anomaly mini-cluster has its own internal density
If k = 3, the anomalies might have LOF ≈ 1 (mutual support)
Solution: Use k larger than expected anomaly cluster size

LOF Behavior in Common Scenarios
Scenario	Point Location	LRD Relationship	Expected LOF
Uniform cluster	Interior	LRD(p) ≈ LRD(neighbors)	≈ 1.0
Uniform cluster	Boundary	LRD(p) < LRD(neighbors)	1.0 - 1.3
Dense cluster	Interior	LRD(p) high, neighbors high	≈ 1.0
Sparse cluster	Interior	LRD(p) low, neighbors low	≈ 1.0
Between clusters	Gap	LRD(p) << LRD(neighbors)	1.0
Isolated point	Far from all	LRD(p) very low	1.0 (variable)
Anomaly micro-cluster	Among anomalies	Mutual support possible	Depends on k

Mathematical Properties and Guarantees

LOF has been extensively analyzed. Understanding its theoretical properties helps practitioners know what to expect.

Property 1: Tightness Bound for Uniform Clusters

For points deep inside a cluster with uniform density, LOF converges to 1. Formally, for a point $p$ with all k neighbors inside a region of constant density:

$$\text{LOF}_k(p) \to 1 \text{ as the neighborhood becomes uniform}$$

Property 2: Lower Bound

LOF is bounded below: $$\text{LOF}k(p) \geq \frac{1}{\max{o \in N_k(p)} |N_k(o)|} \cdot \frac{\min_{o \in N_k(p)} \text{reach-dist}(\cdot, o)}{\text{reach-dist}(p, \cdot)}$$

While complex, this guarantees LOF > 0 for all finite datasets.

Property 3: Asymptotic Behavior

As $n \to \infty$ with $k = o(n)$: $$\text{LOF}_k(p) \xrightarrow{p} \frac{f(\text{neighbors})}{f(p)}$$

where $f$ is the true underlying density. LOF consistently estimates the density ratio in the limit.

Property 4: Sensitivity to k

The LOF score can change substantially with k:

Small k: Captures very local density variations; sensitive to noise
Large k: Smooths over larger regions; may miss local anomalies

The original LOF paper recommends examining LOF across a range of k values and considering points that are anomalous for many k values as more reliable detections.

Property 5: Not a True Probability

Despite being a ratio, LOF is not a probability and has no natural scale:

LOF = 1.1 is not "10% anomalous"
LOF = 2.0 does not mean "twice as anomalous" as LOF = 1.0
Thresholds must be set empirically or using domain knowledge

Property 6: Boundary Effects

Points at cluster boundaries systematically have LOF slightly above 1, because:

Their neighbors include more interior points with higher density
This creates false positive pressure at boundaries
Mitigation: Use higher thresholds (e.g., LOF > 1.5) or robust statistics across k values

The k Selection Problem

LOF's sensitivity to k is both a feature and a bug. There is no universally optimal k. Best practice: compute LOF for k ∈ {10, 15, 20, 30, 50}, and either (1) take the maximum LOF across k values, (2) average LOF scores, or (3) count how many k values flag each point. Points anomalous across many k values are more reliably anomalous.

robust_lof.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
from sklearn.neighbors import LocalOutlierFactor
from typing import List, Tuple
 
def multi_k_lof(X: np.ndarray, 
                k_values: List[int] = [10, 15, 20, 30, 50],
                aggregation: str = 'max') -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute LOF across multiple k values for robustness.
    
    Parameters:
    -----------
    X : np.ndarray
        Input data
    k_values : list of int
        k values to try
    aggregation : str
        How to combine scores: 'max', 'mean', 'vote'
    
    Returns:
    --------
    final_scores : np.ndarray
        Aggregated LOF scores
    all_scores : np.ndarray
        Matrix of LOF scores for each k (n_samples x len(k_values))
    """
    n_samples = X.shape[0]
    all_scores = np.zeros((n_samples, len(k_values)))
    
    for i, k in enumerate(k_values):
        if k >= n_samples:
            k = n_samples - 1
        
        lof = LocalOutlierFactor(
            n_neighbors=k,
            algorithm='auto',
            metric='euclidean',
            contamination='auto',
            novelty=False
        )
        lof.fit(X)
        
        # LOF returns negative scores (more negative = more anomalous)
        # Convert to positive (higher = more anomalous)
        all_scores[:, i] = -lof.negative_outlier_factor_
    
    # Aggregate across k values
    if aggregation == 'max':
        # Maximum LOF across k values
        final_scores = np.max(all_scores, axis=1)
    elif aggregation == 'mean':
        # Average LOF across k values
        final_scores = np.mean(all_scores, axis=1)
    elif aggregation == 'vote':
        # Count how often each point exceeds threshold
        threshold = 1.5
        votes = np.sum(all_scores > threshold, axis=1)
        final_scores = votes / len(k_values)
    else:
        raise ValueError(f"Unknown aggregation: {aggregation}")
    
    return final_scores, all_scores
 
 
def lof_with_confidence(X: np.ndarray, 
                        k_values: List[int] = [10, 15, 20, 30, 50],
                        threshold: float = 1.5) -> Tuple[np.ndarray, np.ndarray]:
    """
    LOF detection with confidence estimates based on k-stability.
    
    Returns:
    --------
    labels : np.ndarray
        1 for anomaly, 0 for normal
    confidence : np.ndarray
        Confidence in label (0 to 1)
    """
    _, all_scores = multi_k_lof(X, k_values, aggregation='max')
    
    # Count threshold crossings
    above_threshold = all_scores > threshold
    vote_fraction = np.mean(above_threshold, axis=1)
    
    # Labels based on majority voting
    labels = (vote_fraction >= 0.5).astype(int)
    
    # Confidence: how decisively does majority vote?
    # 100% = all k values agree, 50% = split decision
    confidence = np.abs(vote_fraction - 0.5) * 2
    
    return labels, confidence

Using scikit-learn's LOF Implementation

Scikit-learn provides a robust, optimized implementation of LOF. Understanding its nuances is essential for correct usage.

Key Parameters:

n_neighbors: The k parameter (default: 20)
algorithm: NN algorithm—'auto', 'ball_tree', 'kd_tree', 'brute'
metric: Distance metric (default: 'minkowski' with p=2, i.e., Euclidean)
contamination: Expected proportion of anomalies—affects threshold
novelty: If True, can score new unseen data; if False, only training data

The Novelty Parameter:

This is the most misunderstood parameter:

novelty=False (default): Transductive mode. LOF is computed for training data only. fit_predict() returns labels. Cannot score new data—calling predict() raises error.
novelty=True: Inductive mode. Training data is assumed clean (no anomalies). fit() learns the normal pattern; predict() and score_samples() can be called on new data.

Warning: With novelty=True, if your training data contains anomalies, they contaminate the "normal" model!

The Contamination Pitfall

The 'contamination' parameter does NOT affect LOF scores—only the threshold for labeling. LOF scores are computed purely from the data geometry. Contamination sets the percentile threshold: contamination=0.1 means the 10% of points with highest LOF are labeled anomalies. If your true contamination differs, labels will be wrong even if scores are correct.

sklearn_lof_usage.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import numpy as np
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
 
# ============================================================
# Example 1: Basic LOF for outlier detection in training data
# ============================================================
 
# Generate synthetic data
np.random.seed(42)
n_normal = 200
n_anomalies = 20
 
# Normal data: two clusters
cluster1 = np.random.randn(n_normal // 2, 2) * 0.5 + np.array([0, 0])
cluster2 = np.random.randn(n_normal // 2, 2) * 0.5 + np.array([4, 4])
normal_data = np.vstack([cluster1, cluster2])
 
# Anomalies: scattered
anomalies = np.random.uniform(-2, 6, (n_anomalies, 2))
 
X = np.vstack([normal_data, anomalies])
y_true = np.array([0] * n_normal + [1] * n_anomalies)
 
# Fit LOF (transductive mode - novelty=False is default)
lof = LocalOutlierFactor(
    n_neighbors=20,
    contamination=0.1,  # Expected 10% anomalies
    algorithm='auto',
    metric='euclidean'
)
 
# fit_predict() is the main method for novelty=False
y_pred = lof.fit_predict(X)
# Returns -1 for anomalies, 1 for inliers (sklearn convention)
y_pred_binary = (y_pred == -1).astype(int)  # Convert to 0/1
 
# Access the LOF scores (negated for sklearn convention)
lof_scores = -lof.negative_outlier_factor_  # Higher = more anomalous
 
print(f"Detected {np.sum(y_pred_binary)} anomalies")
print(f"Score range: {lof_scores.min():.3f} to {lof_scores.max():.3f}")
 
 
# ============================================================
# Example 2: LOF for novelty detection (scoring new data)
# ============================================================
 
# Train on clean data only
X_train = normal_data  # Assume we know these are normal
 
# Fit LOF in novelty mode
lof_novelty = LocalOutlierFactor(
    n_neighbors=20,
    contamination='auto',  # Will use offset_=1.5 for decision
    novelty=True  # Enable scoring of new data
)
lof_novelty.fit(X_train)
 
# Now we can score new data
X_test = np.vstack([
    np.random.randn(10, 2) * 0.5 + np.array([0, 0]),  # Normal
    np.array([[10, 10], [-5, -5], [2, -3]])  # Anomalies
])
 
# Get predictions (-1 = anomaly, 1 = normal)
test_predictions = lof_novelty.predict(X_test)
 
# Get raw scores (more negative = more anomalous)
test_scores = lof_novelty.score_samples(X_test)
 
# Get positive LOF values
test_lof = -lof_novelty.score_samples(X_test)
 
print(f"
Novelty mode test predictions: {test_predictions}")
print(f"Test LOF scores: {test_lof}")
 
 
# ============================================================
# Example 3: Proper preprocessing and evaluation
# ============================================================
 
from sklearn.metrics import precision_recall_curve, average_precision_score
 
# Always normalize features for distance-based methods
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
# Compute LOF with multiple contamination settings to find optimal
results = {}
for cont in [0.05, 0.1, 0.15, 0.2]:
    lof = LocalOutlierFactor(n_neighbors=20, contamination=cont)
    y_pred = lof.fit_predict(X_scaled)
    y_pred_binary = (y_pred == -1).astype(int)
    
    # Use raw scores for ranking (contamination only affects labels)
    scores = -lof.negative_outlier_factor_
    ap = average_precision_score(y_true, scores)
    
    results[cont] = {
        'predictions': y_pred_binary,
        'scores': scores,
        'average_precision': ap
    }
    print(f"Contamination {cont}: AP = {ap:.3f}")
 
# Best practice: use scores for ranking, not predictions
best_scores = results[0.1]['scores']  # Any contamination gives same scores
precision, recall, thresholds = precision_recall_curve(y_true, best_scores)
print(f"
Overall Average Precision: {average_precision_score(y_true, best_scores):.3f}")

When LOF Excels and When It Struggles

LOF is powerful but not universal. Understanding its strengths and limitations enables informed method selection.

Core Strengths:

When LOF Excels

•Variable density clusters: LOF's signature capability—correctly handling clusters of different densities where distance-based methods fail.
•Local anomalies: Points anomalous in their local context but not globally extreme are well-detected.
•Non-parametric: No assumptions about data distribution; works with arbitrary cluster shapes.
•Interpretable scores: LOF values have clear meaning (density ratio to neighbors).
•Robust to density variation: Works when different regions of feature space have legitimately different densities.

When LOF Struggles

•High dimensionality: Like all distance-based methods, LOF degrades when d > 15-20 due to distance concentration.
•Large datasets: O(n²) complexity makes it impractical for n > 50,000 without approximations.
•Global anomalies: Points far from everything may score lower than expected (boundary effects).
•Clustered anomalies: If anomalies form their own cluster, they may mutually support each other (low LOF).
•Sensitivity to k: Performance can vary significantly with k; requires tuning or multi-k approaches.
•Streaming data: Cannot efficiently update as new points arrive; requires full recomputation.

When to Choose LOF:

✅ Moderate dimensionality (d < 20) ✅ Moderate dataset size (n < 50K) ✅ Known or expected density variation ✅ Need for interpretable scores ✅ Unsupervised setting (no labels) ✅ Batch processing acceptable

When to Consider Alternatives:

❌ High dimensionality → Isolation Forest ❌ Very large data → Approximate methods ❌ Streaming → Online algorithms ❌ Global outliers primary → Distance-based ❌ Very fast inference needed → Simpler methods ❌ Anomalies may cluster → Isolation Forest

The Practitioner's Rule

LOF is the right first choice when you suspect density variation in your data and dataset size is manageable. Its computational cost is worth paying when you need reliable local anomaly detection. For large datasets or high dimensions, start with Isolation Forest and use LOF as a secondary check on flagged points.

Summary: Local Outlier Factor

We've provided an exhaustive treatment of the Local Outlier Factor algorithm—from mathematical foundations through practical implementation. LOF represents a paradigm shift from absolute to relative anomaly detection.

Key Takeaways

•LOF compares densities, not distances: Anomalies are points less dense than their neighbors, regardless of absolute scale.
•Reachability distance provides stability: The max operation smooths density estimates, reducing noise sensitivity.
•LOF ≈ 1 means locally consistent: Points matching their neighborhood's density are normal.
•Multi-k approaches improve robustness: Computing LOF across multiple k values and aggregating catches more reliable anomalies.
•Scikit-learn's novelty parameter matters: Transductive mode scores training data; inductive mode scores new data.
•Computational cost limits scalability: O(n²) complexity restricts LOF to moderate-sized datasets.

What's Next:

The LOCI algorithm (Local Correlation Integral) extends LOF's ideas with automatic k selection and statistical significance testing. Unlike LOF's fixed k, LOCI adapts neighborhood size based on local structure, providing parameter-free operation. LOCI is the subject of our next page.

Page Complete

You now possess comprehensive knowledge of the Local Outlier Factor—one of the most influential anomaly detection algorithms ever developed. This understanding prepares you for advanced density-based methods, LOF variants, and the challenge of scaling local anomaly detection to large datasets.

Local Outlier Factor (LOF)

The Density-Relative Revolution

Learning Objectives

Building Blocks: k-Distance, Reachability, and Density

LOF is built from a hierarchy of carefully designed concepts. Understanding each component is essential for mastering the algorithm.

Component 1: k-Distance

The k-distance of a point $p$, denoted $d_k(p)$, is the distance from $p$ to its k-th nearest neighbor:

$$d_k(p) = d(p, o) \text{ where } o \text{ is the k-th nearest neighbor of } p$$

This defines a "neighborhood radius" for each point. The k-distance is larger for isolated points and smaller for points in dense regions.

Component 2: k-Distance Neighborhood

The k-distance neighborhood of $p$, denoted $N_k(p)$, is the set of points within distance $d_k(p)$ from $p$:

$$N_k(p) = {q \in \mathcal{D} \setminus {p} : d(p, q) \leq d_k(p)}$$

Important subtlety: $|N_k(p)| \geq k$ due to possible ties. If multiple points are exactly at the k-distance, all are included. This ensures the neighborhood always contains at least k points.

Why k-Distance Neighborhood Can Exceed k

Component 3: Reachability Distance

The reachability distance of point $p$ with respect to point $o$ is:

$$\text{reach-dist}_k(p, o) = \max{d_k(o), d(p, o)}$$

This is the key innovation in LOF. The reachability distance is at least $d_k(o)$, even if $p$ is closer. This "smoothing" effect has critical implications:

For points inside $o$'s neighborhood: $\text{reach-dist}_k(p, o) = d_k(o)$ (the k-distance of $o$ dominates)
For points outside $o$'s neighborhood: $\text{reach-dist}_k(p, o) = d(p, o)$ (actual distance dominates)

Why This Matters:

Geometric Interpretation:

Component 4: Local Reachability Density (LRD)

The local reachability density of point $p$ is the inverse of the average reachability distance from $p$ to its k-nearest neighbors:

$$\text{lrd}k(p) = \left(\frac{\sum{o \in N_k(p)} \text{reach-dist}_k(p, o)}{|N_k(p)|}\right)^{-1}$$

Equivalently:

$$\text{lrd}k(p) = \frac{|N_k(p)|}{\sum{o \in N_k(p)} \text{reach-dist}_k(p, o)}$$

Interpretation:

High LRD: Point $p$ is in a dense region (low average reachability distance to neighbors)
Low LRD: Point $p$ is in a sparse region or isolated (high average reachability distance)

LRD acts as a local density estimate, but using reachability distances rather than raw distances. This provides the stability benefits discussed above.

Key Properties of LRD:

Scale-invariant within local neighborhoods: Only relative distances matter
Symmetric influence: Each neighbor contributes equally to the average
Bounded below: LRD > 0 for any finite reachability distances

LOF Component Summary
Component	Symbol	Formula	Interpretation
k-distance	$d_k(p)$	Distance to k-th NN	Neighborhood radius
k-neighborhood	$N_k(p)$	${q : d(p,q) \leq d_k(p)}$	Points within radius
Reachability dist.	$\text{reach-dist}_k(p,o)$	$\max{d_k(o), d(p,o)}$	Smoothed distance
Local reach. density	$\text{lrd}_k(p)$	$\|N_k(p)\| / \sum \text{reach-dist}$	Local density estimate
Local Outlier Factor	$\text{LOF}_k(p)$	$\frac{\sum \text{lrd}(o)}{\|N_k(p)\| \cdot \text{lrd}(p)}$	Relative outlierness

The Local Outlier Factor: Putting It Together

Definition: Local Outlier Factor

The Local Outlier Factor of point $p$ is the ratio of the average LRD of $p$'s neighbors to $p$'s own LRD:

$$\text{LOF}k(p) = \frac{\sum{o \in N_k(p)} \text{lrd}_k(o)}{|N_k(p)| \cdot \text{lrd}k(p)} = \frac{1}{|N_k(p)|} \sum{o \in N_k(p)} \frac{\text{lrd}_k(o)}{\text{lrd}_k(p)}$$

This is the average ratio of neighbors' densities to the point's own density.

Interpretation:

LOF ≈ 1: Point $p$ has similar density to its neighbors—likely normal
LOF > 1: Point $p$ is less dense than its neighbors—potentially anomalous
LOF < 1: Point $p$ is denser than its neighbors—definitely not an anomaly (may be a local mode)

The Brilliance of LOF:

LOF's Self-Consistency

Complete LOF Algorithm:

Algorithm: LOF(D, k)
Input: Dataset D, number of neighbors k
Output: LOF score for each point

1. For each point p ∈ D:
   a. Find N_k(p) = k-nearest neighbors of p
   b. Compute d_k(p) = distance to k-th neighbor

2. For each point p ∈ D:
   For each neighbor o ∈ N_k(p):
      Compute reach-dist_k(p, o) = max{d_k(o), d(p, o)}

3. For each point p ∈ D:
   Compute lrd_k(p) = |N_k(p)| / Σ_{o ∈ N_k(p)} reach-dist_k(p, o)

4. For each point p ∈ D:
   Compute LOF_k(p) = (1/|N_k(p)|) × Σ_{o ∈ N_k(p)} [lrd_k(o) / lrd_k(p)]

5. Return LOF scores

Computational Complexity:

Step 1: O(n²) or O(n log n) with spatial indexing for k-NN queries
Steps 2-4: O(n × k) for local computations
Total: Dominated by Step 1, typically O(n² d) naive or O(n log n × d) with indexing
Space: O(n × k) for storing neighborhoods and distances

lof_from_scratch.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
import numpy as np
from sklearn.neighbors import NearestNeighbors
from typing import Tuple
 
def compute_lof(X: np.ndarray, k: int = 20) -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute Local Outlier Factor from scratch.
    
    Parameters:
    -----------
    X : np.ndarray of shape (n_samples, n_features)
        Input data
    k : int
        Number of neighbors (MinPts parameter)
    
    Returns:
    --------
    lof_scores : np.ndarray of shape (n_samples,)
        LOF score for each point
    lrd_values : np.ndarray of shape (n_samples,)
        Local reachability density for each point
    """
    n_samples = X.shape[0]
    
    # Step 1: Find k-nearest neighbors for each point
    # We use k+1 and exclude self
    nn = NearestNeighbors(n_neighbors=k+1, algorithm='auto')
    nn.fit(X)
    distances, indices = nn.kneighbors(X)
    
    # Exclude self (first neighbor)
    distances = distances[:, 1:]
    indices = indices[:, 1:]
    
    # k-distance for each point (distance to k-th neighbor)
    k_distances = distances[:, -1]
    
    # Step 2: Compute reachability distances
    # reach_dist[i, j] = reachability distance from point i to its j-th neighbor
    reach_distances = np.zeros_like(distances)
    for i in range(n_samples):
        for j in range(k):
            neighbor_idx = indices[i, j]
            # reach-dist(p, o) = max(k-distance(o), actual_distance(p, o))
            reach_distances[i, j] = max(
                k_distances[neighbor_idx],  # k-distance of neighbor
                distances[i, j]              # actual distance
            )
    
    # Step 3: Compute Local Reachability Density (LRD)
    # lrd(p) = k / sum of reachability distances to k neighbors
    lrd_values = np.zeros(n_samples)
    for i in range(n_samples):
        sum_reach_dist = np.sum(reach_distances[i, :])
        if sum_reach_dist > 0:
            lrd_values[i] = k / sum_reach_dist
        else:
            # All neighbors at same position (edge case)
            lrd_values[i] = np.inf
    
    # Step 4: Compute LOF
    # LOF(p) = average of (lrd of neighbor / lrd of p)
    lof_scores = np.zeros(n_samples)
    for i in range(n_samples):
        if lrd_values[i] == 0 or np.isinf(lrd_values[i]):
            lof_scores[i] = 1.0  # Default for degenerate cases
            continue
        
        neighbor_lrd_sum = 0
        for j in range(k):
            neighbor_idx = indices[i, j]
            neighbor_lrd_sum += lrd_values[neighbor_idx]
        
        avg_neighbor_lrd = neighbor_lrd_sum / k
        lof_scores[i] = avg_neighbor_lrd / lrd_values[i]
    
    return lof_scores, lrd_values
 
 
def identify_anomalies(X: np.ndarray, k: int = 20, 
                       threshold: float = 1.5) -> np.ndarray:
    """
    Identify anomalies using LOF.
    
    Parameters:
    -----------
    X : np.ndarray
        Input data
    k : int
        Number of neighbors
    threshold : float
        LOF threshold (points with LOF > threshold are anomalies)
    
    Returns:
    --------
    labels : np.ndarray
        1 for anomaly, 0 for normal
    """
    lof_scores, _ = compute_lof(X, k)
    return (lof_scores > threshold).astype(int)

Geometric Intuition: Understanding LOF Visually

The mathematics of LOF can obscure its elegant geometric intuition. Let's build visual understanding through carefully constructed examples.

Scenario 1: Uniform Cluster

Imagine a tight cluster of 100 points uniformly distributed in a circle. For any interior point $p$:

$p$'s neighborhood contains k nearby points
$p$'s LRD reflects the local density (high, since points are close)
$p$'s neighbors have similar LRD values
LOF ≈ 1 (density matches neighbors)

For boundary points, there's a slight increase in LOF (neighbors are partially "inward" toward denser regions), but typically LOF < 1.5.

Scenario 2: Two Clusters of Different Densities

Consider two clusters:

Cluster A: 100 points, tight (σ = 1)
Cluster B: 100 points, loose (σ = 5)

For points deep inside each cluster:

In Cluster A: High LRD, high neighbor LRD → LOF ≈ 1
In Cluster B: Low LRD, low neighbor LRD → LOF ≈ 1

The absolute densities differ by 25x, but LOF doesn't care—internal consistency is what matters!

The Multi-Density Magic

Scenario 3: Point Between Clusters

Now place a point $p$ exactly between Clusters A and B:

$p$'s neighbors include points from both clusters
The reachability distances to Cluster A neighbors are large (must travel far)
The reachability distances to Cluster B neighbors are smaller (closer)
$p$'s LRD is low (average reachability is high)
$p$'s neighbors (in both clusters) have higher LRD than $p$
LOF significantly > 1 (anomaly detected!)

Scenario 4: Point Far from Everything

Place a point $p$ very far from all clusters:

$p$'s k neighbors are all far away → very low LRD for $p$
But those neighbors (edge of closest cluster) also have lowish LRD (boundary effects)
LOF > 1 but may not be extremely high

Scenario 5: Micro-cluster of Anomalies

Consider 3 anomalous points clustered together:

Each anomaly has 2 other anomalies as nearest neighbors
The anomaly mini-cluster has its own internal density
If k = 3, the anomalies might have LOF ≈ 1 (mutual support)
Solution: Use k larger than expected anomaly cluster size

LOF Behavior in Common Scenarios
Scenario	Point Location	LRD Relationship	Expected LOF
Uniform cluster	Interior	LRD(p) ≈ LRD(neighbors)	≈ 1.0
Uniform cluster	Boundary	LRD(p) < LRD(neighbors)	1.0 - 1.3
Dense cluster	Interior	LRD(p) high, neighbors high	≈ 1.0
Sparse cluster	Interior	LRD(p) low, neighbors low	≈ 1.0
Between clusters	Gap	LRD(p) << LRD(neighbors)	1.0
Isolated point	Far from all	LRD(p) very low	1.0 (variable)
Anomaly micro-cluster	Among anomalies	Mutual support possible	Depends on k

Mathematical Properties and Guarantees

LOF has been extensively analyzed. Understanding its theoretical properties helps practitioners know what to expect.

Property 1: Tightness Bound for Uniform Clusters

For points deep inside a cluster with uniform density, LOF converges to 1. Formally, for a point $p$ with all k neighbors inside a region of constant density:

$$\text{LOF}_k(p) \to 1 \text{ as the neighborhood becomes uniform}$$

Property 2: Lower Bound

LOF is bounded below: $$\text{LOF}k(p) \geq \frac{1}{\max{o \in N_k(p)} |N_k(o)|} \cdot \frac{\min_{o \in N_k(p)} \text{reach-dist}(\cdot, o)}{\text{reach-dist}(p, \cdot)}$$

While complex, this guarantees LOF > 0 for all finite datasets.

Property 3: Asymptotic Behavior

As $n \to \infty$ with $k = o(n)$: $$\text{LOF}_k(p) \xrightarrow{p} \frac{f(\text{neighbors})}{f(p)}$$

where $f$ is the true underlying density. LOF consistently estimates the density ratio in the limit.

Property 4: Sensitivity to k

The LOF score can change substantially with k:

Small k: Captures very local density variations; sensitive to noise
Large k: Smooths over larger regions; may miss local anomalies

The original LOF paper recommends examining LOF across a range of k values and considering points that are anomalous for many k values as more reliable detections.

Property 5: Not a True Probability

Despite being a ratio, LOF is not a probability and has no natural scale:

LOF = 1.1 is not "10% anomalous"
LOF = 2.0 does not mean "twice as anomalous" as LOF = 1.0
Thresholds must be set empirically or using domain knowledge

Property 6: Boundary Effects

Points at cluster boundaries systematically have LOF slightly above 1, because:

Their neighbors include more interior points with higher density
This creates false positive pressure at boundaries
Mitigation: Use higher thresholds (e.g., LOF > 1.5) or robust statistics across k values

The k Selection Problem

robust_lof.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
import numpy as np
from sklearn.neighbors import LocalOutlierFactor
from typing import List, Tuple
 
def multi_k_lof(X: np.ndarray, 
                k_values: List[int] = [10, 15, 20, 30, 50],
                aggregation: str = 'max') -> Tuple[np.ndarray, np.ndarray]:
    """
    Compute LOF across multiple k values for robustness.
    
    Parameters:
    -----------
    X : np.ndarray
        Input data
    k_values : list of int
        k values to try
    aggregation : str
        How to combine scores: 'max', 'mean', 'vote'
    
    Returns:
    --------
    final_scores : np.ndarray
        Aggregated LOF scores
    all_scores : np.ndarray
        Matrix of LOF scores for each k (n_samples x len(k_values))
    """
    n_samples = X.shape[0]
    all_scores = np.zeros((n_samples, len(k_values)))
    
    for i, k in enumerate(k_values):
        if k >= n_samples:
            k = n_samples - 1
        
        lof = LocalOutlierFactor(
            n_neighbors=k,
            algorithm='auto',
            metric='euclidean',
            contamination='auto',
            novelty=False
        )
        lof.fit(X)
        
        # LOF returns negative scores (more negative = more anomalous)
        # Convert to positive (higher = more anomalous)
        all_scores[:, i] = -lof.negative_outlier_factor_
    
    # Aggregate across k values
    if aggregation == 'max':
        # Maximum LOF across k values
        final_scores = np.max(all_scores, axis=1)
    elif aggregation == 'mean':
        # Average LOF across k values
        final_scores = np.mean(all_scores, axis=1)
    elif aggregation == 'vote':
        # Count how often each point exceeds threshold
        threshold = 1.5
        votes = np.sum(all_scores > threshold, axis=1)
        final_scores = votes / len(k_values)
    else:
        raise ValueError(f"Unknown aggregation: {aggregation}")
    
    return final_scores, all_scores
 
 
def lof_with_confidence(X: np.ndarray, 
                        k_values: List[int] = [10, 15, 20, 30, 50],
                        threshold: float = 1.5) -> Tuple[np.ndarray, np.ndarray]:
    """
    LOF detection with confidence estimates based on k-stability.
    
    Returns:
    --------
    labels : np.ndarray
        1 for anomaly, 0 for normal
    confidence : np.ndarray
        Confidence in label (0 to 1)
    """
    _, all_scores = multi_k_lof(X, k_values, aggregation='max')
    
    # Count threshold crossings
    above_threshold = all_scores > threshold
    vote_fraction = np.mean(above_threshold, axis=1)
    
    # Labels based on majority voting
    labels = (vote_fraction >= 0.5).astype(int)
    
    # Confidence: how decisively does majority vote?
    # 100% = all k values agree, 50% = split decision
    confidence = np.abs(vote_fraction - 0.5) * 2
    
    return labels, confidence

Using scikit-learn's LOF Implementation

Scikit-learn provides a robust, optimized implementation of LOF. Understanding its nuances is essential for correct usage.

Key Parameters:

n_neighbors: The k parameter (default: 20)
algorithm: NN algorithm—'auto', 'ball_tree', 'kd_tree', 'brute'
metric: Distance metric (default: 'minkowski' with p=2, i.e., Euclidean)
contamination: Expected proportion of anomalies—affects threshold
novelty: If True, can score new unseen data; if False, only training data

The Novelty Parameter:

This is the most misunderstood parameter:

novelty=False (default): Transductive mode. LOF is computed for training data only. fit_predict() returns labels. Cannot score new data—calling predict() raises error.
novelty=True: Inductive mode. Training data is assumed clean (no anomalies). fit() learns the normal pattern; predict() and score_samples() can be called on new data.

Warning: With novelty=True, if your training data contains anomalies, they contaminate the "normal" model!

The Contamination Pitfall

sklearn_lof_usage.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
import numpy as np
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
 
# ============================================================
# Example 1: Basic LOF for outlier detection in training data
# ============================================================
 
# Generate synthetic data
np.random.seed(42)
n_normal = 200
n_anomalies = 20
 
# Normal data: two clusters
cluster1 = np.random.randn(n_normal // 2, 2) * 0.5 + np.array([0, 0])
cluster2 = np.random.randn(n_normal // 2, 2) * 0.5 + np.array([4, 4])
normal_data = np.vstack([cluster1, cluster2])
 
# Anomalies: scattered
anomalies = np.random.uniform(-2, 6, (n_anomalies, 2))
 
X = np.vstack([normal_data, anomalies])
y_true = np.array([0] * n_normal + [1] * n_anomalies)
 
# Fit LOF (transductive mode - novelty=False is default)
lof = LocalOutlierFactor(
    n_neighbors=20,
    contamination=0.1,  # Expected 10% anomalies
    algorithm='auto',
    metric='euclidean'
)
 
# fit_predict() is the main method for novelty=False
y_pred = lof.fit_predict(X)
# Returns -1 for anomalies, 1 for inliers (sklearn convention)
y_pred_binary = (y_pred == -1).astype(int)  # Convert to 0/1
 
# Access the LOF scores (negated for sklearn convention)
lof_scores = -lof.negative_outlier_factor_  # Higher = more anomalous
 
print(f"Detected {np.sum(y_pred_binary)} anomalies")
print(f"Score range: {lof_scores.min():.3f} to {lof_scores.max():.3f}")
 
 
# ============================================================
# Example 2: LOF for novelty detection (scoring new data)
# ============================================================
 
# Train on clean data only
X_train = normal_data  # Assume we know these are normal
 
# Fit LOF in novelty mode
lof_novelty = LocalOutlierFactor(
    n_neighbors=20,
    contamination='auto',  # Will use offset_=1.5 for decision
    novelty=True  # Enable scoring of new data
)
lof_novelty.fit(X_train)
 
# Now we can score new data
X_test = np.vstack([
    np.random.randn(10, 2) * 0.5 + np.array([0, 0]),  # Normal
    np.array([[10, 10], [-5, -5], [2, -3]])  # Anomalies
])
 
# Get predictions (-1 = anomaly, 1 = normal)
test_predictions = lof_novelty.predict(X_test)
 
# Get raw scores (more negative = more anomalous)
test_scores = lof_novelty.score_samples(X_test)
 
# Get positive LOF values
test_lof = -lof_novelty.score_samples(X_test)
 
print(f"
Novelty mode test predictions: {test_predictions}")
print(f"Test LOF scores: {test_lof}")
 
 
# ============================================================
# Example 3: Proper preprocessing and evaluation
# ============================================================
 
from sklearn.metrics import precision_recall_curve, average_precision_score
 
# Always normalize features for distance-based methods
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
# Compute LOF with multiple contamination settings to find optimal
results = {}
for cont in [0.05, 0.1, 0.15, 0.2]:
    lof = LocalOutlierFactor(n_neighbors=20, contamination=cont)
    y_pred = lof.fit_predict(X_scaled)
    y_pred_binary = (y_pred == -1).astype(int)
    
    # Use raw scores for ranking (contamination only affects labels)
    scores = -lof.negative_outlier_factor_
    ap = average_precision_score(y_true, scores)
    
    results[cont] = {
        'predictions': y_pred_binary,
        'scores': scores,
        'average_precision': ap
    }
    print(f"Contamination {cont}: AP = {ap:.3f}")
 
# Best practice: use scores for ranking, not predictions
best_scores = results[0.1]['scores']  # Any contamination gives same scores
precision, recall, thresholds = precision_recall_curve(y_true, best_scores)
print(f"
Overall Average Precision: {average_precision_score(y_true, best_scores):.3f}")

When LOF Excels and When It Struggles

LOF is powerful but not universal. Understanding its strengths and limitations enables informed method selection.

Core Strengths:

When LOF Excels

•Variable density clusters: LOF's signature capability—correctly handling clusters of different densities where distance-based methods fail.
•Local anomalies: Points anomalous in their local context but not globally extreme are well-detected.
•Non-parametric: No assumptions about data distribution; works with arbitrary cluster shapes.
•Interpretable scores: LOF values have clear meaning (density ratio to neighbors).
•Robust to density variation: Works when different regions of feature space have legitimately different densities.

When LOF Struggles

•High dimensionality: Like all distance-based methods, LOF degrades when d > 15-20 due to distance concentration.
•Large datasets: O(n²) complexity makes it impractical for n > 50,000 without approximations.
•Global anomalies: Points far from everything may score lower than expected (boundary effects).
•Clustered anomalies: If anomalies form their own cluster, they may mutually support each other (low LOF).
•Sensitivity to k: Performance can vary significantly with k; requires tuning or multi-k approaches.
•Streaming data: Cannot efficiently update as new points arrive; requires full recomputation.

When to Choose LOF:

When to Consider Alternatives:

The Practitioner's Rule

Summary: Local Outlier Factor

Key Takeaways

•LOF compares densities, not distances: Anomalies are points less dense than their neighbors, regardless of absolute scale.
•Reachability distance provides stability: The max operation smooths density estimates, reducing noise sensitivity.
•LOF ≈ 1 means locally consistent: Points matching their neighborhood's density are normal.
•Multi-k approaches improve robustness: Computing LOF across multiple k values and aggregating catches more reliable anomalies.
•Scikit-learn's novelty parameter matters: Transductive mode scores training data; inductive mode scores new data.
•Computational cost limits scalability: O(n²) complexity restricts LOF to moderate-sized datasets.

What's Next:

Page Complete