Distance Based Methods - Learning Module

Loading content...

0/278

LOCI Algorithm

Beyond Fixed Neighborhoods: Adaptive Local Outlier Detection

LOF requires a critical parameter: k, the number of neighbors. Choose k too small, and you're sensitive to noise. Choose k too large, and you miss local anomalies. Worse, the optimal k varies across different regions of the same dataset—a dense cluster might need k = 50 while a sparse region needs k = 5.

The Local Correlation Integral (LOCI) algorithm, introduced by Papadimitriou et al. in 2003, offers an elegant solution: automatic adaptation of neighborhood size based on local data structure. Rather than using a fixed k, LOCI examines outlier scores across a range of radii, determining for each point whether it deviates significantly from its neighbors at any scale.

LOCI also introduces a principled statistical framework for declaring anomalies. Instead of arbitrary thresholds like "LOF > 1.5", LOCI computes Multi-granularity Deviation Factor (MDEF) and its normalized form, using standard deviation-based cutoffs that have statistical interpretation.

The result is a more robust, parameter-free (or nearly so) approach to local outlier detection that automatically adjusts to local data characteristics.

Learning Objectives

By the end of this page, you will: (1) Understand LOCI's multi-granularity approach to neighborhood definition, (2) Master the MDEF and normalized MDEF scoring mechanisms, (3) Analyze the statistical basis for LOCI's anomaly thresholds, (4) Compare LOCI's adaptive approach to LOF's fixed-k method, and (5) Implement LOCI and understand when to prefer it over LOF.

Core Concepts: Counting Neighborhoods

LOCI replaces LOF's fixed-k neighborhood with radius-based neighborhoods, enabling examination of outlierness at multiple scales.

Definition: r-Neighborhood

The r-neighborhood of point $p$ is the set of all points within distance $r$:

$$N(p, r) = {q \in \mathcal{D} : d(p, q) \leq r}$$

The size of this neighborhood is the counting neighborhood:

$$n(p, r) = |N(p, r)|$$

Unlike LOF's k-neighborhood, the r-neighborhood size varies: dense regions have large $n(p, r)$, sparse regions have small $n(p, r)$.

Definition: Sampling Neighborhood

LOCI distinguishes between:

Counting neighborhood $N(p, r)$: Points used for the count
Sampling neighborhood $N(p, \alpha r)$: Points whose counts we average

The parameter $\alpha$ (typically 0.5) defines the ratio. For each point in the sampling neighborhood $N(p, \alpha r)$, we compute their counting neighborhood size $n(q, r)$.

The α Parameter Intuition

Why use αr < r for the sampling neighborhood? Points in N(p, αr) are 'closer' neighbors whose density we compare against. The counting radius r captures the local density scale. Using α = 0.5 means we compare p's density to that of points within half the counting radius—these are the truly local neighbors that define 'typical' density.

The Key Insight:

For a normal point $p$ in a region of roughly uniform density:

$p$'s counting neighborhood size $n(p, r)$ should be similar to
The average counting neighborhood size of $p$'s sampling neighbors

For an anomaly:

$p$'s counting neighborhood is sparse (low $n(p, r)$)
But $p$'s sampling neighbors are denser (high average $n(q, r)$)
This discrepancy signals anomalous behavior

Mathematical Framework:

Define the average counting neighborhood of $p$'s sampling neighbors:

$$\hat{n}(p, r, \alpha) = \frac{1}{n(p, \alpha r)} \sum_{q \in N(p, \alpha r)} n(q, r)$$

This is the expected neighborhood size for points similar to $p$ (within distance $\alpha r$).

For normal points: $n(p, r) \approx \hat{n}(p, r, \alpha)$ For anomalies: $n(p, r) << \hat{n}(p, r, \alpha)$

LOCI vs LOF: Neighborhood Definitions
Concept	LOF	LOCI
Neighborhood type	k-nearest neighbors	Radius-based
Size parameter	Fixed k	Radius r (variable)
Neighborhood size	Exactly k	Varies with local density
Comparison basis	lrd ratio	Count ratio (MDEF)
Multi-scale	Manual (try multiple k)	Automatic (scan radii)
Statistical significance	Heuristic threshold	Standard deviation-based

MDEF: Multi-Granularity Deviation Factor

LOCI's core innovation is the Multi-granularity Deviation Factor (MDEF), which quantifies how much a point's local density deviates from its neighbors' densities.

Definition: MDEF

$$\text{MDEF}(p, r, \alpha) = 1 - \frac{n(p, r)}{\hat{n}(p, r, \alpha)}$$

Alternatively written as:

$$\text{MDEF}(p, r, \alpha) = \frac{\hat{n}(p, r, \alpha) - n(p, r)}{\hat{n}(p, r, \alpha)}$$

Interpretation:

MDEF ≈ 0: Point $p$'s neighborhood size matches its neighbors' → normal
MDEF > 0: Point $p$ has fewer neighbors than expected → potentially anomalous
MDEF < 0: Point $p$ has more neighbors than expected → local density peak
MDEF → 1: Point $p$ is extremely sparse relative to neighbors → strong anomaly

Connection to LOF:

LOF computes the ratio of neighbors' densities to the point's density. MDEF computes the normalized difference of neighborhood sizes. These are related:

$$\text{MDEF} \approx 1 - \frac{1}{\text{LOF}}$$

High LOF corresponds to MDEF approaching 1; LOF ≈ 1 corresponds to MDEF ≈ 0.

The Problem with Raw MDEF:

Raw MDEF values are not directly comparable across different radii or different datasets. A MDEF of 0.3 might be highly significant in a low-variance region but completely normal in a high-variance region.

Solution: Normalized MDEF

To standardize MDEF, we compute the standard deviation of neighborhood sizes among sampling neighbors:

$$\sigma_{\hat{n}}(p, r, \alpha) = \sqrt{\frac{1}{n(p, \alpha r)} \sum_{q \in N(p, \alpha r)} \left(n(q, r) - \hat{n}(p, r, \alpha)\right)^2}$$

The normalized MDEF is:

$$\sigma\text{-MDEF}(p, r, \alpha) = \frac{\hat{n}(p, r, \alpha) - n(p, r)}{\sigma_{\hat{n}}(p, r, \alpha)}$$

This measures how many standard deviations $p$'s neighborhood count deviates from the local mean.

Statistical Interpretation:

Under the assumption that neighborhood sizes follow an approximately normal distribution:

$\sigma\text{-MDEF} > 3$: Point deviates by more than 3 standard deviations → statistically significant anomaly (p < 0.003)
$\sigma\text{-MDEF} > 2$: Moderate evidence of anomaly
$\sigma\text{-MDEF} < 1$: Within normal variation

The 3σ Rule in LOCI

LOCI's default threshold (σ-MDEF > 3) has clear statistical meaning: points more than 3 standard deviations from the local mean. Under normality, fewer than 0.3% of points would exceed this by chance. This provides principled, calibrated anomaly detection without arbitrary threshold tuning.

σ-MDEF Interpretation Guide
σ-MDEF Range	Statistical Meaning	Action
< 1	Within 1 std dev of mean	Normal
1 - 2	Mildly unusual	Monitor, not anomaly
2 - 3	Moderately unusual	Potential anomaly, investigate
3	More than 3 std dev	Strong anomaly signal
4	Extreme deviation	High-confidence anomaly

Multi-Granularity: Scanning Across Radii

The "multi-granularity" in LOCI refers to examining MDEF across a range of radii, not just a single fixed neighborhood size. This provides automatic adaptation to local structure.

The Multi-Scale Approach:

For each point $p$, LOCI computes MDEF for radii $r$ ranging from 0 to $r_{max}$:

$${\text{MDEF}(p, r, \alpha) : r \in [0, r_{max}]}$$

A point is flagged as anomalous if $\sigma\text{-MDEF}(p, r, \alpha) > k_{\sigma}$ (typically 3) for any radius $r$.

Why Multi-Granularity Matters:

Consider these scenarios:

Scenario 1: Dense local, sparse global A point in a tiny dense cluster within a large sparse region.

At small r: Many neighbors, MDEF low (matches local dense cluster)
At large r: Few neighbors relative to the larger region, MDEF may be high

Scenario 2: Sparse local, dense global An isolated point near the edge of a dense cluster.

At small r: Few neighbors, MDEF high (anomalous at this scale)
At large r: Many neighbors (cluster members), MDEF lower

Scenario 3: Anomalous at one scale only A bridge point between two clusters.

At the scale of the gap: Clearly anomalous
At larger scales: May appear normal

Multi-granularity catches all these cases by not committing to a single scale.

Computational Efficiency in LOCI

Naively computing MDEF for all radii is O(n³). The original LOCI paper introduces the 'aLOCI' (approximate LOCI) algorithm using quad-trees/octrees to achieve O(n²) average case. The key insight: neighborhood counts change only at discrete radii (when another point enters/exits), allowing event-driven computation.

The LOCI Curve:

For visualization and analysis, LOCI generates an "LOCI curve" for each point:

x-axis: radius $r$ (or equivalently, expected neighborhood count)
y-axis: $\sigma\text{-MDEF}(p, r, \alpha)$

The curve shows how anomalous the point appears at each scale.

Curve Interpretation:

Flat near zero: Point is consistently normal at all scales
Peak at specific r: Anomalous at that particular scale
Rising trend: Increasingly anomalous at larger scales (sparse region)
Falling trend: Less anomalous at larger scales (integrates with cluster)

Aggregating Across Scales:

The simplest aggregation: take the maximum $\sigma\text{-MDEF}$ across all radii:

$$\text{LOCI-score}(p) = \max_r \sigma\text{-MDEF}(p, r, \alpha)$$

Alternatives include:

Average $\sigma\text{-MDEF}$ across radii
Fraction of radii where $\sigma\text{-MDEF} > 3$
Area under the LOCI curve (AUC)

Discrete Approximation:

In practice, we sample radii at discrete intervals:

$$r_i = r_{min} + i \cdot \Delta r, \quad i = 0, 1, ..., \frac{r_{max} - r_{min}}{\Delta r}$$

Choosing these parameters:

$r_{min}$: Should be small enough to capture local structure. Common choice: median distance to nearest neighbor.
$r_{max}$: Should be large enough to include the global scale. Common choice: maximum pairwise distance or a large percentile.
$\Delta r$: Step size trading off accuracy vs. computation. Common: 20-50 radius values total.

An alternative approach (data-driven radii): Use the actual distances in the dataset as radii. For each point, the relevant radii are exactly the distances to other points. This avoids arbitrary discretization but increases computation.

The Complete LOCI Algorithm

Let us present the complete LOCI algorithm with all its components.

Algorithm: LOCI

Input: Dataset D, α (default 0.5), k_σ (default 3), radii R
Output: LOCI scores and anomaly labels

1. Preprocessing:
   Compute all pairwise distances: dist[i][j] = d(p_i, p_j)
   (Or use spatial index for range queries)

2. For each point p in D:
   Initialize max_sigma_mdef = 0
   
   For each radius r in R:
      a. Count n(p, r) = |{q : d(p,q) ≤ r}|
      b. Find sampling neighborhood: S = {q : d(p,q) ≤ αr}
      c. If |S| < nmin (typically 20): continue (skip this radius)
      
      d. For each q in S:
         Count n(q, r) = |{o : d(q,o) ≤ r}|
      
      e. Compute n_hat = mean of n(q, r) for q in S
      f. Compute σ_n = std of n(q, r) for q in S
      
      g. If σ_n > 0:
         σ-MDEF(p, r) = (n_hat - n(p, r)) / σ_n
      Else:
         σ-MDEF(p, r) = 0  (all neighbors have same count)
      
      h. Update: max_sigma_mdef = max(max_sigma_mdef, σ-MDEF)
   
   LOCI_score[p] = max_sigma_mdef

3. Label anomalies:
   For each point p:
      anomaly[p] = (LOCI_score[p] > k_σ)

4. Return LOCI_score, anomaly

loci_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import numpy as np
from scipy.spatial.distance import cdist
from typing import Tuple, List
from dataclasses import dataclass
 
@dataclass
class LOCIResult:
    """Results from LOCI algorithm."""
    scores: np.ndarray           # Maximum σ-MDEF for each point
    labels: np.ndarray           # 1 = anomaly, 0 = normal
    mdef_curves: List[np.ndarray]  # MDEF at each radius for each point
 
 
def compute_loci(X: np.ndarray,
                 alpha: float = 0.5,
                 k_sigma: float = 3.0,
                 n_radii: int = 30,
                 min_neighbors: int = 20) -> LOCIResult:
    """
    Compute LOCI (Local Correlation Integral) anomaly scores.
    
    Parameters:
    -----------
    X : np.ndarray of shape (n_samples, n_features)
        Input data
    alpha : float
        Sampling neighborhood ratio (default 0.5)
    k_sigma : float
        Number of standard deviations for anomaly threshold (default 3)
    n_radii : int
        Number of radii to sample
    min_neighbors : int
        Minimum sampling neighborhood size
    
    Returns:
    --------
    LOCIResult containing scores, labels, and MDEF curves
    """
    n_samples = X.shape[0]
    
    # Compute all pairwise distances
    dist_matrix = cdist(X, X, metric='euclidean')
    
    # Determine radii to scan
    max_dist = np.max(dist_matrix)
    min_dist = np.min(dist_matrix[dist_matrix > 0])  # Smallest non-zero
    radii = np.linspace(min_dist, max_dist * 0.8, n_radii)
    
    # Initialize results
    max_sigma_mdef = np.zeros(n_samples)
    mdef_curves = [np.zeros(n_radii) for _ in range(n_samples)]
    
    for r_idx, r in enumerate(radii):
        sampling_r = alpha * r
        
        for p in range(n_samples):
            # Counting neighborhood of p
            n_p = np.sum(dist_matrix[p, :] <= r)
            
            # Sampling neighborhood of p
            sampling_mask = dist_matrix[p, :] <= sampling_r
            sampling_neighbors = np.where(sampling_mask)[0]
            
            if len(sampling_neighbors) < min_neighbors:
                mdef_curves[p][r_idx] = 0
                continue
            
            # Count neighbors for each point in sampling neighborhood
            neighbor_counts = []
            for q in sampling_neighbors:
                n_q = np.sum(dist_matrix[q, :] <= r)
                neighbor_counts.append(n_q)
            
            neighbor_counts = np.array(neighbor_counts)
            
            # Compute statistics
            n_hat = np.mean(neighbor_counts)
            sigma_n = np.std(neighbor_counts)
            
            # Compute σ-MDEF
            if sigma_n > 0:
                sigma_mdef = (n_hat - n_p) / sigma_n
            else:
                sigma_mdef = 0
            
            mdef_curves[p][r_idx] = sigma_mdef
            max_sigma_mdef[p] = max(max_sigma_mdef[p], sigma_mdef)
    
    # Determine anomaly labels
    labels = (max_sigma_mdef > k_sigma).astype(int)
    
    return LOCIResult(
        scores=max_sigma_mdef,
        labels=labels,
        mdef_curves=mdef_curves
    )
 
 
def plot_loci_curves(result: LOCIResult,
                     point_indices: List[int],
                     radii: np.ndarray,
                     k_sigma: float = 3.0):
    """
    Visualize LOCI curves for selected points.
    """
    import matplotlib.pyplot as plt
    
    fig, ax = plt.subplots(figsize=(10, 6))
    
    for idx in point_indices:
        label = f"Point {idx} (score={result.scores[idx]:.2f})"
        is_anomaly = result.labels[idx]
        style = '--' if is_anomaly else '-'
        ax.plot(radii, result.mdef_curves[idx], style, label=label, alpha=0.7)
    
    # Add threshold line
    ax.axhline(y=k_sigma, color='red', linestyle=':', label=f'Threshold (k_σ={k_sigma})')
    
    ax.set_xlabel('Radius r')
    ax.set_ylabel('σ-MDEF')
    ax.set_title('LOCI Curves: σ-MDEF vs Radius')
    ax.legend()
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    return fig

Computational Complexity:

Naive implementation: O(n² × |R|) where |R| is the number of radii
With precomputed distances: O(n² + n × |R| × average_neighborhood_size)
With spatial indexing (aLOCI): O(n² / log n) average case using quad-trees
Space: O(n²) for distance matrix, O(n × |R|) for MDEF curves

The aLOCI Optimization:

The approximate LOCI (aLOCI) uses quad-trees to approximate neighborhood counts:

Build a quad-tree (2D) or octree (3D) over the data
For each cell, precompute point counts
Range queries sum cell counts, with partial cells estimated
This gives O(n log n) query time at the cost of approximation

For high-dimensional data, kd-trees or ball trees can be used similarly.

LOCI vs LOF: A Detailed Comparison

LOCI and LOF share the same philosophical goal—detecting local outliers—but differ substantially in approach. Understanding these differences informs method selection.

Fundamental Differences:

Aspect	LOF	LOCI
Neighborhood	k nearest neighbors	All points within radius r
Parameter	Fixed k	Range of radii (mostly automatic)
Scoring	Density ratio	Normalized count deviation
Threshold	Heuristic (e.g., 1.5)	Statistical (3σ rule)
Multi-scale	Manual (run with multiple k)	Automatic (scan radii)
Interpretability	Density ratio	Standard deviations from mean

When to Choose LOCI

•Parameter-free requirement: LOCI needs only α and k_σ which have principled defaults
•Multi-scale anomalies: Points anomalous at specific scales are automatically detected
•Statistical interpretation: The 3σ threshold has clear meaning
•Diagnostic output: LOCI curves show why a point is anomalous
•Varying cluster sizes: LOCI adapts to different natural scales

When to Choose LOF

•Computational efficiency: LOF is faster (O(n log n) vs O(n²) for LOCI)
•Well-known k: When domain knowledge suggests specific k values
•Library availability: LOF is in scikit-learn; LOCI requires custom code
•Streaming data: LOF can be updated incrementally more easily
•Simple interpretation: LOF's density ratio is intuitive

Empirical Comparison:

Academic benchmarks have compared LOF and LOCI extensively:

Detection quality: Generally similar for well-tuned LOF vs LOCI. LOCI may slightly outperform on multi-scale datasets.
Robustness: LOCI is more robust to parameter settings. LOF with wrong k can fail completely; LOCI with suboptimal α degrades gracefully.
Computation time: LOF is typically 2-5x faster than naive LOCI. aLOCI closes the gap but adds implementation complexity.
Interpretability: LOCI's curves provide richer diagnostic information. LOF gives just a single score.

The Practical Reality:

Despite LOCI's theoretical advantages, LOF is far more commonly used in practice because:

It's available in popular libraries
It's faster
Multi-k LOF approaches provide similar robustness
The simpler formulation is easier to explain to stakeholders

LOCI is valuable when:

You need parameter-free operation
You want visual diagnostics (LOCI curves)
Your data has complex multi-scale structure

The Pragmatic Choice

Start with LOF (easier, faster, well-supported). If results are unsatisfactory or parameter tuning is problematic, investigate LOCI. The additional diagnostic information from LOCI curves can help you understand why certain points are flagged, which is valuable for debugging and stakeholder communication.

Practical Considerations and Extensions

Deploying LOCI in practice requires attention to several implementation details and potential extensions.

Parameter Guidelines:

The α parameter (sampling ratio):

Default: 0.5 (sampling neighborhood is half the counting radius)
Lower α (0.3): Tighter local comparison, more sensitive to local anomalies
Higher α (0.7): Broader comparison, smoother scores
Robustness: Results typically stable for α ∈ [0.3, 0.7]

The k_σ threshold:

Default: 3.0 (3 standard deviations)
Higher k_σ (3.5, 4.0): Fewer false positives, may miss subtle anomalies
Lower k_σ (2.5, 2.0): More detections, higher false positive rate
Statistical guidance: Use 3.0 unless you have specific precision/recall requirements

The minimum neighborhood size (n_min):

Purpose: Avoid undefined/unstable statistics for very small neighborhoods
Typical values: 10-30
Guideline: At least 10 points to compute meaningful mean and std
Impact: Larger n_min may miss very local anomalies

Scalability Strategies:

For Large Datasets (n > 10,000):

Sampling: Compute LOCI on a random sample, then score remaining points using learned thresholds.
Approximate methods (aLOCI): Use spatial trees for range count approximations.
Parallel computation: The outer loop over points is embarrassingly parallel.
Radius pruning: Skip radii where few points have non-trivial MDEF (require minimum variance).

For High Dimensions (d > 10):

Dimensionality reduction: Apply PCA or UMAP before LOCI.
Alternative metrics: Consider Manhattan distance or learned metrics.
Hybrid approaches: Use fast global methods (Isolation Forest) for initial filtering, then LOCI on suspicious points.

Extensions and Variants:

Weighted LOCI: Weight points by importance or recency.
Temporal LOCI: Extend to time series with time-aware distances.
Streaming LOCI: Maintain approximate neighborhood counts for online detection.
Ensemble LOCI: Combine with other detectors for robustness.

scalable_loci.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
import numpy as np
from sklearn.neighbors import BallTree
from typing import Tuple
import multiprocessing as mp
from functools import partial
 
def fast_loci(X: np.ndarray,
              alpha: float = 0.5,
              k_sigma: float = 3.0,
              n_radii: int = 20,
              min_neighbors: int = 20,
              n_jobs: int = -1) -> Tuple[np.ndarray, np.ndarray]:
    """
    Scalable LOCI using BallTree for efficient range queries.
    
    Parameters:
    -----------
    X : np.ndarray
        Input data
    alpha : float
        Sampling neighborhood ratio
    k_sigma : float
        Threshold in standard deviations
    n_radii : int
        Number of radii to sample
    min_neighbors : int
        Minimum sampling neighborhood size
    n_jobs : int
        Number of parallel jobs (-1 for all cores)
    
    Returns:
    --------
    scores : np.ndarray
        Maximum σ-MDEF for each point
    labels : np.ndarray
        Anomaly labels (1 = anomaly)
    """
    n_samples = X.shape[0]
    
    # Build BallTree for efficient range queries
    tree = BallTree(X, metric='euclidean')
    
    # Determine radii from data distribution
    # Use k-nearest neighbor distances to set scale
    k_for_radius = min(50, n_samples - 1)
    dists, _ = tree.query(X, k=k_for_radius + 1)
    median_knn_dist = np.median(dists[:, -1])
    max_radius = np.percentile(dists[:, -1], 95) * 2
    
    radii = np.linspace(median_knn_dist * 0.5, max_radius, n_radii)
    
    def process_point(p_idx: int) -> float:
        """Compute max σ-MDEF for a single point."""
        max_sigma_mdef = 0.0
        
        for r in radii:
            # Count points within r of p
            n_p = tree.query_radius([X[p_idx]], r=r, count_only=True)[0]
            
            # Get sampling neighbors (within α·r)
            sampling_r = alpha * r
            sampling_idx = tree.query_radius([X[p_idx]], r=sampling_r)[0]
            
            if len(sampling_idx) < min_neighbors:
                continue
            
            # Count neighbors for each sampling neighbor
            neighbor_counts = tree.query_radius(
                X[sampling_idx], r=r, count_only=True
            )
            
            # Compute statistics
            n_hat = np.mean(neighbor_counts)
            sigma_n = np.std(neighbor_counts)
            
            if sigma_n > 0:
                sigma_mdef = (n_hat - n_p) / sigma_n
                max_sigma_mdef = max(max_sigma_mdef, sigma_mdef)
        
        return max_sigma_mdef
    
    # Parallel processing
    if n_jobs == -1:
        n_jobs = mp.cpu_count()
    
    if n_jobs > 1:
        with mp.Pool(n_jobs) as pool:
            scores = np.array(pool.map(process_point, range(n_samples)))
    else:
        scores = np.array([process_point(i) for i in range(n_samples)])
    
    labels = (scores > k_sigma).astype(int)
    
    return scores, labels
 
 
def hybrid_loci(X: np.ndarray,
                prefilter_contamination: float = 0.2,
                final_k_sigma: float = 3.0) -> np.ndarray:
    """
    Hybrid approach: fast pre-filter + detailed LOCI on candidates.
    
    Uses Isolation Forest to identify suspicious points, then
    applies full LOCI analysis only to those candidates.
    """
    from sklearn.ensemble import IsolationForest
    
    n_samples = X.shape[0]
    
    # Step 1: Fast pre-filter with Isolation Forest
    iso_forest = IsolationForest(
        contamination=prefilter_contamination,
        random_state=42,
        n_jobs=-1
    )
    prefilter_labels = iso_forest.fit_predict(X)
    candidates = np.where(prefilter_labels == -1)[0]
    
    print(f"Pre-filter identified {len(candidates)} candidates "
          f"({100*len(candidates)/n_samples:.1f}%)")
    
    # Step 2: Detailed LOCI on candidates
    final_labels = np.zeros(n_samples, dtype=int)
    
    if len(candidates) > 0:
        candidate_scores, candidate_labels = fast_loci(
            X[candidates],
            k_sigma=final_k_sigma
        )
        final_labels[candidates] = candidate_labels
    
    return final_labels

Summary: LOCI Algorithm

We've provided a comprehensive treatment of the LOCI algorithm, from its multi-granularity philosophy through efficient implementation strategies.

Key Takeaways

•LOCI uses radius-based neighborhoods: Unlike LOF's fixed k, LOCI examines all points within radius r, with r varying across scales.
•MDEF measures deviation: The Multi-granularity Deviation Factor compares a point's neighborhood size to its neighbors' average.
•σ-MDEF provides statistical significance: Normalizing by standard deviation gives interpretable thresholds (3σ = statistically significant).
•Multi-granularity is automatic: Scanning across radii detects anomalies at their natural scale without parameter tuning.
•LOCI curves enable diagnostics: Visualizing σ-MDEF vs radius reveals why and at what scale points are anomalous.
•Computational cost is higher: O(n²) complexity limits scalability; approximate methods and parallelization are essential for large data.

What's Next:

We've covered local outlier detection methods extensively. The next page addresses a fundamental challenge that affects all distance-based methods: the curse of dimensionality. As dimensionality increases, distances concentrate, nearest neighbor distinctions become meaningless, and detection performance degrades. Understanding and mitigating this phenomenon is essential for applying these methods to real-world high-dimensional data.

Page Complete

You now possess comprehensive knowledge of the LOCI algorithm—its theoretical foundations, statistical framework, and practical implementation. Whether you choose LOCI for its automatic multi-scale analysis or LOF for its simplicity, you now understand the full landscape of local outlier detection methods.