Loading content...
LOF requires a critical parameter: k, the number of neighbors. Choose k too small, and you're sensitive to noise. Choose k too large, and you miss local anomalies. Worse, the optimal k varies across different regions of the same dataset—a dense cluster might need k = 50 while a sparse region needs k = 5.
The Local Correlation Integral (LOCI) algorithm, introduced by Papadimitriou et al. in 2003, offers an elegant solution: automatic adaptation of neighborhood size based on local data structure. Rather than using a fixed k, LOCI examines outlier scores across a range of radii, determining for each point whether it deviates significantly from its neighbors at any scale.
LOCI also introduces a principled statistical framework for declaring anomalies. Instead of arbitrary thresholds like "LOF > 1.5", LOCI computes Multi-granularity Deviation Factor (MDEF) and its normalized form, using standard deviation-based cutoffs that have statistical interpretation.
The result is a more robust, parameter-free (or nearly so) approach to local outlier detection that automatically adjusts to local data characteristics.
By the end of this page, you will: (1) Understand LOCI's multi-granularity approach to neighborhood definition, (2) Master the MDEF and normalized MDEF scoring mechanisms, (3) Analyze the statistical basis for LOCI's anomaly thresholds, (4) Compare LOCI's adaptive approach to LOF's fixed-k method, and (5) Implement LOCI and understand when to prefer it over LOF.
LOCI replaces LOF's fixed-k neighborhood with radius-based neighborhoods, enabling examination of outlierness at multiple scales.
Definition: r-Neighborhood
The r-neighborhood of point $p$ is the set of all points within distance $r$:
$$N(p, r) = {q \in \mathcal{D} : d(p, q) \leq r}$$
The size of this neighborhood is the counting neighborhood:
$$n(p, r) = |N(p, r)|$$
Unlike LOF's k-neighborhood, the r-neighborhood size varies: dense regions have large $n(p, r)$, sparse regions have small $n(p, r)$.
Definition: Sampling Neighborhood
LOCI distinguishes between:
The parameter $\alpha$ (typically 0.5) defines the ratio. For each point in the sampling neighborhood $N(p, \alpha r)$, we compute their counting neighborhood size $n(q, r)$.
Why use αr < r for the sampling neighborhood? Points in N(p, αr) are 'closer' neighbors whose density we compare against. The counting radius r captures the local density scale. Using α = 0.5 means we compare p's density to that of points within half the counting radius—these are the truly local neighbors that define 'typical' density.
The Key Insight:
For a normal point $p$ in a region of roughly uniform density:
For an anomaly:
Mathematical Framework:
Define the average counting neighborhood of $p$'s sampling neighbors:
$$\hat{n}(p, r, \alpha) = \frac{1}{n(p, \alpha r)} \sum_{q \in N(p, \alpha r)} n(q, r)$$
This is the expected neighborhood size for points similar to $p$ (within distance $\alpha r$).
For normal points: $n(p, r) \approx \hat{n}(p, r, \alpha)$ For anomalies: $n(p, r) << \hat{n}(p, r, \alpha)$
| Concept | LOF | LOCI |
|---|---|---|
| Neighborhood type | k-nearest neighbors | Radius-based |
| Size parameter | Fixed k | Radius r (variable) |
| Neighborhood size | Exactly k | Varies with local density |
| Comparison basis | lrd ratio | Count ratio (MDEF) |
| Multi-scale | Manual (try multiple k) | Automatic (scan radii) |
| Statistical significance | Heuristic threshold | Standard deviation-based |
LOCI's core innovation is the Multi-granularity Deviation Factor (MDEF), which quantifies how much a point's local density deviates from its neighbors' densities.
Definition: MDEF
$$\text{MDEF}(p, r, \alpha) = 1 - \frac{n(p, r)}{\hat{n}(p, r, \alpha)}$$
Alternatively written as:
$$\text{MDEF}(p, r, \alpha) = \frac{\hat{n}(p, r, \alpha) - n(p, r)}{\hat{n}(p, r, \alpha)}$$
Interpretation:
Connection to LOF:
LOF computes the ratio of neighbors' densities to the point's density. MDEF computes the normalized difference of neighborhood sizes. These are related:
$$\text{MDEF} \approx 1 - \frac{1}{\text{LOF}}$$
High LOF corresponds to MDEF approaching 1; LOF ≈ 1 corresponds to MDEF ≈ 0.
The Problem with Raw MDEF:
Raw MDEF values are not directly comparable across different radii or different datasets. A MDEF of 0.3 might be highly significant in a low-variance region but completely normal in a high-variance region.
Solution: Normalized MDEF
To standardize MDEF, we compute the standard deviation of neighborhood sizes among sampling neighbors:
$$\sigma_{\hat{n}}(p, r, \alpha) = \sqrt{\frac{1}{n(p, \alpha r)} \sum_{q \in N(p, \alpha r)} \left(n(q, r) - \hat{n}(p, r, \alpha)\right)^2}$$
The normalized MDEF is:
$$\sigma\text{-MDEF}(p, r, \alpha) = \frac{\hat{n}(p, r, \alpha) - n(p, r)}{\sigma_{\hat{n}}(p, r, \alpha)}$$
This measures how many standard deviations $p$'s neighborhood count deviates from the local mean.
Statistical Interpretation:
Under the assumption that neighborhood sizes follow an approximately normal distribution:
LOCI's default threshold (σ-MDEF > 3) has clear statistical meaning: points more than 3 standard deviations from the local mean. Under normality, fewer than 0.3% of points would exceed this by chance. This provides principled, calibrated anomaly detection without arbitrary threshold tuning.
| σ-MDEF Range | Statistical Meaning | Action |
|---|---|---|
| < 1 | Within 1 std dev of mean | Normal |
| 1 - 2 | Mildly unusual | Monitor, not anomaly |
| 2 - 3 | Moderately unusual | Potential anomaly, investigate |
3 | More than 3 std dev | Strong anomaly signal |
4 | Extreme deviation | High-confidence anomaly |
The "multi-granularity" in LOCI refers to examining MDEF across a range of radii, not just a single fixed neighborhood size. This provides automatic adaptation to local structure.
The Multi-Scale Approach:
For each point $p$, LOCI computes MDEF for radii $r$ ranging from 0 to $r_{max}$:
$${\text{MDEF}(p, r, \alpha) : r \in [0, r_{max}]}$$
A point is flagged as anomalous if $\sigma\text{-MDEF}(p, r, \alpha) > k_{\sigma}$ (typically 3) for any radius $r$.
Why Multi-Granularity Matters:
Consider these scenarios:
Scenario 1: Dense local, sparse global A point in a tiny dense cluster within a large sparse region.
Scenario 2: Sparse local, dense global An isolated point near the edge of a dense cluster.
Scenario 3: Anomalous at one scale only A bridge point between two clusters.
Multi-granularity catches all these cases by not committing to a single scale.
Naively computing MDEF for all radii is O(n³). The original LOCI paper introduces the 'aLOCI' (approximate LOCI) algorithm using quad-trees/octrees to achieve O(n²) average case. The key insight: neighborhood counts change only at discrete radii (when another point enters/exits), allowing event-driven computation.
The LOCI Curve:
For visualization and analysis, LOCI generates an "LOCI curve" for each point:
The curve shows how anomalous the point appears at each scale.
Curve Interpretation:
Aggregating Across Scales:
The simplest aggregation: take the maximum $\sigma\text{-MDEF}$ across all radii:
$$\text{LOCI-score}(p) = \max_r \sigma\text{-MDEF}(p, r, \alpha)$$
Alternatives include:
Discrete Approximation:
In practice, we sample radii at discrete intervals:
$$r_i = r_{min} + i \cdot \Delta r, \quad i = 0, 1, ..., \frac{r_{max} - r_{min}}{\Delta r}$$
Choosing these parameters:
An alternative approach (data-driven radii): Use the actual distances in the dataset as radii. For each point, the relevant radii are exactly the distances to other points. This avoids arbitrary discretization but increases computation.
Let us present the complete LOCI algorithm with all its components.
Algorithm: LOCI
Input: Dataset D, α (default 0.5), k_σ (default 3), radii R
Output: LOCI scores and anomaly labels
1. Preprocessing:
Compute all pairwise distances: dist[i][j] = d(p_i, p_j)
(Or use spatial index for range queries)
2. For each point p in D:
Initialize max_sigma_mdef = 0
For each radius r in R:
a. Count n(p, r) = |{q : d(p,q) ≤ r}|
b. Find sampling neighborhood: S = {q : d(p,q) ≤ αr}
c. If |S| < nmin (typically 20): continue (skip this radius)
d. For each q in S:
Count n(q, r) = |{o : d(q,o) ≤ r}|
e. Compute n_hat = mean of n(q, r) for q in S
f. Compute σ_n = std of n(q, r) for q in S
g. If σ_n > 0:
σ-MDEF(p, r) = (n_hat - n(p, r)) / σ_n
Else:
σ-MDEF(p, r) = 0 (all neighbors have same count)
h. Update: max_sigma_mdef = max(max_sigma_mdef, σ-MDEF)
LOCI_score[p] = max_sigma_mdef
3. Label anomalies:
For each point p:
anomaly[p] = (LOCI_score[p] > k_σ)
4. Return LOCI_score, anomaly
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126
import numpy as npfrom scipy.spatial.distance import cdistfrom typing import Tuple, Listfrom dataclasses import dataclass @dataclassclass LOCIResult: """Results from LOCI algorithm.""" scores: np.ndarray # Maximum σ-MDEF for each point labels: np.ndarray # 1 = anomaly, 0 = normal mdef_curves: List[np.ndarray] # MDEF at each radius for each point def compute_loci(X: np.ndarray, alpha: float = 0.5, k_sigma: float = 3.0, n_radii: int = 30, min_neighbors: int = 20) -> LOCIResult: """ Compute LOCI (Local Correlation Integral) anomaly scores. Parameters: ----------- X : np.ndarray of shape (n_samples, n_features) Input data alpha : float Sampling neighborhood ratio (default 0.5) k_sigma : float Number of standard deviations for anomaly threshold (default 3) n_radii : int Number of radii to sample min_neighbors : int Minimum sampling neighborhood size Returns: -------- LOCIResult containing scores, labels, and MDEF curves """ n_samples = X.shape[0] # Compute all pairwise distances dist_matrix = cdist(X, X, metric='euclidean') # Determine radii to scan max_dist = np.max(dist_matrix) min_dist = np.min(dist_matrix[dist_matrix > 0]) # Smallest non-zero radii = np.linspace(min_dist, max_dist * 0.8, n_radii) # Initialize results max_sigma_mdef = np.zeros(n_samples) mdef_curves = [np.zeros(n_radii) for _ in range(n_samples)] for r_idx, r in enumerate(radii): sampling_r = alpha * r for p in range(n_samples): # Counting neighborhood of p n_p = np.sum(dist_matrix[p, :] <= r) # Sampling neighborhood of p sampling_mask = dist_matrix[p, :] <= sampling_r sampling_neighbors = np.where(sampling_mask)[0] if len(sampling_neighbors) < min_neighbors: mdef_curves[p][r_idx] = 0 continue # Count neighbors for each point in sampling neighborhood neighbor_counts = [] for q in sampling_neighbors: n_q = np.sum(dist_matrix[q, :] <= r) neighbor_counts.append(n_q) neighbor_counts = np.array(neighbor_counts) # Compute statistics n_hat = np.mean(neighbor_counts) sigma_n = np.std(neighbor_counts) # Compute σ-MDEF if sigma_n > 0: sigma_mdef = (n_hat - n_p) / sigma_n else: sigma_mdef = 0 mdef_curves[p][r_idx] = sigma_mdef max_sigma_mdef[p] = max(max_sigma_mdef[p], sigma_mdef) # Determine anomaly labels labels = (max_sigma_mdef > k_sigma).astype(int) return LOCIResult( scores=max_sigma_mdef, labels=labels, mdef_curves=mdef_curves ) def plot_loci_curves(result: LOCIResult, point_indices: List[int], radii: np.ndarray, k_sigma: float = 3.0): """ Visualize LOCI curves for selected points. """ import matplotlib.pyplot as plt fig, ax = plt.subplots(figsize=(10, 6)) for idx in point_indices: label = f"Point {idx} (score={result.scores[idx]:.2f})" is_anomaly = result.labels[idx] style = '--' if is_anomaly else '-' ax.plot(radii, result.mdef_curves[idx], style, label=label, alpha=0.7) # Add threshold line ax.axhline(y=k_sigma, color='red', linestyle=':', label=f'Threshold (k_σ={k_sigma})') ax.set_xlabel('Radius r') ax.set_ylabel('σ-MDEF') ax.set_title('LOCI Curves: σ-MDEF vs Radius') ax.legend() ax.grid(True, alpha=0.3) plt.tight_layout() return figComputational Complexity:
The aLOCI Optimization:
The approximate LOCI (aLOCI) uses quad-trees to approximate neighborhood counts:
For high-dimensional data, kd-trees or ball trees can be used similarly.
LOCI and LOF share the same philosophical goal—detecting local outliers—but differ substantially in approach. Understanding these differences informs method selection.
Fundamental Differences:
| Aspect | LOF | LOCI |
|---|---|---|
| Neighborhood | k nearest neighbors | All points within radius r |
| Parameter | Fixed k | Range of radii (mostly automatic) |
| Scoring | Density ratio | Normalized count deviation |
| Threshold | Heuristic (e.g., 1.5) | Statistical (3σ rule) |
| Multi-scale | Manual (run with multiple k) | Automatic (scan radii) |
| Interpretability | Density ratio | Standard deviations from mean |
Empirical Comparison:
Academic benchmarks have compared LOF and LOCI extensively:
Detection quality: Generally similar for well-tuned LOF vs LOCI. LOCI may slightly outperform on multi-scale datasets.
Robustness: LOCI is more robust to parameter settings. LOF with wrong k can fail completely; LOCI with suboptimal α degrades gracefully.
Computation time: LOF is typically 2-5x faster than naive LOCI. aLOCI closes the gap but adds implementation complexity.
Interpretability: LOCI's curves provide richer diagnostic information. LOF gives just a single score.
The Practical Reality:
Despite LOCI's theoretical advantages, LOF is far more commonly used in practice because:
LOCI is valuable when:
Start with LOF (easier, faster, well-supported). If results are unsatisfactory or parameter tuning is problematic, investigate LOCI. The additional diagnostic information from LOCI curves can help you understand why certain points are flagged, which is valuable for debugging and stakeholder communication.
Deploying LOCI in practice requires attention to several implementation details and potential extensions.
Parameter Guidelines:
The α parameter (sampling ratio):
The k_σ threshold:
The minimum neighborhood size (n_min):
Scalability Strategies:
For Large Datasets (n > 10,000):
Sampling: Compute LOCI on a random sample, then score remaining points using learned thresholds.
Approximate methods (aLOCI): Use spatial trees for range count approximations.
Parallel computation: The outer loop over points is embarrassingly parallel.
Radius pruning: Skip radii where few points have non-trivial MDEF (require minimum variance).
For High Dimensions (d > 10):
Dimensionality reduction: Apply PCA or UMAP before LOCI.
Alternative metrics: Consider Manhattan distance or learned metrics.
Hybrid approaches: Use fast global methods (Isolation Forest) for initial filtering, then LOCI on suspicious points.
Extensions and Variants:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132
import numpy as npfrom sklearn.neighbors import BallTreefrom typing import Tupleimport multiprocessing as mpfrom functools import partial def fast_loci(X: np.ndarray, alpha: float = 0.5, k_sigma: float = 3.0, n_radii: int = 20, min_neighbors: int = 20, n_jobs: int = -1) -> Tuple[np.ndarray, np.ndarray]: """ Scalable LOCI using BallTree for efficient range queries. Parameters: ----------- X : np.ndarray Input data alpha : float Sampling neighborhood ratio k_sigma : float Threshold in standard deviations n_radii : int Number of radii to sample min_neighbors : int Minimum sampling neighborhood size n_jobs : int Number of parallel jobs (-1 for all cores) Returns: -------- scores : np.ndarray Maximum σ-MDEF for each point labels : np.ndarray Anomaly labels (1 = anomaly) """ n_samples = X.shape[0] # Build BallTree for efficient range queries tree = BallTree(X, metric='euclidean') # Determine radii from data distribution # Use k-nearest neighbor distances to set scale k_for_radius = min(50, n_samples - 1) dists, _ = tree.query(X, k=k_for_radius + 1) median_knn_dist = np.median(dists[:, -1]) max_radius = np.percentile(dists[:, -1], 95) * 2 radii = np.linspace(median_knn_dist * 0.5, max_radius, n_radii) def process_point(p_idx: int) -> float: """Compute max σ-MDEF for a single point.""" max_sigma_mdef = 0.0 for r in radii: # Count points within r of p n_p = tree.query_radius([X[p_idx]], r=r, count_only=True)[0] # Get sampling neighbors (within α·r) sampling_r = alpha * r sampling_idx = tree.query_radius([X[p_idx]], r=sampling_r)[0] if len(sampling_idx) < min_neighbors: continue # Count neighbors for each sampling neighbor neighbor_counts = tree.query_radius( X[sampling_idx], r=r, count_only=True ) # Compute statistics n_hat = np.mean(neighbor_counts) sigma_n = np.std(neighbor_counts) if sigma_n > 0: sigma_mdef = (n_hat - n_p) / sigma_n max_sigma_mdef = max(max_sigma_mdef, sigma_mdef) return max_sigma_mdef # Parallel processing if n_jobs == -1: n_jobs = mp.cpu_count() if n_jobs > 1: with mp.Pool(n_jobs) as pool: scores = np.array(pool.map(process_point, range(n_samples))) else: scores = np.array([process_point(i) for i in range(n_samples)]) labels = (scores > k_sigma).astype(int) return scores, labels def hybrid_loci(X: np.ndarray, prefilter_contamination: float = 0.2, final_k_sigma: float = 3.0) -> np.ndarray: """ Hybrid approach: fast pre-filter + detailed LOCI on candidates. Uses Isolation Forest to identify suspicious points, then applies full LOCI analysis only to those candidates. """ from sklearn.ensemble import IsolationForest n_samples = X.shape[0] # Step 1: Fast pre-filter with Isolation Forest iso_forest = IsolationForest( contamination=prefilter_contamination, random_state=42, n_jobs=-1 ) prefilter_labels = iso_forest.fit_predict(X) candidates = np.where(prefilter_labels == -1)[0] print(f"Pre-filter identified {len(candidates)} candidates " f"({100*len(candidates)/n_samples:.1f}%)") # Step 2: Detailed LOCI on candidates final_labels = np.zeros(n_samples, dtype=int) if len(candidates) > 0: candidate_scores, candidate_labels = fast_loci( X[candidates], k_sigma=final_k_sigma ) final_labels[candidates] = candidate_labels return final_labelsWe've provided a comprehensive treatment of the LOCI algorithm, from its multi-granularity philosophy through efficient implementation strategies.
What's Next:
We've covered local outlier detection methods extensively. The next page addresses a fundamental challenge that affects all distance-based methods: the curse of dimensionality. As dimensionality increases, distances concentrate, nearest neighbor distinctions become meaningless, and detection performance degrades. Understanding and mitigating this phenomenon is essential for applying these methods to real-world high-dimensional data.
You now possess comprehensive knowledge of the LOCI algorithm—its theoretical foundations, statistical framework, and practical implementation. Whether you choose LOCI for its automatic multi-scale analysis or LOF for its simplicity, you now understand the full landscape of local outlier detection methods.